Space and energy efficient molecular programming and space efficient text indexing methods for sequence alignment by Christopher Joseph Thachuk M.Sc., Simon Fraser University, 2007 B.C.S., The University of Windsor, 2005 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in The Faculty of Graduate Studies (Computer Science) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) April 2013 c Christopher Joseph Thachuk 2013 Abstract Nucleic acids play vital roles in the cell by virtue of the information encoded into their nucleotide sequence and the folded structures they form. Given their propensity to alter their shape over time under changing environmental conditions, an RNA molecule will fold through a series of structures called a folding pathway. As this is a thermodynamically-driven probabilistic process, folding pathways tend to avoid high energy structures and those which do are said to have a low energy barrier. In the first part of this thesis, we study the problem of predicting low energy barrier folding pathways of a nucleic acid strand. We show various restrictions of the problem are computationally intractable, unless P = NP. We propose an exact algorithm that has exponential worst-case runtime, but uses only polynomial space and performs well in practice. Motivated by recent applications in molecular programming we also consider a number of related problems that leverage folding pathways to perform computation. We show that verifying the correctness of these systems is PSPACE-hard and in doing so show that predicting low energy barrier folding pathways of multiple interacting strands is PSPACE-complete. We explore the computational limits of this class of molecular programs which are capable, in principle, of logically reversible and thus energy efficient computation. We demonstrate that a space and energy efficient molecular program of this class can be constructed to solve any problem in SPACE—the class of all space-bounded problems. We prove a number of limits to deterministic and also to space efficient computation of molecular programs that leverage folding pathways, and show limits for more general classes. In the second part of this thesis, we continue the study of algorithms and data structures for predicting properties of nucleic acids, but with quite different motivations pertaining to sequence rather than structure. We design a number of compressed text indexes that improve pattern matching queries in light of common biological events such as single nucleotide polymorphisms in genomes and alternative splicing in transcriptomes. Our text indexes and associated algorithms have the potential for use in alignment of sequencing data to reference sequences. ii Preface The candidate contributed to all major ideas and writing of the published manuscripts and wrote all chapters of this thesis. We now detail the contributions of the candidate in published articles resulting from this work. Non-thesis related work published by the candidate during the course of their degree is not listed. Research from Part I of the thesis was conducted in collaboration with a number of co-authors; primarily the candidate’s supervisor Dr. Anne Condon and also Dr. Jan Maˇ nuch. In no instance was a co-author a graduate student. Other work presented in this thesis part that is not yet published was conducted under the supervision of Dr. Anne Condon and written by the candidate. • The introductory chapter of Part I was written by the candidate, but uses selected content from publications that he co-authored [25, 26, 80, 81, 129, 131]. Details of the contribution for each of these publications is given below. • A version of Chapter 2 has been published in the proceedings of the 15th Annual International Conference on DNA Computing and Molecular Programming (2009) [80] and also the Journal of Natural Computing (2011) [81]. The candidate collaborated with co-authors in developing the reduction proof. The candidate contributed to writing the manuscripts, particularly the journal version [81]. • A version of Chapter 3 has been published in the proceedings of the Pacific Symposium of Biocomputing (2010) [129]. The candidate was the primary researcher to design and implement the main algorithm of the paper and performed all experiments. He also contributed to writing the manuscript. The presentation of the algorithm and correctness proofs have been rewritten by the candidate in the thesis to facilitate additional results. • A version of Chapter 4 appeared in the proceedings of the 17th Annual International Conference on DNA Computing and Molecular Programming (2011) [25] and also the Journal of the Royal Society: Interface Focus (2012) [26]. The candidate contributed to all aspect of research and was one of the main writing authors contributing various sections of the manuscript. Some alternate proofs and additional results have been given by the candidate in the thesis version to provide deeper insight of the topics considered. iii Preface • A version of Chapter 5 appears in the proceedings of the 18th Annual International Conference on DNA Computing and Molecular Programming (2012) [131] and was awarded best student paper. The candidate contributed to all aspects of research and in particular developed the SAT verification procedure, integrated it with the tree traversal procedure proposed by the co-author, proved the correctness of the result, extended the result into a DSD implementation, and formally proved the computational hardness of a number of related problems. The manuscript was written by the candidate. The results of this chapter have been significantly enhanced and supplemented compared with the published version. All research from Part II of the thesis was conducted independently by the candidate and all chapters and published manuscripts were written by the candidate. • Versions of Chapter 8 and Chapter 9 have been published in the proceedings of the 22nd Annual International Symposium on Combinatorial Pattern Matching (2011) [127] and the journal Theoretical Computer Science [128]. The candidate was the sole author and the paper won best student paper. • A version of Chapter 10 was published in the proceedings of the 18th Annual International Symposium on String Processing and Information Retrieval (2011) [130]. The candidate was the sole author. • Chapter 7, the introductory chapter of the part, used content from the previously mentioned papers published by the candidate[127, 130]. iv Table of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Table of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Abstract Preface List of figures Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix Notes on reading the text . . . . . . . . . . . . . . . . . . . . . . . . xxvi Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxvii Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxix I Space and energy efficient molecular programming: energy barriers, chemical reaction networks, and DNA strand displacement systems . . . . . . . . . . . . . . . . . . . 1 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Nucleic acid folding pathways . . . . . . . . . . . . . . . 1.1.1 The simple energy model . . . . . . . . . . . . . . 1.2 Molecular programming . . . . . . . . . . . . . . . . . . . 1.2.1 DNA strand displacement systems (DSD) . . . . . Toehold mediated strand displacement . . . . . . Folding pathway of a strand displacement . . . . . Illegal strand displacement . . . . . . . . . . . . . 1.2.2 Chemical reaction networks (CRN) . . . . . . . . Chemical reactions and signal molecules . . . . . Chemical reaction rates . . . . . . . . . . . . . . . CRNs as a means for computation . . . . . . . . . 1.2.3 Tagged chemical reaction networks (tagged CRN) Tags and tagged chemical reaction equations . . . Space complexity of a tagged CRN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 6 7 8 8 9 9 11 11 12 13 14 16 17 v Table of contents . . . . . . 18 18 20 22 23 26 energy barrier folding pathways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 27 30 41 3 Predicting minimum energy barrier folding pathways . . . . . 3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 An algorithm for the set barrier problem . . . . . . . . . . . . . 3.2.1 Splitting strategy . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Cutting strategy . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 The overall algorithm . . . . . . . . . . . . . . . . . . . . 3.2.4 Algorithm correctness and complexity . . . . . . . . . . . Comments on practical and theoretical runtime efficiency 3.2.5 Finding minimum barriers for non-pairwise optimal instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . Construction of PWO(G) . . . . . . . . . . . . . . . . . . . 3.3 Empirical results . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Implementation and experimental environment . . . . . . 3.3.2 Generation of problem instances . . . . . . . . . . . . . . 3.3.3 Algorithm runtime performance . . . . . . . . . . . . . . 3.4 Solving the direct with repeats barrier problem . . . . . . . . . . 3.5 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . 42 42 44 44 51 52 53 54 4 On recycling and its limits in molecular programs . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 On the need for strand recycling . . . . . . . . . . . 4.1.2 On the potential for strand recycling . . . . . . . . 4.1.3 On the limits of strand recycling . . . . . . . . . . . 4.1.4 Related work . . . . . . . . . . . . . . . . . . . . . . 4.2 GRAY: a binary reflecting Gray code counter . . . . . . . . 4.2.1 Chemical reaction network for the GRAY counter . 4.2.2 DSD implementation of the GRAY counter . . . . . 4.2.3 Space and expected time of the GRAY counter . . . 4.2.4 A fixed order implementation of the GRAY counter 4.2.5 Comparison with another molecular counter . . . . 4.3 Limits on molecule recycling in chemical reaction networks 4.4 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . 64 64 65 66 68 69 70 70 71 75 76 78 79 83 1.3 1.4 1.5 1.2.4 Proper chemical reaction networks (proper CRN) 1.2.5 Realizing CRNs with DSDs . . . . . . . . . . . . 1.2.6 Energy efficient computation . . . . . . . . . . . Objectives . . . . . . . . . . . . . . . . . . . . . . . . . Contributions . . . . . . . . . . . . . . . . . . . . . . . Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Complexity of predicting low 2.1 Preliminaries . . . . . . . . 2.2 Result . . . . . . . . . . . . 2.3 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 56 56 57 57 57 59 61 vi Table of contents 5 Space and energy efficient molecular programming . 5.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Space efficient CRN simulation of PSPACE . . . . . . . 5.3.1 Verifying a 3sat instance variable assignment . Verifying an arbitrary clause . . . . . . . . . . . Verifying the overall formula . . . . . . . . . . . 5.3.2 A space efficient post-order tree traversal . . . . 5.3.3 Solving a q3sat instance . . . . . . . . . . . . . Integrating formula verification and tree traversal Integrating quantifiers into the tree traversal . . Ending the computation . . . . . . . . . . . . . 5.4 Space efficient CRN simulation of SPACE . . . . . . . . 5.5 Space and energy efficient DSD simulation of SPACE . . 5.6 Complexity of verifying CRNs and DSDs . . . . . . . . 5.7 A reduction from q3sat to eb-ipfp-multi . . . . . . . 5.7.1 The reduction . . . . . . . . . . . . . . . . . . . 5.8 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 84 85 86 87 88 88 90 93 93 94 94 96 98 100 101 101 108 6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.1 Predicting folding pathways . . . . . . . . . . . . . . . . . . . . 109 6.2 Designing folding pathways . . . . . . . . . . . . . . . . . . . . . 112 II Space efficient text indexes motivated by biological sequence alignment problems . . . . . . . . . . . . . . . . . . . 117 7 Introduction . . . . . 7.1 Text indexing . . 7.2 Biological sequence 7.3 Objectives . . . . 7.4 Contributions . . 7.5 Outline . . . . . . . . . . . . . . . . . . alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 118 119 121 122 123 8 A compressed full-text dictionary . . . . . . . . . . . . . . . 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.1 Related work . . . . . . . . . . . . . . . . . . . . . . . 8.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Overview of the full-text dictionary . . . . . . . . . . . . . . 8.3.1 The lex id of text segments . . . . . . . . . . . . . . 8.4 Components of the full-text dictionary . . . . . . . . . . . . 8.4.1 CSA: compressed enhanced suffix array . . . . . . . . 8.4.2 The sa id identifier and RSA: conceptual tools . . . 8.4.3 L: text segment lengths . . . . . . . . . . . . . . . . . 8.4.4 LEX, MB , ME , E: text segment SA range representation 8.4.5 BP: containment of text segment SA ranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 124 125 125 128 130 131 131 131 132 133 133 vii Table of contents 8.5 8.6 8.4.6 CNT: count of text segment prefixes . . . . . . . . . 8.4.7 Summary of full-text dictionary components . . . . Using the full-text dictionary . . . . . . . . . . . . . . . . . 8.5.1 Pre-processing the pattern . . . . . . . . . . . . . . 8.5.2 Finding parent ranges and longest matches . . . . . 8.5.3 dict prefix: report text segments that prefix P . 8.5.4 dict match: report text segments contained in P . 8.5.5 dict count: counting text segments contained in P 8.5.6 prefix: report range of lex ids that prefix P . . . 8.5.7 locate: report positions in T containing P . . . . . 8.5.8 match stats: finding the matching statistics of P . Constructing the full-text dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 135 135 135 136 136 137 137 138 138 138 138 9 Indexing text with wildcards . . . . . . . . . . . . . . . . 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Overview of indexing text containing wildcards . . . . . . 9.4 Components of the text with wildcards index . . . . . . . 9.4.1 F, R: indexing the text . . . . . . . . . . . . . . . 9.4.2 lex ids, rlex ids, and Π: text segment identifiers 9.4.3 RSA, RSA: storing SA ranges . . . . . . . . . . . . 9.4.4 LEN, POS, WCS: auxiliary arrays . . . . . . . . . 9.4.5 RQ: supporting range queries . . . . . . . . . . . 9.4.6 Summary of the components . . . . . . . . . . . . 9.5 Matching in text with wildcards . . . . . . . . . . . . . . 9.5.1 Pre-processing the pattern . . . . . . . . . . . . . 9.5.2 type1 match: finding all type 1 matches of P . . 9.5.3 type2 match: finding all type 2 matches of P . . 9.5.4 type3 match: finding all type 3 matches of P . . 9.6 Less haste, less waste: reducing the space further . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 140 143 143 144 144 144 144 145 145 145 145 146 146 146 147 152 10 Indexing hypertext . . . . . . . . . . . . . 10.1 Introduction . . . . . . . . . . . . . . . 10.2 Preliminaries . . . . . . . . . . . . . . . 10.2.1 Succinct graph representation . 10.2.2 Hypertext . . . . . . . . . . . . 10.3 Construction of the hypertext index . . 10.3.1 Indexing node text . . . . . . . 10.3.2 Storing graph topology . . . . . 10.3.3 Auxiliary data structures . . . . 10.4 Pattern matching in the hypertext index 10.4.1 Preprocessing the pattern . . . . 10.4.2 Matching within a node . . . . . 10.4.3 Matching across a single edge . 10.4.4 Matching across multiple edges . Overview of the algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 157 159 159 160 161 161 162 162 164 164 165 165 167 167 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Table of contents Verifying the suffix condition Verifying the prefix condition Reporting all matching paths 10.5 Reducing the index space . . . . . . 10.6 Considering restricted hypertext . . 10.6.1 Path constraints . . . . . . . 10.6.2 Topology constraints . . . . 10.6.3 Text constraints . . . . . . . 11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 169 169 170 171 171 172 172 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 ix List of tables 1.1 4.1 8.1 The complexity of folding pathway energy barrier problems for the simple energy model. . . . . . . . . . . . . . . . . . . . . . . . 25 Comparison of n-bit counter implementations. The GRAY and GRAY-FO counters described in this section are compared with the QSW counter which is based on the simulation of stack machines by strand displacement reactions of Qian et al. [98]. . . . . 79 Inventory of space usage for data structures comprising a full-text dictionary for a string T of length n containing d text segments. 135 9.1 A comparison of text indexes supporting wildcard characters in a text T over an alphabet of size σ containing d distinct groups of wildcards. |CSA| is the size of a subsidiary compressed suffix array implementation supporting rank queries in O(tLF ) time. dˆ is the # of distinct wildcard group lengths, occ1 , occ2 , occ are the # of occurrences containing no wildcard group, 1 wildcard group, and overall, respectively; γ = i,j prefix(P [i..|P |], Tj ), † = our result, ‡ = our result combined with Hon et al. [49] . . . . . . . . . . . . . . . . . . . . . . . . . . 142 9.2 Inventory of space usage for data structures comprising an index for a text T of length n containing d groups of wildcards and dˆ denotes the number of unique lengths of wildcard groups separating text segments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 10.1 Inventory of space usage for succinct index of a general hypertext. Sections 10.5 and 10.6 explore the removal of various components of the overall index. . . . . . . . . . . . . . . . . . . . . . . . . . 163 x List of figures 1.1 1.2 1.3 1.4 (a) An initial secondary structure (left) and a final secondary structure (right) for a given RNA strand. (b) A corresponding arc diagram. The arcs on top of the nucleotide sequence denote the base pairs in the initial secondary structure while the arcs below the nucleotide sequence denote the base pairs of the final secondary structure. . . . . . . . . . . . . . . . . . . . . . . . . . (a) A possible folding pathway is shown for an initial structure A transitioning through intermediate structures (B, C, . . .) until the final structure I is reached. For a particular position in the pathway, the top of the arc diagram denotes the current base pairs of the structure and the bottom of the arc diagram denotes the base pairs still to be added to reach the final structure. Each structure along the pathway differs from its neighbours by one arc. (b) The corresponding energy plot. The barrier in this example is two. . . A DNA strand displacement system consisting of two signal strands, A and B, and one double stranded complex consisting of two bound strands C and D and a template strand E. Long domains of the template strand are shown in red while long domains of signal strands and bound strands are shown in gray. Universal toehold domains are shown in black. . . . . . . . . . . . . . . . . Strand displacement. (a) Toehold (black subsequence) of signal strand A binds with its unpaired complement on the template strand B. (b) The single long domain (gray subsequence) of A competes via a random walk process with the single long domain of strand C to bind with the complementary long domain of B until all bases of A are bound to B. (c) Toehold of C detaches from B, at which point it has been displaced and becomes a signal. The process is reversible and signal strand C could next displace the bound strand A. . . . . . . . . . . . . . . . . . . . . 3 4 8 8 xi List of figures 1.5 A corresponding folding pathway is shown for the displacement example of Figure 1.4 using a simple sequence design where toehold domains have one base and long domains have two bases. The displacement of strand C by strand A is shown in seven steps, from (a) to (g). Base pairs are shown as edges between strands. The energy changes between each structure, assuming Kassoc = 2, are shown in the bottom right. Since toehold domains have one base, then the energy barrier of the underlying folding pathway, relative to (a), is Kassoc . If the toehold domains had length LT > 1 then the energy barrier, relative to (a), would be Kassoc − 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 (a) Chemical reaction equations for a 3-bit standard binary counter. (b) The configuration graph of the computation performed by the 3-bit standard binary counter forms a chain and is logically reversible. The nodes represent the state of the computation and the edges are directed between states reachable by a single reaction. 1.7 Representing the state of the CRN for a 3-bit standard binary counter can be achieved by the presence and absence of certain signal strands for each bit position. Long domains for bits representing a 1 value are coloured in red while those representing a 0 value are coloured in grey. Universal toehold domains are coloured black. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8 A strand displacement implementation of the reaction 01 11 as proposed by Qian et al. [98]. From top to bottom, the input signal strand 01 (shown in a shaded box on the left) is consumed by the transformer (middle) which produces the signal strand 11 (shown in a shaded box on the right). Additional unbound strands are used in the process and are considered part of the transformer. The transformer can be applied next in the opposition direction (from bottom to top) to consume signal 11 and produce signal 01 . In this and later figures, the Watson-Crick complement of a domain x is denoted by x∗ . . . . . . . . . . . . . . . . . . . . . . 1.9 Tagged chemical reaction equations for a 3-bit standard binary counter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.10 A strand displacement implementation of the bi-molecular chemical reaction equation A + B C + D using the construction proposed by Qian et al. [98]. . . . . . . . . . . . . . . . . . . . . . 1.11 Example configuration graphs, induced on four different inputs, for (a) deterministic computation, and (b) logically reversible computation. Nodes represent possible states in a computation and directed edges denote valid state transitions. . . . . . . . . . 2.1 The three arcs on the bottom all conflict with the same two arcs on the top, and vice versa. Thus, each forms a band of arcs. Each band is collapsed into a single arc with weight equal to the size of the band. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 12 15 16 17 19 21 29 xii List of figures 2.2 2.3 2.4 3.1 3.2 3.3 3.4 3.5 3.6 3.7 Organization of weighted arcs in the initial (top) and the final (bottom) configurations. . . . . . . . . . . . . . . . . . . . . . . . Illustration of the construction in the proof of Theorem 3: (a) The instance created for the set of integers {{10, 9, 8, 7, 7, 7}}. (b) The energy function stays within barrier k if and only if the partition sets are selected correctly (T1 = {{10, 7, 7}} and T2 = {{9, 8, 7}}). (c) The energy function exceeds the barrier for an incorrect selection of partition sets (T1 = {{10, 9, 8, 7, 7}} and T2 = {{7}}). The dashed lines depict hypothetical progress of the pathway for some energy barrier larger than k. . . . . . . . . Illustration of the sequence of energy difference changes on the folding pathway described in lines (2.1), (2.2) and (2.3). Details are discussed in the text of the chapter. . . . . . . . . . . . . . . (left column) An example of an arc diagram representation of an initial and final structure of an RNA folding pathway, and (right column) the corresponding conflict graph. In the conflict graph, there is a node for every arc, and an edge between any pair of arcs that cross. . . . . . . . . . . . . . . . . . . . . . . . . . . . . An example of a 2-barrier direct RNA folding pathway from an initial to final structure (left column), a corresponding set pathway (right column), and a graph showing the current folding pathway energy and the current barrier set size (center column). The set pathway instance (right) is specified by the conflict graph of the RNA folding pathway instance (left). The current set in the set pathway is denoted by black vertices while the current secondary structure in the folding pathway is indicated by the set of arcs on top. . . . . . . . . . . . . . . . . . . . . . . . . . . . . An example of the BasicSplit algorithm. For a pairwise-optimal bipartite graph G, a perfect matching is identified (top left), a directed precedence graph D is constructed (top right), strongly connected components in D are identified (bottom left), and one that is a sink in the condensation of D is returned (bottom right). Creating a pairwise-optimal instance (bottom) from a non-pairwiseoptimal instance (top). . . . . . . . . . . . . . . . . . . . . . . . . Distribution of conflicting base pairs for generated problem instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The required time to find an optimal barrier pathway is shown for two time scales. . . . . . . . . . . . . . . . . . . . . . . . . . . Frequency of maximum (left) and average (right) subproblem sizes, measured as number of base pairs in the subproblem produced by the first call to the BasicSplit algorithm for a given instance. The maximum and average are taken over all subproblems generated for a given instance. . . . . . . . . . . . . . . . . 31 33 35 43 45 49 55 58 59 60 xiii List of figures 3.8 4.1 4.2 4.3 4.4 4.5 4.6 4.7 5.1 (left) An arc diagram representation for the RNA strand UCUGAG CUAGUG. Arcs (base pairs) in the initial structure are shown in red, those in the final structure are shown in blue and potential temporary arcs are shown in green. Also shown are the corresponding conflict graphs for the indirect folding pathway problem (center) and the direct folding pathway problem (right). . . . . . . . . . . To reach the end state, the standard binary counter must perform a sequence of reactions that always occur in the forward direction, thus requiring a new transformer for every reaction as they are not recycled. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The 3-bit binary reflecting Gray code. The code for n digits can be formed by reflecting the code for n − 1 digits across a line, then prefixing each value above the line with 0 and those below the line with 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . (a) Tagged chemical reaction equations for a 3-bit binary reflecting Gray code counter. (b) The configuration graph of the computation performed by the 3-bit binary reflecting Gray code counter forms a chain and is logically reversible. The nodes represent the state of the computation and the edges are directed between states reachable by a single reaction. . . . . . . . . . . . . . . . . To reach the end state, the binary reflecting Gray code counter must perform a sequence of reactions that always alternate in the forward and reverse direction, thus requiring only one transformer for every reaction since they are actively recycled. . . . . . . . . An example of signal molecules (top two left strands) and the transformer, consisting of auxiliary strands (top two right strands) and a saturated template strand (bottom complex) associated with the forward direction of reaction equation 01 11 which requires a mutex. In this and later figures, the Watson-Crick complement of a domain x is denoted by x∗ . . . . . . . . . . . . . The sequence of strand displacement events for the reaction equation 01 11 when a mutex signal µ is required. The mutex is the first signal to be consumed and the last to be produced, in either reaction direction. Otherwise, the reaction cascade proceeds exactly as before as dictated by the QSW construction. . . . . . An example of the signal molecules and the transformer molecules for the ith reaction. The counter is in state bn . . . bi+1 0i 1i−1 0i−2 . . . 01 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Solving a q3sat instance. Edge labeled paths from root to leaf denote variable assignments. Nodes are satisfied based on quantifier and satisfiability of left and right children. . . . . . . . . . 63 65 66 67 68 72 73 74 86 xiv List of figures 5.2 (left) Eight chemical reaction equations to verify an arbitrary 3sat clause Ci for each combination of variable assignments. The product of the reaction is CiT for assignments that satisfy the ith clause, and CiF otherwise. (right) Reaction equations to verify the overall 3sat formula φ, consisting of m clauses. . . . . . . . . 5.3 Flow control when verifying a formula φ having m clauses. . . . . 5.4 A logically reversible post-order traversal of all descendants of the root of a height h perfect binary tree can be achieved using three reactions: (6) mark left, (7) move right, and (8) mark right. Below each reaction is an illustration of the action it performs on the tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Integrating the 3sat verification procedure into the leaf level reactions of the tree traversal procedure. Two reaction variants are created for marking leaf nodes as either satisfied or unsatisfied based on the result of the verification procedure. One reaction variant can proceed if the signal φF is available and the other variant requires φT . As these are the only two reaction variants, the formula for the current variable assignment must be verified before the leaf node can be marked. The move right reaction requires φ? as a catalyst, thus ensuring the verification procedure is reversed prior to the next verification step. Existing catalysts listed in Figure 5.4 remain and are omitted above for space. . . . 5.6 Integrating quantifiers to non-leaf levels of the tree traversal. For both universal and existential levels, four variants of the left node reactions are created to process the four combinations of left and right children satisfiability. The integration is identical for right node reactions. Existing catalysts remain the same as listed before and are omitted for space. . . . . . . . . . . . . . . . . . . . 5.7 After both children of the root have been solved a solution can be determined based on the quantifier of the root level. Equations are shown assuming the root variable xn is universally quantified. 5.8 The logically reversible computation chain of the q3sat CRN. In more than half of the states, the output signal is present (shown shaded). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.9 Extending the logically reversible computation chain of the q3sat CRN. Extending the chain is achieved by adding an additional reaction that produces a new signal and requires the final signal multiset of the original computation chain as catalysts. States where the output signal is present are shown shaded. . . . . . . . 5.10 A strand displacement implementation of the bi-molecular chemical reaction equation A + B C + D using a modified construction from that proposed by Qian et al. [98]. In this construction, four-way branch migration is used to displace strands, in contrast to three-way branch migration from the original construction. . . 88 89 91 94 95 95 98 99 102 xv List of figures 5.11 A folding pathway is shown for a strand displacement using fourway branch migration. A simple sequence design is assumed where toehold domains have one base and long domains have two bases. The displacement of strand B by strand A is shown in seven steps, from (a) to (g). Initially, the long domain of A is bound to strand C. During the displacement, C will form base pairs with B while A forms base pairs with T . In the figure, base pairs are shown as edges between strands. The energy changes between each structure, assuming Kassoc = 2, are shown in the bottom right. The energy barrier of the underlying folding pathway, relative to (a), is Kassoc + 1. Note that for toehold length LT > 2, where Kassoc > LT , the energy barrier would be Kassoc − 1.103 7.1 An example of short reads aligned to a reference genome G. Alignments may contain matches, mismatches, insertions and deletions. For instance, the alignment of the single read to the reference (red outline) contains a match in the first position of the alignment, a mismatch in the second, an insertion in the third position and a deletion in the twelfth position. Sequencing the genomes of individuals helps determine genetic mutations, such as single nucleotide polymorphisms/variations (SNPs/SNVs) of individuals compared to a reference genome. . . . . . . . . . . . . 120 8.1 The Burrows-Wheeler transform of a string T = mississippi$ is T BWT = ipssm$pissii. . . . . . . . . . . . . . . . . . . . . . . . 126 Performing backward search to find the SA range of the string ‘is’ from the SA range of the string ‘s’, using T BWT , the BurrowsWheeler transform of text T . (a) The current match and SA range for ‘s’. (b) All occurrences of character i in T BWT within the current SA range are identified. (c) The LF -mapping is used to update the SA range to the new match ‘is’. . . . . . . . . . . 127 8.2 xvi List of figures 8.3 A compressed full-text dictionary for the ordered list of text segments (aa, aca, a, aa, cacc, ac). The first three columns give a conceptual representation of the full-text dictionary. The second column shows the sorted suffixes of the serialized string T = φaaφacaφaφaaφcaccφac$ representing the text segments. The third column contains the array i indicating the sorted lexicographic rank of each suffix of T . The first column shows the SA ranges of the text segments and their containment relationship. Each text segment SA range is labeled by (lex id, segment) pairs. Shown in the last three columns are actual data structures used in the full-text dictionary representation: the ME array which marks the end of one or more text segment SA ranges, the MB array which marks the beginning of each text segment SA range, and the BP array that represents the containment of text segment SA ranges (their tree topology). Three different queries (shaded intervals) are shown with their corresponding smallest enclosing text segment SA range (if any) marked in the BP array. 129 9.1 The three cases to consider when matching a pattern to a text with wildcards. Here and throughout this chapter, we will illustrate the wildcard character as ‘*’. . . . . . . . . . . . . . . . . . 143 Shown is a compressed suffix array for a text T = φaa φaca φa φaa φcacc φac and a compressed suffix array for the reverse of T . The shaded intervals denote the SA range of a query aφ in the forward index and corresponding SA range of φa in the reverse index. Using backward search the SA range in the forward index can be updated for the pattern aaφ, and by leveraging information in T BWT the corresponding SA range for φaa can be updated in the reverse index. Both new SA ranges are shown demarcated with arrows. See the text for details. . . . . . . . . . 152 9.2 10.1 A example of a hypertext. A query matches within a hypertext if and only if it can be aligned as a path through the graph. A path shown in bold matches the query pattern pizzafrompisaisbountiful. 158 10.2 A simple genome, G, is shown having five exons contained in two genes. Exons are strings over the four letter alphabet of DNA. Below is the corresponding transcriptome, T , which consists of five transcripts. Transcripts are formed by the concatenation of certain exons from G. Above is the splicing graph, S, where each of the five nodes correspond to one of the five exons from G, and each directed edge denotes splicing events (concatenation of exons) that are found in T . A hypertext model H for the transcriptome is also shown. . . . . . . . . . . . . . . . . . . . . . 159 xvii List of figures 10.3 (left) An example of the underlying suffix array and BWT string for the forward index F of the text T = φacaφgφgaφcgφct$, representing the serialization of possible text in exons e1 , . . . , e5 , supposing those five exons consist of the five sequences {aca, g, ga, cg, ct} respectively, from Figure 10.2. (right) The underlying suffix array and BWT string for the reverse index R of the text T = φaca φgφagφgcφtc$. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 10.4 The three cases to consider when matching a pattern to a hypertext.164 10.5 An example of the pattern aaca matching across a single edge in a hypertext. The pattern suffix ca prefixes nodes with lexicographic rank (lex id) in the range [6, 6] while the pattern prefix aa suffixes nodes with reverse lexicographic rank (rlex id) in the range [2, 3]. Points in the query rectangle [6, 6] × [2, 3] are type 2 matches. A point (a, b) appears in the grid if and only if a node with lexicographic rank a has an incoming edge from a node with reverse lexicographic rank b. . . . . . . . . . . . . . . . . . . . . . 166 10.6 The two cases to consider when verifying the suffix condition in type 3 matches. (a) The suffix match can form a sub-path initiation event with node 4. (b) The suffix match can form a sub-path extension event with node 2. . . . . . . . . . . . . . . . 168 xviii Glossary bandwidth Given a CRN C = S, R, S0 , send , Bs , the bandwidth of signal species s ∈ S is the maximum number of copies of s that appears in a multiset I of any reaction (I, P ) ∈ R. The bandwidth of C is the sum of bandwidths for all signal species in S. xviii, 79 blunt-end displacement Any attempted toehold mediated strand displacement where the invading strand does not first bind its toehold domain to the template strand. xviii, 9 bound strand A strand where one or more of its bases are paired to other bases on a template strand. xviii, xx, xxi, xxiii, 8, 18, 20, see also unbound strand BWT Burrows-Wheeler transform. xviii catalyst A signal molecule that is required to be present for the application of a corresponding chemical reaction equation. It is not consumed nor produced when acting as a catalyst. In an equivalent interpretation, it is both consumed and produced by the application of the corresponding chemical reaction equation. xviii, xix, xxi, 11, 13, 67, 70 chemical reaction equation Either a reversible chemical reaction equation or an irreversible chemical reaction equation. xviii–xxv, 11, 13, 14, 17, 18, 20, 65, 70 Chemical Reaction Network (CRN) Consists of an initial signal multiset and a set of chemical reaction equations. Formally, we define a CRN to be a tuple C = S, R, S0 , send , where • S is the set of all signal types (or species) of signal molecules used in any reaction. • R is a set of chemical reaction equations, where each R ∈ R is an ordered pair of multisets of signal molecules. Intuitively, a reaction equation R = (I, P ) consumes the signal molecules in I as the input reactants and produces the signal molecules in P as products. Note that it is only the signal molecules in I − P that are actually consumed. The others act as catalysts for the reaction. Our formalism is directional to allow modeling non-reversible reactions; a reversible chemical reaction is modeled as two separate elements of R, i.e., (I, P ) and (P, I). xix Glossary • S0 is the initial signal multiset where s ∈ S0 → s ∈ S. • send ∈ S is a signal molecule denoting the end of computation. . xviii–xx, xxii, xxiv, 11, 13, 14, 16, 18, 20, 64, 65, 69, 70, 73, 79, 85, 90 closed system A system, such as a CRN or DSD, is closed if no outside interference can occur such as the removal or addition of signals. xviii, xxv, 11, 13, 17, 22, 64, 66, 68, 69 configuration graph A configuration graph for a computation has a node for every possible state on every possible input for the underlying Turing machine being modeled. There is a directed edge from node i to node j if and only if state j is reachable from state i in a single state-transition of the Turing machine. xviii, xxi, 20 consumed A signal molecule is said to be consumed when it is removed from the current signal multiset due to the application of a chemical reaction equation. xviii, xix, xxi–xxiv, 11, 14, 18, 20, 69, 71, 72 crosstalk In molecular programs, crosstalk occurs when signals from different copies of the system present in the same reaction volume interfere with the intended sequence of reactions that would occur if only a single copy were present.. xviii, 79 current signal multiset The multiset of all signal molecules currently present in the reaction volume. xviii, xx, xxii, xxiii, 11, 13 DNA Strand Displacement system (DSD) Consists of one or more signal strands and double stranded complexes. Signal strands can be consumed and produced by means of toehold mediated strand displacement. xviii, xx, xxii–xxiv, 8, 11, 14, 20, 64–66, 70–73, 85 double stranded complex One template strand base paired with one or more bound strands. xviii, xx, xxiv, 8, 14, see also template strand & bound strand energy efficient computation In a logically reversible computation there is no inherent lower bound on the required amount of energy lost to complete the computation. We call such a computation energy efficient. xviii, 20, see also logically reversible computation evading strand The bound strand of a toehold mediated strand displacement. xviii, xxi, xxiv, 9 fuel See transformer. xviii, xx, 66, 68, 69 fuel-depletion Denotes a scenario where an insufficient amount of fuel or copies of transformers are available to complete a computation. xviii, 66, 68 xx Glossary illegal displacement Any toehold mediated strand displacement that is not a legal displacement. xviii, 9, see also mismatch displacement, blunt-end displacement & spontaneous displacement initial signal multiset The multiset of all signal molecules present in the reaction volume prior to the application of any chemical reaction equations. xviii–xx, xxiii, 11, 13, 14, 16–18, 65, 68–70, 73, 90, 92 initial tag multiset The multiset of all tags present in the reaction volume prior to the application of any chemical reaction equations. xviii, xxiii, xxiv, 16–18, 66, 68, 82, 90 invading strand The signal strand that binds to a template strand in order to displace a currently bound strand during a toehold mediated strand displacement. xviii, xix, xxi, xxiv, 9 irreversible chemical reaction equation Specifies an event that can consume a multiset of reactants and a produce a multiset of products. Some of the specified reactants may be catalysts and are therefore also produced as products.. xviii, xix, xxii, xxiii, 11 legal displacement Any toehold mediated strand displacement where the invading strand first binds its toehold domain to the template strand and the adjacent long domain of the invading strand involved in three way branch migration is identical to respective long domain of the evading strand that is currently bound to the template. xviii, xxi, 9, 20, 72 logically reversible computation A logically-reversible-computation is a form of deterministic computation where the configuration graph induced on any particular input forms a chain and each node i along the chain has a directed edge to node j, if and only if there is a directed edge from node j to node i. Therefore any state along the chain is reachable (and recoverable) from any other and previous state information is never lost. xviii, xx, 20, 65, 67–69, 72, 90, see also energy efficient computation & configuration graph long domain A longer strand domain that binds irreversibly to complementary regions on template strands and can only be unbound from a template strand by toehold mediated strand displacement. xviii, xxi, xxiii, xxiv, 8, 9, 18, 20 mismatch displacement Any attempted toehold mediated strand displacement where the long domains of the one or more invading strands are not identical to the long domain of the evading strand. xviii, 9 mutex strand A special single copy signal strand that is required to perform any toehold mediated strand displacements. xviii, xxiv, 72, see also transaction xxi Glossary produced A signal molecule is said to be produced when it is added to the current signal multiset due to the application of a chemical reaction equation. xviii, xix, xxi–xxiv, 11, 14, 18, 20, 72 product A signal molecule that is produced due to the application of a chemical reaction equation. xviii, xix, xxi, xxii, 11, 13, 18, 20, 71 proper chemical reaction equation Either a reversible chemical reaction equation or an irreversible chemical reaction equation where the number of proper-products equals the number of proper-reactants. xviii, 18, see also proper-reactant & proper-product proper Chemical Reaction Network (proper CRN) A CRN or tagged CRN is proper if each of its chemical reaction equations consumes the same number of reactants and that it produces as products. xviii, 18, 68, 90, see also CRN & tagged CRN proper-product A signal molecule that is produced due to the application of a chemical reaction equation and is not consumed by the same reaction (i.e., it is not a catalyst). xviii, xxii, 18, see also proper chemical reaction equation & proper-reactant proper-reactant A signal molecule that is consumed due to the application of a chemical reaction equation and is not produced by the same reaction (i.e., it is not a catalyst). xviii, xxii, 18, see also proper chemical reaction equation & proper-product QSW construction A construction proposed by Qian et al. [98] to realize any tagged CRN by a DSD. xviii, 18, 20, 73 reactant A signal molecule that is consumed due to the application of a chemical reaction equation. xviii, xix, xxi, xxii, 11, 13, 18, 20, 71 reaction rate The relative speed that a chemical reaction equation can be applied within a given reaction volume. xviii, 12 reaction volume The container and medium where chemical reaction equations can be applied using the signal molecules currently present in the container. xviii, xx–xxiii, xxv, 11, 13, 17, 22, 64, 66, 68, 69, 79 required space of a tagged CRN The minimum size of the reaction volume for a tagged CRN to complete its intended sequence of reactions (computation). xviii, 17, 66, 90, see also space complexity of a tagged CRN computation reversible chemical reaction equation A convenience of notation that denotes a reaction that can be applied in either direction (i.e., the products and reactants can switch roles). Formally this equivalent to having two irreversible chemical reaction equations; one for each direction.. xviii, xix, xxii, xxiii, 11 xxii Glossary saturated template strand A template strand where all long domains and all but one toehold domain is bound to other strands. xviii, 18 signal molecule An elementary (chemical) molecule that can be consumed and produced by the application of chemical reaction equations. xviii– xxiii, 11, 13, 18, 69–71 signal species A generic term to refer to a type of signal molecule or signal strand rather than a specific instance of that type. xviii, see also signal molecule & signal strand signal strand A strand that is not bound (has paired bases with) any other strand. xviii, xx, xxi, xxiii, xxiv, 8, 9, 14, 18, 20 space complexity of a tagged CRN See space complexity of a tagged CRN computation. xviii, 18, 68 space complexity of a tagged CRN computation Given a trace ρ for a tagged CRN C = S, R, S0 , send , T , let S ∗ be the largest signal multiset of the sequence of multisets induced by ρ. The space complexity is defined to be |S ∗ | + |T |. xviii, xxiii, 17 spontaneous displacement Any event where a bound strand spontaneously breaks its bases pairs with the corresponding template strand to become a signal strand. xviii, 9 state of a CRN Defined by the current signal multiset of the CRN. xviii, 13, 70 strand domain A subsequence of a strand used in displacement reactions. xviii, xxi, xxiv, 8 tag A special signal assigned to chemical reaction equations that are implemented in a DSD to denote the required state of a transformer. In a reversible chemical reaction equation, the required tag for the reaction in the reverse direction is produced by the forward reaction, and vice versa. xviii, xxi, xxiii, xxiv, 16, 17, 82, 85 tagged chemical reaction equation Either a reversible chemical reaction equation or an irreversible chemical reaction equation which additionally requires that a tag specific to that reaction is present in the reaction volume before it can be applied. In a reversible chemical reaction equation, the required tag for the reaction in the reverse direction is produced by the forward reaction, and vice versa. xviii, xxiii, 16, 17, 65, 67 tagged Chemical Reaction Network (tagged CRN) Consists of an initial signal multiset, an initial tag multiset, and a set of tagged chemical reaction equations. Formally, we define a tagged CRN to be a tuple C = S, R, S0 , send , T, T0 , where all members are defined the same as a xxiii Glossary CRN and additionally T is the set of all tag species, and T0 is the initial tag multiset, containing one or more tags for each reaction R ∈ R. xviii, xxii, xxiii, 16, 17, 20, 24, 65–67, 69–73, 85, 90, see also CRN, tag & initial tag multiset template strand A long strand that can bind one or more signal strands. xviii–xxi, xxiii, xxiv, 8, 9, 18 three way branch migration See toehold mediated strand displacement. xviii, xxi, xxiv, 9 toehold domain A short strand domain that binds reversibly to complementary regions on template strands. xviii, xix, xxi, xxiii, xxiv, 8, 18 toehold mediated strand displacement The toehold domain of an invading strand A binds (forms base-pairs) to the complementary toehold of the template strand B. Then in a random walk process (often referred to as three way branch migration), the bases of the long domain of A compete with those belonging to the identical long domain of the evading strand C to form base pairs with the complementary long domain of the template strand B that must be adjacent to the toehold domain. Once the long domain of A has bound to its complement of B, C remains bound to B by just its short toehold domain. The toehold bonds can break, thereby releasing signal C. (Of course A may detach from the template before C is released, in which case the displacement does not happen.). xviii–xxi, xxiv, 9, 18, 20, 71, 72 trace Given a CRN C = S, R, S0 , send , a trace for C is a sequence of reactions ρ = R1 , R2 , . . . , Rm from R, where each Ri = (Ii , Pi ) such that ρ induces a corresponding sequence of multisets S0 , S1 , . . . , Sm , with S0 being the multiset of initial signal molecules in C, and for all 1 ≤ i ≤ m, we have both Ii ⊆ Si−1 and Si = Si−1 − Ii + Pi . xviii, xxiii, 13, 17, 82 transaction In the context of DSD, a transaction is a sequence of toehold mediated strand displacements, the first of which consumes a mutex strand and the last of which produces a mutex strand. xviii, 71, 72 transformer A collection of one or more strands, some forming a double stranded complex, which are used to implement a chemical reaction equation by consuming a set of signal strands and producing an alternate set of signal strands. xviii, xx, xxiii, xxv, 14, 16–18, 20, 65, 69, 70 unbound strand A strand where none its bases are paired to other bases on a template strand.. xviii, 8, 18, see also bound strand universal toehold When all toehold domains in a DSD share a common sequence they are said to be universal. xviii, 8, 18, 20 xxiv Glossary waste One or more bound and unbound strands of a transformer used only once to effect a chemical reaction equation. The strands remain in the reaction volume in a closed system. xviii, 23, 66, 68 xxv Notes on reading the text By definition every PhD thesis should be unique and I hope this case is no exception. In what follows, I provide two distinct, self-contained and seemingly unrelated thesis parts. Both are motivated by problems related to nucleic acids, and my interest lies in the underlying combinatorial challenges they present. Briefly, in Part I, Space and energy efficient molecular programming: energy barriers, chemical reaction networks, and DNA strand displacement systems, I explore the combinatorial challenges associated with predicting and designing low-energy barrier folding pathways of one or more nucleic acid strands. In Part II, Space efficient text indexes motivated by biological sequence alignment problems, I design a number of compressed data structures to support efficient pattern matching queries for applications involving nucleic acid sequence alignment. While distinct, and self-contained, both thesis parts share common themes that motivated me to pursue both directions simultaneously. In particular, both parts address combinatorial problems related to nucleic acids and also explore the theme of space efficiency. Otherwise each thesis part stands on its own, with its own introduction and conclusion, and can be read independently. Each part has been written with a different research community in mind; however, a significant effort has been made to provide a sufficient background so that either part can be read by an interested computer science researcher. On the use of I, the candidate, and we In all technical chapters, I will avoid the use of the personal pronoun I. When specifically identifying myself, I will often use the term the candidate. Even when presenting material where I was the sole author, I will say we instead of I, as is tradition in scientific writing. However, when an opinion is being expressed, it can be assumed that the opinion is mine, but is not necessarily the opinion of my co-authors. This is often the case in chapter summaries and conclusions where I have had the benefit of considering the totality of the research considered in this thesis. When rewriting significant portions of published work, I have taken the opportunity to think of the implications beyond what was originally presented in isolation. I believe deeper insight is offered in this current form for both the motivations, and the contributions, of the research undertaken in this thesis. xxvi Acknowledgements I owe the greatest debt of gratitude to my research supervisor Professor Anne Condon. Anne not only supervised my PhD research, but also co-supervised my masters research. Most everything I know about being a scientist I have learned from her. I could not have found a better mentor. Anne’s example continues to serve as an inspiration to become a better scientist, a better community citizen and a better person. Thank you Anne for your time, your mentorship and friendship, the academic freedom to pursue a number of research directions, and for teaching me how to be a scientist. It has been a wonderful experience to know you and work along side you. Thank you to my committee members, Professor Will Evans and Professor Arvind Gupta. My research and the writing in this thesis have benefited from your feedback, suggestions and support. Thank you Will for many helpful discussions and in particular I would like to thank you for your suggestions and feedback that significantly improved the writing in the second part of my thesis. Thank you Arvind for all of your great advice over the years, both in terms of research and also more generally. A special thanks to Dr. Jan Maˇ nuch who has been not only my most active collaborator, but also a good friend. I have enjoyed the countless hours we have spent working together on problems and have learned a lot from you for which I am grateful. I have had many other wonderful collaborators over the years, and for that I am lucky. In particular, thank you to Professor Holger Hoos and Professor Alan Hu who have been a great source of advice at different stages of my research career. I also had the good fortune to work with a number of talented undergraduate students including Jay Zhang, Daniel Lai, John Cheu and Leigh-Anne Mathieson. Thanks to current and former members of the βeta lab at UBC, including Frank Hutter, Dave Tompkins, Hosna Jabbari, Mirela Andronescu, Bonnie Kirkpatrick, Baharak Rastegari, Murray Patterson, and Monir Hajiaghayi for their friendship, help and feedback. I would also like to thank the members of the DNA computing community who have offered insightful and constructive feedback for the research found in this thesis. In particular I would like thank Professor Erik Winfree, Dr. Lulu Qian, Dr. David Soloveichik and Dr. Dave Doty for helpful discussions that clarified my understanding of a number of complex topics. I would also like to acknowledge the generous funding of my research from scholarships given by the National Science and Engineering Research Council of Canada (NSERC), the Michael Smith Foundation for Health Research (MSFHR) and also funding from UBC and my supervisor. xxvii Acknowledgements A special thanks to my family who have always encouraged me in everything I have done. Finally, thank you Meagan for your unwavering support and the happiness you give me. Completing this thesis meant many evenings and weekends spent in a lab away from you. The time put into this thesis was not entirely mine to spend and so this work is as much yours as it is mine. xxviii Dedication For Meagan. xxix Part I Space and energy efficient molecular programming: energy barriers, chemical reaction networks, and DNA strand displacement systems 1 Chapter 1 Introduction In this chapter we motivate work in Part I of our thesis, on nucleic acid folding pathways. Our research in this area has two primary motivations. First, the work was begun with the aim to better understand and computationally predict folding pathways exhibited in biological systems. During the course of that initial research, we realized the computational hardness of solving the prediction problem. This suggested to us that folding pathways may be a mechanism for performing non-trivial computation. Indeed, designed folding pathways of multiple interacting nucleic acid strands were already being used to perform simple computation. This potential was the impetus for our second motivation: understanding the computational limits of molecular programs that leverage folding pathways. Such pathways also have the potential to perform logically reversible, and thus energy efficient, computation. A better understanding of these designed pathways can shed light on the complexity of predicting folding pathways involving multiple strands. In what follows, we discuss each motivation and its related work, summarize the contributions of this thesis part, and give an overview of the ensuing chapters that detail our technical contributions. This chapter introduces concepts as needed to understand our motivations and gives a comprehensive overview of the models we study in this thesis. Additional concepts and definitions are introduced as needed in later chapters. 1.1 Nucleic acid folding pathways RNA molecules play vital roles in the cell by virtue of the information encoded into their nucleotide sequence and the structures the molecules form. The primary structure, or nucleotide sequence, can be thought of as a string over the alphabet ΣRN A = {A, C, G, U }. The tertiary, or 3-dimensional, structure of an RNA is determined in large part by the bonds formed between pairs of complementary nucleotides within its sequence, such as the Watson-Crick base pairs: A pairs to U and C pairs to G1 . These bonds constitute the secondary structure of the molecule. An example of two alternate secondary structures for the same nucleotide sequence is given in Figure 1.1(a). Shown in Figure 1.1(b) is a common representation for secondary structures called the arc diagram representation, where each arc denotes a base pair. Prediction of RNA secondary structure is 1 Formal definitions are given in the following chapter. 2 1.1. Nucleic acid folding pathways crucial to understand their myriad2 biological functions. Throughout, we will use the term structure to mean secondary structure. Given their propensity to alter their shape over time under changing environmental conditions, an RNA molecule will fold through a series of structures called a folding pathway [4, 42, 55, 104, 115, 145]. Thus knowledge of folding pathways between pairs of alternative RNA structures is very valuable for inferring RNA function in such environments, and is also valuable for predicting RNA structure, e.g., in light of co-transcriptional folding [20, 42, 108, 117, 133]. As illustrated in Figure 1.2, each structure differs from its predecessor by a single base pair (or equivalently by a single arc in the arc diagram representation). a a a a c 5 g a c a c g c ' a 3 a a a a a ac c ' g a a a c g c g a a a a a a c a c a g c a c a a c a c c g a c a a a g 3 ' 5' (a) 5'g a c c a a a a g a c c a a a a g a c c a a a a g a c c 3' (b) Figure 1.1: (a) An initial secondary structure (left) and a final secondary structure (right) for a given RNA strand. (b) A corresponding arc diagram. The arcs on top of the nucleotide sequence denote the base pairs in the initial secondary structure while the arcs below the nucleotide sequence denote the base pairs of the final secondary structure. Much focus to date has been on pathways of pseudoknot-free secondary structures—structures in which no base pairs cross in the arc diagram representation3 . Since folding is a thermodynamically-driven probabilistic process, 2 While the number of biological functions of RNA in a cell is certainly finite, new functions of this versatile molecule continue to be elucidated. 3 Complex pseudoknots are rare in nature and therefore algorithmic approaches usually assume their absence in any reasonable solution space. 3 1.1. Nucleic acid folding pathways A D G B E H C F I (a) -3 -4 A C B G D F E barrier -2 H I (b) Figure 1.2: (a) A possible folding pathway is shown for an initial structure A transitioning through intermediate structures (B, C, . . .) until the final structure I is reached. For a particular position in the pathway, the top of the arc diagram denotes the current base pairs of the structure and the bottom of the arc diagram denotes the base pairs still to be added to reach the final structure. Each structure along the pathway differs from its neighbours by one arc. (b) The corresponding energy plot. The barrier in this example is two. folding pathways tend to avoid high-energy structures4 . As a result, many methods for predicting folding pathways or entire energy landscapes5 —particularly course-grained methods designed to work for large structures which do not attempt to model the complete energy landscape—are guided by calculations of the energy barrier [42, 125]. Intuitively, this is the highest energy difference between the initial structure and the other structures along a pathway. For example, the folding pathway in Figure 1.2 has an energy barrier of 2 under the simple energy model where each base-pair contributes -1 to the overall energy of a structure. Contrast this with a naive folding pathway that first removes all base-pairs in the initial structure before adding all base-pairs in the final structure, resulting in an energy barrier of 4. There is a rich literature on the problem of predicting folding pathways and energy landscapes, both in theory and in practice; see the recent work of Geis at al. [42], Tang et al. [125] and the references therein. We focus here on algorithms for energy barrier calculation which are an important component of many approaches to estimation of entire energy landscapes. Such methods have been proposed, for example, by Morgan and Higgs [83], Wolfinger [143], Flamm et al. [36–38], Geis et al. [42] and Dotu et al. [29]. 4 These structures are energetically unfavourable. In simple energy models, high-energy structures have fewer base-pairs than alternative structures of the same molecule. 5 A discrete interpretation of an energy landscape for an RNA molecule is the set of all pathways between all structures the molecule can form. 4 1.1. Nucleic acid folding pathways Several versions of the energy barrier problem have been studied, which are distinguished by properties of the intermediate structures. Morgan and Higgs focus on direct folding pathways from structure A to structure B in which intermediate structures contain only arcs in A ∪ B and such that the total pathway length is |A B| (the size of the symmetric difference). In such pathways, each arc from the initial structure not also in the final structure is removed exactly once and each arc from the final structure not also in the initial structure is added exactly once along the pathway. The example in Figure 1.2 is a direct pathway. A larger class of pathways is obtained by allowing the length of the pathway to exceed |A B|. We call such pathways direct-with-repeats pathways since an arc from A or B may be added or removed multiple times along the pathway. An even more general class of pathways allows intermediate structures to contain “temporary” arcs which are neither in A nor in B. These temporary arcs would denote base pairs that are not found in either the original nor the final structure. Morgan and Higgs call such pathways indirect. Thus, direct pathways are a subclass of direct-with-repeats pathways, which in turn are a subclass of indirect pathways. Morgan and Higgs assume the simple energy model in which each base pair contributes −1 to the total free energy. Using a randomized greedy approach, they construct several low-barrier direct pathways and take the minimum energy barrier of these as their estimate. (They also construct indirect pathways using a “single link clustering” method.) Wolfinger et al. use a barrier tree to represent the folding landscape; identifying nodes in the tree (which are called saddle points) is analogous to calculating energy barriers. Flamm et al.’s method [37] for approximating energy barriers explores direct pathways by performing a breadth-first search, maintaining the best m candidate solutions at each step. As m becomes large, the search does become exhaustive, yielding an exact solution, however exponential runtime and memory are required. The program barriers [38] is capable of computing exact direct and indirect pathways, provided a complete sample of low energy states separating the two structures is provided. However, this approach is also exponential in runtime and space and thus impractical for medium (100-500 nucleotides) or large (> 500 nucleotides) problem instances. For their Kinwalker folding pathway predictor, Geis et al. [42] describe a heuristic which explores the space of possible direct pathways in a more sophisticated manner than does the Morgan-Higgs heuristic, incorporating a parameter look-ahead technique to avoid excessive runtimes. While their method uses the Turner energy model [74, 79, 134]6 to evaluate energy barriers, it relies on simple addition and removal of base pairs (and thus the simple energy model) while generating putative low-barrier pathways. It is important to note that, in general, determining energy barriers is not restricted to alternate structures of a single RNA (or DNA) strand (molecule). The idea of computing an energy barrier can be generalized to consider alternate 6 The Turner energy model is a more realistic energy model parameterized by experimental data and is the standard model used for algorithmic prediction of nucleic acid secondary structure. 5 1.1. Nucleic acid folding pathways structures of sets of multiple interacting strands. Inter-molecular base-pairs can form between strands in addition to intra-molecular base-pairs. As we discuss in the next section, the ability for strands to predictably interact can be exploited to design and perform molecular computation. Given the importance of understanding interactions of multiple strands in problem domains such as DNA computing, there are already efforts to provide practical and effective probabilistic simulation tools—see the work of Schaeffer [112] and references therein. In summary, all current methods are either heuristic in nature and thus are not guaranteed to find the exact energy barrier between two structures, even for a single molecule, or are exponential in both runtime and space, precluding their use on even medium sized problem instances. Thus there is strong motivation for finding a fast method which can exactly compute the energy barrier between two structures. Indeed, the Geis et al. method can estimate energy barriers for structures of long sequences (1,500nt or more) but the authors note that “as the performance of Kinwalker crucially depends on approximating saddle heights, further improvements to the Morgan-Higgs heuristic as well as alternative approaches will be investigated”. 1.1.1 The simple energy model We now formally define the simple energy model. To do so, we must define the concept of a strand complex. A single strand, that has no base pairs to another strand, is a complex. A group of more than one strands forms a complex if (i) no strand in the group has a base pair with a strand outside of the group and (ii) given any partition of the group into two subgroups, there is at least one base pair from a strand in one subgroup to the other. Formally, they form a connected component in the circle arc diagram representation. For instance, Figure 1.5(a) shows a circle arc diagram for three strands where edges denote base pairs; strands C and B form a complex and strand A forms its own complex. In Figure 1.5(b) there is only one complex formed by strands A, B and C. Each Watson-Crick base pair (i.e., A-U and C-G for RNA and A-T and C-G for DNA) contributes −1 to the overall energy. Therefore, the energy of a structure for a single strand can be formally defined as: −#basepairs (1.1) For instance, both structures depicted in Figure 1.1(a) are composed of a single complex, contain eight base pairs, and therefore have energy −8. Base pairs are also counted in the multiple strand case. Additionally, there is an entropic penalty constant Kassoc , Kassoc > 1, for each strand association that results in fewer strand complexes. A strand association event occurs when the first base pair is formed between two complexes that were not previously associated; thus, the energy will instantaneously change by Kassoc − 1. The 6 1.2. Molecular programming simple energy model for multiple interacting strands can be formally defined as: (#strands − #complexes)Kassoc − #basepairs (1.2) The example of Figure 1.5(a) would have energy (3−2)Kassoc −3 = Kassoc −3 whereas the example of Figure 1.5(b) would have energy (3 − 1)Kassoc − 4 = 2Kassoc − 4 as it contains one fewer complex and one additional base pair. As with early studies of RNA structure prediction algorithms, in this thesis we will study folding pathways by adopting the simple energy model. The reasoning for this choice is two-fold. First, this model is significantly simpler and remains sufficient to understand the complexity of the underlying combinatorial problem. If the problem is hard in the simple energy model, it provides evidence that it is hard for more complex models. Second, if effective algorithms are developed in the simple energy model, then it is possible that they could be adapted for more complex energy models. This was the case for the RNA structure prediction problem that was first studied with the simple energy model [89] and later improved to use the Turner energy model [79]. 1.2 Molecular programming The area of molecular programming enjoys active research from both theoreticians and experimentalists due in part to its promise of embedded logical computation that can naturally interface with biological systems. For instance, if a condition is detected in a cell, then a certain therapeutic agent can be released. A widely studied and experimentally practical model of computation in molecular programming entails so-called DNA7 strand displacement systems (DSD). DSDs leverage the fact that substrings of DNA strands will hybridize to their perfect complements and can also displace other bound strands sharing the same substring. By a careful, non-trivial design of strands one can realize a complex, yet deterministic computation. DSDs have been experimentally implemented and verified to simulate logic circuits [21, 116], neural networks [99], and DNA walkers [119], among numerous other applications [51, 75, 92, 97, 122, 136, 151]. They have also been shown capable, in principle, of energy-efficient Turinguniversal computation [64, 98]. An underlying property of DSDs is that intended sequences of strand displacements form low-energy barrier pathways, while unintended sequences must over come a high-energy barrier. This connection to folding pathways was our initial motivation for studying DSDs in this thesis. However, as discussed below, DSDs consider strands at an abstract domain level, and not at the sequence level. Thus, any formal conclusions that we draw about underlying folding pathways of DSDs will necessarily consider a corresponding sequence design, and will be in terms of the multiple interacting strand folding pathway model. 7 DNA is the molecule of choice for molecular computing because of its stability in comparison to RNA. Concepts like secondary structure still apply as DNA forms base pairs just as RNA does, although using a different set of nucleotides ({A, C, G, T}). 7 1.2. Molecular programming 1.2.1 DNA strand displacement systems (DSD) A DNA Strand Displacement system (DSD) consists of signal strands and double stranded complexes consisting of one or more bound strands to a long template strand . Strands that are not bound are said to be unbound strands. DNA strands are oriented and have a 5’ end and a 3’ end. A strand can only bind to another strand in opposite orientation. Consider the example DSD in Figure 1.3. There are two signal strands A and B, and one double stranded complex. The 3’ ends of the strands are depicted in the example with arrows ( ). The bound strands C and D have opposite orientation to the template strand E. A C B D E Figure 1.3: A DNA strand displacement system consisting of two signal strands, A and B, and one double stranded complex consisting of two bound strands C and D and a template strand E. Long domains of the template strand are shown in red while long domains of signal strands and bound strands are shown in gray. Universal toehold domains are shown in black. Strands in the system are composed of two types of strand domains: short toehold domains, and long domains. All DSDs that we study make use of universal toeholds meaning that all toehold domains share a common sequence. Distinct long domains are assumed to have a distinct sequence design. In the example DSD of Figure 1.3, universal toehold domains are shown in black and long domains of bound and signal strands are shown in gray. The long domains of the template strand E are shown in red. The template strands have complementary toehold domains to those on signal and bound strands. Toehold mediated strand displacement (b) A B A (a) C C C C B A C B A B (c) A B Figure 1.4: Strand displacement. (a) Toehold (black subsequence) of signal strand A binds with its unpaired complement on the template strand B. (b) The single long domain (gray subsequence) of A competes via a random walk process with the single long domain of strand C to bind with the complementary long domain of B until all bases of A are bound to B. (c) Toehold of C detaches from B, at which point it has been displaced and becomes a signal. The process is reversible and signal strand C could next displace the bound strand A. 8 1.2. Molecular programming Toehold domains bind reversibly, and long domains irreversibly, to complementary regions on template strands. The fundamental operation in a DSD is toehold mediated strand displacement, whereby a toehold domain of a signal strand, called the invading strand , binds to an unbound complementary toehold domain of a template strand and, if the adjacent long domain is complementary, it can displace a currently bound signal strand, called the evading strand , of the same length. We illustrate a simple, reversible version of toehold mediated strand displacement in Figure 1.4. First, the toehold of invading signal strand A binds (forms base-pairs) to the complementary toehold of the template strand B. Then in a random walk process (often referred to as branch migration), the bases of the long domain of A compete with those belonging to the identical long domain of the evading strand C to form base pairs with the complementary long domain of the template strand B. Once the long domain of A has bound to its complement of B, C remains bound to B by just its short toehold domain. The toehold bonds can break, thereby releasing signal C. (Of course A may detach from the template before C is released, in which case the displacement does not happen.) The displacement is reversible because signal C can bind to the template strand B to displace strand A via the same principles. Folding pathway of a strand displacement Figure 1.5 illustrates the same displacement in terms of the corresponding folding pathway for a hypothetical sequence design where toehold domains have a length of one base and long domains have a length of two bases. After the first (and only) toehold base pair is formed between strand A and B, relative to that energy, branch migration of the long domains of A and C occur within energy barrier 1. This is because a base pair between A and B can be added immediately after a base pair between C and B is removed. Illegal strand displacement The example in Figure 1.4 and Figure 1.5 illustrates a legal displacement. Now let us consider how an evading strand can be displaced (i.e., produced as a signal) from a template other than by a legal displacement. First, it is possible that one or more invading strands with a different long domain are used for displacement. We call this a mismatch displacement. Second, it is possible the invading strand does have an identical long domain, but toehold base pairs between the invading strand and template strand are not formed prior to branch migration. We call this blunt-end displacement. Finally, it is possible that the evading strand simply breaks all base pairs with the template strand and disassociates. We call this a spontaneous displacement. The occurrence of any one of the three types of illegal displacements would result in a folding pathway with a higher energy barrier. Consistent with Soloveichik et al. [122], we assume throughout that only legal displacements can occur and that sequences of domains can be designed to be sufficiently different that a strand with domain δ is very unlikely 9 1.2. Molecular programming A A C A C C B B B (a) (b) (c) A A A C C C B B (d) B (e) (f) A c 2Kassoc − 3 2Kassoc − 4 C b e d f B (g) Kassoc − 3 a g Figure 1.5: A corresponding folding pathway is shown for the displacement example of Figure 1.4 using a simple sequence design where toehold domains have one base and long domains have two bases. The displacement of strand C by strand A is shown in seven steps, from (a) to (g). Base pairs are shown as edges between strands. The energy changes between each structure, assuming Kassoc = 2, are shown in the bottom right. Since toehold domains have one base, then the energy barrier of the underlying folding pathway, relative to (a), is Kassoc . If the toehold domains had length LT > 1 then the energy barrier, relative to (a), would be Kassoc − 1. 10 1.2. Molecular programming to displace a strand with domain δ .8 1.2.2 Chemical reaction networks (CRN) Just as a DSD abstracts sequence level details of a folding pathway using the concept of domains, Chemical Reaction Networks (CRNs) abstract details about displacements. CRNs provide a concise language for writing molecular programs and affords us the opportunity to express complex ideas more succinctly in this thesis. Chemical reactions and signal molecules A chemical reaction equation details a process whereby certain molecule types can be consumed —the reactants—and others produced —the products—within some reaction volume. A reaction may also require the presence of catalyst molecules of certain types. We refer to all three categories, generically, as signal C molecules. For example, the reaction A + B → D consumes a signal of type A and a signal of type B and produces a signal of type D in the presence of the catalyst9 signal C. This is an example of an irreversible chemical reaction C equation; however, A + B D is an example of a reversible chemical reaction equation meaning that both a signal of type A and of type B can also be produced by consuming a signal of type D in the presence of the catalyst signal C. A CRN is a set of chemical reactions, in addition to a multiset of signals present within the reaction volume, prior to any reaction occurring, called the initial signal multiset. The current signal multiset is the current composition of signals of a given CRN within a reaction volume. In this work, we consider a reaction volume to be a closed system, meaning that signals cannot be added to the current signal multiset unless they are produced by a reaction, and signals cannot be removed from the current signal multiset unless they are consumed by a reaction. Example 1.2.1. Let us consider a concrete example of a 3-bit standard binary counter that should begin at count 000, advance to 001, and so on, until reaching the count 111. In our molecular program, we let signal 0i and signal 1i denote that bit i has value 0 and 1, respectively, for 1 ≤ i ≤ 3. Thus, our 3-bit counter will have the following initial signal multiset: {03 , 02 , 01 }. Figure 1.6(a) gives three chemical reaction equations for exchanging signals and thus changing the state, or current signal multiset, of the counter. Figure 1.6(b) represents all 8 This is a reasonable assumption one can make and can be shown formally using results from coding theory. Schulman and Zukerman [114] show how to construct a set of 2Θ(n) domains (i.e., binary strings in their code) of equal length Θ(n) such that the energy barrier (Levenshtein distance) between any pair of domains is at least cn, for any given constant c. 9 Some reactions require the presence of one or more signals, called catalysts, which they do not consume. Note how we represent catalysts in our reaction equations. These are not to be confused with rate constants which do not significantly factor into our research. Catalysts do play a significant role our thesis research and this representation was chosen for its succinctness. 11 1.2. Molecular programming (1) 01 (2) 02 + 11 (3) 03 + 12 + 11 11 12 + 0 1 13 + 0 2 + 0 1 (a) {03,02,11} {03,12,11} {13,02,11} {13,12,11} 1-for 2-for 1-for 3-for 1-for 2-for 1-for 1-rev 2-rev 1-rev 3-rev 1-rev 2-rev 1-rev {03,02,01} {03,12,01} {13,02,01} {13,12,01} (b) Figure 1.6: (a) Chemical reaction equations for a 3-bit standard binary counter. (b) The configuration graph of the computation performed by the 3-bit standard binary counter forms a chain and is logically reversible. The nodes represent the state of the computation and the edges are directed between states reachable by a single reaction. reachable states of the counter as nodes and has edges between states that are reachable within one reaction step. Chemical reaction rates When reasoning about time complexity of various chemical reaction networks, we will use the well known stochastic chemical kinetics model [43]. This model, based on a continuous time Markov process, permits us to reason about the probability and the expected time to completion of sequences, and of individual, chemical reactions within a well mixed reaction volume. In general, each reaction has an associated reaction rate denoting the relative speed of the reaction within some defined reaction volume. The rate of a reaction is dependent on (i) the order of the reaction, (ii) the size of reaction volume, and (iii) the reaction rate constant. In all examples of this thesis, we will assume a reaction volume of size v and assume a uniform reaction rate constant, k = 1. As all reactions share the same reaction rate constant, we omit them when reasoning about the expected reactions times. The order of a reaction is the number of required reactants. For instance, R1 → . . . is a unimolecular reaction, R1 + R2 → . . . is a bimolecular reaction, R1 + R2 + R3 → . . . is a trimolecular reaction, and so on. In a volume of size v and assuming a uniform rate constant k = 1, 1| the propensity of a unimolecular reaction R1 → . . ., is |R v 0 = |R1 |, where |R1 | denotes the number of copies of signal R1 in the current state. The propen2| sity of a bimolecular reaction, R1 + R2 → . . . is |R1v||R , assuming R1 = R2 , 1 2|R1 |−1 and is otherwise. In general, the propensity of an ith order reaction, v1 2 |...|Ri | assuming i distinct reactants, is |R1 ||R . The expected time until the next v i−1 reaction is an exponential random variable with a rate, r, equal to the sum of the propensities of reactions that can occur. The probability of a particular 12 1.2. Molecular programming reaction occurring next is equal to its propensity divided by r. Intuitively, for all reactions that can occur in a particular state (i.e., all reactants are present in sufficient quantity), higher order reactions are less likely than lower order reactions. In this thesis, many of the CRNs we propose will initially contain higher order reactions. However, when implemented as DSDs, all reactions will be bimolecular reactions between two distinct species, each of a single copy. A further property of the CRNs we propose in this thesis is that in any given state, at most two reactions are possible. Thus, in the DSD realizations of our CRNs (see Section 1.2.5), the propensity of a reaction that can occur is v1 , and the expected time for the reaction to occur is O(v). CRNs as a means for computation We will always define the CRNs we study in this work using the chemical reaction notation already introduced. However, it is helpful to have a formal definition of a CRN when proving certain results. This eliminates ambiguity that may arise, for example, when reasoning about reversible reactions and catalysts. In addition, a formal definition will permit us to define what it means to perform computation with a CRN in a reaction volume that is a closed system. We define a Chemical Reaction Network (CRN) to be a tuple C = S, R, S0 , send , where • S is the set of all signal types (or species) of signal molecules used in any reaction. • R is a set of chemical reaction equations, where each R ∈ R is an ordered pair of multisets of signal molecules. Intuitively, a reaction equation R = (I, P ) consumes the signal molecules in I as the input reactants and produces the signal molecules in P as products. Note that it is only the signal molecules in I − P that are actually consumed. The others act as catalysts for the reaction. Our formalism is directional to allow modeling non-reversible reactions; a reversible chemical reaction is modeled as two separate elements of R, i.e., (I, P ) and (P, I). • S0 is the initial signal multiset where s ∈ S0 → s ∈ S. • send ∈ S is a signal molecule denoting the end of computation.10 The state of a CRN is defined by its current signal multiset. We formalize computations in C in the natural manner: Let ρ be a sequence of reactions R1 , R2 , . . . , Rm from R, where each Ri = (Ii , Pi ). We define ρ to be a trace of C if ρ induces a corresponding sequence of multisets S0 , S1 , . . . , Sm , with S0 being the multiset of initial signal molecules in C, and for all 1 ≤ i ≤ m, we have both Ii ⊆ Si−1 and Si = Si−1 − Ii + Pi . (We use “−” and “+” to denote multiset subtraction and union.) If ρ is a trace for a completed computation, 10 A computation may have multiple final states. To model this situation, we can let s end be produced in all final reactions, in addition to any other signal molecules that may indicate the result of the computation. 13 1.2. Molecular programming then send ∈ Sm and send ∈ / Sn for n = m. Note that throughout this thesis, we consider the computation to halt when the send signal is first produced. The traces we study for completed computations will reflect this fact. Example 1.2.1 (continued). Let us describe the 3-bit standard binary counter formally. The set of signal types is S = {01 , 02 , 03 , 11 , 12 , 13 , send }, where send denotes the end of computation. The initial signal multiset is S0 = {03 , 02 , 01 }. Finally, we have the following set of chemical reaction equations {R1-for = ({01 }, {11 }), R1-rev = ({11 }, {01 }), R2-for = ({01 , 12 }, {11 , 02 }), R2-rev = ({11 , 02 }, {01 , 12 }), R3-for = ({01 , 12 , 13 }, {11 , 02 , 03 }), R3-rev = ({11 , 02 , 03 }, {01 , 12 , 13 }), Rend = ({11 , 12 , 13 }, {send })} . These reactions, with the exception of the last one, formally define the reactions shown in Figure 1.6(a). The shortest trace producing send is the sequence of reactions R1-for , R2-for , R1-for , R3-for , R1-for , R2-for , R1-for , Rend , which induces the following sequence of multisets: {03 , 02 , 01 }, {03 , 02 , 11 }, {03 , 12 , 01 }, {03 , 12 , 11 }, {13 , 02 , 01 }, {13 , 02 , 11 }, {13 , 12 , 01 }, {13 , 12 , 11 }, {send } . 1.2.3 Tagged chemical reaction networks (tagged CRN) As CRNs are an abstract description for molecular programs, we must consider how they can be realized by physical systems such as DSDs. Continuing with the 3-bit standard binary counter example, we can represent each 0i and 1i with a unique strand, for 1 ≤ i ≤ 3. Figure 1.7 shows two states of the counter and the composition of signal strands representing those states. Thus, state representation is easily achieved, but how does one transition between states? For instance, how can reaction (1) of Figure 1.6(a) be implemented? Unfortunately, we do not know how to change the signal strand 01 directly into the signal strand 11 . However, we do know how to achieve the same result, indirectly. Figure 1.8 shows a strand displacement process—based on toehold mediated strand displacement as discussed in Section 1.2.1—implementing the reaction 01 11 based on a construction proposed by Qian et al. [98]. From top to bottom, the 01 signal strand interacts with a transformer to first become consumed — sequestered on a double stranded complex — and ultimately the 11 signal strand is produced — released from a double stranded complex. The strands contained within a shaded box are the signal strands, while everything else forms the transformer for this reaction. The important point is that the 14 1.2. Molecular programming - 01 - 03 + 03 - 12 b3 + + 01 12 b2 b1 (a) The current signal strands denote that the counter has value 010. - 02 b3 02 - 11 - 13 + + 13 b2 + 11 b1 (b) The current signal strands denote that the counter has value 101. Figure 1.7: Representing the state of the CRN for a 3-bit standard binary counter can be achieved by the presence and absence of certain signal strands for each bit position. Long domains for bits representing a 1 value are coloured in red while those representing a 0 value are coloured in grey. Universal toehold domains are coloured black. 15 01 − 01 + − 11 + + + ∗ 01 11 1.2. Molecular programming − ∗ 11 01 01 − 11 11 11 11 − − ∗ 11 + 01 + ∗ 01 + 01 − + 11 − + − 01 + 01 + ∗ 01 − 11 − ∗ 11 Figure 1.8: A strand displacement implementation of the reaction 01 11 as proposed by Qian et al. [98]. From top to bottom, the input signal strand 01 (shown in a shaded box on the left) is consumed by the transformer (middle) which produces the signal strand 11 (shown in a shaded box on the right). Additional unbound strands are used in the process and are considered part of the transformer. The transformer can be applied next in the opposition direction (from bottom to top) to consume signal 11 and produce signal 01 . In this and later figures, the Watson-Crick complement of a domain x is denoted by x∗ . transformer is not in the same state after producing signal 11 as it was prior to consuming signal 01 . The transformer is no longer in a state that can consume a 01 signal and produce a 11 signal. It is however in a state that can perform reaction (1) in reverse (from bottom to top in Figure 1.8). Thus, while the same transformer can be used to perform both the forward and reverse of a reaction, it must strictly alternate between these directions. Tags and tagged chemical reaction equations To capture this notion of transformer orientation at the level of a chemical reaction network, we can tag each side of a reaction to represent the transformer and its required orientation that is necessary to perform a reaction in the respective direction. In the case of reversible reactions, when considered as two separate reactions, the forward tag of one will be the reverse tag of the other. We call these tagged chemical reaction equations. A tagged Chemical Reaction Network (tagged CRN) consists of an initial signal multiset, an initial tag multiset, and a set of tagged chemical reaction equations. Formally, we define a tagged CRN to be a tuple C = S, R, S0 , send , T, T0 , where all members are defined the same as a CRN and additionally T is the set of all tag species, and T0 is the initial tag multiset, containing one or more tags for each tagged chemical reaction equation R ∈ R. 16 1.2. Molecular programming Space complexity of a tagged CRN This simple concept of tags allows us to account for the required number of transformers and the minimum size of the reaction volume required to complete a computation. Given a trace ρ for a tagged CRN C = S, R, S0 , send , T, T0 , let S ∗ be the largest signal multiset of the sequence of multisets induced by ρ. We define the space complexity of a tagged CRN computation with trace ρ of tagged CRN C to be |S ∗ | + |T0 |. Note that for every reaction, exactly one tag is consumed and one is produced, thus the number of tags present in any reachable state is equal to |T0 |. Intuitively, this corresponds to the minimum size of the reaction volume of a closed system to fit all molecules necessary to complete the computation specified by the trace ρ. We will often refer to this quantity as the required space of a tagged CRN . (1) Tf1 + 01 (2) Tf2 + 02 + 11 (3) Tf3 + 03 + 12 + 11 Tr1 + 11 Tr2 + 12 + 01 Tr3 + 13 + 02 + 01 Figure 1.9: Tagged chemical reaction equations for a 3-bit standard binary counter. Example 1.2.1 (continued). We can augment our 3-bit standard binary counter chemical reaction equations from Figure 1.6(a) with tags resulting in the tagged chemical reaction equations shown in Figure 1.9. Formally, we have the following set of tagged chemical reaction equations {R1-for = ({T1f , 01 }, {T1b , 11 }), R1-rev = ({T1b , 11 }, {T1f , 01 }), R2-for = ({T2f , 01 , 12 }, {T2b , 11 , 02 }), R2-rev = ({T2b , 11 , 02 }, {T2f , 01 , 12 }), R3-for = ({T3f , 01 , 12 , 13 }, {T3b , 11 , 02 , 03 }), R3-rev = ({T3b , 11 , 02 , 03 }, {T3f , 01 , 12 , 13 }), Rend = ({Tend , 11 , 12 , 13 }, {Tend , send })} . If we consider the sequence of reactions illustrated in Figure 1.6(b) to advance from count 000 to 111, then the initial signal multiset is still {03 , 02 , 01 } and the initial tag multiset required for the computation to reach the count 111 and finally produce send is {T1f , T1f , T1f , T1f , T2f , T2f , T3f , Tend }. Therefore, the required space or space complexity for this tagged CRN is eleven molecules as each signal multiset during the computation has the same size as the initial signal multiset. In general, we will reason about the space complexity of a tagged CRN asymptotically. 17 1.2. Molecular programming 1.2.4 Proper chemical reaction networks (proper CRN) We define one additional restricted class of CRNs to help simplify our space complexity analysis throughout this thesis. Given a (possibly tagged) CRN C having a reaction set R, consider any R = (I, P ) ∈ R. We call the signal molecules consumed in I − P proper reactants and those produced in P − I proper products. R is a k-proper chemical reaction equation (or simply a proper chemical reaction equation) if and only if |I − P | = |P − I| = k. We say that C is a k-proper Chemical Reaction Network (proper CRN) (or simply a proper CRN) if all reactions are proper and k is the maximum number of proper inputs of all reactions in R. We observe the following obvious, but useful results. Lemma 1. A proper CRN with initial signal multiset S0 will always have |S0 | free signal molecules during a computation. Lemma 2. The space complexity of a tagged CRN C = S, R, S0 , send , T, T0 with initial signal multiset S0 and initial tag multiset T0 that is also a proper CRN is |S0 | + |T0 |. 1.2.5 Realizing CRNs with DSDs Soloveichik et al. [122] showed that arbitrary CRNs could be realized by using DNA signal strands to represent the signal molecules and by using a cascade of toehold mediated strand displacements to implement chemical reaction equations. Qian et al. [98] proposed an alternate construction — hereafter called the QSW construction — that is capable of simulating bi-molecular, and higherorder, chemical reactions. Specifically, the construction can exchange a multiset of signal strands (the reactants) for another multiset of signal strands (the products) through a sequence of toehold mediated strand displacements. Signal strands are of the same form: a negative recognition long domain − d, followed by a universal toehold t, followed by a positive recognition long domain + d. Signals and additional auxiliary strands that will be produced by a reaction are initially bound strands on a template strand. Additional unbound strands, consisting of a single long domain and a single universal toehold are used to effect the cascade of toehold mediated strand displacements. All toehold domains are universal and are therefore not labeled in the following figures. The one template strand for each reaction has the property that all of its long domains and all but one of its toehold domains are bound. We call these saturated template strands. We refer to the saturated template complex and associated auxiliary strands, collectively, as a transformer. For example, Figure 1.10 shows an implementation of the chemical reaction equation A + B C + D using the QSW construction. The forward reaction is depicted from top to bottom and it can be seen that the signal strands C and D are initially bound to the template strand of the transformer. The reverse reaction is depicted from bottom to top. Realizing a reaction equation with more or less reactants and products is straightforward and involves modifying the template strand appropriately, adding the necessary auxiliary unbound strands, 18 + B − C + ∗ + B∗ − C∗ − A A + D − D − D∗ + A + + C 1.2. Molecular programming A + A − C + ∗ + B∗ − C∗ D − D∗ B C + ∗ + B∗ − C∗ − C D − D − D∗ C − + B + + C B + B − C − D + ∗ + B∗ − C∗ − D∗ A D − D A + B A D − − + D + A A C − + − − + A − D B + + − + B + A + A − B A C − + − − B A + A + B − C − D A + B∗ − C∗ − D∗ Figure 1.10: A strand displacement implementation of the bi-molecular chemical reaction equation A + B C + D using the construction proposed by Qian et al. [98]. 19 1.2. Molecular programming and adding the necessary bound strands that will eventually be produced as signal strands. The QSW construction guarantees us the following result for the types of systems we study. Theorem 1 (Qian et al. [98]). Any tagged CRN requiring O(s) space can be realized and simulated by a DSD in O(poly(s)) space assuming all strand displacements are legal . In related work, Cardelli [15, 72] has shown how primitives that support concurrent models of computation, such as fork and join gates, can be implemented using strand displacement systems. Many of the techniques used in the QSW construction are similar to those of Cardelli’s constructions: for example, the signal strands share a common universal toehold while the long domains are distinct, and do not use branched structures. To effect an abstract chemical reaction equation with i reactants and i products, the QSW construction uses a cascading of toehold mediated strand displacements whereby the reactants are first consumed (by a transformer) and products are then produced by further strand displacements. This order of events is similar to an i-way join followed by an i-way fork of Cardelli. In this work, we will make use of a modified version of the QSW construction that we describe in Chapter 4. 1.2.6 Energy efficient computation Aside from the potential biological and chemical applications, DSDs and CRNs are also of independent interest due to their promise for realizing energy efficient computation. Rolf Landauer proved that logically irreversible computation— computation as modeled by a standard Turing machine—dissipates an amount of energy proportional to the number of bits of information lost, such as previous state information, and therefore cannot be energy efficient [69]. Surprisingly, Charles Bennett showed that, in principle, energy efficient computation is possible, by proposing a universal Turing machine to perform logically reversible computation and identified nucleic acids (RNA/DNA) as a potential medium to realize logically reversible computation in a physical system [8]. A logically reversible computation is a form of deterministic computation. For our purposes, it suffices to understand the important difference distinguishing these two classes of computation. A configuration graph of a computation has a node for every possible state on every possible input for the underlying Turing machine being modeled. There is a directed edge from node i to node j if and only if state j is reachable from state i in a single state-transition of the Turing machine. An example of a deterministic configuration graph, for four different inputs (source nodes A-D) having a common final state is given in Figure 1.11a. Consider the path from A to the sink node labeled final state. Every node along the path has an out-degree of at most 1, making the computation deterministic. This is also true for the other source nodes. The computation in this example is not reversible. However, if every directed edge is replaced with two edges, one for each orientation (or equivalently 20 1.2. Molecular programming current state D A final state B C (a) Deterministic computation current state A final state B C D (b) Logically reversible computation Figure 1.11: Example configuration graphs, induced on four different inputs, for (a) deterministic computation, and (b) logically reversible computation. Nodes represent possible states in a computation and directed edges denote valid state transitions. 21 1.3. Objectives by an undirected edge), we get a configuration graph for symmetric computation. With this change, we lose determinism. Consider a computation that has begun from node A, has reached the node labeled final state, and is now reversing towards its initial state. Once the computation reaches the node labeled current state, then a non-deterministic choice must be made. In essence, the computation does not currently have enough information to (deterministically) return to its initial state. Thus, information is lost. Contrast this with the logically reversible configuration graph of Figure 1.11b, again shown for four different inputs (source nodes A-D). Importantly, a logically reversible computation for a particular input forms a chain which is unconnected to any state for any other possible input. This means any state along the chain can be deterministically reached from any other state along the chain. Thus, information is not lost. Even though non-terminal nodes along the chain have two possible choices of where to next proceed, the computation is still deterministic as one choice is always the previous state of the computation. (Retreating to the previous state is equivalent to the transition never having occurred.) The important point is that at any given node, the computation cannot proceed to more than one other node that is not the previous state. All of the molecular programs we propose in this thesis have this property. 1.3 Objectives There is a need to find minimum-energy barrier folding pathways of nucleic acids not only in the context of biological processes, but also in the molecular programs which leverage them. When designing molecular programs, knowledge of folding pathways can help to debug the intended behaviour and also to formally verify the correctness of a program; for instance, by ensuring that certain states are unreachable by a low-energy barrier folding pathway. Still, the problem complexity of finding pathways with this property remained unknown at the outset of this research. Understanding the complexity of this problem could have a number of implications. If the problem is easy (in P), then an efficient and effective algorithm may be developed to better understand both naturally occurring and designed folding pathways. If the problem is hard, it may be possible to use folding pathways for non-trivial computation within chemical and biological systems. It is our first objective to elucidate the problem complexity of finding minimum-energy barrier folding pathways. Regardless of problem complexity, there is a need for an exact algorithm that is efficient in practice. To date, all exact algorithms have time and space complexity that is exponential in the size of the input (length of nucleic acid strand(s)). Can this be improved? It is also our aim to understand the computational power of deterministic molecular programs that leverage folding pathways. In this context, there was a particular need to understand space complexity of molecular programs. In the context of molecular programs operating in a reaction volume of a closed system, space can be thought of as the necessary size of that volume to fit all 22 1.4. Contributions molecules necessary to complete a computation. Can a biological soup of nucleic acids having total size Θ(n) perform a computation, by means of a folding pathway, of Θ(2n ) steps? If so, such a program would need to be space-efficient and reuse strands. At the outset of this thesis, DSD implementations did not efficiently reuse strands and therefore accumulated waste—inert strands that remain present in the reaction volume. We will discuss the details and consequences of strand re-use in Chapter 4. This question can be expanded to ask: what are the limitations to deterministic space-efficient molecular programming via folding pathways, if any? As base-pair formation is an inherently reversible process, we also explore the limits of logically-reversible, and thus energy-efficient computation of DSDs and folding pathways. Along the way, we are interested in understanding the complexity of a number of related problems. Can deterministic DSDs and their underlying folding pathways be verified to be correct (i.e. certain states are reachable within a certain energy barrier, while others are not)? 1.4 Contributions We now describe the contributions of this part of the thesis, in the order they are discussed. By folding pathway, we mean pseudoknot-free nucleic acid folding pathway using the simple energy model. 1. We show that finding a direct folding pathway with minimum energy barrier is NP-complete for both the single strand and multiple strand cases. 2. We give a graph-theoretic algorithm for finding direct folding pathways with minimum energy barrier for the single strand case and discuss how it can be extended to the multiple strand case. For an instance having n arcs, the algorithm has a worst case time complexity exponential in n, but a space complexity only polynomial in n and is shown to be efficient in practice for most of the experimental instances evaluated. A feature of the algorithm is the ability to identify a succinct representation of all minimum free-energy structures between the initial and final structure of an instance in polynomial time. 3. We show that finding a direct-with-repeats folding pathway with minimum energy barrier is NP-complete for the single strand case and NP-hard for the multiple strand case. 4. We give the first example of a minimum energy barrier (indirect) folding pathway for multiple interacting strands whose length is exponential in the combined length of participating sequences. Our example is a DSD implementation of a binary-reflecting Gray code counter. An n-bit counter deterministically advances through 2n states using only poly(n) space. This demonstrates that deterministic DNA strand displacement (DSD) systems are capable, in principle, of space-efficient computation. An assumption 23 1.4. Contributions of this construction is that certain strand species exist as a single copy, rather than in an unbounded concentration. 5. We give the first proof that certain classes of chemical reaction networks (CRN), such as the underlying CRN implemented by our DSD Gray code counter, cannot be space efficient if all species of molecules are assumed to exist in concentration (multiple copies), rather than as a single copy. This implies the counter lacks determinism when all strands are present in concentration. We generalize this result to show that it is not possible to design any deterministic chemical reaction network that performs more than a linear number of deterministic computation steps in a reaction volume of a closed system (i.e., as modeled by a tagged CRN), unless certain molecules exist as a single copy (i.e., an exact count of the molecules is necessary to ensure determinism). 6. We demonstrate that any space-bounded computation can be solved by a space and energy efficient DNA strand displacement system, and thus low energy barrier (indirect) folding pathways of multiple interacting strands. We achieve the result by first giving a space efficient molecular program that can solve any arbitrary (unquantified) Boolean formula. We evolve the program to consider quantified Boolean formulas and further transformations to achieve our overall result. In the process we demonstrate a number of techniques useful for logically-reversible computation such as traversing a complete binary tree. Given our results of bullet 5, we must assume that certain molecules exist as a single copy to achieve these results. 7. We characterize the complexity of verification and model checking of deterministic chemical reaction networks and DNA strand displacement systems by showing that the reachability problems associated with these models are PSPACE-hard. We fully characterize restrictions of the models that are PSPACE-complete. 8. We relate our molecular programming results (bullets 6 & 7) with our earlier study of nucleic acid folding pathways by incorporating our quantified Boolean formula solver implementation (developed in Chapter 5) into a proof that predicting indirect folding pathways with minimum energy barrier for multiple interacting strands is PSPACE-complete. 9. We motivate and propose a refinement to the ReversibleSPACE complexity class to better model the inherent properties of current molecular programming domains. A summary of the known complexity for the various folding pathway problems studied in this thesis is given in Table 1.1. 24 Folding pathway energy barrier problem direct Single strand Hardness NP-complete NP-complete indirect open direct NP-complete direct with repeats NP-hard indirect PSPACE-complete Multiple interacting strands Table 1.1: The complexity of folding pathway energy barrier problems for the simple energy model. 1.4. Contributions direct with repeats Notes Shown in Chapter 2. Solvable in practice by a graph theoretic algorithm given in Chapter 3. Shown equivalent to the direct folding pathway problem in Chapter 3. If hard, could lead to novel molecular programming methods. Otherwise, an efficient algorithm could shed light on biological pathways and energy landscapes. Hard by restriction to the single strand case. Gives insight into direct folding pathways typical of DSD systems. Hard by restriction to the single strand case. Not clear if problem is in NP. Shown in Chapter 5. Gives many insights into DSDs, CRNs, and logically reversible computation. 25 1.5. Outline 1.5 Outline Chapter 2 formally introduces the minimum-energy barrier folding pathway problem and resolves the complexity of direct folding pathways. The hardness proof is quite technical. We note that reading the proof is unnecessary to understand the following chapters; knowing that finding minimum energy-barrier direct folding pathways is NP-complete is sufficient. The remaining chapters are much more accessible. In Chapter 3 we introduce an algorithm for finding direct folding pathways having minimum energy barriers. The algorithm makes interesting use of elegant graph decomposition techniques and may be of interest to theorists searching for open problems motivated by biological questions. In Chapter 4 we turn our attention towards molecular programming motivated by folding pathways and investigate the potential and peril of space efficient computation in these models. In Chapter 5 we explore how any deterministic computation that halts can be implemented by a space and energy efficient molecular program that leverages folding pathways. We also resolve the complexity of related problems, including prediction of folding pathways involving multiple interacting strands. In Chapter 6 we summarize our results and motivate the need for refined complexity models to more accurately characterize existing molecular programs. 26 Chapter 2 Complexity of predicting minimum energy barrier folding pathways In this chapter, we study the computational complexity of the energy barrier problem for nucleic acids: what energy barrier must be overcome for one or more DNA or RNA molecule(s) to adopt a given final secondary structure, starting from a given initial secondary structure? The results presented here are a first step towards solving the energy barrier problem. Our results pertain to restricted types of folding pathways, namely direct folding pathways. Such pathways were introduced by Morgan and Higgs [83]. A folding pathway from secondary structure I to F is direct if the only arcs which are added are those from F −I and the only arcs which are removed are those from I−F. Beyond the importance of this simple model for the study of biological folding pathways (see Section 1.1), we note that most designed nucleic acid folding pathway systems that we are familiar with are direct [116, 120, 146, 147]. However, there are examples of designed indirect folding pathways, including the catalytic system of Zhang et al. [150] and the binary-reflecting Gray code counter we present in Chapter 4. The whole of this chapter is dedicated to the proof of our first main result: finding direct folding pathways with minimum energy barrier is NP-complete. In our proof, we consider the folding pathways of single strands. At the end of the chapter, we extend the result to consider the case of multiple interacting strands. We begin with formal definitions and by listing existing results necessary to support our claim. 2.1 Preliminaries A secondary structure T for an RNA (DNA) molecule of length n is a set of base pairs i.j, with 1 ≤ i < j ≤ n, such that (i) each base index i or j appears in at most one base pair and (ii) the bases at indices i and j form a WatsonCrick (i.e., C-G, A-U, or A-T) base pair. Since we represent secondary structures Content from this chapter appears in the proceedings of the 15th Annual International Conference on DNA computing and Molecular Programming (DNA 2009) [82] and the Journal of Natural Computing [81]. 27 2.1. Preliminaries using arc diagrams, we use the word arc interchangeably with base pair (see Figure 1.1). Our main results pertain to pseudoknot-free secondary structures. That is, structures with no crossing arcs. We assume a very simple energy model for secondary structures in which each arc contributes an energy of −1. Thus, as is roughly consistent with more realistic energy models, the more base pairs in a structure the lower its energy. We denote the energy of secondary structure T by E(T ). Fix initial and final pseudoknot-free secondary structures I and F. A direct pseudoknot-free folding pathway from I to F is a sequence of pseudoknot-free secondary structures I = T0 , T1 , . . . , Tr = F, where each Ti is obtained from Ti−1 by either the addition of one arc from F − I or the removal of one arc from I − F. Thus, there are exactly |I F| (the size of the symmetric difference of the two structures) steps along a direct folding pathway. We call each such addition or removal an arc operation and we let +x and −x denote the addition and removal of the arc x, respectively. The Ti ’s which are not the initial nor the final structure are called intermediate structures. A folding pathway can thus be specified by its corresponding sequence of arc operations; we call this a transformation sequence. A direct pseudoknot-free transformation sequence specifies a folding pathway which is both direct and pseudoknot-free. The energy barrier of a folding pathway I = T0 , T1 , . . . , Tr = F is the maximum of E(Ti ) − E(I), where the max is taken over all integers i in the range 1 ≤ i ≤ r. The energy difference of each intermediate configuration Ti is defined as E(Ti ) − E(I). For instance, the entire folding pathway illustrated in Figure 1.2 has an energy barrier of 2; whereas the structure labeled B has an energy difference of 1. If Π is the transformation sequence for this pathway, then the energy barrier of transformation sequence Π, denoted as ∆ E(I, F, Π), is defined to be the energy barrier of the corresponding folding pathway. In our result, it is convenient to work with weighted arcs. To motivate why, note that the union I ∪ F of two pseudoknot-free secondary structures may be pseudoknotted, i.e., may have crossing arcs, even when both I and F are pseudoknot-free. In a pseudoknotted structure, we use the term band to refer to a set of nested arcs, each of which crosses the same set of arcs. In a folding pathway from I to F which minimizes the energy barrier, we can assume without loss of generality that when one arc in a band of I ∪ F is added, then all arcs in the band are added consecutively. Similarly, we can assume without loss of generality that when one arc in a band is removed, then all arcs in the band are removed consecutively. Thus, it is natural to represent the set of arcs in a band as one arc with a weight equal to the number of arcs in the band. An example showing two bands represented by weighted arcs is given in Figure 2.1. Hence we generalize the notion of secondary structure as follows. A weighted w arc I = (I b , I e )I is specified by start and end indices I b < I e and a weight I w . We say that two weighted arcs I and J are crossing if either I b ≤ J b ≤ I e ≤ J e , or J b ≤ I b ≤ J e ≤ I e . A configuration is a set of weighted arcs. Configuration {Ii }ni=1 is pseudoknot-free if for all 1 ≤ i < j ≤ n, Ii and Ij are not crossing. n The energy of configuration I = {Ii }ni=1 is E(I) = − i=1 I w . The previous definitions can easily be generalized to weighted arcs. We can now formally 28 2.1. Preliminaries 2 3 Figure 2.1: The three arcs on the bottom all conflict with the same two arcs on the top, and vice versa. Thus, each forms a band of arcs. Each band is collapsed into a single arc with weight equal to the size of the band. define the main problem studied in this chapter. Problem 1. eb-dpfp (Energy Barrier for Direct Pseudoknot-free Folding Pathway of a single strand) Instance: Given two pseudoknot-free configurations I = {Ii }ni=1 (initial) and F = {Fi }m i=1 (final) of a single strand, and integer k. Question: Is there a direct pseudoknot-free transformation sequence S such that the energy barrier of S, in the simple energy model, is at most k? The reduction in our result begins with an arbitrary instance of the 3partition problem. Problem 2. 3-partition 3n Instance: Given 3n integers a1 , . . . , a3n such that i=1 ai = nA and A/4 < ai < A/2 for each i. Question: Is there a partition of the integers {1, . . . , 3n} into n disjoint triples G1 , G2 , . . . , Gn such that the sum of all aj , where j belongs to Gi , is equal to A, i.e., c(Gi ) = j∈Gi aj = A, for each i = 1, . . . , n? Note the use of the notation c(Gi ). We will use this throughout to simplify our language. Importantly, we note the following result. Theorem 2 (Garey, Johnson (1979) [41]). The 3-partition problem is NPcomplete even if A is polynomial in n. The choice of the 3-partition problem was not arbitrary. It is known as a strongly NP-complete problem, and therefore remains hard even when A is polynomial in n. This is important for our reduction as the number of arcs we create in the corresponding folding pathway instance will be proportional to A (in unary). Thus, if A were exponential in n, then the number of arcs would be exponential in n and the reduction from the 3-partition problem instance would take exponential time, and not the requisite polynomial time. For this very reason, we did not make use of the weakly NP-complete partition 29 2.2. Result problem—partition a set of integers into two sets with the same sum—as that problem is solvable in pseudo-polynomial time when the sum of all integers, and thus A, is polynomial in n. Note that the 3-partition problem is in P if A is a constant; thus, we will ensure that A is polynomial in n in our reduction. For more information regarding this important distinction between weakly and strongly NP-complete problems, the reader is directed to Garey & Johnson [41]. 2.2 Result Theorem 3. The eb-dpfp problem, namely the energy barrier for direct pseudoknot-free folding pathway problem, is NP-complete. We note that the theorem does not require the energies of the initial and final structures to be minimum and indeed they can be different, as illustrated in Figures 2.1 and 2.3. Proof. It is straightforward to show that the eb-dpfp problem is in NP. Given an instance (I, F, k), it is sufficient to non-deterministically guess a direct folding pathway from I to F, and to verify that the energy barrier of this path is at most k. Note that the length of any such pathway is at most |I| + |F|. To show that the eb-dpfp problem is NP-hard, we provide a reduction from the 3-partition problem. We first provide a formal description of the reduction, then provide some intuition as to why the reduction is correct, and then prove correctness in detail. Consider an instance of the 3-partition problem A/2 > a1 ≥ · · · ≥ a3n > 3n A/4 such that j=1 aj = nA and the value of A is bounded above by some polynomial in n. We define an instance (I, F, k) of the eb-dpfp problem as follows. The initial configuration I contains weighted arcs {A¯j,i ; j = 1, . . . , 3n, i = 1, . . . , n} ∪ {A˜j,i ; j = 1, . . . , 3n, i = 1, . . . , n} ∪ {T˜i ; i = 1, . . . , n}. The final configuration F is {Aj,i ; j = 1, . . . , 3n, i = 1, . . . , n} ∪ {Ti ; i = 1, . . . , n}. The arcs are organized as in Figure 2.2. Intuitively, the various T sets of weighted arcs are associated with the n triples, while the various A sets are associated with the 3n integers of the input. For each set of the weighted arcs corresponding to triples, there are weighted arcs corresponding to all 3n integers of the input. The reason for this is that a triple “chooses” its corresponding entries by adding arcs denoting the value of the entry, prior to a “validation” stage that occurs later in the folding pathway. 30 2.2. Result T˜1 ˜ A3n,1 T˜2 .. . ... T1 A˜2,1 ˜ A1,1 A¯1,1 A1,1 T˜3 ... .. . A¯2,1 A2,1 ... A¯3n,1 ... ... A3n,1 A˜1,2 A¯1,2 T2 A1,2 ... ... T3 Figure 2.2: Organization of weighted arcs in the initial (top) and the final (bottom) configurations. Formally, the arcs are organized as follows: T1b < T˜1b < A˜b3n,1 < · · · < A˜b1,1 < T1e < A¯b1,1 , e Tib < T˜i−1 < T˜ib < A˜b3n,i < · · · < A˜b1,i < Tie < A¯b1,i , ∀i = 2, . . . , n, A¯bj,i < Abj,i < A¯ej,i < A˜ej,i < Aej,i , ∀i = 1, . . . , n, Aej,i ∀j = 1, . . . , 3n, b ¯ < Aj+1,i , ∀i = 1, . . . , n, b Ae3n,i < Ti+1 , e e ˜ A3n,n < Tn . ∀j = 1, . . . , 3n − 1, ∀i = 1, . . . , n − 1, The weights of arcs are set up as follows. For all i = 1, . . . , n and j = 1, . . . , 3n: A˜w j,i = 4iaj , w ¯ Aj,i = k − (j − 1)A − 4iaj , Aw j,i = k − jA. Also, T˜1w = k − (7n − 4)A, T˜iw = k − (6n + 8)nA − 4(n − 1)iA, ∀i = 2, . . . , n, Tiw = k − (6n + 8)nA, ∀i = 1, . . . , n − 1, Tnw = k , where k > 4(5n2 + n + 1)A is the energy barrier. Before getting into the details of the proof, we next describe intuitively the key properties of the construction. The weights are chosen to ensure that the folding pathway with minimum energy barrier has the following properties. Here 31 2.2. Result we list only the arcs that are added and assume without loss of generality that all arc removals happen only when needed. 1. Initially a (possibly empty) sequence of Aj,i ’s are added to the folding pathway. Intuitively, this corresponds to “triple choosing” for the initial set of integers. The added Aj,i ’s define a potential solution G1 , G2 , . . . , Gn to the 3-partition problem in a natural way: Gi contains j if Aj,i is in this initial sequence. As we will prove later, the weights ensure that the addition of each Aj,i raises the energy difference. After 3n such additions, the energy difference is so high that no other Aj,i ’s can be added. As a result, the weights impose certain desirable constraints on the Gi ’s which will help ensure that they (or a slight perturbation of the Gi ’s) form a valid solution. 2. Following the initially-added sequence of Aj,i ’s, the Ti ’s must be added in increasing order of i (with no interspersed Aj,i ’s). This is in part because of the placement of the T˜i ’s: adding T1 requires only the removal of T˜1 , whereas adding Ti , for i > 1, requires the costlier removal of both T˜i−1 and T˜i . Thus, it becomes feasible to add Ti without exceeding the energy barrier only after Ti−1 is added because, at that point, T˜i−1 has already been removed. In addition, after adding T1 , the energy difference increases to the level that none of the Aj,i ’s can be added (and stays there until addition of Tn ). 3. Moreover, the Ti ’s can be added without exceeding the energy barrier only if the Gi ’s defined by the initial sequence of Aj,i ’s actually is a valid solution. Intuitively, this is a “triple validation” to ensure all chosen triples form a valid solution. That is, if the Gi ’s are valid then for each i, at least three of the Aj,i ’s are in the initial sequence and so at least three of the A˜j,i ’s (whose weights sum to at least 4iA) were removed in the initial part of the pathway described in 1 above. This means that at most n − 3 of the A˜j,i ’s remain to be removed before Ti can be added. The total weight of the remaining A˜j,i ’s is just low enough to ensure that they can be added without exceeding the energy barrier. In contrast, if the Gi ’s are not valid then for some i the weight of the A˜j,i ’s which must be removed in order to add Ti causes the energy barrier k to be exceeded. Let us illustrate the construction of the proof with the following example. Assume that we want to partition the multiset of integers {{10, 9, 8, 7, 7, 7}} into two sets (n = 2). Figure 2.3a shows the corresponding instance of the energy barrier problem. For each triple, there are weighted arcs denoting all integers in the input set. We have labeled the associated weighted arcs by their integer value from the input set and have coloured those associated with triple 1 as black, and those associated with triple 2 as white. Figure 2.3b shows a correct pathway, which selected two triples, T1 = {{10, 7, 7}} (corresponding to the black labels along the pathway during triple-choosing) and T2 = {{9, 8, 7}} (corresponding to the white labels along the pathway during triple-choosing), both of which 32 2.2. Result 10 triple1 9 8 7 7 7 triple2 10 9 8 7 7 7 (a) triple−validating triple−choosing 10 9 8 7 7 7 triple1 triple2 clean−up 9 8 7 10 7 7 7 7 k 0 (b) triple−choosing 10 9 8 7 triple−validating 7 7 triple1 triple2 clean−up 7 10 9 8 k 0 (c) Figure 2.3: Illustration of the construction in the proof of Theorem 3: (a) The instance created for the set of integers {{10, 9, 8, 7, 7, 7}}. (b) The energy function stays within barrier k if and only if the partition sets are selected correctly (T1 = {{10, 7, 7}} and T2 = {{9, 8, 7}}). (c) The energy function exceeds the barrier for an incorrect selection of partition sets (T1 = {{10, 9, 8, 7, 7}} and T2 = {{7}}). The dashed lines depict hypothetical progress of the pathway for some energy barrier larger than k. 33 2.2. Result sum to 24. By construction of the proof, the portion of the folding pathway corresponding to the triple validation stage is able to proceed within barrier k. However, in the incorrect pathway shown in Figure 2.3c, where selection does not result in two equal triples (i.e., T1 = {{10, 9, 8, 7, 7}} and T2 = {{7}}), the triple validation stage fails, forcing the barrier above k. The remainder of this chapter formally proves that the eb-dpfp instance has a solution with energy barrier at most k if and only if the 3-partition instance a1 , . . . , a3n has a solution. In showing this result, we must demonstrate a number of properties that the construction enforces, such as ensuring that elements of the input set are selected exactly once. First, assume that the 3-partition instance has a solution G1 , . . . , Gn , where Gi = {ji,1 , ji,2 , ji,3 }. Let f (j) = i if j ∈ Gi , for every j = 1, . . . , 3n. We will show that the transformation sequence −A¯1,f (1) , −A˜1,f (1) , +A1,f (1) , . . . , −A¯3n,f (3n) , −A˜3n,f (3n) , +A3n,f (3n) , , −T˜1 , +T1 , . . . , −A˜1,1 , . . . , −A˜3n,1 (2.1) ˜j ,1 , −A ˜j ,1 , −A ˜j ,1 without −A 1,1 1,2 1,3 −A˜1,n , . . . , −A˜3n,n , −T˜n , +Tn , (2.2) ˜j ˜j ˜j without −A , −A , −A n,1 ,n n,2 ,n n,3 ,n −A¯1,1 , +A1,1 , −A¯1,2 , +A1,2 , . . . , −A¯3n,n , +A3n,n (2.3) without indexes 1, f (1); 2, f (2); . . . ; 3n, f (3n) is pseudoknot-free with energy barrier exactly k. For clarity, the − sign marks the arcs from the initial configuration which are being removed and the + sign marks the arcs from the final configuration which are being added. It is easy to see that the sequence is pseudoknot-free, since • each Aj,i only crosses A˜j,i and A¯j,i in the initial configuration and it is added only when these two arcs are already removed; and • each Ti only crosses the following arcs in the initial configuration: T˜i−1 (if i > 2), T˜i and A˜1,i , . . . , A˜3n,i and they are all removed before Ti is added. Second, let us verify that the energy difference of each intermediate configuration is at most k. Figure 2.4 summarizes the sequence of energy differences along the pathway given in lines (2.1), (2.2) and (2.3) above; we next provide the details. First, in line (2.1), by induction, for each j = 1, . . . , 3n, before removing −A¯j,f (j) , −A˜j,f (j) the energy difference is (j − 1)A and after removal it is k. Then after adding +Aj,f (j) it decreases to jA. At the end of line (2.1), the energy difference is 3nA. Next, we need to check that the sum of weights of arcs −A˜1,1 , . . . , −A˜3n,1 , −T˜1 ˜j ,1 , −A ˜j ,1 , −A ˜j ,1 without −A 1,1 1,2 1,3 34 2.2. Result (2) (1) (3) k ... 6n2 A + 8nA ... 3nA ... 0 Figure 2.4: Illustration of the sequence of energy difference changes on the folding pathway described in lines (2.1), (2.2) and (2.3). Details are discussed in the text of the chapter. is at most k − 3nA. The sum of weights of these arcs is exactly 3 3n j=1 A˜w j,1 − 3n A˜w j1, =1 ˜w ,1 + T1 = j=1 3 4aj − =1 4aj1, + k − 7nA + 4A = 4nA − 4A + k − 7nA + 4A = k − 3nA . Thus, just before adding +T1 , the energy difference is again exactly k. And after adding +T1 , it is 6n2 A + 8nA. Similar calculations show that the energy difference will alternate between k, after each removal subsequence, and 6n2 A + 8nA, after each addition of +Ti , in line (2.2) with the exception of the last addition, when the energy difference is 0. In line (2.3), all remaining arcs from the initial configuration (−A¯j,i ) are removed and all remaining arcs from the final configuration (+Aj,i ) are added—this is the clean up phase. Note that each ¯ removal is possible since A¯w j,i < k and after processing each pair −Aj,i , +Aj,i , w w ¯ the energy difference only decreases since Aj,i − Aj,i = A − 4iaj < 0. Now, assume that there is a pseudoknot-free transformation sequence S with the energy barrier at most k. From S, we will construct a solution for the original 3-partition instance and show that it is a valid solution. We organize our proof into three parts, in line with the three properties described in the intuition at the start of the proof. Consider the subsequence of S containing only additions, i.e., arcs from the final configuration. Let S + denote this subsequence. We assume without loss of generality that all removals in S happen only when needed, i.e., the next addition would not be possible without those removals. Hence, the subsequence S + determines the whole sequence S. By processing an arc +I in S + we mean 35 2.2. Result (i) removal of all arcs −J in S immediately preceding +I (that is, −J does not precede any other +I in S + ) and (ii) adding +I. The first part of our proof considers the prefix of S + just before the first T is added. Let this prefix be: +Aj1 ,i1 , +Aj2 ,i2 , . . . , +AjM ,iM (2.4) where M is the number of +Aj,i ’s added before +T . We use this prefix to define a potential solution to the 3-partition problem: Gi = {j ; i = i}, for every i = 1, . . . , n. Ultimately we will show that the Gi ’s (or a slight perturbation of the Gi ’s) form a solution to the 3-partition problem. Towards this goal, our first two lemmas below prove some useful properties of the Gi ’s that can be inferred from the weights of the arcs in the folding pathway prefix (2.4) and the corresponding removed arcs. Let | Gi |j denote the number of elements in Gi (over all i) with value at most j. In order for the Gi ’s to be a valid solution, | Gi |j should be exactly j for all j, 1 ≤ j ≤ 3n. Intuitively, this condition ensures that elements n are selected exactly once. Moreover it should be the case that i=1 c(Gi ) = nA where c(Gi ) denotes the sum of aj for j ∈ Gi (see the definition of 3-partition). The statements of the two lemmas below assert somewhat weaker properties of the Gi ’s. Lemma 3. For every j = 1, . . . , 3n, n i=1 | Gi |j ≤ j. Consequently, M ≤ 3n. Proof. Let +T be the first +Ti in S + . Consider an +Aj,i appearing before +T . Recall that before adding +Aj,i , we need to remove both −A˜j,i and −A¯j,i . Since, ¯w A˜w j,i + Aj,i = k − (j − 1)A, the energy difference has to be at most (j − 1)A for +Aj,i to be added. Note that processing of each +Aj,i appearing in S + before +T will increase the energy difference by A, as it requires both −A˜j,i and −A¯j,i w ˜w to be removed first and A¯w j,i + Aj,i −Aj,i = k−(j−1)A−4iaj +4iaj −(k−jA) = A. For instance, an +A1,i can only appear at the first position of the part of the subsequence S + before +T , since it requires the energy difference to be at most 0 and after any +Aj,i is added, the energy difference increases to A. Thus, starting from the second position, no +A1,i can be added before +T . Similarly, +Aj,i can appear only in the first j positions of the subsequence of S + before +T . Due to this condition imposed by the construction, the lemma easily follows. In the next lemma we use double brackets to denote multisets: for example {{1, 2, 2}} is the multiset with elements 1, 2, and 2 and {{1, 1, 2}} = {{1, 2, 2}}. n Lemma 4. i=1 c(Gi ) ≤ nA − (3n − M )A/4, where the equality happens only if M = 3n and {{aj1 , . . . , ajM }} = {{a1 , . . . , a3n }}. 36 2.2. Result Proof. Let b1 ≥ b2 ≥ · · · ≥ bM be the sorted elements of the multiset n M {{aj1 , . . . , ajM }}. Note that i=1 c(Gi ) = j=1 bj . We will show that bj ≤ aj for every j = 1, . . . , M . Suppose to the contrary that bj > aj for some j. Hence, elements b1 , . . . , bj belong to {{a1 , . . . , aj−1 }}, i.e., |{{aj1 , . . . , ajM }}|j−1 ≥ j. This is a contradiction with Lemma 3. Hence, we have n M c(Gi ) = i=1 j=1 3n M bj ≤ j=1 aj = nA − j=M +1 aj ≤ nA − (3n − M )A/4 . The equality happens only if M = 3n (since aj > A/4) and bj = aj , for every j = 1, . . . , 3n. We now turn to the second part of our proof: we show that, following the initially-added sequence of +Aj,i ’s, the Ti ’s must be added in increasing order of i. That is, the arcs +T1 , . . . , +Tn appear in the subsequence S + consecutively (with no +Aj,i in between) and in this order. The next lemma shows that the first +Ti in the sequence S + must be +T1 and the following lemma reasons about the rest of the sequence of +Ti ’s. Lemma 5. The first +Ti in S + is +T1 . Proof. Let +T be the first +Ti in S + . As argued in the proof of Lemma 3, after each +Aj,i , the energy difference increases by A. Hence, before adding +T , the energy difference is non-negative. Second, if > 1 then to add +T , both −T˜ −1 and −T˜ has to be removed. After their removal the energy difference would be at least 2k − 2(6n + 8)nA − 4(n − 1)(2 − 1)A > k, a contradiction. The last inequality follows by k > 4(5n2 +n+1)A = 2(6n+8)nA−4(n−1)(2n−1)A. Hence, by the above lemma, the subsequence S + has the following form +Aj1 ,i1 , +Aj2 ,i2 , . . . , +AjM ,iM , +T1 followed by the addition of all remaining +Aj,i ’s and +Ti ’s. The following lemma gives more detailed insight into the order of arcs in S + . In the remaining lemmas we adopt notation which was introduced by Graham, Knuth and Patashnik [44]: [i > j] = 1, 0, if i > j; otherwise, and [i = j] = 1, 0, if i = j; otherwise. Lemma 6. All Ti ’s appear in S + in one sequence and in increasing order. Proof. Assume to the contrary that subsequence +T1 , +T2 , . . . , +Tp is followed by an arc +I different from +Tp+1 in S + , where p < n. This arc could be either +Aj,i or +T , where > p + 1. We will show that both cases lead to a contradiction by lower bounding the energy difference of the intermediate configuration after adding +Tp . 37 2.2. Result As argued in the proof of Lemma 3, processing of each +Ajm ,im will contribute A to the energy difference. Hence, before adding +T1 , the energy difference is non-negative. We will lower bound contributions of processing +T1 , . . . , +Tp to the energy difference. For every i = 1, . . . , p, to process +Ti , we need to remove −T˜i and all −A˜j,i which were not yet removed. This will add to the energy difference T˜iw + j ∈T / i A˜w j,i ≥ k − 3nA − [i > 1](6n + 5)nA − 4(n − 1)iA + 4i j=1,...,3n aj − 4| Gi |iA/2 > k − 3nA + [i > 1](6n + 5)nA − 2| Gi |nA , since only | Gi | arcs −A˜j,i have been removed before processing +Ti and each A˜w j,i = 4iaj < 2nA. Hence, the contribution of processing +T1 is at least k − 3nA − 2| G1 |nA − T1w = (6n + 5)nA − 2| G1 |nA, and the contribution of p processing +Ti , for i = 2, . . . , p, is at least −2| Gi |nA. Since i=1 | Gi | ≤ M and by Lemma 3, M ≤ 3n, the total contribution of adding T1 , . . . , Tp is at least 6n2 A + 5nA − 6n · nA = 5nA. Hence, the energy difference of the intermediate configuration before processing +I is at least 5nA. Now, let us consider two cases depending on the type of arc +I. First, assume that +I is a +T , for some > p + 1. Since +T −1 appears in S + after +T , to add +T , we need to remove both −T˜ −1 and −T˜ . Since the energy difference before removing −T˜ −1 and −T˜ is positive (at least 5nA), the lemma follows by the argument used in the proof of Lemma 5. Second, assume that +I is an +Aj,i . Before adding +Aj,i , the arc −A¯j,i needs to be removed. Since A¯w j,i = k − (j − 1)A − 4iaj > k − (3n − 1)A − 2nA > k − 5nA, the energy difference after removing −A¯j,i would be greater than 5nA + k − 5nA = k, a contradiction. Hence, by the above lemmas, the subsequence S + has the following form +Aj1 ,i1 , +Aj2 ,i2 , . . . , +AjM ,iM , +T1 , +T2 , . . . , +Tn followed by the all remaining +Aj,i ’s. Moving on to the last part of the proof: we show that the Gi ’s defined by the initial sequence of +Aj,i ’s form a valid solution (or can be perturbed slightly to form a valid solution) by arguing that only in this case can all of the T ’s be added without exceeding the energy barrier. Specifically, we will show that M = 3n and {{aj1 , . . . , aj3n }} = {{a1 , . . . , a3n }}. For this purpose, the next two lemmas prove lower bounds on sums of the c(Gi )’s. Lemma 7. For every = 1, . . . , n, i=1 i(c(Gi ) − A) ≥ (M − 3n)A/4. Proof. To process +T , −T˜ and all remaining −A˜1, , . . . , −A˜3n, need to be removed. Specifically, this corresponds to all −A˜j, ’s for which j ∈ / G . Hence, the total weight of arcs which need to be removed is 38 2.2. Result T˜w + j ∈T / A˜w j, = k − 3nA − [ > 1](6n + 5)nA − 4(n − 1) A + 4 (nA − c(G )) = k − 3nA − [ > 1](6n + 5)nA + 4 (A − c(G )) . After removing these arcs, the energy difference will increase by this amount and then decrease by T w = k − (6n + 8)nA. Hence, the total change of the energy difference for adding +T is [ = 1](6n + 5)nA + 4 (A − c(G )). It is easy to see, by induction on , that the energy difference before removing −1 the arc for +T is M A + [ > 1](6n + 5)nA + i=1 4i(A − c(Gi )), since after processing subsequence +Aj1 ,i1 , . . . , +AjM ,iM , the energy difference is M A. Since the energy difference, after removing the necessary arcs before adding +T , must be at most k, we have −1 M A + [ > 1](6n + 5)nA + i=1 4i(A − c(Gi )) + k − 3nA − [l > 1](6n + 5)nA + 4 (A − c(G )) ≤ k which simplifies to i=1 i(c(Gi ) − A) ≥ (M − 3n)A/4 . Using the inequalities from Lemma 7, we will lower bound the sum of c(Gi )’s. n Lemma 8. We have i=1 c(Gi ) ≥ nA − (3n − M )A/4, where the equality happens only if c(G1 ) = A−(3n−M )A/4 and c(Gi ) = A, for every i = 2, . . . , n. Proof. We will multiply each inequality of Lemma 7 with the positive constant 1/ − [n > l]/ + 1 and sum the inequalities: n =1 (1/ − [n > l]/( + 1)) i=1 i(c(Gi ) − A) n ≥ =1 (1/ − [n > l]/( + 1)) (M − 3n)A/4 . Changing the order of the sums on the left hand side and using the fact that n =i (1/ − [n > ]/( + 1)) = 1/i we obtain: n i=1 n (c(Gi ) − A) = i=1 n i(c(Gi ) − A) =i (1/ − [n > ]/( + 1)) ≥ (M − 3n)A/4, and the lemma easily follows. The equality, in the resulting inequality, happens only if we have equality in all inequalities used in the summation. This would 39 2.2. Result imply that i=1 i(c(Gi ) − A) = (M − 3n)A/4, (2.5) for all = 1, . . . , n. For = 1, we have c(G1 ) − A = (M − 3n)A/4, i.e., c(G1 ) = A − (3n − M )A/4. Subtracting Equation (2.5) for and Equation (2.5) for − 1, we obtain (c(G ) − A) = 0, i.e., c(G ) = A. n By Lemmas 4 and 8, we have i=1 c(Gi ) = nA − (3n − M )A/4, i.e., we have equality in both Lemma 4 and Lemma 8. Thus, by Lemma 4, we have that M = 3n and {{aj1 , . . . , aj3n }} = {{a1 , . . . , a3n }}. Although this does not imply that G1 , . . . , Gn immediately forms a decomposition of the set {1, 2, . . . , 3n}. For instance, if a1 = a2 , the multiset {{j1 , . . . , j3n }} could contain zero 1’s and two 2’s. However, this is easily solved with a slight perturbation of the solution. The sets G1 , . . . , Gn could be mapped to the decomposition of {1, 2, . . . , 3n} just by a sequence of replacements i’s with j’s assuming aj = aj+1 = · · · = ai . This transformation simplifies the correspondence between the solution of two problems. Most importantly, by Lemma 8, we have c(G1 ) = A − (3n − M )A/4 = A and also c(Gi ) = A for all i = 2, . . . , n. Hence, the sets G1 , . . . , Gn (possibly modified as described above) are the solution to the 3-partition problem. The reduction is polynomial as the sum of weights of all arcs (which is the total number of arcs in the unweighted instance) is 3n n T˜iw + Tiw + i=1 w ¯w (A˜w j,i + Aj,i + Aj,i ) j=1 < n · 2k + 3n2 · 2k = O(n2 k) = O(n4 A) , and A is assumed to be polynomial in n. The problem of predicting folding pathways for a single strand generalizes to multiple interacting strands in the natural way. In this problem a configuration generalizes to consider base pairs between strands in addition to within strands. If there are n bases in all strands of an instance, then bases can be identified uniquely as a number in [1, n]. Thus, the current definition of a configuration also works for the multiple strand variation of the problem. We formally define the problem as follows. Problem 3. eb-dpfp-multi (Energy Barrier for Direct Pseudoknot-free Folding Pathway of Multiple interacting strands) Instance: Given two pseudoknot-free configurations I = {Ii }ni=1 (initial) and F = {Fi }m i=1 (final), of multiple interacting strands, and integer k. Question: Is there a direct pseudoknot-free transformation sequence S such that the energy barrier of S, in the simple energy model, is at most k? 40 2.3. Chapter summary Since the length of a direct pathway for multiple interacting strands from an initial configuration I to a final configuration F has maximum length |I| + |F|, then the problem is in NP. By restriction to the single strand version of the problem, eb-dpfp, we can conclude the following result. Theorem 4. The eb-dpfp-multi problem, namely the energy barrier for direct pseudoknot-free folding pathway of multiple interacting strands problem, is NPcomplete. 2.3 Chapter summary We have shown that the energy barrier problem for direct pseudoknot-free folding pathways is NP-complete, via a reduction from the 3-partition problem. Thus, unless NP = P , there is no polynomial-time algorithm for calculating the energy barrier of direct folding pathways. This justifies the use of heuristics for estimating energy barriers [38, 83, 125, 143] and leads to the interesting question of whether or not there is an algorithm that is guaranteed to return the exact energy barrier and which works well on practical instances of the problem (while not in the worst case). This is the focus of the next chapter. Our proof can help shed insight on energy landscapes. Consider an instance (I, F, k) of the eb-dpfp problem which is derived from a “yes” instance of 3-partition according to our construction. There are exponentially many possible prefixes (of the type shown in (2.4)) which could precede the addition of T1 , all of which do not exceed the energy barrier k. Of these, it may be that only one defines a valid solution of Gi ’s. Thus, if pathways are followed according to a random process, it could take exponential time for the random process to find the pathway with energy barrier k. This is because there are exponentially many initial prefixes which could lead to such a pathway of which only one can be extended to a pathway with barrier k. In this chapter, we do not fully resolve the computational complexity of the general energy barrier problem, in which the pathway need not be direct. Two challenges in understanding the complexity of this problem which need to be considered are repeat arcs—arcs added and removed multiple times in a pathway—and temporary arcs—arcs not specified in the initial or final structure. The following chapter sheds further light on pathways containing repeat arcs. 41 Chapter 3 Predicting minimum energy barrier folding pathways In the previous chapter we established that the direct energy barrier folding pathway problem is NP-complete. Still, there is a need for an exact algorithm that performs well in practice. This is exactly the focus of this chapter. We first generalize the folding pathway problem of a single strand to one defined in terms of bipartite graphs, allowing us to exploit the rich knowledge and algorithms found in graph theory. Later in the chapter, we will discuss results and extensions to the case of multiple interacting strands. While the algorithm we develop has exponential running time in the worst case, it is the first exact algorithm that uses only polynomial space. As we will show by an empirical evaluation in Section 3.3, the algorithm performs well in practice. Furthermore, the algorithm is inherently parallel; a property that could be exploited when solving hard instances. In the process of proving the correctness of the algorithm, we will resolve the complexity of the direct with repeats energy barrier folding pathway problem (see Section 3.4). 3.1 Preliminaries We find it convenient to model the problem in terms of bipartite graphs and first develop some useful notation. For a pair of pseudoknot-free structures for the same RNA sequence, we define the conflict graph to be a bipartite graph G[A, B] where A is the set of arcs from the first structure, B is the set of arcs from the second structure, and there is an edge in E(G) between a ∈ A and b ∈ B if and only if a and b are crossing. An example is given in Figure 3.1. Throughout, we denote the neighbours of a vertex v, or set of vertices X in G as NG (v) and NG (X), respectively. We denote the subgraph of G induced by subsets A ⊆ A and B ⊆ B by G/[A , B ]. Also, we denote the stability number of G, i.e., the size of a maximum independent set in G, as α(G). We need a notion analogous to that of a pair of minimum free energy (MFE) structures11 in the context of bipartite graphs. We say that G is pairwise-optimal if α(G) = |A| = |B|. If Content from this chapter appears in the proceedings of the 15th Annual Pacific Symposium on Biocomputing (PSB 2010) [129]. 11 A minimum free energy (MFE) structure has the lowest energy of any possible structure for a given molecule. In the simple energy model proposed by Morgan and Higgs [83], where each arc contributes -1 to the energy score, a pseudoknot-free MFE structure is one with a maximum number of non-crossing arcs. 42 3.1. Preliminaries A and B are MFE structures then G must be pairwise-optimal; otherwise the largest independent set in the conflict graph G would be a set of arcs with lower free energy than either A or B. a1 a2 b1 a1 a2 a3 a4 b2 b3 b4 b1 a3 a4 b2 b3 b4 Figure 3.1: (left column) An example of an arc diagram representation of an initial and final structure of an RNA folding pathway, and (right column) the corresponding conflict graph. In the conflict graph, there is a node for every arc, and an edge between any pair of arcs that cross. Let G[A, B] be a pairwise-optimal bipartite graph. A set pathway for G is a sequence of independent sets S0 , . . . , Sm , each of which is a subset of A ∪ B, such that (i) S0 = A, (ii) Sm = B and (iii) for every i = 1, . . . , m, |Si−1 Si | = 1 (the size of the symmetric difference is one, i.e., at each step one element is either added or removed). The transformation sequence corresponding to this set pathway is the sequence of singletons S0 S1 , . . . , Sm−1 Sm . If an element appearing in the transformation sequence is not in the current set, then the element is meant to be added. If the element is in the current set, the element is meant to be removed. The set pathway is direct if its corresponding transformation sequence has no repeating elements. The barrier of the pathway (or its corresponding transformation sequence) is k = maxi |A| − |Si |. (Since A is a maximum independent set of G, it must be that |A| − |Si | ≥ 0 for all i, 1 ≤ i ≤ m.) We say that a set pathway is a (≤ k)-barrier set pathway or a k-barrier pathway if its barrier is ≤ k or = k, respectively. A min-barrier set pathway is a set pathway whose barrier is less than or equal to the barriers of any other set pathway for G. Consider the following problem: Problem 4. eb-dsp (Energy Barrier for Direct Set Pathway) Instance: Given a pairwise-optimal bipartite graph G[A, B] and integer k. Question: Is there a direct set pathway with barrier at most k for G? An instance of the eb-dpfp (Energy Barrier for Direct Pseudoknot-free Folding Pathway) problem can be mapped to an instance of the eb-dsp (Energy Barrier for Direct Set Pathway) problem by constructing its conflict graph. We note that the mapping from an instance of eb-dpfp to an instance of eb-dsp is only immediate if both the initial and final secondary structures of the eb-dpfp instance are MFE structures. However, we demonstrate in Section 3.2.5 how this condition can be removed, such that any instance of eb-dpfp can be solved. See Figure 3.2 for an example which relates an instance of each problem. However, the Direct Set Pathway problem is actually a more general problem since 43 3.2. An algorithm for the set barrier problem not every bipartite graph is realizable by a pair of pseudoknot-free structures. We characterize conflict graphs for the RNA direct folding pathway problem more accurately in Section 3.5. 3.2 An algorithm for the set barrier problem Our algorithm for the Direct Set Barrier Problem uses two key ideas. The first is a splitting strategy: if for some proper non-empty subset B1 of B the induced subgraph G/[A1 , B1 ] is pairwise-optimal, where A1 = NG (B1 ), then we can determine the solution for G by recursively solving the problem on the induced subgraphs G/[A1 , B1 ] and G/[A \ A1 , B \ B1 ] and combining their solutions. At some point, a subproblem G [A , B ] cannot be split further as it contains exactly two maximum independent sets: A and B . In this case, we say that G is minimal pairwise-optimal. Solving a minimal pairwise-optimal subproblem requires our second idea, a cutting strategy for reducing the size of minimal pairwise-optimal problem instances. After reducing the size, we can once again attempt to split the problem, and so on, until either a solution is found, or it is determined that one does not exist. In the following sections, we detail these strategies, the overall algorithm, its correctness and complexity, and the empirical performance of its implementation. 3.2.1 Splitting strategy The hypothesis, in terms of the RNA folding pathway problem, that predicated the splitting strategy is simple: if one could identify a MFE structure C consisting of arcs from both the initial and final structures, A and B respectively, then there always exists an optimal pathway from A to B via C. The combination of Lemma 9 and Lemma 10 show that this hypothesis is correct, in terms of the direct set barrier problem; specifically, the resulting solution is (i) a valid set pathway, and (ii) optimal. Lemma 9. Let G[A, B] be a pairwise-optimal bipartite graph and let G1 = G/[A1 , B1 ] be pairwise-optimal where B1 is a proper non-empty subset of B and A1 = NG (B1 ). Let G2 = G/[A2 , B2 ], where A2 = A \ A1 and B2 = B \ B1 . Then 1. G2 is pairwise-optimal, and 2. if T1 and T2 are (≤ k)-barrier transformation sequences for G1 and G2 respectively then T1 , T2 is a (≤ k)-barrier transformation sequence for G. Proof. Consider the first claim. Suppose to the contrary that G2 is not pairwiseoptimal. Then, in G2 , there must exist some maximum independent set C ⊆ A2 ∪ B2 , where |C| > |B2 |. Since NG (B1 ) = A1 , then C ∪ B1 is also a maximum independent set in G of size |C| + |B1 |, since C ∩ B1 = ∅. But |C| + |B1 | > |B2 | + |B1 | = |B|, contradicting that G is pairwise-optimal. 44 3.2. An algorithm for the set barrier problem a1 a2 b1 a3 a4 b 2 b3 a2 b1 b4 a3 a4 b2 b3 b4 0 0 −1 1 −2 b1 b2 b3 b4 b1 a3 a4 b2 b4 a3 a4 b1 b4 a4 b1 b4 b3 b4 b1 0 0 a1 a2 a3 a4 −1 1 4 b2 b3 b4 b1 0 0 a1 a2 a3 a4 −1 1 4 b2 b3 b4 b1 0 0 a1 a2 a3 a4 −1 1 4 b2 b3 b4 b1 0 0 a1 a2 a3 a4 −1 1 4 b2 b3 b4 b1 0 0 a1 a2 a3 a4 −1 1 4 b2 b3 b4 b1 0 0 a1 a2 a3 a4 −1 1 4 b2 b3 b4 b1 0 0 a1 a2 a3 a4 −1 1 4 b2 b3 b4 b1 0 0 a1 a2 a3 a4 −1 1 b2 b3 b4 b1 −2 a4 b4 3 a1 −2 2 −3 3 a1 a2 −2 2 −3 3 a1 a2 b3 −2 2 −3 3 a1 a2 b3 b2 −2 2 −3 3 a1 a2 b3 b2 a3 −2 b1 2 −3 −4 b2 b3 b4 3 a1 a2 b3 b2 a3 b4 −2 b1 2 −3 −4 b1 2 −3 −4 b2 b3 3 b2 −4 b2 b3 b2 b3 b4 a4 4 −4 b2 b3 a3 −4 −4 b3 a2 2 −3 −4 a3 a4 a1 3 a1 a2 b3 b2 a3 b4 a4 −2 2 −3 −4 3 a1 a2 b3 b2 a3 b4 a4 b1 4 Figure 3.2: An example of a 2-barrier direct RNA folding pathway from an initial to final structure (left column), a corresponding set pathway (right column), and a graph showing the current folding pathway energy and the current barrier set size (center column). The set pathway instance (right) is specified by the conflict graph of the RNA folding pathway instance (left). The current set in the set pathway is denoted by black vertices while the current secondary structure in the folding pathway is indicated by the set of arcs on top. 45 3.2. An algorithm for the set barrier problem Consider the second claim. Compare the set pathways that result from applying T1 to G1 and from applying T1 to G. In the former, no set in the pathway contains an element of A2 , while in the latter all sets additionally contain A2 ; otherwise, the set pathways are identical. Since barrier is relative to the initial set and all sets in the latter additionally contain A2 , then the barrier of both pathways is identical (≤ k). However, we must show that no element of A2 conflicts with an element of V (G1 ) = A1 ∪ B1 . Clearly this is true since NG (B1 ) = A1 , A1 ∩ A2 = ∅ and A1 is in the same partition. Next, consider that the final set after applying T1 to G is A2 ∪ B1 . Importantly, note that |A2 ∪ B1 | = |A| since G1 is pairwise-optimal. To prove the claim, the remaining pathway can have barrier at most k. If T2 is next applied, the remaining pathway is identical to the result of T2 applied to G2 , except that each set will additionally contain B1 ; as above, the barrier is identical (≤ k). Finally, no element of B1 can conflict with any element of V (G2 ) = A2 ∪ B2 since NG (B1 ) = A1 , A1 ∩ A2 = ∅ and B2 is in the same partition. Therefore, T1 , T2 must be a valid (≤ k)-barrier transformation sequence for G. Lemma 10. Let G[A, B] be a pairwise-optimal bipartite graph and let G = G/[A , B ] be pairwise-optimal where B is a non-empty (not necessarily strict) subset of B and A = NG (B ). If the minimum barrier of any direct transformation sequence for G is k, then the minimum barrier of any direct (possibly with repeats) transformation sequence for G is at least k. Proof. Assume to the contrary that there exists a direct (possibly with repeats) transformation sequence T for G with barrier k < k. Let X be the first set in the pathway specified by T which is missing k more elements from A as from B . Specifically, X is the first set such that |X ∩ B | − |X ∩ A | = k. Such a set must exist, otherwise G would have a (< k)-barrier direct transformation sequence. We can determine the size of X relative to the initial set A, that is |A X|, as follows. Let A1 = (A \ X) ∩ A and A2 = (A \ X) \ A . Let B1 = (X \ A) ∩ B and let B2 = (X \ A) \ B . Informally, these are all elements removed (A1 ∪ A2 ) and all elements added (B1 ∪ B2 ), relative to the initial set A, partitioned by their inclusion (A1 ∪ B1 ) or exclusion (A2 ∪ B2 ) in A ∪ B . Since T is a k -barrier transformation sequence we have the following. |A1 | + |A2 | − |B1 | − |B2 | ≤ k k + |A2 | − |B2 | ≤ k k + |A2 | − |B2 | < k |A2 | < |B2 | by definition of X, |A1 | − |B1 | = k since k < k (*) Case 1. |B2 | = 0. Contradiction with (*). Note that this case shows the claim holds when G = G. Case 2. |B2 | > 0. Consider that NG (B2 ) ⊆ (A1 ∪ A2 ) ⊆ (A ∪ A2 ), otherwise X is not an independent set. Therefore, NG (B ∪B2 ) ⊆ A ∪A2 since NG (B ) ⊆ A 46 3.2. An algorithm for the set barrier problem and we have that B ∪ B2 ∪ (A \ (A ∪ A2 )) is an independent set. Consider the size of this set. |B ∪ B2 ∪ (A \ (A ∪ A2 ))| = |B | + |B2 | + |A \ (A ∪ A2 )| mutually disjoint by definition = |B | + |B2 | + |A| − |A | − |A2 | since A ∩ A2 = ∅ ∧ (A ∪ A2 ) ⊆ A = |A| + |B2 | − |A2 | since |A | = |B | by assumption that G and G/[A , B ] are pairwise-optimal > |A| by (*) This contradicts that G is pairwise-optimal. We are now faced with the task of identifying a pairwise-optimal subproblem. Intuitively, in terms of the folding pathway problem, this amounts to identifying a set of arcs from the final structure that can replace an equal number of arcs from the initial structure resulting in an intermediate structure that is (i) pseudoknot-free, and (ii) MFE. In terms of the set pathway problem, the resulting intermediate set is a maximum independent set; thus, our task is to identify a maximum independent set spanning both partitions. We can leverage K˝onig’s Theorem to solve this problem in terms of matching. Theorem 5 (K˝ onig [61]). In a bipartite graph, the number of vertices in a maximum independent set equals the number of edges in a minimum edge covering. Observation 1. If G[A, B] is a pairwise-optimal bipartite graph, then G contains a perfect matching of size |A| (= |B|). Proof. This follows immediately from the definition of pairwise-optimal and Theorem 5. For the sake of efficiency, it is desirable for a splitting strategy to always identify a minimal pairwise-optimal subproblem: one that cannot be split further. This is the behaviour of the BasicSplit algorithm that we now present. 47 3.2. An algorithm for the set barrier problem Algorithm 1: BasicSplit input : A non-null pairwise-optimal bipartite graph G[A, B] output: (A1 , B1 ) where B1 ⊆ B, A1 = NG (B1 ), and G/[A1 , B1 ] is minimal pairwise-optimal 1 begin 2 M ←− MaximumMatching(G); 3 E ←− {(b, a) | b ∈ B ∧ (b, a) ∈ E} ∪ {(a, b) | a ∈ A ∧ (a, b) ∈ M }; 4 (A1 , B1 ), (A2 , B2 ), . . . , (Ap , Bp ) ←− Tarjan(D = [A, B; E ]); 5 return (A1 , B1 ); 48 A A B B C3 C1 C2 B C1 C2 C3 Figure 3.3: An example of the BasicSplit algorithm. For a pairwise-optimal bipartite graph G, a perfect matching is identified (top left), a directed precedence graph D is constructed (top right), strongly connected components in D are identified (bottom left), and one that is a sink in the condensation of D is returned (bottom right). 3.2. An algorithm for the set barrier problem A 49 3.2. An algorithm for the set barrier problem The algorithm itself is simple, and neatly summarized in Algorithm 1 and illustrated in Figure 3.3. First, find a maximum matching M in G (using for instance, the Hopcroft-Karp algorithm [52]). Second, create the precedence graph for G and M . By precedence graph, we mean the directed bipartite graph D[A, B] where E(D) = {(b, a) | b ∈ B ∧ (b, a) ∈ E(G)} ∪ {(a, b) | a ∈ A ∧ (a, b) ∈ M }. Third, find the strongly connected components (SSCs) of D (using for instance, Tarjan’s algorithm [126]). Finally, return a SCC that is a sink in the condensation of D, i.e., the directed acyclic graph in which each SCC is condensed into a single node. Note the Tarjan’s algorithm, which we make use of in the BasicSplit algorithm, returns SSCs in depth-first order. Thus, the first SCC that is returned is a sink in the condensation of D. The next lemma summarizes important properties of the BasicSplit algorithm. Lemma 11. Given a pairwise-optimal bipartite graph G[A, B] the BasicSplit algorithm returns a tuple (A1 , B1 ) such that 1. B1 ⊆ B, 2. A1 = NG (B1 ), and 3. G/[A1 , B1 ] is minimal pairwise-optimal. Proof. Since the precedence graph D contains the same nodes as G then the first claim is trivially true. Consider the third claim. Since G is pairwiseoptimal then by Observation 1, M must be a perfect matching. Consequently, each b ∈ B1 is matched to a unique a ∈ A. By construction of the precedence graph D, a and b are strongly connected and therefore it must be the case that a ∈ A1 . Therefore, |A1 | ≥ |B1 |. Since this is true for all p strongly connected components in D, specifically that |Ai | ≥ |Bi |, for 1 ≤ i ≤ p, and since |A| = |B| (G is pairwise-optimal) we can conclude that |A1 | = |B1 | by the pigeon hole principle. Therefore, there exists a perfect matching between A1 and B1 in G/[A1 , B1 ] and by Theorem 5, the size of a maximum independent set for G/[A1 , B1 ] is |A1 | (= |B1 |). Thus, G/[A1 , B1 ] is pairwise-optimal. Suppose it is not minimal pairwise-optimal. Then, there must exist some non-empty sets A1 ⊆ A1 and B1 ⊆ B1 , such that A1 ∪ B1 is also a maximum independent set in G/[A1 , B1 ]. B1 \ B1 B1 B1 \ B1 B1 A1 A1 \ A1 A1 A1 \ A1 G/[A1 , B1 ] D/[A1 , B1 ] 50 3.2. An algorithm for the set barrier problem With respect to G and M , A1 must be matched with B1 \ B1 and B1 must be matched with A1 \ A1 . With respect to D, this implies that there are no arcs oriented from A1 \ A1 to B1 \ B1 . Also, there cannot be any arcs from B1 to A1 . Thus, for any x ∈ B1 ∪ (A1 \ A1 ) there does not exist a path, in D, to any y ∈ A1 ∪ (B1 \ B1 ) contradicting that (A1 , B1 ) is strongly connected in D and proving the third claim. Consider the second claim. Let G = G/[A1 , B1 ]. Since G is pairwiseoptimal, A1 = NG (B1 ) and therefore A1 ⊆ NG (B1 ). If A1 = NG (B1 ) we are done. Otherwise, let a be any node in NG (B1 ) \ A1 incident to some b ∈ B1 . Since a ∈ / A1 , a and b belong to different strongly connected components. Next, consider that Tarjan’s algorithm finds SSCs of a graph by depth first search, returning a SSC only after all other reachable ones have been returned. This implies (A1 , B1 ) must be a sink in the condensation of D. However, by its construction, D must contain an arc (b, a). Contradiction. 3.2.2 Cutting strategy In the original presentation of the algorithm [129], we presented two cutting strategies. Here, we detail the more intuitive (and efficient) of the two: the twosided cutting strategy. For a problem instance consisting of a minimal pairwiseoptimal bipartite graph G[A, B] and barrier k, this strategy generates the subgraphs G/(A \ {a}, B \ {b}) for each choice of a ∈ A and b ∈ B and recursively solves each of the resulting subproblems with the barrier set to k − 1. The following lemma states that if we do this for all possible choices of a and b, we are guaranteed to find a (≤ k)-barrier set pathway for G if one exists. Lemma 12. Let G[A, B] be minimal pairwise-optimal. Then 1. G/[A \ {a}, B \ {b}] is pairwise-optimal for all a ∈ A and b ∈ B, 2. if G/[A \ {a}, B \ {b}] has a transformation sequence T with barrier k − 1 then T = {a}, T , {b} is a transformation sequence for G with barrier k and, 3. G has a transformation sequence with barrier k only if G/[A \ {a}, B \ {b}] has a transformation sequence with barrier at most k − 1 for some a ∈ A and b ∈ B. Proof. Let A = A \ {a} and B = B \ {b} and let G = G/[A , B ]. Consider the first claim. Suppose not. Then, in G , there must exist some maximum independent set C ⊆ A ∪ B , where |C| > |B | implying that C contains elements of both A and B. However, since |B − B | = 1 we have that |C| ≥ |B| contradicting that G is minimal pairwise-optimal. Consider the second claim. Let P be the set pathway specified by T on G . Then the set pathway specified by T on G is exactly P = A, P , B. First, observe that P is a valid set pathway since (i) adding b last cannot introduce a conflict as the final set of P is B and (ii) removing a first ensures it cannot conflict with any set in P , B. Finally, since the barrier of P is k − 1 relative 51 3.2. An algorithm for the set barrier problem to A , it is k − 1 + |A − A | = k relative to A. Since adding b cannot raise the barrier, then T is a k-transformation sequence for G. Consider the third claim. Suppose to the contrary that G has a k-transformation sequence and for all a ∈ A and b ∈ B, any transformation sequence T on G has barrier at least k. But the barrier of T is relative to A . Relative to the initial set of G, the barrier is at least k + |A − A | = k + 1. Since G is pairwise-optimal (claim 1), then by Lemma 10 the barrier for G is at least as large as the barrier for G . Contradiction. 3.2.3 The overall algorithm Algorithm 2: DirectTransformation input : A pairwise-optimal bipartite graph G[A, B] and a barrier k output: A direct transformation sequence from A to B with barrier at most k, or ∅ if one does not exist 1 begin // trivial base case 2 if k ≤ 0 then return ∅; 3 ; 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 // split (A1 , B1 ) ←− BasicSplit(G); G1 ←− G/[A1 , B1 ]; G ←− G/[A \ A1 , B \ B1 ]; // solve G recursively if G is non-null then T ←− DirectTransformation(G , k); if T = ∅ then return ∅; ; else T ←− ∅; // base case for G1 if |A1 | ≤ k then return A1 , B1 , T ; ; // otherwise, recursively solve G1 with cutting strategy foreach a ∈ A1 and b ∈ B1 do T1 ←− DirectTransformation(G1 /[A1 \ {a}, B1 \ {b}], k − 1); if T1 = ∅ then return {a}, T1 , {b}, T ; return ∅; The DirectTransformation algorithm incorporates the splitting and two52 3.2. An algorithm for the set barrier problem sided cutting strategy. First, a minimal pairwise-optimal subgraph G1 is identified by the BasicSplit algorithm (line 3). Note that if G is minimal pairwiseoptimal, then G1 = G. If G , the remainder of the problem, is not null, i.e., G was not already minimal pairwise-optimal, then it is solved recursively, if possible (lines 6-10). An overall solution is returned if G1 is trivially solvable (line 11), otherwise, the cutting strategy reduces G1 into smaller subproblems to be solved recursively (lines 12-15). Overall, if a solution for G1 and G is found, their concatenated pathway is returned as a solution to G. If no solution is found, then an empty transformation sequence is returned. 3.2.4 Algorithm correctness and complexity Theorem 6. The DirectTransformation algorithm is correct. Proof. To prove this claim we must show that for any arbitrary pairwise-optimal bipartite graph G[A, B] the algorithm returns a valid (≤ k)-barrier direct transformation sequence for G, if and only if one exists. We prove this by induction on the order of G, i.e., the number of vertices in G. Correctness is straightforward to show when |A| = |B| = 1. Suppose that DirectTransformation is correct on input graphs where |A| = |B| ≤ n − 1. By Lemma 11, G1 is minimal pairwise-optimal and either G is null or, by Lemma 9, G is pairwise optimal. We will show correctness in the latter case which immediately implies correctness of the former. By Lemma 10 and Lemma 9 there exist (≤ k)-barrier transformation sequences T1 and T , for G1 and G , respectively, if and only if T1 , T is a (≤ k)-barrier transformation sequence for G. Since the algorithm always returns the concatenation of their solutions, or ∅ if a solution for one does not exist, then it is sufficient to show the algorithm is correct for both G1 and G . First, consider that G1 cannot be null and consequently G has a smaller order than G. Therefore, by assumption, DirectTransformation must be correct for G . Next, consider how the algorithm solves G1 . The base case for G1 (line 11) is clearly correct, so consider the recursive case (lines 12-15). Since G1 is minimal pairwise-optimal then, by Lemma 12, G1 has a solution if and only if there exists some a ∈ A1 and b ∈ B1 such that G1 = G1 /[A1 \ {a}, B1 \ {b}] has a (< k)-barrier transformation sequence. Since the algorithm tries all pairs of a and b, and the algorithm is guaranteed to be correct for G1 by assumption (smaller order than G), then the recursive case must also be correct for G1 . Theorem 7. Given a pairwise-optimal bipartite graph G[A, B] and maximum barrier k, DirectTransformation runs in O(n2k+ω ) time and O(n2 ) space where n = |A| = |B| and ω < 2.38. Proof. First note that the time and space complexity of the subsidiary algorithm BasicSplit, which is called once for each call to DirectTransformation, is dominated by finding a maximum matching; a problem that can be solved in O(nω ) time and O(n2 ) space [47]. 53 3.2. An algorithm for the set barrier problem The worst case occurs when the input for each recursive call to DirectTransformation is already minimal pairwise-optimal; thus, BasicSplit simply returns the original input. In this case, the cutting strategy makes O(n2 ) recursive calls to DirectTransformation with allowable barrier one less than the current, and the recursion bottoms out when k reaches 0. We therefore have O(n2k ) calls to DirectTransformation each taking O(nω ) time, since BasicSplit dominates the runtime at each step, for a total runtime of (n2k+ω ). In all cases, the maximum matching algorithm dominates space usage resulting in O(n2 ) space overall. Comments on practical and theoretical runtime efficiency The observant reader will have noticed a potential redundancy in the DirectTransformation algorithm. Specifically, that the sequence of SCCs returned by Tarjan’s algorithm specifies a safe splitting sequence: a sequence of minimal pairwise-optimal subgraphs G1 , G2 , . . . , Gp such that a concatenation of optimal transformation sequences for these problems T = T1 , T2 , . . . , Tp , where Ti is an optimal transformation sequence for Gi , is an optimal solution for the original graph G. (Thus, the redundancy arises as BasicSplit is called an extra p − 2 times to determine the same splitting sequence.) The correctness of this claim follows from a straightforward generalization of Lemma 9 and Lemma 11. In general, the condensation of the precedence graph specifies a partial order on minimal pairwise-optimal subgraphs. Any total order respecting this partial order (for instance, a depth first traversal), is a safe splitting sequence. Considering the example in Figure 3.3, C1 , C2 , C3 and C2 , C1 , C3 are both safe splitting sequences. Interestingly, the condensation of the precedence graph provides a succinct representation of all maximum independent sets. In terms of the folding pathway problem, this results in a succinct representation of all MFE structures, composed of arcs from the initial and final structures. Next, consider the cutting strategy which must guess (through brute force enumeration) a first element to remove and a last element to add. If this reduced problem cannot be split, the procedure must be repeated recursively until it can (or a base case is reached). If G is a bi-clique, i.e., all elements from the A must be removed before any elements from B can be added12 , then we witness the theoretical worst case behaviour of the algorithm. However, if we instead ask which element b ∈ B we will first add, and which element a ∈ A we will last remove, then we can fully determine the sequence of elements which must be removed first (NG (b)), and which sequence of elements will be added last (NG (a)). This seemingly minor observation leads to a major practical speedup. Similarly, we need only consider pairs a and b which are not adjacent in G. For dense graphs (i.e., |E(G)| = Θ(|A|2 )), this can significantly cut the search space in practice. In theory, we may argue that the algorithm runtime is O(nk+w ) since there are O(n) non-adjacent pairs in a dense bipartite graph. 12 Recall the definition of a band of arcs in the folding pathway problem from Chapter 2. A bi-clique in the set pathway problem corresponds to a folding pathway instance containing exactly two bands, one for each structure, that cross. 54 3.2. An algorithm for the set barrier problem The worst case analysis assumed the use of the most efficient algorithm known (in terms of worst case runtime) for finding a maximum matching [47]. However, this can be improved both in terms of practical and theoretical efficiency if the conflict graphs are known to be sparse. In this √ case, the matching algorithm due to Hopcroft and Karp [52] guarantees O(m n) runtime, where m and n are the size (number of edges) and order (number of vertices) of G, respectively, and is known to be one of the most efficient in practice. For this reason, we have implemented the Hopcroft and Karp algorithm for finding a maximum matching within our BasicSplit algorithm. Therefore, the implementation we evaluate in Section 3.3 has runtime O(n2k+2.5 ). If we know our conflict graphs are sparse (i.e., |E(G)| = O(|A|)), we may argue that the algorithm runtime is O(n2k+1.5 ). Finally, consider that this algorithm has a trivial parallel implementation: each subproblem in a safe splitting sequence can be solved independently, in parallel; furthermore, each subproblem generated by the cutting strategy can be solved independently, in parallel. 3.2.5 Finding minimum barriers for non-pairwise optimal instances The above algorithm maintains the invariant of operating on pairwise-optimal bipartite graphs. While this property has greatly simplified the algorithm description and proof of correctness, we now outline how these results can be extended to the more general case of solving the barrier problem for any bipartite graph G[A, B]. We accomplish this by giving a polynomial time reduction from G to a pairwise-optimal bipartite supergraph of G, denoted as PWO(G). As we will show, a solution for PWO(G) can be mapped to solution for G. 0 −1 −2 −3 −4 A B A B X A −0 −1 −2 −3 −4 B X A A B =B Figure 3.4: Creating a pairwise-optimal instance (bottom) from a non-pairwiseoptimal instance (top). 55 3.3. Empirical results Construction of PWO(G) If G is not pairwise-optimal, then α(G) > |A| and/or α(G) > |B|. We will construct a bipartite supergraph G [A , B ] of G such that α(G ) = |A | = |B | = α(G) and is therefore pairwise-optimal. Let A = A ∪ X and B = B ∪ Y where |X| = α(G) − |A| and |Y | = α(G) − |B|. Finally, let E(G ) = E(G) ∪ E where E = {(x, b) | x ∈ X ∧ b ∈ B } ∪ {(a, y) | y ∈ Y ∧ a ∈ A }. Note that since G is bipartite, α(G) can be determined in polynomial time [47, 52, 61] and therefore PWO(G) can be constructed in polynomial time (O(nω )). See Figure 3.4 for an example. Theorem 8. There exists a (≤ k)-barrier transformation sequence for any bipartite graph G[A, B] if and only if there exists a (≤ k )-barrier transformation sequence for PWO(G), where k = k + α(G) − |A|. Proof. Let G [A , B ] = PWO(G), where A = A ∪ X and B = B ∪ Y . Let T be a transformation sequence for G having barrier k . Since G is a supergraph of G, then T must contain as a subsequence a transformation sequence T for G. Let k be the barrier of T applied to G. Consider that no addition operation can appear in T until all elements from X have been removed since every element of X dominates B . Likewise, no element of Y can be added in T until all of A is removed. Therefore, since reordering a consecutive sequence of remove (or addition) operations cannot affect the barrier, T can be reordered as X, T, Y . Since the barrier for the prefix X, T is at most k relative to the initial set A , and since the prefix X simply removes all elements of X thus resulting in the initial set for G, then we can express the barrier for X, T as |X| + k = k + α(G) − |A|. Since the addition of elements of Y at the end of the transformation sequence cannot increase the barrier, we can conclude that k = k + α(G) − |A|. 3.3 Empirical results We implemented two versions of our algorithm for the direct set barrier problem, differing by their cutting strategy, in order to study their efficiency in practice on biologically motivated data. The first, referred to as the O(n2k+2.5 ) algorithm, uses the two-sided cutting strategy as described above. The second, referred to as the nO(n) algorithm, uses a one-sided cutting strategy described in previous work [129]. Until this point, our algorithm has been described in terms of a decision problem; i.e., can a problem instance be solved within barrier k? However, the implementation of our algorithm is terms of the more general optimization problem; i.e., find the minimum barrier k that can solve a problem instance. The general strategy is to perform a binary search on values of k, using the decision algorithm as a subsidiary algorithm. As such, our empirical results report on the runtime required to identify the minimum barrier. 56 3.3. Empirical results 3.3.1 Implementation and experimental environment Both algorithms were coded in C++ and compiled using g++ (GCC version 4.2.1). All experiments were run on our reference PCs with 2.4Ghz Intel Pentium IV processors with 256KB L2 cache and 1GB RAM, running SUSE Linux version 10.3. 3.3.2 Generation of problem instances With the motivation of studying algorithm performance across a variety of problem instances, we randomly sampled five sequences for each of four different classes of non-coding RNA—Transfer RNA, Transfer Messenger RNA, Ribonuclease P RNA, and 5S Ribosomal RNA—found in the RNA STRAND database [3]. For each sequence, five MFE structures—with respect to number of base pairs—were determined using a modified version of the Nussinov-Jacobsen algorithm [89]. The modified algorithm stored all optimal paths within the traceback matrix. In this way, we were able to randomly sample five different MFE structures for the same sequence. Identical structures were discarded. Every possible pairing of structures for the same sequence formed a new problem instance. Thus, ten problem instances were created for each sequence, resulting in 200 problem instances overall. The distribution of sequence length and the resulting number of conflicting base pairs between paired structures can be seen in Figure 3.5. In general, and as expected, the number of conflicting bases pairs increases with sequence length. 3.3.3 Algorithm runtime performance Both algorithms were run for a maximum of 1 CPU hour on each of the 200 hundred problem instances. The nO(n) algorithm found solutions to 183 instances, while the O(n2k+2.5 ) algorithm found solutions to 184 instances. Interestingly, the nO(n) algorithm found solutions to three instances not found by the O(n2k+2.5 ) algorithm; likewise, four instances were found by the O(n2k+2.5 ) algorithm not found by the nO(n) algorithm. Of the instances that were solved, optimal barriers were found within 1 CPU second by both algorithms in 90% of the cases with barrier height ranging from 1 to 8. The barrier of harder instances ranged from 6 to 11, with a mean of 9. In general, the O(n2k+2.5 ) algorithm was the best performing for harder instances. However, as can be seen in Figure 3.6, both algorithms excelled for certain instances relative to one another. The instances which failed to be solved within our cut-off time tended to have the highest number of conflicting base pairs; moreover, they tended to have the largest minimally pairwise-optimal subproblems generated by the BasicSplit algorithm. For each problem instance, we recorded the size of the maximum subproblem, as well as the average size of all subproblems, produced by the BasicSplit algorithm at the top level of recursion. We measured size as number of base pairs (elements in terms of the set barrier problem). Figure 3.7 57 3.3. Empirical results Figure 3.5: Distribution of conflicting base pairs for generated problem instances. 58 3.4. Solving the direct with repeats barrier problem Figure 3.6: The required time to find an optimal barrier pathway is shown for two time scales. shows the frequency of problem instances which have given maximum (left) and average (right) subproblem size. The problem instances which have maximum subproblems of size 200 or more were those that failed to be solved, within the allotted runtime. Alternative methods for splitting or recursing on such subproblems would clearly be valuable. Or, simply exploiting the inherent parallelism of the proposed algorithms could lead to a solution in reasonable wall-clock time, when many CPUs are employed. 3.4 Solving the direct with repeats barrier problem Since any graph is a subgraph of itself, and since Lemma 10 correctly considers this case, we can immediately conclude that repeat operations cannot lower the barrier in direct transformation sequences for pairwise-optimal instances. Moreover, as an immediate consequence of Lemma 10 and Theorem 8, we can conclude the following, more general result. Theorem 9. If there exists a direct-with-repeats k-barrier transformation sequence for a bipartite graph G[A, B] then there must exist a direct transformation sequence for G with barrier at most k. Theorem 9 has implications for the direct-with-repeats folding pathway problem that we formally define below. 59 3.4. Solving the direct with repeats barrier problem Figure 3.7: Frequency of maximum (left) and average (right) subproblem sizes, measured as number of base pairs in the subproblem produced by the first call to the BasicSplit algorithm for a given instance. The maximum and average are taken over all subproblems generated for a given instance. 60 3.5. Chapter summary Problem 5. eb-drpfp (Energy Barrier for Direct-with-Repeats Pseudoknot-free Folding Pathway of a single strand) Instance: Given two pseudoknot-free configurations I = {Ii }ni=1 (initial) and F = {Fi }m i=1 (final), of a single strand, and integer k. Question: Is there a direct-with-repeats pseudoknot-free transformation sequence S such that the energy barrier of S, in the simple energy model, is at most k? As repeats do not lower the energy barrier, then, by combining Theorem 3 and Theorem 9, we can conclude the following result. Theorem 10. The eb-drpfp problem, namely the energy barrier for directwith-repeats pseudoknot-free folding pathway problem, of a single strand, is NP-complete. We can also consider the implications for the multi-strand version of the problem that allows repeats in the folding pathway. Problem 6. eb-drpfp-multi (Energy Barrier for Direct-with-Repeats Pseudoknotfree Folding Pathway of Multiple interacting strands) Instance: Given two pseudoknot-free configurations I = {Ii }ni=1 (initial) and F = {Fi }m i=1 (final), of multiple interacting strands, and integer k. Question: Is there a direct pseudoknot-free transformation sequence S such that the energy barrier of S, in the simple energy model, is at most k? It is unclear whether Theorem 9 can be applied to the multi-strand case. It leaves open the possibility that repeats may help to lower the energy barrier. Furthermore, we currently cannot bound the length of a minimum energy barrier pathway for this problem. Thus, we do not know if eb-drpfp-multi is in NP. However, by restriction to the single strand case, we can conclude the following result. Theorem 11. The eb-drpfp-multi problem, namely the energy barrier for direct-with-repeats pseudoknot-free folding pathway of multiple interacting strands problem, is NP-hard. We note that a constructive proof of Theorem 9 was previously given in terms of the more restricted folding pathway problem [129]. It is likely the same proof would hold for the more general set pathway problem as well. This alternative proof was given instead as it was noticed that a slight generalization of Lemma 10 proved this property in addition to the properties required to show algorithm correctness. 3.5 Chapter summary In this chapter, we proposed an algorithm to exactly solve a generalized version of the direct energy barrier folding pathway problem that is defined in terms of 61 3.5. Chapter summary bipartite graphs. The algorithm has an exponential worst case time complexity, but uses only polynomial space. Because the algorithm is inherently parallel, this property could be exploited to help solve hard instances. As shown by an empirical study, the algorithm is practical for most instances, although it fails to solve some instances in a reasonable run-time. Due to the splitting algorithm, large sequences are not necessarily hard to solve. However, sequences that that do not decompose into small sub-problems can result in poor empirical performance by the algorithm, including the failure to solve the instance within a reasonable runtime. This is, of course, due to the exhaustive enumeration performed in order to guarantee a solution is optimal. However, the splitting algorithm could be used in conjunction with heuristic methods. For instance, it could be used to first partition the solution space into sub-problems, with the aim of improving both the efficiency and accuracy of the overall heuristic method used to solve each sub-problem. We note that our algorithm is based on the simple energy model ; however, there is potential to extend it to more complicated energy models. For instance, it may be straightforward to extend the algorithm to consider nearest neighbour base pairs as in the Turner energy model [79]. The resulting changes would still only consider local interactions when calculating the energy of particular structures. However, it seems unlikely that the current algorithm can be easily modified to consider global interactions found in the full Turner energy model, such as multi-branch loops. Fortunately, for the case of folding pathways resulting from strand displacement systems, such global interactions are not present by design. Furthermore, in the design of strand displacement systems, it is desirable that intended folding pathways have low energy barriers. Therefore, if our algorithm were extended to the multi-strand case (as discussed in Section 6.1), it would have a polynomial runtime in these important cases. As such, it may be a valuable tool in the design of strand displacement systems where unintended folding pathways should have a necessarily larger barrier than intended pathways. Interestingly, by proving the algorithm is correct, we were also able to prove that repeat arcs do not help in a direct folding pathway. This establishes that the direct-with-repeats energy barrier folding pathway problem is NP-complete for the single strand case and NP-hard for the multiple interacting strand case. Unfortunately, the algorithm does not seem applicable for indirect folding pathways. The design of the algorithm explicitly assumes that the graph modeling the conflicts, between the arcs of the initial and final structures of a problem instance, is bipartite. In an indirect folding pathway, where any non-crossing arc forming a Watson-Crick base-pair can be added at any point along a pathway, the conflict graph is not necessarily bipartite (and unlikely to be in general). Still, it is possible that a better understanding of the structure of conflict graphs for indirect pathways could lead to a similar result. The conflict graphs formed for indirect folding pathways can be characterized as circle graphs. The conflict graphs for direct pathways are 2-colourable circle graphs (see Figure 3.8 for an example). For direct pathways, we were able to exploit the following property: if one 62 3.5. Chapter summary 4 5 A 6 7 6, 7 3 G C C 8 2 U U G A 9 U G 10 11 9, 11 5, 8 U G 2, 4 12 7, 12 1, 9 1 9, 11 1, 9 7, 12 5, 8 3, 5 6, 7 2, 6 2, 4 3, 5 1, 5 2, 6 2, 10 8, 9 2, 12 7, 10 3, 9 4, 7 5, 11 Figure 3.8: (left) An arc diagram representation for the RNA strand UCUGAG CUAGUG. Arcs (base pairs) in the initial structure are shown in red, those in the final structure are shown in blue and potential temporary arcs are shown in green. Also shown are the corresponding conflict graphs for the indirect folding pathway problem (center) and the direct folding pathway problem (right). could identify an MFE structure C consisting of arcs from both the initial and final structures, A and B respectively, then there always exists an optimal pathway from A to B via C. We note that finding maximum independent sets in circle graphs, which correspond to MFE structures, is in P13 . Could a generalized version of the algorithm proposed here be adapted for indirect pathways? Most properties exploited in the proofs are argued in terms of independent sets. Removing assumptions regarding the colourability of the graph would be a necessary first step. Finally, it remains possible that the indirect folding pathway problem (for a single strand14 ) is in P. 13 While it is the case that the pseudoknot-free RNA/DNA structure prediction problem is in P, it is important to separately note the complexity for the corresponding circle graph problem, since the former is a restriction of the latter. 14 We show in Chapter 5 that the indirect pathway problem is PSPACE-complete for multiple interacting strands. 63 Chapter 4 On recycling and its limits in molecular programs While the previous two chapters focused exclusively on finding minimum-energy barrier folding pathways, we now turn our attention towards designing folding pathways to perform space efficient deterministic computation. Specifically, our aim is to design a low energy barrier folding pathway that deterministically transitions through a number of unique structures exponential in the length of the nucleic acid strand(s). We do not know how to design such pathways with a single strand, but in this chapter we show how this goal can be achieved using a set of multiple interacting strands. In such a pathway, subsets of strands will bind and unbind to other subsets, forming and breaking new strand complexes, multiple times over the length of the pathway. Thus, various strands will be actively reused, or recycled during the course of the folding pathway. This chapter explores the limits of strand recycling in folding pathways and the molecular programs that leverage them. More generally, we also consider the concept of recycling molecules within Chemical Reaction Networks (CRNs). To our knowledge, we present the first example of a DNA Strand Displacement system (DSD) which significantly recycles strands. This also serves as the first example of a designed minimum energy barrier (indirect) folding pathway whose length is exponential in the combined length of participating strands. We also demonstrate a serious limit to recycling: recycling is not possible in deterministic CRNs, and their DSD realizations, when multiple copies of the initial state of the system are present in the same environment. In fact, we show that with just one extra copy of the initial signal molecules of a given CRN, it can perform at most a linear number of deterministic computation steps within a reaction volume of a closed system. 4.1 Introduction We begin the chapter by illustrating the concept of recycling within molecular programs, discuss the benefits and possible limits, and give an overview of related work and our results in this context. The molecular programming models and terminology used throughout this and subsequent chapters are reviewed and Content from this chapter appears in the proceedings of the 17th Annual International Conference on DNA computing and molecular programming (DNA 2011) [25] and the Journal of the Royal Society: Interface Focus [26]. 64 4.1. Introduction introduced in Section 1.2 of the introductory chapter. Moreover, definitions of common molecular programming terms can be found in the Glossary. 4.1.1 On the need for strand recycling Our goal is to determine whether or not molecular programs that leverage folding pathways can perform deterministic computations that are also space efficient. If a molecular program consists of Θ(n) molecules, could it deterministically advance through Θ(2n ) unique states? To answer this question, we initially set out to design a molecular program that simulates an n-bit standard binary counter. Recall the 3-bit standard binary counter given in Example 1.2.1 of Chapter 1. The chemical reaction equations for the counter are given in Figure 1.6(a). The counter is designed to begin at count 000, advance to 001, and so on, until reaching the count 111. Indeed, the corresponding CRN, with initial signal multiset {03 , 02 , 01 }, simulates a logically reversible computation advancing correctly through the 23 unique states as illustrated in Figure 1.6(b). This counter can be generalized to n bits in the obvious way. It would seem that this simple example is sufficient to show that chemical reaction networks can perform deterministic computation exponential in their size (their largest signal multiset size, or number of reactions, for instance). Can we implement this CRN with a DSD? Figure 4.1: To reach the end state, the standard binary counter must perform a sequence of reactions that always occur in the forward direction, thus requiring a new transformer for every reaction as they are not recycled. As discussed in Section 1.2.3, we do not know how to implement chemical reaction equations in a DSD without the use of transformers. (An example of a transformer implementing the chemical reaction equation 01 11 is given in Figure 1.8.) While a transformer implementing a particular chemical reaction equation can be used to effect both the forward and reverse of the reaction, it must strictly alternate between these directions. To capture this notion, we introduced the concept of tagged chemical reaction equations and formally defined tagged Chemical Reaction Networks (tagged CRNs) in Section 1.2.3. The tagged equations for the 3-bit counter are given in Figure 1.9. Consider that the standard binary counter always performs these reactions in one direction. This means that a new transformer is required at every reaction step. This 65 4.1. Introduction is illustrated in Figure 4.1. While the generalized n-bit counter does advance deterministically through 2n states, it would also require 2n − 1 transformers be present in the initial tag multiset when formally defined as a tagged CRN. The required space of a tagged CRN for an n bit standard binary counter is Θ(2n ). In general, the standard binary counter is not an example of space efficient deterministic computation that can be realized by a DSD. The lack of transformer reuse in the standard binary counter is representative of other DSD programs in the literature. While some do use reversible transformers, such as the example transformer of Figure 1.8, the intended computation does not actively exploit this property. Appropriately, transformers are often referred to as fuel . The term captures the problem well: should the same reaction need to occur multiple times in the future, additional copies of fuel are required. In a reaction volume of a closed system, all fuel necessary to complete a computation must be present initially to avoid fuel-depletion. Therefore, the reaction volume becomes polluted with inactive fuel strands referred to as waste. Active recycling of transformers could avoid these problems. 4.1.2 On the potential for strand recycling 0 1 n= 3 n= n= 2 1 With the aim to avoid fuel depletion, waste, and to give an example of a spaceefficient DSD, we now propose an alternate counter based on the binary reflecting Gray code sequence [111]. The sequence is a Gray code as each successive value differs from the previous in exactly one bit position. It is called a binary reflecting Gray code due to its elegant recursive definition: the n-bit Gray code sequence is formed by reflecting the (n − 1)-bit sequence across a line, then prefixing values above the line with 0 and those below the line with 1. This is illustrated for n = 1, 2, 3 in Figure 4.2. 0 1 1 0 00 01 11 10 00 01 11 10 10 11 01 00 000 001 011 010 110 111 101 100 Figure 4.2: The 3-bit binary reflecting Gray code. The code for n digits can be formed by reflecting the code for n − 1 digits across a line, then prefixing each value above the line with 0 and those below the line with 1. Figure 4.3(a) gives the tagged chemical reactions for 3-bit version of this counter, which we call GRAY. The counter advances through application of the 66 4.1. Introduction (1) (2) (3) Tf1 + 01 Tr1 + 11 Tf2 + 02 11 Tr2 + 12 Tf3 + 03 12 +01 Tr3 + 13 (a) {03,02,11} {03,12,01} {13,12,11} {13,02,01} 1-for 2-for 1-rev 3-for 1-for 2-rev 1-rev 1-rev 2-rev 1-for 3-rev 1-rev 2-for 1-for {03,02,01} {03,12,11} {13,12,01} {13,02,11} (b) Figure 4.3: (a) Tagged chemical reaction equations for a 3-bit binary reflecting Gray code counter. (b) The configuration graph of the computation performed by the 3-bit binary reflecting Gray code counter forms a chain and is logically reversible. The nodes represent the state of the computation and the edges are directed between states reachable by a single reaction. three reversible tagged chemical reaction equations (1-3) to produce the logically reversible computation chain shown in Figure 4.3(b). It is worth pointing out the correspondence between the tagged CRN and the recursive definition of the binary reflecting Gray code sequence. Consider the chain of Figure 4.3(b). The middle reaction (3-for) flips the third bit from a 0 to a 1 for the first and only time. To its left is the complete chain for the 2-bit sequence and to its right is the complete chain for the reverse of the 2-bit sequence. This can be seen in the left four chain nodes if one ignores the 03 bit in the signal sets. To create 13 +02 +01 Tr4 + 14 ) the 4-bit code, one additional reaction can be added (Tf4 + 04 which effectively results in the entire chain of Figure 4.3 forming the left half of the computation chain of the 4-bit sequence, and its reverse forming the right half, separated by a single reaction to flip the fourth bit. The key idea to achieving this reaction sequence is for the reaction that alters the nth bit to require as catalysts the last signal values in the (n − 1)-bit sequence. This ensures the new reaction does not proceed until the (n − 1)-bit sequence is complete. Moreover, by not consuming any of these signals, it forces the entire computation chain, until that point, to reverse. The reason for this is as follows. The new reaction has created a new signal, not previously seen (1n ). This creates a new state along the computation chain. Reversing the new reaction simply steps back in the computation chain; however, since the new state contains, as a proper subset, all signals at the end of the chain for the (n − 1)-bit sequence, then the last reaction of that chain can proceed in reverse. Once this occurs, the reaction for the nth bit cannot proceed in either direction as the necessary catalysts are no longer present. Moreover, since the (n − 1)-bit sequence formed a logically reversible chain, then all of its reactions will be reversed. This is recursively true for its left and its right sub-chain. This 67 4.1. Introduction is a powerful technique that is also exploited in our designs of Chapter 5. The key difference between the GRAY counter and the standard binary counter is that each particular reaction in GRAY occurs alternately in the forward and reverse direction, due to the recursive nature of the computation chain. This is illustrated in Figure 4.4. GRAY only requires a single copy of each transformer. For a 3-bit GRAY counter, the initial signal multiset is {03 , 02 , 01 } and the initial tag multiset is {Tf3 , Tf2 , Tf1 }. An n-bit GRAY counter has an initial signal multiset of size n and an initial tag multiset of size n. As GRAY is also a proper CRN, then by Lemma 2 it has space complexity 2n. In general, GRAY is an example of a space-efficient deterministic computation that can be realized as a DSD. Figure 4.4: To reach the end state, the binary reflecting Gray code counter must perform a sequence of reactions that always alternate in the forward and reverse direction, thus requiring only one transformer for every reaction since they are actively recycled. In summary, recycling in DNA strand displacement systems offers the potential of supporting space-efficient DNA computations in which the number of strands required to complete a computation in a reaction volume of a closed system is logarithmic in the length of the computation. Systems that recycle strands, and more generally molecules, do not use fuel, i.e., large concentrations of certain transformer species that can bias reactions in one direction, and so are not prone to problems of fuel-depletion or waste. However, such advantages come at a price: as we will show, our counter proceeds somewhat more slowly than comparable fuel-driven strand displacement counters. The slowdown is due in part to the fact that reactions are used in both directions. Thus, our GRAY counter is not biased to advance towards the final state but rather performs an unbiased random walk on the logically reversible computation chain, both advancing and retreating, ultimately reaching the final state. 4.1.3 On the limits of strand recycling Our n-bit GRAY counter advances correctly through 2n states because only single copies of initial signals are present. That is to say, the computation relies on the fact that certain signal molecules are absent during certain steps in the computation. This property cannot be guaranteed, for instance, if the 68 4.1. Introduction initial signal multiset were duplicated multiple times within the same reaction volume. Consider a reaction that consumes a signal present in the initial signal multiset. It is not necessarily true that the same reaction will next be repeated, multiple times, in order to consume all of the additional copies present due to the duplication of the initial signal multiset. As we will show, ensuring that a signal is consumed in all copies present in the same reaction volume is not generally possible. In Section 4.3 we show that if Θ(n) copies of the initial signal multiset are present, then the counter does not advance properly in a very strong sense: the final state of the counter can be reached in just O(n2 ) chemical reactions, rather than using the intended sequence of 2n reactions. This result applies more generally and shows limits on molecule recycling when multiple copies of the initial signal multiset are present, under some restrictions on the allowable CRNs. In particular, if the size of the initial signal multiset of such a CRN is logarithmic in the length of a valid computation, then the CRN can produce any signal in a polynomial number of steps, when a linear number of copies of the initial signal multiset are present. We give a stronger result for tagged CRNs by showing that they cannot perform a computation super-linear in the size of their initial tag multiset when two copies of the initial signal and tag multisets are present in the same reaction volume. 4.1.4 Related work Qian et al. [98] showed how to simulate a stack machine using strand displacement systems. A binary counter can be implemented via a stack machine; we call such a counter a QSW (Qian-Soloviechik-Winfree) counter and we compare its properties and resources with our counters in Section 4.2.5. Their result performs logically reversible computation and can also use fuel to bias the computation toward the final state. We compare our results to a fuel-biased QSW counter as the unbiased version is slower—it performs an unbiased random walk along the computation chain similar to our result. We also assume that all fuel, or transformers, must be initially present in the reaction volume that we assume is a closed system. Building on models of Winfree and Rothemund [106, 141], Reif et al. [102] studied a tile-based graph assembly model in which tiles may both adhere to and be removed from a tile assembly. In their self-destructible graph assembly model, the removal of tiles allows for the possibility of tile reuse. The authors demonstrate that tile reuse is possible in an abstract tile model, via a PSPACEhardness result. Doty et al. [30] showed a negative result on tile reuse for an irreversible variant of the model of Reif et al. Kharam et al. [60] describe a DNA binary counter in which bit values are represented using relative concentrations of two molecule species. This is very different than our work, where the values of bits (0 and 1) are represented by the absence or presence of certain signal molecules. As their counter relies on concentrations of molecules, it cannot be space efficient as shown by our results in Section 4.3. 69 4.2. GRAY: a binary reflecting Gray code counter 4.2 GRAY: a binary reflecting Gray code counter Here we describe the Chemical Reaction Network (CRN) and DNA Strand Displacement system (DSD) implementation of our GRAY counter, provide a proof of its correctness, and analyze its expected time and space usage. We show how it can be modified to use only bi-molecular reactions, resulting in our fixed-order GRAY counter: GRAY-FO. 4.2.1 Chemical reaction network for the GRAY counter We generalize the 3-bit GRAY counter in Section 4.1.2 to n bits. The counter state is represented by n signal molecules, one per bit. Presence of signal molecule bi denotes that the ith bit has value bi , for b = 0 or b = 1. Initially, the state is 0n . . . 02 01 . Each possible state of the counter represents a value in the Gray code sequence. The counter is described abstractly by the following chemical reaction equations: (gc-1) (gc-i) 01 0i 11 1i−1 +0i−2 +···+01 1i , 2 ≤ i ≤ n All CRNs we propose are tagged Chemical Reaction Networks (tagged CRNs) and therefore account for the space of required transformers. However, to simplify the reaction equations and since each reaction has a unique transformer in our various implementations, we will omit the actual tag symbols. Lemma 13. The above CRN ensures that the n-bit GRAY counter correctly advances through the 2n states of the binary reflecting Gray code sequence, if each reaction is atomic15 and all initial signal molecules exist as single copies. Furthermore, to advance through all counter states, each reaction is applied alternately in the forward and reverse direction. Proof. Proof by induction on the number of digits. The claims are vacuously true for n = 0. Assume they are true for a counter with i − 1 bits and consider the construction of the i-bit counter. The initial signal multiset is the same as the (i − 1)-bit counter, except it contains the signal 0i . It also contains one additional reaction, to flip the ith bit. However, this reaction cannot occur for the first time until the signals 1i−1 , 0i−2 , . . . , 01 are present as catalysts. These are exactly the signals for the last state of the (i−1)-bit counter. Thus, since the (i − 1)-bit counter is correct by the induction hypothesis, the first 2i−1 reactions are exactly those of the entire (i − 1)-bit counter sequence. The ith reaction can then be applied, otherwise the computation reverses to a previous state. 15 “Atomic” is standard computer science terminology for something that occurs as if all at once, hearkening back to the original Greek etymology of an atom as an indivisible unit. Reasoning about chemical reactions as computational processes can unfortunately result in clashes in terminology. 70 4.2. GRAY: a binary reflecting Gray code counter Since the ith reaction does not consume any signals for bits less than i, the entire reaction chain of the (i − 1)-bit counter is reversed (as it does not interact with the ith bit), otherwise the computation would reverse to a previous state. Importantly, the 0i signal was consumed and since it was present as a single copy, the ith reaction cannot be applied again once the (i − 1)-bit reaction chain has begun reversing. Since the reactions alternated in the forward and reverse direction in the (i − 1)-bit counter, they continue the alternation when the chain is reversed as the first reaction after the ith bit is flipped is the reverse of the last reaction prior to flipping the ith bit. Overall, the i-bit counter correctly advances through 2(2i−1 ) = 2i states and alternates the direction of reactions when signal molecules are present as single copies. 4.2.2 DNA strand displacement implementation of the GRAY counter Recall from Section 1.2.5 and Theorem 1 that the QSW construction is capable of simulating any tagged CRN by a space efficient DSD. Unfortunately, the construction does not simulate the higher-order reactions atomically, since some product signal molecules can initiate other reactions before all product signal molecules are produced. However, the toehold mediated strand displacements do occur in a fixed order and all reactant signal molecules are consumed before any product signal molecule is produced. We exploit this fact to simulate atomicity. In particular, we borrow the concept of transactions from database and concurrency theory — a group of operations that either completes or does not complete in its entirety, and does not interfere with any other transaction. We implement transactions using a simple synchronization primitive: a mutex. A transaction must acquire the mutex in order to start, and releases it only when it completes. This is analogous to processes blocking when another process is in a critical section (which by definition must appear atomic). We consider the state of our counter to be defined only when the mutex is available. More precisely, let µ denote a single copy of a signal molecule species representing the mutex. In any sequence of strand displacements representing a chemical reaction, µ is the first reactant to be consumed and the last product to be produced. Therefore only one chemical reaction (transaction) can be in progress at any given time. When µ is next available, either all strand displacements in the sequence took place and the counter is in a new state—the transaction succeeded—or the counter is in the same state and the configuration of all molecules is exactly the same prior to the reaction beginning—the transaction failed. Since each reaction is implemented as a transaction, it appears atomic and cannot interfere with other reactions. An example of the signal molecules and the transformer associated with the forward direction of the reaction 01 11 which requires the availability of the mutex signal µ is given in Figure 4.5. Contrast this with the implementation of the same reaction that does not use a mutex signal in Figure 1.8. As previously discussed, the reaction can only initiate if the signal molecule 71 4.2. GRAY: a binary reflecting Gray code counter − 01 − 11 − µ + ∗ µ + 01 + ∗ 01 − 11 − ∗ 11 − + 11 + 01 + µ + µ µ µ − + µ − ∗ µ Figure 4.5: An example of signal molecules (top two left strands) and the transformer, consisting of auxiliary strands (top two right strands) and a saturated template strand (bottom complex) associated with the forward direction of reaction equation 01 11 which requires a mutex. In this and later figures, the Watson-Crick complement of a domain x is denoted by x∗ . µ is present, and can only complete if all other reactants—in this case 01 , assuming a forward reaction—are available. An example of the sequence of strand displacements for the reaction 01 11 is given in Figure 4.6. The reaction proceeds from top to bottom in the forward direction and from bottom to top in the backwards direction. The transformers that implement the ith reaction (gc-i) are a straightforward generalization of the first reaction. As before, the signal molecule µ must initiate the first strand displacement, and is not produced until the last strand displacement. The number of required intermediate strand displacement reactions is dependent on the number of reactants and products. Specifically, the ith reaction requires 2i + 2 strand displacements to complete. An example of the transformer for the ith reaction is given in Figure 4.7. We now formally prove that the general DSD construction of Theorem 1 can be augmented with a mutex to ensure that all reactions occur as transactions: a sequence of toehold mediated strand displacements, the first of which consumes a mutex strand and the last of which produces a mutex strand, while no other no other displacement of the sequence produces a mutex strand. Our primary motivation in this CRN to DSD conversion is to ensure that (i) all reactions occur as transactions, and (ii) that the space complexity of the resulting DSD remains polynomially bounded by the space complexity of the original CRN. However, we note that this construction inherently enforces serial computation. That is, only a single reaction can occur at any one time. By design, our CRNs are meant to operate in this manner. However, this conversion could slow down other CRNs, designed to perform many parallel reactions, by a factor of O(v) in the worst case, where v is the size of the reaction volume. Theorem 12. Any logically reversible tagged CRN requiring O(s) space can be simulated by a DSD in O(poly(s)) space, while ensuring that all chemical reactions occur as transactions (and therefore appear atomic) assuming all strand displacements are legal. Proof. Let C = S, R, S0 , send , T, T0 be any logically reversible tagged CRN 72 − + ∗ − 11 − ∗ 11 µ − ∗ µ 01 − 11 − ∗ 11 − µ − ∗ µ 01 µ + ∗ 01 01 − + ∗ 01 − ∗ 11 11 − − 11 µ µ − ∗ µ + 11 µ + + 01 µ + − − + 11 + ∗ 01 µ − 11 − ∗ 11 − µ − ∗ µ µ − µ 01 + + µ 01 µ µ + − − + + ∗ − µ µ 11 01 + ∗ 01 − + + µ + µ 11 − + µ − µ + + ∗ + − µ + µ + ∗ 01 µ + 01 + + µ + + 11 4.2. GRAY: a binary reflecting Gray code counter µ − − 01 µ + µ + ∗ µ + 01 + ∗ 01 − 11 − ∗ 11 − µ − ∗ µ Figure 4.6: The sequence of strand displacement events for the reaction equation 01 11 when a mutex signal µ is required. The mutex is the first signal to be consumed and the last to be produced, in either reaction direction. Otherwise, the reaction cascade proceeds exactly as before as dictated by the QSW construction. requiring O(s) space. We create an augmented logically reversible tagged CRN of C called C as follows. We add the mutex signal species µ to S and one signal molecule of µ to S0 . For each reaction equation R = (I, P ) ∈ R, we add µ to both I and P . Note that we have only increased the number of reactants and products of a reaction by a constant and have only added a constant number of new signal molecules to the initial signal multiset. We construct a DSD D of C using the QSW construction of Theorem 1 establishing most of the claim. All that remains is to show that the addition of the mutex signal forces each strand displacement cascade to occur as a transaction. We argue by induction on the sequence of chemical reactions of the original CRN C. Since C is logically reversible, then there is only one valid sequence of chemical reactions. Prior to any displacement simulating a chemical reaction, we will ensure the following invariant holds: (i) all template strands of all 73 4.2. GRAY: a binary reflecting Gray code counter + ∗ µ + ∗ 0i + ∗ 1i−1 + ∗ 0i−2 0i−2 − 01 − µ µ µ 01 − + −1 1i−1 − 01 + 01 0i − 1i 1i + − ... + ... −2 − ... 1i 01 0i−2 + 0i−2 + + + 01 0i−2 − + − 1i−1 2 1i−1 1i−1 + 0 i− + − − 0i 0i 1 1i + + µ 1 i− µ 0i µ − + − − − + ... + ∗ 01 − ∗ 1i − ∗ 1i−1 − ∗ 0i−2 − ∗ 01 − ∗ µ Figure 4.7: An example of the signal molecules and the transformer molecules for the ith reaction. The counter is in state bn . . . bi+1 0i 1i−1 0i−2 . . . 01 . transformers are saturated and require the mutex signal molecule µ to initiate the first strand displacement, and (ii) there is exactly one available copy of µ. The invariant is trivially satisfied for the base case, when no reaction has yet occurred. Suppose the first i − 1 reactions appear atomic, and the invariant is satisfied. Without loss of generality, suppose the next attempted reaction involves the k th transformer. Because we assume that all strand displacements are legal, no auxiliary strand or signal strand that is not µ can displace any strand in any transformer. Since there is exactly one available copy of the mutex signal species µ, that strand alone can initiate a reaction. Suppose the reaction is in the forward direction, as the reverse direction is symmetric. The signal molecule µ must initiate the first strand displacement by binding to the left end of the k th transformer’s template strand. This begins the transaction. Note that there is another copy of µ sequestered at the right end of the template. When the signal strand µ is once again produced, there are two cases to consider. Case 1. If the copy on the right end of the transformer is released, then the transaction succeeded. It now appears that all input strands—the reactants—have been consumed, and all output strands—the products—have been produced. Furthermore, the invariant is preserved as (i) the k th transformer is saturated, and only a signal strand µ can initiate a new reaction on the right end of the template, and (ii) exactly one signal strand µ was produced as the final strand displacement. Case 2. Otherwise, the original copy of µ was released, the transaction failed, and the system is in the same state as before the reaction had begun, satisfying the invariant, as any intermediate strand displacements must have been reversed prior to the original µ signal strand being released. Importantly, whether or not a transaction succeeds, while one is in progress no other reaction can be initiated since no other copy of the signal strand µ is available. Thus, all reactions are implemented as transactions and appear atomic. 74 4.2. GRAY: a binary reflecting Gray code counter 4.2.3 Space and expected time analysis of the GRAY counter Here we analyze the space—the total number of nucleotide bases of all required strands in the reaction volume—and the expected time of the GRAY counter as it advances from initial to final states. We assume single copies of the initial signal, transformer, and mutex strands. Importantly, we note that since reactions occur alternately in the forward and reverse direction according to Lemma 13, then only a single copy of each reaction transformer is necessary. We note that our space analysis carries an assumption: Θ(n) domains of the signal species can be designed to have length Θ(n) such that only legal displacements occur for the duration of the counter. This seems to be a reasonable assumption one can make when considering existing results from coding theory. Schulman and Zukerman [114] show how to construct a set of 2Θ(n) domains (i.e., binary strings in their code) of equal length Θ(n) such that the energy barrier (Levenshtein distance) between any pair of domains is at least cn, for any given constant c. However, we note that while a long domain of length Θ(n) may be sufficient to avoid illegal displacements, it may not be necessary. It may be the case that this bound is loose, and domains of length Θ(log n) are sufficient. Lemma 14. Assuming long domains have length Θ(n), the total number of nucleotide bases needed for a single copy of each initial signal, transformer, and mutex species of the n-bit GRAY counter is Θ(n3 ). Proof. Each signal strand 0i and the initial mutex strand µ is composed of a toehold and two long domains. The same is true of the strands for states 1i and the sequestered signal strands µ that are part of the initial transformer species. There are auxiliary transformer strands consisting of one toehold and one long domain for each type of signal species. We choose the toehold length to be Θ(1). Since the domain length dominates the toehold length, the total number of bases in all signal species and auxiliary strands is Θ(n2 ). The template strands for the ith transformer have Θ(i) domains, which dominate their length, and therefore have length Θ(in). Thus, the total number n of bases in all transformer template strands in the system is i=1 Θ(in) = Θ(n3 ). Next consider the expected time for the counter to progress from its initial to its final state. Other than introducing the concept in Chapter 1, we have thus far ignored the rate of reactions in a chemical reaction network16 . Briefly, in the DSD implementations of all networks proposed in this thesis, reactions always occur between two species present as single copies in the reaction volume. If the reaction volume has size V , the bimolecular reaction rate involving these two single copy species (i.e., the time to find each other and interact) is 1/V . Lemma 15. Assuming a single copy of each initial signal, transformer, and mutex species, and that all strand displacements are legal and all reactions 16 For a detailed overview of chemical reaction rates, particularly for strand displacement systems, the reader is referred to the PhD thesis of David Yu Zhang [149]. 75 4.2. GRAY: a binary reflecting Gray code counter occur as transactions (appear atomic), the GRAY counter advances through the 2n states of the binary reflecting Gray code sequence in Θ(n3 22n ) expected time. Proof. We assume that reactions occur in a volume of size Θ(n3 ), since this is the total number of nucleotides required to represent all strands of the system. Each strand displacement step involves interaction between two strand species and thus the rate of each strand displacement step is 1/Θ(n3 ). First, consider the shortest path from the initial state to the final state. On this path, each order-i reaction is applied 2n−i times and involves Θ(i) strand displacements. Thus the total number of strand displacement steps along the n shortest path is i=1 Θ(i)2n−i = Θ(2n ). Because each reaction is reversible, the system does not strictly follow the shortest path but rather proceeds as an unbiased random walk along the logically reversible computation chain. The expected number of steps for a random walk to reach one end of a length-Θ(2n ) path from the other is Θ((2n )2 ) = Θ(22n ) [32]. Therefore, the expected number of strand displacement steps is Θ(22n ). Since each strand displacement step occurs at a rate of 1/Θ(n3 ), the overall expected time is Θ(n3 22n ). Note that the expected time is polynomial in the Ω(2n ) steps required to proceed through all 2n unique states of an n-bit binary reflecting Gray code counter. Combining Lemmas 13 through 15 and Theorem 12 we have the following result. Theorem 13. An n-bit binary reflecting Gray code counter can be implemented as a DNA strand displacement system that proceeds through the 2n unique states of the binary reflecting Gray code sequence in Θ(n3 22n ) expected time and uses only Θ(n3 ) nucleotides (space). 4.2.4 A fixed order implementation of the GRAY counter An n-digit GRAY counter can perform a computation having length exponential in n, while only using space polynomial in n. However, it relied on template strands containing O(n) domains, each of length O(n), resulting in an overall length of O(n2 ) nucleotides. Synthesis of long nucleic acid strands is challenging, and the fidelity of synthesized strands generally decreases as sequence length increases. For this reason, it is desirable to bound the length of all strands in the system to O(n) bases. We now briefly describe how a template strand from the GRAY counter consisting of 2i + 2 domains, can be split into i + 1 template strands requiring 4 domains each, for any i > 1. The overall space will only be increased by a constant, resulting in the same volume, and thus the same expected time. To simplify the description we introduce some notation. Consider the (gc-i) reaction of the GRAY counter which has, including catalysts, i reactants and i products: 1i−1 +0i−2 +···+01 0i 1i 76 4.2. GRAY: a binary reflecting Gray code counter Theorem 12 demonstrated that by using the QSW construction and introducing a mutex species µ—thus creating an order i + 1 reaction—chemical reactions occur as transactions and therefore appear atomic. Specifically to our counter, µ is first consumed, then 0i , then, 1i−1 , and so on. Likewise, after all reactants are consumed, 1i is first produced, then 1i−1 , and so on, until finally µ is produced. We denote a strand displacement implementation supporting a transaction of this type, which is initiated by consuming a mutex α, and terminated when producing a mutex β, by: [α + 0i 1i−1 +0i−2 +···+01 1i + β] In the case of the GRAY counter, α = β = µ. Our goal is to convert this order i + 1 reaction into a cascade of i + 1 bi-molecular reactions, while preserving the appearance of atomicity. Using the above notation, we implement the following reaction cascade: [ µ [ αi1 1i−1 0i−2 αi1 ] αi2 ] αii−1 ] 1i ] µ ] .. . [ αii−2 [ 0i [ αii−1 01 αii−1 1i catalysts checked in sequence i-th bit flipped and µ released The overall transaction has been split into a cascade of sub-transactions. Each sub-transaction is implemented as a bi-molecular reaction using Theorem 12 (based on the QSW construction). The first i−1 sub-transactions check, in sequence, that all i − 1 catalysts are present. The mutex signal molecule µ is consumed during the first check. The last two transactions will first perform the bit flip and then release the mutex signal molecule. Every sub-transaction, except the last two, produces a unique mutex signal species that is required to initiate the next sub-transaction in the cascade. Upon successful completion of the first i sub-transactions in the cascade, the final sub-transaction occurs, producing the original mutex signal species µ, and thus finalizing the overall transaction. Note that in all cases, once the transaction has begun and before it completes, the original mutex signal µ is absent, and therefore no other reaction cascade can commence. The implementation works in the reverse direction in a similar way with the exception that once the original mutex signal µ is consumed, the bit is flipped first and the mutex signal strand µ is released only after the presence of all catalysts have been verified. Note that as in the previous case, as the mutex strand µ is missing until either the entire transaction completes, or reverses, no other reaction in the system can occur. Thus, flipping the bit prior 77 4.2. GRAY: a binary reflecting Gray code counter to verifying all catalysts are present does not affect the correctness of the system. In the case that all catalysts are not present, the transaction cascade cannot complete, and will necessarily reverse. Using the above transformation for all higher-order reactions in the original GRAY counter implementation results in a new, fixed order counter, GRAY-FO. 4.2.5 Comparison with another molecular counter Table 4.1 summarizes properties of our counters and compares with another counter, which we call QSW, based on work of Qian et al. [98] (see Section 4.1.4). The properties considered are (i) order or max number of reactants or products of chemical reactions that describe the counter, (ii) space or total number of nucleotides needed to implement the counter and (iii) expected time for the counter to reach a designated final state from its initial state when the volume equals the space. We describe how order, space and expected time grow as a function of n, the number of counter bits. In all cases, we make the assumption that the length, in bases, of long domains is Θ(n). First, consider the QSW counter. Qian et al. [98] showed how to simulate a stack machine using strand displacements systems. A binary counter could be implemented via a stack machine. An n-bit implementation of the QSW counter advances deterministically through 2n states and uses reactions of order 2 (some of which involve polymer extension reactions that realize the stack). The transformer molecules used in the strand displacement realizations of these reactions can serve as fuel, biasing the reaction so that the counter advances. We analyze the biased version of the counter; the unbiased version is slower. The expected number of reactions for the biased counter to advance to its final state is Θ(2n ). Each reaction consumes a constant number of molecules and so the overall expected consumption, or waste, is Θ(2n ). The expected time depends on the volume in which the reaction takes place. If all strands consumed are initially present in the reaction volume, then the volume is Θ(2n ) and thus each step takes expected time Θ(2n ), leading to an overall expected time of Θ(22n ). Our n-bit binary reflecting Gray code counter, GRAY, uses reactions of maximum order Θ(n), generates only Θ(n3 ) waste and uses expected time Θ(n3 22n ) to reach the final state. Our GRAY-FO counter improves on the GRAY counter in that the reaction order is Θ(1). The QSW counter also has reaction order Θ(1) and has expected time Θ(22n ), which is somewhat better than the expected time needed by our counters. However, the QSW counter generates Θ(2n ) waste, exponentially worse than our counters. All three counters are deterministic in that they advance and retreat through a predetermined linear ordering of states (i.e., they are logically reversible). 78 4.3. Limits on molecule recycling in chemical reaction networks Properties Reaction order Space (in nucleotides) Expected Time GRAY Θ(n) Θ(n3 ) Θ(n3 22n ) GRAY-FO Θ(1) Θ(n3 ) Θ(n3 22n ) QSW [98] Θ(1) Θ(2n ) Θ(22n ) Table 4.1: Comparison of n-bit counter implementations. The GRAY and GRAY-FO counters described in this section are compared with the QSW counter which is based on the simulation of stack machines by strand displacement reactions of Qian et al. [98]. 4.3 Limits on molecule recycling in chemical reaction networks In this section, we show that all chemical reaction networks that efficiently recycle strands, or that can perform useful computations for a number of steps that significantly exceeds the number of signal molecules, are not deterministic when multiple copies of the initial signal molecules are present. The underlying problem is the representation of the state of the network as specific combinations of signal molecules. If there are multiple copies of the network in the same reaction volume — as would typically occur in a laboratory setting17 — then the states of the different copies may interfere with one another — a process we call transaction. To illustrate this point, we again consider the 3-bit GRAY counter. Initially, in a single copy of the construction, the signal molecules {03 , 02 , 01 } denote the state 03 02 01 . Consider a two-copy network where the initial multiset of present signal molecules is duplicated, yielding the multiset {03 , 03 , 02 , 02 , 01 , 01 }. (We also assume a duplicate multiset of transformers is available.) As in the single copy case, assume reaction (1) occurs in the forward direction, followed by reaction (2) in the forward direction. The resulting multiset of signal molecules is {03 , 03 , 02 , 12 , 01 , 11 }. In the single copy case, we intend that reaction (1) in the reverse direction will occur next; however, given the current multiset of present signal molecules in the two-copy case, reaction (3) in the forward direction could instead occur, resulting in the multiset {03 , 13 , 02 , 12 , 01 , 11 }. At this point, a copy of every signal molecule is present, and any reaction can occur, in either direction. Furthermore, the single copy case required at least seven reactions to produce the final state 13 02 01 , whereas the two-copy case can reach it in three. Crosstalk between the copies has broken the counter. Recall the formal definition of a CRN C = S, R, S0 , send and related concepts from Section 1.2.2. In addition, we use Bs to denote the bandwidth of signal species s, i.e., the maximum number of copies of s that appears in a multiset I of any reaction (I, P ) ∈ R, and we use BC to denote the bandwidth 17 We note that while significantly more challenging, single molecule experiments are possible. This is discussed in more detail in Chapter 6. 79 4.3. Limits on molecule recycling in chemical reaction networks of C, i.e., the sum of bandwidths of all signal species in S. An x-copy version of C, denoted C(x) , is obtained by duplicating the initial multiset S0 x times, (x) (x) i.e., C(x) = S, R, S0 , send where S0 is a multiset consisting of x copies of S0 . Theorem 14. Let C = S, R, S0 , send be a 1-proper chemical reaction network. If there exists a trace that produces send in C then for the x-copy chemical reaction network C(x) with x ≥ BC +1, there exists a computation that produces send in at most (BC + 1)BC /2 + 1 steps. Proof. Let ρ = R1 , . . . , Rm be a trace for a computation that produces send in the last step in the (single-copy) network C and let S0 , . . . , Sm be the corresponding sequence of multisets of signal molecules. Let S be the multiset of signal molecules obtained by including ws copies of each s ∈ S, where ws ≥ 0 is the maximum number of copies of signal molecule s that appears in a multiset I of any reaction (I, P ) of the sequence ρ. Note that |S | ≤ BC . Let k = |S − S0 | and note that k ≤ BC . The goal is to produce all signal molecules in the multiset S − S0 , so that we can apply the last reaction of ρ and produce send . We construct a trace of the appropriate length for the multi-copy network from the trace ρ for the single-copy network. The high-level structure of the proof is as follows: First, we project out from ρ the k reactions, in order, that first produce each of the molecules in the multiset S − S0 . From that sequence, we build a trace of the multi-copy network that is the concatenation of k phases. Each phase adds one more signal molecules to the multiset of signal molecules present, preserves the presence of all signal molecules previously produced, and “consumes” one copy of the initial signal molecules in S0 . We will show that the j-th phase is at most j reactions long, so the total length of the trace producing k S − S0 is bounded by j=1 j = (k + 1)k/2 ≤ (BC + 1)BC /2. We now formalize the construction of the k phases. Define the first appearance of the c-th copy of signal molecule s to be in Si if there are at least c copies of s in multiset Si and less than c copies of s in each of S0 , S1 , . . . , Si−1 . Let s1 , . . . , sk be the sequence of signal molecules (with multiplicities) from S − S0 in order of their first appearances in S1 , . . . , Sm and let Rindex(sj ) be the reaction in ρ which first produced this copy of sj . In other words, Rindex(sj ) is the reaction that produced the first appearance of sj (where sj is the c-th copy of some signal molecule s, for some c). The k phases will produce the signal molecules in S − S0 exactly in this order: signal molecule sj will be produced in phase j. Each phase j will consist of several reactions, numbering 0 to j, which will produce sj without removing any other signal molecule from multiset S − S0 , but they can remove one signal molecule from S0 . This is replenished by adding one new copy of S0 at the beginning of this phase. To find the sequence of these reactions, we will work backwards. Assuming sj is not already present in the current multiset of signal molecules, we use reaction Rindex(sj ) to produce sj . As a result, we might have removed one of the signal molecules s1 , . . . , sj−1 , say si . If that is the case, we repeat the process of producing signal molecule si , i.e.,, repeating reactions of phase i. 80 4.3. Limits on molecule recycling in chemical reaction networks The k phases are constructed to maintain three invariants: 1. After the j-th phase, the multiset of signal molecules contains the multiset {s1 , . . . , sj }. 2. The trace constructed so far has not relied on the existence of more than j copies of the initial signal molecules S0 . 3. For each i ≤ j, the i-th phase have used at most i reactions. The invariants are vacuously true initially (before any phases). Assuming they are true after j − 1 phases, we construct the j-th phase as follows. If sj is already present in the current multiset of molecules, we do nothing; it was fortuitously produced in an earlier phase. Otherwise, the first reaction in the phase is Rindex(sj ) , the reaction that produced sj for the first time. We know this reaction can be applied because all of {s1 , . . . , sj−1 } are available, as well as the j-th copy of S0 . This guarantees that the multiset now contains sj , and we have relied on only j copies of S0 . However, since the network is 1-proper, the reaction consumed at most 1 input signal molecule. If the reaction consumed 0 molecules, or if the 1 molecule is in S0 , the invariant is maintained and the phase ends. Otherwise, the reaction consumed some si , where i < j. To restore si to the multiset, we repeat the sequence of reactions of the i-th phase. Note that this is valid since the new copy of S0 was not yet used and all signal molecules required for phase i are still present. The number of reactions of phase j can be bounded by the number of reactions of phase i plus one. By the induction invariant the i-th phase requires at most i reactions and since i < j, the j-th phase requires at most j reactions. Concatenating the k phases produces a trace for the k-copy chemical reaction network C(k) , which produces all of {s1 , . . . , sk } within (k + 1)k/2 reactions. If send is not in S − S0 = {s1 , . . . , sk }, then S contains all inputs needed for the last reaction in ρ that produces send . Thus to produce send we might need one additional copy of the initial multiset and one additional step. Since k ≤ BC , the result follows. Note that Theorem 14 is much stronger than our intuitive notion of crosstalk short-circuiting a computation. It states that with only a linear number of copies, any signal molecule can be produced in at most a quadratic length computation. Although it applies only to 1-proper networks, it is sufficient to show that the GRAY counter does not work correctly when there are enough copies present. Furthermore, since there is a direct transformation between GRAY-FO and GRAY, it also demonstrates that our GRAY-FO counter is not robust in the multi-copy setting. We can formalize the intuitive notion of short-circuiting. A network C is x-copy-tolerant if, for all s ∈ S, the length of the shortest trace to produce s in C and in C(x) is the same. A network is copy-tolerant if it is x-copytolerant for all x. The result of Theorem 14 only pertains to a restricted class of chemical reaction networks. This raises the question of whether it is possible 81 4.3. Limits on molecule recycling in chemical reaction networks to design a copy-tolerant network, which is not 1-proper and that performs a computation exponential in its size. As we are only interested in deterministic chemical reactions networks that properly account for space when representing DSDs (i.e., tagged chemical reaction networks), we can achieve a tighter bound by focusing on the specific class of networks we study in this thesis. Theorem 15. For any tagged chemical reaction network C = S, R, S0 , send , T, T0 , if there is a deterministic computation that produces send in t > |T0 | steps, then the 2-copy network C(2) is not deterministic after 2|T0 | − 2 steps. Proof. Let ρ = R1 , R2 , . . . , Rt be the deterministic trace of reactions from the original network C which induces a corresponding sequence of multisets S0 , S1 , . . . , St , with send ∈ St . For convenience, we let Rk denote the reaction applied at step k, for 1 ≤ k ≤ t. Since reaction tags are counted in the initial tag multiset T0 , then there must exist some reaction, previously applied in the forward direction18 at step i, that is the first to be applied in the reverse direction at some later step j ≤ |T0 | + 1. Otherwise, all the tags would be consumed and the computation would halt within |T0 | steps. Let Ri = (Ii , Pi ) denote the forward version of this reaction which consumes the multiset X = Ii −Pi and produces the multiset Y = Pi −Ii . Thus, Rj consumes Y and produces X. Note that Pi ⊆ Si . Next consider that i = j − 1. Suppose it were. The multiset of signals prior to applying Rj−1 is Sj−2 , and it is Sj−2 − X + Y afterward, if i = j − 1. Next applying reaction Rj , which consumes Y and produces X by definition, results in the multiset Sj−2 . The computation is now stuck in a length-2 cycle and has only advanced to at most j−1 = |T0 | new states. Therefore, Rj−1 = (Ij−1 , Pj−1 ) is not the reverse of Rj . Note that Ij−1 ⊆ Sj−2 . Now, we will construct a trace ρ for the network C(2) . Let the first j − 2 reactions be the same as in ρ. Thus, the resulting multiset after j − 2 steps is Sj−2 = Sj−2 + S0 ; otherwise, the original trace ρ is not valid. Next, append the first i reactions of ρ. By the same reasoning, this will result in the multiset Sj−2+i = Sj−2 + Si . Since Ij−1 ⊆ Sj−2 ⊆ Sj−2+i then reaction Rj−1 can be applied. Since Pi ⊆ Si ⊆ Sj−2+i then reaction Rj , the reverse of Ri , can be applied. Since one is not the reverse of the other, the computation is not deterministic after step j − 2 + i ≤ 2j − 4 ≤ 2|T0 | − 2. The result of Theorem 15 states that no tagged chemical reaction network will be deterministic for more than a linear number of steps when a second copy of the initial signal and tag multisets are present. This is formally stated in the following corollary. Corollary 1. It is not possible to design a tagged chemical reaction network that performs deterministic computation within a number of steps that is superlinear in the size of the initial tag multiset and that is also 2-copy-tolerant. 18 Without loss of generality, we can assume all reactions in a chemical reaction network are always applied first in the forward direction. 82 4.4. Chapter summary 4.4 Chapter summary In this chapter we have introduced the concept of recycling, or molecule reuse, in strand displacement systems and chemical reaction networks. Our n-bit GRAY counters effectively use recycling to deterministically step through 2n states while requiring space, or total number of nucleotides, of just O(n3 ). Our GRAY counter strand displacement constructions also introduce the use of a mutex strand to ensure that higher-level chemical reactions are executed atomically. Finally, we show limits to recycling: for example, any signal molecule of our n-bit counter can be generated using just O(n2 ) reactions when Θ(n) copies of the initial signal molecules share the same volume. We have also shown that even having a second copy of the initial tag and signal multisets ensures that no computation which uses transformers can be deterministic after a linear number of steps. One weakness of our counter construction is that the number of distinct domains needed is polynomial in n, the number of bits of the counter. In contrast, a QSW binary counter that is implemented via the stack machine of Qian et al. [98] uses just a constant number of distinct domains independent of n. Is it possible to construct an n-bit counter that combines the best of the GRAY and QSW counters, i.e., uses space that is polynomial in n and uses O(1) distinct domains? More generally, can all computation be realized by strand displacement systems whose space and expected time are within a (small) polynomial factor of the space and time of the computation? Our negative results suggest that any such systems must rely on exact molecular counts if they must have determinism. 83 Chapter 5 Space and energy efficient molecular programming In the previous chapter, we demonstrated that space-efficient molecular programming with chemical reaction networks (CRN) and DNA strand displacement systems (DSD) is possible, in principle, by giving an implementation of a Gray code counter that performed a computation with a number of steps exponential in the required space. In this chapter, we ask the question: can any space efficient computation be realized by a space efficient CRN and DSD? We answer in the affirmative by showing how any problem in PSPACE can be solved by a logically reversible tagged CRN using polynomial space. We also demonstrate how this result can be extended to solve any space-bounded computation (i.e., all of SPACE). Our CRN can be realized by a space and energy efficient DSD implementation. Not only do our results further characterize the computational power of CRNs and DSDs, they shed light on the complexity of a number of important related problems such as CRN and DSD model checking and verification [64, 65]. We show that even determining if an arbitrary state is reachable from an initial state of a CRN or DSD—a question that must be solved when verifying the correctness of a CRN or DSD—is PSPACE-hard. We show that the problem is PSPACE-complete for restricted classes of CRNs and DSDs. In this chapter we also return to reasoning at the sequence level by showing how our new results can be used to establish that the minimum energy barrier indirect folding pathway for multiple interacting strands problem (eb-ipfp-multi) is PSPACE-complete. 5.1 Related work As with the previously mentioned results related to CRNs and DSDs, we now highlight results related to the limits of logically reversible computation. An introduction to logically reversible and energy-efficient computation is given in Section 1.2.6. This is a topic that will be explored in this chapter in the context of molecular programming. Charles Bennett’s seminal work, that showed how any T (n) time-bounded Turing machine can be simulated by a logically reversible Turing machine, was space inefficient as his reversible Turing machine simulation required Θ(T (n)) space [8]. He later improved the result to Content from this chapter appears in the proceedings of the 18th Annual International Conference on DNA computing and molecular programming (DNA 2012) [131]. 84 5.2. Preliminaries show that any Turing machine computation using T (n) time and S(n) space can be simulated by a logically reversible Turing machine to use O(S(n) log T (n)) space [9]. This proved that PSPACE equals ReversiblePSPACE [9]—the class of problems solvable by a logically reversible Turing machine that uses polynomial space. The result has since been generalized to prove SPACE equals ReversibleSPACE [70] demonstrating that, in principle, any space-bounded computation can be solved by a space and energy efficient computation. Until recently it remained unclear if a physical system could realize logically reversible computation. In perhaps one of the most important theoretical results in the field of molecular programming, Qian et al. [98] gave a DSD implementation of a stack machine capable, in principle, of energy efficient Turing universal computation. However, as with Bennett’s seminal work, their implementation requires space proportional to the number of steps in the computation as it consumes fuel (transformer) molecules to drive the overall process forward. 5.2 Preliminaries Definitions of DNA Strand Displacement systems (DSDs) and Chemical Reaction Networks (CRNs) are given in Section 1.2. In this chapter, as with the last, we will reason exclusively about tagged CRNs. However, to simplify the presentation, we will omit tags from reaction equations. CRNs can be implemented by DSDs in a number of ways [72, 98]. We will leverage the implementation from Theorem 12 which relies on the assumption that certain signals only occur as a single copy within the reaction volume. A single copy mutex species is used to ensure that a strand displacement cascade which implements any particular reaction will occur as a transaction and therefore appear atomic. Specifically, either the entire cascade implementing a reaction will succeed, or it will return to the state prior to beginning the cascade. Importantly, the mutex molecule is sequestered during the cascade and therefore another reaction cannot begin. Finally, we formally define problems we study in this chapter. The first two problems ask whether certain states are reachable within a CRN or DSD and are the basis for formal verification of these systems. Recall that the state of a CRN (DSD) is the current composition of free signal molecules (strands). Problem 7. crnR (CRN reachability) Instance: A chemical reaction network with initial state Sinit and an arbitrary state S . Question: Is S reachable from Sinit ? 85 5.3. Space efficient CRN simulation of PSPACE Problem 8. dsdR (DSD reachability) Instance: A DNA strand displacement system with initial state Sinit and an arbitrary state S . Question: Is S reachable from Sinit ? In this chapter we will construct a CRN that can solve any instance of the totally quantified 3-satisfiability problem defined below. Problem 9. q3sat (Totally quantified 3-satisfiability) Instance: A totally quantified Boolean formula ψ of n variables in prenex normal form with strictly alternating quantifiers, ∀xn ∃xn−1 ∀xn−2 . . . Q1 x1 φ, where Q1 is the quantifier ∀ if n is odd and the quantifier ∃ otherwise, and where φ is an unquantified Boolean formula of m clauses in conjunctive normal form, each containing a literal for 3 distinct variables. Question: Is the formula ψ satisfiable? Finally, we will resolve the complexity of predicting indirect folding pathways for multiple interacting strands. We will reason about this problem using the simple energy model formally defined in Section 1.1.1. Problem 10. eb-ipfp-multi (Energy Barrier for Indirect Pseudoknot-free Folding Pathway of Multiple interacting strands) Instance: Given two pseudoknot-free structures I (initial) and F (final), of multiple interacting strands, and integer k. Question: Is there an indirect pseudoknot-free folding pathway from I to F such that the energy barrier in the simple energy model is at most k? 5.3 Space efficient CRN simulation of PSPACE ∀x3 ∃x2 ∀x1 (x1 ∨ x2 ∨ x3 ) ∧ (¬x1 ∨ ¬x2 ∨ ¬x3 ) T F T F F F F T T T T F T T F T T T F T T T T T F T F T F Figure 5.1: Solving a q3sat instance. Edge labeled paths from root to leaf denote variable assignments. Nodes are satisfied based on quantifier and satisfiability of left and right children. Our goal is to demonstrate that any problem in PSPACE can be solved by a space efficient, logically reversible, tagged CRN. By solved, we mean that the CRN will produce a special accept signal for an instance of a problem 86 5.3. Space efficient CRN simulation of PSPACE if and only if a Turing machine simulated with that same problem instance ends in an accepting state. Otherwise, the CRN will produce a special reject signal. To that end, we will show how a CRN with those properties can be constructed to solve any arbitrary instance of the q3sat problem, which is a complete problem for the class PSPACE. We present our solution in three logical parts. In Section 5.3.1, we demonstrate how to construct a CRN for verifying if a 3sat formula is satisfied. In Section 5.3.2, we present an elegant solution for traversing a perfect binary tree in post-order that is both space efficient and logically reversible. In Section 5.3.3, we demonstrate how the two CRNs can be integrated and then modified to capture the semantics of strictly alternating variable quantifiers in the q3sat instance. To simplify the presentation of our result, we will add new reactions, tags, and signal molecules when needed as we refine our construction towards its final form. To understand the intuition behind our construction, consider that a perfect binary tree of height n, with each level of the tree representing a variable, has 2n leaves, each with a unique path from the root specifying a unique variable assignment. A tree defined in this manner can be used to express the semantics of strictly alternating quantifiers in the q3sat instance (see Figure 5.1). Leaf nodes are considered satisfied, or true, if and only if the current variable assignment satisfies the unquantified 3sat formula of the q3sat instance. For example, the first leaf node from the left of the tree in Figure 5.1 is not satisfied, and is therefore considered false, as the variable assignment x1 = F, x2 = F, x3 = F does not satisfy the formula (x1 ∨ x2 ∨ x3 ) ∧ (¬x1 ∨ ¬x2 ∨ ¬x3 ); however, the second leaf node from the left with assignment x1 = T, x2 = F, x3 = F does satisfy the formula so it is considered to be true. Internal nodes can be used to propagate satisfiability of a partially solved instance up the tree. If an internal node represents a universally quantified variable, then it is marked as true if and only if both of its children are true. Therefore, the parent node of the first two leaf nodes from the left in Figure 5.1 is false as it is a universally quantified node and only one of its children is true. Similarly, a node representing an existentially quantified variable is marked false if and only if both children are false. In this straightforward manner, the overall quantified formula can be determined to be true or false, once the root is marked. Since the satisfiability of a node can immediately be determined once that of its two children is known, we perform a post-order traversal19 of the tree. Furthermore, we exploit the fact that once the satisfiability of a child is marked, the satisfiability of its descendants is irrelevant and can be forgotten. This allows us to smartly reuse space in our tree traversal procedure. 5.3.1 Verifying a 3sat instance variable assignment We first demonstrate how the formula φ can be verified as satisfied or unsatisfied for a particular variable assignment. A variable assignment ensures exactly 19 In a post-order tree traversal, a node is processed only after its children have been processed. 87 5.3. Space efficient CRN simulation of PSPACE one signal for each variable xi is present: xTi for a true assignment, and xF i otherwise. We first introduce the necessary reactions to verify an individual clause and demonstrate how the overall formula can be determined to be true or false. Verifying an arbitrary clause Recall that in a 3sat instance, each clause consists of exactly three literals, each for a distinct variable20 . As such, there are exactly eight possible truth assignments and we create a reversible reaction for each. The reactions for verifying the ith clause, containing literals for variables xj , xk and xl are given in Figure 5.2 (left). When the clause signal molecule Ci? is present, exactly one of the eight reactions can be applied, specified by the current variable assignment. The variable signals act as catalysts and the Ci? signal is consumed producing either a CiT signal if the clause is satisfied, or CiF otherwise. (1) Ci? Ci? F F xF j +xk +xl F T xF j +xk +xl [F/T ] Ci [F/T ] Ci , 1≤i≤m , 1≤i≤m Ci? φ? C1? (3) CiT ? Ci+1 (4) .. . T T xT j +xk +xl (2) (5) [F/T ] Ci , Hi? + CiF T Cm Hi + φ F , 1≤i<m , 1≤i≤m φT 1≤i≤m Figure 5.2: (left) Eight chemical reaction equations to verify an arbitrary 3sat clause Ci for each combination of variable assignments. The product of the reaction is CiT for assignments that satisfy the ith clause, and CiF otherwise. (right) Reaction equations to verify the overall 3sat formula φ, consisting of m clauses. For example, suppose Ci represents the following clause: (xj ∨ ¬xk ∨ xl ). T F F The reaction having catalysts xF j , xk , and xl will produce Ci . The other seven T reactions will produce Ci . Note that for a particular variable assignment, only one reaction will apply in both the forward and reverse direction, ensuring the process is logically reversible. Verifying the overall formula The overall process of verifying the formula φ can be thought of as a subroutine that is initiated by consuming the signal φ? and completes by producing either 20 We assume this form to simplify our description. Note that when two or more of the literals in a clause are for the same variable, it is always possible to simplify the clause, even when they are negations of each other. It is also always possible to add dummy variables to ensure every clause has exactly three literals. 88 5.3. Space efficient CRN simulation of PSPACE φ? C1? C1T C2? C2T ··· T Cm−1 ? Cm H1? C1F H2? C2F F ? Cm Hm H1 + φF H2 + φ F Hm + φF T Cm φT Figure 5.3: Flow control when verifying a formula φ having m clauses. the signal φT , if φ is satisfied, or φF otherwise. The variable assignment signals are catalysts, and their values are maintained after the process completes. For the formula to be true, all clauses must be satisfied. However, any combination of unsatisfied clauses will result in φ being false. For this reason, care must be taken that clauses are checked systematically to ensure reversibility. The overall process is depicted in Figure 5.3 and the reactions are given in Figure 5.2 (right). The process checks each clause, in sequence, and if the current clause is unsatisfied then reaction (4) occurs, immediately producing the φF signal denoting that the formula is unsatisfied. This reaction consumes a history signal Hi? and produces another history signal Hi . The sole purpose of the history signal is to ensure the reversibility of the computation, should the φF signal be produced as it uniquely identifies which clause was the first to be unsatisfied. Otherwise, all clauses are satisfied, and thus the signal φT can be produced and is sufficient to ensure the computation is reversible. Lemma 16. A 3sat Boolean formula of m clauses over n variables can be verified by a logically reversible tagged CRN in O(m) reaction steps using Θ(m+ n) space. Proof. Importantly, we must now establish that the process is logically reversible. We first argue by induction on m, the number of clauses of the 3sat T formula φ, that Cm is eventually produced by a logically reversible sequence of reactions if and only if the first m clauses are satisfied and otherwise φF is produced by a logically reversible sequence of reactions along with a history signal denoting the first unsatisfied clause. In addition to the clause history signals Hi? , we assume initially that the signal φ? is present and exactly one signal for each variable xi denoting its truth assignment—xTi or xF i . Suppose the inductive hypothesis holds for m − 1 clauses and consider the case when φ has m clauses. We have two cases: Case 1. The first m − 1 clauses of φ are satisfied. By the inductive hypothesis, T will eventually be produced, by a logically reversible sequence of signal Cm−1 reactions. Other than the reverse of the previous reaction, only the reaction to ? produce Cm can be applied. Next, other than reversing, only one clause reaction T F T will be applicable and will produce either Cm or Cm . If Cm is produced, we are F done. If Cm was produced, either the reverse of the previous reaction can be T applied, or φF + Hm is next produced, ending the process. Thus, Cm or φF (in 89 5.3. Space efficient CRN simulation of PSPACE addition to a history signal denoting the first unsatisfied clause) is eventually produced by a sequence of logically reversible reactions. Case 2. At least one of the first m − 1 clauses of φ are unsatisfied. By the inductive hypothesis, this case will correctly produce φF and a history signal denoting the first unsatisfied clause. The new reactions pertaining to clause m are not applicable and thus inconsequential. T To complete the process, if CM was produced, other than reversing the T previous reaction, the signal φ can next be produced. It is easy to see that in the worst case, O(m) reactions steps are required. Finally, we establish the space claim. The initial signal multiset has size Θ(m + n) as it consists of the n variable signals, m clause history signals and the signal φ? . The CRN has Θ(m) reactions since there are a constant number for each of the m clauses and the overall formula verification. Since each reaction is applied at most once when verifying a formula, one tag per reaction is sufficient, therefore establishing the size of the initial tag multiset to be Θ(m). As the CRN is proper, then by Lemma 2 the required space to complete the computation is Θ(m + n). 5.3.2 A space efficient post-order tree traversal Next we demonstrate how to perform a post-order traversal of a perfect binary tree in a space-efficient manner. Importantly, the procedure must be logically reversible. The intuition and chemical reaction equations are captured in Figure 5.4. For any node with a left and right child, once the descendants of the left child have been recursively traversed (Figure 5.4 (a)), the left child can be marked (Figure 5.4 (b)) using reaction (6) mark left. Any information stored in those descendant nodes is no longer required and the whole traversal of that subtree can be reversed (Figure 5.4 (c)), the traversal can move to the right child (Figure 5.4 (d)) using reaction (7) move right, the right subtree can be recursively traversed (Figure 5.4 (e)), and finally the right child marked (Figure 5.4 (f)) using reaction (8) mark right. Lemma 17. Given a perfect binary tree of height h, all descendants of the root can be traversed in post-order, by a logically-reversible tagged CRN in Θ(3h ) reaction steps, using Θ(h) space. Proof. We construct the logically reversible tagged CRN of Figure 5.4 adding reactions (6), (7), and (8), for each 1 ≤ i ≤ h. As each reaction of the CRN is reversible, after every reaction step, the reverse of the previous reaction can always be applied. To demonstrate the CRN is logically reversible, we need to demonstrate that at any point there is at most one other reaction that can be applied. We will further establish the invariant that each reaction strictly alternates in being applied in the forward and reverse direction, ensuring at most one tag is required for each type of reaction. We will argue by structural induction. Let sh denote the number of reaction steps required for a tree of height h. 90 5.3. Space efficient CRN simulation of PSPACE (6) mark left ? xF i + ri + li? i−1 j=1 {rj } move right li + ri? + xFi i−1 j=1 {lj? } xTi ,1≤i≤h (d) move to right subtree (c) reverse all steps from (a) (8) ,1≤i≤h (b) mark li (a) recursively solve tree rooted at li (7) li mark right ri? (e) recursively solve tree rooted at ri ? ri+1 + li + xTi + i−1 j=1 {rj } ri ,1≤i≤h (f) mark ri Figure 5.4: A logically reversible post-order traversal of all descendants of the root of a height h perfect binary tree can be achieved using three reactions: (6) mark left, (7) move right, and (8) mark right. Below each reaction is an illustration of the action it performs on the tree. 91 5.3. Space efficient CRN simulation of PSPACE ? ? Consider the base case when h = 1 with initial signal multiset r2? , xF 1 , l1 , r1 . T Reaction (8) cannot be applied until the signal x1 is present which is produced by reaction (7). Similarly, reaction (7) cannot be applied until signal l1 is present. Thus, it is easy to see that reaction (6) must first be applied—marking the left child—followed by reaction (7)—moving to the right child—and finally reaction (8)—marking the right child and completing the traversal in s1 = 3 reaction steps. Each reaction was only applied once, in the forward direction, so the strictly alternating invariant is trivially maintained. Suppose the traversal completes in sh−1 reactions steps, is logically reversible, and the strictly alternating invariant is maintained for a tree of height h − 1. Consider a tree of height h, having initial multiset of signals S = ? ? ? }∪ 1≤i≤h xF {rh+1 i , li , ri . Before reaction (6) (and thus reaction (7) and (8)) can be applied, the signals 1≤j<h rj must be present. As the left subtree is selected, the signal rh? is present, and by the induction hypothesis, the only available action is to produce these signals in sh−1 logically reversible reaction steps, that maintain the strictly alternating invariant, by traversing the subtree rooted at lh (see Figure 5.4 (a)). Importantly, the signals 1≤j<h−1 rj? are now absent and therefore no reaction affecting levels 1, . . . , h − 2 can occur. Other than reversing the previous reaction, which produced signal rh−1 , only reaction (6) can be applied for level h, thus producing lh (see Figure 5.4 (b)). Next, observe that reaction (7) cannot be applied until the signals 1≤j<h lj? are present. Other than reversing the previous reaction, only a reversal of all sh−1 reaction steps that traversed the left subtree can be applied next, yielding the required signals to next apply reaction (7), producing signal xTh , denoting a move to the right subtree (see Figure 5.4 (c) and (d)). Note that the reversal of the left subtree will maintain the strictly alternating invariant as it ensures all lower level reactions have been reset to their initial state, in order to be used again in the right subtree. Similar to reaction (6), reaction (8) cannot be applied at level h until the right subtree is traversed in sh−1 logically reversible reaction steps (see Figure 5.4 (e)). Other than reversing the previous reaction, only reaction (8) can next be applied at level h producing the signal rh and ensuring no further reactions on lower levels can occur. The traversal is complete and no reaction, other than the reverse of the previous, can occur. Thus, the overall traversal is logically reversible, and is clearly in post-order. As the strictly alternating invariant was maintained for all lower level reactions, and all reactions at level h have been applied for the first time, and only once, the invariant is maintained for a tree of height h. Exactly 3 reactions occurred at level h, and 3sh−1 reactions were required for the two traversals and one reversal of the height h − 1 subtrees, giving us the recurrence sh = 3sh−1 + 3. Solving sh with s1 = 3 gives us the closed form expression 32 (3h − 1), establishing the claimed Θ(3h ) reaction steps. Finally, consider the space claim. As we have shown that reactions strictly alternate being applied in the forwards and reverse direction, at most one tag for each of the Θ(h) reactions is sufficient. Consider that the initial multiset of ? ? ? signals for a tree of height h is S = {rh+1 } ∪ 1≤i≤h xF and therefore i , li , ri 92 5.3. Space efficient CRN simulation of PSPACE |S| = 3h+1. Since the CRN is proper, we immediately establish the space claim by Lemma 1. 5.3.3 Solving a q3sat instance We now have the means to verify if a variable assignment satisfies a 3sat formula φ. We can also traverse a perfect binary tree in post-order, and in the process enumerate all possible variable assignments for φ. What remains is to combine these processes together in order to determine if a q3sat instance can be satisfied. We approach the integration in two parts. First, we will demonstrate how the formula verification process can be triggered immediately prior to the tree-traversal marking of a leaf node and how the verification reactions can be entirely reversed, prior to the next time the verification procedure must run. This effectively demonstrates how any problem in NP can be solved by a logically reversible CRN in polynomial space, if we specify the end of computation as the presence of the signal φT , or the signal φF in conjunction with the signals for the final variable assignment to be enumerated. Finally, we demonstrate how the tree traversal reactions of Figure 5.4 can be augmented in order to capture the semantics of alternating universal and existential quantifiers, thus demonstrating how any problem in PSPACE can be solved in polynomial space by a logically reversible CRN. Integrating formula verification and tree traversal Recall the sequence of logical steps in traversing level 1 of the tree, i.e., the leaves: mark left leaf, move right, mark right leaf. We augment the reactions for level 1 to force the following sequence: (i) verify φ, (ii) mark left leaf, (iii) reverse reactions of step (i), (iv) move right, (v) verify φ, (vi) mark right leaf. This new sequence ensures two invariants: first, the current variable assignment is verified prior to marking the current leaf, and second, the verification procedure is fully reversed prior to the next verification. The augmented reactions are given in Figure 5.5. Both reactions marking a leaf have been split into two variants, each ensuring the verification procedure has completed by requiring as a catalyst one of the two possible outcomes of the verification process. In addition, we add new signals to record whether or not the variable assignment for a particular leaf is a satisfying assignment for φ. These signals will be used later to propagate satisfiability up the tree, once quantifiers have been integrated. Note that the reaction to move to the right leaf now requires the signal φ? as a catalyst. This forces all steps performed in the previous verification to reverse. After moving to the right leaf, and thus swapping the value of variable x1 , the verification process can again run immediately prior to marking the right leaf. Importantly, we want to ensure that the verification procedure is completely integrated into the leaf level reactions and cannot perform any reactions while the traversal is marking higher level nodes. This is easily accomplished by augmenting reactions (2)-(5) to require r2? as a catalyst. Note that the augmented variants of the tree traversal reactions 93 5.3. Space efficient CRN simulation of PSPACE are also fully distinguishable by their catalysts (and products), thus ensuring the process is logically reversible. New Catalysts φ? φF L?1 +l1? mark left move right xF 1 ··· ··· φT l1 +LF 1 L?1 +l1? r1 +RF 1 R?1 +r1? ··· l1 +LT 1 xT 1 mark right R?1 +r1? ··· ··· r1 +RT 1 Figure 5.5: Integrating the 3sat verification procedure into the leaf level reactions of the tree traversal procedure. Two reaction variants are created for marking leaf nodes as either satisfied or unsatisfied based on the result of the verification procedure. One reaction variant can proceed if the signal φF is available and the other variant requires φT . As these are the only two reaction variants, the formula for the current variable assignment must be verified before the leaf node can be marked. The move right reaction requires φ? as a catalyst, thus ensuring the verification procedure is reversed prior to the next verification step. Existing catalysts listed in Figure 5.4 remain and are omitted above for space. Integrating quantifiers into the tree traversal Integrating quantifiers in non-leaf levels of the tree is relatively straightforward. Recall that the levels of the tree strictly alternate between universal and existential quantification. For each level, we create four variants of the correct quantifier for both the left and right node marking reactions to additionally produce a signal indicating if the current subtree is satisfied. The reaction variants for marking a left node are given in Figure 5.6. These reactions require as catalysts the signals indicating if the left and right children of the current node are satisfied and therefore four variants are sufficient to consider all cases, for each type of quantifier. As with the leaf level reactions, the augmented reaction variants can be fully distinguished by their catalysts ensuring the computation remains logically reversible, and the correct reactions are reversed. Ending the computation Once both children of the root have been solved the output signal can be produced based on the satisfiability of the children and on the quantifier imposed on the root level variable xn . The reaction equations for the universal quantifier are shown in Figure 5.7. Modifying the reactions for an existential quantifier is straightforward. Recall that reactions at level n − 1 cannot proceed unless the signal rn? is present. We could have the reaction producing the solution signal also con94 5.3. Space efficient CRN simulation of PSPACE New Catalysts LF i−1 ∀ levels L?i +li? ∃ levels L?i +li? + ··· ··· RF i−1 LF i−1 li +LF i L?i +li? li +LF i L?i +li? + ··· ··· RT i−1 F LT i−1 + Ri−1 li +LF i L?i +li? li +LT i L?i +li? ··· ··· T LT i−1 + Ri−1 li +LF i L?i +li? li +LT i L?i +li? ··· ··· li +LT i li +LT i Figure 5.6: Integrating quantifiers to non-leaf levels of the tree traversal. For both universal and existential levels, four variants of the left node reactions are created to process the four combinations of left and right children satisfiability. The integration is identical for right node reactions. Existing catalysts remain the same as listed before and are omitted for space. sume rn? . This would end the computation chain as only reversing the previous reaction would be possible next. However, for reasons we will make clear in Section 5.5, the signal rn? is never altered and therefore after the solution signal is produced, the entirety of the tree traversal steps will be reversed before reaching the end of the computation chain. The entire configuration of the CRN system will appear identical to the initial configuration, with the exception that the output has been written (i.e., the ψ ? signal has been consumed and been replaced by ψ F or ψ T ). See Figure 5.8 for a schematic of the logically reversible computation chain. (9) ψ ? ψ? ψ? ψ? F LF n−1 + Rn−1 + n−1 j=1 { rj } ψF T LF n−1 + Rn−1 + n−1 j=1 { rj } ψF F LT n−1 + Rn−1 + n−1 j=1 { rj } ψF LT n−1 n−1 j=1 { rj } ψT T + Rn−1 + (a) visit descendants (b) produce output Figure 5.7: After both children of the root have been solved a solution can be determined based on the quantifier of the root level. Equations are shown assuming the root variable xn is universally quantified. Theorem 16. Any arbitrary instance of q3sat with n variables and m clauses can be solved by a logically reversible tagged CRN in O(m 3n ) reaction steps using Θ(m + n) space. Proof. Let ψ be the totally quantified Boolean formula of the instance and φ be the unquantified 3sat formula. By Lemma 16 a set of Θ(m) reactions can be created to verify if φ is satisfied, or not, for a particular variable assignment. By Lemma 17, a set of Θ(n) reactions can be created to traverse the height n tree 95 5.4. Space efficient CRN simulation of SPACE representing all possible assignments of the n variables. Furthermore, the above modifications demonstrate how these two processes can be integrated into one logically reversible computation chain, and how quantifiers can be added to the non-leaf levels to determine if there is a satisfying solution for ψ by propagating satisfiability of subtrees up to higher levels. Importantly, the modifications only increase the number of reactions by a constant factor and are designed to maintain the property that the computation is logically reversible. Consider that the number of reaction steps acting on a tree node, prior to reaching the root, has not increased. However, prior to marking every leaf in the traversal, the verification procedure is run for the current variable assignment (and reversed in between). Therefore, by Lemmas 16 and 17, the root of the height n tree can be reached, and a solution signal produced, within O(m 3n ) reaction steps. As forcing the entire tree traversal to reverse prior to the end of computation only doubles the number of reaction steps, the claim on computation length is established. Next, consider the space required of the combined CRN. The modified verification procedure requires the following initial multiset, where T3sat is the multiset of required tags: S3sat = 1≤i≤m Ci? , Hi? ∪ φ? , r2? ∪ T3sat The augmented tree traversal procedure requires the following initial multiset, where Ttree is the multiset of required tags: Stree = 1≤i<n li? , x?i , ri? , L?i , Ri? ∪ rn? , φ? , ψ ? ∪ Ttree The space required for the initial multiset of the combined CRN is therefore |Sq3sat | = |Stree ∪ S3sat |. As the combined CRN maintains the property that reactions strictly alternate being applied in the forward and reverse direction, then one tag for each of the Θ(m + n) reactions is sufficient and |Sq3sat | ∈ Θ(m + n). As q3sat is a complete problem for PSPACE [94], we immediately have the following. Corollary 2. Any problem in PSPACE can be solved by a logically reversible tagged CRN using polynomial space. 5.4 Space efficient CRN simulation of SPACE We have so far shown how to simulate any problem in PSPACE with a spaceefficient CRN. In this section, we extend our result to show how any S(n) space-bounded computation can be simulated by a logically reversible tagged CRN using at most poly(S(n)) space. 96 5.4. Space efficient CRN simulation of SPACE The first result that we leverage, summarized in Theorem 17, states that any S(n) space-bounded computation, with an input of size n ≤ S(n), can be simulated by an alternating Turing machine in O(S(n)2 ) steps. Theorem 17 (Chandra, Kozen and Stockmeyer [18]). If S(n) ≥ n, then NSPACE(S(n)) ⊆ c>0 ATIME(c · S(n)2 ). We will also make use of the following transformation, from a nondeterministic Turing machine to a propositional formula, due to Cook [28]. Theorem 18 (Cook [28]). Let M be a T (n) time-bounded nondeterministic Turing machine. For each input x there is a conjunctive normal form propositional formula F (x) of length Θ(T (n) log T (n)), [containing at most three literals per clause and that can be produced in poly(n) time], such that F (x) is satisfiable if and only if M accepts x within T (n) steps. As communicated by Williams [140], this same transformation can be used to derive a quantified Boolean formula ψ of length Θ(T (n) log T (n)), for an alternating Turing machine M , such that ψ is satisfiable if and only if M accepts its input within T (n) steps. Intuitively, the universal states of M are represented by universal variables, while existential states are represented by existential variables. Importantly, Cook’s reduction ensures that the unquantified propositional formula is in conjunctive normal form and that clauses have at most three literals. Without loss of generality, we can assume that the quantified formula ψ is in prenex normal form and that existential and universal quantifiers strictly alternate21 . We summarize this result due to Theorems 17 and 18 in Corollary 3. Corollary 3 (Chandra, Kozen and Stockmeyer [18] & Cook [28]). Given any S(n) space bounded Turing machine M and an input x of length n, with S(n) ≥ n, it is possible in poly(S(n)) time to construct a totally quantified Boolean formula ψ, in prenex normal form, having Θ(S(n)2 log S(n)) clauses in conjunctive normal form, where each clause contains at most three literals (i.e., ψ is an instance of q3sat), such that ψ is satisfiable if and only if M accepts x within S(n) space. In Corollary 3 there is a condition that the space complexity is at least as large as the size of the input. When this is not case, such problem instances are necessarily in PSPACE and therefore our result of Corollary 2 applies. Otherwise, by Theorem 16 and Corollary 3 we can conclude the following result. Corollary 4. Any problem solvable in S(n) ≥ n space can be solved by a logically reversible tagged CRN using O(S(n)2 log S(n)) space. 21 Our CRN construction works without the assumption that quantifiers strictly alternate, but it simplifies the presentation of the result to assume that they do. 97 5.5. Space and energy efficient DSD simulation of SPACE 5.5 Space and energy efficient DSD simulation of SPACE The remarkable consequence that Bennett’s work demonstrates is that energy consumption is not necessarily an intrinsic cost of computation. In particular, if the computation is logically reversible, there is no inherent lower bound on energy expenditure, due to the computation. However, there must be a reasonable probability the actual solution can be observed. This can be problematic in a logically reversible computation which is free to immediately reverse once reaching a solution state. Qian et al. [98] solved this problem by using fuel (transformers) to provide a slight bias for remaining in a solution state once the computation completes. However, in our result, since reactions must be reused efficiently in both directions to maintain a polynomial space bound, they cannot be biased in general. reach root and produce answer traversing descendants of root ··· t states reversing traversal ··· t + 1 states Figure 5.8: The logically reversible computation chain of the q3sat CRN. In more than half of the states, the output signal is present (shown shaded). To overcome this, we have designed our reactions that produce an output signal to ensure the next logical step in the computation is to reverse the tree traversal. This effectively doubles the length of the logically reversible computation chain and established the important property that the output signal can be observed in strictly more than half of the states (see Figure 5.8). Notice that this was also the case for Bennett’s original reversible Turing machine implementation22 . As the computation performs an unbiased random walk along the logically reversible computation state space, the steady state probability of observing the output signal is p > 0.5. This probability can be further increased in a number of ways. For instance, by adding one additional reaction that produces a new signal and requires the final signal multiset of the original computation chain as catalysts, we can once again double the number of reactions in the new chain. In this case, the probability of observing the output signal is p > 0.75. In this manner, for every new reaction added to the CRN, the probability of not observing an output signal is cut in half. Formally, the probability of observing an answer becomes p > 1 − 2−(1+c) when c ≥ 0 number of new reactions are added to extend the computation chain. Thus, we can make the steady state probability of observing 22 The forward traversal of the tree, production of the output signal, and reversal of the traversal are analogous to the compute, copy output, and retrace phases of Bennett’s original reversible Turing machine simulation [8]. 98 5.5. Space and energy efficient DSD simulation of SPACE produce new signal ··· ··· 2t + 1 states ··· ··· 2t + 2 states Figure 5.9: Extending the logically reversible computation chain of the q3sat CRN. Extending the chain is achieved by adding an additional reaction that produces a new signal and requires the final signal multiset of the original computation chain as catalysts. States where the output signal is present are shown shaded. a solution signal arbitrarily high. Furthermore, at the DSD level, we could design the gates which implement the reaction producing the output signal to have a slight bias in the forward direction, by manipulating relative toehold lengths, effectively biasing our overall computation towards the end of the chain containing an answer shown in Figure 5.8 [142]. As this reaction is only performed once, the gate implementing the reaction is not reused and therefore, the bias is not problematic for the overall computation to complete. We note that the CRN and DSD description given here is a non-uniform model of computation. Specifically, the CRN description is dependent on, and encodes, a particular problem instance. Therefore, different problem instances will result in different CRN descriptions and thus a different DSD implementation. Particularly at the DSD level, where synthesizing strands and gates is challenging, it would be desirable if only the input strands differed between unique instances. This can be achieved by constructing a more general quantified Boolean formula that is within a polynomial size of the original encoding described here. In such a construction, part of the input would describe which clauses are active for the particular problem instance. The generalized formula would be for a fixed number n of variables and could be used to solve any instance having at most n variables23 . While the computation chain can be extended to increase the probability of observing the output to some fraction of the total length of the chain, as currently described, there is only one position on the chain (the initial position) where the input can be changed. Changing the input signals in the middle of the chain would mean the computation is no longer logically reversible—the chain would be missing necessary signals to reverse. This can be overcome by extending the number of states where the input can change, without issue, to be a constant fraction of the entire chain length [123]. In particular, we can use the n-bit GRAY counter implementation from Chapter 4 to form the first fraction of states of the overall chain. The reactions of the counter are orthogonal to the reactions of the q3sat solver and thus the input can be changed at any position when the counter is active. We would make the first reaction of the 23 This was suggested in an anonymous review of an earlier version of this work [131] 99 5.6. Complexity of verifying CRNs and DSDs q3sat solver consume the high order bit of the GRAY counter, which is only produced by its last reaction. Thus, the overall chain is logically reversible, the input can be written during a constant fraction of the chain length, and the output can be read for a constant fraction of the chain length [123]. Combining Theorem 12, Corollary 2 and Corollary 4 we have the following. Theorem 19. Any problem in SPACE can be solved by a space and energy efficient DSD. 5.6 Complexity of verifying CRNs and DSDs Next we show there exists a polynomial time and space reduction from an arbitrary q3sat instance I into an instance I of the CRN reachability problem (q3sat ≤p crnR), such that I can be solved if and only if I can be solved. Theorem 20. The reachability problem for CRNs (crnR) is PSPACE-hard. Proof. Given an arbitrary q3sat instance, construct the CRN of Theorem 16 which is of polynomial size and can be constructed in polynomial time and space by following the steps described in our construction. Ask the question of whether the state Sinit /{ψ ? } ∪ {ψ T } can be reached from Sinit , where Sinit is the initial state of the CRN. By Lemma 1 it is easy to see the reachability problem for proper CRNs is in PSPACE. Whether other forms of CRNs are in PSPACE is dependent on their definition and how the required space to complete a computation is accounted for. Any tagged CRN accounts for the necessary transformers as part of the size of the reaction volume and therefore, by this interpretation, is in PSPACE. Corollary 5. The reachability problem for proper/tagged CRNs is PSPACEcomplete. We note that other results are known for unrestricted CRNs which are not studied here. (Un-tagged) reversible CRNs correspond to reversible Petri nets where the reachability problem is EXPSPACE-complete [16]. CRN reachability has also been studied for the probabilistic case [121, 148] and nondeterministic case [148] and the connection with Petri nets was also explored [148]. By Theorem 12 and Theorem 20 we immediately have the following analogous results for DSDs. Corollary 6. The reachability problem for DSDs (dsdR) is PSPACE-hard. Clearly the reachability problem is PSPACE-complete for the set of DSDs implementing a proper CRN. When transformer (fuel) molecules are considered part of the space usage, as would be the case for closed volumes that are studied here (i.e., tagged CRNs), then the reachability problem is PSPACE-complete. 100 5.7. A reduction from q3sat to eb-ipfp-multi 5.7 A reduction from q3sat to eb-ipfp-multi We next show that a DSD instance created by the above chain of reductions (i.e., q3sat ≤p crnR ≤p dsdR) can be adapted to show the minimum energy barrier indirect folding pathway problem for multiple interacting strands is PSPACEcomplete when using the simple energy model defined in Section 1.1.1. 5.7.1 The reduction We begin with an arbitrary instance ψ of the q3sat problem consisting of n variables and m clauses. By Theorem 16, we can construct a CRN of size Θ(m+n) that will output a special acceptance signal if and only if ψ is satisfiable. By Theorem 12, this CRN can be implemented by a DSD that uses poly(m + n) space. Thus, the DSD will produce a special signal strand we call syes if and only if ψ is satisfiable. Now, there are two issues we must address. First, for the folding pathway problem, we must reason at the sequence level and not at the abstract domain level specified by a DSD. Second, there is an assumption in the DSD construction that only legal toe-hold mediated strand displacements occur. We must design our sequences to ensure this assumption is maintained in order to conclude any meaningful result. To simplify our argument, we will use a modified version of the QSW construction (i.e., the DSD construction of Theorem 12). The modification is straightforward: for every long domain in the original construction that is initially free (unbound), bind it to a new complementary strand (see Figure 5.10). Note that the resulting DSD still uses space poly(m + n). This modified construction has two effects. First, a legal displacement of a strand now requires four-way branch migration (described below), in contrast to three-way branch migration required in the original construction. Second, prior to any displacement and immediately after any sequence of legal displacements, all long domains are fully bound to a complementary domain. This latter property greatly simplifies our correctness argument. A legal toe-hold mediated strand displacement (legal displacement) that uses four-way branch migration involves four strands: an invading strand, a complementary strand bound to the displacing long domain of the invading strand, a template strand with a free toehold, and an evading strand currently bound to the template strand. The process can be summarized in five steps: (i) the invading strand / complementary strand complex associates to the complex containing the template strand by forming a first base pair, (ii) additional base pairs are formed between the free toehold of the template strand and the toehold complement domain on the invading strand, (iii) a long domain of the invading strand displaces an identical long domain of the evading strand using fourway branch migration, where the evading strand forms base pairs with the complementary strand as the invading strand forms base pairs with the template strand, (iv) all but one of the toehold base pairs of the evading strand are broken, and (v) the evading strand, now fully bound to the complementary strand, breaks the last base pair and disassociates from the complex containing the 101 + + ∗ + − A A + − B B ∗ − C C ∗ D − D − D∗ + A + + C 5.7. A reduction from q3sat to eb-ipfp-multi A + A B − C + ∗ + B∗ − C∗ D D − D∗ B B C + ∗ + B∗ − C∗ − C D − D − D∗ C − + B + + + C B + ∗ + A − B B ∗ − C C ∗ − D − D∗ D − D + + B A A D − − + D + A A C − + − − + A − − + + + A − B A A + C − + − − B A + A + B − C − D A + B∗ − C∗ − D∗ Figure 5.10: A strand displacement implementation of the bi-molecular chemical reaction equation A + B C + D using a modified construction from that proposed by Qian et al. [98]. In this construction, four-way branch migration is used to displace strands, in contrast to three-way branch migration from the original construction. 102 5.7. A reduction from q3sat to eb-ipfp-multi C C B C B B A A (a) A (b) (c) T T T C C B C B B A (d) A A (e) (f) T T T C B c 3Kassoc − 4 e A 3Kassoc − 6 (g) T 2Kassoc − 5 b a d f g Figure 5.11: A folding pathway is shown for a strand displacement using fourway branch migration. A simple sequence design is assumed where toehold domains have one base and long domains have two bases. The displacement of strand B by strand A is shown in seven steps, from (a) to (g). Initially, the long domain of A is bound to strand C. During the displacement, C will form base pairs with B while A forms base pairs with T . In the figure, base pairs are shown as edges between strands. The energy changes between each structure, assuming Kassoc = 2, are shown in the bottom right. The energy barrier of the underlying folding pathway, relative to (a), is Kassoc + 1. Note that for toehold length LT > 2, where Kassoc > LT , the energy barrier would be Kassoc − 1. 103 5.7. A reduction from q3sat to eb-ipfp-multi template strand. The main difference with three-way branch migration is that the complementary strand forms base pairs with the evading strand whenever possible. An example folding pathway for a legal displacement using four-way branch migration is given in Figure 5.11. Now let us consider how an evading strand can be displaced (i.e., produced) from a template other than by a legal displacement. In the simplest case, it is possible that the evading strand simply breaks all base pairs with the template strand and disassociates. We call this a spontaneous displacement. Now suppose one or more other strands, that possibly have base pairs with one or more other complexes, are used to perform the illegal displacement. We partition this possibility into two cases. In the first case, suppose all the invading strands have a different long domain than the evading strand. We call this a mismatch displacement. In the second case, suppose at least one of the invading strands does have a long domain that is the same as the evading strand, but it is either the case that there are no free toeholds adjacent to the evading strand, or if there are, all invading strands with the correct long domain have the toehold on the wrong side. We call this blunt-end displacement. Note that if there were at least one free adjacent toehold, and at least one of the invading strands with the correct long domain had a toehold on the correct side, then it could simply perform a legal displacement. Thus, these three cases cover all possible events for an illegal strand displacement. We begin by designing the sequences. Let all domains on template strands, inclusive of toehold domains, and all domains on complementary strands be formed of sequences using the bases T and G. Let all other strands be formed of sequences using the bases A and C. Thus, it is not possible for intra-strand base pairs to form24 . Suppose all toeholds on template strands use the same sequence and are of identical length LT , with Kassoc > LT > 2; recall that Kassoc is the entropic penalty for each strand association that results in fewer strand complexes. Therefore all toehold complement domains on other strands also have identical sequences (the complement of the common toehold sequence) and have length LT . Further suppose all long domains and complement strands have a common length LL > 2Kassoc > LT > 2. Create sequences for long domains, in polynomial time, such that distinct domains have an edit distance of at least 2Kassoc bases25 . Let B be the baseline energy of the initial DSD, prior to any displacements, using any sequence design satisfying the above constraints. We now argue formally that this reduction will result in a folding pathway that can displace the acceptance strand syes within barrier Kassoc − 1, relative to the baseline energy B, if and only if the q3sat instance ψ is satisfiable. Lemma 18. From a particular configuration having energy B, with all template strands saturated, a legal displacement can complete within barrier Kassoc − 1, 24 The simple energy model assumes only Wastson-Crick base pairs can form (i.e., A-T and C-G). 25 Many trivial code word designs are sufficient for this purpose. For example, assign a unique multiple of 2Kassoc T (A) bases to long domains on the template and complementary (other) strands. Such a code word design has polynomial size and is trivially created in polynomial time. 104 5.7. A reduction from q3sat to eb-ipfp-multi resulting in a new configuration with all template strands saturated, having energy B. Proof. Let I, C, E, and T be the invading strand, complementary strand, evading strand, and template strand, respectively. The association of I to T decreases the number of complexes by one, and increases the number of base pairs by one, therefore increasing the energy to B + Kassoc − 1. As toeholds have length LT > 1, then LT − 1 new base pairs can immediately form, resulting in energy B +Kassoc −LT . Identical long domains of I and E will perform four-way branch migration (T and C being the third and fourth strand). Thus, as one base pair between E and T is broken and another between C and I is broken, raising the energy to B + Kassoc − LT + 2, a new base pair between I and T and another between E and C can form next, lowering the energy to B +Kassoc −LT . This oscillation of two base pair differences happens for each of the LL bases in the common long domain of I and E. Once the common long domain of E is displaced (and C is fully bound to E), LT − 1 toehold base pairs bonding E to T can break, raising the energy to B + Kassoc − 1. As strand E disassociates from the template, the number of complexes increases by one, and the number of base pairs decreases by one, lowering the energy back to B. The original free toehold of T is now paired to I. However, the toehold previously paired with E is now free. All other bases on T are paired and therefore T is saturated. As LT > 2, then the highest energy of any configuration during displacement is B + Kassoc − 1, achieving the claimed energy barrier Kassoc − 1. To simplify the remaining argument, the combination of the next two lemmas show that any displacement of a strand within the prescribed energy barrier must be by exactly one additional displacing complex. That is to say, no combination of multiple complexes can cooperate to displace a strand from another complex, nor can the strands of the same complex be used, possibly with additional complexes, to displace a strand without exceeding the energy barrier. Lemma 19. From a particular configuration having energy B, with all template strands saturated, a displacement involving more than two complexes cannot complete within barrier Kassoc − 1 . Proof. Since the template strand that binds the evading strand is saturated, then exactly one toehold of length LT is unbound. Therefore, after the first complex associates with the complex containing the template strand, and leaving open at least one base for a second association (on either of the original complexes), the minimum possible energy is B + Kassoc − LT + 1. When a second complex associates, the minimum possible energy cannot be lower than B + 2Kassoc − LT > B + Kassoc − 1 as Kassoc > LT . Lemma 20. From a particular configuration having energy B, with all template strands saturated, a displacement from a template strand using other strands from the same template complex cannot complete within barrier Kassoc − 1. 105 5.7. A reduction from q3sat to eb-ipfp-multi Proof. First, we note that any domain from the same complex as the template used in such a displacement could only be a long domain that is bound to a complementary strand (i.e., not a domain bound to the template strand as the template domain it is bound to would need to form base pairs with the evading strand and it could therefore not displace). Next, we note that such a long domain must be one bound adjacent to the strand to be displaced; otherwise, a pseudoknot would occur as it must cross (cover) at least one domain of another strand bound to the template. For example, in the top of Figure 5.10 only the long domain for +C bound to its complement could displace the long domain −D without creating a pseudoknot. Furthermore, a pseudoknot must also form if the strand being displaced interacts with adjacent strands on both sides (one must cross its own long domain bound to the template strand). Therefore, consider the case of an adjacent long domain bound to a complement strand that could displace the evading strand without creating a pseudoknot. Next, consider that by the sequence design the edit distance with the domain to be displaced is at least 2Kassoc , thus any displacement using only the adjacent long domain and its complemenatary strand would result in an energy barrier of at least 2Kassoc . As argued above, no other strands from the same complex could cooperate without creating a pseudoknot. Suppose one other complex is used to cooperate in the displacement. In this case, it has at most LT free bases that could be used to lower the energy (using already paired bases cannot lower the energy barrier). However, as the association cost is Kassoc > LT , then a second complex cannot be used to overcome the energy difference. Similarly, and consistent with Lemma 19, more than one additional complex cannot be used to lower the energy barrier. Lemma 21. From a particular configuration having energy B, with all template strands saturated, a mismatch displacement cannot complete within barrier Kassoc − 1. Proof. Suppose to the contrary that a mismatch displacement can complete within barrier Kassoc −1. By definition of a mismatch displacement, one or more invading strands are used, but the long domains used for displacement all differ from the long domain of the evading strand. By Lemmas 19 and 20, we need only consider the case of a single invading complex without the cooperation of strands in the same complex as the evading strand. For the LL base pairs broken for the evading strand, the invading strand can form at most L ≤ LL − 2Kassoc new base pairs due to the sequence design constraints. Therefore, just prior to the evading strand removing its toehold base pairs, and assuming the invading strand formed base pairs to all of its toehold complement domain, the energy will be B + Kassoc − LT + LL − L ≥ B + Kassoc − LT + LL − (LL − 2Kassoc ) = B + 3Kassoc − LT . However, B + 3Kassoc − LT > B + Kassoc − 1 as Kassoc > LT . Contradiction. Lemma 22. From a particular configuration having energy B, with all template strands saturated, a blunt-end displacement cannot complete within barrier Kassoc − 1. 106 5.7. A reduction from q3sat to eb-ipfp-multi Proof. Suppose to the contrary that a blunt-end displacement can complete within barrier Kassoc −1. Note that by Lemma 19 and 20, increasing the number of invading strands or cooperatively using strands of the same complex as the evading strand cannot help. In a blunt-end displacement, by definition, branch migration of a long domain is not preceded by the formation of base pairs in an adjacent unbound toehold. Both the case of the invading strand binding to a toehold which is not adjacent the evading strand, or the case of it binding to a toehold that is adjacent but on the wrong side (as expected in a legal displacement), can be ruled out as a pseudoknot would form. Since the template strand is saturated, then it must be the case that an existing base pair, involving a base on the template strand, must first break prior to the invading stranding forming its first base pair, raising the energy to B +1. When the invading strand associates the number of complexes decreases by one, and the number of base pairs increases by one, thus raising the energy to B + Kassoc . Thus, the energy barrier is at least Kassoc . Contradiction. Lemma 23. From a particular configuration having energy B, with all template strands saturated, a spontaneous displacement cannot complete within barrier Kassoc − 1. Proof. Suppose to the contrary that a spontaneous displacement can complete within barrier Kassoc − 1. Since the length of the long domain on the evading strand is LL , then the energy after all but the last base pair is broken is B + LL − 1 > B + Kassoc − 1 as LL > Kassoc . Thus, the energy barrier is at least Kassoc . Contradiction. Theorem 21. The eb-ipfp-multi problem, namely the energy barrier for indirect pseudoknot-free folding pathway of multiple interacting strands problem, is PSPACE-complete. Proof. Using the reduction described above, and by Theorem 16 and Theorem 12, given an arbitrary instance ψ of q3sat having n variables and m clauses, a DSD can be constructed with the discussed modifications, in time poly(m + n), with a fully specified nucleotide sequence that has poly(m + n) total bases overall, such that: (i) all template strands of the DSD are initially saturated, and (ii) the DSD will produce a special signal strand, syes , through a sequence of legal strand displacements if and only if ψ is satisfiable. Let B be the initial energy of the resulting DSD. We now show that any sequence of legal displacements follows an indirect folding pathway within energy barrier Kassoc − 1. This follows immediately from the construction (all templates are saturated) and by Lemma 18 as it guarantees each legal displacement is within the energy barrier and returns to the initial energy B (and all templates are again saturated). In the other direction, Lemmas 19–23 ensure that any sequence with at least one non-legal displacement must exceed the energy barrier Kassoc − 1. Therefore, the folding pathway is within energy barrier Kassoc − 1 if and only if the DSD follows a sequence of legal displacements. Thus, ψ is satisfiable if and only if the strand syes can be displaced within energy barrier Kassoc − 1. 107 5.8. Chapter summary 5.8 Chapter summary In this chapter, we asked the question: can space and energy efficient computation be realized by chemical reaction networks (CRN) and DNA strand displacement systems (DSD)? We have shown this can be achieved in general by giving a logically reversible space efficient CRN implementation capable of solving any problem in PSPACE—the class of all problems solvable in polynomial space. Furthermore, our CRN can be realized by a space and energy efficient DSD. We have also shown how these results can be extended to solve any problem in SPACE. Thus, any computation that halts can be solved by a space and energy efficient DSD. The only other DSD implementation capable of solving any problem in SPACE is the stack machine implementation of Qian et al. [98]. The result of this chapter improves upon the stack machine implementation in terms of space efficiency, as the stack machine uses space proportional to computation length. However, our result falls short in a number of other respects when compared with the stack machine. Our construction provides a non-uniform model of computation, and thus, as currently described, a new CRN and thus DSD, must be created for each different problem instance that must be solved. It is conceivable that the result can be generalized to solve any problem instance, up to a particular size. The stack machine implementation is Turing universal. Since our result is based on a non-uniform model of computation, it cannot simulate computations that do not halt, and is therefore not capable of Turing universal computation. In addition to further characterizing the computational power of standard molecular programming systems, our result has a number of important consequences. For instance, we show that even determining if a certain state is reachable in a CRN, such as a desirable or undesirable configuration, is PSPACEhard, effectively demonstrating the intrinsic complexity of model checking and formal verification of chemical reaction networks. We further show the problem is PSPACE-complete for restricted classes of CRNs, such as when the CRNs are proper or when the reaction volume is a closed system (i.e., tagged CRNs). The results also hold at the DSD level. In this chapter we once again reason concretely at the sequence level to consider folding pathways. We show that beginning with our q3sat solver construction we can prove that finding minimum energy barrier indirect folding pathways for multiple interacting strands is PSPACE-complete. 108 Chapter 6 Conclusion Our research began with a desire to better understand the combinatorial nature of nucleic acid folding pathways between two secondary structures of the same nucleic acid strand. As folding pathways tend to avoid high-energy structures, a primary motivation was to understand and computationally predict low energy barrier folding pathways exhibited in biological systems. As with early studies of RNA structure prediction, we decided to focus on the simple energy model that corresponds to the number of base pairs of the involved structures. The reasoning for this choice was two-fold. First, this model is significantly simpler and remains sufficient to understand the complexity of the underlying combinatorial problem. If the problem is hard in the simple energy model, it provides evidence that it is hard for more complex models. Second, if effective algorithms are developed in the simple energy model, then it is possible they could be adapted for more complex energy models. This was the case for the RNA structure prediction problem that was first studied with the simple energy model [89] and later improved to use the Turner energy model [79]. 6.1 Predicting folding pathways At the outset of this research, the computational complexity for this problem remained unknown. We began by studying direct folding pathways [83] where intermediate structures could only remove base pairs from the initial structure and only add base pairs from the final structure. In Chapter 2 we have shown that the energy barrier problem for direct pseudoknot-free folding pathways is NP-complete, via a reduction from the 3-partition problem. Thus, unless NP = P, there is no polynomial-time algorithm for calculating the energy barrier of direct folding pathways. The proof in Chapter 2 can help shed insight on energy landscapes. A property of the proof is that there are exponentially many partial folding pathways that are within the minimum energy barrier, however, by design, only one will lead to a full pathway with minimum energy barrier. Thus, if pathways are followed according to a random process, it could take exponential time for the random process to find the pathway with minimum energy barrier. In one view, this suggests that for certain instances, it would be much more informative to ask which is the most likely folding pathway. This would be appropriate when the relative barrier difference between many possibilities is small. In another view, this suggests folding pathways may be leveraged to perform non-trivial 109 6.1. Predicting folding pathways computation, especially if a guarantee can be made that the correct pathway has a significantly lower overall barrier than incorrect pathways. Unfortunately, our proof is deficient in forcing this separation between the correct pathway and other incorrect pathways. Specifically, the difference between a minimum energy barrier in the contrived construction of our proof, and an alternate incorrect folding pathway, may be some small constant. Therefore, while it remains hard to find a minimum energy barrier pathway, in the worst case, it may not be hard to find a close approximation. Our current result does not preclude this possibility. Shortly before this initial research had begun, interesting new directions were being explored in the field of DNA computing and molecular programming. In particular, DNA strand displacement systems (DSD) were designed and implemented to perform simple computations, among other tasks. These systems use multiple interacting strands and are designed such that a correct sequence of strand displacements follows a low energy barrier folding pathway, while incorrect displacement sequences must overcome a larger energy barrier. Furthermore, the underlying designed folding pathways for many of these initial systems were direct. Specifically, these early systems shared the common characteristic that any particular strand may be displaced once and may displace one other strand. Thus, there was a growing need to understand and predict folding pathways of multiple interacting strands, even for direct folding pathways. Such knowledge could be used in the design and debugging of new molecular programs that leverage folding pathways. From our result on the single strand case in Chapter 2, by restriction we were able to conclude that predicting direct folding pathways for multiple interacting strands is also an NP-complete problem. However, these initial complexity results did not resolve the complexity of the general energy barrier problem, in which the pathway need not be direct. Two challenges in understanding the complexity of this problem which needed to be considered were repeat base pairs—base pairs added and removed multiple times in a pathway—and temporary base pairs—base pairs not specified in the initial nor final structure but form temporarily in order to improve the energy barrier. Regardless of problem complexity, there was a need for an exact prediction algorithm that is efficient in practice. Prior to this research, all exact algorithms had time and space complexity that were exponential in the size of the input (length of nucleic acid strand(s)). In Chapter 3, we proposed an algorithm to exactly solve a generalized version of the direct energy barrier folding pathway problem that is defined in terms of bipartite graphs. The algorithm has an exponential worst case time complexity, but, importantly, it uses only polynomial space. The algorithm is practical for most instances tested in our empirical study, although it fails to solve some instances in a reasonable runtime. Moreover, the algorithm is inherently parallel, and this parallelism could be exploited to help solve hard instances. One important contribution of this work is a polynomial time algorithm that can split a problem instance into many smaller sub-instances. While we cannot avoid exponential worst case runtime 110 6.1. Predicting folding pathways with our method, our splitting algorithm may be of independent interest and could be used in conjunction with heuristic methods. For instance, it could be used to first partition the solution space into sub-problems, with the aim of improving both the efficiency and accuracy of the overall heuristic method used to solve each sub-problem. Our pathway prediction algorithm only considered single strands. For direct pathways, it seems that the algorithm could be generalized in a straightforward manner to consider multiple interacting strands. Such a generalization should also consider the entropic penalty for strand association. It seems plausible that additional nodes added to the corresponding bipartite graph instance could achieve this aim. Such an algorithm would be useful for verification of DNA strand displacement systems which follow direct folding pathways. Unfortunately, the algorithm does not seem immediately applicable for indirect folding pathways. The design of the algorithm explicitly assumes that the graph modeling the conflicts, between the arcs (representing base pairs) of the initial and final structures of a problem instance, is bipartite. In an indirect folding pathway, where any non-crossing arc forming a Watson-Crick base-pair can be added at any point along a pathway, the conflict graph is not necessarily bipartite (and unlikely to be in general). Still, it is possible that a better understanding of the structure of conflict graphs for indirect pathways could lead to a similar result. The conflict graphs formed for indirect folding pathways can be characterized as circle graphs. The conflict graphs for direct pathways are 2-colourable circle graphs (see Figure 3.8 for an example). For direct pathways, we were able to exploit the following property: if one could identify an MFE structure C consisting of arcs from both the initial and final structures, A and B respectively, then there always exists an optimal pathway from A to B via C. Could a generalized version of the algorithm proposed here be adapted for indirect pathways? Most properties exploited in the proofs are argued in terms of independent sets. Removing assumptions regarding the colourability of the graph would be a necessary first step. Interestingly, by proving that the algorithm of Chapter 3 is correct, we were also able to prove that repeat base pairs do not help in a direct folding pathway. This established that the direct-with-repeats energy barrier folding pathway problem is NP-complete for the single strand case and NP-hard for the multiple interacting strand case. However, these early results, even those that consider repeat base pairs, did nothing to resolve the complexity of predicting indirect folding pathways that permit temporary base pairs. As these prediction problems were computationally hard, what they did do was to motivate us to study the computational limits of DNA strand displacement systems (DSD) that leverage low energy barrier folding pathways. How appropriate then that it was a DSD construction we devise in Chapter 5 that served as the basis to show that predicting minimum energy barrier indirect folding pathways for multiple interacting strands is PSPACE-complete. However, our new construction shares a common deficiency with our original construction for the direct folding pathway prediction problem. Specifically, the minimum energy barrier pathway is only guaranteed to be a small constant 111 6.2. Designing folding pathways improvement over incorrect pathways. While the proof is sufficient to show the hardness of the problem in the worst case, it leaves open the possibility that there may exist a polynomial time constant factor approximation algorithm. It also underscores a significant issue, already well known to the community, that must be overcome in the design of DSDs: blunt-end displacements. In our folding pathway construction which is based on a sequence design for a particular DSD construction, we identified three types of illegal displacements. By using a more sophisticated sequence design, for two of these types of illegal displacements, we could ensure that the difference in energy between the minimum barrier pathway and any other pathway grows polynomially in the combined length of the strands. However, as a blunt-end displacement occurs with the use of an identical domain, a clever sequence design cannot improve the desired energy barrier separation. In these cases, the energy barrier separation is dictated fully by the length of toehold domains. By design, toeholds are always of constant length to ensure displaced strands can easily disassociate from template strands. While this is a significant issue for the design of DSDs, this does not preclude the possibility that another construction, not based on DSDs, could be found that gives a polynomial, or even logarithmic, separation between the minimum barrier pathway and all other pathways. Not only would this be informative for the prediction problem, it would also be an interesting future direction in the design of folding pathways for computation and other molecular programming tasks. Unfortunately, the complexity of predicting a minimum barrier indirect folding pathway of a single strand remains open. This was the original problem that motivated this entire line of research and is the most relevant problem for understanding folding pathways within a biological context. While other variants of the problem proved to be computationally hard, it remains possible that this problem is in P. It could be the case that the direct folding pathway problem is too constrained to be easy, while the increased complexity in the indirect case only arises when there are a polynomial number of strands. If a polynomial time algorithm for this problem emerges, a logical next step would be to extend the result to use a more sophisticated energy model [79]. 6.2 Designing folding pathways The complexity of predicting minimum energy barriers suggested to us that folding pathways may be a mechanism for performing non-trivial computation. From that perspective, we next aimed to understand the computational power of deterministic molecular programs that leverage folding pathways. In particular, we were interested in understanding the computational limits of DNA strand displacement systems (DSD) and more generally, the chemical reaction networks (CRN) that they implement. There was a particular need to understand space complexity in these models. Towards that end, in Chapter 4 we introduced the concept of a tagged chemical reaction network in order to account for changes to auxiliary strands and 112 6.2. Designing folding pathways complexes, often called fuel or transformers in our work, when reactions are implemented in a DSD. Specifically, each reaction i is assigned a unique tag, Ti . When the reaction occurs in the forward direction, Ti is consumed and, if the reaction is reversible, a new tag TiR is produced. Should the reaction need to occur again in the forward direction, then either another copy of the tag must be available or the reaction must first be reversed in order to consume the tag TiR and re-produce the tag Ti . This simple mechanism allows us to reason concretely about the required quantity of molecules necessary to complete a computation. In the context of molecular programs operating in a closed reaction volume, space can be thought of as the necessary size of that volume to fit all molecules necessary to complete a computation, inclusive of tags which represent fuel/transformers. Can a biological soup of nucleic acids having total size poly(n) perform a computation, by means of a folding pathway, of Θ(2n ) steps? In Chapter 4 we demonstrated that yes, this is possible, by first giving a tagged CRN for an n-bit Gray code counter that deterministically advances through 2n states while only requiring Θ(n) total molecules and then showing how it can be implemented as a DSD using poly(n) space. This implementation introduced the concept of recycling, or molecule reuse, in strand displacement systems and chemical reaction networks. To our knowledge, in addition to being the first molecular program to significantly recycle molecules/strands, this is the first example of a designed indirect folding pathway that has length exponential in the number of nucleotides of the interacting strands. In developing our result, we also introduced the use of a mutex strand to an existing construction of Qian et al. [98] to ensure that any chemical reaction can be realized by a DNA strand displacement cascade that appears to execute atomically. In the least, this contribution provides a direct correspondence from a tagged CRN to a DSD implementation and as a result, greatly simplified our correctness proofs. Furthermore, it motivated us to continue reasoning at the more abstract level of CRNs. While the Gray counter demonstrated that space-efficient molecular programming is possible, we next asked if all space-bounded Turing machine computations could be realized by strand displacement systems whose space and expected time are within a (small) polynomial factor of the space and time of the Turing machine computation. In Chapter 5, we gave a logically reversible space efficient CRN implementation of a quantified Boolean formula solver capable of solving any problem in PSPACE—the class of all problems solvable in polynomial space. Furthermore, our CRN can be realized by a space efficient DSD. We have also shown how these results can be extended to solve any problem in SPACE. Thus, any computation that halts can be solved by a space efficient DSD. The only other DSD implementation capable of solving all problems in SPACE is the stack machine implementation of Qian et al. [98]. The model of their result is a variant of the DSD model in which reactions not only produce and consume signal strands, but can also extend or reduce the number of base units at one end of a polymer. The result of Chapter 5 improves upon the 113 6.2. Designing folding pathways stack machine implementation in terms of space efficiency, as the stack machine uses space proportional to computation length. However, our result falls short in a number of other respects when compared with the stack machine. Our construction provides a non-uniform model of computation, and thus, as currently described, a new CRN and thus DSD, must be created for each different problem instance that must be solved. It is conceivable that the result can be generalized to solve any problem instance, up to a particular size. The stack machine implementation is Turing universal. Since our result is based on a nonuniform model of computation and uses a reduction from a Turing machine to a quantified Boolean formula, it cannot simulate computations that do not halt, and is therefore not capable of Turing universal computation. In addition to further characterizing the computational power of standard molecular programming systems, we considered a number of related problems in Chapter 5. We showed that determining if a certain state is reachable in a CRN, such as a desirable or undesirable configuration, is PSPACE-hard. This demonstrates the intrinsic complexity of model checking and formal verification of chemical reaction networks. We further showed that the problem is PSPACEcomplete for restricted classes of CRNs, such as when the CRNs are proper— each reaction produces the same number of molecules it consumes—or when the reaction volume is a closed system (i.e., the CRN is tagged ). The results also hold at the DSD level. Aside from the potential biological and chemical applications, DSDs and CRNs are also of independent interest due to their promise for realizing energy efficient computation. Rolf Landauer proved that logically irreversible computation — computation as modeled by a standard Turing machine — dissipates an amount of energy proportional to the number of bits of information lost, such as previous state information, and therefore cannot be energy efficient [69]. Surprisingly, Charles Bennett showed that, in principle, energy efficient computation is possible, by proposing a logically reversible universal Turing machine and identified nucleic acids (RNA/DNA) as a potential medium for reversible computation [8]. However, this remained a theoretical result with no known physical implementation. It was Qian et al. [98] who first demonstrated that energy efficient computation could be realized, in principle, with a logically reversible DSD system that simulates a stack machine. Our quantified Boolean formula solver of Chapter 5 is also logically reversible. Thus, we have demonstrated that any space-bounded Turing machine computation can be realized, in principle, by a space and energy efficient CRN and DSD. However, our CRN implementations throughout this thesis share a common assumption with the stack machine implementation: certain initial signal molecules must occur as a single copy. Initially, this assumption served to simplify the discussion. However, in Chapter 4 we have shown that this assumption is actually crucial to achieve space-efficient computation. We have shown that for any proper CRN, any signal molecule can be produced using just O(n2 ) reaction steps when Θ(n) copies of the initial signal molecules share the same volume. This result has since been improved by others to consider more general classes of CRNs [27]. We have also shown a much stronger result for determinis114 6.2. Designing folding pathways tic computations in a closed volume. Specifically, even having a second copy of the initial input signals ensures that no tagged CRN can be deterministic after a linear number of steps. The intuition as to why the single copy assumption is important is that it gives us a means to erase information. In a single copy setting, once a molecule of a particular type is consumed, it is no longer present. In a multi-copy setting, once a molecule of a particular type is consumed, there is no guarantee that the other copies are simultaneously consumed. While the single copy restriction permitted us to study the very limits of computation for a biological soup, it imposes a significant engineering challenge. All DSD implementations to-date use concentrations of strands of each type. Producing and successfully executing a DSD with a single copy restriction is currently challenging, but feasible. For instance, the first published result on the measurement of a single enzyme molecule was by Boris Rotman, in 1961 [107]. The experimental techniques developed in that first paper are still influential and in use, and new advancements in single molecule studies continue to be made [62]. Our results also hold at the more general level of chemical reaction networks. Thus, any physical realization of a CRN could, in principle, make use of our constructions. Furthermore, the problems in a multi-copy system only arise when signals from one copy interfere, or cross-talk, with signals from another copy. If one copy could be compartmentalized from another, the challenge could be overcome. This may involve a move away from a biological soup, and back to a strictly surface based model [14], or some hybrid, possibly involving recent advances in DNA origami [105]. If all signals from each copy were tethered to a surface, and separated from other copies, then reactions could proceed as expected. This is not our idea. It has been suggested as a means to improve the speed of DSD reactions by co-locating related strands [19, 100]. Should a practical means be developed to address the single-copy issue, then a more rigorous study of logically reversible CRNs is appropriate. In the course of our research it became clear that certain techniques could be used when developing a logically reversible CRN. For instance, the 3sat verification procedure could be reliably executed, much like a subroutine, by producing signal molecules necessary for either its initial or its final reaction, and by ensuring that signals produced and consumed by intermediate reactions involved only signals local to that procedure. Furthermore, a common technique we used was to add an additional reaction to effectively double the length of a computation by forcing the original chain of reactions to reverse. This permitted us to actively recycle molecules/strands and was evident in both our Gray code counter of Chapter 4 and also in our quantified Boolean formula solver of Chapter 5. These techniques could conceivably be extended into a formal grammar for programming logically reversible CRNs. Finally, we find the current complexity classes for logically reversible computation too general to capture the realities of logically reversible molecular programming. The class ReversibleSPACE represents all problems that can be solved by a space-bounded logically reversible Turing machine. As with any Turing machine, the space bound is with respect to the length of tape necessary 115 6.2. Designing folding pathways to complete the computation. In CRNs and DSDs, bits of information are represented with the presence and absence of signal molecules. Thus, the length of tape required in the Turing machine computation corresponds well with the maximum quantity of signals required during the CRN computation. However, this does not account for fuel (transformers) that a CRN may require to complete its computation. The reaction is the fundamental operation in a CRN just as a state transition is the fundamental operation for a Turing machine. However, with current technology, a reaction in a CRN requires fuel, which in turn requires physical space, whereas a Turing machine state transition does not. In essence, a logically reversible Turing machine could perform all state transitions in only one direction, while still using significantly less space than the number of computation steps. This is not currently possible in molecular programming. We have demonstrated that any space-bounded computation can be realized with a logically reversible tagged CRN that requires only one tag per reaction equation. In essence, our logically reversible CRN strictly alternates its reactions in the forward and reverse direction. It is conceivable that we could simulate our CRN with a logically reversible Turing machine. It is also conceivable that our simulation could be constructed to ensure that each state transition of the Turing machine either strictly alternates being applied in the forward and reverse direction, or adheres to a polynomial bound in the difference between forward and reverse transitions, at every step of the computation. Should such a construction be possible, we will have given a logically reversible Turing machine, capable of simulating any space-bounded Turing computation, that is semantically restricted to capture the notion of fuel. We let ReversibleSPACE denote the class of problems solvable by such a Turing machine. It has already been shown by Lange et al. [70] that ReversibleSPACE = SPACE. In future work, it is our goal to show that ReversibleSPACE = SPACE. 116 Part II Space efficient text indexes motivated by biological sequence alignment problems 117 Chapter 7 Introduction 7.1 Text indexing The study of strings, their properties, and associated algorithms has played a key role in advancing our understanding of problems in areas such as compression, text mining, information retrieval, and pattern matching, amongst numerous others. A most basic and widely studied question in stringology asks: given a string T (the text) how many occurrences and in what positions does it contain a string P (the pattern) as a substring? It is well known that this problem can be solved in time proportional to the lengths of both strings [63]. However, it is often the case that we wish to repeat this question for many different pattern strings and a fixed text T of length n over an alphabet of size σ. The idea is to create a full-text index for T so that repeated queries can be answered in time proportional to the length of P alone. It was first shown by Weiner [139] in 1973 that the suffix tree data structure could be built in linear time for exactly this purpose. The ensuing years have seen the versatility of the suffix tree as it has been demonstrated to solve numerous other related problems. While suffix trees use O(n) words of space in theory, this does not translate to a space efficient data structure in practice. For this reason, Manber and Myers [78] proposed the suffix array data structure. Though a great practical improvement over suffix trees, the Ω(n log n)26 bit space requirement is often prohibitive for larger texts. Building in part on the pioneering work of Jacobson [56] on succinct data structures, two seminal papers helped usher in the study of so-called succinct full-text indexes. Grossi and Vitter [45] proposed a compressed suffix array that occupies O(n log σ) bits; the same space required to represent the original string T . Soon after, Ferragina and Manzini [33] proposed the FM-index, a type of compressed suffix array that can be inferred from the Burrows-Wheeler transform (BWT) of the text and some auxiliary structures, leading to a space occupancy proportional to nHk (T ) bits, where Hk (T ) denotes the k th order empirical entropy27 of T . BWT is a reversible transformation that produces a permutation of the original text (T bwt ) which is more easily compressible28 with the use of local compressors such as run length 26 We use log to denote log2 throughout. a text T of length n, the term (Hk (T ))n is a lower bound on the number of bits required to encode T with any algorithm that uses contexts of length at most k. 28 The BWT works by (1) creating a conceptual matrix where each row is a different cyclic rotation of the original text (with an appended special character, lexicographically smaller than any character from the alphabet); (2) sorting the rows in lexicographical order and (3) 27 For 118 7.2. Biological sequence alignment encoding (often preceded by the Move-to-front transform [10]). The BurrowsWheeler transformed text T bwt is related to the original text by the so-called LF mapping. Thus, as Ferragina and Manzini showed, any queries on the compressed representation of T bwt can be interpreted as queries on T itself (and conversely). This is the basic idea behind most succinct self-indexes proposed thus far [88]. We note that the success of succinct data structures in general relies on efficient operations on bit vectors, and is due to the seminal work of Jacobson [56] and Munro et al. [86]. For an in-depth discussion, the reader is referred to the work of M¨ akinen and Navarro [76]. For details on the LF mapping, BWT or compressed full text indexes in general, the reader is referred to the excellent review by Navarro and M¨akinen [88]. These and subsequent results have made it possible to answer efficiently the substring question on texts as large, or larger, than the Human genome. 7.2 Biological sequence alignment The Human Genome Project has enabled a revolutionary step forward in understanding our genes and their function. A significant next challenge is to understand genome variation across individuals and its correlation with disease, as well as genomic mutations and rearrangements in cancerous cells. Since at least two reference human genome sequences are now available, de novo assembly of a genome of interest from short fragments — inferring a linear genome sequence from a collection of shorter DNA fragments called reads — is no longer required in most human genomic studies. Instead, current studies focus on resequencing, that is, inference of the genome of interest by alignment of the reads, produced by sequencing the genome, to the available reference genomes (see Figure 7.1). The actual information sought is not the canonical sequence of the genome of interest, but rather, how does it differ from a known reference? For example, single nucleotide variations (SNVs) in an individual’s genome (compared with the wild type or reference genome) have been identified as significant in many types of human cancer [13, 40, 48, 93, 103, 144] (see Figure 7.1). These discoveries are enabling the development of novel methods for disease diagnosis and therapy [22]. Fueling the discovery of genetic variation amongst populations and individuals has been the application of next generation sequencing technology (NGS). While the technologies underlying competing NGS platforms vary, all share significant differences from traditional, Sanger style sequencing [57]. The new technologies focus on massively parallel sequencing and are capable of producing millions of reads in a typical run [53, 57]. While the sheer quantity of reads and overall bases which can be sequenced in a given time frame are vastly greater outputting the last column of the sorted matrix. The BWT is useful for compression as it often produces successive runs of the same character. For instance, consider a text over the English language containing many instances of the words this, that, there, the, those, etc. Cyclic rotations beginning with the letter h would have a high probability of ending with the letter t. Thus, after sorting, the transformed text, taken from the last column, would likely contain runs of the letter t. 119 7.2. Biological sequence alignment G polymorphism ACCGTTACAgACCGGG Ag--GC-TTATacgAA ACCGTTACAcACCGGG AtcgGCaTTAT---AA CCGTTACAcACCGGGA CGTTACAcACCGGGAT GTTACAcACCGGGATT Figure 7.1: An example of short reads aligned to a reference genome G. Alignments may contain matches, mismatches, insertions and deletions. For instance, the alignment of the single read to the reference (red outline) contains a match in the first position of the alignment, a mismatch in the second, an insertion in the third position and a deletion in the twelfth position. Sequencing the genomes of individuals helps determine genetic mutations, such as single nucleotide polymorphisms/variations (SNPs/SNVs) of individuals compared to a reference genome. 120 7.3. Objectives than traditional Sanger style sequencing [118], there are at least two caveats. First, NGS reads are short, typically between 28 and 300 bases, dependent on the specific platform (compared to ∼ 1, 000 base reads of Sanger style sequencing) [118]. Second, NGS reads are more prone to sequencing errors, whereby the reported sequence of a read differs from the true sequence of the DNA molecule; these differences can be characterized by the common string edit operations of substitution, insertion and deletion. The rate and type of sequencing errors is also dependent on the platform employed. Both of these features of NGS reads, coupled with the scale of the produced data, can confound the task of efficiently aligning reads to a reference genome and have forged it as one of the most actively researched problems in contemporary bioinformatics. NGS is also being utilized to capture data from the transcriptome; a process referred to as RNA-Seq [84]. Instead of sequencing genomic DNA, RNA-Seq aims to sequence the complementary DNA (cDNA) of RNA molecules in a cell. Transcriptome read alignment is providing valuable information to researchers, beyond genomic sequencing. In particular, this technology can be used to quantify the level of expression of various transcripts by sequencing messenger RNA, thus implicating the relative expression level of proteins. For instance, a highly expressed transcript should yield higher read coverage than a poorly expressed transcript. This technology has also been used to elucidate RNA editing events, exon boundaries and novel alternative splice junctions [96, 132, 138]. 7.3 Objectives The use of full-text indexing has played a crucial part in advancing our understanding of biological sequence data. The abundance and production rate of high-throughput sequencing data and the size of genome sequences all but made necessary the use of succinct text indexes—those using space proportional to the information theoretic lower bound. Indeed, the more recent aligners, including Bowtie [71] and BWA [73] amongst numerous others, make use of succinct self-indexes29 . As memory resources on commodity hardware have steadily increased, succinct indexes are no longer necessary for many of these applications when large memory machines are available. However, even when sufficient memory is available to use non-succinct text indexes, use of succinct indexes often results in more efficient solutions due to better caching performance as one would expect in non-uniform memory architectures (NUMA). Rather than add to the growing number of new tools that improve sequence alignment efficiency through the use of various heuristics and implementation techniques, my interest lies in adapting and improving the underlying full-text index data structures to better model biological reference sequences such as genomes. In particular, I have identified two specific types of biological sequence events, alluded to in the previous section, that are not naturally captured by a static, linear text. The first is the presence of known variations, amongst a 29 A text index is a self-index if it does not require the original text to be stored but can provide efficient access to any substring of the original text. 121 7.4. Contributions population, in specific positions of a reference genome. The second is known splicing events between positions in a reference genome that form part of a common transcript sequence. More details of these events and how they are incorporated into a full-text index are given in subsequent chapters. The solutions proposed are theoretically rigorous in the sense that all time and space complexity claims are formally proved. My focus in proposing these indexes was to improve various time and space trade-offs compared with existing solutions found in the literature. While these contributions are made and studied from a theoretical perspective, the hope is that incorporating this information directly into the primary indexes will result in more biologically meaningful alignments and/or improve the overall efficiency of the alignment problem. As discussed in the conclusion of this part, additional steps are necessary for these results to have a practical impact, such as adapting query algorithms for approximate matching. While my motivation arises from these particular applications in biological sequence alignment, the results I give in this part are more general and may have applications in other problem domains. 7.4 Contributions We now describe the contributions of this part of the thesis, in the order they are discussed: 1. We propose a compressed full-text dictionary that can index a set of text segments in space proportional to the compressed size of their concatenation while still supporting a number of efficient query operations. These include determining which text segments are contained within a query pattern and which contain or prefix a query pattern. 2. We propose new succinct indexes for text containing wildcards that improve the space complexity compared with the existing state-of-the-art at the time they were proposed without increasing query time complexity. Independently and in parallel with another group, we give the first compressed index for this problem. We also show how our results can be combined with the results proposed independently to improve the stateof-the-art for this problem. 3. We propose a new query algorithm, based on dynamic programming, for indexes of text with wildcards that significantly improves the query working space complexity over existing solutions. The algorithm is fairly general and easily adapted for use with other indexes that use the same general strategy of pattern matching in text with wildcards. 4. We show a correspondence between the wildcard and hypertext indexing problems by demonstrating that standard strategies and techniques for solving the former can be generalized to solve the latter. 122 7.5. Outline 5. We propose the first index for hypertext, a graphical generalization of text. We first propose a succinct index and later show how the index only requires space proportional to the length of the compressed text and topology of its graphical structure. We also study a number of interesting restrictions for hypertext. 7.5 Outline Each chapter of this part builds on results from previous chapters. Chapter 8 introduces much of the notation and existing results from the literature that are leveraged throughout the part. Subsequent chapters develop notation and introduce other existing results as needed, in a cumulative manner. The first technical result presented in this part is the design of the full-text dictionary index and is presented in Chapter 8. This result is leveraged in the design of our other proposed text indexes. While it is unnecessary to understand the complete implementation details of the full-text index in order to understand the wildcard or hypertext index, it is critical to understand its supported query operations as stated in Theorem 22. In Chapter 9 we develop indexes for text containing wildcards and the associated query algorithms for exact matching. Chapter 10 details indexes for hypertext, a graphical generalization of linear text. In Chapter 11 we summarize our contributions of this part and highlight a few of the more important open problems arising from this work. 123 Chapter 8 A compressed full-text dictionary 8.1 Introduction A full-text index 30 is a data structure that can efficiently determine the positions, in a fixed string T (the text), where an arbitrary query string P (the pattern) appears as a substring. Such an index is useful, for instance, to search an electronic book repeatedly for different words, or a genome sequence for biological sequence signatures. A dictionary is a data structure for a fixed and ordered collection D = (T1 , T2 , . . . , Td ) of strings called text segments, that can efficiently determine all occurrences of all text segments that appear as a substring in an arbitrary query string P ; these are called dictionary matches. Such an index is useful, for instance, to search many email messages for all occurrences of a fixed collection of keywords associated with spam31 . The entirety of this chapter is devoted to describing a new data structure called a compressed full-text dictionary that combines the features of a full-text index with those of a dictionary, and uses space roughly equal to that of just a compressed fulltext index. Specifically, we are interested in the benefits of a full-text index for a string T and also the ability to perform dictionary matches when T is composed of a number of text segments delimited by a special character. Such a data structure can be used to improve the space complexity of current approaches for indexing text containing wildcards (see Chapter 9). Note that it is always possible to create a compressed full-text dictionary of any ordered collection of text segments, by first concatenating them using a special character as a delimiter. In Chapter 10, we show this approach is useful for creating a compressed index for hypertext, a generalization of linear text. Content from this chapter appears in the proceedings of the 22nd Annual Symposium on Combinatorial Pattern Matching (CPM 2011) [127] and the journal Theoretical Computer Science [128]. 30 The term full-text index is used in the string community to denote a data structure that supports efficient substring query operations over a text. Constructing such a data structure is referred to as indexing the text and the data structure itself is referred to simply as an index for the text. 31 The term spam is used to describe unsolicited and indiscriminate messages sent in bulk as electronic mail. 124 8.2. Preliminaries 8.1.1 Related work A succinct dictionary was used as a subsidiary data structure in the approach of Tam et al. [124] who proposed the first succinct index for text containing wildcards. Very recently, Belazzougui [6] proposed a compressed dictionary based on the Aho-Corasick automaton having optimal query time. The compressed space occupancy was further improved by a modification given by Hon et al. [50]. While these results are impressive and interesting in their own right, the wildcard matching problem and hypertext matching problem, discussed in subsequent chapters, benefit from a full-text dictionary that can report the text segments contained in P (dictionary matches), as well as the text segments which are prefixed by P and also fully contain P (substring matches). 8.2 Preliminaries We first develop notation that will be used throughout this chapter and the subsequent chapters of this part. We also state useful lemmas for fundamental and well-known succinct data structures that our results employ as subsidiary data structures. In addition to the space complexity of these data structures, we give the time complexity for the relevant operations we perform on them. It is not necessary to understand the details of how these subsidiary data structures support the listed operations in order to understand our results. However, we point the reader to relevant literature should the details of such data structures be of interest. Unless otherwise stated, equivalent or improved versions of these subsidiary data structures can be substituted in the development of our data structures. Let T [1, n] be a string over a finite alphabet Σ of size σ. We denote its j th character by T [j] and a substring from the ith to the j th position by T [i..j]. We assume that an end-of-text sentinel character $ ∈ / Σ has been appended to T (T [n] = $) and $ is lexicographically smaller than any character in Σ. For any substring X we use |X| to denote its length and X to denote its reverse sequence. The suffix array SA of T is a permutation of the integers [1, n] giving the increasing lexicographical order of the suffixes of T where SA[i] = j means that the ith lexicographically smallest suffix of T begins at position j. Conceptually, SA can be thought of as a list of all suffixes of T in lexicographic order. For example, Figure 8.1(c) gives the sorted list of the twelve suffixes of the string mississippi$. A string X has a suffix array (SA) range [a, b] with respect to SA if a − 1 suffixes of T are lexicographically smaller than X and b − a + 1 suffixes of T contain X as a prefix. If a > b the range is said to be an empty SA range and X does not exist as a substring of T . Consider the sorted suffixes for the string mississippi$ shown in Figure 8.1(c). The query string iss has the SA range [4, 5] as there are three lexicographically smaller suffixes, and exactly two suffixes are prefixed by iss. In our full-text dictionary, we will not construct the suffix array for T . 125 8.2. Preliminaries Rather, we will use a compressed suffix array CSA of T . A compressed suffix array is a space efficient representation of both the string T and also the suffix array for T . Many compressed suffix array implementations store a representation of T BWT , the Burrows-Wheeler transform of T . Determining T BWT for a string T can be thought of in this way: (i) create a conceptual matrix of all cyclic rotations of T , (ii) sort all rows of the matrix into lexicographic order, and (iii) output the last column of the conceptual matrix. This process is illustrated in Figure 8.1. mississippi$ ississippi$m ssissippi$mi sissippi$mis issippi$miss ssippi$missi sippi$missis ippi$mississ ppi$mississi pi$mississip i$mississipp $mississippi $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi (a) List cyclic rotations (b) Sort cyclic rotations (c) Output last column Figure 8.1: The Burrows-Wheeler transform of a string T = mississippi$ is T BWT = ipssm$pissii. Compressed suffix array implementations that rely on the Burrows-Wheeler transform make use of the so-called LF -mapping, that relates characters in the last column of the conceptual transform matrix, to characters in the first column. Specifically, the ith occurrence of a character c ∈ Σ in the last column corresponds to the ith smallest suffix that begins with character c. For example, Figure 8.2(b) shows how the third and fourth occurrence of the character ‘i’ in T BWT corresponds to the third and fourth smallest suffix that begins with character ‘i’. These implementations search for a match of a pattern P [1 . . . m] by first finding the suffix array range [sp, ep] for the string P [m − 1 . . . m]. If [sp, ep] is not an empty range, then a new range is determined for the string P [m−2 . . . m], and so on. In this way, patterns are searched backwards and the search algorithm is appropriately called backward search [34]. An example of extending a match from the pattern ‘s’ to the pattern ‘is’ is shown in Figure 8.2. The idea is to find, in the current SA range of the string T BWT , the first and the last occurrence 126 8.2. Preliminaries of the character that will extend the pattern (in this case ‘i’). The LF -mapping can then be used to update the current SA range to point to all suffixes prefixed by the extended pattern (‘is’). Knowledge of how this algorithm works is not necessary to understand our result; however, for details of the algorithm and related topics we refer the reader to the excellent review by Navarro and M¨ akinen [88]. $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi (a) (b) (c) Figure 8.2: Performing backward search to find the SA range of the string ‘is’ from the SA range of the string ‘s’, using T BWT , the Burrows-Wheeler transform of text T . (a) The current match and SA range for ‘s’. (b) All occurrences of character i in T BWT within the current SA range are identified. (c) The LF mapping is used to update the SA range to the new match ‘is’. Our full-text dictionary can be made to utilize any compressed suffix array that supports the LF -mapping, and thus backward search. However, we restrict our attention to an implementation based on the wavelet tree representation [46] of T BWT whose properties are exploited in Section 9.6 to further reduce the required space of the overall index we propose for text containing wildcards. Lemma 24 (M¨ akinen & Navarro [76]). A compressed suffix array CSA, based on the wavelet tree of T BWT , can be represented in nHk (T ) + o(n log σ) bits of space, for any k ≤ α logσ n − 1 and 0 < α ≤ 1, such that the operation rankc (T BWT , i) which counts the occurrences of character c up to position i in T BWT and also the LF operation are supported in time O(log σ), the suffix array range of every suffix of a string X can be computed in O(|X| log σ) time, and each match of X in T can be reported in an additional O(log1+ n) time, for any > 0, where T is a text of length n over an alphabet of size σ. 127 8.3. Overview of the full-text dictionary In our full-text dictionary construction, we also make use of the following well known data structures. Lemma 25 (Raman et al. [101]). A bit vector B of length n containing d 1 bits can be represented in d log nd + O(d + n logloglogn n ) bits to support the operations rank1 (B, i) giving the number of 1 bits appearing in B[1..i] and select1 (B, i) giving the position of the ith 1 in B in O(1) time. d Lemma 26 (Grossi & Vitter [45]). An array L of d integers where i=1 L[i] = n can be represented in d( lg(n/d) + 2 + o(1)) bits to support O(1) time access to any element. Lemma 27 (Munro & Raman [85]). A sequence BP of d balanced parentheses can be represented in (2+o(1))d bits of space to support the following operations in O(1) time: rank( (BP, i), select( (BP, i), and similarly for right parentheses, as well as: • findclose(BP, l) (findopen(BP, r)): index of matching right (left) parenthesis for left (right) parenthesis at position l (r) • enclose(BP, i): indices (l, r) of closest matching pair to enclose (i, findclose(BP, i)) if such a pair exists and is undefined otherwise The matching statistics for a string X with respect to T is an array ms of tuples such that ms[i] = (qi , [ai , bi ]) states that the longest prefix of X[i..|X|] that matches anywhere in T has length qi and suffix array range [ai , bi ]. Very recently Ohlebusch et al. [91] showed matching statistics can be efficiently computed with backward search if CSA is enhanced with auxiliary data structures using O(n) bits to represent so-called longest common prefix (lcp) intervals (cf. [91]). We leverage this result in the design of our compressed full-text dictionary and its search algorithm. Lemma 28 (Ohlebusch et al. [91]). The matching statistics of a pattern X with respect to text T over an alphabet of size σ can be computed in O(|X| log σ) time given a compressed enhanced suffix array of T . 8.3 Overview of the full-text dictionary The data structure we propose in this chapter can be built based on an already existing string that contains text segments as substrings, or based on an ordered list of d text segments. Suppose we are given a string T that contains d text segments. Specifically, let T = φk1 T1 φk2 T2 φk3 T3 . . . φkd Td φkd+1 $ be a string over an alphabet Σ ∪ {φ}, followed by the traditional end-of-text sentinel $, having total length n. We define φ to be lexicographically smaller than any c ∈ Σ and $ to be lexicographically smaller than φ. We call the character φ a delimiter, and let φki denote the ith group, or run, of delimiter characters having length ki ≥ 0, for 1 ≤ i ≤ d + 1. The string T is defined to contain exactly d text segments— maximal substrings that do not contain a delimiter character. By definition, 128 8.3. Overview of the full-text dictionary text segments must be separated by a run of one or more delimiter characters. Therefore, φki , the delimiter group separating text segments Ti−1 and Ti , must have length ki > 0, for 1 < i ≤ d. In this case, the underlying ordered list of text segments is D = (T1 , T2 , . . . , Td ). Suppose we are given an ordered list D = (T1 , T2 , . . . , Td ) of d text segments. Then we can construct a string T = T1 φT2 φ . . . Td $, that is a serialization of all text segments in D delimited by the character φ. In either case, we will begin our construction with a string T of length n containing the d text segments of D = (T1 , T2 , . . . , Td ), delimited by at least one φ character. (2,aa) (3,aa) (4,ac) (5,aca) (6,caca) i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 ME 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 1 0 MB 0 0 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 0 0 1 0 BP ( ( ) ( ( ) ) ) ( ) (1,a) $ ϕaϕaaϕcaccϕac$ ϕaaϕacaϕaϕaaϕcaccϕac$ ϕaaϕcaccϕac$ ϕac$ ϕacaϕaϕaaϕcaccϕac$ ϕcaccϕac$ aϕaϕaaϕcaccϕac$ aϕaaϕcaccϕac$ aϕacaϕaϕaaϕcaccϕac$ aϕcaccϕac$ aaϕacaϕaϕaaϕcaccϕac$ aaϕcaccϕac$ ac$ acaϕaϕaaϕcaccϕac$ accϕac$ c$ cϕac$ caϕaϕaaϕcaccϕac$ caccϕac$ ccϕac$ Figure 8.3: A compressed full-text dictionary for the ordered list of text segments (aa, aca, a, aa, cacc, ac). The first three columns give a conceptual representation of the full-text dictionary. The second column shows the sorted suffixes of the serialized string T = φaaφacaφaφaaφcaccφac$ representing the text segments. The third column contains the array i indicating the sorted lexicographic rank of each suffix of T . The first column shows the SA ranges of the text segments and their containment relationship. Each text segment SA range is labeled by (lex id, segment) pairs. Shown in the last three columns are actual data structures used in the full-text dictionary representation: the ME array which marks the end of one or more text segment SA ranges, the MB array which marks the beginning of each text segment SA range, and the BP array that represents the containment of text segment SA ranges (their tree topology). Three different queries (shaded intervals) are shown with their corresponding smallest enclosing text segment SA range (if any) marked in the BP array. 129 8.3. Overview of the full-text dictionary 8.3.1 The lex id of text segments Let pos(Ti ) denote the starting position in T of the ith text segment. In our supported operations, we find it convenient to refer uniquely to each of the d text segments as a number in the range [1, d], called a lex id, which is based on the relative lexicographic order of the d suffixes of T that are prefixed by a text segment. Informally, if the suffix T [pos(Ti )..n] is the j th lexicographically smallest suffix, of all d suffixes of T that are prefixed by a text segment, then Ti will have lex id j. We can formalize this notion as follows. Let [ai , bi ] be the SA range of T [pos(Ti )..n], the suffix of T beginning with the ith text segment. Since all suffixes of T are unique, then ai = bi , and ai = aj , unless i = j. The lex id of text segment Ti is j = |{ak | ak ≤ ai , 1 ≤ k ≤ d}|. The lex id is defined in such a way that a range of matching lex ids can be returned for the prefix operation that our full-text dictionary will support. We will see the benefits of this when the full-text dictionary is used within the other data structures we develop in Chapters 9 and 10. An example full-text dictionary is given in Figure 8.3 and shows the lex ids and SA ranges for six different text segments. Note that in this example, there are two text segments that share a common string (‘aa’) and therefore there are only five unique SA ranges. Now that the form of the data that we are creating a data structure for is clear, and the concept of lex ids of text segments has been formally defined, we can state the main result we show in this chapter. Throughout the chapter, we will assume32 that a query pattern P is over the alphabet Σ. Theorem 22. A string T over alphabet Σ ∪ {φ} of length n, that contains d, not necessarily distinct, text segments over alphabet Σ, can be represented by a compressed full-text dictionary F in |CSA| + O(n) + O(d log n) bits, to support the following operations given any query pattern P : • dict prefix (F, P): returns the (possibly empty) list of the occ1 lex ids for text segments that prefix P in O(|P | log σ + occ1 ) time, • dict match (F, P): returns the (possibly empty) list of the occ2 lex ids for text segments that are contained as substrings in P in O(|P | log σ + occ2 ) time, • dict count (F, P): returns the count of text segments that are contained as substrings in P in O(|P | log σ) time, • prefix(F, P): returns the (possibly empty) range [lexid1, lexid2] of lex ids for text segments that are prefixed by P in O(|P | log σ) time, • locate(F, P): returns the (possibly empty) list of the occ3 positions in T that are prefixed by P in O(|P | log σ +occ3 log1+ n) time, for any > 0, 32 This simplifies the discussion, though it is worth noting that any invalid character in a pattern P can be identified in O(|P |) time. 130 8.4. Components of the full-text dictionary • match stats (F, P): returns the matching statistics {(qi , [ai , bi ]) 1 ≤ i ≤ |P |} for P with respect to T in O(|P | log σ) time, where σ = |Σ ∪ {φ}| and |CSA| denotes the size of any compressed suffix array of T supporting the LF operation in time O(log σ). 8.4 Components of the full-text dictionary Before discussing how to perform any operations on a full-text dictionary F, we first describe its subsidiary data structures. We focus first on describing these data structures, but leave details on how they are constructed to Section 8.6. 8.4.1 CSA: compressed enhanced suffix array We first build CSA, the compressed suffix array for T , using |CSA| bits, and then enhance it to represent longest common prefix (lcp) intervals (cf. [91]) using an additional O(n) bits. Analogous to the notation SA[i], we let CSA[i] denote the starting position of the ith lexicographically smallest suffix of a text T . 8.4.2 The sa id identifier and RSA: conceptual tools In Section 8.3.1 we formally defined the lex id as a unique identifier for every text segment in T . By definition, text segments are not necessarily distinct strings. Text segments that are the same string will have the same SA range in T and also the same length. To simplify the discussion of the components of our data structure, and also the operations it supports, we will introduce an internal identifier for text segments called the sa id. Intuitively, the sa id specifies a total ordering of all the unique SA ranges that are associated with text segments. We will give a formal definition, but first we prove the properties we need to group text segments. The following lemma formalizes two notions: (i) text segments representing different strings must have SA ranges that begin at different positions; and (ii) those representing the same string must have the same SA range and the same length. Lemma 29. Let [a, b] and [c, d] be the non-empty suffix array ranges in CSA for a text segment Ti and a text segment Tj . Then a = c if and only if Ti = Tj and b = d. Proof. Suppose a = c. Then Ti and Tj share a common prefix of length min(|Ti |, |Tj |). Since each text segment in T is followed by a character in {φ, $} and no text segment contains a character in {φ, $}, then it must be the case that |Ti | = |Tj |. Therefore Ti = Tj , and c = d since identical strings must have the same SA range in T . Suppose Ti = Tj (and b = d). Identical strings must have the same SA range in T and therefore a = c. Let RSA = ([a1 , b1 ], [a2 , b2 ], . . . , [ad , bd ]) be the list of the d unique SA ranges for the d text segments, where 1 ≤ d ≤ d, ordered by the start position 131 8.4. Components of the full-text dictionary of the range (i.e., a1 < a2 < · · · < ad ). (Note that by Lemma 29 each start position, ai , is distinct, for 1 ≤ i ≤ d .) If a text segment has SA range [ai , bi ] we say it has an sa id of i, i.e., the sa id specifies the relative order of text segment SA ranges when they are sorted by their start position. We will use the sa id of text segments as an index into some of our subsidiary data structures comprising F. Note that the value of any particular sa id will be in the range [1, d ] as there are exactly d distinct strings among the set of all d text segments. The list RSA is only conceptual. It simplifies our discussion of the components of F and also the operations performed on F; however, it does not need to be stored to perform queries. We will make use of it when constructing the data structure and give details in Section 8.6. 8.4.3 L: text segment lengths Let L[i] be the length of text segments with sa id i. Since L stores at most d ≤ d lengths that sum to some value n ≤ n, then it can be stored as a compressed integer array using O(d log n) bits by Lemma 26. Lemma 30 summarizes how we can use the length and SA range of any text segment to determine if it is a prefix of a given text X (and vice versa). Lemma 30. Let [c, d] be the non-empty SA range for a string X. Then (i) text segments with SA range [aj , bj ] (the j th range in RSA) are a prefix of X if and only if aj ≤ c ≤ d ≤ bj and |X| ≥ L[j]. Similarly, (ii) X is a prefix of all text segments with SA range [aj , bj ] if and only if c ≤ aj ≤ bj ≤ d. Proof. Consider proposition (i). We let Tj be any text segment with SA range [aj , bj ]; Tj therefore has length |Tj | = L[j]. Suppose that Tj is prefix of X. Then it must be the case that |Tj | ≤ |X|. By definition T [CSA[aj ]..|T |] ( T [CSA[bj ]..|T |] ) is lexicographically smaller (greater) than any other suffix of T prefixed by the string Tj ; thus, [aj , bj ] must enclose [c, d] and we have aj ≤ c ≤ d ≤ bj . Next suppose aj ≤ c ≤ d ≤ bj and |Tj | ≤ |X|. Since [aj , bj ] encloses [c, d] they must share a common prefix of length min(|X|, |Tj |). If [aj , bj ] = [c, d] it could be the case that X is a proper prefix of Tj ; however, since |X| ≥ |Tj | by supposition then X and Tj must share a common prefix of length at least |Tj |. Thus, Tj is a prefix of X. Consider proposition (ii). Suppose that X is a prefix of Tj . By definition T [CSA[c]..|T |] ( T [CSA[d]..|T |] ) is lexicographically smaller (greater) than any other suffix of T prefixed by the string X; thus, [c, d] must enclose [aj , bj ] and we have c ≤ aj ≤ bj ≤ d. Next, suppose that c ≤ aj ≤ bj ≤ d. Since [c, d] encloses [aj , bj ] they must share a common prefix of length min(|X|, |Tj |). Since the character following Tj is a special character in the set {φ, $}, we know that |Tj | cannot be a proper prefix of X and therefore |Tj | ≥ |X|. Thus, Tj and X share a prefix of length |X| making X a prefix of Tj . 132 8.4. Components of the full-text dictionary 8.4.4 LEX, MB , ME , E: text segment SA range representation The following lemma formalizes the notion that text segments with the same sa id can be identified by a contiguous range of lex ids. Lemma 31. If an SA range is common to k > 0 distinct text segments, then those text segments form a contiguous range of k lex ids. Proof. If k = 1 then the condition is trivially met. Suppose k > 1. Proof by contradiction. Suppose only k text segments share the common SA range and they do not form a contiguous range of k lex ids. Recall that lex ids are assigned according to the lexicographic rank of all suffixes of T that are prefixed by a text segment. Let a and b be the minimum and maximum lex id of the k text segments sharing the common SA range, respectively. By assumption that the k lex ids do not form a contiguous range, then b − a + 1 > k. Therefore, there must exist a text segment with lex id c, a ≤ c ≤ b, that is not one of the k text segments sharing the common SA range. By definition of lex id, the text segment with lex id c is lexicographically equal to or larger than the text segment with lex id a. Similarly, it is smaller or equal to the text segment with lex id b. Since the text segments with lex id a and b share a common SA range, and since SA ranges cannot cross, then the text segment with lex id c must also share the same SA range. Contradiction. We construct a simple array LEX to store, as a 2-tuple, the range of lex ids associated with each sa id. Formally, set LEX[i] = j, k if the text segments with lex ids j, j + 1, . . . , k have the common sa id i. We can represent the d ≤ d entries in O(d log d) bits. We next construct two bit vectors, MB and ME , each of length n, to mark the beginning and end, respectively, of SA ranges in RSA. Formally, set MB [ai ] = 1, and set ME [bi ] = 1, for 1 ≤ i ≤ d . All other entries in the bit vectors have value 0. By Lemma 25, both can be represented in O(d log n) + o(n) bits. Since each SA range in RSA must begin at a distinct position, then MB contains d bits set to 1 and can therefore be used to count the number of ranges in RSA that begin prior to some position p (i.e., cnt = rank1 (MB , p − 1) ). However, ranges in RSA can end at the same position. Therefore, the number of 1 bits in ME is some value d , 1 ≤ d ≤ d , and we cannot use ME to directly count the number of ranges that end prior to some position p. For this reason, we make use of an additional array E and set E[i] = |{bj | bj ≤ bi , 1 ≤ j ≤ i}|, for 1 ≤ i ≤ d . The array E keeps a cumulative count of the closed ranges up to and including the ith bit marked 1 in ME . It can be stored in O(d log d) bits. This is sufficient information to complete our count query (i.e., cnt = E[ rank1 (ME , p − 1) ] ). 8.4.5 BP: containment of text segment SA ranges If a pattern P is prefixed by one or more text segments with SA range ri = [ai , bi ], then it is also prefixed by text segments with SA ranges that enclose 133 8.4. Components of the full-text dictionary ri . We seek a means to efficiently identify these SA ranges. We now show how to create a forest to represent the containment relationship between text segment SA ranges. This forest is defined formally, in addition to its balanced parenthesis representation, however a straightforward algorithm for constructing the balanced parenthesis representation directly is presented in Section 8.6. Consider a range ri = [ai , bi ] from RSA with sa id i. We define the parent range of ri to be the smallest range rj from RSA, ri = rj , that encloses it. We will call sa id j the parent of sa id i. We say that ri is a child range of rj . We will call sa id i a child of sa id j. If rj = [aj , bj ] encloses ri = [ai , bi ], then it must be the case that aj < ai and therefore j < i. When no range from RSA encloses ri , other than itself, then it has no parent range. Since SA ranges cannot cross, then each range in RSA has at most one parent range. We can formally describe this relationship as a forest. We create a node with label i representing sa id i, for 1 ≤ i ≤ d . We add an edge between nodes j and i, j < i, if and only if j is a parent of i. Since the label of a parent node is strictly less than the label for any of its children, then the direction of the relationship is always clear. Any forest can be represented by a sequence of balanced parentheses. In particular begin with the lowest node label not yet visited and perform a pre-order traversal of the tree rooted at that node, outputting a left parenthesis when visiting a node for the first time, and outputting a right parenthesis when returning to a node from processing its children. (Children are visited in ascending order of their label value.) This process can be repeated for all unprocessed trees until the entire forest is represented as a sequence bp of balanced parentheses. Importantly, the ith left parenthesis represents the sa id i, for 1 ≤ i ≤ d . The sequence bp can be thought of intuitively in the following way. It has d left parentheses that denote the start position of each SA range from RSA. The ith left parenthesis denotes the start position of the ith SA range, [ai , bi ], in RSA. Its corresponding right parenthesis denotes the end position. Any other left parenthesis between this pair denote other SA ranges, from RSA, that are enclosed by [ai , bi ]. Thus the sequence bp uses at most 2d ≤ 2d parentheses to fully capture the containment relationship among all SA ranges from RSA. We create BP, an indexed representation of bp, in order to support a number of useful operations. By Lemma 27, BP can be represented in O(d) bits. 8.4.6 CNT: count of text segment prefixes As stated in the previous section, if a pattern P is prefixed by one or more text segments with SA range ri = [ai , bi ], then it is also prefixed by text segments with SA ranges that enclose ri . To permit efficient counting queries, we create an array CNT of length d , such that CNT[i] is the count of all text segments that enclose ri = [ai , bi ] (inclusive of those with sa id i). The count for each entry can be determined by an in-order traversal of the forest represented by BP and by using the LEX array to determine ranges of lex ids (which can be used to determine counts for each sa id). Specifically, the count of a child is the count of text segments with the same sa id of the child in addition to the 134 8.5. Using the full-text dictionary count of the parent. As each entry sums to at most d, then the array CNT of d ≤ d entries can be stored in O(d log d) bits. 8.4.7 Summary of full-text dictionary components To aid in the discussion of supported operations, we list all subsidiary data structures that comprise the full-text dictionary F in Table 8.1. Symbol CSA L LEX MB ME E BP CNT Description compressed enhanced suffix array of T array storing length of each text segment array of lex id ranges for each text segment SA range bit vector marking beginning of text segment SA ranges bit vector marking end of text segment SA ranges array of cumulative count of closed text segment SA ranges balanced parentheses representation of text segment SA range containment count of text segments that prefix each text segment SA range Space (bits) |CSA| + O(n) O(d log n) O(d log d) O(d log n) + o(n) O(d log n) + o(n) O(d log d) O(d) O(d log d) Table 8.1: Inventory of space usage for data structures comprising a full-text dictionary for a string T of length n containing d text segments. Combining the space for the subsidiary data structures, we have the following. Lemma 32. A string T over alphabet Σ ∪ {φ} of length n, that contains d, not necessarily distinct, text segments over alphabet Σ, can be represented by a compressed full-text dictionary F in |CSA| + O(n) + O(d log n) bits. 8.5 Using the full-text dictionary We now describe how to support various operations on a full-text dictionary F using its subsidiary data structures as described in Section 8.4. Throughout this section, we assume that F is built for a string T of length n containing d text segments and that queries are with respect to a query pattern P . To simplify the description of the operations, we again define the ordered list of the unique SA ranges of text segments as we did when describing the components of F. Let RSA = ([a1 , b1 ], [a2 , b2 ], . . . , [ad , bd ]) be the list of d unique SA ranges for the d text segments, where 1 ≤ d ≤ d, ordered by the start position of the range. Note that list RSA is only conceptual and need not be stored to perform queries. 8.5.1 Pre-processing the pattern When performing any of the query operations, we first calculate the matching statistics of P . Recall that the matching statistics for P with respect to T is an array ms of tuples such that ms[i] = (qi , [ci , di ]) states that the longest prefix of P [i..|P |] that matches anywhere in T has length qi and SA range [ci , di ]. By Lemma 28 we can compute the matching statistics for P in time O(|P | log σ) using CSA. 135 8.5. Using the full-text dictionary 8.5.2 Finding parent ranges and longest matches We first develop some useful lemmas to simplify the description of our operations. Lemma 33. Given any sa id i the sa id of its parent can be determined in O(1) time. If sa id i has no parent then 0 is returned in O(1) time. Proof. Let li = select( (BP, i) which gives the position, in BP, of the left parenthesis for sa id i. Let (lj , rj ) = enclose(BP, li ). If the enclose operation returns undefined, then i has no parent and we return 0. Otherwise, lj is the position of the left parenthesis representing the parent of sa id i. We can determine the actual sa id value j = rank( (BP, lj ). We return j. We perform a constant number of operations, all supported in O(1) time. Lemma 34. Given a string X and its SA range [c, d], the value i, such that text segments with sa id i form the longest prefix match of X (of any text segment strings), can be returned in O(1) time. If there is no text segment that prefixes X, then i = 0 is returned in O(1) time. Proof. We want to find the longest text segment string that prefixes X, if one exists. As a first candidate, we will determine the maximum i, such that [ai , bi ] encloses [c, d]. Let b = rank1 (MB , c) and e = E[rank1 (ME , d1 − 1)]. These are, respectively, the number of SA ranges in RSA that begin up to and including position c, and the number that close prior to position d. Let i = b − e. Intuitively, i is the sa id (the position in RSA) of the last SA range that began up to position c that does not close prior to position d. If i = 0 then X is not prefixed by a text segment and we return 0. Otherwise, when i > 0, [ai , bi ] is the smallest range in RSA to enclose [c, d]. By Lemma 30 we must ensure that |X| ≥ L[i]. If the condition is satisfied, we return i. Otherwise, when |X| < L[i], X is a proper prefix of text segments with sa id i so they cannot be prefixes of X. However, if sa id i has a parent with sa id j we know that its SA range [aj , bj ] must enclose [ai , bi ] and aj < ai (by Lemma 29). The range [aj , bj ] must therefore enclose [c, d]. We also know that c > aj , since c ≥ ai . Therefore, if [aj , bj ] exists, then it represents text segments that are a proper prefix of X and therefore L[j] < |X|. By Lemma 30, text segments with sa id j would be a prefix of X. We can find the parent for sa id i, if it exists, in O(1) time by Lemma 33. If it exists, we return its sa id. Otherwise, we return 0. Overall, we performed a constant number of operations all supported in O(1) time. 8.5.3 dict prefix: report text segments that prefix P With these results, we can now show how to implement the dict prefix operation which reports all text segments that prefix a query pattern. 136 8.5. Using the full-text dictionary Lemma 35. Given a pattern P and its matching statistics, ms, the dict prefix operation can return the lex ids of all occ1 text segments that prefix P in O(1 + occ1 ) time. Proof. Using ms[1] = (q1 , [c1 , d1 ]), by Lemma 34, we can find i, the sa id for the text segments with the longest prefix match to P [1..q1 ] in O(1) time. If i = 0 then there are no matches and we are done. Otherwise, we report the lex ids in the range LEX[i]. By Lemma 33, we can find j, the sa id for the parent of i in O(1) time, if it exists. If i does not have a parent (j = 0), then we are done. Otherwise, we report the lex ids in the range LEX[j]. We repeat this procedure for subsequent parents, until we no longer find a parent. Note that in this case, we will have done at most occ1 + 1 operations to find a parent if there are occ1 overall prefix matches to report. Therefore, the overall time33 is O(1 + occ1 ). 8.5.4 dict match: report text segments contained in P Reporting all text segments that match in P can be achieved by reporting all matches for each prefix of P using the dict prefix operation. Lemma 36. Given a pattern P and its matching statistics, ms, the dict match operation can return the lex ids of all occ2 text segments that are substrings of P in O(|P | + occ2 ) time. Proof. We use the dict prefix operation |P | times to report matches for each prefix, resulting in occ2 overall matches. The overall time required is O(|P | + occ2 ). 8.5.5 dict count: counting text segments contained in P Counting all text segments that match in P can be achieved by counting all matches for each suffix of P . This can be achieved by first identifying the sa id for text segments that form the longest prefix match to a particular suffix of P and then looking up the count in the array CNT. Lemma 37. Given a pattern P and its matching statistics, ms, the dict count operation can count all occurrences of text segments that are substrings of P in O(|P |) time. Proof. Using ms[i] = (qi , [ci , di ]), by Lemma 34, we can find j, the sa id for the text segments with the longest prefix match to P [i..qi ] in O(1) time. If i = 0 then there are no matches. Otherwise, the count of matches is CNT[j]. We can sum the counts for all i, 1 ≤ i ≤ |P | in O(|P |) time. 33 If instead a succinct representation of the output giving ranges of lex ids is acceptable, then the overall time to report can be bounded as O(min(|P |, occ1 )) as we perform at most O(min(|P |, occ1 )) operations to find a parent, since each parent is a proper prefix of its children. 137 8.6. Constructing the full-text dictionary 8.5.6 prefix: report range of lex ids that prefix P In contrast to the dict prefix operation which reports matches of text segments that prefix a pattern P , we now show how the range of lex ids for text segments that contain P as a prefix can be determined. Note that the technique used to implement this operation can be performed using any CSA based on the Burrows Wheeler transform to determine lex ids for the text segments. No additional data structures other than the CSA are required. We will use this technique in subsequent chapters. Lemma 38. Given a pattern P and its matching statistics, ms, the prefix operation can return the range of lex ids for text segments that contain P as a prefix in O(1) time. Proof. We will use ms[1] = (q1 , [c1 , d1 ]). For every suffix i of T that begins a text segment, we have that T BWT [i] is either a φ or $ character. We can use this property to determine the relative rank of every text segment contained in the SA range [c1 , d1 ] by inspecting T BWT [c1 , d1 ]. If there are k total φ characters and exactly one $ character in T , which are lexicographically smaller than any other character in Σ, then the first k + 1 rows of the suffix array for T are for suffixes that begin with a special character in {φ, $}. Let t = rankφ (T, k + 1) + rank$ (T, k+1). Then, the pair (lexid1+1, lexid2), where lexid1 = rankφ (T, a− 1) + rank$ (T, a − 1) − t and lexid2 = rankφ (T, b) + rank$ (T, b) − t, denotes the range of lex ids of text segments contained in the SA range [c1 , d1 ]. 8.5.7 locate: report positions in T containing P Finding matches of P in the string T is already supported by the CSA data structure, giving us the following result. Lemma 39. Given a pattern P and its matching statistics, ms, the locate operation can return the occ3 positions in T that contain P as a prefix in O(occ3 log1+ |T |) time. 8.5.8 match stats: finding the matching statistics of P Finding the matching statistics of P with respect to F is already supported by the enhanced CSA data structure, giving us the following result. Lemma 40. Given a pattern P the match stats operation can determine the matching statistics of P with respect to F in O(|P | log σ) time. The proof of Theorem 22 follows from Lemmas 28,32,35,36,3738,39 and 40. 8.6 Constructing the full-text dictionary Construction of the overall full-text dictionary is straightforward and consists of the construction of subsidiary data structures, such as the compressed suffix 138 8.6. Constructing the full-text dictionary Algorithm 3: Constructing the bp sequence Input: RSA specifies the SA ranges of text segments in order of their beginning position Output: The balanced parentheses array bp representing the containment relationship of text segment SA ranges 1: initialize an empty stack S 2: for i = 1 . . . d do 3: (sp, ep) ← R[i] 4: while S is not empty and sp > Top(S) do 5: print ’)’ 6: Pop(S) 7: print ’(’ 8: Push(S, ep) 9: while S is not empty do 10: print ’)’ 11: Pop(S) array and marking of SA ranges of text segments. However, construction of the BP index cannot occur until bp, the sequence of balanced parentheses representing the containment relationship of SA ranges, is known. Below, we elaborate on how one may construct this sequence, prior to creating its index. We first construct the array RSA from Section 29 which is the sorted list of unique SA ranges of all text segments. For each text segment Ti ∈ D, 1 ≤ i ≤ d, we find its SA range and then append it into a temporary array t of length d. Finding the SA range for all text segments takes O(n log σ) time as their combined length is O(n). The array t can then be sorted in O(d log n) time using the beginning position of each SA range as the sort key. The array RSA can then be determined from t. Moreover, by keeping the duplicate SA ranges in t, the lex ids can be determined and can be used to create other auxiliary data structures. Constructing the bp sequence which represents the containment relationship of text segment SA ranges is relatively straightforward and a procedure is given in Algorithm 3. The text segment SA ranges are processed in increasing lexicographical order. The algorithm ensures that right parentheses of intervals are appended to the bp sequence only after any contained intervals have been closed with right parentheses. This is accomplished with the use of a stack and by comparison of previously computed SA ranges stored in RSA. The stack stores at most d integers from [1, n], thus the algorithm requires O(d log n) bits of working space. It is straightforward to see that the algorithm and construction of RSA can be accomplished in O(n log σ + d log n) overall time and uses O(d log n) overall bits of temporary work space. 139 Chapter 9 Indexing text with wildcards 9.1 Introduction We are interested in designing a compressed full-text index to answer a generalized version of the problem of aligning a pattern P of length m to a text T of length n, where T contains k wildcard positions that can match any character of P . Our motivation arises in the context of aligning short-read data, produced by high throughput sequencing technologies. Typically, short-reads are aligned against a so-called reference genome; however, the quantity of positions known to differ between individuals due to single nucleotide polymorphisms (SNPs) numbers in the millions [39]. Therefore, one canonical reference sequence is not representative of an entire population of individuals. Modeling SNPs as wildcards would yield more informed, and by extension, more accurate alignment of short-reads. While our motivation is grounded in biological sequence alignment, the solutions we propose in this chapter are more generally applicable to any problem benefiting the indexing of text containing wildcard characters. Cole, Gottlieb & Lewenstein [24] were among the first to study the problem of indexing text sequences containing wildcards and proposed an index using O(n logk n) words of space capable of answering queries in O(m+logk n log log n+ occ) time, where occ denotes the number of matching positions. This result was later improved by Lam et al. [66] resulting in space usage of only O(n) words and a query time no longer exponential in k. A key idea in their work was to build a type of dictionary of the text segments of T = T1 φk1 T2 φk2 . . . φkd Td+1 where each text segment Ti contains no wildcards and φki denotes the ith wildcard group of size ki ≥ 1, for 1 ≤ i ≤ d ≤ k. In their result, the query time includes the term γ = i,j prefix(P [i..|P |], Tj ) where prefix(P [i..|P |], Tj ) = 1 if Tj is a prefix of P [i..|P |] and 0 otherwise. Intuitively, γ is the number of occurrences of text segments within P . Despite this improvement in query time complexity, O(n) words of space can be prohibitive for texts as large as the Human genome. The use of dictionary matching of text segments within a pattern was also crucial in the approach of Tam et al. [124] who proposed the first succinct index that uses (3+o(1))n log σ bits. Using a compressed suffix array CSA, their space Content from this chapter appears in the proceedings of the 22nd Annual Symposium on Combinatorial Pattern Matching (CPM 2011) [127] and the journal Theoretical Computer Science [128]. 140 9.1. Introduction complexity can be reduced to 3|CSA| + O(d log n) bits. In our first contribution of this chapter, we show how to build on the full-text dictionary proposed in Chapter 8, to attain a succinct index using only 2|CSA|+ O(n) + O(d log n) bits while maintaining the same query time complexity as the index of Tam et al. [124]. However, in our view, the main challenge that must be overcome for successful wildcard matching is a reduction of the query working space. The fastest solution of Tam et al. [124], matches our query time, if modified to use the same subsidiary data structures we use, but requires a query working space of O(n log d + m log n) bits. Acknowledging that the first term is impractical for large texts, they give a slower solution that reduces the working space to be O(n log σ + m log n). This makes the solution feasible, but constraining considering the fact that p parallel queries necessarily increase the working space by a factor of p. A main contribution of this chapter is an algorithm that reduces the query working space complexity significantly to O(min(dm, γ log d) + m log n) bits. For our motivating problem, alignment of short-reads to the Human genome (3 billion bases with 1-2 million SNPs), this reduces the working space by two orders of magnitude from gigabytes to tens of megabytes. Finally, we show that by permitting an increase in worst case query time the space of the index can be reduced to only nHk (T ) + o(n log σ) + 2n + O(d log n) bits. Existing solutions store a compressed suffix array for both T and another for its reverse, T . The key to the space reduction is the elimination of the reverse index by exploiting a method used for bidirectional search [113]. Independently and in parallel with this work, Hon et al. showed an alternate approach to eliminate the reverse index [49]. This decreases the overall text index space term to nHk (T ) + o(n log σ) + O(d log n) bits with an increase in query time compared to the fastest solution presented here (which uses more space). While the construction and use of their index to achieve these bounds is quite technical, their query time is faster than that of our smallest index. However, the ideas presented here and in the work of Hon et al. are complementary and can be combined to improve the overall state-of-the-art for this problem. Our results for indexing text with wildcards are summarized and compared with existing results in Table 9.1. Results are also summarized when ideas from this work are combined with those of Hon et al. [49]. Details for combining the approaches are given in Section 9.6. For a fair comparison, the results of Tam et al. [124] have been adjusted to use the same subsidiary data structures used by our index. 141 Index Space O(n logk n) O(n) words words 3|CSA| + O(d log n) bits 3|CSA| + O(d log n) bits Query Time O(m + logk n log log n + occ) O(m log n + γ + occ) m tLF + min m, dˆ log d O +occ1 log1+ n + occ2 log d + γ Query Working Space O(1) O(n) [24] [66] O(n log d + m log n) bits [124] same as above with working space reduced by increasing query time m tLF + min m, dˆ log d O O(n log σ + m log n) +occ1 log1+ n + occ2 log d + γ logσ d bits [124] O(γ log d + m log n) bits † O(dm + m log n) bits † O(dm + m log n) bits † O(γ log d + m log n) bits † O((γ + m) log n) bits [49] O(dm + m log n) bits ‡ O(γ log d + m log n) bits ‡ 2|CSA|+O(n)+O(d log n) bits O 2|CSA|+O(n)+O(d log n) bits O nHk (T ) + o(n log σ) + 2n +O(d log n) bits O nHk (T ) + o(n log σ) + 2n +O(d log n) bits O nHk (T ) + o(n log σ) +O(d log n) bits O nHk (T ) + o(n log σ) +O(d log n) bits O nHk (T ) + o(n log σ) +O(d log n) bits O m tLF + min m, dˆ +occ1 log 1+ n m tLF + min +occ1 log1+ n log d log log d log d + occ2 log log d log d m, dˆ log log d log d + occ2 log log d + γ log γ +γ m2 log σ + m log n + min m, dˆ +occ1 log 1+ n+ log d log log d log d occ2 log log d +γ m2 log σ + m log n + min m, dˆ log d log log d log d +occ1 log1+ n + occ2 log log d + γ log γ m log1+ n + m min m, dˆ log d +occ1 log1+ n + occ2 log d + γ log γ m log1+ n + m min m, dˆ +occ1 log 1+ n+ log d occ2 log log d m log1+ n + m min m, dˆ +occ1 log 1+ n+ log d occ2 log log d log d log log d +γ log d log log d + γ log γ 142 Table 9.1: A comparison of text indexes supporting wildcard characters in a text T over an alphabet of size σ containing d distinct groups of wildcards. |CSA| is the size of a subsidiary compressed suffix array implementation supporting rank queries in O(tLF ) time. dˆ is the # of distinct wildcard group lengths, occ1 , occ2 , occ are the # of occurrences containing no wildcard group, 1 wildcard group, and overall, respectively; γ = i,j prefix(P [i..|P |], Tj ), † = our result, ‡ = our result combined with Hon et al. [49] 9.1. Introduction words words 9.2. Preliminaries 9.2 Preliminaries Our wildcard matching algorithm makes use of an orthogonal range query data structure; specifically, it is an index for a set of two-dimensional points that can count and report the set of points contained inside a query rectangle. Lemma 41 (Bose et al. [11]). A set d of points from universe M = [1..d] × [1..d] can be represented in (1+o(1))d log d bits to support orthogonal range reporting in O((1 + occ) logloglogd d ) time, where occ is the size of the output. 9.3 Overview of indexing text containing wildcards Let T be a string over an alphabet Σ ∪ {φ} of size σ where φ ∈ / Σ and T [i] = φ if and only if position i is a wildcard position in T . In particular, we denote the structure of the input string as T = T1 φk1 T2 φk2 . . . φkd Td+1 where each text segment Ti contains no wildcards and φki denotes the ith wildcard group of size ki ≥ 1, for 1 ≤ i ≤ d. Our goal is to create an index for the purpose of identifying all the locations in T that exactly match any query pattern P , modulo wildcard positions. Similar to previous approaches [66, 124], we classify the match into one of three cases: the match of P contains no wildcard group (Type 1), the match of P contains exactly (some portion of) one wildcard group (Type 2), and the match of P contains more than one wildcard group (Type 3). See Figure 9.1 for examples of each of the three types of matches. Our solution for Type 2 matching is largely inspired by previous approaches [66, 124], so we give an overview of the approach but omit the details. Our algorithm for Type 3 matching is novel and can result in significantly reduced working space. (a) Type 1 (b) Type 2 (c) Type 3 Figure 9.1: The three cases to consider when matching a pattern to a text with wildcards. Here and throughout this chapter, we will illustrate the wildcard character as ‘*’. 143 9.4. Components of the text with wildcards index 9.4 Components of the text with wildcards index Before detailing how we implement the three types of matching algorithms, we first give an overview of the subsidiary data structures that comprise our overall full-text index for text containing wildcards. The space complexity given for each component is assuming an input text T of length n having d groups of k overall wildcards. We will use |CSA| to denote the size (in bits) of a compressed suffix array for T . 9.4.1 F, R: indexing the text We first build F, a compressed full-text dictionary of Chapter 8, for T . Among others, we will make use of the dict prefix operation that reports all text segments from T that are a prefix of a query pattern P . From Theorem 22, F requires |CSA|+O(n)+O(d log n) bits of space. We also construct a compressed suffix array R for T , the reverse of T . The space required for R is |CSA| bits. Note that R does not need to support location reporting as our full-text dictionary F already does. 9.4.2 lex ids, rlex ids, and Π: text segment identifiers By design, many of the operations supported by the full-text dictionary F report the match of a text segment Tj using its lex id which is Tj ’s lexicographic rank among all text segments in T . We will also need to know the reverse lexicographic rank of each text segment (i.e., when the reverse of each text segment are sorted lexicographically). The reverse lexicographic rank, or rlex id, of a text segment Tj is equivalent to the lexicographic rank of Tj in R, the compressed suffix array for T . The rlex ids for each text segment with respect to R can be determined in the equivalent manner they are determined for F by the prefix operation as described in Section 8.5.6. In addition, we also need to know the relative position of a text segment in T among all text segments in T . Therefore, we store a permutation ΠF →P mapping the lex ids of text segments with respect to the forward index to their relative position order in T . For instance, if Tj has lex id k, then ΠF →P [k] = j. We do the same for mapping the rlex id of text segments with respect to the reverse index to their relative position in T by creating the mapping ΠR→P . The space required to store ΠF →P and ΠR→P is O(d log n) bits. We will use the relative position order as the main index into the other subsidiary data structures of our index. 9.4.3 RSA, RSA: storing SA ranges We construct an array RSA of length d + 1 to store the SA ranges of each text segment with respect to F. For instance, RSA[j] specifies the SA range for text segment Tj . Similarly, we construct an array RSA of length d + 1 to store the 144 9.5. Matching in text with wildcards SA ranges of the reverse of each text segment with respect to R. For instance, RSA[j] specifies the SA range for Tj , the reverse of text segment Tj . Both arrays can be stored in O(d log n) bits. 9.4.4 LEN, POS, WCS: auxiliary arrays We find it convenient to store additional information for each text segment, in auxiliary arrays, indexed by the relative position order. We store the length of each text segment in an array LEN. Note that array LEN of the dictionary construction can be adapted to store lengths in this relative position order with the use of Π. We store the beginning position of a text segment in T (i.e., its offset from the beginning of the string T ) using the array POS. We store the size of the preceding wildcard group in the array WCS. Note that all arrays have length d + 1 and overall require O(d log n) bits of space to support constant time access. 9.4.5 RQ: supporting range queries We approach Type 2 matching no different than previous approaches [66, 124], by employing a number of 2D orthogonal range query data structures. Specifˆ where dˆ is the ically, we will create a data structure RQi , for each 1 ≤ i ≤ d, number of unique lengths of wildcard groups that separate adjacent text segments. We will add a 2D point (i, j) into the data structure RQk if and only if the text segment with lex id j is adjacent to the text segment with rlex id i and they are separated by exactly k wildcard characters in T . For example, in Figure 9.1b, the two text segments that form the match with the pattern P would be represented by a point in the data structure RQ1 as they are separated by a wildcard group having length 1. We use the data structure of Lemma 41 for each RQi . As there are exactly d total points added among all RQi then the total space required is O(d log d) bits. 9.4.6 Summary of the components To aid in the discussion of supported operations, we list all subsidiary data structures that comprise our index for text containing wildcards in Table 9.2. Combining the space of subsidiary data structures, we have the following. Lemma 42. Given a text T of length n containing d groups of wildcards the combined space required of the above indexes is 2|CSA| + O(n) + O(d log n) bits. 9.5 Matching in text with wildcards We now outline the various operations supported by our index. In particular, we give details on how each of the three types of matches can be determined. 145 9.5. Matching in text with wildcards Symbol F R ΠF →P ΠR→P RSA RSA LEN POS WCS RQi Description compressed full-text dictionary of T compressed suffix array of T a mapping from lex id to relative position order of text segments a mapping from rlex id to relative position order of text segments SA ranges for each Tj w.r.t. F SA ranges for each Tj w.r.t. R length of each text segment beginning position in T of each text segment size of preceding wildcard group of each text segment 2D point data structure relating the rlex id of a text segment to the lex id of the text segment that follows it in T , when the two are separated by a wildcard group of length i, for 1 ≤ i ≤ dˆ Space (bits) |CSA| + O(n) + O(d log n) |CSA| O(d log n) O(d log n) O(d log n) O(d log n) O(d log n) O(d log n) O(d log n) O(d log d) Table 9.2: Inventory of space usage for data structures comprising an index for a text T of length n containing d groups of wildcards and dˆ denotes the number of unique lengths of wildcard groups separating text segments. 9.5.1 Pre-processing the pattern All three matching types make use of the matching statistics of P with respect to F. Types 2 and 3 matching also make use of the SA ranges for each suffix of P with respect to R. Both can be computed in O(m log σ) time (by Lemmas 24 and 28) and require O(m log n) bits to store. We incorporate these time and working space complexities into the results for each type. 9.5.2 type1 match: finding all type 1 matches of P Type 1 matching corresponds exactly to the traditional pattern matching problem, where we must locate positions in T that contain P as a substring. Therefore, as our full-text dictionary F supports the locate operation, then type 1 matches can be identified in the time bounds specified in Theorem 22. Lemma 43. All occ1 Type 1 matches can be reported using O(m log n) bits of working space in O(m log σ + occ1 log1+ n) time, for > 0. 9.5.3 type2 match: finding all type 2 matches of P A Type 2 match occurs when the alignment of P to T contains exactly (a portion of) one wildcard group. Our solution for Type 2 matching is the same as previous approaches [66, 124]. Therefore, we outline the high level idea of the approach for completeness, but omit details. First suppose that a match of P aligns with two text segments (and thus properly contains one wildcard group). Then, we seek a pair of neighbouring text segments Tj and Tj+1 , separated by a wildcard group of size kj , where P [i..|P |] aligns to the first |P | − i + 1 characters of Tj+1 —referred to as the suffix match (of P )—and P [1..i − 1 − kj ] aligns to the last i − 1 − kj characters of Tj —referred to as the prefix match. 146 9.5. Matching in text with wildcards ··· Tj φ ··· φ Tj+1 ··· By construction, the data structure RQkj will contain a point (p, q) if and only if the text segment with rlex id p is followed in T by the text segment with lex id q. For a fixed suffix P [i..|P |] and wildcard group length kj our strategy will be to (i) find all potential suffix matches and record their lex ids, (ii) find all potential prefix matches and record their rlex ids, (iii) determine which candidate prefix matches are adjacent to a candidate suffix match in T , and (iv) report the matching text segments forming a match with P . Using F, prefix(F, P [i..|P |]) will return the range of lex ids, [s1 , s2 ], for the candidate suffix matches completing step (i). By using the same technique described in Section 8.5.6, we can determine the range of rlex ids, [r1 , r2 ], for the candidate prefix matches of P [1..i − 1 − kj ] with respect to R completing step (ii). Next, we can determine all pairs of prefix and suffix candidates that are adjacent in T , and are separated by a length kj wildcard group, by determining all occ2 points (xl , yl ), for 1 ≤ l ≤ occ2 , that are in the query rectangle [r1 , r2 ] × [s1 , s2 ], completing step (iii). Since, xl , for 1 ≤ l ≤ occ2 , gives the rlex id for the text segment that forms a match with a prefix of P , then using ΠR→P we can determine the relative position of the matching text segments in T . Using POS the actual position within T for each match can be reported, completing step (iv). In general, we can repeat the above procedure for every combination of suffix length and wildcard group length bound by m. However, as pointed out by Tam et al. [124] the number of distinct wildcard group sizes dˆ is often a small constant, particularly in genomic sequences. We therefore only consider at most dˆ lengths, provided they are not larger than m. Handling Type 2 matches that end or begin a match within a wildcard group is similar to the procedure outlined above. However, when we determine matches where the first k characters of P match within a wildcard group, we can query all RQl , for each k ≤ l ≤ m (or for each of the dˆ unique lengths if they are between length k and m), using [1, d] as the range of rlex ids, in conjunction with the actual range of lex ids for the remaining suffix of P . This essentially performs a two-sided orthogonal range query. The analogous action can be taken when determining matches that end within a wildcard group. Lemma 44. All Type 2 matches can be reported using O(m log n) bits of ˆ log d ) + occ2 log d ) time. working space in O(m(log σ + min(m, d) log log d log log d 9.5.4 type3 match: finding all type 3 matches of P Type 3 matches contain at least (portions of) two wildcard groups and therefore must fully contain at least one text segment. The general idea in previous approaches and in this chapter is to consider this case as an extension of the dictionary matching problem: text segments contained within P are candidate positions, but we must verify if they can be extended to a full match of P . 147 9.5. Matching in text with wildcards Algorithm 4: Report Type 3 matches Input: a string P of length m, SA ranges for all suffixes of P w.r.t. F, SA ranges for all suffixes of P w.r.t. R Output: positions in T forming a Type 3 match with P , modulo wildcard positions 1: for i = m to 1 do 2: for each lex id returned by dict prefix(F, P [i..m]) do 3: j ← ΠF →P [lex id] 4: [ap , bp ] ← SA range of P [1..i − 1] w.r.t R 5: [as , bs ] ← SA range of P [i + LEN[j] + WCS[j + 1]..m] w.r.t F 6: [cp , dp ], [cs , ds ] ← RSA[j − 1], RSA[j + 1] 7: if j = d + 1 or LEN[j] + WCS[j + 1] ≥ m − i + 1 then // Case 1: P does not contain Tj+1 8: if LEN[j] + WCS[j + 1] ≥ m − i + 1 or [as , bs ] encloses [cs , ds ] then // Case 1: suffix condition satisfied 9: if j = 1 or LEN[j − 1] + WCS[j] > i − 1 then // Case 1a: P does not contain Tj−1 10: if WCS[j] ≥ i − 1 or [ap , bp ] encloses [cp , dp ] then // Case 1a: prefix condition satisfied 11: print match at position POS[j] − i + 1 12: else // Case 1b: P must contain Tj−1 13: set bit j of W[i] to 1 14: else // Case 2: P must contain Tj+1 15: if bit j + 1 of W[i + LEN[j] + WCS[j + 1]] is set to 1 then // Case 2: suffix condition is satisfied 16: if j = 1 or LEN[j − 1] + WCS[j] > i − 1 then // Case 2a: P does not contain Tj−1 17: if WCS[j] ≥ i − 1 or [ap , bp ] encloses [cp , dp ] then // Case 2a: prefix condition satisfied 18: print match at position POS[j] − i + 1 19: else // Case 2b: P must contain Tj−1 20: set bit j of W[i] to 1 148 9.5. Matching in text with wildcards However, we execute this idea in an altogether novel manner by proposing a dynamic programming algorithm that can greatly reduce the working space over existing approaches. Before giving details, we note that there is a trade-off in working space and query time of our dynamic programming algorithm depending on whether the dynamic programming table W (employed by the algorithm) uses bit vectors or sorted lists. W requires O(dm) bits in the former case, and O(γ log d + m log γ) bits in the latter, where γ is number of dictionary matches contained in the query pattern. If sorted lists are used, there is a log γ factor slowdown to the query time compared to using bit vectors. While in practice we expect γ to be small, in theory it can be as large as O(dm) and therefore using bit vectors in these cases would result in a faster query algorithm and smaller working space. Fortunately, our approach allows us to perform a counting query to first determine the size of γ. Since we have available F, the full-text dictionary of T as described in Chapter 8, we can determine γ = dict count(F, P ) in O(|P | log σ) time. Thus, we can always ensure the working space is O(min(γ log d + m log γ, dm)). The complete details of our approach are given in Algorithm 4. We now highlight the main idea and give the intuition behind the correctness. First, suppose that text segment Tj matches in P starting at position i. Consider the conditions that must be satisfied to confirm that this match can be extended to a complete match of P in T . We must verify that (i) P [i + LEN[j]..|P |] can be matched to the text following Tj in T — referred to as the suffix condition — and (ii) P [1..i − 1] can be matched to the text preceding Tj in T — referred to as the prefix condition. If both conditions are verified, we can report that P matches T beginning at position POS[j] − i + 1. ···φ Tj−1 φ···φ Tj φ···φ Tj+1 φ··· For working space, we make use of an array W containing m entries, one for each suffix of P . We describe the algorithm assuming the use of bit vectors and comment in the proof on the effect of using sorted lists. Each of the m entries of W contain a bit vector of d + 1 bits (one for each text segment), with all entries set to zero using the constant time initialization technique [12]. During the course of the algorithm the j th bit of W[i] is set to 1 if and only if the suffix condition is true for P [i..m] with respect to Tj . Essentially, this entry would mean that the string P [i..m] matches T [POS[j]..n] as a prefix. There are exactly m stages of the algorithm corresponding to the m suffixes of P . Each stage i considers a longer suffix of P (i = m, m − 1, . . . , 1). In a given stage i we consider each text segment Tj found to be a prefix of the ith suffix of P . This can be accomplished using the dict prefix operation of F. To verify the prefix and suffix conditions for Tj we first consider (line 7 of Algorithm 4): will P [i + LEN[j]..m] need to fully contain the next text segment Tj+1 in order to match in T ? This breaks our algorithm into the two main cases. If the match will not fully contain Tj+1 (Case 1), we verify the suffix condition by checking whether P [i + LEN[j]..m] is compatible with the wildcard group to 149 9.5. Matching in text with wildcards its right and the prefix of Tj+1 to which it must align (line 8). If the suffix condition is satisfied, we consider (line 9): will P [1..i − 1] need to fully contain the previous text segment Tj−1 in order to match in T ? If it does not need to fully contain the previous text segment Tj−1 (Case 1a), we verify the prefix condition is satisfied by checking that P [1..i−1] is compatible with the wildcard group to its left and the suffix of Tj−1 to which it must align (line 10). If indeed the prefix condition is satisfied, we output a match (line 11). If it does need to fully contain the previous text segment Tj−1 (Case 1b), we set the j th bit of entry W[i] to 1, to indicate that a suffix condition holds for P [i..m] with respect to Tj (line 13). The key idea here is that we only attempt to verify the prefix condition when Tj would be the last text segment to occur in P (i.e., Case 1a) and if not (Case 1b), we record information in W stating that we currently have a partial match, but for it to remain viable, Tj−1 should be a suffix of P [1..i − WCS[j] − 1] which can be verified in a future stage of the algorithm. Case 2 occurs when P must contain the next text segment Tj+1 to satisfy the suffix condition (lines 14–20). Since stages of the algorithm proceed for longer suffixes of P , and thus decreasing values of i, then the suffix condition would have been previously checked and, if satisfied, the bit j+1 of W[i+LEN[j]+WCS[j+1]] would be set to 1. The remaining questions are answered as before: the prefix condition is verified if possible, and otherwise successful partial matches are again recorded in W. Lemma 45. All Type 3 matches can be reported in O(m log σ + γ) time using O(dm + m log n) bits of working space, or reported in O(m log σ + γ log γ) time using O(γ log d + m log n) bits of working space. Proof. Recall that the algorithm proceeds in m stages for decreasing i = m, . . . , 1 for each suffix of P . It is clear in the algorithm description that verification of a match of Tj proceeds by first ensuring the suffix condition can be satisfied (Case 1: if P does not contain Tj+1 ) or ensuring it was previously satisfied (Case 2: P must contain Tj+1 ), and then verifying the prefix condition in the cases where P does not contain Tj−1 (Cases 1a, 2a) (and reporting a match when verified), or by instead marking W to signify a partial match, expecting the match to be continued by a match of Tj−1 in a future stage (Cases 1b, 2b). The correctness relies on showing that W is set correctly to confirm the satisfaction of the suffix condition for the next text segment (Tj−1 ) for a future time step. We show correctness by induction on i. Consider the base case when we are in stage i = m. All candidate text segments Tj fall into Case 1 which (importantly) does not rely on the correctness of previous stages of the algorithm. The suffix condition is trivially true. The prefix condition is split into two cases. The first case (Case 1a) is when a successful match of P will not contain Tj−1 . This can be verified by checking if the appropriate prefix of P is a suffix of Tj−1 , unless the prefix of P is fully matched by the preceding wildcard group. If the prefix of P matches, both conditions have been satisfied and we have an overall match that can be reported. If P must fully contain Tj−1 for a successful match, then bit j of W[i] is set to denote that the suffix condition of P [i..m] is satisfied with respect to Tj . Now assume we are in some stage i and the algorithm is correct 150 9.5. Matching in text with wildcards for all shorter suffixes (i.e., stages i + 1, . . . , m). Case 1 is handled as before and does not rely on the correctness of previous stages, so assume we are in Case 2 (P must contain Tj+1 ). Then, if the suffix condition is satisfied bit j + 1 of W[i + LEN[j] + WCS[j + 1]] should be set to 1. This bit would have been set in an earlier stage t > i, and we have assumed the algorithm is correct for earlier stages (i.e., i + 1, . . . , m). Therefore, it must be the case that the suffix condition for Tj is satisfied if and only if W[i + LEN[j] + WCS[j + 1]] has bit j + 1 set to 1. Similarly to before, if the suffix condition is satisfied, we can attempt to verify the prefix condition when P does not contain Tj−1 or by recording the partial match in W when P must contain Tj−1 . This completes the correctness proof. We now consider the additional runtime and work space incurred for Type 3 matching. There are γ candidate positions overall that can be reported in O(m log σ + γ) time by Theorem 22. Each candidate is processed once, in O(1) time when using bit vectors. The array W occupies O(dm) bits as working space. Thus, the overall time complexity is O(m log σ + γ) and working space is O(dm + m log n) when using bit vectors. Clearly, one could also maintain a sorted list of text segment position ids for the m entries of W instead of a bit vector. Also, m pointers can be used to mark the head of each list. Since there are γ total entries of matching text segment position ids in all lists, and a text segment position id can be uniquely identified with log d + 1 bits, then the total space is O(γ log d) bits to store the sorted ids. Inserting an entry into the list or querying an entry takes at most O(log γ) time, compared with O(1) time when using bit vectors. We note that the space to store the m pointers to the head of each sorted list use no more than O(m log γ) bits and is therefore absorbed into the O(m log n) term denoting the space to store the suffix array ranges. Combining the results for the 3 types of matching we arrive at our main result. Theorem 23. A text T of length n containing d groups of wildcards, can be represented in 2|CSA| + O(n) + O(d log n) bits, to support the following operations given any query pattern P : • type1 match(P ): returns all occ1 positions in T that match P , using no wildcard groups, in O(m log σ + occ1 log1+ n) time and O(m log n) bits of working space, • type2 match(P ): returns all occ2 positions in T that match P , using ˆ log d ) + (some portion of) one wildcard group, in O(m(log σ + min(m, d) log log d occ2 logloglogd d ) time and O(m log n) bits of working space, • type3 match(P ): returns all occ3 positions in T that match P , using (some portion of) two or more wildcard groups, in O(m log σ + γ) time and O(m log n + dm) bits of working space, or O(m log σ + γ log γ) time and O(m log n + γ log d) bits of working space, 151 9.6. Less haste, less waste: reducing the space further where σ = |Σ ∪ {φ}|, |CSA| denotes the size of any compressed suffix array of T supporting the LF operation in time O(log σ), dˆ is the number of unique wildcard group lengths, and γ is the number of text segments that match in P , and > 0. 9.6 Less haste, less waste: reducing the space further Figure 9.2: Shown is a compressed suffix array for a text T = φaa φaca φa φaa φcacc φac and a compressed suffix array for the reverse of T . The shaded intervals denote the SA range of a query aφ in the forward index and corresponding SA range of φa in the reverse index. Using backward search the SA range in the forward index can be updated for the pattern aaφ, and by leveraging information in T BWT the corresponding SA range for φaa can be updated in the reverse index. Both new SA ranges are shown demarcated with arrows. See the text for details. Letting |CSA| denote the size of a subsidiary compressed suffix array, our index requires 2|CSA| + O(n) + O(d log n) bits in comparison to that of Tam et al. [124]—the first succinct index for this problem—which requires 3|CSA| + O(d log n) bits. For alphabets such as proteins (σ = 20) or larger this can result in a substantially smaller index. However, for small alphabets such as DNA (σ = 4), the O(n) term becomes quite significant. This term arises from the need to store auxiliary data structures for determining lcp parent intervals 152 9.6. Less haste, less waste: reducing the space further when computing matching statistics of a query string. Ohlebusch and Gog [90] proposed a solution that computes parent intervals in constant time (for σ = O(1)) and has been demonstrated to use between 3n–5n bits in practice [91]. This approach would ensure no slowdown in query time at the expense of a larger index compared to that of Tam et al. for the DNA alphabet. Using a solution by Fischer et al. [35] we can store the necessary lcp information using at most 2n + o(n) bits. This would yield an index of roughly the same size as that of Tam et al. when σ = 4; however, it incurs a logarithmic slowdown (in n) when computing parent intervals. Specifically, the time to pre-process the pattern becomes O(m log n) as at most m parent intervals must be computed for the m suffixes of P . In either case, both our index and that of Tam et al. store a compressed suffix array for both the text and its reverse. An interesting question is whether we can eliminate the suffix array of the reverse text. Doing so would lead to a substantial space reduction, regardless of alphabet size. We now show that this question can be answered in the affirmative. First, consider how the reverse index is used. In order to determine if some prefix P [1 . . . i] of a pattern P is a suffix of a text segment, a compressed suffix array R of the reverse text is searched using P [1 . . . i] as the query (cf. Section 9.4.1). The resulting matches would form a contiguous interval in the reverse index. This property allows for easy verification of a partial match given a suffix array range of a query (in the reverse index) and is the basis for the orthogonal range query data structure relating the forward and reverse lexicographic order of text segments. Note that if P [1 . . . i] has a non-empty SA range [a, b] in R, then there is a non-empty SA range [c, d] in F for the query P [1 . . . i] and d−c = b−a. Recently, Schnattinger et al. [113] demonstrated that with the use of a compressed suffix array based on a wavelet tree, one can perform bidirectional search. Specific to our example, by performing an incremental backward search of the query P [1 . . . i] in F, the SA range for P [1 . . . i] in R can also be updated incrementally, without performing any queries on R. Since the idea is central to our space reduction, we now give the intuition of the method; however, the reader is referred to Schnattinger et al. [113] for the details. Shown in Figure 9.2 are two compressed suffix arrays: one for the text T = φaaφacaφaφaaφcaccφac (CSA) and one for T (CSA). Suppose we wish to locate the SA range of a query string X = aaφ in CSA and the corresponding SA range of X in CSA. The shaded regions represent the SA ranges matching the suffix X[2 . . . |X|] and the corresponding match of X[2 . . . |X|] in CSA. Given the SA range of X[2 . . . |X|], the SA range of X in CSA, shown demarcated by arrows, can be determined by backward search in the usual manner as we are prefixing the currently matched pattern by one character. However, finding the SA range of X requires suffixing the currently matched pattern by one character. Therefore, the SA range of X[2 . . . |X|] must contain the SA range of X in CSA. Since all suffixes of T are in sorted order in CSA, it follows that if we knew how many suffixes of T , prefixed by X[2 . . . |X|], were (i) lexicographically less than X, and (ii) how many were lexicographically greater, then we could exactly determine the correct SA range of X in CSA. Schnattinger et al. [113] 153 9.6. Less haste, less waste: reducing the space further demonstrated that we can answer both questions by exploiting the relationship of the T BWT string to CSA. Continuing with our example, for any character α in the shaded region of T BWT , there must exist a suffix of T prefixed by αX[2 . . . m]. Furthermore, αX[2 . . . m] must prefix some suffix of T . Therefore, to determine the number of suffixes of T , prefixed by φa, that are lexicographically less than φaa, we can simply count the number of characters less than a in the shaded region of T BWT . In this case, one character (φ) is less than a. Similarly, we discover one character (c) is lexicographically larger than a. Given these values, we can update the SA range in CSA to the interval demarcated by the arrows. Importantly, it was not necessary to perform any query on CSA to determine the correct SA range in CSA. Thus, CSA does not need to be constructed in the first place. By using the technique of Schnattinger et al. [113], and without any modification to the data structures, we can determine the SA range of the reverse of every text segment with respect to the reverse index R, without having constructed R, by performing queries only on F since it is backed by a compressed suffix array. This can be computed for all text segments in O(n log σ) time. Furthermore, for a pattern P of length m, we can compute the SA range of P in R in O(m log σ) time. It follows that we can determine the corresponding SA ranges for all m prefixes of P in O(m2 log σ) time. By modifying the result of Theorem 23 to use the lcp representation of Fischer et al. [35], using the compressed suffix array of Lemma 24, and by employing bidirectional search as described above, we have the following result. Theorem 24. A text T of length n containing d groups of wildcards, can be represented in nHk (T )+o(n log σ)+2n+O(d log n) bits, to support the following operations given any query pattern P : • type1 match(P ): returns all occ1 positions in T that match P , using no wildcard groups, in O(m log σ + occ1 log1+ n) time and O(m log n) bits of working space, • type2 match(P ): returns all occ2 positions in T that match P , using ˆ log d ) + (some portion of) one wildcard group, in O(m(log σ + min(m, d) log log d occ2 logloglogd d ) time and O(m log n) bits of working space, • type3 match(P ): returns all occ3 positions in T that match P , using (some portion of) two or more wildcard groups, in O(m log n + m2 log σ + γ) time and O(m log n+dm) bits of working space, or O(m log n+m2 log σ+γ log γ) time and O(m log n + γ log d) bits of working space, where σ = |Σ ∪ {φ}|, Hk (T ) denotes the k th order empirical entropy of T (for any k ≥ 0), dˆ is the number of unique wildcard group lengths, γ is the number of text segments that match in P , and > 0. Independently and in parallel with this work, Hon et al. showed that two sparse suffix trees can be used in conjunction with an FM-index for the forward 154 9.6. Less haste, less waste: reducing the space further index in order to eliminate the reverse index [49]. This decreases the overall index space to nHk (T ) + o(n log σ) + O(d log n) bits with an increase in query time compared to the fastest solution presented here (which uses more space). In particular, their approach has the following time and space bounds. Theorem 25 (Hon et al. [49]). Given a text T of length n containing d groups of wildcards, an index of nHk (T ) + o(n log σ) + O(d log n) bits of space can be built to report all matches of a pattern P of length m using O((m + γ) log n) bits of ˆ log d) + occ1 log1+ 2 n + occ2 log d + working space in O(m(log1+ 1 n + min(m, d) γ log γ) time, where 1 > 0, 2 > 0, dˆ is the number of unique wildcard group lengths, and γ is the number of matching text segments in P . While the construction and use of the sparse suffix trees for the claimed space and query time is quite technical—we refer the reader to their paper [49] for the details—their query algorithm does not suffer from the m2 term that is necessary in the worst case for the smallest index described above in this work and is therefore faster except for degenerate cases. However, ideas from both the Hon et al. approach and this work are complementary and can be combined to improve indexing text with wildcards. For instance, the dynamic programming algorithm for Type 3 matching in this work results in reduced working space and a query time that can be faster, and is the same in the worst case, when compared with the Hon et al. approach. Combining ideas from both indexes, we can achieve the following result. Theorem 26. A text T of length n containing d groups of wildcards, can be represented in nHk (T ) + o(n log σ) + O(d log n) bits, to support the following operations given any query pattern P : • type1 match(P ): returns all occ1 positions in T that match P , using no wildcard groups, in O(m log σ + occ1 log1+ 1 n) time and O(m log n) bits of working space, • type2 match(P ): returns all occ2 positions in T that match P , using (some ˆ log d ) + portion of) one wildcard group, in O(m(log1+ 2 n + min(m, d) log log d occ2 logloglogd d ) time and O(m log n) bits of working space, • type3 match(P ): returns all occ3 positions in T that match P , using (some portion of) two or more wildcard groups, in O(m log1+ 2 n + γ) time and O(m log n + dm) bits of working space, or O(m log1+ 2 n + γ log γ) time and O(m log n + γ log d) bits of working space, where σ = |Σ ∪ {φ}|, Hk (T ) denotes the k th order empirical entropy of T (for any k ≥ 0), dˆ is the number of unique wildcard group lengths, γ is the number of text segments that match in P , 1 > 0, and 2 > 0 . Finally, we note that different time and space trade-offs can be achieved simply by using different subsidiary text indexes and range reporting data structures. Recently Belazzougui and Navarro showed, for the first time, that a 155 9.6. Less haste, less waste: reducing the space further compressed text index can be constructed that supports query times independent of alphabet size [7]. In particular, their result can be used to improve the m log σ term to m for query time by introducing an O(n) term of additional index space. Any other self-index, that supports the LF operation, with different space and time trade-offs could also be used in the approaches discussed here. Furthermore, the O(d log n) term may be significantly smaller than the size of the compresses text, if for instance, the text does not contain many wildcards groups. In these cases, the overall query time can be improved by using non-succinct range reporting structures [17]. 156 Chapter 10 Indexing hypertext 10.1 Introduction Much more progress has been made in mapping reads from genome data to reference genomes than on aligning reads derived from transcriptomes. The latter problem is harder by the very nature of the events it is capable of capturing compared to genomic sequencing. Since introns are spliced from genes in the process of transcription (see Figure 10.2), spliced reads may map to two regions of the genome that are separated by many hundreds or thousands of bases. The difficulty of aligning NGS reads that span intron boundaries is exacerbated by their short length, and often is not attempted, resulting in a significant loss of information. When compared with aligning reads to a reference text, the transcriptome read alignment problem is modeled more accurately by the problem of aligning patterns to a hypertext. Informally, hypertext is a generalization of text from a linear structure to a directed graph, G = (V, E), with each node being a fragment of text and edges implying which fragments of text can be appended; thus, any path in the graph is a substring of the hypertext. An example of a hypertext is given in Figure 10.1. The example transcriptome in Figure 10.2 consists of five overall exons between two genes. The splicing events, and valid transcripts are also shown. The resulting hypertext model of this transcriptome has a node for each exon, and an edge between exons joined by a splicing event, resulting in two components (one for each gene). The seminal work on pattern matching in hypertext is due to Manber and Wu [77] who proposed a O(|V | + m|E| + occ log log m) time algorithm, where m is the length of the pattern and occ are the number of matches. Akutsu [1] proposed an O(n) algorithm for matching in hypertext forming a tree structure, where n is the total length of text in all nodes. Park and Kim [95] considered the case where the hypertext forms a directed acyclic graph by proposing a O(n + m|E|) time algorithm, under the assumption that no node in G matches to more than one position in the pattern. Amir et al. [2] proposed an algorithm with the same runtime complexity; however, theirs was the first algorithm for the case of hypertext forming a general graph. Amir et al. [2] and Navarro [87] also considered the problem of approximate matching in hypertext. In all cases, the runtimes of the previously proposed pattern matching algoContent from this chapter appears in the proceedings of the 18th Annual International Conference on String Processing and Information REtrieval (SPIRE 2011) [130]. 157 10.1. Introduction oun sa is b tiful eau pi zza from an Figure 10.1: A example of a hypertext. A query matches within a hypertext if and only if it can be aligned as a path through the graph. A path shown in bold matches the query pattern pizzafrompisaisbountiful. rithms in hypertext are impractical for alignment of millions of transcriptome reads as they are at least linear in the size of the hypertext. Surprisingly, no index for hypertext, succinct or otherwise, has been previously proposed. In this work, we propose a succinct index to model hypertext. Our index can model any hypertext forming a general graph and makes no restriction to the topology. We also propose a new exact pattern matching algorithm, capable of aligning a pattern to any path in the hypertext, that is especially efficient for hypertexts where few nodes share common prefixes or when all nodes are of constant degree. In particular, our new algorithm can report all patterns crossing at most one edge—a valid assumption for current transcriptome read datasets—in O(m log σ + m logloglog|V|V| | + occ1 log1+ n + occ2 logloglog|V|V| | ) time, where occ1 (occ2 ) is the number of matches that cross no (one) edge. We also consider a restricted version of the problem, where only certain paths in the hypertext are considered valid and also prove the worst case query time complexity is improved for other restrictions including graph topology. A main contribution of this chapter is to show the correspondence between the hypertext matching problem and the problem of matching text containing wildcards. As we will show, the former can be viewed as a generalization of the latter. In particular, recent strategies for indexing text with wildcards are applicable for indexing hypertext. Improvement to one problem may immediately lead to improvements of the other. While our results in this chapter are general and relevant to applications that are appropriately modeled by a hypertext, our original motivation was to better model the transcriptome read alignment problem. We view the results in this chapter as a theoretical contribution towards that end. However, the reads produced by current sequencing technologies contain sequencing errors— errors introduced during the sequencing process—in addition to the genetic variation expected between the experimental sequence and a reference sequence. A significant challenge that must be overcome, before these approaches could yield practical tools for transcriptome read alignment, is to efficiently support 158 10.2. Preliminaries approximate pattern matching queries. S ... ... ... ... ... ... g1 e1 G e1 g2 e2 e3 e4 e2 e5 e4 t1 T ... t4 e1 e3 e4 t2 e5 t5 e1 e2 e3 t3 H e1 e2 e3 e4 e5 Figure 10.2: A simple genome, G, is shown having five exons contained in two genes. Exons are strings over the four letter alphabet of DNA. Below is the corresponding transcriptome, T , which consists of five transcripts. Transcripts are formed by the concatenation of certain exons from G. Above is the splicing graph, S, where each of the five nodes correspond to one of the five exons from G, and each directed edge denotes splicing events (concatenation of exons) that are found in T . A hypertext model H for the transcriptome is also shown. 10.2 Preliminaries In developing our hypertext index, we will use the same notation and leverage many existing results from the literature listed in Section 8.2 and Section 9.2. The remainder of this section details an existing result and definitions specific to developing our hypertext index. 10.2.1 Succinct graph representation The succinct graph representation of Farzan & Munro supports a number of graph topology query operations in O(1) time using the best space achievable [31]. In this application, we only require the use of efficient adjacency queries. Their result is stated in terms of Boolean matrices supporting access queries. In terms of graphs, this is equivalent to determining adjacency of nodes using an adjacency matrix. Lemma 46 (Farzan & Munro [31]). A Boolean matrix of size n×n with m ones 2 can be represented in (1+ ) lg nm bits for any constant > 0, while supporting access (and successor) queries in O(1) time. 159 10.2. Preliminaries FBWT $ t ϕacaϕgϕgaϕcgϕct$ $ a ϕcgϕct$ g ϕct$ a ϕgϕgaϕcgϕct$ g ϕgaϕcgϕct$ g aϕcgϕct$ c aϕgϕgaϕcgϕct$ acaϕgϕgaϕcgϕct$ ϕ a caϕgϕgaϕcgϕct$ cgϕct$ ϕ ct$ ϕ c gϕct$ gϕgaϕcgϕct$ ϕ gaϕcgϕct$ ϕ c t$ RBWT $ c ϕacaϕgϕagϕgcϕtc$ $ g ϕagϕgcϕtc$ a ϕgϕagϕgcϕtc$ g ϕgcϕtc$ c ϕtc$ c aϕgϕagϕgcϕtc$ acaϕgϕagϕgcϕtc$ ϕ agϕgcϕtc$ ϕ t c$ g cϕtc$ a caϕgϕagϕgcϕtc$ gϕagϕgcϕtc$ ϕ a gϕgcϕtc$ gcϕtc$ ϕ tc$ ϕ Figure 10.3: (left) An example of the underlying suffix array and BWT string for the forward index F of the text T = φacaφgφgaφcgφct$, representing the serialization of possible text in exons e1 , . . . , e5 , supposing those five exons consist of the five sequences {aca, g, ga, cg, ct} respectively, from Figure 10.2. (right) The underlying suffix array and BWT string for the reverse index R of the text T = φacaφgφagφgcφtc$. 10.2.2 Hypertext A hypertext generalizes the notion of text to be a directed graph G = (V, E) such that each node v ∈ V contains text over an alphabet Σ and the outgoing edges of v are incident to nodes containing text that can follow v’s. A match of a pattern P to the hypertext G is a path p = v1 , . . . , vk through G, and an offset l into the first node v1 , such that P matches the concatenation of the text in nodes v1 , . . . , vk , beginning at position l in v1 , and ending at some prefix of vk . Problem 11. (Pattern matching in hypertext) Instance: A hypertext G and a pattern P Question: Which paths in G match P ? Previous algorithms for matching in hypertext focused on reporting only the initial node, and offset within that node, of paths in G matching P [1, 2, 95]. For our motivating problem of aligning patterns to a transcriptome, the actual path is required to be known, and that is our focus in the remainder of the paper. However, our matching algorithm can be simplified if only the initial node of a match (and the offset within the node) is desired. 160 10.3. Construction of the hypertext index 10.3 Construction of the hypertext index The succinct hypertext index is a collection of three sets of data structures: those indexing the node text, those indexing the graph topology, and useful auxiliary structures. In our pattern matching algorithms we find it useful to identify nodes of the graph by two different identifiers: the lex id, and the rlex id. This is reflected in our descriptions of the data structures below. The lex id gives the prefix lexicographic rank34 of the text contained within the node as compared with all other nodes in V . Similarly, the rlex id gives the rank with respect to the suffix lexicographic rank. We show how these ids can be determined in Section 10.3.3; their meaning is exactly as in the previous two chapters. However, for many hypertext applications there is a canonical id associated with each node, giving an absolute ordering of nodes, that s
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Space and energy efficient molecular programming and...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Space and energy efficient molecular programming and space efficient text indexing methods for sequence… Thachuk, Christopher Joseph 2013
pdf
Page Metadata
Item Metadata
Title | Space and energy efficient molecular programming and space efficient text indexing methods for sequence alignment |
Creator |
Thachuk, Christopher Joseph |
Publisher | University of British Columbia |
Date Issued | 2013 |
Description | Nucleic acids play vital roles in the cell by virtue of the information encoded into their nucleotide sequence and the folded structures they form. Given their propensity to alter their shape over time under changing environmental conditions, an RNA molecule will fold through a series of structures called a folding pathway. As this is a thermodynamically-driven probabilistic process, folding pathways tend to avoid high energy structures and those which do are said to have a low energy barrier. In the first part of this thesis, we study the problem of predicting low energy barrier folding pathways of a nucleic acid strand. We show various restrictions of the problem are computationally intractable, unless P=NP. We propose an exact algorithm that has exponential worst-case runtime, but uses only polynomial space and performs well in practice. Motivated by recent applications in molecular programming we also consider a number of related problems that leverage folding pathways to perform computation. We show that verifying the correctness of these systems is PSPACE-hard and in doing so show that predicting low energy barrier folding pathways of multiple interacting strands is PSPACE-complete. We explore the computational limits of this class of molecular programs which are capable, in principle, of logically reversible and thus energy efficient computation. We demonstrate that a space and energy efficient molecular program of this class can be constructed to solve any problem in SPACE ---the class of all space-bounded problems. We prove a number of limits to deterministic and also to space efficient computation of molecular programs that leverage folding pathways, and show limits for more general classes. In the second part of this thesis, we continue the study of algorithms and data structures for predicting properties of nucleic acids, but with quite different motivations pertaining to sequence rather than structure. We design a number of compressed text indexes that improve pattern matching queries in light of common biological events such as single nucleotide polymorphisms in genomes and alternative splicing in transcriptomes. Our text indexes and associated algorithms have the potential for use in alignment of sequencing data to reference sequences. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2013-04-12 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivs 3.0 Unported |
DOI | 10.14288/1.0052204 |
URI | http://hdl.handle.net/2429/44172 |
Degree |
Doctor of Philosophy - PhD |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2013-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/3.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2013_spring_thachuk_chris.pdf [ 2.31MB ]
- Metadata
- JSON: 24-1.0052204.json
- JSON-LD: 24-1.0052204-ld.json
- RDF/XML (Pretty): 24-1.0052204-rdf.xml
- RDF/JSON: 24-1.0052204-rdf.json
- Turtle: 24-1.0052204-turtle.txt
- N-Triples: 24-1.0052204-rdf-ntriples.txt
- Original Record: 24-1.0052204-source.json
- Full Text
- 24-1.0052204-fulltext.txt
- Citation
- 24-1.0052204.ris