Study of Two Biochemical Models:Chemical Reaction Networks, and Nucleic Acid SystemsbyMonir HajiaghayiB.Sc., Sharif University of Technology, 2008M.Sc., The University of British Columbia, 2010A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Computer Science)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)October 2017c© Monir Hajiaghayi 2017AbstractThe contributions of this thesis are motivated by an exciting challenge atthe intersection of computer science and biochemistry: Can we programmolecules to do interesting or useful computations? There has been signifi-cant progress in programming nucleic acids - particularly DNA molecules -thanks in part to availability of models and algorithms for predicting nucleicacid structure and folding kinetics. At a higher level of abstraction, ChemicalReaction Networks (CRNs) have proven to be valuable as a molecular pro-gramming model that enables researchers to understand the potential andlimitations of computing with molecules, unencumbered by low-level details.These two levels of abstraction are linked; it is possible to “compile” CRNprograms into nucleic acid systems that make the programs implementablein a test tube.We design and analyze CRN algorithms for two purposes. First, weshow how any semilinear function can be computed by CRNs, even whenno “leader” species (i.e., initial species with constant but non-zero counts)is present. Our results improve earlier results of Chen et al. (2012) whoshowed that only semilinear functions are computable by error-free CRNsusing leaders. Our new CRN construction can be done in expected timeO(n), which is faster than O(n log n) bound achieved by Chen et al. Second,we provide the most intuitive proofs of correctness and efficiency for threedifferent CRNs computing Approximate Majority: Given a mixture of twotypes of species with an initial gap between their counts, a CRN computationmust reach totality on the majority species with high probability. The CRNsof our interest have the ability to start with an initial gap of Ω(√n log n).In the second part of this thesis, we study the problem of predicting theMinimum Free Energy secondary structure (the set of base pairs) of a givenset of nucleic acid strands with no pseudoknots (crossing base pairs).Weshow that this problem is APX-hard which implies that there does not exista polynomial time approximation scheme for this problem, unless P = NP.We also propose a new Monte-Carlo based method to efficiently estimatenucleic acid folding kinetics.iiLay SummaryOur contributions are motivated by an exciting challenge at the intersectionof computer science and biochemistry: Can we program molecules to dointeresting or useful computations? First, we study Chemical Reaction Net-works (CRNs) – a valuable molecular model for programming the dynamicsof interacting molecules in a well-mixed solution. We show how CRNs cancompute functions that are unions of linear pieces, and provide a simple anal-ysis of an elegant CRN that determines which of two molecular species ismost populous in the mixture. Since CRN programs can be“compiled” intonucleic acid systems whose folding dynamics simulate the reactions, the sec-ond part of our thesis studies such systems. We present an efficient methodto estimate the nucleic acid folding kinetics, and show that the structureprediction of multiple interacting strands is computationally intractable.iiiPrefaceThe author contributed to all major ideas and writing of all published andunpublished manuscripts that are the basis of this dissertation. In no in-stance, there was a student as a co-author. The author collaborated with theauthor’s supervisor and mentors in all aspects of research including definingthe problems, design and implementation of algorithms, proving the mainclaims and conducting experimental studies.Part I of this thesis was resulted from collaboration with author’s super-visor Dr. Anne Condon, as well as Dr. Dave Doty, Dr. David Kirkpatrickand Dr. Jan Manuch.• Chapters 1, 2 and 5 were written by the author, but used selectedcontent from publications that she co-authored [28, 33, 34].• A version of Chapter 3 has been published in the proceedings of the19th Annual International Conference on DNA Computing and Molec-ular Programming (2013) [33] and also the Journal of Natural Com-puting (2015) [34]. The author collaborated with the co-author, Dr.Dave Doty, in developing the algorithms, proving the lemmas, andwriting the manuscripts.• A version of Chapter 4 appears in the proceedings of the 23th AnnualInternational Conference on DNA Computing and Molecular Program-ming (2017) [28]. The author was the primary investigator to lead allthe discussions and provide the initial detailed proofs of each phasein the new proof strategy, and performed most of the experiments.Dr. David Kirkpatrick proposed the new tri-molecular CRN as anabstraction and intuition of the phases required to compute Approx-imate Majority. All the authors then equally contributed in furthersimplifying the proofs and writing the manuscript.Part II of this thesis was resulted from collaboration with a number of co-authors: the author’s supervisor Dr. Anne Condon, Dr. Bonnie Kirkpatrick,Dr. Alexandre Bouchard-Coˆte´, Dr. Chris Thachuk, and Dr. LiangliangWang.ivPreface• Chapters 6, and 9 were written by the author, but used selected con-tent from publications that she co-authored [29, 47, 60].• A version of Chapter 7 has been published in 31th International Con-ference on Machine Learning (ICML2013) [47]. The author collabo-rated with co-authors in the development and implementation of thenew sampling method and writing the manuscript. The author alsoperformed all the experiments on Nucleic Acid Folding Pathways. Dr.Liangliang Wang took the lead on experimental studies of anotherbiological example in the original paper that was excluded from thethesis.• The work presented in Chapter 8 [29] is conducted from the collabora-tion of the author with the author’s supervisor Dr. Anne Condon, andDr. Chris Thachuk. The author contributed to all aspect of researchand specifically revised the earlier reduction model proposed by co-authors to conclude more strong results of APX-hardness, proved thereduction correctness and was one of the main authors. The resultsfrom this chapter is not published yet, but the manuscript is almostready (pending on experimental study) and will be submitted to SIAMReview Journal.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvI Leaderless Deterministic Computation, and Approxi-mate Majority with Chemical Reaction Networks1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Deterministic Computation with CRNs . . . . . . . . . . . . 31.2 The Approximate Majority Problem . . . . . . . . . . . . . . 51.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3.1 Deterministic Computations with CRNs . . . . . . . . 71.3.2 Approximate Majority with CRNs . . . . . . . . . . . 81.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Background on Chemical Reaction Networks . . . . . . . . 102.1 Chemical Reaction Networks . . . . . . . . . . . . . . . . . . 102.2 Kinetic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 11viTable of Contents3 Leaderless Deterministic Chemical Reaction Networks . . 143.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1.1 Stable Decidability of Predicates . . . . . . . . . . . . 153.1.2 Stable Computation of Functions . . . . . . . . . . . . 163.1.3 Kinetic Model . . . . . . . . . . . . . . . . . . . . . . 173.2 Leaderless CRCs can Compute Semilinear Functions . . . . . 203.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 Simplifying Analyses of Chemical Reaction Networks forApproximate Majority . . . . . . . . . . . . . . . . . . . . . . 294.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.1.1 Analysis Tools . . . . . . . . . . . . . . . . . . . . . . 304.2 Approximate Majority Using Tri-molecular Reactions . . . . 314.3 Approximate Majority Using Bi-molecular Reactions . . . . . 354.3.1 The Double-B CRN . . . . . . . . . . . . . . . . . . . 354.3.2 The Single-B CRN . . . . . . . . . . . . . . . . . . . . 404.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 464.5 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505 Summary and Future Work . . . . . . . . . . . . . . . . . . . 515.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.2.1 Deterministic Computation With CRNs . . . . . . . . 525.2.2 Approximate Majority . . . . . . . . . . . . . . . . . . 54II On Prediction of Nucleic Acid Folding Pathways andStructures 566 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.1 Background on Nucleic Acids: RNA/DNA Molecules . . . . . 586.2 Nucleic Acid Folding Pathways . . . . . . . . . . . . . . . . . 596.3 Nucleic Acid Secondary Structure Prediction . . . . . . . . . 626.4 Objectives and Contributions . . . . . . . . . . . . . . . . . . 646.4.1 Nucleic Acid Folding Pathways . . . . . . . . . . . . . 646.4.2 Multi-stranded Nucleic Acid Structure Prediction . . . 656.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67viiTable of Contents7 Approximating Nucleic Acid Population Kinetics . . . . . 687.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 687.1.1 RNA Folding Models . . . . . . . . . . . . . . . . . . . 697.1.2 Population Kinetic Calculations . . . . . . . . . . . . . 717.2 Efficient Continuous-Time Markov Chain Estimation . . . . . 727.2.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . 747.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 827.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 848 Hardness of Multi-stranded Nucleic Acid MFE SecondaryStructure Prediction . . . . . . . . . . . . . . . . . . . . . . . 878.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 878.1.1 The Simple Energy Model . . . . . . . . . . . . . . . . 888.1.2 Problem definitions . . . . . . . . . . . . . . . . . . . . 898.2 String Designs and their Properties . . . . . . . . . . . . . . . 918.3 The Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . 978.4 Reduction Correctness . . . . . . . . . . . . . . . . . . . . . . 1008.5 Approximability . . . . . . . . . . . . . . . . . . . . . . . . . 1078.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1089 Summary and Future Work . . . . . . . . . . . . . . . . . . . 1109.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1109.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 1119.2.1 Nucleic Acid Folding Pathways . . . . . . . . . . . . . 1119.2.2 Multi-stranded Nucleic Acid Secondary Structure Pre-diction . . . . . . . . . . . . . . . . . . . . . . . . . . . 113Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115viiiList of TablesTable 6.1 Computational complexity of predicting nucleic acid MFEpseudoknot-free secondary structures and partition func-tions, when the input is a single strand, multiple strandswith a constant bound c on the number of strands, andmultiple strands where the number of strands can growwith the input length n. In each case, n is the totalnumber of bases in the input strand(s). We note that,for a single strand, a new work by Bringmann et al. [17]presents an exact sub-cubic algorithm using a simple basepair model. The bold term shows our contribution andthe question marks show that the complexity of the cor-responding problems is as yet unresolved. . . . . . . . . 63Table 7.1 Biological RNA sequences obtained from the RNA STRANDdatabase [7]. Symbols U and S correspond to the set ofsecondary structures obtained from the full model andsubset model, respectively. . . . . . . . . . . . . . . . . . 82ixList of FiguresFigure 1.1 CRNs to compute function f(x) = x+ 1. CRN (a) startswith x copies of species X and one copy of leader speciesL. CRN (b) starts with only x copies of species X. Allthe reactions have rate constant 1. . . . . . . . . . . . . . 4Figure 1.2 A tri-molecular and two bi-molecular chemical reactionnetworks (CRNs) for Approximate Majority. Reactions(0’x) and (1’y) of Single-B have rate constant 1/2 whileall other reactions have rate constant 1. . . . . . . . . . . 6Figure 3.1 An example illustrating of the essential steps of our lead-erless construction for the deterministic computation ofpartial affine functions – Lemma 3.2.1. . . . . . . . . . . 24Figure 4.1 The gap x − y (red line) and minority (count y for tri-molecular CRN and yˆ for bi-molecular CRNs) (blue line),as a function of time, of sample runs of the (a) tri-molecular,(b) Double-B, and (c) Single-B CRNs. The initial countis n = 106, the initial gap x − y is 2√n lg n and param-eters cγ , dγ and eγ are set to 2, 8, and 2 respectively.The vertical dotted lines demark the transition betweenphases 1, 2 and 3. . . . . . . . . . . . . . . . . . . . . . . 47Figure 4.2 Comparison of the time (a) and success rate, i.e., proba-bility of correctness (b) of Single-B, Double-B and the tri-molecular CRN for Approximate Majority. Each point inthe plot is an average over 5,000 trials. The initial config-uration has no B’s and the imbalance between X’s andY ’s is√n lnn. Plots show 99% confidence intervals. . . . 48Figure 6.1 Two RNA secondary structures drawn using NUPACK [110].These are two out of the many possible secondary struc-tures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60xList of FiguresFigure 6.2 An example of a folding pathway for the RNA sequencein Figure 6.1. Each structure has been drawn using NU-PACK [110]. . . . . . . . . . . . . . . . . . . . . . . . . . 60Figure 7.1 Performance of our method (TIPS) and forward sampling(FS) on RNA21 and HIV23 molecules with their subsetstate space. The relative errors of the estimates vs. fold-ing times, {0.25,1,4}, are shown (Figures (a) and (d))along with the CPU times corresponding to the mini-mum number of particles required to satisfy the accuracylevel I in milliseconds (Figures (b) and (e)) and the vari-ance of TIPS and FS estimations (Figures (c) and (f)) onfolding times, {0.125, 0.25, · · · , 8}. . . . . . . . . . . . . . 85Figure 7.2 Performance of our method (TIPS) and forward sampling(FS) on 1AFX and 1XV6 molecules with their full statespace. The relative errors of the estimates vs. foldingtimes, {0.5,2,8} are shown (Figures (a) and (d)) alongwith the CPU times corresponding to the minimum num-ber of particles required to satisfy the accuracy level I inmilliseconds (Figures (b) and (e)) and the variance ofTIPS and FS estimations (Figures (c) and (f)) on foldingtimes, {0.5, 1, · · · , 8}. . . . . . . . . . . . . . . . . . . . . 86Figure 8.1 a) Polymer graph representation of the pseudoknot-freesecondary structure for the strand set {1, 2, 3} with or-dering {123}, b) second polymer graph for the same setof strands with ordering {132}. . . . . . . . . . . . . . . . 88Figure 8.2 An instance of the restricted 3-dimensional matching prob-lem (3dm(3)) where X = {x1, x2, x3}, Y = {y1, y2, y3},Z = {z1, z2, z3}. (a) The set of permitted triples T . (b)A valid matching M⊆ T . . . . . . . . . . . . . . . . . . 90Figure 8.3 The resulting set of strands, specified at the domain level (parta), and the MFE structure (part b), when reducing fromthe 3dm instance of Figure 8.2. In the MFE structure,triple-strand t1 labeled as perfect triple represents thatthe triple (x1, y2, z2) is in the solution of the 3dm(3) in-stance. However, triple-strand t4 denoted as trim-deprivedtriple shows that the triple (x2, y3, z2) is not selected inthe solution. . . . . . . . . . . . . . . . . . . . . . . . . . 98xiList of FiguresFigure 9.1 Tuning parameter α. Performance of our method (TIPS)using different values of α compared to forward sampling(FS) for estimating the folding pathway of the 1XV6molecule on its full state space. The minimum numberof particles required to perform each sampling method isshown on the y-axis. . . . . . . . . . . . . . . . . . . . . . 112xiiAcknowledgementsI would like to express my special appreciation and thanks to my researchsupervisor Professor Anne Condon, you have been a tremendous mentor andadvisor for me during both my masters and doctoral studies and I could nothave imagined having a better mentor during these years. Thank you formaking time for me despite your busy schedules, supporting me throughmy though times, allowing and encouraging me to pursue different researchdirections, and teaching me how to be a research scientist. I was extremelyfortunate to know you and work along side you.I would also like to thank my committee members, Professor David Kirk-patrick, and Professor Holger Hoos. Thank you David for all those long andhelpful discussions. You thought me how to think more critically, pay moreattentions to the details and become a better researcher. Moreover, I wouldlike to thank you for your insightful comments and suggestions that re-markably improved the writing of the first part of my thesis. Thank youHolger for all your help, support and valuable comments and feedback thatenhanced the quality of my thesis.Many thanks to my examining committee, Professor Will Evans andProfessor Ruben Zamar and my external examiner, Professor Ho-Lin Chen,for their time and effort that they put into reading my thesis and for theirinsightful comments and feedback. Special thanks to my collaborators, Dr.Dave Doty, Dr. Jan Manuch, Dr. Bonnie Kirkpatrick, Dr. Chris Thachukand Professor Alex Bouchard Cote. My research has greatly benefited fromthe helpful discussions and your productive suggestions. I would like tothank the current and former members of the BETA lab including ShuYang, Nasim Zolakbaf, Hosna Jabbari, Mirela Andronescu, Baharak Raste-gari, and Chris Thachuk for their friendship, support and help during mytime in UBC. Thanks to the support staff at the UBC Computer ScienceDepartment, in particular Holly Kwan, Joyce Poon, Lara Hall and MicheleNg for being such a great and supportive people.A special thanks to my family. Words cannot express how grateful Iam to my parents, my only amazing sister Mehri, and my brothers for alltheir encouragements, and sacrifices that they have made on my behalf. IxiiiAcknowledgementswould also like to thank my friends, in particular Elahe, Noushin, Anahita,Fatemeh, Neda, and Samaneh. Thank you for true friendship and beinga part of my life. At the end I would like to express my appreciation tomy beloved husband Ali who has always been my support in the gloomiesttimes and always offered me his love, and advice. Thank you for being sucha wonderful partner.xivDedicationTo my dear mom and my late beloved dad.xvPart ILeaderless DeterministicComputation, andApproximate Majority withChemical Reaction NetworksChapter 1IntroductionMolecular programming encompasses the design of programmable molec-ular systems such as molecular robotics, nanoscale computing, and pro-grammable assembly of nanoscale patterns and devices. It also encompasseslanguages and algorithms for programming these systems [30, 62, 82]. In thelast two decades, theoretical and experimental studies in this field have shedlight on integration of logical computation with biological systems [25, 46].For instance, such an integration can let a certain curative agent be releasedif a specific condition is detected in a cell.A key goal is to re-purpose the descriptive language of chemistry andphysics, which describes how the natural world works, as a prescriptive lan-guage of programming, which prescribes how an artificially engineered sys-tem should work. When the programming goal is the manipulation of indi-vidual molecules in a well-mixed solution, the language of chemical reactionnetworks (CRNs) is an attractive choice.A CRN (i.e., formally defined in Section 2.1) is a finite set of reactionssuch as X + Y → Z + W each describing a rule for transforming reac-tant molecules (e.g., {X,Y }) into product molecules (e.g., {Z,W}) [42, 57].Figures 1.1 and 1.2 show two examples of CRNs. The underlying modeldescribes how the amounts of molecular species evolve when molecules in-teract in a well-mixed solution. If an interaction (i.e., a collision betweensome molecules) is reactive, some molecules are consumed and others areproduced. Thus, by reaction, we mean a reactive interaction. A configura-tion of a well-mixed solution is also defined as the current composition ofthe amounts of all the species.CRNs may model the “amount” of a species as a real number, namely itsconcentration (average count per unit volume), or as a nonnegative integer(i.e., total count in solution, requiring the total volume of the solution to bespecified as part of the system). Here, we focus on the latter integer countsmodel which is called “stochastic”. The stochastic nature of the CRN modelmakes it better suited than the former model for analysis of systems in whichsome species counts may be low and may fluctuate significantly. In thismodel, the reactions discretely change the configuration of the system, and1Chapter 1. Introductionare assumed to happen probabilistically. In fact, the model is probabilistic attwo levels. First, which interaction occurs next is stochastically determined,reflecting the dynamics of collisions in a well-mixed solution [20, 42]. Forinstance, interactions whose reactants have high molecular counts are morelikely to happen first than interactions whose molecular counts are smaller.Second, a reaction can have more than one outcome, and rate constantsassociated with reactions determine the relative likelihood of each outcome.For example, reactions (0’x) and (0’y) of Figure 1.2(c) are equally likely. Acomputation of a CRN C is a trace of the configurations from a given initialconfiguration to some final target configuration. We note that interactionsoccur in parallel, and time of a computation is typically measured as thetotal number of interactions divided by n, assuming that the total numberof interacting molecules and volume remain fixed and are Θ(n).The computational power of CRNs has been investigated with regardto simulating boolean circuits [65], neural networks [50], digital signal pro-cessing [55], and simulating bounded-space Turing machines with an arbi-trary small, non-zero probability of error with only a polylogarithmic slow-down [11]. We note that Angluin et al. investigated the computationalpower of bi-molecular CRNs (involving only two reactants) under a dif-ferent name known as the population protocols (PPs) model [9], in whichagents interact in a pairwise fashion and may change state upon interacting.Agents and states of a PP naturally correspond to molecules and speciesof a CRN. CRNs are even efficiently Turing-universal, again with a small,nonzero probability of error over all time [94]. Using a theoretical model ofDNA strand displacement systems (DSDs) 1, it was shown that any CRNcan be transformed into a set of DNA complexes that approximately emu-late the CRN [83, 92, 95]. The experimental successes to date, are small inscale, but in the future we can expect to be able to reliably implement muchlarger CRNs by real chemicals, as causes of error are addressed [84, 99].In the context of CRNs and PPs, a substanstial amount of research has alsobeen conducted on a central problem called Approximate Majority (AM)[9, 12, 74, 81]: in a mixture of two types of species where the gap betweenthe counts of the majority and minority species is above some threshold,which species is in the majority? Approximate Majority is a significantproblem because, for instance, it can be used as a subroutine when simulat-ing other computational models [9]. For example, a comparison operation(comparing the counts of two species or agents) is a necessary component1DSDs consist of a set of strand displacement reactions where intuitively, in each reac-tion, a strand displaces another strand from a complex [18].21.1. Deterministic Computation with CRNsto simulate the behaviour of a register machine and the AM protocol is animportant subroutine to compute this operation [12]. Moreover, the systemof some biological switches is also related to both the structure and kineticsof the AM computation [19, 24].While the works mentioned above focus on the stochastic behaviour ofchemical kinetics and may encounter some probability of error in the compu-tations, the deterministic behaviour of CRNs, in the sense that the designedprotocols progress to a correct configuration with absolutely no error, is alsoof great interest [10, 21].In this part of the thesis, we make new contributions to design of CRNsfor deterministic computation, and analysis of CRNs for Approximate Ma-jority. We discuss these two subjects and their related work in Section 1.1and 1.2 respectively. Next, we sum up our objectives and contributions ineach context in Section 1.3. Finally, we end this introductory chapter bygiving an outline of the following chapters that detail our contributions inthis thesis part.1.1 Deterministic Computation with CRNsIn this section, our focus is on CRNs with deterministic guarantees on theirbehavior. These CRNs have the property that they are guaranteed to reacha correct configuration, regardless of the order in which reactions occur. Forexample, the CRN with the reaction X → 2Y is guaranteed eventually toreach a configuration in which the count of Y is twice the initial count ofX, i.e., computes the function f(x) = 2x, representing the input by theinitial count x of species X and the output by the count y of species Y , oncestable. Similarly, the reactions X1 → 2Y and X2 + Y → ∅, under arbitrarychoice of sequence of the two reactions, compute the function f(x1, x2) =max{0, 2x1 − x2}.Angluin et al. [10, 11] explored the computational behaviour of CRNsthat are deterministic under the population protocols model. They showedthat the input sets S ⊆ Nk decidable by deterministic CRNs (i.e., provid-ing “yes” or “no” answers by the presence or absence of certain indicatorspecies) are precisely the semilinear subsets of Nk. Semilinear sets are de-fined formally in Section 3.1. Informally, they are finite unions of “periodic”sets, where the definition of “periodic” is extended in a natural way tomulti-dimensional spaces, such as Nk. We note that the semilinear predi-cate computation presented by Angluin et al. [10, 11] only depends on theinput species and does not require any auxiliary “leader” species, i.e., species31.1. Deterministic Computation with CRNsX → Y X → B + 2Y (1.1.1)L→ Y B +B → B +K (1.1.2)Y +K → ∅ (1.1.3)(a) (b)Figure 1.1: CRNs to compute function f(x) = x+ 1. CRN (a) starts with xcopies of species X and one copy of leader species L. CRN (b) starts withonly x copies of species X. All the reactions have rate constant 1.present initially with constant but non-zero counts (described in more detaillater).Chen et al. [21] extended these results to function computation andshowed that precisely the semilinear functions (functions f whose graph{(x,y) ∈ Nk+l | f(x) = y } is a semilinear set) are deterministically com-putable by CRNs. We say a function f : Nk → Nl is stably (a.k.a., determin-istically) computable by a CRN C if there are “input” species X1, . . . , Xkand “output” species Y1, . . . , Yl such that, if C starts with x1, . . . , xk copiesof X1, . . . , Xk, respectively, then with probability one, it reaches a “count-stable” configuration in which the counts of Y1, . . . , Yl are expressed by thevector f(x1, ..., xk), and these counts never again change [21].The method proposed by Chen et al. [21] uses some auxiliary “leader”species present initially, in addition to the input species. To illustrate theirutility, suppose that we want to compute function f(x) = x+ 1 with CRNs.Using the previous approach, we have an input species X (with initial countx), an output species Y (with initial count 0) and an auxiliary “leader”species L (with initial count 1). Fig 1.1a shows the reactions that computef(x).However, it is experimentally difficult to prepare a solution with a singlecopy (or a small constant number) of a certain species. Chen et al. [21] askedwhether it is possible to do away with the initial “leader” molecules, i.e., torequire that the initial configuration contains initial counts x1, x2, . . . , xk ofinput species X1, X2, . . . , Xk, and initial count 0 of every other species. Itis easy to “elect” a single leader molecule from an arbitrary initial numberof copies using a reaction such as L+ L→ L, which eventually reduces thecount of L to 1. However, the problem with this approach is that, since L isa reactant in other reactions, there is no way in general to prevent L fromparticipating in these reactions until the reaction L+L→ L has reduced it41.2. The Approximate Majority Problemto a single copy.Despite these difficulties, we answer the question affirmatively, showingthat each semilinear function can be computed by a “leaderless” CRN, i.e.,a CRN which starts with an initial configuration containing only the inputspecies. To illustrate the question, consider the function f(x) = x + 1once more. In order to compute the function without a leader (i.e., theinitial configuration has x copies of X and 0 copies of every other species),the reactions in Figure 1.1b suffice: Reaction 1.1.1 produces x copies of Band 2x copies of Y . Reaction 1.1.2 consumes all copies of B except one,so reaction 1.1.2 executes precisely x − 1 times, producing x − 1 copies ofK. Therefore reaction 1.1.3 consumes x − 1 copies of output species Y ,eventually resulting in 2x − (x − 1) = x + 1 copies of Y . Note that thisapproach uses a sort of leader election on the B molecules. We generalizethis example and describe our leaderless CRN construction in Chapter 3.We also refer the readers to Sections 1.3.1 for a more detailed discussionof our contributions.1.2 The Approximate Majority ProblemConsider a mixture with n molecules, some of species X and the rest ofspecies Y . Let x and y denote the number of copies of X and Y duringa CRN computation. The Approximate Majority problem [9] is to reachconsensus — a configuration in which all molecules are X (x = n) or allare Y (y = n), from an initial configuration in which x + y = n and thegap |x− y| is above some threshold. If initially x > y, the consensus shouldbe X-majority (x = n), and if initially y > x the consensus should be Y -majority. We focus on the case when initially x > y since the CRNs thatwe analyze are symmetric with respect to X and Y .Angluin et al. [12] proposed and analyzed the Single-B CRN of Figure1.2(c) in the PP framework. Informally, reactions (0’x) and (0’y) are equallylikely to produce B’s (blanks) from X’s or Y ’s respectively, while reactions(1’) and (2’) recruit B’s to become X’s and Y ’s respectively. When X isinitially in the majority (x > y initially), a reaction event is more likelyto be (1’) than (2’), with the bias towards (1’) increasing as x gets larger.Angluin et al. showed correctness: if initially x − y = ω(√n log n), thenwith high probability Single-B reaches X-majority consensus. They alsoshowed efficiency: with “high” probability 1 − n−Ω(1), for any initial gapvalue x − y, Single-B reaches consensus within O(n log n) interactions. Inaddition, they proved correctness and efficiency in more general settings,51.2. The Approximate Majority ProblemX +X + Y → X +X +X (1)X + Y + Y → Y + Y + Y (2)(a) Tri-molecular CRNX + Y → B +B (0’)X +B → X +X (1’)Y +B → Y + Y (2’)(b) Double-B CRNX + Y1/2→ X +B (0’x)X + Y1/2→ Y +B (0’y)X +B → X +X (1’)Y +B → Y + Y (2’)(c) Single-B CRNFigure 1.2: A tri-molecular and two bi-molecular chemical reaction networks(CRNs) for Approximate Majority. Reactions (0’x) and (1’y) of Single-Bhave rate constant 1/2 while all other reactions have rate constant 1.such as when some molecules in the initial configuration are B’s, or in thepresence of o(√n) Byzantine agents.Surprisingly, while the Single-B CRN is quite simple, they provided avery complicated and lengthy proof and noted that “Unfortunately, whilethe protocol itself is simple, proving that it converges quickly appears to bevery difficult”. Motivated by their intriguing but complex approach, we aimto provide simpler and more intuitive proofs of correctness and efficiency,with the hope that simple techniques can be adapted to reason about CRNsfor other problems. For this purpose, we analyze three different CRNs forApproximate Majority: a simple tri-molecular CRN, whose reactions involvejust the two species X and Y that are present initially, and two bi-molecularCRNs, which we call Double-B and Single-B, that use an additional “blank”species B – see Figure 1.2. As noted earlier, the Single-B CRN is the sameas that of Angluin et al. The Double-B CRN is symmetric even in the PPsetting.Several others have subsequently and independently studied the problem;Mertzios et al. [74] analyze Single-B when x + y = n and y ≤ n. Theyuse a biased random walk argument to show that for = 1/7, Single-B61.3. Contributionsreaches consensus on X-majority with exponentially small error probability1 − e−Θ(n). They conjecture that exponentially small probability holds if is any positive constant less than 1/2. We show in Section 4.6 how to provethis conjecture using our analysis tools.Perron et al. [81] also analyze Single-B when the initial gap is a con-stant fraction times n, showing correctness with all but exponentially smallprobability. Their results do not apply to smaller initial gaps.Cruise and Ganesh [31] devise a family of protocols in network mod-els where agents (nodes) can poll other agents in order to update theirstate. Their family of protocols provides a natural generalization of ourtri-molecular CRN. Their analysis uses connections between random walksand electrical networks. We consider our proof of the tri-molecular CRN tobe simpler than theirs, and we obtain high probability bounds on efficiency,while they reason about expected time.Yet other work on Approximate Majority pertains to settings with dif-ferent assumptions about the number of states per agent or with differentinteraction scheduling rules [5, 36, 74], or analyze more general multi-valuedconsensus problems [12, 14, 15].Section 1.3.2 describes our contributions on the Approximate Majorityproblem.1.3 ContributionsHere, we summarize our contributions for the subjects discussed in Sec-tions 1.1 and 1.2.1.3.1 Deterministic Computations with CRNsIn this regard, we mainly answer the question asked by Chen et al. [21],whether deterministic CRNs without a leader retain the same power, andmake the following contributions.1. We show that every semilinear function is deterministically computableby a CRN whose initial configuration contains only the input speciesX1, . . . , Xk, and zero counts of every other species, so long as f(0) = 0.2. We describe a leaderless CRN construction to compute any semilinearfunction. We use a similar framework to the construction of Chenet al. [21], decomposing the semilinear function into a finite union ofaffine partial functions (linear functions with an offset; defined formally71.3. Contributionsin Section 3.1). We show how to compute each affine function withleaderless CRNs, using a fundamentally different construction thanthe affine-function computing CRNs of Chen et al. [21].3. We show that our direct stable leaderless computation of a semilinearfunction completes in expected time O(n), where n is the total numberof input molecules. This time bound is slower than the O(log5 n) timebound achieved by Chen et al. [21], but faster than the O(n log n)bound achieved by the direct construction of Chen et al. [21] whereboth fast and direct constructions crucially use leaders.1.3.2 Approximate Majority with CRNsOur main contribution is that we provide the simplest and most intuitiveproofs of correctness and efficiency for the CRNs listed in Figure 1.2 thatcompute Approximate Majority. Our detailed contributions are listed below.1. We propose a new and simple tri-molecular CRN (see Figure 1.2(a))for Approximate Majority.2. We analyze the correctness and efficiency of the tri-molecular CRNusing simple analysis tools such as biased random walks and Chernoffbounds.3. We analyze the efficiency and correctness of bi-molecular CRNs ofFigure 1.2 – Double-B and Single-B CRNs – by relating them to thetri-molecular CRN (see Section 4.3).4. We show that the count of B (blank) species in Single-B CRN is tightlybounded by the count of Y species, assuming that initially y < x andb < n/2, where n is the total count of molecules.5. A bonus is that all of our results apply with high probability whenthe initial gap is Ω(√n log n), and thus are a factor of√log n strongerthan Angluin et al.’s results [12]. However, we note that their resultsaddress a more general settings (such as starting from any arbitrarygap or tolerating o(√n) Byzantine agents) than ours and we suspectthat the complexity of their approach may well be attributable to thedifficulty of handling these cases that we do not address.81.4. Outline1.4 OutlineIn Chapter 2 we provide the required background and preliminary defini-tions to understand chemical reaction networks kinetics. Next, Chapter 3reports on our leaderless computation with CRNs. Section 3.1 first describesthe CRN semantics for the deterministic computation. Second, Section 3.2explains our leaderless deterministic construction to compute any semilin-ear function, along with its time complexity analysis. Then, in Chapter 4,we discuss the correctness and efficiency of our concerned Approximate Ma-jority CRNs. We study the behaviour of the tri-molecular CRN in threedifferent phases in Section 4.2. Later, we analyze the behaviour of Double-B and Single-B CRNs in Section 4.3.9Chapter 2Background on ChemicalReaction NetworksHere, we present some preliminary concepts and definitions as needed tounderstand chemical reaction networks and their underlying kinetic modelused throughout Chapters 3 and 4. We introduce additional concepts anddefinitions as needed in each corresponding chapter.2.1 Chemical Reaction NetworksWe model interacting molecules in a well-mixed solution, under fixed envi-ronmental conditions, such as temperature. Let Λ = {Λ1,Λ2, . . . ,Λm} be afinite set of chemical species here and throughout. We write NΛ to denotethe set of vectors of |Λ| nonnegative integers, with each coordinate “labeled”by a distinct element of Λ. Given X ∈ Λ and c ∈ NΛ, we refer to c(X) asthe count of X in c. We write c ≤ c′ to denote that c(X) ≤ c′(X) for allX ∈ Λ. Given c, c′ ∈ NΛ, we define the vector component-wise operationsof addition c+c′, subtraction c−c′, and scalar multiplication nc for n ∈ N.If ∆ ⊂ Λ, we view a vector c ∈ N∆ equivalently as a vector c ∈ NΛ byassuming c(X) = 0 for all X ∈ Λ \∆.Molecules can collide, or equivalently, interact. Given a finite set ofchemical species Λ, we denote an interaction that simultaneously involvesr1 copies of Λ1, r2 copies of Λ2, and so on by a vector r = (r1, r2, . . . , rm),and define the order of the interaction to be r1+r2+. . .+rm. Reactions withorders 1, 2, and 3 are called unimolecular, bimolecular, and tri-molecular re-spectively. In our research on CRNs, we assume that all interactions havethe same order (either two or three). In fact, each interaction r has an as-sociated rate constant kr > 0. In our discussions in Chapters 3 and 4 weassume that all interaction rate constants kr are 1. Some interactions arenon-reactive (collisions resulting in no consumed or produced molecules),while others can trigger one or more reactions. A reaction over Λ is a tripleα = 〈r,p, kα〉 ∈ NΛ × NΛ × R+, where r 6= p, specifying the stoichiome-102.2. Kinetic Modeltry of the reactants and products, respectively, and the rate constant kα.The reaction rate constant determines the probability that the reaction isaccomplished. We can write reaction α in the formα : r1Λ1 + r2Λ2 + . . .+ rmΛmkα→ p1Λ1 + p2Λ2 + . . .+ pmΛm.For instance, reaction A+ 2B1/2→ A+ 3C is a reaction with tuple 〈(1, 2, 0),(1, 0, 3), 12〉.A (finite) chemical reaction network (CRN) is a pair C = (Λ, R), whereΛ is a finite set of chemical species, and R is a finite set of reactions over Λ,such that if R(r) is the subset of R with reactant vector r, then∑α∈R(r) kα ≤1(= kr). A configuration of a CRN C = (Λ, R) is a vector c ∈ NΛ.Given a configuration c and reaction α = 〈r,p, kα〉, we say that α isapplicable to c if r ≤ c (i.e., c contains enough of each of the reactants forthe reaction to occur). If α is applicable to c, then we write α(c) to denotethe configuration c+p−r (i.e., the configuration that results from applyingreaction α to c). If c′ = α(c) for some reaction α ∈ R, we write c→C c′, ormerely c→ c′ when C is clear from context.2.2 Kinetic ModelThe following model of stochastic chemical kinetics is widely used in quan-titative biology and other fields dealing with chemical reactions betweenspecies present in small counts [42]. It ascribes probabilities to computationand execution sequences, and also defines the time of reactions, allowingus to study the computational complexity of the CRN computation in thefollowing two chapters.The kinetics of a CRN is described by a stochastic model of chemicalkinetics [42, IIB] as follows. Given a fixed volume v ∈ R+ and currentconfiguration c = (c1, · · · , cm), where c1 = c(Λ1), c2 = c(Λ2), . . . , cm =c(Λm), the propensity of a particular interaction r = (r1, r2, . . . , rm) of ordero isρ(c, r) =[m∏i=1(ciri)]/vo−1.the propensity of reaction α = 〈r,p, kα〉 is ρ(c, α) = kαρ(c, r). For instance,the propensity of a unimolecular reaction α : Xkα→ . . . in configuration cis ρ(c, α) = kαc(X). Likewise, the propensity of a bimolecular reactionα : X + Ykα→ . . ., where X 6= Y , is ρ(c, α) = kα c(X)c(Y )v and the propensity112.2. Kinetic Modelof a tri-molecular reaction α : 2X +Ykα→ . . . is ρ(c, α) = kα c(X)(c(X)−1)c(Y )2v2 .Let ρ(c, R) =∑α∈R ρ(c, α) be the sum of the propensities of all reactionsapplicable to configuration c. When an interaction occurs, the probabilitythat it is a reaction event isρ(c, R)/ρ,where ρ =∑r ρ(c, r) is the sum of the propensities of all possible interac-tions r of orders {o1, o2, . . . , ol} that are assumed to happen in a well-mixedsolution. For example, if only interactions of orders {1, 2} can happen in asolution with total molecular count n, then ρ =(n1)/v0 +(n2)/v1.When a reaction event occurs, the probability that it is reaction α = 〈r,p, kα〉iskαρ(c, r)/ρ(c, R).With respect to a given initial configuration c, a computation is a finite orinfinite sequence I1, I2, . . . where each Ii is either a non-reactive interactionor a reaction, chosen according to their probabilities. Formally, let c0 = c.For i ≥ 1, the probability that Ii is non-reactive interaction r is ρ(ci−1, r)/ρ,in which case we let ci = ci−1, and the probability that Ii is reaction α =(r,p, kα) is kαλ(ci−1, r)/ρ, in which case ci is such that ci−1 → ci.An execution (a.k.a., execution sequence) E is a finite or infinite sub-sequence of configurations c0, c1, . . . of the underlying computation wherenon-reactive interactions that result in no change of configuration are ex-cluded. Therefore, for all i ∈ {1, . . . , |E|−1}, ci−1 → ci. If a finite executionsequence starts with c and ends with c′, we write c→∗C c′, or merely c→∗ c′when the CRN C is clear from context. In this case, we say that c′ is reach-able from c.The propensity function then determines the evolution of the systemas follows. At any moment, the time until the next interaction occurs isexponentially distributed with parameter ρ [42, IIIC] defined earlier. Then,the expected time until an interaction occurs is 1ρ . Accordingly, the expectedtime until the next reaction occurs is 1/ρ(c,R) (note that ρ(c, R) = 0 if noreactions are applicable to c).The kinetic model is based on the physical assumption of well-mixednessin a diluted solution. Thus, we assume the finite density constraint, whichstipulates that a volume required to execute a CRN must be proportionalto the maximum molecular count obtained during execution [94]. In otherwords, the total concentration (molecular count per volume) is bounded.This realistically constrains the speed of the computation achievable byCRNs. We apply the kinetic model only to CRNs with configuration spaces122.2. Kinetic Modelthat are bounded for each start configuration, choosing the volume to beequal to the reachable configuration with the highest molecular count. Inthis thesis, the volume will always be within a constant multiplicative factorof the number of input molecules.13Chapter 3Leaderless DeterministicChemical Reaction NetworksIn this chapter, we begin with preliminary mathematical definitions anddifferent semantic interpretations of CRNs that are used throughout thischapter. We then provide our construction and expected time analysis toprove that a semilinear function is deterministically computed with leader-less CRNs. Finally, we summarize our results and discuss our contributions.3.1 PreliminariesGiven a vector x ∈ Nk, let ‖x‖ = ‖x‖1 =∑ki=1 |x(i)|, where x(i) de-notes the ith coordinate of x. A set A ⊆ Nk is linear if there exist vectorsb,u1, . . . ,up ∈ Nk such thatA = { b + n1u1 + . . .+ npup | n1, . . . , np ∈ N } .A is semilinear if it is a finite union of linear sets. If f : Nk → Nl is a func-tion, define the graph of f to be the set{(x,y) ∈ Nk × Nl ∣∣ f(x) = y } . Afunction is semilinear if its graph is semilinear. A predicate φ : Nk → {0, 1}is semilinear if the set{x ∈ Nk ∣∣ φ(x) = 1 } is a semilinear set.We say a partial function f : Nk 99K Nl is affine if there exist kl rationalnumbers a1,1, . . . , al,k ∈ Q and l+k nonnegative integers b1, . . . , bl, c1, . . . , ck ∈N such that, if y = f(x), then for each j ∈ {1, . . . , l}, y(j) = bj +∑ki=1 aj,i(x(i) − ci), and for each i ∈ {1, . . . , k}, x(i) − ci ≥ 0. In ma-trix notation, there exist a l × k rational matrix A and vectors b ∈ Nl andc ∈ Nk such that f(x) = A(x− c) + b.This definition of affine function may appear contrived; see [21] for anexplanation of its various intricacies. For reading this chapter, the mainutility of the definition is that it satisfies Lemma 3.2.2.Note that by appropriate integer arithmetic, a partial function f : Nk 99KNl is affine if and only if there exist kl integers n1,1, . . . , nk,l ∈ Z and 2l+ knonnegative integers b1, . . . , bl, c1, . . . , ck, d1, . . . , dl ∈ N such that, if y =143.1. Preliminariesf(x), then for each j ∈ {1, . . . , l}, y(j) = bj + 1dj∑ki=1 ni,j(x(i) − ci), andfor each i ∈ {1, . . . , k}, x(i) − ci ≥ 0. Each dj may be taken to be theleast common multiple of the denominators of the rational coefficients inthe original definition. We employ this latter definition, since it is moreconvenient for working with integer-valued molecular counts.In this chapter, we use CRNs to decide subsets of Nk (for which wereserve the term “chemical reaction decider” or CRD) and to compute func-tions f : Nk → Nl (for which we reserve the term “chemical reaction com-puter” or CRC). In the next two subsections we define two semantic inter-pretations of CRNs that correspond to these two tasks. We use the termCRN to refer to either a CRD or CRC when the statement is applicable toeither type.These definitions differ slightly from those of Chen et al. [21], becauseours are specialized to “leaderless” CRNs: those that can compute a predi-cate or function in which no species are present in the initial configurationother than the input species. In the terminology of Chen et al. [21], a CRNwith species set Λ and input species set Σ is leaderless if it has an initialcontext σ : Λ → N such that σ(S) = 0 for all S ∈ Λ \ Σ. The definitionsbelow are simplified by assuming this to be true of all CRNs.We also use the convention of Angluin, Aspnes, and Eisenstat [10] thatfor a CRD, all species “vote” yes or no, rather than only a subset of speciesas in [21], since this convention is convenient for proving time bounds.3.1.1 Stable Decidability of PredicatesWe now review the definition of stable decidability of predicates introducedby Angluin, Aspnes, and Eisenstat [10].2 Intuitively, the set of species ispartitioned into two sets: those that “vote” yes and those that vote no, andthe system stabilizes to an output when a consensus vote is reached (allpositive-count species have the same vote) that can no longer be changed(no species voting the other way can ever again be produced). It would betoo strong to characterize deterministic correctness by requiring all possibleexecutions to achieve the correct answer; for example, a reversible reactionsuch as A−⇀↽−B could simply be chosen to run back and forth forever, starv-ing any other reactions. In the more refined definition that follows, the2Those authors use the term “stably compute”, but we reserve the term “compute” forthe computation of non-Boolean functions. Also, we omit discussion of the definition ofstable computation used in the population protocols literature, which employs a notion of“fair” executions; the definitions are proven equivalent in [21].153.1. Preliminariesdeterminism of the system is captured in that it is impossible to stabilize toan incorrect answer, and the correct stable output is always reachable.A (leaderless) chemical reaction decider (CRD) is a tupleD = (Λ, R,Σ,Υ),where (Λ, R) is a CRN, Σ ⊆ Λ is the set of input species, and Υ ⊆ Λ is theset of yes voters, with species in Λ \ Υ referred to as no voters. An inputto D will be an initial configuration i ∈ NΣ (equivalently, i ∈ Nk if we writeΣ = {X1, . . . , Xk} and assign Xi to represent the i’th coordinate); that is,only input species are allowed to be non-zero. If we are discussing a CRNunderstood from context to have a certain initial configuration i, we write#0X to denote i(X).We define a global output partial function Φ : NΛ 99K {0, 1} as follows.Φ(c) is undefined if either c = 0, or if there exist S0 ∈ Λ \ Υ and S1 ∈ Υsuch that c(S0) > 0 and c(S1) > 0. Otherwise, either (∀S ∈ Λ)(c(S) >0 =⇒ S ∈ Υ) or (∀S ∈ Λ)(c(S) > 0 =⇒ S ∈ Λ \ Υ); in the former case,the output Φ(c) of configuration c is 1, and in the latter case, Φ(c) = 0.A configuration o is output stable if Φ(o) is defined and, for all c suchthat o →∗ c, Φ(c) = Φ(o). We say a CRD D stably decides the predicateψ : NΣ → {0, 1} if, for any initial configuration i ∈ Nk, for all configurationsc ∈ NΛ, i→∗ c implies c→∗ o such that o is output stable and Φ(o) = ψ(i).Note that this condition implies that no incorrect output stable configurationis reachable from i. We say that D stably decides a set A ∈ Nk if it stablydecides its indicator function.The following theorem is due to Angluin, Aspnes, and Eisenstat [10]:Theorem 3.1.1 ([10]). A set A ⊆ Nk is stably decidable by a CRD if andonly if it is semilinear.The model they use is defined in a slightly different way; the differences(and those differences’ lack of significance to the questions we explore) areexplained in Chen et al. [21].3.1.2 Stable Computation of FunctionsWe now define a notion of stable computation of functions similar to thoseabove for predicates. Intuitively, the inputs to the function are the initialcounts of input species X1, . . . , Xk, and the outputs are the counts of outputspecies Y1, . . . , Yl. The system stabilizes to an output when the counts ofthe output species can no longer change. Again determinism is capturedin that it is impossible to stabilize to an incorrect answer and the correctstable output is always reachable.163.1. PreliminariesA (leaderless) chemical reaction computer (CRC) is a tuple C = (Λ, R,Σ,Γ),where (Λ, R) is a CRN, Σ ⊂ Λ is the set of input species, Γ ⊂ Λ isthe set of output species, such that Σ ∩ Γ = ∅. By convention, we letΣ = {X1, X2, . . . , Xk} and Γ = {Y1, Y2, . . . , Yl}. We say that a configura-tion o is output stable if, for every c such that o →∗ c and every Yi ∈ Γ,o(Yi) = c(Yi) (i.e., the counts of species in Γ will never change if o isreached). As with CRD’s, we require initial configurations i ∈ NΣ in whichonly input species are allowed to be positive. We say that C stably computesa function f : Nk → Nl if for any initial configuration i ∈ NΣ, i→∗ c impliesthere exists o such that c→∗ o and o is an output stable configuration withf(i) = (o(Y1),o(Y2), . . . ,o(Yl)). Note that this condition implies that noincorrect output stable configuration is reachable from i.If a CRN stably decides a predicate or stably computes a function, wesay the CRN is stable (a.k.a., deterministic).If f : Nk 99K Nl is a partial function undefined on some inputs, wesay that a CRC C stably computes f if C stably computes f on all inputsx ∈ dom f , with no constraint on the behavior of C if it is given an inputx 6∈ dom f .3.1.3 Kinetic ModelHere, we employ the same kinetic model described in Section 2.2. However,we assume that the rate constants of all reactions are 1, and we adjustthe model with this assumption. The reaction rate constants do not affectthe definition of stable computation; they only affect the expected timeanalysis. Our expected time analyses remain asymptotically unaffected ifthe rate constants are changed (although the constants hidden in the big-Onotation would change). Intuitively, we can assume that in Lemmas 3.1.2,3.1.3, and 3.1.4, i.e., the main lemmas to analyze the expected time in thischapter, all the rate constants are replaced by the smallest one which onlyscales down the total propensity of reactions by a constant factor.Given CRN C = (Λ, R) and configuration c, recall from Section 2.2 thatthe expected time until the next reaction occurs is 1ρ(c,R) .Moreover, for the reactions in R′, where R′ ⊂ R, assuming that thereactions in R \ R′ either do not affect the reactants of reactions in R′,or that if they do, those reactions are not applicable, then the expectedtime until the next reaction in R′ occurs is 1ρ(c,R′) . This observation allowsus to analyze different independent components of a CRN as if they weretheir own CRN, so long as they either run in parallel without affecting eachothers’ reactants, or they run in series, but under the assumption that one173.1. Preliminariescomponent has finished reacting (and therefore cannot affect reactant countsfor the next component).It is not difficult to show that if a CRN is stable and has a finite reachableconfiguration space from any initial configuration i, then under the kineticmodel (in fact, for any choice of rate constants), with probability 1 the CRNwill eventually reach an output stable configuration [21].We require the following lemmas later in our main theorems. Some ofthese are implicit or explicit in many earlier papers on stochastic CRNs, butwe include their proofs for the sake of self-containment.The lemmas are stated with respect to a certain “initial configuration” cthat may not be the initial configuration of an actual CRN we define. This isbecause the lemmas are employed to argue about CRNs that are guaranteedto evolve to some configuration c that satisfies the hypothesis of the lemma,and we use the lemma to bound the expected time it takes for the CRN tocomplete a sequence of reactions, starting from c. Therefore terms such as“applicable reaction” refer to being applicable from c and any configurationreachable from it, although some additional inapplicable reactions may havebeen applicable prior to reaching the configuration c. We note that, fromnow on, if current configuration c is understood from the context, we maywrite #X to denote c(X).Lemma 3.1.2. Let c be a configuration. Let A = {A1, . . . , Am} be a setof species with the property that, for all configurations reachable from c,every applicable reaction in which any species in A appears is of the formAi → B1 + . . .+Bl, where each Bi′ 6∈ A for 1 ≤ i′ ≤ l. Then starting from aconfiguration c, in which S =∑mi=1 c(Ai) ≤ L, the expected time to reachfrom c to a configuration in which all Ai’s disappear is O(logL).Proof. Assume the hypothesis. After each relevant reaction occurs, the sumS is reduced by 1. Therefore no reactions can occur after (at most) Lreactions have executed. If∑mi=1 #Ai = k (i.e., the sum of the propensitiesof each possible reaction) in an arbitrary configuration reachable from c, theexpected time for any reaction to occur is 1k . By linearity of expectation, theexpected time for L reactions to execute is at most∑Lk=11k = O(logL).Lemma 3.1.3. Let c be a configuration. Let A = {A1, . . . , Am} be a setof species with the property that, for all configurations reachable from c,every applicable reaction in which any species in A appears is of the formAi +Aj → Ap +B1 + . . .+Bl, where each Bi′ 6∈ A for 1 ≤ i′ ≤ l, and for alli, j ∈ {1, . . . ,m}, there is at least one reaction Ai + Aj → . . . of this form.Then starting from a configuration c in which S =∑mi=1 c(Ai) ≤ L, with183.1. Preliminariesvolume O(L), the expected time to reach a configuration in which none ofthe described reactions can occur is O(L).Proof. Assume the hypothesis. Let c′ ∈ N be a constant such that thevolume is at most c′L. After each relevant reaction occurs, the sum Sis reduced by 1. Therefore no reactions can occur after (at most) L − 1reactions have executed. Now let ρ(c, αij) be the propensity of the reactionAi + Aj → Ap + B1 + . . . + Bl, which is equal to ρ(c, αji) as well, and ifthere is more than one reaction of that form, let ρ(c, αij) represent the rateof one of those reactions selected arbitrarily. Since Ai can react with Aj forany i, j ∈ {1, . . . ,m}, given that∑mi=1 #Ai = k in an arbitrary configurationreachable from c, the expected time for the next reaction to occur is inverselyproportional to the sum of the propensities of each possible reaction, i.e.,m∑i=1m∑j=1j≥iρ(c, αij) =12m∑i=1m∑j=1j 6=iρ(c, αij) +m∑i=1ρ(c, αii)=12m∑i=1m∑j=1j 6=i#Ai#Ajc′L+m∑i=1#Ai(#Ai − 1)2c′L=12c′L m∑i=1m∑j=1#Ai#Aj −m∑i=1#A2i+ 12c′Lm∑i=1#Ai(#Ai − 1)=12c′L m∑i=1#Ai m∑j=1#Aj− m∑i=1#Ai=12c′L(k2 − k)so the expected time for the next reaction to occur is 2c′Lk2−k . By linearity ofexpectation, the expected time for (at most) L − 1 reactions to execute isat most∑L−1k=12c′Lk2−k = 2c′L∑L−1k=1 (1k−1 − 1k ) = 2c′L(1− 1L−1) = O(L).Lemma 3.1.4. Let c be a configuration. Let C = {C1, . . . , Cp} and A ={A1, . . . , Am} be two sets of species with the property that, for all configu-rations reachable from c, every applicable reaction in which any species inA or C appears is of the form Ci + Aj → Ci + B1 + . . . + Bl, where eachBi′ 6∈ A for 1 ≤ i′ ≤ l. Then starting from a configuration c in which forall i ∈ {1, . . . , p}, c(Ci) = Ω(L), and S =∑mi=1 c(Ai) ≤ L, with volumeO(L), the expected time to reach a configuration in which all Ai’s disappear193.2. Leaderless CRCs can Compute Semilinear Functionsis O(logL).Proof. Assume the hypothesis. Then the counts of each Ci do not decrease.(They may increase if some Bl ∈ C, but this only strengths the conclu-sion.) Therefore this is similar to the proof of Lemma 3.1.2, since for eachk, the expected time until the next reaction occurs when∑mj=1 #Aj = k,in an arbitrary configuration reachable from c, is within a constant of 1k .Thus by linearity of expectation, the expected time for (at most) L (i.e.,∑mi=1 c(Ai) ≤ L) reactions to occur is at most∑Lk=11k = O(logL).3.2 Leaderless CRCs can Compute SemilinearFunctionsTo supply an input vector x ∈ Nk to a CRN, we use an initial configurationwith x(i) molecules of input species Xi. Throughout this section, we letn = ||x||1 =∑ki=1 x(i) denote the initial number of molecules in solution.Since all CRNs we employ have the property that they produce at most aconstant multiplicative factor more molecules than are initially present, thisimplies that the volume required to satisfy the finite density constraint isO(n).Suppose the CRC C stably computes a function f : Nk 99K Nl. Wesay that C stably computes f monotonically if its output species are notconsumed in any reaction.3We show in Lemma 3.2.1 that affine partial functions can be computedin expected time O(n) by a leaderless CRC. For its use in proving Theo-rem 3.2.4, we require that the output molecules be produced monotonically.If we used a direct encoding of the output of the function, this would beimpossible for general affine functions. For example, consider the functionf(x1, x2) = x1 − x2 where dom f = { (x1, x2) | x1 ≥ x2 }. By withhold-ing a single copy of X2 and letting the CRC stabilize to the output value#Y = x1 − x2 + 1, then allowing the extra copy of X2 to interact, the onlyway to stabilize to the correct output value x1 − x2 is to consume a copyof the output species Y . Therefore Lemma 3.2.1 computes f indirectly viaan encoding of f ’s output that allows monotonic production of outputs, en-coding the output value y(j) as the difference between the counts of two3Its output species could potentially be reactants so long as they are catalytic, meaningthat the stoichiometry of the species as a product is at least as great as its stoichiometryas a reactant, e.g. if Y is the output species, X + Y → Z + Y or A+ Y → Y + Y .203.2. Leaderless CRCs can Compute Semilinear Functionsmonotonically produced species Y Pj and YCj , a concept formalized by thefollowing definition.Let f : Nk 99K Nl be a partial function. We say that a partial functionfˆ : Nk 99K Nl×Nl is a diff-representation of f if dom f = dom fˆ and, for allx ∈ dom f , if (yP ,yC) = fˆ(x), where yP ,yC ∈ Nl, then f(x) = yP − yC ,and yP = O(f(x)).4 In other words, fˆ represents f as the difference of itstwo outputs yP and yC , with the larger output yP possibly being larger thanthe original function’s output, but is at most by a multiplicative constantlarger.The following lemma is the main technical result required for proving ourmain theorem, Theorem 3.2.4. It shows that every affine function can becomputed (via a diff-representation) in time O(n) by a leaderless CRC. Theexample in Figure 3.1 also clarifies the essence of the leaderless computationof affine functions.Lemma 3.2.1. Let f : Nk 99K Nl be an affine partial function with f(0) = 0if 0 ∈ dom f . Then there is a diff-representation fˆ : Nk 99K Nl × Nl of fand a leaderless CRC that monotonically stably computes fˆ in expected timeO(n).Proof. If f is affine, then there exist kl integers n1,1, . . . , nk,l ∈ Z and 2l+ knonnegative integers b1, . . . , bl, c1, . . . , ck, d1, . . . , dl ∈ N such that, if y =f(x), then for each j ∈ {1, . . . , l}, y(j) = bj + 1dj∑ki=1 ni,j(x(i) − ci), andfor each i ∈ {1, . . . , k}, x(i)− ci ≥ 0. Note in particular that since the rangeof f is Nl, the value bj + 1dj∑ki=1 ni,j(x(i)− ci) must be an integer.Define the CRC as follows. It has input species Σ = {X1, . . . , Xk} andoutput species Γ = {Y P1 , . . . , Y Pl , Y C1 , . . . , Y Cl }.There are three main components of the CRN, separately handling theci offset, the ni,j/dj coefficient, and the bj offset.For a species S that stabilizes to a fixed count depending only on theinput configuration, write #∞S to denote the eventual stable count of S(in the case of Y Pj and YCj , this will be the same as the total amount everproduced, since they are never consumed). The latter two components bothmake use of Y Cj molecules to account for production of YPj molecules in ex-cess of y(j) to ensure that #∞Y Pj −#∞Y Cj = y(j), which establishes thatthe CRC stably computes a diff-representation of f . It is clear by inspectionof the reactions that #∞Y Pj = O(y(j)).4By yP = O(f(x)), we mean that there is a constant c such that yP ≤ cf(x) for allx ∈ Nk.213.2. Leaderless CRCs can Compute Semilinear FunctionsFor all input species Xi (1 ≤ i ≤ k), add the reactionXi → Ci,1 +B1 +B2 + . . .+Bl + b1Y P1 + b2Y P2 + . . . blY Pl (3.2.1)The first product Ci,1 will be used to handle the ci offset, and the remain-ing products will be used to handle the bj offsets. By Lemma 3.1.2, reac-tion (3.2.1) takes time O(log n) to complete.We now describe the three components of the CRC separately.ci offset: Reaction (3.2.1) produces x(i) copies of Ci,1. We must reduce thisnumber by ci, producing x(i)− ci copies of X ′i, the species that will beused by the next component to handle the ni,j/dj coefficient. A high-order reaction implementing this is (ci+1)Ci,1 → ciCi,1+X ′i, since thatreaction will eventually happen exactly x(i)− ci times (stopping when#Ci,1 reaches ci). This is implemented by the following bimolecularreactions.For each i ∈ {1, . . . , k} and m, p ∈ {1, . . . , ci}, if m + p ≤ ci, add thereactionCi,m + Ci,p → Ci,m+p.If m+ p > ci, add the reactionCi,m + Ci,p → Ci,ci + (m+ p− ci)X ′i.By Lemma 3.1.3, these reactions complete in expected time O(n).Note that although the final reaction above may produce a large num-ber if X ′i has a large number of products, it is straightforward to simu-late any such reaction with products P1, . . . , P` with reactions havingtwo products only, e.g., the first product is P ′1, followed by reactionsP ′i → P ′i+1 + Pi for each i ∈ {1, . . . , `− 2}, and P ′`−1 → P`−1 + P`.ni,j/dj coefficient: For each i ∈ {1, . . . , k}, add the reactionX ′i → Xi,1 +Xi,2 + . . .+Xi,lThis allows each output to be associated with its own copy of theinput. By Lemma 3.1.2, these reactions complete in expected timeO(log n).For each i ∈ {1, . . . , k} and j ∈ {1, . . . , l}, add the reactionXi,j →{ni,jDPj,1, if ni,j > 0;(−ni,j)DCj,1, if ni,j < 0.223.2. Leaderless CRCs can Compute Semilinear FunctionsBy Lemma 3.1.2, these reactions complete in expected time O(log n).We must now divide #DPj,1 and #DCj,1 by dj . This is accomplished bythe high-order reactions djDPj,1 → Y Pj and djDCj,1 → Y Cj . Similarlyto the previous component, we implement these with the followingreactions for dj ≥ 1.We first handle the case dj > 1. For each j ∈ {1, . . . , l} and m, p ∈{1, . . . , dj − 1}, if m+ p ≤ dj − 1, add the reactionsDPj,m +DPj,p → DPj,m+pDCj,m +DCj,p → DCj,m+pIf m+ p > ci, add the reactionsDPj,m +DPj,p → DPj,m+p−dj + Y PjDCj,m +DCj,p → DCj,m+p−dj + Y CjBy Lemma 3.1.3, these reactions complete in expected time O(n).When dj = 1, we only have the following unimolecular reactions.DPj,1 → Y PjDCj,1 → Y CjBy Lemma 3.1.2, these reactions complete in expected time O(log n).These reactions will produce 1dj∑ni,j>0ni,j(x(i)−ci) copies of Y Pj and− 1dj∑ni,j<0ni,j(x(i) − ci) copies of Y Cj . Therefore, letting #coefY Pjand #coefYCj denote the number of copies of YPj and YCj eventuallyproduced just by this component, it holds that #coefYPj −#coefY Cj =1dj∑ki=1 ni,j(x(i)− ci).bj offset: For each j ∈ {1, . . . , l}, add the reactionBj +Bj → Bj + bjY Cj (3.2.2)By Lemma 3.1.3, these reactions complete in expected time O(n).Reaction (3.2.1) produces bj copies of YPj for each copy of Bj produced,which is∑ki=1 x(i). Reaction (3.2.2) occurs precisely (∑ki=1 x(i)) − 1times. Therefore reaction (3.2.2) produces precisely bj fewer copies ofY Cj than reaction (3.2.1) produces of YPj . This implies that when all233.2. Leaderless CRCs can Compute Semilinear FunctionsY = f(X) =32(X− 5) + 4• Initial configuration: #X = n, 0 any other species• Reactions:make copies for parallel computations: X → C +B + 4Y #C=#B=n,#Y=4n1) offset 5:(5 + 1)C → 5C +X ′ #C=5,#X′=n−52) coefficient 32 :X′ → 3D #D=3(n−5)2D → Y #Y=32(n−5)3) offset 4:B +B → B + 4Y C #B=1,#YC=4(n−1)4) combining steps 1, 2, and 3 :Y + Y C → ∅#Y=32(n−5)+4n−4(n−1)=32(n−5)+4Figure 3.1: An example illustrating of the essential steps of our leaderlessconstruction for the deterministic computation of partial affine functions –Lemma 3.2.1.copies of Y Cj are eventually produced by reaction (3.2.2), the number ofY Pj ’s produced by reaction (3.2.1) minus the number of YCj ’s producedby reaction (3.2.2) is bj .We require the following lemma, proven in Chen et al. [21].Lemma 3.2.2 ([21]). Let f : Nk → Nl be a semilinear function. Then thereis a finite set {f1 : Nk 99K Nl, . . . , fm : Nk 99K Nl} of affine partial functions,where each dom fi is a linear set, such that, for each x ∈ Nk, if fi(x) isdefined, then f(x) = fi(x), and⋃mi=1 dom fi = Nk.We require the following theorem, due to Angluin, Aspnes, and Eisen-stat [11, Theorem 5], which states that any semilinear predicate can be243.2. Leaderless CRCs can Compute Semilinear Functionsdecided by a CRD in expected time O(n).Theorem 3.2.3 ([11]). Let φ : Nk → {0, 1} be a semilinear predicate. Thenthere is a leaderless CRD D that stably decides φ (so long as some positivenumber of molecules are initially present), and the expected time to reach anoutput-stable configuration is O(n).The following is the main theorem of our work in this chapter. It showsthat semilinear functions can be computed by leaderless CRCs in linearexpected time.Theorem 3.2.4. Let f : Nk → Nl be a semilinear function with f(0) = 05.Then there is a leaderless CRC that stably computes f in expected time O(n).Proof. The CRC will have input species Σ = {X1, . . . , Xk} and outputspecies Γ = {Y1, . . . , Yl}. By Lemma 3.2.2, there is a finite set F = {f1 :Nk 99K Nl, . . . , fm : Nk 99K Nl} of affine partial functions, where eachdom fi is a linear set, such that, for each x ∈ Nk, if fi(x) is defined, thenf(x) = fi(x). We compute f on input x as follows. Since each dom fi is a lin-ear (and therefore semilinear) set, by Theorem 3.2.3 we compute each semi-linear predicate φi = “x ∈ dom fi and (∀i′ ∈ {1, . . . , i−1}) x 6∈ dom fi′?” byseparate parallel CRD’s each stabilizing in expected time O(n). (The lattercondition ensures that for each x, precisely one of the predicates is true, incase the domains of the partial functions have nonempty intersection.) Herewe are relying on the fact that Boolean combinations (union, intersection,complement) of semilinear sets are semilinear [43].By Lemma 3.2.1, for each i ∈ {1, . . . ,m}, there is a diff-representation fˆiof fi that can be stably computed by parallel CRC’s. Assume that for eachi ∈ {1, . . . ,m} and each j ∈ {1, . . . , l}, the jth pair of outputs yP (j) andyC(j) of the ith function is represented by species YˆPi,j and YˆCi,j . We interpreteach Yˆ Pi,j and YˆCi,j as an “inactive” version of “active” output species YPi,j andY Ci,j .For each i ∈ {1, . . . ,m}, for the CRD Di = (Λ, R,Σ,Υ) computing thepredicate φi, let L1i represent any species in Υ, and L0i represent any speciesin Λ \ Υ, and that once Di reaches an output stable configuration, whereb is the output of Di. By Theorem 5 of Angluin et al. [11], if the totalcount of L1i and L0i molecules is Ω(n) (which can be enforced by adding5It is easy to see that no leaderless CRN could reach an output stable state with positivecount of output species Y from an initial state with no molecules, since it would need tocontain reaction(s) of the form ∅→ A for some species A from which (unbounded countsof) Y could be produced.253.2. Leaderless CRCs can Compute Semilinear Functionsboth L1i and L0i as products of the initial reactions of the input molecules:Xi → L1i + L0i + . . .), then the correct vote can be spread through thepopulation of Lbi molecules (for b ∈ {0, 1}) in time O(n).Add the following reactions for each i ∈ {1, . . . ,m} and each j ∈ {1, . . . , l}:L1i + YˆPi,j → L1i + Y Pi,j + Yj (3.2.3)L0i + YPi,j → L0i +Mi,j (3.2.4)Mi,j + Yj → Yˆ Pi,j (3.2.5)The latter two reactions implement the reverse direction of the first reaction– using L0i as a catalyst instead of L1i – using only bimolecular reactions.Also add the reactionsL1i + YˆCi,j → L1i + Y Ci,j (3.2.6)L0i + YCi,j → L0i + Yˆ Ci,j (3.2.7)andY Pi,j + YCi,j → Kj (3.2.8)Kj + Yj → ∅ (3.2.9)That is, a “yes” answer for function i activates the ith output and a“no” answer deactivates the ith output. Eventually each CRD stabilizes sothat precisely one i has L1i present, and for all i′ 6= i, L0i′ is present, whichtakes time O(n) by Theorem 5 of Angluin et al. [11]. We now claim thatat this point, all outputs for the correct function fˆi will be activated andall other outputs will be deactivated. The reactions enforce that at anytime, #Yj = #Kj +∑mi=1(#YPi,j + #Mi,j). In particular, #Yj ≥ #Kj and#Yj ≥ #Mi,j at all times, so there will never be a Kj or Mi,j moleculethat cannot participate in the reaction of which it is a reactant. Eventually#Y Pi,j and #YCi,j stabilize to 0 for all but one value of i (by reactions (3.2.4),(3.2.5), (3.2.7)), and for this value of i, #Y Pi,j stabilizes to y(j) and #YCi,jstabilizes to 0 (by reaction (3.2.8)). Eventually #Kj stabilizes to 0 by thelast reaction. Eventually #Mi,j stabilizes to 0 since L0i is absent for thecorrect function fˆi. This ensures that #Yj stabilizes to y(j).It remains to analyze the expected time to stabilization. Let n = ‖x‖.By Lemma 3.2.1, the expected time for each affine function computation tocomplete is O(n). Since the Yˆ Pi,j are produced monotonically, the most YPi,jmolecules that are ever produced is #∞Yˆ Pi,j . Since we have m computations263.3. Conclusionin parallel, the expected time for all of them to complete is O(nm) = O(n)(since m depends on f but not n). We must also wait for each predicatecomputation to complete. By Theorem 3.2.3, each of these predicates takesexpected time O(n) to complete, so all of them complete in expected timeO(mn) = O(n).At this point, the Li1 leaders must convert inactive output species toactive, and Li′0 (for i′ 6= i) must convert active output species to inactive.By Lemma 3.1.4, reactions (3.2.3), (3.2.4), (3.2.6), and (3.2.7) complete inexpected time O(log n). Once this is completed, by Lemma 3.1.3, reac-tion (3.2.5) completes in expected time O(n). Reaction (3.2.8) completesin expected time O(n) by Lemma 3.1.3. Once this is done, reaction (3.2.9)completes in expected time O(n) by Lemma 3.1.3.3.3 ConclusionIn this chapter, we have answered an open question of Chen, Doty, andSoloveichik [21], who showed that a function f : Nk → Nl is deterministi-cally computable by a stochastic chemical reaction network (CRN) if andonly if the graph of f is a semilinear subset of Nk+l. Their proposed con-struction crucially used auxiliary leader species. The authors asked whetherdeterministic CRNs without a leader can still compute semilinear functions.We have affirmatively answered this question and showed that every semi-linear function is deterministically computable by a CRN which starts withan initial configuration containing only the input species and zero counts ofevery other species, so long as f(0) = 0.Chen et al. [21] provided, for every semilinear function f , a direct con-struction of a CRN that computes f (using leaders) in expected timeO(n log n), where n is the number of molecules present initially. Theythen combined this direct, error-free construction in parallel with a fast(O(log5 n)) but error-prone CRN that uses a leader to compute any com-putable function (including semilinear), using the error-free computation tochange the answer of the error-prone computation only if the latter is in-correct. This combination speeds up the computation from expected timeO(n log n) for the direct construction to expected time O(log5 n) for thecombined construction.Since we have assumed no leaders may be supplied in the initial configu-ration, and since the problem of computing arbitrary computable functionswithout a leader has remained a major open problem [11], this trick doesnot work for speeding up our construction. However, we have shown that273.3. Conclusionwith some care in the choice of reactions, the direct stable computation ofa semilinear function can be done in expected time O(n), improving uponthe O(n log n) bound of the direct construction of [21].28Chapter 4Simplifying Analyses ofChemical Reaction Networksfor Approximate MajorityIn this chapter, we describe the details of our techniques to investigate theefficiency and correctness of the CRNs shown in Figure 1.2. We start by ex-plaining the preliminary semantics of the Approximate Majority CRNs andour analysis tools utilized throughout in Section 4.1. We then analyze thebehaviour of the tri-molecular CRN in Section 4.2. Next, we analyze the bi-molecular CRNs of Figure 1.2 (Single-B and Double-B CRNs) in Section 4.3by showing that they are essentially simulations of the tri-molecular CRN.We end this chapter by describing our experimental results and reviewingour outcomes in this work.4.1 PreliminariesWe employ the kinetic model described in Section 2.2 for our analysis6.We note that for the CRNs that we analyze in this chapter, there issome order o such that for every reaction (r,p, kα) of R, r1 +r2 + . . .+rm =p1 + p2 + . . . pm = o. Thus the number of interacting molecules does notchange over time. Moreover, according to the finite density constraint, we’llassume that the volume of the solution is proportional to the initial numberof molecules.We consider a system in which the initial molecular count is n, and sothe molecular count in each subsequent configuration is also n. Therefore,the total molecular count in each configuration is n and, without loss ofgenerality, we assume that volume v, which remains fixed, is also equal to6Here, we note that the reaction rate constants defined for the CRNs shown in Fig-ure 1.2 are very important in the proof of their correctness and efficiency. It is an areaof future work (briefly discussed in Section 5.2.2) to understand how the reaction rateconstants can be accurately approximated by DSDs.294.1. Preliminariesn.Now, suppose that the well-mixed solution is in configuration c. Recallfrom Section 2.2, the time until the next interaction happens is exponentiallydistributed with parameter ρ =∑r ρ(c, r). Accordingly, the time Tk fork interactions is at most the sum of k exponentially distributed randomvariables, each with parameter ρ, with expected value and variance E[Tk] =k/ρ and Var[Tk] = k/ρ2 respectively. With the assumptions that all theinteractions are of order o and v = n, we concludeρ =(no)/vo−1 = Θ(n) ≤ n. (4.1.1)Thus, by Chebyshev’s inequality, we have thatP[|Tk − E[Tk]| ≥ h√Var[Tk]] = P[|Tk −Θ(k/n)| ≥ hΘ(√k/n)] ≤ 1/h2.If for example k = n, then by setting h =√n we see that the time for ninteractions is O(1) with probability at least 1− 1/n. Thus we may use thenumber of interactions, divided by n, as a proxy for time. More generally,the time for k = Ω(n) interactions is O(k/n) with probability at least 1−1/n.4.1.1 Analysis ToolsWe will use the following well-known property of random walks, Chernofftail bounds on functions of independent random variables, and Azuma’sinequality.Lemma 4.1.1 (Asymmetric one-dimensional random walk [38](XIV.2)). Ifwe run an arbitrarily long sequence of independent trials, each with successprobability at least p, then the probability that the number of failures everexceeds the number of successes by b is at most (1−pp )b.Lemma 4.1.2 (Chernoff tail bounds [26]). If we run N independent trials,with success probability p, then SN , the number of successes, has expectedvalue µ = Np and, for 0 < δ < 1,(a) P(SN ≤ (1− δ)µ) ≤ exp(− δ2µ2 ), and(b) P(SN ≥ (1 + δ)µ) ≤ exp(− δ2µ3 ).Lemma 4.1.3 (Azuma’s inequality [73]). Let Q1, . . . , Qk be independentrandom variables, with Qr taking values in a set Ar for each r. Supposethat the (measurable) function f : ΠAr → R satisfies |f(x) − f(x′)| ≤ cr304.2. Approximate Majority Using Tri-molecular Reactionswhenever the vectors x and x′ differ only in the rth coordinate. Let Y be therandom variable f(Q1, . . . , Qk). Then , for any t > 0,P[|Y − E[Y ]| ≥ t] ≤ 2 exp(− 2t2/k∑r=1c2r).4.2 Approximate Majority Using Tri-molecularReactionsIn this section we analyze the behaviour of the tri-molecular CRN of Figure1.2(a). Intuitively, its reactions sample triples of molecules and amplifythe majority species by exploiting the facts that (i) every triple must havea majority of either X or Y , and (ii) the ratio of the number of tripleswith two X-molecules and one Y -molecule to the number of triples with2 Y -molecules and one X-molecule, is exactly the ratio of X-molecules toY -molecules.Our main goal is to prove the following:Theorem 4.2.1. For any constant γ > 0, there exists a constant cγ suchthat, provided the initial molecular count of X exceeds that of Y by at leastcγ√n lg n 7, the tri-molecular CRN reaches a consensus of X-majority, withprobability at least 1− n−γ, in at most cγn lg n interactions.Recall that we denote by x and y the random variables correspondingto the molecular count of X and Y respectively. We divide the computa-tion into a sequence of three, slightly overlapping and possibly degenerate,phases, where cγ , dγ and eγ are constants depending on γ:phase 1 cγ/2√n lg n < x− y ≤ n(dγ − 2)/dγ . It ends as soon as y ≤ n/dγ .phase 2 eγ lg n < y < 2n/dγ . It ends as soon as y ≤ eγ lg n.phase 3 0 ≤ y < 2eγ lg n. It ends as soon as y = 0.Of course the assertion that a computation can be partitioned in sucha way that these phases occur in sequence holds only with sufficiently highprobability. To facilitate this argument, as well as the subsequent efficiencyanalysis, we divide both phase 1 and phase 2 into Θ(lg n) stages, defined byintegral values of t and s, as follows:7In this chapter, when we use notation lg to refer to log2.314.2. Approximate Majority Using Tri-molecular Reactions• A typical stage in phase 1 starts with x ≥ y+ 2t√n lg n and ends withx ≥ y + 2t+1√n lg n, where lg cγ ≤ t ≤ (lg n − lg lgn)/2 + lg((dγ −2)/(2dγ)).• A typical stage in phase 2 starts with y ≤ n/2s and ends with y ≤n/2s+1, where lg dγ ≤ s ≤ lg n− lg lgn− lg eγ − 1.Our proof of correctness (the computation proceeds through the specifiedphases as intended) and our timing analysis (how many interactions does ittake to realize the required number of reaction events) exploit the simple andfamiliar tools set out in the previous section, taking advantage of boundson the probability of reactions (1) and (2) that hold throughout a givenphase/stage:(a) [Low probability of unintended phase/stage completion] Therelative probability of reactions (1) and (2) is determined by the rela-tive counts of X and Y . This allows us to argue, using a biased randomwalk analysis (Lemma 4.1.1 above), that, with high probability, thereis no back-sliding; when the computation leaves a phase/stage it isalways to a higher indexed phase/stage (cf. Corollaries 1, 2 and 3,below).(b) [High probability of intended phase/stage completion withina small number of reaction events] Within a fixed phase/stagethe computation can be viewed as a sequence of independent trials(choice of reaction (1) or (2)) with a fixed lower bound on the prob-ability of success (choice of reaction (1)). This allows us to establish,by a direct application of Chernoff’s upper tail bound Lemma 4.1.2,an upper bound, for each phase/stage, on the probability that thephase/stage completes within a specified number of reaction events(cf. Corollaries 4, 5 and 6, below).(c) [High probability that the reaction events occur within asmall number of molecular interactions] Within a fixed phase/stagethe choice of reaction events, among interactions, can be viewed as asequence of independent trials with a fixed lower bound on the prob-ability of success (the interaction corresponds to a reaction event).Thus our timing analysis (proof of efficiency) is another direct appli-cation of Chernoff’s upper tail bound (Lemma 4.1.2) (cf. Corollary 7,below).324.2. Approximate Majority Using Tri-molecular ReactionsLemma 4.2.2. At any point in the computation, if x − y = ∆ then theprobability that x− y ≤ ∆/2 at some subsequent point in the computation isless than (1/e)∆2/(2n+2∆).Proof. Since x− y > ∆/2 up to the point when we first have x− y ≤ ∆/2,it follows that x ≥ n/2 + ∆/4 and y ≤ n/2−∆/4. We can view the changein x− y as a random walk, starting at ∆, with success (an increase in x− y)probability p satisfying p ≥ 1/2 + ∆/(4n).It follows from Lemma 4.1.1 that reaching a configuration where x −y ≤ ∆/2 (which entails an excess of ∆/2 failures to successes) is less than( 11+∆/n)∆/2 which is at most (1/e)∆2/(2n+2∆).Corollary 4.2.3. In stage t of phase 1, x − y reduces to 2t−1√n lg n withprobability less than 1/n22t−2.Lemma 4.2.4. At any point in the computation, if y = n/k then the prob-ability that y > 2n/k at some subsequent point in the computation is lessthan (2/(k − 2))n/k.Proof. Since y ≤ 2n/k up to the point when we first have y > 2n/k, wecan view the change in y as a random walk, starting at n/k, with success (adecrease in y) probability p satisfying p ≥ 1− 2/k.It follows from Lemma 4.1.1 that reaching a configuration where y >2n/k (which entails an excess of n/k failures to successes) is less than (2/(k−2))n/k.Corollary 4.2.5. In stage s of phase 2, y increases to n/2s−1 with proba-bility less than (2/(2s − 2))n/2s.Corollary 4.2.6. In phase 3, y increases to 2eγ lg n with probability lessthan (2eγ lg n/(n− 2eγ lg n))eγ lgn.Lemma 4.2.7. At any point in the computation, if x− y = ∆ ≤ n/2 then,assuming that x−y never reduces to ∆/2, the probability that x−y increasesto 2∆ within at most λn reaction events is at least 1− exp(− (λ−2)∆2λ(2n+∆)).Proof. We view the choice of reaction as an independent trial with successcorresponding to reaction (1), and failure to reaction (2). We start with x−y = ∆ and run until either x− y = ∆/2 or we have completed λn reactions.By Lemma 4.1.2, the probability that we complete λn reactions with fewerthan λn/2 + ∆/2 successes, which is necessary under our assumptions if wefinish with x− y < 2∆, is at most exp(− (λ−2)∆2λ(2n+∆)).334.2. Approximate Majority Using Tri-molecular ReactionsCorollary 4.2.8. In stage t of phase 1, assuming that x− y never reducesto 2t−1√n lg n, the probability that x− y increases to 2t+1√n lg n within atmost λn reaction events is at least 1− exp(− (λ−2)22t lgn3λ ).Lemma 4.2.9. At any point in the computation, if y = n/k then, assumingthat y never increases to 2n/k, the probability that y decreases to n/k − rwithin f(n) > 2r reaction events is at least 1− exp(−Θ(f(n)).Proof. We view the choice of reaction as an independent trial with successcorresponding to reaction (1), and failure to reaction (2). We start withy = n/k and run until either y = n/k−r or we have completed f(n) reactionevents. (We assume, by Lemma 4.2.4, that y < 2n/k, and so p > 1 − 2/k,throughout.)By Lemma 4.1.2, the probability that we complete f(n) reactions withfewer than (f(n)+r)/2 successes, which is necessary under our assumptionsif we finish with y > nk − r, is at mostexp(− [f(n)(k − 2)/k − (f(n) + r)/2]22f(n)(k − 2)/k ),which is at most exp(−Θ(f(n)), when f(n) > 2r.Corollary 4.2.10. In stage s of phase 2, assuming that y never increasesto n/2s−1, y decreases to n/2s+1, ending stage s, in at most λn/2s reactionevents, with probability at least 1− exp(−Θ(λn/2s)).Corollary 4.2.11. In phase 3, assuming that y never increases to 2eγ lg n,y decreases to 0, ending phase 3 (and the entire computation), in at mostλ lg n reaction events, with probability at least 1− exp(−Θ(λ lg n)).Lemma 4.2.12. If during some sequence of m interactions the total propen-sity of all reactions is at least p then the probability that the sequence givesrise to fewer than mp/(2n) reaction events is no more than exp(−mp/(8n)).Proof. Recall from Section 2.2 that if the total propensity of all reactionsin a given configuration c is at least p then the probability that an inter-action results in a reaction is ρ(c,R)ρ ≥ p/n (see Equation 4.1.1). Hence, byLemma 4.1.2, the probability that a sequence of m interactions gives rise tofewer than mp/(2n) reaction events is no more than exp(−mp/(8n)).Corollary 4.2.13.(i) The λn reaction events of each stage of phase 1 occur within (8/3)dγλninteractions, with probability at least 1− exp(−λn/4)344.3. Approximate Majority Using Bi-molecular Reactions(ii) The λ(n/2s) reaction events of stage s of phase 2 occur within (16/3)λninteractions, with probability at least 1− exp(−λn/2s+2) ; and(iii) The λ lg n reaction events of phase 3 occur within (8/3)λn lg n interac-tions, with probability at least 1− exp(λ lg n/4).Proof. It suffices to observe the following lower bounds on the propensity ofreaction (1) alone in individual phases/stages:(i) in phase 1, x > y ≥ n/dγ , so the propensity of reaction (1) is greaterthan 3n/(4dγ);(ii) in stage s of phase 2, x > n(1− 2s−1) and y ≥ n/2s+1 ≥ (lg n)/2, so thepropensity of reaction (1) is at least 3n/2s+3;(iii) in phase 3, x ≥ n− lg n and y ≥ 1, so the propensity of reaction (1) isat least 3/4.Finally, we prove Theorem 4.2.1 using the pieces proved until now.of Theorem 4.2.1. (i) [Correctness] It follows directly from Corollaries 1 and4 (respectively, 2 and 5, 3 and 6) that phase 1 (respectively phase 2, phase3) completes in the intended fashion, within at most λn lg n ( respectively,λn, λ lg n) reaction events, with probability at least 1 − exp(−Θ(cγ lg n))(respectively, 1− exp(−Θ(λn/dγ)), 1− exp(−Θ(λ lg n))).(ii) [Efficiency] It is immediate from Corollary 4.2.13 that the required num-ber of reaction events in phase 1 (respectively, phase 2, phase 3) occur withinΘ(λn lg n) interactions, with probability at least 1− exp(−Θ(λ lg n)).4.3 Approximate Majority Using Bi-molecularReactionsHere we show correctness and efficiency of the Double-B and Single-B CRNs,essentially by showing that the abstraction of both CRNs corresponds to thetri-molecular CRN of the previous section.Here and throughout, we denote by b the random variable correspondingto the molecular count of B.4.3.1 The Double-B CRNIn this section we analyse the behaviour of the Double-B CRN of Fig-ure 1.2(b). We will show that:354.3. Approximate Majority Using Bi-molecular ReactionsTheorem 4.3.1. For any constant γ > 0, there exists a constant cγ suchthat, provided (i) the initial molecular count of X and Y together is at leastn/2, and (ii) the count of X exceeds that of Y by at least cγ√n lg n, thecomputation reaches a consensus of X-majority, with probability at least1− n−γ, in at most cγn lg n interactions.Comparing with Theorem 4.2.1, it becomes clear that the role of themolecule B is simply to facilitate the simulation. The sense in which Double-B can be seen as simulating the earlier tri-molecular CRN is that we cananalyse its behaviour in three phases that directly parallel those of our tri-molecular analysis. We measure progress throughout in terms of the changein the molecular counts xˆ, defined as x + b/2, and yˆ, defined as y + b/2,noting that reaction (0’) leaves these counts unchanged and reactions (1’)and (2’) change xˆ and yˆ at exactly half the rate that the corresponding tri-molecular reactions (1) and (2) change x and y. In each phase, we note thatthe relative propensity of reaction (1’) to that of (2’), equals or exceeds therelative propensity of reaction (1) to that of (2) in the tri-molecular CRN,and we argue that the total propensity of reactions (1’) and (2’) is at leastsome constant fraction of the total propensity of reactions (1) and (2). Thisallows us to conclude that Double-B (i) takes at most twice as many reactionevents as the tri-molecular CRN to complete each corresponding phase/stageand (ii) these reaction events occur within a number of interactions that is atmost some constant multiple of the number of interactions needed to realizethe required reaction events in the tri-molecular CRN.Bounds on b, the molecular count of BWe start by setting out bounds on b, the molecular count of molecule B,specifically y/2α ≤ b ≤ n/2, where α ≥ 20, that, with high probability,hold after the first Θ(n) interactions (see Lemmas 4.3.2 and 4.3.8), andcontinue to hold thereafter. These bounds allow us to establish property(ii) above. Our bounds are summarized in Lemma 4.3.3 below, and itsproof is a straightforward application of Chernoff bounds. We note thatthe connection between the number of reaction events and the number ofinteractions in each interval I used in Lemmas 4.3.2 and 4.3.3 is shown inLemma 4.3.8.Lemma 4.3.2. Let I be the interval of the first l = 12y0 reaction eventsin a computation of Double-B, where y0 is the initial molecular count ofY . Let b0, and be denote the initial and final values of b in this interval(similarly for y). If b0 < y0/α (even if b0 = 0), then there exists a constant364.3. Approximate Majority Using Bi-molecular Reactionsfγ such that if y0 ≥ fγ lg n, then ye/α ≤ be ≤ 10n/21 with probability atleast 1− 1/nγ.Proof. Note that x−y does not change by reaction (0’), and by Lemma 4.2.2,it never reaches (x0 − y0)/2 through reactions (1’) and (2’), which result inx− y ≥ 0 in interval I.Assuming that b ≤ y/α over the course of interval I, the probabilitythat a reaction event would be equal to reaction (0’) is at least αα+2 (seecomputations below).P[reaction (0’) occurs] =xyxy + xb+ yb≥ xyxy + 2x yα≥ αα+ 2.Therefore, by the Chernoff lower tail bound 4.1.2, there are at least 3α4(α+2) lreaction events of type (0’), among l = 1/2y0 reaction events, with proba-bility at least 1 − 1/nγ provided that y0 ≥ fγ lg n . Thus, at most α+84(α+2) lreactions events are of types (1’) and (2’).We note that the number of B molecules increases via reaction (0’) anddecreases via reactions (1’) and (2’). The number of Y molecules also in-creases via reaction (2’) and decreases via reaction (0’). So, we can computebe and ye as follows.be ≥ b0 + 6α4(α+ 2)l − α+ 84(α+ 2)l, and ye ≤ y0 − 3α4(α+ 2)l +α+ 84(α+ 2)l,With a simple computation, it is clear that be ≥ yeα even if b0 = 0.Moreover, since x > y in interval I, we have y0 < n/2 and subsequentlyl < n/4 which never let b reach 10n/21.Lemma 4.3.3. Let I be any interval of y0 reaction events of a computationof Double-B, where ≤ 1α+1 , α ≥ 20, and y0 is the value of y at the beginningof I. Let b0, be, bmin, bmax denote the initial, final, minimum, and maximumvalues of b in the interval I (similarly for y). Then for any γ > 0, there isa constant lγ such that if y0 ≥ lγ lg n, then the following bounds hold withprobability at least 1− 1/nγ.(a) (Lower bounds) If b0 ≥ y0α then bmin ≥ ymax2α and be ≥ yeα .(b) (Upper bounds) If b0 ≤ 10n/21 then be ≤ 10n/21 and bmax ≤ n/2.Proof. We prove the claims (a) and (b) as follows.Lower bounds: We can validate that both lower bounds hold whenb0 ≥ 2αy0. In fact, we can simply consider the minimum value b0 − y0374.3. Approximate Majority Using Bi-molecular Reactionsfor bmin and be and the maximum value ymax = y0 + y0 for ymax and ye.So, we just focus on the situation where b0 <2αy0. Using this bound onb together with the assumptions that x ≥ y (Lemma 4.2.2), and ≤ 1α+1 ,we obtain that a reaction event is equal to reaction (0’) with probability atleast α−1α+5 (see computation below).P[Reaction (0’) occurs] =xyxy + xb+ yb≥ yy + 2b≥ y0 − y0y0 +2αy0 + 3y0≥ α− 1α+ 5.Therefore, by the Chernoff lower tail bound (Lemma 4.1.2), there are atleast 3(α−1)4(α+5)yˆ0 reaction events of types (0’), among y0 reaction events, withprobability at least 1− 1/nγ on the condition that y0 ≥ lγ lg n. As a result,at most α+234(α+5) reaction events are equal to either reaction (1’) or (2’) inFigure 1.2. Therefore, we can prove the correctness of the lower bounds asfollows.case (i) ymax2α ≤ bmin. Throughout interval I, at most α+234(α+5)y0 (reactions(1’) and (2’)) B molecules are consumed and at most α+234(α+5)y0 (reac-tions (2’)) Y molecules are produced. So,bmin ≥ b0 − (α+ 23)4(α+ 5)y0, and ymax ≤ y0 + (α+ 23)4(α+ 5)y0that confirm the inequality for all y0α ≤ b0 ≤ 2y0α .case (ii) yeα ≤ be. At the end of the interval I, in addition to the maximumconsumption and production of B and Y molecules described in case(i), at least 6(α−1)4(α+5)y0 (reaction(0’)) B molecules are produced and atleast 3(α−1)4(α+5)y0 (reaction (0’)) Y molecules are consumed as well. So,be ≥ b0 + 5α− 292(α+ 5)y0, and ye ≤ y0 − α− 13(α+ 5)y0which result in yeα ≤ be for all y0α ≤ b0 ≤ 2y0α .Upper bounds: Note that y0 ≤ n/2, since x > y and x + y ≤ nthrough the interval. It is easy to verify that while b0 ≤ 9n21 , both upperbounds completely hold even if b increases to its maximum of b0 +2y0 (be ≤bmax ≤ 10n/21 ≤ n/2). Moreover, note that over the course of I, even ifb0 = 10n/21, we have bmax ≤ 10n/21 + 2y0 ≤ 11n/21 and subsequentlyx+ y ≥ 10n/21.384.3. Approximate Majority Using Bi-molecular ReactionsWe now consider the case where b0 >9n21 . The value of b increases viareaction (0’) and decreases via reactions (1’) and (2’). Then, the relativeprobability of reaction (0’) to that of reactions (1’) and (2’) can be computedas follows.P[Reaction (0’) occurs]P[Reaction (1’) or (2’) occurs]=xy(x+ y)b≤ (10/42)2(10n/21)(9n/21)≤ 25Therefore, the probability that a reaction event would be equal to reaction(0’) is less than 2/7.By the Chernoff upper tail bound (Lemma 4.1.2), we can then concludethat there at most 11/35y0 reaction events are equal to reaction (0’) (i.e.,increasing b) with probability at most exp(−Θ(n)) assuming that y = Θ(n).Interchangeably, it means that the number of reaction events equal to re-actions (1’) or (2’) (i.e., decreasing b) is also at least 24/35y0. Therefore,with high probability, the net change in b after y0 reaction events is nega-tive ((22/35 − 24/35)y0 < 0) and it gives us be ≤ 10n21 . Also, as the totalincrease in b is at most 2235y0 throughout interval I, b will be bounded byb0 +2235yˆ0 ≤ n/2 as well.Lemma 4.3.4. Let I be any interval of l = cy0 reaction events in thecomputation of the Double-B CRN, where c ≤ 1/2, y0 is the initial value ofy at the beginning of I, and x ≥ y throughout the interval. If y0 ≥ w lg n,then for any γ > 0, interval I will happen within 16γ(1−c)wn interactions withprobability at least 1− 1/nγ.Proof. Using Lemma 4.2.12, it is sufficient to show that the total propensityof Double-B reactions over the course of interval I is 1−c2 y0. Among l = y0reaction events, the number of Y molecules can decrease by at most l, soy ≥ y0− l. On the other hand, since x ≥ y, we have n = x+y+b ≤ x+x+bleading to x+b/2 ≥ n/2 throughout this interval. Putting these observationstogether, we obtain:ρ(c,R) =xy + xb+ ybn≥ y(x+ b/2)n≥ 1/2(y0 − l) = 12(1− c)y0.Double-B simulates the tri-molecular CRNTo understand computations of the Double-B CRN as simulations of thetri-molecular CRN, we again identify three phases, this time expressed in394.3. Approximate Majority Using Bi-molecular Reactionsterms of xˆ and yˆ. (Note that (i) xˆ − yˆ = x − y and (ii) yˆ = 0 implies bothy = 0 and b = 0):phase 1 cγ/2√n lg n < xˆ− yˆ ≤ n(dγ − 2)/dγ . It ends as soon as yˆ ≤ n/dγ .phase 2 eγ lg n < yˆ < 2n/dγ . It ends as soon as yˆ ≤ eγ lg n.phase 3 0 ≤ yˆ < 2eγ lg n. It ends as soon as yˆ = 0.As before, we divide both phase 1 and phase 2 into Θ(lg n) stages, definedby integral values of t and s, as follows:• A typical stage in phase 1 starts with xˆ ≥ yˆ+ 2t√n lg n and ends withxˆ ≥ yˆ + 2t+1√n lg n, where lg cγ ≤ t ≤ (lg n − lg lgn)/2 + lg((dγ −2)/(2dγ)).• A typical stage in phase 2 starts with yˆ ≤ n/2s and ends with yˆ ≤n/2s+1, where lg dγ ≤ s ≤ lg n− lg lgn− lg eγ − 1.The proof of Theorem 4.3.1 completely parallels that of Theorem 4.2.1,with xˆ and yˆ substituted for x and y. For correctness of Double-B, it sufficesto note that (i) reaction (0’) does not change either xˆ or yˆ, (ii) reaction (1’)increases xˆ by 1/2 and decreases yˆ by 1/2, and (iii) reaction (2’) increases yˆby 1/2 and decreases xˆ by 1/2. Thus twice as many reactions (1’) and (2’)suffice to achieve the same results as reactions (1) and (2) of the tri-molecularCRN.The efficiency of Double-B follows in a similar way from the earlier anal-ysis of the tri-molecular CRN presented in Corollary 4.2.13. There we ob-served that it sufficed to bound from below the propensity of reaction (1).For the corresponding analysis of Double-B, we observe that in all corre-sponding phases/stages the propensity of reaction (1’) is up to a constantfactor the same as that of reaction (1). This follows immediately from theupper bound (n/2) on b, which ensures that the molecular count of X is atleast n/4, and the lower bound (y/292) on b, which ensures that the molec-ular count of B is at least a constant fraction of that of Y . The constant eγthat is used in demarking the end of phase 2 and the start of phase 3 willnow depend on the constant fγ of Lemma ??, in order to ensure that thislower bound on b holds throughout phase 2 with high probability.4.3.2 The Single-B CRNHere, we study the behaviour of Single-B, originally proposed by Angluin etal. [12] and shown in Figure 1.2(b). We will show that:404.3. Approximate Majority Using Bi-molecular ReactionsTheorem 4.3.5. For any constant γ > 0, there exists a constant cγ > γsuch that, provided (i) the initial molecular count of X and Y together is atleast n/2, and (ii) the count of X exceeds that of Y by at least cγ√n lg n, theSingle-B CRN reaches a consensus of X-majority, with probability at least1− n−γ, in cγn lg n interactions.Comparing the Double-B and Single-B CRNs, we notice the only differ-ence is that reaction (0’) is replaced by probabilistic reactions (0’x) and (0’y)which are equally likely. Intuitively, this replacement keeps the behaviourof the Single-B and Double-B CRNs essentially the same, as reactions (0’x)and (0’y), on average, have no effect on xˆ and yˆ. However, its advantage isto never let Single-B CRN reach B-majority consensus8.In order to analyze the behaviour of Single-B, we also consider threephases similar to those for Double-B and tri-molecular CRNs. The progressof the protocol is also measured in terms of random variables xˆ and yˆ. Re-actions (0’x) and (1’) increase xˆ by 1/2 and decrease yˆ by 1/2, and reactions(0’y) and (2’) decrease xˆ by 1/2 and increase yˆ by 1/2. We recall that weassume x and y are substituted by xˆ and yˆ when we refer to lemmas ofSection 4.2.phase 1 (cγ − γ)/2√n lg n < xˆ − yˆ ≤ n(dγ − 2)/dγ . It ends as soon asyˆ ≤ n/dγ .phase 2 eγ lg n < yˆ < 2n/dγ . It ends as soon as yˆ ≤ eγ lg n.phase 3 0 ≤ yˆ < 2eγ lg n. It ends as soon as yˆ = 0.Similar to the Double-B and tri-molecular CRNs, we divide both phase 1and phase 2 into Θ(lg n) stages, defined by integral values of t and s, asfollows:• A typical stage in phase 1 starts with xˆ ≥ yˆ+ 2t√n lg n and ends withxˆ ≥ yˆ+2t+1√n lg n, where lg(cγ−γ) ≤ t ≤ (lg n− lg lgn)/2+lg((dγ−2)/(2dγ)).• A typical stage in phase 2 starts with yˆ ≤ n/2s and ends with yˆ ≤n/2s+1, where lg dγ ≤ s ≤ lg n− lg lgn− lg eγ − 1.The proof that Single-B is correct and efficient parallels the phase anal-ysis of the tri-molecular CRN, if we make the following adjustment steps.8We note that although the B-majority consensus is reachable in the Double-B CRN,the probability of such an event is very small (i.e., nΘ(− lg(n))) with our initial configurationsetting. The bound is computed with a simple biased random walk analysis in lgn stages.414.3. Approximate Majority Using Bi-molecular Reactions1. Establish that the initial gap between xˆ and yˆ, with high probability,is not decreased by more than γ√n lg n within n reaction events –see Lemma 4.3.6. Thus, it assures us that xˆ ≥ yˆ in the startinginterval required to get b bounded. Moreover, with the assumptionthat initially xˆ − yˆ ≥ cγ√n lg n, with high probability, phase 1 startswith xˆ− yˆ ≥ (cγ − γ)√n lg n.2. Employ Lemmas 4.3.2 and 4.3.3 to provide the lower-bound (y/2α),where α ≥ 20, and upper-bound (n/2) on b after the first Θ(n) inter-actions. In order to verify the proof of lemma for Single-B, it’s onlyneeded to adjust the constants according to propensities of reactions(0’x) and (0’y). We are able to provide a tighter upper bound onb with respect to variable y9. In fact, we can show that inequalityy2α ≤ b ≤ 2αy hold at any point of the Single-B computation althoughit doesn’t hold for the Double-B computation – see Lemmas 4.3.7 and4.3.8.3. Show that Lemmas 4.2.2 and 4.2.7 and their corresponding corollariesalso prove the correctness and efficiency of phase 1 in the Single-BCRN. Utilizing the lower bound on b, the ratio of total propensityof reactions (0’x) and (1’) to that of reactions (0’y) and (2’) is lowerthan the relative propensity of reaction (1) to that of reaction (2) inthe tri-molecular CRN by at most a small constant – See Lemma 4.3.9.Therefore, the analysis of phase 1 of Single-B parallels that of the tri-molecular CRN.4. Modify Lemmas 4.2.4 and 4.2.9 so that they also verify the correctnessand efficiency of phases 2 and 3 in the single-B CRN. Referring backto the lower bound on b, we note that the ratio of total propensity ofreactions (0’y) and (2’) to that of reactions (0’x) and (1’), is greaterthan the ratio of the propensity of reaction (2) to that of reaction (1)in the tri-molecular CRN, by at most a small constant. It is straight-forward to consider this small constant and revise Lemmas 4.2.4 and4.2.9 and their related corollaries, to make the analysis of phases 2and 3 of Single-B also parallel to those of the tri-molecular CRN – SeeLemmas 4.3.10 and 4.3.11.5. Employ Lemma 4.2.12 to complete the proof of efficiency. Using the9We note that, the derived b bounds in Lemma 4.3.3 are sufficient for the proof ofcorrectness and efficiency of our two bi-molecular CRNs. However, a tighter upper boundon b may be useful when the Single-B protocol is used as a part of other CRNs.424.3. Approximate Majority Using Bi-molecular Reactionsupper bound on b, which confirms that x ≥ n/4 and the lower boundon b, which confirms b ≥ y/292, we can conclude that the total propen-sities of reactions (0’x), (0’y), (1’), and (2’) is at least some constantfraction of the total propensities of reaction (1) and (2) in tri-molecularCRN. Therefore, the total number of interactions in Single-B is at mostsome constant multiple the required number of interactions in the tri-molecular CRN.Lemma 4.3.6. Starting from xˆ − yˆ ≥ cγ√n lg n, where cγ > γ, xˆ − yˆreduces to (cγ − γ)√n lg n within d = n reaction events with probability lessthan 1/n(γ2).Proof. Starting from xˆ−yˆ ≥ cγ√n lg n, the probability that xˆ−yˆ increases isat least as much as the probability that it decreases. As a worst case scenario,we can view the changes in xˆ− yˆ, as an unbiased random walk which startsat cγ√n lg n and its expected translation distance is√n within n reactionevents [38](XIV). We now define event Md as the one where xˆ− yˆ decreasesin total by at most γ√n lg n within d = n reaction events. Let Q1, . . . , Qrdenote independent random variables where 0 ≤ r ≤ d taking values in setAr = [1,−1]. The independent random variables Qr satisfy the conditionsof Azuma’s inequality (Lemma 4.1.3) with cr = 2, the expected change√n(assuming an unbiased random walk), and function Y = f(Q1, . . . , Qd) =max1≤r≤d |∑ri=1Qi| which gives us the maximum translation distance overd reaction events. Now, using Azuma’s inequality, we can infer that P[|Y −√n| ≥ γ√n lg n] ≤ 1/nγ2 . It means that in our unbiased random walkthe maximum distance from the origin is not more than γ√n lg n with highprobability which leads to P[Md] ≤ 1− 1/nγ2 .Lemma 4.3.7. Let I be any interval of y0 reaction events of a computationof Single-B, where ≤ 1α+1 , α ≥ 20, and y0 is the initial value of y at thebeginning of I. Let b0, be, bmin, bmax denote the initial, final, minimum,and maximum values of b in the interval I (similarly for y). Then for anyγ > 0, there is a constant lγ such that if y0 ≥ lγ lg n, then the followingbounds hold with probability at least 1− 1/nγ.(a) (Upper bounds) If b0 ≤ min(n/2, αy0) then bmax ≤ 2αymin and be ≤αye.Proof. We prove tighter upper bounds on b as follows.Upper bounds: It is easy to verify that while b0 ≤ (α − 1)y0, upperbounds bmax ≤ 2αymin and be ≤ αye completely hold even if y reaches to itsminimum of y0 − y0 and b increases to its maximum of b0 + y0 within y0434.3. Approximate Majority Using Bi-molecular Reactionsreaction events. Therefore, for simplicity we just consider the case whereb0 > (α− 1)y0. The number of B molecules can only increase via reactions(0’x) and (0’y). We can bound the probability that a reaction event wouldbe equal to one of these two reactions as follows.P[Reaction 0’x or 0’y ocurrs] =xyxy + xb+ yb≤ yy + b≤ y0 + y0y0 + (α− 1)y0 + 2y0 ≤1 + α− 2 .With an application of the Chernoff upper tail bound 4.1.2, we thenconclude that the probability of having more than 4(1+)3(α−2)y0 reaction eventsof types (0’x) and (0’y), is at most 1/nγ , provided that y0 ≥ lγ lg n. Also,as reactions (0’x) and (0’y) have the same rate, there are at most 2(1+)3(α−2)y0reaction events of type (0’x) with probability at least 1 − 1/nγ . Thus, thenumber of reaction events equal to either reaction (1’) or (2’) is at leasty0− 4(1+)3(α−2)y0. So, we can prove the upper bounds of our claim as follows.case (i) bmax ≤ 2αymin. Since at most 4(1+)3(α−2)y0 (reactions (0’x) and (0’y))B molecules are produced and at most 2(1+)3(α−2)y0 (reaction (0’x)) Ymolecules are consumed, we have:bmax ≤ b0 + 4(1 + )3(α− 2)y0, and ymin ≥ y0 −2(1 + )3(α− 2)y0which lead to bmax ≤ 2αymin for all (α− 1)y0 < b0 ≤ αy0.case(ii) be ≤ αye. We note that, at the end of the interval I, at least y0−4(1+)3(α−2)y0 (reactions (1’) and (2’)) B molecules are also consumed.Therefor, we getbe ≤ b0 + 8(1 + )3(α− 2)y0 − y0, and ye ≥ y0 −2(1 + )3(α− 2)y0which confirm be ≤ αye.Lemma 4.3.8. With high probability, y0 reaction events of interval I dis-cussed in Lemma 4.3.3 happens in Θ(n) interactions.444.3. Approximate Majority Using Bi-molecular ReactionsProof. Using Lemma 4.2.12, it is sufficient to show that the total propensityof Single-B reactions over the course of interval I is Ω(y0). Among l = y0reaction events, the number of B and Y molecules can decrease by at most l,so y ≥ y0−l. On the other hand, since x ≥ y, we have n = x+y+b ≤ x+x+bleading to x+b/2 ≥ n/2 throughout this interval. Putting these observationstogether, we obtain:ρ(c,R) =xy + xb+ ybn/2≥ y(x+ b/2)n/2≥ y0 − l = Ω(y0).Lemma 4.3.9. At any point in the computations, assuming that the xˆ− yˆ ≥∆/2, the probability that xˆ− yˆ increases is at least 1/2 + Θ(∆/n).Proof. Let p denote the probability of a success (xˆ − yˆ increases) and qdenote the probability of a failure (xˆ− yˆ increases). We note that reactions(0’y) and (2’) decrease the gap between xˆ and yˆ, and reactions (0’x) and(1’) increase the gap. So, given that x ≤ n, and y/292 < b, we can computethe probability that xˆ− yˆ increases on a reaction event as follows.1)q = P[xˆ− yˆ decreases]p = P[xˆ− yˆ increases] =1/2xy + yb1/2xy + xb≤ 1− (xˆ− yˆ)b1/2xy + xb≤ 1− (∆/2)bx(1/2y + b)≤ 1−Θ(∆/n),2) q + p = 1So, with a simple calculation, equations 1 and 2 result in p ≥ 1/2+Θ(∆/4n).Lemma 4.3.10. At any point in the computation, if yˆ = n/k then theprobability that yˆ > 2n/k at some subsequent point in the computation isless than (1−Θ(1))n/k.Proof. Let p denote the probability of a success (yˆ decreases) and q denotethe probability of a failure ( yˆ increases). We note that reactions (0’y) and(2’) increase yˆ, and reactions (0’x) and (1’) decrease it. So, assuming thatx ≤ n, xˆ− yˆ ≥ n− n/4k, and y < 292b, we can compute the ratio q/p on a454.4. Experimental Resultsreaction event as follows.1)q = P[yˆ increases]p = P[yˆ decreases]=1/2xy + yb1/2xy + xb≤ 1− (xˆ− yˆ)b1/2xy + xb≤ 1− (n− 4n/k)bn(1/2y + b)≤ 1−Θ(1)By Lemma 4.1.1, we conclude that reaching a configuration where y >2n/k (which entails an excess of n/k failures to successes) is less than (1−Θ(1))n/k.Lemma 4.3.11. At any point in the computation, if yˆ = n/k then, assumingthat yˆ never increases to 2n/k, the probability that yˆ decreases to n/k − rwithin f(n) > Θ(r) reaction events is at least 1− exp(−Θ(f(n)).Proof. The proof is completely parallel to the proof of Lemma 4.2.9. We onlyneed to compute the probability of a success (yˆ decrease). By Lemma 4.3.10,q/p = 1 − Θ(1). So, considering p + q = 1, it’s straightforward to obtainp ≥ 12 + Θ(1).4.4 Experimental ResultsFigure 4.1 illustrates the progress of computations of each of our CRNs ineach of the three phases, on a single run. In the first phase, the gap x−y (redline) increases steadily. Once the gap is sufficiently high, phase 2 starts andthe count of y for the tri-molecular CRN, and yˆ for the bi-molecular CRNs,decrease steadily. In the last phase, as the counts of y and yˆ are small, thereis more noise in the evolution of y and yˆ, but they do reach 0. The rate ofconvergence is faster for Double-B than Single-B, stemming from the factthat blanks are produced at double the rate. Figure 4.2 shows how expectedtime and probability of correctness (success rate) increase as a function of thevolume n, for each of the CRNs. A fit to the data of that figure shows thatthe expected times of the tri-molecular, Double-B and Single-B CRNs growas 3.4 lnn, 2.4 lnn, and 4.0 lnn respectively10. For n ≥ 100, the tri-molecularCRN has at least 99% probability of correctness and the bi-molecular CRNshave at least 97% percent probability of correctness. These probabilities allapproach 1 as n gets larger.10We note that the x-axis is in the logarithmic scale in Figure 4.2(a). So, to computethe growing rates, we only need to find the slope a of each set of data points according toequation y ≈ a lnx.464.4. Experimental Results(a)(b)(c)Figure 4.1: The gap x− y (red line) and minority (count y for tri-molecularCRN and yˆ for bi-molecular CRNs) (blue line), as a function of time, ofsample runs of the (a) tri-molecular, (b) Double-B, and (c) Single-B CRNs.The initial count is n = 106, the initial gap x−y is 2√n lg n and parameterscγ , dγ and eγ are set to 2, 8, and 2 respectively. The vertical dotted linesdemark the transition between phases 1, 2 and 3.474.4. Experimental Results 10 15 20 25 30 35 40 45 50 55 60 100 1000 10000 100000 1e+06timevolumeSingle BDouble BTrimolecular(a) 0.97 0.975 0.98 0.985 0.99 0.995 1 1.005 100 1000 10000 100000 1e+06success ratevolumeSingle BDouble BTrimolecular(b)Figure 4.2: Comparison of the time (a) and success rate, i.e., probabilityof correctness (b) of Single-B, Double-B and the tri-molecular CRN for Ap-proximate Majority. Each point in the plot is an average over 5,000 trials.The initial configuration has no B’s and the imbalance between X’s and Y ’sis√n lnn. Plots show 99% confidence intervals.484.5. Application4.5 ApplicationAs one application of our analysis methods, we show how to resolve a conjec-ture of Mertzios et al. [74] for the Single-B CRN (when initially x+ y = n):Conjecture: For any constant > 0, if initially x− y = n then the CRNreaches consensus on X-majority with probability at least 1− exp(−Θ(n)).Mertzios et al. [74] proved that the conjecture holds for Single-B when > 5n/7. We first argue that Mertzios et al.’s conjecture holds for thetri-molecular CRN, for any > 0. As in previous sections, we model theevolution of x − y as a sequence of stages, where the ith stage starts whenx− y ≥ 2in for the first time, and ends when x− y = 2i+1n. Lemma 4.2.2shows that, once the ith stage starts, the probability that x − y reducesto 2i−1n is at most exp(−Θ(n)). Lemma 4.2.7 shows that, assuming thatx−y does not reduce to 2i−1n, the stage does indeed end, i.e. x−y reaches2i+1n, with probability 1 − exp(−Θ(n)). In these applications of Lemmas4.2.2 and 4.2.7, the constant in the Θ depends on . Since consensus toX-majority is reached in O(lg n) stages, the overall success probability is atleast 1− exp(−Θ(n)).To prove Mertzios et al.’s conjecture for Single-B, we show that if x−y =∆ ≥ n (and x+ y + b = n) then for some sufficiently large constant k thatdepends on , the probability that x − y increases to min{2∆, n} withinkn reaction events is at least 1 − exp(−Θ(n)) (where the constant in theΘ depends linearly on 2/k). From this property, the conjecture follows bymodeling the evolution of x − y as a sequence of stages exactly as for thetri-molecular CRN.Let #(0’x), #(0’y), #(1’) and #(2’) be random variables denoting thenumber of reactions of type (0’x), (0’y), (1’) and (2’) respectively at anypoint in the sequence of kn reaction events. Azuma’s inequality 4.1.3 tells usthat throughout the computation, with probability at least 1−exp(−Θ(n)),|#(0’x)−#(0’y)| ≤ ∆/6. (4.5.1)So we’ll assume in what follows that (4.5.1) holds. Then at any point in thecomputation,x− y = ∆ + (#(1’)−#(2’)) + (#(0’x)−#(0’y)) ≥ 5∆/6 + (#(1’)−#(2’)).A slight variant of Lemma 4.2.2 shows that the random walk #(1’)−#(2’)reduces to −2∆/3 at some point in the computation with probability atmost exp(−Θ(n)). Similarly, a variant of Lemma 4.2.7 then tells us that,494.6. Conclusionassuming that #(1’)−#(2’) never reduces to −2∆/3, and thus x− y neverreduces to ∆/2, with probability at least 1 − exp(−Θ(n)), x − y increasesto min{2∆, n} within k′n reactions of type #(1’) or #(2’), for a sufficientlylarge constant k′. To complete the argument, note that any sequence ofkn = 2k′n+n reaction events of Single-B must contain at least k′n reactionsof types #(1’) or #(2’). Otherwise the reaction sequence would contain atleast k′n+ n+ 1 reactions of types #(0’x) and #(0’y), and these reactionswould consume k′n + n + 1 X’s and Y ’s in total. This is more than thesum of the number of X’s and Y ’s present at the start of the sequence ofreactions (at most n) plus the number of X’s and Y ’s produced during thesequence of reactions (at most k′n), a contradiction.4.6 ConclusionAngluin, Aspnes, and Eisenstat proposed a simple population protocol forApproximate Majority and proved correctness and O(log n) time efficiencywith high probability, given an initial gap of size ω(√n log n) when the to-tal molecular count in the mixture is n. Motivated by their intriguing butcomplicated proof, we have provided simpler, more intuitive proofs of cor-rectness and efficiency for three different CRNs for Approximate Majority.Key to our approach has been to first analyze a tri-molecular CRN with justtwo reactions and two species. We have then showed how two bi-molecularCRNs, including that of Angluin et al., are essentially simulations of thetri-molecular CRN. Our results improve on those of Angluin et al. in thatthey hold even with an initial gap of Ω(√n log2 n). Our analysis approach,which leverages the simplicity of a tri-molecular CRN to ultimately reasonabout bi-molecular CRNs, may be also useful in analyzing other CRNs.50Chapter 5Summary and Future Work5.1 SummaryIn Chapter 3 we have proposed a new design to deterministically computesemilinear functions with CRNs using no leader species. We were motivatedby the intriguing question of Chen et al. [21] who asked about the possibil-ity of such computations. Similar to Chen et al.’s framework [21], we havealso decomposed the semilinear function into a finite union of affine partialfunctions. We have then provided leaderless CRNs to compute each affinefunction. Our CRN construction differs from the affine-function computingCRNs of Chen et al. [21] in that we only use the input species (and noleader species) to compute the offset and coefficients of each affine partialfunction. Lemma 3.2.1, is in fact our primary technical contribution. Next,in order to decide which affine function should be applied to a given input,we have employed the leaderless semilinear predicate computation of An-gluin et al. [11]; this latter part of the construction is actually identical tothe construction of Chen et al. [21], but we have included it because ourtime analysis is different. Assuming n is the number of molecules presentinitially, we have proved that our construction ends in expected time O(n)which is faster than the O(n log n) expected time bound on the direct con-struction (with the use of leaders) of Chen et al. [21], but slower than theO(log5 n) expected time bound on the fast construction of Chen et al. [21]which relied heavily on the use of a fast, error-prone CRN that computesarbitrary computable functions, and which crucially uses a leader.In Chapter 4, first we have analyzed our tri-molecular CRN, shown inFigure 1.2(a) which computes Approximate Majority. We have studied theCRN in three phases. Recall that x and y denote the number of copies of Xand Y during a CRN computation. In the first phase we have modeled theevolution of the gap x−y as a sequence of random walks with increasing biasof success (i.e., increase in x− y). Similarly, in the second phase we modelthe evolution of the count of y as a sequence of random walks with increasingbias of success (decrease in y). We have used a simple biased random walkanalysis to show that these walks make forward progress with high probabil-515.2. Future Workity, thereby ensuring correctness. To show efficiency of each random walk,we have modeled it as a sequence of independent trials, observed a naturallower bound on the probability of progress, and applied Chernoff bounds.In the third and last phase we have modeled the “end game” as y decreasesfrom Θ(log n) to 0, and applied the random walk analysis and Chernoffbounds a final time to show correctness and efficiency, respectively. For theDouble-B CRN, we have showed that with high probability, after a shortinitial start-up period and continuing almost until consensus is reached, thenumber of B’s (blanks) is at least proportional to y and is at most n/2, inwhich case reaction events are reactions (1’) or (2’) with probability Ω(1).Moreover, blanks are in a natural sense a proxy for X + Y (an interactionbetween X and Y ), and so reactions (1’) and (2’) behave exactly like the tworeactions of our tri-molecular CRN. Essentially the same argument appliesto Single-B. However, we have been also able to provide a tighter upperbound (i.e., proportional to y) on the number of blanks. We didn’t concernourselves with smaller initial gaps. But note that even with no initial gapwe can still expect efficiency, since the expected number of interactions untila gap of√n log n is reached is O(n log n). This would be true even if therewere no bias in favour of reaction (1’) as x, the majority species, increases.We suspect that the complexity of Angluin et al.’s proof stems from the casewhen the initial gap is small (O(√n log n)), and the fact that they show ef-ficiency with high probability, rather than expected efficiency for the caseswith small enough gaps.5.2 Future WorkWe discuss our future work pertaining to the first part of this thesis.5.2.1 Deterministic Computation With CRNsWe identify two general directions for future work in the context of deter-ministic computations.Time Complexity The clearest shortcoming of our leaderless CRC, com-pared to the leader-employing CRC of Chen et al. [21], is its time complexity.Our CRC takes expected time O(n) to complete with n input molecules, ver-sus O(log5 n) for the CRC of Theorem 4.4 of Chen et al. [21]. However, wedo obtain a modest speedup (O(n) versus O(n log n)), compared to the directconstruction of Theorem 4.1 of Chen et al. [21]. The indirect construction of525.2. Future WorkTheorem 4.1 of Chen et al. [21] relied heavily on the use of a fast, error-proneCRN that computes arbitrary computable functions crucially using a leader.The major open question is, for each semilinear function f : Nk → Nl, isthere a leaderless CRC that stably computes f on input of size n in expectedtime t(n), where t is a sublinear function? Belleville et al. [16] very recentlyshowed that, a wide range of semilinear functions and predicates, satisfy-ing some conditions, require linear time to be deterministically computed.However, the optimal computing time of the other semilinear functions andpredicates not satisfying those conditions, still remains as an open question.If this is not possible for all semilinear functions, another interesting openquestion is to precisely characterize the class of functions that can be stablycomputed by a leaderless CRC in polylogarithmic time. For example, theclass of linear functions with positive integer coefficients (e.g., f(x1, x2) =3x1 + 2x2) has this property since they are computable by O(log n)-timeunimolecular reactions such as X1 → 3Y,X2 → 2Y . However, most ofthe CRN programming techniques used to generalize beyond such functionsseem to require some bimolecular reaction A+B → . . . in which it is possibleto have #A = #B = 1, making the expected time at least n just for thisreaction.Tolerance to Imprecise Inputs Despite the fact that removing the as-sumption of initial leader species makes the model more realistic, it remainsan unrealistic aspect of the model. We have assumed the ability to preparean initial state with precisely specified counts of input molecules. It is cer-tainly equally as difficult to prepare a solution with 999,999 molecules of X,ensuring that the solution does not contain 1,000,000 molecules, as to ensurethat the solution contains 1 molecule of leader L and not 2. However, it isnot clear how to properly formalize the question, “What computations canCRNs do when initial states can only be approximately specified?” If weimagined, for instance, being able to prepare counts only to within k bitsof precision for some constant k, then at most 2k different values of a giveninput could be specified.Rather than discussing errors of specification and approximate initialcounts, there is an alternative way to formalize the idea that with largeamounts of molecules, we lose control over individual counts. This is touse the continuous (mass-action) model, in which the amount of a speciesis given by a nonnegative real number indicating its average count per unitvolume in an infinite volume solution. Even with the ability to specify aprecise initial state (vector of real-valued concentrations), without control535.2. Future Workover the rates of reactions, only continuous piecewise linear functions can becomputed [22]. Because these functions are continuous (in fact, uniformlycontinuous, i.e., rates of change of the output with respect to input arebounded by a constant), a small error in specifying an initial state provablyleads only to a small error in reporting the output. Therefore, continuousCRNs are in a sense already robust to imprecise inputs, merely because theycan only compute functions that are naturally “error-tolerant”.Contrast this to the case of the discrete model and the semilinear func-tions they compute, such as the function f(n) = n if n is even and f(n) = 0otherwise. Here, a small change in the input causes an arbitrarily largechange in the correct output, and hence an arbitrarily large error in thereported output if the initial state is specified incorrectly. Thus, given thetheoretical ability of discrete CRNs to compute functions lacking the naturalrobustness of continuous functions to small errors in inputs, it remains anopen problem to formalize how such a CRN might compute such nonrobustfunctions in a robust way.5.2.2 Approximate MajorityThere are several ways in which we can extend our results. Angluin et al. [12]analyze settings in which (i) some agents (molecules) have Byzantine, i.e.,adversarial, behaviour upon interactions with others, (ii) some molecules are“activated” (become eligible for reaction) by epidemic spread of a signal,and (iii) there are three or more species present initially and the goal is toreach consensus on the most populous species (multi-valued consensus). Webelieve that our techniques can be generalized to these settings.There are other ways in which we might generalize our results, moti-vated by practicalities of molecular systems. When a CRN is “compiled”to a DNA strand displacement system, it may be that the DNA stranddisplacement reaction rate constants closely approximate, but are not ex-actly equal to, the CRN reaction rates. It could be helpful to describe howthe initial gap needed to guarantee correct and efficient computations forApproximate Majority with high probability depends on the uncertainty inthe rate constants. It could also be useful to analyze variants of the CRNsanalyzed here, or other CRNs, in which some or all of the reactions arereversible. For example, if the blank-producing reaction (0’) of Double-B isreversible, the CRN appears to still be both correct and efficient, while hav-ing the additional nice property that stable B-majority is no longer possible.Again, we believe that our analyses can generalize to these scenarios.Another interesting problem to investigate, is the application of Approx-545.2. Future Workimate Majority in solving and analyzing the fast probabilistic leader electionproblem using CRNs, i.e., in volume n with an initial configuration withn copies of species X and no other species initially present, produce a sin-gle copy of a species L in time O(log n) with high probability. We havesuggested such a leader election CRN using Approximate Majority and ourexperimental results are very promising. However, the proof of its correct-ness and efficiency is still very challenging. Our hope is that our new analysisof Approximate Majority would be a big step toward simplifying the analysisof our proposed leader election CRN.55Part IIOn Prediction of NucleicAcid Folding Pathways andStructures56Chapter 6IntroductionRNA and DNA are both nucleic acids; the role of DNA in the cell is toserve as an information storage channel available to cellular computing net-works [93]. RNA molecules are involved in many cellular functions, includ-ing DNA transcription into mRNA, translation of RNA into proteins bytRNA [61], control of the plasmid copy number in Escherichia coli [41] andgene regulation via mechanisms that degrade mRNA [13]. The possiblesplice isoforms of the mRNA transcript are also partly regulated by RNAmolecules [103]. In order to determine RNA/DNA functions, it is an im-portant step to recognize their structures [4, 80]. Moreover, nucleic acidsequences are the basis of DNA computing and molecular programming forconstruction of nano-devices such as DNA origami [40, 48] and DNA stranddisplacement (DSD) systems [87]. As dicussed in the first part of this thesis,DSDs are also promising components to pysically implement even hypothet-ical chemical reaction networks [95], which abstract details about displace-ments. Therefore, identifying RNA and DNA structures is also a funda-mental step in the design, programming, and verification of these systems.Determining the nucleic acid structure using experimental methods, such asNMR and X-ray crystallography, is expensive, time-consuming and in somecases impractical [37]. Therefore, computational methods are widely usedto help understand the structure and function of DNA and RNA molecules.Nucleic acid folding pathways describe how RNA and DNA moleculesfold in on themselves via intra-molecular interactions. RNA/DNA moleculesdynamically move through a sequence of intermediate structures, when fold-ing into their functional structures, i.e., three-dimensional shapes. In somecases these intermediate structures also contribute to the biological func-tion of the molecule. For example, the function of the flavinmononucleotideriboswitch depends on this molecule’s ability to change its structure [106].In other examples, a molecule may be bistable, i.e., have two stable func-tional structures [39, 106]. The folding pathways of DNA molecules canhelp determine gene transcription rates or control the DNA strand displace-ment kinetics [112]. In fact, DSDs are designed so that a sequence of stranddisplacements follows a DNA folding pathway.576.1. Background on Nucleic Acids: RNA/DNA MoleculesWhile it would be ideal to have the full description of nucleic acid foldingdynamics, nucleic acid secondary structure – a set of bonds formed betweennucleotides in an RNA/DNA molecule (see Section 6.1) – already providesa level of description that yields much insight into its thermodynamics andfolding kinetics [100, 112]. Therefore, we also focus on nucleic acid secondarystructure in our research.In recent years, computational methods for nucleic acid folding path-ways simulation have received increasing attention, since they can be veryhelpful in designing nano-scale machines that have potential health applica-tions [104, 112] and they can also provide significant insights into RNA/DNAfolding dynamics. For example, there is a designed RNA that can detecta cancer mutation and activate the cell death pathway [104]. In anotherexample, designed RNAs were used to map simultaneous RNA expressionpatterns in intact biological samples [27].Motivated by all of these roles of nucleic acid structures and foldingpathways, in this part of the thesis, we aim to contribute to computationalmethods helpful for improving the folding pathway and structure predictionof nucleic acids. In the remainder of this chapter, we first provide somebackground on nucleic acids. We then describe related work on nucleic acidfolding pathway and structure prediction and introduce the problems thatare our focus in sections 6.2 and 6.3 respectively. We continue to summarizeour objectives and contributions in Section 6.4 and provide an outline of theremaining chapters in Section 6.5.6.1 Background on Nucleic Acids: RNA/DNAMoleculesA single DNA or RNA strand is a sequence of nucleotide bases, which werepresent using the character set {A, C, G, T} or {A, C, G, U} respectively11,with the left end of the sequence corresponding to the 5′ end of the strandand the right end corresponding to the 3′ end. Here and throughout, let ndenote the length of RNA/DNA sequences. If sequences are indexed consec-utively starting from 1, we can represent a base pair as a tuple (i, j), suchthat i < j − 1, which specifies that the base at position i in the sequenceis paired with the base at position j (where j is not consecutive with i).A secondary structure is simply a matching of strand positions that agreeswith the Watson-Crick base pairs [32], namely C–G, A–T for DNA and C–G,11The letters A, C, G, U, and Tstand for Adenine, Cytosine, Guanine, Uracil, and Thyminenucleotides respectively.586.2. Nucleic Acid Folding PathwaysA–U for RNA where no position is in two pairs. That is, if (i, j) and (i′, j′)are in the structure then i, j, i′ and j′ are all distinct. We note that wobblepairs, i.e., non-Watson-Crick G–U pairs, are also frequent in RNA structuresand we consider them in our analysis of RNA kinetics. The secondary struc-ture is pseudoknot-free, if no base pairs cross, and more formally, if neitheri < i′ < j < j′ nor i′ < i < j′ < j, for all tuples (i, j) and (i′, j′). If asecondary structure is not pseudoknot-free, it is said to have a pseudoknot.Figure 6.1 shows planar representations of two possible pseudoknot-free sec-ondary structures for an RNA sequence of 23 nucleotides. From now on,when we use structure, we refer exclusively to secondary structure.Each secondary structure is composed of motifs, which have an associ-ated free energy value. The free energy of a secondary structure then iscalculated as the sum of its motifs’ energies.At first, the free energy of a secondary structure was computed for asimple “base pair” energy model [77, 78, 114] in which the free energy ofthe structure was only dependent on the number and types of its base pairs.However, since the mid-1990s, more sophisticated thermodynamic energymodels have been developed that account for entropic loop penalties, stackedpairs and other structural motifs [64, 68–70]. For example, in Figure 6.1,the energy of the right structure is −8.70 kcal/mol, using the Turner energymodel [68].RNA/DNA molecules tend to fold into their minimum free energy (MFE)structure at equilibrium [101]. A nucleic acid partition function is also de-fined as a sum of the free energy over all possible structures12.6.2 Nucleic Acid Folding PathwaysWe start this section with an informal description of nucleic acid foldingpathways and population kinetics. We then briefly discuss available meth-ods and challenges in computing the population kinetics. The readers canfind the formal and detailed description of population kinetics estimation inChapter 7.Throughout, we only discuss RNA sequences as our arguments about DNAfolding pathways follow the same principle.A folding pathway, from an initial structure to a final one, is defined as asequence of secondary structures where each successive secondary structurediffers from the previous one by a single base pair. Figure 6.2 shows a12The partition function algorithm proposed by Mathews [69] can also predict a MFEstructure with highly probable base pairs.596.2. Nucleic Acid Folding PathwaysFigure 6.1: Two RNA secondary structures drawn using NUPACK [110].These are two out of the many possible secondary structures.13-05-24 2:16 PMPage 1 of 1file:///Users/Monir/Desktop/Docs/Folding-kinetics/Folding_models/SMC_project/Results/HIV23_IS_path/0.svg13-05-24 2:17 PMPage 1 of 1file:///Users/Monir/Desktop/Docs/Folding-kinetics/Folding_models/SMC_project/Results/HIV23_IS_path/1.svg13-05-24 2:18 PMPage 1 of 1file:///Users/Monir/Desktop/Docs/Folding-kinetics/Folding_models/SMC_project/Results/HIV23_IS_path/2.svg13-05-24 2:18 PMPage 1 of 1file:///Users/Monir/Desktop/Docs/Folding-kinetics/Folding_models/SMC_project/Results/HIV23_IS_path/3.svg13-05-24 2:18 PMPage 1 of 1file:///Users/Monir/Desktop/Docs/Folding-kinetics/Folding_models/SMC_project/Results/HIV23_IS_path/4.svg13-05-24 2:19 PMPage 1 of 1file:///Users/Monir/Desktop/Docs/Folding-kinetics/Folding_models/SMC_project/Results/HIV23_IS_path/5.svg13-05-24 2:19 PMPage 1 of 1file:///Users/Monir/Desktop/Docs/Folding-kinetics/Folding_models/SMC_project/Results/HIV23_IS_path/6.svg13-05-24 2:19 PMPage 1 of 1file:///Users/Monir/Desktop/Docs/Folding-kinetics/Folding_models/SMC_project/Results/HIV23_IS_path/7.svg13-05-24 2:19 PMPage 1 of 1file:///Users/Monir/Desktop/Docs/Folding-kinetics/Folding_models/SMC_project/Results/HIV23_IS_path/8.svgFigure 6.2: An example of a folding pathway for the RNA sequence inFigure 6.1. Each structure has been drawn using NUPACK [110].folding pathway for the RNA sequence represented in Figure 6.1 startingfrom its unfolded structure and ending at its minimum free energy (MFE)structure. If the holding times, i.e., the elapsed times in transitioning fromone structure to the next, are also included in a folding pathway, the obtainedsequence is called a folding pathway trajectory. If we were to imagine a verylarge number of copies of the same RNA, say all initially in the unfoldedstate, and observe a folding pathway for each copy, then we could computethe fraction of copies occupying each secondary structure at each time point.The vector of these fractions over time is the approximate population kineticsof the RNA molecule (with respect to the initial unfolded state). For a giventime point, the exact population kinetics are essentially the diffusion limitof the folding pathway simulation at that time point, i.e., when the numberof copies of the RNA is taken to infinity [76].RNA population kinetics were initially researched as a way to enhanceprediction of the functional secondary structure of RNA molecules [2, 67, 96].Structure prediction could be improved by predicting structures that arevery common in the population kinetics but that are not necessarily the606.2. Nucleic Acid Folding Pathwaysthermodynamically most favourable, i.e., MFE structure [88]. Consideringthe dependency between the quality of MFE-based secondary structure pre-diction and the quality of the thermodynamic energy parameters used, itis also the case that the quality of population kinetics prediction dependson the kinetic rates used by the predictive model. Work that relies on orpredicts these rates are grouped into two broad categories: 1) work at ther-modynamic equilibrium without force acting on the system [23, 53, 112] and2) work at a thermodynamic non-equilibrium where a force is acting on thesystem [66, 102].Methods such as the Multistrand Simulator [90] and Kinfold [39] can sim-ulate folding pathways of RNA/DNA molecules in the examples describedas our motivations. For a given input RNA molecule and initial structure,these methods model the RNA folding process by continuous-time Markovchains (CTMCs). A model of CTMC can be thought of as a timed randomwalk on a directed graph. There are probabilities of transition associatedwith the edges of the graph, and the holding time is exponentially dis-tributed with a rate depending on the current node. To approach inferencein CTMCs representing RNA folding process, these two methods performMonte-Carlo simulations (or execute the Doob-Gillespie algorithm [42]) offolding pathway trajectories, where the underlying state space is the set ofall possible secondary structures for the input molecule and kinetic ratesdetermine holding times, and transition probabilities from one secondarystructure to another. Since it seems quite difficult to obtain the necessarykinetic rates experimentally, in these simulators the kinetic transition ratesare derived from nearest neighbor thermodynamic parameters such as theTurner energy model [68].The simulation model, Kinfold, of Flamm et al. [39], has all secondarystructures of an input sequence in its state space, and we refer to it as the fullmodel. The full model can also be used to infer exact or approximate popu-lation kinetics (see Section 7.1). However, inferring the population kineticsfrom the full model is slow, in part because of the size of the underlyingstate space. For an RNA sequence of length n, there can be as many as(n2k)ways to create a secondary structure of k base pairs, and thus O(3n) possiblesecondary structures. This means that the size of the state space for thefull model will be exponentially related to the length of RNA, which evenrenders the Monte Carlo simulations (or the Doob-Gillespie algorithm [42]),non-practical for longer RNAs (e.g., with length > 30) or multi-strandedRNA sequences (i.e., an RNA sequence composed of multiple interactingstrands). This difficulty prompted us to investigate an alternative for theclassic Monte Carlo methods in order to efficiently estimate the population616.3. Nucleic Acid Secondary Structure Predictionkinetics of longer RNAs. We discuss our objectives and contributions onthis matter in Section 6.4.1.6.3 Nucleic Acid Secondary Structure PredictionAs explained earlier, computational methods are widely used to obtain abetter understanding of the structure and function of nucleic acids. In thisarea, a central challenge has been reliable prediction of nucleic acid sec-ondary structure. In both biological and molecular computing contexts,thermodynamic analyses are widely used to predict secondary structures.Much work has focused on prediction of pseudoknot-free secondary struc-tures, since such structures are common in both biological and designedsystems and since pseudoknot-free structures are easier to handle algorith-mically [54, 63, 71]. Here and throughout, we consider a method to beefficient if its running time is bounded by a fixed-degree polynomial in thetotal length of the strands in the multi-set.In what follows, we briefly summarize significant contributions on the de-velopment of algorithms for predicting the pseudoknot-free secondary struc-ture of a single nucleic acid strand, or of multiple interacting strands. Ta-ble 6.1 also presents a summary of the time complexity of pseudoknot-freesecondary structure and partition function prediction (our contribution inthis thesis is shown in bold). When the input has multiple strands, we sep-arate the cases where the number of strands is bounded by a fixed constantc, and when the number of strands is unbounded, i.e., can grow with theinput size.For single strands with length n, O(n3) dynamic programming algo-rithms have long been used to efficiently predict minimum free energy (MFE)pseudoknot-free secondary structures, first for a simple base pair model andlater for more sophisticated energy models [77, 78, 114]. However, very re-cently, Bringmann et al. [17] proposed a truly sub-cubic algorithm to predictMFE secondary structures for a simple base pair energy model. Dynamicprogramming methods can also be used to efficiently calculate the partitionfunction for a given strand, making it possible to compute the probabilityof base pair formation in equilibrium [72].In addition to prediction of secondary structure of single strands, therehas also been much interest in prediction of complexes that result when basepairs form between two or more strands. Such predictions can be used tounderstand the affinity of binding between a nucleic acid oligonucleotide andits potential target in biological processes such as RNA interference [107].626.3. Nucleic Acid Secondary Structure PredictionInput Type MFE Partition FunctionSingle Strand O(n3) [77, 78, 114] O(n3) [72]Multiple Strands, Bounded (≤ c) ? O(n3(c− 1)!) [32]Multiple Strands, Unbounded APX-hard ?Table 6.1: Computational complexity of predicting nucleic acid MFEpseudoknot-free secondary structures and partition functions, when the in-put is a single strand, multiple strands with a constant bound c on thenumber of strands, and multiple strands where the number of strands cangrow with the input length n. In each case, n is the total number of basesin the input strand(s). We note that, for a single strand, a new work byBringmann et al. [17] presents an exact sub-cubic algorithm using a simplebase pair model. The bold term shows our contribution and the questionmarks show that the complexity of the corresponding problems is as yetunresolved.Prediction of multi-stranded secondary structures is also important, becausemethods for biomolecular programming and construction of nano-devices,such as algorithmic self-assembly, DNA strand displacement systems andDNA origami, are based on the formation of such complexes [40, 87]; pre-diction methods, such as that provided by NUPACK, [111] can guide thedesign of such programs and devices and be very useful in physical imple-mentation of chemical reaction networks studied in part I of the thesis.An energy model for single-stranded secondary structure formation canbe extended to obtain a model for multi-stranded complex formation by (i)charging an additional strand association penalty, typically a constant timesthe number of strands involved in the complex, and (ii) accounting for ro-tational symmetries [32] . We explain a simple extension of the base pairenergy model to operate on multi-stranded nucleic acids in Section 8.1.1.Predicting MFE pseudoknot-free secondary structures formed from two (orany constant number) of strands with respect to a model that only accountsfor strand association penalties is a straightforward extension of dynamicprogramming algorithms for single strands [8, 109, 113]. However, it isnot clear how such algorithms can efficiently account for rotational sym-metries that can arise when two or more indistinguishable strands interact[32]. Nevertheless, Dirks et al. [32] showed how to efficiently calculate thepartition function for a constant number of interacting molecules that formpseudoknot-free structures, by showing how rotational symmetry could be636.4. Objectives and Contributionsaccounted for, while simultaneously addressing algorithmic overcounting is-sues that arise in partition function calculation. However, the partitionfunction calculation method of Dirks et al. requires a separate dynamicprogramming computation on all possible orderings of strands that interactto form a single complex. As a result, the method does not run in polyno-mial time when the number of participating strands grows with the overallinput size (total length of strands). This situation can arise, for example, inDNA strand displacement systems. Also, surprisingly, while the partitionfunction for a constant number of interacting strands can be calculated effi-ciently, it is not known how to efficiently calculate the MFE pseudoknot-freesecondary structure of a constant number of interacting strands.In summary, exact and efficient dynamic programming algorithms are wellknown for predicting the MFE pseudoknot-free secondary structure of a sin-gle nucleic acid strand. However, all the available methods for computingMFE pseduoknot-free secondary structure formed from a set of nucleic acidstrands are either heuristic and therefore are not guaranteed to find the opti-mum structure or require an exponential runtime to find the exact solution.Thus, we are motivated to answer the following basic question that has re-mained open after over three decades of work on computational pseudoknot-free secondary structure prediction of nucleic acids: given a set of nucleicacid strands, is it possible to efficiently compute a MFE pseudoknot-freesecondary structure that they can form? We discuss our objectives andcontributions in more detail in Section 6.4.2.6.4 Objectives and ContributionsNext, we give a summary of our objectives and contributions on the twodifferent problems discussed in this part of the thesis.6.4.1 Nucleic Acid Folding PathwaysTo enhance the functional structure prediction, i.e., common structures thatare not necessarily the MFE ones, of RNA molecules, it would be beneficialto have an accurate estimation of their population kinetics. RNA populationkinetics indeed provide information about the probability of reaching differ-ent secondary structures at a given time. Considering an RNA moleculewith a combinatorial state space, the computation of population kinetics,and hence probabilistic inference, would be difficult or impossible with theexisting methods. So, we first aim to find an approach that accuratelyand efficiently computes the probability that an RNA molecule beginning in646.4. Objectives and Contributionsone secondary structure, x, will transition in the given time, t, to a targetstructure, y. For this purpose, we focus on the computation of transitionprobabilities in general CTMCs having combinatorial state spaces, becausenot only this could approximate RNA population kinetics, but also it couldbe beneficial to many other applications as diverse as queueing theory, pylo-genetics, and chemical reaction networks [51, 75].Our contributions on thismatter are as follows.1. For CTMC problems with countably infinite states, where classicalmethods such as matrix exponentiation are not applicable, the mainsolution approach has been Markov chain Monte Carlo methods forboth the holding times and sequences of visited states. We propose amodified Monte Carlo method, where the holding times are marginal-ized analytically. Our method approaches inference in CTMCs withweak assumptions on the state space and can approximate transitionprobabilities as well as estimate CTMC parameters for this generalclass of processes.2. We confirm our results by conducting experiments on an importantexample of CTMCs with combinatorial state space (i.e., a set of allsecondary structures): nucleic acid folding pathways. We verify on realRNA sequences that our method outperforms the classic Monte Carloapproach for estimating the transition probabilities that marginalizeover folding pathways and provide a model for the kinetics of an RNAmolecule interacting chemically with itself.6.4.2 Multi-stranded Nucleic Acid Structure PredictionWe note that MFE secondary structure prediction is still of particular inter-est to researchers for a better understanding of the functional nucleic acidstructure. While efficient thermodynamics-based approaches are well knownfor prediction of pseudoknot-free secondary structures of single strands, theproblem of predicting pseudoknot-free secondary structures of multiple in-teracting strands is not studied well enough.Given a set of nucleic acid strands and a positive integer k, letMulti-Pkf-SSP be the problem of determining whether the strands canform a pseudoknot-free secondary structure with at least k base pairs.In this thesis, we mainly prove that for the base pair energy model,the Multi-Pkf-SSP problem is computationally intractable. We note thathardness results can be valuable even for the simple base pair energy model;656.4. Objectives and Contributionsit would seem unlikely that the prediction problem becomes easier if theenergy model is more sophisticated.Our contributions, in more detail, are as follows.1. We show that Multi-Pkf-SSP is NP-hard, meaning that the exis-tence of an polynomial time method for MFE pseudoknot-free sec-ondary structure prediction of a multi-set of strands would imply allproblems in the complexity class NP, which includes problems thatare widely believed to be intractable, would have polynomial time al-gorithms.2. Our hardness proof of Multi-Pkf-SSP uses a reduction from a vari-ant of 3-dimensional matching (3DM), already known to be NP-hard,and employs code word designs with high pairwise edit distance [91].We noticed an error in the proof of high pairwise distance betweenwords that were padded (i.e., expanded by inserting a character inregular positions) given by Schulman and Zukerman [91]. So, as an-other contribution we fixed the issue and will ask them to publish anerratum.3. Given the NP-hardness result above, we also prove that there is noefficient method to find a pseudoknot-free secondary structure whoseenergy is a close estimate of the energy of the MFE structure un-less NP 6= P. Specifically, if there is a polynomial time approximationscheme (PTAS) that could find a pseudoknot-free secondary structurewhose free energy closely approximates that of the MFE for any givenmulti-set of strands, then again NP = P. A PTAS is a polynomialtime algorithm that receives as input an instance of an optimizationproblem and an arbitrary parameter > 0, and returns an outputwhose value (in our case, the number of base pairs in the MFE struc-ture) is within a factor 1− of the value of the optimal solution. Therunning time of a PTAS could be dependent on , but it must be poly-nomial in the input size for every fixed . Formally, we show that theoptimization problem of finding the MFE structure for a multi-set ofnucleic acid strands is hard for the complexity class APX (shown as“APX-hard” in Table 6.1), the class of NP optimization problems thathave constant factor approximation algorithms.666.5. Outline6.5 OutlineWe address the formal definition of RNA folding models, population kinetics,continuous time Markov chains (CTMCs) and our efficient estimation ofCTMC transition probabilities in Chapter 7, Sections 7.1 and 7.2. We notethat we only describe our methodology with respect to a simple CTMCsetup that expresses RNA folding pathways. However, our algorithm canbe extended to other situations, where some or all the parameters of theCTMC are unknown, but as it is not directly related to the scope of thisdissertation we do not discuss them here13. Finally, in Section 7.3, weprovide the experimental results on approximating population kinetics ofsome real RNAs employing our new method.Chapter 8, Section 8.1 first provides preliminary definitions, problemstatements, and an overview of some useful theorems to resolve the com-putational complexity of the MFE structure prediction for multi-strandednucleic acid systems. Later, in Section 8.2, we outline the string proper-ties and designs required for our NP-hardness proof. We further provide apolynomial time reduction from a variant of 3DM to Multi-Pkf-SSP inSection 8.3, and prove its correctness in Section 8.4. In Section 8.5, we alsoinfer that an optimization version of the problem is hard for the complexityclass APX.13The extended version of our method can be found in the published manuscript [47].67Chapter 7Approximating Nucleic AcidPopulation KineticsIn this chapter, we first provide some background and formal definitionson RNA folding models. Equipped with this background, in Section 7.2,we present an effective method for approximating transition probabilities ingeneral CTMCs defined over combinatorial state spaces. In Section 7.3, wethen support the efficiency of our method by applying it to RNA foldingmodels, as a powerful example of CTMCs, to estimate the population kinet-ics of some real RNA molecules. Later, in Section 7.4, we give a summaryof our results and contributions.7.1 PreliminariesHere and throughout, we will restrict our attention to RNA pseudoknot-freesecondary structures, a large and important class of structures14.A folding pathway of m steps consists of a sequence of secondary struc-tures σ = i1, i2, ..., im where each successive secondary structure differs fromthe previous one by a single base pair, i.e. the Hamming distance betweenij−1 and ij is exactly 1 for all 2 ≤ j ≤ m. A folding pathway trajectory ofm steps is a sequence of tuples (s0, t0), (s1, t1), ..., (sm, tm) where si is thesecondary structure at time i and ti is the holding time for structure si. Thecollection of structures in a folding pathway or trajectory is not necessarilydistinct. In other words, one structure can appear more than once along thefolding pathway or trajectory.The free energy of a secondary structure i, E(i), is computed usingthe Turner energy model [68] and an O(n3) dynamic programming algo-rithm [72].14We note that all models and methods in this chapter can trivially be extended toDNA sequences. Moreover, they all can also trivially apply to the case of pseudoknots bysimply including psudoknotted structures in the model and using free energies of pseudo-knots to derive kinetic rates. However, our emperical results depend non-trivially on theassumption of pseudoknot-free structures.687.1. PreliminariesThe equilibrium probability that an RNA molecule will occupy a partic-ular secondary structure iski =e−E(i)/(ρτ)Zkwhere ρ is the gas constant, τ is the temperature, Zk =∑j∈U e−E(j)/(ρτ)is the partition function [72], and U is the set of all possible secondarystructures for the molecule. The vector k is a distribution since the sum ofthe ki’s is one, and is known as the Boltzmann distribution.For example τ = 310.15 K is roughly body temperature. The gas con-stant is ρ = 1.985× 103 kcal/(K mol), making ρτ = 6.1565× 105 kcal/mol.The gas constant is related to the Boltzmann constant as ρ = NAkB wherekB is the Boltzmann constant and NA is Avogadro’s number. Since ρτ and Ehave the same units (i.e., kcal/mol), this makes the Boltzmann distributionunit-less.7.1.1 RNA Folding ModelsFlamm et al. [39] introduced the RNA folding model that was inspired bychemical reaction systems modelled using continuous-time Markov chains(CTMCs). A CTMC is defined via its infinitesimal generator matrix, i.e.,an array of numbers representing the rate a CTMC moves between states,and an initial distribution over its states. Like discrete Markov chains, manyCTMCs have a stationary distribution.Assume that an RNA molecule is given and let U be the space ofpseudoknot-free secondary structures for that molecule. Let N(i) ⊂ U bethe neighborhood of i where {i} ∩N(i) = ∅ and symmetry is preserved (ifj ∈ N(i) then i ∈ N(j) for all i 6= j, i, j ∈ U). For the full model defineddirectly below, the neighborhood of i includes every structure that differsfrom i by a single base pair.We say that secondary structures i and j are connected if there is a pathof structures from i to j where each subsequent structure is a neighbor ofthe previous one, i.e. i = s1, s2, ..., s`−1, s` = j such that s` ∈ N(s`−1) for all`.The Full Model [39]. Kawasaki introduced an ‘inverse-symmetric’ tran-sition rule, i.e., one for which Kij = K−1ji for i 6= j. The Kawasaki transition697.1. Preliminariesrule gives the infinitesimal generator matrix for i 6= j asKij ={e(E(i)−E(j))/(2ρτ) if j ∈ N(i)0 if j /∈ N(i) (7.1.1)andKii = −∑j 6=iKij .The neighborhood of i is every structure that differs from i by exactly onebase pair. Notice that although Flamm et al. also included as neighborsa pair of structures that differ by shifted base pairs, we do not and onlyconsider basic and well-known kinetic moves (i.e., base pair removal andaddition). This means that every pair of secondary structures is connected.We use the ’full model’ to refer to the CTMT with infinitesimal matrix Kand with the initial state, x, being that distribution over the structures thatthe RNA molecule begins in. Typically x is the point-mass on either theopen-chain or the minimum free energy state. The concepts introduced inwhat follows apply to any other initial state distribution.Informally, the population kinetics at a given time t is a vector whosejth entry is the probability that the CTMT is in state (secondary structure)j at time t. Formally, for a CTMT with initial distribution x and infinites-imal generator matrix Q, the population kinetics is given by x exp(tQ). Inthe case that Q = K, as t → ∞ the population kinetics converges to thestationary distribution given by the Boltzmann distribution k (this followsfrom the fact that k exp(tK) = k, for all t ≥ 0⇔ kK = 0).The Subset Model [98]. The main challenge with obtaining the matrixK is that the size of the state space U is exponential in the number of basesin an RNA molecule. For this reason, some approximations take a subsetof connected structures S ⊆ U [75], such that the initial (unfolded) state isin S. The choice of S is an important decision and we briefly discuss it inChapter 9. We also follow the approach proposed by Kirkpatrick et al.[60]to verify the quality of the chosen set. Some models design a CTMC on thesubset of nodes S that converges to the Boltzmann distribution normalizedon the subset S of structures [98]. Next, we describe such a model.We now employ the Kawasaki rule as described in Tang et al. [98] toinfer the infinitesimal rate matrix L on the subset S.Lij ={Kij if i ∈ S and j ∈ N(i) ∩ S0 otherwise707.1. PreliminariesandLii = −∑j 6=iLij ≥ Kii,where K is the rate matrix in Equation 7.1.1. Notice that since Lii ≥ Kii theprocess L will have slower holding times on average than does the processK. More discussion of the holding times will follow.The process with infinitesimal rate matrix L converges to the distributionon S which is proportional to the Boltzmann distribution`i =e−E(i)/(ρτ)Z`where Z` =∑j∈S e−E(j)/(ρτ).7.1.2 Population Kinetic CalculationsThere are two ways to calculate the population kinetics of an RNA molecule:either from the transition probabilities of the corresponding CTMT or froma Monte Carlo simulation.For an arbitrary finite CTMC with infinitesimal generator matrix Q, thetransition probability PQ(t)ij is defined to be the probability that the CTMTtransitions to state j at time t, given that it is in state i. The correspondingmatrix of transition probabilities PQ(t) satisfies the ODEdPQ(t)dt= QPQ(t), t ≥ 0.This ODE can be solved using an ODE solver or matrix exponentiation (seeNorris [76] 2.1.1 for a proof):PQ(t) = exp(tQ) :=∞∑j=1(tQ)jj!. (7.1.2)From the matrix exponential, we can immediately obtain the exact popula-tion kinetics which are the vector x exp(tQ), where x is the initial distribu-tion of the CTMC.Computing Equation 7.1.2 can be done through matrix exponentiationmethods such as spectral decomposition or the eigenvalues using the Lanczosalgorithm [89]. However, these matrix exponentiation algorithms have expo-nential running time, since |U |, and thus the size of the matrix is exponential.Munsky and Khammash provide an alternative method for approximating717.2. Efficient Continuous-Time Markov Chain Estimationthe matrix exponential [75].A second way to compute the population kinetics is the Monte Carlosimulation of folding pathway trajectories from process Q. A single sam-ple of such a simulation tracks the progress of a single RNA molecule asit moves across the secondary structure state space U . Monte Carlo simu-lation [42] is accomplished using the exponential holding-time distributionand the embedded Markov chain [45]:(1) The holding time ti is exponentially distributed with parameter −Qii.(2) The embedded (discrete) Markov chain has transition matrix Hij =−Qij/Qii when Qii < 0 whereas Hii = 0 ∀i. The self-loop transitionprobability Hii is zero to represent that self-transitions are invisible incontinuous time.Initially si is distributed according to the initial distribution x. Steps(1)-(2) yield a single tuple, (si, ti), and a trajectory is made of m tuples, θ =(s0, t0), (s1, t1), ..., (sm, tm). Let Θ be the set of trajectories sampled. Giventhese |Θ| samples, approximate population kinetics at a given time t can becomputed by calculating the fraction of samples in which the RNA moleculeis in a given secondary structure at time t. Recall that the populationkinetics is a vector and its i’th component is approximated as(p˜(t))i = (1/|Θ|) ·∣∣∣∣∣∣sk = i| ∑j≤k−1tk ≤ t <∑j≤ktk∣∣∣∣∣∣The approximation p˜(t) converges to the exact population kinetics, x exp(tQ),as the number of samples grows [76]. This Monte Carlo simulation is theone performed by the RNA folding software Kinfold [39] with process K andwith both Metropolis-Hastings and Kawasaki versions of process K.7.2 Efficient Continuous-Time Markov ChainEstimationIn leveraging the modelling capabilities of CTMCs, the bottleneck is typi-cally the computation of the transition probabilities: the conditional proba-bility that a trajectory ends in a given end state, given a start state and atime interval. This computation involves the marginalization over the un-countable set of end point conditioned paths. We propose an efficient MonteCarlo method to approach inference in CTMCs with weak assumptions on727.2. Efficient Continuous-Time Markov Chain Estimationthe state space. Our method can approximate transition probabilities aswell as estimate CTMC parameters for this general class of processes. Moreprecisely, we are interested in countably infinite state space CTMCs thatsatisfy the following two criteria. First, we require the construction of a cer-tain type of potential on the state space. We describe this potential in moredetail in Section 7.2.1, and show in Section 7.3 that such potentials can beeasily constructed for RNA folding models, as a real example of CTMCs.Second, the CTMC should be explosion-free to avoid pathologies (i.e., it isrequired to have a finite number of transitions in any finite time intervalwith probability one).In contrast, classical uniformization methods assume that there is a fixedbound on all the rates [44], a much stronger condition than our explosion-freeassumption. Other approaches, based on performing Markov chain MonteCarlo (MCMC), relax the bounded rate assumption [85, 86], but they havea running time that depends linearly on the size of the state space in thesparse case and quadratically in the dense case.Assume that every trajectory (or path) is considered as a particle, thenparticle-based methods offer an interesting complementary approach, be-cause they have a time complexity per particle that depends on the imputednumber of transitions between the two end points instead of on the size ofthe state space.In the simplest case, one can implement this idea using a proposal dis-tribution equal to the generative process over paths initialized at the startpoint. The weight of a particle is then equal to one if the end point of thegenerated path coincides with the observed end point, and zero otherwise.We call this proposal the forward sampling proposal and it exactly corre-sponds to the simple Monte Carlo process explained in the previous sectionfor the approximation of RNA population kinetics.Unfortunately, the forward sampling method has two serious limitations.First, the requirement of imputing holding times between each transitionmeans that the proposal distribution is defined over a potentially high-dimensional continuous space. This implies that large numbers of particlesare required in practice. Second, in problems where each state has a largenumber of successors, the probability of reaching the end state can becomeextremely small.Now, we present our efficient particle-based Monte Carlo approach methodwhere the holding times are marginalized analytically. For convenience, wecall our approach Time Integrated Path Sampling, or TIPS.737.2. Efficient Continuous-Time Markov Chain Estimation7.2.1 MethodologyIn this thesis, we only describe the simplest framework in which our methodcan be applied and our main contributions can be understood: computingthe probability that a CTMC with known rate parameters occupies statey ∈ X at time t given that it occupies state x ∈ X at time 0, where X is acountable set of states. This simple framework is still very influential andcompletely represents the RNA folding models that are of particular interestto us. However, we note that our method is not limited to this setup andcan be extended to more complicated CTMC frameworks that fall out ofthe scope of this thesis - see Hajiaghayi et al. [47] for more details.Notation. Let ν(x, y) denote the transition probability from state x ∈ Xto state y ∈ X given that a state jump occurs (i.e. ∑y:y 6=x ν(x, y) =1, ν(x, x) = 0). Let λ(x) denote the rate of the exponentially-distributedholding time at state x (λ : X → [0,∞)).15 We only require efficient point-wise evaluation of λ(·), ν(·, ·) and efficient simulation from ν(x, ·) for allx ∈ X . We start by assuming that ν and λ are fixed, and discuss their esti-mation later. We define some notation for paths sampled from this process.Let X1, X2, . . . denote the list of visited states with Xi 6= Xi+1, called thejump chain, and H1, H2, . . . , the list of corresponding holding times. Themodel is characterized by the following distributions: Xi+1|Xi ∼ ν(Xi, ·),Hi|Xi ∼ F (λ(Xi)), where F (λ(Xi)) = 1− exp(−λν(Xi, ·)), is the exponen-tial distribution CDF with rate λ. Given a start state X1 = x, we denote byPx the probability distribution induced by this model. Finally, we denoteby N the number of states visited, counting multiplicities, in the interval[0, t], i.e. (N = n) ≡ (∑n−1i=1 Hi ≤ t <∑ni=1Hi).Overview of the inference method Using the simple setup introducedabove, the problem we try to solve is to approximate Px(XN = y), whichwe approach using an importance sampling method16 [35]. Each proposedparticle consists of a sequence (a list of variable finite length) of states X ∗starting at x and ending at y. More formally, we haveX ∗ = (x1, . . . , xn) | n ∈ N, x1 = x, xn = y and xi ∈ X .15Note that this is a reparameterization of the infinitesimal generator matrix Qx,y, withQx,x = −λ(x), and Qx,y = λ(x)ν(x, y) for x 6= y.16Intuitively, when the target distribution is difficult to sample from, importance sam-pling can be used as an alternative approach. This sampling method specifies a newprobability density function as the proposal distribution and draw samples from this dis-tribution rather than drawing them directly from the target distribution.747.2. Efficient Continuous-Time Markov Chain EstimationIn fact, we marginalize the holding times, hence avoiding the difficultiesinvolved with sequentially proposing times constrained to sum to the timet between the end points.Concretely, our method is based on the following elementary property:Proposition 7.2.1. If we let pi(x∗) = γ(x∗)/Px(XN = y), where,γ(x∗) =(n−1∏i=1ν(xi, xi+1))× (7.2.1)P(n−1∑i=1Hi ≤ T <n∑i=1Hi∣∣∣∣X∗ = x∗),for x∗ ∈ X ∗ and zero otherwise, where n = |x∗|, X∗ = (X1, · · · , XN ), andHi’s are sampled independently according to F (λ(Xi)), then pi is a normal-ized probability mass function.Proof. We have, for any x∗ = (x1, x2, . . . , xn) ∈ X ∗,(∏n−1i=1 ν(xi, xi+1))P(∑n−1i=1 Hi ≤ T <∑ni=1Hi∣∣∣∣X∗ = x∗)Px(XN = y)=Px(X1 = x1, . . . , Xn = xn)Px(XN = y)× P(n−1∑i=1Hi ≤ T <n∑i=1Hi∣∣∣∣X1 = x1, . . . , Xn = xn)=1Px(XN = y)× Px(n−1∑i=1Hi ≤ T <n∑i=1Hi, X1 = x1, . . . , Xn = xn)= Px(X∗ = x∗|XN = y).Since the right hand side is a conditional distribution,pi(x∗) = Px(X∗ = x∗|XN = y),is indeed a normalized probability mass function.As our notation for γ and pi suggests, we use this result as follows. First,we define an importance sampling algorithm that targets the unnormalizeddensity γ(x∗) via a proposal P˜(X∗ = x∗). Let us denote the k-th particle757.2. Efficient Continuous-Time Markov Chain Estimationproduced by this algorithm by x∗(k) ∈ X ∗, k ∈ {1, . . . ,K}, where the num-ber of particles K is an approximation accuracy parameter. Each of theK particles is sampled independently according to the proposal P˜. Second,we exploit the fact that the sample average of the unnormalized importanceweights w(x∗(k)) = γ(x∗(k))/P˜(X∗ = x∗(k)) generated by this algorithmprovide a consistent estimator for the normalizer of γ. Finally, by Propo-sition 7.2.1, this normalizer coincides with the quantity of interest here,Px(XN = y) – see Algorithm 1 followed by Algorithms 2, and 3. The onlyformal requirement on the proposal is that Px(X∗ = x∗) > 0 should implyP˜(X∗ = x∗) > 0. However, to render this algorithm practical, we need toshow that it is possible to define efficient proposals, in particular proposalssuch that Px(X∗ = x) > 0 if and only if P˜(X∗ = x∗) > 0 (in order to avoidparticles of zero weight). We also need to show that γ can be evaluatedpoint-wise efficiently, which we establish in Proposition 7.2.3.Algorithm 1 : TIPS(x, y, t)Input: Two end point states x and y and a given time tOutput: The sample average of unnormalized importance weights, s/Ks← 0for k = 1, 2, . . . ,K do(L, p˜, p)← propose(x, {y})Qˇ← Qˇ(L) {See Section ‘Analytic jump integration’}n = |L| {The length of the list of states L}s← s+ p× (exp(tQˇ))1,n/p˜end forreturn s/K767.2. Efficient Continuous-Time Markov Chain EstimationAlgorithm 2 : propose(x,A)Input: A start state x, and a set of target states AOutput: A sampled path L, proposal transition probability p˜, and jumpchain transition probability p(L, p˜, p)←proposeHittingPath(x,A, false)n ∼ Geo(·, β) {Geometric with support 1, 2, . . . }p˜← p × Geo(n;β) {Multiply by geometric probability mass function}for i = 2, 3, . . . , n dox′ ← last(L) {Last state visited in the list L}(L′, p˜′, p′)←proposeHittingPath(x′, A, true)p˜← p˜× p˜′p← p× p′L← L ◦ L′ {Concatenation of the two lists}end forreturn (L, p˜, p)Algorithm 3 : proposeHittingPath(x,A, b)Input: A start state x, a set of target states A, and a boolean variable b towhether sample at least one state or notOutput: A sampled path L, proposal transition probability p˜, and jumpchain transition probability pp← 1p˜← 1L← list(x) {Creates a new list containing the point x}for i = 1, 2, . . . doif x ∈ A and (not(b) or i > 1) thenreturn (L, p˜, p)end ifx′|x ∼ P˜(·|Xi−1 = x)p˜← p˜× P˜(Xi = x′|Xi−1 = x)p← p× ν(x, x′)L← L ◦ x′x← x′end forreturn (L, p˜, p)777.2. Efficient Continuous-Time Markov Chain EstimationProposal distributions. Our proposal distribution is based on the ideaof simulating the jump chain, i.e., of sequentially sampling from ν until y isreached. However this idea needs to be modified for two reasons. First, (*),since the state is countably infinite in the general case, there is a potentiallypositive probability that the jump chain sampling procedure will never hity. Even when the state is finite, it may take an unreasonably large numberof steps to reach y. Second, (**), forward jump chain sampling assigns zeroprobability to paths visiting y more than once.We address (*) by using a user-specified potential ρy : X → N centeredat the target state y (see Lemma 7.2.2 for the conditions we impose on ρy).For example, we used the Hamming distance, i.e., the distance between thebase pairs of the current structure and the target structure, for RNA kineticsapplications. Informally, the fact that this distance favours states which arecloser to y is all that we need to bias the sampling of our new jump processtowards visiting y.How do we bias the proposal sampling of the next state? Let D(x) ⊂ Xbe the set of states that decrease the potential from x. The proposed jump-chain transitions are chosen with probabilityP˜(Xi+1 = xi+1|Xi = xi) = (7.2.2)(αyxi)(ν(xi, xi+1)1{xi+1 ∈ D(xi)}∑x′i+1∈D(xi) ν(xi, x′i+1))+ (1− αyxi)(ν(xi, xi+1)(1− 1{xi+1 ∈ D(xi)})∑x′i+1 /∈D(xi) ν(xi, x′i+1)),where αyx = max{α,∑x′i+1∈D(xi) ν(xi, x′i+1)}. We note that α > 1/2 is atuning parameter. We briefly discuss the sensitivity of our method to thisparameter in Chapter 9. Lemma 7.2.2 guarantees that our proposal hits thetarget end point y with probability one.Point (**) can be also easily addressed by simulating a geometrically-distributed number of excursions where the first excursion starts at x, andthe others at y, and each excursion ends at y. We let β denote the parameterof this geometric distribution, a tuning parameter, which we also discuss inChapter 9.Lemma 7.2.2. Assuming αyx = max{α,∑x′i+1∈D(xi) ν(xi, x′i+1)}, under weakconditions, our proposal mechanism in Equation 7.2.2 will hit target y in fi-nite time with probability one.Proof. We show that our proposal mechanism hits the target end point y787.2. Efficient Continuous-Time Markov Chain Estimationwith probability one under the following assumptions:1. The potential ρy(x) takes the value zero if and only if x = y.2. The potential changes by one in absolute value for all state transitions:|ρy(z)− ρy(x)| = 1 for all z such that ν(x, z) > 03. For all states x 6= y, there is always a way to propose a state thatresults in a decrease in potential:For all x ∈ X , x 6= y, there exists z such that ν(x, z) > 0 and ρy(z) < ρy(x).To simplify the notation, we will drop the y superscript for the remainderof this proof.To prove that the process always hits y, it is sufficient to show that thesequence ρ(Xn) is a supermartingale17, which in our case reduces to showingthat E[ρ(Xi+1)|Xi] ≤ ρ(Xi).Note that the last condition ensures that the normalizer∑x′2∈D(X1) ν(X1, x′2)is always positive, hence our expression of the proposal mechanism is alwayswell defined. Note that technically, we should also require 0 < Px(ρy(X2) <ρy(x)) ≤ 1 to ensure that the second normalizer, ∑x′2 /∈D(X1) ν(X1, x′2), isalso positive, but if this is not the case, the proposal mechanism can alwaysbe replaced by ν in these cases without changing the conclusion of the resultproven here.Using the second condition, we have:E[ρ(Xi+1)|Xi] = αXi(ρ(Xi)− 1) + (1− αXi)(ρ(Xi) + 1)= 1− 2αXi + ρ(Xi)≤ ρ(Xi).Finally, since the supermartingale ρ(Xn) is non-negative, P˜(N <∞) = 1,we conclude that the process always hits y.Analytic jump integration. Now, we describe how the unnormalizeddensity γ(x∗) defined in Equation (7.2.1) can be evaluated efficiently for anygiven path x∗ ∈ X ∗.17A supermartingale is a sequence of real-valued random variables X0, X1, X2, . . . withthe property that for each Xi, E[Xi] < ∞ and E[Xi|X0, . . . , Xi−1] ≤ Xi−1 [45](Chapter12).797.2. Efficient Continuous-Time Markov Chain EstimationIt is enough to show that we can compute the following integral forHi|X∗ ∼ F (λ(Xi)) independently conditionally on X∗:P(n−1∑i=1Hi ≤ t <n∑i=1Hi∣∣∣∣X∗ = x∗)= (7.2.3)∫· · ·∫hi>0:∑ni=1 hi=tg(h1, h2, . . . , hn) dh1 dh2 . . . dhn,where g(h1, h2, . . . , hn) ={n−1∏i=1f(hi;λ(xi))}(1− F (hn;λ(xn))),and where f is the exponential density function. Unfortunately, there isno efficient closed form for this high-dimensional integral, except for specialcases (for example, if all rates are equal) [3]. This integral is related to thoseneeded for computing convolutions of non-identical independent exponentialrandom variables. While there exists a rich literature on numerical approxi-mations to these convolutions, these methods either add assumptions on therate multiplicities (e.g. |{λ(x1), . . . , λ(xN )}| = |(λ(x1), . . . , λ(xN ))|), or arecomputationally intractable [6].We propose to do this integration using the construction of an auxiliary,finite state CTMC with a n+1 by n+1 rate matrix Qˇ (to be defined shortly).The states of Qˇ correspond to the states visited in the path (x1, x2, . . . , xn)with multiplicities plus an extra state sn+1. All off-diagonal entries of Qˇare set to zero with the exception of transitions going from xi to xi+1, fori ∈ {1, . . . , n}. More specifically, Qˇ is−λ(x1) λ(x1) 0 · · · 0 00 −λ(x2) λ(x2) · · · 0 0· · · · · · · · · · · · · · · · · ·0 0 0 · · · −λ(xn) λ(xn)0 0 0 · · · 0 0 . (7.2.4)This construction is motivated by the following property.Proposition 7.2.3. For any finite proposed path (x1, x2, . . . , xn), if Qˇ isdefined as in Equation (7.2.4), then(exp(tQˇ))1,n= P(n−1∑i=1Hi ≤ t <n∑i=1Hi∣∣∣∣X∗ = x∗)(7.2.5)807.2. Efficient Continuous-Time Markov Chain Estimationwhere exp(A) denotes the matrix exponential of A.18Proof. Let Xˇ1, Xˇ2, . . . and Hˇ1, Hˇ2, . . . denote the states and holding timesrespectively of a CTMC with rate matrix Qˇ. The states take values in{1, 2, . . . , n + 1}, and we let Pˇ1 denote the path probabilities under thisprocess conditioned on starting at X1 = 1. Let Nˇ be defined similarly to N(the random number of states visited):(Nˇ = n) ≡(n−1∑i=1Hˇi ≤ t <n∑i=1Hˇi)≡{ω ∈ Ωˇ :n−1∑i=1Hˇi(ω) ≤ t <n∑i=1Hˇi(ω)}.Here, Ωˇ is an auxiliary probability space used to define the above randomvariables:Xˇi : Ωˇ→ XHˇi : Ωˇ→ [0,∞).For all i ∈ {2, . . . , n+1}, only state i−1 has a positive rate of transitioning tostate i, therefore (Xˇi = j) ⊂ (Xˇi−1 = j−1) for all j (*). Using equation 7.1.218Multiplicities of the rates in Qˇ greater than one will break diagonalization-basedmethods of solving exp(tQˇ), but other efficient matrix exponentiation methods such asthe squaring and scaling method are still applicable in these cases.817.3. Experimental Resultsand applying (*) inductively yield:(exp(tQˇ))1,n= Pˇ1(XˇNˇ = n)= Pˇ1(XˇNˇ = n, XˇNˇ−1 = n− 1)...= Pˇ1(XˇNˇ = n, XˇNˇ−1 = n− 1, . . . , Xˇ1 = n− Nˇ + 1)= Pˇ1(Nˇ = n, Xˇ1 = 1, Xˇ2 = 2, . . . , Xˇn = n)= Pˇ1(Nˇ = n)Pˇ1(Xˇ1 = 1, Xˇ2 = 2, . . . , Xˇn = n|Nˇ = n)= Pˇ1(Nˇ = n) n∏i=2Pˇ(Xˇi = i|Xˇi−1 = i− 1)= Pˇ1(Nˇ = n)=∫ ∫· · ·∫hi>0:h1+h2+···+hn=tg(h1, h2, . . . , hn) dh1 dh2 . . . dhn.7.3 Experimental ResultsHere, we will use our method, TIPS, to approximate RNA population kinet-ics. More specifically, we compare the accuracy of the transition probabilityestimates given by our method (TIPS) to those obtained by forward sam-pling (FS). We used the RNA molecules shown in Table 7.1.Sequence Length |U | |S|1AFX 12 70 -1XV6 12 48 -RNA21 21 ∼ 1100 657HIV 23 ∼ 1500 266Table 7.1: Biological RNA sequences obtained from the RNA STRANDdatabase [7]. Symbols U and S correspond to the set of secondary structuresobtained from the full model and subset model, respectively.For each method (TIPS and FS) and molecule, we first approximatedthe probability Px(XN = y) that beginning in its unfolded structure x, themolecule would end, after folding time t, in its MFE structure y. We then827.3. Experimental Resultscomputed, as a reference, the probability of this transition using an expen-sive matrix exponential. Computing the matrix exponential on the full statespace was only possible for the RNAs of no more than 12 nucleotides. Forthe longer RNAs, we used an RNA subset model – see Section 7.1.1 – andrestricted the state space to a connected subset S of secondary structures.While our method scales to longer RNAs, we wanted to be able to com-pare against forward sampling and to the true value obtained by matrixexponentiation.We note that Figure 7.1 shows the results on the subset model of RNA21and HIV molecules, and Figure 7.1 shows the results on the full model of1AFX and 1XV6.We ran the experiments with a range of number of particles, {51, 52, · · · , 56},for 30 replicates on folding times from {0.125, 0.25, · · · , 8}19. Here, we com-pare the performance of the two methods by looking at the absolute logerror of the estimate pˆ (i.e., error(pˆ) = | log pˆ − logPx(XN = y)|) over allreplicates.Figures 7.1a, 7.1d, 7.2a, and 7.2d show the performance of the FS andTIPS methods on selective folding times, {0.25, 1, 4}. Figures 7.1b, 7.1e,7.2b, and 7.2e show the CPU times (in milliseconds) corresponding to theminimum number of particles required to satisfy the certain accuracy level,I = {pˆ : error(pˆ) < 1.0} on all the folding times.The variances of FS and TIPS weights, for 56 = 15625 particles, are alsocomputed and compared on different folding times (see Figures 7.1c, 7.1f,7.2c, and 7.2f). Note that the variance is shown in log scale in these figures.The graphs show that our novel method TIPS outperforms FS in esti-mating the probability of transition from x to y in shorter folding times,since it needs many fewer particles (and correspondingly faster CPU times)than FS to be able to precisely estimate the probability. For instance, forthe RNA21 molecule with folding time 0.25, FS cannot satisfy the accuracylevel I given above, even with 15625 particles, however TIPS only needs 5particles with 16 ms of CPU time to reach the same accuracy level. Similarly,the variance of our method is smaller by a larger margin.For longer folding times in Figure 7.1, the performance of the TIPSand FS methods are comparable (in terms of the obtained errors and CPUtimes), slightly in favour of forward sampling. For example, for the HIV23molecule with folding time 4.0, TIPS and FS require 5 and 25 particles, and19Folding time, in this context, is a dimensionless quantity, meaning that it can bescaled. See the discussion about dimensions for the Boltzmann distribution in Section7.1.837.4. ConclusionCPU times, 12 ms and 5 ms, respectively to satisfy I.We note that the reason why FS can still perform reasonably well forlonger folding times is that we picked the final end point to be the MFEstructure, which has high probability under the stationary distribution. Forlow probability targets, FS will often fail to produce even a single hittingtrajectory, whereas each trajectory sampled by our method will hit the targetby construction.7.4 ConclusionWe have presented an efficient method for approximating transition prob-abilities and posterior distributions over parameters in countably infiniteCTMCs. We have demonstrated on real RNA molecules that our methodis competitive with existing methods for estimating the transition probabil-ities which marginalize over folding pathways and provide a model for thekinetics of a single strand of RNA interacting chemically with itself.What makes our method particularly attractive in large or countablyinfinite state space CTMCs is that our method’s running time per particleis independent of the size of the state space. The running time does dependcubically on the number of imputed jumps, so we expect that our methodwill be most effective when the typical number of transitions between twoobservations or imputed latent states is moderate (no more than approxi-mately a thousand with current architectures). The distribution of the jumpchain should also be reasonably concentrated to ensure that the sampler canproceed with a moderate number of particles. We have shown the realisticexamples on RNA folding pathways where these conditions are empiricallymet.847.4. Conclusion0.25 1 4012100 10000 100 10000 100 10000# of particlesErrorFSTIPS(a) RNA21- errorlllll0.1 0.2 0.5 1.0 2.0 5.0 10.05102050100Folding timeCPU time (ms)lllllllFSTIPS(b) RNA21- CPU time (ms)lllll0.1 0.2 0.5 1.0 2.0 5.0 10.01e−131e−101e−071e−041e−01Folding timeVariance (#particles = 15625)lllllllFSTIPS(c) RNA21- variance0.25 1 40.00.51.01.5100 10000 100 10000 100 10000# of particlesErrorFSTIPS(d) HIV23 - errorlllll0.1 0.2 0.5 1.0 2.0 5.0 10.0510152030Folding timeCPU time (ms)ll l lll lFSTIPS(e) HIV23 - CPU time (ms)lllll0.1 0.2 0.5 1.0 2.0 5.0 10.01e−141e−111e−081e−051e−02Folding timeVariance (#particles = 15625)lllllllFSTIPS(f) HIV23 - varianceFigure 7.1: Performance of our method (TIPS) and forward sampling (FS) on RNA21and HIV23 molecules with their subset state space. The relative errors of the estimatesvs. folding times, {0.25,1,4}, are shown (Figures (a) and (d)) along with the CPU timescorresponding to the minimum number of particles required to satisfy the accuracy level Iin milliseconds (Figures (b) and (e)) and the variance of TIPS and FS estimations (Figures(c) and (f)) on folding times, {0.125, 0.25, · · · , 8}.857.4. Conclusion0.5 2 80246100 10000 100 10000 100 10000# of particlesErrorFSTIPS(a) 1AFX- errorlllll0.5 1.0 2.0 5.0 10.0105020010005000Folding timeCPU time (ms)ll lllFSTIPS(b) 1AFX- CPU time (ms)lll l l0.5 1.0 2.0 5.0 10.02e−072e−062e−052e−04Folding timeVariance (#particles = 15625)lllllFSTIPS(c) 1AFX- variance0.5 2 8024100 10000 100 10000 100 10000# of particlesErrorFSTIPS(d) 1XV6 - errorllll0.5 1.0 2.0 5.0 10.010505005000Folding timeCPU time (ms)l llllFSTIPS(e) 1XV6 - CPU time (ms)ll ll0.5 1.0 2.0 5.0 10.01e−071e−061e−051e−04Folding timeVariance (#particles = 15625)lllllFSTIPS(f) 1XV6 - varianceFigure 7.2: Performance of our method (TIPS) and forward sampling (FS) on 1AFX and1XV6 molecules with their full state space. The relative errors of the estimates vs. foldingtimes, {0.5,2,8} are shown (Figures (a) and (d)) along with the CPU times correspondingto the minimum number of particles required to satisfy the accuracy level I in milliseconds(Figures (b) and (e)) and the variance of TIPS and FS estimations (Figures (c) and (f))on folding times, {0.5, 1, · · · , 8}.86Chapter 8Hardness of Multi-strandedNucleic Acid MFESecondary StructurePredictionIn the previous chapter, we proposed an efficient method to estimate thepopulation kinetics of RNA/DNA molecules that can contribute to a moreaccurate prediction of their functional structure. Here, we discuss the com-putational hardness of MFE structure prediction for a set of nucleic acidsstrands, as another important method for a better understanding of nucleicacid functions.8.1 PreliminariesWe review some basic terminology and prior work in order to precisely for-mulate the problem description and proof techniques. We employ the prop-erties described in Section 6.1 as our basis for single-stranded nucleic acidsand assume that only Watson-Crick base pairs can form between nucleotidebases. Nevertheless, an RNA/DNA sequence can be composed of multiplestrands. In this case, we define a secondary structure formed between multi-ple interaction strands as follows. Base pairing between two strands occursin an antiparallel format. That is, the Watson-Crick complement (sometimesreferred to as the ‘reverse-complement’) of strand x = 5′ − x1 · · ·xn − 3′ isthe strand 3′ − y1 · · · yn − 5′ ≡ 5′ − yn · · · y1 − 3′ where (xi, yi) is a Watson-Crick base pair. For example, the reverse-complement of 5′ − ACTCG− 3′ is5′−CGAGT−3′. Throughout, for simplicity, we will use the term complementto mean reverse-complement or Watson-Crick complement and denote thecomplement of x by x¯.Similar to the single-stranded model, the secondary structure formed bym interacting strands is a set of Watson-Crick base pairs. To specify the878.1. Preliminaries123(a)123(b)Figure 8.1: a) Polymer graph representation of the pseudoknot-free sec-ondary structure for the strand set {1, 2, 3} with ordering {123}, b) secondpolymer graph for the same set of strands with ordering {132}.secondary structure, we assign identifiers from 1 to m to the strands, andeach base is named by a strand identifier and a position on the correspondingstrand. For instance if base i in strand s pairs with base j in strand t, wheres ≤ t and i < j − 1 if s = t, the base pair is denoted as (is, jt). Schemati-cally, a multi-stranded secondary structure can be represented as a polymergraph by ordering and depicting the directional (5′ to 3′) strands around thecircumference of a circle and connecting the base pairs with straight lines.The number of different ways to position m strands on a circle correspondsto the set of circular permutations is (m−1)! (e.g., {123} and {132} are theonly orderings for three strands 1, 2, and 3 — see Figure 8.1) [32]. If thereexists a polymer graph for a given secondary structure, corresponding to acircular permutation without crossing lines, then the secondary structure iscalled pseudoknot-free — see Figure 8.1a. A secondary structure consists ofone or more complexes that correspond to the connected components in thepolymer graph representation.8.1.1 The Simple Energy ModelHere, we employ a very simple extension of the “base pair” free energy modelfor secondary structures [78]. In that model, the energy of each base pair is−1, and the overall free energy of a secondary structure is the total energycontribution of its base pairs. This means that a higher number of basepairs in a secondary structure of a single strand corresponds to a lower free888.1. Preliminariesenergy.In a system consisting of multiple interacting strands, there is an entropicpenalty for strands to associate via base pairing (i.e., a penalty for reducingthe number of complexes) [32]. In this simplified energy model, we definethe strand association penalty to be Kassoc ≥ 0. Thus, for a pseudoknot-freesecondary structure S consisting of m strands, l ≤ m complexes, and p basepairs, the free-energy of S is defined as E(S) = p(−1) + (m− l)Kassoc. Forexample, the secondary structure in Figure 8.1(a) has free energy 21(−1) +(3−1)Kassoc = −21+2Kassoc. Therefore, given a set of strands {s1, . . . , sm},an optimal pseudoknot-free secondary structure Si has the property thatE(Si) ≤ E(Sj) for all Sj ∈ S(s1, . . . , sm) where S(s1, . . . , sm) is the set ofall pseudoknot-free secondary structures of s1, . . . , sm.Since there can be a tradeoff between the number of base pairs and thenumber of complexes, then it is possible under this model for an optimalpseudoknot-free secondary structure to have less than the maximum numberof possible base pairs. However, our proofs have been constructed so thatpseudoknot-free MFE secondary structures will have a maximum number ofbase pairs for any reasonable value of the constant Kassoc. We will proceedwith our problem definitions under the assumption that Kassoc = 0 andformally argue later that the results hold for all constants Kassoc ≥ 0.8.1.2 Problem definitionsWe now formally define the main problem of interest in this chapter.Problem 1. Multi-Pkf-SSPInstance: Given m nucleic acid strands and a positive integer k.Question: Is there a pseudoknot-free secondary structure of the m strandscontaining at least k base pairs?To show hardness of our problem, we will describe a polynomial-timereduction from a restriction of the 3-dimensional matching problem toMulti-Pkf-SSP. A 3-dimensional matching is defined as follows. Let X,Y , and Z be finite, disjoint sets, and let T be a subset of X × Y × Z.That is, T consists of triples (x, y, z) such that x ∈ X, y ∈ Y , and z ∈ Z.Now M ⊆ T is a 3-dimensional matching if the following holds: for anytwo distinct triples (xi, yj , zk) ∈ M and (xa, yb, zc) ∈ M, we have xi 6= xa,yj 6= yb, and zk 6= zc.For convenience in our construction, we use a restriction of the 3-dimensionalmatching problem, called 3dm(3), that requires each element to appear in898.1. Preliminariesx1 y1 z1x2 y2 z2x3 y3 z3X Y Z(a)x1 y1 z1x2 y2 z2x3 y3 z3X Y Z(b)Figure 8.2: An instance of the restricted 3-dimensional match-ing problem (3dm(3)) where X = {x1, x2, x3}, Y = {y1, y2, y3},Z = {z1, z2, z3}. (a) The set of permitted triples, T ={(x1, y2, z2), (x2, y1, z1), (x2, y3, z2), (x3, y3, z3)}. (b) A valid matching M⊆T .at most three triples of T .Problem 2. 3dm(3)Instance: Given T ⊆ X × Y × Z, where |X| = |Y | = |Z| = n and eachelement of X, Y and Z appears in at most 3 triples of T .Question: Does there exist a matching M ⊆ T , with |M | = n?Theorem 8.1.1 (Garey & Johnson (1979) [1]). 3dm(3) is NP-complete.We note that the Multi-Pkf-SSP problem is a decision problem thatdetermines whether there exists an output structure with at least k basepairs or not. This problem can be turned into an optimization problem witha slight modification. We name this variant of the problemMax-Multi-Pkf-SSP.Problem 3. Max-Multi-Pkf-SSPInstance: Given m nucleic acid strands.Question: Determine a pseudoknot-free secondary structure of the mstrands with maximum number of base pairs.An optimization problem is in APX if it has a constant factor approxima-tion algorithm, i.e., an efficient method that can determine a solution withinsome fixed multiplicative factor of an optimal solution. A problem is APX-hard if for some constant c, a c-approximation algorithm for the problemwould imply that NP = P. One way to prove a problem is APX-hard is to908.2. String Designs and their Propertiesshow an approximation-preserving reduction from a known APX-hard prob-lem. We derive our hardness result for the Max-Multi-Pkf-SSP problemby a reduction from the Max-3dm(3) problem, an optimization variant of3dm(3).Problem 4. Max-3dm(3)Instance: Given T ⊆ X × Y × Z, where |X| = |Y | = |Z| = n and eachelement of X, Y and Z appears in at most 3 triples of T .Question: Find a maximum size 3-dimensional matching M ⊆ T .Kann [58] had previously shown Max-3dm(3) is APX-hard by showingthat it is NP-hard to decide whether an arbitrary instance of the problemhas a matching of size n or a matching of size at most (1−0)n, for some 0 >0 [59]. Nutov & Beniaminy [79] gave a (1− 1/e) approximation algorithm,demonstrating that Max-3dm(3) is in APX.Theorem 8.1.2 (Kann (1994) [58] and Nutov & Beniaminy [79]). Max-3dm(3)is APX-complete.8.2 String Designs and their PropertiesIn this section we show how to design strings with properties that are usefulin our reduction. We follow standard string notation: for a string a =a1 . . . an we denote its ith character (or symbol) by ai and its length by|a| = n; for any symbol B, we let Bl denote a string of length l consisting ofonly B’s. The following related string properties are of particular interestto us.1. A pairwise sequence alignment, or simply alignment, of strings a and bis a pair of strings (a′, b′) with |a′| = |b′|, where a′ and b′ are obtainedfrom a and b respectively by the insertion of zero or more copies ofa special gap symbol. Moreover, for any i, not both a′i and b′i aregap symbols and if neither a′i nor b′i is the gap symbol then a′i = b′i.The alignment can alternatively be considered as a sequence of alignedpairs (a′i, b′i), 1 ≤ i ≤ |a′|. A pair is a gap pair if either a′i or b′i is a gapsymbol. We also define an optimal alignment of a and b as a pairwisealignment of a and b with a minimum number of gap pairs, amongstall possible alignments.2. A longest common subsequence between strings a and b is a longestsubsequence common to the two strings. We denote the length of918.2. String Designs and their Propertiessuch a subsequence by LCS(a, b). A longest common subsequencecorresponds to an optimal alignment of a and b and LCS(a, b) is equalto the total number of gap-free pairs of symbols in the alignment.3. The insertion-deletion distance dLCS(a, b) between strings a and b isthe minimum number of insertions and deletions of symbols needed toconvert a into b (or equivalently to convert b to a). Equivalently, theinsertion-deletion distance between a and b is equal to the number ofgap pairs in an optimal alignment of a and b.The insertion-deletion distance and length of the longest common sub-sequence of two strings are related by the following known result.Theorem 8.2.1 ([49]). Given two strings a and b, where |a| = n and |b| =n′, then dLCS(a, b) = k if and only if LCS(a, b) =(n+n′−k)2 .Note that if a and b are equi-length strings, then k is an even number.In the next theorem, we provide a set of strings with a technique thatemploys a greedy codeword design used also in Justesen [56] and Schulmanand Zuckerman [91].Theorem 8.2.2. Let w > 0 and δ > 0. For any n, a set of at least wnequi-length strands over the alphabet {A, T}, each of length k log2 n for someconstant k (that depends on w and δ), can be designed in 2O(log2 n) time,such that the insertion-deletion distance between any pair in the set is atleast δ log2 n. Moreover, all strands in the set have at least dδ log2 n/2e A’sand at least dδ log2 n/2e T’s.Proof. We construct the desired set using a greedy algorithm that is specifiedin terms of a quantity t = Θ(log2 n) that we determine later. From {A, T}t,first put the two strings At and Tt in the set (we will remove them at theend). Once i ≥ 2 strings are in the set, choose any string from {A, T}t whoseinsertion-deletion distance from all i strings already in the set is at leastδ log2 n, and add it to the set. Continue until no more strings can be chosenwith the desired insertion-deletion distance. Finally, remove the strings Atand Tt. This algorithm runs in time 2O(log2 n).The number of strings in {A, T}t that have insertion-deletion distance atmost 2d from a given string s is at most(td)22d (see proof of Lemma 2 ofSchulman and Zukerman [91]). If d = dδ log2 n/2e, then our set has thedesired property that the insertion-deletion distance between any pair inthe set is at least δ log2 n. Furthermore all strings in the set, once At andTt are removed, must have at least dδ log2 n/2e A’s and at least dδ log2 n/2e928.2. String Designs and their PropertiesT’s; otherwise, their insertion-deletion distance from At and Tt, would be lessthan δ log2 n.The number of strings in the set before removal of At and Tt is at leastwn+ 2 if we choose t so that2t/((td)22d) ≥ 2t/2 ≥ wn+ 2.These inequalities hold if t is a sufficiently large constant times log2 n. (Forthe second inequality, we simply need that t ≥ 1 + 2 log2w + 2 log2 n. Forthe first inequality, from Stirling’s formula we have that(td)< (t e/d)d,and so the inequality holds if d log2(t e/d) ≤ t/2. This in turn holds ift = ηd (= ηdδ log2 n/2e) where we choose constant η so that ηe ≤ 2η.)Finally, since the strings At and Tt are removed and all other strings haveinsertion-deletion distance at least δ log2 n from strings At and Tt, all strandsin the set have at least δ log2 n A’s and at least δ log2 n T’s.Our design also makes use of a padding function. Let ρ5 denote thepadding function that, applied to a string, inserts five A’s (called paddedA’s) at the start of, and between, every pair of symbols in the string.Definition 8.2.3 (padding function ρ5). Let a = a1a2 . . . an be a string.Then ρ5(a) = A5a1A5a2 . . . A5an.If dLCS(a, b) = k then dLCS(ρ5(a), ρ5(b)) may be less than k. Toillustrate why, first consider a modified padding function ρ1, defined asρ1(a1a2 . . . an) = A1a1A1a2 . . . A1an. If we choose a = a1a2a3a4a5 = AATATT,and ac = TTATAA (ac is the real complement of a), then dLCS(a, ac) = 6whereas dLCS(ρ1(a), ρ1(ac)) = 4. This appears to contradict an assertion inLemma 2 of Schulman and Zukerman [91]. Adapting this example, it is thecase that ifa′ = a51a52 . . . a55 = A5A5T5A5T5T5and a′c would be the real complement of a′, then dLCS(a′, a′c) = 30, whiledLCS(ρ5(a′), ρ5(a′c)) = 24.We next show a general lower bound on dLCS(ρ5(a), ρ5(b)) in terms ofdLCS(a, b).Lemma 8.2.4. Let a and b be equi-length strings over {A, T}. If dLCS(a, b) =k then dLCS(ρ5(a), ρ5(b)) ≥ k2 .Proof. Suppose that dLCS(ρ5(a), ρ5(b)) < k2 . Let n be the length of a andb. We will obtain a contradiction to the hypothesis of the lemma that938.2. String Designs and their PropertiesdLCS(a, b) = k. Throughout, when referring to characters in the paddedstrings ρ5(a) and ρ5(b), we’ll denote the characters of the original strings aand b by Ao and To and the padded A’s by Ap.LetA be an optimal alignment of ρ5(a) and ρ5(b). Each pair of charactersin alignment A has one of four types: original, with two original characters;padded, with two padded characters; mixed, with one Ao and one Ap, orgap, with one gap symbol. Let x, y and u denote, in order, the counts oforiginal, padded and mixed pairs, respectively. To prove the lemma, we firstestablish various bounds on these counts.First, it must be that x ≤ n− k2 : since dLCS(a, b) = k, if x were greaterthan n− k2 we would be able to use the alignment A to obtain an alignmentof a and b with less than k gap pairs.Second, using Theorem 8.2.1 and our assumption that dLCS(ρ5(a), ρ5(b)) <k2 , we have that LCS(ρ5(a), ρ5(b)) ≥ 6n− bk4c, and sox+ y + u = LCS(ρ5(a), ρ5(b)) ≥ 6n− bk4c. (8.2.1)Third, we’ll obtain a lower bound on x. Note that 2y + u is boundedby the total number of Ap characters, and so is at most 10n. Thereforey+ du2 e ≤ 5n. Substituting this inequality into Equation 8.2.1, we have thatx ≥ n− bk4c − bu2 c. (8.2.2)From the fact that x ≤ n− k2 and inequality 8.2.2 we also have that thenumber of mixed pairs u is at least k2 .Now partition the mixed pairs into two types: sloppy and tight. A mixedpair p is sloppy if, among the first five pairs to the right of p, there is atleast one gap pair containing a To or Ap character. If p is not sloppy, wecall it tight. If p is tight, let p′ be the first pair to the right of p that isnot a padded pair. Such a pair p′ must exist, since our padding function issuch that any Ap character is eventually followed by an original character.Pair p′ is either a gap pair containing Ao or is a mixed pair, in which case italso contains Ao. In either case, because exactly five Ap’s separate any twooriginal characters, if the Ao character of pair p is in string a then the Aocharacter of pair p′ is in string b and vice versa. In what follows, we refer top′ as p’s partner. Note that p′ may itself be a tight pair.Each sloppy pair contributes one to dLCS(ρ5(a), ρ5(b)). Since we areassuming that dLCS(ρ5(a), ρ5(b)) < k2 , less thank2 of the mixed pairs aresloppy.Thus, at least u− k2 + 1 of the mixed pairs are tight. Using these tight948.2. String Designs and their Propertiespairs, we now convert alignment A to another alignment A′ with at leastn− k2 +1 original pairs, obtained as follows. Starting from the leftmost pair ofalignment A and working towards the right, find the first tight mixed pair pofA and its partner p′. Remove p, p′ and all of the intervening (padded) pairsbetween them from the alignment, and instead pair each padded characterfrom the removed pairs with a gap, and pair the Ao character of p with theAo character of p′ (recall that one of these Ao characters is in string a andthe other is in string b). Repeat, starting from the pair just to the right ofp′, until the rightmost end of A is reached.Let d be the number of new original pairs obtained in this manner. Thend is at least bu2 c − dk4e + 1: this lower bound is achieved when all partnersare themselves tight mixed pairs. Therefore, the number of original pairs inalignment A′ isx+ d ≥ n− bk4c − bu2 c+ bu2 c − dk4e+ 1 = n− k2 + 1.As noted earlier, any alignment of ρ5(a) and ρ5(b) has at most n− k2 originalpairs since dLCS(a, b) = k, and so we have a contradiction. Thus, ourassumption that dLCS(ρ5(a), ρ5(b)) < k2 is false, and the lemma is true.We next define the unpairedness of a secondary structure for a strand orpair of strands, and show that sets of padded strings with high insertion-deletion distance are useful in obtaining strands whose optimal structureshave high unpairedness.Definition 8.2.5. Let a and b be strands and let S(a) and S(a, b) be sec-ondary structures for strand a and pair (a, b) respectively. The unpairednessof S(a) or S(a, b) is the number of bases that are not paired in S(a) orS(a, b), respectively. We note that the base pairs of (a, b) includes bothinter-molecular and intra-molecular pairs.Lemma 8.2.6. Let a′ and b′ be any strands over the alphabet {A, T}, leta = ρ5(a′), let b = ρ5(b′), and let s be any substrand of a or a. Let S(s),S(a, b), S(a, b) and S(a, b) be any pseudoknot-free secondary structures fors, (a, b), (a, b) and (a, b), respectively. Then1. The unpairedness of S(s) is at least 13 |s|.2. The unpairedness of S(a, b) is at least 23 (|a|+ |b|).3. The unpairedness of S(a, b) is at least 23(|a|+ |b|).4. The unpairedness of S(a, b) is at least 13dLCS(a, b).958.2. String Designs and their PropertiesProof. To show part 1, first suppose that s is a substrand of a = ρ5(a′). Allbases that are paired in the secondary structure S(s) are within substrands. If |s| ≤ 2, then no bases of s are paired in S(s), given our assumptionthat consecutive bases in a strand cannot form a base pair, and so part 1holds. If |s| ≥ 3, the number of (intra-molecular) base pairs of S(s) is atmost the number of T’s in s. If 3 ≤ |s| ≤ 6 then s can have at most oneT, and thus at most one base pair, so s has at least |s| − 2 unpaired basesand again part 1 holds. Suppose that |s| ≥ 7. Because s is a substrand ofa padded strand, the number of T’s in s is at most d2|s|/7e: this maximumis achieved if |s| = 7 and s both starts and ends with a T. Even if all ofthe T’s of s are paired to A’s, the number of unpaired A’s is still at leastb3|s|/7c ≥ |s|/3 since |s| ≥ 7. The argument when s is a substrand of a isobtained by replacing A’s with T’s in the argument for a substrand of a.Similarly, the total number of T’s in S(a, b) is at most (|a|+ |b|)/6 and sothe unpairedness is at least 4(|a|+|b|)/6. The argument for the unpairednessof S(a, b) is obtained by replacing A’s with T’s in the argument for {a, b}.Finally, the inter-molecular base pairs of S(a, b) correspond to a commonsubsequence of strands a and b, and thus the number of such base pairs isat most LCS(a, b) = n − dLCS(a,b)2 by Theorem 8.2.1. Therefore the totalnumber of bases in both a and b that do not form inter-molecular base pairsof S(a, b) is at least dLCS(a, b). Now consider any substructure of S(a, b)within some maximal substrand s of either a or b¯ that has no inter-molecularbase pairs. The unpairedness of this substructure is at least 13 |s|, by part 1 ofthis Lemma. Thus, over all substrands that do not contain inter-molecularbase pairs, at least a fraction 13 of bases are unpaired (not involved in intra-molecular base pairs). Since the total length of such substrands is at leastdLCS(a, b), the unpairedness of S(a, b¯) is at least13dLCS(a, b).Definition 8.2.7. A set S of strands is k-robust if the following propertieshold:1. All strands of S have the same length.2. All strands of S have at least k A’s and at least k T’s.3. For any a and b in the set, the unpairedness of optimal structures fora, a¯, (a, b), (a¯, b¯), and (a, b¯) is at least k.Theorem 8.2.8. Let w > 0. For any n, a log2 n-robust set of at leastwn strands, each of length p log2 n for some constant p, can be designed in2O(log2 n) time.968.3. The ReductionProof. Using Theorem 8.2.2, for any w > 0 and δ = 6 we can obtain, intime 2O(log2 n), a set S ′ of at least wn strands, each of length k log2 n forsome constant k, such that the insertion-deletion distance between any pairof strands in S ′ is at least 6 log2 n. Moreover, all strands in S ′ have at least3 log2 n A’s and at least 3 log2 n T’s. This latter property implies that thestrands in S ′ have length at least 6 log2 n.Apply the padding function ρ5 to strands in S ′ to obtain a new set S.Note that the strands in S have length 6k log2 n, which must be at least36 log2 n. Lemma 8.2.4 shows that the insertion-deletion distance betweenany pair of strands in S is at least δ log2 n/2 = 3 log2 n. Lemma 8.2.6 thenshows that if a and b are any two strands in the set S, the unpairedness ofthe optimal structure of a, or its complement, or of (a, b), (a, b) or (a, b), is atleast min{13 |a|, 23(|a|+ |b|), 13dLCS(a, b)}. Given that |a| and |b| are at least36 log2 n and that dLCS(a, b) = 3 log2 n, this lower bound is at least log2 n.Therefore, the unpairedness of the set S is at least log2 n, as desired.8.3 The ReductionWe show a polynomial time, many-one reduction from 3dm(3) toMulti-Pkf-SSP(3dm(3) ≤PT Multi-Pkf-SSP). Given an instance I = (X,Y, Z, T ) of3dm(3), where m = |T | and n = |X| = |Y | = |Z|, we construct an instanceI ′ of Multi-Pkf-SSP as follows.Domains used in strands of I ′:The strands of the resulting instance, I ′, consist of the following domains.• One domain for each x ∈ X, y ∈ Y , and z ∈ Z and one domain foreach complement. Where no confusion arises, we use x, x¯, y, y¯, z, andz¯ to refer to these domains.• A separator and a separator-complement domain, denoted by Sep andSep.• A trim domain and a trim-complement domain, denoted by Trm andTrm respectively.Strands of I ′:The resulting instance, I ′, consists of the following strands; the strands aredescribed as a sequence of domains.978.3. The Reductionx1 y2 z2 z2 y2 x1(Triple-strands: t1 , t2 , t3 , t4) Template strand:t1 t2 t3 t4xyz-support strands:x1 , x2 , x3 , y1 , ... z3,, , , , ... , z3y1x3x2x1Trim-complement strands:Separator-complement strands:,trmseptrmtrm trmtrmtrm trm trm trm trmtrm trm trmsepsepsepx2 y1 z1 z1 y1 x2 x3 y3 z3 z3 y3 x3x2 y3 z2 z2 x2y3Separator strands:sep sep ,sep, sep , sep sep ,, sep,, , ,,sepsepsepsep sep sep sep sepsep sep sep sepsep sep sep sep sep,trm ,trm ,trm ,trm trm , ,trm ,trm ,trm ,trm trm(a) StrandsPerfect triple5'3'trm x1 sep y 2 sep z2 trm z 2 sep y 2 sep x1 trm _ __ _ __ _ _ __ _ __ _ Trim-deprived triple__ __ _ __ _ __ trm x2 sep y3 sep z2 trm z2 sep y3 sep x2 trmtrm x1 sep y2 trm y2 sep x1 trm(b) MFE structureFigure 8.3: The resulting set of strands, specified at the domain level (parta), and the MFE structure (part b), when reducing from the 3dm instanceof Figure 8.2. In the MFE structure, triple-strand t1 labeled as perfecttriple represents that the triple (x1, y2, z2) is in the solution of the 3dm(3)instance. However, triple-strand t4 denoted as trim-deprived triple showsthat the triple (x2, y3, z2) is not selected in the solution.988.3. The Reduction• Template strand : one strand that is the concatenation of triple-strands.There is one triple-strand for each triple (x, y, z) ∈ T , which is theconcatenation of the domains Trm, x, Sep, y, Sep, z, Trm, z¯, Sep, y¯,Sep, x¯, Trm in that order. We call the substrands x Sep y Sep z andz¯ Sep y¯ Sep x¯ of a triple-strand the 5′ and 3′ flanks, respectively.We call the Trm domains at the ends of the triple-strand the end-trimsand the Trm domain at the center of the triple-strand the center-trim.• Separator (-complement) support strands: there are 2n strands con-sisting of just the domain Sep and another 2n strands consisting ofjust the domain Sep.• xyz-support strands: for each x, y and z domain there is one strandconsisting of just that domain and one for its complement, for a totalof 6n strands.• Trim-complement strands: there are 2m+n strands consisting of justthe domain Trm (these strands are the complement of the trim domainsTrm).We refer to the xyz-support strands and the separator and separator-complementsupport strands collectively as the support strands.This completes the description of the reduction at the domain level ofdetail. Figure 8.3a shows the resulting Multi-Pkf-SSP instance, specifiedat the domain level, after a reduction from the 3dm(3) instance depicted inFigure 8.2.The MFE structure of the resulting set of strands is partially depictedin Figure 8.3b. The binding of the xyz-supports and separator supportsx1, Sep, y2, Sep, z2, their complements, and trim-complement strands tothe substrand labeled as “perfect triple”, denotes that the triple (x1, y2, z2)is selected in the solution of the 3dm(3) instance. The other triple-strandthat is depicted is a “trim-deprived triple” — i.e., a triple where at leastone of its trim domains is unbound — as the triple (x2, y3, z2) does notappear in the solution from Figure 8.2 (right). Intuitively, there is a trim-complement strand available to bind with each of the 2m end-trim domainsat the ends of all triple-strands, and in addition the number of xyz-support,separator supports and additional trim-complement strands is necessary andsufficient to have n “perfect triples” in an optimal secondary structure whenthe 3dm(3) instance has a perfect matching of size n.998.4. Reduction CorrectnessSequence design for I ′:To complete the reduction, we specify a sequence design for each domain ofI ′. For the x, y, and z domains, we use the set of sequences of Theorem 8.2.8with w = 3, since we need 3n domains (plus their complements) in total.Let E (= Θ(log2 n)) be the length of these domains. The trim domainsconsist only of the base G, and are also of length E. Formally, let Trm = GE .The Sep domain is A6E , and the Sep domain is the complement of the Sepdomain, namely T6E .The sequence design has the property that there are an equal numberof A and T bases overall: for every x, y, z or separator domain in a triple-strand, or x, y, z, or separator strand there is another domain or strand thatis its complement. The total number of C’s in trim-complement strands is(2m+ n)E. The total number of G’s in end-trims and center-trims is 3mE.Since m ≥ n, the total number of G’s is at least as great as the total numberof C’s. Therefore, under the assumption that only Watson-Crick base pairscan form, the maximum number of base pairs is limited to the total numberof A (or T) bases plus the total number of C bases. Let P denote this quantity.The instance I ′ is comprised of the strands of I ′ and the positive integerP .Lemma 8.3.1. Instance I ′ can be constructed in time polynomial in n.Proof. Instance I ′ has one template strand, 2n separator supports and 2nseparator-complement supports 6n xyz-support strands, and 2m + n trim-complement strands, for a total of 2m + 11n + 1 strands. The templatestrand has 13m domains and the other strands have one domain each, for atotal of 15m+ 11n domains.Since every domain in the construction has length Θ(log2 n), instanceI ′ is of size polynomial in n overall. The sequences can also be designedin polynomial time: The sequence design of separator and trim domains istrivial, and the sequences for the x, y, z domains can be designed in timepolynomial in n by Theorem 8.2.8.8.4 Reduction CorrectnessWe show that if the given instance I of 3dm(3) has a perfect matchingthen the optimal secondary structure formed from strands in I ′ is a singlecomplex that has P base pairs–the maximum number of base pairs that canbe formed from the strands of I ′, whereas if the optimal matching of I hassize n− i then the optimal structure has only P − Ω(iE) base pairs.1008.4. Reduction CorrectnessLemma 8.4.1. If I has a perfect matching, then the strands of I ′ can forma pseudoknot-free secondary structure, consisting of a single complex and Pbase pairs, with n perfect triple-strands.Proof. Here, in the reduced instance I ′, bases in the n triple-strands cor-responding to the perfect matching can be bound to the support strandsx, Sep, y, Sep, z, their complements, and three trim-complement strandsand form n perfect triple-strands. The end-trims of the remaining triple-strands can also be bound to two trim-complement strands while their com-plementary 5′ and 3′ flanks are paired together to make trim-deprived triples.Therefore, as all A’s and C’s are paired in this single (connected) complex,the number of base pairs is P which is optimal.We next consider the case that the optimal matching of I has size atmost n− i. Let Opt(I ′) denote the optimal pseudoknot-free structure of thereduced instance I ′. We establish properties that must hold true of Opt(I ′)and conclude that when the optimal matching of I has size at most n − i,then Opt(I ′) has P − Ω(iE) base pairs.With respect to a given structure, we say that a domain is bound if atleast one of its bases forms a base pair. A domain d in a triple-strand (aspart of the template strand) is connected to a non-template strand s if thereis a sequence of non-template strands s1, s2, . . . , sj where sj = s, such thatd forms a base pair with s1, s1 forms a base pair with s2, and so on up tosj−1 forming a base pair with sj = s.We partition the triple-strands into four types, depending on the struc-ture they form in Opt(I ′).• Perfect triples: the triple-strand binds to the set of non-templatestrands that are complementary to the triple-strand domains. Thisset of non-template strands contains two Sep’s, two Sep’s, three Trm’sand six xyz-support strands in total. The set of perfect triples corre-sponds to a matching of instance I.• Trim-deprived triples: at least one trim of a triple-strand is unbound.• Hogger triples: these are triple-strands which are not trim-deprived,and moreover, the ten domains in the flanks of a hogger triple arebound to, or connected to, at least eleven support strands in total.• Flawed triples: none of the above. In particular, flawed triples are nottrim-deprived.1018.4. Reduction CorrectnessEach triple-strand belongs to exactly one of the above four types. Note alsothat because a hogger or a flawed triple is not trim-deprived, the supportdomains that are bound to or connected to the 5′ flank (or 3′ flank) of itstriple-strand cannot bind to other domains on the template strand, or apseudoknot would form.Lemma 8.4.2. The total number of trim-deprived and flawed triples inOpt(I ′) is at least (m− n) + i/11.Proof. There are m triple-strands overall. Let p be the number of perfecttriples; each of these triple-strands has 10 support strands bound to it. Leth be the number of hogger triples; there are at least 11 support strandsbound or connected to each. There are 6n xyz-supports and 4n separatorand separator-complement strands in total, so 10p + 11h ≤ 10n and h ≤10(n − p)/11. Note also that since the optimal matching of I has size atmost n − i, the number of perfect triples p must be at most n − i and son− p ≥ i.The total number of triple-strands is m, so the number of triple-strandsthat are neither perfect nor hogger ism− p− h ≥ m− p− 10(n− p)/11 = m− n+ (n− p)/11 ≥ (m− n) + i/11.Lemma 8.4.3. Either Opt(I ′) has at least m−n+i/22 trim-deprived triples,or at least i/22 flawed triples.Proof. Suppose that the number of trim-deprived triples is less than m −n + i/22. By Lemma 8.4.2, the total number of trim-deprived and flawedtriples is at least (m− n) + i/11. Subtracting, we have that the number offlawed triples is at least i/22.We now adapt our notion of unpairedness from Section 8.2 to ACT-unpairedness. Let a and b be strands and let S(a) and S(a, b) be secondarystructures for strand a and pair (a, b) respectively. The ACT-unpairedness ofS(a) or S(a, b) is the number of A, C and T bases that are not paired in S(a)or S(a, b), respectively.Lemma 8.4.4. If the number of trim-deprived triples in Opt(I ′) is at leastm−n+ i/22, then at least iE/22 C’s are unpaired in Opt(I ′), and so Opt(I ′)has ACT-unpairedness Ω(iE).1028.4. Reduction CorrectnessProof. Each trim-deprived triple forms at most 2E CG base pairs, with theGs being in the trims (center-trim and end-trims) of the triple-strand andthe Cs being in trim-complement strands. Triple-strands that are not trim-deprived form at most 3E CG base pairs. There are no other CG base pairs.So, the total number of CG base pairs is at most(m− n+ i/22)2E + (m− (m− n+ i/22))3E = (2m+ n− i/22)E.The total number of trim-complement strands is 2m + n, each containingE Cs. So, the number of unpaired C bases in trim-complements is at leastiE/22.In order to show that many flawed triples cause Opt(I ′) to have high ACT-unpairedness, we first derive some useful properties about flawed triples. Inwhat follows, we let Lf = x Sepxy y Sepyz z and Rf = z¯ Sepyz y¯ Sepxy x¯denote the sequences on the 5′ and 3′ flanks of a flawed triple. Let Bf (5′)and Bf (3′) be the sets of support strands that are bound to, or connected todomains of Lf and Rf respectively, in the structure Opt(I′). Since a flawedtriple has at most ten support strands bound to it in total, either Bf (5′) ≤ 5or Bf (3′) ≤ 5. In the following lemmas, for concreteness, we suppose thatBf (5′) ≤ 5; the argument when Bf (3′) ≤ 5 is obtained by replacing domainsand strands with their complements and bases A and T with each other. LetOpt(Lf ) be the substructure of Opt(I′) formed by the bases in Lf and thestrands in Bf (5′).Lemma 8.4.5. Let Lf = x Sepxy y Sepyz z be the left flank of a flawedtriple with respect to structure Opt(I ′). Suppose that there are l ≥ 2 bondsbetween x, y or z domain of Lf and either Sepxy or Sepyz. Then Opt(Lf )has ACT-unpairedness at least 5(l − 1).Proof. As Sepxy and Sepyz contain only A’s, they can only bind with T’sof Lf . Our sequence design ensures that there are at least five padded A’sbetween any two successive T’s of x, y or z. Therefore, in order to avoidpseduoknots, if there are l bonds between x, y, or z and a Sep domain, atleast 5(l − 1) padded A’s remain unpaired.Lemma 8.4.6. Suppose that in Opt(Lf ), Bf (5′) ≤ 5 and the ACT-unpairednessof Lf is less than (log2 n)/3. Then the following must hold.1. Each Sep domain of Lf is bound to a Sep-support domain.2. Each x, y and z domain of Lf is bound to an xyz-support domain.1038.4. Reduction CorrectnessAs a consequence, each x, y, and z domain of Lf is bound to a distinctxyz-support of Bf (5′), each Sep domain of Lf is bound to a distinct Sepsupport of Bf (5′), and Bf (5′) contains exactly three xyz-supports and twoSep supports.Proof. Suppose to the contrary that the first condition does not hold, i.e.,one of Lf ’s Sep domains is not bound to a Sep support. The total numberof T’s that can bind to the Sep domain is at most 5.5E, accounted for asfollows: at most 3E/6 T’s in the x, y, and z domains of Lf plus at most5E in the remaining support strands, if they are five xyz-support strands.Thus at least E/2 of the 6E A’s in the Sep domain are unpaired. SinceE ≥ log2 n, we get a contradiction to the hypothesis of the lemma. Thusthe first condition must hold.Next suppose that the first condition holds but that the second does not;specifically that the x domain of Lf is not bound to an xyz-support domain(the argument is similar for the y or z domains). Recall that domain xcontains at least log2 n T’s, since by design the domains comprise a log2 n-robust set. At least 2(log2 n)/3 of the T’s must be paired, or the hypothesisof the lemma that the ACT-unpairedness of Lf is less than (log2 n)/3 wouldnot be true. Since the first condition of the lemma holds, the Sep domainadjacent to x on the 5′ flank is bound to a Sep strand. Therefore domainx cannot have bonds to domain y or z, or to the Sep domain between yand z, or a pseudoknot would form. Also, the T’s in domain x cannot bindto Sep strands, since Sep’s are composed only of T’s. If there were at least(log2 n)/3 bonds between x and Sepxy, Lemma 8.4.5 would imply that x hasACT-unpairedness at least 5((log2 n)/3−1) ≥ log2 n, again contradicting thehypothesis of the lemma.Therefore, at least (log2 n)/3 T’s of x must form intramolecular bondswith A’s that are also in the x domain. The total length of substrands ofx that have either unpaired bases or intramolecular base pairs must be atleast 3(log2 n)/3: this lower bound is met if each T, say at position i of x isbound to an A that is either at position i− 2 or i+ 2 (since we assume thatno base pair can form between consecutive bases). Part 1 of Lemma 8.2.6therefore implies that x has ACT-unpairedness at least (log2 n)/3, once againcontradicting the hypothesis of the lemma. We conclude that the secondcondition of the lemma must hold.Since both conditions hold, it cannot be that two of the x, y, and zdomains of Lf are bound to the same xyz-support of Bf (5′), or a pseudoknotwould form with bonds between a Sep of Lf and a Sep support. Similarly,it cannot be that both Sep’s have bonds to the same Sep. Hence, each1048.4. Reduction CorrectnessSep domain of Lf is bound to a distinct Sep support of Bf (5′), and Bf (5′)contains exactly three xyz-supports and two Sep supports, completing theproof of the Lemma.Lemma 8.4.7. Suppose that in Opt(Lf ), Bf (5′) ≤ 5. ACT-unpairednessof Lf is less than (log2 n)/3. Then for any constant α < 1/7, the ACT-unpairedness of Opt(Lf ) is at least α log2 n.Proof. Let α < 1/7. Suppose to the contrary that the ACT-unpairedness ofOpt(Lf ) is less than α log2 n. By Lemma 8.4.6, Bf (5′) must contain threexyz-supports, say a, b, and c, with a bound to x, b bound to y, and c boundto x.We first show that in Opt(Lf ), there can be at most α log2 n/5 basesbetween a Sep domain of Lf and one of the domains x, y, or z adjacent tothe Sep domain. Otherwise, by Lemma 8.4.5, at least α log2 n bases of awould be unpaired, and we get a contradiction. Similarly, there can be atmost α log2 n/5 bases between a Sep domain of Lf and one of the domainsa, b, or c adjacent to the Sep domain.Since Lf is the flank of a flawed triple, either a 6= x¯, b 6= y¯, or c 6= z¯.First suppose that a 6= x¯. Since the set of domains is log2 n-robust, therecan be at most E − log2 n base pairs between a and x. By the argument inthe previous paragraph, x has at most α(log2 n)/5 bases to Sepxy. Similarly,if Sepab is the separator complement between a and b, then a has at mostα(log2 n)/5 bases to Sepab. If a has base pairs with Sepxy, then x cannothave base pairs with Sepab and vice versa, in order to avoid pseudoknots.Therefore, either a or x has at least log2 n − α(log2 n)/5 ≥ 34(log2 n)/35bases that are either unpaired or form intramolecular bonds. By Lemma8.2.6, either a or x has unpairedness at least 11(log2 n)/35 ≥ (log2 n)/4,proving the lemma. The argument when c 6= z¯ is similar to that whena 6= x¯.Finally, suppose that a = x¯ and c = z¯ but b 6= y¯. As noted earlier,b has at most α(log2 n)/5 bonds with each Sep adjacent to it. Also, atleast log2 n bases of b are not paired with y, since the set of domains islog2 n-robust. Of these, at most αlog2n can be unpaired, or again we get acontradiction. Therefore, b has at least log2 n − 2α(log2 n)/5 − α log2 n =log2 n− 7α(log2 n)/5 bonds to the Sep’s adjacent to y, and so b has at least12(log2 n− 7α(log2 n)/5) bonds to Sepxy.Moreover, Sepab must have at least 6E − α log2 n(12/5) base pairs withSepxy. This is because Sepab has at most α(log2 n)/5 bases with each of aand b, and Sepab has at most 3α log2 n bases paired with x. To see why1058.4. Reduction Correctnessthe latter assertion holds, note that otherwise at least 3α log2 bases of aare not paired with any strand other than a and thus by Lemma 8.2.6,at least α log2 n bases of a are unpaired, which again is a contradiction.Therefore, Sepab has at most α log2(2/5 + 3) pairs in total with a, x, and b,and since at most α log2 n bases of Sepab can be unpaired, Sepab has at least6E − α log2 n(2/5 + 3− 1) = 6E − α log2 n(12/5) base pairs with Sepxy.Therefore the total number of bases that are paired with bases of Sepxyis at least 12(log2 n − 7α(log2 n)/5) (with b) plus 6E − α log2 n(12/5) (withSepab). The total is6E + log2 n(1/2− 7α/10− α(12/5)) ≥ 6E + log2 n(1/2− α(31/10)).Since α ≤ 1/7, this quantity is greater than 6E, again a contradiction sincethe length of Sepxy is 6E.Lemma 8.4.8. If the optimal matching of I has size at most n − i, thenOpt(I ′) has P − Ω(iE) base pairs.Proof. By Lemma 8.4.3, Opt(I ′) either has at least m − n + i/22 trim-deprived triples, or at least i/22 flawed triples.First suppose that Opt(I ′) has at least m − n + i/22 trim-deprivedtriples. Then by Lemma 8.4.4, the strands of Opt(I ′) have ACT-unpairednessΩ(iE). Similarly, if Opt(I ′) has at least i/22 flawed triples, then by Lemma8.4.7, each flawed triple has ACT-unpairedness Ω(log2 n) = Ω(E), sinceE = Θ(log2 n). Again, the total ACT-unpairedness is Ω(iE).Recall that all A’s, C’s and T’s must be paired in order for the totalnumber of base pairs to be P . Since the total ACT-unpairedness is Ω(iE), itmust be that the number of base pairs in Opt(I ′) is at most P −Ω(iE).Theorem 8.4.9. Multi-Pkf-SSP is NP-complete.Proof. Let I be any instance of Multi-Pkf-SSP, i.e, m nucleic acid strandsand a positive integer k. Given a secondary structure S for I, we cancheck in time polynomial in the total length of the strands whether S is avalid, pseudoknot-free secondary structure and whether it has k base pairs.Therefore, Multi-Pkf-SSP is in NP.Moreover, in the last section we provided a a polynomial time reductionfrom any instance I of 3dm(3) to an instance I ′ of Multi-Pkf-SSP. Theoptimal structure Opt(I ′) has P base pairs if I has a perfect matching, byLemma 8.4.1, and Opt(I ′) has less than P base pairs if I does not have aperfect matching (by Lemma 8.4.8), where P is the total number of A, T andC bases of the strands of instance I ′.1068.5. ApproximabilityPutting these together, we can conclude that Multi-Pkf-SSP is NP-complete.Until now, we have only considered the number of base pairs in theMFE structure under the assumption that there is no penalty for strandassociation, i.e., Kassoc = 0. Our construction has the property that struc-ture Opt(I ′) is a single complex when I has a perfect matching. WhenKassoc > 0 the penalty to bring the 2m+ 11n+ 1 strands into a single com-plex is (2m + 11n)Kassoc. However, the number of base pairs formed is atleast E, the total length of the xyz-support strands, where E = Θ(log2 n).Thus, for any positive constant Kassoc the value of E can be scaled by aconstant to ensure that a single domain binding is always favourable, evenwhen decreasing the number of complexes by one.8.5 ApproximabilityWe proved that the Multi-Pkf-SSP problem is NP-complete in Theo-rem 8.4.9. Given this result, it is natural to investigate that if there is apolynomial-time algorithm to approximate the optimal secondary structureof multi-stranded systems. In this section we show that the Max-Multi-Pkf-SSPproblem is APX-hard as well — see Theorem 8.5.2. This result asserts thatthere exists no PTAS for this problem, unless P = NP.To show the hardness result, we first verify that our reduction fromMax-3dm(3), which itself is APX-hard by Theorem 8.1.2, to Max-Multi-Pkf-SSPis a PTAS–reduction, i.e., an approximation-preserving reduction whichtransforms one optimization problem into another one. For this purpose,we map instances of Max-3dm(3) to instances of Max-Multi-Pkf-SSPwith the same polynomial time construction used for reducing 3dm(3) toMulti-Pkf-SSP. We then prove that this construction also maps a solutionof Max-Multi-Pkf-SSP to a solution of Max-3dm(3) using Lemma 8.5.1.Lemma 8.5.1. Our reduction from an instance I of Max-3dm(3) to aninstance I ′ of Max-Multi-Pkf-SSP yields that• if I has a matching of size n then |Opt(I ′)| = P ;• if I has a matching of size at most (1−0)n then |Opt(I ′)| ≤ P−α0nEwhere α > 0 is a constant.Proof. This lemma directly follows from Lemmas 8.4.1 and 8.4.8.1078.6. ConclusionGiven an overview of our approach, we formally prove the main resultof this section in what follows.Theorem 8.5.2. Max-Multi-Pkf-SSP is APX-hard.Proof. Let’s be more specific about the values of P and E. Theorem 8.2.8 as-sures that parameter E, the length of each xyz-support domain, is Θ(log2 n).Using our sequence design and Lemma 8.3.1, we also get that instanceI ′ includes Θ(n) + Θ(m) domains of length Θ(E). Since we are workingwith instances of Max-3dm(3), we know that the number of triples in theinstance is m ≤ 3n. From all of our assumptions we can conclude thatP = Θ(n log2 n).We now apply Lemma 8.5.1 to show APX-hardness of Max-Multi-Pkf-SSP.Suppose by contradiction that for some > 0, there is a (1−)–approximationalgorithm for this problem. Then,• if I of Max-3dm(3) has a matching of size n, on instance I ′ ofMax-Multi-Pkf-SSP the algorithm returns a solution with value at least(1− )|Opt(I ′)| = (1− )P ;• if I has a matching of size at most (1−0)n, on instance I ′ the algorithmreturns a solution with value at most |Opt(I ′)| ≤ P − α0nE.Therefore, ifP − α0nE < (1− )P (8.5.1)the algorithm can distinguish between the cases where I has a matching ofsize n or of size at most (1− 0)n. By our current assumptions about P andE, equation 8.5.1 holds if <α0nEP. (8.5.2)This contradicts the APX-hardness of Max-3dm(3) (Theorem 8.1.2).8.6 ConclusionA basic question that has remained open from over three decades of workon computational pseudoknot-free secondary structure prediction of nu-cleic acids is: can we efficiently compute the minimum free energy (MFE)pseudoknot-free secondary structure for a multi-set of DNA or RNA strands?We have shown that this problem is NP-hard, and is therefore computation-ally intractable, unless P = NP. A natural question then is whether solutionsto the problem can be efficiently approximated, if P 6= NP. Unfortunately,1088.6. Conclusionthere is a limit to the accuracy of any such method. We have shown thatthe optimization problem of finding the MFE structure for a multi-set ofnucleic acid strands is hard for the complexity class APX, the class of NPoptimization problems that have constant factor approximation algorithms.The result implies that there does not exist a polynomial time approxima-tion scheme for this problem, unless P = NP. Given these results, it suggeststhat heuristic methods, such as stochastic local search, and randomized algo-rithms should be investigated for structure prediction of multiple interactingstrands.109Chapter 9Summary and Future Work9.1 SummaryMany problems of practical interest rely on continuous-time Markov chains(CTMCs) defined over combinatorial state spaces, rendering the computa-tion of transition probabilities, and hence probabilistic inference, difficultor impossible with existing methods. For these problems, where classicalmethods are not applicable, the main alternative has been particle Markovchain Monte Carlo methods. In Chapter 7, we have proposed an efficientparticle-based Monte Carlo method, called TIPS, to approach inference inCTMCs with weak assumptions on the state space using an importancesampling approach. Our method requires a user-specified potential functioncentered at the target end point and satisfying some certain conditions. Wehave defined our proposal sampling based on this potential function andproved that our proposal hits the target with probability one. We have alsoshowed that our method (TIPS) outperforms the forward sampling methodon nucleic acid folding pathways which is an important examples of CTMCsand demonstrated that in a range of realistic inferential setups, our schemedramatically reduces the variance of the Monte Carlo approximation.In Chapter 8, we have showed that, while efficient thermodynamics-basedapproaches are well known for prediction of pseudoknot-free secondary struc-tures of single strands, the problem of predicting pseudoknot-free secondarystructures of multiple interacting strands is computationally intractable un-less P = NP. Our proof uses a polynomial time reduction from a variant of3-dimensional matching to our problem Multi-Pkf-SSP. To provide thisreduction, we have designed our sequences employing code word designs withhigh pairwise edit distance of Schulman and Zukerman [91]. However, weencountered an issue in their proof and fixed it in Lemma 8.2.4. Moreover,we have also proved that there are no polynomial time algorithms to pro-vide an approximation for the MFE structure of a set of nucleic-acid strandsunless P = NP and therefore the problem is APX-hard as well.1109.2. Future Work9.2 Future WorkOur work in this part of thesis can be extended as follows.9.2.1 Nucleic Acid Folding PathwaysParameter Tuning One caveat of our results is that our method, TIPS,was sensitive to a range of values of the tuning parameters α and β. Forexample, we simply tried different values for parameter α and found thatthe accuracy of our sampling in RNA folding pathways was susceptible tothe setting of this parameter (see Figure 9.1).We believe that the behavior of our method is sensitive to α, β, becausethe sampled jump chains are typically longer in RNA folding pathways.Intuitively, for longer folding times, the transition probabilities are moreinfluenced by the low probability particles or paths, as these low proba-bility paths comprise a greater percent of all possible paths. This meansthat any setting of α that heavily biases the sampled paths to be from theregion just around x and y will need to sample a large number of pathsin order to approximate the contribution of paths with a low probability.This situation is analogous to the well-known problems in importance sam-pling of mismatches between the proposal and actual distributions. Similarsampling considerations apply to parameter β which controls the number ofexcursions from y. If β is too restrictive, again, paths will be sampled thatdo not well reflect the actual probability of excursions. Parameter tuningis therefore an important area of future work. It might be possible to usesome automated tuners [52, 105] or to approach the problem by essentiallycreating mixtures of proposals each with its own tuning parameters.Subset Selection In Section 7.1.1, we mentioned that the choice of subsetS, a subset of secondary structures for a given nucleic acid, is an importantdecision, but we didn’t argue how to choose the connected set S to optimizethe accuracy of a given subset model. In fact, there are many possiblechoices of S. For example in Tang et al. [97], subset S is randomly sampledaccording to the Boltzmann distribution of the structures. In Kirkpatrick etal. [60], instead, subset S is a mixture of suboptimal structures (i.e., thosewith the closest free energy to the MFE structure) and structures sampledfrom the Boltzmann distribution, which is also used as our selection modelin Section 7.3. As another example, Wolfinger et al. [108] use the set of1119.2. Future Workl l l ll0.5 1.0 2.0 5.0 10.0510505005000Folding timeMin. # of particleslll l l lTIPS− alpha = 0.6TIPS− alpha = 0.66TIPS− alpha = 0.75FSFigure 9.1: Tuning parameter α. Performance of our method (TIPS) usingdifferent values of α compared to forward sampling (FS) for estimating thefolding pathway of the 1XV6 molecule on its full state space. The minimumnumber of particles required to perform each sampling method is shown onthe y-axis.1129.2. Future Worklocal minima or metastable structures20 and their connecting structures,known as saddle points, as subset S. Finding the best choice of S of a givensize m that would minimize the inaccuracy of a particular subset modelis still an intractable and open problem. However, we believe the optimalsubsets have some tractable and explorable properties (such as connectivity,probability, diversity, inclusion of saddle points, etc.), and understandingthese properties can be very helpful for modelling RNA folding pathways onsubset S. As a future work, we are interested to study the impact of theseproperties in the quality of a subset.9.2.2 Multi-stranded Nucleic Acid Secondary StructurePredictionComputational Experiments We proved our results using a simple en-ergy model. Although it would seem unlikely to have an easier predictionproblem if a more complicated energy model is employed, it is still valuableto provide some computational experiments using a realistic energy model.For example, we are interested to run the following useful experiment: 1)consider an arbitrary 3DM(3) matching instance, 2) reduce it using our se-quence design algorithm in Section 8.2, and 3) use some available softwaresuch as NUPACK [110] for predicting the MFE structure of multiple inter-acting strands under the use of the Turner energy model. This way, e.g.,for the small 3DM(3) instance shown in Figure 8.2, we will end up with 42strands with a total length of 4427. Unfortunately, there is no feasible wayto conduct such an experiment at this stage, because there is no softwarecapable of handling so many large sequences. We also thought about simpli-fying our experiment by calculating the MFE structure using a fixed orderof strands. However, the number of sequences and their total length are toolarge to be handled by any existing software suites. Therefore, a possiblefuture work can be our contribution for extending the NUPACK source codeto support larger multi-stranded nucleic acid systems.Another variant of the multi-stranded prediction problem In ourwork, we studied the hardness of the general prediction problem where thereis no restriction (except pseudoknot-free constraint) on the multiple inter-acting strands. However, we can consider another variant of the problemwhere the goal is to find the MFE structure of a restricted mutli-set of20Formally, structure x is called a local minimum if E(x) ≤ E(y) for all y ∈ N(x) whereE(x) is the free energy and N(x) is the set of all neighbours of x.1139.2. Future Workstrands, e.g., a set of strands containing only specific types of nucleotidesor a set with arbitrary edit distances (with no lower bound) between itsstrands. Then, this alternative problem may be more manageable and thehardness result may be avoided. We can investigate the complexity of thisnew problem as another area of future work.114Bibliography[1] Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co. New York, New York, NY,USA.[2] J. P. Abrashams, M. V. D. Berg, E. V. Batenburg, and C. Pleij.Prediction of RNA secondary structure, including pseudoknotting, bycomputer simulation. Nucleic Acids Research, 18(10):3035–3035, May1990.[3] M. Akkouchi. On the convolution of exponential distributions. Journalof the Chungcheong Mathematical Society, 21(4):502–510, 2008.[4] B. Alberts, A. Johnson, J. Lewis, M. Raff, K. Roberts, and P. Walter.Molecular Biology of the Cell. Garland Science, 4th edition, 2002.[5] D. Alistarh, J. Aspnes, D. Eisenstat, R. Gelashvili, and R. L. Rivest.Time-space trade-offs in population protocols. In Proceedings ofthe Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Al-gorithms, pages 2560–2579, 2017.[6] S. Amari and R. Misra. Closed-form expressions for distribution of sumof exponential random variables. IEEE Transactions on Reliability,46(4):519–522, 1997.[7] M. Andronescu, B. V., H. Hoos, and A. Condon. RNA STRAND:the RNA secondary structure and statistical analysis database. BMCBioinformatics, 9:340, 2008.[8] M. Andronescu, Z. C. Zhang, and A. Condon. Secondary structureprediction of interacting RNA molecules. Journal of Molecular Biol-ogy, 345(5):987–1001, February 2005.[9] D. Angluin, J. Aspnes, Z. Diamadi, M. Fischer, and R. Peralta. Com-putation in networks of passively mobile finite-state sensors. Dis-tributed Computing, 18:235–253, 2006. Preliminary version appearedin PODC 2004.115Bibliography[10] D. Angluin, J. Aspnes, and D. Eisenstat. Stably computable predicatesare semilinear. In PODC 2006: Proceedings of the twenty-fifth annualACM symposium on Principles of distributed computing, pages 292–299, New York, NY, USA, 2006. ACM Press.[11] D. Angluin, J. Aspnes, and D. Eisenstat. Fast computation by popu-lation protocols with a leader. Distributed Computing, 21(3):183–199,September 2008. Preliminary version appeared in DISC 2006.[12] D. Angluin, J. Aspnes, and D. Eisenstat. A simple population pro-tocol for fast robust approximate majority. Distributed Computing,21(2):87–102, July 2008.[13] D. P. Bartel. MicroRNAs: genomics, biogenesis, mechanism, and func-tion. Cell, 116:281–297, 2004.[14] L. Becchetti, A. Clementi, E. Natale, F. Pasquale, R. Silvestri, andL. Trevisan. Simple dynamics for plurality consensus. DistributedComputing, pages 1–14, 2016.[15] L. Becchetti, A. E. F. Clementi, E. Natale, F. Pasquale, and L. Tre-visan. Stabilizing consensus with many opinions. In Proceedings ofthe Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Al-gorithms, pages 620–635, 2016.[16] A. Belleville, D. Doty, and D. Soloveichik. Hardness of computingand approximating predicates and functions with leaderless popula-tion. Proceedings of the 44th International Colloquium on Automata,Languages and Programming [to appear], 2017.[17] K. Bringmann, F. Grandoni, B. Saha, and V. V. Williams. Truly Sub-cubic Algorithms for Language Edit Distance and RNA-Folding viaFast Bounded-Difference Min-Plus Product. In 2016 IEEE 57th An-nual Symposium on Foundations of Computer Science (FOCS), pages375–384, October 2016.[18] L. Cardelli. Strand algebras for DNA computing. Natural Computing,10(1):407–428, 2011.[19] L. Cardelli and A. Csiksz-Nagy. The Cell Cycle Switch ComputesApproximate Majority. Scientific Reports, 2:srep00656, September2012.116Bibliography[20] L. Cardelli, M. Kwiatkowska, and L. Laurenti. Programming dis-crete distributions with chemical reaction networks. In Rondelez Y.,Woods D. (eds) DNA Computing and Molecular Programming, Lec-ture Notes in Computer Science, volume 9818, pages 35–51. Springer,Cham, 2016.[21] H.-L. Chen, D. Doty, and D. Soloveichik. Deterministic function com-putation with chemical reaction networks. In DNA 18: Proceedingsof The 18th International Meeting on DNA Computing and Molecu-lar Programming, volume 7433 of Lecture Notes in Computer Science,pages 25–42. Springer, 2012.[22] H.-L. Chen, D. Doty, and D. Soloveichik. Rate-independent computa-tion in mass-action chemical reaction networks. In ITCS 2014: Pro-ceedings of the 5th Conference on Innovations in Theoretical ComputerScience, pages 313–326, 2014.[23] S.-J. Chen. RNA folding: Conformational statistics, folding kinetics,and ion electrostatics. Annual Review of Biophysics, 37(1):197–214,2008. PMID: 18573079.[24] Y.-J. Chen, N. Dalchau, N. Srinivas, A. Phillips, L. Cardelli, D. Solove-ichik, and G. Seelig. Programmable chemical controllers made fromDNA. Nature Nanotechnology, 8(10):755–762, October 2013.[25] Y.-J. Chen, B. Groves, R. A. Muscat, and G. Seelig. DNA nan-otechnology from the test tube to the cell. Nature Nanotechnology,10(9):748–760, September 2015.[26] H. Chernoff. A measure of asymptotic efficiency for tests of a hypoth-esis based on the sum of observations. The Annals of MathematicalStatistics, pages 493–507, 1952.[27] H. M. T. Choi, J. Y. Chang, L. A. Trinh, J. E. Padilla, S. E. Fraser,and N. A. Pierce. Programmable in situ amplification for multiplexedimaging of mRNA expression. Nature Biotechnology, 28:1208– 1212,2010.[28] A. Condon, M. Hajiaghayi, D. Kirkpatrick, and J. Manuch. Simplify-ing analyses of chemical reaction networks for approximate majority.Proceedings of the 23th International Conference on DNA Computingand Molecular Programming (DNA23) [to appear], 2017.117Bibliography[29] A. Condon, M. Hajiaghayi, and C. Thachuk. Predicting minimum freeenergy structures of multi-stranded nucleic acid systems is APX-Hard.[30] M. Cook, D. Soloveichik, E. Winfree, and J. Bruck. Programmabilityof chemical reaction networks. In Algorithmic Bioprocesses, pages 543–584. Springer Berlin Heidelberg, 2009.[31] J. Cruise and A. Ganesh. Probabilistic consensus via polling andmajority rules. Queueing Systems, 78(2):99–120, 2014.[32] R. M. Dirks, J. S. Bois, J. M. Schaeffer, E. Winfree, and N. A. Pierce.Thermodynamic analysis of interacting nucleic acid strands. SIAMReview, 49(1):65–88, January 2007.[33] D. Doty and M. Hajiaghayi. Leaderless Deterministic Chemical Reac-tion Networks. In Proceedings of the 19th International Conference onDNA Computing and Molecular Programming, volume 8141 of DNA19, pages 46–60, 2013.[34] D. Doty and M. Hajiaghayi. Leaderless deterministic chemical reactionnetworks. Natural Computing, 14(2):213–223, June 2015.[35] A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlomethods in practice. Springer, 2001.[36] M. Draief and M. Vojnovic. Convergence speed of binary intervalconsensus. SIAM Journal on Control and Optimization, 50(3):1087–1109, 2012.[37] B. Felden. RNA structure: experimental analysis. Current Opinionin Microbiology, 10(3):286–291, June 2007.[38] W. Feller. An Introduction to Probability Theory and its Applications,volume 1. Wiley, New York, 3rd edition, 1968.[39] C. Flamm, W. Fontana, I. L. Hofacker, and P. Schuster. RNA foldingat elementary step resolution. RNA, 6:325–338, 2000.[40] K. Fujibayashi, R. Hariadi, S. H. Park, E. Winfree, and S. Murata.Toward reliable algorithmic self-assembly of DNA tiles: A fixed-widthcellular automaton pattern. Nano Letters, 8(7):1791–1797, July 2008.[41] K. Gerdes and E. G. H. Wagner. RNA antitoxins. Current Opinionin Microbiology, 10(2):117–124, April 2007.118Bibliography[42] D. T. Gillespie. Exact stochastic simulation of coupled chemical reac-tions. The Journal of Physical Chemistry, 81(25):2340–2361, 1977.[43] S. Ginsburg and E. H. Spanier. Semigroups, Presburger formulas, andlanguages. Pacific Journal of Mathematics, 16(2):285–296, 1966.[44] W. K. Grassmann. Transient solutions in markovian queueing systems.Computers and Operations Research, 4:47–100, 1977.[45] G. R. Grimmett and D. R. Stirzaker. Probability and Random Process.Oxford University Press, New York, 2nd edition, 1992.[46] B. Groves, Y.-J. Chen, C. Zurla, S. Pochekailov, J. L. Kirschman,P. J. Santangelo, and G. Seelig. Computing in mammalian cells withnucleic acid strand exchange. Nature Nanotechnology, 11(3):287–294,March 2016.[47] M. Hajiaghayi, B. Kirkpatrick, L. Wang, and A. Bouchard-Ct. Effi-cient continuous-time markov chain estimation. ICML, 2014.[48] D. Han, S. Pal, J. Nangreave, Z. Deng, Y. Liu, and H. Yan. DNAorigami with complex curvatures in three-dimensional space. Science,332(6027):342, 2011.[49] D. S. Hirschberg. Pattern matching algorithms, chapter Serial Com-putations of Levenshtein Distances, pages 123–142. Oxford UniversityPress, 1997.[50] A. Hjelmfelt, E. D. Weinberger, and J. Ross. Chemical implementationof neural networks and Turing machines. Proceedings of the NationalAcademy of Sciences of the United States of America, 88(24):10983–10987, 1991.[51] J. P. Huelsenbeck and F. Ronquist. MRBAYES: Bayesian inference ofphylogenetic trees. Bioinformatics, 17(8):754–755, 2001.[52] F. Hutter, H. H. Hoos, K. Leyton-Brown, and T. Stu¨tzle. ParamILS:an automatic algorithm configuration framework. Journal of ArtificialIntelligence Research, 36:267–306, October 2009.[53] H. Isambert. The jerky and knotty dynamics of RNA. Methods,49(2):189–96, 2009.119Bibliography[54] J. A. Jaeger, D. H. Turner, and M. Zuker. Predicting optimal andsuboptimal secondary structure for RNA. Methods in Enzymology,183:281–306, 1990.[55] H. Jiang, M. Riedel, and K. Parhi. Digital signal processing withmolecular reactions. IEEE Design and Test of Computers, 29(3):21–31, 2012.[56] J. Justesen. Class of constructive asymptotically good algebraic codes.IEEE Transactions on Information Theory, 18(5):652 – 656, 1972.[57] N. G. v. Kampen. Stochastic processes in physics and chemistry.North-Holland, Amsterdam; New York, 1992.[58] V. Kann. Maximum bounded 3-dimensional matching is MAX SNP-complete. Information Processing Letters, 37(1):27–35, January 1991.[59] V. Kann. Maximum bounded h-matching is max SNP-complete. In-formation Processing Letters, 49(6):309–318, March 1994.[60] B. Kirkpatrick, M. Hajiaghayi, and A. Condon. A new model for ap-proximating RNA folding trajectories and population kinetics. Com-putational Science & Discovery, 6(1):014003, January 2013.[61] A. Korostelev and H. F. Noller. The ribosome in focus: new structuresbring new insights. Trends in Biochemical Sciences, 32:434–441, 2007.[62] M. R. Lakin, D. Parker, L. Cardelli, M. Kwiatkowska, and A. Phillips.Design and analysis of DNA strand displacement devices using prob-abilistic model checking. Journal of the Royal Society, Interface,9(72):1470–1485, July 2012.[63] R. Lorenz, S. H. Bernhart, C. Hner zu Siederdissen, H. Tafer,C. Flamm, P. F. Stadler, and I. L. Hofacker. ViennaRNA package2.0. Algorithms for Molecular Biology : AMB, 6:26, November 2011.[64] Z. J. Lu, D. H. Turner, and D. H. Mathews. A set of nearest neigh-bor parameters for predicting the enthalpy change of RNA secondarystructure formation. Nucleic Acids Research, 34(17):4912–4924, Octo-ber 2006.[65] M. O. Magnasco. Chemical kinetics is Turing universal. PhysicalReview Letters, 78(6):1190–1193, 1997.120Bibliography[66] P. Maragakis, F. Ritort, C. Bustamante, M. Karplus, and G. E.Crooks. Bayesian estimates of free energies from nonequilibrium workdata in the presence of instrument noise. The Journal of ChemicalPhysics, 129(2), 2008.[67] H. M. Martinez. An RNA folding rule. Nucleic Acids Research, 12(1Pt 1):323–334, January 1984.[68] D. H. Mathews, J. Sabina, M. Zuker, and D. H. Turner. Expandedsequence dependence of thermodynamic parameters improves pre-diction of RNA secondary structure. Journal of Molecular Biology,288(5):911–940, May 1999.[69] D. H. MATHEWS. Using an RNA secondary structure partition func-tion to determine confidence in base pairs predicted by free energyminimization. RNA, 10(8):1178–1190, 2004.[70] D. H. Mathews, M. D. Disney, J. L. Childs, S. J. Schroeder, M. Zuker,and D. H. Turner. Incorporating chemical modification constraintsinto a dynamic programming algorithm for prediction of RNA sec-ondary structure. Proceedings of the National Academy of Sciences ofthe United States of America, 101(19):7287–7292, May 2004.[71] D. H. Mathews and D. H. Turner. Prediction of RNA secondary struc-ture by free energy minimization. Current Opinion in Structural Bi-ology, 16(3):270–278, June 2006.[72] J. S. McCaskill. The equilibrium partition function and base pairbinding probabilities for RNA secondary structure. Biopolymers, 29(6-7):1105–1119, June 1990.[73] C. McDiarmid. On the method of bounded differences. London SocietyLecture Note Series, 141:148–188, 1989.[74] G. B. MertziosSotiris, E. Nikoletseas, C. L. Raptopoulos, and P. G.Spirakis. Determining majority in networks with local interactions andvery small local memory. Distributed Computing, 30(1):1–16, 2017.[75] B. Munsky and M. Khammash. The finite state projection algorithmfor the solution of the chemical master equation. Journal of ChemistryPhysics, 124(4):1–13, 2006.[76] J. Norris. Markov Chains. Cambridge Series in Statistical and Proba-bilistic Mathematics. Cambridge University Press, 1997.121Bibliography[77] R. Nussinov and A. B. Jacobson. Fast algorithm for predicting the sec-ondary structure of single-stranded RNA. Proceedings of the NationalAcademy of Sciences of the United States of America, 77(11):6309 –6313, November 1980.[78] R. Nussinov, G. Pieczenik, J. Griggs, and D. Kleitman. Algorithms forloop matchings. SIAM Journal on Applied Mathematics, 35(1):68–82,July 1978.[79] Z. Nutov and I. Beniaminy. A (1-1/e)-approximation algorithm for thegeneralized assignment problem. Operations Research Letters - ORL,34(3), 2006.[80] T. Pan. How RNA function requires structure. Nature Structural andMolecular Biology, 5(7):540–541, July 1998.[81] E. Perron, D. Vasudevan, and M. Vojnovic. Using three states forbinary consensus on complete graphs. In Proceedings of the 28th IEEEConference on Computer Communications (INFOCOM), pages 2527–2535, 2009.[82] A. Phillips and L. Cardelli. A programming language for composableDNA circuits. Journal of the Royal Society Interface, 6 Suppl 4:S419–436, August 2009.[83] L. Qian, E. Winfree, and J. Bruck. Neural network computation withDNA strand displacement cascades. Nature, 475(7356):368–372, 2011.[84] L. Qian and E. Winfree. A simple DNA gate motif for synthesizinglarge-scale circuits. Journal of The Royal Society Interface, pages 71–89, February 2011.[85] V. Rao and Y. W. Teh. Fast MCMC sampling for Markov jump pro-cesses and continuous time Bayesian networks. In Proceedings of theTwenty-Seventh Conference Annual Conference on Uncertainty in Ar-tificial Intelligence (UAI-11), pages 619–626, Corvallis, Oregon, 2011.[86] V. Rao and Y. W. Teh. MCMC for continuous-time discrete-statesystems. NIPS’12 Proceedings of the 25th International Conferenceon Neural Information Processing Systems, pages 701–709, 2012.[87] P. W. K. Rothemund. Folding DNA to create nanoscale shapes andpatterns. Nature, 440(7082):297–302, March 2006.122Bibliography[88] G. R.R., L. J.C., and C. J.J. The accuracy of ribosomal RNA com-parative structure models. Current Opinion in Structural Biology,12(3):301–310, 2002.[89] Y. Saad. Numerical Methods for Large Eigenvalue Problems. HalstedPress, New York, NY, 1992.[90] J. Schaeffer. The Multistrand Simulator: Stochastic Simulation of theKinetics of Multiple Interacting DNA Strands. PhD thesis, CaliforniaInstitute of Technology, 2012.[91] L. Schulman and D. Zuckerman. Asymptotically good codes correct-ing insertions, deletions, and transpositions. IEEE Transactions onInformation Theory, 45(7):2552–2557, November 1999.[92] G. Seelig, D. Soloveichik, D. Y. Zhang, and E. Winfree. Enzyme-freenucleic acid logic circuits. Science, 314(5805):1585–1588, 2006.[93] J. A. Shapiro. Genome Informatics: The Role of DNA in CellularComputations. Biological Theory, 1(3):288–301, September 2006.[94] D. Soloveichik, M. Cook, E. Winfree, and J. Bruck. Computationwith finite stochastic chemical reaction networks. Natural Computing,7(4):615–633, 2008.[95] D. Soloveichik, G. Seelig, and E. Winfree. DNA as a universal sub-strate for chemical kinetics. Proceedings of the National Academy ofSciences of the United States of America, 107(12):5393–5398, 2010.[96] M. Tacker, W. Fontana, P. F. Stadler, and P. Schuster. Statisticsof RNA melting kinetics. European Biophysics Journal, 23(1):29–38,April 1994.[97] X. Tang, B. Kirkpatrick, S. Thomas, G. Song, and N. M. Amato.Using motion planning to study RNA folding kinetics. Journal ofComputational Biology, 12(6):862–881, 2005.[98] X. Tang, S. Thomas, L. Tapia, and N. M. Amato. Tools for simulat-ing and analyzing RNA folding kinetics. In Proceedings of the 11thannual international conference on Research in computational molec-ular biology, RECOMB’07, pages 268–282, Berlin, Heidelberg, 2007.Springer-Verlag.123Bibliography[99] C. Thachuk, E. Winfree, and D. Soloveichik. Leakless DNA StrandDisplacement Systems. In DNA Computing and Molecular Program-ming, volume 9211 of Lecture Notes in Computer Science, pages 133–153. Springer International Publishing, August 2015.[100] D. Thirumalai, N. Lee, S. A. Woodson, and D. Klimov. Early eventsin RNA folding. Annual Review of Physical Chemistry, 52(1):751–762,2001.[101] I. Tinoco and C. Bustamante. How RNA folds. Journal of MolecularBiology, 293(2):271–281, October 1999.[102] I. Tinoco, D. Collin, and P. T. X. Li. The effect of force on thermo-dynamics and kinetics: unfolding single RNA molecules. BiochemicalSociety Transactions, 32(Pt 5):757–760, 2004.[103] S. Valadkhan. The spliceosome: caught in a web of shifting interac-tions. Current Opinion in Structural Biology, 17:310–315, 2007.[104] S. Venkataraman, R. M. Dirks, C. T. Ueda, and N. A. Pierce. Selectivecell death mediated by small conditional RNAs. Proceedings of theNational Academy of Sciences, 107(39):16777–16782, 2010.[105] Z. Wang, S. Mohamed, and N. de Freitas. Adaptive hamiltonian andriemann monte carlo samplers. In International Conference on Ma-chine Learning (ICML), 2013.[106] J. K. Wickiser, W. C. Winkler, R. R. Breaker, and D. M. Crothers. Thespeed of RNA transcription and metabolite binding kinetics operatean FMN riboswitch. Molecular Cell, 18:49–60, 2005.[107] R. C. Wilson and J. A. Doudna. Molecular mechanisms of RNA in-terference. Annual Review of Biophysics, 42:217–239, 2013.[108] M. T. Wolfinger, W. A. Svrcek-Seiler, C. Flamm, I. L. Hofacker, andP. F. Stadler. Exact folding dynamics of RNA secondary structures.Journal of Physics A: Mathematical and General, 37:4731–4741, 2004.[109] S. Wuchty, W. Fontana, I. L. Hofacker, and P. Schuster. Completesuboptimal folding of RNA and the stability of secondary structures.Biopolymers, 49(2):145–165, February 1999.[110] J. N. Zadeh, C. D. Steenberg, J. S. Bois, B. R. Wolfe, M. B. Pierce,A. R. Khan, R. M. Dirks, and N. A. Pierce. NUPACK: analysis and124Bibliographydesign of nucleic acid systems. Journal of Computational Chemistry,32:170–173, 2011.[111] J. N. Zadeh, B. R. Wolfe, and N. A. Pierce. Nucleic acid sequencedesign via efficient ensemble defect optimization. Journal of Compu-tational Chemistry, 32(3):439–452, February 2011.[112] D. Y. Zhang and E. Winfree. Control of DNA strand displacementkinetics using toehold exchange. Journal of the American ChemicalSociety, 131(47):17303–17314, 2009.[113] M. Zuker and P. Stiegler. Optimal computer folding of large RNAsequences using thermodynamics and auxiliary information. NucleicAcids Research, 9(1):133–148, January 1981.[114] M. Zuker and D. Sankoff. RNA secondary structures and their pre-diction. Bulletin of Mathematical Biology, 46(4):591–621, July 1984.125
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Study of two biochemical models : chemical reaction...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Study of two biochemical models : chemical reaction networks, and nucleic acid systems Hajiaghayi, Monir 2017
pdf
Page Metadata
Item Metadata
Title | Study of two biochemical models : chemical reaction networks, and nucleic acid systems |
Creator |
Hajiaghayi, Monir |
Publisher | University of British Columbia |
Date Issued | 2017 |
Description | The contributions of this thesis are motivated by an exciting challenge at the intersection of computer science and biochemistry: Can we program molecules to do interesting or useful computations? There has been significant progress in programming nucleic acids - particularly DNA molecules - thanks in part to availability of models and algorithms for predicting nucleic acid structure and folding kinetics. At a higher level of abstraction, Chemical Reaction Networks (CRNs) have proven to be valuable as a molecular programming model that enables researchers to understand the potential and limitations of computing with molecules, unencumbered by low-level details. These two levels of abstraction are linked; it is possible to "compile" CRN programs into nucleic acid systems that make the programs implementable in a test tube. We design and analyze CRN algorithms for two purposes. First, we show how any semilinear function can be computed by CRNs, even when no "leader" species (i.e., initial species with constant but non-zero counts) is present. Our results improve earlier results of Chen et al. (2012) who showed that only semilinear functions are computable by error-free CRNs using leaders. Our new CRN construction can be done in expected time O(n), which is faster than O(n log n) bound achieved by Chen et al. Second, we provide the most intuitive proofs of correctness and efficiency for three different CRNs computing Approximate Majority: Given a mixture of two types of species with an initial gap between their counts, a CRN computation must reach totality on the majority species with high probability. The CRNs of our interest have the ability to start with an initial gap of Ω(√n log n). In the second part of this thesis, we study the problem of predicting the Minimum Free Energy secondary structure (the set of base pairs) of a given set of nucleic acid strands with no pseudoknots (crossing base pairs). We show that this problem is APX-hard which implies that there does not exist a polynomial time approximation scheme for this problem, unless P = NP. We also propose a new Monte-Carlo based method to efficiently estimate nucleic acid folding kinetics. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2017-10-16 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0357130 |
URI | http://hdl.handle.net/2429/63311 |
Degree |
Doctor of Philosophy - PhD |
Program |
Bioinformatics |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2017-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2017_november_hajiaghayi_monir.pdf [ 3.04MB ]
- Metadata
- JSON: 24-1.0357130.json
- JSON-LD: 24-1.0357130-ld.json
- RDF/XML (Pretty): 24-1.0357130-rdf.xml
- RDF/JSON: 24-1.0357130-rdf.json
- Turtle: 24-1.0357130-turtle.txt
- N-Triples: 24-1.0357130-rdf-ntriples.txt
- Original Record: 24-1.0357130-source.json
- Full Text
- 24-1.0357130-fulltext.txt
- Citation
- 24-1.0357130.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0357130/manifest