Effective Heuristic Methods for DNA Strand Design .. by . Dan C. Tulpan B.Sc, POLITEHNICA University of Bucharest, 2000 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy in The Faculty of Graduate Studies (Computer Science) The University Of British Columbia October 2006 © Dan C. Tulpan 2006 11 Abstract Sets of DNA strands that satisfy combinatorial and thermodynamic properties play an important role in various approaches to biomolecular computations, nano struc-ture design, molecular tagging, and DNA microarrays. The problem of designing such sets of DNA strands appears to be computationally hard. This thesis introduces new algorithms for design of DNA strand sets that satisfy any of several combinatorial and thermodynamic constraints, which aim to max-imize desired hybridization between strands and their complements, while mini-mizing undesired cross-hybridizations. To heuristically search for good strand sets for bio-computing applications, our algorithms use a conflict-driven stochastic lo-cal search approach, which is known to be effective in solving comparable search problems. We describe new and improved thermodynamic measures of the quality of strand sets. With respect to these measures of quality, our algorithms consistently find, within reasonable time, sets that are significantly better than previously pub-lished sets in the literature. We also present a detailed analysis and selection of heuristics for improving the quality of DNA strand selection criteria with direct applications in microarray probe design. iii Contents Abstract ii Contents iii List of Tables vi List of Figures ix Abbreviations xii Acknowledgements xiii Dedication xiv Statement of Co-Authorship xv 1 Introduction 1 2 DNA Strand Design 6 2.1 The DNA Strand Design Problem 6 2.2 Combinatorial Constraints 8 . 2.3 Thermodynamic Constraints 9 3 Related Work 15 3.1 DNA Strand Design Problems 15 3.2 Design Constraints 17 3.3 DNA Strand Design Algorithms 19 3.4 Stochastic Local Search Algorithms 20 3.5 Microarray Probe Design Algorithms 20 4 Combinatorial Design 22 4.1 Background and Problem Description 22 4.2 A Simple SLS Algorithm 23 Contents iv 4.2.1 Evaluation Criteria 26 4.2.2 In-Silico Experimental Protocols 26 4.2.3 Noise Parameter 9 26 4.2.4 RTD-based analysis 27 4.2.5 Quality of Strand Sets Obtained by our Algorithm . . . . 28 4.3 An Improved Stochastic Local Search Algorithm 34 4.3.1 Evaluation criteria 37 4.3.2 RTD-based analysis 38 4.3.3 Noise Parameter 42 4.3.4 New DNA Strand Sets and Empirical Bounds 43 5 C o m b i n a t o r i a l and T h e r m o d y n a m i c Design 47 5.1 Extended Stochastic Local Search Algorithm 47 5.2 Experimental Setup 48 5.2.1 Evaluation Criteria 49 5.2.2 In-Silico Experimental Protocols 53 5.3 Results and Discussion 55 5.3.1 Correlation between 6 and Concentrations 57 5.3.2 Equilibrium Concentration Dependency of DNA Strands . 58 5.4 Junction Optimization 61 6 M i c r o a r r a y Probe Design 65 6.1 Background and Problem Description 65 6.1.1 The Microarray Probe Design Problem 69 6.1.2 Probe Design Goals 69 6.1.3 Probe-Probe Hybridization 70 6.1.4 Target-Target Hybridization 72 6.1.5 Probe-Target Hybridization 72 6.2 Probe Selection Criteria 73 6.3 Our Probe Design Approach 77 6.3.1 Probe Length 79 6.3.2 Sequence Masking using BLAST 79 6.3.3 Melting Temperature 80 6.3.4 Self-hybridization 82 6.3.5 Other Criteria 82 6.3.6 The Probe Selection Process 83 6.4 Results and Discussion 85 6.4.1 Benchmark Data Sets 85 6.4.2 Evaluation Criteria 86 6.4.3 Preliminary Experiments 88 Contents v 6.4.4 Comparison with Other Methods 94 6.4.5 Run Times . . . . . 98 6.4.6 Improving Self-hybridization . .' 102 7 Conclusions and Future Directions 104 Bibliography 109 A Time Complexity for Determining C(pfc) Scores 121 B DNA Strand Sets 122 C Parameter Settings for Microarray Probe Design Programs 124 VI 1 List of Tables 4.1 Empirical bounds on C I quaternary codes 31 4.2 Best previously known bounds on the size of codes that satisfy the C I constraint as a function of word length n and Hamming distance d [12] 32 4.3 Empirical bounds on (C1,C5) quaternary codes for different n-d combinations 33 4.4 Empirical bounds on (C1,C2) quaternary codes 34 4.5 Best known bounds on the size of codes that satisfy the (C1,C2) constraints as a function of word length n and Hamming distance d [76] 35 4.6 Empirical bounds on (C1,C2,C5) quaternary codes obtained with the simple SLS algorithm (Results are reported in [112]) 36 4.7 Empirical bounds on (C1,C2,C5) quaternary codes obtained with the improved SLS Algorithm. 1-mutation+random code words neighborhoods have been used. The number of random code words used here are {10,100,1000, 5 000}. For n = 8, d = 4 we found a better bound, namely 112 code words by initializing our algorithm with the best previously known word set (108 code words) plus an additional random word 44 List of Tables vii 5.1 Control sets used as a basis for comparison with the results ob-tained from our algorithm. Each column of the table, other than the leftmost, corresponds to one set. Each column header gives a set ID and, for all but set S6, a reference to the paper in which the set was published. The next two rows report the strand length and the number of strands in each set. Following these, there is one row per type of combinatorial or thermodynamic constraint CI to T5 (except for T4 - see Section 2.3 for details). A check mark in a col-umn indicates that a constraint of the column's type was enforced when designing the set, and further information about the compo-sition of the strands is given where available. The second part of the table describes thermodynamic properties of control sets. For each set, the rows of the table give the free energy ranges cor-responding to constraints T l (desired strand-complement interac-tions), T2 (undesired strand-complement interactions) and T3 (un-desired complement-complement interactions) , respectively and the melting temperature range corresponding to constraint T5. Note that these thermodynamic constraints were not used as design con-straints for the control sets, other than those of Shortreed et al [98]. 54 5.2 Comparison of the quality of the control sets and our improved and en-larged sets. Each pair of rows (between delimiting lines) corresponds to one set, and lists quality measures for the control set, the improved set (indicated by the suffi x "-1") and the enlarged set (indicated by the suffi x "-2"). Improvements over the control set are highlighted in bold-face. The columns, from left to right, specify (1) set ID, (2) the strand length for the set, (3) the number of strands in the set, (4) the free en-ergy gaps 5 and 5*, (5) the melting temperature interval width, (6) the pairwise sensitivity, (7) the pairwise specifi city, (8) the pairwise discrim-ination, (9) the minimum free energy value as computed with Comb-Fold vl.O [6], (10) the minimum free energy gaps T and T * for junc-tions, and (11) the run time, measured as total CPU time on our refer-ence machine for running the respective experimental protocol until the given set was obtained. (See Section 2.3 for details). In the melting temperature column, the measurements were obtained using the func-tion calc_Tin_complementary_with_entwpy_santalucia_2004(...) of the PairFold vl.4 package [7]. Si-opt MFE values for junctions have been obtained after optimizing the arrangement of strands in subsets such that r* is maximized (where T* is the free energy gap that accounts for junc-tions, as defi ned in Equation 5.20) 56 List of Tables viii 6.1 Available software packages for microarray probe design and their corresponding probe selection criteria. Each cell marks the pres-ence of a certain criterion with a y/ sign and provides additional information specific to the experiments performed in the paper (if specified) that describes the software 78 6.2 Overview of the data sets used for comparing the accuracy of probe design programs and their properties 86 6.3 Combinations of sequence and motif base probabilities used to study the efficiency of different probe selection criteria 91 6.4 Probe set sizes obtained with our method and four publicly avail-able packages for DNA probe design. For our method, we report the number of probes that have melting temperatures in the. range [70,75]° C. . 95 6.5 Run-times for our method and four publicly available packages for DNA probe design. The run-times for our probe selection method do not include the evaluation step described in Section 6.3.6 and the self-hybridization optimization procedure described in Section 6.4.6. The values in brackets represent low accuracy run-time mea-surements for web-based software 101 6.6 Self-hybridization minimum free energy (MFE) values for probe sets designed with the four control packages and our method. Here, we report only the MFEs for probe sets with high melting temper-ature values in the range [70,75]° C. Osprey produced sets with only one probe for all the artificial data sets. The *-opt sets have been designed with our method using the self-hybridization probe design criterion 103 B . l A set of 112 DNA code words of length 8, satisfying the CI , C2 and C5 constraints. This set represents an improvement of a pre-vious set containing 108 DNA words obtained by Frutos et al. in 1997 [40] 122 B. 2 A set of 128 DNA code words of length 8, satisfying the CI , C2 and C5 constraints. This set represents the best improvement over Frutos et al. set obtained in 1997 [40] -." 123 C. 1 Summary of software parameter settings for Bacteriophage_phi3626 and Buchnera_sp 125 IX List of Figures 1.1 Research roadmap 4 4.1 Median number of steps to find solutions as a function of noise parameter 9 (set size fc=70, strandlength n=8, hamming distance d=4, constraints CI, C2 and C5) 27 4.2 RLDs for the SLS algorithm and exponential distributions with identical medians ((a) and (b)) and 90% quantile (a), (a): CI and C5 constraints (different set sizes k G {100,120,140}); (b): CI , C2, C5 constraints (SLS with and without random replacement). . 29 4.3 RLDs for different ^-mutation neighborhoods, set size k = 70, strand length n — 8, Hamming distance d = 4, constraints CI , C2 and C5 39 4.4 RTDs for different i/-mutation neighborhoods, set size k = 70, strand length n = 8, Hamming distance d — 4, constraints CI , C2 and C5 40 4.5 Median run-time in CPU seconds for different neighborhood sizes, set size k = 70, strand length n = 8, Hamming distance d = 4, GC content = 50%, constraints CI, C2 and C5 41 4.6 RLDs for purely random and hybrid neighborhoods of size 200, set size k = 70, strand length n = 8, Hamming distance d = 4, constraints CI, C2 and C5 42 4.7 RTDs (CPU time) for purely random and hybrid neighborhoods of size 200, set size k — 70, strand length n = 8, Hamming distance d — 4, constraints CI, C2 and C5 43 4.8 Median run-time of the enhanced SLS algorithm (in search steps) as a function of the noise parameter value: constraints CI , C2 and C5, n = 8, d — 4, and k — 70 for 1-mutation + random neighbor-hood 45 List of Figures x 5.1 Positive free energy gaps of correct and incorrect strand-complement pairs for set S5 by Penchovsky and Ackermann [83]. The three curves represent (from left to right) the cumulative distribution of the free energy values of all desired strand-complement hybrids, of all undesired strand-complement hybrids and of all (undesired) complement-complement hybrids. The two dots represent the spe-cific values of i and j that determine the free energy gap as defined in Equation ?? 49 5.2 Correlation between pairwise discrimination values and duplex free energy gaps for sets S5 Penchovsky (control) and S5-1 (improved); each point in the plot corresponds to the discrimination and 5 val-ues of a given strand and an incorrect complement 58 5.3 Concentration of strands, complements, perfect and imperfect matches as a function of S for sets (a) S5 Penchovsky (control), and (b) S5-1 (improved); each point in the plot corresponds to the equilibrium concentration of one single strand or duplex 60 6.1 Two key microarray parameters: the area per probe £, and the distance between probe sites, A 68 6.2 Distribution of C(jpk) values for 15 randomly selected sets of probes (5 sets per criterion) of length 25 for all genes of Bacteriophage_phi3626 selected using the following criteria: probes selected uniformly at random (baseline), probes selected uniformly at random after BLAST (Ml) and probes selected uniformly at random after BLAST (Ml) and subsequent removal of 6-base repeats (M6) 84 6.3 Correlation plot for MFE values obtained from Simplified PairFold vs PairFold for 1 000 pairs of randomly generated sequences of length 25. Pearson's correlation coefficient is r = 0.8314794. . . 88 6.4 C(pk) scores for BLAST seeds in the range 7-20 for Bacterio-phage_phi3626 (50 CDS). The distribution of C(pk) values of the probe set is displayed in the form of CDF curves, (a) When us-ing BLAST seeds from 7 to 12, the number of sequences (genes) that can provide probes increases up to the maximum of 50. (b), (c) and (d) Increasing the BLAST seed from 12 to 25 will allow probes with higher cross-hybridization to be selected in the final set. 90 6.5 C(pk) score distributions for 4 combinations of base probabilities for sequences and planted motifs and various probe selection crite-ria, (a) BP1, (b) BP2, (c) BP3, (d) BP4 92 List of Figures x i 6.6 C(pk) score distributions for 5 combinations of base probabilities for sequences and planted motifs and various probe selection crite-ria, (a) BP5, (b) BP6, (c) BP7, (d) BP8, (e) BP9 93 6.7 C(pk) values for sets of probes of length 25 produced with differ-ent software packages. The probe sets have been extracted from a benchmark set of sequences: B.subtilis (30 randomly selected CDS, length interval [444, 882], average length 552.5 nucleotides). The melting temperature interval was set to [70,75]°C for all soft-ware packages that supported this option 96 6.8 C(pk) values for sets of probes of length 25 produced with dif-ferent software packages. The probe sets have been designed for a benchmark set of sequences: Bacteriophage_phi3626 (50 CDS, length interval [120, 2 952], average length 621.78 nucleotides). The melting temperature interval was set to [70,75]°C for all soft-ware packages that supported this option 97 6.9 C(pk) values for sets of probes of length 25 produced with differ-ent software packages. The probe sets have been designed for a benchmark set of sequences: Buchriera_sp (573 CDS, length inter-val [117, 4224], average length 988.13 nucleotides). The melting temperature interval was set to [70,75]°C for all software packages that supported this option 98 6.10 C(pfc) score distributions for 4 combinations of base probabilities for sequences and planted motifs and various probe selection crite-ria, (a) BP1, (b) BP2, (c) BP3, (d) BP4 99 6.11 C(pk) score distributions for 5 combinations of base probabilities for sequences and planted motifs and various probe selection crite-ria, (a) BP5, (b) BP6, (c) BP7, (d) BP8, (e) BP9 100 Abbreviations b p Base pairs. C D S Coding sequence. c D N A Complementary DNA. D N A Deoxyribonucleic acid. d s D N A Double stranded deoxyribonucleic acid E S T Expressed Sequence Tag. M F E Minimum Free Energy. N N Nearest Neighbour. N P Non-deterministic polynomial time. R N A Ribonucleic acid. S L S Stochastic Local Search. S N P Single Nucleotide Polymorphism. ssDNA Single stranded deoxyribonucleic acid. xiii Acknowledgements First and foremost I must acknowledge my thesis supervisors, Professor Anne Con-don and Professor Holger Hoos. Throughout my time as a graduate student Anne and Holger have provided me with unending guidance in research, many years of financial support, and a genuine friendship. Anne and Holger have dedicated hundreds of hours of their time collaborating with me on various open problems and unresolved questions related to our research. Without their extensive support, patience and endless encouragement, my thesis would not have been possible. My thesis committee members certainly deserve recognition. Professor Alan Mackworth and Professor Robert Corn kindly agreed to serve as members of my supervisory committee and give of their time and wisdom. I would also like to thank the committee members for their suggestions and careful proof-reading of my thesis. I would like to thank Mirela Andronescu for the help provided when soft-ware technicalities became a burden, wonderful discussions and collaboration on various topics of bioinformatics, and especially for her enthusiasm and invalu-able friendship. I would like to thank the wonderful members, staff and alumni of the BETA and LCI Lab who made every day enjoyable: Sanja Rogic, Viann Chan, Rosalia Aguirre-Hernandez, Evelyn Fong, Hosna Jabbari, Baharak Raste-gari, Alena Shmygelska, Shelly Zhao, Sohrab Shah, Steph Durocher, Dave Tomp-kins, Frank Hutter, Jeremy Barbay, Kevin Smyth, Igor Naverniouk, Min Ouyang, Valerie McRae, Kenji Okuma, Pantelis Elinas. I should acknowledge our research collaborators from University of Wisconsin: Seo Bong Chang, Michael Shortreed, Lloyd Smith. Our sporadic exchanges of ideas and discussions on DNA thermodynamics provided a solid starting point for the material presented in Chapter 5. To my family, my parents Mariana and Dan, and my brother Dragos, thank you for your kindness and unlimited support throughout the years. You will undoubt-edly be relieved to see this thesis put to rest. I would like to thank my wife Mady and my son Dominic, for their support and smiles. Finally, my lovely wife Mady deserves a special acknowledgment for her encouragement, her patience, her kindness, her joy and her love throughout the entire duration of the thesis and beyond. THIS THESIS IS DEDICATED TO MY PARENTS Mariana Tulpan & Dan Corneliu Tulpan XV Statement of Co-Authorship Parts of this thesis have been co-written and published in conferences and journals [111-113]. Here is a list of contributions to the following areas: Identification and design of research program The identification and the design of the research program for this thesis has been decided by the author in collaboration with and supervised by Professor Anne Condon and Professor Holger Hoos. Performing the research The research for this thesis has been performed by the author in collaboration with co-authors of previously published papers [111-113]. Data analyses The data analyses for the current research has been performed by the author under the direct supervision of the thesis advisors, Professor Anne Condon and Professor Holger Hoos. Manuscript preparation This thesis has been prepared and organized by the author under the direct supervision of Professor Anne Condon and Professor Holger Hoos. The final version has been adjusted according to the comments and suggestions received from the supervisory committee and the external reviewer. 1 Chapter 1 Introduction 'Science may set limits to knowledge, but should not set limits to imagination." Bertrand Russell The design of DNA sequences is a key step in many biotechnological applications, such as biomolecular computing [37, 40], nano structure design [87, 120], molec-ular tagging [15, 16, 36], and DNA microarrays [47, 51, 85, 104]. In 1994, Adleman proposed to use a mixture of biochemical techniques based on combinatorial chemistry and enzymatic manipulation reactions to solve in-stances of NP-complete problems (e.g. Hamiltonian Path Problem) [1]. In the DNA computing experiments performed by Adleman, short oligonucleotides, called words, are units of information storage. Words are concatenated into long strands which store several bits of information. Three years later, a group of researchers from University of Wisconsin, Madison, adapted Adleman's ideas to DNA molecules attached to surfaces and proposed to perform logical manipulations of large sets of data by the hybridization and enzymatic manipulation of the attached DNA strands [40, 69]. With the advent of DNA microarray technology, the parallel analysis of large number of genes became a routine task. Microarrays are plastic or glass slides, to which a large number of short DNA strands of known sequence have been attached at fixed positions. The use of DNA microarrays relies on accurate design for probes that are immobilized on a surface and bind specifically to complementary targets in a complex solution. In the case of universal DNA microarrays, probes are not designed to bind to a genomic sequence or product, but instead probe complements are ligated to the genomic sequences, and the ligated product is captured on the array via specific hybridization between the probe and its complement [35]. DNA strands are also designed for use in DNA-templated organic synthesis and DNA display, whereby the 'template" DNA strands direct the production of a library of small polymers [43, 52]. In a surface based DNA computation [71], words, Chapter 1. Introduction 2 or their complements, are typically immobilized on a planar surface - just like oligonucleotide probes on a DNA microarray chip. In this work, we consider the problem of designing moderate-sized, high-quality sets of strands for such applications. The quality of a set of DNA strands heavily relies on DNA hybridization - the process of joining two single-strands of DNA to form a double-stranded molecule. Specific hybridization between a strand and its complement is the means for processing bits of information in a DNA computation and for detecting DNAs or RNAs with a microarray. Therefore, it is important that the strand oligonu-cleotides are designed so as to (1) maximize desired hybridization, i.e., specific hybridization between a strand and its perfect complement, and (2) minimize un-desired hybridization, i.e. nonspecific cross-hybridization between a strand and the complement of a distinct strand, or between two complementary strands under fixed environmental conditions such as temperature. In this work, we assume that strands are immobilized on a surface and cannot hybridize to each other. To design good strand sets computationally, it is important to have a reliable predictor of the quality of a set - that is, how well the set meets goals (1) and (2) experimentally. With respect to a given measure of quality, a key algorithmic challenge is to design a high-quality strand set, when the length of the strands and the size of the set are specified. An alternative formulation of the challenge is to design a set of strands of a given length whose quality meets a certain threshold, and which is as large as possible. The quality of a set of DNA strands is usually ensured by using specific design criteria. Some previous studies on DNA strand design uses combinatorial design criteria [15, 16, 36], similar to those used in coding theory [73], while others use thermodynamic criteria [83] . Combinatorial design criteria mostly rely on count-ing the presence/absence and the similarities/dissimilarities of the bases for one, and respectively, two DNA strands, ignoring the bio-chemical differences between the bases. An example of such a measure is the Hamming distance, which rep-resents the number of mismatches between two DNA strands. Thermodynamic design criteria take into account these differences, thus increasing the accuracy of estimating the degree of hybridization for two DNA strands. One such thermody-namic measure is the free energy of a pair of DNA strands, whose value decreases when the stability of a hybridized duplex increases. The following is an exam-ple that emphasizes the difference between combinatorial and thermodynamic cri-teria. The degree of hybridization between two different pairs of DNA strands, namely (GGGGTTTTGGGG, CCCCTTTTCCCC) and ( G C C G A A A A G C C G , C G G C A A A A C G G C ) would have the same quality when Hamming distance is used (the number of mismatches between the strands from both pairs is 8), while the free energies for the two hybridized pairs is -6.08 kcal/mol for the first pair, Chapter 1. Introduction 3 and -7.74 kcal/mol for the second pair (which is more stable). The underlined bases represent matched base pairs. A detailed description of combinatorial and thermodynamic constraints is provided in Chapter 2. The problem of designing large sets with high quality D N A strands is compu-tationally difficult because the size of the solution space (i.e. all possible sets of strands of the specified length and size) can be enormous (for example, for strands of length 10, there are more than 10 3 7 0 strand sets of size 100). There is no known efficient procedure that is guaranteed to find optimal strand sets even for the sim-plest design criteria that model interaction between two distinct strands. The difficulty of this problem is also influenced by the necessity of using differ-ent constraints, their selection being guided by biotechnological applications. The design of D N A strands for D N A computing, molecular tagging and nano struc-ture design demands less stringent constraints. For these applications, only the interaction between strands and corresponding complements must be considered. Over the past few years there has been considerable success in developing stochas-tic local search algorithms as well as randomised systematic search methods for solving these problems, and to date, stochastic search algorithms are amongst the best known techniques for solving problems from various domains of Artificial In-telligence and Operations Research, such as satisfiability, constraint satisfaction, planning, scheduling, and other application areas [57]. The design of D N A strands for microarrays requires more stringent constraints, due the necessity of choosing unique strands that perfectly complement specific regions from large genomic data sets, while avoiding hybridization with every other regions of the same genome. As a result, different heuristic techniques are used for microarray strand design. These techniques allows someone to quickly scan large genomic data sets and select probes from areas of interest, based on fast assessment of simple combinatorial or thermodynamically-based constraints [81, 89, 94, 116]. This thesis introduces novel heuristic algorithms that can be used to design high quality sets of D N A strands for a variety of biotechnological applications, with focus on biomolecular computing and microarrays. As shown in Figure 1.1, this research formulates three D N A strand design problems and explores efficient heuristic methods that efficiently solve them. The major contributions of this thesis are the following: 1. We introduce a simple stochastic local search algorithm for design of D N A strand sets, namely sets of equal-length words over the nucleotides alpha-bet {A, C, G , T) that satisfy simple combinatorial constraints, and analyze its performance using an empirical methodology based on run-time distribu-tions [56]. We report several cases in which our algorithm finds word sets that match or exceed the best previously known constructions. Chapter 1. Introduction 4 [ Effective Heuristic Methods for DNA Strand Design] Design Sets with Combinatorial Constraints Simple SLS Algor i thm Improved SLS Algor i thm Empirical Evaluations Parameter Optimization Better Code Bounds Design Sets with Combinatorial and Thermodynamic Constraints Design Probes for Microarrays Probe Selection Criteria New Quality Evaluation Criteria Thermodynamic Constraints Extended SLS Algor i thm New Quality Evaluation Criteria Model for Competi t ion Between Strands Figure 1.1: Research roadmap 2. We present an improved version of the simple stochastic local search al-gorithm for DNA strand design presented in Chapter 4. In particular, we describe the use of randomised, hybrid neighborhoods, that leads to substan-tial performance improvements. To our best knowledge, this type of neigh-bourhood has not been previously described and may well be applicable to SLS algorithms for other problems. We report cases in which the improved stochastic local search algorithm finds more words sets that match or exceed the previously known results. 3. We describe a new complex algorithm for design of strand sets, to be used in DNA computations or universal microarrays. Our algorithm can design sets that satisfy any of several thermodynamic and combinatorial constraints, which aim to maximize desired hybridizations between strands and their complements, while minimizing undesired cross-hybridizations. To heuristi-cally search for good strand sets, our algorithm uses a conflict-driven stochas-tic local search approach, which is known to be effective in solving compa-rable search problems. The PairFold program of Andronescu et al. [7] is used to calculate the minimum free energy of hybridization between two mismatched strands. 4. We describe new thermodynamic measures of the quality of strand sets. With respect to these measures of quality, our complex algorithm consistently finds, within reasonable time, sets that are significantly better than previ-Chapter 1. Introduction 5 ously published sets in the literature. 5. In this thesis, we introduce a modification of a thermodynamic-based mea-sure proposed by Bhanot et al. [10]. This measure estimates the quality of probe sets for DNA microarrays, and can be used for an in-silico evalu-ation of probe sets before performing in-vitro experiments. We also show evidence, that some previously used criteria for microarray probe selection could be unnecessary according to the newly introduced thermodynamic measure. The remainder of the thesis is organized as follows. Chapter 2 introduces the DNA Strand Design Problem and presents the combinatorial and thermodynamic models. Chapter 3 provides an overview of state-of-the-art methods for DNA strand design. Chapter 4 describes novel stochastic local search algorithms that can handle combinatorial constraints and their empirical analysis and evaluation. Chapter 5 presents an extended stochastic local search algorithm that can cope with combinatorial and thermodynamic constraints. Chapter 6 describes probe se-lection criteria for microarray probe design and investigates their efficency through a thermodynamically-based evaluation measure. Chapter 7 concludes the thesis and presents several directions for future research. 6 Chapter 2 DNA Strand Design 'The conclusion of design fbws naturally from the data; we should not shrink from it; we should embrace it and build on it. " Michael Behe, Cambridge University, 1994 Good design of D N A strands is a necessity for successful D N A computing, but how to achieve this goal is a challenge. This chapter introduces the D N A Code Design Problem and presents the underlying models that provide a reliable framework for solving this problem. 2.1 The DNA Strand Design Problem A single stranded D N A molecule is a long, unbranched polymer composed of only four types of subunits. These subunits are the deoxyribonucleotides containing the bases adenine (A), cytosine (C), guanine (G) and thymine (T). The nucleotides are linked together by chemical bonds and they are attached to a sugar-phosphate chain like four kinds of beads strung on a necklace. The sugar-phosphate chain is composed by alternating the ribose and phosphate for each nucleotide. This alternating structure gives each base and implicitly the whole strand a direction from the ribose end (denoted by 5') to the phosphate end (denoted by 3')- When the 573' ends are not explicitly labeled , D N A sequences are always written in 5' to 3' direction. "Parts of this chapter have been published. (1) Dan C. Tulpan, Holger Hoos, and Anne Condon (2003) 'Stochastic Local Search Algorithms for DNA Word Design", Proc. of Eighth International Meeting on DNA Based Computers (DNA 8), Lecture Notes in Computer Science 2568, pp.229-241, Springer Verlag; (2) D. Tulpan, M. Andronescu, M. Shortreed, S. Chang, A. Condon, H. Hoos, and L. Smith (2005) 'Thermodynarnically Based DNA Strand Design", Nucleic Acids Research, Vol. 33, No. 15, pp. 4951-4964. Chapter 2. DNA Strand Design 7 The nucleotides composing DNA bind to each other in pairs via hydrogen bonds in a process known as hybridization. Each nucleotide pairs up with its unique complement (the Watson-Crick complement), so C pairs up with G and A with T. C and G pair up in a more stable manner than A and T do, which is due to one extra hydrogen bond in the C-G pair. The Watson-Crick complement of a DNA strand is the strand obtained by replacing each C nucleotide with a G and vice versa, and each T nucleotide with an A and vice versa, and also switching the 5' and 3' ends. For example the Watson-Crick complement of 5'-AACTAG-3' is 3'-TTGATC-5'. Non-Watson-Crick base pairings are also possible. For example, G-T pairs occur in biological sequences. Formally, these notions are captured in the following definitions: A DNA strand w is represented by a string over the quaternary alphabet £={ A, C, G, T). String w corresponds to a single-stranded DNA molecule with left end of the string corresponding to the 5'-end of the DNA strand. The complement, or Watson-Crick complement, of a DNA strand w is obtained by reversing w and then by replacing each A with a T and vice versa, and replacing each C in s by G and vice versa. We denote the Watson-Crick complement of a DNA strand w with c(w) and we will simply use the terms complementary strand, or complement in this work. For simplicity, we simply use c; instead of c(wi). Throughout Chapters 1 to 5 we use symbols w\,W2, ...,Wk t 0 denote unique strands and c i , C 2 , to denote the corresponding complements. Thus, strand Wi and complement Cj form a perfect match (duplex), whereas strand Wi and com-plement Cj, and complements Cj and Cj, with j ^ i, form mismatches. In Chap-ter 6 we will call 'probes" the strands attached to the surface of microarrays and 'targets" the corresponding complements floating in the solution. We adopt this terminology because it is widely used in the literature on DNA microarrays. Finally, we formalize the decision version of the DNA Strand Design Problem. The DNA Strand Design (DSD) Problem is defined as follows: with respect to a fixed set C of constraints (which are not given as part of the input), given a set size k and strand length n, find a set S with k DNA strands of length n, satisfying all the constraints from C. It may be noted that this problem can be also formalized as an optimization problem, the optimization being done for the size of the set, k. The constraints used in the design of DNA strands can be grouped into two classes: combinatorial and thermodynamic constraints. The grouping is based on the measures that estimate the quality of one or more DNA strands upon which the constraint is applied. Combinatorial constraints mostly rely on verifying the Chapter 2. DNA Strand Design 8 presence or absence of certain nucleotides in a strand, and counting of matches and/or mismatches for pairs of DNA strands, whereas thermodynamic constraints take into account the biochemical properties that govern DNA hybridization. The current formulation of the problem needs to be changed for microarray probe design and will be provided in Chapter 6. 2.2 Combinatorial Constraints Combinatorial constraints are widely used to design large sets of non-interacting DNA strands. The necessity of designing sets of strands that avoid cross hybridiza-tion can be superficially modeled by enforcing high number of mismatches be-tween all possible pairs of strands and between pairs of strands and the comple-ments of others. The stability and uniformity of DNA strands can be also modeled with combinatorial constraints by either controlling the presence or absence of un-desired subsequences or counting specific bases within the same strand. Various types of combinatorial constraints are formalized in the following paragraphs. CI: Direct Mismatches. The number of mismatches in a perfect alignment of two strands (also known as Hamming distance) must be above a given threshold. Here, a perfect alignment is a pairing of bases from the two strands, where the ith base of the first strand is paired with the ith base of the second strand, and a mismatch is a pairing of two distinct bases. For example, G paired with A, C, or T is a mismatch whereas G paired with G is a match. For example, the strands u>=5'-AACAA-3' and u>'=5'-AAGAA-3' have one mismatch at position i = 3, where i — 1 : 5. This constraint can be applied to pairs of strands only, to pairs of complements only, or to both pairs of strands and complements. C2: Complement Mismatches. The number of mismatches in a perfect align-ment of a strand and a complement must be above a given threshold. For ex-ample, the complement of strand iu=5'-AACAA-3', which is c=5'-TTGTT-3' and strand u/=5'-TTATT-3' have one mismatch at position i = 3, where i — 1 : 5. This constraint is also called the reverse-complement constraint. C3: Slide Mismatches. The number of mismatches in a slide of one strand over another must be above a given threshold. A slide of strand w over another strand vJ (both of equal length) is an alignment of w and w', in which, for some i, the first base of w is paired with the ith base of w', the second base of w is paired with the (i + l)st base of w', and so on, so that the first i — 1 bases of w' and the last i — 1 bases of w are unpaired. For example, the Chapter 2. DNA Strand Design 9 strands u;=5'-ACC-3' and u/=5'-TGC-3' can overlap in three, two, one, or zero positions. If i = 2 (we start counting from zero, i.e., no slides), base 'A" of strand w is superposed over base 'C'from strand w '. C4: Consecutive Matches. The maximum number of consecutive matches be-tween all slides of one strand over the other must be in a given range. For example, the strands tu=5'-ACT-3' and «/'=5'-CTG-3' have a maximum of two matches (CT), when the second base (C) of w overlaps with the first base (C) of w'. C5: G C Content. The number of G's and/or C s in a strand must be in a given range. For example, all strands presented in [15] have 4, 5 or 6 C s . One such strand is u>=5'-TTACACCAATCTCTT-3', which contains five Cs. C6: Forbidden Subsequences. Each strand must not contain given undesired sub-sequences either along the whole strand, at the 5' and 3' ends and/or in the middle of the strand, as required. For example, the strand u>=5'-CCTTAACQ-S ' does not have occurrences of more than two identical bases along the whole strand. C7: Alphabet Size. Al l strands must contain bases belonging to a given non-empty subset of the alphabet {A,C,G,T}. It is a common practice to de-sign DNA strands over the {A,C,T} alphabet in order to minimize undesired structures, such as G-quartets [59, 100], that can be caused by the presence of G's. For example, strand w=5'-CTCCTACAATTCCTA-3' contains only letters A, C and T. Constraint C5 can be used to attain desired hybridization. Roughly, the stability of a strand-complement pair increases as the content of G's and C s within strands in-creases. Constraint C5 is also used to ensure uniform melting temperatures across desired strand-complement pairings. Constraints C1-C4, C6, and C7 can be used in various combinations to avoid undesired hybridization. As is widely known, the strength of hybridization between two strands (or between bases of the same strand) depends roughly on the number of nucleotide bonds formed (longer per-fectly paired regions are more stable than shorter ones; this is modeled by C1-C4, and C6), and on the types of nucleotides involved in bonding. 2.3 Thermodynamic Constraints While combinatorial constraints provide a good starting point for designing sets of strands for various biotechnological applications, their lack of accuracy is well Chapter 2. DNA Strand Design 10 known. More accurate measures of hybridization are required to efficiently predict the interaction between DNA strands. A higher level of accuracy can be achieved by understanding the thermodynamic laws that govern the interaction between bases of DNA strands. Thermodynamic constraints have been defined and used to design high quality sets of DNA strands. Before presenting the thermodynamic constraints necessary to obtain such sets, it is necessary to provide a brief introduction to the underlining thermodynamic principles that govern DNA strand interactions. The Nearest-Neighbour Thermodynamic Model With the primary goal of predicting secondary structure and DNA / RNA sta-bility from the base sequence, over the past 20 years a number of melting tempera-ture and diffraction based studies have been conducted to estimate the relationship between thermodynamic stability and DNA base sequences, in terms of nearest-neighbour base pair interactions. The nearest-neighbour (NN) model for nucleic acids was pioneered by Crothers et al. [28] and Tinoco and coworkers [31] starting in the 1960s. Subsequently, sev-eral papers and theoretical reports on DNA and RNA nearest-neighbour thermo-dynamics have been published. In 1986, Breslauer et al. reported a set of nearest-neighbour sequence-dependent stability parameters for DNA obtained from evalu-ation of melting curves of short DNA oligomers [18], Differences between DNA polymers and oligonucleotide thermodynamic trends have led to the conclusion that there must be a relation between N N thermodynamic parameters and the length of the DNA strands [95]. The simple nearest-neighbour model for nucleic acids assumes that the stability of a given base pair depends on the neighbouring base pairs. In double-stranded (complementary duplex) DNA there are 10 unique nearest-neighbour base pairs, oriented from 5' to 3', namely: [AA \ TT], [AG \ TC], [AC | TG], [GA j CT], [GG | CC], [TG | AC], [CG \ GC], [GC \ CG], [AT \ TA], [TA \ AT]. Here, [GA | CT] represents the duplex 5 ' GA-3' CT-5' 3' and [TC | AG] represents the duplex 5 ' TC - 3' AG-5' 3' Chapter 2. DNA Strand Design 11 Hence, the group [GA | CT] is identical to [TC \ AG] (the later is rotated by 180°) and we end up having 10 rather than 16 possible configurations. A measure of stability of neighbouring base pairs is their free energies. The free energy is a thermodynamic quantity, which can be used to determine if a re-action is spontaneous or not. The free energy of a DNA duplex is given by the Gibbs-Helmholtz equation: AG = AH - T x AS, where AS* is the entropy vari-ation, AH is the variation in heat content (also known as enthalpy) and T is the temperature. The heat content (H) can be thought of as heat transfer in matter. The change in enthalpy that occurs in a chemical reaction is due to the energy required to break the chemical bonds in the reactants and the energy produced by forming the chemical bonds of the products. In molecular thermodynamics, entropy (5) measures the dispersion of energy among molecules in a final state of a system compared to an initial state of the same system, at a given temperature. An entropy increase in a system involves energy dispersal among more microstates in the sys-tem's final state than in its initial state. Here a microstate of a system denotes one arrangement of the energies of its molecules at a given instant. A change in the arrangement of the energies of component molecules in a system will determine a new microstate of the system. Assuming the additivity of the NN thermodynamic parameters, which implies the absence of longer range interactions between pairs of bases, the following for-mula provides an estimate of the free energy of a DNA duplex: i where the sum is taken over all ten possible nearest-neighbour pairs i, and: • AG°(i) represents the free energy for the nearest-neighbour base pair i; • AG°nit represents the initialization free energy, which depends on the forma-tion of the first base pair interactions from the 5' and 3' ends ( AG°nitw^termG,c and AG" .. ,,. A T ) of the strands. imtw/terniA-1 ' A C ° — \r<° 4- /\c° u l J m i t — ^^initw/termG-C "T" initw/termA-T ' • n(i) represents the number of occurrences of the nearest-neighbour base pair i in the duplex. Free energies are always lower or equal with zero for any pair of DNA strands. The lower the free energy is for a DNA duplex, the more stable the duplex is. The following example illustrates the calculation of the free energy for the m u , , 5' - AACGTT - 3' DNA duplex 3 , _ T T G C A A _ 5 , • Chapter 2. DNA Strand Design 12 predicted = *G°<[AA\TT\) + AG°([AC\TG]) + AG°([CG\GC]) + AG°([GT\CA\) + AG°([TT\AA}) + A G ° m t The simple N N model presented above works for perfect complementary DNA strands. However, in many applications, DNA duplexes are used so that they don't match each other in a perfect way. For example, the DNA strand C C A A A A A C C binds to the DNA strand G G A A A A A G G by forming 4 base pairs. One way to estimate the free energy for this duplex is to sum the 4 terms, which may not equal the experimental values for the free energy. Furthermore, if we consider the following two DNA single-strands, namely A A G G C C G G A A and TCCGGCCTTA we observe that there is a higher chance to bind if we shift the first one to the left one position. In this thesis, we use the PairFold program implemented by Mirela Andronescu to compute free energies for DNA duplexes. The PairFold program predicts the secondary structure of a pair of RNA or DNA molecules based on free energy min-imization. The secondary structure for a pair of strands (w, w') is a set of base pairs, with each base of w and w' occurring in at most one pair. PairFold returns four secondary structures, for the pairs (w, w'), (w1, w), (w, w), and (w', w'), and their corresponding free energies. The algorithm uses dynamic programming and runs in time bounded by the cube of the lengths of the input strands. A detailed view of the computational model used in this work, which takes into considera-tion complex interactions between DNA sequences, can be found in the work of Andronescu et al. [7]. Constraints Thermodynamic constraints rely on the thermodynamic nearest-neighbor model presented above. We use the DNA parameters of SantaLucia Jr. [95] (a set of unpublished parameters have been obtained via private communication with John SantaLucia) and the PairFold vl.4 implementation of Andronescu et al. [5] to cal-culate minimum free energies of duplexes. We use the parameters and Equation (3) of SantaLucia and Hicks [96] to calculate melting temperature, assuming a 1M salt concentration and 10~ 7 M concentration of both strands and complements (for a total concentration of 2 * 10~ 7M). The values of these parameters have been set based on communications with researchers from Lloyd Smith's Laboratory from the Department of Chemistry, University of Wisconsin, USA. The melting tem-perature is only calculated for perfectly matched duplexes. Throughout, we use the following notation. T denotes the temperature of the reaction, R — 1.98717 Chapter 2. DNA Strand Design 13 kcal/(mol • K) is the gas constant, AG°(x, y) is the minimum free energy of the duplex xy at standard conditions (room temperature of 24° C and absolute pressure of 1 a,tm ), and TMi denotes the melting temperature of strand wi. In this thesis we define the following thermodynamic constraints: Tl: Perfect Match Free Energy. The free energy of a strand and its complement must be in a given range: for any i, A G 0 ( m l l C l ) e [ A G ; m l , A G U ] . For example, the perfect match free energy range of the strands in the set S=[ ACACT, A A A A A , TCGTC} is [-4.45, -1.94] kcal/mol. T2: Complement Mismatch Free Energy. The free energy of a strand and the complement of a distinct strand must be in a given range: for any i,j ^ i, AG°(wi,Cj) e [AG°min2lAG°max2). For example, the complement mismatch free energy range of the strands in the set S= {TTTAAA, A A A AAA} is [-1.04, -0.33] kcal/mol. T3: Complement-Complement Mismatch Free Energy. The free energy of a complement-complement duplex must be in a given range: for any i, j, AG° ( C l , c , ) e [AG°linS, AG°nax3}. For example, the complement-complement mismatch free energy range of the strands in the set S ={ AATTCC, GGCCAA, TTCCGG} is [-1.86, -0.81] kcal/mol. T4: Strand-Strand Mismatch Free Energy. The free energy of a pair of strands must be in a given range: for any i, j, AG°(wi,Wj) € [AG°mm4,AG°maxA]. For example, the strand-strand mismatch free energy range of the strands in the set S ={AATTCC, GGCCAA, TTCCGG) is [-1.76, -0.69] kcal/mol. T5: Melting Temperatures. The melting temperature for each strand and its com-plement must be in a given range: for any i, TMie[TMmm5,TMmax5}. For example, the melting temperature range of the strands in the set S ={ AAT TCCGGAA, CCAACCAATT} is [79.55,90.73] ° C. Constraint T5 can be used to select strands with uniform melting temperatures, with a higher degree of accuracy than by using GC content (C5), by choosing TMmax5 - TMmin5 to be small. Constraint T l can further help with selecting sequences by measuring the strength of the bonds (free energy) that form between Chapter 2. DNA Strand Design 14 each sequence and its perfect match, regardless of the number of matches or the individual nucleotides that take part in bonding [83]. Constraints T2, T3 and T4 are used to restrict undesired hybridization, which can occur between a strand and the complement of a different strand (T2), or be-tween two distinct complements (T3) or strands (T4). Typically, the given ranges should not overlap with the range for T l . 15 Chapter 3 Related Work "While a desktop PC is designed to perform one calculation very fast, DNA strands produce billions of potential answers simul-taneously. This makes the DNA computer suitable for solving "fuzzy logic" problems that have many possible solutions rather than the either/or logic of binary computers." National Geographic, Feb. 2004 The design of sets of short DNA strands that satisfy combinatorial and thermody-namic constraints, is motivated by the tasks of storing information in DNA strands used for computation, molecular bar-codes in chemical libraries [15, 16, 36, 119], or DNA microarrays [2, 39, 48]. Good strand design is important in order to min-imise errors due to non-specific hybridization between distinct strands and their complements, to achieve a higher information density, and to obtain large sets of strands for large-scale applications. For the types of combinatorial and thermodynamic constraints typically de-sired, there are no known efficient algorithms for design of DNA strand sets. Tech-niques from coding theory have been applied to design of DNA strand sets [16, 40]. While valuable, this approach is hampered by the complexity of the combinatorial constraints on the strand sets, which are often hard to reason about theoretically. For these reasons, heuristic approaches such as stochastic local search offer much promise in the design of strand sets. 3.1 DNA Strand Design Problems Coding theory literature is a good starting point for DNA code design problems, however, there are still many questions that need to be answered. Code design theorems provide constructions and bounds on sets of strands sat-isfying constraint CI [73]. Brouwer et al. [19, 20] presents a collection of tables listing the best known bounds of code sets over binary alphabets. Nevertheless, Chapter 3. Related Work 16 there are still combinations of relatively small strand lengths and distance thresh-olds, for which the size of the maximum strand set is unknown [66]. Less is known about the optimality of code sets over 4-letter alphabets. Tables with best known bounds on quaternary codes can be found in different digital databases [21, 99]. In biomolecular computing applications code design plays a crucial role in er-ror control and in designing libraries of short DNA strands for information storage. Adleman [1] and Lipton [70] first suggested the use of random sets of DNA strands for code design. Braich et al. [14] found the solution of an NP-complete problem (the 'Hamil-tonian Path Problem" with 7 cities) on a DNA computer in 7 days of wet lab work. They reported a set of 40 15-mers using combinatorial constraints CI , C2, C4, C5 and C7, nevertheless no algorithmic insights have been provided. Brenner et al. [17] used a set of 8 4-mers as a DNA vocabulary that helps to build a larger library of longer strands (approx. 17 million 32-mers), which in turn were used to identify and extract genes that are differentially expressed. They also report only the constraints used to obtain the set of 4-mers, without providing informa-tion about the algorithm used to'design them. Faulhammer et al. [36] used an RNA library of 20 15-mers to solve an instance of a computationally hard problem ('Knights Problem'). They developed a computer program called PERMUTE that is presented in more detail in Section 3.3. Frutos et al. [40] developed a DNA set of 108 8-mers, which can be used to store and manipulate information in DNA molecules attached on surfaces. They use a 'template-map" design strategy, where each DNA strand is defined as a combination of two 8-bit binary strings. Pen-chovsky et al. [83] developed a set of 24 16-mers, which can be used to encode binary information in DNA molecules. A detailed discussion of their design al-gorithm and criteria is presented in Section 3.2. Shortreed et al. [98] developed two sets of 64 12-mers and 16-mers, which can be used to encode large amounts of information using DNA strands. DNA strands from both sets have been used in a two round DNA computation for a small size 3-SAT problem. Garzon et al. [45] and Rose et al. [91] introduced an alternative formula-tion for the DNA strand design problem based on statistical mechanics principles. They assign a weight Z for each possible "hybridization configuration" (a perfectly matched duplex or a mismatched one) for all pairs of code strands. For each con-_ AG figuration they assign a statistical weight (e RT ) based on Boltzmann's statistical model. Here R represents the molar gas constant and T is the temperature. In the statistical formulation of the DNA code design problem, the goal is to find a set of strands for which ^ is small,where Ze is the sum of all statistical weights over all error configurations, and Zc is the sum of the 2"s over all configurations. Bortolussi and Sgarro [13] explored the possibility of formulating the DNA Chapter 3. Related Work 17 strand design problem in a constraint satisfaction (CSP) framework, where sets of strings are represented using rooted directed acyclic graphs. In the worst case, the graph representation of a DNA strand design problem contains an exponential number of nodes. Thus, solving even small size problems in this framework, be-comes a daunting task, even when pruning techniques are used to minimize the search space. The feasibility of solving problems of interesting size within this framework becomes a burden due to the following reasons: (i) the space of pos-sible solutions for a set size k and strand size n is of the order of 4.n'k, and (ii) the representation of feasible domains of the variables during the computation of a solution becomes a burden because direct memorization is not an option even for small k and ?;,. While this work describes new horizons for DNA strand design, no specific results have been reported. A large number of publications present design methods of variable length DNA strands for DNA computing applications related to synthetic self-assembly of DNA strands [25, 26, 117]. The CODEGEN program [65] uses methods from algebraic coding theory based on the formal language framework proposed by Hussini et al. [60] to produce DNA strands of variable length that avoid inter- and intra-molecular cross hybridizations. While their approach can, in theory, generate infinite sets of good variable length DNA strands, their implementation cannot generate sets of strands larger than 4 096 elements, or DNA strands longer than 256 bases. Theo-retical properties of languages that consist of DNA strands have been presented in a suite of publications [61, 63, 64]. While formal language descriptions of DNA strands can easily address cross hybridizations between sub-strands of distinct or identical DNA strands within a set, their ability to encode free energy, melting temperature and Hamming distance conditions still remains a challenge. DNA strand design plays also an important role in microarray experiments. Different criteria, such as strand length, number of strands per input sequence, maximal distance from the 5' or the 3' end, melting temperature ranges, thresholds to reject secondary structures, prohibited subsequences and cross-hybridization have to be considered to obtain an efficient design. A description of this problem together with detailed information on DNA strand selection criteria are presented in Chapter 6. 3.2 Design Constraints Combinatorial criteria have been widely used in the design of strand sets [8, 9, 29, 30, 121, 122]; criteria include uniform GC content across strands (so that de-sired hybridization between strands and their complements have similar melting temperature) and a large number of mismatches between two strands. The use of Chapter 3. Related Work 18 Hamming distance for DNA design was proposed by Deaton et al. [29, 30] in 1996. In 1997, Garzon et al. [44] introduced the H-measure, which is denned as the minimum of the Hamming distances calculated by shifting one sequence against the other. Marathe et al. [76] interpreted the constraints for designing DNA sequences from the perspective of combinatorial code design and they defined a set of 3 constraints based on the Hamming distance metric. Ben-Dor et al. [9] used interesting mathematical tools (DeBruijn sequences) for generation of strands of variable length that have similar melting temperatures. They used the '2-4 rule' for melting temperature estimation and they assign a weight to each strand, which rep-resents the melting temperature of the duplex formed by the strand and its perfect complement. However, combinatorial criteria alone can be a poor measure of the quality of a set. Thermodynamic aspects for DNA code design have been considered in an increasing number of papers. Garzon et al. [45] and Rose et al. [91] proposed thermodynamically-based measures of quality that are based on statistical mechan-ics principles. But it is difficult to computationally measure the quality of a set using their criteria for many reasons. First, the evaluation of all possible structural configurations of the reactants and products present in a reaction is computationally intractable. Second, even if the sets of thermodynamic parameters for all possible nearest neighbor interactions (including all sizes of internal and multi loops) would be completely known, the direct calculation of the computational incoherence mea-sure introduced by Garzon et al. requires the computation of the concentrations of all possible reactants and products of a reaction. To obtain the values of all concen-trations in a generalized hybridization reaction, where all products, reactants and their alternate states coexist, it is necessary to solve massive systems of coupled quadratic equations, which is still a challenge nowadays. Fan et al. [35] used thermodynamic criteria for design of probes for univer-sal microarrays, but provided little detail of their method or design criteria. Pen-chovsky and Ackermann [83] combined combinatorial and thermodynamic criteria to design sets of sequences for molecular computations. They required that the melting temperature of strands in a set lie within a small range, in order to maxi-mize desired hybridization. In addition, they introduced a new 'free energy gap" measure of the quality of a set, which is denoted by 5* in this work: to avoid un-desired hybridization, the gap between the largest MFE (minimum free energy) of any duplex formed from a strand and its complement, and the smallest MFE of any mismatch duplex formed from a strand and the complement of a distinct strand, should be as large as possible (S* is defined precisely in Chapter 5). They also de-veloped a heuristic algorithm to design sets of strands that uses random search. A weakness of their algorithm is that their means for measuring the free energy of a duplex secondary structure with mismatches relies on a program for measuring the Chapter 3. Related Work 19 free energy of a single-stranded molecule, requiring the linking of the strands with an artificial hairpin structure. Another weakness of their random search algorithm lies in the search strategy, which works as a generate-and-test mechanism (a sim-plification of hill climbing) — this type of simple search mechanism is generally known to achieve poorer results than more advanced stochastic local search meth-ods, such as the one underlying the strand design algorithm presented in Chapter 4 [57]. Also, Penchovsky and Ackermann do not provide comparisons of the qual-ity of sets obtained using their algorithm with those constructed using previous methods. 3.3 DNA Strand Design Algorithms Stochastic search methods (e.g. simulated annealing, genetic search'algorithms, random walk, lexicographic search) have been used successfully for decades in the construction of good binary codes [42, 54]. Typically, the focus of this work is in finding codes of size greater than the best previously known bound, and a detailed empirical analysis of the stochastic local search algorithms is presented in Chapters 4 and 5. Deaton et al. [29, 30] and Zhang and Shin [122] describe genetic algorithms for finding sets of DNA strands that satisfy much stronger constraints than the CI and C2 constraints, in which the number of mismatches must exceed a threshold even when the strands are aligned with shift (C3). However, they do not provide a detailed analysis of the performance of their algorithms. Hartemink et al. [53] used a computer algorithm for designing strand sets that satisfy yet other constraints, in which a large pool (several billion) of strands were screened in order to determine whether they meet the constraints. Several other researchers have used computer algorithms to generate sets of strands (see for example [15]), but provide no details on the algorithms. Some DNA strand design programs are publicly available. The DNASequence-Generator program [37] designs DNA sequences that satisfy certain sub-strand dis-tance constraints and, in addition, have melting temperature or GC content within prescribed ranges. The program can generate DNA sequences de-novo, or integrate partially specified strands or existing strands into the set. The PERMUTE program was used to design the strands of Faulhammer et al. [36] for their RNA-based 10-variable computation. Their program generates random strands using constraints C7 (i.e., strands defined over a three letter alphabet {A,C,T}) and T5 (i.e., av-erage melting temperature of 45° C) and then permutes the sequences to satisfy constraints C1,C2 and C4. Chapter 3. Related Work 20 3.4 Stochastic Local Search Algorithms Stochastic local search (SLS) algorithms strongly use randomised decisions while searching for solutions to a given problem. They play an increasingly important role for solving hard combinatorial problems from various domains of Artificial Intelligence and Operations Research, such as satisfiability, constraint satisfac-tion, planning, scheduling, and other application areas [57]. Over the past few years there has been considerable success in developing stochastic local search algorithms as well as randomised systematic search methods for solving these problems, and to date, stochastic search algorithms are amongst the best known techniques for solving problems from many domains. Detailed empirical studies are crucial for the analysis and development of such high-performance stochastic search techniques. Stochastic search methods have already been applied to the design of DNA strand sets. Deaton et al. [29, 30] and Zhang and Shin [122] used genetic algo-rithms for design of DNA strands, and provide some small sets of strands that satisfy well-motivated combinatorial constraints. Faulhammer et al. [36] also use a stochastic search approach and provide the code for their algorithm. In all cases, while small sets of strands produced by the algorithms have been presented (and the papers make other contributions independent of the strand de-sign algorithms), little or no analysis of algorithm performance is provided. As a result it is not possible to extract general insights on design of stochastic algorithms for code design or to do detailed comparisons of their approaches with other algo-rithms. One goal is to understand what algorithmic principles are most effective in the application of stochastic local search methods to the design of DNA strand sets. 3.5 Microarray Probe Design Algorithms A large number of probe design algorithms have been published in the past decade. Most of them use similar design algorithms and probe selection criteria, with only little variation [68, 81, 86, 89, 90, 93]. While some algorithms trade combinato-rial criteria for speed gain (ProbeSelect [68], ProMide [86]), others use thermo-dynarnically based criteria to achieve higher quality sets of probes (Osprey [46], ProbeMaker [105]). From the software locality point of view, we distinguish two groups of probe design programs: the client/server programs (these include web-based applica-tions) and the autonomous programs (can be installed on a local computer). Exam-ples of client/server software available online are OligoWiz [81] and ROSO [89]. Chapter 3. Related Work 21 OligoWiz selects probes based on a set of five criteria: specificity, melting temper-ature, position within transcripts, sequence complexity and base composition. A detailed discussion of probe selection criteria is presented in Chapter 6. A score is assigned for each criteria per probe and a weighted score is computed. The probes with the best scores are further selected and visualized in a graphical interface. ROSO consists of a suite of 5 probe selection steps. First, it filters out all input sequences that are identical and masks also the repeated bases within them. The second step consists of a similarity search using BLAST [3]. Next step consists of eliminating probes that self-hybridize. The fourth step eliminates all probes that have undesired melting temperatures. The last step selects a final set of probes based on four additional criteria: GC content, first and last bases, base repetitions and self-hybridization free energies. From the second group of programs, we distinguish two programs, namely OligoArray [94] and ProbeSelect [68]. OligoArray takes into account the fol-lowing criteria: specificity, position within transcript, melting temperature, self-hybridization and presence of specific subsequences in the probe. The algorithm runs as follows: it selects a probe starting from the end of the first sequence in the input data set and checks it against all the aforementioned criteria. If the probe satisfies all criteria, the algorithm moves to the next input sequence; otherwise it selects a new probe after a 10 base jump towards the 5' end of the input sequence. The process continues until a probe that satisfies all criteria is found. If no such probe is found, the algorithm selects a probe that has a lower probability for cross-hybridization. The ProbeSelect algorithm consists of seven steps. First, the pro-gram builds a suffix array [74] for the input sequences, to quickly identify the high similarity regions in the input sequences. Then, the algorithm builds a landscape for each input sequence. A landscape represents the frequency in the suffix array of all subsequences in the query input sequence. The third step consists in selecting a list with 10 to 20 candidate probes for each input sequence that minimizes the sum of the frequencies of their subsequences in the suffix array. The fourth step consists in searching for potential cross-hybridizations between each probe and the set of input sequences. Step 5 marks the cross-hybridizations within the input se-quences. The next step consists of melting temperature and self-hybridization free energy computations for each remaining probe. The last step consists in selecting the probes, which have the most stable perfect-match free energy and hybridize the least with all the other potential targets. In the next sections we present efficient algorithms that design high quality sets of DNA strands for various biotechnological applications. 22 Chapter 4 Combinatorial Design The subject of optimization is a fascinating blend of heuris-tics and rigor, of theory and experiment. It can be studied as a branch of pure mathematics, yet has applications in almost ev-ery branch of science and technology. " R. Fletcher, Practical Methods of Optimization, John Wiley 1987. This chapter introduces two Stochastic Local Search (SLS) algorithms that provide high quality solutions for the DNA Strand Design Problem introduced in Chapter 2. 4.1 Background and Problem Description The DNA strand design problem that we consider here is: given set size k and strand length n, find a set of k DNA strands, each of length n, satisfying combina-torial constraints CI , C2 and C5 (see Section 2.2). The total number of strands of length n defined over any quaternary alphabet is 4 n . The number of possible strand sets of size k that can be formed with 4 n strands is: For the particular example of strands with n = 8 and k = 100, the number of all possible strand sets is approximately 1.75 x 1 0 2 6 7 . The huge number of possible °Parts of this chapter have been published. (1) Dan C. Tulpan, Holger Hoos, and Anne Condon (2003) 'Stochastic Local Search Algorithms for DNA Word Design", Proc. of Eighth International Meeting on DNA Based Computers (DNA 8), Lecture Notes in Computer Science 2568, pp. 229-241, Springer Verlag; (2) Dan C. Tulpan and Holger H. Hoos (2003) 'Hybrid Randomized Neighborhoods Improve Stochastic Local Search for DNA Code Design", Proc. of the 16th Canadian Conference on Ar-tificial Intelligence (AI'20O3), Lecture Notes in Computer Science 2671, pp. 418-433, Springer Verlag. Chapter 4. Combinatorial Design 23 sets that must be explored in order to find a big set of strands suggests the use of non-exhaustive search algorithms for solving this type of problems. One class of such methods are stochastic local search algorithms, which have been used with success for many years in code design as well as in solving other combinatorial problems [55]. Here ,the goal is to create an algorithm that can deal with vari-ous combinations of constraints, namely direct, complement and slide mismatches, consecutive matches, GC content, forbidden subsequences and alphabet size. 4.2 A Simple SLS Algorithm Our basic stochastic local search algorithm performs a randomized iterative im-provement search in the space of DNA strand sets. This overall approach is similar to WalkSAT, an efficient and influential algorithm for the propositional satisfia-bility problem [55, 78], Casanova, one of the best-performing algorithms (at the time when it was proposed) for the winner determination problem in combinatorial auctions [58], and the Min Conflicts Heuristic for CSP [79, 80]. All strand sets constructed during the search process have exactly the target number of strands. Additionally, when using constraint C5, we ensure that all strands in the given candidate set always have the prescribed GC content. The other two combinatorial constraints considered in this thesis, the CI and C2 constraints, are binary predicates over DNA strands. The function our algorithm is trying to minimize is the number of pairs that violate the given constraint(s). Algorithm 1 gives a general outline of our basic algorithm. In the following, we will describe this algorithm in more detail. The initial set of strands is determined by a simple randomized procedure that generates any DNA strand of length n with equal probability. If constraint C5 is used, we check each strand that is generated and only accept strands with the specified GC content. Our implementation also provides the possibility to initialize the search with a given set of DNA strands, in particular, obtained with constructive algorithms known from coding theory; if such a set has less than k strands, it is expanded with randomly generated strands such that a set of k strands is always obtained. The same method as for producing random starting sets are used for generating these additional strands. Note that the initial strand set may contain multiple copies of the same strand. In each step of the search process (one execution of the inner for loop from Algorithm 1), first, a pair of strands violating one of the Hamming distance con-straints is selected uniformly at random. Then, for each of these strands, all possi-ble single-base modifications are considered. As an example of single-base mod-Chapter 4. Combinatorial Design 24 procedure Stochastic_Local_Search_for_DNA_Strand_Design (k, n, C) input: number of strands (k), strand length (n), set of constraints (C) output: set S ofk strands that fully or partially satisfies C for i •- 1 to maxTries do S := initial set of strands S:=S for j := 1 to maxSteps do if S satisfi es all constraints then return 5 end if Randomly select strands wi,wz G S that violate one of the constraints Mi := all strands obtained from wi by substituting one base Mi := all strands obtained from W2 by substituting one base with probability 8 do select strand w' from Mi U Mi at random otherwise select strand w' from Mi U M% such that number of conflcl violations is maximally decreased end with probability if w' G Mi then replace wi by w' in S else replace W2 by w' in S end if if S has no more constraint violations than S then S:= S-end if end for end for return S end Stochastic_Local_Search_for_DNA_Strand_Design A l g o r i t h m 1: Outline of a general stochastic local search procedure for DNA strand design. Chapter 4. Combinatorial Design 25 ifications consider the strand ACTT of length 4. A new strand GCTT can be obtained by replacing letter A from the first strand with letter G. When constraint C5 is used, this is restricted to modifications that maintain the specified GC con-tent. For a pair of strands of length n without constraint C5, this yields 6n new strands. (Some of these might be identical.) With a fixed probability 9, one of these modifications is accepted uniformly at random, regardless of the number of constraint violations that will result from it. In the remaining case (with probability 1 — 9), each modification is assigned a score, defined as the net decrease in the number of constraint violations caused by it, and a modification with maximal score is accepted. (If there are multiple such modifications, one of them is chosen uniformly at random.) Note that using this scheme, in each step of the algorithm, exactly one base in one strand is modified. The parameter 9, also called the noise parameter, controls the greediness of the search process; for high values of 9, constraint violations are not resolved effi-ciently, while for low values of 9, the search has more difficulties to escape from local optima of the underlying search space. Throughout the run of the algorithm, the best candidate solution encountered so far, i.e., the DNA strand set with the fewest constraint violations, is memorized. Note that even if the algorithm terminates without finding a valid set of size k, a valid subset can always be obtained by iteratively selecting pairs of strands that violate a Hamming distance constraint and removing one of the two strands in-volved in that conflict from the set. Hence, a strand set of size k with t constraint violations can always be reduced to a valid set of at least size k — t. Hamming distances between strands and/or their reverse complements are not recomputed in each iteration of the algorithm; instead, these are computed once after generating the initial set, and updated after each search step. This can be done very efficiently, since any modification of a single strand can only affect the Hamming distances between this strand and the k — 1 remaining strands in the set. The outer loop of our algorithm can perform multiple independent runs of the underlying search process. In conjunction with randomized search initialisation, multiple independent runs can yield better performance; essentially, this is the case if there is a risk for the search process to stagnate, i.e., to get stuck in a region of the underlying search space that it is very unlikely to escape from. In this context, the parameter maxSteps, which controls the number of steps after which the search is aborted and possibly restarted from a new initial strand set, can have an impor-tant impact on the performance of the algorithm; this will become apparent in our experimental results presented in Section 4.2.4. The simple SLS algorithm presented above can easily be enhanced with an additional diversification mechanism that is based on occasional random replace-ments of small subsets of strands. In this extended algorithm, after performing a Chapter 4. Combinatorial Design 26 search step, i.e., after each iteration of the inner loop of Algorithm 1, we check whether more than nsteps steps have been performed since the last improvement over the best candidate solution found so far. Whenever this is the case, we take it as in indication of search stagnation and perform a diversification step by replac-ing a fixed fraction fr of the strands (chosen uniformly at random) with strands that are newly generated at random (this is done exactly as during search initialisa-tion). After this we continue as in the simple SLS algorithms; we generally make sure that between any two diversification steps at least nsteps simple search steps are performed. The values nsteps and fr are additional parameters of the extended algorithm. 4.2.1 Evaluation Criteria To evaluate the performance of our SLS algorithm (and its variants), we performed two types of computational experiments. Detailed analyses of the run-time distri-butions of our algorithm on individual problem instances were used to study the behavior of the algorithm and the impact of parameter settings. For these empirical analyses, the methodology of Hoos and Stiitzle [56] for measuring and analyzing run-length and run-time distributions (RLDs and RTDs) of Las Vegas algorithms was used. Run-length was measured in terms of search steps, and absolute CPU time per search step was measured to obtain a cost model of these search steps . The other type of experiment used the optimized parameter settings obtained from the detailed analyses for obtaining DNA strand sets of maximal size for various strand lengths and combinatorial constraints. 4.2.2 In-Silico Experimental Protocols The absolute CPU time for each search step was measured on a PC with two 1GHz Pentium III CPUs (we ran one process per CPU), 512 MB cache and 1GB R A M running Red Hat Linux 7.1 (kernel 2.4.9). The obtained values range between 0.0006 CPU seconds for constraints CI and C5, and 0.0079 CPU seconds for con-straints CI and C2, both using set size k — 100, strand length n — 8, and Hamming distance d = 4. 4.2.3 Noise Parameter 6 Introducing noise in our algorithmic approach, i.e., using probabilistic moves when taking decisions, provides robustness to our algorithm and allows it to escape from local minima encountered during search. We learned that the setting of the noise parameter, 9, has a substantial effect on the performance of our algorithm: the Chapter 4. Combinatorial Design 27 100000 E 10000 1000 - i r —i r _i i i i_ 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Noise parameter theta Figure 4.1: Median number of steps to find solutions as a function of noise param-eter 9 (set size fc=70, strand length n=8, hamming distance d=4, constraints CI , C2 and C5). time required for solving a given problem instance can easily vary by more than an order of magnitude depending on the noise parameter setting used. Empirically optimal settings of 9 are consistently close to 0.2 for different problem instances and sizes; consequently, this value was used for all experimental results reported here (see Figure 4.1). Each data point in Figure 4.1 is averaged over one thousand runs of the algorithm for problem size 70, strand length 8, Hamming distance 4 and constraints CI, C2 and C5. 4.2.4 RTD-based analysis To study and characterize the behavior of our algorithm, we measured RTDs from 1 000 runs (i.e., maxTries = 1 000) of the algorithm applied to individual problem instances, using extremely high settings of the cutoff parameter, maxSteps, to en-sure that a solution was found in each run without using random restarts. For each run, we recorded the number of search steps (i.e., executions of the inner for loop Chapter 4. Combinatorial Design 28 of Algorithm 1) required for finding a solution. From this data, the RTD gives the probability of success as a function of the number of search steps performed. As can be seen in Figure 4.2 (a), for given constraints, n, and d, the time required for obtaining a strand set of size k with a fixed probability p increases with k for all p. Furthermore, for high p, this increase is much more dramatic than for low p, as can be seen in the 'fat" right tails of the RTDs. This indicates that our simple SLS algorithm, even when using optimized noise settings, can suffer from stagnation behavior that compromises its performance for long run-times. This is evident from comparing the RTD curves with appropriately fitted exponential distributions, as we can see in Figure 4.2 (a) The easiest way to overcome this stagnation effect is to restart the algorithm after a fixed number maxSteps of search steps. Optimal values of maxSteps can be easily determined from the empirical RTD data [56]. When solving a problem instance for the first time, however, this method is not applicable (unless the RTD can be estimated a priori), and the actual performance depends crucially on the value used for maxSteps. It is therefore desirable to find different mechanisms for ameliorating or elim-inating the observed search stagnation. The random replacement variant of our simple SLS algorithm described at the end of the previous section was designed to achieve this goal, and in many cases substantially reduces the stagnation effect (Figure 4.2 shows a typical example), leading to a significant improvement in the robustness and performance of our algorithm. However, as can be shown by com-paring the empirical RTDs with appropriately fitted exponential distributions (see Figure 4.2: (b)), even this modification does not completely eliminate the stagna-tion behavior. 4.2.5 Quality of Strand Sets Obtained by our Algorithm Using the improvements and parameter settings obtained from the detailed analysis described above, we used our SLS algorithm to compute strand sets for a large number of DNA strand design problems. We report the sizes of strand sets obtained by the SLS algorithm presented in Section 4.2. This data illustrates in detail the performance of our algorithm and will be useful as a baseline for future work. Most of the strand sets are publicly available at the following URL: h t t p : / / w w w . c s . u b c . c a / ~ d c t u l p a n / p a p e r s / d n a 8 / t a b l e s . The data on our strand sets is grouped in four tables showing the results for different constraint combinations as follows: C I (Table 4.1), C1+C5 (Table 4.3), C1+C2 (Table 4.4) and C1+C2+C5 constraints (Table 4.7). In these tables, entries Chapter 4. Combinatorial Design 29 10 100 1000 10000 100000 1e+06 Number of steps (a) 100 1000 10000 100000 1e+06 Number of steps. (b) Figure 4.2: RLDs for the SLS algorithm and exponential distributions with identi-cal medians ((a) and (b)) and 90% quantile (a), (a): CI and C5 constraints (different set sizes Ac <E {100,120,140}); (b): CI , C2, C5 constraints (SLS with and without random replacement). Chapter 4. Combinatorial Design 30 in bold face match or exceed the best previously known set sizes (lower bounds). All other values are smaller than known set sizes for corresponding (n, d) values. Dashes ('-') indicate cases of limited or no interest for DNA strand design, for which we did not run our algorithm. Along with each maximal word set size found for our algorithm, we also report the corresponding median number of steps spent by our algorithm in order to find that set (in square brackets, [...], typically specified in thousands of search steps). Each table entry is based on between 5 and 20 runs of our algorithm; the precise number of runs was varied with problem size and solution quality; median run times are taken over successful runs only. Tables 4.2 and 4.5 contain the best previously known theoretical lower and upper bounds on sizes of strand sets. In cases, where lower and upper bounds are identical, only a single value is shown. The corresponding results can be summarized as follows: Constraint CI only. There is a significant body of work on the construction of strand sets over an alphabet of size 4 that satisfy the CI constraint only, with the best bounds summarized by Bogdanova et al. [12]. We used our algorithm to design strand sets in which the strand lengths ranged from 4 to 12 with the min-imum Hamming distance between pairs of strands ranging from 4 to 10. Out of 42 tests, our algorithm was able to find a strand set whose size matches the best known theoretical construction in 16 cases. Constraints C1+C2. In this case, we considered strand lengths ranging in length from 4 to 12, and the minimum Hamming and reverse complement Ham-ming distances between pairs of strands ranging from 4 to 10. Out of 42 tests, our algorithm was able to find a strand set whose size matched or exceeded the theo-retical results of Marathe et al. [76] in 41 cases. Constraints C1+C5. We obtained strand sets satisfying the constraints CI and C5 for strand length ranging from 4 up to 20, and the minimum Hamming distance between pair of strands ranging from 2 to 10. Compared to the results reported by Corn et al. [69], we obtained bigger sets in almost all cases except two particular combinations, namely strand length 8 and Hamming distance 4, and strand length 12 and Hamming distance 6. Constraints C1+C2+C5. We obtained strand sets for strand lengths ranging from 4 to 12, and the minimum Hamming and reverse complement Hamming dis-tances between pairs of strands ranging in length from 4 to 10. There was no body of previous results that we could compare with except in one case, namely for n\d 3 4 5 6 7 8 9 10 11 4 - 4 [.Olfc] - - - - - - -5 - 16 [1.5*;] 4 [.02fe] - - - - - -6 - 64 [80fc] 9 [15*;] 4 [.02k] - - - - -7 - 78 [70fc] 23 [75k] 8 [.5k] 4 [.04fc] - - - -8 - 219 [65k] 55 [40*;] 19 [50k] 5 [.03fe] 4 [.09fc] - - -9 - 616 [285fc] 138 [6Sfe] 41 [50fc] 15 [50fc] 5 j. l fc] 4 (.12k] - -10 - 1783 [471fc] ' 358 [312fe] 95 [130k] 32 [75fc] 12 [5k] 5 [.5k] 4 [.32k] -11 - 5598 [523fc] 967 [592*;] 227 [165fc + . l k ] 70 [36 k] 27 [545fc] 10 [15k] 4 [.05k] -12 - > 10000 [24*;] 2689 [263fe] 578 [300k] 156 [735fc] 53 [390k] 23 [532k + .5k] 9 (.5k + 5fc] -n o 3 o-3 Table 4.1: Empirical bounds on CI quaternary codes. n \ d 3 4 5 6 7 8 9 10 11 4 16 4 - - - - - - -5 64 16 4 - - - - - -6 164- 179 64 9 4 - - - - -7 512-614 128- 179 32 8 4 - - - -8 2048 - 2340 320-614 70- 128 32 5 4 - - -9 8192-9360 1024-2340 256-512 64- 128 18-20 5 4 - -10 17408-30427 4096 - 9360 1024 - 2048 256-512 40-80 16 5 4 -11 65536- 109226 16384-30427 4096-6241 1024 - 2048 92 - 320 32-64 12 4 4 12 262144 - 419430 65536-109226 8192-20852 4096 - 6241 256- 1280 64 - 242 24-48 9 4 Table 4.2: Best previously known bounds on the size of codes that satisfy the CI constraint as a function of word length n and Hamming distance d [12]. 9 o o 3 g CD 3 to 4-2 5-2 6-3 7-3 8-4 9-4 10-5 11-5 72-6 48 ].3k] 142 [93fc] 85 [860fe] 230 [650fc] 209 [J 400 [72(c) 256 [250fc] 620 [48fc] 410 [] 13-6 14-7 15-7 76-S 17-8 18-9 79-9 | 20-70 990 [36fc] 500 [5fc] 1571 [lOOOfc] 966 [242fc] 2355 [745fc] 1200 [15k] 3451 [396fc] 2193 [596k] -Table 4.3: Empirical bounds on (C1,C5) quaternary codes for different n-d combinations. Chapter 4. Combinatorial Design 34 n\d 4 5 6 7 8 9 10 4 2 [.0'2fc] - - - - - -5 4 [.02fc] 2 [.005*;] - - - - -6 28 [28k] 4 [.lfc] 2 [.02*:] - - - -7 40 [254fc] 11 [3*:] 2 (.005*:] 2 [.05*:] - - -8 112 [850k] 27 [7fc] 10 [31*;] 2 [.Oil:] 2 [.03*:] - -9 314 [677*;] 72 [142*;] 20 [37k] 8 (10*:] 2 [.02*:] 2 [.04*:] -10 938 [702fe] ISO [386*;] 49 [287*:] lli [495*:] 8 [11.01k] 2 [.01*:] 2 [.02fc] 11 2750 [117*:] 488 [257*:] 114 [145*;] 35 [Ok] 12 [lfc] 5 [2k] 2 [.05k] 12 >8<I(I0 [72*:] 1.340 [400*:] 290 [327*:] 79 [236*:] 27 [712.5*;] 11 [828fc] 4 (.2fc) Table 4.4: Empirical bounds on (C1,C2) quaternary codes. strands of length 8 with 50% GC content that satisfy the Hamming and reverse-complement constraints with at least 4 mismatches between pairs of strands. For this case, Frutos et al. [40] constructed a set of 108 strands of length 8. We were not able to obtain any 108 DNA strand sets satisfying these constraints, using our algorithm when initialized with a random initial set. The biggest set found had 92 strands. However, when we initialized our algorithm with the set of 108 strands obtained by Frutos et al., along with one additional randomly chosen strand, we were able to obtain a valid set of size 109. Continuing this incremental improve-ment procedure, we were able to construct sets containing 112 strands in less than one day of CPU time on our reference machine. Using the same approach we were able to extend the set of 92 strands up to only 107 strands. 4.3 An Improved Stochastic Local Search Algorithm The improved SLS algorithm uses the same improvement strategy as the previous one, with one modification. Here, we propose different types of neighborhood gen-eration mechanisms (mostly based on randomisation), which lead to performance improvements of the basic SLS algorithm. As in the case of the simple SLS algorithm, only strands with the prescribed GC content are considered during the search, and the neighborhoods are restricted accordingly. In this section, we consider the following neighborhoods: ^-mutation neighborhoods. The ^-mutation neighborhood of a given strand WQ consists of all strands that can be obtained from WQ by modifying up to v bases, except w0 itself. Our previous SLS algorithm was based on the 1-n\d 2 3 4 5 6 7 8 2 2 - - - - - -3 0 - 2 3 0 - 2 3 - - - - -4 ' 2 5 2 - 8 2 - - - -5 4 - 2 7 2 - 2 7 0 - 2 6 0 - 8 - - -6 2° 16 - 107 8 - 107 2 - 5 2 - -7 33 - 2 1 1 16 - 2 1 1 2 - 372 2 - 372 0 - 3 4 0 - 4 -8 160 - 1310 128 - 1310 8 - 118 8 - 118 2 - 3 2 9 354 - 2 1 5 160 - 2 1 5 8 - 4681 8 - 4681 2 - 372 2 - 372 2 - 1 4 10 2 1 7 1184 - 16912 320 - 16912 32 - 1202 32 - 1202 8 - 142 8 - 8 11 3903 - 2 1 9 576 - 2 1 9 60 - 61680 32 - 61680 8 - 3964 8 - 3964 8 - 420 12 2 1 13233 - 226719 2304 - 226719 173 - 13294 96 - 13294 8 - 1276 8 - 1276 9 sa c? -s 9 3 JS § fa 3 Table 4.5: Best known bounds on the size of codes that satisfy the (C1,C2) constraints as a function of word length n and Hamming distance d [76]. Oft n\d 2 3 4 5 6 7 8 9 10 4 20 [.03k] 5 [.02fc] 2 [.001k] - - - - - -6 282 [lfc] 37 [19fc] 11 [3fc] 2 [.003fc] 2 j.OOOfc] - - - -8 3981 [20fc] 350 [119fc] 92* [4000k] 19 [1.2fc] 7 [10fc] 2 [.Olfc] 1 [.02fc] - -10 X 3700 [287fc] 640 [406fc] 127 [170.5fc] 37 [38 fc] 11 [134k] 5 [1.3fc] 2 [.02] 1 [.005fc] 12 X X 5685 [455 fc] 933 [531fc] 210 [121.5k] 59 [77fc] 21 [217fcJ 9 [341.5fc] 3 [1.5k] 9 Co -a n o 3 to Table 4.6: Empirical bounds on (C1,C2,C5) quaternary codes obtained with the simple SLS algorithm (Results are reported in [112]). ON Chapter 4. Combinatorial Design 37 mutation neighborhood; for a given pair of strands of length n (as considered in each step of our algorithm), there are Q(n) 1-mutation neighbors that fulfill the GC constraint. The 3-mutation neighborhood of a pair of strands, in contrast, consists of Q(n3) strands satisfying constraint C5. Purely random neighborhoods. Another simple way of choosing neighborhoods of a given strand wo is by generating a fixed number of random strands with the same length and GC content as WQ. This type of neighborhood increases the mobility of the algorithm within the search space and supports search steps that are equivalent to several search steps in a ^-neighborhood with small v. Use of this rather simplistic neighborhood mechanism leads to substantial improvements in the performance of our algorithm, as we will document in Section 4.3.2. Hybrid randomized neighborhoods. These are obtained by adding elements of the purely random neighborhood to a i/-mutation neighborhood. This ef-fectively enables the algorithm to explore regions of the search space that could not be reached easily using purely ^-mutation neighborhoods while still keeping the search focused on local regions of the space of candidate solutions (sets of strands). Extra randomness added on the top of ^ -mutation neighbors enhances the ability of the SLS algorithm to exit from local min-ima regions and eventually find solutions faster. As our empirical results show, this novel type of neighborhood leads to substantial improvements in the performance of our SLS algorithm. In Section 4.3.2 we will discuss the efficiency of these neighborhoods and their impact on the performance of our SLS algorithm for DNA strand design. In partic-ular, we will see that the use of randomized neighborhoods helps to avoid search stagnation and allows the algorithm to find bigger DNA strands. 4.3.1 Evaluation criteria To evaluate the performance of our improved SLS algorithm, we performed two types of computational experiments. Detailed analyses of the run-time and run-length distributions of our algorithm on individual problem instances were used to study the behavior of the algorithm and the impact of parameter settings. For these empirical analyses, the methodology of Hoos and Stutzle [56] for measuring and analyzing run-time distributions (RTDs) and run-length distributions (RLDs) of Las Vegas algorithms was used. Run-length was measured in terms of search steps, and absolute CPU time per search step was measured to obtain a cost model Chapter 4. Combinatorial Design 38 of these search steps. The other type of experiment used the optimized parame-ter settings obtained from the detailed analyses for obtaining DNA strand sets of maximal size for various strand lengths and combinatorial constraints. For these types of computational experiments we used the same machine specifications as described in Section 4.2.2. 4.3.2 RTD-based analysis To study and characterize the behavior of the new proposed neighborhood mech-anisms for the simple SLS algorithm, we measured RTDs and RLDs from 1 000 successful runs of the algorithm applied to a representative problem instance with set size k — 70, strands length n — 8, Hamming distance d = 4 and all 3 con-straints (CI, C2, C5). Experiments with other problem instances gave analogous results (not reported here) to the ones reported for this instance. Using extremely high settings of the cutoff parameter ensured that a solution was found in each in-dividual run without using random restarts. Foreach run corresponding to different neighborhood settings we collected the number of search steps required for finding a solution to the given problem instance. From this data, the RLD gives the prob-ability of success as a function of the number of search steps performed. We also obtain RTDs by multiplying the number of search steps with the corresponding CPU time per step. //-mutation neighborhoods. The 1-mutation neighborhood has been successfully used in the simple SLS algo-rithm introduced in Section 4.2. While one would expect that using z/-mutation neighborhoods with v > 1 decreases the number of search steps required for find-ing certain strand sets, it is not clear whether the increased computational cost of scanning these larger neighborhoods can be amortized. Figure 4.3 shows RLDs for different types of //-mutation neighborhoods for the problem instance consid-ered above. The time required for obtaining a set of strands of size fc = 70 with a fixed probability p increases with v and p. For high p, this increase is much more dramatic than for low p values. The right 'fat" tails of the RLDs emphasize this phenomenon. As can be seen from Figure 4.4, the higher time complexity of the search steps using ^-mutation neighborhoods with v > 1 is not amortized by the reduction in the number of search steps. Similar results were obtained for other problem instances. Chapter 4. Combinatorial Design 39 "lOO 1000 10000 100000 1e+06 Number of iterations Figure 4.3: RLDs for different i/-mutation neighborhoods, set size k — 70, strand length n = 8, Hamming distance d — 4, constraints CI , C2 and C5. Purely random neighborhoods. The use of purely random neighborhoods leads to better performance of our algo-rithm than any of the z -^mutation neighborhoods. For our representative problem instance, this can be seen when comparing the medians of the RTDs for ^-mutation neighborhoods in Figure 4.4 with the median run-times for purely random neigh-borhoods for varying sizes shown in Figure 4.5. At the same time, using purely random neighborhoods leads to RTDs that do not have the same Tat" right tails that indicated the stagnation behaviour of our SLS algorithm for the z/-mutation neighborhoods. This leads to even more substantial performance advantages of purely random neighborhoods over the i^-mutation neighborhoods when compar-ing the mean CPU times for solving a given problem instance or high percentiles of the respective RTDs. When varying the size of the purely random neighborhoods, i.e., the number of strands considered for replacing a given strand in a candidate strand set, we found that typically there is an optimal range of neighborhood sizes. When using smaller neighborhoods, the performance of the algorithm decreases since the number of search steps required for finding a solution increases. For larger neighborhoods, the time complexity for each search step increases, and at some point the reduction in Chapter 4. Combinatorial Design 40 CPU Time [sec] Figure 4.4: RTDs for different z/-mutation neighborhoods, set size k = 70, strand length n = 8, Hamming distance d = 4, constraints CI , C2 and C5. the number of search steps required for finding a solution no longer amortizes this higher cost. This is illustrated for our representative problem instance in Figure 4.5. Hybrid neighborhoods. We noticed that adding random strands to i/-mutation neighborhoods leads to im-proved performance of our SLS algorithm. This raises the following question: is there any reason to keep the ^-mutation neighborhood as part of a bigger, random neighborhood? We investigated this question in an experiment in which we compared three hy-brid neighborhoods of 200 strands that include the 1-mutation, 2-mutation, and 3-mutation neighborhoods, respectively,as well as a purely random neighborhood of size 200. From Figures 4.6 and 4.7 we can see that for our representative problem instance, including the 1-mutation neighborhood in a bigger random neighborhood results in slight performance improvements in terms of iterations as well as CPU time, while including the 2- or 3-mutation neighborhoods is disadvantageous. In a second experiment we tested whether this result also .holds for different neighborhood sizes. As can be seen from Figure 4.5, hybrid neighborhoods ob-Chapter 4. Combinatorial Design 41 100 , 1-mutation + random ngb — t — Pure random ngb — - x -0 100 200 300 400 500 600 700 800 900 1000 Neighborhood size Figure 4.5: Median run-time in CPU seconds for different neighborhood sizes, set size fc = 70, strand length n — 8, Hamming distance d = 4, GC content = 50%, constraints CI , C2 and C5. tained by adding random neighbors to the 1-mutation neighborhood generally leads to improved performance compared to using purely random neighborhoods of the same size. One intuitive explanation for the efficiency of using hybrid neighbor-hoods is based on the fact that 1-mutation neighbors can be easily 'mutated" back into the original strand. This mechanism allows the algorithm to easily and cheaply reverse problematic search steps that, for example, may lead into a local minimum of the underlying search space. Further experiments have been performed for different (fc, n, d) combinations (e.g., (102,10,5), (10,10,7), (15,12,8), (25,6,3)) and different set sizes and GC con-tent fractions (e.g., k — 56 and GC content = 3, fc = 28 and GC content = 2, and k = 8 and GC content =1). Our algorithm found solutions faster when we used hybrid neighborhoods than when using purely random neighborhoods, considering the total size of the neighborhood as being fixed in both cases. The CPU costs per step are roughly the same for hybrid and pure random neighborhoods having the same size, but the SLS algorithm with hybrid neighborhood spends less steps for leaching the same solution quality than when using purely random neighborhoods. Overall, in all the cases we examined, using hybrid neighborhoods consisting of Chapter 4. Combinatorial Design 42 100 Number of iterations Figure 4.6: RLDs for purely random and hybrid neighborhoods of size 200, set size k = 70, strand length n = 8, Hamming distance d = 4, constraints CI , C2 and C5. all 1-mutation neighbors and additional random strands lead to better performance than using purely random or //-mutation neighborhoods. 4.3.3 Noise Parameter The use of noise in the simple SLS algorithm, i.e., allowing probabilistic moves when taking decisions, provides robustness to the algorithm and allows it to escape from local minima. When considering randomized neighborhoods, the optimal value for the noise parameter, 9 appears to be 0 as can be seen in Figure 4.8, compared with 0.2 for 1-mutation neighborhoods (see Section 4.2.3). One possible explanation for this phenomenon may reside in the increased need of greediness for the search algorithm when searching bigger neighborhoods. Fur-thermore, the additional diversification provided by random neighbors can appar-ently compensate and even substitute for the effect of the noise mechanism - both provide mechanisms for escaping from local optima in the underlying search space. Chapter 4. Combinatorial Design 43 C P U Time [sec] Figure 4.7: RTDs (CPU time) for purely random and hybrid neighborhoods of size 200, set size k = 70, strand length n = 8, Hamming distance d — 4, constraints C1.C2 and C5. 4.3.4 New DNA Strand Sets and Empirical Bounds Using the new improvement strategies based on bigger and more randomized neigh-borhoods, we used the enhanced SLS algorithm (using hybrid neighborhoods) to generate strand sets for the proposed problem instance. We obtained 107 DNA strands of length 8 satisfying constraints CI , C2 and C5 with at least 4 mismatches between pairs of strands, GC content 50%, and using a random initial set of strands. For the same case, Frutos et al. [40] constructed a set of 108 strands of length 8. We reported in Section 4.2.5 sets of 92 strands obtained by initializing the simple SLS algorithm with random sets. Initially, sets of 112 strands were obtained by initializing the same algorithm with the best known set containing 108 strands plus one extra strand at a time. Sets of 112 strands have also been obtained when using randomized neighborhoods under the same experimental conditions (initialization with the set of 108 strands by Frutos et al.). Later, we obtained a set with 128 strands that exceeds the best sets known so far by 16 strands [110]. For the problem instances with constraints CI , C2 and C5 satisfied we com-pared the sizes of the strand sets obtainable by our algorithm with previously n\d 2 3 4 5 6 7 8 9 10 4 24 [.03fc] 6 [.01k] 2 [,001k] - - - - - -6 31(1 [.7k] 41 [1.5fc] 15 [,3k] 4 [,005k] 2 [,002k] - - - -8 4CI22 [16.7fc] .WO [3.4fc] 107* [40k] 26 [64 fc] 12 [4.8k] 2 [.002fc] 2 [.002fc] - -10 X 4131 [45.2fc] 790 [9.4fc] 158 [2k] 41 [1.2fc] 15 [.6fc] 6 [2.6fc] 2 [.002] 2 [.002k] 12 X X 61 INI [256fc] 988 [23.3k] 240 [3. lfc] 70 [1.7k] 25 [1.2fc] 9 [3.2fc] 4 [.2k] Table 4.7: Empirical bounds on (C1,C2,C5) quaternary codes obtained with the improved SLS Algorithm. 1-mutation+random code words neighborhoods have been used. The number of random code words used here are {10,100,1000, 5 000}. For n = 8, d = 4 we found a better bound, namely 112 code words by initializing our algorithm with the best previously known word set (108 code words) plus an additional random word. Chapter 4. Combinatorial Design 45 10000 r 1 1 1 ' 1 r 0 0.1 0.2 0.3 0.4 0.5 0.6 Noise parameter Figure 4.8: Median run-time of the enhanced SLS algorithm (in search steps) as a function of the noise parameter value: constraints CI , C2 and C5, n = 8, d = 4, and fc = 70 for 1-mutation + random neighborhood. known strand sets. Out of a total of 31 comparisons with previous results (see Tables 4.6, 4.7), we found strand sets that are equal to or improve on previous constructions in all but one case, namely n = 8 and d = 4. In this particular case, while our algorithm was not able to meet the previous best construction when start-ing from a random initial set of strands, we were still able to improve on the best previous construction by initializing our algorithm with the best previously known strand set plus an additional random strand. Further improvements on our current results have been achieved after our enhanced SLS algorithm had been published. Gaborit and King [41] found better codes satisfying constraints CI and C5, that ex-ceeded ours in all cases. They also found better codes satisfying constraints CI , C2 and C5 using a combination of lexicographic constructions and stochastic search techniques. Nevertheless, our current results equal their results in 20 out of 42 cases and are still better than their results in 16 out of 42 cases. Our stochastic local search algorithms provide a few advantages over other methods, particularly over standard coding theory design techniques. While cod-ing theory algorithms heavily use the symmetries of the problem instances (for Chapter 4. Combinatorial Design 46 example the parity of the strand length), our stochastic local search algorithms can solve problems with a wide variety of input combinations, being limited only by the physical capacity of the hardware. Chapter 5 47 Combinatorial and Thermodynamic Design 'Nothing in life is certain except death, taxes and the second law of thermodynamics." Seth Lloyd, MIT In Chapter 4 we proposed two Stochastic Local Search (SLS) algorithms that pro-vide high quality solutions for the DNA Strand Design Problem, when only com-binatorial constraints are considered. This chapter extends the previous Stochastic Local Search algorithms to generate high quality sets of DNA strands that satisfy any combination of combinatorial and thermodynamic constraints, which aim to maximize desired hybridizations between strands and their complements, while minimizing undesired cross-hybridizations. Moreover, we describe new thermo-dynamic measures of the quality of strand sets. With respect to these measures of quality, our algorithm consistently finds, within reasonable time, sets that are significantly better than previously published sets in the literature. 5.1 Extended Stochastic Local Search Algor i thm The algorithms proposed in the following sections can design a set S of equally long DNA strands, being either strands or complements, where S is of specified set size k. The combinatorial and thermodynamic constraints that are supported by the algorithms are presented in Chapter 2. The user of the algorithms may choose which combination of constraints should be used in the design process. "Parts of this chapter have been published. D. Tulpan, M . Andronescu, M . Shortreed, S. Chang, A . Condon, H. Hoos, and L. Smith (2005) 'Thermodynamically Based D N A Strand Design", Nucleic Acids Research, Vol . 33, No. 15, pp. 4951-4964. Chapter 5. Combinatorial and Thermodynamic Design 48 Our DNA design algorithm is based on the stochastic local search approach developed in Chapter 4. It takes as input the desired strand length n and set size k, along with a specification of which constraints the set must satisfy and attempts to find a set that meets these requirements. The algorithm performs a search in a space of DNA strand sets of fixed size that may violate the given constraints, using a search strategy that combines randomized iterative improvement and random walk. The differences between this new, extended SLS algorithm and the simple al-gorithm presented in Chapter 4 (Algorithm 1) are as follows. First, the algorithm is initialized with a randomly selected set of k DNA strands, all of which are selected to fulfill any combination of single-strand constraints (C5, C6, C7, T l , T5). Sub-sequently, these specific constraints are always kept satisfied; i.e., when replacing a strand from the current candidate set in any search step, the replacement strands are always chosen such that they satisfy any given single strand constraint. We accomplish this later task by using a simple and quick generate and test mecha-nism, where one strand at a time is generated uniformly at random and tested for constraints C5 and T5. The SLS algorithm uses only purely random neighborhoods. While in Chap-ter 4 we showed that the SLS algorithms using moderate size hybrid neighbor-hoods provide marginally better results than when using purely random neighbor-hoods, the generation of 1-mutation neighbors becomes an inefficient task when constraint T5 is required. The inefficiency arises from the fact that a large number of 1-mutation neighbors may not satisfy constraint T5, thus being replaced with randomly generated ones. 5.2 Experimental Setup Our algorithm has been implemented in C/C++ and has been compiled and run under SuSe Linux 9.1, kernel version 2.6.5. Since no system-specific libraries are used, the algorithm should compile on most operating systems supporting stan-dard C/C++ language compilers. Academic users can obtain our software from http://www.cs.ubc.ca/labs/beta/Software/DnaCodeDesign. The source code and precompiled libraries for PairFold vl.4 [7] have been made publicly available at: http://www.rnasoft.ca. The online version of PairFold contains a complete set of publicly available DNA nearest-neighbour parameters kindly provided by David Mathews. For the results reported here, the more accurate (unpublished) parame-ters of John SantaLucia Jr. have been used. Chapter 5. Combinatorial and Thermodynamic Design 49 5.2.1 Evaluation Criteria We use several measures to gauge the quality of a strand set. Two of them account for the free energy gaps, denoted by S and 5*. 8 represents the free energy gap between perfect matches and imperfect matches of a strand (see Figure 5.1). i f -20 - 1 5 - 1 0 Free energy [kcal/mol] Figure 5.1: Positive free energy gaps of correct and incorrect strand-complement pairs for set S5 by Penchovsky and Ackermann [83]. The three curves represent (from left to right) the cumulative distribution of the free energy values of all de-sired strand-complement hybrids, of all undesired strand-complement hybrids and of all (undesired) complement-complement hybrids. The two dots represent the specific values of i and j that determine the free energy gap as defined in Equation ??. For every strand W{ in the set, we consider the difference between the perfect match free energy, AG°(wi,Ci), and the minimum free energy of a mismatch in-volving uii or Q . The minimum such difference, over all u i j in the set, is denoted by 6: Chapter 5. Combinatorial and Thermodynamic Design 50 mm l<i,j<N,ijtj mm "A -AG°(wuCl)) I A G 0 ( c „ C j ) \ \ AG°(cuCi) J (5.1) Note that the inner minimum takes into account three types of mismatches, namely mismatches between a and a strand Wj with j ^ i, between Cj and a strand Wi with j ^ i, and also mismatches between a complement C j and itself or another complement Cj. A related quantity of interest, previously proposed by Penchovsky and Acker-mann [83], is 6*, defined as follows. Let AG° : = max AG(wi,Ci Ki<N (5.2) AG?, :— min AG(wuCj), and (5.3) l<i ,j<N,ijij A G 3 := min A G ( c , , C j ) . (5.4) l<i,j<N Then <5* := min{AGS, A G 3 } - AG°. (5.5) The value <5* is of interest because the thermodynamic constraints T l - T3 make it possible to directly design a set for which S* is above a certain threshold. Roughly speaking, the larger 5 and S* are, the larger the gap between the free energy of desired and undesired hybridizations, and thus the better the set is. Moreover, since 5 > 8*, it follows that 6* provides a useful lower bound on the (true) energy gap 5 between competing desired and undesired hybridizations. Penchovsky and Ackermann [83] use 5* as a measure of the quality of their set. Furthermore, we measure the melting temperature interval width, p, for perfect matches (strand-complement duplexes), which is defined as the difference between the highest and the lowest melting temperature values of all strand-complement duplexes. p := max T M ; - min T M ; (5.6) l<i<N l<i<N The third type of measures that we use provides insight on the relative abun-dance of desired versus undesired products of hybridization reactions. To define these measures, which are pairwise specificity, sensitivity and discrimination, we consider the following interactions, which can occur between strands and comple-ments in a set: Chapter 5. Combinatorial and Thermodynamic Design 51 II: wi + ci -=i WiCi A strand Wi perfectly matches the corresponding complement Ci (desired hybridization). 12: Wi + c.j ^± w^j A strand Wi imperfectly matches a complement Cj, where i ^ j (undesired hybridization). 13: Cj + Cj ^ C{Cj A complement C j hybridizes with a complement Cj, where i ^ j (undesired hybridization). 14: u>i + Wj -=± WiWj A strand W{ hybridizes with a strand Wj, where i ^ j (unde-sired hybridization). 15: c -=-± cf0ided The MFE structure for a complement contains base pairs, that is, a subsequence of a complement c hybridizes with another subsequence be-longing to the same complement (undesired hybridization). 16: w ^ wfolded The MFE structure for a strand contains base pairs, that is, a subsequence of a strand w hybridizes with another subsequence belonging to the same strand (undesired hybridization). For simplicity, we ignore interactions among three or more strands. (A limited form of interaction between three strands will be discussed later, in Section 5.3.2). In the following analysis, we will also ignore interactions of type 14, as in a surface-based application, strands are fixed on surfaces and so cannot interact with each other. However, the method of analysis could be extended to model such interac-tions if needed. (We note that our algorithm can design strand sets so that undesired interactions between strands are unlikely; it is just in the modeling step that such interactions are ignored.) We will also ignore for now undesired hybridization of types 15 and 16, and address these separately in the discussion section. Therefore, to model competition between a strand Wi and two distinct comple-ments Ci and Cj, we consider a 6-phase model: Wi + Ci ^ WiCi (5.7) Wi + Cj ^ w^j (5.8) Ci + Cj ^ CiCj (5.9) Equations 5.7, 5.8, 5.9 describe desired (II) and undesired (12,13) hybridiza-tion between strands and complements. Let [wi]°, [ci}°, and [CJ]° represent the initial strand concentrations (which are known quantities, e.g., 1 0 _ 7 M each in our case), and let [wi], [c-], and [CJ] represent the equilibrium strand concentrations (unknown quantities). Let [ i f j C j ] , [u^c,], and Chapter 5. Combinatorial and Thermodynamic Design 52 [ciCj] be the equilibrium concentrations for the hybridization products IUJCJ , WiCj and CiCj. The following system of equilibrium equations describes the relationships between concentrations, free energies and reaction temperature for single strands and hybrids. [wi]° = [wi] + [wiCi] + [wiCj] (5.10) [ci}° = [Cl] + [WiCi] + {clC]} (5.11) c 1° = [cj] + [wiCj] + [ciCj] (5.12) \WiCi] M A = e « x y (5.13) lwicj\ Wi][Cj [dCj] = E H S T — (5.14) AG"(ci,c) ' = e mcf— (5.15) In our software, the system of equations (5.10 - 5.15) is implemented and solved using the commercial software package Maple v9 [75]. We are interested in maximizing the number of complements c- that are prop-erly hybridized to Wi, as opposed to being un-hybridized or hybridized to another Cj. In this context we consider duplexes u>jCj to be true positives, and single strands d and duplexes qc,- to be false negatives (since in these, c- is not hybridized as de-sired), and so we define the pairwise sensitivity as follows: . . . . . . [WiCi] .„ . pairwise sensitivity := mm — — — - — (5.10) l<i,j<N,i^j [WiCi] + [Ci] + [CiCj\ Similarly, we are interested in maximizing the number of complements Cj that are not hybridized. In this context we consider single strands Cj to be true nega-tives, while duplexes WiCj and CjCj are considered false positives (since in these cases, there is undesired hybridization involving Cj). Thus, we define the pairwise specificity as follows: [CH] pairwise specificity :— min -— (5.17) l<i,j<N,i^j [Cj] + [WiCj] + [CiCj] Chapter 5. Combinatorial and Thermodynamic Design 53 Clearly, pairwise sensitivity and specificity values are always between 0 and 1; the closer these values are to 1, the better the quality of the set, at least with respect to competition between Q and Cj for Wi. We are also interested in maximizing the ratio of correctly formed duplexes, i.e., true positives to incorrect duplexes, i.e., false positives. Thus we define the pairwise discrimination as follows: • • ,. . . . [u>iCi] pairwise discrimination : = mm r (5.1b) l<i,j<N,ijtj [WiCj] For each of our strand sets, we measure the pairwise sensitivity, specificity, and discrimination assuming that initial concentration of strands and complements is 10~ 7 M each. 5.2.2 In-Silico Experimental Protocols The results presented in this chapter were obtained by in-silico"DNA design. Our software was run on a PC with dual 2GHz Intel Xeon CPU's (only one of which was used in our experiments), 512 KB cache and 4GB R A M , running SuSe Linux 9.1 (kernel 2.6.5). To test the ability of our algorithm to design DNA sets with various sizes and properties, we first selected seven representative sets from the literature as controls (these are denoted by SI . . . S5, S7, and S8 in our tables). In addition, we generated 10 random sets using only the C7 constraint. We reported the best random set (S6) out of 10; this set is used as an additional control in our experiments. Table 5.1 provides references for these control sets and summarizes which constraints were originally used in their design. We evaluated each control set using the measures discussed in this chapter. All values are presented in Table 5.1. We used the following general protocol to design improved and enlarged sets. For each set we used constraints C6, C7, T1-T3, and T5. Constraints C6, C7, T1-T3 and T5 have been set to match the original constraints used in the design of the controls (see Table 5.1). Because in this work, we focus on applications where strands are affixed to a surface, we do not consider constraint T4 for the set designs used in our empirical study, but the current version of our set design software supports its use. To design sets of the same size, but higher quality than the respective control set, we used the following protocol. For each set, we performed 20 independent runs of our extended SLS algorithm with specific parameter settings, and with a maximum cutoff of 12 CPU hours per run. For the first run, the algorithm was initialised with the maximum and minimum values of T l , T2, T3, and T5 of the control sets. Then, in any consecutive run, we set A G ^ i n l to a very small value, Chapter 5. Combinatorial and Thermodynamic Design 54 Control Sets - Original Constraints SI Hraich et »1.|14] S2 Brenner et «1.[17] S3 Fnulhi.ii mier ct nl.[36| S4 Knit.* et »I.(4II1 S5 I'en.l.ovsky ct «I.[IU| S6 Random S7 Shortri-cd et .il. I|9K| SS Slmrtreed el ill. 2[W] St™iid length 15 4 15 8 16 16 12 16 No. i.f.strunds 40 8 20 10S 24 24 64 64 CI y >4 mismatches >3 in is mate lies <J y >4 mismatches C2 si >4 mis ma (dies y >4 mis notches CM 7 • <7 ma Idles C4 •J <7 matches CS y 35.3.T,f y 25.00% CY. y 5H y CCC (5'.3'«nds) y CCC CC {5'.3" ends) y CCC CC (5',3' ends) C7 y A.C.T J A.C.T A.C.T y A.C.T v / A . C T y A . C T y A . C T T l y implicit y y 12 y y y T3 implicit y y T5 ave»ge=4S"6' y range o f ± l . 5 " C y y 1"C range Co/ifro/ Sefs - Extrapolated Constraints T l nmj>e: (kcul/m.il) [^ •';'.„„1-^ -'"..,.Jii [-16.SO. -13.57] 1-1.96.-1.01] [-17.56.-11.63) [-8,95. -6.50| [-17.94,-16.84] i-19.31.-12.741 [-12.66.-11.78] [-16.42. -15.45] TI rimgc: (kt-al/nw.i) ( A 6 ' ; ; „ - „ , , A G ; ; 1-9.29. -0.22] (-1.86. 0.00] [-7.98.-1.29| I-6.B3. 0.001 [-8.72, -2.26] [-8,35,-0.86] 1-8.94.-0.93] 1-7.62.-1.10] T3 nui|!c: (kml/nxil) [AC1;;, ,-„.,, A O ; ; , , , , . , ) 1-4.48. O.OO] 1-0.64. 0.00] [-5.02,0.001 [-6.46, 0.001 [-2.62.0.00] [-4.50. -0.07| [-5.06, 0.00] [-9,06. 0,001 T5 nrnce: <"C) [TA/„„-„s .7 , A/ , l l , „ 5 ] 146.64.55.75] [-67.19. -48.07] [40.87.58.20] [17.06,27.34] [55.15.58.65] [43.56. 62.23] [42.38.43.50] [51.81.52.77] Table 5.1: Control sets used as a basis for comparison with the results obtained from our algorithm. Each column of the table, other than the leftmost, corresponds to one set. Each column header gives a set ID and, for all but set S6, a reference to the paper in which the set was published. The next two rows report the strand length and the number of strands in each set. Following these, there is one row per type of combinatorial or thermodynamic constraint CI to T5 (except for T4 - see Section 2.3 for details). A check mark in a column indicates that a constraint of the column's type was enforced when designing the set, and further information about the composition of the strands is given where available. The second part of the table describes thermodynamic properties of control sets. For each set, the rows of the table give the free energy ranges corresponding to constraints T l (desired strand-complement interactions), T2 (undesired strand-complement interactions) and T3 (undesired complement-complement interactions) , respectively and the melting temperature range corresponding to constraint T5. Note that these thermodynamic constraints were not used as design constraints for the control sets, other than those of Shortreed et al [98]. Chapter 5. Combinatorial and Thermodynamic Design 55 and we decremented the value of AG°naxl in steps of 0.25 kcal/mol, thus increas-ing the free energy gap 6*. We reported the best of the sets obtained from these 20 runs (these are the sets Sl-1, S2-1, . . . , S8-1 in our tables). We reported also the CPU time required to design the best sets as being the sum of all independent CPU times spent by the algorithm to obtain incrementally improved sets up to the one reported in Table 5.2. The reported run time does not include the time to evaluate the quality of the designed sets. To obtain bigger sets (which we denote by S1-2, S2-2, . . . , S8-2 in our tables) with approximately the same quality as the control sets, we used a slightly different protocol. For each set, we performed 20 independent runs; starting with the size of the control set, we incremented the size of the target set by 5 in each run. For sets Sl-2, S3-2, S4-2, S5-2, S7-2 and S8-2, we initialised the T2, T3, and T5 range values with the ones corresponding to control sets SI, S3, S5, S7 and S8, respec-tively. The T l constraint values have been initialised as follows. The minimum perfect match free energy was initialised with A G ^ i n l — X, and the maximum was initialised with A G ^ r a x l — X, where X = 5 — 5*. In this manner we enforced 6* of the newly designed set to be bigger than 5 of the control set, thus using a much harder constraint than for the design of better sets. For S2-2, the method described above did not provide bigger sets in less than 12 CPU hours on our ref-erence machine, so we followed the same strategy as the one used to design better sets, i.e., we initialised A G ^ i n l and AG°wxl with the corresponding values for the control set. One minor modification of this protocol was needed in the context of set S2-2. In this case, due to range overlaps in T l and T3 in the control set, the constraints T l , T2, T3, and T5 (initialised with the values from the control set) were insufficient to ensure the uniqueness of the strands in the set. Thus, we re-moved any strands that occurred more than once in S2-2, trimming the set from 43 strands (with duplicates) to only 13 unique strands. We note that the duplication of strands could also be prevented in the SLS strand design algorithm, e.g., by using Hamming distance constraints. 5.3 Results and Discussion The results of the two computational experiments described in the previous section are summarized in Table 5.2. Clearly, our algorithm is effective in designing sets of the same size as the control sets but with larger free energy gaps, 5*, between cor-rect and incorrect strand-complement pairs. Note that larger 5* values imposed as design constraints when generating the new sets using our algorithm induce larger 5 values, as well as improved pairwise specificity, sensitivity and discrimination values. The actual free energy gaps, 6, are positive for all of these improved sets, Chapter 5. Combinatorial and Thermodynamic Design 56 Sets Comparison Set Strand No. of Pairwise I'nirwisc Pairwise CombFold r\r-\ Run Time k-iiKtli stmnds (kciil/mol) r o Sensitivity Specificity Discrimination MFE (kciil/mol) (kcal/mol) (Cl'Us/m/h) SI Itraicli 15 40 6.25 |4.28] 9.11 0,95 1.00 354.36 -1.25 4.10 [1.84] SI-opt 15 40 4.97|3.78] S l - l 15 40 7.57|7.55] 1.79 1.00 1.00 939.63 -0.09 1.72 |1.36| 2.61] Sl-l-opt 15 40 3.93 [ 3.79| Sl-2 15 114 6.42 [6.33] 5.20 0.99 1.00 377.32 -11.64 1.53 [1.53] t.Sh Sl-2-opt 15 114 2.44 12.12] S2 Hreimcr 4 8 -0.19 (-0.851 19.12 0.00 1.00 0.73 -3.19 S2-1 4 8 .<).(IS [-0.1181 13.49 0.011 1.00 1.30 0.011 4s S2-2 4 13 -0.47 [-0.67] 17.23 0.00 1.00 0.82 -3.04 12s S3 FimllixiiinK-r IS 20 6.13 |3.651 17.33 0.78 1.00 968.14 -0.46 3.93 1-0.06] S3-opt 15 20 5.43(0.94] S3-1 15 20 8.59 ]8.46] 3.81 0.99 1.00 5324.62 -1.63 5.84(5.321 15.6m S3-1-opt 15 20 6.24(6.16] S3-2 15 11(1 6.25(6.13] 11.56 0.97 1.00 986.16 -4.28 2.28 [1.801 4.7h S3-2-opt 15 110 3.33 [2.471 S4 I'rutos 8 108 -0.21 1-2.22] 10.28 0.00 0,95 6.25 -11.19 S4-] 8 108 1.28 [0.89] 6.67 0.04 0.99 7.70 -11.58 4m S4-2 3 173 1.59(0.731 5.71 0.06 0.99 15.67 -13.07 Hi S5 t'cncliiivsky 16 24 8.79 |8.12] 3.50 1.00 1.00 3569,76 0.00 6.88 |5,78| S5-opt 16 24 7.09(6.14] S5-1 16 24 9.27(9.20] 1.95 1.0(1 1.00 57(14.45 (1.0(1 5.83 (5.791 1.5b S5-1-i.pl 16 24 6.14 [6.08] S5-2 16 44 8.90(8.82! 3.45 1.00 1.00 4163.51 (1.00 4.57 |4.16| 17h S5-2-opt 16 44 5.61 |5.20| SO U.mdom 16 24 6.10 |4.39| 18,67 0.90 1.00 522.73 -2.34 3.74 (1.90] 0.0 Is S7 Shorlrccril 12 64 2.87 |2.84] 1.11 0.79 0.97 23.20 -1.03 2.91 (2.761 S7-opl 12 64 2.91 |2.76] S7-1 12 64 3.72|3.65] 1.07 0.8H 0.98 49.59 0.(10 0.23 [-0.641 1.9li S7-l-opt 12 64 0.23 1-0.64| S7-2 12 144 3.01 (2.85] 1.11 0.79 0.97 27.74 -3.94 0.00 1-1.83] 2.7h S7-2-opt 12 144 0.00 1-1.83] S8 Shortrwd2 16 64 6.77 (6.39] 0.96 0.99 1.00 4739.83 -5.50 5.59 |5.25] S8-«pt 16 64 5.59 [5.25| S8-1 16 64 8.15 (8.09) 0.94 0.99 1.00 5057.76 -2.52 2.78 [2.74] 1.71. SS-1-opt 16 64 3.55 13.491 S8-2 16 80 7.91 (7.85| 0.95 0.99 1.00 4093.02 -4.32 3.69 [3.501 3.2b SS-2-npt 16 80 3.69 [3.50] Table 5.2: Comparison of the quality of the control sets and our improved and enlarged sets. Each pair of rows (between delimiting lines) corresponds to one set, and lists qual-ity measures for the control set, the improved set (indicated by the suffix and the enlarged set (indicated by the suffi x "-2"). Improvements over the control set are high-lighted in boldface. The columns, from left to right, specify (1) set ID, (2) the strand length for the set, (3) the number of strands in the set, (4) the free energy gaps S and 5*, (5) the melting temperature interval width, (6) the pairwise sensitivity, (7) the pairwise specifi city, (8) the pairwise discrimination, (9) the minimum free energy value as com-puted with CombFold vl.O [6], (10) the minimum free energy gaps r and r* for junctions, and (11) the run time, measured as total CPU time on our reference machine for running the respective experimental protocol until the given set was obtained. (See Section 2.3 for details). In the melting temperature column, the measurements were obtained using the function calc_Tm_complementary_with_entropy_santalucia_2004(...) of the PairFold vl.4 package [7], Si-opt MFE values for junctions have been obtained after optimizing the arrangement of strands in subsets such that r* is maximized (where T * is the free energy gap that accounts for junctions, as defi ned in Equation 5.20). Chapter 5. Combinatorial and Thermodynamic Design 57 except for the set S2-1, indicating that for every complement, the hybrid formed with its correct strand is more stable than the hybrid with any incorrect strand, while in some cases (notably, for control sets S2 and S4), the bounds on the gap, 8*, are negative. Also, pairwise specificity is always at least as high, and often higher, as sensitivity, which indicates that (under the conditions studied here, in which all initial concentrations are equal) instability of correct strand-complement hybrids tends to pose more of a problem than the stability of incorrect hybrids. We also note that for all sets shown in the table, the melting temperature range p for correct strand-complement pairs is at least as narrow for our new sets as for the control sets. In two cases, the melting temperature range for each of our improved sets is more than four times narrower than for the respective control sets, although such a reduction was not explicitly specified as a design constraint. The results from our second experiment, in which we used our algorithm to de-sign larger sets with thermodynamic properties comparable to the control sets, are also shown in Table 5.2. Note that in all cases, we were able to obtain substantially larger sets without any loss of quality (except for S8-2, whose pairwise discrimi-nation is slightly lower), as measured by 8, p, pairwise sensitivity, specificity and discrimination. It may be noted that the CPU time required for finding our improved sets varies substantially between the different sets, but is in most cases substantially lower than 10 CPU hours, and always below 20 CPU hours on our reference machine. Due to the highly randomised nature of our SLS algorithm, its run-time over multiple runs on exactly the same input data is quite variable (with standard deviations in the same order as the run-times shown here) — this is typical for SLS algorithms in general and does not impact their ability to consistently solve hard combinatorial problems such as the strand design problems studied here [57]. 5.3.1 Correlation between 5 and Concentrations The computational results reported in the previous section clearly demonstrate the ability of our stochastic local search (SLS) algorithm to design high-quality DNA strand sets, where quality is assessed using a variety of measures. As previously noted, in our algorithm we only control the location and mini-mal size of the free energy gaps between correct and incorrect strand-complement hybrids and the free energy range for correct hybrids, but not for the pairwise speci-ficity, sensitivity or discrimination. However, considering the nature of Equations 10-15, there is a tight correlation between 8 (and likewise, 8*) and the relative concentrations of correct vs incorrect hybrids, as directly measured by pairwise discrimination (see Equation 5.18). Figure 5.2 shows this correlation for sets S5 and S5-1; every point in this correlation plot corresponds to a positive solution of Chapter 5. Combinatorial and Thermodynamic Design 58 Figure 5.2: Correlation between pairwise discrimination values and duplex free energy gaps for sets S5 Penchovsky (control) and S5-1 (improved); each point in the plot corresponds to the discrimination and S values of a given strand and an incorrect complement. the equilibrium Equations 10-15 for one combination of a strand, its correct com-plement and a distinct complement. The discrimination values within both sets vary over several orders of magnitude, and the overall pairwise discrimination is determined by a small number of relatively stable undesired hybrids. Shortreed et al. [98] introduced A , a variant of our notion of pairwise discrim-ination. We did not compute A for the sets reported in this chapter since we did not model interactions involving more than one strand, but the results in the paper of Shortreed et al. [98] indicate that A and our pairwise discrimination measure are quite similar. 5.3.2 Equilibrium Concentration Dependency of DNA Strands To further investigate how our quality measures depend on the free energy gap S, we studied, for pairs of strands wl and complements c}, the dependence of the equilibrium concentrations of the single D N A strands and hybrids, as determined Chapter 5. Combinatorial and Thermodynamic Design 59 by Equations 10-15, on the respective free energy gap values. Figure 5.3 illustrates the results of this analysis for sets S5 (top) and S5-1 (bottom). Clearly, the concen-trations of the desired products, namely the correct hybrid, [U^Q ] , and the unbound incorrect complement, [CJ], are consistently several orders of magnitude higher than the concentrations of any undesired products ([u>i], [ci\, [wlCj] and [CJCJ]). Interestingly, undesired hybridization between complements is insignificant com-pared to the impact of incorrect strand-complement hybrids. This means that for our sets, the competitive reactions Wi + Ci ^ u»,Cj and Wi + Cj ^ w^Cj dominate the more complex equilibria considered in our model; in particular, as clearly illus-trated in Figure 5.3, the difference Sij in free energy between correct and incorrect hybrids, W{C-L and w^Cj, mostly determines the relative concentrations of correct vs incorrect hybrids, and hence the pairwise discrimination measure. This justifies the approach of Shoitreed et al. [98] to ignore such interactions. However, there are cases where the product CiCj can become significant, such as when the initial complement concentrations are significantly greater than the strand concentrations, or when strands are designed over a 4-base, rather than a 3-base, alphabet. As can be seen from Figures 5.2 and 5.3, the improved set generated by our algorithm has higher pairwise discrimination because the relative concentrations of the worst (that is, most stable) undesired hybrids have been reduced, as a re-sult of the respective increased free energy gaps. At the same time, as a side-effect, the concentrations of the most stable (undesired) complement-complement hybrids are also reduced. It may also be noted that compared to the reference sets, our new sets show more even distributions of the free energy gaps and pairwise discrimina-tion values, as well as reduced variations in the concentrations of the strands and complements obtained from our equilibria analyses. Although our model makes a number of simplifying assumptions, in particular, equal initial concentrations and independence of the individual equilibria, we expect to find similar differences in more realistic models, whose analysis is currently beyond our reach. It may be noted that given these observations, pairwise sensitivity mainly de-pends on the stability of the correct strand-complement hybrids. This is clearly reflected in the results from Table 5.2 which show that pairwise sensitivity corre-lates strongly with the melting temperature of the correct hybrids, TM. Note that in the case of S4 and the corresponding new sets, the relatively low TM values are a consequence of the small strand length and recall that in the original applica-tion, these strands are concatenated with strand labels, which effectively increases their length and the stability of the correct duplexes. Pairwise specificity, on the other hand, mainly depends on the concentration of incorrect strand-complement duplexes, and hence improves as a result of increasing the free energy gap 5. While these observations confirm the effectiveness of the free energy gap as a design constraint, it may be noted that highly constrained designs may not always Chapter 5. Combinatorial and Thermodynamic Design 60 le-07 le-08 le-09 18-10 in 18-11 c o (3 1 1e-12 H O c O 1e-13 l t -14 1e-15 1e-16 1e-17 1 i M - — j M N I * i ; [ciwij o i i i i 1 1 10 11 12 13 Free Energy Gap [kcal/mol'K] (a) 14 15 1e-07 1e-08 1e-09 18-10 w 1e-11 c o (5 E 1e-12 S c O le-13 18-14 1e-15 18-16 I 1 Iwi f-[w| « 1 CIWI c 1 i 1 J....»..B.^ . -1 1 11 12 13 Free Energy Gap [kcal/morK] (b) 14 16 Figure 5.3: Concentration of strands, complements, perfect and imperfect matches as a function of S for sets (a) S5 Penchovsky (control), and (b) S5-1 (improved); each point in the plot corresponds to the equilibrium concentration of one single strand or duplex. Chapter 5. Combinatorial and Thermodynamic Design 61 be required in order to obtain good strand sets according to our quality measures. For example, we generated a set S6 of 24 strands of length 16 under the sole con-straint that only the bases A,T and C may be used (constraint C7), and found it to be quite reasonable with respect to our quality measures and as compared to some of our other sets (see Table 5.2). Constraint C7 is widely used in the design of DNA and RNA strand sets; this constraint substantially reduces the potential for forming stable incorrect complement-complement (or strand-strand duplexes, if strand interaction is not precluded by immobilization on a surface), since such duplexes cannot contain any energetically favorable G-C pairs. Our results suggest that while simple combinatorial constraints such as C7, or similarly C4, may be sometimes sufficient for obtaining strand sets of reasonable quality (particularly, when only relatively small sets of reasonably long strands are required), better and larger strand sets can be found using high-performance algorithms that directly support thermodynamical design constraints. 5.4 Junction Optimization In DNA computing applications (as well as in the context of biomolecular tagging), strands are typically not used in isolation, but strands or their complements are rather concatenated to form longer strands. These concatemers represent strings of data, and the values that may occur at each position of such a string are encoded by a group of DNA strands. For example, Braich et al. [14] used bit strings of length 20 in their DNA computation. These were represented as concatemers of 20 DNA strands, where a different strand was used for each bit at each of the 20 positions. Hence, they partitioned their set of 40 strands of length 15 (SI in Table 5.1) into 20 groups of two strands each, and used the 2 2 0 strands of length 300 obtained by concatenation of one strand from each group to represent the 2 2 0 bit strings of length 20. Similarly, the sets of Faulhammer et al. [36] and Penchovsky et al. [83] (S3 and S4) are partitioned into ten and twelve groups respectively, each consisting of two strands, yielding a total of 2 1 0 and 2 1 2 long strands which encode bit strings of length 10 and 12, respectively. The concatemers thus obtained need to satisfy two additional constraints. Firstly, concatemers that can form stable secondary structures should be avoided as much as possible, because such secondary structure within a long strand may interfere with desired hybridizations between strands and their complements. Secondly, un-desired hybridizations between complements and any region between two adjacent strands (or vice versa) on a long strand should also be avoided. Neither of these constraints is explicitly enforced in our design algorithm, although the slide mis-matches constraint (C3) can help reduce hybridization of complements between Chapter 5. Combinatorial and Thermodynamic Design 62 strands. To evaluate our new sets with respect to the first additional constraint, we used the CombFold vl.O algorithm by Andronescu et al. [6] to find the minimum free energy secondary structure over all concatemers that can be formed based on a given partitioning into groups. For the control sets, we considered the original par-titionings reported in the literature, while for our new sets, we considered random partitionings with the same number and size of groups as used for the respective control sets. The results from the CombFold analysis of these sets and partitionings are shown in Table 5.2; they clearly demonstrate that for our improved and enlarged sets, the potential for concatemers with undesired stable secondary structures is not significantly higher than for the control sets. The worst values from the CombFold analysis are for the S4 sets (Frutos et al.), and these are the only sets which are de-signed over a 4-letter {A, C, G, T}, rather than a three-letter {A, C, T} alphabet; this suggests that a three-letter alphabet is preferable. The second additional constraint, on undesired hybridizations between comple-ments and junctions of two concatenated strands, can be formalized in two ways, as follows (analogous to 5 and 5* defined earlier): For a given set whose strands are partitioned into ordered groups, if strand u>; and Wj are adjacent on a long strand (obtained by concatenating one strand per group in order), we let W{Wj de-note the strand obtained by concatenating Wi and Wj. Let T be the set of triples (i, j, k) such that and k are distinct and the sequence WiWj appears on some long strand. We let AG°4:= min AG0(wiWj,ck). (5.19) ( i , j , f c ) e T We can now define the gap r* as follows: T* := AG°4 - AG?, (5.20) where AG° is as defined in Equation 5.2 on page 50. Also, we let r be the gap between the free energy of desired hybridizations and undesired hybridizations at junctions, defined as follows: r : = min (AG°(wiwj,ck) - AG°(wk,ck)). (5.21) Just as for 5 and 6*, r* is a lower bound for r. Table 5.2 shows measurements of r and T* for sets SI, S3, S5, S7 and S8 and our respective improved and enlarged sets, using the same partitioning into groups as in the original references (for the control sets) or random partitionings into the same number of groups as for the control sets (for our new sets). Chapter 5. Combinatorial and Thermodynamic Design 63 The partitioning of strands into groups can have a substantial impact on T and T* , and hence the potential for incorrect strand/complement junction hybridiza-tions. In order to find improved strand parti ti on ings, we developed a simple stochas-tic local search algorithm that takes as input a set of strands, a desired group size, and a value v, and partitions strands into ordered groups of the desired size so that the value of r* for the set with respect to these ordered groups is at least v (or if not, as close to v as possible). The algorithm starts with an initial arbitrary parti-tioning of strands into groups. Then, the algorithm repeatedly swaps two strands from different groups, so as to ultimately reduce the value of r*. Details are given in Figure 2; in our use of the algorithm, we chose the value 0.2 for the probability •0, which controls whether a 'Worse" partitioning P ' replaces the current partition-ing P at an iteration of the algorithm. The value 0.2 was determined from multiple runs of the algorithm on different strand sets. We simply selected the probability value that allows the algorithm to find a solution in the lowest number of iterations. Using this algorithm, we organized the strands in sets SI, S3, S5, S7, S8 and our corresponding improved and enlarged designed sets - into ordered groups of size 2. For each set, over several runs, we gradually increased the input value v in in-crements of 1 kcal/mol, six times or until the algorithm did not succeed in finding a grouping for which r* > v. procedure Stochastic Local Search for Junctions Optimization (S, v, m, r, ip) input: set of strands S, threshold v for junction free energy gaps, number of subsets m, number of strands per subset r, probability ip output: partitioning P of S P := random partitioning of S into rn subsets of size r each repeat select uniformly at random two strands si and S2 from two distinct subsets in P P' .= P with si and «2 swapped if P' has fewer bad junctions than P then P :=P' else with probability rjj do P := P' end with probability end if until no bad junctions remain or time expired return P end Stochastic Local Search for Junctions Optimization Algorithm 2: Outline of the stochastic local search procedure used to partition strands into groups for use in DNA computations. A 'bad junction" is a junction with MFE lower than v. Chapter 5. Combinatorial and Thermodynamic Design 64 In Table 5.2, we report the r and r * values obtained by optimizing the control sets as well as our respective new sets using this algorithm. We exclude the sets S2 (by Brenner et al.) and S4 (by Frutos et al.), since these sets were not designed with the goal of concatenating strands in the sets. For three of the five sets, we were able to find new partitionings of the control sets which have improved r and r* values. Additionally for these three sets, our new sets have T * values that are better than those of the control sets, although in just one of these cases was the r value of our new set better than that of the control set. We note that our algorithm optimized for T* rather than r; we expect that we would obtain further improvements on the r values if our algorithm optimized for r. However, our new sets S7 and S8 have poorer r and r * values than the control sets of Shortreed et al. [98]. In order to see how our algorithm can be further improved, it is useful to con-sider the algorithm introduced by Shortreed et al. [98]. This algorithm is initialized with a pool of 4 n candidate strands generated exhaustively. Next, the strands that do not satisfy constraints C5 (no more than two consecutive Cs), C7 (no Gs) and T5 (melting temperature in the interval [39, 40]° C) are removed from the pool. The third step consists of the elimination of all the strands that form stable mismatched duplexes. The forth step consists in further trimming the strands that have a high probability to form junctions. Thus, at the end they were left with a set of 64 12-mers. Overall, compared with this method for DNA strand design, our SLS design algorithm is conceptually more complex but, based on a comparison between the S7 and S8 sets, produces sets with better free energy gap (5), and pairwise discrim-ination, sensitivity, and specificity. Moreover, our SLS algorithm has better scaling behaviour (particularly with respect to the memory requirements for longer strands and the incorporation of additional constraints) and so it can be applied to a wider range of DNA strand design problems. Nevertheless, Shortreed et al.'s approach [98] to partitioning strands in order to form concatemers has some strong features, such as the use of a pool of strands that is larger than needed, which results in partitioned sets of the highest quality in many cases. An important feature of the Shortreed et al. approach to grouping strands in order to form concatemers is that the concatemers are formed from a pool of strands that is larger than needed. This technique could be incorporated into the approach of this chapter, and may yield better values of r. Alternatively, our SLS algorithm for strand design could be adapted to take partitioning into account, although this would add to the complex-ity of that algorithm. Chapter 6 Microarray Probe Design 65 Microarrays offer 'a fresh, comprehensive and open-minded look at every problem in biology. " Brown and Botstein, Nature 1999 Chapters 4 and 5 formulated two variants of the DNA Code Design Problem tai-lored for bio-computing applications. This chapter examines the problem of de-signing probes for use in microarrays, discusses its relationship with the code de-sign problems considered in previous chapters, and examines strategies for evalu-ating and designing probe sets in this context. 6.1 Background and Problem Description The invention of microarray technology has enabled scientists to swiftly move from studying individual genes to bulk studies of gene expression. DNA microarrays (also known as DNA chips1) consist of planar surfaces (glass slides, nylon mem-branes, fused silica or gold) bearing large collections of DNA fragments (called probes) at discrete plate locations, at which fragments (called targets) are available for hybridization. Each fragment is used to signal the presence or the absence of a specific genomic sequence and its concentration using fluorescent labels. The length of the fragments immobilized on a chip typically varies between 20 bases (short oligomers) and over 70 bases (long oligomers). The key aspects of microarray probe design can be identified more easily if we try to understand better how microarrays are used. The basic protocol [107] works as follows: 1. RNA Isolation. This process requires the extraction of the RNA from the cells of the sample under consideration. 'The term DNA Chip originated from the similarity between integrated circuits (computer chips) fabrication and photolithographic methods used for manufacturing DNA microarrays. Chapter 6. Microarray Probe Design 66 2. RNA labeling. This process usually implies a reverse transcription reaction and DNA nucleotide dye incorporation. Depending on the fabrication tech-nology used, this step may require PCR amplification and labeling of RNAs. At the end of this step, a set of labeled targets is ready to be used. 3. Target hybridization. Labeled targets are placed in a solution and this solu-tion is poured on the microarray. The microarray is left in the solution for a predefined period of time, allowing targets to find and bind to their corre-sponding probes. This process is carried out at specific temperatures, chosen in such a way so to minimize non-specific binding of probes to the wrong targets. The temperature settings vary among experiments, and they depend on the concentration of formamide present in the solution. Commonly used temperatures span a wide range from as low as 12° C to 4 2 ° C or higher for oligonucleotide microarrays [108] and from 55° C to 70° C and above for cDNA microarrays. In practice, the optimal hybridization temperature is empirically determined for a given experimental setup [67]. 4. Washing the microarray. The solution containing targets is removed and the array is washed with various substances that may contain different salt and detergent concentrations. The purpose of these chemical components is to minimize non-specific hybridization and their concentration is chosen with great care. It is known that solutions with higher sodium-based salt con-centrations tighten the DNA base pairing, and those with low concentrations could weaken them. 5. Microarray scanning and data analysis. After washing the microarray, a scanning process takes place. The purpose of the scanning process is to detect and quantify the amount of targets bound to the DNA probes at spe-cific locations on the microarray. The details of the scanning process may vary, depending on the labeling procedure used in a previous step. If fluo-rescent dyes are used, the following optical scanning process is used. First, a laser beam is used to excite the fluorescent dye. The emitted photons are captured and quantified using a photo multiplier tube (PMT). Second, the number representing the number of photons captured from the fluorescence are translated into a pixel of an image file (usually, a 16-bit .tiff file or other uncompressed image formats). The last step comprises image analysis by finding corresponding spots and comparing them using different statistical techniques that often involve data normalization. The idea of attaching multiple DNA sequences on a plate dates back to 1988 and even earlier [84], but the technology to implement this idea was not well de-veloped at that time. A breakthrough was achieved in 1991 by Fodor et al. [38] and Chapter 6. Microarray Probe Design 67 later perfected by Affymetrix, a California-based biotechnology company. Several microarray manufacturing techniques are currently in use, including: 1. light-activated polymer synthesis combined with a masking process similar to the one used for silicon chip manufacturing (Affymetrix technology), 2. electric-field based micropipetting [33], 3. direct spotting of cDNA fragments onto positively coated plates, 4. ink-jet-based technology (Agilent Technologies) 5. bead-based arrays (Illumina, Inc.), [48] Generally, microarrays can be grouped into two categories [115]: 1. cDNA (oligonucleotide) spotted arrays, and 2. high-density synthetic oligonucleotide (Affymetrix-like) arrays. This categorization is based partly on the way the arrays are manufactured and partly on the way they are used. cDNA arrays are mostly used in 2-dye (also known as 2-channel) experiments. This means that 2 samples of RNA are taken from 2 separate sources and cDNA is created from each by using a technique called RT-PCR (Reverse Transcriptase Polymerase Chain Reaction). Each cDNA sam-ple is labeled with its own specific dye. Both cDNA samples are then washed over a single array, after which dye intensities on each spot are measured using a fluorescence camera. cDNA arrays have typically less probes per area unit than Affymetrix-like arrays, containing approximately 20 000 sequences per chip, each of which is about 60 to 80 nucleotides long. In contrast to the 2-dye approach, Affymetrix-like arrays can only be treated with a single sample (e.g., RNA from a single source) and no RT-PCR is necessary. These arrays can potentially have a density of up to 100 000 sequences per chip, each of which being no longer than 40 nucleotides [115]. An important goal in microarray technology is to decrease the size of microar-rays. To attain this goal, it is necessary to decrease the size of the probe sites and their spacing, while still being able to reliably detect molecular hybridization. The smaller the microarray the lower the quantities of biological samples required for analysis. Nevertheless, the quality of a microarray chip resides also in its informa-tion storage capacity. Having more probes per array is equivalent to storing more information thus increasing the potential for analysis. The number of probe molecules in a given spot controls their intermolecular interactions. Close probe spacing enables complex interactions among multiple Chapter 6. Microarray Probe Design 68 Figure 6.1: Two key microarray parameters: the area per probe E, and the distance between probe sites, A. probes and targets [39]. To avoid the inability of the parts of the probes that are closer to the microarray surface to bind to targets (due to steric hindrance, immo-bility and closeness to the microarray surface), linkers have been widely used. The linker stacks on one end of a probe, thus creating a distance between the microar-ray surface and the section of oligonucleotide that is used in hybridization. Typical linkers used in microarray manufacturing are ethylene glycol oligomers or short sequences of thymine (T) bases. The optimal linker length was determined to be about 40 atoms [97]. The attainable values of the area per probe, E, for oligonucleotide microarrays vary with the support surface [47, 85, 104]. The reported probe densities within spots vary between 1.2 x 10 1 0 and 4 x 1013 probes per cm 2 corresponding to 2.5 x 1 0 2 i 2 < E < 8.3 x 1 0 5 i 2 [51]. Typical values of E on glass surfaces are of the order of 104 A2 corresponding to a distance A ss 100A between probe sites. On polypropylene supports, significantly higher probe site densities of single stranded DNA are possible, where E « 40A 2 and A « 7A [50]. Universal DNA microarrays are organism-independent tools, which have the ability to identify specific target genomic sequences within a large and complex mixture. They typically have high probe densities, with probe lengths in the range 20-40 bases. Unlike organism-specific microarrays, universal microarrays are more flexible because their probes are not used directly to identify targets of specific ge-nomic sequences. Instead, specific targets are labeled with complements of probes Chapter 6. Microarray Probe Design 69 (which have been affixed on the surface of the microarray) and then brought in contact with the universal microarray [101]. Each probe will capture only the la-beled targets. Universal microarrays are used for genotyping of SNPs [49], SNP detection [34] and gene expression profiling [92]. In every application, the univer-sal microarrays are used to detect the presence of specific targets or measure their concentration. 6.1.1 The Microarray Probe Design Problem Microarray technology offers the possibility of studying the expression of large numbers of genes in a single experiment and is considered to be a complex and powerful tool that helps the investigation of many biological problems. The choice of appropriate design strategies for probes used in microarray applications requires special attention, due to its direct impact on the accuracy and interpretation of the results. Poor design strategies may lead to low quality sets of probes that cross-hybridize with undesired targets. The M i c r o a r r a y Probe Design P r o b l e m is defined as follows: with respect to a fixed set C of constraints, given N DNA sequences gi,g2, - • • ,9N (represent-ing genes, or coding sequences, or expressed sequence tags) and probe length n, find a set of N probes P\,P2, • • • ,PN (one probe per sequence) satisfying all the constraints from C. A probe pi represents the complement of a contiguous length n subsequence of gi, with i <E {1 , . . . , N}. Al l probes pit with i G { 1 , . . . , N} are fixed on the surface of the chip and all sequences gi are present in the solution. 6.1.2 Probe Design Goals The single stranded probes attached on each spot of a microarray must have the following properties: Speci f ic i ty : Each probe pi must be specific to its corresponding sequence g^. This means that pi must be a contiguous subsequence of gi and only gi, and should not occur in any other sequence gj with j ^ i,j 6 { 1 , . . . , N}. Sensi t iv i ty : Each probe must be able to hybridize with its perfect match (on a target) when a large set of presumably long sequences (whole genome) is presented to the microanay. Thus, all probes must not only be unique, but they should not be able to hybridize well with incorrect targets in the solu-tion. Chapter 6. Microarray Probe Design 70 Uniformity: Al l probes must function well under uniform experimental condi-tions. For example, to ensure a reliable quantitative comparison of gene ex-pression, microarray hybridization conditions must be uniform for all genes. Among the three design goals mentioned above, specificity seems to be by far the most important and the hardest to address computationally [62, 90]. The larger the set of sequences that is investigated, the higher the probability of hybridization between probes and competitive incorrect targets. Thus, the following discussion is focused on modeling specificity. To efficiently address the specificity goal in microarray probe design, a good understanding of the types of hybridization that occur in a microarray experiment must be achieved. Various interactions between different or similar probes and targets can occur in a microarray experiment. The interactions can be grouped based on the particular strands into probe-probe, target-target and probe-target interactions. The likelihood of probe-probe interactions depends on the microarray density [51] and can be addressed during the probe design stage. One way to address this type of interaction is to select only probes that do not hybridize well with each other or with themselves, thus being able to remain unfolded and capture the corre-sponding targets from the solution. The SLS algorithm presented in Chapter 5 can be used to address this problem, with the only modification that probe candidates shall be provided as part of the input, but not generated uniformly at random. Nev-ertheless, probe-probe interactions have not been addressed in previous work (to our best knowledge). As we will discuss in more detail in the following, there are two reasons for not considering this type of interactions: (i) the distance between anchored probes is sufficiently long to ensure non-interaction; (ii) probes can be screened for self-hybridization and rejected if they tend to form stable structures. The second type of interaction is unavoidable, less understood and cannot be efficiently controlled experimentally. Finally, probe-target interactions have a ma-jor impact on the quality of the results obtained with a microarray and thus will become the main focus of the probe selection criteria we develop later in this chap-ter. In the following sections, all three types of interactions are discussed in more detail. 6.1.3 Probe-Probe Hybridization It was found that at chip probe densities of less than 1 0 1 2 probes per cm2, probe-probe interactions and molecular crowding effects are insignificant [11], However, these studies use a single-component case (equivalent with non-competitive hy-bridization) for building the models, which limits their applicability in interpreting two- or multi-component microarray data (where at least two targets compete to Chapter 6. Microarray Probe Design 71 hybridize with one probe). To decide whether probe-probe interaction is an issue that needs further consid-eration when designing large sets of probes, we need to look also at the molecular length of single stranded and double stranded DNA, which determines the overall length of a DNA strand. These measures are well established for double stranded DNA [23]. Assuming that measuring for double stranded DNA is done at hy-bridization temperatures in the range 30°C < Tm < 60°C and salt concentrations of 1M of NaCl, double stranded DNA up to a length of 50 base pairs can be consid-ered a rod-like molecule with each base-pair contributing 3.4 A to its length. The radius of dsDNA is 9.5 A and its cross-section area is 284 A 2 . The corresponding features for single stranded DNA are not as well established as for double stranded DNA. A typical value for the radius of a monomer is 6 A [102, 106]. As a result of long range interactions that give rise to complex structures, single stranded DNA is often considered to be a random coil [114]. Note that this makes the effective length of the structure hard to determine. Generally, we can distinguish three types of probe-probe interactions: 1. interactions between probes placed on the same spot of the array, 2. interactions between probes placed on the boundary areas of neighboring spots, and 3. interactions between nucleotides of the same probe. Al l three types of interactions can, hypothetical ly, influence the sensitivity of microarrays. Probe-probe interactions of type 1 can occur on high density chips, where probe density is greater than 10 1 2probes/cm 2. Type 2 probe-probe interactions can only occur when the distance between probe sites is smaller than twice the length of one probe. Type 3 probe-probe interactions can take place when parts of the probe sequence become highly attractive to other parts of the same probe, giving rise to secondary structure formation. As most of microarrays used nowadays have probe densities lower than or com-parable to 10 1 2 probes per cm 2 , and because we can select probes that do not self-hybridize, we will not consider this type of hybridization in this work. Never-theless, probe-probe interactions can be easily incorporated into the probe design scheme we develop later in this chapter by using the stochastic local search ap-proach described in Chapters 4 and 5. The SLS algorithm presented in Chapter 4 can be used to optimize the probe-probe interactions, when this is required. The only changes that must be performed on the original SLS algorithm, is to replace Chapter 6. Microarray Probe Design 72 the randomly generated neighborhoods for each probe with collections of candi-date probes selected from each sequence, after probe selection criteria have been applied on the input set. Thus, each probe will have a neighbourhood consisting of candidate probes selected from the sequence of provenience. In this chapter, a different design strategy will be employed, based on a carefully selected set of design criteria. 6.1.4 Target-Target Hybridization Target-target hybridization can take place in the solution, which contains genomic sequences that need to be fished out by microarray probes. Nevertheless, it is very difficult if not impossible to control these interactions, due to the lack of control over the target identities and concentrations. Thus, we will not consider this type of hybridization in this thesis. 6.1.5 Probe-Target Hybridization Probe-target hybridization dominates the sequence interaction reactions that occur on a microarray chip. The probe-target hybridization process is influenced mainly by the way we select the probes to be placed on the chip and by the concentrations of both the probes fixed on the chip and the targets floating in the solution where the chip is exposed to during the hybridization reaction. Additional reaction-specific parameter settings may influence this hybridization reaction (e.g. temperature and pressure, or salt concentration). A probe pi may not bind to its target sequence due to various reasons, including the following: • The target ij may fold into itself such that the complement, pi, of probe pi would be unavailable. • The strength of the bond is too weak, thus the target may not hybridize well to the probe. • The target ij may have a stronger preference to bind to another target tj than to pi or to self-hybridize. • The target ij may have a strong preference to bind to a different probe pj. The SLS algorithm introduced in Chapter 4 doesn't apply well to this problem, when only probe-target interactions are taken into account. For the DNA Strand Design Problem described in Chapter 2, a change in the set of strands will pro-duce a change in the set of complements, which will be reflected in the evaluation Chapter 6. Microarray Probe Design 73 function values (the number of unsatisfied constraints). Thus, there exists an in-terdependency between the strands in a set and the evaluation of their quality is dependent on the quality of the corresponding complements. For the microarray design problem, where only probe-target interactions are considered, the probes are independent of each other and the set of targets doesn't change when one probe is replaced with another in the result set. 6.2 Probe Selection Criteria Various constraints for microarray probe design have been reported in the litera-ture. These constraints address the design goals described above. MO. Probe length The length of the probes can be chosen based on the mi-croarray manufacturing technology and experimental needs. High-density arrays (Affymetrix-like) can use probes up to 40 nucleotides long, while cDNA spotted arrays can use nucleotides up to a few hundred bases in length. The use of longer probes increases the specificity of a microarray. Nevertheless, long probes have a higher tendency towards forming stable secondary structures due to the increased number of nucleotides. Long probes have also a higher probability of hybridizing with undesired targets. Short probes of 8-20 bases have been used in early research work [77] due to the high costs of synthesizing probes. These probes had low specificities and required temperature settings at or below room temperature. Microchip manufac-turing technologies have improved since then, and costs have dropped dramatically. Thus, nowadays longer probes are widely used. Selecting probes of equal length (or with lengths in a narrow range) helps controlling the uniformity of a probe set. M l . Cross-hybridization Probes must be unique to each target, thus the probe affinity difference between the intended target and all other targets must be maxi-mized. Here, affinity refers to the preference of a probe to hybridize with a specific target, when other targets are present. Probe affinity can be defined more formally in terms of thermodynamic concepts like Gibbs free energy [118]. Kane et al. [62] introduced a new constraint, which has been tested on 50-mers, namely no 15-long contiguous repeats are allowed anywhere in the entire set of input sequences. Probe specificity improvement has been addressed in previous work by using BLAST alignments [90]. Various BLAST seeds and expectation values are used to ensure that no similar sequence regions will be used for probe selection. Nev-ertheless, most settings are poorly motivated and no thorough studies have been Chapter 6. Microarray Probe Design 7 4 published. The OligoWiz software [81] uses a BLAST-like approach to compute a homology score for each oligonucleotide in a given sequence. This score provides an estimate of the degree of cross4iybridization. Li et al. [68] and Relogio et al. [88] introduced constraints on probes so that they do not attempt to target sequence regions that may lead to the formation of dimers and hairpins. The most direct way to choose good probes in a set is to compute every cross-hybrid free energy, and pick the probes that cross-hybridize the least. Bhanot et al. [10] follow a variation of this procedure, where they evaluate only the free energy of cross-hybrids whose stability exceeds some reasonable threshold. They gen-erate a list of binding free energies for all probe-target hybrids that have not been eliminated by using other constraints (e.g. melting temperature). Their method dis-criminates against probes that are more likely to cross-hybridize. They introduced a scoring function that measures the level of cross-hybridization for each possi-ble probe. Unfortunately, the formula provided in the publication is ambiguous and leaves space for different interpretations. No further clarification was obtained from the original authors. It also seems that the proposed formula was not used in their work as an evaluation measure due to the exhaustive computations required. We interpreted the scoring function proposed in their work as follows: Let ./V be the number of sequences where i — { 1 , . . . , N}. Let L(gi) be the length of sequence gi. Let pi be a probe, and ti be a target, where i = { 1 , . . . , T V } . Let lw be the length of a window Wjm taken from a sequence gj and starting at position rn, which hybridizes with probe pk (perfect or imperfect match). The measure C(pk) for a probe p for sequence gk is defined as follows: nt , E j L i j ¥ f c ZLjLi~lw+1 exp(-AG(Pk,wjm)/RT) C ( P k ) - ; exP(-AG(Pk,tk)/RT) { b A ) f AG(pk,tk) - AG(pk,wjm - ^ ^ exp i r r M ^ p z i ^ l ( 6 .2) 3 = 1, j ^ k m = l ^ / where AG(pk,Wjm) is the minimum free energy for hybridizing probe pk with window Wjm and AG(pk, tk) is the minimum free energy for hybridizing probe pk with target tk. A more intuitive description of the C(pk) measure is given by Equation 6.2. The score associated with probe pk is given by the difference in stability between its perfect match and all target sections Wjm of length lw taken over all sequences except the one where pks target resides. The higher this value, the higher the probability of having stable matches formed between probe pk for sequence gk and targets Wjm of length lw on gene j , with j ^ fc. Chapter 6. Microarray Probe Design 75 C(j>k) values greater than 1 means that at least one target window of length lw has similar affinity for probe p^ as for its corresponding target sequence from sequence g^. The value of C(pu) is always positive and the larger the value of C(pk), the greater the cross-hybridization. Thus, in order to use C(pk) for probe design, the probe with lowest C{pk) score is chosen among all possible probe candidates, for each individual sequence. Selecting or evaluating probes using this probe selec-tion criterion allows for a better control of specificity and sensitivity of individual probes. M2. Melting temperature settings Melting temperature values are used to select probes that discriminate well between targets at a common temperature. Different melting temperature criteria have been proposed, starting with simply imposing upper and lower bound values on the melting temperature of a probe and its complement [68, 88]. A more complicated scoring function is used in the OligoWiz software [81]: ATm score — \ T m - T m \ (6.3) where Trn represents the melting temperature of a probe, and Tm is the mean melt-ing temperature of all probes or a user specified melting temperature threshold. Probes are selected from each sequence so as to minimize the ATm score, thus narrowing the variation of Tm values for all probes in the designed set. A similar scoring function for melting temperatures was used by Bhanot et al. [10]. To find a suitable experimental temperature, they first obtain the distribu-tion of melting temperatures of the entire potential pool of correct hybrids. Then, they restrict probes to a relatively narrow band of melting temperatures, namely the range [Tm — 4, Tm + 4], where Tm is the mean of the melting temperature distri-bution for all probes that can be selected. The requirement of selecting probes with melting temperatures that span a narrow range of values ensures the uniformity of the set of probes. M3. Position within target For cDNA microarrays, the position at which a probe binds to its designated target sequence seems to play an important role, es-pecially for long targets [81]. In a cDNA microarray experimental setup, dyes are usually incorporated during reverse transcription, when cDNA is synthesized. Reverse transcription rarely produces complete transcripts of the message. It is typical to obtain a substantial bias towards the 3'-end of the sequence [10]. The incompleteness of the resulting transcripts has the following effect. The reverse transcriptase will fall off the transcript with a certain probability and the further Chapter 6. Microarray Probe Design 76 away from the starting point this happens, the less signal will be generated. There-fore probes with target positions closer to the starting point of the reverse transcrip-tase are preferable. If the labeling starts from the 3'-end (where the poly A tail is attached), the following scoring function can be used [81]: position score — (1 — dp)A3 (6.4) where dp is the probability that the reverse transcriptase will fall off its template at any given base, and A 3 ' is the probe distance to the 3'-end of the input sequence. Intuitively, the position score measures the probability of selecting a probe at a given position. This selection criterion indirectly addresses the probe sensitivity goal. M 4 . Complexity filtering Very common sequence fragments are undesirable in probe composition, thus it is recommended to control the complexity of probes. Here, the complexity of a probe refers to the frequency of specific sequence pat-terns within it. A low-complexity measure for a probe was introduced by Nielsen et al. [81]. First, they generate a list of sequence sub-fragments. Then, the informa-tion content of these can be generated specifically for each probe. Their software (OligoWiz) calculates the information content of a sequence pattern, w, using the following equation: nt n(w) nt (6.5) where n[w) represents the number of occurrences of a sequence pattern in the set of sequences, l(w) is the pattern length, nt is the total number of patterns of a given length. ' A low complexity score is associated with each probe. low_complexity score = 1 — norm > I(vJi) (6-6) where L is the length of the probe, Wi represents the pattern at position i, and norm is a function that normalizes the sum to the interval [0,1]. Probes with extremely low complexities have a score close to 0. The major-ity of the probes investigated by Nielsen et al. [81] have exponentially distributed scores in the range [0.8,1.0]. Filtering probes based on their complexity is meant to increase the specificity of each probe by decreasing the probability of hybridiza-tion with undesired targets. Chapter 6. Microarray Probe Design 11 M5. Alphabet-related scoring Different letters are sometimes used to encode sequence regions of special importance (e.g., the letter N is sometimes used to encode non-coding regions). The OligoWiz software [81] assigns score 0 to all probe regions that contain letters other than A, C, T, and G, and 1 otherwise. The alphabet-related scoring is directly used to avoid unnecessary computations and it does not address directly the design goals specified above. M6. Base Composition Exclusion of probes that are particularly AT- or GC-rich is commonly used in most microarray design schemes [89, 90]. Lockhart et al. [72] introduced a set of rules, which are still used by Affymetrix. For example, no single base (As, Cs, Gs, Ts) exceeds 50% of the probe size, the length of any contiguous stretch of any single type of base (A, C, G, or T) is less than 25% of the probe length, and the GC content of a probe must be between 40% and 60%. This design criterion is used to ensure a higher specificity for the probes. M7. Self-hybridization Lockhart et al. [72] proposed the following rule: no self-complementarity is allowed within the probe sequence, (i.e., no interaction between bases). Bhanot et al. [10] select probes so that the free energy for each probe is lower than some threshold. More refined constraints for probe stability have been proposed by Li et al. [68] and Relogio et al. [88]. Their criteria pose constraints on secondary structure formation: probes should not form dimers and hairpins (ensured by using thermodynamic constraints). Selecting probes that do not self-hybridize increases their sensitivity, by allowing them to capture the corre-sponding targets without introducing additional competitive barriers. For example, if a probe self-hybridizes well, forming a stable secondary structure, the hybridiza-tion with the corresponding target occurs in two steps. First, the probe must unfold up to the point where bases along its sequence get exposed to the complementary bases in the target. Then, the target will hybridize with the probe. 6.3 Our Probe Design Approach A large collection of probe selection criteria have been used in the literature and implemented in software packages. A list of available software packages together with the criteria used for probe design is given in Table 6.1. However, many criteria used for probe design have not been thoroughly investigated in-vitro. Some param-eter values have been determined experimentally, but they do not generalize well for all data sets. For example, different BLAST scores and minimal seed settings are widely used for masking of similar sequences in gene coding regions. Never-Chapter 6. Microarray Probe Design 78 Software Mil M l M2 M3 M4 M5 M(i M7 CoAri-lys [9(1] V 50niers s/ BLAST •J V OligoArray [93] •j 50niers V BLAST V 87 ± 5° C V OligoArray 2.0 [94] v/ 45-47 mere s/ BLAST V 82 - 90° C V •J V GC content: 35-5-% V Oligol'ickerllKi] 70mers •J BLAST lliish table •J 79 ± 5" C V V OligoWiz [81] •j 45-55mers •J mean Tin: 75.7° C •J V •J Oliz [24] • 50mers s/ BLAST 76 ± 5° C V GC content: 45-50% Osprey [4(i] •j 70niers V Gribskovproti les |32| V 78 ± 5° C V V llcky [27] •j 50,70niers V suffi x-arrays •J dynamic GC content: 30-70% ]>robeMakcr[105] V 20niers V V s/ •J V V I'mbeSelect [6H] V 20-25 mere 50,70mers •j suffi x-arrays V 1'rnMiilc [SC] V 25mers V suffi x-arrays ROSO [f»] •j 15-70mers V BLAST V V V YODA [82] V Table 6.1: Available software packages for microarray probe design and their cor-responding probe selection criteria. Each cell marks the presence of a certain crite-rion with a \J sign and provides additional information specific to the experiments performed in the paper (if specified) that describes the software. theless, no experimental insights toward discovery of optimal criteria for BLAST parameter selection for various data sets have been published. Chapter 6. Microarray Probe Design 79 Probe specific constraints are sometimes used in a redundant fashion. For ex-ample, most software packages for probe design allow the concurrent usage of GC content and specific melting temperature ranges. Both measures control the level of stability of a given strand with different accuracies. A tradeoff between fast and accurate probe selection criteria is considered in most probe design frameworks. Nevertheless, the usage of over-simplified criteria is sometimes not justified. For example, the replacement of thermodynamic free energy measurements with com-binatorial criteria based on distances that do not differentiate between nucleotides is still widely used and poorly motivated. The probe selection criteria used in this study are variations of: MO, M l , M2, M5, and M7. The position within a target (M3) criterion is applied mostly to cDNA arrays and can be easily addressed by removing the sequence regions within a few hundred nucleotides of the 3' end of each target sequence. Complexity filtering (M4) and base composition (M6) are unnecessary when thermodynamic similarity and uniformity criteria are used to ensure the specificity of each probe for a large pool of target sequences. 6.3.1 Probe Length As previously mentioned, probe length depends on microarray technology, and the prediction accuracy of duplexes gets poorer with increasing length. Short probes may limit a researcher's ability to find unique binding regions on the input se-quences. This problem is widely addressed by using multiple short probes for each sequence in the input data set. Longer probes increase the probability of find-ing unique binding regions on the input sequences, but at the cost of increasing the probability of forming stable duplexes with incorrect target sequences or self-hybridizing. To establish a solid base for studying the necessity and efficiency of probe design constraints, small probes need to be considered, such that the validity of the conclusions is not hampered by the accuracy of the results. Thus, this work uses probes of length 25 - like those used in Affymetrix chips. 6.3.2 Sequence Masking using BLAST Basic Local Alignment Search Tool (BLAST) [3] is an algorithm for comparing biological sequences, such as the amino-acid sequences of different proteins or DNA sequences. A BLAST search enables a researcher to compare a query se-quence with a library or database of sequences, and identify library sequences that resemble the query sequence. To run, BLAST requires two sequences as input: a query sequence and a sequence database. BLAST will identify regions similar to the query sequence regions in the database. In typical applications of BLAST, Chapter 6. Microarray Probe Design 80 the query sequence is much smaller than the database; for example, the query may be several hundred or thousand nucleotides long while the database may contain sequences with several billion nucleotides. Here we used NCBI BLAST version 2.2.12 with gaps [4]. The gapped version of the BLAST algorithm, used in this work, can be conceptually divided into three stages. 1. In the first stage, BLAST searches for exact sequence matches of a small fixed length W between the query and sequences in the database. Such an exact match is also called a seed. For example, given the sequences AGT-TAC and ACTTAG and a word length W = 3, BLAST would identify the matching substring TTA that is common to both sequences. By default, W = 11 for DNA sequences and the minimum value of W is 7. 2. In the second stage, BLAST tries to extend the match in both directions, starting at the seed. The ungapped alignment process extends the initial seed match of length W in each direction in an attempt to boost the alignment score. Insertions and deletions are not considered during this stage. For our example, the ungapped alignment between the sequences AGTTAC and ' ACTTAG centered around the common word TTA would be: . . A G T T A C . . . . A C T T A G . . If a high-scoring ungapped alignment is found, the database sequence is passed on to the third stage. 3. In the third stage, BLAST performs a gapped alignment between the query sequence and the database sequence using a variation of the Smith-Waterman algorithm [103]. Statistically significant alignments are then displayed to the user. The selection of optimal BLAST seeds is important for accurate similarity searches. We investigate the impact of BLAST seeds in the context of microar-ray probe design in Section 6.4.3. 6.3.3 Melting Temperature It is crucial that all probes attached on a microarray perform well under similar hybridization conditions. This requirement can be satisfied by using the melting Chapter 6. Microarray Probe Design 81 temperature of DNA duplexes. Thus, the similarity of hybridization conditions for a given set of probes can be achieved by enforcing the melting temperatures of all probes to fall within a narrow range. The process of finding optimal settings of melting temperatures for individual probes is deeply rooted in the hybridization and washing conditions required in a microarray experiment. The Nearest-Neighbour model implemented by PairFold provides support for computing melting temperatures for DNA sequences. Melting temperatures are computed using the following formula: T - = A S + ^ M C r / 4 ) + 1 6 ' 6 * l 0 g 1 . 0 ^ . 7 ^ + ] - 2 6 9 3 ^ where AH is the enthalpy, A S is the entropy change of the reaction, R is the universal gas constant (1.98717 * I O - 3 kcal K ~ L mol-1), CT is the total molar concentration of the strands, and [Na+] is the salt concentration. We propose the following procedure to obtain melting temperature range lim-its: 1. Use Algorithm 3 to obtain the melting temperature range for which at least one probe of length n per sequence can be selected. The algorithm com-putes melting temperature ranges for all probes of length n for each input sequence Qi and records the minimum and the maximum values, Tmin(gi) and Tmax(gi). After all maximum and minimum values have been obtained, the range RT = \Tmin>Tmax\is returned such that the intersection between RT and all ranges [Tmin(gl),Tmax(gi),] is not empty. The boundary values of RT are computed as follows: Tmin = max Tmin(gl) (6.8) \<i<N Tmax= mm Tmax(gi) (6.9) l<i<N 2. Adjust the interval obtained in step 1 such that the overall melting temper-ature range is narrow enough to fulfill the experimental requirements. It is sometimes the case that the user selected interval intersects the interval com-puted at the previous step but is not nested within it. In this situation, there will be sequences for which no probes can be selected. Once the melting temperature interval is set, probe selection can begin. Given a predefined melting temperature interval, all possible probes that passed previous criteria will be considered. One probe per sequence is finally selected in such a way that its melting temperature is the highest within the allowed interval. Chapter 6. Microarray Probe Design 82 procedure ComputeTMRange ( < / i , < ? J V , n) input: set of sequences gt, number of sequences N, probe length n output: minimum and maximum melting temperature range values: rjrninTM, rjmaxTM minTM := + 0 0 maxTM := — 0 0 for i := 0 to JVdo •minTM, = + 0 0 maxTMi = — 0 0 for j := 0 to L(gi) - n + 1 do compute TMij for probe at position j on sequence gi if (TMij < rninTMi) rninTMi := TMij if (TMij > maxTMi) maxTMi := TMij end for if (rninTMi > minTM) minTM := rninTMi if (maxTMi < maxTM) maxTM := maxTMi end for end ComputeTMRange return [minTM, maxTM] Algorithm 3: The algorithm for melting temperature range calculation. 6.3.4 Self-hybridization A DNA probe can self-hybridize to form a secondary structure, which in turn will diminish the ability of the probe to capture its corresponding target. For microarray probe design, self-hybridization becomes a greater concern when long probes are used. In this work we use the SimFold package [7] to compute the minimum free energy (MFE) for a predicted secondary structure. SimFold uses a dynamic pro-gramming approach to predict secondary structures for RNA and DNA molecules. In our approach, each selected probe must have a very high MFE. Ideally, the MFE must equal zero, thus the probe must be structure-free. 6.3.5 Other Criteria We have further explored different probe selection criteria based on criteria M l and M6. First, we investigated the following selection protocol that could have replaced the use of BLAST and other criteria presented here: • Select one sequence from the initial set and compute the secondary structure Chapter 6. Microarray Probe Design 83 of the duplex formed between this sequence and a different sequence from the set. • Remove the areas that form stable duplexes from both sequences. • Insert the remaining sub-sequences back in the original set and perform this process until no further insertion (or removals) can be done. The remaining set of sub-sequences is guaranteed to contain only pieces of sequences that do not hybridize with each other. Thus, we can select high quality probes from the remaining sub-sequences. Unfortunately, this method proved to be very detrimental with respect to the size of the sets of probes. We noticed a decay in the number of probes (more than 50%) for the input sequences. Nevertheless, the quality of the selected probes was as high as expected, when C(pk) score dis-tributions have been compared. Further investigation may result in more effective variants of this method. We investigated also the use of probe selection criteria M6, namely rejection of probes with a high number of repeated bases (e.g., six A's, Cs , G's and T's in a row). This criterion is widely used in practice, nevertheless we did not notice any major improvement in the quality of randomly selected sets of probes, compared with the baseline (see Figures 6.5a-d, 6.6a-e and 6.2). 6.3.6 The Probe Selection Process In this work, we propose two probe selection approaches for microarray probe design. The first approach consists of the following three steps. 1. Prepare the input data and find the optimal melting temperature interval and the optimal seed for BLAST. The optimal melting temperature interval can be obtained by using Algorithm 3. The optimal seed for BLAST is obtained by running BLAST multiple times with different seeds on the same set of target sequences. We select the minimum seed that leaves the opportunity to select at least one probe per target sequence. 2. Mask the input sequence using the gapped version of BLAST with the opti-mal seed value obtained in the previous step. 3. Filter out all probes with melting temperatures outside the desired range, namely [70, 75] 0 C. This will further narrow down the original melting temperature range obtained in step 1, thus possibly causing decays in the number of probes that can be selected from some target sequences. Chapter 6. Microarray Probe Design 84 remove 6 base repeats be fo re B L A S T 0 1e 15 1e-14 1e-13 1e-12 1 e-11 1e-10 1e-09 1e-08 1e-07 1e-06 1e-05 0.0001 C(k) Figure 6.2: Distribution of C(pk) values for 15 randomly selected sets of probes (5 sets per criterion) of length 25 for all genes of Bacteriophage_phi3626 selected us-ing the following criteria: probes selected uniformly at random (baseline), probes selected uniformly at random after BLAST (Ml) and probes selected uniformly at random after BLAST (Ml) and subsequent removal of 6-base repeats (M6). 4. Sets of probes are selected uniformly at random from the results obtained with these two approaches. This approach allows the design of probe sets with high specificity, sensitivity and uniformity. The uniformity is guaranteed by the last step, which specifies a narrow melting temperature interval of 5° C, for all the probes. The second approach differs from the first only in that in the third step the probes with the highest melting temperature per sequence are selected. As we will see in Section 6.4, this approach provides much better results than the previous approach on average; still, the selected probes tend to be spread over a wide range of melting temperatures. The fourth step can be modified in a straightforward way (by selecting not one, but more probes for each target sequence) to support selection of multiple probes per target. Chapter 6. Microarray Probe Design 85 We further evaluate the quality of the sets of probes using the C(p^) scores computed using the Simplified PairFold algorithm presented in Section 6.4.2. This is a very time-consuming process, thus we initially apply it to one probe per target sequence only. The evaluation step can be repeated for different sets of randomly selected probes, until we find one that satisfies our criteria. A better improvement strategy is discussed under future work in Chapter 7. 6.4 Results and Discussion We investigate the quality of our probe selection criteria using some of the bench-mark data sets presented in Table 6.2. The benchmarks with a large number of sequences have not been used in this work because of two reasons: (1) the probe design packages considered in this work require large amounts of time to produce results, and (2) the evaluation of probes cross-hybridization using the C(p f c) mea-sure requires not only large amounts of time to complete, but with our computing resources, it's simply not feasible. We evaluate the quality of the sets of probes us-ing the C(pk) measure for cross-hybridization introduced in Section 6.2. Then we compare the quality of the probe sets constructed using our probe selection criteria and sets obtained using four publicly available packages. 6.4.1 Benchmark Data Sets The data sets used in this work have been selected and generated to test the accu-racy of different microarray probe design techniques towards achieving the design goals presented in Section 6.1.2. We grouped the data sets used in this chapter into two categories. The first group contains artificial data sets with genes that have been generated to test and improve specific aspects of our design scheme. The insertion of known motifs ('planted" motifs) at known locations helps creating a control set of probes that are meant to test the specificity of the probe design meth-ods under consideration. The complementary variation of the content of bases in the 'planted" motifs with respect to that in the target sequences ensures a higher level of discrimination for the probes corresponding to inserted motifs ('planted probes'). The second group contains coding sequences (CDS) for small (at most 100 sequences), medium (between 101 and 1000 sequences) and large (more than 1 000 sequences) genomes. The data sets with a low number of sequences have been used to quickly test and identify the probe selection criteria that play a major role in selecting good probes and those that have less impact on the quality of the final probes, given our evaluation criteria. The medium and large sets of sequences are used to confirm the findings based on small size data sets. We have also selected Chapter 6. Microarray Probe Design 86 the benchmarks based on the variation of GC content, that span a range between 27.44% for Buchnera_sp and 51.58% for Rattus_Norvegicus. A summary of these data sets is provided in Table 6.2. Organism # Targets Target Len. Avg. GC-Cont. Avg. Range Target Len. Range (%) GC-Cont. (%) Artificial Data Sets BP1 50 425 425 [84.86, 85.88] 85.23 BP2 50 425 425 [66.51,67.57] 67.11 BP3 50 425 425 [32.66, 35.65] 32.88 BP4 50 425 425 [13.88, 15.53] 14.73 BP5 50 425 425 [48.00, 49.92] 49.72 BP6 50 425 425 [49.49, 51.06] 49.82 BP7 50 425 425 [50.16, 52.94] 50.39 BPS 50 425 425 [49.39, 52.24] 49.85 BP9 50 425 425 [49.62, 52.24] 49.82 Public Data Sets Bacillus suhlilis 30 [444, 882] 552.50 [43.87, 46.70] 44.33 Bacteriopliage_phi3626 50 [120,2952] 621.78 [26.97, 33.10] 28.62 Buclinera_sp 573 [117,4 224] 988.13 [27.22, 30.47] 27.44 Other Public Data Sets (not used in this work) Drosophila_Melanogastcr 19 177 [132, 69 571] 2 242.26 [39.93, 50.57] 50.06 Homo_Sapiens 16084 [126, 27 652] 2 651.58 [49.35, 66.03] 49.95 Mus_Musculus 23 658 [66, 23 218] 2 249.30 [39.63, 50.55] 50.03 Pan_Troglodytes 21 335 [4, 55 099] 1 824.68 [48.60, 69.12] 49.41 Rattus_Norvegicus 14 905 [66, 37014] 1 809.40 [39.42, 53.64] 51.58 Saccharomyces cerevisiae 6714 [51, 14733] 1351.98 [37.12,41.71] 39.61 Table 6.2: Overview of the data sets used for comparing the accuracy of probe design programs and their properties. 6.4.2 Evaluation Criteria To evaluate the quality of the probe sets, we estimate the probability of cross-hybridization between a probe and a set of longer sequences using the scoring function from Equation 6.1 in Section 6.2. The scoring function measures the difference in stability between a probe and its perfect match and all other imperfect matches along the whole set of sequences. The stability measure is given by free Chapter 6. Microarray Probe Design 87 energy estimations. The standard version of the PairFold algorithm that could be used for calculating duplex free energies has cubic time complexity in the length of the sequences involved, thus, a computationally faster version is needed. Simplified PairFold A simplified version of the PairFold software package has been kindly provided by Mirela Andronescu - one of the authors of PairFold. The simplified version com-putes the minimum free energy (MFE) secondary structure formed by two input DNA or RNA molecules. Only interactions between bases that belong to different sequences (inter-sequence interactions) are allowed, and hence the predictions can be different from PairFold. The base interactions that occur within the same se-quence (intra-sequence interactions) are not included in the minimum free energy value, nor in the final secondary structure. Simplified PairFold has a time com-plexity of 0{{ni + T I 2 ) 2 ) , whereas the more complex standard version of PairFold has time complexity 0({n\ + n . 2 ) 3 ) , where n\ and 712 are the lengths of the two sequences. To ensure that no major accuracy loss takes place, we generated 1 000 pairs of random sequences with uniform base distribution of length 25 and we computed the minimum free energies using PairFold and Simplified PairFold. The M F E val-ues for Simplified PairFold correlate well (Pearson's coefficient r = 0.8314794) with the ones obtained using PairFold (see Figure 6.3). Nevertheless, for longer sequences, the correlation coefficient decreases (r = 0.5794835 for 50-mers and r = 0.4786741 for 100-mers) due to the increased probability of having intra-sequence interactions. C{pk) measure for cross-hybridization. The simplest and most direct way to choose the best probe set is to compute all free energies for all pairs of probes and targets and pick the probes that cross hy-bridize the least. Nevertheless, this procedure is computationally expensive and can be used only for very small sets of probes. The asymptotic complexity of this computation is Q(N2 * L 2 *Tsim Piined PairFold)* where N is the number of sequences (genome size), L is the average length of all sequences and Tsimpiified PairFold is the run-time of Simplified Pairfold, which is a quadratic function of L. A more detailed computation of these results is provided in Appendix A. Computational considerations for C(j>k). For two sequences of length 25, the average run time for Simplified PairFold is 1 millisecond on a reference machine with Intel(R) Xeon (TM) CPU 2.40GHz dual Chapter 6. Microarray Probe Design 88 •2 h DL -8 -10 miff: -12 -10 -8 -6 MFE Simplified Pairfold Figure 6.3: Correlation plot for MFE values obtained from Simplified PairFold vs PairFold for 1 000 pairs of randomly generated sequences of length 25. Pearson's correlation coefficient is r = 0.8314794. processors (only one processor used for the non-parallel estimation) with 1 GB D D R A M and 512 KB L2 cache. For 1 000 genes each with an average length of 500 bases, using a probe length of 25, the run-time for computing a C(pk) score for one probe is 7.92 CPU minutes, on the same reference machine. Thus, the C(pk) scores for a set with 1 000 probes of length 25 can be evaluated in about 5.5 days of computation. For our experiments, we implemented a parallel version of the C(pk) calculation procedure that uses the LAM/MPI package [22] and runs on up to 40 CPUs, which reduces the computation time in the previous example from 5.5 days to 3.5 hours. 6.4.3 Preliminary Experiments Complete details on experimental parameter settings are often not specified in the literature on microarrays. In this work, a melting temperature interval of [70,75] °C has been chosen based on private communication with scientists from UBC, and is used consistently throughout the chapter. Chapter 6. Microarray Probe Design 89 We performed a set of empirical studies that ultimately led to the current probe selection criteria. In the first of these studies, we investigated the impact of the BLAST seed length parameter on sample data sets. BLAST gene masking is a rough cross-hybridization removal technique that can be further enhanced by using probe selection criteria based on thermodynamic measures. The selection of optimal seeds for BLAST is dependent on the number of target sequences and the similarity between them. The more target sequences, the higher the chance of encountering regions with high similarity. For each BLAST seed in the range [7,20], we masked out the similar regions for all given target sequences, by running BLAST on one sequence at a time against the remaining sequences. For each of the resulting masked sequences, we selected all valid sub-sequences of size 25 and generated the corresponding probes for each of them. Thus, each target sequence will have a list of associated probes that can be used as unique identifiers. Then we selected uniformly at random one probe per sequence. For each probe, we computed the C(pk) score. The results for the Bacteriophage_phi3626 genome are shown in Figure 6.4. For small seeds up to 11 (Figure 6.4a), there are sequences for which no probes could be selected. The probability of finding similar sub-sequences of smaller length is high in this case. Thus the probability of having sequences without any non-masked regions of length greater than the desired probe length is high. For larger seeds, the prob-ability of having highly similar regions decreases, but the probability of leaving unmasked regions that may hybridize increases. Thus, the probability of obtain-ing high C{pk) values is higher (Figures 6.4b and 6.4c). A tradeoff for selecting the best BLAST seed is necessary. The smallest seed for which we can select probes for the Bacteriophage_phi3626 genome is 12. Similarly, the smallest seed for the genome of Buchnera_sp, for which all sequences have at least one probe is 14. Thus, the selection of a good BLAST seed is dependent on the input data sequences. In the second set of empirical studies, we looked at how different base concen-trations for randomly generated artificial sequences affect the quality of the probes when different probe selection criteria are applied. Originally, we considered a set of 6 probe selection criteria: • Select 25 bases long probes (MO), • Run BLAST with seed 12 for cross-hybridization avoidance (Ml) , • Select probes with melting temperatures in the range [70,75]° C (M2), • Select the probes with the highest melting temperature per sequence (M2), Chapter 6. Microarray Probe Design 90 — . — , — •• • i • .... • *• t. / M .£> seea / : „ , . . . . . . . ^ . - : seed 9 seed 10 1e-14 1e-12 1e-10 1e-08 C(k) 50 45 40 35 30 25 20 15 10 5 1e-06 0.0001 : ' 1 \-r • £7 K / rj' j ~ seed 12 . / i seed 13 ,» 7 seed 14 r y Z. .... seed 16 . V"* ' seed 17 i 1e-14 1e-10 le C<k) 1e-06 0.0001 (a) (b) ' 1 ' 1 "77^ - •* it i .. .. / / seed 12 seed 18 ' seed 19 / I . I . 1e-12 1e-10 1e-08 C(k) (C) 1e-06 0.0001 • '..•',J t/ f T -t, ; /J/y M ••• seed 12 seed 21 seed 23 seed 24 IJ -M-—•-' • 1e-14 1e-12 1e-10 1e-08 C(k) 1e-06 0.0001 (d) Figure 6.4: C(Pk) scores for BLAST seeds in the range 7-20 for Bacterio-phage_phi3626 (50 CDS). The distribution of C{pk) values of the probe set is displayed in the form of CDF curves, (a) When using BLAST seeds from 7 to 12, the number of sequences (genes) that can provide probes increases up to the max-imum of 50. (b), (c) and (d) Increasing the BLAST seed from 12 to 25 will allow probes with higher cross-hybridization to be selected in the final set. • Select probes which do not have any consecutive stretch of 6 or more identi-cal bases (M6), • Select probes with the highest GC content per sequence (M6). We generated 9 data sets with 50 random sequences of length 400, with various base probabilities and we inserted motifs of length 25 with various base probabil-Chapter 6. Microarray Probe Design 91 Sequences Motifs ID P(A) P(T) P(C) P(G) P(A) P(T) P(C) P(G) BPl 0.05 0.05 0.45 0.45 0.45 0.45 0.05 0.05 BP2 0.15 0.15 0.35 0.35 0.35 0.35 0.15 0.15 BP3 0.35 0.35 0.15 0.15 0.15 0.15 0.35 0.35 BP4 0.45 0.45 0.05 0.05 0.05 0.05 0.45 0.45 BP5 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 BP6 0.05 0.45 0.05 0.45 0.45 0.05 0.45 0.05 BP7 0.15 0.35 0.15 0.35 0.35 0.15 0.35 0.15 BP8 0.35 0.15 0.35 0.15 0.15 0.35 0.15 0.35 BP9 0.45 0.05 0.45 0.05 0.05 0.45 0.05 0.45 Table 6.3: Combinations of sequence and motif base probabilities used to study the efficiency of different probe selection criteria. ities at random locations within each sequence. A summary of the combinations of base probabilities for each set is presented in Table 6.3. These randomly gener-ated sets are used as benchmarks for investigating the effectiveness of the 6 probe selection criteria presented above. The specific base compositions for sequences and planted motifs have been chosen to facilitate the investigation of two factors: (i) the impact of selecting high and low GC content probes over the quality of the probe set and (ii) the impact of other base compositions on probe selection criteria. Figures 6.5 and 6.6 present preliminary results obtained by applying the afore-mentioned probe selection criteria on sequences with planted motifs and various base probabilities. Distributions of C(pk) values for probes with GC content of 10%, 30%, 70% and 90% are presented in Figure 6.5. The cross-hybridization scores correspond to sets of randomly selected probes after specific selection crite-ria have been used, namely M0, M l , M2 and M6. We noticed that probes with the highest GC content per sequence and those whose corresponding motifs have been inserted (and have GC content higher than 50%) have lower (better) C(pfc) values when the GC content of the sequences is low. When the GC content of the motifs and the sequences equals 50%, the best set of probes is obtained by selecting those with the highest GC content per sequence (see Figure 6.6a). When we keep the GC content constant (50%), and we change the base prob-abilities for A and C versus G and T, we notice that the quality of the probes that correspond to inserted motifs is always high. Nevertheless, the quality of the probes with the highest GC content per sequence and 50% AC-content is higher for BP5 (Figure 6.6a), BP7 (Figure 6.6c) and BP8 (Figure 6.6d). Chapter 6. Microarray Probe Design 92 -i . J . ( • • fl I Lh • J . L III t" v m K leloro BLAST l lemperalure • J . /' . / . merlin B-14 1e-12 1e-10 1e-08 1e-06 0.0001 001 C(kJ W/: 1 WU. I I • m j before BLAST mailing temperature 1o-18 1o-16 «-12 la-10 ia-08 ie-06 0 0001 0 01 1 100 (a) (b) 50 45 40 35 s 30 25 t 20 15 10 J • J / I 7 • J melting lemperalure tialoro BLAST i i • i '— J i / t j ... melting lemperalure Colore BLAST e-OB 1e-06 0.0001 001 3-10 1e-08 le-06 0.0001 0.01 1 100 C{k) (c) (d) Figure 6.5: C(pk) score distributions for 4 combinations of base probabilities for sequences and planted motifs and various probe selection criteria, (a) BP1, (b) BP2, (c) BP3, (d) BP4 Chapter 6. Microarray Probe Design 93 ' / " / ' ft • / i • j J i • / I ' T - t jh i i . t "Mr- fa l * « r c o ; / / H^fiwJ f . io f^ j • / ! • } • ; ' rftw BLAST »»*d 1? J moiling lamperalure belore BLAST 1B 18 lo- lb U-14 le 12 I B 10 1BC8 1«-06 0 0001 0 01 1 100 (e) Figure 6.6: C(Pk) score distributions for 5 combinations of base probabilities for sequences and planted motifs and various probe selection criteria, (a) BP5, (b) BP6, (c) BP7, (d) BP8, (e) BP9 Chapter 6. Microarray Probe Design 94 Since GC content is well correlated with the melting temperature of probe se-quences [109], we looked further at selecting probes based on GC content and melting temperatures. In general, probes with the highest GC content per sequence have lower C(pk) values than the ones with low GC content, thus suggesting that we can select good probes by choosing those with the highest melting tempera-tures. When the GC content of the sequences is between 30% and 70%, the probes with the highest melting temperature per sequence have the lowest C{pk) values. Nevertheless, selecting only probes with the highest melting temperature per sequence does not assure a narrow melting temperature range for the set. Thus, the selection of probes with the highest melting temperature in a predefined range seems to be a good selection criterion and will be used in this chapter. For com-parison, we present also sets of probes with the highest melting temperature per sequence. 6.4.4 Comparison with Other Methods To test the ability of our probe selection strategy to build DNA probe sets that fa-vorably address the design goals defined in Section 6.1.2, we first selected several software packages from the literature as controls, namely OligoPicker [116], Os-prey [46], ROSO [89] and YODA [82]. We ran each software and evaluated the resulting probe sets using the measures discussed in Section 6.4.2. The evaluation was performed on benchmark sets of coding sequences presented in Table 6.2. The parameters for all software packages used in the comparison study have been carefully chosen uniformly to minimize the impact on the variations in results quality. Where possible, we;tried to use the same parameters for all software pack-ages. For parameters specific to only one package we used the default settings. A summary of specific parameter values and settings for the compared software packages is presented in Appendix C, Table C . l . As a baseline probe set, we selected probes uniformly at random from each sequence and we compared the C(pk) distributions for baseline and designed sets. All software packages produce better probe sets than the baseline sets (randomly selected sets of probes before BLAST has been run on the target sequences) for all benchmarks. A summary of the results obtained with Oligopicker, Osprey, ROSO, YODA and our software is presented in Table 6.4. In average, our probe selection method produces sets of strands of comparable size with the ones generated by other packages. For 6 out of 12 data sets, our method found sets of probes of optimal size, i.e., one probe per sequence. An evaluation of the C(pk) scores for a subset of 30 CDS sequences from Bacillus Subtilis (Figure 6.7) provided on the OligoWiz [81] web site and 50 CDS from Bacteriophage_phi3626 (Figure 6.8) suggests that our method for probe se-Chapter 6. Microarray Probe Design 95 Data Set Our Method OligoPicker Osprcy web based ROSO web based YODA BP1 (50 sequences) 29 50 1 15 50 BP2 (50 sequences) 49 50 1 43 50 BP3 (50 sequences) 50 50 1 50 50 BP4 (50 sequences) 24 50 1 19 50 BP5 (50 sequences) 50 50 1 50 50 BP6 (50 sequences) 43 50 1 49 50 BP7 (50 sequences) 50 50 1 50 50 BP8 (50 sequences) 50 50 1 50 50 BP9 (50 sequences) 33 50 1 48 50 B.Subtilis (30 sequences) 30 30 30 30 30 Bacteriophage_phi3626 (50 sequences) 46 50 44 39 50 Buchnera_sp (573 sequences) 552 572 560 539 573 Table 6.4: Probe set sizes obtained with our method and four publicly available packages for DNA probe design. For our method, we report the number of probes that have melting temperatures in the range [70, 75]° C. lection produces sets of probes of higher quality than the control software pack-ages. Further comparisons using public data sets with a higher number of sequences show that our software produces sets of probes of comparable quality with the ones produced by Osprey and ROSO (Figure 6.9). The other two software packages, namely OligoPicker and YODA, produce sets of probes with poorer C(pk) scores. The observed difference is due to two factors: (i) the automatic settings for melting Chapter 6. Microarray Probe Design 96 Figure 6.7: C(p*) values for sets of probes of length 25 produced with different software packages. The probe sets have been extracted from a benchmark set of se-quences: B.subtilis (30 randomly selected CDS, length interval [444, 882], average length 552.5 nucleotides). The melting temperature interval was set to [70,75]°C for all software packages that supported this option. temperature ranges, and (ii) the usage of different melting temperature formulae and DNA strand concentrations in different packages. For Oligopicker and Osprey, we noticed that the melting temperatures for 11 and respectively 3 probes out of 30 are lower than the in-silico experimental interval setting of [70,75]° C. The melting temperature values computed by YODA for 30 probes designed for 30 CDS sequences from B. Subtilis lie in the range [55.42,60.28]° C, while the range estimated with the melting temperature function provided by PairFold for the same set of 30 probes is [70.73, 76.04]° C. We also performed additional comparisons using the artificial data sets. The Osprey package was able to select only one probe out of 50 for all artificial data sets provided as input. At this time we are not aware of what could have caused this problem, thus we do not include this package in this comparison. Figures 6.10a-d, and 6.11a-e present results obtained using OligoPicker, ROSO, YODA and our method. We have chosen the same baseline set, as described above, namely, sets Chapter 6. Microarray Probe Design 97 (/i 35 1 1 1 1 1 • - • f r i J: ;; ..: J jf I-Highest TMs iqhest TMs in ranae After BLAST Before BLAST ROSO Osprey YODA » r ; 1 1 1e-15 le-14 1e-13 1e-12 1e-11 le-10 1e-09 1e-08 1e-07 1e-06 1e-05 C(pk) Figure 6.8: C(pk) values for sets of probes of length 25 produced with different software packages. The probe sets have been designed for a benchmark set of sequences: Bacteriophage_phi3626 (50 CDS, length interval [120, 2 952], average length 621.78 nucleotides). The melting temperature interval was set to [70,75]°C for all software packages that supported this option. with probes selected uniformly at random from each sequence. Selecting probes with the highest melting temperature per sequence constantly produces high quality sets, whereas limiting the melting temperature range for probes to [70,75]° C works best for sequences with low GC content (see Figures 6.10c, 6.10d and 6.11a). OligoPicker, ROSO and YODA perform similarly well on these data sets. They seem to perform better on data sets that contain sequences rich in GC content (50% and over). Interestingly, for sequences with 10% GC con-tent, good probe sets can be obtained by simply selecting probes at random after BLAST has been run over the input sequence using seeds of length 12. The same phenomenon can be observed in Figure 6.9 on the public data set Buchnera_sp (average GC content per sequence equal with 27.44 and BLAST seed 14). For the same public data set, namely Buchnera_sp, we notice that our probe selection methods produce also probes of lower quality (high C(pk) scores). The number of probes with high C(pk) scores is less than 5 for this data set. The presence of Chapter 6. Microarray Probe Design 9 8 600 500 400 300 100 \-1 ' 1 1 ' l -i 1 1 1 1 ' 1 ' j <: 1 1 j -• 1 • i - \ • . l . Highest TM probts per seq Highest TM probe in range ROSO Osprey After BLAST YODA Before BLAST i . i > i J I . 1e-18 1e-16 1e-12 1e-10 1e-08 C(pk) 1e-06 0.0001 0.01 100 Figure 6.9: C(pk) values for sets of probes of length 25 produced with different software packages. The probe sets have been designed for a benchmark set of sequences: Buchnera_sp (573 CDS, length interval [117, 4 224], average length 988.13 nucleotides). The melting temperature interval was set to [70,75]°C for all software packages that supported this option. low quality probes in our results is due to the use of long BLAST seeds for this set (seed 14). While using high BLAST seeds minimizes the amount of masked regions within the data sequences and allows us to select probes for virtually all sequences, it increases the probability for missing high similarity regions within the data sequences. 6.4.5 Run Times Run-time comparisons for the computational experiments performed in this Chap-ter are difficult to achieve due to the following factors: 1. Different programming languages have been used to implement the software packages compared in the previous section. 2. Two software packages (Osprey and ROSO) have been accessed via web Chapter 6. Microarray Probe Design 99 Figure 6.10: C(pk) score distributions for 4 combinations of base probabilities for sequences and planted motifs and various probe selection criteria, (a) BP1, (b) BP2, (c) BP3, (d) BP4 Figure 6.11: C(pk) score distributions for 5 combinations of base probabilities for sequences and planted motifs and various probe selection criteria, (a) BP5, (b) BP6, (c) BP7, (d) BP8, (e) BP9 Chapter 6. Microarray Probe Design 101 Data Set Our Method OligoPicker Osprey web based ROSO web based YODA BP1-BP9 4s 3 s [6 s] [63 s] 1 s B.Subtilis 2s 1 s [20 s] [46 s] 1 s Bacteriophage_phi3626 3 s 5 s [37 s] [56 s] 2s Buchnera_sp 108 s 46 s [5 580 s] [254 s] 66 s Table 6.5: Run-times for our method and four publicly available packages for DNA probe design. The run-times for our probe selection method do not include the evaluation step described in Section 6.3.6 and the self-hybridization optimization procedure described in Section 6.4.6. The values in brackets represent low accu-racy run-time measurements for web-based software. interfaces. 3. Some software packages have optimized code whilst others do not. 4. We do not know the specifics of the computing platforms where the web-based packages have been run. 5. Some standalone packages have graphical user interfaces, thus the run time measurement is also influenced by setting the parameters and the time re-quired to display the information. Still, we provide run-time results on the reference machine for our method and rough estimates for the other programs, including the web-based ones. Table 6.5 shows that our method as well as OligoPicker and YODA scale relatively well with data set size. The web-based software seems to require more time to design sets of probes. It is unclear whether the additional time required to obtain probe sets is just an artifact caused by the imprecise time measurements or whether it is due to the design methods the packages are built upon. We performed additional run-time measurements for individual computational components of'our probe selection method. The first two steps that involve BLAST masking require the largest amount of time. For example, one BLAST masking step for seed length 14 needs in average 55 seconds to process 573 sequences from Buchnera_sp, while step 3 requires only 3 seconds to complete a full data sequence scan for probes with melting temperatures in the specified range. In general, developing probe design methods that scale well with input data size is a challenging task. While some methods scale better than others with re-spect to input size, the probe design criteria for those methods are mostly based on Chapter 6. Microarray Probe Design 102 simplistic approximations. The usage of higher accuracy (thermodynamic) design criteria increases the overall computation time, thus making the probe design pro-cess infeasible for large data sets. A trade-off between accuracy and speed seems to be necessary. 6.4.6 Improving Self-hybridization The evaluation measure, C(pk) used in this work doesn't capture the ability of a probe to self-hybridize, thus the figures presented in Section 6.4 do not include information related to this selection criterion. To gauge the selection of probes, which do not self-hybridize, we use the minimum free energy (MFE) of each probe. Table 6.6 presents the MFE values for sets of probes designed with our method and the 4 other packages, for all the benchmark data sets used in this work. First, we design sets of probes that have not been explicitely required to ful-fill self-hybridization constraints. We obtain results of similar but slightly lower quality (on average) than the probe sets obtained with other software. Then, we improve the quality of the probe sets by selecting probes that have overall self-hybridization free energies lower than -5.00 kcal/mol. The number of probes per set remains unchanged and the cross-hybridization C(pk) scores have similar qual-ities with the original sets of probes. In all but three cases, namely sets BP1, BP4 and Buchnera_sp, our sets had lower self-hybridization MFEs than all the control sets. Chapter 6. Microarray Probe Design 103 Data Set Our Method OligoPicker Osprey ROSO YODA BP1 ' [-3.70,0.00] [-10.44,-0.07] [0.00,0.00] [-1.46,0.00] [-8.22,0.00] BP 1-opt [-2.95,0.00] BP2 [-5.03,0.00] [-6.73,0.00] [-0.81,-0.81] [-2.01,0.00] [-6.00,-0.04] BP2-opt [-1.83,0.00] BP3 [-3.66,0.00] [-2.50,0.00] [0.00,0.00] [-2.72,0.00] [-3.22,0.00] BP3-opt [-1.87,0.00] BP4 [-3.87,0.00] [-4.04,0.00] [-2.01,-2.01] [-2.58,0.00] [-2.81,0.00] BP5 [-4.22,0.00] [-4.01,0.00] [0.00,0.00] [-3.00,0.00] [-4.43,0.00] BP5-opt [-1.54,0.00] BP6 [-4.06,0.00] [-4.03,0.00] [0.00,0.00] [-3.64,0.00] [-3.89,0.00] BP6-opt [-2.63,0.00] BP7 [-4:28,0.00] [-4.17,0.00] [-0.27,-0.27] [-3.06,0.00] [-5.29,0.00] BP7-opt [-1.21,0.00] BP8 [-2.77,0.00] [-3.41,0.00] [0.00,0.00] [-2.06,0.00] [-3.85,0.00] BP8-opt [-1.30,0.00] BP9 [-3.53,0.00] [-8.09,0.00] [0.00,0.00] [-2.07,0.00] [-4.43,0.00] BP9-opt [-2.63,0.00] B.Subtilis [-6.45,0.00] [-3.18,0.00] [-1.53,0.00] [-3.88,0.00] [-4.21,0.00] B.Subtilis-opt [-0.71,0.00] Bacteriophage_phi3626 [-3.10,0.00] [-2.77,0.00] [-2.35,0.00] [-1.24,0.00] [-3.64,0.00] Bacteriophage_phi3626-opt [-1.40,0.00] Buchnera_sp [-7.78,0.00] [-4.93,0.00] [-3.81,0.00] [-4.79,0.00] [-4.54,0.00] Buchnera_sp-opt [-4.98,0.00] Table 6.6: Self-hybridization minimum free energy (MFE) values for probe sets designed with the four control packages and our method. Here, we report only the MFEs for probe sets with high melting temperature values in the range [70,75]° C. Osprey produced sets with only one probe for all the artificial data sets. The *-opt sets have been designed with our method using the self-hybridization probe design criterion. 104 Chapter 7 Conclusions and Future Directions "We shall not cease from exploration And the end of all our exploring Will be to arrive where we started And know the place for the fi rst time. " T.S. Eliot In this thesis we have presented a comprehensive framework for designing DNA strands for various biotechnological applications. Our new methods for DNA strand design can be used to generate sets of DNA strands that satisfy combina-torial and thermodynamic constraints. First, we presented a new stochastic local search algorithm for DNA strand design, along with empirical results that characterize its performance and indicate its ability to find high-quality sets of DNA strands satisfying various combinations of combinatorial constraints. In particular, we observed that a simple random re-placement strategy improves the expected run-time of the algorithm when finding sets of strands that satisfy constraints CI, C2 and C5. Second, we presented an improved version of the simple SLS algorithm for DNA Strand Design proposed in Chapter 4, based on new neighbourhood genera-tion mechanisms, along with empirical results that characterize its performance. New insights on neighbourhood mechanisms have been described and we pre-sented evidence that by using hybrid randomised neighborhoods the performance of our SLS algorithm can be significantly improved. Third, we introduced a new approach for DNA strand design that integrates thermodynamic as well as combinatorial constraints and uses a stochastic local search strategy that is effective at finding sets that satisfy these constraints. Fur-thermore, we showed that our design constraints are well correlated with measures of quality of a set, such as pairwise sensitivity, specificity, and discrimination. Chapter 7. Conclusions and Future Directions 105 The current algorithm can easily be adapted to design non-interacting strands for universal microarrays or molecular beacons with specific capture-probe capa-bilities. Constraints on interactions between strands that are immobilized on a sur-face, as well as between those in solution, are supported by the algorithm and can be integrated into our model in a straightforward way. Our model for evaluating quality of sets can support analysis when the initial concentrations of words and complements are not equal; particularly in the context of iterative DNA computa-tions during which target concentrations vary, it would be useful to consider the quality of sets under uneven and varying strand and complement concentrations. Last, we investigated methods for microarray probe design. The challenge here was to design high-quality sets of probes that behave well with respect to a given set of target sequences (e.g., a genome). While the behaviour of a set of probes was previously tested only experimentally, here we use a thermodynamic measure that offers in silico insights over the level of cross-hybridization foreach individual probe. Future Work This thesis has demonstrated the effectiveness of heuristic methods for DNA strand design, has introduced novel and efficient stochastic local search algorithms that design high quality strands for DNA computing and molecular tagging applica-tions, has showed evidence that high quality probes for microarray applications can be obtained by carefully selecting specific probe design criteria and has presented in-silico evaluation measures for sets of DNA strands. Next, we present some natu-ral extensions of this work as well as other possible applications of stochastic local search techniques. Designing Larger Sets of DNA Strands The current work introduces three SLS algorithms, which are capable of designing sets of strands of limited size. In future work, we will examine the possibility to consider more complex stochastic local search strategies, which we expect to achieve improved performance that will likely lead to larger sets of strands. While for small number of constraints, our algorithms perform well from the time and space complexity point of view, an increase in the number of constraints influences dramatically the run-time and memory requirements of the algorithms. In a preliminary study, we were able to design sets of 1000 20-mers with a melting temperature range of 5°C and a free energy gap 5* of 5 kcal/mol in 2.7 CPU hours on our reference machine (averaged over 10 independent runs). To design even Chapter 7. Conclusions and Future Directions 106 larger sets, the memory requirements of our present algorithm would be reduced; this can be achieved by using optimized or partial representations of the constraints between pairs of DNA strands during the search process. Alternatively, by using a slightly different approach, based on the local optimization of subsets of strands, it should be possible to obtain sets of tens of thousands of strands. We have also performed preliminary tests on the scaling of our algorithm's performance with strand length. We obtained sets of 50 50-mers and 60-mers with free energy gaps 5* > 3.00 kcal/mol in 6 and 10 CPU minutes, respectively, on our reference ma-chine (averaged over 10 independent runs). Learning Design Principles from Empirical Studies It would be interesting to see if better theoretical design principles can be extracted from the sets of strands that we have now obtained empirically. In the construction of classical codes, theory and experiment are closely linked, and we expect that the same can be true for the construction of DNA codes. Search space analysis may provide more insight on the hidden mechanisms that make it difficult to computa-tionally solve DNA strand design problems and shed more light on the reasons for the efficiency of the hybrid neighborhoods studied here. In another direction of future work, one could use hybrid randomised neigh-bourhood mechanisms for DNA code design problems with different constraint combinations as well as for the design of binary codes. Satisfying More Realistic Constraints A relatively straightforward extension of our approach to design thermodynamically-constrained sets of DNA strands would be to additionally model interactions of type 14 (presented in Chapter 5), in which strands interact with each other. In this context, it would be particularly interesting to use a model that applies to the situ-ation of densely packed surfaces, as encountered in DNA microarray and surface-based DNA computing applications; unfortunately, we are currently unaware of such a model. Another direction for future work is to use the ensemble free energy, as given by the partition function, instead of the MFE calculations (provided by PairFold) in our approach. In this context, a reasonably efficient algorithm for calculating the partition function on pairs of strands is needed. To our best knowledge, no such algorithm is currently available, but in principle, as soon as this situation changes, it will be easy to integrate such a procedure into our software. It may be noted that our current thermodynamic model presented in Chapter 5 is limited to interactions between one strand and two complements, and makes Chapter 7. Conclusions and Future Directions 107 the simplifying assumption that the equilibria between any such triplet of strands are independent of each other. In future work, it would be interesting to extend this model to more complex equilibria, which may, for example, capture competi-tive hybridization within a set comprising several strands and complements. Ulti-mately, it would be very useful to devise a model that is accurate enough to allow quantitative predictions of the desired and undesired hybridizations within a given set of strands and complements, and to support the computational design of sets of DNA strands based on that model. We believe that the work presented in this thesis represents a first important step towards achieving this larger and more ambitious goal. Improved Microarray Design Methods One straightforward extension of our approach for microarray probe design is as follows: • First, we obtained a set of probes by using the selection approach from Sec-tion 6.3.6. • After we evaluate the quality of the current set of probes by computing the cross-hybridization C(pk) scores, we choose probes with poor quality from the output of step 3 and replace them with better probes. • Pools of new replacement probes can be selected uniformly at random from the output of the step 3. The newly selected probes will be evaluated using the C(pk) scores and the best one will be chosen as replacement. This process is iterated until either a better probe is found or we run out of time or probes in the pool. It can be implemented in a straightforward fashion using a stochastic search approach that allows us to select candidate probes using different simple heuristics. Such a simplistic approach is expected to outperform the meth-ods studied in Chapter 6 when cross-hybridization scores are used as an evaluation measure. Another direction that can be pursued in continuation of the work presented here is to verify the quality of the probes selected by our algorithm in actual DNA microarray experiments. Several questions may be answered: • How good is the thermodynamically-based evaluation procedure for probes introduced in Chapter 6? Does the Simplified PairFold provide a sufficiently accurate measure of thermodynamic interactions between DNA strands? Chapter 7. Conclusions and Future Directions 108 • What is the difference in probe stability and specificity for sets designed with various software packages? • What is the role played by target-target and probe-probe interactions in DNA microarray experiments? What could be gained by considering these inter-actions when designing sets of probes? • To which extent is it possible to achieve improvements by using thermody-namically based probe selection criteria instead of using string comparison-based criteria? • What is the optimal setting for experimental duplex melting temperatures and how is this decided in practice? • What is the optimal probe length for a given set of input DNA sequences? The design of DNA strands plays a major role in many biotechnological ap-plications, such as DNA computing, molecular bar tagging and DNA microarrays. While previous approaches addressed this problem to a certain extent, this the-sis proposes novel and highly efficient heuristic methods for DNA strand design, proposes new quality measures and performs a thorough analysis of the quality of strand sets. We hope that our approach opens the doors for further exploration of efficient computational techniques that can help in the design of large and high-quality sets of DNA strands for a multitude of applications in biotechnology and biology. 109 Bibliography [1] A D L E M A N , L . Molecular computation of solutions to combinatorial prob-lem. Science 266 (1994), 1021-1024. [2] A L B E R T S , B. , A N D M I A K E - L Y E , R. Unscrambling the puzzle of biological machines: the importance of the details. Cell 68, 3 (1992), 415-420. [3] A L T S C H U L , S. F., G I S H , W., M I L L E R , W., M Y E R S , E. W., A N D L I P M A N , D. J. Basic local alignment search tool. J Mol Biol 215, 3 (Oct 1990), 403-410. [4] A L T S C H U L , S. F., M A D D E N , T. L . , S C H A F F E R , A . A . , Z H A N G , J., Z H A N G , Z . , M I L L E R , W., A N D L I P M A N , D. J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 17 (Sep 1997), 3389-3402. [5] A N D R O N E S C U , M . Master thesis: Algorithms for predicting the secondary structure of pairs and combinatorial sets of nucleic acid strands. University of British Columbia, Vancouver, Canada (November 2003). [6] A N D R O N E S C U , M . , A G U I R R E - H E R N A N D E Z , R., C O N D O N , A . , A N D HOOS, H . H . Rnasoft: A suite of RNA secondary structure prediction and design software tools. Nucleic Acids Res 37, 13 (Jul 2003), 3416-3422. [7] A N D R O N E S C U , M . , Z H A N G , Z . C , A N D C O N D O N , A . Secondary struc-ture prediction of interacting RNA molecules. J Mol Biol 345, 5 (Feb 2005), 987-1001. [8] A R I T A , M . , N I S H I K A W A , A. , H A G I Y A , M . , K O M I Y A , K . , G O U Z U , H . , A N D S A K A M O T O , K. Improving sequence design for DNA computing. In Proc. Genetic and Evolutionary Computation Conference (Las Vegas, Nevada, USA, 2000), pp. 875-882. [9] B E N - D O R , A. , K A R P , R., S C H W I K O W S K I , B. , A N D Y A K H I N I , Z . Univer-sal DNA tag systems: a combinatorial design scheme. Journal of Computa-tional Biology 7 (2000), 503-519. Bibliography 110 [10] B H A N O T , G. , L O U Z O U N , Y . , Z H U , J., A N D D E L I S I , C . The importance of thermodynamic equilibrium for high throughput gene expression arrays. Biophys. J. 84 (2003), 124-135. [11] BISHOP, J., B L A I R , S., A N D C H A G O V E T Z , M . A competitive kinetic model of nucleic acid surface hybridization in the presence of point mutants. Biophys. J. 90 (2006), 831-840. [12] B O G D A N O V A , G . T., B R O U W E R , A . E., K A P R A L O V , S. N . , A N D O S T E R G A R D , P. R. J. Error-correcting codes over an alphabet of four ele-ments. Designs, Codes, and Cryptography 23, 3 (2001), 333-342. [13] B O R T O L U S S I , L . , A N D S G A R R O , A. Constraint satisfaction problems over DNA strings. In Proceedings of 1st Workshop on Constraint Based Methods for Bioinformatics, WCB 2005 (2005), pp. 11-18. [14] B R A I C H , R . , C H E L Y A P O V , N . , J O H N S O N , C , R O T H E M U N D , P., A N D A D L E M A N , L . Solution of a 20-variable 3-SAT problem on a DNA com-puter. Science 296 (2002), 478^179. [15] B R A I C H , R . , J O H N S O N , C , R O T H E M U N D , P., H W A N G , D., C H E L Y A P O V , N . , A N D A D L E M A N , L . Solution of a satisfiability problem on a gel-based DNA computer. Lecture Notes in Computer Science 2054 (2001), 27^4-2. [16] B R E N N E R , S., A N D L E R N E R , R . A . Encoded combinatorial chemistry. Proc Natl Acad Sci USA 89, 12 (Jun 1992), 5381-5383. [17] B R E N N E R , S., W I L L I A M S , S., V E R M A A S , E., S T O R C K , T., M O O N , K . , M C C O L L U M , C , M A O , J., L U O , S., K I R C H N E R , J., A N D E L E T R , S. In vitro cloning of complex mixtures of DNA on microbeads: Physical sep-aration of differentially expressed cDNAs. Proc. Natl. Acad. Sci. USA 97 (2000), 1665-1670. [18] B R E S L A U E R , K . J., F R A N K , R . , B L O C K E R , H . , A N D M A R K Y , L . A . Pre-dicting DNA duplex stability from the base sequence. Proc Natl Acad Sci U S A 83,11 (Jun 1986), 3746-3750. [19] B R O U W E R , A . , S H E A R E R , J., S L O A N E , N . , A N D S M I T H , W. A new table of constant weight codes. IEEE Transactions on Information Theory 36, 6 (Nov 1990), 1334-1380. [20] B R O W E R , A . Table of general binary codes, h t t p : / / w w w . w i n . t u e . n l / ~aeb/ codes / b i n a r y - 1 . h tml , Last visited in October 2006. Bibliography 111 [21] B R O W E R , A . Table of general quaternary codes, h t t p : / / w w w . w i n . t u e . n l / ~ a e b / c o d e s / q u a t e r n a r y - l . h t m l , Last visited in Octo-ber 2006. [22] B U R N S , G., D A O U D , R . , A N D V A I G L , J. L A M : An Open Cluster Envi-ronment for MPI. In Proceedings of Supercomputing Symposium (1994), pp. 379-386. [23] C A N T O R , C , A N D S C H I M M E L , R Biophisiccd Chemistry. W.H.Freeman, New York, N Y , 1980. [24] C H E N , H. , A N D SHARP, B . M . Oliz, a suite of peri scripts that assist in the design of microarrays using 50mer oligonucleotides from the 3' untranslated region. BMC Bioinformatics 3 (Oct 2002), 27-27. [25] C H E N , L , K A L L E N B A C H , N . , A N D S E E M A N , N . A specific quadrilateral synthesized from DNA branched junctions. Journal of the American Chem-ical Society 111 (1989), 6402-6407. [26] C H E N , J., A N D S E E M A N , N . C . Synthesis from DNA of a molecule with the connectivity of a cube. NJ50 (Apr. 1991), 631-633. [27] C H O U , H . H . , H S I A , A . P., M O O N E Y , D. L . , A N D S C H N A B L E , P. S. Picky: oligo microarray design for large genomes. Bioinformatics 20, 17 (Nov 2004), 2893-2902. [28] C R O T H E R S , D., A N D Z I M M , B . Theory of the melting transition of syn-thetic polynucleotides: Evaluation of the stacking free energy. Journal of Molecular Biology 116 (1964), 1-9. [29] D E A T O N , R . , G A R Z O N , M . , M U R P H Y , R . , ROSE, L , F R A N C E S C H E T T I , D., A N D S T E V E N S , S. Genetic search of reliable encodings for DNA-based computation. Proc. First Annual Conference on Genetic Program-ming (1996), 9-15. [30] D E A T O N , R . , M U R P H Y , R . , G A R Z O N , M . , F R A N C E S C H E T T I , D., A N D S T E V E N S , S. Good encodings for DNA-based solutions to combinatorial problems. DIMACS Series in Discrete Mathematics and Theoreticcd Com-puter Science 44 (1999), 247-258. [31] D E V O E , H . , A N D TINOCO, L . The stability of helical polynucleotides: base contributions. Journal of Molecular Biology 4 (1962), 500-517. Bibliography 112 [32] EDDY, S. R. Profile hidden markov models. Bioinformatics 14, 9 (1998), 755-763. [33] E D M A N , C , R A Y M O N D , D., W U , D., T U , E., SOSNOWSKI , R., B U T L E R , W., N E R E N B E R G , M . , A N D H E L L E R , M . Electric field directed nucleic acid hybridization on microchips. Nucleic Acids Research 25 (1997), 4907-4914. [34] E R D O G A N , F., K I R C H N E R , R., M A N N , W., ROPERS, H . H . , A N D N U B E R , U . A . Detection of mitochondrial single nucleotide polymorphisms using a primer elongation reaction on oligonucleotide microarrays. Nucleic Acids Res 29, 7 (Apr 2001). [35] F A N , J., C H E N , X . , H A L U S H K A , M . , B E R N O , A . , H U A N G , X . , R Y D E R , T., L I P S H U T Z , R., L O C K H A R T , D., A N D C H A K R A V A R T I , A . Parallel geno-typing of human SNPs using generic high-density oligonucleotide tag ar-rays. Genome Research 10 (2000), 853-860. [36] F A U L H A M M E R , D., C U K R A S , A. , L I P T O N , R., A N D L A N D W E B E R , L . Molecular computation: RNA solutions to chess problems. Proc. Natl. Acad. Sci. USA 97 (2000), 1385-1389. [37] F E L D K A M P , U . , B A N Z H A F , W., A N D R A U H E , H . A DNA sequence com-piler. In Proceedings 6th DIMACS Workshop on DNA Based Computers, held at the University of Leiden, Leiden, The Netherlands, 13-17 June 2000, p. 253. [38] F O D O R , S., R E A D , L , P I R R U N G , M . , S T R Y E R , L . , L U , A . , A N D S O L A S , D. Light-directed, spatially addressable parallel chemical synthesis. Science 75(1991), 767-773. [39] F O R M A N , J., W A L T O N , I., S T E R N , D., R A V A , R., A N D T R U L S O N , M . Thermodynamics of duplex formation and mismatch discrimination on pho-tolithographically synthesized oligonucleotide arrays. ACS SYMPOSIUM SERIES 682 (1998), 206-228. [40] F R U T O S , A. , L i u , Q., T H I E L , A . , S A N N E R , A . , C O N D O N , A . , S M I T H , L. , A N D C O R N , R. Demonstration of a word design strategy for DNA computing on surfaces. Nucleic Acids Research 25 (1997), 4748-4757. [41] G A B O R I T , P., A N D K I N G , O. D. Linear constructions for DNA codes. Theor. Comput. Sci. 334, 1-3 (2005), 99-113. Bibliography 113 [42] G A M A L , A . A . E., H E M A C H A N D R A , L . A . , S H P E R L I N G , I., A N D W E I , V. K. Using simulated annealing to design good codes. IEEE Trans. Inf. Theor. 33, 1 (1987), 116-123. [43] G A R T N E R , Z . , TSE , B. , G R U B I N A , R . , D O Y O N , J., S N Y D E R , T., A N D L I U , D . DNA-templated organic synthesis and selection of a library of macrocycles. Science 305 (2004), 1601-1605. [44] G A R Z O N , M . , D E A T O N , R . , N E A T H E R Y , P., F R A N C E S C H E T T I , D . , , A N D M U R P H Y , R. A new metric for DNA computing. In Proc. 2nd Genetic Programming Conference (Morgan Kaufman, 1997), pp. 472^178. [45] G A R Z O N , M . , D E A T O N , R . , ROSE, J., A N D F R A N C E S C H E T T I , D. Soft molecular computing. DIMACS Series in Discrete Mathematics and Theo-retical Computer Science 54 (2000), 91-100. [46] G O R D O N , P. M . , A N D S E N S E N , C . W. Osprey: a comprehensive tool employing novel methods for the design of oligonucleotides for DNA se-quencing and microarrays. Nucleic Acids Res 32, 17 (2004). [47] G R A V E S , D. Powerful tools for genetic analysis come of age. Trends Biotechnol. 17, 3 (1999), 127-134. [48] G U N D E R S O N , K. L . , K R U G L Y A K , S., G R A I G E , M . S., G A R C I A , F., K E R -M A N I , B. G . , Z H A O , C . , C H E , D., D I C K I N S O N , T., W I C K H A M , E., B I E R L E , J., D O U C E T , D., M I L E W S K I , M . , Y A N G , R . , S I E G M U N D , C , H A A S , J., Z H O U , L . , O L I P H A N T , A . , F A N , J. B. , B A R N A R D , S., A N D C H E E , M . S. Decoding randomly ordered DNA arrays. Genome Res 14, 5 (May 2004), 870-877. [49] G U N D E R S O N , K. L . , S T E E M E R S , F. J., L E E , G . , M E N D O Z A , L . G . , A N D C H E E , M . S. A genome-wide scalable SNP genotyping assay using mi-croarray technology. Nat Genet 37, 5 (May 2005), 549-554. [50] H A L P E R I N , A . , BUHOT, A . , A N D Z H U L I N A , E. Sensitivity, specificity, and the hybridization isotherms of DNA chips. Biophys. J. 86 (2004), 718-730. [51] H A L P E R I N , A. , BUHOT, A. , A N D Z H U L I N A , E. Brush effects on DNA chips: Thermodynamics, kinetics, and design guideliness. Biophys. J. 89 (2005), 796-811. [52] H A L P I N , D., A N D H A R B U R Y , P. DNA display I. sequence-encoded routing of DNA populations. Public Library of Science - Biology 2 (2004), el73. Bibliography 114 [53] H A R T E M I N K , A . , G l F F O R D , D., A N D K H O D O R , J . Automated constraint-based nucleotide sequence selection for DNA computation. Biosystems 52 (1999), 227-235. [54] H O N K A L A , I., A N D O S T E R G A R D , P. Applications in Code Design. In Lo-cal Search In Combinatorial Optimization (E. Aarts and J .K. Lenstra, eds.), Wiley-Interscience Series in Discrete Mathematics and Optimization, 1997. [55] HOOS, H. PhD Thesis: Stochastic Local Search - Methods, Models, Appli-cations. Infix-Verlag, Sankt Augustin, Germany, 1999. [56] Hoos, H., A N D S T U T Z L E , T. Evaluating Las Vegas algorithms - pitfalls and remedies. In Proceedings of the 14th Annual Conference on Uncertainty in Artificial Intelligence (UAI-98) (San Francisco, CA, 1998), Morgan Kauf-mann, pp. 238-24. [57] H O O S , H., A N D S T U T Z L E , T. Stochastic Local Search-Foundations and Applications. Morgan Kaufmann, San Francisco, CA., 2004. [58] Hoos, H. H., A N D BOUTILIER, C . Solving combinatorial auctions using stochastic local search. In AAAI/IAA1 (2000), pp. 22-29. [59] H O R V A T H , M . P., A N D S C H U L T Z , S. C . DNA G-quartets in a 1.86 A resolution structure of an Oxytricha nova telomeric protein-DNA complex. J Mol Biol 310, 2 (Jul 2001), 367-377. [60] HUSSINI , S., K A R I , L . , A N D KONSTANTINIDIS , S. Coding properties of DNA languages. Theor. Comput. Sci. 290, 3 (2003), 1557-1579. [61] J O N O S K A , N . , M A H A L I N G A M , K . , A N D C H E N , J . Involution codes: with application to DNA coded languages. Natural Computing: an International Journal 4, 2 (2005), 141-162. [62] K A N E , M . , JATKOE, T., STUMPF, C , L U , J., T H O M A S , J., A N D M A D O R E , S. Assessment of the sensitivity and specificity of oligonu-cleotide (50mer) microarrays. Nucleic Acids Research 28 (2000), 4552-4557. [63] K A R I , L . , K I T T O , R . , A N D T H I E R R I N , G. Codes, involutions, and DNA encodings. In Formal and Natural Computing - Essays Dedicated to Grze-gorz Rozenberg [on occasion of his 60th birthday, March 14, 2002] (Lon-don, UK, 2002), Springer-Verlag, pp. 376-393. Bibliography 115 [64] K A R I , L . , KONSTANTINIDIS , S., A N D SOSIK, P. On properties of bond-free DNA languages. Theor. Comput. Sci. 334, 1-3 (2005), 131-159. [65] K E P H A R T , D., A N D L E F E V R E , J. CODEGEN: The generation and test-ing of DNA code words. In Proceedings of the 2004 IEEE Congress on Evolutionary Computation (2004), pp. 1135-1142. [66] K L E I N , Y . , L Y T S I N , S., A N D V A R D Y , A . Two new bounds on the size of binary codes with a minimum distance of three. Codes and Cryptography 6 (1995), 219-227. [67] L E M I E U X , B . , A H A R O N I , A . , A N D S C H E N A , M . Overview of DNA chip technology. Molecular Breeding 4 (1998), 277-289. [68] L i , F., A N D S T O R M O , G . Selection of optimal DNA oligos for gene ex-pression arrays. Bioinformatics 17 (2001), 1067-1076. [69] L i , M . , L E E , H.-J., C O N D O N , A. , A N D C O R N , R. DNA word design strategy for creating sets of non-interacting oligonucleotides for DNA mi-croarrays. Langmuir 18 (2002), 805-812. [70] L l P T O N , R. J. Speeding up computations via molecular biology, internet version only, 1994. [71] L i u , Q., FRUTOS, A . , W A N G , L . , C O N D O N , A . , C O R N , R . , A N D S M I T H , L . DNA computations on surfaces. Nature 403 (2000), 175-179. [72] L O C K H A R T , D., D O N G , H . , B Y R N E , M . , F O L L E T T I E , M . , G A L L O , M . , C H E E , M . , M I T T M A N N , M . , W A N G , C , K O B A Y A S H I , M . , H O R T O N , PL, A N D B R O W N , E. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat. biotechnol. 14 (1996), 1675-1680. [73] M A C W l L L l A M S , F., A N D S L O A N E , N . The Theory of Error-Correcting Codes. Elsevier/North-Holland, Amsterdam, Holland, 1977. [74] M A N B E R , U . , A N D M Y E R S , G . Suffix arrays: a new method for on-line string searches. In SODA '90: Proceedings of the first annual ACM-SIAM symposium on Discrete algorithms (Philadelphia, PA, USA, 1990), Society for Industrial and Applied Mathematics, pp. 319-327. [75] M A P L E S O F T . Maple 10. h t t p : / /www.maplesof t . com/,Last visited in October 2006. [76] M A R A T H E , A. , C O N D O N , A . , A N D C O R N , R. On combinatorial DNA word design. Journal of Computational Biology 8 (2001), 201-220. Bibliography 116 [77] M A S K O S , U . , A N D S O U T H E R N , E. M . Oligonucleotide hybridizations on glass supports: a novel linker for oligonucleotide synthesis and hybridiza-tion properties of oligonucleotides synthesised in situ. Nucleic Acids Res 20, 7 (Apr 1992), 1679-1684. [78] M C A L L E S T E R , D., S E L M A N , B . , A N D K A U T Z , H . Evidence for invariants in local search. In Proceedings of the Fourteenth National Conference on Artificial Intelligence (AAAI'97) (Providence, Rhode Island, 1997), pp. 321— 326. [79] M I N T O N , S., JOHNSTON, M . D., PHILIPS, A . B . , A N D L A I R D , P. Solving large-scale constraint satisfaction and scheduling problems using a heuristic repair method. In Proceedings of the 8th National Conference on Artifi-cial Intelligence, AAAI Press/The MIT Press (Menlo Park, CA, USA, 1990), pp. 17-24. [80] M I N T O N , S., JOHNSTON, M . D., PHILIPS, A. B . , A N D L A I R D , P. Min-imizing conflicts: A heuristic repair method for constraint satisfaction and scheduling problems. Artificial Intelligence 58, 1-3 (1992), 161-205. [81] N I E L S E N , H . , W E R N E R S S O N , R., A N D K N U D S E N , S. Design of oligonu-cleotides for microarrays and perspectives for design of multi-transcriptome arrays. Nucleic Acids Research 31 (2003), 3491-3496. [82] N O R D B E R G , E. K . Yoda: selecting signature oligonucleotides. Bioinfor-matics 21, 8 (Apr 2005), 1365-1370. [83] P E N C H O V S K Y , R., A N D A C K E R M A N N , J. DNA library design for molecu-lar computation. Journal of Computational Biology 10 (2003), 215-229. [84] P E V Z N E R , P. A . Computational Molecular Biology: An Algorithmic Ap-proach (Computational Molecular Biology). The MIT Press, August 2000. [85] P l R R U N G , M . How to make a DNA chip. Angew. Chem. Int. Ed. 41 (2002), 1276-1289. [86] R A H M A N N , S. Fast large scale oligonucleotide selection using the longest common factor approach. J Bioinform Comput Biol 1, 2 (Jul 2003), 343-361. [87] REIF , J. H . , L A B E A N , T. H . , A N D S E E M A N , N . C. Challenges and appli-cations for self-assembled DNA nanostructures. Lecture Notes in Computer Science 2054 (2001), 173. Bibliography 117 [88] R E L O O I O , A . , S C H W A G E R , C , R I C H T E R , A . , A N S O R A G E , W. , A N D V A L C A R C E L , J. Optimization of oligonucleotide-based DNA microarrays. Nucleic Acids Research 30 (2002). [89] R E Y M O N D , N . , C H A R L E S , H . , D U R E T , L . , C A L E V R O , F., B E S L O N , G., A N D F A Y A R D , J. M . Roso: optimizing oligonucleotide probes for microar-rays. Bioinformatics 20, 2 (Jan 2004), 271-273. [90] R I M O U R , S., H I L L , D., M I L I T O N , C , A N D P E Y R E T , P. GoArrays: highly dynamic and efficient microarray probe design. Bioinformatics 21, 7 (Apr 2005), 1094-1103. [91] ROSE, J., D E A T O N , R., F R A N C E S C H E T T I , D., G A R Z O N , M . , A N D S T E V E N S , S. A statistical mechanical treatment of error in the annealing biostep of DNA computation. In Proceedings of the Genetic and Evolution-ary Computation Conference (San Francisco, 1999), vol. 2, Morgan Kauf-mann, pp. 1829-1834. [92] R O T H , M . E., F E N G , L . , M C C O N N E L L , K. J., S C H A F F E R , P. J., G U E R R A , C . E. , AFFOURTIT, J. P., PIPER, K. R., G U C C I O N E , L . , H A R I H A R A N , J., F O R D , M . J., P O W E L L , S. W., K R I S H N A S W A M Y , H . , L A N E , J., G U C -CIONE, L . , INTRIERI, G., M E R K E L , J. S., PERBOST, C , V A L E R I O , A . , Z O L L A , B . , G R A H A M , C D . , H N A T H , J., M I C H A E L S O N , C , W A N G , R., Y I N G , B . , H A L L I N G , C , P A R M A N , C . E. , R A H A , D., O R R , B . , JE-D R Z K I E W I C Z , B . , L l A O , J., T E V E L E V , A . , M A T T E S S I C H , M . J., K R A N Z , D. M . , L A C E Y , M . , K A U F M A N , J. C , K I M , J., L A T I M E R , D. R., A N D L l Z A R D l , P. M . Expression profiling using a hexamer-based universal mi-croarray. Nat Biotechnol 22, 4 (Apr 2004), 418-426. [93] R O U I L L A R D , J. M . , H E R B E R T , C . J., A N D Z U K E R , M . Oligoarray: genome-scale oligonucleotide design for microarrays. Bioinformatics 18, 3 (Mar 2002), 486^187. [94] R O U I L L A R D , J. M . , Z U K E R , M . , A N D G U L A R I , E. Oligoarray 2.0: design of oligonucleotide probes for DNA microarrays using a thermodynamic ap-proach. Nucleic Acids Res 31, 12 (Jun 2003), 3057-3062. [95] S A N T A L U C I A , J. A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proc Natl Acad Sci U S A 95, 4 (Feb 1998), 1460-1465. [96] S A N T A L U C I A , J., A N D H I C K S , D. The thermodynamics of DNA structural motifs. Annu. Rev. Biophys. Biomol. Struct. 33 (2004), 415-440. Bibliography 118 [97] SHCHEPINOV, M . , C A S E - G R E E N , S., A N D S O U T H E R N , E . Steric factors influencing hybridisation of nucleic acids to oligonucleotide arrays. Nucleic Acids Research 25 (1997), 1155-1161. [98] S H O R T R E E D , M . R., C H A N G , S. B . , H O N G , D., PHILLIPS , M . , C A M -PION, B . , T U L P A N , D. C , A N D R O N E S C U , M . , C O N D O N , A . , H o o s , H . H . , A N D S M I T H , L . M . A thermodynamic approach to designing structure-free combinatorial DNA word sets. Nucleic Acids Res 33, 15 (2005), 4965-4977. [99] S L O A N E , N . Web page with code bounds, h t t p : //www. r e s e a r c h . a t t . com/~nj as, Last visited in October 2006. [100] S M I T H , F. W., A N D F E I G O N , J . Quadruplex structure of Oxytricha telom-eric DNA oligonucleotides. Nature 356, 6365 (Mar 1992), 164-168. [101] S M I T H , K. Universal microarrays: An algorithmic approach, h t t p : / / www . c s . m c ' g i l l . c a / ~ k a l e i g h / c o m p b i o / m i c r o a r r a y / microarray_summary% .pdf,Last visited in October 2006. [102] S M I T H , S., C U I , Y . , A N D B U S T A M A N T E , C . Overstreching B-DNA: the elastic response of individual double-stranded and single-stranded DNA molecules. Science 271 (1996), 795-799. [103] S M I T H , T. F., A N D W A T E R M A N , M . S. Identification of common molecu-lar subsequences. J Mol Biol 147, 1 (Mar 1981), 195-197. [104] S O U T H E R N , E . , M I R , K. , A N D SHCHEPINOV, M . Molecular interactions on microarrays. Nat Genet. 21 (1999), 5-9. [105] S T E N B E R G , J., N I L S S O N , M . , A N D L A N D E G R E N , U . ProbeMaker: an extensible framework for design of sets of oligonucleotide probes. BMC Bioinformatics 6 (2005), 229-229. [106] S T R I C K , T., DESSINGES, M . , C H A R V I N , G . , D E K K E R , N . , A L L E M A N D , J., B E N S I M O N , D., A N D C R O Q U E T T E , V . Streching of macromolecules and proteins. Rep. Prog. Phys. 66 (2003), 1^15. [107] S U G N E T , C . Microarray frequently asked questions, h t t p : / /www. soe . u c s c . e d u / ~ s u g n e t / m i c r o a r r a y / m i c r o a r r a y _ F A Q . h t m l , Last visited in October 2006. [108] TAO, S. C , L I , Y , L I U , Y . H . , M A , X . M . , A N D C H E N G , J . Room-temperature hybridization of target DNA with microarrays in concentrated Bibliography 119 solutions of guanidine thiocyanate. Biotechniques 34, 6 (Jun 2003), 1260-1262. [109] TONG, B. Y . , AND L E U N G , M . L. Correlation between percentage guanine-cytosine content and melting temperature of deoxyribonucleic acid. Biopolymers 16, 6 (Jun 1977), 1223-1231. [110] T U L P A N , D. Web site: A set with 128 DNA strands, which satisfies the Hamming distance, reverse complement and GC content constraint, h t t p : //www. c s .ubc . c a / ~ d c t u l p a n / h t m l / d w d . h t m l , Last visited in October 2006. [ I l l ] T U L P A N , D., A N D R O N E S C U , M . , S HORTREED , M . , C H A N G , S., C O N -D O N , A . , Hoos, H . , , A N D S M I T H , L . Thermodynarnically based DNA strand design. Nucleic Acids Research 33 (2005), 4951^-964. [112] T U L P A N , D., Hoos, H . , A N D C O N D O N , A . Stochastic local search al-gorithms for DNA word design. Lecture Notes in Computer Science 2568 (2003), 229-241. [113] T U L P A N , D. C , A N D Hoos, H . H. Hybrid randomised neighbourhoods improve stochastic local search for DNA code design. Lecture Notes in Computer Science 2671 (January 2003), 418-433. [114] T U R N E R , D. Chap. 8. Conformational Changes in Nucleic Acids: Struc-tures, Properties and Functions. University Science Books, Sausalito, CA., 2000. [115] U N I V E R S I T Y C O L L E G E L O N D O N , D E P A R T M E N T OF M O L E C U L A R B I O L -OGY , U . Microarray tutorial, h t t p : / / w w w . u c l . a c . u k / o n c o l o g y / M i c r o C o r e / H T M L _ r e s o u r c e / t u t _ f r a m e s e t . htm, Last visited in October 2006. [116] W A N G , X . , A N D SEED , B. Selection of oligonucleotide probes for protein coding sequences. Bioinformatics 19, 7 (May 2003), 796-802. [117] W I N F R E E , E., L i u , F., W E N Z L E R , L . A. , A N D S E E M A N , N . C. Design and self-assembly of two-dimensional DNA crystals. N394 (Aug. 1998), 539-544. [118] Y I L M A Z , L . S., A N D N O G U E R A , D. R. Mechanistic approach to the prob-lem of hybridization efficiency in fluorescent in situ hybridization. Appl Environ Microbiol 70, 12 (Dec 2004), 7126-7139. Bibliography 120 [119] Y O S H I D A , H . , A N D S U Y A M A , A . Solution to 3-SAT by breadth first search. DIM ACS Series in Discrete Mathematics and Theoretical Computer Science 54 (2000), 9-22. [120] Y U R K E , B . , T U R B E R F I E L D , A . J., M I L L S , A . P., S I M M E L , F. C , A N D N E U M A N N , J. L. A DNA-fuelled molecular machine made of DNA. Nature 406, 6796 (August-2000), 605-608. [121] Z H A N G , B. , A N D S H I N , S. Code optimization for DNA computing of max-imal cliques. Advances in Soft Computing - Engineering Design and Manu-facturing 1998 (1998), 735-742. [122] Z H A N G , B.-T., A N D S H I N , S.-Y. Molecular algorithms for efficient and reliable DNA computing. In Genetic Programming 1998: Proceedings of the Third Annual Conference (University of Wisconsin, Madison, Wiscon-sin, USA, 22-25 1998), J. R. Koza, W. Banzhaf, K. Chellapilla, K. Deb, M . Dorigo, D. B. Fogel, M . H. Garzon, D. E. Goldberg, H. Iba, and R. Ri-olo, Eds., Morgan Kaufmann, pp. 735-744. 121 Appendix A Time Complexity for Determining C(pfc) Scores Given N sequences of equal length L, probe length I and Simplified Pairfold com-putation time Ts Pair fold, the complexity of computing C(pk) scores following the Formula 6.1 provided in Chapter 6 can be obtained as follows. The total number of possible probes that can be selected from the set of N sequences is: number of probes = N * (L — I + 1) (A.l) The time required to compute the C(jpk) score for one probe is: time(one probe) — (TV - 1) * (L — I + 1) * Tspairfoid (A- 2) The time required to compute the C(Pk) score for all probes is: time(all probes) = N * (N - 1) * (L - I + l ) 2 * TSpairfoid (A.3) Thus the asymptotic complexity for computing the C(pk) score for all probes, when L » lis Q(N2 * L2 * TSpairfoid)- Note that Tspairfoid depends quadrati-cally on the length I of the sequences involved in the computation of MFE. 122 Appendix B DNA Strand Sets TAGCAGCT TAGCTCGA ATCGAGCT ATCGTCGA A A G G T G C A A A G G A C G T TTCCTGCA 1TCCACGT TACGTCCT TACGAGGA ATGCTCCT ATGCAGGA TTGGACCA AACCTGGT TTGGTGGT A C C A A A C C TCGAATCG TCGATAGC AGCTATCG AGCTTAGC ACGTTTCC ACGTAAGG TGCATTCC T G C A A A G G TCCTTACG TCCTATGC AGGATACG AGGAATGC TGGTAACC ACCATTGG ' TGGTTTGG C C A A C A A C GCTACTAG GCTAGATC CGATCTAG CGATGATC CCTTGTAC CCTTCATG GGAAGTAC GGAACATG G C A T G A A G GCATCTTC CGTAGAAG CGTACTTC GGTTCAAC CCAAGTTG GGTTGTTG C A A C C C A A GATCCGAT GATCGCTA CTAGCGAT CTAGGCTA CATGGGAA CATGCCTT GTACGGAA GTACCCTT GAAGGCAT GAAGCGTA CTTCGCAT CTTCCGTA GTTGCCAA CAACGGTT GTTGGGTT A C A C A C A C TCTCAGAG TCTCTCTC A G A G A G A G AGAGTCTC ACTGTGAC ACTGACTG TGACTGAC TGACACTG TCAGTCAG TCAGAGTC AGTCTCAG AGTCAGTC TGTGACAC ACACTGTG TGTGTGTG A A A A C C C C TATACGCG TATAGCGC ATATCGCG ATATGCGC AATTGGCC AATTCCGG TTAAGGCC TTAACCGG TAATGCCG TAATCGGC ATTAGCCG ATTACGGC TTTTCCCC A A A A G G G G T T r r G G G G C A C A A G A G GAGTAGTG GTGTTCTC GTCTACAG CAGATGTC CAGTTCAG ACTGGTGT A G T C G A G A AGTCCTCT A G A G C T G A TGACGTGT T C A G G A G A GTCAAGTC GTGATGAG TGTGGTCA TGTGCAGT A A C C A C C A Table B . l : A set of 112 DNA code words of length 8, satisfying the C I , C2 and C5 constraints. This set represents an improvement of a previous set containing 108 DNA words obtained by Frutos et al. in 1997 [40]. Appendix B. DNA Strand Sets 123 GAAGTACG GTACTTCC TCGATAGC AGGAGCAT GAAAGGCA AACTGCCA GTCCCTTT CACATCAG CCAATGGT TTACCGTG GAGAACTG GCAGTGTA CAACCAAG ATATGCGC TACCCGAT TATACGCG GGATTAGC AATCACCG TACGAGGA CAGTAGAG TACGTCCT TAGCTrCG TTGGCAAC TTGTCCCA CCCCAAAA AACCGTTG AGCGAGAT CGGAATGT TCTGCTCA GGACAGAA AAGGCTTC GAGCTAGA ACTGAGTC CAACACGA TGGTTCGT CTGCCATA AGCTl'GGA TGATGGTC ACGTCCTT ATTCCCTC GCTTCGAT GACTACAC AGTAAGGG ACACACAC ACCGTCAA ' CACCTTGT CCACTTTG CACACACA GCATCTTC GTTTCACC CTGGTAGT TCAGACTG TCACGACA TCGGTGAT AGTCCACA GGGGTTTT TGCAGGAA TCGCTCTA CGAGATAG AAGCCCAA TCCTAAGG GTAGGATC TTGGATGG ATCGCTAG AGACGAGT TGAAACGC CAATGACC GCCACTAA GCATGAAG AGGGTAAG CCTCTCAT TAACAGCC CGACAATC TAGTGGCT CCTTCATG TGGACTAG TTCCGAAG CCGGATTA CACAAGTC CCTTGTAC TGACCTCT TATCTCGC CTGAACAC TGCATACG AACCTACC GATAGAGC GTACGCAT AAGGTGCA TCAACGAC AGTCTGAC CATGCGTA GGGTCAAA GAGATGAC AAACTGGG ATTAGGCC TGTTCCAC ACCCTGTT AGCAGATC AAGTCACG AGTTGCTG TCTCCAGT TAGCGATC ATGCAGGA GGTTACCA TCCGATAC CCAGTAAC AATTCGGC CTCTTGAC GGAATCCT CGTAGAAG GGAACATG CGATAGCT CTCTGAGA ACGCAATG GTTGAGGT CCAAAACG CTGAGTCA TTAAGCCG GGCAAAGA TCAGGTGT ATrGTCGG CGCTCTAT TTCCACGT GAGACTCT GCTGTTAG CTCGATCT ACGTATGC AGTGCTGT Table B.2: A set of 128 DNA code words of length 8, satisfying the CI, C2 and C5 constraints. This set represents the best improvement over Frutos et al. set obtained in 1997 [40]. 124 Append ix C Parameter Settings for Microarray Probe Design Programs Parameter Oligopicker Osprey ROSO YODA Ours I'rube length 25 25 25 25 25 [K+] 1 M 1 M 1 M [Na+] 1 M ! M 1 M 'larget tone. I O - 7 M 1 0 - 7 M 1 0 ~ 7 M Melting temperature \"C] 5 ° C nmge [70,751 170,751 5 ° C range 170,751 Hairpin FE [kcal/mul) auto 0 Humoduplex FE [kcal/mul] -6,00 Selection area whole sequence whole sequence Hybridization temperature (without denaturating agents) 6 5 ° C Allow polyX (X=4) yes yes yes yes Number of probes per sequence 1 1 1 1 1 Decree ui' cross-reactivity 15 NLA ST score threshold 30 Max % identities 80% 80% CC percent range 12 Maximum consecutive matches 15 Continued on next page Appendix C. Parameter Settings for Microarray Probe Design Programs 125 Table C . l - con t inued f r o m prev ious page Parameter Oligopicker Osprey ROSO YODA Ours Dimer window 15 Dimer stringency 13 Hairpin window 6 Hairpin stringency 6 Hairpin min. gap 3 Hairpin max. gap 6 Prohibited sequences no Min. margin btw. target T „ , and closest secondary match 15 Maximum Vii of binding auto Table C . l : Summary of software parameter settings for Bacterio-phage_phi3626 and Buchnera_sp.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Effective heuristic methods for DNA strand design
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Effective heuristic methods for DNA strand design Tulpan, Dan C. 2006
pdf
Page Metadata
Item Metadata
Title | Effective heuristic methods for DNA strand design |
Creator |
Tulpan, Dan C. |
Date Issued | 2006 |
Description | Sets of DNA strands that satisfy combinatorial and thermodynamic properties play an important role in various approaches to biomolecular computations, nano structure design, molecular tagging, and DNA microarrays. The problem of designing such sets of DNA strands appears to be computationally hard. This thesis introduces new algorithms for design of DNA strand sets that satisfy any of several combinatorial and thermodynamic constraints, which aim to maximize desired hybridization between strands and their complements, while minimizing undesired cross-hybridizations. To heuristically search for good strand sets for bio-computing applications, our algorithms use a conflict-driven stochastic local search approach, which is known to be effective in solving comparable search problems. We describe new and improved thermodynamic measures of the quality of strand sets. With respect to these measures of quality, our algorithms consistently find, within reasonable time, sets that are significantly better than previously published sets in the literature. We also present a detailed analysis and selection of heuristics for improving the quality of DNA strand selection criteria with direct applications in microarray probe design. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2010-01-18 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
IsShownAt | 10.14288/1.0051756 |
URI | http://hdl.handle.net/2429/18592 |
Degree |
Doctor of Philosophy - PhD |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_2006-201121.pdf [ 8.53MB ]
- Metadata
- JSON: 831-1.0051756.json
- JSON-LD: 831-1.0051756-ld.json
- RDF/XML (Pretty): 831-1.0051756-rdf.xml
- RDF/JSON: 831-1.0051756-rdf.json
- Turtle: 831-1.0051756-turtle.txt
- N-Triples: 831-1.0051756-rdf-ntriples.txt
- Original Record: 831-1.0051756-source.json
- Full Text
- 831-1.0051756-fulltext.txt
- Citation
- 831-1.0051756.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051756/manifest