Generalized distance and applications in protein folding by Ali Reza Mohazab A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in The Faculty of Graduate Studies (Physics) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) January 2013 c Ali Reza Mohazab 2012 Abstract The Euclidean distance, D, between two points is generalized to the distance between strings or polymers. The problem is of great mathematical beauty and very rich in structure even for the simplest of cases. The necessary and sucient conditions for nding minimal distance transformations are presented. Locally minimal solutions for one-link and two-link chains are discussed, and the largeN limit of a polymer is studied. Applications of D to protein folding and structural alignment are explored, in particular for nd- ing minimal folding pathways. Non-crossing constraints and the resulting untangling moves in folding pathways are discussed as well. It is observed that, compared to the total distance, these extra untangling moves consti- tute a small fraction of the total movement. The resulting extra distance from untangling movements (Dnx) are used to distinguish dierent protein classes, e.g. knotted proteins from unknotted proteins. By studying the ensembles of untangling moves, dominant folding pathways are constructed for three proteins, in particular a knotted protein. Finally, applications of D, and related metrics to protein folding rate prediction are discussed. It is seen that distance metrics are good at predicting the folding rates of 3-state folders. ii Preface The content of this thesis relies on ve manuscripts to each of which a chapter is dedicated. To those ve chapters a background chapter and a conclusion are amended. Three of the ve aforementioned manuscripts are already published and the remaining two will be published shortly. I am the rst author in all of those manuscripts and the bulk of the research was conducted by me in all of them|obviously with guidance from my supervisor Dr. Steven S. Plotkin (SSP). All the gures and tables were generated by me (ARM) except in what is mentioned below. Chapter 2 is based on (Mohazab, AR and SS Plotkin. JPhys.CM. 2008) [89].1 The research was conducted by ARM and the text of the paper was written by SSP. All the gures and tables and results were generated by ARM except gs 2.14, and 2.15 which were generated by SSP. The material pertaining to these two gures were also largely developed by SSP. Chapter 3 is based on (Mohazab, AR and SS Plotkin. Biophys.J. 2008) [90]. The research was conducted by ARM and the text was written by SSP. All the gures and tables except gures 3.1, 3.2, 3.3a, 3.6a, and E.1, were generated by ARM. Those were generated by SSP. Figure 3.3b was generated jointly. Chapter 4 is based on (Mohazab, AR and SS Plotkin IJQC 2009)[91]. The research was conducted by ARM and the methods section of the paper was written by ARM as well. The introduction of the paper was written by SSP and the conclusion was written jointly. All the gures and tables were generated by ARM. SSP was responsible for the nal editing. Chapter 5 is based on (Mohazab AR, and SS Plotkin, PLoS Comput. Biol. 2012) [87]. The research was conducted by ARM. The methods section was written by ARM as well. The introduction of the paper was written by SSP and the conclusion and results sections, in their nal form, were written by SSP as well. All the gures and tables were generated by ARM. SSP was responsible for the nal editing. Chapter 6 is based on (Mohazab, AR and SS Plotkin, unpublished, 2012) 1Some of the material of section 2.5 is taken from the introduction section of [91]. iii Preface [88]. The research was conducted by ARM. The methods and results section and most of the conclusion section were written by ARM as well. SSP wrote the introduction and edited the sections written by ARM and elaborated on the conclusion. SSP also suggested that additional material be added to the paper, inspired by work done in [36]. This work for the additional material would be conducted by a third researcher Atanu Das (AD), and he would be the second author of the paper. None of AD's contributions are re ected in this thesis. All the tables and material that are presented in chapter 6 are the sole work of ARM. The appendices of this thesis are derived from the appendices and the supplementary material of the aforementioned papers, each of which are referred to within the body of the relevant chapter. SSP is responsible for the text of Appendix A, the material in A.3 is also entirely his work. Appendices B, C, and D are the exclusive work of ARM. The material in Appendix E is the joint work of ARM and SSP, with text written by SSP. Figures E.1, and E.2a were also generated by SSP. Figures E.2b and E.2c, were generated by ARM. Appendix F is the exclusive work of ARM. iv Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv 1 Outline and background . . . . . . . . . . . . . . . . . . . . . 1 1.1 Protein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Protein folding . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Order parameters in protein folding . . . . . . . . . . . . . . 7 2 Minimal distance transformations between links and poly- mers: principles and examples . . . . . . . . . . . . . . . . . . 10 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Distance for polymers or strings . . . . . . . . . . . . . . . . 12 2.2.1 Discrete chains . . . . . . . . . . . . . . . . . . . . . . 14 2.2.2 General variation of the distance functional . . . . . . 16 2.2.3 Conditions for an extremum . . . . . . . . . . . . . . 17 2.3 Single links . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.1 Straight line transformations . . . . . . . . . . . . . . 21 2.3.2 Piece-wise extremal transformations: transformations with rotations . . . . . . . . . . . . . . . . . . . . . . 22 2.3.3 Systematically exploring transformations by varying link positions . . . . . . . . . . . . . . . . . . . . . . . 25 2.4 2-link chains . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 v Table of Contents 2.4.1 Transformations involving a change in convexity . . . 32 2.4.2 Transformations with initial and nal states in 3-D . 35 2.5 Limit of large link number . . . . . . . . . . . . . . . . . . . 36 2.5.1 MRSD as a metric for protein folding . . . . . . . . . 42 2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3 Minimal folding pathways for coarse-grained biopolymer frag- ments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2.1 Representative protein fragments . . . . . . . . . . . 49 3.2.2 Construction of minimal pathways . . . . . . . . . . . 52 3.2.3 RMSD and MRSD . . . . . . . . . . . . . . . . . . . 53 3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3.1 -hairpin . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3.2 -helix . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3.3 Crossover structure . . . . . . . . . . . . . . . . . . . 58 3.4 Discussion and conclusion . . . . . . . . . . . . . . . . . . . . 61 4 Structural alignment using the generalized Euclidean dis- tance between conformations . . . . . . . . . . . . . . . . . . 64 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2 Method and results . . . . . . . . . . . . . . . . . . . . . . . 65 4.3 Conclusion and discussion . . . . . . . . . . . . . . . . . . . . 68 5 Polymer uncrossing and knotting in protein folding, and their role in minimal folding pathways . . . . . . . . . . . . 71 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.2.1 Calculation of the transformation distance . . . . . . 74 5.2.2 Generating unfolded ensembles . . . . . . . . . . . . . 91 5.2.3 Proteins used . . . . . . . . . . . . . . . . . . . . . . 93 5.2.4 Calculating distance metrics for the unfolded ensem- ble . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.3.1 Quantifying minimal folding pathways . . . . . . . . 104 5.3.2 Topological constraints induce folding pathways . . . 115 5.4 Conclusion and discussion . . . . . . . . . . . . . . . . . . . . 116 vi Table of Contents 6 The role of polymer non-crossing and geometrical distance in protein folding kinetics . . . . . . . . . . . . . . . . . . . . 122 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 6.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.2.1 Proteins used with rate . . . . . . . . . . . . . . . . . 124 6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.4 Conclusion and discussion . . . . . . . . . . . . . . . . . . . . 128 7 Conclusion and further thoughts . . . . . . . . . . . . . . . . 133 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 Appendices A Sucient conditions for an extremum to be a minimum . 151 A.1 Distance between points . . . . . . . . . . . . . . . . . . . . . 155 A.2 Geodesics on the surface of a sphere . . . . . . . . . . . . . . 156 A.3 Harmonic oscillator . . . . . . . . . . . . . . . . . . . . . . . 157 B Necessary conditions for straight line transformations . . 158 C Critical angles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 D Minimal transformations in 2 dimensions . . . . . . . . . . . 165 E Extremal trajectories of beads or links subject to steric ex- cluded volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 E.1 Point particle . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 E.2 One link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 F Cross correlation of order parameters . . . . . . . . . . . . . 174 vii List of Tables 3.1 Values of the distance for various protein backbone fragments, as compared to other metrics . . . . . . . . . . . . . . . . . . 56 4.1 D=N (in units of link length squared) between the aligned structures in gure 4.1 . . . . . . . . . . . . . . . . . . . . . . 68 4.2 MRSD (in units of link length) between the aligned structures in gure 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.1 Proteins analyzed . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.2 Order parameters for various classications of proteins . . . . 103 6.1 Two-state proteins: correlation between folding rate and var- ious order parameters indicated. . . . . . . . . . . . . . . . . 125 6.2 Three-state proteins: correlation between folding rate and various order parameters indicated. . . . . . . . . . . . . . . 126 6.3 -helix dominated proteins that are 2-state folders: correla- tion between folding rate and various order parameters indi- cated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.4 -helix dominated proteins (both 2- and 3- state): correlation between folding rate and various order parameters indicated . 127 6.5 -sheet dominated proteins that are 2-state folders: correla- tion with various order parameters indicated. . . . . . . . . . 128 6.6 -sheet dominated proteins (both 2- and 3- state): correlation with various order parameters indicated. . . . . . . . . . . . . 129 6.7 Mixed secondary structure proteins that are 2-state folders: correlation with various order parameters indicated. . . . . . 129 6.8 Mixed secondary structure proteins (both 2- and 3-state): correlation with various order parameters indicated. . . . . . 130 6.9 Correlation of folding rate of all the studied proteins, for which folding rates were available, with various order param- eters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 viii List of Tables 6.10 Best rate predictors for dierent classes of proteins, based on Kendall and Pearson correlations. . . . . . . . . . . . . . . . . 131 F.1 Two-state proteins: correlation between various order param- eters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 F.2 Three-state proteins: correlation between various order pa- rameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 F.3 -helix dominated proteins (both 2- and 3- state): Correla- tion between various order parameters. . . . . . . . . . . . . . 177 F.4 -sheet dominated proteins (both 2- and 3- state): Correla- tion between various order parameters. . . . . . . . . . . . . . 178 F.5 Mixed secondary structure proteins: Correlation between var- ious parameters. . . . . . . . . . . . . . . . . . . . . . . . . . 179 F.6 Unknotted proteins: correlation between various order pa- rameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 F.7 Knotted proteins: correlating between various order parameters181 F.8 All proteins: correlating between various order parameters . . 182 ix List of Figures 1.1 Schematic representation of a generic amino acid. . . . . . . . 2 1.2 Graphical representation of proteins . . . . . . . . . . . . . . 4 1.3 Funnel energy landscape . . . . . . . . . . . . . . . . . . . . . 6 1.4 Q and RMSD . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 Distance between two points . . . . . . . . . . . . . . . . . . . 11 2.2 Distance between two curves . . . . . . . . . . . . . . . . . . 12 2.3 Curve discretization . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Broken extremal . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5 Possible and impossible straight line transformations . . . . . 21 2.6 Bowtie transformation . . . . . . . . . . . . . . . . . . . . . . 23 2.7 Link broken extremal . . . . . . . . . . . . . . . . . . . . . . . 24 2.8 Successive transformations between two links through rotation 28 2.9 Successive transformations between two links through trans- lation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.10 Two-link transformations example . . . . . . . . . . . . . . . 31 2.11 Non-degenerate 2-link transformations . . . . . . . . . . . . . 32 2.12 A transformation between two states of opposite convexity . . 33 2.13 Sub-minimal and minimal transformations in a sample 2-link system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.14 Two-link transformations in 3D . . . . . . . . . . . . . . . . . 37 2.15 Examples of transformations between initial and nal states of opposite convexity, for increasing numbers of links . . . . . 38 2.16 MRSD explained . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.17 MRSD and RMSD in non-crossing constraints . . . . . . . . . 41 2.18 Free energy surfaces for MRSD and Q . . . . . . . . . . . . . 45 3.1 Residues 99{153 in regulatory chain B of Aspartate Car- bamoyltransferase . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2 -hairpin fragment and the initial state . . . . . . . . . . . . 51 3.3 Overpass/underpass fragment . . . . . . . . . . . . . . . . . . 51 3.4 Illustration of the general recipe for obtaining minimal pathways 52 x List of Figures 3.5 Minimal transformations to the -hairpin . . . . . . . . . . . 55 3.6 -helix and its minimal pathway . . . . . . . . . . . . . . . . 57 3.7 Various steps in a minimal pathway obeying non-crossing . . 60 4.1 Alignments with dierent cost functions . . . . . . . . . . . . 67 4.2 Scale invariant distance resulting from dierent alignments with dierent cost functions . . . . . . . . . . . . . . . . . . . 69 5.1 Transformation of a simple conformation with link size change shown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.2 Scatter plot for link length deviation . . . . . . . . . . . . . . 78 5.3 Crossing detection using projections . . . . . . . . . . . . . . 79 5.4 Two possible untangling transformations . . . . . . . . . . . . 81 5.5 Minimal untangling using knowledge of future crossings . . . 81 5.6 Snapshots of a transformation with two crossings . . . . . . . 82 5.7 Leg substructure . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.8 Crossing substructures . . . . . . . . . . . . . . . . . . . . . . 85 5.9 The three types of Reidemeister moves. As it can be seen, Reidemeister move type III does not reverse the nature of any of the crossings. . . . . . . . . . . . . . . . . . . . . . . . 85 5.10 Schematic illustration of the canonical leg movement . . . . . 86 5.11 A single leg movement to undo several crossings . . . . . . . . 87 5.12 Topological loop twist . . . . . . . . . . . . . . . . . . . . . . 88 5.13 Schematic of the canonical elbow move . . . . . . . . . . . . . 88 5.14 Various crossing substructures in a simple example . . . . . . 89 5.15 An example (subset) tree of possible transformations for a given crossing structure . . . . . . . . . . . . . . . . . . . . . 91 5.16 Clustering of proteins depending on order parameter . . . . . 101 5.17 Statistical signicance for all order parameters in distinguish- ing between dierent classes of proteins . . . . . . . . . . . . 102 5.18 Renderings of the three proteins whose minimal transforma- tions we investigate in detail . . . . . . . . . . . . . . . . . . . 104 5.19 Bar plots for the noncrossing operations involved in minimal transformations, for the protein 2ABD . . . . . . . . . . . . 106 5.20 Bar plots of the noncrossing operations for the -sheet protein 1PKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.21 Bar plots of the noncrossing operations for the knotted pro- tein 3MLG . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.22 Consensus histograms of the transformations described in Fig- ures 5.19-5.21 . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 xi List of Figures 5.23 Schematic of the most representative transformation for the protein 2ABD. . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.24 Schematic of the most representative transformation for the knotted protein 3MLG. . . . . . . . . . . . . . . . . . . . . . 114 5.25 Schematic diagram for the residues involved in noncrossing operations for two minimal transformations and . and the Sequence overlap of moves . . . . . . . . . . . . . . . . . . . . 115 5.26 Pathway overlap (Q) distributions for 3 proteins . . . . . . 117 6.1 Correlation between folding rate and RMSD for three-state folders. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.2 Absolute value of Kendall correlation of a few order parame- ters and rate, across dierent classes of proteins. . . . . . . . 131 B.1 A link in 3D space. . . . . . . . . . . . . . . . . . . . . . . . . 158 C.1 Transformation in which both ends stay on a linear track . . 161 C.2 Geometric proof for critical angle condition . . . . . . . . . . 163 C.3 A minimal transformation in s() parametrization . . . . . . 164 D.1 Hyper extended solution vs a more general compound straight- line transformation . . . . . . . . . . . . . . . . . . . . . . . . 166 D.2 Optimal compound straight line transformation . . . . . . . . 167 D.3 An optimal compound straight-line solution for 2 link . . . . 168 D.4 Minimal transformation restricted to 2 dimensions, for 2 links of opposite convexity which form opposite sides of a square. . 169 E.1 Inequality constraints . . . . . . . . . . . . . . . . . . . . . . 171 E.2 Extremal trajectories and inequality constraints . . . . . . . . 173 xii Acknowledgments I would like to thank the former and current members of Plotkin's research group for helpful discussions and Dr. Plotkin for supervising my research. I would also like to thank the members of my committee Dr. G. Patey, Dr. C. Hansen and Dr. J. Rottler. xiii Dedication This thesis is dedicated to my family. xiv Chapter 1 Outline and background This thesis is about a mathematical construct known as generalized distance (D) and some of the applications it can have in protein science. The central concept underlying D is to extend the concept of conventional distance be- tween two points to distance between two extended objects (polymers). The problem by itself is of great mathematical beauty and even in its simplest case (nding the minimal distance between two links) is incredibly rich in structure. Therefore we spend the greater part of chapter 2 discussing the problem from a purely mathematical point of view. At the end of chapter 2 we propose D as an order parameter for studying protein folding. In the subsequent chapters we explore the applications of D in various areas of protein science. In particular, in chapter 3, we will see how D can be used to construct folding pathways for protein fragments, such as -helices and -hairpins, and how non-crossing constraints can have an impact on these folding pathways. Construction of geometric pathways for folding has long been of interest. In our case, one benet of studying folding pathways for fragments is the possibility to construct mathematically exact solutions. As expected, we nd that the alignment of the initial and nal conformations aects the pathway and the resulting distance between the initial and nal conformations. Therefore, in chapter 4, we study the dierences in optimally aligned structures that result from using dierent alignment cost functions. We will see that a simple and computationally inexpensive approximation to D called Mean Root Squared Distance or MRSD, is adequate for nearly optimal alignment for suciently long chains. In chapter 5, we focus on minimal distances between full-length proteins. This problem is much more dicult to solve analytically, hence we develop an algorithmic method for constructing an approximate minimal solution. An important aspect to consider in constructing geometrical protein-folding pathways is the non-crossing constraints and the resulting untangling moves. We develop methods that approximately capture the minimal untangling moves required when folding a protein. We apply these methods to more than forty proteins and explore the importance of non-crossing constraints in distinguishing dierent protein structures. The results from our analysis 1 1.1. Protein Figure 1.1: Schematic representation of a generic amino acid. concerning the dominant folding pathway of a knotted protein are potentially of the greatest interest for applications. Another important result from this chapter is that the contribution of untangling distance is generally very small compared to the total distance that the protein has to travel. In chapter 6, we apply the formalism developed in chapter 5 to protein kinetics and explore how dierent metrics correlate with the protein folding rate. One important result that emerges from this analysis is the relative success of distance-like metrics in predicting the folding rate of three-state proteins, which tend to fold through an intermediate state. The distance D will be introduced and studied in the next chapter ex- tensively. However, understanding the applications of D in protein folding requires familiarity with some of the fundamental concepts in protein sci- ence and protein folding. These concepts are addressed in the remainder of this chapter. 1.1 Protein Proteins are macromolecules that perform a vast array of biological functions in living organisms. They are made of smaller constituents called amino acids. Amino acids are a group of biological molecules that are composed of a central carbon atom called C, a hydrogen atom, a carboxyl group (COOH), an amine group (NH2) and a side chain (R) that is specic to each amino acid; see gure 1.1. There are about one hundred amino acids found in nature, twenty of which are used as building blocks of proteins [14]. Two amino acids can be linked together by forming a special covalent bond called a peptide bond. A peptide bond results from the reaction of the carboxyl group of one amino acid with the amino group of the adjacent amino acid. Through the repetition of this mechanism several amino acid 2 1.1. Protein molecules can be linked together to form a long chain of residues connected by peptide bonds, called a polypeptide. From an all-atom perspective a polypeptide is comprised of a backbone supported by peptide bonds, and various side-chains of the dierent comprising amino-acids. A protein is made of one or more polypeptide chains [14]. By construction, any polypeptide has one free amine group at one end and one free carboxyl group at the other end. The end that is characterized by the free amine group is called the N-terminus, and the end characterized by the carboxyl group is called the C-terminus. When amino acids are part of a polypeptide, they are called residues, and are numbered from 1 to N, counting from the N-terminus to the C-terminus by convention. Each protein has a unique sequence of amino acids. The sequence is encoded in the gene that is responsible for the synthesis of the protein inside the cell. Shortly after its synthesis, the protein (amino acid chain) is generally disordered, has high entropy, and lacks a specic structure. Through a complex process of interaction between the constituent amino acids aided by the surrounding environment, the protein spontaneously \folds" into a well-dened 3D structural ensemble, that is specic to that protein. Folding may start concurrently with synthesis, but current data on folding rates in comparison to translation rates indicates most proteins tend to fold only after complete translation [95]. Sometimes helper proteins called chaperones kinetically proofread the folding process by kicking the proteins out of the local minimum traps [55]. The well-dened nal structure is called the native structure or the native conformation. The 3D shape of the protein is crucial for its biological function. In fact protein \misfolding", meaning folding to an incorrect nal conformation, is involved in many degenerative diseases, such as Creutzfeldt-Jakob disease (the human form of mad cow disease), Alzheimer's disease, Huntington's and Parkinson's disease [70]. The native structure of a given protein can be determined experimentally using a variety of techniques, most importantly X-ray crystallography and NMR spectroscopy. Native structure coordinates, once determined, are usu- ally stored in plain-text digital les and deposited in the protein data bank (PDB: www.wwpdb.org), available to researchers on the Internet. Each de- termined native conformation has a unique 4-letter (alpha-numerical) iden- tier. For example the three dimensional structure of the protein acylphos- phatase is available as 1aps.pdb. At the time of writing this dissertation, about eighty thousand protein structures have been determined experimen- tally. Four levels of protein structure can generally be identied: primary, secondary, tertiary and quaternary structure. The primary structure sim- 3 1.1. Protein (a) (b) (c) Figure 1.2: Graphical representation of protein acylphosphatase (1aps.pdb) (a) All atom stick model, where green corresponds to Carbon, white to Hy- drogen, blue to Nitrogen, and red to Oxygen. (b) Backbone representation with the secondary structures emphasized. The color red corresponds to the alpha-helix secondary structure, the color yellow corresponds to the beta strands. (c) The surface accessible graphical representation, with the same color scheme described for sub-gure (a). All gures were generated using pymol. ply corresponds to the amino-acid sequence of the protein. The secondary structure is formed through formation of hydrogen bonds between protein residues. During the process of folding, various segments of the protein chain form highly regular substructures called secondary structures. There are two common types of secondary substructures, alpha-helix and beta-strand or beta-sheet. The tertiary structure is equivalent to the native structure of a single protein and entails the relative positioning of the secondary sub- structures. The quaternary structure is formation of structures comprising of several peptide chains. In this thesis we are only concerned with the rst three levels of structures. Protein structures are graphically represented in several ways in three rough categories: All atom representation, backbone representation (usually with the two types secondary structures rendered dierently, see gure 1.2) and solvent accessible surface representation. The backbone representation can be \coarse-grained" furthermore by representing the entire amino-acid by its C atom: a process known as C coarse-graining. In this thesis we work primarily with C coarse-grained structures. 4 1.2. Protein folding 1.2 Protein folding The process in which the unstructured coil of polypeptide transforms to the native structure is called protein folding. From an energetic point of view, protein folding is a thermodynamic process, in which the system equilibrates to its minimal free-energy state. Under normal conditions, the minimal free- energy state is the native conformation. Named after Christian B. Ann- sen, who won the Nobel prize for the discovery, Annsen's dogma states that under normal conditions the native structure of the protein is uniquely determined by its amino acid sequence [4]. The time scale of the folding process varies drastically across dierent proteins. The folding rate kf (s 1) covers a range of a few orders of magni- tude. Engrailed homeodomain (PDB: 1ENH with a length of 54 residues), has a log kf = 10:53, whereas a knotted protein such as 2ouf-x2 (PDB: 3MLG with a length of 169 residues) has log kf = 6:91. Currently it is not possible to capture the full dynamical mechanism of the complete folding process, experimentally. Instead, a number of indi- rect experimental methods are used to gain insight. For example, single or multiple residues are mutated and the resulting changes in folding kinetics and native structure are studied. This method is known as value anal- ysis [101]. Computer simulations (known as in silico methods) have been invaluable tools as well. However, brute-force all atom simulations are too computationally intensive at this stage to capture the process on a long enough time-scale at the high-throughput level. The current prominent theoretical paradigm is that the folding process is a diusion in protein conformation space on a funnel-like energy land- scape [99]; see gure 1.3. Any intermediate conformation has a transition probability to adjacent conformations, with the funnel energy gradient driv- ing the overall diusion towards the native structure. During the folding process, the protein chain loses overall conformational entropy (as the na- tive state is well-structured) but loses internal energy to a greater extent, and hence the total Gibbs free energy goes down by the end of the process [99]. The funnel shape of the energy landscape (on a rough scale) ensures that, as the internal free energy of the system is decreasing, the intermediate conformation becomes more and more native-like. The shape of the energy landscape is a result of evolution, through which the energetic frustrations of the system have been minimized or ameliorated. The protein sequences have been selected to yield funnel-shaped energy landscapes. A random sequence of amino-acids only has small probability to have a funnel-like energy landscape [99]. 5 1.2. Protein folding Route 1 Route 2 Configurational entropy Energy Degree of nativeness 100 KT 2-3 KT Figure 1.3: Funnel energy landscape. Image adopted from [99] It is deduced from this model that there is not a single xed route from an unfolded conformation to the native structure. The complete folding trajectories on the funnel can vary and can start at dierent points, but they all converge to the same point: the native conformation. The detailed mechanism of folding is governed by a smaller energy scale, in which the internal free energy loss is compensated by conformational entropy loss. The total internal energy loss of the native state is of order 100kBT whereas free energy barriers (when loss of conformational entropy is accounted for) are of order 10kBT . However, protein folding as we will see is not a purely energetically driven process. Topology plays an important role as well [99]. A folding protein undergoes a complex interplay of energetics and entrop- ics as it navigates through its accessible phase space, however the resulting kinetics are often simple [11, 13, 112, 113]: many proteins fold across a sin- gle free energy barrier in a 2-state like fashion. The kinetics of many other proteins are only marginally more complex, folding by a 3-state mechanism. Central in the study of protein folding are the ideas of commitment prob- ability and reaction coordinates. Reaction coordinates are one-dimensional coordinates (essentially a number) that capture the progress along a reac- tion pathway. Commitment probability is the probability that the state diuses in conformation space to the nal state before reaching the initial state [96]. One of the conceptual renements to arise from theoretical and simulation studies is the study of \good" reaction coordinates that correlate with commitment probability to complete a reaction such as the folding reac- tion [8, 9, 28, 33, 58, 85, 130]. Reaction coordinates must generally take into account the energy surface on which the molecule of interest is undergoing conformational diusion [10, 39, 138], and the Markovian or non-Markovian 6 1.3. Order parameters in protein folding nature of the diusion [59, 114]. 1.3 Order parameters in protein folding In condensed matter systems, useful order parameters have historically had intuitive geometrical interpretations. Their denition did not require the knowledge of a particular Hamiltonian (although their temperature-dependence and time-evolution were aected by the energy function in the system). In chemical reactions, the distance between constituents in reactant and prod- uct has played a ubiquitous role in the construction of potential energy surfaces [77]. In protein folding order parameters are generally used to compare struc- tures, not always to look at phase transitions. The study of various order parameters that might best represent progress in the folding reaction has generated much interest [8, 17, 18, 21, 26, 32, 44, 57, 73, 81, 114], with questions focusing on what parameter(s) or principal component-like mo- tions might best correlate with splitting probability or probability of folding before unfolding. On the other hand, analyses using intuitive geometric order parame- ters have been developed to understand folding and are now commonly used. These include the fraction of native contacts Q [8, 21, 64, 94, 111], which can be locally or globally dened, root mean square distance or devia- tion (RMSD) between structures [45, 56, 124], structural overlap parameter [19], Debye-Waller factors [117, 118]2, or fraction of correct Dihedral an- gles [64]. To nd a simple geometrical order parameter that quanties progress to the folded structure poses several challenges. These include an accu- rate account of the eects of polymer non-crossing [90], energetic and en- tropic heterogeneity in native driving forces (which will induce bottlenecks in folding pathways) [78, 110, 111], as well as non-native frustration and trapping [23, 108, 115]. Fortunately it has been borne out experimentally that wild type pro- teins are suciently minimally frustrated that non-native interactions do not play a strong role in either folding rate or mechanism, and native structure 2The so-called B-factors or temperature factores indicate the relative vibrational mo- tion. The lower the factor the more ordered the structure is. In PDB les, each atom of the native state has an associated B-factor. Thus a Debye-Waller factor can be used as an order parameter to determine how structured a partially-unfolded ensemble is, and where it tends to be unstructured 7 1.3. Order parameters in protein folding based models for folding rates and mechanisms have enjoyed considerable success [2, 5, 41]. Many of the reaction coordinates have been used to describe the folding process, while still being awed in principle. These characterizations have been largely successful because the majority of conformations during folding are well-characterized by changes in these parameters: Proteins undergo some collapse concurrently with folding, lower their internal energy, and adopt structures geometrically similar to the native structure. Below we describe two of the most common order parameters used in the discipline. In subsequent chapters we expand on this topic. Fraction of native contacts, or Q, is an order parameter that is com- monly used as a measure of native-proximity. Given two structures denoted by A and B, QAB ( P i<j A ij B ij)=( P i<j B ij). Here, A ij is equal to unity if the two residues are in contact (a concept we describe shortly), otherwise it is zero. If two non-hydrogen atoms of a pair of non-neighboring residues are within a prescribed cut-o distance (usually 4.9 A) then the two residues are considered to be in contact. The contacts present in the native structure of the protein are called the native contacts, the total number of them beingP i<j B ij . For any arbitrary conformation of the polypeptide chain, a frac- tion of the contacts that were present in the native structure are present. Of course, other contacts may exist that are not present in the native struc- ture, these being called the non-native contacts. The order parameter Q of a conformation is the fraction of native contacts that are present in that conformation. RMSD or Root Mean Squared Deviation is another order parameter commonly used. RMSD q N1 PN i=1(rAi rBi)2 is a least-squares mea- sure of similarity between structures A and B. Typically, this quantity is minimized given two structures, and so can be thought of as a \least squares t". The sum may be over all atoms, or simply over the C atoms of the residues in coarse-grained models. Figure 1.4 shows two structures A and B with dierent measures of structural similarity to a \native" hairpin fragment N . These structures have dierent measures of proximity depending on the coordinate used to characterize them. If we use the fraction of native contacts, Q, to describe native proximity, structure A has a Q of QA = 1=3 while QB = 0, so by this measure it is more native. If we use the root mean square deviation RMSD, structure B is more native-like than A. Moreover, structure B would have a higher probability of folding before unfolding than A, i.e., it has a larger 8 1.3. Order parameters in protein folding value of pFOLD [33], and so is closer kinetically to the native structure. The longer the hairpin, the more likely a slightly expanded structure is to fold, so the discrepancy between Q and RMSD for these pairs of structures becomes even larger. Figure 1.4: Order parameters do not always correlate with kinetic proximity. Structure A above is more native-like according to the fraction of native contacts, while structure B is more native-like according to RMSD, and is also closer kinetically to the native structure. Image adopted from [91]. 9 Chapter 2 Minimal distance transformations between links and polymers: principles and examples In this chapter, the concept and calculation of generalized distance are in- troduced. We generalize calculation of Euclidean distance between points to that between one-dimensional objects, such as strings or polymers. Then, we derive the necessary and sucient conditions for the transformation be- tween two polymer congurations to be minimal. We give numerous ex- amples for the special cases of one and two links, and then investigate the transition to a large number of links, neglecting for the time being curvature and non-crossing constraints. Equipped with this new mathematical tool, we investigate applications of this metric to protein folding, specically, to secondary and tertiary structural fragments. For most of this chapter, we are interested in generalized distance (D) purely as an interesting mathematical concept, analyzed using the calculus of variations. Certainly, applications of D need not be restricted to those in protein science. However, there are a few results from this chapter that are applicable in protein folding. In particular, the generalized distanceD can be considered as an order parameter for the protein folding process, similar to the usage of root-mean-squared deviation from the native structure (RMSD). In the limit of large number of chain links, and in the absence of curvature and non-crossing constraints, D is approximately equal to a metric that is comparable to, but dierent from RMSD: MRSD (Mean Root Squared Distance). We argue that MRSD is the more physically meaningful of the two, directly related to the idea of how much everything should move in 3D space in order for the protein to fold, while RMSD is the Euclidean distance in the 3N-dimensional conformational space, where N is the number of coarse-grained C beads. Lastly, we address the issue that an accurate 10 2.1. Introduction Figure 2.1: Distance between the two points A and B is the minimum length of the curve connecting the two points account of structure proximity should take into account the fact that real protein chains cannot cross themselves; therefore, non-crossing constraints need to be taken into account. Neither MRSD nor RMSD address this issue, but an accurate calculation of D should. 2.1 Introduction The distance between two points can be thought of as a minimization prob- lem in the calculus of variations, where we try to minimize an integral of innitesimal distance segments. For example, in a Euclidean space in order to nd the distance between two points A and B, we have to minimize the following integral: D = Z rB rA ds = Z T 0 dt p _r2 (2.1) Here we have let _r dr=dt, and we use notation such that. The boundary conditions on the extremal path are r(0) = rA and r(T ) = rB. The concept is depicted in gure 2.1. Taking the functional derivative in eq. (2.1) gives Euler-Lagrange (EL) equations for the Lagrangian L = p _r2: d dt @L @ _r = 0 or _̂v = 0 (2.2) with v̂ the unit vector in the direction of the velocity. Since the derivative of a unit vector is always orthogonal to that vector, equation (2.2) says that the direction of the velocity cannot change, and therefore straight line motion results. Applying the boundary conditions gives v̂ = (rBrA)= jrB rAj. However, any function v(t) = jvo(t)j v̂ satisfy- ing the boundary conditions is a solution, so long as R T 0 dt jvo(t)j = jrB rAj. 11 2.2. Distance for polymers or strings Figure 2.2: The distance DAB is the accumulation of how much everything moves. The solution is reparameterization-invariant. Then the extremal functional r(t) is given by r(t) = rA + rB rA jrB rAj Z t 0 dt jvo(t)j (2.3) and the distance by D = Z T 0 dt p _r2 = Z T 0 dt jvo(t)j = jrB rAj (2.4) which represents the diagonal of a hypercube, as expected. At this point we could x the parameterization by choosing jvo(t)j = jrB rAj =T (constant speed), for example. The extremal transformation (2.3) is also a minimum. In Appendix A we will give the sucient conditions for an extremum to be a (local) minimum, where we will return to this example. The above idea can be generalized to space curves, surfaces, or higher dimensional manifolds [109]. The distance is dened through the trans- formation between the objects that minimizes the cumulative amount of arc-length traveled by all parts of the manifold, see gure 2.2. The shortest distance between A to B is purely a geometry problem, but by choosing an articial \time" parameter t, we express the problem as a dynamic variational problem [109]. The motivation for this is to avoid complications that might arise when one specic coordinate is no longer a single-valued function of the others. 2.2 Distance for polymers or strings Describing the transformation r(s; t) between two space curves rA(s) and rB(s) requires two scalar parameters: s the arc-length along the space curve, 12 2.2. Distance for polymers or strings and t the \time", as in the above zero-dimensional case, measuring progress during the transformation. The boundary conditions are then r(s; 0) = rA(s) and r(s; T ) = rB(s). The minimal transformation r (s; t) is an object of dimension one higher than A or B, i.e., it yields a distance that is two- dimensional. The distance D = D[r(s; t)], where the functional D[r] is given by D[r] = Z L 0 ds Z T 0 dt p _r2 : (2.5) Here we have used the shorthand r r(s; t) = (x(s; t); y(s; t); z(s; t)) (a 3-vector), and _r @r=@t. It has been shown previously that the problem of distance does not map to a simple soap lm, nor to the minimal area of a world-sheet (which corresponds to the action of a classical relativistic string) [109]. Formulated as above, the string can contract and expand arbitrarily in order to minimize the distance traveled. The transforming object is akin to a rubber band, and all points on rA(s) will move in straight lines to their partner points on rB(s) to minimize the distance. It is worth mentioning that protein chains, for example, only change their length by about one percent at biological temperatures. To accurately represent the transformation of a non-extensible string, a Lagrange multiplier (s; t) must be introduced into the eective Lagrangian, weighting the constraint: p r02 = 1 ; (2.6) where r0 @r=@s. Under this constraint, points along the string can no longer move inde- pendently of each other, but must always be a xed (innitesimal) distance apart. The tangent vector t̂ = r0 is now a unit vector, and the total length of the string is L = R L 0 ds p r02 = R L 0 ds. Consider the minimal distance transformation between two congura- tions rA(s) and rB(s) of an ideal polymer of length L. Let us derive the Euler-Lagrange (EL) equations for this case. From equations (2.5) and (2.6), the eective action is D = Z L 0 Z T 0 ds dt L _r; r0 (2.7a) where L = p _r2 p r02 1 (2.7b) 13 2.2. Distance for polymers or strings (a) (b) Figure 2.3: Continuum (a) and discretized (b) polymer chain. The EL equation for the continuum polymer is a nonlinear (vector) PDE, while the EL equations for the discretized polymer are a set of nonlinear ODEs. and the Lagrange multiplier (s; t) is a function of both s and t. The extrema of the distance functional D in (2.7a) are found from D = 0. Taking the functional derivative gives EL equations [109]: _̂v = + 0t̂ : (2.8) where v̂ is the unit velocity vector, t̂ is the unit tangent vector, and is the curvature vector. In eq. (2.8), we see explicitly that if the non-extensibility constraint is removed or, equivalently, if = 0, all points on rA(s) move in straight lines to rB(s). 2.2.1 Discrete chains To make the problem more amenable to solution, we can discretize the spatial variables while letting the time variable remain continuous, i.e. we implement the method of lines to solve eq. (2.8). Rather than directly dis- cretizing eq. (2.8), however, it is more natural to consider a discretized chain as shown in gure 2.3, from the outset, and to calculate the EL equations for this system. This recipe then gives the same result as properly discretiz- ing eq. (2.8). For the discretized chain, the constraint in eq. (2.6) becomes jrj = s = L=(N 1), giving the length of each link. As the number of beads N !1 the system approaches a continuous chain. For nite N , the Lagrangian becomes a function of the positions and velocities fri; _rig of all beads i, 1 i N + 1. We use the shorthand notation L(ri; _ri). This recipe yields the distance metric for an ideal, freely-jointed chain (meaning that the angle between two consecutive links can be of any value without any cost), which has no non-local interactions and no curvature 14 2.2. Distance for polymers or strings constraints. While this approximation is often used as a rst step, real chains may behave quite dierently, for several reasons. In many cases, the conguration which is an energetic minimum is a straight line, or a single conformation dictated by the chemistry of the polymeric bonds. At nite temperature, energy from the environment induces conformational uctu- ations. Real polymers also cannot cross themselves, and, because of their stereochemistry, also take up volume. We leave these interesting features for later analysis. Equation (2.6) for the discretized chain becomes N constraint equations added to the eective Lagrangian: NX i=1 ̂i;i+1 p (ri+1 ri)2 s where each ̂i;i+1 ̂i;i+1(t) is a function of t, and ̂N;N+1 = 0. Letting 2̂s and ri+1=i ri+1 ri we rewrite this strictly for convenience as X i;i+1 2 r2i+1=i s2 1 ! : We next convert to dimensionless variables by letting r = (s)r̂. To simplify the notation, from here on, we simply refer to r̂ as r. The distance for the discretized chain becomes D[ri; _ri] = s2 Z T 0 dt L (ri; _ri) (2.9) with eective Lagrangian L (ri; _ri) = NX i=1 q _r2i i;i+1 2 r2i+1=i 1 : (2.10) The derivatives _r and ri+1=i are raised to dierent powers in (2.10), how- ever so long as ri+1=i satises the constraint ri+1=i = 1, the EL equations for ri(t) will be the same whether the constraint q r2i+1=i = 1 or r 2 i+1=i = 1 is used. The reparameterization invariance present for point particles (c.f. sec- tion 2.1) is still present for beads on the chain, but the parameterization of arclength along the chain is taken to be xed by the discretization. 15 2.2. Distance for polymers or strings 2.2.2 General variation of the distance functional For reasons that will become clear as we progress, we consider the general variation of the functional D, allowing for broken extremals. That is, we allow the curves describing the particle trajectories to be non-smooth in principle at one or more points in time. Consider the case of one such point at time t1. The distance can be written as D = Z t1 0 dt L(ri; _ri) + Z T t1 dt L(ri; _ri) (2.11) The space-trajectories of the particles must be continuous at time t1, so ri(t1 ) and ri(t1+ ) must have the same limit as ! 0, or in shorthand: ri t1 = ri t+1 : (2.12) Let ri(t) and ~ri(t) be two neighboring trajectories from ri(0) = rAi to ri(T ) = rBi (see gure 2.4). Neighboring curves will dier by the rst order quantity hi(t) = ~ri(t) ri(t). The xed boundary conditions at t = 0; T dictate that hi(0) = hi(T ) = 0. The dierence in distance between the two trajectories is D = D[ri + hi]D[ri] = Z t1+t1 0 dt L(ri + hi; _ri + _hi) Z t1 0 dt L(ri; _ri) + Z T t1+t1 dt L(ri + hi; _ri + _hi) Z T t1 dt L(ri; _ri) (2.13) Taylor expanding the Lagrangian to rst order in hi: z L L(ri; _ri) + NX i=1 Lri hi + L _ri _hi and integrating by parts using the xed boundary conditions at t = 0; T , the dierence in distance up to rst order in hi is D Z t1 0 dt X i Lri d dt L _ri hi + Z T t1 dt X i Lri d dt L _ri hi + L(t1 )t1 L(t+1 )t1 + X i L _ri hijt1 X i L _ri hijt+1 (2.14) with the shorthand L(t) L(ri(t); _ri(t)). 2z We use the notation Fr @F=@r, F _r @F=@ _r. 16 2.2. Distance for polymers or strings 0 t1+dt1t1 δr(t1) h(t1) Figure 2.4: General variations of a functional with xed end points allow for broken extremals. In the text we derive the extra \corner" conditions for a piecewise continuous path to still be extremal for our distance functional. 2.2.3 Conditions for an extremum The variation D diers from D above only by second order terms. Then for the transformation from frAig to frBig to be an extremum, D = 0. Thus, the EL equations (in the top line of eq. (2.14)) must vanish in each regime [0; t1), (t1; T ]. Using the form of the Lagrangian in eq. (2.10), the EL equations become: _̂v1 + 12 r2=1 = 0 (2.15a) _̂v2 12 r2=1 + 23 r3=2 = 0 (2.15b) ... _̂vN N1;N rN=(N1) = 0 (2.15c) According to equation (2.14) there are additional conditions for the transformation to be an extremum. To nd these, rst note that up to rst order (see gure 2.4) hi(t1) ri(t1) _ri(t1) t1 : (2.16) 17 2.2. Distance for polymers or strings Then the rst variation in the distance is D = 24 L X i _ri L _ri ! t1 L X i _ri L _ri ! t+1 35 t1 + X i h L _ri jt1 L _ri jt+1 i ri(t1) (2.17) which must vanish at an extremum. Because the variations ri and t1 are all independent, the terms in square brackets in equation (2.17) must vanish. Writing these expressions in terms of the conjugate momenta pi = L _ri and Hamiltonian, H =Pi _ri pi L gives the conditions: pij t1 = pij t+1 (2.18a) Hj t1 = Hj t+1 (2.18b) These conditions are called the Weierstrass-Erdmann conditions or corner conditions in the calculus of variations [46]. According to the Lagrangian in equation (2.10), the Hamiltonian is given by H = NX i=1 i;i+1 2 r2i+1=i 1 which is identically zero, so corner condition (2.18b) provides no further information. The conjugate momenta according to (2.10) are given by pi = _ri j _rij = v̂i : (2.19) Therefore, according to corner condition (2.18a), extremal trajectories can- not suddenly change direction: each ri(t) follows a smooth path continuous up to rst derivatives in the spatial coordinates. The fact that one corner condition provided no information due to the vanishing of the Hamiltonian is related to our choice of parameterization in formulating the problem. For example, in the case of the distance of the single point particle mentioned in the introduction, the Lagrangian may be dened either through independent variable x as L(x) = p 1 + y02 + z02 (with e.g. y0 = dy=dx), or parametrically through independent variable t as L(t) =p _r2. The conjugate momenta are then either L(x)y0 = y0= p 1 + y02 + z02 and 18 2.2. Distance for polymers or strings L(x)z0 = z0= p 1 + y02 + z02, or L(t)_r = _r=j _rj v̂. The Hamiltonia are either H(x) = 1= p 1 + y02 + z02 orH(t) = L(t) _r(_r=j _rj) = 0. The corner conditions can be shown to be equivalent for both choices of independent variable: for L(t) they give v̂(t1 ) = v̂(t+1 ), so that the direction of the tangent to the curve cannot have a discontinuity. Together, the Hamiltonian and two conjugate momenta for L(x) can be interpreted as components of the unit tangent vector to the curve, i.e. t̂(x) = (̂i+y0ĵ+z0k̂)= p 1 + y02 + z02, and so once again, the corner conditions enforce a continuous tangent vector, here t̂(x1 ) = t̂(x + 1 ). Boundary conditions In the continuum limit, the boundary conditions on r(s; t) are r(s; 0) = rA(s), r(s; T ) = rB(s) where rA and rB are the two congurations of the polymer. For discrete chains, these boundary conditions become fri(0)g = fr(A)i g (2.20a) fri(T )g = fr(B)i g : (2.20b) There are also boundary conditions that hold for the end points of the chain at all times. From equations (2.15a, 2.15c), we see that there are three solutions for the end points of the chain: 1) If 6= 0, purely rotational motion results. This can be seen by taking the dot product of eq. (2.15a) with v1, which yields 12v1 r2=1 = 0, so the velocity of the end point is orthogonal to the link. The rotation must be about a point that is internal to the link, i.e., on the line between points 1 and 2 for end point 1. This can be seen straightforwardly for the case of one link by removing point 3 from equations (2.15a) and (2.15b). Then the accelerations _̂vi must be in opposite directions. This can only occur if rotation is about a point on the line between points 1 and 2. 2) If = 0, _̂vi = 0, and straight-line motion of the end point results. 3) Writing out the time-derivative in (2.15a) yields v21 _v1 (v1 _v1)v1 = 12 jv1j3 r2=1 (2.21) which has the trivial solution v1 = 0. The end point can be at rest, while other parts of the chain move. For a transformation to be minimal, it is necessary, but not sucient, that it be an extremum. In Appendix A we derive the sucient conditions for a given transformation to minimize the functional (2.9). We discuss 19 2.3. Single links sucient conditions further below in the context of minimal transformations for links. In the discrete version of our variational problem, minimizing D for chains with N rigid links, one seeks to transform a chain from its initial conguration to the nal conguration, while minimizing the total distance that the N + 1 beads travel. Similar to what was done for the problem of nding the distance between two points, we transfrom the problem from a geometric variational problem, to a dynamic variational problem by choos- ing an articial \time" parameter t to avoid complications that might arise when one specic coordinate is no longer a single-valued function of the others. The study of minimal transformations between small numbers of links has applications to the inverse kinematic problem in robotics and movement control. In the inverse kinematic problem, one is given the initial and nal positions of the end-eector (the hand of the robot), and asked for the func- tional form of the joint variables for all intermediate states. Generally there is no unique solution, until some optimization functional is introduced, such as minimizing the time rate of change of acceleration (the jerk), torque, or muscle tension (see the review [65] and references therein). The minimal distance transformation would be relevant, if one sought the fastest trans- formation between initial and nal states, without explicit regard to me- chanical limitations. The indeterminate intermediate points can be handled variationally as a free boundary value problem. As we will see the solutions to these problems involve smooth patches of combinations of rotations and straight-line motions. 2.3 Single links In the limit of one link, equations (2.15a-2.15c) reduce to: _̂vA + rB/A = 0 _̂vB rB/A = 0 (2.22) where we have let A represent point 1, B point 2, and 12. The link has length 1 in our dimensionless formulation, so the vector rB/A could also have been written as a unit vector r̂B/A. Both points A and B are end points and satisfy the boundary conditions of section 2.2.3. This means that points A and B move by either pure rotation, straight-line translation, or remain at rest. The initial and nal conditions may be written rA(0) = A, rB(0) = B, rA(T ) = A 0, rB(T ) = B0. 20 2.3. Single links A B A′ B′rB xA a b (a) (b) A B A ′ B ′ (c) Figure 2.5: Possible (a,b) and impossible (c) straight line transformations between links AB and A0B0. Figure b shows a straight line transformation where the initial and nal states do not lie in the same plane. In the text we derive the conditions for the possibility of a straight line transformation between links. The link in our problem has direction, so A must transform to A0 and B to B0. We will often use arrowheads in gures to denote this direction. 2.3.1 Straight line transformations As a rst example, consider the two links shown in gure 2.5a. The four points A;B;A0; B0 need not lie in a plane (see, for example, g. 2.5b). Let angle \BAA0 a be obtuse. We draw straight lines from A to A0 and B to B0, and ask whether such a transformation is possible. We can thus derive the following rule: For a straight line transformation to exist between two links, opposite angles of the quadrilateral made by AB, A0B0, AA0, BB0 must be obtuse. Let the length that point A travels be xA, i.e., we imagine the point A 0 and the distance xA = jAA0j to be variable. The length rB that point B travels is then a function of xA and the original angle a, rB(xA; a). We can now nd conditions on the angle b \BB0A0, such that the transformation 21 2.3. Single links is possible. After some distance xA traveled by point A, the length of the line from B to A0 is BA0 = x2A + 1 2xA cos a = r2B + 1 2rB cos b so that rB(xA; a) = cos b p cos2 b+ f(xA; a) with f(xA; a) = x 2 A 2xA cos a. Since a is obtuse, f > 0 when xA > 0, and so the positive root must be taken for rB to positive. When xA = 0, f(0; a) = 0, and rB(0; a) = cos b+ jcos bj = 0 Therefore b must also be an obtuse angle. If two opposite angles are obtuse, then the other two angles must be acute. This concludes the proof that the above conditions are sucient. An additional proof that they are necessary is given in Appendix B. We readily see that gure 2.5a is one pair of a larger set of straight line transformations that can continue until one or both of the obtuse angles reaches 90. This collection forms a \bow tie" of admissible congurations, as in gure 2.6. Note that straight lines in the quadrilateral may cross as in the transformation from A;B to A0; B0 in gure 2.6. Trivial translations of the link without any concurrent rotation are a special case of general straight line transformations. 2.3.2 Piece-wise extremal transformations: transformations with rotations An immediate question concerns the nature of the transformation between AB and A0B0 in gure 2.5c, where opposite angles of the quadrilateral are not obtuse. Recall our link has direction, so A cannot transform to B0. Then direct straight-line solution is not possible, due to the constraint of constant link length. The only remaining solution is for the link to rotate as part of the trans- formation. Consider rst the rotation of link AB. The EL equations (2.22) allow for pure rotations about A, B, or a common center along the link. Likewise for link A0B0. The rotation can occur from either link AB (g 2.7a) or link A0B0 (g 2.7b). After the link rotates to a critical angle, it can then travel in 22 2.3. Single links A B A ′ B ′ A1 B1 A2 B2 (a) A B A ′ B ′ (b) Figure 2.6: (a) An example of a set of link congurations connected by a straight-line transformation. The link rotates clockwise as it translates to allow the end points to move in straight lines. The translation can proceed no farther than the end points AB and A0B0, which have link vectors ! AB or ! A0B0 that are perpendicular to one or the other of the vectors v̂A or v̂B. The totality of states thus connected forms a \bowtie". (b) A bowtie where the terminal states AB and A0B0 happen to cross each other. a straight line. The extremals are broken, in that they involve matching up a piece consisting of pure rotation with a piece consisting of pure translation of the end points of the link. Where the pieces match they must satisfy the corner conditions (2.18a, 2.18b). This means that the end points cannot suddenly change direction, a situation which is only satised by a straight line trajectory that lies tangent to the circle of rotation. From gure 2.6, we see that a straight line transformation exists only when an angle between a link and one of the straight line trajectories reaches =2. The critical angle that link AB must rotate is then determined by the point where a line drawn from B0 is just tangent to the unit sphere centered at pointA, pointB1 in gure 2.7a. There is generally a dierent critical angle if the rotation occurs at link A0B0 as in g 2.7B. It is shown in Appendix C that in general the critical angle is determined by drawing the tangent to a circle or sphere about one of the link ends. If the rotation was about a common center, we see that one or another of the link ends would violate a corner condition, so the rotation must be about one of the link ends. According to eqs. (A.5) and (A.6), the matrix P has a determinant of 23 2.3. Single links A B B1 A ′ B ′ (a) A B B1 A ′ B ′ (b) A B B1 B2 A ′ B ′ (c) A B B1 A ′ B ′ (d) × A B B1 A′ B′ CP (e) Figure 2.7: Transformations between two links involving broken extremals consisting of rotation and translation. (b) is the global minimum, with shortest distance traveled during the transformation. (a), (c), and (d) are local minima. (e) is extremal, but not minimal as the trajectory of arc _ B0B1 passes through a conjugate point, see Appendix A. 24 2.3. Single links zero due to the parametric formulation in the problem, and so is not positive denite. To show that the transformations in g. 2.7a,b are indeed minimal, we need to then express the problem in non-parametric form. To do this, let the independent variable be the angle of the link with the vertical. Then the displacement x along the line AA0 is the unknown function of to be determined by minimizing the total arc length traveled. This distance can be written as D[x] = Z 1 0 d p x02 + 2x0 cos + 1 + p x02 In this formulation, the scalar quantity P() = Lx0x0 becomes P() = sin2 (x02 + 2x0 cos + 1)3=2 which is always > 0 except for the isolated point = 0, in particular it is positive along the extremal trajectory which is necessary for a minimum. So we conclude that the transformation with the smaller angle of rotation in g 2.7b is here the global minimum, and the other transformation (g 2.7a) is a local minimum. Figure 2.7e is also an extremal trajectory, satisfying corner conditions, and with positive denite P . However, it is not a local minimum because the trajectory passes through a conjugate point (denoted by point CP , where the dotted line along A0B0 meets the great circle about A0). According to the results in section A.2, if the extremal trajectory (a great circle) traverses an angle larger than radians, it passes through a conjugate point and thus becomes unstable to sinusoidal perturbations with roots at the end points of the great arc, but no roots in between (see section A.2). Transformations involving rotations about points B or B0 in gure 2.7 both have conjugate points and so are not minimal. The transformation in g. 2.7c does not pass through a conjugate point and so is in fact another local minimum. The part of the extremum along the straight line section of the trajectory has no conjugate points as discussed above. 2.3.3 Systematically exploring transformations by varying link positions We can investigate what happens to the minimal transformation when one of the link positions or angles is varied with respect to the other. Let us 25 2.3. Single links start by putting the two links head to tail, as shown in gure 2.8a. The distance between them is 2 by simple translation of link end points. We can now increase the angle between the two vectors by rotating the right link for example, as in gures 2.8b{h. So long as the angle between the two vectors is less than 90, one link may slide along another and the distance is unchanged (gs 2.8a-c). This is a special case of the transformations shown in gure 2.6 (compare for example gure 2.8b with the middle three unlabeled links in that gure). Beyond 90 however, the transformation must include rotation. Fig 2.8d has an angle of 150. The minimal transformation rst rotates, for example with the tail of the horizontal black arrow xed, and the head tracing out the blue arc, until the critical angle is reached, where a straight line made from the nal arrowhead (at the top of the gure) is just tangent to the circle made by the blue arc. This state is indicated by a red link in gure 2.8d. The link then translates to its reciprocal position at the opposite end of the bowtie, denoted by a second red link (c.f. also gure 2.6b). At this point, the arrowhead has completed the transformation. Finally, the tail rotates into its nal position. The total distance traveled is slightly larger than 2. When the angle between the vectors is 120, as shown in 2.8e, the trans- formation consists of pure rotations. Taking the initial state to be the hori- zontal black vector, the link rst rotates about its xed tail, the head tracing out the blue arc, until the link reaches the state shown in red, where the position of the arrowhead has reached its nal end point. Then the link rotates about its head until the position of the tail reaches the nal state. When the angle between the links is larger than 120 as shown in gs 2.8f{ g, the transformation must involve rotation about an internal point along the link. Let points A and B denote the tail and head of the link respec- tively. If an innitesimal rotation occurs about an internal point P , the increment in distance traveled is D = jrB/Pj + jrB/Aj = which is independent of the position of the instantaneous center of rota- tion (ICR). This means that there are an innity of transformations all giving the same distance, depending on the time-dependence of the ICR. Two simple alternatives with only two discrete positions of ICR are shown in gures 2.8f,g. Specically, in gure 2.8f, the horizontal black vector rst rotates about its tail to the red conguration, which is a mirror image of the nal black vector. Then rotation is about an internal point determined by the intercept of the red vector with the nal black vector, with end points 26 2.3. Single links tracing out the green arcs. In gure 2.8g, only one ICR is allowed to imple- ment the rotation of radians. The red vector shows an intermediate state. Figure 2.8g depicts the transformation for overlapping, opposite pointing vectors. Rotation can now only occur about one point in the center of the vectors. Figure 2.9 illustrates what happens when one of the links is translated with respect to another, starting from two dierent scenarios shown in 2.9a and 2.9b. In 2.9a, the tail of the vertical link is displaced (1=3;1=3) with respect to the tail of the horizontal link. The minimal transformation is a pure rotation by =2. In gure 2.9b, the tail of the vertical link is now displaced to (2=3;1=3). Pure rotations again give a distance of =2. Rotation about a point on the horizontal link that is equidistant from both arrowheads transforms the initial arrowhead to the nal (red intermediate state). Then, rotation of the tail about the arrowhead transforms to the nal state. In gure 2.9c, the minimal transformation rst involves a translation by sliding the arrowhead along the vertical, until the arrowheads overlap (red intermediate state). The tail end of the link then rotates into place. In gure 2.9d, straight lines from the end points will not satisfy the ob- tuse condition in section 2.3.2, so the transformation must involve rotations. Here a straight line transformation takes the link almost to the nal state. It then must undergo a small rotation to complete the transformation. Seen in reverse, the vertical arrow must rotate to a critical angle determined by the criterion in section 2.3.2, before the link can nish the transformation by pure translation. Figure 2.9e is actually gure 2.8f. The nal condition (the tilted link) will be systematically changed, by translating it vertically away from the horizontal link (which we choose arbitrarily as the initial conguration). In gure 2.9f, the tilted link is translated vertically by 13 . The trans- formation can be achieved by rotating the horizontal link about a point equidistant from both arrowheads, to the red intermediate conguration. The link then rotates about the arrowhead into the nal conguration. The distance is still the angle rotated for the reasons mentioned above in the context of gures 2.8f{g, = (150=180), which is unchanged from 2.9e. In fact, so long as the arrowhead can be reached by rotation (the translated distance is less than d where d is the solution to d2 + d + 1 p3 = 0 for this angle), then the distance will be unchanged. The transformation at the critical distance is shown in gure 2.9g. The rotations now occur about the end-points: the tail and head of the link. In gure 2.9h, the translated distance is now equal to 1. The transfor- 27 2.3. Single links D = 2.000 (a) D = 2.000 (b) D = 2.000 (c) D = 2.020 (d) D = 2.094 (e) D = 2.618 (f) D = 3.142 (g) Figure 2.8: Successive transformations between two links made by rotating a link so that there is a progressively larger angle between the links as vectors (or smaller angle made between them as lines). The two boundary conditions (the initial and nal conditions) are shown as black links, and an intermediate state is shown as a red link or links. The arcs traced out by the end points are shown in blue or green, while straight line motions, when they are not along the links themselves, are shown in grey. The distance traveled over the course of the transformation is given below each gure. 28 2.4. 2-link chains mation rst consists of a rotation about the tail to a critical angle (blue arc and red intermediate state), then a translation much like that in gure 2.6 (grey straight lines between red intermediate states), and nally a rotation about the head (green arc) to the nal conguration. 2.4 2-link chains We now consider the next simplest case of 2 links (3 beads). The Lagrangian now reads: L(r1; r2; r3; _r1; _r2; _r3) = q _r21 + q _r22 + q _r23 1 2 12 (r2 r1)2 1 1 2 23 (r3 r2)2 1 (2.23) which has EL equations (c.f. eq.s 2.15a-2.15c) z : _̂vA + AB rB/A = 0 (2.24a) _̂vB AB rB/A + BC rC/B = 0 (2.24b) _̂vC BC rC/B = 0 : (2.24c) The corner conditions (2.18a), (2.19) imply v̂i t = v̂i t+ so the direction of motion cannot suddenly change, unless along one part of the extremal the velocity of point i is zero (the point is at rest), where its direction v̂ is then undened. The boundary conditions described in section 2.2.3 hold as well, so the end points can either be at rest, move in straight lines, or purely rotate. This gives 3 3 = 9 possible scenarios to investigate here, many of which can readily be ruled out. For example, consider the states in gure 2.10a. Because A and A0 are in the same position, rotation and translation of A are ruled out and point A remains at rest, leaving 3 scenarios for the other end point C. However, since C and C 0 are at dierent positions and ABC are along a straight line, C cannot remain at rest initially, leaving either translation or rotation for point C. 2z The links have length 1 in our dimensionless formulation, so the vectors rB/A and rC/B could also have been written as unit vectors r̂B/A and r̂C/B. 29 2.4. 2-link chains D = 1.571 (a) D = 1.571 (b) D = 1.730 (c) D = 2.374 (d) D = 2.618 (e) D = 2.618 (f) D = 2.618 (g) D = 3.181 (h) Figure 2.9: Successive transformations between two links made by trans- lating one link with respect to the other. In (a-d) the initial and nal congurations are perpendicular, while in (e-h) they are at an angle of 150 to each other. Note the distances in (e-g) are all the same, even though the end points of the links are at varying distances from each other. 30 2.4. 2-link chains b b b b b A A ′ B B′ C C′ (a) b b b b b b b b A A ′ B B′ C C′ C′′ (b) b b b b b b b b A A ′ B B′ C C′ θ D = 4.498 (c) b b b b b b b b A A ′ B B′ C C′ θ D = 4.498 (d) Figure 2.10: (a) Initial and nal states for a chain of two links. The trans- formation in (b) is non-extremal because it violates a corner condition at C 00. (c) and (d) are degenerate minima- rotations occurring about B0 or B both have the same length. Intermediate states, shown in red, have opposite convexity in (c) and (d). 31 2.4. 2-link chains b b b b b A A ′ B B′ C C′ pi 4 (a) b b b b b b b b A A ′ B B′ C C′ C′′ θ′ D = 3.114 (b) b b b b b b b b A A ′ B B′ C C′ C′′ θ D = 2.985 (c) Figure 2.11: (a) Initial and nal states for a polymer of 2 links. The angle between AB and A0B0 is =4. The minimal transformations in (b) and (c) are now no longer degenerate. (c) is the global minimum. Suppose C translates towards C 0, as in gure 2.10b. Then, _̂vC = 0 and from (2.24c,2.24b) BC = 0 and _̂vB = AB rB/A. B cannot move in a straight line without moving point A, so AB 6= 0, and thus B must rotate about point A. The transformation then proceeds as in gure 2.10b until B reaches B0 and C reaches C 00. Then, however, if C 00 were to rotate to C 0, the trajectory would violate corner conditions at point C 00. Therefore the direction of translation of C must not be directly to C 0 but must be tangential to the arc _ C 0C 00 as in gure 2.10c. The reverse of this transformation is allowable as well, as can be seen by swapping the labels ABC ! A0B0C 0. Here, C rst rotates to the critical angle shown in g 2.10d, and then translates to C 0. In fact, one can see that links BC and B0C 0 along with lines BB0 and CC 0 form a quadrilateral, as in gure 2.7, with the same consequences for ro- tation to a critical angle. For the links in g 2.10 the situation is symmetric so rotation can occur at the beginning or end of the transformation. Fig- ure 2.11a shows an example with this symmetry broken, so that the distance is dierent depending where the rotation occurs, as in gures 2.7a,b. In this case, the transformation in g 2.11c has the minimal distance, and that in g 2.11b is sub-minimal. Extensions of the transformation in gure 2.11 to large numbers of links were explored in [109]. 2.4.1 Transformations involving a change in convexity Transformations between congurations with opposite convexity involve mo- tion out of the plane, even if the initial and nal states lie in the plane. If 32 2.4. 2-link chains b b b bb b A A ′ B B′ C C′ B′′ Figure 2.12: A transformation between two states of opposite convexity: ABC has convexity down and right, while A0B0C 0 has convexity up and left. There is no extremal transformation in the plane that can connect them, without some apparent violation of corner conditions. the transformation is constrained to lie in plane, the trajectories of some points will be non-monotonic- those points must move farther away from their nal positions before approaching them. We illustrate these ideas with some examples below. Consider the initial and nal states in gure 2.12. We again imagine B rotating to B0. If C were to translate to C 0, one would have the intermediate conguration A0B00C 0. Now C 0 and A0 must remain at rest to satisfy corner conditions. Then the only way to nish the transformation is for B00 to rotate about the axis A0C 0, however then the trajectory of B violates corner conditions and so is not extremal. In Appendix D we take up the issue of minimal transformations for this case when the links are constrained to lie in a plane. We thus seek a point B00 and resulting trajectory ! BB00B0 such that arc _ BB00 satises corner conditions with arc _ B00B0. One solution is to eectively place B00 at position B0 by considering the boundary condition with C at rest (and A at rest). Then B rotates to B0 about axis AC, and the trajectory of B lies on a circle dened by the intersection of two unit spheres centered at A and C. The sphere about A is drawn in gure 2.13 as a visual aid. Along arc _ BB0 both AB 6= 0, and BC 6= 0. Once in conguration A0B0C, C can then undergo rotation about B0 to C 0, with A0 and B0 stationary. The transformation in 2.13a is a local minimum in distance, however, it 33 2.4. 2-link chains is not the global minimum. A shorter distance transformation can be seen by considering the reverse transformation. Imagine A0 and C 0 stationary, while B0 rotates about axis A0C 0 in gure 2.13b. This rotation of B0 follows a circular trajectory dened by the intersection of two unit spheres centered at A0 and C 0. The rotation occurs until point B00, which is the point where above circle is tangent to a great circle on the unit sphere about A and passing through B. The arc _ BB00 is a great circle, because this is a geodesic for point B given A is xed, which follows from the Euler equations (2.24b, 2.24c) when BC = 0. The great circle is dened by the plane containing the points A, B, and B00. The angle between the (variable) vector ! BC of link BC and the tangent the the arc _ B0B00 is always =2, so once the corner condition is met, point C on link BC can move in straight line motion from C 0 to C while B moves on the great circle from B00 to B. That is, the quadrilaterial criterion of section 2.3.1 is met for BB00C 0C. To nd point B00, let its position be rB" = (xo; y(xo); z(xo)). The great circle is dened by the plane passing through the points A, B, and B00. This plane has normal n !AB !AB00 = (1; 0; 0) (xo; y(xo); z(xo)) = (0;z(xo); y(xo)). At the point B00 the normal is orthogonal to the tangent vector of the circle dened by rotation about the AC 0 axis. This tangent vector is t̂ = @r=@s = xs(1; yx; zx) by the chain rule. At B 00, t̂ n = 0, or z(xo)yx(xo) + y(xo)zx(xo) = 0 (2.25) The functions y(x) and z(x) are dened by the intersection of two unit spheres centered at (0; 0; 0) and (1= p 2; 1 + 1= p 2; 0), giving y(x) = 1 p 2 2 + p 2 x z(x) = p 1 x2 y(x)2 : (2.26) Together, (2.25) and (2.26) give rB" = 0B@ p 2 1 2( p 2 1)q 2(5 p 2 7) 1CA The distance traveled along arc _ BB00 is BB", where cos BB" = xo = p 2 1. The distance traveled along arc _ B00B0 can similarly be shown to be rB"B' = 34 2.4. 2-link chains (a) (b) Figure 2.13: Sub-minimal (a) and minimal (b) transformations for the boundary conditions in gure 2.12. The distances for each transformation are approximately 3:007L2 and 2:576L2 respectively, where L is the link length. Transformation (a) proceeds from ABC by rst rotating B to B0 about axis AC, then rotating C about point B0. Transformation (b) pro- ceeds from ABC by simultaneously translating C to C 0 while rotating B about A on a great circle to point B00. Finally point B rotates from B00 to B0 about axis A0C 0. sin(=8) cos1(2 p 2 3). Adding the distance CC 0, the total (minimal) distance is thus D = 2:576. There is, of course, a degenerate solution to the above with z ! z. 2.4.2 Transformations with initial and nal states in 3-D We now give a representative example where the initial and nal cong- urations do not lie in the same plane, as shown in gure 2.14. Because AB ? AA0 and BC ? CC 0, neither A nor C will rotate about B as part of the transformation. Nor can ABC simultaneously translate directly to A0B0C 0, because, for example, quadrilateral AA0B0B does not satisfy the rule of opposite angles =2, so link AB cannot slide (translate) to A0B0. This leaves 3 options for the initial stages of the transformation: 1.) A translates, B rotates, C remains xed. B then rotates about C in the CBB0 plane. The initial direction of motion of B is then v̂B = (î+ k̂)= p 2, however then v̂A can only move backward to preserve link length (v̂A = k̂), similar to gure B.1. This rules out case (1). 2.) A remains xed, B rotates, C remains xed. B then rotates towards B0 about axis AC, until it reaches a critical angle where line B00B0 is tangent 35 2.5. Limit of large link number to its circular trajectory (see g. 2.14a). At this point the quadrilateral B00CC 0B0 does not have opposite obtuse angles, so a straight line transfor- mation to A0B0C 0 is not possible. It is possible to transform to a congura- tion A0B0C 00, where C 00 is at position (1; 1; 1) and angle \B0C 00C = =2, so that v̂C = k̂. Then the transformation is completed by a =2 rotation of C 00 about B0. This transformation is sub-minimal as it has a larger distance. 3.) A remains xed, B rotates, C translates. In this case, B rotates toward B0 in the BAB0 plane, while C translates to C 0, until the state AB00C 00 is reached (see g. 2.14b). State AB00C 00 can be found as follows. Because the rotation of B is about the axis (0;1=p2; 1=p2), the position !AB00 of B00 after rotation of the (critical) angle is (cos ; sin = p 2; sin = p 2). This angle is then determined by the condition ! AB00 !B00B0 = 0, where! B00B0 = ! AB0 !AB00. The solution to this condition is simply = =4. The location of C 00 is then determined from the condition that the link length from B00 to C 00 is one: j!B00C 00j = 1, where !B00C 00 = !AB00 + t!CC 0. Solv- ing this condition for t gives the position of C 00 as (3+ p 2 5 ; 1; 2(2p2) 5 ). At this point the quadrilateral B0B00C 00C 0 has opposite obtuse angles, and quadrilateral AB00B0A0 has opposite angles = =2, so it is in a bowtie conguration as in the end point congurations in gure 2.6. Therefore, all points AB00C 00 can translate from this intermediate state to their nal positions A0B0C 0. The total distance traveled is + jAA0j+ jCC 0j+ jB00B0j or D = 2 + =4 + p5 5:022. The reverse of this transformation is also possible, where point B0 rotates about A0 in the plane B0AB, while C 0 trans- lates along ! C 0C. Inspection reveals the distance covered is the same as the forward transformation. 2.5 Limit of large link number From the transformation discussed in section 2.4.1, we saw that if both \ABC and \A0B0C 0 were =2 as in gure 2.15a, then the transformations in gures 2.13a and 2.13b became degenerate, having distance D = =p2. The transformation is completed by a single rotation about axis 13. We can now examine the eect of increasing the number of links. Let the number of links increase to 4, and let us preserve the symmetry that is present about the horizontal axis in g 2.15a, so the initial and nal states become an octagon (gure 2.15b). In the limit, as N ! 1, the gure becomes a circle. If we separated the links in gure 2.15a by some distance in the y direc- 36 2.5. Limit of large link number (a) −0.5 0 0.5 1 1.5 −0.5 0 0.5 1 1.5 −0.5 0 0.5 1 1.5 2 2.5 X B B’’ C A A’ B’ C’ Y Z (b) Figure 2.14: (a) Sub-minimal transformation and (b) minimal transforma- tions between ABC and A0B0C 0 (see text) . tion (perpendicular to axis 13), then the minimal transformation involved the same rotation of 2 about axis 13 up to a critical angle c, after which all three points 123 can translate in straight lines to 102030. In the same fashion, the minimal transformation for the octagonal transformation in g 2.15b in- volves a rotation of point 3 out of the plane about axis 24 to a critical angle c at which the point is located at position 3 00. Once this critical angle is reached, point 3 translates in a straight line from 300 to 30. Because points 1 and 5 are stationary to satisfy corner conditions, points 2 and 4 must move in great circles about points 1 and 5. However, points 2 and 4 cannot nish the transformation by moving on great circles. At the conguration 102003040050 in gure 2.15b, point 3 has nished the transforma- tion, but points 2 and 4 have not. To satisfy corner conditions at the points 200 and 400, the great circles must be out of plane as well. At points 200 and 400, the transformation nishes with rotations about axes 1030 and 3050. The total distance D 7:93. Of course, the time reverse of this transformation (equivalent to swapping primed and unprimed labels) is also a minimal transformation, as is the transformation obtained by re ection about the z = 0 plane. Now consider increasing the chain to 6 links, so the combination of ri(0) and ri(T ) becomes a dodecagon (12-sided polygon, see gures 2.15c-d). As before, the midpoint vertex (here r4) must rotate out of the plane about axis 35 to a critical angle c before translating in a straight line to r40 . This critical angle is where ! 3400 !40040 = !5400 !40040 = 0. The quadrilaterals 220303 and 655060 are of the type in gure 2.7, so point 3 must rotate about r2(0) 37 2.5. Limit of large link number 0 0.2 0.4 0.6 0.8 1 1.2 1.4 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0 0.1 0.2 0.3 0.4 3 3’ 2 X 2’ Y 1 1’ Z (a) 0 0.5 1 1.5 2 2.5 −1 −0.5 0 0.5 1 0 0.5 4 5 5’ 3 X 3’’ 4’’ 2 2’’ 3’ 1 Y 1’ 2’ Z (b) 0 0.5 1 1.5 2 2.5 3 3.5 4 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 7 7’ 6 6’ 6’’ 5 5’ 5’’ 4 4’ 4’’ X 3’’ 3’ 3 2’’ 2 2’ 1 1’ Y (c) 0 1 2 3 4 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 0 0.5 6 5 7 7’ 4 X 5’’ 4’’ 6’6’’ 3 3’’ 5’ 2 4’ 1 2’’ 1’ Y 3’ 2’ Z (d) Figure 2.15: Examples of transformations between initial and nal states of opposite convexity, for increasing numbers of links. (a) illustrates the transformation for N = 2 links. (b) N = 4 and initial and nal state form an octagon. (c,d) N = 6 and initial and nal states form a dodecagon. (c) top view. (d) view in perspective. Rotations are shown as solid color lines (either green or blue). Translations are shown as dashed lines. The grey dashed lines underneath 30030 in (b) and 40040 in (d) are shown only to illustrate that those lines are above the plane. 38 2.5. Limit of large link number to a critical angle where ! 2300 !30030 = 0, and likewise for point 5. While point 3 rotates to its critical angle, point 4 translates along line 40040. Points r1(0) and r7(0) overlap with r1(T ) and r7(T ) and so remain xed to satisfy corner conditions. After point 3 has reached its critical angle, it can translate along 30030 as point 2 rotates about r1. However to satisfy corner conditions at point 200, the rotation cannot remain in the xy plane. Point r200 is determined as the point where t̂ nplane = 0, where t̂ is the tangent to the arc _ 2200 dened by rotation about axis 130, and nplane is the normal to the plane 12200, i.e., r2=1r200=1. The same process holds for point 6. These critical points and some intermediate states for the transformation are shown in gure 2.15d. The total distance covered by the transformation is D 16:3. It is sensible to consider the total length of chain as xed to say L = 1, and to let the link length dsN for the chain of N links be determined by NdsN = L. Because distances scale as ds 2 N, the N = 2; 4; 6 cases have D2 0:555L2, D4 0:496L2, D6 0:445L2. Note that this distance decreases with increasing number of links: the constraints on the motion of the various beads during the transformation are relaxed as the number of links is increased. We can then imagine resting a piece of string on a table in the shape of a semi-circular arc, and then asking how one can move this string to a facing semicircle of opposite convexity. So long as the string has some non-zero persistence length `P, the transformation of minimal distance must involve lifting the string o of the table to change its local convexity. The vertical height the string must be lifted (see g 2.15d) is of order sin(`P=L) `P=L, which goes to zero for an innitely long chain. As the number of links N !1, some simplications emerge. In particu- lar the contribution to the total distance due to rotations becomes negligible, and the translational component dominates. To see this note that the dis- tance due to straight line motion scales as: D(st. line) dsNL L2 while the distance traveled during rotations scales as D(rot.) dsN(cds) L2=N where we assume the worst case scenario, where an extensive number of links must rotate before translating. Because translation dominates the distance as N !1, the distance traveled converges to L times the mean root square 39 2.5. Limit of large link number distance (MRSD), i.e., D1 ! ds N+1X i=1 jri(T ) ri(0)j = L 1 N X i q (rBi rAi)2 = L (MRSD) (2.27) The MRSD for the examples in gures 2.15b,d are 0:394L and 0:400L respectively, which are both less than the actual distances traveled (in units of L). In the limit N !1, where the polygon becomes a circle, the distance converges to D1 = 4L2=2 0:4053L2. For large N systems then, it is a good rst approximation to use MRSD for the distance. The MRSD is always less than the root mean square distance (RMSD), except in special cases when they are equal. To see this, we can apply Holder's inequality NX k=1 (gk) (hk) NX k=1 gk ! NX k=1 hk ! where gk; hk 0, ; 0, and + = 1. With the specic identications gk = (rBk rAk)2 r2k, hk = 1, and = = 1=2, we have directly 1 N X k q r2k s 1 N X k r2k For example the RMSD for the circle conguration discussed above is p 2L= 0:4502L, which is greater than the MRSD. The fact that the distance converges for large N to MRSD rather than RMSD suggests that RMSD may not be the best metric for determining similarity between molecular structures, although it is ubiquitously used. This fact warrants future investigation- it has implications in research areas from structural alignment based pharmacophore identication [49, 75, 103] to protein structure and function prediction [6, 47]. MRSD has a simple intuitive physical meaning- the MRSD between two structures gives the average distance each residue in one structure would have to travel on a straight line to get to its counterpart in the other structure (g 2.16). 40 2.5. Limit of large link number Figure 2.16: The MRSD is the average length of the black line segments between corresponding residues of the initial and nal conguration. Image adopted from [91]. Figure 2.17: The MRSD and RMSD between the two curves are close to zero (the curves in this gure are displaced for better viewing but should be imagined to be superimposed). However, because the curve cannot pass through itself, in order to undergo the transformation, one leg must un- dergo relatively large amplitude motions to travel from one conformation to another. This results in a non-zero distance between the conformations by accurate metrics that can account for non-crossing. Image adopted from [91]. This interpretation of MRSD points to a shortcoming of both MRSD and RMSD, which is the importance of chain non-crossing constraints. Consider the two curves depicted in g 2.17, which dier by having opposite sense of underpass/overpass. When both curves are aligned by minimizing MRSD or RMSD, the respective values are almost zero. However the physically relevant distance for one conformation to transform to the other is much larger, and must involves one arm of the backbone circumventing the other as it moves between conformations. It was shown in [109] that chains with persistence length characterized by some radius of curvature R have extensive corrections to the MRSD- derived minimal distance, which do not vanish as N ! 1, but remain so long as R=L is nonzero. Likewise, chains that cannot cross themselves have non-local EL equations and extensive corrections to the minimal distance. Nevertheless, it is worthwhile to investigate some more complex polymers with MRSD as an approximate distance metric. We pursue this in the next section. 41 2.5. Limit of large link number 2.5.1 MRSD as a metric for protein folding Here we examine the use of MRSD as a metric or order parameter for pro- tein folding. To this end we adopt an unfrustrated C model of segment 84 140 of src tyrosine-protein kinase (src-SH3), by applying a Go-like Hamiltonian [23, 123, 129] to an o{lattice coarse-grained representation of the src-SH3 native structure (PDB: 1FMK). Amino acids are represented as single beads centered at their C positions. The Go-like energy of a protein conguration is given by the following Hamiltonian, which we will explain term by term: H(jN) = kr X bonds (r rN )2 + k X triples ( N )2 + X n=1;3 k (n) X quads [1 cos (n ( N ))] + N X ji+3 " 6 ij rij 10 5 ij rij 12# + NN X ji+3 ij rij 12 :(2.28) Adjacent beads are strung together into a polymer through harmonic bond interactions that preserve native bond distances between consecutive C residues. Here r and rN represent the distances between two subsequent residues in congurations and the native state N . As with other param- eters in the Hamiltonian, the distances rN are based on the PDB structure and may vary between pairs. The angles N represent the angles formed by three subsequent C residues in the PDB structure, and the angles N rep- resent the dihedral angles dened by four subsequent residues. The dihedral potential consists of a sum of two terms, one with period 2 and another with 2=3, which give cis and trans conformations for angles between succes- sive planes of three amino acids, with a global dihedral potential minimum at N 2 [; ]. The parameters kr, k, and k, are taken to accurately describe the energetics of the protein backbone: we used the values kr = 50 kcal/mol, k = 20 kcal/mol, k (1) = 1 kcal/mol and k (3) = 0:5 kcal/mol for molecu- lar dynamics (MD) simulations using the AMBER software package [104]. For MD simulations using LAMMPS [107], we had used slightly dier- ent values: kr = 80 kcal/mol, k = 16 kcal/mol, k (1) = 0:8 kcal/mol and k (3) = 0:4 kcal/mol. The last line in equation (2.28) deals with non-local interactions, both native and non-native. If two amino acids are separated by 3 more along 42 2.5. Limit of large link number the chain (ji jj 3), and have one or more pairs of heavy atoms within a cut-o distance of rc = 4:8 A in the PDB structure, the amino acids are said to have a native contact. Then the respective coarse-grained C residues are given a Lennard-Jones-like 10-12 potential of depth N = 0:6 kcal/mol (0:8 kcal/mol for LAMMPS simulations) and a position of the potential minimum equal to the distance of the C atoms in the PDB structure. That is, ij is taken equal to native distance between C residues i and j if i{j have a native contact. If two amino acids are not in contact, their respective C residues steri- cally repel each other (NN = +0:6 kcal/mol). Thus NN = 0 if i-j is a native residue pair, while N = 0 if i-j is a non-native pair. For non-native residue pairs, ij = 4 Angstroms. In an arbitrary conguration , two C residues i and j are considered to have formed a native contact if they have a distance rij 1:2ij . The results of MD simulations do not strongly depend on the specic value of this cuto. The fraction of native contacts present in the particular conguration is then dened as Q (or Q). The MRSD of conguration is found by aligning this conguration to the native structure, by minimizing MRSD over 3 translational and 3 rotational degrees of freedom. Constant temperature molecular dynamics simulations were run for this system using both AMBER and LAMMPS simulation packages, by other members of the Plotkin research group. The version of LAMMPS that was used for our simulation suered from a bug, wherein dierent chiralities of dihedral angles were not energetically distinguished. This bug has been xed in future versions of LAMMPS. Thus, these results show heuristically one of the arguable shortcomings of the order parameter Q, namely the failure to distinguish between two mirror congurations. The probability for the system to have given values of Q and MRSD within (Q;Q + Q) and (MRSD;MRSD + MRSD) is proportional to the exponential of the free energy F (Q;MRSD). Thus the free energy can be directly obtained by sampling, binning, and taking the logarithm: F (Q1;MRSD1) F (Q2;MRSD2) = kBT log p(Q1;MRSD1) p(Q2;MRSD2) (2.29) with F (1; 0) = EN, the energy of the native structure. Figure 2.18 shows the free energy surfaces obtained using the above recipe, for the AMBER (g 2.18a) and LAMMPS (g 2.18b) molecular dy- namics routines. The temperature is taken to be the transition or folding temperature TF, where the unfolded and folded free energies are equal. 43 2.6. Conclusions Notice that F (Q) is comparable for both as it should be, as is F (MRSD) as well. However, the free energy surface plotted as a function of both Q and MRSD shows a marked dierence. In addition to a native minimum, the LAMMPS routine has an additional minimum at Q 0:95 and MRSD 8:4. The conformational states in this bin are closely related, with an average MRSD between them of 1:8A. We can take the most representative state in this bin as that which has a minimum MRSD from all the others in the bin (at Q :95, MRSD 8:4): min i P0 j 6=iMRSDij= P0 j 6=i 1:6A. Inspection reveals that this state is a mirror image of the PDB structure (see g 2.18b): If we re ect this structure about one plane, and subsequently align this re ected structure to the PDB one, the MRSD is only 1:1A. The discrepancy in free energy surfaces corresponding to the presence of a low energy mirror-image structure arises, because the COMPASS class 2 dihedral potentials in the LAMMPS algorithm did not ascribe a sign to the angle , so the full range [; ], is projected onto [0; ]. This gives the set of actual dihedral angles fi + g the same energy as the set fig, so that the dihedral potentials have two minima rather than one, and thus a protein chain of the opposite chirality (a mirror image) is allowed and has the same energy as the PDB structure. We found that the CHARMM and harmonic dihedral styles do not have this problem; however, they have less versatile function forms, so that we favored modifying the COMPASS dihedrals to dene over its full range. 2.6 Conclusions Analogously to the distance between two points, the distance between two nite length space curves is dened using a variational problem, and may be calculated by minimizing a functional of 2 independent variables s and t, where s is the arc-length along the chain, and t is the 'elapsed time' during the transformation. We derived the Euler-Lagrange (EL) equation giving the solution to this problem, which is a vector partial dierential equation, with extremal solu- tion r(s; t). We also derived sucient conditions for the extremal solution to be a minimum, through the Jacobi equation. Once the minimal transfor- mation r(s; t) is known, the distance D D[r] follows. We provided a general recipe for the solution to the EL equation, using the method of lines. The resulting N + 1 EL equations for the discretized chain are ODEs that can be interpreted geometrically and solved for minimal solutions. Solutions consist generally of rotations and translations pieced 44 2.6. Conclusions 30 25 20 15 10 5 MR SD 0.1 0.60.2 0.3 0.4 0.5 0.7 0.8 0.9 1.0Q 12 10 8 6 4 2 0 5 10 15 20 25 30 MR SD F(MRSD) 3 4 5 6 7 8 F(Q ) 0.1 0.60.2 0.3 0.4 0.5 0.7 0.8 0.9 1.0Q (a) 0.1 0.60.2 0.3 0.4 0.5 0.7 0.8 0.9 1.0Q 0.1 0.60.2 0.3 0.4 0.5 0.7 0.8 0.9 1.0Q MR SD 5 10 15 20 1 2 3 4 5 6 7 F(Q ) F(MRSD) 8 7 6 5 4 3 2 1 0 5 10 15 20 25 MR SD (b) Figure 2.18: Free energy surfaces for the folding of Go-model src-SH3 using two molecular dynamics simulation packages, AMBER (a) and LAMMPS (b). The contour plots give F (Q;MRSD). The projections F (Q) and F (MRSD) are also shown on each side. The COMPASS class 2 dihedral potential in LAMMPS allows for a mirror image of the folded structure (red color structure in inset) that is not immediately evident from the F (Q) or F (MRSD) surfaces. Future implementa- tions of LAMMPS using COMPASS dihedrals for biomolecular simulations have corrected for dihedral angles dened on the interval [; ]. 45 2.6. Conclusions together so the direction of velocity of any link end point does not suddenly change (the Weierstrass-Erdmann corner conditions). We explored the minimal transformations for the simplest polymers, con- sisting of 1 or 2 links, in depth. For transformations between 2 links, convex- ity becomes an issue (the analog to the direction of the radius of curvature for a continuous string). For example, even if the initial and nal states lie in the same plane, if the convexities of these states are of opposite sign the transformation must pass through intermediate states that are out of the plane. Similarly, given a semicircular piece of string lying on a table, to move it to a semicircle of opposite convexity using the minimal amount of motion, the string must be lifted o the table. In the limit of a large number of links, some simplications emerge. For chains without curvature or non-crossing constraints, the distance converges to L times the mean root square distance (MRSD) of the initial and nal conformations. So for example, the distance between two strings of length L forming the top and bottom halves of a circle respectively is 4L2=2, the distance between horizontal and vertical straight lines of length L which touch at one end is L2= p 2, and the distance to fold a straight line upon itself (to form a hairpin) is L2=4. The fact that for large N the distance (over L) converges to MRSD rather than RMSD suggests that RMSD may not be the best metric for determining similarity between molecular structures, although it is ubiq- uitously used. Adopting MRSD may lead to improvements in structural alignment algorithms. The MRSD was investigated as an approximate metric for protein fold- ing. Free energy surfaces for folding were constructed for two simulation packages, AMBER and LAMMPS. It was found that including MRSD as an order parameter uncovered discrepancies between the two molecular dynam- ics algorithms. Because dihedral angles in LAMMPS (at least in COMPASS class 2 style) are only dened on [0; ], the potential admits a mirror im- age structure degenerate in energy with the native structure. This is easily remedied and should not be interpreted as a deciency in the LAMMPS simulation package, so long as one is aware of it. It should be mentioned that the mirror-image structure would also have been seen, had RMSD been used as an additional order parameter. In subsequent chapters, we will focus on applications of the above con- cepts in protein folding and structural alignment. In particular, in chapter 3, we will apply the principles developed here, in nding minimal folding path- ways for protein fragments and brie y touch on the idea of non-crossing constraints. The use-case in structural alignment will be discussed in chap- 46 2.6. Conclusions ter 4. In chapter 5, we give a systematic treatment of non-crossing and in chapter 6 we investigate whether the distance D can be a predictor of folding kinetics. 47 Chapter 3 Minimal folding pathways for coarse-grained biopolymer fragments In this chapter we apply the concept of generalized distance, introduced before, to nd minimal folding pathways for several candidate protein frag- ments, including the helix, the -hairpin, and a non-planar structure where chain non-crossing is important. Comparing the distances traveled with root mean-squared distance (RMSD) and mean root-squared distance (MRSD), we show that chain non-crossing can have large eects on the kinetic prox- imity of apparently similar conformations. Furthermore we see that structures that are aligned to the -hairpin by minimizing MRSD show globally dierent orientation than structures aligned by minimizing RMSD. 3.1 Introduction In 1.3, we reviewed two of the most common order parameters used in pro- tein folding. While the utility of simple order parameters is indisputable, it is easy to see that even for simple structures they can lead to inaccurate mea- surements of native proximity. For example, a -hairpin that is only slightly expanded beyond the range of its hydrogen bonds is essentially committed to fold, but would have a Q value near zero. See gure 1.4. Comparing two conformations of a piece of polymer chain that crosses either over itself or under itself would give an RMSD that could be quite small. The amount of motion the polymer would have to undergo to transform from one confor- mation to the other, however, respecting the non-crossing constraint, would have to be comparably large. See gure 2.17. Here we propose D as an order parameter to capture the complexities of biomolecular folding. This distance depends only on the geometry of the initial and nal congurations. 48 3.2. Methods The minimal distance transformation between an initial polymer con- formation A and the folded or native conformation N can be thought of as an optimal folding pathway that is the most direct route from A to N. Of course, the actual trajectory is a stochastic one. It is interesting to ask whether the typical or average dynamical trajectory resembles the minimal one after suitable averaging, but we do not answer this question here. In- teraction energies in the system will certainly modify the weights of reactive trajectories, making some trajectories preferred over others. On the other hand, much of the folding mechanism is thought to be insensitive to spe- cic sequence details [86], and depends more on the geometry of the native structure and its resultant topology of interactions [5]. A direct application of minimal folding path to a full protein is an im- portant future goal. We will address approximations to this problem in chapter 5. In this chapter, we take a more bottom-up, modular approach, and apply the minimal distance transformation to various representative pro- tein fragments and construct exact solutions. In particular, we investigate the minimal folding pathways for a -sheet, an -helix, and an overpass- underpass problem, where chain non-crossing is important. 3.2 Methods We refer to the transformation between structures A and N that minimizes the distance functional in Eq. 2.5 as the minimal transformation or optimal folding pathway. Solving the equations of motion for the discretized version gives solutions for straight line motions of the beads, preceded or followed by local intensive rotations as we saw earlier. 3.2.1 Representative protein fragments As an example protein domain to which we apply our methods, we choose residues 99{153 in regulatory chain B of Aspartate Carbamoyltransferase [48] (PDB code 1AT1, see Fig. 3.1). From this domain, we select three fragments for investigation, as representatives of some commonly found sec- ondary and tertiary structures: The -hairpin containing -strands 2 and 3, residues 126{137. The C-terminal -helix, residues 147{151. The -strand 1-turn-strand 2 tertiary motif, residues 101{130. 49 3.2. Methods Figure 3.1: Residues 99{153 in regulatory chain B of Aspartate Carbamoyl- transferase [48] (PDB code 1AT1) are chosen for analysis. From this do- main we select three fragments for investigation. Two are outlined in dashed boxes: -hairpin residues 126{137, and -helix residues 147{151. The strand 1-turn-strand-2 tertiary motif, residues 101{130, is also used investigate the importance of non-crossing. We investigate an overpass/underpass problem for a simplied version of segment 3 for which chain non-crossing is important. The polymer frag- ments are coarse-grained by taking the C atom to represent each residue. The CC distances in our fragments are sharply peaked: jri+1=ij = (3:81 0:04)A. We do not change the numbers present in the PDB structure: they are held xed during the transformation. We investigate the minimal dis- tance transformations between extended states of polymer and the above secondary structures. Extended states are constructed as follows. For the -hairpin, we rotate the chain about the positions of C(132) and C(133) so that the initial state is an extended linear strand (Fig. 3.2 b). For the -helix, we take the simplied case of a straight line for the initial condition. For the over/under problem we imagine a scenario where the -sheet in Fig. 3.3a is unformed, and the polymer chain involved in the turn has crossed under rather than over -strand 2. The two congurations have the opposite sense, in that the chain must cross over itself (or go over the top or the bottom of the structure) to form the correct tertiary structure (Fig. 3.3b). Alternatively, -strands 2 and 3 in Fig. 3.1 may cross over -strand 1 to solve the underpass-overpass problem, but this would involve larger-scale motion, that is, a larger distance traveled. 50 3.2. Methods Figure 3.2: (a) -hairpin fragment, with all-atom and coarse-grained C representations superposed. (b) The extended initial state. Figure 3.3: a) Residues 101{130 of Aspartate Carbamoyltransferase can be taken as an example of an overpass/underpass problem where chain non-crossing is important. (b) Conformation of the segment in panel a with the -sheet unformed. Both initial and nal structures (with opposite over/under sense) are superposed in this stereo view. (c) A simplied model to capture the essence of the underpass-overpass problem. Both initial and nal states are shown as viewed from above. Residues 1 8 must transform to residues 10 80, but cannot pass through the obstacle marked with a circled X, representing a long piece of polymer normal to the plane of the gure. 51 3.2. Methods Figure 3.4: Illustration of the general recipe for obtaining minimal pathways A stereo view of initial and nal states for such a scenario is shown in Fig. 3.3 b. We ask: What is the minimal distance pathway for conversion between these two structures? To make the problem more amenable to analysis, we simplify the structures in the spirit of lattice models, as shown in Fig. 3.3 c. The initial and nal conditions are regular and symmetric, but intermediate congurations can be anywhere so long as they are consistent with the constraints of constant link length and non-crossing (i.e., they can be o-lattice). 3.2.2 Construction of minimal pathways Minimal folding trajectories are constructed by the recipe described in chap- ter 2 (Fig. 3.4). The basic recipe is as follows. First we take the coordinate of one C residue, say r(Ci) in the unfolded conformation, then we imagine rotating r(Ci) about r(C(i1)). The protein backbone is treated approx- imately as a freely jointed chain to carry out this procedure. All possible rotations of Ci about C(i1) form a sphere of radius jr(Ci) r(C(i1))j. A cone is drawn from the nal position of Ci, i.e., r FOLDED(Ci) in the folded structure, to be tangent to this sphere. In general, one particular direction will have the minimal amount of rotation before proceeding in a straight line to rFOLDED(Ci). The arc of the great circle along this direc- tion is then chosen as part of the minimal trajectory for residue i. 52 3.3. Results 3.2.3 RMSD and MRSD To review, in the limit of long polymer chains and in the absence of non- crossing, the distance accumulated by rotation of each link before translating gives a negligible contribution to the total distance, and the total distance traveled converges to the chain length L times the mean root-square distance (MRSD), i.e., for two structures A and B, lim N!1 D = L 1 N X i p (rBi rAi)2 = L (MRSD): (3.1) As we saw the MRSD is always less than the RMSD often used for structural comparison. Which of these quantities provides more accuracy for structural alignment is still an open question, although the MRSD may be less sensitive to large uctuations of a subset of points. To investigate the sensitivity of MRSD versus RMSD to perturbations in residue's position, note that the change in RMSD with respect to moving one residue an amount rAi is (RMSD) rAi 1 N jrAi rBij RMSD ; while the change in MRSD with respect to moving one residue an amount rAi is (MRSD) rAi 1 N : So if residue i has a structural discrepancy larger than the average as mea- sured by RMSD, changes in RMSD with respect to this residue's position will be larger than those for MRSD. Unfolded conformations were aligned to folded structures by minimizing MRSD and RMSD, and minimal trans- formations constructed for these conformation pairs. For the -hairpin, the conformation pairs were observed to be globally dierent depending on whether the alignment cost function was MRSD or RMSD. 3.3 Results 3.3.1 -hairpin We coarse-grain the fragment containing residues 126{137 by considering only the C atoms (see Fig. 3.2 a). We consider folding to this structure from an extended state. The extended state is obtained by two rotations about residues 132 and 133, which extend the hairpin out to a quasilinear 53 3.3. Results strand (the extended state in Fig. 3.2 b). This initial extended state is aligned to the nal structure in four dierent ways: 1) One strand of the hairpin is directly aligned to the corresponding residues of the extended state (Fig. 3.5, a and b), 2) The center links of the hairpin and extended state are directly aligned to each other (Fig. 3.5 c), 3) The initial position/orientation of the extended state is found by min- imizing the MRSD between the two coarse-grained C structures (hairpin and extended state) in Fig. 3.2, a and b (Fig. 3.5 d, blue extended strand), and 4) The initial position/orientation of the extended state is found by min- imizing the RMSD between the two coarse-grained C structures (hairpin and extended state) in Fig. 3.2, a and b (Fig. 3.5 d, teal extended strand). From these initial states, we have found minimal folding trajectories consisting of rotations and subsequent translations of the residues (or vice versa) as described in section 3.2. To gain intuition for the transformations from the MRSD and RMSD aligned structures, we also considered minimal transformations from an ide- alized straight-line structure to an idealized -hairpin, whose initial and nal states are shown in Fig. 3.5 e. The distances for all the -hairpin trans- formations, along with numbers for the RMSD and MRSD for the same transformations, are given in Table 3.1. The resulting transformations for the above boundary conditions are shown in Fig. 3.5, a{c, and f{i. As described in section 3.2, the minimal folding pathways proceed by forming kinks or solitonic-like waves that prop- agate along the backbone. The soliton-like object consists of a rotation of a bead until the link containing that bead reaches a critical angle. The bead subsequently translates until it reaches its nal position. For the idealized straight-line to -hairpin transformation, the MRSD and RMSD aligned structures are globally dierent (Fig. 3.5 e). The MRSD between the two aligned straight-line structures is 15.39 A, larger than the MRSD of either structure to the folded hairpin state (Table 3.1). The trans- formation from the RMSD-aligned line involves predominantly straight-line motion from the line to the hairpin (Fig. 3.5 f). Only 0.1% of the distance corresponds to rotational motion. The transformation from the MRSD- aligned line involves both rotations and translations, as shown in Fig. 3.5 g. This gives the MRSD-aligned pair a distance only marginally smaller (0.4%) than the RMSD-aligned pair (Table 3.1), even though the transformations have dierent initial states and very dierent character. For the real -hairpin and extended state, the transformations are remi- 54 3.3. Results Figure 3.5: Minimal transformations to the -hairpin. Distances are listed in Table 3.1. (a) Folding pathway in which one strand of the hairpin can be thought of as peeling away by rotations of the links to various critical angles, which are then followed by subsequent translations into their nal positions. (b) A minimal pathway that can be thought of as involving kink propaga- tion or peeling away from the extended strand, followed by translation of the links into their nal positions in the -hairpin. (c) A zippering mechanism, in which we have aligned the middle link of the hairpin and sought the min- imal distance transformation. The distance here is somewhat larger than the distance for the transformations in panels a and b. (d) The extended strand is aligned to the -hairpin by minimizing RMSD (blue), or minimiz- ing MRSD (teal). (e) Idealized version of the extended strand and -hairpin. The extended strand is again aligned to the -hairpin by minimizing RMSD (blue), or minimizing MRSD (teal). (f) Transformation for the idealized -hairpin, for RMSD-aligned structures. Initial state is blue, nal state is red, and intermediate state is in green. (g) Transformation for the idealized -hairpin, for MRSD-aligned structures. (h) RMSD-aligned transformation between the extended strand (blue) and -hairpin (red). An intermediate state is shown in green. (i) MRSD-aligned transformation between the ex- tended strand (blue) and -hairpin (red). An intermediate state is shown in green. 55 3.3. Results Table 3.1: Values of the distance for various protein backbone fragments, as compared to other metrics Backbone conformation Figure D=(N `) RMSD MRSD -Hairpin (half-aligned) 3.5 a 10.372 15.538 9.926 -Hairpin (half-aligned) 3.5 b 10.372 15.538 9.926 -Hairpin (zipper) 3.5 c 12.787 13.560 11.317 -Hairpin (RMSD-aligned) 3.55 h 9.749 10.501 9.730 -Hairpin (MRSD-aligned) 3.5 i 10.277 12.681 9.412 Ideal -hairpin (RMSD-aligned) 3.5 fy 12.25 13.24 12.24 Ideal -hairpin (MRSD-aligned) 3.5 gy 12.18 16.31 11.27 -Helix (MRSD aligned) 3.6 b 3.595 3.954 3.577 -Helix (1-link aligned) 3.6 c 4.675 5.805 4.233 Over/under (non-crossing) 3.7y 13.991 6.173 5.239 Distance D is divided by N times the link length `, so that all quantities in the table have units of A. y D is put in the same units as the above transformations, i.e., we take ` = 3.81 A for the link length. niscent of the ideal case. The MRSD and RMSD aligned structures are glob- ally dierent, as shown in Fig. 3.5 d. The MRSD between the two aligned extended structures is 9.83 , which is again larger than the MRSD of either structure to the folded hairpin state (Table 3.1). The MRSD-aligned pair has a distance 17% dierent than the ideal case and the RMSD-aligned pair has a distance 23% dierent than the ideal case. Fig. 3.5, h and i, depict the transformations for RMSD- and MRSD-aligned pairs, respectively. For the real -hairpin, the RMSD-aligned extended state has a smaller distance than the MRSD-aligned extended state by 5%, i.e., the scenario present in the idealized case is reversed, somewhat surprisingly. This indicates that the aligned structures obtained by minimizing the actual distance need not re- semble those structures obtained by either the RMSD or MRSD alignments. An alignment algorithm for general structures using distance D as a cost function is a nontrivial problem that we reserve for future work. However, we will discuss a simple case in chapter 4. We note that the above transformations will not all have the same energy gain as they fold. The transformations in Fig. 3.5a and c, are similar in the main to the energetically driven zippering and assembly mechanisms of conformational search proposed by Ozkan et al. [100]. A folding pathway similar to the transformation in Fig. 3.5 b would not have concurrent energy gain and so would be less likely thermodynamically. To implement the 56 3.3. Results Figure 3.6: (a) Single -helix of ve residues 147{151 taken from PDB 1AT1. (b) Minimal pathway to fold the -helix (red), from a straight line initial state which has been aligned by minimizing MRSD (shown in blue, see text for description). A conformation partway though the transition is shown in green. (c) Minimal pathway to fold the helix from a straight-line initial conformation with its second link directly aligned to the second link of the helix. Distances for both transformations are given in Table 3.1. We emphasize that this is a hypothetical, idealized transformation that is not realizable for the physical chain. transformation shown in Fig. 3.5 c, the construction described in section 3.2 above and shown in the gure is only approximately correct, to 1%. To nd an exact minimal solution involves generalizing the methodology to allow for concurrent rotations of two links about a central axis, as described in more detail in Sections 2.3 and 2.4. 3.3.2 -helix We coarse-grain the helical fragment containing residues 147{151 by consid- ering only the C atoms (see Fig. 3.6 a). We consider folding to this structure from an extended state. The ex- tended state is taken for simplicity to be a straight line. Of course more re- alistic extended conformations could be taken, but would give minor quan- titative corrections to the numbers we obtain. We consider two dierent initial conditions for the straight line, one where link 2 is exactly aligned with link 2 of the -helix (Fig. 3.6 c), and one where the straight line is aligned to the helix by minimizing the MRSD. This initial condition is such 57 3.3. Results that the straight line threads the helix (Fig. 3.6 b). The aligned unfolded structure obtained by minimizing RMSD is similar in this case: the MRSD between the two aligned structures is only 1.53 A. From these initial states, we found minimal folding trajectories consisting of rotations and subsequent translations from the straight-line conformation to the helix. Fig. 3.6 b shows a minimal folding pathway to the -helix. An interme- diate conformation (partway through the transition) is shown in green. The distance traveled after minimizing MRSD is indeed less than the distance after alignment of one link. For both of these transformations, the distances traveled per residue are less than the corresponding distance per residue for the -hairpin transformations. 3.3.3 Crossover structure The fact that the polymer chain cannot cross itself is represented by inequal- ity constraints in the equations of motion. We introduce the methods for solution of variational problems with inequality constraints in Appendix E. The upshot is that the minimal distance problem is a free problem until a residue on the chain touches the obstacle. At that point the residue is constrained to be on the surface of the obstacle and the trajectory is dened accordingly. Eventually the particle or residue leaves the surface, and the problem becomes a free problem once again, as the particle moves to its nal position. The transformation is then piecewise, consisting of three pieces, and at the interface between the pieces, the corner conditions (Eq. 6) must hold. The initial and nal conditions of an idealized non-crossing chain are shown in Fig. 3.3 c. In our problem of chain non-crossing, the obstacle is an eectively innite line, normal to the plane of Fig. 3.3 c (marked by a circled X), so residues only need to touch that point before proceeding to their nal position. In this treatment residues are treated asymmetrically, in that one part of the chain has steric hindrance along bonds, while another only has steric hindrance for the masses or beads at the termini of bonds. This approximation is assumed to simplify the transition, and because the resulting distance only diers by a small nite size-eect from the distance obtained by employing links for all parts of the chain. We found a solution that fully satises the Euler-Lagrange (EL) equa- tions Eqs. 5a{5c, and corner conditions satisfy Eq. 6. According to the analysis in Appendix A, this class of solutions is at least a local minimum. It involves the propagation of a kink starting at the end of the chain, in 58 3.3. Results which the chain proceeds snakelike over the obstacle and then back down to its nal position, and so is intuitively reasonable. The distance is given in Table 3.1, along with the RMSD and MRSD. In cases where non-crossing is important, the distance D will be signicantly greater than either RMSD or MRSD. The transformation starts by a rotation of link 7{8 about the point 7, until a critical angle =2 is reached. Residue 8 subsequently translates to the crossover point O. Immediately as it starts translating, link 6{7 rotates about point 6 (Fig. 3.7 a) and residue 7 rotates to its critical angle of =2. The process repeats until link 5{6 rotates to an angle of =6, at which point residue 8 touches the obstacle (Fig. 3.7 b). At this point, residue 8, which is touching a nondierentiable (nons- mooth) surface, may violate corner conditions for the reasons discussed in Appendix E. Residue 8 moves horizontally to the left while residue 7 moves vertically, so the end points of the link slide in orthogonal directions (Fig. 3.7 c). After this part of the transformation is complete, the chain is in the conguration shown in Fig. 3.7 d. At this point, link 4{5 begins to rotate, and this sets up a cascade of mo- tions throughout the chain. Residue 8 slides vertically downward, residue 7 slides horizontally to the left, and residue 6 slides vertically upward (Fig. 3.7 e). Note that residue 8 appears to violate corner conditions in the opposite sense of residue 7. These violations are again due to the in uence of the crossover constraint. When link 4{5 has rotated to =6, link 6{7 is horizontal and link 7{8 is vertical (Fig. 3.7 f). As 4-5 continues to rotate, residues 7 and 8 proceed vertically downward in Fig. 3.7 g, while residue 6 moves left horizontally, until the conformation in Fig. 3.7 h is reached when link 4{5 has nished its rotation to =2. At this point link 3{4 begins to rotate about position 3, moving residue 4 to the non-crossing position O, while the rest of the chain shifts downward vertically in the Fig. 3.7, i and j. Finally residue 3 rotates about position 2 while residue 4 translates in a straight line to its nal position, and all other residues translate downwards (Fig. 3.7, k and l). This completes the transformation. Note again that the distance in Table 3.1 is much larger than either the RMSD or MRSD. A second trans- formation is obtained by time-reversing the above solution, and swapping the right and left branches of the structure that serve as initial and nal conditions. 59 3.3. Results Figure 3.7: Various steps in a minimal pathway obeying non-crossing. Two conformations are drawn for each step. By convention, we number residues in the conformation that is leading in the transformaption. (See text for a description of the transformation. 60 3.4. Discussion and conclusion 3.4 Discussion and conclusion In this chapter, we have applied the general theory of distance between one-dimensional objects to nd the minimal folding pathways for protein fragments. We consider this to be a rst step in building up ever-larger fragments to eventually look at the distance as an order parameter for the folding of an entire biomolecule. We investigated the minimal folding pathway for a helix, a -hairpin, and a structure involving a crossover where the integrity of the chain is essential in determining the minimal transformation. The non-crossing problem has the largest distance per residue of all conformations considered. Not surprisingly, the -helix has the shortest. It is an interesting question to address the consequence of the distance from an unfolded structure to a folded structure on its folding rate. We will address this question in chapter 6. We have made several approximations in our model. In our analysis of minimal distance trajectories, we have not accounted for the steric excluded volume of the side chain and backbone degrees of freedom that have been coarse-grained out. It is possible to account for this in principle by applying the methods described in Appendix E. We take the trajectories derived here as a rst approximation to the more fully constrained problem. Another modication that must be considered is the range of allowed angles between consecutive triples of C residues. While sharp kinks in our transformations were the exception rather than the rule, we have assumed in our analysis that the full range of angles is allowable. The coarse-graining procedure does give greater exibility for the resulting chain because there are six backbone bonds per C triple; however, a more thorough analysis would take into account a restricted range of allowable angles. The construction of an ecient alignment algorithm based on the dis- tance D as a cost function is a goal , and could have important future im- plications for structure prediction and biomolecular folding dynamics. We explore this question is some detail in chapter 4. For our purposes here we chose the approximate metrics MRSD and RMSD. For the -hairpin, the best-aligned MRSD structure was globally dierent than the best-aligned RMSD structure. The distance from a straight line to an idealized -hairpin structure was slightly less when the structures were aligned by minimizing MRSD than for RMSD. However, the situation was reversed for the real -hairpin structure, with the RMSD-aligned structures having a smaller distance by 5%. We will visit this problem again in chapter 4. The non-crossing transformation raises interesting questions about the 61 3.4. Discussion and conclusion validity of structural comparison metrics when polymer non-crossing is im- portant. The RMSD and MRSD were both quite small for the conformations we considered, comparable to the -helix distances. However, the actual dis- tance for a physically realizable transformation was large|larger than the distances in -sheet transformations. The solution we found for the case of non-crossing was extremal and min- imal, at least locally. However, there is no guarantee that this is the glob- ally minimal transformation|some preliminary results for small numbers of links indicate there can be shorter pathways in some instances. However, the dierence in distances between ground-state and excited-state transfor- mations involves rotations of links and so is nonextensive: in the limit of large numbers of links, the discrepancies go to zero (see 2.5). Noncrossing constraints introduce a mechanistic aspect to the folding process. A folding mechanism consists of a specic sequence of events, or pathway. In the context of our problem the chain had to cross over the ob- stacle before translating to its nal position. In practice the chain can go up and over the top or bottom of the obstacle, or cross over it in dierent places with varying likelihood, so strictly speaking there are many pathways and we have just investigated the minimal distance pathway here. Nevertheless, such constraints can further restrict the entropic bottleneck [137] governing folding rates. The physics of non-crossing is certainly important for knotted proteins, and the generalized distance may be useful as an order parameter for these proteins, whereas other structural comparison parameters would be awed. The non-crossing constraints in a knotted protein slow its kinetics [38, 127], and lead to dierent molecular evolutionary pressures for fast and reliable folding [83, 125, 131]. For a simple stochastic process such as the one-dimensional diusion of a point particle on a at potential between two absorbing barriers, the splitting or commitment probability pF = D=DTOT , where DTOT is the total distance between the two barriers, giving a correlation hDpF i = 1. The presence of such a correlation between distance and commitment probability for simple examples provides encouragement to investigate whether or not one would nd a signicant correlation for the more complex problem of protein folding, in particular when the presence of non-crossing constraints for congurational diusion has been accounted for. In the above discussion, pF has tacitly been written in terms of D rather than the reverse. This underscores the conceptual importance of geometric order parameters in understanding the progress of a reaction. In protein folding, an emergent simplicity has been the result that native 62 3.4. Discussion and conclusion topology determines the major features of the free energy landscape for a protein, and consequently a protein's folding rate and mechanism [5]. The distance D between disordered or partly disordered protein structures and the native structure may capture the evolution of topology during the fold- ing process more accurately than many other order parameters proposed to characterize the folding kinetics and mechanisms of proteins: a full sys- tematic comparison remains a problem for future research. Aspects of this problem are discussed in chapters 5 and 6. Useful order parameters have simple geometric interpretations. Here we have shown that in principle one can compute the distance that would have to be traveled to connect two arbitrary biopolymer structures, a sim- ple geometric quantity that can include non-crossing constraints, as well as properties such as restricted allowable angles or chain stiness. The problem of nding a minimal distance pathway for a biomolecule is now an algorith- mic problem rather than a conceptual one. In the long run, it is feasible that the analysis of other reactions involving large numbers of degrees of freedom might benet from order parameters similar to the one we studied here, which are capable of accounting for the structural complexities inherent in large molecules. 63 Chapter 4 Structural alignment using the generalized Euclidean distance between conformations In the previous chapter we saw that aligning the folded and unfolded con- formations using dierent cost functions (RMSD and MRSD) resulted in dierent total distance D undertaken in the transformation. In this chap- ter, we align structures using D itself as a cost function, to obtain globally minimum transformations. The unfolded structures that we consider are ide- alized straight-line segments with varying number of links, which are then aligned to idealized beta hairpins using D as a cost function. The alignment and resulting distance D are compared with the alignments and distances of RMSD and MRSD. More realistic extended structures that are consis- tent with the physicochemistry of peptide bonds could be taken. However, important lessons are learned from the idealized cases which are generally easier to interpret (c.f. Figure 3.5). This is a rst step toward aligning more complex structures using D as a cost function. We will also see that there exist approximations involving decimating the backbone chain, which capture much of the properties of a true D alignment. Applying these approximate metrics to align structures such as a full protein is a topic for future research. It should be noted that our motivation for structural alignment is to nd the alignment that results in minimal D. Generally speaking however, struc- tural alignment is used to establish homology between two or more polymers based on their shape. Therefore, polymers with much higher degree of sim- ilarity than those structural pairs considered here are usually aligned. The metrics used here to align unfolded structures to the corresponding folded one can in principle be used to align homologous structures as well. We reserve this, however, as a topic for future work. 64 4.1. Introduction 4.1 Introduction In principle, minimal pathways can be computed for any initial and nal congurations, just as RMSD can be computed between any two congu- rations. However, it is of special signicance to anneal the congurations allowing translations and rotations, until the minimal distance transforma- tion is achieved (i.e. the minimum of minimal distance transformations). This is analogous to the usual procedure of using RMSD or MRSD as a cost function between two structures and minimizing with respect to translations and rotations. While the minimization procedure is particularly straightfor- ward for RMSD and involves the inversion of a matrix, the minimization using the distance D as a cost function involves a simplex or conjugate gradient minimization and so is more computationally intensive. In short the boundary conformations are allowed to translate and rotate in 3D space. Their position and orientation is modied to produce a pathway with minimal length, as compared to all other minimal pathways that can be obtained by positioning and orienting the same two structures in 3D space. 4.2 Method and results For the purpose of generating accurate initial guesses for the minimal dis- tance aligned structure, we introduce the following hierarchy: D0 = N MRSD (4.1a) D1 = N1X i=1 D ` (A) i ; ` (B) i (4.1b) D2 = int((N1)=2)X i=1 D n ` (A) i o ; n ` (B) i o +D(end link)1 (4.1c) ... DN = D : (4.1d) 65 4.2. Method and results In this hierarchy, the D have the following interpretation: D0 is the cumulative distance between the sets of points comprising the residue lo- cations of conformations A and B, D1 is the cumulative distance between the sets of single links, `i, comprising congurations A and B, D2 is the cumulative distance between the sets of double links, f`i; `i+1g, comprising congurations A and B plus any single-link remainder if one exists, and so on. That is, at level the polymer chain is divided up into sub-segments each of link-length , plus one segment constituting the remainder. When = N , the chain as a whole is considered, which is the true distance D. This procedure is also illustrated schematically adjacent to each equation above. We observed that D1 was a good approximation to the total D between two chains, was much easier in practice to calculate, and could be automated in a robust way, in the sense that human intervention and tuning was not necessary. For these reasons we used it to generate initial guesses for minimal distance aligned structures. After the initial alignment using D1 the chains were further aligned using the full distance D. At this stage the general form of the transformation is established and the computation can be automated. We used a Nelder-Mead simplex method in our algorithm to nd the minimal distance alignment. Figure 4.1 shows the aligned structures using RMSD, MRSD, D1, and D, for increasing numbers of links. Several points can be observed. For the smallest number of links (3), MRSD, D1, and D all give the same alignment (g 4.1a). For 5 or more links, the MRSD-aligned structure breaks symme- try by choosing particular diagonal direction, while D1 and D retain this symmetry but begin to dier (g 4.1b). The deviation from MRSD and D is a nite-size eect [109], so we know that the two alignments must even- tually converge as N is increased. At 9 links (g 4.1d), the D1-alignment breaks symmetry in the same fashion as MRSD, yet the D-alignment re- mains similar to RMSD. By 11 links (g 4.1e), the D-aligned structure has broken symmetry as well, however with a smaller angle to the horizontal than either MRSD or D1. As N is increased, D1 and MRSD aligned struc- tures quickly converge, while the angle with respect to the horizontal of the D-aligned structure continues to lag behind that of either MRSD and D1 structures, converging slowly as N continues to increase (gures 4.1f-j). The RMSD-aligned structure remains horizontal throughout. Average lengths of -hairpins in databases constructed from the PDB are about 17 residues [27], most consistent with g 4.1h. From this gure we see that hairpins of this length have a globally dierent structural alignment with extended structures depending on whether D or RMSD is used. 66 4.2. Method and results (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) Figure 4.1: Alignments with dierent cost functions. The Hairpin is shown in red. D alignment in green, D1 in blue, MRSD in yellow, and RMSD in cyan 67 4.3. Conclusion and discussion Alignment cost function N D D1 MRSD RMSD 4 0.785 0.785 0.785 0.822 6 1.391 1.415 1.473 1.419 8 1.974 1.983 2.085 2.014 10 2.559 2.574 2.654 2.615 12 3.127 3.158 3.197 3.216 14 3.674 3.705 3.726 3.817 16 4.207 4.235 4.247 4.418 18 4.732 4.769 4.762 5.019 20 5.252 5.294 5.272 5.620 22 5.767 5.802 5.783 6.221 Table 4.1: D=N (in units of link length squared) between the aligned struc- tures in gure 4.1. Each of the 4 columns represents the structural pairs for the cost function labeled. For example, column 3 gives D=N for structural pairs in gure 4.1 aligned using MRSD. Table 4.1 and gure 4.2 summarize the results for the minimal distance transformations from the aligned structures. Table 4.1 gives the numerical value of the distance D for each aligned structure, aligned using the various cost functions listed: D, D1, MRSD, and RMSD. Note that the distance D is always minimized for the distance-aligned structure, and tends to increase as one considers the D1, MRSD and then RMSD-aligned structures for a given number of links. For comparison, in table 4.2 the corresponding values of MRSD are given for the aligned structures using each cost function. Note in each table that as N gets large, D tends to converge to MRSD. The distance traveled per residue, in units of link length is D=Nb. Divid- ing this measure by the chain length (N1)b gives a scale-invariant measure of the distance: ~D = D=(N(N 1)b2). This quantity is plotted in gure 4.2. We can see from the plot that the D1-aligned structure generally gives a good approximation to the true D-aligned structure. Moreover, MRSD, D1 and D all converge to the same while RMSD converges to a dissimilar value. 4.3 Conclusion and discussion In this chapter we used the generalized distance D and various approxima- tions of it as cost functions to align unfolded idealized strands of various 68 4.3. Conclusion and discussion Figure 4.2: Scale invariant distance resulting from dierent alignments with dierent cost functions Alignment cost function N D D1 MRSD RMSD 4 0.707 0.707 0.707 0.809 6 1.375 1.393 1.337 1.412 8 1.961 1.960 1.899 2.008 10 2.547 2.545 2.436 2.610 12 3.062 3.108 2.959 3.211 14 3.575 3.675 3.475 3.813 16 4.081 4.004 3.987 4.414 18 4.585 4.506 4.495 5.015 20 5.088 5.008 5.002 5.616 22 5.591 5.511 5.508 6.218 Table 4.2: MRSD (in units of link length) between the aligned structures in gure 4.1 using the four cost functions we considered. For example, column 1 gives MRSD for structural pairs in gure 4.1 aligned using the distance D. 69 4.3. Conclusion and discussion sizes to their corresponding idealized -hairpin structures. The distance D for the minimal transformation between aligned structural pairs was com- pared for various alignment cost functions: RMSD, MRSD, D1, and D itself. D1 is the distance between conformational pairs if the chain were decimated to single links and distance of all single-link transformations was summed. We found that D1-aligned structures generally gave a distance that was close to the true D-aligned structure, and in this sense was a good approxi- mation. However the aligned structures were noticeably dierent depending on the cost function, for the nite values of N (number of residues) that we studied. Our largest value of N was 22, while the average length of - hairpins is about 17 residues. For these average hairpin lengths, the minimal D aligned structure is globally dierent from the RMSD structure. Whether this discrepancy is generally true for larger structures or whole proteins re- mains to be determined, but we feel it is likely. It is not yet clear at this point whether alignment using distance will yield more accurate predictions for such problems as protein structure prediction or ab-initio drug design. What is clear is that the best-aligned structures using a reasonable align- ment metric such as the true distance give very dierent results than RMSD, even for relatively simple structures such as the beta-hairpin. 70 Chapter 5 Polymer uncrossing and knotting in protein folding, and their role in minimal folding pathways In this chapter we give a systematic treatment of non-crossing constraints in protein untangling. First we develop an approximate but easily automatable algorithm for minimal folding pathways of a polymer without considering non-crossing constraints. Then we will study perturbations in the pathway that occur due to presence of non-crossing constraints. Finally we apply the formalism to a number of proteins including knotted proteins. We will see how non-crossing distance can dierentiate classes of proteins and how topological constraints, manifested by untangling operations in our formal- ism, induce folding pathways. We will also study how persistence of dierent untangling moves varies across protein classes. The formalism outlined in this chapter is the next logical and systematic improvement to what has been done in chapter 3. 5.1 Introduction A transformation connecting unfolded states with the native folded state can be considered as a reaction coordinate. A transformation can also be used as a starting point for renement, by examining commitment probability or other reaction coordinate formalism. Several methods have been developed to nd transformations between protein conformational pairs without specic reference to a molecular me- chanical force eld. These include coarse-grained elastic network mod- els [66, 67], coarse-grained plastic network models [84], iterative cluster- normal mode analysis [120], restrained interpolation (the Morph server) [72], the FRODA method [135], and geometrical targeting (the geometrical path- 71 5.1. Introduction ways (GP) server) [36]. The GP method nds trajectories between confor- mation pairs by gradually decreasing the RMSD between the conformations, while preserving structural contraints within the protein. Dead-ends can be encountered. In this event, two recovery methods may be attempted, a ran- dom perturbation technique, and backtracking by temporarily increasing RMSD before attempting the transformation again. In this chapter we consider transformations between polymer conforma- tion pairs that would not be viable by a conjugate-gradient type or direct minimization approach, in that dead-ends would inevitably be encountered. We focus specically on how one might nd geometrically optimal transfor- mations that account for polymer non-crossing constraints, and would apply to knotted proteins for example. By a geometrically optimal transformation, we mean a transformation in which every monomer in a polymer would travel the least distance in 3- dimensional space in moving from conformation A to conformation B. This is a variational problem, and the equations of motion, along with the minimal transformation and the Euclidean distance covered, have been worked out in previous chapters. Although minimal transformations have been found for the backbones of secondary structures, and the non-crossing problem has been treated in chapter 3, minimal transformations between unfolded and folded states for full protein chain lengths have not been treated before. We focus on this problem in this chapter. The minimal transformation inevitably involves curvilinear motion if bond, angle, or stereochemical constraints are involved. If such constraints are neglected, the minimal distance corresponding to the minimal trans- formation converges to the mean of the root squared distance (MRSD), or the mean of the straightline distances between pairs of atoms or monomers. This is not the RMSD. For any typical pair of conformations, the MRSD is always less than the RMSD, which can be proved by applying Holder's inequality [89]. The RMSD can be thought of as a least squares t between two struc- tures. Alternatively, it may also be thought of as the straight-line Eu- clidean distance between two structures in a high-dimensional space of di- mension 3N , where N is the number of atoms or residues considered in the protein. Fast algorithms have been constructed to align structures using RMSD [24, 25, 40, 62, 63, 69]. If several intermediate states are known along the pathway of a trans- formation between a pair of structures, then the RMSD may be calcu- lated consecutively for each successive pair. This notion of RMSD as an order parameter goes back to reaction dynamics papers from the early 72 5.1. Introduction 1980's [7, 16, 35, 132], however in these approaches the potential energy gov- erns the most likely reactive trajectories taken by the system, and RMSD is simply accumulated through the transition states. In the absence of a potential surface except for that corresponding to steric constraints, the incremental RMSD may be treated as a cost function and its minimal transformation between two structures found. This idea is behind the transformation approaches discussed above. However, the minimal transformation using RMSD (or 3N:D Euclidean distance) as a cost function is dierent than the minimal transformation using 3D Euclidean distance (MRSD) as a cost function, and the RMSD-derived transformation does not correspond to the most straight-line trajectories. The RMSD is not equivalent to the total amount of motion a protein or polymer must undergo in transforming between structures, even in the absence of steric constraints enforcing deviations from straight-line motion. In what follows, we rst describe our method for calculating the distance corresponding to a minimal transformation that accounts for the extra dis- tance traveled to avoid self-crossing of the polymer chain. This involves nd- ing the dierent ways a polymer can uncross or \untangle" itself, and then calculating the corresponding distance for each of the untangling transfor- mations. Since there are typically several avoided crossings during a minimal folding transformation, nding the optimal untangling strategy corresponds to nding the optimal combination of uncrossing operations with minimal total distance cost. After quantifying such a procedure, we apply this to full length protein backbone chains for several structural classes, including -helical proteins, -sheet proteins, - proteins, 2-state and 3-state folders, and knotted pro- teins. We generate unfolded ensembles for each of the proteins investigated, and calculate minimal distance transformations for each member of the un- folded ensemble to fold. We look for dierences in the distance between structural and kinetic classes, and compare these to dierences in other or- der parameters between the respective classes. The other order parameters investigated include absolute contact order ACO [105], relative contact order RCO [105], long-range order LRO [52], root-mean-squared deviation RMSD, mean-root-squared deviation MRSD, and chain length N[42, 54]. The vari- ations of distance metrics considered include total distance D, distance per residue D=N , the \extra" non-crossing distance to avoid non-crossing Dnx, and the extra non-crossing distance per residue Dnx=N . We also investigate how the various order parameters either correlate or are independent from each other. We nally discuss our results and conclude. 73 5.2. Methods 5.2 Methods 5.2.1 Calculation of the transformation distance The value of Dnx is calculated as follows: The chain transforms from con- formation A to conformation B as a ghost chain, so the chain is allowed to pass through itself. The beads of the chain follow straight trajectories from initial to nal positions. This is an approximation to the actual Euclidean distance D of the transformation, where straight line transformations of the beads are generally preceded or followed by non-extensive local rotations to preserve the link length connecting the beads as a rigid constraint [89, 109]. The instances of self-crossing along with their times are recorded. The as- sociated cost for these crossings is computed retroactively, for example the distance cost for one arm of the chain to circumnavigate another obstructing part is then added to the \ghost" distance to compute the total distance. The method for calculating the non-crossing distance has three major components, evolution of the chain, crossing detection, and crossing cost calculation. Each is described in one of the subsections below. Evolution of the chain As mentioned above, the condition of constant link length between residues along the chain is relaxed, so that the non-extensive rotations that would generally contribute to the distance traveled are neglected here. This ap- proximation becomes progressively more accurate for longer chains. Thus, ideal transformations only involve pure straight-line motion. The approx- imate transformation is carried out in a way to minimize deviations from the true transformation (D), such that link lengths are kept as constant as possible, given that all beads must follow straight-line motion. We thus only allow deviations from constant link length when rotations would be necessary to preserve it; this only occurs for a small fraction of the total tra- jectory, typically either at the beginning or the end of the transformation [89, 109]. A specic example As an example of the amount of distance neglected by this approximation, consider the pair of congurations in Figure 5.1, where a chain of 10 residues that is initially horizontal transforms to a vertical orientation as shown in the gure. The distance neglecting rotations (our approximation) is 77.78, in reduced units of the link length, while the exact calculation including rotations [89, 109] gives a distance of 78.56. 74 5.2. Methods A few intermediate conformations are shown in the gure. In particu- lar note the link length change (and hence violation of constant link length condition) in the fourth link for the gray conformation (conformation F), resulting from our approximation. If the link length is preserved, the trans- formation consists of local rotations at the boundary points. Also note that when transforming from cyan to magenta the rst bead moves less than , because it reaches its nal destination and \sticks" to the nal point, and will not be moved subsequently. General method The algorithm to evolve the chain is as follows. Straight- line paths from the positions of the beads in the initial chain conguration to the corresponding positions of the beads in the nal conguration are constructed. The bead furthest away from the destination, i.e., the bead whose path is the longest line, is chosen. Let this bead be denoted by index b where 0 b N . In the example of gure 5.1, this bead corresponds to bead number 9 (b9). The bead is then moved toward its destination by a small pre-determined amount , and the new position of bead b is recorded. In this way the transformation is divided into say M steps: M = dmax=, where dmax is the maximal distance. Let i be the step index 0 i M . If initially the chain conguration was at step i (e.g., i = 0), the spatial position of bead b at step i before the transformation is denoted by rb;i, and after the transformation by rb;i+1. The upper bound to capture the essence of the transformation dynamics diers according to the complexity of the problem. To capture all of the instances of self-crossing, a step size of two percent of the link length suced for all cases. The neighboring beads (b+1 and b1) should also follow paths on their corresponding straight-line trajectories. Their new position on their paths (rb+1;i+1 and rb1;i+1) are then calculated based on the constant link length constraints. This new position corresponds to moving the beads by b+1;i, b1;i respectively. Once rb+1;i+1 and rb1;i+1 are calculated, we proceed and calculate rb+2;i+1 and rb2;i+1 until we reach the end points of the chain. As an example consider gure 5.1, going from the conformation B (Green) to the conformation C (Yellow). First, bead number 9, which is the bead farthest from its nal destination, is moved by , then taking constant link length constraints and straight line trajectories into account, the new position of bead 8 is calculated and so on, until all the new bead positions which correspond to the yellow conformation are calculated. If somewhere during the propagation to the endpoints, a solution can- not be constructed or no continuous solution exists, i.e. lim!0(rb+m;i+1 75 5.2. Methods rb+m;i) 6= 0, then we set rb+m;i+1 = rb+m;i. That is, the bead will remain stationary for a period of time. 3 Consequently rb+n;i+1 = rb+n;i for all beads with n > m that have not yet reached their nal destination. This is because the new position of each bead is calculated by the position of the bead next to it for any particular step i. The same recipe is applied when propagating incremental motions b;i+1 along the other direction of the chain (going from b n to b n 1) as well. When a given bead that has been held stationary becomes the farthest bead away from its nal position, it is then moved again. I.e., stationary beads can move again at a later time during the transformation if they become the furthest beads away from the nal conformational state. Such a scenario does not occur in the context of the simple example of gure 5.1. Once the positions of all the beads in step i + 1 are calculated, the same procedure is repeated for step i+ 2 and so on, until the chain reaches the nal conguration. If the position of a given bead b at step i is such that jrb;i Rbj < , where Rb is the spatial position of bead b in the nal conformation, then rb;i+1 is set to Rb. In other words we discretely snap the bead to the nal position if it is closer than the step size . In the context of gure 5.1, this corresponds to going from conformation D (Cyan) to conformation E (Magenta). Bead 0 (b0) is snapped to the nal conformation. Once a bead reaches its destination it locks there and will never move again. See conformation F (gray) in gure 5.1. Figure 5.2 plots show the standard deviaion in link length vs. the link length, for transformations of 200 random structures generated by self avoid- ing random walks (SAW), to one pre-specied SAW. The length of the ran- dom chains was 9 links. The chains were aligned by minimizing MRSD before the transformation took place [89{91], where MRSD stands for the mean root squared distance and is dened by 1N PN n=1 p (rAn rBn)2 = 1 N PN n=1 jrAn rBn j. Crossing detection As stated earlier, during the transformation the chain is initially treated as a ghost chain, and so is allowed to cross itself. To keep track of the crossing instances of the chain, a crossing matrix X is updated at all time steps during the transformation. If the chain has N beads and N 1 links, we can dene an (N 1) (N 1) matrix X that contains the crossing properties of a 3This in principle may result in a link length change for the corresponding link, and thus constraint violation, in our approximation. An exact algorithm involves local link rotation instead. 76 5.2. Methods δ δδ> 1 2 3 4 5 6 7 8 9 0 δ b0 b9 (a) (b) Figure 5.1: (a) Several intermediate conformations for a transformation (A{ G proceeding along the color sequence red, green, yellow, cyan, magenta, gray, and blue) are shown. The step-size delta is shown. Note the step in which the rst bead of the chain (b0) is \snapped" into the nal conformation because its distance to the destination is less than (going from D to E). In the intermediate conformation F (Gray), beads 0 to 3 have reached their nal locations and no longer move. Note also the link length violation of link 4 in conformation F, due to the approximation that ignores end point rotations, for this intermediate gure. A milder violation is observed when going from D (cyan) to E (magenta), since bead 1 through N all assume a step size of while bead 0 moved a step size < . (b) Panel b shows a surface plot showing link length as a function of link number and step number during transformation. For the whole process, mean link length ̀ is 0.98 units and standard deviation `2 is 0.063. 77 5.2. Methods 0 0.1 0.2 0.3 0.4 0.5 0 0.2 0.4 0.6 0.8 1 St an da rd De via tio n Average Link Length Figure 5.2: Scatter plot of the average link length (x-axis) and deviation from the mean link length (y-axis), for transformations between 200 ran- domly generated structures of 9 links and the (randomly generated) refer- ence structure shown in the inset to the gure. Each point in the scatter plot corresponds to a whole transformation between randomly generated structure and the reference structure. The \native" or reference state is shown in the inset, along with several of the 200 initial states. In practice, transformations with a mean link length of unity have a standard deviation near zero because the link lengths have hardly changed. For the ensemble of transformations shown, the ensemble average of the mean link length for the transformations is 0.96, and the average of the corresponding standard deviation is 0.076. 2D projection of the strand, in analogy with topological analysis of knots. The element Xij is nonzero if link i is crossing link j in the 2D projection at that instant. Without loss of generality we can assume that the projection is onto the XY plane, as in Figure 5.3. We use the XY plane projection throughout this chapter.4 We parametrize the chain uniformly and continuously in the direction of ascending link number by a parameter s with range 0 s N . So for example the middle of the second link is specied by s = 3=2. If the projection of link i is crossing the projection of link j, then jXij j is the value 4We use the crossings in the projected image as a book-keeping device to detect real 3D crossings. A real crossing event is characterized by a sudden change in the over-under nature of a crossing on a projected plane. Since for any 3D crossing, the change of nature of the over-under order of crossing links is present in any arbitrary projection of choice, keeping track of a single projection is enough to detect 3D crossings. Of course a given projection plane may not be the optimal projection plane for a given crossing, however if the time step is small enough any projection plane will be sucient to detect a crossing. 78 5.2. Methods (a) (b) Figure 5.3: (a) A 3 link chain with its vertical projection. A crossing in the projection is shown with a green circle. The crossing in the projection occurs at points s = 0:29 and s = 2:82, where the chain is parametrized uniformly from 0 to 3. Since link 1 is under link 3 at the point of projection crossing, 0.29 will appear with a negative sign in the corresponding X (eq. 5.1a). (b) The blue chain and the red chain have the exact same vertical projection, however their corresponding X matrices are dierent in sign, as given in Eq. 5.1b. This indicates that the over-under sense has changed for the links whose projections are crossing. This in turn indicates that a true crossing has occurred when going from the red conformation to the blue conformation, as opposed to a series of conformations where the chain has navigated to conformations having the opposite crossing sense without passing through itself. of s at the crossing point of link i and jXjij is the value of s at the crossing point of link j. If link i is over j (i.e. the corresponding point of the cross on link i has a higher z value than the corresponding point of the cross on link j) then Xij > 0, otherwise Xij < 0. Thus, after the sign operation, sign(X) is an anti-symmetric matrix. A simple illustrative example of the value of X for the 3-link chain in gures 5.3a and 5.3b is X (to) = 24 0 0 0:290 0 0 +2:82 0 0 35 (5.1a) X (to + ) = 24 0 0 +0:290 0 0 2:82 0 0 35 (5.1b) 79 5.2. Methods The fact that X13 is negative at time to indicates that at that instant, link 1 is under link 3 in 3D space, above the corresponding point on the plane on which the projections of the links have crossed (green circle in gure 5.3). At each step during the transformation of the chain, the matrix X is updated. A true crossing event is detected by looking at X for two consec- utive conformations. A crossing event occurs when any non-zero element in the matrix X discontinuously changes sign without passing through zero. Once Xij changes sign, Xji must change sign as well. If the chain navigates through a series of conformations that changes the crossing sense and thus the sign of Xij , but does not pass through itself in the process, the matrix elements Xij will not change sign discontinuously but will have values of zero at intermediate times before changing sign. Crossing cost calculation Even in the simplest case of crossing, there are multiple ways for the real chain to have avoided crossing itself. The extra distance that the chain must have traveled during the transformation to respect the fact that the chain cannot pass through itself is called the \non-crossing" distance Dnx. If the chain were a ghost chain which could pass through itself, the corresponding distance for the whole transformation would be the MRSD, along with rela- tively small modications that account for the presence of a conserved link length. Accounting for non-crossing always introduces extra distance to be traveled. As the chain is transforming from conformation A to conformation B as a ghost chain according to the procedure discussed above, a number of self- crossing incidents occur. Figure 5.4 shows a continuous but topologically equivalent version of the crossing event shown in gure 5.3 (b). Even for this simple case, there are multiple ways for the transformation to have avoided the crossing event, each with a dierent cost. Furthermore, later crossings can determine the best course of action for the previous crossings. Figure 5.5 illustrates how non-crossing distances are non-additive, so that one must look at the whole collection of crossing events. Therefore to nd the optimum way to \untangle" the chain (reverse the sense of the crossings), one must look at all possible uncrossing transformations, in retrospect. The recipe we follow is to evolve the chain as a ghost chain and write down all the incidents of self-crossings that happen during the transformation. Then looking at the global transformation, we nd the best untangling movement that the chain could have taken. 80 5.2. Methods Figure 5.4: Two possible untangling transformations. The top transfor- mation involves twisting of the loop. The lower transformation involves a snake like movement of the vertical leg. A third one would involve moving the horizontal leg, in a similar snake-like fashion. Note that the moves rep- resented here are not necessarily the most ecient ones in their topological class, but rather the most intuitive ones. There are transformations that are topologically equivalent but generally involve less total motion of the chain (see for example Figures 5.12(a), 5.12(b)). Figure 5.5: The minimal untangling movement in going from A to C (through B') is less than the sum of the minimum untangling movements going from A to B and then from B to C 81 5.2. Methods Figure 5.6: A few snapshots during a transformation involving 2 instances of chain crossing. The transformation occurs clockwise starting from initial conguration I and proceeding to nal conguration F. To compute the extra cost introduced by non-crossing constraints we pro- ceed as follows: We construct a matrix that we call the cumulative crossing matrix Y. Yij is non-zero if link i has truly (in 3D) crossed link j, at any time during the transformation. This matrix is thus conceptually dierent from the matrix X, which holds only for one instant (one conformation) and which can have crossings in the 2D projection which are not true crossings during the transformation. The values of the elements of Y are calculated in the same way that the values are calculated for X. The sign also depends on whether the link was crossed from over to under or from under to over, so that a given projection plane is still assumed. The order in which the cross- ing have happened are kept track of in another matrix YO. The coordinates of all the beads at the instant of a given crossing are also recorded. For ex- ample, if during the transformation of a chain, two crossing have happened, then two sets of coordinates for intermediate states are also stored. We describe a simple concrete example to illustrate the general method next. A Concrete Example Figure 5.6 shows a simple transformation of a 7-link chain. During the transformation the chain crosses itself in two in- stances. The rst instance of self-crossing is between link 5 and link 7. The second instance is when link 2 crosses link 4. The location of the cross- 82 5.2. Methods ing along the chain is also recorded: i.e., if we assume that the chain is parametrized by s = 0 to N , then at the instant of the rst crossing (link 5 and link 7) s = 4:4 (link 5) and s = 6:9 (link 7). The second crossing occurs at s = 1:3 (link 2) and s = 3:8 (link 4). The full coordinates of all beads are also known: we separately record the full coordinates of all beads at each instant of crossing. The information that indicates which links have crossed and their over-under structure can be aggregated into the cumulative cross- ing matrix Y. For the example in gure 5.6, the cumulative matrix (up to a minus sign indicating what plane the crossing events have been projected on) is Y = 2666666664 0 0 0 0 0 0 0 0 0 0 1:3 0 0 0 0 0 0 0 0 0 0 0 3:8 0 0 0 0 0 0 0 0 0 0 0 4:4 0 0 0 0 0 0 0 0 0 0 0 6:9 0 0 3777777775 : Y tells us, during the whole process of transformation, which links have truly crossed one another and what the relative over-under structure has been at the time of crossing. For example, by glancing at the matrix we can see that two links 5,7 and 2,4 have crossed one another. We also know from the sign of the elements in Y that both links 2 and 7 were underneath links 4 and 5 just prior to their respective crossings in the reference frame of the projection. Two links will cross each other at most once during a transformation. If one link, e.g. link i, crosses several others during the transformation, elements (i; j), (i; k) etc... along with their transposes will be nonzero. The order of crossings can be represented in a similar fashion as a sparse matrix. YO = 2666666664 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 3777777775 : Analyzing the structure of the crossings is similar to analyzing the struc- ture of a knot, wherein one studies a knot's 2D projections, noting the crossings and their over/under nature based on a given directional param- 83 5.2. Methods (a) (b) Figure 5.7: For the crossing points indicated by the green circles, two legs, colored blue and red, can be identied. Each leg starts at the crossing and terminates at an end. eterization of the curve [1, 92, 136]. One dierence here is that we are not dealing with true closed-curve knots (in the mathematical sense), as a knot is a representation of S1 in S3. Here we treat open curves. Crossing substructures By studying the crossing structure of open-ended pseudo-knots in the most general sense, one can identify a number of sub-structures that recur in cross- ing transformations. Any act of reversing the nature of all the crossings of the polymer can be cast within the framework of some ordered combination of reversing the crossings of these substructures. We identify three sub-structures: Leg, Loop, and Elbow. Leg Given any self-crossing point of a chain, a leg is dened from that crossing point to the end of the chain. Therefore for each self-crossing point two legs can identied as the shortest distance along the chain from that crossing point to each end|see gure 5.7. A single leg structure is shown in gure 5.8(a). Loop As stated earlier, when traveling along the polymer one arrives at each crossing twice. If the two instances of a single crossing are encountered consecutively while traveling along the polymer, and no intermediate cross- ing occurs, then the substructure that was traced in between is a loop. See Fig 5.8(b). Elbow If two consecutive crossings have same over-under sense, then they form an elbow; see Fig 5.8(c). Note that the same two consecutive crossing instances will occur in reverse order on the second visit of the crossings: 84 5.2. Methods (a) (b) (c) Figure 5.8: (a) A single leg structure, (b) A loop structure, (c) An elbow structure I II III Figure 5.9: The three types of Reidemeister moves. As it can be seen, Rei- demeister move type III does not reverse the nature of any of the crossings. these form a dual of the elbow. By convention the segment with longer arc- length between the two consecutive crossings is dened as the elbow. This would be the horseshoe shaped strand in gure 5.8(c). Reversing the crossing nature The goal of this formalism is to assist in nding a series of movements that will result in reversing the over-under nature of all the crossings, with the least amount of movement required by the polymer. So at this point we introduce basic movements that that will reverse the nature of the crossings for the above substructures. Two of these moves, the loop move and and the elbow move, are equilvalent to Reideister moves type I and II respectively, in knot theory.5 See gure 5.9. The leg move has no equivalent reidemeister move. 5For an introduction to the basic concepts in knot theory see for example [1] 85 5.2. Methods (a) (b) Figure 5.10: Schematic illustration of the canonical leg movement, either from left to right as in (a) or eectively its time reverse as in (b). Both transformations traverse the same distance. The transformation in (a) is equivalent to the \plug" transformation analyzed in the context of folding simulations for trefoil knotted proteins [125], while the transformation in (b) (see ref. [90] for a detailed description of this transformation) is equivalent to the \slip-knotting" transformation more often observed in the folding of knotted proteins [93]. Using leg movement A transformation that reverses the over/under na- ture of a leg involves the motion of all the beads constituting the leg. Each bead must move to the location of the crossing (the \root" of the leg), and then move back to its original location [90]. The canonical leg movement is shown schematically in gure 5.10. We can reverse the nature of all the crossings that have occurred on a leg, if more than one crossing occurs, through a leg movement (see gure 5.11). The move is topologically equivalent to the movement of the free end of the leg along the leg up to the desired crossing, and then moving all the way back to the original position while reversing the nature of the crossing on the way back. Loop twist and loop collapse Reversing the crossing of a loop sub- structure can be achieved by a move that is topologically equivalent to a twist, see gure 5.12 (a). This type of move is called a Reidemeister type I 86 5.2. Methods Figure 5.11: One can reverse the over-under nature of all the crossings that have occurred on a leg, through a single leg movement. This move has no Reidemeister equivalent. move in knot theory. However the optimal motion is generally not a twist or rotation in 3-dimensional space (3D). Figure 5.12(b) shows a move which is topologically equivalent to a twist in 3D, but costs a smaller distance, by simply moving the residues inside the loop in straight lines to their nal positions, resulting in a \pinching" motion to close the loop and re-open it. From now on we refer to the optimal motion simply as loop twist, because it is topologically equivalent, but we keep in mind that the actual optimal physical move, and the distance calculated from it, is dierent. Elbow moves Reversing the crossings of an elbow substructure can be done by moving the elbow segment in the motion depicted in gure 5.13: Each segment moves in a straight line to its corresponding closest point on the obstruction chain, and then it moves in a straight line to its nal position. Operator notation The transformations for leg movement, elbow, and twist can be expressed very naturally in terms of operator notation, where in order to untangle the chain the various operators are applied on the chain until the nature of all the self-crossing are reversed. If we uniquely identify each instance of self-crossing by a number, then a topological loop twist at crossing i can be represented by the operator R(i) (R for Reidemeister). An elbow move, for the elbow dened by crossings i and i + 1, can be represented as E(i; i + 1). As discussed above, for each self-crossing, two legs can be identied corresponding to the two termini of the chain. This was exemplied in gure 5.7, by the red and blue legs. Since we choose a direction of parametrization for the chain, we refer to the two leg movements as the \start leg" movement and the \end leg" movement, 87 5.2. Methods (a) (b) Figure 5.12: (a) Reversing the over-under nature of a crossing through a topological loop twist: Reidemeister move type I. (b) By \pinching" the loop before the twist, the cost in distance for changing the crossing nature is reduced. Figure 5.13: Schematic of the canonical elbow move. From left to right. This is equivalent to Reidemeister move type II. 88 5.2. Methods Figure 5.14: A chain with several self-crossing points before and after un- tangling. Various topological substructures that are discussed in the text are color coded. For the case of the legs (red and cyan) note that various other legs can be identied, for example a leg that starts at crossing 2 and ends at the red terminus. Here we color only the shortest legs from crossing 1 to the terminus as red, and crossing 2 to the opposite terminus as cyan. and for a generic crossing i we denote them as LN (i) and LC(i) respectively. The operators that we dened above are left acting (similar to matrix multiplication). So a loop twist at crossing i followed by an elbow move at crossings j and j + 1 is represented by E(j; j + 1)R(i). Example Figure 5.14 shows sample congurations before and after un- tangling. The direction of parametrization is from the red terminus to the cyan terminus. It can be seen that there are several ways to untangle the chain. One example would be R(3)LC(2)R(1), which consists of a twist of the green loop, followed by the cyan leg movement, followed by a twist of the blue loop. Another path of untangling would be E(2; 3)LN (1), which is movement of the red leg followed by the magenta elbow move. For the two above transformations, the order of operations can be swapped, i.e., they are commutative, and the resulting distance for each of the trans- formations will be the same. That is D[E(2; 3)LN (1)] = D[LN (1)E(2; 3)]. However, E(2; 3)LN (1) is a more ecient transformation thanR(3)LC(2)R(1), i.e. D[E(2; 3)LN (1)] < D[R(3)LC(2)R(1)]. Other transformation moves are not commutative in the algorithm, for example in Figure 5.14, LN (1)R(3)R(2) is not allowed, since R(2) will only act on loops dened by two instances of a crossing that are encountered consecutively in traversing the polymer, i.e., no intermediate crossings can occur. Therefore even if crossing 2 happens kinetically before crossing 3 dur- ing the ghost transformation, only transformation LN (1)R(2)R(3) is allowed in the algorithm. 89 5.2. Methods Minimal untangling cost For each operator in the above formalism, a transformation distance/cost can be calculated. Hence the optimal untangling strategy is nding the optimal set of operator applications with minimal total cost. This solution amounts to a search in the tree of all possible transformations, as illustrated in Figure 5.15. The optimal application of operators can be computed by applying a version of the depth-rst tree search algorithm. According to the algorithm, from any given conformation there are sev- eral moves that can be performed, each having a cost associated with the move. The pseudo-code for the search algorithm can be written as follows: procedure find_min_cost (moves_so_far=None, cost_so_far=0,\ min_total_cost=Infinity): optim_moves = NULL_MOVE if cost_so_far > min_total_cost: return [Infinity, optim_moves] endif for move in available_moves(moves_so_far): [temp_cost, temp_optim_moves] = find_min_cost (moves_so_far + move,\ cost_so_far + cost(move),\ min_total_cost) if temp_cost < min_total_cost: min_total_cost = temp_cost optim_moves = move + temp_optim_moves endif endfor return [min_total_cost,optim_moves] endprocedure The values to the right side of the equality sign in the arguments of the pro- cedure are the default values that the procedure starts with. The procedure is called recursively, and returns both the set of optimal uncrossing moves (for a given crossing matrix corresponding to a starting and nal confor- mation), and the distance corresponding to that set of optimal uncrossing moves. The algorithm visits all branches of the tree of possible uncrossing op- erations until it reaches the end. However it is smart enough to terminate the search along the branch if the cost of operations exceeds that of a so- lution already found. See gure 5.15 for an illustration of the depth-rst search tree algorithm. The above procedure was implemented using both the GNU Octave programing language and C++. To optimize speed by 90 5.2. Methods LN(1) 20 20 LC(2) 60 80 R(2) 10 30 60 R(1) 10 R(2) 10 25 50 LC(3) 30 20 E(3,4) 5 LC(3) 30 10 1 2 3 4 Figure 5.15: An example (subset) tree of possible transformations for a given crossing structure. Accumulated distances are given inside the circles representing nodes of the tree; the non-crossing transformations and their corresponding distances are shown next to the branches of the tree. The al- gorithm starts from the bottom node and proceeds to the top nodes, starting in this case along the right-most branch. The possible transformations to be considered as candidate minimal transformations are : [LC(3)R(2)R(1)], [E(3; 4)R(2)R(1)], [R(2)LN (1)] which then terminates because the accumu- lated distance exceeds the minimum so far of 25, and [LC(2)LN (1)]. eliminating redundant moves, only one permutation was considered when operators commuted. 5.2.2 Generating unfolded ensembles To generate transformations between unfolded and folded conformations, we adopt an o-lattice coarse grained C model [22, 122], and generate an unfolded structural ensemble from the native structure as follows. For a native structure with N links, we dene three data sets: The set of C residue indices i, for which i = 1; ; N 91 5.2. Methods The set of native link angles j between three consecutive residues, for which j = 2; ; N 1 The set of native dihedral angles k between four consecutive residues, for which k = 2; ; N 2 The distribution of C-C distances in PDB structures is sharply peaked around 3.76A ( = 0:09A). In practice we took the rst C-C distance from the N-terminus as representative, and used that number for the equilibrium link length for all C-C distances in the protein. To generate an unfolded ensemble, we start by selecting at random a bead i (2 i N 1) in the native conformation, and we then perform operations that change the angle centered at residue i, i, and the dihedral centered at bond i (i + 1), i. If i = N 1 only the angle is changed. The new angle and dihedral are selected at random from the Boltzmann distribution described below. At the end of each operation, i ! newi and i ! newi . Changing these values corresponds to rotating an entire substructure, where all the beads j > i will end up in a new position. This recipe corresponds to an extension of the pivot algorithm [74, 82], with the additional feature that the most probable rotation selects the values of the angles and dihedrals in the native structure. That is, if we dene i = new i Nati and i = newi Nati , then the most probable and most probable are both zero. By increasing an articial temperature, larger i's and i's become more accessible. The new angle is chosen from a probability distribution proportional to exp (E ()), where E() is computed from: E(i) = k i Nati 2 : (5.2) The parameter 1= plays the role of a temperature, which we have set to unity. We set k = 20. Similarly for , the probability distribution function is proportional to exp (E ()), where E () is computed from E (i) = k1[1 + cos ()] + k3[1 + cos (3)] (5.3) with i = i Nati , k1 = 1, k3 = 0:5, and again = 1. The fact that the ks are much smaller than k means that for a given temperature, dihedral angles are more uniformly distributed than the s. If is set to zero then all states are equally accessible and the algorithm reduces to the pivot algorithm, i.e., a random walk generator. If is set to 1 then chain behaves as a rigid object and does not deviate from its native state. 92 5.2. Methods Each pivot operation results in a new structure that must be checked so that it has no steric overlap with itself, i.e., the chain must be self-avoiding. If the new chain conformation has steric overlap, then the attempted move is discarded, and a new residue is selected at random for a pivot operation. In practice, we dened steric overlap by rst nding an approximate con- tact or cut-o distance for the coarse-grained model. The contact distance was taken to be the smaller of either the minimum C-C distance between those residues in native contact (where two residues are dened to be in native contact if any of their heavy atoms are within 4.9A), or the C-C distance between the rst two consecutive residues. For SH3 for example the minimum C distance in native contacts is 4:21A and the rst link length is 3:77A, so for SH3 all non-neighbor beads must be further than 3:77A for a pivot move to be accepted. Future renements of the acceptance criteria can involve the use of either the mean C-C distance or other criteria more accurately representing the steric excluded volume of residue side chains. In our recipe, to generate a single unfolded structure we start with the native structure and implementN successful pivot moves, whereN is related to the number of residues N by N = ln(0:01)=ln[0:99(N 2)=(N 1)]. For the next unfolded structure we start again from the native structure and pivot N successful times, following the above recipe. Note that N successful pivots does not generally aect all beads of the chain. In the most likely scenario some beads are chosen several times and some beads are not chosen at all, according to a Poisson distribution. This particular choice of N means that for polymers with N < 101 where N2N1 < 0:99, the chance that any given link is not pivoted at all during theN pivot operations is 0:01. On the other hand for longer polymers where N2N1 > 0:99, the probability that any particular segment of the protein with the length 0:01 of the total length, has 0:01 chance of not having any of its beads pivoted. For any N however, the sheer number of pivot moves generally ensures a large RMSD between the native and generated unfolded structures. Each unfolded structure generally retains small amounts of native-like secondary and tertiary structure, due to the native biases in angle and di- hedral distributions. For example, for SH3 the number of successful pivot moves was 162 and the mean fraction of native contacts in the generated unfolded ensemble was 0:06. 5.2.3 Proteins used The proteins used in this study are given in table 5.1. They consist of 25 2-state folders, 13 3-state folders, 11 all -helix proteins, 14 all -sheet 93 5.2. Methods proteins, 13 - proteins, and 5 knotted proteins. 94 5.2 . M eth o d s Table 5.1: Proteins analyzed in this chapter. PDB fold 2ndry str. log kf LRO RCO ACO MRSD RMSD hDnxi hDnxi=N hDi 103 hDi=N N 1A6N 3 -helix 1.10 1.4 0.1 14.0 26.2 29.2 285 1.9 4.24 28.1 151 1APS 2 Mixed -1.48 4.2 0.2 21.8 22.7 25.4 201 2.1 2.43 24.8 98 1BDD 2 -helix 11.75 0.9 0.1 5.2 14.0 14.9 76.5 1.3 0.91 15.2 60 1BNI 3 Mixed 2.60 2.5 0.1 12.3 20.8 22.8 209 1.9 2.46 22.8 108 1CBI 3 -sheet -3.20 2.8 0.1 18.8 25.1 27.9 286 2.1 3.70 27.2 136 1CEI 3 -helix 5.80 1.0 0.1 9.1 16.7 18.9 71.4 0.8 1.49 17.5 85 1CIS 2 Mixed 3.87 3.3 0.2 10.8 15.1 16.8 99.7 1.5 1.10 16.6 66 1CSP 2 -sheet 6.98 3.0 0.2 11.0 16.8 18.4 98.0 1.5 1.23 18.3 67 1EAL 3 -sheet 1.30 2.5 0.1 15.7 24.9 27.9 278 2.2 3.44 27.1 127 1ENH 2 -helix 10.53 0.4 0.1 7.4 13.5 14.9 28.0 0.5 0.76 14.1 54 1G6P 2 -sheet 6.30 3.8 0.2 11.7 16.4 18.0 83.1 1.3 1.17 17.7 66 1GXT 3 Mixed 4.38 3.7 0.2 18.6 21.1 23.5 148 1.7 2.03 22.8 89 1HRC 2 -helix 8.76 2.2 0.1 11.7 19.6 22.2 126 1.2 2.17 20.8 104 1IFC 3 -sheet 3.40 2.8 0.1 17.7 25.1 27.9 284 2.2 3.58 27.3 131 1IMQ 2 -helix 7.31 1.7 0.1 10.4 16.1 17.9 80.7 0.9 1.46 17.0 86 1LMB 2 -helix 8.50 1.1 0.1 7.1 17.0 18.6 76.8 0.9 1.55 17.9 87 1MJC 2 -sheet 5.24 3.0 0.2 11.0 17.5 19.2 110 1.6 1.32 19.1 69 1NYF 2 -sheet 4.54 2.8 0.2 10.6 15.3 17.0 87.4 1.5 0.97 16.8 58 1PBA 2 Mixed 6.80 2.6 0.1 12.0 18.9 20.8 156 1.9 1.69 20.8 81 1PGB 2 Mixed 6.00 2.1 0.2 9.7 14.1 15.7 25.4 0.5 0.81 14.5 56 1PKS 2 -sheet -1.05 3.8 0.2 15.2 17.9 20.2 136 1.8 1.50 19.7 76 1PSF 3 -sheet 3.22 2.8 0.2 11.7 16.8 19.4 72.1 1.0 1.23 17.8 69 1RA9 3 Mixed -2.50 3.4 0.1 22.3 25.5 28.6 402 2.5 4.46 28.1 159 1RIS 2 Mixed 5.90 3.0 0.2 18.4 21.5 23.9 163 1.7 2.25 23.2 97 1SHG 2 -sheet 1.41 3.0 0.2 10.9 15.1 16.7 92.3 1.6 0.95 16.7 57 1SRL 2 -sheet 4.04 3.1 0.2 11.0 14.8 16.3 94.5 1.7 0.92 16.5 56 Continued on next page 95 5.2 . M eth o d s Table 5.1 { continued from previous page PDB fold 2ndry str. log kf LRO RCO ACO MRSD RMSD hDnxi hDnxi=N hDi 103 hDi=N N 1TIT 3 -sheet 3.47 4.1 0.2 15.8 18.7 20.8 154 1.7 1.82 20.4 89 1UBQ 2 Mixed 5.90 2.4 0.2 11.5 17.0 18.9 92.1 1.2 1.39 18.2 76 1VII 2 -helix 11.52 0.4 0.1 4.0 8.1 9.2 4.1 0.1 0.30 8.2 36 1WIT 2 -sheet 0.41 5.0 0.2 18.9 20.4 22.7 168 1.8 2.07 22.2 93 2A5E 3 Mixed 3.50 2.6 0.1 8.3 22.2 23.9 354 2.3 3.82 24.5 156 2ABD 2 -helix 6.55 2.3 0.1 12.0 18.2 20.0 77.5 0.9 1.65 19.1 86 2AIT 2 -sheet 4.20 4.1 0.2 14.4 16.9 18.7 107 1.5 1.36 18.3 74 2CI2 2 Mixed 3.90 2.7 0.2 10.0 15.1 16.9 78.3 1.2 1.06 16.4 65 2CRO 3 -helix 3.70 1.2 0.1 7.3 14.0 15.5 37.3 0.6 0.95 14.6 65 2HQI 2 Mixed 0.18 4.3 0.2 13.6 16.3 18.4 86.9 1.2 1.26 17.5 72 2PDD 2 -helix 9.80 1.0 0.1 4.8 10.6 11.5 19.9 0.5 0.48 11.0 43 2RN2 3 Mixed 0.10 3.6 0.1 19.3 27.7 30.9 521 3.4 4.81 31.0 155 1O6D ? Knotted ? 3.1 0.1 18.9 26.2 28.7 515 3.5 4.36 29.7 147 2HA8 ? Knotted ? 3.3 0.1 16.2 25.7 28.5 671 4.1 4.84 29.9 162 2K0A ? Knotted ? 3.4 0.1 14.6 22.4 24.5 369 3.4 2.81 25.8 109 2EFV ? Knotted ? 2.1 0.2 12.6 20.0 21.8 147 1.8 1.79 21.8 82 1NS5 3 Knotted -1.83 2.9 0.1 18.2 27.5 30.4 503 3.3 4.71 30.8 153 1MXI 3 Knotted -2.56 2.8 0.1 16.7 26.1 29.0 643 4.0 4.85 30.1 161 3MLG 3 Knotted -6.91 1.2 0.1 21.4 27.7 30.8 481 2.8 5.16 30.5 169 96 5.2. Methods 5.2.4 Calculating distance metrics for the unfolded ensemble To obtain minimal transformations between unfolded and native structures for a given protein, the C backbone was extracted from the PDB native structure, and 200 coarse-grained unfolded structures were generated using the methods described above. The unfolded structures were then aligned using RMSD and the average (residual) RMSD was calculated. The unfolded structures were then aligned by minimizing MRSD, and the residual MRSD was calculated. Then conformations were further coarse-grained (smoothed) by sampling every other bead, hence reducing the total number of beads. By the above further-coarse graining, we eliminate all instances of potential self-crossing in which the loop size or elbow size is smaller than three links. Each structure was then transformed to the folded state by the algorithm discussed earlier in section 5.2.1. The self-crossing instances, along with the coordinates of all the beads, were recorded as well. Appropriate data structures were formed and relevant crossing substructures (leg, elbow, and loop) were detected. With topological data structures at hand, the minimal untangling cost was found, through the depth-rst search in the tree of possible uncrossing operations that was described above. Finally, the minimal untangling cost, Dnx, and the total distance, D are calculated for each unfolded conformation. These dier from one unfolded conformation to the other; the ensemble average is recorded and used below. The ensemble average of MRSD and RMSD are also calculated from the 200 unfolded structures that were generated. Importance of non-crossing We dene the importance of non-crossing (INX) as the ratio of the extra untangling movement caused by non-crossing constraints, divided by the distance when no such constraints exists, i.e., if the chain behaved as a ghost chain. Mathematically this ratio is dened as INX = Dnx= (mrsdN) Other metrics Following [52], we dene Long-range Order (LRO) as: LRO = X i<j nij=N where nij = ( 1 if ji jj > 12 0 otherwise (5.4) 97 5.3. Results where, i and j are the sequence indices for two residues for which the CC distance is 8 A in the native structure. Likewise we dene Relative Contact Order (RCO) following [105]: RCO = 1 LN NX i<j Lij ; (5.5) where N is the total number of contacts between nonhydrogen atoms in the protein that are within 6 A in the native structure, L is the number of residues, and Lij is the sequence separation between contacts in units of residues. Similarly, Absolute Contact order (ACO) [105] is dened to be: ACO = 1 N NX i<j Lij = RCO L (5.6) 5.3 Results Proteins were classied by several criteria: 2-state vs. 3-state folders -helix dominated, vs -sheet dominated, vs mixed. knotted vs unknotted proteins Several questions are answered for each group of proteins: What fraction of the total transformation distance is due to non- crossing constraints? How do the dierent order parameters distinguish between the dier- ent classes of proteins? How do the dierent order parameters correlate with each other? In table 5.2, we compare the unfolded ensemble-average of several metrics between dierent classes of proteins, and perform a p-value analysis based on the Welch t-test. The null hypothesis states that the two samples being compared come from normal distributions that have the same means but possibly dierent variances. Metrics compared in Table 5.2 are INX, LRO, RCO, ACO, MRDS, RMSD, Dnx, Dnx=N , D, D=N and N . 98 5.3. Results The most obvious check of the general method outlined in the present paper is to compare the non-crossing distance Dnx between knotted and unknotted proteins. Here we see that knotted proteins traverse about 3:5 times the distance as unknotted proteins in avoiding crossings, so that the two classes of proteins are dierent by this metric. The same conclusion holds for knotted vs. unknotted proteins if we use Dnx=N , D, D=N , or INX. Of all metrics, the statistical signicance is highest when comparing D=N , which is important because the knotted proteins considered here tend to be signicantly longer than the unknotted proteins, so that chain length N distinguishes the two classes. Dividing by N partially normalizes the chain- length dependence of D, however D=N still correlates remarkably strongly with N when compared for all proteins (r = 0:824, see Appendix F, Ta- ble F.8). It was somewhat unusual that MRSD and RMSD distinguished knotted proteins from unknotted proteins better than D (or Dnx), which accounts for non-crossing. All other quantities, including INX, ACO, and RCO dis- tinguish knotted from unknotted proteins. The only quantity that fails is LRO. The importance of noncrossing INX, measuring the ratio of the uncross- ing distance Dnx to the ghost-chain distance N MRSD, was largest for knotted proteins, followed by proteins, with proteins having the smallest INX. Mixed proteins had an average INX value in between that for and proteins. In distinguishing all- and all- proteins, we nd that LRO and RCO are by far the best discriminants. Interestingly, INX and Dnx=N also discrimate these two classes comparably or better than ACO does. Dnx is marginal, while all other metrics fail. All metrics except for N and D are able to discriminate from mixed - proteins, with LRO performing the best by far. Interestingly, none of the above metrics can distinguish proteins from mixed - proteins. It is sensible that energetic considerations would be the dominant distin- guishing mechanism between two- and three state folders. Intermediates are typically stabilized energetically. We can nevertheless investigate whether any geometrical quantities discriminates the two classes. Indeed LRO and RCO fail, as does INX. This supports the notion that intermediates are not governed by \topological traps" that are undone by uncrossing motion, but rather are energetically driven. ACO performs marginally. Three-state folders tend to be longer than 2-state folders, so that N distinguishes them and in fact provides the strongest discriminant, consistent with previous re- sults [43]. Interestingly RMSD, MRSD, and D perform comparably to N . 99 5.3. Results However these measures also correlate strongly with N (see Appendix F Table F.8). D=N , Dnx and Dnx=N also perform well, but still correlate with N , albeit more weakly than the above metrics. Figure 5.16A shows a scatter plot of all proteins as a function of Dnx=N vs. and LRO. Knotted and unknotted proteins are indicated, as are , , and mixed - proteins. Two and three state proteins are indicated as triangles and squares respectively. From the gure, it is easy to visualize how LRO provides a successful discriminant between = and =(mixed) proteins, but is unsuccessful in discriminating =(mixed), knotted and un- knotted, and two and three state folders. It is also clear from the gure how Dnx=N discriminates knotted from unknotted proteins. One can also see distribution overlap, but nevertheless successful discrimination between and and and mixed proteins. Figure 5.16B shows a scatter plot of all proteins as a function of Dnx vs. N , using the same rendering scheme for protein classes as in Figure 5.16A. From the gure, one can see how the metrics correlate with each other, and how they both discriminate knotted from unknotted proteins and 2-state from 3-state proteins. Moreover one can see how despite the signicant correlation between Dnx and N , Dnx can discriminate proteins from either proteins or mixed / proteins, while N cannot. As a control study for the above metrics, we took random selections of half of the proteins, to see if random partitioning of the proteins into two classes resulted in any of the metrics distinguishing the two sets with statistical signicance. No metric in this study had signicance: the p-values ranged from about 0.32 to 0.94. Figure 5.17 shows a plot of the statistical signicance for all the metrics in Table 5.2 to distinguish various pairs of protein classes: 2-state from 3-state, from , from mixed =, from mixed, and knotted from unknotted. We can dene the most consistent discriminator between protein classes as that metric that is statistically signicant for the most classes, and for those classes has the highest statistical signicance. By this criterion Dnx=N is the most consistent discriminator between the general structural and kinetic classes considered here. Interestingly, in all cases, the extra distance introduced by non-crossing constraints is a very small fraction (less than 13% ) of the MRSD, which represents the ghost distance neglecting non-crossing. This was not an ob- vious result, but it was encouraging evidence for the reason simple order- parameters that contain no explicit penalty for crossing have been so suc- cessful historically [5, 8, 21, 30, 94, 98, 105]. 100 5.3. Results 0 1 4 5 0 1 2 3 4 5 LRO <D nx /N > 2 3 A 0 100 200 300 400 500 600 700 0 50 100 150 200 <D nx > N B Figure 5.16: (A) Scatter plot of all proteins as a function of Dnx=N and LRO. Knotted proteins are indicated as green circles and are clustered; unknotted proteins are clustered using the black closed curve, and contain -helical proteins clustered in red, and mixed - proteins clustered in ma- genta. Beta proteins are indicated in blue. Two and three state proteins are indicated as triangles and squares respectively. LRO provides a strong discriminant against and mixed proteins, but not knotted and unknotted proteins, while Dnx=N discriminates knotted from unknotted proteins, and moderately discriminates proteins from mixed proteins. (B) Scatter plot of all proteins as a function of Dnx and N . The rendering scheme for protein classes is the same as in panel (A). Kinetic 2-state folders are indicated by the black dashed curve. Both Dnx and N distinguish knotted from unknot- ted proteins, and 2-state from 3-state proteins. By projecting proteins and either mixed / or all- proteins onto each order parameter, one can see how Dnx can discriminate proteins from both mixed or proteins, while N cannot. This is despite the signicant correlation between Dnx and N . 101 5.3. Results 0 5 10 15 20 2-3s α-β α-M β-M knot-unknot lo g( pv al ue ) INX LRO RCO ACO MRSD RMSD Dnx/N Dnx D D/N N Figure 5.17: Statistical signicance for all order parameters in distinguish- ing between dierent classes of proteins. The log of the statistical signi- cance is plotted for various pairs of protein classes, so that a higher number indicates better ability to distinguish between dierent classes. The blue horizontal line indicates a threshold of 5% for statistical signicance. 102 5 .3. R esu lts Class INX PINX LRO PLRO RCO PRCO 2-state folders 3-state folders 7.55e-02 8.25e-02 (3.93e-01) 2.7 2.6 (9.46e-01) 1.58e-01 1.31e-01 (5.07e-02) -helix proteins -sheet proteins Mixed secondary structure 5.21e-02 9.04e-02 8.64e-02 :4.01e-05 m:(5.71e-01) m:5.44e-04 1.2 3.3 3.1 :7.40e-08 m:(4.27e-01) m:6.20e-07 1.10e-01 1.72e-01 1.56e-01 :3.34e-07 m:(2.68e-01) m:3.48e-03 Unknotted proteins knotted proteins 7.79e-02 1.30e-01 1.48e-03 2.6 2.7 (9.20e-01) 1.49e-01 1.24e-01 1.49e-02 Class ACO PACO MRSD PMRSD RMSD PRMSD 2-state folders 3-state folders 11.4 14.7 4.50e-02 16.4 21.9 5.89e-04 18.1 24.4 4.88e-04 -helix proteins -sheet proteins Mixed secondary structure 8.5 13.9 14.5 :3.76e-04 m:(7.08e-01) m:1.62e-03 15.8 18.7 19.9 :(1.19e-01) m:(4.50e-01) m:4.11e-02 17.5 20.8 22.1 :(1.14e-01) m:(4.73e-01) m:4.16e-02 Unknotted proteins knotted proteins 12.5 16.9 5.59e-03 18.3 25.1 1.79e-04 20.3 27.7 3.18e-04 Class Dnx=N PDnx=N Dnx PDnx D PD 2-state folders 3-state folders 1.3 1.9 1.71e-02 94.9 238 3.30e-03 1309 2924 8.06e-04 -helix proteins -sheet proteins Mixed secondary structure 8.74e-01 1.7 1.8 :1.88e-04 m:(6.65e-01) m:1.56e-03 80.4 146 195 :4.50e-02 m:(2.99e-01) m:2.30e-02 1450 1802 2274 :(4.14e-01) m:(3.10e-01) m:(1.06e-01) Unknotted proteins knotted proteins 1.5 3.3 5.33e-04 144 476 2.05e-03 1862 4074 2.67e-03 Class D=N PD=N N PN 2-state folders 3-state folders 17.6 23.8 8.56e-04 71.3 116 4.17e-04 -helix proteins -sheet proteins Mixed secondary structure 16.7 20.4 21.6 :(6.95e-02) m:(4.67e-01) m:2.68e-02 77.9 83.4 98.3 :(6.57e-01) m:(2.49e-01) m:(1.59e-01) Unknotted proteins knotted proteins 19.7 28.4 1.04e-04 86.9 140 3.54e-03 Table 5.2: Order parameters for various classications of proteins. The data set of 2- and 3-state folders is the same as the data set for -helical -sheet and mixed proteins, and is given in table 5.1. This is also the same data set as the unknotted proteins. Knotted proteins are separately classied, and not included as either 2-state or 3-state proteins. A discrimination is deemed statistically signicant if the probability of the null hypothesis is less than 5%. 103 5.3. Results (a) (b) (c) Figure 5.18: Renderings of the three proteins whose minimal transforma- tions we investigate in detail. (A) acyl-coenzyme A binding protein, PDB id 2ABD [3], an all- protein; (B) Src homology 3 (SH3) domain of phos- phatidylinositol 3-kinase, PDB id 1PKS [71], a largely protein; (C) The designed knotted protein 2ouf-knot, PDB id 3MLG [68]. 5.3.1 Quantifying minimal folding pathways The minimum folding pathway gives the most direct way that an unfolded protein conformation can transform by reconguration to the native struc- ture. However, dierent congurations in the unfolded ensemble transform by dierent sequences of events, for example one unfolded conformation may require a leg uncrossing move, followed by a Reidemeister move elsewhere on the chain, followed by an uncrossing move of the opposite leg, while another unfolded conformation may require only a single leg uncrossing move. The sequence of moves can be represented as a color-coded bar plot, as shown in Figures 5.19-5.21. In these gures, the sequence of moves is taken from right to left, and the width of the bar indicates the non-crossing distance undertaken by that move. A scale bar is given underneath each gure indicating a distance of 100 in units of the link length. Red bars indicate moves corresponding to the N-terminal leg (LN ) of the protein, while green bars indicate moves corresponding to the C-terminal leg (LC). Blue bars indicate Reidemeister \pinch and twist" moves, while cyan bars indicate elbow uncrossing moves. The typical sequence of moves varies depending on the protein. Fig- ure 5.19 shows the uncrossing transformations of the all- protein acyl- coenzyme A binding protein (PDB id 2ABD [3], see Figure 5.18A). Panels A and B depict the same set of transformations, but in A they are sorted from largest to smallest values of LN uncrossing, and in B they are sorted from largest to smallest values of LC uncrossing. The leg moves in each panel are aligned so that the left end of the bars corresponding to the moves being 104 5.3. Results sorted are all lined up. Some transformations partway down in panel A do not require an LN move; these are then ordered from largest to smallest LC move. The converse is applied in panel B. Some moves do not require either leg move; these are sorted in decreasing order of the total distance of Reide- meister loop twist moves. Finally, some transformations require only elbow moves; these are sorted from largest to smallest total uncrossing distance. Figure 5.20 shows the uncrossing transformations for the Src homology 3 (SH3) domain of phosphatidylinositol 3-kinase (PI3K), a largely- protein (about 23% helix, including 3 short 310 helical turns; PDB id 1PKS [71], see Figure 5.18B), sorted analogously to Figure 5.19. Figure 5.21 shows the uncrossing transformations involved in the minimal folding of the designed knotted protein 2ouf-knot (PDB id 3MLG [68], Figure 5.18C). Interestingly, for the all- protein 2ABD, 12% of the sample of 172 transformations considered did not require any uncrossing moves, and pro- ceed directly from the unfolded to the folded conformation. These trans- formations are not shown in Figure 5.19. For the protein and knotted protein, every transformation that we considered (195 for 1PKS and 90 for 3MLG) required at least one uncrossing move. As a specic example, the top-most move in Figure 5.21 panel B consists of a C-leg move (green) covering 90% of the non-crossing distance, followed by N-leg move (red) covering 7% of the distance, then a short elbow move (cyan), a short Reidemeister loop move (blue), another short elbow move (cyan), and nally a short Reidemeister move (blue). In some cases the elbow and loop moves commute if they involve dierent parts of the chain, but generally they do not. For this reason we have not made any attempt to cluster loop and elbow moves, rather we have just represented them in the order they occur. On the other hand, consecutive leg moves commute and can be taken in either order. In Figures 5.19-5.21, one can see that signicantly more motion is in- volved in the leg uncrossing moves than for other types of move. The total distance covered by leg moves is 82% for 3MLG, 69% for 1PKS, and 49% for 2ABD. For 3MLG, the total leg move distance is comprised of 44% LN moves, and 38% LC moves. For 1PKS, leg move distance is comprised of 18% LN moves, and 51% LC moves. For 2ABD, distance for the leg moves is roughly symmetric with 26% LN and 23% LC . One dierence that can be seen for the all- protein compared to the and knotted proteins is in the persistence of the leg motion. For 2ABD, only 24% of the transformations require LN moves and only 30% of the transformations require LC moves. On the other hand the persistence of leg moves is greater in the protein and greatest in the knotted protein. For 105 5.3. Results 2ABD (a) 2ABD (b) Figure 5.19: Bar plots for the noncrossing operations involved in minimal transformations, for the protein 2ABD. The sequence of noncrossing oper- ations in the transformation corresponding to a given pair of conformations is represented as a color-coded series of bars, with the sequence of moves going from right to left, and the length of the bar indicating the non-crossing distance undertaken by a particular move. Red bars indicate N-terminal leg (LN ) uncrossing, green bars indicate C-terminal leg (LC) uncrossing, blue bars indicate Reidemeister \pinch and twist" loop uncrossing moves, and cyan bars indicate elbow uncrossing moves. The same set of 172 transforma- tions is shown in panels A and B. Panel A sorts uncrossing transformations by rank ordering the following move types, largest to smallest: LN , LC , loop uncrossing, elbow move. Panel B sorts moves by LC , LN , loop uncrossing, elbow move. The scale bar underneath each panel indicates a distance of 100 in units of the link length. The arrow in each panel denotes the \most representative" transformation, as dened in the text. 106 5.3. Results 1PKS (a) 1PKS (b) Figure 5.20: Bar plots of the noncrossing operations for the -sheet pro- tein 1PKS (see Figure 5.19 and the text for more details). Red bars: LN uncrossing moves; green bars: LC uncrossing moves; Blue bars: loop un- crossing moves; Cyan bars: elbow uncrossing moves. The same set of 195 transformations is shown in panels A and B, sorted as in Figure 5.19. The scale bar underneath each panel indicates a distance of 100 in units of the link length. 107 5.3. Results 3MLG (a) 3MLG (b) Figure 5.21: Bar plots of the noncrossing operations for the knotted pro- tein 3MLG (see Figure 5.19 and the text for more details). Red bars: LN uncrossing moves; green bars: LC uncrossing moves; Blue bars: loop un- crossing moves; Cyan bars: elbow uncrossing moves. The same set of 90 transformations is shown in panels A and B, sorted as in Figure 5.19. The scale bar underneath each panel indicates a distance of 100 in units of the link length. The arrow in each panel denotes the \most representative" transformation, as dened in the text. The transformation located 8 bars up from the bottom of Panel A requires both LN and LC moves, however both leg motions are very small. 108 5.3. Results 1PKS, LN and LC moves persist in 74% and 66% of the transformations respectively. In 3MLG, LN and LC moves persist in 92% and 41% of the transformations respectively. Inspection of the transformations for the protein 1PKS in panels A and B of Figure 5.20 reveals that uncrossing moves generally cover larger distance than in the protein 2ABD (the mean uncrossing distance for is 136 for 1PKS vs. 77.5 for 2ABD). We also notice that in contrast to the leg uncrossing moves in 2ABD, both LN and LC moves are often required (44% of the transformations require both LN and LC moves, compared to 5% for 2ABD). The asymmetry of the protein is manifested in the asymmetry of the leg move distance: the LN moves are generally shorter than the LC moves, covering about 1/4 of the total leg move distance. As mentioned above, LC moves comprise about 51% of the total distance for the 195 transformations in 5.20, while LN moves only comprise about 18 % of the distance on average. Both LN and LC moves are persistent as mentioned above. A leg move of either type is present in 95% of the transformations. Inspection of the transformations in Figure 5.21 reveals that every trans- formation requires either an LN or an LC move. This is sensible for a knotted protein, and is in contrast to the transformations for the protein 2ABD, where many moves do not require any leg uncrossing at all and consist of only short Reidemeister loop and elbow moves. In this sense the diversity of folding routes [110, 111] for the knotted protein 3MLG is the smallest of the proteins considered here, and illustrates the concept that topological constraints induce a pathway-like aspect to the folding mechanism. The N-terminal LN leg move is the most persistently required uncrossing move, present in about 92% of the transformations. This is generally the terminal end of the protein that we found was involved in forming the pseudo-trefoil knot. Sometimes however, the C-terminal end is involved in forming the knot, though this move is less persistent and is present in only 41% of the transformations. However when an LC move is undertaken, the distance traversed is signicantly greater, as shown in Panel B of Figure 5.21. This asymmetry is a consequence of the asymmetry already present in the native structure of the protein. Consensus minimal folding pathways From the transformations described in Figures 5.19-5.21, we see that there are a multitude of dierent transformations that can fold each protein. The pathways for the protein 2ABD are more diverse than those for the or knotted proteins. From the ensemble of transformations for each protein, 109 5.3. Results we can average the amount of motion for each uncrossing move to obtain a quantity representing the consensus or most representative minimal folding pathway for that protein. This takes the form of the histograms in Fig- ure 5.22, with the x-axes representing the order of uncrossing/untangling events, right to left, and the y-axes representing the average amount of mo- tion in each type of move. The ensemble of untangling transformations can be divided into three dierent classes: transformations in which leg LN is the largest move, trans- formations in which leg LC is the largest move, and transformations in which an elbow E or loop R (for Reidemeister type I) are the largest moves. More- over, if LN and LC moves occur consecutively they can be commuted, so without loss of generality we take the LN move as occurring before the LC move in the x-axes of Figure 5.22. The leg moves, if they occur rst, are then followed by either elbow (E) and/or loop (R) moves, of which there may be several. In general, the leg moves may both occur before the collection of loop and elbow moves, after them, or may bracket the elbow and loop moves (e.g. second bar in Figure 5.21). By the construction of our approximate algorithm, if two LN moves were encountered during a trajectory (they were encountered only a few times during the course of our studies), they would be aggregated into one LN move involving the larger of the two motions, in order to remove any possible redundancy of motion. Hence no more than one LN or LC move is obtained for all transformations. We found that three pairs of elbow and loop moves were sucient to describe about 93% of all transformations (see the x-axes of Figure 5.22). In summary, the sequence LN , LC , R, E, R, E, R, E, LN , LC (read from left to right) characterized almost all transformations, and so was adopted as a general scheme. Any exceptions simply had more small elbow and loop moves that were of minor consequence; for these transformations we simply accumulated the extra el- bow and loop moves into the most appropriate R or E move. The general recipe for rendering loops R in Figure 5.22 is as follows: if one R move is encountered (regardless of where), each half is placed rst and last (third) in the general scheme. If two R moves are encountered, they are placed rst and last, and if three R moves are encountered, they are simply partitioned in the order they occured. For four or more R moves, the middle N 2 are accumulated into the middle slot in the general scheme. The same recipe is applied to elbow moves E. As a specic example, the rst bar in Fig- ure 5.21B consists of LC , LN , E1, R1, E2, R2, which after permutation of the rst two leg moves falls into the general scheme above as LN , LC , R1, E1, 0, 0, R2, E2, 0, 0. The bottom-most transformation in Figure 5.21B consists of R1, R2, R3, E1, E2, E3, LN , which becomes 0, 0, R1, E1, R2, 110 5.3. Results E2, R3, E3, LN , 0 in the general scheme. Figure 5.22 shows histograms of the minimal folding mechanisms, ob- tained from the above-described procedure. Note again there are 3 classes of transformation, one where LN is the largest move, one where LC is the largest move, and one where either loop R or elbow E is the largest move. Each uncrossing element of the transformation, C-leg, N-leg, Reidemeister loop, or elbow, contributes to the height of the corresponding bar, which represents the average over transformations in that class. The percentage of transformations that fall into each class is given in the legend to panels A-C of Figure 5.22. Most of the transformations (73%) for the -protein 2ABD fall into the class with a dominant loop or elbow move, which itself tends to cover less uncrossing distance than either leg uncrossing (ordinates of Panels A-C Fig- ure 5.22). This is a signature of a diverse range of folding pathways- minimal folding pathways need not involve obligatory leg uncrossing constraints. In this sense, the protein 1PKS has a more constrained folding mechanism than the protein; there is a signicantly larger percentage of transforma- tions for which a leg transformation LC or LN dominates, though the mean distances undertaken when a leg move does dominate are comparable for LC and even larger for the protein for LN . The knotted protein 3MLG has the most constrained minimal folding pathway. A leg move from either end dominates for 91% of the cases. Even for the transformations where loop or elbow moves dominate, there is still signicant LN motion. The dominant pathways for knotting 3MLG involve leg crossing from either N or C terminus. When the C terminus is involved in the minimal transformation, the motion can be signicant (Figure 5.22B). Among all transformations of a given protein, a transformation can be found that is closest to the average transformation for one of the three classes in Figure 5.22. This consensus transformation has a sequence of moves that when mapped to the scheme in Figure 5.22, has minimal deviations from the averages shown there. Further, we can nd the transformation that has minimal deviation to any of the three classes in Figure 5.22. For the knotted protein 3MLG, the best t transformation is to the class with LN - dominated move, for the protein 2ABD, the the best t transformation is to the class with miscellaneous-dominated moves. For the protein this is the transformation denoted by a short arrow to the left of the transformation in panels A and B of Figure 5.19, and illustrated in Figure 5.23. For the knotted protein this is the transformation denoted by a short arrow in panels A and B of Figure 5.21, and illustrated in Figure 5.24. We can construct schematics of these most-representative folding trans- 111 5.3. Results 0 20 40 60 80 100 120 140 LC LN E R E R E R LC LN Consensus pathways with largest LN move Moves Av era ge di sta nc e 3MLG 73.3% 2ABD 15.1% 1PKS 16.4% (a) 0 100 200 300 400 500 600 700 LC LN E R E R E R LC LN Consensus pathways with largest Lc move 3MLG 17.8% 2ABD 13.4% 1PKS 54.4% Moves Av era ge di sta nc e (b) 0 20 40 60 80 100 120 LC LN E R E R E R LC LN Consensus pathways with largest misc move 3MLG 8.9% 2ABD 71.5% 1PKS 29.2% Moves Av era ge di sta nc e (c) Figure 5.22: Consensus histograms of the transformations described in Figures 5.19-5.21 (see text for a description of the construction). Each bar represents the distance of a corresponding move type, N or C leg (LN or LC), elbow E, or loop R. The order of the sequence of moves is taken from right to left along the x-axis. An all- protein (2ABD), an all- protein (1PKS), and a knotted protein (3MLG) are considered. (a) Transformations with leg LN as the largest move. These encompass 15% of the transformations those in the protein, 16% of the transformations in the protein, and 73% of the transformations for the knotted protein. (b) Transformations with leg LC as the largest move, which encompass 13% of the protein transformations, 54% of protein transformations, and 18% of knotted protein transforma- tions. (c) Transformations with either an elbow E or loop R as the largest move, which encompass 71% of the protein transformations, 29% of protein transformations, and 9% of knotted protein transformations. 112 5.3. Results Figure 5.23: Schematic of the most representative transformation for the protein 2ABD. formations. Figure 5.23 shows the most representative transformation for the all- protein 2ABD. It is noteworthy that the transformation requires remarkably little motion: it contains a negligible leg motion followed by a loop uncrossing of modest distance, followed by a short elbow move that is also inconsequential: in shorthand E[9]R[20]LN [1], where the numbers in brackets indicate the cost of moves in units where the link length is unity. In constructing a schematic of the representative transformation in Fig- ure 5.23, we ignore the smaller leg and elbow moves and illustrate the loop move roughly to scale. Although additional crossing points appear from the perspective of the gure, the remainder of the transformation involves simple straight-line motion. Figure 5.24 shows the most representative folding transformation for the knotted protein 3MLG. The sequence of events constructed from the minimal transformation, R[21]R[18]LN [125] in the above notation, consists of a dominant leg move depicted in steps 4 and 5 of the transformation, and two relatively short loop moves that are neglected in the schematic as inconsequential. Loops appear from the perspective of the gure, and the crossing points appear to shift in position, however the remainder of the transformation involves simple straight-line motion. 113 5.3. Results Figure 5.24: Schematic of the most representative transformation for the knotted protein 3MLG. 114 5.3. Results R1 R1 R2 E1 R2LN LN LC LC α β Figure 5.25: Schematic diagram for the residues involved in noncrossing operations for two minimal transformations and . and the Sequence overlap of moves 5.3.2 Topological constraints induce folding pathways From Figures 5.19-5.21, one can see that topological non-crossing constraints can induce pathway-like folding mechanisms, particularly for knotted pro- teins, and in part for -sheet proteins as well. The locality of interactions in conjunction with simple tertiary arrangement of helices in the -helical protein profoundly aects the nature of the transformations that fold the protein, such that the distribution of minimal folding pathways is diverse. Conversely, the knotted protein, although largely helical, has non-trivial ter- tiary arrangement, which is manifested in the persistence of a leg crossing move in the minimal folding pathway. In this way, a folding \mechanism" is induced by the geometry of the native structure. We can quantify this notion by calculating the similarity between mini- mal folding pathways. To this end we note that the transformation 6 from the bottom in Figure 5.21B, which contains an LN move followed by 2 short loops and an elbow, should not fundamentally be very dierent than the transformation 10 from the bottom in that gure, which contains a loop and 2 short elbows followed by a larger LN move. In general we treat the commonality of the moves as relevant to the overlap rather than the specic number of residues involved, or the order of the moves that arises from the depth-rst tree search algorithm. Thus for each transformation pair we dene two sequence overlap vec- tors in the following way. Overlaying the residues involved in moves for each transformation along the primary sequence on top of each other as in Figure 5.25, we count those moves of the same type that overlap in sequence for both transformations. So for example in Figure 5.25 the result is two vectors of binary numbers, one with 4 elements and one with 5 elements, based on the overlap of moves of the same type: here the rst vector is ~ = (1; 1; 0; 1) and 2nd is ~ = (1; 0; 1; 0; 1). To nd the pathway overlap, we also record the noncrossing distances of the various transformations which here would be two vectors of the form ~D = (DLN ;DR1 ;DR2 ;DLC )|, and ~D = (DLN ;D R1 ;DR2 ;D E1 ;DLC )|. Square matrices are constructed for 115 5.4. Conclusion and discussion and , where each row is identical and equal to the vector ~. This matrix then operates on ~D to make a new vector that has distances for the elements that are nonzero in ~, and is the same length for both and . In the above example, ~D = (DLN ;DR1 ;DLC )| and ~D = (D LN ;DR2 ;D LC )|. These vectors are then multiplied through the inner product, and divided by the norms of ~D and ~D to obtain the overlap Q . In the above example, Q = (DLND LN + DR1D R2 + DLCD LC )= qP i (D)2i P j(D)2j . In general the formula for the overlap is given by Q = ( ~D) ( ~D)q ( ~D ~D)( ~D ~D) (5.7) When = , Q = 1. In the above example, Q < 1 even if all loops were aligned, because there is no elbow move in transformation . If two transformations have an identical set of moves, Q = 1 if all the moves have at least partial overlap with a move of the same type in primary sequence. If a loop move in transformation overlaps two loop moves in transformation , it is assigned to the loop with larger overlap in primary sequence. For the rst two transformations in Figure 5.21A, Q = 0:988, and for the rst two transformations in Figure 5.21B, Q = 0:999. On the other hand for the rst and last transformations in Figure 5.21B, Q = 0:033. Figure 5.26 shows the distributions of overlaps Q between all pairs of transformations indicated in Figures 5.19-5.21, for the three proteins shown in Figure 5.18. The distributions show a transition from multiple diverse minimal folding pathways for the protein, to the emergence of a dominant minimal folding pathway for the knotted protein. The mean overlap Q be- tween transformations can be obtained by averaging Q in Equation (5.7) over all pairs of transformations, Q = P < Q = (N (N 1) =2). Mean overlaps for each protein are given in the caption to Figure 5.26. This illus- trates that topological constraints induce mechanistic pathways in protein folding. We elaborate on this in the Discussion section. 5.4 Conclusion and discussion The Euclidean distance between points can be generalized mathematically to nd the distance between polymer curves; this can be used to nd the mini- mal folding transformation of a protein. Here, we have developed a method for calculating approximately minimal transformations between unfolded and folded states that accounts for polymer non-crossing constraints. The 116 5.4. Conclusion and discussion 0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.2 0.4 0.6 0.8 1 Fr ac tio n Qαβ Alpha-helical (2ABD)0.62 (a) 0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.2 0.4 0.6 0.8 1 Fr ac tio n Qαβ Beta-sheet (1PKS) (b) 0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.2 0.4 0.6 0.8 1 Fr ac tio n Qαβ Knotted (3MLG) (c) Figure 5.26: Pathway overlap (Q) distributions for the 3 proteins in Fig- ure 5.18, as dened by Equation (5.7), operating on the transformations in Figure 5.19-5.21. (a) The pathway overlap distribution for the all- protein 2ABD indicates a large contribution for Q = 0, indicating a diverse set of minimal transformations fold the protein. The average Q for these trans- formations is 0:18. (b) The pathway overlap distribution for the -protein shows the emergence of a peak around Q = 1, indicating partial restric- tion of folding pathways. The peak near Q = 0 still carries more weight in the distribution. The average Q = 0:45. (c) The peak around Q = 1 be- comes dominant for the pathway overlap distribution of the knotted protein, indicating the emergence of a dominant restricted minimal folding pathway. The average Q = 0:62. 117 5.4. Conclusion and discussion extra motion due to non-crossing constraints was calculated retroactively for all crossing events of a ghost chain transformation involving straightline motion of all beads on a coarse-grained model chain containing every other C atom, from an ensemble of unfolded conformations, to the folded struc- ture as dened from the coordinates in the protein databank archive. The distances undertaken by the uncrossing events correspond to straight-line motions of all the beads from the conformation before the crossing event, over and around the constraining polymer, and back to the essentially iden- tical polymer conformation immediately after the crossing event. Given a set of chain crossing events, the various ways of undoing the crossings are explored using a depth-rst tree search algorithm, and the transformation of least distance is recorded as the minimal transformation. We found that knotted proteins quite sensibly must undergo more non- crossing motion to fold than unknotted proteins. We also nd a similar conclusion for transformations between all- and all- proteins; all pro- teins generally undergo very little uncrossing motion during folding. In fact the unfolded ensemble-averaged uncrossing distance Dnx can be used as a discrimination measure between various structural and kinetic classes of proteins. Comparing several metrics arising from this work with several common metrics in the literature such as RMSD, absolute contact order ACO, and long range order LRO, we found that the most reliable discrim- inator between structural classes, as well as between two- and three-state proteins, was Dnx=N . On the other hand, even for knotted proteins, the motion involved in avoiding non-crossing constraints is only about 13% of the total ghost chain motion undertaken had the noncrossing constraints been neglected. This was not an obvious result, to this author at least. In contrast to melts of long polymers, chain non-crossing and the resultant entanglement does not appear to be a signicant factor in protein folding, at least for the structures and ensembles we have studied here. It is tempting to conclude from this that chain non-crossing constraints play a minor role in determining folding mechanisms. It is nevertheless an empirical fact that knotted proteins fold signicantly slower than unknotted proteins. As well, raw percentages of total motion do not take into account the diculty in certain types of special polymer movement, in particular when the entropy of folding routes is tightly constrained [18, 110, 111, 113]. However the small percentage of non-crossing motion may oer some explanation as to why simple order parameters, such as absolute contact order, that do not explicitly account for noncrossing in characterizing folding mechanisms, have historically been so successful. The non-crossing distance was calculated here for a chain of zero thick- 118 5.4. Conclusion and discussion ness, so that non-crossing is decoupled from steric constraints. Finite vol- ume steric eects would likely enhance the importance of non-crossing con- straints, since the volume of phase space where chains are non-overlapping is reduced, and thus chain motions must be further altered to respect these additional constraints. One potential issue in the construction of the algorithm used here is that the minimal transformation is generally not equivalent to a kinetically real- izable transformation. In the depth-rst tree search algorithm illustrated in Figure 5.15, the set of crossing points denes a set of uncrossing moves that may be permuted, or combined for example through a compound leg move- ment as in Figure 5.11. However the kinetic sequence of crossing events, in particular those signicantly separated in \time" along the minimal trans- formation, may not be permutable or combinable physically, at least not without modifying the distance travelled.6 Hence the transformations are treated here as approximations to the true minimal transformations that respect non-crossing. The algorithm as described above may underrepresent the amount of motion involved in noncrossing by allowing kinetically separated moves to be commutable. On the other hand, the motion assumed in the algorithm to be undertaken by a crossing event contains abrupt changes in the direc- tion of the velocity (corners) at the time of the uncrossing event, and so is larger than the true minimal distance, which contains no corners except possibly at the position of the innitely thin chain, represented as a discon- tinuous obstacle. These errors cancel at least in part. It is an interesting topic for future research to develop an improved algorithm that computes minimal transformations, perhaps using these approximate transformations as a starting point for further optimization or modication. In dierentiating two- and three- state folders, chain length provided the best discriminant: three-state folders are longer chains than two-state folders. Other metrics such as RMSD, MRSD, and D=N performed nearly as well. Knotted proteins, as compared to unknotted proteins, are the most distinguishable class of those we investigated. That is, all metrics we investi- gated except for LRO signicantly dierentiated the knotted from unknotted 6As a hypothetical example, suppose at time t1 a crossing event occurs between residue a which is 10 residues in from the N-terminus, and residue b somewhere else along the chain. Then at time t2, the next crossing event involves a residue c that is 20 residues in from the N-terminus, and residue d somewhere along the chain. To avoid redundant motion, the minimal transformation is only taken to involve a leg motion between the residues from c to the N-terminus, about point d; this is assumed to encompass the motion in the rst leg transformation, even though the crossing events occurred at dierent times. 119 5.4. Conclusion and discussion proteins. This is followed by proteins and mixed - proteins, for which all metrics except distance D and chain length N provide discrimination. When considered over all proteins, the physical motion of a polymer re- quired for folding D correlates with quantities such as ACO or LRO (see Table F.8 in the Supplementary Content), however when considering only knotted proteins, - proteins, or 3-state proteins, D does not correlate with ACO. The dierentiation between structural or kinetic classes of proteins is a separate issue from the question of which order parameters that may best correlate with folding rates within a given structural or kinetic class of proteins [52, 60, 61, 102, 106]; this latter question is an interesting topic for future research. Dierentiating relevant native-structure based order pa- rameters that provide good correlates of folding kinetics is a complicated issue, in that dierent structural classes may correlate better or worse with a given order parameter [60]. The mathematical construction of minimal folding transformations can elucidate folding pathways. To this end we have dissected the morphology of protein structure formation for several dierent native structures. We found that the folding transformations of knotted proteins, and to a lesser extent proteins, are dominated by persistent leg uncrossing moves, while proteins have diverse folding pathways dominated simply by loop uncrossing. A pathway overlap function can then be dened, the structure of which is fundamentally dierent for proteins and for knotted proteins. While the overlap function supports the notion of a diverse collection of folding pathways for the protein, the overlap function for the knotted protein indicates that topological polymer constraints can induce \mechanism" into how a protein folds, i.e., these constraints induce a dominant sequence of events in the folding pathway. This eect is observed to some extent in the protein we investigated, but is most pronounced for knotted proteins. Coarse-grained simulation studies of the reversible folder YibK [83] showed that non-native interactions between the C-terminal end and residues to- wards the middle of the sequence were a prerequisite for reliable folding to the trefoil knotted native conformation [125], the evolutionary origins of which were supported by hydrophobicity and -sheet propensity proles of the SpoU methyltransferase family. This suggests a new aspect of evo- lutionary \design" involving selective non-native interactions, beyond the generic role that non-native interactions may play in accelerating folding rate [23, 108]. Low kinetic success rates 1 2% in purely structure- based Go simulations are also seen in coarse-grained simulation studies of YibK [126] and all-atom simulation studies of the small = knotted protein MJ0366 [93]. In these studies by Onuchic and colleagues, a \slip-knotting" 120 5.4. Conclusion and discussion mechanism driven by native contacts is proposed, rather than the \plug" mechanism in [125], which is driven by non-native contacts. Both slip- knotting and plug mechanisms were described by Mohazab and Plotkin as optimal un-crossing motions of protein chains in [90]. Bioinformatic studies that investigate evolutionary selection by strengthening critical native in- teractions in knotted proteins are an interesting topic for future research. There is certainly a precedent of selection for native interactions that pe- nalize on-pathway intermediates in ribosomal protein S6 [78, 110, 111]. As well, Lua and Grosberg have found that, due to enhanced return prob- abilities originating from nite globule size along with secondary structural preferences, protein chains have smaller degree of interpenetration than col- lapsed random walks, and thus fewer knots than would be expected for such collapsed random walks [80]. It is still not denitively answered whether this statistical selection against knots in the protein universe is a cause or consequence of the above size and structural preferences. 121 Chapter 6 The role of polymer non-crossing and geometrical distance in protein folding kinetics In this chapter we apply the formalism developed in chapter 5 to the prob- lem of folding kinetics. Then we compare dierent rate predictors across dierent classes of proteins and see that distance-like metrics do very well in predicting the folding rate of 3-state folders. 6.1 Introduction Energetic driving forces towards the folded structure are essential for rapid and reliable folding. Models that randomly search for either the native ensemble or a loosened native-like topomer ensemble show slow kinetics and folding mechanisms that do not correlate with those determined from experimental -values [133]. The theory that strongly attractive native interactions bias a protein's congurational search towards the biologically-functional structure [12, 13, 29, 76, 128] leads to the notion that some topological or geometrical aspects of the native structures of various proteins could determine their folding rates and/or folding mechanisms [5, 20, 31, 34, 37, 52, 53, 60, 61, 102, 105, 106, 134]. However no single parameter appears to be an accurate predictor of folding kinetics over all structural and kinetic classes. While some quantities such as contact order, relative contact order, and long range order (LRO) correlated well with the folding rate for 2-state proteins [52, 105, 106], they correlated poorly with the folding rates of 3-state proteins, where the size of the protein, as quantied simply by the chain length, seemed to be the best predictor [61]. 122 6.1. Introduction Istomin et al. [60] found that chain length also correlated well with fold- ing rate for the various structural classes of two-state proteins: , , and mixed , when considered separately. They also found a strong correla- tion between LRO and folding rate when all 2-state proteins were considered together. Information on the folding mechanism is gained from determining which quantity correlates with rate for a given structural or kinetic class of protein. The fact that ACO or LRO correlates well with rate for 2-state proteins indicates a dominance of the process of loop closure, through the formation of native contacts, as the rate limiting step in folding. Energy also must play a role in driving folding and thus determining folding rates. Protein rates have been shown to correlate with stability for 2-state proteins.[73] Folding rates have also been shown to correlate with the variance of con- tact probability [78, 79, 110, 111] which yields a strong correlation between rate and the variance of experimentally-determined -values for two-state folders [102]. Perhaps surprisingly, the RMSD has not been used as an order param- eter in predicting the rates of proteins. This is likely due to the fact that information on a pair of structures rather than a single structure is needed to calculate it. Given a generated unfolded ensemble, the RMSD can be cal- culated between each unfolded conformation and the native conformation, and an average RMSD between unfolded and folded states can be calculated, and subsequently tested as a determinant of rate. The RMSD can be thought of as a least squares t between two struc- tures. It may also be thought of as the straight-line Euclidean distance between two structures in a high-dimensional space of dimension 3N , where N is the number of atoms or residues considered in the protein. If several intermediate states are known along the pathway of a trans- formation between a pair of structures, then the RMSD may be calculated consecutively for each successive pair. Energy is explicitly considered as modifying the pathway taken. RMSD is accumulated along the pathway through the transition states [119]. However as we have mentioned on various occasions, the RMSD is not equivalent to the total amount of motion a protein or polymer must undergo in transforming between structures, even in the absence of steric constraints enforcing deviations from straight-line motion. The accumulated straight line motion of all residues is given by the number of residues times the mean-root squared distance (MRSD) [89, 90, 109]. This quantity is always less than the RMSD. 123 6.2. Methods As a rate-determining order parameter, the Euclidean distance can be tested in the same way as ACO or LRO, so long as an unfolded ensemble is generated. For each protein we obtain minimal transformations between individual structures in an unfolded ensemble and the corresponding native structure. The ensemble average of the quantity for each of the proteins forms the rate-determining order parameter. 6.2 Methods We quickly recap the steps involved in generating ensemble averages for the quantities that require a starting and ending conformation of the protein. For a given protein, the PDB le is selected, and the C backbone is ex- tracted. Using the methods described in section 5.2.2, 200 coarse-grained unfolded structures are generated. The unfolded structures are then aligned using RMSD and the average (residual) RMSD is calculated. The unfolded structures are then aligned by minimizing MRSD, and the residual MRSD is calculated. Then conformations are further coarse-grained (smoothed) by sampling every other bead, hence reducing the total number of beads. Then each structure is transformed to the folded state by the algorithm discussed in section 5.2.1 and the minimal untangling cost is found. At the end of the day, various quantities like minimal untangling cost (Dnx), MRSD, RMSD are calculated for each unfolded conformation. These dier from one un- folded conformation to the other; the ensemble average is recorded and used below. 6.2.1 Proteins used with rate The proteins used in this study are given in table 5.1. They consist of 25 2-state folders, 13 3-state folders, 11 all -helix proteins, 14 all -sheet proteins, 13 - proteins, and 5 knotted proteins. 6.3 Results We use same classication as in chapter 5, for the proteins. Proteins are classied by several criteria: 2-state vs. 3-state folders -helix dominated, vs -sheet dominated, vs mixed. knotted vs unknotted proteins 124 6.3. Results Two-state Proteins Order parameter Kendall correlation Kendall p- value Pearson correlatoin Pearson p- value LRO -0.696 1.09e-06 -0.875 1.10e-08 RCO -0.711 6.26e-07 -0.854 5.73e-08 ACO -0.464 1.15e-03 -0.781 4.15e-06 hMRSDi (-0.224) (0.117) -0.535 5.89e-03 hRMSDi (-0.257) (0.072) -0.560 3.61e-03 hDnxi -0.437 2.18e-03 -0.624 8.53e-04 hDnxi=N -0.471 9.72e-04 -0.680 1.86e-04 hDi (-0.184) (0.198) (-0.428) (0.033) hDi=N (-0.250) (0.079) -0.573 2.77e-03 N (-0.131) (0.358) (-0.337) (0.099) Table 6.1: Two-state proteins: correlation between folding rate and various order parameters indicated. We are specically interested in the question that how non-crossing distance, total distance, and other distance related order parameters correlate with folding rate for dierent classes of the proteins, and how do they compare with other order parameters. The results are summarized in the tables. 125 6.3. Results Three-state Proteins Order parameter Kendall cor K. p-value Pearson cor P. p-value LRO (-0.154) (0.464) (-0.292) (0.332) RCO (-0.077) (0.714) (0.029) (0.926) ACO (-0.462) (0.028) (-0.658) (0.014) MRSD (-0.538) (0.010) (-0.672) (0.012) RMSD -0.564 7.27e-03 -0.685 9.74e-03 hDnxi -0.564 7.27e-03 (-0.647) (0.017) hDnxi=N (-0.462) (0.028) (-0.601) (0.030) hDi (-0.513) (0.015) -0.690 9.11e-03 hDi=N (-0.538) (0.010) (-0.670) (0.012) N (-0.503) (0.017) (-0.644) (0.018) Table 6.2: Three-state proteins: correlation between folding rate and various order parameters indicated. Figure 6.1: Correlation between folding rate and RMSD for three-state fold- ers. 126 6.3. Results 2-state -helix Proteins Order parameter Kendall cor K. p-value Pearson cor P. p-value LRO (-0.691) (0.017) (-0.817) (0.013) RCO (-0.357) (0.216) (-0.367) (0.371) ACO (-0.571) (0.048) -0.835 9.92e-03 MRSD (-0.429) (0.138) (-0.714) (0.047) RMSD (-0.500) (0.083) (-0.716) (0.046) hDnxi (-0.429) (0.138) (-0.525) (0.181) hDnxi=N (-0.143) (0.621) (-0.326) (0.431) hDi (-0.429) (0.138) (-0.717) (0.045) hDi=N (-0.429) (0.138) (-0.689) (0.059) N (-0.327) (0.257) (-0.741) (0.035) Table 6.3: -helix dominated proteins that are 2-state folders: correlation between folding rate and various order parameters indicated. The sample size is 8. -helix Proteins Order parameter Kendall cor K. p-value Pearson cor P. p-value LRO (-0.477) (0.041) (-0.384) (0.243) RCO (0.018) (0.938) (0.074) (0.828) ACO (-0.491) (0.036) (-0.710) (0.014) MRSD (-0.418) (0.073) (-0.728) (0.011) RMSD (-0.491) (0.036) (-0.733) (0.010) hDnxi (-0.309) (0.186) (-0.670) (0.024) hDnxi=N (-0.164) (0.484) (-0.523) (0.099) hDi (-0.418) (0.073) -0.740 9.22e-03 hDi=N (-0.382) (0.102) (-0.715) (0.013) N (-0.330) (0.157) -0.747 8.22e-03 Table 6.4: -helix dominated proteins (both 2- and 3- state): correlation between folding rate and various order parameters indicated. The sample size is 11, with 8 of them being 2-state folders and 3 being 3-state folders. 127 6.4. Conclusion and discussion 2-state -sheet Proteins Order parameter Kendall cor K. p-value Pearson cor P. p-value LRO (-0.278) (0.297) (-0.471) (0.200) RCO -0.722 6.71e-03 (-0.779) (0.013) ACO (-0.222) (0.404) (-0.649) (0.059) MRSD (-0.111) (0.677) (-0.390) (0.300) RMSD (-0.111) (0.677) (-0.478) (0.193) hDnxi (-0.333) (0.211) (-0.689) (0.040) hDnxi=N (-0.611) (0.022) -0.814 7.51e-03 hDi (-0.167) (0.532) (-0.465) (0.207) hDi=N (-0.111) (0.677) (-0.452) (0.222) N (-0.167) (0.532) (-0.435) (0.242) Table 6.5: -sheet dominated proteins that are 2-state folders: correlation with various order parameters indicated. The sample size is 9. The performance of RMSD over all dierent classes of proteins can be compared to that of ACO and D. See gure 6.2. 6.4 Conclusion and discussion From Table 6.1 it is concluded that long range contact formation is governing the rate of folding for 2-state folders. From Table 6.2 we infer that traditional measures fail to predict the kinetic mechanism of folding for 3-state proteins. However a measure of native geometry still does correlate with folding rate, and thus can speak to the mechanism of folding. By native geometery we do not necessarily mean native topology, i.e. the chain properties of the network of native contacts, but more similar to the distance that all parts of the polymer chain have to move. Native geometries that on average required large distances to be traveled via stochastic motion tend to have slower rates. One might suspect that the physical motion of a polymer required for folding would correlate with quantities such as ACO or LRO, however look- ing at the cross correlation tables (see Appendix F) it is seen that D only correlates with ACO in a signicant manner, when we consider all the pro- teins. If we look at 3-state folders or at only knotted proteins even this correlation is not signicant. Table 6.4 consisting only of -helical proteins, does not show correlation with LRO. This indicates that it is necessary to include -proteins in the sample so that there is a discrepancy in LRO between members of the en- 128 6.4. Conclusion and discussion -sheet Proteins Order parameter Kendall cor K. p-value Pearson cor P. p-value LRO (0.099) (0.622) (-0.064) (0.827) RCO (-0.099) (0.622) (0.090) (0.761) ACO (-0.429) (0.033) -0.675 8.03e-03 MRSD (-0.319) (0.112) (-0.528) (0.052) RMSD (-0.407) (0.043) (-0.556) (0.039) hDnxi (-0.385) (0.055) (-0.573) (0.032) hDnxi=N (-0.495) (0.014) (-0.583) (0.029) hDi (-0.385) (0.055) (-0.545) (0.044) hDi=N (-0.319) (0.112) (-0.540) (0.046) N (-0.398) (0.048) (-0.550) (0.041) Table 6.6: -sheet dominated proteins (both 2- and 3- state): correlation with various order parameters indicated. The sample size is 14, with 9 of them being 2-state folders and 5 being 3-state folders. 2-state Mixed secondary structure Proteins Order parameter Kendall cor K. p-value Pearson cor P. p-value LRO (-0.691) (0.017) -0.908 1.79e-03 RCO (-0.546) (0.059) (-0.759) (0.029) ACO (-0.327) (0.257) (-0.561) (0.148) MRSD (-0.109) (0.705) (-0.296) (0.477) RMSD (-0.109) (0.705) (-0.326) (0.431) hDnxi (-0.182) (0.529) (-0.317) (0.444) hDnxi=N (-0.109) (0.705) (-0.288) (0.490) hDi (-0.182) (0.529) (-0.305) (0.463) hDi=N (-0.182) (0.529) (-0.302) (0.468) N (-0.182) (0.529) (-0.279) (0.503) Table 6.7: Mixed secondary structure proteins that are 2-state folders: cor- relation with various order parameters indicated. The sample size is 8. 129 6.4. Conclusion and discussion Mixed secondary structure Proteins Order parameter Kendall cor K. p-value Pearson cor P. p-value LRO (-0.400) (0.057) -0.703 7.35e-03 RCO (0.039) (0.854) (-0.039) (0.898) ACO (-0.452) (0.032) (-0.632) (0.021) MRSD (-0.400) (0.057) (-0.567) (0.043) RMSD (-0.374) (0.075) (-0.591) (0.033) hDnxi (-0.400) (0.057) (-0.573) (0.041) hDnxi=N (-0.400) (0.057) (-0.547) (0.053) hDi (-0.426) (0.043) (-0.587) (0.035) hDi=N (-0.426) (0.043) (-0.570) (0.042) N (-0.426) (0.043) (-0.551) (0.051) Table 6.8: Mixed secondary structure proteins (both 2- and 3-state): corre- lation with various order parameters indicated. The sample size is 13, with 8 of them being 2-state folders and 5 being 3-state folders. All proteins Order parameter Kendall cor K. p-value Pearson cor P. p-value LRO -0.357 1.02e-03 -0.475 1.69e-03 RCO (-0.155) (0.153) (-0.199) (0.213) ACO -0.541 6.40e-07 -0.798 4.24e-10 MRSD -0.460 2.27e-05 -0.742 2.80e-08 RMSD -0.489 6.58e-06 -0.753 1.34e-08 hDnxi -0.538 7.18e-07 -0.723 9.58e-08 hDnxi=N -0.560 2.49e-07 -0.750 1.62e-08 hDi -0.460 2.27e-05 -0.720 1.15e-07 hDi=N -0.467 1.67e-05 -0.756 1.10e-08 N -0.441 4.84e-05 -0.689 6.45e-07 Table 6.9: Correlation of folding rate of all the studied proteins, for which folding rates were available, with various order parameters. The sample size is 41 130 6.4. Conclusion and discussion 0 0.2 0.4 0.6 0.8 1 2-state 3-state 2-salpha allalpha 2-sbeta allbeta 2-smix allmix allprots Ab s V alu e o f C orr ela toi n w ith R ate RMSD RCO ACO LRO N D/N Figure 6.2: Absolute value of Kendall correlation of a few order parameters and rate, across dierent classes of proteins. P. Class Best K. cor. Best P. cor. 2-state RCO LRO 3-state hRMSDi & hD=Ni hRMSDi 2-state (LRO) ACO all -helix (hRMSDi) & (ACO) N 2-state RCO (RCO) all -sheet ACO ACO 2-state Mixed Str. (LRO) LRO all Mixed Str. (ACO) LRO all prots ACO ACO Table 6.10: Best rate predictors for dierent classes of proteins, based on Kendall and Pearson correlations. The items in brackets made 5% p-value cuto but not 1%. As described above, angle brackets here indicate an average over the conformations in the unfolded ensemble. 131 6.4. Conclusion and discussion semble, for LRO to be useful as a predictor. LRO does not do well with proteins and mixed proteins either. See tables 6.6 and 6.8. In fact it seems that adding 3-state folders to the mix is largely responsible for this lack of corerlation. Compare with 6.3, 6.5 and 6.7 for which 3-state folders have been removed. So the tentative conclusion is that addition of 3-state folders and proteins ruins the correlation. However larger samples are required for stronger and more denitive conclusions, for the cases that we divide and subdivide the classes. The fact that ACO is still a useful predictor of rates indicates that the mechanism of folding is still largely one of native contact formation and closure of loops. Not surprisingly Dnx did not predict the rates. The non-crossing distance is small; polymer non-crossing is unlikely to play a role for these -helical proteins. When it comes to correlation with folding rate for -proteins the im- provement in RMSD over ACO is probably not signicant (corr coes were 0:73 vs 0:71), , although it is interesting that a simple measure such as RMSD is doing so well. Interestingly enough when 3-state folders are removed ACO becomes a better predictor for rate compared to RMSD. When it comes to -sheet proteins table 6.6, it seems that all standard measures fail except for ACO which barely hold on (in pearson coecient only). Interestingly enough when 3-state folders are removed ACO performs even worse, but RCO signicantly improves and becomes the best predictor. But as stated, addition of a few 3-state folders (50%) makes the advantage of RCO disappear. Remember from table 6.2 that RCO was the worst predictor of rate in 3-state folders. Table 6.8 shows that only LRO is a predictor of folding rates for proteins. Mechanism may be governed signicantly (but not exclusively) by loop closure and long range contact formation. As it can be seen from table 6.7 LRO is by far the best predictor when it comes to 2-state proteins. Adding 3-state proteins to the mix erodes its advantage and gives D and RMSD (and to a lesser degree ACO) a better competitive edge. It can be inferred that adding enough proteins that are 3-state folders will make RMSD or perhaps D the best predictor for this class of proteins. 132 Chapter 7 Conclusion and further thoughts In this thesis, we introduced the mathematical concept of the generalized Euclidean distance D, between extended objects. We saw that the prob- lem can be formulated as a variational problem. We then discretized the problem to that of a system of links and found extremum solutions to the corresponding Euler-Lagrange equations. Subsequently we derived the nec- essary and sucient conditions for the extrema to be local minima. We explored the toy models of very small number of links and then extended the results to many links. We posed the idea that D can be considered an order parameter, when looking at the problem of protein folding and showed that in its fullest form D does not have some of the problems of the simple geometric order parameters such as Q and RMSD. We saw that a zeroth approximation of D leads to an order parameter similar to RMSD but dierent, called Mean Root Squared Distance (MRSD). Using MRSD and Q as order parameters, we constructed free energy potential surfaces. According to the energy landscape paradigm in protein folding, there are multiple folding pathways for unfolded conformations. That being con- sidered we set out to calculate \minimal" folding pathways for fragments of proteins. By minimal folding pathways we mean geometrical folding path- ways that minimize D traveled between the initial and nal conformations. In doing so we also laid down some foundations for systematic treatment of non-crossing constraints, as inequality constraints in the calculus of varia- tions. We saw that folding pathways for -helices are shorter than that of hairpins. It was also seen that non-crossing constraints can lead to sig- nicant extra movement (in our case snake-like movement) on the polymer side for seemingly close structures. When minimizing total distance traveled, we were faced with the prob- lem of structurally aligning the initial (unfolded) and nal (folded) confor- mations before calculating the minimal distance. We saw that using MRSD instead of RMSD, makes a global dierence in the alignment, when it comes 133 Chapter 7. Conclusion and further thoughts to aligning hairpins and their corresponding unfolded structure. Therefore we further investigated the problem of structural alignment using dierent cost functions, including the full D, using the ideal hairpins of dierent sizes as models. In doing so we introduced some higher order approximation to the true-distance D. Results allowed it to be observed, that for a large num- ber of residues the dimension-less distance D=N(N 1)` converges to the same value when D or MRSD or higher approximations of D are used as cost functions, but not when RMSD is used. This allows us to use MRSD as a computationally inexpensive alternative to D. We then focused on the role of non-crossing constraints when minimally folding full proteins. Using the concepts found in the mathematical theory of knots, we developed the formalism of nding approximate minimal un- tangling moves arising from non-crossing constraints during protein folding. The canonical untangling moves that we considered were leg moves, elbow moves and loop twists. We treated them as ordered operators. Solutions to minimal untangling problems were reduced to depth-rst search in the tree of possible untangling operator applications. Geometrically speaking, the protein folding process is a many-to-one process, meaning that many dierent conformations fold to a single con- formation, hence any exploration of the role of non-crossing and untangling moves should consider an ensemble of unfolded structures. Therefore we had to develop methods that quickly generate coarse-grained unfolded ensembles from given coarse-grained native structures. Having developed the untangling formalism and the unfolded ensem- ble generator, we considered a few dozen proteins across dierent classes, including knotted proteins. We saw that perturbations caused by extra un- tangling moves introduced by non-crossing constraints play a small role in the total amount of chain movement. It was also observed that non-crossing constraints play a signicantly larger role in the folding process of knotted proteins. We observed that the perturbations in distance caused by the non- crossing constraints, when normalized by the zeroth approximation distance, are not dierent between 2-state folders and 3-state folders. However since 3-state folders are on average signicantly longer, all of the non-normalized distance-like quantities were larger. Across dierent classes of proteins, sorted by their secondary structure, we saw that for the unknotted pro- teins, non-crossing constraints are the least important in -helical proteins and the most important in -sheet proteins. Furthermore, looking at the ensemble of untangle moves we constructed consensus unfolding pathways for several proteins, in particular a knotted protein. 134 Chapter 7. Conclusion and further thoughts By studying overlaps of untangling operations, for ensembles of proteins, we observed that folding pathway mechanisms can be induced by the geom- etry of native structure in the knotted protein. Such bottlenecks did not exist for the alpha helical protein, but existed to some extent for the beta sheet protein. We further extended our studies to protein kinetics and possible correla- tions between folding rates and various distance-like quantities. We saw that distance-like quantities have success in predicting folding rates for 3-state folders. The normalized non-crossing distance Dnx=N , as well as D and others, signicantly correlate with the folding rate of 3-state proteins. The surprising champion however was RMSD, which had the best correlation coecient (although marginally). In short for 3-state folders we saw that a few of the quantities that we proposed for the rst time to be rate predictors performed better than all the traditional rate predictors: LRO, RCO, ACO and N. For 2-state folders LRO, RCO, and ACO performed better. Future research in this subject can have two general directions: rene- ments to the model, and applications of the model to new areas. We will brie y sketch a few lines for future endeavors. A possible renement to the model would be an introduction of persis- tence length and curvature constraints to D. During the course of research we saw in a few occasions that some of the angles between the links of the conformation become very large, albeit for a short time, during the trans- formation. Curvature constrains to the chain can be added as inequality constraints to the variational problem, hence ensuring that i MAX all the time. Introducing a soft potential for the angle between the links, is an- other way to add curvature constraints. It would be interesting to see how much deviation from the ideal extremum path and distance (which is ob- tained in absence of such constraint) we will get when curvature constraint are introduced and how this eect is compared to that of non-crossing con- straints. A limitation of the model is the fact that the thickness of the chain is zero in our model. Therefore the non-crossing constraints do not take eect unless the two crossing links get extremely close. Therefore changing the model from curves to tubes can improve this aspect. The other renement that we can make to the model, is to allow sidechains. Currently D is dened for two curves or two chains. In principle there is nothing to stop the generalization to tree-like objects. This in principle will add more ODE's to the set of the coupled ODE's (see Eq.s 2.15a{2.15c) and will distort the block diagonal nature of the ODE matrix. Another enhancement to the model could be to introduce some ener- 135 Chapter 7. Conclusion and further thoughts getics into D, when considering protein folding. We can introduce native interaction into the formal distance functional that is to be optimized. From a practical point of view adding energetics in the form of native interactions to the model is equivalent to nding rst the potential V that induces a fold- ing pathway similar to what we introduced in chapter 5, and then adding native Go-like terms to the potential. Even without adding the energetics, a very interesting question that can be answered in future studies is the correlation of D to commitment probability. To address this question for any given protein we can proceed as follows: we simulate the protein or the protein ensemble using a molecular dynamics package, e.g. GROMACS, then sample the system at regular intervals, extract the conformations and calculate D for each of them. The free energy surface obtained will be a two-well system for a 2-state folder. A small fraction of the sampled states will be the transition states that sit somewhere between the two wells. We can correlate D with commitment probability by looking at the fate of the transistion states (either folding completely or unfolding). Our analysis in chapter 5 has shown that the eect of non-crossing constraints on D is about 0:07. It means that we can use MRSD which is computationally inexpensive compared to D to approximate D. We can benchmark our results against Q or RMSD. It could also be informative to apply our formalism for D to the reaction pathways of 3-state folders. Considering that the pathway is unfolded (U) ! Intermediate (I) ! Folded (F), it would be intersting to compute the D between all the individual pairs and benchmark against for example RMSD and Q. Also as the folding rates for an ever increasing number of knotted proteins are determined experimentally, it would be benecial to see how distance like quantities correlate with folding rates of knotted proteins. Considering the success of such quantities in correlating well with folding kinetics of 3- state folders we are optimistic that they will do very well when it comes to knotted proteins. Another area of interest for future studies is the thermodynamic untan- gling distance hDnxiT between two conformations. The formalism that we developed in chapter 5 concerns itself with the \minimal" untangling cost. However the minimal untangling cost is not necessarily the most entropi- cally favorable. There is a well-dened set of untangling moves that give the minimal untangling cost for a given transformation. However there might be a very large set of dierent untangling moves that each will give only a slightly higher untangling cost compared to the minimal untangling cost. From a thermodynamic perspective under non-zero \temperature" the sys- 136 Chapter 7. Conclusion and further thoughts tem is more likely to untangle itself following more entropically favorable untangling operations. Quantifying such notions for dierent proteins is a rich subject for future studies. It is also an interesting question to ask whether the actual dynamics be- tween polymer congurations|after a suitable averaging over trajectories| resembles the minimal transformation. This question is linked with the role of the entropy of transformations described above. It is also related to the problem of nding the dominant pathway for a chemical reaction [97], which has recently been applied to the problem of protein folding [121]. We have focused here on the question of geometrical distance for complex systems, which can be separated from the calculation of quantities such as reaction paths that depend intrinsically on energetics, i.e. on the specic Hamiltonian of the system. Quantifying the relationship between geometri- cal distance and the dominant reaction path is a future question worthy of investigation. 137 Bibliography [1] Colin C. Adams. The Knot Book. W H Freeman and Company, 1994. [2] E. Alm and D. Baker. Prediction of protein-folding mechanisms from free-energy landscapes derived from native structures. Proc Nat Acad Sci USA, 96:11305{11310, 1999. [3] K V Andersen and F M Poulsen. The three-dimensional structure of acyl-coenzyme a binding protein from bovine liver: structural re- nement using heteronuclear multidimensional nmr. J. Biomol. NMR, 3:271{284, 1993. Comment 2abd. [4] C. B. Annsen. Principles that govern the folding of protein chains. Science, 181:223, 1973. [5] D. Baker. A surprising simplicity to protein folding. Nature, 405:39{ 42, 2000. [6] D. Baker and A. Sali. Protein structure prediction and structural genomics. Science, 294:93{96, 2001. [7] Stephen Bell and James S. Crighton. Locating transition states. The Journal of Chemical Physics, 80(6):2464{2475, 1984. [8] Robert B. Best and Gerhard Hummer. Reaction coordinates and rates from transition paths. Proc Nat Acad Sci USA, 102((19)):6732{6737, 2005. [9] Peter G. Bolhuis, David Chandler, Christoph Dellago, and Phillip L. Geissler. Transition path sampling: Throwing ropes over rough moun- tain passes, in the dark. Ann. Rev. Phys. Chem., 53:291{318, 2002. [10] Davide Branduardi, Francesco Luigi Gervasio, and Michele Parrinello. From a to b in free energy space. The Journal of Chemical Physics, 126(5):054103, 2007. 138 Bibliography [11] J. D. Bryngelson, J. N. Onuchic, N. D. Socci, and P. G. Wolynes. Fun- nels, pathways and the energy landscape of protein folding. Proteins: Struct. Funct. Genet., 21:167{195, 1995. [12] J. D. Bryngelson and P. G. Wolynes. Spin glasses and the statistical mechanics of protein folding. Proc Nat Acad Sci USA, 84:7524{7528, 1987. [13] J. D. Bryngelson and P. G. Wolynes. Intermediates and barrier cross- ing in a random energy model (with applications to protein folding). J Phys Chem, 93:6902{6915, 1989. [14] N Campbell and J Reece. Biology. Benjamin Cummings, 6 edition, 2001. [15] D. Cass. Optimum growth in an aggregative model of capital accu- mulation. Rev. Econ. Stud., 32:233{240, 1965. [16] Charles J. Cerjan and William H. Miller. On nding transition states. The Journal of Chemical Physics, 75(6):2800{2806, 1981. [17] Hue Sun Chan and Ken A. Dill. Transition states and folding dynamics of proteins and heteropolymers. J Chem Phys, 100(12):9238{9257, 15 June 1994. [18] L. L. Chavez, J. N. Onuchic, and C. Clementi. Quantifying the rough- ness on the free energy landscape: Entropic bottlenecks and protein folding rates. J Am Chem Soc, 126:8426{8432, 2004. [19] Margaret S. Cheung and D. Thirumalai. Nanopore-protein interac- tions dramatically alter stability and yield of the native state in re- stricted spaces. J Mol Biol, 357(2):632{643, 2006. [20] F. Chiti, N. Taddei, P. M. White, M. Bucciantini, F. Magherini, M. Stefani, and C. M. Dobson. Mutational analysis of acylphosphatase suggests the importance of topology and contact order in protein fold- ing. Nature Struct Biol, 6(11):1005{1009, 1999. [21] Samuel S. Cho, Yaakov Levy, and Peter G. Wolynes. P versus Q: Struc- tural reaction coordinates capture protein folding on smooth land- scapes. Proc Nat Acad Sci USA, 103:586{591, 2006. [22] C. Clementi, H. Nymeyer, and J. N. Onuchic. Topological and ener- getic factors: what determines the structural details of the transition 139 Bibliography state ensemble and en-route intermediates for protein folding? An in- vestigation for small globular proteins. J Mol Biol, 298:937{953, 2000. [23] C. Clementi and S. S. Plotkin. The eects of nonnative interactions on protein folding rates: Theory and simulation. Protein Sci, 13:1750{ 1766, 2004. [24] Evangelos A. Coutsias, Chaok Seok, and Ken A. Dill. Using quater- nions to calculate rmsd. Journal of Computational Chemistry, 25(15):1849{1857, 2004. [25] Evangelos A. Coutsias, Chaok Seok, and Ken A. Dill. Rotational superposition and least squares: The svd and quaternions approaches yield identical results. reply to the preceding comment by G. Kneller. Journal of Computational Chemistry, 26(15):1663{1665, 2005. [26] Payel Das, Mark Moll, Hernan Stamati, Lydia E. Kavraki, and Cecilia Clementi. Low-dimensional, free-energy landscapes of protein-folding reactions by nonlinear dimensionality reduction. Proc Nat Acad Sci USA, 103(26):9885{9890, 2006. [27] Xavier de la Cruz, E. Gail Hutchinson, Adrian Shepherd, and Janet M. Thornton. Toward predicting protein topology: An approach to iden- tifying -hairpins. Proc Natl Acad Sci U.S.A., 99(17):11157{11162, 2002. [28] Christoph Dellago, Peter G. Bolhuis, Felix S. Csajka, and David Chan- dler. Transition path sampling and the calculation of rate constants. The Journal of Chemical Physics, 108(5):1964{1977, 1998. [29] K. A. Dill and H. S. Chan. From levinthal to pathways to funnels. Nature Struct Biol, 4:10{19, 1997. [30] Feng Ding, Weihua Guo, Nikolay V. Dokholyan, Eugene I. Shakhnovich, and Joan-Emma Shea. Reconstruction of the src-sh3 protein domain transition state ensemble using multiscale molecular dynamics simulations. J Mol Biol, 350:1035{1050, 2005. [31] A. R. Dinner and M. Karplus. The roles of stability and contact order in determining protein folding rates. Nature Struct. Biol., 8(1):21{22, 2001. 140 Bibliography [32] Nikolay V. Dokholyan, Lewyn Li, Feng Ding, and Eugene I. Shakhnovich. Topological determinants of protein folding. Proc. Nat. Acad. Sci. USA, 99(13):8637{8641, 2002. [33] R. Du, V. S. Pande, A. Yu. Grosberg, T. Tanaka, and E. S. Shakhnovich. On the transition coordinate for protein folding. J Chem Phys, 108:334{350, 1998. [34] M. R. Ejtehadi, S. P. Avall, and S. S. Plotkin. Three-body interac- tions improve the prediction of rate and mechanism in protein folding models. Proc. Natl. Acad. Sci., 101(42):15088{15093, 2004. [35] R. Elber and M. Karplus. A method for determining reaction paths in large molecules: Application to myoglobin. Chemical Physics Letters, 139(5):375 { 380, 1987. [36] Daniel W. Farrell, Kirill Speranskiy, and M. F. Thorpe. Generating stereochemically acceptable protein pathways. Proteins: Structure, Function, and Bioinformatics, 78(14):2908{2921, 2010. [37] A. R. Fersht. Transition-state structure as a unifying basis in protein- folding mechanisms: Contact order, chain topology, statbility, and the extended nucleus mechanism. Proc Nat Acad Sci USA, 97:1525{1529, 2000. [38] A. V. Finkelstein and A. Ya. Badretdinov. In uence of chain knotting on rate of folding. Folding & Design, 3:67{68, 1997. [39] Stefan Fischer and Martin Karplus. Conjugate peak renement: an algorithm for nding reaction paths and accurate transition states in systems with many degrees of freedom. Chemical Physics Letters, 194(3):252 { 261, 1992. [40] Darren R. Flower. Rotational superposition: A review of methods. J Mol Graph Mod, 17:238{244, 1999. [41] O. V. Galzitskaya and A. V. Finkelstein. A theoretical search for folding/unfolding nuclei in three-dimensional protein structures. Proc Nat Acad Sci USA, 96:11299{11304, 1999. [42] O. V. Galzitskaya, D. N. Ivankov, and A. V. Finkelstein. Folding nuclei in proteins. FEBS Lett, 489:113{118, 2001. 141 Bibliography [43] Oxana V. Galzitskaya, Sergiy O. Garbuzynskiy, Dmitry N. Ivankov, and Alexei V. Finkelstein. Chain length is the main determinant of the folding rate for proteins with three-state folding kinetics. Proteins: Structure, Function, and Bioinformatics, 51(2):162{166, 2003. [44] A. E. Garca. Large-amplitude nonlinear motions in proteins. Phys Rev Lett, 68:2696{2699, 1992. [45] Angel E. Garcia and Jose N. Onuchic. Folding a protein in a computer: An atomic description of the folding/unfolding of protein A. Proc. Natl. Acad. Sci., 100(24):13898{13903, 2003. [46] I. M. Gelfand and S. V. Fomin. Calculus of Variations. Dover, 2000. [47] M. Gerstein and M. Levitt. Comprehensive assessment of automatic structural alignment against a manual standard. Protein Science, 7:445{456, 1998. [48] J. E. Gouaux and W. N. Lipscomb. Crystal structures of phospho- noacetamide ligated t and phosphonoacetamide and malonate ligated r states of aspartate carbamoyltransferase at 2.8-a resolution and neu- tral ph. Biochemistry, 29:389{402, 1990. [49] J. Greene, S. Kahn, H. Savoj, P. Sprague, and S. Teig. Chemical function queries for 3d database search. J. Chem. Inf. Comput. Sci., 34:1297{1308, 1994. [50] John Gregory and Cantian Lin. An unconstrained calculus of vari- ations formulation for generalized optimal control problems and for the constrained problem of bolza. J. Math. Anal. Appl., 187:826{841, 1994. [51] John Gregory and Cantian Lin. Constrained Optimization in the Calculus of Variations and Optimal Control Theory. Springer, New York, rst edition, 2007. [52] M. Michael Gromiha and S Selvaraj. Comparison between long-range interactions and contact order in determining the folding rate of two- state proteins: application of long-range order to folding rate predic- tion. Journal of Molecular Biology, 310(1):27 { 32, 2001. [53] M.Michael Gromiha and S. Selvaraj. Inter-residue interactions in protein folding and stability. Progress in Biophysics and Molecular Biology, 86(2):235 { 277, 2004. 142 Bibliography [54] A. M. Gutin, V. I. Abkevich, and E. I. Shakhnovich. Chain length scaling of protein folding time. Phys Rev Lett, 77:5433{5436, 1996. [55] F. Ulrich Hartl. Molecular chaperones in cellular protein folding. Nature, 381(6583):571{580, Jun 1996. [56] G. Hummer, A. E. Garca, and S. Garde. Conformational diusion and helix formation kinetics. Phys Rev Lett, 85:2637{2640, 2000. [57] G. Hummer, A. E. Garca, and S. Garde. Helix nucleation kinetics from molecular simulations in explicit solvent. Proteins, 42:77{84, 2001. [58] Gerhard Hummer. From transition paths to transition states and rate coecients. The Journal of Chemical Physics, 120(2):516{523, 2004. [59] Gerhard Hummer and Ioannis G. Kevrekidis. Coarse molecular dynamics of a peptide fragment: Free energy, kinetics, and long- time dynamics computations. The Journal of Chemical Physics, 118(23):10762{10773, 2003. [60] Andrei Y. Istomin, Donald J. Jacobs, and Dennis R. Livesay. On the role of structural class of a protein with two-state folding kinetics in determining correlations between its size, topology, and folding rate. Protein Science, 16(11):2564{2569, 2007. [61] Dmitry N. Ivankov, Sergiy O. Garbuzynskiy, Eric Alm, Kevin W. Plaxco, David Baker, and Alexei V. Finkelstein. Contact order re- visited: In uence of protein size on the folding rate. Protein Science, 12(9):2057{2062, 2003. [62] W. Kabsch. A solution for the best rotation to relate two sets of vectors. Acta Crystallographica Section A, 32(5):922{923, Sep 1976. [63] W. Kabsch. A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallographica Section A, 34(5):827{828, Sep 1978. [64] John Karanicolas and C. L. Brooks III. The origins of asymmetry in the folding transition states of protein L and protein G. Protein Sci, 11(10):2351{2361, 2002. [65] M. Kawato. Trajectory formation in arm movements: Minimization principles and procedures. In H. N. Zelaznik, editor, Advances in 143 Bibliography Motor Learning and Control, chapter 9, pages 225{259. Human Ki- netics, 1996. [66] Moon K Kim, Gregory S Chirikjian, and Robert L Jernigan. Elastic models of conformational transitions in macromolecules. Journal of Molecular Graphics and Modelling, 21(2):151 { 160, 2002. [67] Moon K. Kim, Robert L. Jernigan, and Gregory S. Chirikjian. Ecient generation of feasible pathways for protein conformational transitions. Biophysical Journal, 83(3):1620 { 1630, 2002. [68] Neil P. King, Alex W. Jacobitz, Michael R. Sawaya, Lukasz Gold- schmidt, and Todd O. Yeates. Structure and folding of a designed knotted protein. Proceedings of the National Academy of Sciences, 107(48):20732{20737, 2010. Comment 3mlg. [69] Gerald R. Kneller. Superposition of molecular structures using quater- nions. Molecular Simulation, 7(1-2):113{119, 1991. [70] Edward H. Koo, Peter T. Lansbury, and Jeery W. Kelly. Amy- loid diseases: Abnormal protein aggregation in neurodegeneration. Proceedings of the National Academy of Sciences, 96(18):9989{9990, 1999. [71] S Koyama, H Yu, DC Dalgarno, TB Shin, LD Zydowsky, and Schreiber SL. Structure of the pl3k sh3 domain and analysis of the sh3 family. Cell, 72:945{952, 1993. Comment 1pks. [72] Werner G. Krebs and Mark Gerstein. The morph server: a standard- ized system for analyzing and visualizig macromolecular motions in a database framework. Nucleic Acids Research, 28(8):1665{1675, 2000. [73] Sergei V. Krivov, Stefanie Mu, Amedeo Ca isch, and Martin Karplus. One-dimensional barrier-preserving free-energy projections of a -sheet miniprotein: New insights into the folding process. Journal of Physical Chemistry B, 112(29):8701{8714, 2008. [74] M. Lal. 'Monte Carlo' computer simulation of chain molecules. Mol. Phys., 17:57{64, 1969. [75] C. Lemmen and T. Lengauer. Computational methods for the struc- tural alignment of molecules. J. Comput. Aided Mol. Des., 14:215{231, 2000. 144 Bibliography [76] Peter E. Leopold, Mauricio Montal, and Jose N. Onuchic. Protein folding funnels: Kinetic pathways through compact conformational space. Proc. Natl Acad. Sci. USA, 89:8721{8725, September 1992. [77] R. D. Levine and R. B. Bernstein. Molecular reaction dynamics and chemical reactivity. Clarendon Press, Oxford, 1987. [78] M. Lindberg, Jeanette Tangrot, and M. Oliveberg. Complete change of the protein folding transition state upon circular permutation. Nature Struct. Biol., 9(11):818{822, 2002. [79] M. O. Lindberg, J. Tangrot, D. E. Otzen, D. A. Dolgikh, A. V. Finkel- stein, and M. Oliveberg. Folding of circular permutants with decreased contact order: general trend balanced by protein stability. J. Mol. Biol., 314:891{900, 2001. [80] Rhonald C Lua and Alexander Y Grosberg. Statistics of knots, geom- etry of conformations, and evolution of proteins. PLoS Comput Biol, 2(5):e45, 05 2006. [81] Ao Ma and Aaron R. Dinner. Automatic method for identifying reac- tion coordinates in complex systems. J. Phys. Chem. B, 109(14):6769{ 6779, 2005. [82] Neal Madras and Alan D. Sokal. The pivot algorithm: A highly ecient monte carlo method for the self-avoiding walk. Journal of Statistical Physics, 50(1{2):109{186, 1988. [83] A. L. Mallam and S. E. Jackson. Probing nature's knots: The folding pathway of a knotted homodimeric protein. J. Mol. Biol., 359:1420{ 1436, 2006. [84] Paul Maragakis and Martin Karplus. Large amplitude conformational change in proteins explored with a plastic network model: Adenylate kinase. Journal of Molecular Biology, 352(4):807 { 822, 2005. [85] Luca Maragliano, Alexander Fischer, Eric Vanden-Eijnden, and Gio- vanni Ciccotti. String method in collective variables: Minimum free energy paths and isocommittor surfaces. The Journal of Chemical Physics, 125(2):024106, 2006. [86] G. A. Mines, T. Pascher, S. C. Lee, J. R. Winkler, and H. B. Gray. Cytochrome c folding triggered by electron transfer. Chem. and Biol., 3:491{497, 1996. 145 Bibliography [87] Ali R Mohazab and Steve S Plotkin. Polymer untangling and unknot- ting in protein folding. PLOS computational biology (submitted), 2012. [88] Ali R Mohazab and Steve S Plotkin. The role of polymer non-crossing and geometrical distance in protein folding kinetics. unpublished, 2012. [89] Ali R. Mohazab and Steven S. Plotkin. Minimal distance transforma- tions between links and polymers: principles and examples. J. Phys. Cond. Mat., 20:244133, 2008. [90] Ali R. Mohazab and Steven S. Plotkin. Minimal folding pathways for coarse-grained biopolymer fragments. Biophys. J., 95:5496{5507, 2008. [91] Ali R. Mohazab and Steven S. Plotkin. Structural alignment us- ing the generalized euclidean distance between conformations. IJQC, 109:3217{3228, November 2009. [92] S. K. Nechaev. Statistics of Knots and Entangled Random Walks. World Scientic, 1996. [93] Jerey K. Noel, Joanna I. Sulkowska, and Jose N. Onuchic. Slip- knotting upon native-like loop formation in a trefoil knot protein. Proceedings of the National Academy of Sciences, 107(35):15403{ 15408, 2010. [94] H. Nymeyer, N. D. Socci, and J. N. Onuchic. Landscape approaches for determining the ensemble of folding transition states: Success and failure hinge on the degree of minimal frustration. Proc. Natl Acad. Sci. USA, 97:634{639, 2000. [95] E.P. O'Brien, M. Vendruscolo, and C.M. Dobson. Prediction of vari- able translation rate eects on cotranslational protein folding. Nature Communications, 3:868, 2012. [96] L. Onsager. Initial recombination of ions. Phys. Rev., 54:554{557, 1938. [97] L. Onsager and S. Machlup. Fluctuations and irreversible processes. Phys Rev, 91(6):1505{1512, 1953. 146 Bibliography [98] J. N. Onuchic, N. D. Socci, Z. Luthey-Schulten, and P. G. Wolynes. Protein folding funnels: The nature of the transition state ensemble. Folding and Design, 1:441{450, 1996. [99] J. N. Onuchic and P. G. Wolynes. Theory of protein folding. Current Opinion in Structural Biology, 14:70{75, 2004. [100] S. B. Ozkan, Ken A. Dill, and Ivet Bahar. Computing the transition state populations in simple protein models. Biopolymers, 68(1):35{46, 2003. [101] S. Banu Ozkan, Ivet Bahar, and Ken A. Dill. Transition states and the meaning of [phi]-values in protein folding kinetics. Nat Struct Mol Biol, 8(9):765{769, Sep 2001. [102] B. Oztop, M. Reza Ejtehadi, and Steven S. Plotkin. Protein folding rates correlate with heterogeneity of folding mechanism. Phys. Rev. Lett., 93:208105, 2004. [103] Y. Patel, V. J. Gillet, G. Bravi, and A. R. Leach. A comparison of the pharmacophore identication programs: Catalyst, disco and gasp. J. Comput. Aided Mol. Des., 16:653{681, 2002. [104] D. A. Pearlman, D. A. Case, J. W. Caldwell, W. S. Ross, T. E. Cheatam, D. M. Ferguson, U. Chandra Singh, P. Weiner, and P. A. Kollman. AMBER, V. 4.1, 1995. [105] K. W. Plaxco, K. T. Simons, and D. Baker. Contact order, transition state placement and the refolding rates of single domain proteins. J. Mol. Biol., 277:985{994, 1998. [106] K. W. Plaxco, K. T. Simons, I. Ruczinski, and D. Baker. Topology, stability, sequence, and length: Dening the determinants of two-state protein folding kinetics. Biochemistry, 39:11177{11183, 2000. [107] S. Plimpton. Fast parallel algorithms for short-range molecular dy- namics. Journal of Computational Physics, 117(1):1{19, 1995. [108] S. S. Plotkin. Speeding protein folding beyond the go model: How a little frustration sometimes helps. Proteins, 45:337{345, 2001. [109] S. S. Plotkin. Generalization of distance to higher dimensional objects. Proc. Natl Acad. Sci. USA, 104(38):14899{14904, 2007. 147 Bibliography [110] S. S. Plotkin and J. N. Onuchic. Investigation of routes and funnels in protein folding by free energy functional methods. Proc. Natl Acad. Sci. USA, 97:6509{6514, 2000. [111] S. S. Plotkin and J. N. Onuchic. Structural and energetic heterogeneity in protein folding i: Theory. J. Chem. Phys., 116(12):5263{5283, 2002. [112] S. S. Plotkin and J. N. Onuchic. Understanding protein folding with energy landscape theory i: Basic concepts. Quart. Rev. Biophys., 35(2):111{167, 2002. [113] S. S. Plotkin and J. N. Onuchic. Understanding protein folding with energy landscape theory ii: Quantitative aspects. Quart. Rev. Biophys., 35(3):205{286, 2002. [114] S. S. Plotkin and P. G. Wolynes. Non-markovian congurational dif- fusion and reaction coordinates for protein folding. Phys. Rev. Lett., 80:5015{5018, 1998. [115] S. S. Plotkin and P. G. Wolynes. Bued energy landscapes: Another solution to the kinetic paradoxes of protein folding. Proc. Natl Acad. Sci. USA, 100(8):4417{4422, 2003. [116] L. S. Pontryagin, V. G. Boltyanskii, R. V. Gamkrelidze, and E. F. Mishchenko. The mathematical theory of optimal processes. Wiley Interscience, New York and London, 1962. [117] J. J. Portman, S. Takada, and P. G. Wolynes. Microscopic theory of protein folding rates. I. Fine structure of the free energy prole and folding routes from a variational approach. J. Chem. Phys., 114:5069{ 5081, 2001. [118] J. J. Portman, S. Takada, and P. G. Wolynes. Microscopic theory of protein folding rates. II. Local reaction coordinates and chain dynam- ics. J. Chem. Phys., 114:5082{5096, 2001. [119] Michael C. Prentiss, David J. Wales, and Peter G. Wolynes. The en- ergy landscape, folding pathways and the kinetics of a knotted protein. PLoS Comput Biol, 6(7):e1000835, 07 2010. [120] Adam D. Schuyler, Robert L. Jernigan, Pradman K. Qasba, Boopathy Ramakrishnan, and Gregory S. Chirikjian. Iterative cluster-nma: A tool for generating conformational transitions in proteins. Proteins: Structure, Function, and Bioinformatics, 74(3):760{776, 2009. 148 Bibliography [121] M. Sega, P. Faccioli, F. Pederiva, G. Garberoglio, and H. Orland. Quantitative protein dynamics from dominant folding pathways. Phys. Rev. Lett., 99(118102), 2007. [122] J. E. Shea, J. N. Onuchic, and C. L. Brooks. Energetic frustration and the nature of the transition state in protein folding. J. Chem. Phys., 113:7663{7671, 2000. [123] J.E. Shea and C.L. Brooks III. From folding theories to folding pro- teins: A review and assessment of simulation studies of protein folding and unfolding. Ann. Rev. Phys. Chem., 52:499{535, 2001. [124] C.D. Snow, H. Nguyen, V.S. Pande, and M. Gruebele. Absolute comparison of simulated and experimental protein-folding dynamics. Nature, 420:102{106, 2002. [125] Konstantin B. Zeldovich Stefan Wallin and Eugene I. Shakhnovich. The folding mechanics of a knotted protein. J. Mol. Biol., 368:884{ 893, 2007. [126] Joanna I. Sulkowska, Piotr Sulkowski, and Jose Onuchic. Dodging the crisis of folding proteins with knots. Proceedings of the National Academy of Sciences, 106(9):3119{3124, 2009. [127] W. R. Taylor. A deeply knotted protein structure and how it might fold. Nature, 406:916{919, 2000. [128] Y. Ueda, H. Taketomi, and Nobuhiro Go. Studies on protein folding, unfolding, and uctuations by computer simulation. Int. J. Peptide Protein Res., 7:445{459, 1975. [129] Yuzo Ueda, Hiroshi Taketomi, and Nobuhiro Go. Studies on protein folding, unfolding and uctuations by computer simulation. I. The eects of specic amino acid sequence represented by specic inter- unit interactions. Int. J. Peptide Protein Res., 7:445{459, 1975. [130] Arjan van der Vaart and Martin Karplus. Minimum free energy path- ways and free energy proles for conformational transitions based on atomistic molecular dynamics simulations. The Journal of Chemical Physics, 126(16):164106, 2007. [131] Peter Virnau, Leonid A Mirny, and Mehran Kardar. Intricate knots in proteins: Function and evolution. PLoS Comput Biol, 2(9):1074{1079, 2006. 149 [132] David J. Wales. Theoretical study of water trimer. Journal of the American Chemical Society, 115(24):11180{11190, 1993. [133] Stefan Wallin and Hue Sun Chan. A critical assessment of the topomer search model of protein folding using a continuum explicit-chain model with extensive conformational sampling. Protein Science, 14(6):1643{ 1660, 2005. [134] Jin Wang, Kun Zhang, Hongyang Lu, and Erkang Wang. Quantifying Kinetic Paths of Protein Folding. Biophys. J., 89(3):1612{1620, 2005. [135] Stephen Wells, Scott Menor, Brandon Hespenheide, and M F Thorpe. Constrained geometric simulation of diusive motion in proteins. Physical Biology, 2(4):S127, 2005. [136] F. W. Wiegel. Introduction to path-integral methods in physics and polymer science. World Scientic, Singapore, 1986. [137] P. G. Wolynes, J. N. Onuchic, and D. Thirumalai. Navigating the folding routes. Science, 267(5204):1619{1620, 1995. [138] Haijun Yang, Hao Wu, Dawei Li, Li Han, and Shuanghong Huo. Temperature-dependent probabilistic roadmap algorithm for calcu- lating variationally optimized conformational transition pathways. Journal of Chemical Theory and Computation, 3(1):17{25, 2007. 150 Appendix A Sucient conditions for an extremum to be a minimum For a transformation to be minimal, it is necessary, but not sucient, that it be an extremum. We now derive the sucient conditions for a given transformation to minimize the functional (2.9). We describe the formalism in some detail because it is not typically taught to physicists|for further reading see for example reference [46]. According to Sylvester's criterion, a quadratic form P ij Aijxixj is posi- tive denite if and only if all descending principal minors of the matrix kAijk are positive, i.e. A11 > 0 ; A11 A12A21 A22 > 0 ; A11 A12 A13 A21 A22 A23 A31 A32 A33 > 0 ; : : : ; detkAijk > 0 ; (A.1) and a function F of x (x1; x2; : : : ; xn) has a minimum at x? if the Jacobian matrix k@2F=@xi@xjk is positive denite at the position of the extremum (where @F=@xi = 0). For a function to be a minimum of a given functional, it must satisfy similar sucient conditions. Consider again the dierence in distance be- tween two trajectories in (2.9)z. Taylor expanding the Lagrangian to second order in hi: D = D [ri + hi]D [ri] = Z T 0 dt L(ri + hi; _ri + _hi) Z T 0 dt L(ri; _ri) Z T 0 dt 24 NX i=1 Lri hi + L _ri _hi + 1 2 3NX i;j Lxixjhihj + 2Lxi _xjhi _hj + L _xi _xj _hi _hj 35 (A.2) 6z We ignore corner conditions for purposes of the derivation. It can be shown that they do not modify the result. 151 Appendix A. Sucient conditions for an extremum to be a minimum At an extremum, the rst order term in (A.2) is zero, and D 2D, the second variation. A sucient conditions for the extremum to be a minimum is 2D > 0. From eq. (2.10), the matrix kLxi _xjk = k0k. Assuming kLxi _xjk is in general a symmetric matrix, i.e., Lxi _xj = Lxj _xi , the second term in the quadratic form of (A.2) may be integrated by parts to give: 2D = 1 2 Z T 0 dt h h _hjP_hi+ hhjQhi i ; (A.3) where we have let jhi denote the vector (h1; h2; : : : ; h3N ), and used the shorthand P and Q for the matrices: P(t) = kPijk = kL _xi _xjk Q(t) = kQijk = kLxixjk d dx kLxi _xjk : (A.4) From (2.10) the explicit form for these matrices may be calculated. P is block diagonal: P = 266664 I (1) ij 0 0 0 I (2) ij 0 ... ... . . . ... 0 0 I(N)ij 377775 (A.5) with each block matrix having elements kI(J)ij k = 1 j _r J j3 ij _r (J) 2 _x (J)i _x (J)j = 1 j _r J j3 24 _y2 + _z2 _x _y _x _z _x _y _x2 + _z2 _y _z _x _z _y _z _x2 + _y2 35 (particle J) (A.6) Interestingly the numerator of (A.6) has the form of an inertia tensor for a point particle in velocity-space. The matrix Q is block tri-diagonal, because the spatial derivatives in (A.4) couple each bead to its two neighbors. Using indices I; J to enumerate beads and i; j to enumerate x; y; z components for each bead: kQIJ;ijk = ij J1;J IJ I;J1 + J;J+1 IJ I;J+1 or Q = 2666666664 121 121 0 0 121 (12 + 23) 1 231 0 0 231 (23 + 34) 1 341 . . . . . . . . . . . . N1;N1 N1;N1 3777777775 (A.7) Here 1ij = ij . For the transformation r (t) to be a minimum of D[r], pit suces that the functional (A.3) be positive denite for all jhi. To de- rive the conditions for this, we can temporarily ignore the fact that (A.3) 152 Appendix A. Sucient conditions for an extremum to be a minimum arose from the second variation of (2.9), and treat (A.3) as a new functional acting on inputs jh(t)i = jh1(t); : : : h3N (t)i. We then ask what jh(t)i ex- tremizes (A.3). If 2D > 0 the only extremal solution can be the trivial one: jh(t)i = j0i, because 2D is homogenous of degree 2. That is, changing the transformation fri (t)g from that which extremized (2.9) to a neighboring transformation fri (t) + hi(t)g would increase the distance traveled. The system of 3N EL equations for jhi from (A.3) is d dt jP_hi+ jQhi = j0i (A.8) with boundary conditions jh(0)i = jh(T )i = j0i : (A.9) Equation (A.8) is referred to as the Jacobi equation in the calculus of vari- ations. First note that if jhi satises the system of equations in (A.8) as well as the boundary conditions (A.9), then integration by parts gives 2D = Z T 0 dt h _hjP_hi+ hhjQhi = Z T 0 dt hhj d dt P_h +Qhi = 0 : (A.10) This means that for 2D to be > 0, any nontrivial jh(t)i which satises the boundary conditions must not itself be an extremal solution of the Jacobi equation, otherwise solutions jr(t)i perturbed by any constant times jh(t)i are themselves extremals. One may think of this by analogy as the necessity for the absence of any \Goldstone modes", where excitations by various Cjh(t)i would lead to a family of curves with zero cost in action, and thus zero eective restoring force, between them. Alternatively we can ask what equation h jhi must satisfy if the EL equations are satised for both L(r; _r) and the neighboring extremal L(r+ h; _r+ _h). Taylor expanding L(r+ h; _r+ _h) in Lr(r+ h; _r+ _h) d dt L _r(r+ h; _r+ _h) = 0 gives d dt L _r _r _h + Lrr d dt Lr _r h = 0 which is exactly Jacobi's equation (A.8) with denitions (A.4). From here on, it is much simpler to elucidate the central concepts for sucient conditions using the case of a single scalar function h(t). The 153 Appendix A. Sucient conditions for an extremum to be a minimum analysis can be generalized to the multi-dimensional case with a bit more eort, but the conclusions are essentially the same and so they will simply be stated along with the conclusions for the '1-D' case. For further details see [46]. We write equation (A.3) in 1-D as: 1 2 Z T 0 dt P _h2 +Qh2 (A.11) It was realized originally by Legendre that the integral could be brought to simpler form by adding zero to it in the form of a total derivative. SinceZ T 0 dt d dt w(t)h2 = 0 for any w(t) so long as h(t) satises the boundary conditions (A.9), we can add it to the integral in (A.11) and seek a function w(t) such that the expression 2D = 1 2 Z T 0 dt P _h2 + 2wh _h+ (Q+ _w)h2 may be written as a perfect square. This yields the dierential equation P (Q+ _w) = w2 (A.12) for w(t), and second variation 2D[h] = 1 2 Z T 0 dt P _h+ w P h 2 : (A.13) Therefore a sucient condition for a minimum is for P > 0 (a necessary condition is for P 0). The analogous condition in the multi-dimensional case is for the matrix kPk to be positive denite. If the dierential term _h + wP h in (A.13) were equal to zero for some h(t), the boundary condition h(0) = 0 would then imply _h(0) = 0 and thus h(t) = 0 for all t by the uniqueness theorem as applied to this rst order dierential equation. Therefore the functional (A.13) is positive denite if, and only if, 1.) P > 0 , 2.) A solution for eq. (A.12) exists for the whole interval [0; T ]. In general, there is no guarantee of condition (2) even if condition (1) is 154 A.1. Distance between points valid. For example if P = 1, Q = 1, (A.12) has solution w(t) = tan(t+ c), which has no nite solution if jT j > . y If a solution w for (A.12) has a pole at, say, ~t, then for the integral (A.13) to remain nite, h(~t) ! 0. This point is said to be conjugate to the point to = 0, i.e., it is a conjugate point. Equation (A.12) is a Riccati equation, which may be brought to lin- ear form by the transformation w(t) = P _H=H, with H(t) an unknown function. Substitution in (A.12) gives d dt P _H +QH = 0 (A.14) which is precisely equation (A.8)- the Jacobi equation for h(t). This means that for equation (A.12) to have a solution on [0; T ], H(t), as given by the solution to (A.14), must have no roots on [0; T ]. But because equation (A.14) holds for h(t) as well, h(t) must have no roots (conjugate points) on [0; T ]. Because h(0) = h(T ) = 0, the only way to extremize (A.11) is to satisfy eq. (A.14) with the trivial solution h(t) = 0. If h(t) 6= 0 for 0 < t < T then it would mean that there was a conjugate point at ~t = T . In the multi-dimensional case an extremal jhi is one of 3N vectors satis- fying equations (A.8), i.e. jh()i = jh()1 : : : h()3N i, 1 3N . A conjugate point is dened as a point where the determinant vanishes: det h (1) 1 (t) h(3N)1 (t) ... ... h (3N) 1 (t) h(3N)3N (t) = 0 The sucient conditions for a transformation to be minimal are then: 1.) The transformation jr(t)i = fri (t)g is extremal, 2.) Along jr(t)i, the matrix P(t) = L _xi _xj is positive denite, and 3.) The interval [0; T ] contains no conjugate points to t = 0. The above ideas can be made clear with a few examples below. A.1 Distance between points From the eective Lagrangian L = p _r2, P = kL _xi _xjk is given in equa- tion (A.6), which has determinant det P = 0, and so is not positive denite. 6y Because reparameterization invariance in our problem, the value of T is adjustable, however precisely because of this invariance, det kPk = 0 and so is no longer positive denite. We discuss this problem and its resolution below. 155 A.2. Geodesics on the surface of a sphere This is due to our choice of parameterization. If we break symmetry by choosing one spatial direction as the independent variable, L (x; y0; z0) =p 1 + y02 + z02 (with e.g. y0 dy=dx and x0 x x1). Then P = 1 (1 + y02 + z02)3=2 1 + z02 y0z0 y0z0 1 + y02 with positive denite determinant det kPk = 1 + y02 + z021=2 > 0 for any trajectory. From eq (A.4), kQ(t)k = k0k. Along the extremal, where y(x) = ax + y0, z(x) = bx + z0, equation (A.8) gives P h0 = c, with c a constant vector and P a positive denite matrix of constant values with respect to x. Solving this rst-order equation gives straight line solutions for h(x). Because h(x0) = 0, there can be no conjugate points, and because h(x1) = 0, the only solution to (A.8) is the trivial one, and the extremum is a minimum. A.2 Geodesics on the surface of a sphere Taking the azimuthal angle as the independent variable, and polar angle () as the dependent variable, the arc-length on the surface of a unit sphere may be written as D[] = Z 1 0 d p 02 + sin2 : (A.15) The EL equations give the extremal trajectory as cos = A sin cos + B sin sin with A;B constants. This is the equation of a plane z = Ax + By, which intersects the surface of the sphere to make a great cir- cle. The scalar P = L00 = sin2 = 02 + sin2 3=2 which is always positive. To simplify the problem, let 0 = 0, and (0) = (1) = =2, so the great circle lies in the z = 0 plane. Along this extremal P is constant and equal to 1, while Q = 1. The second variation, eq. (A.11), is then (1=2) R 1 0 d h02 h2. The corresponding Jacobi equation, h00+h = 0, must not have a root between [0; 1]. Every nontrivial solution to the Jacobi equa- tion satisfying the initial condition h(0) = 0 has the form h() = C sin, C 6= 0, which reveals a conjugate point at = . Thus for the extremal curve to be minimal, 1 must be < , the location of the opposite pole on the sphere. If 1 < , there is no extremal solution for h() other than the trivial one which satises the boundary conditions. It is instruc- tive to look at the arc-length under sinusoidal variations around the ex- tremal path which satisfy the boundary conditions h(0) = h(1) = 0, 156 A.3. Harmonic oscillator so that () = =2 + h() = =2 + sin (=1). Inserting this into eq (A.15) above and expanding to second order in , we see that rst order terms in vanish, and the dierence in distance from the extremal path is D = 2=41 2 21. For 1 < this is always greater than zero, com- patible with the fact the extremal is a minimum. Further analysis useing a general perturbation scheme would be required for a general proof. For 1 > this is always less than zero indicating the extremal is a maximum with respect to these perturbations: the length may be shortened. When 1 = , D = 0 to second order. When h() represents the dierence between great circles D is precisely zero. A.3 Harmonic oscillator It is not widely appreciated that the classical action for a simple harmonic oscillator is not always a minimum, and indeed in many cases can be a maximum with respect to some perturbations. The action for a harmonic oscillator with given spring constant is proportional to S[x] = R T 0dt 1 2( _x 2x2), which has EL equation x + x = 0. Taking the specic initial conditions x(0) = 1, _x(0) = 0, the extremal solution is x(t) = cos t. The scalar P (t) = L _x _x = 1, which is always positive and satises the necessary conditions for a minimum. The scalar Q = Lxx ddtLx _x = 1. The second variation 2S[h] = 1 2 R T 0dt( _h2h2), which has Jacobi equation h+h = 0. This is the same Jacobi equation as that for geodesics on a sphere, so the sucient conditions will parallel those above. The boundary condition h(0) = 0 gives h(t) = A sin t, with conjugate points at t = n, n = 1; 2; : : :. This means that the action is a minimum only so long as T < , i.e., a half-period. If we let x(t) be the extremal solution plus a sinusoidal perturbation satisfying the Jacobi equation at the conjugate points: x(t) = cos t + sin t, then the dierence in action from the extremal path becomes S = (2=4T )(2 T 2). This result is exact because the action for the oscillator is quadratic (as opposed to the action for geodesics). Because the action is quadratic, the original EL equation and Jacobis equation (A.8) are guaranteed to be identical|in such cases it is not particularly necessary to explicitly identify P and Q. When T < , S > 0 compatible with minimality, as in section A.2. When T is larger than a half-period, S < 0 and the extremal trajectory is a maximum (with respect to half-wavelength sinusoidal perturbations), and when T = , the end point is the conjugate point and S = 0. 157 Appendix B Necessary conditions for straight line transformations It was shown in section 2.3.1 that to have straight line transformations between links, it is sucient to have facing obtuse angles on opposite sides of the the quadrilateral dened by the transformation as shown in gure 2.5A. We now show that it is a necessary condition as well, i.e., we show that a slide in the correct direction is not possible in the absence of obtuse angles. B A v̂B dt v̂A dt Figure B.1: A link in 3D space. Without loss of generality assume that the link is initially along the z axis. The paths traveled by the link ends are shown in the gure. Note that the end point trajectories of A and B are in 3D space so the paths traveled by A and B need not cross or lie in the same plane. Let the unit vector along A's path be v̂A and the unit vector along B's path be v̂B. Because the angles that the path of A and the path of B make with the link are acute, the z-component of v̂B ( zB) is negative and the z-component of v̂A (zA) is positive. One can write v̂A and v̂B as v̂A = A + zAẑ v̂B = B + zBẑ where A and B are vectors in xy plane and zA > 0 and zB < 0. 158 Appendix B. Necessary conditions for straight line transformations Let rA(t) and rB(t) denote the positions of the A and B ends at time t: rA = tv̂A rB = g(t)v̂B + ẑ The rigid link constraint dictates that (rA rB) (rA rB) = 1 which translates to: g2 + 2g (zB t (c+ zA zB)) 2tzA + t2 + 1 = 1 with c = A B. Solving for g as a function of t, keeping in mind that g(0) = 0: g(t) = (zB t (c+ zA zB)) + q (zB t (c+ zA zB))2 t2 + 2tzA : Now if g0(t) > 0 it means that the B-end of the link is travelling in the assumed direction, and if g0(t) < 0 it means that B-end is travelling in the opposite direction (which means that the angle is not acute anymore). Writing g0(0) we get: g0(0) = 2 zB c+ 2 zA z 2 B 2 zA 2 jzBj + c+ zAzB = zA jzBj < 0 : Thus point B can only travel in the opposite direction from what was assumed, which in turn means an all-acute slide is not possible. We conclude that the condition of \facing obtuse angles" is necessary and sucient for transformations consisting only of pure translations. 159 Appendix C Critical angles The concept of critical angle was rst introduced in 2.3.2. In order for a straight-line slide of both ends to be possible, at some stage during the transformation the link needs to rotate about one of the ends, with the other end being stationary. In principle the rotation can be about either of the two ends and it can happen at the beginning or the end of the transformation. The conditions on the critical angle or orientation can be readily derived from the broken extremal conditions. It was seen from 2.18a and 2.19, the non-trivial corner conditions read: v̂ij+ = v̂ij : (C.1) We know that the path traveled by the moving bead during the rotation is circular and the path that is traveled during the slide part is a straight line. Broken extremal condition forces these two paths to be patched smoothly, which means that the straight-line path should be tangent to the circle. In the 3D case, for the broken extremal condition to be satised, the straight line slide path and the circular rotation path should lie in the same plane. For example in gure 2.7 where B is rotating about A initially to B1 and then slides to B0, the rotation has to be in the plane formed by the three points ABB0. Matching the directions of velocity as in (C.1) does not itself mean that a link can subsequently slide in a straight line, however at the tangent point, the tangent line to the circle is perpendicular to the radius, hence one satis- es this second condition as well. Below we derive an analytical expression for the critical angle for a particular case of single link problem, as an exam- ple and illustration of the discussed concepts. Furthermore the particular example will be used later in D to introduce minimal transformations in 2 dimensions. Consider the single link action with the particular parametrization s = s(), as discussed in section 2.3.2:Z ( p _s2 + 1 + 2 _s cos + p _s2) d: (C.2) 160 Appendix C. Critical angles b b b b b b b b A B A′ B′ θc Figure C.1: Transformation in which both ends stay on a linear track where s !A()A is the (signed) distance of A-end from its initial position, and is the angle between the link and the horizontal line (see gure C.1). The Euler Lagrange equation of motion reads: d d ( _sp _s2 + _s+ cos p _s2 + 1 + 2 _s cos ) = 0 (C.3) We consider a transformation which is not (necessarily) a minimum: s = a cos sin + b (C.4) with a and b parameters to be determined. Such a transformation in fact forces the two ends to travel on a straight line (right from the beginning), but the A side may in fact retreat and then move forward. We call such a transformation a \hyperextended transfor- mation". A sample transformation of this kind is shown in gure C.1. The parameters a and b in (C.4) can be tuned to meet the boundary conditions (see below). In fact it is seen that point A on the link retreats backwards until it reaches some critical angle, which is when link AB makes an angle 2 with 161 Appendix C. Critical angles the straight line BB0 that point B travels on. Subsequently A then moves forward towards A0. Assume that runs from 1 to 2, where 0 < 2 < =2. For simplicity assume that both these angles are between 0 and 2 . The boundary conditions dictate that: s(1) = 0 (C.5) s(2) = l (C.6) where l is the distance between A and A0. a and b can be explicitly solved to give: a = sin 2 + sin 1 l cos 1 cos 2 (C.7) b = cos 1 ( sin 2 l) + sin 1 cos 2 cos 1 cos 2 (C.8) For our purposes we only need to note that the critical angle occurs when _s dsd becomes zero, that is when A stops going backward and starts moving forward: _s = a sin cos = 0 (C.9) where a is given in C.7. We can now ask what should 1 be so that there is no need for the link to go backward, i.e., it moves forward from the beginning and the transformation is monotonic. Equations (C.9) and (C.7) give: cos + sin 2 + sin l cos cos 2 sin = 0 (C.10) For pedagogical reasons we prove condition (C.10) using analytic geom- etry as well. Looking at gure C.2 we have the following: g2 + l21 = 1 (C.11) g2 + l22 = a 2 (C.12) g a = 1 l + l1 + l2 (C.13) We can solve g = p 1 l21 and a = p 1 l21 + l22 from the rst two equations and substitute in the third equation to give: l = p 1 l21 + l22p 1 l21 l1 l2 (C.14) 162 Appendix C. Critical angles θ2 θ1 l l1 l2 b a g Figure C.2: Geometric proof for critical angle condition On the other hand based on our results for g and a we have: sin 1 = p 1 l21p 1 l21 + l22 (C.15) cos 1 = l2p 1 l21 + l22 (C.16) sin 2 = l1 (C.17) cos 2 = q 1 l21 (C.18) Substitution of eqns (C.15-C.18) in equation (C.10) gives equation (C.14) after some simplication. For the particular case that we have discussed, the proposed transfor- mation is in fact a minimal solution if 1 is greater than the critical angle, because in that case a simple slide would be possible. If 1 is less than the critical angle a locally minimum solution as we know is pure rotation to the critical angle and then straight line slide. Pure rotation has a nice geomet- ric interpretation in our parametrization. It corresponds to the null solution s = 0. Since at the critical angle _s = 0 we see that s = 0 will be smoothly patched with s = a cos sin + b, as mandated by the corner conditions 163 Appendix C. Critical angles in equation (2.18a). Figure C.3: A minimal transformation in s() parametrization. The hori- zontal segment corresponds to pure rotation and the curved section corre- sponds to slide on straight paths. Here the corner conditions demand that the derivative _s be continuous at the critical angle. 164 Appendix D Minimal transformations in 2 dimensions It was seen in section 2.4.1 that for the case of two links when one is conned to moving in a plane, satisfying the constant link length constraints and cor- ner conditions do not seem to lead to solutions which are extremal. However given the additional constraint that the links must lie in a plane, there must be one or a set of minimal transformations. We need to look at other forms of transformations, namely compound straight line transformations. We will elaborate on the idea starting with single links. The hyper extended solution that was discussed previously in Appendix C can be considered as a very special example of compound straight line trans- formation. These are transformations that are made strictly from straight line paths with no pure rotation. A more general transformation is shown in gure D.1 beside the old transformation. Note that the corners do not technically violate the corner conditions be- cause the speed of bead \A" is zero at the corner point in any parametriza- tions that can simultaneously describe A motion and B motion: Since at the corner point, the link makes an angle of 90 degrees with the path that B travels, the speed of B at the critical angle in innitely larger than the speed of A. In fact one sees that we have an instantaneous pure rotation about A-bead, when it is at the corner point. v̂a is not clearly dened at the corners, and everywhere else (when the speed of the bead(s) is not zero), the two beads are travelling on a straight line. The two solutions depicted in the gure come from two dierent parametrizations of the most general form of the action and result in dierent distances. But each of them is a local minimum once the direction of ! AA00 is picked, and these local minima have dierent values for the distance. We can then ask about the best position to put the corner point, to minimize the distance traveled in the compound straight line transformation, with respect to other compound straight line transformations. We assume the corner occurs on one side and we take it to be the \A" side. Note that at the corner, the link makes a right angle with the B-bead 165 Appendix D. Minimal transformations in 2 dimensions b b b b b b b b A B A′ B′ A′′ θc0.47 1.00 2.24 b b b b b b b b A B A ′ B ′ A ′′ B ′′ 0.11 1.08 2.24 Figure D.1: The previous hyper extended solution is shown along with a more general compound straight-line transformation, where ! AA00 travels in some general direction. Length of each line segment is written beside it. For the hyper extended solution the value of AA00 is multiplied by two because the path is traveled twice. 166 Appendix D. Minimal transformations in 2 dimensions b b b b b b A B A ′ B ′ A ′′ B ′′ 2.24 0.18 0.93 Figure D.2: Optimal compound straight line transformation path BB0, meaning that the distance from the corner point to the B path is always the length of the link, i.e., unity. Also note that the total distance that the \A"-bead travels is the distance from the initial point A to the corner point A00, plus the distance from A00 to the nal position A0. The locus of points with equal sum of distances from two points A and A0 denes an ellipse with foci at A and A0. Moreover the length of the major axis of the ellipse equals the sum of the distances from the foci. Thus the smaller the major axis of the ellipse with foci A and A0, the smaller the total distance traveled by the \A"-bead. Moreover A00 should sit on a line parallel to B-path at a distance of 1 from the B-path line BB0. So in seeking the shortest distance traveled the A end of the link, we seek the point A00 such that it lies on an ellipse with foci A and A0, the ellipse shares at least one point with a line parallel to BB0 and distance 1 away from it, and lastly that the ellipse has the smallest possible major axis (see gure D.2). So the ellipse giving the minimal distance is tangent to the parallel line, and A00 is the tangent point. This is illustrated in gure D.2. This solution can be straightforwardly extended to 2 links, as depicted in gure D.3. Consider then the example in gure 2.15a, where the links are no longer allowed to move out of the plane (see gure D.4). Here rA = rA0 167 Appendix D. Minimal transformations in 2 dimensions b b b b b b b b b b b 2.24 0.18 0.93 2.45 0.94 Figure D.3: An optimal compound straight-line solution for 2 link. For this particular class of solutions, the problem is divided into to disjoint problems (one for each link) and solved separately. and rC = rC0 and the above ellipses turn into a circles centered at A and C. The circles have radii 11=p2, so that the perpendicular distance from line BB0 to the farthest point on the circle is 1 and a fully extended intermediate state is allowed. 168 Appendix D. Minimal transformations in 2 dimensions b b bb b b b A B C A′ B′ C ′ A′′ B′′ C′′ Figure D.4: Minimal transformation restricted to 2 dimensions, for 2 links of opposite convexity which form opposite sides of a square. 169 Appendix E Extremal trajectories of beads or links subject to steric excluded volume The extremal trajectories of beads or links subject to steric excluded volume is a variational problem in the presence of an inequality constraint. A bead can be outside a given region but not inside it, or must travel from point A to point B while avoiding an intervening volume. E.1 Point particle Variational problems subject to inequality constraints arose historically in the theory of optimal control [15, 50, 51, 116]. In our context we illustrate the idea with a simple example of a point particle moving from A to B but with the constraint that the point and resulting trajectory must lie outside an innite cylinder of radius a, r a in Fig. E.1. The distance traveled by the point is written as D[r] = Z T 0 dtF (_r; ; ); (E.1a) where F (_r; ; ) = p _r2 + (a jrj+ 2) (E.1b) The second term in the integrand embodies the inequality constraint ajrj 0. The value is the Lagrange multiplier enforcing the constraint, and the quantity 2 may be thought of as an \excess parameter" whose signicance will soon become clear. Let a vector X = (r; ; ) represent all the unknowns in the problem. The Euler-Lagrange (EL) equations are then d dt F _X = FX ; (E.2) 170 E.1. Point particle Figure E.1: (a) Extremal trajectories for an inequality constraint problem. In this case, a path that is a minimal distance from point A at (xA; yA) = (1:5; 0) to point B at (xB; yB) = (+1:5; 0) is sought subject to the constraint that the path must remain outside a circle of unit radius. Both positive and negative solutions are shown. (b) Lagrange multiplier and excess parameter for the above problem. If 6= 0, = 0, and if = 0, 6= 0. with the convention FX @F=@X. The EL equations are a r + 2 = 0; (E.3a) = 0; (E.3b) _̂v = r̂: (E.3c) In addition to the EL equations, transversality or corner conditions must hold for the trajectory to be extremal [46]. These demand that F _r(t ) = F _r(t+) (E.4a) and F _r _F _rjt = F _r _F _rjt+ (E.4b) where t = lim!0(t ). In this parameterization (r in terms of time), Eq. E.4b gives no new information, and Eq. E.4a demands that v̂(t) = v̂(t+): (E.5) To solve these equations, rst note that from Eq. E.3a, if r > a, the excess parameter is 2 > 0. Then from Eq. E.3b, the Lagrange multiplier is = 0. Then from Eq. E.3c, _̂v = 0 and the particle moves in a straight line. The particle moves in a straight line until a point where it touches the cylinder. Equation E.5 demands that the straight line must be tangent to the cylinder, otherwise we would have a corner at that point. Once on the cylinder, r = a and so 2 = 0. The quantity equation _̂v is determined kinematically by 171 E.2. One link the trajectory which follows the boundary condition, here the surface of the cylinder at r = a. This then determines equation (t) = j _̂vj. This gives the piecewise trajectory in Fig. E.1 a. Both positive and negative solutions are shown. For this extremal trajectory, the Lagrange multiplier and excess parameter can be found straightforwardly, for example as functions of x (Fig. E.1 b). In particular = 1=y(x) on the cylinder, zero otherwise. If the obstructing object is no longer a cylinder of circular cross section, but we compress the x axis of the cylinder so that it is an ellipsoid, then in the limit that the minor axis (the x axis of the ellipsoid)! 0, the obstructing object becomes a at strip (or line in cross section). Then the extremal tra- jectory consists of two straight-line pieces with an apparent corner between them, due to the discontinuity at the surface of the excluded boundary. E.2 One link The above solution can be generalized to the case of a single link undergo- ing a transformation from one side of a sphere to the other side. For the initial conditions in Fig. E.2 a, the solution consists of one bead on the link moving in straight-line motion, and the other following a piecewise trajec- tory consisting of straight-line motion, a great circle geodesic, and nally straight-line motion again. When one axis of the sphere is compressed so that the sphere becomes a disk, the minimal-distance solution acquires a discontinuity or cusp (Fig. E.2 b). This means that minimal-distance transformations can violate cor- ner conditions if the inequality constraints are themselves discontinuous or more precisely nonsmooth. The extremal transformation of the link AB in Fig. E.2 b involves a straight-line translation of A to A1, while point B translates to BL. Then point B rotates to point B1 on the surface of the disk, where it experiences a corner as per the above discussion. It subse- quently rotates again to BR, then A1 and BR translate together in straight lines to points A and B, respectively. As another example, consider the initial conditions in Fig. E.2 c, which involves the problem of one link transforming in the presence of an innite strip. This situation has applications to the problem of chain non-crossing discussed in the text. The minimal transformation consists of two piecewise rotations of B with a corner between them, at position Bc. 172 E.2. One link (a) (b) (c) Figure E.2: (a) Extremal trajectory for a one-link transformation subject to inequality constraints. The link moves from conguration AB to A0B0 in the presence of an obstructing sphere. The link length AB is conserved during this process. The distance traveled by the end-points A and B of the link is minimized by the transformation shown, which involves straight-line motion of A to A0, and straight-line motion of B along a trajectory tangent to the sphere. Point B traces out a great circle on the surface of the sphere before continuing to B0 on another trajectory tangent to the sphere. (b) When the sphere in panel a is compressed to form a two-dimensional disk of the same radius, the minimal transformation takes the form shown, with a discontinuity in the trajectory of B at point B1. Moreover, the piecewise solution must still retain rotations and is not purely piecewise straight lines. (c) Transformation from AB to AB0, in the presence of an intervening in- nite strip. The minimal transformation consists of two piecewise rotations with a corner violation between them: the link rotates from B to Bc, then from Bc to B. 173 Appendix F Cross correlation of order parameters Cross correlation of order parameters used, in chapters 5 and 6, with each other, for various classications of proteins are shown below: 174 A p p en d ix F . C ro ss co rrela tion of ord er p aram eters INX LRO RCO ACO MRSD RMSD INX ||| (0.437,2.18e-03) (0.413,3.78e-03) (0.273,0.055) (0.140,0.327) (0.133,0.350) LRO 0.650,4.34e-04 ||| 0.718,4.91e-07 0.591,3.46e-05 (0.337,0.018) (0.357,0.012) RCO 0.552,4.20e-03 0.853,6.25e-08 ||| 0.513,3.22e-04 (0.220,0.123) (0.253,0.076) ACO (0.470,0.018) 0.830,2.79e-07 0.786,3.26e-06 ||| 0.667,3.00e-06 0.700,9.36e-07 MRSD (0.454,0.023) 0.629,7.60e-04 (0.455,0.022) 0.867,2.09e-08 ||| 0.967,1.26e-11 RMSD (0.446,0.025) 0.642,5.41e-04 (0.473,0.017) 0.880,6.77e-09 0.998,0.00e+00 ||| hDnxi 0.669,2.55e-04 0.707,7.73e-05 0.546,4.73e-03 0.871,1.48e-08 0.909,3.27e-10 0.911,2.49e-10 hDnxi=N 0.918,1.07e-10 0.756,1.25e-05 0.636,6.36e-04 0.737,2.60e-05 0.737,2.61e-05 0.734,3.01e-05 hDi (0.299,0.146) 0.514,8.55e-03 (0.308,0.134) 0.817,6.22e-07 0.963,1.33e-14 0.967,4.22e-15 hDi=N 0.534,5.93e-03 0.667,2.68e-04 (0.495,0.012) 0.877,9.00e-09 0.995,0.00e+00 0.993,0.00e+00 N (0.213,0.306) (0.425,0.034) (0.160,0.444) 0.712,6.56e-05 0.917,1.22e-10 0.921,7.08e-11 hDnxi hDnxi=N hDi hDi=N N INX (0.487,6.50e-04) 0.733,2.78e-07 (0.060,0.674) (0.207,0.148) (0.00e+00,1.000) LRO 0.551,1.13e-04 0.518,2.88e-04 (0.284,0.047) (0.377,8.20e-03) (0.212,0.138) RCO (0.487,6.50e-04) 0.533,1.86e-04 (0.153,0.283) (0.273,0.055) (0.081,0.573) ACO 0.693,1.19e-06 0.513,3.22e-04 0.627,1.13e-05 0.707,7.37e-07 0.570,6.41e-05 MRSD 0.640,7.32e-06 (0.407,4.38e-03) 0.880,7.02e-10 0.933,6.18e-11 0.799,2.19e-08 nRMSD 0.647,5.87e-06 (0.400,5.07e-03) 0.873,9.42e-10 0.927,8.43e-11 0.792,2.87e-08 hDnxi ||| 0.753,1.30e-07 0.573,5.89e-05 0.693,1.19e-06 0.503,4.21e-04 hDnxi=N 0.904,6.18e-10 ||| (0.327,0.022) (0.473,9.12e-04) (0.255,0.074) hDi 0.877,9.11e-09 0.621,9.27e-04 ||| 0.840,3.97e-09 0.919,1.18e-10 hDi=N 0.938,4.67e-12 0.799,1.70e-06 0.946,9.69e-13 ||| 0.758,1.07e-07 N 0.781,4.09e-06 0.512,8.88e-03 0.973,4.44e-16 0.889,2.81e-09 ||| Table F.1: Two-state proteins: correlation between various order parameters. The upper triangle matrix (contain- ing elements above the dash cell in each column) contains Kendall correlation coecient and the corresponding p-value, and the lower triangle portion contains Pearson corr. coecient and the corresponding p-value. 175 A p p en d ix F . C ro ss co rrela tion of ord er p aram eters INX LRO RCO ACO MRSD RMSD INX ||| (0.359,0.088) (-2.56e-02,0.903) (0.359,0.088) (0.436,0.038) (0.410,0.051) LRO 0.714,6.13e-03 ||| 0.615,3.41e-03 (0.538,0.010) (0.256,0.222) (0.282,0.180) RCO (-4.88e-02,0.874) (0.612,0.026) ||| (0.462,0.028) (0.026,0.903) (0.051,0.807) ACO (0.591,0.033) 0.703,7.31e-03 (0.516,0.071) ||| 0.564,7.27e-03 0.590,5.01e-03 MRSD 0.741,3.75e-03 (0.351,0.240) (-1.44e-01,0.639) 0.717,5.78e-03 ||| 0.974,3.54e-06 RMSD 0.721,5.38e-03 (0.356,0.233) (-1.04e-01,0.736) 0.743,3.63e-03 0.997,8.57e-14 ||| hDnxi 0.868,1.19e-04 (0.408,0.166) (-3.00e-01,0.320) (0.588,0.034) 0.898,3.13e-05 0.884,6.11e-05 hDnxi=N 0.947,9.03e-07 (0.577,0.039) (-1.20e-01,0.696) (0.677,0.011) 0.897,3.29e-05 0.885,5.89e-05 hDi 0.750,3.16e-03 (0.260,0.390) (-3.54e-01,0.235) (0.586,0.035) 0.955,3.70e-07 0.944,1.29e-06 hDi=N 0.781,1.60e-03 (0.389,0.189) (-1.42e-01,0.643) 0.721,5.44e-03 0.998,2.38e-14 0.994,8.54e-12 N 0.733,4.36e-03 (0.204,0.505) (-4.61e-01,0.113) (0.485,0.093) 0.905,2.16e-05 0.884,5.96e-05 hDnxi hDnxi=N hDi hDi=N N INX 0.615,3.41e-03 0.769,2.52e-04 (0.513,0.015) (0.436,0.038) (0.503,0.017) LRO (0.333,0.113) (0.333,0.113) (0.231,0.272) (0.256,0.222) (0.219,0.297) RCO (-5.13e-02,0.807) (-5.13e-02,0.807) (-5.13e-02,0.807) (0.026,0.903) (-9.03e-02,0.667) ACO (0.487,0.020) (0.436,0.038) (0.487,0.020) 0.564,7.27e-03 (0.452,0.032) MRSD 0.769,2.52e-04 0.667,1.51e-03 0.821,9.44e-05 1.000,1.95e-06 0.735,4.65e-04 RMSD 0.795,1.55e-04 0.641,2.29e-03 0.846,5.66e-05 0.974,3.54e-06 0.761,2.91e-04 hDnxi ||| 0.846,5.66e-05 0.897,1.95e-05 0.769,2.52e-04 0.890,2.27e-05 hDnxi=N 0.966,8.34e-08 ||| 0.744,4.02e-04 0.667,1.51e-03 0.735,4.65e-04 hDi 0.959,2.42e-07 0.899,2.97e-05 ||| 0.821,9.44e-05 0.916,1.30e-05 hDi=N 0.920,8.61e-06 0.924,6.46e-06 0.959,2.17e-07 ||| 0.735,4.65e-04 N 0.934,2.93e-06 0.857,1.83e-04 0.983,1.78e-09 0.909,1.64e-05 ||| Table F.2: Three-state proteins: correlation between various order parameters. The upper triangle matrix (con- taining elements above the dash cell in each column) contains Kendall correlation coecient and the corresponding p-value, and the lower triangle portion contains Pearson corr. coecient and the corresponding p-value. 176 A p p en d ix F . C ro ss co rrela tion of ord er p aram eters INX LRO RCO ACO MRSD RMSD INX ||| (0.330,0.157) (-3.09e-01,0.186) (0.345,0.139) (0.491,0.036) (0.418,0.073) LRO (0.372,0.259) ||| (0.183,0.432) 0.624,7.56e-03 0.624,7.56e-03 (0.550,0.018) RCO (-4.71e-01,0.144) (0.208,0.538) ||| (0.200,0.392) (-9.09e-02,0.697) (-1.82e-02,0.938) ACO (0.381,0.248) 0.753,7.53e-03 (0.172,0.613) ||| 0.709,2.40e-03 0.782,8.15e-04 MRSD (0.586,0.058) (0.600,0.051) (-2.04e-01,0.547) 0.910,9.88e-05 ||| 0.927,7.18e-05 RMSD (0.560,0.073) (0.599,0.051) (-1.94e-01,0.567) 0.918,6.68e-05 0.999,7.84e-13 ||| hDnxi (0.600,0.051) (0.425,0.193) (-3.60e-01,0.277) 0.786,4.12e-03 0.928,3.77e-05 0.927,3.96e-05 hDnxi=N 0.852,8.71e-04 (0.494,0.123) (-4.22e-01,0.196) (0.723,0.012) 0.898,1.72e-04 0.886,2.80e-04 hDi (0.510,0.109) (0.490,0.126) (-2.78e-01,0.409) 0.858,7.22e-04 0.964,1.72e-06 0.967,1.17e-06 hDi=N (0.616,0.044) (0.595,0.053) (-2.26e-01,0.504) 0.901,1.53e-04 0.999,7.26e-14 0.997,3.38e-11 N (0.535,0.090) (0.578,0.062) (-2.65e-01,0.431) 0.897,1.80e-04 0.984,4.42e-08 0.988,1.14e-08 hDnxi hDnxi=N hDi hDi=N N INX 0.673,3.97e-03 0.818,4.60e-04 (0.491,0.036) (0.527,0.024) (0.587,0.012) LRO (0.587,0.012) (0.440,0.059) 0.624,7.56e-03 (0.587,0.012) 0.611,8.88e-03 RCO (-1.27e-01,0.586) (-2.00e-01,0.392) (-9.09e-02,0.697) (-1.27e-01,0.586) (-1.10e-01,0.637) ACO 0.673,3.97e-03 (0.527,0.024) 0.709,2.40e-03 0.673,3.97e-03 0.697,2.83e-03 MRSD 0.818,4.60e-04 0.673,3.97e-03 1.000,1.85e-05 0.964,3.69e-05 0.917,8.55e-05 RMSD 0.745,1.41e-03 (0.600,0.010) 0.927,7.18e-05 0.891,1.36e-04 0.844,3.01e-04 hDnxi ||| 0.855,2.53e-04 0.818,4.60e-04 0.855,2.53e-04 0.844,3.01e-04 hDnxi=N 0.925,4.64e-05 ||| 0.673,3.97e-03 0.709,2.40e-03 0.697,2.83e-03 hDi 0.983,5.84e-08 0.882,3.24e-04 ||| 0.964,3.69e-05 0.917,8.55e-05 hDi=N 0.936,2.31e-05 0.915,7.76e-05 0.965,1.58e-06 ||| 0.881,1.62e-04 N 0.948,9.44e-06 0.881,3.36e-04 0.982,7.80e-08 0.983,5.91e-08 ||| Table F.3: -helix dominated proteins (both 2- and 3- state): Correlation between various order parameters. The upper triangle matrix (containing elements above the dash cell in each column) contains Kendall correlation coecient and the corresponding p-value, and the lower triangle portion contains Pearson corr. coecient and the corresponding p-value.177 A p p en d ix F . C ro ss co rrela tion of ord er p aram eters INX LRO RCO ACO MRSD RMSD INX ||| (0.165,0.412) (0.363,0.071) (-2.75e-01,0.171) (-2.97e-01,0.139) (-3.85e-01,0.055) LRO (0.035,0.904) ||| 0.626,1.81e-03 (0.165,0.412) (-3.30e-02,0.870) (-7.69e-02,0.702) RCO (0.378,0.183) 0.676,7.92e-03 ||| (-7.69e-02,0.702) (-3.63e-01,0.071) (-3.19e-01,0.112) ACO (-1.86e-01,0.524) (0.342,0.231) (-2.82e-01,0.328) ||| 0.714,3.73e-04 0.758,1.58e-04 MRSD (-2.91e-01,0.313) (-1.84e-01,0.529) (-7.61e-01,1.57e-03) 0.826,2.75e-04 ||| 0.912,5.52e-06 RMSD (-3.09e-01,0.283) (-1.90e-01,0.516) (-7.52e-01,1.92e-03) 0.830,2.37e-04 0.998,1.55e-15 ||| hDnxi (-9.97e-02,0.734) (-2.13e-01,0.465) (-7.26e-01,3.26e-03) 0.812,4.18e-04 0.978,1.73e-09 0.972,5.98e-09 hDnxi=N (0.328,0.252) (-1.35e-01,0.644) (-5.09e-01,0.063) 0.707,4.68e-03 0.807,4.89e-04 0.794,7.03e-04 hDi (-2.46e-01,0.396) (-2.30e-01,0.429) (-7.68e-01,1.33e-03) 0.812,4.24e-04 0.993,1.16e-12 0.991,7.58e-12 hDi=N (-2.44e-01,0.400) (-1.83e-01,0.532) (-7.52e-01,1.94e-03) 0.828,2.55e-04 0.999,0.00e+00 0.996,9.30e-14 N (-2.80e-01,0.332) (-1.69e-01,0.564) (-7.43e-01,2.32e-03) 0.844,1.47e-04 0.994,1.05e-12 0.992,2.89e-12 hDnxi hDnxi=N hDi hDi=N N INX (-9.89e-02,0.622) (0.187,0.352) (-3.63e-01,0.071) (-2.97e-01,0.139) (-3.76e-01,0.061) LRO (0.077,0.702) (-7.69e-02,0.702) (-1.10e-02,0.956) (-3.30e-02,0.870) (-2.21e-02,0.912) RCO (-1.21e-01,0.547) (-9.89e-02,0.622) (-2.97e-01,0.139) (-3.63e-01,0.071) (-2.87e-01,0.152) ACO 0.648,1.24e-03 (0.363,0.071) 0.780,1.02e-04 0.714,3.73e-04 0.796,7.39e-05 MRSD 0.758,1.58e-04 (0.516,0.010) 0.934,3.27e-06 1.000,6.30e-07 0.928,3.76e-06 RMSD 0.714,3.73e-04 (0.429,0.033) 0.934,3.27e-06 0.912,5.52e-06 0.950,2.20e-06 hDnxi ||| 0.714,3.73e-04 0.736,2.45e-04 0.758,1.58e-04 0.729,2.80e-04 hDnxi=N 0.900,1.14e-05 ||| (0.451,0.025) (0.516,0.010) (0.442,0.028) hDi 0.988,3.69e-11 0.824,2.90e-04 ||| 0.934,3.27e-06 0.994,7.26e-07 hDi=N 0.986,1.20e-10 0.835,2.04e-04 0.994,6.02e-13 ||| 0.928,3.76e-06 N 0.981,6.64e-10 0.805,5.11e-04 0.996,2.98e-14 0.993,2.25e-12 ||| Table F.4: -sheet dominated proteins (both 2- and 3- state): Correlation between various order parameters. The upper triangle matrix (containing elements above the dash cell in each column) contains Kendall correlation coecient and the corresponding p-value, and the lower triangle portion contains Pearson corr. coecient and the corresponding p-value.178 A p p en d ix F . C ro ss co rrela tion of ord er p aram eters INX LRO RCO ACO MRSD RMSD INX ||| (0.077,0.714) (-3.59e-01,0.088) (0.128,0.542) (0.385,0.067) (0.359,0.088) LRO (0.308,0.307) ||| (0.308,0.143) (0.487,0.020) (0.282,0.180) (0.308,0.143) RCO (-4.26e-01,0.147) (0.521,0.068) ||| (0.205,0.329) (-2.05e-01,0.329) (-1.79e-01,0.393) ACO (0.300,0.320) (0.664,0.013) (0.463,0.111) ||| 0.590,5.01e-03 0.615,3.41e-03 MRSD (0.670,0.012) (0.340,0.255) (-2.30e-01,0.451) 0.726,4.95e-03 ||| 0.974,3.54e-06 RMSD (0.659,0.014) (0.373,0.209) (-1.89e-01,0.537) 0.758,2.66e-03 0.998,8.22e-15 ||| hDnxi 0.751,3.09e-03 (0.217,0.475) (-5.13e-01,0.073) (0.481,0.096) 0.915,1.14e-05 0.909,1.70e-05 hDnxi=N 0.889,4.79e-05 (0.328,0.275) (-3.90e-01,0.187) (0.545,0.054) 0.921,7.87e-06 0.916,1.11e-05 hDi (0.683,0.010) (0.212,0.487) (-4.83e-01,0.095) (0.534,0.060) 0.940,1.85e-06 0.932,3.41e-06 hDi=N 0.709,6.61e-03 (0.342,0.253) (-2.56e-01,0.399) 0.706,6.94e-03 0.998,5.55e-15 0.996,6.81e-13 N (0.668,0.013) (0.143,0.641) (-5.76e-01,0.039) (0.437,0.136) 0.897,3.28e-05 0.882,6.66e-05 hDnxi hDnxi=N hDi hDi=N N INX 0.590,5.01e-03 0.590,5.01e-03 (0.513,0.015) (0.410,0.051) (0.462,0.028) LRO (0.179,0.393) (0.179,0.393) (0.205,0.329) (0.308,0.143) (0.154,0.464) RCO (-3.08e-01,0.143) (-3.08e-01,0.143) (-2.82e-01,0.180) (-1.79e-01,0.393) (-2.82e-01,0.180) ACO (0.436,0.038) (0.436,0.038) (0.513,0.015) 0.615,3.41e-03 (0.513,0.015) MRSD 0.795,1.55e-04 0.795,1.55e-04 0.872,3.35e-05 0.974,3.54e-06 0.821,9.44e-05 RMSD 0.769,2.52e-04 0.769,2.52e-04 0.846,5.66e-05 0.949,6.34e-06 0.795,1.55e-04 hDnxi ||| 0.949,6.34e-06 0.923,1.12e-05 0.821,9.44e-05 0.872,3.35e-05 hDnxi=N 0.948,8.33e-07 ||| 0.872,3.35e-05 0.821,9.44e-05 0.821,9.44e-05 hDi 0.986,6.07e-10 0.915,1.14e-05 ||| 0.897,1.95e-05 0.949,6.34e-06 hDi=N 0.929,4.31e-06 0.942,1.50e-06 0.946,1.07e-06 ||| 0.846,5.66e-05 N 0.960,1.93e-07 0.874,9.26e-05 0.985,8.98e-10 0.902,2.43e-05 ||| Table F.5: Mixed secondary structure proteins: Correlation between various parameters. The upper triangle matrix (containing elements above the dash cell in each column) contains Kendall correlation coecient and the corresponding p-value, and the lower triangle portion contains Pearson corr. coecient and the corresponding p-value.179 A p p en d ix F . C ro ss co rrela tion of ord er p aram eters INX LRO RCO ACO MRSD RMSD INX ||| (0.424,1.77e-04) (0.229,0.043) (0.292,9.96e-03) (0.252,0.026) (0.223,0.048) LRO 0.658,7.03e-06 ||| 0.615,5.48e-08 0.518,4.66e-06 (0.262,0.021) (0.268,0.018) RCO (0.297,0.071) 0.736,1.42e-07 ||| (0.340,2.66e-03) (4.27e-03,0.970) (0.033,0.772) ACO 0.516,9.10e-04 0.730,1.98e-07 (0.494,1.61e-03) ||| 0.642,1.43e-08 0.670,3.19e-09 MRSD 0.513,1.00e-03 (0.403,0.012) (-4.60e-02,0.784) 0.805,1.07e-09 ||| 0.954,0.00e+00 RMSD 0.502,1.33e-03 (0.411,0.010) (-2.62e-02,0.876) 0.819,3.32e-10 0.998,0.00e+00 ||| hDnxi 0.591,9.27e-05 (0.344,0.034) (-1.83e-01,0.271) 0.672,3.83e-06 0.901,1.31e-14 0.895,3.60e-14 hDnxi=N 0.856,7.02e-12 0.587,1.07e-04 (0.109,0.514) 0.745,8.28e-08 0.851,1.27e-11 0.844,2.69e-11 hDi (0.432,6.78e-03) (0.241,0.145) (-2.63e-01,0.111) 0.673,3.68e-06 0.949,0.00e+00 0.945,0.00e+00 hDi=N 0.566,2.12e-04 (0.434,6.48e-03) (-2.65e-02,0.874) 0.811,6.52e-10 0.998,0.00e+00 0.995,0.00e+00 N (0.395,0.014) (0.211,0.203) (-3.26e-01,0.046) 0.627,2.56e-05 0.934,0.00e+00 0.928,0.00e+00 hDnxi hDnxi=N hDi hDi=N N INX (0.488,1.62e-05) 0.633,2.21e-08 (0.218,0.054) (0.289,0.011) (0.180,0.111) LRO (0.379,8.18e-04) (0.381,7.47e-04) (0.211,0.063) (0.276,0.015) (0.167,0.139) RCO (0.115,0.309) (0.147,0.195) (-5.83e-02,0.606) (0.024,0.831) (-1.03e-01,0.363) ACO 0.633,2.21e-08 0.545,1.47e-06 0.596,1.38e-07 0.656,6.81e-09 0.560,7.31e-07 MRSD 0.747,4.10e-11 0.619,4.53e-08 0.886,4.88e-15 0.963,0.00e+00 0.826,2.82e-13 RMSD 0.724,1.56e-10 0.590,1.82e-07 0.881,7.11e-15 0.929,2.22e-16 0.823,3.40e-13 hDnxi ||| 0.849,6.13e-14 0.730,1.12e-10 0.778,6.12e-12 0.689,1.13e-09 hDnxi=N 0.907,4.22e-15 ||| 0.579,3.11e-07 0.656,6.81e-09 0.538,2.03e-06 hDi 0.961,0.00e+00 0.814,5.13e-10 ||| 0.866,1.91e-14 0.941,0.00e+00 hDi=N 0.917,6.66e-16 0.885,1.61e-13 0.947,0.00e+00 ||| 0.806,1.03e-12 N 0.928,0.00e+00 0.769,1.74e-08 0.987,0.00e+00 0.928,0.00e+00 ||| Table F.6: Unknotted proteins: correlation between various order parameters the upper triangle matrix (containing elements above the dash cell in each column) contains Kendall correlation coecient and the corresponding p- value, and the lower triangle portion contains Pearson corr. coecient and the corresponding p-value. 180 A p p en d ix F . C ro ss co rrela tion of ord er p aram eters INX LRO RCO ACO MRSD RMSD INX ||| (0.524,0.099) (-5.24e-01,0.099) (-1.43e-01,0.652) (-2.38e-01,0.453) (-1.43e-01,0.652) LRO (0.767,0.044) ||| (-4.76e-02,0.881) (-2.38e-01,0.453) (-3.33e-01,0.293) (-4.29e-01,0.176) RCO (-7.22e-01,0.067) (-3.26e-01,0.476) ||| (-1.43e-01,0.652) (-2.38e-01,0.453) (-3.33e-01,0.293) ACO (-6.24e-02,0.894) (-3.92e-01,0.385) (-3.94e-01,0.382) ||| 0.905,4.32e-03 (0.810,0.011) MRSD (0.213,0.647) (-1.15e-01,0.805) (-7.13e-01,0.072) 0.901,5.68e-03 ||| 0.905,4.32e-03 RMSD (0.211,0.649) (-1.33e-01,0.776) (-7.19e-01,0.068) 0.900,5.77e-03 0.999,1.27e-08 ||| hDnxi (0.713,0.072) (0.304,0.508) (-9.73e-01,2.29e-04) (0.530,0.221) (0.789,0.035) (0.792,0.034) hDnxi=N 0.919,3.42e-03 (0.602,0.153) (-9.05e-01,5.09e-03) (0.287,0.533) (0.573,0.179) (0.571,0.180) hDi (0.344,0.450) (-8.12e-02,0.863) (-8.31e-01,0.021) (0.832,0.020) 0.970,2.84e-04 0.975,1.92e-04 hDi=N (0.396,0.379) (0.044,0.926) (-8.16e-01,0.025) (0.830,0.021) 0.981,9.09e-05 0.981,1.00e-04 N (0.365,0.421) (-8.59e-02,0.855) (-8.41e-01,0.018) (0.822,0.023) 0.957,7.22e-04 0.962,5.15e-04 hDnxi hDnxi=N hDi hDi=N N INX (0.714,0.024) 0.905,4.32e-03 (0.143,0.652) (0.048,0.881) (0.238,0.453) LRO (0.238,0.453) (0.429,0.176) (-3.33e-01,0.293) (-2.38e-01,0.453) (-2.38e-01,0.453) RCO (-8.10e-01,0.011) (-6.19e-01,0.051) (-6.19e-01,0.051) (-5.24e-01,0.099) (-7.14e-01,0.024) ACO (0.143,0.652) (-4.76e-02,0.881) (0.524,0.099) (0.619,0.051) (0.429,0.176) MRSD (0.048,0.881) (-1.43e-01,0.652) (0.619,0.051) (0.714,0.024) (0.524,0.099) RMSD (0.143,0.652) (-4.76e-02,0.881) (0.714,0.024) (0.810,0.011) (0.619,0.051) hDnxi ||| (0.810,0.011) (0.429,0.176) (0.333,0.293) (0.524,0.099) hDnxi=N 0.924,2.97e-03 ||| (0.238,0.453) (0.143,0.652) (0.333,0.293) hDi 0.890,7.33e-03 (0.678,0.094) ||| (0.714,0.024) 0.905,4.32e-03 hDi=N 0.885,8.07e-03 (0.720,0.068) 0.981,9.40e-05 ||| (0.619,0.051) N 0.899,5.90e-03 (0.690,0.086) 0.998,2.16e-07 0.972,2.37e-04 ||| Table F.7: Knotted proteins: correlating between various order parameters the upper triangle matrix (containing elements above the dash cell in each column) contains Kendall correlation coecient and the corresponding p-value, and the lower triangle portion contains Pearson corr. coecient and the corresponding p-value. 181 A p p en d ix F . C ro ss co rrela tion of ord er p aram eters INX LRO RCO ACO MRSD RMSD INX ||| (0.361,4.76e-04) (0.048,0.639) (0.352,6.63e-04) (0.380,2.35e-04) (0.358,5.34e-04) LRO 0.530,1.81e-04 ||| (0.488,2.28e-06) (0.448,1.45e-05) (0.217,0.035) (0.223,0.031) RCO (0.022,0.883) 0.662,7.48e-07 ||| (0.200,0.053) (-1.23e-01,0.233) (-1.05e-01,0.309) ACO 0.548,9.85e-05 0.619,5.95e-06 (0.334,0.025) ||| 0.661,1.58e-10 0.679,4.91e-11 MRSD 0.637,2.55e-06 (0.322,0.031) (-1.82e-01,0.232) 0.832,1.47e-12 ||| 0.962,0.00e+00 RMSD 0.625,4.48e-06 (0.329,0.027) (-1.64e-01,0.282) 0.843,3.57e-13 0.999,0.00e+00 ||| hDnxi 0.776,3.85e-10 (0.238,0.115) (-3.31e-01,0.027) 0.659,8.70e-07 0.875,4.00e-15 0.868,1.13e-14 hDnxi=N 0.919,0.00e+00 (0.419,4.21e-03) (-1.45e-01,0.344) 0.698,9.91e-08 0.851,1.34e-13 0.842,4.14e-13 hDi 0.622,5.13e-06 (0.174,0.254) (-3.68e-01,0.013) 0.722,2.17e-08 0.954,0.00e+00 0.951,0.00e+00 hDi=N 0.697,1.07e-07 (0.345,0.020) (-1.79e-01,0.239) 0.827,2.54e-12 0.996,0.00e+00 0.994,0.00e+00 N 0.583,2.66e-05 (0.160,0.293) (-4.13e-01,4.78e-03) 0.693,1.31e-07 0.947,0.00e+00 0.944,0.00e+00 hDnxi hDnxi=N hDi hDi=N N INX 0.592,9.90e-09 0.707,7.51e-12 (0.370,3.43e-04) (0.420,4.71e-05) (0.329,1.42e-03) LRO (0.324,1.68e-03) (0.318,2.05e-03) (0.183,0.076) (0.233,0.024) (0.146,0.157) RCO (-5.66e-02,0.584) (-3.43e-02,0.739) (-1.74e-01,0.092) (-1.11e-01,0.282) (-2.20e-01,0.033) ACO 0.634,8.08e-10 0.556,7.44e-08 0.622,1.68e-09 0.669,9.43e-11 0.583,1.65e-08 MRSD 0.776,5.77e-14 0.673,7.27e-11 0.889,0.00e+00 0.960,0.00e+00 0.832,8.88e-16 RMSD 0.758,2.19e-13 0.651,2.98e-10 0.887,0.00e+00 0.933,0.00e+00 0.832,8.88e-16 hDnxi ||| 0.877,0.00e+00 0.778,4.97e-14 0.812,3.55e-15 0.735,1.10e-12 hDnxi=N 0.953,0.00e+00 ||| 0.655,2.31e-10 0.713,4.97e-12 0.611,3.23e-09 hDi 0.947,0.00e+00 0.862,2.69e-14 ||| 0.885,0.00e+00 0.940,0.00e+00 hDi=N 0.906,0.00e+00 0.893,2.22e-16 0.959,0.00e+00 ||| 0.824,1.33e-15 N 0.916,0.00e+00 0.823,3.78e-12 0.990,0.00e+00 0.947,0.00e+00 ||| Table F.8: All proteins: correlating between various order parameters the upper triangle matrix (containing elements above the dash cell in each column) contains Kendall correlation coecient and the corresponding p-value, and the lower triangle portion contains Pearson corr. coecient and the corresponding p-value. 182
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Generalized distance and applications in protein folding
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Generalized distance and applications in protein folding Mohazab, Ali Reza 2013
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Generalized distance and applications in protein folding |
Creator |
Mohazab, Ali Reza |
Publisher | University of British Columbia |
Date Issued | 2013 |
Description | The Euclidean distance, D, between two points is generalized to the distance between strings or polymers. The problem is of great mathematical beauty and very rich in structure even for the simplest of cases. The necessary and sufficient conditions for finding minimal distance transformations are presented. Locally minimal solutions for one-link and two-link chains are discussed, and the large N limit of a polymer is studied. Applications of D to protein folding and structural alignment are explored, in particular for finding minimal folding pathways. Non-crossing constraints and the resulting untangling moves in folding pathways are discussed as well. It is observed that, compared to the total distance, these extra untangling moves constitute a small fraction of the total movement. The resulting extra distance from untangling movements (Dnx ) are used to distinguish different protein classes, e.g. knotted proteins from unknotted proteins. By studying the ensembles of untangling moves, dominant folding pathways are constructed for three proteins, in particular a knotted protein. Finally, applications of D, and related metrics to protein folding rate prediction are discussed. It is seen that distance metrics are good at predicting the folding rates of 3-state folders. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2013-01-09 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0073509 |
URI | http://hdl.handle.net/2429/43831 |
Degree |
Doctor of Philosophy - PhD |
Program |
Physics |
Affiliation |
Science, Faculty of Physics and Astronomy, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2013-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2013_spring_mohazab_ali.pdf [ 4.46MB ]
- Metadata
- JSON: 24-1.0073509.json
- JSON-LD: 24-1.0073509-ld.json
- RDF/XML (Pretty): 24-1.0073509-rdf.xml
- RDF/JSON: 24-1.0073509-rdf.json
- Turtle: 24-1.0073509-turtle.txt
- N-Triples: 24-1.0073509-rdf-ntriples.txt
- Original Record: 24-1.0073509-source.json
- Full Text
- 24-1.0073509-fulltext.txt
- Citation
- 24-1.0073509.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0073509/manifest