Generalized distance and applications in protein folding by Ali Reza Mohazab A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in The Faculty of Graduate Studies (Physics) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) January 2013 c Ali Reza Mohazab 2012 Abstract The Euclidean distance, D, between two points is generalized to the distance between strings or polymers. The problem is of great mathematical beauty and very rich in structure even for the simplest of cases. The necessary and sufficient conditions for finding minimal distance transformations are presented. Locally minimal solutions for one-link and two-link chains are discussed, and the large N limit of a polymer is studied. Applications of D to protein folding and structural alignment are explored, in particular for finding minimal folding pathways. Non-crossing constraints and the resulting untangling moves in folding pathways are discussed as well. It is observed that, compared to the total distance, these extra untangling moves constitute a small fraction of the total movement. The resulting extra distance from untangling movements (Dnx ) are used to distinguish different protein classes, e.g. knotted proteins from unknotted proteins. By studying the ensembles of untangling moves, dominant folding pathways are constructed for three proteins, in particular a knotted protein. Finally, applications of D, and related metrics to protein folding rate prediction are discussed. It is seen that distance metrics are good at predicting the folding rates of 3-state folders. ii Preface The content of this thesis relies on five manuscripts to each of which a chapter is dedicated. To those five chapters a background chapter and a conclusion are amended. Three of the five aforementioned manuscripts are already published and the remaining two will be published shortly. I am the first author in all of those manuscripts and the bulk of the research was conducted by me in all of them—obviously with guidance from my supervisor Dr. Steven S. Plotkin (SSP). All the figures and tables were generated by me (ARM) except in what is mentioned below. Chapter 2 is based on (Mohazab, AR and SS Plotkin. JPhys.CM. 2008) [89].1 The research was conducted by ARM and the text of the paper was written by SSP. All the figures and tables and results were generated by ARM except figs 2.14, and 2.15 which were generated by SSP. The material pertaining to these two figures were also largely developed by SSP. Chapter 3 is based on (Mohazab, AR and SS Plotkin. Biophys.J. 2008) [90]. The research was conducted by ARM and the text was written by SSP. All the figures and tables except figures 3.1, 3.2, 3.3a, 3.6a, and E.1, were generated by ARM. Those were generated by SSP. Figure 3.3b was generated jointly. Chapter 4 is based on (Mohazab, AR and SS Plotkin IJQC 2009)[91]. The research was conducted by ARM and the methods section of the paper was written by ARM as well. The introduction of the paper was written by SSP and the conclusion was written jointly. All the figures and tables were generated by ARM. SSP was responsible for the final editing. Chapter 5 is based on (Mohazab AR, and SS Plotkin, PLoS Comput. Biol. 2012) [87]. The research was conducted by ARM. The methods section was written by ARM as well. The introduction of the paper was written by SSP and the conclusion and results sections, in their final form, were written by SSP as well. All the figures and tables were generated by ARM. SSP was responsible for the final editing. Chapter 6 is based on (Mohazab, AR and SS Plotkin, unpublished, 2012) 1 Some of the material of section 2.5 is taken from the introduction section of [91]. iii Preface [88]. The research was conducted by ARM. The methods and results section and most of the conclusion section were written by ARM as well. SSP wrote the introduction and edited the sections written by ARM and elaborated on the conclusion. SSP also suggested that additional material be added to the paper, inspired by work done in [36]. This work for the additional material would be conducted by a third researcher Atanu Das (AD), and he would be the second author of the paper. None of AD’s contributions are reflected in this thesis. All the tables and material that are presented in chapter 6 are the sole work of ARM. The appendices of this thesis are derived from the appendices and the supplementary material of the aforementioned papers, each of which are referred to within the body of the relevant chapter. SSP is responsible for the text of Appendix A, the material in A.3 is also entirely his work. Appendices B, C, and D are the exclusive work of ARM. The material in Appendix E is the joint work of ARM and SSP, with text written by SSP. Figures E.1, and E.2a were also generated by SSP. Figures E.2b and E.2c, were generated by ARM. Appendix F is the exclusive work of ARM. iv Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Preface List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgments Dedication x . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv 1 Outline and background . . 1.1 Protein . . . . . . . . . . . 1.2 Protein folding . . . . . . . 1.3 Order parameters in protein . . . . . . . . . . . . . . . folding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Minimal distance transformations between links and polymers: principles and examples . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Distance for polymers or strings . . . . . . . . . . . . . . . . 2.2.1 Discrete chains . . . . . . . . . . . . . . . . . . . . . . 2.2.2 General variation of the distance functional . . . . . . 2.2.3 Conditions for an extremum . . . . . . . . . . . . . . 2.3 Single links . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Straight line transformations . . . . . . . . . . . . . . 2.3.2 Piece-wise extremal transformations: transformations with rotations . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Systematically exploring transformations by varying link positions . . . . . . . . . . . . . . . . . . . . . . . 2.4 2-link chains . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 5 7 10 11 12 14 16 17 20 21 22 25 29 v Table of Contents 2.5 2.6 2.4.1 Transformations involving a change in convexity . 2.4.2 Transformations with initial and final states in 3-D Limit of large link number . . . . . . . . . . . . . . . . . 2.5.1 MRSD as a metric for protein folding . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 35 36 42 44 3 Minimal folding pathways for coarse-grained biopolymer fragments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2.1 Representative protein fragments . . . . . . . . . . . 49 3.2.2 Construction of minimal pathways . . . . . . . . . . . 52 3.2.3 RMSD and MRSD . . . . . . . . . . . . . . . . . . . 53 3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3.1 β-hairpin . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3.2 α-helix . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3.3 Crossover structure . . . . . . . . . . . . . . . . . . . 58 3.4 Discussion and conclusion . . . . . . . . . . . . . . . . . . . . 61 4 Structural alignment using the tance between conformations . 4.1 Introduction . . . . . . . . . . 4.2 Method and results . . . . . . 4.3 Conclusion and discussion . . . generalized Euclidean dis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 65 65 68 5 Polymer uncrossing and knotting in protein folding, and their role in minimal folding pathways . . . . . . . . . . . . 71 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.2.1 Calculation of the transformation distance . . . . . . 74 5.2.2 Generating unfolded ensembles . . . . . . . . . . . . . 91 5.2.3 Proteins used . . . . . . . . . . . . . . . . . . . . . . 93 5.2.4 Calculating distance metrics for the unfolded ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.3.1 Quantifying minimal folding pathways . . . . . . . . 104 5.3.2 Topological constraints induce folding pathways . . . 115 5.4 Conclusion and discussion . . . . . . . . . . . . . . . . . . . . 116 vi Table of Contents 6 The role of polymer non-crossing in protein folding kinetics . . . . 6.1 Introduction . . . . . . . . . . . 6.2 Methods . . . . . . . . . . . . . 6.2.1 Proteins used with rate . 6.3 Results . . . . . . . . . . . . . . 6.4 Conclusion and discussion . . . . and geometrical distance . . . . . . . . . . . . . . . . 122 . . . . . . . . . . . . . . . . 122 . . . . . . . . . . . . . . . . 124 . . . . . . . . . . . . . . . . 124 . . . . . . . . . . . . . . . . 124 . . . . . . . . . . . . . . . . 128 7 Conclusion and further thoughts . . . . . . . . . . . . . . . . 133 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 Appendices A Sufficient conditions for an extremum to be a minimum A.1 Distance between points . . . . . . . . . . . . . . . . . . . . A.2 Geodesics on the surface of a sphere . . . . . . . . . . . . . A.3 Harmonic oscillator . . . . . . . . . . . . . . . . . . . . . . B Necessary conditions for straight line transformations . . . . 151 155 156 157 . . 158 C Critical angles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 D Minimal transformations in 2 dimensions . . . . . . . . . . . 165 E Extremal trajectories cluded volume . . . . E.1 Point particle . . . E.2 One link . . . . . of beads or links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . subject to . . . . . . . . . . . . . . . . . . . . . steric . . . . . . . . . . . . ex. . 170 . . 170 . . 172 F Cross correlation of order parameters . . . . . . . . . . . . . 174 vii List of Tables 3.1 4.1 4.2 Values of the distance for various protein backbone fragments, as compared to other metrics . . . . . . . . . . . . . . . . . . D/N (in units of link length squared) between the aligned structures in figure 4.1 . . . . . . . . . . . . . . . . . . . . . . MRSD (in units of link length) between the aligned structures in figure 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 68 69 5.1 5.2 Proteins analyzed . . . . . . . . . . . . . . . . . . . . . . . . . 95 Order parameters for various classifications of proteins . . . . 103 6.1 Two-state proteins: correlation between folding rate and various order parameters indicated. . . . . . . . . . . . . . . . . Three-state proteins: correlation between folding rate and various order parameters indicated. . . . . . . . . . . . . . . α-helix dominated proteins that are 2-state folders: correlation between folding rate and various order parameters indicated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . α-helix dominated proteins (both 2- and 3- state): correlation between folding rate and various order parameters indicated . β-sheet dominated proteins that are 2-state folders: correlation with various order parameters indicated. . . . . . . . . . β-sheet dominated proteins (both 2- and 3- state): correlation with various order parameters indicated. . . . . . . . . . . . . Mixed secondary structure proteins that are 2-state folders: correlation with various order parameters indicated. . . . . . Mixed secondary structure proteins (both 2- and 3-state): correlation with various order parameters indicated. . . . . . Correlation of folding rate of all the studied proteins, for which folding rates were available, with various order parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 125 126 127 127 128 129 129 130 130 viii List of Tables 6.10 Best rate predictors for different classes of proteins, based on Kendall and Pearson correlations. . . . . . . . . . . . . . . . . 131 F.1 Two-state proteins: correlation between various order parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 F.2 Three-state proteins: correlation between various order parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 F.3 α-helix dominated proteins (both 2- and 3- state): Correlation between various order parameters. . . . . . . . . . . . . . 177 F.4 β-sheet dominated proteins (both 2- and 3- state): Correlation between various order parameters. . . . . . . . . . . . . . 178 F.5 Mixed secondary structure proteins: Correlation between various parameters. . . . . . . . . . . . . . . . . . . . . . . . . . 179 F.6 Unknotted proteins: correlation between various order parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 F.7 Knotted proteins: correlating between various order parameters181 F.8 All proteins: correlating between various order parameters . . 182 ix List of Figures 1.1 1.2 1.3 1.4 Schematic representation of a generic Graphical representation of proteins Funnel energy landscape . . . . . . . Q and RMSD . . . . . . . . . . . . . 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 Distance between two points . . . . . . . . . . . . . . . . . . . Distance between two curves . . . . . . . . . . . . . . . . . . Curve discretization . . . . . . . . . . . . . . . . . . . . . . . Broken extremal . . . . . . . . . . . . . . . . . . . . . . . . . Possible and impossible straight line transformations . . . . . Bowtie transformation . . . . . . . . . . . . . . . . . . . . . . Link broken extremal . . . . . . . . . . . . . . . . . . . . . . . Successive transformations between two links through rotation Successive transformations between two links through translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Two-link transformations example . . . . . . . . . . . . . . . Non-degenerate 2-link transformations . . . . . . . . . . . . . A transformation between two states of opposite convexity . . Sub-minimal and minimal transformations in a sample 2-link system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Two-link transformations in 3D . . . . . . . . . . . . . . . . . Examples of transformations between initial and final states of opposite convexity, for increasing numbers of links . . . . . MRSD explained . . . . . . . . . . . . . . . . . . . . . . . . . MRSD and RMSD in non-crossing constraints . . . . . . . . . Free energy surfaces for MRSD and Q . . . . . . . . . . . . . 2.10 2.11 2.12 2.13 2.14 2.15 2.16 2.17 2.18 3.1 3.2 3.3 3.4 amino . . . . . . . . . . . . acid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Residues 99–153 in regulatory chain B of Aspartate Carbamoyltransferase . . . . . . . . . . . . . . . . . . . . . . . . β-hairpin fragment and the initial state . . . . . . . . . . . . Overpass/underpass fragment . . . . . . . . . . . . . . . . . . Illustration of the general recipe for obtaining minimal pathways 2 4 6 9 11 12 14 17 21 23 24 28 30 31 32 33 35 37 38 41 41 45 50 51 51 52 x List of Figures 3.5 3.6 3.7 Minimal transformations to the β-hairpin . . . . . . . . . . . α-helix and its minimal pathway . . . . . . . . . . . . . . . . Various steps in a minimal pathway obeying non-crossing . . 55 57 60 4.1 4.2 Alignments with different cost functions . . . . . . . . . . . . Scale invariant distance resulting from different alignments with different cost functions . . . . . . . . . . . . . . . . . . . 67 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15 5.16 5.17 5.18 5.19 5.20 5.21 5.22 Transformation of a simple conformation with link size change shown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Scatter plot for link length deviation . . . . . . . . . . . . . . Crossing detection using projections . . . . . . . . . . . . . . Two possible untangling transformations . . . . . . . . . . . . Minimal untangling using knowledge of future crossings . . . Snapshots of a transformation with two crossings . . . . . . . Leg substructure . . . . . . . . . . . . . . . . . . . . . . . . . Crossing substructures . . . . . . . . . . . . . . . . . . . . . . The three types of Reidemeister moves. As it can be seen, Reidemeister move type III does not reverse the nature of any of the crossings. . . . . . . . . . . . . . . . . . . . . . . . Schematic illustration of the canonical leg movement . . . . . A single leg movement to undo several crossings . . . . . . . . Topological loop twist . . . . . . . . . . . . . . . . . . . . . . Schematic of the canonical elbow move . . . . . . . . . . . . . Various crossing substructures in a simple example . . . . . . An example (subset) tree of possible transformations for a given crossing structure . . . . . . . . . . . . . . . . . . . . . Clustering of proteins depending on order parameter . . . . . Statistical significance for all order parameters in distinguishing between different classes of proteins . . . . . . . . . . . . Renderings of the three proteins whose minimal transformations we investigate in detail . . . . . . . . . . . . . . . . . . . Bar plots for the noncrossing operations involved in minimal transformations, for the α protein 2ABD . . . . . . . . . . . . Bar plots of the noncrossing operations for the β-sheet protein 1PKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bar plots of the noncrossing operations for the knotted protein 3MLG . . . . . . . . . . . . . . . . . . . . . . . . . . . . Consensus histograms of the transformations described in Figures 5.19-5.21 . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 77 78 79 81 81 82 84 85 85 86 87 88 88 89 91 101 102 104 106 107 108 112 xi List of Figures 5.23 Schematic of the most representative transformation for the α protein 2ABD. . . . . . . . . . . . . . . . . . . . . . . . . . 5.24 Schematic of the most representative transformation for the knotted protein 3MLG. . . . . . . . . . . . . . . . . . . . . . 5.25 Schematic diagram for the residues involved in noncrossing operations for two minimal transformations α and β. and the Sequence overlap of moves . . . . . . . . . . . . . . . . . . . . 5.26 Pathway overlap (Qαβ ) distributions for 3 proteins . . . . . . 6.1 6.2 113 114 115 117 Correlation between folding rate and RMSD for three-state folders. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Absolute value of Kendall correlation of a few order parameters and rate, across different classes of proteins. . . . . . . . 131 B.1 A link in 3D space. . . . . . . . . . . . . . . . . . . . . . . . . 158 C.1 Transformation in which both ends stay on a linear track . . 161 C.2 Geometric proof for critical angle condition . . . . . . . . . . 163 C.3 A minimal transformation in s(θ) parametrization . . . . . . 164 D.1 Hyper extended solution vs a more general compound straightline transformation . . . . . . . . . . . . . . . . . . . . . . . . D.2 Optimal compound straight line transformation . . . . . . . . D.3 An optimal compound straight-line solution for 2 link . . . . D.4 Minimal transformation restricted to 2 dimensions, for 2 links of opposite convexity which form opposite sides of a square. . 166 167 168 169 E.1 Inequality constraints . . . . . . . . . . . . . . . . . . . . . . 171 E.2 Extremal trajectories and inequality constraints . . . . . . . . 173 xii Acknowledgments I would like to thank the former and current members of Plotkin’s research group for helpful discussions and Dr. Plotkin for supervising my research. I would also like to thank the members of my committee Dr. G. Patey, Dr. C. Hansen and Dr. J. Rottler. xiii Dedication This thesis is dedicated to my family. xiv Chapter 1 Outline and background This thesis is about a mathematical construct known as generalized distance (D) and some of the applications it can have in protein science. The central concept underlying D is to extend the concept of conventional distance between two points to distance between two extended objects (polymers). The problem by itself is of great mathematical beauty and even in its simplest case (finding the minimal distance between two links) is incredibly rich in structure. Therefore we spend the greater part of chapter 2 discussing the problem from a purely mathematical point of view. At the end of chapter 2 we propose D as an order parameter for studying protein folding. In the subsequent chapters we explore the applications of D in various areas of protein science. In particular, in chapter 3, we will see how D can be used to construct folding pathways for protein fragments, such as α-helices and β-hairpins, and how non-crossing constraints can have an impact on these folding pathways. Construction of geometric pathways for folding has long been of interest. In our case, one benefit of studying folding pathways for fragments is the possibility to construct mathematically exact solutions. As expected, we find that the alignment of the initial and final conformations affects the pathway and the resulting distance between the initial and final conformations. Therefore, in chapter 4, we study the differences in optimally aligned structures that result from using different alignment cost functions. We will see that a simple and computationally inexpensive approximation to D called Mean Root Squared Distance or MRSD, is adequate for nearly optimal alignment for sufficiently long chains. In chapter 5, we focus on minimal distances between full-length proteins. This problem is much more difficult to solve analytically, hence we develop an algorithmic method for constructing an approximate minimal solution. An important aspect to consider in constructing geometrical protein-folding pathways is the non-crossing constraints and the resulting untangling moves. We develop methods that approximately capture the minimal untangling moves required when folding a protein. We apply these methods to more than forty proteins and explore the importance of non-crossing constraints in distinguishing different protein structures. The results from our analysis 1 1.1. Protein Figure 1.1: Schematic representation of a generic amino acid. concerning the dominant folding pathway of a knotted protein are potentially of the greatest interest for applications. Another important result from this chapter is that the contribution of untangling distance is generally very small compared to the total distance that the protein has to travel. In chapter 6, we apply the formalism developed in chapter 5 to protein kinetics and explore how different metrics correlate with the protein folding rate. One important result that emerges from this analysis is the relative success of distance-like metrics in predicting the folding rate of three-state proteins, which tend to fold through an intermediate state. The distance D will be introduced and studied in the next chapter extensively. However, understanding the applications of D in protein folding requires familiarity with some of the fundamental concepts in protein science and protein folding. These concepts are addressed in the remainder of this chapter. 1.1 Protein Proteins are macromolecules that perform a vast array of biological functions in living organisms. They are made of smaller constituents called amino acids. Amino acids are a group of biological molecules that are composed of a central carbon atom called Cα , a hydrogen atom, a carboxyl group (−COOH), an amine group (−N H2 ) and a side chain (−R) that is specific to each amino acid; see figure 1.1. There are about one hundred amino acids found in nature, twenty of which are used as building blocks of proteins [14]. Two amino acids can be linked together by forming a special covalent bond called a peptide bond. A peptide bond results from the reaction of the carboxyl group of one amino acid with the amino group of the adjacent amino acid. Through the repetition of this mechanism several amino acid 2 1.1. Protein molecules can be linked together to form a long chain of residues connected by peptide bonds, called a polypeptide. From an all-atom perspective a polypeptide is comprised of a backbone supported by peptide bonds, and various side-chains of the different comprising amino-acids. A protein is made of one or more polypeptide chains [14]. By construction, any polypeptide has one free amine group at one end and one free carboxyl group at the other end. The end that is characterized by the free amine group is called the N-terminus, and the end characterized by the carboxyl group is called the C-terminus. When amino acids are part of a polypeptide, they are called residues, and are numbered from 1 to N, counting from the N-terminus to the C-terminus by convention. Each protein has a unique sequence of amino acids. The sequence is encoded in the gene that is responsible for the synthesis of the protein inside the cell. Shortly after its synthesis, the protein (amino acid chain) is generally disordered, has high entropy, and lacks a specific structure. Through a complex process of interaction between the constituent amino acids aided by the surrounding environment, the protein spontaneously “folds” into a well-defined 3D structural ensemble, that is specific to that protein. Folding may start concurrently with synthesis, but current data on folding rates in comparison to translation rates indicates most proteins tend to fold only after complete translation [95]. Sometimes helper proteins called chaperones kinetically proofread the folding process by kicking the proteins out of the local minimum traps [55]. The well-defined final structure is called the native structure or the native conformation. The 3D shape of the protein is crucial for its biological function. In fact protein “misfolding”, meaning folding to an incorrect final conformation, is involved in many degenerative diseases, such as Creutzfeldt-Jakob disease (the human form of mad cow disease), Alzheimer’s disease, Huntington’s and Parkinson’s disease [70]. The native structure of a given protein can be determined experimentally using a variety of techniques, most importantly X-ray crystallography and NMR spectroscopy. Native structure coordinates, once determined, are usually stored in plain-text digital files and deposited in the protein data bank (PDB: www.wwpdb.org), available to researchers on the Internet. Each determined native conformation has a unique 4-letter (alpha-numerical) identifier. For example the three dimensional structure of the protein acylphosphatase is available as 1aps.pdb. At the time of writing this dissertation, about eighty thousand protein structures have been determined experimentally. Four levels of protein structure can generally be identified: primary, secondary, tertiary and quaternary structure. The primary structure sim3 1.1. Protein (a) (b) (c) Figure 1.2: Graphical representation of protein acylphosphatase (1aps.pdb) (a) All atom stick model, where green corresponds to Carbon, white to Hydrogen, blue to Nitrogen, and red to Oxygen. (b) Backbone representation with the secondary structures emphasized. The color red corresponds to the alpha-helix secondary structure, the color yellow corresponds to the beta strands. (c) The surface accessible graphical representation, with the same color scheme described for sub-figure (a). All figures were generated using pymol. ply corresponds to the amino-acid sequence of the protein. The secondary structure is formed through formation of hydrogen bonds between protein residues. During the process of folding, various segments of the protein chain form highly regular substructures called secondary structures. There are two common types of secondary substructures, alpha-helix and beta-strand or beta-sheet. The tertiary structure is equivalent to the native structure of a single protein and entails the relative positioning of the secondary substructures. The quaternary structure is formation of structures comprising of several peptide chains. In this thesis we are only concerned with the first three levels of structures. Protein structures are graphically represented in several ways in three rough categories: All atom representation, backbone representation (usually with the two types secondary structures rendered differently, see figure 1.2) and solvent accessible surface representation. The backbone representation can be “coarse-grained” furthermore by representing the entire amino-acid by its Cα atom: a process known as Cα coarse-graining. In this thesis we work primarily with Cα coarse-grained structures. 4 1.2. Protein folding 1.2 Protein folding The process in which the unstructured coil of polypeptide transforms to the native structure is called protein folding. From an energetic point of view, protein folding is a thermodynamic process, in which the system equilibrates to its minimal free-energy state. Under normal conditions, the minimal freeenergy state is the native conformation. Named after Christian B. Anfinsen, who won the Nobel prize for the discovery, Anfinsen’s dogma states that under normal conditions the native structure of the protein is uniquely determined by its amino acid sequence [4]. The time scale of the folding process varies drastically across different proteins. The folding rate kf (s−1 ) covers a range of a few orders of magnitude. Engrailed homeodomain (PDB: 1ENH with a length of 54 residues), has a log kf = 10.53, whereas a knotted protein such as 2ouf-x2 (PDB: 3MLG with a length of 169 residues) has log kf = −6.91. Currently it is not possible to capture the full dynamical mechanism of the complete folding process, experimentally. Instead, a number of indirect experimental methods are used to gain insight. For example, single or multiple residues are mutated and the resulting changes in folding kinetics and native structure are studied. This method is known as φ value analysis [101]. Computer simulations (known as in silico methods) have been invaluable tools as well. However, brute-force all atom simulations are too computationally intensive at this stage to capture the process on a long enough time-scale at the high-throughput level. The current prominent theoretical paradigm is that the folding process is a diffusion in protein conformation space on a funnel-like energy landscape [99]; see figure 1.3. Any intermediate conformation has a transition probability to adjacent conformations, with the funnel energy gradient driving the overall diffusion towards the native structure. During the folding process, the protein chain loses overall conformational entropy (as the native state is well-structured) but loses internal energy to a greater extent, and hence the total Gibbs free energy goes down by the end of the process [99]. The funnel shape of the energy landscape (on a rough scale) ensures that, as the internal free energy of the system is decreasing, the intermediate conformation becomes more and more native-like. The shape of the energy landscape is a result of evolution, through which the energetic frustrations of the system have been minimized or ameliorated. The protein sequences have been selected to yield funnel-shaped energy landscapes. A random sequence of amino-acids only has small probability to have a funnel-like energy landscape [99]. 5 1.2. Protein folding Configurational entropy Route 1 Route 2 Energy 100 KT Degree of nativeness 2-3 KT Figure 1.3: Funnel energy landscape. Image adopted from [99] It is deduced from this model that there is not a single fixed route from an unfolded conformation to the native structure. The complete folding trajectories on the funnel can vary and can start at different points, but they all converge to the same point: the native conformation. The detailed mechanism of folding is governed by a smaller energy scale, in which the internal free energy loss is compensated by conformational entropy loss. The total internal energy loss of the native state is of order 100kB T whereas free energy barriers (when loss of conformational entropy is accounted for) are of order 10kB T . However, protein folding as we will see is not a purely energetically driven process. Topology plays an important role as well [99]. A folding protein undergoes a complex interplay of energetics and entropics as it navigates through its accessible phase space, however the resulting kinetics are often simple [11, 13, 112, 113]: many proteins fold across a single free energy barrier in a 2-state like fashion. The kinetics of many other proteins are only marginally more complex, folding by a 3-state mechanism. Central in the study of protein folding are the ideas of commitment probability and reaction coordinates. Reaction coordinates are one-dimensional coordinates (essentially a number) that capture the progress along a reaction pathway. Commitment probability is the probability that the state diffuses in conformation space to the final state before reaching the initial state [96]. One of the conceptual refinements to arise from theoretical and simulation studies is the study of “good” reaction coordinates that correlate with commitment probability to complete a reaction such as the folding reaction [8, 9, 28, 33, 58, 85, 130]. Reaction coordinates must generally take into account the energy surface on which the molecule of interest is undergoing conformational diffusion [10, 39, 138], and the Markovian or non-Markovian 6 1.3. Order parameters in protein folding nature of the diffusion [59, 114]. 1.3 Order parameters in protein folding In condensed matter systems, useful order parameters have historically had intuitive geometrical interpretations. Their definition did not require the knowledge of a particular Hamiltonian (although their temperature-dependence and time-evolution were affected by the energy function in the system). In chemical reactions, the distance between constituents in reactant and product has played a ubiquitous role in the construction of potential energy surfaces [77]. In protein folding order parameters are generally used to compare structures, not always to look at phase transitions. The study of various order parameters that might best represent progress in the folding reaction has generated much interest [8, 17, 18, 21, 26, 32, 44, 57, 73, 81, 114], with questions focusing on what parameter(s) or principal component-like motions might best correlate with splitting probability or probability of folding before unfolding. On the other hand, analyses using intuitive geometric order parameters have been developed to understand folding and are now commonly used. These include the fraction of native contacts Q [8, 21, 64, 94, 111], which can be locally or globally defined, root mean square distance or deviation (RMSD) between structures [45, 56, 124], structural overlap parameter χ [19], Debye-Waller factors [117, 118]2 , or fraction of correct Dihedral angles [64]. To find a simple geometrical order parameter that quantifies progress to the folded structure poses several challenges. These include an accurate account of the effects of polymer non-crossing [90], energetic and entropic heterogeneity in native driving forces (which will induce bottlenecks in folding pathways) [78, 110, 111], as well as non-native frustration and trapping [23, 108, 115]. Fortunately it has been borne out experimentally that wild type proteins are sufficiently minimally frustrated that non-native interactions do not play a strong role in either folding rate or mechanism, and native structure 2 The so-called B-factors or temperature factores indicate the relative vibrational motion. The lower the factor the more ordered the structure is. In PDB files, each atom of the native state has an associated B-factor. Thus a Debye-Waller factor can be used as an order parameter to determine how structured a partially-unfolded ensemble is, and where it tends to be unstructured 7 1.3. Order parameters in protein folding based models for folding rates and mechanisms have enjoyed considerable success [2, 5, 41]. Many of the reaction coordinates have been used to describe the folding process, while still being flawed in principle. These characterizations have been largely successful because the majority of conformations during folding are well-characterized by changes in these parameters: Proteins undergo some collapse concurrently with folding, lower their internal energy, and adopt structures geometrically similar to the native structure. Below we describe two of the most common order parameters used in the discipline. In subsequent chapters we expand on this topic. Fraction of native contacts, or Q, is an order parameter that is commonly used as a measure Given two structures denoted ∑ of native-proximity. ∑ A B B by A and B, QAB ≡ ( i<j ∆ij ∆ij )/( i<j ∆ij ). Here, ∆A ij is equal to unity if the two residues are in contact (a concept we describe shortly), otherwise it is zero. If two non-hydrogen atoms of a pair of non-neighboring residues are within a prescribed cut-off distance (usually 4.9 ˚ A) then the two residues are considered to be in contact. The contacts present in the native structure of are called the native contacts, the total number of them being ∑ the protein B i<j ∆ij . For any arbitrary conformation of the polypeptide chain, a fraction of the contacts that were present in the native structure are present. Of course, other contacts may exist that are not present in the native structure, these being called the non-native contacts. The order parameter Q of a conformation is the fraction of native contacts that are present in that conformation. RMSD or Root Mean √ Squared Deviation is another order parameter ∑ 2 commonly used. RM SD ≡ N −1 N i=1 (rAi − rBi ) is a least-squares measure of similarity between structures A and B. Typically, this quantity is minimized given two structures, and so can be thought of as a “least squares fit”. The sum may be over all atoms, or simply over the Cα atoms of the residues in coarse-grained models. Figure 1.4 shows two structures A and B with different measures of structural similarity to a “native” hairpin fragment N . These structures have different measures of proximity depending on the coordinate used to characterize them. If we use the fraction of native contacts, Q, to describe native proximity, structure A has a Q of QA = 1/3 while QB = 0, so by this measure it is more native. If we use the root mean square deviation RMSD, structure B is more native-like than A. Moreover, structure B would have a higher probability of folding before unfolding than A, i.e., it has a larger 8 1.3. Order parameters in protein folding value of pFOLD [33], and so is closer kinetically to the native structure. The longer the hairpin, the more likely a slightly expanded structure is to fold, so the discrepancy between Q and RMSD for these pairs of structures becomes even larger. Figure 1.4: Order parameters do not always correlate with kinetic proximity. Structure A above is more native-like according to the fraction of native contacts, while structure B is more native-like according to RMSD, and is also closer kinetically to the native structure. Image adopted from [91]. 9 Chapter 2 Minimal distance transformations between links and polymers: principles and examples In this chapter, the concept and calculation of generalized distance are introduced. We generalize calculation of Euclidean distance between points to that between one-dimensional objects, such as strings or polymers. Then, we derive the necessary and sufficient conditions for the transformation between two polymer configurations to be minimal. We give numerous examples for the special cases of one and two links, and then investigate the transition to a large number of links, neglecting for the time being curvature and non-crossing constraints. Equipped with this new mathematical tool, we investigate applications of this metric to protein folding, specifically, to secondary and tertiary structural fragments. For most of this chapter, we are interested in generalized distance (D) purely as an interesting mathematical concept, analyzed using the calculus of variations. Certainly, applications of D need not be restricted to those in protein science. However, there are a few results from this chapter that are applicable in protein folding. In particular, the generalized distance D can be considered as an order parameter for the protein folding process, similar to the usage of root-mean-squared deviation from the native structure (RMSD). In the limit of large number of chain links, and in the absence of curvature and non-crossing constraints, D is approximately equal to a metric that is comparable to, but different from RMSD: MRSD (Mean Root Squared Distance). We argue that MRSD is the more physically meaningful of the two, directly related to the idea of how much everything should move in 3D space in order for the protein to fold, while RMSD is the Euclidean distance in the 3N-dimensional conformational space, where N is the number of coarse-grained Cα beads. Lastly, we address the issue that an accurate 10 2.1. Introduction Figure 2.1: Distance between the two points A and B is the minimum length of the curve connecting the two points account of structure proximity should take into account the fact that real protein chains cannot cross themselves; therefore, non-crossing constraints need to be taken into account. Neither MRSD nor RMSD address this issue, but an accurate calculation of D should. 2.1 Introduction The distance between two points can be thought of as a minimization problem in the calculus of variations, where we try to minimize an integral of infinitesimal distance segments. For example, in a Euclidean space in order to find the distance between two points A and B, we have to minimize the following integral: ∫ T √ ∫ rB dt r˙ 2 ds = D= (2.1) rA 0 Here we have let r˙ ≡ dr/dt, and we use notation such that. The boundary conditions on the extremal path are r∗ (0) = rA and r∗ (T ) = rB . The concept is depicted in figure 2.1. Taking the functional derivative √ in eq. (2.1) gives Euler-Lagrange (EL) equations for the Lagrangian L = r˙ 2 : ( ) d ∂L = 0 dt ∂ r˙ ˆ˙ = 0 or v (2.2) ˆ the unit vector in the direction of the velocity. with v Since the derivative of a unit vector is always orthogonal to that vector, equation (2.2) says that the direction of the velocity cannot change, and therefore straight line motion results. Applying the boundary conditions ˆ = (rB −rA )/ |rB − rA |. However, any function v(t) = |vo (t)| v ˆ satisfygives v ∫T ing the boundary conditions is a solution, so long as 0 dt |vo (t)| = |rB − rA |. 11 2.2. Distance for polymers or strings Figure 2.2: The distance DAB is the accumulation of how much everything moves. The solution is reparameterization-invariant. Then the extremal functional r∗ (t) is given by rB − rA r (t) = rA + |rB − rA | ∗ ∫ t dt |vo (t)| (2.3) 0 and the distance by ∗ ∫ D = T dt 0 √ ∫ r˙ ∗2 T dt |vo (t)| = |rB − rA | = (2.4) 0 which represents the diagonal of a hypercube, as expected. At this point we could fix the parameterization by choosing |vo (t)| = |rB − rA | /T (constant speed), for example. The extremal transformation (2.3) is also a minimum. In Appendix A we will give the sufficient conditions for an extremum to be a (local) minimum, where we will return to this example. The above idea can be generalized to space curves, surfaces, or higher dimensional manifolds [109]. The distance is defined through the transformation between the objects that minimizes the cumulative amount of arc-length traveled by all parts of the manifold, see figure 2.2. The shortest distance between A to B is purely a geometry problem, but by choosing an artificial “time” parameter t, we express the problem as a dynamic variational problem [109]. The motivation for this is to avoid complications that might arise when one specific coordinate is no longer a single-valued function of the others. 2.2 Distance for polymers or strings Describing the transformation r(s, t) between two space curves rA (s) and rB (s) requires two scalar parameters: s the arc-length along the space curve, 12 2.2. Distance for polymers or strings and t the “time”, as in the above zero-dimensional case, measuring progress during the transformation. The boundary conditions are then r(s, 0) = rA (s) and r(s, T ) = rB (s). The minimal transformation r∗ (s, t) is an object of dimension one higher than A or B, i.e., it yields a distance that is twodimensional. The distance D∗ = D[r∗ (s, t)], where the functional D[r] is given by ∫ L ∫ T √ D[r] = ds dt r˙ 2 . (2.5) 0 0 Here we have used the shorthand r ≡ r(s, t) = (x(s, t), y(s, t), z(s, t)) (a 3-vector), and r˙ ≡ ∂r/∂t. It has been shown previously that the problem of distance does not map to a simple soap film, nor to the minimal area of a world-sheet (which corresponds to the action of a classical relativistic string) [109]. Formulated as above, the string can contract and expand arbitrarily in order to minimize the distance traveled. The transforming object is akin to a rubber band, and all points on rA (s) will move in straight lines to their partner points on rB (s) to minimize the distance. It is worth mentioning that protein chains, for example, only change their length by about one percent at biological temperatures. To accurately represent the transformation of a non-extensible string, a Lagrange multiplier λ(s, t) must be introduced into the effective Lagrangian, weighting the constraint: √ r2 = 1, (2.6) where r ≡ ∂r/∂s. Under this constraint, points along the string can no longer move independently of each other, but must always be a fixed (infinitesimal) distance apart. The tangent ∫vector√ˆt = r ∫is now a unit vector, and the total length L L of the string is L = 0 ds r 2 = 0 ds. Consider the minimal distance transformation between two configurations rA (s) and rB (s) of an ideal polymer of length L. Let us derive the Euler-Lagrange (EL) equations for this case. From equations (2.5) and (2.6), the effective action is where ∫ L∫ ( ) ds dt L r˙ , r 0 0 ) (√ √ L = r2−1 r˙ 2 − λ D = T (2.7a) (2.7b) 13 2.2. Distance for polymers or strings (a) (b) Figure 2.3: Continuum (a) and discretized (b) polymer chain. The EL equation for the continuum polymer is a nonlinear (vector) PDE, while the EL equations for the discretized polymer are a set of nonlinear ODEs. and the Lagrange multiplier λ ≡ λ(s, t) is a function of both s and t. The extrema of the distance functional D in (2.7a) are found from δD = 0. Taking the functional derivative gives EL equations [109]: ˆ˙ = λκ + λ ˆt . v (2.8) ˆ is the unit velocity vector, ˆt is the unit tangent vector, and κ is the where v curvature vector. In eq. (2.8), we see explicitly that if the non-extensibility constraint is removed or, equivalently, if λ = 0, all points on rA (s) move in straight lines to rB (s). 2.2.1 Discrete chains To make the problem more amenable to solution, we can discretize the spatial variables while letting the time variable remain continuous, i.e. we implement the method of lines to solve eq. (2.8). Rather than directly discretizing eq. (2.8), however, it is more natural to consider a discretized chain as shown in figure 2.3, from the outset, and to calculate the EL equations for this system. This recipe then gives the same result as properly discretizing eq. (2.8). For the discretized chain, the constraint in eq. (2.6) becomes |∆r| = ∆s = L/(N − 1), giving the length of each link. As the number of beads N → ∞ the system approaches a continuous chain. For finite N , the Lagrangian becomes a function of the positions and velocities {ri , r˙ i } of all beads i, 1 ≤ i ≤ N + 1. We use the shorthand notation L(ri , r˙ i ). This recipe yields the distance metric for an ideal, freely-jointed chain (meaning that the angle between two consecutive links can be of any value without any cost), which has no non-local interactions and no curvature 14 2.2. Distance for polymers or strings constraints. While this approximation is often used as a first step, real chains may behave quite differently, for several reasons. In many cases, the configuration which is an energetic minimum is a straight line, or a single conformation dictated by the chemistry of the polymeric bonds. At finite temperature, energy from the environment induces conformational fluctuations. Real polymers also cannot cross themselves, and, because of their stereochemistry, also take up volume. We leave these interesting features for later analysis. Equation (2.6) for the discretized chain becomes N constraint equations added to the effective Lagrangian: N ∑ ˆ i,i+1 λ (√ ) (ri+1 − ri )2 − ∆s i=1 ˆ i,i+1 ≡ λ ˆ i,i+1 (t) is a function of t, and λ ˆ N,N +1 = 0. Letting where each λ ˆ λ ≡ 2λ ∆s and ri+1/i ≡ ri+1 − ri we rewrite this strictly for convenience as ∑ λi,i+1 ( r2i+1/i ∆s2 2 ) −1 . We next convert to dimensionless variables by letting r = (∆s)ˆr. To simplify the notation, from here on, we simply refer to ˆr as r. The distance for the discretized chain becomes ∫ T D[ri , r˙ i ] = ∆s2 dt L (ri , r˙ i ) (2.9) 0 with effective Lagrangian L (ri , r˙ i ) = N (√ ∑ i=1 r˙ 2i )) λi,i+1 ( 2 − ri+1/i − 1 . 2 (2.10) The derivatives r˙ and ri+1/i are raised to different powers in (2.10), however so long as ri+1/i satisfies the constraint ri+1/i = 1, the EL equations √ for ri (t) will be the same whether the constraint r2i+1/i = 1 or r2i+1/i = 1 is used. The reparameterization invariance present for point particles (c.f. section 2.1) is still present for beads on the chain, but the parameterization of arclength along the chain is taken to be fixed by the discretization. 15 2.2. Distance for polymers or strings 2.2.2 General variation of the distance functional For reasons that will become clear as we progress, we consider the general variation of the functional D, allowing for broken extremals. That is, we allow the curves describing the particle trajectories to be non-smooth in principle at one or more points in time. Consider the case of one such point at time t1 . The distance can be written as ∫ t1 ∫ T D= dt L(ri , r˙ i ) + dt L(ri , r˙ i ) (2.11) 0 t1 The space-trajectories of the particles must be continuous at time t1 , so ri (t1 − ) and ri (t1 + ) must have the same limit as → 0, or in shorthand: ( ) ( +) ri t− . 1 = r i t1 (2.12) Let ri (t) and ˜ri (t) be two neighboring trajectories from ri (0) = rAi to ri (T ) = rBi (see figure 2.4). Neighboring curves will differ by the first order quantity hi (t) = ˜ri (t) − ri (t). The fixed boundary conditions at t = 0, T dictate that hi (0) = hi (T ) = 0. The difference in distance between the two trajectories is ∆D = D[ri + hi ] − D[ri ] ∫ t1 +δt1 ∫ t1 ˙ = dt L(ri + hi , r˙ i + hi ) − dt L(ri , r˙ i ) 0 0 ∫ T ∫ T ˙ + dt L(ri + hi , r˙ i + hi ) − dt L(ri , r˙ i ) t1 +δt1 (2.13) t1 Taylor expanding the Lagrangian to first order in hi :‡ L ≈ L(ri , r˙ i ) + N ( ∑ Lri · hi + Lr˙ i · h˙ i ) i=1 and integrating by parts using the fixed boundary conditions at t = 0, T , the difference in distance up to first order in hi is ) ) ∫ t1 ∑ ( ∫ T ∑( d d ∆D ≈ dt Lri − Lr˙ i · hi + dt Lri − Lr˙ i · hi dt dt 0 t 1 i i ∑ ∑ + (2.14) + L(t− Lr˙ i · hi |t− − Lr˙ i · hi |t+ 1 )δt1 − L(t1 )δt1 + 1 i 1 i with the shorthand L(t) ≡ L(ri (t), r˙ i (t)). 2‡ ˙ We use the notation Fr ≡ ∂F/∂r, Fr˙ ≡ ∂F/∂ r. 16 2.2. Distance for polymers or strings h(t1) δr(t1) 0 t1 t1+dt1 Figure 2.4: General variations of a functional with fixed end points allow for broken extremals. In the text we derive the extra “corner” conditions for a piecewise continuous path to still be extremal for our distance functional. 2.2.3 Conditions for an extremum The variation δD differs from ∆D above only by second order terms. Then for the transformation from {rAi } to {rBi } to be an extremum, δD = 0. Thus, the EL equations (in the top line of eq. (2.14)) must vanish in each regime [0, t1 ), (t1 , T ]. Using the form of the Lagrangian in eq. (2.10), the EL equations become: ˆ˙ 1 + λ12 r2/1 = 0 v ˆ˙ 2 − λ12 r2/1 + λ23 r3/2 = 0 v .. . ˙v ˆ N − λN −1,N rN/(N −1) = 0 (2.15a) (2.15b) (2.15c) According to equation (2.14) there are additional conditions for the transformation to be an extremum. To find these, first note that up to first order (see figure 2.4) hi (t1 ) ≈ δri (t1 ) − r˙ i (t1 ) δt1 . (2.16) 17 2.2. Distance for polymers or strings Then the first variation in the distance is ( ) ( ) ∑ ∑ δD = L − r˙ i · Lr˙ i − L− r˙ i · Lr˙ i + ∑[ i Lr˙ i |t− 1 t− ]1 − Lr˙ i |t+ · δri (t1 ) 1 i δt1 t+ 1 (2.17) i which must vanish at an extremum. Because the variations δri and δt1 are all independent, the terms in square brackets in equation (2.17) must vanish. Writing these expressions in terms of the conjugate momenta pi = Lr˙ i and ∑ Hamiltonian, H = i r˙ i · pi − L gives the conditions: pi | − = pi | + (2.18a) H| − = H| + (2.18b) t1 t1 t1 t1 These conditions are called the Weierstrass-Erdmann conditions or corner conditions in the calculus of variations [46]. According to the Lagrangian in equation (2.10), the Hamiltonian is given by N ) ∑ λi,i+1 ( 2 ri+1/i − 1 H=− 2 i=1 which is identically zero, so corner condition (2.18b) provides no further information. The conjugate momenta according to (2.10) are given by pi = r˙ i ˆi . =v |˙ri | (2.19) Therefore, according to corner condition (2.18a), extremal trajectories cannot suddenly change direction: each ri (t) follows a smooth path continuous up to first derivatives in the spatial coordinates. The fact that one corner condition provided no information due to the vanishing of the Hamiltonian is related to our choice of parameterization in formulating the problem. For example, in the case of the distance of the single point particle mentioned in the introduction, the√ Lagrangian may be (x) defined either through independent variable x as L = 1 + y 2 + z 2 (with e.g. y = dy/dx), or parametrically through independent variable t as L(t) = √ √ (x) r˙ 2 . The conjugate momenta are then either Ly = y / 1 + y 2 + z 2 and 18 2.2. Distance for polymers or strings √ (x) (t) ˆ . The Hamiltonia are either Lz = z /√ 1 + y 2 + z 2 , or Lr˙ = r˙ /|˙r| ≡ v (x) (t) (t) 2 2 H = 1/ 1 + y + z or H = L −r˙ ·(˙r/|˙r|) = 0. The corner conditions can be shown to be equivalent for both choices of independent variable: ˆ (t− ˆ (t+ for L(t) they give v 1) = v 1 ), so that the direction of the tangent to the curve cannot have a discontinuity. Together, the Hamiltonian and two conjugate momenta for L(x) can be interpreted as components of the unit √ ˆ 1 + y 2 + z 2 , and so tangent vector to the curve, i.e. ˆt(x) = (ˆi + y ˆj + z k)/ once again, the corner conditions enforce a continuous tangent vector, here ˆt(x− ˆ + 1 ) = t(x1 ). Boundary conditions In the continuum limit, the boundary conditions on r(s, t) are r(s, 0) = rA (s), r(s, T ) = rB (s) where rA and rB are the two configurations of the polymer. For discrete chains, these boundary conditions become (A) {ri (0)} = {ri } (B) {ri (T )} = {ri }. (2.20a) (2.20b) There are also boundary conditions that hold for the end points of the chain at all times. From equations (2.15a, 2.15c), we see that there are three solutions for the end points of the chain: 1) If λ = 0, purely rotational motion results. This can be seen by taking the dot product of eq. (2.15a) with v1 , which yields λ12 v1 · r2/1 = 0, so the velocity of the end point is orthogonal to the link. The rotation must be about a point that is internal to the link, i.e., on the line between points 1 and 2 for end point 1. This can be seen straightforwardly for the case of one link by removing point 3 from equations (2.15a) and (2.15b). Then ˆ˙ i must be in opposite directions. This can only occur if the accelerations v rotation is about a point on the line between points 1 and 2. ˆ˙ i = 0, and straight-line motion of the end point results. 2) If λ = 0, v 3) Writing out the time-derivative in (2.15a) yields v12 v˙ 1 − (v1 · v˙ 1 ) v1 = −λ12 |v1 |3 r2/1 (2.21) which has the trivial solution v1 = 0. The end point can be at rest, while other parts of the chain move. For a transformation to be minimal, it is necessary, but not sufficient, that it be an extremum. In Appendix A we derive the sufficient conditions for a given transformation to minimize the functional (2.9). We discuss 19 2.3. Single links sufficient conditions further below in the context of minimal transformations for links. In the discrete version of our variational problem, minimizing D for chains with N rigid links, one seeks to transform a chain from its initial configuration to the final configuration, while minimizing the total distance that the N + 1 beads travel. Similar to what was done for the problem of finding the distance between two points, we transfrom the problem from a geometric variational problem, to a dynamic variational problem by choosing an artificial “time” parameter t to avoid complications that might arise when one specific coordinate is no longer a single-valued function of the others. The study of minimal transformations between small numbers of links has applications to the inverse kinematic problem in robotics and movement control. In the inverse kinematic problem, one is given the initial and final positions of the end-effector (the hand of the robot), and asked for the functional form of the joint variables for all intermediate states. Generally there is no unique solution, until some optimization functional is introduced, such as minimizing the time rate of change of acceleration (the jerk), torque, or muscle tension (see the review [65] and references therein). The minimal distance transformation would be relevant, if one sought the fastest transformation between initial and final states, without explicit regard to mechanical limitations. The indeterminate intermediate points can be handled variationally as a free boundary value problem. As we will see the solutions to these problems involve smooth patches of combinations of rotations and straight-line motions. 2.3 Single links In the limit of one link, equations (2.15a-2.15c) reduce to: ˆ˙ A + λ rB/A = 0 v ˆ˙ B − λ rB/A = 0 v (2.22) where we have let A represent point 1, B point 2, and λ ≡ λ12 . The link has length 1 in our dimensionless formulation, so the vector rB/A could also have been written as a unit vector ˆrB/A . Both points A and B are end points and satisfy the boundary conditions of section 2.2.3. This means that points A and B move by either pure rotation, straight-line translation, or remain at rest. The initial and final conditions may be written rA (0) = A, rB (0) = B, rA (T ) = A , rB (T ) = B . 20 2.3. Single links B′ rB B b a xA A A′ (a) (b) B′ B A A′ (c) Figure 2.5: Possible (a,b) and impossible (c) straight line transformations between links AB and A B . Figure b shows a straight line transformation where the initial and final states do not lie in the same plane. In the text we derive the conditions for the possibility of a straight line transformation between links. The link in our problem has direction, so A must transform to A and B to B . We will often use arrowheads in figures to denote this direction. 2.3.1 Straight line transformations As a first example, consider the two links shown in figure 2.5a. The four points A, B, A , B need not lie in a plane (see, for example, fig. 2.5b). Let angle ∠BAA ≡ a be obtuse. We draw straight lines from A to A and B to B , and ask whether such a transformation is possible. We can thus derive the following rule: • For a straight line transformation to exist between two links, opposite angles of the quadrilateral made by AB, A B , AA , BB must be obtuse. Let the length that point A travels be xA , i.e., we imagine the point A and the distance xA = |AA | to be variable. The length rB that point B travels is then a function of xA and the original angle a, rB (xA , a). We can now find conditions on the angle b ≡ ∠BB A , such that the transformation 21 2.3. Single links is possible. After some distance xA traveled by point A, the length of the line from B to A is BA = x2A + 1 − 2xA cos a = rB2 + 1 − 2rB cos b so that rB (xA , a) = cos b ± √ cos2 b + f (xA , a) with f (xA , a) = x2A − 2xA cos a. Since a is obtuse, f > 0 when xA > 0, and so the positive root must be taken for rB to positive. When xA = 0, f (0, a) = 0, and rB (0, a) = cos b + |cos b| = 0 Therefore b must also be an obtuse angle. If two opposite angles are obtuse, then the other two angles must be acute. This concludes the proof that the above conditions are sufficient. An additional proof that they are necessary is given in Appendix B. We readily see that figure 2.5a is one pair of a larger set of straight line transformations that can continue until one or both of the obtuse angles reaches 90◦ . This collection forms a “bow tie” of admissible configurations, as in figure 2.6. Note that straight lines in the quadrilateral may cross as in the transformation from A, B to A , B in figure 2.6. Trivial translations of the link without any concurrent rotation are a special case of general straight line transformations. 2.3.2 Piece-wise extremal transformations: transformations with rotations An immediate question concerns the nature of the transformation between AB and A B in figure 2.5c, where opposite angles of the quadrilateral are not obtuse. Recall our link has direction, so A cannot transform to B . Then direct straight-line solution is not possible, due to the constraint of constant link length. The only remaining solution is for the link to rotate as part of the transformation. Consider first the rotation of link AB. The EL equations (2.22) allow for pure rotations about A, B, or a common center along the link. Likewise for link A B . The rotation can occur from either link AB (fig 2.7a) or link A B (fig 2.7b). After the link rotates to a critical angle, it can then travel in 22 2.3. Single links A′ B A2 B1 B B′ B2 B′ A′ A A1 A (a) (b) Figure 2.6: (a) An example of a set of link configurations connected by a straight-line transformation. The link rotates clockwise as it translates to allow the end points to move in straight lines. The translation can proceed −−→ no farther than the end points AB and A B , which have link vectors AB −−→ ˆ A or v ˆB. or A B that are perpendicular to one or the other of the vectors v The totality of states thus connected forms a “bowtie”. (b) A bowtie where the terminal states AB and A B happen to cross each other. a straight line. The extremals are broken, in that they involve matching up a piece consisting of pure rotation with a piece consisting of pure translation of the end points of the link. Where the pieces match they must satisfy the corner conditions (2.18a, 2.18b). This means that the end points cannot suddenly change direction, a situation which is only satisfied by a straight line trajectory that lies tangent to the circle of rotation. From figure 2.6, we see that a straight line transformation exists only when an angle between a link and one of the straight line trajectories reaches π/2. The critical angle that link AB must rotate is then determined by the point where a line drawn from B is just tangent to the unit sphere centered at point A, point B1 in figure 2.7a. There is generally a different critical angle if the rotation occurs at link A B as in fig 2.7B. It is shown in Appendix C that in general the critical angle is determined by drawing the tangent to a circle or sphere about one of the link ends. If the rotation was about a common center, we see that one or another of the link ends would violate a corner condition, so the rotation must be about one of the link ends. According to eqs. (A.5) and (A.6), the matrix P has a determinant of 23 2.3. Single links B1 B1 B′ B A B A A′ (a) A′ (b) B′ B B1 B′ B2 B B′ A A′ B1 A A′ (c) (d) B′ B A A′ B1 × CP (e) Figure 2.7: Transformations between two links involving broken extremals consisting of rotation and translation. (b) is the global minimum, with shortest distance traveled during the transformation. (a), (c), and (d) are local minima. (e) is extremal, but not minimal as the trajectory of arc B B1 passes through a conjugate point, see Appendix A. 24 2.3. Single links zero due to the parametric formulation in the problem, and so is not positive definite. To show that the transformations in fig. 2.7a,b are indeed minimal, we need to then express the problem in non-parametric form. To do this, let the independent variable be the angle θ of the link with the vertical. Then the displacement x along the line AA is the unknown function of θ to be determined by minimizing the total arc length traveled. This distance can be written as ∫ θ1 (√ √ ) x 2 + 2x cos θ + 1 + x 2 D[x] = dθ θ0 In this formulation, the scalar quantity P(θ) = Lx x becomes P(θ) = sin2 θ (x 2 + 2x cos θ + 1)3/2 which is always > 0 except for the isolated point θ = 0, in particular it is positive along the extremal trajectory which is necessary for a minimum. So we conclude that the transformation with the smaller angle of rotation in fig 2.7b is here the global minimum, and the other transformation (fig 2.7a) is a local minimum. Figure 2.7e is also an extremal trajectory, satisfying corner conditions, and with positive definite P . However, it is not a local minimum because the trajectory passes through a conjugate point (denoted by point CP , where the dotted line along A B meets the great circle about A ). According to the results in section A.2, if the extremal trajectory (a great circle) traverses an angle larger than π radians, it passes through a conjugate point and thus becomes unstable to sinusoidal perturbations with roots at the end points of the great arc, but no roots in between (see section A.2). Transformations involving rotations about points B or B in figure 2.7 both have conjugate points and so are not minimal. The transformation in fig. 2.7c does not pass through a conjugate point and so is in fact another local minimum. The part of the extremum along the straight line section of the trajectory has no conjugate points as discussed above. 2.3.3 Systematically exploring transformations by varying link positions We can investigate what happens to the minimal transformation when one of the link positions or angles is varied with respect to the other. Let us 25 2.3. Single links start by putting the two links head to tail, as shown in figure 2.8a. The distance between them is 2 by simple translation of link end points. We can now increase the angle between the two vectors by rotating the right link for example, as in figures 2.8b–h. So long as the angle between the two vectors is less than 90◦ , one link may slide along another and the distance is unchanged (figs 2.8a-c). This is a special case of the transformations shown in figure 2.6 (compare for example figure 2.8b with the middle three unlabeled links in that figure). Beyond 90◦ however, the transformation must include rotation. Fig 2.8d has an angle of 150◦ . The minimal transformation first rotates, for example with the tail of the horizontal black arrow fixed, and the head tracing out the blue arc, until the critical angle is reached, where a straight line made from the final arrowhead (at the top of the figure) is just tangent to the circle made by the blue arc. This state is indicated by a red link in figure 2.8d. The link then translates to its reciprocal position at the opposite end of the bowtie, denoted by a second red link (c.f. also figure 2.6b). At this point, the arrowhead has completed the transformation. Finally, the tail rotates into its final position. The total distance traveled is slightly larger than 2. When the angle between the vectors is 120◦ , as shown in 2.8e, the transformation consists of pure rotations. Taking the initial state to be the horizontal black vector, the link first rotates about its fixed tail, the head tracing out the blue arc, until the link reaches the state shown in red, where the position of the arrowhead has reached its final end point. Then the link rotates about its head until the position of the tail reaches the final state. When the angle between the links is larger than 120◦ as shown in figs 2.8f– g, the transformation must involve rotation about an internal point along the link. Let points A and B denote the tail and head of the link respectively. If an infinitesimal rotation ∆θ occurs about an internal point P , the increment in distance traveled is ∆D = |rB/P |∆θ + |rB/A |∆θ = ∆θ which is independent of the position of the instantaneous center of rotation (ICR). This means that there are an infinity of transformations all giving the same distance, depending on the time-dependence of the ICR. Two simple alternatives with only two discrete positions of ICR are shown in figures 2.8f,g. Specifically, in figure 2.8f, the horizontal black vector first rotates about its tail to the red configuration, which is a mirror image of the final black vector. Then rotation is about an internal point determined by the intercept of the red vector with the final black vector, with end points 26 2.3. Single links tracing out the green arcs. In figure 2.8g, only one ICR is allowed to implement the rotation of π radians. The red vector shows an intermediate state. Figure 2.8g depicts the transformation for overlapping, opposite pointing vectors. Rotation can now only occur about one point in the center of the vectors. Figure 2.9 illustrates what happens when one of the links is translated with respect to another, starting from two different scenarios shown in 2.9a and 2.9b. In 2.9a, the tail of the vertical link is displaced (1/3, −1/3) with respect to the tail of the horizontal link. The minimal transformation is a pure rotation by π/2. In figure 2.9b, the tail of the vertical link is now displaced to (2/3, −1/3). Pure rotations again give a distance of π/2. Rotation about a point on the horizontal link that is equidistant from both arrowheads transforms the initial arrowhead to the final (red intermediate state). Then, rotation of the tail about the arrowhead transforms to the final state. In figure 2.9c, the minimal transformation first involves a translation by sliding the arrowhead along the vertical, until the arrowheads overlap (red intermediate state). The tail end of the link then rotates into place. In figure 2.9d, straight lines from the end points will not satisfy the obtuse condition in section 2.3.2, so the transformation must involve rotations. Here a straight line transformation takes the link almost to the final state. It then must undergo a small rotation to complete the transformation. Seen in reverse, the vertical arrow must rotate to a critical angle determined by the criterion in section 2.3.2, before the link can finish the transformation by pure translation. Figure 2.9e is actually figure 2.8f. The final condition (the tilted link) will be systematically changed, by translating it vertically away from the horizontal link (which we choose arbitrarily as the initial configuration). In figure 2.9f, the tilted link is translated vertically by 13 . The transformation can be achieved by rotating the horizontal link about a point equidistant from both arrowheads, to the red intermediate configuration. The link then rotates about the arrowhead into the final configuration. The distance is still the angle rotated for the reasons mentioned above in the context of figures 2.8f–g, θ = (150/180)π, which is unchanged from 2.9e. In fact, so long as the arrowhead can be reached by rotation (the√translated distance is less than d where d is the solution to d2 + d + 1 − 3 = 0 for this angle), then the distance will be unchanged. The transformation at the critical distance is shown in figure 2.9g. The rotations now occur about the end-points: the tail and head of the link. In figure 2.9h, the translated distance is now equal to 1. The transfor27 2.3. Single links D = 2.000 (a) D = 2.020 (d) D = 2.000 (b) D = 2.094 (e) D = 2.000 (c) D = 2.618 (f) D = 3.142 (g) Figure 2.8: Successive transformations between two links made by rotating a link so that there is a progressively larger angle between the links as vectors (or smaller angle made between them as lines). The two boundary conditions (the initial and final conditions) are shown as black links, and an intermediate state is shown as a red link or links. The arcs traced out by the end points are shown in blue or green, while straight line motions, when they are not along the links themselves, are shown in grey. The distance traveled over the course of the transformation is given below each figure. 28 2.4. 2-link chains mation first consists of a rotation about the tail to a critical angle (blue arc and red intermediate state), then a translation much like that in figure 2.6 (grey straight lines between red intermediate states), and finally a rotation about the head (green arc) to the final configuration. 2.4 2-link chains We now consider the next simplest case of 2 links (3 beads). The Lagrangian now reads: √ √ √ 2 2 L(r1 , r2 , r3 , r˙ 1 , r˙ 2 , r˙ 3 ) = r˙ 1 + r˙ 2 + r˙ 23 ) 1 ) ( ( 1 − λ12 (r2 − r1 )2 − 1 − λ23 (r3 − r2 )2 − 1 (2.23) 2 2 which has EL equations (c.f. eq.s 2.15a-2.15c) ‡ : ˆ˙ A + λAB rB/A = 0 v ˆ˙ B − λAB rB/A + λBC rC/B = 0 v ˆ˙ C − λBC rC/B = 0 . v (2.24a) (2.24b) (2.24c) The corner conditions (2.18a), (2.19) imply ( ) ( ) ˆ i t+ ˆ i t− = v v so the direction of motion cannot suddenly change, unless along one part of the extremal the velocity of point i is zero (the point is at rest), where its ˆ is then undefined. direction v The boundary conditions described in section 2.2.3 hold as well, so the end points can either be at rest, move in straight lines, or purely rotate. This gives 3 × 3 = 9 possible scenarios to investigate here, many of which can readily be ruled out. For example, consider the states in figure 2.10a. Because A and A are in the same position, rotation and translation of A are ruled out and point A remains at rest, leaving 3 scenarios for the other end point C. However, since C and C are at different positions and ABC are along a straight line, C cannot remain at rest initially, leaving either translation or rotation for point C. 2‡ The links have length 1 in our dimensionless formulation, so the vectors rB/A and rC/B could also have been written as unit vectors ˆ rB/A and ˆ rC/B . 29 2.4. 2-link chains D = 1.571 D = 1.571 (a) D = 1.730 (b) D = 2.374 (c) D = 2.618 (d) D = 2.618 (e) D = 2.618 (g) (f) D = 3.181 (h) Figure 2.9: Successive transformations between two links made by translating one link with respect to the other. In (a-d) the initial and final configurations are perpendicular, while in (e-h) they are at an angle of 150◦ to each other. Note the distances in (e-g) are all the same, even though the end points of the links are at varying distances from each other. 30 2.4. 2-link chains C′ C′ B′ B′ A′ C ′′ A′ A B C A (a) B C (b) C′ C′ θ B′ B′ θ A′ A′ A B D = 4.498 (c) C A B C D = 4.498 (d) Figure 2.10: (a) Initial and final states for a chain of two links. The transformation in (b) is non-extremal because it violates a corner condition at C . (c) and (d) are degenerate minima- rotations occurring about B or B both have the same length. Intermediate states, shown in red, have opposite convexity in (c) and (d). 31 2.4. 2-link chains C′ C′ B′ B′ C′ C ′′ θ′ C ′′ π 4 A′ A B′ A′ B C B D = 3.114 (a) θ A′ A (b) C A B C D = 2.985 (c) Figure 2.11: (a) Initial and final states for a polymer of 2 links. The angle between AB and A B is π/4. The minimal transformations in (b) and (c) are now no longer degenerate. (c) is the global minimum. ˆ˙ C = 0 Suppose C translates towards C , as in figure 2.10b. Then, v ˙ ˆ B = λAB rB/A . B cannot move in a and from (2.24c,2.24b) λBC = 0 and v straight line without moving point A, so λAB = 0, and thus B must rotate about point A. The transformation then proceeds as in figure 2.10b until B reaches B and C reaches C . Then, however, if C were to rotate to C , the trajectory would violate corner conditions at point C . Therefore the direction of translation of C must not be directly to C but must be tangential to the arc C C as in figure 2.10c. The reverse of this transformation is allowable as well, as can be seen by swapping the labels ABC → A B C . Here, C first rotates to the critical angle θ shown in fig 2.10d, and then translates to C . In fact, one can see that links BC and B C along with lines BB and CC form a quadrilateral, as in figure 2.7, with the same consequences for rotation to a critical angle. For the links in fig 2.10 the situation is symmetric so rotation can occur at the beginning or end of the transformation. Figure 2.11a shows an example with this symmetry broken, so that the distance is different depending where the rotation occurs, as in figures 2.7a,b. In this case, the transformation in fig 2.11c has the minimal distance, and that in fig 2.11b is sub-minimal. Extensions of the transformation in figure 2.11 to large numbers of links were explored in [109]. 2.4.1 Transformations involving a change in convexity Transformations between configurations with opposite convexity involve motion out of the plane, even if the initial and final states lie in the plane. If 32 2.4. 2-link chains C′ B′ C B ′′ A′ A B Figure 2.12: A transformation between two states of opposite convexity: ABC has convexity down and right, while A B C has convexity up and left. There is no extremal transformation in the plane that can connect them, without some apparent violation of corner conditions. the transformation is constrained to lie in plane, the trajectories of some points will be non-monotonic- those points must move farther away from their final positions before approaching them. We illustrate these ideas with some examples below. Consider the initial and final states in figure 2.12. We again imagine B rotating to B . If C were to translate to C , one would have the intermediate configuration A B C . Now C and A must remain at rest to satisfy corner conditions. Then the only way to finish the transformation is for B to rotate about the axis A C , however then the trajectory of B violates corner conditions and so is not extremal. In Appendix D we take up the issue of minimal transformations for this case when the links are constrained to lie in a plane. −−−−→ We thus seek a point B and resulting trajectory BB B such that arc BB satisfies corner conditions with arc B B . One solution is to effectively place B at position B by considering the boundary condition with C at rest (and A at rest). Then B rotates to B about axis AC, and the trajectory of B lies on a circle defined by the intersection of two unit spheres centered at A and C. The sphere about A is drawn in figure 2.13 as a visual aid. Along arc BB both λAB = 0, and λBC = 0. Once in configuration A B C, C can then undergo rotation about B to C , with A and B stationary. The transformation in 2.13a is a local minimum in distance, however, it 33 2.4. 2-link chains is not the global minimum. A shorter distance transformation can be seen by considering the reverse transformation. Imagine A and C stationary, while B rotates about axis A C in figure 2.13b. This rotation of B follows a circular trajectory defined by the intersection of two unit spheres centered at A and C . The rotation occurs until point B , which is the point where above circle is tangent to a great circle on the unit sphere about A and passing through B. The arc BB is a great circle, because this is a geodesic for point B given A is fixed, which follows from the Euler equations (2.24b, 2.24c) when λBC = 0. The great circle is defined by the plane containing the points A, B, and B . −−→ The angle between the (variable) vector BC of link BC and the tangent the the arc B B is always π/2, so once the corner condition is met, point C on link BC can move in straight line motion from C to C while B moves on the great circle from B to B. That is, the quadrilaterial criterion of section 2.3.1 is met for BB C C. To find point B , let its position be rB” = (xo , y(xo ), z(xo )). The great circle is defined by the plane passing through the points A, B, and B . −−→ −−→ This plane has normal n ≡ AB × AB = (1, 0, 0) × (xo , y(xo ), z(xo )) = (0, −z(xo ), y(xo )). At the point B the normal is orthogonal to the tangent vector of the circle defined by rotation about the AC axis. This tangent vector is ˆt = ∂r/∂s = xs (1, yx , zx ) by the chain rule. At B , ˆt · n = 0, or −z(xo )yx (xo ) + y(xo )zx (xo ) = 0 (2.25) The functions y(x) and z(x) are defined by √the intersection of two unit √ spheres centered at (0, 0, 0) and (1/ 2, 1 + 1/ 2, 0), giving √ 2 √ x y(x) = 1 − 2+ 2 √ z(x) = 1 − x2 − y(x)2 . (2.26) Together, (2.25) and (2.26) give √ √2 − 1 = √2( 2 − 1) √ 2(5 2 − 7) rB” The distance traveled along arc BB is θBB” , where cos θBB” = xo = √ 2 − 1. The distance traveled along arc B B can similarly be shown to be rθB”B’ = 34 2.4. 2-link chains (a) (b) Figure 2.13: Sub-minimal (a) and minimal (b) transformations for the boundary conditions in figure 2.12. The distances for each transformation are approximately 3.007L2 and 2.576L2 respectively, where L is the link length. Transformation (a) proceeds from ABC by first rotating B to B about axis AC, then rotating C about point B . Transformation (b) proceeds from ABC by simultaneously translating C to C while rotating B about A on a great circle to point B . Finally point B rotates from B to B about axis A C . √ sin(π/8) cos−1 (2 2 − 3). Adding the distance CC , the total (minimal) distance is thus D = 2.576. There is, of course, a degenerate solution to the above with z → −z. 2.4.2 Transformations with initial and final states in 3-D We now give a representative example where the initial and final configurations do not lie in the same plane, as shown in figure 2.14. Because AB ⊥ AA and BC ⊥ CC , neither A nor C will rotate about B as part of the transformation. Nor can ABC simultaneously translate directly to A B C , because, for example, quadrilateral AA B B does not satisfy the rule of opposite angles ≥ π/2, so link AB cannot slide (translate) to A B . This leaves 3 options for the initial stages of the transformation: 1.) A translates, B rotates, C remains fixed. B then rotates about C in √ the ˆ ˆ ˆ B = (−i + k)/ 2, CBB plane. The initial direction of motion of B is then v ˆ ˆ A can only move backward to preserve link length (ˆ however then v vA = −k), similar to figure B.1. This rules out case (1). 2.) A remains fixed, B rotates, C remains fixed. B then rotates towards B about axis AC, until it reaches a critical angle where line B B is tangent 35 2.5. Limit of large link number to its circular trajectory (see fig. 2.14a). At this point the quadrilateral B CC B does not have opposite obtuse angles, so a straight line transformation to A B C is not possible. It is possible to transform to a configuration A B C , where C is at position (1, 1, 1) and angle ∠B C C = π/2, so ˆ Then the transformation is completed by a π/2 rotation of C ˆ C = k. that v about B . This transformation is sub-minimal as it has a larger distance. 3.) A remains fixed, B rotates, C translates. In this case, B rotates toward B in the BAB plane, while C translates to C , until the state AB C is reached (see fig. 2.14b). State AB C can be found as follows. Because √ √ −−→ the rotation of B is about the axis (0, −1/ 2, 1/ 2), the √ position AB √ of B after rotation of the (critical) angle θ is (cos θ, sin θ/ 2, sin θ/ 2). −−→ −−−→ This angle is then determined by the condition AB · B B = 0, where −−−→ −−→ −−→ B B = AB − AB . The solution to this condition is simply θ = π/4. The location of C is then determined from the condition that the link length −−−→ −−−→ −−→ −−→ from B to C is one: |B C | = 1, where B C = AB + tCC√ . Solv√ ing this condition for t gives the position of C as ( 3+5 2 , 1, 2(2−5 2) ). At this point the quadrilateral B B C C has opposite obtuse angles, and quadrilateral AB B A has opposite angles = π/2, so it is in a bowtie configuration as in the end point configurations in figure 2.6. Therefore, all points AB C can translate from this intermediate state to their final positions A B C . The √ total distance traveled is θ + |AA | + |CC | + |B B | or D = 2 + π/4 + 5 ≈ 5.022. The reverse of this transformation is also possible, where point B rotates about A in the plane B AB, while C trans−−→ lates along C C. Inspection reveals the distance covered is the same as the forward transformation. 2.5 Limit of large link number From the transformation discussed in section 2.4.1, we saw that if both ∠ABC and ∠A B C were π/2 as in figure 2.15a, then the transformations √ in figures 2.13a and 2.13b became degenerate, having distance D = π/ 2. The transformation is completed by a single rotation about axis 13. We can now examine the effect of increasing the number of links. Let the number of links increase to 4, and let us preserve the symmetry that is present about the horizontal axis in fig 2.15a, so the initial and final states become an octagon (figure 2.15b). In the limit, as N → ∞, the figure becomes a circle. If we separated the links in figure 2.15a by some distance in the y direc36 2.5. Limit of large link number 2.5 2 C’ Z 1.5 1 B’ 0.5 B’’ A’ C 0 B −0.5 1.5 A 1 1.5 1 0.5 0.5 0 Y (a) 0 −0.5 −0.5 X (b) Figure 2.14: (a) Sub-minimal transformation and (b) minimal transformations between ABC and A B C (see text) . tion (perpendicular to axis 13), then the minimal transformation involved the same rotation of 2 about axis 13 up to a critical angle θc , after which all three points 123 can translate in straight lines to 1 2 3 . In the same fashion, the minimal transformation for the octagonal transformation in fig 2.15b involves a rotation of point 3 out of the plane about axis 24 to a critical angle θc at which the point is located at position 3 . Once this critical angle is reached, point 3 translates in a straight line from 3 to 3 . Because points 1 and 5 are stationary to satisfy corner conditions, points 2 and 4 must move in great circles about points 1 and 5. However, points 2 and 4 cannot finish the transformation by moving on great circles. At the configuration 1 2 3 4 5 in figure 2.15b, point 3 has finished the transformation, but points 2 and 4 have not. To satisfy corner conditions at the points 2 and 4 , the great circles must be out of plane as well. At points 2 and 4 , the transformation finishes with rotations about axes 1 3 and 3 5 . The total distance D ≈ 7.93. Of course, the time reverse of this transformation (equivalent to swapping primed and unprimed labels) is also a minimal transformation, as is the transformation obtained by reflection about the z = 0 plane. Now consider increasing the chain to 6 links, so the combination of ri (0) and ri (T ) becomes a dodecagon (12-sided polygon, see figures 2.15c-d). As before, the midpoint vertex (here r4 ) must rotate out of the plane about axis 35 to a critical angle θc before translating in a straight line to r4 . This −→ −−→ −→ −−→ critical angle is where 34 · 4 4 = 54 · 4 4 = 0. The quadrilaterals 22 3 3 and 655 6 are of the type in figure 2.7, so point 3 must rotate about r2 (0) 37 2.5. Limit of large link number 4’’ 0.5 0.4 3’ 5’ 2’ 0.2 3’ 3 0.1 5 Z Z 0.3 2’’ 0 2’ 3’’ 0 1 4 0.6 0.4 0.2 2.5 0.5 1.4 1’ 0 1 1.2 1’ 1 0 1 2 0.6 −0.2 −0.4 −0.5 1 2 0.5 −1 0.2 Y 0 1.5 3 0.8 0.4 −0.6 Y 2 0 X (a) X (b) 4’ 2 3’ 5’ 1.5 2’ 1 6’’ 6’ 4’ 0.5 7’ 1’ 7’ 1 7 0 0 6 2’’ 2 5’’ 2’ 1.5 −0.5 7 3’ Z Y 6’’ 5’ 6’ 2’’ 0.5 4’’ 1 3’’ −1 5’’ 0.5 6 2 1’ 0 −1.5 0.5 1 1.5 (c) 2 X 3 4 2 1 −1.5 2.5 3 2 3 −1 5 4 0 3’’ 1 −0.5 4’’ 3 −2 4 5 3.5 4 Y −2 0 X (d) Figure 2.15: Examples of transformations between initial and final states of opposite convexity, for increasing numbers of links. (a) illustrates the transformation for N = 2 links. (b) N = 4 and initial and final state form an octagon. (c,d) N = 6 and initial and final states form a dodecagon. (c) top view. (d) view in perspective. Rotations are shown as solid color lines (either green or blue). Translations are shown as dashed lines. The grey dashed lines underneath 3 3 in (b) and 4 4 in (d) are shown only to illustrate that those lines are above the plane. 38 2.5. Limit of large link number −→ −−→ to a critical angle where 23 · 3 3 = 0, and likewise for point 5. While point 3 rotates to its critical angle, point 4 translates along line 4 4 . Points r1 (0) and r7 (0) overlap with r1 (T ) and r7 (T ) and so remain fixed to satisfy corner conditions. After point 3 has reached its critical angle, it can translate along 3 3 as point 2 rotates about r1 . However to satisfy corner conditions at point 2 , the rotation cannot remain in the x − y plane. Point r2 is determined as the point where ˆt · nplane = 0, where ˆt is the tangent to the arc 22 defined by rotation about axis 13 , and nplane is the normal to the plane 122 , i.e., r2/1 × r2 /1 . The same process holds for point 6. These critical points and some intermediate states for the transformation are shown in figure 2.15d. The total distance covered by the transformation is D ≈ 16.3. It is sensible to consider the total length of chain as fixed to say L = 1, and to let the link length dsN for the chain of N links be determined by N dsN = L. Because distances scale as ds2N , the N = 2, 4, 6 cases have D2 ≈ 0.555L2 , D4 ≈ 0.496L2 , D6 ≈ 0.445L2 . Note that this distance decreases with increasing number of links: the constraints on the motion of the various beads during the transformation are relaxed as the number of links is increased. We can then imagine resting a piece of string on a table in the shape of a semi-circular arc, and then asking how one can move this string to a facing semicircle of opposite convexity. So long as the string has some non-zero persistence length P , the transformation of minimal distance must involve lifting the string off of the table to change its local convexity. The vertical height the string must be lifted (see fig 2.15d) is of order ∼ sin(π P /L) ∼ P /L, which goes to zero for an infinitely long chain. As the number of links N → ∞, some simplifications emerge. In particular the contribution to the total distance due to rotations becomes negligible, and the translational component dominates. To see this note that the distance due to straight line motion scales as: D(st. line) ∼ ds N L ∼ L2 while the distance traveled during rotations scales as D(rot.) ∼ ds N (θc ds) ∼ L2 /N where we assume the worst case scenario, where an extensive number of links must rotate before translating. Because translation dominates the distance as N → ∞, the distance traveled converges to L times the mean root square 39 2.5. Limit of large link number distance (MRSD), i.e., D∞ → ds N +1 ∑ |ri (T ) − ri (0)| i=1 =L √ 1 ∑ (rBi − rAi )2 N i = L (MRSD) (2.27) The MRSD for the examples in figures 2.15b,d are 0.394 L and 0.400 L respectively, which are both less than the actual distances traveled (in units of L). In the limit N → ∞, where the polygon becomes a circle, the distance converges to D∞ = 4L2 /π 2 ≈ 0.4053L2 . For large N systems then, it is a good first approximation to use MRSD for the distance. The MRSD is always less than the root mean square distance (RMSD), except in special cases when they are equal. To see this, we can apply H¨ older’s inequality N ∑ ( (gk )α (hk )β ≤ N ∑ gk )α ( N ∑ hk k=1 k=1 k=1 )β where gk , hk ≥ 0, α, β ≥ 0, and α + β = 1. With the specific identifications gk = (rBk − rAk )2 ≡ ∆r2k , hk = 1, and α = β = 1/2, we have directly √ √ 1 ∑ 1 ∑ 2 2 ∆rk ≤ ∆rk N N k k √ For example the RMSD for the circle configuration discussed above is 2L/π ≈ 0.4502L, which is greater than the MRSD. The fact that the distance converges for large N to MRSD rather than RMSD suggests that RMSD may not be the best metric for determining similarity between molecular structures, although it is ubiquitously used. This fact warrants future investigation- it has implications in research areas from structural alignment based pharmacophore identification [49, 75, 103] to protein structure and function prediction [6, 47]. MRSD has a simple intuitive physical meaning- the MRSD between two structures gives the average distance each residue in one structure would have to travel on a straight line to get to its counterpart in the other structure (fig 2.16). 40 2.5. Limit of large link number Figure 2.16: The MRSD is the average length of the black line segments between corresponding residues of the initial and final configuration. Image adopted from [91]. Figure 2.17: The MRSD and RMSD between the two curves are close to zero (the curves in this figure are displaced for better viewing but should be imagined to be superimposed). However, because the curve cannot pass through itself, in order to undergo the transformation, one leg must undergo relatively large amplitude motions to travel from one conformation to another. This results in a non-zero distance between the conformations by accurate metrics that can account for non-crossing. Image adopted from [91]. This interpretation of MRSD points to a shortcoming of both MRSD and RMSD, which is the importance of chain non-crossing constraints. Consider the two curves depicted in fig 2.17, which differ by having opposite sense of underpass/overpass. When both curves are aligned by minimizing MRSD or RMSD, the respective values are almost zero. However the physically relevant distance for one conformation to transform to the other is much larger, and must involves one arm of the backbone circumventing the other as it moves between conformations. It was shown in [109] that chains with persistence length characterized by some radius of curvature R have extensive corrections to the MRSDderived minimal distance, which do not vanish as N → ∞, but remain so long as R/L is nonzero. Likewise, chains that cannot cross themselves have non-local EL equations and extensive corrections to the minimal distance. Nevertheless, it is worthwhile to investigate some more complex polymers with MRSD as an approximate distance metric. We pursue this in the next section. 41 2.5. Limit of large link number 2.5.1 MRSD as a metric for protein folding Here we examine the use of MRSD as a metric or order parameter for protein folding. To this end we adopt an unfrustrated Cα model of segment 84 − 140 of src tyrosine-protein kinase (src-SH3), by applying a G¯o-like Hamiltonian [23, 123, 129] to an off–lattice coarse-grained representation of the src-SH3 native structure (PDB: 1FMK). Amino acids are represented as single beads centered at their Cα positions. The G¯o-like energy of a protein configuration α is given by the following Hamiltonian, which we will explain term by term: ∑ ∑ (rα − rN )2 + kθ (θα − θN )2 H(α|N ) = kr bonds + ∑ (n) kφ n=1,3 + N ∑ triples [1 − cos (n × (φα − φN ))] quads [ ( ) ( )12 ] ∑ σij 10 σij 6 + −5 rij rij j≥i+3 NN ∑ ( σij )12 (2.28) . rij j≥i+3 Adjacent beads are strung together into a polymer through harmonic bond interactions that preserve native bond distances between consecutive Cα residues. Here rα and rN represent the distances between two subsequent residues in configurations α and the native state N . As with other parameters in the Hamiltonian, the distances rN are based on the PDB structure and may vary between pairs. The angles θN represent the angles formed by three subsequent Cα residues in the PDB structure, and the angles φN represent the dihedral angles defined by four subsequent residues. The dihedral potential consists of a sum of two terms, one with period 2π and another with 2π/3, which give cis and trans conformations for angles between successive planes of three amino acids, with a global dihedral potential minimum at φN ∈ [−π, π]. The parameters kr , kθ , and kφ , are taken to accurately describe the energetics of the protein backbone: we used the values kr = 50 kcal/mol, (1) (3) kθ = 20 kcal/mol, kφ = 1 kcal/mol and kφ = 0.5 kcal/mol for molecular dynamics (MD) simulations using the AMBER software package [104]. For MD simulations using LAMMPS [107], we had used slightly differ(1) ent values: kr = 80 kcal/mol, kθ = 16 kcal/mol, kφ = 0.8 kcal/mol and (3) kφ = 0.4 kcal/mol. The last line in equation (2.28) deals with non-local interactions, both native and non-native. If two amino acids are separated by 3 more along 42 2.5. Limit of large link number the chain (|i − j| ≥ 3), and have one or more pairs of heavy atoms within a cut-off distance of rc = 4.8 ˚ A in the PDB structure, the amino acids are said to have a native contact. Then the respective coarse-grained Cα residues are given a Lennard-Jones-like 10-12 potential of depth N = −0.6 kcal/mol (−0.8 kcal/mol for LAMMPS simulations) and a position of the potential minimum equal to the distance of the Cα atoms in the PDB structure. That is, σij is taken equal to native distance between Cα residues i and j if i–j have a native contact. If two amino acids are not in contact, their respective Cα residues sterically repel each other ( NN = +0.6 kcal/mol). Thus NN = 0 if i-j is a native residue pair, while N = 0 if i-j is a non-native pair. For non-native residue pairs, σij = 4 Angstroms. In an arbitrary configuration α, two Cα residues i and j are considered to have formed a native contact if they have a distance rij ≤ 1.2σij . The results of MD simulations do not strongly depend on the specific value of this cutoff. The fraction of native contacts present in the particular configuration α is then defined as Q (or Qα ). The MRSD of configuration α is found by aligning this configuration to the native structure, by minimizing MRSD over 3 translational and 3 rotational degrees of freedom. Constant temperature molecular dynamics simulations were run for this system using both AMBER and LAMMPS simulation packages, by other members of the Plotkin research group. The version of LAMMPS that was used for our simulation suffered from a bug, wherein different chiralities of dihedral angles were not energetically distinguished. This bug has been fixed in future versions of LAMMPS. Thus, these results show heuristically one of the arguable shortcomings of the order parameter Q, namely the failure to distinguish between two mirror configurations. The probability for the system to have given values of Q and MRSD within (Q, Q + ∆Q) and (M RSD, M RSD + ∆M RSD) is proportional to the exponential of the free energy F (Q, M RSD). Thus the free energy can be directly obtained by sampling, binning, and taking the logarithm: ( ) p(Q1 , M RSD1 ) F (Q1 , M RSD1 ) − F (Q2 , M RSD2 ) = −kB T log (2.29) p(Q2 , M RSD2 ) with F (1, 0) = EN , the energy of the native structure. Figure 2.18 shows the free energy surfaces obtained using the above recipe, for the AMBER (fig 2.18a) and LAMMPS (fig 2.18b) molecular dynamics routines. The temperature is taken to be the transition or folding temperature TF , where the unfolded and folded free energies are equal. 43 2.6. Conclusions Notice that F (Q) is comparable for both as it should be, as is F (M RSD) as well. However, the free energy surface plotted as a function of both Q and MRSD shows a marked difference. In addition to a native minimum, the LAMMPS routine has an additional minimum at Q ≈ 0.95 and M RSD ≈ 8.4. The conformational states in this bin are closely related, with an average MRSD between them of 1.8˚ A. We can take the most representative state in this bin as that which has a minimum MRSD from all the (∑ ) in the ∑ others min ≈ 1.6˚ A. bin (at Q ≈ .95, M RSD ≈ 8.4): i j=i M RSDij / j=i Inspection reveals that this state is a mirror image of the PDB structure (see fig 2.18b): If we reflect this structure about one plane, and subsequently align this reflected structure to the PDB one, the MRSD is only 1.1˚ A. The discrepancy in free energy surfaces corresponding to the presence of a low energy mirror-image structure arises, because the COMPASS class 2 dihedral potentials in the LAMMPS algorithm did not ascribe a sign to the angle φ, so the full range [−π, π], is projected onto [0, π]. This gives the set of actual dihedral angles {φi + π} the same energy as the set {φi }, so that the dihedral potentials have two minima rather than one, and thus a protein chain of the opposite chirality (a mirror image) is allowed and has the same energy as the PDB structure. We found that the CHARMM and harmonic dihedral styles do not have this problem; however, they have less versatile function forms, so that we favored modifying the COMPASS dihedrals to define φ over its full range. 2.6 Conclusions Analogously to the distance between two points, the distance between two finite length space curves is defined using a variational problem, and may be calculated by minimizing a functional of 2 independent variables s and t, where s is the arc-length along the chain, and t is the ’elapsed time’ during the transformation. We derived the Euler-Lagrange (EL) equation giving the solution to this problem, which is a vector partial differential equation, with extremal solution r∗ (s, t). We also derived sufficient conditions for the extremal solution to be a minimum, through the Jacobi equation. Once the minimal transformation r∗ (s, t) is known, the distance D∗ ≡ D[r∗ ] follows. We provided a general recipe for the solution to the EL equation, using the method of lines. The resulting N + 1 EL equations for the discretized chain are ODEs that can be interpreted geometrically and solved for minimal solutions. Solutions consist generally of rotations and translations pieced 44 2.6. Conclusions 30 30 25 20 MRSD 15 15 10 MRSD 25 20 10 2 4 8 6 12 Q F(MRSD) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 10 0 5 5 8 7 F(Q) 6 5 4 3 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Q 25 (a) 20 MRSD 15 15 10 10 2 3 5 4 7 6 8 Q F(MRSD) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 7 0 5 5 1 MRSD 20 6 F(Q) 5 4 3 2 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Q (b) Figure 2.18: Free energy surfaces for the folding of G¯o-model src-SH3 using two molecular dynamics simulation packages, AMBER (a) and LAMMPS (b). The contour plots give F (Q, M RSD). The projections F (Q) and F (M RSD) are also shown on each side. The COMPASS class 2 dihedral potential in LAMMPS allows for a mirror image of the folded structure (red color structure in inset) that is not immediately evident from the F (Q) or F (M RSD) surfaces. Future implementations of LAMMPS using COMPASS dihedrals for biomolecular simulations have corrected for dihedral angles defined on the interval [−π, π]. 45 2.6. Conclusions together so the direction of velocity of any link end point does not suddenly change (the Weierstrass-Erdmann corner conditions). We explored the minimal transformations for the simplest polymers, consisting of 1 or 2 links, in depth. For transformations between 2 links, convexity becomes an issue (the analog to the direction of the radius of curvature for a continuous string). For example, even if the initial and final states lie in the same plane, if the convexities of these states are of opposite sign the transformation must pass through intermediate states that are out of the plane. Similarly, given a semicircular piece of string lying on a table, to move it to a semicircle of opposite convexity using the minimal amount of motion, the string must be lifted off the table. In the limit of a large number of links, some simplifications emerge. For chains without curvature or non-crossing constraints, the distance converges to L times the mean root square distance (MRSD) of the initial and final conformations. So for example, the distance between two strings of length L forming the top and bottom halves of a circle respectively is 4L2 /π 2 , the distance between horizontal √ and vertical straight lines of length L which 2 touch at one end is L / 2, and the distance to fold a straight line upon itself (to form a hairpin) is L2 /4. The fact that for large N the distance (over L) converges to MRSD rather than RMSD suggests that RMSD may not be the best metric for determining similarity between molecular structures, although it is ubiquitously used. Adopting MRSD may lead to improvements in structural alignment algorithms. The MRSD was investigated as an approximate metric for protein folding. Free energy surfaces for folding were constructed for two simulation packages, AMBER and LAMMPS. It was found that including MRSD as an order parameter uncovered discrepancies between the two molecular dynamics algorithms. Because dihedral angles in LAMMPS (at least in COMPASS class 2 style) are only defined on [0, π], the potential admits a mirror image structure degenerate in energy with the native structure. This is easily remedied and should not be interpreted as a deficiency in the LAMMPS simulation package, so long as one is aware of it. It should be mentioned that the mirror-image structure would also have been seen, had RMSD been used as an additional order parameter. In subsequent chapters, we will focus on applications of the above concepts in protein folding and structural alignment. In particular, in chapter 3, we will apply the principles developed here, in finding minimal folding pathways for protein fragments and briefly touch on the idea of non-crossing constraints. The use-case in structural alignment will be discussed in chap46 2.6. Conclusions ter 4. In chapter 5, we give a systematic treatment of non-crossing and in chapter 6 we investigate whether the distance D can be a predictor of folding kinetics. 47 Chapter 3 Minimal folding pathways for coarse-grained biopolymer fragments In this chapter we apply the concept of generalized distance, introduced before, to find minimal folding pathways for several candidate protein fragments, including the helix, the β-hairpin, and a non-planar structure where chain non-crossing is important. Comparing the distances traveled with root mean-squared distance (RMSD) and mean root-squared distance (MRSD), we show that chain non-crossing can have large effects on the kinetic proximity of apparently similar conformations. Furthermore we see that structures that are aligned to the β-hairpin by minimizing MRSD show globally different orientation than structures aligned by minimizing RMSD. 3.1 Introduction In 1.3, we reviewed two of the most common order parameters used in protein folding. While the utility of simple order parameters is indisputable, it is easy to see that even for simple structures they can lead to inaccurate measurements of native proximity. For example, a β-hairpin that is only slightly expanded beyond the range of its hydrogen bonds is essentially committed to fold, but would have a Q value near zero. See figure 1.4. Comparing two conformations of a piece of polymer chain that crosses either over itself or under itself would give an RMSD that could be quite small. The amount of motion the polymer would have to undergo to transform from one conformation to the other, however, respecting the non-crossing constraint, would have to be comparably large. See figure 2.17. Here we propose D as an order parameter to capture the complexities of biomolecular folding. This distance depends only on the geometry of the initial and final configurations. 48 3.2. Methods The minimal distance transformation between an initial polymer conformation A and the folded or native conformation N can be thought of as an optimal folding pathway that is the most direct route from A to N. Of course, the actual trajectory is a stochastic one. It is interesting to ask whether the typical or average dynamical trajectory resembles the minimal one after suitable averaging, but we do not answer this question here. Interaction energies in the system will certainly modify the weights of reactive trajectories, making some trajectories preferred over others. On the other hand, much of the folding mechanism is thought to be insensitive to specific sequence details [86], and depends more on the geometry of the native structure and its resultant topology of interactions [5]. A direct application of minimal folding path to a full protein is an important future goal. We will address approximations to this problem in chapter 5. In this chapter, we take a more bottom-up, modular approach, and apply the minimal distance transformation to various representative protein fragments and construct exact solutions. In particular, we investigate the minimal folding pathways for a β-sheet, an α-helix, and an overpassunderpass problem, where chain non-crossing is important. 3.2 Methods We refer to the transformation between structures A and N that minimizes the distance functional in Eq. 2.5 as the minimal transformation or optimal folding pathway. Solving the equations of motion for the discretized version gives solutions for straight line motions of the beads, preceded or followed by local intensive rotations as we saw earlier. 3.2.1 Representative protein fragments As an example protein domain to which we apply our methods, we choose residues 99–153 in regulatory chain B of Aspartate Carbamoyltransferase [48] (PDB code 1AT1, see Fig. 3.1). From this domain, we select three fragments for investigation, as representatives of some commonly found secondary and tertiary structures: • The β-hairpin containing β-strands 2 and 3, residues 126–137. • The C-terminal α-helix, residues 147–151. • The β-strand 1-turn-strand 2 tertiary motif, residues 101–130. 49 3.2. Methods Figure 3.1: Residues 99–153 in regulatory chain B of Aspartate Carbamoyltransferase [48] (PDB code 1AT1) are chosen for analysis. From this domain we select three fragments for investigation. Two are outlined in dashed boxes: β-hairpin residues 126–137, and α-helix residues 147–151. The strand 1-turn-strand-2 tertiary motif, residues 101–130, is also used investigate the importance of non-crossing. We investigate an overpass/underpass problem for a simplified version of segment 3 for which chain non-crossing is important. The polymer fragments are coarse-grained by taking the Cα atom to represent each residue. The Cα Cα distances in our fragments are sharply peaked: |ri+1/i | = (3.81 ± 0.04)˚ A. We do not change the numbers present in the PDB structure: they are held fixed during the transformation. We investigate the minimal distance transformations between extended states of polymer and the above secondary structures. Extended states are constructed as follows. For the β-hairpin, we rotate the chain about the positions of Cα (132) and Cα (133) so that the initial state is an extended linear strand (Fig. 3.2 b). For the α-helix, we take the simplified case of a straight line for the initial condition. For the over/under problem we imagine a scenario where the β-sheet in Fig. 3.3a is unformed, and the polymer chain involved in the turn has crossed under rather than over β-strand 2. The two configurations have the opposite sense, in that the chain must cross over itself (or go over the top or the bottom of the structure) to form the correct tertiary structure (Fig. 3.3b). Alternatively, β-strands 2 and 3 in Fig. 3.1 may cross over β-strand 1 to solve the underpass-overpass problem, but this would involve larger-scale motion, that is, a larger distance traveled. 50 3.2. Methods Figure 3.2: (a) β-hairpin fragment, with all-atom and coarse-grained Cα representations superposed. (b) The extended initial state. Figure 3.3: a) Residues 101–130 of Aspartate Carbamoyltransferase can be taken as an example of an overpass/underpass problem where chain non-crossing is important. (b) Conformation of the segment in panel a with the β-sheet unformed. Both initial and final structures (with opposite over/under sense) are superposed in this stereo view. (c) A simplified model to capture the essence of the underpass-overpass problem. Both initial and final states are shown as viewed from above. Residues 1 − 8 must transform to residues 1 − 8 , but cannot pass through the obstacle marked with a circled X, representing a long piece of polymer normal to the plane of the figure. 51 3.2. Methods Figure 3.4: Illustration of the general recipe for obtaining minimal pathways A stereo view of initial and final states for such a scenario is shown in Fig. 3.3 b. We ask: What is the minimal distance pathway for conversion between these two structures? To make the problem more amenable to analysis, we simplify the structures in the spirit of lattice models, as shown in Fig. 3.3 c. The initial and final conditions are regular and symmetric, but intermediate configurations can be anywhere so long as they are consistent with the constraints of constant link length and non-crossing (i.e., they can be off-lattice). 3.2.2 Construction of minimal pathways Minimal folding trajectories are constructed by the recipe described in chapter 2 (Fig. 3.4). The basic recipe is as follows. First we take the coordinate of one Cα residue, say r(Cαi ) in the unfolded conformation, then we imagine rotating r(Cαi ) about r(Cα(i−1) ). The protein backbone is treated approximately as a freely jointed chain to carry out this procedure. All possible rotations of Cαi about Cα(i−1) form a sphere of radius |r(Cαi ) − r(Cα(i−1) )|. A cone is drawn from the final position of Cαi , i.e., rF OLDED (Cαi ) in the folded structure, to be tangent to this sphere. In general, one particular direction will have the minimal amount of rotation before proceeding in a straight line to rF OLDED (Cαi ). The arc of the great circle along this direction is then chosen as part of the minimal trajectory for residue i. 52 3.3. Results 3.2.3 RMSD and MRSD To review, in the limit of long polymer chains and in the absence of noncrossing, the distance accumulated by rotation of each link before translating gives a negligible contribution to the total distance, and the total distance traveled converges to the chain length L times the mean root-square distance (MRSD), i.e., for two structures A and B, lim D = L × N →∞ 1 ∑√ (rBi − rAi )2 = L × (M RSD). N (3.1) i As we saw the MRSD is always less than the RMSD often used for structural comparison. Which of these quantities provides more accuracy for structural alignment is still an open question, although the MRSD may be less sensitive to large fluctuations of a subset of points. To investigate the sensitivity of MRSD versus RMSD to perturbations in residue’s position, note that the change in RMSD with respect to moving one residue an amount δrAi is δ(RM SD) 1 |rAi − rBi | , ≈ δrAi N RM SD while the change in MRSD with respect to moving one residue an amount δrAi is 1 δ(M RSD) ≈ . δrAi N So if residue i has a structural discrepancy larger than the average as measured by RMSD, changes in RMSD with respect to this residue’s position will be larger than those for MRSD. Unfolded conformations were aligned to folded structures by minimizing MRSD and RMSD, and minimal transformations constructed for these conformation pairs. For the β-hairpin, the conformation pairs were observed to be globally different depending on whether the alignment cost function was MRSD or RMSD. 3.3 3.3.1 Results β-hairpin We coarse-grain the fragment containing residues 126–137 by considering only the Cα atoms (see Fig. 3.2 a). We consider folding to this structure from an extended state. The extended state is obtained by two rotations about residues 132 and 133, which extend the hairpin out to a quasilinear 53 3.3. Results strand (the extended state in Fig. 3.2 b). This initial extended state is aligned to the final structure in four different ways: 1) One strand of the hairpin is directly aligned to the corresponding residues of the extended state (Fig. 3.5, a and b), 2) The center links of the hairpin and extended state are directly aligned to each other (Fig. 3.5 c), 3) The initial position/orientation of the extended state is found by minimizing the MRSD between the two coarse-grained Cα structures (hairpin and extended state) in Fig. 3.2, a and b (Fig. 3.5 d, blue extended strand), and 4) The initial position/orientation of the extended state is found by minimizing the RMSD between the two coarse-grained Cα structures (hairpin and extended state) in Fig. 3.2, a and b (Fig. 3.5 d, teal extended strand). From these initial states, we have found minimal folding trajectories consisting of rotations and subsequent translations of the residues (or vice versa) as described in section 3.2. To gain intuition for the transformations from the MRSD and RMSD aligned structures, we also considered minimal transformations from an idealized straight-line structure to an idealized β-hairpin, whose initial and final states are shown in Fig. 3.5 e. The distances for all the β-hairpin transformations, along with numbers for the RMSD and MRSD for the same transformations, are given in Table 3.1. The resulting transformations for the above boundary conditions are shown in Fig. 3.5, a–c, and f–i. As described in section 3.2, the minimal folding pathways proceed by forming kinks or solitonic-like waves that propagate along the backbone. The soliton-like object consists of a rotation of a bead until the link containing that bead reaches a critical angle. The bead subsequently translates until it reaches its final position. For the idealized straight-line to β-hairpin transformation, the MRSD and RMSD aligned structures are globally different (Fig. 3.5 e). The MRSD between the two aligned straight-line structures is 15.39 ˚ A, larger than the MRSD of either structure to the folded hairpin state (Table 3.1). The transformation from the RMSD-aligned line involves predominantly straight-line motion from the line to the hairpin (Fig. 3.5 f). Only ∼0.1% of the distance corresponds to rotational motion. The transformation from the MRSDaligned line involves both rotations and translations, as shown in Fig. 3.5 g. This gives the MRSD-aligned pair a distance only marginally smaller (0.4%) than the RMSD-aligned pair (Table 3.1), even though the transformations have different initial states and very different character. For the real β-hairpin and extended state, the transformations are remi54 3.3. Results Figure 3.5: Minimal transformations to the β-hairpin. Distances are listed in Table 3.1. (a) Folding pathway in which one strand of the hairpin can be thought of as peeling away by rotations of the links to various critical angles, which are then followed by subsequent translations into their final positions. (b) A minimal pathway that can be thought of as involving kink propagation or peeling away from the extended strand, followed by translation of the links into their final positions in the β-hairpin. (c) A zippering mechanism, in which we have aligned the middle link of the hairpin and sought the minimal distance transformation. The distance here is somewhat larger than the distance for the transformations in panels a and b. (d) The extended strand is aligned to the β-hairpin by minimizing RMSD (blue), or minimizing MRSD (teal). (e) Idealized version of the extended strand and β-hairpin. The extended strand is again aligned to the β-hairpin by minimizing RMSD (blue), or minimizing MRSD (teal). (f) Transformation for the idealized β-hairpin, for RMSD-aligned structures. Initial state is blue, final state is red, and intermediate state is in green. (g) Transformation for the idealized β-hairpin, for MRSD-aligned structures. (h) RMSD-aligned transformation between the extended strand (blue) and β-hairpin (red). An intermediate state is shown in green. (i) MRSD-aligned transformation between the extended strand (blue) and β-hairpin (red). An intermediate state is shown 55 in green. 3.3. Results Table 3.1: Values of the distance for various protein backbone fragments, as compared to other metrics Backbone conformation Figure D/(N )∗ RMSD MRSD β-Hairpin (half-aligned) 3.5 a 10.372 15.538 9.926 β-Hairpin (half-aligned) 3.5 b 10.372 15.538 9.926 β-Hairpin (zipper) 3.5 c 12.787 13.560 11.317 β-Hairpin (RMSD-aligned) 3.55 h 9.749 10.501 9.730 3.5 i 10.277 12.681 9.412 β-Hairpin (MRSD-aligned) Ideal β-hairpin (RMSD-aligned) 3.5 f† 12.25 13.24 12.24 Ideal β-hairpin (MRSD-aligned) 3.5 g† 12.18 16.31 11.27 α-Helix (MRSD aligned) 3.6 b 3.595 3.954 3.577 α-Helix (1-link aligned) 3.6 c 4.675 5.805 4.233 Over/under (non-crossing) 3.7† 13.991 6.173 5.239 ∗ Distance D is divided by N times the link length , so that all quantities in the table have units of ˚ A. † D is put in the same units as the above transformations, i.e., we take = 3.81 ˚ A for the link length. niscent of the ideal case. The MRSD and RMSD aligned structures are globally different, as shown in Fig. 3.5 d. The MRSD between the two aligned extended structures is 9.83 , which is again larger than the MRSD of either structure to the folded hairpin state (Table 3.1). The MRSD-aligned pair has a distance 17% different than the ideal case and the RMSD-aligned pair has a distance 23% different than the ideal case. Fig. 3.5, h and i, depict the transformations for RMSD- and MRSD-aligned pairs, respectively. For the real β-hairpin, the RMSD-aligned extended state has a smaller distance than the MRSD-aligned extended state by ∼5%, i.e., the scenario present in the idealized case is reversed, somewhat surprisingly. This indicates that the aligned structures obtained by minimizing the actual distance need not resemble those structures obtained by either the RMSD or MRSD alignments. An alignment algorithm for general structures using distance D as a cost function is a nontrivial problem that we reserve for future work. However, we will discuss a simple case in chapter 4. We note that the above transformations will not all have the same energy gain as they fold. The transformations in Fig. 3.5a and c, are similar in the main to the energetically driven zippering and assembly mechanisms of conformational search proposed by Ozkan et al. [100]. A folding pathway similar to the transformation in Fig. 3.5 b would not have concurrent energy gain and so would be less likely thermodynamically. To implement the 56 3.3. Results Figure 3.6: (a) Single α-helix of five residues 147–151 taken from PDB 1AT1. (b) Minimal pathway to fold the α-helix (red), from a straight line initial state which has been aligned by minimizing MRSD (shown in blue, see text for description). A conformation partway though the transition is shown in green. (c) Minimal pathway to fold the helix from a straight-line initial conformation with its second link directly aligned to the second link of the helix. Distances for both transformations are given in Table 3.1. We emphasize that this is a hypothetical, idealized transformation that is not realizable for the physical chain. transformation shown in Fig. 3.5 c, the construction described in section 3.2 above and shown in the figure is only approximately correct, to 1%. To find an exact minimal solution involves generalizing the methodology to allow for concurrent rotations of two links about a central axis, as described in more detail in Sections 2.3 and 2.4. 3.3.2 α-helix We coarse-grain the helical fragment containing residues 147–151 by considering only the Cα atoms (see Fig. 3.6 a). We consider folding to this structure from an extended state. The extended state is taken for simplicity to be a straight line. Of course more realistic extended conformations could be taken, but would give minor quantitative corrections to the numbers we obtain. We consider two different initial conditions for the straight line, one where link 2 is exactly aligned with link 2 of the α-helix (Fig. 3.6 c), and one where the straight line is aligned to the helix by minimizing the MRSD. This initial condition is such 57 3.3. Results that the straight line threads the helix (Fig. 3.6 b). The aligned unfolded structure obtained by minimizing RMSD is similar in this case: the MRSD between the two aligned structures is only 1.53 ˚ A. From these initial states, we found minimal folding trajectories consisting of rotations and subsequent translations from the straight-line conformation to the helix. Fig. 3.6 b shows a minimal folding pathway to the α-helix. An intermediate conformation (partway through the transition) is shown in green. The distance traveled after minimizing MRSD is indeed less than the distance after alignment of one link. For both of these transformations, the distances traveled per residue are less than the corresponding distance per residue for the β-hairpin transformations. 3.3.3 Crossover structure The fact that the polymer chain cannot cross itself is represented by inequality constraints in the equations of motion. We introduce the methods for solution of variational problems with inequality constraints in Appendix E. The upshot is that the minimal distance problem is a free problem until a residue on the chain touches the obstacle. At that point the residue is constrained to be on the surface of the obstacle and the trajectory is defined accordingly. Eventually the particle or residue leaves the surface, and the problem becomes a free problem once again, as the particle moves to its final position. The transformation is then piecewise, consisting of three pieces, and at the interface between the pieces, the corner conditions (Eq. 6) must hold. The initial and final conditions of an idealized non-crossing chain are shown in Fig. 3.3 c. In our problem of chain non-crossing, the obstacle is an effectively infinite line, normal to the plane of Fig. 3.3 c (marked by a circled X), so residues only need to touch that point before proceeding to their final position. In this treatment residues are treated asymmetrically, in that one part of the chain has steric hindrance along bonds, while another only has steric hindrance for the masses or beads at the termini of bonds. This approximation is assumed to simplify the transition, and because the resulting distance only differs by a small finite size-effect from the distance obtained by employing links for all parts of the chain. We found a solution that fully satisfies the Euler-Lagrange (EL) equations Eqs. 5a–5c, and corner conditions satisfy Eq. 6. According to the analysis in Appendix A, this class of solutions is at least a local minimum. It involves the propagation of a kink starting at the end of the chain, in 58 3.3. Results which the chain proceeds snakelike over the obstacle and then back down to its final position, and so is intuitively reasonable. The distance is given in Table 3.1, along with the RMSD and MRSD. In cases where non-crossing is important, the distance D will be significantly greater than either RMSD or MRSD. The transformation starts by a rotation of link 7–8 about the point 7, until a critical angle π/2 is reached. Residue 8 subsequently translates to the crossover point O. Immediately as it starts translating, link 6–7 rotates about point 6 (Fig. 3.7 a) and residue 7 rotates to its critical angle of π/2. The process repeats until link 5–6 rotates to an angle of π/6, at which point residue 8 touches the obstacle (Fig. 3.7 b). At this point, residue 8, which is touching a nondifferentiable (nonsmooth) surface, may violate corner conditions for the reasons discussed in Appendix E. Residue 8 moves horizontally to the left while residue 7 moves vertically, so the end points of the link slide in orthogonal directions (Fig. 3.7 c). After this part of the transformation is complete, the chain is in the configuration shown in Fig. 3.7 d. At this point, link 4–5 begins to rotate, and this sets up a cascade of motions throughout the chain. Residue 8 slides vertically downward, residue 7 slides horizontally to the left, and residue 6 slides vertically upward (Fig. 3.7 e). Note that residue 8 appears to violate corner conditions in the opposite sense of residue 7. These violations are again due to the influence of the crossover constraint. When link 4–5 has rotated to π/6, link 6–7 is horizontal and link 7–8 is vertical (Fig. 3.7 f). As 4-5 continues to rotate, residues 7 and 8 proceed vertically downward in Fig. 3.7 g, while residue 6 moves left horizontally, until the conformation in Fig. 3.7 h is reached when link 4–5 has finished its rotation to π/2. At this point link 3–4 begins to rotate about position 3, moving residue 4 to the non-crossing position O, while the rest of the chain shifts downward vertically in the Fig. 3.7, i and j. Finally residue 3 rotates about position 2 while residue 4 translates in a straight line to its final position, and all other residues translate downwards (Fig. 3.7, k and l). This completes the transformation. Note again that the distance in Table 3.1 is much larger than either the RMSD or MRSD. A second transformation is obtained by time-reversing the above solution, and swapping the right and left branches of the structure that serve as initial and final conditions. 59 3.3. Results Figure 3.7: Various steps in a minimal pathway obeying non-crossing. Two conformations are drawn for each step. By convention, we number residues in the conformation that is leading in the transformaption. (See text for a description of the transformation. 60 3.4. Discussion and conclusion 3.4 Discussion and conclusion In this chapter, we have applied the general theory of distance between one-dimensional objects to find the minimal folding pathways for protein fragments. We consider this to be a first step in building up ever-larger fragments to eventually look at the distance as an order parameter for the folding of an entire biomolecule. We investigated the minimal folding pathway for a helix, a β-hairpin, and a structure involving a crossover where the integrity of the chain is essential in determining the minimal transformation. The non-crossing problem has the largest distance per residue of all conformations considered. Not surprisingly, the α-helix has the shortest. It is an interesting question to address the consequence of the distance from an unfolded structure to a folded structure on its folding rate. We will address this question in chapter 6. We have made several approximations in our model. In our analysis of minimal distance trajectories, we have not accounted for the steric excluded volume of the side chain and backbone degrees of freedom that have been coarse-grained out. It is possible to account for this in principle by applying the methods described in Appendix E. We take the trajectories derived here as a first approximation to the more fully constrained problem. Another modification that must be considered is the range of allowed angles between consecutive triples of Cα residues. While sharp kinks in our transformations were the exception rather than the rule, we have assumed in our analysis that the full range of angles is allowable. The coarse-graining procedure does give greater flexibility for the resulting chain because there are six backbone bonds per Cα triple; however, a more thorough analysis would take into account a restricted range of allowable angles. The construction of an efficient alignment algorithm based on the distance D as a cost function is a goal , and could have important future implications for structure prediction and biomolecular folding dynamics. We explore this question is some detail in chapter 4. For our purposes here we chose the approximate metrics MRSD and RMSD. For the β-hairpin, the best-aligned MRSD structure was globally different than the best-aligned RMSD structure. The distance from a straight line to an idealized β-hairpin structure was slightly less when the structures were aligned by minimizing MRSD than for RMSD. However, the situation was reversed for the real β-hairpin structure, with the RMSD-aligned structures having a smaller distance by ∼5%. We will visit this problem again in chapter 4. The non-crossing transformation raises interesting questions about the 61 3.4. Discussion and conclusion validity of structural comparison metrics when polymer non-crossing is important. The RMSD and MRSD were both quite small for the conformations we considered, comparable to the α-helix distances. However, the actual distance for a physically realizable transformation was large—larger than the distances in β-sheet transformations. The solution we found for the case of non-crossing was extremal and minimal, at least locally. However, there is no guarantee that this is the globally minimal transformation—some preliminary results for small numbers of links indicate there can be shorter pathways in some instances. However, the difference in distances between ground-state and excited-state transformations involves rotations of links and so is nonextensive: in the limit of large numbers of links, the discrepancies go to zero (see 2.5). Noncrossing constraints introduce a mechanistic aspect to the folding process. A folding mechanism consists of a specific sequence of events, or pathway. In the context of our problem the chain had to cross over the obstacle before translating to its final position. In practice the chain can go up and over the top or bottom of the obstacle, or cross over it in different places with varying likelihood, so strictly speaking there are many pathways and we have just investigated the minimal distance pathway here. Nevertheless, such constraints can further restrict the entropic bottleneck [137] governing folding rates. The physics of non-crossing is certainly important for knotted proteins, and the generalized distance may be useful as an order parameter for these proteins, whereas other structural comparison parameters would be flawed. The non-crossing constraints in a knotted protein slow its kinetics [38, 127], and lead to different molecular evolutionary pressures for fast and reliable folding [83, 125, 131]. For a simple stochastic process such as the one-dimensional diffusion of a point particle on a flat potential between two absorbing barriers, the splitting or commitment probability pF = D/DT OT , where DT OT is the total distance between the two barriers, giving a correlation DpF = 1. The presence of such a correlation between distance and commitment probability for simple examples provides encouragement to investigate whether or not one would find a significant correlation for the more complex problem of protein folding, in particular when the presence of non-crossing constraints for configurational diffusion has been accounted for. In the above discussion, pF has tacitly been written in terms of D rather than the reverse. This underscores the conceptual importance of geometric order parameters in understanding the progress of a reaction. In protein folding, an emergent simplicity has been the result that native 62 3.4. Discussion and conclusion topology determines the major features of the free energy landscape for a protein, and consequently a protein’s folding rate and mechanism [5]. The distance D between disordered or partly disordered protein structures and the native structure may capture the evolution of topology during the folding process more accurately than many other order parameters proposed to characterize the folding kinetics and mechanisms of proteins: a full systematic comparison remains a problem for future research. Aspects of this problem are discussed in chapters 5 and 6. Useful order parameters have simple geometric interpretations. Here we have shown that in principle one can compute the distance that would have to be traveled to connect two arbitrary biopolymer structures, a simple geometric quantity that can include non-crossing constraints, as well as properties such as restricted allowable angles or chain stiffness. The problem of finding a minimal distance pathway for a biomolecule is now an algorithmic problem rather than a conceptual one. In the long run, it is feasible that the analysis of other reactions involving large numbers of degrees of freedom might benefit from order parameters similar to the one we studied here, which are capable of accounting for the structural complexities inherent in large molecules. 63 Chapter 4 Structural alignment using the generalized Euclidean distance between conformations In the previous chapter we saw that aligning the folded and unfolded conformations using different cost functions (RMSD and MRSD) resulted in different total distance D undertaken in the transformation. In this chapter, we align structures using D itself as a cost function, to obtain globally minimum transformations. The unfolded structures that we consider are idealized straight-line segments with varying number of links, which are then aligned to idealized beta hairpins using D as a cost function. The alignment and resulting distance D are compared with the alignments and distances of RMSD and MRSD. More realistic extended structures that are consistent with the physicochemistry of peptide bonds could be taken. However, important lessons are learned from the idealized cases which are generally easier to interpret (c.f. Figure 3.5). This is a first step toward aligning more complex structures using D as a cost function. We will also see that there exist approximations involving decimating the backbone chain, which capture much of the properties of a true D alignment. Applying these approximate metrics to align structures such as a full protein is a topic for future research. It should be noted that our motivation for structural alignment is to find the alignment that results in minimal D. Generally speaking however, structural alignment is used to establish homology between two or more polymers based on their shape. Therefore, polymers with much higher degree of similarity than those structural pairs considered here are usually aligned. The metrics used here to align unfolded structures to the corresponding folded one can in principle be used to align homologous structures as well. We reserve this, however, as a topic for future work. 64 4.1. Introduction 4.1 Introduction In principle, minimal pathways can be computed for any initial and final configurations, just as RMSD can be computed between any two configurations. However, it is of special significance to anneal the configurations allowing translations and rotations, until the minimal distance transformation is achieved (i.e. the minimum of minimal distance transformations). This is analogous to the usual procedure of using RMSD or MRSD as a cost function between two structures and minimizing with respect to translations and rotations. While the minimization procedure is particularly straightforward for RMSD and involves the inversion of a matrix, the minimization using the distance D as a cost function involves a simplex or conjugate gradient minimization and so is more computationally intensive. In short the boundary conformations are allowed to translate and rotate in 3D space. Their position and orientation is modified to produce a pathway with minimal length, as compared to all other minimal pathways that can be obtained by positioning and orienting the same two structures in 3D space. 4.2 Method and results For the purpose of generating accurate initial guesses for the minimal distance aligned structure, we introduce the following hierarchy: D0 = N × M RSD (4.1a) D1 = N −1 ∑ D ( (A) (B) i , i ) i=1 (4.1b) int((N −1)/2) D2 = ∑ D ({ (A) i } { , (B) i }) (end + D1 link) i=1 (4.1c) .. . DN = D . (4.1d) 65 4.2. Method and results In this hierarchy, the Dα have the following interpretation: D0 is the cumulative distance between the sets of points comprising the residue locations of conformations A and B, D1 is the cumulative distance between the sets of single links, i , comprising configurations A and B, D2 is the cumulative distance between the sets of double links, { i , i+1 }, comprising configurations A and B plus any single-link remainder if one exists, and so on. That is, at level α the polymer chain is divided up into sub-segments each of link-length α, plus one segment constituting the remainder. When α = N , the chain as a whole is considered, which is the true distance D. This procedure is also illustrated schematically adjacent to each equation above. We observed that D1 was a good approximation to the total D between two chains, was much easier in practice to calculate, and could be automated in a robust way, in the sense that human intervention and tuning was not necessary. For these reasons we used it to generate initial guesses for minimal distance aligned structures. After the initial alignment using D1 the chains were further aligned using the full distance D. At this stage the general form of the transformation is established and the computation can be automated. We used a Nelder-Mead simplex method in our algorithm to find the minimal distance alignment. Figure 4.1 shows the aligned structures using RMSD, MRSD, D1 , and D, for increasing numbers of links. Several points can be observed. For the smallest number of links (3), MRSD, D1 , and D all give the same alignment (fig 4.1a). For 5 or more links, the MRSD-aligned structure breaks symmetry by choosing particular diagonal direction, while D1 and D retain this symmetry but begin to differ (fig 4.1b). The deviation from MRSD and D is a finite-size effect [109], so we know that the two alignments must eventually converge as N is increased. At 9 links (fig 4.1d), the D1 -alignment breaks symmetry in the same fashion as MRSD, yet the D-alignment remains similar to RMSD. By 11 links (fig 4.1e), the D-aligned structure has broken symmetry as well, however with a smaller angle to the horizontal than either MRSD or D1 . As N is increased, D1 and MRSD aligned structures quickly converge, while the angle with respect to the horizontal of the D-aligned structure continues to lag behind that of either MRSD and D1 structures, converging slowly as N continues to increase (figures 4.1f-j). The RMSD-aligned structure remains horizontal throughout. Average lengths of β-hairpins in databases constructed from the PDB are about 17 residues [27], most consistent with fig 4.1h. From this figure we see that hairpins of this length have a globally different structural alignment with extended structures depending on whether D or RMSD is used. 66 4.2. Method and results (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) Figure 4.1: Alignments with different cost functions. The Hairpin is shown in red. D alignment in green, D1 in blue, MRSD in yellow, and RMSD in cyan 67 4.3. Conclusion and discussion N 4 6 8 10 12 14 16 18 20 22 Alignment D D1 0.785 0.785 1.391 1.415 1.974 1.983 2.559 2.574 3.127 3.158 3.674 3.705 4.207 4.235 4.732 4.769 5.252 5.294 5.767 5.802 cost function MRSD RMSD 0.785 0.822 1.473 1.419 2.085 2.014 2.654 2.615 3.197 3.216 3.726 3.817 4.247 4.418 4.762 5.019 5.272 5.620 5.783 6.221 Table 4.1: D/N (in units of link length squared) between the aligned structures in figure 4.1. Each of the 4 columns represents the structural pairs for the cost function labeled. For example, column 3 gives D/N for structural pairs in figure 4.1 aligned using MRSD. Table 4.1 and figure 4.2 summarize the results for the minimal distance transformations from the aligned structures. Table 4.1 gives the numerical value of the distance D for each aligned structure, aligned using the various cost functions listed: D, D1 , MRSD, and RMSD. Note that the distance D is always minimized for the distance-aligned structure, and tends to increase as one considers the D1 , MRSD and then RMSD-aligned structures for a given number of links. For comparison, in table 4.2 the corresponding values of MRSD are given for the aligned structures using each cost function. Note in each table that as N gets large, D tends to converge to MRSD. The distance traveled per residue, in units of link length is D/N b. Dividing this measure by the chain length (N −1)b gives a scale-invariant measure ˜ = D/(N (N − 1)b2 ). This quantity is plotted in figure 4.2. of the distance: D We can see from the plot that the D1 -aligned structure generally gives a good approximation to the true D-aligned structure. Moreover, MRSD, D1 and D all converge to the same while RMSD converges to a dissimilar value. 4.3 Conclusion and discussion In this chapter we used the generalized distance D and various approximations of it as cost functions to align unfolded idealized strands of various 68 4.3. Conclusion and discussion Figure 4.2: Scale invariant distance resulting from different alignments with different cost functions N 4 6 8 10 12 14 16 18 20 22 Alignment D D1 0.707 0.707 1.375 1.393 1.961 1.960 2.547 2.545 3.062 3.108 3.575 3.675 4.081 4.004 4.585 4.506 5.088 5.008 5.591 5.511 cost function MRSD RMSD 0.707 0.809 1.337 1.412 1.899 2.008 2.436 2.610 2.959 3.211 3.475 3.813 3.987 4.414 4.495 5.015 5.002 5.616 5.508 6.218 Table 4.2: MRSD (in units of link length) between the aligned structures in figure 4.1 using the four cost functions we considered. For example, column 1 gives MRSD for structural pairs in figure 4.1 aligned using the distance D. 69 4.3. Conclusion and discussion sizes to their corresponding idealized β-hairpin structures. The distance D for the minimal transformation between aligned structural pairs was compared for various alignment cost functions: RMSD, MRSD, D1 , and D itself. D1 is the distance between conformational pairs if the chain were decimated to single links and distance of all single-link transformations was summed. We found that D1 -aligned structures generally gave a distance that was close to the true D-aligned structure, and in this sense was a good approximation. However the aligned structures were noticeably different depending on the cost function, for the finite values of N (number of residues) that we studied. Our largest value of N was 22, while the average length of βhairpins is about 17 residues. For these average hairpin lengths, the minimal D aligned structure is globally different from the RMSD structure. Whether this discrepancy is generally true for larger structures or whole proteins remains to be determined, but we feel it is likely. It is not yet clear at this point whether alignment using distance will yield more accurate predictions for such problems as protein structure prediction or ab-initio drug design. What is clear is that the best-aligned structures using a reasonable alignment metric such as the true distance give very different results than RMSD, even for relatively simple structures such as the beta-hairpin. 70 Chapter 5 Polymer uncrossing and knotting in protein folding, and their role in minimal folding pathways In this chapter we give a systematic treatment of non-crossing constraints in protein untangling. First we develop an approximate but easily automatable algorithm for minimal folding pathways of a polymer without considering non-crossing constraints. Then we will study perturbations in the pathway that occur due to presence of non-crossing constraints. Finally we apply the formalism to a number of proteins including knotted proteins. We will see how non-crossing distance can differentiate classes of proteins and how topological constraints, manifested by untangling operations in our formalism, induce folding pathways. We will also study how persistence of different untangling moves varies across protein classes. The formalism outlined in this chapter is the next logical and systematic improvement to what has been done in chapter 3. 5.1 Introduction A transformation connecting unfolded states with the native folded state can be considered as a reaction coordinate. A transformation can also be used as a starting point for refinement, by examining commitment probability or other reaction coordinate formalism. Several methods have been developed to find transformations between protein conformational pairs without specific reference to a molecular mechanical force field. These include coarse-grained elastic network models [66, 67], coarse-grained plastic network models [84], iterative clusternormal mode analysis [120], restrained interpolation (the Morph server) [72], the FRODA method [135], and geometrical targeting (the geometrical path71 5.1. Introduction ways (GP) server) [36]. The GP method finds trajectories between conformation pairs by gradually decreasing the RMSD between the conformations, while preserving structural contraints within the protein. Dead-ends can be encountered. In this event, two recovery methods may be attempted, a random perturbation technique, and backtracking by temporarily increasing RMSD before attempting the transformation again. In this chapter we consider transformations between polymer conformation pairs that would not be viable by a conjugate-gradient type or direct minimization approach, in that dead-ends would inevitably be encountered. We focus specifically on how one might find geometrically optimal transformations that account for polymer non-crossing constraints, and would apply to knotted proteins for example. By a geometrically optimal transformation, we mean a transformation in which every monomer in a polymer would travel the least distance in 3dimensional space in moving from conformation A to conformation B. This is a variational problem, and the equations of motion, along with the minimal transformation and the Euclidean distance covered, have been worked out in previous chapters. Although minimal transformations have been found for the backbones of secondary structures, and the non-crossing problem has been treated in chapter 3, minimal transformations between unfolded and folded states for full protein chain lengths have not been treated before. We focus on this problem in this chapter. The minimal transformation inevitably involves curvilinear motion if bond, angle, or stereochemical constraints are involved. If such constraints are neglected, the minimal distance corresponding to the minimal transformation converges to the mean of the root squared distance (MRSD), or the mean of the straightline distances between pairs of atoms or monomers. This is not the RMSD. For any typical pair of conformations, the MRSD is always less than the RMSD, which can be proved by applying H¨older’s inequality [89]. The RMSD can be thought of as a least squares fit between two structures. Alternatively, it may also be thought of as the straight-line Euclidean distance between two structures in a high-dimensional space of dimension 3N , where N is the number of atoms or residues considered in the protein. Fast algorithms have been constructed to align structures using RMSD [24, 25, 40, 62, 63, 69]. If several intermediate states are known along the pathway of a transformation between a pair of structures, then the RMSD may be calculated consecutively for each successive pair. This notion of RMSD as an order parameter goes back to reaction dynamics papers from the early 72 5.1. Introduction 1980’s [7, 16, 35, 132], however in these approaches the potential energy governs the most likely reactive trajectories taken by the system, and RMSD is simply accumulated through the transition states. In the absence of a potential surface except for that corresponding to steric constraints, the incremental RMSD may be treated as a cost function and its minimal transformation between two structures found. This idea is behind the transformation approaches discussed above. However, the minimal transformation using RMSD (or 3N.D Euclidean distance) as a cost function is different than the minimal transformation using 3D Euclidean distance (MRSD) as a cost function, and the RMSD-derived transformation does not correspond to the most straight-line trajectories. The RMSD is not equivalent to the total amount of motion a protein or polymer must undergo in transforming between structures, even in the absence of steric constraints enforcing deviations from straight-line motion. In what follows, we first describe our method for calculating the distance corresponding to a minimal transformation that accounts for the extra distance traveled to avoid self-crossing of the polymer chain. This involves finding the different ways a polymer can uncross or “untangle” itself, and then calculating the corresponding distance for each of the untangling transformations. Since there are typically several avoided crossings during a minimal folding transformation, finding the optimal untangling strategy corresponds to finding the optimal combination of uncrossing operations with minimal total distance cost. After quantifying such a procedure, we apply this to full length protein backbone chains for several structural classes, including α-helical proteins, β-sheet proteins, α-β proteins, 2-state and 3-state folders, and knotted proteins. We generate unfolded ensembles for each of the proteins investigated, and calculate minimal distance transformations for each member of the unfolded ensemble to fold. We look for differences in the distance between structural and kinetic classes, and compare these to differences in other order parameters between the respective classes. The other order parameters investigated include absolute contact order ACO [105], relative contact order RCO [105], long-range order LRO [52], root-mean-squared deviation RMSD, mean-root-squared deviation MRSD, and chain length N[42, 54]. The variations of distance metrics considered include total distance D, distance per residue D/N , the “extra” non-crossing distance to avoid non-crossing Dnx , and the extra non-crossing distance per residue Dnx /N . We also investigate how the various order parameters either correlate or are independent from each other. We finally discuss our results and conclude. 73 5.2. Methods 5.2 5.2.1 Methods Calculation of the transformation distance The value of Dnx is calculated as follows: The chain transforms from conformation A to conformation B as a ghost chain, so the chain is allowed to pass through itself. The beads of the chain follow straight trajectories from initial to final positions. This is an approximation to the actual Euclidean distance D of the transformation, where straight line transformations of the beads are generally preceded or followed by non-extensive local rotations to preserve the link length connecting the beads as a rigid constraint [89, 109]. The instances of self-crossing along with their times are recorded. The associated cost for these crossings is computed retroactively, for example the distance cost for one arm of the chain to circumnavigate another obstructing part is then added to the “ghost” distance to compute the total distance. The method for calculating the non-crossing distance has three major components, evolution of the chain, crossing detection, and crossing cost calculation. Each is described in one of the subsections below. Evolution of the chain As mentioned above, the condition of constant link length between residues along the chain is relaxed, so that the non-extensive rotations that would generally contribute to the distance traveled are neglected here. This approximation becomes progressively more accurate for longer chains. Thus, ideal transformations only involve pure straight-line motion. The approximate transformation is carried out in a way to minimize deviations from the true transformation (D), such that link lengths are kept as constant as possible, given that all beads must follow straight-line motion. We thus only allow deviations from constant link length when rotations would be necessary to preserve it; this only occurs for a small fraction of the total trajectory, typically either at the beginning or the end of the transformation [89, 109]. A specific example As an example of the amount of distance neglected by this approximation, consider the pair of configurations in Figure 5.1, where a chain of 10 residues that is initially horizontal transforms to a vertical orientation as shown in the figure. The distance neglecting rotations (our approximation) is 77.78, in reduced units of the link length, while the exact calculation including rotations [89, 109] gives a distance of 78.56. 74 5.2. Methods A few intermediate conformations are shown in the figure. In particular note the link length change (and hence violation of constant link length condition) in the fourth link for the gray conformation (conformation F), resulting from our approximation. If the link length is preserved, the transformation consists of local rotations at the boundary points. Also note that when transforming from cyan to magenta the first bead moves less than δ, because it reaches its final destination and “sticks” to the final point, and will not be moved subsequently. General method The algorithm to evolve the chain is as follows. Straightline paths from the positions of the beads in the initial chain configuration to the corresponding positions of the beads in the final configuration are constructed. The bead furthest away from the destination, i.e., the bead whose path is the longest line, is chosen. Let this bead be denoted by index b where 0 ≤ b ≤ N . In the example of figure 5.1, this bead corresponds to bead number 9 (b9 ). The bead is then moved toward its destination by a small pre-determined amount δ, and the new position of bead b is recorded. In this way the transformation is divided into say M steps: M = dmax /δ, where dmax is the maximal distance. Let i be the step index 0 ≤ i ≤ M . If initially the chain configuration was at step i (e.g., i = 0), the spatial position of bead b at step i before the transformation δ is denoted by rb,i , and after the transformation by rb,i+1 . The upper bound δ to capture the essence of the transformation dynamics differs according to the complexity of the problem. To capture all of the instances of self-crossing, a step size δ of two percent of the link length sufficed for all cases. The neighboring beads (b + 1 and b − 1) should also follow paths on their corresponding straight-line trajectories. Their new position on their paths (rb+1,i+1 and rb−1,i+1 ) are then calculated based on the constant link length constraints. This new position corresponds to moving the beads by δb+1,i , δb−1,i respectively. Once rb+1,i+1 and rb−1,i+1 are calculated, we proceed and calculate rb+2,i+1 and rb−2,i+1 until we reach the end points of the chain. As an example consider figure 5.1, going from the conformation B (Green) to the conformation C (Yellow). First, bead number 9, which is the bead farthest from its final destination, is moved by δ, then taking constant link length constraints and straight line trajectories into account, the new position of bead 8 is calculated and so on, until all the new bead positions which correspond to the yellow conformation are calculated. If somewhere during the propagation to the endpoints, a solution cannot be constructed or no continuous solution exists, i.e. limδ→0 (rb+m,i+1 − 75 5.2. Methods rb+m,i ) = 0, then we set rb+m,i+1 = rb+m,i . That is, the bead will remain stationary for a period of time. 3 Consequently rb+n,i+1 = rb+n,i for all beads with n > m that have not yet reached their final destination. This is because the new position of each bead is calculated by the position of the bead next to it for any particular step i. The same recipe is applied when propagating incremental motions δb,i+1 along the other direction of the chain (going from b − n to b − n − 1) as well. When a given bead that has been held stationary becomes the farthest bead away from its final position, it is then moved again. I.e., stationary beads can move again at a later time during the transformation if they become the furthest beads away from the final conformational state. Such a scenario does not occur in the context of the simple example of figure 5.1. Once the positions of all the beads in step i + 1 are calculated, the same procedure is repeated for step i + 2 and so on, until the chain reaches the final configuration. If the position of a given bead b at step i is such that |rb,i − Rb | < δ, where Rb is the spatial position of bead b in the final conformation, then rb,i+1 is set to Rb . In other words we discretely snap the bead to the final position if it is closer than the step size δ. In the context of figure 5.1, this corresponds to going from conformation D (Cyan) to conformation E (Magenta). Bead 0 (b0 ) is snapped to the final conformation. Once a bead reaches its destination it locks there and will never move again. See conformation F (gray) in figure 5.1. Figure 5.2 plots show the standard deviaion in link length vs. the link length, for transformations of 200 random structures generated by self avoiding random walks (SAW), to one pre-specified SAW. The length of the random chains was 9 links. The chains were aligned by minimizing MRSD MRSD before the transformation took place [89–91], where √ stands for the 1 ∑N mean root squared distance and is defined by N n=1 (rAn − rBn )2 = 1 ∑N n=1 |rAn − rBn |. N Crossing detection As stated earlier, during the transformation the chain is initially treated as a ghost chain, and so is allowed to cross itself. To keep track of the crossing instances of the chain, a crossing matrix : is updated at all time steps during the transformation. If the chain has N beads and N − 1 links, we can define an (N − 1) × (N − 1) matrix : that contains the crossing properties of a 3 This in principle may result in a link length change for the corresponding link, and thus constraint violation, in our approximation. An exact algorithm involves local link rotation instead. 76 5.2. Methods 9 8 7 6 5 4 3 2 1 δ 0 δ> δ δ b0 b9 (a) (b) Figure 5.1: (a) Several intermediate conformations for a transformation (A– G proceeding along the color sequence red, green, yellow, cyan, magenta, gray, and blue) are shown. The step-size delta is shown. Note the step in which the first bead of the chain (b0 ) is “snapped” into the final conformation because its distance to the destination is less than δ (going from D to E). In the intermediate conformation F (Gray), beads 0 to 3 have reached their final locations and no longer move. Note also the link length violation of link 4 in conformation F, due to the approximation that ignores end point rotations, for this intermediate figure. A milder violation is observed when going from D (cyan) to E (magenta), since bead 1 through N all assume a step size of δ while bead 0 moved a step size < δ. (b) Panel b shows a surface plot showing link length as a function of link number and step number during transformation. For the whole process, mean link length ¯ is 0.98 units and standard deviation δ 2 is 0.063. 77 5.2. Methods 0.5 Standard Deviation 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 Average Link Length 0.8 1 Figure 5.2: Scatter plot of the average link length (x-axis) and deviation from the mean link length (y-axis), for transformations between 200 randomly generated structures of 9 links and the (randomly generated) reference structure shown in the inset to the figure. Each point in the scatter plot corresponds to a whole transformation between randomly generated structure and the reference structure. The “native” or reference state is shown in the inset, along with several of the 200 initial states. In practice, transformations with a mean link length of unity have a standard deviation near zero because the link lengths have hardly changed. For the ensemble of transformations shown, the ensemble average of the mean link length for the transformations is 0.96, and the average of the corresponding standard deviation is 0.076. 2D projection of the strand, in analogy with topological analysis of knots. The element :ij is nonzero if link i is crossing link j in the 2D projection at that instant. Without loss of generality we can assume that the projection is onto the XY plane, as in Figure 5.3. We use the XY plane projection throughout this chapter.4 We parametrize the chain uniformly and continuously in the direction of ascending link number by a parameter s with range 0 ≤ s ≤ N . So for example the middle of the second link is specified by s = 3/2. If the projection of link i is crossing the projection of link j, then |:ij | is the value 4 We use the crossings in the projected image as a book-keeping device to detect real 3D crossings. A real crossing event is characterized by a sudden change in the over-under nature of a crossing on a projected plane. Since for any 3D crossing, the change of nature of the over-under order of crossing links is present in any arbitrary projection of choice, keeping track of a single projection is enough to detect 3D crossings. Of course a given projection plane may not be the optimal projection plane for a given crossing, however if the time step is small enough any projection plane will be sufficient to detect a crossing. 78 5.2. Methods (a) (b) Figure 5.3: (a) A 3 link chain with its vertical projection. A crossing in the projection is shown with a green circle. The crossing in the projection occurs at points s = 0.29 and s = 2.82, where the chain is parametrized uniformly from 0 to 3. Since link 1 is under link 3 at the point of projection crossing, 0.29 will appear with a negative sign in the corresponding : (eq. 5.1a). (b) The blue chain and the red chain have the exact same vertical projection, however their corresponding : matrices are different in sign, as given in Eq. 5.1b. This indicates that the over-under sense has changed for the links whose projections are crossing. This in turn indicates that a true crossing has occurred when going from the red conformation to the blue conformation, as opposed to a series of conformations where the chain has navigated to conformations having the opposite crossing sense without passing through itself. of s at the crossing point of link i and |:ji | is the value of s at the crossing point of link j. If link i is over j (i.e. the corresponding point of the cross on link i has a higher z value than the corresponding point of the cross on link j) then :ij > 0, otherwise :ij < 0. Thus, after the sign operation, sign(:) is an anti-symmetric matrix. A simple illustrative example of the value of : for the 3-link chain in figures 5.3a and 5.3b is 0 : (to ) = 0 +2.82 0 : (to + δ) = 0 −2.82 0 −0.29 0 0 0 0 0 +0.29 0 0 0 0 (5.1a) (5.1b) 79 5.2. Methods The fact that :13 is negative at time to indicates that at that instant, link 1 is under link 3 in 3D space, above the corresponding point on the plane on which the projections of the links have crossed (green circle in figure 5.3). At each step during the transformation of the chain, the matrix : is updated. A true crossing event is detected by looking at : for two consecutive conformations. A crossing event occurs when any non-zero element in the matrix : discontinuously changes sign without passing through zero. Once :ij changes sign, :ji must change sign as well. If the chain navigates through a series of conformations that changes the crossing sense and thus the sign of :ij , but does not pass through itself in the process, the matrix elements :ij will not change sign discontinuously but will have values of zero at intermediate times before changing sign. Crossing cost calculation Even in the simplest case of crossing, there are multiple ways for the real chain to have avoided crossing itself. The extra distance that the chain must have traveled during the transformation to respect the fact that the chain cannot pass through itself is called the “non-crossing” distance Dnx . If the chain were a ghost chain which could pass through itself, the corresponding distance for the whole transformation would be the MRSD, along with relatively small modifications that account for the presence of a conserved link length. Accounting for non-crossing always introduces extra distance to be traveled. As the chain is transforming from conformation A to conformation B as a ghost chain according to the procedure discussed above, a number of selfcrossing incidents occur. Figure 5.4 shows a continuous but topologically equivalent version of the crossing event shown in figure 5.3 (b). Even for this simple case, there are multiple ways for the transformation to have avoided the crossing event, each with a different cost. Furthermore, later crossings can determine the best course of action for the previous crossings. Figure 5.5 illustrates how non-crossing distances are non-additive, so that one must look at the whole collection of crossing events. Therefore to find the optimum way to “untangle” the chain (reverse the sense of the crossings), one must look at all possible uncrossing transformations, in retrospect. The recipe we follow is to evolve the chain as a ghost chain and write down all the incidents of self-crossings that happen during the transformation. Then looking at the global transformation, we find the best untangling movement that the chain could have taken. 80 5.2. Methods Figure 5.4: Two possible untangling transformations. The top transformation involves twisting of the loop. The lower transformation involves a snake like movement of the vertical leg. A third one would involve moving the horizontal leg, in a similar snake-like fashion. Note that the moves represented here are not necessarily the most efficient ones in their topological class, but rather the most intuitive ones. There are transformations that are topologically equivalent but generally involve less total motion of the chain (see for example Figures 5.12(a), 5.12(b)). Figure 5.5: The minimal untangling movement in going from A to C (through B’) is less than the sum of the minimum untangling movements going from A to B and then from B to C 81 5.2. Methods Figure 5.6: A few snapshots during a transformation involving 2 instances of chain crossing. The transformation occurs clockwise starting from initial configuration I and proceeding to final configuration F. To compute the extra cost introduced by non-crossing constraints we proceed as follows: We construct a matrix that we call the cumulative crossing matrix ;. ;ij is non-zero if link i has truly (in 3D) crossed link j, at any time during the transformation. This matrix is thus conceptually different from the matrix :, which holds only for one instant (one conformation) and which can have crossings in the 2D projection which are not true crossings during the transformation. The values of the elements of ; are calculated in the same way that the values are calculated for :. The sign also depends on whether the link was crossed from over to under or from under to over, so that a given projection plane is still assumed. The order in which the crossing have happened are kept track of in another matrix ;O . The coordinates of all the beads at the instant of a given crossing are also recorded. For example, if during the transformation of a chain, two crossing have happened, then two sets of coordinates for intermediate states are also stored. We describe a simple concrete example to illustrate the general method next. A Concrete Example Figure 5.6 shows a simple transformation of a 7-link chain. During the transformation the chain crosses itself in two instances. The first instance of self-crossing is between link 5 and link 7. The second instance is when link 2 crosses link 4. The location of the cross82 5.2. Methods ing along the chain is also recorded: i.e., if we assume that the chain is parametrized by s = 0 to N , then at the instant of the first crossing (link 5 and link 7) s = 4.4 (link 5) and s = 6.9 (link 7). The second crossing occurs at s = 1.3 (link 2) and s = 3.8 (link 4). The full coordinates of all beads are also known: we separately record the full coordinates of all beads at each instant of crossing. The information that indicates which links have crossed and their over-under structure can be aggregated into the cumulative crossing matrix ;. For the example in figure 5.6, the cumulative matrix (up to a minus sign indicating what plane the crossing events have been projected on) is 0 0 0 0 0 0 0 0 0 0 1.3 0 0 0 0 0 0 0 0 0 0 0 0 0 ;= 0 −3.8 0 0 . 0 0 0 0 0 0 −4.4 0 0 0 0 0 0 0 0 0 0 0 6.9 0 0 ; tells us, during the whole process of transformation, which links have truly crossed one another and what the relative over-under structure has been at the time of crossing. For example, by glancing at the matrix we can see that two links 5,7 and 2,4 have crossed one another. We also know from the sign of the elements in ; that both links 2 and 7 were underneath links 4 and 5 just prior to their respective crossings in the reference frame of the projection. Two links will cross each other at most once during a transformation. If one link, e.g. link i, crosses several others during the transformation, elements (i, j), (i, k) etc... along with their transposes will be nonzero. The order of crossings can be represented in a similar fashion as a sparse matrix. 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 ;O = 0 2 0 0 0 0 0 . 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 Analyzing the structure of the crossings is similar to analyzing the structure of a knot, wherein one studies a knot’s 2D projections, noting the crossings and their over/under nature based on a given directional param83 5.2. Methods (a) (b) Figure 5.7: For the crossing points indicated by the green circles, two legs, colored blue and red, can be identified. Each leg starts at the crossing and terminates at an end. eterization of the curve [1, 92, 136]. One difference here is that we are not dealing with true closed-curve knots (in the mathematical sense), as a knot is a representation of 51 in 53 . Here we treat open curves. Crossing substructures By studying the crossing structure of open-ended pseudo-knots in the most general sense, one can identify a number of sub-structures that recur in crossing transformations. Any act of reversing the nature of all the crossings of the polymer can be cast within the framework of some ordered combination of reversing the crossings of these substructures. We identify three sub-structures: Leg, Loop, and Elbow. Leg Given any self-crossing point of a chain, a leg is defined from that crossing point to the end of the chain. Therefore for each self-crossing point two legs can identified as the shortest distance along the chain from that crossing point to each end—see figure 5.7. A single leg structure is shown in figure 5.8(a). Loop As stated earlier, when traveling along the polymer one arrives at each crossing twice. If the two instances of a single crossing are encountered consecutively while traveling along the polymer, and no intermediate crossing occurs, then the substructure that was traced in between is a loop. See Fig 5.8(b). Elbow If two consecutive crossings have same over-under sense, then they form an elbow; see Fig 5.8(c). Note that the same two consecutive crossing instances will occur in reverse order on the second visit of the crossings: 84 5.2. Methods (a) (b) (c) Figure 5.8: (a) A single leg structure, (b) A loop structure, (c) An elbow structure I II III Figure 5.9: The three types of Reidemeister moves. As it can be seen, Reidemeister move type III does not reverse the nature of any of the crossings. these form a dual of the elbow. By convention the segment with longer arclength between the two consecutive crossings is defined as the elbow. This would be the horseshoe shaped strand in figure 5.8(c). Reversing the crossing nature The goal of this formalism is to assist in finding a series of movements that will result in reversing the over-under nature of all the crossings, with the least amount of movement required by the polymer. So at this point we introduce basic movements that that will reverse the nature of the crossings for the above substructures. Two of these moves, the loop move and and the elbow move, are equilvalent to Reideister moves type I and II respectively, in knot theory.5 See figure 5.9. The leg move has no equivalent reidemeister move. 5 For an introduction to the basic concepts in knot theory see for example [1] 85 5.2. Methods (a) (b) Figure 5.10: Schematic illustration of the canonical leg movement, either from left to right as in (a) or effectively its time reverse as in (b). Both transformations traverse the same distance. The transformation in (a) is equivalent to the “plug” transformation analyzed in the context of folding simulations for trefoil knotted proteins [125], while the transformation in (b) (see ref. [90] for a detailed description of this transformation) is equivalent to the “slip-knotting” transformation more often observed in the folding of knotted proteins [93]. Using leg movement A transformation that reverses the over/under nature of a leg involves the motion of all the beads constituting the leg. Each bead must move to the location of the crossing (the “root” of the leg), and then move back to its original location [90]. The canonical leg movement is shown schematically in figure 5.10. We can reverse the nature of all the crossings that have occurred on a leg, if more than one crossing occurs, through a leg movement (see figure 5.11). The move is topologically equivalent to the movement of the free end of the leg along the leg up to the desired crossing, and then moving all the way back to the original position while reversing the nature of the crossing on the way back. Loop twist and loop collapse Reversing the crossing of a loop substructure can be achieved by a move that is topologically equivalent to a twist, see figure 5.12 (a). This type of move is called a Reidemeister type I 86 5.2. Methods Figure 5.11: One can reverse the over-under nature of all the crossings that have occurred on a leg, through a single leg movement. This move has no Reidemeister equivalent. move in knot theory. However the optimal motion is generally not a twist or rotation in 3-dimensional space (3D). Figure 5.12(b) shows a move which is topologically equivalent to a twist in 3D, but costs a smaller distance, by simply moving the residues inside the loop in straight lines to their final positions, resulting in a “pinching” motion to close the loop and re-open it. From now on we refer to the optimal motion simply as loop twist, because it is topologically equivalent, but we keep in mind that the actual optimal physical move, and the distance calculated from it, is different. Elbow moves Reversing the crossings of an elbow substructure can be done by moving the elbow segment in the motion depicted in figure 5.13: Each segment moves in a straight line to its corresponding closest point on the obstruction chain, and then it moves in a straight line to its final position. Operator notation The transformations for leg movement, elbow, and twist can be expressed very naturally in terms of operator notation, where in order to untangle the chain the various operators are applied on the chain until the nature of all the self-crossing are reversed. If we uniquely identify each instance of self-crossing by a number, then a topological loop twist at crossing i can be represented by the operator R(i) (R for Reidemeister). An elbow move, for the elbow defined by crossings i and i + 1, can be represented as E(i, i + 1). As discussed above, for each self-crossing, two legs can be identified corresponding to the two termini of the chain. This was exemplified in figure 5.7, by the red and blue legs. Since we choose a direction of parametrization for the chain, we refer to the two leg movements as the “start leg” movement and the “end leg” movement, 87 5.2. Methods (a) (b) Figure 5.12: (a) Reversing the over-under nature of a crossing through a topological loop twist: Reidemeister move type I. (b) By “pinching” the loop before the twist, the cost in distance for changing the crossing nature is reduced. Figure 5.13: Schematic of the canonical elbow move. From left to right. This is equivalent to Reidemeister move type II. 88 5.2. Methods Figure 5.14: A chain with several self-crossing points before and after untangling. Various topological substructures that are discussed in the text are color coded. For the case of the legs (red and cyan) note that various other legs can be identified, for example a leg that starts at crossing 2 and ends at the red terminus. Here we color only the shortest legs from crossing 1 to the terminus as red, and crossing 2 to the opposite terminus as cyan. and for a generic crossing i we denote them as LN (i) and LC (i) respectively. The operators that we defined above are left acting (similar to matrix multiplication). So a loop twist at crossing i followed by an elbow move at crossings j and j + 1 is represented by E(j, j + 1)R(i). Example Figure 5.14 shows sample configurations before and after untangling. The direction of parametrization is from the red terminus to the cyan terminus. It can be seen that there are several ways to untangle the chain. One example would be R(3)LC (2)R(1), which consists of a twist of the green loop, followed by the cyan leg movement, followed by a twist of the blue loop. Another path of untangling would be E(2, 3)LN (1), which is movement of the red leg followed by the magenta elbow move. For the two above transformations, the order of operations can be swapped, i.e., they are commutative, and the resulting distance for each of the transformations will be the same. That is D[E(2, 3)LN (1)] = D[LN (1)E(2, 3)]. However, E(2, 3)LN (1) is a more efficient transformation than R(3)LC (2)R(1), i.e. D[E(2, 3)LN (1)] < D[R(3)LC (2)R(1)]. Other transformation moves are not commutative in the algorithm, for example in Figure 5.14, LN (1)R(3)R(2) is not allowed, since R(2) will only act on loops defined by two instances of a crossing that are encountered consecutively in traversing the polymer, i.e., no intermediate crossings can occur. Therefore even if crossing 2 happens kinetically before crossing 3 during the ghost transformation, only transformation LN (1)R(2)R(3) is allowed in the algorithm. 89 5.2. Methods Minimal untangling cost For each operator in the above formalism, a transformation distance/cost can be calculated. Hence the optimal untangling strategy is finding the optimal set of operator applications with minimal total cost. This solution amounts to a search in the tree of all possible transformations, as illustrated in Figure 5.15. The optimal application of operators can be computed by applying a version of the depth-first tree search algorithm. According to the algorithm, from any given conformation there are several moves that can be performed, each having a cost associated with the move. The pseudo-code for the search algorithm can be written as follows: procedure find_min_cost (moves_so_far=None, cost_so_far=0,\ min_total_cost=Infinity): optim_moves = NULL_MOVE if cost_so_far > min_total_cost: return [Infinity, optim_moves] endif for move in available_moves(moves_so_far): [temp_cost, temp_optim_moves] = find_min_cost (moves_so_far + move,\ cost_so_far + cost(move),\ min_total_cost) if temp_cost < min_total_cost: min_total_cost = temp_cost optim_moves = move + temp_optim_moves endif endfor return [min_total_cost,optim_moves] endprocedure The values to the right side of the equality sign in the arguments of the procedure are the default values that the procedure starts with. The procedure is called recursively, and returns both the set of optimal uncrossing moves (for a given crossing matrix corresponding to a starting and final conformation), and the distance corresponding to that set of optimal uncrossing moves. The algorithm visits all branches of the tree of possible uncrossing operations until it reaches the end. However it is smart enough to terminate the search along the branch if the cost of operations exceeds that of a solution already found. See figure 5.15 for an illustration of the depth-first search tree algorithm. The above procedure was implemented using both the GNU Octave programing language and C++. To optimize speed by 90 5.2. Methods 25 60 80 LC(3) 30 LC(2) 60 50 E(3,4) 5 LC(3) 30 20 30 R(2) 10 R(2) 10 1 20 10 LN(1) 20 R(1) 10 2 3 4 Figure 5.15: An example (subset) tree of possible transformations for a given crossing structure. Accumulated distances are given inside the circles representing nodes of the tree; the non-crossing transformations and their corresponding distances are shown next to the branches of the tree. The algorithm starts from the bottom node and proceeds to the top nodes, starting in this case along the right-most branch. The possible transformations to be considered as candidate minimal transformations are : [LC (3)R(2)R(1)], [E(3, 4)R(2)R(1)], [R(2)LN (1)] which then terminates because the accumulated distance exceeds the minimum so far of 25, and [LC (2)LN (1)]. eliminating redundant moves, only one permutation was considered when operators commuted. 5.2.2 Generating unfolded ensembles To generate transformations between unfolded and folded conformations, we adopt an off-lattice coarse grained Cα model [22, 122], and generate an unfolded structural ensemble from the native structure as follows. For a native structure with N links, we define three data sets: • The set of Cα residue indices i, for which i = 1, · · · , N 91 5.2. Methods • The set of native link angles θj between three consecutive residues, for which j = 2, · · · , N − 1 • The set of native dihedral angles φk between four consecutive residues, for which k = 2, · · · , N − 2 The distribution of Cα -Cα distances in PDB structures is sharply peaked around 3.76˚ A (σ = 0.09˚ A). In practice we took the first Cα -Cα distance from the N-terminus as representative, and used that number for the equilibrium link length for all Cα -Cα distances in the protein. To generate an unfolded ensemble, we start by selecting at random a bead i (2 ≤ i ≤ N − 1) in the native conformation, and we then perform operations that change the angle centered at residue i, θi , and the dihedral centered at bond i − (i + 1), φi . If i = N − 1 only the angle is changed. The new angle and dihedral are selected at random from the Boltzmann distribution described below. At the end of each operation, θi → θinew and φi → φnew . Changing these values corresponds to rotating an entire i substructure, where all the beads j > i will end up in a new position. This recipe corresponds to an extension of the pivot algorithm [74, 82], with the additional feature that the most probable rotation selects the values of the angles and dihedrals in the native structure. That is, if we define at , then the most probable ∆θ and ∆θi = θinew − θiN at and ∆φi = φnew − φN i i most probable ∆φ are both zero. By increasing an artificial temperature, larger ∆θi ’s and ∆φi ’s become more accessible. The new angle θ is chosen from a probability distribution proportional to exp (−βE (θ)), where E(θ) is computed from: ( )2 βE(θi ) = kθ θi − θiN at . (5.2) The parameter 1/ plays the role of a temperature, which we have set to unity. We set kθ = 20. Similarly for φ, the probability distribution function is proportional to exp (−βE (φ)), where βE (φ) is computed from βE (φi ) = kφ1 [1 + cos (∆φ)] + kφ3 [1 + cos (3∆φ)] (5.3) at , k with ∆i φ = φi − φN = 1. The fact φ1 = 1, kφ3 = 0.5, and again i that the kφ s are much smaller than kθ means that for a given temperature, dihedral angles are more uniformly distributed than the θs. If is set to zero then all states are equally accessible and the algorithm reduces to the pivot algorithm, i.e., a random walk generator. If is set to ∞ then chain behaves as a rigid object and does not deviate from its native state. 92 5.2. Methods Each pivot operation results in a new structure that must be checked so that it has no steric overlap with itself, i.e., the chain must be self-avoiding. If the new chain conformation has steric overlap, then the attempted move is discarded, and a new residue is selected at random for a pivot operation. In practice, we defined steric overlap by first finding an approximate contact or cut-off distance for the coarse-grained model. The contact distance was taken to be the smaller of either the minimum Cα -Cα distance between those residues in native contact (where two residues are defined to be in native contact if any of their heavy atoms are within 4.9˚ A), or the Cα -Cα distance between the first two consecutive residues. For SH3 for example the minimum Cα distance in native contacts is 4.21˚ A and the first link length is 3.77˚ A, so for SH3 all non-neighbor beads must be further than 3.77˚ A for a pivot move to be accepted. Future refinements of the acceptance criteria can involve the use of either the mean Cα -Cα distance or other criteria more accurately representing the steric excluded volume of residue side chains. In our recipe, to generate a single unfolded structure we start with the native structure and implement N successful pivot moves, where N is related to the number of residues N by N = ln(0.01)/ln[0.99(N − 2)/(N − 1)]. For the next unfolded structure we start again from the native structure and pivot N successful times, following the above recipe. Note that N successful pivots does not generally affect all beads of the chain. In the most likely scenario some beads are chosen several times and some beads are not chosen at all, according to a Poisson distribution. This particular choice of −2 N means that for polymers with N < 101 where N N −1 < 0.99, the chance that any given link is not pivoted at all during the N pivot operations is 0.01. −2 On the other hand for longer polymers where N N −1 > 0.99, the probability that any particular segment of the protein with the length 0.01 of the total length, has 0.01 chance of not having any of its beads pivoted. For any N however, the sheer number of pivot moves generally ensures a large RMSD between the native and generated unfolded structures. Each unfolded structure generally retains small amounts of native-like secondary and tertiary structure, due to the native biases in angle and dihedral distributions. For example, for SH3 the number of successful pivot moves was 162 and the mean fraction of native contacts in the generated unfolded ensemble was 0.06. 5.2.3 Proteins used The proteins used in this study are given in table 5.1. They consist of 25 2-state folders, 13 3-state folders, 11 all α-helix proteins, 14 all β-sheet 93 5.2. Methods proteins, 13 α-β proteins, and 5 knotted proteins. 94 Table 5.1: Proteins analyzed in this chapter. fold 3 2 2 3 3 3 2 2 3 2 2 3 2 3 2 2 2 2 2 2 2 3 3 2 2 2 2ndry str. α-helix Mixed α-helix Mixed β-sheet α-helix Mixed β-sheet β-sheet α-helix β-sheet Mixed α-helix β-sheet α-helix α-helix β-sheet β-sheet Mixed Mixed β-sheet β-sheet Mixed Mixed β-sheet β-sheet log kf 1.10 -1.48 11.75 2.60 -3.20 5.80 3.87 6.98 1.30 10.53 6.30 4.38 8.76 3.40 7.31 8.50 5.24 4.54 6.80 6.00 -1.05 3.22 -2.50 5.90 1.41 4.04 LRO 1.4 4.2 0.9 2.5 2.8 1.0 3.3 3.0 2.5 0.4 3.8 3.7 2.2 2.8 1.7 1.1 3.0 2.8 2.6 2.1 3.8 2.8 3.4 3.0 3.0 3.1 RCO 0.1 0.2 0.1 0.1 0.1 0.1 0.2 0.2 0.1 0.1 0.2 0.2 0.1 0.1 0.1 0.1 0.2 0.2 0.1 0.2 0.2 0.2 0.1 0.2 0.2 0.2 ACO 14.0 21.8 5.2 12.3 18.8 9.1 10.8 11.0 15.7 7.4 11.7 18.6 11.7 17.7 10.4 7.1 11.0 10.6 12.0 9.7 15.2 11.7 22.3 18.4 10.9 11.0 MRSD 26.2 22.7 14.0 20.8 25.1 16.7 15.1 16.8 24.9 13.5 16.4 21.1 19.6 25.1 16.1 17.0 17.5 15.3 18.9 14.1 17.9 16.8 25.5 21.5 15.1 14.8 RMSD 29.2 25.4 14.9 22.8 27.9 18.9 16.8 18.4 27.9 14.9 18.0 23.5 22.2 27.9 17.9 18.6 19.2 17.0 20.8 15.7 20.2 19.4 28.6 23.9 16.7 16.3 Dnx 285 201 76.5 209 286 71.4 99.7 98.0 278 28.0 83.1 148 126 284 80.7 76.8 110 87.4 156 25.4 136 72.1 402 163 92.3 94.5 Dnx /N 1.9 2.1 1.3 1.9 2.1 0.8 1.5 1.5 2.2 0.5 1.3 1.7 1.2 2.2 0.9 0.9 1.6 1.5 1.9 0.5 1.8 1.0 2.5 1.7 1.6 1.7 95 D × 103 4.24 2.43 0.91 2.46 3.70 1.49 1.10 1.23 3.44 0.76 1.17 2.03 2.17 3.58 1.46 1.55 1.32 0.97 1.69 0.81 1.50 1.23 4.46 2.25 0.95 0.92 Continued D /N 28.1 24.8 15.2 22.8 27.2 17.5 16.6 18.3 27.1 14.1 17.7 22.8 20.8 27.3 17.0 17.9 19.1 16.8 20.8 14.5 19.7 17.8 28.1 23.2 16.7 16.5 on next N 151 98 60 108 136 85 66 67 127 54 66 89 104 131 86 87 69 58 81 56 76 69 159 97 57 56 page 5.2. Methods PDB 1A6N 1APS 1BDD 1BNI 1CBI 1CEI 1CIS 1CSP 1EAL 1ENH 1G6P 1GXT 1HRC 1IFC 1IMQ 1LMB 1MJC 1NYF 1PBA 1PGB 1PKS 1PSF 1RA9 1RIS 1SHG 1SRL fold 3 2 2 2 3 2 2 2 3 2 2 3 ? ? ? ? 3 3 3 2ndry str. β-sheet Mixed α-helix β-sheet Mixed α-helix β-sheet Mixed α-helix Mixed α-helix Mixed Knotted Knotted Knotted Knotted Knotted Knotted Knotted log kf 3.47 5.90 11.52 0.41 3.50 6.55 4.20 3.90 3.70 0.18 9.80 0.10 ? ? ? ? -1.83 -2.56 -6.91 Table 5.1 – continued from previous page LRO RCO ACO MRSD RMSD Dnx 4.1 0.2 15.8 18.7 20.8 154 2.4 0.2 11.5 17.0 18.9 92.1 0.4 0.1 4.0 8.1 9.2 4.1 5.0 0.2 18.9 20.4 22.7 168 2.6 0.1 8.3 22.2 23.9 354 2.3 0.1 12.0 18.2 20.0 77.5 4.1 0.2 14.4 16.9 18.7 107 2.7 0.2 10.0 15.1 16.9 78.3 1.2 0.1 7.3 14.0 15.5 37.3 4.3 0.2 13.6 16.3 18.4 86.9 1.0 0.1 4.8 10.6 11.5 19.9 3.6 0.1 19.3 27.7 30.9 521 3.1 0.1 18.9 26.2 28.7 515 3.3 0.1 16.2 25.7 28.5 671 3.4 0.1 14.6 22.4 24.5 369 2.1 0.2 12.6 20.0 21.8 147 2.9 0.1 18.2 27.5 30.4 503 2.8 0.1 16.7 26.1 29.0 643 1.2 0.1 21.4 27.7 30.8 481 Dnx /N 1.7 1.2 0.1 1.8 2.3 0.9 1.5 1.2 0.6 1.2 0.5 3.4 3.5 4.1 3.4 1.8 3.3 4.0 2.8 D × 103 1.82 1.39 0.30 2.07 3.82 1.65 1.36 1.06 0.95 1.26 0.48 4.81 4.36 4.84 2.81 1.79 4.71 4.85 5.16 D /N 20.4 18.2 8.2 22.2 24.5 19.1 18.3 16.4 14.6 17.5 11.0 31.0 29.7 29.9 25.8 21.8 30.8 30.1 30.5 N 89 76 36 93 156 86 74 65 65 72 43 155 147 162 109 82 153 161 169 5.2. Methods PDB 1TIT 1UBQ 1VII 1WIT 2A5E 2ABD 2AIT 2CI2 2CRO 2HQI 2PDD 2RN2 1O6D 2HA8 2K0A 2EFV 1NS5 1MXI 3MLG 96 5.2. Methods 5.2.4 Calculating distance metrics for the unfolded ensemble To obtain minimal transformations between unfolded and native structures for a given protein, the Cα backbone was extracted from the PDB native structure, and 200 coarse-grained unfolded structures were generated using the methods described above. The unfolded structures were then aligned using RMSD and the average (residual) RMSD was calculated. The unfolded structures were then aligned by minimizing MRSD, and the residual MRSD was calculated. Then conformations were further coarse-grained (smoothed) by sampling every other bead, hence reducing the total number of beads. By the above further-coarse graining, we eliminate all instances of potential self-crossing in which the loop size or elbow size is smaller than three links. Each structure was then transformed to the folded state by the algorithm discussed earlier in section 5.2.1. The self-crossing instances, along with the coordinates of all the beads, were recorded as well. Appropriate data structures were formed and relevant crossing substructures (leg, elbow, and loop) were detected. With topological data structures at hand, the minimal untangling cost was found, through the depth-first search in the tree of possible uncrossing operations that was described above. Finally, the minimal untangling cost, Dnx , and the total distance, D are calculated for each unfolded conformation. These differ from one unfolded conformation to the other; the ensemble average is recorded and used below. The ensemble average of MRSD and RMSD are also calculated from the 200 unfolded structures that were generated. Importance of non-crossing We define the importance of non-crossing (INX) as the ratio of the extra untangling movement caused by non-crossing constraints, divided by the distance when no such constraints exists, i.e., if the chain behaved as a ghost chain. Mathematically this ratio is defined as IN X = Dnx / (mrsd × N ) Other metrics Following [52], we define Long-range Order (LRO) as: { ∑ 1 if |i − j| > 12 LRO = nij /N where nij = 0 otherwise i<j (5.4) 97 5.3. Results where, i and j are the sequence indices for two residues for which the Cα −Cα distance is ≤ 8 ˚ A in the native structure. Likewise we define Relative Contact Order (RCO) following [105]: N 1 ∑ RCO = ∆Lij , L×N (5.5) i<j where N is the total number of contacts between nonhydrogen atoms in ˚ in the native structure, L is the number of the protein that are within 6 A residues, and ∆Lij is the sequence separation between contacts in units of residues. Similarly, Absolute Contact order (ACO) [105] is defined to be: N 1 ∑ ACO = ∆Lij = RCO × L N (5.6) i<j 5.3 Results Proteins were classified by several criteria: • 2-state vs. 3-state folders • α-helix dominated, vs β-sheet dominated, vs mixed. • knotted vs unknotted proteins Several questions are answered for each group of proteins: • What fraction of the total transformation distance is due to noncrossing constraints? • How do the different order parameters distinguish between the different classes of proteins? • How do the different order parameters correlate with each other? In table 5.2, we compare the unfolded ensemble-average of several metrics between different classes of proteins, and perform a p-value analysis based on the Welch t-test. The null hypothesis states that the two samples being compared come from normal distributions that have the same means but possibly different variances. Metrics compared in Table 5.2 are INX, LRO, RCO, ACO, MRDS, RMSD, Dnx , Dnx /N , D, D/N and N . 98 5.3. Results The most obvious check of the general method outlined in the present paper is to compare the non-crossing distance Dnx between knotted and unknotted proteins. Here we see that knotted proteins traverse about 3.5 times the distance as unknotted proteins in avoiding crossings, so that the two classes of proteins are different by this metric. The same conclusion holds for knotted vs. unknotted proteins if we use Dnx /N , D, D/N , or INX. Of all metrics, the statistical significance is highest when comparing D/N , which is important because the knotted proteins considered here tend to be significantly longer than the unknotted proteins, so that chain length N distinguishes the two classes. Dividing by N partially normalizes the chainlength dependence of D, however D/N still correlates remarkably strongly with N when compared for all proteins (r = 0.824, see Appendix F, Table F.8). It was somewhat unusual that MRSD and RMSD distinguished knotted proteins from unknotted proteins better than D (or Dnx ), which accounts for non-crossing. All other quantities, including INX, ACO, and RCO distinguish knotted from unknotted proteins. The only quantity that fails is LRO. The importance of noncrossing IN X, measuring the ratio of the uncrossing distance Dnx to the ghost-chain distance N × M RSD, was largest for knotted proteins, followed by β proteins, with α proteins having the smallest IN X. Mixed proteins had an average INX value in between that for α and β proteins. In distinguishing all-α and all-β proteins, we find that LRO and RCO are by far the best discriminants. Interestingly, INX and Dnx /N also discrimate these two classes comparably or better than ACO does. Dnx is marginal, while all other metrics fail. All metrics except for N and D are able to discriminate α from mixed α-β proteins, with LRO performing the best by far. Interestingly, none of the above metrics can distinguish β proteins from mixed α-β proteins. It is sensible that energetic considerations would be the dominant distinguishing mechanism between two- and three state folders. Intermediates are typically stabilized energetically. We can nevertheless investigate whether any geometrical quantities discriminates the two classes. Indeed LRO and RCO fail, as does INX. This supports the notion that intermediates are not governed by “topological traps” that are undone by uncrossing motion, but rather are energetically driven. ACO performs marginally. Three-state folders tend to be longer than 2-state folders, so that N distinguishes them and in fact provides the strongest discriminant, consistent with previous results [43]. Interestingly RMSD, MRSD, and D perform comparably to N . 99 5.3. Results However these measures also correlate strongly with N (see Appendix F Table F.8). D/N , Dnx and Dnx /N also perform well, but still correlate with N , albeit more weakly than the above metrics. Figure 5.16A shows a scatter plot of all proteins as a function of Dnx /N vs. and LRO. Knotted and unknotted proteins are indicated, as are α, β, and mixed α-β proteins. Two and three state proteins are indicated as triangles and squares respectively. From the figure, it is easy to visualize how LRO provides a successful discriminant between α/β and α/(mixed) proteins, but is unsuccessful in discriminating β/(mixed), knotted and unknotted, and two and three state folders. It is also clear from the figure how Dnx /N discriminates knotted from unknotted proteins. One can also see distribution overlap, but nevertheless successful discrimination between α and β and α and mixed proteins. Figure 5.16B shows a scatter plot of all proteins as a function of Dnx vs. N , using the same rendering scheme for protein classes as in Figure 5.16A. From the figure, one can see how the metrics correlate with each other, and how they both discriminate knotted from unknotted proteins and 2-state from 3-state proteins. Moreover one can see how despite the significant correlation between Dnx and N , Dnx can discriminate α proteins from either β proteins or mixed α/β proteins, while N cannot. As a control study for the above metrics, we took random selections of half of the proteins, to see if random partitioning of the proteins into two classes resulted in any of the metrics distinguishing the two sets with statistical significance. No metric in this study had significance: the p-values ranged from about 0.32 to 0.94. Figure 5.17 shows a plot of the statistical significance for all the metrics in Table 5.2 to distinguish various pairs of protein classes: 2-state from 3-state, α from β, α from mixed α/β, β from mixed, and knotted from unknotted. We can define the most consistent discriminator between protein classes as that metric that is statistically significant for the most classes, and for those classes has the highest statistical significance. By this criterion Dnx /N is the most consistent discriminator between the general structural and kinetic classes considered here. Interestingly, in all cases, the extra distance introduced by non-crossing constraints is a very small fraction (less than 13% ) of the MRSD, which represents the ghost distance neglecting non-crossing. This was not an obvious result, but it was encouraging evidence for the reason simple orderparameters that contain no explicit penalty for crossing have been so successful historically [5, 8, 21, 30, 94, 98, 105]. 100 5.3. Results 5 700 A B 600 4 <Dnx> <Dnx/N> 500 3 2 400 300 200 1 0 0 100 1 2 3 LRO 4 5 0 0 50 100 150 200 N Figure 5.16: (A) Scatter plot of all proteins as a function of Dnx /N and LRO. Knotted proteins are indicated as green circles and are clustered; unknotted proteins are clustered using the black closed curve, and contain α-helical proteins clustered in red, and mixed α-β proteins clustered in magenta. Beta proteins are indicated in blue. Two and three state proteins are indicated as triangles and squares respectively. LRO provides a strong discriminant against α and mixed proteins, but not knotted and unknotted proteins, while Dnx /N discriminates knotted from unknotted proteins, and moderately discriminates α proteins from mixed proteins. (B) Scatter plot of all proteins as a function of Dnx and N . The rendering scheme for protein classes is the same as in panel (A). Kinetic 2-state folders are indicated by the black dashed curve. Both Dnx and N distinguish knotted from unknotted proteins, and 2-state from 3-state proteins. By projecting α proteins and either mixed α/β or all-β proteins onto each order parameter, one can see how Dnx can discriminate α proteins from both mixed or β proteins, while N cannot. This is despite the significant correlation between Dnx and N. 101 5.3. Results 20 INX LRO RCO ACO MRSD RMSD Dnx/N Dnx D D/N N log(pvalue) 15 10 5 0 2-3s α-β α-M β-M knot-unknot Figure 5.17: Statistical significance for all order parameters in distinguishing between different classes of proteins. The − log of the statistical significance is plotted for various pairs of protein classes, so that a higher number indicates better ability to distinguish between different classes. The blue horizontal line indicates a threshold of 5% for statistical significance. 102 103 INX 7.55e-02 8.25e-02 5.21e-02 9.04e-02 8.64e-02 7.79e-02 1.30e-01 ACO 11.4 14.7 8.5 13.9 14.5 12.5 16.9 Dnx /N 1.3 1.9 8.74e-01 1.7 1.8 1.5 3.3 Class 2-state folders 3-state folders α-helix proteins β-sheet proteins Mixed secondary structure Unknotted proteins knotted proteins PINX (3.93e-01) αβ:4.01e-05 βm:(5.71e-01) αm:5.44e-04 1.48e-03 PACO 4.50e-02 αβ:3.76e-04 βm:(7.08e-01) αm:1.62e-03 5.59e-03 PDnx /N 1.71e-02 αβ:1.88e-04 βm:(6.65e-01) αm:1.56e-03 5.33e-04 D/N 17.6 23.8 16.7 20.4 21.6 19.7 28.4 LRO 2.7 2.6 1.2 3.3 3.1 2.6 2.7 MRSD 16.4 21.9 15.8 18.7 19.9 18.3 25.1 Dnx 94.9 238 80.4 146 195 144 476 PD/N 8.56e-04 αβ:(6.95e-02) βm:(4.67e-01) αm:2.68e-02 1.04e-04 PLRO (9.46e-01) αβ:7.40e-08 βm:(4.27e-01) αm:6.20e-07 (9.20e-01) PMRSD 5.89e-04 αβ:(1.19e-01) βm:(4.50e-01) αm:4.11e-02 1.79e-04 PDnx 3.30e-03 αβ:4.50e-02 βm:(2.99e-01) αm:2.30e-02 2.05e-03 N 71.3 116 77.9 83.4 98.3 86.9 140 RCO 1.58e-01 1.31e-01 1.10e-01 1.72e-01 1.56e-01 1.49e-01 1.24e-01 RMSD 18.1 24.4 17.5 20.8 22.1 20.3 27.7 D 1309 2924 1450 1802 2274 1862 4074 PN PRCO (5.07e-02) αβ:3.34e-07 βm:(2.68e-01) αm:3.48e-03 1.49e-02 PRMSD 4.88e-04 αβ:(1.14e-01) βm:(4.73e-01) αm:4.16e-02 3.18e-04 PD 8.06e-04 αβ:(4.14e-01) βm:(3.10e-01) αm:(1.06e-01) 2.67e-03 4.17e-04 αβ:(6.57e-01) βm:(2.49e-01) αm:(1.59e-01) 3.54e-03 Table 5.2: Order parameters for various classifications of proteins. The data set of 2- and 3-state folders is the same as the data set for α-helical β-sheet and mixed proteins, and is given in table 5.1. This is also the same data set as the unknotted proteins. Knotted proteins are separately classified, and not included as either 2-state or 3-state proteins. A discrimination is deemed statistically significant if the probability of the null hypothesis is less than 5%. 5.3. Results Class 2-state folders 3-state folders α-helix proteins β-sheet proteins Mixed secondary structure Unknotted proteins knotted proteins Class 2-state folders 3-state folders α-helix proteins β-sheet proteins Mixed secondary structure Unknotted proteins knotted proteins Class 2-state folders 3-state folders α-helix proteins β-sheet proteins Mixed secondary structure Unknotted proteins knotted proteins 5.3. Results (a) (b) (c) Figure 5.18: Renderings of the three proteins whose minimal transformations we investigate in detail. (A) acyl-coenzyme A binding protein, PDB id 2ABD [3], an all-α protein; (B) Src homology 3 (SH3) domain of phosphatidylinositol 3-kinase, PDB id 1PKS [71], a largely β protein; (C) The designed knotted protein 2ouf-knot, PDB id 3MLG [68]. 5.3.1 Quantifying minimal folding pathways The minimum folding pathway gives the most direct way that an unfolded protein conformation can transform by reconfiguration to the native structure. However, different configurations in the unfolded ensemble transform by different sequences of events, for example one unfolded conformation may require a leg uncrossing move, followed by a Reidemeister move elsewhere on the chain, followed by an uncrossing move of the opposite leg, while another unfolded conformation may require only a single leg uncrossing move. The sequence of moves can be represented as a color-coded bar plot, as shown in Figures 5.19-5.21. In these figures, the sequence of moves is taken from right to left, and the width of the bar indicates the non-crossing distance undertaken by that move. A scale bar is given underneath each figure indicating a distance of 100 in units of the link length. Red bars indicate moves corresponding to the N-terminal leg (LN ) of the protein, while green bars indicate moves corresponding to the C-terminal leg (LC ). Blue bars indicate Reidemeister “pinch and twist” moves, while cyan bars indicate elbow uncrossing moves. The typical sequence of moves varies depending on the protein. Figure 5.19 shows the uncrossing transformations of the all-α protein acylcoenzyme A binding protein (PDB id 2ABD [3], see Figure 5.18A). Panels A and B depict the same set of transformations, but in A they are sorted from largest to smallest values of LN uncrossing, and in B they are sorted from largest to smallest values of LC uncrossing. The leg moves in each panel are aligned so that the left end of the bars corresponding to the moves being 104 5.3. Results sorted are all lined up. Some transformations partway down in panel A do not require an LN move; these are then ordered from largest to smallest LC move. The converse is applied in panel B. Some moves do not require either leg move; these are sorted in decreasing order of the total distance of Reidemeister loop twist moves. Finally, some transformations require only elbow moves; these are sorted from largest to smallest total uncrossing distance. Figure 5.20 shows the uncrossing transformations for the Src homology 3 (SH3) domain of phosphatidylinositol 3-kinase (PI3K), a largely-β protein (about 23% helix, including 3 short 310 helical turns; PDB id 1PKS [71], see Figure 5.18B), sorted analogously to Figure 5.19. Figure 5.21 shows the uncrossing transformations involved in the minimal folding of the designed knotted protein 2ouf-knot (PDB id 3MLG [68], Figure 5.18C). Interestingly, for the all-α protein 2ABD, ≈ 12% of the sample of 172 transformations considered did not require any uncrossing moves, and proceed directly from the unfolded to the folded conformation. These transformations are not shown in Figure 5.19. For the β protein and knotted protein, every transformation that we considered (195 for 1PKS and 90 for 3MLG) required at least one uncrossing move. As a specific example, the top-most move in Figure 5.21 panel B consists of a C-leg move (green) covering ≈ 90% of the non-crossing distance, followed by N-leg move (red) covering ≈ 7% of the distance, then a short elbow move (cyan), a short Reidemeister loop move (blue), another short elbow move (cyan), and finally a short Reidemeister move (blue). In some cases the elbow and loop moves commute if they involve different parts of the chain, but generally they do not. For this reason we have not made any attempt to cluster loop and elbow moves, rather we have just represented them in the order they occur. On the other hand, consecutive leg moves commute and can be taken in either order. In Figures 5.19-5.21, one can see that significantly more motion is involved in the leg uncrossing moves than for other types of move. The total distance covered by leg moves is 82% for 3MLG, 69% for 1PKS, and 49% for 2ABD. For 3MLG, the total leg move distance is comprised of 44% LN moves, and 38% LC moves. For 1PKS, leg move distance is comprised of 18% LN moves, and 51% LC moves. For 2ABD, distance for the leg moves is roughly symmetric with 26% LN and 23% LC . One difference that can be seen for the all-α protein compared to the β and knotted proteins is in the persistence of the leg motion. For 2ABD, only 24% of the transformations require LN moves and only 30% of the transformations require LC moves. On the other hand the persistence of leg moves is greater in the β protein and greatest in the knotted protein. For 105 5.3. Results 2ABD (a) 2ABD (b) Figure 5.19: Bar plots for the noncrossing operations involved in minimal transformations, for the α protein 2ABD. The sequence of noncrossing operations in the transformation corresponding to a given pair of conformations is represented as a color-coded series of bars, with the sequence of moves going from right to left, and the length of the bar indicating the non-crossing distance undertaken by a particular move. Red bars indicate N-terminal leg (LN ) uncrossing, green bars indicate C-terminal leg (LC ) uncrossing, blue bars indicate Reidemeister “pinch and twist” loop uncrossing moves, and cyan bars indicate elbow uncrossing moves. The same set of 172 transformations is shown in panels A and B. Panel A sorts uncrossing transformations by rank ordering the following move types, largest to smallest: LN , LC , loop uncrossing, elbow move. Panel B sorts moves by LC , LN , loop uncrossing, elbow move. The scale bar underneath each panel indicates a distance of 100 in units of the link length. The arrow in each panel denotes the “most representative” transformation, as defined in the text. 106 5.3. Results 1PKS 1PKS (a) (b) Figure 5.20: Bar plots of the noncrossing operations for the β-sheet protein 1PKS (see Figure 5.19 and the text for more details). Red bars: LN uncrossing moves; green bars: LC uncrossing moves; Blue bars: loop uncrossing moves; Cyan bars: elbow uncrossing moves. The same set of 195 transformations is shown in panels A and B, sorted as in Figure 5.19. The scale bar underneath each panel indicates a distance of 100 in units of the link length. 107 5.3. Results 3MLG 3MLG (a) (b) Figure 5.21: Bar plots of the noncrossing operations for the knotted protein 3MLG (see Figure 5.19 and the text for more details). Red bars: LN uncrossing moves; green bars: LC uncrossing moves; Blue bars: loop uncrossing moves; Cyan bars: elbow uncrossing moves. The same set of 90 transformations is shown in panels A and B, sorted as in Figure 5.19. The scale bar underneath each panel indicates a distance of 100 in units of the link length. The arrow in each panel denotes the “most representative” transformation, as defined in the text. The transformation located 8 bars up from the bottom of Panel A requires both LN and LC moves, however both leg motions are very small. 108 5.3. Results 1PKS, LN and LC moves persist in 74% and 66% of the transformations respectively. In 3MLG, LN and LC moves persist in 92% and 41% of the transformations respectively. Inspection of the transformations for the β protein 1PKS in panels A and B of Figure 5.20 reveals that uncrossing moves generally cover larger distance than in the α protein 2ABD (the mean uncrossing distance for is 136 for 1PKS vs. 77.5 for 2ABD). We also notice that in contrast to the leg uncrossing moves in 2ABD, both LN and LC moves are often required (44% of the transformations require both LN and LC moves, compared to 5% for 2ABD). The asymmetry of the protein is manifested in the asymmetry of the leg move distance: the LN moves are generally shorter than the LC moves, covering about 1/4 of the total leg move distance. As mentioned above, LC moves comprise about 51% of the total distance for the 195 transformations in 5.20, while LN moves only comprise about 18 % of the distance on average. Both LN and LC moves are persistent as mentioned above. A leg move of either type is present in 95% of the transformations. Inspection of the transformations in Figure 5.21 reveals that every transformation requires either an LN or an LC move. This is sensible for a knotted protein, and is in contrast to the transformations for the α protein 2ABD, where many moves do not require any leg uncrossing at all and consist of only short Reidemeister loop and elbow moves. In this sense the diversity of folding routes [110, 111] for the knotted protein 3MLG is the smallest of the proteins considered here, and illustrates the concept that topological constraints induce a pathway-like aspect to the folding mechanism. The N-terminal LN leg move is the most persistently required uncrossing move, present in about 92% of the transformations. This is generally the terminal end of the protein that we found was involved in forming the pseudo-trefoil knot. Sometimes however, the C-terminal end is involved in forming the knot, though this move is less persistent and is present in only 41% of the transformations. However when an LC move is undertaken, the distance traversed is significantly greater, as shown in Panel B of Figure 5.21. This asymmetry is a consequence of the asymmetry already present in the native structure of the protein. Consensus minimal folding pathways From the transformations described in Figures 5.19-5.21, we see that there are a multitude of different transformations that can fold each protein. The pathways for the α protein 2ABD are more diverse than those for the β or knotted proteins. From the ensemble of transformations for each protein, 109 5.3. Results we can average the amount of motion for each uncrossing move to obtain a quantity representing the consensus or most representative minimal folding pathway for that protein. This takes the form of the histograms in Figure 5.22, with the x-axes representing the order of uncrossing/untangling events, right to left, and the y-axes representing the average amount of motion in each type of move. The ensemble of untangling transformations can be divided into three different classes: transformations in which leg LN is the largest move, transformations in which leg LC is the largest move, and transformations in which an elbow E or loop R (for Reidemeister type I) are the largest moves. Moreover, if LN and LC moves occur consecutively they can be commuted, so without loss of generality we take the LN move as occurring before the LC move in the x-axes of Figure 5.22. The leg moves, if they occur first, are then followed by either elbow (E) and/or loop (R) moves, of which there may be several. In general, the leg moves may both occur before the collection of loop and elbow moves, after them, or may bracket the elbow and loop moves (e.g. second bar in Figure 5.21). By the construction of our approximate algorithm, if two LN moves were encountered during a trajectory (they were encountered only a few times during the course of our studies), they would be aggregated into one LN move involving the larger of the two motions, in order to remove any possible redundancy of motion. Hence no more than one LN or LC move is obtained for all transformations. We found that three pairs of elbow and loop moves were sufficient to describe about 93% of all transformations (see the x-axes of Figure 5.22). In summary, the sequence LN , LC , R, E, R, E, R, E, LN , LC (read from left to right) characterized almost all transformations, and so was adopted as a general scheme. Any exceptions simply had more small elbow and loop moves that were of minor consequence; for these transformations we simply accumulated the extra elbow and loop moves into the most appropriate R or E move. The general recipe for rendering loops R in Figure 5.22 is as follows: if one R move is encountered (regardless of where), each half is placed first and last (third) in the general scheme. If two R moves are encountered, they are placed first and last, and if three R moves are encountered, they are simply partitioned in the order they occured. For four or more R moves, the middle N − 2 are accumulated into the middle slot in the general scheme. The same recipe is applied to elbow moves E. As a specific example, the first bar in Figure 5.21B consists of LC , LN , E1 , R1 , E2 , R2 , which after permutation of the first two leg moves falls into the general scheme above as LN , LC , R1 , E1 , 0, 0, R2 , E2 , 0, 0. The bottom-most transformation in Figure 5.21B consists of R1 , R2 , R3 , E1 , E2 , E3 , LN , which becomes 0, 0, R1 , E1 , R2 , 110 5.3. Results E2 , R3 , E3 , LN , 0 in the general scheme. Figure 5.22 shows histograms of the minimal folding mechanisms, obtained from the above-described procedure. Note again there are 3 classes of transformation, one where LN is the largest move, one where LC is the largest move, and one where either loop R or elbow E is the largest move. Each uncrossing element of the transformation, C-leg, N-leg, Reidemeister loop, or elbow, contributes to the height of the corresponding bar, which represents the average over transformations in that class. The percentage of transformations that fall into each class is given in the legend to panels A-C of Figure 5.22. Most of the transformations (73%) for the α-protein 2ABD fall into the class with a dominant loop or elbow move, which itself tends to cover less uncrossing distance than either leg uncrossing (ordinates of Panels A-C Figure 5.22). This is a signature of a diverse range of folding pathways- minimal folding pathways need not involve obligatory leg uncrossing constraints. In this sense, the β protein 1PKS has a more constrained folding mechanism than the α protein; there is a significantly larger percentage of transformations for which a leg transformation LC or LN dominates, though the mean distances undertaken when a leg move does dominate are comparable for LC and even larger for the α protein for LN . The knotted protein 3MLG has the most constrained minimal folding pathway. A leg move from either end dominates for 91% of the cases. Even for the transformations where loop or elbow moves dominate, there is still significant LN motion. The dominant pathways for knotting 3MLG involve leg crossing from either N or C terminus. When the C terminus is involved in the minimal transformation, the motion can be significant (Figure 5.22B). Among all transformations of a given protein, a transformation can be found that is closest to the average transformation for one of the three classes in Figure 5.22. This consensus transformation has a sequence of moves that when mapped to the scheme in Figure 5.22, has minimal deviations from the averages shown there. Further, we can find the transformation that has minimal deviation to any of the three classes in Figure 5.22. For the knotted protein 3MLG, the best fit transformation is to the class with LN dominated move, for the α protein 2ABD, the the best fit transformation is to the class with miscellaneous-dominated moves. For the α protein this is the transformation denoted by a short arrow to the left of the transformation in panels A and B of Figure 5.19, and illustrated in Figure 5.23. For the knotted protein this is the transformation denoted by a short arrow in panels A and B of Figure 5.21, and illustrated in Figure 5.24. We can construct schematics of these most-representative folding trans111 5.3. Results Average distance 120 Consensus pathways with largest LN move 700 3MLG 73.3% 2ABD 15.1% 1PKS 16.4% 100 80 60 40 3MLG 17.8% 2ABD 13.4% 1PKS 54.4% 500 400 300 200 100 20 0 Consensus pathways with largest Lc move 600 Average distance 140 0 LC LN E R E R E R LC LN Moves LC LN E R E R E R LC LN Moves (a) (b) Consensus pathways with largest misc move 120 Average distance 100 3MLG 8.9% 2ABD 71.5% 1PKS 29.2% 80 60 40 20 0 LC LN E R E R E R LC LN Moves (c) Figure 5.22: Consensus histograms of the transformations described in Figures 5.19-5.21 (see text for a description of the construction). Each bar represents the distance of a corresponding move type, N or C leg (LN or LC ), elbow E, or loop R. The order of the sequence of moves is taken from right to left along the x-axis. An all-α protein (2ABD), an all-β protein (1PKS), and a knotted protein (3MLG) are considered. (a) Transformations with leg LN as the largest move. These encompass 15% of the transformations those in the α protein, 16% of the transformations in the β protein, and 73% of the transformations for the knotted protein. (b) Transformations with leg LC as the largest move, which encompass 13% of the α protein transformations, 54% of β protein transformations, and 18% of knotted protein transformations. (c) Transformations with either an elbow E or loop R as the largest move, which encompass 71% of the α protein transformations, 29% of β protein transformations, and 9% of knotted protein transformations. 112 5.3. Results Figure 5.23: Schematic of the most representative transformation for the α protein 2ABD. formations. Figure 5.23 shows the most representative transformation for the all-α protein 2ABD. It is noteworthy that the transformation requires remarkably little motion: it contains a negligible leg motion followed by a loop uncrossing of modest distance, followed by a short elbow move that is also inconsequential: in shorthand E[9]R[20]LN [1], where the numbers in brackets indicate the cost of moves in units where the link length is unity. In constructing a schematic of the representative transformation in Figure 5.23, we ignore the smaller leg and elbow moves and illustrate the loop move roughly to scale. Although additional crossing points appear from the perspective of the figure, the remainder of the transformation involves simple straight-line motion. Figure 5.24 shows the most representative folding transformation for the knotted protein 3MLG. The sequence of events constructed from the minimal transformation, R[21]R[18]LN [125] in the above notation, consists of a dominant leg move depicted in steps 4 and 5 of the transformation, and two relatively short loop moves that are neglected in the schematic as inconsequential. Loops appear from the perspective of the figure, and the crossing points appear to shift in position, however the remainder of the transformation involves simple straight-line motion. 113 5.3. Results Figure 5.24: Schematic of the most representative transformation for the knotted protein 3MLG. 114 5.3. Results α β LN LN R1 R1 R2 R2 E1 LC LC Figure 5.25: Schematic diagram for the residues involved in noncrossing operations for two minimal transformations α and β. and the Sequence overlap of moves 5.3.2 Topological constraints induce folding pathways From Figures 5.19-5.21, one can see that topological non-crossing constraints can induce pathway-like folding mechanisms, particularly for knotted proteins, and in part for β-sheet proteins as well. The locality of interactions in conjunction with simple tertiary arrangement of helices in the α-helical protein profoundly affects the nature of the transformations that fold the protein, such that the distribution of minimal folding pathways is diverse. Conversely, the knotted protein, although largely helical, has non-trivial tertiary arrangement, which is manifested in the persistence of a leg crossing move in the minimal folding pathway. In this way, a folding “mechanism” is induced by the geometry of the native structure. We can quantify this notion by calculating the similarity between minimal folding pathways. To this end we note that the transformation 6 from the bottom in Figure 5.21B, which contains an LN move followed by 2 short loops and an elbow, should not fundamentally be very different than the transformation 10 from the bottom in that figure, which contains a loop and 2 short elbows followed by a larger LN move. In general we treat the commonality of the moves as relevant to the overlap rather than the specific number of residues involved, or the order of the moves that arises from the depth-first tree search algorithm. Thus for each transformation pair we define two sequence overlap vectors in the following way. Overlaying the residues involved in moves for each transformation along the primary sequence on top of each other as in Figure 5.25, we count those moves of the same type that overlap in sequence for both transformations. So for example in Figure 5.25 the result is two vectors of binary numbers, one with 4 elements and one with 5 elements, based on the overlap of moves of the same type: here the first vector is ∆α = (1, 1, 0, 1) and 2nd is ∆β = (1, 0, 1, 0, 1). To find the pathway overlap, we also record the noncrossing distances of the various transformations which α , D α , D α , D α ) , and here would be two vectors of the form Dα = (DL R1 R2 LC N β β β β Dβ = (DL , DR , DR , DE , DLβ C ) . Square matrices are constructed for α 1 2 1 N 115 5.4. Conclusion and discussion and β, where each row is identical and equal to the vector ∆. This matrix then operates on D to make a new vector that has distances for the elements that are nonzero in ∆, and is the same length for both α and β. In the above α , D α , D α ) and β D β = (D β , D β , D β ) . These example, α Dα = (DL R1 LC LC LN R2 N vectors are then multiplied through the inner product, and divided by the αβ norms of Dα and Dβ to obtain the overlap √∑Q . In∑the above example, α D β + D α D β + D α D β )/ α 2 β 2 Qαβ = (DL R1 R2 LC LC i (D )i j (D )j . In general LN N the formula for the overlap is given by Qαβ = √ (α Dα ) · (β Dβ ) (5.7) (Dα · Dα )(Dβ · Dβ ) When α = β, Qαβ = 1. In the above example, Qαβ < 1 even if all loops were aligned, because there is no elbow move in transformation α. If two transformations have an identical set of moves, Qαβ = 1 if all the moves have at least partial overlap with a move of the same type in primary sequence. If a loop move in transformation β overlaps two loop moves in transformation α, it is assigned to the loop with larger overlap in primary sequence. For the first two transformations in Figure 5.21A, Qαβ = 0.988, and for the first two transformations in Figure 5.21B, Qαβ = 0.999. On the other hand for the first and last transformations in Figure 5.21B, Qαβ = 0.033. Figure 5.26 shows the distributions of overlaps Qαβ between all pairs of transformations indicated in Figures 5.19-5.21, for the three proteins shown in Figure 5.18. The distributions show a transition from multiple diverse minimal folding pathways for the α protein, to the emergence of a dominant minimal folding pathway for the knotted protein. The mean overlap Q between transformations can be obtained by Qαβ in Equation (5.7) ∑averaging αβ over all pairs of transformations, Q = α<β Q / (N (N − 1) /2). Mean overlaps for each protein are given in the caption to Figure 5.26. This illustrates that topological constraints induce mechanistic pathways in protein folding. We elaborate on this in the Discussion section. 5.4 Conclusion and discussion The Euclidean distance between points can be generalized mathematically to find the distance between polymer curves; this can be used to find the minimal folding transformation of a protein. Here, we have developed a method for calculating approximately minimal transformations between unfolded and folded states that accounts for polymer non-crossing constraints. The 116 5.4. Conclusion and discussion Alpha-helical (2ABD) 0.25 0.25 0.2 0.2 0.15 0.1 0.15 0.1 0.05 0.05 0 0 Beta-sheet (1PKS) 0.3 Fraction Fraction 0.62 0.3 0.2 0.4 Qαβ 0.6 0.8 0 0 1 0.2 (a) 0.4 Qαβ 0.6 0.8 1 (b) Knotted (3MLG) 0.3 0.25 Fraction 0.2 0.15 0.1 0.05 0 0 0.2 0.4 Qαβ 0.6 0.8 1 (c) Figure 5.26: Pathway overlap (Qαβ ) distributions for the 3 proteins in Figure 5.18, as defined by Equation (5.7), operating on the transformations in Figure 5.19-5.21. (a) The pathway overlap distribution for the all-α protein 2ABD indicates a large contribution for Qαβ = 0, indicating a diverse set of minimal transformations fold the protein. The average Q for these transformations is 0.18. (b) The pathway overlap distribution for the β-protein shows the emergence of a peak around Qαβ = 1, indicating partial restriction of folding pathways. The peak near Qαβ = 0 still carries more weight in the distribution. The average Q = 0.45. (c) The peak around Qαβ = 1 becomes dominant for the pathway overlap distribution of the knotted protein, indicating the emergence of a dominant restricted minimal folding pathway. The average Q = 0.62. 117 5.4. Conclusion and discussion extra motion due to non-crossing constraints was calculated retroactively for all crossing events of a ghost chain transformation involving straightline motion of all beads on a coarse-grained model chain containing every other Cα atom, from an ensemble of unfolded conformations, to the folded structure as defined from the coordinates in the protein databank archive. The distances undertaken by the uncrossing events correspond to straight-line motions of all the beads from the conformation before the crossing event, over and around the constraining polymer, and back to the essentially identical polymer conformation immediately after the crossing event. Given a set of chain crossing events, the various ways of undoing the crossings are explored using a depth-first tree search algorithm, and the transformation of least distance is recorded as the minimal transformation. We found that knotted proteins quite sensibly must undergo more noncrossing motion to fold than unknotted proteins. We also find a similar conclusion for transformations between all-β and all-α proteins; all α proteins generally undergo very little uncrossing motion during folding. In fact the unfolded ensemble-averaged uncrossing distance Dnx can be used as a discrimination measure between various structural and kinetic classes of proteins. Comparing several metrics arising from this work with several common metrics in the literature such as RMSD, absolute contact order ACO, and long range order LRO, we found that the most reliable discriminator between structural classes, as well as between two- and three-state proteins, was Dnx /N . On the other hand, even for knotted proteins, the motion involved in avoiding non-crossing constraints is only about 13% of the total ghost chain motion undertaken had the noncrossing constraints been neglected. This was not an obvious result, to this author at least. In contrast to melts of long polymers, chain non-crossing and the resultant entanglement does not appear to be a significant factor in protein folding, at least for the structures and ensembles we have studied here. It is tempting to conclude from this that chain non-crossing constraints play a minor role in determining folding mechanisms. It is nevertheless an empirical fact that knotted proteins fold significantly slower than unknotted proteins. As well, raw percentages of total motion do not take into account the difficulty in certain types of special polymer movement, in particular when the entropy of folding routes is tightly constrained [18, 110, 111, 113]. However the small percentage of non-crossing motion may offer some explanation as to why simple order parameters, such as absolute contact order, that do not explicitly account for noncrossing in characterizing folding mechanisms, have historically been so successful. The non-crossing distance was calculated here for a chain of zero thick118 5.4. Conclusion and discussion ness, so that non-crossing is decoupled from steric constraints. Finite volume steric effects would likely enhance the importance of non-crossing constraints, since the volume of phase space where chains are non-overlapping is reduced, and thus chain motions must be further altered to respect these additional constraints. One potential issue in the construction of the algorithm used here is that the minimal transformation is generally not equivalent to a kinetically realizable transformation. In the depth-first tree search algorithm illustrated in Figure 5.15, the set of crossing points defines a set of uncrossing moves that may be permuted, or combined for example through a compound leg movement as in Figure 5.11. However the kinetic sequence of crossing events, in particular those significantly separated in “time” along the minimal transformation, may not be permutable or combinable physically, at least not without modifying the distance travelled.6 Hence the transformations are treated here as approximations to the true minimal transformations that respect non-crossing. The algorithm as described above may underrepresent the amount of motion involved in noncrossing by allowing kinetically separated moves to be commutable. On the other hand, the motion assumed in the algorithm to be undertaken by a crossing event contains abrupt changes in the direction of the velocity (corners) at the time of the uncrossing event, and so is larger than the true minimal distance, which contains no corners except possibly at the position of the infinitely thin chain, represented as a discontinuous obstacle. These errors cancel at least in part. It is an interesting topic for future research to develop an improved algorithm that computes minimal transformations, perhaps using these approximate transformations as a starting point for further optimization or modification. In differentiating two- and three- state folders, chain length provided the best discriminant: three-state folders are longer chains than two-state folders. Other metrics such as RMSD, MRSD, and D/N performed nearly as well. Knotted proteins, as compared to unknotted proteins, are the most distinguishable class of those we investigated. That is, all metrics we investigated except for LRO significantly differentiated the knotted from unknotted 6 As a hypothetical example, suppose at time t1 a crossing event occurs between residue a which is 10 residues in from the N-terminus, and residue b somewhere else along the chain. Then at time t2 , the next crossing event involves a residue c that is 20 residues in from the N-terminus, and residue d somewhere along the chain. To avoid redundant motion, the minimal transformation is only taken to involve a leg motion between the residues from c to the N-terminus, about point d; this is assumed to encompass the motion in the first leg transformation, even though the crossing events occurred at different times. 119 5.4. Conclusion and discussion proteins. This is followed by α proteins and mixed α-β proteins, for which all metrics except distance D and chain length N provide discrimination. When considered over all proteins, the physical motion of a polymer required for folding D correlates with quantities such as ACO or LRO (see Table F.8 in the Supplementary Content), however when considering only knotted proteins, α-β proteins, or 3-state proteins, D does not correlate with ACO. The differentiation between structural or kinetic classes of proteins is a separate issue from the question of which order parameters that may best correlate with folding rates within a given structural or kinetic class of proteins [52, 60, 61, 102, 106]; this latter question is an interesting topic for future research. Differentiating relevant native-structure based order parameters that provide good correlates of folding kinetics is a complicated issue, in that different structural classes may correlate better or worse with a given order parameter [60]. The mathematical construction of minimal folding transformations can elucidate folding pathways. To this end we have dissected the morphology of protein structure formation for several different native structures. We found that the folding transformations of knotted proteins, and to a lesser extent β proteins, are dominated by persistent leg uncrossing moves, while α proteins have diverse folding pathways dominated simply by loop uncrossing. A pathway overlap function can then be defined, the structure of which is fundamentally different for α proteins and for knotted proteins. While the overlap function supports the notion of a diverse collection of folding pathways for the α protein, the overlap function for the knotted protein indicates that topological polymer constraints can induce “mechanism” into how a protein folds, i.e., these constraints induce a dominant sequence of events in the folding pathway. This effect is observed to some extent in the β protein we investigated, but is most pronounced for knotted proteins. Coarse-grained simulation studies of the reversible folder YibK [83] showed that non-native interactions between the C-terminal end and residues towards the middle of the sequence were a prerequisite for reliable folding to the trefoil knotted native conformation [125], the evolutionary origins of which were supported by hydrophobicity and β-sheet propensity profiles of the SpoU methyltransferase family. This suggests a new aspect of evolutionary “design” involving selective non-native interactions, beyond the generic role that non-native interactions may play in accelerating folding rate [23, 108]. Low kinetic success rates ∼ 1 − 2% in purely structurebased G¯o simulations are also seen in coarse-grained simulation studies of YibK [126] and all-atom simulation studies of the small α/β knotted protein MJ0366 [93]. In these studies by Onuchic and colleagues, a “slip-knotting” 120 5.4. Conclusion and discussion mechanism driven by native contacts is proposed, rather than the “plug” mechanism in [125], which is driven by non-native contacts. Both slipknotting and plug mechanisms were described by Mohazab and Plotkin as optimal un-crossing motions of protein chains in [90]. Bioinformatic studies that investigate evolutionary selection by strengthening critical native interactions in knotted proteins are an interesting topic for future research. There is certainly a precedent of selection for native interactions that penalize on-pathway intermediates in ribosomal protein S6 [78, 110, 111]. As well, Lua and Grosberg have found that, due to enhanced return probabilities originating from finite globule size along with secondary structural preferences, protein chains have smaller degree of interpenetration than collapsed random walks, and thus fewer knots than would be expected for such collapsed random walks [80]. It is still not definitively answered whether this statistical selection against knots in the protein universe is a cause or consequence of the above size and structural preferences. 121 Chapter 6 The role of polymer non-crossing and geometrical distance in protein folding kinetics In this chapter we apply the formalism developed in chapter 5 to the problem of folding kinetics. Then we compare different rate predictors across different classes of proteins and see that distance-like metrics do very well in predicting the folding rate of 3-state folders. 6.1 Introduction Energetic driving forces towards the folded structure are essential for rapid and reliable folding. Models that randomly search for either the native ensemble or a loosened native-like topomer ensemble show slow kinetics and folding mechanisms that do not correlate with those determined from experimental φ-values [133]. The theory that strongly attractive native interactions bias a protein’s configurational search towards the biologically-functional structure [12, 13, 29, 76, 128] leads to the notion that some topological or geometrical aspects of the native structures of various proteins could determine their folding rates and/or folding mechanisms [5, 20, 31, 34, 37, 52, 53, 60, 61, 102, 105, 106, 134]. However no single parameter appears to be an accurate predictor of folding kinetics over all structural and kinetic classes. While some quantities such as contact order, relative contact order, and long range order (LRO) correlated well with the folding rate for 2-state proteins [52, 105, 106], they correlated poorly with the folding rates of 3-state proteins, where the size of the protein, as quantified simply by the chain length, seemed to be the best predictor [61]. 122 6.1. Introduction Istomin et al. [60] found that chain length also correlated well with folding rate for the various structural classes of two-state proteins: α, β, and mixed α − β, when considered separately. They also found a strong correlation between LRO and folding rate when all 2-state proteins were considered together. Information on the folding mechanism is gained from determining which quantity correlates with rate for a given structural or kinetic class of protein. The fact that ACO or LRO correlates well with rate for 2-state proteins indicates a dominance of the process of loop closure, through the formation of native contacts, as the rate limiting step in folding. Energy also must play a role in driving folding and thus determining folding rates. Protein rates have been shown to correlate with stability for 2-state proteins.[73] Folding rates have also been shown to correlate with the variance of contact probability [78, 79, 110, 111] which yields a strong correlation between rate and the variance of experimentally-determined φ-values for two-state folders [102]. Perhaps surprisingly, the RMSD has not been used as an order parameter in predicting the rates of proteins. This is likely due to the fact that information on a pair of structures rather than a single structure is needed to calculate it. Given a generated unfolded ensemble, the RMSD can be calculated between each unfolded conformation and the native conformation, and an average RMSD between unfolded and folded states can be calculated, and subsequently tested as a determinant of rate. The RMSD can be thought of as a least squares fit between two structures. It may also be thought of as the straight-line Euclidean distance between two structures in a high-dimensional space of dimension 3N , where N is the number of atoms or residues considered in the protein. If several intermediate states are known along the pathway of a transformation between a pair of structures, then the RMSD may be calculated consecutively for each successive pair. Energy is explicitly considered as modifying the pathway taken. RMSD is accumulated along the pathway through the transition states [119]. However as we have mentioned on various occasions, the RMSD is not equivalent to the total amount of motion a protein or polymer must undergo in transforming between structures, even in the absence of steric constraints enforcing deviations from straight-line motion. The accumulated straight line motion of all residues is given by the number of residues times the mean-root squared distance (MRSD) [89, 90, 109]. This quantity is always less than the RMSD. 123 6.2. Methods As a rate-determining order parameter, the Euclidean distance can be tested in the same way as ACO or LRO, so long as an unfolded ensemble is generated. For each protein we obtain minimal transformations between individual structures in an unfolded ensemble and the corresponding native structure. The ensemble average of the quantity for each of the proteins forms the rate-determining order parameter. 6.2 Methods We quickly recap the steps involved in generating ensemble averages for the quantities that require a starting and ending conformation of the protein. For a given protein, the PDB file is selected, and the Cα backbone is extracted. Using the methods described in section 5.2.2, 200 coarse-grained unfolded structures are generated. The unfolded structures are then aligned using RMSD and the average (residual) RMSD is calculated. The unfolded structures are then aligned by minimizing MRSD, and the residual MRSD is calculated. Then conformations are further coarse-grained (smoothed) by sampling every other bead, hence reducing the total number of beads. Then each structure is transformed to the folded state by the algorithm discussed in section 5.2.1 and the minimal untangling cost is found. At the end of the day, various quantities like minimal untangling cost (Dnx ), MRSD, RMSD are calculated for each unfolded conformation. These differ from one unfolded conformation to the other; the ensemble average is recorded and used below. 6.2.1 Proteins used with rate The proteins used in this study are given in table 5.1. They consist of 25 2-state folders, 13 3-state folders, 11 all α-helix proteins, 14 all β-sheet proteins, 13 α-β proteins, and 5 knotted proteins. 6.3 Results We use same classification as in chapter 5, for the proteins. Proteins are classified by several criteria: • 2-state vs. 3-state folders • α-helix dominated, vs β-sheet dominated, vs mixed. • knotted vs unknotted proteins 124 6.3. Results Order parameter LRO RCO ACO MRSD RMSD Dnx Dnx /N D D /N N Kendall correlation -0.696 -0.711 -0.464 (-0.224) (-0.257) -0.437 -0.471 (-0.184) (-0.250) (-0.131) Two-state Proteins Kendall p- Pearson value correlatoin 1.09e-06 -0.875 6.26e-07 -0.854 1.15e-03 -0.781 (0.117) -0.535 (0.072) -0.560 2.18e-03 -0.624 9.72e-04 -0.680 (0.198) (-0.428) (0.079) -0.573 (0.358) (-0.337) Pearson pvalue 1.10e-08 5.73e-08 4.15e-06 5.89e-03 3.61e-03 8.53e-04 1.86e-04 (0.033) 2.77e-03 (0.099) Table 6.1: Two-state proteins: correlation between folding rate and various order parameters indicated. We are specifically interested in the question that how non-crossing distance, total distance, and other distance related order parameters correlate with folding rate for different classes of the proteins, and how do they compare with other order parameters. The results are summarized in the tables. 125 6.3. Results Order parameter LRO RCO ACO MRSD RMSD Dnx Dnx /N D D /N N Three-state Proteins Kendall cor K. p-value Pearson cor (-0.154) (0.464) (-0.292) (-0.077) (0.714) (0.029) (-0.462) (0.028) (-0.658) (-0.538) (0.010) (-0.672) -0.564 7.27e-03 -0.685 -0.564 7.27e-03 (-0.647) (-0.462) (0.028) (-0.601) (-0.513) (0.015) -0.690 (-0.538) (0.010) (-0.670) (-0.503) (0.017) (-0.644) P. p-value (0.332) (0.926) (0.014) (0.012) 9.74e-03 (0.017) (0.030) 9.11e-03 (0.012) (0.018) Table 6.2: Three-state proteins: correlation between folding rate and various order parameters indicated. Figure 6.1: Correlation between folding rate and RMSD for three-state folders. 126 6.3. Results Order parameter LRO RCO ACO MRSD RMSD Dnx Dnx /N D D /N N 2-state α-helix Proteins Kendall cor K. p-value Pearson cor (-0.691) (0.017) (-0.817) (-0.357) (0.216) (-0.367) (-0.571) (0.048) -0.835 (-0.429) (0.138) (-0.714) (-0.500) (0.083) (-0.716) (-0.429) (0.138) (-0.525) (-0.143) (0.621) (-0.326) (-0.429) (0.138) (-0.717) (-0.429) (0.138) (-0.689) (-0.327) (0.257) (-0.741) P. p-value (0.013) (0.371) 9.92e-03 (0.047) (0.046) (0.181) (0.431) (0.045) (0.059) (0.035) Table 6.3: α-helix dominated proteins that are 2-state folders: correlation between folding rate and various order parameters indicated. The sample size is 8. Order parameter LRO RCO ACO MRSD RMSD Dnx Dnx /N D D /N N α-helix Proteins Kendall cor K. p-value (-0.477) (0.041) (0.018) (0.938) (-0.491) (0.036) (-0.418) (0.073) (-0.491) (0.036) (-0.309) (0.186) (-0.164) (0.484) (-0.418) (0.073) (-0.382) (0.102) (-0.330) (0.157) Pearson cor (-0.384) (0.074) (-0.710) (-0.728) (-0.733) (-0.670) (-0.523) -0.740 (-0.715) -0.747 P. p-value (0.243) (0.828) (0.014) (0.011) (0.010) (0.024) (0.099) 9.22e-03 (0.013) 8.22e-03 Table 6.4: α-helix dominated proteins (both 2- and 3- state): correlation between folding rate and various order parameters indicated. The sample size is 11, with 8 of them being 2-state folders and 3 being 3-state folders. 127 6.4. Conclusion and discussion Order parameter LRO RCO ACO MRSD RMSD Dnx Dnx /N D D /N N 2-state β-sheet Proteins Kendall cor K. p-value Pearson cor (-0.278) (0.297) (-0.471) -0.722 6.71e-03 (-0.779) (-0.222) (0.404) (-0.649) (-0.111) (0.677) (-0.390) (-0.111) (0.677) (-0.478) (-0.333) (0.211) (-0.689) (-0.611) (0.022) -0.814 (-0.167) (0.532) (-0.465) (-0.111) (0.677) (-0.452) (-0.167) (0.532) (-0.435) P. p-value (0.200) (0.013) (0.059) (0.300) (0.193) (0.040) 7.51e-03 (0.207) (0.222) (0.242) Table 6.5: β-sheet dominated proteins that are 2-state folders: correlation with various order parameters indicated. The sample size is 9. The performance of RMSD over all different classes of proteins can be compared to that of ACO and D. See figure 6.2. 6.4 Conclusion and discussion From Table 6.1 it is concluded that long range contact formation is governing the rate of folding for 2-state folders. From Table 6.2 we infer that traditional measures fail to predict the kinetic mechanism of folding for 3-state proteins. However a measure of native geometry still does correlate with folding rate, and thus can speak to the mechanism of folding. By native geometery we do not necessarily mean native topology, i.e. the chain properties of the network of native contacts, but more similar to the distance that all parts of the polymer chain have to move. Native geometries that on average required large distances to be traveled via stochastic motion tend to have slower rates. One might suspect that the physical motion of a polymer required for folding would correlate with quantities such as ACO or LRO, however looking at the cross correlation tables (see Appendix F) it is seen that D only correlates with ACO in a significant manner, when we consider all the proteins. If we look at 3-state folders or at only knotted proteins even this correlation is not significant. Table 6.4 consisting only of α-helical proteins, does not show correlation with LRO. This indicates that it is necessary to include β-proteins in the sample so that there is a discrepancy in LRO between members of the en128 6.4. Conclusion and discussion Order parameter LRO RCO ACO MRSD RMSD Dnx Dnx /N D D /N N β-sheet Proteins Kendall cor K. p-value (0.099) (0.622) (-0.099) (0.622) (-0.429) (0.033) (-0.319) (0.112) (-0.407) (0.043) (-0.385) (0.055) (-0.495) (0.014) (-0.385) (0.055) (-0.319) (0.112) (-0.398) (0.048) Pearson cor (-0.064) (0.090) -0.675 (-0.528) (-0.556) (-0.573) (-0.583) (-0.545) (-0.540) (-0.550) P. p-value (0.827) (0.761) 8.03e-03 (0.052) (0.039) (0.032) (0.029) (0.044) (0.046) (0.041) Table 6.6: β-sheet dominated proteins (both 2- and 3- state): correlation with various order parameters indicated. The sample size is 14, with 9 of them being 2-state folders and 5 being 3-state folders. 2-state Mixed secondary structure Proteins Order parameter Kendall cor K. p-value Pearson cor LRO (-0.691) (0.017) -0.908 RCO (-0.546) (0.059) (-0.759) ACO (-0.327) (0.257) (-0.561) MRSD (-0.109) (0.705) (-0.296) RMSD (-0.109) (0.705) (-0.326) Dnx (-0.182) (0.529) (-0.317) Dnx /N (-0.109) (0.705) (-0.288) D (-0.182) (0.529) (-0.305) D /N (-0.182) (0.529) (-0.302) N (-0.182) (0.529) (-0.279) P. p-value 1.79e-03 (0.029) (0.148) (0.477) (0.431) (0.444) (0.490) (0.463) (0.468) (0.503) Table 6.7: Mixed secondary structure proteins that are 2-state folders: correlation with various order parameters indicated. The sample size is 8. 129 6.4. Conclusion and discussion Order parameter LRO RCO ACO MRSD RMSD Dnx Dnx /N D D /N N Mixed secondary structure Proteins Kendall cor K. p-value Pearson cor (-0.400) (0.057) -0.703 (0.039) (0.854) (-0.039) (-0.452) (0.032) (-0.632) (-0.400) (0.057) (-0.567) (-0.374) (0.075) (-0.591) (-0.400) (0.057) (-0.573) (-0.400) (0.057) (-0.547) (-0.426) (0.043) (-0.587) (-0.426) (0.043) (-0.570) (-0.426) (0.043) (-0.551) P. p-value 7.35e-03 (0.898) (0.021) (0.043) (0.033) (0.041) (0.053) (0.035) (0.042) (0.051) Table 6.8: Mixed secondary structure proteins (both 2- and 3-state): correlation with various order parameters indicated. The sample size is 13, with 8 of them being 2-state folders and 5 being 3-state folders. Order parameter LRO RCO ACO MRSD RMSD Dnx Dnx /N D D /N N All Kendall cor -0.357 (-0.155) -0.541 -0.460 -0.489 -0.538 -0.560 -0.460 -0.467 -0.441 proteins K. p-value 1.02e-03 (0.153) 6.40e-07 2.27e-05 6.58e-06 7.18e-07 2.49e-07 2.27e-05 1.67e-05 4.84e-05 Pearson cor -0.475 (-0.199) -0.798 -0.742 -0.753 -0.723 -0.750 -0.720 -0.756 -0.689 P. p-value 1.69e-03 (0.213) 4.24e-10 2.80e-08 1.34e-08 9.58e-08 1.62e-08 1.15e-07 1.10e-08 6.45e-07 Table 6.9: Correlation of folding rate of all the studied proteins, for which folding rates were available, with various order parameters. The sample size is 41 130 6.4. Conclusion and discussion Abs Value of Correlatoin with Rate 1 RMSD RCO ACO LRO N D/N 0.8 0.6 0.4 0.2 0 2-state 3-state 2-s all 2-s alpha alpha beta all beta 2-s mix all mix all prots Figure 6.2: Absolute value of Kendall correlation of a few order parameters and rate, across different classes of proteins. P. Class 2-state 3-state 2-state α all α-helix 2-state β all β-sheet 2-state Mixed Str. all Mixed Str. all prots Best K. cor. RCO RMSD & D/N (LRO) ( RMSD ) & (ACO) RCO ACO (LRO) (ACO) ACO Best P. cor. LRO RMSD ACO N (RCO) ACO LRO LRO ACO Table 6.10: Best rate predictors for different classes of proteins, based on Kendall and Pearson correlations. The items in brackets made 5% p-value cutoff but not 1%. As described above, angle brackets here indicate an average over the conformations in the unfolded ensemble. 131 6.4. Conclusion and discussion semble, for LRO to be useful as a predictor. LRO does not do well with β proteins and mixed proteins either. See tables 6.6 and 6.8. In fact it seems that adding 3-state folders to the mix is largely responsible for this lack of corerlation. Compare with 6.3, 6.5 and 6.7 for which 3-state folders have been removed. So the tentative conclusion is that addition of 3-state folders and β proteins ruins the correlation. However larger samples are required for stronger and more definitive conclusions, for the cases that we divide and subdivide the classes. The fact that ACO is still a useful predictor of rates indicates that the mechanism of folding is still largely one of native contact formation and closure of loops. Not surprisingly Dnx did not predict the rates. The non-crossing distance is small; polymer non-crossing is unlikely to play a role for these α-helical proteins. When it comes to correlation with folding rate for α-proteins the improvement in RMSD over ACO is probably not significant (corr coeffs were 0.73 vs 0.71), , although it is interesting that a simple measure such as RMSD is doing so well. Interestingly enough when 3-state folders are removed ACO becomes a better predictor for rate compared to RMSD. When it comes to β-sheet proteins table 6.6, it seems that all standard measures fail except for ACO which barely hold on (in pearson coefficient only). Interestingly enough when 3-state folders are removed ACO performs even worse, but RCO significantly improves and becomes the best predictor. But as stated, addition of a few 3-state folders (50%) makes the advantage of RCO disappear. Remember from table 6.2 that RCO was the worst predictor of rate in 3-state folders. Table 6.8 shows that only LRO is a predictor of folding rates for α − β proteins. Mechanism may be governed significantly (but not exclusively) by loop closure and long range contact formation. As it can be seen from table 6.7 LRO is by far the best predictor when it comes to 2-state α − β proteins. Adding 3-state proteins to the mix erodes its advantage and gives D and RMSD (and to a lesser degree ACO) a better competitive edge. It can be inferred that adding enough α − β proteins that are 3-state folders will make RMSD or perhaps D the best predictor for this class of proteins. 132 Chapter 7 Conclusion and further thoughts In this thesis, we introduced the mathematical concept of the generalized Euclidean distance D, between extended objects. We saw that the problem can be formulated as a variational problem. We then discretized the problem to that of a system of links and found extremum solutions to the corresponding Euler-Lagrange equations. Subsequently we derived the necessary and sufficient conditions for the extrema to be local minima. We explored the toy models of very small number of links and then extended the results to many links. We posed the idea that D can be considered an order parameter, when looking at the problem of protein folding and showed that in its fullest form D does not have some of the problems of the simple geometric order parameters such as Q and RMSD. We saw that a zeroth approximation of D leads to an order parameter similar to RMSD but different, called Mean Root Squared Distance (MRSD). Using MRSD and Q as order parameters, we constructed free energy potential surfaces. According to the energy landscape paradigm in protein folding, there are multiple folding pathways for unfolded conformations. That being considered we set out to calculate “minimal” folding pathways for fragments of proteins. By minimal folding pathways we mean geometrical folding pathways that minimize D traveled between the initial and final conformations. In doing so we also laid down some foundations for systematic treatment of non-crossing constraints, as inequality constraints in the calculus of variations. We saw that folding pathways for α-helices are shorter than that of β hairpins. It was also seen that non-crossing constraints can lead to significant extra movement (in our case snake-like movement) on the polymer side for seemingly close structures. When minimizing total distance traveled, we were faced with the problem of structurally aligning the initial (unfolded) and final (folded) conformations before calculating the minimal distance. We saw that using MRSD instead of RMSD, makes a global difference in the alignment, when it comes 133 Chapter 7. Conclusion and further thoughts to aligning hairpins and their corresponding unfolded structure. Therefore we further investigated the problem of structural alignment using different cost functions, including the full D, using the ideal hairpins of different sizes as models. In doing so we introduced some higher order approximation to the true-distance D. Results allowed it to be observed, that for a large number of residues the dimension-less distance D/N (N − 1) converges to the same value when D or MRSD or higher approximations of D are used as cost functions, but not when RMSD is used. This allows us to use MRSD as a computationally inexpensive alternative to D. We then focused on the role of non-crossing constraints when minimally folding full proteins. Using the concepts found in the mathematical theory of knots, we developed the formalism of finding approximate minimal untangling moves arising from non-crossing constraints during protein folding. The canonical untangling moves that we considered were leg moves, elbow moves and loop twists. We treated them as ordered operators. Solutions to minimal untangling problems were reduced to depth-first search in the tree of possible untangling operator applications. Geometrically speaking, the protein folding process is a many-to-one process, meaning that many different conformations fold to a single conformation, hence any exploration of the role of non-crossing and untangling moves should consider an ensemble of unfolded structures. Therefore we had to develop methods that quickly generate coarse-grained unfolded ensembles from given coarse-grained native structures. Having developed the untangling formalism and the unfolded ensemble generator, we considered a few dozen proteins across different classes, including knotted proteins. We saw that perturbations caused by extra untangling moves introduced by non-crossing constraints play a small role in the total amount of chain movement. It was also observed that non-crossing constraints play a significantly larger role in the folding process of knotted proteins. We observed that the perturbations in distance caused by the noncrossing constraints, when normalized by the zeroth approximation distance, are not different between 2-state folders and 3-state folders. However since 3-state folders are on average significantly longer, all of the non-normalized distance-like quantities were larger. Across different classes of proteins, sorted by their secondary structure, we saw that for the unknotted proteins, non-crossing constraints are the least important in α-helical proteins and the most important in β-sheet proteins. Furthermore, looking at the ensemble of untangle moves we constructed consensus unfolding pathways for several proteins, in particular a knotted protein. 134 Chapter 7. Conclusion and further thoughts By studying overlaps of untangling operations, for ensembles of proteins, we observed that folding pathway mechanisms can be induced by the geometry of native structure in the knotted protein. Such bottlenecks did not exist for the alpha helical protein, but existed to some extent for the beta sheet protein. We further extended our studies to protein kinetics and possible correlations between folding rates and various distance-like quantities. We saw that distance-like quantities have success in predicting folding rates for 3-state folders. The normalized non-crossing distance Dnx /N , as well as D and others, significantly correlate with the folding rate of 3-state proteins. The surprising champion however was RMSD, which had the best correlation coefficient (although marginally). In short for 3-state folders we saw that a few of the quantities that we proposed for the first time to be rate predictors performed better than all the traditional rate predictors: LRO, RCO, ACO and N. For 2-state folders LRO, RCO, and ACO performed better. Future research in this subject can have two general directions: refinements to the model, and applications of the model to new areas. We will briefly sketch a few lines for future endeavors. A possible refinement to the model would be an introduction of persistence length and curvature constraints to D. During the course of research we saw in a few occasions that some of the angles θ between the links of the conformation become very large, albeit for a short time, during the transformation. Curvature constrains to the chain can be added as inequality constraints to the variational problem, hence ensuring that θi ≤ θM AX all the time. Introducing a soft potential for the angle between the links, is another way to add curvature constraints. It would be interesting to see how much deviation from the ideal extremum path and distance (which is obtained in absence of such constraint) we will get when curvature constraint are introduced and how this effect is compared to that of non-crossing constraints. A limitation of the model is the fact that the thickness of the chain is zero in our model. Therefore the non-crossing constraints do not take effect unless the two crossing links get extremely close. Therefore changing the model from curves to tubes can improve this aspect. The other refinement that we can make to the model, is to allow sidechains. Currently D is defined for two curves or two chains. In principle there is nothing to stop the generalization to tree-like objects. This in principle will add more ODE’s to the set of the coupled ODE’s (see Eq.s 2.15a–2.15c) and will distort the block diagonal nature of the ODE matrix. Another enhancement to the model could be to introduce some ener135 Chapter 7. Conclusion and further thoughts getics into D, when considering protein folding. We can introduce native interaction into the formal distance functional that is to be optimized. From a practical point of view adding energetics in the form of native interactions to the model is equivalent to finding first the potential V that induces a folding pathway similar to what we introduced in chapter 5, and then adding native G¯o-like terms to the potential. Even without adding the energetics, a very interesting question that can be answered in future studies is the correlation of D to commitment probability. To address this question for any given protein we can proceed as follows: we simulate the protein or the protein ensemble using a molecular dynamics package, e.g. GROMACS, then sample the system at regular intervals, extract the conformations and calculate D for each of them. The free energy surface obtained will be a two-well system for a 2-state folder. A small fraction of the sampled states will be the transition states that sit somewhere between the two wells. We can correlate D with commitment probability by looking at the fate of the transistion states (either folding completely or unfolding). Our analysis in chapter 5 has shown that the effect of non-crossing constraints on D is about 0.07. It means that we can use MRSD which is computationally inexpensive compared to D to approximate D. We can benchmark our results against Q or RMSD. It could also be informative to apply our formalism for D to the reaction pathways of 3-state folders. Considering that the pathway is unfolded (U) → Intermediate (I) → Folded (F), it would be intersting to compute the D between all the individual pairs and benchmark against for example RMSD and Q. Also as the folding rates for an ever increasing number of knotted proteins are determined experimentally, it would be beneficial to see how distance like quantities correlate with folding rates of knotted proteins. Considering the success of such quantities in correlating well with folding kinetics of 3state folders we are optimistic that they will do very well when it comes to knotted proteins. Another area of interest for future studies is the thermodynamic untangling distance Dnx T between two conformations. The formalism that we developed in chapter 5 concerns itself with the “minimal” untangling cost. However the minimal untangling cost is not necessarily the most entropically favorable. There is a well-defined set of untangling moves that give the minimal untangling cost for a given transformation. However there might be a very large set of different untangling moves that each will give only a slightly higher untangling cost compared to the minimal untangling cost. From a thermodynamic perspective under non-zero “temperature” the sys136 Chapter 7. Conclusion and further thoughts tem is more likely to untangle itself following more entropically favorable untangling operations. Quantifying such notions for different proteins is a rich subject for future studies. It is also an interesting question to ask whether the actual dynamics between polymer configurations—after a suitable averaging over trajectories— resembles the minimal transformation. This question is linked with the role of the entropy of transformations described above. It is also related to the problem of finding the dominant pathway for a chemical reaction [97], which has recently been applied to the problem of protein folding [121]. We have focused here on the question of geometrical distance for complex systems, which can be separated from the calculation of quantities such as reaction paths that depend intrinsically on energetics, i.e. on the specific Hamiltonian of the system. Quantifying the relationship between geometrical distance and the dominant reaction path is a future question worthy of investigation. 137 Bibliography [1] Colin C. Adams. The Knot Book. W H Freeman and Company, 1994. [2] E. Alm and D. Baker. Prediction of protein-folding mechanisms from free-energy landscapes derived from native structures. Proc Nat Acad Sci USA, 96:11305–11310, 1999. [3] K V Andersen and F M Poulsen. The three-dimensional structure of acyl-coenzyme a binding protein from bovine liver: structural refinement using heteronuclear multidimensional nmr. J. Biomol. NMR, 3:271–284, 1993. Comment 2abd. [4] C. B. Anfinsen. Principles that govern the folding of protein chains. Science, 181:223, 1973. [5] D. Baker. A surprising simplicity to protein folding. Nature, 405:39– 42, 2000. [6] D. Baker and A. Sali. Protein structure prediction and structural genomics. Science, 294:93–96, 2001. [7] Stephen Bell and James S. Crighton. Locating transition states. The Journal of Chemical Physics, 80(6):2464–2475, 1984. [8] Robert B. Best and Gerhard Hummer. Reaction coordinates and rates from transition paths. Proc Nat Acad Sci USA, 102((19)):6732–6737, 2005. [9] Peter G. Bolhuis, David Chandler, Christoph Dellago, and Phillip L. Geissler. Transition path sampling: Throwing ropes over rough mountain passes, in the dark. Ann. Rev. Phys. Chem., 53:291–318, 2002. [10] Davide Branduardi, Francesco Luigi Gervasio, and Michele Parrinello. From a to b in free energy space. The Journal of Chemical Physics, 126(5):054103, 2007. 138 Bibliography [11] J. D. Bryngelson, J. N. Onuchic, N. D. Socci, and P. G. Wolynes. Funnels, pathways and the energy landscape of protein folding. Proteins: Struct. Funct. Genet., 21:167–195, 1995. [12] J. D. Bryngelson and P. G. Wolynes. Spin glasses and the statistical mechanics of protein folding. Proc Nat Acad Sci USA, 84:7524–7528, 1987. [13] J. D. Bryngelson and P. G. Wolynes. Intermediates and barrier crossing in a random energy model (with applications to protein folding). J Phys Chem, 93:6902–6915, 1989. [14] N Campbell and J Reece. Biology. Benjamin Cummings, 6 edition, 2001. [15] D. Cass. Optimum growth in an aggregative model of capital accumulation. Rev. Econ. Stud., 32:233–240, 1965. [16] Charles J. Cerjan and William H. Miller. On finding transition states. The Journal of Chemical Physics, 75(6):2800–2806, 1981. [17] Hue Sun Chan and Ken A. Dill. Transition states and folding dynamics of proteins and heteropolymers. J Chem Phys, 100(12):9238–9257, 15 June 1994. [18] L. L. Chavez, J. N. Onuchic, and C. Clementi. Quantifying the roughness on the free energy landscape: Entropic bottlenecks and protein folding rates. J Am Chem Soc, 126:8426–8432, 2004. [19] Margaret S. Cheung and D. Thirumalai. Nanopore-protein interactions dramatically alter stability and yield of the native state in restricted spaces. J Mol Biol, 357(2):632–643, 2006. [20] F. Chiti, N. Taddei, P. M. White, M. Bucciantini, F. Magherini, M. Stefani, and C. M. Dobson. Mutational analysis of acylphosphatase suggests the importance of topology and contact order in protein folding. Nature Struct Biol, 6(11):1005–1009, 1999. [21] Samuel S. Cho, Yaakov Levy, and Peter G. Wolynes. P versus Q: Structural reaction coordinates capture protein folding on smooth landscapes. Proc Nat Acad Sci USA, 103:586–591, 2006. [22] C. Clementi, H. Nymeyer, and J. N. Onuchic. Topological and energetic factors: what determines the structural details of the transition 139 Bibliography state ensemble and en-route intermediates for protein folding? An investigation for small globular proteins. J Mol Biol, 298:937–953, 2000. [23] C. Clementi and S. S. Plotkin. The effects of nonnative interactions on protein folding rates: Theory and simulation. Protein Sci, 13:1750– 1766, 2004. [24] Evangelos A. Coutsias, Chaok Seok, and Ken A. Dill. Using quaternions to calculate rmsd. Journal of Computational Chemistry, 25(15):1849–1857, 2004. [25] Evangelos A. Coutsias, Chaok Seok, and Ken A. Dill. Rotational superposition and least squares: The svd and quaternions approaches yield identical results. reply to the preceding comment by G. Kneller. Journal of Computational Chemistry, 26(15):1663–1665, 2005. [26] Payel Das, Mark Moll, Hernan Stamati, Lydia E. Kavraki, and Cecilia Clementi. Low-dimensional, free-energy landscapes of protein-folding reactions by nonlinear dimensionality reduction. Proc Nat Acad Sci USA, 103(26):9885–9890, 2006. [27] Xavier de la Cruz, E. Gail Hutchinson, Adrian Shepherd, and Janet M. Thornton. Toward predicting protein topology: An approach to identifying β-hairpins. Proc Natl Acad Sci U.S.A., 99(17):11157–11162, 2002. [28] Christoph Dellago, Peter G. Bolhuis, Felix S. Csajka, and David Chandler. Transition path sampling and the calculation of rate constants. The Journal of Chemical Physics, 108(5):1964–1977, 1998. [29] K. A. Dill and H. S. Chan. From levinthal to pathways to funnels. Nature Struct Biol, 4:10–19, 1997. [30] Feng Ding, Weihua Guo, Nikolay V. Dokholyan, Eugene I. Shakhnovich, and Joan-Emma Shea. Reconstruction of the src-sh3 protein domain transition state ensemble using multiscale molecular dynamics simulations. J Mol Biol, 350:1035–1050, 2005. [31] A. R. Dinner and M. Karplus. The roles of stability and contact order in determining protein folding rates. Nature Struct. Biol., 8(1):21–22, 2001. 140 Bibliography [32] Nikolay V. Dokholyan, Lewyn Li, Feng Ding, and Eugene I. Shakhnovich. Topological determinants of protein folding. Proc. Nat. Acad. Sci. USA, 99(13):8637–8641, 2002. [33] R. Du, V. S. Pande, A. Yu. Grosberg, T. Tanaka, and E. S. Shakhnovich. On the transition coordinate for protein folding. J Chem Phys, 108:334–350, 1998. [34] M. R. Ejtehadi, S. P. Avall, and S. S. Plotkin. Three-body interactions improve the prediction of rate and mechanism in protein folding models. Proc. Natl. Acad. Sci., 101(42):15088–15093, 2004. [35] R. Elber and M. Karplus. A method for determining reaction paths in large molecules: Application to myoglobin. Chemical Physics Letters, 139(5):375 – 380, 1987. [36] Daniel W. Farrell, Kirill Speranskiy, and M. F. Thorpe. Generating stereochemically acceptable protein pathways. Proteins: Structure, Function, and Bioinformatics, 78(14):2908–2921, 2010. [37] A. R. Fersht. Transition-state structure as a unifying basis in proteinfolding mechanisms: Contact order, chain topology, statbility, and the extended nucleus mechanism. Proc Nat Acad Sci USA, 97:1525–1529, 2000. [38] A. V. Finkelstein and A. Ya. Badretdinov. Influence of chain knotting on rate of folding. Folding & Design, 3:67–68, 1997. [39] Stefan Fischer and Martin Karplus. Conjugate peak refinement: an algorithm for finding reaction paths and accurate transition states in systems with many degrees of freedom. Chemical Physics Letters, 194(3):252 – 261, 1992. [40] Darren R. Flower. Rotational superposition: A review of methods. J Mol Graph Mod, 17:238–244, 1999. [41] O. V. Galzitskaya and A. V. Finkelstein. A theoretical search for folding/unfolding nuclei in three-dimensional protein structures. Proc Nat Acad Sci USA, 96:11299–11304, 1999. [42] O. V. Galzitskaya, D. N. Ivankov, and A. V. Finkelstein. Folding nuclei in proteins. FEBS Lett, 489:113–118, 2001. 141 Bibliography [43] Oxana V. Galzitskaya, Sergiy O. Garbuzynskiy, Dmitry N. Ivankov, and Alexei V. Finkelstein. Chain length is the main determinant of the folding rate for proteins with three-state folding kinetics. Proteins: Structure, Function, and Bioinformatics, 51(2):162–166, 2003. [44] A. E. Garc´ıa. Large-amplitude nonlinear motions in proteins. Phys Rev Lett, 68:2696–2699, 1992. [45] Angel E. Garcia and Jose N. Onuchic. Folding a protein in a computer: An atomic description of the folding/unfolding of protein A. Proc. Natl. Acad. Sci., 100(24):13898–13903, 2003. [46] I. M. Gelfand and S. V. Fomin. Calculus of Variations. Dover, 2000. [47] M. Gerstein and M. Levitt. Comprehensive assessment of automatic structural alignment against a manual standard. Protein Science, 7:445–456, 1998. [48] J. E. Gouaux and W. N. Lipscomb. Crystal structures of phosphonoacetamide ligated t and phosphonoacetamide and malonate ligated r states of aspartate carbamoyltransferase at 2.8-a resolution and neutral ph. Biochemistry, 29:389–402, 1990. [49] J. Greene, S. Kahn, H. Savoj, P. Sprague, and S. Teig. Chemical function queries for 3d database search. J. Chem. Inf. Comput. Sci., 34:1297–1308, 1994. [50] John Gregory and Cantian Lin. An unconstrained calculus of variations formulation for generalized optimal control problems and for the constrained problem of bolza. J. Math. Anal. Appl., 187:826–841, 1994. [51] John Gregory and Cantian Lin. Constrained Optimization in the Calculus of Variations and Optimal Control Theory. Springer, New York, first edition, 2007. [52] M. Michael Gromiha and S Selvaraj. Comparison between long-range interactions and contact order in determining the folding rate of twostate proteins: application of long-range order to folding rate prediction. Journal of Molecular Biology, 310(1):27 – 32, 2001. [53] M.Michael Gromiha and S. Selvaraj. Inter-residue interactions in protein folding and stability. Progress in Biophysics and Molecular Biology, 86(2):235 – 277, 2004. 142 Bibliography [54] A. M. Gutin, V. I. Abkevich, and E. I. Shakhnovich. Chain length scaling of protein folding time. Phys Rev Lett, 77:5433–5436, 1996. [55] F. Ulrich Hartl. Molecular chaperones in cellular protein folding. Nature, 381(6583):571–580, Jun 1996. [56] G. Hummer, A. E. Garc´ıa, and S. Garde. Conformational diffusion and helix formation kinetics. Phys Rev Lett, 85:2637–2640, 2000. [57] G. Hummer, A. E. Garc´ıa, and S. Garde. Helix nucleation kinetics from molecular simulations in explicit solvent. Proteins, 42:77–84, 2001. [58] Gerhard Hummer. From transition paths to transition states and rate coefficients. The Journal of Chemical Physics, 120(2):516–523, 2004. [59] Gerhard Hummer and Ioannis G. Kevrekidis. Coarse molecular dynamics of a peptide fragment: Free energy, kinetics, and longtime dynamics computations. The Journal of Chemical Physics, 118(23):10762–10773, 2003. [60] Andrei Y. Istomin, Donald J. Jacobs, and Dennis R. Livesay. On the role of structural class of a protein with two-state folding kinetics in determining correlations between its size, topology, and folding rate. Protein Science, 16(11):2564–2569, 2007. [61] Dmitry N. Ivankov, Sergiy O. Garbuzynskiy, Eric Alm, Kevin W. Plaxco, David Baker, and Alexei V. Finkelstein. Contact order revisited: Influence of protein size on the folding rate. Protein Science, 12(9):2057–2062, 2003. [62] W. Kabsch. A solution for the best rotation to relate two sets of vectors. Acta Crystallographica Section A, 32(5):922–923, Sep 1976. [63] W. Kabsch. A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallographica Section A, 34(5):827–828, Sep 1978. [64] John Karanicolas and C. L. Brooks III. The origins of asymmetry in the folding transition states of protein L and protein G. Protein Sci, 11(10):2351–2361, 2002. [65] M. Kawato. Trajectory formation in arm movements: Minimization principles and procedures. In H. N. Zelaznik, editor, Advances in 143 Bibliography Motor Learning and Control, chapter 9, pages 225–259. Human Kinetics, 1996. [66] Moon K Kim, Gregory S Chirikjian, and Robert L Jernigan. Elastic models of conformational transitions in macromolecules. Journal of Molecular Graphics and Modelling, 21(2):151 – 160, 2002. [67] Moon K. Kim, Robert L. Jernigan, and Gregory S. Chirikjian. Efficient generation of feasible pathways for protein conformational transitions. Biophysical Journal, 83(3):1620 – 1630, 2002. [68] Neil P. King, Alex W. Jacobitz, Michael R. Sawaya, Lukasz Goldschmidt, and Todd O. Yeates. Structure and folding of a designed knotted protein. Proceedings of the National Academy of Sciences, 107(48):20732–20737, 2010. Comment 3mlg. [69] Gerald R. Kneller. Superposition of molecular structures using quaternions. Molecular Simulation, 7(1-2):113–119, 1991. [70] Edward H. Koo, Peter T. Lansbury, and Jeffery W. Kelly. Amyloid diseases: Abnormal protein aggregation in neurodegeneration. Proceedings of the National Academy of Sciences, 96(18):9989–9990, 1999. [71] S Koyama, H Yu, DC Dalgarno, TB Shin, LD Zydowsky, and Schreiber SL. Structure of the pl3k sh3 domain and analysis of the sh3 family. Cell, 72:945–952, 1993. Comment 1pks. [72] Werner G. Krebs and Mark Gerstein. The morph server: a standardized system for analyzing and visualizig macromolecular motions in a database framework. Nucleic Acids Research, 28(8):1665–1675, 2000. [73] Sergei V. Krivov, Stefanie Muff, Amedeo Caflisch, and Martin Karplus. One-dimensional barrier-preserving free-energy projections of a β-sheet miniprotein: New insights into the folding process. Journal of Physical Chemistry B, 112(29):8701–8714, 2008. [74] M. Lal. ’Monte Carlo’ computer simulation of chain molecules. Mol. Phys., 17:57–64, 1969. [75] C. Lemmen and T. Lengauer. Computational methods for the structural alignment of molecules. J. Comput. Aided Mol. Des., 14:215–231, 2000. 144 Bibliography [76] Peter E. Leopold, Mauricio Montal, and Jos´e N. Onuchic. Protein folding funnels: Kinetic pathways through compact conformational space. Proc. Natl Acad. Sci. USA, 89:8721–8725, September 1992. [77] R. D. Levine and R. B. Bernstein. Molecular reaction dynamics and chemical reactivity. Clarendon Press, Oxford, 1987. [78] M. Lindberg, Jeanette Tangrot, and M. Oliveberg. Complete change of the protein folding transition state upon circular permutation. Nature Struct. Biol., 9(11):818–822, 2002. [79] M. O. Lindberg, J. Tangrot, D. E. Otzen, D. A. Dolgikh, A. V. Finkelstein, and M. Oliveberg. Folding of circular permutants with decreased contact order: general trend balanced by protein stability. J. Mol. Biol., 314:891–900, 2001. [80] Rhonald C Lua and Alexander Y Grosberg. Statistics of knots, geometry of conformations, and evolution of proteins. PLoS Comput Biol, 2(5):e45, 05 2006. [81] Ao Ma and Aaron R. Dinner. Automatic method for identifying reaction coordinates in complex systems. J. Phys. Chem. B, 109(14):6769– 6779, 2005. [82] Neal Madras and Alan D. Sokal. The pivot algorithm: A highly efficient monte carlo method for the self-avoiding walk. Journal of Statistical Physics, 50(1–2):109–186, 1988. [83] A. L. Mallam and S. E. Jackson. Probing nature’s knots: The folding pathway of a knotted homodimeric protein. J. Mol. Biol., 359:1420– 1436, 2006. [84] Paul Maragakis and Martin Karplus. Large amplitude conformational change in proteins explored with a plastic network model: Adenylate kinase. Journal of Molecular Biology, 352(4):807 – 822, 2005. [85] Luca Maragliano, Alexander Fischer, Eric Vanden-Eijnden, and Giovanni Ciccotti. String method in collective variables: Minimum free energy paths and isocommittor surfaces. The Journal of Chemical Physics, 125(2):024106, 2006. [86] G. A. Mines, T. Pascher, S. C. Lee, J. R. Winkler, and H. B. Gray. Cytochrome c folding triggered by electron transfer. Chem. and Biol., 3:491–497, 1996. 145 Bibliography [87] Ali R Mohazab and Steve S Plotkin. Polymer untangling and unknotting in protein folding. PLOS computational biology (submitted), 2012. [88] Ali R Mohazab and Steve S Plotkin. The role of polymer non-crossing and geometrical distance in protein folding kinetics. unpublished, 2012. [89] Ali R. Mohazab and Steven S. Plotkin. Minimal distance transformations between links and polymers: principles and examples. J. Phys. Cond. Mat., 20:244133, 2008. [90] Ali R. Mohazab and Steven S. Plotkin. Minimal folding pathways for coarse-grained biopolymer fragments. Biophys. J., 95:5496–5507, 2008. [91] Ali R. Mohazab and Steven S. Plotkin. Structural alignment using the generalized euclidean distance between conformations. IJQC, 109:3217–3228, November 2009. [92] S. K. Nechaev. Statistics of Knots and Entangled Random Walks. World Scientific, 1996. [93] Jeffrey K. Noel, Joanna I. Sulkowska, and Jos´e N. Onuchic. Slipknotting upon native-like loop formation in a trefoil knot protein. Proceedings of the National Academy of Sciences, 107(35):15403– 15408, 2010. [94] H. Nymeyer, N. D. Socci, and J. N. Onuchic. Landscape approaches for determining the ensemble of folding transition states: Success and failure hinge on the degree of minimal frustration. Proc. Natl Acad. Sci. USA, 97:634–639, 2000. [95] E.P. O’Brien, M. Vendruscolo, and C.M. Dobson. Prediction of variable translation rate effects on cotranslational protein folding. Nature Communications, 3:868, 2012. [96] L. Onsager. Initial recombination of ions. Phys. Rev., 54:554–557, 1938. [97] L. Onsager and S. Machlup. Fluctuations and irreversible processes. Phys Rev, 91(6):1505–1512, 1953. 146 Bibliography [98] J. N. Onuchic, N. D. Socci, Z. Luthey-Schulten, and P. G. Wolynes. Protein folding funnels: The nature of the transition state ensemble. Folding and Design, 1:441–450, 1996. [99] J. N. Onuchic and P. G. Wolynes. Theory of protein folding. Current Opinion in Structural Biology, 14:70–75, 2004. [100] S. B. Ozkan, Ken A. Dill, and Ivet Bahar. Computing the transition state populations in simple protein models. Biopolymers, 68(1):35–46, 2003. [101] S. Banu Ozkan, Ivet Bahar, and Ken A. Dill. Transition states and the meaning of [phi]-values in protein folding kinetics. Nat Struct Mol Biol, 8(9):765–769, Sep 2001. [102] B. Oztop, M. Reza Ejtehadi, and Steven S. Plotkin. Protein folding rates correlate with heterogeneity of folding mechanism. Phys. Rev. Lett., 93:208105, 2004. [103] Y. Patel, V. J. Gillet, G. Bravi, and A. R. Leach. A comparison of the pharmacophore identification programs: Catalyst, disco and gasp. J. Comput. Aided Mol. Des., 16:653–681, 2002. [104] D. A. Pearlman, D. A. Case, J. W. Caldwell, W. S. Ross, T. E. Cheatam, D. M. Ferguson, U. Chandra Singh, P. Weiner, and P. A. Kollman. AMBER, V. 4.1, 1995. [105] K. W. Plaxco, K. T. Simons, and D. Baker. Contact order, transition state placement and the refolding rates of single domain proteins. J. Mol. Biol., 277:985–994, 1998. [106] K. W. Plaxco, K. T. Simons, I. Ruczinski, and D. Baker. Topology, stability, sequence, and length: Defining the determinants of two-state protein folding kinetics. Biochemistry, 39:11177–11183, 2000. [107] S. Plimpton. Fast parallel algorithms for short-range molecular dynamics. Journal of Computational Physics, 117(1):1–19, 1995. [108] S. S. Plotkin. Speeding protein folding beyond the g¯o model: How a little frustration sometimes helps. Proteins, 45:337–345, 2001. [109] S. S. Plotkin. Generalization of distance to higher dimensional objects. Proc. Natl Acad. Sci. USA, 104(38):14899–14904, 2007. 147 Bibliography [110] S. S. Plotkin and J. N. Onuchic. Investigation of routes and funnels in protein folding by free energy functional methods. Proc. Natl Acad. Sci. USA, 97:6509–6514, 2000. [111] S. S. Plotkin and J. N. Onuchic. Structural and energetic heterogeneity in protein folding i: Theory. J. Chem. Phys., 116(12):5263–5283, 2002. [112] S. S. Plotkin and J. N. Onuchic. Understanding protein folding with energy landscape theory i: Basic concepts. Quart. Rev. Biophys., 35(2):111–167, 2002. [113] S. S. Plotkin and J. N. Onuchic. Understanding protein folding with energy landscape theory ii: Quantitative aspects. Quart. Rev. Biophys., 35(3):205–286, 2002. [114] S. S. Plotkin and P. G. Wolynes. Non-markovian configurational diffusion and reaction coordinates for protein folding. Phys. Rev. Lett., 80:5015–5018, 1998. [115] S. S. Plotkin and P. G. Wolynes. Buffed energy landscapes: Another solution to the kinetic paradoxes of protein folding. Proc. Natl Acad. Sci. USA, 100(8):4417–4422, 2003. [116] L. S. Pontryagin, V. G. Boltyanskii, R. V. Gamkrelidze, and E. F. Mishchenko. The mathematical theory of optimal processes. Wiley Interscience, New York and London, 1962. [117] J. J. Portman, S. Takada, and P. G. Wolynes. Microscopic theory of protein folding rates. I. Fine structure of the free energy profile and folding routes from a variational approach. J. Chem. Phys., 114:5069– 5081, 2001. [118] J. J. Portman, S. Takada, and P. G. Wolynes. Microscopic theory of protein folding rates. II. Local reaction coordinates and chain dynamics. J. Chem. Phys., 114:5082–5096, 2001. [119] Michael C. Prentiss, David J. Wales, and Peter G. Wolynes. The energy landscape, folding pathways and the kinetics of a knotted protein. PLoS Comput Biol, 6(7):e1000835, 07 2010. [120] Adam D. Schuyler, Robert L. Jernigan, Pradman K. Qasba, Boopathy Ramakrishnan, and Gregory S. Chirikjian. Iterative cluster-nma: A tool for generating conformational transitions in proteins. Proteins: Structure, Function, and Bioinformatics, 74(3):760–776, 2009. 148 Bibliography [121] M. Sega, P. Faccioli, F. Pederiva, G. Garberoglio, and H. Orland. Quantitative protein dynamics from dominant folding pathways. Phys. Rev. Lett., 99(118102), 2007. [122] J. E. Shea, J. N. Onuchic, and C. L. Brooks. Energetic frustration and the nature of the transition state in protein folding. J. Chem. Phys., 113:7663–7671, 2000. [123] J.E. Shea and C.L. Brooks III. From folding theories to folding proteins: A review and assessment of simulation studies of protein folding and unfolding. Ann. Rev. Phys. Chem., 52:499–535, 2001. [124] C.D. Snow, H. Nguyen, V.S. Pande, and M. Gruebele. Absolute comparison of simulated and experimental protein-folding dynamics. Nature, 420:102–106, 2002. [125] Konstantin B. Zeldovich Stefan Wallin and Eugene I. Shakhnovich. The folding mechanics of a knotted protein. J. Mol. Biol., 368:884– 893, 2007. [126] Joanna I. Sulkowska, Piotr Sulkowski, and Jos´e Onuchic. Dodging the crisis of folding proteins with knots. Proceedings of the National Academy of Sciences, 106(9):3119–3124, 2009. [127] W. R. Taylor. A deeply knotted protein structure and how it might fold. Nature, 406:916–919, 2000. [128] Y. Ueda, H. Taketomi, and Nobuhiro G¯o. Studies on protein folding, unfolding, and fluctuations by computer simulation. Int. J. Peptide Protein Res., 7:445–459, 1975. [129] Yuzo Ueda, Hiroshi Taketomi, and Nobuhiro G¯o. Studies on protein folding, unfolding and fluctuations by computer simulation. I. The effects of specific amino acid sequence represented by specific interunit interactions. Int. J. Peptide Protein Res., 7:445–459, 1975. [130] Arjan van der Vaart and Martin Karplus. Minimum free energy pathways and free energy profiles for conformational transitions based on atomistic molecular dynamics simulations. The Journal of Chemical Physics, 126(16):164106, 2007. [131] Peter Virnau, Leonid A Mirny, and Mehran Kardar. Intricate knots in proteins: Function and evolution. PLoS Comput Biol, 2(9):1074–1079, 2006. 149 [132] David J. Wales. Theoretical study of water trimer. Journal of the American Chemical Society, 115(24):11180–11190, 1993. [133] Stefan Wallin and Hue Sun Chan. A critical assessment of the topomer search model of protein folding using a continuum explicit-chain model with extensive conformational sampling. Protein Science, 14(6):1643– 1660, 2005. [134] Jin Wang, Kun Zhang, Hongyang Lu, and Erkang Wang. Quantifying Kinetic Paths of Protein Folding. Biophys. J., 89(3):1612–1620, 2005. [135] Stephen Wells, Scott Menor, Brandon Hespenheide, and M F Thorpe. Constrained geometric simulation of diffusive motion in proteins. Physical Biology, 2(4):S127, 2005. [136] F. W. Wiegel. Introduction to path-integral methods in physics and polymer science. World Scientific, Singapore, 1986. [137] P. G. Wolynes, J. N. Onuchic, and D. Thirumalai. Navigating the folding routes. Science, 267(5204):1619–1620, 1995. [138] Haijun Yang, Hao Wu, Dawei Li, Li Han, and Shuanghong Huo. Temperature-dependent probabilistic roadmap algorithm for calculating variationally optimized conformational transition pathways. Journal of Chemical Theory and Computation, 3(1):17–25, 2007. 150 Appendix A Sufficient conditions for an extremum to be a minimum For a transformation to be minimal, it is necessary, but not sufficient, that it be an extremum. We now derive the sufficient conditions for a given transformation to minimize the functional (2.9). We describe the formalism in some detail because it is not typically taught to physicists—for further reading see for example reference [46]. ∑ According to Sylvester’s criterion, a quadratic form ij Aij xi xj is positive definite if and only if all descending principal minors of the matrix Aij are positive, i.e. A11 > 0 , A11 A12 > 0, A21 A22 A11 A12 A13 A21 A22 A23 > 0 , A31 A32 A33 ... , det Aij > 0 , (A.1) and a function F of x ≡ (x1 , x2 , . . . , xn ) has a minimum at x if the Jacobian matrix ∂ 2 F/∂xi ∂xj is positive definite at the position of the extremum (where ∂F/∂xi = 0). For a function to be a minimum of a given functional, it must satisfy similar sufficient conditions. Consider again the difference in distance between two trajectories in (2.9)‡ . Taylor expanding the Lagrangian to second order in hi : ∫ ∫ T dt L(ri + hi , r˙ i + h˙ i ) − T ∆D = D [ri + hi ] − D [ri ] = dt L(ri , r˙ i ) 0 0 ∫ T 3N ( N ( ) ) 1∑ ∑ ≈ dt Lxi xj hi hj + 2Lxi x˙ j hi h˙ j + Lx˙ i x˙ j h˙ i h˙ j Lri · hi + Lr˙ i · h˙ i + 2 0 i=1 i,j (A.2) 6‡ We ignore corner conditions for purposes of the derivation. It can be shown that they do not modify the result. 151 Appendix A. Sufficient conditions for an extremum to be a minimum At an extremum, the first order term in (A.2) is zero, and ∆D ≈ δ 2 D, the second variation. A sufficient conditions for the extremum to be a minimum is δ 2 D > 0. From eq. (2.10), the matrix Lxi x˙ j = 0 . Assuming Lxi x˙ j is in general a symmetric matrix, i.e., Lxi x˙ j = Lxj x˙ i , the second term in the quadratic form of (A.2) may be integrated by parts to give: ∫ ] 1 T [ ˙ ˙ 2 δ D= dt h|Ph + h|Qh , (A.3) 2 0 where we have let |h denote the vector (h1 , h2 , . . . , h3N ), and used the shorthand P and Q for the matrices: P(t) = Pij = Lx˙ i x˙ j ( ) d Q(t) = Qij = Lx i x j − Lxi x˙ j . dx (A.4) From (2.10) the explicit form for these matrices may be calculated. P is block diagonal: (1) Iij 0 ··· 0 (2) 0 Iij ··· 0 P= . (A.5) .. .. .. . . . . . (N ) 0 0 · · · Iij with each block matrix having elements 2 y˙ + z˙ 2 −x˙ y˙ −x˙ z˙ 1 1 2 (J) (J) (J) ˙ (J) − x˙ i x˙ j −x˙ y˙ x˙ 2 + z˙ 2 −y˙ z˙ = Iij = 3 δij r J |˙r | |˙r J |3 2 −x˙ z˙ −y˙ z˙ x˙ + y˙ 2 (particle J) (A.6) Interestingly the numerator of (A.6) has the form of an inertia tensor for a point particle in velocity-space. The matrix Q is block tri-diagonal, because the spatial derivatives in (A.4) couple each bead to its two neighbors. Using indices I, J to enumerate beads and i, j to enumerate x, y, z components for each bead: ( QIJ,ij or Q ) [ ( ) ( )] = δij λJ−1,J δIJ − δI,J−1 + λJ,J+1 δIJ − δI,J+1 λ12 1 −λ12 1 0 0 −λ12 1 (λ12 + λ23 ) 1 −λ23 1 0 0 −λ23 1 (λ23 + λ34 ) 1 −λ34 1 = . . . . . . . . . . . . ··· ··· −λN −1,N 1 (A.7) λN −1,N 1 Here ij = δij . For the transformation to be a minimum of D[r], pit suffices that the functional (A.3) be positive definite for all |h . To derive the conditions for this, we can temporarily ignore the fact that (A.3) r∗ (t) 152 Appendix A. Sufficient conditions for an extremum to be a minimum arose from the second variation of (2.9), and treat (A.3) as a new functional acting on inputs |h(t) = |h1 (t), . . . h3N (t) . We then ask what |h(t) extremizes (A.3). If δ 2 D > 0 the only extremal solution can be the trivial one: |h(t) = |0 , because δ 2 D is homogenous of degree 2. That is, changing the transformation {r∗i (t)} from that which extremized (2.9) to a neighboring transformation {r∗i (t) + hi (t)} would increase the distance traveled. The system of 3N EL equations for |h from (A.3) is d ˙ |Ph + |Qh = |0 dt (A.8) |h(0) = |h(T ) = |0 . (A.9) − with boundary conditions Equation (A.8) is referred to as the Jacobi equation in the calculus of variations. First note that if |h satisfies the system of equations in (A.8) as well as the boundary conditions (A.9), then integration by parts gives ∫ T ( ) ∫ T d ( ˙) 2 ˙ ˙ δ D= dt h|Ph + h|Qh = dt h| − Ph + Qh = 0 . (A.10) dt 0 0 This means that for δ 2 D to be > 0, any nontrivial |h(t) which satisfies the boundary conditions must not itself be an extremal solution of the Jacobi equation, otherwise solutions |r∗ (t) perturbed by any constant times |h(t) are themselves extremals. One may think of this by analogy as the necessity for the absence of any “Goldstone modes”, where excitations by various C|h(t) would lead to a family of curves with zero cost in action, and thus zero effective restoring force, between them. Alternatively we can ask what equation h ≡ |h must satisfy if the EL equations are satisfied for both L(r, r˙ ) and the neighboring extremal ˙ Taylor expanding L(r + h, r˙ + h) ˙ in L(r + h, r˙ + h). ˙ =0 ˙ − d Lr˙ (r + h, r˙ + h) Lr (r + h, r˙ + h) dt gives ) ) ( d ( d ˙ − Lr˙ r˙ · h + Lrr − Lrr˙ · h = 0 dt dt which is exactly Jacobi’s equation (A.8) with definitions (A.4). From here on, it is much simpler to elucidate the central concepts for sufficient conditions using the case of a single scalar function h(t). The 153 Appendix A. Sufficient conditions for an extremum to be a minimum analysis can be generalized to the multi-dimensional case with a bit more effort, but the conclusions are essentially the same and so they will simply be stated along with the conclusions for the ’1-D’ case. For further details see [46]. We write equation (A.3) in 1-D as: 1 2 ∫ ( ) dt P h˙ 2 + Qh2 T (A.11) 0 It was realized originally by Legendre that the integral could be brought to simpler form by adding zero to it in the form of a total derivative. Since ∫ T ) d ( dt w(t)h2 = 0 dt 0 for any w(t) so long as h(t) satisfies the boundary conditions (A.9), we can add it to the integral in (A.11) and seek a function w(t) such that the expression ∫ ) 1 T ( ˙2 2 δ D= dt P h + 2whh˙ + (Q + w) ˙ h2 2 0 may be written as a perfect square. This yields the differential equation P (Q + w) ˙ = w2 (A.12) for w(t), and second variation 1 δ D[h] = 2 ∫ 2 0 ( w )2 dt P h˙ + h . P T (A.13) Therefore a sufficient condition for a minimum is for P > 0 (a necessary condition is for P ≥ 0). The analogous condition in the multi-dimensional case is for the matrix P to be positive definite. If the differential term h˙ + Pw h in (A.13) were equal to zero for some ˙ h(t), the boundary condition h(0) = 0 would then imply h(0) = 0 and thus h(t) = 0 for all t by the uniqueness theorem as applied to this first order differential equation. Therefore the functional (A.13) is positive definite if, and only if, 1.) P > 0 , 2.) A solution for eq. (A.12) exists for the whole interval [0, T ]. In general, there is no guarantee of condition (2) even if condition (1) is 154 A.1. Distance between points valid. For example if P = 1, Q = −1, (A.12) has solution w(t) = tan(t + c), which has no finite solution if |T | > π. † If a solution w for (A.12) has a pole at, say, t˜, then for the integral (A.13) to remain finite, h(t˜) → 0. This point is said to be conjugate to the point to = 0, i.e., it is a conjugate point. Equation (A.12) is a Riccati equation, which may be brought to lin˙ ear form by the transformation w(t) = −P H/H, with H(t) an unknown function. Substitution in (A.12) gives − d ( ˙) P H + QH = 0 dt (A.14) which is precisely equation (A.8)- the Jacobi equation for h(t). This means that for equation (A.12) to have a solution on [0, T ], H(t), as given by the solution to (A.14), must have no roots on [0, T ]. But because equation (A.14) holds for h(t) as well, h(t) must have no roots (conjugate points) on [0, T ]. Because h(0) = h(T ) = 0, the only way to extremize (A.11) is to satisfy eq. (A.14) with the trivial solution h(t) = 0. If h(t) = 0 for 0 < t < T then it would mean that there was a conjugate point at t˜ = T . In the multi-dimensional case an extremal |h is one of 3N vectors satis(α) (α) fying equations (A.8), i.e. |h(α) = |h1 . . . h3N , 1 ≤ α ≤ 3N . A conjugate point is defined as a point where the determinant vanishes: (1) h1 (t) .. det . (3N ) h1 (t) (3N ) ··· h1 ··· (3N ) h3N (t) (t) .. . =0 The sufficient conditions for a transformation to be minimal are then: 1.) The transformation |r∗ (t) = {r∗i (t)} is extremal, 2.) Along |r∗ (t) , the matrix P(t) = Lx˙ i x˙ j is positive definite, and 3.) The interval [0, T ] contains no conjugate points to t = 0. The above ideas can be made clear with a few examples below. A.1 Distance between points √ From the effective Lagrangian L = r˙ 2 , P = Lx˙ i x˙ j is given in equation (A.6), which has determinant det P = 0, and so is not positive definite. 6† Because reparameterization invariance in our problem, the value of T is adjustable, however precisely because of this invariance, det P = 0 and so is no longer positive definite. We discuss this problem and its resolution below. 155 A.2. Geodesics on the surface of a sphere This is due to our choice of parameterization. If we break symmetry by choosing one spatial direction as the independent variable, L (x, y , z ) = √ 1 + y 2 + z 2 (with e.g. y ≡ dy/dx and x0 ≤ x ≤ x1 ). Then ( ) 1 1 + z 2 −y z P= 2 (1 + y 2 + z 2 )3/2 −y z 1 + y ( )−1/2 with positive definite determinant det P = 1 + y 2 + z 2 > 0 for any trajectory. From eq (A.4), Q(t) = 0 . Along the extremal, where y(x) = ax + y0 , z(x) = bx + z0 , equation (A.8) gives P · h = c, with c a constant vector and P a positive definite matrix of constant values with respect to x. Solving this first-order equation gives straight line solutions for h(x). Because h(x0 ) = 0, there can be no conjugate points, and because h(x1 ) = 0, the only solution to (A.8) is the trivial one, and the extremum is a minimum. A.2 Geodesics on the surface of a sphere Taking the azimuthal angle φ as the independent variable, and polar angle θ(φ) as the dependent variable, the arc-length on the surface of a unit sphere may be written as ∫ φ1 √ D[θ] = (A.15) dφ θ 2 + sin2 θ . φ0 The EL equations give the extremal trajectory as cos θ = A sin θ cos φ + B sin θ sin φ with A, B constants. This is the equation of a plane z = Ax + By, which intersects the surface of the sphere to make a great cir( )3/2 cle. The scalar P = Lθ θ = sin2 θ/ θ 2 + sin2 θ which is always positive. To simplify the problem, let φ0 = 0, and θ(φ0 ) = θ(φ1 ) = π/2, so the great circle lies in the z = 0 plane. Along this extremal P is constant and equal to 1, while Q = −1. The second variation, eq. (A.11), is then ) ∫φ ( (1/2) 0 1dφ h 2 − h2 . The corresponding Jacobi equation, h +h = 0, must not have a root between [0, φ1 ]. Every nontrivial solution to the Jacobi equation satisfying the initial condition h(0) = 0 has the form h(φ) = C sin φ, C = 0, which reveals a conjugate point at φ = π. Thus for the extremal curve to be minimal, φ1 must be < π, the location of the opposite pole on the sphere. If φ1 < π, there is no extremal solution for h(φ) other than the trivial one which satisfies the boundary conditions. It is instructive to look at the arc-length under sinusoidal variations around the extremal path which satisfy the boundary conditions h(0) = h(φ1 ) = 0, 156 A.3. Harmonic oscillator so that θ(φ) = π/2 + h(φ) = π/2 + sin (πφ/φ1 ). Inserting this into eq (A.15) above and expanding to second order in , we see that first order terms in ( vanish, ) ( and the ) difference in distance from the extremal path is ∆D = 2 /4φ1 π 2 − φ21 . For φ1 < π this is always greater than zero, compatible with the fact the extremal is a minimum. Further analysis useing a general perturbation scheme would be required for a general proof. For φ1 > π this is always less than zero indicating the extremal is a maximum with respect to these perturbations: the length may be shortened. When φ1 = π, ∆D = 0 to second order. When h(φ) represents the difference between great circles ∆D is precisely zero. A.3 Harmonic oscillator It is not widely appreciated that the classical action for a simple harmonic oscillator is not always a minimum, and indeed in many cases can be a maximum with respect to some perturbations. The action for ∫ T a harmonic oscillator with given spring constant is proportional to S[x] = 0 dt 12 (x˙ 2 −x2 ), which has EL equation x ¨ + x = 0. Taking the specific initial conditions x(0) = 1, x(0) ˙ = 0, the extremal solution is x(t) = cos t. The scalar P (t) = Lx˙ x˙ = 1, which is always positive and satisfies the necessary conditions for a d Lxx˙ = −1. The second variation δ 2 S[h] = minimum. The scalar Q = Lxx − dt ∫ 1 T ˙2 2 ¨ 2 0 dt(h −h ), which has Jacobi equation h+h = 0. This is the same Jacobi equation as that for geodesics on a sphere, so the sufficient conditions will parallel those above. The boundary condition h(0) = 0 gives h(t) = A sin t, with conjugate points at t = nπ, n = 1, 2, . . .. This means that the action is a minimum only so long as T < π, i.e., a half-period. If we let x(t) be the extremal solution plus a sinusoidal perturbation satisfying the Jacobi equation at the conjugate points: x(t) = cos t + sin t, then the difference in action from the extremal path becomes ∆S = ( 2 /4T )(π 2 − T 2 ). This result is exact because the action for the oscillator is quadratic (as opposed to the action for geodesics). Because the action is quadratic, the original EL equation and Jacobis equation (A.8) are guaranteed to be identical—in such cases it is not particularly necessary to explicitly identify P and Q. When T < π, ∆S > 0 compatible with minimality, as in section A.2. When T is larger than a half-period, ∆S < 0 and the extremal trajectory is a maximum (with respect to half-wavelength sinusoidal perturbations), and when T = π, the end point is the conjugate point and ∆S = 0. 157 Appendix B Necessary conditions for straight line transformations It was shown in section 2.3.1 that to have straight line transformations between links, it is sufficient to have facing obtuse angles on opposite sides of the the quadrilateral defined by the transformation as shown in figure 2.5A. We now show that it is a necessary condition as well, i.e., we show that a slide in the correct direction is not possible in the absence of obtuse angles. B ˆ B dt v ˆA dt v A Figure B.1: A link in 3D space. Without loss of generality assume that the link is initially along the z axis. The paths traveled by the link ends are shown in the figure. Note that the end point trajectories of A and B are in 3D space so the paths traveled by A and B need not cross or lie in the same plane. Let the unit vector ˆ A and the unit vector along B’s path be v ˆ B . Because the along A’s path be v angles that the path of A and the path of B make with the link are acute, ˆ B (≡ zB ) is negative and the z-component of v ˆ A (zA ) the z-component of v ˆ A and v ˆ B as is positive. One can write v ˆ A = ρA + zA z ˆ v ˆ B = ρB + zB z ˆ v where ρA and ρB are vectors in xy plane and zA > 0 and zB < 0. 158 Appendix B. Necessary conditions for straight line transformations Let rA (t) and rB (t) denote the positions of the A and B ends at time t: rA = tˆ vA ˆ rB = g(t)ˆ vB + z The rigid link constraint dictates that (rA − rB ) · (rA − rB ) = 1 which translates to: g 2 + 2g (zB − t (c + zA zB )) − 2tzA + t2 + 1 = 1 with c = ρA · ρB . Solving for g as a function of t, keeping in mind that g(0) = 0: √ g(t) = − (zB − t (c + zA zB )) + (zB − t (c + zA zB ))2 − t2 + 2tzA . Now if g (t) > 0 it means that the B-end of the link is travelling in the assumed direction, and if g (t) < 0 it means that B-end is travelling in the opposite direction (which means that the angle is not acute anymore). Writing g (0) we get: g (0) = −zA 2 zB c + 2 zA zB2 − 2 zA + c + zA zB = <0. 2 |zB | |zB | Thus point B can only travel in the opposite direction from what was assumed, which in turn means an all-acute slide is not possible. We conclude that the condition of “facing obtuse angles” is necessary and sufficient for transformations consisting only of pure translations. 159 Appendix C Critical angles The concept of critical angle was first introduced in 2.3.2. In order for a straight-line slide of both ends to be possible, at some stage during the transformation the link needs to rotate about one of the ends, with the other end being stationary. In principle the rotation can be about either of the two ends and it can happen at the beginning or the end of the transformation. The conditions on the critical angle or orientation can be readily derived from the broken extremal conditions. It was seen from 2.18a and 2.19, the non-trivial corner conditions read: ˆ i |+ = v ˆ i |− . v (C.1) We know that the path traveled by the moving bead during the rotation is circular and the path that is traveled during the slide part is a straight line. Broken extremal condition forces these two paths to be patched smoothly, which means that the straight-line path should be tangent to the circle. In the 3D case, for the broken extremal condition to be satisfied, the straight line slide path and the circular rotation path should lie in the same plane. For example in figure 2.7 where B is rotating about A initially to B1 and then slides to B , the rotation has to be in the plane formed by the three points ABB . Matching the directions of velocity as in (C.1) does not itself mean that a link can subsequently slide in a straight line, however at the tangent point, the tangent line to the circle is perpendicular to the radius, hence one satisfies this second condition as well. Below we derive an analytical expression for the critical angle for a particular case of single link problem, as an example and illustration of the discussed concepts. Furthermore the particular example will be used later in D to introduce minimal transformations in 2 dimensions. Consider the single link action with the particular parametrization s = s(θ), as discussed in section 2.3.2: ∫ √ √ ( s˙ 2 + 1 + 2s˙ cos θ + s˙ 2 ) dθ. (C.2) 160 Appendix C. Critical angles B′ A′ A B θc Figure C.1: Transformation in which both ends stay on a linear track −−−−→ where s ≡ A(θ)A is the (signed) distance of A-end from its initial position, and θ is the angle between the link and the horizontal line (see figure C.1). The Euler Lagrange equation of motion reads: d s˙ s˙ + cos θ (√ + √ )=0 2 dθ s˙ 2 s˙ + 1 + 2s˙ cos θ (C.3) We consider a transformation which is not (necessarily) a minimum: s = a cos θ − sin θ + b (C.4) with a and b parameters to be determined. Such a transformation in fact forces the two ends to travel on a straight line (right from the beginning), but the A side may in fact retreat and then move forward. We call such a transformation a “hyperextended transformation”. A sample transformation of this kind is shown in figure C.1. The parameters a and b in (C.4) can be tuned to meet the boundary conditions (see below). In fact it is seen that point A on the link retreats backwards until it reaches some critical angle, which is when link AB makes an angle π2 with 161 Appendix C. Critical angles the straight line BB that point B travels on. Subsequently A then moves forward towards A . Assume that θ runs from θ1 to θ2 , where 0 < θ2 < π/2. For simplicity assume that both these angles are between 0 and π2 . The boundary conditions dictate that: s(θ1 ) = 0 (C.5) s(θ2 ) = l (C.6) where l is the distance between A and A . a and b can be explicitly solved to give: − sin θ2 + sin θ1 − l cos θ1 − cos θ2 cos θ1 (− sin θ2 − l) + sin θ1 cos θ2 b = − cos θ1 − cos θ2 a = (C.7) (C.8) For our purposes we only need to note that the critical angle occurs ds when s˙ ≡ dθ becomes zero, that is when A stops going backward and starts moving forward: s˙ = −a sin θ − cos θ = 0 (C.9) where a is given in C.7. We can now ask what should θ1 be so that there is no need for the link to go backward, i.e., it moves forward from the beginning and the transformation is monotonic. Equations (C.9) and (C.7) give: cos θ + − sin θ2 + sin θ − l sin θ = 0 cos θ − cos θ2 (C.10) For pedagogical reasons we prove condition (C.10) using analytic geometry as well. Looking at figure C.2 we have the following: g 2 + l12 = 1 (C.11) g 2 + l22 = a2 (C.12) g 1 = (C.13) a l + l1 + l2 √ √ We can solve g = 1 − l12 and a = 1 − l12 + l22 from the first two equations and substitute in the third equation to give: √ 1 − l2 + l2 (C.14) l = √ 1 2 2 − l1 − l2 1 − l1 162 Appendix C. Critical angles a l2 g l1 θ2 b l θ1 Figure C.2: Geometric proof for critical angle condition On the other hand based on our results for g and a we have: √ 1 − l12 sin θ1 = √ 1 − l12 + l22 l2 cos θ1 = √ 1 − l12 + l22 sin θ2 = l1 √ cos θ2 = 1 − l12 (C.15) (C.16) (C.17) (C.18) Substitution of eqns (C.15-C.18) in equation (C.10) gives equation (C.14) after some simplification. For the particular case that we have discussed, the proposed transformation is in fact a minimal solution if θ1 is greater than the critical angle, because in that case a simple slide would be possible. If θ1 is less than the critical angle a locally minimum solution as we know is pure rotation to the critical angle and then straight line slide. Pure rotation has a nice geometric interpretation in our parametrization. It corresponds to the null solution s = 0. Since at the critical angle s˙ = 0 we see that s = 0 will be smoothly patched with s = a cos θ − sin θ + b, as mandated by the corner conditions 163 Appendix C. Critical angles in equation (2.18a). Figure C.3: A minimal transformation in s(θ) parametrization. The horizontal segment corresponds to pure rotation and the curved section corresponds to slide on straight paths. Here the corner conditions demand that the derivative s˙ be continuous at the critical angle. 164 Appendix D Minimal transformations in 2 dimensions It was seen in section 2.4.1 that for the case of two links when one is confined to moving in a plane, satisfying the constant link length constraints and corner conditions do not seem to lead to solutions which are extremal. However given the additional constraint that the links must lie in a plane, there must be one or a set of minimal transformations. We need to look at other forms of transformations, namely compound straight line transformations. We will elaborate on the idea starting with single links. The hyper extended solution that was discussed previously in Appendix C can be considered as a very special example of compound straight line transformation. These are transformations that are made strictly from straight line paths with no pure rotation. A more general transformation is shown in figure D.1 beside the old transformation. Note that the corners do not technically violate the corner conditions because the speed of bead “A” is zero at the corner point in any parametrizations that can simultaneously describe A motion and B motion: Since at the corner point, the link makes an angle of 90 degrees with the path that B travels, the speed of B at the critical angle in infinitely larger than the speed of A. In fact one sees that we have an instantaneous pure rotation about A-bead, when it is at the corner point. vˆa is not clearly defined at the corners, and everywhere else (when the speed of the bead(s) is not zero), the two beads are travelling on a straight line. The two solutions depicted in the figure come from two different parametrizations of the most general form of the action and result in different −−→ distances. But each of them is a local minimum once the direction of AA is picked, and these local minima have different values for the distance. We can then ask about the best position to put the corner point, to minimize the distance traveled in the compound straight line transformation, with respect to other compound straight line transformations. We assume the corner occurs on one side and we take it to be the “A” side. Note that at the corner, the link makes a right angle with the B-bead 165 Appendix D. Minimal transformations in 2 dimensions B′ B′ A′ 1.00 A 0.47 ′′ A A′ 2.24 2.24 B ′′ 1.08 B θc A 0.11 B A′′ Figure D.1: The previous hyper extended solution is shown along with a −−→ more general compound straight-line transformation, where AA travels in some general direction. Length of each line segment is written beside it. For the hyper extended solution the value of AA is multiplied by two because the path is traveled twice. 166 Appendix D. Minimal transformations in 2 dimensions B′ A′ 2.24 0.93 B ′′ A′′ 0.18 A B Figure D.2: Optimal compound straight line transformation path BB , meaning that the distance from the corner point to the B path is always the length of the link, i.e., unity. Also note that the total distance that the “A”-bead travels is the distance from the initial point A to the corner point A , plus the distance from A to the final position A . The locus of points with equal sum of distances from two points A and A defines an ellipse with foci at A and A . Moreover the length of the major axis of the ellipse equals the sum of the distances from the foci. Thus the smaller the major axis of the ellipse with foci A and A , the smaller the total distance traveled by the “A”-bead. Moreover A should sit on a line parallel to B-path at a distance of 1 from the B-path line BB . So in seeking the shortest distance traveled the A end of the link, we seek the point A such that it lies on an ellipse with foci A and A , the ellipse shares at least one point with a line parallel to BB and distance 1 away from it, and lastly that the ellipse has the smallest possible major axis (see figure D.2). So the ellipse giving the minimal distance is tangent to the parallel line, and A is the tangent point. This is illustrated in figure D.2. This solution can be straightforwardly extended to 2 links, as depicted in figure D.3. Consider then the example in figure 2.15a, where the links are no longer allowed to move out of the plane (see figure D.4). Here rA = rA 167 Appendix D. Minimal transformations in 2 dimensions 2.45 2.24 0.93 0.94 0.18 Figure D.3: An optimal compound straight-line solution for 2 link. For this particular class of solutions, the problem is divided into to disjoint problems (one for each link) and solved separately. and rC = rC and the above√ellipses turn into a circles centered at A and C. The circles have radii 1 − 1/ 2, so that the perpendicular distance from line BB to the farthest point on the circle is 1 and a fully extended intermediate state is allowed. 168 Appendix D. Minimal transformations in 2 dimensions C ′′ B C′ ′ C B ′′ A′ A B A′′ Figure D.4: Minimal transformation restricted to 2 dimensions, for 2 links of opposite convexity which form opposite sides of a square. 169 Appendix E Extremal trajectories of beads or links subject to steric excluded volume The extremal trajectories of beads or links subject to steric excluded volume is a variational problem in the presence of an inequality constraint. A bead can be outside a given region but not inside it, or must travel from point A to point B while avoiding an intervening volume. E.1 Point particle Variational problems subject to inequality constraints arose historically in the theory of optimal control [15, 50, 51, 116]. In our context we illustrate the idea with a simple example of a point particle moving from A to B but with the constraint that the point and resulting trajectory must lie outside an infinite cylinder of radius a, r ≤ a in Fig. E.1. The distance traveled by the point is written as ∫ T D[r] = dtF (˙r, λ, ), (E.1a) 0 where F (˙r, λ, ) = √ r˙ 2 + λ(a − |r| + 2 ) (E.1b) The second term in the integrand embodies the inequality constraint a|r| ≤ 0. The value λ is the Lagrange multiplier enforcing the constraint, and the quantity 2 may be thought of as an “excess parameter” whose significance will soon become clear. Let a vector X = (r, λ, ) represent all the unknowns in the problem. The Euler-Lagrange (EL) equations are then d F ˙ = FX , dt X (E.2) 170 E.1. Point particle Figure E.1: (a) Extremal trajectories for an inequality constraint problem. In this case, a path that is a minimal distance from point A at (xA , yA ) = (1.5, 0) to point B at (xB , yB ) = (+1.5, 0) is sought subject to the constraint that the path must remain outside a circle of unit radius. Both positive and negative solutions are shown. (b) Lagrange multiplier λ and excess parameter for the above problem. If = 0, λ = 0, and if = 0, λ = 0. with the convention FX ≡ ∂F/∂X. The EL equations are a−r+ 2 = 0, (E.3a) λ = 0, vˆ˙ = −λˆ r. (E.3b) (E.3c) In addition to the EL equations, transversality or corner conditions must hold for the trajectory to be extremal [46]. These demand that Fr˙ (t− ) = Fr˙ (t+ ) (E.4a) F − r˙ F˙ r˙ |t− = F − r˙ F˙ r˙ |t+ (E.4b) and where t± = lim →0 (t ± ). In this parameterization (r in terms of time), Eq. E.4b gives no new information, and Eq. E.4a demands that vˆ(t− ) = vˆ(t+ ). (E.5) To solve these equations, first note that from Eq. E.3a, if r > a, the excess parameter is 2 > 0. Then from Eq. E.3b, the Lagrange multiplier is λ = 0. Then from Eq. E.3c, vˆ˙ = 0 and the particle moves in a straight line. The particle moves in a straight line until a point where it touches the cylinder. Equation E.5 demands that the straight line must be tangent to the cylinder, otherwise we would have a corner at that point. Once on the cylinder, r = a and so 2 = 0. The quantity equation vˆ˙ is determined kinematically by 171 E.2. One link the trajectory which follows the boundary condition, here the surface of the cylinder at r = a. This then determines equation λ(t) = |vˆ˙ |. This gives the piecewise trajectory in Fig. E.1 a. Both positive and negative solutions are shown. For this extremal trajectory, the Lagrange multiplier and excess parameter can be found straightforwardly, for example as functions of x (Fig. E.1 b). In particular λ = 1/y(x) on the cylinder, zero otherwise. If the obstructing object is no longer a cylinder of circular cross section, but we compress the x axis of the cylinder so that it is an ellipsoid, then in the limit that the minor axis (the x axis of the ellipsoid) → 0, the obstructing object becomes a flat strip (or line in cross section). Then the extremal trajectory consists of two straight-line pieces with an apparent corner between them, due to the discontinuity at the surface of the excluded boundary. E.2 One link The above solution can be generalized to the case of a single link undergoing a transformation from one side of a sphere to the other side. For the initial conditions in Fig. E.2 a, the solution consists of one bead on the link moving in straight-line motion, and the other following a piecewise trajectory consisting of straight-line motion, a great circle geodesic, and finally straight-line motion again. When one axis of the sphere is compressed so that the sphere becomes a disk, the minimal-distance solution acquires a discontinuity or cusp (Fig. E.2 b). This means that minimal-distance transformations can violate corner conditions if the inequality constraints are themselves discontinuous or more precisely nonsmooth. The extremal transformation of the link AB in Fig. E.2 b involves a straight-line translation of A to A1 , while point B translates to BL . Then point B rotates to point B1 on the surface of the disk, where it experiences a corner as per the above discussion. It subsequently rotates again to BR , then A1 and BR translate together in straight lines to points A and B, respectively. As another example, consider the initial conditions in Fig. E.2 c, which involves the problem of one link transforming in the presence of an infinite strip. This situation has applications to the problem of chain non-crossing discussed in the text. The minimal transformation consists of two piecewise rotations of B with a corner between them, at position Bc . 172 E.2. One link (a) (b) (c) Figure E.2: (a) Extremal trajectory for a one-link transformation subject to inequality constraints. The link moves from configuration AB to A B in the presence of an obstructing sphere. The link length AB is conserved during this process. The distance traveled by the end-points A and B of the link is minimized by the transformation shown, which involves straight-line motion of A to A , and straight-line motion of B along a trajectory tangent to the sphere. Point B traces out a great circle on the surface of the sphere before continuing to B on another trajectory tangent to the sphere. (b) When the sphere in panel a is compressed to form a two-dimensional disk of the same radius, the minimal transformation takes the form shown, with a discontinuity in the trajectory of B at point B1 . Moreover, the piecewise solution must still retain rotations and is not purely piecewise straight lines. (c) Transformation from AB to AB , in the presence of an intervening infinite strip. The minimal transformation consists of two piecewise rotations with a corner violation between them: the link rotates from B to Bc , then from Bc to B. 173 Appendix F Cross correlation of order parameters Cross correlation of order parameters used, in chapters 5 and 6, with each other, for various classifications of proteins are shown below: 174 INX LRO RCO ACO MRSD nRMSD Dnx Dnx /N D D /N N INX ——— 0.650,4.34e-04 0.552,4.20e-03 (0.470,0.018) (0.454,0.023) (0.446,0.025) 0.669,2.55e-04 0.918,1.07e-10 (0.299,0.146) 0.534,5.93e-03 (0.213,0.306) Dnx (0.487,6.50e-04) 0.551,1.13e-04 (0.487,6.50e-04) 0.693,1.19e-06 0.640,7.32e-06 0.647,5.87e-06 ——— 0.904,6.18e-10 0.877,9.11e-09 0.938,4.67e-12 0.781,4.09e-06 LRO (0.437,2.18e-03) ——— 0.853,6.25e-08 0.830,2.79e-07 0.629,7.60e-04 0.642,5.41e-04 0.707,7.73e-05 0.756,1.25e-05 0.514,8.55e-03 0.667,2.68e-04 (0.425,0.034) Dnx /N 0.733,2.78e-07 0.518,2.88e-04 0.533,1.86e-04 0.513,3.22e-04 (0.407,4.38e-03) (0.400,5.07e-03) 0.753,1.30e-07 ——— 0.621,9.27e-04 0.799,1.70e-06 0.512,8.88e-03 RCO (0.413,3.78e-03) 0.718,4.91e-07 ——— 0.786,3.26e-06 (0.455,0.022) (0.473,0.017) 0.546,4.73e-03 0.636,6.36e-04 (0.308,0.134) (0.495,0.012) (0.160,0.444) D (0.060,0.674) (0.284,0.047) (0.153,0.283) 0.627,1.13e-05 0.880,7.02e-10 0.873,9.42e-10 0.573,5.89e-05 (0.327,0.022) ——— 0.946,9.69e-13 0.973,4.44e-16 ACO (0.273,0.055) 0.591,3.46e-05 0.513,3.22e-04 ——— 0.867,2.09e-08 0.880,6.77e-09 0.871,1.48e-08 0.737,2.60e-05 0.817,6.22e-07 0.877,9.00e-09 0.712,6.56e-05 D /N (0.207,0.148) (0.377,8.20e-03) (0.273,0.055) 0.707,7.37e-07 0.933,6.18e-11 0.927,8.43e-11 0.693,1.19e-06 (0.473,9.12e-04) 0.840,3.97e-09 ——— 0.889,2.81e-09 MRSD (0.140,0.327) (0.337,0.018) (0.220,0.123) 0.667,3.00e-06 ——— 0.998,0.00e+00 0.909,3.27e-10 0.737,2.61e-05 0.963,1.33e-14 0.995,0.00e+00 0.917,1.22e-10 N (0.00e+00,1.000) (0.212,0.138) (0.081,0.573) 0.570,6.41e-05 0.799,2.19e-08 0.792,2.87e-08 0.503,4.21e-04 (0.255,0.074) 0.919,1.18e-10 0.758,1.07e-07 ——— RMSD (0.133,0.350) (0.357,0.012) (0.253,0.076) 0.700,9.36e-07 0.967,1.26e-11 ——— 0.911,2.49e-10 0.734,3.01e-05 0.967,4.22e-15 0.993,0.00e+00 0.921,7.08e-11 Table F.1: Two-state proteins: correlation between various order parameters. The upper triangle matrix (containing elements above the dash cell in each column) contains Kendall correlation coefficient and the corresponding p-value, and the lower triangle portion contains Pearson corr. coefficient and the corresponding p-value. Appendix F. Cross correlation of order parameters INX LRO RCO ACO MRSD RMSD Dnx Dnx /N D D /N N 175 INX LRO RCO ACO MRSD RMSD Dnx Dnx /N D D /N N INX ——— 0.714,6.13e-03 (-4.88e-02,0.874) (0.591,0.033) 0.741,3.75e-03 0.721,5.38e-03 0.868,1.19e-04 0.947,9.03e-07 0.750,3.16e-03 0.781,1.60e-03 0.733,4.36e-03 Dnx 0.615,3.41e-03 (0.333,0.113) (-5.13e-02,0.807) (0.487,0.020) 0.769,2.52e-04 0.795,1.55e-04 ——— 0.966,8.34e-08 0.959,2.42e-07 0.920,8.61e-06 0.934,2.93e-06 LRO (0.359,0.088) ——— (0.612,0.026) 0.703,7.31e-03 (0.351,0.240) (0.356,0.233) (0.408,0.166) (0.577,0.039) (0.260,0.390) (0.389,0.189) (0.204,0.505) Dnx /N 0.769,2.52e-04 (0.333,0.113) (-5.13e-02,0.807) (0.436,0.038) 0.667,1.51e-03 0.641,2.29e-03 0.846,5.66e-05 ——— 0.899,2.97e-05 0.924,6.46e-06 0.857,1.83e-04 RCO (-2.56e-02,0.903) 0.615,3.41e-03 ——— (0.516,0.071) (-1.44e-01,0.639) (-1.04e-01,0.736) (-3.00e-01,0.320) (-1.20e-01,0.696) (-3.54e-01,0.235) (-1.42e-01,0.643) (-4.61e-01,0.113) D (0.513,0.015) (0.231,0.272) (-5.13e-02,0.807) (0.487,0.020) 0.821,9.44e-05 0.846,5.66e-05 0.897,1.95e-05 0.744,4.02e-04 ——— 0.959,2.17e-07 0.983,1.78e-09 ACO (0.359,0.088) (0.538,0.010) (0.462,0.028) ——— 0.717,5.78e-03 0.743,3.63e-03 (0.588,0.034) (0.677,0.011) (0.586,0.035) 0.721,5.44e-03 (0.485,0.093) D /N (0.436,0.038) (0.256,0.222) (0.026,0.903) 0.564,7.27e-03 1.000,1.95e-06 0.974,3.54e-06 0.769,2.52e-04 0.667,1.51e-03 0.821,9.44e-05 ——— 0.909,1.64e-05 MRSD (0.436,0.038) (0.256,0.222) (0.026,0.903) 0.564,7.27e-03 ——— 0.997,8.57e-14 0.898,3.13e-05 0.897,3.29e-05 0.955,3.70e-07 0.998,2.38e-14 0.905,2.16e-05 N (0.503,0.017) (0.219,0.297) (-9.03e-02,0.667) (0.452,0.032) 0.735,4.65e-04 0.761,2.91e-04 0.890,2.27e-05 0.735,4.65e-04 0.916,1.30e-05 0.735,4.65e-04 ——— RMSD (0.410,0.051) (0.282,0.180) (0.051,0.807) 0.590,5.01e-03 0.974,3.54e-06 ——— 0.884,6.11e-05 0.885,5.89e-05 0.944,1.29e-06 0.994,8.54e-12 0.884,5.96e-05 Table F.2: Three-state proteins: correlation between various order parameters. The upper triangle matrix (containing elements above the dash cell in each column) contains Kendall correlation coefficient and the corresponding p-value, and the lower triangle portion contains Pearson corr. coefficient and the corresponding p-value. Appendix F. Cross correlation of order parameters INX LRO RCO ACO MRSD RMSD Dnx Dnx /N D D /N N 176 INX LRO RCO ACO MRSD RMSD Dnx Dnx /N D D /N N INX ——— (0.372,0.259) (-4.71e-01,0.144) (0.381,0.248) (0.586,0.058) (0.560,0.073) (0.600,0.051) 0.852,8.71e-04 (0.510,0.109) (0.616,0.044) (0.535,0.090) Dnx 0.673,3.97e-03 (0.587,0.012) (-1.27e-01,0.586) 0.673,3.97e-03 0.818,4.60e-04 0.745,1.41e-03 ——— 0.925,4.64e-05 0.983,5.84e-08 0.936,2.31e-05 0.948,9.44e-06 LRO (0.330,0.157) ——— (0.208,0.538) 0.753,7.53e-03 (0.600,0.051) (0.599,0.051) (0.425,0.193) (0.494,0.123) (0.490,0.126) (0.595,0.053) (0.578,0.062) Dnx /N 0.818,4.60e-04 (0.440,0.059) (-2.00e-01,0.392) (0.527,0.024) 0.673,3.97e-03 (0.600,0.010) 0.855,2.53e-04 ——— 0.882,3.24e-04 0.915,7.76e-05 0.881,3.36e-04 RCO (-3.09e-01,0.186) (0.183,0.432) ——— (0.172,0.613) (-2.04e-01,0.547) (-1.94e-01,0.567) (-3.60e-01,0.277) (-4.22e-01,0.196) (-2.78e-01,0.409) (-2.26e-01,0.504) (-2.65e-01,0.431) D (0.491,0.036) 0.624,7.56e-03 (-9.09e-02,0.697) 0.709,2.40e-03 1.000,1.85e-05 0.927,7.18e-05 0.818,4.60e-04 0.673,3.97e-03 ——— 0.965,1.58e-06 0.982,7.80e-08 ACO (0.345,0.139) 0.624,7.56e-03 (0.200,0.392) ——— 0.910,9.88e-05 0.918,6.68e-05 0.786,4.12e-03 (0.723,0.012) 0.858,7.22e-04 0.901,1.53e-04 0.897,1.80e-04 D /N (0.527,0.024) (0.587,0.012) (-1.27e-01,0.586) 0.673,3.97e-03 0.964,3.69e-05 0.891,1.36e-04 0.855,2.53e-04 0.709,2.40e-03 0.964,3.69e-05 ——— 0.983,5.91e-08 MRSD (0.491,0.036) 0.624,7.56e-03 (-9.09e-02,0.697) 0.709,2.40e-03 ——— 0.999,7.84e-13 0.928,3.77e-05 0.898,1.72e-04 0.964,1.72e-06 0.999,7.26e-14 0.984,4.42e-08 N (0.587,0.012) 0.611,8.88e-03 (-1.10e-01,0.637) 0.697,2.83e-03 0.917,8.55e-05 0.844,3.01e-04 0.844,3.01e-04 0.697,2.83e-03 0.917,8.55e-05 0.881,1.62e-04 ——— RMSD (0.418,0.073) (0.550,0.018) (-1.82e-02,0.938) 0.782,8.15e-04 0.927,7.18e-05 ——— 0.927,3.96e-05 0.886,2.80e-04 0.967,1.17e-06 0.997,3.38e-11 0.988,1.14e-08 177 Table F.3: α-helix dominated proteins (both 2- and 3- state): Correlation between various order parameters. The upper triangle matrix (containing elements above the dash cell in each column) contains Kendall correlation coefficient and the corresponding p-value, and the lower triangle portion contains Pearson corr. coefficient and the corresponding p-value. Appendix F. Cross correlation of order parameters INX LRO RCO ACO MRSD RMSD Dnx Dnx /N D D /N N INX LRO RCO ACO MRSD RMSD Dnx Dnx /N D D /N N INX ——— (0.035,0.904) (0.378,0.183) (-1.86e-01,0.524) (-2.91e-01,0.313) (-3.09e-01,0.283) (-9.97e-02,0.734) (0.328,0.252) (-2.46e-01,0.396) (-2.44e-01,0.400) (-2.80e-01,0.332) Dnx (-9.89e-02,0.622) (0.077,0.702) (-1.21e-01,0.547) 0.648,1.24e-03 0.758,1.58e-04 0.714,3.73e-04 ——— 0.900,1.14e-05 0.988,3.69e-11 0.986,1.20e-10 0.981,6.64e-10 LRO (0.165,0.412) ——— 0.676,7.92e-03 (0.342,0.231) (-1.84e-01,0.529) (-1.90e-01,0.516) (-2.13e-01,0.465) (-1.35e-01,0.644) (-2.30e-01,0.429) (-1.83e-01,0.532) (-1.69e-01,0.564) Dnx /N (0.187,0.352) (-7.69e-02,0.702) (-9.89e-02,0.622) (0.363,0.071) (0.516,0.010) (0.429,0.033) 0.714,3.73e-04 ——— 0.824,2.90e-04 0.835,2.04e-04 0.805,5.11e-04 RCO (0.363,0.071) 0.626,1.81e-03 ——— (-2.82e-01,0.328) (-7.61e-01,1.57e-03) (-7.52e-01,1.92e-03) (-7.26e-01,3.26e-03) (-5.09e-01,0.063) (-7.68e-01,1.33e-03) (-7.52e-01,1.94e-03) (-7.43e-01,2.32e-03) D (-3.63e-01,0.071) (-1.10e-02,0.956) (-2.97e-01,0.139) 0.780,1.02e-04 0.934,3.27e-06 0.934,3.27e-06 0.736,2.45e-04 (0.451,0.025) ——— 0.994,6.02e-13 0.996,2.98e-14 ACO (-2.75e-01,0.171) (0.165,0.412) (-7.69e-02,0.702) ——— 0.826,2.75e-04 0.830,2.37e-04 0.812,4.18e-04 0.707,4.68e-03 0.812,4.24e-04 0.828,2.55e-04 0.844,1.47e-04 D /N (-2.97e-01,0.139) (-3.30e-02,0.870) (-3.63e-01,0.071) 0.714,3.73e-04 1.000,6.30e-07 0.912,5.52e-06 0.758,1.58e-04 (0.516,0.010) 0.934,3.27e-06 ——— 0.993,2.25e-12 MRSD (-2.97e-01,0.139) (-3.30e-02,0.870) (-3.63e-01,0.071) 0.714,3.73e-04 ——— 0.998,1.55e-15 0.978,1.73e-09 0.807,4.89e-04 0.993,1.16e-12 0.999,0.00e+00 0.994,1.05e-12 N (-3.76e-01,0.061) (-2.21e-02,0.912) (-2.87e-01,0.152) 0.796,7.39e-05 0.928,3.76e-06 0.950,2.20e-06 0.729,2.80e-04 (0.442,0.028) 0.994,7.26e-07 0.928,3.76e-06 ——— RMSD (-3.85e-01,0.055) (-7.69e-02,0.702) (-3.19e-01,0.112) 0.758,1.58e-04 0.912,5.52e-06 ——— 0.972,5.98e-09 0.794,7.03e-04 0.991,7.58e-12 0.996,9.30e-14 0.992,2.89e-12 178 Table F.4: β-sheet dominated proteins (both 2- and 3- state): Correlation between various order parameters. The upper triangle matrix (containing elements above the dash cell in each column) contains Kendall correlation coefficient and the corresponding p-value, and the lower triangle portion contains Pearson corr. coefficient and the corresponding p-value. Appendix F. Cross correlation of order parameters INX LRO RCO ACO MRSD RMSD Dnx Dnx /N D D /N N INX LRO RCO ACO MRSD RMSD Dnx Dnx /N D D /N N INX ——— (0.308,0.307) (-4.26e-01,0.147) (0.300,0.320) (0.670,0.012) (0.659,0.014) 0.751,3.09e-03 0.889,4.79e-05 (0.683,0.010) 0.709,6.61e-03 (0.668,0.013) Dnx 0.590,5.01e-03 (0.179,0.393) (-3.08e-01,0.143) (0.436,0.038) 0.795,1.55e-04 0.769,2.52e-04 ——— 0.948,8.33e-07 0.986,6.07e-10 0.929,4.31e-06 0.960,1.93e-07 LRO (0.077,0.714) ——— (0.521,0.068) (0.664,0.013) (0.340,0.255) (0.373,0.209) (0.217,0.475) (0.328,0.275) (0.212,0.487) (0.342,0.253) (0.143,0.641) Dnx /N 0.590,5.01e-03 (0.179,0.393) (-3.08e-01,0.143) (0.436,0.038) 0.795,1.55e-04 0.769,2.52e-04 0.949,6.34e-06 ——— 0.915,1.14e-05 0.942,1.50e-06 0.874,9.26e-05 RCO (-3.59e-01,0.088) (0.308,0.143) ——— (0.463,0.111) (-2.30e-01,0.451) (-1.89e-01,0.537) (-5.13e-01,0.073) (-3.90e-01,0.187) (-4.83e-01,0.095) (-2.56e-01,0.399) (-5.76e-01,0.039) D (0.513,0.015) (0.205,0.329) (-2.82e-01,0.180) (0.513,0.015) 0.872,3.35e-05 0.846,5.66e-05 0.923,1.12e-05 0.872,3.35e-05 ——— 0.946,1.07e-06 0.985,8.98e-10 ACO (0.128,0.542) (0.487,0.020) (0.205,0.329) ——— 0.726,4.95e-03 0.758,2.66e-03 (0.481,0.096) (0.545,0.054) (0.534,0.060) 0.706,6.94e-03 (0.437,0.136) D /N (0.410,0.051) (0.308,0.143) (-1.79e-01,0.393) 0.615,3.41e-03 0.974,3.54e-06 0.949,6.34e-06 0.821,9.44e-05 0.821,9.44e-05 0.897,1.95e-05 ——— 0.902,2.43e-05 MRSD (0.385,0.067) (0.282,0.180) (-2.05e-01,0.329) 0.590,5.01e-03 ——— 0.998,8.22e-15 0.915,1.14e-05 0.921,7.87e-06 0.940,1.85e-06 0.998,5.55e-15 0.897,3.28e-05 N (0.462,0.028) (0.154,0.464) (-2.82e-01,0.180) (0.513,0.015) 0.821,9.44e-05 0.795,1.55e-04 0.872,3.35e-05 0.821,9.44e-05 0.949,6.34e-06 0.846,5.66e-05 ——— RMSD (0.359,0.088) (0.308,0.143) (-1.79e-01,0.393) 0.615,3.41e-03 0.974,3.54e-06 ——— 0.909,1.70e-05 0.916,1.11e-05 0.932,3.41e-06 0.996,6.81e-13 0.882,6.66e-05 179 Table F.5: Mixed secondary structure proteins: Correlation between various parameters. The upper triangle matrix (containing elements above the dash cell in each column) contains Kendall correlation coefficient and the corresponding p-value, and the lower triangle portion contains Pearson corr. coefficient and the corresponding p-value. Appendix F. Cross correlation of order parameters INX LRO RCO ACO MRSD RMSD Dnx Dnx /N D D /N N INX LRO RCO ACO MRSD RMSD Dnx Dnx /N D D /N N INX ——— 0.658,7.03e-06 (0.297,0.071) 0.516,9.10e-04 0.513,1.00e-03 0.502,1.33e-03 0.591,9.27e-05 0.856,7.02e-12 (0.432,6.78e-03) 0.566,2.12e-04 (0.395,0.014) Dnx (0.488,1.62e-05) (0.379,8.18e-04) (0.115,0.309) 0.633,2.21e-08 0.747,4.10e-11 0.724,1.56e-10 ——— 0.907,4.22e-15 0.961,0.00e+00 0.917,6.66e-16 0.928,0.00e+00 LRO (0.424,1.77e-04) ——— 0.736,1.42e-07 0.730,1.98e-07 (0.403,0.012) (0.411,0.010) (0.344,0.034) 0.587,1.07e-04 (0.241,0.145) (0.434,6.48e-03) (0.211,0.203) Dnx /N 0.633,2.21e-08 (0.381,7.47e-04) (0.147,0.195) 0.545,1.47e-06 0.619,4.53e-08 0.590,1.82e-07 0.849,6.13e-14 ——— 0.814,5.13e-10 0.885,1.61e-13 0.769,1.74e-08 RCO (0.229,0.043) 0.615,5.48e-08 ——— (0.494,1.61e-03) (-4.60e-02,0.784) (-2.62e-02,0.876) (-1.83e-01,0.271) (0.109,0.514) (-2.63e-01,0.111) (-2.65e-02,0.874) (-3.26e-01,0.046) D (0.218,0.054) (0.211,0.063) (-5.83e-02,0.606) 0.596,1.38e-07 0.886,4.88e-15 0.881,7.11e-15 0.730,1.12e-10 0.579,3.11e-07 ——— 0.947,0.00e+00 0.987,0.00e+00 ACO (0.292,9.96e-03) 0.518,4.66e-06 (0.340,2.66e-03) ——— 0.805,1.07e-09 0.819,3.32e-10 0.672,3.83e-06 0.745,8.28e-08 0.673,3.68e-06 0.811,6.52e-10 0.627,2.56e-05 D /N (0.289,0.011) (0.276,0.015) (0.024,0.831) 0.656,6.81e-09 0.963,0.00e+00 0.929,2.22e-16 0.778,6.12e-12 0.656,6.81e-09 0.866,1.91e-14 ——— 0.928,0.00e+00 MRSD (0.252,0.026) (0.262,0.021) (4.27e-03,0.970) 0.642,1.43e-08 ——— 0.998,0.00e+00 0.901,1.31e-14 0.851,1.27e-11 0.949,0.00e+00 0.998,0.00e+00 0.934,0.00e+00 N (0.180,0.111) (0.167,0.139) (-1.03e-01,0.363) 0.560,7.31e-07 0.826,2.82e-13 0.823,3.40e-13 0.689,1.13e-09 0.538,2.03e-06 0.941,0.00e+00 0.806,1.03e-12 ——— RMSD (0.223,0.048) (0.268,0.018) (0.033,0.772) 0.670,3.19e-09 0.954,0.00e+00 ——— 0.895,3.60e-14 0.844,2.69e-11 0.945,0.00e+00 0.995,0.00e+00 0.928,0.00e+00 Table F.6: Unknotted proteins: correlation between various order parameters the upper triangle matrix (containing elements above the dash cell in each column) contains Kendall correlation coefficient and the corresponding pvalue, and the lower triangle portion contains Pearson corr. coefficient and the corresponding p-value. Appendix F. Cross correlation of order parameters INX LRO RCO ACO MRSD RMSD Dnx Dnx /N D D /N N 180 INX LRO RCO ACO MRSD RMSD Dnx Dnx /N D D /N N INX ——— (0.767,0.044) (-7.22e-01,0.067) (-6.24e-02,0.894) (0.213,0.647) (0.211,0.649) (0.713,0.072) 0.919,3.42e-03 (0.344,0.450) (0.396,0.379) (0.365,0.421) Dnx (0.714,0.024) (0.238,0.453) (-8.10e-01,0.011) (0.143,0.652) (0.048,0.881) (0.143,0.652) ——— 0.924,2.97e-03 0.890,7.33e-03 0.885,8.07e-03 0.899,5.90e-03 LRO (0.524,0.099) ——— (-3.26e-01,0.476) (-3.92e-01,0.385) (-1.15e-01,0.805) (-1.33e-01,0.776) (0.304,0.508) (0.602,0.153) (-8.12e-02,0.863) (0.044,0.926) (-8.59e-02,0.855) Dnx /N 0.905,4.32e-03 (0.429,0.176) (-6.19e-01,0.051) (-4.76e-02,0.881) (-1.43e-01,0.652) (-4.76e-02,0.881) (0.810,0.011) ——— (0.678,0.094) (0.720,0.068) (0.690,0.086) RCO (-5.24e-01,0.099) (-4.76e-02,0.881) ——— (-3.94e-01,0.382) (-7.13e-01,0.072) (-7.19e-01,0.068) (-9.73e-01,2.29e-04) (-9.05e-01,5.09e-03) (-8.31e-01,0.021) (-8.16e-01,0.025) (-8.41e-01,0.018) D (0.143,0.652) (-3.33e-01,0.293) (-6.19e-01,0.051) (0.524,0.099) (0.619,0.051) (0.714,0.024) (0.429,0.176) (0.238,0.453) ——— 0.981,9.40e-05 0.998,2.16e-07 ACO (-1.43e-01,0.652) (-2.38e-01,0.453) (-1.43e-01,0.652) ——— 0.901,5.68e-03 0.900,5.77e-03 (0.530,0.221) (0.287,0.533) (0.832,0.020) (0.830,0.021) (0.822,0.023) D /N (0.048,0.881) (-2.38e-01,0.453) (-5.24e-01,0.099) (0.619,0.051) (0.714,0.024) (0.810,0.011) (0.333,0.293) (0.143,0.652) (0.714,0.024) ——— 0.972,2.37e-04 MRSD (-2.38e-01,0.453) (-3.33e-01,0.293) (-2.38e-01,0.453) 0.905,4.32e-03 ——— 0.999,1.27e-08 (0.789,0.035) (0.573,0.179) 0.970,2.84e-04 0.981,9.09e-05 0.957,7.22e-04 N (0.238,0.453) (-2.38e-01,0.453) (-7.14e-01,0.024) (0.429,0.176) (0.524,0.099) (0.619,0.051) (0.524,0.099) (0.333,0.293) 0.905,4.32e-03 (0.619,0.051) ——— RMSD (-1.43e-01,0.652) (-4.29e-01,0.176) (-3.33e-01,0.293) (0.810,0.011) 0.905,4.32e-03 ——— (0.792,0.034) (0.571,0.180) 0.975,1.92e-04 0.981,1.00e-04 0.962,5.15e-04 Table F.7: Knotted proteins: correlating between various order parameters the upper triangle matrix (containing elements above the dash cell in each column) contains Kendall correlation coefficient and the corresponding p-value, and the lower triangle portion contains Pearson corr. coefficient and the corresponding p-value. Appendix F. Cross correlation of order parameters INX LRO RCO ACO MRSD RMSD Dnx Dnx /N D D /N N 181 INX LRO RCO ACO MRSD RMSD Dnx Dnx /N D D /N N INX ——— 0.530,1.81e-04 (0.022,0.883) 0.548,9.85e-05 0.637,2.55e-06 0.625,4.48e-06 0.776,3.85e-10 0.919,0.00e+00 0.622,5.13e-06 0.697,1.07e-07 0.583,2.66e-05 Dnx 0.592,9.90e-09 (0.324,1.68e-03) (-5.66e-02,0.584) 0.634,8.08e-10 0.776,5.77e-14 0.758,2.19e-13 ——— 0.953,0.00e+00 0.947,0.00e+00 0.906,0.00e+00 0.916,0.00e+00 LRO (0.361,4.76e-04) ——— 0.662,7.48e-07 0.619,5.95e-06 (0.322,0.031) (0.329,0.027) (0.238,0.115) (0.419,4.21e-03) (0.174,0.254) (0.345,0.020) (0.160,0.293) Dnx /N 0.707,7.51e-12 (0.318,2.05e-03) (-3.43e-02,0.739) 0.556,7.44e-08 0.673,7.27e-11 0.651,2.98e-10 0.877,0.00e+00 ——— 0.862,2.69e-14 0.893,2.22e-16 0.823,3.78e-12 RCO (0.048,0.639) (0.488,2.28e-06) ——— (0.334,0.025) (-1.82e-01,0.232) (-1.64e-01,0.282) (-3.31e-01,0.027) (-1.45e-01,0.344) (-3.68e-01,0.013) (-1.79e-01,0.239) (-4.13e-01,4.78e-03) D (0.370,3.43e-04) (0.183,0.076) (-1.74e-01,0.092) 0.622,1.68e-09 0.889,0.00e+00 0.887,0.00e+00 0.778,4.97e-14 0.655,2.31e-10 ——— 0.959,0.00e+00 0.990,0.00e+00 ACO (0.352,6.63e-04) (0.448,1.45e-05) (0.200,0.053) ——— 0.832,1.47e-12 0.843,3.57e-13 0.659,8.70e-07 0.698,9.91e-08 0.722,2.17e-08 0.827,2.54e-12 0.693,1.31e-07 D /N (0.420,4.71e-05) (0.233,0.024) (-1.11e-01,0.282) 0.669,9.43e-11 0.960,0.00e+00 0.933,0.00e+00 0.812,3.55e-15 0.713,4.97e-12 0.885,0.00e+00 ——— 0.947,0.00e+00 MRSD (0.380,2.35e-04) (0.217,0.035) (-1.23e-01,0.233) 0.661,1.58e-10 ——— 0.999,0.00e+00 0.875,4.00e-15 0.851,1.34e-13 0.954,0.00e+00 0.996,0.00e+00 0.947,0.00e+00 N (0.329,1.42e-03) (0.146,0.157) (-2.20e-01,0.033) 0.583,1.65e-08 0.832,8.88e-16 0.832,8.88e-16 0.735,1.10e-12 0.611,3.23e-09 0.940,0.00e+00 0.824,1.33e-15 ——— RMSD (0.358,5.34e-04) (0.223,0.031) (-1.05e-01,0.309) 0.679,4.91e-11 0.962,0.00e+00 ——— 0.868,1.13e-14 0.842,4.14e-13 0.951,0.00e+00 0.994,0.00e+00 0.944,0.00e+00 Table F.8: All proteins: correlating between various order parameters the upper triangle matrix (containing elements above the dash cell in each column) contains Kendall correlation coefficient and the corresponding p-value, and the lower triangle portion contains Pearson corr. coefficient and the corresponding p-value. Appendix F. Cross correlation of order parameters INX LRO RCO ACO MRSD RMSD Dnx Dnx /N D D /N N 182
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Generalized distance and applications in protein folding
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Generalized distance and applications in protein folding Mohazab, Ali Reza 2013
pdf
Page Metadata
Item Metadata
Title | Generalized distance and applications in protein folding |
Creator |
Mohazab, Ali Reza |
Publisher | University of British Columbia |
Date Issued | 2013 |
Description | The Euclidean distance, D, between two points is generalized to the distance between strings or polymers. The problem is of great mathematical beauty and very rich in structure even for the simplest of cases. The necessary and sufficient conditions for finding minimal distance transformations are presented. Locally minimal solutions for one-link and two-link chains are discussed, and the large N limit of a polymer is studied. Applications of D to protein folding and structural alignment are explored, in particular for finding minimal folding pathways. Non-crossing constraints and the resulting untangling moves in folding pathways are discussed as well. It is observed that, compared to the total distance, these extra untangling moves constitute a small fraction of the total movement. The resulting extra distance from untangling movements (Dnx ) are used to distinguish different protein classes, e.g. knotted proteins from unknotted proteins. By studying the ensembles of untangling moves, dominant folding pathways are constructed for three proteins, in particular a knotted protein. Finally, applications of D, and related metrics to protein folding rate prediction are discussed. It is seen that distance metrics are good at predicting the folding rates of 3-state folders. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2013-01-09 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0073509 |
URI | http://hdl.handle.net/2429/43831 |
Degree |
Doctor of Philosophy - PhD |
Program |
Physics |
Affiliation |
Science, Faculty of Physics and Astronomy, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2013-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2013_spring_mohazab_ali.pdf [ 4.46MB ]
- Metadata
- JSON: 24-1.0073509.json
- JSON-LD: 24-1.0073509-ld.json
- RDF/XML (Pretty): 24-1.0073509-rdf.xml
- RDF/JSON: 24-1.0073509-rdf.json
- Turtle: 24-1.0073509-turtle.txt
- N-Triples: 24-1.0073509-rdf-ntriples.txt
- Original Record: 24-1.0073509-source.json
- Full Text
- 24-1.0073509-fulltext.txt
- Citation
- 24-1.0073509.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0073509/manifest