Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Constraints on the organization and information properties of DNA sequences Sibbald, Peter Ramsay 1988

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-UBC_1989_A1 S52.pdf [ 5.62MB ]
Metadata
JSON: 831-1.0098293.json
JSON-LD: 831-1.0098293-ld.json
RDF/XML (Pretty): 831-1.0098293-rdf.xml
RDF/JSON: 831-1.0098293-rdf.json
Turtle: 831-1.0098293-turtle.txt
N-Triples: 831-1.0098293-rdf-ntriples.txt
Original Record: 831-1.0098293-source.json
Full Text
831-1.0098293-fulltext.txt
Citation
831-1.0098293.ris

Full Text

C O N S T R A I N T S O N T H E O R G A N I Z A T I O N A N D I N F O R M A T I O N P R O P E R T I E S O F D N A S E Q U E N C E S by PETER RAMSAY SIBBALD B.Sc. Simon Fraser University, 1984 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE STUDIES Botany We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA 8 December 1988 ® Peter Ramsay Sibbald, 1988 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at The University of B r i t i s h Columbia, I agree that the Li b r a r y shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the Head of my Department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Botany The University of British Columbia 2075 Wesbrook Place Vancouver, Canada V6T 1W5 Date: 8 December 1988 ABSTRACT In an investigation which concentrated p r imar i ly on the two completely sequenced chloroplast genomes, one from a tobacco and one from a l iverwort , an attempt has been -made to discover some of the factors which produce order in D N A sequences. This was done by 1. looking in detail at doublet organization throughout the genomes, 2'. by examining the ability of different methods to predict the existence of genes, based only on sequence organization and 3. by employing information theorj' to explore various levels of ordering in these sequences. The doublet analysis was performed on seven categories of D N A : t D N A , r D N A , ribosomal proteins, open reading frames not known to be genes ( U R F ) , other protein genes, non coding regions and introns. The r D N A has the most unusual doublet properties of al l categories although al l categories have, to a considerable extent, s imilar doublet properties. I suggest that these part icular doublet properties facilitate accurate replication of the genome. In addition it appears that doublets which have certain thermodynamic properties are more abundant that others, suggesting that there is a selection pressure at the level of doublets for certain thermodynamic properties. Nussinov 's hypothesis, that complementary doublets have s imilar relative abundances due to inverted duplication events has been tested and would not seem to explain the phenomenon. Ficket t ' s method to predict whether U R F s are genes was more successful than Sheperd's method. Fickett 's method was modified for use on the chloroplast genomes and its rate of successful prediction increased substantially. This modified i i method wi l l be useful for other chloroplast genomes as they are sequenced and also supports Fickett 's contention that the method could be improved for use on specific groups. The abil i ty to predict genes based only on sequence data shows that the requirement to code for protein exerts a detectable amount of order on the gene sequence and that this order is distinguishable from the order in non coding regions. Nea r ly a l l U R F s greater than 200 base pairs in both plants are predicted to be genes. Informational analysis showed that most order is at the level of single and double bases wi th a significant, lesser amount of order at the triplet and 4-plet level. This was true for both coding and noncoding regions in both plants. This is in contrast earlier work (Rowe and Trainor) which found that in viruses there was a significant difference between 4-plet ordering in coding and noncoding regions. It is suggested that D N A may be optimized for replication rather than protein production. Several new problems and experiments have been suggested. i i i TABLE OF CONTENTS Abstract i i L i s t of Tables v L i s t of Figures v i L i s t of Abbreviat ions v i i Acknowledgement v i i i I. General Introduction 1 II. Doublet or Neares t Neighbour Analyses 5 A . Mater ia l s and Methods 10 B . Results 11 C. Discussion 23 1. Nuss inov 's Hypothesis 23 2. Doublet suppression 29 3. Thermodynamic considerations 31 III. Prediction and identification of chloroplast genes 38 A . Descript ion of the problem 38 B . Mater ia l s and Methods 43 C . Results 45 D . Discussion 54 I V . Information theory and D N A 64 A . Introduction to information theory 64 B . Gat l in ' s Approach 65 C . A N e w Approach 69 D . Proof of Equivalence 71 E . Redundancy 72 F . Appl icat ion of information theory to sequence analysis 73 • G . Use of higher D-values 78 H . Inverted repeats as a source of redundancy 87 References 93 Appendix 1. A Selection of Computer Programs 106 A . P rog ram 1 106 B . P rog ram 2 109 C. P rog ram 3 115 D . P rog ram 4 127 E . P rogram 5 138 i v LIST OF TABLES Table 1. Correlation coefficients for tobacco doublet relative abundances analysed on the sense strand 14 Table 2. Correlation coefficients for l iverwort doublet relative abundances analysed on the sense strand : 20 Table 3. Correlation coefficients for tobacco doublet relative abundances analysed on the B-s t rand . 21 Table 4. Correlation coefficients for l iverwort doublet relative abundances analysed on the A-s t rand 22 Table 5. The G + C contents and percentage of the genome comprised of different categories of D N A 24 Table 6. Thermodynamic properties for doublets 32 Table 7. Correlations between doublet abundances and doublet properties 36 Table 8. N e w weighting parameters for Ficket t testcode procedure 44 Table 9. Prediction of genes in the tobacco chloroplast genome 46 Table 10. Prediction of genes i n the l iverwort chloroplast genome 50 Table 11. The performance' of modified Ficket t procedure on genes of other species 55 v LIST OF FIGURES Figure 1. Tobacco doublet relative abundances for various categories of D N A . . 12 Figure 2. L ive rwor t doublet relative abundances on the sense strand for the various categories of D N A 17 Figure 3. Tobacco doublets analyzed on the B-strand 18 Figure 4. L ive rwor t doublets analyzed on the sense strand 19 Figure 5. Correlations between doublet abundances +/- one IR (tobacco) 27 Figure 6. Correlations between doublet abundances +/- one IR (liverwort) 28 Figure 7. Cluster analysis of l iverwort genes and U R F s greater than 200 bp. 57 Figure 8. Cluster analysis of the tobacco genes and U R F s greater than 200 bp. 58 Figure 9. Two entropy scales for D N A 68 Figure 10. Two channels through . which D N A information flows 74 Figure 11. The G + C ranges for various t axa 77 Figure 12. D l to D 6 for tobacco and pseudorandom sequences. 79 Figure 13. Real and simulated D2-values 80 Figure 14. Real and simulated D3-values 81 Figure 15. Real and simulated D4-values 82 Figure 16. Real and simulated D5-values 83 Figure 17. Real and simulated D6-values 84 Figure 18. Redundancy calculated in nonoverlapping windows of 500 bases for the tobacco genome 88 Figure 19. Redundancy calculated in nonoverlapping windows of 500 bases for the l iverwort genome 89 Figure 20. Information parameters in unique and repeat regions for three species. 91 v i LIST OF ABBREVIATIONS A:adenine, C:cytosine, G:guanine, Trthymine 5 G + C content: the fraction of bases that are G or C. The human genome for example has G + C = 0.43, occasionally expressed as a percentage. . .5 d s D N A : double stranded D N A 6 strand: a single strand is those bases that are joined covalently to each other through the sugar phosphate backbone. A double strand is two of these held together by hydrogen bonds 6 relative abundance: the amount of something divided by the amount expected or predicted 7 IR: inverted repeat 8 U R F : unidentified reading frame; synonomous wi th O R F in much of the literature 10 I R A : inverted repeat A 11 I R B : inverted repeat B 16 S D : standard deviation 26 R:purine 31 Y:pyr imid ine 31 bp: base pairs 33 Z - D N A : D N A whose helix twists in the opposite direction to "normal" or B - D N A ; 39 v i i ACKNOWLEDGEMENT To m y fellow graduate students who have been helpful and supportive throughout, I a m most thankful. In part icular, Michae l J . White , M u r r a y Webb, M i s h t u Banerjee, Lacey Samuels, Leonard Dyck and K a l i Robson have all acted as sounding boards, helped sort through ideas and provided cr i t ic ism. Jack Maze , a friend and supervisor, has been more supportive than any student can reasonably expect. I w i sh to thank m y other supervisor, Tony Griffiths, and members of m y committee, M a x Taylor and C y Finnegan for taking time to help and having faith in m y work. D a n Brooks, John Collier and L i l a Ga t l in provided technical help. M y family and I are indebted to Lute and A n n a van den Berg , who so kindly offered us a place to stay during the final b i r th pangs of this dissertation. L a s t l y , I wish to thank both m y wife, M a r y Jo , for her considerable patience throughout, and Joseph who is teaching me about development. v i i i I. GENERAL INTRODUCTION Since efficient methods were first devised to sequence D N A (Sanger and Coulson 1975, M a x a m and Gilbert 1977), the amount of nucleotide sequence data has increased exponentially (DeLis i 1988). Pr ior to 1975 there were essentially no 7 such data. N o w , in GenBank release 54.0, there are in excess of 1.6x10 base pairs (bp) and this trend shows no sign of slowing. A s improved methods of sequencing are developed (Johnson-Dow et al. 1987, review article), and the process is automated, data w i l l accumulate even more rapidly. One result of all this act ivi ty is the emergence of a new discipline, theoretical molecular biology (Nussinov 1987). The goals of this discipline include developing the theoretical framework necessary to understanding the organization of protein, R N A , and D N A molecules. Since the data base is so large, a smal l increase in our understanding of it should result in a substantial increase in our understanding of biology. The principal aims of this dissertation are to further our understanding of D N A organization, and to seek some of the causes of that organization, since general principles must be found to act as a base for future work. D e L i s i (1988) has emphasised the magnitude of the problem in expressing a concern about what we w i l l do when the data base is growing at a rate of 6 10 bp per day, which is sure to happen when the human genome is sequenced. Thus methods must be found to thoroughly and rapidly analyse these data. A s Nuss inov (1987) has observed, theoretical molecular biology is f i rmly rooted in empir ical ly obtained data and should function in concert wi th more traditional experimental approaches. Therefore, most of the work described below is 1 General Introduction / 2 immediately juxtaposed to results obtained in a research laboratory. This has been a conscious choice, the result of which, i t is hoped, w i l l be to make that laboratory work more useful to a wide range of biologists, both theoretically and empirical ly inclined. In keeping wi th the above ideas, this dissertation has focussed on chloroplast D N A , since, in 1986, the complete sequences of the chloroplast genomes of tobacco (Shinozaki et al. 1986) and the l iverwort , Marchantia polymorpha (Ohyama et al. 1986) were reported. Reasons for choosing these sequences for a detailed analysis include m y own interest in photosynthesis as wel l as the completeness of the reported sequences. No detailed analysis of the type I have performed has yet been reported, perhaps because the sequences have become available only recently. In addition, the tobacco and l iverwort chloroplast genomes are 155,844 and 121,024 bp respectively, a size that is smal l enough for rapid and relatively inexpensive computer analysis yet large enough to be interesting and, for some of the analyses, "statistically significant". Fur ther information regarding these sequences w i l l be presented later, but a recent review article concerning them is available (Umesono and Ozeki 1987), and for a more general overview of chloroplast genomes, reviews by Zurawsk i and Clegg (1987) and Palmer (1985) are informative. More specialized reviews concerning the molecular genetics of Chlamydomonas plastids (Rochaix 1987) and topography or conformation of D N A wi th in algal chloroplasts (Coleman 1985) are also available. One other important reason for working wi th chloroplast D N A is the complete absence of modified bases. The only reported exception is Chlamdyomonas General Introduction / 3 chloroplast D N A which has 5-methylated cytosine (Palmer 1985). This is important since some of the analyses would be seriously changed i f the possibility of modified bases had to be allowed. Due to the absence of modified bases in both tobacco and l iverwort there is no concern about the possibility that cytosine, for example, might be methylated in vivo. This work is concerned wi th discovering what constraints act on D N A sequences and the effects of these constraints. A constraint in this regard is defined broadly as any factor, biotic or abiotic, internal or external to the organism, that has an effect on the D N A sequence of the organism. I think that we as biologists, are not fully aware of how t ru ly ignorant we are about the total D N A in most organisms. In mammals , for example, only about 1% of the D N A seems to code for protein while some of the rest is known to be involved in chromosome structure; but, for much of it, we cannot even postulate a reason for its existence. One way to solve problems in any new area (such as theoretical molecular biology), is by analogy, i . e. find s imilar already solved problems in an existing area and then adapt the solutions to the new problem but, is the analogy of D N A to language, suggested by Eb l ing and J imenez (1980), Brendel and Busse (1984) and Head (1987), a va l id one? I thought at the beginning of this study that it was and that a potentially valuable direction would be to search for "grammar" and "sentences", i . e., rules that govern D N A structure in a language-like fashion. M y view has now changed, principal ly because language is essentially a linear thing while D N A should, perhaps, not be viewed as such. General Introduction / 4 There have also been attempts (Gatlin 1972a, Holzmuller 1984, Brooks et al. 1984), to use information theory to understand some aspects of D N A sequence composition and ordering. Whi le examining the feasibility of using information theory for this purpose, I have found that in some reported cases information theory has been incorrectly applied. The last goal of this dissertation is to identify areas for further study in the relatively new area of theoretical molecular biology. II. DOUBLET OR NEAREST NEIGHBOUR ANALYSES A systematic method for code-breaking often involves analysis of strings of symbols from the smallest units of organization up to larger ones. If for example a cryptographer were attempting to decipher an intercepted message, the first step would be to quantify the frequency wi th which each symbol in the message occurs. Secondly, he might examine the frequency wi th which pairs of symbols occur. This procedure, and the steps which follow it, can often provide useful insights into the syntax of the underlying message. S imi la r ideas have been fruitfully employed in the analysis of D N A sequences. Class ical ly , the G + C content! of a D N A sequence of interest was determined by centrifugation through C s C l (this method is described in texts such as Freifelder 1982). Since, in a double stranded D N A , the amount of C equals the amount of G and A and T are also equimolar, a knowledge of the G + C content necessarily provides the amounts of al l four bases. This is analogous to determining the letter frequency in a message. W i t h the development of sequencing, base compositions are now usual ly determined by direct counting. Fo r reasons that are not understood, even in a single strand of D N A the amount of G is usual ly roughly equal to the amount of C and likewise for A and T, even though there is no known base pai r ing requirement that would force this to be so. To continue the analogy to cryptography, we m a y consider the frequencies of adjacent symbols, in this case adjacent bases in a D N A sequence. In 1961 Josse, Ka i se r and R o m b e r g reported the nearest neighbour frequencies for a t A list of abbreviations, and the page where they are first used appears on page v i . 5 Doublet or Nearest Neighbour Analyses / 6 variety of double stranded D N A s (dsDNA) . Due to the experiment being performed wi th d s D N A as a template, and given the rules of base pair ing, it was inevitable that the complementary doublets (Nussinov 1987) would have the same abundances. A doublet on one • strand read 5' to 3', pairs wi th its complement, which is read 5' to 3' on the other strand. Fo r example, in the sequence 5 ' - A C T G A G C A T A A C C G G T C T T - 3 ' 3 ' - T G A C T C G T A T T G G C C A G A A - 5' the amounts of the complementarj ' doublets A A and T T , C C and G G , A C and G T , A G and C T , G A and T C , T G and C A are each identical. The amount of A A must therefore equal the amount of T T simply due to pair ing rules between complementary strands. The amounts of T A , A T , C G and G C are not under this constraint since each is its own complementary doublet, - i . e. T A read 5' to 3' on one strand pairs wi th T A read 5' to 3' one the other strand. The experiment of Josse et al. (1961) is impressive as a piece of fine biochemistry; it showed that there is a bias in the doublet abundances of D N A , i . e. the doublets do not occur in amounts that would be indicative of random ordering. Consider for example, one such doublet, C A . Suppose, in a given sequence, the frequency of C is 0.20 and that of A is 0.30. Then any specific base in the sequence has a probability of 0.20 of being a C and the next base has a probability of 0.30 of being an A ; the frequency of the doublet C A would be 0.06 i f the bases in the given sequence were randomly arranged. Should an analysis of the sequence find that the frequency of C A is actually 0.08, then C A is more abundant in this sequence than would be expected. (One w ay to express this ari thmetically is as a relative abundance, in this case actual/expected Doublet or Nearest Neighbour Analyses / 7 -= 0.08/0.06 = 1.33.) In no organism examined to date do relative abundances indicate randomness in doublet organization (Nussinov 1987). Due to the complementarity of the two strands that make up a d s D N A , it is sufficient, for many purposes, to discuss the properties of one strand of a d s D N A and thus, the convention used herein is that any D N A sequence discussed w i l l show only one strand, from 5' to 3', wi th the other strand understood unless otherwise specified. F o r example, the d s D N A 5 ' - A A T G C C - 3 ' 3 ' - T T A C G G - 5 ' shall be wri t ten as A A T G C C (or G G C A T T ) . L ikewise it is convenient to discuss the doublet properties of one strand only. W i t h the appearance of sequencing, it became possible to find doublet relative abundances by direct counting, using a computer and such doublet data have been presented for a var iety of human nuclear sequences (Day and Blake 1984, Hinds and Blake 1984). The doublets C G and T A in that mater ia l have very low relative abundances. Doublet relative abundances have also been reported for several vertebrates, invertebrates, D N A viruses, vertebrate mitochondria (Nussinov 1984a), as well as more general groupings of prokaryotes and eukaryotes (Nussinov 1984b). A g a i n , wi th the exception of vertebrate mitochondrial sequences, T A is a rare doublet, while in al l groups C G is rare. Blake and E a r l e y . (1986), in an analysis of approximately 300,000 bp of E. coli found that doublet properties are a function of base composition wi th in a genome. They have also analysed the doublet properties of coding and noncoding regions and find that the Doublet or Neares t Neighbour Analyses / 8 properties in coding regions are very similar to those in noncoding regions, although there is the well characterized finding that the G + C content in the coding regions is greater than in noncoding regions. Nuss inov (1987) has pointed out that doublet properties are s imi lar in both coding and noncoding regions, though other workers (Borodovskii et al. 1987a,b) concentrated on the differences in doublet properties in coding and noncoding regions of E. coli. since they are not identical. This literature can be summarized as follows: 1. Doublet properties are nonrandom. 2. They are to some degree s imilar in noncoding and coding regions wi th in a genome. 3. Cer ta in doublets such as C G and T A tend to be rare. Another property of interest to the present study is that even wi thin a single strand of D N A , complementary doublets have s imi lar relative abundances (Nussinov 1984a,b 1987). Nuss inov has suggested that this is the result of numerous inverted duplication events during genome evolution. For example i f the double stranded D N A 5 ' - A C T G T G A C - 5 ' was inverted and duplicated to give A C T G C A G T on one strand, then the complementary doublets A C and G T , for example, would be equally abundant. The presence of large inverted repeats (IR) in the chloroplast genome provide an opportunity to test this hypothesis. The ra r i ty of the doublet C G (CG suppression), has been postulated to be due to Doublet or Nearest Neighbour Analyses / 9 the tendency for C to be methylated when in a C G doublet followed by deamination result ing i n the conversion of C G to T G . T A has been thought to be rare because of stop codon avoidance. Stop codons in the standard genetic code are T A A , T A G and T G A two of which contain T A . The predictions that would follow i f these two explanations accounted for the rar i ty of C G and T A are: 1. The "miss ing" C G should appear as an excess of T G . 2. T A should be rare i n protein coding regions but not par t icular ly rare in other regions of D N A . 3. In organisms where methylation doesn't occur, C G should not be selected against. Some of these predictions have been tested by M a z i n and V a n y u s h i n (1987 a,b), who found that in eukaryotes the C G loss was generally compensated by an increase in T G . However, they also found the same result in organisms which lack 5-methylcytosine. In order to explain this apparent dispari ty, they postulated that the ancestors of organisms which presently lack 5-methylcytosine must at one time have had it. A l l organelles seem to lack methylat ion of C , w i th the exception of Chlamydomonas chloroplasts (Palmer 1985). In a study of plant nuclear, mitochondrial and chloroplast protein genes, Boudraa and Pe r r in (1987) found that, in chloroplast D N A the doublet T A is avoided in al l codon positions, not just position I t , which suggests that stop codon avoidance is not the reason for T A avoidance. In addition, they found C G is avoided in nuclear D N A (methylated) as well as mitochondrial D N A , and in positions II and III of chloroplast D N A . This they argue, would suggest that methylat ion is not the cause of C G suppression, but rather that the physical-chemical properties of the doublets are s imply subject to selection pressure. To support this idea they cite work by Breslauer et al. (1986) in which it was shown that of a l l 16 doublets, t A codon consists of three bases. A T G for example codes for methionine. Position I is A , II is T and III is G . Doublet or Nearest Neighbour Analyses / 10 T A binds most weakly wi th the complementary doublet on the opposite strand while C G binds most strongly. They suggest that T A and C G m a y be avoided since they prevent thermodynamic optimization of D N A sequences. V e r y recently Sr in ivasan et al. (1987) have reported a detailed study of the twist , rol l and tilt properties of D N A residues in doublets. A goal of this dissertation is to assimilate such work into an explanation of doublet frequency. A. MATERIALS AND METHODS A l l chloroplast D N A sequence data were obtained v i a the Un ive r s i t y of Br i t i sh Columbia computer centre from GenBank Release 52.0. A l l analyses were performed on the U B C mainframe computer using programs wri t ten by the author in Pascal , except certain statistical analyses (such as autocorrelations and Spearman-Rank correlations) which were performed using the package M I D A S (Fox and Gui re 1976). Da ta were plotted using Tel lagraf (a product of Integrated Software Systems Corporation Graphics , San Diego, California). Descriptions of gene and unidentified reading frame ( U R F ) positions in both the l iverwort (Ohyama et al. 1986) and tobacco (Shinozaki et al. 1986) chloroplast genomes are available on the GenBank tapes. The analyses of doublets can be performed i n two fundamentally different ways depending on the question being addressed. The most common method is to analyse 5' to 3' on the sense strand; an approach based on a belief that i t is the function of D N A that is important, and that the function of D N A is to code for components which the organism requires. When using this method, an Doublet or Nearest Neighbour Analyses / 11 arb i t rary decision must be made regarding which strand and direction ought to be analysed in noncoding regions. A second method of analysis examines the same strand in the same direction for al l types of D N A in the genome. Pract ica l ly , this method has the disadvantage that i t is not a lways known, i n the case of incompletely sequenced genomes, which strand a given gene is on. However , since i t is based on an emphasis on structure rather than function, i t allows for a na tura l treatment of noncoding regions. Both methods have been employed below. B. RESULTS The entire tobacco genome was analyzed on the B-s t randt (Deno et al. 1983), 5' to 3'. Doublet relative abundances were determined, wi th the results shown in F i g . 1 (total). Removal of the 25,339 bp I R A followed by analysis of the remaining genome (total-IRA) has little effect on the doublet properties (Fig. 1). To determine how the different categories of D N A within total - IRA; protein genes, introns, noncoding, U R F s , r D N A , ribosomal protein genes, t D N A contribute to the overal l pattern, each category, except noncoding, . was analyzed 5' to 3' on the sense strand (Fig. 1). (Note that the sense strand is not a lways the B-strand.) Noncoding regions were analyzed 5' to 3' on the B-strand, which was an arb i t rary choice since noncoding regions are noncoding on both strands. In F i g . 1, the l iverwort genome, analyzed 5' to 3' on the A-s t rand , is shown for comparison. Examina t ion of F i g . 1 shows that noncoding regions are very s imilar to the total genome in their organization and that al l categories of D N A show a t T h e A-st rand is the one that codes for ribulose bis -phosphate carboxylase. The B-s t rand is the complementary strand. Doublet or Nearest Neighbour Analyses / 12 Figure 1. Tobacco doublet relative abundances for various categories of DNA. All the y axes are identical. All were analysed on the sense strand 5' to 3'. Total liverwort chloroplast DNA is provided for comparison. 1.4 1.3 1.2 1.1 1 0.9 0.8 0.7 0.6 H 1.3 1.2 1.1 1 0.9 0.8 0.7 0.6 H 1.3 1.2 H 1.1 1 0.9 0.8 0.7-0.6-1.3 1.2 1.1 1 0.9 0.8 0.7 0.6 1.3 1.2 1.1 H 0.9 0.8 0.7 0.6 TOTAL PROTEIN CENES TRNA GENES TOTAL-IRA INTRONS RJ BOSOM AL PROTEIN GENES o o ^  <L> • ~ O © O *" *-Doublet or Nearest Neighbour Analyses / 13 general downward slope from left to right, indicating that they have s imilar doublet properties. Relative to tobacco, l iverwort has a marked deficit in G A and T C doublets. To more readily compare the doublet properties, Spearman-Rank correlations were calculated for each pair of categories (Table 1). This nonparametric statistic (i. e. no assumptions are made about distribution or l ineari ty of the data), is close to 1.0 i f the doublet properties are strongly positively correlated, 0.0 i f uncor rec ted and close to -1.0 i f strongly negatively correlated. Table 1 shows that al l correlations are positive. If we construct a nul l model in which positive and negative correlations are equally l ikely, this 36 -35 model has a probability of 2/(2 )= 2 of producing al l positive or all negative correlations. Therefore the probability that a l l correlations have the same sign, -11 due to chance alone, based on this nu l l model is 3 x 10 . Therefore it is highly significant that a l l correlations are positive and i t seems reasonable to assert that a l l categories have at least somewhat s imilar doublet properties. Note that the correlation between total and to ta l - IRA is 1.00. Ribosomal D N A is least wel l correlated wi th the other categories. F o r N = 1 6 , the number of doublets, Spearman-Rank correlations are "significant" at the 0.05 level at 0.425 and at the 0.01 level at 0.601. There is some controversy regarding the use of inferential statistics when he whole population can be assessed. Here I assume that the sequences are complete and accurate. D u r i n g sequencing, each base is sequenced at least three times, more often i f this does not produce a consensus. This is not to imply that there are no other chloroplast genomes wi th different sequences, rather, that from a statistical viewpoint, I a m assuming absolute certainty that these two sequences are two Doublet or Nearest Neighbour Analyses / 14 Table 1. Spearman-Rank correlation coefficients for the relative abundances of doublets in tobacco wi th IRs (total), w i th one IR deleted (total-IRA), and seven categories wi th in tota l - IRA. Doublets were analysed on the sense strand, 5' to 3'. The l iverwort genome (total) is shown for comparison. Abbreviat ions: totahthe entire genome; total-IRA: the entire genome minus one of the inverted repeats; genesrprotein coding genes not including ribosomal proteins; URF:unident i f ied reading frames; nomnoncoding regions; r D N A : r i b o s o m a l D N A ; t D N A : c o d i n g for t R N A ; introns:intervening sequences wi th in a gene. total 1.00 tot- IRA 1.00 1.00 genes 0.50 0.50 1.00 rprotein 0.37 0.37 0.84 1.00 U R F 0.85 0.85 0.79 0.53 1.00 non 0.97 0.97 0.57 0.39 0.88 1.00 r D N A 0.46 0.46 0.39 0.51 0.37 0.49 1.00 t D N A 0.73 0.73 0.68 0.71 0.74 0.73 0.77 1.00 introns 0.59 0.59 0.89 0.77 0.83 0.67 0.60 0.78 1.00 total t - IRB gene rprot U R F non r D N A t D N A intron Doublet or Nearest Neighbour Analyses / 15 complete populations. Another justification for such an assumption is that the sequencing procedure itself takes into account a very large number of molecules. The question then arises; are inferential statistics appropriate to derive conclusions regarding these sequences? The workers at The Statist ical Research Laboratory at the Unive r s i ty of Mich igan (1976) have said "Fi rs t , i f one's dataset incorporates al l relevant cases in the • population of interest (except perhaps for a few missing observations), then the application of various inference procedures is. not only unnecessary; it is meaningless. In such cases, where one has essentially a census of the relevant population, it suffices s imply to describe and summarize the variables and their interrelationships in the dataset." For the most part I concur wi th this view, especially since "significance" can be achieved by a number of methods which to me seem suspect. One of these methods is to chose an alternative nul l model or alternative statistic, one which provides the desired level of significance i f previous methods fail to do so. This is a common practice. A second method is through manipulat ion of sample size. There is no rule for example that says that one must report a l l of one's data. Then it is possible, i f desired, to turn a significant finding into an insignificant one s imply by excluding some data even i f the data excluded are chosen randomly. Despite these considerable problems, I have calculated significance values in some cases but have tried to be verj ' explicit about the nul l model being employed. The l iverwort genome was also analyzed as above; 5' to 3' on the A-s t rand in Doublet or Nearest Neighbour Analyses / 16 total, total-IRB and noncoding regions, and 5' to 3' on the sense strand in the other categories (Fig. 2) In F i g . 2 tobacco doublet data are shown for comparison, and Table 2 presents correlations for this analysis , i . e. Spearman-Rank correlation coefficients for the relative abundances of doublets in l iverwort w i th IRs (total), wi th one IR deleted (total-IRB), and seven categories wi th in total-IRB. Doublets were analysed 5' to 3' on the sense strand. Aga in tobacco data are shown for comparison. In l iverwort , removal of the 10,058 bp I R B has a smal l effect on the doublet properties of the genome (Fig. 2, Table 2) as was found for tobacco, and, also s imi lar to tobacco are the exclusively positive correlations between the doublet properties of the categories (Table 2). In l iverwort , however, both t D N A and r D N A correlate poorly wi th the rest of the genome, while, in noncoding regions the high abundance of the doublet G G is par t icular ly s t r iking (Fig . 2). The strong C G suppression expected in protein genes does not appear to be compensated for by an overabundance of the doublet T G . A doublet analysis can also be performed all on one strand in one direction (5' to 3'), and this has been done for tobacco (Fig. 3) on the B-strand, and l iverwort (Fig. 4) on the A-s t rand , and the correlations calculated (Tables 3,4). A g a i n , al l correlations for both plants were positive (Tables 3,4), and a comparison of F i g . 1 wi th F i g . 3, and of F i g . 2 with F i g . 4, shows that, while there are smal l differences when the analysis is performed this way , the trends are s imi lar . Perhaps, due to the very strong tendency for complementary doublets to have s imilar relative abundances (Fig. 1, F i g . 2), this is what should be expected. Doublet or Nearest Neighbour Analyses / 17 Figure 2. Liverwort doublet relative abundances for various categories of DNA. All the y axes are identical. All were analysed on the sense strand 5' to 3'. Total tobacco chloroplast DNA is provided for comparison. 1.65 1.55 H 1.45 1.35 1.25 1.15 1.05 0.95 0.85 0.75 0.65 0.55 1.55 1.45 H 1.35 1.25 1.15 1.05 0.95 0.85-0.75-0.65 0.55 1.55-1.45-1.35 1.25 1.15 1.05 0.95 0.85 0.75 0.65 0.55 1.55 1.45 1.35 1.25 1.15 1.05 0.95 0.85 0.75 0.65 0.55 1.55 1.45 1.35 1.25 1.15 1.05 0.95 0.85 0.75 0.65 0.55 TOTAL TOTAL-1RB TRNA CENES R1BOSOMAL PROTEIN CENES tRNA CENES TOBACCO Doublet or Nearest Neighbour Analyses / 18 F igure 3. Tobacco doublets analysed on the B-strand 5' to 3' for various categories of D N A . Note that the axes wi th in a figure are constant but va ry from figure to figure. Doublet or Nearest Neighbour Analyses / 19 Figure 4. Liverwort doublets analysed on the A-strand 5' to 3' for various categories of DNA. 1.65-1 1.55 1.45 1.35 1.25 1.15 1.05 0.95 0.85 0.75 0.65 0.55-1.55-1.45-1.35 1.25 1.15 1.05 0.95 0.85 0.75 H 0.65 0.55-1.55-1.45-1.35 1.25 1.15 1.05 0.95 0.85 0.75 0.65 0.55 1.55 1.45 1.35 1.25 1.15 1.05 0.95 0.85 0.75 0.65 0.55H 1.55 1.45 1.35 1.25 1.15 1.05 0.95 0.85 0.75 0.65 0.55 TOTAL 1NTRONS TRNA GENES tRNA GENES TOTAL- 1RB 1 Z 221 RlBOSOM AL PROTEIN GENES V77X TOBACCO Doublet or Nearest Neighbour Analyses / 20 Table 2. Spearman-Rank correlation coefficients for the relative abundances of doublets in l iverwort wi th IRs (total), wi th one IR deleted (total-IRB), and seven categories wi th in total-IRB. Doublets were analysed 5' to 3 ' on the sense strand. Tobacco is shown for comparison. Abbreviat ions: totahthe entire genome; total-IRB: the entire genome minus one of the inverted repeats; genes:protein coding genes not including ribosomal proteins; URF:unident i f ied reading frames; nomnoncoding regions; r D N A : r i b o s o m a l D N A ; tDNA:cod ing for t R N A ; introns:intervening sequences wi th in a gene. total 1.00 tot-IRB 0.99 1.00 genes 0.91 0.93 1.00 rprotein 0.69 0.68 0.63 1.00 U R F 0.91 0.93 0.94 0.72 1.00 non 0.75 0.71 0.56 0.63 0.65 1.00 r D N A 0.49 0.46 0.39 0.21 0.31 0.34 1.00 t D N A 0.25 0.22 0.08 0.30 0.05 0.39 0.49 1.00 introns 0.84 0.82 0.69 0.87 0.83 0.85 0.39 0.31 1.00 total t - IRB gene rprot U R F non r D N A t D N A intron Doublet or Nearest Neighbour Analyses / 21 Table 3. Spearman-Rank correlation coefficients for the relative abundances of doublets in the the total tobacco chloroplast genome (total), the total tobacco chloroplast genome wi th the I R A deleted (total-IRA), and seven categories wi th in tota l - IRA. A l l analyses were 5' to 3' on the B-strand. Abbreviat ions: totahthe entire genome; total-IR: the entire genome minus one of the inverted repeats; genes:protein coding genes not including ribosomal proteins; URF:unident i f ied reading frames; nomnoncoding regions; r D N A : r i b o s o m a l D N A ; tDNA:cod ing for t R N A ; introns:intervening sequences wi th in a gene. total 1.00 tot-IRA 1.00 1.00 genes 0.78 0.79 1.00 rprotein 0.88 0.89 0.67 1.00 U R F 0.96 0.97 0.83 0.87 1.00 non 0.98 0.97 0.68 0.85 0.93 1.00 r D N A 0.47 0.46 0.38 0.30 0.36 0.48 1.00 t D N A 0.83 0.83 0.54 0.78 0.70 0.85 0.46 introns 0.93 0.92 0.66 0.83 0.91 0.93 0.41 total t - IRA gene rprot U R F non r D N A 1.00 0.77 1.00 Doublet or Nearest Neighbour Analyses / 22 Table 4. Spearman-Rank correlation coefficients for the relative abundances of doublets in l iverwort wi th IRs (total), w i th one IR deleted (total-IRB), and seven categories wi th in to ta l - IRB. A l l analyses were 5' to 3' on the A-s t rand. Abbreviat ions: totahthe entire genome; total - IRB: the entire genome minus one of the inverted repeats; genes:protein coding genes not including ribosomal proteins; U R F : unidentified reading frames; nomnoncoding regions; r D N A : r i b o s o m a l D N A ; tDNAtcod ing for t R N A ; introns:intervening sequences wi th in a gene. total 1.00 tot-IRB 0.99 1.00 genes 0.93 0.94 1.00 rprotein 0.75 0.74 0.66 1.00 U R F 0.91 0.92 0.90 0.83 1.00 non 0.75 0.71 0.55 0.78 0.64 1.00 r D N A 0.49 0.46 0.58 0.16 0.32 0.34 1.00 t D N A 0.51 0.47 0.35 0.34 0.37 0.64 0.55 1.00 introns 0.84 0.82 0.71 0.92 0.88 0.86 0.27 0.50 1.00 total t - IRB gene rprot U R F non r D N A t D N A intron Doublet or Nearest Neighbour Analyses / 23 The amounts of D N A in each category (genes, U R F s , noncoding etc.), were determined as a percentage of the total-IR for both plants (Table 5). Ribosomal D N A has the highest G + C content in both plants while noncoding regions have the lowest. The observation that U R F G + C contents are between those of genes and noncoding regions suggested that some U R F s might in fact be genes. This possibility is investigated in the next chapter. C. DISCUSSION 1. Nussinov's Hypothesis In many doublet analyses performed to date it has been observed that the relative abundances ' of complementary doublets are s imilar (Nussinov 1984a, b, 1987). This is so even when a single strand of D N A is analysed and therefore is not due simply to base pai r ing rules. Nussinov suggested that this phenomenon might be due to inverted duplication events during the evolution of a genome. F o r example, i f the double stranded D N A 5 ' - A C T G A A - 3 ' 3 ' - T G A C T T - 5 ' was inverted and duplicated to yield 5 ' - A C T G A A T T C A G T - 3 ' •3 ' - T G A C T T A A G A C A - 3 ' and then an analysis was performed on one of the resulting single strands, it would be found that complementary doublets would have the same abundances. In this case, the amount of A C must equal the amount of G T and so forth. Doublet or Nearest Neighbour Analyses / 24 Table 5. The G + C contents and percentage of the genome minus one IR which each category constitutes. Note that the G + C contents of U R F s for both plants lie between the value for genes and non-coding regions, suggesting that some U R F s are in fact genes or ribosomal proteins. to t+IR total gene rprot t R N A r R N A int non U R F Tobacco G + C 0.37 0.37 0.41 0.39 0.53 0.55 0.36 0.32 0.36 % total 100 17.47 6.12 1.67 3.48 7.54 33.09 30.63 L ive rwor t G + C 0.29 0.28 0.31 0.30 0.52 0.53 0.23 0.21 0.24 . % total 100 39.17 7.78 2.39 4.49 11.60 38.32 25.14 Doublet or Neares t Neighbour Analyses / 25 Nussinov 's hypothesis is reasonable, but difficult to test, because i t invokes, as an explanation, events that occurred very long ago and are hence inaccessible. Nonetheless, it is desirable that the hypothesis be tested. In both the total chloroplast genomes, I found that the complementary doublets (single strand analysed) had s imilar relative abundances (Fig. 1,2 Table 1,2). I f this were due to the presence of the two inverted repeats, it would be predicted that removal of one IR would change this situation. A s shown in F i g . 1,2 the removal of one IR in both plants has very little effect on the doublet properties of the rest of the genome. The parametric correlation (product-moment) between doublet relative abundances of the l iverwort genome wi th and without the IR was 0.9977. Fo r tobacco the value was 0.9991. Fo r N = 1 6 , correlations are "significant" at the 0.05 level at 0.468 and at the 0.01 level at 0.590 so there is a temptation to conclude that these correlations are highly significant. Unfortunatley, such a conclusion is quite misleading. In both tobacco and the l iverwort genomes the amounts of A and T are very similar and the amounts of C and G are also very s imi lar . Due to this, i f a single stranded pseudorandom sequence of sufficient length is generated, such that its base composition matches the real sequences, it w i l l be found that the amount of C C equals the amount of G G ; the amount of A C wi l l equal that of T G etc. In short, it w i l l be found that complementary doublets w i l l have nearly identical (I used the word "equal" loosely above) relative abundances. To circumvent this problem, and to decide i f the correlations real ly were significantly different from expectation, the following simulation was performed. To test the tobacco correlation coefficient for significance, 214 different, pseudorandom, single stranded sequences, wi th the same base composition as the real genome and the same length as the real Doublet or Nearest Neighbour Analyses / 26 genome minus one IR were generated. (The program had a time l imi t to keep costs down. When the time expired, 214 sequences had been generated and this number was judged adequate.) Fo r each sequence the doublet relative abundances were calculated. Then for each sequence, an amount of the sequence equivalent to the length of one IR was "inverted and duplicated" and spliced onto the end of its parent sequence. Then the doublet relative abundances for these larger sequences were calculated. Then 214 coefficients of correlation between the doublet relative abundances before and after the IR addition were calculated. These are plotted in F i g . 5. The simulation correlation coefficients ranged from 0.9777 to 0.9989 wi th a mean value of 0.9859 and standard deviation (SD) of 0.005387. The actual value of 0.9991 is both outside the range of simulated correlation values as well as being approximate^ ' 2.44 S D from the mean. For this reason, the tobacco correlation of 0.9991 is significantly higher than that expected due to chance at the 0.01 level, conservatively estimated. Note that the observation that 0.9991 is greater than the 0.590, to which one would normal ly attribute a significance level of 0.01, is irrelevant. A s imilar s imulation was performed for the l iverwort , again keeping both base composition and lengths identical to the real sequence. This time 100 sequences were generated. The distribution of correlation coefficients is shown in F i g . 6. The correlation coefficients had a m i n i m u m of 0.9375, a m a x i m u m of 0.9883, a mean of 0.9744 wi th a S D of 0.01033. A g a i n the real value, 0.9977 falls above the range, and more than 2 S D away from the mean. Since this simulation was performed wi th 100 random sequences, it is reasonable to say that this correlation is significantly higher than expected, at the 0.05 level. Doublet or Nearest Neighbour Analyses / 27 Figure 5. The distribution of correlation coefficients between the doublet relative abundances of 214 pseudorandom sequences with and without one IR. Length of sequence and base composition are the same as the tobacco chloroplast genome. The correlation between the doublet relative abundances of the tobacco chloroplast genome with and without one IR was 0.9991 which falls outside the range of the above values. O c D CT © \Y^Y{\\Y^Y(\\Y,AY(iV^XQY/\XYAY<,\\S{<\Y(\\YxAY<L n r , n r <x O' © o o- o- o- o o o © © o- o ° o o o-C o r r e l a t i o n Doublet or Nearest Neighbour Analyses / 28 Figure 6. The distribution of correlation coefficients between the doublet relative abundances of 100 pseudorandom sequences with and without one IR. Length of sequence and base composition are the same as the liverwort chloroplast genome. The correlation between the doublet relative abundances of the liverwort chloroplast genome with and without one IR was 0.9977 which falls outside the range of the above values. 3 0 - 1 — — — — 1 Correlat ion Doublet or Nearest Neighbour Analyses / 29 The first conclusion is that it is not the presence of IRs in the chloroplast genome which cause complementary doublets to have s imilar relative abundances. If Nuss inov ' s mechanism were operating here, it would have been expected that the correlation between the genome doublet relative abundances plus and minus IR, would have been considerably lower than it was. In fact, it might be expected to fall in the ranges shown in F i g . 5 and 6. I feel that this is strong evidence against Nuss inov 's hypothesis since, here is a case where the IRs are easy to identify, and one in which removal of one IR does not change the doublet properties (i. e. they remain highly significantly correlated wi th and without one IR, more so than in the simulation). It must st i l l be allowed that the whole genome could consist of smal l IRs , though Pa lmer (1985) has indicated that repeats are rare in the chloroplast genomes, except for the large IRs and a few smal l IRs involved in insertions and deletions. It can st i l l be argued that the smal l IRs have been "muddied" by point mutations so that they are no longer detectable but such that they st i l l contribute complementary doublets. This argument, I think must be allowed as a possibility even though i t is v i r tua l ly impossible to disprove. In conclusion, this test should be seen as evidence against Nuss inov 's hypothesis. 2. Doublet suppression In both genomes there is evidence of T A suppression. Significantly, this is least evident in protein genes of l iverwort and in protein genes, ribosomal protein genes and introns of tobacco (Fig. 2,4). If T A supression were a form of stop codon avoidance then T A should be, i f anything, rarer in protein genes. This is Doublet or Neares t Neighbour Analyses / 30 not observed and is in agreement wi th the conclusion of Boudraa and Per r in (1987) that stop codon avoidance is not the cause of T A suppression. It is also consistent wi th the finding that the doublet T A is rare in intervening sequences (introns and noncoding) (Smith et al. 1983). The two genomes examined here also exhibit C G suppression which B i r d (1980) (see also M a z i n and V a n y u s h i n 1987a,b) suggested was due to methylat ion of C followed by mutat ion of C G to T G ; i n these chloroplasts, there is no methylat ion of bases and no compensating overabundance of T G . M a z i n and V a n y u s h i n (1987a) found in Drosophila, in which C is not methylated, that C G was suppressed. They suggested that the D N A of ancestors of this genus was normal ly methylated at some time, which hypothesis can readily be extended to " in any organism in which C G is suppressed, this attests to methylat ion of the ancestral D N A of that organism". We do not know when chloroplasts lost their methylase or i f they ever had one, and the lack of compensating T G argues, at least in the case of chloroplasts, against this idea. Indeed, even more elaborate hypotheses can be put forward to explain C G suppression. Fo r example, a gene from the chloroplast could move to the nucleus (Lewin 1984), reside there and be methylated, and lose its C G before moving back into the chloroplast but, gene transfer into the chloroplast does not seem to occur, perhaps because inserts disrupt v i ta l functions (Lonsdale 1987, Shinozaki et al. 1986). One could just postulate that modified genes imported into the chloroplast could recombine with genes already there. (As an aside it might be mentioned that i t is possible that transfer into the chloroplast does not occur because chloroplasts seem to lack reverse transcriptase, while the nucleus and mitochondria probably have this enzyme (Schuster and Brennicke 1987).) Evidence supporting this view is that only transcribed sequences have Doublet or Nearest Neighbour Analyses / 31 been found in more than one genome in a cell, thus, the hypothesis of M a z i n and V a n y u s h i n (1987) and the more elaborate one above seem off the mark, par t ia l ly because they are not parsimonious and part ly because they have a "hole-patching" quality to them. It is simpler to propose that methylat ion is not a complete explanation of G G suppression. 3. Thermodynamic considerations The data appear to indicate that R Y and Y R doublets (R=pur ine , Y = pyrimidine) have low relative abundances. Calladine (1982) suggested that this was due to across-strand steric hinderance between the two purines in such a conformation. The evidence for this, taken together wi th the notion that bases or doublets are selected according to their physical-chemical properties, is to me more appealing than some of the other suggestions as discussed above. Calladine's hypothesis is easy to test. R Y and Y R doublets ought to be rare throughout the genome in order to reduce k inking and stress in the D N A (a reminder R = A , G Y = C,T) . Figures 1-4 show that R Y and Y R doublets are suppressed. C G and T A are also Y R doublets. Breslauer et al. (1986) have determined the standard biochemical thermodynamic enthalpy AH 0, standard biochemical entropy AS°, and standard biochemical Gibbs free energy AG 0 , for the breaking of the bond between a doublet on one strand and its complement on the other strand. These data have been taken from Table 2 of Breslauer et al. (1986) and reappear in m y Table 6. These data were obtained by differential scanning calorimetr}', circular dichromism and ultraviolet spectroscopy of synthetic oligomers. These data suggest that the doublet T A binds most weakly (AG°=0.9 kcal mol ) while C G Doublet or Nearest Neighbour Ana lyses / 32 Table 6. Thermodynamic parameters for each doublet. For each doublet A H ° (kcal mol ), A S ° (cal K mol ), A G (kcal mol ) (Breslauer et al. (1986), flexibili ty, the probability of rol l ing clockwise, the probability of roll ing counter-clockwise, the probability of t i l t ing clockwise and the probability of t i l t ing counter-clockwise (Srinivasan et al. 1987) are shown. For a more com description of these parameters the reader is urged to refer to the orij works. base A H ° • A S 0 A G ° flex r o l l l roll 2 t i l t l t i lt2 A A 9.1 24.0 1.9 7.1 41 59 66 34 A C 6.5 17.3 1.3 2.9 13 87 78 22 A G 7.8 20.8 1.6 3.5 12 88 76 24 A T 8.6 23.9 1.5 4.0 18 82 71 29 C A 5.8 12.9 1.9 6.0 29 71 65 35 C C 11.0 26.6 3.1 2.6 8 92 81 19 C G 11.9 27.8 3.6 4.6 11 89 69 31 C T 7.8 20.8 1.6 3.3 16 84 71 29 G A 5.6 13.5 1.6 5.1 33 67 58 42 G C 11.1 26.7 3.1 3.8 10 90 71 29 G G 11.0 26.2 3.1 2.7 8 92 63 37 G T 6.5 17.3 1.3 4.2 21 79 59 41 T A 6.0 16.9 0.9 7.2 40 60 57 43 T C 5.6 13.5 1.6 3.6 19 81 69 31 T G 5.8 12.9 1.9 5.0 20 80 62 38 T T 9.1 24.0 1.9 5.7 38 62 52 48 Doublet or Nearest Neighbour Analyses / 33 ( A G 0 =3.6 kcal mol ) binds most strongly. In this regard, Boudraa and Per r in (1987) suggested that T A and C G might be selected against in order to "thermodynamical ly optimize" D N A . In other words, perhaps T A binds too weakly and C G binds too strongly. I shall return to this idea shortly but first wish to consider other hypotheses proposed wi th regard to doublet abundances. Trif inov and Sussman (1980) performed autocorrelations on 36,000 bp from a var ie ty of sources and found that in eukaryotes, but not prokaryotes, doublets tend to be distributed along the D N A strand wi th a periodicity of about 10.5 bp which is the number of bases in one complete turn of the D N A helix in B - D N A t . They suggested that this is in order to curve D N A so that i t could bind histones and form chromatin. Prokaryotes lack histones. The fact that doublets have some tendency to repeat every 10.5 bp means that i f a d s D N A helix was examined from one side of the cylinder it would fit into, that side would have a preponderance of doublets of a type. In eukaryotes, this could be involved in histone binding (Trifinov and Sussman 1980). There are suggestions that chloroplasts contain a histone-like protein (Briat et al. 1982,1984) and Gant t and K e y (1987) have cloned a gene for a pea histone that is associated wi th chloroplasts though it is not clear i f this histone is inside the chloroplast. Other work showing chloroplast D N A i n chromatin-like formations (Coleman 1985) also suggests that plastid D N A could be bound to histones or histone-like proteins. Thus, the need to bind proteins could p lay a role in doublet selection and position in chloroplast D N A . However, autocorrelations of the two genomes revealed no peak at 10.5 bp (data not shown). St i l l it is necessary to keep in t B - D N A is the "normal" helix form of D N A . D N A twisted the oppposite way is Z - D N A . Doublet or Nearest Neighbour Analyses / 34 mind that the requirement that D N A and cytoplasm must interact may well exert a selection pressure on D N A organization, even at the single or doublet base level. Another suggestion concerning doublet abundances involves repair mechanisms. Occasionally when D N A is replicated, errors are made in the copying process. Polymerases that replicate D N A have error checking capabilities and may reverse and correct an error. In E. coli, in certain short sequences where errors m a y occur, systems exist that wi l l correct them (Jones et al. 1987). These systems depend heavi ly on the adjacent sequences and can only repair certain types of error. A s w i l l be discussed in a later chapter, the presence of certain doublets, in correct quantities, may a way to achieve sufficient Shannon-redundancy (Gatl in 1972a) to enable error-repairing systems to locate errors. M y a im in this ini t ial chapter is to incorporate the ideas of Sr in ivasan et al. (1987), Bres lauer et al. (1986) and Calladine (1982), along wi th the data in F i g . 1-4, into an explanation of doublet preference based on classical physical-chemistry. Table 6 shows A H ° , A S ° and A G 0 for breaking the H-bonding between a doublet on one strand and it's complement on the other strand. A large A G ° implies a strong binding. (These data are taken from Breslauer et al. (1986).) In addition, Table 6 also shows flexibili ty, probability of roll ing clockwise or counter-clockwise and the probabili ty of t i l t ing clockwise or counter-clockwise for bases involved in each doublet:complement arrangement. (These data are taken from Sr in ivasan et al. (1987).) In agreement wi th the low A G 0 for T A , this doublet also has the greatest flexibility. A G ° and flexibility are negatively Doublet or Nearest Neighbour Analyses / 35 correlated. This is reassuring since they were obtained independently by different groups of workers. M y prel iminary work showed that the doublet relative abundances for tobacco correlated positively wi th AG°, AH°, AS° and negatively wi th flexibil i ty. I also found that doublets seemed to be selected that rolled one way rather than the other and tilted one way rather than the other, which suggested that doublets might be selected on the basis of their thermodynamic properties. One testable idea that follows from this is that such should occur in all organisms. In addition to doublet relative abundances for tobacco and l iverwort , I took data from the literature for E. coli (Blake and Ea r l ey 1986), eukaryotes, mitochondria (Nussinov 1984b), vertebrates, nonvertebrates (a designation from Nuss inov 1984a), D N A viruses, nonretro R N A viruses and retroviruses (Nussinov 1984a). In al l groups doublet relative abundances correlate positively with AG° (Table 7), though only 5 of the 10 correlations are greater than 0.425 and significant at the 0.05 level. However the probabili ty of obtaining 10 positive or 10 negative correlations (since we would be impressed in either case) due to chance, assuming as a nu l l model that positive and negative 10 -9 -3 correlations are equally l ikely is 2/(2 ) = 2 = 1.95 x 10 . Therefore, conservatively, at the 0.01 level, based on this nul l model it is significant that all AG° values are positively correlated wi th doublet relative abundances. None of • the other thermodynamic or topological parameters is significantly correlated at the 0.05 level wi th doublet relative abundances under this nul l model. This also suggests that a f i rm, solid D N A structure is preferred to an overly flexible structure. This could help put doublet relative abundances on solid "non-biological" grounds. Doublet or Nearest Neighbour Analyses / 36 Table 7. Spearman-Rank correlationg between doublet relative abundances and thermodynamic properties of the doublets. Note the exclusively positive correlations between AG ° and abundance in all groups. The pi us and minus signs at t l bottom of the columns are summaries of the correlations for that column. source AH° AS 0 AG 0 flex r o l l l roll2 t i l t l t i lt2 tobacco 0.09 0.18 0.39 -0.22 -0.10 0.10 0.01 -0.01 l iverwort 0.50 0.48 0.72 -0.27 -0.34 0.34 0.15 -0.15 E. coli 0.19 0.15 0.44 0.42 0.24 -0.24 -0.12 0.12 eukaryotes 0.10 0.03 0.46 -0.19 -0.10 0.10 0.084 -0.08 mitochondria 0.39 0.37 0.15 -0.18 -0.22 0.22 0.11 -0.11 vertebrates -0.08 -0.15 0.30 -0.20 -0.07 0.07 0.17 -0.17 nonvertebrates 0.34 0.30 0.70 0.00 -0.02 0.02 -0.07 0.07 D N A viruses 0.21 0.16 0.49 -0.26 -0.14 0.14 0.05 -0.05 nonretro R N A viruses -0.40 -0.44 0.21 -0.00 0.14 -0.14 -0.14 0.14 retro viruses 0.10 0.06 0.35 -0.31 -0.15 0.15 0.11 -0.11 + + + + + . . Doublet or Nearest Neighbour Analyses / 37 The hypothesis that doublet selection is influenced by thermodynamic properties seems reasonable. A s imilar suggestion, that codons are chosen on the basis of thermodynamic properties has been made by Rowe and Trainor (1983a). Their work however was based almost entirely on a model rather than more empir ical ly as has been done here. The observation that R Y and Y R doublets are selected against, probably for reasons connected wi th steric hinderance avoidance (Calladine 1982) in a l l species (Nussinov 1987) suggests that the selection is associated specifically wi th D N A , rather than in general, the organisms themselves. The finding that doublet relative abundances correlate wi th at least one thermodynamic property is in agreement with such an idea and no special explanations are required to account for C G suppression, since, C G is, as suggested by Boudraa and Pe r r in (1987), s imply thermodynamical ly undesirable (because it binds too strongly). This is not to suggest that some other explanations do not account for some doublet abundances. Doublets m a y well be involved i n repair of mutation damage, methylat ion etc. W h a t is suggested is that this simple explanation has a certain intellectual appeal, though it requires much further evidence and testing. In conclusion, thermodynamic properties seem to explain some but by no means al l of the doublet relative abundances observed in the chloroplast D N A sequences, and the other sequences examined. In chapter 4 evidence is presented to show that informational properties also are important in doublet selection. Undoubtedly there is a hierarchy of constraints present in these systems. III. PREDICTION AND IDENTIFICATION OF CHLOROPLAST GENES A. DESCRIPTION OF THE PROBLEM This chapter is concerned wi th the identification of protein coding genes. V e r y often when D N A is sequenced i t is not known i f the D N A contains a gene or not. This can happen when a clone that hybridizes wi th a probe for a known gene is sequenced or i t can also happen when a large region, such as an organellar genome, is completely sequenced. It is desirable to be able to identify genes from sequence data for a var iety of reasons. F i r s t , biochemical methods of identifying genes, such as looking for transcripts, can fail i f a) the transcript is in low quantity or b) the gene for the transcript is " o f f due to developmental stage of the organism or effects of the environment from which the organism was taken; second, an accurate way to predict genes would facilitate experimental design. N o r m a l l y , a researcher has only enough resources for a subset of a l l desirable experiments and accurate predictions would reduce wasted time and material ; third, the way in which genes can be identified from sequence data might yield useful insights into genome organization and constraints, i . e., i f genes can be found based merely on the differences in their sequences, this can tell us something about gene us. nongene organization. A s Staden (1984) has explained, gene identification is a multistep process. Star t ing wi th a sequence, open reading frames (also called unidentified reading frames or U R F s ) are identified. Introns, i f any, are located. Then an attempt is made to determine whether this U R F is actually transcribed or whether i t is 38 Prediction and identification of chloroplast genes / 39 merely due to chance; i . e. even in a random sequence U R F s w i l l occur, and the challenge is to distinguish between U R F s of this type, and genes. These steps are not a lways performed in this order. Some methods, such as Fickett 's (1982)- and Staden's (1984) positional base methods, can indicate that a region contains a gene but not on which strand it is . A l l methods rely on one of two principles; either they locate promoters, splicing signals etc. based on known consensus sequences, or they rely on the differences between sequence properties of genes and nongenes, caused by the necessity of coding for protein. There is considerable information available in the literature regarding recognition sites in D N A . Chloroplast promoters have been well characterized and are in many cases s imi lar to prokaryotic ones (Hanley-Bowdoin and Chua 1987, see also Zurawsk i and Clegg 1987). In plants (An 1987) and in Drosophila (Nussinov and Lennon 1984), there is evidence that promoter regions m a y form Z - D N A . (For a brief review of Z - D N A see Wells (1988).) Intron splicing signals in chloroplasts are also well known (Ohyama et al. 1986, Shinozaki et al. 1986, Umesono and Ozeki 1987). This chapter then is dependent on published work performed by O h y a m a et al. (1986) and Shinozaki et al. (1986), reporting the positions of U R F s . This work was performed by searching for start and stop codons that are sufficiently far apart and in frame, while taking into account intron splicing signals. The problem for this dissertation is to determine which of the 39 U R F s i n tobacco and 18 in l iverwort greater than 200 bp are real ly genes. It is well known that amino acid usage in organisms is nonrandom (eg. Jukes et al. 1975). In tobacco, the codon usage for known genes has been tabulated Prediction and identification of chloroplast genes / 40 (Wakasugi et al. 1986) and is also not indicative of randomness. Fo r example the stop codon usages are T A A : 2 5 , T A G : 9 and T G A : 5 , and s imilar results apply to other codon selections so that it has proved possible to identify genes based on codon preference (Gribskov et al. 1984). I have (implicitly) employed this method using a clustering algori thm. Sheperd (1981) has shown that many genes, when read in the correct frame, are closer to having all R N Y codons (N = any base) than in the other two frames. He suggested that this might be due to the original code having been an R N Y code. In the mater ia l presented below, I have described the success that method has in distinguishing between genes and nongenes in the two chloroplast genomes. Ficket t ' s (1982) method is based on an autocorrelation wi th period 3, and is performed as follows: Four parameters called A-posit ion, and l ikewise C , G and T-position are determined. F o r A-posit ion, the number of A s in positions 3n, 3n +1 and 3n + 2 are found for the U R F in question. A-position = the m a x i m u m of these values/(minimum of them+1) . For example, suppose there are 8 A s in positions 3n, 20 in positions 3 n + l and 15 in positions 3n + 2. Then A-posit ion = 20/(15 +1). The other 3 position parameters are calculated identically for the corresponding bases. In addition, 4 content parameters A-content...T-content are s imply the fraction of the strand comprised of each base. Based on a look-up tablet (Table I of Ficket t 1982) 8 probabilities of coding are found for content and position. This gives p ...p .. Based on the Los Alamos data base, 1 8 Ficket t then assigned weightings w ...w to each probability (Table II Ficket t 1 8 1982). A "testcode" value is found t A look-up table is a compilation of data which provides an output value given an input value. Prediction and identification of chloroplast genes / 41 p w + p w ... p w , which, based on another look-up table, indicates the over 1 1 * 2 2 8 8 al l probability of the U R F being a coding unit (i. e. a gene), and it gives a decision (prediction) that the U R F is a gene, is not a gene, or that i t is unsure. It is important to note that the Los Alamos data base of 1982 might not be representative of "a l l D N A " , which Ficket t recognized in suggesting that the test could likety be improved for use on specific taxonomic groups. This has been done for chloroplast D N A below. In many organisms and in mitochondria the genetic code is unorthodox (Lagerkvist 1981, F o x 1987 review articles) but in chloroplasts, so far as is known, the standard code is used (Palmer 1985, Shinozaki et al. 1986, O h y a m a et al. 1986, Zu rawsk i and Clegg 1987), and al l the t R N A s . required by ,the chloroplast are coded on the chloroplast genome (Wakasugi et al. 1986, Umesono and Ozeki 1987). There are approximately 30 different t R N A s coded on the chloroplast genome, which is more than in mitochondria and less than E. coli (Umesono and Ozeki 1987). The relevance of this to gene identification is that there is a relationship between t R N A abundance (availability?), codon usage and protein abundance, though the cause and effect aspect of this sti l l seems obscure. Thus, Ikemura (1981a, b 1982) noted correlations between t R N A abundance and codon usage in E. coli and yeast and suggested that genes select codons that correspond to abundant t R N A s . Gran tham et al. (1980, 1981) found that the genes coding for the most abundant proteins had codons corresponding to the most abundant t R N A s but that less abundant proteins had rarer codons corresponding to less abundant t R N A s . General ly the less abundant the protein, the rarer the t R N A s required to synthesize it. Due to this observation, they put Prediction and identification of chloroplast genes / 42 forward the genome hypothesis. The idea is that codon choice is a way of regulating gene expressivity. Bennetzen and H a l l (1981) working wi th yeast, obtained results in good agreement with Grantham's group and wi th Ikemura, but noted in addition that the codons that are most abundant do not include those containing 100% G , C , A , T , , or G C or A T . The desirablity of equal binding energies for a l l codomtRNA interactions was postulated. A l l this work seems to suggest that the t R N A levels are the "given" and that the genome modifies its codon usage accordingly. However , a recent paper Adjustment of the tRNA population to the codon usage in chloroplasts (Pfitzinger et al. 1987), suggests that perhaps it is the codons which are the "given" and it is the t R N A levels which are adjusted accordingly. O n this point there seems to be no clear empir ical evidence either w ay but I would suggest that, since the genome (nuclear) is the same in al l cells of an organism but proteins are not, it follows that the t R N A levels must be adjusted to the codons present rather than vice versa. It is not difficult to imagine a situation, however, where the t R N A levels are the given and the codons for the proteins necessar} 7 at every stage of an organism's development have been selected in such a way that the t R N A s are required in the given amounts. Personally, I have difficult}' wi th this idea because the requirements of the organism are also a function of the environment, not just the developmental stage, and it seems most parsimonious to adjust the parts supply to the demands of the software, not the reverse, especially when there is no obvious a priori reason why the parts supply cannot be controlled. In any case, the patterns (of base usage, codon usage etc.) found in genes that distinguish them from nongenes, appear to be a function of a multitude of processes occuring in the cell or the organelle. In the work described below concerning gene prediction, Prediction and identification of chloroplast genes / 43 this should be kept in mind throughout; particularly, since one of the findings is that some prediction methods, which work well in other systems, work poorly in chloroplasts. Possible reasons for this will be discussed later. B. MATERIALS AND METHODS Programs were written by the author in Pascal. For Sheperd's test, the number of mutations required to convert all codons to RNY were tabulated in frame I (in frame) and in frames II and III. An URF was predicted to be a gene only if the number of such mutations was least in frame I. Fickett's method was performed exactly as per Fickett (1982), and the method was also modified as follows: the 46 known protein coding genes in tobacco and the first 46 noncoding regions greater than 200 bp, starting at base 1 on the B-strand and moving 5' to 3', were used; the four position and content parameters were calculated for each of the 92 sequences and these were used exactly as per Fickett to generate 8 new weighting parameters (Table II of Fickett 1982)(Table 8); Fickett's testcode is then used verbatim. Cluster analyses were performed using MIDAS (Fox and Guire 1976) with Ward's minimum variance algorithm to cluster euclidean distances. The data for the cluster were the fractional abundances of each of the 64 codons for each gene and URF. This procedure places genes which have the most similar codon usage closest together in cluster diagrams. Prediction and identification of chloroplast genes / 44 Table 8. N e w weighting parameters for Ficket t ' s testcode procedure. These were derived from the tobacco chloroplast genome. See methods. A C G T Position 0.22 0.24 0.26 0.43 Content -0.11 -0.15 0.41 0.09 Prediction and identification of chloroplast genes / 45 C. RESULTS The predictions of Sheperd's method appear in Table 9 for tobacco and Table 10 for l iverwort . This method predicts that in tobacco, 13 of 46 known genes listed are not genes and that in the l iverwort 11 of 54 known genes are not genes. This error rate (24%) for. known genes means that predictions regarding U R F s are not reliable. They are presented herein only in the interests of completeness. (Note that I assume that the ndh genes and other genes which bear strong s imi lar i ty to prokaryotic genes, are in fact genes even though in many cases no protein or t ranscript has been found.) Ficket t ' s testcode, when used on tobacco, misidentified 8 known genes as nongenes , and was unsure about 16 others (Table 9). F o r the l iverwort , the testcode mistook 1 known gene for a nongene and was unsure about 11 others (Table 10). Given these results, I modified Fickett 's test in an attempt to improve its performance on chloroplast sequences. This modified procedure produced 2 false negatives and 5 indecisions in the 46 known tobacco genes (Table 9), and for the l iverwort , there were 2 false negatives and 8 indecisions in the 54 known genes (Table 10). This is an error rate of less than 5% and an indecision rate of 12%, a considerable improvement over the original test result. Equa l ly importantly, none of the 46 tobacco non-coding regions were identified as genes by this method (data not shown). However , it is somewhat circular reasoning to develop the test using tobacco sequences, and then determine the ut i l i ty of the test using only two plants, one of which is tobacco. In order to check that the modified weightings (Table 8) were generally applicable to chloroplast D N A and not just to these two plants, the test was tried on a Prediction and identification of chloroplast genes / 46 Table 9. Predictions of genes in the tobacco chloroplast genome. For a l l tests c indicates coding, n indicates non-coding, and ? indicates indecision. Nomenclature as in Shinozaki et al. 1986. S denotes the decision made bj ' Sheperd's test. Prob is the probability of being a gene and D is the decision made by Ficket t ' s test in its original and modified forms. "gene" S_ Fickett 's Modified Prob D Prob D a tpA c 0.98 c 1.00 c atpB c 0.98 c 1.00 c a tpE c 0.98 c 1.00 c a tpF c 0.07 n 0.92 c a t p H n 1.00 c 1.00 c a tp l c 0.92 c 0.92 c infA c 0.07 n 0.40 ? n d h A n 0.77 ? 0.98 c ndhB c 0.29 n 0.04 n ndhC n 0.77 ? 1.00 c ndhD n 0.77 ? 1.00 c ndhE c 0.92 c. 0.77 ? ndhF c 0.40 ? 0.92 c U R F 3 9 n 0.40 ? 0.92 c U R F 6 2 c 0.98 c 0.92 c U R F 7 0 A n 0.00 n 0.07 . n U R F 7 OB c 0.92 c 1.00 c U R F 7 3 n 1.00 c 1.00 c Prediction and identification of chloroplast genes / 47 "gene" U R F 7 4 A U R F 7 4 B U R F 7 5 U R F 7 7 U R F 7 9 U R F 8 0 U R F 8 2 U R F 8 7 U R F 9 0 U R F 9 2 U R F 9 8 U R F 9 9 A U R F 9 9 B U R F 1 0 3 U R F 105 U R F 1 1 5 U R F 1 3 1 U R F 134 U R F 138 U R F 1 5 4 U R F 1 5 8 U R F 167 U R F 184 n c n c c n n n c c n c n Ficket t ' s Prob 0.00 0.40 0.07 0.07 0.04 0.92 0.77 0.04 0.92 0.04 0.07 0.29 0.92 0.07 0.07 0.04 0.40 0.92 0.04 0.04 0.40 0.29 0.29 D n n ? n n n Modified Prob 0.00 0.77 0.00 0.92 0.29 0.98 1.00 0.29 1.00 0.07 0.00 0.04 0.92 0.00 0.07 0.07 0.92 1.00 0.00 0.07 0.98 0.92 0.29 D n c c n c Prediction and identification of chloroplast genes / 48 "gene" U R F 2 2 8 U R F 2 2 9 U R F 2 8 4 U R F 3 1 3 U R F 3 5 0 U R F 3 9 3 U R F 5 0 9 A U R F 5 1 2 U R F 5 8 1 U R F 12 44 U R F 1 7 0 8 pe tA petB petD psaA psaB psbA psbB psbC psbD psbE psbF rbcL c c c n c c Ficket t ' s Prob 0.29 0.07 0.40 0.29 0.07 0.92 0.04 0.77 0.07 0.07 0.07 0.92 0.77 0.98 0.92 0.40 0.77 0.92 0.77 0.77 0.29 0.77 0.98 D n n ? n n c n n n c ? c c •? ? c ? ? n 9 Modified Prob 0.98 0.00 0.92 0.07 0.04 1.00 0.00 0.98 0.07 0.29 . 0.29 1.00 0.98 1.00 0.98 0.98 0.98 1.00 1.00 0.92 0.92 1.00 1.00 D n Prediction and identification of chloroplast genes / 49 "gene" _S Ficket t ' s Modified Prob JJ Prob JJ rpl2 c 0.92 c 1.00 c rp!14 c 0.92 c 1.00 c r p l l 6 n 1.00 c 1.00 c rpl20 c 0.77 ? 1.00 c rpl22 c 0.40 ? 0.92 c rpl23 c 0.40 ? 0.98 c rpl33 c 0.29 n 0.98 . c rpoA c 0.40 ? 0.77 ? rpoB c 0.92 c 1.00 c rps2 c 0.77 ? _ 1.00 c rps3 c 0.92 c 1.00 c rps4 n 0.92 c 0.98 c rps7 n 1.00 c 1.00 c r p s l l c 1.00 c 1.00 c r p s l 4 n 0.29 n 0.77 ? r p s l 5 n 0.04 n 0.04 n r p s l 6 c 0.92 c 1.00 c r p s l 8 n 0.29 n 0.98 c r p s l 9 c 0.92 c 0.92 c secX n 0.98 c 1.00 c ssb c 0.77 ? 0.77 ? Prediction and identification of chloroplast genes / 50 Table 10. Predictions of genes for Marchantia. Nomenclature as in O h y a m a et al. 1986. Abbreviations as in Table 9. "gene" _S Fickett 's Modified Prob D Prob JD a tpA c 0.92 c 1.00 • c a tpB c 0.98 c 1.00 c a tpE c 0.92 c 1.00 c a tpF c 0.92 c 1.00 c a t p H c 1.00 c 1.00 c a tp l c 0.98 c 0.98 c f r x A c 0.98 c 1.00 c f rxB c 0.77 ? 0.77 ? f rxC c 0.98 c 1.00 c in fA c 0.98 c 0.98 c m b p X n 0.77 ? 0.77 ? n d h l n 0.92 c 0.40 ? ndh2 c 0.77 ? 0.07 n ndh3 n 0.77 ? 0.40 ? ndh4 n 0.92 c 0.98 c ndh4L c 0.98 c 0.98 c ndh5 c 0.77 ? 0.92 c U R F 6 9 n 0.98 c 0.98 c U R F 7 4 c . 0.98 c 1.00 c U R F 1 3 5 c 0.40 ? 0.07 n U R F 1 6 7 c 0.98 c 1.00 c Prediction and identification of chloroplast genes / 51 "gene" U R F 169 U R F 1 8 4 U R F 2 0 3 U R F 2 2 8 U R F 3 1 6 U R F 3 2 0 U R F 3 7 0 i U R F 3 9 2 U R F 4 3 4 U R F 4 6 4 U R F 4 6 5 U R F 5 1 3 U R F 1 0 6 8 U R F 2 1 3 6 pe tA petB petD psaA psaB . psbA psbB psbC . psbD Ficket t ' s Prob 0.40 0.92 0.92 0.92 0.92 0.29 0.40 0.92 0.29 0.77 0.92 0.92 0.29 0.40 0.92 1.00 0.98 0.92 0.92 0.92 0.98 0.98 0.92 D c c n 7 c n 7 c c n ? Modified Prob 0.92 1.00 1.00 0.98 1.00 0.07 0.77 1.00 0.07 0.92 1.00 1.00 0.77 0.77 1.00 1.00 0.98 1.00 1.00 0.98 1.00 1.00 0.98 D c c c n ? Prediction and identification of chloroplast genes / 52 "gene" _S Ficket t ' s Modified Prob JJ Prob D psbE c 0.77 ? 1.00 c psbF n 0.92 c 1.00 c psbG c 0.77 ? 0.77 ? rbcL c 0.92 c 1.00 c r p l l 4 c 0.92 c 1.00 c r p l l 6 c 0.98 c 1.00 c rpl2 c 0.98 c 1.00 c rpl20 c 0.98 c 0.98 c rpl21 n 0.92 c 0.98 c rpl22 c 0.92 c 0.92 c rpl23 c 0.77 ? 0.92 c rpl33 c 0.92 c 0.92 c rpoA c 0.92 c 0.92 c rpoB c 0.98 c 0.98 c r p o C l c 0.98 c 0.98 c rpoC2 c 0.77 ? 0.77 r p s l l c 0.98 c 1.00 c r p s l 2 c 1.00 c 1.00 c r p s l 4 n 0.77 ? 0.77 ? r p s l 5 c 0.29 n 0.29 n r p s l 8 n 0.77 ? 0.77 ? r p s l 9 c 0.98 c 0.98 c rps2 c 0.92 c 0.92 c Prediction and identification of chloroplast genes / 53 "gene" _S Fickett 's Modified Prob D Prob D rps3 c 0.92 c 1.00 c rps4 n 0.92 c 0.92 c rps7 c 0.92 c 0.98 c rps8 c 0.92 c 0.98 c secX n 0.92 c 1.00 c Prediction and identification of chloroplast genes / 54 variety of other chloroplast sequences from different species. The results are in Table 11. Aga in , the modified test again performs better than the original, and also again does not mistake any non-coding regions for coding ones. Therefore, I am reasonably confident that any U R F s predicted in this work to be genes are in fact coding for protein. A s an independent check of the accuracy of the predictions of Fickett 's modified procedure, euclidean distances between codon usage for each gene and U R F were clustered for both plants (Fig. 7,8). There is a tendency for genes wi th related functions to cluster together. In tobacco two distinct clusters were found which contained 13 of the 21 putative noncoding U R F s , while, in l iverwort, the 3 indecisively identified U R F s were clustered. D. DISCUSSION Sheperd's test, which predicts genes rather poorly in these sequenced chloroplast genomes does so fair ly wel l in m a n y other organisms. The reasons for this failure are of considerable interest. Sheperd (1981) suggested that the pr imit ive code was an R N Y code and that enough of that code remains in present day sequences to identify genes, even though mutations have occurred. Staden (1984) inquired i f Sheperd's method was s imply detecting codon preferences. B y analyzing D N A . sequences, which were computer generated without codon bias, Staden showed that this is not what the method does. H e also wondered whether Sheperd's method was detecting an absence of stop codons, which are a l l Y R R codons. In an experiment using computer generated D N A sequences wi th various Prediction and identification of chloroplast genes / 55 Table 11. A comparison of Fickett 's testcode procedure and its modified form when applied to genes and non-coding regions from v other species. The modified form performs somewhat better in that it is not uncertain about the identity of genes in 3 cases where the original form is uncertain. In no case does the modified form identify a non-coding region as a gene. Acc # is Genbank access number. Other abbreviations as in Table 10. Species Acc # description Fickett 's Modified Prob D Prob D Amaranthus hybridus K 0 1 2 0 0 , psbA 0.92 c 0.98 c Hordeum vulgare X 0 0 6 3 0 rbcL 0.98 c 1.00 c Chlamydomonas reinhardtiM.13704 atpB 0.98 c 1.00 c Zea mays JO1421 cfB 0.92 c 1.00 c Zea mays J 0 1 4 2 1 cfE 0.92 c 1.00 c Zea mays M 1 2 7 0 4 psbG 0.40 ? 0.40 ? Zea mays M 1 1 2 0 3 non538 0.00 n 0.00 n Zea mays M 1 1 2 0 3 p s l a l 0.40 ? 0.98 c Zea mays M 1 1 2 0 3 p s l a 2 0.40 ? 0.77 ? Zea mays X 0 1 6 9 8 rps4 0.40 ? 0.98 c Oenothera hookeri X 0 3 7 8 0 psbE 0.77 7 1.00 c Pisum sativum X 0 3 5 7 5 non224 0.00 n 0.04 n Pisum sativum X 0 3 5 7 5 a tpA24 0.92 c 0.92 c Solanum nigrum X 0 1 6 5 1 non318 0.04 n 0.00 n Solanum nigrum X 0 1 6 5 1 psbA 0.92 c 0.98 c Spirodela oligorhiza X 0 3 8 3 4 r p l l 6 1.00 c 1.00 c Spinacia oleracea X 0 1 7 2 4 psbC. 0.92 c 1.00 c Prediction and identification of chloroplast genes / 56 Vicia faba X 0 0 6 8 2 U R F 0.00 n 0.00 n Triticum aestivum J01458 atpase 1.00 c 1.00 c Prediction and identification of chloroplast genes / 57 Figure 7. Cluster analysis of liverwort genes and URFs greater than 200 bp. Genes of a type tend to cluster together eg. the ndh genes are clustered as are the rpL genes. The three URFs for which the modified Fickett's test made no decision are also clustered. Unlike tobacco the URFs predicted to be noncoding are not clustered. The reason for this is unknown. • -i LIVERWORT n<Jh 2 • • -ndh4 • - -URF 320 •» - -URF228 • - -ndh3 • • -ndh5 • - * rwJfil • • -URF184 • - -n d M L • - -• t p l • - • p .bD • - -p*bC • - -p«bB • - -P»BB • p i « A « - I pe tB • - • pe tD • - -psbE « - -p i b F • - -• t p H • - • pabA • • -rpoB rpoC abpx r p s 2 p c t t f r x C URF5I3 URF485 URF 39 2 URF318 pabG URF434 URFIBB • t p F • tpE URF187 • tpA • tpB URF 20 3 r b c L URF74 t r « A r p o M URF1088 URF3701 URF 2138 URF484 r p i 1 5 URF135 rpL33 URF 6 9 rpoA r p s 3 r p L 2 t r p s 7 r p L 2 0 r p » 8 r p L 2 rpL14 r p e l B rpL23 f r u C r p L I S r p L Z Z r p » 1 2 InfA f r x B r p » 4 r p a l B r p i 1 4 M C X •I I I -I I I -I I-I I 1-1 • ..-I • - . - I -• •I Prediction and identification of chloroplast genes / 58 Figure 8. Cluster analysis of the tobacco genes and URFs greater than 200 bp. Of the 21 URFs predicted to be noncoding, 13 are in two distinct clumps. As in the liverwort genes of a type tend to cluster. The photosynthesis genes psa and psb are good examples of this. ---I i -• II — i - i i -•II TOBACCO • tpA •tpB •-.-1... URF393 • rpoB • patA » •tpE • rpL4 • • tpF • URF22B • URF167 • URF103 • URT134 • URF284 • URFS12 • rpa2 « URF1244* rpoA • rp»3 • mJhB • • URF1B4 • URF229 • URF350 • URF5090* URF5B1 I URF17D8*-I I URF115 • URFB7 • URF82 • URF73 • InfA • URF7 7 • URF74b • I URF 156 •• I-URF105 • URF 138 • URF B0 • URF92 • p«bE • URF 39 • URF75 • URF 80 • •tpH • p»bF • URF62 • patB » I p*aA • 1 I paaB * 1 p«bC • pabD • • pabB • • rbcL • pabA • atpl • ndhA • ndhO • I MhF • I -URF313 • petO • ndhC • ndhE • URF9BB URF 70s • URF7B • URF74a • URF154 • URFBB • URF 700 • URFBBa « rpa15 * aaB • rpL33 • rpalB • rpalB • URF131 • rpL23 • rpL20 » rpa14 • rpL2 • rpalB • rpL22 • rpa4 • rpa7 • rpLIB • rpa l l • — I -1 I-- I I •--1 -I I I I--I-I -I -II-— I Prediction and identification of chloroplast genes / 59 amounts of stop codons in each of the 3 frames, Staden showed that the lack of stop codons is part ly, but not solely responsible for the success of the method. Therefore, it appears that the success of the Sheperd method is not due to either of these two suggested artefacts. I f one supposes that the original code was an R N Y code, i t might be that those genes and genomes which had evolved furthest from the original code would be least R N Y - l i k e . Fo r this reason, I originally postulated that the oldest genomes might be least R N Y - l i k e , but, as a colleague pointed out, assuming a single origin for the code means al l genomes are the same age. Faced with this l imitat ion, I modified the hypothesis to the following: i f a genome is in a constant environment, then, ini t ia l ly , the mutation rate (here I mean the rate of fixed mutations in surv iv ing genomes) should be high, but as beneficial mutations accumulate in the genome, the mutation rate should decrease. Eventual ly the mutat ion rate should be a very smal l positive number, reflecting only "neutral" mutations ( K i m u r a 1986), or zero, i f selection pressure is so strong that no mutations are neutral . This would predict that the more mutations that have occurred in a genome's history, the further from an R N Y code it is and the slower the present mutation rate must be. Chloroplast D N A is considered to be mutat ing slowly relative to nuclear D N A (L i et al. 1985, Pa lmer 1987, Ri t land and Clegg 1987), and the poor performance of Sheperd's method of analysis illustrates that i t is now far from a pure R N Y code. Therefore, i f the original code was an R N Y code, and chloroplasts have remained rather constant for a long period of time, it is consistent that most beneficial mutations have already occurred and the genome is far from its original form. Prediction and identification of chloroplast genes / 60 There is another approach to the question of the efficacy of Sheperd's method. Is there another reason why a gene might tend to consist of R N Y codons other than origin? Trif inov (1987) found that genes have a codon pattern G-nonG-N and he presents rather convincing evidence that this is a frame monitoring mechanism. He found putative frame monitoring sites on 16S r R N A which "match up" wi th a G-nonG-N codon pattern. Because the recognition site on the ribosome corresponds to 12 codons of m R N A , not every codon has to be G-nonG-N, just a sufficient number to mainta in frame. Thus, Trif inov proposes that in a gene there are actually two simultaneous, par t ia l ly redundant codes; one that describes codons and amino acids, and the other that describes reading frame. This type of mechanism, i f it is real and widespread (which I suspect it is), would make possible another explanation of Sheperd's test. One can postulate that ribosomes, which stay in frame on G - n o n G - N codes, would have provided pressure for use of G-nonG-N codons, which are, of course, R - n o n G - N . It appears that base choice for the third position is correlated wi th G + C composition of the genome (Bernardi and Bernard i 1985). The question which arises, i f this hypothesis is to be credible, is whether the gene or ribosome arose first. Since ribosomes consist of gene products, the answer might seem obvious, but this is deceptive. The original genetic mater ia l was probably R N A , not D N A (Cedergen and Grosjean 1987, eg.). Taylor and Coates (1988) have shown that there are a series of correlations between codons that code for an amino acid and the chemical properties of the amino acid. Specifically, the first base in the codon corresponds to the biochemical synthetic pathway, the second base to hydrophobicity (see Rose et al. 1985), and the third base to molecular weight. In modern organisms this specificity is plainly due to specificity in t R N A s and the Prediction and identification of chloroplast genes / 61 enzymes that charge them, but the specificity probably did not arise this way . Original ly there was probably a tendency for amino acids to bind directly to certain R N A sequences, and of course, R N A binds to itself. This is the beginning of a pr imit ive R N A : a m i n o acid binding complex. Enzymat ic function could follow, eventually giving rise to a structure which was later stabilized wi th protein to give a protoribosome. The point is that amino a c i d : R N A interactions are fundamental and enzymatic molecules could arise prior to any evolution of the genetic code. If there was potential frame keeping function in the ribosome, then the codons used might be selected to be G - n o n G - N or R N Y . This hypothesis could predict that R N Y - n e s s is less prevalent in chloroplasts because 1) frame keeping is less important than in other systems or 2) frame keeping is performed another way . This question requires more work since the differences between frame keeping mechanisms in different organisms and cell compartments are at present a black box. It would be i l luminat ing perhaps to construct plasmids, each of which contains a gene for the same protein but which have various levels of G-nonG-N. This construction is possible due to the presence of synonomous codons. It would be predicted that the transcripts of the gene wi th the lowest level of G-nonG-N would be translated least efficiently. If for example, the gene to be altered was the ampici l l in resistance gene which was on a plasmid that also contained say, a normal tetracyline resistance gene, the tetracycline resistance shows that the bacterium contains the plasmid and inabili ty to grow on ampici l l in would indicate a defective ampicil l in gene. This design has the advantage that it allows the detection of a graded response. Another point which seems relevant herein; Tramontano et al. (1984) found that Prediction and identification of chloroplast genes / 62 the complementary strand of a gene is, more often than expected, an U R F . I f the code tends to be R N Y , then the complementary stand also tends to be R N Y simply due to pair ing rules. Tramontano et al. (1984) suggested that the "complementary inverted proteins" (cip) could be transcribed and they performed some tests, Fickett 's among them, to support this hypothesis. There are two codes that wi l l produce cip in the same frame as the other gene; R N Y and Y N R . If cip frequently prove to be transcribed, code selection might be a way to save space or increase the chance of genes exist ing on alternate strands. Ficket t ' s method, which relies on both position and composition data, out-performed Sheperd's method on chloroplast D N A (Table 9,10), and its performance was improved by modifying it for use on chloroplast D N A . In the modified form it also appears to correctly identify genes in a wide variety of chloroplasts (Table 11). Some predictions are of part icular interest. In l iverwort, the f r x A gene, which was rather tentatively suggested to be a gene by O h y a m a et al. (1986), is predicted to be a gene. Oh-oka et al. (1987) have shown that this is in fact a gene for a photosystem (PS) I FeS-protein. The putative psbG gene in both maize (Table 11) and l iverwort (Table 10) is neither predicted to be a gene nor a nongene. Steinmetz et al. (1986) made antibodies which should react wi th the psbG gene product i f that exists; they found a 24 k D a protein (apparent molecular weight on gels) that reacts wi th these antibodies. There is a possibility that these antibodies are detecting a protein that is not psbG encoded. The probability of this occurring is low (Sibbald and White 1987) but by no means zero. This gene could be a pseudogene or could be made in such low amounts that its structure is atypical (see the discussion of codon usage in the Prediction and identification of chloroplast genes / 63 introduction to this chapter). L a s t l y , the test makes a smal l percentage of errors and this could be one of them. The modified Ficket t ' s parameters have been published (Sibbald 1988) and should prove useful for -workers in other species. The cluster diagrams (Fig. 7,8) show a s imi lar i ty in codon usage within gene families. In F i g . 7 the photosynthesis genes (ps ) are clustered, except psbG. The ndh genes ( N A D H dehydrogenase) are also clustered. In both F i g . 7,8 the ribosomal protein genes are clustered. In F i g . 7 the U R F s about which modified Ficket t ' s procedure is unsure are clustered, suggesting that Ficket t ' s test is detecting codon usage to some extent. In F i g . 8, the photosynthesis genes are also clustered. The U R F s predicted not to be genes are fair ly wel l clustered into two groups, lending support to the predictions. In conclusion, the bases in coding sequences and noncoding sequences are organized sufficiently differently that coding and noncoding regions can be identified on this basis alone, indicating that being protein coding exerts a substantial constraint on D N A organization. Fickett 's suggestion that his testcode could be modified for various taxonomic groups and the results of an implementation of this suggestion in the present work shows that sequence organization is significantly different between species. Thus the treatment of D N A as a "context free" molecule, as is often implici t ly done by molecular biologists, is only an approximation and a test of the cip hypothesis (Tramontano et al. 1984) would be worthwhile. A search for transcripts from both strands of genes is necessary. IV. INFORMATION THEORY AND DNA A. INTRODUCTION TO INFORMATION THEORY Information theory is a development of this century and was created for a variety of reasons, including a desire to improve communication. Kul lback (1958) has presented a history of the development of information theory. A second valuable source-book is Key papers in the development of information theory (Slepian 1973) which contains most of the important work from 1948 to 1973, much of which is now difficult to obtain from other sources. Two other volumes, Gat l in (1972a) and Holzmuller (1984), concerned wi th information measures as applied to D N A , are very useful. Since a detailed history of information theory is not central to the biological arguments presented here, the reader is referred to the above references for a more detailed exposition of the field. Information theory, as originally conceived by Shannon (1948), is concerned wi th the t ransmission of a message through a channel to a receiver. A. message is a physical or temporal str ing or sequence of symbols; for example, a str ing of Morse code is a message, as is a strand of D N A . .Thus, while message has various connotations in normal language, herein i t is defined as a string of symbols. A channel is any medium that t ransmits a message. Channels may be noisy which means they introduce errors into the message. Information theory is often concerned wi th problems such as: 1. given a transmitter, channel and receiver what is the m a x i m u m rate of information transfer? and 2. how can information be transmitted accurately when a channel is noisy? Shannon (1949) showed, in his celebrated second theorem that a message can be transmitted 64 Information theory and D N A / 65 through a channel, no matter how noisy, provided the message is correctly coded at source, which is a surpris ing and important result. B . G A T L I N ' S A P P R O A C H In her seminal book, Gat l in (1972a) showed how to calculate a variety of information measures, or entropies, for biological sequences. These measures strictly apply only to infinite sequences and, for this reason, it is correct to speak of probabilities, rather than frequencies, of symbols. The following outline of Gat l in ' s method follows from Gat l in (1972a p. 62-67; 1984). Consider an infinite sequence of symbols drawn from a finite alphabet. The alphabet has a symbols. In the sequence, each symbol occurs wi th a probability p where i indicates wi th which of the a symbols the probabilitj ' is associated, i For any alphabet Z p = 1 i i Ga t l in defined H = -k Z p ; log p 1 i i j where k is a constant and is the entropy of the single symbols in the sequence. Fo r convenience base 2 logs are used, k is set to 1 and the units are M a x bits per symbol. H is a m a x i m u m (H ) when all p = 1/a, i . e. all symbols 1 1 i are equiprobable. If the p deviate from being equal, then the sequence entropy i Information theory and D N A / 66 M a x is less than . Gat l in defined the divergence from equiprobability as D ^ M a x Obs D = H - H 3 1 — 1 1 M a x Obs where H = log a and H is calculated using equation 2 wi th the p 1 1 i observed for the sequence. In addition to considering single symbols, i t is possible to consider substrings of symbols, or n-tuples where n is the length of the substring. In this terminology, doublets are 2-tuples, triplets are 3-tuples etc. I f the symbols in the sequence are independent then Ind H = - Z . Z . . . Z p. p. ... p log (p. p. ... p ) 4 n i j n i j n I J n where p , p ... p are the probabilities of single bases in the first, second ... i J n nth position of n-tuple respectively. Gat l in 's model is that of a M a r k o v source, which is generating the sequence under consideration. This source can have a (m) memory (m) of 0,1... Ga t l in defines H or " H - M a r k o v wi th a memory of m " as (m) H = - Z p , log p((m+l) |m) 5 M a m + 1 where p( (m+l) |m) is the conditional probability of the ( m + l ) t h symbol given the m symbols and p is the probability of the (m+l)- tuple . G iven this, it is m + 1 Information theory and D N A / 67 possible to define the entropy i f the symbols in the n-tuple are not independent. D This entropy is H n D (1) (2) (m-1) (m) H = H + H + H + ... H + (n-m)H 6 n 1 M M M M (1) Gat l in then defined a second D-value, D 2 (1) (1) D = H - H 7 2 1 M for a memory of 1 and more generally, for a memory of m (m) (m) D = H - H 2 1 M The most general expression for D-values is (n-2) (n-1) D • = H - H n M M This results in the entropy scale shown in F i g . 9 which is redrawn from F i g . 7 of Ga t l in (1972a). Note that Gat l in ' s approach is to define a varietj ' of H-values and then to define D-values in terms of them. Her derivation is based on the work of Shannon (1948, 1949). Unfortunately, her derivation is difficult to understand and the motivation behind the calculations is not a lways readily apparent. The motivation for her work seems to be that deviations fom equiprobability (at the level of 1-tuplets) and independence (at the level of n-tuplets, n > l ) can be partitioned. Specifically, i f the sequence had equiprobable symbols, independently Information theory and D N A / 68 Figure 9. Two entropy scales for sequences. On the left is Gatlin 's entropy scale, redrawn from Gatlin (1972a). See text for for explanation of notation. On the right is my new scale. 2-, 1.5 H CD CO CO .Q to +^ CD 0.5 H 0 J I log a a H, (2) D2 (i) (3) H (i) M (1) H (2) M (1) Hi X3) T log a H, H, H, H 4 Other D and H values Other D and H values zero zero Information theory and D N A / 69 distributed, then the entropy per symbol in the sequence would be log a bits. Divergence from equiprobability decreases this entropy by an amount D ^ to the level H ^ . A s is clear from Gat l in ' s examples (1972a) however, the sequence can also diverge from independence. In the sequence A C G T A C G T . . . the sequence has (1) equiprobable symbols and so D is 0 but D is log a = 2 bits/symbol because 1 2 al l divergences from independence occur at the level of 2-tuples. In the sequence (1) (1) 00110011. . . D = 0, D = 0 and D = 1 . Tha t is , the single symbols, 0 1 2 3 and 1 are equiprobable. A t the level of 2-tuples, 0 is followed by 1 half the time and by 0 half the time while 1 is also followed by 0 and 1 wi th equal probability. Therefore there is no deviation from independence at the 2-tuple level (1) and D = 0. A t the 3-tuple level however, p( l |00) = 1, p ( 0 | l l ) = 1, p(0|lO) 2 (1) = 1, p( l |01) = 1 and this is complete dependence at the 3-tuple level and D 3 = log a = log 2 = 1. C. A NEW APPROACH In an attempt to improve and clarify Gatl in 's approach, and in order to better understand the meaning of D-values, I derived another way to obtain D-values. (1) It seemed that D might be the entropy of the doublets that would be found i f 2 the single bases were independent, minus the entropy of the observed doublets. W h a t is postulated is that (1) D = - Z Z p . p. log (p. p.) - - Z p log p 10 A 1 J 1 J 1 J K K K where k refers to observed doublets and i and j to single symbols. To continue, D (assume the memory of 1 from now on) might be viewed as the entropy of 3 the triplets that would occur i f the doublets were independent, minus the entropy Information theory and D N A / 70 of the observed triplets. A s was done by Hinds and Blake (1984), I estimate the probabilities of the triplets that would occur i f the doublets were independent, as p. p / px for the triplet ixj. In this notation, p is the observed probability ix xj ix of the doublet ix , p the probability of xj and p is the probability of the xj x single symbol x. Symbolical ly I postulate D = - Z Z. p. p . /p log (p. p . / p ) • - I p log p 11 3 ix xj. ix xj x ix xj x k k k where k refers to observed triplets. General ly , T thought it was reasonable to think of D (always assuming a memory of one) as n D = - I Z. p. p . /p log (p. p . / p ) - - Z p log p 12 n ix xj ix xj x ix xj x k k k where, p is the probability of the n-l- tuple corresponding to the first n-1 ix symbols of the n-tuple, p is the probability of the n-l- tuple corresponding to xj the last n-1 symbols of the n-tuple and p is the probability of the central x n-2-tuple. A g a i n k refers to observed n-tuples. This follows from F i g . 9 where it (1) can be seen that Gat l in ' s D -values are independent and additive. B y calculating n D-values in this new way i t was felt that the order from the previous level was being corrected for and thus this approach would insure independence. Next , rather that having many H-values, it is possible to define one type of H-value as -H = log a - D 13 1 1 14 H = H - D n n-1 n Information theory and D N A / 71 15 as shown in F i g . 9. Note that this method is the inverse of Gat l in ' s . I define D-values first, in an intuitive w a y that follows from the definitions of D and (1) 1 D , and then define H-values. Also note that m y H is equivalent to Gatl ins V l ) n H • . Whi le using Gat l in ' s formulae and m y own for D-values, I found they a lways gave the same values wi th in the l imits of round-off error. Because this happened repeatedly, I concluded that they must be identical. A s Ga t l i n (personal communication) correctly pointed out however, I had not proven that they are identical. Here is a proof that m y D-values are equivalent to Gat l in ' s . D. PROOF OF EQUIVALENCE M y expression for D can be rewrit ten from equation 12 as n D = Z . p. . log p. -. - I Z. p. p . / p log (p. p . / p ) 16 n IXJ IXJ IXJ ix xj ix xj x ix xj x The expression is rewrit ten wi th the inclusion of a conditional probability and logs expanded D = Z . P. . log (p. p(j|ix))- Z Z. p. p ./p (log p. + log p . - log p ) 17 n ixj ixj ix ix xj ix xj x ix xj x A s can be seen from the definitions in equations 5 to 9, each of the five terms can be immediately expressed in Gat l in 's notation. (n-1) D = - H •- H + H + H - H 18 n n-1 ^ . - j j n-1 n-1 n-2 D = H - H - H 19 n n-1 M n-2 Information theory and D N A / 72 (n-2) (n-1) D = H - H 20 n M M This is Gat l in ' s D (equation 9) and the proof is complete. Before proceding to n use these equations to investigate D N A sequences, several points should be noted. Gat l in ' s (1984) algori thm (equation 9) is more convenient than m y equation 16 for computer programs. M y method however, uses a more streamlined notation and is more intui t ively obvious. It provides an alternative interpretation for Gat l in ' s D-values and a confirmation of their essential correctness. E . R E D U N D A N C Y Another concept, redundancy, must be introduced. M a x M a x R =redundancy(n) = (H - H ) / H 21 n 1 n 1 R , R . . .R are redundancies calculated from H , H . . . H . R is proportional to 1 2 n 1 2 n n the sum of D-values D to D . R lies between 0 and 1 inclusive. For a I n sequence of equally abundant bases wi th no higher order constraints R = 0 . Fo r a sequence A A A A . . . D ^ = 2, H ^ = 0 and R ^ = l , i . e. the sequence is completely redundant. Fo r a sequence A C G T A C G T . . . D =0 , H =2, D =2 , H =0 and 1 1 2 2 R ^ = 0 and R = 1 . It has been observed that the R of Engl i sh language is sufficiently large that about hal f the symbols can be deleted and the sense sti l l remains. Information theory and D N A / 73 F. APPLICATION OF INFORMATION THEORY TO SEQUENCE ANALYSIS The question to be addressed next is: can information theory be an appropriate tool for the analysis of D N A sequences, and, i f so, what special caveats must we be aware of? B a r r y (1986) has questioned the contribution made by information theoretical analyses of D N A because much of the information required to make an organism comes from external sources and D N A only functions in the context of the cytoplasm. I agree wi th these points. M y concern follows from the notion that D N A must ul t imately be considered in a biological context; but, just as i t is val id to fractionate a plant and to study its components as an aid to grasping the whole, it is val id to study D N A out of its normal context wi th the intention of (eventually) returning to study the organism as a biological unit. P la in ly , D N A is an informational molecule, and it is a large component of the genetic mater ia l . Therefore, it seems appropriate to use information theory to study it. However , it is necessary to discuss message and channel i n more detail. F i g . 10 shows that there are two possible channels (plus, possibly the reverse transcriptase channel about which very little is known): D N A - > protein and D N A - > D N A . The message for the transcription/translation channel is a description of amino acid order while the message for replication is just the D N A itself. The channel for D N A - > protein is the transcription and translation machinery whereas the D N A - > D N A channel is cell- > cell (ontogeny) or organism-> organism (phylogeny or reproduction depending on the time scale). Based on information theory, suggestions have been made that D N A and protein have both evolved in the direction of increasing entropy Hasegawa and Yano Information theory and DNA / 74 Figure 10. A schematic drawing of the two channels through which the information encoded in DNA can flow according to the central dogma. The reverse transcriptase channel could be included also. CHANNEL DNA—•RNA—• PROTEIN (J CHANNEL Information theory and D N A / 75 1975, Y a n o and Hasegawa 1974) or decreasing entropy (Reichert and Wong 1971, Reichert et al. 1976) but both these suggestions have been largely discredited (Gatl in 1974, Ga t l in 1975, Fe r rac in et al. 1976, S i t a ram and V a r m a 1984). Smi th (1969)(see also Ga t l in 1972b) showed that D N A has the m a x i m u m capacity to code for diverse proteins at a G + C of 0.43, and i t was suggested that vertebrate genome G + C values were clustered around 0.43 due to a pressure toward m a x i m u m protein coding capacity. I have verified that the protein coding capacity is a m a x i m u m at 0.43, but what is at issue is whether there is a pressure to evolve toward this value. F i r s t , I found that, for a var iety of protein genes, using synonomous codons the G + C value could tj^pically be made to va ry over a range of 0.35 and st i l l code for the same protein. The amino acid compositions of proteins from E. coli, yeast, mouse and human based on approximately 180 genes, as compiled by Blake et al. (1986), were examined. These amino acid compositions could have been achieved wi th G + C ranging from 0.30 to 0.68 (E. coli), 0.29 to 0.66 (yeast), 0.31 to 0.68 (mouse) and 0.29 to 0.66 (human). Therefore, on average at least, the requirement to code for protein is not seen as a very narrow constraint on G + C. One prediction that can be made, i f Smi th and Gat l in are correct, is that the G + C of coding regions w i l l be 0.43 and G + C elsewhere may not be. However , in E. coli (Blake and Ear l ey 1986), tobacco chloroplasts (chapter 1), Xenopus and humans (Moreau and Scherrer 1987) the coding regions are more G + C rich than the noncoding regions, which seems to be a very Information theory and D N A / 76 common property of genomes. In their study of coding and noncoding regions of vertebrate D N A , Moreau and Scherrer (1987) found that the G + C of coding regions was 0.55, quite different from 0.43. L a s t l y , in F i g . 11, the approximate ranges of G + C for a variety of taxonomic groups are shown. Ferns and fern-allies, which are not normal ly thought to be advanced, have G + C contents ranging from 0.37 to 0.46 (Green 1971). (See also Shapiro (1970).) It seems more l ikely, as suggested by Green (1971), that the clustering is due to the base composition of a recent common ancestor- and not to informational constraints. So, in this case, i t would seem that vertebrates, also a monophyletic group, probably also have a clustering of G + C around 0.43 due to historical constraint rather than selection pressure on informational properties, i . e. the information calculation is correct but is not the explanation for the phenomenon. This error might have been made in part due to the human tendency to see humans as more highly evolved than other l iv ing things. I f D N A is not optimized to code for protein perhaps it is optimized to code for more D N A . The fact that norandom doublet relative abundances, which are a source of Shannon redundancy, are s imilar in coding and noncoding regions might suggest that it is fidelity in replication that is cr i t ical . Yockey (1974), in a paper that was wel l ahead of its time, came to this point but then discarded the idea, probably because, in m y opinion, such a view ran contrary to the existing paradigm of the time, namely that D N A exists to code for protein. Since there now exists a new paradigm of selfish D N A (Dawkins 1976, 1982), it is no longer heretical to postulate that D N A is organized in such a way so as to optimize its own fidelity of replication. Fur ther evidence that it is replication that Information theory and DNA / 77 Figure 11. The G + C ranges for various taxa. The theoretical maximum information capacity for protein coding is at 0.43. These data are taken from Shapiro (1970). mammal — frog fish echinoderms arthropods — molluscs porifera protists anglosperms gymnosperms ferns — — green algae — brown algae — — — — red algae — — • — — — cyanobacters — — — — — — — bacteria I 1 1 1 1 1 1 1 1 1 1 1 1 1 0 . 2 5 0 . 3 0 0 . 3 5 0 . 4 0 0 . 4 5 0 . 5 0 0 . 5 5 0 . 6 0 0 . 6 5 0 . 7 0 0 . 7 5 0 . 8 0 0 . 8 5 0 . 9 0 G+C content Information theory and D N A / 78 is optimized rather than protein production, is the presence of error checking in D N A polymerases which in R N A polymerases is comparatively absent. Fur ther , errors made in synthesising protein are rapidly disposed of while errors in D N A replication tend to accumulate. G. USE OF HIGHER D-VALUES Another example of information theory being applied to sequence analysis is found in Brooks et al. (1984,1988). A discussion of what follows is the subject of a forthcoming paper (Sibbald, Banerjee, Maze submitted.). Brooks et al. show, for 23 real D N A sequences from the E M B L data base, that while D to D are not different from expectation, D and D show a jump in magnitude which lead 5 6 them to suggest (Brooks et al. 1988) that, since D N A helices complete a turn in 10.5 bp, they might be detecting topological features of the D N A . W h a t they are in fact detecting is an artefact due to sequence length of the type discussed by Gat l in (1974), Rowe and Trainor (1983b) and by Rowe and H a r v e y (1985). If a sequence is short, not all windows of length 5 or 6 can occur in the sequence and so an apparent "order" appears due to this and this inflates D and D . It 5 6 is shown (Fig . 12) that pseudorandom sequences behave in a s imilar fashion even though they are nonbiological. One procedure to avoid this problem is to compare the D-values obtained for a sequence wi th the D-values of a large set of pseudorandom sequences the same length (Rowe and Trainor 1983b). This has been done for the seven categories of D N A in both the l iverwort and tobacco (Fig. 13, 14, 15, 16, 17). In F i g . 13 D is outside the expected region in a l l 2 categories for both plants. D (Fig . 14) values are generally closer to the Information theory and DNA / 79 Figure 12. Dl to D6 for tobacco and pseudorandom sequences. Lengths of sequences are 600 (black bars), 6,000 (hatched) and 24,000 (white). BCL refers to the method of Brooks et al. and new, is our method. The effect of length is dramatic. a 33 Si D1 D2 D3 D4 D5 D6 D1 D2 D3 D4 D5 D6 Information theory and DNA / 80 Figure 13. Real and simulated D2-values. The upper curve is the maximum D2-value obtained for 13 different sequence lengths at 100 psuedorandom sequences per length. The lower curve is the minimum simulated D2. The tip of each arrow is the value of a real D2-value. The italic labels are tobacco; the others are liverwort. All D2-values fall outside the expected range of D2 values and therefore are significantly different from expectation. 0.03 © CO CO . Q CD Q . CO CD 0.02 0.01 0.00 1000 10000 100000 Log sequence length i 111 1000000 Information theory and D N A / 81 Figure 14. Real and simulated D3-values. Upper and lower curves are maximum and minimum simulated D3-values respectively. Italic labels refer to tobacco and others to liverwort. The r R N A genes f a l l inside the expected region. 0.05-1 0 . 0 4 -co .tl 0.02 m o.oo t rna tgene gone non 1000 I I I I I I 10000 100000 Log sequence length i i i i i 11 1 0 0 0 0 0 0 Information theory and DNA / 82 Figure 15. Real and simulated D4-values. Upper and lower curves are maximum and minimum simulated D4-values respectively. Italic labels refer to tobacco and others to bverwort. As in Fig. 14 the rRNA genes fall inside the expected region and all other categories are much closer to the expectation than they were for D3. 0.20-1 0.00 -\ 1000 10000 100000 Log sequence length 1 0 0 0 0 0 0 Information theory and D N A / 83 Figure 16. Real and simulated D5-values. Upper and lower curves are maximum and minimum simulated D5-values respectively. Italic labels refer to tobacco and others to liverwort. 0.6 gene 1000 10000 1 0 0 0 0 0 Log sequence length i i i i i 111 1 0 0 0 0 0 0 Information theory and D N A / 84 Figure 17. Real and simulated D6-values. Upper and lower curves are maximum and minimum simulated D6-values respectively. Italic labels refer to tobacco and others to liverwort. A t this level of order, a ll categories fal l very near or in the expected region. 0.8-£ 0.6-CD Q . CO f=L 0.4 CQ 0.2 * \ \ / r m a trna \ \ / / \ \ / r p r o t rma V\ / V\ / i n t r o n / N X / / U R F / g e n e / A \ / i n ° n y rprot / / / Intron ^ ^ ^ - ^ K non URF flene*^-—. 1000 10000 100000 Log sequence length 1000000 Information theory and D N A / 85 expected region and some values lie inside it. D , D and D values are 4 5 6 progressively closer to the expected region, though unlike Rowe and Trainor (1983b), some values even at D are outside the expected region. 6 The expected region was constructed as follows. For lengths of 1, 2, 4, 8, 12, 3 20, 26, 37, 45, 70, 100, 120, 160 x 10 bases, 100 pseudorandom sequences were generated. Fo r each sequence, D ^ to D ^ were calculated. F o r each of the 13 groups of 100 sequences, the mean, S D and m a x i m u m and m i n i m u m of each D-value were calculated. I chose to plot the m i n i m u m and m a x i m u m rather than the mean plus and minus one S D (Rowe and Trainor 1983b) or two S D (Gatl in 1979), because the data are not normally distributed. In the s imulat ion described above, the mean tends to be closer to the m i n i m u m value than to the m a x i m u m value. Averag ing these distances, the m i n i m u m and m a x i m u m are 2.43 S D from the mean. This test of significance is therefore at least as stringent as that applied by either Rowe and Trainor (1983b) or Gat l in (1979) and in addition, more accurately reflects the true (skewed) distribution of the simulated D-values. Note however, that Rowe and H a r v e y (1985) criticize this method of determining significance and suggest that i t is more appropriate to a lways use as the simulated sequence, one which has identical order at the n-1 level to determine the significance at the n-level. Above (and in Rowe and Trainor 1983b), this was not done; the same set of 100 sequences (with a l l bases equally probable) was used to determine all 6 D-values at each length. However it is plain that the difference is really due to a lack of clari ty regarding the correct nul l model. Here I have s imply chosen an arbi t rary (but reasonable) nul l model and stated significances relative to it. I a m not convinced that the Rowe and H a r v e y (1985) Information theory and D N A / 86 method is better, though I have found in a simulation that it provides slightly different results. The results shown in F i g . 13-17 show that rather than D and D being 5 6 further from the expectation they are in fact closer, contrary to Brooks et al. (1988). L a s t l y , Brooks et al. calculate their D-values by comparing, for D for 3 example, triplets occuring to triplets expected based on single bases. This makes D and D dependent and nonadditive. A more appropriate method, presented 2 3 above compares triplets that occur to those that would occur based on doublets that actually occur. General ly, using the m a x i m u m overlap method wi th windows of size n-1 (or Gat l in 's method) solves this problem for higher D-values and, M a x since H is 2 bits per base for D N A , no D-value can be greater than 2. Values that exceed 2, as presented i n Brooks et al. (1988) obviously are not in agreement w i t h this notion. Rowe and Tra inor (1983b) found that only in noncoding regions of several v i r a l genomes did D ^ differ from expectation. In no regions did D ^ and D ^ differ from expectation. The m a x i m u m simulated D ^ that I obtained was 0.0083 (not plotted). In both plants, only t R N A D ^-values were wi th in the expected region which is not surprising considering the G + C values involved (Table 5). For al l categories, D is much higher that the expectation (Fig 13). Tobacco and l iverwort r R N A genes have D ^ values not different from expectation but are the only categories to exhibit this. Keep in mind that the mean is closer to the minimum so that any values above the m a x i m u m are more than 2.4 S D from the mean. Contrary to Rowe and Trainor (1983b) both noncoding and coding Information theory and D N A / 87 regions wi th the exception of r R N A genes, ribosomal protein genes and tobacco introns have significantly high D values. Regarding D and D values, it seems 4 5 6 best to make no statement regarding their being different from expectation, even though some of them lie outside the expected region. P a r t of this caution stems from t R N A genes being very short, on the order of 70 bp and D and D 5 6 have limited meaning in such a context. In summary , the overwhelming amount of order is at the D and D levels w i th significant but lesser order at the D 1 2 3 and D ^ levels. The reason for the difference in results wi th chloroplasts compared to viruses is not known. H. INVERTED REPEATS AS A SOURCE OF REDUNDANCY I wish to present a pre l iminary finding concerning IRs in chloroplast genomes. If R , based on D and D as per Ga t l in (1972a), is calculated in nonoverlapping 2 1 2 windows of 500 bases for the two chloroplast genomes and plotted as a function of sequence position, the IRs are so much lower than the rest of the genome that I have not even labelled them (Fig. 18, 19). The mean R in the tobacco L S C was 0.053737 wi th an S D of 0.022736 based on an N of 173. I R B had a mean R of 0.039756 wi th an S D of 0.019134, wi th an N of 51. U s i n g a onetailed test for differences of means (Spiegel 1961) we calculate the standardized variable z = 4.385 and conclude that the R of the L S C is significantly higher than that of the I R B at the 0.002 level. The same test applied to the l iverwort showed that the mean R of the L S C region was significantly higher than the R of the I R B at the 0.05 level. Th is suggested that i f R is low, perhaps repeating the message twice (i. e. having an IR) is a way Information theory and DNA / 88 Figure 18. Redundancy calculated in nonoverlapping windows of 500 bases for the tobacco genome. 0.20 n 0.00 4 0 1 150000 200000 Information theory and DNA / 89 Figure 19. Redundancy calculated in nonoverlapping windows of 500 bases for the liverwort genome. 0.30-1 0.25 H 0.20 H 0.00-1 1 1 —i 1 1 1 1 0 20000 40000 60000 80000 100000 120000 140000 BASE Information theory and D N A / 90 to compensate. Consistent wi th this idea is the observation that the IRs contain the r R N A genes whose G + C is near 0.5, resulting in a low D ^ , and whose doublet properties differ strongly from the rest of the genome (chapter 2). Ribosomal R N A genes often occur in tandem (eg. Worton et al. 1988) and seem to undergo conversion or "correction of each other" (Metzenberg et al. 1983) since mutations found in one copy of the repeat invar iably appear i n the other. Chloroplast IRs likewise "communicate" and m a y keep each other accurate (Palmer 1985). This notion that IRs exist to compensate for low redundancy could be tested i f only there were an organism wi th but "one IR" in its chloroplast genome. Ac tua l ly there are several! (conifers, as a group, for example (Strauss et al. 1988)) Euglena is the one wi th the most sequence information available in the E M B L data base. I compiled a l l the sequences in the data base for the single copy regions and for the IR region and then I calculated D ^ , D ^ , R , H and H for the single copy regions and the IRs of tobacco, l iverwort 2 1 2 and Euglena. The results are in F i g . 20. The redundancy in the Euglena IRs is much greater than in tobacco and l iverwort , suggesting that perhaps i t was able to dispose of one IR since the remaining one could retain fidelity due to its internal structure or redundancy. This suggestion is based on very little data but, as more sequences accumulate, the idea could be tested properly. In Euglena, the R of the IR is also significantly lower (at the 0.05 level, one tailed) that the R of the "unique" region. Since I a m postulating that there might be a threshhold R that needs to be attained before the repeat can be discarded, this is not cr i t ical . Information theory and D N A / 91 Figure 20. A comparison of H i , H2, D l , D2 and R i n the unique and repeat regions in tobacco, liverwort and Euglena. Euglena has only one repeat region and the highest R. R values shown are lOx. TOBACCO LIVERWORT EUGLENA unique IR unique IR unique IR = H2 = 2 n — H i = Hi — H I u, R H2 = Hz H1 H2 1.5 H 1-o.H — " R 0 Information theory and D N A / 92 The idea which has emerged from this study is that the requirement for redundancy acts as a constraint. Perhaps in D N A informational systems, i f redundancy can be achieved by increasing D or one of the higher D-values, then it m a y be possible to discard the other copy of an important sequence. I intend to pursue this further to t ry to identify more generally the patterns of D-value constraint throughout the data base. R E F E R E N C E S A n G (1987) A potential Z - D N A - f o r m i n g sequence is an essential upstream element of a plant promoter. Bioessays 7:211-214 B a r r y J M (1986) Informational D N A : a useful concept? Trends Biochem. Sc i . 11:317-318 Bennetzen J L , H a l l B D (1982) Codon selection in yeast. J . B io l . Chern. 257:3026-3031 Bernard i G , Bernard i G (1985) Codon usage and genome composition. J . M o l . E v o l . 22:363-365 B i r d A P (1980) D N A methylat ion and the frequency of C p G in animal D N A . N u c l . Acids Res.8:1499-1504 Blake R D , Ea r l ey S (1986) Distr ibut ion and evolution of sequence characteristics in the E. coli genome. J . Biomol . Struct. D y n a m . 4:291-307 Blake R D , Hinds P W , Ea r l ey S, H i l l y a r d A L , D a y G R (1986) Evolution and functional significance of the bias in codon usage. Biomol . Stereodynam. 4:271-286 Borodovsii M Y u , Spr izki t sk i i A , Golovanov E I , Aleksandrov A A (1987a) Stat is t ical patterns in p r imary structures of the functional regions of the genome in Escherichia coli. I. Frequency characteristics. M o l . B io l . 21:826-833 93 / 94 Borodovsii M Y u , Spr izk i t sk i i A , Golovanov E I , Aleksandrov A A (1987b) Statist ical patterns in p r imary structures of the functional regions of the genome in Escher ichia coli. II. Nonuniform M a r k o v models. M o l . B io l . 21:833-840 Boudraa M , Pe r r in P (1987) C p G and T p A frequencies in the plant system. N u c l Acids Res. 15:5729-5737 Brendel V , Busse H G (1984) Genome structure described by formal languages. N u c l . Acids Res. 12:2561-2568 Breslauer K J , F r ank R, Blocker H , M a r k y L A (1986) Predict ing D N A duplex stabil i ty from the base sequence. Proc. N a t l . Acad . Sci . U S A 83:3746-3750 B r i a t J F , Gigot C , Laulhere J P , Mache R (1982) Visual iza t ion of a spinach plastid transcriptionally active D N A - p r o t e i n complex in a highly condensed structure. P lant Phys io l . 69:1205-1211 B r i a t J F , Letoffe S, Mache R, Rouvi iere-Yaniv J (1984) S imi la r i ty between the bacterial histone-like protein H U and a protein from spinach chloroplasts. F E B S Let t . 172:75-79 Brooks D R , LeBlond P H , C u m m i n g D D (1984) Information and entropy in a simple evolution model. J . Theor. B io l . 109:77-93 Brooks D R , C u m m i n g D D , LeBlond P H (1988) In: Entropy, information and evolution Eds . Weber B H , Depew D J , Smi th J D , pp 189-226. Cambidge M a s s . M I T Press / 95 Callaine C R (1982) Mechanics of sequence-dependent stacking of bases in B - D N A . J . M o l . B i o l . 161:343-352 Cedergen R, Grosjean H (1987) On the pr imacy of pr imordial R N A . Biosystems 20:175-180 Clegg M T , Ri t land K , Zurawsk i G (1986) Processes of chloroplast evolution. In S K a r l i n , E Nevo (Eds) Evolut i ionary processes and theory. Acad . Press N Y p275-293 Coleman A W (1985) Divers i ty of plastid D N A conformation among classes of eukaryotic algae. J . Phycol . 21:1-16 Dawkins R (1976) The selfish gene. Oxford Unive r s i ty Press. Dawkins R (1982) The extended pheriotype. Oxford Univers i ty Press. D a y G R , Blake R D (1984) Computer analysis and manipulat ion of D N A sequences. Computers and Chem. 8:67-73 D e L i s i C (1988) Computers in molecular biology: current applications and emerging trends. Science 240:47-52 Deno H , Shinozaki K , Sugiura M , (1983) Nucleotide sequence of tobacco chloroplast gene for the a subunit of proton-translocating A T P a s e . N u c l . Acids Res. 11:2185-2191 Ebeling W , Jimenez-Montano M A (1980) On grammars , complexitj ' and information measures of biological macromolecules. M a t h . Biosci . 52:53-71 / 96 Fe r rac in A , Paoluzi R, Benassi M (1976) The measure-unit of evolution. Biosystems 8:10-23 Ficket t J W (1982) Recognition of protein coding regions in D N A sequences. N u c l . Acids Res. 10:5305-5318 F o x D J , Guire K E (1976) Documentation for M I D A S . Stat is t ical Research Laboratory, Un ive r s i ty of Mich igan , A n n Arbor Fox T D (1987) N a t u r a l variat ion in the genetic code. A n n . Rev. Genet. 21:67-91 Freifelder D (1982) Phys ica l Biochemistry, 2nd ed. W H Freeman and Co. San Francisco. Gant t J S , K e y J L (1987) Molecular cloning of a pea H I histone c D N A . Eu r . J . Biochem. 166:119-125 Ga t l in L L (1972a) Information theory and the l iv ing system. Columbia Univers i ty Press. N Y Gat l in L L (1972b) The entropy m a x i m u m of protein. M a t h . Biosci . 13:213-227 Ga t l in L L (1974) Conservation of Shannon's redundancy for proteins. J . M o l . E v o l . 3:189-208 Gat l in L L (1975) The D-normal distribution. J . M o l . E v o l . 6:147-148 Gat l in L L (1979) A new measure of bias in finite sequences wi th applications to E S P data. J . A m e r . Soc. Psychical Res. 73:29-43 / 97 Ga t l in L L (1984) Parapsychology and D-measures. J . A m e r . Soc. Psychical Res. 78:331-340 G r a n t h a m R, Gautier C , Gouy M , Mercier R, Pave A (1980) Codon catalog usage and the genome hypothesis. N u c l . Acids Res. 8:r49-r62 G r a n t h a m R, Gautier C , Gouy M , Jacobzone M , Merc ier R (1981) Codon catalog usage is a genome strategy modulated for gene expressivity. N u c l . Acids Res. 9:r43-r74 Green B R (1971) Isolation and base composition of D N A ' s of pr imit ive land plants I. ferns and fern-allies. Biochem. Biophy. A c t a 254:402-406 Gribskov M , Devereaux J , Burgess R R (1984) The codon preference plot: graphic analysis of protein coding sequences and prediction of gene expression. N u c l . Acids Res. 12:539-549 Hanley-Bowdoin , Chua N - H (1987) Chloroplast promoters. Trends Bio l . Sci . 12:67-70 Hasegawa M , Yano T (1975) Ent ropy of the genetic information and evolution. Origin Life 6:219-227 Head T (1987) F o r m a l language theor3' and D N A : analysis of the generative capacity of specific recombinant behaviours. B u l l . M a t h . Bio l . 49:737-759 Hinds P W , Blake R D (1984) Degrees of divergence in the E. eoZigenome from correlations between dinucleotide, trinucleotide, and codon frequencies. J . Biomol . Struct. D y n a m . 2:101-118 / 98 Holzmul ler W (1984) Information in biological systems: the roll of macromolecules. Cambridge Univers i ty Press Ikemura T (1981a) Correlat ion between the abundance of Escherichia coli transfer R N A s and the occurance of the respective codons in its protein genes. J . M o l . B io l . 146:1-21 Ikemura T (1981b) Correlat ion between the abundance of Escherichia coli transfer R N A s and the occurance of the respective codons in its protein genes: a proposal for a synonomous codon choice that is optimal for the E. coli translational system. J . M o l . B io l . 151:389-409 Ikemura T (1982) Correlat ion between the abundance of yeast transfer R N A s and the occurance of the respective codons in protein genes. J . M o l . B io l . 158:573-597 Jensen R V (1987) Classical chaos. A m e r . Sc i . 75:168-181 Johnson-Dow L , M a r d i s E , Heiner C, Roe B A (1987) Optimized methods for fluorescent and radio labeled D N A sequencing. BioTechniques 5:754-765 Jones M , Wagner R, Radman M (1987) M i s m a t c h repair and recombination in E. coli. Cel l 50:621-626 Josse J , Ka i se r A D , Kornberg A (1961) Enzymat i c synthesis of deoxyribonucleic acid. V I I I . Frequencies of nearest neighbour base sequences in deoxyribonucleic acid. J . B io l . Chern. 236:864-875 Jukes T H , Holmquis t R, Moise H (1975) A m i n o acid composition of proteins: selection against the genetic code. Science 189:50-51 / 99 K i m u r a M (1986) D N A and the neutral theory. P h i l . Trans . R. Soc. Lond . B 312:343-354 Kul lback S (1958) Information theory and statistics. Wi ley , N Y Lagerkv i s t U (1981) Unorthodox codon reading and the evolution of the genetic code. Cel l 23:305-306 L e w i n R (1984) No genome barriers to promiscuous D N A . Science 224:970-971 L i W , L u o C, W u C (1985) Evolut ion of D N A sequences. In R J M a c l n t y r e (ed) Molecular Evolut ionary Genetics. P lenum Press. N Y and London Lonsdale D M (1987) The biochemistry of plants. Springer-Verlag. In Press. M a x a m A , Gilbert W (1977) A new method for sequencing D N A . Proc. N a t l . A c a d . Sci . U S A 74:560-564 M a z i n A L , Vanyush in B F (1987) Loss of C p G nucleotides from D N A . I. Methyla ted and nonmethylated compartments in eukaryotes w i th different content of 5-methylcytosine in D N A . M o l . B io l . 21:465-472 M a z i n A L , V a n y u s h i n B F (1987) Loss of C p G nucleotides from D N A . II. Methyla ted and nonmethylated genes of vertebrates. M o l . B io l . 21:473-481 Metzenberg R L , Stevens J N , Baisch T J (1983) 5S R N A genes of Neurospora crassa: conservation and divergence in a dispersed gene family. Microbiology-1983 190-194 / 100 Moreau J , Scherrer K (1987) Co-evolution of base composition and codon usage in Xenopus laevis and human globin genes wi th long-range D N A organization of their genome. F E B S Let t . 221:3-10 Nussinov R (1984a) Doublet frequencies in evolutionary distinct groups. N u c l . Acids Res. 12:1749-1763 Nussinov R (1984b) Strong doublet preferences in nucleotide sequences and D N A geometry. J . M o l . E v o l . 20:111-119 Nussinov R (1987) Theoretical molecular biology: prospectives and perspectives. J . Theor. B i o l . 125:219-235 Nussinov R, Lennon G G (1984) Structural features are as important as sequence homologies in Drosophila heat shock gene upstream regions. J . M o l . E v o l . 20:106-110 Oh-oka H , Takahash i Y , Wada K , Ma t suba ra H , O h y a m a K , Ozeki H (1987) The 8 k D a polypeptide in photosystem I is a probable candidate of an iron-sulfur protein coded by the chloroplast gene frxA. F E B S Let t . 218:52-54 O h y a m a K , F u k a z a w a H , Kohch i T , Sh i ra i H , Sano T , Sano S, Umesono K , Sh ik i Y , Takeuchi M , Chang Z , A o t a S, Inokuchi H , Ozeki H (1986) Chloroplast gene organization deduced from complete sequence of . l iverwort Marchantia polymorpha chloroplast D N A . Nature 327:572-574 Pa lmer J D (1985) Evolut ion of chloroplast and mitochondrial D N A in plants and algae. In M a c l n t y r e R J (ed), Molecular Evolut ionary Genetics, P lenum Press, N Y and London, pp 131-240 / 101 Pa lmer J D (1987) Chloroplast dna evolutiion and biosystematic uses of chloroplast D N A variat ion. A m e r . Na tu r . 130: s6-s29 (supplement) Pfitzinger H , Gui l lemaut P , W e i l J - H , P i l l ay D T N (1987) Adjustment ofthe t R N A population to the codon usage in chloroplasts. N u c l . Acids R E s . 15:1377-1386 Reichert T A , Y u J M C , Christensen R A (1976) Molecular evolution as a process of message refinement. J . M o l . Evo l . 8:41-54 Reichert T A , Wong A K C (1971) Toward a molecular taxonomy. J . M o l . E v o l . 1:97-111 Rit land K , Clegg M T (1987) Evolut ionary analysis of plant D N A sequences. A m e r . Na tu r . 130:s74-s l00 (supplement) Rochaix J - D (1987) Molecular genetics of chloroplasts an mitochondria in the unicellular green alga Chlamydomonas. F E M S Microbiol . Rev. 46:13-34 Rose G D , Geselowitz A R , Lesser G J , Lee R H , Zehfus M H (1985) Hydrophobici ty of amino acid residues in globular proteins. Science 229:834-838 Rowe G W , H a r v e y I F (1985) Information content in finite sequences: communication between dragonfly larvae. J . Theor. B io l . 116:275-290 Rowe G W , Trainor L E H (1983a) A thermodynamic bias in v i ra l genes. J . Theor. B i o l . 101:171-203 Rowe G W , Trainor L E H (1983b) O n the informational content of v i ra l D N A . J . Theor. Bio l . 101:151-170 / 102 Sanger F , Coulson A R (1975) A rapid method for determining D N A sequences in D N A by primed synthesis wi th D N A polymerase. J . M o l . B i o l . 94:441-448 Schuster W , Brennicke A (1987) Plas t id , nuclear and reverse transcriptase sequences in the mitochondrial genome of Oenothera: is genetic information transferred between organelles v i a R N A ? E M B O J . 6:2857-2863 Shannon C E (1948) A mathematical theory of communication. Be l l Syst . Tech. J . 27:379-423 Shannon C E (1949) Communicat ion in the presence of noise. Proc. I R E 37:10-21 Shapiro H S (1970) In Handbook of Biochemistry, H A Sober (ed), C R C Cleveland pp H 8 1 - H 9 9 Sheperd J C W (1981) Method to determine the reading frame of a protein from the purine/pyrimidine genome sequence and its possible evolutionary justificatiion. Proc. N a t l . Acad . Sci . U S A 7:1596-1600 Shinozaki K , Ohme M , T a n a k a M , Wakasu ig i T, Hayash ida N , Matsubayash i T, Za i t a N , Chunwongse J , Obakata J , Yamaguchi-Shinozaki K , Ohto C , Torazawa K , M e n g B Y , Sugita M , Deno H , Kamogash i ra T, Y a m a d a K , K u s a d a J , T a k a i w a F , K a t o A , Tohdoh N , Shimada H , Sugiura M (1986) The complete nucleotide sequence of the tobacco chloroplast genome: its gene organization and expression. E M B O J . 5:2043-2049 Sibbald P R , White M J (1987) H o w probable are antibody cross-reactions? J . Theor. B io l . 127:163-169 / 103 Sibbald P R (1988) Pat tern of base usage, nearest neighbour analysis and identification of genes in two completely sequenced chloroplast genomes. Cur r . Genet. In press. S i t a r am B R , V a r m a V S (1984) Information and onepoint mutations i n D N A . J . Theor. B io l . 110:523-532 Slepian D (1973) K e y papers in the development of information theory. I E E E Press N Y Smi th T F (1969) The genetic code, information density and evolution. M a t h . Biosci . 4:179-187 Smi th T F , Wate rman M S , Sadler J R (1983) Statist ical distribution of nucleic acid sequence functional domains. N u c l . A c i d Res. 11:2205-2220 Spiegel M R (1961) Schaum's outline of theory and problems of statistics. Schaum Publ ishing Co. N Y Sr in ivasan A R , Torres R, C la rk W , Olson W K (1987) Base sequence effects in double helical D N A . I. Potentiial energy estimates of local base morphology. J . Biomol . Struct, and D y n a m . 5:452-496 Staden R (1984) Measurements of the effects that coding for protein has on a D N A sequence and their use for finding genes. N u c l . Acids Res. 12:551-567 Stat is t ical Research Laboratory, Univers i ty of Mich igan (1976) Elementary Statistics Using Midas / 104 Steinmetz A A , Castroviejo M , Sayre R T , Bogorad L (1986) Protein P S I I - G an additional component of photosystem II identified through its plastid gene in maize. J . B io l . Chern. 261:2485-2488 Strauss S H , Pa lmer J D , Howe G T , Doerksen A H (1988) Chloroplast genomes of two conifers lack a large inverted repeat and are extensively rearranged. Proc. N a t l . A c a d . Sci . U S A 85:3898-3902 Taylor F J R , Coates (1988) The code wi th in the codons. In Press. Thomsen D E (1987) Fractals : magical fun or revolutionary science. Science News 131:184 Trifinov E N (1987) Translat ional framing code and frame-monitoring mechanism as suggested by the analysis of m R N A and 16S r R N A nucleotide sequences. J . M o l . B i o l . 194:643-652 Tramontano A , Scarlato V , B a r n i N , Cipollaro M , Franze A , Macchiato M F , Cascino A (1984) Statist ical evaluation of the coding capacity of complementary D N A strands. N u c l . Acids Res. 12:5049-5059 Trifinov E N , Sussman J L (1980) The pitch of chromatin D N A is reflected in its nucleotide sequence. Proc. N a t l . Acad . Sci . U . S . A . 77:3816-3820 Umesono K , Ozeki O (1987) Chloroplast gene organization in plants. Trends Genet. 3:281-287 / 105 Wakasug i T, Ohme M , Shinozaki K , Sugiura M (1986) Structures of tobacco . chloroplast genes for' t R N A B e ( C A U ) m t R N A L e U ( C A A ) , t R N A C y S ( G C A ) , Ser T h r t R N A ( U G A ) and t R N A (GGU) : a compilation of t R N A genes from tobacco chloroplasts. P lant M o l . Bio l . 7:385-392 Wells R D (1988) Unusua l D N A structures. J . Bio l . Chern. 263:1095-1098 Wit tman-Liebold B (1985) Ribosomal proteins: their structure and evolution. In Structure, function and genetics of ribosomes. E d . Hardes ty B , K r a m e r G , pp 326-361. Spr inger-Ver lag Worton R G , Sutherland J , Sylvester J E , Wi l l a rd H F , Bodrug S, Dube I, Duff C , K e a n V , R a y P N , Schmickel R D (1988) H u m a n ribosomal R N A genes: orientation of the tandem ar ray and conservation of the 5' end. Science 239:64-68 Yano T , Hasegawa M (1974) Ent ropy increase of amino acid sequence in protein. J . M o l . E v o l . 4:179-187 Yockey H P (1974) A n application of information theory to the central dogma and the sequence hypothesis. J . Theor. B io l . 46:369-406 Z u r a w s k i G , Clegg M T (1987) Evolut ion of higher-plant chloroplast DNA-encoded genes: implications for structure function andd phylogenetic studies. A n n . . Rev. P lant Phys io l . 38:391-418 APPENDIX 1. A SELECTION OF COMPUTER PROGRAMS. Below are some of the programs used to collect data and perform calculations that appear in this dissertation. All are written in Standard Pascal and were compiled on the UBC mainframe computer using Pascaljb. They are presented to assist the reader in determining exactly what was done. In this sense they are an aid to reproducibility rather than examples of sophisticated programming. A. PROGRAM 1 program dnamaker(output); {* This program generates a pseudorandom DNA sequence using *} {* a modulo type generator. It asks the user for the seed *} {* to be used for the generator and also asks for the length *} {* of the sequence wanted. It should probably not be used *} {* to generate sequences longer than about 30,000 bases due *} {* to the nature of the generator which w i l l repeat after *} {* 65,536 iterations. The particular regions in which a *} {* number produces a certain base can be editted by the user.*} ^*************************************** var seed, i , column, max: integer; {seed i s for. generator} 106 / 107 {i i s a counter for a loop} {column gets output formatted} {number is random} number: real; base: char; inp, out: text; procedure generator; {generates a pseudorandom number between 0 and l} begin seed := (13849 + 25173 * seed) MOD 65536; number := seed / 65536; end {generator}; {procedure generator} j^ QCf^ ri • ^ ^ I ' f ^ f ^ f ^ f ^ f ^ f T ^ f i ' f rosin ^ * , ' f ^ f ' ^ ^ ' ' f ' ' f ^ * j { f i r s t query the user for a seed and sequence length} reset(inp, 'file=*msource*, interactive'); rewrite(out, 1file=*msink*'); writeln(out, 'Give me an integer seed between 1 and 65535.'); get(inp); readln(inp, seed); w r i t e l n ( o u t , 'How many bases would you l i k e ? ' ) ; g e t ( i n p ) ; r e a d l n ( i n p , max); column := 1; { t h i s i s j u s t a marker to get 60 char l i n e s of output} f o r i := 1 t o max do begin {generate sequence} generator; i f number < 0.36640066 then base := 'A'; i f ((number >= 0.36640066) AND (number < 0.5090809)) then base := 'C ; i f ((number >= 0.5090809) AND (number < 0.6534122)) then base := 'G' ; i f number >= 0.6534122 then base := 'T'; w r i t e ( b a s e ) ; i f column MOD 60 = 0 then w r i t e l n ; column : = column + 1; end; end {dnamaker}. <lend 2> B. PROGRAM 2 program d o u b l e t ( o u t p u t ) ; I************************************************* * {* This program i s used to generate pseudo random {* sequences w i t h doublet p r o p e r t i e s ddtermined by {* the user. In i t s cu r r e n t c o n f i g u r a t i o n i t {* generates 7268 bases w i t h the same doublet {* p r o p e r i t e s as k a l i l o DNA. The v a r i a b l e s serve the {* same f u n c t i o n s as they d i d i n program 1. var seed, i , column: i n t e g e r ; random: r e a l ; base, next: char; procedure generator; begin seed := (13849 + 25173 * seed) MOD 65536; random := seed / 65536; I 110 end {generator}; begin £********* uigin base := 'C; w r i t e ( b a s e ) ; seed := 33333; column := 1; f o r i := 1 to 7268 do begin generator; i f i < 1355 then begin case base of 'A': begin i f random < (188 / 471) then next := 1A'; i f ((random >= (188 / 471)) AND (random < (254 / 471))) then next := 'C; i f ((random >= (254 / 471)) AND (random < (349 / 471))) then next := 'G'; i f random >= (349 / 471) then / 111 next := 1T'; end; ' C 1 : begin i f random < (63 / 219) then next := 'A'; i f ((random >= (63 / 219)) AND (random < (125 / 219))) then next := 1C 1; i f ((random >= (125 / 219)) AND (random < (159 / 219))) then next := 'G'; i f random >= (159 / 219) then next := 'T'; end; ' G' : begin i f random < (95 / 287) then next := 1A'; i f ((random >= (95 / 287)) AND (random < (132 / 287))) then next := 'C; i f ((random >= (132 / 287)) AND (random < (213 / 287))) then / 112 next := 1G 1; i f random >= (213 / 287) then next := 1T 1; end; 'T': begin i f random < (126 / 376) then next := 'A' ; i f ((random >= (126 / 376)) AND (random < (179 / 376))) then next := 'C'; i f ((random >= (179 / 376)) AND (random < (256 / 376))) then next := 'G'; i f random >= (256 / 376) then next := 1T'; end; end; {cases} end { i f i<1355} e l s e begin case base of 'A': begin i f random < (826 / 2165) then next := 'A'; i f ((random >= (826 / 2165)) AND then next := 1C'; i f ((random >= (1067 / 2165)) AND then next := 'G'; i f random >= (1385 / 2165) then next := 'T'; end; •C: begin i f random < (298 / 811) then next := 'A'; f ((random >= (298 / 811)) AND then next := 'C; f ((random >= (468 / 811)) AND then next := 'G'; i f random >= (499 / 811) then (random < (1067 / 2165))) (random < (1385 / 2165))) (random < (468 / 811))) (random < (499 / 8811))) next := *T'; end; 'G': begin i f random < (328 / 753) then next .:= ' A 1; i f ((random >= (328 / 753)) AND (random < (407 / 753))) then next := 'C; i f ((random >= (407 / 753)) AND (random < (545 / 753))) then next := 'G'; i f random >= (545 / 753) then next : = ' T'; end; 'T': begin i f random < (713 / 2127) then next : = ' A'; i f ((random >= (713 / 2127)) AND (random < (1033 / 2127))) then next := 'C; i f ((random >= (1033 / 2127)) AND (random < (1299 / 2127))) then / 115 next := 'G' ; i f random >= (1299 / 2127) then next := 'T'; end; end; {cases} end; { i f i>1354} base := next; w r i t e ( b a s e ) ; i f column MOD 60 = 0 then w r i t e l n ; column := column + 1 ; end; end {doublet}. C. PROGRAM 3 program brook(input, o u t p u t ) ; ^****************************************** {* This program c a l c u l a t e s D l to D6 as per Brooks et a l {* Reference i s JTB 109:77-93 £********************************* *} *} const blank = ' max = 6000; var i , j , k, 1, m, n, o: integer; {array counters} d l , d2, d3, d4, d5, d6: real; {deviations from max info due to clumps 1-6} length: integer; {length of sequence} sigma: real; {the second term of equations 5-8 in reference} firstterm: real; { likewise the f i r s t } base: char; sequence: packed array. [1 .. max] of char; single: packed array [1 4] of real; double: packed array [1 ... 16] of real; t r i p l e : packed array [1 .. 64] of real; quad: packed array [1 .. 256] of real; pent: packed array [1 .. 1024] of real; hex: packed array [1 .. 4096]. of real; f i r s t , second, third, fourth, f i f t h , sixth: integer; {read the bases into the array} length := max; base := blank; for i := 1 to max do begin while (base = ' ') do read(base); sequence[i] := base; base := blank; end; { reading of bases } for i := 1 to 4 do single[i] := 0; {set arrays to zero for} {singlets to sixes } for i := 1 to 16 do double[i] := 0; for i := 1 to 64 do t r i p l e [ i ] := 0; for i := 1 to 256 do quad[i] := 0; for i := 1 to 1024 do pent[i] := 0; for i := 1 to 4096 do hex[i] := 0; { now we go through the sequence and } { length singlets, doublets ... sixes} for i := 1 to (length - 5) do begin case sequence[i] of •A' : f i r s t := 1; •C : f i r s t := 2; 'G' : f i r s t := 3; f i r s t := 4; end; { case } case sequence[i + 1] of •A' : second := 1; •C : second := 2; •G': • second := 3; •T' : second := 4; end; { case } / 119 case sequence[i + 2] of 'A' : t h i r d := 1; 'C: t h i r d := 2; •G': t h i r d := 3; ' T ' 5 t h i r d := 4; end; { case } case sequence[i + 3] of •A': f o u r t h := 1; 'C*: f o u r t h := 2; 'G' : f o u r t h := 3; 'T' : f o u r t h := 4; end; { case } case sequence[i + 4] of •A': f i f t h := 1; 7 120 'C : f i f t h := 2; •G': f i f t h := 3; 'T' : f i f t h := 4; end; { case } case sequence[i + 5] of 'A' : sixth := 1; •C: sixth := 2; 'G' : sixth := 3; 'T': sixth := 4; end; { case } single[first] := single[first] + 1; double[4 * ( f i r s t - 1) + second] : = double[4 * ( f i r s t - 1) + second] +1; triple[16 * ( f i r s t - 1) + 4 * (second - 1) + third] := triple[16 * ( f i r s t - 1) + 4 * (second - 1 ) + third] + 1; quad[64 * ( f i r s t - 1) + 16 * (second - 1) + 4 * ( t h i r d - 1) + fourth] := quad[64 * ( f i r s t - 1) 16 * (second - 1) + 4 * ( t h i r d - 1) + fourth] + i pent[256 * ( f i r s t - 1) + 64 * (second - 1) + 16 * ( t h i r d - 1) + 4 * (fourth - 1) + f i f t h ] : = pent[256 * ( f i r s t - 1) + 64 * (second - 1) + 16 * ( t h i r d - 1) + 4 * (fourth - 1) + f i f t h ] + i ; hex[l024 * ( f i r s t - 1) + 256 * (second - 1) + 64 * ( t h i r d - 1) + 16 * (fourth - 1) + 4 * ( f i f t h - 1) + sixth] := hex[l024 * ( f i r s t - 1) + 256 * (second - 1) + 64 * ( t h i r d - 1) + 16 * ( fourth - 1) + 4 * ( f i f t h - 1) + sixth] + 1; end; go through the sequence and counting } now divide each count by the sequence length to get} a f r a c t i o n a l abundance} for i := 1 to 4 do s i n g l e [ i ] := s i n g l e [ i ] / (length - 5); for i := 1 to 16 do double[i] := double[i] / (length - 5); for i := 1 to 64 do t r i p l e [ i ] := t r i p l e [ i ] / (length - 5); for i := 1 to 256 do quad[i] := quad[i] / (length - 5); for i := 1 to 1024 do pent[i] := pent[i] / (length - 5); for i := 1 to 4096 do hexti] := hex[i] / (length - 5); { now we calculate d l to d6 } {dl} sigma := 0; for i := 1 to 4 do begin i f single[i] > 0 then sigma := sigma + single[i] * ln(single[i]) / ln(2); end; d l := 2 + sigma; writeln('Dl = ', d l : 3: 5); (d2> sigma := 0; firstterm := 0; for i •: = 1 to 16 do begin i f double [i] > 0 then sigma := sigma + double[i] * ln(double[i]) / ln(2); end; for j := 1 to 4 do begin for k := 1 to 4 do begin i f (single[j] * single[k]) > 0 then firstterm := firstterm + single[j] * single[k] * ln(single[j] * single[k]) / ln(2); end; end; d2 := sigma - firstterm; writeln('D2 = d2: 3: 5); {d3} sigma := 0; firstterm := 0; for i := 1 to 64 do begin i f t r i p l e [ i ] > 0 then sigma := sigma + t r i p l e [ i ] * l n ( t r i p l e [ i ] ) / ln(2); end; for j := 1 to 4 do begin for k := 1 to 4 do begin for 1 := 1 to 4 do begin i f (singletj] * single[k] * single[l]) > 0 then firstterm := firstterm + singletj] * single[k] (singletl]) * ln(single[j] * single[k] * single[l]) / ln(2); end; end; / 124 end; d3 := sigma - firstterm; writeln(*D3 = d3: 3: 5); {d4} sigma := 0; firstterm := 0; for i := 1 to 256 do begin i f quad[i] > 0 then sigma := sigma + quad[i] * ln(quad[i]) / ln(2); end; for j := 1 to 4 do begin for k := 1 to 4 do begin for 1 := 1 to 4 do begin for m := 1 to 4 do begin i f (single[j] * singlefk] * single[l] * single[m]) > 0 then firstterm := firstterm + single[j] * singlefk] * single[l] * single[m] * ln(single[j] * singlefk] * singlefl] * single[m]) / ln(2); end; end; end; end; d4 := sigma - firstterm; / 125 writeln('D4 = d4: 3: 5); {d5} sigma := 0; firstterm := 0; for i := 1 to 1024 do begin i f pent[i] > 0 then sigma := sigma + pent[i] * ln(pent[i]) / ln(2); end; for j := 1 to 4 do begin for k := 1 to 4 do begin for 1 := 1 to 4 do begin for m := 1 to 4 do begin for n := 1 to 4 do begin i f (single[j] * single[k] * single[l] * single[m] * single[n]) > 0 then firstterm := firstterm + singletj] * single[k] * singletl] * single[m] * single[n] * ln(singletj] * single[k] * single[l] * singletm] * singletn]) / ln(2); end; end; end; end; / 126 end; d5 := sigma - firstterm; writeln('D5 = d5: 3: 5); {d6} sigma := 0; firstterm := 0; for i := 1 to 4096 do begin i f hex[i] > 0 then sigma := sigma + hex[i] * ln(hex[i]) / ln(2); end; for j := 1 to 4 do begin for k := 1 to 4 do begin for 1 := 1 to 4 do begin for m := 1 to 4 do begin for n := 1 to 4 do begin for o := 1 to 4 do begin i f (single[j] * single[k] * single[l] * single[m] * single[n] * single[o]) > 0 then firstterm := firstterm + singletj] * singlefk] * singlefl] * singlefm] * singlefn] * singlefo] * ln(single[j] * singlefk] *_ singlefl] * singlefm] * singlefn] * singlefo]) / ln(2); end; end; end; end; end; end; d6 := sigma - f i r s t t e r m ; w r i t e l n ( ' D 6 = ', d6: 3: 5); end {brook}. D. PROGRAM 4 program newbrook(input, o u t p u t ) ; .£*************************************** {* This c a l c u l a t e s D l t D6 using the maximum overlap of *} {* windows of s i z e n-1. The D-values i t generates are *} {* independent of each o t h e r . I t handles a sequence of 6000*} {* bases i n i t s current c o n f i g u r a t i o n and can be e d i t t e d *} {* t o handle sequences of other s i z e s . |***************************************** const blank = ' '; max = 6000; var i , j , k, 1, m, n, o: integer; {array counters} d l , d2, d3, d4, d5, d6: real; {deviations from max info due to clumps 1-6} length: integer; {length of sequence} sigma: real; {the second term of equations 5-8 in reference} firstterm: real; { likewise the f i r s t } base: char; sequence: packed array [1 .. max] of char; single: packed array [1 .. 4] of real; double: packed array [ l .. 16] of real; t r i p l e : packed array [1 .. 64] of real; quad: packed array [ l .. 256] of real; pent: packed array [1 .. 1024] of real; hex: packed array [1 .. 4096] of real; f i r s t , second, third, fourth, f i f t h , sixth: integer; begin £************ main ***************} {read the bases into the array "sequence"} length := max; base := blank; / 129 for i := 1 to max do begin while (base = ' ') do read(base); sequence [i] := base; base := blank; end; { reading of bases } for i := 1 to 4 do single[i] := Of-fset arrays to zero for} {single-mers to 6-mers } for i := 1 to 16 do double[i] := 0; for i : = ' 1 to 64 do t r i p l e [ i ] := 0; for i := 1 to 256 do quad[i] := 0; for i := 1 to 1024 do pent[i] := 0; for i :=' 1 to 4096 do hex[i] := 0; { now we go through the sequence and length } .{ singlets, doublets ... sixes} for i := 1 to (length - 5) do begin case sequence[i] of / 130 'A' : f i r s t := 1; 'C : f i r s t := 2; •G' : f i r s t := 3; •T': f i r s t := 4; end; { case } case sequence[i + 1] of 'A' : second := 1; •C : second := 2; •G' : second := 3; •T' : second := 4; end; { case } case sequence[i + 2] of •A': third := 1; •C : / 131 third := 2; •G': third := 3; . T i . . third := 4; end; { case } case sequence[i + 3] of 'A' : fourth := 1; •Cf: fourth := 2; 'G': fourth := 3; *T': fourth := 4; end; { case } case sequence[i + 4] of •A': f i f t h := 1; 'C: f i f t h := 2; 'G' : f i f t h := 3; / 132 'T1 : f i f t h := 4; end; { case } case sequence[i + 5] of •A': sixth := 1; •C : sixth := 2; *G': sixth := 3; •T': sixth := 4; end; { case } single[first] := single[first] + 1; double[4 * ( f i r s t - 1) + second] := double[4 * ( f i r s t - 1) + second] +1; tri p l e [ l 6 * ( f i r s t - 1) + 4 * (second - 1) + third] := t r i p l e [ l 6 * ( f i r s t - 1) + 4 * (second - 1) + third] +1; quad[64 * ( f i r s t - 1) + 16 * (second - 1) + 4 * (third - 1) + fourth] := quad[64 * ( f i r s t - 1) + 16 * (second - 1) + 4 * (third - 1) + fourth] + 1; pent[256 * ( f i r s t - 1) + 64 * (second - 1) + 16 * (third - 1) + 4 * (fourth - 1) + f i f t h ] : = pent[256 * ( f i r s t - 1) + 64 * (second - 1) + 16 * (third - 1) + 4 * (fourth - 1) + f i f t h ] + i ; hex[l024 * ( f i r s t - 1) + 256 * (second - 1) + 64 * (third - 1) + 16 * (fourth - 1) + 4 * ( f i f t h - 1) + sixth] := hex[l024 * ( f i r s t - 1) + 256 * (second - 1) + 64 * (third - 1) + 16 * ( fourth - 1) + 4 * ( f i f t h - 1) + sixth] + 1; end; { go through the sequence and counting } { now divide each count by the sequence length to get} { a fractional abundance} for i := 1 to 4 do single[i] := single[i] / (length - 5); for i := 1 to 16 do double[i] := double[i] / (length - 5); for i := 1 to 64 do t r i p l e [ i ] := t r i p l e [ i ] / (length - 5); for i := 1 to 256 do quad[i] := quad[i] / (length - 5); for i := 1 to 1024 do pent[i] := pent[i] / (length - 5); for i := 1 to 4096 do hex[i] := hex[i] / (length - 5); { now we calculate d l to d6 } { d l and d2 are calculated as in reference } {dl} sigma := 0; for i := 1 to 4 do begin i f single[i] > 0 then sigma := sigma + single[i] * ln(single[i]) / ln(2); end; dl : = . 2 +. sigma; write(dl: 3: 5, '); ld2} sigma := 0; firstterm := 0; for i := 1 to 16 do begin i f double[i] > 0 then sigma := sigma + double[i] * ln(double[i]) / ln(2); end; for j := 1 to 4 do begin for k := 1 to 4 do begin i f (singletj] * single[k]) > 0 then firstterm := firstterm + single[j] * single[k] * ln<singletj] * single[k]> / ln(2); end; end; d2 := sigma - f i r s t t e r m ; w r i t e ( d 2 : 3: 5, * ) ; {end d2} {d3} sigma := 0; f i r s t t e r m := 0; f o r i := 1 to 64 do begin i f t r i p l e t i ] > 0 then £*************************** {* The modulo f u n c t i o n f i n d s {* window whose f i r s t bases {* match the l a s t bases of a ' {* previous window. { *************************** sigma := sigma + t r i p l e [ i ] * l n ( t r i p l e [ i ] ) / l n ( 2 ) ; end; f o r i := 1 to 16 do begin f o r j := (4 * ( ( i - 1) MOD 4) + 1) to(4 * ( ( i - 1) MOD 4) + 4) do begin i f ( d o u b l e [ i ] * d o u b l e t j ] / s i n g l e [ ( i - 1) MOD 4 + 1]) > 0 then f i r s t t e r m := f i r s t t e r m + d o u b l e [ i ] * d o u b l e [ j ] / s i n g l e [ ( i - 1) MOD 4 + 1] * l n ( d o u b l e t i ] * d o u b l e t j ] / s i n g l e [ ( i - 1) MOD 4 + 1]) / l n ( 2 ) ; end; end; d3 := sigma - f i r s t t e r m ; / 136 w r i t e ( d 3 : 3: 5, ' ) ; {d4} sigma := 0; f i r s t t e r m := 0; f o r i := 1 to 256 do begin i f quad[i] > 0 then sigma := sigma + quad[i] * l n ( q u a d [ i ] ) / l n ( 2 ) ; end; f o r i := 1 to 64 do begin f o r j := (4 * ( ( i - 1) MOD 16) + 1) to(4 * ( ( i - 1) MOD 16) + 4) do begin i f ( t r i p l e [ i ] * t r i p l e t j ] * d o u b l e [ ( i - 1) MOD 16 +1}) > 0 then f i r s t t e r m := f i r s t t e r m + t r i p l e [ i ] * t r i p l e [ j ] / d o u b l e [ ( i - 1) MOD 16 + 1] * l n ( t r i p l e [ i ] * t r i p l e t j ] / d o u b l e [ ( i - 1) MOD 16 + 1]) / l n ( 2 ) ; end; end; d4 := sigma - f i r s t t e r m ; w r i t e ( d 4 : 3: 5, ' ) ; {d5} sigma := 0; f i r s t t e r m := 0; f o r i := 1 to 1024 do begin / 137 i f p e n t [ i ] > 0 then sigma := sigma + p e n t [ i ] * l n ( p e n t [ i ] ) / l n ( 2 ) ; end; f o r i := 1 t o 256 do begin f o r j := (4 * ( ( i - 1) MOD 64) + 1) to(4 * ( ( i - 1) MOD 64) + 4) do begin i f (quad[i] * quad[j] * t r i p l e [ ( i - 1) MOD 64 + 1]) > 0 then f i r s t t e r m := f i r s t t e r m + quad[i] * quad[j] / t r i p l e t ( i - 1) MOD 64 + 1] * l n ( q u a d [ i ] * quad[j] / t r i p l e t ( i - 1) MOD 64 + 1]) / l n ( 2 ) ; end; end; d5 := sigma - f i r s t t e r m ; w r i t e ( d 5 : 3: 5, ', ' ) ; {d6} sigma := 0; f i r s t t e r m := 0; f o r i := 1 to 4096 do begin i f h e x [ i ] > 0 then sigma := sigma + h e x [ i ] * l n ( h e x [ i ] ) / l n ( 2 ) ; end; f o r i := 1 to 1024 do begin 7 138 f o r j := (4 * ( ( i - 1) MOD 256) + 1) to(4 * . ( ( i - 1) MOD 256) + 4) do begin i f ( p e n t t i ] * p e n t [ j ] * quad[(i - 1) MOD 256 + 1]) > 0 then f i r s t t e r m := f i r s t t e r m + p e n t [ i ] * p e n t [ j ] / quad[(i - 1) MOD 256 + 1] * l n ( p e n t [ i ] * pent [ j ] / q u a d [ ( i - 1) MOD 256 + l ] ) / l n ( 2 ) ; end; end; d6 := sigma - f i r s t t e r m ; w r i t e ( d 6 : 3: 5, ', ' ) ; w r i t e l n ; end {brook}. E. PROGRAM 5 program manhat(input, o u t p u t ) ; .£****************************************************j {* This program w r i t e s another program. The program *} {* i t w r i t e s i s a t e l l a g r a f program which w i l l then *} {* draw a Manhattan block p l o t (Gates p l o t ) of the *} {* sequence. I t a u t o m a t i c a l l y scales the t e l l a g r a f *} {* p l o t so that i t w i l l look w e l l proportioned. *} i********* ******************************************* "i / 139 const blank = ' *; var x, y, count: integer; xmax, xmin, ymax, ymin: integer; {plotting limits} base: char; begin reset(input); count := 0; base := blank; x := 0; y := 0; xmax := 0; ymax := 0; ymin := 0; xmin : = 0; {now make the program write the f i r s t part of the tellagraf program} writeln('generate a.plot.'); writeln('title i s • • * * * * * * * * * * * • • . ' ) ; writeln('x axis label i s '' — c g — ' ' . ' ) ; writeln('y axis label i s 1 1 — a t — ' ' . ' ) ; / 140 w r i t e l n ( ' c u r v e 1, symbol count=0.'); w r i t e l n ( ' c r o s s o f f . ' ) ; w r i t e l n ( ' i n p u t d a t a . ' ) ; w r i t e l n C ' d a t a ' " ) ; {now generate the p a i r s to be p l o t t e d and f i n d the max and min} I********** main **********} while NOT e o f ( i n p u t ) do begin base := blank; w h i l e ((base = ' ') AND (NOT eof)) do begin read(base); end; count := count + 1; i f base IN ['A', 'C, 'G', 'T' ] then base := base e l s e base := 'z'; case base of ' A 1 : y := y - 1; •C: x := x - 1; •G': x := x + 1; / 141 •T': y := y + 1; •Z': writeln('*** strange base * * * ' ) • end; {cases} {now adjust the max and min i f these x and y f a l l outside previous limits} i f x > xmax then xmax := x; i f y > ymax then ymax := y; i f x < xmin then xmin := x; i f y < ymin then ymin := y; writelnC ', x: 1, ', ', y: .1, ' ') ; count := 0; end; i f xmax > ymax then ymax := xmax e l s e xmax : = ymax; i f xmin > ymin then xmin := ymin e l s e ymin := xmin; {the above two i f s , ensure that the steps i n the p l o t w i l l be approximately the same length on the p l o t because the lengths of the two a x i s w i l l be s i m i l a r } xmax := (xmax DIV 10) * 10 + 10; ymax := (ymax DIV 10) * 10 + 10; xmin := (xmin DIV 10) * 10 - 10; ymin := (ymin DIV 10) * 10 - 10; {now w r i t e the par t of the t e l l a g r a f program that f o l l o w s the data} w r i t e l n ( ' e o d . 1 ) ; w r i t e l n ( ' x a x i s maximum ', xmax, ', minimum ', xmin,. ', s t e p s i z e 10.'); w r i t e l n ( ' y a x i s maximum ', ymax, ', minimum ', ymin, ', s t e p s i z e 10.'); w r i t e l n ( ' n o l e g e n d . 1 ) ; w r i t e l n ( 1 frame.'); end {manhat}. 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0098293/manifest

Comment

Related Items