Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Constraints on the organization and information properties of DNA sequences Sibbald, Peter Ramsay 1988

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-UBC_1989_A1 S52.pdf [ 5.62MB ]
Metadata
JSON: 831-1.0098293.json
JSON-LD: 831-1.0098293-ld.json
RDF/XML (Pretty): 831-1.0098293-rdf.xml
RDF/JSON: 831-1.0098293-rdf.json
Turtle: 831-1.0098293-turtle.txt
N-Triples: 831-1.0098293-rdf-ntriples.txt
Original Record: 831-1.0098293-source.json
Full Text
831-1.0098293-fulltext.txt
Citation
831-1.0098293.ris

Full Text

CONSTRAINTS  O N T H E ORGANIZATION  PROPERTIES  A N D INFORMATION  OF D N A SEQUENCES  by PETER RAMSAY B.Sc.  SIBBALD  Simon Fraser University,  A THESIS SUBMITTED IN PARTIAL  1984  FULFILMENT OF  THE REQUIREMENTS FOR T H E DEGREE OF DOCTOR OF PHILOSOPHY  in THE FACULTY OF GRADUATE  STUDIES  Botany  We accept this thesis as conforming to the required standard  THE UNIVERSITY OF BRITISH 8 December  COLUMBIA  1988  ® Peter Ramsay Sibbald,  1988  In presenting this thesis i n p a r t i a l fulfilment of the requirements for an advanced degree a t The U n i v e r s i t y of B r i t i s h Columbia, I agree that the L i b r a r y shall m a k e i t freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes m a y be granted by the H e a d of m y D e p a r t m e n t or by h i s or her representatives. I t i s understood that copying or publication of this thesis for financial gain shall not be allowed without m y w r i t t e n permission.  Botany The U n i v e r s i t y of B r i t i s h 2075 Wesbrook Place Vancouver, C a n a d a V6T 1W5 Date: 8 December  1988  Columbia  ABSTRACT  In  an  investigation w h i c h  chloroplast has  genomes,  been -made  to  sequences.  This  throughout  the  predict  the  one  was  of  information  sequences.  The  (URF),  other  most a  of  facilitate  doublets  that  others,  for  complementary duplication  similar  explore  doublets  events  has  the on  accurate have  ability  of  different  sequence  frames and  replication  of  there  is  a  relative and  introns.  the  methods  ordering  3. by  in  these  of D N A :  known  be  The  to  rDNA  that  In  to  particular addition  properties  are  more level  hypothesis,  that  to  at  it  the  due  seem  the  have,  these  genome.  genes  has  a l l categories  Nussinov's  not  to  categories  abundances  would  in D N A  organization  selection pressure  have  tested  not  thermodynamic  properties.  been  seven  suggest  thermodynamic similar  of  although I  attempt  organization and  levels  reading  an  order  doublet  properties.  that  produce at  on  certain  liverwort,  detail  various  coding regions  a  which  performed  open  two completely sequenced  from  in  of a l l categories  suggesting  certain  only  doublet  which  one  looking  was  on the  factors  examining  to  non  and  the  1.  analysis  extent,  of  based  doublet properties  that  doublets  theorj'  doublet  properties  abundant  by  2'. b y  protein genes,  considerable  appears  some  genes,  primarily  tobacco  ribosomal proteins,  unusual  doublet  a  done  genomes,  employing  rDNA,  from  discover  existence  tDNA,  concentrated  to  inverted  explain  the  successful  than  phenomenon.  Fickett's Sheperd's genomes  method  to  method.  predict Fickett's  whether method  URFs was  are  genes  modified  for  was use  more on  the  chloroplast  and its rate of successful prediction increased substantially. This  ii  modified  method  will  be  also supports specific  The  sequence  predicted to be  analysis  double bases  with  was  is i n contrast was  a  regions.  is  a  true  predict genes  that  this  all U R F s  order  could  based  only  they be  are  than  and  improved for use  on sequence  a detectable  data  amount  is distinguishable from  greater  sequenced  the  on  shows  of order on order  200 base pairs i n both  i n non  plants  are  showed  that  most  significant, lesser for both  order  amount  at  of order  the at  level the  of  single  and  triplet and  earlier w o r k (Rowe and Trainor) w h i c h found t h a t i n viruses  there  suggested  that  between  noncoding regions i n both  4-plet This  difference  coding and  is  plants.  significant It  to  method  as  genes.  Informational  This  chloroplast genomes  to code for protein exerts  and  coding regions. N e a r l y  level.  other  ability  requirement  gene  for  F i c k e t t ' s contention that the  groups.  that the the  useful  4-plet  D N A may  be  ordering optimized  in for  coding  and  replication rather  protein production. S e v e r a l new problems and experiments have been  iii  noncoding than  suggested.  TABLE OF CONTENTS Abstract  ii  L i s t of Tables  v  L i s t of F i g u r e s  vi  L i s t of A b b r e v i a t i o n s  vii  Acknowledgement  viii  I. G e n e r a l Introduction II.  Doublet or N e a r e s t  1 Neighbour A n a l y s e s  A . M a t e r i a l s and Methods B . Results C. Discussion 1. N u s s i n o v ' s H y p o t h e s i s 2. Doublet suppression 3. T h e r m o d y n a m i c considerations  5 10 11 23 23 29 31  III. Prediction and identification of chloroplast genes A . D e s c r i p t i o n of the problem B . M a t e r i a l s and Methods C . Results D . Discussion  38 38 43 45 54  IV.  64 64 65 69 71 72 73 • 78 87  Information theory and D N A A . Introduction to information theory B. Gatlin's Approach C. A N e w Approach D . Proof of Equivalence E . Redundancy F . A p p l i c a t i o n of information theory to sequence analysis G . U s e of higher D-values H . Inverted repeats as a source of redundancy  References Appendix A. B. C. D. E.  93 1. A Selection of Computer Program 1 Program 2 Program 3 Program 4 Program 5  Programs  iv  106 106 109 115 127 138  LIST OF TABLES Table  Table  Table  Table  Table  1. Correlation coefficients for tobacco analysed on the sense s t r a n d  doublet  relative  14  2. Correlation coefficients for liverwort doublet analysed on the sense s t r a n d 3. Correlation coefficients for tobacco analysed on the B - s t r a n d  doublet  relative  relative  different  abundances :  20  abundances .  4. Correlation coefficients for liverwort doublet analysed on the A - s t r a n d 5. The G + C contents and  abundances  percentage of the  relative  21  abundances 22  genome comprised  of  categories of D N A  24  Table  6. T h e r m o d y n a m i c  properties for doublets  Table  7. Correlations  Table  8. N e w weighting p a r a m e t e r s for F i c k e t t testcode procedure  44  Table  9. Prediction of genes i n the  46  Table  10. Prediction of genes i n the  Table  1 1 . The species  between doublet  abundances and  tobacco  chloroplast  32 doublet  properties  genome  l i v e r w o r t chloroplast  genome  36  50  performance' of modified F i c k e t t procedure on genes of other 55  v  LIST OF FIGURES Figure Figure  1. Tobacco doublet relative  abundances for various categories  2. L i v e r w o r t doublet relative abundances on the various categories of D N A  sense strand  for  the 17  Figure  3. Tobacco doublets  Figure  4. L i v e r w o r t doublets  Figure  5. Correlations between  doublet abundances  +/-  one I R (tobacco)  27  Figure  6. Correlations between  doublet abundances  +/-  one I R (liverwort)  28  Figure  7. C l u s t e r analysis of l i v e r w o r t genes and  Figure  8. C l u s t e r analysis of the  Figure  9. T w o entropy  Figure  10. T w o channels  Figure  11. T h e G + C ranges for various t a x a  Figure  12. D l to D 6 for tobacco  Figure  13. R e a l and simulated D2-values  80  Figure  14. R e a l and simulated D3-values  81  Figure  15. R e a l and  simulated D4-values  82  Figure  16. R e a l and simulated D5-values  83  Figure  17. R e a l and  84  Figure  Figure  Figure  analyzed on the  of D N A . . 12  scales  B-strand  analyzed on the  tobacco  18  sense s t r a n d  URFs  19  greater t h a n  genes and U R F s  200 bp.  greater t h a n  200  for D N A  bp. 58 68  through . w h i c h D N A information flows  74 77  and pseudorandom  sequences.  79  simulated D6-values  18. Redundancy calculated i n nonoverlapping windows of 500 for the tobacco genome  bases  19. Redundancy calculated i n nonoverlapping windows of 500 bases for the l i v e r w o r t genome 20. Information  57  parameters i n unique  vi  and repeat regions  for three  88  89 species. 91  LIST OF ABBREVIATIONS A:adenine,  C:cytosine, G:guanine, T r t h y m i n e  G + C content: for dsDNA:  the fraction of bases that are  example has  double stranded  G+ C  =  5 G or C . The h u m a n  0.43, occasionally expressed  as  genome  a percentage.  DNA  ..5 6  strand:  a single s t r a n d is those bases that are joined covalently to each other through the sugar phosphate backbone. A double s t r a n d is two of these held together by hydrogen bonds relative abundance: the amount of something divided by the a m o u n t expected or predicted  6  IR:  8  inverted repeat  URF:  unidentified reading frame;  synonomous w i t h  O R F i n m u c h of the  7  literature 10  IRA:  inverted repeat A  11  IRB:  inverted repeat B  16  SD:  standard  deviation  26  R:purine  31  Y:pyrimidine  31  bp:  33  base pairs  Z-DNA:  D N A whose helix twists i n the ;  opposite direction to " n o r m a l " or  vii  B-DNA 39  ACKNOWLEDGEMENT  To  my  fellow  throughout, Mishtu  I  graduate  am  Banerjee,  In  Lacey  Samuels,  Leonard Dyck  a  friend  members  helped  sort  supervisor,  has  expect.  I  wish  who  so  help. kindly  dissertation. patience  to  particular,  through been thank  my and  and  offered  a  us  Lastly, I wish  I  are  place  to thank  and Joseph  stay  and  and  helpful J.  supportive  Robson have  all  provided c r i t i c i s m . J a c k  Maze, can  Griffiths,  and  C y F i n n e g a n for t a k i n g time to  help  supervisor,  Collier  to L u t e  during  the  who is teaching me  Tony  and and  Lila Anna  final  about  any  acted  student  other  than  both m y wife, M a r y  viii  and  W h i t e , M u r r a y Webb,  Kali  supportive  indebted  to  been  Michael  work. D a n Brooks, J o h n  M y family  throughout,  ideas more  of m y committee, M a x T a y l o r  and h a v i n g faith i n m y technical  have  thankful.  sounding boards,  reasonably  who  most  as  and  students  birth  G a t l i n provided van  den  pangs  J o , for her  of  Berg, this  considerable  development.  I. G E N E R A L INTRODUCTION  Since  efficient methods  1975,  Maxam  increased  and  were  first devised to  Gilbert  exponentially  1977),  (DeLisi  the  sequence  amount  1988).  Prior  of  to  D N A (Sanger  nucleotide  1975  there  and  sequence were  Coulson  data  has  essentially  no  7 such  data.  pairs  (bp)  and  this  sequencing  are  developed  process this  is  N o w , i n G e n B a n k release  automated,  a c t i v i t y is  (Nussinov  molecules.  biology.  The  understanding  of  organization, work. a  concern  to  the  of  result aims  has  we  will  do  r a p i d l y . One result  large,  the  be  found  to  data  of act  of the base  i n our to  the as  a  of the  of all biology  theoretical R N A , and  increase  are  some  the  protein,  small  dissertation seek  the  a  increase  magnitude  when  of  and  molecular  developing  organization  to  article),  theoretical  include  so  principles m u s t emphasised  more  discipline,  base  i m p r o v e d methods  even  this  and  1.6x10  review  substantial  of  As  of  1987,  is  in a  i n excess  al.  the  base  are  slowing.  discipline  D N A organization,  what  et  new  this  data  should  of  accumulate of a  principal  (1988)  about  sign  understanding  since general  DeLisi  will  goals  Since  no  (Johnson-Dow  emergence  The  of it  shows  data  necessary  understanding of  the  1987).  framework DNA  trend  54.0, there  in  our  understanding further  causes base  of for  problem i n  our that future  expressing  is growing at  a  rate of  6 10  bp per  day,  T h u s methods As  w h i c h is sure  experimental  when the  human  genome  is  must be found to thoroughly and r a p i d l y analyse these  N u s s i n o v (1987) has  empirically  to happen  obtained  data  approaches.  sequenced. data.  observed, theoretical molecular biology is f i r m l y and  should  Therefore,  function most  1  of  in  concert  the  work  with  more  described  rooted in traditional below  is  G e n e r a l Introduction / 2 immediately been  juxtaposed  a conscious  laboratory  to  results  obtained  choice, the result  w o r k more  in a  research  laboratory.  of w h i c h , i t is hoped,  will  This  has  be to make  useful  to a wide range of biologists, both  above  ideas,  that  theoretically a n d  empirically inclined.  In  keeping  DNA,  with  since,  tobacco et al.  in  1986) were include  of the reported been  recently. and  perhaps  computer  Reasons  of the  the tobacco  analysis  informative.  these  on  chloroplast  sequences  as well  sequences  have  and l i v e r w o r t chloroplast  and  Ozeki  reviews  More  chloroplast  genomes of  specialized  within  (Coleman  important modified  reason bases.  Further  and  reviews  as the  for  a  become  available  genomes  are  only  155,844  for rapid a n d relatively  more  1987) and topography  completeness  performed has  article  concerning  detailed  I have  information  but a recent review 1987),  for a  to be interesting  b y Z u r a w s k i and Clegg  (Rochaix  algal chloroplasts  enough  significant".  Chlamydomonas plastids  of  the  y e t large  be presented later,  genomes,  other  for choosing  i n photosynthesis  because  "statistically  (Umesono  chloroplast  absence  sequences  h a s focussed  sequences. N o detailed analysis of the type  analyses,  available  One  dissertation  bp respectively, a size t h a t is s m a l l enough  sequences w i l l  are  complete  m y o w n interest  I n addition,  121,024  the  reported.  reported,  inexpensive of  1986, the  this  (Shinozaki et al. 1986) a n d the l i v e r w o r t , Marchantia polymorpha ( O h y a m a  analysis  yet  the  regarding  concerning  general  (1987) the  a n d , for some  them  is  overview  of  and P a l m e r  molecular  these  (1985)  genetics  or conformation  of  of D N A  1985) are also available.  for w o r k i n g w i t h The  only  chloroplast  reported  D N A is the  exception  complete  is Chlamdyomonas  G e n e r a l Introduction / 3 chloroplast  DNA  important  since  possibility  which some  has of  the  of modified bases  bases i n both  tobacco  and  5-methylated analyses  had  to  be  would  w o r k is concerned  and  the  effects  of  any  factor,  broadly as has  an  effect  biologists, DNA  are  to  chromosome for its  seriously to  the  DNA aware In  protein  structure;  of  constraints  constraint  for  act  in  if  the  of modified  possibility that  ignorant  of the  rest  of it,  we  new  area  on  this  regard  I  we  are  known  cannot  even  is  defined  organism,  think  that  about  only about is  D N A sequences  to the  organism.  for example,  some  much  the  truly  mammals,  while  but,  of  how  absence  is  in vivo.  A  sequence  This  changed  about the  biotic or abiotic, i n t e r n a l or external  fully  for  constraints.  1985).  that  we  the  as total  1% of the D N A  to  be  involved i n  postulate  a  reason  existence.  way  biology), and  DNA  to  (1984)  to  adapt  language, Head  was  "grammar"  essentially  a  i . e. the  (1987), that  a  a  linear  any  find  s i m i l a r already the  Ebling  valid  one?  i.  e.,  M y v i e w has thing  to  by  potentially  "sentences",  fashion.  in  solutions  suggested  and and  language-like  problems  analogy,  then  and it  solve  is b y  area  that  these  organisms.  code  be  l i v e r w o r t there is no concern  w i t h discovering w h a t  the  not  i n most  seems  One  on  (Palmer  allowed. Due  cytosine, for example, m i g h t be methylated  This  cytosine  while  new  and I  valuable  now  problem  changed,  D N A should,  theoretical  at  but,  is  (1980), the  direction  that  as  solved problems  Jimenez  thought  rules  (such  govern  i n an the  Brendel  existing  analogy  of  and  Busse  beginning of this  study  would  be  DNA  to  not  be  search  structure  p r i n c i p a l l y because  perhaps,  molecular  for  in  a  language  is  viewed  as  such.  G e n e r a l Introduction There  have  1984),  to  also  use  composition  been  attempts theory  to  ordering.  While  examining  the  found  in  and  for  this  theory  has  been incorrectly  last  goal  relatively new  1972a,  information  theory  The  (Gatlin  purpose,  of  this  I  have  understand  that  Holzmuller some  aspects  feasibility some  1984, of  Brooks  et  / 4 al.  D N A sequence  of  using  information  reported  cases  information  applied.  dissertation  a r e a of theoretical  is  to  identify  molecular  areas  biology.  for  further  study  in  the  II. DOUBLET OR NEAREST NEIGHBOUR ANALYSES  A  systematic  method  symbols from a  the  cryptographer  for  code-breaking  were  Secondly, he  occur.  This  insights  attempting  fruitfully  are  stranded  also  amounts in  a  a  is  equimolar,  determined  continue  and  adjacent Josse,  the  t A list of page v i .  of  of  This  message.  was  as  of C equals G+ C  is analogous  development  of  the  A  and  of G  T,  even  that would force this to be  analogy  and  for  in  this  to  Romberg  abbreviations,  cryptography,  case  and  adjacent  reported the  the  page  is  are  usually  though  been  the  G+ C  A  and  provides  letter  a T the  frequency are  now  understood,  even  roughly  there  through  Since, i n  compositions not  useful  have  and  necessarily  base  message  centrifugation  of G  first  of symbols  ideas  determining the  is  equal  no  to  the  known  base  so.  we  may  bases  in  nearest  where  5  amount  that  example  provide  1982).  of  the  Classically,  by  content  to  often  Freifelder  sequencing,  amount  can  pairs  Similar  determined  If for  message,  which  sequences.  such  the  it,  strings  s y m b o l i n the  with  follow  DNA  texts  of  of D N A the likewise  symbols, Kaiser  amount  knowledge  the  which  of interest in  w h i c h each  by direct counting. F o r reasons  strand  of C  analysis  intercepted  frequency  underlying  analysis  bases.  With  p a i r i n g requirement  To  the  an  with  the  steps  described  a  of a l l four  single  amount  of  the  D N A , the  message.  usually in  method  decipher  frequency  the  of a D N A sequence  (this  double  in  to  examine  and  syntax  employed  content! CsCl  the  the  might  procedure,  into  involves  smallest units of organization up to larger ones.  step would be to quantify occurs.  often  they  consider a  the  frequencies  D N A sequence.  neighbour are  first  In  frequencies used  of  1961 for  appears  a on  Doublet or N e a r e s t N e i g h b o u r A n a l y s e s / 6 variety  of  performed was  double with  stranded  dsDNA  inevitable that  same  the  abundances.  complement,  as  which  DNAs a  (dsDNA).  template,  complementary  A  doublet  on  is  read  to  5'  and  Due  given  doublets  on  the  the  the  rules  experiment of base  (Nussinov 1987)  one • strand 3'  to  read  5'  other  to  For  pairing,  it  would  have  the  pairs  with  its  3',  strand.  being  example,  in  the  sequence 5'-ACTGAGCATAACCGGTCTT-3' 3 ' - T G A C T C G T A T T G G C C A G A A - 5' the  amounts  GT,  A G and  AA  must  of the  complementarj'  C T , G A and  therefore  constraint on  since  one  strand  T C , T G and  equal the  complementary strands. each  doublets  amount  The amounts is  pairs  its  A A and  C A are  TA  read  5'  to  3'  showed  that  bias  doublets  do  Consider  for  frequency  has  probability if  analysis CA  is  express  of  the  a  is  in  i n amounts one  0.20  such  and  bases  of being in  the  sequence  abundant  in  this a r i t h m e t i c a l l y i s as  that  would  of  the  sequence  a  frequency  sequence  that  0.30.  being  were  frequency than  the  The  amount  between  not under  T A read other  of  5'  this to  strand.  3' The  a piece of fine biochemistry; it  abundances be  G G , A C and  to p a i r i n g rules  one  of  D N A , i.  indicative of r a n d o m  C A . Suppose,  of A is  A ; the  given  this  doublet  0.20  an  find  the  doublet,  that  probability of  0.30  of the more  occur  example,  of C  sequence  0.06  not  a  due  doublet, - i . e.  (1961) is impressive as  is  identical.  of T A , A T , C G and G C are  experiment of Josse et al. there  each  of T T s i m p l y  o w n complementary  with  T T , C C and  any  specific  base  in  next  base  has  the  of the  doublet  randomly of C A is  a relative abundance,  ordering.  given  and  would  the  in a  Then C  e.  be  sequence,  actually  i n this case  the  C A would  arranged.  expected.  the  be  Should 0.08,  (One  a  an then  way  to  actual/expected  Doublet or N e a r e s t N e i g h b o u r A n a l y s e s / 7 =  0.08/0.06  =  1.33.)  indicate randomness  Due  to  the  sufficient, dsDNA  of the  purposes,  thus,  will  date  do relative  abundances  i n doublet organization (Nussinov 1987).  many  and  understood  no organism examined to  complementarity  for  discussed  In  the  show  two  to  discuss  convention  only  one  strands  used  strand,  that  the  properties  herein  from  make  is  5'  up of  that  to  3',  a  dsDNA,  one  any  strand  DNA the  of  a  sequence  other  strand  it is convenient to  discuss  unless otherwise specified. F o r example, the  with  it is  dsDNA  5'-AATGCC-3' 3'-TTACGG-5' shall  be  w r i t t e n as  AATGCC  (or G G C A T T ) .  the doublet properties of one strand  With  the  appearance  abundances been  direct  presented  Hinds low  by  and  of  for a  Blake  1984).  vertebrates,  1984a),  as  (Nussinov TA  is a  in  an  the  The  doublets  as  became a  a  of  possible  computer nuclear  C G and  more  the  groupings  approximately  sequences  function of base  doublet properties  of  have  of  composition w i t h i n  doublet  also been  E. a  relative  data  have  Blake  material  1984,  have reported  very for  mitochondria (Nussinov  prokaryotes  C G is rare. bp  doublet  (Day and  T A i n that  exception of vertebrate  300,000  find  such  D N A viruses, vertebrate  general  with  to  and  Doublet relative abundances  doublet, while i n a l l groups  analysis are  using  invertebrates,  1984b). A g a i n , rare  properties analysed  well  it  v a r i e t y of h u m a n  relative abundances.  several  only.  sequencing,  counting,  Likewise  and  eukaryotes  mitochondrial sequences,  Blake  and  coli found genome.  E a r l e y . (1986), that  They  of coding and noncoding regions a n d find  doublet  have that  also the  Doublet or N e a r e s t properties  in  although  there  coding out  is  regions  that  though in  coding the  doublet  well  is greater  doublet  other  regions  properties  very  similar  characterized  than  properties  workers  are  to  finding  Neighbour A n a l y s e s / 8  those  that  in  the  noncoding  G+ C  content  i n noncoding regions. N u s s i n o v (1987)  are  similar  (Borodovskii  in  et al.  i n coding and  both  coding  1987a,b)  and  of E.  in  has  on  the  coli.  the  pointed  noncoding  concentrated  noncoding regions  regions,  regions,  differences  since  they  are  not identical.  This  literature  nonrandom.  can  be  2. T h e y  are  summarized to  some  as  degree  follows: similar  3. C e r t a i n doublets  such as  Another  of interest  present  study  doublets  have  strand  of  (Nussinov numerous  DNA, 1984a,b  to  the  complementary 1987).  Nussinov  inverted duplication events  has  Doublet  properties  i n noncoding and  w i t h i n a genome.  property  1.  coding regions  C G and T A tend to be  is  that  even  similar  suggested  that  d u r i n g genome  rare.  within  relative this  are  is  a  single  abundances  the  result  evolution. F o r example  of  i f the  double stranded D N A 5'-ACTG TGAC-5' was  inverted  complementary The  presence  and  duplicated  doublets of large  opportunity to test this  The  r a r i t y of the  AC  to  and  give GT,  inverted repeats  ACTGCAGT for  example,  (IR) i n the  on  one  would  be  strand,  then  equally  abundant.  chloroplast genome  the  provide  an  to be due  to  hypothesis.  doublet C G ( C G suppression), has  been  postulated  Doublet or Nearest the  tendency  deamination be  rare  code  are:  C  to  resulting  because  are  would  for  be  i n the  of  stop  i f these  1. The  Stop  of w h i c h  explanations  in  a  of C G to  avoidance.  T G A two  two  when  conversion  codon  T A A , T A G and  follow  methylated  as  CG  codons  in  contain for  an  doublet  T G . T A has  accounted  " m i s s i n g " C G should appear  Neighbour A n a l y s e s / 9  been  the  excess  rarity  In  organisms  against. (1987 by  Some  a,b),  an  must  at  with  the  plant  that one  the  found  are  In  for  have  it.  had  in  and  chloroplast  position I t ,  as  D N A . This  well  as  they  explain  chloroplasts  doublet  suggests  that  addition, they  argue, rather  to  (1986)  would that  selection  suggest the  T A is stop  organisms  methylation In  avoided  codon  a and  in  it w a s  Perrin  all codon is  avoided i n II  methylation  this  of C , study  avoidance  i n positions  To support  they  5-methylcytosine  Boudraa  C G is  that  compensated in  1985).  genes,  Vanyushin  disparity,  lack  lack  selected  is  idea  shown that of a l l 16  not  nuclear  and  physical-chemical properties  pressure.  in which  apparent  to  found  and  result  (Palmer  protein  TA  of D N A .  be  generally  presently  seem  not  Mazin  same  this  which  D N A the  by  the  mitochondrial D N A , and  but  s i m p l y subject et al.  In  found  C G and  regions  should  C G loss was  chloroplast  which  CG  tested  A l l organelles  of Chlamydomonas  T A avoidance.  by Breslauer  to  of organisms  mitochondrial that,  the  also  order  occur,  been  ancestors  of C G suppression,  doublets work  time  (methylated)  chloroplast cause  the  not j u s t  reason  DNA  have  i n T G . H o w e v e r , they  exception  positions,  predictions  5-methylcytosine.  nuclear,  (1987)  these  doesn't  who found that in eukaryotes  lack  postulated  methylation  of  increase  which  where  that  of T G . 2. T A should be  r a r e i n protein coding regions but not p a r t i c u l a r l y rare i n other 3.  to  genetic  predictions of  by  thought  standard  T A . The the  followed  III of  not  the  of  the  they  cite  doublets,  t A codon consists of three bases. A T G for example codes for methionine. Position I is A , II is T and III is G .  of  Doublet or N e a r e s t TA  binds  most  weakly  while  C G binds  since  they  most  prevent  Srinivasan  et al.  properties  of  with  the  strongly.  complementary  They  thermodynamic  (1987) have  DNA  suggest  doublet  that  on  the  T A and  opposite  C G may  optimization of D N A sequences.  reported  residues  Neighbour A n a l y s e s /  in  a detailed study  doublets.  A  goal  of the  of  assimilate such work into an explanation of doublet  be  10  strand avoided  Very  recently  twist, roll and  this  dissertation  tilt  is  to  frequency.  A. M A T E R I A L S A N D METHODS  All  chloroplast  D N A sequence  Columbia  computer  performed  on  author  the  centre UBC  i n P a s c a l , except  Spearman-Rank  data  from  certain  correlations)  G u i r e 1976). D a t a were  Software  Systems  and  (Ohyama  unidentified et al.  1986)  available on the  The  analyses  depending analyse the for  5'  to  frame  (URF)  as  using  written  the  autocorrelations  and  the  in  were by  package  California).  positions  British  analyses  programs  (such  Diego,  (Shinozaki et al.  of  MIDAS Integrated  Descriptions  both  the  of  liverwort  1986) chloroplast genomes  are  G e n B a n k tapes.  the  can  question  3' on the  be  being  sense  performed  which  the  i n two  addressed.  strand;  function of D N A that is important, components  All  u s i n g T e l l a g r a f (a product San  and tobacco  U n i v e r s i t y of  52.0.  performed  plotted  the  using  analyses  Graphics,  reading  of doublets  on  were  via  Release  computer  statistical  which  Corporation  obtained  GenBank  mainframe  (Fox and  gene  were  organism  an and  The  fundamentally most  approach t h a t the  requires.  different  common  based  on a  method belief  ways is  that  function of D N A is to  When  using  this  method,  to  it is code an  Doublet or N e a r e s t arbitrary be  decision m u s t  analysed  same  strand  Practically, the  in  case  the  made regions.  same  method  regarding A  direction  has  the  since it is based  for  a  natural  which  second for  genomes,  on a n  treatment  emphasis of  strand  analysis  types  of  DNA  that  it  on  is  not  strand  structure regions.  in  the  given  rather  to  examines  always  a  Both  11  direction ought  of  which  noncoding  and  method  all  disadvantage  of incompletely sequenced  However, allows  noncoding  in  this  be  Neighbour Analyses /  the  genome.  known, gene  than  in  is  on.  function,  methods  have  it  been  employed below.  B. RESULTS  The to  entire 3'.  Fig.  Doublet 1  (total).  remaining To  tobacco  relative  how the  sense  pattern, strand  Noncoding arbitrary the  each  of  the  different  (Note  were  were  genome,  total genome  IRA on  except  that  analyzed  the 5'  analyzed  5'  with  (Deno et al.  1983),  5'  the  shown  in  followed  the  results  by  doublet  of D N A w i t h i n  ribosomal protein  genes, t D N A  sense  not  3'  are to  strand on  is the  on  the  bis  to  (Fig.  1).  3'  the  A-strand,  is are  genes, to  the  on  the  B-strand.)  which  strands.  t h a t noncoding regions  ribulose  5'  B-strand,  i n their organization and that all categories  t T h e A - s t r a n d is the one t h a t codes for B - s t r a n d is the complementary strand.  the  contribute  always  noncoding on both 3'  of  t o t a l - I R A ; protein  analyzed  to  analysis  properties  noncoding, . was  comparison. E x a m i n a t i o n of F i g . 1 shows to the  bp  little effect  choice since noncoding regions  liverwort  B-strandt  determined,  25,339  categories  rDNA,  category,  (Fig. 1).  regions  analyzed on the  (total-IRA) has  introns, noncoding, U R F s , overall  was  abundances  Removal  genome  determine  genome  was  an  In F i g . 1, shown very  for  similar  of D N A show  -phosphate carboxylase.  a The  Doublet or Nearest Neighbour Analyses / 12 Figure 1. Tobacco doublet relative abundances for various categories of DNA. All the y axes are identical. All were analysed on the sense strand 5' to 3'. Total liverwort chloroplast DNA is provided for comparison.  1.4  TOTAL  1.3  TOTAL-IRA  1.2  1 . 1 1 0.9 0.8 0.7 0.6 H 1.3  PROTEIN  CENES  INTRONS  1.2  1 . 1 1 0.9 0.8 0.7 0.6 H 1.3 1.2 H 1 . 1 1 0.9 0.8 0.70.61.3  TRNA GENES  1.2  1 . 1 1 0.9 0.8 0.7 0.6 1.3 1.2  1 . 1  H  0.9 0.8 0.7 0.6  o o ^ <L> • ~ O © O *" *-  RJBOSOMAL PROTEIN  GENES  Doublet or N e a r e s t Neighbour A n a l y s e s / general  downward  slope  from  left  to  right,  indicating  doublet properties.  Relative to tobacco, l i v e r w o r t has  TC  more  doublets.  To  correlations  were  nonparametric linearity  of  positively  null  calculated  statistic the  data),  correlated,  correlated.  Table  model  model has  readily  in  to  1.0  if  if uncorrected that  positive  Therefore  assumptions  close  the  of  the  and  36  negative  )=  2  -35  properties,  made  are  (Table  are  i f strongly  positive.  correlations  1).  are  This  distribution  properties  -1.0  similar  Spearman-Rank  about  doublet  close to  have  deficit in G A and  categories  are  a l l correlations and  a probability of 2/(2  correlations.  no  is  pair  they  a marked  doublet  each  e.  shows  which  the  for  (i.  0.0  1  compare  that  13  If  we  or  strongly negatively  construct  equally  likely,  a this  of producing a l l positive or all negative  probability that  a l l correlations  have  the  same  sign,  -11 due  to  highly  chance  alone,  significant  based  that  all  assert that a l l categories that  the  at  N=16, the  0.05  controversy can  be  During  the  have  level  at  chloroplast  the  Here  sequencing,  at  I  each  use  and  viewpoint, I a m  with  3  positive  somewhat total-IRA  x  and  similar  is  1.00.  Spearman-Rank  at  the  0.01  of inferential  is  that  different  a s s u m i n g absolute  the  sequenced  not produce a consensus. genomes  is  10 it  .  Therefore  seems  it  reasonable  doublet properties. Ribosomal  D N A is  is to  Note least  categories.  assume base  model  are  least  of doublets, 0.425  regarding  null  total and  other  number  assessed.  this does  the  this  correlations  correlation between  w e l l correlated w i t h  For  on  correlations  level  at  statistics sequences at  T h i s is not sequences,  least  0.601.  when are  rather,  he  "significant"  There  times,  and  some  accurate.  more  t h a t there that  is  whole population  complete  three  to i m p l y  are  from  certainty t h a t these two sequences  often  are a  no  if  other  statistical are  two  Doublet or N e a r e s t Table  1.  doublets  in  categories The  Spearman-Rank tobacco  genome  genome;  genesrprotein reading tRNA;  IRs  (total),  coefficients with  w i t h i n t o t a l - I R A . Doublets were  liverwort  entire  with  correlation  (total)  is  t o t a l - I R A : the coding  frames;  genes  not  nomnoncoding  introns:intervening  for  genome  including regions;  IR  analysed  shown  entire  one  for  Neighbour A n a l y s e s /  the  deleted on the  comparison. minus  one  ribosomal  abundances  (total-IRA), and sense strand,  proteins;  inverted  of  seven  5' to  Abbreviations: of the  rDNA:ribosomal  sequences w i t h i n  relative  14  3'.  totahthe repeats;  URF:unidentified  D N A ; tDNA:coding  for  a gene.  total  1.00  tot-IRA  1.00  1.00  genes  0.50  0.50  1.00  rprotein  0.37  0.37  0.84  1.00  URF  0.85  0.85  0.79  0.53  1.00  non  0.97  0.97  0.57  0.39  0.88  1.00  rDNA  0.46  0.46  0.39  0.51  0.37  0.49  1.00  tDNA  0.73  0.73  0.68  0.71  0.74  0.73  0.77  1.00  introns  0.59  0.59  0.89  0.77  0.83  0.67  0.60  0.78  1.00  total  t-IRB  gene  rprot  URF  non  rDNA  tDNA  intron  Doublet or N e a r e s t complete  populations.  sequencing  Another  procedure  The question then  itself  takes  arises; are  regarding these sequences? the  justification into  interest  (except  application  at  (1976) have  is  For  perhaps  of various  meaningless.  In  such  cases,  is. not  where  has  one  and their interrelationships  a  number  of  is to chose  an  provides  the  level  a  desired  common practice.  There  is no rule  Then  it is simply  by  A  Despite  values  some  these  cases  that  simply  to derive conclusions Laboratory  which  to  describe  the  census  it of  summarize  seem  "significance" can  suspect.  alternative  One  statistic,  m a n i p u l a t i o n of sample  some  says  have  was  also  that one  turn data  considerable  but  is through  tried  a  must  significant finding  even  if  the  problems, to  report  be  I  verj'  data have  explicit  to  one  second  method  fail  of  methods  to  at  dataset."  me  model or  a  and  especially since to  then  only unnecessary;  essentially  i n the  view,  null  that  the  of molecules.  of significance i f previous  i f desired,  excluding  randomly. in  methods  for example  possible,  this  alternative  appropriate  procedures  suffices  p a r t I concur w i t h  number  m i s s i n g observations),  variables  by  it  few  the  methods  one  a  inference  large  is  all relevant cases i n the • population  population,  most  very  assumption  15  said  relevant  achieved  is  for  an  The Statistical Research  the  the  a  inferential statistics  " F i r s t , i f one's dataset incorporates of  such  account  The w o r k e r s  U n i v e r s i t y of M i c h i g a n  for  Neighbour A n a l y s e s /  do  an  excluded  these which  so. T h i s  a l l of one's  into  be  size. data.  insignificant are  chosen  calculated  significance  about  null  the  model  being employed.  The  l i v e r w o r t genome  analyzed  as  above;  5'  to  3' on the  A-strand in  Doublet or N e a r e s t total,  t o t a l - I R B and  other  categories  comparison,  within  2)  with  In  IRs  presents  Doublets  were  analysed  small  effect  on  was  found  for  however,  the  in  particularly  striking  genes  not  does  and,  the  noncoding  regions  (Fig.  The  appear  2). to  5'  be  for  to  3'  are  this  analysis,  the  of the to  high  in  strong  CG  compensated  suppression  for  by  an  categories  strand.  Again  10,058  ( F i g . 2, the  bp  Table  exclusively  (Table 2). I n  the  of the  for  of doublets  categories  abundance  the  shown  seven  are  poorly w i t h  in  e.  sense  tobacco  16  i.  genome  of the  correlate  the  data  abundances  on  strand  l i v e r w o r t , r e m o v a l of the  similar  rDNA  sense  (total-IRB), and  properties  also  and  on the  relative  doublet properties  tDNA  3'  doublet  I R deleted  doublet  tobacco,  both  the  comparison. In  positive correlations between  while,  for  one  a  to  correlations  with  for  5'  tobacco  (total),  shown  genome,  2  correlation coefficients  are  liverwort,  and  Fig.  2  data  has as  2)  Table  total-IRB.  tobacco IRB  (Fig.  and  Spearman-Rank liverwort  noncoding regions,  Neighbour A n a l y s e s /  rest  of  doublet  expected  G G is  in  overabundance  the  protein of  the  doublet T G .  A to  doublet 3'),  and  liverwort Again,  a n a l y s i s can this  has  ( F i g . 4) all  also be  on  been the  correlations  are to  are  small  for  differences  s i m i l a r . Perhaps, have  expected.  similar  done  both  when  a l l on one  tobacco the  plants  F i g . 3, and  due to the  relative  for  A - s t r a n d , and  comparison of F i g . 1 w i t h there  performed  the  (Fig.  3)  were  analysis  (Fig.  1,  (Tables  direction (5'  B-strand,  and  (Tables  3,4).  3,4),  F i g . 4, shows  is performed  Fig.  the  calculated  positive  of F i g . 2 w i t h  i n one  on  correlations  v e r y strong tendency  abundances  strand  this  that,  w a y , the  for complementary  2),  this  is  what  and  a  while trends  doublets  should  be  Doublet or Nearest Neighbour Analyses / 17 Figure 2. Liverwort doublet relative abundances for various categories of DNA. All the y axes are identical. All were analysed on the sense strand 5' to 3'. Total tobacco chloroplast DNA 1.65 1.55 H 1.45 1.35 1.25 1.15 1.05 0.95 0.85 0.75 0.65 0.55 1.55 1.45 H 1.35 1.25 1.15 1.05 0.95 0.850.750.65 0.55 1.551.451.35 1.25 1.15 1.05 0.95 0.85 0.75 0.65 0.55 1.55 1.45 1.35 1.25 1.15 1.05 0.95 0.85 0.75 0.65 0.55 1.55 1.45 1.35 1.25 1.15 1.05 0.95 0.85 0.75 0.65 0.55  TOTAL  TRNA CENES  tRNA CENES  is provided for comparison.  TOTAL-1RB  R1BOSOMAL PROTEIN  TOBACCO  CENES  Doublet or N e a r e s t Figure  3.  categories from  figure  Tobacco of  doublets  D N A . Note to  figure.  that  analysed the  on  axes  the within  Neighbour A n a l y s e s /  B-strand  5'  a  are  figure  to  3'  for  constant  18  various but  vary  Doublet o r N e a r e s t Figure  4. L i v e r w o r t  doublets  analysed  Neighbour A n a l y s e s  on t h e A - s t r a n d  5' to 3' f o r various  categories o f D N A .  1.65-1 1.55 1.45 1.35 1.25 1.15 1.05 0.95 0.85 0.75 0.65 0.551.551.451.35 1.25 1.15 1.05 0.95 0.85 0.75 H 0.65 0.551.551.451.35 1.25 1.15 1.05 0.95 0.85 0.75 0.65 0.55 1.55 1.45 1.35 1.25 1.15 1.05 0.95 0.85 0.75 0.65 0.55H 1.55 1.45 1.35 1.25 1.15 1.05 0.95 0.85 0.75 0.65 0.55  TOTAL- 1RB  TOTAL  1 Z  221  1NTRONS  TRNA GENES  RlBOSOM AL  PROTEIN  V77X  tRNA GENES  / 19  TOBACCO  GENES  Doublet or N e a r e s t Table  2.  doublets  Spearman-Rank  correlation  i n l i v e r w o r t w i t h IRs  categories  within  coefficients  (total),  t o t a l - I R B . Doublets  w i t h one  were  for  Neighbour A n a l y s e s /  the  relative  I R deleted  analysed  5' to  (total-IRB), and  3 ' on the  Tobacco is shown for comparison. A b b r e v i a t i o n s : totahthe entire the  entire  not  including  regions;  genome  ribosomal  one  of the  proteins;  rDNA:ribosomal  sequences w i t h i n  total  minus  inverted  repeats;  DNA;  tDNA:coding  for  sense  of  seven strand.  genome; total-IRB:  genes:protein  U R F : u n i d e n t i f i e d reading  abundances  20  frames;  tRNA;  coding genes nomnoncoding  introns:intervening  a gene.  1.00  tot-IRB  0.99  1.00  genes  0.91  0.93  1.00  rprotein  0.69  0.68  0.63  1.00  URF  0.91  0.93  0.94  0.72  1.00  non  0.75  0.71  0.56  0.63  0.65  1.00  rDNA  0.49  0.46  0.39  0.21  0.31  0.34  1.00  tDNA  0.25  0.22  0.08  0.30  0.05  0.39  0.49  1.00  introns  0.84  0.82  0.69  0.87  0.83  0.85  0.39  0.31  total  t-IRB  gene  rprot  URF  non  rDNA  tDNA  1.00 intron  Doublet or N e a r e s t Table  3.  doublets  Spearman-Rank in  chloroplast total-IRA. entire  the genome  tRNA;  total  genome;  coding  the  genes  5'  chloroplast  to  entire not  nomnoncoding  introns:intervening  coefficients  I R A deleted  were  total-IR:  frames;  tobacco  w i t h the  A l l analyses  genes:protein reading  the  correlation  3'  for  Neighbour A n a l y s e s /  the  genome  relative (total),  (total-IRA), and on  genome including  regions;  the  minus  one  ribosomal  a  total  categories  Abbreviations:  of  the  proteins;  rDNA:ribosomal  sequences w i t h i n  the  seven  B-strand.  abundances  inverted  21 of  tobacco within totahthe repeats;  URF:unidentified  D N A ; tDNA:coding  for  gene.  total  1.00  tot-IRA  1.00  1.00  genes  0.78  0.79  1.00  rprotein  0.88  0.89  0.67  1.00  URF  0.96  0.97  0.83  0.87  1.00  non  0.98  0.97  0.68  0.85  0.93  1.00  rDNA  0.47  0.46  0.38  0.30  0.36  0.48  1.00  tDNA  0.83  0.83  0.54  0.78  0.70  0.85  0.46  1.00  introns  0.93  0.92  0.66  0.83  0.91  0.93  0.41  0.77  total  t-IRA  gene  rprot  URF  non  rDNA  1.00  Doublet or N e a r e s t Table  4.  doublets categories  Spearman-Rank  in liverwort with IRs within  Abbreviations: the  inverted  total-IRB.  totahthe repeats;  U R F : unidentified  entire  (total), All  frames;  coefficients w i t h one  analyses  genome;  genes:protein  reading  t D N A t c o d i n g for t R N A ;  total  correlation  for  were  5'  genes  not  nomnoncoding  introns:intervening  the  relative  I R deleted  t o t a l - I R B : the  coding  Neighbour A n a l y s e s /  to entire  abundances  (total-IRB), and 3'  on  genome  the  minus  rDNA:ribosomal  sequences w i t h i n a  of  seven  A-strand.  including ribosomal  regions;  22  one  of  proteins; DNA;  gene.  1.00  tot-IRB  0.99  1.00  genes  0.93  0.94  1.00  rprotein  0.75  0.74  0.66  1.00  URF  0.91  0.92  0.90  0.83  1.00  non  0.75  0.71  0.55  0.78  0.64  1.00  rDNA  0.49  0.46  0.58  0.16  0.32  0.34  1.00  tDNA  0.51  0.47  0.35  0.34  0.37  0.64  0.55  1.00  introns  0.84  0.82  0.71  0.92  0.88  0.86  0.27  0.50  total  t-IRB  gene  rprot  URF  non  rDNA  tDNA  1.00 intron  Doublet or N e a r e s t The  amounts  determined DNA the and  has  as the  of  D N A in  a  percentage  highest  each of the  regions  (genes,  total-IR  G + C content  lowest. T h e observation noncoding  category  for  i n both  suggested  possibility is investigated i n the  that next  some  URFs,  both  plants  that U R F G + C  noncoding  plants  (Table  etc.),  are  between  might  i n fact  23  were  5). Ribosomal  while noncoding regions  contents URFs  Neighbour A n a l y s e s /  have  those of genes be  genes.  This  chapter.  C. DISCUSSION  1. Nussinov's Hypothesis  In  many  relative  analyses  abundances ' of  1987). is  doublet  T h i s is so  not  due  phenomenon genome.  complementary  even  simply  performed  when to  a  base  to  date  it  doublets  single strand pairing  has  are  been  similar  observed (Nussinov  of D N A is analysed  rules.  Nussinov  suggested  m i g h t be due to inverted duplication events during the  F o r example, i f the  and  that  the  1984a,  b,  therefore that  this  evolution of a  double stranded D N A  5'-ACTGAA-3' 3'-TGACTT-5' was  inverted and duplicated to  yield  5'-ACTGAATTC AGT-3' •3 ' - T G A C T T A A G A C A - 3 ' and  then  would In  be  an  analysis  found  this case,  the  that  was  performed  complementary  amount  of A C m u s t  on  one  doublets  of the would  equal the  resulting have  amount  the  single  strands,  same  abundances.  of G T and so  forth.  it  Doublet or N e a r e s t Neighbour A n a l y s e s / 24 Table  5. The  each  category  lie  between  URFs  are  Tobacco  G + C contents constitutes.  the  value  and  Note that  for  genes  the  and  of the  genome  G + C contents  m i n u s one  of U R F s  non-coding regions,  IR  which  for both  plants  suggesting  that  some  i n fact genes or ribosomal proteins.  G+ C  t o t + I R total  gene  rprot  t R N A r R N A int  non  URF  0.37  0.37  0.41  0.39  0.53  0.55  0.36  0.32  0.36  100  17.47 6.12  1.67  3.48  7.54  33.09 30.63  0.28  0.31  0.30  0.52  0.53  0.23  0.21  100  39.17 7.78  2.39  4.49  11.60 38.32 25.14  % total Liverwort  percentage  G+ C % total  0.29  0.24 .  Doublet or N e a r e s t Nussinov's an  hypothesis  explanation,  Nonetheless,  the  is  had  that  I  similar  presence  reasonable,  two  i n both  the  genome.  abundances tobacco 0.05  that  conclusion amounts  these  is  quite  of A  sufficient  Due  length  sequences, amount  it  will  the  above)  significance, same  base  be  214  really  different,  composition  a  that  as  and  that  the  that  0.590  amount  nearly To  significantly different To  test  the  pseudorandom, the  real  genome  and  I f this  strand  were  that  due  removal  r e m o v a l of one rest of  between  doublet  relative  the  IR  are  was  0.9977.  "significant"  Unfortunatley,  For  at  the  of C  and  G  pseudorandom composition  In  short, (I  this from  will  used  the  problem,  be  and  same  the also of real  that  "equal"  to  decide i f  the  following  coefficient  sequences, length  a  of G G ;  found  word  expectation, correlation  the  amount  it  stranded the  the  are  sequence  matches  to  such  l i v e r w o r t genomes  amounts  tobacco single  (single  of the  the  identical  circumvent  total  properties  of C C equals  of T G etc.  have  and  base  the  so there is a temptation  stranded  its  both  the  as  inaccessible.  predicted  significant.  the  single  such  without  tobacco  similar  abundances.  performed.  In  25  invokes,  hence  1,2).  be  correlations  highly  both  if  equal  were  Table  doublet  and  it  doublets  (product-moment)  level at  are In  this,  will  are  tested.  it would  N=16,  0.01  very  found  relative  was  the  generated,  doublets  correlations  simulation  to  of A C w i l l  complementary loosely  is  at  are  and  be  ( F i g . 1,2  with  For  correlations  T  because  complementary  on the  correlation  misleading.  and  similar.  little effect  0.9991.  and  ago  hypothesis  repeats,  l i v e r w o r t genome  0.468  test,  situation. A s shown i n F i g . 1,2  very  was  to  long  the  inverted  parametric  value at  the  abundances  this  has  of the  level  very  The  the  conclude  the  plants  difficult  very  that  relative  of one I R would change IR  that  found  of the  but  occurred  desirable  genomes,  analysed) to  events  it  chloroplast  is  Neighbour A n a l y s e s /  as  for  with  the  the  real  Doublet or N e a r e s t Neighbour A n a l y s e s / 26 genome costs  minus  were  w a s judged  length  of  parent  its  sequences relative  0.9989  with  actual as  chance  at  0.9991  is  significance  similar  composition  The  of one  each  IR was  sequence.  F i g . 5. a  sequences  sequence,  the  program  sequence  "inverted  Then  before The  mean  value well  of  as  the  an  had  had  the  a  been  limit  generated  of the  duplicated" and relative  time  doublet relative  amount  and  doublet  and  after  greater  abundances  addition were  s i m u l a t i o n correlation  value  of 0.9859  0.9991  is  both  and  coefficients  standard  outside 2.44  the  is significantly  than  the  0.590,  to  larger  the  doublet  from  would  to  0.005387. correlation reason,  t h a t expected  due  observation  normally  are  0.9777  F o r this  Note t h a t the  one  end  these  simulated  mean.  higher t h a n  which  the  calculated. These  of  the  this  equivalent  deviation (SD) of  S D from  and  onto  for  ranged  range  level, conservatively estimated.  s i m u l a t i o n was and  lengths The  performed  identical to  distribution  correlation coefficients had  performed  IR  keep  abundances  sequence spliced  to  to  that  attribute  a  level of 0.01, is irrelevant.  generated.  range,  the  being a p p r o x i m a t e ^ '  0.01  m e a n of 0.9744 the  for  tobacco correlation of 0.9991  were  214  adequate.) F o r each  abundances in  A  (The  were calculated. T h e n 214 coefficients of correlation between  plotted  values  generated.  time expired,  calculated. T h e n  to the  the  I R were  down. W h e n the  number  The  one  the  the  liverwort, again  real  sequence.  correlation  100  than  2 S D a w a y from  random  correlation is significantly  sequences,  higher t h a n  the  the it  This  coefficients  m i n i m u m of 0.9375,  w i t h a S D of 0 . 0 1 0 3 3 . A g a i n  and more with  a  of  for  is  expected, at  a  keeping time  is  100  shown  sequences in  Fig.  6.  falls  a  above  Since this simulation w a s  reasonable the  base  m a x i m u m of 0.9883,  real value, 0.9977  mean.  both  0.05  to level.  say  that  this  Doublet or Nearest Neighbour Analyses / 27 Figure 5. The distribution of correlation  coefficients between the doublet relative  abundances of 214 pseudorandom sequences  with and without one IR. Length of  sequence and base composition are the same as the tobacco chloroplast genome. The correlation between the doublet relative abundances of the tobacco  chloroplast  genome with and without one IR was 0.9991 which falls outside the range of the above values.  O c D  CT ©  \Y^Y{\\Y^Y(\\Y,AY(iV^XQY/\XYAY< \\S{<\Y(\\Y AY< ,  x  L  nr,nr < x  O' © o o- o- o- o o o © © o- o ° o o oCorrelation  Doublet or Nearest Neighbour Analyses / 28 Figure 6. The distribution of correlation  coefficients between the doublet relative  abundances of 100 pseudorandom sequences with and without one IR. Length of sequence and base composition are the same as the liverwort chloroplast genome. The  correlation  chloroplast  between  the doublet  relative  abundances  of the  liverwort  genome with and without one IR was 0.9977 which falls outside the  range of the above values.  30-1  —  Correlation  —  —  —  1  Doublet or N e a r e s t Neighbour A n a l y s e s / The  first  genome If  conclusion is  which  cause  that  it  is  complementary  Nussinov's mechanism  were  correlation between  the  IR,  would  considerably  been  genome  against  easy  identify,  and  doublet  properties  (i.  e.  they  remain  without  one  I R , more  so  than  i n the  that repeats are small  IRs  involved  the  small  IRs  have  been  but  such  detectable  argument,  I  in  i n the  few  longer  have  relative than  it  since,  which  here  removal highly  think  in  must  insertions  and  "muddied"  by  that  be  they  still  allowed as  one  been  In  plus  and  it  IR  where  does  must  still  It  can  mutations  are  change  the  with  and  allowed t h a t  be  that  complementary even  though as  strong  indicated  large I R s and  still so  be  IRs  P a l m e r (1985) has  except for the  possibility  be  minus  is  the  not  that  might  correlated  impossible to disprove. In conclusion, this test should be seen Nussinov's  abundances.  expected  fact,  case  significantly  contribute a  chloroplast  6. I feel that this  deletions. point  the  relative  have  was.  simulation). It  chloroplast genomes,  in  abundances  is a  of  IRs  similar  it would  could consist of s m a l l I R s , though rare  of  shown i n F i g . 5 and  N u s s i n o v ' s hypothesis  whole genome  here,  lower  evidence  one  to  doublet  to fall i n the  the  ranges  presence  doublets  expected  to  the  operating  the  have  not  29  argued they  it  is  evidence  that  are  doublets.  a  no This  virtually against  hypothesis.  2. Doublet suppression  In  both  evident  genomes in  there  protein  genes  and  introns  codon  avoidance  is evidence  genes  of  of tobacco  then  of T A suppression.  liverwort ( F i g . 2,4).  T A should be,  and  in  protein  Significantly, genes,  I f T A supression  i f a n y t h i n g , rarer  were  this  ribosomal a  form  in protein genes.  is  least  protein of  stop  This  is  Doublet or N e a r e s t not  observed  (1987)  that  consistent  a n d is i n agreement stop  with  codon avoidance the finding  that  with  Neighbour Analyses / 30  the conclusion  of B o u d r a a  is not the cause of T A suppression. the doublet  T A is rare  a n d noncoding) (Smith et al. 1983). T h e two genomes  exhibit  C G suppression  TG;  suggested  which  which  there  of T G . M a z i n  of ancestors  of this  hypothesis  can  readily  suppressed,  this  attests  that  also  (1987a)  was normally  extended  to  to m e t h y l a t i o n  They  methylated  "in any  W e do not k n o w w h e n chloroplasts lost their methylase and  the lack  of compensating  against  this  explain  C G suppression.  the  nucleus  before  idea.  (Lewin  moving  does not seem 1987,  back  into  aside  it might  does  not occur  even  more  F o r example,  1984), into  reside  et  al.  the chloroplast  there  be mentioned  that  and mitochondria  1987).)  Evidence  probably  supporting  this  or i f they  i n the case  hypotheses  which  which C G is  organism".  ever  h a d one,  of chloroplasts,  c a n be p u t forward to  methylated,  but, gene  transfer  inserts  disrupt  just  recombine  seem have  view  is t h a t  into  vital  the  its C G  chloroplast  functions  (Lonsdale  that  modified  genes  already  there.  with  transfer  to lack reverse this  a n d lose  postulate  i t is possible that  because chloroplasts  nucleus  least  in  time,  D N A of that  and be  One could  could  at some  that the  a gene from the chloroplast could move to  the chloroplast  1986).  at  elaborate  to occur, perhaps because  Shinozaki  imported  Indeed,  T G argues,  i n Drosophila, i n suggested  organism  of the ancestral  of C G to  a n d no compensating  found  C G w a s suppressed.  also  and Vanyushin  by mutation  of bases  sequences  e x a m i n e d here  Mazin  of C followed  is no m e t h y l a t i o n  genus  be  (see  and Vanyushin  C is not methylated,  DNA  (1980)  w a s due to m e t h y l a t i o n  i n these chloroplasts,  overabundance  Bird  It is also  i n intervening  (introns  1987a,b)  and P e r r i n  enzyme only  genes (As an  into the chloroplast  transcriptase, (Schuster  transcribed  while the  and Brennicke sequences  have  Doublet or N e a r e s t N e i g h b o u r A n a l y s e s / 31 been and  found  i n more  Vanyushin  partially  one  (1987) and  because  "hole-patching" a  than  they  are  genome  in a  cell,  the  more  not  parsimonious  quality to t h e m .  It  elaborate  thus, one  and  the  above partly  is simpler to propose  hypothesis seem  off  because that  of  Mazin  the  mark,  they  have  m e t h y l a t i o n is  a not  complete explanation of G G suppression.  3. Thermodynamic considerations  The have  data  appear  low  relative  across-strand The  abundances.  steric  according to  than  some  easy  to  order  test.  to  Figures  of the  1-4  YR  standard between  other  show  together  and  Breslauer  data  have  been  my  Table  6.  taken These  two the  purines  as  discussed  ought in  to  the  et  al.  free strand  0  energy and  this  i n such  rare  D N A (a are  0  for  to  due  or  doublets  more  throughout reminder  the  are  appealing  the  from  Table  2 of B r e s l a u e r et al.  data  were  obtained  in  Y = C,T).  C G and  TA  are  the  standard  entropy  AS°, and  breaking other  of  the  bond  strand.  These  (1986) and  differential  genome  is  R=A,G  determined  biochemical  to  conformation.  C a l l a d i n e ' s hypothesis  its complement on the  by  was  a  me  suppressed.  have  standard AG ,  is  above.  be  (1986)  AH ,  that  notion that bases  Y R doublets  thermodynamic enthalpy  doublet on one  the  with  stress  R Y and  Gibbs  (1982) suggested  physical-chemical properties,  Y R doublets  that  biochemical a  between  suggestions  kinking  doublets.  biochemical  their  R Y and  reduce  Calladine  hinderance  evidence for this, t a k e n  selected  also  to indicate that R Y and Y R doublets ( R = p u r i n e , Y = pyrimidine)  scanning  reappear  in  calorimetr}',  circular  d i c h r o m i s m and ultraviolet spectroscopy of synthetic oligomers. These  suggest  that the doublet T A binds most w e a k l y (AG°=0.9 kcal m o l  data  ) while C G  Doublet or N e a r e s t Neighbour Table  6.  (kcal  mol  flexibility,  Thermodynamic ),  AS°  the  (cal  parameters K  mol  probability  of  counter-clockwise,  the  counter-clockwise  (Srinivasan  description  of  these  ),  for  AG  (kcal  rolling  probability  each  doublet. mol  clockwise,  )  parameters  al.  1987)  the  reader  are  urged  AH°  al.  (1986),  probability  of  rolling  probability  For to  doublet  a  refer  more to  the  of tilting com orij  works.  base  AH°  AA  9.1  AC  •  AG°  flex  rolll  roll 2  tiltl  tilt2  24.0  1.9  7.1  41  59  66  34  6.5  17.3  1.3  2.9  13  87  78  22  AG  7.8  20.8  1.6  3.5  12  88  76  24  AT  8.6  23.9  1.5  4.0  18  82  71  29  CA  5.8  12.9  1.9  6.0  29  71  65  35  CC  11.0  26.6  3.1  2.6  8  92  81  19  CG  11.9  27.8  3.6  4.6  11  89  69  31  CT  7.8  20.8  1.6  3.3  16  84  71  29  GA  5.6  13.5  1.6  5.1  33  67  58  42  GC  11.1  26.7  3.1  3.8  10  90  71  29  GG  11.0  26.2  3.1  2.7  8  92  63  37  GT  6.5  17.3  1.3  4.2  21  79  59  41  TA  6.0  16.9  0.9  7.2  40  60  57  43  TC  5.6  13.5  1.6  3.6  19  81  69  31  TG  5.8  12.9  1.9  5.0  20  80  62  38  TT  9.1  24.0  1.9  5.7  38  62  52  48  AS  0  32  et  the  shown.  is  each  (Breslauer  the  of tilting clockwise and  et  For  Analyses /  Doublet or N e a r e s t Neighbour A n a l y s e s / ( A G =3.6  kcal  0  (1987)  mol  suggested  ) binds  that  most  TA  and  strongly. CG  In  might  " t h e r m o d y n a m i c a l l y optimize" D N A . In other and  C G binds too  consider other  Trifinov  tend  of sources  (1980)  and  is the n u m b e r  They  histones have  Sussman  suggested and  performed  found  that  some  examined  preponderance histone  is  of  in  (1987)  doublets  of  (Trifinov  cloned  chloroplasts though work  and  Thus,  that  the  plastid  need  position  in  revealed  no peak  In  to  at  lack  for  clear  on  bind  proteins  D N A . However,  10.5  bp  (data  " n o r m a l " helix  not  form  play  a  10.5 bp  of the  D N A helix i n  D N A so  that  The  this  it  fact  B-DNA  could  that  dsDNA  could  There  histone  are  that  be  bind  doublets  helix  was  histones a  role  Still  and  and with  chloroplast.  Other  doublet  1985)  the  the  also  proteins.  selection two  it is necessary  of D N A . D N A twisted  that  associated  histone-like  of  a in  Gantt  (Coleman  or in  involved  suggestions  is  is inside the  autocorrelations shown).  from  fit into, t h a t side would have  histone  to  bp  a periodicity of about  means that i f a  pea  bound  to  doublets  histones.  1980).  a  could  to  prokaryotes,  D N A i n chromatin-like formations be  order  first w i s h  36,000  ( B r i a t et al. 1982,1984)  i f this  Perrin  abundances.  not  eukaryotes,  Sussman  gene  and  in  shortly but  but  curve  10.5 bp  type.  D N A could  chloroplast  t B - D N A is the is Z - D N A .  a  it is not  showing chloroplast  suggests  idea  with  cylinder it would a  against  to doublet  to  Prokaryotes  side of the  have  this  eukaryotes,  chloroplasts contain a histone-like protein Key  to  autocorrelations  order  Boudraa  words, perhaps T A binds too w e a k l y  D N A strand  to repeat every  one  binding  this  chromatin.  tendency from  in  regard,  selected  of bases i n one complete t u r n that  form  return  be  proposed w i t h regard  to be distributed along the  which t.  hypotheses  and  variety  strongly. I shall  this  33  to  and  genomes keep  oppposite  in  way  Doublet or N e a r e s t Neighbour A n a l y s e s / mind  that  exert  a  base  the  requirement  selection  that  pressure  on  D N A and  suggestion  Occasionally  D N A organization, even  when  concerning D N A is  doublet  and  correct  occur,  an  error.  systems  depend  exist  h e a v i l y on  error.  A s will  errors  be  to enable  aim  in  this  will  1-4,  into  large  the  discussed may  in a  later  a way  may  single  or  well  doublet  chapter  an  is  a  doublet  on  one  to  and  explanation  In  bases  involved  in  each  also  has  the  in  the  sequences  (Jones  copying  et al. 1987).  can  only  chapter,  the  presence  mechanisms.  repair  process.  may  where  reverse  errors  These  of  certain  types  AH°,AS° and  it's  the  ideas  preference and A G  doublets, (Gatlin  of S r i n i v a s a n et al.  are  flexibility,  the  based  data on  in F i g . classical  for b r e a k i n g the H-bonding  0  complement data  of  errors.  incorporate  doublet  may  systems  of certain  Calladine (1982), along w i t h  strand  on  taken  the  other  from  strand.  A  B r e s l a u e r et al.  probability of rolling clockwise  probability of tilting clockwise or counter-clockwise for  doublet:complement  S r i n i v a s a n et al. (1987).)  doublet  short  to locate  addition, Table 6 also shows  counter-clockwise and the  repair  to achieve sufficient Shannon-redundancy  error-repairing systems  initial  made  and  A G ° implies a strong binding. (These  (1986).)  from  interact  checking capabilities and  them  sequences  physical-chemistry. Table 6 shows between  error  correct  adjacent  (1987), B r e s l a u e r et al. (1986)  or  at  involves  are  E. coli, i n certain  that  the  correct quantities,  1972a)  In  abundances  replicated,  P o l y m e r a s e s t h a t replicate D N A have  My  must  level.  Another  in  cytoplasm  34  greatest  In  arrangement.  agreement  flexibility.  with  AG°  the  and  (These low  AG  flexibility  data 0  for are  are TA,  taken this  negatively  Doublet or N e a r e s t Neighbour A n a l y s e s / correlated. groups  This  of  workers.  abundances  for  with  flexibility.  way  rather  suggested  My  tobacco I  the  I  In  took  eukaryotes,  mitochondria  designation  from  retroviruses  (Nussinov  positively than 10  with  0.425  and  due  to  correlations  all AG° • the the  suggests structure. grounds.  selected  In  the  with  that  a  level,  or  doublet  firm,  T h i s could  is  a  2/(2 based  null 10  )  on  we  2  this  the  -9  other,  tobacco Earley  topological parameters  RNA  relative  solid  help put  abundances  be  D N A structure  is  this  doublet relative abundances  and  greater  x it  10 is  in  and -3  .  null to  either  negative Therefore,  significant  significantly  preferred  (a  correlate  are  impressed  positive  model  under  1986),  viruses  abundances  1.95  is  and  probability of obtaining  that  null  which  should occur i n  10 correlations  =  one  nonvertebrates  relative  would  rolled  thermodynamic  such  nonretro  model  =  that  and  5 of the  negatively  positively correlated w i t h doublet relative abundances.  thermodynamic level  as  and  their  vertebrates,  (since  relative  coli (Blake  level. H o w e v e r the  correlations  likely  0.01  0.05  doublet  for  doublet  only  different  abundances  viruses,  a l l groups  of  is that  relative E.  than  basis  this  1984b),  7), though  assuming  equally  from  DNA  the  by  selected  rather  the  for  that  to be  way  on  literature  1984a),  1984a).  chance,  at  one  independently  AG°, AH°, AS°  seemed  doublet  the  negative  values are  other 0.05  10  are  conservatively,  be  significant at the  positive or  case)  tilted  (Nussinov  (Table  AG°  and  to  Nussinov  showed  positively w i t h  idea that follows  from  obtained  work  doublets  might  addition  data  were  that  other  One testable  liverwort,  preliminary  correlated  doublets  organisms.  since they  also found  than  that  properties. all  is r e a s s u r i n g  35  None of  correlated  model. This an  on solid  that  overly  at also  flexible  "non-biological"  Doublet or N e a r e s t Table  7.  Spearman-Rank  thermodynamic  properties  between  and  AG °  correlationg of the  abundance  in  between  doublets. all  Note the  groups.  bottom of the columns are  summaries  source  AH°  AS  tobacco  0.09  0.18  liverwort  0.50  E. coli  doublet  of the  AG  The  Neighbour Analyses / relative  36  abundances  and  exclusively positive correlations p i us  correlations  and  minus  signs  at  tl  for that column.  flex  rolll  roll2  tiltl  tilt2  0.39  -0.22  -0.10  0.10  0.01  -0.01  0.48  0.72  -0.27  -0.34  0.34  0.15  -0.15  0.19  0.15  0.44  0.42  0.24  -0.24  -0.12  0.12  eukaryotes  0.10  0.03  0.46  -0.19  -0.10  0.10  0.084  -0.08  mitochondria  0.39  0.37  0.15  -0.18  -0.22  0.22  0.11  -0.11  vertebrates  -0.08  -0.15  0.30  -0.20  -0.07  0.07  0.17  -0.17  nonvertebrates  0.34  0.30  0.70  0.00  -0.02  0.02  -0.07  0.07  DNA  0.21  0.16  0.49  -0.26  -0.14  0.14  0.05  -0.05  nonretro R N A viruses  -0.40  -0.44  0.21  -0.00  0.14  -0.14  -0.14  0.14  retro  0.10  0.06  0.35  -0.31  -0.15  0.15  0.11  -0.11  +  +  +  +  +  .  viruses  viruses  0  0  .  Doublet or N e a r e s t N e i g h b o u r A n a l y s e s / 37 The  hypothesis  seems  that  reasonable.  thermodynamic work  are  least  been  against,  is  suggested  1982)  (because  it  Boudraa binds  explanations  do  involved i n  repair  this  are  not  too  for  a  that  (Nussinov DNA,  1987)  rather  such  simply  (1987),  for  mutation  evidence and  This  is  not  some doublet damage, a  to  suggest  abundances. etc.  the  general,  the  correlate  with  an  idea and  no  since, C G is,  as  Doublets  intellectual appeal,  undesirable  that  What  doublets  that  thermodynamically  methylation  certain  in  more  hinderance  suggests  than  with  YR  of  Their  than  steric  relative abundances  is i n agreement  basis  (1983a).  R Y and with  properties  the  rather  and  Perrin  on  Trainor  model  connected  t h a t doublet  strongly).  is  some may  other well  be  suggested  is  though  it  requires  testing.  conclusion, thermodynamic  all  of the  properties  doublet relative abundances sequences  and  observation  with  chosen  for C G suppression,  In  other  are  Rowe on  thermodynamic  required to account  simple explanation has  m u c h further  by  species  property  by  codons  reasons  all  finding  account of  that  The  specifically The  influenced  entirely  here.  in  is  made  almost  thermodynamic  by  been  probably  themselves. one  has  done  associated  special explanations  that  suggestion,  based  (Calladine  organisms at  has  selected  selection  similar  was  as  avoidance  selection  properties  however  empirically  A  doublet  and  the  examined.  that  informational properties  also  there is a h i e r a r c h y of constraints  seem  to  observed In  are  chapter important  present  explain some  i n the 4  by  no  means  chloroplast D N A sequences,  evidence  i n doublet  i n these  but  systems.  is  presented  to  show  selection. Undoubtedly  III. PREDICTION AND  IDENTIFICATION OF CHLOROPLAST  GENES  A. DESCRIPTION OF THE PROBLEM  This  chapter  often  when  not.  This  gene  is  concerned  happen  sequenced  genes from  was  of the taken;  design.  data  such  low quantity  stage  or  b)  second, an  Normally,  a  material;  third,  genes can be tell us  As  Starting  made  a  can  clone  that  also  the  for  effects  accurate  way  insights  found based  of  the  with  (1984) a  or U R F s ) to  has  has  which into  merely  are  determine  only  enough  predictions  fail  region, to be  i f a)  is " o f f  due  from  for  gene a  as  to  the to  or  known  such  able  which  open  this  can  resources would  be  for  reduce  identified  genome  organization  on the  differences  an  identify  transcript  of is  developmental the  organism  a  wasted from  and  subset  time  and  sequence  constraints,  i n their  of a l l  i.  sequences,  data e.,  this  if can  organization.  gene  identification  reading  identified. Introns,  whether  can  environment  genes  explained,  sequence,  probe  a  Very  F i r s t , biochemical methods  transcript  the  a  large  is desirable  transcripts,  genes.  w a y to predict genes would facilitate experimental  accurate in  It  a  v a r i e t y of reasons.  gene  or  when  coding  D N A contains  hybridizes w i t h  happen  looking for  and  the  useful  for a  protein  k n o w n i f the  something about gene us. nongene  Staden  frames  identification of  it is not  researcher  experiments  yield  as  organism  desirable  might  it  the  is completely sequenced.  sequence genes,  when  or  genome,  identifying  with  D N A is sequenced  can  organellar  in  is  frames if any,  U R F is  38  (also are  actually  is  a  called  multistep  process.  unidentified  reading  located. T h e n transcribed  or  an  attempt  whether  it  is is  Prediction and identification of chloroplast genes / merely the  due  to  challenge  steps  are  is  not  (1982)- and contains  chance; to  always  consensus  not  they  sequences,  of genes and nongenes,  There sites  is considerable in  many  Lennon  are  of  i n this  order.  Some  and  r e l y on  caused b y the  information  this  URFs  type,  can  the  signals  differences  necessity  in  have  also  known  1987). et  This al.  The  far  see  been  literature well  chapter  (1986)  apart  and  et al.  then  is  in  frame,  while  In  tobacco,  the  codon  may  splicing  on  (1986),  taking  into  known  recognition and  are  in also  (Nussinov  signals  in  and  (For  a  chloroplasts  1986,  Umesono  and  work  performed  by  reporting  200 bp are  for  properties  form Z - D N A .  published  usage i n organisms usage  known  1987, see  in Drosophila  is to determine  18 i n l i v e r w o r t greater t h a n  It is well k n o w n t h a t amino acid 1975).  Chua  by searching for start and  problem for this dissertation  i n tobacco and  al.  sequence  region of two  on  regarding  S h i n o z a k i et al.  dependent  a  r e l y on one based  These  Fickett's  that  characterized  regions  Intron  1986,  S h i n o z a k i et  performed and  ( A n 1987) and  (1988).)  as  and  of coding for protein.  the  that promoter  Wells  (Ohyama  T h i s w o r k was  sufficiently signals.  1987). I n plants  of Z - D N A  well  Ohyama URFs.  Clegg  such  etc.  between  occur,  genes.  indicate  it is. A l l methods splicing  will  and  methods,  methods,  available  promoters  1984), there is evidence  review  Ozeki  al.  strand  promoters,  they  sequence  cases s i m i l a r to prokaryotic ones ( H a n l e y - B o w d o i n and  Zurawski  brief  random URFs  on w h i c h  D N A . Chloroplast  a  positional base  locate  or  in  between  performed (1984)  a gene but either  even  distinguish  Staden's  principles;  i . e.  39  the  positions  stop codons that  account which  intron  of the  of are  splicing  39  URFs  r e a l l y genes.  is nonrandom genes  has  (eg. J u k e s  been  et  tabulated  Prediction and identification of chloroplast genes / 40 (Wakasugi the to on  et al. 1986)  and  stop codon usages are other  codon  codon  method when  selections  preference  is  TAA:25, so  that  (Gribskov  read  original  in  the  i n the  code  described  correct other  h a v i n g been  the  success  an  that  proved  performed  as  follows:  T-position  are  and  3n + 2 are  these  values/(minimum 3n,  20  on  For  in  A-position = 20/(15 +1).  have  the  fraction  look-up  tablet  (Table  content  and  position.  Fickett  then  assigned A  t A look-up table an input value.  I  other  the  of F i c k e t t  This  identify  based  genes  employed  shown t h a t m a n y all R N Y codons  material  this  genes, (N = any to  the  presented below, I  distinguishing  called  For  3n+l 3  between  gives  weightings  strand 1982)  have  genes  and  p ...p .. 1 8 w ...w 1 8  compilation of data  of  are  and  of  each  to value  on  each  Los  probability is  w h i c h provides  is and  positions  3n,  m a x i m u m of are  8  As  3n + 2.  in  Then  identically  A-content...T-content base.  Based  of coding are  the  and C,G  calculated  parameters  probabilities  Based  there  positions  parameters  comprised  in  A-position = the  in  3,  likewise  As  suppose  15  period  and  number  example,  8  with  A-position,  the  position  "testcode" is a  apply  that this m i g h t be due  U R F i n question.  them+1).  of  has  example  s i m i l a r results  (implicitly)  having  corresponding bases. In addition, 4 content  simply  to  autocorrelation  A-position,  positions The  in  an  parameters  found for the of  to  and  For  genomes.  based  Four  determined.  3n + 1  positions  is  I  H e suggested  has  randomness.  possible  (1981)  closer  method  method  (1982)  are  of  TGA:5,  R N Y code. In the  Fickett's  1982).  has  and  et al. 1984).  two frames.  two chloroplast  are  indicative  TAG:9  it  frame,  nongenes i n the  the  not  using a clustering a l g o r i t h m . Sheperd  base) than  for  also  on  found  A l a m o s data (Table  II  a for  base, Fickett  found an  output  value  given  Prediction p w + p w ... p w , which, 1 1 * 2 2 8 8 all  probability of the  decision It  on another  U R F being a  of chloroplast genes /  look-up table, indicates the  coding unit  (i. e.  a  gene),  (prediction) t h a t the U R F is a gene, is not a gene,  is i m p o r t a n t  representative could  based  and identification  to  note  that  the  Los Alamos  of " a l l D N A " , w h i c h  likety  be  improved  for  use  Fickett on  data  base  and  it  1982  over  gives  or that i t is  of  taxonomic  might  groups.  This  a  unsure. not  recognized i n suggesting that the  specific  41  be test  has  been  done for chloroplast D N A below.  In  many  organisms  (Lagerkvist  1981,  known,  the  standard  et  1986,  al.  chloroplast and  chloroplast (Umesono there protein  is  1987).  and  review  and  1987). T h e  codon  usage  though the  (1981a,  i n E. to  coding  most  abundant  the  coli  abundant  genes  corresponding  b  for  to  the  tRNAs less  tRNAs  and  abundant  suggested  Grantham  that  et  proteins less  tRNAs.  aspect  al.  1986, U m e s o n o the  than  E.  coli  identification  is  that  codon usage  and  less  tRNA  obscure.  abundance  select  1981)  codons  codons  found  that  corresponding  abundant  proteins  G e n e r a l l y the  less  ,the  on  genes  (1980,  had  by  of this still seems  that  is  coded  (availability?),  correlations between  and  abundant  gene  as  1986, O h y a m a  tRNAs  mitochondria and to  far  t R N A s . required  different  of this  unorthodox  so  et al.  ( W a k a s u g i et al.  30  and effect  is  chloroplasts,  a l l the  abundance  noted  yeast  but  in  relevance  cause  tRNAs. most  than  tRNA  1982) and  1987),  more  in  code  1985, Shinozaki  approximately  is  genetic  but  chloroplast genome  are  which  the  articles)  (Palmer  Clegg  relationship between  Ikemura  rarer  mitochondria  code is used  Ozeki  abundance,  correspond  1987  There  genome,  a  in  coded on the  Thus,  the  Fox  Zurawski  are  Ozeki  and  had  abundant  to  and that the the  rarer  codons  the  protein,  required to synthesize it. D u e to this observation, they  put  Prediction and forward  the  regulating  gene  obtained noted  hypothesis.  expressivity.  in  addition  the  that  the  codons  is  tRNA  codon the  levels  all  are  way  cells  codons  priori  the  a  a  are  organism  the  "given"  point  necessar}  7  at  such  a  Personally, I  have  difficult}'  it  of the  also  seems  a  software, why  the  (of base usage,  nongenes, or  the  appear  to  organelle.  way  function  most  not,  the  usage  the  work  seems  genome of the  tRNA  is  tRNA  the  genome  the  vice v e r s a .  It  of an  organism's  are  idea because the  environment, to  especially  cannot etc.)  not j u s t  adjust  be  found  the  there  controlled. i n genes  multitude  described  below  In  that  levels difficult  and  the  the  given  requirements of developmental supply is no any  to  gene  the  obvious case,  distinguish  of processes  concerning  are  development in  the  parts  when  of a  given  required  it  same i n  tRNA  stage  to  evidence  is not  the  codon  which  is the  that  suggest  its  empirical  every  w i t h this  those  perhaps  levels  are  function work  that  (nuclear)  follows  tRNAs  but  population  levels  the  of  yeast,  to  modifies  suggests  it  with  include  tRNA  reverse,  codon a  not  the  of the  supply  In  are  that  parts  be  the  parsimonious  not  it  since  where  in  are  and  do  1987),  present rather t h a n  however,  proteins  et al.  way  42  of equal binding energies  A l l this the  a  with Ikemura,  there seems to be no clear  proteins  codons  situation,  reason  cell  (Pfitzinger  and  abundant  that  is  working  group  most  and  choice  (1981)  recent paper Adjustment  but  to the  selected  and  patterns from  an  organism  demands  codon  Hall  postulated.  I would suggest that,  the  been  stage,  are  "given"  chloroplasts  which  adjusted  for  amounts. the  in  but  of  imagine  have  that  and  that  was  the  accordingly. O n this  m u s t be to  usage  codons  adjusted either  is  100% G , C , A , T , , or G C or A T . The desirablity  usage accordingly. H o w e v e r , the  idea  Bennetzen  a l l c o d o m t R N A interactions  that  The  results i n good agreement w i t h G r a n t h a m ' s  containing for  genome  identification of chloroplast genes /  a the  them  occuring i n prediction,  Prediction and identification of chloroplast genes / 43 this should be kept in mind throughout; particularly, since one of the findings is that some prediction methods, which work well in other systems, work poorly in chloroplasts. Possible reasons for this will be discussed later.  B. MATERIALS AND  METHODS  Programs were written by the author in Pascal. For Sheperd's test, the number of mutations required to convert all codons to RNY were tabulated in frame I (in frame) and in frames II and III. An URF was predicted to be a gene only if the  number of such mutations  was  least in frame I. Fickett's method  performed exactly as per Fickett (1982), and the method was  was  also modified as  follows: the 46 known protein coding genes in tobacco and the first 46 noncoding regions greater than 200 bp, starting at base 1 on the B-strand and moving 5' to 3', were used; the four position and content parameters were calculated for each  of  generate  the 8  92  sequences  and  new  weighting  parameters  Fickett's testcode is then used  Cluster  analyses  were  these  were  used  (Table  II  exactly of  as  Fickett  per  Fickett  1982)(Table  to 8);  verbatim.  performed  using  MIDAS  (Fox  and  Guire  Ward's minimum variance algorithm to cluster euclidean distances. the cluster were the fractional abundances  of each of the  1976)  with  The data for  64 codons for each  gene and URF. This procedure places genes which have the most similar codon usage closest together in cluster diagrams.  Prediction and identification Table  8.  N e w weighting parameters  derived from  for  of chloroplast genes /  F i c k e t t ' s testcode  the tobacco chloroplast genome. See  procedure.  These  methods.  A  C  G  T  Position  0.22  0.24  0.26  0.43  Content  -0.11  -0.15  0.41  0.09  44  were  Prediction and identification of chloroplast genes /  45  method  10  C. RESULTS The  predictions of Sheperd's  for  liverwort. This  are  not  This are  genes  error  and  rate  method that  (24%)  predicts  in the  that  similarity  I  assume  the  Fickett's  testcode,  nongenes , and testcode  was  mistook  (Table  10).  when  are  improve  its  produced  2  these  false  9),  and  in  54  k n o w n genes  an  indecision  result.  rate  Equally  identified  as  for  the  of  on  by  this  to  utility check  of the that  the  tobacco,  for  5  other  a  genes  not  genes.  regarding  URFs  of  completeness.  which  bear  in many  were  method  (data  in  test  46  strong  cases  no  error  46  tobacco  not  shown).  weightings  (Table  one 8)  chloroplast D N A and not just to these two plants, the  and  over  the  non-coding H o w e v e r , it  were  liverwort,  the  11  others  attempt  modified  of less  of w h i c h  as  an  known  rate  genes  about  in  negatives  improvement  plants,  the  unsure  test u s i n g tobacco sequences, two  known  This  the  2 false  is an  the  was  sequences.  8  For  Fickett's  indecisions  of  9).  and  modified  considerable  none  (Table  nongene  10). T h i s  test using only modified  interests  misidentified  others  a I  and  circular reasoning to develop the the  are  predictions  genes even though  chloroplast  (Table 12%,  and  i n fact  l i v e r w o r t , there  importantly,  genes  genes  16  results,  negatives  that  herein only i n the  on  about  performance  (Table the  used  unsure  means  k n o w n genes  found.)  1 k n o w n gene  Given  genes  ndh  to prokaryotic genes,  protein or t r a n s c r i p t has been  13 of 46 k n o w n genes listed  l i v e r w o r t 11 of 54  presented  that  i n Table 9 for tobacco and T a b l e  t h a t i n tobacco,  for. k n o w n  not reliable. T h e y are  (Note  appear  to  procedure  tobacco  genes  8 indecisions than  5%  and  original  test  regions  were  is  somewhat  and then  determine  is tobacco. I n  generally  order  applicable  test w a s tried on a  to  Table  9.  indicates  Predictions coding,  n  indicates  is  the  "gene"  S_  i n the  tobacco  non-coding,  i n S h i n o z a k i et al. 1986.  in its original and  and  of genes  as  probability  Prediction  of being modified  a  and  S denotes the gene and  identification of chloroplast  D  chloroplast  genome.  ?  indecision.  indicates  For  a l l tests  decision made bj' Sheperd's test. is the  decision  46 c  Nomenclature  made b y  Fickett's  forms.  Fickett's  genes /  Modified  Prob  D  Prob  D  atpA  c  0.98  c  1.00  c  atpB  c  0.98  c  1.00  c  atpE  c  0.98  c  1.00  c  atpF  c  0.07  n  0.92  c  atpH  n  1.00  c  1.00  c  atpl  c  0.92  c  0.92  c  infA  c  0.07  n  0.40  ?  ndhA  n  0.77  ?  0.98  c  ndhB  c  0.29  n  0.04  n  ndhC  n  0.77  ?  1.00  c  ndhD  n  0.77  ?  1.00  c  ndhE  c  0.92  c.  0.77  ?  ndhF  c  0.40  ?  0.92  c  U R F 39  n  0.40  ?  0.92  c  URF62  c  0.98  c  0.92  c  URF70A  n  0.00  n  0.07 .  n  U R F 7 OB  c  0.92  c  1.00  c  URF73  n  1.00  c  1.00  c  Prob test  Prediction and identification of chloroplast genes /  Modified  Fickett's  "gene"  Prob  D  Prob  n  0.00  D  URF74A  n  0.00  URF74B  c  0.40  0.77  U R F 75  n  0.07  0.00  U R F 77  c  0.07  0.92  U R F 79  c  0.04  0.29  URF80  n  0.92  0.98  c  URF82  n  0.77  1.00  c  URF87  n  0.04  0.29  n  URF90  0.92  1.00  c  URF92  0.04  0.07  URF98  0.07  0.00  URF99A  0.29  0.04  URF99B  c  0.92  0.92  URF103  c  0.07  0.00  U R F 105  n  0.07  0.07  URF115  c  0.04  n  0.07  URF131  n  0.40  ?  0.92  U R F 134  0.92  1.00  U R F 138  0.04  0.00  URF154  0.04  n  0.07  URF158  0.40  n  0.98  U R F 167  0.29  n  0.92  U R F 184  0.29  0.29  n  47  Prediction and  Modified  Fickett's  "gene"  identification of chloroplast genes /  Prob  D  Prob  URF228  c  0.29  n  0.98  URF229  c  0.07  n  0.00  URF284  c  0.40  ?  0.92  URF313  n  0.29  n  0.07  URF350  c  0.07  n  0.04  URF393  c  0.92  c  1.00  URF509A  0.04  0.00  URF512  0.77  0.98  URF581  0.07  n  0.07  U R F 12 44  0.07  n  0.29  URF1708  0.07  n  0.29  petA  0.92  c  1.00  petB  0.77  ?  0.98  petD  0.98  c  1.00  psaA  0.92  c  0.98  psaB  0.40  •?  0.98  psbA  0.77  ?  0.98  psbB  0.92  c  1.00  psbC  0.77  ?  1.00  psbD  0.77  ?  0.92  psbE  0.29  n  0.92  psbF  0.77  9  1.00  rbcL  0.98  1.00  D  n  .  48  Prediction and identification of chloroplast genes / 49  "gene"  _S  Fickett's  Modified  Prob  JJ  Prob  JJ  rpl2  c  0.92  c  1.00  c  rp!14  c  0.92  c  1.00  c  rpll6  n  1.00  c  1.00  c  rpl20  c  0.77  ?  1.00  c  rpl22  c  0.40  ?  0.92  c  rpl23  c  0.40  ?  0.98  c  rpl33  c  0.29  n  0.98  .c  rpoA  c  0.40  ?  0.77  ?  rpoB  c  0.92  c  1.00  c  rps2  c  0.77  ?  1.00  c  rps3  c  0.92  c  1.00  c  rps4  n  0.92  c  0.98  c  rps7  n  1.00  c  1.00  c  rpsll  c  1.00  c  1.00  c  rpsl4  n  0.29  n  0.77  ?  rpsl5  n  0.04  n  0.04  n  rpsl6  c  0.92  c  1.00  c  rpsl8  n  0.29  n  0.98  c  rpsl9  c  0.92  c  0.92  c  secX  n  0.98  c  1.00  c  ssb  c  0.77  ?  0.77  ?  _  Prediction and identification of chloroplast genes / 50 Table  10. Predictions  1986.  Abbreviations as i n Table  "gene"  _S  of genes for Marchantia.  Nomenclature  as i n O h y a m a  9.  Fickett's  Modified  Prob  D  Prob  JD  atpA  c  0.92  c  1.00  atpB  c  0.98  c  1.00  c  atpE  c  0.92  c  1.00  c  atpF  c  0.92  c  1.00  c  atpH  c  1.00  c  1.00  c  atpl  c  0.98  c  0.98  c  frxA  c  0.98  c  1.00  c  frxB  c  0.77  ?  0.77  ?  frxC  c  0.98  c  1.00  c  infA  c  0.98  c  0.98  c  mbpX  n  0.77  ?  0.77  ?  ndhl  n  0.92  c  0.40  ?  ndh2  c  0.77  ?  0.07  n  ndh3  n  0.77  ?  0.40  ?  ndh4  n  0.92  c  0.98  c  ndh4L  c  0.98  c  0.98  c  ndh5  c  0.77  ?  0.92  c  URF69  n  0.98  c  0.98  c  U R F 74  c  0.98  c  1.00  c  URF135  c  0.40  ?  0.07  n  URF167  c  0.98  c  1.00  c  .  •  c  et al.  Prediction and  "gene"  Modified  Fickett's Prob  identification of chloroplast genes /  D  Prob  D  0.92  c  U R F 169  0.40  URF184  0.92  c  1.00  c  URF203  0.92  c  1.00  c  URF228  0.92  0.98  URF316  0.92  1.00  URF320  0.29  n  0.07  n  URF370i  0.40  7  0.77  ?  URF392  0.92  c  1.00  URF434  0.29  n  0.07  URF464  0.77  7  0.92  URF465  0.92  c  1.00  URF513  0.92  c  1.00  URF1068  0.29  n  0.77  URF2136  0.40  ?  0.77  petA  0.92  1.00  petB  1.00  1.00  petD  0.98  0.98  psaA  0.92  1.00  psaB  0.92  1.00  . psbA  0.92  0.98  psbB  0.98  1.00  psbC .  0.98  1.00  psbD  0.92  0.98  51  Prediction and identification of chloroplast genes / 52  "gene"  _S  Fickett's  Modified  Prob  JJ  Prob  D  psbE  c  0.77  ?  1.00  c  psbF  n  0.92  c  1.00  c  psbG  c  0.77  ?  0.77  ?  rbcL  c  0.92  c  1.00  c  rpll4  c  0.92  c  1.00  c  rpll6  c  0.98  c  1.00  c  rpl2  c  0.98  c  1.00  c  rpl20  c  0.98  c  0.98  c  rpl21  n  0.92  c  0.98  c  rpl22  c  0.92  c  0.92  c  rpl23  c  0.77  ?  0.92  c  rpl33  c  0.92  c  0.92  c  rpoA  c  0.92  c  0.92  c  rpoB  c  0.98  c  0.98  c  rpoCl  c  0.98  c  0.98  c  rpoC2  c  0.77  ?  0.77  rpsll  c  0.98  c  1.00  c  rpsl2  c  1.00  c  1.00  c  rpsl4  n  0.77  ?  0.77  ?  rpsl5  c  0.29  n  0.29  n  rpsl8  n  0.77  ?  0.77  ?  rpsl9  c  0.98  c  0.98  c  rps2  c  0.92  c  0.92  c  Prediction and  "gene"  _S  Fickett's  identification of chloroplast genes /  Modified  Prob  D  Prob  D  rps3  c  0.92  c  1.00  c  rps4  n  0.92  c  0.92  c  rps7  c  0.92  c  0.98  c  rps8  c  0.92  c  0.98  c  secX  n  0.92  c  1.00  c  53  Prediction and identification variety Table  of other  chloroplast sequences  11. A g a i n ,  also again does am  reasonably  the  modified  not mistake  test  any  confident t h a t  from  different  a g a i n performs  URFs  species. better  non-coding regions  any  of chloroplast genes / 54 The  than  results  the  original,  for coding ones.  predicted i n this  are  in and  Therefore,  w o r k to be  genes  I  are  in fact coding for protein.  As  an  independent  procedure,  check of the  euclidean distances  clustered  for  both  functions  to  cluster  contained  13  indecisively  of  plants  between  ( F i g . 7,8).  together.  the  identified  accuracy of the  21  In  URFs  codon usage There  tobacco  putative  predictions of F i c k e t t ' s  is  a  two  for each tendency  distinct  noncoding  URFs,  gene  and  modified  U R F were  for  genes  with  related  clusters  were  found  which  while,  in  liverwort,  the  3  were clustered.  D. DISCUSSION  Sheperd's  test,  genomes failure code  does are  was  sequences  which so  fairly  an  R N Y code  to identify  D N A . sequences,  Sheperd's  well  of considerable  inquired i f Sheperd's  showed  predicts  that  this  method  in  is  even  was  were  not  many  that  method was  which  rather  interest.  and  genes,  genes  detecting  other  Sheperd enough  though  organisms.  (1981)  sequenced The  suggested  the an  chloroplast  reasons that  the  for  code  remains  mutations  have  occurred. Staden  generated  method absence  without  in present  does.  He  of stop  codons. In an experiment u s i n g computer generated  also  codons,  codon  D N A sequences  day  (1984)  B y analyzing bias,  wondered which  this  primitive  of that  s i m p l y detecting codon preferences.  computer  what  poorly i n these  are with  Staden whether all Y R R various  Prediction and identification of chloroplast genes / Table  11.  A  comparison  when  applied  form  performs  genes  in  3  to  genes  and  somewhat  cases  testcode  the  i n that original  form  non-coding region as  number.  Other abbreviations as i n Table  Amaranthus Hordeum  hybridus  is  uncertain  uncertain. a  gene.  its  modified  species. about In  Acc #  no is  The  the  identity of  case  does  Genbank  10.  description Fickett's  Modified  Prob  D  Prob  D  psbA  0.92  c  0.98  c  X00630  rbcL  0.98  c  1.00  c  reinhardtiM.13704  atpB  0.98  c  1.00  c  Zea  mays  JO1421  cfB  0.92  c  1.00  c  Zea  mays  J01421  cfE  0.92  c  1.00  c  Zea  mays  M12704  psbG  0.40  ?  0.40  ?  Zea  mays  M11203  non538  0.00  n  0.00  n  Zea  mays  M11203  pslal  0.40  ?  0.98  c  Zea  mays  M11203  psla2  0.40  ?  0.77  ?  Zea  mays  X01698  rps4  0.40  ?  0.98  c  X03780  psbE  0.77  7  1.00  c  Oenothera  hookeri  Pisum  sativum  X03575  non224  0.00  n  0.04  n  Pisum  sativum  X03575  atpA24  0.92  c  0.92  c  Solanum  nigrum  X01651  non318  0.04  n  0.00  n  Solanum  nigrum  X01651  psbA  0.92  c  0.98  c  Spirodela  oligorhiza  X03834  rpll6  1.00  c  1.00  c  X01724  psbC.  0.92  c  1.00  c  Spinacia  oleracea  form  modified  K01200,  vulgare  Chlamydomonas  #  and  from v other  it is not  form  Acc  procedure  non-coding regions  modified  Species  a  Fickett's  better  where  identify  of  55  the  access  Prediction  Vicia  faba  Triticum  aestivum  and  identification  of chloroplast  genes /  X00682  URF  0.00  n  0.00  n  J01458  atpase  1.00  c  1.00  c  56  Prediction and identification of chloroplast genes / 57 Figure  7. Cluster analysis of liverwort genes and URFs greater  than 200 bp.  Genes of a type tend to cluster together eg. the ndh genes are clustered as are the rpL genes. The three URFs for which the modified Fickett's test made no decision are also clustered. Unlike  tobacco the URFs predicted  to be noncoding  are not clustered. The reason for this is unknown. LIVERWORT n<Jh 2 ••ndh4 •-URF 320 • » - URF228 • - ndh3 ••ndh5 •-* rwJfil ••URF184 • - ndML •-•tpl •-• p.bD •- p*bC •-p«bB •- P»BB • pi«A «-I petB •-• petD •-psbE «-pibF • -•tpH •-• pabA ••rpoB rpoC abpx rps2 pctt frxC URF5I3 URF485 URF 39 2 URF318 pabG URF434 URFIBB •tpF • tpE URF187  • -i  •I II -I I I -I II I 1-1  • tpA • tpB URF 20 3 rbcL URF74 tr«A rpoM URF1088 URF3701 URF 2138 • ..-I URF484 • - . - I rpi15 URF135 rpL33 URF 6 9 rpoA rps3 rpL2t rps7 rpL20 rp»8 rpL2 rpL14 rpelB rpL23 fruC rpLIS rpLZZ rp»12 InfA frxB rp»4 rpalB rpi14 MCX  • •I  Prediction and identification of chloroplast genes / 58 Figure 8. Cluster analysis of the tobacco genes and URFs greater than 200 bp. Of the 21 URFs predicted to be noncoding, 13 are in two distinct clumps. As in the liverwort genes of a type tend to cluster. The photosynthesis genes psa and psb are good examples of this. TOBACCO • tpA •tpB •-.-1... URF393 • rpoB • patA » •tpE • rpL4 • • tpF • URF22B • URF167 • URF103 • URT134 • URF284 • URFS12 • rpa2 « URF1244* rpoA • rp»3 • mJhB • • URF1B4 • URF229 • URF350 • URF5090* URF5B1 URF17D8*-I URF115 • URFB7 • URF82 • URF73 • InfA • URF7 7 • URF74b • URF 156 •• URF105 • URF 138 • URF B0 • URF92 • p«bE • URF 39 • URF75 • URF 80 • •tpH • p»bF • URF62 • patB » p*aA • 1 paaB * 1 p«bC • pabD • • pabB • • rbcL • pabA • atpl • ndhA • ndhO • I MhF • IURF313 • petO • ndhC • ndhE • URF9BB URF 70s • URF7B • URF74a • URF154 • URFBB • URF 700 • URFBBa « rpa15 * aaB • rpL33 • rpalB • rpalB • URF131 • rpL23 • rpL20 » rpa14 • rpL2 • rpalB • rpL22 • rpa4 • rpa7 • rpLIB • rpall •  ---I  i-  • II  —i - i i-  •II  I I  I I-  —  -1 -II  I I  I I-  •--1  -I I I I-I-I  -I -II— I  Prediction and identification of chloroplast genes / amounts  of stop  codons  of  codons  is  stop  method.  Therefore,  to either  and  least  genomes  be  least  origin  for  the  limitation,  I  accumulate  modified  the  the  mutations  are  the  mutating and  a  genome's  present  to  rate  R N Y code,  and  success  that  the  lack  of  the  is not  due  the  same  the  following: rate  rate  should  selection  be.  the  from  a  pure  that  Faced  with  genome  is  mean  is  be  genomes  the  single this in  rate of  Eventually  so  R N Y code  it  the  "neutral"  strong  mutations  that that  is  no have  and  Chloroplast D N A is considered to  of  1985,  R N Y code. have  most  its original  Palmer  Sheperd's  rather  beneficial mutations form.  1987,  method  Therefore,  remained  a  beneficial mutations  reflecting only  more  an  oldest  decrease.  pressure  that  further  as  would  a  a  (here I  those  assuming  age. if  code  the  out,  positive number,  chloroplasts  is far from  original  pointed  performance  it is consistent  genome  postulated  D N A ( L i et al.  from  the  are  predict  must  poor  an  if  the  the  code  occurred and the  zero,  far  the  for  should be high, but  small  nuclear  it is now  that  Sheperd method  from  mutation  mutation  to  that  showed  R N Y code, it m i g h t be that  colleague  the  would  mutation  and  long period of time,  or  history,  relative  1987),  the  This  illustrates was  a  genomes  very  1986),  neutral.  slowly  Clegg  as  initially,  a  of the  an  originally  hypothesis  genome,  should be  (Kimura  in  I  all  then,  Staden  responsible  success  i n s u r v i v i n g genomes)  mutations  occurred  but,  means  rate  solely  evolved furthest  reason,  code  in  frames,  artefacts.  had  RNY-like,  mutations  slower  this  environment,  mutation  not  3  original code was  which  For  might  fixed  but  it appears that the  that the  RNY-like.  constant  partly,  of the  of these two suggested  If one supposes genes  i n each  59  of  i f the constant have  the be  Ritland analysis original for  a  already  Prediction and identification of chloroplast genes / 60 There is another there  another  than  reason  origin?  and  he  presents He  "match  with  up"  ribosome  rather  a  that  in  gene  one  that  a  a  This  is), would  make  that  tend  to  12  are  codons  of to  actually  two  amino  have  mRNA,  stay  is  the  frame  on  for use of G - n o n G - N codons, w h i c h  genome  Bernardi  hypothesis  is  to  ribosomes  consist  deceptive.  The  and  Grosjean  series  of  chemical  credible, is  of gene  original  properties  hydrophobicity  whether  products,  genetic  to (see  organisms  the  biochemical  The the  acid.  was  code  the  on  the  to  be  proposes  describes  would  which  reading  have  provided  It  appears  arose  if  pathway,  the  base  acid in  second  are  and the  specificity  in t R N A s  a the  codon  base  t h i r d base to molecular weight. to  is  D N A (Cedergen  amino  first  Since this  shown that there an  this  first.  obvious, but  the  due  it  postulate  arises,  R N A , not  for  codes;  (which I suspect  seem  (1988) have  synthetic  has  Trifinov  ribosome  might  Specifically,  is p l a i n l y  or  probably  that  1985), and  specificity  gene  which  G + C composition of the  question  answer  Coates  codons  amino  Rose et al. this  and  between  of the  the  material  1987, eg.). T a y l o r  correlations  corresponds  modern  be  1985).  codon  of course, R - n o n G - N .  t h i r d position is correlated w i t h  and  rRNA  test. One can  codes,  that base choice for the (Bernardi  monitoring  that  widespread  are,  frame  p a r t i a l l y redundant  other  G-nonG-N  G-nonG-N  Thus,  explanation of Sheperd's  in  other  recognition site  frame.  Is  pattern  16S  every  simultaneous,  acids, and  a  on  the  method.  R N Y codons  codon  not  maintain  of  sites  Because  of m e c h a n i s m , i f it is r e a l and  which  a this  monitoring  number  and  consist  that  codon pattern. codons  to  genes  frame  possible another  ribosomes,  pressure  found  putative  there  type  might  convincing evidence  sufficient  describes  question of the efficacy of Sheperd's  gene  G-nonG-N  corresponds just  a  (1987)  found  G-nonG-N,  that  why  Trifinov  mechanism.  frame.  approach to the  and  to In the  Prediction and identification of chloroplast genes / enzymes  that  Originally  charge  there  them,  was  probably  certain R N A sequences, of  a  give  a  giving  the  and  used  predict  keeping  is  performed  less  between  frame  are  present  at  plasmids, various  each  the  The  point  is  a  black  of which  codons.  It  that  to  box.  gene  to that  be  altered  a  This be  also  was  contained  tetracycline resistance  shows  to grow on a m p i c i l l i n  would  advantage  Another  seems  to  bind  acid:RNA  prior  to  any  be  G-nonG-N in  or  chloroplasts  systems  requires  or  more  2)  work  organisms  be  illuminating  gene  for  the  that  same  the  ampicillin a  least  indicate a  defective  ribosome,  gene  ampicillin  hypothesis 1)  frame  keeping  the  is  differences  compartments to  construct  which  the  presence  of the  gene  If for  which  resistance  This  have of  with  example,  was  on  gene,  p l a s m i d and gene.  then  but  to  to  the  perhaps  efficiently.  tetracyline  herein; T r a m o n t a n o  protein  evolution of  cell  transcripts  follow,  are  since  that the bacterium contains the  relevant  to  interactions  frame  protein  resistance  normal  could  because  and  would  predicted  say,  way.  directly  R N Y . This  construction is possible due  the  this  stabilized w i t h  t h a t it allows the detection of a graded  point w h i c h  arise  keeping function i n the  i n different  It  contains  arise  other  question  would  acids  later  amino  prevalent  in  keeping mechanisms  was  could  less  than  This  amino  lowest level of G - n o n G - N would be translated  plasmid  the  is  selected  important way.  which  potential frame  RNY-ness  for  did not  R N A binds to itself. T h i s is the beginning  molecules  be  probably  binding complex. E n z y m a t i c function  structure  levels of G - n o n G - N .  synonomous the  might  another  specificity  tendency  a  enzymatic  that  a  acid  code. I f there was  codons  could  to  protoribosome.  fundamental genetic  rise  the  and of course,  primitive R N A : a m i n o  eventually  but  61  a the  inability  design  has  found  that  response.  et al.  (1984)  Prediction and identification of chloroplast genes / the  complementary  the  code  simply  strand  due  to  pairing inverted  some  tests,  codes  that  YNR.  I f cip frequently  Fickett's will  space  out-performed  modified  form  chloroplasts frxA  et al.  was  gene,  Some  was  a  gene  i n both  maize  gene for (Table  gene nor  a  with  p s b G gene  the  on  nongene.  on  to  frame  are  tentatively a  gene.  and  The  that these antibodies  probability  means amounts  zero. that  of  This its  this gene  is  could  a  structure  is  gene;  of particular suggested  et al.  low  (1986) exists;  a  made they  (Sibbald  the  or  D N A . In  the  In  have  White  could  be  psbG to  which  should  kDa  protein  There  psbG  1987)  that  predicted  24  made  Ohyama  putative  neither  of  liverwort,  shown  these antibodies.  and  variety  a gene b y  a  data, its  antibodies found  to  and  interest.  is  a way  wide  (1987)  10)  two  R N Y and  9,10),  a protein that is not  pseudogene  a t y p i c a l (see  in  to be  are  composition  chloroplast  genes  the  strands.  (PS) I FeS-protein. The  detecting  occurring  they There  (Table  O h - o k a et al.  i f that  are  be  performed  and  on  l i v e r w o r t (Table  Steinmetz product  use  (apparent molecular weight on gels) that reacts w i t h possibility  and  other  DNA  identify  photosystem  11)  the  position  chloroplast  correctly  rather  a  that  hypothesis.  as  U R F . If  suggested  code selection m i g h t be  both  predictions  to be  (1984)  this  an  also tends to be R N Y  transcribed  support  same  expected,  stand  al.  be  modifying it for  appears  than  of genes existing on alternate  method  is predicted  is i n fact  react  relies  which  (1986),  a  which  11).  this  be  chance  also  (Table  could  to  the  the  improved by  it  (cip)  et  prove to be transcribed,  Sheperd's  performance  in  often  complementary  them,  cip  is, more  Tramontano  proteins"  produce  method,  the  rules.  among  or increase  Fickett's  the  gene  tends to be R N Y , then  "complementary  save  of a  62  a  encoded.  but in  is  by  such  discussion of codon usage  in  no low the  Prediction and identification of chloroplast genes / introduction to this chapter). and  this  could  published  The  one  of  them.  families.  diagrams  In  ndh  Fig.  genes  7  ( F i g . 7,8) the  The  protein  Fickett's  procedure  detecting  codon  dehydrogenase)  genes  are  is  clustered.  unsure  usage  show  photosynthesis  (NADH  ribosomal  to  are  some  also clustered. T h e U R F s  conclusion,  organized  the  extent.  on  this  basis  constraint  on  could  modified  for  be  implementation  of  is  only  1984) is  small  Fickett's  percentage  parameters  this  an  would  necessary.  free"  similarity  genes are  (ps  also  in )  have  URFs  suggesting  F i g . 8,  that  sequences coding  indicating  various  the  and and  that  taxonomic  suggestion  in  different  molecule, as and  a  In  that  within  psbG.  F i g . 7,8 which  photosynthesis well  the  modified  Fickett's  fairly  gene  except  both  about  been  species.  usage  clustered,  clustered.  I n F i g . 7 the  In  codon  are  the  being  is often test  w o r t h w h i l e . A search  groups present  between  protein  suggestion and  test  is  genes  are  clustered  into  work  implicitly cip  done  from  can  be  coding  exerts  a  that  his  shows  that  by  are  regions  results  hypothesis  transcripts  sequences  the  species. T h u s the  of the for  noncoding noncoding  D N A organization. F i c k e t t ' s  approximation be  a  of errors  predictions.  coding  alone,  organization is significantly "context  in  differently  substantial  a  modified  a  predicted not to be genes are  bases  sufficiently  identified  as  test m a k e s  clustered,  two groups, lending support to the  In  the  (Sibbald 1988) and should prove useful for -workers i n other  cluster  The  be  Lastly,  63  testcode of  an  sequence  treatment of D N A molecular biologists, (Tramontano  both  strands  et  al.  of genes  IV. INFORMATION  THEORY AND  A. INTRODUCTION TO INFORMATION  THEORY  Information  this  variety has  theory  of reasons,  presented  valuable  1973)  much  of w h i c h  Gatlin  (1972a)  applied  to  is  central  not  to the  a  a  history is  which  Morse  code  and  to  very  various  a  symbols.  which  often  concerned  receiver  information showed,  means  what  in  be his  of the  message string  in  they  with is  the  the  of  was  information  development  important  to obtain from  other  concerned  Since a  arguments  originally  or as  a  work  a  with  here,  language,  such  as:  rate  of  maximum  celebrated  of  to  accurately second  the  it  1.  a  information  when  64  given  a  that  a  while  referred  with is  string  message  as  a  a of  has  string  of  Channels m a y  be  transmitter, and  is noisy?  message  a  Information  transfer?  channel  theory  A. message  defined  message.  is  as  is concerned  example,  a message.  into the  volumes,  field.  receiver. for  is  1973,  measures  reader  of D N A . .Thus, herein  theorem  a  theory  of information  (1948),  symbols;  strand  introduce errors  transmitted  Shannon  second  to  T w o other  a  (1958)  A  1948  information  presented  m e d i u m that t r a n s m i t s  problems  for  information  from  sources.  history  channel  sequence is  created  theory.  of  detailed  conceived b y  through  normal  A channel is any  noisy  most  useful.  and  for a more detailed exposition of the  message,  connotations  development  (1984),  century  improve communication. K u l l b a c k  in  biological  as  temporal  to  papers  Holzmuller  the  theory,  is  the  is now difficult  t r a n s m i s s i o n of a or  of  of  desire  Key  above references  physical  a  contains  D N A , are  Information  development  including  source-book  (Slepian  the  is  DNA  can  theory  channel  and  2.  can  how  Shannon be  is  (1949)  transmitted  Information through  a channel, no matter  how noisy, provided the  at source, w h i c h is a s u r p r i s i n g and important  B.  GATLIN'S  In  her  strictly speak  book,  measures,  apply  only  Gatlin or  to  (1972a)  entropies,  infinite  of probabilities, rather  sequences than  Consider  an  infinite  of  alphabet  has  a  where  i  For  Z  any  p  i  i  =  i indicates  H  -k Z p ; log p i i j  where  k  sequence. bits per are  sequence  with  the  which  how  to  and,  for of  sequences. this  sequence, of the  drawn each  a  variety  it  The  is  of  measures correct  following  to  outline  1984).  from  a  finite  s y m b o l occurs  symbols  a  These  reason,  symbols.  (1972a p. 62-67;  symbols  calculate  the  alphabet.  with  The  a probability  probabilitj' is  associated,  1  defined  =  is correctly coded  alphabet  Gatlin  1  65  result.  biological  frequencies,  Gatlin  symbols. In  showed for  of G a t l i n ' s method follows from  p  message  and D N A /  A P P R O A C H  seminal  information  theory  is  a  constant  and  is  the  entropy  of  the  single  symbols  in  the  F o r convenience base 2 logs are used, k is set to 1 and the units are Max symbol. H is a m a x i m u m ( H ) w h e n all p = 1/a, i . e. all symbols 1 1 i  equiprobable.  I f the  p  i  deviate  from  being equal,  then  the  sequence  entropy  Information theory  and D N A /  66  Max is less than  . G a t l i n defined  Max = H 1—1  D  H  observed  In  3  Obs  1  log 1  for the  to  are  a  and  H  is  considering single where  2-tuples,  independent  calculated  using  equation  2  with  the  p  i  sequence.  symbols, or n-tuples  are  from equiprobability as D ^  1  =  addition  doublets  divergence  Obs - H  Max where  the  symbols,  n is the  triplets  are  it is possible  length 3-tuples  to  consider  of the  substring.  etc.  the  If  In  substrings  this  symbols  in  of  terminology, the  sequence  then  Ind H  = i  n  where nth  - Z . Z . . . Z p. p. ... p log (p. p. ... p ) j n i j n I J n  p , p ... p are i J n  position  which  is  memory  of n-tuple  generating  (m)  the  probabilities  respectively.  the  sequence  of 0,1... G a t l i n  defines  of single  Gatlin's  model  4  bases is  under consideration. (m) H  or  in  that This  "H-Markov  the of  first,  a  Markov  source  with a  second  can  ...  source, have  memory  of  a  m"  as (m) H  =  M  where m  - Z p a  , log p ( ( m + l ) | m ) m+1  p((m+l)|m)  symbols  and  is the p  m+ 1  5  conditional probability of the is  the  probability  of  the  (m+l)th  (m+l)-tuple.  s y m b o l given Given  this,  the it  is  Information possible  to define  T h i s entropy  is  the D H n  D H  (1) =  n  H  +  1  G a t l i n then  H  M  ... H  n-tuple  (m-1)  (m) +  M  (n-m)H  are  not  67  independent.  6  M  (1) a second D - v a l u e , D 2  defined  H  - H  1  a memory  7  M  of 1 and  (m) D  +  symbols i n the  and D N A /  (1) =  2  for  i f the  (2) + H  M  (1) D  entropy  theory  H  The most  1  - H  general  =  H  expression  of G a t l i n  Note define  i n the  Gatlin's  D-values (1948,  motivation  motivation of  M  entropy  scale  shown i n F i g . 9 w h i c h is r e d r a w n  from F i g . 7  (1972a).  that  Shannon  for D-values is  (n-1) - H  M  T h i s results  level  of m  M  (n-2)  the  generally, for a m e m o r y  (m) =  2  D • n  more  for  in  terms  is  of  to  define  them.  1949). U n f o r t u n a t e l y , behind her  1-tuplets)  partitioned.  approach  the  work and  Specifically,  to  be  independence i f the  Her her  calculations  seems  a  varietj'  derivation  of is  H-values based  derivation is difficult is  that (at  sequence had  not  always  deviations the  level  equiprobable  then  to  the  work  of  to understand  readily  fom of  on  and  apparent.  equiprobability  n-tuplets, symbols,  n>l)  and The  (at  the  can  be  independently  Information theory a n d D N A Figure  9. T w o  scale, r e d r a w n the r i g h t is m y  entropy from  scales  Gatlin  f o r sequences.  log a  I  H,  H  entropy  (2)  log a  H,  D  2  (i)  (3)  T  (i)  HM  H,  (1)  (2)  HM  to +^ CD  i s Gatlin's  (1972a). See text f o r for e x p l a n a t i o n of notation.  a  CD CO CO .Q  the left  new scale.  2-,  1.5  On  / 68  H,  (1)  0.5 H  X3)  Hi  Other D and H values  0  J  zero  H  4  Other D and H values  zero  On  Information distributed,  then  the  Divergence  from  equiprobability decreases  level also  entropy  H ^ . A s is clear from  symbol  Gatlin's  divergences  independence.  sequence  would an  be  this  entropy  by  examples  (1972a)  however, the  log a  amount  bits.  D ^ to  sequence  the can  independence occur at the level of 2-tuples. I n the sequence (1) (1) = 0, D = 0 and D = 1 . T h a t is, the single symbols, 0 2 3  D  and  1 are  equiprobable.  time  and  probability.  the  from  00110011...  by  In  i n the  sequence A C G T A C G T . . . the sequence has (1) equiprobable symbols and so D is 0 but D is log a = 2 bits/symbol because 1 2 all  diverge from  per  theory and D N A / 69  1  0 half  Therefore  A t the  the  time  level  while  of 2-tuples,  0  is  followed  1 is also followed  there is no deviation from  by  by  0 and  independence  at  1 half  1 with  the  the equal  2-tuple  level  (1) and  D  =  0. A t the  3-tuple  level  however, p ( l | 0 0 )  =  1, p ( 0 | l l )  2  =  1, p ( l | 0 1 )  =  log a =  an  It seemed the  =  at  the  3-tuple  1, p(0|lO) (1) level and D 3  1.  APPROACH  attempt  understand  1 and this is complete dependence  log 2  C. A NEW In  =  =  to  improve and  clarify  Gatlin's  approach,  and  i n order  to  better  the  m e a n i n g of D-values, I derived another w a y to obtain D-values. (1) that D m i g h t be the entropy of the doublets that would be found i f 2  single bases  were  independent,  minus  the  entropy  of the  observed  doublets.  W h a t is postulated is that (1) D  =  A  1 J  where D  3  the  - Z Z p . p. log (p. p.) - - Z p 1  k refers  (assume  the  J  1  J  K  to observed doublets m e m o r y of 1 from  triplets t h a t would occur i f the  log p K  and  10  K  i and j  to single symbols. To continue,  now on) m i g h t be viewed doublets were  independent,  as  the  entropy of  m i n u s the  entropy  Information of the  observed triplets. A s was  probabilities of the p. ix  p  of  the  / px  xj  doublet  triplet ixj. In  ix, p  the  xj  3  =  D  n  k  refers  of D  =  notation,  symbols  to  last  observed  triplets.  p  Generally, T  is  ix  the  n-1  probability  n-tuple,  symbols  Again  p  of  k refers  is  xj the  of  the  is the is  x  independent,  as  observed probability  the  probability of  the  thought  11  it  was  reasonable  to  as  the  n-l-tuple  n-tuple  and  this  new  being corrected that  for  way and  having many  as  is  x  to observed n-tuples.  D-values  in  p  it  was  thus  felt  this  the  This  the  approach  H - v a l u e s , it is possible  to  the  and  from  first  n-1  corresponding  probability  order  would  to  n-l-tuple  follows  independent that  12  corresponding  probability of the  (1) that G a t l i n ' s D -values are n  H  ix  were  the  postulate  can be seen  rather  and  (always a s s u m i n g a m e m o r y of one)  n  of the  n-2-tuple.  was  xj  (1984), I estimate  doublets p  70  - I Z. p. p . /p log (p. p . / p ) - - Z p log p ix xj ix xj x ix xj x k k k  where, p  the  this  and D N A /  - Z Z. p. p . /p log (p. p . / p ) • - I p log p ix xj. ix xj x ix xj x k k k  where think  occur i f the  probability of  single s y m b o l x . Symbolically I  D  done b y H i n d s and B l a k e  triplets that would  for the  theory  of  the  to  central  F i g . 9 where  it  additive. B y calculating from  insure define  the  previous  independence. one  type  level Next,  of H - v a l u e  -  1  =  log a - D 1  13  14  Information H  =  n  as  H  - D  n-1  shown  theory  and  DNA /  71 15  n  i n F i g . 9.  Note  that  this  method  is the  inverse  of G a t l i n ' s . I  define  D-values first, i n a n intuitive w a y that follows from the definitions of D and (1) D , and then define H - v a l u e s . A l s o note that m y H is equivalent to G a t l i n s Vl) n 1  H  •  .  always  While gave  happened  using the  same  repeatedly,  communication) identical. H e r e  My  expression  D  =  logs  expanded  As  I concluded pointed  D  n n  t h a t they out  own  for  D-values,  limits of round-off m u s t be  however,  is a proof that m y D-values  I  are  I  error.  found Because  identical. A s G a t l i n  had  not  equivalent  proven  that  =  for D  n  can be rewritten  from equation  is  rewritten  with  the  inclusion  of  a  Z . P. . log (p. p(j|ix))- Z Z. p. p ./p (log p. ixj ixj ix i x xj i x xj x ix  can be  seen from  =  (n-1) -H •- H n-1 ^.-jj  =  H  n-1  they this  (personal they  are  to G a t l i n ' s .  EQUIVALENCE  the  definitions  can be immediately expressed  D  w i t h i n the  my  12  as  Z . p. . log p. -. - I Z. p. p . / p log (p. p . / p ) IXJ IXJ ix xj i x xj x ix xj x  expression  n  and  16  IXJ  The  D  formulae  values  correctly  D. PROOF OF  n  Gatlin's  - H  M  + - H  H  in Gatlin's  n-1  n-2  i n equations  +  H  n-1  5 to  conditional  +  probability  log p . - log p ) xj x  9, each  of the  five  and  17  terms  notation.  - H  n-2  18 19  Information  (n-2) D  n  This use  =  H  for and  (n-1) - H M  M  is G a t l i n ' s  D n  these equations  Gatlin's  (1984)  computer is  20  (equation  9) and the proof  to investigate D N A sequences,  a l g o r i t h m (equation programs.  more  is complete.  9) is more  M y method  intuitively  obvious.  however,  It  convenient uses  provides  an  a  than  more  m y equation  streamlined  alternative  16  notation  interpretation  for  correctness.  REDUNDANCY  A n o t h e r concept,  R  m u s t be introduced.  redundancy,  Max =redundancy(n) = ( H n 1  R , R . . . R are redundancies 1 2 n the  Before proceding to  several points should be noted.  G a t l i n ' s D-values a n d a confirmation of their essential  E.  theory a n d D N A / 72  sum  of D-values  D  AAAA...  redundant.  R^ = 0  and  sufficiently remains.  For  D ^ = 2,  a  R =1. large  has  that about  R  bases w i t h  H^ =0  sequence  It  D .  lies  21  H , H ...H. R is proportional to 1 2 n n between  0  and  1  inclusive.  For a  n  sequence of equally abundant sequence  calculated from  to  I  Max -H ) / H n 1  and  no higher order constraints R = 0 . F o r a R^ = l ,  ACGTACGT...  been half  observed  i . e.  the  sequence  D =0, H =2, 1 1 2 that  the  R  of  the symbols can be deleted  is completely  D =2,  H =0 2  English  language  a n d the sense  and  is still  Information  theory  and D N A /  73  F. A P P L I C A T I O N OF INFORMATION T H E O R Y TO SEQUENCE A N A L Y S I S  The  question  to be  tool  for  the  analysis  we  be  aware  information to  make  an  of?  of the  the  notion t h a t  just  as  Barry  (1986)  and,  has  D N A must  from  external  a  whole, it is v a l i d  genetic  material.  Therefore,  study Fig.  shows  transcriptase DNA->  that  informational  channel  DNA.  The  there  it  and  D N A out  message  description of amino acid order itself.  The  machinery organism-> Based  channel whereas organism  possible very  for  the  protein  the  DNA- >  DNA  both evolved in the  by  information required  a  biological  functions  of its  is to  a use  channels is  as  component  is  the  have  is  cell- >  depending  been  made  direction of increasing entropy  of  cell on  and is  a  the D N A translation  (ontogeny)  or  time  scale).  that D N A and  protein  Hasegawa  the  to  reverse  protein  and  the  detail.  channel  transcription  channel  with  theory  possibly the DNA->  aid  unit.  channel i n more  known):  but,  an  a biological  information  (plus,  from  n o r m a l context  large  in  context;  its components  o r g a n i s m as  it  reproduction  suggestions  must  made  message for replication is j u s t  DNA- >  on information theory,  special caveats  transcription/translation  for  or  appropriate  D N A only  message and  little  while the  (phylogeny  the  appropriate  to discuss  which  in  to study  molecule, and  two  an  these points. M y concern follows  study  seems  are  about  be  contribution  and  considered  plant to  it. H o w e v e r , it is necessary 10  the  sources  intention of (eventually) r e t u r n i n g to study  an  what  questioned  u l t i m a t e l y be  to fractionate  D N A is  i f so,  theory  of D N A because m u c h of the  comes  Plainly,  have  information  c y t o p l a s m . I agree w i t h  it is v a l i d the  is: can  of D N A sequences,  organism  context  to g r a s p i n g  next  theoretical analyses  the  the  addressed  and  Yano  Information theory and DNA Figure  10.  information  A  schematic  encoded  drawing  in DNA  can  of flow  the  two  according  channels to the  through  which  the  central dogma.  The  reverse transcriptase channel could be included also.  CHANNEL  DNA—•RNA—• PROTEIN  (J  CHANNEL  / 74  Information 1975,  Yano  1971,  Reichert  discredited  and  Hasegawa et  al.  (Gatlin  1974)  1976)  1974,  or  but  Gatlin  decreasing  both  these  theory a n d D N A /  entropy  (Reichert  suggestions  have  1975, F e r r a c i n et al.  1976,  1972b)  DNA  and  Wong  been  Sitaram  75  largely  and  Varma  1984).  Smith  (1969)(see  capacity that  to  code  vertebrate  pressure protein  also for  diverse  genome  toward  the  same  yeast,  mouse  Blake  et  al.  0.31  least,  the  constraint correct,  on  is  not  (chapter are  G + C of 0.43,  were  coding  m a x i m u m at  clustered capacity.  human were  with  G+C  to  0.68  G + C.  the  0.43  have  verified  I  maximum  it was  around  0.43, but w h a t  the  rich  on  ranging  (mouse)  One  acid  is at  suggested due  to  that  issue is  a the  whether  to  These  from  and  0.29  code  for  of  to  to  protein can  coli  the  (Blake  ( M o r e a u and noncoding  proteins  180  genes,  acid  is  not  will  be  regions,  code  from  E.  coli,  compiled  by  (E.  0.66  0.43  Scherrer  still  as  coli),  0.29  Therefore, as  a  if Smith  Earley  and  have  seen  made,  the  compositions could  (human).  be  and  of 0.35  of  0.68  0.66  coding regions  and h u m a n s than  amino  0.30  prediction that  G+ C  using synonomous codons  compositions  approximately  examined.  H o w e v e r , i n E.  G+ C  and  to v a r y over a range  amino based  requirement  1), Xenopus  more  be made  The  (1986),  that be.  tj^pically  and  has  to evolve toward this value.  protein.  achieved  (yeast),  may  values  a  that  I found that, for a v a r i e t y of protein genes,  G + C value could  at  showed  at  protein  coding capacity is a  First,  been  proteins  G+ C  maximum  there is a pressure  for  Gatlin  and  seems  average  very  narrow  Gatlin  G+ C  1987) the  which  on  and  1986), tobacco  to  are  elsewhere chloroplasts  coding regions to  be  a  very  Information common  property  vertebrate  D N A , Moreau  regions  was  ranges  of  fern-allies, ranging  of genomes.  0.55,  for  which  from  are  0.37  more  likely,  base  composition  constraints. group,  as  a  a  rather  have  than  This  might  humans  If  source suggest paper the  been  but  fact  that  in  that that  idea,  it was  is  fidelity  well  probably  ahead  because,  existing p a r a d i g m of the  for  are  in  in  my  the  and  living  the  not  Ferns  is  to  coding  contents It  due  seems to  the  informational  also a monophyletic  0.43  due  to  properties, the  human  historical i.  e.  the  phenomenon.  tendency  to  see  it is optimized to code  for  relative  abundances,  which  s i m i l a r i n coding and noncoding regions that  is  critical.  came  to  this  opinion,  and  things.  protein perhaps  time,  76  approximate  (1970).)  explanation for to  of  G+ C  clustering  around  due  doublet  have  Shapiro  informational the  G+ C  shown.  vertebrates,  G+ C  part  replication  of its  that  on  other  norandom  of S h a n n o n redundancy,  of  are  advanced, also  that  is not  made  optimized to code  D N A . The  seem  the  in F i g . 11, the  ancestor-  pressure  as more highly evolved than  D N A is not  more  have  (1971),  clustering  calculation is correct  (See  common  selection  be  and D N A /  noncoding regions of  that  groups  to  1971).  it would  information error  taxonomic  Green  a  found  0.43. L a s t l y ,  of  recent  So, i n this case, also  from  (Green by  of coding and  (1987)  n o r m a l l y thought  0.46  of  study  Scherrer  variety  not  to  their  different  suggested  probably  constraint  and  quite  G+ C  In  theory  such  a  Yockey  point but  view  ran  are  might  (1974), then  a  in  a  discarded  contrary  to  the  time, n a m e l y that D N A exists to code for protein. Since  there now exists a new p a r a d i g m of selfish  D N A ( D a w k i n s 1976,  1982), it is no  longer  organized  way  heretical  to  postulate  optimize its o w n fidelity  that  D N A is  in  such  a  so  as  to  of replication. F u r t h e r evidence t h a t it is replication t h a t  Information theory and DNA / 77 Figure  11. The  G+C  ranges  for various  taxa.  The  theoretical  maximum  information capacity for protein coding is at 0.43. These data are taken from Shapiro (1970).  mammal —  frog fish  echinoderms arthropods —  molluscs porifera protists anglosperms  gymnosperms ferns —  —  green algae —  brown algae  ————  red algae  — — • — — —  cyanobacters  — — — — — — — I  1  1  1  1  1  1  1  1  1  bacteria 1  1  1  1  0.250.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90  G+C content  Information is optimized r a t h e r DNA  polymerases  errors  made  than  protein  which  in  production, is the  R N A polymerases  i n synthesising protein  replication tend to  are  is  theory  presence  and D N A /  of error  comparatively  78  checking i n  absent.  Further,  errors  in D N A  r a p i d l y disposed of while  accumulate.  G. USE OF HIGHER D-VALUES  Another  example  of  information  found i n B r o o k s et al. a  forthcoming paper  23  real  from  being  (Sibbald, Banerjee, from  the  expectation, D  Maze  EMBL  10.5 bp, they in fact Gatlin  is an  artefact  not  apparent  follows is the  subject of  B r o o k s et al.  show,  for  D  are  that  while  D  to  due  to  sequence  of the  shown  even  (Fig.  though  compare  the  12)  they  length  of the  and by Rowe and  appears due  5  or  6  can  type  done  (Fig.  13,  categories  are  nonbiological.  for 14,  sequences the  15,  seven 16,  the  for  for both plants. D  sequences  One a  same  categories  17). In  lead  discussed  procedure  sequence length  with  behave to  occur  i n the  D  is outside  2  a  ( F i g . 14) values are  by  sequence  and D . It 6  similar  this  fashion  problem  D-values of a  (Rowe and T r a i n o r  of D N A i n both  F i g . 13  in  avoid  the  are  H a r v e y (1985). I f a  to this and this inflates D  pseudorandom  D - v a l u e s obtained  of pseudorandom been  that  which  D N A . W h a t they  5 is  is  1988) that, since D N A helices complete a t u r n i n  all windows of length  "order"  analysis  show a j u m p in magnitude  m i g h t be detecting topological features  is short,  so an  base,  sequence  6  (1974), Rowe and T r a i n o r (1983b)  sequence and  (Brooks et al.  detecting  to  submitted.).  data  and D 5  t h e m to suggest  applied  (1984,1988). A discussion of w h a t  D N A sequences  not different  theory  is  to  large  set  1983b). T h i s  has  the  l i v e r w o r t and  the  expected  tobacco  region i n a l l  generally closer to  the  Information theory and DNA Figure  12. D l  to D6  for tobacco and  pseudorandom  / 79  sequences. Lengths of  sequences are 600 (black bars), 6,000 (hatched) and 24,000 (white). BCL refers to the method of Brooks et al. and new, is our method. The effect of length is dramatic.  a  33  Si  D1  D2  D3  D4  D5  D6  D1  D2  D3  D4  D5  D6  Information theory and DNA Figure  13. Real  D2-value  obtained  and for  simulated 13  D2-values. The  different  sequence  upper curve lengths  at  / 80  is the maximum 100  psuedorandom  sequences per length. The lower curve is the minimum simulated D2. The tip of each arrow is the value of a real D2-value. The  italic labels are tobacco; the  others are liverwort. All D2-values fall outside the expected range of D2  values  and therefore are significantly different from expectation.  0.03  ©  0.02  CO CO .Q CD Q. CO  CD  0.01  0.00  1000  10000  100000  Log sequence length  i  111  1000000  Information Figure and  14. Real  minimum  and simulated  simulated  D3-values. U p p e r  D3-values respectively.  others to liverwort. T h e r R N A  theory  and lower Italic  labels  and D N A  curves refer  genes f a l l inside the expected  / 81  are m a x i m u m to tobacco  and  region.  0.05-1 trna  0.04-  tgene co .tl m  0.02  gone non  o.oo  I  1000  I  I I  I  10000  I  100000  Log sequence length  i  i i i i 11 1000000  Information theory and DNA  / 82  Figure 15. Real and simulated D4-values. Upper and lower curves are maximum and  minimum  simulated D4-values respectively. Italic labels refer to tobacco and  others to bverwort.  As in Fig. 14 the rRNA  genes fall inside  the expected  region and all other categories are much closer to the expectation than  they  were for D3.  0.20-1  0 . 0 0 -\ 1000  10000  100000  Log sequence length  1000000  Information Figure and  16. Real  minimum  a n d simulated  simulated  D5-values. U p p e r  D5-values respectively.  theory  and lower  curves  and D N A  / 83  are m a x i m u m  Italic labels refer to tobacco  and  others to liverwort.  0.6  gene  1000  10000  100000  Log sequence length  i  i  i i i 111  1000000  Information Figure and  17. Real  minimum  a n d simulated  simulated  D6-values. U p p e r  D6-values  respectively.  theory  and lower Italic  curves  / 84  are maximum  labels refer to tobacco a n d  others to liverwort. A t this level of order, a l l categories fall v e r y expected  and D N A  near or i n the  region.  0.8-  £  0.6-  CD Q. CO  f=L  CQ  * trna  \ \ \ \  0.4  /rma /  / \ \  /rprot  V\ /  rma  V\  0.2  /NX/ / A \  rprot  1000  /intron  / Intron  /  /URF  /  /  gene  i ° y n  n  /  ^^^-^K URF flene*^-—.  non  10000 100000 Log sequence length  1000000  Information expected  region  progressively  and  some  closer  to  values  the  expected  (1983b), some values even at  The  expected  20,  26,  were  region was  37,  45,  generated.  70,  D  the  mean  100,  120,  F o r each  1979), because  value. the  more  test  either  accurately  simulated  significance at  not  done; to  difference Here  I  have  x  10  D ^ to  the  one  true  the  same  which  100  D ^ were  is  and  of  100  due  to  chosen  a  of  1,  2,  are  Trainor  4,  pseudorandom  8,  an  each of  of  arbitrary  of each than  at  least  Gatlin  order  to the m a x i m u m  m a x i m u m are as  at  2.43  and  the  from  as  that  i n addition,  simulated D-values.  this method to  SD  stringent  (1979)  appropriate  (with  of determining  always  n-1  a l l bases  length.  (but  the  s i m u l a t i o n described  value t h a n  and  or  clarity  12,  sequences  and m a x i m u m rather  distribution of the  identical  sequences  lack  85  T r a i n o r 1983b) or two S D (Gatlin  therefore  more  a l l 6 D-values at  simply  Rowe  calculated. F o r each  minimum  (1983b)  is  has  values  6  S D a n d m a x i m u m and m i n i m u m  (skewed)  it  D  level  use to  as  the  determine  n-level. Above (and i n Rowe and T r a i n o r 1983b), this  set  really  unlike  F o r lengths  bases,  minimum  Trainor  that  and  5  n o r m a l l y distributed. In the  significance  the  suggest  determine is  follows. 3  Rowe and H a r v e y (1985) criticize  sequence,  the  used  not  and  reflects  and  the  of  Rowe  Note however, t h a t significance  are  D  though  S D (Rowe and  these distances,  This  by  data  D , 4  tends to be closer to the m i n i m u m  Averaging  applied  160  the m e a n ,  minus one  the  mean  mean.  as  sequence,  it.  and D N A /  outside the expected region.  calculated. I chose to plot the  plus and  above, the  are  6  inside region,  constructed  13 groups of 100 sequences, D-value were  lie  theory  equally probable)  H o w e v e r it  regarding  the  reasonable)  significances relative to it. I a m not convinced t h a t the  was  null  is  plain  correct  that  null  model a n d  Rowe and  was the  model. stated  H a r v e y (1985)  Information method  is better,  different  The  shown  further  from  (1988).  Lastly,  the  triplets  D  D  and  that of  triplets  (or  H that  2  bits  exceed  it  provides  86  slightly  did  D^  In  is  2, as  categories,  The  based more that  this  D  to  being  6  for  D  This  method,  occur  based  on  doublets  higher  D-values  (1988)  be  for  3  presented  with  for  al.  makes  method  D-value can  i n B r o o k s et al.  D  B r o o k s et  bases.  appropriate would  and  5  comparing,  on single  problem  D N A , no  that  only  expectation.  only  much  tRNA  higher  only  to  mean.  D-values by  m a x i m u m overlap  windows  greater  and,  than  obviously are  that  exhibit any  Contrary  values to  Rowe  Keep  and  D^  regions  that were  G + C values  that  above  no  D ^-values  D ^ values this.  i n noncoding regions  In  m a x i m u m simulated  genes have  so  contrary  those  for  found  from  plants,  is  categories  closer,  A  s u r p r i s i n g considering the  D  fact  expected  solves  presented  liverwort r R N A  minimum  base  than  their  to  method)  per  differ  both  not  calculate triplets  rather  2.  not i n  this notion.  expectation.  plotted).  the  simulation t h a t  that  in  G e n e r a l l y , u s i n g the  T r a i n o r (1983b)  genomes  are  occur  Rowe  and  they  that  Gatlin's  is  show  nonadditive.  with  which  in a  and  agreement  from  found  13-17  occuring to  occur.  n-1 Max  Values  Fig.  dependent  3  actually  since  in  expectation  compares  size  have  Brooks et al.  example,  above  I  and D N A /  results.  results  2  though  theory  the  the  mind  that  within  (1983b)  D^  mean more  both  is  than  (not  region  5). F o r a l l Tobacco  expectation but  the  differ  expected  13).  viral  0.0083  (Table  (Fig  from  and  was  the  involved  m a x i m u m are  Trainor  D^  obtained  expectation  not different in  I  did  of several  closer 2.4  noncoding  are  the  to  the  SD  and  and  from coding  Information regions  with  the  introns have  exception  significantly  of r R N A  high D  genes,  ribosomal protein  values. R e g a r d i n g D 4  best to m a k e though from have  some tRNA  and  of t h e m  lie outside  being  meaning D  levels.  The  1  and D  a  on  context.  In  and D 2  levels w i t h  reason  for  the  order  of  summary,  in  expectation,  of this  caution  70  bp  the  5 overwhelming  significant but  difference  tobacco  values, it seems  from  region. P a r t  the  and  87  6  being different  expected  short,  i n such  is at the  D^  very  the  their  and D N A /  genes  5  statement regarding  genes  limited  of order  no  theory  and  lesser  results  D  even stems  and  D  6 amount  order  at  the  D  with  chloroplasts  3  compared to viruses is not k n o w n .  H. I N V E R T E D R E P E A T S AS A S O U R C E OF R E D U N D A N C Y  I wish  to present  R , based 2  a preliminary  on D  and 1  D  as  2  finding  per  concerning IRs i n chloroplast genomes.  Gatlin  (1972a),  is calculated i n nonoverlapping  windows of 500 bases for the  two chloroplast genomes  of  are  sequence  that  I have  LSC  was  mean  R  not  even  0.053737 of  onetailed  significantly to  significantly  with  for  the  an  with  z  =  than  an  higher t h a n  of  showed  the  much  of  of  means and  the that  R of the  repeating  based  at  the  mean  R  function  of the  genome  i n the  tobacco a  with  a  an  N  1961)  that  the  of we  R  0.002  level. the  the  mean  R  I R B at  the  0.05  the message  rest  a  on a n N of 173. I R B had  (Spiegel  conclude  plotted as  the  19). T h e  0.019134,  IRB  and  lower than  ( F i g . 18,  SD  4.385  that  liverwort  if R is low, perhaps  so  S D of 0 . 0 2 2 7 3 6  differences  variable higher  IRs  labelled them  0.039756  test  standardized  applied  position, the  If  51.  Using  calculate  of  the  LSC  is  same  test  region  was  level. T h i s suggested  that  of  The  the  LSC  twice (i. e. h a v i n g an IR) is a  way  Information theory and DNA Figure 18. Redundancy calculated in nonoverlapping windows of 500  / 88  bases for the  tobacco genome.  0.20 n  0.00 4 0  1  150000  200000  Information theory and DNA  / 89  Figure 19. Redundancy calculated in nonoverlapping windows of 500 bases for the liverwort genome.  0.30-1  0.25 H  0.20 H  0.00-1 0  1 20000  1 40000  —i 60000  1 80000  BASE  1 100000  1 120000  1 140000  Information to  compensate.  the  rRNA  doublet  genes  whose  properties  Ribosomal  differ  found  Chloroplast  in  IRs  tested  (Strauss  et  available  i n the  R  2  the  one  This if  chloroplast genome.  for  1  al.  only  and  H  there  EMBL  data  for  the  than  greater  internal more  R  of the  of the that  critical.  of one  structure  as  observation  resulting  the  sequences  were are  is  the  base.  rest  of  for the  single  copy  are  since  accumulate,  be  to  keep  the  attained  organism  with  the  but  regions  and  one  lower (at the  postulating that the  repeat  the  other. accurate  redundancy  "one  IR"  sequence  in  its  example  information  i n the  data  base  I calculated D ^ , D ^ ,  IRs  of tobacco, l i v e r w o r t  i n the  Euglena  t h a t perhaps retain  is based  0.05  since  then  the  could  1983)  a group, for  sequences  I R region and  seem  in  low  2).  and  other  for  most  idea could be tested  before  each  whose  (chapter  1988)  appear  compensate  with  IRs contain  low D ^ , and  several! (conifers, as  remaining  "unique" region. Since I a m to  may  This suggestion  I R is also significantly  needs  exist  one  the  genome  invariably  and  and D N A / 90  (Metzenberg et al.  and l i v e r w o r t , suggesting  or redundancy.  a  the  in F i g . 20. T h e redundancy  the  that  (eg. W o r t o n et al.  repeat  an  in  I compiled a l l the  and  i n tobacco IR  the  IRs  there  Euglena  2  of  that  1988))  much  dispose  from  "communicate"  Actually  The results  R  copy  notion  and Euglena.  to  0.5,  occur i n tandem  single copy regions  , H  near  strongly  likewise  1985).  be  is  is the  conversion or "correction of each other"  mutations  (Palmer  this idea  G+ C  R N A genes often  to undergo  could  Consistent w i t h  theory  IRs is  it was  fidelity  due  able  to  its  on v e r y little data  but,  properly. I n Euglena,  the  level, one tailed) that the there  can  be  m i g h t be discarded,  a  R  threshhold this  is  not  Information theory Figure regions and  20. A  comparison  i n tobacco,  o f H i , H 2 , D l , D 2 and R  liverwort  a n d Euglena.  the highest R. R values shown a r e  Euglena  2 n  =  i n the unique  h a s only  one repeat  LIVERWORT  IR  unique  H — = H i =  u, Hz  unique  IR  Hi  — R H1 H2  1-  o.H  —  " R  0  region  EUGLENA  IR  =  2  1.5 H  a n d repeat  lOx.  TOBACCO unique  and D N A / 91  H  I H2  Information The  idea  which  redundancy  acts  redundancy  can  then intend  it m a y to  be  pursue  has as be  emerged a  constraint.  achieved  possible to this  from  further  D-value constraint throughout  by  this  Perhaps  increasing  discard the to  study  try  the data  to  in  or  one  of  copy of a n  identify  base.  that  the  and D N A / 92 requirement  D N A informational  D  other  is  theory  more  the  higher  important  generally  systems,  the  for if  D-values,  sequence. patterns  I of  REFERENCES  An  G  (1987)  element  Barry  JM  A  potential  Z-DNA-forming  of a plant promoter.  (1986)  Informational  sequence  Bioessays  DNA: a  is  an  essential  upstream  7:211-214  useful  concept?  Trends  Biochem.  Sci.  11:317-318  Bennetzen  JL,  Hall  BD  (1982)  Codon  selection  in  yeast.  J.  Biol.  Chern.  257:3026-3031  Bernardi Evol.  Bird  AP  G,  Bernardi  G  (1985)  Codon  usage  and  genome  composition. J . M o l .  22:363-365  (1980)  D N A methylation  and  the  frequency  of  C p G in  animal D N A .  evolution of sequence  characteristics  N u c l . A c i d s Res.8:1499-1504  Blake  R D , Earley S in the E.  Blake  coli  R D , Hinds functional  (1986)  PW,  Distribution  and  genome. J . B i o m o l . Struct. D y n a m .  Earley  significance  of  S, the  Hillyard bias  in  A L , Day codon  GR  usage.  4:291-307  (1986) Biomol.  Evolution  and  Stereodynam.  4:271-286  Borodovsii  M Y u , Sprizkitskii  patterns  in  primary  A , Golovanov E I , A l e k s a n d r o v A A (1987a)  structures  of the  functional  regions  E s c h e r i c h i a coli. I. F r e q u e n c y characteristics. M o l . B i o l .  93  of the  21:826-833  Statistical genome  in  / Borodovsii  M Y u , Sprizkitskii  patterns  in primary  A , Golovanov E I , A l e k s a n d r o v A A (1987b) Statistical  structures  of the  functional  E s c h e r i c h i a coli. II. N o n u n i f o r m M a r k o v  B o u d r a a M , P e r r i n P (1987) A c i d s Res.  Brendel  HG  N u c l . A c i d s Res.  JF,  Gigot  plastid  Briat  JF,  FEBS  Brooks  R,  the  C,  (1984)  Laulhere  DR,  S,  H,  base sequence.  JP,  structure  Marky  L A (1986)  R  PH,  Cumming  evolution Mass.  in  21:833-840  plant s y s t e m .  Nucl  by  formal  languages.  (1982)  S c i . U S A 83:3746-3750  Visualization  complex  in  a  of  a  highly  spinach condensed  69:1205-1211  protein  Rouviiere-Yaniv HU  and  a  J  (1984)  protein  Similarity  from  between  spinach  the  chloroplasts.  172:75-79  LeBlond  DR,  genome  P r e d i c t i n g D N A duplex  Acad.  DNA-protein  Cumming  DD  (1984)  simple evolution model. J . Theor. B i o l .  Brooks  i n the  described  Proc. N a t l .  Mache  active  M a c h e R,  histone-like  Lett.  Genome  Blocker  Plant Physiol.  Letoffe  bacterial  models. M o l . B i o l .  C p G and T p A frequencies  transcriptionally  structure.  of the  12:2561-2568  K J , Frank  stability from  Briat  regions  15:5729-5737  V , Busse  Breslauer  94  Eds.  DD,  Weber  M I T Press  LeBlond  PH  B H , Depew  Information  and  entropy  in  a  109:77-93  (1988)  DJ,  Smith  In:  Entropy,  JD,  pp  information 189-226.  and  Cambidge  / Callaine  C R (1982)  J.  M o l . Biol.  Cedergen  M e c h a n i c s of sequence-dependent stacking of bases in  95  B-DNA.  161:343-352  R , Grosjean  H  (1987)  O n the  p r i m a c y of p r i m o r d i a l R N A . B i o s y s t e m s  20:175-180  Clegg M T , R i t l a n d K , Z u r a w s k i Karlin,  E  Nevo  (Eds)  G (1986) Processes  Evolutiionary  processes  of chloroplast evolution. In and  theory.  Acad.  Press  S  NY  p275-293  Coleman  AW  (1985)  Diversity  of  eukaryotic algae. J . P h y c o l .  D a w k i n s R (1976) T h e selfish  plastid  D N A conformation  gene.  Oxford  University  pheriotype. Oxford  Day  Computer  Blake  sequences.  DeLisi  C  RD  Computers and C h e m .  (1988)  Computers  emerging trends.  Deno  H,  (1984)  Shinozaki  Science  for  the  Acids Res.  11:2185-2191  W,  Jimenez-Montano  information  of  analysis  Press.  University  and  Press.  manipulation  of  DNA  8:67-73  molecular  biology:  current  applications  and  240:47-52  Sugiura  gene  Ebeling  in  K,  chloroplast  classes  21:1-16  D a w k i n s R (1982) The extended  GR,  among  a  M , (1983) subunit  MA  of  (1980)  Nucleotide  sequence  proton-translocating  On  grammars,  measures of biological macromolecules. M a t h .  of  ATPase.  tobacco Nucl.  complexitj'  Biosci.  52:53-71  and  / 96 Ferracin  A,  Paoluzi  Biosystems  Fickett J W  DJ,  Benassi  M  (1976)  The  measure-unit  of  evolution.  8:10-23  (1982)  A c i d s Res.  Fox  R,  Recognition of protein coding regions i n D N A sequences.  Nucl.  10:5305-5318  Guire  KE  (1976)  Documentation  for  MIDAS.  Statistical  Research  L a b o r a t o r y , U n i v e r s i t y of M i c h i g a n , A n n A r b o r  F o x T D (1987) N a t u r a l v a r i a t i o n in the genetic code. A n n . R e v . Genet.  Freifelder  D  (1982)  Physical  Biochemistry,  2nd  ed.  W H Freeman  21:67-91  and  Co. S a n  Francisco.  G a n t t J S , K e y J L (1987) Molecular Biochem.  Gatlin  cloning of a pea  HI  histone  cDNA.  Eur. J .  166:119-125  L L (1972a)  Information theory  and  the  living s y s t e m .  Columbia  University  Press. N Y  G a t l i n L L (1972b) The entropy m a x i m u m of protein. M a t h .  Biosci.  Gatlin  for  LL Evol.  (1974)  Conservation  of  Shannon's  redundancy  13:213-227  proteins.  J. Mol.  3:189-208  Gatlin  L L (1975) The D - n o r m a l distribution. J . M o l . E v o l .  Gatlin  L L (1979) A new measure of bias i n finite E S P data. J . A m e r .  Soc. P s y c h i c a l Res.  6:147-148  sequences w i t h  73:29-43  applications to  / Gatlin  L L (1984)  Parapsychology  and  D-measures.  J.  Amer.  Pave  A  Soc. P s y c h i c a l  97 Res.  78:331-340  Grantham  R,  Gautier  usage and the  Grantham  Res.  genome  is  a  genome  M,  Mercier  R,  hypothesis. N u c l . A c i d s Res.  M , Jacobzone  strategy  for  Codon  catalog  8:r49-r62  M , Mercier  modulated  (1980)  gene  R (1981)  Codon catalog  expressivity.  Nucl.  Acids  9:r43-r74  BR plants  Gribskov  Gouy  R, Gautier C, Gouy  usage  Green  C,  (1971)  Isolation  I. ferns  and  base  composition  DNA's  and fern-allies. Biochem. B i o p h y . A c t a  M , Devereaux J ,  Burgess  R R (1984)  a n a l y s i s of protein coding sequences and A c i d s Res.  of  The codon  of  primitive  land  254:402-406  preference  plot:  graphic  prediction of gene expression.  Nucl.  12:539-549  Hanley-Bowdoin,  Chua  N-H  (1987)  Chloroplast  promoters.  Trends  Biol.  Sci.  12:67-70  Hasegawa  M , Yano  Origin  Head  T  Life  (1987)  T  (1975)  Entropy  of the  PW,  Blake  correlations  information  and  evolution.  the  generative  6:219-227  Formal  language  theor3'  and  capacity of specific recombinant behaviours.  Hinds  genetic  RD  (1984)  between  D N A : analysis Bull.  Degrees  of divergence  dinucleotide,  trinucleotide,  B i o m o l . Struct. D y n a m .  2:101-118  Math.  in and  of  Biol.  the  E.  codon  49:737-759  eoZigenome frequencies.  from J.  Holzmuller  W (1984) Information i n biological  Cambridge U n i v e r s i t y  Ikemura  T  (1981a)  RNAs Mol.  Ikemura  and Biol.  T  and  proposal  occurance  the  of the  respective  Correlation between  the  for  occurance  a  the  of the  of Escherichia  codons  synonomous  codon  choice  i n its  coli  protein  transfer  genes.  J.  of  the  of Escherichia  codons  that  is  in  its  coli  protein  optimal  for  transfer genes:  the  a  E.  coli  RNAs  and  Mol.  Biol.  151:389-409  I k e m u r a T (1982) Correlation between the occurance  abundance  respective  translational s y s t e m . J . M o l . B i o l .  the  abundance  146:1-21  (1981b)  RNAs  Press  Correlation between  the  / 98 the roll of macromolecules.  systems:  respective  abundance  codons  in  of yeast protein  transfer  genes.  J.  158:573-597  Jensen  R V (1987) C l a s s i c a l chaos. A m e r .  Johnson-Dow  L,  fluorescent  Jones  E,  Heiner  J, acid.  Roe  BA  (1987)  Optimized  methods  for  and radio labeled D N A sequencing. BioTechniques 5:754-765  r e p a i r and  recombination in  E.  50:621-626  Kaiser VIII.  A D , Kornberg  A  Frequencies  deoxyribonucleic acid. J . B i o l .  Jukes  C,  M , W a g n e r R , R a d m a n M (1987) M i s m a t c h coli. Cell  Josse  Mardis  S c i . 75:168-181  T H , Holmquist selection against  R, the  Moise  (1961) of  nearest  Chern.  H  Enzymatic  synthesis  neighbour  of deoxyribonucleic  base  sequences  in  236:864-875  (1975)  Amino  genetic code. Science  acid  189:50-51  composition of  proteins:  Kimura  M  (1986)  D N A and  the  neutral  theory.  Phil.  Trans.  / 99 Soc. L o n d . B  R.  312:343-354  Kullback  S (1958) Information theory  Lagerkvist  U  (1981)  code. Cell  Lewin  Li  and statistics. W i l e y , N Y  Unorthodox codon  reading  and  the  evolution of the  23:305-306  R (1984) N o genome b a r r i e r s  W , L u o C , W u C (1985)  to promiscuous D N A . Science  Evolution  of D N A sequences.  In  M o l e c u l a r E v o l u t i o n a r y Genetics. P l e n u m Press. N Y and  Lonsdale D M (1987) The biochemistry of plants.  Maxam  A , Gilbert  Acad.  Mazin  AL,  Mazin  W  (1977)  Vanyushin and  BF  method  R J M a c l n t y r e (ed) London  S p r i n g e r - V e r l a g . In  for  sequencing  (1987)  Loss  of  CpG  Press.  D N A . Proc.  compartments  nucleotides in  of 5-methylcytosine i n D N A . M o l . B i o l .  Methylated  crassa:  new  nonmethylated  A L , Vanyushin  Metzenberg  A  224:970-971  Natl.  S c i . U S A 74:560-564  Methylated content  genetic  BF  (1987)  Loss  of  eukaryotes  conservation  Microbiology-1983  JN,  Baisch  and  190-194  TJ  CpG  (1983)  divergence  in  with  DNA.  I.  different  21:465-472  nucleotides  and nonmethylated genes of vertebrates.  R L , Stevens  from  5S a  from  M o l . Biol.  R N A genes dispersed  DNA.  II.  21:473-481  of gene  Neurospora family.  / Moreau  J,  Scherrer  i n Xenopus  K  (1987)  laevis  and h u m a n  of their genome.  Nussinov  R  A c i d s Res.  Nussinov  R  R  (1987)  frequencies  usage  D N A organization  in  evolutionary  distinct  groups.  Nucl.  preferences  i n nucleotide  sequences  and D N A  20:111-119  Theoretical molecular biology:  prospectives  and  perspectives.  J.  125:219-235  R , L e n n o n G G (1984)  homologies  codon  221:3-10  S t r o n g doublet  J . M o l . Evol.  Theor. B i o l .  Nussinov  composition and  globin genes w i t h long-range  Lett.  Doublet  of base  12:1749-1763  (1984b)  geometry.  Nussinov  FEBS  (1984a)  Co-evolution  100  in  Drosophila  S t r u c t u r a l features heat  shock  gene  are  as  upstream  important regions.  as  J.  sequence  Mol. Evol.  20:106-110  Oh-oka  H , Takahashi  The  8  kDa  iron-sulfur  Ohyama  polypeptide  Y,  Chloroplast Marchantia  JD  K , Matsubara  in  photosystem  protein coded by the  K , Fukazawa  Shiki  Palmer  Y , Wada  H , Kohchi  Takeuchi gene  M,  organization  polymorpha  (1985)  Shirai Z,  a  deduced  S,  from  of chloroplast and  131-240  K , Ozeki  probable  T,  Inokuchi complete  chloroplast D N A . N a t u r e  Evolution  and L o n d o n , pp  is  H , Sano  Aota  algae. I n M a c l n t y r e R J (ed), M o l e c u l a r NY  I  chloroplast gene frxA.  T,  Chang  H , Ohyama  candidate  FEBS  Lett.  Sano  S,  H,  H  of  an  218:52-54  Umesono K ,  Ozeki  sequence  (1987)  H  (1986)  of . l i v e r w o r t  327:572-574  mitochondrial D N A i n plants  and  E v o l u t i o n a r y Genetics, P l e n u m Press,  Palmer  JD  DNA  Pfitzinger  (1987)  / 101 and biosystematic uses of chloroplast  Chloroplast d n a evolutiion  variation. Amer.  Natur.  130: s6-s29  H , Guillemaut P, Weil J - H , Pillay  population  to  the  codon  usage  (supplement)  D T N (1987)  in  A d j u s t m e n t ofthe  chloroplasts.  Nucl.  tRNA  Acids  REs.  15:1377-1386  Reichert of  T A , Y u J M C , Christensen message refinement.  Reichert  T A , Wong  R A (1976)  J . M o l . Evol.  A K C (1971)  Molecular  evolution as  a  process  8:41-54  Toward  a  molecular  taxonomy.  J.  M o l . Evol.  1:97-111  Ritland  K,  Clegg  Amer.  Rochaix  MT  Natur.  J-D  (1987)  Evolutionary  130:s74-sl00  (1987)  Molecular  unicellular green  genetics  Rowe  GW,  Harvey  IF  communication between  Rowe  plant  DNA  sequences.  an  Information  mitochondria  in  the  Microbiol. R e v . 46:13-34  G J , Lee R H , Zehfus  (1985)  M H (1985) H y d r o p h o b i c i t y 229:834-838  content  dragonfly larvae. J . Theor. B i o l .  in  finite  sequences:  116:275-290  A thermodynamic bias i n v i r a l  genes. J .  Theor.  101:171-203  G W , T r a i n o r L E H (1983b) Theor.  chloroplasts  in globular proteins. Science  Rowe G W , T r a i n o r L E H (1983a) Biol.  of  alga Chlamydomonas. F E M S  amino acid residues  of  (supplement)  Rose G D , Geselowitz A R , L e s s e r of  analysis  Biol.  101:151-170  O n the  informational content  of v i r a l  D N A . J.  Sanger  F , Coulson A R (1975)  DNA  Schuster  by p r i m e d synthesis  W,  Brennicke  sequences  in  transferred  Shannon  the  w i t h D N A polymerase. J . M o l . B i o l .  (1987)  Plastid,  nuclear  mitochondrial genome  between  C E (1948)  A  / 102 for determining D N A sequences i n  A rapid method  organelles  of  via R N A ?  A mathematical  theory  and  reverse  Oenothera:  EMBO  J.  94:441-448  is  transcriptase  genetic  information  6:2857-2863  of communication. B e l l  Syst. Tech. J .  27:379-423  Shannon  Shapiro  C E (1949)  HS  (1970)  C o m m u n i c a t i o n i n the  In  Handbook  presence  of noise. Proc. I R E  of B i o c h e m i s t r y , H A Sober  (ed),  37:10-21  C R C Cleveland  pp H 8 1 - H 9 9  Sheperd  J C W (1981)  the  purine/pyrimidine  justificatiion.  Shinozaki  N,  Torazawa Kusada  K , Meng  PR,  sequence  reading and  frame its  Obakata  B Y , Sugita  (1987)  127:163-169  possible  protein  from  evolutionary  M , Wakasuigi T, Hayashida N , Matsubayashi J,  Yamaguchi-Shinozaki K ,  M , Deno  and expression. E M B O  MJ  of a  S c i . U S A 7:1596-1600  nucleotide sequence of the  White  Theor. B i o l .  J,  the  H , Kamogashira  T a k a i w a F , K a t o A , Tohdoh N , S h i m a d a  T h e complete organization  genome  M , Tanaka  Chunwongse  J,  determine  Proc. N a t l . A c a d .  K , Ohme  Zaita  Sibbald  M e t h o d to  How  J.  T,  Ohto  T, C,  Yamada K ,  H , Sugiura M  tobacco chloroplast genome:  (1986)  its gene  5:2043-2049  probable  are  antibody  cross-reactions?  J.  / Sibbald  PR  (1988)  identification Curr.  Sitaram  of  genes  Genet. In  press.  BR, Varma  Theor. B i o l .  Slepian  Pattern  D  (1973)  of in  V S (1984)  base two  usage,  nearest  completely  Information  and  neighbour  sequenced  onepoint  103  analysis  chloroplast  mutations  and  genomes.  in D N A . J .  110:523-532  Key  papers  in  the  development  of  information  theory.  IEEE  evolution.  Math.  Press N Y  Smith  TF  (1969)  Biosci.  Smith  The  genetic  code,  information  density  and  4:179-187  T F , W a t e r m a n M S , Sadler J R (1983) sequence  functional domains. N u c l .  Spiegel M R (1961)  Statistical distribution of nucleic acid  A c i d Res.  S c h a u m ' s outline of theory  11:2205-2220  and  problems of statistics.  Schaum  Publishing Co. N Y  Srinivasan  A R , Torres  double helical J.  Staden  R,  Clark  W , Olson  D N A . I. Potentiial energy  B i o m o l . Struct, and D y n a m .  R  DNA  Statistical Using  (1984)  Measurements  sequence  of the  and their use  estimates  Base  sequence  effects  in  of local base morphology.  5:452-496  effects  for finding  Research L a b o r a t o r y , U n i v e r s i t y Midas  W K (1987)  that genes.  coding for Nucl.  of M i c h i g a n  protein  A c i d s Res.  (1976) Elementary  has  on  a  12:551-567  Statistics  / Steinmetz  A A , Castroviejo  additional  component  maize. J . B i o l .  Strauss two  S H , Palmer conifers  a  (1986)  Protein  II identified through  PSII-G  an  its plastid gene i n  261:2485-2488  large  inverted  repeat  and  Chloroplast genomes of  are  extensively  rearranged.  S c i . U S A 85:3898-3902  (1988)  D E (1987)  L  J D , H o w e G T , Doerksen A H (1988)  Proc. N a t l . A c a d .  Thomsen  R T , Bogorad  of photosystem  Chern.  lack  T a y l o r F J R , Coates  M , Sayre  104  The code w i t h i n  Fractals:  m a g i c a l fun  the  codons. I n  Press.  or revolutionary  science.  Science  News  131:184  Trifinov  EN  (1987)  as  suggested  J.  M o l . Biol.  Tramontano  A,  Cascino  by the  A  Umesono  K,  Genet.  of m R N A  V,  (1984)  Barni  N,  Statistical  and  and  frame-monitoring  16S r R N A  Cipollaro  mechanism  nucleotide  M , Franze  evaluation  of  D N A strands. N u c l . A c i d s Res.  JL  (1980)  The  sequence. Proc. N a t l .  Ozeki  code  sequences.  194:643-652  E N , Sussman  nucleotide  framing  analysis  Scarlato  complementary  Trifinov  Translational  O  3:281-287  (1987)  the  Chloroplast  Sci. U . S . A .  gene  Macchiato M F ,  coding  capacity  of  12:5049-5059  pitch of chromatin Acad.  A,  D N A is reflected  i n its  77:3816-3820  organization  in  plants.  Trends  / Wakasugi  T,  Ohme  . chloroplast tRNA  Ser  tobacco  (UGA)  for'  and  Structure,  K,  Sugiura  tRNA  tRNA  chloroplasts.  Wittman-Liebold  Worton  Shinozaki  genes  Wells R D (1988)  pp  M,  Thr  B e  (CAU)m  (GGU):  a  P l a n t M o l . Biol.  (1985)  function  Ribosomal  and  Structures  tRNA  (CAA),tRNA  L e U  compilation  J . Biol.  proteins:  genetics  (1986)  of  of  tRNA  C y S  tobacco (GCA),  genes  from  7:385-392  U n u s u a l D N A structures.  B  M  105  Chern.  their  of ribosomes.  263:1095-1098  structure  and  E d . Hardesty  evolution.  In  B, Kramer  G,  326-361. S p r i n g e r - V e r l a g  R G , Sutherland  Kean  V,  orientation  Ray of  J , Sylvester J E , W i l l a r d  PN, the  Schmickel  tandem  RD  array  H F , B o d r u g S, D u b e  (1988)  and  Human  conservation  ribosomal of  the  I, D u f f  RNA  5'  end.  C,  genes: Science  239:64-68  Yano T, Hasegawa J.  M o l . Evol.  Y o c k e y H P (1974) the  Zurawski  M (1974)  genes:  of amino  acid sequence i n protein.  4:179-187  A n application of information  sequence hypothesis.  G , Clegg  E n t r o p y increase  J . Theor. B i o l .  M T (1987)  implications  . Rev. Plant Physiol.  for  Evolution  structure  38:391-418  theory  to the  central  dogma  and  46:369-406  of higher-plant  function  andd  chloroplast D N A - e n c o d e d  phylogenetic  studies.  Ann.  APPENDIX 1. A SELECTION OF COMPUTER PROGRAMS.  Below are some of the programs used to collect data and perform calculations that appear in this dissertation. All are written in Standard Pascal and were compiled on the UBC mainframe computer using Pascaljb. They are presented to assist the reader in determining exactly what was done. In this sense they are an aid to reproducibility rather than examples of sophisticated programming.  A. PROGRAM 1  program dnamaker(output);  {* This program generates a pseudorandom DNA sequence using {* a modulo type generator.  It asks the user f o r the seed  *} *}  {* to be used f o r the generator and also asks f o r the length *} {* of the sequence wanted. It should probably not be used {* to generate sequences longer than about 30,000 bases due  *} *}  {* to the nature of the generator which w i l l repeat after  *}  {* 65,536 i t e r a t i o n s . The p a r t i c u l a r regions i n which a  *}  {* number produces a certain base can be editted by the user.*} ^***************************************  var seed, i , column, max: integer; {seed i s for. generator}  106  / 107 {i  i s a counter for a loop}  {column gets output  formatted}  {number i s random} number: r e a l ; base: char; inp, out: text;  procedure generator;  {generates a pseudorandom number between 0 and l}  begin seed := (13849 + 25173 * seed) MOD  65536;  number := seed / 65536; end  {generator};  {procedure  j^QCf^ri  generator}  •^^I'f^f^f^f^f^fT^fi'f  rosin  ^ *  ,  '  f  ^ f ' ^ ^ ' ' f ' ' f ^ * j  { f i r s t query the user for a seed and sequence length} reset(inp,  'file=*msource*, i n t e r a c t i v e ' ) ;  rewrite(out, writeln(out,  1  file=*msink*'); 'Give me an integer seed between 1 and 65535.');  get(inp); readln(inp, seed);  writeln(out,  'How  many bases would you  like?');  get(inp); readln(inp, column  :=  max); 1;  { t h i s i s j u s t a marker t o g e t 60 char l i n e s o f output} f o r i := 1 t o max  do b e g i n {generate  sequence}  generator; if  number < 0.36640066 then base  if  :=  'A';  ((number >= 0.36640066) AND  (number  <  0.5090809))  then base if  :=  'C ;  ((number >=  0.5090809) AND  then base if  :=  number >=  'G' ; 0.6534122  then base  :=  'T';  write(base); if  column MOD  60 = 0  then writeln; column end;  : = column + 1;  (number  <  0.6534122))  end  {dnamaker}.  <lend 2>  B. PROGRAM 2 program d o u b l e t ( o u t p u t ) ;  I************************************************* * {* T h i s program i s used t o g e n e r a t e pseudo  random  {* sequences w i t h d o u b l e t p r o p e r t i e s d d t e r m i n e d by {* t h e u s e r . I n i t s c u r r e n t c o n f i g u r a t i o n i t {* g e n e r a t e s  7268 bases w i t h t h e same d o u b l e t  {* p r o p e r i t e s a s k a l i l o DNA. The v a r i a b l e s  serve t h e  {* same f u n c t i o n s a s t h e y d i d i n program 1.  var seed,  i , column:  random: base,  integer;  real;  next:  procedure  char;  generator;  begin seed  := (13849 + 25173  random  := seed / 65536;  * seed) MOD 65536;  I  110  end { g e n e r a t o r } ;  begin  £********* uigin  base  :=  'C;  write(base); seed  :=  column for  33333; :=  1;  i := 1 t o 7268 do b e g i n  generator; if  i <  1355  then begin c a s e base o f 'A': if  begin random < (188 /  471)  then next if  :=  1  A';  ((random >=  (188 / 471)) AND  (random < (254 / 4 7 1 ) ) )  then next if  :=  'C;  ((random >=  (254 / 471)) AND  then next if  :=  random >= then  'G'; (349 /  471)  (random < (349 / 4 7 1 ) ) )  / next  :=  111  T';  1  end; ' C : begin 1  if  random < (63 /  219)  then next if  :=  'A';  ((random >=  (63 / 219)) AND  (random <  (125 / 2 1 9 ) ) )  then next if  :=  1  ((random >=  C ; 1  (125 / 219)) AND  (random < (159 / 2 1 9 ) ) )  then next if  :=  'G';  random >=  (159 /  219)  then next  :=  'T';  end; ' G' : b e g i n if  random < (95 /  287)  then next if  :=  1  ((random >=  A'; (95 / 287)) AND  (random <  (132 / 2 8 7 ) ) )  then next if  :=  ((random >= then  'C; (132 / 287)) AND  (random < (213 / 2 8 7 ) ) )  / 112 next if  :=  1  random >=  G ; 1  (213 /  287)  then next  :=  T ;  1  1  end; 'T': b e g i n if  random < (126 /  376)  then next if  :=  'A' ;  ((random >=  (126 / 376)) AND  (random < (179 / 3 7 6 ) ) )  then next if  :=  'C';  ((random >=  (179 / 376)) AND  then next if  :=  'G';  random >=  (256 /  then next  :=  end; end; {cases} end { i f i<1355} else begin case base o f 'A':  begin  1  T';  376)  (random < (256 / 3 7 6 ) ) )  if  random < (826 / 2165) then next  if  ((random >=  :=  'A';  (826 / 2165))  AND (random < (1067 / 2165)))  then next if  ((random >=  :=  1  C';  (1067 / 2165)) AND (random < (1385 / 2165)))  then next if  :=  random >=  'G'; (1385 / 2165)  then next  :=  'T';  end; •C: if  begin random < (298 / 811) then next  f  ((random >=  :=  'A';  (298 / 811)) AND (random < (468 / 8 1 1 ) ) )  then next f  ((random >=  :=  'C;  (468 / 811)) AND (random < (499 / 8 8 1 1 ) ) )  then next if  :=  random >= then  'G'; (499 / 811)  next  :=  *T';  end; 'G': if  begin random  < (328 /  753)  then next .:= ' A ; 1  if  ((random >=  (328 / 753)) AND  (random < (407 / 7 5 3 ) ) )  then next if  ((random >=  :=  'C;  (407 / 753)) AND  (random < (545 / 7 5 3 ) ) )  then  if  next  :=  'G';  random  >=  (545 /  753)  then next  : = ' T';  end; 'T': b e g i n if  random < (713 /  2127)  then next if  ((random >=  : = ' A';  (713 / 2127)) AND  (random < (1033 / 2 1 2 7 ) ) )  then next if  ((random >= then  :=  'C;  (1033 / 2127)) AND  (random < (1299 / 2 1 2 7 ) ) )  / 115  next if  := 'G' ;  random >= (1299 / 2127) then next  := 'T';  end; end; {cases} end; {if  i>1354} base := n e x t ; write(base); if  column MOD  60 = 0  then writeln; column  := column  +1;  end; end { d o u b l e t } .  C. PROGRAM 3 program b r o o k ( i n p u t , o u t p u t ) ; ^****************************************** {* T h i s program c a l c u l a t e s D l t o D6 a s p e r Brooks e t a l *} {* R e f e r e n c e i s JTB  109:77-93  £*********************************  *}  const blank = ' max  = 6000;  var i,  j , k, 1, m, n, o: integer;  {array counters} dl,  d2, d3, d4, d5, d6: r e a l ;  {deviations from max  i n f o due to clumps  1-6}  length: integer; {length of sequence} sigma: r e a l ; {the second term of equations 5-8 i n reference} firstterm:  real;  { likewise the f i r s t } base: char; sequence: packed array. [1 .. max]  of char;  single: packed array  [1  4] of r e a l ;  double: packed array  [1 ... 16] of r e a l ;  t r i p l e : packed array  [1 .. 64] of r e a l ;  quad: packed array  [1 .. 256] of r e a l ;  pent: packed array  [1 .. 1024] of r e a l ;  hex: packed array first,  [1 .. 4096]. of r e a l ;  second, t h i r d ,  fourth, f i f t h ,  s i x t h : integer;  {read the bases into the array} length  :=  max;  base := blank; for i := 1 to max  do begin  while (base = ' ') do read(base); sequence[i] := base; base := blank; end; { reading of bases } for i := 1 to 4 do single[i]  := 0;  {set arrays to zero for} {singlets to sixes  }  for i := 1 to 16 do double[i] := 0; for i := 1 to 64 do triple[i]  := 0;  for i := 1 to 256  do  quad[i] := 0; for i := 1 to 1024 pent[i]  do  := 0;  for i := 1 to 4096 do hex[i] := 0;  { now  we go through the sequence and }  { length singlets, doublets ... sixes} for  i := 1 to (length - 5) do begin case sequence[i] of •A' : first  := 1;  •C : first  := 2;  'G' : first  := 3;  first  := 4;  end; { case } case sequence[i + 1] of •A' : second := 1; •C : second := 2; •G': • second := 3; •T' : second := 4; end; { case }  / case s e q u e n c e [ i + 2] of 'A' : third  :=  1;  :=  2;  :=  3;  :=  4;  'C: third •G': third 'T '  5  third end; { case }  c a s e s e q u e n c e [ i + 3] o f •A': fourth  :=  1;  :=  2;  :=  3;  :=  4;  'C*: fourth 'G' : fourth 'T' : fourth end; { case } c a s e s e q u e n c e [ i + 4] of •A': fifth  :=  1;  119  7 'C : fifth  := 2;  •G': fifth  := 3;  'T' : fifth  := 4;  end; { case } case sequence[i + 5] of 'A' : sixth  := 1;  •C: sixth  := 2;  'G' : sixth  := 3;  'T': sixth  := 4;  end; { case } single[first]  := s i n g l e [ f i r s t ] + 1;  double[4 * ( f i r s t  - 1) + second] : =  double[4 * ( f i r s t triple[16 third]  * (first  - 1) + second]  +1;  - 1) + 4 * (second - 1) +  := t r i p l e [ 1 6  ) + third] + 1;  * (first  - 1) + 4 * (second - 1  120  quad[64 * ( f i r s t - 1) + 16 * (second - 1) + 4 * ( t h i r d - 1) + fourth]  := quad[64 * ( f i r s t - 1)  16 * (second - 1) + 4 * ( t h i r d - 1) + fourth] + i pent[256 * ( f i r s t - 1) + 64 * (second - 1) + 16 * ( t h i r d - 1) + 4 * (fourth - 1) + f i f t h ] : = pent[256 * ( f i r s t - 1) + 64 * (second - 1) + 16 * ( t h i r d - 1) + 4 * (fourth - 1) + f i f t h ] + i; hex[l024 * ( f i r s t - 1) + 256 * (second - 1) + 64 * ( t h i r d - 1) + 16 * (fourth - 1) + 4 * ( f i f t h - 1) + s i x t h ] := hex[l024 * ( f i r s t - 1) + 256 * (second - 1) + 64 * ( t h i r d - 1) + 16 * ( f o u r t h - 1) + 4 * ( f i f t h - 1) + s i x t h ] + 1; end; go through the sequence and counting } now d i v i d e each count by the sequence length to get} a f r a c t i o n a l abundance} for i := 1 to 4 do single[i]  := s i n g l e [ i ] / (length - 5);  for i := 1 to 16 do double[i]  := double[i] / (length - 5);  for i := 1 to 64 do triple[i]  := t r i p l e [ i ] / (length - 5);  for i := 1 to 256 do quad[i] := quad[i] / (length - 5);  for i := 1 to 1024 do pent[i]  := pent[i] / (length - 5);  for i := 1 to 4096 do hexti]  := hex[i] / (length - 5);  { now we calculate d l to d6 } {dl} sigma := 0; for i := 1 to 4 do begin if  single[i] > 0 then sigma := sigma + s i n g l e [ i ] * l n ( s i n g l e [ i ] ) / l n ( 2 ) ;  end; dl  := 2 + sigma;  writeln('Dl = ', d l : 3: 5); (d2> sigma := 0; firstterm  := 0;  for i •: = 1 to 16 do begin i f double [i] > 0 then sigma := sigma + double[i]  * ln(double[i]) / l n ( 2 ) ;  end; for  j := 1 to 4 do begin  for k := 1 to 4 do begin if  ( s i n g l e [ j ] * single[k]) > 0  then firstterm  := f i r s t t e r m + s i n g l e [ j ] * single[k] *  l n ( s i n g l e [ j ] * single[k]) / l n ( 2 ) ; end; end; d2  := sigma - f i r s t t e r m ;  writeln('D2 =  d2: 3: 5);  {d3} sigma := 0; firstterm  := 0;  for i := 1 to 64 do begin if  triple[i] > 0 then sigma := sigma + t r i p l e [ i ]  * ln(triple[i])  / ln(2);  end; for  j := 1 to 4 do begin  for k := 1 to 4 do begin for 1 := 1 to 4 do begin if  ( s i n g l e t j ] * single[k] * s i n g l e [ l ] ) > 0 then firstterm  := f i r s t t e r m + s i n g l e t j ] * single[k]  ( s i n g l e t l ] ) * l n ( s i n g l e [ j ] * single[k] * s i n g l e [ l ] ) / ln(2); end; end;  / 124 end; d3  := sigma - f i r s t t e r m ;  writeln(*D3 =  d3: 3: 5);  {d4} sigma := 0; firstterm  := 0;  for i := 1 to 256 do begin i f quad[i] > 0 then sigma := sigma + quad[i] * ln(quad[i]) / ln(2); end; for j := 1 to 4 do begin for k := 1 to 4 do begin for 1 := 1 to 4 do begin for m := 1 to 4 do begin if  ( s i n g l e [ j ] * singlefk]  * s i n g l e [ l ] * single[m]) > 0  then firstterm  := f i r s t t e r m + s i n g l e [ j ] * singlefk]  s i n g l e [ l ] * single[m] * l n ( s i n g l e [ j ] * singlefk] end; end; end; end; d4  := sigma - f i r s t t e r m ;  * s i n g l e f l ] * single[m]) / ln(2);  *  /  writeln('D4 =  d4: 3: 5);  {d5} sigma := 0; firstterm  := 0;  for i := 1 to 1024 do begin i f pent[i] > 0 then sigma := sigma + pent[i] * ln(pent[i]) / ln(2); end; for  j := 1 to 4 do begin  for k := 1 to 4 do begin for 1 := 1 to 4 do begin for m := 1 to 4 do begin for n := 1 to 4 do begin if  ( s i n g l e [ j ] * single[k] * s i n g l e [ l ] * single[m] *  single[n]) > 0 then firstterm  := f i r s t t e r m + s i n g l e t j ] * single[k] *  s i n g l e t l ] * single[m] * single[n] * l n ( s i n g l e t j ] * single[k] * s i n g l e [ l ] * singletm] * singletn]) / ln(2); end; end; end; end;  125  / 126 end; d5  := sigma - f i r s t t e r m ;  writeln('D5 =  d5: 3: 5);  {d6} sigma := 0; firstterm  := 0;  for i := 1 to 4096 do begin i f hex[i] > 0 then sigma := sigma + hex[i] * ln(hex[i])  / ln(2);  end; for  j := 1 to 4 do begin  for k := 1 to 4 do begin for 1 := 1 to 4 do begin for m := 1 to 4 do begin for n := 1 to 4 do begin for o := 1 to 4 do begin if  ( s i n g l e [ j ] * single[k] * s i n g l e [ l ] * single[m] *  single[n] * single[o])  > 0  then firstterm  := f i r s t t e r m + s i n g l e t j ] * singlefk]  s i n g l e f l ] * singlefm] * * singlefo]  singlefn]  * l n ( s i n g l e [ j ] * singlefk]  s i n g l e f l ] * singlefm] * singlefn] singlefo])  / ln(2);  *  *_  *  end; end; end; end; end; end; d6  := sigma - f i r s t t e r m ;  w r i t e l n ( ' D 6 = ', d6: 3: 5 ) ; end {brook}.  D. PROGRAM  4  program n e w b r o o k ( i n p u t , o u t p u t ) ; .£*************************************** {* T h i s c a l c u l a t e s D l t D6 u s i n g t h e maximum o v e r l a p o f  *}  {* windows o f s i z e n-1. The D-values i t g e n e r a t e s a r e  *}  {* independent o f each o t h e r .  I t h a n d l e s a sequence o f 6000*}  {* bases i n i t s c u r r e n t c o n f i g u r a t i o n and c a n be e d i t t e d {* t o h a n d l e sequences o f o t h e r  sizes.  |*****************************************  const blank  = ' ';  max = 6000;  *}  var i,  j , k, 1, m, n, o: integer;  {array counters} dl,  d2, d3, d4, d5, d6: r e a l ;  {deviations from max info due t o clumps 1-6} length: integer; {length of sequence} sigma: r e a l ; {the second term of equations 5-8 i n reference} firstterm:  real;  { likewise the f i r s t } base: char; sequence: packed array  [1 .. max]  of char;  single: packed array  [1 .. 4] of r e a l ;  double: packed array  [ l .. 16] of r e a l ;  t r i p l e : packed array  [1 .. 64] of r e a l ;  quad: packed array  [ l .. 256] of r e a l ;  pent: packed array  [1 .. 1024] of r e a l ;  hex: packed array first,  [1 .. 4096] of r e a l ;  second, t h i r d , fourth, f i f t h ,  sixth:  begin £************ main ***************} {read the bases into the array "sequence"} length  := max;  base := blank;  integer;  / 129 for i := 1 to max do begin while (base = ' ') do read(base); sequence [ i ]  := base;  base := blank; end; { reading of bases } for i := 1 to 4 do single[i]  := Of-  fset arrays to zero for} {single-mers to 6-mers } for i := 1 to 16 do double[i] := 0; for i : = ' 1 to 64 do triple[i]  := 0;  for i := 1 to 256 do quad[i] := 0; for i := 1 to 1024 do pent[i] := 0; for i :=' 1 to 4096 do hex[i] := 0; { now we go through the sequence and length } .{ singlets,  doublets  ... sixes}  for i := 1 to (length - 5) do begin case sequence[i] of  / 130 'A' : first  := 1;  'C : first  := 2;  •G' : first  := 3;  •T': first  := 4;  end; { case } case sequence[i + 1] of 'A' : second := 1; •C : second := 2; •G' : second := 3; •T' : second := 4; end; { case } case sequence[i + 2] of •A': third •C :  := 1;  / 131 third  := 2;  •G': third .  T  i .  := 3;  .  third  := 4;  end; { case } case sequence[i + 3] of 'A' : fourth  := 1;  •C : f  fourth := 2; 'G': fourth  := 3;  *T': fourth  := 4;  end; { case } case sequence[i + 4] of •A': fifth  := 1;  'C: fifth  := 2;  'G' : fifth  := 3;  / 132 'T : 1  fifth  := 4;  end; { case } case sequence[i + 5] of •A': sixth  := 1;  •C : sixth  := 2;  *G': sixth  := 3;  •T': sixth  := 4;  end; { case } single[first]  := s i n g l e [ f i r s t ] + 1;  double[4 * ( f i r s t (first  - 1) + second] := double[4 *  - 1) + second]  triple[l6 * (first third]  +1;  - 1) + 4 * (second - 1) +  := t r i p l e [ l 6 * ( f i r s t + third]  - 1) + 4 * (second - 1)  +1;  quad[64 * ( f i r s t  - 1) + 16 * (second - 1) +  4 * (third - 1) + fourth] := quad[64 * ( f i r s t  - 1) +  16 * (second - 1) + 4 * (third - 1) + fourth] + 1; pent[256 * ( f i r s t  - 1) + 64 * (second - 1) +  16 * (third - 1) + 4 * (fourth - 1) + f i f t h ] : = pent[256  * (first  - 1) + 64 * (second - 1) +  16 * (third - 1) + 4 * (fourth - 1) + f i f t h ] + i; hex[l024 * ( f i r s t  - 1) + 256 * (second - 1) +  64 * (third - 1) + 16 * (fourth - 1) + 4 * ( f i f t h - 1) + sixth] := hex[l024 * ( f i r s t  - 1) +  256 * (second - 1) + 64 * ( t h i r d - 1) + 16 * ( fourth - 1) + 4 * ( f i f t h - 1) + sixth] + 1; end; { go through the sequence and counting } { now divide each count by the sequence { a fractional  length to get}  abundance}  for i := 1 to 4 do single[i]  := s i n g l e [ i ] / (length - 5);  for i := 1 to 16 do double[i]  := double[i] / (length - 5);  for i := 1 to 64 do triple[i] for  := t r i p l e [ i ] / (length - 5);  i := 1 to 256 do  quad[i] := quad[i] / (length - 5); for i := 1 to 1024 do pent[i] := pent[i] / (length - 5); for  i := 1 to 4096 do  hex[i] := hex[i] / (length - 5);  { now we calculate  d l to d6 }  { d l and d2 are calculated  as i n reference }  {dl} sigma := 0; for i := 1 to 4 do begin if  single[i] > 0 then sigma := sigma + s i n g l e [ i ] * l n ( s i n g l e [ i ] ) / ln(2);  end; dl  : = . 2 +. sigma;  write(dl:  3: 5,  ');  ld2} sigma := 0; firstterm  := 0;  for i := 1 to 16 do begin i f double[i] > 0 then sigma := sigma + double[i] * ln(double[i])  / ln(2);  end; for  j := 1 to 4 do begin  for k := 1 to 4 do begin if  ( s i n g l e t j ] * single[k])  > 0  then firstterm  := f i r s t t e r m + s i n g l e [ j ] * single[k]  * ln<singletj]  * single[k]>  / ln(2);  end; end; d2 := sigma -  firstterm;  w r i t e ( d 2 : 3: 5,  *);  {end d2} £***************************  {d3} sigma  := 0;  firstterm for  {* The modulo f u n c t i o n f i n d s  := 0;  {* window whose f i r s t  i := 1 t o 64 do  if  begin  bases  {* match t h e l a s t bases o f a '  tripleti] > 0  {* p r e v i o u s window.  {* * * * * * * * * * * * * * * * * * * * * * * * * * *  then sigma  := sigma + t r i p l e [ i ] * l n ( t r i p l e [ i ] ) /  ln(2);  end; for  i := 1 t o 16 do  begin  for  j := (4 * ( ( i - 1) MOD  MOD  4) + 4) do  if  4) + 1) t o ( 4 * ( ( i - 1)  begin  ( d o u b l e [ i ] * d o u b l e t j ] / s i n g l e [ ( i - 1) MOD  4  + 1]) > 0 then firstterm  := f i r s t t e r m  / s i n g l e [ ( i - 1) MOD  + double[i] * double[j] 4 + 1] * l n ( d o u b l e t i ]  * d o u b l e t j ] / s i n g l e [ ( i - 1) MOD end; end; d3  := sigma -  firstterm;  4 + 1]) /  ln(2);  / write(d3:  3: 5,  ');  {d4} sigma  :=  0;  firstterm for  :=  0;  i := 1 t o 256 do  if  begin  quad[i] > 0 then sigma  := sigma + q u a d [ i ] * l n ( q u a d [ i ] ) /  ln(2);  end; for  i := 1 t o 64 do  for  j :=  MOD  begin  (4 * ( ( i - 1) MOD  16) + 4) do  if  16) + 1) t o ( 4 * ( ( i - 1)  begin  ( t r i p l e [ i ] * t r i p l e t j ] * d o u b l e [ ( i - 1) MOD  16 +1}) > 0  then firstterm  := f i r s t t e r m  / d o u b l e [ ( i - 1) MOD  + triple[i] * 16 + 1] *  * t r i p l e t j ] / d o u b l e [ ( i - 1) MOD end; end; d4  := sigma -  firstterm;  w r i t e ( d 4 : 3: 5,  ');  {d5} sigma  :=  firstterm for  0; :=  0;  i := 1 t o 1024 do  begin  triple[j]  ln(triple[i] 16 + 1 ] ) /  ln(2);  136  /  if  pent[i] > 0 then sigma  := sigma + p e n t [ i ] * l n ( p e n t [ i ] ) /  ln(2);  end; for  i := 1 t o 256 do  for  begin  j := (4 * ( ( i - 1) MOD  MOD  64) + 1) t o ( 4 * ( ( i -  1)  64) + 4) do b e g i n  if  (quad[i] * quad[j] * t r i p l e [ ( i  - 1) MOD  64 + 1]) > 0  then firstterm  := f i r s t t e r m  triplet(i  - 1) MOD  quad[j] / t r i p l e t ( i  + quad[i] * quad[j] /  64 + 1] * l n ( q u a d [ i ] * - 1) MOD  64 + 1]) /  ln(2);  end; end; d5  := sigma -  firstterm;  w r i t e ( d 5 : 3: 5,  ', ' ) ;  {d6} sigma  :=  firstterm for if  0; :=  0;  i := 1 t o 4096 do b e g i n hex[i] > 0 then sigma  := sigma + h e x [ i ] * l n ( h e x [ i ] ) /  end; for  i := 1 t o 1024 do b e g i n  ln(2);  137  7 for  j := (4 * ( ( i - 1) MOD MOD if  256) + 1) t o ( 4 * . ( ( i - 1)  256) + 4) do b e g i n (pentti] * pent[j] * quad[(i  - 1) MOD  256 + 1 ] ) > 0  then firstterm quad[(i [j]  := f i r s t t e r m + p e n t [ i ] * p e n t [ j ] /  - 1) MOD  256 + 1] * l n ( p e n t [ i ] * pent  / quad[(i  - 1) MOD  256 + l ] ) / l n ( 2 ) ;  end; end; d6  := sigma - f i r s t t e r m ;  write(d6:  3: 5, ',  ');  writeln; end  {brook}.  E. PROGRAM 5  program m a n h a t ( i n p u t ,  output);  .£****************************************************j {* T h i s program w r i t e s a n o t h e r program. The program *} {* i t w r i t e s i s a t e l l a g r a f program which w i l l {* draw a Manhattan b l o c k  p l o t (Gates p l o t ) o f t h e  {* sequence. I t a u t o m a t i c a l l y {* p l o t so t h a t i t w i l l  then *}  scales the t e l l a g r a f  *} *}  look w e l l proportioned.  i********* ******************************************* "i  *}  138  / 139 const blank = ' *;  var x, y, count: integer; xmax, xmin, ymax, ymin: {plotting  integer;  limits}  base: char;  begin reset(input); count  := 0;  base := blank; x := 0; y  := 0;  xmax := 0; ymax := 0; ymin := 0; xmin : = 0; {now make the program write the f i r s t part of the t e l l a g r a f program} writeln('generate a . p l o t . ' ) ; writeln('title i s • • * * * * * * * * * * * • • . ' ) ; writeln('x  axis  label i s '' — c g —  writeln('y axis l a b e l i s  1 1  —a t —  ''.'); ''.');  / 140 writeln('curve  1, symbol count=0.');  writeln('cross  off.');  writeln('input  data.');  writelnC'data'"); {now  g e n e r a t e t h e p a i r s t o be p l o t t e d  and f i n d  t h e max and min}  I**********  main **********}  w h i l e NOT e o f ( i n p u t ) do b e g i n base  := b l a n k ;  w h i l e ( ( b a s e = ' ') AND (NOT e o f ) ) do b e g i n read(base); end; count if  := count + 1;  base IN ['A', ' C , 'G', 'T' ] then base := base else base  :=  'z';  c a s e base o f 'A : 1  y  := y - 1;  •C: x  := x - 1;  •G': x  := x + 1;  / 141 •T': y := y + 1; •Z': writeln('***  strange base * * * ' ) •  end; {cases} {now adjust the max and min i f these x and y f a l l outside previous limits} i f x > xmax then xmax := x; i f y > ymax then ymax := y; i f x < xmin then xmin := x; i f y < ymin then ymin := y; writelnC count  ', x: 1, ', ', y: .1, ' ') ;  := 0;  end; i f xmax > ymax then  ymax := xmax else xmax : = ymax; if  xmin > ymin then xmin  := ymin  else ymin {the  := xmin;  above two i f s , e n s u r e t h a t  the steps i n the p l o t w i l l  be  a p p r o x i m a t e l y t h e same l e n g t h on t h e p l o t because t h e l e n g t h s of  t h e two a x i s w i l l  be s i m i l a r }  xmax := (xmax DIV 10) * 10 + 10; ymax := (ymax DIV 10) * 10 + 10; xmin  := (xmin DIV 10) * 10 - 10;  ymin  := (ymin DIV 10) * 10 - 10;  {now w r i t e t h e p a r t o f t h e t e l l a g r a f program t h a t f o l l o w s t h e data} writeln('eod. ); 1  w r i t e l n ( ' x a x i s maximum  ', xmax, ',  minimum ', xmin,. ', s t e p s i z e w r i t e l n ( ' y a x i s maximum minimum writeln('no  ', ymin,  1  end  {manhat}.  ', ymax, ',  ', s t e p s i z e 10.');  legend. );  w r i t e l n ( frame.');  10.');  1  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0098293/manifest

Comment

Related Items