UBC Theses and Dissertations
Constraints on the organization and information properties of DNA sequences Sibbald, Peter Ramsay
In an investigation which concentrated primarily on the two completely sequenced chloroplast genomes, one from a tobacco and one from a liverwort, an attempt has been -made to discover some of the factors which produce order in DNA sequences. This was done by 1. looking in detail at doublet organization throughout the genomes, 2'. by examining the ability of different methods to predict the existence of genes, based only on sequence organization and 3. by employing information theorj' to explore various levels of ordering in these sequences. The doublet analysis was performed on seven categories of DNA: tDNA, rDNA, ribosomal proteins, open reading frames not known to be genes (URF), other protein genes, non coding regions and introns. The rDNA has the most unusual doublet properties of all categories although all categories have, to a considerable extent, similar doublet properties. I suggest that these particular doublet properties facilitate accurate replication of the genome. In addition it appears that doublets which have certain thermodynamic properties are more abundant that others, suggesting that there is a selection pressure at the level of doublets for certain thermodynamic properties. Nussinov's hypothesis, that complementary doublets have similar relative abundances due to inverted duplication events has been tested and would not seem to explain the phenomenon. Fickett's method to predict whether URFs are genes was more successful than Sheperd's method. Fickett's method was modified for use on the chloroplast genomes and its rate of successful prediction increased substantially. This modified method will be useful for other chloroplast genomes as they are sequenced and also supports Fickett's contention that the method could be improved for use on specific groups. The ability to predict genes based only on sequence data shows that the requirement to code for protein exerts a detectable amount of order on the gene sequence and that this order is distinguishable from the order in non coding regions. Nearly all URFs greater than 200 base pairs in both plants are predicted to be genes. Informational analysis showed that most order is at the level of single and double bases with a significant, lesser amount of order at the triplet and 4-plet level. This was true for both coding and noncoding regions in both plants. This is in contrast earlier work (Rowe and Trainor) which found that in viruses there was a significant difference between 4-plet ordering in coding and noncoding regions. It is suggested that DNA may be optimized for replication rather than protein production. Several new problems and experiments have been suggested.
Item Citations and Data