Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

DNA sequence and structure analysis of the Drosophila gene Polyhomeotic Daly, Mark K. 1990

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-UBC_1990_A6_7 D34.pdf [ 5.3MB ]
JSON: 831-1.0098117.json
JSON-LD: 831-1.0098117-ld.json
RDF/XML (Pretty): 831-1.0098117-rdf.xml
RDF/JSON: 831-1.0098117-rdf.json
Turtle: 831-1.0098117-turtle.txt
N-Triples: 831-1.0098117-rdf-ntriples.txt
Original Record: 831-1.0098117-source.json
Full Text

Full Text

DNA SEQUENCE AND STRUCTURE ANALYSIS OF THE DROSOPHILA GENE POLYHOMEOTIC by MARKKDALY B.Sc., University of British Columbia, 1987 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in THE FACULTY OF GRADUATE STUDIES ZOOLOGY  We accept this thesis as conforming to the required standard  THE UNIVERSITY OF BRITISH COLUMBIA April, 1990 © Mark K. Daly, 1990  In  presenting this  degree  at the  thesis in  University of  partial  fulfilment  of  of  department  this thesis for or  by  his  or  requirements  British Columbia, I agree that the  freely available for reference and study. I further copying  the  representatives.  an advanced  Library shall make  it  agree that permission for extensive  scholarly purposes may be her  for  It  is  granted  by the  understood  that  head of copying  my or  publication of this thesis for financial gain shall not be allowed without my written permission.  Department  of  'LOOL-O G-'y  The University of British Columbia Vancouver, Canada  Date  DE-6 (2/88)  frVfr.  ^3  ^Ifd  Abstract  polyhomeotic is a gene of the Polycomb-grovvp required for proper segment determination in Drosophila. Genetic and molecular analysis has shown that ph has a repetetive structure. The DNA sequence presented here shows that ph consists of a direct tandem duplication with very high sequence conservation. Analysis of the sequence has revealed several conserved open reading frames and splice junctions, putative transcriptional promoter and terminator sequences, polyadenylation signals and translational start signals. In addition, the DNA sequence shows that ph contains a zinc finger sequence in each repeat. This suggests that ph may encode a DNA-binding protein.  ii  TABLE OF CONTENTS PAGE ABSTRACT  i i  TABLE OF CONTENTS  i i i  LIST OF TABLES  iv  LIST OF FIGURES  v  ACKNOWLEDGEMENT  vi  GENERAL INTRODUCTION  1  MATERIALS AND METHODS  7  RESULTS AND DISCUSSION  15  SUMMARY  88  REFERENCES  92  iii  LIST OF TABLES  TABLE 1. Long ORFs conserved i n both repeats o f ph. TABLE 2. ORFs 70 amino a c i d s i n l e n g t h or g r e a t e r . TABLE 3. Codon p r e f e r e n c e parameters o f the 6 p u t a t i v e ORFs. TABLE 4. A l i s t o f s p l i c e j u n c t i o n sequences and branch sequences present i n ph. TABLE 5. Comparison o f s p l i c e j u n c t i o n s of ORF 4/5/6 t o the consensus. TABLE 6. P u t a t i v e promoter s i g n a l s i n t h e ph sequence. TABLE 7. P u t a t i v e p o l y a d e n y l a t i o n s i g n a l s found i n t h e ph sequence. TABLE 8. Amino a c i d sequence s i m i l a r i t y between the proximal ORFs and the d i s t a l ORFs. TABLE 9. The t h r e e most abundant amino a c i d s o f each ORF. TABLE 10. S t r u c t u r e , conformation and charge o f t h e 6 p u t a t i v e ORFs. TABLE 11. Comparison o f the ph p r o t e i n sequence t o other p r o t e i n s .  iv  LIST OF FIGURES PAGE FIGURE 1. Sequencing s t r a t e g y f o r the d i s t a l repeat o f ph.  17  FIGURE 2. Genomic DNA sequence o f ph.  19  FIGURE 3. Optimal alignment o f t h e p r o x i m a l repeat w i t h the d i s t a l r e p e a t .  28  FIGURE 4. A p h y s i c a l  44  summary map o f ph.  FIGURE 5. Frequency d i s t r i b u t i o n s o f ORFs i n a random sequence (a) and i n a coding sequence ( b ) .  46  FIGURE 6. Amino a c i d sequence o f ORF 2.  61  FIGURE 7. Amino.acid sequence o f ORF 3.  63  FIGURE 8. Amino a c i d sequence o f ORF 4/5/6.  65  FIGURE 9. Amino a c i d sequence o f ORF 8.  67  FIGURE 10 . Amino a c i d sequence o f ORF 9.  69  FIGURE 11 . Amino a c i d sequence o f ORF 10/11/12  71  FIGURE 12 . Optimal amino a c i d sequence alignment o f ORF 2 and ORF 8.  74  FIGURE 13 . Optimal amino a c i d sequence alignment o f ORF 3 and ORF 9.  76  FIGURE 14 . Optimal amino a c i d sequence alignment o f ORF 4/5/6 and ORF 10/11/12.  78  V  ACKNOWLEDGEMENT  I would like to thank my supervisor, Dr. Hugh Brock for his help and encouragement during the course of this work.  vi  GENERAL INTRODUCTION  In Drosophila  melanogaster, homeotic genes are required for the  determination of segmental identity. The homeotic genes belong to two complexes. Genes of the Bithorax Complex (BX-C) are responsible for segmental differentiation of thoracic and abdominal segments (Lewis, 1978), and genes of the Antennapedia Complex (ANT-C) are responsible for proper segmental identity of head and thoracic segments (Kaufman et al., 1980). Because genes in these two complexes act in parasegments (PS; the posterior compartment of one segment and the anterior compartment of the next most posterior segment), I will refer to parasegments where necessary. As an example, PS6 includes posterior T3 and anterior Al.  The BX-C spans about 300 kilobases (kb) of DNA in the 89E region of chromosome 3 (Bender et al., 1983; Peifer et al., 1987). Individual recessive mutations in the BXC give parasegments a tissue identity appropriate to a more anterior parasegment. Conversely, dominant BX-C mutations give parasegments a tissue identity appropriate to a more posterior parasegment (Lewis, 1978). The BX-C contains three structural genes: Ultrabithorax  (Ubx), abdominal-A (abd-A), and Abdominal-B (Abd-B).  These three genes take up less than 20 kb of coding DNA Thus, most of the 300 kb of the BX-C is regulatory (Peifer et al., 1987).  1  Regulatory regions can be identified by complementation analysis. Mutations in distinct tissue-specific regulatory regions of Ubx complement one another. However a mutation in a structural gene does not complement mutation(s) in regulators of that gene. The complex complementation pattern distinguishes 9 regulatory regions within the BX-C: abx/bx and bxd/pbx (for control of Ubx; Lewis, 1978); iab2,  iab3,  iab4, and  iaJb5 (for control of abd-A; Karch et al., 1985) and iab5 through 9 (for control of Abd-B; Karch et al., 1985). Each embryonic parasegment has a unique cellular mosaic pattern of Ubx, abd-A and Abd-B expression. In each more posterior parasegment (and all parasegments posterior to it) an additional set of cells gains a unique identity. This new identity is conferred by a successively more distal regulator in the BX-C. Thus, successive regulatory regions are activated in a proximal-to-distal direction along the BX-C in an anterior-to-posterior direction down the embryo.  The homeotic genes are regulated in three ways: 1) By maternal and early-acting zygotic segmentation genes (Carroll et Harding and Levine, 1988; Irish et  al., 1986; Ingham and Martinez-Arias, 1986; al., 1989); 2) a hierarchy of cross-regulatory  interactions between the homeotic genes (Struhl, 1982; Hafen et al., 1984; Harding et al., 1985; Struhl and White, 1985; Riley et  al., 1987) and 3) by the genes of the  Polycomb (Pc) group (Jurgens, 1985).  The developing embryo is divided into successively smaller domains by the gap genes, the segmentation genes and the segment polarity genes. The initial patterns of Antp and Abd-B gene expression in the embryo depend on regulation by gap gene products (Harding and Levine, 1988). hunchback (iib) and Kruppel (Kr) are necessary for proper Antp expression in PS4 and PS5. Kr and knirps (kni) are required for normal Antp and Abd-B expression in T3 and PS12 and PS13 (Harding and Levine, 1988). The An tp gene contains two promoters that are differentially regulated by the gap genes hb and  2  Kr and by the segmentation gene fushi  tarazu  (ftz; Irish et al., 1989). The gap  and segmentation gene products contain nucleic acid binding domains (eg, zinc fingers, homeoboxes) that could facilitate their interaction with regulatory regions within the homeotic genes (Tautz et al., 1987).  Regulation of homeotic gene expression also occurs by cross-regulatory interactions between the homeotic genes themselves. Posteriorly acting genes of the BX-C (abd-A, Abd-B) regulate anteriorly acting genes (eg, Ubx) by repression (Struhl and White, 1985). Likewise, the expression of Antp is repressed in its posterior domain by Ubx, abdA and Abd-B (Hafen et a l . , 1984). Another ANT-C gene, sex combs  reduced  (scr), is controlled by Ubx (Struhl, 1982). The homeotic gene products contain a DNAbinding domain, the homeobox, that allows them to bind to promoters of other homeotic genes.  Genes of the Pc-group also control spatial distribution of the homeotic gene products. The Pc-group genes include Pc (Lewis, 1978; Duncan and Lewis, 1982), extra sex combs (esc; Struhl, 1981), Polycomb-like combs (sxc; Ingham, 1984), Additional  (Pel; Duncan, 1982), super sex  sex combs (Asx), Posterior  sex  combs (Psc), Sex comb on midleg (Scm; Jurgens, 1985), Sex comb  extra  (See; Breen and Duncan, 1986), polyhomeotic polycombeotic  (ph; Dura e t al., 1985),  (pco; Shearn et al., 1978) andpleiohomeotic  (pho; Hochman  et al., 1964; Gehring, 1970).  The Pc-group genes are repressors of BX-C gene expression. This has been shown genetically through dosage studies, and molecularly. Embryos mutant for esc and Pc show an initial distribution of Ubx that is normal. However, later in development, the Ubx gene product becomes ectopically expressed throughout the embryo. This suggests that these  3  genes are not required for the initiation of homeotic gene expression, but rather for the maintenance of spatial regulation once it has been established (Struhl and Akam, 1985; Wedeen et  al., 1986). Mutations in the Pc-group have similar phenotypes, including  posteriorly directed transformations of abdominal parasegments in the embryo (Jurgens, 1985).  Pc-group double mutants have stronger phenotypes than embryos mutant for only one Pc-group gene. This suggests that genes of the Pc-group act synergistically to control spatial expression of the BX-C genes. This could occur in two ways. Gene products of the Pcgroup could form a complex multimer in which all products interact (Locke et al., 1988). This type of interaction has been observed for the gene products that form the contractile apparatus in Drosophila  and Caenorhabd.itis  (Homyk and Emerson, 1988; Fuller,  1986; Mogami and Hotta, 1981; Epstein et al., 1986; Park and Horvitz, 1986). The Pcgroup genes show different temporal and spatial modes of expression. Thus, it is unlikely that all members of the Pc-group form the same multimer. However, it is possible that subsets of the Pc-group form distinct multimers that act in different tissues and at different times. This could explain why different Pc-group genes have related but distinct phenotypes. Alternatively, members of the Pc-group could interact indirectly as a regulatory network. Antibodies against Pc have been shown to bind to over 60 discrete sites on polytene chromosomes, including the BX-C and ANT-C, Scm, Asx, Psc, sxc, ph and pco (Zink and Paro, 1989). This shows that Pc regulates at least some members of the Pc-group.  polyhomeotic  (ph) is a well studied member of the Pc-group. The gene was  named to reflect the large number of phenotypes and transformations associated with its' mutations, ph is unlike other Pc-group genes in that it is required for epidermal development as well as segmental specification (Dura et al., 1987). Loss of pb function results in a pleiotropic phenotype that includes epidermal cell death, abnormal segmentation  4  and homeotic gene expression in the epidermis and central nervous system (CNS), and abnormal axon pathway development in the CNS. Strong hypomorphB cause transformation of all thoracic and abdominal segments towards A8 (Dura et al., 1988). ph amorphs show cell death in the ventral epidermis, and die at 12 hours after egg deposition. Amorphs show an A8 transformation of every segment. This phenotype is consistent with ph being a repressor of BX-C function.  Unlike other Pc-group genes, ph is required both zygotically and maternally for normal embryonic development (Dura et al., 1988). Maternal ph amorphs are not rescued by two doses of paternal wild type product. Thus, the maternal effect mutation cannot be compensated for by an increase in  dose of zygotic ph genes. Therefore, in early  embryogenesis, ph function comes only from maternal gene products (Dura et al., +  1988). ph and pho are the only Pc-group genes whose maternal effect cannot be rescued by a duplication of the paternal genes. Thus, only ph and pho are critically required in the maternal germ line. This sets these two genes apart from the other Pc-group genes.  ph is required for the proper expression of homeotic and segmentation genes. Initially, the pattern of Scr, Antp and Ubx are normal in the epidermis of ph mutants, but following germ band retraction, Scr, Antp and Ubx are ectopically expressed (Smouse et al., 1988). These results suggest that ph is not required for the initiation of spatial regulation but, like other Pc-group genes, is required for maintenance of spatial regulation in the epidermis. However, in the CNS there is an absolute requirement for ph to allow expression of the homeotic genes because expression of Scr and Ubx is abolished in the CNS of ph mutants. Surprisingly, Antp is ectopically expressed in the CNS of ph mutants (Smouse et a1., 1988).  5  ph is also required for regulation of segmentation genes in the epidermis and CNS (Smouse et al., 1988). In ph embryos, engrailed  (en), which is normally expressed  only in the posterior compartment of each parasegment, becomes expressed also in the anterior compartment. Like the homeotic genes, en expression is absent in the CNS of ph embryos. The patterns of ftz and even-skipped  (eve) expression in the CNS are  altered in ph~ embryos. Loss of ph product causes suppression of ftz expression and ectopic eve expression in the CNS (Smouse, 1988). The expression of fusion genes containing regulatory regions of ft z or eve and the coding region of lacZ in ph~ embryos is the same as that of the wild type ftz or eve gene (Smouse, 1988). This shows that regulation of ft z and eve by ph occurs through their promoters.  Northern analysis of the ph region shows that embryonic and pupal transcription patterns differ (Freeman, 1988). In embryos, two major transcripts of 6.4 and 6.1 kb are observed. The 6.1 kb transcript must have a proximal promoter because ph^^^, an inversion with a proximal breakpoint, truncates the 6.1 kb transcript but does not affect the 6.4 kb transcript. The 6.4 kb transcript may have either a proximal or distal promoter. In pupae, the ph^^ ^ inversion truncates the 6.6 kb transcript and leaves the 6.1 kb transcript intact. Therefore, there are at least four different ph transcripts, at least two of which have proximal promoters (Freeman, 1988). It is possible that all ph promoters are proximal, in which case ph is a single transcription unit with multiple promoters, alternative splicing and/or alternative termination. Alternatively, ph could be two transcription units, with one proximal and one distal promoter. Besides the major ph transcripts, some smaller ph transcripts are also detected. Probes that hybridize to the 6.4 kb and 6.1 kb transcripts in embryos also hybridize to these smaller transcripts. However, the synthesis of these small transcripts begins about 3.5 kb upstream of the major ph transcripts. Moreover, an independent transcription unit intervenes between the 5' end of the small ph transcripts and  6  the 5' end of the major ph transcripts. It is not clear how these small transcripts are related to ph function.  Obtaining the DNA sequence of the ph region would be an important contribution towards an understanding of the structure of the ph gene or genes. This thesis examines the structure of ph. With my collaborators in France (J. Deatrick and N. Randsholt) the entire sequence of ph has been determined. This will allow us to map cDNAs to the genomic sequence and therefore locate exons and introns within the gene. The exact nature of the repeats was determined by aligning the proximal and distal sides of the gene. The degree of sequence similarity between the two repeats gives an indication of how recent the ph duplication is. Putative open reading frames (ORFs) were located by assuming that long regions of amino acid sequence uninterrupted by stop codons represent regions of functional significance. These ORFs were analyzed by computer to understand their structure. The DNA and protein sequence of ph was searched for putative regulatory motifs and splice junctions. The data that I have obtained provides the raw information from which inferences of ph structure can be made and further experiments to elucidate its structure and function can be based.  MATERIALS AND METHODS  Bacterial  culture  All cultures were grown in Luria-Bertani medium (1% Bacto-tryptone, 0.5% Bactoyeast extract, 1% NaCl with a pH of 7.5).  7  Competent cells  Competent cells were prepared using the calcium chloride procedure. One litre of LB broth was inoculated with 10 mis of an overnight culture of E.  coli. Cells were  vigorously shaken at 37°C until a density of about 5 x 10^ cells/ml was reached. The cells were chilled in an ice-water bath for 5 min and then centrifuged at 4000 g for 5 min at 4°C. The supernatant was discarded and cells were resuspended in a half volume of ice cold 50 mM CaCl2 and 10 mM Tris-HCl (pH 8.0). The cells were kept on ice for 15 min and then centrifuged at 4000 g for 5 min at 4°C. The supernatant was discarded and the cells were resuspended in 1715 volume of ice cold 50 mM CaCl , 10 mM Tris-HCl (pH 8.0) and 50% 2  glycerol. 0.5 ml aliquots were dispensed into pre-chilled microfuge tubes and stored at 4°C for 24 hours. The tubes were then transferred to a -70°C freezer for long-term storage.  M i n i - p l a s m i d preps  1.5 mis of bacterial culture was centrifuged at 10000 rpm for 2 min. The cell pellet was resuspended in 100/tl of cold 50 mM glucose, 10 mM EDTA, 25 mM Tris-HCl (pH 8.0). 200/tl of 0.2 N NaOH, 1% SDS was added and the tubes were gently agitated to mix and left on ice for 15 min. The tubes were centrifuged at 10000 rpm for 15 min. The supernatant was added to a new tube containing an equal volume of phenol-chloroform (1:1) and vigorously mixed for 10 min. The tubes were centrifuged at 10000 rpm for 5 min. The supernatant was transferred to a new tube and precipitated with 2 volumes of 95% ethanol. The tubes were centrifuged at 10000 rpm for 15 min to collect the pellets. The pellets were washed once with 1 ml of 70% ethanol and then dried under vacuum. Pellets were resuspended in water.  8  Maxi-plasmid preps  These preparations are a scaled-up version of the mini- prep but with several added steps to ensure sequenceable DNA.  50 ml overnight bacterial cultures were centrifuged at 10000 rpm for 5 min. Pellets were resuspended in 3.5 ml cold 50 mM glucose, 10 mM EDTA, 25 mM Tris-HCl (pH 8.0). 7 ml of 0.2 N NaOH, 1% SDS was added, the tubes were inverted to mix and placed on ice for 15 min. 5 ml KoAc (pH 4.8) was added, the tubes were inverted to mix and placed on ice for 15 min. The tubes were centrifuged at 20000 rpm for 20 min at 4°C. The supernatant was extracted once with phenolxhloroform (1:1) and once with chloroform. To the supernatant, 2 volumes of 95% ethanol was added. The tubes were kept on ice for 5 min, and then centrifuged at 20000 rpm for 20 min. The pellets were washed with 70% ethanol and dried under vacuum. The pellets were resuspended in 600/4 H2O. 70/*g RNAse was added and the samples were incubated at 37°C for one hour. To each sample, 300/*l of 7.5 M NH^OAc was added. The tubes were frozen at -70°C for 10 min, thawed and centrifuged at 10000 rpm for 10 min. The supernatant was removed from the protein pellet and transferred to a new tube and 2 volumes of 95% ethanol was added. The samples were chilled at -70°C for 5 min and centrifuged at 10000 rpm for 15 min. The DNA pellets were washed with 70% ethanol and dried under vacuum. The pellets were resuspended in 200/tl H2O. 120/t.1 20% polyethylene glycol, 2.5 M NaCl was added and the samples were kept at 4°C overnight. The tubes were centrifuged at 10000 rpm for 15 min. The supernatant was discarded and the pellets washed twice with 70% ethanol. The DNA was dried under vacuum and resuspended in H2O. Typical yields were 100 to 200/.g DNA  9  Agarose gel electrophoresis  Agarose gels of 0.8% were made with BRL Ultrapure agarose and IX Tris-borate EDTA buffer (100 mM Tris base, pH 8.0, 100 mM boric acid, 2 mM EDTA). Gels were electrophoresed in IX TBE containing 500 g/1 ethidium bromide. Gels were visualized on a UV trans-illuminator.  Polyacrylamide g e l electrophoresis  Sequencing gels consisted of 50% urea, 20% 38:2 acrylamide: N,N*-methylene-bis acrylamide,  IX TBE, 0.08% ammonium  tetramethylethylenediamine  persulfate  and 0.02% N,N,N,N',  (Bethesda Research Laboratories, Ultrapure). Gels were  polymerized overnight at room temperature. Gels were pre-run in IX TBE at 1600 V for 30 min. To the lower buffer chamber, 0.5 volumes of 3 M NaOAc were added. Samples were loaded and the gel run at 1600 V for 2.5 hrs. Gels were soaked in 10% methanol, 10% acetic acid for 10 min and vacuum-dried at 80°C for 2 hrs. Gels were exposed to Kodak X-Omat-AR film overnight.  DNA sequencing  ph was sequenced on both strands using the dideoxy method of Sanger and Coulson (1977). Appropriate restriction fragments from the distal repeat (Figure 1) were subcloned and sequenced using directed deletions to generate nested sub-sets of clones on both' strands (Sal 1.5, Sal 0.8), in one direction (Sal 2.3), or to obtain partial sequence (Sal 4.0). Gaps in sequence data werefilledby synthesizing oligonucleotide primers.  10  1). Directed deletions  DNA to be subjected to directed deletions (Henikoff, 1984) was sub-cloned into either pUCl8 or pUCl9. About 30 g of DNA was completely digested with a restriction enzyme that cuts within the polylinker site to leave a 3' overhang adjacent to the priming site. The DNA was extracted once with phenol-chloroform (1:1) and once with chloroform. The DNA was precipitated as above. Resuspended DNA was completely digested with a restriction enzyme that cuts within the polylinker to leave a 5' overhang that can be used as a substrate for exonuclease III digestion into the sub-cloned insert. The DNA was organically extracted and precipitated as above.  The doubly digested DNA was resuspended in 60/*l of 66 mM Tris-HCl (pH 8.0), 0.66 mM MgC^. Half of this sample was treated with exonuclease III to determine the rate of digestion. Lul of DNA was removed for a control (zero digestion). The remaining DNA was heated to 37°C for 1 min. 1^1 (75u) of exonuclease III.(BRL) was added. Following a 30 sec pre-incubation period, 2.5/J aliquots were removed every 30 sec and mixed with 7.5/*l of 0.25 M NaCl, 30 mM KoAc (pH 4.6), 1 mM ZnS0 , 5% glycerol and 67 Vogt u/ml of SI nuclease 4  (BRL) in tubes sitting on ice. Once all timepoints were taken, these tubes were moved to room temperature for 30 min to allow for digestion by SI nuclease. Reactions were stopped by the addition of 1/4 of 0.5 M Tris-HCl (pH 8.0), 0.125 M EDTA and heating the samples to 70°C for 10 min. 2y»l aliquots of each timepoint were separated electrophoretically on an agarose gel (as described above) to determine the digestion rate of exonuclease III. If the rate of digestion was close to 150 bp/min, the remainder of each timepoint was extracted once with phenol-chloroform (1:1) and once with chloroform. If the rate of digestion was too slow or too fast, exonuclease III digestion was repeated with an appropriate increase or decrease in the number of units of enzyme used. The DNA was precipitated as described previously  11  and resuspended in  lOyul  20 mM Tris-HCl (pH 8.0), 7 mM MgCl and 10 u/ml Klenow 2  polymerase (BRL). The DNA was incubated at room temperature for 3 min. Lul of a mix of all four deoxyribonucleotides, each at 0.125 mM was added and the reactions were incubated at 37°C for 15 min. Each sample was mixed with  66 mM Tris-HCl (pH 8.0), 6.6 mM  MgCl , 10 mM dithiothreitol, 0.2 mM ATP, 50% polyethylene glycol and 25 u/ml T4 DNA 2  ligase (BRL). The reactions were incubated at room temperature overnight.  20/«] of each sample was added to 100/ul of competent E. c o l i JM83 and left on ice for 30 min. The cells were heat-shocked at 37 °C for 30 sec and placed on ice for 1 min. 200/xl LB  broth was added and the tubes placed in a 37°C incubator and shaken for 1 hr at  200 rpm. 50 /A\ 5-bromo-4-chloro-3-indolyl-B-D-galactoside (20 /*g/ml in N-N 1  dimethylformamide) and 50/i isopropylthiogalactoside  (20/<g/ml in H 0) was added to 2  each tube. The cells were plated out on LB media containing 100/«g/ml ampicillin. Plates were incubated at 37°C overnight. Five colonies from each timepoint were grown in LB broth containing 100/*g/ml ampicillin. Plasmid mini-preps (as described above) were made from each timepoint. Each timepoint was digested to completion with two restriction enzymes that will cut out the sub-cloned fragment only if the priming site is intact. The digested DNA was separated electrophoretically on an agarose gel to determine insert size and suitability for sequencing. Maxi plasmid DNA preps (as described above) were made for suitable timepoints.  2). O l i g o n u c l e o t i d e  synthesis  In cases where directed deletions were not used, oligonucleotide primers were made to sequence directly off a restriction fragment. Oligonucleotides (18mers with a minimum GC content of 50%) were made using a Model 391 PCR-Mate DNA Synthesizer (Applied Biosystems) following the manufacturer's instructions. Cleavage and cyanoethyl  12  deprotection of the primer was accomplished by drawing up concentrated ammonia solution (35%, BDH Aristar) into the column and letting it stand at room temperature for 30 min. This process was repeated three times. The expelled ammonia solution was collected in a glass vial with a Teflon-lined cap and the volume made up to 3 mis with concentrated ammonia solution. The solution was incubated at 55°C for 15 hrs. The solution was cooled, transferred to microcentrifuge tubes and lyophilized. Pellets were resuspended in a total volume of 200/ul H2O. Oligonucleotides were purified by spin-column chromatography using a Sephadex G50-50 matrix (Maniatis et  al., 1982). A molar ratio of 5:1 primer:template  was used for sequencing reactions.  3). Sequencing  DNA sequencing was performed using a modification of the T7 DNA polymerase protocol of Tabor and Richardson (1987). Either modified T7 polymerase (Sequenase, United States Biochemicals) or unmodified T7 polymerase (Pharmacia) was used. Both enzymes gave good results.  6/ng of plasmid DNA was denatured in 0.2 M NaOH in a volume of 100/J at 65°C for 5 min. The DNA was cooled in ice and  50/.1 7.5 M NH^OAc was added followed  immediately by 600/J 95% ethanol. The DNA was precipitated as above. The DNA pellet was resuspended in 7/J H 2 O and heated to 55°C for 5 min. The sample was mixed by vigorous vortexing for 1 min. 2/J of 200 mM Tris-HCl (pH 8.0), 100 mM MgCl ,250 mM NaCl and 1 2  / J sequencing primer (0.5  pmol^«l) were added. The sample was heated to 65°C for 2 min in  a heating block. The block was placed at room temperature so that the sample was cooled to 30°C over a period of about 1 hr. 2/J of a solution containing dGTP, dCTP and dTTP each at a concentration of 1.5/.M, 1/.1 0.1 M dithiothreitol, 0.5 / . l (<t S)dATP and 2/J T7 DNA 35  polymerase (1.5 u^l) were added and the tube was incubated at room temperature for 3 min.  13  3.5/41 aliquots were added to tubes containing 2.5yJ of each of four termination mixes (eg, the ddG termination mix contains 80/M dGTP,  dATP,  dCTP, 80/JVI dTTP, §*,M  ddGTP and 50 mM NaCl) that were pre-warmed to 37°C. The tubes were incubated at 37°C for 5 min. The reactions were stopped by the addition of 4195% deionized formamide, 20 mM EDTA (pH 7.5), 0.05% (weight/volume) xylene cyanol FF and 0.05% (w/v) bromphenol blue. The reactions were heated at 90°C for 2 min, quick-cooled in an ice-water bath for 1 min and loaded onto a polyacrylamide gel (described above).  4). DNASTAR computer programs.  Several DNASTAR computer programs were used to analyze the ph sequence. The proximal and distal repeats were compared using ALIGN (for DNA sequence) and AALIGN (for protein sequence). These programs produce a locally optimal alignment of two partially homologous sequences. The resulting similarity index is the total number of matched bases divided by the sum of the number of mismatched bases and the number of gaps in the alignment. The program GENEPLOT was used to determine codon preference values for ph. This program plots the codon preference value for each codon in each reading frame for both strands of the DNA sequence relative to a codon frequency table. The codon preference value is defined as the probability that a given reading frame is similar in codon usage to the frequency table. The codon frequency table used in this work is made up from several DrosophUa genes (Ashburner, 1989). The program calculates the codon preference of each reading frame of the sequence entered into the program and for a random sequence with the same composition as the sequence entered. If the former value is significantly greater than that of a random sequence, the sequence would have coding potential. The PATTERNS program was used to search DNA and amino acid sequence for certain known consensus sequences (eg, exon-intron junctions, promoter sequences etc.). The program searches both strands of the DNA sequence for the consensus sequence in question. The program  14  PROTEIN was used to provide information about the physical properties of the putative ph protein. The program determines the structure of a protein (ie, regions most likely to be helix, turn, coil or sheet), hydropathy profiles of a protein and the charge, molecular weight, isoelectric point and amino acid content of a protein. The FASTP (Lipman and Pearson, 1985) and FASTA (Pearson and Lipman, 1988) programs were used to search DNA and amino acid databases for genes with sequences similar to ph. The programs are fast because they initially screen sequences for similarity by searching for aligned identical amino acids. Other algorithms compare each nucleotide or amino acid of one sequence with every nucleotide or amino acid of the other sequence.  RESULTS AND DISCUSSION  DNA  SEQUENCE  Sequencing s t r a t e g y  The DNA sequence of ph presented here is 24925 base pairs (bp) in length. This represents sequence that I obtained (from 15724 through 24925) and sequence that my collaborators in France (Janet Deatrick and Neel Randsholt) obtained (from 1 through 15730). Any comparisons between the distal and proximal repeats required the sequence data provided by my collaborators. All sequence data was analyzed using DNASTAR computer programs. For convenience, sequence is presented as the coding strand and any sequence coordinates mentioned are in reference to the coding strand.  15  Figure 1 shows my sequencing strategy for the distal repeat of ph. Arrows show the direction and length of sequence obtained. Sequence was obtained from the five subclones indicated in Figure 1. It can be seen from Figure 1 that over 80% of the sequence has been obtained from at least two independent clones on each strand and 91% of the sequence was obtained from clones on both strands. Sequencing on both strands is necessary to clarify ambiguities inevitably present in single-stranded sequence obtained by the Sanger method. The sequencing gels were read independently by two people to minimize reading errors, and to eliminate errors introduced entering the sequence into the computer. In the case of disagreement, both readers re-examined the original gels to arrive at a consensus. The 0.8 Sal fragment sequence is single-stranded sequence, although at least two clones have been sequenced in each region. Subsequent to this work, a cDNA from the region has been sequenced on both strands, and has confirmed the sequence presented here. Figure 2 shows the complete coding-strand sequence of the ph region.  To determine the sequence similarity between the two repeats, the proximal and distal repeats were aligned using the algorithm of Wilbur and Lipman (1983; Figure 3). As seen in Figure 3, sequence similarity varies between 100% and zero. One would expect that the regions of very high sequence similarity have been conserved over time for functional reasons. I predict that these regions will be exons or regulatory sequences, and that the Tegions  of low sequence similarity will include intron sequences. The bottom part of Figure 4  shows a comparison of the repeated regions as determined by aligning the two sequences (Figure 3) and the repeats as determined by cross-hybridization studies (Freeman, 1988). The two methods support each other in that in each of Freeman's repeats (a through e) there is a region of high sequence similarity, although the extent of the conserved sequence' in d and e is less than in the other repeats. The size of the proximal c repeat was underestimated by the cross-hybridization study by about 400 bp. In addition, the crosshybridization study did not accurately locate the region of unique sequence.  16  Figure 1. Sequencing strategy for the distal repeat of ph.  Sequence is presented in a proximal to distal direction (left to right). Arrows represent the direction of sequencing. The sub-clones used for sequencing are labelled at the top of the figure. In the case of the 8.7 Sal-Xho was sequenced.  17  fragment, only a portion of this sub-clone  Figure 2. Genomic DNA sequence of ph.  Sequence is presented as the coding strand. The proximal repeat (J. Deatrick and N. Randsholt) includes sequence between 1 and 15730. The distal repeat includes sequence between 15731 and 24925.  19  1 61 121 181 241  CTCGAGGTGTGGACGCAATCTTCTCCTCACCACGGGCCGTATCGCACTGATAGCAGGGAC ACCAGGAAACCGAACTTTTCCACTAGACCTCTCGGGCTCTAGGTATATCACTATATATGG CGACGTTATCAGCCCCTCCGACTCTGCCCCTGCATGCGAAATTAGCATATTTATTATGGA CCACGCACACACACACTCGCACACACGCACACCGCAGCACGGTCTAGATTTGGTCTGGTT TGAAAAGTGCAATCCACGGTCCGTGGAGTCAAGATCTTTATGACTCCACACAGATTATTC  60 120 160 240 300  301 361 421 481 541  CCGGCAGGTAGATAGATCCCTACACAGAAACGGTCATAAAGCAA^TTGGCTCGCSGCCAG ATTGGAAGAGATACAGATTCGGATTCGGATACGGATACGGGTACGGGTATATGCATGGAT AGATATGCCTGGAGGATTTGCACCACCCGGTTACGGTGGATTAGCCTTCGTGCAAAAATG" TATTTGTATTTTGCAACGAACAATATTTCATGTTATGTACATATTTAAGACCAGTAGGCA TTAAATTCACTATTGCAATTGTTATATAATCTGGAGCTGCACATACGCAAGTTGTTAATT  360 420 480 540 600  601 661 721 781 641  TTCACAOGAGTATTTATAACACCTCCTTCTGTCTATCTCTTACATATTTAAGTTAAOTAC TTAATATAAATACTTTAAGTATAATGCATATATGAATATAGTCTTTTAGCGGGTTAATAG CAGTGCCCACCCTCGTAATCATTATTGAGATCATGTTTATCTCACTCGCTCTCTCTTOCT CTGTATTTTTGTCGTTTTGTGTTTCATTACGTCAAAAAATTCGAOCTTTTGTGTATGTGT CTGTGTGGGGCCGOGTGCGAACCCCTCTTOGCTCTAAACAACCCAGACAAACAGAAACAC  660 720 780 840 900  901 T O O G f T A A C A G G T A G C T G A T A A G C G C G A A A A C A A f A T C A G C A T c f o C A T A C A T A T A T A A A 961 CCGGTTGTGAAACGAATATCAGATGTGTCACTGCAGTTACTAAAGOGATTAATTAGGATC 1021 CTCATTCACCAACACACGACAAGTAACCGGGAATGGGAGAGCCTCACGCGCCAGAATCTC 1061 CCAATGGAAAATTGAAAAACCGAAAGAGAAACTTACGCAGAAGCACTCAAAAGCAGGTTG 1141 A T C G G A A C A C C A A C T G A A C T T A C A A G G T T T A T G T A C G G A T C G A T C G G G C T C A C G T C G A G G  960 1020 1080 1140 1200  1201 1261 1321 1381 1441  1260 1320 1380 1440 1500  TCCGGCTACTCCTTTCATCCATAAGTCAGTGATGTCCTTCAGAGCAOCGTCCAACCCOGA ATCACAACAAACCGCAACTAAAATTCGAAACGCAAAACAGGGGCAGATTAGCGTTAGTTA AGAGATACGATAGGCGCCAAACATCCCCCCCCCCCCCTCCATCAAAGAATTCAAAATGAT CACCGAGCGAAAGCTCCCCAGGAGAGAGAGCGAACTGAGCTTTCAGAGCGAAAGCGAGTG AGTGAGCACACCACAGCGGCGTTCTACAAATTTCAACAGTTCATATTCCGGATCGATTTG  1501 TCTTfTACTGTTTcfGTTTACCAcfTCTTTATGGATTCGTGAAc6GGAAATAACACAATA 1560  1561 1621 1681 1741  TACTAATACAAAATTTTTTGTTTTTTGGGATGTTTTTTCTATAATTCAATTACTTCTAAT AACCGCAAACAAAATCTTAATGTCTTGGGAAATCAAATTGATAAATCTCTGTTTACCTAT TGGAATTTGTTATCAACAATTACCATTTACTATTATTATTACTTCTGTGTTTTATTTAGT CCCTAGTTGCTTTTCAAAATATAGCTAGCTGCATTGCAAATTCACTCTGTAGAAATGAAA  1620 1680 1740 1800  1801 1861 1921 1981 2041  GTCCTAACATCTTTTATATAAAGAGATTATATAA^CCGCTAGAATTAGTTTTAAATTTAA GCATATTTAATACATATTTTTATACTGTTTATTCCTAAAATTGCAAACTTGTAATCAAAA CCAGTACGGATATCACCTCGCATAACAACTGTTATACAGCCACTTTAATTCACCACACAT GCGTTAGCCACGCCCACTGAACCACTAAACATTTAGCTTGTCTCTGTCCTATTCTTCTCC GCAGGCGGCACAATGGTGCTATCTCTTTTGCACCCCATAAAAAACCTTTTCCGGTGGTGC  I860 1920 1980 2040 2100  2101 G C C C A T T T T T T A T T C G C T T C G A T C T T C C G T C G G C A T T T T G T A C T & C C A A T T A G C A O C G A C 2161 T T C T A C A A T T C A A T C A A C T T T G T T A T T G T A C C C A A A G A A A A A T G G A G A A G A T A A C A T T C T 2221 C A T A C A C T T G T A C C T G T T T C G C A A C T T G T G A G T T G A C T C C A A T A A T T G C T T C T T G G G T A T 2281 C G A T A A A G T C C C A A T A A C G A T T G C T A A T A A C T T A A A A T T T A T A G C T T A C A A T C A T G G A T G 2341 T C A A T A T T T A A T T T T G T A T C T C A T G T T G G O T A T G T T T T G A T T A A T T T G T G T G T C T C C A T A  2160 2220 2280 2340 2400  2401 2461 2521 2581 2641  CCGAJU^GGTATTAA&TTAAATGTCTATTGTTCTGTTTAGGATTA&TAAACACTTfcATTT TCTCTACCAGTAAAATTTTCCTAAAACCAGTAATTTAGTTCAGTTATTATGACAGTTTCT TAGGATTTTTCTTCAATACTTCTGAGTATCAACTGTTGGATCCATTAAATTTAGAACTTC TGCCAAGTAAAACGGACTCTACCCACGCATGGAAATTATAAAATAACGCCAAGTGCTCGA TTATATATATTGCAATTAAAACAATCGCAAGCGCAACCACACATCTCTGTTTGAAAATCG  2460 2520 2580 264 0 2700  2701 2761 2821 2881 2941  WTG£AAATTTACA6ACOGTGACCAA^TATAGO£ATATTTTGAAAATOGAGTTATTOCG CTTTAATAGAATACATAAAGAGGACATGATTAGAAGACAATTGAAACGGCAAATAGTCCT ATTTGGCTCCCCCTCCCATTTTCAATTCCTACCCTCGTTCACACCCCTAATGATCTTAGG GGTGGGATTACGTGTACACCCCTTGGGACATCAGAATTGATCCCCTTTTGAGGGTAATTG TATAATGTTTCTTTGGGAAATCOGTTTGTACAGGCCACGGAAAATCCTGGAAAGGGCTCG  2760 2820 2880 2940 3000  3001 3061 3121 3181 3241  GTTGCTCCCAAAOGTCGCAAOTGTCACAOGAATCAACCCTTTATCGCAGTTTCCTCTTTT TTAAGGATCCGGCCTTGGTACAGTTTCGAAAGGGTTTTATATGTTGGGGTATTGAGTGAA AGTAGGGTTTATTTGCTTTGAATTTGAGTAATGGCTTACTAAAGATTAGAACGTTTCTTT TATAACTTTTTATTGTTATTTTAAAAGGTGCACAATAACTCCGATGAAATTTGAAAGTTG TTAACTTAAGTTCTTAAGTTCTTTCCAGCGAGACTATTTGTTGCTTAAATAAATTTTTGC  3060 3120 3180 3240 3300  3301 C A A A J j l A A A C A A T C T T A A A T T A A C & C T T C T A C C A f A T T T A C A T C G C C A A T G C A T f T C T A G 3361 AGTTGGGAGCATTCCTATTTGGATAATAATGAACAGTCACCACTTAAGGACCCGAAGGAC 3421 C T C C T C G A T A T T T C A T C C A C T G A A A A C T T C A A G C A A A A C T C G A G G A C T T C C T G T T T A C G C 3481 TGCACAGAGTTGGCAGGTCAATTAGCAGATTGGAAAGGACAATCGCAAATGTGTGCAATG 3541 CGCACAGGGTCAGTTATTAATAATGGAGAATAGGCGATATTAAAGCAAGTATGGGCTGTT  20  3360 3420 34B0 3540 3600  3601 3661 3721 3761 3841  CCAAJULXCXAGACCATTTCGCACTCGACTTACCTCAXATACCTG^TTGATCCATOCACTT CAAATCCAAJVATAGTTTATAAAACTTTTTACAATATGTAAAATGGTAAGAAATATGACAC CAGCATAAGGAATTTGGAATTTTAGAGTGTTTAGAAATCATGAAAAGAATTTTTAAGGTT CCACCTGAAAAAATAAAGCATTTACTTTTACTAGTATGAAAAAATGAATTTGATATCCTA AJXACAAAAATACAAATAAAAATCAAACTGGAAAGGAATTTATACTCCCATTTATTTCAGT  3660 3720 3780 3840 3900  3901 3961 4021 4081 4141  CCTGATCCCTTGCGATCCATTTTCATTTCTGGCCCCOCCTCAAAACCTTTAGAA6ACTTC 3960 ATTACGATTGCAATAAGCACAACAACAACAACCCACTTTTGGTTATCACATCCGCATGCA 4 0 2 0 CACCCCTTGGAAAOGCCAACAAAAATGGTGTAGGTAOCTCTGACAATTGCCAOCACCAAC-4080 AACCACACAAGTACAACAACAGGTTCGTTTAAATTTTTATTACAGCAGGGTGCGACCGAG 4140 AGGGAGGAAGGCCAGCGTTAACGTGCCAGAGTGAGAGGCAATATGACAAAAGCACCCCCA 4 2 0 0  4201 4261 4321 4361 4441  GCAATTTACGCTCATTAGGCATTT6TCATTTTTA6GAAATOCAAATATCGACTC^CTCGC TCATGCCCTAGGAGTACCATTCGCCCCAAACACAAGGAGCAATAAGTTGAAGGCAATTAT AAATGGCAAGAAGAAAAGCGCATGCTCGACTTCTTGTTGTGTTGAACTTGTGAGGAAAAG CCGAAAGAGGGAAATAACGGCGAAGGAGGAGGTCACAACTAATAGAAAAAAAGCTCCGAA AAAACAAGCATACACACACATGCAATGAGAACAACCAAAGCAAGGCAGAGAGCGAGAGAG  4260 4320 4380 4440 4500  4S01 4561 4621 4681 4741  AGAGACCAAAAGCAGTTAAAAAGC&TAAAAATAA^ATGGCGGCA&CAAAAAAGAAAAGCA AACGAGACGAGACAAGCCAACAAAAAGCTAATCGGAATGAAAACATTTGTGGGGCTCAGC GACGTTGT7GTTGGATGAGGTGGGAAGAGTAAGAAGAAGACCAACAAGAGCCCAAAGCAA CGCACACATACTACAAATGGTGCAGGCACACACGCACGGGCTGGCACAAAATGAAAAATG AAATGACAAAGACAGGCACAGTGGGTCATGAGTGGTGGATATTTGAAACATTAATAAACT  4560 4620 4680 4740 4800  4801 4 86 1 4921 4981 5041  TAAAACTTAATAAA^CAACTCAATACTTATGTCTATTATATAATfcTATTTTAcfAAACG 4860 TTTTCATTTTACTTATCGTAACATTGTTTATATGTATAAAAAGTTTGGAAAAATTGCTGT 4920 TCTAAAAAATTGAACCGCTGTACTCTTTGTTCTCAAACTGCAACTGTAAAGCAATCCAAT 4980 AATAATGGA7CGCTTACCACTTTTCAATGGGTGGGAGAGAATAAGATTTCGCTCTGCCTC 5040 TGCGCATACGCTATCCTCCCCTTCTCAATCGACACACCCGTGTGTTACTGACAAGCACAA 5 1 0 0  5 1 0 1 CTAAfAATAAGGTAiAGCCGATCCGACCCGATCTCATGTGAAAAAGAACGAATCACAAAC 5 1 6 0  5161 5221 5281 5341  GAGGTAGAGGTAGGAAGGTAGTCGTAGTGGTGGTGGTTGTGATGGCACGAAAAAAGAAGA TGAAGTATAGCAACAATCGTTGTCGAGTCGGGCCGGGCGGTACACATTCGAGTCTACACA CATAAAACTGGCTTCGCGCGTATTTATTGATGTACATACCCGGTACCCACAAAGTAAAAG GGTATACTGGGCACTTGGGTTTAACTCGAATTTGTGTTAGTAGTCACCACCAATAACTTA  5220 5260 5340 5400  5401 5461 5521 5581 5641  CAAATAAATATTTAJkGAAGGGTTTfcATTTTAAGGATACGACTGAAGTTAGCGAGGAATG GTATATGAAAAAGGGGTATTTGAAAGTCGAGTCACCAGACCATGTTATGTTTTCAGCAAC GGAAGGGGGCTTTCAGTCGGGGTGGTTGAGCACGCATACACATGCCCGCCAGCTGTAGTC TTCCTGTTTTTTTTACTCGTTTTGTTTTTGCTTTGTCAAACGAACTTCTGCCGTTTCATT CCAACCCCAACCGAACGCACGGCATCCCTCTCGCACGCGCTAACTCGCCAGGGCCACTCC  5460 5520 5580 5640 5700  5701 5761 5821 5881 5941  TAGC&GCGTCGCAcfcGCACCCGTGCATTTCGGTATAGAGAAAA&TTATTACTC&CAACG ACATCTGAGAAGAGAGAAGCGGATCCAGCTGCACACCAACACACACACATATTCAGCGCA CATGCGCTCATTTTGTTTCCGATCCGAACGAAAAGTA6AGTTGTCGCTGTGGCGCGCCGT TCAGTTTGAAACTTAACTTGGCGGTGTACGGTTCTTGCGCTCTGCTCTCTCTGCGTTCGT CTTCCTGTGGTCTCGTATCTGTCGCATACCGCATCCCATCTGTATTCAACCAACAAAAAA  5760 5820 5880 5940 6000  6001 6061 6121 6181 6241  CCCG&CGACGCGACATACTCACTG&ATACCCTGCMU^TTTGTTAAATTTTTTTCMAAAG 6060 AGCTAACGCTOCTGTTGATTAGCTAGTGCAGATGTGCAGACATAAAAAGTGATGCCGCGC 6 1 2 0 CACAGTGGAGCCCCTAGCTGGCGAATCGTCGCTOGCGACGTAGGTAGTGCAGTTAAAACA 6180 AGTACTTAGTGCTGTGACTGTGGCTTAATTTTATGTAAGATCOCGTGCACAGGTCCTTAG 6 2 4 0 TCGTTACACTGAGAAAAAGAAAACTGCCTGCACCCAGCGGGGAGAAGTTGAACTGGACTG 6 3 0 0  6301 6361 6421 64 81 6541  ACTGTGAGTGGAGGTTGTACTAATIACTACTCTCCOTGCACGAAACACTCATTTACACGC AAGCACACACACACACACACACACACACGAGACACTGGGACTTTTGTCGGATTTTCGTAT TCGTTTTGTGTACTTTTGTTGTCTTGCGTTCCACGTTACATACATATGTATATGTTTGGT GTTGCCTGTTGTATGTTTATATTTATTGCCTTCACACATGTGCGTGTTTGTTAATGTACA ATATAATACGGCAAATAGCAAAAAGAGAAGAAACTGACGACTAAAAAGAAACCGCCATGC  6360 6420 6480 6540 6600  6601 6661 6721 6781 6841  CGAC^ATAACAAM^CAACCACCA&TCTCTCCGC&CCCGCCCAAJLACAAAAACAJUIACOC CTAAGCCGACGCATGCATACCTATAATTTATTATAAATATTGTTTTTATTTTGAATAATG CATCGTCGTGCATTGAAGTTTATGCAAAAAAGGTATTTTTTGTTTAGTTTGOTTTTATTT TTATCACTGCTTTTGTACTGCCTGCACTTCTGCTTTTTGTTTACAATTTTTGGTTTAATC TGCTCTTGAGCATTGGATATTGATTCTATATCCTATTCGGATAATAATCACACGTTAATC  6660 6720 6780 6 8 40 6900  6901 6961 7021 7081 7141  ATAAfGCTCGTAAjJ^TGCAGCGA&TGGATATTcfGTGCTCTCT&TCATCATAAAATGCG CTCGACTGGCCGGCCAAGAAAAGAAAATAAATTCATGAAACCCAAAACGAGTTTCCCCTC CGTCGCCCTCTCCGCCATTCATCATCCAACCGACACACGCCCCCCGCTGTGCGGCGTTGT TGCATTTTTAACAACGAATTTCGCAATGCATGCCCCTGTCTGCATTTGTGTGTCTGCGTC CCCTCCTGCTCGTTGATTTCTCCCTCTCTCCGTGCAACCCTCTCTCACTCGCGCAACAAA  6960 7020 7080 7140 7200  21  7201 OCXACATAAAOAAXXCAXAAO^cfATATCOAAA&OTATCACTTTTTCTTOTTCCTOCOO 7260 7261 CAACCCTGCT7ACATTTTGCTTTACTCTGCCGAAATAAACACTC7AGGGT7C7CAGCTG7 7320 7321 AACGAATCTGAGAAATCTGTCTTCCGGTTTAAGCGAAGATTCAATTTTATTGAATACTTT 7380 7381 TATATT7C77ACAG7CCG7TATCACTGAA7GC7AAT777ACC7ACAAACT7AAAT7GT7C 74 40 7441 CTGGCAAGAAAATAAACCACAATGATTAAATTATTTTAGACTCCTTTAAAGCACAACAGT 7500 7501 7561 7621 7681 7741  TTTT6TAAATAAATAGTTGACAACTCTCTTTACACCTACGCTACACCCCACTCGCTCTGA  OTTTATGTGACTGTGTGTACGGCCTGGCCATAGCCTCATCTCGTTCACTCACATTTGTTT CATCTTCTTCGATTGCAGAACTCGTTTGTGTATCAATCAATAGGCTGTCAAGGCCCCCCC CCCTCCTCGCTCGCTCCCGCGCATTTTGCACGCCCGAATCCACTGGAGGIGATGTTTGAT TTGAGCGCGACCTTTCCCCCGCACCCCTATCGOCGTGTCTGTGTGTGTGTGTGCTGCCCC  7560 7620 7680 "7740 7800  7801 7861 7921 7981 8041  GCATfTTGATCATTCCTT7CCCAC&CCTACTCCTTTCCCCCCGATTGCAAGCAG&CTGTT TTACGGGGGTTCTCACACACTCGAGCTCGAATGTATGTACCCTACTCCATGGCGACATGG AAACCAGTCTTTTTTTTTCTCTCAAGAGGTTTTTCGCAGCGTGCGGCCGTGAGCTIAACT TACACACACTTGCACACGCGCACCCACCACCCTGTGGCGAGTATTCCACCCCTCTGGACC ACCCACCCCATATTCCCTTTCTTTTACGGGGTAGCACCCACATGAGGGTTGCCAAAATCC  7860 7920 7980 8040 8100  8101 TTTTACCCGCTTTTATAACCCATTfcGCCCCTTTfTCTGTCTTTfTTGCACTCAfoCTTT 8160  8161 8221 6281 6341  TTATGCTCTTGCTGTTTATCGGCCTTGCGGGCTTTTTGGGCGATGGAAAGGGGTGOGGAT CAGGCTCTCTGGACTGCCGGGCATCACAGTCGCGGTCAATGCAATAGGACCTTGAAACCA CGCTTC7CCAG77AGA7CAT7CATACT7GAAC7ATATCAGGGAACTGATTCAGG AAGTAA TATTAG7TAAT7ATT7C7AGAAAAACATCCT7ACCATGTGGAGTACTCA0CCATTTACGT  8220 6280 8340 8400  64  01 6461 8521 6581 86 41  T7G7C!CAC7AA7CC7A77AGCT7GAGCAC7ACTG7ATAAAACACTAACCATCTC6TGT7G  8701 6761 8821 8 6 81 6941  GGCA0C7CC777C7&77CAT7CGTTGAATGCGAAAGTCCGTTAAG7AGT7GGCCAAAT0T TCAT7G7CGC7GAAAAAGGGCGGGT7CGGAAAAAGGTCCACAAAAAGAACGAATAT7TTC GGCATAGGGATTGGGATCGGGA7CGCGC7CGGGATC7CTACCTAAAAG77AACCAGGG7A CACAATTC7TAA7TAGAAAT7GAAATGACATG0TACATATCT7CAGGCT7GACCT7GTGA T7AAG7CGCAAC7A7AG7TT7CATATTATCCCGAGTTAAAGG7GATCGT7ACC7CAGAGT  8940 9000  9001 9061 9121 9181 9241  CATAT77ACTT7TCCCGCGT7TGTTTCTGCCACTCTCGAAAG7CATCTTT7ATGT77ACA T7GGAAATATGATTAT7TAATAAGAGCGT7777GT77GCTAG77CCCGCAT77GTCTCAC ACTTTTTTGAATGACTAAAACCAGAT7777G7A777ACCAAG7AGCCA7A777GC77AAA CAAAAAAJUaAAAAAAAAACGGATATCACT7TGTTATG7TAGTT77CCATGATCAT7AAA TTTAAT7777TAA7GTTTAATT7AAGGGCCGTTT7CAAAACATCACGGCCAG7ATAAATA  9060 9120 9180 9240 9300  9301 9361 9421 94 81 9541  ATAATAACAAATAATGA7GAGAAAGC7GACCT7ATTTTTGCGTCAAG7CCGAT7CTCAGC 9360 TGGGAGAGCTTGT777GCA7GAGCGGGGAGGGGAGCAGGGGGCAGAACGGGGATGC7GAC 9420 AACGTCT7TGCAT7CCC77CCTATTG777G7CGC7AGGACAAGAGAAGGCCA77A0GAAC 94 80 ACTGT7T7TCCAGTTTGTT7GCGTGTGGTG7GG7CCGAAATGACAT7CC7C77T777GC7 954 0 GAGCAAA7G777AC7GA7CGCCAAAC7TCTC77TT7CAA7CC7CAGAGCOGACACAGAAA 9600  9601 9661 9721 9781 9841  GCGATACGACCACACCCC7GAGCACCACA0CA7CCCA0G0GAT77CAGCATCAGCGAT7C 9660 TAGCAGGAGGCAC7C77CCC7TGAAGGACAAC7CGAACA7CCGCGAGAAGCCCC7CCACC 9720 A7AACTACAACCACAA7AACAACAACAGC7CCCAGCACTCACAC7CGCACCAGCAGCAGC 9780 AACAACAGCAGG7GGG7GGCAAGCAGCTCGAGCGGCCACTAAAG7GCC7GGAAACGC7CG 9840 CCCAGAAGGCGGGAATCACCT7CGACGAGAAATACGATG7GGCCAGCCCCCCGCA7CCCG 9900  6460 CTCCCAACTGTCAGCAT7TTTAGCATTTCGATGGAAAGG7CCT7GAACT7CGCC7GCAAT 8520 TCCCATCGGC7TTATTGCCCTT77ACAATAATA7CGAACGGTGCCAGC7GCG7GGC7AAA 8 5 8 0 T7AG77T7CCGGGC7G77G77G7CGAACG7TGAACGTGGAAAACGAG7GCGACTACCATG 6640 CTCC7CATCGAAAT7CGAAAACCCATAGATAAAGA7CGATC7GAAATGCGAAGGCTGTCA 6700 8760 6820  8880  9901 9961 10021 10081 10141  GCATt GCCCAGCAOCAOGCGACTTC^ 9960 C C C C C A C AAGCCATCGGCACGGAACTCCGCCCACGGGCCGCAGGCAAACCCACACCCCAA 10020 OC AC7CCGAACAGGCCCAGTGC7CXCAGCACACCCAACACTAACTGCAACTCAA7TGCCC 10080 GCCACACCAGCCTCACGC7GGAGAAGGCGCAGAATCCCGGCCAGCAGGTGGCC0CCACCA 10140 CCACGC7GCCACTCCAGATA7CCCCTGAGCAGC7GCACCAG77C7ATCCGAGCAATCCCT 10200  10201 10261 10321 10361 10441  ACGCCAT7CAGGTGAAGCAAGAG^fTCCCACGCACACGACC 10260 TAAAGC ATGCAACCAACAT7ATGGAAG77CAGCAGCAG77GCAGC7GCAGCAGCTG7CGG 10320 AA0CCAACGGTGGAGGAGCAGCCTCGGCCGGAGCCGGA0GAGCAGCTAG7CCGGCCAAC7 10380 CGCAGCAAAGCCAGCAACAGCAGCACTCCAC AGCCATCAGCACCA7G7CGCCGA7GCAAT 10440TGGCAGCGGCCAC7GGAGCAGT7GGCGGGGA77CGACACAGCGAAGCACGC7CCACC7AA 10500  10501 10561 10621 10681 10741  T0CAJlccCTCCACCAG7TTCCTGTiTCCCCAAATGATTGTGTCG&GAAATCT6Tf GCATC 10560 CAGGAGGCC7CGG7CAGCAGCCAATCCAGG7GATCACCGCCGGCAAGCCAT7CCAAGGCA 10620 ACGGCCCCCAGATGCTTACCACCACGAC7CAAAACGCCAAGCAAATGATC0G7GCCCAAG 10680 CGGGAT7CGCTGGCGGAAA77ACGCGACC7GCAT7CCCACAAACCACAATCAATCGCCCC 1074 0 ACACGC7GC7CT7C7CACCCATCAACC7CATT7CCCCACAGCAGCAACAGAACCTGC7GC 10800  22  1 0 8 0 1 AATcJjlTGeCCCCTCCAGCTCASCAGCAQCXACTCACCCJaCACCACCAACACrrTTAACC 1 0 8 6 1 AGCAGCAACAACAGCAGCT7AC7CAGCAGCAACAGCAGT7GACAGC7GCTCTGGCCAAGG 10921 TCGGAGTGGA7GCGCAGGGCAAGC7GGCCCAGAAAGTGGTTCAGAAA0TGACTACCACCA  10860 10920 10980  1 0 9 8 1 G7AGCGCGG7CCAGGCGGCGACGGG7CCTGGA7C7AC7GGGTCAACACAGACCCAGCAGG 11040 1 1 0 4 1 TGCAGCA0GTTCAGCAACAGCAGCAGCAGACCACCCAAACCACTCA0CA07GCGTGCA6G 1 1 1 0 0 1 1 1 0 1 TT7CCACG7CGACT77GCCAG7CGCTGTOGG7GGACACTCTGTTCACACTGCCCAACTTC 1 1 1 6 0 1 1 1 6 1 TGAACGC7GGCCAAGCGCAACAAA7GCAGA77CCCTGG7TCTTACAGAA7GCTGCAGGAC 1 1 2 2 0 11221 TGCAGCCGT77GGGCCAAACCAGATCA7CC7GCGAAACCAGCCAGACGGAACCCAAGGCA 11280 1 1 2 8 1 TGTTCATTCAACAGCAACCGGCGACGCAGACTTTGCAGACCCAGCAAAACCGTAAGAATA .11340  1 1 3 4 1 TT07CA7G7A7A77GCA7CGGATAGG7ACTAAAGTCAACTATCTTCC7ACAGAGATTATT 1 1 4 0 0 11401 11461 11521 11581 11641  CAATGCAACGTGACCCAGACGCCCACTAAGGCAC&CACTCAACT&GATGCACTT&CTCCC 1 1 4 6 0 AA0CAGCAACAGCAGCAGCAGCAGG77GGCAC7ACCAACCAGACGCAGCAGCA6CAACTA 1 1 5 2 0 GCGGTGGCTACTGCCCAGTTGCAGCAACAGCAGCACCAACTCACTGCAGCAGCTCTGCAG 11580 CGACCAGGAGCCCCTGTCATGCCCCACAATGGAACTCAAGTGCCTCCGGCCAGTTCCCTA 11640 TCCACACAGACTCCCCAGAACCAGAGCCTCCTGAACGCCAAAATGCGCAACAAGCACCAG 1 1 7 0 0  11701 11761 11821 11881  CCGGTGCGCCCCGcfTTAOCCACATTGAJUUkCCGAAATCGGTCAACTCGCAGGAtAAAAT 1 1 7 6 0 AAGG7AG7AGGCCACC7GACCACCG7GCAGCAGCAGCAACAGGCGACGAA7CTCCAGCAG 1 1 8 2 0 G7GG77AA7GCGCCGGGCAACAAG7CAG7A77G777C77TAT7ATC7GCCT7CTCACAC7 11880 AC77AT7777GCA7C777CCAGAA7GG77G7GA7GAGCACAACGGGCAC7CCGATCACCC 1 1 9 4 0  1 1 9 4 1 TCCAGAATGGACAGACCCTTCATGCAGCCACTGCGOCAGGAGTCGACAAGCAGCAACAGC  12000  1 2 0 0 1 AGCTACAACTGTTTCAGAAACAGCAAATCCTCCAACAACAACAAATGTTGCAACAGCAGA 12060  1 2 0 6 1 TTGC7GCCA77CAAA7GCAGCAGCAGCAAGCGGC7GTTCAG0CCCAGCAACAACAGCAGC 1 2 1 2 0 1 2 1 2 1 AACAGGTCTCTCAGCAGCAGCACGTTAACGCCCAGCAACAGCAAGCGGTGGCGCAACAAC 1 2 1 8 0  1 2 1 8 1 AACAGGCAG7CGCGCAGGC7CAGCAACAGCAGAGGGAGCAACAGCAGCAAG77GCCCAAG 1 2 2 4 1 CCCAGGCGCAGCATCAACAGGCTCTCGCGAATCCCACTCAGCAAATCCTTCAGGTGGCCC  12240 12300  1 2 3 0 1 CAAA7CAA77CA7CACG7CCCACCAGCAACAGCAGCAGCACCAACTTCACAACCAACTGA 1 2 3 6 0 1 2 3 6 1 TACAGCAGCAGCTACAGCAACAGGCGCAGGCACAAGTTCAAGCCCAAG7GCAGGCTCAAG 12420  12421 CGCAACAGCAACAACAGCAGCGAGAGCAGCAGCAGAA7ATTATCCAGCAGATTGTGGTGC  124 80  124 81 AACAG7CTGGAGCGAC77C7CAACAGAC77CCCAGCAGCAACAGCACCACCAA7CCGGGC 12540 1 2 5 4 1 AAC7ACAGC7AAGTAGCG7CCCC77CTCAG77TCTTCCTCAACGACGCCAGCCCGAA7AG 1 2 6 0 0  1 2 6 0 1 C7ACCTC7AGTGC7CTGCAGGCAGCCCTCTCCGCCTCTGGCCCCATCTTTCAGACAGCTA 1 2 6 6 0 1 2 6 6 1 AGCCGGG7AC77GCAG77CC7CC7CCCCCACAAGCAC7C7GG7CACAAT7ACCAACCAGA 1 2 7 2 0 1 2 7 2 1 CCAGCAC7CC777GG7CACCAGCAGTACGG7GGCCAG7A7ACAGCAGGCTCAGACGCAAT 1 2 7 8 0  '12781 C7GC7CAGG7CCACCAACA7CAGCAGC7AA7CAGCGCCACAATTGCCGGAGGGACTCAAC  12840  1 2 8 4 1 AACAGCCACAGGGACCGCCATCACT7ACACCCACCACAAATCCAATTT7GGCCA7GACCT 1 2 9 0 0 1 2 9 0 1 CGA7GA7GAA7GCTACA07CGG7CACCTT7CCAC7GCTCCGCCTGTAACTCTTTCTGTGA 1 2 9 6 0 129 61 CAAGCACCGC7GTTAC77CGTCGCCGGGTCAGCTGGTTCTCTTAAGCACGGCTAGTAGCG 1 3 0 2 0  1 3 0 2 1 G7GGAGGAGG7AGCA7ACCAGCCACGCCCACCAAAGAGACACC77CGAAAGC0CCCACCG 1 3 0 8 1 CAACCC7GG7GCCCA77GG77CGCCCAAGAC7CC7G7A7CAGGAAAGGACACCTGCACTA  13080 1314 0  13201 13261 13321 13381 13441  13260 13320 13380 134 40 13500  1 3 1 4 1 CCCCCAAA7CA7CTAC7CC7GCCAC7G7CAGCGCATCCC7AGACGCCACTAGTTCCACAG 1 3 2 0 0 CCGAAGCCCTGTCCAArGGAGA7GCCTCAOA7AG&7CTTCCACG&TGTCAAAOo6cGCTA CCAC7CCCACCAGCAAGCAAAGCAA7GCAGCAG7GCAGCCACCGAG7AGCACCAC7CCCA ACAG7G7CAG7G0GAAAGAAGAGCCGAAGC7GGCAACC7CCGGCAGT77AACG7CCGCAA CA7CAAC7TCAACCACGACAACGA7CACCAATGGGA77GGAG7AGCCAGAACGACAGCCA CCACGGC7GTC7CAACCGC7AGCACAACCAC7ACCAGTTCTGGCACCTTTATCACAAG7T  1 3 5 0 1 GCACCAGCACAACCACAACCACCACGTCGAG7A7CAGTAAT0GAfCGAAGCA7CTCCCCA 13560 1 3 5 6 1 AGGCGA7GA77AAGCCGAACG7C77AAC7CACG7CA7CGA7G6CT7CA7CA7CCAGGAGG 13620 1 3 6 2 1 CCAACGAGCCA77TCCCG7CACCAGACAGCGA7A7GCAGACAAAGACGTCAGCGATGAGC 13680 13681  CGCCAAG7GAG7A7AAACT7C7GG7ACCAA7GC7777TCGCAATC77AACGTG7CAT7CC  13740  13801 13861 13921 13981  CTCCAGGCTCGGATATGGT7GCTTGCGAGCAG7G7GGAAAGATG&AGCACAAAGCAAAGC TGAAACGGAAGCGC7AC7GTTCGCCAGGA7GC7CGAGGCAGGCAAAGAACGGCA7CGG7G GAG77GGA7CAGGAGAGACGAACGGCC7CGGGACAGC7GG7A7AGTTCCCC7GGCAGCCA TGGCAT7GG7GGACAGGC7GGA7GAAGCCA7GGC7GAGGAGAAGA7GCAGACAGAGGCCA  13860 13920 13980 14040  1 3 7 4 1 TTCGCGCAGAGAAAAAGGCAACCA7GCAGGAGGACATCAAGCTAACTCGAATAGCA7CAG  13800  1 4 0 4 1 CCCCAAACCT77CAGAA7CGTTTCCTATT7TGGGACCC7CAACAGAAC7ACCTCCAATG7 1 4 1 0 0 14101 CACTGCCAG7CCAAGCGGCGATT76TGCGCCCTC6CCTCTTGCAATGCCTCTAGGATCGC 14160  1 4 1 6 1 CATTGTCAG77GCAC77CCAAC7CTTGCACCAC7G7C7G7AG7CACTTCTGGCGCGGCGC 14220  1 4 2 2 1 CCAAG7C77CGGAAG7GAA7GGAACAGA7CG7CCGCCAA7CAGCAGCTGGAG7G7GGACG 14 280 14281 ATG7CAGCAAC7TCA77CGAGAAC7GCC7GG77GTCAGGAC7ACC7GGACGACTT7A7AC 14 34 0 1 4 3 4 1 AGCAGGAGA7CGACGGCCAAGCGCTTCTGTTGCTCAAGGAGAAGCATTTG6TGAACGC7A  23  14400  14401 14 4 61 14521 14581 14641  TOOCCATCXXCCTC&CTCCACCTcfTAAXATTOTGOCCXXCOTOCXCTCCXTTXXeOXGG TCCCGCCXCCTGGCGXGGCCXXGGXTCCXGGXGCGCXGTAGGGCXGCtXGXGCXCGXXXX CCCGXXXXXGXTGXTCTCCTXXCCGXGCXOTGGXCCTGGTTCXXCCXXGTCTGTCOTGCC AGGTGXTTCTGXTTCXATCGXGCXGGCGXXXXGGXCGCGXXTCCXTTTGCXXXXTXTTXT TAGCATCAGGCCATCAGGTCTTAAACCCATTGACTTTGTACATACTCCCAAGATCACTTA  14460 14520 14560 14640 14700  14701 TAAGCATATTCATTfATAAATTAAACTAACAGTCAACACTCAAAAACGXATOGXATTXCT 14761 TAXXCTAAGGAAAAGCTATGXXTTXXTTGCCXGCCXXGTXXATGGXXTAAAGTACATTTT 14 821 A T X X A T X A G C X T X X G T T T X T X G T C T X X G T X G C X T X T T X A T X X C T C C C X X C G T C X T G G X T X 14 881 C T T T G T X C X X G T X T T T T X X T C T T X G G X A T C X X A T G T A G C A C T A T G A T T G T T C T A A C A A C T 14941 AAGAATTTTAAGCCIATGAATAATAATTGATATCTAAATCTGAATTTGAACTTTIIACTA  14760 14B20 14880 14940 15000  15001 15061 15121 15181 15241  AATXATATTTGXXTGCCTXGXCCTXAGCTTTTTTfTCXACXTTTflTTTTTOCXXTTGCT CXAGXXATTAAAATGGCACTGAATAGTGTTTAATAAATGTGAAAAAGAATTTTCTGAGTT TTTTTXAGTCXCTTGXCCAXTTXCXXTAXTTGXTGTGXTTXGTTTACTTTGATTCGACCA TTTGATCAAATATAGGTGGGAAAAGTGTATAXXXTGCATACCTAGCTTTATCAAATGTAA CATGGAAATACTTCTGACTTACAATGACTAAGAAACCAGAGAGCAGAGCTTTIAATTCAG  15060 15120 15180 15240 15300  15301 15361 15421 1 5 4 81 15541  AATATGXAAATATA^CAJaCT^ 15360 CTACCGXXTCXTCTXCCGXXGXCTTTXXTCXTTCTTTTCCXGCTCXGCXGXXGXXTTGXC 15420 TTXATTTCACTCTGCGCXXXXTXXGTXTXCGTXCXXXCGTGGGCTTCGTTCXACGCAAAA 15480 CTGTTATACTACAATTTCACCACCGGGCTAAGCCCCGCCCTCGAGCACGACAATGTCGCA 15540 CAGAGAGAGAGTGCGXTTAAGCGCCGTATAGCTATCCCTGTCTTTTAGCGGGCACTCCCT 15600  15601 15661 15721  TAOCCAAAXACTAAAATCGAGTCCATTTTTCCCGTCGGCATTTT&TTATTGGCTGAGCAG 15660 CGACGCTGGCXGXGCOGCGXCTTCACGCTTAGATCGGTGGACTCACCIATGAATAGTCGT 15720 TGTTGTCGAC 15730  24  15731 15791 15851 15911 15971  GTCGTCGTGGCTAT&GGCAAATCAAAATGCAGAACGTTAAATCC&CTCAACACA^GTATG TATCTTTGGTAGTATTTACAAGCATTACTTACAATTTTTCTCGCTCTTTCGCTGGTATAT CTATOAAATGGGATAACAAGATTATATTTTTCACCGAATAATAGTATAATCATTAGTACA GTTACGTCATTCGTACCACAGGTATGAAACGTAACAATCTATT6CAGTCTCAGAATCTTA AACTTTAATAAGTCACCTCAATGTAGCATTAAGAAAATAATTACCACACTGATCTATACA  15790 15850 15910 15970 16030  16031 16091 16151 16211 16271  AACATTGTCTCGACAATCTTATACCAAATAAGAAAAJUUlCATATfTAGAGTCCAATCCAA 16090 CTAGTTTCATCTTCGTTCTGACTTGCTCCTTCACAAACCCTATTATAGTGTTTAGCGGTC 16150 AAGCTCGGTACGATGCTCTCGTAATCCAATTGAATCCTAAATTGOGGCGCTGAGATCATG 16210 GTTGTATTTAATTATGCATCTTATTAAGTAGATATATTGAAGAAATTTATTTCAGATAAG 16270 AACTACTACTGAGATGTAAGAAAGTACAATTAAATGTATGGCAAGGTTTTGTTAGTTGTC 16330  16331 16391 16451 16511 16571  AGTTTCTACAAATTT AAAACATATCTTAACGTTC^TTTGGAACTTTTCCAOCACTAACTT 16390 TTAACTTCGGCTAACAATATATTCAAAAATTGCATTAGATCGTGATACGCAAAACOCAAT 16450 CATTTCTCTTCCCTTTTTTCTTTTGCATTTTGATTTCTTATTAACCTAAATTTAGATATA 16510 ACAGTTGCATGGTAGAATAACGCCCTTCGTGTGGGAGACCCCGAATCCAGATCCGTCTCG 16570 CCAATAAAATCCCTACCGTTTCTGCATCCCTGGAAACTTTGCAGCTACATATCTATTTAT 16630  16631 16691 16751 16811 16871  GTAT&TATGTATGTfTCTAAATACAAATGTTCAA&GAAGATTACACGTTTGCTGAAAAGG 16690 CGAAAGAGGGAAATAACCAACTAAGCAAAATAGAGAGTGAGAGAGAGAGAGAAAGCAAAA 16750 GCACTTAATAAGCATAAAAATAATATGGCGGCTGCAAAAAACAAAAGAAAACGAGAAAAA 16810 ACGAGCCGAGCTAAAAAAGCAATACTAATCGGAATGATGAACACTTGTGGGCCTCAGCGA 16870 CGTTGTTGTTCGAAGCCGTGGGAAGAGTCACAACACCCACTCCAACCAACGCACCCATAC 16930  16931 CACAJIATGATGCACGCACACACACACGGGCTGGAACAAAATGAAAAATGAAATGCCAAAG 16990 16991 ACAGGCACCTTGGGTCATGAACGGTAAATATGTGAAACATTTGAAAAATTAATTAATAAA 17050 17051 17111  TATATTAATATGTAAGTGAAACTCTAAATCTTATATCTATGTTCCTAATAATATAATAAT AATTATTTAATATAATAATAATTATTTAATAATAATAATAATTATTTTACACCATTACTT  17171 TTTTGTTTTACTGGCATAACCCAAOTCGACAGTTATGTACATACAAAGATATTTCTGTTC  17110 17170  17230  17231 TACACAGCTATGTAAATCCTAAGTTTTACGTTGCTGCTCTATAASATTGCTCCAOTGTGC 17290 17291 CTTTGGTTGCCAAAGTACAACAATAATGCATTCCAATAACAATAGATCGAGTGCCACTGT 17350 17351  17411 CGTCTGTTTCCGACAAGCACAACCAATAATAAAGTAAGTTGTATAAAGTAAGAGCAAGGG 17471 AGAAAGGCGCTAGAGGTTGCCAAGTAGTGGTGGTTGTCAAGGAACCAAAAAAAGAAGATG  GGGAACATATCACTCTGTCTCTGCGCATACOCTACCCCTTCCCTTCGCAAACCCTTCGTT  17470 17530  17531 17591 17651 17711 17771  AAGTATAGCAGCAATCGAATTTGAGTCGGGTATTfTATTTTGTTfTCACAGTGGAAOGOG GCGCACAAGTCGGGGTGGTTGAGCACGCATACACATGCCCGCCAGCTGTAGTCTTGATTC CTGTTTTTTTTACTCGTTTTGGTTTTGCTTTGTCAACTGCTGCCGTTTCATTCCAACCCC AAGCGAACGCACGGCATCCCGCTCGCACGCGCTATCTGACCAGGGCCACTCCTAGTGGCG TCGCACTCGCACCCCTGCATTTCGGTATAGAGAAAAGTTATTAGTCGCAACGGTATCTGA  17590 17650 17710 17770 17830  17831 17891 17951 16011 18071  CAGGAGAGAAGCGGATCCAGCTGCACACCAACACACACACACACACATTCAOCACACATG 17890 CGCTCATTTTGTTTCCGATCCGAGCAAAAGTAGAGTTGTCGCTGTAGCGCGCGGTTCAGT 17950 TTGAAACTTAACTTGGCGGCGTGCGGTTCTTGCGCTCTGCTCTCTTTGCGTTCGTGTTGG 18010 TGTGGTGTCGATGTGTCGCATACCGCATGTGTATTGAACGGOGGAAAAAAAAAGCGCCGA 18070 COCGACGCACACACACCCTACCACCGTTTCGTAGTATTTATTTATATATTTATTTTTGGC 18130  18131 18191 18251 18311 16371  GATCAATGCAAATcScTGCTGACTATAACTGATTAGTCAAAACAATTAATGCTcfGCGGC 18190 GTCTAAAGTTGCGTCGTTTTGAATGTTAGGCATGTACAGGTCCCTCCAAATATACAATAA 18250 TACAGTACAGCAAGGAAAGCAAAAATGAAAACGGTAAAACATATTTTTTCTGATGAGACG 18310 CTGTCGTCGGCGCATGCTCTTTGCCTGTTGTTTGCGCCGCATTGTTGTCATTGCTCCCTG 18370 CCTCTTCGTTTTCTCCTTCTCCTCTTCCTCTGTTTCTCCCTCTTGCAACTCAATTTTTGG 18430  18431 18491 18551 18611 18671  CTAA6GGAAAAATCAAATAATGATGGCATTATTGfATGCCGCCACACGCACATGCACACA CACACATGCACATATCGTGGCGCGGTAGAAAGAAACCAGTCTTTCGGTTGCAAGTTGAAG ATGCTGCTGTGCCTAGGTTTATTACCCCCCTTCCAAATCCGTTCCGCTCTTAOCCTGCCG ATGGAAATGGAAGCCAAAATCCACCCCCGATACACACCAAATGTAACTGAAACTCCTAAC TATTATTGCTGGCGCCGGTTGCTCCTTCCGCTCCTCTGAGCCGCCGCCATTTCTTACTCC  18731 18791 18851 18911 18971  TGGACAACTGGCGAACAGCATTTTGAGCGCCCAGCTTAGGTCTTfGAACCTGTCCCCTTG GATATCACCATGATGGCTTATAACCATCCGCC AAGGGCCATCCGTAAAT AAAAGTGGTCT CCCTTACACCCACTCGAACTTCCTCTCTOCCTAGTAGATGGGAGGAOGTGATGGAGAAGG GGAATATGCAGCAGAGAGCAATCCTACTATAAATAGATCTTTCGGGCACAGGCAACGCTG ACAAGTCACCTCCTTTAGGTGTTGGTGTGCATCCACATATGAAAAACAGTGCACTTTTAG  19031 19091 19151 19211 19271  TGGGfcACGAATCATGAACAAATTAA^TATTTATTTACCTGTTfccGTAAAAAGfACTGA 19090 ATACACGGTGGGGCTAGTGTTGTACAAAGTTGTAGCTGCAATCATTTGTAATTTCCCATA 19150 TAATATGTTGGTGTTTAAGCTTACCACGTCTCACCTAAACAGTATCCACTGTAAGTAAAT 19210 TACCTCGTGAGTGAACGATCATCCAAAACAAAGTGTATCTCTCGTTCCCTTCTCACATTT 19270 TCGTCCCGACTTTGTTTTCGTTTTCGTTTTCTTTGGCCTTTTTGGACCCAGATCGATCCC 19330  25  17410  18490 18550 18610 18670 18730 16790 18850 18910. 18970 19030  19331 19391 19451 19511 19571  CTCTfTCTCTTTCTCXATGCTXCT&COTCCCTCTtcCACCCCCTTCTTTCTTTTTCTOTT CCTCGTCCCCTTTTCGCGTTGGAGAGCAAGAGAGAGAGTCTTTC6CGGGAGTAACATGCA GTCGCAGTTTGTTGTTGTCCGCTTTGTTCGTCAAGCOCTAAGAGAATATGAATAAGAATA TAAATATAAATTTAAACATGAATATGAAGCATGCCCCACGGATTCGGAGCCGCGCCAAGT CTGTCCGTGTTTGTTTCTGTTTAAAGCAATTTACTGAT AGTCTGTGTGCCTTTCCTCTTI  19390 19450 19510 19570 19630  19631 19691 19751 19 811 19671  C T C T f TCCAGCAGTGGGGACACCGAAAGTGAATCJIGCAACAACAA7ACGTACAC6ACCAC CTTCCCCGGAAGCCACCACATCTGTAAAGGTCAACTCCACCACTCGCGTGGACCCCCAGC C G C C A C T A A G G T G C C T G G A A A C A C T C G C C C A G A A G G C A G G C A T C A G C T T C G A C G A G G ACT TTGCCAAGAGTCCATCCCAATCGCCCAGCTCTAAGGCAGCACGTGGGTCAGTCCGAACGC CATCAATCAGACGGCGCCACCCACTACTACCGCTCACCAGCAGATCGCCAAGCCCACCCG  19690 19750 19810 19870 19930  19931 19991 20051 20111 20171  A C T C A A A G A C A A C C G G C C G C A A A c f GGAG A A G T C A C A G A G T C C A G C T C A A C C G A f G G C C G CCGCCACCAATGTGCCGCTGCAGATCTCCCCCGAGCAGCTGCAGCAGTTATATCCAAACA ATCCCTACGCCATTCAGGTGAAGCAAGAGTTTCCCACGCACACGACCAGTGGCAGTGGAA CTGAACTAAAGCATGCAACCAACATTATGGAAGTTCAGCAGCAGTTGCACGTGCAGCAGC AGCTGTCGGAAGCCAACGGTGGAGGAGCAGCCTCGGCCGGAGCCGGAGGAGCACCTAGTC  19990 20050 20110 20170 20230  20231 20291 20351 20411 20471  CGGC£AACTCGCAG£AAAGCCAGCAACAGCAGCA6TCCACAGCCATCAGCACCA{'CTCGC 20290 CGATGCAATTGGCAGGGCCCACTGGAGGAGTTGGCGGGGATTGGACACAGGGAAGGACGG 20350 TGCAGCTAATGCAACCCTCCACCAGTTTCCTGTATCCCCAAATGATTGTGTCGGGAAATC 20410 TGTTGCATCCAGGAGGCCTCGGTCAGCAGCCAATCCAGGTGATCACCGCCGGCAAGCCAT 20470 TCCAAGGCAACGGCCCCCAGATGCTTACCACCACGACTCAAAACCCCAACCAAATGATCG 20530  20531 20591 20651 20711 20771  GTCG8CAAGCGGGATTCGCTGOCGGAAATTACGC6ACCTGCATT£CTTCGAACCACAATC AGTCGCCTCAGACGGTOCTCATCTCGCCGGTGAATGTCATCTCCCACTCGCCACAGCAGC AGCAAAACCTTCTGCAATCAATGGCCGCCGCAGCTCAACAACAGCAACTTACCCAACAGC AGCAGCAACAGCTTAACCAGCAGCAACAGCAGCTCAACCAGCAGCAGCAACAGCAACAGC TGACTGCCGCTCTGGCCAAGGTGGGAGTCGATGCGCACGGCAAGCTGCCCCAGAAAGTCC  20590 20650 20710 20770 20830  2 0 8 31 20891 20951 21011 21071  TTCAGAAGGTGACCACCACCAGCAGCACG6TGCAGCCGGCGACG&CTCCTGGATCTACTG CGTCAACACAGACCCAGCAGGTGCAGCAGGTTCAGCAACAGCAGCAGCAGACCACCCAAA CCACTCAGCAGTGCGTGCAGGTTTCACAGTCGACTTTGCCAGTCGGTOTGGGTCGACAGT CTGTTCAGACTGCCCAACTTTTGAACGCTGGCCAAGCGCAACAAATGCAGATCCCTTGGT TCTGGCAGAATGCGGCGGGCCTGCAACCCTTCGGCTCCAATCAGATCATCCTGCGAAACC  20890 20950 21010 21070 21130  21131 21191 21251 21311 21371  A O C C A G ACCG AACCCAAGGCATCTT C A T T C A A C A 6 C A A C C G G C G A C G C A G A C T T T GCAGA CCCAGCAAAACCGTAAGAATATTGTCATGTATATTGCATCGGATAGGTACTAAAGTCAAC T A T C T T C C T A C A G AGATTATTCAATGCAACGTGACGCAGACGCCCACTAAGGCACGCACT CAACTCGATGCACTTGCTCCCAAGCAGCAACAGCAGCAGCAGCAGGTTGGCACTACCAAC CAGACGCAGCAGCAGCAACTAOCGGTGGCTACTGCCCAGTTGCAGCAACAGCACCAGCAA  21190 21250 21310 21370 21430  21431 CTCACTGCAGCAGCTCTGCAGCGACCAGGAOCCcfcTGTCATGCCCCACAATGGAACTCAA 2 1 4 9 1 GTGCGTCCGGCCAGTTCCGTATCCACACAGACTGCCCAGAACCAGAGCCTGCTGAAGGCC 2 1 5 5 1 AAAATCCGCAACAAGCAGCAGCCGGTGCGCCCCCCTTTAGCCACATTCAAAACCGAAATC 21611 GGTCAAGTCGCAGGACAAAATAAGGTAGTAGGCCACCTGACCACCGTGCAGCAGCAGCAA 21671 CAGGCGACGAATCTCCAGCAGCTGGTTAATGCGGCGGGCAACAAGTCAGTATTCTTTGTT  21490 21550 21610 21670 21730  21731 21791 21851 21911 21971  TATTATCTCCCTTGfCACAGTACTTATTTTTGCAfCTTTCCAGAATGGTTGTGAfGAGCA CAACGGGCACTCCGATCACCCTGCAGAATGGACAGACCCTTCATGCAGCCACTGCGGCAG GAGTCGACAAGCAGCAACAGCAGCTACAACTGTTTCAGAAACAGCAAATCCTGCAACAAC AACAAATGTTGCAACAGCAGATTGCTGCCATTCAAATCCAGCAGCAGCAAGCGGCTGTTC AGCCCCAGCAACAACAGCAGCAACAGGTCTCTCAGCACCAGCAGGTTAACGCCCAGCAAC  21790 21850 21910 21970 22030  22031 22091 22151 22211 22271  AGCA^GCGGTGGCGCAACAACAACAGGCAGTCGC6CAGGCTCAG&AACAGCAGA6GGAGC AACAGCAGCAAGTTGCCCAAGCCCAGGCGCAGCATCAACAGGCTCTCGCGAATGCCACTC AGCAAATCCTTCAGCTGGCGCCAAATCAATTCATCACCTCCCACCAGCAACAGCAGCAGC A G C A A C T T C ACAACCAACTGATACAGCAGCAGCTACAGCAACAGGCGCAGGCACAAGTTC AAGCCCAAGTCCAGGCTCAAGCGCAACAGCAACAACACCAGCGAGAGCAGCAGCAGAATA  22090 22150 22210 22270 22330  22331 22391 22451 22511 22571  TTAT&CAGCAGATT&TGGTGCAACAGTCAACTGGAGCCACTTCTCAACAGCACCAGCAOC 2239 0 AACCGCAACAGCAGTCTGGACAGTTGCAGCTTAGCAGCCTGCCGTTTTCGGTTTCACCAT 22450 CCATGACGGCCGAAGATATTGCCGGAATAACATCCAGTCCCCTACAAGAAGCTCTCTCGG 22510 TGTCTGGCGCCATCTTTCAGACAACCAAACCGATTACTTGCAOTTCCTCTACGCTCCCCA 22570 CAAGCAGTGTCCTCACAATTACCACCCAGAGCAGCACTCCTCTGGTCACCAGCAGTACGG 22630  22631 22691 22751 22811 22871  TGGcfeAGTATGCAGCAGGCTCAGAfcGCAAGGTAcfcAGATCCATCAACATCAGCAGCTAA TCAGCGCCACTATTGCCGGAGGCTCTCAACAGCAGCAGCAOCAGCAGCAACTGGGACTAC CTTCACTTACACCCACCACCCCCTCACCTACAACAAATCCCATTCTGGCCATGACCTCGA T G ATG AATGCCACCGTGGGTCACCT A T C C A C T G C C C C A C C C G T T A G T G T T T C T A G C A C C G CTGTCACTCCATCGTCTGGACAGCTGGTCACACTAAGCAGTCCTAGTACCGGTCGAGGAG  26  22690 22750 22810 22870 22930  22931 22991 23051 23111 23171  CAGGCTTTCCAGCCACGCCCACCAAAGAGACACcfTCAAAAGGOCCCACCGCAA&CCTGG TGCCCATTGATTCGCCCAAGACTCCTGTATCAGGAAAGGACACCTGCACTACCCCCAAAT CATCTACTCCTGCCACTGTTAGCGCATCCOTAGAGGCCAOTAGTTCCACAGGCGAAGCCC TGTCCAATGGAGATGCCTCAGATAGGTCTTCCACGCCGTCAAAGGGCGCTACCACTCCCA CCAGCAAGCAAAGCAATGCACCAGTGCAGCCACCGAGTAGCACCATTCCCAACAGTGTCA  22990 23050 23110 23170 23230  23231 23291 23351 23411 23471  CTCGGAAACAAGAGCCGAAGCTGCACAACTGCCGCAGTTTAACGfcCCCAACATCAACAT CAACCACGACAACGATCACCAATGGGATTGGAGTAGCCAGAACGACAGCCAGCACGGCTG TCTCAACCGCTAGCACAACCACTACCAGTTCTGGCACCTTTACCACAAGTTGCACCAGCA CAACCACAACCACCACGTCGAGTATCAGTAA7GGATCGAAGGATCTCCCCAAGGCGATGA TTAAGCCGAACGTCTTAACTCACGTCATCGATGGCTTCATCATCCAGCAGGCCAACGAGC  23290 23350 23410 23470 23530  23531 23591 23651 23711 23771  CATTTCCCGTCACCAGACAGCGATATGCAGACAAAGACGTCAGCGATGAGCCGCCAAGTG AGTATAAACTTCTGGTACCAATGCTTTTTCGCAATCTTAACGTGTCATTCCTTCGCGCAG AGAAAAAGGCAACCATGCAGGAGGACATCAAGCTAAGTGGAATAGCATCAGCTCCAGGCT CGGA7ATGGTTGCTTGCGAGCAGTGTGGAAAGATGGAGCACAAAGCAAAGCTGAAACGGA AGCGCTACTGTTCGCCAGGATGCTCGAGGCAGCCAAAGAACGGCATCCGTGGAGTTGGAT  23590 23650 23710 23770 23630  23831 23891 23951 24 011 24 071  CAGGAGAGACGAAC&GCCTGGGGACAGGTCGTATAGTTGGGGTGGACCCAATGGCATTGG TGGACAGACTGGATGAAGCCATGGCTGAGGAGAAGATGCAGACAGAATCATACCAGACAG TATCGGACGCTT7GCCAATTCAAGCGCCTACGCCGGAGCTCCCACCCATTTCGATGCCAG TGCTGGCGGCTATGTCGACATCTTCACCACTTTCGTTGCCCCTGACATTGCCCTTGCCAA TTGCAATAGCTCCCACTGTGTCACTGCCAGTGGTTTCAOCTGGAGTGCTTGCGCCGGTCC  23690 23950 24010 24 070 24130  24131 24191 24251 24 311 24371  TAOCAATACCATCcfCGAATATAAATGGATCCGAfCOCCCTCCCATCAGCAOTT&GAOTG TGGAAGAAGTTAGCAATTTCATCCGAGAACTGCCTGGTTGCCAGGACTACGTGGACGACT TTATACAGCAGGAGATCGACGGCCAAGCGCTGCTGCTGCTCAAAGAAAACCATTTGGTTA ACGCCATGGGCATGAAGCTGGGTCCAGCTCTCAAAATTGTGGCCAAGGTGGAGTCCATTA AGGAGGTCCCGCCAGGCGATGTAAAGGATTAAAAACACGCAACAAAGTCAAGGTTTCAAA  24190 24250 24 310 24370 24430  24 431 24 491 24551 24 611 24671  AGACCGCTTTCTTTACTTTCCCGCGTTTCACCTAAATGTAACGACATTTACTTC6TGAGC GAATGTGATCAGACAGAACAAAGTGAATCACGTTCCGACTCACCACTTCTCACACGACGT ACACCCTAATCATCAGCTACATGCACCTAATCTACAAAGGGAACTCCCCAGAGAGCAACC GGTGCCTGGAATCACTGACTCTGTTGCGAGGCCCATCCCATCCAGAATCTATGCGAGAAA ICCATAATTAGGTGAT6T.AGTTGTTTTTCCCGCACATGACGAAAGCAAGGAATATGACCC  24 490 24550 24610 24670 24730  24731 24791 24 851 24911  TCCTfCGGCGCCGAAGCTCCAGCTAGTTTAAGCACCCCGATCAGACCCCAAGATfCTOGC AATAGTAGAGTCCATGACTCTGTGCGACGAAAAGGACGGGGAGGTTATAGGACCGCTCGC CCCTCCGCGTTCGATCAACAGTCTTCAGCAGTCTACCAGAGTCTGAGGATAGGAGCGGGC AGTATCTGAGCTCTA  24790 24850 24910 24925  27  Figure 3. Optimal alignment of the proximal repeat with the distal repeat.  The alignment was made using the method of Wilbur and Lipman (1983). The ktuple size = 6, the gap penalty = 6, and the largest gap = 20 nucleotides. The top line of the alignment represents the proximal repeat sequence. The bottom line represents the distal repeat sequence. The middle line represents sequence conserved in both repeats. Dashed lines are gaps in the sequence necessary to maintain sequence conservation. The overall similarity between the repeats is 70.089%. The total number of bases not opposite a gap = 8264. The number of gaps = 94. The number of bases opposite gaps = 4227. The length of the overlapping region = 12491 bp and the total number of matched bases = 5858.  28  3510v  3520v  3530v  3540v  3550v  3560v  3570v  TTAGCAGATTGGAAAGGACAATGGCAAATGTCTGCAATGCOCAGAGGGTGAGTTATTAATAATCGAGAAT T G CG A GGCAAAT CTCGTCGTGGCTATAGGCAAATC— — — - —  10* 3590v  3560v  20* 3600v  3610v  3620v  3630v  3640v  AOGCGATATTAAAGGAAGTATOGOCTGTTOGAAAAAGAAGAOCJITTTGOaACTCGACTTACCTGAAATAG  3650v  3660v  3670v  3680v  3690v  3700v  3710v  GTGTTTGATCCATOCAGTTGAAATGCJLAAATAGTTTATAAAACTTTTTACAATATGTAAAATOGTAAGAA AAAATGG A . AAAATGGAGAACG  3720v  3730v  3740v  3750v  3760v  30-  3770v  3780v  ATATGACACGAGCATAAGGAATTTOGAATTTTAGAGTGTTTAGAAATCATGAAAAGAATTTTTAAGGTTG TA C A A T TTT AGT TTTA AA T A AATTTTT G T TTAAATCCGCTCAACACATGTATGTATCTTTGGTAGTATTTACAAGCATTACTTACAATTTTTCTCGCTC  40-  50-  3790v  603810v  3800v  70-  60-  90-  100" 3820v  CACCTGAAAAAATAAAGCATTTACTTTTAC TAGTATGAAAAAATG C A TA TAGTAT A A C TTTCGCTCGTATATCTATGAAATOGGATAACAAGATTATATTTTTCACCGAATAATAGTATAATCATTAG  110v  1203840v  1303850v  1403860v  1503870v  1603880v  170" 3890v  1903910v  2003920v  2103930v  2203940v  230" 3950v  240* 3960v  3960v  3990v  4000v  2604010v  2704020v  280* 4030v  3004050v  3104060v  3204070v  330" 4080v  '340" 4090v  350" 4100v  380-  390"  400"  410"  420"  AATTTGATATCCTAAAACAAAAATACAAATAAAAATCAAAGTGGAAAGGAATTTATACTC^ A C A A A AAA T A A T A G A T T TTT TAGAGTTACGTCATTCGTACCACAGGTATGAAACGTAACAATCTATTGCAGTCTCAGAATGTTAAACTTT'  180v  CAGTGGTGATCCCTTGCGATCCATTTTCATTTCTGGCCCCGCCTCAAAACCTTTAGAAGACTTCATTACG A GTGA CCTCAA AAGA A T AATAAGTCA CCTCAATOTAOCATTAAGAAAATAATTAC  250v  ATTGCJJlTAAGCACAACAACAACAACCCACTTTTGGTTATCACATCCGCATGCACACCCGTTGGAAAGGC C A A C AC T TA CA AT G A A T T GA G CACACTGATCTATACAAACATTGTCTCGACAATCTTATACCAAATAAGAAAAAAACATATTTAGAGTGCA  290v  CAACAAAAATGCTGTAGCTAGCTCTGACAATTCCCAGCACCAACAACCACACAACTACAACAACAGCTTC CAA A T T G TCTGAC CC CAC AAC A A A C ATCCAACTAGTTTCATCTTCGTTCTGACTTGCTCCTTCACAAACCCTATTATAOTGTTTAGCOGTCAAOC  360-  370-  •  G TCGGTACGATGCTCTCGTAATCCAATTGAATCCTAAATTGOOGCOCTGAGATCATGGTTGTATTTAATTA  430-  440-  450-  460"  470~  480"  490-  TGCATCTTATTAAGTAGATATATTGAAGAAATTTATTTCAGATAAGAACTACTACTGAGATGTAAGAAAG  500-  510-  520-  530-  540* 550" 5604110v 4120v  •  TTTAAATTTTTATTACAOCAOO TTTAAA T T A G TACAATTAAATGTATOGCAAOOTTTTOTTAOTTGTCAOTTTCTACAAATTTAAAACATATCTTAACGTTO  v  570-  4140v  580-  4150v  590-  4160v  600"  4170v  610*  4180v  620*  4190v  630-  OTGCGACCGAGAGGGAGGAAGGCCAGCGTTAACGTGCCAGAGTGAGAGGCAATATGACAAAAGCACGCGC T G AG A C GC A CAATAT AAA GC TTTTGGAACTTTTCCAGCACTAACTTTTAACTTCGGCTAA CAATATATTCAAAAATTGCAT  v  6406506604210v 4220v  670"  680" 4230v  AGCAATTTACGCTCATTAOQCATTTOT A T A T  690* 4240v  CATTTTTACGAAATOCAAXT CATTTT A T AA  7AGATCGTGATACGCAAAACGCAATCATTTCTGTTCCC1TTTTTCTTTTGCATTTTGATTTCTTATTAAC  700v  7104260v  7204270v  • 7304280v  740* 4290v  750" 4300v  7604310v  . 770-  780-  790-  600-  610*  620*  630-  ATCGACTCTCTCGCTCATGCCCTAGGAGTACCATTCGCCCCAAACACAAGGAGCAATAAGTTGAA T A T G A G CGCCC G C A T A CTAAATTTAGATATAACAGTTCCATGGTAGAATAACGCCCTTCGTGTGGGAGACCCCGAATCCAGATCCG  29  '  08  .099T .0»9T .0C9t -0J9T -0X9t .009t 3YXY3930X3X3X9X3X3Y3XYXY3YY999X9XaY339X9Y93X YSYXYY0YYXYY33XXY3 V X X 0 3 X YY 9X9Y93X 3 iY9XXYXXXYX939393XX399X3YYYYXY3Y3Y3YX9X9Y93XXY3Y9YX9939g9339993X9Y93XO AOtCS A00CSA06Z9 09ZS A0iJ9 A992SAQ9Z9 -oast OXYYXYYOYYO — X XYY3YY3 XX93XYY3YY39YXYX9YY9XY9YY9YYYYYYD3Y390XY9X9XX90X90X90X9YX03X9YX99YY99V AQK9 A o e Z S AQXC9 AQO«9 AQSts AQ819 A  X99Y9YX99Y99YYY9Y9XYY99YY9YYYYY9X9XY9X3XY9333Y933XY9339YYYX99YYXYYXYY* A0419 A0919 A09T9 AQKS A0CX9 A02X9 A0tt9 .rust .099X YxsYYYoaoiiooxxiaooiaxoYoa Y 3 X 3 3 3YY3Y39YY3Y9X3YXX9X0X9333Y3Y3Y93XYY3X3XX3333X33XYX093YXY3939X3X339X3X3 AOOIS A0609 A0909 A0t09 A0909 A0909 O»09 .099X .0>9t .0C9T .0J9T .0X9X .0091 .06»X X39XXY0YYXYX3X39X39XXO3YXXXX9YYX33XYYYX9XYX39Y3Y3YX3XX9X3XXXYXY9YYY3Y* 3 XXYOYYXY YY O YXX X O Y X 03XXXY0YYXY YOV0Y0OOXO0OXYY3XXXX3Y33YXX3O3XYOOX A0C09 A0Z09 A0109 A0009 A066* .09K .0L»X -09*T .09>X .0»>X Y3YX9XYXX9Y3Y93X9YY333YYXY399X3YXXXX9XXX XXX3YXXY33Y Y XY Y Y 3YY X XX9XXX XX 3 VYXYYXYY33XYY39YVYX9X3YY39X3YYY3X3XX9XXX3X3YX9X3933YY9XXYYYYYYX3XX9X30 A096» AO{,6> A096» A096» o»6» A0C6* *QZ6f .oe»x .oztx .oxtx ,oo»x .oeex • .09ex .otex 3YXXXXYXXYYXYYXYYXYYXYYXXXYXXYYXYYXYYXYXYYXXXYXXYYXYYXYYXYXYYXYYX33XXO Y XYYYYX X X X X Y Y XYX XX Y X Y Y Y X X XXYYYYY09XXXSYYYYYXYXSXYXYXXX0XXY3YYX93XYXX3YXXXXY3XXXX03YYYX3YXXXXYX3 A  A  A0I6*  A00g»  A069»  A089»  *0LQ*  AO»9>  A0C8»  *0Z9>  A0t8»  A009>  *099f  A098*  .09CT .09CX .OKX .0C6X .OZCX .OXEX -00CX XYX3XYXYXX 3XYYYX3X3YYY9X9YYX9XYXYYXXYXYXYYYXYYXXYYXXYYYYY9XXXY3YYY0 X XYXYXX Y X3 YY Y X YXYYXX Y Y YY XYYXXY YYYOXXXY XXYYXYXYXXYX3X9XYXX3YXYY3X3YY3XYYYXYYXX3YYYYXX3YYYXYYXXY3YYY9XXXY  *06l>  .0621 .082X -OUt .092X .09ZX .OKX .OCZX X9XYXYYYX993YY9XY3X999XX33Y399Y3Y0YYY339XYYY0XYYYVY9XYYYY3YY99X300O3Y3 XY XOO Y9XY3X999X 3Y300Y3Y0VYY3 9XYYY91YYYYY0XYYYY3Y 09X39093Y3 XY99X99X9Y9XY3X999X9Y3Y399Y3Y9YYY3Y9XYYY9XYYYYY9XYYYY3Y399X39993Y3 A<J9I> AOLL* A09t» A(J9J> AQtLt *0ZLf *QZLf .0Z2X -OXZX .OOZX .06XX .08XX .OIXX Y0Y3Y3Y393Y39XY9XYYY3Y33YXY333Y393YY3 3YY03X9Y933Y9YY9Y9X9Y9YY . 3V3Y3Y39 Y39X 9XYYY3Y 3YXY3 3V393YY3 YY3 Y9YY9Y X0Y9YY 93Y3Y3Y399Y39X99XYYY3YX3YXY3Y3Y393YY39YYY3339Y9YY3YY33Y9YY9YY9YYX9Y9YY AOU*  A00t»  A069>  A099»  A0L9»  A<)99*  AO99>  .09XX .09XX .0*XX .OCXX .OSXX .OXXX .OOXX 999X9999YY99XX9XX9XX93Y930Y3X33999X9XX3Y3YY9XY9XYY993XYYX3YXYY39YYYYYY' 999X99 9 Y99XX9XX3XX93Y939Y3X3 999X9XX Y Y9XYY993XYYX 39YYYYY 999X99Y9XY99XX9XX9XX93Y939Y3X39999X9XX-XY3YYYY9XYY993XYYX 39YYYYY0»9* A0E9» A0J9» A0X9> A009> A069* .060X .090X .0£0X .090X .090X .0*01 .OCOX X39Y9339Y93YYYYYY9Y93YYYY9YYYY9YYYYYY39X399399XYXYYXYYYYYXY39YYXYYXX3V Y 339Y 3Y Y Y3Y93YYY 9YYYY9YYYYYY30 399399XYXYYXYYYYYX 39VY YYXX3Y --3YY333YY3Y9Y93Y3Y93YYY39YYYY9YYYYYY39Y393339XYXYYXYYYYYX339YYYYYXX3Y A099* A0iS> A09S> A099» *0*S* *0ZSf A<)J9> -0Z0X .OXOX .OOOX .066 .086 39YYYY39VYY9Y9Y0Y9Y9Y9Y9X9Y9Y9YXYYYY39YY-X3YY33V 39YYYY3 9Y3Y9Y9Y9Y9Y9 9Y9Y9Y YY39YY 3Y 3 39YYYY3 9Y9Y9V9Y9Y9Y939Y3Y9Y399YY39VYY33YY3YY9Y9XYY39XY3Y3Y3Y3YXY30 A0X9* A009» A06»» A08»^ *0Ltf A<)9>* A .066 .096 .096 .0*6 .0C6 .0Z6 .0X6 YXYYY999Y9YYY90g9YYYY9X99XXX93Y3YXXY9YY99YY0XX9XYYY3YXYYYX9XXX9XYX9XY* Y YY Y9 YYYY X Y3Y X 9 Y 9 Y YY 9 OY YY^YYYYYY933X39YYYYYYY9YXYYX3YY3Y3X99Y99Y99YY03993YYXYYY999Y9YYY9399YT A  A0>>*  A0C»»  A02»*  A0X»»  A00»»"  A06O  A0(.C»  AQ9C9-  A09C8  AQ»C»  AQCCt  AQZC»  A  .006 .068 .098 .Oti .098 .099 .0*9 9XYX9XYXXXYX9XYXY3YX39Y39XXX3YYY99X333XY39X3XXX939YX333XYYYYXYY3993X3X. XX X X 9 XX OX 3 9 Y Y XYYD99 YY90Y9X9XX3YY9XX0X9XX9XX3XX3Y93X30XY3939YYYYgYY9YY390XYYYXYXXYY399---1  5320v 5330v 5340v 5350v 53e0v 5370v 5380v OTACATACCCGSTACCCACAXACTAAAAS0GTATACTCG0CACTT66GTTTAACTCCAATTTCTCTTAGT G C C AA G T C G OCTACCCCTTCCCTTCOCAAACGCTTCOTTCGTCTGTTTCCGACAAGCACA — 1660" 1670" 1680" 1690" 1700" 5390v 5400v 5410v 5420v 5430v 5440v 5450v AGTCAGCACCAATAACTTACAAATAAATATTTAAGAAGGGTTTTCATTTTAAGGATACOACTGAAGTTAG ACCAATAA A AA TAT A AAG G A C GA AG ACCAATAATAAAGTAAGTTCTATAAAGTAAGAGCAAGGGAGAAAGGCGCTAGAGGTTGCCAAG 1710" 1720" 1730" 1740" 1750" 1760" 5460v 5470v S480v CGAGGAATGGTATATG AAAAAG OGGTATT G G T T AAAAAG GGGTATT TAGTGGTCGTTGTGAACGAACCAAAAAAACAAGATCAAGTATAGCAGCAATCGAATTTGACTCGGGTATT 1770" 1780" 1790* 1800" 1810" 1820" 1830" 5490v 5500v 5510v 5520v 5530v 5540v TGAAAGTCGAGTCACCAGACCATCTTATGTTTTCAGCAACGGAAGGGGCCTTTC-AGTCGGGGTGGTTGA T A TGTTTTCA GGAAGGGOGC C AOTCGGOGTGGTTGA TTATTT TCTTTTCACAGT-OGAAGGGGGCOCACAAGTCCGGGTGGTTGA 1840" 1850" I860" 1870" i860" v 5560v 5570v 5580v 5590v 5600v 5610v GCACGCATACACATGCCCGCCAGCTGTAGTCTT——CCTCTTTTTTTTACTCGTTTTGTTTTTGCTTTG GCACGCATACACATGCCCGCCAGCTGTAGTCTT CCTGTTTTTTTTACTCGTTTTG TTTTGCTTTG CCACGCATACACATGCCCGCCAGCTGTAGTCTTGATTCCTGTTTTTTTTACTCGTTTTGGTTTTGCTTTG 1890" 1900" 1910" 1920" 1930" 1940" 1950" 5620v 5630v 5640v 5650v 5660v 5670v 5680v TCAAACGAACTTCTGCCGTTTCATTCCAACCCCAAGCGAACGCACGGCATCCCTCTCGCACGCGCTAACT TCAA G CTGCCGTTTCATTCCAACCCCAAGCGAACGCACGOCATCCC CTCGCACGCGCTA CT TCAACTG CTGCCGTTTCATTCCAACCCCAAGCGAACGCACGGCATCCCGCTCGCACGCGCTATCT 1960" 1970" 1980" 1990" 2000" 2010" 5690v 5700v 5710v 5720v 5730v 5740v 5750v COCCAGGGCCACTCCTAGCGGCGTCGCACTCGCACCCGTGCATTTCGGTATAGAGAAAAGTTATTAGTCG G CCAGGGCCACTCCTAG GGCGTCGCACTCGCACCCGTGCATTTCGGTATAGAGAAAAGTTATTAGTCG GACCAGGGCCACTCCTAGTGGCGTCGCACTCCCACCCGTGCATTTCGGTATAGAGAAAAGTTATTAGTCG 2030" 2040" 2050" 2060" 2070* 2080" 5760v 5770v 5780v 5790v 5800v 5810v 5820v CAACGACATCTGAGAAGAGAGAAGCGGATCCAGCTGCACACCAACACACACACAT-—ATTCAGCGCAC CAACG ATCTGAGA GAOAGAAGCGGATCCAGCTGCACACCAACACACACACA ATTCAGC CAC CAACGGTATCTGAGAGGAGAGAAGCGGATCCAGCTGCACACCAACACACACACACACACATTCAGCACAC 2100" 2110" 2120" 2130" 2140" 2150" 5830v 5840v 5850v 5860v 5870v 5880v 5890v ATGCGCTCATTTTGTTTCCGATCCGAACGAAAAGTAGAGTTGTCGCTGfGGCGCGCGGTTCAGTTTGAAA ATGCGCTCATTTTGTTTCCGATCCGA C AAAAGTAGAGTTGTCGCTGT 6CGCGCGGTTCAGTTTGAAA ATGCGCTCATTTTGTTTCCGATCCGAGC-AAAAGTAGAGTTGTCGCTGTAOCGCGCGGTTCAGTTTGAAA 2170" 2180" 2190" 2200" 2210* 2220" 5900v 5910v 5920v 5930v 5940v 5950v 5960v CTTAACTTGGCGGTGTACGGTTCTTGCGCTCTGCTCTCTCTGCGTTCGTGTTGGTGTGGTGTCGTATGTG CTTAACTTGGCGG GT CGGTTCTTGCGCTCTGCTCTCT TGCGTTCGTGTTGGTGTGGTGTCG ATOTG CTTAACTTGGCGGCGTGCGGTTCTTGCGCTCTGCTCTCTTTGCGTTCGTGTTGGTGTGGTGTCG-ATGTG 2240" 2250" 2260" 2270" 2280" 2290" 5970v 5980v 5990v €000v 6010v 6020v TCGCATACCCCATCGCATGTGTATTGAACGAAG AAAAAACCCGCCGACGCGACATACTCACTGGATA TCGCATACCGCAT GTGTATTGAACG C AAAAAA CGCCGACGCGAC AC CAC TA TCGCATACCGCAT CTGTATTGAACGCGGGAAAAAAAAAGCGCCGACCCGACGCACACACACCCTA 2300* 2310" 2320" 2330" 2340* 2350" 2360* v €040v 6050v 6060v 6070v 6080v CCCTGCAAATTTGTTAA ATTTTTTTCAAAAAGAGCTAACGGTGCTGTTGATTAGCT CC T TA ATTTTT CCACCGTTTCGTAOTATTTATTTATATATTTATTTTT-----------»---———————— 2370" 2380" 2390" 6090v 6100v 6110v 6120v 6130v €140v 6150v AGTGCAGATGTGCAGACATAAAAAGTGATGCCGCGCCACAGTGGAGCCCCTAGCTGGCGAATCGTCGCTG 6160v 6170v ClBOv 6190v 6200v 6210v 6220v OCGACGTAGCTAGTGCAGTTAAAACAAGTACTTAGTGCTGTGACTGTGGCTTAATTTTATGTAAGATCGC 6230v 6240v 6250v 6260v €270v 6280v 6290v OTGCAC^OGTOCTTAGTCGTTACACTGAGAAAAAOAAAAGTGCCTGCAGCCAGCGOOGAGAAGTTGAAGT  31  €300v $310v. 6320v C330v 6340v 6350v 6360v OGAOTCACTCTCACTOCAOOTTGTACTAATTACTACTOTCCCTCCAWSAAACACTCATrTACACOCAXOC 6370v 6380v 6390v 6400v 6410v 6420v ACACAr^CACACACACACACACACGAGACACTGGGACTTTTGTCGGATTTTCGTATT  6430v  6440v 6450v 6460v 6470v 6480v 6490v 6500v TTTGTTGTCTTGCGTTCCACGTTACATACATATGTATATGTTTGGTGTTGCCTGTTGTATGTTTATATTT 6510v 6520v 6530v 6540v 6550v 6560v 6570v ATTGCCTTCACACATCTGCGTGTTTGTTAATGTACAATATAATACGGCAAATAGCAAAAAGAGAAGAAAC 6580v 6590v 6600v 6610v 6620v 6630v 6640v 9GAGGACTAAAAAGAAAGCGCCATGCCGACTATAACAAAAACAACCACCAGTGTCTCCGCCGCCGCCGAA 6650v 6660v 6670v 6680v 6690v 6700v 6710v JUCAAAAACAAAACGCCTAAGCCGACGCATGCATACCTATAATTTATTATAAATATTCTTTTTATTTTGA 6720v 6730v 6740v 6750v 6760v 6770v 6780v ATAATGGATCGTCGTGCATTGAAGTTTATGCAAAAAAGGTATTTTTTGTTIAGTTTGGTTTTATTTTTAT 6790v 6800v 6810v 6820v 6830v 6840v 6850v CACTGCTTTTGTACTGCCTGCACTTGTGCTTTTTGTTTACAATTTTTGGTTTAATCTGCTCTTGAGCATT 6860v 6870v 6680v 6890v 6900v 6910v 6920v (3GATATTGATTCTATATCGTATTCGGATAATAATGACACGTTAATCATAATGCTCGTAAAAATGCAGCGA 6930v 6940v 6950v 6960v 6970v 6980v 6990v GTGGATATTCTGTGCTCTCTGTCATCATAAAATCCGCTCGACTCGCGGGCCAACAAAAGAAAATAAATTC 7000v 7010v 7020v 7030v 7040v 7050v 7060v ATGAAACCCAAAACGAGTTTCCCCTCCGTCGCCCTCTCCGCCATTCATCATCCAACCGACACACCCCCCC 7070v 7080v 7090v 7100v 7110v 7120v 7130v COCTOTGCOOCGTTGTTGCATTTTTAACAACGAATTTCOCAATOCATGCCCGTGTCTCCATTTGTGTGTG 7140v 7150v 7160v 7170v 7180v 7190v 7200v TGCGTCGGCTGGTGGTGGTTGATTTCTCCCTCTCTCCGTGCAACCCTCTCTCACTCGCGCAACAAAGCAA 7210v 7220v 7230v 7240v 7250v 7260v 7270v CATAAAGAAAACAAAAGTTCTATATCOAAAGCTATCACTTTTTCTTGTTGCTGCGOCAACCCTGCTTACA 7280v 7290v 7300v 7310v 7320v 7330v 7340v TTTTGCTTTACTCTGCCGAAATAAACAGTCTAOGGTTGTCAGCTGTAAMAATCTGAGAAATCTGTGTTC  32  7350v 7360v 7370v 7380v 7390v 7400v 7410v CCOTTTAAOCCAACATTCAATTTTATTGAATXCTTTTATXTTTCTTACACTCCCTTATCACTOAATOCTA 7420v 7430v 7440v 7450v ATTTTACCTACAAACTTAAATTCTTCGTOOCAAGAAAA^  7460v  7470v  7490v 7500v 7510v 7520v 7530v 7540v TTTAAAGCACAACAGTTTTTGTXAATXXXTXGTTGXCXXCTCTGTTTXCXCCTXCOCT  7480v  7550v  7560v 7570v 7580v 7590v 7600v 7610v 7620v CTCTGXCrTTTATGTGACTGTGTGTACGGCCTGGCCATAGCCTCATCTCCn'TCACTCACATTTGTTTCATC 7630v 7640v 7650v 7660v 7670v 7680v 7690v TTCTTCGATTGCAGAACTCGTTTGTGTATCAATCAATAGGCTGTCAA0GCCCCCCCCCC7?CCTCGCTCGC 7700v 7710v 7720v 7730v 7740v 7750v 7760v TCCCGCGCATTTTGCACGCCCGXATCCACTOGXOGTGXTGTTTGXTTTGXOCGCGACCTTTCCCCCGCAC 7770v 7780v 7790v 7800v 7810v 7820v 7830v CCCTATCGGCGTGTCTGTGTGTGTGTGTGCTGCCCCGCATTTTGATCATTCCTTTCCCACCOCTACTOCT 7840v 7850v 7660v 7870v 7880v 7890v 7900v TTCCCCCCGATTGCAAGCAGGCTGTTTTACGCGGGTTCTCACACACTCGAGCTCGAATGTATCTACCCTA 7910v 7920v 7930v 7940v 7950v 7960v 7970v CTCCATGGCGXCXTGGXAACCAGTCTTTTTTTTTCTCTCAAGAGGTTTTTCGCAOCOTOCOOCCCTGAGC 7980v 7990v BOOOv 8010v 8020v 8030v 8040v TTAACTTACACACACTTGCACACGCGCACCCACCACCCTGTGGCGAGTATTCCACCCCTCTGGACCACCC 6050v 8060v 8070v 6080v 8090v BlOOv 8110v ACCCCATATTCCCTTTCTTTIACGGGGTAOCAGCGACATGAOGGTTGCCAAAATGCTTTTACCGOCTTTT 8120v 8130v 8140v 8150v 8160v 8170v 8180v ATAACCCATTTCGCCCCTTTTTCTCTCTTTTTTGCACTCATGCTTTTTATGCTCTTCCTGTTTATGGGCC 8190V 8200v 8210v 8220v 8230v 8240v 8250v TTGCOGGCTTTTTGGGCGATOGAAAOGGGTGGGGATCAGOCTCTCTOGACTGCCGGGCATCACAGTCGCG GGCGAT 24008260v 8270v 8280v 8290v 8300v 8310v 8320v CTCAXTOCXXTXGGXCCTTGXXACCACOCTTCTCCACTIAGATCATTCATAGTTOXXCTXTXTCAOGGAA CAXTGCAX G TG ACTATX G A —CXATGCXXXTCGGTGCTG ACTATAAGTGXTTX 2410" 2420* 2430* 8330v 8340v 8350v 8360v 8370v 8380v 8390v CTGXTTCA6GAAGTAATATTACTTAATTATTTCTAGAAAAACATCCTTACCATGTGGAGTACTCAGCCAT TGA T T T G T G GTAC C T GTGAAAACUJLTTAATGCTGTGCGGCOTCTAAAGTTGCGTCGTTTTGAATGTTAGGCATGTACAGGTGCCT 2440245024602470" 24B0* 2490" 2500"  33  8400v 8410v 8420v 8430v TTACCTTTCTCCACTAATCCTATTASCTTCAOCACTACTCTA A T T A T A C A C A T CCAAATATACJUITAATACAGTACAGCAAGGAAAGCAAAAATG 2510" 2520" 2530" 2540" 2550"  8440v 8450v TAAAACACTAACCATCTCGTG TAAAACA TCT TG  2560" 2570" 8460v 8470v TTOCTCCCAACTCTC TTGCTCCC C AGACGCTGTCGTCGGCGCATGCTCTTTGCCTGTTGTTTGCGCCGCATTGTTGTCATTGCTCCCTCCCTCT 2580* 2590" 2600* 2610* 2620* 2630* 2640" 8480v 8490v 8500v 8510v 8520v 8530v 6540v AGCATTTTTAGCATTTCGATGGAAAGGTCCTTGAACTTCGCCTGCJU^TTCGCATCGGCTTTATTGCCCTT TTT T T C T GT TG T GC AT TCGTTTTCTCCTTCTCCTCTTCCTCTGTTTCTGOTCTTGCAACTCA^ 2650" 2660" 2670" 2680* 2690" 2700" 2710* 8550V 6560v 8570v TTACAATAATATCGAACOGTOCCAGCTG CGTGGC AATAAT G CGTGGC AATAATGATGGCATTATTGTATGCCGCCACACGCACATGCACACACJICACaiTGCACATATCGTGGC 2720" 2730" 2740" 2750" 2760" 2770" 2780" v 8590v 8600v 6610v 8620v 6630v 6640v TAAATTAOTTTTCCGGGCTGTTGTTGTCGAACGTTGAACGTGGAAAACGAGTGCGACTACCATGCTGCTC T T TT CT C A GTTGAA ATCCTGCT OCGGTAGAAAGAAACCAGTCTTTCGGTTGCAAGTTGAAG ATGCTGCTG 2790* 2800* 2810* 2820* v 8660v 8670v 8680v 6690v 8700v 8710v ATG<JAAATTCGAAAACCCATAGATAAAGATCGATOTGAAATGCGAAOGCTGTCAGGCAOCTC«TTTCTCT A T A ACCC AA TGCCTAGGTTTATTACCCCCCTTCCAAA—— 2840* 2850" v 8730v 8740v 8750v TCATTCGTTGAATGCGAAAGTCCCTTAAGTAGTTGG TCCGTT G TT G TCCGTTCCGCTCTTAGCCTGCCGATOGAAATGGAAGCCAAAATCCACCCC 2860" 2870* 2880* 2890" 2900* 6760v 8770v 8780v 8790v 6800v 8810v CCAAATGTTr^TTGTCCCTGAAAAAGGGCGGGTTCGGAAAAAGGTCCACAAAAAGAACGAA CCAAATGT CTGAAA T G CC CCATACACACCAAATCTAA CTGAAACTCCTAACTATTATTGCTOGCGCCGGTTGCTCCTTCCG 2920* 2930* 2940* 2950* 2960* 2970* 6820v 8830v 8640v 6B50v 8660v 8870v 8680v TATTTTCGGGATAGGGATTCGGATCGGGATCGGGCTCGGGATCTCTACCTAAAAGTTAACCAGGGTACAC T G G GG T GGC G AC G CCGT CTCCTCTGAGCCGCGGCCATTTCTTACTCCTGGACAACTGGKjAACAGCATTTTGAGCGCCCAGCTTAGG 2980* 2990* 3000* 3010* 3020* 3030* 3040* 8890v 6900v 8910v 6920v 8930v 6940v 8950v AATTCTTAATTAGAAATTGAAATGACATGCTACATATCTTCAGGCTTGACCTTGTGATTAAGTCOGAACT TT A T T A T C A T T A C T C C A AA TCTTTGAACCTGTCCCCTTGGATATCACCATGATGOCTTATAACCATCCOCCAAGGGCCATCCGTAAATA 3050* 3060* 3070* 3080* 3090* 3100* 3110* 8960v 8970v 8980v 6990v ATAGTTTTGATATTATCCCGAOTTAA AGGTGATC6TTACCT A AGT T T C C T AGGTCAT C A AAAGTGGTCTCCCTTACACCCACTCGAACTTCCTCTCTGCGTAOTAGATGGGAGGAGGTGATGGAGAAGG 3120* 3130* 3140* 3150* 3160* 3170* 3180* 9000v 9010v 9020v 9030v 9040v 9050v 9060v CACACTCATATTTACTTTTCCCOCCTTTOTTTCTGCCACTCTCGAAAOTCATCTTTTATCTTTACATTOG A A C T T T ATGTTT CGAATATGCAGCAGAGAGCAATCCTACTATAAATAG ATGTTTCOGGCAC 3190" 3200" 3210" 3220" 9070v 9080v 9090v 9100v 9110v 9120v AAATATGATTATTTAATAAGAGCGTTTTTGTTTGCTAGTTCCCGCATTTGTCTCA CACTTTTT A A T ATA TTT GT T GT C GCA T T A CACTTTT AGGCAACGCTGACAAGTGACCTGCTTTAGGTGTTGGTGTGCATGCACATATGAAAAACAGTGCACTTTTA 3240* 3250* 3260* 3270* 3280* 3290* v 9140v 9150v 9160v 9170v 9160v TGAATOACTAAAACCAGATTTTTG TATTTACOUlGTAGCCATATTTOCTTAAACAAAAAAA AA C GA TATTTACC T A A T AA A A OTCWGTCACGAATCATGAAWUUITTAAAATATTTATTTACCTGTTCCOTAAAAAGTACTGAATACACGGT 3310* 3320* 3330* 3340* 3350* 3360* v 9200v 9210v 9220v 9230v 9240v 9250v AAAAAAAAAAACOGATATCACTTTGTTATCTTAGTTTTCCATGATCATTAAATTTAATTTTTTAA A G A A TT ATCATT TAATTT A OGGGCTAGTCTTCTACAAAGTTGTAGCTGCA ATCATTTG TAATTTCCCATATAAT 3380" 3390" 3400" 3410" 3420*  34  9260v 9270v 9280v 9290v 9300v TGTTTAATTTAAGGGCCCnTTTCJIAAACATCACGGCCAOTATAAATAATAAT---TGTTTAA T A C T C AA CAGTAT A TAA ATGTTOOTGTTTAAGCTTACCACGTCTCACCTAAA CAGTATCCACTGIAAGTAAATTACCTC 3430* 3440* 34503460* 3470* 3480* 9310v 9320v 9330v 9340v 9350v AACAAATAATGATGAGAAAQCTGACCTTATTTTTGGCTCAACTCCGATTC AACAAA T C TTC I7TGAGTGAACGATCATCCAAAACAAAGTGTATCTCTCGTTCCC — TTC 3500351035203530* 9360v 9370v 9380v 9390v 9400v 9410v 9420v TCAGCTGOGAGAGC TTGTTTTGCATCAGCGGGGASGGGAGCAGGGGGCAGAACGGGGATGCTGAC TCA T G C TTGTTTT T G G A G C TCACATTTTCGTCCCGACTTTGTTTTCGTTTTCGTTTTCTTTOOCGTTTTTOGACCCAGATCGATCCCOT 3540355035603570~ 35803590" 3600* 9430v 9440v 9450v 9460v 9470v 9480v AACCTGTTTGCAWOCCTTCCTATTGTT-TGTCGCTAGGACAAGAGAAGOCCATTAGGAACACTCTTTTT TT C T T TGTCGC C CTTTGTCTTTCTCAATGGTAGTGCGTCCCTGTCGCACCCCCTTCTT 361036203630* 3640* v 9500v 9510v 9520v 9530v 9540v CCAOTTTCTTTGCCTGTCGTGTOGTCCGAAATGACATTGCTCTTTTTTGCT TCTTTTT T —---------------------------------------TCTTTTTCTGTTCCTCOTCCCCTTTTCGCG 3650* 3660* 3670* 9550v 9560v 9570v 9580v 9590v 9600v GAGCAAATGTTTACTGATCGCCAAACTTCTCTTTTTCAATCCTCAGAGCGGACACAGAAAOCOAT GAGCAA G T T C G G G T TTGGAGAGCAAGAGAGAGAGTCTTTGGCGGGAGTAACATGCAGTCGCAGTTTGTTGTTGTGCGGTTTGTT 3690370037103720* 37303740* 9610v 9620v 9630v 9640v 9650v 9660v 9670v ACGACCACACCCGTGAGCACCACAGCATCCCAGGGCATTTCAGCATCAGCGATTCTAGCAGGAGGCACTC CC A A A T A A G AG A GC C CCTOUiGCCCTAAGAGAATATGAATAAGAATATAAATATAAATTTAAACATGAATATCAAGCATGCCCCA 3760* 3770* 3780* 3790* 3800* 3810* 9680v 9690v 9700v 9710v 9720v 9730v 9740v TTCCCTTGAAGGACAACTCGAACATCCGCGAGAAGCCCCTCCACCATAACTACJUICCACAATAACAACAA T G A G C C G T A AC A CGGATTCGGAGCCGCGCCAAGTGTGTCCGTGTTTGTTTGTGTTTAAAGCAATTTACTGATAGTCTGTGTG 3630* 384036503860* 3870" 36609750v 9760v 9770v 9780v 9790v CAGCTCCCAGCACTCACACTCGCACCAOCAG CAOCAACAACAGCAGGTGG C TCC CTC CCAGCAG CAGCAACAACA A GT CGTTTCCTCTTTCTCTTT' CCAGCAGTGGGGACACCGAAAOTGAATCAGCAACAACAATACOTAC 390039103920393039403950" 9800v 9810v 9820v 9830v 9840v 9850v 9660v. GTGGCAAGCAGCTCGAGCGGCCACTAAAGTGCCTGGAAACGCTCGCCCAGAAGGCGGGAATCACCTTCGA C A C C C G GCCAC A A C C C GCG G A C CC CG ACOVCCACCTTCCCCGGAAGCCACCACATCTGO'AAAGGTCAACTCCACCACTCGCGTGGACCCCCAGCGG 3960* 3970* 3980* 3990" 4000401040209870v 9880v 9890v 9900v 9910v 9920v 9930v CCAGAAATACGATCTGGCCAGCCCCCCGCATCCCGGCATTGCCCAGCA-OCAGGCGACTTCAOGAACAOC A AA G G C C CCCCAG A GCAGGC C C AC CCACTAAGCTGCCTGGAAACACTC GCCCAGAAOGCAGOCATCAOCTTCGACGAG 4030404040504060* 4070* 9940v 9950v 9960v 9970v OCCCAGCAACGGGATCAGGCTCAOTCACCCCCACAAGCCATCGG GCCCAGC G C G C A GCCATC (^CTTTGCCAAGAGTCCATCCCAATCGCCCAGCTCTAAGGCAGCACGTGGGTCAGTCGGAACGCCATCAA 4090410041104120* 4130* 4140v 9990v 10000V 10010V 10020v CACOOAACTCCGCCCACOOGCCGCAOOCAAACCCACACC CCAAOCACTCC G C ACCCAC C CCAAGC C CC TCAOACOOCOCC ACCCACTACTACCOCTCAOCAOCAOATCOCCAAOCOCACC 416041704180" 4190" v 10040v 10050v 10060v 10070v 10060v 10090v 4iAACAGOCCCAGTOCTC^CAGCACACCCAACACTAACTGCAACTCAATTQCCCGCCACACCAGCCT ACTCAA C C C CA CG ACTCAAAOACAACCOOCCOCAAA 4210* 4220v lOHOv 10120v 10130v 10140v 10150v 10160v CTCGAGAAGGCGCAGAATCCCGGCCACCAGGTGGCCGCCACCACCACGGTGCCACTGCAGATATCCCCTG CTCGAGAAG C CAGA TCC G CA C C TGGC OCC CCACCA CTGCC CTCCAGAT TCCCC G CTGGAGAAGTCACAGAGTCCAGCTCAACCGATGGCOCCCGCCACCAATGTGCCCCTGCAGATCTCCCCCG 4230* 4240* 4250* 4260" 4270* . 4280* 4290-  35  v 10180v 10190v 10200v 10210v 10220v 10230v AGCAOCTGCAGCAGTTCTATGCGAOCAATCCCTACOCCATTCAOGTGAAOCAAGAOTTTCCCACGa^CAC AGCXGCTGCXGCXGTT TATGC A CAATCCCTACGCCATTCAGGTGAAGCAAGAGTTTCCCACGCACAC AGCAGCTGCAGCAGTTATATGCAAACAATCCCTACGCCATTCAGGTGAAGCAAGAGTTTCCCACGCACAC 4300* 4310* 4320* 4330* 4340* 4350* 4360* v 10250v 10260v 10270v 10280v 10290v 10300v CACCAGTGGCAGTGGAACTGAACTAAAGCATGCAACCAACATTATGGAAGTTCAGCAGCAGTTGCAGCTG GACCAGTGGCAGTGGAACTGAACTAAAGCATGCAACCAACATTATGGAAGTTCAGCAGCAGTTGCA TG GACCAGTGGCAGTGGAACTGAACTAAAGCATGCAACCAACATTATGGAAGTTCAGCAGCAGTTGCACGTG 4370* 4380* 4390* 4400* 4410* 4420* 4430* v 10320v 10330v 10340v 10350v 10360v 10370v CAGCAGC TGTCGGAAGCCAACGGTGGAGGAGCAGCCTCGGCCGGAGCCGGAGGAGCAGCTAGTCCGG CXGCXGC TGTCGGAAGCCAACGGTGGAGGAGCAGCCTCGGCCGGAGCCGGAGGAGCAGCTAGTCCGG CAGCAGCAGCTGTCGGAAGCCAACGGTCGAGGAGCAGCCTCGGCCGGAGCCGGAGGAGCAGCTAGTCCGG 4440* 4450* 4460* 4470* 4480* 4490* 4500* 10380v 10390v 10400v 10410v 10420v 10430v 10440v CCAACTCGCAGCAAAOCCAGCAACAGCAOCACTCCACAGCCATCAGCACCATGTCOCCGATGCAATTGGC CCXXCTCGCXGCXAAGCCAGCAACAGCAGCACTCCACACCCATCAGCACCATGTCGCCGATGCAATTGGC CCAACTCGCAGCJUU^GCCAGCJaCAGCAGC^CTCCACJlOCC^^ 4510* 4520* 4530* 4540* 4550* 4560* 4570* 10450V 10460V 10470v 10480v 10490v 10500v 10510v AGCGGCCACTGGAGGAGTTGGCGGGGATTGGACACAGGGAAGGACGGTGCAGCTAATGCAACCCTCCACC AG G CCACTGGAGGAGTTGGCGGGGATTGGACACAGGGAAGGACGGTGCAGCTAATGCAACCCTCCACC AGGGCCCACTGGAGGAGTTGGCGGGGATTGGACACAGGGAAGGACGGTGCAGCTAATGCAACCCTCCACC 4580* 4590* 4600* 4610* 4620* 4630* 4640* 10520V 10530v 10540v 10550v 10560v 10570v 10580v AGTTTCCTGTATCCCCAAATGATTGTGTCGGGAAATCTGTTGCATCCAGGAGGCCTCGGTCAGCAGCCAA AGTTTCCTGTATCCCCAAATGATTGTGTCGGGAAATCTGTTGCATCCAGGAGGCCTCGGTCAGCAGCCAA AGTTTCCTGTATCCCCAAATGATTGTGTCGGGAAATCTCTTGCATCCAGGAGGCCTCGGTCAGCAGCCAA 4650* 4660* 4670* 4680* 4690* 4700* 4710* 10590v 10600v 10610v 10620v 10630v 10640v 10650v "TCCAGGTGATCACCGCCGGCAAGCCATTCCAAGGCAACGGCCCCCAGATGCTTACCACCACGACTCAAAA TCCAGGTGATCACCGCCGGCAAGCCATTCCAAGGCAACGGCCCCCAGATGCTTACCACCACGACTCAAAA TCCAGGTGATCACCGCCCGCAAGCCATTCCAAGGCAACGGCCCCCAGATGCTTACCACCACGACTCAAAA 4720* 4730* 4740* 4750* 4760* 4770* 4780* 10660v 10670v 10680v 10690v 10700v 10710v 10720v CGCCAAGCAAATGATCGGTGGCCAAGCGGGATTCGCTGGCGCAAATTACCCGACCTGCATTCCCACAAAC CGCCXAGCXAATGATCGGTGGCCAAGCCGGATTCGCTGGCGGAAATTACCCGACCTCCATTCC C AAC CGCCAAGCAAATGATCGGTGGCCAAGCGGGATTCGCTGGCGGAAATTACGCGACCTCCATTCCTTCGAAC 4790* 4800* 4810* 4820* 4830* 4840* 4850* 10730V 10740V 10750v 10760v 10770v 10780v CACAATCAATCGCCCCAGACGGTGCTCTTCTCACCGATGAACGTCATT TCGCCACAOCAOCAAC CACAATCA TCGCC CAGACGGTGCTC TCTC CCG TGAA CTCAT TCGCCACAGCAGCA C CACAATCAGTCGCCTCAGACGGTCCTCATCTCGCCGCTCAATCTCATCTCCCACTCCCCACAGCAGCACC 486048704880* 48904900* ' 49104920* v 10800v 10810V ' 10820V 10630v 10840v 10850v AGAACCTGCTGCAATCAATGGCCGCTGCAGCTCAGCAGCAGCAXCTCACCCAXCXGCXG CAACAGTT A AACCT CTGCAATCAATGGCCGC CCXGCTCX CX CXGCXACT ACCCAACAGCAG CAACAG T AAAACCTTCTGCAATCAATGGCCGCCGCAGCTCAACAACAGCAACTTACCCAACXGCXOCAOCAACXGCT 4930* 4940* 4950* 4960* 4970* 4980* 4990* v 10870v 10860v 10890v 10900v 10910v TAACCAGCAGCAACAACAGCAGCTTACTCAGCAGCAACAOCAGTTGACAGCTG CTCTGOCCAA TAACCAGCAGCAACA CAGC CAGCAGCAACAGCA ACAGCTG CTCTGGCCAA TAACCAGCJkGCAACAGCAGCTCAACCAGCAGCACCAACAGCA ACAGCTGACTCCCGCTCTGCCCAA 5000* 5010* 5020* 5030* 5040* 5050* v 10930v 10940V 10950v 10960v 10970v 10980v GGTGGGAGTGGATGCGCAGGGCAAGCTGGCCCAGAAAGTGGTTCAGAAAGTGACTACCACCAGTAGCGCG OGTGGGAGTGGATGCGCAGGGCAAGCTGGCCCAGAAAG7GGTTCAGAA GTGAC ACCACCAG AGC CG GGTGGGAGTGGATGCGCAGGGCAAGCTGGCCCAGAXXGTGGTTCAGAAGGTGACCACCACCAGCAGCACG 5070* 5080* 5090* 5100* 5110* 5120* V HOOOv HOlOv 11020v 11030V 11040v 11050v GTCCAOGCGGCGACGGGTCCTGGATCTACTGCGTCAACACAGACCCAGCAOGTGCAGCAGGTTCAGCAAC GT CAGGCGGCGACGGGTCCTGGATCTACTGGGTCAACACAGACCCAGCAGGTGCAGCAGGTTCAGCAAC OTGCAGGCGGCGACCGGTCCTCGATCTACTGGCTCAACACAGACCCAGCAGGTOCAGCACGTTCAOCAAC 51405150" 5160* 5170* 5180* 5190* v 11070v 11080v 11090v lllOOv llllOv 11120v AGCAGCAOCAGACCACCCAAACCACTCAGCAGTGCGTGCAGGTTTCCACGTCGACTTTGCCAGTCGGTGT AGCAGCXOCXGXCCXCCCXAACCACTCAGCAGTGCGTGCAGGTTTC OTCGACTTTGCCAGTCGGTGT AGCAGCAGCAGACCACCCAAACCACTCAGCAGTCCGTGCAGGTTTCACAGTCGACTTTGCCAGTCGGTGT 5210* 5220* 5230* 5240* 5250* 5260* v 11140v 11150v 11160V 11170v 11180v 11190v CGOTOGACAGTCTCTTCAGACTGCCCAACTTCTGAACGCTGGCCAAGCGCAACAAATGCAGATTCCCTGG OGGTGGACAGTCTGTTCAGACTGCCCAACTT TGAACOCTGGCCAAGCGCAACAAATGCAGAT CC TGG CGGTGCACACTCTCTTCAGACTGCCCAACTTTTGAACGCTGGCC^AACCGCAACAAATGCAGATCCCTTGG 52805290* 5300* 5310* 5320* 5330*  36  v 11210v 11220v 11230v 11240v 11250v U260v TTCTTACAGAATGCTGCAGGACTGCAGCCGTTTGOGCCAAACCAGATCATCCTGCGAAACC^^ TTCT CAGAATGC GC GG CTGCA C C TT OG C AA CAGATCATCCTGCGAAACCAOCCAGACG TTCTGGCAGAATGCGGCGGGCCTGCAACCCTTCGGCTCCAATCAGATCATCCTGCGAAACCAOCCAGACG 5350* 53605370" 536053905400** v 11260v 11290v 11300v 11310v 11320v 11330v GAACCCAAGGCATGTTCATTCAACAGCAACCGGCGACGCAGACTTTGCAGACCCAGCAAAACCOTAAGAA GAACCCAAGGCATGTTCATTCAACAGCAACCGGCGACGCAGACTTTGCAGACCCAGCAAAACCGTAAGAA CAACCC^AGGCATGTTCATTCAACAGCAACCGGCGACGCAGACTTTGCAGACCCAGCAAAACCGTAAGAA 542054305440545054605470V 11350v 11360v 11370v 11360v 11390v 11400v TATTGTCATGTATATTGCATCGGATAGGTACTAAAGTCAACTATCTTCCTACAGAGATTATTCAATGCAA TATTGTCATGTATATTGCATCGGATAGGTACTAAAGTCAACTATCTTCCTACAGAGATTATTCAATGCAA TATTGTCATGTATATTGCATCGGATAGGTACTAAAGTCAACTATCTTCCTACAGAGATTATTCAATGCAA 5490* 550055105520" 5530" 5540" v 11420v 11430v 11440v 11450v 11460v 11470v CGTGACGCAGACGCCCACTAAGOCACGCACTCAACTGGATGCACTTGCTCCCAAGCAGCAACAOCAGCAO CGTGACGCAGACGCCCACTAAGGCACGCACTCAACTGGATGCACTTGCTCCCAAGCAGCAACAGCAGCAG CGTGACGCAGACGCCCACTAAGGCACGCACTCAACTGGATGCACTTGCTCCCAAGCAGCAACAGCAGCAG 55605570* 5580" 559056005610v -11490V 11500v 11510v - 11520v - -U530v 11540v CAGCAGGTTGGCACTACCAACCAGACGCAGCAGCAGCAACTAGCGGTGGCTACTGCCCAGTTGCAGCAAC CAGCAGGTTGGCACTACCAACCAGACGCAGCAGCAGCAACTAGCOGTGGCTACTGCCCAGTTGCAGCAAC C^GCAGGTTGGC^CTACCAACCAGACGCAGCAGCAGCAACTAGCGGTGGCTACTGCCCAGTTGCAGCAAC 5630564056505660* 5670* 5680v 11560v 11570v 11580v 11590v 11600v 11610v AOCAGCAOCAACTCACTGCAGCAGCTCTGCAOCGACCAGGAGCCCCTGTCATOCCCCACAATOGAACTCA AGCAGCAGCAACTCACTGCAGCAGCTCTGCAGCGACCAGGAGCCCCTGTCATGCCCCACAATOGAACTCA AGCAGCAGCAACTCACTGCAGCAGCTCTGCAGCGACCAGGAGCCCCTCTCATGCCCCACAATGGAACTCA 57005710" 5720573057405750v 11630v 11640v 11650v 11660v 11670v 11680v AGTGCGTCCGGCCAGTTCCGTATCCACACAGACTGCCCAGAACCAGAGCCTGCTGAAGGCCAAAATGCGC AGTGCGTCCGGCCAGTTCCGTATCCACACAGACTGCCCAGAACCAGAGCCTGCTGAAGGCCAAAATGCGC AGTGCGTCCGGCCAGTTCCGTATCCACACAGACTGCCCAGAACCAGAGCCTGCTGAAGGCCAAAATGCGC 5770578057905800* 58105820* v 11700v 11710v 11720v 11730v 11740v 11750v AACAAGCAGCAGCCGGTGCGCCCCGCTTTAOCCACATTGAAAACCGAAATCGGTCAAGTCGCAGOACAAA AACAAGCAGCAGCCGGTGCGCCCCGCTTTAGCCACATTGAAAACCGAAATCGGTCAAGTCGCAGGACAAA AACAACCAGCAGCCGGTGCGCCCCGCTTTAGCCJiCATTGAAAACCGAAATCGGTCAACTCGCAGGACAAA 584058505860* 5870* 5880* 5890v 11770V 11780v 11790V 11800v 11810v 11820v ATAAGGTAGTAGGCCACCTGACCACCGTGCAGCAGCAGCAACAGGCGACGAATCTCCAGCAGGTGGTTAA ATAAGGTAGTAGGCCACCTGACCACCGTGCAGCAGCAGCAACAGGCGACGAATCTCCAGCAGGTGGTTAA W M  W W  ATAAGGTAGTAGGCCACCTGACCACCGTGCAGCAGCAGCAACAGGCGACGAATCTCCAGCAGGTGGTTAA  v  591011840V  592011850v  5930594011860v - 11670v  5950* 11880v  5960* 11890v  v  5980" 11910v  599011920v  6000* 11930V  601011940v  602011950v  603011960v  v  605011980v  606011990v  607012000v  608012010v  6090" 12020v  610012030v  TGCGGCGGGCAACAAGTGAGTATTGTTTGTTTATTATCTGCCTTGTCACAGTACTTATTTTTGCATCTTT TGCGGCGGGCAACAAGTGAGTATTGTTTGTTTATTATCTGCCTTGTCACAGTACTTATTTTTGCATCTTT TGCGGCGGGCAACAAGTGAGTATTGTTTGTTTATTATCTGCCTTGTCACAGTACTTATTTTTOCATCTTT CCAGAATGGTTGTGATGAGCACAACGGGCACTCCGATCACCCTGCAGAATGGACAGACCCTTCATGCAGC CCAGAATGGTTGTGATGAGCACAACGGGCACTCCGATCACCCTGCAGAATGGACAGACCCTTCATGCAGC CCAGAATGGTTGTGATGAGCACAACGGGCACTCCGATCACCCTGCAGAATGGACAGACCCTTCATGCAGC CACTGCGGCAGGAOTCGACAAGCACCAACAGCAGCTACAACTCTTTCAGAAACAGCAAATCCTOCAACAA CACTGCGGCAGGAGTCGACAAGCAGCAACAGCAGCTACAACTGTTTCAGAAACAGCAAATCCTGCAACAA CACTGCGGCAGGAGTCGACAAGC^GCAACAGCAGCTACAACTGTTTCAGAAACAGCAAATCCTGCAACAA  6120613061406150" 61606170v 12050v 12060v 12070v 12080v 12090v 12100v CAACAAATGTTGCAACAGCAGATTGCTGCCATTCAAATGCAOCAOCAOCAAGCGGCTGTTCAGOCCCAGC CAACAAATGTTGCAACAGCAGATTGCTGCCATTCAAATGCAGCAGCAGCAAGCGGCTGTTCAGGCCCAGC CJUiCAAATGTTGCAACAGCAGATTGCTGCC^TTCJUaTCCAGc^GCAGCAAGCGGCTGTTCAGGCCCACC 6190620062106220* 62306240v 12120v 12130V 12140v 12150v 12160v 12170v AACAACAGCAGCAACAGGTCTCTCAGCAOCAOCAOGTTAACGCCCAGCAACAOCAAGCGGTOOCOCAACA AACAACAGCAGCAACAGGTCTCTCAGCAGCAGCAOGTTAACGCCCAGCAACAGCAAGCGGTGGCGCAACA AACAACAGCAGCAACAGGTCTCTC^GCAGCAGCAGGTTAACGCCCAGCAACAGCAAGCGCTGGCCCAACA * 6260* 6270" 6280" 629063006310V 12190v 12200v 12210v 12220v 12230v 12240v ACAACAGGCAGTCGCGCAGGCTCAGCAACAGCAGAGGGAGCAACAGCAGCAAGTTGCCCAAGCCCAGGCG ACAACAGGCAGTCGCGCAOGCTCAGCAACAGCAGAOGGAGCAACAGCAGCAAGTTGCCCAACCCCAGOCG A£AACAGGCAGTCGCGCAGGCTCAGCAACAGCAGAGGGAGCAACAGCACCAAGTTGCCCAAGCCCAGGCG * 63306340* 6350" 63606370" 6380-  37  v 12260v 12270V 12280v 12290v 12300v 12310v CAGCATCAA("AGGCTCTCGCGAATGCCACTCAGCAAATCCTTCAGCTCGCGCCAAATCAATTCATCACGT CAGCATCAACAGGCTCTCGCGAATGCCACTCAGCAAATCCTTCAOGTGGCGCCAAATCAATTCATCACGT CAGCATCAACAGGCTCTCGCGAATGCCACTCACCAAATCCTTCAOCTGGCGCCAAATCAATTCATCACGT 6400* 6410* 6420* 6430* 6440" 6450* v 12330v 12340v 12350v 12360v 12370v 12360v CCCACCAGCAACAGCAGCAGCAGCAACTTCACAACCAACTGATACAGCAGCAGCTACAGCAACAGGCOCA CCCACCAGCAACAGCAGCAGCAGCAACTTCACAACCAACTGATACAGCAGCAGCTACAGCAACAGGCGCA CCCACCAGCAACAGCAGCAGCAGCAACTTCACAACCAACTGATACAGCAGCAGCTACAGCAACAGGCGCA 6470648064906500" 6510* 6520V 12400v 12410V 12420V 12430v 12440v 12450v OGCACAAGTTCAAGCCCAAOTGCAOGCTCAAGCGCAACAOCAACAACAOCAOCGAGAOCAGCAOCAGAAT GGCACAAGTTCAAGCCCAAGTGCAGGCTCAAGCGCAACAGCAACAACAGCAGCGAGAGCAGCAGCAGAAT GGCACAAGTTCAAGCCCAAGTGCAGGCTCAAGCGCAACAGCAACAACAGCAGCGAGAGCAGCAGCAGAAT 654065506560" 6570* 6580" 6590v 12470v 12480v 12490v 12500v 12510v 12520v ATTATCCAGCAGATTGTGGTGCAACAGTC TGGAGCGACTTCTCAACAGACTTCCCAGCAGCAACAGC ATTATCCAGCAGATTGTGGTGCAACAGTC TGGAGCGACTTCTCAACAG CAGCAGCAAC GC ATTATCCAGCAGATTGTGGTGCAACAGTCAACTGGAGCGACTTCTCAACAGCAG CAGCAGCAACCGC 6610* 6620* 6630* 6640* 66506660* v 12540v 12550V 12560v 12570v 12560v 12590v ACCACCAATCCGGGCAACTACAGCTAAGTAGCGTGCCeTTCTCACTTTCTTCGTCAACGACGCCA-A CA CA TC GG CA T CAGCT AG AGCGTGCC TT TC CTTTC C TC A GACG C AACAGCAGTCTGGACAGTTGCAGCTTACCAGCGTGCCGTTTTCGGTTTCACCATCCATGACGCCG6AACA 6680" 6690* . 6700*. .. 6710- .. . 6.720*_ .. 6730* . . 12600v 12610v 12620v 12630v 12640V 12650v GCCGGAATAGCTACCTCTAGTGCTCTGCA00CA6CCCTCTCCGCCTCTGGCGCCATCTTTCAGACA GCCGGAATA C CC T C A GC C TCTGGCGCCATCTTTCAGACA TATTGCCGGAATAACATCCAGTGCCCTACAAGAAGCTCTCTCOOTG TCTGGCGCCATCTTTCAGACA 675067606770* 6780" 6790* 6800v 12670v 12680v 12690v 12700v 12710v 12720v GCTAAGCCGGGTACTTCCAGTTCCTCCTC CCCCACAAGCACTGTGCTCACAATTACCAACCAGAGCA C AA CCG TACTTGCAGTTCCTC C CCCCACAAGCAGTGTGGTCACAATTACCA CCAGAGCA ACCAAACCGATTACTTGCAGTTCCTCTACGCTCCCCACAAGCAGTGTGGTCACAATTACCAGCCAGAGCA 68106620• 68306840* 6850" 68606870* 12730v 12740V 12750v 12760v 12770v 12780v ' 12790v CCACTCCTTTGGTCACCAGCAGTACGGTGGCCAGTATACAGCAGGCTCAGACGCAATCTGCTCAGGTCCA GCACTCCT TGGTCACCAGCAGTACGGTGGCCAGTAT CAGCAGGCTCAGACGCAA T CTCAG TCCA GCACTCCTCTGGTCACCAGCAGTACGGTGGCCAGTATGCAGCAGGCTCAGACGCAAGGTACTCAGATCCA 68806890* 6900" 6910* 692069306940" 12600V 12810v 12620V 12830v 12840v 12850v 12860v CCAACATCAGCAGCTAATCAGCGCCACAATTGCCGGACGGACTCAACAACAGCCACAGGGACCGCCA CAACATCAGCAGCTAATCAGCGCCAC ATTGCCGGAGGG CTCAACA CAGC CAG C GC A TCAACATCAGCAGCTAATCAGCGCCACTATTGCCGGAGGGTCTCAACAGCAGCAGCAGCAGCAGCAACTG 6950696069706980* 6990* 7000* 7010" 12870v 12880v 12890v 12900v • TCACTTACACCCACCAC — AAATCCAATTTTGOCCATGACCTCGATGA TCACTTACACCCACCAC AAATCC ATT TGOCCATGACCTCGATGA GGACTACCTTCACTTACACCCACCACGCCCTCACCTACAACAAATCCCATTCTGGCCATGACCTCGATGA 702070307040* 7050* 7060* 7070* 7080* v 12920v 12930v 12940v 12950v 12960v 12970v TGAATOCTACAGTGGGTCACCTTTCCACTGCTCCOCCTGTAACTGTTTCTGTGACAAOCACCGCTOTTAC TGAATGC AC GTGGGTCACCT TCCACTGC CC CC GT A TGTTTCT AGCACCGCTGT AC TGAATGCCACCCTGGGTCACCTATCCACTGCCCCACCCGTTAGTGTTTCT AGCACCGCTGTCAC 70907100711071207130* 7140v 12990v 13000v 13010v 13020v 13030v 13040v rrCGTCGCCGGGTCAGCTGGTTCTCTTAAGCACOGCTAGTAGCGGTOGAGGAGGTAGCATACCACCCACG T C TCG C GG CAGCTGGT TAAGCA GCTAGTAGCGGTGGAOGAG GC T CCAGCCACG TCCATCGTCTGGACAGCTGGTCACACTAAGCAGTGCTAGTACCGCTGGAGGAGCAGGCTTTCCACCCACG 7160* 7170* 7180* 7190* 7200* 7210* v 13060V 13070v 13060v 13090v 13100v 13110v CCCACCAAAGAGACACCTTCGAAAGGGCCCACCGCAACCCTGGTGCCCATTGGTTCGCCCAAGACTCCTG CCCACCAAAGAGACACCTTC AAAGGGCCCACCGCAACCCTGGTGCCCATTG TTCGCCCAAGACTCCTG CCCACCAAAGAGACACCTTCAAAAGGGCCCACCGCAACCCTGGTOCCCATTGATTCGCCCAAGACTCCTG 723072407250* 726072707280* v 13130v 13140v 13150v 13160v 13170v 13180v TATCAGGAAAGGACACCTGCACTACCCCCAAATCATCTACTCCTGCCACTGTCAGCGCATCCGTAGAGGC TATCAGGAAAGGACACCTGCACTACCCCCAAATCATCTACTCCTGCCACTGT AGCGCATCCGTAGAGGC -fATCAGGAAAGGACACCTGCACTACCCCCAAATCATCTACTCCTGCCACTGTTAGCGCATCCGTAGAGGC 7300* 73107320733073407350V 13200V 13210v 13220v 13230v 13240v 13250v CAGTAOTTCCACAGGCGAAGCCCTGTCCAATGGAGATOCCTCAGATAOGTCTTCCACGCTOTCAAAOOGC CAGTAGTTCCACAGGCGAAGCCCTGTCCAATGGAGATGCCTCAGATAGGTCTTCCACGC GTCAAAGGGC CAGTAGTTCCACAGGCGAAGCCCTGTCCAATGGAGATGCCTCAGATAGGTCTTCCACGCCGTCAAAGGOC 737073807390740074107420*  38  v 13270v 13260v 13290v 13300v 13310v 13320v OCTACCACTCCCACCAOCAAGCAAAGCAATGCAGCAGTGCAGCCACCGAGTAGCACCACTCCCAACAGTG CCTACCACTCCCACCAGCAAGCAAAGCAATGCAGCAGTGCAGCCACCGAGTAGCACCA TCCCAACAGTG GCTACCACTCCCACCAGCAAGCAAAGCAATGCAGCAGTGCAGCCACCGAGTAGCACCATTCCCAACAGTG 7440* 74507460" 74707480* 7490* V 13340v 13350v 13360V 13370v 13380v 13390v TCAGTGGGAAAGAAGAGCCGAAGCTCGCAACCTGCGGCAGTTTAACGTCCGCAACATCAACTTCAACCAC TCAGTGGGAAAGAAGAGCCGAAGCTG A CTGCGGCAGTTTAACGTCCGCAACATCAAC TCAACCAC TCAGTGGCAAAGAAGAGCCGAAGCTGCACAACTGCCCCAGTTTAACCTCCOCAACATCAACATCAACCAC 751075207530754075507560v 13410v 13420v 13430v 13440v 13450v 13460v 6ACAACGATCACCAATGGGATTGGAGTAGCCAGAACGACAGCCAGCACGGCTGTCTCAACCGCTAGCACA CACAACGATCACCAATGGGATTGGAGTAGCCAGAACGACAGCCAGCACGGCTGTCTCAACCGCTACCACA CACAACGATCACCAATGGGATTGGAGTAGCCAGAACGACAGCCAGCACGGCTGTCTCAACCGCTAGCACA 758075907600" 76107620" 7630" V 13480V 13490v 13500V 13510v 13520v 13530v ACCACTACCAGTTCTGGCACCTTTATCACAAGTTGCACCAGCACAACCACAACCACCACGTCGAGTATCA ACCACTACCAGTTCTGGCACCTTTA CACAAGTTGCACCAGCACAACCACAACCACCACGTCGAGTATCA ACCACTACCAGTTCTGGCACCTTTACCACAAGTTGCACCAGCACAACCACAACCACCACGTCGAGTATCA 765076607670" 7680* 76907700" V 13550v 13560V 13570v 13580v 13590v 13600v OTAATGGATCGAAGGATCTCCCCAAGGCGATGATTAAGCCGAACGTCTIAACTCACGTCATCGATGGCTT CTAATGGATCGAAGGATCTCCCCAAGGCGATGATTAAGCCGAACGTCTTAACTCACCTCATCGATOGCTT CTAATGGATCGAAGGATCTCCCCAAGGCGATGATTAAGCCGAACGTCTTAACTCACGTCATCGATGGCTT 772077307740" 7750* 7760" 7770* V 13620v 13630v 13640v 13650v 13660v 13670v CATCATCCAGGAGGCCAACGAGCCATTTCCCGTCACCAGACAGCGATATGCAGACAAAGACGTCAGCGAT CATCATCCAGGAGGCCAACGAGCCATTTCCCGTCACCAGACAGCGATATGCAGACAAAGACGTCAGCGAT CATCATCCAGGAGGCCAACGAGCCATTTCCCGTCACCAGACAGCGATATGCAGACAAAGACGTCAGCGAT 77907800781078207830* 7840v 13690v 13700v 13710v 13720v 13730v 13740v GAGCCGCCAAGTGAGTATAAACTTCTGGTACCAATGCTTTTTCGCAATCTTAACGTGTCATTCCTTCGCG GAGCCGCCAAGTGAGTATAAACTTCTGGTACCAATGCTTTTTCGCAATCTTAACCTGTCATTCCTTCGCC GAGCCGCCAAGTGAGTATAAACTTCTGGTACCAATGCTTTTTCGCAATCTTAACGTGTCATTCCTTCGCG 786078707880* 78907900" 7910* v 13760v 13770v 13760v 13790v 13800v 13810v CAGAGAAAAAGGCAACCATGCAGGAGGACATCAAGCTAAGTGGAATAGCATCAGCTCCAGGCTCGGATAT CAGAGAAAAAGGCAACCATGCAGGAGGACATCAAGCTAAGTGGAATAGCATCAGCTCCAGGCTCGGATAT CAGAGAAAAAGGCAACCATGCAGGAGGACATCAAGCTAAGTGCAATAGCATCAGCTCCAGGCTCGGATAT 793079407950* 79607970" 7980* V 13630v 13640V 13850v 13860v 13870v 13BB0v GGTTGCTTGCGAGCAGTGTGGAAAGATGGAGCACAAAGCAAAGCTGAAACGGAAGCGCTACTGTTCGCCA CGTTGCTTGCGAGCAGTGTCGAAAGATGGAGCACAAAGCAAAGCTGAAACGGAAGCGCTACTGTTCGCCA GGTTGCTTGCGAGCAGTGTGGAAAGATGGAGCACAAAGCAAAGCTGAAACGGAAGCGCTACTGTTCGCCA 80008010802080308040* 8050v 13900v 13910v 13920v 13930v 13940v 13950v GGATGCTCGAGGCAGGCAAAGAACGGCATCGGTGGAGTTGGATCAGGAGAGACGAACGGCCTGGGGACAG GGATGCTCGAGGCAGGCAAAGAACGGCATCGGTGGAGTTGGATCAGGAGAGACGAACGGCCTGGGGACAG GGATGCTCGAGGCACCCAAAGAACGCCATCGGTGGACTTGGATCACGAGAGACGAACGCCCTGGGGACAG 8070808080908100" 81108120* v 13970v 13980v 13990v 14000v 14010v 14020v OTGGTATAGTTGGGGTGGCAGCCATGOCATTGGTGGACAOGCTGGATGAAGCCATGGCTGAOGAGAAGAT CTGGTATAGTTGGGGTGG CC ATGGCATTCGTCGACAG CTGGATGAAGCCATGGCTGAGGACAACAT GTGGTATAGTTGGGGTGGACGCAATGGCATTGGTGGACAGACTGGATGAAGCCATGGCTGAGGAGAAGAT 8140815081606170" 81808190* v 14040v 14050v 14060v 14070v 14080v 14090v GCAGACAGAGGCCACCCCAAAGCTTTCAGAATCGTTTCCTATTTTG^GAGCCTCAACAGAAGTACCTCCA GCAGACAGA C CC A T TC GA C TT CC ATT GCAGACAGAATCATACCAGACAGTATCGGACCCTTTGCCAATT 821082208230* 8240v 14110v 14120v 14130v ATGTCACTGCCAGTCCAAGCGGC GATTTCTGCGCCCTCGCCTC CAAGCGGC GATTTC CCC GC — »CAAGCGGCTACGCCGGAGGTCCCACCGATTTCGATGCCAGTGCTGGCGGCTATGT 62508260827062808290* 14140V  14150V  14160V  TTOCAATGCCTCTAOGATCGCCATT TTGCAAT CTC CGACATCTTCACCACTTTCGTTGCCCCTGACATTOCCCTTGCCAATTGCAATAGCTC— 83008310832083306340" 835014170v 14180v 14190v 14200v 14210v 14220v 14230v 43TCAGTTGCACTTCCAACTCTTGCACCACTGTCTGTAGTCACTTCTGGCGCGGCGCCCAAGTCTTCGGAA CCACTGT CTCACT C G G C G T CCACTGT GTCACTGCCAGTGGTTTCAGCTGGAGTGGTTGC •360637063806390-  39  14240v 14250v 14260v 14270v AATGGAACAGATCGTCCGCCAATCAOCAGCTGGAGTCTG  OTO G  AATGGA  C GATCG C C C C ATCAGCAG  TGGAGTGTG  GCCGGTCCTAGCAATACCATCCTCGAATATAAATGGATCCGATCCCCCTCCCATCAGCAGTTCCAGTCTG 8400* 8410* 8420* 8430* 8440* 6450* 6460* v 14290v 14300v 14310v 14320v 14330v 14340v CACGATGTCAGCAACTTCATTCGAGAACTGCCTGGTTGTCAGGACTACCTGGACGACTTTATACAGCAGG CA  GA G T AGCAA T T C A T  CGAGAACTGCCTGGTTG  CAGGACTACGTGGACGACTTTATACAGCAGG  GAAGAAGTTAGCAATTTCATCCGAGAACTGCCTGGTTGCCAGGACTACGTGGACGACTTTATACAGCAGG 8470* 8480* 8490* 8500* 8510* 8520* 6530* v 14360v 14370v 14380v 14390v 14400v 14410v AGATCGACGGCCAAGCGCTTCTGTTGCTCAAOGAOAACCATTTGGTGAACGCTATGGGCATGAAGCTOGG AGATCGACGGCCAAGCGCT  C T G T G C T C A A GA AA C A T T T G G T AACGC  ATGGGCATGAAGCTGGG  AGATCGACGGCCAAGCGCTGCTGCTGCTCAAAGAAAACCATTTGGTTAACGCCATGCCCATGAAGCTCGG 8540* 8550" 8560* 6570* 6580* 8590* 8600* v 14430v 14440v 14450v 14460v 14470v 14480v TCCAGCTCTTAAAATTGTGGCCAAGGTGGAGTCCATTAAGGAGGTCCCGCCACCTGOCGA6GCCAAGGAT T C C A G C T C T AAAATTGTGGCCAAGGTGGAGTCCATTAAGGAGGTCCCGCCA G G TCCAGCTCTCAAAATTGTGGCCAAGGTGGAGTCCATTAAGGAGGTCCCGCCAGGCGATGTA  AAGGAT AAGGAT  8610* 8620* 8630* 6640* 6650* 8660v 14500v 14510v 14520v 14530v 14540v 14550v CCAGGAGCGCAGTAGGGCAGCTAGAGCACCAAAAGCCGAAAAAGATGATCTCCTAACCGACCAGTGGACC  AAAAGA T CT C G AAAAGACCCCTTTCTTTAGTTTCCCGCGTTT  A A G TAAAAACACGCAACAAAGTCAAGGTTTC—  8680* 8690* 6700* 8710* 8720* v 14570v 14580v 14590v 14600v 14610v 14620v TOGTTCAACCAAGTCTGTCGTGCCAGGTGATTCTGATTCAATCGAOCAGGCGAAAAGGACGCGAATCCAT T AA A T C GTGA CACCTAAATGTAACGACATTTACTTCGTGA  8740* 14640v  v  GCGAAT GCGAATGTGA  8750* 14650V  6760*  TTGCAAAATATTATTAGCATCAGGC  —  CATCAOGT  T A A A A G ATCA G CATCAG T TCAGACAGAACAAAGTGAATCACGTTCCCACTCACCACTTCTCACACGACGTACACCCTAATCATCAGCT  8780* 14670v  v  8790" 14680V  8800" 14690v  8810* 14700v  8820* 14710v  8830* 14720v  CTTAAACCCATTGACTTTGTACATACTCCCAAGATCACTTATAAGCATATTCATTTATAAATTAAACTAA CC ACTCCC AGA C  ACATGCACCTAATCTACAAA0GGAACTCCCCA6AGAGCAACC0GTGCC 8850" 8660" 8870" 8880* .v 14740v 14750V 14760v 14770v 14780v 14790v •CUGTCAACAOTCAAAAACGAATCGAATTACTTAAACTAAGGAAAAGCTATGAArTAATTGCCAGCCAAGT TGGAAT A C T A  CT  G  A GC  G  TGGAATCACTGACTCTGTTGCOAOGCCCATCCCATCCAGAATCTATGCG 6890* 6900* 6910* 8920* 8930* 14B10v 14820v  v  AAAT  —  A A  OGAATAAAGTACATTTT  GGAATA  C  T T  AOAAATCCATAATTAOGTGATCTAGTTGTTTTTCCCOCACATGACGAAAGCAAGGAATATGACCCTCCTT 8940* 6950* 8960* 6970* 8980* 8990* 9000* 14830v 14840v 14850v 14860v 14870v 14880v 14890v ATAAATAACCATAACTTTATACTCTAAGTACCATArTAATAACTCCCAACGTCATGGATAGrTTGTACAA TAGTTT  CGGCOCCGAAOCTOCAGC  CA  TAGTTTAAGCAC  9010* 9020* 9030* 14900V 14910V 14920V 14930v 14940v 14950V 14960v CTATTTTAATCTTAOGAATCAAATGTAGCACTATGATTGTTCTAACAACTAAGAATTTTAAOCCTATGAA T A  C  GATTGT  CCCGATCAGACCCCAA  AA A  T A  AG  C  A  GATTGTOOCAATAOTAGAGTCCATGACTCTGTGCGA  9040* 9050* 9060* 9070* 9080* 14970v 14980v 14990v 15000v 15010v 15020v 15030v TAATAATTGATATCTAAATGTGAATTTGAACTTTTTACTAAATAATATTTGAATGCCTAGACCTAAGCTT T T  AAA  C  C  T  A  T  A  C  T  C  CGAAAAGGACGGGGACCTTATACGACCCCTCGCGCCTCCCCGTTGGATCAACACTCTTCAGCACTCTACC 9100* 9110* 9120* 91309140" 9150* 15040v 15050v 15060v 15070v 15080v 15090v 15100v TTTTTTCAACATTTTTTTTTTGCAATTGCTGAAGAAATTAAAATOOCACTGLAATAGTOTTTAATAAATCT T  A  T  OCA T  CTGA  AOAGTCTGACGATAOGAOCGGGCAGTATCTGAGCTCTA 91709180* 9190-  40  Open r e a d i n g frame a n a l y s i s  To determine the direction of ph transcription, single-stranded RNA probes were hybridized to ph messages. The direction of transcription was found to be from proximal to distal (J. Deatrick and N. Randsholt, pers. comm.).  The sequence data were used to locate open reading frames (ORFs) in ph. True ORFs will have to be distinguished from untranslated 5' and 3' regions of mRNAs. Sequence data alone cannot be used to identify ph exons. To unequivocally locate ph exons, Si nuclease mapping of cDNAs will have to be done. This work is in progress and has already resulted in the determination of the splicing pattern of three exons in the proximal repeat (ORF 475/6; Denise Pierre, Hugh Brock, unpublished data). Because ph has a repeated structure, but has variable sequence conservation, it is likely that conserved sequences are under selection pressure. Therefore, ORFs conserved in both repeats are likely to be genuine. Table 1 lists these conserved ORFs. A striking observation about all of these ORFs (2-6, 8-12) is that all insertions or deletions occur in multiples of three, implying that there is strong selection to maintain the reading frame.  However, if the ph repeats have slightly different functions or splicing patterns, then there could be genuine ORFs that are not conserved between the repeats. These ORFs could be identified using ORF length and codon preference as criteria. I plotted the distribution of the number of ORFs on one frame of the coding Btrand and compared this to the distribution of the number of ORFs on one frame of the non-coding strand (Figure 5). If length is a good criterion of openness, the coding strand plot should be skewed to the right when compared to the non-coding strand plot. Looking at Figure 5b, the distribution is essentially normal up to a length of 70 amino acids. Beyond this point, ORFs are widely  41  T a b l e 1. Long ORFs conserved i n b o t h r e p e a t s o f ph ORF  Coordinates  (bp)  (bp)  Length*  (aa)  2  5452-5700  249  83  3  5859-6083  225  75  4/5/6  ,9558-14498  4824  1608  8  17539-17811  273  91  9  17924-18163  240  80  10/11/12  19497-24399  4785  1595  *Lengths exclude i n t r o n s f o r ORFs 4/5/6 and 10/11/12.  42  T a b l e 2. ORFs 70 amino a c i d s i n l e n g t h o r g r e a t e r Reading frame 1 Coordinates (bp) 4426- 4635 5452- 5700 6292- 6531 6991- 7206 7210- 7452 7594- 7806 7 8 1 0 - 8082 9298- 9618 9823- 10212 10264- 10497 11164- 11844 11917- 12357 15604-•15888 17539- 17811 19492- 21240 2 1 3 0 1 - 21630 21700- 24399  Length (aa) 70 83 80 72 81 71 91 107 130 78 227 147 95 91 583 110 900  Reading frame 2 Coordinates (bp) 1-280 284-541 1442-1765 3293-3502 4187-4465 6308-6574 6914-7153 7157-7405 7745-7969 9740-10060 10370-10855 10985-11332 11780-12145 12149-12553 12662-13012 13730-14002 14018-14425 17924-18163 18251-18706 19205-19441 19760-20068 20115-20356 21035-21715 21788-22228 22232-22453  43  Length (aa) 93 86 108 70 93 89 80 83 75 107 162 116 122 135 117 91 136 80 152 79 103 80 227 147 74  Reading frame 3 Coordinates (bp)  Length (aa)  4398-•4649 4653- 4922 5859- 6083 6324- 6581 7092-•7295 7521- 7736 7740- 8264 9559- 11369 11430- 11759 11829- 14498 18972- 19201 19221-•19502 19845- 20111 20229- 20723 20727- •21203 21651- 22016 22020-•22421 22425-•22853 23919-•24200  84 90 75 86 68 72 175 603 110 890 76 94 89 165 159 122 134 143 94  Figure 4. A physical summary map of ph.  The sub-clones used for sequencing are labelled (616-617, 617-628, 1952SB, 21908, 214, 04 and 215 were used by Deatrick and Randsholt to sequence the proximal repeat. 1.5, 4.0, 0.8, 2.3 and 8.7 were used by myself to sequence the distal repeat). All open reading frames (ORFs) 70 aa in length or greater are drawn on the map. Thick lines represent ORFs conserved in both repeats (ORFs 2 through 6 and ORFs 8 through 12). A vertical dash marks the position of a stop codon lying between two ORFs. The known splicing pattern of ORFs 4, 5 and 6 and the suspected splicing pattern of ORFs 10, 11 and 12 is indicated with the hatched lines. Vertical arrows represent introns. The open boxes represent the zinc finger sequence found in each repeat. Putative promoters (P), translation initiation signals (Tr) and transcription termination signals (t) are indicated. The confirmed location of a polyA tail is marked (A). Thick lines at the bottom of the figure (i through vi) represent regions with a very high similarity index (over 95%). The line marked "unique" represents sequence found only in the proximal repeat. Regions of repetitive and unique DNA as determined by crosshybridization studies (Freeman, 1988) are diagrammed as boxes at the bottom of the figure (a through e).  44  »  5'  3'  (proximal)  kb 1  , ?  010-017  (distal) 10 *  •  *  •  *  19S2SB  617-126 * T»  RF2-*-  •  M9Q8  14 i -21-  l  IS  I  04 i J l  i l  It  i  l  20 i  22  l  -i2  f  AP Trl 7  -I2J-  RF1  ox  12 —i  24 I  I.  M — i l —  1, ,.7  — H^— J  L  1-  RF3 I  I  •  H  Idl  L ynlqiM  1(1 | l \h I  l» a -TunWTOT—g- I  » H  I C I  h  I  i  I  Figure 5. Frequency distributions of ORFs in a random sequence (a) and in a coding sequence (b).  Thisfigurewas made to see if length is a good criterion of open-ness. The length of every ORF in one frame of the non-coding strand and in one frame of the coding strand was plotted against its frequency. The mean length of ORFs in the coding sequence is 29 aa. The mean length of ORFs in the non-coding sequence is 24 aa.  46  47  distributed along the X-axis. I would assume that ORFs of around 70 amino acids in length or greater have coding potential because they lie outside the normal distribution. Therefore, I arbitrarily chose a length of 70 amino acids (aa) as a cutoff point for significant ORFs. Any ORF with a length of 70 aa or greater is listed in Table 2 (also see Fig. 4). Some of these ORFs may not be genuine and there may be some genuine ORFs with lengths shorter than 70 aa that are not listed in this table. A comparison of cDNA sequence with the genomic sequence will have to be done to unequivocally determine the locations of genuine ORFs in ph.  Sequence conservation cannot be used as a method to identify ORFs unique to either side of the gene. However, codon preference can be used to distinguish non-coding DNA from coding DNA (Gribskov et al., 1984). For most amino acids, several codons can specify the same amino acid (codon degeneracy). The frequency of synonymous codons varies depending on the organism and the gene. Non-coding DNA shows no preference for a specific codon in a family of synonymous codons. However, coding DNA can show a preference for one codon over another (Gribskov et al. and references therein). Thus, regions of relatively high codon preference should be indicative of DNA with coding potential. GENEPLOT was used to determine the codon preference of each putative ORF. A D r o s o p h i l a codon frequency table (Ashburner, 1989) was used in the analysis. Table 3 shows the overall codon preference parameters for each ORF and for a random sequence of the same composition as the ORF. None of the ORFs show a value significantly higher than the codon preference for a random sequence. Thus, for ph, codon preference cannot be used to indicate regions of coding potential. Abundantly expressed genes tend to have the best codon preference values (Ashburner, 1989) and since ph is rare, its codon preferences may be different.  The splice junctions of ORFs 4-5 and 5-6 have been determined by the comparison of cDNA sequence to the genomic sequence (D. Pierre and H. Brock, unpublished data).  48  Table  3. Codon p r e f e r e n c e parameters o f t h e 6 p u t a t i v e ORFs  ORF  Codon p r e f e r e n c e of input s t r a n d  Codon p r e f e r e n c e f o r random sequence o f same c o m p o s i t i o n as i n p u t  2  0.94  0.96  3  0.96  0.97  4/5/6  0.97  0.95  8  0.95  0.96  9  0.95  0.97  10/11/12  0.97  0.95  49  strand  Because the splice junction sequences of ORFs 10, 11 and 12 are identical to those of ORF 4/5/6, the same splicing pattern was assumed for ORFs 10, 11 and 12. This will have to be confirmed by the comparison of cDNA sequence to genomic sequence.  Two regions of high sequence similarity (regions i and i vin Fig. 4) do not contain any long ORFs that are duplicated on each side of the gene. It is possible that ORFs are present in these regions but that they are less than 70 aa in length. Region i contains an ORF at its 5' end but this ORF is not duplicated in region iv. It is possible that these regions contain conserved regulatory sequences needed for proper expression of ph. This could be tested by placing these regions upstream of a marker gene (eg, lacZ) and transforming D r o s o p h i l a embryos with the chimaeric construct. If this region is required for ph regulation, it should control the distribution of B-galactosidase such that it mimics the distribution of ph product in wild-type embryos. If it does, the exact positions of regulatory sequences within the region could be determined using deletion series or sitespecific mutagenesis.  The proximal repeat contains 2248 bp of DNA not present in the distal repeat (the unique region, Fig. 4). The unique region could contain ORFs that are incorporated into the ph protein. Alternatively, this region could represent a large intron. In any case, it cannot be decided if the proximal repeat has gained DNA from an insertion event, or if the distal repeat has suffered a deletion. The proximal and distal repeats are therefore not perfect repeats, as their lengths differ. Subsequent to the duplication of the ph region, the two repeats have diverged.  Northern walk data (J. Deatrick and N. Randsholt, unpublished data) shows that probes from coordinates 3.5 kb-5.2 kb and from 14 kb-17.5 kb hybridize to both the major transcripts, suggesting that at least one sequence conserved on both sides must be present in  50  the 6.1 kb and 6.4 kb mRNAs. Yet my analysis reveals no long ORFs conserved in both repeats in this region. There are nevertheless two small regions marked i and ivon Figure 5 that are highly conserved. These sequences may represent conserved untranslated leader regions present in both messages. This does not rule out the possibility that there are also small ORFs in this region. As noted in the introduction, the region from 0-1.25 kb hybridizes to small ph messages but as shown in Table 2, there are no long ORFs in this region. It may be that this region has many small ORFs whose locations will have to be confirmed by Si nuclease mapping or the comparison of cDNA sequence to the genomic sequence. The two largest ORFs in this region of the proximal repeat and the corresponding region of the distal repeat are labelled 1 and 7 on'Figure 4 (also see Table 2).  Splice junctions  The presence of consensus splice junctions around putative ORFs can be used as an indication of how likely a particular ORF is to be genuine. The PATTERNS program was used to search the gene for potential exon-intron junction sequences and branch sequences. The following consensus sequences were screened: MAGGTRAGT  fi  exon intron  CTRAY  (Y)-,, NYAG 1 1  branch  rr  intron exon  (Mount, 1982). The minimum match percentage required for each sequence is as follows: MAGGTRAGT (79%), CTRAY (100%), and (Y) NYAG (87%). A list of all splice 1;L  junction and branch sequences found in ph is given in Table 4. All ten putative ORFs have  51  T a b l e 4. A l i s t o f s p l i c e j u n c t i o n sequences and branch sequences p r e s e n t i n ph (Y) XYAG 11  ccccctccatcaaAG tattcttctccgcAG cattttctctaccAG ttcctgttttttaAG tcccatttatttcAG ttatatttcttacAG cttcttcgattgcAG tttttttctctcaAG tttttctctcaagAG ctcttttttgctgAG tttttcaatcctcAG actcttcccttgaAG ccctggttcttacAG actatcttcctacAG tctgccttgtcacAG tttgcatctttccAG ctcctcccccacaAG cattccttcgcgcAG atcattcttttgcAG atccctgtcttttAG ttctgttctacacAG attttgttttcacAG ccgttccgctcttAG ttccgctcctctgAG ttcctctctgcgtAG tctttctctttccAG ttctctttccagcAG tcccactcgccacAG actatcttcctacAG tttgcatctttccAG tctgccttgtcacAG cattccttcgcgcAG tcgccctcccatcAG ccctcccatcagcAG  Coord.  MAGGTRAGT  Coord.  CTRAY  Coord.  1367 2044 2470 3065 3899 7394 7638 7946 7948 9543 9586 9686 11207 11392 11902 11879 12694 13749 15402 15588 17237 17581 18603 18710 18884 19640 19643 20646 21263 21773 21750 23650 24179 24182  cAGGTagaT cAGGTagcT cgaGTgAGT cttGTgAGT agGGTaAtT cAGGTcAaT agGGTgAGT atGGTaAGa cAGGTtcGT aAGGTaAag cAGGTgctT actGTgAGT cAtGTgcGT aAGGTattT aAtGTatGT cAGGTggGT cAGGTgAag cAGGTgAtc aAGGTggGa aAaGTgAcT cAaGTgcGT cAGGTggtT cAaGTgAGT cAaGTgAGT cAGGTgAtT cAaGTaAaT aAtGTgAaT aAtGTaAGa cAtGTatGT cAGGTatGa acGGTaAaT tAtGTaAGT aAaGTaAGT aAaGTaAGa cAGGTgccT aAtGTaAcT actGTaAGT ctcGTgAGT cAaGTgtGT  309 913 1439 2249 2935 3498 3550 3706 4104 5113 6234 6305 6521 6754 7894 9793 10213 10591 10921 10969 11621 11822 11845 13698 14584 14799 14981 15238 15787 15933 17015 17063 17445 17459 18231 18654 19207 19218 19570  CTgAt CTgAt CTaAt CTaAt CTaAc CTaAt CTaAt CTgAc CTaAt CTaAt CTgAc CTaAt CTgAt CTaAc CTaAc CTaAt CTaAt CTgAt CTaAt CTaAc CTgAc CTgAc CTgAt CTaAt CTaAt CTgAc CTgAt CTaAt CTaAc CTaAc CTaAc CTgAc CTgAt CTgAc CTaAc CTaAc CTaAt CTaAt CTgAc  50 919 1566 1619 1807 2307 2870 4063 4422 4591 5091 5104 5136 5683 6066 6323 7415 8328 8411 8447 9329 9419 9557 10063 10500 11779 12360 12810 14542 14729 14930 15257 16022 16111 16387 • 16404 16838 17098 17749  62  Table 4  (Continued)  MAGGTRAGT  Coord.  CTRAY  Coord.  aAaGTgAaT aAGGTgccT cAcGTggGT cAGGTgAag cAGGTgAtc ccGGTgAaT aAGGTggGa aAGGTgAcc cAaGTgcGT cAGGTggtT cAaGTgAGT cAaGTgAGT aAaGTgAaT  19659 19762 19854 20069 20450 20621 20792 20840 21492 21693 21716 23589 24514  CTgAc CTgAt CTaAc CTgAc CTgAt CTaAt CTgAc CTgAt CTaAt CTgAc CTaAt CTaAt CTgAc  18152 18303 18669 18971 19607 20359 21650 22231 22690 24055 24559 24580 24628  53  potential junction sequences at theirtermini.However, the degree of similarity of these sequences to the consensus varies. Potential branch sequences (CTRAY) showing 100% similarity with the consensus exist in all putative introns except sequence lying between ORFs 8 and 9. It is possible that a branch sequence with less than 100% similarity with the consensus exists in this sequence. All sequences contained in Table 4 contain the most conserved nucleotides of each consensus (boldfaced).  The splice pattern of ORFs 4, 5 and 6 was confirmed by the comparison of cDNA sequence to the genomic sequence (D. Pierre and H. Brock, pers.  comm.). An intron of  60 bp lies between ORFs 4 and 5 and between ORFs 5 and 6. The splice junctions and their similarity with the respective consensus sequences are listed in Table 5. The two introns are highly conserved in the distal repeat. The four splice junctions between ORFs 10,11 and 12 show 100% sequence identity with the corresponding junctions for ORF 4/5/6. It is probable that the distal repeat has the same splicing pattern as the proximal repeat in this region and ORF 4/5/6 was compared to putative ORF 10/11/12 under this assumption. Because the sequence conservation of these introns is so high, it suggests that alternative splicing may occur and that in some ph transcripts, these introns represent coding sequence. Thus, there may be other splicing patterns of ORFs 4, 5, 6 and ORFs 10,11,12 in addition to the pattern presented here. The splicing patterns of the remaining ORFs will have to be determined by comparing the sequences of cDNAs that contain these ORFs to the genomic sequence (work in progress).  Promoter sequences  Northern analysis suggests that the ph proximal promoter should be 3' of coordinate 3500 because a fragment 5' hybridizes to mRNA from the transcription unit 5' to ph. If there is a second ph promoter, it should be 3' to coordinate 7800 because an inversion  54  (ph* ) lu  truncates the 6.1 kb (embryonic) and 6.6 kb (pupal) ph transcripts, but does not  affect the 6.4 kb (embryonic) and 6.1 kb (pupal) transcripts. The DNA sequence was screened for putative promoter sequences corresponding to the TATA (Breathnach and Chambon, 1981) and CCAAT (McKnight and Kingsbury, 1982) boxes. Potential sequences are listed in Table 6. Two of the more interesting sequences are located in the 5' regions of each repeat. The sequence atGTATAaAaaGttt at coordinate 4897 in the proximal repeat is 87% similar to the consensus. Seventy-four bp upstream at coordinate 4818 is the sequence acTCAATac. This putative promoter lies upstream of the 5'-most ORF conserved in both repeats (ORF 2), downstream of the transcription unit 5' to ph, and is a good candidate for the proximal promoter. A similar putative promoter is found at the 5' end of the distal repeat. The sequence GtGTATAaAatGcat at coordinate 15211 is 93% similar to the consensus. The sequence GaCCAATta lies 75 bp upstream at coordinate 15137. This promoter lies upstream of ORF 7 and could therefore drive transcription of the distal repeat, ph could contain promoters lacking these consensus sequences, in which case they would not be revealed by this search. Therefore, none of the promoter data listed above is conclusive. Once primer extensions determine the extreme 5' end of ph message, deletion series can be done on DNA lying upstream to determine the locations of promoter sequences.  A translation initiation sequence (GGAATGG) at 5454 of the proximal repeat matches the consensus perfectly (Kozak, 1984). This sequence lies at the extreme 5' end of ORF 2 and is in frame with the ORF 2 sequence. ORF 8 does not contain a translation start site at its 5' end. However, ORF 7 does at position 15708 (cCTATGa). Both of these translation start sites lie just downstream of potential promoters.  As discussed above, the probable locations of the promoters for the 6.1 kb and 6.4 kb mRNAs are at coordinates 4.8 and 15.2 respectively, although other sites cannot be ruled out. In particular, it seems likely that there should be a promoter responsible for the  55  T a b l e 5 . Comparison o f s p l i c e j u n c t i o n s o f ORF 4 / 5 / 6 t o t h e consensus ORF  J u n c t i o n sequence  consensus  4  AccGTGtaa  MAGGTRAGT  44  5  aCTaTCTTCCTACAG  (Y) XYAG  87  5  CAaGTGAGT  MAGGTRAGT  89  6  TTTgCaTCTTTCCAG  (Y) XYAG  87  11  11  56  %  similarity  T a b l e 6. P u t a t i v e promoter s i g n a l s i n t h e ph sequence ,GXGTATAWAWXGXXG atGTATAaAaaGttt GaGTATAaActtctg GtGTATAaAatGcat GaGTATAaActtctG GaaTATAaAtgGatc  Coord.  GGYCAAWCT  Coord.  4897 13695 15211 23596 24153  cGcCAAaCa tGtCAAtaT GacCAAaaT GGtCAAtta acTCAAtac tctCAAaCT tGtCAAaCg tGcCAAaaT GGtCAAtgc GGcCAAatg cGcCAAaCT aGcCAAtCc GGcCAAaCc GGcCAAaaT GGaCAAaaT cGcCAAtCa GacCAAtta GttCAAatT cGaCAAatc tGcCAAagT aGcCAAaaT aGcCAAtCc GGcCAAaaT GGaCAAaaT  1341 2345 2727 3502 4818 4957 5620 8096 8260 8757 9565 10585 11239 11682 11758 14260 15137 15334 16047 17304 18628 20444 21553 21629  57  synthesis of the small transcripts detected with probes from coordinates 0-1372, and from 5781-24751. In addition, there should be a promoter in the region from 1372-3464 responsible for the transcription of the 2 kb mRNA from the gene just upstream of ph. However, no obvious candidates were found. Locating these promoters will take additional work.  The 5' end of the distal repeat is less confusing. Northern data show a transcript hybridizing to the 1.5 Sal fragment. But the distal promoter lies upstream of this fragment. Thus, this promoter can also control transcription of the ORF(s) present in the 1.5 Sal fragment.  Termination sequences  Putative polyadenylation signals (AATAAA; Proudfoot and Brownlee, 1976) exist at the 3' end of the proximal repeat (Table 7). A polyA tail has been found in a proximal cDNA and maps just upstream of the putative distal promoter at coordinate 15107 (D. Pierre and H. Brock, unpublished results). Just upstream lies a polyadenylation signal at 15092 (Table 7). This lends further support as to the assignment of the distal promoter sequence. No polyadenylation signals showing 100% similarity to the consensus were found at the 3' end of the distal repeat. However, there are several sequences that vary from the consensus (eg, AACAAA at 24512).  A transcriptionterminationsequence (ATirTTCT) that has 100% identity with the consensus (Scott et al., 1988) is present in between the two repeats at 15824. If both repeats were transcriptionally active at the same time, this sequence would not prevent a  58  T a b l e 7. P u t a t i v e p o l y a d e n y l a t i o n s i g n a l s found i n t h e ph sequence  Coordinate  Sequence  3288 3792 3854 4793 4809 5403 6986 7294 7451 7508 14806 15092 16573 17045 17438 18837  AATAAA  AATAAA  AATAAA AATAAA  AATAAA AATAAA AATAAA AATAAA AATAAA AATAAA AATAAA AATAAA AATAAA AATAAA AATAAA AATAAA  59  transcription complex from the proximal repeat from entering the putative distal promoter region. This is unlike the wolffish antifreeze protein genes, also organized as tandem repeats. These genes possess a transcription termination sequence that lies between the two repeats to prevent the transcription of one repeat from influencing the regulation of the other repeat (Scott et a 1., 1988).  THE PH  PROTEIN  ph c o n t a i n s a p u t a t i v e z i n c  finger  The function of ph is unknown. However, analysis of the amino acid sequence of ph may allow us to deduce a potential function of ph. Due to the size of the gene, only the 6 putative ORFs (2, 3,475/6, 8, 9 and 10/11712; Figures 6 to 11) were studied in detail. These 6 ORFs were studied separately since the splicing pattern of these ORFs is unknown. The 6 ORFs show different degrees of amino acid sequence similarity between the two repeats (Table 8 and Figures 12 to 14). The degree of sequence similarity ranges from 43.5% to 86.8%. Regions with higher sequence similarity are more likely to be under greater selection pressure and, therefore, are more likely to be important for ph function.  The amino terminus of each repeat shows less sequence conservation than the rest of the gene. Thus, the carboxy end of the gene must play a more important role in the function of the protein than the amino terminus. This is supported by the presence of a canonical zincfingerin ORFs 475/6 and 10/11/12. The zinc finger is of the sequence CEQCGKMEHKAKLKRKRYCSPGC and is located between 13824-13892 in the proximal repeat and between 23725-23793 in the distal repeat. The sequence shows 100% conservation between the two repeats.  60  Figure 6. Amino acid sequence of ORF2.  61  1 €1  RGMVYEKGVFESRVTRPCYVFS RFIPTPSERTASLSHALTGQGHS  f°  62  Figure 7. Amino acid sequence of ORF3.  63  1 SCRCCARFSLKLNLAWGS 61 i , L N F F Q K E L T V L L I S  64  Figure 8. Amino acid sequence of ORF 475/6.  65  1 61 121 181 241  gPNFSFSILRXDTESDTTTPVSTTXSOGISASAlLxGGTLPLKDNSNIREKPLHHNYNHN HNNSSQHSHSHQQQQQQQVGGKQLERPLKCLETLAQKAGITFDEKyDVASPPHPGIAQQQ ATSGTGPXTGSGSVTPTSHRHGTPPTGRRQTHTPSTPNRPSXPSTPNTNCNSIXRHTSLT LEKAQNPGQQVXATTTVPLQ1SPEQLQQFXASNPXAIQVKQEFPTHTTSGSGTELKHATN IKEVQQQLQLQQLSEANGGGXXSACAGGXXSPXNSQQSQQQQKSTXISTMSPMQLXXXTG  301 CVGG6vn QGRTVOLMQPSTSFLYP6MlVSGNLLHPOGLOOQPlQVITXGKPFQGNGPQML ,  SO 120 180 240 300 360  361 421 481 541  TTTTQNAXQMIGGQAGFAGGNYATCIPTNHNQSPQTVLFSPHNV1SPQQQQNLLQSMAAA AOOOOLTOOOOQFNQOOOQQLTQQOQOLTAALAKVGvDAQGKLAQKWQKVTTTSSAVQA ATGPGSTGSTQTQQVQQVQQQQQQTTQTTQQCVQVSTSTLPVGVGGQSVQTAQLLNXGQX OQMQIPWFLQNAAGLQPFGPNQIILRNQP0GTQGMF1QQQPATQTLQTQQNQIIQCNVTQ  420 480 540 600.  601 661 721 781 841  TPTKARTOLDALXPKOQOOOOOVGfTNOTOOOOLAVATAQLQQQOQOLTAAALQRPGAPV KPHNGTQVRPASSVSTQTXQNQSLLKXKMRNKQQPVRPXLXTLKTE1GQVXGQNKWGHL TTV00000XTNLOQVVNXAGNKMWMSTTGTPITI.ONGQTLHAATAAGVDK0OQQL0LFQ KQQILQQQQMLQQQIAAIQMQQQQAAVQAOQQQOQQVSQQQQVNAQQQQAVAQQQQAVAQ AQOQQREQQQQVAOAQADHQQALANATQQILQVAPNQFITSHQQQQQQQLHNQLIQQQLQ  660 720 780 840 900  901 OQAOAQVQADVOAOAOBQOOOREQOQNIIOQIWQQSGATSOQTSQOOQHHQSGQLQLSS  961 1021 1081 1141  VPFSVSSSTTPAGIATSSXLQXXLSXSGXIFQTXKPGTCSSSSPTSSWT1TN0SSTPLV TSSTVXS100XOTQSXQVHQHQOL1SXT1XGGTOQOPOGPPSLTPTTNPILXMTSMMNXT VGHLSTXPPVTVSVTSTXVTSSPGQLVLLSTXSSGGGGSIPXTPTKETPSKGPTXTLVPI CSPKTPVSGKDTCTTPKSSTPXTVSXSVEXSSSTGEXLSNGDXSDRSSTLSKGXTTPTSK  1201 O S K X X V Q P P S S T T P N S V S G K E E P K L X T C G S L T S X T S T S T T T T I T N G I G V X R T T X I T X V S T  1261 1321 1381 14 41  XSTTTTSSGTFITSCTSTTTTTTSSISNGSKDLPKXMIKPNVLTHVIDGFIIQEXNEPFP VTRCJRyADKDVSDEPPEEYKLLVPMLFRNLNVSFLRXEKKATMQEDIKLSGlASAPGSDM VACEQCGKMEHKAKLKRKRYCSPGCSRQAKNGICGVGSGETNGLGTGGIVGVAXMALVDR LDEAMAEEKMQTEATPKLSESFPILGASTEVPPMSLPVOAAISXPSPLXMPLGSPLSVAL  1501 PTLX^LSVVTSGXXPKSSEVNGTDRPPISSWSVD£>VSKF1RELP&C0DYVDDFI60E1DG  1561 QXLLLLKEKHLVNAMGMKLGPALKIVAKVESIKEVPPPGEAKDPGAQ  66  960  1020 1080 114 0 1200 1260  1320 1360 1440 1500 1560  1607  Figure  9.  Amino acid sequence of  ORF8.  67  68  Figure 10. Amino acid sequence of ORFS).  69  1 SCRCSXRFSLKLNLXACGSCJaLsLRSCWCGVTJVSHTACVLNGOKKKRRRDAHTPYHRFV €1 VFIYIFIFGDOCKSVXTISD  70  Figure 11. Amino acid sequence of putative ORF 10/11/12.  71  i  l 61 121 181 241  BNHNKNININLNMNMKHXPRIRSRXKCVRVCLCLKQPTOSLCAPPLSLSSSCDT£SESAT TIRTPPPSPEATTSVKVNSTTRVDPQRPLRCLETLAQKAGISFDEDFAKSPSQSPSSKAA RGSVGTPSIRRRHPLLPLSSRSPSAPDSKTTGRKLEKSQSPAQPMAAATNVPLQ1SPEQL QQLYANNPYAIQVKQEFPTHTTSGSGTELKHATNIMEVQQQLHVQQQLSEANGGGAASAG AGGAA5PAN5QQSQQQQK5TAISTMSPHQLAGPTGGVGGDWTQGRTVQLHQPSTSFLYPQ  60 120 160 240 300  301 361 421 4 81 541  MIVSGNLLHPGGLCQQP1QV1TAGKPFQGNGPQM£,TTTTQNAKQMICOQAGFAGGNYATC IPSNHNQSPQTVLISPVNVISHSPQQQQNLLQSMAAAAQQQQLTQQQQQQLNQQQQQLMQ O0QQQQLTAALAKVGVOAQGKLAQKWQKVTTTSSTVQAATGPGSTGSTQTQQVQQVQQQ QQQTTQTTQQCVQVSQSTLPVGVGGQSVQTAQLLNAGQAQQMQIPWFWQNAAGLQPFGSN QIILRNQPDGTQGMFIQQQPATQTLQTQQNQIIQCNVTQTPTKARTQLDALAPKQQQQQQ  360 420 480 540 600  601 661 721 781 841  OVGTFNQTQQQQLAVATAQLQQOQOOLTAAALQRPGAPVMPHNGFQVRPASSVSTQTAQN QSLLKAKMRNKQQPVRPALATLKTEIGQVAGQNKWGHLTTVQQQQQATNLQQWNAAGN KMWHSTTGTPITLQNGOTLHAATAAGVDKOOQQLQLFQKOQILQQOOMLQQQ1AAIOMQ OQQAAVQAQQQQQQQVSQQOQVNAQQQQAVAQQQQAVAQAQQQQREOOQQVAQAQAQHQQ ALAMATQQILQVAPNQFITSKQQQOQQQLHNQLIQQQLQQQAQAQVQAQVQAQAQQQQQQ  660 720 780 640 900  901 961 1021 1081 1141  REOQQNIIQQIWQ6STGATSQQQQQOPQQQSGQLQLSSVPFSVSPSKTAEDIA6ITSSA LQEALSVSGAIFOTTKPITCSSSTLPTSSWT1TSQSSTPLVTSSTVASMQQADTQGTQ1 HQHQQLISATIAGGSQQQ0QQQQLGLPSLTPTTPSPTTNP1LAMTSMMNATVGHLSTAPP VSVSSTAVTPSSGQLVTLSSASSGGGAGFPATPTKETPSKGPTATLVPIDSPKTPVSGKD TCTTPKSSTPATVSASVEASSSTGEALSNGDA5DRSSTPSKGATTPTSKQSNAAVQPPSS  960 1 0 2 0 1080 1140 1200  1201 1261 1321 1381 1441  TLPNSVSGKEEPKLHNCGSLTSATSTSTTTTITNGIGVARTTASFAVSTASTTTTSSGTF TTSCTSTTTTTTSSISNGSKDLPKAMIKPNVLTHVIDGFIIQEANEPFPVTRORYADKDV SDEPPSEYKLLVPMLFRNLKVSFLRAEKKATMQEDIKLSGIASAPGSDMVACEQCGKMEH KAKLKRKRYCSPGCSRQAKNGIGGVGSGETNGLGTGGIVGVDAKALVDRLDEAMAEEKMQ TESYQTVSDALPIQAATPEVPPISMPVLAAKSTSSPLSLPLTLPI.PIAIAPTV5LPWSA  1 1 1 1 1  1501 1561  G W A P V L A I P SS N I N G S D R P P I S S W S V E E V S N F I & E L P G C Q D Y V D D F I Q Q E I D G Q A L L L L 1 5 6 0 KENHLVMAMGMKLGPALKIVAKVESIKEVPPGDVKD 1596  72  2 3 3 4 5  6 2 8 4 0  0 0 0 0 0  T a b l e 8. Amino a c i d sequence s i m i l a r i t y between t h e p r o x i m a l ORFs and t h e d i s t a l ORFs ORF  (prox.)  ORF ( d i s t . )  Similarity  index  (%)  o v e r l a p (aa)  2  8  43.5  62  3  9  62.1  66  4/5/6  10/11/12  86.8  1541  73  Figure 12. Optimal amino acid sequence alignment of ORF2 and ORF8.  This alignment was made using the method of Wilbur and Lipman (1983). The gap penalty = 4 and the deletion penalty = 12. The top line of the alignment represents ORF2. The bottom line represents ORF8. The middle line represents amino acid sequence conserved in both ORFs. A colon indicates amino acids that are positively related by the protein similarity matrix (Lipman and Pearson, 1985). A dot indicates those with a zero value relationship. Dashes represent gaps introduced into the sequence to maintain sequence conservation. ORF2 and ORF8 have an overall identity of 43.5% in a 62 aa overlap.  74  lOv 20v 30v 40v 50v 60v RQMWEKGVFESRVTRPCYVFSNGRGLSVGVVEHAYTCPPAVVF—LFFLLVLPLLCQTNFCRFIPTPSER .: . :VF: . G VGWEHAYTCPPAW: :FF F L. :. . F P ::.R OQSNLSRVFYFVFTVXGGAQVGWEHAYTCPPAWLIPVFFTRFGFALSTAAVS-FOPOANAR 10* 20* 30405060v SOv TASLSHALTGQGHS •• •• « * •S HPARTRYLTRATPSGVALAPVHFGIEKSY 708090-  75  Figure 13. Optimal amino acid sequence alignment of ORF 3 and ORF9.  See the legend to Figure 12 for the method of alignment and alignment parameters. ORF 3 (top line) and ORF9 (bottom line) have an overall identity of 63.6% in a 66 aa overlap.  76  VLLIS 8 DOCKSVLTISD 80*  77  Figure 14. Optimal amino acid sequence alignment of ORF 4/5/6 and putative ORF 10/11/12.  See the legend to Figure 12 for the method of alignment and alignment parameters. ORF 4/576 (top line) and putative ORF 10/11/12 (bottom line) have an overall identity of 86.8% in a 1541 aa overlap.  78  lOv  20v  30v  40v  50v  60v  70v  SPNFSFSILRADTESDTTTPVSTTASQCISASAILXOCTLPLKDNSNIREKPIIHHNYMHNMNMSSQHSKS  : N  . .  6 : .  :. .  .::L  8  ..  x : . .  :::  KNMNKNININLNKNMKHAPRIRSRAXCVRVCLCLKQFTDSLCAFPLSLSSSGDTESESATTIRTPPPSPEA  10203040506070" 60v 90v iOOv llOv 120v 130v 140v HQQQQQQQVGGKQLERPLKCLETLAQKAGITFDEKYDVASPPHPGIAQQQAT8CTGPATGSGSVTPTEHRH i t . . : :RPL:CLETLAQKAGI:FOE.: : . : P : :::s :: : .  TTSVKVNSTTRVDPQRPLRCLETLAQKAGIEFDEDF  AKSPSQSPSSKAARG-SV  6090100110* 120150v 160v 170v 180v 190v 200v 21t)v GTPPTGRRQTHTPSTPNRPSAPSTPNTNCNSIARHTSLTLEKAQNPGQQVAATTTVPLQISPEQLQQFYAS GTP:. RR:. P ::..PSAP :.::.:R .LEK:Q:P:Q.:AA:T.VPLQISPEQLQQ:YA: OTPSIRRRHPLLPLSSRSPSAP DSKTTGR KLEKSQSPAQPMAXATNVPLQISPEQLQQLYAN 130140* 150160170180" 220v 230v 240v 250v 260v 270v 260v NPYAIQVKQEFPTHTTSGSGTELKHATNIMEVQQQLQL-QQLSEANGGGAASAGAGGAASPANSQQSQQQQ NPYAIQVKQEFPTHTTSGSGTELKHATNIMEVQQQL:: OQLSEANGGGAASAGAGGAASPAKSQQSQQQQ NPYAIQVKQEFPTHTTSGSGTELKHATNIMEVOQQLHVQQOLSEANGGGAASAGAGGAASPANSOOSQQOO 190200210220230~ 240250290v 300v 310v 320v 330v 340v 350v HSTAISTMSPMQLAAATGGVGGDWTQGRTVQLMQPSTSFLYPQMIVSGNLLHPGGLGQQPIQVITAGKPFQ KSTAISTMSPMQLA::TGGVGGDWTQGRTVQLMQP STSFLYPQMIVS GNLLHPGGLGQQP1QVITAGKPFQ HSTAISTMSPMQLAGPTGGVGGDWTQGRTVQLMQPSTSFLYPQMIVSGNLLHPGGLGQQPIQVITAGKPFQ 270280290300310320360v 370v 380v 390v 400v 410v 420v CNGPQMLTTTTQNAKQMIGGQAGFAGGNYATC1PTHHNQSPQTVLFSPKNVI—SPQQQQNLLQSKAAAAQ GNGPQMLTTTTQNAKQKIGGQAGFAGGNYATCIP:NHNQSPQTVL:SP:NVI SPQQQQNLLQSMAAAAQ GNGPQKLTTTTQNAKQMIGGQAGFAGGNYATCIPENHNQSPQTVLI8PVNVI6HSPQQQQNLLQSMAAAAQ 340350360370" 380390430v 440v 450v 460v 470v 480v 490v QQQLTQQQQQFNQQQQQQLT--QQQQQLTAALAKVGVDAQGKLAQKVVQKVTTTSSAVQAATGPGSTGSTQ OQQLTQQQQQ :OQQQQL. QQQQQLTAALAKVGVDAQGKLAQKWQKVTTTSS: VQAATGPGSTGSTQ QQQLTQQQQQQLNQQQQQLNQQQQQQQLTAALAKVGVDAQGKLAQKWQKVTTTSSTVQAATGPGSTGSTQ 410420430440450460* 470" 500v 510v 520v 530v 540v 550v 560v TQQVQQVQQQQQOTTQTTQQCVQVSTSTLPVGVGGQSVQTAQLLNAGQAQOMQIPWFLQNAAGLQPFGPNQ TQQVQQVQQQQQQTTQTTQQCVQVS STLPVGVGGQSVQTAQLLNAGQAQQMQIPWF QNAAGLQPFG:HQ TQQVQQVQQQQQQTTQTTOQCVQVSQSTLPVGVGGQSVQTAQLLNAGQAQQMQIPWFWQNAAGLQPFGSNQ 480490500510" 520530* 540" 570v 580v 590v 600v 610v 620v 630v IILRNQPDGTQGMFIQQQPATQTLQTQQNQIIQCNVTQTPTKARTQLDALAPKQQQQQQQVGTTNQTQQQQ IILRNQPDGTQGMFIQQQPATQTLQTQQNQIIQCNVTQTPTKARTQLDALAPKQQQQQQQVGTTNQTQQQQ IILRNQPDGTQGMFIQQQPATQTLQTQQNQIIQCNVTQTPTKARTQLDALAPKQQQQQQQVGTTNQTQQQQ 550560570* 580590600* 610640v 650v 660v 670v 680v 690v 700v LAVATAQLQQQQQQLTAAALQRPGAPVMPHNCTQVRPASSVSTQTAQNQSLLKAKMRNKQQPVRPA1.ATLK LAVATAQLQQQQQQLTAAALQRPGAPVKPHNGTQVRPASSVSTQTAQNQSLLKAKMRNKQQPVRPALATLK LAVATAQLQQQQQQLTAAALQRPGAPVMPHNGTQVRPASSVSTQTAQNQSLLKAKHRNKQQPVRPAI.ATLK 620630640650660670680" 710v 720v 730v 740v 750v 760v 770v TElGQVAGQNKWGHLTTVQQQQQATNLQQVVNAAGNKMVVMSTTGTPITLQNGQTLHAATAAGVt>KQQQQ TEIGQVAGQNKWGHLTTVQQQQQATNLQQWNAAGNKMWMSTTGTPITLQNGQTLHAATAAGVDKQQQQ TEIGQVAGQNKWGHLTTVQQQQQATNLQQWNAAGNKMWMSTTGTPXTLQNGQTLHAATAAGVDKQQQQ 690700710720730740750" 780v 790v 800v 810v 820v 830v 640v LQLFQKQQILQQQQKLQQQIAAIQMQQQQAAVQAQQQQQQQVSQQQQVNAQQQQAVAQQQQAVAQAQQQQF LQLFQKQQILQQQQMLQQQIAAIQHQQQQAAVQAQQQQQQQVSQQQQVNAQQQQAVAQQQQAVAQAQQQQR LQLFQKQQILQQQQKLQQQIAAIQMQQQQAAVQAQQQQQQQVSQQQQVNAQQOQAVAQQQQAVAQAQQQQR 760770780" 790800810820850v 860v 870v 880v 690v 900v 910v EQQQQVAQAQAQHQQALANATQQILQVAPNQFITSHQQQQQQQLHNQLIQQQLQOQAQAOVQAQVOAQAQO BQQQQVAQAQAQHQQALANATQQILQVAPNQFITSHQQQQQQQLHNQLIQQQLQQQAQAQVQAQVQAQAQQ EQQQQVAQAQAQHQQALANATQQILQVAPNQFITSHQQQQQQQLHNQLIQQQLQQQAQAQVQAQVQAQAQQ 630640850660' 870680890*  79  v 930v 940v 950v 960v 970v 980v OOOQREQQONIIOQIWQQSGXTSQQTSOQQQHHQSGQLQLSSVPFSVSSSTTPXGIA—TSSALQAALSA OQQQREQQQNIIQQIWQQS.::: Q QQQ.::QSGQLQLSSVPFSVS:S T:.:IX TSSXLQ.XLS. OOQQREQQQN1IQQIWQQSTGATSQOOODQPOQQSGQLQLSSVPFSVSPSMTXEDIXGITSSALQEXliSV 900* 910" 920930940* 950960990v lOOOv lOlOv 1020v 1030v 1040v 1050v - _ SGXIFOTXKPGTCSSSS-PTSSWTITNQSSTPLVTSSTVXSIQQXQTQSXQVHQHQQLISXTIXGGT 8GXIF0T;KP TCSSS: PTSSWTIT:QSSTPLVTSSTVXS:QQAQTQ::Q:HQHQQL1SXTIXGG: 6GXIFQTTKP1TCSSSTLPTSSWTITSQSSTPLVTSSTVXSHQQXQTQGTQIHQHQQLISXTIXGGSQQQ 980990100010101020" 10301060v 1070v lOSOv 1090v HOOv lllOv -OQQPQGPPSLTPTT— NPILXMTSMMNXTVGHLSTXPPVTVSVTSTXVTSSPGQLVl,LSTXSSGGGG QQQ. G PSLTPTT HPILXMTSMMNXTVGHLSTXPPV:VS STXVT.'S :GQLV LS:XSSGGG: OQQQQLGLPSLTPTTPSPTTNPILXMTSKKNXTVGHLSTXPPVSVS~STXVTPSSGQLVTLSSXSSGGGX 105010601070108010901100v 1130v 1140v 1150v 1160v 1170v HBOv SIPXTPTKETPSKGPTXTLVPIGSPKTPVSGKDTCTTPKSSTPXTVSXSVEXSSSTGEXLSNGDXSDRSST ::PXTPTKETPSKGPTXTLVPI:SPKTPVSGKDTCTTPKSSTPXTVSXSVEXSSSTGEALSNGDXSDRSST GFPXTPTKETPSKGPTXTLVP1DSPKTPVSGKDTCTTPKSSTPXTVSXSVEXSSSTGEXLSNGDXSDRSST 112011301140115011601170" V 1200v 1210v 1220v 1230v 1240v 1250v 1260v LSKGXTTPTSKQSNXXVQPPSSTTPNSVSGKEEPKLXTCGSliTSXTSTSTTTTlTNGIGVXRTTXSTXVST SKGXTTPTSKQSNXAVQPPSST.PNSVSCKEEPKL .CGSLTSATSTSTTTTITNG1GVXRTTXSTXVST PSKGXTTPTSKQSNXXVQPPSSTIPHSVSGKEEPKLHNCGSLTSXTSTSTTTTITNGIGVXRTTXSTXVST 11901200121012201230* 12401270v 1280v 1290v 1300v 1310v 1320v 1330v ASTTTTSSGTF1TSCTSTTTTTTSS1SNGSKDLPKXMIKPNVLTHV1DGF1IQEXNEPFPVTRQRYXDKDV ASTTTTSSGTF.TSCTSTTTTTTSS1SNGSKDLPKXMIKPNVLTHVIDGFI1QEXNEPFPVTRQRYXDKDV XSTTTTSSGTFTTSCTSTTTTTTSS1SNCSKDLPKXMIKPNV1.THVIDGFI1QEXNEPFPVTRQRXXDKDV 1260* 1270" 12801290" 1300" 1310" 1320* 1340V 1350v 1360v 1370v 1380v 1390v 1400v SDEPPSEYKLLVPKLFRNLNVSFLRXEKKXTMQEDIKLSGIXSXPGSDMVXCEQCGKMEHKXKLKRKRYCS SDEPPSEYKLLVPMLFRNLNVSFLRXEKKATMQEDIKLSGIASAPGSDMVACEOCGKMEHKAKLKRKRYCS SDEPPSEYKLLVPMLFRNLNVSFLRXEKKXTMQEDIKLSGIXSXPGSDMVXCEQCGKMEHKXKLKRKRYCS 1330134013501360" 1370" 1380" 13901410v 1420v 1430v 1440v 1450v 1460v 1470v PGCSRQXKNG1GGVGSGETNGLGTGGIVGVXXMALVDRLDEAMAEEKMQTEATPKI.SESFPILGXSTEVPP PGCSRQAKNGIGGVGSGETNGLGTGGIVGV.XMXLVDRLDEXMXEEKMQTE: ..:S:::PI :X:.EVPP 6>GCSRQXKNG1GGVGSGETNGLGTGGIVGVDXMXLVDRLDEXMXEEKMQTESYQTVSDX1,PIQXXTPEVPP 14001410* 142014301440* 1450" 14601480v 1490v 1500V 1510v 1520v 1530v MSLPVQAAISAPSPLAMPLGSPLSVALPTLAPLSWTSG -AAPKSSEVNCTDRPPISSWSVT>DVSNF :S:PV AX:S::SPL::PL. PL::X::. ,:L:W::C A SS::NG:DRPPISSWSV::VSNF ISMPVLXXMSTSSPLSLPLTLPLPIXIXPTVSLPWSXGWXPVLXIPSSNINGSDRPPISSWSVEEVSNF 147014801490" 1500* 151015201530* v 1550v 1560v 1570v 1580v 1590v 1600v IRELPGCQDYVDDFIOQEIDGQXLLLLKEKHLVNXMGMKLGPALKIVAKVESIKEVPPPGEAKDPGAQ 1RELPGCQDYVDDFIQQEIDGQXLLLLKE:HLVNXMGMKLGPXLKIVXKVES1KEVPP G:.KD 1RELPGCQDYVDDF1QQEIDGQXLLLLKENHLVNXMGMKLGPXLKIVXKVESIKEVPP-0DVKD 154015501560157015801590" :  80  The zinc finger motif was originally discovered in TFIIIA of Xenopus (Miller et al., 1985). Since then, zincfingershave been found in a host of other gene regulatory proteins in yeast (Struhl, 1987), mammals (Mitchell and Tjian, 1989) and  Drosophila  (Tautz et al., 1987; Rosenberg et a1., 1986). There are two types of zincfingers.One consists of two cysteine-histidine pairs separated by 12 to 14 aa (the cys2his2 finger; Mitchell and Tjian, 1989). The other consists of two cysteine-cysteine pairs separated by two or four aa (the cys2cys2 finger). The ph zinc finger is typical of the latter class (the conserved cysteine residues in the ph finger sequence above are boldfaced). The finger sequence is thought to form a tetrahedral complex with a zinc ion and that the residues intervening between the cys-his or cys-cys pairs loop out to form a finger that interacts directly with DNA. The loop of the ph finger isrichin positively charged amino acids (lysine, arginine, histidine). This would facilitate interaction of the ph protein with negatively charged DNA  Other proteins with cys2cys2 zinc fingers include the glucocorticoid receptor (Evans, 1988), the estrogen receptor (Krust et al., 1986) and GAL4 of yeast (Keegan e t a l . , 1986; Ma and Ptashne, 1987). Proteins of the cys2cys2fingerfamily have one or two fingers, like ph. Proteins of the cys2his2fingerfamily usually have multiplefingers(eg, TFIDA, Spl). Several Drosophila  proteins have cys2his2fingers.Kr hasfivefingers  (Rosenberg et al., 1986) and hb contains a total of sixfingers(Tautz e t a i ., 1987). The serendipity  locus shows a structure similar to ph in that it is made up of a tandem  repeat, each repeat containing several conserved zincfingers(Vincent et a I., 1985).  The presence of a DNA-binding motif in ph supports the hypothesis that ph is required for determination in the Drosophila  81  embryo, ph is necessary for proper  expression of the homeotic and segmentation genes. The ph protein could therefore regulate these genes by direct interaction with promoter sequences via the zinc finger.  Other domains and m o t i f s  ORFs 4/5/6 and 10/11/12 are glutamine-rich (see Table 9). Transcription factor Spl contains four separate transcriptional-activating domains. Two of these are domains rich in glutamine (25%; Courey and Tjian, 1988). Other known or suspected transcription factors contain glutamine stretches (eg, zeste, Antp, cut; Biggin et al., 1988; Pirotta et al., 1987; Schneuwly et al., 1986; Blochlinger et al., 1988). A glutamine-rich region of Antp has been shown to functionally substitute for an Spl-activating domain (Mitchell and Tjian, 1989). These glutamine stretches could contact other proteins (eg, RNA polymerase or other transcription factors) by hydrogen bonding and thereby influence the rate of transcription. Alanine stretches may also play a role in transcriptional activation (Courey and Tjian, 1988). Spl, zeste and ph all contain a stretch of alternating glutamine and alanine. In ph, the sequence is QAQAQVQAQVQAQAQ, located at 12381 and 22252. Together, these data support the suggestion that ph encodes a transcription factor.  The amino acid sequences of the 6 ORFs were screened for other protein sequence motifs. The sequence does not contain a leucine zipper (Kouzarides and Ziff, 1988) nor does it contain any ATP-binding sites (Walker et a l . , 1982). The sequence does contain many putative signal peptide cut sites (Perlman and Halvorson, 1983). However, these sites are only three amino acids in length and could easily occur in a random sequence. The sequence does not contain a prd box (Frigerio et al., 1986) or a homeobox (Laughon and Scott, 1984; Frigerio et  al., 1986). Because ph contains a zinc finger, the presence of a  homeobox as a DNA-binding domain would be redundant, although the protein could be bifunctional. If ph is indeed a transcription factor, one would expect to find a nuclear  82  T a b l e 9. The t h r e e most abundant amino a c i d s o f each ORF.  ORF  Amino acid  2  Valine Leucine Phenylalanine  3  Content 12 11 10  Leucine Arginine Cysteine  17 9 9  4/5/6  Glutamine Threonine Alanine Serine  18 11 10 10  8  Alanine Valine Phenylalanine Proline Threonine  13 13 10 8 8  9  Serine Cysteine Leucine Arginine  11 10 10 9  Glutamine Alanine Serine Threonine Leucine Proline Valine  18 10 10 10 7 7 7  10/11/12  83  (%)  transport signal in the sequence. So far, no consensus sequence is known, although short stretches of basic residues (lysine, arginine) have been identified as nuclear transport signals in some proteins (Dingwall and Laskey, 1986). ph contains two short arginine stretches. One in ORF 4/5/6 (RRRR at 11300) that is conserved in ORF 10/11/12 and another (RRRR at 15730) that is present only in ORF 7 of the distal repeat.  Protein structure and composition  The amino acid sequences of the 6 ORFs were analyzed using the PROTEIN program. Table 10 lists summary information about each ORF. The ph protein is, for the most part, in the extended (B-sheet) conformation (mean of 65%). Interspersed throughout the protein are varying amounts of helix, turn and coil. The distribution of the four different protein conformations along a typical ORF is random. However, there are some regions of interest. ORF 3 contains stretches of helix not found at all in the corresponding region from the distal repeat, ORF 9. The longest of these stretches lie between 42-50 and between 62-72 (coordinates refer to the amino acid sequences given in Figures 7 to 12). ORF 4/5/6 has a 12 aa stretch of alternating turn and coil (TPSTPNRPSAPS) between 153-164. ORF 4/5/6 also contains a long stretch of turn between 429-439. The same ORF contains two long regions of helix between 1439-1454 and between 1561-1574. ORF 10/11/12 contains several regions of coil (48-56, 200-207 and 528-537). ORF 10/11/12 contains several long stretches of helix (1337-1358,1423-1444 and 1549-11585) that are not conserved in ORF 4/5/6.  The isoelectric points of each ORF are high (mean of 9.04) indicating that ph is a basic protein. This is supportive of ph being a regulatory protein since its overall positive charge would facilitate interaction with negatively charged DNA.  84  T a b l e 10. S t r u c t u r e , conformation and charge o f t h e 6 p u t a t i v e ORFs ORF  Length  Robson conform.(%) H E T C  2  83  0  3  75  33 59 8  4/5/6  1607  4  8  91  9 10/11/12  66 20 14  MW (g/mol)  Iso. p t . Average hydrophobicity Hopp Kyte  9214  8.55  42  35  8405  8.83  29  47  64 14 19  169187  8.97  11  -52  0  64 11 25  9918  9.91  44  26  80  0  73 23 5  8903  9.09  17  21  1596  3  66 13 18  169123  8.90  10  -49  1  85  Hydrophobicity plots of each ORF show that ph is a hydrophilic protein. There are no putative trans-membrane domains.  Table 9 lists the three most abundant amino acids of each ORF. The data in Table 9 point out that some ORFs have higher contents of certain amino acids than other ORFs. Some ORFs have long stretches of the same amino acid. Stretches of glutamine are thought to play a role in transcriptional activation (discussed above), ph also contains stretches of serine or threonine. Interestingly, Spl, a transcription factor with a zincfingerand a glutamine-rich region, also has serine and threonine stretches (Courey and Tjian, 1988). The function of these stretches is unknown.  Similarities of the ph sequence to other proteins  The 6 putative ORFs were checked for amino acid similarity with sequences in the NBRF-PIR (rel 22) database using the algorithm of Lipman and Pearson (1985). In addition to the zinc finger result discussed earlier, ph showed other interesting similarities. The top five alignments for each ORF in the distal repeat are listed in Table 11. ORF 10/11/12 is very rich in glutamine. Long stretches of glutamine have been termed opa repeats. ORF 10/11/12 shows high amino acid similarity with thefiveproteins listed in Table 11A. However, the regions of sequence similarity were almost exclusively the opa repeats. The high similarity indices of Table 11A therefore reflect the fact that ORF 10/11/12 shares long stretches of glutamine with the other proteins listed. To see if regions outside of the opa repeats of ORF 10/11/12 share sequence similarity with other proteins, a 302 aa stretch (1166-1468 in Figure  86  T a b l e 11. Comparisons o f t h e ph p r o t e i n sequence t o o t h e r proteins. Protein A) . ORF  47.0 33.8 30.2 37.5 19.0  134 263 351 168 594  42.3 40.9 21.7 30.0 33.3  26 22 60 80 18  28.3 32.9 28.6 58.3 27.7  53 73 56 12 47  22.4 26.8 29.5 19.5 38.2  165 41 78 569 34  38.2 21.2 21.1 20.4 23.3  34 104 161 98 103  9  i n t e g r i n beta-1 chain precursor f i b r o n e c t i n receptor beta chain t h y r o g l o b u l i n p r e c u r s o r (bovine) •DNA-binding p r o t e i n (herpes s i m p l e x v i r u s ) r u b r e d o x i n (Pseudomonas) C) . ORF  overlap (aa)  10/11/12  mopa b o x p r o t e i n (mouse) alpha-beta - g l i a d i n precursor g l u t e n i n low molecular weight c h a i n r e g u l a t o r y p r o t e i n zeste (Drosophila) notch p r o t e i n (Drosophila) B) . ORF  Similar, i n d e x (%)  8  prolactin-inducible protein precursor cytochrome p 4 5 0 i i c 2 (rabbit) retrovirus-related pol polyprotein c e c r o p i n b p r e c u r s o r ( c e c r o p i a moth) phosphoenolpyruvate carboxylase D) . 1166-1468 o f ORF  10/11/12  s a l i v a r y g l u e p r o t e i n sgs-3 (Drosophila) b a l b i a n i r i n g p r o t e i n 1-gamma (Drosophila) r e g u l a t o r y p r o t e i n zeste (Drosophila) gene 62 p r o t e i n ( v a r i c e l l a - z o s t e ) e l a s t i n p r e c u r s o r (chicken) E) . 1287-1446 o f ORF  10/11/12  e l a s t i n p r e c u r s o r (chicken) D N A - d i r e c t e d RNA p o l y m e r a s e I I 1 - a r a b i n o s e - b i n d i n g p r o t e i n (E. c o l i ) s-adenosylmethionine synthetase pal cross-reacting lipoprotein precursor  87  11) including the zinc finger but excluding the opa repeats was extracted from ORF 10/11/12 and used to search the database (Table 11D). This region of ph is threonine and serine rich and therefore shows sequence similarity with other threonine and serine rich proteins. Like the results of Table 11A, the high similarity indices of Table 11D are due to the program aligning long stretches of threonine and serine between the region of ORF 10/11/12 analyzed and the other proteins. An even smaller region of ORF 10/11/12 that includes the zinc finger (1287-1446 in Figure 11) was screened for sequence similarities (Table HE). This region of ORF 10/11/12 does not contain any long stretches of the same amino acid. Therefore, the similarity index values of Table HE can be considered genuine measurements of sequence conservation between ph and other proteins. DNA-directed RNA polymerase II, 1-arabinosebinding protein and s-adenosyl methionine synthetase are all nucleotide binding proteins. If ph is indeed a DNA-binding protein, the similarity of these three proteins to ph is understandable. The other two proteins listed in Table HE, however, show no apparent functional similarity to ph. Elastins are the major structural components of tissues that require rapid extension and complete recovery (Raju and Anwar, 1987). Lipoproteins are required for lipid transport.  ORF 9 and ORF 8 do not contain long stretches of the same amino acid; their amino acid composition is random. Therefore, the similarity index values of Table 11B and C can be considered genuine. ORF 9 is similar to several membrane-bound receptor proteins {eg, integrins, fibronectins). Integrins are involved in cell-cell or cell-matrix interactions (Hynes, 1987) and fibronectins are required for cell motility and attachment (Dufour et  al.,  1988) . ph is required for normal axonal pathway development in the CNS (Smouse et al., 1988). This function of ph has similarities to those of the fibronectins and integrins. Unlike the integrins, ph does not appear to contain a trans-membrane domain. If ph is a DNA-binding protein, one wouldn't expect it to be membrane bound, unless the protein is bifunctional. The similarity of ph to the DNA-binding protein of herpes simplex virus is  88  understandable if ph is indeed a DNA-binding protein. The reason for ORF 9 having similarities to thyroglobulin precursor and rubredoxin is not understood. Thyroglobulins synthesize thyroid hormones (Palumbo, 1987) and rubredoxins are iron-sulfur proteins necessary for electron transfer in bacteria (Frey et al., 1987). ORF 8 shows similarity to phosphoenolpyruvate carboxylase and retrovirus-related pol polyprotein, two known nucleotide-binding proteins. These similarities make sense if ph is a transcription factor. It is not obvious why the other three proteins listed in Table 11C show similarity to ph. Prolactin induces the synthesis of prolactin-inducible protein (PIP). The function of PIP is unknown (Murphy et al., 1987). Cytochrome p450iic2 is a hydroxylase present in the endoplasmic reticulum (Green and MacLennan, 1967). Cecropins are antibacterial proteins present in the immune haemolymph of insects (Steiner et al., 1981). In summary, sequence similarities between known nucleotide or DNA-binding proteins and ph make sense if ph is indeed a transcription factor. The reason that proteins with diverse functional roles have similarities with ph are unknown.  SUMMARY  Genetic analysis of ph showed that two independent mutation events were required to make a ph null (Dura et al., 1987). This was the first evidence that ph has a repetitive structure. Using cross-hybridization studies, Freeman (1988) showed that ph consists of a large tandem repeat, each repeat separated by a region of unique sequence. The results presented in my thesis confirm the repetitive structure of the ph gene. My data show that ph does indeed consist of a large tandem repeat, and that the sequence conservation within the repeat is very high.  89  The structure of the gene is not certain. However, the data support the hypothesis that ph consists of two transcription units. Each repeat contains a putative promoter lying upstream of a putative translation start site. In addition, each repeat contains polyadenylation signals at their 3'ends. Other models of ph structure cannot be ruled out by my analysis. Transcripts could be alternatively spliced. Both repeats could be transcribed off one promoter and then post-transcriptionally cleaved into separate messages.  The data presented here do not tell us the function of the ph protein. However, the data support the hypothesis that ph is a transcription factor. Transcription factors require at least two domains for regulating gene expression at a promoter: a DNA-binding domain and a transcriptional-activating domain (Ma and Ptashne, 1987; Mitchell and Tjian, 1989). Both ph repeats contain a putative cys2cy&2 zinc finger that is perfectly conserved in each repeat. The zinc finger is a known DNA-binding domain. The ph amino acid sequence is rich in glutamine. Glutamine stretches have been shown to activate transcription when linked to zinc finger domains of certain transcription factors (Mitchell and Tjian, 1989). In addition, several short stretches of basic residues that could be nuclear transport signals occur in the ph sequence. The above data support but do not prove that the ph protein is a transcription factor.  The repetitive structure of ph provides a good example of a duplicated gene. The repetitive structure of ph is not without precedent - other eukaryotdc genes have a similar organization. In Drosophila,  engrailed  (en) and invected  (inv) are neighboring  genes that share extensive homology over 117 aa (Coleman et al., 1987). Their functional relation (if any) is not yet understood. The two proteins contain a homeobox that lies within the 117 aa conserved region. The mammalian counterparts of en and inv also share extensive amino acid homology (Joyner et al., 1985). Since this conservation is preserved across phyla, it implies a functional conservation between the two genes.  90  The achaete-scute  complex consists of two transcription units, each with  three domains of highly conserved amino acid sequence (Villares and Cabrera, 1987). The two transcription units have a similar function - the differentiation of sensory organs. transformer  is a Drosophila  gene required for female sexual differentiation. The  gene contains an 8 kb tandem duplication (Villares and Cabrera, 1987). Both components of the repeat are transcribed, yet the significance of the repeat is unclear.  Genes with a tandem repeat organization can be found in other eukaryotes. The wolffish antifreeze protein genes show a structure similar to ph (Scott et al., 1988). The major component genes exist as inverted tandem repeats, 8 kb in length. The two genes are separated by 1.3 kb. The minor component genes exist as direct tandem repeats. Like ph, each repeat possesses its own promoter and polyadenylation signals. Also like ph, a transcription termination sequence (ATTTTTNT) is located between the two repeats. The two genes are highly conserved but one contains a region of unique sequence. This is again similar to ph.  Future experiments include Si nuclease mapping of cDNAs to the genomic sequence to determine the ph splicing pattern. Alternatively, cDNAs can be sequenced and their sequence compared with the genomic sequence to identify splice junctions. This could be done in several tissues and at various stages of development to see if ph is alternatively spliced. The 5' end(s) of the gene will have to be determined using primer extension analysis. This will allow us to pinpoint the location of the ph promoters). Once putative promoters have been located, their role in the expression of ph can be determined using site-specific mutagenesis. Suspected ph regulatory regions could be fused to the lacZ coding region. Embryos could then be transformed with the chimaeric gene constructs and the regulation of B-galactosidase expression assayed in vivo (Ashburner, 1989). The same method could be  91  used to determine the functional significance of the putative zinc finger and glutamine-rich domains.  My determination of ph ORFs will allow for the synthesis of ph-specific antibodies. A long ORF (eg, ORF 4/5/6) could be fused to the lacZ gene. The resulting fusion protein could then be injected into rabbits. Purified antibodies could be used to probe embryos and cells for ph protein distribution. These antibodies could also be used to probe salivary gland chromosomes (Zink and Paro, 1989). If ph is indeed a transcription factor, and is expressed in this tissue, one would expect ph-specific antibodies to bind to the salivary gland chromosomes. If the ph protein does bind to a known gene(s), the exact nature of this interaction could be determined using DNase footprinting analysis.  92  REFERENCES  Ashburner, M. (1989). Drosophila, a laboratory handbook. Cold Spring Harbour Laboratory Press, Cold Spring Harbour.  Bender, W., Akam, M., Karch, F., Beachy, PA., Peifer, M., Spierer, P., Lewis, E.B. and D.S. Hogness (1983). Molecular genetics of the bithorax complex in  Drosophila  melanogaster. Science 221,23-29.  Biggin, M.D., Bickel, S., Benson, M., Pirotta, V. and R Tjian (1988). Zeste encodes a sequence-specific transcription factor that activates the Ultrabithorax promoter in vitro. Cell 53, 713-722.  Birnstiel, M.L., Busslinger, M. and K. Strub (1985). Transcription termination and 3' processing. The end is in site! Cell 41,349-359.  Blochlinger, K., Bodmer, R, Jack, J., Jan, J.Y. and Y.N. Jan (1988). Primary structure and expression of a product from cut, a locus involved in specifying sensory organ identity in Drosophila. Nature 333,629-635.  Breathnach, R. and P. Chambon (1981). Organization and expression of eukaryotic split genes coding for proteins. Ann. Rev. Biochem. 50,349-383.  93  Breen, T.R. and I.M. Duncan (1986). Maternal expression of genes that regulate the bithorax complex of Drosophila  melanogaster. Dev. Biol. 118,442-456.  Carroll, S.B., Laymon, RA, McCutcheon, MA, Riley, P.D. and M.P. Scott (1986). The localization and regulation of Antennapedia  protein expression in  Drosophila  embryos. Cell 47,113-122.  Courey, AJ. and R. Tjian (1988). Analysis of Spl in vivo reveals multiple transcriptional domains, including a novel glutamine-rich activation motif. Cell 55, 887-898.  Dingwall, C. and R A Laskey (1986). Protein import into the cell nucleus. Ann. Rev. Cell Biol. 2, 367-390.  Dufour, S., Duband, J.L., Kornblihtt, AR. and J.P. Thiery (1988). The role of fibronectins in embryonic cell migrations. Trends in Genet. 4,198-203.  Duncan, I. (1982). Polycomblike:  a gene that appears to be required for the  normal expression of the bithorax and Antennapedia gene complexes of  Drosophila  melanogaster. Genetics 102,49-70.  Duncan, I. and E.B. Lewis (1982). Genetic control of body segment differentiation in Drosophila.  In Developmental  Order: Its Origin  and  Regulation  (ed. S. Subtelny). New York: Liss. Symp. Soc. Devi. 40, 533-554.  Dura, J.-M., Brock, H.W. and P. Santamaria (1985). polyhomeotic Drosophila  melanogaster  a gene of  required for correct expression of segmental identity.  Mol. Gen. Genet. 198, 213-220.  94  Dura, J.-M., Deatrick, J., Randsholt, N.B., Brock, H.W. and P. Santamaria (1988). Maternal and zygotic requirement for the polyhomeotic Drosophila.  complex genetic locus in  Roux's Arch. Dev. Biol. 197,239-246.  Epstein, H.F., Ortiz, I. and LAT. MacKinnon (1986). The alteration of myosin isoform compartmentation in specific mutations of Caeborhabditis  elegans. J.  Cell. Biol. 103, 985-993.  Evans, R.M. (1988). The steroid and thyroid hormone receptor superfamily. Science 240, 889-895.  Freeman, S. (1988). M.Sc. Thesis: Molecular analysis of the Drosophila  gene,  polyhomeotic.  Frey, M., Sieker, L., Payan, F., Haser, R, Bruschi, M., Pepe, G. and J. LeGall (1987). Rubredoxin from Desulfovibrio gigas: A molecular model of the oxidized form at 1.4 A resolution. J. Mol. Biol. 197,525-541.  Frigerio, G., Burri, M., Bopp, D., Baumgartner, S. and M. Noll (1986). Structure of the segmentation gene paired and the Drosophila PRD gene set as part of a gene network. Cell 4 7, 735-746.  Fuller, M.T. (1986). Genetic analysis of spermatogenesis in Drosophila:  the role  of testes specific beta-tubulin and interacting genes in cellular morphogenesis. In "Gametogenesis and the early embryo", Gall, J.G. (ed.). pp. 19-41, Alan R Liss Inc., New York.  95  Gehring, W. (1970). A recessive lethal with a homeotic effect in D. melanogaster. Dros. Inform. Serv. 45,103.  Green, D.E. and D.H. MacLennan (1967). The mitochondrial system of enzymes in D.M. Greenberg (ed.), Metabolic Pathways, 3rd. ed., vol. 1, pp.47-111, Academic Press Inc., New York.  Gribskov, M., Devereux, J. and R.R. Burgess (1984). The codon preference plot: graphic analysis of protein coding sequences and prediction of gene expression. Nuc. Acids Res. 12, 539-549.  Hafen, E., Levine, M. and W. Gehring (1984). Regulation of  Antennapedia  transcript distribution by the bithorax complex in Drosophila. Nature 307,287-289.  Harding, K, Wedeen, C , McGinnis, W. and M. Levine (1985). Spatially regulated expression of homeotic genes in Drosophila. Science 229,1236-1242.  Harding, K and M. Levine (1988). Gap genes define the limits of Antennapedia and Bithorax gene expression during early development in Drosophila  . EMBO J. 7,205-214.  Henikoff, S. (1984). Unidirectional digestion with exonuclease III creates targeted breakpoints for DNA sequencing. Gene 28,351-359.  Hochman, B., Gloor, H. and M.M. Green (1964). Analysis of chromosome 4 in Drosophila  melanogaster. I. spontaneous and X-ray induced lethals. Genetics 35,  109-126.  96  Homyk, T. and CP. Emerson (1988). Functional interactions between unlinked muscle genes within haplo insufficient regions of the Drosophila  genome. Genetics 119,  105-121.  Hynes, R. (1987). Integrins: A family of cell surface receptors. Cell 48, 549-554.  Ingham, P. (1984). A gene that regulates the bithorax complex differentially in larval and adult cells of Drosophila. Cell 37,815-823.  Ingham, P.W. and A Martinez-Arias (1986). The correct activation of Antennapedia and bithorax complex genes requires the fushi-tarazu gene. Nature 324, 592-597.  Irish, V.I., Martinez-Arias, A and M. Akam (1989). Spatial regulation of the Antennapedia and Ultrabithorax homeotic genes during Drosophila  early development.  EMBO J. 8,1527-1537.  Joyner, AL., Kornberg, T., Coleman, KG., Cox, D.R. and G.R. Martin (1985). Expression during embryogenesis of a mouse gene with sequence homology to the Drosophila engrailed  gene. Cell 43,29-37.  Jurgens, G. (1985). A group of genes controlling the expression of the bithorax complex in Drosophila. Nature 31 6,153-155.  Karch, F., Weiffenbach, B., Peifer, M., Bender, W., Duncan, I., Celniker, S., Crosby, M. and E.B. Lewis (1985). The abdominal region of the bithorax complex. Cell 43,81-96.  97  Kaufman, T.C., Lewis, R. and B. Wakimoto (1980). Cytogenetic analysis of chromosome 3 in Drosophila  melanogaster:  The homeotic gene complex in  polytene interval 84A-B. Genetics 94,115-133.  Keegan, L., Gill, G. and Ptashne, M. (1986). Separation of DNA binding from the transcription-activating function of a eukaryotic transcriptional activator protein. Science 231,699-704.  Kouzarides, T. and E. Ziff (1988). The role of the leucine zipper in the fos-jun interaction. Nature 336, 646-651.  Kozak, M. (1984). Compilation and analysis of sequences upstream from the translational start site in eukaryotic mRNAs. Nuc. Acids Res. 12, 857-872.  Krust, A., Green, S., Argos, P., Kumar, V., Walter, P., Bornert, J.M. and P. Chambon (1986). The chicken oestrogen receptor sequence: homology with v-erbA and the human oestrogen and glucocorticoid receptors. EMBO J. 5, 891-897.  Laughon, A. and M.P. Scott (1984). Sequence of a Drosophila  segmentation gene:  protein structure homology with DNA-binding proteins. Nature 310,25-31.  Lewis, E.B. (1978). A gene complex controlling segmentation in  Drosophila.  Nature 276, 565-570.  Lipman, D.J. and W.R Pearson (1985). Rapid and sensitive protein similarity searches. Science 22 7,1435-1441.  98  Locke, J.M., Kotarksi, M.A. and KD. Tartof (1988). Dosage dependent modifiers of position-effect variegation in Drosophila and a mass action model that explains their effect. Genetics 120, 181-198.  Ma, J. and M. Ptashne (1987). Deletion analysis of GAL4 defines two transcriptional activating segments. Cell 4 8,847-853.  Maniatis, T., Fritsch, E.F. and J. Sambrook (1982). Molecular cloning: A laboratory manual. Cold Spring Harbour Laboratory.  McKnight, S.L. and R. Kingsbury (1982). Transcriptional control signals of a eukaryotic protein-coding gene. Science 21 7,316-324.  Miller, J., McLachlan, A.D. and A Klug (1985). Repetetive zinc-binding domains in the protein transcription factor IIIAfrom Xenopus oocytes. EMBO J. 4,1609-1614.  Mitchell, P.J. and R Tjian (1989). Transcriptional regulation in mammalian cells by sequence-specific DNA binding proteins. Science 245, 371-378.  Mogami, K and Y. Hotta (1981). Isolation of Drosophila flightless mutations which affect myofibrillar proteins of indirect flight muscle. Mol. Gen. Genet. 183,409-417.  Mount, S.M. (1982). A catalogue of splice junction sequences. Nuc. Acids Res. 10, 459-472.  Murphy, L.C., Tsuyuki, D., Myal, Y. and R.P.C. Shiu (1987). Isolation and sequencing of a cDNA clone for a prolactin-inducible protein (PIP). J. Biol. Chem. 262,15236-15241.  99  Palumbo, G. (1987). Thyroid hormonogenesis. J. Biol. Chem. 262,17182-17188.  Park, E.C. and H.R. Horvitz (1986). C. elegans  unc -105 mutations affect  muscle and are suppressed by other mutations that affect muscle. Genetics 113,853-867.  Peifer, M., Karch, F. and W. Bender (1987). The bithorax complex: control of segmental identity. Genes and Dev. 1,891-898.  Perlman, D. and H.O. Halvorson (1983). A putative signal peptidase recognition site and sequence in eukaryotic and prokaryotic signal peptides. J. Mol. Biol. 167, 391-409.  Pirrotta, V., Manet, E., Hardon, E., Bickel, S.E. and M. Benson (1987). Structure and sequence of the Drosophila  zeste gene. EMBO J. 6, 791-799.  Proudfoot, N.J. and G.G. Brownlee. (1976). 3' non-coding region sequences in eukaryotic messenger RNA Nature 263,211-214.  Raju, K and R A Anwar (1987). A comparative analysis of the amino acid and cDNA sequences of bovine elastin a and chick elastin. Biochem. Cell Biol. 65, 842-845.  Riley, P.D., Carroll, S.B. and M.P. Scott (1987). The expression and regulation of Sex combs reduced protein in Drosophila  embryos. Genes Dev. 1,716-730.  Rosenberg, V.B., Schroder, C, Preiss, A, Kienlin, A., Cote, S., Riede, I. and H. Jackie (1986). Structural homology of the product of the Drosophila Kruppel gene with Xenopus transcription factor IIIA. Nature 319, 336-339.  100  Schneuwly, S., Kuroiwa, A., Baumgartner, P. and W.J. Gehring (1986). Structural organization and sequence of the homeotic gene Antennapedia of Drosophila melanogaster. EMBO J. 5, 733-739.  Scott, G.K. Hayes, P.H., Fletcher, G.L and P.L. Davies (1988). Wolffish antifreeze protein genes are primarily organized as tandem repeats that each contain two genes in inverted orientation. Mol. and Cell Biol. 8, 3670-3675.  Shearn, A, Hersperger, G. and E. Hersperger (1978). Genetic analysis of two allelic temperature sensitive mutants of Drosophila melanogaster both of which are zygotic and maternal-effect lethals. Genetics 89, 341-353.  Smouse, D., Goodman, C , Mahowald, A and N. Perrimon (1988). polyhomeotic: A gene required for the embryonic development of axon pathways in the central nervous system of Drosophila. Genes Dev. 2, 830-842.  Steiner, H., Hultmark, D., Engstrom, A , Bennich, H. and H.G. Boman (1981). Sequence and specificity of two antibacterial proteins involved in insect immunity. Nature 292, 246-248.  Struhl, K. (1987). Promoters, activator proteins, and the mechanism of transcriptional initiation in yeast Cell 49,295-297.  Struhl, G. and M. Akam (1985). Altered distributions of transcripts in extra  Ultrabithorax  sex combs mutant embryos of Drosophila. EMBO J. 4,3259-3264.  101  Struhl, G. and RAH. White (1985). Regulation of the Ultrabithorax Drosophila  gene of  by other bithorax complex genes. Cell 43, 507-519.  Struhl, G. (1982). Genes controlling segmental specification in the  Drosophila  thorax. Proc. Natl. Acad. Sci. USA 79, 7380-7384.  Struhl, G. (1981). A gene product required for the correct initiation of segment determination in Drosophila. Nature 293,36-41.  Tabor, S. and CC. Richardson (1987). DNA sequence analysis with a modified bacteriophage T7 DNA polymerase. Proc. Natl. Acad. Sci. USA. 84,4767-4771.  Tautz, D., Lehmann, R., Schnurch, H., Schuh, R., Seifert, E., Kienlin, A., Jones, K. and H. Jackie (1987). Finger protein of novel structure encoded by hunchback, a second member of the gap class of Drosophila  segmentation genes. Nature 32 7, 383-389.  Tautz, D., Trick, M. and G.A Dover (1986). Cryptic simplicity in DNA is a major source of genetic variation. Nature 322, 652-656.  Villares, R and CV. Cabrera (1987). The achaete-scute  gene complex of D.  melanogaster: Conserved domains in a subset of genes required for neurogenesis and their homology to myc. Cell 50,415-424.  Vincent, A , Colot, H.V. and M. Rosbash (1985). Sequence and structure of the Serendipity  locus of Drosophila  melanogaster. J. Mol. Biol. 186,149-166.  102  Walker, J.E., Saraste, M., Runswick, M.J. and N.J. Gay (1982). Distantly related sequences in the a- and B-subunits of ATP-synthase, myosin, kinases and other ATPrequiring enzymes and a common nucleotide binding fold. EMBO J. 1, 945-951.  Wedeen, C , Harding, K and M. Levine (1986). Spatial regulation of Antennapedia  and bithorax  gene expression by the Polycomb locus. Cell 44, 739-  748.  Wilbur, W.J. and D.J. Lipman (1983). Rapid similarity searches of nucleic acid and protein data banks. Proc. Natl. Acad. Sci. 80, 726-730.  Wharton, K A , Yedvobnick, B., Finnerty, V.G. and S. Artavanis-Tsakonas (1985). opa: A novel family of transcribed repeats shared by the Notch locus and other developmentally regulated loci in D. melanogaster. Cell 4 0, 55-62.  Zink, B. and R. Paro (1989). In vivo binding pattern of a trans-regulator of homeotic genes in Drosophila  melanogaster. Nature 33 7,468-471.  103  


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items