Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

DNA sequence and structure analysis of the Drosophila gene Polyhomeotic Daly, Mark K. 1990

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-UBC_1990_A6_7 D34.pdf [ 5.3MB ]
JSON: 831-1.0098117.json
JSON-LD: 831-1.0098117-ld.json
RDF/XML (Pretty): 831-1.0098117-rdf.xml
RDF/JSON: 831-1.0098117-rdf.json
Turtle: 831-1.0098117-turtle.txt
N-Triples: 831-1.0098117-rdf-ntriples.txt
Original Record: 831-1.0098117-source.json
Full Text

Full Text

DNA SEQUENCE AND STRUCTURE ANALYSIS OF THE DROSOPHILA GENE POLYHOMEOTIC by MARKKDALY B.Sc., University of British Columbia, 1987 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in THE FACULTY OF GRADUATE STUDIES ZOOLOGY We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA April, 1990 © Mark K. Daly, 1990 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of 'LOOL-O G-'y The University of British Columbia Vancouver, Canada Date frVfr. ^3 ^Ifd DE-6 (2/88) Abstract polyhomeotic is a gene of the Polycomb-grovvp required for proper segment determination in Drosophila. Genetic and molecular analysis has shown that ph has a repetetive structure. The DNA sequence presented here shows that ph consists of a direct tandem duplication with very high sequence conservation. Analysis of the sequence has revealed several conserved open reading frames and splice junctions, putative transcriptional promoter and terminator sequences, polyadenylation signals and translational start signals. In addition, the DNA sequence shows that ph contains a zinc finger sequence in each repeat. This suggests that ph may encode a DNA-binding protein. ii TABLE OF CONTENTS PAGE ABSTRACT i i TABLE OF CONTENTS i i i LIST OF TABLES i v LIST OF FIGURES v ACKNOWLEDGEMENT v i GENERAL INTRODUCTION 1 MATERIALS AND METHODS 7 RESULTS AND DISCUSSION 15 SUMMARY 88 REFERENCES 92 i i i LIST OF TABLES TABLE 1. Long ORFs conserved i n both repeats of ph. TABLE 2. ORFs 70 amino acids i n length or greater. TABLE 3. Codon preference parameters of the 6 putative ORFs. TABLE 4. A l i s t of s p l i c e junction sequences and branch sequences present i n ph. TABLE 5. Comparison of s p l i c e junctions of ORF 4/5/6 to the consensus. TABLE 6. Putative promoter signals i n the ph sequence. TABLE 7. Putative polyadenylation signals found i n the ph sequence. TABLE 8. Amino acid sequence s i m i l a r i t y between the proximal ORFs and the d i s t a l ORFs. TABLE 9. The three most abundant amino acids of each ORF. TABLE 10. Structure, conformation and charge of the 6 putative ORFs. TABLE 11. Comparison of the ph protein sequence to other proteins. i v LIST OF FIGURES PAGE FIGURE 1. Sequencing strategy for the d i s t a l repeat of ph. 17 FIGURE 2. Genomic DNA sequence of ph. 19 FIGURE 3. Optimal alignment of the proximal repeat with the d i s t a l repeat. 28 FIGURE 4. A physical summary map of ph. 44 FIGURE 5. Frequency d i s t r i b u t i o n s of ORFs i n a random sequence (a) and in a coding sequence (b). 46 FIGURE 6. Amino acid sequence of ORF 2. 61 FIGURE 7. Amino.acid sequence of ORF 3. 63 FIGURE 8. Amino acid sequence of ORF 4/5/6. 65 FIGURE 9. Amino acid sequence of ORF 8. 67 FIGURE 10 . Amino acid sequence of ORF 9. 69 FIGURE 11 . Amino acid sequence of ORF 10/11/12 71 FIGURE 12 . Optimal amino aci d sequence alignment of ORF 2 and ORF 8. 74 FIGURE 13 . Optimal amino acid sequence alignment of ORF 3 and ORF 9. 76 FIGURE 14 . Optimal amino acid sequence alignment of ORF 4/5/6 and ORF 10/11/12. 78 V ACKNOWLEDGEMENT I would like to thank my supervisor, Dr. Hugh Brock for his help and encouragement during the course of this work. vi GENERAL INTRODUCTION In Drosophila melanogaster, homeotic genes are required for the determination of segmental identity. The homeotic genes belong to two complexes. Genes of the Bithorax Complex (BX-C) are responsible for segmental differentiation of thoracic and abdominal segments (Lewis, 1978), and genes of the Antennapedia Complex (ANT-C) are responsible for proper segmental identity of head and thoracic segments (Kaufman et al., 1980). Because genes in these two complexes act in parasegments (PS; the posterior compartment of one segment and the anterior compartment of the next most posterior segment), I will refer to parasegments where necessary. As an example, PS6 includes posterior T3 and anterior Al. The BX-C spans about 300 kilobases (kb) of DNA in the 89E region of chromosome 3 (Bender et al., 1983; Peifer et al., 1987). Individual recessive mutations in the BX-C give parasegments a tissue identity appropriate to a more anterior parasegment. Conversely, dominant BX-C mutations give parasegments a tissue identity appropriate to a more posterior parasegment (Lewis, 1978). The BX-C contains three structural genes: Ultrabithorax (Ubx), abdominal-A (abd-A), and Abdominal-B (Abd-B). These three genes take up less than 20 kb of coding DNA Thus, most of the 300 kb of the BX-C is regulatory (Peifer et al., 1987). 1 Regulatory regions can be identified by complementation analysis. Mutations in distinct tissue-specific regulatory regions of Ubx complement one another. However a mutation in a structural gene does not complement mutation(s) in regulators of that gene. The complex complementation pattern distinguishes 9 regulatory regions within the BX-C: abx/bx and bxd/pbx (for control of Ubx; Lewis, 1978); iab2, iab3, iab4, and iaJb5 (for control of abd-A; Karch et al., 1985) and iab5 through 9 (for control of Abd-B; Karch et al., 1985). Each embryonic parasegment has a unique cellular mosaic pattern of Ubx, abd-A and Abd-B expression. In each more posterior parasegment (and all parasegments posterior to it) an additional set of cells gains a unique identity. This new identity is conferred by a successively more distal regulator in the BX-C. Thus, successive regulatory regions are activated in a proximal-to-distal direction along the BX-C in an anterior-to-posterior direction down the embryo. The homeotic genes are regulated in three ways: 1) By maternal and early-acting zygotic segmentation genes (Carroll et al., 1986; Ingham and Martinez-Arias, 1986; Harding and Levine, 1988; Irish et al., 1989); 2) a hierarchy of cross-regulatory interactions between the homeotic genes (Struhl, 1982; Hafen et al., 1984; Harding et al., 1985; Struhl and White, 1985; Riley et al., 1987) and 3) by the genes of the Polycomb (Pc) group (Jurgens, 1985). The developing embryo is divided into successively smaller domains by the gap genes, the segmentation genes and the segment polarity genes. The initial patterns of Antp and Abd-B gene expression in the embryo depend on regulation by gap gene products (Harding and Levine, 1988). hunchback (iib) and Kruppel (Kr) are necessary for proper Antp expression in PS4 and PS5. Kr and knirps (kni) are required for normal Antp and Abd-B expression in T3 and PS12 and PS13 (Harding and Levine, 1988). The An tp gene contains two promoters that are differentially regulated by the gap genes hb and 2 Kr and by the segmentation gene fushi tarazu (ftz; Irish et al., 1989). The gap and segmentation gene products contain nucleic acid binding domains (eg, zinc fingers, homeoboxes) that could facilitate their interaction with regulatory regions within the homeotic genes (Tautz et al., 1987). Regulation of homeotic gene expression also occurs by cross-regulatory interactions between the homeotic genes themselves. Posteriorly acting genes of the BX-C (abd-A, Abd-B) regulate anteriorly acting genes (eg, Ubx) by repression (Struhl and White, 1985). Likewise, the expression of Antp is repressed in its posterior domain by Ubx, abd-A and Abd-B (Hafen et a l . , 1984). Another ANT-C gene, sex combs reduced (scr), is controlled by Ubx (Struhl, 1982). The homeotic gene products contain a DNA-binding domain, the homeobox, that allows them to bind to promoters of other homeotic genes. Genes of the Pc-group also control spatial distribution of the homeotic gene products. The Pc-group genes include Pc (Lewis, 1978; Duncan and Lewis, 1982), extra sex combs (esc; Struhl, 1981), Polycomb-like (Pel; Duncan, 1982), super sex combs (sxc; Ingham, 1984), Additional sex combs (Asx), Posterior sex combs (Psc), Sex comb on midleg (Scm; Jurgens, 1985), Sex comb extra (See; Breen and Duncan, 1986), polyhomeotic (ph; Dura et al., 1985), polycombeotic (pco; Shearn et al., 1978) andpleiohomeotic (pho; Hochman et al., 1964; Gehring, 1970). The Pc-group genes are repressors of BX-C gene expression. This has been shown genetically through dosage studies, and molecularly. Embryos mutant for esc and Pc show an initial distribution of Ubx that is normal. However, later in development, the Ubx gene product becomes ectopically expressed throughout the embryo. This suggests that these 3 genes are not required for the initiation of homeotic gene expression, but rather for the maintenance of spatial regulation once it has been established (Struhl and Akam, 1985; Wedeen et al., 1986). Mutations in the Pc-group have similar phenotypes, including posteriorly directed transformations of abdominal parasegments in the embryo (Jurgens, 1985). Pc-group double mutants have stronger phenotypes than embryos mutant for only one Pc-group gene. This suggests that genes of the Pc-group act synergistically to control spatial expression of the BX-C genes. This could occur in two ways. Gene products of the Pc-group could form a complex multimer in which all products interact (Locke et al., 1988). This type of interaction has been observed for the gene products that form the contractile apparatus in Drosophila and Caenorhabd.itis (Homyk and Emerson, 1988; Fuller, 1986; Mogami and Hotta, 1981; Epstein et al., 1986; Park and Horvitz, 1986). The Pc-group genes show different temporal and spatial modes of expression. Thus, it is unlikely that all members of the Pc-group form the same multimer. However, it is possible that subsets of the Pc-group form distinct multimers that act in different tissues and at different times. This could explain why different Pc-group genes have related but distinct phenotypes. Alternatively, members of the Pc-group could interact indirectly as a regulatory network. Antibodies against Pc have been shown to bind to over 60 discrete sites on polytene chromosomes, including the BX-C and ANT-C, Scm, Asx, Psc, sxc, ph and pco (Zink and Paro, 1989). This shows that Pc regulates at least some members of the Pc-group. polyhomeotic (ph) is a well studied member of the Pc-group. The gene was named to reflect the large number of phenotypes and transformations associated with its' mutations, ph is unlike other Pc-group genes in that it is required for epidermal development as well as segmental specification (Dura et al., 1987). Loss of pb function results in a pleiotropic phenotype that includes epidermal cell death, abnormal segmentation 4 and homeotic gene expression in the epidermis and central nervous system (CNS), and abnormal axon pathway development in the CNS. Strong hypomorphB cause transformation of all thoracic and abdominal segments towards A8 (Dura et al., 1988). ph amorphs show cell death in the ventral epidermis, and die at 12 hours after egg deposition. Amorphs show an A8 transformation of every segment. This phenotype is consistent with ph being a repressor of BX-C function. Unlike other Pc-group genes, ph is required both zygotically and maternally for normal embryonic development (Dura et al., 1988). Maternal ph amorphs are not rescued by two doses of paternal wild type product. Thus, the maternal effect mutation cannot be compensated for by an increase in dose of zygotic ph genes. Therefore, in early embryogenesis, ph+ function comes only from maternal gene products (Dura et al., 1988). ph and pho are the only Pc-group genes whose maternal effect cannot be rescued by a duplication of the paternal genes. Thus, only ph and pho are critically required in the maternal germ line. This sets these two genes apart from the other Pc-group genes. ph is required for the proper expression of homeotic and segmentation genes. Initially, the pattern of Scr, Antp and Ubx are normal in the epidermis of ph mutants, but following germ band retraction, Scr, Antp and Ubx are ectopically expressed (Smouse et al., 1988). These results suggest that ph is not required for the initiation of spatial regulation but, like other Pc-group genes, is required for maintenance of spatial regulation in the epidermis. However, in the CNS there is an absolute requirement for ph to allow expression of the homeotic genes because expression of Scr and Ubx is abolished in the CNS of ph mutants. Surprisingly, Antp is ectopically expressed in the CNS of ph mutants (Smouse et a1., 1988). 5 ph is also required for regulation of segmentation genes in the epidermis and CNS (Smouse et al., 1988). In ph embryos, engrailed (en), which is normally expressed only in the posterior compartment of each parasegment, becomes expressed also in the anterior compartment. Like the homeotic genes, en expression is absent in the CNS of ph embryos. The patterns of ftz and even-skipped (eve) expression in the CNS are altered in ph~ embryos. Loss of ph product causes suppression of ftz expression and ectopic eve expression in the CNS (Smouse, 1988). The expression of fusion genes containing regulatory regions of ft z or eve and the coding region of lacZ in ph~ embryos is the same as that of the wild type ftz or eve gene (Smouse, 1988). This shows that regulation of ft z and eve by ph occurs through their promoters. Northern analysis of the ph region shows that embryonic and pupal transcription patterns differ (Freeman, 1988). In embryos, two major transcripts of 6.4 and 6.1 kb are observed. The 6.1 kb transcript must have a proximal promoter because ph^^^, an inversion with a proximal breakpoint, truncates the 6.1 kb transcript but does not affect the 6.4 kb transcript. The 6.4 kb transcript may have either a proximal or distal promoter. In pupae, the ph^^ ^ inversion truncates the 6.6 kb transcript and leaves the 6.1 kb transcript intact. Therefore, there are at least four different ph transcripts, at least two of which have proximal promoters (Freeman, 1988). It is possible that all ph promoters are proximal, in which case ph is a single transcription unit with multiple promoters, alternative splicing and/or alternative termination. Alternatively, ph could be two transcription units, with one proximal and one distal promoter. Besides the major ph transcripts, some smaller ph transcripts are also detected. Probes that hybridize to the 6.4 kb and 6.1 kb transcripts in embryos also hybridize to these smaller transcripts. However, the synthesis of these small transcripts begins about 3.5 kb upstream of the major ph transcripts. Moreover, an independent transcription unit intervenes between the 5' end of the small ph transcripts and 6 the 5' end of the major ph transcripts. It is not clear how these small transcripts are related to ph function. Obtaining the DNA sequence of the ph region would be an important contribution towards an understanding of the structure of the ph gene or genes. This thesis examines the structure of ph. With my collaborators in France (J. Deatrick and N. Randsholt) the entire sequence of ph has been determined. This will allow us to map cDNAs to the genomic sequence and therefore locate exons and introns within the gene. The exact nature of the repeats was determined by aligning the proximal and distal sides of the gene. The degree of sequence similarity between the two repeats gives an indication of how recent the ph duplication is. Putative open reading frames (ORFs) were located by assuming that long regions of amino acid sequence uninterrupted by stop codons represent regions of functional significance. These ORFs were analyzed by computer to understand their structure. The DNA and protein sequence of ph was searched for putative regulatory motifs and splice junctions. The data that I have obtained provides the raw information from which inferences of ph structure can be made and further experiments to elucidate its structure and function can be based. MATERIALS AND METHODS B a c t e r i a l culture All cultures were grown in Luria-Bertani medium (1% Bacto-tryptone, 0.5% Bacto-yeast extract, 1% NaCl with a pH of 7.5). 7 Competent cells Competent cells were prepared using the calcium chloride procedure. One litre of LB broth was inoculated with 10 mis of an overnight culture of E. coli. Cells were vigorously shaken at 37°C until a density of about 5 x 10^  cells/ml was reached. The cells were chilled in an ice-water bath for 5 min and then centrifuged at 4000 g for 5 min at 4°C. The supernatant was discarded and cells were resuspended in a half volume of ice cold 50 mM CaCl2 and 10 mM Tris-HCl (pH 8.0). The cells were kept on ice for 15 min and then centrifuged at 4000 g for 5 min at 4°C. The supernatant was discarded and the cells were resuspended in 1715 volume of ice cold 50 mM CaCl2, 10 mM Tris-HCl (pH 8.0) and 50% glycerol. 0.5 ml aliquots were dispensed into pre-chilled microfuge tubes and stored at 4°C for 24 hours. The tubes were then transferred to a -70°C freezer for long-term storage. Mini-plasmid preps 1.5 mis of bacterial culture was centrifuged at 10000 rpm for 2 min. The cell pellet was resuspended in 100/tl of cold 50 mM glucose, 10 mM EDTA, 25 mM Tris-HCl (pH 8.0). 200/tl of 0.2 N NaOH, 1% SDS was added and the tubes were gently agitated to mix and left on ice for 15 min. The tubes were centrifuged at 10000 rpm for 15 min. The supernatant was added to a new tube containing an equal volume of phenol-chloroform (1:1) and vigorously mixed for 10 min. The tubes were centrifuged at 10000 rpm for 5 min. The supernatant was transferred to a new tube and precipitated with 2 volumes of 95% ethanol. The tubes were centrifuged at 10000 rpm for 15 min to collect the pellets. The pellets were washed once with 1 ml of 70% ethanol and then dried under vacuum. Pellets were resuspended in water. 8 Maxi-plasmid preps These preparations are a scaled-up version of the mini- prep but with several added steps to ensure sequenceable DNA. 50 ml overnight bacterial cultures were centrifuged at 10000 rpm for 5 min. Pellets were resuspended in 3.5 ml cold 50 mM glucose, 10 mM EDTA, 25 mM Tris-HCl (pH 8.0). 7 ml of 0.2 N NaOH, 1% SDS was added, the tubes were inverted to mix and placed on ice for 15 min. 5 ml KoAc (pH 4.8) was added, the tubes were inverted to mix and placed on ice for 15 min. The tubes were centrifuged at 20000 rpm for 20 min at 4°C. The supernatant was extracted once with phenolxhloroform (1:1) and once with chloroform. To the supernatant, 2 volumes of 95% ethanol was added. The tubes were kept on ice for 5 min, and then centrifuged at 20000 rpm for 20 min. The pellets were washed with 70% ethanol and dried under vacuum. The pellets were resuspended in 600/4 H2O. 70/*g RNAse was added and the samples were incubated at 37°C for one hour. To each sample, 300/*l of 7.5 M NH^OAc was added. The tubes were frozen at -70°C for 10 min, thawed and centrifuged at 10000 rpm for 10 min. The supernatant was removed from the protein pellet and transferred to a new tube and 2 volumes of 95% ethanol was added. The samples were chilled at -70°C for 5 min and centrifuged at 10000 rpm for 15 min. The DNA pellets were washed with 70% ethanol and dried under vacuum. The pellets were resuspended in 200/tl H2O. 120/t.1 20% polyethylene glycol, 2.5 M NaCl was added and the samples were kept at 4°C overnight. The tubes were centrifuged at 10000 rpm for 15 min. The supernatant was discarded and the pellets washed twice with 70% ethanol. The DNA was dried under vacuum and resuspended in H2O. Typical yields were 100 to 200/.g DNA 9 Agarose gel electrophoresis Agarose gels of 0.8% were made with BRL Ultrapure agarose and IX Tris-borate EDTA buffer (100 mM Tris base, pH 8.0, 100 mM boric acid, 2 mM EDTA). Gels were electrophoresed in IX TBE containing 500 g/1 ethidium bromide. Gels were visualized on a UV trans-illuminator. Polyacrylamide gel electrophoresis Sequencing gels consisted of 50% urea, 20% 38:2 acrylamide: N,N*-methylene-bis acrylamide, IX TBE, 0.08% ammonium persulfate and 0.02% N,N,N,,N'-tetramethylethylenediamine (Bethesda Research Laboratories, Ultrapure). Gels were polymerized overnight at room temperature. Gels were pre-run in IX TBE at 1600 V for 30 min. To the lower buffer chamber, 0.5 volumes of 3 M NaOAc were added. Samples were loaded and the gel run at 1600 V for 2.5 hrs. Gels were soaked in 10% methanol, 10% acetic acid for 10 min and vacuum-dried at 80°C for 2 hrs. Gels were exposed to Kodak X-Omat-AR film overnight. DNA sequencing ph was sequenced on both strands using the dideoxy method of Sanger and Coulson (1977). Appropriate restriction fragments from the distal repeat (Figure 1) were sub-cloned and sequenced using directed deletions to generate nested sub-sets of clones on both' strands (Sal 1.5, Sal 0.8), in one direction (Sal 2.3), or to obtain partial sequence (Sal 4.0). Gaps in sequence data were filled by synthesizing oligonucleotide primers. 10 1). Directed deletions DNA to be subjected to directed deletions (Henikoff, 1984) was sub-cloned into either pUCl8 or pUCl9. About 30 g of DNA was completely digested with a restriction enzyme that cuts within the polylinker site to leave a 3' overhang adjacent to the priming site. The DNA was extracted once with phenol-chloroform (1:1) and once with chloroform. The DNA was precipitated as above. Resuspended DNA was completely digested with a restriction enzyme that cuts within the polylinker to leave a 5' overhang that can be used as a substrate for exonuclease III digestion into the sub-cloned insert. The DNA was organically extracted and precipitated as above. The doubly digested DNA was resuspended in 60/*l of 66 mM Tris-HCl (pH 8.0), 0.66 mM MgC^. Half of this sample was treated with exonuclease III to determine the rate of digestion. Lul of DNA was removed for a control (zero digestion). The remaining DNA was heated to 37°C for 1 min. 1^ 1 (75u) of exonuclease III.(BRL) was added. Following a 30 sec pre-incubation period, 2.5/J aliquots were removed every 30 sec and mixed with 7.5/*l of 0.25 M NaCl, 30 mM KoAc (pH 4.6), 1 mM ZnS04, 5% glycerol and 67 Vogt u/ml of SI nuclease (BRL) in tubes sitting on ice. Once all timepoints were taken, these tubes were moved to room temperature for 30 min to allow for digestion by SI nuclease. Reactions were stopped by the addition of 1/4 of 0.5 M Tris-HCl (pH 8.0), 0.125 M EDTA and heating the samples to 70°C for 10 min. 2y»l aliquots of each timepoint were separated electrophoretically on an agarose gel (as described above) to determine the digestion rate of exonuclease III. If the rate of digestion was close to 150 bp/min, the remainder of each timepoint was extracted once with phenol-chloroform (1:1) and once with chloroform. If the rate of digestion was too slow or too fast, exonuclease III digestion was repeated with an appropriate increase or decrease in the number of units of enzyme used. The DNA was precipitated as described previously 11 and resuspended in l O y u l 20 mM Tris-HCl (pH 8.0), 7 mM MgCl2 and 10 u/ml Klenow polymerase (BRL). The DNA was incubated at room temperature for 3 min. L u l of a mix of all four deoxyribonucleotides, each at 0.125 mM was added and the reactions were incubated at 37°C for 15 min. Each sample was mixed with 66 mM Tris-HCl (pH 8.0), 6.6 mM MgCl2, 10 mM dithiothreitol, 0.2 mM ATP, 50% polyethylene glycol and 25 u/ml T4 DNA ligase (BRL). The reactions were incubated at room temperature overnight. 20/«] of each sample was added to 100/ul of competent E. c o l i JM83 and left on ice for 30 min. The cells were heat-shocked at 37 °C for 30 sec and placed on ice for 1 min. 200/xl LB broth was added and the tubes placed in a 37°C incubator and shaken for 1 hr at 200 rpm. 50 /A\ 5-bromo-4-chloro-3-indolyl-B-D-galactoside (20 /*g/ml in N-N1-dimethylformamide) and 50/i isopropylthiogalactoside (20/<g/ml in H20) was added to each tube. The cells were plated out on LB media containing 100/«g/ml ampicillin. Plates were incubated at 37°C overnight. Five colonies from each timepoint were grown in LB broth containing 100/*g/ml ampicillin. Plasmid mini-preps (as described above) were made from each timepoint. Each timepoint was digested to completion with two restriction enzymes that will cut out the sub-cloned fragment only if the priming site is intact. The digested DNA was separated electrophoretically on an agarose gel to determine insert size and suitability for sequencing. Maxi plasmid DNA preps (as described above) were made for suitable timepoints. 2). Oligonucleotide synthesis In cases where directed deletions were not used, oligonucleotide primers were made to sequence directly off a restriction fragment. Oligonucleotides (18mers with a minimum GC content of 50%) were made using a Model 391 PCR-Mate DNA Synthesizer (Applied Biosystems) following the manufacturer's instructions. Cleavage and cyanoethyl 12 deprotection of the primer was accomplished by drawing up concentrated ammonia solution (35%, BDH Aristar) into the column and letting it stand at room temperature for 30 min. This process was repeated three times. The expelled ammonia solution was collected in a glass vial with a Teflon-lined cap and the volume made up to 3 mis with concentrated ammonia solution. The solution was incubated at 55°C for 15 hrs. The solution was cooled, transferred to microcentrifuge tubes and lyophilized. Pellets were resuspended in a total volume of 200/ul H2O. Oligonucleotides were purified by spin-column chromatography using a Sephadex G50-50 matrix (Maniatis et al., 1982). A molar ratio of 5:1 primer:template was used for sequencing reactions. 3). Sequencing DNA sequencing was performed using a modification of the T7 DNA polymerase protocol of Tabor and Richardson (1987). Either modified T7 polymerase (Sequenase, United States Biochemicals) or unmodified T7 polymerase (Pharmacia) was used. Both enzymes gave good results. 6/ng of plasmid DNA was denatured in 0.2 M NaOH in a volume of 100/J at 65°C for 5 min. The DNA was cooled in ice and 50/.1 7.5 M NH^OAc was added followed immediately by 600/J 95% ethanol. The DNA was precipitated as above. The DNA pellet was resuspended in 7/J H2O and heated to 55°C for 5 min. The sample was mixed by vigorous vortexing for 1 min. 2/J of 200 mM Tris-HCl (pH 8.0), 100 mM MgCl2,250 mM NaCl and 1 / J sequencing primer (0.5 pmol^«l) were added. The sample was heated to 65°C for 2 min in a heating block. The block was placed at room temperature so that the sample was cooled to 30°C over a period of about 1 hr. 2/J of a solution containing dGTP, dCTP and dTTP each at a concentration of 1.5/.M, 1/.1 0.1 M dithiothreitol, 0.5 / . l (<t35S)dATP and 2/J T7 DNA polymerase (1.5 u^ l) were added and the tube was incubated at room temperature for 3 min. 13 3.5/41 aliquots were added to tubes containing 2.5yJ of each of four termination mixes (eg, the ddG termination mix contains 80/M dGTP, dATP, dCTP, 80/JVI dTTP, §*,M ddGTP and 50 mM NaCl) that were pre-warmed to 37°C. The tubes were incubated at 37°C for 5 min. The reactions were stopped by the addition of 4195% deionized formamide, 20 mM EDTA (pH 7.5), 0.05% (weight/volume) xylene cyanol FF and 0.05% (w/v) bromphenol blue. The reactions were heated at 90°C for 2 min, quick-cooled in an ice-water bath for 1 min and loaded onto a polyacrylamide gel (described above). 4). DNASTAR computer programs. Several DNASTAR computer programs were used to analyze the ph sequence. The proximal and distal repeats were compared using ALIGN (for DNA sequence) and AALIGN (for protein sequence). These programs produce a locally optimal alignment of two partially homologous sequences. The resulting similarity index is the total number of matched bases divided by the sum of the number of mismatched bases and the number of gaps in the alignment. The program GENEPLOT was used to determine codon preference values for ph. This program plots the codon preference value for each codon in each reading frame for both strands of the DNA sequence relative to a codon frequency table. The codon preference value is defined as the probability that a given reading frame is similar in codon usage to the frequency table. The codon frequency table used in this work is made up from several DrosophUa genes (Ashburner, 1989). The program calculates the codon preference of each reading frame of the sequence entered into the program and for a random sequence with the same composition as the sequence entered. If the former value is significantly greater than that of a random sequence, the sequence would have coding potential. The PATTERNS program was used to search DNA and amino acid sequence for certain known consensus sequences (eg, exon-intron junctions, promoter sequences etc.). The program searches both strands of the DNA sequence for the consensus sequence in question. The program 14 PROTEIN was used to provide information about the physical properties of the putative ph protein. The program determines the structure of a protein (ie, regions most likely to be helix, turn, coil or sheet), hydropathy profiles of a protein and the charge, molecular weight, isoelectric point and amino acid content of a protein. The FASTP (Lipman and Pearson, 1985) and FASTA (Pearson and Lipman, 1988) programs were used to search DNA and amino acid databases for genes with sequences similar to ph. The programs are fast because they initially screen sequences for similarity by searching for aligned identical amino acids. Other algorithms compare each nucleotide or amino acid of one sequence with every nucleotide or amino acid of the other sequence. RESULTS AND DISCUSSION DNA SEQUENCE Sequencing strategy The DNA sequence of ph presented here is 24925 base pairs (bp) in length. This represents sequence that I obtained (from 15724 through 24925) and sequence that my collaborators in France (Janet Deatrick and Neel Randsholt) obtained (from 1 through 15730). Any comparisons between the distal and proximal repeats required the sequence data provided by my collaborators. All sequence data was analyzed using DNASTAR computer programs. For convenience, sequence is presented as the coding strand and any sequence coordinates mentioned are in reference to the coding strand. 15 Figure 1 shows my sequencing strategy for the distal repeat of ph. Arrows show the direction and length of sequence obtained. Sequence was obtained from the five sub-clones indicated in Figure 1. It can be seen from Figure 1 that over 80% of the sequence has been obtained from at least two independent clones on each strand and 91% of the sequence was obtained from clones on both strands. Sequencing on both strands is necessary to clarify ambiguities inevitably present in single-stranded sequence obtained by the Sanger method. The sequencing gels were read independently by two people to minimize reading errors, and to eliminate errors introduced entering the sequence into the computer. In the case of disagreement, both readers re-examined the original gels to arrive at a consensus. The 0.8 Sal fragment sequence is single-stranded sequence, although at least two clones have been sequenced in each region. Subsequent to this work, a cDNA from the region has been sequenced on both strands, and has confirmed the sequence presented here. Figure 2 shows the complete coding-strand sequence of the ph region. To determine the sequence similarity between the two repeats, the proximal and distal repeats were aligned using the algorithm of Wilbur and Lipman (1983; Figure 3). As seen in Figure 3, sequence similarity varies between 100% and zero. One would expect that the regions of very high sequence similarity have been conserved over time for functional reasons. I predict that these regions will be exons or regulatory sequences, and that the Tegions of low sequence similarity will include intron sequences. The bottom part of Figure 4 shows a comparison of the repeated regions as determined by aligning the two sequences (Figure 3) and the repeats as determined by cross-hybridization studies (Freeman, 1988). The two methods support each other in that in each of Freeman's repeats (a through e) there is a region of high sequence similarity, although the extent of the conserved sequence' in d and e is less than in the other repeats. The size of the proximal c repeat was under-estimated by the cross-hybridization study by about 400 bp. In addition, the cross-hybridization study did not accurately locate the region of unique sequence. 16 Figure 1. Sequencing strategy for the distal repeat of ph. Sequence is presented in a proximal to distal direction (left to right). Arrows represent the direction of sequencing. The sub-clones used for sequencing are labelled at the top of the figure. In the case of the 8.7 Sal-Xho fragment, only a portion of this sub-clone was sequenced. 17 Figure 2. Genomic DNA sequence of ph. Sequence is presented as the coding strand. The proximal repeat (J. Deatrick and N. Randsholt) includes sequence between 1 and 15730. The distal repeat includes sequence between 15731 and 24925. 19 1 CTCGAGGTGTGGACGCAATCTTCTCCTCACCACGGGCCGTATCGCACTGATAGCAGGGAC 60 61 ACCAGGAAACCGAACTTTTCCACTAGACCTCTCGGGCTCTAGGTATATCACTATATATGG 120 121 CGACGTTATCAGCCCCTCCGACTCTGCCCCTGCATGCGAAATTAGCATATTTATTATGGA 160 181 CCACGCACACACACACTCGCACACACGCACACCGCAGCACGGTCTAGATTTGGTCTGGTT 240 241 TGAAAAGTGCAATCCACGGTCCGTGGAGTCAAGATCTTTATGACTCCACACAGATTATTC 300 301 CCGGCAGGTAGATAGATCCCTACACAGAAACGGTCATAAAGCAA^TTGGCTCGCSGCCAG 360 361 ATTGGAAGAGATACAGATTCGGATTCGGATACGGATACGGGTACGGGTATATGCATGGAT 420 421 AGATATGCCTGGAGGATTTGCACCACCCGGTTACGGTGGATTAGCCTTCGTGCAAAAATG" 480 481 TATTTGTATTTTGCAACGAACAATATTTCATGTTATGTACATATTTAAGACCAGTAGGCA 540 541 TTAAATTCACTATTGCAATTGTTATATAATCTGGAGCTGCACATACGCAAGTTGTTAATT 600 601 TTCACAOGAGTATTTATAACACCTCCTTCTGTCTATCTCTTACATATTTAAGTTAAOTAC 660 661 TTAATATAAATACTTTAAGTATAATGCATATATGAATATAGTCTTTTAGCGGGTTAATAG 720 721 CAGTGCCCACCCTCGTAATCATTATTGAGATCATGTTTATCTCACTCGCTCTCTCTTOCT 780 781 CTGTATTTTTGTCGTTTTGTGTTTCATTACGTCAAAAAATTCGAOCTTTTGTGTATGTGT 840 641 CTGTGTGGGGCCGOGTGCGAACCCCTCTTOGCTCTAAACAACCCAGACAAACAGAAACAC 900 901 T O O G f T A A C A G G T A G C T G A T A A G C G C G A A A A C A A f A T C A G C A T c f o C A T A C A T A T A T A A A 960 961 CCGGTTGTGAAACGAATATCAGATGTGTCACTGCAGTTACTAAAGOGATTAATTAGGATC 1020 1021 CTCATTCACCAACACACGACAAGTAACCGGGAATGGGAGAGCCTCACGCGCCAGAATCTC 1080 1061 CCAATGGAAAATTGAAAAACCGAAAGAGAAACTTACGCAGAAGCACTCAAAAGCAGGTTG 1140 1141 ATCGGAACACCAACTGAACTTACAAGGTTTATGTACGGATCGATCGGGCTCACGTCGAGG 1200 1201 TCCGGCTACTCCTTTCATCCATAAGTCAGTGATGTCCTTCAGAGCAOCGTCCAACCCOGA 1260 1261 ATCACAACAAACCGCAACTAAAATTCGAAACGCAAAACAGGGGCAGATTAGCGTTAGTTA 1320 1321 AGAGATACGATAGGCGCCAAACATCCCCCCCCCCCCCTCCATCAAAGAATTCAAAATGAT 1380 1381 CACCGAGCGAAAGCTCCCCAGGAGAGAGAGCGAACTGAGCTTTCAGAGCGAAAGCGAGTG 1440 1441 AGTGAGCACACCACAGCGGCGTTCTACAAATTTCAACAGTTCATATTCCGGATCGATTTG 1500 1501 TCTTfTACTGTTTcfGTTTACCAcfTCTTTATGGATTCGTGAAc6GGAAATAACACAATA 1560 1561 TACTAATACAAAATTTTTTGTTTTTTGGGATGTTTTTTCTATAATTCAATTACTTCTAAT 1620 1621 AACCGCAAACAAAATCTTAATGTCTTGGGAAATCAAATTGATAAATCTCTGTTTACCTAT 1680 1681 TGGAATTTGTTATCAACAATTACCATTTACTATTATTATTACTTCTGTGTTTTATTTAGT 1740 1741 CCCTAGTTGCTTTTCAAAATATAGCTAGCTGCATTGCAAATTCACTCTGTAGAAATGAAA 1800 1801 GTCCTAACATCTTTTATATAAAGAGATTATATAA^CCGCTAGAATTAGTTTTAAATTTAA I860 1861 GCATATTTAATACATATTTTTATACTGTTTATTCCTAAAATTGCAAACTTGTAATCAAAA 1920 1921 CCAGTACGGATATCACCTCGCATAACAACTGTTATACAGCCACTTTAATTCACCACACAT 1980 1981 GCGTTAGCCACGCCCACTGAACCACTAAACATTTAGCTTGTCTCTGTCCTATTCTTCTCC 2040 2041 GCAGGCGGCACAATGGTGCTATCTCTTTTGCACCCCATAAAAAACCTTTTCCGGTGGTGC 2100 2101 GCCCATTTTTTATTCGCTTCGATCTTCCGTCGGCATTTTGTACT&CCAATTAGCAOCGAC 2160 2161 TTCTACAATTCAATCAACTTTGTTATTGTACCCAAAGAAAAATGGAGAAGATAACATTCT 2220 2221 CATACACTTGTACCTGTTTCGCAACTTGTGAGTTGACTCCAATAATTGCTTCTTGGGTAT 2280 2281 CGATAAAGTCCCAATAACGATTGCTAATAACTTAAAATTTATAGCTTACAATCATGGATG 2340 2341 TCAATATTTAATTTTGTATCTCATGTTGGOTATGTTTTGATTAATTTGTGTGTCTCCATA 2400 2401 CCGAJU^GGTATTAA&TTAAATGTCTATTGTTCTGTTTAGGATTA&TAAACACTTfcATTT 2460 2461 TCTCTACCAGTAAAATTTTCCTAAAACCAGTAATTTAGTTCAGTTATTATGACAGTTTCT 2520 2521 TAGGATTTTTCTTCAATACTTCTGAGTATCAACTGTTGGATCCATTAAATTTAGAACTTC 2580 2581 TGCCAAGTAAAACGGACTCTACCCACGCATGGAAATTATAAAATAACGCCAAGTGCTCGA 264 0 2641 TTATATATATTGCAATTAAAACAATCGCAAGCGCAACCACACATCTCTGTTTGAAAATCG 2700 2701 WTG£AAATTTACA6ACOGTGACCAA^TATAGO£ATATTTTGAAAATOGAGTTATTOCG 2760 2761 CTTTAATAGAATACATAAAGAGGACATGATTAGAAGACAATTGAAACGGCAAATAGTCCT 2820 2821 ATTTGGCTCCCCCTCCCATTTTCAATTCCTACCCTCGTTCACACCCCTAATGATCTTAGG 2880 2881 GGTGGGATTACGTGTACACCCCTTGGGACATCAGAATTGATCCCCTTTTGAGGGTAATTG 2940 2941 TATAATGTTTCTTTGGGAAATCOGTTTGTACAGGCCACGGAAAATCCTGGAAAGGGCTCG 3000 3001 GTTGCTCCCAAAOGTCGCAAOTGTCACAOGAATCAACCCTTTATCGCAGTTTCCTCTTTT 3060 3061 TTAAGGATCCGGCCTTGGTACAGTTTCGAAAGGGTTTTATATGTTGGGGTATTGAGTGAA 3120 3121 AGTAGGGTTTATTTGCTTTGAATTTGAGTAATGGCTTACTAAAGATTAGAACGTTTCTTT 3180 3181 TATAACTTTTTATTGTTATTTTAAAAGGTGCACAATAACTCCGATGAAATTTGAAAGTTG 3240 3241 TTAACTTAAGTTCTTAAGTTCTTTCCAGCGAGACTATTTGTTGCTTAAATAAATTTTTGC 3300 3301 C A A A J j l A A A C A A T C T T A A A T T A A C & C T T C T A C C A f A T T T A C A T C G C C A A T G C A T fTCTAG 3360 3361 AGTTGGGAGCATTCCTATTTGGATAATAATGAACAGTCACCACTTAAGGACCCGAAGGAC 3420 3421 CTCCTCGATATTTCATCCACTGAAAACTTCAAGCAAAACTCGAGGACTTCCTGTTTACGC 34B0 3481 TGCACAGAGTTGGCAGGTCAATTAGCAGATTGGAAAGGACAATCGCAAATGTGTGCAATG 3540 3541 CGCACAGGGTCAGTTATTAATAATGGAGAATAGGCGATATTAAAGCAAGTATGGGCTGTT 3600 20 3 6 0 1 CCAAJULXCXAGACCATTTCGCACTCGACTTACCTCAXATACCTG^TTGATCCATOCACTT 3 6 6 0 3 6 6 1 CAAATCCAAJVATAGTTTATAAAACTTTTTACAATATGTAAAATGGTAAGAAATATGACAC 3 7 2 0 3 7 2 1 CAGCATAAGGAATTTGGAATTTTAGAGTGTTTAGAAATCATGAAAAGAATTTTTAAGGTT 3 7 8 0 3 7 6 1 CCACCTGAAAAAATAAAGCATTTACTTTTACTAGTATGAAAAAATGAATTTGATATCCTA 3 8 4 0 3 8 4 1 AJXACAAAAATACAAATAAAAATCAAACTGGAAAGGAATTTATACTCCCATTTATTTCAGT 3 9 0 0 3 9 0 1 C C T G A T C C C T T G C G A T C C A T T T T C A T T T C T G G C C C C O C C T C A A A A C C T T T A G A A 6 A C T T C 3 9 6 0 3 9 6 1 ATTACGATTGCAATAAGCACAACAACAACAACCCACTTTTGGTTATCACATCCGCATGCA 4 0 2 0 4 0 2 1 CACCCCTTGGAAAOGCCAACAAAAATGGTGTAGGTAOCTCTGACAATTGCCAOCACCAAC-4080 4 0 8 1 AACCACACAAGTACAACAACAGGTTCGTTTAAATTTTTATTACAGCAGGGTGCGACCGAG 4 1 4 0 4 1 4 1 AGGGAGGAAGGCCAGCGTTAACGTGCCAGAGTGAGAGGCAATATGACAAAAGCACCCCCA 4 2 0 0 4 2 0 1 GCAATTTACGCTCATTAGGCATTT6TCATTTTTA6GAAATOCAAATATCGACTC^CTCGC 4 2 6 0 4 2 6 1 TCATGCCCTAGGAGTACCATTCGCCCCAAACACAAGGAGCAATAAGTTGAAGGCAATTAT 4 3 2 0 4 3 2 1 AAATGGCAAGAAGAAAAGCGCATGCTCGACTTCTTGTTGTGTTGAACTTGTGAGGAAAAG 4 3 8 0 4 3 6 1 CCGAAAGAGGGAAATAACGGCGAAGGAGGAGGTCACAACTAATAGAAAAAAAGCTCCGAA 4 4 4 0 4 4 4 1 AAAACAAGCATACACACACATGCAATGAGAACAACCAAAGCAAGGCAGAGAGCGAGAGAG 4 5 0 0 4 S 0 1 AGAGACCAAAAGCAGTTAAAAAGC&TAAAAATAA^ATGGCGGCA&CAAAAAAGAAAAGCA 4 5 6 0 4 5 6 1 AACGAGACGAGACAAGCCAACAAAAAGCTAATCGGAATGAAAACATTTGTGGGGCTCAGC 4 6 2 0 4 6 2 1 G A C G T T G T 7 G T T G G A T G A G G T G G G A A G A G T A A G A A G A A G A C C A A C A A G A G C C C A A A G C A A 4 6 8 0 4 6 8 1 CGCACACATACTACAAATGGTGCAGGCACACACGCACGGGCTGGCACAAAATGAAAAATG 4 7 4 0 4 7 4 1 AAATGACAAAGACAGGCACAGTGGGTCATGAGTGGTGGATATTTGAAACATTAATAAACT 4 8 0 0 4 8 0 1 T A A A A C T T A A T A A A ^ C A A C T C A A T A C T T A T G T C T A T T A T A T A A T f c T A T T T T A c f A A A C G 4 8 6 0 4 8 6 1 TTTTCATTTTACTTATCGTAACATTGTTTATATGTATAAAAAGTTTGGAAAAATTGCTGT 4 9 2 0 4 9 2 1 TCTAAAAAATTGAACCGCTGTACTCTTTGTTCTCAAACTGCAACTGTAAAGCAATCCAAT 4 9 8 0 4 9 8 1 A A T A A T G G A 7 C G C T T A C C A C T T T T C A A T G G G T G G G A G A G A A T A A G A T T T C G C T C T G C C T C 5 0 4 0 5 0 4 1 TGCGCATACGCTATCCTCCCCTTCTCAATCGACACACCCGTGTGTTACTGACAAGCACAA 5 1 0 0 5 1 0 1 CTAAfAATAAGGTAiAGCCGATCCGACCCGATCTCATGTGAAAAAGAACGAATCACAAAC 5 1 6 0 5 1 6 1 GAGGTAGAGGTAGGAAGGTAGTCGTAGTGGTGGTGGTTGTGATGGCACGAAAAAAGAAGA 5 2 2 0 5 2 2 1 TGAAGTATAGCAACAATCGTTGTCGAGTCGGGCCGGGCGGTACACATTCGAGTCTACACA 5 2 6 0 5 2 8 1 CATAAAACTGGCTTCGCGCGTATTTATTGATGTACATACCCGGTACCCACAAAGTAAAAG 5 3 4 0 5 3 4 1 GGTATACTGGGCACTTGGGTTTAACTCGAATTTGTGTTAGTAGTCACCACCAATAACTTA 5 4 0 0 5 4 0 1 CAAATAAATATTTAJkGAAGGGTTTfcATTTTAAGGATACGACTGAAGTTAGCGAGGAATG 5 4 6 0 5 4 6 1 GTATATGAAAAAGGGGTATTTGAAAGTCGAGTCACCAGACCATGTTATGTTTTCAGCAAC 5 5 2 0 5 5 2 1 GGAAGGGGGCTTTCAGTCGGGGTGGTTGAGCACGCATACACATGCCCGCCAGCTGTAGTC 5 5 8 0 5 5 8 1 TTCCTGTTTTTTTTACTCGTTTTGTTTTTGCTTTGTCAAACGAACTTCTGCCGTTTCATT 5 6 4 0 5 6 4 1 CCAACCCCAACCGAACGCACGGCATCCCTCTCGCACGCGCTAACTCGCCAGGGCCACTCC 5 7 0 0 5 7 0 1 TAGC&GCGTCGCAcfcGCACCCGTGCATTTCGGTATAGAGAAAA&TTATTACTC&CAACG 5 7 6 0 5 7 6 1 ACATCTGAGAAGAGAGAAGCGGATCCAGCTGCACACCAACACACACACATATTCAGCGCA 5 8 2 0 5 8 2 1 C A T G C G C T C A T T T T G T T T C C G A T C C G A A C G A A A A G T A 6 A G T T G T C G C T G T G G C G C G C C G T 5 8 8 0 5 8 8 1 TCAGTTTGAAACTTAACTTGGCGGTGTACGGTTCTTGCGCTCTGCTCTCTCTGCGTTCGT 5 9 4 0 5 9 4 1 C T T C C T G T G G T C T C G T A T C T G T C G C A T A C C G C A T C C C A T C T G T A T T C A A C C A A C A A A A A A 6 0 0 0 6 0 0 1 CCCG&CGACGCGACATACTCACTG&ATACCCTGCMU^TTTGTTAAATTTTTTTCMAAAG 6 0 6 0 6 0 6 1 AGCTAACGCTOCTGTTGATTAGCTAGTGCAGATGTGCAGACATAAAAAGTGATGCCGCGC 6 1 2 0 6 1 2 1 CACAGTGGAGCCCCTAGCTGGCGAATCGTCGCTOGCGACGTAGGTAGTGCAGTTAAAACA 6 1 8 0 6 1 8 1 AGTACTTAGTGCTGTGACTGTGGCTTAATTTTATGTAAGATCOCGTGCACAGGTCCTTAG 6 2 4 0 6 2 4 1 TCGTTACACTGAGAAAAAGAAAACTGCCTGCACCCAGCGGGGAGAAGTTGAACTGGACTG 6 3 0 0 6 3 0 1 ACTGTGAGTGGAGGTTGTACTAATIACTACTCTCCOTGCACGAAACACTCATTTACACGC 6 3 6 0 6 3 6 1 AAGCACACACACACACACACACACACACGAGACACTGGGACTTTTGTCGGATTTTCGTAT 6 4 2 0 6 4 2 1 TCGTTTTGTGTACTTTTGTTGTCTTGCGTTCCACGTTACATACATATGTATATGTTTGGT 6 4 8 0 6 4 8 1 GTTGCCTGTTGTATGTTTATATTTATTGCCTTCACACATGTGCGTGTTTGTTAATGTACA 6 5 4 0 6 5 4 1 ATATAATACGGCAAATAGCAAAAAGAGAAGAAACTGACGACTAAAAAGAAACCGCCATGC 6 6 0 0 6 6 0 1 CGAC^ATAACAAM^CAACCACCA&TCTCTCCGC&CCCGCCCAAJLACAAAAACAJUIACOC 6 6 6 0 6 6 6 1 CTAAGCCGACGCATGCATACCTATAATTTATTATAAATATTGTTTTTATTTTGAATAATG 6 7 2 0 6 7 2 1 CATCGTCGTGCATTGAAGTTTATGCAAAAAAGGTATTTTTTGTTTAGTTTGOTTTTATTT 6 7 8 0 6 7 8 1 T T A T C A C T G C T T T T G T A C T G C C T G C A C T T C T G C T T T T T G T T T A C A A T T T T T G G T T T A A T C 6 8 4 0 6 8 4 1 T G C T C T T G A G C A T T G G A T A T T G A T T C T A T A T C C T A T T C G G A T A A T A A T C A C A C G T T A A T C 6 9 0 0 6 9 0 1 A T A A f G C T C G T A A j J ^ T G C A G C G A & T G G A T A T T c f G T G C T C T C T & T C A T C A T A A A A T G C G 6 9 6 0 6 9 6 1 CTCGACTGGCCGGCCAAGAAAAGAAAATAAATTCATGAAACCCAAAACGAGTTTCCCCTC 7 0 2 0 7 0 2 1 CGTCGCCCTCTCCGCCATTCATCATCCAACCGACACACGCCCCCCGCTGTGCGGCGTTGT 7 0 8 0 7 0 8 1 TGCATTTTTAACAACGAATTTCGCAATGCATGCCCCTGTCTGCATTTGTGTGTCTGCGTC 7 1 4 0 7 1 4 1 C C C T C C T G C T C G T T G A T T T C T C C C T C T C T C C G T G C A A C C C T C T C T C A C T C G C G C A A C A A A 7 2 0 0 21 7201 OCXACATAAAOAAXXCAXAAO c^fATATCOAAA&OTATCACTTTTTCTTOTTCCTOCOO 7260 7261 CAACCCTGCT7ACATTTTGCTTTACTCTGCCGAAATAAACACTC7AGGGT7C7CAGCTG7 7320 7321 AACGAATCTGAGAAATCTGTCTTCCGGTTTAAGCGAAGATTCAATTTTATTGAATACTTT 7380 7381 TATATT7C77ACAG7CCG7TATCACTGAA7GC7AAT777ACC7ACAAACT7AAAT7GT7C 74 40 7441 CTGGCAAGAAAATAAACCACAATGATTAAATTATTTTAGACTCCTTTAAAGCACAACAGT 7500 7501 TTTT6TAAATAAATAGTTGACAACTCTCTTTACACCTACGCTACACCCCACTCGCTCTGA 7560 7561 OTTTATGTGACTGTGTGTACGGCCTGGCCATAGCCTCATCTCGTTCACTCACATTTGTTT 7620 7621 CATCTTCTTCGATTGCAGAACTCGTTTGTGTATCAATCAATAGGCTGTCAAGGCCCCCCC 7680 7681 CCCTCCTCGCTCGCTCCCGCGCATTTTGCACGCCCGAATCCACTGGAGGIGATGTTTGAT "7740 7741 TTGAGCGCGACCTTTCCCCCGCACCCCTATCGOCGTGTCTGTGTGTGTGTGTGCTGCCCC 7800 7801 GCATfTTGATCATTCCTT7CCCAC&CCTACTCCTTTCCCCCCGATTGCAAGCAG&CTGTT 7860 7861 TTACGGGGGTTCTCACACACTCGAGCTCGAATGTATGTACCCTACTCCATGGCGACATGG 7920 7921 AAACCAGTCTTTTTTTTTCTCTCAAGAGGTTTTTCGCAGCGTGCGGCCGTGAGCTIAACT 7980 7981 TACACACACTTGCACACGCGCACCCACCACCCTGTGGCGAGTATTCCACCCCTCTGGACC 8040 8041 ACCCACCCCATATTCCCTTTCTTTTACGGGGTAGCACCCACATGAGGGTTGCCAAAATCC 8100 8101 TTTTACCCGCTTTTATAACCCATTfcGCCCCTTTfTCTGTCTTTfTTGCACTCAfoCTTT 8160 8161 TTATGCTCTTGCTGTTTATCGGCCTTGCGGGCTTTTTGGGCGATGGAAAGGGGTGOGGAT 8220 8221 CAGGCTCTCTGGACTGCCGGGCATCACAGTCGCGGTCAATGCAATAGGACCTTGAAACCA 6280 6281 CGCTTC7CCAG77AGA7CAT7CATACT7GAAC7ATATCAGGGAACTGATTCAGG AAGTAA 8340 6341 TATTAG7TAAT7ATT7C7AGAAAAACATCCT7ACCATGTGGAGTACTCA0CCATTTACGT 8400 64 01 T7G7C!CAC7AA7CC7A77AGCT7GAGCAC7ACTG7ATAAAACACTAACCATCTC6TGT7G 6460 6461 CTCCCAACTGTCAGCAT7TTTAGCATTTCGATGGAAAGG7CCT7GAACT7CGCC7GCAAT 8520 8521 TCCCATCGGC7TTATTGCCCTT77ACAATAATA7CGAACGGTGCCAGC7GCG7GGC7AAA 8580 6581 T7AG77T7CCGGGC7G77G77G7CGAACG7TGAACGTGGAAAACGAG7GCGACTACCATG 6640 86 41 CTCC7CATCGAAAT7CGAAAACCCATAGATAAAGA7CGATC7GAAATGCGAAGGCTGTCA 6700 8701 GGCA0C7CC777C7&77CAT7CGTTGAATGCGAAAGTCCGTTAAG7AGT7GGCCAAAT0T 8760 6761 TCAT7G7CGC7GAAAAAGGGCGGGT7CGGAAAAAGGTCCACAAAAAGAACGAATAT7TTC 6820 8821 GGCATAGGGATTGGGATCGGGA7CGCGC7CGGGATC7CTACCTAAAAG77AACCAGGG7A 8880 8 6 81 CACAATTC7TAA7TAGAAAT7GAAATGACATG0TACATATCT7CAGGCT7GACCT7GTGA 8940 6941 T7AAG7CGCAAC7A7AG7TT7CATATTATCCCGAGTTAAAGG7GATCGT7ACC7CAGAGT 9000 9001 CATAT77ACTT7TCCCGCGT7TGTTTCTGCCACTCTCGAAAG7CATCTTT7ATGT77ACA 9060 9061 T7GGAAATATGATTAT7TAATAAGAGCGT7777GT77GCTAG77CCCGCAT77GTCTCAC 9120 9121 ACTTTTTTGAATGACTAAAACCAGAT7777G7A777ACCAAG7AGCCA7A777GC77AAA 9180 9181 CAAAAAAJUaAAAAAAAAACGGATATCACT7TGTTATG7TAGTT77CCATGATCAT7AAA 9240 9241 TTTAAT7777TAA7GTTTAATT7AAGGGCCGTTT7CAAAACATCACGGCCAG7ATAAATA 9300 9301 ATAATAACAAATAATGA7GAGAAAGC7GACCT7ATTTTTGCGTCAAG7CCGAT7CTCAGC 9360 9361 TGGGAGAGCTTGT777GCA7GAGCGGGGAGGGGAGCAGGGGGCAGAACGGGGATGC7GAC 9420 9421 AACGTCT7TGCAT7CCC77CCTATTG777G7CGC7AGGACAAGAGAAGGCCA77A0GAAC 94 80 94 81 ACTGT7T7TCCAGTTTGTT7GCGTGTGGTG7GG7CCGAAATGACAT7CC7C77T777GC7 954 0 9541 GAGCAAA7G777AC7GA7CGCCAAAC7TCTC77TT7CAA7CC7CAGAGCOGACACAGAAA 9600 9601 GCGATACGACCACACCCC7GAGCACCACA0CA7CCCA0G0GAT77CAGCATCAGCGAT7C 9660 9661 TAGCAGGAGGCAC7C77CCC7TGAAGGACAAC7CGAACA7CCGCGAGAAGCCCC7CCACC 9720 9721 A7AACTACAACCACAA7AACAACAACAGC7CCCAGCACTCACAC7CGCACCAGCAGCAGC 9780 9781 AACAACAGCAGG7GGG7GGCAAGCAGCTCGAGCGGCCACTAAAG7GCC7GGAAACGC7CG 9840 9841 CCCAGAAGGCGGGAATCACCT7CGACGAGAAATACGATG7GGCCAGCCCCCCGCA7CCCG 9900 9901 GCATt GCCCAGCAOCAOGCGACTTC^  9 9 6 0 9961 CCCCCAC AAGCCATCGGCACGGAACTCCGCCCACGGGCCGCAGGCAAACCCACACCCCAA 10020 10021 OC AC7CCGAACAGGCCCAGTGC7CXCAGCACACCCAACACTAACTGCAACTCAA7TGCCC 10080 10081 GCCACACCAGCCTCACGC7GGAGAAGGCGCAGAATCCCGGCCAGCAGGTGGCC0CCACCA 10140 10141 CCACGC7GCCACTCCAGATA7CCCCTGAGCAGC7GCACCAG77C7ATCCGAGCAATCCCT 10200 10201 ACGCCAT7CAGGTGAAGCAAGAG^ fTCCCACGCACACGACC 10260 10261 TAAAGC ATGCAACCAACAT7ATGGAAG77CAGCAGCAG77GCAGC7GCAGCAGCTG7CGG 10320 10321 AA0CCAACGGTGGAGGAGCAGCCTCGGCCGGAGCCGGA0GAGCAGCTAG7CCGGCCAAC7 10380 10361 CGCAGCAAAGCCAGCAACAGCAGCACTCCAC AGCCATCAGCACCA7G7CGCCGA7GCAAT 10440-10441 TGGCAGCGGCCAC7GGAGCAGT7GGCGGGGA77CGACACAGCGAAGCACGC7CCACC7AA 10500 10501 T0CAJlccCTCCACCAG7TTCCTGTiTCCCCAAATGATTGTGTCG&GAAATCT6Tf GCATC 10560 10561 CAGGAGGCC7CGG7CAGCAGCCAATCCAGG7GATCACCGCCGGCAAGCCAT7CCAAGGCA 10620 10621 ACGGCCCCCAGATGCTTACCACCACGAC7CAAAACGCCAAGCAAATGATC0G7GCCCAAG 10680 10681 CGGGAT7CGCTGGCGGAAA77ACGCGACC7GCAT7CCCACAAACCACAATCAATCGCCCC 1074 0 10741 ACACGC7GC7CT7C7CACCCATCAACC7CATT7CCCCACAGCAGCAACAGAACCTGC7GC 10800 22 10801 AATcJjlTGeCCCCTCCAGCTCASCAGCAQCXACTCACCCJaCACCACCAACACrrTTAACC 10860 10861 AGCAGCAACAACAGCAGCT7AC7CAGCAGCAACAGCAGT7GACAGC7GCTCTGGCCAAGG 10920 10921 TCGGAGTGGA7GCGCAGGGCAAGC7GGCCCAGAAAGTGGTTCAGAAA0TGACTACCACCA 10980 10981 G7AGCGCGG7CCAGGCGGCGACGGG7CCTGGA7C7AC7GGGTCAACACAGACCCAGCAGG 11040 11041 TGCAGCA0GTTCAGCAACAGCAGCAGCAGACCACCCAAACCACTCA0CA07GCGTGCA6G 11100 11101 TT7CCACG7CGACT77GCCAG7CGCTGTOGG7GGACACTCTGTTCACACTGCCCAACTTC 11160 11161 TGAACGC7GGCCAAGCGCAACAAA7GCAGA77CCCTGG7TCTTACAGAA7GCTGCAGGAC 11220 11221 TGCAGCCGT77GGGCCAAACCAGATCA7CC7GCGAAACCAGCCAGACGGAACCCAAGGCA 11280 11281 TGTTCATTCAACAGCAACCGGCGACGCAGACTTTGCAGACCCAGCAAAACCGTAAGAATA .11340 11341 TT07CA7G7A7A77GCA7CGGATAGG7ACTAAAGTCAACTATCTTCC7ACAGAGATTATT 11400 11401 CAATGCAACGTGACCCAGACGCCCACTAAGGCAC&CACTCAACT&GATGCACTT&CTCCC 11460 11461 AA0CAGCAACAGCAGCAGCAGCAGG77GGCAC7ACCAACCAGACGCAGCAGCA6CAACTA 11520 11521 GCGGTGGCTACTGCCCAGTTGCAGCAACAGCAGCACCAACTCACTGCAGCAGCTCTGCAG 11580 11581 CGACCAGGAGCCCCTGTCATGCCCCACAATGGAACTCAAGTGCCTCCGGCCAGTTCCCTA 11640 11641 TCCACACAGACTCCCCAGAACCAGAGCCTCCTGAACGCCAAAATGCGCAACAAGCACCAG 11700 11701 CCGGTGCGCCCCGcfTTAOCCACATTGAJUUkCCGAAATCGGTCAACTCGCAGGAtAAAAT 11760 11761 AAGG7AG7AGGCCACC7GACCACCG7GCAGCAGCAGCAACAGGCGACGAA7CTCCAGCAG 11820 11821 G7GG77AA7GCGCCGGGCAACAAG7CAG7A77G777C77TAT7ATC7GCCT7CTCACAC7 11880 11881 AC77AT7777GCA7C777CCAGAA7GG77G7GA7GAGCACAACGGGCAC7CCGATCACCC 11940 11941 TCCAGAATGGACAGACCCTTCATGCAGCCACTGCGOCAGGAGTCGACAAGCAGCAACAGC 12000 12001 AGCTACAACTGTTTCAGAAACAGCAAATCCTCCAACAACAACAAATGTTGCAACAGCAGA 12060 12061 TTGC7GCCA77CAAA7GCAGCAGCAGCAAGCGGC7GTTCAG0CCCAGCAACAACAGCAGC 12120 12121 AACAGGTCTCTCAGCAGCAGCACGTTAACGCCCAGCAACAGCAAGCGGTGGCGCAACAAC 12180 12181 AACAGGCAG7CGCGCAGGC7CAGCAACAGCAGAGGGAGCAACAGCAGCAAG77GCCCAAG 12240 12241 CCCAGGCGCAGCATCAACAGGCTCTCGCGAATCCCACTCAGCAAATCCTTCAGGTGGCCC 12300 12301 CAAA7CAA77CA7CACG7CCCACCAGCAACAGCAGCAGCACCAACTTCACAACCAACTGA 12360 12361 TACAGCAGCAGCTACAGCAACAGGCGCAGGCACAAGTTCAAGCCCAAG7GCAGGCTCAAG 12420 12421 CGCAACAGCAACAACAGCAGCGAGAGCAGCAGCAGAA7ATTATCCAGCAGATTGTGGTGC 124 80 124 81 AACAG7CTGGAGCGAC77C7CAACAGAC77CCCAGCAGCAACAGCACCACCAA7CCGGGC 12540 12541 AAC7ACAGC7AAGTAGCG7CCCC77CTCAG77TCTTCCTCAACGACGCCAGCCCGAA7AG 12600 12601 C7ACCTC7AGTGC7CTGCAGGCAGCCCTCTCCGCCTCTGGCCCCATCTTTCAGACAGCTA 12660 12661 AGCCGGG7AC77GCAG77CC7CC7CCCCCACAAGCAC7C7GG7CACAAT7ACCAACCAGA 12720 12721 CCAGCAC7CC777GG7CACCAGCAGTACGG7GGCCAG7A7ACAGCAGGCTCAGACGCAAT 12780 '12781 C7GC7CAGG7CCACCAACA7CAGCAGC7AA7CAGCGCCACAATTGCCGGAGGGACTCAAC 12840 12841 AACAGCCACAGGGACCGCCATCACT7ACACCCACCACAAATCCAATTT7GGCCA7GACCT 12900 12901 CGA7GA7GAA7GCTACA07CGG7CACCTT7CCAC7GCTCCGCCTGTAACTCTTTCTGTGA 12960 129 61 CAAGCACCGC7GTTAC77CGTCGCCGGGTCAGCTGGTTCTCTTAAGCACGGCTAGTAGCG 13020 13021 G7GGAGGAGG7AGCA7ACCAGCCACGCCCACCAAAGAGACACC77CGAAAGC0CCCACCG 13080 13081 CAACCC7GG7GCCCA77GG77CGCCCAAGAC7CC7G7A7CAGGAAAGGACACCTGCACTA 1314 0 13141 CCCCCAAA7CA7CTAC7CC7GCCAC7G7CAGCGCATCCC7AGACGCCACTAGTTCCACAG 13200 13201 CCGAAGCCCTGTCCAArGGAGA7GCCTCAOA7AG&7CTTCCACG&TGTCAAAOo6cGCTA 13260 13261 CCAC7CCCACCAGCAAGCAAAGCAA7GCAGCAG7GCAGCCACCGAG7AGCACCAC7CCCA 13320 13321 ACAG7G7CAG7G0GAAAGAAGAGCCGAAGC7GGCAACC7CCGGCAGT77AACG7CCGCAA 13380 13381 CA7CAAC7TCAACCACGACAACGA7CACCAATGGGA77GGAG7AGCCAGAACGACAGCCA 134 40 13441 CCACGGC7GTC7CAACCGC7AGCACAACCAC7ACCAGTTCTGGCACCTTTATCACAAG7T 13500 13501 GCACCAGCACAACCACAACCACCACGTCGAG7A7CAGTAAT0GAfCGAAGCA7CTCCCCA 13560 13561 AGGCGA7GA77AAGCCGAACG7C77AAC7CACG7CA7CGA7G6CT7CA7CA7CCAGGAGG 13620 13621 CCAACGAGCCA77TCCCG7CACCAGACAGCGA7A7GCAGACAAAGACGTCAGCGATGAGC 13680 13681 CGCCAAG7GAG7A7AAACT7C7GG7ACCAA7GC7777TCGCAATC77AACGTG7CAT7CC 13740 13741 TTCGCGCAGAGAAAAAGGCAACCA7GCAGGAGGACATCAAGCTAACTCGAATAGCA7CAG 13800 13801 CTCCAGGCTCGGATATGGT7GCTTGCGAGCAG7G7GGAAAGATG&AGCACAAAGCAAAGC 13860 13861 TGAAACGGAAGCGC7AC7GTTCGCCAGGA7GC7CGAGGCAGGCAAAGAACGGCA7CGG7G 13920 13921 GAG77GGA7CAGGAGAGACGAACGGCC7CGGGACAGC7GG7A7AGTTCCCC7GGCAGCCA 13980 13981 TGGCAT7GG7GGACAGGC7GGA7GAAGCCA7GGC7GAGGAGAAGA7GCAGACAGAGGCCA 14040 14041 CCCCAAACCT77CAGAA7CGTTTCCTATT7TGGGACCC7CAACAGAAC7ACCTCCAATG7 14100 14101 CACTGCCAG7CCAAGCGGCGATT76TGCGCCCTC6CCTCTTGCAATGCCTCTAGGATCGC 14160 14161 CATTGTCAG77GCAC77CCAAC7CTTGCACCAC7G7C7G7AG7CACTTCTGGCGCGGCGC 14220 14221 CCAAG7C77CGGAAG7GAA7GGAACAGA7CG7CCGCCAA7CAGCAGCTGGAG7G7GGACG 14 280 14281 ATG7CAGCAAC7TCA77CGAGAAC7GCC7GG77GTCAGGAC7ACC7GGACGACTT7A7AC 14 34 0 14341 AGCAGGAGA7CGACGGCCAAGCGCTTCTGTTGCTCAAGGAGAAGCATTTG6TGAACGC7A 14400 23 14401 TOOCCATCXXCCTC&CTCCACCTcfTAAXATTOTGOCCXXCOTOCXCTCCXTTXXeOXGG 14460 14 4 61 TCCCGCCXCCTGGCGXGGCCXXGGXTCCXGGXGCGCXGTAGGGCXGCtXGXGCXCGXXXX 14520 14521 CCCGXXXXXGXTGXTCTCCTXXCCGXGCXOTGGXCCTGGTTCXXCCXXGTCTGTCOTGCC 14560 14581 AGGTGXTTCTGXTTCXATCGXGCXGGCGXXXXGGXCGCGXXTCCXTTTGCXXXXTXTTXT 14640 14641 T A G C A T C A G G C C A T C A G G T C T T A A A C C C A T T G A C T T T G T A C A T A C T C C C A A G A T C A C T T A 14700 14701 T A A G C A T A T T C A T T f A T A A A T T A A A C T A A C A G T C A A C A C T C A A A A A C G X A T O G X A T T X C T 14760 14761 TAXXCTAAGGAAAAGCTATGXXTTXXTTGCCXGCCXXGTXXATGGXXTAAAGTACATTTT 14B20 14 821 ATXXATXAGCXTXXGTTTXTXGTCTXXGTXGCXTXTTXATXXCTCCCXXCGTCXTGGXTX 14880 14 881 C T T T G T X C X X G T X T T T T X X T C T T X G G X A T C X X A T G T A G C A C T A T G A T T G T T C T A A C A A C T 14940 14941 A A G A A T T T T A A G C C I A T G A A T A A T A A T T G A T A T C T A A A T C T G A A T T T G A A C T T T I I A C T A 15000 15001 A A T X A T A T T T G X X T G C C T X G X C C T X A G C T T T T T T f T C X A C X T T T f l T T T T T O C X X T T G C T 15060 15061 CXAGXXATTAAAATGGCACTGAATAGTGTTTAATAAATGTGAAAAAGAATTTTCTGAGTT 15120 15121 T T T T T X A G T C X C T T G X C C A X T T X C X X T A X T T G X T G T G X T T X G T T T A C T T T G A T T C G A C C A 15180 15181 TTTGATCAAATATAGGTGGGAAAAGTGTATAXXXTGCATACCTAGCTTTATCAAATGTAA 15240 15241 CATGGAAATACTTCTGACTTACAATGACTAAGAAACCAGAGAGCAGAGCTTT IAATTCAG 15300 15301 A A T A T G X A A A T A T A ^ C A J a C T ^ 15360 15361 CTACCGXXTCXTCTXCCGXXGXCTTTXXTCXTTCTTTTCCXGCTCXGCXGXXGXXTTGXC 15420 15421 TTXATTTCACTCTGCGCXXXXTXXGTXTXCGTXCXXXCGTGGGCTTCGTTCXACGCAAAA 15480 154 81 CTGTTATACTACAATTTCACCACCGGGCTAAGCCCCGCCCTCGAGCACGACAATGTCGCA 15540 15541 CAGAGAGAGAGTGCGXTTAAGCGCCGTATAGCTATCCCTGTCTTTTAGCGGGCACTCCCT 15600 15601 T A O C C A A A X A C T A A A A T C G A G T C C A T T T T T C C C G T C G G C A T T T T & T T A T T G G C T G A G C A G 15660 15661 CGACGCTGGCXGXGCOGCGXCTTCACGCTTAGATCGGTGGACTCACCIATGAATAGTCGT 15720 15721 T G T T G T C G A C 15730 24 15731 GTCGTCGTGGCTAT&GGCAAATCAAAATGCAGAACGTTAAATCC&CTCAACACA^GTATG 15790 15791 TATCTTTGGTAGTATTTACAAGCATTACTTACAATTTTTCTCGCTCTTTCGCTGGTATAT 15850 15851 CTATOAAATGGGATAACAAGATTATATTTTTCACCGAATAATAGTATAATCATTAGTACA 15910 15911 GTTACGTCATTCGTACCACAGGTATGAAACGTAACAATCTATT6CAGTCTCAGAATCTTA 15970 15971 AACTTTAATAAGTCACCTCAATGTAGCATTAAGAAAATAATTACCACACTGATCTATACA 16030 16031 AACATTGTCTCGACAATCTTATACCAAATAAGAAAAJUUlCATATfTAGAGTCCAATCCAA 16090 16091 CTAGTTTCATCTTCGTTCTGACTTGCTCCTTCACAAACCCTATTATAGTGTTTAGCGGTC 16150 16151 AAGCTCGGTACGATGCTCTCGTAATCCAATTGAATCCTAAATTGOGGCGCTGAGATCATG 16210 16211 GTTGTATTTAATTATGCATCTTATTAAGTAGATATATTGAAGAAATTTATTTCAGATAAG 16270 16271 AACTACTACTGAGATGTAAGAAAGTACAATTAAATGTATGGCAAGGTTTTGTTAGTTGTC 16330 16331 AGTTTCTACAAATTT AAAACATATCTTAACGTTC^TTTGGAACTTTTCCAOCACTAACTT 16390 16391 TTAACTTCGGCTAACAATATATTCAAAAATTGCATTAGATCGTGATACGCAAAACOCAAT 16450 16451 CATTTCTCTTCCCTTTTTTCTTTTGCATTTTGATTTCTTATTAACCTAAATTTAGATATA 16510 16511 ACAGTTGCATGGTAGAATAACGCCCTTCGTGTGGGAGACCCCGAATCCAGATCCGTCTCG 16570 16571 CCAATAAAATCCCTACCGTTTCTGCATCCCTGGAAACTTTGCAGCTACATATCTATTTAT 16630 16631 GTAT&TATGTATGTfTCTAAATACAAATGTTCAA&GAAGATTACACGTTTGCTGAAAAGG 16690 16691 CGAAAGAGGGAAATAACCAACTAAGCAAAATAGAGAGTGAGAGAGAGAGAGAAAGCAAAA 16750 16751 GCACTTAATAAGCATAAAAATAATATGGCGGCTGCAAAAAACAAAAGAAAACGAGAAAAA 16810 16811 ACGAGCCGAGCTAAAAAAGCAATACTAATCGGAATGATGAACACTTGTGGGCCTCAGCGA 16870 16871 CGTTGTTGTTCGAAGCCGTGGGAAGAGTCACAACACCCACTCCAACCAACGCACCCATAC 16930 16931 CACAJIATGATGCACGCACACACACACGGGCTGGAACAAAATGAAAAATGAAATGCCAAAG 16990 16991 ACAGGCACCTTGGGTCATGAACGGTAAATATGTGAAACATTTGAAAAATTAATTAATAAA 17050 17051 TATATTAATATGTAAGTGAAACTCTAAATCTTATATCTATGTTCCTAATAATATAATAAT 17110 17111 AATTATTTAATATAATAATAATTATTTAATAATAATAATAATTATTTTACACCATTACTT 17170 17171 TTTTGTTTTACTGGCATAACCCAAOTCGACAGTTATGTACATACAAAGATATTTCTGTTC 17230 17231 TACACAGCTATGTAAATCCTAAGTTTTACGTTGCTGCTCTATAASATTGCTCCAOTGTGC 17290 17291 CTTTGGTTGCCAAAGTACAACAATAATGCATTCCAATAACAATAGATCGAGTGCCACTGT 17350 17351 GGGAACATATCACTCTGTCTCTGCGCATACOCTACCCCTTCCCTTCGCAAACCCTTCGTT 17410 17411 CGTCTGTTTCCGACAAGCACAACCAATAATAAAGTAAGTTGTATAAAGTAAGAGCAAGGG 17470 17471 AGAAAGGCGCTAGAGGTTGCCAAGTAGTGGTGGTTGTCAAGGAACCAAAAAAAGAAGATG 17530 17531 AAGTATAGCAGCAATCGAATTTGAGTCGGGTATTfTATTTTGTTfTCACAGTGGAAOGOG 17590 17591 GCGCACAAGTCGGGGTGGTTGAGCACGCATACACATGCCCGCCAGCTGTAGTCTTGATTC 17650 17651 CTGTTTTTTTTACTCGTTTTGGTTTTGCTTTGTCAACTGCTGCCGTTTCATTCCAACCCC 17710 17711 AAGCGAACGCACGGCATCCCGCTCGCACGCGCTATCTGACCAGGGCCACTCCTAGTGGCG 17770 17771 TCGCACTCGCACCCCTGCATTTCGGTATAGAGAAAAGTTATTAGTCGCAACGGTATCTGA 17830 17831 CAGGAGAGAAGCGGATCCAGCTGCACACCAACACACACACACACACATTCAOCACACATG 17890 17891 CGCTCATTTTGTTTCCGATCCGAGCAAAAGTAGAGTTGTCGCTGTAGCGCGCGGTTCAGT 17950 17951 TTGAAACTTAACTTGGCGGCGTGCGGTTCTTGCGCTCTGCTCTCTTTGCGTTCGTGTTGG 18010 16011 TGTGGTGTCGATGTGTCGCATACCGCATGTGTATTGAACGGOGGAAAAAAAAAGCGCCGA 18070 18071 COCGACGCACACACACCCTACCACCGTTTCGTAGTATTTATTTATATATTTATTTTTGGC 18130 18131 GATCAATGCAAATcScTGCTGACTATAACTGATTAGTCAAAACAATTAATGCTcfGCGGC 18190 18191 GTCTAAAGTTGCGTCGTTTTGAATGTTAGGCATGTACAGGTCCCTCCAAATATACAATAA 18250 18251 TACAGTACAGCAAGGAAAGCAAAAATGAAAACGGTAAAACATATTTTTTCTGATGAGACG 18310 18311 CTGTCGTCGGCGCATGCTCTTTGCCTGTTGTTTGCGCCGCATTGTTGTCATTGCTCCCTG 18370 16371 CCTCTTCGTTTTCTCCTTCTCCTCTTCCTCTGTTTCTCCCTCTTGCAACTCAATTTTTGG 18430 18431 CTAA6GGAAAAATCAAATAATGATGGCATTATTGfATGCCGCCACACGCACATGCACACA 18490 18491 CACACATGCACATATCGTGGCGCGGTAGAAAGAAACCAGTCTTTCGGTTGCAAGTTGAAG 18550 18551 ATGCTGCTGTGCCTAGGTTTATTACCCCCCTTCCAAATCCGTTCCGCTCTTAOCCTGCCG 18610 18611 ATGGAAATGGAAGCCAAAATCCACCCCCGATACACACCAAATGTAACTGAAACTCCTAAC 18670 18671 TATTATTGCTGGCGCCGGTTGCTCCTTCCGCTCCTCTGAGCCGCCGCCATTTCTTACTCC 18730 18731 TGGACAACTGGCGAACAGCATTTTGAGCGCCCAGCTTAGGTCTTfGAACCTGTCCCCTTG 16790 18791 GATATCACCATGATGGCTTATAACCATCCGCC AAGGGCCATCCGTAAAT AAAAGTGGTCT 18850 18851 CCCTTACACCCACTCGAACTTCCTCTCTOCCTAGTAGATGGGAGGAOGTGATGGAGAAGG 18910. 18911 GGAATATGCAGCAGAGAGCAATCCTACTATAAATAGATCTTTCGGGCACAGGCAACGCTG 18970 18971 ACAAGTCACCTCCTTTAGGTGTTGGTGTGCATCCACATATGAAAAACAGTGCACTTTTAG 19030 19031 TGGGfcACGAATCATGAACAAATTAA^TATTTATTTACCTGTTfccGTAAAAAGfACTGA 19090 19091 ATACACGGTGGGGCTAGTGTTGTACAAAGTTGTAGCTGCAATCATTTGTAATTTCCCATA 19150 19151 TAATATGTTGGTGTTTAAGCTTACCACGTCTCACCTAAACAGTATCCACTGTAAGTAAAT 19210 19211 TACCTCGTGAGTGAACGATCATCCAAAACAAAGTGTATCTCTCGTTCCCTTCTCACATTT 19270 19271 TCGTCCCGACTTTGTTTTCGTTTTCGTTTTCTTTGGCCTTTTTGGACCCAGATCGATCCC 19330 25 19331 C T C T f T C T C T T T C T C X A T G C T X C T & C O T C C C T C T t c C A C C C C C T T C T T T C T T T T T C T O T T 19390 19391 CCTCGTCCCCTTTTCGCGTTGGAGAGCAAGAGAGAGAGTCTTTC6CGGGAGTAACATGCA 19450 19451 GTCGCAGTTTGTTGTTGTCCGCTTTGTTCGTCAAGCOCTAAGAGAATATGAATAAGAATA 19510 19511 TAAATATAAATTTAAACATGAATATGAAGCATGCCCCACGGATTCGGAGCCGCGCCAAGT 19570 19571 CTGTCCGTGTTTGTTTCTGTTTAAAGCAATTTACTGAT A G T C T G T G T G C C T T T C C T C T T I 19630 19631 C T C T f TCCAGCAGTGGGGACACCGAAAGTGAATCJIGCAACAACAA7ACGTACAC6ACCAC 19690 19691 CTTCCCCGGAAGCCACCACATCTGTAAAGGTCAACTCCACCACTCGCGTGGACCCCCAGC 19750 19751 CGCCACTAAGGTGCCTGG AAACACTCGCCCAGAAGGCAGGCATCAGCTTCGACGAGG ACT 19810 19 811 TTGCCAAGAGTCCATCCCAATCGCCCAGCTCTAAGGCAGCACGTGGGTCAGTCCGAACGC 19870 19671 CATCAATCAGACGGCGCCACCCACTACTACCGCTCACCAGCAGATCGCCAAGCCCACCCG 1 9 9 3 0 19931 ACTCAAAGACAACCGGCCGCAAAcf GGAG AAGTCACAGAGTCCAGCTCAACCGAfGGCCG 19990 19991 CCGCCACCAATGTGCCGCTGCAGATCTCCCCCGAGCAGCTGCAGCAGTTATATCCAAACA 20050 20051 ATCCCTACGCCATTCAGGTGAAGCAAGAGTTTCCCACGCACACGACCAGTGGCAGTGGAA 20110 20111 CTGAACTAAAGCATGCAACCAACATTATGGAAGTTCAGCAGCAGTTGCACGTGCAGCAGC 20170 20171 AGCTGTCGGAAGCCAACGGTGGAGGAGCAGCCTCGGCCGGAGCCGGAGGAGCACCTAGTC 20230 20231 CGGC£AACTCGCAG£AAAGCCAGCAACAGCAGCA6TCCACAGCCATCAGCACCA{'CTCGC 20290 20291 CGATGCAATTGGCAGGGCCCACTGGAGGAGTTGGCGGGGATTGGACACAGGGAAGGACGG 20350 20351 TGCAGCTAATGCAACCCTCCACCAGTTTCCTGTATCCCCAAATGATTGTGTCGGGAAATC 20410 20411 TGTTGCATCCAGGAGGCCTCGGTCAGCAGCCAATCCAGGTGATCACCGCCGGCAAGCCAT 20470 20471 TCCAAGGCAACGGCCCCCAGATGCTTACCACCACGACTCAAAACCCCAACCAAATGATCG 20530 20531 GTCG8CAAGCGGGATTCGCTGOCGGAAATTACGC6ACCTGCATT£CTTCGAACCACAATC 20590 20591 AGTCGCCTCAGACGGTOCTCATCTCGCCGGTGAATGTCATCTCCCACTCGCCACAGCAGC 20650 20651 AGCAAAACCTTCTGCAATCAATGGCCGCCGCAGCTCAACAACAGCAACTTACCCAACAGC 20710 20711 AGCAGCAACAGCTTAACCAGCAGCAACAGCAGCTCAACCAGCAGCAGCAACAGCAACAGC 20770 20771 TGACTGCCGCTCTGGCCAAGGTGGGAGTCGATGCGCACGGCAAGCTGCCCCAGAAAGTCC 20830 208 31 TTCAGAAGGTGACCACCACCAGCAGCACG6TGCAGCCGGCGACG&CTCCTGGATCTACTG 20890 20891 CGTCAACACAGACCCAGCAGGTGCAGCAGGTTCAGCAACAGCAGCAGCAGACCACCCAAA 20950 20951 CCACTCAGCAGTGCGTGCAGGTTTCACAGTCGACTTTGCCAGTCGGTOTGGGTCGACAGT 21010 21011 CTGTTCAGACTGCCCAACTTTTGAACGCTGGCCAAGCGCAACAAATGCAGATCCCTTGGT 21070 21071 TCTGGCAGAATGCGGCGGGCCTGCAACCCTTCGGCTCCAATCAGATCATCCTGCGAAACC 21130 21131 AOCCAG ACCG AACCCAAGGCATCTT CATTCAACA6CAACCGGCGACGCAGACTTT GCAGA 21190 21191 CCCAGCAAAACCGTAAGAATATTGTCATGTATATTGCATCGGATAGGTACTAAAGTCAAC 21250 21251 T A T C T T C C T A C A G AGATTATTCAATGCAACGTGACGCAGACGCCCACTAAGGCACGCACT 21310 21311 CAACTCGATGCACTTGCTCCCAAGCAGCAACAGCAGCAGCAGCAGGTTGGCACTACCAAC 21370 21371 CAGACGCAGCAGCAGCAACTAOCGGTGGCTACTGCCCAGTTGCAGCAACAGCACCAGCAA 21430 21431 CTCACTGCAGCAGCTCTGCAGCGACCAGGAOCCcfcTGTCATGCCCCACAATGGAACTCAA 21490 21491 GTGCGTCCGGCCAGTTCCGTATCCACACAGACTGCCCAGAACCAGAGCCTGCTGAAGGCC 21550 21551 AAAATCCGCAACAAGCAGCAGCCGGTGCGCCCCCCTTTAGCCACATTCAAAACCGAAATC 21610 21611 GGTCAAGTCGCAGGACAAAATAAGGTAGTAGGCCACCTGACCACCGTGCAGCAGCAGCAA 21670 21671 CAGGCGACGAATCTCCAGCAGCTGGTTAATGCGGCGGGCAACAAGTCAGTATTCTTTGTT 21730 21731 T A T T A T C T C C C T T G f C A C A G T A C T T A T T T T T G C A f C T T T C C A G A A T G G T T G T G A f G A G C A 21790 21791 CAACGGGCACTCCGATCACCCTGCAGAATGGACAGACCCTTCATGCAGCCACTGCGGCAG 21850 21851 GAGTCGACAAGCAGCAACAGCAGCTACAACTGTTTCAGAAACAGCAAATCCTGCAACAAC 21910 21911 AACAAATGTTGCAACAGCAGATTGCTGCCATTCAAATCCAGCAGCAGCAAGCGGCTGTTC 21970 21971 AGCCCCAGCAACAACAGCAGCAACAGGTCTCTCAGCACCAGCAGGTTAACGCCCAGCAAC 22030 22031 AGCA^GCGGTGGCGCAACAACAACAGGCAGTCGC6CAGGCTCAG&AACAGCAGA6GGAGC 22090 22091 AACAGCAGCAAGTTGCCCAAGCCCAGGCGCAGCATCAACAGGCTCTCGCGAATGCCACTC 22150 22151 AGCAAATCCTTCAGCTGGCGCCAAATCAATTCATCACCTCCCACCAGCAACAGCAGCAGC 22210 22211 AGCAACTTC ACAACCAACTGATACAGCAGCAGCTACAGCAACAGGCGCAGGCACAAGTTC 22270 22271 AAGCCCAAGTCCAGGCTCAAGCGCAACAGCAACAACACCAGCGAGAGCAGCAGCAGAATA 22330 22331 TTAT&CAGCAGATT&TGGTGCAACAGTCAACTGGAGCCACTTCTCAACAGCACCAGCAOC 2239 0 22391 AACCGCAACAGCAGTCTGGACAGTTGCAGCTTAGCAGCCTGCCGTTTTCGGTTTCACCAT 22450 22451 CCATGACGGCCGAAGATATTGCCGGAATAACATCCAGTCCCCTACAAGAAGCTCTCTCGG 22510 22511 TGTCTGGCGCCATCTTTCAGACAACCAAACCGATTACTTGCAOTTCCTCTACGCTCCCCA 22570 22571 CAAGCAGTGTCCTCACAATTACCACCCAGAGCAGCACTCCTCTGGTCACCAGCAGTACGG 22630 22631 TGGcfeAGTATGCAGCAGGCTCAGAfcGCAAGGTAcfcAGATCCATCAACATCAGCAGCTAA 22690 22691 TCAGCGCCACTATTGCCGGAGGCTCTCAACAGCAGCAGCAOCAGCAGCAACTGGGACTAC 22750 22751 CTTCACTTACACCCACCACCCCCTCACCTACAACAAATCCCATTCTGGCCATGACCTCGA 22810 22811 TG ATG AATGCCACCGTGGGTCACCT ATCCACTGCCCCACCCGTT AGTGTTTCTAGCACCG 22870 22871 CTGTCACTCCATCGTCTGGACAGCTGGTCACACTAAGCAGTCCTAGTACCGGTCGAGGAG 22930 26 22931 CAGGCTTTCCAGCCACGCCCACCAAAGAGACACcfTCAAAAGGOCCCACCGCAA&CCTGG 22990 22991 TGCCCATTGATTCGCCCAAGACTCCTGTATCAGGAAAGGACACCTGCACTACCCCCAAAT 23050 23051 CATCTACTCCTGCCACTGTTAGCGCATCCOTAGAGGCCAOTAGTTCCACAGGCGAAGCCC 23110 23111 TGTCCAATGGAGATGCCTCAGATAGGTCTTCCACGCCGTCAAAGGGCGCTACCACTCCCA 23170 23171 CCAGCAAGCAAAGCAATGCACCAGTGCAGCCACCGAGTAGCACCATTCCCAACAGTGTCA 23230 23231 CTCGGAAACAAGAGCCGAAGCTGCACAACTGCCGCAGTTTAACGfcCCCAACATCAACAT 23290 23291 CAACCACGACAACGATCACCAATGGGATTGGAGTAGCCAGAACGACAGCCAGCACGGCTG 23350 23351 TCTCAACCGCTAGCACAACCACTACCAGTTCTGGCACCTTTACCACAAGTTGCACCAGCA 23410 23411 CAACCACAACCACCACGTCGAGTATCAGTAA7GGATCGAAGGATCTCCCCAAGGCGATGA 23470 23471 TTAAGCCGAACGTCTTAACTCACGTCATCGATGGCTTCATCATCCAGCAGGCCAACGAGC 23530 23531 CATTTCCCGTCACCAGACAGCGATATGCAGACAAAGACGTCAGCGATGAGCCGCCAAGTG 23590 23591 A G T A T A A A C T T C T G G T A C C A A T G C T T T T T C G C A A T C T T A A C G T G T C A T T C C T T C G C G C A G 23650 23651 AGAAAAAGGCAACCATGCAGGAGGACATCAAGCTAAGTGGAATAGCATCAGCTCCAGGCT 23710 23711 CGGA7ATGGTTGCTTGCGAGCAGTGTGGAAAGATGGAGCACAAAGCAAAGCTGAAACGGA 23770 23771 AGCGCTACTGTTCGCCAGGATGCTCGAGGCAGCCAAAGAACGGCATCCGTGGAGTTGGAT 23630 23831 CAGGAGAGACGAAC&GCCTGGGGACAGGTCGTATAGTTGGGGTGGACCCAATGGCATTGG 23690 23891 TGGACAGACTGGATGAAGCCATGGCTGAGGAGAAGATGCAGACAGAATCATACCAGACAG 23950 23951 TATCGGACGCTT7GCCAATTCAAGCGCCTACGCCGGAGCTCCCACCCATTTCGATGCCAG 24010 24 011 T G C T G G C G G C T A T G T C G A C A T C T T C A C C A C T T T C G T T G C C C C T G A C A T T G C C C T T G C C A A 24 070 24 071 TTGCAATAGCTCCCACTGTGTCACTGCCAGTGGTTTCAOCTGGAGTGCTTGCGCCGGTCC 24130 24131 TAOCAATACCATCcfCGAATATAAATGGATCCGAfCOCCCTCCCATCAGCAOTT&GAOTG 24190 24191 TGGAAGAAGTTAGCAATTTCATCCGAGAACTGCCTGGTTGCCAGGACTACGTGGACGACT 24250 24251 TTATACAGCAGGAGATCGACGGCCAAGCGCTGCTGCTGCTCAAAGAAAACCATTTGGTTA 24 310 24 311 ACGCCATGGGCATGAAGCTGGGTCCAGCTCTCAAAATTGTGGCCAAGGTGGAGTCCATTA 24370 24371 AGGAGGTCCCGCCAGGCGATGTAAAGGATTAAAAACACGCAACAAAGTCAAGGTTTCAAA 24430 24 431 AGACCGCTTTCTTTACTTTCCCGCGTTTCACCTAAATGTAACGACATTTACTTC6TGAGC 24 490 24 491 GAATGTGATCAGACAGAACAAAGTGAATCACGTTCCGACTCACCACTTCTCACACGACGT 24550 24551 ACACCCTAATCATCAGCTACATGCACCTAATCTACAAAGGGAACTCCCCAGAGAGCAACC 24610 24 611 GGTGCCTGGAATCACTGACTCTGTTGCGAGGCCCATCCCATCCAGAATCTATGCGAGAAA 24670 24671 ICCATAATTAGGTGAT6T.AGTTGTTTTTCCCGCACATGACGAAAGCAAGGAATATGACCC 24730 24731 T C C T f C G G C G C C G A A G C T C C A G C T A G T T T A A G C A C C C C G A T C A G A C C C C A A G A T f C T O G C 24790 24791 AATAGTAGAGTCCATGACTCTGTGCGACGAAAAGGACGGGGAGGTTATAGGACCGCTCGC 24850 24 851 CCCTCCGCGTTCGATCAACAGTCTTCAGCAGTCTACCAGAGTCTGAGGATAGGAGCGGGC 24910 24911 A G T A T C T G A G C T C T A 24925 27 Figure 3. Optimal alignment of the proximal repeat with the distal repeat. The alignment was made using the method of Wilbur and Lipman (1983). The ktuple size = 6, the gap penalty = 6, and the largest gap = 20 nucleotides. The top line of the alignment represents the proximal repeat sequence. The bottom line represents the distal repeat sequence. The middle line represents sequence conserved in both repeats. Dashed lines are gaps in the sequence necessary to maintain sequence conservation. The overall similarity between the repeats is 70.089%. The total number of bases not opposite a gap = 8264. The number of gaps = 94. The number of bases opposite gaps = 4227. The length of the overlapping region = 12491 bp and the total number of matched bases = 5858. 28 3510v 3520v 3530v 3540v 3550v 3560v 3570v T T A G C A G A T T G G A A A G G A C A A T G G C A A A T G T C T G C A A T G C O C A G A G G G T G A G T T A T T A A T A A T C G A G A A T T G C G A G G C A A A T C T C G T C G T G G C T A T A G G C A A A T C — — — - — 10* 20* 3560v 3590v 3600v 3610v 3620v 3630v 3640v A O G C G A T A T T A A A G G A A G T A T O G O C T G T T O G A A A A A G A A G A O C J I T T T G O a A C T C G A C T T A C C T G A A A T A G 3650v 3660v 3670v 3680v 3690v 3700v 3710v G T G T T T G A T C C A T O C A G T T G A A A T G C J L A A A T A G T T T A T A A A A C T T T T T A C A A T A T G T A A A A T O G T A A G A A A A A A T G G A . A A A A T G G A G A A C G 30-3720v 3730v 3740v 3750v 3760v 3770v 3780v A T A T G A C A C G A G C A T A A G G A A T T T O G A A T T T T A G A G T G T T T A G A A A T C A T G A A A A G A A T T T T T A A G G T T G T A C A A T T T T A G T T T T A A A T A A A T T T T T G T T T A A A T C C G C T C A A C A C A T G T A T G T A T C T T T G G T A G T A T T T A C A A G C A T T A C T T A C A A T T T T T C T C G C T C 40- 50- 60- 70- 60- 90- 100" 3790v 3800v 3810v 3820v C A C C T G A A A A A A T A A A G C A T T T A C T T T T A C T A G T A T G A A A A A A T G C A T A T A G T A T A A C T T T C G C T C G T A T A T C T A T G A A A T O G G A T A A C A A G A T T A T A T T T T T C A C C G A A T A A T A G T A T A A T C A T T A G 110- 120- 130- 140- 150- 160- 170" v 3840v 3850v 3860v 3870v 3880v 3890v A A T T T G A T A T C C T A A A A C A A A A A T A C A A A T A A A A A T C A A A G T G G A A A G G A A T T T A T A C T C ^ A C A A A A A A T A A T A G A T T T T T T A G A G T T A C G T C A T T C G T A C C A C A G G T A T G A A A C G T A A C A A T C T A T T G C A G T C T C A G A A T G T T A A A C T T T ' 180- 190- 200- 210- 220- 230" 240* v 3910v 3920v 3930v 3940v 3950v 3960v C A G T G G T G A T C C C T T G C G A T C C A T T T T C A T T T C T G G C C C C G C C T C A A A A C C T T T A G A A G A C T T C A T T A C G A G T G A C C T C A A A A G A A T A A T A A G T C A C C T C A A T O T A O C A T T A A G A A A A T A A T T A C 250- 260- 270- 280* v 3960v 3990v 4000v 4010v 4020v 4030v A T T G C J J l T A A G C A C A A C A A C A A C A A C C C A C T T T T G G T T A T C A C A T C C G C A T G C A C A C C C G T T G G A A A G G C C A A C A C T T A C A A T G A A T T G A G C A C A C T G A T C T A T A C A A A C A T T G T C T C G A C A A T C T T A T A C C A A A T A A G A A A A A A A C A T A T T T A G A G T G C A 290- 300- 310- 320- 330" '340" 350" v 4050v 4060v 4070v 4080v 4090v 4100v C A A C A A A A A T G C T G T A G C T A G C T C T G A C A A T T C C C A G C A C C A A C A A C C A C A C A A C T A C A A C A A C A G C T T C C A A A T T G T C T G A C C C C A C A A C A A A C A T C C A A C T A G T T T C A T C T T C G T T C T G A C T T G C T C C T T C A C A A A C C C T A T T A T A O T G T T T A G C O G T C A A O C 360- 370- • 380- 390" 400" 410" 420" G T C G G T A C G A T G C T C T C G T A A T C C A A T T G A A T C C T A A A T T G O O G C O C T G A G A T C A T G G T T G T A T T T A A T T A 430- 440- 450- 460" 470~ 480" 490-T G C A T C T T A T T A A G T A G A T A T A T T G A A G A A A T T T A T T T C A G A T A A G A A C T A C T A C T G A G A T G T A A G A A A G 500- 510- 520- 530- 540* 550" 560-4110v 4120v • T T T A A A T T T T T A T T A C A O C A O O T T T A A A T T A G T A C A A T T A A A T G T A T O G C A A O O T T T T O T T A O T T G T C A O T T T C T A C A A A T T T A A A A C A T A T C T T A A C G T T O 570- 580- 590- 600" 610* 620* 630-v 4140v 4150v 4160v 4170v 4180v 4190v O T G C G A C C G A G A G G G A G G A A G G C C A G C G T T A A C G T G C C A G A G T G A G A G G C A A T A T G A C A A A A G C A C G C G C T G A G A C G C A C A A T A T A A A G C T T T T G G A A C T T T T C C A G C A C T A A C T T T T A A C T T C G G C T A A C A A T A T A T T C A A A A A T T G C A T 640- 650- 660- 670" 680" 690* v 4210v 4220v 4230v 4240v A G C A A T T T A C G C T C A T T A O Q C A T T T O T C A T T T T T A C G A A A T O C A A X T ' A T A T C A T T T T A T A A 7AGATCGTGATACGCAAAACGCAATCATTTCTGTTCCC1TTTTTCTTTTGCATTTTGATTTCTTATTAAC 700- 710- 720- • 730- 740* 750" 760-v 4260v 4270v 4280v 4290v 4300v 4310v A T C G A C T C T C T C G C T C A T G C C C T A G G A G T A C C A T T C G C C C C A A A C A C A A G G A G C A A T A A G T T G A A T A T G A G C G C C C G C A T A C T A A A T T T A G A T A T A A C A G T T C C A T G G T A G A A T A A C G C C C T T C G T G T G G G A G A C C C C G A A T C C A G A T C C G . 770- 780- 790- 600- 610* 620* 630-29 08 .099T .0»9T .0C9t -0J9T -0X9t .009t 3YXY3930X3X3X9X3X3Y3XYXY3YY999X9XaY339X9Y93X YSYXYY0YYXYY33XXY3 V X X 0 3 X YY 9X9Y93X 3 iY9XXYXXXYX939393XX399X3YYYYXY3Y3Y3YX9X9Y93XXY3Y9YX9939g9339993X9Y93XO AOtCS A00CS- A06Z9 A09ZS A0iJ9 A992S- AQ9Z9 - o a s t OXYYXYYOYYO — X XYY3YY3 XX93XYY3YY39YXYX9YY9XY9YY9YYYYYYD3Y390XY9X9XX90X90X90X9YX03X9YX99YY99V A Q K 9 A o e Z S AQXC9 AQO« 9 AQSts AQ819 X99Y9YX99Y99YYY9Y9XYY99YY9YYYYY9X9XY9X3XY9333Y933XY9339YYYX99YYXYYXYY* A0419 A0919 A09T9 AQKS A0CX9 A02X9 A0tt9 . r u s t .099X YxsYYYoaoiiooxxiaooiaxoYoa Y 3 X 3 3 3YY3Y39YY3Y9X3YXX9X0X9333Y3Y3Y93XYY3X3XX3333X33XYX093YXY3939X3X339X3X3 AOOIS A0609 A0909 A0t09 A0909 A0909 AO»09 .099X .0>9t .0C9T .0J9T .0X9X .0091 .06»X X39XXY0YYXYX3X39X39XXO3YXXXX9YYX33XYYYX9XYX39Y3Y3YX3XX9X3XXXYXY9YYY3Y* 3 XXYOYYXY Y Y O Y X X X O Y X 03XXXY0YYXY YOV0Y0OOXO0OXYY3XXXX3Y33YXX3O3XYOOX A0C09 A0Z09 A0109 A0009 A066* .09K .0L»X -09*T .09>X .0»>X Y3YX9XYXX9Y3Y93X9YY333YYXY399X3YXXXX9XXX XXX3YXXY33Y Y XY Y Y 3YY X XX9XXX XX 3 VYXYYXYY33XYY39YVYX9X3YY39X3YYY3X3XX9XXX3X3YX9X3933YY9XXYYYYYYX3XX9X30 A096» AO{,6> A096» A096» A o»6» A0C6* *QZ6f .oe»x .oztx . o x t x ,oo»x .oeex • .09ex . o t e x 3YXXXXYXXYYXYYXYYXYYXYYXXXYXXYYXYYXYYXYXYYXXXYXXYYXYYXYYXYXYYXYYX33XXO Y X Y Y Y Y X X X X X Y Y XYX XX Y X Y Y Y X X XXYYYYY09XXXSYYYYYXYXSXYXYXXX0XXY3YYX93XYXX3YXXXXY3XXXX03YYYX3YXXXXYX3 A 0 I 6 * A00g» A 0 6 9 » A089» *0LQ* *099f A 0 9 8 * .09CT .09CX .OKX .0C6X .OZCX .OXEX -00CX XYX3XYXYXX 3XYYYX3X3YYY9X9YYX9XYXYYXXYXYXYYYXYYXXYYXXYYYYY9XXXY3YYY0 X XYXYXX Y X3 YY Y X YXYYXX Y Y YY XYYXXY YYYOXXXY XXYYXYXYXXYX3X9XYXX3YXYY3X3YY3XYYYXYYXX3YYYYXX3YYYXYYXXY3YYY9XXXY AO»9> A 0 C 8 » *0Z9> A0t8» A009> *06l> .0621 .082X -OUt .092X .09ZX .OKX .OCZX X9XYXYYYX993YY9XY3X999XX33Y399Y3Y0YYY339XYYY0XYYYVY9XYYYY3YY99X300O3Y3 XY XOO Y9XY3X999X 3Y300Y3Y0VYY3 9XYYY91YYYYY0XYYYY3Y 09X39093Y3 XY99X99X9Y9XY3X999X9Y3Y399Y3Y9YYY3Y9XYYY9XYYYYY9XYYYY3Y399X39993Y3 A<J9I> AOLL* A09t» A(J9J> AQtLt *0ZLf *QZLf .0Z2X -OXZX .OOZX .06XX .08XX .OIXX Y0Y3Y3Y393Y39XY9XYYY3Y33YXY333Y393YY3 3YY03X9Y933Y9YY9Y9X9Y9YY . 3V3Y3Y39 Y39X 9XYYY3Y 3YXY3 3V393YY3 YY3 Y9YY9Y X0Y9YY 93Y3Y3Y399Y39X99XYYY3YX3YXY3Y3Y393YY39YYY3339Y9YY3YY33Y9YY9YY9YYX9Y9YY AOU* A00t» A069> A099» A 0 L 9 » A<)99* AO99> .09XX .09XX .0*XX .OCXX .OSXX .OXXX .OOXX 999X9999YY99XX9XX9XX93Y930Y3X33999X9XX3Y3YY9XY9XYY993XYYX3YXYY39YYYYYY' 999X99 9 Y99XX9XX3XX93Y939Y3X3 999X9XX Y Y9XYY993XYYX 39YYYYY 999X99Y9XY99XX9XX9XX93Y939Y3X39999X9XX-XY3YYYY9XYY993XYYX 39YYYYY-A0»9* A0E9» A0J9» A0X9> A009> A069* .060X .090X .0£0X .090X .090X .0*01 .OCOX X39Y9339Y93YYYYYY9Y93YYYY9YYYY9YYYYYY39X399399XYXYYXYYYYYXY39YYXYYXX3V Y 339Y 3Y Y Y3Y93YYY 9YYYY9YYYYYY30 399399XYXYYXYYYYYX 39VY YYXX3Y --3YY333YY3Y9Y93Y3Y93YYY39YYYY9YYYYYY39Y393339XYXYYXYYYYYX339YYYYYXX3Y A 0 9 9 * A0iS> A 0 9 S > A099» *0*S* *0ZSf A<)J9> -0Z0X .OXOX .OOOX .066 .086 39YYYY39VYY9Y9Y0Y9Y9Y9Y9X9Y9Y9YXYYYY39YY-- X3YY33V 39YYYY3 9Y3Y9Y9Y9Y9Y9 9Y9Y9Y YY39YY 3Y 3 39YYYY3 9Y9Y9V9Y9Y9Y939Y3Y9Y399YY39VYY33YY3YY9Y9XYY39XY3Y3Y3Y3YXY30 A0X9* A 0 0 9 » A06»» A08»^ *0Ltf A<)9>* A .066 .096 .096 .0*6 .0C6 .0Z6 .0X6 YXYYY999Y9YYY90g9YYYY9X99XXX93Y3YXXY9YY99YY0XX9XYYY3YXYYYX9XXX9XYX9XY* Y YY Y9 YYYY X Y3Y X 9 Y 9 Y YY 9 OY YY^YYYYYY933X39YYYYYYY9YXYYX3YY3Y3X99Y99Y99YY03993YYXYYY999Y9YYY9399YT A0>>* A 0 C » » A 0 2 » * A 0 X » » A00»»" A 0 6 O A . 006 . 0 6 8 .098 .Oti .098 .099 .0*9 9XYX9XYXXXYX9XYXY3YX39Y39XXX3YYY99X333XY39X3XXX939YX333XYYYYXYY3993X3X. XX X X 9 XX OX 3 9 Y Y XYYD99 YY90Y9X9XX3YY9XX0X9XX9XX3XX3Y93X30XY3939YYYYgYY9YY390XYYYXYXXYY399----A0(.C» AQ9C9- A09C81 AQ»C» AQCCt AQZC» 5320v 5330v 5340v 5350v 53e0v 5370v 5380v OTACATACCCGSTACCCACAXACTAAAAS0GTATACTCG0CACTT66GTTTAACTCCAATTTCTCTTAGT G C C AA G T C G OCTACCCCTTCCCTTCOCAAACGCTTCOTTCGTCTGTTTCCGACAAGCACA — 1660" 1670" 1680" 1690" 1700" 5390v 5400v 5410v 5420v 5430v 5440v 5450v AGTCAGCACCAATAACTTACAAATAAATATTTAAGAAGGGTTTTCATTTTAAGGATACOACTGAAGTTAG ACCAATAA A AA TAT A AAG G A C GA AG ACCAATAATAAAGTAAGTTCTATAAAGTAAGAGCAAGGGAGAAAGGCGCTAGAGGTTGCCAAG 1710" 1720" 1730" 1740" 1750" 1760" 5460v 5470v S480v CGAGGAATGGTATATG AAAAAG OGGTATT G G T T AAAAAG GGGTATT TAGTGGTCGTTGTGAACGAACCAAAAAAACAAGATCAAGTATAGCAGCAATCGAATTTGACTCGGGTATT 1770" 1780" 1790* 1800" 1810" 1820" 1830" 5490v 5500v 5510v 5520v 5530v 5540v TGAAAGTCGAGTCACCAGACCATCTTATGTTTTCAGCAACGGAAGGGGCCTTTC-AGTCGGGGTGGTTGA T A TGTTTTCA GGAAGGGOGC C AOTCGGOGTGGTTGA TTATTT TCTTTTCACAGT-OGAAGGGGGCOCACAAGTCCGGGTGGTTGA 1840" 1850" I860" 1870" i860" v 5560v 5570v 5580v 5590v 5600v 5610v GCACGCATACACATGCCCGCCAGCTGTAGTCTT——CCTCTTTTTTTTACTCGTTTTGTTTTTGCTTTG GCACGCATACACATGCCCGCCAGCTGTAGTCTT CCTGTTTTTTTTACTCGTTTTG TTTTGCTTTG CCACGCATACACATGCCCGCCAGCTGTAGTCTTGATTCCTGTTTTTTTTACTCGTTTTGGTTTTGCTTTG 1890" 1900" 1910" 1920" 1930" 1940" 1950" 5620v 5630v 5640v 5650v 5660v 5670v 5680v TCAAACGAACTTCTGCCGTTTCATTCCAACCCCAAGCGAACGCACGGCATCCCTCTCGCACGCGCTAACT TCAA G CTGCCGTTTCATTCCAACCCCAAGCGAACGCACGOCATCCC CTCGCACGCGCTA CT TCAACTG CTGCCGTTTCATTCCAACCCCAAGCGAACGCACGGCATCCCGCTCGCACGCGCTATCT 1960" 1970" 1980" 1990" 2000" 2010" 5690v 5700v 5710v 5720v 5730v 5740v 5750v COCCAGGGCCACTCCTAGCGGCGTCGCACTCGCACCCGTGCATTTCGGTATAGAGAAAAGTTATTAGTCG G CCAGGGCCACTCCTAG GGCGTCGCACTCGCACCCGTGCATTTCGGTATAGAGAAAAGTTATTAGTCG GACCAGGGCCACTCCTAGTGGCGTCGCACTCCCACCCGTGCATTTCGGTATAGAGAAAAGTTATTAGTCG 2030" 2040" 2050" 2060" 2070* 2080" 5760v 5770v 5780v 5790v 5800v 5810v 5820v CAACGACATCTGAGAAGAGAGAAGCGGATCCAGCTGCACACCAACACACACACAT-—ATTCAGCGCAC CAACG ATCTGAGA GAOAGAAGCGGATCCAGCTGCACACCAACACACACACA ATTCAGC CAC CAACGGTATCTGAGAGGAGAGAAGCGGATCCAGCTGCACACCAACACACACACACACACATTCAGCACAC 2100" 2110" 2120" 2130" 2140" 2150" 5830v 5840v 5850v 5860v 5870v 5880v 5890v ATGCGCTCATTTTGTTTCCGATCCGAACGAAAAGTAGAGTTGTCGCTGfGGCGCGCGGTTCAGTTTGAAA ATGCGCTCATTTTGTTTCCGATCCGA C AAAAGTAGAGTTGTCGCTGT 6CGCGCGGTTCAGTTTGAAA ATGCGCTCATTTTGTTTCCGATCCGAGC-AAAAGTAGAGTTGTCGCTGTAOCGCGCGGTTCAGTTTGAAA 2170" 2180" 2190" 2200" 2210* 2220" 5900v 5910v 5920v 5930v 5940v 5950v 5960v CTTAACTTGGCGGTGTACGGTTCTTGCGCTCTGCTCTCTCTGCGTTCGTGTTGGTGTGGTGTCGTATGTG CTTAACTTGGCGG GT CGGTTCTTGCGCTCTGCTCTCT TGCGTTCGTGTTGGTGTGGTGTCG ATOTG CTTAACTTGGCGGCGTGCGGTTCTTGCGCTCTGCTCTCTTTGCGTTCGTGTTGGTGTGGTGTCG-ATGTG 2240" 2250" 2260" 2270" 2280" 2290" 5970v 5980v 5990v €000v 6010v 6020v TCGCATACCCCATCGCATGTGTATTGAACGAAG AAAAAACCCGCCGACGCGACATACTCACTGGATA TCGCATACCGCAT GTGTATTGAACG C AAAAAA CGCCGACGCGAC AC CAC TA TCGCATACCGCAT CTGTATTGAACGCGGGAAAAAAAAAGCGCCGACCCGACGCACACACACCCTA 2300* 2310" 2320" 2330" 2340* 2350" 2360* v €040v 6050v 6060v 6070v 6080v CCCTGCAAATTTGTTAA ATTTTTTTCAAAAAGAGCTAACGGTGCTGTTGATTAGCT C C T TA ATTTTT CCACCGTTTCGTAOTATTTATTTATATATTTATTTTT-----------»---———————— 2370" 2380" 2390" 6090v 6100v 6110v 6120v 6130v €140v 6150v AGTGCAGATGTGCAGACATAAAAAGTGATGCCGCGCCACAGTGGAGCCCCTAGCTGGCGAATCGTCGCTG 6160v 6170v ClBOv 6190v 6200v 6210v 6220v OCGACGTAGCTAGTGCAGTTAAAACAAGTACTTAGTGCTGTGACTGTGGCTTAATTTTATGTAAGATCGC 6230v 6240v 6250v 6260v €270v 6280v 6290v OTGCAC^OGTOCTTAGTCGTTACACTGAGAAAAAOAAAAGTGCCTGCAGCCAGCGOOGAGAAGTTGAAGT 31 €300v $310v. 6320v C330v 6340v 6350v 6360v OGAOTCACTCTCACTOCAOOTTGTACTAATTACTACTOTCCCTCCAWSAAACACTCATrTACACOCAXOC 6370v 6380v 6390v 6400v 6410v 6420v 6430v ACACAr^CACACACACACACACACGAGACACTGGGACTTTTGTCGGATTTTCGTATT 6440v 6450v 6460v 6470v 6480v 6490v 6500v TTTGTTGTCTTGCGTTCCACGTTACATACATATGTATATGTTTGGTGTTGCCTGTTGTATGTTTATATTT 6510v 6520v 6530v 6540v 6550v 6560v 6570v ATTGCCTTCACACATCTGCGTGTTTGTTAATGTACAATATAATACGGCAAATAGCAAAAAGAGAAGAAAC 6580v 6590v 6600v 6610v 6620v 6630v 6640v 9GAGGACTAAAAAGAAAGCGCCATGCCGACTATAACAAAAACAACCACCAGTGTCTCCGCCGCCGCCGAA 6650v 6660v 6670v 6680v 6690v 6700v 6710v JUCAAAAACAAAACGCCTAAGCCGACGCATGCATACCTATAATTTATTATAAATATTCTTTTTATTTTGA 6720v 6730v 6740v 6750v 6760v 6770v 6780v ATAATGGATCGTCGTGCATTGAAGTTTATGCAAAAAAGGTATTTTTTGTTIAGTTTGGTTTTATTTTTAT 6790v 6800v 6810v 6820v 6830v 6840v 6850v CACTGCTTTTGTACTGCCTGCACTTGTGCTTTTTGTTTACAATTTTTGGTTTAATCTGCTCTTGAGCATT 6860v 6870v 6680v 6890v 6900v 6910v 6920v (3GATATTGATTCTATATCGTATTCGGATAATAATGACACGTTAATCATAATGCTCGTAAAAATGCAGCGA 6930v 6940v 6950v 6960v 6970v 6980v 6990v GTGGATATTCTGTGCTCTCTGTCATCATAAAATCCGCTCGACTCGCGGGCCAACAAAAGAAAATAAATTC 7000v 7010v 7020v 7030v 7040v 7050v 7060v ATGAAACCCAAAACGAGTTTCCCCTCCGTCGCCCTCTCCGCCATTCATCATCCAACCGACACACCCCCCC 7070v 7080v 7090v 7100v 7110v 7120v 7130v COCTOTGCOOCGTTGTTGCATTTTTAACAACGAATTTCOCAATOCATGCCCGTGTCTCCATTTGTGTGTG 7140v 7150v 7160v 7170v 7180v 7190v 7200v TGCGTCGGCTGGTGGTGGTTGATTTCTCCCTCTCTCCGTGCAACCCTCTCTCACTCGCGCAACAAAGCAA 7210v 7220v 7230v 7240v 7250v 7260v 7270v CATAAAGAAAACAAAAGTTCTATATCOAAAGCTATCACTTTTTCTTGTTGCTGCGOCAACCCTGCTTACA 7280v 7290v 7300v 7310v 7320v 7330v 7340v TTTTGCTTTACTCTGCCGAAATAAACAGTCTAOGGTTGTCAGCTGTAAMAATCTGAGAAATCTGTGTTC 32 7350v 7360v 7370v 7380v 7390v 7400v 7410v CCOTTTAAOCCAACATTCAATTTTATTGAATXCTTTTATXTTTCTTACACTCCCTTATCACTOAATOCTA 7420v 7430v 7440v 7450v 7460v 7470v 7480v ATTTTACCTACAAACTTAAATTCTTCGTOOCAAGAAAA^ 7490v 7500v 7510v 7520v 7530v 7540v 7550v TTTAAAGCACAACAGTTTTTGTXAATXXXTXGTTGXCXXCTCTGTTTXCXCCTXCOCT 7560v 7570v 7580v 7590v 7600v 7610v 7620v CTCTGXCrTTTATGTGACTGTGTGTACGGCCTGGCCATAGCCTCATCTCCn'TCACTCACATTTGTTTCATC 7630v 7640v 7650v 7660v 7670v 7680v 7690v TTCTTCGATTGCAGAACTCGTTTGTGTATCAATCAATAGGCTGTCAA0GCCCCCCCCCC7?CCTCGCTCGC 7700v 7710v 7720v 7730v 7740v 7750v 7760v TCCCGCGCATTTTGCACGCCCGXATCCACTOGXOGTGXTGTTTGXTTTGXOCGCGACCTTTCCCCCGCAC 7770v 7780v 7790v 7800v 7810v 7820v 7830v CCCTATCGGCGTGTCTGTGTGTGTGTGTGCTGCCCCGCATTTTGATCATTCCTTTCCCACCOCTACTOCT 7840v 7850v 7660v 7870v 7880v 7890v 7900v TTCCCCCCGATTGCAAGCAGGCTGTTTTACGCGGGTTCTCACACACTCGAGCTCGAATGTATCTACCCTA 7910v 7920v 7930v 7940v 7950v 7960v 7970v CTCCATGGCGXCXTGGXAACCAGTCTTTTTTTTTCTCTCAAGAGGTTTTTCGCAOCOTOCOOCCCTGAGC 7980v 7990v BOOOv 8010v 8020v 8030v 8040v TTAACTTACACACACTTGCACACGCGCACCCACCACCCTGTGGCGAGTATTCCACCCCTCTGGACCACCC 6050v 8060v 8070v 6080v 8090v BlOOv 8110v ACCCCATATTCCCTTTCTTTIACGGGGTAOCAGCGACATGAOGGTTGCCAAAATGCTTTTACCGOCTTTT 8120v 8130v 8140v 8150v 8160v 8170v 8180v ATAACCCATTTCGCCCCTTTTTCTCTCTTTTTTGCACTCATGCTTTTTATGCTCTTCCTGTTTATGGGCC 8190V 8200v 8210v 8220v 8230v 8240v 8250v TTGCOGGCTTTTTGGGCGATOGAAAOGGGTGGGGATCAGOCTCTCTOGACTGCCGGGCATCACAGTCGCG GGCGAT 2400-8260v 8270v 8280v 8290v 8300v 8310v 8320v CTCAXTOCXXTXGGXCCTTGXXACCACOCTTCTCCACTIAGATCATTCATAGTTOXXCTXTXTCAOGGAA CAXTGCAX G TG ACTATX G A —CXATGCXXXTCGGTGCTG ACTATAAGTGXTTX 2410" 2420* 2430* 8330v 8340v 8350v 8360v 8370v 8380v 8390v CTGXTTCA6GAAGTAATATTACTTAATTATTTCTAGAAAAACATCCTTACCATGTGGAGTACTCAGCCAT TGA T T T G T G GTAC C T GTGAAAACUJLTTAATGCTGTGCGGCOTCTAAAGTTGCGTCGTTTTGAATGTTAGGCATGTACAGGTGCCT 2440- 2450- 2460- 2470" 24B0* 2490" 2500" 33 8400v 8410v 8420v 8430v 8440v 8450v TTACCTTTCTCCACTAATCCTATTASCTTCAOCACTACTCTA TAAAACACTAACCATCTCGTG A T T A T A C A C A T TAAAACA TCT TG CCAAATATACJUITAATACAGTACAGCAAGGAAAGCAAAAATG 2510" 2520" 2530" 2540" 2550" 2560" 2570" 8460v 8470v TTOCTCCCAACTCTC TTGCTCCC C AGACGCTGTCGTCGGCGCATGCTCTTTGCCTGTTGTTTGCGCCGCATTGTTGTCATTGCTCCCTCCCTCT 2580* 2590" 2600* 2610* 2620* 2630* 2640" 8480v 8490v 8500v 8510v 8520v 8530v 6540v AGCATTTTTAGCATTTCGATGGAAAGGTCCTTGAACTTCGCCTGCJU^TTCGCATCGGCTTTATTGCCCTT TTT T T C T GT TG T GC A T TCGTTTTCTCCTTCTCCTCTTCCTCTGTTTCTGOTCTTGCAACTCA^ 2650" 2660" 2670" 2680* 2690" 2700" 2710* 8550V 6560v 8570v TTACAATAATATCGAACOGTOCCAGCTG CGTGGC AATAAT G CGTGGC AATAATGATGGCATTATTGTATGCCGCCACACGCACATGCACACACJICACaiTGCACATATCGTGGC 2720" 2730" 2740" 2750" 2760" 2770" 2780" v 8590v 8600v 6610v 8620v 6630v 6640v TAAATTAOTTTTCCGGGCTGTTGTTGTCGAACGTTGAACGTGGAAAACGAGTGCGACTACCATGCTGCTC T T TT CT C A GTTGAA ATCCTGCT OCGGTAGAAAGAAACCAGTCTTTCGGTTGCAAGTTGAAG ATGCTGCTG 2790* 2800* 2810* 2820* v 8660v 8670v 8680v 6690v 8700v 8710v ATG<JAAATTCGAAAACCCATAGATAAAGATCGATOTGAAATGCGAAOGCTGTCAGGCAOCTC«TTTCTCT A T A ACCC AA TGCCTAGGTTTATTACCCCCCTTCCAAA—— 2840* 2850" v 8730v 8740v 8750v TCATTCGTTGAATGCGAAAGTCCCTTAAGTAGTTGG TCCGTT G TT G TCCGTTCCGCTCTTAGCCTGCCGATOGAAATGGAAGCCAAAATCCACCCC 2860" 2870* 2880* 2890" 2900* 6760v 8770v 8780v 8790v 6800v 8810v CCAAATGTTr^TTGTCCCTGAAAAAGGGCGGGTTCGGAAAAAGGTCCACAAAAAGAACGAA CCAAATGT CTGAAA T G CC CCATACACACCAAATCTAA CTGAAACTCCTAACTATTATTGCTOGCGCCGGTTGCTCCTTCCG 2920* 2930* 2940* 2950* 2960* 2970* 6820v 8830v 8640v 6B50v 8660v 8870v 8680v TATTTTCGGGATAGGGATTCGGATCGGGATCGGGCTCGGGATCTCTACCTAAAAGTTAACCAGGGTACAC T G G GG T GGC G A C G C C G T CTCCTCTGAGCCGCGGCCATTTCTTACTCCTGGACAACTGGKjAACAGCATTTTGAGCGCCCAGCTTAGG 2980* 2990* 3000* 3010* 3020* 3030* 3040* 8890v 6900v 8910v 6920v 8930v 6940v 8950v AATTCTTAATTAGAAATTGAAATGACATGCTACATATCTTCAGGCTTGACCTTGTGATTAAGTCOGAACT TT A T T A T C A T T A C T C C A AA TCTTTGAACCTGTCCCCTTGGATATCACCATGATGOCTTATAACCATCCOCCAAGGGCCATCCGTAAATA 3050* 3060* 3070* 3080* 3090* 3100* 3110* 8960v 8970v 8980v 6990v ATAGTTTTGATATTATCCCGAOTTAA AGGTGATC6TTACCT A AGT T T C C T AGGTCAT C A AAAGTGGTCTCCCTTACACCCACTCGAACTTCCTCTCTGCGTAOTAGATGGGAGGAGGTGATGGAGAAGG 3120* 3130* 3140* 3150* 3160* 3170* 3180* 9000v 9010v 9020v 9030v 9040v 9050v 9060v CACACTCATATTTACTTTTCCCOCCTTTOTTTCTGCCACTCTCGAAAOTCATCTTTTATCTTTACATTOG A A C T T T ATGTTT CGAATATGCAGCAGAGAGCAATCCTACTATAAATAG ATGTTTCOGGCAC 3190" 3200" 3210" 3220" 9070v 9080v 9090v 9100v 9110v 9120v AAATATGATTATTTAATAAGAGCGTTTTTGTTTGCTAGTTCCCGCATTTGTCTCA CACTTTTT A A T A T A TTT GT T GT C GCA T T A CACTTTT AGGCAACGCTGACAAGTGACCTGCTTTAGGTGTTGGTGTGCATGCACATATGAAAAACAGTGCACTTTTA 3240* 3250* 3260* 3270* 3280* 3290* v 9140v 9150v 9160v 9170v 9160v TGAATOACTAAAACCAGATTTTTG TATTTACOUlGTAGCCATATTTOCTTAAACAAAAAAA AA C GA TATTTACC T A A T AA A A OTCWGTCACGAATCATGAAWUUITTAAAATATTTATTTACCTGTTCCOTAAAAAGTACTGAATACACGGT 3310* 3320* 3330* 3340* 3350* 3360* v 9200v 9210v 9220v 9230v 9240v 9250v AAAAAAAAAAACOGATATCACTTTGTTATCTTAGTTTTCCATGATCATTAAATTTAATTTTTTAA A G A A T T ATCATT TAATTT A OGGGCTAGTCTTCTACAAAGTTGTAGCTGCA ATCATTTG TAATTTCCCATATAAT 3380" 3390" 3400" 3410" 3420* 34 9260v 9270v 9280v 9290v 9300v TGTTTAATTTAAGGGCCCnTTTCJIAAACATCACGGCCAOTATAAATAATAAT----TGTTTAA T A C T C AA CAGTAT A TAA ATGTTOOTGTTTAAGCTTACCACGTCTCACCTAAA CAGTATCCACTGIAAGTAAATTACCTC 3430* 3440* 3450- 3460* 3470* 3480* 9310v 9320v 9330v 9340v 9350v AACAAATAATGATGAGAAAQCTGACCTTATTTTTGGCTCAACTCCGATTC AACAAA T C TTC I7TGAGTGAACGATCATCCAAAACAAAGTGTATCTCTCGTTCCC — TTC 3500- 3510- 3520- 3530* 9360v 9370v 9380v 9390v 9400v 9410v 9420v TCAGCTGOGAGAGC TTGTTTTGCATCAGCGGGGASGGGAGCAGGGGGCAGAACGGGGATGCTGAC TCA T G C TTGTTTT T G G A G C -TCACATTTTCGTCCCGACTTTGTTTTCGTTTTCGTTTTCTTTOOCGTTTTTOGACCCAGATCGATCCCOT 3540- 3550- 3560- 3570~ 3580- 3590" 3600* 9430v 9440v 9450v 9460v 9470v 9480v AACCTGTTTGCAWOCCTTCCTATTGTT-TGTCGCTAGGACAAGAGAAGOCCATTAGGAACACTCTTTTT TT C T T TGTCGC C CTTTGTCTTTCTCAATGGTAGTGCGTCCCTGTCGCACCCCCTTCTT 3610- 3620- 3630* 3640* v 9500v 9510v 9520v 9530v 9540v CCAOTTTCTTTGCCTGTCGTGTOGTCCGAAATGACATTGCTCTTTTTTGCT TCTTTTT T —---------------------------------------TCTTTTTCTGTTCCTCOTCCCCTTTTCGCG 3650* 3660* 3670* 9550v 9560v 9570v 9580v 9590v 9600v GAGCAAATGTTTACTGATCGCCAAACTTCTCTTTTTCAATCCTCAGAGCGGACACAGAAAOCOAT GAGCAA G T T C G G G T TTGGAGAGCAAGAGAGAGAGTCTTTGGCGGGAGTAACATGCAGTCGCAGTTTGTTGTTGTGCGGTTTGTT 3690- 3700- 3710- 3720* 3730- 3740* 9610v 9620v 9630v 9640v 9650v 9660v 9670v ACGACCACACCCGTGAGCACCACAGCATCCCAGGGCATTTCAGCATCAGCGATTCTAGCAGGAGGCACTC C C A A A T A A G AG A GC C CCTOUiGCCCTAAGAGAATATGAATAAGAATATAAATATAAATTTAAACATGAATATCAAGCATGCCCCA 3760* 3770* 3780* 3790* 3800* 3810* 9680v 9690v 9700v 9710v 9720v 9730v 9740v TTCCCTTGAAGGACAACTCGAACATCCGCGAGAAGCCCCTCCACCATAACTACJUICCACAATAACAACAA T G A G C C G T A AC A CGGATTCGGAGCCGCGCCAAGTGTGTCCGTGTTTGTTTGTGTTTAAAGCAATTTACTGATAGTCTGTGTG 3630* 3840- 3650- 3860* 3870" 3660-9750v 9760v 9770v 9780v 9790v CAGCTCCCAGCACTCACACTCGCACCAOCAG CAOCAACAACAGCAGGTGG C TCC CTC CCAGCAG CAGCAACAACA A GT CGTTTCCTCTTTCTCTTT' CCAGCAGTGGGGACACCGAAAOTGAATCAGCAACAACAATACOTAC 3900- 3910- 3920- 3930- 3940- 3950" 9800v 9810v 9820v 9830v 9840v 9850v 9660v. GTGGCAAGCAGCTCGAGCGGCCACTAAAGTGCCTGGAAACGCTCGCCCAGAAGGCGGGAATCACCTTCGA C A C C C G GCCAC A A C C C GCG G A C CC CG ACOVCCACCTTCCCCGGAAGCCACCACATCTGO'AAAGGTCAACTCCACCACTCGCGTGGACCCCCAGCGG 3960* 3970* 3980* 3990" 4000- 4010- 4020-9870v 9880v 9890v 9900v 9910v 9920v 9930v CCAGAAATACGATCTGGCCAGCCCCCCGCATCCCGGCATTGCCCAGCA-OCAGGCGACTTCAOGAACAO-C A AA G G C C CCCCAG A GCAGGC C C AC CCACTAAGCTGCCTGGAAACACTC GCCCAGAAOGCAGOCATCAOCTTCGACGAG 4030- 4040- 4050- 4060* 4070* 9940v 9950v 9960v 9970v OCCCAGCAACGGGATCAGGCTCAOTCACCCCCACAAGCCATCGG GCCCAGC G C G C A GCCATC (^CTTTGCCAAGAGTCCATCCCAATCGCCCAGCTCTAAGGCAGCACGTGGGTCAGTCGGAACGCCATCAA 4090- 4100- 4110- 4120* 4130* 4140-v 9990v 10000V 10010V 10020v CACOOAACTCCGCCCACOOGCCGCAOOCAAACCCACACC CCAAOCACTCC G C ACCCAC C CCAAGC C CC TCAOACOOCOCC ACCCACTACTACCOCTCAOCAOCAOATCOCCAAOCOCACC 4160- 4170- 4180" 4190" v 10040v 10050v 10060v 10070v 10060v 10090v 4iAACAGOCCCAGTOCTC^CAGCACACCCAACACTAACTGCAACTCAATTQCCCGCCACACCAGCCT ACTCAA C C C CA CG ACTCAAAOACAACCOOCCOCAAA 4210* 4220-v lOHOv 10120v 10130v 10140v 10150v 10160v CTCGAGAAGGCGCAGAATCCCGGCCACCAGGTGGCCGCCACCACCACGGTGCCACTGCAGATATCCCCTG CTCGAGAAG C CAGA TCC G CA C C TGGC OCC CCACCA CTGCC CTCCAGAT TCCCC G CTGGAGAAGTCACAGAGTCCAGCTCAACCGATGGCOCCCGCCACCAATGTGCCCCTGCAGATCTCCCCCG 4230* 4240* 4250* 4260" 4270* . 4280* 4290-35 v 10180v 10190v 10200v 10210v 10220v 10230v AGCAOCTGCAGCAGTTCTATGCGAOCAATCCCTACOCCATTCAOGTGAAOCAAGAOTTTCCCACGa^CAC AGCXGCTGCXGCXGTT TATGC A CAATCCCTACGCCATTCAGGTGAAGCAAGAGTTTCCCACGCACAC AGCAGCTGCAGCAGTTATATGCAAACAATCCCTACGCCATTCAGGTGAAGCAAGAGTTTCCCACGCACAC 4300* 4310* 4320* 4330* 4340* 4350* 4360* v 10250v 10260v 10270v 10280v 10290v 10300v CACCAGTGGCAGTGGAACTGAACTAAAGCATGCAACCAACATTATGGAAGTTCAGCAGCAGTTGCAGCTG GACCAGTGGCAGTGGAACTGAACTAAAGCATGCAACCAACATTATGGAAGTTCAGCAGCAGTTGCA TG GACCAGTGGCAGTGGAACTGAACTAAAGCATGCAACCAACATTATGGAAGTTCAGCAGCAGTTGCACGTG 4370* 4380* 4390* 4400* 4410* 4420* 4430* v 10320v 10330v 10340v 10350v 10360v 10370v CAGCAGC TGTCGGAAGCCAACGGTGGAGGAGCAGCCTCGGCCGGAGCCGGAGGAGCAGCTAGTCCGG CXGCXGC TGTCGGAAGCCAACGGTGGAGGAGCAGCCTCGGCCGGAGCCGGAGGAGCAGCTAGTCCGG CAGCAGCAGCTGTCGGAAGCCAACGGTCGAGGAGCAGCCTCGGCCGGAGCCGGAGGAGCAGCTAGTCCGG 4440* 4450* 4460* 4470* 4480* 4490* 4500* 10380v 10390v 10400v 10410v 10420v 10430v 10440v CCAACTCGCAGCAAAOCCAGCAACAGCAOCACTCCACAGCCATCAGCACCATGTCOCCGATGCAATTGGC CCXXCTCGCXGCXAAGCCAGCAACAGCAGCACTCCACACCCATCAGCACCATGTCGCCGATGCAATTGGC CCAACTCGCAGCJUU^GCCAGCJaCAGCAGC^CTCCACJlOCC^^ 4510* 4520* 4530* 4540* 4550* 4560* 4570* 10450V 10460V 10470v 10480v 10490v 10500v 10510v AGCGGCCACTGGAGGAGTTGGCGGGGATTGGACACAGGGAAGGACGGTGCAGCTAATGCAACCCTCCACC AG G CCACTGGAGGAGTTGGCGGGGATTGGACACAGGGAAGGACGGTGCAGCTAATGCAACCCTCCACC AGGGCCCACTGGAGGAGTTGGCGGGGATTGGACACAGGGAAGGACGGTGCAGCTAATGCAACCCTCCACC 4580* 4590* 4600* 4610* 4620* 4630* 4640* 10520V 10530v 10540v 10550v 10560v 10570v 10580v AGTTTCCTGTATCCCCAAATGATTGTGTCGGGAAATCTGTTGCATCCAGGAGGCCTCGGTCAGCAGCCAA AGTTTCCTGTATCCCCAAATGATTGTGTCGGGAAATCTGTTGCATCCAGGAGGCCTCGGTCAGCAGCCAA AGTTTCCTGTATCCCCAAATGATTGTGTCGGGAAATCTCTTGCATCCAGGAGGCCTCGGTCAGCAGCCAA 4650* 4660* 4670* 4680* 4690* 4700* 4710* 10590v 10600v 10610v 10620v 10630v 10640v 10650v "TCCAGGTGATCACCGCCGGCAAGCCATTCCAAGGCAACGGCCCCCAGATGCTTACCACCACGACTCAAAA TCCAGGTGATCACCGCCGGCAAGCCATTCCAAGGCAACGGCCCCCAGATGCTTACCACCACGACTCAAAA TCCAGGTGATCACCGCCCGCAAGCCATTCCAAGGCAACGGCCCCCAGATGCTTACCACCACGACTCAAAA 4720* 4730* 4740* 4750* 4760* 4770* 4780* 10660v 10670v 10680v 10690v 10700v 10710v 10720v CGCCAAGCAAATGATCGGTGGCCAAGCGGGATTCGCTGGCGCAAATTACCCGACCTGCATTCCCACAAAC CGCCXAGCXAATGATCGGTGGCCAAGCCGGATTCGCTGGCGGAAATTACCCGACCTCCATTCC C AAC CGCCAAGCAAATGATCGGTGGCCAAGCGGGATTCGCTGGCGGAAATTACGCGACCTCCATTCCTTCGAAC 4790* 4800* 4810* 4820* 4830* 4840* 4850* 10730V 10740V 10750v 10760v 10770v 10780v CACAATCAATCGCCCCAGACGGTGCTCTTCTCACCGATGAACGTCATT TCGCCACAOCAOCAAC CACAATCA TCGCC CAGACGGTGCTC TCTC CCG TGAA CTCAT TCGCCACAGCAGCA C CACAATCAGTCGCCTCAGACGGTCCTCATCTCGCCGCTCAATCTCATCTCCCACTCCCCACAGCAGCACC 4860- 4870- 4880* 4890- 4900* ' 4910- 4920* v 10800v 10810V ' 10820V 10630v 10840v 10850v AGAACCTGCTGCAATCAATGGCCGCTGCAGCTCAGCAGCAGCAXCTCACCCAXCXGCXG CAACAGTT A AACCT CTGCAATCAATGGCCGC CCXGCTCX CX CXGCXACT ACCCAACAGCAG CAACAG T AAAACCTTCTGCAATCAATGGCCGCCGCAGCTCAACAACAGCAACTTACCCAACXGCXOCAOCAACXGCT 4930* 4940* 4950* 4960* 4970* 4980* 4990* v 10870v 10860v 10890v 10900v 10910v TAACCAGCAGCAACAACAGCAGCTTACTCAGCAGCAACAOCAGTTGACAGCTG CTCTGOCCAA TAACCAGCAGCAACA CAGC CAGCAGCAACAGCA ACAGCTG CTCTGGCCAA TAACCAGCJkGCAACAGCAGCTCAACCAGCAGCACCAACAGCA ACAGCTGACTCCCGCTCTGCCCAA 5000* 5010* 5020* 5030* 5040* 5050* v 10930v 10940V 10950v 10960v 10970v 10980v GGTGGGAGTGGATGCGCAGGGCAAGCTGGCCCAGAAAGTGGTTCAGAAAGTGACTACCACCAGTAGCGCG OGTGGGAGTGGATGCGCAGGGCAAGCTGGCCCAGAAAG7GGTTCAGAA GTGAC ACCACCAG AGC CG GGTGGGAGTGGATGCGCAGGGCAAGCTGGCCCAGAXXGTGGTTCAGAAGGTGACCACCACCAGCAGCACG 5070* 5080* 5090* 5100* 5110* 5120* V HOOOv HOlOv 11020v 11030V 11040v 11050v GTCCAOGCGGCGACGGGTCCTGGATCTACTGCGTCAACACAGACCCAGCAOGTGCAGCAGGTTCAGCAAC GT CAGGCGGCGACGGGTCCTGGATCTACTGGGTCAACACAGACCCAGCAGGTGCAGCAGGTTCAGCAAC OTGCAGGCGGCGACCGGTCCTCGATCTACTGGCTCAACACAGACCCAGCAGGTOCAGCACGTTCAOCAAC 5140- 5150" 5160* 5170* 5180* 5190* v 11070v 11080v 11090v lllOOv l l l l O v 11120v AGCAGCAOCAGACCACCCAAACCACTCAGCAGTGCGTGCAGGTTTCCACGTCGACTTTGCCAGTCGGTGT AGCAGCXOCXGXCCXCCCXAACCACTCAGCAGTGCGTGCAGGTTTC OTCGACTTTGCCAGTCGGTGT AGCAGCAGCAGACCACCCAAACCACTCAGCAGTCCGTGCAGGTTTCACAGTCGACTTTGCCAGTCGGTGT 5210* 5220* 5230* 5240* 5250* 5260* v 11140v 11150v 11160V 11170v 11180v 11190v CGOTOGACAGTCTCTTCAGACTGCCCAACTTCTGAACGCTGGCCAAGCGCAACAAATGCAGATTCCCTGG OGGTGGACAGTCTGTTCAGACTGCCCAACTT TGAACOCTGGCCAAGCGCAACAAATGCAGAT CC TGG CGGTGCACACTCTCTTCAGACTGCCCAACTTTTGAACGCTGGCC^AACCGCAACAAATGCAGATCCCTTGG 5280- 5290* 5300* 5310* 5320* 5330* 36 v 11210v 11220v 11230v 11240v 11250v U260v TTCTTACAGAATGCTGCAGGACTGCAGCCGTTTGOGCCAAACCAGATCATCCTGCGAAACC^^ TTCT CAGAATGC GC GG CTGCA CC TT OG C AA CAGATCATCCTGCGAAACCAOCCAGACG TTCTGGCAGAATGCGGCGGGCCTGCAACCCTTCGGCTCCAATCAGATCATCCTGCGAAACCAOCCAGACG 5350* 5360- 5370" 5360- 5390- 5400- W M * * W W v 11260v 11290v 11300v 11310v 11320v 11330v GAACCCAAGGCATGTTCATTCAACAGCAACCGGCGACGCAGACTTTGCAGACCCAGCAAAACCOTAAGAA GAACCCAAGGCATGTTCATTCAACAGCAACCGGCGACGCAGACTTTGCAGACCCAGCAAAACCGTAAGAA CAACCC^AGGCATGTTCATTCAACAGCAACCGGCGACGCAGACTTTGCAGACCCAGCAAAACCGTAAGAA 5420- 5430- 5440- 5450- 5460- 5470-V 11350v 11360v 11370v 11360v 11390v 11400v TATTGTCATGTATATTGCATCGGATAGGTACTAAAGTCAACTATCTTCCTACAGAGATTATTCAATGCAA TATTGTCATGTATATTGCATCGGATAGGTACTAAAGTCAACTATCTTCCTACAGAGATTATTCAATGCAA TATTGTCATGTATATTGCATCGGATAGGTACTAAAGTCAACTATCTTCCTACAGAGATTATTCAATGCAA 5490* 5500- 5510- 5520" 5530" 5540" v 11420v 11430v 11440v 11450v 11460v 11470v CGTGACGCAGACGCCCACTAAGOCACGCACTCAACTGGATGCACTTGCTCCCAAGCAGCAACAOCAGCAO CGTGACGCAGACGCCCACTAAGGCACGCACTCAACTGGATGCACTTGCTCCCAAGCAGCAACAGCAGCAG CGTGACGCAGACGCCCACTAAGGCACGCACTCAACTGGATGCACTTGCTCCCAAGCAGCAACAGCAGCAG 5560- 5570* 5580" 5590- 5600- 5610-v -11490V 11500v 11510v - 11520v - -U530v - 11540v CAGCAGGTTGGCACTACCAACCAGACGCAGCAGCAGCAACTAGCGGTGGCTACTGCCCAGTTGCAGCAAC CAGCAGGTTGGCACTACCAACCAGACGCAGCAGCAGCAACTAGCOGTGGCTACTGCCCAGTTGCAGCAAC C^GCAGGTTGGC^CTACCAACCAGACGCAGCAGCAGCAACTAGCGGTGGCTACTGCCCAGTTGCAGCAAC 5630- 5640- 5650- 5660* 5670* 5680-v 11560v 11570v 11580v 11590v 11600v 11610v AOCAGCAOCAACTCACTGCAGCAGCTCTGCAOCGACCAGGAGCCCCTGTCATOCCCCACAATOGAACTCA AGCAGCAGCAACTCACTGCAGCAGCTCTGCAGCGACCAGGAGCCCCTGTCATGCCCCACAATOGAACTCA AGCAGCAGCAACTCACTGCAGCAGCTCTGCAGCGACCAGGAGCCCCTCTCATGCCCCACAATGGAACTCA 5700- 5710" 5720- 5730- 5740- 5750-v 11630v 11640v 11650v 11660v 11670v 11680v AGTGCGTCCGGCCAGTTCCGTATCCACACAGACTGCCCAGAACCAGAGCCTGCTGAAGGCCAAAATGCGC AGTGCGTCCGGCCAGTTCCGTATCCACACAGACTGCCCAGAACCAGAGCCTGCTGAAGGCCAAAATGCGC AGTGCGTCCGGCCAGTTCCGTATCCACACAGACTGCCCAGAACCAGAGCCTGCTGAAGGCCAAAATGCGC 5770- 5780- 5790- 5800* 5810- 5820* v 11700v 11710v 11720v 11730v 11740v 11750v AACAAGCAGCAGCCGGTGCGCCCCGCTTTAOCCACATTGAAAACCGAAATCGGTCAAGTCGCAGOACAAA AACAAGCAGCAGCCGGTGCGCCCCGCTTTAGCCACATTGAAAACCGAAATCGGTCAAGTCGCAGGACAAA AACAACCAGCAGCCGGTGCGCCCCGCTTTAGCCJiCATTGAAAACCGAAATCGGTCAACTCGCAGGACAAA 5840- 5850- 5860* 5870* 5880* 5890-v 11770V 11780v 11790V 11800v 11810v 11820v ATAAGGTAGTAGGCCACCTGACCACCGTGCAGCAGCAGCAACAGGCGACGAATCTCCAGCAGGTGGTTAA ATAAGGTAGTAGGCCACCTGACCACCGTGCAGCAGCAGCAACAGGCGACGAATCTCCAGCAGGTGGTTAA ATAAGGTAGTAGGCCACCTGACCACCGTGCAGCAGCAGCAACAGGCGACGAATCTCCAGCAGGTGGTTAA 5910- 5920- 5930- 5940- 5950* 5960* v 11840V 11850v 11860v - 11670v 11880v 11890v TGCGGCGGGCAACAAGTGAGTATTGTTTGTTTATTATCTGCCTTGTCACAGTACTTATTTTTGCATCTTT TGCGGCGGGCAACAAGTGAGTATTGTTTGTTTATTATCTGCCTTGTCACAGTACTTATTTTTGCATCTTT TGCGGCGGGCAACAAGTGAGTATTGTTTGTTTATTATCTGCCTTGTCACAGTACTTATTTTTOCATCTTT 5980" 5990- 6000* 6010- 6020- 6030-v 11910v 11920v 11930V 11940v 11950v 11960v CCAGAATGGTTGTGATGAGCACAACGGGCACTCCGATCACCCTGCAGAATGGACAGACCCTTCATGCAGC C C A G A A T G G T T G T G A T G A G C A C A A C G G G C A C T C C G A T C A C C C T G C A G A A T G G A C A G A C C C T T C A T G C A G C C C A G A A T G G T T G T G A T G A G C A C A A C G G G C A C T C C G A T C A C C C T G C A G A A T G G A C A G A C C C T T C A T G C A G C 6050- 6060- 6070- 6080- 6090" 6100-v 11980v 11990v 12000v 12010v 12020v 12030v C A C T G C G G C A G G A O T C G A C A A G C A C C A A C A G C A G C T A C A A C T C T T T C A G A A A C A G C A A A T C C T O C A A C A A C A C T G C G G C A G G A G T C G A C A A G C A G C A A C A G C A G C T A C A A C T G T T T C A G A A A C A G C A A A T C C T G C A A C A A C A C T G C G G C A G G A G T C G A C A A G C ^ G C A A C A G C A G C T A C A A C T G T T T C A G A A A C A G C A A A T C C T G C A A C A A 6120- 6130- 6140- 6150" 6160- 6170-v 12050v 12060v 12070v 12080v 12090v 12100v CAACAAATGTTGCAACAGCAGATTGCTGCCATTCAAATGCAOCAOCAOCAAGCGGCTGTTCAGOCCCAGC CAACAAATGTTGCAACAGCAGATTGCTGCCATTCAAATGCAGCAGCAGCAAGCGGCTGTTCAGGCCCAGC CJUiCAAATGTTGCAACAGCAGATTGCTGCC^TTCJUaTCCAGc^GCAGCAAGCGGCTGTTCAGGCCCACC 6190- 6200- 6210- 6220* 6230- 6240-v 12120v 12130V 12140v 12150v 12160v 12170v AACAACAGCAGCAACAGGTCTCTCAGCAOCAOCAOGTTAACGCCCAGCAACAOCAAGCGGTOOCOCAACA AACAACAGCAGCAACAGGTCTCTCAGCAGCAGCAOGTTAACGCCCAGCAACAGCAAGCGGTGGCGCAACA AACAACAGCAGCAACAGGTCTCTC^GCAGCAGCAGGTTAACGCCCAGCAACAGCAAGCGCTGGCCCAACA * 6260* 6270" 6280" 6290- 6300- 6310-V 12190v 12200v 12210v 12220v 12230v 12240v ACAACAGGCAGTCGCGCAGGCTCAGCAACAGCAGAGGGAGCAACAGCAGCAAGTTGCCCAAGCCCAGGCG ACAACAGGCAGTCGCGCAOGCTCAGCAACAGCAGAOGGAGCAACAGCAGCAAGTTGCCCAACCCCAGOCG A£AACAGGCAGTCGCGCAGGCTCAGCAACAGCAGAGGGAGCAACAGCACCAAGTTGCCCAAGCCCAGGCG * 6330- 6340* 6350" 6360- 6370" 6380-37 v 12260v 12270V 12280v 12290v 12300v 12310v CAGCATCAA("AGGCTCTCGCGAATGCCACTCAGCAAATCCTTCAGCTCGCGCCAAATCAATTCATCACGT CAGCATCAACAGGCTCTCGCGAATGCCACTCAGCAAATCCTTCAOGTGGCGCCAAATCAATTCATCACGT CAGCATCAACAGGCTCTCGCGAATGCCACTCACCAAATCCTTCAOCTGGCGCCAAATCAATTCATCACGT 6400* 6410* 6420* 6430* 6440" 6450* v 12330v 12340v 12350v 12360v 12370v 12360v CCCACCAGCAACAGCAGCAGCAGCAACTTCACAACCAACTGATACAGCAGCAGCTACAGCAACAGGCOCA CCCACCAGCAACAGCAGCAGCAGCAACTTCACAACCAACTGATACAGCAGCAGCTACAGCAACAGGCGCA CCCACCAGCAACAGCAGCAGCAGCAACTTCACAACCAACTGATACAGCAGCAGCTACAGCAACAGGCGCA 6470- 6480- 6490- 6500" 6510* 6520-V 12400v 12410V 12420V 12430v 12440v 12450v OGCACAAGTTCAAGCCCAAOTGCAOGCTCAAGCGCAACAOCAACAACAOCAOCGAGAOCAGCAOCAGAAT GGCACAAGTTCAAGCCCAAGTGCAGGCTCAAGCGCAACAGCAACAACAGCAGCGAGAGCAGCAGCAGAAT GGCACAAGTTCAAGCCCAAGTGCAGGCTCAAGCGCAACAGCAACAACAGCAGCGAGAGCAGCAGCAGAAT 6540- 6550- 6560" 6570* 6580" 6590-v 12470v 12480v 12490v 12500v 12510v 12520v ATTATCCAGCAGATTGTGGTGCAACAGTC TGGAGCGACTTCTCAACAGACTTCCCAGCAGCAACAGC ATTATCCAGCAGATTGTGGTGCAACAGTC TGGAGCGACTTCTCAACAG CAGCAGCAAC GC ATTATCCAGCAGATTGTGGTGCAACAGTCAACTGGAGCGACTTCTCAACAGCAG CAGCAGCAACCGC 6610* 6620* 6630* 6640* 6650- 6660* v 12540v 12550V 12560v 12570v 12560v 12590v ACCACCAATCCGGGCAACTACAGCTAAGTAGCGTGCCeTTCTCACTTTCTTCGTCAACGACGCCA--A CA CA TC GG CA T CAGCT AG AGCGTGCC TT TC CTTTC C TC A GACG C AACAGCAGTCTGGACAGTTGCAGCTTACCAGCGTGCCGTTTTCGGTTTCACCATCCATGACGCCG6AACA 6680" 6690* . 6700*. .. 6710- .. . 6.720*_ .. 6730* . . 12600v 12610v 12620v 12630v 12640V 12650v GCCGGAATAGCTACCTCTAGTGCTCTGCA00CA6CCCTCTCCGCCTCTGGCGCCATCTTTCAGACA GCCGGAATA C CC T C A GC C TCTGGCGCCATCTTTCAGACA TATTGCCGGAATAACATCCAGTGCCCTACAAGAAGCTCTCTCOOTG TCTGGCGCCATCTTTCAGACA 6750- 6760- 6770* 6780" 6790* 6800-v 12670v 12680v 12690v 12700v 12710v 12720v GCTAAGCCGGGTACTTCCAGTTCCTCCTC CCCCACAAGCACTGTGCTCACAATTACCAACCAGAGCA C AA CCG TACTTGCAGTTCCTC C CCCCACAAGCAGTGTGGTCACAATTACCA CCAGAGCA ACCAAACCGATTACTTGCAGTTCCTCTACGCTCCCCACAAGCAGTGTGGTCACAATTACCAGCCAGAGCA 6810- 6620- • 6830- 6840* 6850" 6860- 6870* 12730v 12740V 12750v 12760v 12770v 12780v ' 12790v CCACTCCTTTGGTCACCAGCAGTACGGTGGCCAGTATACAGCAGGCTCAGACGCAATCTGCTCAGGTCCA GCACTCCT TGGTCACCAGCAGTACGGTGGCCAGTAT CAGCAGGCTCAGACGCAA T CTCAG TCCA GCACTCCTCTGGTCACCAGCAGTACGGTGGCCAGTATGCAGCAGGCTCAGACGCAAGGTACTCAGATCCA 6880- 6890* 6900" 6910* 6920- 6930- 6940" 12600V 12810v 12620V 12830v 12840v 12850v 12860v CCAACATCAGCAGCTAATCAGCGCCACAATTGCCGGACGGACTCAACAACAGCCACAGGGACCGCCA CAACATCAGCAGCTAATCAGCGCCAC ATTGCCGGAGGG CTCAACA CAGC CAG C GC A TCAACATCAGCAGCTAATCAGCGCCACTATTGCCGGAGGGTCTCAACAGCAGCAGCAGCAGCAGCAACTG 6950- 6960- 6970- 6980* 6990* 7000* 7010" 12870v 12880v 12890v 12900v • TCACTTACACCCACCAC — AAATCCAATTTTGOCCATGACCTCGATGA TCACTTACACCCACCAC AAATCC ATT TGOCCATGACCTCGATGA GGACTACCTTCACTTACACCCACCACGCCCTCACCTACAACAAATCCCATTCTGGCCATGACCTCGATGA 7020- 7030- 7040* 7050* 7060* 7070* 7080* v 12920v 12930v 12940v 12950v 12960v 12970v TGAATOCTACAGTGGGTCACCTTTCCACTGCTCCOCCTGTAACTGTTTCTGTGACAAOCACCGCTOTTAC TGAATGC AC GTGGGTCACCT TCCACTGC CC CC GT A TGTTTCT AGCACCGCTGT AC TGAATGCCACCCTGGGTCACCTATCCACTGCCCCACCCGTTAGTGTTTCT AGCACCGCTGTCAC 7090- 7100- 7110- 7120- 7130* 7140-v 12990v 13000v 13010v 13020v 13030v 13040v rrCGTCGCCGGGTCAGCTGGTTCTCTTAAGCACOGCTAGTAGCGGTOGAGGAGGTAGCATACCACCCACG T C TCG C GG CAGCTGGT TAAGCA GCTAGTAGCGGTGGAOGAG GC T CCAGCCACG TCCATCGTCTGGACAGCTGGTCACACTAAGCAGTGCTAGTACCGCTGGAGGAGCAGGCTTTCCACCCACG 7160* 7170* 7180* 7190* 7200* 7210* v 13060V 13070v 13060v 13090v 13100v 13110v CCCACCAAAGAGACACCTTCGAAAGGGCCCACCGCAACCCTGGTGCCCATTGGTTCGCCCAAGACTCCTG CCCACCAAAGAGACACCTTC AAAGGGCCCACCGCAACCCTGGTGCCCATTG TTCGCCCAAGACTCCTG CCCACCAAAGAGACACCTTCAAAAGGGCCCACCGCAACCCTGGTOCCCATTGATTCGCCCAAGACTCCTG 7230- 7240- 7250* 7260- 7270- 7280* v 13130v 13140v 13150v 13160v 13170v 13180v TATCAGGAAAGGACACCTGCACTACCCCCAAATCATCTACTCCTGCCACTGTCAGCGCATCCGTAGAGGC TATCAGGAAAGGACACCTGCACTACCCCCAAATCATCTACTCCTGCCACTGT AGCGCATCCGTAGAGGC -fATCAGGAAAGGACACCTGCACTACCCCCAAATCATCTACTCCTGCCACTGTTAGCGCATCCGTAGAGGC 7300* 7310- 7320- 7330- 7340- 7350-V 13200V 13210v 13220v 13230v 13240v 13250v CAGTAOTTCCACAGGCGAAGCCCTGTCCAATGGAGATOCCTCAGATAOGTCTTCCACGCTOTCAAAOOGC CAGTAGTTCCACAGGCGAAGCCCTGTCCAATGGAGATGCCTCAGATAGGTCTTCCACGC GTCAAAGGGC CAGTAGTTCCACAGGCGAAGCCCTGTCCAATGGAGATGCCTCAGATAGGTCTTCCACGCCGTCAAAGGOC 7370- 7380- 7390- 7400- 7410- 7420* 38 v 13270v 13260v 13290v 13300v 13310v 13320v OCTACCACTCCCACCAOCAAGCAAAGCAATGCAGCAGTGCAGCCACCGAGTAGCACCACTCCCAACAGTG CCTACCACTCCCACCAGCAAGCAAAGCAATGCAGCAGTGCAGCCACCGAGTAGCACCA TCCCAACAGTG GCTACCACTCCCACCAGCAAGCAAAGCAATGCAGCAGTGCAGCCACCGAGTAGCACCATTCCCAACAGTG 7440* 7450- 7460" 7470- 7480* 7490* V 13340v 13350v 13360V 13370v 13380v 13390v TCAGTGGGAAAGAAGAGCCGAAGCTCGCAACCTGCGGCAGTTTAACGTCCGCAACATCAACTTCAACCAC TCAGTGGGAAAGAAGAGCCGAAGCTG A CTGCGGCAGTTTAACGTCCGCAACATCAAC TCAACCAC TCAGTGGCAAAGAAGAGCCGAAGCTGCACAACTGCCCCAGTTTAACCTCCOCAACATCAACATCAACCAC 7510- 7520- 7530- 7540- 7550- 7560-v 13410v 13420v 13430v 13440v 13450v 13460v 6ACAACGATCACCAATGGGATTGGAGTAGCCAGAACGACAGCCAGCACGGCTGTCTCAACCGCTAGCACA CACAACGATCACCAATGGGATTGGAGTAGCCAGAACGACAGCCAGCACGGCTGTCTCAACCGCTACCACA CACAACGATCACCAATGGGATTGGAGTAGCCAGAACGACAGCCAGCACGGCTGTCTCAACCGCTAGCACA 7580- 7590- 7600" 7610- 7620" 7630" V 13480V 13490v 13500V 13510v 13520v 13530v ACCACTACCAGTTCTGGCACCTTTATCACAAGTTGCACCAGCACAACCACAACCACCACGTCGAGTATCA ACCACTACCAGTTCTGGCACCTTTA CACAAGTTGCACCAGCACAACCACAACCACCACGTCGAGTATCA ACCACTACCAGTTCTGGCACCTTTACCACAAGTTGCACCAGCACAACCACAACCACCACGTCGAGTATCA 7650- 7660- 7670" 7680* 7690- 7700" V 13550v 13560V 13570v 13580v 13590v 13600v OTAATGGATCGAAGGATCTCCCCAAGGCGATGATTAAGCCGAACGTCTIAACTCACGTCATCGATGGCTT CTAATGGATCGAAGGATCTCCCCAAGGCGATGATTAAGCCGAACGTCTTAACTCACCTCATCGATOGCTT CTAATGGATCGAAGGATCTCCCCAAGGCGATGATTAAGCCGAACGTCTTAACTCACGTCATCGATGGCTT 7720- 7730- 7740" 7750* 7760" 7770* V 13620v 13630v 13640v 13650v 13660v 13670v CATCATCCAGGAGGCCAACGAGCCATTTCCCGTCACCAGACAGCGATATGCAGACAAAGACGTCAGCGAT CATCATCCAGGAGGCCAACGAGCCATTTCCCGTCACCAGACAGCGATATGCAGACAAAGACGTCAGCGAT CATCATCCAGGAGGCCAACGAGCCATTTCCCGTCACCAGACAGCGATATGCAGACAAAGACGTCAGCGAT 7790- 7800- 7810- 7820- 7830* 7840-v 13690v 13700v 13710v 13720v 13730v 13740v GAGCCGCCAAGTGAGTATAAACTTCTGGTACCAATGCTTTTTCGCAATCTTAACGTGTCATTCCTTCGCG GAGCCGCCAAGTGAGTATAAACTTCTGGTACCAATGCTTTTTCGCAATCTTAACCTGTCATTCCTTCGCC GAGCCGCCAAGTGAGTATAAACTTCTGGTACCAATGCTTTTTCGCAATCTTAACGTGTCATTCCTTCGCG 7860- 7870- 7880* 7890- 7900" 7910* v 13760v 13770v 13760v 13790v 13800v 13810v CAGAGAAAAAGGCAACCATGCAGGAGGACATCAAGCTAAGTGGAATAGCATCAGCTCCAGGCTCGGATAT CAGAGAAAAAGGCAACCATGCAGGAGGACATCAAGCTAAGTGGAATAGCATCAGCTCCAGGCTCGGATAT CAGAGAAAAAGGCAACCATGCAGGAGGACATCAAGCTAAGTGCAATAGCATCAGCTCCAGGCTCGGATAT 7930- 7940- 7950* 7960- 7970" 7980* V 13630v 13640V 13850v 13860v 13870v 13BB0v GGTTGCTTGCGAGCAGTGTGGAAAGATGGAGCACAAAGCAAAGCTGAAACGGAAGCGCTACTGTTCGCCA CGTTGCTTGCGAGCAGTGTCGAAAGATGGAGCACAAAGCAAAGCTGAAACGGAAGCGCTACTGTTCGCCA GGTTGCTTGCGAGCAGTGTGGAAAGATGGAGCACAAAGCAAAGCTGAAACGGAAGCGCTACTGTTCGCCA 8000- 8010- 8020- 8030- 8040* 8050-v 13900v 13910v 13920v 13930v 13940v 13950v GGATGCTCGAGGCAGGCAAAGAACGGCATCGGTGGAGTTGGATCAGGAGAGACGAACGGCCTGGGGACAG GGATGCTCGAGGCAGGCAAAGAACGGCATCGGTGGAGTTGGATCAGGAGAGACGAACGGCCTGGGGACAG GGATGCTCGAGGCACCCAAAGAACGCCATCGGTGGACTTGGATCACGAGAGACGAACGCCCTGGGGACAG 8070- 8080- 8090- 8100" 8110- 8120* v 13970v 13980v 13990v 14000v 14010v 14020v OTGGTATAGTTGGGGTGGCAGCCATGOCATTGGTGGACAOGCTGGATGAAGCCATGGCTGAOGAGAAGAT CTGGTATAGTTGGGGTGG CC ATGGCATTCGTCGACAG CTGGATGAAGCCATGGCTGAGGACAACAT GTGGTATAGTTGGGGTGGACGCAATGGCATTGGTGGACAGACTGGATGAAGCCATGGCTGAGGAGAAGAT 8140- 8150- 8160- 6170" 8180- 8190* v 14040v 14050v 14060v 14070v 14080v 14090v GCAGACAGAGGCCACCCCAAAGCTTTCAGAATCGTTTCCTATTTTG^GAGCCTCAACAGAAGTACCTCCA GCAGACAGA C CC A T TC GA C TT CC ATT GCAGACAGAATCATACCAGACAGTATCGGACCCTTTGCCAATT 8210- 8220- 8230* 8240-v 14110v 14120v 14130v ATGTCACTGCCAGTCCAAGCGGC GATTTCTGCGCCCTCGCCTC CAAGCGGC GATTTC CCC GC — »CAAGCGGCTACGCCGGAGGTCCCACCGATTTCGATGCCAGTGCTGGCGGCTATGT 6250- 8260- 8270- 6280- 8290* 14140V 14150V 14160V TTOCAATGCCTCTAOGATCGCCATT TTGCAAT CTC CGACATCTTCACCACTTTCGTTGCCCCTGACATTOCCCTTGCCAATTGCAATAGCTC— 8300- 8310- 8320- 8330- 6340" 8350-14170v 14180v 14190v 14200v 14210v 14220v 14230v 43TCAGTTGCACTTCCAACTCTTGCACCACTGTCTGTAGTCACTTCTGGCGCGGCGCCCAAGTCTTCGGAA CCACTGT CTCACT C G G C G T CCACTGT GTCACTGCCAGTGGTTTCAGCTGGAGTGGTTGC •360- 6370- 6380- 6390-39 14240v 14250v 14260v 14270v OTO AATGGAACAGATCGTCCGCCAATCAOCAGCTGGAGTCTG G AATGGA C GATCG CC CC ATCAGCAG TGGAGTGTG GCCGGTCCTAGCAATACCATCCTCGAATATAAATGGATCCGATCCCCCTCCCATCAGCAGTTCCAGTCTG 8400* 8410* 8420* 8430* 8440* 6450* 6460* v 14290v 14300v 14310v 14320v 14330v 14340v CACGATGTCAGCAACTTCATTCGAGAACTGCCTGGTTGTCAGGACTACCTGGACGACTTTATACAGCAGG CA GA GT AGCAA TTCAT CGAGAACTGCCTGGTTG CAGGACTACGTGGACGACTTTATACAGCAGG GAAGAAGTTAGCAATTTCATCCGAGAACTGCCTGGTTGCCAGGACTACGTGGACGACTTTATACAGCAGG 8470* 8480* 8490* 8500* 8510* 8520* 6530* v 14360v 14370v 14380v 14390v 14400v 14410v AGATCGACGGCCAAGCGCTTCTGTTGCTCAAOGAOAACCATTTGGTGAACGCTATGGGCATGAAGCTOGG AGATCGACGGCCAAGCGCT CTG TGCTCAA GA AA CATTTGGT AACGC ATGGGCATGAAGCTGGG AGATCGACGGCCAAGCGCTGCTGCTGCTCAAAGAAAACCATTTGGTTAACGCCATGCCCATGAAGCTCGG 8540* 8550" 8560* 6570* 6580* 8590* 8600* v 14430v 14440v 14450v 14460v 14470v 14480v TCCAGCTCTTAAAATTGTGGCCAAGGTGGAGTCCATTAAGGAGGTCCCGCCACCTGOCGA6GCCAAGGAT TCCAGCTCT AAAATTGTGGCCAAGGTGGAGTCCATTAAGGAGGTCCCGCCA G G AAGGAT TCCAGCTCTCAAAATTGTGGCCAAGGTGGAGTCCATTAAGGAGGTCCCGCCAGGCGATGTA AAGGAT 8610* 8620* 8630* 6640* 6650* 8660-v 14500v 14510v 14520v 14530v 14540v 14550v CCAGGAGCGCAGTAGGGCAGCTAGAGCACCAAAAGCCGAAAAAGATGATCTCCTAACCGACCAGTGGACC A A G AAAAGA T CT C G TAAAAACACGCAACAAAGTCAAGGTTTC— AAAAGACCCCTTTCTTTAGTTTCCCGCGTTT 8680* 8690* 6700* 8710* 8720* v 14570v 14580v 14590v 14600v 14610v 14620v TOGTTCAACCAAGTCTGTCGTGCCAGGTGATTCTGATTCAATCGAOCAGGCGAAAAGGACGCGAATCCAT T AA A T C GTGA GCGAAT CACCTAAATGTAACGACATTTACTTCGTGA GCGAATGTGA 8740* 8750* 6760* v 14640v 14650V TTGCAAAATATTATTAGCATCAGGC — CATCAOGT T A A A A G ATCA G CATCAG T TCAGACAGAACAAAGTGAATCACGTTCCCACTCACCACTTCTCACACGACGTACACCCTAATCATCAGCT 8780* 8790" 8800" 8810* 8820* 8830* v 14670v 14680V 14690v 14700v 14710v 14720v CTTAAACCCATTGACTTTGTACATACTCCCAAGATCACTTATAAGCATATTCATTTATAAATTAAACTAA CC ACTCCC AGA C ACATGCACCTAATCTACAAA0GGAACTCCCCA6AGAGCAACC0GTGCC 8850" 8660" 8870" 8880* .v 14740v 14750V 14760v 14770v 14780v 14790v •CUGTCAACAOTCAAAAACGAATCGAATTACTTAAACTAAGGAAAAGCTATGAArTAATTGCCAGCCAAGT TGGAAT ACT A CT G A GC G TGGAATCACTGACTCTGTTGCOAOGCCCATCCCATCCAGAATCTATGCG 6890* 6900* 6910* 8920* 8930* v 14B10v 14820v AAAT — OGAATAAAGTACATTTT A A GGAATA C T T AOAAATCCATAATTAOGTGATCTAGTTGTTTTTCCCOCACATGACGAAAGCAAGGAATATGACCCTCCTT 8940* 6950* 8960* 6970* 8980* 8990* 9000* 14830v 14840v 14850v 14860v 14870v 14880v 14890v ATAAATAACCATAACTTTATACTCTAAGTACCATArTAATAACTCCCAACGTCATGGATAGrTTGTACAA TAGTTT CA CGGCOCCGAAOCTOCAGC TAGTTTAAGCAC 9010* 9020* 9030* 14900V 14910V 14920V 14930v 14940v 14950V 14960v CTATTTTAATCTTAOGAATCAAATGTAGCACTATGATTGTTCTAACAACTAAGAATTTTAAOCCTATGAA T A C GATTGT AA A AG T A C A CCCGATCAGACCCCAA GATTGTOOCAATAOTAGAGTCCATGACTCTGTGCGA 9040* 9050* 9060* 9070* 9080* 14970v 14980v 14990v 15000v 15010v 15020v 15030v TAATAATTGATATCTAAATGTGAATTTGAACTTTTTACTAAATAATATTTGAATGCCTAGACCTAAGCTT A A A T T C C T A T A C T C CGAAAAGGACGGGGACCTTATACGACCCCTCGCGCCTCCCCGTTGGATCAACACTCTTCAGCACTCTACC 9100* 9110* 9120* 9130- 9140" 9150* 15040v 15050v 15060v 15070v 15080v 15090v 15100v TTTTTTCAACATTTTTTTTTTGCAATTGCTGAAGAAATTAAAATOOCACTGLAATAGTOTTTAATAAATCT T A T OCA T CTGA AOAGTCTGACGATAOGAOCGGGCAGTATCTGAGCTCTA 9170- 9180* 9190-40 Open reading frame analysis To determine the direction of ph transcription, single-stranded RNA probes were hybridized to ph messages. The direction of transcription was found to be from proximal to distal (J. Deatrick and N. Randsholt, pers. comm.). The sequence data were used to locate open reading frames (ORFs) in ph. True ORFs will have to be distinguished from untranslated 5' and 3' regions of mRNAs. Sequence data alone cannot be used to identify ph exons. To unequivocally locate ph exons, Si nuclease mapping of cDNAs will have to be done. This work is in progress and has already resulted in the determination of the splicing pattern of three exons in the proximal repeat (ORF 475/6; Denise Pierre, Hugh Brock, unpublished data). Because ph has a repeated structure, but has variable sequence conservation, it is likely that conserved sequences are under selection pressure. Therefore, ORFs conserved in both repeats are likely to be genuine. Table 1 lists these conserved ORFs. A striking observation about all of these ORFs (2-6, 8-12) is that all insertions or deletions occur in multiples of three, implying that there is strong selection to maintain the reading frame. However, if the ph repeats have slightly different functions or splicing patterns, then there could be genuine ORFs that are not conserved between the repeats. These ORFs could be identified using ORF length and codon preference as criteria. I plotted the distribution of the number of ORFs on one frame of the coding Btrand and compared this to the distribution of the number of ORFs on one frame of the non-coding strand (Figure 5). If length is a good criterion of openness, the coding strand plot should be skewed to the right when compared to the non-coding strand plot. Looking at Figure 5b, the distribution is essentially normal up to a length of 70 amino acids. Beyond this point, ORFs are widely 41 Table 1. Long ORFs conserved i n both repeats of ph ORF Coordinates (bp) Length* (bp) (aa) 2 5452-5700 249 83 3 5859-6083 225 75 4/5/6 ,9558-14498 4824 1608 8 17539-17811 273 91 9 17924-18163 240 80 10/11/12 19497-24399 4785 1595 *Lengths exclude introns for ORFs 4/5/6 and 10/11/12. 42 Table 2. ORFs 70 amino acids i n length or greater Reading frame 1 Reading frame 2 Reading frame 3 C o o r d i n a t e s L e n g t h C o o r d i n a t e s Length C o o r d i n a t e s Length (bp) (aa) (bp) (aa) (bp) (aa) 4426- 4635 70 1-280 93 4398-•4649 84 5452-5700 83 284-541 86 4653- 4922 90 6292- 6531 80 1442-1765 108 5859-6083 75 6991- 7206 72 3293-3502 70 6324- 6581 86 7210- 7452 81 4187-4465 93 7092-•7295 68 7594- 7806 71 6308-6574 89 7521- 7736 72 7810- 8082 91 6914-7153 80 7740- 8264 175 9298- 9618 107 7157-7405 83 9559-11369 603 9823- 10212 130 7745-7969 75 11430- 11759 110 10264- 10497 78 9740-10060 107 11829-14498 890 11164- 11844 227 10370-10855 162 18972- 19201 76 11917- 12357 147 10985-11332 116 19221-•19502 94 15604-•15888 95 11780-12145 122 19845- 20111 89 17539- 17811 91 12149-12553 135 20229- 20723 165 19492-21240 583 12662-13012 117 20727-•21203 159 21301- 21630 110 13730-14002 91 21651- 22016 122 21700-24399 900 14018-14425 136 22020-•22421 134 17924-18163 80 22425-•22853 143 18251-18706 152 23919-•24200 94 19205-19441 79 19760-20068 103 20115-20356 80 21035-21715 227 21788-22228 147 22232-22453 74 43 Figure 4. A physical summary map of ph. The sub-clones used for sequencing are labelled (616-617, 617-628, 1952SB, 21908, 214, 04 and 215 were used by Deatrick and Randsholt to sequence the proximal repeat. 1.5, 4.0, 0.8, 2.3 and 8.7 were used by myself to sequence the distal repeat). All open reading frames (ORFs) 70 aa in length or greater are drawn on the map. Thick lines represent ORFs conserved in both repeats (ORFs 2 through 6 and ORFs 8 through 12). A vertical dash marks the position of a stop codon lying between two ORFs. The known splicing pattern of ORFs 4, 5 and 6 and the suspected splicing pattern of ORFs 10, 11 and 12 is indicated with the hatched lines. Vertical arrows represent introns. The open boxes represent the zinc finger sequence found in each repeat. Putative promoters (P), translation initiation signals (Tr) and transcription termination signals (t) are indicated. The confirmed location of a polyA tail is marked (A). Thick lines at the bottom of the figure (i through vi) represent regions with a very high similarity index (over 95%). The line marked "unique" represents sequence found only in the proximal repeat. Regions of repetitive and unique DNA as determined by cross-hybridization studies (Freeman, 1988) are diagrammed as boxes at the bottom of the figure (a through e). 44 » kb 5' (proximal) 1 , ? 10 * • * • * 12 — i 14 IS I t 20 22 24 i l I i i l i l I. I 1, 3' (distal) 010-017 617-126 19S2SB M9Q8 -21- 04 i J l l f -i2 M — i l — ,.7 ox RF1 RF2-*- • L RF3 * T» -I2J-1 -AP Trl 7 —JH^— L I H yn lq iM 1(1 l» » I • Idl | l \ h I a - TunWTOT — g - I H I C I h I i I Figure 5. Frequency distributions of ORFs in a random sequence (a) and in a coding sequence (b). This figure was made to see if length is a good criterion of open-ness. The length of every ORF in one frame of the non-coding strand and in one frame of the coding strand was plotted against its frequency. The mean length of ORFs in the coding sequence is 29 aa. The mean length of ORFs in the non-coding sequence is 24 aa. 46 47 distributed along the X-axis. I would assume that ORFs of around 70 amino acids in length or greater have coding potential because they lie outside the normal distribution. Therefore, I arbitrarily chose a length of 70 amino acids (aa) as a cutoff point for significant ORFs. Any ORF with a length of 70 aa or greater is listed in Table 2 (also see Fig. 4). Some of these ORFs may not be genuine and there may be some genuine ORFs with lengths shorter than 70 aa that are not listed in this table. A comparison of cDNA sequence with the genomic sequence will have to be done to unequivocally determine the locations of genuine ORFs in ph. Sequence conservation cannot be used as a method to identify ORFs unique to either side of the gene. However, codon preference can be used to distinguish non-coding DNA from coding DNA (Gribskov et al., 1984). For most amino acids, several codons can specify the same amino acid (codon degeneracy). The frequency of synonymous codons varies depending on the organism and the gene. Non-coding DNA shows no preference for a specific codon in a family of synonymous codons. However, coding DNA can show a preference for one codon over another (Gribskov et al. and references therein). Thus, regions of relatively high codon preference should be indicative of DNA with coding potential. GENEPLOT was used to determine the codon preference of each putative ORF. A D r o s o p h i l a codon frequency table (Ashburner, 1989) was used in the analysis. Table 3 shows the overall codon preference parameters for each ORF and for a random sequence of the same composition as the ORF. None of the ORFs show a value significantly higher than the codon preference for a random sequence. Thus, for ph, codon preference cannot be used to indicate regions of coding potential. Abundantly expressed genes tend to have the best codon preference values (Ashburner, 1989) and since ph is rare, its codon preferences may be different. The splice junctions of ORFs 4-5 and 5-6 have been determined by the comparison of cDNA sequence to the genomic sequence (D. Pierre and H. Brock, unpublished data). 48 Table 3. Codon preference parameters of the 6 putative ORFs ORF Codon preference Codon preference f o r of input s t r a n d random sequence of same composition as input strand 2 0.94 0.96 3 0.96 0.97 4/5/6 0.97 0.95 8 0.95 0.96 9 0.95 0.97 10/11/12 0.97 0.95 49 Because the splice junction sequences of ORFs 10, 11 and 12 are identical to those of ORF 4/5/6, the same splicing pattern was assumed for ORFs 10, 11 and 12. This will have to be confirmed by the comparison of cDNA sequence to genomic sequence. Two regions of high sequence similarity (regions i and i vin Fig. 4) do not contain any long ORFs that are duplicated on each side of the gene. It is possible that ORFs are present in these regions but that they are less than 70 aa in length. Region i contains an ORF at its 5' end but this ORF is not duplicated in region iv. It is possible that these regions contain conserved regulatory sequences needed for proper expression of ph. This could be tested by placing these regions upstream of a marker gene (eg, lacZ) and transforming D r o s o p h i l a embryos with the chimaeric construct. If this region is required for ph regulation, it should control the distribution of B-galactosidase such that it mimics the distribution of ph product in wild-type embryos. If it does, the exact positions of regulatory sequences within the region could be determined using deletion series or site-specific mutagenesis. The proximal repeat contains 2248 bp of DNA not present in the distal repeat (the unique region, Fig. 4). The unique region could contain ORFs that are incorporated into the ph protein. Alternatively, this region could represent a large intron. In any case, it cannot be decided if the proximal repeat has gained DNA from an insertion event, or if the distal repeat has suffered a deletion. The proximal and distal repeats are therefore not perfect repeats, as their lengths differ. Subsequent to the duplication of the ph region, the two repeats have diverged. Northern walk data (J. Deatrick and N. Randsholt, unpublished data) shows that probes from coordinates 3.5 kb-5.2 kb and from 14 kb-17.5 kb hybridize to both the major transcripts, suggesting that at least one sequence conserved on both sides must be present in 50 the 6.1 kb and 6.4 kb mRNAs. Yet my analysis reveals no long ORFs conserved in both repeats in this region. There are nevertheless two small regions marked i and ivon Figure 5 that are highly conserved. These sequences may represent conserved untranslated leader regions present in both messages. This does not rule out the possibility that there are also small ORFs in this region. As noted in the introduction, the region from 0-1.25 kb hybridizes to small ph messages but as shown in Table 2, there are no long ORFs in this region. It may be that this region has many small ORFs whose locations will have to be confirmed by Si nuclease mapping or the comparison of cDNA sequence to the genomic sequence. The two largest ORFs in this region of the proximal repeat and the corresponding region of the distal repeat are labelled 1 and 7 on'Figure 4 (also see Table 2). Splice junctions The presence of consensus splice junctions around putative ORFs can be used as an indication of how likely a particular ORF is to be genuine. The PATTERNS program was used to search the gene for potential exon-intron junction sequences and branch sequences. The following consensus sequences were screened: MAGGTRAGT CTRAY (Y)-,, NYAG fi 1 1 rr exon intron branch intron exon (Mount, 1982). The minimum match percentage required for each sequence is as follows: MAGGTRAGT (79%), CTRAY (100%), and (Y)1;LNYAG (87%). A list of all splice junction and branch sequences found in ph is given in Table 4. All ten putative ORFs have 51 Table 4. A l i s t of s p l i c e junction sequences and branch sequences present i n ph (Y) 1 1XYAG Coord. MAGGTRAGT Coord. CTRAY Coord. ccccctccatcaaAG 1367 cAGGTagaT 309 CTgAt 50 tattcttctccgcAG 2044 cAGGTagcT 913 CTgAt 919 cattttctctaccAG 2470 cgaGTgAGT 1439 CTaAt 1566 t t c c t g t t t t t t a A G 3065 cttGTgAGT 2249 CTaAt 1619 tccc a t t t a t t t c A G 3899 agGGTaAtT 2935 CTaAc 1807 tta t a t t t c t t a c A G 7394 cAGGTcAaT 3498 CTaAt 2307 cttcttcgattgcAG 7638 agGGTgAGT 3550 CTaAt 2870 t t t t t t t c t c t c a A G 7946 atGGTaAGa 3706 CTgAc 4063 tttttctctcaagAG 7948 cAGGTtcGT 4104 CTaAt 4422 ctcttt t t t g c t g A G 9543 aAGGTaAag 5113 CTaAt 4591 tttttcaatcctcAG 9586 cAGGTgctT 6234 CTgAc 5091 actcttcccttgaAG 9686 actGTgAGT 6305 CTaAt 5104 ccctggttcttacAG 11207 cAtGTgcGT 6521 CTgAt 5136 actatcttcctacAG 11392 aAGGTattT 6754 CTaAc 5683 tctgccttgtcacAG 11902 aAtGTatGT 7894 CTaAc 6066 tttgcatctttccAG 11879 cAGGTggGT 9793 CTaAt 6323 ctcctcccccacaAG 12694 cAGGTgAag 10213 CTaAt 7415 cattccttcgcgcAG 13749 cAGGTgAtc 10591 CTgAt 8328 atcattcttttgcAG 15402 aAGGTggGa 10921 CTaAt 8411 atccctgtcttttAG 15588 aAaGTgAcT 10969 CTaAc 8447 ttctgttctacacAG 17237 cAaGTgcGT 11621 CTgAc 9329 attttgttttcacAG 17581 cAGGTggtT 11822 CTgAc 9419 ccgttccgctcttAG 18603 cAaGTgAGT 11845 CTgAt 9557 ttccgctcctctgAG 18710 cAaGTgAGT 13698 CTaAt 10063 ttcctctctgcgtAG 18884 cAGGTgAtT 14584 CTaAt 10500 t c t t t c t c t t t c c A G 19640 cAaGTaAaT 14799 CTgAc 11779 ttctctttccagcAG 19643 aAtGTgAaT 14981 CTgAt 12360 tcccactcgccacAG 20646 aAtGTaAGa 15238 CTaAt 12810 actatcttcctacAG 21263 cAtGTatGT 15787 CTaAc 14542 tttgcat c t t t c c A G 21773 cAGGTatGa 15933 CTaAc 14729 tctgccttgtcacAG 21750 acGGTaAaT 17015 CTaAc 14930 cattccttcgcgcAG 23650 tAtGTaAGT 17063 CTgAc 15257 tcgccctcccatcAG 24179 aAaGTaAGT 17445 CTgAt 16022 ccctcccatcagcAG 24182 aAaGTaAGa 17459 CTgAc 16111 cAGGTgccT 18231 CTaAc 16387 • aAtGTaAcT 18654 CTaAc 16404 actGTaAGT 19207 CTaAt 16838 ctcGTgAGT 19218 CTaAt 17098 cAaGTgtGT 19570 CTgAc 17749 62 Table 4 (Continued) MAGGTRAGT Coord. CTRAY Coord. aAaGTgAaT 19659 CTgAc 18152 aAGGTgccT 19762 CTgAt 18303 cAcGTggGT 19854 CTaAc 18669 cAGGTgAag 20069 CTgAc 18971 cAGGTgAtc 20450 CTgAt 19607 ccGGTgAaT 20621 CTaAt 20359 aAGGTggGa 20792 CTgAc 21650 aAGGTgAcc 20840 CTgAt 22231 cAaGTgcGT 21492 CTaAt 22690 cAGGTggtT 21693 CTgAc 24055 cAaGTgAGT 21716 CTaAt 24559 cAaGTgAGT 23589 CTaAt 24580 aAaGTgAaT 24514 CTgAc 24628 53 potential junction sequences at their termini. However, the degree of similarity of these sequences to the consensus varies. Potential branch sequences (CTRAY) showing 100% similarity with the consensus exist in all putative introns except sequence lying between ORFs 8 and 9. It is possible that a branch sequence with less than 100% similarity with the consensus exists in this sequence. All sequences contained in Table 4 contain the most conserved nucleotides of each consensus (boldfaced). The splice pattern of ORFs 4, 5 and 6 was confirmed by the comparison of cDNA sequence to the genomic sequence (D. Pierre and H. Brock, pers. comm.). An intron of 60 bp lies between ORFs 4 and 5 and between ORFs 5 and 6. The splice junctions and their similarity with the respective consensus sequences are listed in Table 5. The two introns are highly conserved in the distal repeat. The four splice junctions between ORFs 10,11 and 12 show 100% sequence identity with the corresponding junctions for ORF 4/5/6. It is probable that the distal repeat has the same splicing pattern as the proximal repeat in this region and ORF 4/5/6 was compared to putative ORF 10/11/12 under this assumption. Because the sequence conservation of these introns is so high, it suggests that alternative splicing may occur and that in some ph transcripts, these introns represent coding sequence. Thus, there may be other splicing patterns of ORFs 4, 5, 6 and ORFs 10,11,12 in addition to the pattern presented here. The splicing patterns of the remaining ORFs will have to be determined by comparing the sequences of cDNAs that contain these ORFs to the genomic sequence (work in progress). Promoter sequences Northern analysis suggests that the ph proximal promoter should be 3' of coordinate 3500 because a fragment 5' hybridizes to mRNA from the transcription unit 5' to ph. If there is a second ph promoter, it should be 3' to coordinate 7800 because an inversion 54 (ph*lu) truncates the 6.1 kb (embryonic) and 6.6 kb (pupal) ph transcripts, but does not affect the 6.4 kb (embryonic) and 6.1 kb (pupal) transcripts. The DNA sequence was screened for putative promoter sequences corresponding to the TATA (Breathnach and Chambon, 1981) and CCAAT (McKnight and Kingsbury, 1982) boxes. Potential sequences are listed in Table 6. Two of the more interesting sequences are located in the 5' regions of each repeat. The sequence atGTATAaAaaGttt at coordinate 4897 in the proximal repeat is 87% similar to the consensus. Seventy-four bp upstream at coordinate 4818 is the sequence acTCAATac. This putative promoter lies upstream of the 5'-most ORF conserved in both repeats (ORF 2), downstream of the transcription unit 5' to ph, and is a good candidate for the proximal promoter. A similar putative promoter is found at the 5' end of the distal repeat. The sequence GtGTATAaAatGcat at coordinate 15211 is 93% similar to the consensus. The sequence GaCCAATta lies 75 bp upstream at coordinate 15137. This promoter lies upstream of ORF 7 and could therefore drive transcription of the distal repeat, ph could contain promoters lacking these consensus sequences, in which case they would not be revealed by this search. Therefore, none of the promoter data listed above is conclusive. Once primer extensions determine the extreme 5' end of ph message, deletion series can be done on DNA lying upstream to determine the locations of promoter sequences. A translation initiation sequence (GGAATGG) at 5454 of the proximal repeat matches the consensus perfectly (Kozak, 1984). This sequence lies at the extreme 5' end of ORF 2 and is in frame with the ORF 2 sequence. ORF 8 does not contain a translation start site at its 5' end. However, ORF 7 does at position 15708 (cCTATGa). Both of these translation start sites lie just downstream of potential promoters. As discussed above, the probable locations of the promoters for the 6.1 kb and 6.4 kb mRNAs are at coordinates 4.8 and 15.2 respectively, although other sites cannot be ruled out. In particular, it seems likely that there should be a promoter responsible for the 55 Table 5 . Comparison of s p l i c e junctions of ORF 4 / 5 / 6 to the consensus ORF Junction sequence consensus % s i m i l a r i t y 4 AccGTGtaa MAGGTRAGT 44 5 aCTaTCTTCCTACAG (Y) 1 1XYAG 87 5 CAaGTGAGT MAGGTRAGT 89 6 TTTgCaTCTTTCCAG (Y) 1 1XYAG 87 56 Table 6. Putative promoter signals i n the ph sequence ,GXGTATAWAWXGXXG Coord. atGTATAaAaaGttt 4897 GaGTATAaActtctg 13695 GtGTATAaAatGcat 15211 GaGTATAaActtctG 23596 GaaTATAaAtgGatc 24153 GGYCAAWCT Coord. cGcCAAaCa 1341 tGtCAAtaT 2345 GacCAAaaT 2727 GGtCAAtta 3502 acTCAAtac 4818 tctCAAaCT 4957 tGtCAAaCg 5620 tGcCAAaaT 8096 GGtCAAtgc 8260 GGcCAAatg 8757 cGcCAAaCT 9565 aGcCAAtCc 10585 GGcCAAaCc 11239 GGcCAAaaT 11682 GGaCAAaaT 11758 cGcCAAtCa 14260 GacCAAtta 15137 GttCAAatT 15334 cGaCAAatc 16047 tGcCAAagT 17304 aGcCAAaaT 18628 aGcCAAtCc 20444 GGcCAAaaT 21553 GGaCAAaaT 21629 57 synthesis of the small transcripts detected with probes from coordinates 0-1372, and from 5781-24751. In addition, there should be a promoter in the region from 1372-3464 responsible for the transcription of the 2 kb mRNA from the gene just upstream of ph. However, no obvious candidates were found. Locating these promoters will take additional work. The 5' end of the distal repeat is less confusing. Northern data show a transcript hybridizing to the 1.5 Sal fragment. But the distal promoter lies upstream of this fragment. Thus, this promoter can also control transcription of the ORF(s) present in the 1.5 Sal fragment. Termination sequences Putative polyadenylation signals (AATAAA; Proudfoot and Brownlee, 1976) exist at the 3' end of the proximal repeat (Table 7). A polyA tail has been found in a proximal cDNA and maps just upstream of the putative distal promoter at coordinate 15107 (D. Pierre and H. Brock, unpublished results). Just upstream lies a polyadenylation signal at 15092 (Table 7). This lends further support as to the assignment of the distal promoter sequence. No polyadenylation signals showing 100% similarity to the consensus were found at the 3' end of the distal repeat. However, there are several sequences that vary from the consensus (eg, AACAAA at 24512). A transcription termination sequence (ATirTTCT) that has 100% identity with the consensus (Scott et al., 1988) is present in between the two repeats at 15824. If both repeats were transcriptionally active at the same time, this sequence would not prevent a 58 Table 7. Putative polyadenylation signals found i n the ph sequence Sequence Coordinate AATAAA 3288 AATAAA 3792 AATAAA 3854 AATAAA 4793 AATAAA 4809 AATAAA 5403 AATAAA 6986 AATAAA 7294 AATAAA 7451 AATAAA 7508 AATAAA 14806 AATAAA 15092 AATAAA 16573 AATAAA 17045 AATAAA 17438 AATAAA 18837 59 transcription complex from the proximal repeat from entering the putative distal promoter region. This is unlike the wolffish antifreeze protein genes, also organized as tandem repeats. These genes possess a transcription termination sequence that lies between the two repeats to prevent the transcription of one repeat from influencing the regulation of the other repeat (Scott et a 1., 1988). THE PH PROTEIN ph contains a putative zinc finger The function of ph is unknown. However, analysis of the amino acid sequence of ph may allow us to deduce a potential function of ph. Due to the size of the gene, only the 6 putative ORFs (2, 3,475/6, 8, 9 and 10/11712; Figures 6 to 11) were studied in detail. These 6 ORFs were studied separately since the splicing pattern of these ORFs is unknown. The 6 ORFs show different degrees of amino acid sequence similarity between the two repeats (Table 8 and Figures 12 to 14). The degree of sequence similarity ranges from 43.5% to 86.8%. Regions with higher sequence similarity are more likely to be under greater selection pressure and, therefore, are more likely to be important for ph function. The amino terminus of each repeat shows less sequence conservation than the rest of the gene. Thus, the carboxy end of the gene must play a more important role in the function of the protein than the amino terminus. This is supported by the presence of a canonical zinc finger in ORFs 475/6 and 10/11/12. The zinc finger is of the sequence CEQCGKMEHKAKLKRKRYCSPGC and is located between 13824-13892 in the proximal repeat and between 23725-23793 in the distal repeat. The sequence shows 100% conservation between the two repeats. 60 Figure 6. Amino acid sequence of ORF2. 61 1 RGMVYEKGVFESRVTRPCYVFS f ° €1 R F I P T P S E R T A S L S H A L T G Q G H S 62 Figure 7. Amino acid sequence of ORF3. 63 1 S C R C C A R F S L K L N L A W G S 61 i , L N F F Q K E L T V L L IS 64 Figure 8. Amino acid sequence of ORF 475/6. 65 1 gPNFSFSILRXDTESDTTTPVSTTXSOGISASAlLxGGTLPLKDNSNIREKPLHHNYNHN SO 61 HNNSSQHSHSHQQQQQQQVGGKQLERPLKCLETLAQKAGITFDEKyDVASPPHPGIAQQQ 120 121 ATSGTGPXTGSGSVTPTSHRHGTPPTGRRQTHTPSTPNRPSXPSTPNTNCNSIXRHTSLT 180 181 LEKAQNPGQQVXATTTVPLQ1SPEQLQQFXASNPXAIQVKQEFPTHTTSGSGTELKHATN 240 241 IKEVQQQLQLQQLSEANGGGXXSACAGGXXSPXNSQQSQQQQKSTXISTMSPMQLXXXTG 300 301 CVGG6vn,QGRTVOLMQPSTSFLYP6MlVSGNLLHPOGLOOQPlQVITXGKPFQGNGPQML 360 361 TTTTQNAXQMIGGQAGFAGGNYATCIPTNHNQSPQTVLFSPHNV1SPQQQQNLLQSMAAA 420 421 AOOOOLTOOOOQFNQOOOQQLTQQOQOLTAALAKVGvDAQGKLAQKWQKVTTTSSAVQA 480 481 ATGPGSTGSTQTQQVQQVQQQQQQTTQTTQQCVQVSTSTLPVGVGGQSVQTAQLLNXGQX 540 541 OQMQIPWFLQNAAGLQPFGPNQIILRNQP0GTQGMF1QQQPATQTLQTQQNQIIQCNVTQ 600. 601 TPTKARTOLDALXPKOQOOOOOVGfTNOTOOOOLAVATAQLQQQOQOLTAAALQRPGAPV 660 661 KPHNGTQVRPASSVSTQTXQNQSLLKXKMRNKQQPVRPXLXTLKTE1GQVXGQNKWGHL 720 721 TTV00000XTNLOQVVNXAGNKMWMSTTGTPITI.ONGQTLHAATAAGVDK0OQQL0LFQ 780 781 KQQILQQQQMLQQQIAAIQMQQQQAAVQAOQQQOQQVSQQQQVNAQQQQAVAQQQQAVAQ 840 841 AQOQQREQQQQVAOAQADHQQALANATQQILQVAPNQFITSHQQQQQQQLHNQLIQQQLQ 900 901 OQAOAQVQADVOAOAOBQOOOREQOQNIIOQIWQQSGATSOQTSQOOQHHQSGQLQLSS 960 961 VPFSVSSSTTPAGIATSSXLQXXLSXSGXIFQTXKPGTCSSSSPTSSWT1TN0SSTPLV 1020 1021 TSSTVXS100XOTQSXQVHQHQOL1SXT1XGGTOQOPOGPPSLTPTTNPILXMTSMMNXT 1080 1081 VGHLSTXPPVTVSVTSTXVTSSPGQLVLLSTXSSGGGGSIPXTPTKETPSKGPTXTLVPI 114 0 1141 CSPKTPVSGKDTCTTPKSSTPXTVSXSVEXSSSTGEXLSNGDXSDRSSTLSKGXTTPTSK 1200 1201 OSKXXVQPPSSTTPNSVSGKEEPKLXTCGSLTSXTSTSTTTTITNGIGVXRTTXITXVST 1260 1261 XSTTTTSSGTFITSCTSTTTTTTSSISNGSKDLPKXMIKPNVLTHVIDGFIIQEXNEPFP 1320 1321 VTRCJRyADKDVSDEPPEEYKLLVPMLFRNLNVSFLRXEKKATMQEDIKLSGlASAPGSDM 1360 1381 VACEQCGKMEHKAKLKRKRYCSPGCSRQAKNGICGVGSGETNGLGTGGIVGVAXMALVDR 1440 14 41 LDEAMAEEKMQTEATPKLSESFPILGASTEVPPMSLPVOAAISXPSPLXMPLGSPLSVAL 1500 1501 PTLX^LSVVTSGXXPKSSEVNGTDRPPISSWSVD£>VSKF1RELP&C0DYVDDFI60E1DG 1560 1561 QXLLLLKEKHLVNAMGMKLGPALKIVAKVESIKEVPPPGEAKDPGAQ 1607 66 Figure 9. Amino acid sequence of O R F 8 . 67 68 Figure 10. Amino acid sequence of ORFS). 69 1 SCRCSXRFSLKLNLXACGSCJaLsLRSCWCGVTJVSHTACVLNGOKKKRRRDAHTPYHRFV €1 VFIYIFIFGDOCKSVXTISD 70 Figure 11. Amino acid sequence of putative ORF 10/11/12. 71 i l B N H N K N I N I N L N M N M K H X P R I R S R X K C V R V C L C L K Q P T O S L C A P P L S L S S S C D T £ S E S A T 6 0 6 1 T I R T P P P S P E A T T S V K V N S T T R V D P Q R P L R C L E T L A Q K A G I S F D E D F A K S P S Q S P S S K A A 1 2 0 1 2 1 R G S V G T P S I R R R H P L L P L S S R S P S A P D S K T T G R K L E K S Q S P A Q P M A A A T N V P L Q 1 S P E Q L 1 6 0 1 8 1 Q Q L Y A N N P Y A I Q V K Q E F P T H T T S G S G T E L K H A T N I M E V Q Q Q L H V Q Q Q L S E A N G G G A A S A G 2 4 0 2 4 1 A G G A A 5 P A N 5 Q Q S Q Q Q Q K 5 T A I S T M S P H Q L A G P T G G V G G D W T Q G R T V Q L H Q P S T S F L Y P Q 3 0 0 3 0 1 MIVSGNLLHPGGLCQQP1QV1TAGKPFQGNGPQM£,TTTTQNAKQMICOQAGFAGGNYATC 3 6 0 3 6 1 I P S N H N Q S P Q T V L I S P V N V I S H S P Q Q Q Q N L L Q S M A A A A Q Q Q Q L T Q Q Q Q Q Q L N Q Q Q Q Q L M Q 4 2 0 4 2 1 O 0 Q Q Q Q L T A A L A K V G V O A Q G K L A Q K W Q K V T T T S S T V Q A A T G P G S T G S T Q T Q Q V Q Q V Q Q Q 4 8 0 4 8 1 Q Q Q T T Q T T Q Q C V Q V S Q S T L P V G V G G Q S V Q T A Q L L N A G Q A Q Q M Q I P W F W Q N A A G L Q P F G S N 5 4 0 5 4 1 Q I I L R N Q P D G T Q G M F I Q Q Q P A T Q T L Q T Q Q N Q I I Q C N V T Q T P T K A R T Q L D A L A P K Q Q Q Q Q Q 6 0 0 6 0 1 O V G T F N Q T Q Q Q Q L A V A T A Q L Q Q O Q O O L T A A A L Q R P G A P V M P H N G F Q V R P A S S V S T Q T A Q N 6 6 0 6 6 1 Q S L L K A K M R N K Q Q P V R P A L A T L K T E I G Q V A G Q N K W G H L T T V Q Q Q Q Q A T N L Q Q W N A A G N 7 2 0 7 2 1 K M W H S T T G T P I T L Q N G O T L H A A T A A G V D K O O Q Q L Q L F Q K O Q I L Q Q O O M L Q Q Q 1 A A I O M Q 7 8 0 7 8 1 O Q Q A A V Q A Q Q Q Q Q Q Q V S Q Q O Q V N A Q Q Q Q A V A Q Q Q Q A V A Q A Q Q Q Q R E O O Q Q V A Q A Q A Q H Q Q 6 4 0 8 4 1 A L A M A T Q Q I L Q V A P N Q F I T S K Q Q Q O Q Q Q L H N Q L I Q Q Q L Q Q Q A Q A Q V Q A Q V Q A Q A Q Q Q Q Q Q 9 0 0 901 R E O Q Q N I I Q Q I W Q 6 S T G A T S Q Q Q Q Q O P Q Q Q S G Q L Q L S S V P F S V S P S K T A E D I A 6 I T S S A 9 6 0 9 6 1 L Q E A L S V S G A I F O T T K P I T C S S S T L P T S S W T 1 T S Q S S T P L V T S S T V A S M Q Q A D T Q G T Q 1 1 0 2 0 1 0 2 1 H Q H Q Q L I S A T I A G G S Q Q Q 0 Q Q Q Q L G L P S L T P T T P S P T T N P 1 L A M T S M M N A T V G H L S T A P P 1 0 8 0 1 0 8 1 V S V S S T A V T P S S G Q L V T L S S A S S G G G A G F P A T P T K E T P S K G P T A T L V P I D S P K T P V S G K D 1 1 4 0 1 1 4 1 T C T T P K S S T P A T V S A S V E A S S S T G E A L S N G D A 5 D R S S T P S K G A T T P T S K Q S N A A V Q P P S S 1 2 0 0 1 2 0 1 T L P N S V S G K E E P K L H N C G S L T S A T S T S T T T T I T N G I G V A R T T A S F A V S T A S T T T T S S G T F 1 2 6 0 1 2 6 1 T T S C T S T T T T T T S S I S N G S K D L P K A M I K P N V L T H V I D G F I I Q E A N E P F P V T R O R Y A D K D V 1 3 2 0 1 3 2 1 S D E P P S E Y K L L V P M L F R N L K V S F L R A E K K A T M Q E D I K L S G I A S A P G S D M V A C E Q C G K M E H 1 3 8 0 1 3 8 1 K A K L K R K R Y C S P G C S R Q A K N G I G G V G S G E T N G L G T G G I V G V D A K A L V D R L D E A M A E E K M Q 1 4 4 0 1 4 4 1 T E S Y Q T V S D A L P I Q A A T P E V P P I S M P V L A A K S T S S P L S L P L T L P I . P I A I A P T V 5 L P W S A 1 5 0 0 1 5 0 1 G W A P V L A I P S S N I N G S D R P P I S S W S V E E V S N F I & E L P G C Q D Y V D D F I Q Q E I D G Q A L L L L 1 5 6 0 1 5 6 1 K E N H L V M A M G M K L G P A L K I V A K V E S I K E V P P G D V K D 1 5 9 6 72 Table 8. Amino a c i d sequence s i m i l a r i t y between the proximal ORFs and the d i s t a l ORFs ORF (prox.) ORF ( d i s t . ) S i m i l a r i t y i n d e x (%) o v e r l a p (aa) 2 8 43.5 62 3 9 62.1 66 4/5/6 10/11/12 86.8 1541 73 Figure 12. Optimal amino acid sequence alignment of ORF2 and ORF8. This alignment was made using the method of Wilbur and Lipman (1983). The gap penalty = 4 and the deletion penalty = 12. The top line of the alignment represents ORF2. The bottom line represents ORF8. The middle line represents amino acid sequence conserved in both ORFs. A colon indicates amino acids that are positively related by the protein similarity matrix (Lipman and Pearson, 1985). A dot indicates those with a zero value relationship. Dashes represent gaps introduced into the sequence to maintain sequence conservation. ORF2 and ORF8 have an overall identity of 43.5% in a 62 aa overlap. 74 lOv 20v 30v 40v 50v 60v RQMWEKGVFESRVTRPCYVFSNGRGLSVGVVEHAYTCPPAVVF—LFFLLVLPLLCQTNFCRFIPTPSER .: . :VF: . G VGWEHAYTCPPAW: :FF F L. :. . F P ::.R OQSNLSRVFYFVFTVXGGAQVGWEHAYTCPPAWLIPVFFTRFGFALSTAAVS-FOPOANAR 10* 20* 30- 40- 50- 60-v SOv TASLSHALTGQGHS • • • • « * • S HPARTRYLTRATPSGVALAPVHFGIEKSY 70- 80- 90-75 Figure 13. Optimal amino acid sequence alignment of ORF 3 and ORF9. See the legend to Figure 12 for the method of alignment and alignment parameters. ORF 3 (top line) and ORF9 (bottom line) have an overall identity of 63.6% in a 66 aa overlap. 76 VLLIS 8 DOCKSVLTISD 80* 77 Figure 14. Optimal amino acid sequence alignment of ORF 4/5/6 and putative ORF 10/11/12. See the legend to Figure 12 for the method of alignment and alignment parameters. ORF 4/576 (top line) and putative ORF 10/11/12 (bottom line) have an overall identity of 86.8% in a 1541 aa overlap. 78 lOv 20v 30v 40v 50v 60v 70v SPNFSFSILRADTESDTTTPVSTTASQCISASAILXOCTLPLKDNSNIREKPIIHHNYMHNMNMSSQHSKS : N . . 6 : . :. . .::L 8 .. x : . . ::: KNMNKNININLNKNMKHAPRIRSRAXCVRVCLCLKQFTDSLCAFPLSLSSSGDTESESATTIRTPPPSPEA 10- 20- 30- 40- 50- 60- 70" 60v 90v iOOv llOv 120v 130v 140v HQQQQQQQVGGKQLERPLKCLETLAQKAGITFDEKYDVASPPHPGIAQQQAT8CTGPATGSGSVTPTEHRH i t . . : :RPL:CLETLAQKAGI:FOE.: :.:P: :::s :: : . TTSVKVNSTTRVDPQRPLRCLETLAQKAGIEFDEDF AKSPSQSPSSKAARG-SV 60- 90- 100- 110* 120-150v 160v 170v 180v 190v 200v 21t)v GTPPTGRRQTHTPSTPNRPSAPSTPNTNCNSIARHTSLTLEKAQNPGQQVAATTTVPLQISPEQLQQFYAS GTP:. RR:. P ::..PSAP :.::.:R .LEK:Q:P:Q.:AA:T.VPLQISPEQLQQ:YA: OTPSIRRRHPLLPLSSRSPSAP DSKTTGR KLEKSQSPAQPMAXATNVPLQISPEQLQQLYAN 130- 140* 150- 160- 170- 180" 220v 230v 240v 250v 260v 270v 260v NPYAIQVKQEFPTHTTSGSGTELKHATNIMEVQQQLQL-QQLSEANGGGAASAGAGGAASPANSQQSQQQQ NPYAIQVKQEFPTHTTSGSGTELKHATNIMEVQQQL:: OQLSEANGGGAASAGAGGAASPAKSQQSQQQQ NPYAIQVKQEFPTHTTSGSGTELKHATNIMEVOQQLHVQQOLSEANGGGAASAGAGGAASPANSOOSQQOO 190- 200- 210- 220- 230~ 240- 250-290v 300v 310v 320v 330v 340v 350v HSTAISTMSPMQLAAATGGVGGDWTQGRTVQLMQPSTSFLYPQMIVSGNLLHPGGLGQQPIQVITAGKPFQ KSTAISTMSPMQLA::TGGVGGDWTQGRTVQLMQP STSFLYPQMIVS GNLLHPGGLGQQP1QVITAGKPFQ HSTAISTMSPMQLAGPTGGVGGDWTQGRTVQLMQPSTSFLYPQMIVSGNLLHPGGLGQQPIQVITAGKPFQ 270- 280- 290- 300- 310- 320-360v 370v 380v 390v 400v 410v 420v CNGPQMLTTTTQNAKQMIGGQAGFAGGNYATC1PTHHNQSPQTVLFSPKNVI—SPQQQQNLLQSKAAAAQ GNGPQMLTTTTQNAKQKIGGQAGFAGGNYATCIP:NHNQSPQTVL:SP:NVI SPQQQQNLLQSMAAAAQ GNGPQKLTTTTQNAKQMIGGQAGFAGGNYATCIPENHNQSPQTVLI8PVNVI6HSPQQQQNLLQSMAAAAQ 340- 350- 360- 370" 380- 390-430v 440v 450v 460v 470v 480v 490v QQQLTQQQQQFNQQQQQQLT--QQQQQLTAALAKVGVDAQGKLAQKVVQKVTTTSSAVQAATGPGSTGSTQ OQQLTQQQQQ :OQQQQL. QQQQQLTAALAKVGVDAQGKLAQKWQKVTTTSS: VQAATGPGSTGSTQ QQQLTQQQQQQLNQQQQQLNQQQQQQQLTAALAKVGVDAQGKLAQKWQKVTTTSSTVQAATGPGSTGSTQ 410- 420- 430- 440- 450- 460* 470" 500v 510v 520v 530v 540v 550v 560v TQQVQQVQQQQQOTTQTTQQCVQVSTSTLPVGVGGQSVQTAQLLNAGQAQOMQIPWFLQNAAGLQPFGPNQ TQQVQQVQQQQQQTTQTTQQCVQVS STLPVGVGGQSVQTAQLLNAGQAQQMQIPWF QNAAGLQPFG:HQ TQQVQQVQQQQQQTTQTTOQCVQVSQSTLPVGVGGQSVQTAQLLNAGQAQQMQIPWFWQNAAGLQPFGSNQ 480- 490- 500- 510" 520- 530* 540" 570v 580v 590v 600v 610v 620v 630v IILRNQPDGTQGMFIQQQPATQTLQTQQNQIIQCNVTQTPTKARTQLDALAPKQQQQQQQVGTTNQTQQQQ IILRNQPDGTQGMFIQQQPATQTLQTQQNQIIQCNVTQTPTKARTQLDALAPKQQQQQQQVGTTNQTQQQQ IILRNQPDGTQGMFIQQQPATQTLQTQQNQIIQCNVTQTPTKARTQLDALAPKQQQQQQQVGTTNQTQQQQ 550- 560- 570* 580- 590- 600* 610-640v 650v 660v 670v 680v 690v 700v LAVATAQLQQQQQQLTAAALQRPGAPVMPHNCTQVRPASSVSTQTAQNQSLLKAKMRNKQQPVRPA1.ATLK LAVATAQLQQQQQQLTAAALQRPGAPVKPHNGTQVRPASSVSTQTAQNQSLLKAKMRNKQQPVRPALATLK LAVATAQLQQQQQQLTAAALQRPGAPVMPHNGTQVRPASSVSTQTAQNQSLLKAKHRNKQQPVRPAI.ATLK 620- 630- 640- 650- 660- 670- 680" 710v 720v 730v 740v 750v 760v 770v TElGQVAGQNKWGHLTTVQQQQQATNLQQVVNAAGNKMVVMSTTGTPITLQNGQTLHAATAAGVt>KQQQQ TEIGQVAGQNKWGHLTTVQQQQQATNLQQWNAAGNKMWMSTTGTPITLQNGQTLHAATAAGVDKQQQQ TEIGQVAGQNKWGHLTTVQQQQQATNLQQWNAAGNKMWMSTTGTPXTLQNGQTLHAATAAGVDKQQQQ 690- 700- 710- 720- 730- 740- 750" 780v 790v 800v 810v 820v 830v 640v LQLFQKQQILQQQQKLQQQIAAIQMQQQQAAVQAQQQQQQQVSQQQQVNAQQQQAVAQQQQAVAQAQQQQF LQLFQKQQILQQQQMLQQQIAAIQHQQQQAAVQAQQQQQQQVSQQQQVNAQQQQAVAQQQQAVAQAQQQQR LQLFQKQQILQQQQKLQQQIAAIQMQQQQAAVQAQQQQQQQVSQQQQVNAQQOQAVAQQQQAVAQAQQQQR 760- 770- 780" 790- 800- 810- 820-850v 860v 870v 880v 690v 900v 910v EQQQQVAQAQAQHQQALANATQQILQVAPNQFITSHQQQQQQQLHNQLIQQQLQOQAQAOVQAQVOAQAQO BQQQQVAQAQAQHQQALANATQQILQVAPNQFITSHQQQQQQQLHNQLIQQQLQQQAQAQVQAQVQAQAQQ EQQQQVAQAQAQHQQALANATQQILQVAPNQFITSHQQQQQQQLHNQLIQQQLQQQAQAQVQAQVQAQAQQ 630- 640- 850- 660' 870- 680- 890* 79 v 930v 940v 950v 960v 970v 980v OOOQREQQONIIOQIWQQSGXTSQQTSOQQQHHQSGQLQLSSVPFSVSSSTTPXGIA—TSSALQAALSA OQQQREQQQNIIQQIWQQS.::: Q QQQ.::QSGQLQLSSVPFSVS:S T:.:IX TSSXLQ.XLS. OOQQREQQQN1IQQIWQQSTGATSQOOODQPOQQSGQLQLSSVPFSVSPSMTXEDIXGITSSALQEXliSV 900* 910" 920- 930- 940* 950- 960-990v lOOOv lOlOv 1020v 1030v 1040v 1050v - _ SGXIFOTXKPGTCSSSS-PTSSWTITNQSSTPLVTSSTVXSIQQXQTQSXQVHQHQQLISXTIXGGT 8GXIF0T;KP TCSSS: PTSSWTIT:QSSTPLVTSSTVXS:QQAQTQ::Q:HQHQQL1SXTIXGG: 6GXIFQTTKP1TCSSSTLPTSSWTITSQSSTPLVTSSTVXSHQQXQTQGTQIHQHQQLISXTIXGGSQQQ 980- 990- 1000- 1010- 1020" 1030-1060v 1070v lOSOv 1090v HOOv l l l O v -OQQPQGPPSLTPTT— NPILXMTSMMNXTVGHLSTXPPVTVSVTSTXVTSSPGQLVl,LSTXSSGGGG QQQ. G PSLTPTT HPILXMTSMMNXTVGHLSTXPPV:VS STXVT.'S :GQLV LS:XSSGGG: OQQQQLGLPSLTPTTPSPTTNPILXMTSKKNXTVGHLSTXPPVSVS~STXVTPSSGQLVTLSSXSSGGGX 1050- 1060- 1070- 1080- 1090- 1100-v 1130v 1140v 1150v 1160v 1170v HBOv SIPXTPTKETPSKGPTXTLVPIGSPKTPVSGKDTCTTPKSSTPXTVSXSVEXSSSTGEXLSNGDXSDRSST ::PXTPTKETPSKGPTXTLVPI:SPKTPVSGKDTCTTPKSSTPXTVSXSVEXSSSTGEALSNGDXSDRSST GFPXTPTKETPSKGPTXTLVP1DSPKTPVSGKDTCTTPKSSTPXTVSXSVEXSSSTGEXLSNGDXSDRSST 1120- 1130- 1140- 1150- 1160- 1170" V 1200v 1210v 1220v 1230v 1240v 1250v 1260v LSKGXTTPTSKQSNXXVQPPSSTTPNSVSGKEEPKLXTCGSliTSXTSTSTTTTlTNGIGVXRTTXSTXVST SKGXTTPTSKQSNXAVQPPSST.PNSVSCKEEPKL .CGSLTSATSTSTTTTITNG1GVXRTTXSTXVST PSKGXTTPTSKQSNXXVQPPSSTIPHSVSGKEEPKLHNCGSLTSXTSTSTTTTITNGIGVXRTTXSTXVST 1190- 1200- 1210- 1220- 1230* 1240-1270v 1280v 1290v 1300v 1310v 1320v 1330v ASTTTTSSGTF1TSCTSTTTTTTSS1SNGSKDLPKXMIKPNVLTHV1DGF1IQEXNEPFPVTRQRYXDKDV ASTTTTSSGTF.TSCTSTTTTTTSS1SNGSKDLPKXMIKPNVLTHVIDGFI1QEXNEPFPVTRQRYXDKDV XSTTTTSSGTFTTSCTSTTTTTTSS1SNCSKDLPKXMIKPNV1.THVIDGFI1QEXNEPFPVTRQRXXDKDV 1260* 1270" 1280- 1290" 1300" 1310" 1320* 1340V 1350v 1360v 1370v 1380v 1390v 1400v SDEPPSEYKLLVPKLFRNLNVSFLRXEKKXTMQEDIKLSGIXSXPGSDMVXCEQCGKMEHKXKLKRKRYCS SDEPPSEYKLLVPMLFRNLNVSFLRXEKKATMQEDIKLSGIASAPGSDMVACEOCGKMEHKAKLKRKRYCS SDEPPSEYKLLVPMLFRNLNVSFLRXEKKXTMQEDIKLSGIXSXPGSDMVXCEQCGKMEHKXKLKRKRYCS 1330- 1340- 1350- 1360" 1370" 1380" 1390-1410v 1420v 1430v 1440v 1450v 1460v 1470v PGCSRQXKNG1GGVGSGETNGLGTGGIVGVXXMALVDRLDEAMAEEKMQTEATPKI.SESFPILGXSTEVPP PGCSRQAKNGIGGVGSGETNGLGTGGIVGV.XMXLVDRLDEXMXEEKMQTE: ..:S:::PI :X:.EVPP 6>GCSRQXKNG1GGVGSGETNGLGTGGIVGVDXMXLVDRLDEXMXEEKMQTESYQTVSDX1,PIQXXTPEVPP 1400- 1410* 1420- 1430- 1440* 1450" 1460-1480v 1490v 1500V 1510v 1520v 1530v MSLPVQAAISAPSPLAMPLGSPLSVALPTLAPLSWTSG :-AAPKSSEVNCTDRPPISSWSVT>DVSNF :S:PV AX:S::SPL::PL. PL::X::. ,:L:W::C A SS::NG:DRPPISSWSV::VSNF ISMPVLXXMSTSSPLSLPLTLPLPIXIXPTVSLPWSXGWXPVLXIPSSNINGSDRPPISSWSVEEVSNF 1470- 1480- 1490" 1500* 1510- 1520- 1530* v 1550v 1560v 1570v 1580v 1590v 1600v IRELPGCQDYVDDFIOQEIDGQXLLLLKEKHLVNXMGMKLGPALKIVAKVESIKEVPPPGEAKDPGAQ 1RELPGCQDYVDDFIQQEIDGQXLLLLKE:HLVNXMGMKLGPXLKIVXKVES1KEVPP G:.KD 1RELPGCQDYVDDF1QQEIDGQXLLLLKENHLVNXMGMKLGPXLKIVXKVESIKEVPP-0DVKD 1540- 1550- 1560- 1570- 1580- 1590" 80 The zinc finger motif was originally discovered in TFIIIA of Xenopus (Miller et al., 1985). Since then, zinc fingers have been found in a host of other gene regulatory proteins in yeast (Struhl, 1987), mammals (Mitchell and Tjian, 1989) and Drosophila (Tautz et al., 1987; Rosenberg et a1 . , 1986). There are two types of zinc fingers. One consists of two cysteine-histidine pairs separated by 12 to 14 aa (the cys2his2 finger; Mitchell and Tjian, 1989). The other consists of two cysteine-cysteine pairs separated by two or four aa (the cys2cys2 finger). The ph zinc finger is typical of the latter class (the conserved cysteine residues in the ph finger sequence above are boldfaced). The finger sequence is thought to form a tetrahedral complex with a zinc ion and that the residues intervening between the cys-his or cys-cys pairs loop out to form a finger that interacts directly with DNA. The loop of the ph finger is rich in positively charged amino acids (lysine, arginine, histidine). This would facilitate interaction of the ph protein with negatively charged DNA Other proteins with cys2cys2 zinc fingers include the glucocorticoid receptor (Evans, 1988), the estrogen receptor (Krust et al., 1986) and GAL4 of yeast (Keegan et a l . , 1986; Ma and Ptashne, 1987). Proteins of the cys2cys2 finger family have one or two fingers, like ph. Proteins of the cys2his2 finger family usually have multiple fingers (eg, TFIDA, Spl). Several Drosophila proteins have cys2his2 fingers. Kr has five fingers (Rosenberg et al., 1986) and hb contains a total of six fingers (Tautz et a i ., 1987). The serendipity locus shows a structure similar to ph in that it is made up of a tandem repeat, each repeat containing several conserved zinc fingers (Vincent et a I. , 1985). The presence of a DNA-binding motif in ph supports the hypothesis that ph is required for determination in the Drosophila embryo, ph is necessary for proper 81 expression of the homeotic and segmentation genes. The ph protein could therefore regulate these genes by direct interaction with promoter sequences via the zinc finger. Other domains and motifs ORFs 4/5/6 and 10/11/12 are glutamine-rich (see Table 9). Transcription factor Spl contains four separate transcriptional-activating domains. Two of these are domains rich in glutamine (25%; Courey and Tjian, 1988). Other known or suspected transcription factors contain glutamine stretches (eg, zeste, Antp, cut; Biggin et al., 1988; Pirotta et al., 1987; Schneuwly et al., 1986; Blochlinger et al., 1988). A glutamine-rich region of Antp has been shown to functionally substitute for an Spl-activating domain (Mitchell and Tjian, 1989). These glutamine stretches could contact other proteins (eg, RNA polymerase or other transcription factors) by hydrogen bonding and thereby influence the rate of transcription. Alanine stretches may also play a role in transcriptional activation (Courey and Tjian, 1988). Spl, zeste and ph all contain a stretch of alternating glutamine and alanine. In ph, the sequence is QAQAQVQAQVQAQAQ, located at 12381 and 22252. Together, these data support the suggestion that ph encodes a transcription factor. The amino acid sequences of the 6 ORFs were screened for other protein sequence motifs. The sequence does not contain a leucine zipper (Kouzarides and Ziff, 1988) nor does it contain any ATP-binding sites (Walker et a l . , 1982). The sequence does contain many putative signal peptide cut sites (Perlman and Halvorson, 1983). However, these sites are only three amino acids in length and could easily occur in a random sequence. The sequence does not contain a prd box (Frigerio et al., 1986) or a homeobox (Laughon and Scott, 1984; Frigerio et al., 1986). Because ph contains a zinc finger, the presence of a homeobox as a DNA-binding domain would be redundant, although the protein could be bi-functional. If ph is indeed a transcription factor, one would expect to find a nuclear 82 Table 9. The three most abundant amino acids of each ORF. ORF Amino acid Content (%) 2 Valine 12 Leucine 11 Phenylalanine 10 3 Leucine 17 Arginine 9 Cysteine 9 4/5/6 Glutamine 18 Threonine 11 Alanine 10 Serine 10 8 Alanine 13 Valine 13 Phenylalanine 10 Proline 8 Threonine 8 9 Serine 11 Cysteine 10 Leucine 10 Arginine 9 10/11/12 Glutamine 18 Alanine 10 Serine 10 Threonine 10 Leucine 7 Proline 7 Valine 7 83 transport signal in the sequence. So far, no consensus sequence is known, although short stretches of basic residues (lysine, arginine) have been identified as nuclear transport signals in some proteins (Dingwall and Laskey, 1986). ph contains two short arginine stretches. One in ORF 4/5/6 (RRRR at 11300) that is conserved in ORF 10/11/12 and another (RRRR at 15730) that is present only in ORF 7 of the distal repeat. Protein structure and composition The amino acid sequences of the 6 ORFs were analyzed using the PROTEIN program. Table 10 lists summary information about each ORF. The ph protein is, for the most part, in the extended (B-sheet) conformation (mean of 65%). Interspersed throughout the protein are varying amounts of helix, turn and coil. The distribution of the four different protein conformations along a typical ORF is random. However, there are some regions of interest. ORF 3 contains stretches of helix not found at all in the corresponding region from the distal repeat, ORF 9. The longest of these stretches lie between 42-50 and between 62-72 (coordinates refer to the amino acid sequences given in Figures 7 to 12). ORF 4/5/6 has a 12 aa stretch of alternating turn and coil (TPSTPNRPSAPS) between 153-164. ORF 4/5/6 also contains a long stretch of turn between 429-439. The same ORF contains two long regions of helix between 1439-1454 and between 1561-1574. ORF 10/11/12 contains several regions of coil (48-56, 200-207 and 528-537). ORF 10/11/12 contains several long stretches of helix (1337-1358,1423-1444 and 1549-11585) that are not conserved in ORF 4/5/6. The isoelectric points of each ORF are high (mean of 9.04) indicating that ph is a basic protein. This is supportive of ph being a regulatory protein since its overall positive charge would facilitate interaction with negatively charged DNA. 84 Table 10. Structure, conformation and charge of the 6 putative ORFs ORF Length Robson MW Iso. pt. Average conform.(%) (g/mol) hydrophobicity H E T C Hopp Kyte 2 83 0 66 20 14 9214 8.55 42 35 3 75 33 59 8 1 8405 8.83 29 47 4/5/6 1607 4 64 14 19 169187 8.97 11 -52 8 91 0 64 11 25 9918 9.91 44 26 9 80 0 73 23 5 8903 9.09 17 21 10/11/12 1596 3 66 13 18 169123 8.90 10 -49 85 Hydrophobicity plots of each ORF show that ph is a hydrophilic protein. There are no putative trans-membrane domains. Table 9 lists the three most abundant amino acids of each ORF. The data in Table 9 point out that some ORFs have higher contents of certain amino acids than other ORFs. Some ORFs have long stretches of the same amino acid. Stretches of glutamine are thought to play a role in transcriptional activation (discussed above), ph also contains stretches of serine or threonine. Interestingly, Spl, a transcription factor with a zinc finger and a glutamine-rich region, also has serine and threonine stretches (Courey and Tjian, 1988). The function of these stretches is unknown. Similarities of the ph sequence to other proteins The 6 putative ORFs were checked for amino acid similarity with sequences in the NBRF-PIR (rel 22) database using the algorithm of Lipman and Pearson (1985). In addition to the zinc finger result discussed earlier, ph showed other interesting similarities. The top five alignments for each ORF in the distal repeat are listed in Table 11. ORF 10/11/12 is very rich in glutamine. Long stretches of glutamine have been termed opa repeats. ORF 10/11/12 shows high amino acid similarity with the five proteins listed in Table 11A. However, the regions of sequence similarity were almost exclusively the opa repeats. The high similarity indices of Table 11A therefore reflect the fact that ORF 10/11/12 shares long stretches of glutamine with the other proteins listed. To see if regions outside of the opa repeats of ORF 10/11/12 share sequence similarity with other proteins, a 302 aa stretch (1166-1468 in Figure 86 Table 11. Comparisons of the ph protein sequence to other proteins. P r o t e i n S i m i l a r , o v e r l a p i n d e x (%) (aa) A) . ORF 10/11/12 mopa box p r o t e i n (mouse) 47.0 134 a l p h a - b e t a - g l i a d i n p r e c u r s o r 33.8 263 g l u t e n i n low m o l e c u l a r w e i g h t c h a i n 30.2 351 r e g u l a t o r y p r o t e i n zeste (Drosophila) 37.5 168 notch p r o t e i n (Drosophila) 19.0 594 B) . ORF 9 i n t e g r i n b e t a - 1 c h a i n p r e c u r s o r 42.3 26 f i b r o n e c t i n r e c e p t o r b e t a c h a i n 40.9 22 t h y r o g l o b u l i n p r e c u r s o r (bovine) 21.7 60 •DNA-binding p r o t e i n (herpes s i m p l e x v i r u s ) 30.0 80 r u b r e d o x i n (Pseudomonas) 33.3 18 C) . ORF 8 p r o l a c t i n - i n d u c i b l e p r o t e i n p r e c u r s o r 28.3 53 cytochrome p 4 5 0 i i c 2 ( r a b b i t ) 32.9 73 r e t r o v i r u s - r e l a t e d p o l p o l y p r o t e i n 28.6 56 c e c r o p i n b p r e c u r s o r ( c e c r o p i a moth) 58.3 12 p h o s p h o e n o l p y r u v a t e c a r b o x y l a s e 27.7 47 D) . 1166-1468 o f ORF 10/11/12 s a l i v a r y g l u e p r o t e i n sgs-3 (Drosophila) 22.4 165 b a l b i a n i r i n g p r o t e i n 1-gamma (Drosophila) 26.8 41 r e g u l a t o r y p r o t e i n zeste (Drosophila) 29.5 78 gene 62 p r o t e i n ( v a r i c e l l a - z o s t e ) 19.5 569 e l a s t i n p r e c u r s o r ( c h i c k e n ) 38.2 34 E) . 1287-1446 o f ORF 10/11/12 e l a s t i n p r e c u r s o r ( c h i c k e n ) 38.2 34 D N A - d i r e c t e d RNA polymerase I I 21.2 104 1 - a r a b i n o s e - b i n d i n g p r o t e i n (E. c o l i ) 21.1 161 s - a d e n o s y l m e t h i o n i n e s y n t h e t a s e 20.4 98 p a l c r o s s - r e a c t i n g l i p o p r o t e i n p r e c u r s o r 23.3 103 87 11) including the zinc finger but excluding the opa repeats was extracted from ORF 10/11/12 and used to search the database (Table 11D). This region of ph is threonine and serine rich and therefore shows sequence similarity with other threonine and serine rich proteins. Like the results of Table 11A, the high similarity indices of Table 11D are due to the program aligning long stretches of threonine and serine between the region of ORF 10/11/12 analyzed and the other proteins. An even smaller region of ORF 10/11/12 that includes the zinc finger (1287-1446 in Figure 11) was screened for sequence similarities (Table HE). This region of ORF 10/11/12 does not contain any long stretches of the same amino acid. Therefore, the similarity index values of Table HE can be considered genuine measurements of sequence conservation between ph and other proteins. DNA-directed RNA polymerase II, 1-arabinose-binding protein and s-adenosyl methionine synthetase are all nucleotide binding proteins. If ph is indeed a DNA-binding protein, the similarity of these three proteins to ph is understandable. The other two proteins listed in Table HE, however, show no apparent functional similarity to ph. Elastins are the major structural components of tissues that require rapid extension and complete recovery (Raju and Anwar, 1987). Lipoproteins are required for lipid transport. ORF 9 and ORF 8 do not contain long stretches of the same amino acid; their amino acid composition is random. Therefore, the similarity index values of Table 11B and C can be considered genuine. ORF 9 is similar to several membrane-bound receptor proteins {eg, integrins, fibronectins). Integrins are involved in cell-cell or cell-matrix interactions (Hynes, 1987) and fibronectins are required for cell motility and attachment (Dufour et a l . , 1988) . ph is required for normal axonal pathway development in the CNS (Smouse et al., 1988). This function of ph has similarities to those of the fibronectins and integrins. Unlike the integrins, ph does not appear to contain a trans-membrane domain. If ph is a DNA-binding protein, one wouldn't expect it to be membrane bound, unless the protein is bi-functional. The similarity of ph to the DNA-binding protein of herpes simplex virus is 88 understandable if ph is indeed a DNA-binding protein. The reason for ORF 9 having similarities to thyroglobulin precursor and rubredoxin is not understood. Thyroglobulins synthesize thyroid hormones (Palumbo, 1987) and rubredoxins are iron-sulfur proteins necessary for electron transfer in bacteria (Frey et al., 1987). ORF 8 shows similarity to phosphoenolpyruvate carboxylase and retrovirus-related pol polyprotein, two known nucleotide-binding proteins. These similarities make sense if ph is a transcription factor. It is not obvious why the other three proteins listed in Table 11C show similarity to ph. Prolactin induces the synthesis of prolactin-inducible protein (PIP). The function of PIP is unknown (Murphy et al., 1987). Cytochrome p450iic2 is a hydroxylase present in the endoplasmic reticulum (Green and MacLennan, 1967). Cecropins are antibacterial proteins present in the immune haemolymph of insects (Steiner et al., 1981). In summary, sequence similarities between known nucleotide or DNA-binding proteins and ph make sense if ph is indeed a transcription factor. The reason that proteins with diverse functional roles have similarities with ph are unknown. SUMMARY Genetic analysis of ph showed that two independent mutation events were required to make a ph null (Dura et al., 1987). This was the first evidence that ph has a repetitive structure. Using cross-hybridization studies, Freeman (1988) showed that ph consists of a large tandem repeat, each repeat separated by a region of unique sequence. The results presented in my thesis confirm the repetitive structure of the ph gene. My data show that ph does indeed consist of a large tandem repeat, and that the sequence conservation within the repeat is very high. 89 The structure of the gene is not certain. However, the data support the hypothesis that ph consists of two transcription units. Each repeat contains a putative promoter lying upstream of a putative translation start site. In addition, each repeat contains polyadenylation signals at their 3'ends. Other models of ph structure cannot be ruled out by my analysis. Transcripts could be alternatively spliced. Both repeats could be transcribed off one promoter and then post-transcriptionally cleaved into separate messages. The data presented here do not tell us the function of the ph protein. However, the data support the hypothesis that ph is a transcription factor. Transcription factors require at least two domains for regulating gene expression at a promoter: a DNA-binding domain and a transcriptional-activating domain (Ma and Ptashne, 1987; Mitchell and Tjian, 1989). Both ph repeats contain a putative cys2cy&2 zinc finger that is perfectly conserved in each repeat. The zinc finger is a known DNA-binding domain. The ph amino acid sequence is rich in glutamine. Glutamine stretches have been shown to activate transcription when linked to zinc finger domains of certain transcription factors (Mitchell and Tjian, 1989). In addition, several short stretches of basic residues that could be nuclear transport signals occur in the ph sequence. The above data support but do not prove that the ph protein is a transcription factor. The repetitive structure of ph provides a good example of a duplicated gene. The repetitive structure of ph is not without precedent - other eukaryotdc genes have a similar organization. In Drosophila, engrailed (en) and invected (inv) are neighboring genes that share extensive homology over 117 aa (Coleman et al., 1987). Their functional relation (if any) is not yet understood. The two proteins contain a homeobox that lies within the 117 aa conserved region. The mammalian counterparts of en and inv also share extensive amino acid homology (Joyner et al., 1985). Since this conservation is preserved across phyla, it implies a functional conservation between the two genes. 90 The achaete-scute complex consists of two transcription units, each with three domains of highly conserved amino acid sequence (Villares and Cabrera, 1987). The two transcription units have a similar function - the differentiation of sensory organs. transformer is a Drosophila gene required for female sexual differentiation. The gene contains an 8 kb tandem duplication (Villares and Cabrera, 1987). Both components of the repeat are transcribed, yet the significance of the repeat is unclear. Genes with a tandem repeat organization can be found in other eukaryotes. The wolffish antifreeze protein genes show a structure similar to ph (Scott et al., 1988). The major component genes exist as inverted tandem repeats, 8 kb in length. The two genes are separated by 1.3 kb. The minor component genes exist as direct tandem repeats. Like ph, each repeat possesses its own promoter and polyadenylation signals. Also like ph, a transcription termination sequence (ATTTTTNT) is located between the two repeats. The two genes are highly conserved but one contains a region of unique sequence. This is again similar to ph. Future experiments include Si nuclease mapping of cDNAs to the genomic sequence to determine the ph splicing pattern. Alternatively, cDNAs can be sequenced and their sequence compared with the genomic sequence to identify splice junctions. This could be done in several tissues and at various stages of development to see if ph is alternatively spliced. The 5' end(s) of the gene will have to be determined using primer extension analysis. This will allow us to pinpoint the location of the ph promoters). Once putative promoters have been located, their role in the expression of ph can be determined using site-specific mutagenesis. Suspected ph regulatory regions could be fused to the lacZ coding region. Embryos could then be transformed with the chimaeric gene constructs and the regulation of B-galactosidase expression assayed in vivo (Ashburner, 1989). The same method could be 91 used to determine the functional significance of the putative zinc finger and glutamine-rich domains. My determination of ph ORFs will allow for the synthesis of ph-specific antibodies. A long ORF (eg, ORF 4/5/6) could be fused to the lacZ gene. The resulting fusion protein could then be injected into rabbits. Purified antibodies could be used to probe embryos and cells for ph protein distribution. These antibodies could also be used to probe salivary gland chromosomes (Zink and Paro, 1989). If ph is indeed a transcription factor, and is expressed in this tissue, one would expect ph-specific antibodies to bind to the salivary gland chromosomes. If the ph protein does bind to a known gene(s), the exact nature of this interaction could be determined using DNase footprinting analysis. 92 REFERENCES Ashburner, M. (1989). Drosophila, a laboratory handbook. Cold Spring Harbour Laboratory Press, Cold Spring Harbour. Bender, W., Akam, M., Karch, F., Beachy, PA., Peifer, M., Spierer, P., Lewis, E.B. and D.S. Hogness (1983). Molecular genetics of the bithorax complex in Drosophila melanogaster. Science 221,23-29. Biggin, M.D., Bickel, S., Benson, M., Pirotta, V. and R Tjian (1988). Zeste encodes a sequence-specific transcription factor that activates the Ultrabithorax promoter in vitro. Cell 53, 713-722. Birnstiel, M.L., Busslinger, M. and K. Strub (1985). Transcription termination and 3' processing. The end is in site! Cell 41,349-359. Blochlinger, K., Bodmer, R, Jack, J., Jan, J.Y. and Y.N. Jan (1988). Primary structure and expression of a product from cut, a locus involved in specifying sensory organ identity in Drosophila. Nature 333,629-635. Breathnach, R. and P. Chambon (1981). Organization and expression of eukaryotic split genes coding for proteins. Ann. Rev. Biochem. 50,349-383. 93 Breen, T.R. and I.M. Duncan (1986). Maternal expression of genes that regulate the bithorax complex of Drosophila melanogaster. Dev. Biol. 118,442-456. Carroll, S.B., Laymon, RA, McCutcheon, MA, Riley, P.D. and M.P. Scott (1986). The localization and regulation of Antennapedia protein expression in Drosophila embryos. Cell 47,113-122. Courey, AJ. and R. Tjian (1988). Analysis of Spl in vivo reveals multiple transcriptional domains, including a novel glutamine-rich activation motif. Cell 55, 887-898. Dingwall, C. and RA Laskey (1986). Protein import into the cell nucleus. Ann. Rev. Cell Biol. 2, 367-390. Dufour, S., Duband, J.L., Kornblihtt, AR. and J.P. Thiery (1988). The role of fibronectins in embryonic cell migrations. Trends in Genet. 4,198-203. Duncan, I. (1982). Polycomblike: a gene that appears to be required for the normal expression of the bithorax and Antennapedia gene complexes of Drosophila melanogaster. Genetics 102,49-70. Duncan, I. and E.B. Lewis (1982). Genetic control of body segment differentiation in Drosophila. In Developmental Order: Its Origin and Regulation (ed. S. Subtelny). New York: Liss. Symp. Soc. Devi. 40, 533-554. Dura, J.-M., Brock, H.W. and P. Santamaria (1985). polyhomeotic a gene of Drosophila melanogaster required for correct expression of segmental identity. Mol. Gen. Genet. 198, 213-220. 94 Dura, J.-M., Deatrick, J., Randsholt, N.B., Brock, H.W. and P. Santamaria (1988). Maternal and zygotic requirement for the polyhomeotic complex genetic locus in Drosophila. Roux's Arch. Dev. Biol. 197,239-246. Epstein, H.F., Ortiz, I. and LAT. MacKinnon (1986). The alteration of myosin isoform compartmentation in specific mutations of Caeborhabditis elegans. J. Cell. Biol. 103, 985-993. Evans, R.M. (1988). The steroid and thyroid hormone receptor superfamily. Science 240, 889-895. Freeman, S. (1988). M.Sc. Thesis: Molecular analysis of the Drosophila gene, polyhomeotic. Frey, M., Sieker, L., Payan, F., Haser, R, Bruschi, M., Pepe, G. and J. LeGall (1987). Rubredoxin from Desulfovibrio gigas: A molecular model of the oxidized form at 1.4 A resolution. J. Mol. Biol. 197,525-541. Frigerio, G., Burri, M., Bopp, D., Baumgartner, S. and M. Noll (1986). Structure of the segmentation gene paired and the Drosophila PRD gene set as part of a gene network. Cell 4 7, 735-746. Fuller, M.T. (1986). Genetic analysis of spermatogenesis in Drosophila: the role of testes specific beta-tubulin and interacting genes in cellular morphogenesis. In "Gametogenesis and the early embryo", Gall, J.G. (ed.). pp. 19-41, Alan R Liss Inc., New York. 95 Gehring, W. (1970). A recessive lethal with a homeotic effect in D. melanogaster. Dros. Inform. Serv. 45,103. Green, D.E. and D.H. MacLennan (1967). The mitochondrial system of enzymes in D.M. Greenberg (ed.), Metabolic Pathways, 3rd. ed., vol. 1, pp.47-111, Academic Press Inc., New York. Gribskov, M., Devereux, J. and R.R. Burgess (1984). The codon preference plot: graphic analysis of protein coding sequences and prediction of gene expression. Nuc. Acids Res. 12, 539-549. Hafen, E., Levine, M. and W. Gehring (1984). Regulation of Antennapedia transcript distribution by the bithorax complex in Drosophila. Nature 307,287-289. Harding, K, Wedeen, C, McGinnis, W. and M. Levine (1985). Spatially regulated expression of homeotic genes in Drosophila. Science 229,1236-1242. Harding, K and M. Levine (1988). Gap genes define the limits of Antennapedia and Bithorax gene expression during early development in Drosophila . EMBO J. 7,205-214. Henikoff, S. (1984). Unidirectional digestion with exonuclease III creates targeted breakpoints for DNA sequencing. Gene 28,351-359. Hochman, B., Gloor, H. and M.M. Green (1964). Analysis of chromosome 4 in Drosophila melanogaster. I. spontaneous and X-ray induced lethals. Genetics 35, 109-126. 96 Homyk, T. and CP. Emerson (1988). Functional interactions between unlinked muscle genes within haplo insufficient regions of the Drosophila genome. Genetics 119, 105-121. Hynes, R. (1987). Integrins: A family of cell surface receptors. Cell 48, 549-554. Ingham, P. (1984). A gene that regulates the bithorax complex differentially in larval and adult cells of Drosophila. Cell 37,815-823. Ingham, P.W. and A Martinez-Arias (1986). The correct activation of Antennapedia and bithorax complex genes requires the fushi-tarazu gene. Nature 324, 592-597. Irish, V.I., Martinez-Arias, A and M. Akam (1989). Spatial regulation of the Antennapedia and Ultrabithorax homeotic genes during Drosophila early development. EMBO J. 8,1527-1537. Joyner, AL., Kornberg, T., Coleman, KG., Cox, D.R. and G.R. Martin (1985). Expression during embryogenesis of a mouse gene with sequence homology to the Drosophila engrailed gene. Cell 43,29-37. Jurgens, G. (1985). A group of genes controlling the expression of the bithorax complex in Drosophila. Nature 31 6,153-155. Karch, F., Weiffenbach, B., Peifer, M., Bender, W., Duncan, I., Celniker, S., Crosby, M. and E.B. Lewis (1985). The abdominal region of the bithorax complex. Cell 43,81-96. 97 Kaufman, T.C., Lewis, R. and B. Wakimoto (1980). Cytogenetic analysis of chromosome 3 in Drosophila melanogaster: The homeotic gene complex in polytene interval 84A-B. Genetics 94,115-133. Keegan, L., Gill, G. and Ptashne, M. (1986). Separation of DNA binding from the transcription-activating function of a eukaryotic transcriptional activator protein. Science 231,699-704. Kouzarides, T. and E. Ziff (1988). The role of the leucine zipper in the fos-jun interaction. Nature 336, 646-651. Kozak, M. (1984). Compilation and analysis of sequences upstream from the translational start site in eukaryotic mRNAs. Nuc. Acids Res. 12, 857-872. Krust, A., Green, S., Argos, P., Kumar, V., Walter, P., Bornert, J.M. and P. Chambon (1986). The chicken oestrogen receptor sequence: homology with v-erbA and the human oestrogen and glucocorticoid receptors. EMBO J. 5, 891-897. Laughon, A. and M.P. Scott (1984). Sequence of a Drosophila segmentation gene: protein structure homology with DNA-binding proteins. Nature 310,25-31. Lewis, E.B. (1978). A gene complex controlling segmentation in Drosophila. Nature 276, 565-570. Lipman, D.J. and W.R Pearson (1985). Rapid and sensitive protein similarity searches. Science 22 7,1435-1441. 98 Locke, J.M., Kotarksi, M.A. and KD. Tartof (1988). Dosage dependent modifiers of position-effect variegation in Drosophila and a mass action model that explains their effect. Genetics 120, 181-198. Ma, J. and M. Ptashne (1987). Deletion analysis of GAL4 defines two transcriptional activating segments. Cell 4 8,847-853. Maniatis, T., Fritsch, E.F. and J. Sambrook (1982). Molecular cloning: A laboratory manual. Cold Spring Harbour Laboratory. McKnight, S.L. and R. Kingsbury (1982). Transcriptional control signals of a eukaryotic protein-coding gene. Science 21 7,316-324. Miller, J., McLachlan, A.D. and A Klug (1985). Repetetive zinc-binding domains in the protein transcription factor IIIAfrom Xenopus oocytes. EMBO J. 4,1609-1614. Mitchell, P.J. and R Tjian (1989). Transcriptional regulation in mammalian cells by sequence-specific DNA binding proteins. Science 245, 371-378. Mogami, K and Y. Hotta (1981). Isolation of Drosophila flightless mutations which affect myofibrillar proteins of indirect flight muscle. Mol. Gen. Genet. 183,409-417. Mount, S.M. (1982). A catalogue of splice junction sequences. Nuc. Acids Res. 10, 459-472. Murphy, L.C., Tsuyuki, D., Myal, Y. and R.P.C. Shiu (1987). Isolation and sequencing of a cDNA clone for a prolactin-inducible protein (PIP). J. Biol. Chem. 262,15236-15241. 99 Palumbo, G. (1987). Thyroid hormonogenesis. J. Biol. Chem. 262,17182-17188. Park, E.C. and H.R. Horvitz (1986). C. elegans unc -105 mutations affect muscle and are suppressed by other mutations that affect muscle. Genetics 113,853-867. Peifer, M., Karch, F. and W. Bender (1987). The bithorax complex: control of segmental identity. Genes and Dev. 1,891-898. Perlman, D. and H.O. Halvorson (1983). A putative signal peptidase recognition site and sequence in eukaryotic and prokaryotic signal peptides. J. Mol. Biol. 167, 391-409. Pirrotta, V., Manet, E., Hardon, E., Bickel, S.E. and M. Benson (1987). Structure and sequence of the Drosophila zeste gene. EMBO J. 6, 791-799. Proudfoot, N.J. and G.G. Brownlee. (1976). 3' non-coding region sequences in eukaryotic messenger RNA Nature 263,211-214. Raju, K and RA Anwar (1987). A comparative analysis of the amino acid and cDNA sequences of bovine elastin a and chick elastin. Biochem. Cell Biol. 65, 842-845. Riley, P.D., Carroll, S.B. and M.P. Scott (1987). The expression and regulation of Sex combs reduced protein in Drosophila embryos. Genes Dev. 1,716-730. Rosenberg, V.B., Schroder, C, Preiss, A, Kienlin, A., Cote, S., Riede, I. and H. Jackie (1986). Structural homology of the product of the Drosophila Kruppel gene with Xenopus transcription factor IIIA. Nature 319, 336-339. 100 Schneuwly, S., Kuroiwa, A., Baumgartner, P. and W.J. Gehring (1986). Structural organization and sequence of the homeotic gene Antennapedia of Drosophila melanogaster. EMBO J. 5, 733-739. Scott, G.K. Hayes, P.H., Fletcher, G.L and P.L. Davies (1988). Wolffish antifreeze protein genes are primarily organized as tandem repeats that each contain two genes in inverted orientation. Mol. and Cell Biol. 8, 3670-3675. Shearn, A, Hersperger, G. and E. Hersperger (1978). Genetic analysis of two allelic temperature sensitive mutants of Drosophila melanogaster both of which are zygotic and maternal-effect lethals. Genetics 89, 341-353. Smouse, D., Goodman, C, Mahowald, A and N. Perrimon (1988). polyhomeotic: A gene required for the embryonic development of axon pathways in the central nervous system of Drosophila. Genes Dev. 2, 830-842. Steiner, H., Hultmark, D., Engstrom, A, Bennich, H. and H.G. Boman (1981). Sequence and specificity of two antibacterial proteins involved in insect immunity. Nature 292, 246-248. Struhl, K. (1987). Promoters, activator proteins, and the mechanism of transcriptional initiation in yeast Cell 49,295-297. Struhl, G. and M. Akam (1985). Altered distributions of Ultrabithorax transcripts in extra sex combs mutant embryos of Drosophila. EMBO J. 4,3259-3264. 101 Struhl, G. and RAH. White (1985). Regulation of the Ultrabithorax gene of Drosophila by other bithorax complex genes. Cell 43, 507-519. Struhl, G. (1982). Genes controlling segmental specification in the Drosophila thorax. Proc. Natl. Acad. Sci. USA 79, 7380-7384. Struhl, G. (1981). A gene product required for the correct initiation of segment determination in Drosophila. Nature 293,36-41. Tabor, S. and CC. Richardson (1987). DNA sequence analysis with a modified bacteriophage T7 DNA polymerase. Proc. Natl. Acad. Sci. USA. 84,4767-4771. Tautz, D., Lehmann, R., Schnurch, H., Schuh, R., Seifert, E., Kienlin, A., Jones, K. and H. Jackie (1987). Finger protein of novel structure encoded by hunchback, a second member of the gap class of Drosophila segmentation genes. Nature 32 7, 383-389. Tautz, D., Trick, M. and G.A Dover (1986). Cryptic simplicity in DNA is a major source of genetic variation. Nature 322, 652-656. Villares, R and CV. Cabrera (1987). The achaete-scute gene complex of D. melanogaster: Conserved domains in a subset of genes required for neurogenesis and their homology to myc. Cell 50,415-424. Vincent, A, Colot, H.V. and M. Rosbash (1985). Sequence and structure of the Serendipity locus of Drosophila melanogaster. J. Mol. Biol. 186,149-166. 102 Walker, J.E., Saraste, M., Runswick, M.J. and N.J. Gay (1982). Distantly related sequences in the a- and B-subunits of ATP-synthase, myosin, kinases and other ATP-requiring enzymes and a common nucleotide binding fold. EMBO J. 1, 945-951. Wedeen, C, Harding, K and M. Levine (1986). Spatial regulation of Antennapedia and bithorax gene expression by the Polycomb locus. Cell 44, 739-748. Wilbur, W.J. and D.J. Lipman (1983). Rapid similarity searches of nucleic acid and protein data banks. Proc. Natl. Acad. Sci. 80, 726-730. Wharton, K A , Yedvobnick, B., Finnerty, V.G. and S. Artavanis-Tsakonas (1985). opa: A novel family of transcribed repeats shared by the Notch locus and other developmentally regulated loci in D. melanogaster. Cell 4 0, 55-62. Zink, B. and R. Paro (1989). In vivo binding pattern of a trans-regulator of homeotic genes in Drosophila melanogaster. Nature 33 7,468-471. 103 


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items