UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Lateral gene transfer from bacteria to protists de Koning, Audrey Patricia 2002

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-ubc_2002-0378.pdf [ 10.6MB ]
JSON: 831-1.0090395.json
JSON-LD: 831-1.0090395-ld.json
RDF/XML (Pretty): 831-1.0090395-rdf.xml
RDF/JSON: 831-1.0090395-rdf.json
Turtle: 831-1.0090395-turtle.txt
N-Triples: 831-1.0090395-rdf-ntriples.txt
Original Record: 831-1.0090395-source.json
Full Text

Full Text

LATERAL GENE TRANSFER FROM BACTERIA TO PROTISTS by AUDREY PATRICIA DE KONING B.Sc, The University of Northern British Columbia, 1999 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in THE FACULTY OF GRADUATE STUDIES Genetics Graduate Programme We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA September 2002 © Audrey Patricia de Koning, 2002 In presenting this thesis in partial fulf i lment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make i t freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my writ ten permission. ABSTRACT The advent of genome-level sequencing has brought a wealth of genetic information that can be used to trace the evolution of organisms. Phylogenies based on homologous proteins often do not agree on implied species relationships, indicating that the evolutionary path taken by some of these genes differs from the evolution of the organisms. One explanation for this observed pattern is lateral gene transfer, the movement of genes across species boundaries. There are mechanisms to faci l i tate such movement in prokaryotic organisms. In eukaryotes, i t is less clear what mechanisms might facil i tate incorporation of foreign genes into a species' genome. Multicellular eukaryotes are presumably more resistant to lateral gene transfer, because the majority of their cells do not have access to genes from other organisms. Moreover, germ and somatic cell lineages are often separate, which greatly reduces the probability that any gene that gains access to a foreign cell wi l l be heritable and subsequently become part of its host species' genome. Phagotrophic protists, however, represent one type of eukaryote where these restrictions do not exist. Such organisms commonly ingest bacteria as food, and chance events may lead to the incorporation and use of bacterial genes. Lateral gene transfer from organelles of symbiotic origin (the chloroplast and mitochondrion) to the nucleus of the host organism represents a specialized case of lateral gene transfer resulting from phagotrophy. Lateral gene transfer in eukaryotes, therefore, is expected to involve bacterial genes incorporated into the genomes of phagocytic species (or those that have evolved from them). Accordingly, genes of bacterial origin were sought among the genetic information available from protists. Over 2000 predicted genes from the genome-sequencing project for the malarial parasite, Plasmodium falciparum, were ii screened to find those that are highly similar to genes from bacterial species. Phylogenetic reconstruction methods were used to elucidate the evolution of 'bacteria-like' genes. Thirteen laterally transferred genes were identified on the basis of protein phylogeny, of which nine are probably transferred from symbiotic organelles and three appear to involve transfers with bacteria. The conclusion is that genes introduced to the genome through lateral gene transfer are not nearly as prevalent in this eukaryote as they are in bacteria. iii TABLE OF CONTENTS ABSTRACT ii TABLE OF CONTENTS .' iv LIST OF TABLES vi LIST OF FIGURES vii PREFACE viii CHAPTER 1 INTRODUCTION 1 1.1 Defining Lateral Gene Transfer 1 1.2 Mechanisms that cause lateral gene transfer in eukaryotes 5 1.2.1 Acquisition of foreign genetic material through endocytosis 5 1.2.2 Introduction of foreign genes through viral infection 8 CHAPTER 2 LATERAL GENE TRANSFER IN A PROTIST GENOME 12 2.1 Introduction 12 2.1.1 Genomic evidence of lateral gene transfer 12 2.1.2 Plasmodium falciparum 14 2.2 Methods 19 2.2.1 P. falciparum sequence data source 19 2.2.2 Identification of 'bacteria-like' genes using BAEwatch 20 2.2.3 Identification of features of P. falciparum genes 22 2.2.4 Retrieval and identification of putative homologues 23 2.2.5 Sequence alignments ..; 25 2.2.6 Phylogenetic reconstruction 28 2.3 Results and Discussion 29 2.3.1 BAEwatch 29 iv 2.3.2 Phlyogenetic analysis of individual 'bacteria-like' genes 31 Type II topoisomerase 33 Adenylosuccinate lyase 40 Elongation factor G 46 Ribosome release factor 52 Proteasome subunit HslV 54 Pseudouridine synthase 58 Ribosomal RNA methyltransferase 61 RNA 3' terminal phosphate cyclase 65 Ribosomal RNA adenine dimethylase 68 S-adenosyl methionine-dependant methyltransferase....71 GTP-binding protein 74 2.3.3 Conclusion: Lateral gene transfer in P. falciparum 78 BIBLIOGRAPHY 80 v LIST OF TABLES Table 2.1 Protein sets used for BAEwatch comparison 21 Table 2.2 Complete genomes queried for homologues of P. falciparum genes . . . . 24 Table 2.3 Summary of BLAST searches with P. falciparum sequence sets 30 Table 2.4 List of P. falciparum genes evaluated by phylogenetic analyses 32 vi LIST OF FIGURES Figure 1.1 Directional transmission of genetic information 3 Figure 2.1 Example of a divergent region in an alignment 27 Figure 2.2 Architecture of type II topoisomerases 34 Figure 2.3 Phylogeny of the B subunit of type II topoisomerase 36 Figure 2.4 Phylogeny of the A subunit of type II topoisomerase 37 Figure 2.5 Phylogeny of type II topoisomerase 39 Figure 2.6 Phylogeny of adenylosuccinate lyase 42 Figure 2.7 Multiple sequence alignment of adenylosuccinate lyase 43 Figure 2.8 Phylogeny of adenylosuccinate lyase 'mixed' clade 45 Figure 2.9 Phylogeny of elongation factor G 48 Figure 2.10 Multiple sequence alignment of elongation factor G 50 Figure 2.11 Phylogeny of ribosome release factor 53 Figure 2.12 Multiple sequence alignment of protease subunit HslV 56 Figure 2.13 Phylogeny of protease subunit HslV 57 Figure 2.14 Phylogeny of RsuA pseudouridine synthase 59 Figure 2.15 Phylogeny of RNA m5C methyltransferase 63 Figure 2.16 Phylogeny of RNA 3' terminal phosphate cyclase 66 Figure 2.17 Schematic alignment of rRNA adenine dimethylase sequences 69 Figure 2.18 Phylogeny of ribosomal RNA adenine dimethylase 70 Figure 2.19 Schematic alignment of SAM-dependant methyltransferase 72 Figure 2.20 Phylogeny of SAM-dependant methyltransferase sequences 73 Figure 2.21 Schematic alignment of GTP-binding protein sequences 76 Figure 2.22 Phylogeny of GTP-binding proteins 77 vii PREFACE Lateral gene transfer has of late, become a much-discussed topic in evolutionary genetics. The nature of these dialogues is, more so than most other biological debates, often hotly adversarial. I think this is because evidence of lateral gene transfer is almost always equivocal, and arguments supporting or refuting cases of lateral transfer inevitably invoke parsimony and rely on an opinion of how ' l ikely' a proposed series of evolutionary events is. For example, in response to the claim of massive lateral transfer from Archaea to the eubacterium Aquifex, Kyrpides and Olsen [1] state "... vertical inheritance via a thermophilic lineage from the archaeal-bacterial common ancestor wi l l be a more parsimonious explanation than independent lateral transfers as suggested by Aravind et al.." Aravind et al. [2] then reply, "...one is forced to postulate that repeated loss of archaeal genes occurs during the evolution of Bacteria. This is not impossible but does make the hypothesis of Krypides and Olsen less parsimonious." Unfortunately for those of us interested in lateral transfer, differences in a 'gut-feeling' opinion are hard to reconcile. Over the last 10 years, examples have accumulated of genes (and even larger genetic units) whose evolution may be explained by lateral transfer. Many researchers have celebrated this somewhat surprising revelation, and are making some rather broad claims of the importance of lateral gene transfer. According to de la Cruz and Davies [2], "genes have flowed through the biosphere, as in a global organism", which leads to the claim by some that phylogenetic classification may not be a useful concept because of rampant lateral gene transfer [3-5]. Gene transfer has been hypothesized to drive the formation of bacterial operons [6] and to maintain the universality of the genetic code [7], Several researchers have put forward the idea viii that lateral gene transfer has played an integral role in bacterial speciation [8-10]. It has even been suggested that lateral transfer is more important than mutation in the driving the evolution of novel functions [11]. In response to this seemingly blanket acceptance of the importance and contribution of lateral gene transfer to evolution, researchers who were not quite so convinced of its ubiquity began to present dissenting views and advocate that caution be used both in inferring cases of lateral transfer and in interpreting the evolutionary implications of lateral transfer [12-14]. Finally, in the same way that an interesting new song on the radio becomes irr i tating when overplayed, the tide of opinion is turning against gene transfer. Advocates of lateral gene transfer are now often greeted with cynicism, as illustrated in a recent quip from C.G. Kurland [15]; "Nothing in science is more self-aggrandizing than the claim that 'al l that went before me is wrong' except perhaps the clam that 'I can explain everything'. Proponents of global HGT (horizontal gene transfer) have i t both ways." This thesis attempts to provide a balance view of lateral gene transfer by keeping an open mind to all evidence, but stil l challenging all conclusions with scepticism. -AD ix CHAPTER 1 INTRODUCTION 1.1 Defining Lateral Gene Transfer The fundamental meaning of lateral gene transfer is the movement of genes between genomes in a direction other than vertical - i.e. genes that are not acquired through inheritance from an immediate ancestor. Such lateral gene movement uncouples the acquisition of genetic information from reproduction. This has several consequences. First, genetic information may move between genomes with widely disparate gene content. Second, while reproduction involves the transfer of an entire genome from parent to offspring, lateral transfer can move much smaller units of information between organisms. Thus an organism that has incorporated laterally transferred information into its genome is largely unchanged, because that organism has a copy of all or nearly all of its parents genes, and is much less similar to the donor of the laterally transferred information. Finally, vertical movement of genetic information must occur with the creation of every new organism, and so most members of a species participate every generation. Lateral movement does not involve 'everybody, all the t ime, ' and may occur sporadically in only a small subset of organisms. Genetic information moves purely vertically in clonal reproduction (figure 1A). Prokaryotes, asexual single-celled eukaryotes, and the somatic cells of multicellular eukaryotes reproduce clonally through binary fission. Each newly created cell is genetically identical to its progenitor, barring any variation caused by mutation. In the prokaryotic world, parasexual processes such as conjugation, transduction, and transformation, also generate genetic variation (figure 1B). Unlike clonal 1 reproduction, parasexual processes cause lateral gene movement, which occurs with only some genes, between only some individuals, some of the t ime. In eukaryotes, variation can be generated by recombination facil i tated through sexual reproduction, which involves the vertical transmission of genetic information from two parents (figure 1C). In contrast with clonal reproduction, which creates assemblages of distinct genetic lineages, sexual reproduction mixes parental genetic information in the offspring and creates a gene pool. There are boundaries to the gene pool however, and sexual reproduction generally involves mechanisms that prevent more distantly related individuals from contributing. These gene pool boundaries define what we typically call a species. This 'sex-centric' concept of species is characterized by wholesale chromosomal recombination and is more diff icult to apply to the localized recombination caused by parasexual processes in the prokaryotic world. There are some parallels, however, as recombinatorial exchange in prokaryotes also has some limits based on degree of relatedness. Recombination is mediated by the RecA enzyme whose efficiency drops sharply as sequence identity decreases. Thus, recombination is more likely and more frequent among more closely related individuals [16]. As in sex-generated gene pools, i t is probably these limits to recombination between distantly related prokaryotes that preserves the associations of certain microbiological characters (e.g. appearance, presence/absence of metabolic pathways, etc.) that traditionally define prokaryotic species. Occasionally genetic information crosses species boundaries. If i t moves vertically by sexual reproduction, we call the process "hybridization" (figure 1D). Typically only closely related (recently diverged) species wi l l hybridize, and so this 2 ro •g > c Prokaryotes B o /\ /\ © © © © /\ /\ /H\ Parasexua l p rocesses o o o o o o o o © © © ® \ / o Sexual reproduct ion Eukaryotes 0) Q) O Q. CO / \ / \ © © ©,. _ • /\ /\ /H\ © © 0 (D © Lateral gene transfer D / \ / \ © © © © \ / \ /\ /\ O o o o o o Hybr id izat ion Figure 1.1 Directional transmission of genetic information. In the top panel, dots depict individuals, while in the bottom panel dots represent species. (A) Genomic information moves vertically through clonal reproduction. (B) Parasexual processes move genetic elements laterally. (C) In sexual reproduction, each parent vertically transmits a copy of its genome. (D) Hybridization moves information vertically, outside of species boundaries. (E) Lateral gene transfer moves information between organisms, independent of inherited information. 3 sort of genetic transmission does not strictly cross species boundaries but rather blurs them. In both prokaryotes and eukaryotes, genes may move laterally between species (figure 1E), either introducing novel genetic information to a species (akin to mutation in an individual) or replacing existing information (similar to recombination). In eukaryotic organisms, this sort of lateral transmission does not blur the species boundary, because vertical transmission of information through clonal and/or sexual reproduction happens with more information (the whole genome), more often (every generation), and with less divergent information (from within the species). In prokaryotes, lateral gene transfer between species occurs right alongside the lateral transmission that contributes to within-species recombination. As such, lateral transfer in prokaryotes probably exists along a continuum: frequent lateral transfer occurs between closely related individuals and causes variation through recombination within the species, less frequent transfer between members of closely related strains and (by analogy to hybridization) blurs species boundaries, and occasional lateral gene transfer occurs between more distantly related species. Although in prokaryotic systems parasexual processes move genes laterally within a species, the term Lateral Gene Transfer (LGT) will be used specifically to describe the transfer of genetic information outside of normal species boundaries. Generally, LGT is used to describe lateral movement of any unit of genetic material including small segments [17], genes [18], operons [19], and even whole genomes [20]. 4 1.2 Mechanisms that cause lateral gene transfer in eukaryotes The following sections detail what we know about possible mechanisms for lateral gene transfer in extant eukaryotic organisms. It must be remembered that conditions may have been very different in the past. For example, while viruses that can infect organisms from different kingdoms are unknown today, they may have existed in the past. Going further back in time, Woese and co-workers have put forth the idea that some of the evidence of lateral gene transfer that we see today results from gene movement at the time of the last universal common ancestor. They note that simple organisms would have had very few barriers to gene exchange, and would represent a 'lateral exchange gene pool' [21, 22]. 1.2.1 Acquisition of foreign genetic material through endocytosis DNA cannot simply diffuse across cell walls and membranes, and so genetic material must have some route of access into a new host. In eukaryotes, genetic material can be imported into cells by endocytosis, in which invaginations of the cell membrane pinch off to form internal vacuoles [23]. Once the phagocytic vacuole is internalized, it generally fuses with a lysosome and the incoming microorganism is digested [24]. Heterotrophic unicellular eukaryotes do this for a source of nutrients, and bacteria-eaters are common in the protist world. Multicellular organisms have a variety of specialized cells (e.g. monocytes, macrophages and neutrophils) that phagocytose microorganisms, but these are generally components of an immune response that targets foreign cells. Several microorganisms have ways to escape degradation once inside the cell, and this can lead to longer associations with eukaryotic cells. For instance, some 5 mycobacteria, Legionella pneumophila, and the protist Toxoplasma gondii can inhibit fusion of the phagosome with the lysosome. Listeria monocytogenes, Shigella flexneri (bacteria) and Trypanosoma cruzi (eukaryote) can lyse the phagosome and escape into the cytosol. Leishmania spp. (a protist) and Salmonella typhimurium (bacterium) can simply inhabit the phago-lysosome and seem to resist digestion [25-27]. There are numerous instances of intracellular microorganisms that can replicate in the vacuole or cytosol and inhabit eukaryotic cells; such associations may be either transient or stable [28]. Both parasitic and commensal relationships exist; Chlamydia, Rickettsia, and Legionella are intracellular bacteria that cause disease in humans, but live mutualistically with Amoebae [29]. The most striking example of a long-term association is that of the eukaryotic organelles of symbiotic origin. These include plastids evolved from a cyanobacteria-like ancestor and mitochondria derived from an alpha-proteobacterial progenitor [30]. It has been estimated that these obligate associations began over 1.5 billion years ago for the mitochondria [31] and only slightly later for plastids. Some protist groups contain secondary plastids. These organelles are derived from endosymbiosis of plastid-containing eukaryotes (red or green algae) [30]. The existence of these secondary symbionts shows that uptake of organisms in eukaryotes is not limited to bacteria, and that internalization through endocytosis can import a source of eukaryotic genes into eukaryotic cells. Microorganism-containing vacuoles are eventually digested in the host eukaryote, and it is at this point that DNA from the lysed microorganism may escape and gain access to the host nucleus. Organelles, too, are digested by lysosomes (in this case by specialized 'self-targeting lysosomes), and some studies have revealed that liberated organellar DNA can end up in the nucleus of the host [32]. Escape of 6 intact DNA from lytic vacuoles probably occurs accidentally and so may not be a frequent source of foreign DNA for lateral gene transfer. However, over evolutionary time, it may have occurred repeatedly. In the case of bacteria ingested for food, repeated uptake and digestion provide a constant source of DNA-containing vacuoles. For organelles, mutualists, and parasites, continued presence in the eukaryotic host (with accompanying deaths and reproduction) ensures that opportunity after opportunity occurs for exposure to foreign DNA[33]. There are wide ranges of both prokaryotic and eukaryotic organisms that can be the donors of DNA acquired by phagocytosis, and the only real limitation on who can be internalized is size - the host cell is generally larger than its prey. As well, since DNA acquisition depends on contact between two cells, we would expect that lateral gene transfer events involving this mechanism would only occur between organisms that occupy a similar milieu. The eukaryotes that can acquire genes in this manner are largely restricted to single-celled organisms. Among the multicellular eukaryotes, cell specialization leaves only restricted set of cells with the ability to phagocytose. In animals, these are the assorted macrophages that function in the immune responses [34]. In plants, only specialized nodule cells in the root hairs of legumes are known to bring in organisms through phagocytosis, and they only internalize their own species-specific rhizobial symbiont to facilitate nitrogen fixation [35]. Neither of these are relevant to lateral gene transfer, as these cells are somatic, and only transfers to germ-line cells will be inherited. 7 1.2.2 Introduction of foreign genes through viral infection A virus is an infectious element that, when independent of a host cell, exists as a nucleic acid genome enveloped in a capsid coat. They represent a potential conduit for lateral gene transfer because they are specialists at slipping in and out of cells, and can move between organisms. Moreover, viruses replicate intracellular^, often in intimate contact with their host genome. If genetic material from the host is incorporated into the viral chromosome, the virus will transfer genes from its last host into its next host. Unlike endocytotic mechanisms for lateral gene transfer, viral transduction obviates the need for donor and recipient cells to be in close proximity. The capsid protects the viral genome from degradation and allows the virion to be widely dispersed [36]. Consequently, viruses can facilitate the dissemination of genes between organisms separated in space or time. There is some evidence that eukaryotic viruses can incorporate host genes into their own genomes. The sequencing of several viral genomes has identified ORFs that have high similarity to genes in these viruses' hosts, and this is proposed to be evidence of common origin [37], Tidona and Darai [38], noting that the viral copies of eukaryote genes do not contain introns, suggested that DNA viruses might capture host genes through either reverse transcription of host mRNAs followed by recombination or by recombination of viral and host DNA and subsequent deletion of introns. Becker [37] proposed that single-stranded RNA viruses, which replicate in the host cytoplasm, could pick up host genes by recombination with host RNA. Unfortunately, the means by which host genes might end up in viral genomes have not been investigated for most eukaryotic viruses. 8 Mechanisms of host gene capture have been studied the most in the retroviruses. Free retrovirions contain diploid RNA genomes that are reverse-transcribed into haploid DNA upon infection. The viral DNA inserts itself into the host genome, with the help of a virus-encoded enzyme, integrase. The virus then uses the host cell machinery to transcribe retroviral mRNAs and to process, splice and translate the transcripts. Two ful l mRNA transcripts (and some viral proteins) are packaged into viral particles that bud from the eukaryotic host to regenerate infective free virions [39]. Errors in viral transcription, reverse-transcription, and/or packaging may lead to production of viruses with host genes. A mutant retrovirus whose genome lacks a packaging recognition site was shown (in a xenologous cell line) to pack host RNAs instead, leading to the production of pseudogenes in the genome of subsequently infected cell host [40-42]. There is also some evidence that random host mRNAs are occasionally packaged along with viral RNA to create the diploid RNA of the free virion. Upon reinfection, reverse-transcriptase (RT) can jump between the templates at short stretches of homology in otherwise unrelated RNAs, causing recombination with the host gene between the regions important for viral infectivity. This forms proviral DNA that has integrated a cellular gene into its genome [43, 44]. Viral/host fusion mRNAs are created by accidental read-through transcription [45]. The viral/host fusion mRNAs are packaged into virions along with a normal viral genomic RNA transcript and as described above, RT mediated recombination regenerates the infective virus [46]. While retroviruses-mediated gene transfer has been observed (e.g. [47-49]), these transfers are between cells of the same species. There are not yet any known examples of cross-species lateral transfer by retroviruses [50]. 9 To detect virus-mediated lateral gene transfer in nature, the new sequence must cause some observable change in the recipient. The best evidence for retroviral gene transfer is the existence of viruses that can induce tumours upon infection because they carry oncogenes derived from cellular host genes [47-49]. Most oncogenic retroviruses are replication defective because they lack essential portions of the viral genome [51, 52], and thus only replicate if wild-type viruses are present in the same cell to provide viral products in trans. Oncogenic retroviruses do not induce significant numbers of naturally occurring tumours; this is presumably because oncogene capture is a rare event that produces viruses with a defective phenotype [53]. Among higher eukaryotes, viral infection can only lead to lateral gene transfer i f genes enter germ cells. There is some evidence that this can happen. Viral infections of the reproductive tract are observed in mammals [54]. Moreover, endogenous retroviruses (ERVs) are found in the genomes of all higher eukaryotes in which they have been sought [55]. ERVs are integrated retroviruses that are vertically inherited by hosts in a Mendelian fashion [50]. Exogenous retroviruses and ERVs are very similar in sequence and genome organization and thus are thought to have originated from rare infections of germ cells [50, 55, 56]. For viruses to cause lateral gene transfer through movement of genes across species boundaries, they must be able to infect more than one species. Infection begins with an interaction of the free virion with a host cell surface receptor. A great many viruses are highly specific for their receptors and infect only a narrow range of hosts [57]. In general, viruses do not cross generic boundaries [36], and host gene movement primarily leads to horizontal transfer between close relatives. However, 10 some broad-host-range viruses have been found. The bacteriophage PRD1 can infect and unusually broad range of hosts, including multiple genera of both gram-positive and proteobacterial groups [58]. There are also some broad-host-range eukaryotic viruses; for example the Borna Disease virus occurs in at least all warm-blooded animals [59]. Furthermore, a study by Jensen et al. [57] suggests that the perception that phages are extremely host-specific may be an experimental artefact. The standard method of virus isolation involves serial rounds of viral growth on one host, which may select for those viruses with the highest host specificity. When Jensen et al. [57] incubated environmental samples on a mixture of two potential hosts, they isolated several phages (bacterial viruses) capable of infecting multiple host species. However, no viruses are known that can infect both bacteria and eukaryotes. 11 CHAPTER 2 LATERAL GENE TRANSFER IN A PROTIST GENOME 2.1 Introduction 2.1.1 Genomic evidence of lateral gene transfer As detailed in section 1.2, studying the mechanisms of lateral gene transfer can give us some indication of the likelihood of transfer events. However, such events introduce an allele or gene that may only have a transient association with the recipient organism and some of its descendants. To elucidate the contribution of lateral gene transfer to the evolution of a species, i t is more relevant to focus on those genes that become a long-term component of the genetic repertoire of a species. Several recent studies have attempted to do this by measuring what proportion of a species' genome has arisen through the lateral transfer of genes, as opposed to de novo creation of new genes through mutational evolution (e.g. [60-63]. These studies have given widely varying estimates, which may in part reflect differences in methodology used for detection (see [64-66] for discussions of anomalies caused by method). Moreover, there is no reason to suppose that all organisms wi l l have been affected by lateral transfer to the same degree. Differences in the mechanisms that cause lateral transfer in different organisms wi l l certainly contribute to observed inequalities. As wel l , the ecology and lifestyle of a particular species wi l l impact the amount of foreign DNA to which an organism is exposed. For example, we would expect the genomes of organisms that live in interdependent microbial communities, such as those found in a rumen, to have a larger percentage of laterally transferred genes than pelagic marine microbes. Indeed, some trends are 12 emerging in genome-based studies of lateral transfer among eubacteria. While 10-15% of the genomes of some bacteria, such as E. coli and Synechocystis, are estimated to be laterally acquired, host-dependant endosymbionts and endoparasites, such as Buchnera [67] and Rickettsia [9]), show no evidence of transfer. This is presumably because the intracellular lifestyle of these organisms precludes contact with other bacteria [67]. It remains to be seen whether this particular trend continues to hold as more genomes are analyzed. Genomic-based studies of lateral gene transfer have focused mostly on transfers between prokaryotes, because the large majority of sequenced genomes are from prokaryotes. Much less is known about the contribution of lateral transfer to the evolution of eukaryotic species. Only seven eukaryotes have been fully sequenced to date: a vertebrate (Homo sapiens), three fungi (Saccharomyces cerevisiae, Schizosaccharomyces pombe, Encephalitozoon cuniculi), two non-vertebrate animals (Caenorhabditis elegans and Drosophila melanogaster) and a plant (Arabidopsis thaliana). The Arabidopsis Genome Initiative [20] carried out a search for all Arabidopsis nuclear genes with highest pairwise similarity (based on BLAST scores) to a bacterial gene. The vast majority of the genes identified in this way were similar to cyanobacteria and probably reflect transfer of genes from the chloroplast. The International Human Genome Sequencing Consortium [68] identified 113 human genes as being of probable bacterial origin. Subsequently, more rigorous analyses [69-71] showed that this was a rather large overestimate of the proportion of laterally transferred genes in the human genome. Even so, the original estimate represents only 0.4% of the human genome, which is much less than estimates in many prokaryotic genomes. The paucity of laterally transferred genes is hardly surprising. 13 As discussed in section 1.2, several biological features of higher eukaryotes provide extra hurdles to the establishment of a laterally transferred gene in a species' genome. Animals, plants, and fungi comprise only a small subset of the diversity of eukaryotes, and so the current sample of sequenced genomes is not representative enough to evaluate the contribution of lateral transfer to eukaryotic evolution. The objective of this study was to determine the extent of laterally transferred genes in a protistan eukaryote. I chose to analyze the available sequence data from a genome-sequencing project currently underway: that of the nuclear genome of an alveolate, Plasmodium falciparum. This genome is being sequenced by a random shotgun method, and so although it is not complete, the set of genes that are available are a more-or-less unbiased sample. I will begin by reviewing salient facts about P. falciparum and its genome. 2.1.2 Plasmodium falciparum Plasmodium falciparum is one of four Plasmodium species that cause human malaria [72] a debilitating and often deadly disease that is endemic in over 100 countries worldwide [73]. Of the four species, P. falciparum is the most virulent and is responsible for the almost all of the one million deaths caused by malaria annually [73]. The P. falciparum parasite lives intracellular^/ in human liver and red blood cells and within mosquito midgut lumen epithelial cells. Free parasites also move through the liver and blood of humans and through the gut, hemolymph and saliva of mosquitoes. P. falciparum is exposed to bacteria in at least some stages of its lifecycle. An examination of mosquito midgut contents has demonstrated that 14 malarial oocytes often coexist with a variety of bacteria [74, 75]. However, phagocytosis of bacterial cells has not been documented for this parasite, and it is doubtful that P. falciparum has any access to bacterial DNA through cell ingestion in its free-living stages. Once inside the human erythrocyte, endocytosis does occur and is a route for the parasite to bring in nutrients from both the host cell cytosol and from the parasitophorous vacuole in which it resides [76]. Nevertheless, these compartments are devoid of foreign DNA, and so any laterally transferred genes would have to come from free DNA in the extracellular environment. Studies investigating the antiplasmodial effect of antisense oligonucleotides have suggested that larger molecules can enter the parasite from the extracellular milieu [77]. More investigation is needed to determine how and which macromolecules may enter malarial parasites from the external surroundings [78] and whether a route exists whereby free foreign DNA can enter P. falciparum. The P. falciparum genome consists of 14 chromosomes with a total size of about 26 Mb [79]. Overall, the genome has an unusually high AT content of 82% [80], representing both very AT-rich intergenic sequences and codon usage that is AT-biased [81]. Fully annotated sequences of both chromosome two [82] and chromosome three [83] have been published. An examination of sequences from these two chromosomes illuminates some notable features of P. falciparum genes. Only 42% of genes on chromosome 2 [82] and 33% of genes on chromosome 3 [83] have identifiable homologues in other species. This is less than half that is found on sequenced chromosomes from other organisms [82]. The paucity of conserved proteins may be an indication that P. falciparum evolves at a faster rate than other 15 organisms, or it may indicate that the divergence time is longer between Plasmodium and the organisms prevalent in sequences databases. On chromosomes two and three, about 90% of the predicted proteins contain regions characterized by high compositional bias and low complexity [82, 83]. Twenty percent of the low complexity regions are comprised of tandem arrays of short peptide motifs while the other eighty percent represent non-repetitive regions containing runs of a single amino acid [84]. Such regions involve more than 60% of the total peptide in about half of all the proteins [80]. Although regions of low complexity have been found in other organisms, the prevalence and extent of such sequences is much higher in P. falciparum than in other eukaryotes, such as S. cerevisiae, C. elegans [82], or Dictyostelium discoideum [80]. The amino acid frequencies within low complexity regions correlate with A-richness in codons, and this bias results in an overall hydrophilic quality [80]. When low-complexity regions occur within globular proteins, it is assumed that these hydrophilic segments 'loop out' of the main core of the folded protein. These regions do not appear to be evolutionarily conserved, because they manifest as P. /a/dparum-specific insertions in alignments with homologues of other species [80, 81] and because their size is often polymorphic among P. falciparum strains [83]. About 45% of all genes on chromosomes two and three have introns [82, 83]. Although the majority of these genes have only one intron [83], some are quite complex and may contain up to 15 introns [83, 85]. Alternative splicing occurs in at least some P. falciparum genes. Leonard van Lin and his colleagues [86] examined a high gene density region of chromosome ten. Six genes were found in only 13.6 kb; three of these contained multiple introns. In two of these genes, sequencing of 16 edited transcripts showed that alternative splicing and multiple polyadenylation sites led to several gene products. The presence of multi-exon genes makes i t dif f icult to determine the correct set of open reading frames (ORFs) with automated gene finding programs. The researchers who sequenced chromosomes two and three used different computer algorithms and the resulting set of ORFs predicted by these programs do not agree [87]. Recently, R. Huestis and K. Fischer [88] performed a re-analysis of ORFs in the chromosome 2 sequence. They used P. /a/c/parum-specific sequence features to create a computer algorithm that identified likely introns and exons. All identified genes were then subject to an extensive manual analysis and when the identified gene structure varied from the original chromosome 2 annotation, cDNA sequences were used to confirm the position of introns and exons. This analysis added 135 introns to chromosome 2 and put introns into at least 40 heretofore intronless genes. The new exon structure of some genes resulted in significant structural changes to the predicted proteins. The work of Huestis and Fischer [88] underscores the need for experimental confirmation of the predicted protein sequences that are being produced in the P. falciparum genome-sequencing project. Two extrachromosomal genetic elements are also present in P. falciparum. A tandemly repeated 5.9 kb linear molecule corresponds to the mitochondrial genome [89], while the apicoplast genome, which is derived from a chloroplast genome, is a 35 kb circle [90]. Both these organellar genomes are reduced when compared with counterparts in other eukaryotes. The mitochondrial genome only encodes rRNAs and a few respiratory chain proteins and is the smallest mitochondrial genome known [91]. Although the apicoplast genome is homologous to the typical plastid found in plants and algae, i t lacks all genes associated with photosynthesis and is much smaller 17 [92], Unlike plant and algal chloroplasts that are bound by two membranes, four membranes surround the apicoplast. These extra two membranes reflect the secondary endosymbiotic origins of the apicoplast, in which a cell containing a double-membrane bound chloroplast was engulfed by another eukaryotic cell [93]. Reduction of the P. falciparum organellar genomes has been accompanied by lateral gene transfer from the organelles to the nuclear genome [31, 92]. Nuclear-encoded genes of organellar origin can be identified through phylogenetic comparison with homologues. Extant plastids and mitochondria each form a monophyletic group whose nearest neighbour lies within a specific bacterial clade; mitochondria are derived from an alpha-proteobacterium [31], and plastids descended from a cyanobacteria-like ancestor [94]. Nuclear genes of organellar origin reflect these phyletic associations. Many nuclear-encoded genes of organellar origin produce proteins that stil l function in the organelle. These proteins are transcribed in the cytoplasm and then directed to the appropriate organelle by cell machinery that recognizes leader sequences at the N-terminus of the transcribed protein. These leader sequences are subsequently cleaved from the protein upon import into the organelle [95]. In P. falciparum, nuclear-encoded, mitochondrion-targeted proteins (e.g. [96, 97]) possess leader sequences with properties typical of mitochondrion-targeting peptides identif ied in other eukaryotes [98, 99]. Proteins targeted to the P. falciparum apicoplast use a bipartite leader peptide [100]. The first portion of the leader sequence is a classic signalling peptide that targets secreted proteins to the endoplasmic reticulum (ER) in eukaryotes [99]. Because secondary endosymbiotic plastids are topological^/ outside of their host cel l , the signal peptide facilitates 18 transport to the apicoplast. The second portion of the leader sequence guides the protein into the apicoplast [100]. This part of the leader sequence resembles the transit peptides used to direct proteins into the primary endosymbiotic chloroplasts of plants and algae and has a typical net positive charge [92]. However, unlike plant transit peptides, which are enriched in serine and threonine residues [101], P. falciparum transit sequences are rich in lysine and asparagine. This is probably a manifestation of the genome's AT bias [92]. Based on the number of plastid-targeted genes found in the A. thaliana genome and taking into account the absence of many plastid pathways in the apicoplast, Waller and his colleagues [100] estimated that between 6 and 17% of all proteins in the P. falciparum genome may be targeted to the apicoplast. 2.2 Methods 2.2.1 P. falciparum sequence data source Deduced protein sequences for Plasmodium falciparum genes were obtained from the Malaria Genetics and Genomics website [102] at the National Centre for Biotechnology and Information (NCBI). Several sequencing centres provided the data made available at this site. Sequence data for P. falciparum chromosomes 1,3,4,5,6,7,8,9, and 13 were obtained from the Sanger Centre [103]. Sequence data for chromosome 12 were obtained from the Stanford DNA Sequencing and Technology Centre [104]. Preliminary sequence data for P. falciparum chromosomes 2,10,11, and 14 were obtained from the Institute for Genomic Research [105], On Jan 20th, 2001, 2073 P. falciparum protein sequences were available at NCBI and were downloaded for investigation. At the time of this analysis, only small amounts of sequence were 19 available for most chromosomes with the exceptions of chromosomes 2 and 3, which have been published, and chromosome 12, which was completely sequenced although not annotated. 2.2.2 Identification of 'bacteria-like' genes using BAEwatch The BAEwatch automated sequence analysis system [106] was used to compare the set of P. falciparum proteins with a large set of proteins from other organisms and to identify those proteins that showed a higher similarity to bacterial proteins than to proteins from other eukaryotes. The 'comparison' data set was comprised of the SWISSPROT and TrEMBL databases plus a number of unfinished bacterial genomes, as detailed in table 2 .1 . For each P. falciparum protein, BAEwatch used gapped BLAST version 2.0 [107] and MSPcrunch [108] algorithms to generate a list of top matching hits with the 'comparison' data set. The results were placed in a database that was then queried for those proteins whose top hits were bacterial. BAEwatch allows the user to ignore top hits of a particular taxonomic level; for example, i f the level chosen is "Family" no top hits from organisms that belong to the same family as the query organism wi l l be considered. For this analysis, all hits from within the phylum apicomplexa were ignored and so the identified 'bacteria-like' genes were those for which the P. falciparum gene was more similar to bacterial genes than i t was to any other non-apicomplexan sequence. The phylum level was chosen to allow detection of lateral gene transfer that occurred before the divergence of the apicomplexans, a group of parasitic protists with similar lifestyles. The 'bacteria-like' P. falciparum genes were assigned a score that essentially quantified the difference in similarity between the most similar bacterial hit and the 20 Table 2.1 Protein sets used for BAEwatch comparison. All protein sets were those available in March of 2001. Genomes listed as unfinished were still in progress in March 2001. Published bacterial genomes Borrelia burgdorferi B31 Campylobacter jejuni NCTC11168 Chlamydia trachomatis D Chlamydia muridarum MoPn Chlamydophila pneumoniae AR39 Chlamydophila pneumoniae CWL029 Chlamydophila pneumoniae J138 Deinococcus radiodurans R1 Escherichia coli 0157 Haemophilus influenzae Rd Helicobacter pylori 26695 Helicobacter pylori J99 Mycobacterium tuberculosis CSU93 Mycoplasma genitalium G37 Mycoplasma pneumoniae M129 Neisseria meningitidis MC58 Neisseria meningitidis 22491 Pasteurella multocida PM70 Pseudomonas aeruginosa PA01 Rickettsia prowazekii MadridE Treponema pallidum Nichols Ureaplasma urealyticum serovar3 Vibrio cholerae N16961 Xylella fastidiosa NA Translations of all coding genes in GenBank SWISSPROT and TREMBL databases [109] Published eukaryotic genomes Arabidopsis thaliana Drosophila melanogaster Caenorhabditis elegans Saccharomyces cerevisiae Homo sapiens - ENSEMBL March 2001 data Unfinished bacterial genomes fHO] Aquifex aeolicus VF5 Bacillus halodurans C-125 Bacillus subtilis 168 Buchnera sp APS Escherichia coli K12 Lactococcus lactis subsp lactis IL1403 Synechocystis sp PCC6803 Thermotoga maritima MSB8 21 most similar eukaryotic hit. Genes with low scores (below 10) were discarded from further consideration, because they probably reflect those genes that are highly conserved in all organisms. Generally-conserved proteins pose a problem for analyses that rely on output lists from BLAST, because the small differences in BLAST bit scores do not sufficiently discriminate between levels of similarity [106]. The output from BAEwatch included a list of the subset of P. falciparum genes that have a noteworthy similarity to bacterial genes. BAEwatch was thus used as a first-pass tool to identify putative laterally transferred genes. 2.2.3 Identification of features of P. falciparum genes. All 'bacteria-like' P. falciparum sequences were evaluated to detect the presence of N-terminal leader sequences. TargetP [99] was used to identify probable signal peptides and mitochondrial targeting peptides. Sequences were subsequently run through the PATS prediction system [111] to detect apicoplast-targeting leader peptides. While TargetP is trained to identify plant transit peptides, i t does not recognize the bipartite'leaders used by P. falciparum to target proteins to the apicoplast [111]. Indeed, in this analysis TargetP recognized only the signal peptide portion of the bipartite leaders and thus invariably misclassified apicoplast-targeted proteins (as identified by PATS) as being targeted for secretion. 'Bacteria-like' genes were used to query the Interpro database [112]. Interpro detects sequence similarity to protein signatures derived by a number of methods. Signatures usually represent functional domains that are found in a variety of proteins. The functional domains identified in P. falciparum genes were considered as clues to the possible function of the protein. 22 2.2.4 Retrieval and identification of putative homologues Each bacteria-like gene was used to query the non-redundant protein database at the National Centre for Biotechnology Information (NCBI [113]) in October of 2001, using gapped BLAST 2.0 [107]. At this t ime, the NCBI protein database included a number of completely sequenced genomes from species listed in table 2.2. The set of proteins with an expectation (E) value below 0.001 formed the basis for selection of homologues. The E value is a measure of the expected number of chance alignments that wi l l be as good as the hit, given the size of the protein database [114]. This rather high (non-stringent) exclusion threshold was chosen because E values vary with the length of the query protein. For example, P. falciparum proteins with leaders and internal low complexity regions are much longer than bacterial homologues (which lack these features), and thus BLAST queries with such proteins result in high expect values even when the regions outside of the P. /a/c/panvm-specific segments are very similar. Putative homologue sets were manually culled according to the following ad hoc rules: 1) Because a high expect score cutoff was used, some BLAST hits represented sequences with matches to only functional protein domains that are small and common in a wide variety of proteins. Such proteins may have evolved through domain shuffling and are not generally suitable for phylogenetic reconstruction. To remove such hits, genes with regions of similarity that extended over less that 25% of the hit 's total peptide were excluded. In a few cases, this eliminated all BLAST search results as possible homologues of the P. falciparum gene. Furthermore, the schematic visual representation of search results that was provided with NCBI's BLAST output was inspected. Genes whose regions of similarity were 23 Table 2.2 Complete genomes queried for homologues of P. falciparum genes EU BACTERIA Firmicutes Proteobacteria Aquificales Actinobacteria Alpha Aquifex aeolicus Corynebacterium glutamicum Agrobacterium tumefaciens Mycobacterium tuberculosis Caulobacter crescentus Mycobacterium leprae Mesorhizobium loti Thermotogales Low GC group Sinorhizobium meliloti Thermotoga maritima Bacillus subtilis Rickettsia conorii Bacillus halodurans Rickettsia prowazekii Staphylococcus aureus Beta Deinnococcus/Thermus Clostridium acetobutylicum Neisseria meningitidis Deinococcus radiodurans Lactococcus lactis Epsilon Mycoplasma genitalium Campylobacter jejuni Spirochaetes Mycoplasma pulmonis Helicobacter pylori Borrelia burgdorferi Mycoplasma pneumoniae Gamma Treponema palladum Streptococcus pneumoniae Buchnera sp. Ureaplasma urealyticum Eschehschia coli Cyanobacteria Haemophilus influenzae Synechocystis sp. Pasteurella multocida Pseudomonas aeruginosa Chlamvdiales Vibrio cholerae Chlamydophila pneumoniae Xylella fastidiosa Chlamydia trachomatis Yersinia pestis Chlamydia muridarum ARCHAEA EUKARYOTA Euryarchaeota Arabidopsis thaliana Archaeoglobus fulgidus Caenorhabditis elegas Halobacterium sp. Drosophila melanogaster Methanothermobacter thermoautotrophicus Guillardia theta (nucleomorph genome) Methanocaldococcus janaschii Saccharomyces cerevisiae Pyrococcus abyssi Pyrococcus horokoshii Thermoplasma acidophilum Thermoplasma volcanium Crenarchaeota Aeropyrum pernix Sulfolobus solfataricus Sulfolobus tokadaii 24 l imited to less than half of the regions of similarity of the majority of other search results were excluded. Hits excluded from the homologue set because of the above rules always had high expectation scores. 2) If a particular species had more than one hit to the query, they were considered paralogous genes, and only the gene with the lowest E score was retained. 3) If all (or most) species contained more than one hit and the multiple sequences formed clear similarity groups (i.e. a copy in each species with the same E value), then the multiple hits were treated as orthologous gene families and were retained. Gene function annotations were sometimes helpful in identifying orthologous gene families. 4) To produce phylogenies that are useful for looking at lateral gene transfer, broad taxonomic sampling is needed. Therefore, the species with the lowest E value was chosen as a representative for its taxonomic family. 5) The final set of orthologues for each P. falciparum gene was l imited to one hundred sequences, in order to facil i tate timely phylogenetic reconstruction. If, following removal of sequences as described above, the set of orthologues exceeded 100 sequences, only those with the 100 lowest expect scores were retained. 2.2.5 Sequence alignments Amino acid sequences of P. falciparum genes and their sets of homologues were aligned using the CLUSTALX sequence alignment program [115]. Analysis was conducted at the protein level because amino acid data only reflects replacement substitutions in the DNA (i.e. those substitutions that change the amino acid specified by the codon). Silent substitutions often have species-specific base biases, indicating that a different mutation or selective force is operating in different lineages. This sort of differential selection can lead to sequence convergence, which causes errors 25 in inferred phylogenetic relationships [116]. This is especially a concern for the AT-biased P. falciparum. Analyses of amino acid instead of DNA sequences should thus reduce the confounding effect of sequence convergence on phylogenetic analysis. Multiple sequence alignments were visually inspected. The 'show low-scoring segments' option in CLUSTALX was invoked, which highlights regions of poor alignment quality. These regions represent segments of the peptide that are too divergent to be reliably aligned [115]. Sequences were removed whose low-scoring segments comprised more than one-third of the total sites in the alignment. When a putative homologue was removed because of a high proportion of low quality segments, the multiple alignment was re-calculated with the remaining homologues. In some cases, the P. falciparum sequence was very divergent and the alignment revealed low-scoring segments for over one third of the protein. In this case, the P. falciparum gene was dropped from further analyses. Alignments were edited to remove aligned sequence positions (sites) that contained gaps, from all species. Internal gaps represent insertions' or deletions (indels) in the evolution of a gene. Because phylogenetic inference methods are based on models of amino acid substitution and indels are formed through quite different processes than substitution, gap sites must be excluded from phylogenetic analyses [116]. The proteins used in the alignments were of variable length, so terminal gaps were also present. The ends of sequences were cropped from the alignment to exclude all sites with terminal gaps. Very short sequences with long terminal gaps were excluded entirely from the homologous protein set, as removal of their terminal gaps left a sample size of sites that was too small for phylogenetic analyses. In addition, sites from all species were excluded from the alignment if 26 positional homology of residues was uncertain. These were visibly apparent as positions where l i t t le residue similarity was evident (figure 2.1) and corresponded to sites where low-scoring segments were present in all or most sequences. Site-exclusion editing of alignments was done using GeneDoc [117] version 2.6.002. - © Y ^ R A L ] R - R R W R S K D K P L P C D -- E P E V F V E N R : i K H P R D Y P D V S - S T E K K N - N I K N I R D G P A V D V E - - N E : E H T N - - S P Y F V Y N V N .1 H A S R A L P - - Q L sVEJKAE P P P A E E l A G C J V p L K L E R g T E A | ^ L F E K F V T F g K N F F S l V N s t < E F N K 1 V R T V I 3 F F F Q | I K A s [ I L S A i T K D i D W ^ R F A A H V R A i V V ^ A L E R l T A A P I N Figure 2.1 Example of a divergent region in an alignment. Sites, such as those in the box, are excluded from consideration in distance calculations because the positional homology of the amino acids cannot be assumed. For instance, the circled leucine residue could be placed anywhere along the gap in front of i t , without worsening the alignment. 27 2.2.6 Phylogenetic reconstruction Phylogenetic inferences were made by constructing trees based on pairwise distances between sequences. Although maximum likelihood methods can outperform distance-based methods in providing the correct tree, maximum likelihood methods require much more computation time than distance methods [116]. The t ime required increases exponentially with each additional taxon, and maximum likelihood often becomes intractable when more than 15 taxa are to be evaluated. Consequently, distance-based methods were more suitable for phylogenetic analyses of the P. falciparum genes and their large sets of homologues. Maximum likelihood distance estimates were calculated with Tree-Puzzle [118] version 5.0, using the WAG substitution model [119] and amino acid frequencies estimated from the data. Distances were corrected for among-site rate heterogeneity with a "mixed" model, described by a discrete gamma-distribution with eight rate categories plus an additional category for invariant sites. Tree-Puzzle calculated both the alpha shape parameter of the gamma distribution (a) and the fraction of invariant sites (i). One hundred bootstrap datasets were generated using the Seqboot program of Phylip [120]. Distances for bootstrapped sequence sets were calculated by Tree-Puzzle using the same parameters as above, except that to reduce computation t ime values for the a and i parameters were set at the values previously calculated. The potential biases introduced by this simplification were not explored. The shell script Puzzleboot [121], was used to feed bootstrap datasets to Tree-Puzzle and to produce a concatenated fi le of distance matrices for all bootstrap replicates. Unrooted phylogenies were constructed from distance matrices using a neighbour-joining 28 algorithm as implemented in BioNJ [122]. The Consense program in Phylip [120] was used to calculate bootstrap values. 2.3 Results and Discussion This section presents results of the BAEwatch analysis and the phylogenetic analyses of individual genes. Extensive discussion is included with the results in each section. 2.3.1 BAEwatch BAEwatch was used to evaluate 2073 P. falciparum ORFs. About 65% of these ORFs had some similarity to existing sequences in the protein database. The sequencing projects for chromosomes 2 and 3 [82, 83] performed a similar BLAST analysis on the set of ORFs identified on those chromosomes, but reported much lower match percentages (table 2.3). The higher value found by the BAEwatch analysis is probably not an indication of some intrinsic property of the set of 2073 P. falciparum ORFs (which includes the sequences from chromosomes two and three). Rather, this high value occurs because the protein databases searched by BAEwatch were substantially larger than those searched by the chromosome two and three sequencing projects two years earlier, and thus the likelihood of finding matching gene segments was increased. BAEwatch identified 1.4% of P. falciparum ORFs as being most similar to a bacterial gene. This value was lower than that reported for chromosome 2 (table 2.3). This may be because the scoring system implemented by BAEwatch considers generally-conserved proteins as not significant, while the values reported for chromosome two were based purely on BLAST hits. 29 Table 2.3 Summary of BLAST searches with P. falciparum sequence sets. Description Chr 2 [82] Chr 3 [83] BAEwatch Total ORFs evaluated 210 217 2073 ORFs with similarity to sequences 120 94 1349 in protein databases (57%) (44%) (65%) ORFs most similar to bacterial 16 not reported 28 sequences (7.6%) (1.4%) BAEwatch identified thirty P. falciparum ORFs as 'bacteria-l ike'. Two ORFs were discarded as false positives, because the query of the protein database at NCBI in October 2001 (see methods) revealed that the most similar proteins were from eukaryotes. Another f i f teen ORFs were excluded from phylogenetic analysis for the following reasons: 1) One P. falciparum ORF was an exact copy of a short segment of another P. falciparum gene that was identified by BAEwatch. This short ORF was located on a different chromosome than the longer sequence. The BLAST lists of similar sequences were identical for the short and long ORFs, although BLAST hits had lower expect values with the longer sequence because regions of similarity existed that were not present in the shorter sequence. The shorter sequence was discarded from further analysis, as i t was clearly a recent duplication. 2) Two ORFs were very short (<80 amino acids), and were discarded because they did not contain enough informative sites for phylogenetic analysis. 3) Five P. falciparum sequences were 30 discarded because no significant potential homologues were found. In these genes, the regions of similarity that were identified by BLAST were l imited to small common motifs that are found in a large number of proteins. Notably, in four of these five cases, the P. falciparum gene was unusually long (over 1800 amino acids). 4) Seven sequences were too divergent to give reasonable alignments with putative homologues. This left 13 sequences (named in table 2.4), which were subject to phylogenetic analysis. It should be emphasized that the approach taken in this study does not guarantee that all lateral gene transfers in the P. falciparum dataset were found. BAEwatch is used as a starting point for likely candidates, and genes that were deemed not amenable to phylogenetic reconstruction may be good targets for study by other methods. 2.3.2 Phylogenetic analysis of individual 'bacteria-like' genes The following eleven sections present the phylogenetic analyses conducted on the individual 'bacteria-like' genes identified by BAEwatch. Two sections present the analyses of two proteins as the proteins are related; type II topoisomerase B subunit and A subunit are evaluated together in section and GTP-binding proteins 1 and 2 are discussed together in section For each gene, sequence features (e.g. the presence of targeting peptides) and phylogenetic relationships are examined to determine if a lateral gene transfer event has taken place, and if so, what the probable origin of the laterally transferred gene is. 31 Table 2.4 List of P. falciparum genes evaluated by phylogenetic analyses Putative function Chromosome Location Type Ii topoisomerase: B subunit 12 Type II topoisomerase: A subunit 12 Adenylosuccinate lyase 2 Elongation factor G 12 Ribosome release factor 2 Proteasome subunit HslV 12 Pseudouridine synthase 2 rRNA methyltransferase 12 rRNA 3' terminal phosphate cyclase 12 rRNA adenine dimethylase 12 S-adenosyl methionine-dependant methyltransferase 12 GTP-binding protein 1 12 GTP-binding protein 2 3 32 Type II topoisomerase Two bacteria-like genes in P. falciparum encode type II DNA topoisomerases. These enzymes alter the topological state of DNA in the cell by creating transient double-stranded breaks. Type II topoisomerases are essential to all cells for replication, separation of replicated chromosomes, and maintenance of superhelicity [123]. Both eukaryotes and prokaryotes possess type II topoisomerases, although in different forms. The eukaryotic topoisomerase II is a homodimer of a single polypeptide of approximately 1500 amino acids. Eubacteria typically possess two type II topoisomerases: DNA gyrase and topoisomerase IV. Both of these enzymes are heterotetramers, comprised of two different subunits designated A and B for gyrase, and ParC (or A) and ParE (or B) for topoisomerase IV [124]. These type II topoisomerases form a homologous family. The N-terminal half of the eukaryotic gene (topoisomerase II) is homologous to the B subunits from gyrase and topoisomerase IV, and the C-terminal half is homologous to the A subunits (figure 2.2). A number of individual bacterial species have only one type II topoisomerase (see figures 2.3, 2.4). Among mesophiles, these species are exceptions and probably reflect lineage-specific loss of one of the two enzymes. The eubacterial hyperthermophiles have only one type II topoisomerase. Interestingly, the archaea possess a type II topoisomerase known as topoisomerase VI that, while comprised of the same two-subunit heterotetramer structure as eubacterial proteins, is non-homologous to the eukaryotic and bacterial types [124]. However, some of the completely sequenced archaeal genomes also possess A and B subunits of gyrase. P. falciparum has a eukaryotic topoisomerase II [125]. P. falciparum also appears to 33 Eukaryotic Topoisomerase II LT • D. discoideum MT Archaeal Topoisomerase B Bacterial Gyrase B Proteobacteria Chlamydiales Bacterial Topoisomerase IV B Synechocystis sp. P. falciparum G. theta A. thaliana \ B subunit Archaeal Topoisomerase A Bacterial Gyrase A Proteobacteria Chlamydiales Bacterial Topoisomerase IV A A subunit Proteobacteria Synechocystis sp. P. falciparum G. theta A. thaliana I i f v Figure 2.2 Architecture of Type II Topoisomerases. The four main forms of type I topoisomerase are shown: Eukaryotic topoisomerase II (top panel), Archaeal topoisomerase, Eubacterial gyrase, and Eubacterial topoisomerase IV. The Archaeal and Eubacterial proteins contain a B subunit (middle panel) and an A subunit (bottom panel). Species or taxonomic groups that differ from the main form are shown beneath that form. Insertions are depicted below sequences, and homologous insertions are shaded alike. Putative targeting peptides are cross hatched. MT, mitochondrion-localized; P. falciparum, Plasmodium falciparum; G. theta, Guillardia theta nucleomorph; A. thaliana, Arabidopsis thaliana; D. discoideum, Dictyostelium discoideum. 34 have the bacterial form of type II topoisomerase. The set of homologues for two 'bacterial-like' proteins identified by BAEwatch on chromosome 12 consists of A and B subunits of gyrase and topoisomerase IV from bacterial species. The alignment of the B subunit of P. falciparum and its set of homologues provided 510 unambiguous positions that were used to build a phylogenetic tree. Gyrase and topoisomerase IV each form separate, well-supported clades (figure 2.3) among the low GC firmicutes and proteobacteria, suggesting that these genes arose from a duplication within these groups of eubacteria. P. falciparum, like many of the bacteria, has only one B subunit type II topoisomerase. The sequence groups with topoisomerase IV from the chlamydiales and Borrelia burgdorferi (a spirochaete). This placement has some bootstrap support (68%), but caution is warranted in placing too much confidence in the clade as it consists of relatively divergent sequences with long branches and could be an artefact of long branch attraction. The A subunits of Topoisomerase IV are only well conserved in the N-terminal half of the peptide and so only 262 unambiguously aligned positions were used to build the phylogeny (figure 2.4). As in the B subunit tree, gyrase and topoisomerase IV fall into separate clades, and higher order bacterial taxa form well-supported groups. Again, the P. falciparum sequence aligns with the chlamydiales and 6. burgdorferi, albeit with low bootstrap support (50%) and a long branch. In addition to P. falciparum, two other eukaryotes have B and A subunits of a type II topoisomerase: Arabidopsis thaliana, a plant, and Guillardia theta (the subunits are found in the nucleomorph genome, the remnant nucleus of a secondary symbiosis). These organisms have plastids and their sequences group convincingly with the cyanobacteria in both the B subunit (figure 2.3) and A subunit (figure 2.4) 35 Escherichia coli T Salmonella enterica T Yersinia pestis T Vibrio cholerae T Pasteurella multocida T Pseudomonas aeruginosa T Xylella fastidiosa T Neisseria meningitidis T Ralstonia solanacearum T Mesorhizobium loti T = Brucella melitensis T Agrobacterium tumefaciens T Caulobacter crescentus T Rickettsia conorii T Streptomyces coelicolorT gamma beta alpha proteobacteria topoisomerase IV Thermus thermophiius Deinococcus radiodurans - Thermotoga maritima Archaeoglobus fulgidus — Aquifex aeolicus Thermoplasma acidophilum 68 100 100 U ,J00, 7 Q 1 2 ° j — Synechocystis sp Jx-T'— Nostoc sp. Plasmodium falciparum i Chlamydophila pneumoniae T 1 Chlamydia muridarum T Borrelia burgdorferi T 60-• Arabidopsis thaliana Guillardia theta nucleomorph . 100i— Lactococcuslactis T ^jgP— Streptococcus pneumoniae T P Enterococcus faecilis T Listeria monocytogenes T Bacillus halodururans T Staphylococcus aureus T Ureaplasma urealyticum T Mycoplasma pulmonis T Clostridium acetobutylicum T Lactococcus lactis G Streptococcus pneumoniae G Enterococcus faecilis G Listeria monocytogenes G Staphylococcus aureus G Bacillus halodururans G Ureaplasma urealyticum G Mycoplasma pulmonis G Clostridium acetobutylicum G 100 r Escherichia coli G Salmonella enterica G Yersinia pestis G Pasteurella multocida G V/Mo cholerae G Pseudomonas aeruginosa G — Buchnera sp. Xylella fastidiosa G = Neisseria meningitidis G Ralstonia solanacearum G -Mesorhizobium loti G Brucella melitensis G Agrobacterium tumefaciens G Caulobacter crescentus G Rickettsia conorii G cyanobacteria 'bacteria-like' eukaryotic topoisomerase low GC firmicutes topisomerase IV low GC firmicutes gyrase 100 Campylobacter jejuni Helicobacter pylori Bacteroides fragilis 1001— Chlamydophila pneumoniae G Chlamydia muridarum G i Streptomyces coelicoior G Mycobacterium tuberculosis gamma beta alpha epsilon proteobacteria gyrase I9ii Halobacterium sp. — Treponema pallidium — Borrelia burgdorferi G 0.1 Figure 2.3 Phylogeny of the B subunit of Type II topoisomerases. The P. falciparum sequence is boxed for reference. The phylogeny is an unrooted BioNJ tree based on rate-corrected maximum-likelihood distances calculated with TreePuzzle. Bootstrap values over 50% are shown at nodes. Taxonomic groups are shown to the right of species' names for clades supported by bootstrap values. The branch length scale represents 0.1 amino acid changes. G, gyrase; t, topoisomerase IV. Where no suffix letter is present for species names, only one type II topoisomerase form has been identified. 36 100 rn j p - Nostoc sp. 2 Rickettsia conorii G 6Qj Agrobacterium tumifaciens T iMjn— Mesorhizobium loti T Q , i ± \ I — Brucella melitensis T Y J I I Caulobacter crescentus T Nostocsp. 1 Synechocystis sp. Synechocystis sp. 2 Guillardia theta nucleomorph Arabidopsis thali na Agrobacterium tumifaciens G Brucella melitensis G Mesorhizobium loti G Caulobacter crescentus G cyanobacteria 'bacteria-like' eukaryotic topoisomerase alpha proteobacteria gyrase ~l Rickettsia conorii T Streptomyces coelicolorT alpha proteobacteria topoisomerase IV 99 r- Chlamydophiia pneumoniae G 1— Chlamydia muridarum G — Streptomyces coelicolorG Mycobacterium tuberculosis G Halobacterium sp. Salmonella enterica G 1 Escherichia coli G Yersinia perstis G W/MTO cholerae G Buchnera sp. G Pseudomonas aeruginosa Pasteurella multocida G Neisseria meningitidis G Ralstonia solanacearum G Xylella fastidiosa G Helicobacter pylori _ ] chlamydiales gyrase ^ actinobacteria gyrase gamma beta proteobacteria gyrase epsilon Campylobacter jejuni G «Salmonella enterica T 6 4 — J J j . Escherichia coli J Campylobacter jejuni T Yersinia pestis T _ Pasteurella multocida T Wfirio cholerae T Pseudomonas aeruginosa T Xylella fastidiosa T Neisseria meningitidis T Ralstonia solanacearum T I gamma & epsilon beta 100 r 100, proteobacteria topoisomerase IV Mycobacterium tuberculosisT Chlamydophiia pneumoniae T Borrelia burgdorfi T Chlamydia muridarum T Thermotoga maritime Aquificales aeolicus Thermus thermophilus Deinococcus radiodurans Plasmodium falciparum Archaeoglobus fulgidus Thermoplasma acidophilum Bacteroides fragilis Borrelia burgdorferi G Treponema pallidium Streptococcus pneumoniae T 1 1 Lactococcus lactis T Enterococcus faecilis T Staphylococcus aureus T Listeria monocytogenes T Bacillus halodurans T r Ureaplasma urealyticum T -xr-r. Mycoplasma pulmonis T — — — — Clostndium acetobutylicum T Streptococcus pneumoniae G — Lactococcus lactis G Enterococcus faecilis G Listeria monocytogenes G Bacillus halodurans G Staphylococcus aureus G Ureaplasma urealyticum G Mycoplasma pulmonis G Clostridium acetobutylicum G low GC firmicutes topoisomerase IV low GC firmicutes gyrase 0.1 Figure 2.4 Phylogeny of the A subunit of Type II topoisomerases. See figure 2.3 for details. 37 trees. The gene architecture for the B and A subunits clearly shows that P. falciparum, G. theta and A. thaliana all have N-terminal extensions reminiscent of leader sequences (figure 2.2). All three of these B subunit extensions are predicted to be plastid targeting, but of the A subunits, curiously only the A. thaliana protein is predicted to be localized to the plastid. The G. theta leader is predicted to be a signal peptide (indicating that the A subunit is exported), and the P. falciparum leader has no predicted function. As type II topoisomerase subunits do not have any function on their own [124], it is likely that B and A subunits are localized to the same place in a cell if they are functional. It is possible that A subunits from G. theta and P. falciparum have some as yet unidentified way of ending up in the plastid alongside their B subunit counterparts; these sequences may use a targeting peptide that is not recognized by the PATS prediction algorithm, or perhaps the A subunit 'hitches a ride' with the B subunit. The combination of localization leaders and specific phylogenetic affiliation with the cyanobacteria, leads to the conclusion that 'bacteria-like' type II topoisomerase subunits originated in the plastid and were transferred to the nucleus in the case of A. thaliana and to the nucleomorph (the remnant nucleus of the secondarily symbiotic plastid) in the case of G. theta. Similarly for P. falciparum, the 'bacteria-like' topoisomerase subunits most probably function in the apicoplast, but the origin of the gene is less certain as the apparent relationship to the 6. burgdor/enVChlamydiales clade is inconsistent with apicoplast inheritance. To explore the relationship between prokaryotic and eukaryotic type II topoisomerases, B and A subunits for prokaryotic gyrase and topoisomerase IV as well as for the 'bacteria-like' eukaryotic sequences (P. falciparum, A. thaliana, G. theta) 38 100 96 Drosophila melanogaster E — Bombyx mori E Gallusgallus E • Homo sapiens E Plasmodium falciparum E Pisum sativum E Arabidopsis thaliana E - Thermotoga maritima - Thermoplasma acidophilum - Aquifex aeolicus Archaeoglobus fulgidus Nostoc sp. Synechocystis sp. Arabidopsis thaliana 100 r 100 _ Chlamydophila pneumoniae T I Chlamydia muridarum T Saccharomyces cerevislae E Candida glabrata E Emericella nidulans E Dictyostelium discoideum E —- Caenorhabditis elegans E 59j_ Leishmania donovani E 100 P- Crithidiafasciculata E 1 Trypanosoma cruzi E . eukaryote topoisomerase II Borrelia burgdorferi T \Plasmodium falciparum] cyanobacteria topoisomerase •iparum\ oa Guillardia theta nucleomorph _ l , Salmonella enterica T Escherichia coli T Yersinia pestis T Worio cholerae T Pasteurella multocida T Pseudomonas aeruginosa T Xy/e//a fastidiosa T Neisseria meningitidis T Ralstonia solanacearum! Mesorhizobium loti T Brucella melitensis T Agrobacterium tumefaciens T Caulobacter crescentus T Rickettsia conorii T Streptomyces coelicolor T Thermus thermophilus Deinococcus radiodurans bacteria-like' eukaryote topoisomerase gamma beta alpha proteobacteria topoisomerase IV , Salmonella enterica G I Escherichia coli G Yersinia pestis G V/brio cholerae G Pasteurella multocida G Pseudomonas aeruginosa G - Buchnera sp. • Xylella fastidiosa G - Neisseria meningitidis G . Ralstonia solanacearum G rp» r— Mesorhizobium loti G 100\ J - Brucella melitensis v J I— Agrobacterium tumefaciens ( 100\ N U I — Caulobacter crescentus G >L ~ Rickettsia conorii G Helicobacter pylori Campylobacter jejuni 60' 100, lOOr- Chlamydophila pneumoniae G L- Chlamydia mundarum G gamma beta alpha epsilon proteobacteria gyrase Streptococcus pneumoniae T Lactococcus lactis T Enterococcus faecilis T Listeria monocytogenes T Bacillus halodurans T Staphylococcus aureus T ureaplasma urealyticum T Mycoplasma pulmonis T Clostridium acetobutylicum T | Streptococcus pneumoniae G Lactococcus lactis G Enterococcus faecilis G Listeria monocytogenes G Bacillus halodurans G — Staphylococcus aureus G Clostridium acetobutylicum G — Ureaplasma urealyticum G — Mycoplasma pulmonis G — Borrelia burgdorferi G — Treponema pallidium 100, Streptomyces coelicolor G Mycobacterium tuberculosis low GC firmicutes topoisomerase IV low GC firmicutes gyrase Halobacterium sp. 0.1 Figure 2.5 Phylogeny of type II topoisomerase. The phylogeny is based on an alignment of eukaryotic topoisomerases with concatenated B and A subunits of bacterial topoisomerases. For clarity, only major nodes of large bacterial clades are labelled with bootstrap values, mt, mitochondrially localized; G, Gyrase; T, Topoisomerase IV; E, Topoisomerase II; the protein of species with no suffix letter is of bacterial form (with separate A and B subunits). See figure 2.3 for details. 39 were concatenated and aligned with eukaryotic topoisomerase II, providing 659 unequivocally aligned sites for phylogenetic analysis. Eukaryotic sequences form a well-defined clade separate from the prokaryotic sequences (figure 2.5), as we would expect from the architecture of type II topoisomerases (figure 2.2). Additionally, the phylogeny shows that in both A. thaliana and P. falciparum the typical eukaryotic topoisomerase II is not related to the bacterial subunit forms. In the concatenated tree, the eukaryotic clade forms the longest branch. P. falciparum has been displaced as the sister group to the Chlamydiales/B. burgdorferi clade by the long eukaryotic clade branch, and now P. falciparum falls in a relatively well-supported clade comprised of the cyanobacterial and the plastid-targeted eukaryotic genes. The observation that the two longest branches unite in all three trees, gives credence to the conjecture that the P. falciparum sequence-placement in the B and A subunit trees (figures 2.3 and 2.4) is spurious. The placement of P. falciparum wi th the cyanobacteria in figure 2.5 lends support to a plastid origin of the bacterial form of type II topoisomerase in P. falciparum. Adenylosuccinate lyase The adenylosuccinate lyase gene in P. falciparum is located on chromosome 2. Adenylosuccinate lyase is a Afunctional enzyme that catalyzes the first step in de novo purine biosynthesis; i t can also form AMP from adenylosuccinate as part of a purine salvage pathway. P. falciparum is incapable of purine biosynthesis, so its enzyme only functions in purine salvage [126]. The adenylosuccinate lyase gene from P. falciparum is 471 amino acids in length. It is easily aligned with its group of homologues and is relatively conserved, 40 allowing 331 sites to be considered in the phylogenetic analysis. The tree inferred from this alignment nests the P. falciparum sequence within the gamma group of proteobacteria (figure 2.6). The tree recovers some major taxonomic groups; specifically, the animal and fungal eukaryotes, low GC firmicutes, and cyanobacteria all form highly supported clades. Only the alpha and epsilon divisions of proteobacteria are monophyletic. The gamma and beta proteobacteria fall in a highly supported mixed group with three eukaryotes and an archaeon, Halobacterium sp. The archaea, with the exception of Halobacterium sp., form a well-supported group with internal branches leading to the animal/fungal eukaryotic clade and the mixed group. Both the fungal/animal eukaryotic clade and the mixed clade form distinct groups from the rest of the tree. This observation is born out by the presence of group defining indels in the multiple sequence alignment of adenylosuccinate lyase homologues (figure 2.7). The relationship of P. falciparum to the gamma proteobacteria is intriguing. Two other eukaryotic sequences group within the gamma proteobacteria. Neither the A. thaliana, Leishmania major, nor the P. falciparum sequences possess leader peptides. Moreover, the phylogenetic position of these eukaryotic sequences excludes both the cyanobacteria and the alpha proteobacteria, precluding an organellar origin for this gene. As an aside, the sequencing project of P. falciparum chromosome 2 [82] identified adenylosuccinate lyase as an organellar gene, presumably based on an apparent targeting peptide. There is no biochemical indication that this is so, and I found no evolutionary evidence for their conclusion. I analysed the adenylosuccinate lyase protein with the plastid-targeting peptide 41 91 100 51 100 — Arabidopsis thaliana ~ Ralstonia solanacearum 1 Neisseria meningitidis 65|— Haemophilus influenzae Escherichia coli Pseudomonas aeruginosa L Vibrio cholerae Plasmodium falciparum mixed clade — Buchnera sp. Leishmania major Xylella fastidiosa 63 100 100 - Halobacterium sp. Mus musculus ~ Homo sapiens Gallus gallus 1— Drosophila melanogaster Saccharomyces cerevisiae 96 animals & fungi 92 r Archaeoglobus fulgidus — Thermoplasma vobanium Pyrobaculum aerophilum 63 Sulfolobus solfataricus - Methanothermobacter thermautotrophicus - Methanocaldococcus janaschii Pyrococcus abyssii archaea Aquifex aeolicus l o o p Lactococcus lactis I '— Streptococcus pneumoniae Listeria monocytogenes Bacillus halodurans Staphylococcus aureus _ lOOj- Sinorhizobium meliloti - Mesorhizobium loti - Caulobacter crescentus j Helicobacter pylori '— Campylobacter jejuni _ 3 \— Nostoc sp. ~ J*- Synechocystis sp. '— Synechococcus sp. Deinococcus radiodurans Thermotoga maritima low GC firmicutes alpha & episilon proteobacteria cyanobacteria 0.1 Figure 2.6 Phylogeny of adenylosuccinate lyase. The 'mixed' clade contains gamma and beta proteobacteria, an archaeon, and eukaryotes. Details as in figure 2.3. 42 o « o 3 _ 0 D, O -H (D 4J - H ^ o o <tj <U c/> cq (h 01 r j DTI "O X) C -D j] r c c 3 (0 vt (0 (0 (0 on 43 prediction software PATS [111], which is a neural network prediction tool that has specifically been trained on confirmed P. falciparum plastid-targeting sequences. PATS did not identify a plastid-targeting sequence. The three eukaryotic sequences that fal l in the mixed clade have in common their exclusion from the other eukaryotic sequences. The eukaryotic clade (figure 2.6) is composed of animals and fungi only, which are generally considered more closely related to each other than either is to the plants. Because the branching order within the mixed clade is not resolved in this tree, a new tree was constructed using sequences from just this clade. Alignment of only this clade allowed for an additional 22 informative sites to be included in the analysis. Because Halobacterium sp. is excluded from the other sequences in the mixed clade with a bootstrap value of 92% (see figure 2.6), i t was used to root the mixed clade sub-tree (figure 2.8). This tree has slightly better resolution than the mixed clade in the large tree. The branching topology resulting from rooting with an archaeal species makes taxonomic sense; the trypanosome L major and the apicomplexan P. falciparum are the most basal eukaryotes, with A. thaliana branching next, and presumed anomalous gamma/beta bacteria form a single clade. The overall phylogeny (figure 2.6) supports four overall distinct clades: the bacteria (excluding alpha and gamma proteobacteria), the animal and fungal eukaryotes, the archaea (excluding Halobacterium sp.), and the mixed clade. The bacterial clade is distinct from the animal and fungal eukaryotes, the archaea (excluding Halobacterium sp.), and the mixed clade, but the branching order between these last three clades is not resolved and all are equally likely to be related to the 44 75 79 74 74 76 Escherichia coli Vibrio cholerae Haemophilus influenzae Buchnera sp. Pseudomonas aeruginosa Neisseria meningitidis Ralstonia solaracearum Xylella fastidiosa 100 T Arabidopsis thaliana 1 I— Arabidopsis thaliana 2 100 Plasmodium chabaudi Plasmodium falciparum Leishmania major Halobacterium sp. 0.1 Figure 2.8 Phylogeny of adenylosuccinate lyase 'mixed' clade. Only sequences belonging to the mixed clade identified in figure 2.6 were used to build the phylogeny. The phylogeny is a BioNJ tree made with rate-corrected maximum likelihood distances, and rooted on the Halobacterium sp. sequence. Bootstrap values over 50% are shown at nodes. 45 others. The simplest hypothesis to explain the phylogeny of adenylosuccinate lyase (figure 2.6) involves at least two lateral transfers. If we assume that the mixed clade branching order is as shown in figure 2.8, then a lateral transfer event from an early eukaryote-like organism to an ancestor of the extant gamma and beta proteobacteria, would account for the mixed clade. Additionally, the animal/fungal clade would have had to acquire its gene from an archaea, through lateral transfer, to account for its position outside Halobacterium sp. More adenylosuccinate lyase sequences are needed to identify if these hypotheses are correct. Specifically, sequences from lower eukaryotes and plants are needed to break up the long branches connecting the animal/fungal eukaryotic clade with the mixed clade. Elongation factor G A 'bacteria-l ike' gene for elongation factor G was identified by BAEwatch in chromosome 12 of P. falciparum. Elongation factor G (EF-G) is a one of a family of ubiquitous translation elongation factors that are found in all organisms [127]. EF-G is believed to have arisen from a very early duplication event, because a paralogous gene is found in organisms from all three domains. Consequently, EF-G has been extensively used to study various questions concerning evolutionary events at the base of the tree of life [128-131]. Because homologues were available from many organisms, the set of homologues of the P. falciparum sequence was chosen by taking the top 50 (lowest expect scores) sequences from the BLAST search of the protein database at NCBI. These sequences provided 640 well-aligned sites to build the phylogeny (figure 2.9). Upon discovering that all the eukaryotic sequences that happened to be included in the set of 46 homologues were likely of mitochondrial origin (figure 2.9), the phylogeny was regenerated with the inclusion of two archaeal and two nuclear eukaryotic EF-G (also known as EF-2) genes. This analysis (not shown) resulted in a less well-resolved phylogeny of the same overall topology and with a longer branch joining the eukaryotic and archaeal sequences to the backbone of the tree. In this additional phylogeny, the P. falciparum sequence grouped within a moderately supported (72 percent bootstrap) clade that included spirochaetes, cyanobacteria, organellar sequences, and alpha proteobacteria. The exclusion of the eukaryote sequences from this clade was taken as evidence that the P. falciparum sequence was 'bacteria-l ike' and not closely related to eukaryotic nuclear EF-G. The EF-G phylogeny presented in figure 2.9 supports several accepted taxonomic relationships, but with some important differences. The low GC firmicutes are monophyletic (ignoring the undefined placement of Clostridium acetobutylicum), as are the actinobacteria and the alpha, beta, gamma, and epsilon subgroups of proteobacteria. The proteobacteria as a group, however, are polyphyletic. The mitochondrion- and plastid-targeted sequences are each monophyletic. Unexpectedly, the mitochondrion-targeted eukaryotic sequences do not branch with the alpha proteobacteria. Rather, the alpha proteobacteria and the plastid targeted sequences form sister clades. The cyanobacteria appear to be polyphyletic with three representatives forming a well-supported clade and one representative, Synechocystis sp., related to Vibrio cholerae (a gamma proteobacteria). A strongly supported branch groups the P. falciparum sequence with the spirochaetes and mitochondrion-targeted eukaryotic clades. 47 Listeria innocua Bacillus halodurans Staphylococcus aureus Geobacillus stearothermophilus Lactococcus lactis Streptococcus pyogenes Ureaplasma urealyticum Mycoplasma pneumoniae Mycoplasma pulmonis Deinococcus radiodurans Thermus thermophilus Clostridium acetobutylicum Spirulina platensis low GC firmicutes Nostoc sp. Synechococcus sp. ] cyanobacteria Candidatus Carsonella ruddii Thermotoga maritima l — Aquifex pyrophilus ' Aquifex aeolicus Thiomonas cuprina Ralstonia solanacearum Neisseria meningitidis Pasteurella multocida Escherichia coli Xylella fastidiosa Pseudomonas aeruginosa Helicobacter pylori Campylobacter jejuni beta proteobacteria gamma proteobacteria epsilon proteobacteria 94 r Rattus rattus M T J ' Mus musculus '— Homo sapiens M T Drosophila melanogaster M T Caenorhabditis elegans Saccharomyces cerevisiae MT Arxula adeninivorans MT Orzya sativa M T Arabidopsis thaliana M T Borrelia burgdorferi ~I Spirochaeta — Treponema pallidiuml eukaryote mitochondrion-targeted Synechocystis sp. ^Plasmodium falciparum\ Vibrio cholerae 76r 99 Caulobacter crescentus Sinorhizobium meliloti Rickettsia prowazekii Arabidopsis thaliana CP | Arthrobacter sp. Mycobacterium leprae Streptomyces coelicolor Chlamydophiia pneumoniae c Chlamydia muridarum Porphyromonas gingivalis alpha proteobacteria plant chloroplast-targeted actinobacteria chlamydiales 0.1 Figure 2.9 Phylogeny of Elongation Factor G. The P. falciparum sequence is boxed for reference. Bootstrap-supported higher-order taxonomic clades are labelled to the right of the species names. Bootstrap values over 50 are shown at nodes. MT: mitochondria-targeted, CP: chloroplast-targeted, as annotated in sequence databases. Other details as in figure 2.3. 48 The P. falciparum sequence has an N-terminal leader of 89 amino acids, which is similar in size to the leaders of the chloroplast-targeted A. thaliana and Glycine max sequences (figure 2.10). Moreover, this leader is predicted to be an apicoplast transit peptide. However, the phylogeny of EF-G shows that the P. falciparum sequence is much more closely related to the mitochondrion-targeted and spirochaete clade, than to the cyanobacteria (the presumed progenitors of the plastid genes). The P. /a/c/porum/spirochaetes/mitochondrion-targeted clade is strongly supported by bootstrapping (99 percent) and is also united by a unique indel (figure 2.10). The unusual organellar-bacterial relationships in the EF-G phylogeny complicate the interpretation of the origin of the P. falciparum gene. Any interpretation of the EF-G phylogeny as i t is presented in figure 2.9 would involve multiple lateral transfer events, with some more biologically plausible than others. For example, the clade that unites alpha proteobacteria and plant chloroplast targeted sequences, could be explained as follows: First, an EF-G gene was transferred from an ancestor of the alpha proteobacteria to the nucleus of an ancestor of plants. Such an event may not be all that unlikely, as an extant alpha bacterium has the ability to insert tumour-inducing genes into the nucleus of the plant cells i t infects [132]. Subsequently, the alpha proteobacterial gene acquired a transit peptide, and functionally replaced the cyanobacteria-like EF-G in the plastid. The P. falciparum/ spirochaete/mitochondrion-targeted clade is much harder to reconcile. The highly-supported grouping of the P. falciparum gene with other eukaryote mitochondrion-targeted genes suggests that the P. falciparum gene was transferred from the mitochondrion to the nucleus where i t picked up a transit 49 Mycoplasma pneumoniae Ureaplasma urealyticum Bacillus halodurans Listeria innocua Geobacillus stearothermophilus Streptococcus pyogenes Thermus thermophilus Deinococcus radiodurans Nostoc sp. Spirulina sp. Synechoccus sp. Clostridium acetobutylicum Aquifex aeolicus Thermotoga maritima Escherichia coli Pasteurella multocida Neisseria meningitidis Ralstonia solanacearum Thiomonas cuprina Xylella fastidiosa Pseudomonas aeruginosa Campylobacter jejuni Helicobacter p y l o r i Mycobacterium leprae Streptomyces coelicolor Chlamydia muridarum Porphyromonas gingivalis Arabidopsis thaliana CP Glycine max CP Sinorhizobium m e l i l o t i Caulobacter crescentus Rickettsia prowazekii Vibrio cholerae Synechocystis sp. Candidatus Carsonella ruddii Arabidopsis thaliana MT Oryza sativa MT Arxula adeninivorans MT Saccharomyces cerevisiae Treponema pallidium Borrelia burdorferi Mus musculus Rattus rattus MT Homo sapiens MT Drosophila melanogaster MT Caenorhabditis elegans MT Plasmodium falciparum iGlTHDG-; THDG-jTHEG-HTSEfJ-g p G -iTHEG-IIRNBG THEG-;IIHr|G SVVHKIG HEG-l o *|HDG NAVT| JG |HEG AAT ' r j b | j H E G 1K||< A A T S NTTT AAT jHDG AAT |HDG AATT SJHDG AAT': HDG AAT rcHOjGfflJHDG-- -AAV| 1: li..G---AATr JO H!.G —-AAT, IG (;HDG AATs | G ; - | H D G AAT| IG |HDG—-AAT| j ' : H E G - - - G A T | I • > H D G AAT| | H E G TAT< |HEG TAT HDG AAT HDG AAT; IGSHEG-I D G -ICNKIHAOHJ G A T I AST! REG E: TG NTIT RGRDGVGA RGRDGVGAKj RGRDNVGA RGRDNVGA RGKDGVGAT: KGKDGVGATj K G K D G V G | V I KGKDGVGAvj IKGKDGVGAVJ RGKDNVGATI RGKDDVGAT RGNDGVGAf I 1 1 Figure 2 .10 Multiple sequence alignment of elongation factor G. An N-terminal segment of elongation factor G from P. falciparum is shown aligned with homologues from other species. Shading indicates the level of residue conservation in each site column: light grey, over 60% conserved; dark grey, over 80% conserved; black, 100% conserved. Numbers preceding the eukaryotic sequences indicate the length of the N-terminal leader sequence. MT: mitochondrion-targeted, CP: chloroplast-targeted, as labelled in gene annotation. 50 peptide and functionally replaced the cyanobacteria-like EF-G in the plastid. Under this hypothesis, however, we would expect the P. falciparum gene and all eukaryote mitochondrion-targeted genes to branch within the alpha proteobacteria/plant chloroplast clade, and they are excluded with a bootstrap value of 80 (figure 2.9). We also must suppose that the spirochaete (eubacteria) genes have a plastid origin, which represents a rather inexplicable lateral gene transfer. It is, of course, quite possible that the EF-G phylogeny is an artefact of the method used to align the sequences and reconstruct the phylogeny, caused by using models of evolution that do not accurately describe the real processes. Lopez et al. [131] examined EF-G sequences and concluded that for this gene, variation in rates of change between sites are not best described by the gamma distribution (as used in the phylogenetic analysis I have presented here), but rather that rates co-vary between sites and between species. However, the phylogeny they presented for EF-G, built taking the covarion model into account, showed the same curious features as the phylogeny in figure 2.9: proteobacteria were not monophyletic, mitochondrion-targeted sequences were monophyletic, and chloroplast-targeted sequences clustered with alpha proteobacteria, far from the cyanobacteria. Because the P. falciparum sequence is specifically related to the bacterial/eukaryotic organelle group rather than to the eukaryotic group of EF-G, the gene has been laterally transferred from a eubacterium or organelle to the P. falciparum nucleus. Both my phylogenetic analysis and that of Lopez et al. [131] suggests that the EF-G gene has experienced a complex evolutionary history and thus phylogenetic evidence for identity of the donor of this transfer remains elusive. 51 Ribosome release factor Chromosome two of P. falciparum contains a gene resembling ribosome release factor, a protein that releases ribosomes from mRNA at the termination of protein biosynthesis [133]. The P. falciparum gene is 498 amino acids in length, while its set of homologues include eubacterial sequences of about 180 amino acids, and eukaryote organellar sequences of about 250 amino acids. The P. falciparum protein is much longer because of a 320 amino acid N-terminal extension. About half of this leader peptide is comprised of low complexity sequence that contains 10 repeats of a heptamer. The P. falciparum sequence is predicted to be targeted to the apicoplast. This small protein is quite well conserved, and 140 sites were used for phylogenetic analysis. The phylogeny of ribosome release factor is shown in figure 2.11. The tree recovers several expected phylogenetic groupings, although the branching order between these groups is not resolved. Groups consisting of chloroplast- and mitochondrion-targeted eukaryotic genes were also identified and these clades were sister to cyanobacteria and alpha proteobacteria, respectively. Unfortunately, the placement of the P. falciparum gene is not resolved, and i t is very divergent. In the absence of any phylogenetic evidence, the presence of an apicoplast-targeting leader sequence is taken as an indication of the gene's plastid origin. 52 100 62 55 Lactcococcus lactis Streptococcus pneumoniae - Lactobacillus reuteri 58 • Staphylococcus aureus Listeria innocua Bacillus subtilis low G C f i rmicutes Borrelia burdorferi 78 Clostridium perfringens — Mycobacterium leprae 7 3 j — Arabidopsis thaliana CP 97 P - Daucus carota CP J | M P - De Spinacea oleracea CP 9 7 1 — Synechocystis sp. ' Nostocsp. 1001 Helicobacter pylori Campylobacter jejuni plant plast id- targeted cyanobacter ia epsi lon proteobacter ia - Treponema pallidium 741— Chlamydophiia pneumoniae ~jchlamvdiales '— Chlamydia trachomatis ' _ l Mycoplasma pulmonis Aquifex aeolicus Ureaplasma urealyticum Plasmodium falciparum 67r Thermotoga maritima 53 • Deinococcus radiodurans Thermus thermophilus 55j Mesorhizobium loti -57Ji— Agrobacterium tumefaciens Brucella melitensis Zymomonas mobilis Caulobacter crescentus - Rickettsia conorii 711 : alpha proteobacter ia Schizosaccharomyces pombe Saccharomyces cerevisiae MT 69 r j — Yersinia pestis 6 2 | — Escherichia coli Buchnera ap. — Homo sapiens Arabidopsis thaliana Drosophila melanogaster 78 97r Pasteurella multocida Vibrio cholerae — Pseudomonas aeruginosa Neisseria meningitidis Ralstonia solanacearum Xylella fastidiosa eukaryo te mi tochondr ion- ta rgeted g a m m a & beta proteobacter ia 0.1 Figure 2.11 Phylogeny of Ribosome Release Factor. The sequence from P. falciparum is boxed for reference. MT:mitochondrion-localized, CP:chloroplast-localized, as identified in protein database annotation. Other figure details given in figure 2.3 53 Proteasome subunit HslV A gene for the proteasome subunit HslV was identified on chromosome 12 in P. falciparum. Proteasomes are cylindrical assemblies of protein-degrading enzymes. Three types of proteasomes are known. Eukaryotes and archaea possess a 20S proteasome. This 20S proteasome is curiously also found in one sole group of bacteria, the actinomycetes [134], and it has been suggested that this represents a lateral gene transfer to the actinomycetes from a non-eubacterial source [135]. A second type of proteasome, ClpAP/ClpXP, has been identified from almost all groups of eubacteria, although it is notably absent from the mycoplasmas (low GC firmicutes lacking a cell wall). These proteasomes are also present in some chloroplasts and in the human mitochondrion [134]. The protease subunit of ClpAP/ClpXP proteasomes is ClpP and this gene is unrelated to the protease subunit of 20S. A third type of proteasome, the HslVU, has been identified in bacteria, but its distribution among bacterial groups is sporadic. It is not present in the cyanobacteria, actinomycetes, chlamydiales or mycoplasmas. All bacterial groups that have HslVU proteasomes also have the ClpAP/ClpXP proteasomes [134]. HslV is the protease subunit of HslVU type proteasomes. It shares some sequence features with the 20S protease subunit, and may be a distant homologue [136]. The main proteasome used in eukaryotic cells is the 20S. The ClpAP/ClpXP proteasomes are also found in eukaryotes and are associated with organelles. ClpP subunits are encoded on plastid genomes (e.g. [30, 137]) and plastid-targeted ClpP genes occur in nuclear genomes (e.g. [138, 139]). ClpP subunits have been shown to localize to the mitochondrion as well [140]. In P. falciparum, a 20S proteasome [96, 141] and an apicoplast-targeted, nuclear encoded ClpP [83] have been identified. In 54 contrast to the widespread occurrence of the ClpP proteasome subunit, genes of the HslVU proteasome are unknown in eukaryotes with the exception of the HslV from P. falciparum used in the HslV phylogeny presented here. None of the five completely sequenced eukaryote genomes queried in this analysis (see table 2.2) possess an HslV homologue. The HslV gene from P. falciparum is 187 amino acids long and shows no substantial length variation from its homologues. These genes are easily alignable, although the P. falciparum sequence is somewhat divergent at the N-terminal end of the peptide (figure 2.12). One hundred thirty four amino acid site positions were deemed co-linear and used to construct the tree shown in figure 2.13. The phylogeny of HslV supports the major groups of bacteria fairly wel l , but does not resolve relationships between them. The HslV gene from P. falciparum groups within a clade comprised of alpha proteobacteria (figure 2.13). In P. falciparum, the HslV has no identifiable leader sequence and is not predicted to be targeted to an organelle. The specific grouping of the P. falciparum sequence with those from alpha Proteobacteria, however, indicates that this gene was likely transferred from the mitochondrion to the nucleus. Whether i t is targeted to the mitochondrion by some unknown mechanism, or has found some function in the cytosol, remains to be seen. This gene is not located in any extant mitochondrial genomes. The absence of this gene in both the nucleus and mitochondrion of other eukaryotes may mean that while this gene was transferred from the mitochondrion to the nucleus in a P. falciparum ancestor, i t was lost in all other eukaryote lineages and was a gene of dispensable function in mitochondria. This hypothesis is consistent with the absence of a 55 ( x t n c s i r - o n v o m m L n i n u i i n c r i H o c D ^ r - o i o coo^o^coo^o^cocococoromaD^o^o^o^01molO o o l N ^ < T l H ^ w l X l ^ D L O u ^ ^ o c o c o ^ l r l c o ( ^ J ^ c o m ( X l ^ ^ c o ^ ^ ^ ^ c o ^ ^ ^ ^ ^ ^ c o c o c D f X l H H H H H H H H H r l r l r i H H H r l H H H H r l 1 1 1 1 1 1 1 1 1 < i i i i i i t i l l CQ 1 1 1 1 1 1 1 1 1 < i i i i i i 1 1 1 1 1 1 1 1 1 O i i i i i i 1 1 1 1 1 i i i i a i i i i i i 1 1 1 1 1 1 1 1 1 cu j i j j i j £ 1 1 1 1 1 1 1 1 1 £ ] 1 1 1 1 i i i t£ a > m i o IB Q CO pi to cj> o is m s u o u o o u a Q Q Q z z c o c o o o s a a a a o i i o o< n & z CJ) CJ) a cj a 15 CJ) Q Z CJ) CJ) u CS (J) CJ) CJ) CJ) Z EH I I I I I I I I I Q I CJ) I I I I Z S t Z Z X W X W « Z Z Z : C : C 0 5 Q b 5 ; * , C J ) Z Z > > > > > > X a « o C J z CS CJ) z Q z z z z z CJ) CJ) CJ) tw « a" i i w a a a * fa E fa i ^ a fa a s CO 1 w EH a i a -MSNT a 2 a X P3 z fa X a a S 3 CJ) > a 1 H Q < z u 1 CJ) OJ fa > CO a 1 X fa z H a i 1 CM CO D a i i 1 Z CO CO i i 1 I CO a i i i 1 u z 1 i i i CO CO 1 i i i a o 1 i i i i i 2 1 i CO a • H BI CO CO CO C J 3 • H B CM O U O 4 J • H p CD a q q co -H q A< H 3 Vj CU Is <0 s cu • H ca CD • H • H CD CU i-i O <n 10 • H • H t j l CO CJ J J " H 0 N a , 0) - n M • H CO CO CJ J J CO • H -H CO o CO 0 CD io CO -T CO U •H 3 -H cu >n 14 •H q -H E= CD 0 •H O w CO U CJ o US C CU •o CO 0 H i j b i 4 J O 1—1 ca " H  - H CO H '—1 U k| CO 0) M M 5 c o cu +J o 3 3 D . ' H +J D 3 q o CO o CO —; CO 0] a co CO • H • H >H 0 40 £ 1 tn • H o CO CO CU • H o co CO c  • Q fl CD CO a frj U o CJ b i co • q * t q In. CO O O J J • H 3 CO >0 0 0 0 0 0 • H 0 CO • H eg N 0 W 0 - H 3 • H - H o * J X - H E= u co q • H -H co +J •3 —* 0 —H -H kj 5>i u o (b M o a o CD H 0 <c 43 £ 1 u S>, 0 01 1 1 CD *5 0 q ^  cu •H 0 •o q '—1 J J l j U 0 cu Q . - H U • H J J Q , + J 1 3 - q CD CO 0 O •V CO ET~H U CJ "1 CO u CU 3 0 cu cu u —H —H co q 3 O m co 0 0 t ! • H J J CO tn .T1 CO CO 3 m CD • H CO • H H U * m CQ Kt CO • 3 EH «n i> At OH CQ 05 SB CO C J 05 Z Z Z O X Z fa EH EH CO Q Q Q E H E H H E H E H > > > > E H faaJatfa>5ifa>fafa>iJiHfa S2t:: tc5 fa X Z : Z Z Z Z Z ES Z Z Z O Q a a E H o a p r j E H z e c o ; • c o c j _ < _ r t < ; «S_rf M ^ j H > ,< . . . . >: i i i i i a « i i i cj> z • cj) cj) a a a a a m as ta a a I I I I I I I I I I CJ) • i i o i i i i i i o C u Q D Q C J ) Q C J ) C J ) C J ) C J ) 0 U Z W U W Q Q U C C a t i J Q C d & J r ^ E d j y ^ r ^ r x i r ^ J c / ) U ^ CO CO CO CO CO . K M Z E n H r t ! D Z S f l ; i < r d < H H C O > > f l : r X H Z Q Z Q Q Q O W W H H « Q « [ > J U C O E H E H C O Q • _kJLS ro fg •H CO « t) CO 3 CJ 3 •H 3 N 0 >H O 4 J • H 3 q -H M CD co q q rO rH q ^ |5j 3 M CU >H 0) S CD • H ro CD • H •-t cu CU m •r-i o >n io 3 " H • H 3 t n CO O J J - J . 0 N ft, Q) - H k| • H CO co 4 J J J tn • H 'H 3 o t i O CU CO • r ? T-1 > , 0 -H 3 CJ • H 3 ' 1 0) HH >H •H q - J B CU 0 •H u CO ca u o O ro q CU •a CO SH 0 M b . 4 J O 3 '—1 rO 1 1 U M •H CO H - C s | o >H m CO M M X) q CJ S * H CD JJ o 3 3 CJ CD 3 3 c CJ CO 0 ro 1—1 in CO a co CO • H • H >H U U ^ CO •H O 3 CO CU H 0 3 r» CO TJ X) X) cu CO a n o CJ 0 & rO £ 1 •q -H q >H CO 0 0 JJ •H 3 XJ co co CO CO o o o O 0 • H 0 ro •H CO o CJ CO TP 0 X) -H 3 •H -H 0 CJ X •H ^5 e CO q •H •H ro JJ TP —H 0 '"H —H u N o o cu 0 a o CU -H 0 « •q X) JJ o CJ CU -H cu * ! 0 E3 tn CD *H 0 •a q rH J J >H >H O CU a - H in •H -u ri J J >H *H lH E= 3 •q CD CO 0 0 —H V) u to co 0 CD 3 O i) CD CD o 1 1 -H CO q 3 o «J CO cu 0 •0 •H JJ ca « ty CO •H rO 3 ro & •H rO H O * cq CQ CO iJ) EH "3 t> CH CQ 05 w O 05 c I .E - 5 . O ro "o •9-TJ ro £= d) E c on "ro i i s B ro •a o E 52 o ro s: -o £ 8 on CD a> cu W> o n E 3 O C L o • o D T ) CU _C t ; on fO > o O i J Z J-> Oi ro y T5 u o lo o> o C O o> on ^ ro "O ! J C O O o o 4-1 <u o 1_ CL E E ro on ° 1 ca cu K 0) Oi E oi e &5 = o ro E * O c Oi io 3 O) CT 3 <D on u» _0 a* "o Q.E X3 9 ± cz o *3 ro 0) l_ I/) 3 C z 8 a3 § s s CT dj to to 0) ro CS ^ T - TJ g at o" 3 "3 on LL. ro 3 1 on * J "a ro CO 5 6 93 89 52 81 51 100 Salmonella typhimurium Escherichia coli - Yersinia pestis Vibrio cholerae Buchnera sp. Pseudomonas aeruginosa Pasteurella multocida L Haemophilus influenzae - Xylella fastidiosa Ralstonia solaracearum gamma and beta proteobacteria Plasmodium falciparum 79 69 f 91 84 50 — Caulobacteria crescentus Sinorhizobium meliloti Mesorhizobium loti Rickettsia prowazekii — Bacillus halodurans Bacillus subtilis Listeria innocua Staphylococcus aureus alpha proteobacteria 100 53 Lactcoccus lactis Helicobacter pylori \ 0psj|on l o w G C firmicutes Campylobacter jejuni Borrelia burgdorferi proteobacteria Aquifex aeolicus Thermotoga maritima 0.1 Figure2.13 Phylogeny of protease subunit HslV. The P. falciparum sequence is boxed for reference. Supported clades are identified with taxonomic aff i l iat ion. Details are as in figure 2.3. 57 mitochondrion-targeting peptide in P. falciparum. If the HslV subunit is sti l l functional in the cytosol of P. falciparum, i t may have a unique function, an interesting possibility that deserves further study. Pseudouridine synthase P. falciparum contains a gene located on chromosome 2 that encodes a pseudouridine synthase. This gene was sequenced and identified by the P. falciparum chromosome 2 sequencing project [82]. Pseudouridine synthases form the modified base pseudouridine from uridines in various RNAs [142]. The set of homologues for P. falciparum sequence consists of eubacterial RsuA pseudouridine synthases. These enzymes modify a specific uracil in the rRNA of the small ribosomal subunit. Previously studies have noted that among the many types of pseudouridine synthases, the RsuA pseudouridine synthases appear to form a homologous group [142, 143]. The P. falciparum gene is 338 amino acids long. RsuA homologues from different bacteria vary in length and are relatively divergent. The conserved regions in the multiple sequence alignment are separated by stretches with l i t t le or no similarity. Consequently, only 113 amino acid sites were used for phylogenetic reconstruction. Unfortunately, the tree that resulted from this analysis is not well resolved, and has generally low bootstrap values (figure 2.14). Many of the bacterial species have multiple RsuA homologues. The phylogeny of RsuA (figure 2.14) groups the multiple homologues into four clades for the proteobacteria, two clades for the low GC Firmicutes, and two clades for the cyanobacteria. Although bootstrap support is low for these clades, all bacterial species with multiple homologues show the same pattern; a particular homologue is 58 rHPasteL ]aemop, -Vibrio cholerae. 1 . Salmonella typhimurium 1 Escherichia coli 1 Yersinia pestis 1 Buchnera sp. Pseudomonas aeruginosa 1 Ralstonia solanacearum. 1 . Jla multocida 1 tilus influenzae 1 Neisseria meningitidis 1 Sinorhizobium meliloti 1 Brucella melitensis Mesorhizobium loti 1 Caulobacter crescentus 1 Rickettsia conorii Hejicob^cter pylori hlamydia trachomatis larnydia muridarum ' 'ila Campylobacter jejuni _J Chlamydoph pneumoniae Arabidopsis thaliana , , . - Borrelia burgdorferi proteobacteria group 1 Treponema pallidium Bacillus subfilFs i Staphylococcus aureus 1 981—- Streptococcus pyogenes 1 •- Streptococcus pneumoniae 1 Lactcoccus lactis 1 Plasmodium falciparum chlamydiales spirochaeta low GC firmicutes group 1 61 i Pse, 58 udompnas aerui Yersinia pestis M^ cog/g^ ma pulmonis Escherichia coli 2 „ , ., „ Pasteurella multocida 2 _98J- Haemophilus influenzae 2 vibrio cnolerae 2 Pseudomonas aeruginosa: Neisseria meningitidis 2 Ralstonia solanacearum 2 - Caulobacter crescentus Streptococcus pyogenes 2 - Streptococcus pneumoniae 2 Bacillus subtilis 2 92r Listeria innocua 2 Lactcoccus. lactis 2 Clostridium perfringens 2 Staphylococcus aureus 2 Thermotpga maritima Aquifex aeolicus 1 D. 8 9 6 2 r % M « r m 3 ' — Clpstridijjm perfringens 3 ' Vibrio cholerae 3 TQUi Ralstonia soj^nagearum 3 proteobacteria group 2 low GC firmicutes group 2 Clostridium perfringen^l^ 77 Neisseria meningitidis ^ • Caulobacter crescentus 3 99 84 r ,—£2| 1 Eschefichi \ 1 Yerspia pestis 4 Salmonella typhimurium 4 Escherichia coli 4 itis Vibrio cholerae 4 Pasteurella multocida 4 61 Haemophilus influenzae 4 Neisseria meningitidis 4 Synechocystis sp. 1 Synechococcus leopoliensis Nostoc sp. 1 _28r Pseudomonas aeruginosa 4 ., . Smorhizooium meliloti 4 J2r — Mesorhizobh. JJosfxxfs °§YS*IS SP' [um loti 4 98 r 97, c: :sp. Deinococcus radiodurans —— Mycobacterium leprae Mycobacterium tuberculosis Streptomyces coelicolor Aquifex aeolicus 2 ] proteobacteria group 3 proteobacteria group 4 cyanobacteria group 1 cyanobacteria group 2 actinobacteria 0.1 Figure 2.14 Phylogeny of RsuA pseudouridine synthase. The P. falciparum sequence is boxed for reference. Taxonomic affiliation of species is labelled. Species suffix numbers represent general group associations, and are independent of the number of homologues in any one species, (e.g. S. typhimurium has three homologues named 1, 3 and 4). 59 more closely related to sequences from close-relatives (defined a priori - see table 2.2) than either to a sequence from a more distant relative or to another homologue from its own species. This lends confidence to the delineation of homologous groups. There is only one highly supported larger clade in the tree, which unites group one of cyanobacteria with the proteobacteria group four (figure 2.14). This association is supported in the multiple sequence alignment, by the presence of a region at the N-terminal end of the peptide that is conserved in these sequences alone. The region was identif ied at the Interpro database [144] as a common RNA binding domain. The occurrence of this particular RNA binding domain in some homologues of RsuA pseudouridine synthases has been found by other researchers [145]. A well-supported clade unites A. thaliana with the chlamydiales. It has previously been noted that plants have a number of genes that are related to chlamydial genes. This relationship can appear when lineage-specific gene loss occurs in the cyanobacteria. Because chalmydiales and cyanobacteria appear to be sister-groups, plant genes of plastid origin are most closely related to chlamydial homologues in the absence of a cyanobacterial homologue [106]. Such an evolutionary history seems likely in the RsuA tree, as lineage specific loss of homologues is common. This means that the A. thaliana and cyanobacterial RsuA sequences are paralogues, and that the chlamydial sequences are the closest orthologues for A. thaliana. Because of the length heterogeneity among RsuA genes, no obvious N-terminal extension can be seen in the P. falciparum or A. thaliana sequence. However, both of these genes are predicted to have plastid-targeting peptides. Unfortunately, there is no support for the P. falciparum branch in the RsuA tree (figure 2.14) and its 60 placement is deemed unresolved. The presence of a targeting peptide coupled with the phylogenetic position of A. thaliana, argues for a plastid origin of the RsuA gene in of A. thaliana. The targeting peptide and fact that the only other eukaryote with a RsuA has one of plastid origin, leads to the tentative conclusion that that P. falciparum RsuA pseudouridine synthase was laterally transferred from the apicoplast progenitor, to the P. falciparum nuclear genome. Ribosomal RNA methyltransferase BAEwatch identified an intron-containing gene on chromosome 12 in P. falciparum as 'bacteria-l ike'. The homologues of this gene in other species are all predicted to be ribosomal RNA methyltransferases. In f . coli, this gene was recently cloned and the enzyme was shown to methylate a C residue at a specific position in the 16S RNA [146]. Based on conservation of the identified E. coli functional sites in sequences from other species, i t was suggested that these m5C rRNA methyltransferases comprise a family of proteins with representatives in all three domains of life [146]. Very few functional studies have been done on enzymes of this family, although some of the eukaryotic m5C RNA methyltransferases have been shown to localize to the nucleolus during cell proliferation. Reid et al. [147] grouped the m5C rRNA methyltransferases into six categories based on the length of variable C and N-terminal regions and on small sequence motifs. Anantharaman et al. [145] further grouped m5C rRNA methyltransferases into four homologous groups, according to the presence of specific RNA binding domains. The P. falciparum gene is 376 amino acids in length. The multiple sequence alignment for the gene reveals small regions of conserved sequence (about 5-20 amino 61 acids) separated by larger non-conserved regions. The N and C-terminal regions of this protein show considerable size variation even among peptides from closely related species. Consequently only 142 sites were used were used for phylogenetic reconstruction. The first pattern evident from the phylogeny of m5C rRNA methyltransferases (figure 2.15) is that a long branch divides the tree in two. The difference between the members of these two groups is not obvious from the multiple sequence alignment. However, a search with these sequences in the Interpro database reveals that sequences in the first group possess a NusB RNA binding domain at the N-terminal region of the peptide. The grouping of sequences by the presence of the NusB domain is consistent with one of the homologous groups identified by Anantharaman et al. [145]. The NusB-containing group is eubacterial, with the P. falciparum sequence present as the only eukaryote. Eight well-supported clades that encompass accepted related species are seen in the NusB-containing portion of the tree (fig 2.15). These describe a chlamydiales branch, an actinobacteria branch, a low GC firmicutes branch, a gamma proteobacteria branch, two alpha proteobacteria branches, and two beta proteobacteria branches. Synechocystis sp. and Thermotoga maritima, as sole representatives of their respective cyanobacteria and thermotogales groups, are not included within the larger groups, and are considered to define their own clades. The relationship between most of these groups is not resolved. However, a strongly supported branch unites the type two alpha proteobacteria, the chlamydiales, and the type two beta proteobacteria, and defines the sequences in these groups as orthologues. 62 Brucella melitensis 2 Mesorhizobium loti 2 Sinorhizobium meliloti 2 Agrobacterium tumifaciens 2 Caulobacter crescentus 2 Chlamydophiia pneumoniae Chlamydia muridarum Plasmodium falciparum | alpha proteobacteria type 2 chlamydiales ~J beta proteobacteria ty£e 2 Mycobacterium tuberculosis Staphylococcus aureus 1 viosis0myCeS coe,/cotor^] actinobacteria Bacillus halodurans Streptococcus pyogenes 1 Streptococcus pneumoniae 1 Listereria innocua Clostridium acetobutylicum Thermotoga maritima Xylella fastidiosa Neisseria meningitidis 1 low GC firmicutes Ralstonia solanacearum 1 Pasteurella multocida Haemophilus influenzae i — Yersinia pestis i Escherichia coli 1 Pseudomonas aeruginosa Coxiella burnetii Synechocystis sp. Agrobacterium tumifaciens 1 Sinorhizobium meliloti 1 Mesorhizobium loti 1 Brucella melitensis 1 Caulobacter crescentus 1 •Drd£6pma-tfie]aK6ga-£{4f' Caenorhabditis elegans —| thermotogales beta proteobacteria type 1 NusB-conta in ing prote ins gamma proteobacteria type 1 cyanobacteria alpha proteobacteria type 1 eukaryota Schizosaccharomyces pombe Saccharomyces cerevisiae Homo sapiens Arabidopsis thaliana Guillardia theta nucleomorph Methanococcus janaschii Sulfolobus tokodii Archaeqglobus fulgidusl Streptococcus pyogenes 2 Streptococcus pneumoniae 2 r Salmonella enterica <• Escherichia coli 2 Vibrio cholerae Archaeoglobus fulgidus 2 Pyrococcus abyssi 2 Halobacterium sp. Deinococcus radiodurans Nostoc sp. Aeropj/rum pernix ichaeoglobus fulgidus 3 Pyrococcus abyssi 2 Pyrococcus abyssi 3 Campylobacter jejuni 0.1 Figure 2.15 Phylogeny of m5C rRNA methyltransferase. The P. falciparum sequence is boxed for reference. Suffix numbering of species denotes that multiple homologues exist. The dotted line divides the tree into two groups, one of which contains sequences that possess an N-terminal NusB RNA binding domain (see text) . Details as in figure 2.3 63 A highly supported branch links the P. falciparum sequence with the chamydial branch. The gene in P. falciparum has undoubtedly been transferred from a bacteria, as i t represents a purely eubacterial form of m5C rRNA methyltransferases. The question then becomes: is an organelle the source of the gene, or was the gene transferred from a free-living bacterium? As the P. falciparum sequence is not most closely related to the alpha bacterial orthologues in the phylogeny, a mitochondrial source is unlikely. It is less easy to evaluate the relationship of the P. falciparum sequence with the cyanobacteria. If the event that led to paralogous genes in alpha and beta proteobacteria was gene duplication in the lineage that led to these two groups, then the Synechocystis sp. sequence is an orthologue of the P. falciparum sequence and the distance between them argues against a plastid origin. However, if a duplication occurred early in the divergence of Eubacteria, then lineage-specific loss has produced the singular homologues of the actinobacteria, low GC firmicutes, gamma proteobacteria, chlamydiales, thermotogales, and cyanobacteria. In this case, the cyanobacterial sequence may be more distantly related to the P. falciparum sequence because they are paralogous. If the P. falciparum gene has a plastid origin, the loss of the orthologue in the cyanobacterial lineage then leads to the apparent association between the P. falciparum and its next closest relatives, the chlamydiales [106]. It is unlikely that the P. falciparum m5C rRNA methyltransferase is localized to the apicoplast. The protein sequence does not possess an identifiable N-terminal extension, and no targeting peptide was identif ied. Moreover, no plastid-targeted m5C rRNA methyltransferases have been identified in any other organisms. If m5C rRNA methyltransferase does have a plastid origin in P. falciparum, the protein has 64 probably been co-opted for use by the cell rather than the organelle. Unfortunately, the patterns left by the evolution of m5C RNA methyltransferases, are not informative for the determination of the origin of the m5C RNA methyltransferase gene transfer in P. falciparum. RNA 3' terminal phosphate cyclase A homologue of RNA 3' terminal phosphate cyclase was identif ied on chromosome 12 of P. falciparum. This family of RNA-binding enzymes makes a cyclic phosphodiester at the end of RNA molecules [148]. The authors who originally described the E. coli and human RNA 3' terminal phosphate cyclase noted at the time that similar proteins appear in all three domains of life [149, 150]. The P. falciparum protein is 569 amino acids long, and has an 80 amino acid leader sequence that is identified as plastid targeting. A 90 amino acid insert is also found in the middle of the P. falciparum protein. Two hundred thir ty nine homologous positions were selected from the multiple sequence alignment and used to make the RNA 3' terminal phosphate cyclase phylogeny (figure 2.16). A well-supported branch (bootstrap of 94 percent) divides the member species in two. The upper clade encompasses the bacteria and the P. falciparum sequence, and the lower clade is comprised of the archaea, the eukaryota, and a lone species of actinobacteria. The P. falciparum sequence branches well in the interior of the upper bacterial clade and is clearly more related to the bacteria than to the eukaryotes or archaea (figure 2.16). The bacterial clade is fully resolved and well supported, although a number of long branches are present. The long branches unite sequences that are not expected to group together by criteria of accepted phylogenetic 65 99 100 99 98 64 100 94 98 8 1 ' Ralstonia solaracearum I — Pseudomonas aeruginosa Salmonella typhimurium Eshcerichia coli Streptomyces coelicolor Nostoc sp. gamma & beta proteobacteria actinobacteria cyanobacteria 99 Listeria monocytogenes Plasmodium falciparum Clostridium acetobutylicum 98 Bacillus halodurans low GC firmicutes — Aquifex aeolicus Mycobacterium tuberculosis Thermoplasma volcanium Drosophila melanogaster Homo sapiens I — Caenorhabditis elegans Archeoglobus fulgidus Methanothermobacterthermoautotrophicus Deinococcus radiodurans deinoCOCCi aquificales actinobacteria eukaryota archaea Methan coccus janachii 100 r~ Sulfolobus tokodii Sulfolobus solfataricus Aeropyrum pemix Pyrococcus horikoshii Halobacterium sp. 0.1 Figure 2.16 Phylogeny of RNA 3' terminal phosphate cyclase. The P. falciparum sequence is boxed for reference. Higher order taxonomic groupings are shaded. See figure 2.3 for details. 6 6 relationships (see table 2.2) based on other genetic and morphological characters. Bacillus halodurans (a low GC firmicute) and Deinococcus radiodurans (a deinococcus) are grouped together by long branches with 98 percent bootstrap support. A 98 percent bootstrap separates Listeria monocytogenes from its fellow low GC firmicute, Clostridium acetobutylicum. A bootstrap value of 100 divides L. monocytogenes and B. halodurans, which are related at the family level. A 99 percent bootstrap value groups the P. falciparum sequence with the sequence from L. monocytogenes. It should be noted, however, that no sequence architecture features unite P. falciparum and L. monocytogenes. The plastid localization signal in the P. falciparum sequence indicates that RNA 3' terminal phosphate cyclase functions in the apicoplast. However, the phylogeny shows a specific relationship of this gene with a low GC firmicute. Two explanations are possible for this conflicting evidence. 1) The gene may have been laterally transferred from an ancestor of the low GC Firmicutes into the nucleus of a progenitor of P. falciparum, and then subsequently acquired a targeting peptide that allowed it to function in the plastid. 2) The relationship of the P. falciparum and L. monocytogenes genes is artefactual, caused by long branch attraction, and the RNA 3' terminal phosphate cyclase functions in the compartment where it evolved. I favour this second hypothesis as being more biologically plausible, and I tentatively conclude that RNA 3' terminal phosphate cyclase is a laterally transferred gene of plastid origin. 67 Ribosomal RNA adenine dimethylase Ribosomal RNA adenine dimethylases are RNA modification enzymes that methylate two adjacent adenosines in rRNA. A gene was identified on chromosome 12 in P. falciparum that is similar to rRNA adenine dimethylases in other organisms. The group of homologues that were identified for this P. falciparum gene included archaeal, eukaryote and eubacterial sequences. Both the P. falciparum and the A. thaliana genes possess N-terminal leader sequences (figure 2.17) that are predicted to be chloroplast-targeting peptides. The P. falciparum target peptide is long (180 amino acids) and contains a region of low complexity made of 10 copies of a hexamer repeat. A second region of low complexity, characterized by of long tracts of lysine and asparagine, comprises about half of the 160 amino acid insert in the P. falciparum gene (figure 2.17). In addition to the indels depicted in figure 2.17, the multiple alignment of this gene contains many small gaps and consists of blocks of well-conserved sites separated by divergent and gap-laden sequence. These divergent sequence sites were discounted and 177 sites remained to be used to calculate phylogenetic distances. The distance tree for rRNA adenine dimethylase (figure 2.18) resolves clades that define major taxonomic groups such as the divisions of proteobacteria, the actinobacteria, the low GC Firmicutes, the spirochaetes, the chlamydiales, the cyanobacteria and the eukaryotes. There is no support, however, for the branches that connect these groups. The position of the plastid-bearing organisms, A. thaliana and P. falciparum, are poorly supported. In the absence of conflicting indications from the phylogeny, the presence of a plastid-targeting sequence is considered 68 CO CO CO 1 1 I K K XIX oo ~o o co l< ro o C D I CO CD CO O CO < ZJ LU CO CO CO o CO 5 5 CO .a o fo g CL o c <D •= JZ <u E *•* JZ CO ° E 3 *• J= o> o> .(5 > JZ Oi U TJ ro £ "ro oj o Oi c Oi on * O ^ u S oi c TJ Oi C ZJ — cr O) oo c Oi Oi £ S -2 oo C Oi Oi ro E o O* Oi io oo ft .2 Ir 3K O X e o* Oi E - 0 00 to 01 U £ =5 ro 00 c O 0 ) C "E CU T J < z at l Oi oo C CO "S E 0 o 4 J TJ | c S o •M O fO >*-II - C o U JZ t o oo •£ <N 5 01 O on •r— Li. oo cu on zi cu TJ 'oo T ) 0) 0) i - TJ 4-> CO C JZ <U LO 00 cu E. oo CL C * ~ + J oo O CU J-P 0 oo <*-C ° 1 £ o on ~5 o OJ 00 JZ * o 01 £J I— ro c E | o 0 0 ro ca 2 3 0) u on ca 69 7 ^ r Musmuculus r p Macaca fascicularis " Homo sapiens 1 0 0 i Kluyveromyces lactis Schizosaccharomyces pombe Drosophila melanogaster Caenorhabditis elegans Archaeoglobus fulgidus 1 0 0 1 eukaryota Thermoplasma acidophilum • Aeropyrum pernix Corynebacteriumjejuni ~~| a c t i n o b a c t e r i a Micromonospora griseorubida Methanothermobacterthermoautotrophtus Methanococcus janaschii Pyrococcus abyssi Halobacterium sp. Nostoc sp. Synechocystis sp. Vibrio cholerae Haemophilus influenzae Salmonella typhi Yersinia pestis Burkholdaria sp. Ralstonia solanacearum Neisseria meningitidis Xylella fastidiosa Pseudomonas aeruginosa Buchnera sp. Mesorhizobium loti Agrobacterium tumifaciens Caulobacter crescentus Homo sapiens mt Rbkettsia prowazekii archaea cyanobacteria gamma & beta proteobacteria alpha proteobacteria •j— Streptococcus pyogenes P- Streptococcus pneumoniae ' Lactococcus lactis Staphylococcus aureus lowGC firmicutes Bacillus subtilis Clostridium acetobutylicum 1 0 0 1 Mycobacterium tubculosis Streptomyces coelicolor Deinococcus radiodurans j— Chlamydophiia pneumoniae ' Chlamydia trachomatis 1 J" actinobacteria chlamydiales Arabidopsis thaliana Plasmodium falciparum — Ureaplasma urealyticum Mycoplasma capricolum Helicobacter pylori Campylobacter jejuni Mycoplasma pulmonis Treponema pallidium ~~I Borrelia burgdorferi _| epsilon proteobacteria spirochaeta 0 . 1 Figure 2.18 Phylogeny of Ribosomal RNA adenine dimethylase. The P. falciparum sequence is boxed for reference. Plastid-bearing organisms are indicated in bold, mt; mitochondrion-targeted. General notation as in figure 2.3 70 evidence that the rRNA adenine dimethylase gene was transferred from the plastid to the nucleus of P. falciparum. S-adenosyl methionine-dependant methyltransferase A gene on chromosome 12 in P. falciparum was identified by BAEwatch as being 'bacteria-l ike'. This gene has a set of homologues for which function has not yet been determined. An S-adenosyl methionine (SAM) binding domain was identif ied in all the homologous proteins, which identifies them as a type of methyltransferase [151]. This is a very common domain found in almost 3000 proteins in the Interpro database [112]. The E.coli homologue of the P. falciparum gene has been shown to be active on membrane localized substrates [152]. The homologues for the P. falciparum sequence included three other eukaryotes amongst a mostly eubacterial protein set. P. falciparum and two of the other eukaryotes have sequences with N-terminal extensions (fig 2.19). In P. falciparum, this leader is predicted to be an apicoplast-targeting peptide. Two hundred thir ty four positionally conserved residues were selected and used to construct a phylogeny (figure 2.20). The main taxonomic groupings of the eubacteria are resolved in the SAM methyltransferase tree, including proteobacterial groups alpha through epsilon, spirochaetes, low GC firmicutes (with the exception of Mycoplasma genitalium), chamydiales, and the cyanobacteria. The two animal sequences group together as sister to the alpha proteobacteria, and probably represent sequences transferred from the mitochondrion. The position of A. thaliana is unresolved. 71 E u b a c t e r i a •— I HZZHZTZI—I I HZ= I—I I : P. falciparum iwwwoooooooooowwoww' i—i—r H I EZZt : 508 A. thaliana —I-H i— i H hr~ \ : 434 D. melanogaster PKK—i—i—H—H H I—i H H I : 356 M. muSCUluS 1—I—I—H H—H 1 1 H H \ : 339 Figure 2.19 Schematic alignment of SAM-dependant methyltransferase sequences. The top line represents the general form of all bacterial homologues. The eukaryotic sequences are shown underneath. Putative organelle-targeting peptides are cross hatched. The amino acid length of the protein is given to the right of the specific sequences. 72 5 1 8 3 6 3 Shewanella violacea — Vibrio parahaemolyticus — Yersinis pestis Escherichia coli Haemophilus influenzae Buchnera aphidicola 7 6 9 4 r 7 7 r 5 8 9 8 r Neisseria meningitidis Xylella fastidiosa Pseudomonas aeruginosa Mesorhizobium loti Agrobacterium tumefaciens Zymomonas mobilis Caulobacter crescentus Drosophila melanogaster g a m m a & beta proteobacter ia Mus musculus J Rickettsia conorii 9 8 r — Treponema pallidium Borrelia burgdorferi 9 7 r 9 7 8 0 Deinococcus radiodurans Nostoc sp. — Synechocystis 9 8 1 Streptococcus pneumoniae Streptococcus pyogenes Lactococcus lactis Listeria monocytogenes Staphylococcus aureus alpha proteobacteria eukaryota spirochaeta 9 5 Bacillus subtilis 6 6 5 9 Enterococcus faecalis Bacillus halodurans Ureaplasma urealyticum • Clostridium acetobutylicum cyanobacter ia l o w G C firmicutes 9 1 r Plasmodium falciparum Mycoplasma genitalium 7 2 r — Thermotoga maritima Aquifex aeolicus 1 0 0 i 9 9 r Chlamydophiia pneumoniae Chlamydia trachomatis - Arabidopsis thaliana Helicobacter pylori Campylobacter jejuni ioor 1 0 0 r — Mycobacterium tuberculosis — Mycobacterium leprae Streptomyces coelicolor ] ] chlamydiales epsi lon proteobacteria act inobacteria 0.1 Figure 2.20 Phylogeny of SAM-dependant methyltransferase. The sequence from P. falciparum is boxed for reference. Eukaryotic sequences are named in bold. Taxonomic affiliations are shown to the right of supported clades. See figure 2.3 for general notation. 73 The sequence from P. falciparum is paired with the cell wall-less, low GC f irmicute, Mycoplasma genitalium, with 94 percent bootstrap support (figure 2.20). At face value, this relationship signifies that the methyltransferase gene from P. falciparum shared a common ancestor with the gene from M. genitalium. However, the apicoplast-localization peptide argues for a plastid origin for this gene. The branch lengths that join these sequences are long. A systematic error in phylogenetic analysis, such as results from applying an incorrect model of sequence evolution, wi l l compound in long branches [116]. Bootstrap values address random sampling error, so while the high bootstrap value on the branch that unites P. falciparum and M. genitalium is compelling, i t may have no significance if the model of phylogenetic inference used is incorrect. The conspicuous absence of the other two fully sequenced mycoplasma species may indicate that the function of this gene is not essential for the parasitic mycoplasmas. A loss of functional constraint in the M. genitalium sequence can lead to mutational saturation, and thus the apparent relationship to P. falciparum. For these reasons, a plastid origin is favoured for this SAM methyltransferase, although i t deserves further study. GTP-binding protein Two of the thirteen 'bacteria-like' genes identified by BAEwatch had matching sets of homologues. These two proteins are not identical. They show marked differences in architecture (figure 2.21) and do not represent a recent duplication. The function of the proteins considered here is unknown. The sequences possess two copies of a characteristic GTP-binding motif, but l i t t le is known beyond that, as no functional studies of these proteins have been carried out. As part of a much larger 74 study examining the evolution of GTPases, Liepe et al. [127] noticed the similarity between these proteins and named the group EngA. The homologues for this gene are primarily bacterial, but four other eukaryotic sequences were also identified. Three of the six eukaryotic sequences had N-terminal extensions; these were P. falciparum 1, A. thaliana 1, and G. theta (fig 2.21). The leader peptide of P. falciparum 1 was very long (381 amino acids). Except for the first 50 amino acids, the leader is comprised of low complexity sequence, containing 13 tandem repeats of a heptamer and several poly-lysine stretches. Nevertheless, this leader was identified as a plastid targeting peptide, as were the leaders of A. thaliana 1 and G. theta. Of the other three eukaryotic sequences, P. falciparum 2 and Oryza sativa did not contain N-terminal extensions, and no targeting peptides were identified. The A. thaliana 2 sequence was predicted to have a mitochondria-targeting peptide of length 118. One hundred and eighteen amino acids is more than a fifth of the total protein. Moreover, the first 118 amino acids share considerable similarity to the sequences of the other homologues, including bacteria (figure 2.21). If the A. thaliana 2 sequence does have a 118 amino acid mitochondrion-localizing peptide, it is highly atypical. The phylogeny constructed with this group of proteins shows that eukaryotic sequences have one of three associations (figure 2.22). First, the 0. sativa and A. thaliana 2 sequences are well supported as a sister group to the alpha proteobacteria. These genes were most likely transferred from the mitochondrion to the nucleus. Second, the P. falciparum 1, G. theta and A. thaliana 1 sequences form a sister group to the cyanobacteria. The evidence presented here suggests that these genes were 75 0 0 0 555 Q Q VT 03 _ Q =» £: LU (C CM CD ® g 'I "CD CD •8 ° I •5 CM - r - _CD C C O § C C C O C D - ~ D C O o 4= 2 S JD .co z: co E a .=3 -8 O -Q £ 5 CO « T CD CO CD 3 CD E o E CO CO CL O ^ CM T - S CO r -T f T - ^ ( O CO O ) ^ m (O oi in t oi CO CM CO CN CO CO o x > co fp ZJ - C N E 2. CO . Q . 1JJ * ^ -2 •8 o * t— CD T— CD "cB C c £ § CO C ^ .CD CD £ •§ -9-CD .CO ^ I •5 3 "5 o E CO CD CL -S P3 •2 o E CO o J Z 4-» o n c O C t o CU o n cu O 4 - 1 O L -J Z CL CO r-CU v<_ t j ° ro CU zt cu F lO l _ I— 0j CO C CU cu on r ; cu 2 J Z <S> l O CU r ; on <0 cu a» on ro i _ CU > 12 <0 oj c= " a « s c •2 c CU a) U on c cu p-£ - & C X 4-> O cz Jr CU a. a. on O c "a oj c c a. u i h - . ! = o n_ CU 0 5 Ii .2>J? z; c £ 2 - C Q . o _ wo 2 rz CN B CM Jo on • 3 ? O ° IT - 00 * ^ CU ZJ X cu o n ~ 5 cu cu cu cu ^ J Z J Z C N O c a t o a cu cu "O o £Z CU cu C L cu CL +3 £ o n 4-1 O ra i— 4-1 o n QJ on > CU on i _ ro c on .2 cu c &!» Oj °J +3 76 ^ j Escherichia coli c i p f Salmonella typhi 3 ° , Yersinia pestis 97P— Wftn'o cholerae 94 59 94 72 Pasteurella multocida TOO1- Haemophilus influenzae Pseudomonas aeruginosa Buchnera sp. 99, Neisseria meningitidis L Neisseria gonorrhoeae Xylella fastidiosa 63,— Brucella melitensis gamma & beta proteobacteria 100 98 94 1 — Mesorhizobium loti OOr- Agrobacterium tumifaciens I— *Zinnrhi7nhii im molilnti 98 r Sinor izobium eliloti — Caulobacter crescentus Rickettsia conorii -1 0 0 i — Oryza sativa 1 — Arabidopsis thaliana 2_ Streptomyces coelicolor -Mycobacterium leprae Deinococcus radiodurans alpha proteobacteria viridiplantae actinobacteria 861 I 86 59 99 Listeria innocua Listeria monocytogenes Staphylococcus aureus Bacillus halodurans Bacillus subtilis I °4j- Streptcoccus pneumoniae L ° 5 p - Streptcoccus pyogenes 1 — Lactcoccus lactis Clostridium pedringens Clostridium acetobutylicum 99 99 88 r 100 i — Mycoplasma genitalium '— Mycoplasma pneumoniae - Ureaplasma urealyticum Mycoplasma pulmonis low GC firmicutes 162 — Synechocystis sp. Nostoc sp. - Arabidopsis thaliana 1 Guillardia theta 681 Plasmodium falciparum 1 cyanobacteria eukaryote 100, 64 Treponema pallidium Helicobacter pylori 1 Campylobacter jejuni — Thermotoga maritima Aquifex aeolicus 10Or- Chlamydia trachomatis 9 8 1 L Chlamydia muridarum i Chlamydophiia pneumoniae Plasmodium falciparum 2 ^ epsilon proteobacteria ] chlamydiales 0.1 Figure 2.22 Phylogeny of GTP-binding protein. The sequences from P. falciparum are boxed for reference. Eukaryotes are shown in bold. Details as in figure 2.3. 77 laterally-transferred from the plastid to the nucleus, in the case of P. falciparum and A. thaliana, and to the nucleomorph in the case of G. theta. Third, the P. falciparum 2 sequence groups with Treponema pallidium, a spirochete. This grouping is not well supported, and the very long branch of the P. falciparum 2 sequence casts suspicion on this placement. The P. falciparum 2 sequence has undoubtedly been transferred from a bacterium or organelle (because no archaeal or eukaryotic homologues exist), but there is not enough information to determine the likely donor of the sequence. With the possible exception of the P. falciparum 2 sequence, the taxonomic distribution of the protein (figure 2.22) is l imited to the eubacteria and those eukaryotes that have acquired the gene from an organelle. The 0. sativa and A. thaliana sequences are the sole representatives of a mitochondrial homologue for this protein. The absence of other fully sequenced eukaryotes in the tree is interesting as i t may mean that the mitochondrial gene transfer event only took place in plant lineage. Certainly, the animal and fungal groups with fully sequenced genomes do not encode a representative of the gene. We await the sequencing of additional eukaryotes from outside the higher eukaryotes, to determine when this transfer occurred. 2.3.3 Conclusion: Lateral gene transfer in P. falciparum In tota l , thirteen P. falciparum genes were analysed for evidence of lateral gene transfer. Of these, eight (type II topoisomerase B subunit, type II topoisomerase A subunit, ribosome release factor, pseudouridine synthase, RNA 3' terminal phosphate cyclase, rRNA adenine dimethylase, SAM dependant methyltransferase, and GTP-binding protein 1) probably represent transfers from the plastid to the nucleus of 78 an ancestor of P. falciparum. Two additional genes show relationships that are more complex and while the evidence for plastid origin is weak, the data do not exclude a hypothesis of plastid origin (rRNA methyltransferase, elongation factor G). One gene is likely to have been transferred from the mitochondrion to the nucleus in an ancestor of P. falciparum (proteasome subunit HslV). A further one gene has equivocal evidence for a mitochondrion origin, but such an interpretation of the phylogenetic data is not ruled out (GTP-binding protein 2). The remaining one gene of thirteen (adenylosuccinate lyase) provides evidence of other lateral gene transfers that do not include P. falciparum. I conclude that the majority of laterally transferred genes in the P. falciparum genome are of organellar origin. Based on this study of a subset of 2073 P. falciparum genes, I do not predict that many non-organellar gene transfers wi l l be observed in the complete genome when it has been fully sequenced. The paucity of laterally transferred genes may be a consequence of P. falciparum's parasitic lifestyle. The reduction of genome size that often accompanies transition to parasitism and an intracellular lifestyle [153, 154] may preclude lateral gene transfer to P. falciparum, unless the acquired gene replaces an essential gene already present. 79 BIBLIOGRAPHY 1. Kyrpides NC, Olsen GJ: Archaeal and bacterial hyperthermophiles: horizontal gene exchange or common ancestry? Trends Genet 1999, 15:298-9. 2. Aravind L, Tatusov RL, Wolf YI, Walker DR, Koonin EV: Reply. Trends Genet 1999, 15:299-300. 3. Doolittle WF: Phylogenetic classification and the universal tree [see comments]. Science 1999, 284:2124-9. 4. Doolittle WF: Lateral genomics. Trends Cell Biol 1999, 9:M5-8. 5. Martin W: Mosaic bacterial chromosomes: a challenge en route to a tree of genomes. Bioessays 1999, 21:99-104. 6. Lawrence JG, Roth JR: Selfish operons: horizontal transfer may drive the evolution of gene clusters. Genetics 1996, 143:1843-60. 7. Woese CR, Olsen GJ, Ibba M, Soil D: Aminoacyl-tRNA synthetases, the genetic code, and the evolutionary process. Microbiol Mol Biol Rev 2000, 64:202-36. 8. Lawrence J, Roth J: Roles of horizontal transfer in bacterial evolution. In: Horizontal gene transfer Edited by Kado. MSaCI, 1st ed. pp. 208-225. London ; New York: Chapman a Hall; 1998: 208-225. 9. Ochman H, Lawrence JG, Groisman EA: Lateral gene transfer and the nature of bacterial innovation. Nature 2000, 405:299-304. 10. de la Cruz I, Davies I: Horizontal gene transfer and the origin of species: lessons from bacteria. Trends Microbiol 2000, 8:128-33. 11. Lawrence JG, Ochman H: Molecular archaeology of the Escherichia coli genome. Proc Natl Acad Sci USA 1998, 95:9413-7. 12. Snel B, Bork P, Huynen MA: Genome phylogeny based on gene content. Nat Genet 1999, 21:108-10. 13. Eisen JA: Horizontal gene transfer among microbial genomes: new insights from complete genome analysis. Curr Opin Genet Dev 2000, 10:606-11. 14. Glansdorff N: About the last common ancestor, the universal life-tree and lateral gene transfer: a reappraisal. Mol Microbiol 2000, 38:177-85. 15. Kurland CG: Something for everyone. Horizontal gene transfer in evolution. EMBO Rep 2000, 1:92-5. 80 16. Spratt BG, Maiden MC: Bacterial population genetics, evolution and epidemiology. Philos Trans R Soc Lond B Biol Sci 1999, 354:701-10. 17. Wang Y, Zhang Z: Comparative sequence analyses reveal frequent occurrence of short segments containing an abnormally high number of non-random base variations in bacterial rRNA genes. Microbiology 2000, 146:2845-54. 18. Schenk S, Decker K: Horizontal gene transfer involved in the convergent evolution of the plasmid-encoded enantioselective 6-hydroxynicotine oxidases. J Mol Evol 1999, 48:178-86. 19. Macario AJ, de Macario EC: The archaeal molecular chaperone machine: peculiarities and paradoxes. Genetics 1999, 152:1277-83. 20. Arabidopsis Gl: Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 2000, 408:796-815. 21. Woese C: The universal ancestor. Proc Natl Acad Sci USA 1998, 95:6854-9. 22. Woese CR: Interpreting the universal phylogenetic tree. Proc Natl Acad Sci U S A 2000, 97:8392-6. 23. Munn AL: Molecular requirements for the internalisation step of endocytosis: insights from yeast. Biochim Biophys Acta 2001, 1535:236-57. 24. Knop M, Schiffer HH, Rupp S, Wolf DH: Vacuolar/lysosomal proteolysis: proteases, substrates, mechanisms. Curr Opin Cell Biol 1993, 5:990-6. 25. Sinai AP, Joiner KA: Safe haven: the cell biology of nonfusogenic pathogen vacuoles. Annu Rev Microbiol 1997, 51:415-62. 26. Haas A: Reprogramming the phagocytic pathway-intracellular pathogens and their vacuoles (review). Mol Membr Biol 1998, 15:103-21. 27. Garcia-del Portillo F: Interaction of Salmonella with lysosomes of eukaryotic cells. Microbiologic 1996, 12:259-66. 28. Corsaro D, Venditti D, Padula M, Valassina M: Intracellular life. Crit Rev Microbiol 1999, 25:39-79. 29. Goebel W, Gross R: Intracellular survival strategies of mutualistic and parasitic prokaryotes. Trends Microbiol 2001, 9:267-73. 30. Gray MW: Evolution of organellar genomes. Curr Opin Genet Dev 1999, 9:678-87. 81 31. Kurland CG, Andersson SG: Origin and evolution of the mitochondrial proteome. Microbiol Mol Biol Rev 2000, 64:786-820. 32. Thorsness PE, Weber ER: Escape and migration of nucleic acids between chloroplasts, mitochondria, and the nucleus. Int Rev Cytol 1996, 165:207-34. 33. Doolittle WF: You are what you eat: a gene transfer ratchet could account for bacterial genes in eukaryotic nuclear genomes. Trends Genet 1998, 14:307-11. 34. Mellman I: Endocytosis and antigen processing. Semin Immunol 1990, 2:229-37. 35. Gage DJ, Margolin W: Hanging by a thread: invasion of legume plants by rhizobia. Curr Opin Microbiol 2000, 3:613-7. 36. Ackermann H-W, DuBow M: Viruses of prokaryotes. Boca Raton: CRC Press; 1987. 37. Becker Y: Evolution of viruses by acquisition of cellular RNA or DNA nucleotide sequences and genes: an introduction. Virus Genes 2000, 21:7-12. 38. Tidona CA, Darai G: Iridovirus homologues of cellular genes-implications for the molecular evolution of large DNA viruses. Virus Genes 2000, 21:77-81. 39. Varmus H: Retroviruses. Science 1988, 240:1427-35. 40. Levine KL, Steiner B, Johnson K, Aronoff R, Quinton TJ, Linial ML: Unusual features of integrated cDNAs generated by infection with genome-free retroviruses. Mol Cell Biol 1990, 10:1891-900. 41. Linial M: Creation of a processed pseudogene by retroviral infection. Cell 1987, 49:93-102. 42. Anderson DJ, Stone J, Lum R, Linial ML: The packaging phenotype of the SE21Q1b provirus is related to high proviral expression and not trans-acting factors. J Virol 1995, 69:7319-23. 43. Jamain S, Girondot M, Leroy P, Clergue M, Quach H, Fellous M, Bourgeron T: Transduction of the human gene FAM8A1 by endogenous retrovirus during primate evolution. Genomics 2001, 78:38-45. 44. Hajjar AM, Linial ML: A model system for nonhomologous recombination between retroviral and cellular RNA. J Virol 1993, 67:3845-53. 45. Swain A, Coffin JM: Mechanism of transduction by retroviruses. Science 1992, 255:841-5. 82 46. Zhang J, Temin HM: Rate and mechanism of nonhomologous recombination during a single cycle of retroviral replication. Science 1993, 259:234-8. 47. Sutrave P, Bonner TI, Rapp UR, Jansen HW, Patschinsky T, Bister K: Nucleotide sequence of avian retroviral oncogene v-mil: homologue of murine retroviral oncogene v-raf. Nature 1984, 309:85-8. 48. Swanstrom R, Parker RC, Varmus HE, Bishop JM: Transduction of a cellular oncogene: the genesis of Rous sarcoma virus. Proc Natl Acad Sci USA 1983, 80:2519-23. 49. Wang LH, Hanafusa H: Avian sarcoma viruses. Virus Res 1988, 9:159-203. 50. Patience C, Wilkinson DA, Weiss RA: Our retroviral heritage. Trends Genet 1997, 13:116-20. 51. Bister K, Vogt PK: Genetic analysis of the defectiveness in strain MC29 avian leukosis virus. Virology 1978, 88:213-21. 52. Hu SS, Moscovici C, Vogt PK: The defectiveness of Mill Hill 2, a carcinoma-inducing avian oncovirus. Virology 1978, 89:162-78. 53. Butel JS: Viral carcinogenesis: revelation of molecular mechanisms and etiology of human disease. Carcinogenesis 2000, 21:405-26. 54. Dejucq N, Jegou B: Viruses in the mammalian male genital tract and their effects on the reproductive system. Microbiol Mol Biol /?ev2001, 65:208-31 ; first and second pages, table of contents. 55. Sverdlov ED: Retroviruses and primate evolution. Bioessays 2000, 22:161-71. 56. Lower R: The pathogenic potential of endogenous retroviruses: facts and fantasies. Trends Microbiol 1999, 7:350-6. 57. Jensen EC, Schrader HS, Rieland B, Thompson TL, Lee KW, Nickerson KW, Kokjohn TA: Prevalence of broad-host-range lytic bacteriophages of Sphaerotilus natans, Escherichia coli, and Pseudomonas aeruginosa. Appl Environ Microbiol 1998, 64:575-80. 58. Bamford DH, Caldentey J, Bamford JK: Bacteriophage PRD1: a broad host range DSDNA tectivirus with an internal membrane. Adv Virus Res 1995, 45:281-319. 59. Hatalski CG, Lewis AJ, Lipkin Wl: Borna disease. Emerg Infect Dis 1997, 3:129-35. 83 60. Ruepp A, Graml W, Santos-Martinez ML, Koretke KK, Volker C, Mewes HW, Frishman D, Stocker S, Lupas AN, Baumeister W: The genome sequence of the thermoacidophilic scavenger Thermoplasma acidophilum. Nature 2000, 407:508-13. 61. Nelson KE, Clayton RA, Gill SR, Gwinn ML, Dodson RJ, Haft DH, Hickey EK, Peterson JD, Nelson WC, Ketchum KA, McDonald L, Utterback TR, Malek JA, Linher KD, Garrett MM, Stewart AM, Cotton MD, Pratt MS, Phillips CA, Richardson D, Heidelberg J, Sutton GG, Fleischmann RD, Eisen JA, Fraser CM, et al..: Evidence for lateral gene transfer between Archaea and bacteria from genome sequence of Thermotoga maritima. Nature 1999, 399:323-9. 62. Ochman H, Jones IB: Evolutionary dynamics of full genome content in Escherichia coli. Embo J 2000, 19:6637-43. 63. Garcia-Vallve S, Romeu A, Palau J: Horizontal gene transfer in bacterial and archaeal complete genomes. Genome Res 2000, 10:1719-25. 64. Ragan MA: On surrogate methods for detecting lateral gene transfer. FEMS Microbiol Lett 2001, 201:187-91. 65. Ragan MA: Detection of lateral gene transfer among microbial genomes. Curr Opin Genet Dev 2001, 11:620-6. 66. Lawrence JG, Ochman H: Reconciling the many faces of lateral gene transfer. Trends Microbiol 2002, 10:1-4. 67. Andersson JO: Evolutionary genomics: is Buchnera a bacterium or an organelle? Curr Biol 2000, 10:R866-8. 68. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange-Thomann N, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M, et al..: Initial sequencing and analysis of the human genome. Nature 2001, 409:860-921. 84 69. Salzberg SL, White 0, Peterson J, Eisen JA: Microbial genes in the human genome: lateral transfer or gene loss? Science 2001, 292:1903-6. 70. Andersson JO, Doolittle WF, Nesbo CL: Genomics. Are there bugs in our genome? Science 2001, 292:1848-50. 71. Stanhope MJ, Lupas A, Italia MJ, Koretke KK, Volker C, Brown JR: Phylogenetic analyses do not support horizontal gene transfers from bacteria to vertebrates. Nature 2001, 411:940-4. 72. Carlton JM, Muller R, Yowell CA, Fluegge MR, Sturrock KA, Pritt JR, Vargas-Serrato E, Galinski MR, Barnwell JW, Mulder N, Kanapin A, Cawley SE, Hide WA, Dame JB: Profiling the malaria genome: a gene survey of three species of malaria parasite with comparison to other apicomplexan species. Mol Biochem Parasitol 2001, 118:201 -10. 73. WHO: Malaria 1982-1997. Weekly Epidemiological Record 1999, 74:265-270. 74. Richman AM, Dimopoulos G, Seeley D, Kafatos FC: Plasmodium activates the innate immune response of Anopheles gambiae mosquitoes. Embo J 1997, 16:6114-9. 75. Straif SC, Mbogo CN, Toure AM, Walker ED, Kaufman M, Toure YT, Beier JC: Midgut bacteria in Anopheles gambiae and An. funestus (Diptera: Culicidae) from Kenya and Mali. J Med Entomol 1998, 35:222-6. 76. Saliba KJ, Kirk K: Nutrient acquisition by intracellular apicomplexan parasites: staying in for dinner. Int J Parasitol 2001, 31:1321 -30. 77. Barker RH, Jr., Metelev V, Rapaport E, Zamecnik P: Inhibition of Plasmodium falciparum malaria using antisense oligodeoxynucleotides. Proc Natl Acad Sci USA 1996, 93:514-8. 78. Kirk K: Membrane transport in the malaria-infected erythrocyte. Physiol Rev 2001, 81:495-537. 79. Lai Z, Jing J, Aston C, Clarke V, Apodaca J, Dimalanta ET, Carucci DJ, Gardner MJ, Mishra B, Anantharaman TS, Paxia S, Hoffman SL, Craig Venter J, Huff EJ, Schwartz DC: A shotgun optical map of the entire Plasmodium falciparum genome. Nat Genet 1999, 23:309-13. 80. Pizzi E, Frontali C: Low-complexity regions in Plasmodium falciparum proteins. Genome Res 2001, 11:218-29. 81. Pizzi E, Frontali C: Divergence of noncoding sequences and of insertions encoding nonglobular domains at a genomic region well conserved in Plasmodia. J Mol Evol 2000, 50:474-80. 85 82. Gardner MJ, Tettelin H, Carucci DJ, Cummings LM, Aravind L, Koonin EV, Shallom S, Mason T, Yu K, Fujii C, Pederson J, Shen K, Jing J, Aston C, Lai Z, Schwartz DC, Pertea M, Salzberg S, Zhou L, Sutton GG, Clayton R, White 0, Smith HO, Fraser CM, Hoffman SL, et al..: Chromosome 2 sequence of the human malaria parasite Plasmodium falciparum. Science 1998, 282:1126-32. 83. Bowman S, Lawson D, Basham D, Brown D, Chillingworth T, Churcher CM, Craig A, Davies RM, Devlin K, Feltwell T, Gentles S, Gwilliam R, Hamlin N, Harris D, Holroyd S, Hornsby T, Horrocks P, Jagels K, Jassal B, Kyes S, McLean J, Moule S, Mungall K, Murphy L, Barrell BG, et al..: The complete nucleotide sequence of chromosome 3 of Plasmodium falciparum. Nature 1999, 400:532-8. 84. Brocchieri L: Low-complexity regions in Plasmodium proteins: in search of a function. Genome Res 2001, 11:195-7. 85. Watanabe J, Sasaki M, Suzuki Y, Sugano S: Analysis of transcriptomes of human malaria parasite Plasmodium falciparum using full-length enriched library: identification of novel genes and diverse transcription start sites of messenger RNAs. Gene 2002, 291:105-13. 86. van Lin LH, Pace T, Janse CJ, Birago C, Ramesar J, Picci L, Ponzi M, Waters AP: Interspecies conservation of gene order and intron-exon structure in a genomic locus of high gene density and complexity in Plasmodium. Nucleic Acids Res 2001, 29:2059-68. 87. Pertea M, Salzberg SL, Gardner MJ: Finding genes in Plasmodium falciparum. Nature 2000, 404:34; discussion 34-5. 88. Huestis R, Fischer K: Prediction of many new exons and introns in Plasmodium falciparum chromosome 2. Mol Biochem Parasitol 2001, 118:187-99. 89. Feagin JE: The 6-kb element of Plasmodium falciparum encodes mitochondrial cytochrome genes. Mol Biochem Parasitol 1992, 52:145-8. 90. Kohler S, Delwiche CF, Denny PW, Tilney LG, Webster P, Wilson RJ, Palmer JD, Roos DS: A plastid of probable green algal origin in Apicomplexan parasites. Science 1997, 275:1485-9. 91. Lang AS, Beatty JT: The gene transfer agent of Rhodobacter capsulatus and &quot;constitutive transduction&quot; in prokaryotes. Arch Microbiol 2001, 175:241-9. 92. Waller RF, Keeling PJ, Donald RG, Striepen B, Handman E, Lang-Unnasch N, Cowman AF, Besra GS, Roos DS, McFadden Gl: Nuclear-encoded proteins target to the plastid in Toxoplasma gondii and Plasmodium falciparum. Proc Natl Acad Sci USA 1998, 95:12352-7. 86 93. McFadden Gl, Roos DS: Apicomplexan plastids as drug targets. Trends Microbiol 1999, 7:328-33. 94. Douglas SE: Plastid evolution: origins, diversity, trends. Curr Opin Genet Dev 1998, 8:655-61. 95. Haucke V, Schatz G: Import of proteins into mitochondria and chloroplasts. Trends in Cell Biology 1997, 7:103-106. 96. Li J, Maga JA, Cermakian N, Cedergren R, Feagin JE: Identification and characterization of a Plasmodium falciparum RNA polymerase gene with similarity to mitochondrial RNA polymerases. Mol Biochem Parasitol 2001, 113:261-9. 97. Takeo S, Kokaze A, Ng CS, Mizuchi D, Watanabe JI, Tanabe K, Kojima S, Kita K: Succinate dehydrogenase in Plasmodium falciparum mitochondria: molecular characterization of the SDHA and SDHB genes for the catalytic subunits, the flavoprotein (Fp) and iron-sulfur (Ip) subunits. Mol Biochem Parasitol 2000, 107:191-205. 98. Hurt EC, van Loon APGM: How proteins find mitochondria and intramitochondrial compartments? Trends Biochem Sci 1986, 11:204-207. 99. Emanuelsson 0, Nielsen H, Brunak S, von Heijne G: Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol 2000, 300:1005-16. 100. Waller RF, Reed MB, Cowman AF, McFadden Gl: Protein trafficking to the plastid of Plasmodium falciparum is via the secretory pathway. Embo J 2000, 19:1794-802. 101. Cline K, Henry R: Import and routing of nucleus-encoded chloroplast proteins. Annu Rev Cell Dev Biol 1996, 12:1-26. 102. Malaria Genetics and Genomics: www.ncbi.nlm.nih.gov/proiects/Malaria. . 103. www.sanger.ac.uk/Proiects/P falciparum/ Sequencing of P. falciparum chromosomes 1,3,4,5,6,7,8,9, and 13 was accomplished as part of the malaria Genome Project with support by the Wellcome Trust. . 104. www-sequence.Stanford.edu/group/malaria Sequencing of P.falciparum chromosome 12 was accomplished as part of the Malaria Genome Project with support by the Burroughs Wellcome fund. . 105. www.tigr.org Sequencing of chromosomes 2, 10, 11, 14 was part of the International Malaria Genome Sequencing Project and was supported by awards from the U.S. Department of Defense. . 87 106. Brinkman FS, Blanchard JL, Cherkasov A, Greberg H, Av-Gay Y, Brunham RC, Fernandez RC, Finlay BB, Otto SP, Ouellette BF, Keeling PJ, Rose AM, Hancock RE, Jones SJ: Evidence that plant-like genes in Chlamydia species reflect an ancestral relationship between Chlamydiaceae, cyanobacteria, and the chloroplast. Genome Res 2002, 12:1159-67. 107. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25:3389-402. 108. Sonnhammer EL, Durbin R: A workbench for large-scale sequence homology analysis. Comput Appl Biosci 1994, 10:301-7. 109. SWISSPROT and TrEMBL databases: http://www.ebi.ac.uk/swissprot/. . 110. Unifinished bacterial genomes at NCBI: http://www, ncbi.nIm.nih .gov/PMGifs/Genomes/eub_u. htmI. . 111. Zuegge J, Ralph S, Schmuker M, McFadden Gl, Schneider G: Deciphering apicoplast targeting signals-feature extraction from nuclear-encoded precursors of Plasmodium falciparum apicoplast proteins. Gene 2001, 280:19-26. 112. Apweiler R, Attwood TK, Bairoch A, Bateman A, Birney E, Biswas M, Bucher P, Cerutti L, Corpet F, Croning MD, Durbin R, Falquet L, Fleischmann W, Gouzy J, Hermjakob H, Hulo N, Jonassen I, Kahn D, Kanapin A, Karavidopoulou Y, Lopez R, Marx B, Mulder NJ, Oinn TM, Pagni M, Servant F, Sigrist CJ, Zdobnov EM: The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res 2001, 29:37-40. 113. National Centre for Biotechnology Information: http://www.ncbi.nlm.nih.gov/. . 114. Altschul SF, Boguski MS, Gish W, Wootton JC: Issues in searching molecular sequence databases. Nat Genet 1994, 6:119-29. 115. Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG: The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res 1997, 25:4876-82. 116. Swofford DL, Olsen GJ, Waddell PJ, Hillis DM: Phylogenetic Inference. In: Molecular Systematics Edited by Hillis DM, Moritz C, Mable BK, 2nd ed. pp. 407-514. Sunderland, Mass.: Sinauer Associates; 1996: 407-514. 117. Nicholas KB, Nicholas HB: A tool for editing and annotating multiple sequence alignments, http://www.psc.edu/biomed/3enedoc/ 1997. 88 118. Strimmer K, von Haeseler A: Quartet puzzling: A quartet maximum likelihood method for reconstructing tree topologies. Molecular Biology and Evolution 1996, 13:964-969. 119. Whelan S, Goldman N: A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol 2001, 18:691-9. 120. Felsenstein J: PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Department of Genetics, University of Washington, Seattle, http://evolution.genetics.washington.edu/phylip.html 1993. 121. Holder M, Roger A: http://hades.biochem.dal.ca/Rogerlab/Software/software.html#puzzleboot. 122. Gascuel 0: BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Mol Biol Evol 1997, 14:685-95. 123. Huang WM: Bacterial diversity based on type II DNA topoisomerase genes. Annu Rev Genet 1996, 30:79-107. 124. Champoux JJ: DNA topoisomerase l-mediated nicking of circular duplex DNA. Methods Mol Biol 2001,95:81-7. 125. Cheesman S, McAleese S, Goman M, Johnson D, Horrocks P, Ridley RG, Kilbey BJ: The gene encoding topoisomerase II from Plasmodium falciparum. Nucleic Acids Res 1994, 22:2547-51. 126. Marshall VM, Coppel RL: Characterisation of the gene encoding adenylosuccinate lyase of Plasmodium falciparum. Mol Biochem Parasitol 1997, 88:237-41. 127. Leipe DD, Wolf YI, Koonin EV, Aravind L: Classification and evolution of P-loop GTPases and related ATPases. J Mol Biol 2002, 317:41-72. 128. Baldauf SL, Palmer JD, Doolittle WF: The root of the universal tree and the origin of eukaryotes based on elongation factor phylogeny. Proc Natl Acad SciUSAmt, 93:7749-54. 129. Iwabe N, Kuma K, Hasegawa M, Osawa S, Miyata T: Evolutionary relationship of archaebacteria, eubacteria, and eukaryotes inferred from phylogenetic trees of duplicated genes. Proc Natl Acad Sci USA 1989, 86:9355-9. 130. Bocchetta M, Gribaldo S, Sanangelantoni A, Cammarano P: Phylogenetic depth of the bacterial genera Aquifex and Thermotoga inferred from analysis of ribosomal protein, elongation factor, and RNA polymerase subunit sequences. J Mol Evol 2000, 50:366-80. 89 131. Lopez P, Forterre P, Philippe H: The root of the tree of life in the light of the covarion model. J Mol Evol 1999, 49:496-508. 132. Zupan J, Muth TR, Draper 0, Zambryski P: The transfer of DNA from agrobacterium tumefaciens into plants: a feast of fundamental insights. Plant J 2000, 23:11-28. 133. Ohnishi M, Janosi L, Shuda M, Matsumoto H, Hayashi T, Terawaki Y, Kaji A: Molecular cloning, sequencing, purification, and characterization of Pseudomonas aeruginosa ribosome recycling factor. J Bacteriol 1999, 181:1281-91. 134. De Mot R, Nagy I, Walz J, Baumeister W: Proteasomes and other self-compartmentalizing proteases in prokaryotes. Trends Microbiol 1999, 7:88-92. 135. Tamura T, Nagy I, Lupas A, Lottspeich F, Cejka Z, Schoofs G, Tanaka K, De Mot R, Baumeister W: The first characterization of a eubacterial proteasome: the 20S complex of Rhodococcus. Curr Biol 1995, 5:766-74. 136. Bochtler M, Ditzel L, Groll M, Hartmann C, Huber R: The proteasome. Annu Rev Biophys Biomol Struct 1999, 28:295-317. 137. Huang C, Wang S, Chen L, Lemieux C, Otis C, Turmel M, Liu XQ: The Chlamydomonas chloroplast clpP gene contains translated large insertion sequences and is essential for cell growth. Mol Gen Genet 1994, 244:151-9. 138. Schaller A, Ryan CA: Molecular cloning of a tomato leaf cDNA encoding an aspartic protease, a systemic wound response protein. Plant Mol Biol 1996, 31:1073-7. 139. Douglas S, Zauner S, Fraunholz M, Beaton M, Penny S, Deng LT, Wu X, Reith M, Cavalier-Smith T, Maier UG: The highly reduced genome of an enslaved algal nucleus. Nature 2001, 410:1091-6. 140. de Sagarra MR, Mayo I, Marco S, Rodriguez-Vilarino S, Oliva J, Carrascosa JL, Casta n JG: Mitochondrial localization and oligomeric structure of HClpP, the human homologue of E. coli ClpP. J Mol Biol 1999, 292:819-25. 141. Zou CB, Nakajima-Shimada J, Nara T, Aoki T: Cloning and functional expression of Rpn1, a regulatory-particle non-ATPase subunit 1, of proteasome from Trypanosoma cruzi. Mol Biochem Parasitol 2000, 110:323-31. 142. Gustafsson C, Reid R, Greene PJ, Santi DV: Identification of new RNA modifying enzymes by iterative genome search using known modifying enzymes as probes. Nucleic Acids Res 1996, 24:3756-62. 90 143. Koonin EV: Pseudouridine synthases: four families of enzymes containing a putative uridine-binding motif also conserved in dUTPases and dCTP deaminases. Nucleic Acids Res 1996, 24:2411-5. 144. Zdobnov EM, Apweiler R: lnterProScan--an integration platform for the signature-recognition methods in InterPro. Bioinformatics 2001, 17:847-8. 145. Anantharaman V, Koonin EV, Aravind L: Comparative genomics and evolution of proteins involved in RNA metabolism. Nucleic Acids Res 2002, 30:1427-64. 146. Tscherne JS, Nurse K, Popienick P, Michel H, Sochacki M, Ofengand J: Purification, cloning, and characterization of the 16S RNA m5C967 methyltransferase from Escherichia coli. Biochemistry 1999, 38:1884-92. 147. Reid R, Greene PJ, Santi DV: Exposition of a family of RNA m(5)C methyltransferases from searching genomic and proteomic sequences. Nucleic Acids Res 1999, 27:3138-45. 148. Billy E, Hess D, Hofsteenge J, Filipowicz W: Characterization of the adenylation site in the RNA 3"-terminal phosphate cyclase from Escherichia coli. J Biol Chem 1999, 274:34955-60. 149. Genschik P, Drabikowski K, Filipowicz W: Characterization of the Escherichia coli RNA 3'-terminal phosphate cyclase and its sigma54-regulated operon. J Biol Chem 1998, 273:25516-26. 150. Genschik P, Billy E, Swianiewicz M, Filipowicz W: The human RNA 3'-terminal phosphate cyclase is a member of a new family of proteins conserved in Eucarya, Bacteria and Archaea. EmboJ 1997, 16:2955-67. 151. Schluckebier G, O'Gara M, Saenger W, Cheng X: Universal catalytic domain structure of AdoMet-dependent methyltransferases. J Mol Biol 1995, 247:16-20. 152. Carrion M, Gomez MJ, Merchante-Schubert R, Dongarra S, Ayala JA: mraW, an essential gene at the dew cluster of Escherichia coli codes for a cytoplasmic protein with methyltransferase activity. Biochimie 1999, 81:879-88. 153. Brown DR: Mycoplasmosis and immunity of fish and reptiles. Front Biosci 2002, 7:d1338-46. 154. dePamphilis CW, Young ND, Wolfe AD: Evolution of plastid gene rps2 in a lineage of hemiparasitic and holoparasitic plants: many losses of photosynthesis and complex patterns of rate variation. Proc Natl Acad Sci U S A 1997, 94:7367-72. 91 


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items