UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

High-resolution mutation detection in Caenorhabditis elegans mutants and natural isolates using array… Maydan, Jason Stephen 2009

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2009_spring_maydan_jason.pdf [ 4.35MB ]
Metadata
JSON: 24-1.0067087.json
JSON-LD: 24-1.0067087-ld.json
RDF/XML (Pretty): 24-1.0067087-rdf.xml
RDF/JSON: 24-1.0067087-rdf.json
Turtle: 24-1.0067087-turtle.txt
N-Triples: 24-1.0067087-rdf-ntriples.txt
Original Record: 24-1.0067087-source.json
Full Text
24-1.0067087-fulltext.txt
Citation
24-1.0067087.ris

Full Text

HIGH-RESOLUTION MUTATION DETECTION IN CAENORHABDITIS ELEGANS MUTANTS AND NATURAL ISOLATES USING ARRAY COMPARATIVE GENOMIC HYBRIDIZATION  by JASON STEPHEN MAYDAN M.Sc., The University of British Columbia, 2000 B.Ed., The University of Windsor, 1996 B.Sc., The University of Western Ontario, 1994  A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE STUDIES (Genetics)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) March 2009  Jason Stephen Maydan, 2009  Abstract An essential requirement of genetic research is the ability to identify mutations. Forward genetic screens begin by selecting for a phenotype and proceed to search for the causative mutation. Reverse genetics experiments first identify the mutation and then seek to derive the mutant phenotype, if any. Both approaches depend on efficient means of detecting mutations. This thesis describes the development of methods to facilitate the detection of mutations in the model organism, Caenorhabditis elegans, using array Comparative Genomic Hybridization (aCGH). Exon-centric oligonucleotide microarrays targeting specific chromosomes and the whole genome were designed and used to detect both large multi-gene and small single-gene deletions. Both homozygous and heterozygous deletions were identified using this technique. I showed that even single nucleotide transitions and transversions are detectable when using microarrays with sufficient probe densities, which are achievable with target regions of two Mbp or less. I also used aCGH to detect extensive natural gene content variation between the N2 Bristol strain and twelve wild C. elegans isolates. Most of the DNA copy number alterations in these strains are deletions relative to Bristol. Over 5% of the genes present in the Bristol strain are absent in at least one of the natural isolates that were examined. This represents a significant increase in the number of genes with known null alleles. These deletions were then used to infer relationships among the natural isolates, which proved to be complex. The methods described in this thesis will greatly assist in the identification of mutations in C. elegans and are also applicable to other organisms with sequenced reference genomes.  ii  Table of Contents Abstract..................................................................................................................................... ii Table of Contents ..................................................................................................................... iii List of Tables ........................................................................................................................... vi List of Figures......................................................................................................................... vii Acknowledgements ................................................................................................................ viii Co-Authorship Statement ......................................................................................................... ix 1. Introduction........................................................................................................................... 1 1.1. Thesis overview.............................................................................................................. 1 1.2. The nematode Caenorhabditis elegans as a model organism........................................... 1 1.3. Current methods of generating and discovering null alleles in C. elegans........................ 2 1.4. Array Comparative Genomic Hybridization.................................................................... 2 1.5. Development of an aCGH platform for deletion discovery in C. elegans......................... 3 1.6. Detecting single nucleotide mutations in C. elegans using aCGH.................................... 4 1.7. Copy number variation in natural isolates of C. elegans.................................................. 5 1.8. Thesis objectives ............................................................................................................ 5 1.9. References...................................................................................................................... 9 2. Efficient high-resolution deletion discovery in Caenorhabditis elegans by array Comparative Genomic Hybridization .. ........................................................................................................ 12 2.1 Introduction................................................................................................................... 12 2.2. Results.......................................................................................................................... 13 2.2.1. Oligonucleotide probe quality and detection of homozygous 50-kb and 1-kb deletions .......................................................................................................................... 13 2.2.2. Detection of single-copy number differences between hermaphrodite and male X chromosomes and in a balanced chromosome II deficiency ............................................. 15 2.2.3. The mIn1 balancer chromosome............................................................................. 16 2.2.4. Novel balanced lethal deletions on chromosome II................................................. 17 2.2.5. Whole-genome array CGH: Comparing N2 Bristol to Hawaiian and Madeiran wild isolates ............................................................................................................................ 18 2.3. Discussion .................................................................................................................... 19 2.3.1. The utility of aCGH in screening for novel deletions.............................................. 19 2.3.2. Natural gene content variation in wild populations................................................. 21 2.4. Methods........................................................................................................................ 23 2.4.1. Probe selection, microarray design, and microarray manufacture ........................... 23 2.4.2. Nematode culture, harvest, and DNA preparation................................................... 24 2.4.3. DNA fragmentation and labeling............................................................................ 25 2.4.4. Sample hybridization and imaging ......................................................................... 26 2.4.5. Data analysis.......................................................................................................... 26 2.5. References.................................................................................................................... 37 3. De novo identification of single nucleotide mutations in Caenorhabditis elegans using array Comparative Genomic Hybridization ..................................................................................... 40 3.1. Introduction.................................................................................................................. 40 3.2. Results.......................................................................................................................... 40 3.2.1. Novel single-nucleotide mutations detected utilizing an exon-centric chromosome II microarray....................................................................................................................... 40 iii  3.2.2. Single-nucleotide mutations detected in 13 strains with previously mapped mutations ........................................................................................................................................ 41 3.2.3. The sensitivity of single nucleotide polymorphism detection using aCGH.............. 42 3.3. Discussion .................................................................................................................... 43 3.3.1. Limitations to SNP discovery using aCGH............................................................. 43 3.3.2. Suggestions to improve the sensitivity and specificity of SNP detection by aCGH . 44 3.3.3. Online resources for SNP detection using aCGH.................................................... 44 3.3.4. SNP detection by aCGH as an alternative to high-throughput sequencing .............. 45 3.4. Methods........................................................................................................................ 45 3.4.1. Mutagenesis........................................................................................................... 45 3.4.2. Nematode culturing and DNA preparation ............................................................. 46 3.4.3. Probe selection, array design and aCGH................................................................. 46 3.4.4. Data analysis and mutation detection ..................................................................... 47 3.5. References.................................................................................................................... 51 4. Copy number variation in the Caenorhabditis elegans genome reveals complex relationships among natural isolates ............................................................................................................ 52 4.1. Introduction.................................................................................................................. 52 4.2. Results.......................................................................................................................... 53 4.2.1. aCGH reveals a bias favoring coding sequence deletions over amplifications in C. elegans ............................................................................................................................ 53 4.2.2. Extensive copy number variation in the C. elegans genome allows even very closely related strains to be distinguished .................................................................................... 55 4.2.3. The distribution of indels in the genome and the overrepresentation of indels in particular gene families.................................................................................................... 55 4.2.4. Relatedness inferences based on deletions shared by multiple natural isolates ........ 56 4.3. Discussion .................................................................................................................... 58 4.3.1. Bias favoring deletions targeting gene families involved in environmental sensation and innate immunity........................................................................................................ 58 4.3.2. New insights into complex strain relationships resulting from recombination and outcrossing in the C. elegans lineage ............................................................................... 59 4.3.3. Very common indels, mutation hotspots, and the possibility of extreme sequence divergence masquerading as deletions ............................................................................. 60 4.4. Methods........................................................................................................................ 62 4.4.1. Strain selection, nematode culturing and DNA preparation .................................... 62 4.4.2. aCGH .................................................................................................................... 63 4.4.3. Indel identification................................................................................................. 63 4.4.4. Chi-square tests, t-tests and ANOVAs.................................................................... 64 4.4.5. Affected genes ....................................................................................................... 64 4.4.6. Strain relationships ................................................................................................ 65 4.4.7. Linkage disequilibrium .......................................................................................... 66 4.5. References.................................................................................................................... 74 5. Conclusions......................................................................................................................... 78 5.1. Thesis summary............................................................................................................ 78 5.2. The significance of this work and its potential applications........................................... 78 5.3. Strategies to reduce the cost of detecting novel induced deletions using aCGH ............. 79 5.4. Single nucleotide mutation detection ............................................................................ 81 5.5. SNP-CGH Mapping...................................................................................................... 82 5.6. Future directions, deep sequencing and site-specific gene conversion ........................... 82 5.7. References.................................................................................................................... 85 iv  Appendix 1. Genes that are completely deleted from the Hawaiian strain (CB4856) genome... 87 Appendix 2. Genes that are completely deleted from the Madeiran strain (JU258) genome. .... 97 Appendix 3. The segmentation algorithm. ............................................................................. 110  v  List of Tables Table 2.1. Gene family members deleted in natural isolates from Hawaii (CB4856) and Madeira (JU258). .................................................................................................................................. 28 Table 4.1. Indels detected in twelve natural isolates of C. elegans. .......................................... 67 Table 4.2. Number of deletions shared by all strain pairs......................................................... 68 Table 4.3. Genes affected by copy number variants in C. elegans. ........................................... 69  vi  List of Figures Figure 1.1. Array Comparative Genomic Hybridization............................................................. 7 Figure 1.2. Detection of a single nucleotide mutation by aCGH utilizing a microarray of highly overlapping oligonucleotide probes. .......................................................................................... 8 Figure 2.1. Detection of a 50-kb homozygous viable deletion in gkDf2.................................... 29 Figure 2.2. Detection of a 1047-bp homozygous viable deletion in ceh-39 (gk329).................. 30 Figure 2.3. Comparison of the normalized average fluorescence ratios (XO male / XX hermaphrodite) for all probe pairs to chromosomes II and X. .................................................. 31 Figure 2.4. Detection of the 1202-bp deletion in dab-1 (gk291) in a wash-sampled balanced heterozygous population.......................................................................................................... 32 Figure 2.5. Deletions detected in a screen for homozygous lethal mutations in six wash-sampled balanced heterozygous populations.......................................................................................... 33 Figure 2.6. Whole-genome aCGH comparing Hawaiian (CB4856) and Bristol N2 (VC196) hermaphrodites........................................................................................................................ 34 Figure 2.7. Whole-genome aCGH comparing Madeiran (JU258) and Bristol N2 (VC196) hermaphrodites........................................................................................................................ 35 Figure 2.8. A homozygous viable deletion identified on chromosome V in the Hawaiian strain (CB4856). ............................................................................................................................... 36 Figure 3.1. Novel detection of an A→T transversion in syd-1.................................................. 49 Figure 3.2. Estimation of the sensitivity and specificity of the current SNP detection technique. ................................................................................................................................................ 50 Figure 4.1. Indels on the left arm of chromosome II in twelve natural isolates of C. elegans.... 70 Figure 4.2. Indels in the genomes of twelve natural isolates of C. elegans. .............................. 71 Figure 4.3. The number of deletions detected in each of 12 natural isolates of C. elegans........ 72 Figure 4.4. Unrooted consensus tree for twelve natural isolates of C. elegans.......................... 73  vii  Acknowledgements I would like to thank my graduate supervisor Dr. Donald Moerman for the opportunity to work in the C. elegans Gene Knockout Facility and for his generosity, support, guidance, encouragement and friendship during my studies. He provided me with opportunities to collaborate with many excellent researchers and attend numerous conferences to present my work. I would also like to thank Dr. Stephane Flibotte for all of his help, mentoring, guidance and support, as well as my other thesis committee members Drs. Sally Otto and Don Riddle for their helpful suggestions, guidance and advice. I enjoyed the opportunity to collaborate with Dr. James Thomas and appreciate his help and guidance. I am grateful to Genome Canada, Genome British Columbia, the Michael Smith Research Foundation, the Canadian Institute of Health Research, and the Natural Sciences and Engineering Research Council of Canada for financially supporting my work. I would like to thank Mark Edgley for all of his mentoring, guidance, advice, assistance and friendship during my PhD studies as well as my time spent as a technician in the laboratory. I owe a great deal of thanks to past and present members of the C. elegans Gene Knockout Facility at UBC for all of their assistance and support over the years, including Joanne Lau, Jaryn Perkins, Bin Shen, Christine Lee, Owen Dadivas, Allison Hay, Angela Fisher, Candice Navaroli, Nadereh Rezania, Lucy Liu, Sarah Neil, Ola Rogulu, Iasha Chaudry, Adam Lorch, Jon Taylor, Rick Zapf, Carolina Chanis, Christine Kwitkowski, and many, many others. I am also grateful for the helpful suggestions of fellow lab members Ryan Viveiros, Adam Warner, Mariana Veiga, and Drs. Teresa Rogalski, Barbara Meissner and Aruna Somasiri. It has been a pleasure to work with so many outstanding people. Finally, I would especially like to thank my parents Steve, Gail and Andrew K., and my brothers Ryan and Andrew S. for their love and support throughout my life.  viii  Co-Authorship Statement Together with my supervisor, Donald Moerman, I was responsible for the design of the research program described in this thesis. I was primarily responsible for the research, data analyses and manuscript preparation. Portions of this thesis are part of multi-author publications. Co-authors of these publications contributed analyses, text, tables, figures, edits, advice, funding and supervision. O. Rogula assisted in the preparation of Figure 1.1. Specific contributions to Chapters 2, 3 and 4 are listed below. I was the primary author of Chapter 2 and was responsible for all study design, analysis, text, figures and tables except where indicated below. S. Flibotte helped to design the study, wrote and edited portions of text, performed analyses, assisted with Fig. 2.3, selected probes for all microarrays, wrote software to calculate and normalize log2 ratios and perform segmentation of the CGH data, and provided editorial suggestions and advice. M. Edgley helped to design the study, wrote portions of text, assisted with nematode culturing and mutagenesis, and provided editorial suggestions and advice. J. Lau assisted with mutagenesis and performed the screen for lethal mutations (Section 2.2.4). R. Selzer, T. Richmond and N. Pofahl performed the aCGH work at NimbleGen. J. Thomas performed analyses, wrote portions of text and assisted with Tables 2.1, 2.2 and 2.3. D. Moerman helped to design the study, wrote portions of text, provided editorial suggestions and advice, supervised and funded the project. I was the primary author of Chapter 3 and was responsible for all study design, experiments, analysis, text and figures except where noted below. H.M. Okada created the software described in section 3.3.3. S. Flibotte helped design the study, provided valuable advice, performed analyses including section 3.2.3, wrote and edited portions of text, selected probes for all microarrays, and contributed Figure 3.2. M.L. Edgley contributed text, editorial suggestions, assistance with nematode culturing and performed mutagenesis. D.G. Moerman helped design the study, contributed portions of text and editorial suggestions, performed analyses, provided valuable advice, supervised and funded the project. I was the primary author of Chapter 4 and was responsible for all study design, experiments, analysis, text, figures and tables except where indicated below. A. Lorch assisted in the ix  preparation of Table 4.3. M.L. Edgley provided advice and helped to culture the RW7000 strain. S. Flibotte provided advice and designed the whole genome microarray and the segmentation algorithm. D.G. Moerman provided advice, editorial suggestions, supervision and funding for the project.  x  1. Introduction 1.1. Thesis overview This thesis describes the development of array Comparative Genomic Hybridization (aCGH) as a method of mutation detection in Caenorhabditis elegans. The ability to reliably detect mutations is an essential requirement of genetic research. The methods described in this thesis are capable of detecting mutations of any size, from deletions several hundred kilobases in length to single nucleotide alterations. These methods have been applied to detect both novel induced mutations and natural gene content variation in wild isolates of C. elegans. The methods developed in this thesis should greatly facilitate the detection of mutations in C. elegans. 1.2. The nematode Caenorhabditis elegans as a model organism The nematode C. elegans has become an extraordinarily popular and useful model organism in many fields of study, including development, the nervous system, behavior, aging and evolution (Riddle et al. 1997). C. elegans is a convenient and powerful research tool for a number of reasons, including its short 3.5-day reproductive lifecycle, the ease with which it is cultured in the lab, and the fact that mutants can be preserved by freezing (Brenner 1974), obviating the need for laborious strain maintenance. The complete C. elegans genome sequence was published in 1998 and contains approximately 20,000 genes, roughly 40% of which have human homologues while 34% appear to be nematode-specific (C. elegans Sequencing Consortium 1998). Remarkably, the entire cell lineage from zygote to 959-cell hermaphroditic adult has been described (Sulston and Horvitz 1977; Sulston et al. 1983), along with the complete anatomical structure of the 302-cell nervous system to the level of individual processes and synaptic connections (White et al. 1986). A number of powerful tools and resources are available to C. elegans researchers, including a library of over 12,500 open reading frame clones (Lamesch et al. 2004), a nearly complete RNAi library of clones allowing the selective knockdown of 85% of the genes in the genome (Kamath et al. 2003), and a growing stock of mutants currently comprised of approximately 1  7000 deletion alleles targeting over 5500 genes (Moerman and Barstead 2008). To fully exploit C. elegans as a model system requires mutations in all of its genes. Since 1998, the C. elegans Gene Knockout Consortium (http://celeganskoconsortium.omrf.org/) has been working towards the goal of generating single gene knockouts for all C. elegans genes. The Consortium includes the Barstead laboratory at the Oklahoma Medical Research Foundation in the USA, the Mitani laboratory at Tokyo Women’s University in Japan, and the Moerman laboratory at the University of British Columbia in Vancouver, Canada. 1.3. Current methods of generating and discovering null alleles in C. elegans A nearly complete set of deletions for all Saccharomyces cerevisiae genes was achieved in 2002 through the use of homologous recombination and gene disruption (Giaever et al. 2002). In the absence of a genome-scale method of site-directed mutagenesis in C. elegans, a number of alternative strategies for generating null alleles in C. elegans have been used. These methods have recently been reviewed (Moerman and Barstead 2008) and are briefly summarized here. Gene disruption methods utilizing transposon insertion and excision with either the Tc1 transposon (Zwaal et al. 1993) or Drosophila transposable element Mos1 (Granger et al. 2004) have been developed in C. elegans and a promising gene conversion method known as MosTIC has recently been described (Robert and Bessereau 2007). TILLING has recently been used to obtain single nucleotide mutations in C. elegans (Gilchrist et al. 2006), some of which cause nonsense mutations, but has not been applied to a large-scale effort of generating null mutations. The C. elegans Gene Knockout Consortium currently uses a method employing random mutagenesis with either ethyl methanesulfonate (EMS) or tri-methylpsoralen and ultraviolet irradiation (TMP/UV) followed by targeted deletion detection using PCR and gel electrophoresis (Edgley et al. 2002; Barstead and Moerman 2006), and this method remains the primary means of generating null alleles in C. elegans (Moerman and Barstead 2008). 1.4. Array Comparative Genomic Hybridization An alternative to PCR-based detection of deletions is Comparative Genomic Hybridization (CGH) (Kallioniemi et al. 1992; Mantripragada et al. 2004), which allows the detection of copy number differences between two DNA samples. DNA samples are differentially labeled and cohybridized to a microarray consisting of DNA probes to target sequences in the genome, and 2  the ratio of fluorescent intensities measured at each probe reveals copy number differences existing between the two genomes (Figure 1.1). Early aCGH experiments primarily used microarrays of bacterial artificial chromosomes (BACs), cDNA clones or PCR products (Solinas-Toldo et al. 1997; Pinkel et al. 1998; Mantripragada et al. 2004). More recently, oligonucleotide microarrays have become more popular because they allow higher resolution mutation discovery (Carvalho et al. 2004; Gresham et al. 2008). C. elegans is an attractive target for aCGH studies because of its relatively small 100 Mb genome. A less complex pool of labeled DNA fragments reduces non-specific hybridization to the probes on the microarray, increasing the signal-to-noise ratio relative to an equivalent human experiment (Flibotte and Moerman 2008) and facilitating the detection of smaller mutations. 1.5. Development of an aCGH platform for deletion discovery in C. elegans Chapter 2 describes the development of a reliable and efficient aCGH platform to assist in the detection of deletions in C. elegans. This platform presents a number of advantages over the PCR-based method of deletion detection. Firstly, the amount of time and labour required to isolate each deletion is reduced because aCGH begins with the mutant animal already in hand, whereas the PCR-based method detects deletions among a large population of worms, subsequently requiring a lengthy process of “sibling selection” to isolate a single mutant animal (Barstead and Moerman 2006). Secondly, while PCR is limited to detecting deletions smaller than the amplicon size, aCGH has no constraint on the maximum detectable deletion size. Thirdly, aCGH can identify additional mutations elsewhere in the mutant genome that are not detected by the PCR-based method. These additional mutations could potentially confound the characterization of the mutant phenotype if not properly purged from the mutant genome by outcrossing. Of course, multiple mutations that are found in a single animal can be individually isolated through backcrosses with N2 and subsequently studied independently. Chapter 2 also presents the first detailed description of natural copy number variation in coding sequences in the C. elegans genome. This work focused on two highly divergent wild isolates of C. elegans (CB4856 from Hawaii, USA and JU258 from Madeira, Portugal) and was later expanded upon significantly in Chapter 4. I also coauthored a paper not discussed in this thesis that describes the application of the aCGH platform to characterize several genetic deficiency  3  and duplication strains, which enabled the subsequent positional cloning and identification of several previously unidentified mutations on chromosome III (Jones et al. 2007). 1.6. Detecting single nucleotide mutations in C. elegans using aCGH aCGH can also be used to detect underlying mutations in individuals with previously mapped phenotypes. Mapping and positional cloning efforts can be extremely laborious and timeconsuming, often culminating in a candidate region containing hundreds of genes. Genomic intervals of this size permit highly sensitive aCGH experiments. As the length of a candidate region decreases, the sensitivity of aCGH experiments increases because of the increased probe density in the region of interest, allowing smaller and smaller deletions to be detected. Even single nucleotide mutations can be detected in aCGH experiments if the probe density is high enough (Gresham et al. 2006; Gresham et al. 2008), as shown in Figure 1.2. Chapter 3 describes the first use of aCGH to detect single nucleotide mutations in C. elegans, demonstrating that aCGH is a viable means of detecting null alleles resulting not only from deletions but also nonsense, frameshift or splice-site mutations. Of course, hypomorphic alleles resulting from missense mutations are also detectable. This method should also be useful for detecting single nucleotide mutations in other organisms with sequenced reference genomes such as Drosophila melanogaster. EMS is the most commonly used mutagen for C. elegans and although it is used by the Knockout Consortium to generate deletions, it primarily creates single nucleotide mutations (Anderson 1995; Cuppen et al. 2007). Many mutants generated by EMS mutagenesis exist in the C. elegans research community, and many of the mutations in these strains have already been mapped to candidate regions small enough to permit single nucleotide polymorphism (SNP) detection using aCGH as described in Chapter 3. As mentioned, achieving sufficient probe density to detect single nucleotide mutations using aCGH relies on previously mapping the mutation to a small enough candidate region. I coauthored a paper in which we presented a greatly improved method of mapping mutations to small genomic intervals (Flibotte et al. 2009), based on our ability to detect single nucleotide variation in C. elegans using aCGH. This method, called SNP-CGH mapping, enables the mapping of a mutation to within 200 kb after just a single genetic cross between the mutant and the highly polymorphic Hawaiian strain (see Section 5.5 for details). This rapid and simple method of mapping mutations with high resolution should be enormously useful in forward 4  genetic screens as well as in enhancer and suppressor screens. The SNP-CGH procedure maps C. elegans mutations with sufficient precision to permit the use of the method described in Chapter 3 in an effort to precisely identify the mutation. 1.7. Copy number variation in natural isolates of C. elegans Copy number variation is an important component of genetic diversity in humans (Sebat et al. 2004; Redon et al. 2006), mice (She et al. 2008) and flies (Emerson et al. 2008), and it factors into human disease susceptibility and prospects for personalized medicine (Sebat et al. 2004; Conrad and Hurles 2007; McCarroll and Altshuler 2007; Buchanan and Scherer 2008). Prior to my work, the extent of copy number variation in C. elegans was unknown. In Chapter 4, I describe copy number variation in the genomes of twelve natural isolates of C. elegans. Genes that are present in the canonical N2 reference genome but absent in these natural isolates are less likely to serve critical functions and can be deprioritized in the Knockout Consortium’s process of targeted deletion discovery using PCR. Researchers interested in deletions targeting these genes could isolate them from the rest of the mutations in the genetic background of natural isolates by serial backcrosses with the N2 strain. The indels that were detected also provided a large number of genetic markers throughout the genome and permitted the opportunity to more thoroughly characterize the complicated relationships among the strains (Denver et al. 2003; Haber et al. 2005). 1.8. Thesis objectives The primary objective of this thesis was to develop an aCGH platform capable of detecting null mutations in C. elegans. The C. elegans Gene Knockout Consortium is mainly interested in discovering single gene deletions, so the platform needed to be efficient for this purpose. The capabilities of this platform were further extended to enable the detection of single nucleotide mutations, allowing the identification of other types of null alleles as well as missense or even silent mutations. I also wanted to use aCGH to investigate the gene content variation in wild isolates of C. elegans, out of an interest in genome evolution and an effort to clarify the complicated relationships among the strains. An important objective of the natural isolate work was to compile a list of genes present in N2 but absent in at least one of the wild strains, since the natural isolates were expected to contain a wealth of null alleles for non-essential N2 genes. 5  These genes would then be deprioritized as targets for knockout by the C. elegans Gene Knockout Consortium.  6  Figure 1.1. Array Comparative Genomic Hybridization. Two genomic DNA samples are fragmented by sonication and differentially end-labeled with either Cy3 or Cy5 fluorescent dyes. The samples are then mixed in equal proportions and cohybridized to a microarray of oligonucleotide probes. Labeled fragments hybridize to complementary probe sequences on the microarray, and the ratio of fluorescent signals (Cy3 / Cy5) is measured at each probe location. As shown here, ratios significantly greater than 1 (log2 ratio > 0) are indicative of amplifications in the mutant genome relative to the wild-type reference sample, while ratios significantly less than 1 (log2 ratio < 0) indicate deletions. Figure reprinted with permission from Don Moerman and Oxford University Press: Briefings in Functional Genomics and Proteomics 7(3): 195-204, copyright 2008.  7  Figure 1.2. Detection of a single nucleotide mutation by aCGH utilizing a microarray of highly overlapping oligonucleotide probes. A single nucleotide mutation is sufficient to cause a detectable shift in log2 ratios on a microarray of 50-mer oligonucleotide probes. Several highly overlapping probes, each of which targets the mutation, are required in order for a shift of this magnitude to be statistically significant. The first position of each probe is indicated by a . The length of each probe targeted by the mutation is illustrated by a horizontal bar, and the position of the mutation in each probe sequence is indicated by an *. Details are given in Chapter 3.  8  1.9. References Anderson, P. 1995. Mutagenesis. Methods Cell Biol 48: 31-58. Barstead, R.J. and D.G. Moerman. 2006. C. elegans deletion mutant screening. Methods Mol Biol 351: 51-58. Brenner, S. 1974. The genetics of Caenorhabditis elegans. Genetics 77: 71-94. Buchanan, J.A. and S.W. Scherer. 2008. Contemplating effects of genomic structural variation. Genet Med 10: 639-647. C. elegans Sequencing Consortium. 1998. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282: 2012-2018. Carvalho, B., E. Ouwerkerk, G.A. Meijer, and B. Ylstra. 2004. High resolution microarray comparative genomic hybridisation analysis using spotted oligonucleotides. J Clin Pathol 57: 644-646. Conrad, D.F. and M.E. Hurles. 2007. The population genetics of structural variation. Nat Genet 39: S30-36. Cuppen, E., E. Gort, E. Hazendonk, J. Mudde, J. van de Belt, I.J. Nijman, V. Guryev, and R.H.A. Plasterk. 2007. Efficient target-selected mutagenesis in Caenorhabditis elegans: Toward a knockout for every gene. Genome Res. 17: 649-658. Denver, D.R., K. Morris, and W.K. Thomas. 2003. Phylogenetics in Caenorhabditis elegans: an analysis of divergence and outcrossing. Mol Biol Evol 20: 393-400. Edgley, M., A. D'Souza, G. Moulder, S. McKay, B. Shen, E. Gilchrist, D. Moerman, and R. Barstead. 2002. Improved detection of small deletions in complex pools of DNA. Nucleic Acids Res 30: e52. Emerson, J.J., M. Cardoso-Moreira, J.O. Borevitz, and M. Long. 2008. Natural selection shapes genome-wide patterns of copy-number polymorphism in Drosophila melanogaster. Science 320: 1629-1631. Flibotte, S., M.L. Edgley, J. Maydan, J. Taylor, R. Zapf, R. Waterston, and D.G. Moerman. 2009. Rapid High Resolution Single Nucleotide Polymorphism-Comparative Genome Hybridization Mapping in Caenorhabditis elegans. Genetics 181: 33-37. Flibotte, S. and D.G. Moerman. 2008. Experimental analysis of oligonucleotide microarray design criteria to detect deletions by comparative genomic hybridization. BMC Genomics 9: 497. Giaever, G., A.M. Chu, L. Ni, C. Connelly, L. Riles, S. Veronneau, S. Dow, A. Lucau-Danila, K. Anderson, B. Andre, A.P. Arkin, A. Astromoff, M. El-Bakkoury, R. Bangham, R. Benito, S. Brachat, S. Campanaro, M. Curtiss, K. Davis, A. Deutschbauer, K.D. Entian, P. Flaherty, F. Foury, D.J. Garfinkel, M. Gerstein, D. Gotte, U. Guldener, J.H. Hegemann, S. Hempel, Z. Herman, D.F. Jaramillo, D.E. Kelly, S.L. Kelly, P. Kotter, D. LaBonte, D.C. Lamb, N. Lan, H. Liang, H. Liao, L. Liu, C. Luo, M. Lussier, R. Mao, P. Menard, S.L. Ooi, J.L. Revuelta, C.J. Roberts, M. Rose, P. Ross-Macdonald, B. Scherens, G. Schimmack, B. Shafer, D.D. Shoemaker, S. Sookhai-Mahadeo, R.K. Storms, J.N. Strathern, G. Valle, M. Voet, G. Volckaert, C.Y. Wang, T.R. Ward, J. Wilhelmy, E.A. Winzeler, Y. Yang, G. Yen, E. Youngman, K. Yu, H. Bussey, J.D. Boeke, M. Snyder, P. Philippsen, R.W. Davis, and M. Johnston. 2002. Functional profiling of the Saccharomyces cerevisiae genome. Nature 418: 387-391. Gilchrist, E.J., N.J. O'Neil, A.M. Rose, M.C. Zetka, and G.W. Haughn. 2006. TILLING is an effective reverse genetics technique for Caenorhabditis elegans. BMC Genomics 7: 262.  9  Granger, L., E. Martin, and L. Segalat. 2004. Mos as a tool for genome-wide insertional mutagenesis in Caenorhabditis elegans: results of a pilot study. Nucleic Acids Res 32: e117. Gresham, D., M.J. Dunham, and D. Botstein. 2008. Comparing whole genomes using DNA microarrays. Nat Rev Genet 9: 291-302. Gresham, D., D.M. Ruderfer, S.C. Pratt, J. Schacherer, M.J. Dunham, D. Botstein, and L. Kruglyak. 2006. Genome-Wide Detection of Polymorphisms at Nucleotide Resolution with a Single DNA Microarray. Science 311: 1932-1936. Haber, M., M. Schungel, A. Putz, S. Muller, B. Hasert, and H. Schulenburg. 2005. Evolutionary history of Caenorhabditis elegans inferred from microsatellites: evidence for spatial and temporal genetic differentiation and the occurrence of outbreeding. Mol Biol Evol 22: 160-173. Jones, M.R., J.S. Maydan, S. Flibotte, D.G. Moerman, and D.L. Baillie. 2007. Oligonucleotide Array Comparative Genomic Hybridization (oaCGH) based characterization of genetic deficiencies as an aid to gene mapping in Caenorhabditis elegans. BMC Genomics 8: 402. Kallioniemi, A., O.P. Kallioniemi, D. Sudar, D. Rutovitz, J.W. Gray, F. Waldman, and D. Pinkel. 1992. Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. Science 258: 818-821. Kamath, R.S., A.G. Fraser, Y. Dong, G. Poulin, R. Durbin, M. Gotta, A. Kanapin, N. Le Bot, S. Moreno, M. Sohrmann, D.P. Welchman, P. Zipperlen, and J. Ahringer. 2003. Systematic functional analysis of the Caenorhabditis elegans genome using RNAi. Nature 421: 231. Lamesch, P., S. Milstein, T. Hao, J. Rosenberg, N. Li, R. Sequerra, S. Bosak, L. DoucetteStamm, J. Vandenhaute, D.E. Hill, and M. Vidal. 2004. C. elegans ORFeome version 3.1: increasing the coverage of ORFeome resources with improved gene predictions. Genome Res 14: 2064-2069. Mantripragada, K.K., P.G. Buckley, T.D. de Stahl, and J.P. Dumanski. 2004. Genomic microarrays in the spotlight. Trends Genet 20: 87-94. McCarroll, S.A. and D.M. Altshuler. 2007. Copy-number variation and association studies of human disease. Nat Genet 39: S37-42. Moerman, D.G. and R.J. Barstead. 2008. Towards a mutation in every gene in Caenorhabditis elegans. Brief Funct Genomic Proteomic 7: 195-204. Pinkel, D., R. Segraves, D. Sudar, S. Clark, I. Poole, D. Kowbel, C. Collins, W.L. Kuo, C. Chen, Y. Zhai, S.H. Dairkee, B.M. Ljung, J.W. Gray, and D.G. Albertson. 1998. High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nat Genet 20: 207-211. Redon, R., S. Ishikawa, K.R. Fitch, L. Feuk, G.H. Perry, T.D. Andrews, H. Fiegler, M.H. Shapero, A.R. Carson, W. Chen, E.K. Cho, S. Dallaire, J.L. Freeman, J.R. Gonzalez, M. Gratacos, J. Huang, D. Kalaitzopoulos, D. Komura, J.R. MacDonald, C.R. Marshall, R. Mei, L. Montgomery, K. Nishimura, K. Okamura, F. Shen, M.J. Somerville, J. Tchinda, A. Valsesia, C. Woodwark, F. Yang, J. Zhang, T. Zerjal, J. Zhang, L. Armengol, D.F. Conrad, X. Estivill, C. Tyler-Smith, N.P. Carter, H. Aburatani, C. Lee, K.W. Jones, S.W. Scherer, and M.E. Hurles. 2006. Global variation in copy number in the human genome. Nature 444: 444-454. Riddle, D.L., T. Blumenthal, B.J. Meyeer, and J.R. Priess. 1997. C. elegans II. Cold Spring Harbor Laboratory Press, Plainview, NY. Robert, V. and J.L. Bessereau. 2007. Targeted engineering of the Caenorhabditis elegans genome following Mos1-triggered chromosomal breaks. Embo J 26: 170-183. 10  Sebat, J., B. Lakshmi, J. Troge, J. Alexander, J. Young, P. Lundin, S. Maner, H. Massa, M. Walker, M. Chi, N. Navin, R. Lucito, J. Healy, J. Hicks, K. Ye, A. Reiner, T.C. Gilliam, B. Trask, N. Patterson, A. Zetterberg, and M. Wigler. 2004. Large-scale copy number polymorphism in the human genome. Science 305: 525-528. She, X., Z. Cheng, S. Zollner, D.M. Church, and E.E. Eichler. 2008. Mouse segmental duplication and copy number variation. Nat Genet 40: 909-914. Solinas-Toldo, S., S. Lampel, S. Stilgenbauer, J. Nickolenko, A. Benner, H. Dohner, T. Cremer, and P. Lichter. 1997. Matrix-based comparative genomic hybridization: biochips to screen for genomic imbalances. Genes Chromosomes Cancer 20: 399-407. Sulston, J.E. and H.R. Horvitz. 1977. Post-embryonic cell lineages of the nematode, Caenorhabditis elegans. Dev Biol 56: 110-156. Sulston, J.E., E. Schierenberg, J.G. White, and J.N. Thomson. 1983. The embryonic cell lineage of the nematode Caenorhabditis elegans. Dev Biol 100: 64-119. White, J.G., E. Southgate, J.N. Thomson, and S. Brenner. 1986. The structure of the nervous system of the nematode C. elegans. Philosophical Transactions of the Royal Society of London - Series B: Biological Sciences: 1-340. Zwaal, R.R., A. Broeks, J. van Meurs, J.T. Groenen, and R.H. Plasterk. 1993. Target-selected gene inactivation in Caenorhabditis elegans by using a frozen transposon insertion mutant bank. Proc Natl Acad Sci U S A 90: 7431-7435  11  2. Efficient high-resolution deletion discovery in Caenorhabditis elegans by array Comparative Genomic Hybridization 1 2.1 Introduction Comparative Genomic Hybridization (CGH) allows the detection of copy number differences between two DNA samples (Kallioniemi et al. 1992; Mantripragada et al. 2004). The two DNA samples, one a reference and the other a test sample, are differentially labeled and hybridized to a representative genome arrayed on a matrix. Over the past several years a number of different array platforms have been utilized for CGH from bacteria artificial chromosomes (BACs) and cosmids to cDNA clones and oligonucleotides (Solinas-Toldo et al. 1997; Pinkel et al. 1998; Mantripragada et al. 2004). As the ability to detect small alterations is limited by the spacing and size of the probes on the matrix, there has been a move away from BAC clones to oligonucleotide arrays for experiments where high resolution is required (Carvalho et al. 2004; Ishkanian et al. 2004; Sebat et al. 2004; Selzer et al. 2005). For example, oligonucleotide arraybased CGH (aCGH) was recently used to measure copy number variation at specific exons in several human genes with a resolution between 50 and 500 bp (Dhami et al. 2005; Selzer et al. 2005). We were interested in determining whether aCGH could be used to detect copy number alterations (insertions and deletions, or “indels”) among different DNA samples of the nematode Caenorhabditis elegans. Specifically, we wished to determine whether aCGH has the required sensitivity and resolving power to detect single-gene knockouts, where the deletions may be small and the animals may be heterozygous. Our laboratory is a member of the C. elegans Knockout Consortium (http://celeganskoconsortium.omrf.org/) and we are interested in examining techniques that might help us identify and clone single-gene knockouts more efficiently. Array CGH, if efficient, has a number of potential advantages over our current PCRbased method (Barstead 1999) of screening for deletions, including the ability to screen thousands of genes in a single experiment, no constraint on the maximum detectable deletion 1  A version of this chapter has been published. Maydan, J.S., S. Flibotte, M.L. Edgley, J. Lau, R.R. Selzer, T.A. Richmond, N.J. Pofahl, J.H. Thomas, and D.G. Moerman. 2007. Efficient high-resolution deletion discovery in Caenorhabditis elegans by array comparative genomic hybridization. Genome Res. 17: 337-347.  12  size, and identification of copy number alterations at other loci in the mutant genome. As an example, the ability to detect large deletions would be useful for screening for tandem gene family knockouts, such as the Serpentine Receptor class AB (srab) family of seventransmembrane chemoreceptors and integral membrane proteins, where consecutive genes may share functional redundancies (Chen et al. 2005; Thomas 2006b). We have designed three exon-tiled oligonucleotide arrays, one for two chromosomes (II and X), one for a single chromosome (II), and one for the entire genome. Using these arrays, we have detected both previously characterized deletions of 1–50 kb in control experiments and new deletion alleles of genes with no known mutations. The sensitivity of aCGH is such that we can detect small deletions even in heterozygous animals. The ability to detect single-copy singlegene deletions at this resolution will allow us to use aCGH to screen for novel induced deletions in mutagenized populations. This will greatly aid our efforts to generate knockout strains for the research community. The resolution of aCGH may also make it an attractive tool for those studying the population biology and evolution of C. elegans. The large number of indel differences observed among the Bristol, Hawaiian, and Madeiran nematode strains points to the dynamic nature of genomes and the flux of many of the gene families within this organism. 2.2. Results 2.2.1. Oligonucleotide probe quality and detection of homozygous 50-kb and 1-kb deletions We designed a pilot microarray composed of a tiled set of oligonucleotide probes to nearly 90% of the exons and 94% of the genes on chromosomes X and II. This set revealed a remarkable consistency in signal-to-noise ratios over all of the experiments. Our initial aCGH experiment was designed to determine whether a large (50 kb) homozygous deletion could be distinguished reliably from wild-type DNA. For this experiment we used gkDf2, a homozygous-viable deletion of the dim-1 locus on chromosome X. PCR analysis indicated that the deletion breakpoints lay between 8,046,205 and 8,046,422 on the left and 8,088,676 and 8,108,916 on the right, a physical interval of ∼50 kb that potentially included up to 12 genes. For this large deletion experiment, fluorescence intensities were collected for all probes, and we calculated log2 fluorescence intensity ratios for the mutant test sample with the reference wild-type sample (gkDf2/ WT). After normalizing with a LOESS regression, the average log2 intensity ratios for 13  the probe pairs gave a SD of 0.13 with very few outliers (see Fig. 2.1a). Note that these few outliers are from a plot of > 92,000 forward and reverse complement probe pairs. The gkDf2 deletion was identified unambiguously in this plot as a prominent peak of negative log2 ratios for probe pairs targeting the chromosome X region around dim-1. An enlarged view of this region is shown in Figure 2.1b, showing that nine genes are affected by this deletion. The deletion breakpoints are clearly defined at the resolution of individual exons. These results indicated that we could certainly identify deletions smaller than 50 kb. Interestingly, probes adjacent to the breakpoints exhibit a positive log2 ratio (Fig. 2.1b), possibly indicating previously unknown duplications of the flanking sequences, which have been periodically observed for deletions caused by this type of mutagenesis (data not shown). Using the same X:II array design, we next examined whether aCGH could detect a smaller homozygous deletion elsewhere on the X chromosome. The mutation gk329 is a 1047-bp deletion in the gene ceh-39. A hybridization plot comparing gk329 with wild-type DNA (Fig. 2.2a) showed a deleted region in the chromosome X region around ceh-39, the site of the gk329 deletion. This region is enlarged in Figure 2.2b and aligned with a diagram of the coding regions for ceh-41, ceh-21, and ceh-39. The 30 probe pairs representing exons 1, 2, and 3 of ceh-39 (T26C11.7) showed strong negative fluorescence log2 ratios, while the nine probe pairs representing exon 4 of ceh-39 and the probe pairs targeting the five nearest exons of ceh-21 (T26C11.6) yielded lower amplitudes, but still statistically significant non-zero log2 ratios (with P-values of 3 x 10-8 and 7 x 10-5, respectively). Probe pairs targeting ceh-41 (T26C11.5) had log2 ratios closer to zero. The negative ratios for probes targeting exons 1–3 of gk329 corresponded exactly with deletion breakpoints determined by DNA sequencing (chromosome X coordinates 1,854,827/1,855,875). The next gene to the right of the deletion, T26C11.t1 (encoding a tRNA-Glu), was not represented among the probes on the array. Fluorescence ratios for probes to the next closest gene, tbx-41 (T26C11.1), lying 9365 bp beyond the distal deletion breakpoint, showed no evidence of reduced signal intensity in the gk329 sample (data not shown). From this experiment it was clear that the X:II chip design permits detection of deletion breakpoints at the resolution of individual exons.  14  2.2.2. Detection of single-copy number differences between hermaphrodite and male X chromosomes and in a balanced chromosome II deficiency Broader application of aCGH in C. elegans research and other model systems would be feasible if its sensitivity extended to detecting heterozygous (single-copy) deletions. We performed two experiments to determine whether log2 fluorescence ratios from a heterozygote are sufficient to give unambiguous identification of deletions. In the first experiment we compared the hybridization signal from wild-type C. elegans hermaphrodite DNA (two X chromosomes) to male DNA (one X chromosome) for all probes on chromosomes II and X (Fig. 2.3). The median log2 fluorescence ratios for probe pairs to chromosome II (which should be equally represented in the two samples) was set to zero, and these exhibited a SD of 0.23. Setting the log2 ratios for chromosome II to zero led to a median log2 ratio for forward and reverse complement probe pairs to chromosome X of -0.82, with a SD of 0.22. These distributions for chromosomes II and X overlapped by only 4% (Fig. 2.3). In a second experiment we compared wild-type DNA with that from a heterozygous 1202-bp deletion on chromosome II, using a balanced strain of genotype dab-1(gk291)/mIn1[mIs14 dpy10(e128)]. Heterozygous animals are wild-type with a pharyngeal green fluorescent protein (GFP) signal conferred by the mIn1 balancer chromosome, and they segregate ∼50% heterozygotes, 25% gk291 homozygotes (viable and fertile but slow-growing), and 25% mIn1 homozygotes (viable and fertile Dpy with small broods and a strong pharyngeal GFP signal). Initially, we compared the hybridization signal from wild-type DNA to that from DNA made from confirmed gk291/mIn1 heterozygotes. A separate hybridization compared the wild-type signal with that from DNA made from a population containing all progeny genotypes in their normal proportions. To obtain this latter sample we simply washed animals off a plate and isolated DNA from the mixed population of animals. The data plots from these two hybridizations were virtually indistinguishable, and both yielded reliable detection of the gk291 deletion (P = 4 x 10-13 [data not shown] and P = 8 x 10-14 [Fig. 2.4], respectively). These experiments demonstrate that single-copy deletions within a single gene can be reliably detected using aCGH. Figure 2.4b shows a fluorescence ratio plot for probe pairs to the dab-1 locus aligned with a diagram of the dab-1 gene model from WormBase WS120. The sequenced deletion breakpoints 15  lie at chromosome II coordinates 8,226,388 and 8,227,590, and agree perfectly with breakpoints predicted by the log2 fluorescence ratios. The log2 fluorescence ratios for probes to the deleted region were similar to those observed for probes to the X chromosome in the male/hermaphrodite experiment. The proximal deletion breakpoint is within an intron, while the distal deletion breakpoint is within an exon. The oligo probes around these breakpoints serve to illustrate the high resolution of aCGH. Since we targeted all aCGH probes to exons rather than using a tiling-path approach, the resolution of the proximal breakpoint was only about 400 nucleotides due to the first intron causing a 400-plus gap between adjacent oligos in that region. However, the distal deletion breakpoint, which lies within an exon, was more accurately resolved since it is targeted by two overlapping oligos (Fig. 2.4b). Together, these two probes span just 73 base pairs, thus resolving the distal deletion breakpoint to < 50 nucleotides. In this experiment, probes flanking the deletion on either side yielded significant positive log2 fluorescence ratios. 2.2.3. The mIn1 balancer chromosome Figure 2.4a reveals several copy number alterations in addition to the gk291 deletion, including a deletion (roughly chromosome II coordinates 1,020,000–1,050,000) and four amplifications (roughly chromosome II coordinates 1,561,000–1,567,000, 11,698,000–11,701,000, 12,847,000–12,848,000, and 13,482,000–13,490,000). Since these experiments were the first to include the mIn1 balancer chromosome, we speculated that these additional features might represent deletions or amplifications on the balancer chromosome, or elsewhere in the balancer strain genome. Formally, the additional chromosome features could be linked to the gk291bearing chromosome, the balancer, or distributed between them or other chromosomes (this latter possibility applies to amplifications only). We suspected that these were features of mIn1, as the construction of this balancer chromosome required two rounds of mutagenesis (Edgley and Riddle 2001). Comparison of N2 DNA with DNA from mIn1 homozygotes showed that all of the additional features observed in the dab-1/mIn1 heterozygote (Fig. 2.4a) were indeed derived from the mIn1 strain (data not shown). These alterations included the deletion on the left arm plus the various amplified regions throughout the chromosome. We can only be certain that the deletion involves the mIn1 chromosome II, since the amplifications of chromosome II sequences in the mIn1 strain do not necessarily reside on chromosome II.  16  2.2.4. Novel balanced lethal deletions on chromosome II To demonstrate that aCGH can be used to reliably detect novel deletions, we conducted a screen for lethal mutations on chromosome II balanced by the mIn1 inversion, using TMP/ UV as a mutagen. We mutagenized a predominantly L4 population of DR2078 [bli-2(e768) unc4(e120)/mIn1[mIs14 dpy- 10(e128)]], set up clonal populations from the F1 progeny, and screened the F2 for absence of viable, fertile Bli-2 Unc-4 adults. Such lines indicated the presence of new recessive lethal mutations linked to bli-2 unc-4 and balanced by mIn1. Approximately 200 balanced lethal lines were obtained. We analyzed 30 of these strains by array CGH, using DNA prepared from mixed populations washed off plates, and detected 25 new deletions (0–3 deletions per strain). We describe six of these new deletions here (see Fig. 2.5). For one of the candidates, gk463, PCR using primers to the regions flanking the deletion confirmed the presence of an 8-kb deletion; sequencing this PCR product demonstrated that gk463 deletes 8063 bp between chromosome II coordinates 4,131,236/4,139,298 encoding the ast-1 (T08H4.3) gene (Fig. 2.5a). Similar experiments confirmed three other deletions. The gk460 lesion is 4.5-kb deletion encompassing the genes F10E7.4 (spon-1), F10E7.11, and F10E7.2 (breakpoints at chromosome II coordinates 7,118,909 and 7,123,417; Fig. 2.5b); gk462 is a 2.2-kb deletion encompassing Y51B9A.5 and an internal tRNA (Fig. 2.5c); and gk465 is a 141-bp deletion affecting the gene C06A8.1 (breakpoints at chromosome II coordinates 7,774,796 and 7,774,938; Fig. 2.5d). In addition to these small deletions, we identified several larger deletions spanning several genes. The gk488 deletion is nearly 500 kb in size, spanning chromosome II coordinates 10,662,230 to 11,160,425, affecting 93 genes (Fig. 2.5e). An even larger deletion affecting 274 genes was identified in gk487. The deletion spans chromosome II coordinates 3,057,725 through 3,841,090, completely deleting over 783 kb with the exception of ∼4.5 kb (from chromosome II coordinates 3,131,948 through 3,136,511) (Fig. 2.5f). From these experiments, we conclude that aCGH is a powerful and efficient method for discovery of knockout mutations and for characterizing large deletions in this organism.  17  2.2.5. Whole-genome array CGH: Comparing N2 Bristol to Hawaiian and Madeiran wild isolates The amount of gene content variability within natural populations of animal species is largely unknown. High-resolution aCGH appears to be an excellent method for exploring this variability. To examine this variability we designed a whole-genome oligonucleotide CGH array to detect alterations among wild-type nematode strains. This array targets the entire C. elegans genome, with 94% coverage of the exons and 98% coverage of the genes. Using our selection criteria it was not possible to obtain 100% coverage with unique probes (see Methods). Here we compared the N2 Bristol strain, isolated in England, to the strain CB4856 that was isolated on one of the Hawaiian Islands and to a strain isolated on the island of Madeira (JU258). We chose the Hawaiian strain because it is a popular strain for single nucleotide polymorphism (SNP) mapping, as it has many sequence variations compared with N2 (Wicks et al. 2001). Comparisons among wild isolates are potentially complicated by the presence of single nucleotide changes relative to N2, which might cause reduced log2 fluorescence ratios that do not reflect deletions. To minimize this possibility, we used a conservative analysis for copy number changes and counted regions as deleted only if they had a consistently low log2 fluorescence ratio over a substantial distance covering many probes (see Methods). Using these conservative criteria we were able to detect many indel differences between N2 and the Hawaiian strain (Fig. 2.6), illustrating that natural large-scale gene-content variation exists between populations. We observed similar differences between N2 and the Madeiran strain (Fig. 2.7). The Hawaiian strain exhibited 141 deletions relative to N2, with a total length of 1.54 Mb of DNA deleted (1.54% of the genome). These deletions removed 483 predicted genes and 48 predicted pseudogenes (2.54% of all genes) (Table 2.1). The Madeiran isolate had 122 deletions relative to N2, deleting 1.94 Mb (1.94% of the genome), removing 670 loci (39 of which are pseudogenes) (Table 2.1). Appendices 1 and 2 show chromosomal coordinates and interpretations for every deleted gene for pairwise genome comparisons between N2 and the Hawaiian and Madeiran strains, respectively. Alterations in the Hawaiian and Madeiran strains relative to N2 Bristol are unevenly distributed both within and between chromosomes, appearing more often on the chromosome arms than in the centers, and a large number of changes on chromosomes II and V, but relatively few 18  changes on chromosome X (Figures 2.6 and 2.7). Most of the copy number alterations detected appear to be deletions in the Hawaiian and Madeiran strains relative to N2, but a few amplifications are also evident. The genome regions deleted in the Hawaiian and Madeiran strains are not gene poor or enriched in known pseudogenes, indicating that there are major differences in the functional gene content among these isolates. Among many gene families analyzed, a few were overrepresented among deleted genes (Table 2.1). The frequency of deletions was particularly high for the MATH-BTB, F-box, C-type lectin, and Srz chemoreceptor families. It was impractical in this study to validate all copy number changes detected by aCGH between these strains, but we did test one representative deletion extending over several probes. We identified a 2942-bp deletion on chromosome V in the Hawaiian strain, CB4856, which affects two adjacent genes, C49G7.1 and D1065.3. Both are uncharacterized genes containing ankyrin repeats as well as BRCT and WSN domains. We designed primers flanking the deletion, amplified the affected region using PCR, and sequenced the region to determine the deletion breakpoints. The deletion falls between chromosome V coordinates 4,057,455 and 4,057,457 for the proximal breakpoint and 4,060,396 and 4,060,398 for the distal breakpoint, confirming a deletion for these two genes in the Hawaiian strain relative to N2 Bristol (Fig. 2.8). We also examined a gene, gst-38, that has been sequenced from the Hawaiian strain and is known to have several SNPs relative to the Bristol strain (Denver et al. 2003). Probe targets in the Hawaiian genome contain 0–3 SNPs each, which resulted in a significantly negative log2 ratio in that region of the genome (-1.6), but not of sufficient amplitude to pass our conservative criteria for identifying deletions (see Methods). 2.3. Discussion 2.3.1. The utility of aCGH in screening for novel deletions We have demonstrated that aCGH is a viable platform for detecting heterozygous deletions as small as 141 bp in size in C. elegans. By targeting exons it is more likely that any detected deletion alters the structure of the gene product. Depending on the overlap of oligonucleotides on the array, the resolution of a deletion breakpoint can be < 50 bp. To increase resolution, chromosome-specific arrays can be manufactured as we did for chromosome II, which may be 19  desirable depending on the experiment being undertaken. For identifying lethal mutations this may be the most fruitful approach, as the lethal mutation will already be balanced (as described above). PCR amplification and DNA sequencing of the deleted region in the mutant genome can be utilized to precisely identify the breakpoints after aCGH has made the initial identification. The ability to detect deletion and amplification events in heterozygous animals is a testament to the sensitivity of aCGH. This is particularly important when screening for lethal mutations, as it means one can use DNA samples from balanced heterozygous populations that are simply washed from a plate. The added convenience of not having to separate out mutant animals should make this type of analysis more amenable as a high-throughput method. An important feature of aCGH is that it yields a high-resolution view of a whole chromosome, or even a whole genome, without the size limitation of ∼100 kb when using a BAC-based platform. Combining an oligonucleotide-based approach with a high-density array format (∼385,000 unique probes) is already leading to widespread adoption of this method for highresolution mapping of DNA breakpoints for larger sized chromosomal rearrangements in tumors and microdeletion syndromes in humans (Pollack et al. 2002; Selzer et al. 2005; Stallings et al. 2006; Strefford et al. 2006; Urban et al. 2006), as well as the detection of amplifications and deletions such as copy number polymorphisms < 0.1 Mb in size (Lucito et al. 2003; Sebat et al. 2004; Conrad et al. 2006). The power of screening a whole chromosome or whole genome for gains and losses of genomic DNA was amply illustrated when we tested for mutations balanced by the inversion mIn1. Besides the known inversion, the mIn1 strain contained several previously unknown deletions and amplifications, some linked to the inversion, but others possibly resident elsewhere in the genome (aCGH identifies only the presence of a sequence in a genome, not its location). We also found a previously undetected deletion of exons 4 and 5 of the gene K05F6.2 (fbxb-50) in our N2 strain. Curiously, this deletion must have occurred relatively recently, as all of the mutations studied here were isolated from N2 in this laboratory. Without whole-genome testing by aCGH, these novel features present in the genomes of N2 and the balancer strain would have remained undetected.  20  2.3.2. Natural gene content variation in wild populations The results of our whole-genome experiment comparing N2 to the Hawaiian and Madeiran wild-type strains revealed a large amount of gene-content variation among these natural isolates. Most of these differences are deletions in the Hawaiian or Madeiran strains relative to N2. Obviously, there is a bias in favor of detecting deletions, because all probes target sequences that are present in the N2 genome. Probe targets containing several SNPs could potentially cause the identification of spurious deletions, so we have used very conservative criteria to ameliorate this possibility and observed that even a gene as divergent as gst-38 is not mistakenly identified as a deletion. Our exon-centric probe selection should also help to reduce the impact of SNPs on hybridization, since SNPs are less common in coding sequences. To identify N2 deletions we will need to compare N2 to a sequenced Hawaiian or Madeiran strain. Previous work in nematodes has shown that chromosomal rearrangements, repeat elements, and transposons are all more common on chromosome arms than in the central region of the chromosomes (Stein et al. 2003). Homologous gene clusters are also more common on the chromosome arms, particularly on the proximal arm of chromosome II and both arms of chromosome V (Thomas 2006b), where we observe the largest number of deletions in the Hawaiian and Madeiran strains. This result suggests that non-allelic homologous recombination (Lupski 1998) on chromosome arms between repeat sequences and/or homologous gene clusters could be responsible for many of the deletions observed in these strains relative to N2. This could also explain the smaller number of gene content alterations observed between N2 and the Hawaiian and Madeiran strains on the X chromosome, where chromosomal rearrangements are less common (Stein et al. 2003). Our array designs targeted only annotated exons in the sequenced N2 genome, but the large number of deletions observed in the Hawaiian and Madeiran strains relative to N2 implies the likelihood that N2 has also lost novel genes present in the other natural isolates. The frequency of deletions was particularly high for the MATH-BTB, F-box, C-type lectin, and Srz chemoreceptor families. These four gene families are among those with the highest rates of birth–death evolution among Caenorhabditis species (J.H. Thomas, unpublished data). The correlation indicates that indel population diversity within the C. elegans species is related to long-term evolutionary stability in gene families. The nature and level of deletion polymorphisms that we find in the nematode is mirrored in human populations (Conrad et al. 21  2006; Hinds et al. 2006; Locke et al. 2006; McCarroll et al. 2006). In the study by Conrad et al. (2006), they reported that genes involved in immunity and defense, sensory perception, cell adhesion, and signal transduction were especially prone to deletion, categories that overlap the gene families highlighted as prone to deletion in C. elegans (Thomas et al. 2005; Thomas 2006a). Array studies in nematodes and humans are the first experiments to view wholesale gene-content variation of large numbers of genes in many diverse gene families between populations. These observations from humans and nematodes offer strong support for the “lessis-more” hypothesis of evolutionary change (Olson 1999). In his review, Olson argued “loss of gene function may represent a common evolutionary response of populations undergoing a shift in environment and, consequently, a change in the pattern of selective pressures.” He went on to suggest that, “adaptive loss of function may occur regularly and may spread rapidly through small populations.” With their small genome size, rapid life cycle, and self-fertilizing mode of reproduction, dispersed wild populations of nematodes are perhaps ideally suited to monitor genomic responses to environmental selective pressures. Similar to others, we observe that the Hawaiian and Madeiran strains are more similar to each other than either are to the Bristol (N2) strain (Haber et al. 2005; Stewart et al. 2005). At first this seems surprising; why should nematodes from the Hawaiian Islands located in the middle of the Pacific Ocean and nematodes from Madeira, an island in the Atlantic off the coast of the African continent, be so similar? As previously suggested (Stewart et al. 2005), we think there may be a simple explanation based on the migration of human populations. During the last half of the nineteenth and first half of the twentieth centuries, planting and harvesting sugar cane was a major crop in Hawaii. The workers in the cane fields came from many countries including China, Japan, the Philippines, and after 1878, from Portugal (Bartholomew and Bailey 1994). Almost all of the new immigrants from Portugal came from either the Azores or the island of Madeira (Bartholomew and Bailey 1994), and these immigrants may have inadvertently brought C. elegans with them. If this is true, we have a fairly precise timeline for the introduction of a new strain of C. elegans to Hawaii. (Subsequent analysis presented in Section 4.3.2 indicates that these two strains are not closely related.) In the experiments described here we have demonstrated that aCGH is a robust technology with many possible applications. These include experiments as diverse as screens for novel induced deletions to population genetic studies comparing evolutionary differences among natural 22  isolates. The protocols and chips described here for the C. elegans genomes can similarly be made for other organisms as is already evident in human, mouse, and yeast studies. The highresolution genome-wide investigation of DNA copy number changes reported here for C. elegans will likely prove to be a powerful tool in genome-wide studies of other model organisms, such as the fly and zebrafish genomes, and the more recently sequenced chicken and dog genomes. 2.4. Methods 2.4.1. Probe selection, microarray design, and microarray manufacture The pilot project focused initially on chromosomes II and X. DNA oligonucleotides, 50 nucleotides in length, were selected to tile open reading frames from both chromosomes. Several types of filters were applied in the selection process in order to maximize the sensitivity and specificity of the oligonucleotides and the signal-to-noise ratio. The applied filters were intentionally relatively mild in order to produce data that would reveal the most important characteristics of oligonucleotides for future chip designs. As a result, ∼90% of the exons and 94% of the genes from both chromosomes are represented on the array. Our oligonucleotide selection can be arbitrarily divided into eight sequential phases. Unless stated otherwise, all of the computer programs have been developed as part of the current work and are freely available from one of the authors (S. Flibotte). (1) The sequences of all curated exons and RNA transcripts on chromosomes II and X were extracted from WormBase (data freeze WS120). Sequences smaller than 50 bases were extended to 50 bases and overlapping sequences were merged. (2) All of the repeats annotated in WormBase were masked. All non-masked subsequences < 50 bases in length were then masked (this was also done after phases 3 and 4). (3) All of the 20-mers occurring more than once in the genome were masked. (4) Homopolymers > 5 bases in length were masked. (5) All possible 50-mers were extracted from the non-masked subsequences and only those with GC content between 30% and 56% were kept, which corresponds to a melting temperature range of Tm = 72.6 ± 5°C. (6) All of the 50mers with folding energy larger than -1 kcal/mol according to a hybrid-ssmin calculation (Markham 2003) were kept. (7) Following a MegaBLAST (Zhang et al. 2000) calculation, 50mers without significant homology with other locations in the genome were kept. (8) For all remaining subsequences, the 50-mers with the lowest overall 15-mer counts were selected using 23  a greedy algorithm and probe spacing parameter, ensuring that the distance between the starting positions of two neighboring oligonucleotides is at least 22 bases for chromosome II and 21 bases for chromosome X, except for the region around dim-1, where the distance was set to 6 bases. For each subsequence, the selection continued until no further oligonucleotides could be selected while respecting the overlap constraint. The overall 15-mer count of an oligonucleotide is defined as the sum of the genomic frequencies of all constituent 15-mers. The application of all of these filters resulted in the selection of 97,481 oligonucleotides for chromosome II and 92,209 oligonucleotides for chromosome X. Microarrays were manufactured by NimbleGen Systems, with each oligonucleotide and its corresponding reverse complement synthesized at random positions on the array. A similar procedure was used to design a chip targeting the whole C. elegans genome (using release WS139) and a chip targeting chromosome II alone (using data freeze WS150). The only differences were that no reverse complement probes were synthesized, the probe spacing parameter was adjusted, and a procedure was introduced to rescue exons targeted by fewer than two oligonucleotides. For the whole-genome chip we tried to select one probe upstream and one probe downstream as close as possible to the underrepresented exon following the filters 2–7 described in the previous paragraph. With a probe spacing parameter of 39, this resulted in 61,910 probes for chromosome I, 64,165 for chromosome II, 56,856 for chromosome III, 59,422 for chromosome IV, 82,944 for chromosome V, and 59,564 for chromosome X. For the chromosome II chip we selected 332,334 probes targeting annotated exons with a probe spacing parameter of 6, and 47,853 probes targeting noncoding sequences with a spacing parameter of 85. 2.4.2. Nematode culture, harvest, and DNA preparation Nematodes were generally grown as previously described (Brenner 1974) on 60- or 150-mm NGM agar plates seeded with Escherichia coli strain OP50 or χ1666. Strains used were N2 (VC196, a hermaphrodite subculture of N2 received from the Caenorhabditis Genetics Center in 2002); N2 males (male stock of CGC N2 received in 1998); mIn1[mIs14 dpy-10(e128)] homozygotes derived from a single Dpy animal selected from CGC strain DR2078 (strain not kept); VC100 (unc-112(r367) V; gkDf2 X); VC615 (dab-1(gk291)/mIn1[mIs14 dpy-10(e128)] II); VC766 (ceh-39(gk329) X); CB4856 (a subculture of the Hawaiian C. elegans wild isolate 24  HA-8); and JU258 (a wild C. elegans isolate from Madeira). All mutant strains (excluding mIn1) were generated by mutagenesis with trimethylpsoralen (TMP) and UV-irradiation. For DNAs prepared from plate cultures, populations were grown to starvation and harvested by washing into 15-mL centrifuge tubes with 10 mL of M9 buffer containing 0.01% Triton X-100. Each population was washed seven times by centrifugation, removal of supernatant by aspiration, and resuspension and vortexing in fresh M9/Triton X-100. After the final wash, populations were plated on unseeded agar plates and left overnight at 20°C to digest any bacteria remaining in their guts, then reharvested by washing and centrifugation. For DNA from N2 males and confirmed dab-1(gk291)/ mIn1[mIs14 dpy-10(e128)] II balanced heterozygotes, worms were picked directly into M9/Triton X-100 in labeled 1.8-mL microcentrifuge tubes, and washed free of bacteria in seven rounds of dilution/centrifugation/aspiration. Aliquots of pelleted worms were transferred to 1.5-mL microcentrifuge tubes containing lysis buffer (50 mM KCl, 10 mM Tris-HCl at pH 8.3, 2.5 mM MgCl2, 0.45% NP-40 [Igepal], 0.45% Tween-20, 0.01% gelatin, 300 µg/mL Proteinase K), frozen at -20°C, and incubated at 55°C–60°C for 3 hours. DNA was prepared either by standard phenol-chloroform extraction followed by ethanol precipitation or with the Puregene DNA Purification Kit (D-7000A, Gentra Systems) using the solid tissue protocol. Purified DNAs were resuspended in nuclease-free sterile dH2O or TE (10 mM Tris-HCl, 1 mM EDTA at pH 7.0–8.0). DNA concentrations were determined with a spectrophotometer (Biomate3, Thermo Spectronic) and adjusted to 500 ng/µL for submission to NimbleGen Systems, Inc. for further processing. 2.4.3. DNA fragmentation and labeling Samples were fragmented and labeled in the NimbleGen Service Laboratory as follows. Two micrograms of each genomic DNA sample were diluted to 80 µL with deionized (DI) water and fragmented by sonication. A portion (0.3 µg) of each sonicated sample was run on a 1% agarose gel to confirm that most of the DNA fragments were between 500 and 2000 bp in length. Cy3 and Cy5 dye-labeled random 9-mers (TriLink BioTechnologies, Inc.) were diluted to 1 O.D./42 µL of buffer containing 0.125 M Tris-HCl (pH 8.0), 0.125 M MgCl2, 1.75 µL/mL βmercaptoethanol. Mutant DNA samples were labeled with Cy3 and the wild-type DNA sample (VC196) was labeled with Cy5. One microgram of genomic DNA was added to each random 9mer buffer solution, denatured at 95°C, and then chilled on ice in 0.2 mL PCR tubes. A total of 25  10 µL of 50x dNTP mixture (1x TE buffer, 10 mM each of dATP, dCTP, dGTP, and dTTP), 8 µL of DI water, and 100 U of Klenow fragment (exo-) was added to each tube and mixed well with a pipette. Samples were centrifuged and incubated at 37°C for 2 hours and 10 µL of 0.5 M EDTA was added and mixed well to stop the labeling reaction. DNA was precipitated by adding 11.5 µL of 0.5M NaCl and 110 µL of isopropanol, vortexing, incubating in the dark for 10 min at room temperature, and centrifuging at 12,000g for 10 min. The supernatant was removed and the DNA pellet was washed with 500 µL of 80% ethanol. After centrifugation at 12,000g for 2 min, the supernatant was removed, and the pellet was dried in a SpeedVac on low heat for 5 min before being rehydrated in 25 µL of DI water. DNA concentration was measured using a spectrophotometer. 2.4.4. Sample hybridization and imaging Samples were hybridized in the NimbleGen Service Facility using standard operating procedures, as previously described (Selzer et al. 2005). Briefly, 15 µg of each labeled test and reference DNA sample were added to a single 1.5 mL tube and dried down in the dark in a SpeedVac on low heat. The DNA was resuspended in 3.5 µL of DI water and vortexed; 41.5 µL of NimbleGen hybridization buffer was added to the tube, mixed well, and heated at 95°C for 5 min in the dark. Samples were hybridized at the NimbleGen Service Facility for 16–20 hours at 42°C. and then washed with NimbleGen wash buffers and scanned on an Axon scanner (Model # 4000B). 2.4.5. Data analysis The fluorescence intensity of each feature on the array was extracted with the NimbleScan 2.1 software for the sample and reference images. The intensity ratios were normalized with the help of the robust LOESS regression on the so-called M-A plot, where M = log2 I1/I2 and A = log2 sqrt(I1*I2), I1, and I2 being the intensities of the feature in the two images, similar to the procedure described in Yang et al. (2002). The LOESS regression was implemented with the library from Cleveland et al. (1992). The log2 ratios, M, corresponding to the probes targeting the forward and reverse strands at the same genomic location, were averaged. No outliers were excluded from the subsequent analysis. Copy number aberrations were detected both by careful visual inspection and with a segmentation algorithm developed and currently being tested by 26  one of the authors (S. Flibotte). This segmentation algorithm is a very efficient implementation of a bottom-up approach (see Appendix 3). The P-value for each aberration was calculated with a one-sample t-test (however, with the total number of non-aberrant data points being very large, one-sample and Welch two-sample t-tests give essentially the same P-values). 9500 50mer oligonucleotides of random sequence but with the same GC content distribution as our probes were synthesized at random locations on each microarray. Use of data from these probes as an estimate of background tends to increase the overall standard deviation of the data, and therefore our analysis includes no background subtraction. For indel comparisons between wild-type strains, genes were from the WormBase release WS150 and were classified into families using a combination of the blastclust clustering algorithm and protein alignments and trees, performed using clustalw and phyml (Thompson et al. 1994; Guindon and Gascuel 2003). We set conservative cutoff values for identifying indels, requiring log2 ratios of ≥ 1 for amplification segments and ≤ -2 for deletions. Chromosomal start and end coordinates for each gene were used to determine whether the gene was entirely contained with an assigned deletion; genes that spanned the end of a deletion were not included.  27  Table 2.1. Gene family members deleted in natural isolates from Hawaii (CB4856) and Madeira (JU258). P-values were computed only for families with potentially higher rates of deletions, and only values < 0.05 are shown. P-values are relative to all genes and are one-sided and computed by a 2 x 2 chi-square test with Yates correction. P-values are not corrected for multiple testing, and those with marginal values after Bonferroni correction are enclosed in parentheses. NA, not applicable; NS, not significantly different. Gene Family All genes MATH only MATH-BTB E3 ubiquitin ligase F-box Ubiquitin Lectin C-type DUF130 DUF19 Srh chemoreceptor Srz chemoreceptor SNF-2-like helicase Srbc chemoreceptor DUF274 Srw chemoreceptor Str-Srj chemoreceptor Sri chemoreceptor Thioredoxin Srt chemoreceptor Nuclear receptor Homeodomain Collagen Major facilitator permease Ser-thr protein kinase DUF18 (ShTK) Major sperm protein Transthyretin Ligand-gated ion channels Srd chemoreceptor Acytransferase Rab-ras Srg chemoreceptor DEAD-box helicase ABC transporter Receptor L Sre chemoreceptor Sru chemoreceptor Insulin Glycosyl hydrolase Tyr protein kinase Galectin  Hawaiian vs. N2 % Deleted P-value  No. of deleted 531 33 17 11 71 5 26 6 15 34 23 11 6 0 9 8 8 2 1 3 2 0 0  309 122 111 96 94  2 2 0 0 0  0.65 1.64 0.00 0.00 0.00  0 2 0 0 0  0.00 1.64 0.00 0.00 0.00  76 59 71 68 63 61 62 56 48 38 37 37 24  0 1 1 1 0 0 1 0 0 0 1 0 0  0.00 1.69 1.41 1.47 0.00 0.00 1.61 0.00 0.00 0.00 2.70 0.00 0.00  0 1 0 1 0 1 1 0 0 0 0 0 0  0.00 1.69 0.00 1.47 0.00 1.64 1.61 0.00 0.00 0.00 0.00 0.00 0.00  2.54 66.00 36.17 28.95 13.25 14.29 8.55 11.54 17.86 10.93 20.00 10.48 7.14 0.00 6.08 2.46 9.88 4.55 1.33 1.05 1.85 0.00 0.00  NA <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 (0.0205) NS (0.014) NS 0.0001 NS NS  No. of deleted 670 35 24 12 94 5 25 24 23 33 10 10 10 6 13 22 5 6 7 5 2 0 0  Madeiran vs. N2 % Deleted  No. of genes 20,873 50 47 38 536 35 304 52 84 311 115 105 84 22 148 325 81 44 75 285 108 231 213  3.21 70.00 51.06 31.58 17.54 14.29 8.22 46.15 27.38 10.61 8.70 9.52 11.90 27.27 8.78 6.77 6.17 13.64 9.33 1.75 1.85 0.00 0.00  P-value NA <0.0001 <0.0001 <0.0001 <0.0001 0.0001 <0.0001 <0.0001 <0.0001 <0.0001 (0.0023) 0.0008 <0.0001 <0.0001 0.0003 0.0006 NS 0.0005 (0.0077)  28  Figure 2.1. Detection of a 50-kb homozygous viable deletion in gkDf2. (a) Normalized log2 ratios (gkDf2/WT) of the average fluorescent intensities for each of the 92,209 forward and reverse pairs of probes to the X chromosome are represented by circles. The deletion is identified by negative log2 ratios and indicated by an arrow. (b) A higher resolution view of fluorescence ratios for probe pairs targeting the 50-kb deletion. Horizontal bars indicate the positions of the nine genes targeted by the deletion. Duplications of sequences flanking the deletion are indicated by positive log2 ratios. Adjacent 50-mer probes in this region overlap by as much as 44 bp.  29  Figure 2.2. Detection of a 1047-bp homozygous viable deletion in ceh-39 (gk329). (a) Normalized log2 ratios (gk329/WT) for the average fluorescence intensities for all probe pairs to the X chromosome are shown. The arrow indicates the deletion. (b) Intensity ratios for probes to ceh-39, ceh-21, and ceh-41 are shown with WormBase gene models to illustrate probe coverage in exons near the deletion. Sequenced deletion breakpoints are indicated by dotted lines. aCGH accurately identified the left breakpoint between exons 3 and 4 of ceh-39.  30  Figure 2.3. Comparison of the normalized average fluorescence ratios (XO male / XX hermaphrodite) for all probe pairs to chromosomes II and X. The graph in the top right corner plots the probe density versus the log2 fluorescence ratio for probes to chromosomes II and X. The curve peaking on the left is for the X chromosome and the curve peaking on the right is for chromosome II. The distributions overlap by ∼4%.  31  Figure 2.4. Detection of the 1202-bp deletion in dab-1 (gk291) in a wash-sampled balanced heterozygous population. (a) The normalized log2 ratios [(dab-1(-)/mIn1)/WT] of the average fluorescence intensities for probe pairs to chromosome II are plotted. The arrow indicates the dab-1 deletion (other features are discussed in the text). (b) Normalized fluorescence ratios for probe pairs targeting dab-1 are shown. Sequenced deletion breakpoints are indicated by dotted lines and were accurately predicted by aCGH. The left breakpoint lies within the second intron. Overlapping probes targeting the right breakpoint span just 73 bp, allowing resolution of the right breakpoint to within fewer than 50 bp.  32  Figure 2.5. Deletions detected in a screen for homozygous lethal mutations in six wash-sampled balanced heterozygous populations. The following normalized log2 ratios are shown: (a) (gk463/mIn1)/WT; (b) (gk460/mIn1)/WT; (c) (gk462/mIn1)/WT; (d) (gk465/mIn1)/WT; (e) (gk488/mIn1)/WT; and (f) (gk487/mIn1)/WT.  33  Figure 2.6. Whole-genome aCGH comparing Hawaiian (CB4856) and Bristol N2 (VC196) hermaphrodites. Large-scale copy number polymorphism is evident between these two wild-type isolates. Normalized log2 fluorescence ratios (CB4856/N2) for all probes on the chip are shown.  34  Figure 2.7. Whole-genome aCGH comparing Madeiran (JU258) and Bristol N2 (VC196) hermaphrodites. Large-scale copy number polymorphism is evident between these two wild-type isolates. The distribution of deletions both within and between chromosomes is similar to that seen in the Hawaiian strain (CB4856; Figure 2.6). Normalized log2 fluorescence ratios (JU258/N2) for all probes on the chip are shown.  35  Figure 2.8. A homozygous viable deletion identified on chromosome V in the Hawaiian strain (CB4856). Normalized log2 ratios (CB4856/N2) indicate that the deletion targets the genes C49G7.1 and D1065.3.  36  2.5. References Barstead, R.J. 1999. Reverse Genetics. In C. elegans: A Practical Approach (ed. I.A. Hope), pp. 97-118. Oxford University Press, Oxford, UK. Bartholomew, G. and B. Bailey. 1994. Maui Remembers: A Local History. Mutual Publishing, Honolulu, Hawaii. Brenner, S. 1974. The genetics of Caenorhabditis elegans. Genetics 77: 71-94. Carvalho, B., E. Ouwerkerk, G.A. Meijer, and B. Ylstra. 2004. High resolution microarray comparative genomic hybridisation analysis using spotted oligonucleotides. J Clin Pathol 57: 644-646. Chen, N., S. Pai, Z. Zhao, A. Mah, R. Newbury, R.C. Johnsen, Z. Altun, D.G. Moerman, D.L. Baillie, and L.D. Stein. 2005. Identification of a nematode chemosensory gene family. Proc Natl Acad Sci U S A 102: 146-151. Cleveland, W.S., Grosse, E., Shyu, M. J. 1992. A Package of C and Fortran Routines for Fitting Local Regression Models. Chapman and Hall, Ltd., London, UK. Conrad, D.F., T.D. Andrews, N.P. Carter, M.E. Hurles, and J.K. Pritchard. 2006. A highresolution survey of deletion polymorphism in the human genome. Nat Genet 38: 75-81. Denver, D.R., K. Morris, and W.K. Thomas. 2003. Phylogenetics in Caenorhabditis elegans: an analysis of divergence and outcrossing. Mol Biol Evol 20: 393-400. Dhami, P., A.J. Coffey, S. Abbs, J.R. Vermeesch, J.P. Dumanski, K.J. Woodward, R.M. Andrews, C. Langford, and D. Vetrie. 2005. Exon Array CGH: Detection of CopyNumber Changes at the Resolution of Individual Exons in the Human Genome. Am J Hum Genet 76: 750-762. Edgley, M.L. and D.L. Riddle. 2001. LG II balancer chromosomes in Caenorhabditis elegans: mT1(II;III) and the mIn1 set of dominantly and recessively marked inversions. Mol Genet Genomics 266: 385-395. Guindon, S. and O. Gascuel. 2003. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 52: 696-704. Haber, M., M. Schungel, A. Putz, S. Muller, B. Hasert, and H. Schulenburg. 2005. Evolutionary history of Caenorhabditis elegans inferred from microsatellites: evidence for spatial and temporal genetic differentiation and the occurrence of outbreeding. Mol Biol Evol 22: 160-173. Hinds, D.A., A.P. Kloek, M. Jen, X. Chen, and K.A. Frazer. 2006. Common deletions and SNPs are in linkage disequilibrium in the human genome. Nat Genet 38: 82-85. Ishkanian, A.S., C.A. Malloff, S.K. Watson, R.J. DeLeeuw, B. Chi, B.P. Coe, A. Snijders, D.G. Albertson, D. Pinkel, M.A. Marra, V. Ling, C. MacAulay, and W.L. Lam. 2004. A tiling resolution DNA microarray with complete coverage of the human genome. Nat Genet 36: 299-303. Kallioniemi, A., O.P. Kallioniemi, D. Sudar, D. Rutovitz, J.W. Gray, F. Waldman, and D. Pinkel. 1992. Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. Science 258: 818-821. Locke, D.P., A.J. Sharp, S.A. McCarroll, S.D. McGrath, T.L. Newman, Z. Cheng, S. Schwartz, D.G. Albertson, D. Pinkel, D.M. Altshuler, and E.E. Eichler. 2006. Linkage disequilibrium and heritability of copy-number polymorphisms within duplicated regions of the human genome. Am J Hum Genet 79: 275-290. Lucito, R., J. Healy, J. Alexander, A. Reiner, D. Esposito, M. Chi, L. Rodgers, A. Brady, J. Sebat, J. Troge, J.A. West, S. Rostan, K.C. Nguyen, S. Powers, K.Q. Ye, A. Olshen, E. Venkatraman, L. Norton, and M. Wigler. 2003. Representational oligonucleotide 37  microarray analysis: a high-resolution method to detect genome copy number variation. Genome Res 13: 2291-2305. Lupski, J.R. 1998. Genomic disorders: structural features of the genome can lead to DNA rearrangements and human disease traits. Trends Genet 14: 417-422. Mantripragada, K.K., P.G. Buckley, T.D. de Stahl, and J.P. Dumanski. 2004. Genomic microarrays in the spotlight. Trends Genet 20: 87-94. Markham, N.R. 2003. Hybrid: a software system for nucleic acid folding, hybridizing and melting predictions. Rensselaer Polytechnic Institute, Troy, NY., Troy, NY, USA. McCarroll, S.A., T.N. Hadnott, G.H. Perry, P.C. Sabeti, M.C. Zody, J.C. Barrett, S. Dallaire, S.B. Gabriel, C. Lee, M.J. Daly, and D.M. Altshuler. 2006. Common deletion polymorphisms in the human genome. Nat Genet 38: 86-92. Olson, M.V. 1999. When less is more: gene loss as an engine of evolutionary change. Am J Hum Genet 64: 18-23. Pinkel, D., R. Segraves, D. Sudar, S. Clark, I. Poole, D. Kowbel, C. Collins, W.L. Kuo, C. Chen, Y. Zhai, S.H. Dairkee, B.M. Ljung, J.W. Gray, and D.G. Albertson. 1998. High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nat Genet 20: 207-211. Pollack, J.R., T. Sorlie, C.M. Perou, C.A. Rees, S.S. Jeffrey, P.E. Lonning, R. Tibshirani, D. Botstein, A.L. Borresen-Dale, and P.O. Brown. 2002. Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proc Natl Acad Sci U S A 99: 12963-12968. Sebat, J., B. Lakshmi, J. Troge, J. Alexander, J. Young, P. Lundin, S. Maner, H. Massa, M. Walker, M. Chi, N. Navin, R. Lucito, J. Healy, J. Hicks, K. Ye, A. Reiner, T.C. Gilliam, B. Trask, N. Patterson, A. Zetterberg, and M. Wigler. 2004. Large-scale copy number polymorphism in the human genome. Science 305: 525-528. Selzer, R.R., T.A. Richmond, N.J. Pofahl, R.D. Green, P.S. Eis, P. Nair, A.R. Brothman, and R.L. Stallings. 2005. Analysis of chromosome breakpoints in neuroblastoma at subkilobase resolution using fine-tiling oligonucleotide array CGH. Genes Chromosomes Cancer 44: 305-319. Solinas-Toldo, S., S. Lampel, S. Stilgenbauer, J. Nickolenko, A. Benner, H. Dohner, T. Cremer, and P. Lichter. 1997. Matrix-based comparative genomic hybridization: biochips to screen for genomic imbalances. Genes Chromosomes Cancer 20: 399-407. Stallings, R.L., P. Nair, J.M. Maris, D. Catchpoole, M. McDermott, A. O'Meara, and F. Breatnach. 2006. High-resolution analysis of chromosomal breakpoints and genomic instability identifies PTPRD as a candidate tumor suppressor gene in neuroblastoma. Cancer Res 66: 3673-3680. Stein, L.D., Z. Bao, D. Blasiar, T. Blumenthal, M.R. Brent, N. Chen, A. Chinwalla, L. Clarke, C. Clee, A. Coghlan, A. Coulson, P. Eustachio, D.H.A. Fitch, L.A. Fulton, R.E. Fulton, S. Griffiths-Jones, T.W. Harris, L.W. Hillier, R. Kamath, P.E. Kuwabara, E.R. Mardis, M.A. Marra, T.L. Miner, P. Minx, J.C. Mullikin, R.W. Plumb, J. Rogers, J.E. Schein, M. Sohrmann, J. Spieth, J.E. Stajich, C. Wei, D. Willey, R.K. Wilson, R. Durbin, and R.H. Waterston. 2003. The Genome Sequence of Caenorhabditis briggsae: A Platform for Comparative Genomics. PLoS Biology 1: e45. Stewart, M.K., N.L. Clark, G. Merrihew, E.M. Galloway, and J.H. Thomas. 2005. High genetic diversity in the chemoreceptor superfamily of Caenorhabditis elegans. Genetics 169: 1985-1996. Strefford, J.C., F.W. van Delft, H.M. Robinson, H. Worley, O. Yiannikouris, R. Selzer, T. Richmond, I. Hann, T. Bellotti, M. Raghavan, B.D. Young, V. Saha, and C.J. Harrison. 2006. Complex genomic alterations and gene expression in acute lymphoblastic 38  leukemia with intrachromosomal amplification of chromosome 21. Proc Natl Acad Sci U S A 103: 8167-8172. Thomas, J.H. 2006a. Adaptive evolution in two large families of ubiquitin-ligase adapters in nematodes and plants. Genome Res 16: 1017-1030. Thomas, J.H. 2006b. Analysis of homologous gene clusters in Caenorhabditis elegans reveals striking regional cluster domains. Genetics 172: 127-143. Thomas, J.H., J.L. Kelley, H.M. Robertson, K. Ly, and W.J. Swanson. 2005. Adaptive evolution in the SRZ chemoreceptor families of Caenorhabditis elegans and Caenorhabditis briggsae. Proc Natl Acad Sci U S A 102: 4476-4481. Thompson, J.D., D.G. Higgins, and T.J. Gibson. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positionspecific gap penalties and weight matrix choice. Nucleic Acids Res 22: 4673-4680. Urban, A.E., J.O. Korbel, R. Selzer, T. Richmond, A. Hacker, G.V. Popescu, J.F. Cubells, R. Green, B.S. Emanuel, M.B. Gerstein, S.M. Weissman, and M. Snyder. 2006. Highresolution mapping of DNA copy alterations in human chromosome 22 using highdensity tiling oligonucleotide arrays. Proc Natl Acad Sci U S A 103: 4534-4539. Wicks, S.R., R.T. Yeh, W.R. Gish, R.H. Waterston, and R.H. Plasterk. 2001. Rapid gene mapping in Caenorhabditis elegans using a high density polymorphism map. Nat Genet 28: 160-164. Yang, Y.H., S. Dudoit, P. Luu, D.M. Lin, V. Peng, J. Ngai, and T.P. Speed. 2002. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res 30: e15. Zhang, Z., S. Schwartz, L. Wagner, and W. Miller. 2000. A greedy algorithm for aligning DNA sequences. J Comput Biol 7: 203-214.  39  3. De novo identification of single nucleotide mutations in Caenorhabditis elegans using array Comparative Genomic Hybridization 2 3.1. Introduction A major roadblock in genetic research lies in the molecular identification of mutations responsible for an observed phenotype. Traditional positional cloning techniques are laborious, time-consuming and sometimes impractical for mapping mutations to regions smaller than a few Mbp, particularly in regions with low recombination frequencies such as the centers of C. elegans chromosomes (Barnes et al. 1995). Sequencing such a large region still remains impractical for most laboratories and as a result many mutations remain uncharacterized. aCGH has been used to detect many types of genome diversity in a variety of organisms (Gresham et al. 2008), including single nucleotide variation in the 12.5 Mb yeast genome using short 25-mer probes (Gresham et al. 2006). We have been using aCGH with exon-centric tiling arrays of 50mer oligonucleotide probes to screen for deletions in the C. elegans genome following mutagenesis with trimethylpsoralen (TMP) and ultraviolet (UV) irradiation (Maydan et al. 2007). Here we demonstrate the use of 50-mer probes to detect single nucleotide mutations in the 100 Mb C. elegans genome. 3.2. Results 3.2.1. Novel single-nucleotide mutations detected utilizing an exon-centric chromosome II microarray In one set of experiments utilizing a microarray with probes targeting primarily exons on C. elegans chromosome II, we screened individuals homozygous for a mutagenized chromosome II. In these experiments we identified three statistically significant putative mutations (P-values ranged from 2.7 x 10-5 to 1.8 x 10-14 according to one-sample t-tests). These putative mutations affected just a few adjacent overlapping probes and produced modest signals comparable to those normally observed for heterozygous deletions. We hypothesized that very small 2  A version of this chapter has been accepted for publication. Maydan, J.S., H.M. Okada, S. Flibotte, M.L. Edgley and D.G. Moerman. In Press. De novo identification of single nucleotide mutations in C. elegans using array Comparative Genomic Hybridization. Genetics.  40  homozygous mutations (much shorter than the length of a probe) could produce signals of this magnitude. The mutations would have to be very small in order to target only a few overlapping probes and permit some hybridization of complementary sequence to the array. Mutations of this size would not have produced statistically significant signals on our whole-genome tiling arrays because each mutation would affect only one or two probes. Our hypothesis was confirmed when PCR and DNA sequencing identified single nucleotide mutations in all three mutants. The strain VC10078 carries gk802, an A→T transversion allele of syd-1 at II: 7586645 (see Fig. 3.1), causing a non-conservative amino acid substitution (I(887) → K); VC10079 contains allele gk803, an A→G transition at nucleotide II: 10825740 which results in a synonymous base pair substitution in mix-1 at the 3rd position of a codon for leucine (CUA → CUG); and VC10077 carries gk801, an allele with two closely linked mutations in Y46E12BL.2 : a G→A transition at II: 15240024, causing a conservative amino acid substitution (V(714) → I), and an A→G transition at II:15240052 resulting in a nonconservative amino acid substitution (Y(723) → C). 3.2.2. Single-nucleotide mutations detected in 13 strains with previously mapped mutations Dense tiling with oligonucleotides is necessary to obtain sufficient statistical power to detect single nucleotide alterations. In a previous study we showed that a window of about 20 bases in each 50-mer probe contains a strong log2 ratio signal (see Figure 1 in Flibotte et al. 2009), and since we require about four probes to target the mutated site that allows a maximum probe spacing of about five bases. The plot in this figure also shows that it would be useful to target both strands and use the small shift in the peak position on opposite strands to help distinguish SNPs from artifacts. Utilizing these probe spacing guidelines, we conducted an additional 13 aCGH experiments comparing homozygous mutants to their parental strains, using 50-mer oligonucleotide microarrays probing regions from 0.65 – 2.60 Mb in length (see Methods) that are known to include unidentified mutations based on prior mapping experiments. From these experiments we selected 58 candidate single nucleotide mutations on the basis of visual inspection of the data and identification using a segmentation algorithm (Maydan et al. 2007; see Appendix 3) or a sliding window technique. We then performed PCR and DNA sequencing in order to gauge the accuracy of our mutation predictions. For each candidate 41  mutation, we calculated a SNP Score by averaging the log2 fluorescence ratios (mutant / wildtype [WT]) in a small window containing probes putatively affected by the mutation, and renormalizing by subtracting from that the average log2 ratio in the immediate flanking regions. This renormalization is necessary to account for local bias, which varies both within and among experiments and makes the detection of SNPs more difficult since artifacts associated with a strong local bias in log2 ratio could easily be confused with the signature expected for a SNP. Unlike previous observations that mutations near the centers of 25-mer probes are most inhibitory to efficient hybridization (Sharp et al. 2007), we observed that mutations located away from the glass slide and freely floating in the solution closer to the 5’ ends of our 50-mer probes produced a larger perturbation to the hybridization process, with a maximum perturbation at seven bases in from the 5’ end (probably due to steric effects; (Figure 1 in Flibotte et al. 2009). The location of the window used to calculate the score reflects this observation. This sensitivity to mutations at the 5’-end of NimbleGen probes has also been observed by Wei et al. (2008). The sequencing results (summarized in Figure 3.2a) confirmed the presence of a single nucleotide mutation in 16 of the candidates for an overall success rate or specificity of 28%. All mutations were either C-to-T or G-to-A transitions, as expected from EMS mutagenesis. The locations of the mutations were usually predicted to within less than 10 bp of their true positions, and to within 1 bp in one case. 3.2.3. The sensitivity of single nucleotide polymorphism detection using aCGH In order to estimate the sensitivity of our single nucleotide mutation detection technique, we performed aCGH experiments to test our ability to detect 2639 known single nucleotide polymorphisms (SNPs) in the CB4856 strain isolated in Hawaii (see Methods for array design details). Examples of all possible transitions and transversions were detected. The SNP detection sensitivity is shown in Figure 3.2b for various thresholds in the SNP Score. At the reasonable threshold of -0.45 the specificity (the percentage of predicted SNPs that are real) would be 31% with a sensitivity (the percentage of real SNPs that are successfully detected) of 37%. In other words, with the current SNP detection technique one could expect to detect roughly one out of every three SNPs present in the targeted region and one will have to sequence roughly three candidates in order to detect a real SNP. As expected, the SNP detection sensitivity of the current technique depends on the type of transition or transversion being  42  investigated, and as can be seen in Figure 3.2c the sensitivity reaches around 50% for the most commonly induced EMS mutations (C-to-T and G-to-A). 3.3. Discussion 3.3.1. Limitations to SNP discovery using aCGH The optimal probe length for single nucleotide mutation detection by aCGH is unclear and likely depends on the hybridization conditions. Single nucleotide mutations should have a greater impact on hybridization to shorter oligonucleotides, but longer oligonucleotides allow a greater number of overlapping probes to target a given single nucleotide mutation, and arrays with longer oligonucleotides tend to have better standard deviations in log2 ratios (Sharp et al. 2007). Further experiments will need to be done to determine the optimal probe length to achieve the greatest sensitivity and specificity as a function of the size of the targeted region; such an optimal length will probably vary with the complexity of the genome being studied. Although this technique is particularly well suited to detecting SNPs generated by EMS mutagenesis, some single nucleotide mutations may not be detectable by aCGH even with higher probe densities than we have used here. We suspected that some of the Hawaiian SNPs that we failed to detect might have been missed because they were found in regions with significant homology to other regions of the genome. In these cases, multiple regions of the genome could have hybridized to our probes, making it unlikely that the effect of a SNP on the log2 ratios would be detectable. However, filtering the oligonucleotide properties according to our best practices and standard microarray design recommendations (Flibotte and Moerman 2008) failed to improve the SNP detection sensitivity. It is also possible that SNPs are more difficult to detect with aCGH when present in the background of the Hawaiian genome, which contains significant structural variation relative to the N2 reference genome (Maydan et al. 2007); consequently, for a more typical SNP detection experiment the sensitivity of the technique might be slightly better than what we have reported here. However, limiting the analysis to SNPs that are located far away from other known mutations did not improve the SNP detection sensitivity. Lastly, we have not yet attempted to detect heterozygous single nucleotide mutations using this technique, but this would be nearly impossible to accomplish with current microarrays. 43  3.3.2. Suggestions to improve the sensitivity and specificity of SNP detection by aCGH The ability of aCGH to detect homozygous single nucleotide mutations in addition to deletions and duplications makes it possible to quickly and affordably identify mutations mapped by traditional positional cloning approaches. A clear example of the feasibility of this technique is demonstrated in the study by O’Meara et al. (in press), in which two single base lesions were mapped to the promoter of the C. elegans gene cog-1 using aCGH. We recommend a maximum probe spacing of no more than five bp in order to have a reasonable chance at successful SNP detection with this technique. This probe spacing corresponds to about two Mbp of genomic sequence on a microarray with 380,000 probes, the oligo capacity of the chips we used in this study. We prefer to apply this SNP detection technique to situations where the mutation is mapped to a maximum of a one Mbp region, as this provides denser coverage of the mutation site and allows us to target both strands. Further reducing the size of the candidate region should improve the likelihood of successful base change detection as more probes target any specific base. If any sequences in the mapped region can be excluded (such as non-coding DNA, repeat elements or genes which can be ruled out as candidate genes) the probe density can be further increased in the remaining regions of interest. Of course, it is possible to use more than one microarray to probe the candidate region if the region is too large to achieve the desired probe density on a single array. When the search region is small enough to allow very high density tiling, one can take advantage of the fact that the effect of a SNP on hybridization is dependent on its position in the probe by including probes that target both strands, and then primarily pursuing candidates showing a small shift between the plus and minus strand log2 ratio profiles. Targeting both strands for this purpose should result in fewer false positives. 3.3.3. Online resources for SNP detection using aCGH In order to make the current SNP detection technique more accessible, we have mounted a web application to design oligonucleotide microarrays built by H.M. Okada. The application can be found at http://hokkaido.bcgsc.ca/SNPdetection/. Downstream analysis tools to calculate and normalize the log2 ratios are also available on this web site. Given the criteria set by the user, such as the probe target region and strand(s), the oligonucleotides are selected in a way to evenly distribute the probes across the selected region. Probes are selected to avoid repeat regions, non-coding regions (optional), and probe sequences that cannot be synthesized due to 44  the cycle number constraint in NimbleGen’s manufacturing process. Once the criteria have been selected, the file is sent to the user in a format ready for submission to NimbleGen. Currently the probe selection application has been set to support the C. elegans and Drosophila  melanogaster genomes, and genomes from other species will be added upon request. 3.3.4. SNP detection by aCGH as an alternative to high-throughput sequencing With the advent of whole genome sequencing using new high-throughput sequencing machines (Hillier et al. 2008) it might be asked if SNP detection on microarrays is a reasonable technique for mutation detection. Deep sequencing may become the method of choice in the future, but for now our method is easier to perform, especially given we have provided a website for oligo design and data analysis. Our method involves less labor and the aCGH work can be outsourced to NimbleGen. For the time being our method is less expensive but this may change in the future. Genetic mapping of mutations remains essential. For our SNP detection method, one needs to do initial mapping to limit the mutation of interest to a small region of the genome. Although deep sequencing without prior genetic mapping is possible, one must then determine which of several hundred changes in the genome is the causative mutation (Hillier et al. 2008; David Spencer, personal communication). 3.4. Methods 3.4.1. Mutagenesis A mixed-stage population of VC1415 (unc-4(e120)/mIn1[mIs14 dpy-10(e128)] II) was subjected to mutagenesis with TMP at 10 µg/ml for 1 hour followed by UV irradiation for 90 seconds at 340 µW/cm2, and then placed on food at 20° C. Both unc-4 and dpy-10 mutations are recessive, and the mIn1 inversion suppresses recombination along the middle of chromosome II from lin-31 to rol-1 (Edgley and Riddle 2001); the mIs14 element confers a semi-dominant GFP signal confined to the pharyngeal muscle. After 48 hours, 30 gravid WT GFP+ P0 adults were singly picked onto 60mm Petri plates and allowed to self. Seven WT GFP+ F1 progeny were singly picked from each parent for a total of 210 clones, from which 100 were selected that segregated viable fertile Unc-4 F2 progeny. Single gravid Unc-4 progeny were picked from each of these plates and used to establish 100 new populations 45  homozygous for unc-4 and any newly induced mutations within the genetic interval balanced by mIn1. The 13 mutant strains with previously mapped mutations were generated by standard ethyl methanesulfonate (EMS) mutagenesis, which yields approximately one single nucleotide mutation every 100 – 400 kb (Anderson 1995; Cuppen et al. 2007), and then serially backcrossed with their parental strains prior to this work. 3.4.2. Nematode culturing and DNA preparation Nematodes were grown on NGM agar plates spread with a lawn of Escherichia coli strain OP50 or χ1666. Nematode populations were grown to starvation on three 60mm Petri plates, harvested by washing, centrifugation and aspiration of supernatant, and frozen at -80° C in 2.5 volumes of worm lysis buffer (50 mM KCl; 10 mM Tris-HCl, pH 8.3; 2.5 mM MgCl2; 0.45% NP-40 (Igepal); 0.45% Tween-20; 0.01% gelatin; 300 µg/ml Proteinase K). Crude lysates were prepared from frozen samples by incubation at 65° C for two hours. Genomic DNA was prepared from the lysates as described previously by Maydan et al. (2007). 3.4.3. Probe selection, array design and aCGH The filters used to select the 50-mer oligonucleotides for the exon-centric chromosome II chip have been described by Maydan et al. (2007; see section 2.4.1). Microarrays for the 13 previously mapped mutations were designed by tiling the target regions with equally spaced overlapping 50-mer oligonucleotides without any filtering except for the elimination of the repeats listed in WormBase and the exclusion of probes that were not possible due to the cycle number constraint in the microarray manufacturing process. The earlier arrays were designed using WormBase data freeze version WS170 while the more recent designs used WS180. A single 380,000-oligonucleotide array was designed for each region of interest except for one experiment where two arrays have been used to cover a genomic region 4.9 Mb wide. The probe spacing, i.e. the distance between the 5’ ends of consecutive probes, on these arrays ranged from 1-5 bp. Unlike our previous exon-centric arrays, no other constraints were applied to the oligonucleotides.  46  From all the CB4856 SNPs present in WormBase data freeze WS170, we selected 2639 that were far enough from all the known mutations in that strain in order to minimize the presence of mutations in the immediate flanking regions of the selected SNPs. Once again, the only filter used in the design process was to eliminate the known repeats. Each SNP was represented on the array by a maximum of 150 50-mer oligonucleotides spaced one bp apart, up to 50 oligonucleotides affected by the mutation and up to 50 oligonucleotides for each immediate left and right flanking region. For each SNP the set of probes alternated between the sequence from the plus and minus strand templates, thus for a given strand the minimum spacing between probes was equal to two bases. For the CB4856 experiment we performed dye-flip hybridizations in order to evaluate the Cy3/Cy5 bias, therefore each SNP log2 ratio profile was measured four times, with two separate hybridizations and on both strands each time. Microarray manufacture, DNA sample handling, labeling with Cy3 (mutants and CB4856) or Cy5 (WT N2 [VC196] reference), hybridization, imaging and fluorescence intensity extraction were performed by Roche NimbleGen, Inc. (Selzer et al. 2005). Oligonucleotides were synthesized at random positions on all arrays. 3.4.4. Data analysis and mutation detection Log2 fluorescence ratios (Cy3/Cy5) were calculated and normalized as previously described (Maydan et al. 2007; see Section 2.4.5). Many initial experiments were performed using the same chromosome II array design, which allowed an approximate determination (by simply averaging) and subsequent subtraction of local bias in the log2 ratio signal for individual experiments. The signature of a SNP in the log2 ratio signal is similar to that of a deletion except that the log2 ratio shows only a modest reduction for the affected probes and of course only a few probes are affected. Mutation candidates were selected by analyzing the aCGH data by visual inspection and use of a segmentation algorithm (Maydan et al. 2007; Appendix 3) or a sliding window technique. The SNP Score, or adjusted mean log2 ratio, corresponds to the average log2 ratio of the probes where the mutation is located in a window 13 bases wide (covering positions 5-17) near the 5’-end of the 50-mer oligonucleotide that is away from the slide and therefore freely floating in the solution, and was then renormalized by subtracting the mean of the log2 ratio in 47  the immediate left and right 50-base wide flanking regions for oligonucleotides not overlapping this window. To test each mutation candidate, PCR was used to amplify products of a few hundred base pairs surrounding the candidate regions. DNA sequencing of these products precisely identified the mutations. When calculating the SNP detection sensitivity in the CB4856 experiments, each of the four log2 ratio measurements were considered separately because each profile is associated with an oligonucleotide spacing of two bp, which is more representative of the SNP detection experiments we used to evaluate the specificity of the technique. We could have averaged the four profiles to reduce the standard deviation before calculating the sensitivity, but this would not have allowed a direct and meaningful comparison with the data from our SNP detection experiments.  48  Figure 3.1. Novel detection of an A→T transversion in syd-1. Normalized log2 ratios of fluorescence intensities (mutant / wild-type) are plotted as  at the first (5’) free-floating base of each 50-mer probe. The length of each probe targeted by the SNP is illustrated by a horizontal bar, and the position of the SNP is indicated by an *. Multiple adjacent overlapping probes targeted the point mutation, so its effect on hybridization was assayed several times. Aberrant fluorescence ratios at probes targeting the SNP stand out from nearby probes targeting wild-type sequence.  49  Figure 3.2. Estimation of the sensitivity and specificity of the current SNP detection technique. (a) The SNP Score (see Methods) is shown for the 58 candidate SNPs we have sequenced with the candidates ordered according to their score. The  and  symbols represent the candidates confirmed and not confirmed by sequencing, respectively. For example, a score smaller than 0.45 would include all the 16 confirmed cases and 36 non-confirmed candidates, corresponding to a specificity of 31%. (b) The detection sensitivity for the SNPs in the CB4856 (Hawaiian) experiments is shown as a function of the threshold in the SNP Score. Using a threshold of -0.45 as before would correspond to a sensitivity of 37%. (c) The sensitivity is shown separately for each transition and transversion type when using the same threshold of -0.45.  50  3.5. References Anderson, P. 1995. Mutagenesis. Methods Cell Biol 48: 31-58. Barnes, T.M., Y. Kohara, A. Coulson, and S. Hekimi. 1995. Meiotic recombination, noncoding DNA and genomic organization in Caenorhabditis elegans. Genetics 141: 159-179. Cuppen, E., E. Gort, E. Hazendonk, J. Mudde, J. van de Belt, I.J. Nijman, V. Guryev, and R.H.A. Plasterk. 2007. Efficient target-selected mutagenesis in Caenorhabditis elegans: Toward a knockout for every gene. Genome Res. 17: 649-658. Edgley, M.L. and D.L. Riddle. 2001. LG II balancer chromosomes in Caenorhabditis elegans: mT1(II;III) and the mIn1 set of dominantly and recessively marked inversions. Mol Genet Genomics 266: 385-395. Flibotte, S., M.L. Edgley, J. Maydan, J. Taylor, R. Zapf, R. Waterston, and D.G. Moerman. 2009. Rapid High Resolution Single Nucleotide Polymorphism-Comparative Genome Hybridization Mapping in Caenorhabditis elegans. Genetics 181: 33-37. Flibotte, S. and D.G. Moerman. 2008. Experimental analysis of oligonucleotide microarray design criteria to detect deletions by comparative genomic hybridization. BMC Genomics 9: 497. Gresham, D., M.J. Dunham, and D. Botstein. 2008. Comparing whole genomes using DNA microarrays. Nat Rev Genet 9: 291-302. Gresham, D., D.M. Ruderfer, S.C. Pratt, J. Schacherer, M.J. Dunham, D. Botstein, and L. Kruglyak. 2006. Genome-Wide Detection of Polymorphisms at Nucleotide Resolution with a Single DNA Microarray. Science 311: 1932-1936. Hillier, L.W., G.T. Marth, A.R. Quinlan, D. Dooling, G. Fewell, D. Barnett, P. Fox, J.I. Glasscock, M. Hickenbotham, W. Huang, V.J. Magrini, R.J. Richt, S.N. Sander, D.A. Stewart, M. Stromberg, E.F. Tsung, T. Wylie, T. Schedl, R.K. Wilson, and E.R. Mardis. 2008. Whole-genome sequencing and variant discovery in C. elegans. Nat Methods 5: 183-188. Maydan, J.S., S. Flibotte, M.L. Edgley, J. Lau, R.R. Selzer, T.A. Richmond, N.J. Pofahl, J.H. Thomas, and D.G. Moerman. 2007. Efficient high-resolution deletion discovery in Caenorhabditis elegans by array comparative genomic hybridization. Genome Res. 17: 337-347. O'Meara M.M., H. Bigelow, S. Flibotte, J.F. Etchberger, D.G. Moerman, and O. Hobert. In press. Cis-regulatory mutations in the C. elegans homeobox gene locus cog-1 affect neuronal development. Genetics. Selzer, R.R., T.A. Richmond, N.J. Pofahl, R.D. Green, P.S. Eis, P. Nair, A.R. Brothman, and R.L. Stallings. 2005. Analysis of chromosome breakpoints in neuroblastoma at subkilobase resolution using fine-tiling oligonucleotide array CGH. Genes Chromosomes Cancer 44: 305-319. Sharp, A.J., A. Itsara, Z. Cheng, C. Alkan, S. Schwartz, and E.E. Eichler. 2007. Optimal design of oligonucleotide microarrays for measurement of DNA copy-number. Hum Mol Genet 16: 2770-2779. Wei, H., P.F. Kuan, S. Tian, C. Yang, J. Nie, S. Sengupta, V. Ruotti, G.A. Jonsdottir, S. Keles, J.A. Thomson, and R. Stewart. 2008. A study of the relationships between oligonucleotide properties and hybridization signal intensities from NimbleGen microarray datasets. Nucleic Acids Res 36: 2926-2938.  51  4. Copy number variation in the Caenorhabditis elegans genome reveals complex relationships among natural isolates 3 4.1. Introduction Copy number variation is an important component of genetic diversity in both Caenorhabditis elegans (Denver et al. 2003) and humans (Sebat et al. 2004; Conrad et al. 2006; Redon et al. 2006) and has been associated with complex traits including autism spectrum disorder (Marshall et al. 2008), mental retardation (Friedman et al. 2006; Madrigal et al. 2007), and schizophrenia (Stefansson et al. 2008). Genes involved in sensory perception, innate immunity and cell adhesion are common targets of copy number variants (CNVs) (Conrad et al. 2006; Maydan et al. 2007). We had previously described extensive copy number variation in the genomes of two highly divergent strains of C. elegans, CB4856 (Hawaii) and JU258 (Madeira) (Maydan et al. 2007), and were interested in examining additional natural isolates to measure the extent of variation in the genomes of less divergent strains. We also recognized that a large number of indel loci spread throughout the genome would allow us to more thoroughly characterize the relationships among strains. Previous studies utilizing nuclear markers have identified some recombinant strains, but they have been limited in their ability to precisely identify which regions of the genome have been exchanged due to the relatively small number of loci at hand (Denver et al. 2003; Haber et al. 2005). Although C. elegans reproduces primarily through selfing of hermaphrodites, males allow rare outcrossing in the wild at an estimated rate of 1-2% (Cutter and Payseur 2003; Barriere and Felix 2005; Barriere and Felix 2007). The frequency of outcrossing varies among populations and has been estimated between 0.01% based on linkage disequilibrium measurements (Barriere and Felix 2005) to as high as 20% based on heterozygote frequencies (Sivasundar and Hey 2005).  3  A version of this chapter will be submitted for publication. Maydan, J.S., A. Lorch, M.L. Edgley, S. Flibotte and D.G. Moerman. Copy number variation in the Caenorhabditis elegans genome reveals complex relationships among natural isolates.  52  In this study we have used array comparative genomic hybridization (aCGH) to detect copy number variation in ten additional natural isolates of C. elegans. We have chosen to use the term indel to refer to the deletions and duplications that we have detected, since many of them are < 1 kb in length and thus would not commonly be considered CNVs (Feuk et al. 2006; Redon et al. 2006). We use the term CNV only for larger aberrations. We use “deletion” to refer to a sequence absent in a natural isolate but present in N2, and “amplification” to refer to an increase in copy number relative to N2. An “amplification” detected in a natural isolate might alternatively represent the deletion of a duplicate copy in the N2 lineage. Only genes that are present in a single copy in the N2 genome are represented on our microarrays, therefore the presence of a gene (as in N2) cannot be reconstituted in an isolate by mutation or conversion from an ancestor carrying a deletion allele. Nevertheless, it is possible for a gene to reappear in a lineage as the result of outcrossing, recombining the intact gene into the lineage. Also, it is possible that extreme sequence divergence could be misinterpreted as a deletion in some cases (see Section 4.3.3). Among the strains in our study, the Hawaiian and Madeiran strains (CB4856 and JU258) carry the largest number of deletions, followed by the Vancouver strain (KR314). Overall, we detected 510 different deletions affecting 1136 genes, or over 5% of the genes in the canonical N2 genome. Indels had a median length of 2.7 kb and deletions relative to N2 were far more common than amplifications. We used deletion loci as markers to derive an unrooted tree and observed complex relationships among the strains. Close associations were identified between CB4853 and CB4858, and CB3191 and RW7000. Different regions of the genome clearly possess different genealogies due to recombination throughout the natural history of the species. 4.2. Results 4.2.1. aCGH reveals a bias favoring coding sequence deletions over amplifications in C. elegans We performed ten new aCGH experiments utilizing our exon-centric whole genome microarray (Maydan et al. 2007), which includes probes to 94% of the exons and 98% of the genes in the N2 reference genome. Each aCGH experiment compared a different natural isolate to N2. Table 4.1 summarizes the number and lengths of indels that were detected in each strain, including 53  CB4856 and JU258 from our previous study (Maydan et al. 2007), as well as the number of genes and pseudogenes that were deleted in each strain. It is important to note that nearly all of the indels detected in this study target coding sequences and are thus unlikely to be selectively neutral. One “amplification” initially detected in all of the strains was subsequently identified by PCR and DNA sequencing as a 1788-bp deletion in our N2 strain (VC196), targeting exons 5 and 6 of alh-2, so that false amplification was ignored in all strains. We detected 883 deletions (510 different types) and 84 amplifications in the twelve natural isolates. Amplifications were less robustly detected than deletions due in part to the conservative criteria that we used (see Methods), but this alone does not account for the large preponderance of deletions in the strains. For example, relaxing the log2 fluorescence ratio of mutant:wild-type (log2 ratio) cutoff for amplifications from 1.0 to 0.9 or even 0.8 captured only a modest number of additional amplifications and did not affect the strong bias towards deletions in the strains (data not shown). The bias towards deletions is clearly evident from plots of log2 ratios on each chromosome (Maydan et al. 2007; see Figures 2.6 and 2.7). An analysis of variance (ANOVA) indicated that indel length varied significantly among the strains in our study (F = 1.8706, P = 0.0395), but not if RW7000 was excluded from the analysis (F = 1.4823, P = 0.1407). With just eight deletions and two amplifications, the mean length of indels in RW7000 (22.6 kb) is greater than in the other strains, due largely to an 86 kb-deletion on chromosome IV and a 117 kb-duplication on chromosome V. RW7000 is the only strain in our study known to have a high Tc1 transposon count (Hodgkin and Doniach 1997), so transposon activity may have been more important in generating CNVs in this strain than in the others. The median indel length among all strains was 2.7 kb, with a mean of 8476 bp, but our ability to detect very small aberrations was limited by the probe density on the arrays. Indel length also varied significantly among chromosomes (F = 5.8772, P = 0.0000233), with larger indels on chromosomes V, IV and II. The mean size of indels on each chromosome ranged from 12.8 kb on chromosome V to just one kb on chromosome X. Furthermore, a Welch’s twosample t-test revealed a significant difference (t = 6.9256, P = 1.091 x 10-11) between the mean indel lengths in the autosome arms (as defined by recombination rate analysis (Barnes et al. 1995)) and the autosome centers or the X chromosome (9505 bp and 3277 bp, respectively).  54  4.2.2. Extensive copy number variation in the C. elegans genome allows even very closely related strains to be distinguished A single deletion was detected in CB3191 and subsequently confirmed by PCR and DNA sequencing. The deletion completely knocks out math-15 (sequenced deletion breakpoints are at chromosome coordinates II: 1866365 and II: 1868059). This illustrates the power of our aCGH experiments to distinguish among strains, since CB3191 had previously appeared identical to N2 based on nearly complete mtDNA sequencing (Denver et al. 2003) and multilocus microsatellite genotyping (Haber et al. 2005). Still, with only one deletion, CB3191 appears as similar to the canonical N2 genome as our laboratory’s N2 strain does. In effect, the aCGH experiment comparing CB3191 to N2 served as a self-versus-self negative control and demonstrated that our whole genome array design, in itself, does not produce false positives. CB4856 and JU258 were clearly the most divergent strains relative to N2, with more indels and more genes deleted than any of the other strains. KR314 (Vancouver) was the most divergent among the other strains and had more deletions on chromosome II than did any other strain, including CB4856 and JU258, but curiously had no deletions at all on the left half of chromosome V (V-L) where 18 deletions were present in CB4856 and 23 in JU258. The absence of deletions on V-L suggests the possibility that KR314 acquired an N2-like V-L through recombination. Interestingly, RW7000 was by far the least divergent strain other than the N2-like CB3191, despite appearing to be more divergent than six of our strains in a study counting alleles among 31 chemoreceptor genes (Stewart et al. 2005). 4.2.3. The distribution of indels in the genome and the overrepresentation of indels in particular gene families Figure 4.1 plots all of the indels we detected in each strain on the left half of chromosome II (IIL). Figure 4.2 plots a higher resolution figure of this type for the entire genome. Chi-square tests indicated that both the number of deletions and the number of amplifications relative to N2 varied significantly among chromosomes (P < 2.2 x 10-16 and P = 1.281 x 10-5, respectively), after adjusting for the different number of probes targeting each chromosome. As previously described for CB4856 and JU258 (Maydan et al. 2007), indels in C. elegans are strikingly more common on autosome arms than in the autosome centers or on the X chromosome. The 55  autosome arms span just 38% of the probes on our Whole Genome microarray but include 83% of the indels that we identified (85% of the deletions and 64% of the amplifications). Table 4.3 lists all of the genes targeted by the indels we detected. 1136 different genes were either wholly or partially deleted in at least one strain. The same gene families that we found to be overrepresented in indels in CB4856 and JU258 are clearly also the most common targets of indels in the new strains in this study, most notably the MATH-BTB, F-box, lectin and serpentine chemoreceptor gene families. All of these gene families cluster on autosome arms (Thomas 2006).  4.2.4. Relatedness inferences based on deletions shared by multiple natural isolates Many of the indels we detected were found in multiple strains. Figure 4.3 shows the number of deletions that were found in each strain and the number of other strains that carry the same deletions. The strains with the most deletions also have the highest proportion of unique deletions in our data set, excluding CB3191 and RW7000, which carry only unique deletions. By this measure, CB4856 was the most divergent strain, with 66% (113/172) of its deletions being unique among the twelve strains. This agreed closely with a study reporting that 70% of CB4856 SNPs were not present in nine other natural isolates (Koch et al. 2000). Unique deletions comprise 64% (75/140) of the deletions in JU258, 42% (33/78) of the deletions in MY2, and 39% (48/124) of the deletions in KR314. Table 4.2 displays the number of deletions shared by all pairs of strains. Deletions were used as markers to infer a strain phylogeny under both Camin-Sokal parsimony (Camin and Sokal 1965), which assumes that deletions are derived states and transitions to deletions are more likely than transitions to the absence of deletions, and Wagner parsimony (Eck and Dayhoff 1966; Kluge and Farris 1969), which considers the appearance or disappearance of deletions equally likely and does not presume ancestral states. Because of our focus on genes that are present in a single copy in N2, the presence of a gene cannot be reconstituted by mutation or conversion in a strain carrying a deletion allele, but it is possible for a gene to reappear in a lineage as the result of outcrossing. Both parsimony methods gave the same consensus tree (Figure 4.4) based on 1000 bootstrap replicates drawn with replacement from the 510 deletion 56  loci that we identified. We chose to exclude amplifications from this analysis because they were less robustly detected than deletions. Deletions were treated as independent characters despite the presence of linkage between loci. We detected significant multilocus linkage disequilibrium (standardized index of association (IAS) = 0.072, P < 0.001, see (Haubold and Hudson 2000)) consistent with other studies (Barriere and Felix 2005; Cutter 2006; Barriere and Felix 2007). It is important to note that because deletions are more common on the autosome arms, these regions of the genome factored heavily into the relationships that we inferred. Overall, close relationships were inferred between CB4853 and CB4858, and between CB3191 and RW7000 (which closely resemble N2). The confidence at the node leading to CB3191 and RW7000 under Camin-Sokal (CS) parsimony is lower than under Wagner (W) parsimony because the two strains occurred in isolation on a subset of the CS trees, but neither strain shared any deletions with any other strain. Different relationships are predicted by different regions of the genome due to the presence of recombination in the lineage, and this is reflected in the low bootstrap confidence at many of the nodes on the consensus tree. An example of a possible recombination event is indicated in Figure 4.1. AB1 was previously identified as a recombinant strain based on discrepancies between the phylogenies inferred by mitochondrial and nuclear marker data sets (Denver et al. 2003), and the bootstrap support at the node leading to AB1 on our consensus tree is particularly low. While JU258 was most similar to JU263 on I-R, it shared more deletions with KR314 on II-L, III-L and V-R. JU258 and KR314 were identified as each other’s closest relative on 27% of CS (32% W) trees inferred from all bootstrap replicates, and the group of JU258/KR314/JU263 was also fairly common (19% CS, 28% W). KR314 was more similar to CB4853 and CB4858 on chromosome II. Other groups with appreciable bootstrap support but not appearing on the consensus tree included AB1/CB4854 (32% CS, 20% W), CB4853/CB4858/CB4854/JU322 (21% CS, 29% W), MY2/JU258 (18% CS, 22% W) and MY2/JU258/JU263 (12% CS, 23% W).  57  4.3. Discussion 4.3.1. Bias favoring deletions targeting gene families involved in environmental sensation and innate immunity We detected far more deletions than amplifications in all of the natural isolates. A bias favoring deletions over insertions has previously been observed in patterns of pseudogene variation (Robertson 2000). It has also been suggested that there may be a high rate of spontaneous deletion in the C. elegans genome (Witherspoon and Robertson 2003) and perhaps selection for small genome size (Denver et al. 2004). It is possible that some of the deletion candidates we detected are actually regions of extreme sequence divergence (see Section 4.3.3). It is also possible that some very ancient duplications are not detected because enough time has passed for the subsequent accumulation of sequence variation to prevent hybridization to our microarrays. It is important to note that duplications present in N2 are not represented on our whole genome microarray because we selected only unique probe sequences with limited homology to other sequences in the genome. Genes involved in sensory perception and innate immunity are enriched in indels in both C. elegans (Maydan et al. 2007) and humans (Nguyen et al. 2006). Chemoreceptor gene families have undergone significant expansion in C. elegans since its common ancestor with Caenorhabditis briggsae (Stein et al. 2003), including multiple rounds of tandem duplication in the sra and srab families (Chen et al. 2005), and appear prone to gene gains and losses. We found that indel lengths were longer on the autosome arms where homologous gene clusters of these same gene families are the most common (Thomas 2006). Higher recombination rates (Barnes et al. 1995) and the presence of homologous gene clusters and repeat sequences probably predisposes autosome arms to non-allelic homologous recombination (NAHR) events, which tend to generate larger CNVs, whereas smaller indels are more likely created by nonhomology based mechanisms (Conrad et al. 2006; Redon et al. 2006; Conrad and Hurles 2007). We chose to include pseudogenes in Table 4.3 because many genes annotated as pseudogenes in the N2 strain probably have functional copies in other natural isolates, especially genes with a single defect in N2 (usually a premature stop codon or deletion) (Stewart et al. 2005). The large number of deletions relative to N2 that we detected suggests that N2 itself may lack many 58  unknown genes that are present in other natural isolates. Of course, only probes to N2 sequences are included on our microarrays. 4.3.2. New insights into complex strain relationships resulting from recombination and outcrossing in the C. elegans lineage Different portions of the genome will possess different genealogies due to recombination. Trees are therefore inherently flawed in their generalized depiction of strain relationships when recombination has occurred between lineages and should not be interpreted as phylogenies. Nevertheless, our results largely agree with trees inferred from nuclear markers (Denver et al. 2003; Haber et al. 2005). Strains that are close together on these trees generally share more deletions than do strains that are further apart, with a few notable exceptions probably resulting from the limited number of loci available in earlier studies. For example, our results differ markedly from the study in which CB4858 and CB4854 appeared identical at 10/10 microsatellite loci (Haber et al. 2005). We found that CB4858 was much more closely related to CB4853, sharing 52/66 of its deletions with CB4853 (52/63 CB4853 deletions were found in CB4858) but sharing just 21/66 deletions with CB4854. CB4858 and CB4853 appeared identical on II-L and throughout chromosomes III and V (one small two-probe CB4858 deletion candidate on chromosome III did not quite meet our P-value cutoff in CB4853, but both probes showed log2 ratios < -2). Only five CB4853 deletions not found in CB4858 were present in any other strains, including two deletions on the X chromosome that were found in MY2, and three deletions on II-R that were present in CB4854. Four of the eight deletions found in CB4858, but not in CB4853, were present in JU263 (on chromosomes I, II and X). CB4858 and CB4854 did share deletions in the regions of some of Haber et al.’s microsatellites but not in all cases. For instance, CB4858 shared a deletion with CB4853 and CB4856 that is not found in CB4854, which is just six kb away from the microsatellite allele shared by CB4858 and CB4854 on II-L. This highlights the importance of using a large number of loci spread throughout the genome when estimating strain relatedness. Another discrepancy between the relationships inferred in our study and those estimated by Haber et al. (2005) involves JU263. Microsatellites did not reveal close relatedness between JU263 and JU258, but JU263 shared more deletions with JU258 (31/68) than with any other strain in our study. JU263 was most similar to KR314 and CB4854 on V-R, and shared 26/68 59  deletions with KR314 overall. Although it does not appear on the consensus tree, a JU263/JU258/KR314 group appeared on 28% of W trees inferred from all bootstrap replicates. Chromosome III told yet another story, where JU263 shared 4/4 deletions on III-L with JU322, and none of these with any other strain. Overall, CB4856 did not appear particularly closely related to any of the other strains. CB4856 shared the most deletions with KR314, JU322, JU258, JU263 and MY2 (18, 18, 17, 17, and 16, respectively), and slightly fewer with the remaining strains (but none with CB3191 and RW7000). However, specific regions of the CB4856 genome more closely resembled particular strains. For example, CB4856 and JU258 shared five deletions in common over a 10.6-Mb interval on chromosome V, but each shared only one deletion over the same interval with any other strain (MY2). Nevertheless, CB4856 and JU258 were significantly diverged from one another in this region, which included 18 deletions unique to CB4856 and ten deletions unique to JU258. Many other regions of similarity among different groups of strains are evident in Figure 4.2 and Table 4.3, illustrating that the relationships among strains are complicated due to recombination and outcrossing, probably to a greater extent than previously appreciated in studies utilizing fewer genetic markers (Denver et al. 2003; Haber et al. 2005). 4.3.3. Very common indels, mutation hotspots, and the possibility of extreme sequence divergence masquerading as deletions Remarkably, we found 6.7-kb deletions (in CB4854, CB4856 and JU322) and duplications (in KR314 and MY2) that affected exactly the same 117 probes on chromosome III, suggesting the possibility that both CNVs arose from a single NAHR event and subsequently survived the process of genetic drift. Furthermore, the log2 ratios for these amplifications (particularly for MY2) indicated a possible four-fold amplification, perhaps suggesting a second duplication had occurred. Some deletions and amplifications were more common among the strains in our study than the allele observed in N2 (see Figures 4.1 and 4.2). For example, a 10-kb deletion on chromosome III was found in all strains except CB3191 and RW7000. This deletion targeted three uncharacterized genes, Y75B8A.31, Y75B8A.32 and Y75B8A.34, and could possibly be of ancient origin. It is possible that some common indels have arisen by independent mutations, 60  particularly if there are hotspots susceptible to mutation by NAHR (Conrad and Hurles 2007) and/or subject to positive selection. There are several very common deletions found on V-R (see Figure 4.2 and Table 4.3), which is a region rich in copy number variation. Regions with very high sequence divergence could potentially produce log2 ratios that are sufficiently negative to appear as deletions in aCGH data, but are unlikely to account for a sizeable proportion of the deletions that we detected. On average, roughly 10% or more of the nucleotides in each of several adjacent 50-mer probes would need to be mutated in order to approach our conservative log2 ratio cutoff for deletions (Flibotte and Moerman 2008). This level of sequence variability in our probe sequences is particularly unlikely because our probes target coding sequences. On average, single nucleotide polymorphisms relative to N2 exist at 1/840 nucleotides in CB4856 (Wicks et al. 2001; Swan et al. 2002) and 1/1500 nucleotides in CB4858 (Hillier et al. 2008), but recent whole genome sequencing of the CB4856 genome has identified several regions of much higher sequence diversity (> 10x) that sometimes coincide with deletions identified in our aCGH data (David Spencer and Ryan Morin, personal communication). Most of the deletions we detected (~ 70%) were found to coincide with gaps in the CB4856 genome sequence, suggesting they are likely to be true deletions. The majority of the deletions we inferred that were not found to coincide with gaps in the genome sequence data were small deletions (< 1000 bp) that were not robustly detected because the sequence gap analysis used a sliding window method with a minimum window size of 1000 bp. However, 15/172 Hawaiian deletions that we detected that were larger than 1000 bp were found to overlap with regions of extreme sequence divergence. Nevertheless, one CB4856 deletion that we detected and an overlapping region of high sequence diversity have both been independently identified and confirmed, and are associated with the genetic incompatibility between CB4856 and N2 (Seidel et al. 2008). This partial reproductive isolation has probably allowed for the accumulation of sequence diversity in this region. Still, we cannot rule out that some of the deletions we report could be false positives resulting from regions of extreme sequence divergence. This is probably more likely in the most divergent strains. Genes in some of the gene families that are overrepresented amongst the deletions we report are known to be subject to positive selection for changes in amino acid sequence (Thomas et al. 2005). We have shown that there is substantial copy number variation in coding sequences in the C. elegans genome. Indels are most common on the autosome arms, especially on chromosomes II 61  and V. Deletions relative to N2 are much more common than amplifications. This bias may partly be the result of selection, as it is unlikely that many of these deletions are selectively neutral because they target coding sequences. Over 5% of the annotated genes in the N2 genome overlap with indels in at least one of the twelve strains that we examined. The indels that were detected should be useful in explaining natural phenotypic variation, particularly in chemosensation (Jovelin et al. 2003) and innate immunity (Schulenburg and Muller 2004). This underestimates the copy number variation in the C. elegans genome because we examined only twelve natural isolates, our ability to detect very small indels is limited by the probe density on our microarrays, and we did not attempt to detect indels that do not target exons. Approximately 26% of the C. elegans genome is intronic and 47% is intergenic sequence (The C. elegans Sequencing Consortium 1998). aCGH is a powerful method of quickly obtaining a large number of genetic markers spread throughout the genome, and has revealed complex relationships among wild C. elegans isolates resulting from recombination and outcrossing events throughout the natural history of the species. 4.4. Methods 4.4.1. Strain selection, nematode culturing and DNA preparation The N2 reference strain in all experiments was VC196, a subculture of N2 received from the Caenorhabditis Genetics Center (CGC) in 2002. RW7000 was acquired from the lab of Robert Waterston in 1987, and was submitted by that lab to the CGC in 1991. All other strains were received directly from the CGC and grown for a minimal number of generations prior to DNA preparation. We selected the strains in an attempt to sample a range of the microsatellite diversity observed by Haber et al. (2005) but also included strains thought to be very closely related to each other in an attempt to better distinguish them with a larger number of loci. We also included JU322, which was not part of the Haber et al. study. We intentionally selected some strains suspected to be recombinant, including AB1 (Denver et al. 2003) and CB4854 (Haber et al. 2005), to test our ability to identify particular regions of the genome that have been exchanged as the result of outcrossing. Nematodes were grown as previously described (Brenner 1974) on 150-mm NGM agar plates seeded with Escherichia coli strain χ1666. Nematode populations were grown to starvation, harvested by washing with M9 containing 0.01% Triton X-100, and washed an additional seven times by centrifugation, removal of the supernatant by 62  aspiration, resuspension and vortexing in M9/Triton-X100. After the final wash, DNA was prepared by standard phenol-chloroform extraction and ethanol precipitation as previously described (Maydan et al. 2007). 4.4.2. aCGH Probes on our whole genome microarray were initially selected from build WS139 of the C. elegans genome (Maydan et al. 2007). Microarray manufacture, DNA fragmentation and labeling, sample hybridization and imaging, and fluorescence intensity measurement were performed by Roche NimbleGen, Inc. as previously described (Maydan et al. 2007). Log2 ratios (Natural Isolate / N2) were calculated and then normalized using the robust LOWESS regression (Cleveland 1981) implemented in the R programming language (R Development Core Team 2006) with a smooth spanner setting of f = 0.4. 4.4.3. Indel identification Indels were detected with a segmentation algorithm developed by S. Flibotte (Maydan et al. 2007; Appendix 3). Aberrant segments with a P-value ≤ 0.01 were called deletions if the mean log2 ratio of probes in the segment was ≤ -2, and called amplifications with a mean log2 ratio ≥ 1. Most indels that we classified as deletions are probably truly deletions as opposed to insertions in the N2 lineage, since we used probes targeting coding sequences that contain no non-unique 20-mers and no more than 70% homology to other genome sequences. A novel N2 gene arising by duplication and subsequent sequence divergence in the N2 lineage would have to accumulate many coding sequence mutations in order to pass our probe selection filters. N2 genes arising from recent duplications are probably among the 2% of genes not represented on our microarrays. Nonetheless, we cannot completely rule out the possibility that some insertions or rearrangements in the N2 lineage could have been misclassified as deletions in the natural isolates. The P-value of each indel was calculated with a one-sample t-test (note that with a very large number of non-aberrant data points, this gives nearly the same value as a Welch’s two-sample ttest). P-values were not corrected for multiple tests, but a Bonferroni correction would not exclude many indels since a large majority of them have P-values far below the cutoff (91% of 63  all indels have P-values < 0.001, and 83% have P-values < 0.0001). All indels affected three or more consecutive probes, with the exception of ten deletion candidates (~1% of the indels) that were detected by just two probes with very negative log2 ratios. All aberrant segments were examined manually and some adjustments were made to fine-tune the selection of the leftmost and rightmost probes (“breakpoint probes”) within the indels. Some segments were interrupted by stretches of probes that did not give log2 ratios consistent with an indel and were manually split into multiple segments. After these adjustments, the mean log2 ratios and P-values were recalculated for all segments with one-sample t-tests using R to ensure that our cutoffs were still met. In most cases these adjustments further decreased the P-values of the indels. Occasionally, probes flanking indels show unusual log2 ratios outside the normal range for unaffected probes (Maydan et al. 2007). This can sometimes make identification of the indel breakpoints less certain. For each indel, we identified “flanking probes” beyond the left and right breakpoint probes to demark the point at which more normal log2 ratios begin to consistently appear again (log2 ratios > -0.8 for deletions, and log2 ratios < 0.5 for amplifications). 78% of these flanking probes were adjacent to their corresponding breakpoint probes, and 90% were within three probes of indels. 4.4.4. Chi-square tests, t-tests and ANOVAs All remaining chi-square tests, t-tests and ANOVA tests were done using R. For the chi-square tests, the expected number of indels on each chromosome was calculated based on the proportion of probes targeting that chromosome, which essentially corrects for differences in length and gene content among chromosomes. Indel lengths were measured from the middle of the left breakpoint probe to the middle of the right breakpoint probe. Indels found in more than one strain were counted multiple times in these tests. 4.4.5. Affected genes Probe coordinates were obtained by remapping probe sequences to the most recent genome data freeze (WS190) using MegaBLAST, which utilizes a greedy algorithm (Zhang et al. 2000) to 64  align DNA sequences. 24 probes on the array no longer had perfect sequence matches in the genome due to changes in the genome sequence from WS139 to WS190, but none of these probes were present in any of the indels that we detected. In order to generate the list of genes affected by indels in Table 4.3, we first extracted the start and stop coordinates for all genes in genome build WS190 from WormBase. From that list, we extracted only the genes that overlapped the coordinates spanned by the indel breakpoint probes. Genes completely contained within the region spanned by an indel were listed as entirely affected. Discrepancies between Table 4.3 and the list of deleted genes in CB4856 and JU258 given in Maydan et al. (2007) are due to the adjustments we made to the indel breakpoints and because the genes listed in the previous study were extracted from an older genome build (WS150). 4.4.6. Strain relationships All deletion loci were treated as discrete presence-absence characters. Strains were considered to carry the same deletion allele if their respective deletions overlapped and both their left and right breakpoint probes were within three probes of each other, or in a small number of cases (where the breakpoint and flanking probes were more ambiguous) if the breakpoint probes in one strain fell within the region spanned by the flanking probes in another strain. Remarkably, most deletions found in multiple strains according to these criteria had exactly the same breakpoint probes, illustrating the reliability of the log2 ratios. The single case where we identified deletions and amplifications affecting the same probes (see F44E2.2a in Table 4.3) could have been treated as a multi-state locus, but this was not done because we chose to exclude amplifications from this analysis. Unrooted trees were inferred from the 510 deletion loci using Phylip 3.66 (Felsenstein 1989). The most parsimonious trees were inferred under both Camin-Sokal and Wagner parsimony methods with 1000 bootstrap replicates drawn with replacement from the loci. A consensus tree was then inferred separately for each method. An unrooted consensus tree was drawn with CB4856 at the base simply because it shared the fewest alleles with all other strains. The true position of the root of the consensus tree is unknown.  65  4.4.7. Linkage disequilibrium We used LIAN 3.5 (Haubold and Hudson 2000) to calculate a standardized index of association (IAS) based on the original formulae given by (Brown et al. 1980; Smith et al. 1993). IAS is 0 at linkage equilibrium. The program tested significance with a Monte Carlo simulation, resampling loci without replacement over 1000 iterations, in order to scramble their order and generate a null distribution of IAS. The P-value is the probability, under the null hypothesis of linkage equilibrium, of IAS being greater than or equal to the value observed for our data set.  66  Table 4.1. Indels detected in twelve natural isolates of C. elegans. The number of deletions and amplifications detected in each isolate is shown, along with statistics summarizing their lengths. The overall number of indels is the sum of those found in all strains, so indels found in more than one strain were counted multiple times. The number of deleted genes and pseudogenes includes those either wholly or partially deleted. The overall number of deleted genes and pseudogenes is the number deleted in at least one of the twelve strains (genes deleted in more than one strain were counted only once). The location refers to the site the strain was initially isolated from. Strain AB1 CB3191 CB4853 CB4854 CB4856 CB4858 JU258 JU263 JU322 KR314 MY2 RW7000 Overall  Deletions 48 1 63 49 172 66 140 68 66 124 78 8 883  Amplifications 9 0 10 7 10 7 10 6 12 6 5 2 84  Median Indel Length 2481 1262 2460 2240 2885 2481 3271 2680 2484 2689 3485 2593 2700  Mean Indel Length  Maximum Indel Length  7564 1262 8651 6201 7279 7502 12390 5080 6270 8674 10080 22600 8476  75120 1262 109400 68860 103200 109400 185700 68860 70040 118100 184100 116900 185700  Deleted Genes 147 1 237 122 517 211 671 145 174 417 301 38 1136  Deleted Pseudogenes 26 0 16 20 91 16 117 18 15 77 44 2 216  Location Adelaide, Australia Altadena, California, USA Altadena, California, USA Altadena, California, USA Oahu, Hawaii, USA Pasadena, California, USA Ribeiro Frio, Madeira, Portugal Le Blanc, France Merlet, France Vancouver, British Columbia, Canada Roxel, Munster, Germany Bergerac, France NA  67  Table 4.2. Number of deletions shared by all strain pairs. The number listed for all comparisons between a strain and itself (indicated by an *) is simply the total number of deletions detected in that strain. Strain AB1 CB3191 CB4853 CB4854 CB4856 CB4858 JU258 JU263 JU322 KR314 MY2 RW7000 AB1 48* 0 14 20 12 13 20 18 12 21 13 0 CB3191 0 1* 0 0 0 0 0 0 0 0 0 0 CB4853 14 0 63* 24 12 52 16 13 23 34 12 0 CB4854 20 0 24 49* 13 21 15 15 23 19 11 0 CB4856 12 0 12 13 172* 13 17 17 18 18 16 0 CB4858 13 0 52 21 13 66* 18 17 23 34 11 0 JU258 20 0 16 15 17 18 140* 31 17 36 26 0 JU263 18 0 13 15 17 17 31 68* 18 26 15 0 JU322 12 0 23 23 18 23 17 18 66* 20 12 0 KR314 21 0 34 19 18 34 36 26 20 124* 14 0 MY2 13 0 12 11 16 11 26 15 12 14 78* 0 RW7000 0 0 0 0 0 0 0 0 0 0 0 8*  68  Table 4.3. Genes affected by copy number variants in C. elegans. Gene Start and Gene Stop coordinates refer to the positions of the first and last base of each gene in WS190. Indels are identified as amplifications (A) or deletions (D) relative to N2. Only genes that are completely contained by the interval spanned by the breakpoint probes are listed as entirely affected, despite the possibility that the indel extends beyond those probes. For example, math-15 is listed as partially deleted in CB3191, but DNA sequencing has shown that the gene is entirely deleted. The coordinates listed for all flanking and breakpoint probes (see Methods) refer to the position of the first base of each probe. In some cases we did not identify a flanking probe because there was no probe on our microarray to either the left or right of the indel on that chromosome. The Indel Length is the difference between the left breakpoint and right breakpoint coordinates, and is equivalent to the distance between the middle of the first and the last probe affected by the indel. This table can be accessed at: http://www.zoology.ubc.ca/~alorch/jason/Table.4.3.pdf  69  Figure 4.1. Indels on the left arm of chromosome II in twelve natural isolates of C. elegans. Deletions unique to a strain are plotted in grey and deletions found in multiple strains are plotted in black. Amplified sequences present in only one strain are shown in orange and those found in multiple strains are shown in red. The actual position of amplified sequences in the genome is unknown. The position of amplifications shown here corresponds to the position of the single copy of that sequence in the N2 reference genome. Small indels are not shown to scale. The blue arrows indicate the site of a possible recombination event. KR314 shares alleles with CB4853 and CB4858 to the right of the arrows but not to the left.  70  Figure 4.2. Indels in the genomes of twelve natural isolates of C. elegans. Deletions unique to a strain are plotted in grey and deletions found in multiple strains are plotted in black. Amplified sequences present in only one strain are shown in orange and those found in multiple strains are shown in red. The actual position of amplified sequences in the genome is unknown. The position of amplifications shown here corresponds to the position of the single copy of that sequence in the N2 reference genome. Small indels are not shown to scale. This figure can be accessed at: http://www.zoology.ubc.ca/~alorch/jason/Figure.4.2.jpg  71  Figure 4.3. The number of deletions detected in each of 12 natural isolates of C. elegans. The numbers of other isolates that carry the same deletions are indicated by the colors in the figure legend.  72  Figure 4.4. Unrooted consensus tree for twelve natural isolates of C. elegans. The two numbers listed in parentheses next to each node are the percentage of trees among 1000 bootstrap replicates that included all strains distal from CB4856 under Camin-Sokal and Wagner parsimony, respectively. The tree should not be interpreted strictly as a phylogeny due to recombination between strains.  73  4.5. References Barnes, T.M., Y. Kohara, A. Coulson, and S. Hekimi. 1995. Meiotic recombination, noncoding DNA and genomic organization in Caenorhabditis elegans. Genetics 141: 159-179. Barriere, A. and M.A. Felix. 2005. High local genetic diversity and low outcrossing rate in Caenorhabditis elegans natural populations. Curr Biol 15: 1176-1184. Barriere, A. and M.A. Felix. 2007. Temporal dynamics and linkage disequilibrium in natural Caenorhabditis elegans populations. Genetics 176: 999-1011. Brenner, S. 1974. The genetics of Caenorhabditis elegans. Genetics 77: 71-94. Brown, A.H., M.W. Feldman, and E. Nevo. 1980. Multilocus Structure of Natural Populations of HORDEUM SPONTANEUM. Genetics 96: 523-536. Camin, J.H. and R.R. Sokal. 1965. A method for deducing branching sequences in phylogeny. Evolution 19: 311-326. Chen, N., S. Pai, Z. Zhao, A. Mah, R. Newbury, R.C. Johnsen, Z. Altun, D.G. Moerman, D.L. Baillie, and L.D. Stein. 2005. Identification of a nematode chemosensory gene family. Proc Natl Acad Sci U S A 102: 146-151. Cleveland, W.S. 1981. LOWESS: A program for smoothing scatterplots by robust locally weighted regression. The American Statistician 35: 54. Conrad, D.F., T.D. Andrews, N.P. Carter, M.E. Hurles, and J.K. Pritchard. 2006. A highresolution survey of deletion polymorphism in the human genome. Nat Genet 38: 75-81. Conrad, D.F. and M.E. Hurles. 2007. The population genetics of structural variation. Nat Genet. 7 Suppl: S30-6. Cutter, A.D. 2006. Nucleotide polymorphism and linkage disequilibrium in wild populations of the partial selfer Caenorhabditis elegans. Genetics 172: 171-184. Cutter, A.D. and B.A. Payseur. 2003. Selection at linked sites in the partial selfer Caenorhabditis elegans. Mol Biol Evol 20: 665-673. Denver, D.R., K. Morris, M. Lynch, and W.K. Thomas. 2004. High mutation rate and predominance of insertions in the Caenorhabditis elegans nuclear genome. Nature 430: 679-682. Denver, D.R., K. Morris, and W.K. Thomas. 2003. Phylogenetics in Caenorhabditis elegans: an analysis of divergence and outcrossing. Mol Biol Evol 20: 393-400. Eck, R.V. and M.O. Dayhoff. 1966. Atlas of Protein Sequence and Structure 1966. National Biomedical Research Foundation, Silver Spring, Maryland. Felsenstein, J. 1989. PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics 5: 164166. Feuk, L., A.R. Carson, and S.W. Scherer. 2006. Structural variation in the human genome. Nature Reviews Genetics 7: 85-97. Flibotte, S. and D.G. Moerman. 2008. Experimental analysis of oligonucleotide microarray design criteria to detect deletions by comparative genomic hybridization. BMC Genomics 9: 85-97. Friedman, J.M., A. Baross, A.D. Delaney, A. Ally, L. Arbour, L. Armstrong, J. Asano, D.K. Bailey, S. Barber, P. Birch, M. Brown-John, M. Cao, S. Chan, D.L. Charest, N. Farnoud, N. Fernandes, S. Flibotte, A. Go, W.T. Gibson, R.A. Holt, S.J. Jones, G.C. Kennedy, M. Krzywinski, S. Langlois, H.I. Li, B.C. McGillivray, T. Nayar, T.J. Pugh, E. RajcanSeparovic, J.E. Schein, A. Schnerch, A. Siddiqui, M.I. Van Allen, G. Wilson, S.L. Yong,  74  F. Zahir, P. Eydoux, and M.A. Marra. 2006. Oligonucleotide microarray analysis of genomic imbalance in children with mental retardation. Am J Hum Genet 79: 500-513. Haber, M., M. Schungel, A. Putz, S. Muller, B. Hasert, and H. Schulenburg. 2005. Evolutionary history of Caenorhabditis elegans inferred from microsatellites: evidence for spatial and temporal genetic differentiation and the occurrence of outbreeding. Mol Biol Evol 22: 160-173. Haubold, B. and R.R. Hudson. 2000. LIAN 3.0: detecting linkage disequilibrium in multilocus data. Bioinformatics 16: 847-848. Hillier, L.W., G.T. Marth, A.R. Quinlan, D. Dooling, G. Fewell, D. Barnett, P. Fox, J.I. Glasscock, M. Hickenbotham, W. Huang, V.J. Magrini, R.J. Richt, S.N. Sander, D.A. Stewart, M. Stromberg, E.F. Tsung, T. Wylie, T. Schedl, R.K. Wilson, and E.R. Mardis. 2008. Whole-genome sequencing and variant discovery in C. elegans. Nat Methods 5: 183-188. Hodgkin, J. and T. Doniach. 1997. Natural variation and copulatory plug formation in Caenorhabditis elegans. Genetics 146: 149-164. Jovelin, R., B.C. Ajie, and P.C. Phillips. 2003. Molecular evolution and quantitative variation for chemosensory behaviour in the nematode genus Caenorhabditis. Mol Ecol 12: 13251337. Kluge, A.G. and J.S. Farris. 1969. Quantitative phyletics and the evolution of anurans. Systematic Zoology 18: 1-32. Koch, R., H.G. van Luenen, M. van der Horst, K.L. Thijssen, and R.H. Plasterk. 2000. Single nucleotide polymorphisms in wild isolates of Caenorhabditis elegans. Genome Res 10: 1690-1696. Madrigal, I., L. Rodriguez-Revenga, L. Armengol, E. Gonzalez, B. Rodriguez, C. Badenas, A. Sanchez, F. Martinez, M. Guitart, I. Fernandez, J.A. Arranz, M. Tejada, L.A. PerezJurado, X. Estivill, and M. Mila. 2007. X-chromosome tiling path array detection of copy number variants in patients with chromosome X-linked mental retardation. BMC Genomics 8: 443. Marshall, C.R., A. Noor, J.B. Vincent, A.C. Lionel, L. Feuk, J. Skaug, M. Shago, R. Moessner, D. Pinto, Y. Ren, B. Thiruvahindrapduram, A. Fiebig, S. Schreiber, J. Friedman, C.E. Ketelaars, Y.J. Vos, C. Ficicioglu, S. Kirkpatrick, R. Nicolson, L. Sloman, A. Summers, C.A. Gibbons, A. Teebi, D. Chitayat, R. Weksberg, A. Thompson, C. Vardy, V. Crosbie, S. Luscombe, R. Baatjes, L. Zwaigenbaum, W. Roberts, B. Fernandez, P. Szatmari, and S.W. Scherer. 2008. Structural variation of chromosomes in autism spectrum disorder. Am J Hum Genet 82: 477-488. Maydan, J.S., S. Flibotte, M.L. Edgley, J. Lau, R.R. Selzer, T.A. Richmond, N.J. Pofahl, J.H. Thomas, and D.G. Moerman. 2007. Efficient high-resolution deletion discovery in Caenorhabditis elegans by array comparative genomic hybridization. Genome Res. 17: 337-347. Nguyen, D.Q., C. Webber, and C.P. Ponting. 2006. Bias of selection on human copy-number variants. PLoS Genet 2: e20. R Development Core Team. 2006. R: A language and environment for statistical computing. Vienna, Austria. Redon, R., S. Ishikawa, K.R. Fitch, L. Feuk, G.H. Perry, T.D. Andrews, H. Fiegler, M.H. Shapero, A.R. Carson, W. Chen, E.K. Cho, S. Dallaire, J.L. Freeman, J.R. Gonzalez, M. Gratacos, J. Huang, D. Kalaitzopoulos, D. Komura, J.R. MacDonald, C.R. Marshall, R. Mei, L. Montgomery, K. Nishimura, K. Okamura, F. Shen, M.J. Somerville, J. Tchinda, A. Valsesia, C. Woodwark, F. Yang, J. Zhang, T. Zerjal, J. Zhang, L. Armengol, D.F. Conrad, X. Estivill, C. Tyler-Smith, N.P. Carter, H. Aburatani, C. Lee, K.W. Jones, S.W. 75  Scherer, and M.E. Hurles. 2006. Global variation in copy number in the human genome. Nature 444: 444-454. Robertson, H.M. 2000. The large srh family of chemoreceptor genes in Caenorhabditis nematodes reveals processes of genome evolution involving large duplications and deletions and intron gains and losses. Genome Res 10: 192-203. Schulenburg, H. and S. Muller. 2004. Natural variation in the response of Caenorhabditis elegans towards Bacillus thuringiensis. Parasitology 128: 433-443. Sebat, J., B. Lakshmi, J. Troge, J. Alexander, J. Young, P. Lundin, S. Maner, H. Massa, M. Walker, M. Chi, N. Navin, R. Lucito, J. Healy, J. Hicks, K. Ye, A. Reiner, T.C. Gilliam, B. Trask, N. Patterson, A. Zetterberg, and M. Wigler. 2004. Large-scale copy number polymorphism in the human genome. Science 305: 525-528. Seidel, H.S., M.V. Rockman, and L. Kruglyak. 2008. Widespread genetic incompatibility in C. elegans maintained by balancing selection. Science 319: 589-594. Sivasundar, A. and J. Hey. 2005. Sampling from natural populations with RNAI reveals high outcrossing and population structure in Caenorhabditis elegans. Curr Biol 15: 15981602. Smith, J.M., N.H. Smith, M. O'Rourke, and B.G. Spratt. 1993. How clonal are bacteria? Proc Natl Acad Sci U S A 90: 4384-4388. Stefansson, H., D. Rujescu, S. Cichon, A. Ingason, S. Steinberg, R. Fossdal, E. Sigurdsson, T. Sigmundsson, J.E. Buizer-Voskamp, T. Hansen, K.D. Jakobsen, P. Muglia, C. Francks, P.M. Matthews, A. Gylfason, B.V. Halldorsson, D. Gudbjartsson, T.E. Thorgeirsson, A. Sigurdsson, A. Jonasdottir, A. Jonasdottir, A. Bjornsson, S. Mattiasdottir, T. Blondal, M. Haraldsson, B.B. Magnusdottir, I. Giegling, H.J. Moller, A. Hartmann, K.V. Shianna, D. Ge, A.C. Need, C. Crombie, G. Fraser, N. Walker, J. Lonnqvist, J. Suvisaari, A. TuulioHenriksson, T. Paunio, T. Toulopoulou, E. Bramon, M. Di Forti, R. Murray, M. Ruggeri, E. Vassos, S. Tosato, M. Walshe, T. Li, C. Vasilescu, T.W. Muhleisen, A.G. Wang, H. Ullum, S. Djurovic, I. Melle, J. Olesen, L.A. Kiemeney, B. Franke, C. Sabatti, N.B. Freimer, J.R. Gulcher, U. Thorsteinsdottir, A. Kong, O.A. Andreassen, R.A. Ophoff, A. Georgi, M. Rietschel, T. Werge, H. Petursson, D.B. Goldstein, M.M. Nothen, L. Peltonen, D.A. Collier, D. St Clair, and K. Stefansson. 2008. Large recurrent microdeletions associated with schizophrenia. Nature 455: 232-236. Stein, L.D., Z. Bao, D. Blasiar, T. Blumenthal, M.R. Brent, N. Chen, A. Chinwalla, L. Clarke, C. Clee, A. Coghlan, A. Coulson, P. Eustachio, D.H.A. Fitch, L.A. Fulton, R.E. Fulton, S. Griffiths-Jones, T.W. Harris, L.W. Hillier, R. Kamath, P.E. Kuwabara, E.R. Mardis, M.A. Marra, T.L. Miner, P. Minx, J.C. Mullikin, R.W. Plumb, J. Rogers, J.E. Schein, M. Sohrmann, J. Spieth, J.E. Stajich, C. Wei, D. Willey, R.K. Wilson, R. Durbin, and R.H. Waterston. 2003. The Genome Sequence of Caenorhabditis briggsae: A Platform for Comparative Genomics. PLoS Biology 1: e45. Stewart, M.K., N.L. Clark, G. Merrihew, E.M. Galloway, and J.H. Thomas. 2005. High genetic diversity in the chemoreceptor superfamily of Caenorhabditis elegans. Genetics 169: 1985-1996. Swan, K.A., D.E. Curtis, K.B. McKusick, A.V. Voinov, F.A. Mapa, and M.R. Cancilla. 2002. High-throughput gene mapping in Caenorhabditis elegans. Genome Res 12: 1100-1105. The C. elegans Sequencing Consortium. 1998. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282: 2012-2018. Thomas, J.H. 2006. Analysis of homologous gene clusters in Caenorhabditis elegans reveals striking regional cluster domains. Genetics 172: 127-143.  76  Thomas, J.H., J.L. Kelley, H.M. Robertson, K. Ly, and W.J. Swanson. 2005. Adaptive evolution in the SRZ chemoreceptor families of Caenorhabditis elegans and Caenorhabditis briggsae. Proc Natl Acad Sci U S A 102: 4476-4481. Wicks, S.R., R.T. Yeh, W.R. Gish, R.H. Waterston, and R.H. Plasterk. 2001. Rapid gene mapping in Caenorhabditis elegans using a high density polymorphism map. Nat Genet 28: 160-164. Witherspoon, D.J. and H.M. Robertson. 2003. Neutral evolution of ten types of mariner transposons in the genomes of Caenorhabditis elegans and Caenorhabditis briggsae. J Mol Evol 56: 751-769. Zhang, Z., S. Schwartz, L. Wagner, and W. Miller. 2000. A greedy algorithm for aligning DNA sequences. J Comput Biol 7: 203-214.  77  5. Conclusions 5.1. Thesis summary This thesis describes the development of an aCGH platform that permits very high-resolution detection of mutations in C. elegans. Exon-centric microarrays targeting specific chromosomes and the whole genome were used to detect novel induced mutations as small as 141 bp in length. Further restricting the candidate region for mutation detection to two Mbp or less allowed the detection of single nucleotide mutations and many mutations have been discovered using this technique. aCGH will facilitate the identification of mutations for the C. elegans community. The whole genome array was used to characterize natural copy number variation in twelve wild isolates of C. elegans and identified over 500 different deletions affecting more than five percent of the annotated genes in the genome. The deletions present in the natural isolates largely affected genes thought to be involved in environmental sensation and innate immunity, and revealed that the relationships among strains are complicated due to recombination resulting from outcrossing with males. 5.2. The significance of this work and its potential applications aCGH is an attractive and important adjunct to the PCR-based method of deletion detection used by the C. elegans Gene Knockout Consortium and the research community. aCGH avoids the time-consuming and labor-intensive process of sibling selection and is not constrained to finding deletions smaller than PCR amplicon sizes of 2-3 kb. Large deletions may be particularly desirable for tandem gene families, such as the Serpentine Receptor class AB (srab) family of 7-transmembrane chemoreceptors and integral membrane proteins, in which consecutive genes may share functional redundancies (Stein et al. 2003; Chen et al. 2005; Thomas 2006). The ability of aCGH to reveal additional mutations elsewhere in mutant genomes that are not detected by the PCR-based method may help to prevent incorrect associations being inferred between mutations and phenotypes. Thus far, the cost per deletion isolated has been comparable between the PCR-based method and aCGH experiments comparing known mutants to N2. The deletions that were detected in the natural isolate work presented in Chapter 4 significantly increased the number of genes with known null alleles, 78  reduced the number of remaining targets for the Knockout Consortium, and may also be helpful in identifying genes that are responsible for phenotypic differences among the strains such as body length and reproductive behavior (Hodgkin and Doniach 1997), social behavior (de Bono and Bargmann 1998), aging (Gems and Riddle 2000), sperm morphology (LaMunyon and Ward 2002), chemosensation (Jovelin et al. 2003) and innate immunity (Schulenburg and Muller 2004). The relationships that were inferred among the strains and the specification of which strains carry the largest numbers of unique indel alleles can be used to inform decisions regarding which strains to select for deep sequencing in the hopes of identifying novel loss-offunction mutations, further contributing to the number of known knockout mutations in C. elegans. 5.3. Strategies to reduce the cost of detecting novel induced deletions using aCGH Both aCGH and the PCR-based method depend on reliable mutagenesis for their success. Since implementing the aCGH method, the Knockout Consortium has had greater success using aCGH to identify deletions in mutagenized strains that display a phenotype (thus far, balanced lethals (see Chapter 2), uncoordinated (Unc) strains, and unc-22 mutants that twitch in the presence of 1% nicotine) than in strains that are phenotypically wild-type. In most cases the mutations that have been detected have not been correlated with the phenotypes, but the phenotypes serve to ensure that the animal has been mutated. We do not know precisely how common it is for mutant strains to carry multiple deletions after the standard mutagenesis procedure (Barstead and Moerman 2006), but several of the balanced lethal strains described in Chapter 2 carried multiple deletions on a single chromosome II. Recent aCGH experiments have identified multiple mutations in the genomes of individual unc-22 mutants (Jon Taylor, personal communication). These results suggest that there are likely to be multiple deletions in many of the mutants that we generate. One way to ensure that there are mutations in the genomes to be screened by aCGH was presented in Chapter 2 in the screens for lethal deletions on chromosome II. Another way would be to perform the standard TMP/UV or EMS mutagenesis (Barstead and Moerman 2006), select single F2 mutants that display a phenotype, propagate those animals clonally through several generations of picking single hermaphroditic parents in order to drive mutations to homozygosity through genetic drift, and then perform aCGH comparing DNA from the F7 79  mutants to the N2 strain. In the absence of selection, F7 animals resulting from this procedure will have lost roughly half of the mutations present in the F1s but be homozygous for nearly all of the remaining mutations that they carry. Screening homozygous mutants is not necessary for successful deletion detection but it does eliminate downstream sibling selection required to isolate the mutation. Also, heterozygous deletions are less reliably detected and could lead to time being wasted attempting to confirm false deletion candidates. A simpler method of ensuring that mutant genomes are used in the aCGH experiments, which would not require additional screening for phenotypes, would be to simply screen mutants that have been previously identified by the PCR-based method. These strains have already been clonally propagated by picking single parents for a handful of generations through the normal process of sibling selection, and could be further clonally propagated as described above to drive heterozygous mutations elsewhere in the genome to homozygosity if desired. Deletions previously detected by PCR would serve as positive controls in these aCGH experiments. Additional mutations that were found in these strains could be isolated from the previously known mutations through backcrosses with the N2 strain. These experiments would also further characterize the genetic background of existing mutant strains, making them more valuable tools to researchers. Perhaps the simplest way to increase the number of deletions that are detected per aCGH experiment would be to compare two mutant DNA samples to each other instead of comparing a mutant to the N2 reference strain. This would effectively cut the cost of the aCGH method in half. In the case of homozygous mutations, there would be little ambiguity between whether a candidate mutation was a deletion in one sample or an amplification in the other because log2 ratios resulting from deletions in one strain are otherwise interpreted as multi-fold amplifications (which are comparatively rare) in the other strain. Although heterozygous deletions could be misinterpreted as homozygous amplifications, PCR and DNA sequencing are subsequently used to precisely characterize all candidate mutations and would distinguish these possibilities. At the current cost of experiments using the 380,000-probe microarrays, detecting roughly one deletion in every three experiments would make aCGH cost-efficient relative to the PCR method. This success rate seems to be achievable. In the most recent set of deletion screens 80  utilizing the whole genome microarray, four novel deletions were detected among six aCGH experiments comparing F7 twitcher (unc-22) mutants generated by TMP/UV mutagenesis with wild-type N2 animals, not including deletions affecting unc-22 (Jon Taylor, personal communication). The availability of higher density microarrays should also make the aCGH method increasingly cost-efficient. Higher density arrays can be used to probe a candidate region with increased probe density and sensitivity, or the arrays can be subdivided to allow multiple individual experiments on a single slide. The impact that higher density arrays will have on the costefficiency of screening for deletions will, of course, depend on their pricing. The Knockout Consortium plans to purchase microarrays from manufacturers and perform the sample handling, hybridizations and microarray imaging in house, which should further reduce the cost of aCGH experiments. It is also possible to gently but thoroughly remove hybridized DNA samples from a microarray after an aCGH experiment in order to permit reuse of the array for subsequent hybridizations, further reducing the cost per experiment. 5.4. Single nucleotide mutation detection The ability to detect single nucleotide mutations using aCGH provides a boon to C. elegans researchers and can also benefit research in other model organisms with sequenced genomes. Currently, the SNP detection strategy presented in Chapter 3 is probably the cheapest method of detecting a SNP in a candidate region up to two Mb in length, however, the technique is not sensitive enough to detect all single nucleotide mutations in an interval of this size. Approximately 50% of C/G-to-T/A mutations that are generated by EMS mutagenesis are detectable in a one Mb candidate region, but A/T-to-T/A mutations are more difficult to detect. Of course, microarrays with this probe density are also capable of detecting very small deletions, and in one experiment a 10-bp deletion was detected and confirmed (data not shown). Further decreasing the size of the candidate region allows increased probe density and the targeting of both strands, which should improve the sensitivity and specificity of the technique.  81  5.5. SNP-CGH Mapping The ability to detect SNPs using aCGH is exploited in the SNP-CGH mapping protocol that we have developed (Flibotte et al. 2009), which provides an extremely rapid and high-resolution means of mapping mutations in C. elegans. In this technique, ten F1 cross progeny are picked following a genetic cross of homozygous mutant hermaphrodites with a recessive phenotype and CB4856 males. One hundred homozygous mutant F2s and their progeny are then selected and allowed to grow as a mixed population. DNA is prepared from the resulting population of mutants, labeled with Cy3 and compared to Cy5-labeled N2 DNA by aCGH. A custom microarray targeting over 3,000 CB4856 SNPs spread evenly throughout the genome is used to compare the two DNA samples. This microarray includes up to 22 different probe sequences for both the N2 and the CB4856 alleles at each SNP locus. Due to recombination, mutants have inherited zero, one or two CB4856 SNP alleles at each SNP locus. N2 alleles are more likely at loci that are closely linked to the selected mutation, with expected log2 ratios (Cy3 / Cy5) = 0 for all probes. Loci that are unlinked to the mutation should include an equal proportion of N2 and CB4856 alleles in the absence of selection (although selection does occur at one locus; see Seidel et al. (2008)), giving log2 ratios < 0 for N2-specific probes and log2 ratios > 0 for CB4856-specific probes. The mapping signal is therefore strongest where the difference between the median log2 ratios measured for the CB4856-specific and N2-specific probes is the smallest. Fitting a cubic smoothing spline to this mapping signal allows the mutation to be mapped to within approximately 200 kb (Flibotte et al. 2009). SNP-CGH mapping should become the method of choice for mapping mutations in C. elegans. This technique can also be applied to other model organisms. Mutations mapped in this way should then be relatively easy to identify using the method described in Chapter 3 because the candidate region is reduced to such a small interval. This would allow probe spacing of 1 bp with probes targeting both strands, which should increase the sensitivity and specificity of the technique compared to our results in Chapter 3 with much larger candidate regions. 5.6. Future directions, deep sequencing and site-specific gene conversion A currently more costly and less accessible alternative for SNP detection would be to utilize massively parallel high-throughput methods of DNA sequencing, or “deep sequencing”, such as 82  the Illumina platform (Quail et al. 2008). The cost of this approach could potentially be reduced by sequencing only the candidate region instead of the entire mutant genome by first isolating DNA from the candidate region using the method known as sequence capture (Hodges et al. 2007; Okou et al. 2007). Sequence capture involves hybridizing fragmented mutant DNA to a microarray of probes to the target region, washing away sample fragments that do not hybridize to the array, and then eluting the fragments bound to the array. These fragments are then amplified using PCR in order to prepare enough DNA for sequencing. This enriches for regions of interest, such as exons, and can avoid undesired sequencing of intergenic or repeat sequences. However, whether or not sequence capture will be worthwhile given the small size of the C. elegans genome and the decreasing cost of deep sequencing is uncertain. While TILLING should remain an efficient method of locating an allelic series of mutations in a gene of interest (Till et al. 2003; Gilchrist et al. 2006), with increasing affordability and accessibility, deep sequencing will probably become the preferred method for untargeted genome-wide searches for null alleles. A recent resequencing study suggested that EMS mutagenesis can produce individual animals that are heterozygous for as many as 25 loss-offunction mutations (Cuppen et al. 2007). Another study provided a proof in principle of using deep sequencing to detect single nucleotide mutations throughout the C. elegans genome by sequencing an N2 isolate and the CB4858 (Pasadena) strain (Hillier et al. 2008). The Knockout Consortium recently sequenced the genome of an unc-22 mutant generated by standard EMS mutagenesis and was able to find a single base pair mutation responsible for the unc-22 mutation, along with mutations leading to 34 non-conservative amino acid changes and one other nonsense mutation (Moerman 2008). In this experiment the gene responsible for the mutant phenotype was already known, but the SNP-CGH protocol described above could be used to narrow the list of candidate genes if deep sequencing is used in forward genetic screens. In the absence of an efficient site-directed method of mutagenesis, random mutagenesis followed by deep sequencing may eventually become the primary method of detecting null alleles for the Knockout Consortium. Deep sequencing could also be used to identify null alleles in natural isolates. For instance, over 100 nonsense mutations were identified in a recent sequencing of the CB4856 genome (David Spencer, personal communication). The natural isolate study in Chapter 4 can be used to guide the selection of other strains for deep sequencing in the search for novel null mutations. The 83  most promising strains would be those with the highest number of unique alleles, indicating significant divergence from the other strains in the study. These would include JU258, KR314 and MY2. A significant disadvantage to the strategy of random mutagenesis followed by either aCGH or deep sequencing is that both methods will increasingly discover mutations in genes with existing mutations as the number of discovered mutations increases. While it is possible to use microarrays or sequence capture strategies that do not target genes for which knockouts currently exist, both methods will eventually become less efficient for identifying desirable knockouts as the number of gene targets decreases. Currently, both methods are attractive because knockout mutations do not exist for most C. elegans genes, so mutations that are discovered are likely to target genes for which mutations do not yet exist. The PCR-based method involves site-directed mutation discovery but still relies on random mutagenesis. As the number of genes remaining to be knocked out decreases, site-directed methods of mutagenesis such as MosTIC (Robert and Bessereau 2007) or some other method utilizing homologous recombination that is practical at a genome scale will likely be needed to achieve comprehensive mutation coverage of the C. elegans genome. For the foreseeable future, the aCGH platform developed in this thesis provides an efficient means of identifying novel mutations in C. elegans.  84  5.7. References Barstead, R.J. and D.G. Moerman. 2006. C. elegans deletion mutant screening. Methods Mol Biol 351: 51-58. Chen, N., S. Pai, Z. Zhao, A. Mah, R. Newbury, R.C. Johnsen, Z. Altun, D.G. Moerman, D.L. Baillie, and L.D. Stein. 2005. Identification of a nematode chemosensory gene family. Proc Natl Acad Sci U S A 102: 146-151. Cuppen, E., E. Gort, E. Hazendonk, J. Mudde, J. van de Belt, I.J. Nijman, V. Guryev, and R.H.A. Plasterk. 2007. Efficient target-selected mutagenesis in Caenorhabditis elegans: Toward a knockout for every gene. Genome Res. 17: 649-658. de Bono, M. and C.I. Bargmann. 1998. Natural variation in a neuropeptide Y receptor homolog modifies social behavior and food response in C. elegans. Cell 94: 679-689. Flibotte, S., M. Edgley, J. Maydan, J. Taylor, R. Zapf, R. Waterston, and D.G. Moerman. 2009. Rapid High Resolution SNP-CGH mapping in Caenorhabditis elegans. Genetics 181: 33-37. Gems, D. and D.L. Riddle. 2000. Defining wild-type life span in Caenorhabditis elegans. J Gerontol A Biol Sci Med Sci 55: B215-219. Gilchrist, E.J., N.J. O'Neil, A.M. Rose, M.C. Zetka, and G.W. Haughn. 2006. TILLING is an effective reverse genetics technique for Caenorhabditis elegans. BMC Genomics 7: 262. Hillier, L.W., G.T. Marth, A.R. Quinlan, D. Dooling, G. Fewell, D. Barnett, P. Fox, J.I. Glasscock, M. Hickenbotham, W. Huang, V.J. Magrini, R.J. Richt, S.N. Sander, D.A. Stewart, M. Stromberg, E.F. Tsung, T. Wylie, T. Schedl, R.K. Wilson, and E.R. Mardis. 2008. Whole-genome sequencing and variant discovery in C. elegans. Nat Methods 5: 183-188. Hodges, E., Z. Xuan, V. Balija, M. Kramer, M.N. Molla, S.W. Smith, C.M. Middle, M.J. Rodesch, T.J. Albert, G.J. Hannon, and W.R. McCombie. 2007. Genome-wide in situ exon capture for selective resequencing. Nat Genet 39: 1522-1527. Hodgkin, J. and T. Doniach. 1997. Natural variation and copulatory plug formation in Caenorhabditis elegans. Genetics 146: 149-164. Jovelin, R., B.C. Ajie, and P.C. Phillips. 2003. Molecular evolution and quantitative variation for chemosensory behaviour in the nematode genus Caenorhabditis. Mol Ecol 12: 13251337. LaMunyon, C.W. and S. Ward. 2002. Evolution of larger sperm in response to experimentally increased sperm competition in Caenorhabditis elegans. Proc Biol Sci 269: 1125-1128. Moerman, D. 2008. Deep sequencing of an unc-22 mutant following EMS mutagenesis in C. elegans., Personal communication. Vancouver, BC, Canada. Okou, D.T., K.M. Steinberg, C. Middle, D.J. Cutler, T.J. Albert, and M.E. Zwick. 2007. Microarray-based genomic selection for high-throughput resequencing. Nat Methods 4: 907-909. Quail, M.A., I. Kozarewa, F. Smith, A. Scally, P.J. Stephens, R. Durbin, H. Swerdlow, and D.J. Turner. 2008. A large genome center's improvements to the Illumina sequencing system. Nat Methods 5: 1005-1010. Robert, V. and J.L. Bessereau. 2007. Targeted engineering of the Caenorhabditis elegans genome following Mos1-triggered chromosomal breaks. Embo J 26: 170-183. Schulenburg, H. and S. Muller. 2004. Natural variation in the response of Caenorhabditis elegans towards Bacillus thuringiensis. Parasitology 128: 433-443. 85  Seidel, H.S., M.V. Rockman, and L. Kruglyak. 2008. Widespread genetic incompatibility in C. elegans maintained by balancing selection. Science 319: 589-594. Stein, L.D., Z. Bao, D. Blasiar, T. Blumenthal, M.R. Brent, N. Chen, A. Chinwalla, L. Clarke, C. Clee, A. Coghlan, A. Coulson, P. Eustachio, D.H.A. Fitch, L.A. Fulton, R.E. Fulton, S. Griffiths-Jones, T.W. Harris, L.W. Hillier, R. Kamath, P.E. Kuwabara, E.R. Mardis, M.A. Marra, T.L. Miner, P. Minx, J.C. Mullikin, R.W. Plumb, J. Rogers, J.E. Schein, M. Sohrmann, J. Spieth, J.E. Stajich, C. Wei, D. Willey, R.K. Wilson, R. Durbin, and R.H. Waterston. 2003. The Genome Sequence of Caenorhabditis briggsae: A Platform for Comparative Genomics. PLoS Biology 1: e45. Thomas, J.H. 2006. Analysis of homologous gene clusters in Caenorhabditis elegans reveals striking regional cluster domains. Genetics 172: 127-143. Till, B.J., T. Colbert, R. Tompa, L.C. Enns, C.A. Codomo, J.E. Johnson, S.H. Reynolds, J.G. Henikoff, E.A. Greene, M.N. Steine, L. Comai, and S. Henikoff. 2003. High-throughput TILLING for functional genomics. Methods Mol Biol 236: 205-220.  86  Appendix 1. Genes that are completely deleted from the Hawaiian strain (CB4856) genome. The position listed is the middle of the deleted gene. Genome Name ZK993.2 Y39G10AR.5 F35E2.3 T02G6.6 Y47H9C.14 Y47H9C.9 Y47H9C.10 M01G12.14 T15D6.1 E03H4.12 H16D19.4 T26E3.8 T26E3.1 W02A11.6 W02A11.8 F44F1.1 K05C4.9 C03H5.1 F28A10.3 T07D3.5 T07D3.4 K02E7.9 K02E7.5 K02E7.10 K02E7.12 Y51H7BR.3 Y51H7BR.2 Y51H7BR.1 K05F6.5 K05F6.4 K05F6.6 K05F6.3 K05F6.7 K05F6.2 K05F6.8 K05F6.9 K05F6.1 K05F6.10 C08E3.7 C08E3.8 C08E3.9 C08E3.10 C08E3.11 C08E3.12 ZC204.9 F58E1.12  Genetic Name / Family SR-unclass fbxb  fbxa cfam4 duf595  clec bath-35  clec-10  btb fbx cathA  fbxb-43 fbxb-42 fbxb-44 fbxb fbxb-52 fbxb-51 fbxb-54 fbxb-50 fbx fbxb-46 fbxb-49 btb fbxa fbxa fbxa fbxa fbxa fbxa fbxb-20 fbxc  Chromosome I I I I I I I I I I I I I I I I I II II II II II II II II II II II II II II II II II II II II II II II II II II II II II  Position 1,112,217 2,346,427 11,743,123 11,829,186 11,906,180 11,908,246 11,910,701 12,115,787 12,380,121 12,444,750 12,650,704 12,666,675 12,691,915 12,762,730 12,767,118 13,260,146 14,736,917 403,400 840,887 886,263 889,574 1,060,237 1,063,042 1,066,288 1,071,265 1,534,818 1,536,773 1,538,130 1,540,811 1,545,252 1,548,512 1,552,095 1,555,788 1,559,902 1,567,541 1,570,611 1,570,990 1,573,768 1,613,422 1,615,632 1,617,948 1,620,582 1,622,249 1,624,294 1,649,688 1,693,959  Note  pseudogene  87  F58E1.13 F36H5.9 F36H5.11 F36H5.3 F36H5.2b F36H5.1 C08F1.4a C08F1.5 C08F1.10 C08F1.6 C08F1.3 C08F1.2 C08F1.1 C08F1.7 C08F1.8 C08F1.9 T08E11.6 T08E11.7 T08E11.5 T08E11.4 T08E11.3 T08E11.2 T08E11.8 T08E11.1 C52E2.5 C52E2.4 C52E2.6 C52E2.7 C52E2.3 C52E2.2 C52E2.1 C52E2.8 C16C4.7 C16C4.6 C16C4.5 C16C4.4 C16C4.15 C16C4.16 C16C4.3 C16C4.8 C16C4.9 C16C4.10 C16C4.11 C16C4.12 C16C4.13 C16C4.14 C16C4.2 C16C4.1 C46F9.4 C46F9.3 C46F9.2 C46F9.1 F52C6.5 F52C6.6 F52C6.7 F52C6.8  btb fbxb fbxb-12 math-28 math-27 math-26 math-3 math-4 cfam10 cfam10 fbxb-13 str-21 math-2 str-22 cfam10 revt fbxb-10 fbxa-3 fbxc math-41 math-40 math-39 fbx fbx fbx fbx fbxb-97 fbxb-96 duf130 clec fbxb-95  fbxb-98 math-15 math-14 math-10 math-11 math-13 math-16 math-17 math-5 math-6 math-7 math-8 math-9 math-12 math-25 math-24 math-23 math-22 math-30 math-31 bath-11 bath-4  II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II  1,695,061 1,752,923 1,754,848 1,764,854 1,769,883 1,773,253 1,777,284 1,780,607 1,785,801 1,787,697 1,789,688 1,792,475 1,800,515 1,802,708 1,806,255 1,809,929 1,817,844 1,819,328 1,822,276 1,827,335 1,831,237 1,832,898 1,834,708 1,838,364 1,841,959 1,844,889 1,847,375 1,849,922 1,851,866 1,852,943 1,854,863 1,856,710 1,859,901 1,863,210 1,866,964 1,870,059 1,872,349 1,874,693 1,876,411 1,878,336 1,880,206 1,882,210 1,884,168 1,885,961 1,888,192 1,890,348 1,891,823 1,894,512 1,896,912 1,901,084 1,903,366 1,905,674 1,908,950 1,912,655 1,914,743 1,917,377  pseudogene pseudogene  88  F52C6.9 F52C6.10 F52C6.11 F52C6.4 F52C6.3 F52C6.2 F52C6.1 F52C6.12 F52C6.13 F52C6.14 C40D2.2 C40D2.1 C40D2.4 F59H6.5 F59H6.6 F59H6.4 F59H6.3 F59H6.2 F59H6.8 F59H6.9 F59H6.10 F59H6.11 F59H6.12 F59H6.1 B0047.1 B0047.2 B0047.3 B0047.4 B0047.5 F07E5.4 F07E5.2 F07E5.5 T16A1.4 T16A1.5 T16A1.9 T16A1.1 K09F6.6 K09F6.9 K09F6.10 K09F6.7 K09F6.8 B0281.4 B0281.5 B0281.6 B0281.3 B0281.2 B0281.7 B0281.8 B0281.1 ZK1240.4 ZK1240.5 ZK1240.9 ZK1240.3 ZK1240.6 ZK1240.2 ZK1240.8  bath-6 bath-7 bath-2 ubq ubq ubq bath-22 ubql-e2  math-20 math-19 homeodomain pif-helicase math-32  bath-21 bath-1 bath-3 bath-5 btb bath-19 bath-20 btb bath-24 math-1 bath-14 fbxb-35 ubql-e3 clec  math-42  ubql-e3 fbxc btb btb btb ubql-e3 revt ubql-e3  ubql-e3 ubql-e3 ubql-e3 ubql-e3 ubql-e3 ubql-e3  II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II  1,918,881 1,920,562 1,922,183 1,923,564 1,925,115 1,926,664 1,928,308 1,929,757 1,930,470 1,932,827 1,997,515 1,999,261 2,000,366 2,005,815 2,011,981 2,012,019 2,018,442 2,021,049 2,026,127 2,028,248 2,029,999 2,031,390 2,033,198 2,037,553 2,041,105 2,043,359 2,045,145 2,046,356 2,048,442 2,050,852 2,052,197 2,056,538 2,081,788 2,083,712 2,094,932 2,101,501 2,277,033 2,281,786 2,288,421 2,291,523 2,295,399 2,298,227 2,300,381 2,301,816 2,307,558 2,308,984 2,310,574 2,312,170 2,313,998 2,315,354 2,317,122 2,318,984 2,321,053 2,322,662 2,324,267 2,328,025  pseudogene  89  ZK1240.1 F43C11.11 F43C11.12 F16G10.5 F16G10.4 F16G10.3 F42G2.5 Y27F2A.3 Y27F2A.6 Y27F2A.8 Y27F2A.2 Y27F2A.1 Y27F2A.9 ZC239.7 ZC239.8 ZC239.9 ZC239.10 ZC239.19 ZC239.12 ZC239.6 ZC239.5 ZC239.4 ZC239.3 ZC239.2 ZC239.14 ZC239.13 ZC239.15 ZC239.16 ZC239.17 ZC239.1 C17F4.4 C17F4.8 C17F4.10 F14D2.7 F14D2.5 F14D2.11 F14D2.15 F14D2.13 F14D2.12 F14D2.14 F14D2.8 F14D2.10 Y49F6A.1 F19B10.11 F19B10.10 F19B10.1 F19B10.12 F40H7.4 F40H7.3 T24E12.5 F54D10.7 F53C3.7 R03H10.7 F59E12.6a F10E7.2 F10E7.3  ubql-e3 duf130 duf130 duf130 duf130 ubql-e3 sri-40 fbxb sri-78 sri-58 duf130 gcy-15 sri-51 sri-48 sri-53 sri-50 btb btb btb btb btb btb btb btb btb btb btb srh-297 srz-67  ubq btb bath-28 bath-30 fbxo  cfam19 srx-99 srx-101 srx-100 ubq  II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II  2,330,624 2,360,120 2,365,867 2,367,873 2,369,674 2,373,287 2,417,192 3,179,286 3,181,603 3,184,153 3,185,969 3,188,052 3,191,463 3,194,643 3,198,351 3,200,257 3,201,907 3,204,624 3,208,305 3,210,123 3,212,757 3,214,931 3,216,336 3,218,444 3,221,697 3,222,750 3,224,089 3,225,023 3,225,676 3,227,772 3,229,699 3,232,840 3,236,542 3,337,169 3,340,797 3,341,568 3,342,960 3,344,466 3,345,825 3,347,163 3,348,452 3,352,058 3,605,380 3,652,760 3,679,013 3,683,812 3,686,537 3,690,101 3,692,502 3,755,628 3,820,745 3,895,533 4,177,232 5,628,313 7,123,244 7,124,918  pseudogene  pseudogene  pseudogene pseudogene  90  F10E7.1 F15A4.8b Y46G5A.7 Y46G5A.8 E01G4.5 Y39G8C.2 Y53F4B.5 cTel54X.1 W05G11.5 C29F9.4 C29F9.12 C29F9.14 C29F9.3a Y71H2AM.14 F44E2.2b Y75B8A.32 Y75B8A.34 3R5.1 F38A1.7 F38A1.14 F38A1.13 R05C11.2 Y69A2AR.24 Y69A2AR.25 Y69A2AR.13 Y69A2AR.12 Y69A2AR.26 Y69A2AR.11 Y69A2AR.10 Y69A2AR.9 Y69A2AR.8 Y69A2AR.27 Y94H6A.10 Y46C8AL.5 Y46C8AL.6 Y46C8AL.1 Y46C8AL.8 Y46C8AL.9b Y46C8AR.1 F49F1.7 F49F1.8 F49F1.9 F49F1.10 F49F1.11 F49F1.12 F49F1.13 R07C12.3 R07C12.2 R07C12.4 R07C12.1 K08D10.10 K08D10.9 K08D10.8 K08D10.7 Y7A9C.9 Y7A9C.8  duf1114 glyh fbxb fbxb pkin-st fbxa-6 btb  glyct  SR-unclass clec clec clec  nhr-242 revt  clec-72 revt clec-73 clec-74 clec-75 clec-76 duf18 glec glec glec  clec clec clec clec clec clec scram scram srz-75 srz-76  II II II II II II II III III III III III III III III III III III IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV  7,127,730 12,477,119 12,755,280 12,756,890 13,473,126 14,096,391 14,983,191 2,094 60,258 120,719 121,877 124,486 125,611 2,744,544 8,857,128 12,356,679 12,360,282 13,780,116 1,258,214 1,262,935 1,265,538 2,058,494 2,563,882 2,567,101 2,570,034 2,571,508 2,579,348 2,586,164 2,591,779 2,595,463 2,596,557 2,597,860 2,710,391 3,946,043 3,950,340 3,955,131 3,958,253 3,962,690 3,967,076 4,124,638 4,132,560 4,137,201 4,138,758 4,140,539 4,142,226 4,143,825 4,145,897 4,149,451 4,151,495 4,153,281 4,155,742 4,159,983 4,163,466 4,165,471 16,287,201 16,288,673  pseudogene  91  Y7A9C.3 Y7A9C.5 Y7A9C.1 Y7A9C.7 Y7A9C.6 Y7A9C.2 Y7A9C.4 K03D3.10d K03D3.8 K03D3.6 K03D3.5 K03D3.4 K03D3.11 K03D3.3 K03D3.2 K03D3.1 C35D6.10 C35D6.9 C35D6.1 C35D6.2 C35D6.8 VY10G11R.1 Y50D4B.6 F59A7.10 F59A7.8 F59A7.4 Y40B10A.3 C45H4.9 C29G2.2 C29G2.1 C31B8.6 K12D9.6 Y73C8C.3 T28A11.21 T28A11.2 T28A11.1 F35F10.5 C17B7.8 C17B7.7 C17B7.9 C17B7.5 C17B7.4 C17B7.10 C17B7.3 C17B7.11 C17B7.2 C17B7.1 C17B7.12 C04E12.6 C04E12.7 C04E12.5 C04E12.4 C04E12.2 C04E12.8 C04E12.9 C04E12.10  srz-73 srz-40 srz-72 srz-41 srh-225 srz-39 rac-2|rab  cfam23 srz-61 srz-like srz-21 srz-74 srz-71 srz-38 srh-228 srh-227 srh-224 pkin-st snfh hil-6 srab-23 srbc-23  str-46 srw-125 cfam19 fbxa-64 duf19 str-64 duf19 nepp duf19 duf750 duf19 nepp duf19 fbxa-65 duf19 str-63 duf19 duf976 scram duf750 duf750 duf19 srx-121 srbc-2 duf750  IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V  16,290,158 16,292,490 16,296,141 16,299,673 16,301,128 16,304,885 16,306,537 16,311,721 16,314,466 16,317,566 16,320,284 16,322,142 16,324,469 16,327,404 16,328,336 16,330,220 16,335,623 16,335,881 16,340,481 16,343,306 16,343,495 16,469,991 1,085,623 2,007,363 2,009,954 2,030,894 2,034,985 2,154,171 2,590,434 2,591,294 2,899,029 2,997,332 3,116,543 3,277,435 3,281,339 3,283,753 3,285,440 3,322,256 3,327,764 3,331,340 3,336,142 3,339,516 3,341,306 3,343,161 3,345,781 3,349,595 3,351,996 3,353,418 3,354,917 3,356,385 3,359,694 3,365,365 3,369,327 3,371,664 3,375,638 3,379,587  pseudogene pseudogene  pseudogene pseudogene pseudogene  pseudogene  pseudogene  pseudogene  pseudogene  92  C04E12.1 C04E12.11 C04E12.12 T20D4.13 T20D4.15 T20D4.16 T20D4.17 T20D4.12 T20D4.11 T20D4.10 T20D4.9 T20D4.8 T20D4.7 T20D4.6 T20D4.5 T20D4.4 T20D4.3 T20D4.18 T20D4.2 T20D4.1 T20D4.19 F47D2.6 Y60C6A.1 T15B7.10 C03G6.1 Y97E10B.6 C54D10.7 ZK1037.4 T23F1.3 K08G2.9 K08G2.10 K08G2.8 C06C6.7 F44G3.8 F11A5.7 F11A5.8 F57E7.1 F57E7.2 T03E6.2 F14F8.6 F14F8.7 F09C6.9 F09C6.10 Y102A5C.1 Y102A5C.2 Y102A5C.3 Y68A4A.2 T19C9.5 T19C9.6 T19C9.8 Y68A4B.3 Y68A4B.2 Y68A4B.1 Y61B8A.3 Y61B8A.1 Y61B8A.2  srbc-4  duf19 duf19 duf19 duf19 duf19 duf19 nepp nepp thiordx duf750 duf750 duf750 srab-21 srab-22 srab-20 duf19 srt-38  srx-57 srx-11 nhr-246 str-72 srh-299 srh-294 srh-293 fbxa-87 recepL acylt cfam23 str-126 srw-44 srw-36 nhr-116 fbxa fbxa srz-47 scp-like CUB2 CUB2 clec clec srh-116 srh-115  V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V  3,384,361 3,387,183 3,389,956 3,392,602 3,393,567 3,394,365 3,395,578 3,396,799 3,398,867 3,401,132 3,403,879 3,406,627 3,408,974 3,411,448 3,414,406 3,416,686 3,420,324 3,423,521 3,426,881 3,429,530 3,430,884 4,270,236 4,784,696 6,813,742 7,379,669 7,917,665 12,443,219 15,320,952 15,459,341 15,866,484 15,868,034 15,869,320 15,996,490 16,132,156 16,208,586 16,211,006 16,484,385 16,485,590 16,584,688 16,684,051 16,686,614 16,916,337 16,918,409 16,920,452 16,921,743 16,923,131 17,194,164 17,230,147 17,232,325 17,242,116 17,244,463 17,247,214 17,250,421 17,253,804 17,256,607 17,259,022  pseudogene  pseudogene pseudogene  pseudogene  pseudogene  93  K10G4.2 K10G4.3 K10G4.6 K10G4.7 K10G4.1 K10G4.4 K10G4.9 K10G4.8 K10G4.5 Y61B8B.1 Y61B8B.2 F31E9.6 F31E9.7 F31E9.5 F31E9.1 F31E9.3 F31E9.4 F31E9.2 F47H4.5 F47H4.4 F47H4.6 F47H4.7 F47H4.8 F47H4.9 F47H4.10 F47H4.1 F47H4.11 F47H4.2 T27C5.7 T27C5.12 T27C5.8 T27C5.14 T27C5.10 F20E11.15 F20E11.16 F20E11.2 F20E11.11 F20E11.12 F20E11.13 F20E11.3 F20E11.8 F20E11.9 F20E11.14 F20E11.10 F20E11.4 F20E11.1 F20E11.5 F20E11.7 F20E11.6 F08E10.1 F08E10.8 F08E10.3 F08E10.2 F08E10.4 F08E10.5 F08E10.6  srw-47 srh-117 srh-262  srw-30 srw-46 fbx sri-70  srz-26 srz-58 fbxa fbx fbxb srg-44 fbxa fbxa fbxa fbxa fbxa skr-5 fbxa-134 fbx clec duf595 srh-96 srw srbc-27 srbc-28 srsx-2 srh-175 srh-154 srh-158 srh-160 srh-157 srh-156 srh-161 srh-203 str-200 srz-48  srw-72 srh-235 srh-114 srh-123 srbc-61 srh-110 srh-253 srh-111  V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V  17,263,954 17,267,087 17,274,886 17,279,657 17,282,723 17,286,351 17,290,602 17,295,549 17,300,703 17,306,323 17,308,868 17,316,992 17,319,612 17,321,408 17,323,597 17,327,083 17,329,822 17,332,281 17,335,906 17,336,170 17,339,956 17,343,841 17,346,496 17,349,046 17,350,819 17,352,023 17,354,051 17,359,324 17,416,725 17,418,588 17,420,351 17,422,304 17,432,342 17,434,429 17,436,678 17,439,625 17,442,319 17,443,742 17,446,702 17,449,569 17,451,538 17,454,047 17,455,818 17,457,973 17,459,971 17,461,924 17,464,200 17,468,180 17,470,398 17,473,520 17,475,475 17,477,311 17,480,025 17,481,436 17,483,929 17,487,188  pseudogene pseudogene  pseudogene  pseudogene  pseudogene  pseudogene  pseudogene pseudogene pseudogene pseudogene pseudogene pseudogene pseudogene  pseudogene  pseudogene pseudogene  94  F08E10.7 K03D7.9 K03D7.8 K03D7.7 K03D7.6 K03D7.5 K03D7.4 C18D4.3 Y6G8.1 Y6G8.2 F57G4.5 F57G4.6 F57G4.9 F57G4.7 F57G4.8 F59A1.5 F59A1.9 F59A1.8 F59A1.12 Y94A7B.5 Y94A7B.6 Y94A7B.8 Y94A7B.9 Y94A7B.7 F16H6.3 F16H6.4 F16H6.5 F16H6.6 F16H6.7 F16H6.8 F16H6.9 F16H6.10 Y37H2B.1 R10E8.7 Y51A2A.2 Y51A2A.3 Y51A2A.4 Y51A2A.11 Y51A2A.5 C08E8.3 Y69H2.10b F11D11.1 T26H2.2 T26H2.1 F21D9.6 C43D7.7 C43D7.5 C43D7.4 C25F9.6 C25F9.5 C25F9.4 C25F9.9 C25F9.2 C25F9.1 C25F9.t5 M04C3.1a  scp-like  fbxa-102 srh-118 srh-261 srz-45 fbx  tc-related fbxa fbxa fbxa-129 srh-298 srh-300 srh-301 srh-304 srh-303 duf18 duf976 duf976 duf976  clec clec clec  clec fbxb-115 fbxb-1  sdz-6  snfh snfh snfh srw-85 snfh  V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V  17,489,394 17,490,666 17,493,132 17,495,320 17,497,815 17,499,492 17,500,773 17,535,146 17,595,891 17,599,921 17,644,968 17,646,034 17,646,969 17,649,407 17,652,300 17,654,479 17,656,958 17,658,506 17,668,843 17,822,511 17,825,573 17,828,918 17,831,873 17,835,110 18,199,159 18,202,398 18,205,191 18,208,395 18,210,452 18,216,921 18,222,266 18,226,306 18,230,180 18,234,791 18,290,424 18,292,662 18,293,715 18,297,685 18,302,334 18,354,009 18,679,902 18,771,813 19,241,708 19,243,562 19,246,505 19,321,820 19,323,487 19,324,767 19,405,779 19,411,733 19,415,190 19,420,199 19,426,665 19,433,165 19,433,900 19,443,365  pseudogene  pseudogene pseudogene  pseudogene pseudogene  pseudogene  95  M04C3.3 M04C3.2 Y43F8B.14 Y43F8B.13 Y116F11B.7 Y116F11B.11 Y113G7A.12 Y113G7A.13 F19B2.7 F19B2.6 F19B2.5 F19B2.4 F19B2.3 F19B2.2 F19B2.10 F19B2.1 F19B2.8 Y113G7B.1 Y113G7B.3 Y113G7B.12 Y113G7B.11 Y113G7B.t2 Y113G7B.26 Y113G7B.14 Y113G7B.15 F26F2.1 F26F2.2 F26F2.3 F26F2.4 F26F2.5 H02F09.4 Y75D11A.1 Y75D11A.4 Y75D11A.5 ZC53.4 Y59E1A.1 Y48D7A.1  snfh snfh snfh  snfh srz-36 srw-39 srz-34 srz-33 srz-16 fbxa-116 fbxa-115  snfh cathA  revt revt  fbxa-40  V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V X X X X X X X  19,447,117 19,450,053 19,456,220 19,462,461 19,848,780 19,877,121 20,143,284 20,146,441 20,150,161 20,158,010 20,161,086 20,166,293 20,169,056 20,171,407 20,174,905 20,176,869 20,183,258 20,186,293 20,189,348 20,208,787 20,212,649 20,212,902 20,223,154 20,225,829 20,229,351 20,557,803 20,561,715 20,564,027 20,567,907 20,571,910 1,564,033 1,757,098 1,763,420 1,770,272 1,931,714 1,940,067 3,009,828  pseudogene pseudogene pseudogene pseudogene  pseudogene  96  Appendix 2. Genes that are completely deleted from the Madeiran strain (JU258) genome. The position listed is the middle of the deleted gene. Genome Name Y74C10AR.3 Y74C10AR.2 F27C1.6 ZK39.3 Y53H1A.3 Y53H1C.3 W04G5.5 T02G6.7 Y47H9C.14 Y47H9C.9 Y47H9C.10 Y47H10A.1 F15H9.1 T09E11.3 T15D6.1 K07E8.3 Y51H7BR.2 Y51H7BR.1 K05F6.5 K05F6.4 K05F6.6 K05F6.3 K05F6.7 K05F6.2 K05F6.8 K05F6.9 K05F6.1 K05F6.10 T07H3.3 T07H3.4 T07H3.5 T07H3.2 T07H3.6 T07H3.1 C08E3.3 C08E3.4 C08E3.5 C08E3.7 C08E3.8 C08E3.9 C08E3.10 C08E3.11 C08E3.12 ZC204.8 ZC204.9 ZC204.10 ZC204.7  GeneticName / Family abc-tr  clec clec duf750 glec  fbxa clp-3 duf316 duf595 duf595 sdz-24 fbxb-43 fbxb-42 fbxb-44 fbxb fbxb-52 fbxb-51 fbxb-54 fbxb-50 fbx fbxb-46 fbxb-49 btb math-38 clec-21 clec-20 bath-46 bath-26 bath-47 bath-33 fbxa fbxa fbxa fbxa fbxa fbxa fbxa fbxa fbxb fbxb-20 fbxb-16 fbxb-15  Chromosome I I I I I I I I I I I I I I I II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II  Position 2,478,901 2,483,169 5,424,977 11,155,497 11,242,863 11,427,683 11,643,850 11,836,206 11,906,180 11,908,246 11,910,701 12,085,997 12,166,054 12,378,020 12,380,121 659,188 1,536,773 1,538,130 1,540,811 1,545,252 1,548,512 1,552,095 1,555,788 1,559,902 1,567,541 1,570,611 1,570,990 1,573,768 1,578,871 1,582,514 1,594,364 1,598,756 1,600,143 1,603,214 1,605,409 1,606,917 1,608,608 1,613,422 1,615,632 1,617,948 1,620,582 1,622,249 1,624,294 1,647,165 1,649,688 1,651,203 1,652,998  Note  97  ZC204.11 ZC204.3 ZC204.12 ZC204.13 F58E1.8 F58E1.9 F58E1.10 F58E1.11 F58E1.12 F58E1.13 F36H5.9 F36H5.11 F36H5.10 F36H5.5 F36H5.4 F36H5.3 F36H5.2b F36H5.1 C08F1.4a C08F1.5 C08F1.10 C08F1.6 C08F1.3 C08F1.2 C08F1.1 C08F1.7 C08F1.8 C08F1.9 T08E11.6 T08E11.7 T08E11.5 T08E11.4 T08E11.3 T08E11.2 T08E11.8 T08E11.1 C52E2.5 C52E2.4 C52E2.6 C52E2.7 C52E2.3 C52E2.2 C52E2.1 C52E2.8 C16C4.7 C16C4.6 C16C4.5 C16C4.4 C16C4.15 C16C4.16 C16C4.3 C16C4.8 C16C4.9 C16C4.10 C16C4.11 C16C4.12  btb btb btb fbxb-18 fbxb-19 fbxc fbxc fbxc btb|fbxa fbxb fbxb-12 cfam10 cfam10 math-28 math-27 math-26 math-3 math-4 cfam10 cfam10 fbxb-13 str-21 math-2 str-22 cfam10 revt fbxb-10 fbxa-3 fbxc math-41 math-40 math-39 fbx fbx fbx fbx fbxb-97 fbxb-96 duf130 clec fbxb-95  fbxb-98 math-15 math-14 math-10 math-11 math-13 math-16 math-17 math-5 math-6 math-7  II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II  1,654,288 1,655,686 1,656,831 1,660,734 1,686,937 1,688,707 1,690,444 1,692,234 1,693,959 1,695,061 1,752,923 1,754,848 1,757,483 1,759,501 1,762,364 1,764,854 1,769,883 1,773,253 1,777,284 1,780,607 1,785,801 1,787,697 1,789,688 1,792,475 1,800,515 1,802,708 1,806,255 1,809,929 1,817,844 1,819,328 1,822,276 1,827,335 1,831,237 1,832,898 1,834,708 1,838,364 1,841,959 1,844,889 1,847,375 1,849,922 1,851,866 1,852,943 1,854,863 1,856,710 1,859,901 1,863,210 1,866,964 1,870,059 1,872,349 1,874,693 1,876,411 1,878,336 1,880,206 1,882,210 1,884,168 1,885,961  pseudogene pseudogene  98  C16C4.13 C16C4.14 C16C4.2 C16C4.1 C46F9.4 C46F9.3 C46F9.2 C46F9.1 F52C6.5 F52C6.6 F52C6.7 F52C6.8 F52C6.9 F52C6.10 F52C6.11 F52C6.4 F52C6.3 F52C6.2 F52C6.1 F52C6.12 F52C6.13 F52C6.14 C40D2.4 F59H6.5 F59H6.6 F59H6.4 F59H6.3 F59H6.2 F59H6.7 F59H6.8 F59H6.9 F59H6.10 F59H6.11 F59H6.12 F59H6.1 B0047.1 B0047.2 B0047.3 B0047.4 B0047.5 F07E5.1 F07E5.7 F07E5.8 F07E5.10 F07E5.9 T16A1.4 T16A1.5 T16A1.3 T16A1.7 T16A1.8 T16A1.9 T16A1.2 T16A1.1 R52.3 R52.4 R52.5  math-8 math-9 math-12 math-25 math-24 math-23 math-22 math-30 math-31 bath-11 bath-4 bath-6 bath-7 bath-2 ubq ubq ubq bath-22 ubql-e2  homeodomain pif-helicase math-32  cya-2 bath-21 bath-1 bath-3 bath-5 btb bath-19 bath-20 btb bath-24 math-1 bath-14 fbxb-6 duf130 duf-wsn  clec fbxc pqn-66 fbxb-37 duf-wsn math-42 math-35 duf130 duf130  II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II  1,888,192 1,890,348 1,891,823 1,894,512 1,896,912 1,901,084 1,903,366 1,905,674 1,908,950 1,912,655 1,914,743 1,917,377 1,918,881 1,920,562 1,922,183 1,923,564 1,925,115 1,926,664 1,928,308 1,929,757 1,930,470 1,932,827 2,000,366 2,005,815 2,011,981 2,012,019 2,018,442 2,021,049 2,023,091 2,026,127 2,028,248 2,029,999 2,031,390 2,033,198 2,037,553 2,041,105 2,043,359 2,045,145 2,046,356 2,048,442 2,066,718 2,071,062 2,074,097 2,077,463 2,077,850 2,081,788 2,083,712 2,084,511 2,088,014 2,092,797 2,094,932 2,096,989 2,101,501 2,105,696 2,107,797 2,109,416  pseudogene  99  R52.6 R52.7 R52.8 R52.9 R52.10 R52.2 R52.1 C40A11.5 C40A11.10 C40A11.4 K09F6.6 K09F6.9 K09F6.10 K09F6.7 K09F6.8 B0281.4 B0281.5 B0281.6 B0281.3 B0281.2 B0281.7 B0281.8 B0281.1 ZK1240.4 ZK1240.5 ZK1240.9 ZK1240.3 ZK1240.6 ZK1240.2 ZK1240.8 ZK1240.1 F43C11.8 F43C11.7 F43C11.9 F43C11.6 F43C11.5 F43C11.4 F43C11.10 F43C11.3 F43C11.2 F43C11.1 F43C11.11 F43C11.12 F16G10.5 F16G10.4 F16G10.3 F16G10.2 F16G10.6 F16G10.7 F16G10.8 F16G10.9 F16G10.10 F16G10.11 F16G10.13 F29A7.3 F08D12.8  duf130 srh-195 math-36 math-37 math-btb math-btb pphos-y btb  ubql-e3 fbxc btb btb btb ubql-e3 revt ubql-e3  ubql-e3 ubql-e3 ubql-e3 ubql-e3 ubql-e3 ubql-e3 ubql-e3 ubql-e3 ubql-e3  duf130 duf130  duf130 duf130 duf130 duf130 duf130 duf130 duf130 duf130 duf130 duf130 duf130 duf130 duf130 duf130 fbxb-105  II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II  2,112,662 2,115,764 2,118,724 2,121,461 2,126,732 2,127,158 2,131,674 2,134,686 2,137,036 2,138,139 2,277,033 2,281,786 2,288,421 2,291,523 2,295,399 2,298,227 2,300,381 2,301,816 2,307,558 2,308,984 2,310,574 2,312,170 2,313,998 2,315,354 2,317,122 2,318,984 2,321,053 2,322,662 2,324,267 2,328,025 2,330,624 2,333,396 2,336,796 2,339,351 2,343,009 2,345,574 2,348,524 2,350,975 2,353,839 2,355,261 2,357,584 2,360,120 2,365,867 2,367,873 2,369,674 2,373,287 2,375,633 2,377,520 2,379,928 2,382,693 2,385,469 2,390,324 2,392,784 2,395,441 2,753,579 2,770,341  pseudogene  100  Y110A2AL.4a Y110A2AL.6 Y110A2AL.7 Y110A2AL.1 T11F1.3 ZC239.8 ZC239.9 ZC239.10 ZC239.19 ZC239.12 ZC239.6 ZC239.5 ZC239.4 ZC239.3 ZC239.2 ZC239.14 ZC239.13 ZC239.15 ZC239.16 ZC239.17 ZC239.1 C17F4.4 C17F4.8 C17F4.10 C17F4.9 C17F4.3 C17F4.7 C17F4.5 C17F4.2 F39E9.7 F39E9.5 F39E9.6 F39E9.1 Y46D2A.2 Y46D2A.1 Y46D2A.3 F14D2.6 F14D2.7 F14D2.5 F14D2.11 F14D2.15 F14D2.13 F14D2.12 F14D2.14 F14D2.8 F14D2.10 F14D2.9 F14D2.4a F14D2.2 F14D2.1 Y49F6C.6 Y49F6C.7 Y49F6C.8 Y49F6C.5 F19B10.9 F19B10.2  sri-51 sri-48 sri-53 sri-50 btb btb btb btb btb btb btb btb btb btb btb srh-297 srz-67 srz-68  fbxc  nepp duf274 duf274 duf274 cfam15 recepL  ubq btb bath-28 bath-30 fbxo tc-related bath-29 duf278 bath-27  bath-23 tbx-18  II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II  2,839,841 2,841,340 2,842,897 2,875,860 2,955,703 3,198,351 3,200,257 3,201,907 3,204,624 3,208,305 3,210,123 3,212,757 3,214,931 3,216,336 3,218,444 3,221,697 3,222,750 3,224,089 3,225,023 3,225,676 3,227,772 3,229,699 3,232,840 3,236,542 3,239,101 3,241,105 3,245,573 3,247,242 3,249,006 3,306,570 3,308,326 3,311,104 3,317,702 3,322,296 3,325,425 3,326,771 3,331,956 3,337,169 3,340,797 3,341,568 3,342,960 3,344,466 3,345,825 3,347,163 3,348,452 3,352,058 3,354,397 3,354,580 3,359,468 3,360,783 3,364,649 3,367,723 3,369,682 3,371,723 3,672,798 3,676,144  pseudogene  pseudogene  101  F19B10.10 F19B10.1 F19B10.12 F40H7.4 F40H7.3 T24E12.5 T24E12.4 F54D10.7 F53C3.7 Y14H12A.1 W03C9.5 Y17G7B.11 F49C5.6 Y46G5A.7 Y46G5A.8 E01G4.5 T12B5.11 T12B5.4 T12B5.3 T12B5.2 T12B5.12 T12B5.1 R06B10.1 Y119D3A.3 Y119D3A.2 Y119D3A.1 Y82E9BL.13 Y82E9BL.12 Y82E9BL.16 Y82E9BL.14 Y82E9BL.15 Y82E9BL.5 Y82E9BL.4 Y82E9BL.3 Y82E9BL.2 Y82E9BL.1 Y82E9BR.8 Y82E9BR.9 Y82E9BR.7 Y82E9BR.20 Y82E9BR.10 Y82E9BR.11 Y82E9BR.12 Y82E9BR.6 Y82E9BR.5 Y82E9BR.13 Y82E9BR.4 Y82E9BR.21 Y82E9BR.22 Y82E9BR.14b B0524.4 H04J21.1 Y75B8A.31 Y75B8A.32 Y75B8A.34 Y75B8A.33  cfam19 srx-99 srx-101 srx-100 srx-111 ubq  str-223 fbxb fbxb fbxa-67 fbxa-11 fbxa-10 fbxa-54 fbxa-70 fbxa-51 fbxa-35 fbxa-28 fbxa-75 fbxa-79 duf13 fbxa-20 fbxa-80 fbxa-19 cfam10 fbxa-25 cfam10 cfam10  fbxa-138  duf-wsn  SR-unclass  II II II II II II II II II II II II II II II II III III III III III III III III III III III III III III III III III III III III III III III III III III III III III III III III III III III III III III III III  3,679,013 3,683,812 3,686,537 3,690,101 3,692,502 3,755,628 3,758,520 3,820,745 3,895,533 3,993,551 11,967,252 12,061,375 12,564,965 12,755,280 12,756,890 13,473,126 949,918 951,456 953,161 957,063 961,133 962,902 969,327 1,296,367 1,298,684 1,304,578 1,307,354 1,310,189 1,312,165 1,314,267 1,315,955 1,346,712 1,348,584 1,350,284 1,352,507 1,355,199 1,358,098 1,359,502 1,361,582 1,374,462 1,375,408 1,377,211 1,381,351 1,381,526 1,387,640 1,391,180 1,401,338 1,405,468 1,406,649 1,411,187 1,884,902 2,369,758 12,353,130 12,356,679 12,360,282 12,364,522  pseudogene pseudogene  102  Y49E10.6 F38A1.11 Y69A2AR.24 Y69A2AR.25 Y69A2AR.13 Y69A2AR.12 Y69A2AR.26 Y69A2AR.11 Y69A2AR.10 Y69A2AR.9 Y69A2AR.8 Y69A2AR.27 Y94H6A.10 Y46C8AL.4 F49F1.7 F49F1.8 F49F1.9 F49F1.10 F49F1.11 F49F1.12 F49F1.13 R07C12.3 R07C12.2 R07C12.4 R07C12.1 K08D10.10 K08D10.9 K08D10.8 K08D10.7 F19C7.6 F19C7.5 F19C7.3 F55B11.5 T27E7.9 Y105C5B.13 Y105C5B.27 K03D3.8 K03D3.6 Y116A8C.1 R13D11.5 K10C9.4 K10C9.8 F59A7.8 Y19D10B.6 Y19D10B.7 F15E11.14 F15E11.15a F15E11.12 F54E2.1 F54E2.5 F54E2.6 H27D07.2 H27D07.3 H27D07.4 H27D07.6 H27D07.1  his-72  nhr-242 revt  clec-71 duf18 glec glec glec  clec clec clec clec clec clec scram scram  fbxb srz-like skr-10  srab-15 str-224 snfh nepp cfam7 cfam7 cfam7 cfam7 duf274 srt-34 srw-141 srw-143 srw-137 srh-87 srw-128  III IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV IV V V V V V V V V V V V V V V V V V  12,367,798 1,248,206 2,563,882 2,567,101 2,570,034 2,571,508 2,579,348 2,586,164 2,591,779 2,595,463 2,596,557 2,597,860 2,710,391 3,938,688 4,124,638 4,132,560 4,137,201 4,138,758 4,140,539 4,142,226 4,143,825 4,145,897 4,149,451 4,151,495 4,153,281 4,155,742 4,159,983 4,163,466 4,165,471 4,595,282 4,596,309 4,598,308 14,428,030 14,548,126 15,947,058 16,162,201 16,314,466 16,317,566 16,901,344 787,858 1,051,400 1,057,598 2,009,954 2,319,701 2,322,117 2,323,097 2,325,569 2,326,855 2,811,049 2,814,059 2,817,377 2,937,781 2,940,381 2,943,092 2,944,858 2,947,442  pseudogene  pseudogene  103  H27D07.5 H05B21.1 H05B21.2 C50H11.5 C50H11.14 C50H11.4 C50H11.3 C50H11.2 C50H11.1 T28A11.9 T28A11.13 T28A11.8 T28A11.15 T28A11.16 T28A11.7 T28A11.6 T28A11.17 T28A11.18 T28A11.19 T28A11.5 T28A11.20 T28A11.4 T28A11.3 T28A11.21 T28A11.2 T28A11.1 F35F10.5 F35F10.7 F35F10.6 F35F10.4 F35F10.8 F35F10.9 F35F10.10 F35F10.2 F35F10.11 F35F10.12 F35F10.1 F35F10.13 F35F10.14 C17B7.8 C17B7.7 C17B7.9 C17B7.5 C17B7.4 C17B7.10 C17B7.3 C17B7.11 C17B7.2 C17B7.1 C17B7.12 C04E12.6 C04E12.7 C04E12.5 C04E12.4 C04E12.2 C04E12.8  srw-122 srw-126 srh-248 srt-9 srt-5 srt-7 srt-71 srt-8 srj-8 thiordx str-58 srt-63 duf19 srbc-5 duf750 nepp duf19 duf19 duf19 nepp duf19 fbxa-64 duf19 str-64 duf19 duf976 duf976 duf750 srx-122 srbc-1 duf750 srbc-3  duf750 duf19 nepp duf1258 duf19 duf750 duf19 nepp duf19 fbxa-65 duf19 str-63 duf19 duf976 scram duf750 duf750 duf19 srx-121  V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V  2,951,284 2,953,568 2,954,750 3,078,513 3,079,998 3,082,226 3,084,284 3,087,100 3,089,994 3,248,305 3,249,801 3,253,857 3,255,291 3,258,695 3,261,056 3,262,666 3,264,767 3,267,381 3,268,984 3,270,724 3,271,969 3,274,041 3,275,555 3,277,435 3,281,339 3,283,753 3,285,440 3,286,742 3,288,422 3,292,580 3,296,812 3,300,268 3,304,715 3,308,905 3,311,016 3,314,180 3,317,032 3,318,461 3,319,409 3,322,256 3,327,764 3,331,340 3,336,142 3,339,516 3,341,306 3,343,161 3,345,781 3,349,595 3,351,996 3,353,418 3,354,917 3,356,385 3,359,694 3,365,365 3,369,327 3,371,664  pseudogene  pseudogene  104  C04E12.9 C04E12.10 C04E12.1 C04E12.11 C04E12.12 T20D4.13 T20D4.15 T20D4.16 T20D4.17 T20D4.12 T20D4.11 T20D4.10 T20D4.9 T20D4.8 T20D4.7 T20D4.6 T20D4.5 T20D4.4 T20D4.3 T20D4.18 T20D4.2 T20D4.1 T20D4.19 T20C4.1 C07G3.5 C07G3.4 C07G3.3 F59B1.5 C17E7.1 T20C7.2 T20C7.1 K07C6.11 K07C6.10 K07C6.9 K07C6.8 K07C6.13 K07C6.7 K07C6.6 K07C6.15 K07C6.5 K07C6.4 K07C6.3 K07C6.2 K07C6.1 T09H2.1 B0213.9 B0213.10 K09D9.9 K09D9.10 K09D9.11 K09D9.12 K09D9.8 C49G7.1 T15B7.10 C51E3.2 F38B7.3  srbc-2 duf750 srbc-4  duf750 duf19 duf19 duf19 duf19 duf19 duf19 nepp nepp thiordx duf750 duf750 duf750 srab-21 srab-22 srab-20 duf19 srj-9 str-228 str-226 str-227 srx-94 nhr-156 nhr-284 srx-61 srx-68 srx-67 srx-66 srx-65 srx-69 srx-64 srx-63 srx-70 cyp-35A5 cyp-35B1 cyp-35B2 cyp-35B3 srz-89 cyp-34A4 str-247 cyp-34A5 srx-62  srh-8 brca-like srsx-27  V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V  3,375,638 3,379,587 3,384,361 3,387,183 3,389,956 3,392,602 3,393,567 3,394,365 3,395,578 3,396,799 3,398,867 3,401,132 3,403,879 3,406,627 3,408,974 3,411,448 3,414,406 3,416,686 3,420,324 3,423,521 3,426,881 3,429,530 3,430,884 3,432,785 3,510,885 3,513,338 3,518,821 3,622,856 3,906,357 3,907,900 3,909,835 3,913,463 3,917,456 3,920,015 3,922,263 3,923,630 3,929,914 3,932,323 3,934,252 3,937,224 3,939,969 3,943,149 3,946,012 3,947,972 3,950,828 3,954,529 3,956,537 3,993,210 3,995,172 3,999,108 4,003,568 4,006,355 4,057,429 6,813,742 10,154,026 11,552,951  pseudogene  pseudogene  105  Y75B12A.2 ZK1037.4 Y6E2A.8 Y6E2A.9a T23D5.2 T23D5.3 T23D5.1 T23D5.5 T23D5.6 T23D5.7 T23D5.8 T23D5.10 T23D5.9 T23D5.12 T23D5.11 F57A10.1 F44G3.8 F11A5.8 F49H6.9 F49H6.11 F49H6.10 T19C9.1 T19C9.4 T19C9.3 T19C9.2 T19C9.5 T19C9.6 T19C9.8 Y68A4B.3 Y68A4B.2 Y68A4B.1 Y61B8A.3 Y61B8A.1 Y61B8A.2 K10G4.2 K10G4.3 K10G4.6 K10G4.7 K10G4.1 K10G4.4 K10G4.9 K10G4.8 K10G4.5 Y61B8B.1 Y61B8B.2 F31E9.6 F31E9.7 F31E9.5 F31E9.1 F31E9.3 F31E9.4 F31E9.2 F47H4.5 F47H4.4 F47H4.6 F47H4.7  nhr-246 tricarbox-carrier str-38 duf130 str-27 str-17 str-18 str-19 str-6 str-43 str-5 str-8 str-9 fbxa-87 acylt srz-92 srz-94 srz-93 srbc-62 srh-109 srh-252 srh-112 scp-like CUB2 CUB2 clec clec srh-116 srh-115 srw-47 srh-117 srh-262  srw-30 srw-46 fbx sri-70  srz-26 srz-58 fbxa fbx fbxb srg-44 fbxa fbxa fbxa  V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V  15,112,547 15,320,952 15,727,763 15,730,183 15,733,312 15,735,850 15,737,506 15,741,816 15,743,823 15,746,443 15,750,523 15,752,198 15,754,125 15,756,421 15,758,297 15,762,180 16,132,156 16,211,006 17,034,333 17,037,162 17,039,714 17,221,074 17,222,426 17,225,473 17,228,230 17,230,147 17,232,325 17,242,116 17,244,463 17,247,214 17,250,421 17,253,804 17,256,607 17,259,022 17,263,954 17,267,087 17,274,886 17,279,657 17,282,723 17,286,351 17,290,602 17,295,549 17,300,703 17,306,323 17,308,868 17,316,992 17,319,612 17,321,408 17,323,597 17,327,083 17,329,822 17,332,281 17,335,906 17,336,170 17,339,956 17,343,841  pseudogene  pseudogene pseudogene  pseudogene  pseudogene pseudogene  pseudogene  pseudogene  pseudogene  106  F47H4.8 F47H4.9 F47H4.10 F47H4.1 F47H4.11 F47H4.2 T27C5.14 T27C5.10 F20E11.15 F20E11.16 F20E11.2 F20E11.11 F20E11.12 F20E11.13 F20E11.3 F20E11.8 F20E11.9 F20E11.14 F20E11.10 F20E11.4 F20E11.1 F20E11.5 F20E11.7 F20E11.6 F08E10.1 F08E10.8 F08E10.3 F08E10.2 F08E10.4 F08E10.5 F08E10.6 F08E10.7 K03D7.9 K03D7.8 K03D7.7 K03D7.6 K03D7.5 K03D7.4 C38D9.2 C38D9.3 C38D9.4 C38D9.5 F57G4.5 F57G4.6 F57G4.9 F57G4.7 F57G4.8 F59A1.5 F59A1.9 F59A1.8 F59A1.7 Y94A7B.1 Y94A7B.3 Y94A7B.4 Y94A7B.5 C31G12.2  fbxa fbxa skr-5 fbxa-134 fbx srh-96 srw srbc-27 srbc-28 srsx-2 srh-175 srh-154 srh-158 srh-160 srh-157 srh-156 srh-161 srh-203 str-200 srz-48  srw-72 srh-235 srh-114 srh-123 srbc-61 srh-110 srh-253 srh-111 scp-like  fbxa-102 srh-118 srh-261  fbxa-133 duf-wsn  tc-related fbxa fbxa fbxa-129 fbxa-108 srh-292 srh-291 srh-296 srh-298 clec  V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V  17,346,496 17,349,046 17,350,819 17,352,023 17,354,051 17,359,324 17,422,304 17,432,342 17,434,429 17,436,678 17,439,625 17,442,319 17,443,742 17,446,702 17,449,569 17,451,538 17,454,047 17,455,818 17,457,973 17,459,971 17,461,924 17,464,200 17,468,180 17,470,398 17,473,520 17,475,475 17,477,311 17,480,025 17,481,436 17,483,929 17,487,188 17,489,394 17,490,666 17,493,132 17,495,320 17,497,815 17,499,492 17,500,773 17,571,597 17,581,072 17,587,911 17,591,990 17,644,968 17,646,034 17,646,969 17,649,407 17,652,300 17,654,479 17,656,958 17,658,506 17,659,910 17,805,922 17,809,625 17,814,150 17,822,511 18,189,760  pseudogene  pseudogene pseudogene pseudogene pseudogene pseudogene pseudogene pseudogene  pseudogene  pseudogene pseudogene  pseudogene  pseudogene pseudogene  107  F16H6.2 F16H6.1 F16H6.3 F16H6.4 F16H6.5 F16H6.6 F16H6.7 F16H6.8 F16H6.9 F16H6.10 Y37H2B.1 R10E8.7 R10E8.3 R10E8.8 R10E8.1 Y51A2A.1 B0462.1 F11D11.6 F11D11.5 F11D11.3 F11D11.4 F11D11.1 Y17D7B.6 Y17D7B.5 C54E10.2 Y17D7A.4 Y17D7A.3a T26H2.3 C25F9.8 C25F9.7 C25F9.6 C25F9.5 C25F9.4 C25F9.9 C25F9.2 C25F9.1 M04C3.1a M04C3.3 M04C3.2 Y43F8B.14 Y43F8B.13 Y43F8B.12 Y43F8B.11 Y43F8B.10 Y43F8B.9 Y43F8B.8 Y43F8B.7 Y43F8B.6 Y43F8B.5 W04E12.6 Y116F11B.7 Y113G7B.6 Y113G7B.8 Y113G7B.9 Y113G7B.12 Y113G7B.11  clec clec-42 duf18 duf976 duf976 duf976  clec, duf130 clec clec duf274 duf274 clec clec clec cyp-33D3 nhr-65 fbxb-2 srw-86 snfh snfh snfh srw-85 snfh snfh snfh snfh snfh duf19 duf19  scp-like clec-49 fbxa-113 fbxb-59 srbc-34 revt  V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V  18,192,878 18,195,355 18,199,159 18,202,398 18,205,191 18,208,395 18,210,452 18,216,921 18,222,266 18,226,306 18,230,180 18,234,791 18,238,098 18,242,101 18,245,743 18,287,304 18,329,487 18,754,286 18,756,538 18,761,700 18,765,026 18,771,813 18,775,431 18,778,043 18,831,641 18,837,335 18,844,876 19,239,268 19,398,190 19,401,837 19,405,779 19,411,733 19,415,190 19,420,199 19,426,665 19,433,165 19,443,365 19,447,117 19,450,053 19,456,220 19,462,461 19,471,884 19,474,156 19,478,277 19,482,756 19,486,339 19,493,568 19,495,973 19,497,797 19,748,097 19,848,780 20,201,960 20,203,371 20,205,286 20,208,787 20,212,649  pseudogene  pseudogene  108  Y113G7B.26 Y113G7B.14 Y113G7B.15 F48F5.1 F56C3.5 F47B7.5 F46F2.1  snfh cathA pphos-y duf130 cfam19  V V V V X X X  20,223,154 20,225,829 20,229,351 20,440,391 1,363,402 3,768,290 15,252,260  109  Appendix 3. The segmentation algorithm. The segmentation algorithm used throughout this thesis was designed and created in the C programming language by Stephane Flibotte. It is a highly efficient implementation of a bottomup approach. The program assumes that log2 ratios for all probes are drawn from a normal distribution and begins by considering each probe as an individual segment. The algorithm performs t-tests for all possible mergers of adjacent segments to estimate the probability that the log2 ratios were drawn from samples with the same mean. These P-values are stored in a heap, prioritizing all possible mergers. The data structure is an ordered doubly linked list. The program makes the most likely merger of adjacent segments according to these P-values, calculates the new mean log2 ratio of the segment resulting from the merger, calculates P-values for subsequent mergers of the new segment with its neighbours and updates the heap. This procedure is repeated until the P-value of the next most likely merger is less than a critical value supplied by the user (typically 0.05). The program calculates a P-value for each remaining segment using a one-sample t-test. Segments are then labeled as candidate “amplifications” or “deletions” if their both their mean log2 ratios and P-values meet or exceed cutoff values supplied by the user, as described earlier in Chapters 2, 3 and 4. Segments that do not meet both of these criteria are labeled as “normal”. If desired, the program can then merge all adjacent segments sharing the same label, recalculating the mean log2 ratios and P-values for the merged segments. The entire process takes just a few seconds for data sets of 380,000 probes like those used in this thesis.  110  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0067087/manifest

Comment

Related Items