Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Computational studies of the genome dynamics of mammalian transposable elements and their relationships… Zhang, Ying 2012

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2012_fall_zhang_ying.pdf [ 3.56MB ]
Metadata
JSON: 24-1.0072813.json
JSON-LD: 24-1.0072813-ld.json
RDF/XML (Pretty): 24-1.0072813-rdf.xml
RDF/JSON: 24-1.0072813-rdf.json
Turtle: 24-1.0072813-turtle.txt
N-Triples: 24-1.0072813-rdf-ntriples.txt
Original Record: 24-1.0072813-source.json
Full Text
24-1.0072813-fulltext.txt
Citation
24-1.0072813.ris

Full Text

COMPUTATIONAL STUDIES OF THE GENOME DYNAMICS OF MAMMALIAN TRANSPOSABLE ELEMENTS AND THEIR RELATIONSHIPS TO GENES by Ying Zhang  M.Sc., Katholieke Universiteit Leuven (BELGIUM), 2004 B.E., Harbin Institute of Technology (CHINA), 1993  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF  DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE STUDIES (Genetics)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) May, 2012  © Ying Zhang, 2012  Abstract  Sequences derived from transposable elements (TEs) comprise nearly 40 - 50% of the genomic DNA of most mammalian species, including mouse and human. However, what impact they may exert on their hosts is an intriguing question. Originally considered as merely genomic parasites or “selfish DNA”, these mobile elements show their detrimental effects through a variety of mechanisms, from physical DNA disruption to epigenetic regulation. On the other hand, evidence has been mounting to suggest that TEs sometimes may also play important roles by participating in essential biological processes in the host cell. The dual-roles of TE-host interactions make it critical for us to understand the relationship between TEs and the host, which may ultimately help us to better understand both normal cellular functions and disease. This thesis encompasses my three genome-wide computational studies of TE-gene dynamics in mammals. In the first, I identified high levels of TE insertional polymorphisms among inbred mouse strains, and systematically analyzed their distributional features and biological effects, through mining tens of millions of mouse genomic DNA sequences. In the second, I examined the properties of TEs located in introns, and identified key factors, such as the distance to the intron-exon boundary, insertional orientation, and proximity to splice sites, that influence the probability that TEs will be retained in genes. In the third, a study specifically focused on genes with extremely high or low TE content in three mammalian species, I showed associations between TE density and the function/conservation of genes, as well as the relevance of chromatin state to TE accumulation in genes.  ii  While most of my results clearly support the idea that today’s TE distribution pattern is an outcome of natural selection or genetic drift during evolution, the final part of my work, which compares TE density to chromatin state in embryonic stem cells, suggests that traces of the initial integration preference of TEs still exist. Taken together, these results demonstrated the effects of both initial TE integration and natural selection in shaping the landscape of today’s mammalian genomes and, most importantly, shed light on the roles of mobile elements in evolution.  iii  Preface  A version of Chapter 2 has been published: Ying Zhang, Irina A. Maksakova, Liane Gagnier, Louie N. van de Lagemaat, Dixie L. Mager (2008). “Genome-wide assessments reveal extremely high levels of polymorphism of two active families of mouse endogenous retroviral elements.” PLoS Genet 4(2): e1000007. I designed all the computational methods for identifying mouse ERV polymorphisms, performed most computational data analyses, and wrote sections of the paper. I.A.M. and L.G. performed all the biological experiments for Figure 2.5 and 2.7, L.N.v.d.L. analyzed the data in Figure 2.6, D.L.M and I.A.M. wrote sections of the paper. A version of Chapter 3 has been published: Ying Zhang, Mark T. Romanish, Dixie L. Mager (2011). “Distributions of transposable elements reveal hazardous zones in Mammalian introns.” PLoS Comput Biol 7(5): e1002046. I designed and performed all the computational data analyses, and wrote sections of the paper. M.T.R. performed the biological experiments for Figure 2.8. D.L.M and M.T.R. wrote sections of the paper. A version of Chapter 4 has been published: Ying Zhang and Dixie L. Mager (2012). “Gene properties and chromatin state influence the accumulation of transposable elements in genes.” PLoS One 7(1): e30158. I designed and performed all the computational data analyses, and wrote the paper. D.L.M corrected and revised the paper. The mouse work was covered by Animal Care Certificate A09-0372 issued by the UBC Animal Care Committee.  iv  Table of Contents  Abstract .................................................................................................................................... ii Preface ..................................................................................................................................... iv Table of Contents .................................................................................................................... v List of Tables .......................................................................................................................... xi List of Figures ........................................................................................................................ xii Acknowledgements ............................................................................................................... xv Chapter 1: Introduction ........................................................................................................ 1 1.1  Mammalian Transposable Elements ..................................................................... 2  1.1.1  Long interspersed nucleotide elements (LINEs)................................................... 6  1.1.2  Short interspersed nucleotide elements (SINEs)................................................... 7  1.1.3  Long terminal repeat (LTR) retrotransposons ...................................................... 8  1.1.4  DNA transposons ................................................................................................ 10  1.2  Distribution of TEs in Mammalian Genomes .................................................... 11  1.2.1  TE distribution and the local G/C content .......................................................... 12  1.2.2  TE orientation bias in genes ................................................................................ 13  1.2.3  TE density within genes ...................................................................................... 14  1.3  Initial Integration Site Preference of TEs ........................................................... 15  1.3.1  Integration site preference of retroviruses and ERVs ......................................... 16  1.3.2  Integration site preference of other TEs.............................................................. 19  1.4 1.4.1  Activity and Polymorphism of TEs in Mammals ............................................... 20 TE activity and polymorphism in humans .......................................................... 21  v  1.4.2  TE activity and polymorphism in mice ............................................................... 24  1.4.3  TE activity and polymorphism in other mammals .............................................. 26  1.5  Effects of TE Integration in the Host Genome ................................................... 29  1.5.1  TE-mediated physical damage at the DNA level ................................................ 29  1.5.2  Transcriptional influence of TE sequences on host genes .................................. 32  1.5.3  Epigenetic Effects of TEs ................................................................................... 36  1.6  Transposable Elements and Host Evolution....................................................... 39  1.6.1  TEs vs. the host genome: an everlasting battle ................................................... 39  1.6.2  TE exaptation: turning “junk” into “gold” .......................................................... 43  1.7  Thesis Objectives................................................................................................... 47  Chapter 2: Identification and Investigation of ERV Polymorphisms in Mice ............... 50 2.1  Background ........................................................................................................... 51  2.2  Results and Discussion.......................................................................................... 53  2.2.1  Prevalence of ETn/MusDs and IAPs in different strains .................................... 53  2.2.2  Identification and frequency of polymorphic ERVs ........................................... 54  2.2.3  Genic distribution patterns of the youngest ERVs are distinct from older  elements .......................................................................................................................... 57 2.2.4  Confirmation of polymorphic ERVs in gene introns .......................................... 58  2.2.5  Potential gene expression effects mediated by polymorphic ERVs ................... 63  2.3  Concluding Remarks ............................................................................................ 71  2.4  Materials and Methods ......................................................................................... 72  2.4.1  Source data .......................................................................................................... 72  2.4.2  Design of ERV probes and detection of ERVs in the assembled B6 genome .... 73  vi  2.4.3  Detection of ERV insertions in test strains ......................................................... 74  2.4.4  Determining the polymorphism status of ERVs present in B6 or in the test  strains .............................................................................................................................75 2.4.5  B6 trace sampling and screening simulations ..................................................... 76  2.4.6  Experimental verification of polymorphic insertions ......................................... 77  2.4.7  RT-PCR............................................................................................................... 78  2.4.8  Northern blotting................................................................................................. 79  2.4.9  Microarray analysis ............................................................................................. 79  2.4.10  PolyERV custom tracks for Genome Browser ............................................... 80  Chapter 3: TE Distributions Reveal Hazardous Zones and Residential Preferences in Gene Introns .......................................................................................................................... 81 3.1  Background ........................................................................................................... 82  3.2  Results and Discussion.......................................................................................... 83  3.2.1  Intronic regions near exon boundaries are depleted of TE insertions ................. 83  3.2.2  TEs within their U-Zones are shorter.................................................................. 86  3.2.3  TEs near exons exhibit strong orientation and splice-site bias ........................... 88  3.2.4  A high fraction of known mutagenic intronic TEs reside within U-zones ......... 91  3.2.5  Polymorphic LTR elements in mice show an intermediate distribution pattern 96  3.2.6  Chimeric transcripts and cryptic splice signals differ within and outside the U-  zone  .............................................................................................................................97  3.2.7  Abnormal gene splicing linked to polymorphic LTR element insertions near  intron boundaries ............................................................................................................ 99 3.3  Concluding Remarks .......................................................................................... 103  vii  3.4  Materials and Methods ....................................................................................... 104  3.4.1  Source Data ....................................................................................................... 104  3.4.2  Computer Simulation of Random TE insertions............................................... 105  3.4.3  Normalization of intronic TE distribution by G/C content ............................... 106  3.4.4  Calculation of the standardized TE frequency levels ....................................... 106  3.4.5  Computational analysis of potential splice sites in LTRs ................................. 107  3.4.6  RT- and qRT-PCR of polymorphic LTR insertions in mouse genes ................ 108  Chapter 4: Gene Properties and Chromatin State Influence the Accumulation of TEs in Genes ................................................................................................................................ 110 4.1  Background ......................................................................................................... 111  4.2  Results and Discussion........................................................................................ 112  4.2.1  Determining sets of orthologous genes with the same extreme of TE density . 112  4.2.2  Chromosomal distribution and TE composition of shared outlier genes.......... 114  4.2.3  Extreme TE density is associated with the function and conservation level of  genes ............................................................................................................................117 4.2.4  Binding of Polymerase-II at gene promoters reveals association between TE-  content and tissue-specificity of genes ......................................................................... 120 4.2.5  Histone marks at promoters confirm the overall open status of SUO genes in  ESCs ...........................................................................................................................123 4.2.6  Expression data of SUO and SLO genes in ESCs support the ‘open chromatin  status’ hypothesis .......................................................................................................... 125 4.2.7  Gene function and chromatin status both contribute to the low TE density of  SLO genes ..................................................................................................................... 127  viii  4.3  Concluding Remarks .......................................................................................... 129  4.4  Materials and Methods ....................................................................................... 131  4.4.1  Selection of source datasets .............................................................................. 131  4.4.2  Identification of genes with extreme TE densities ............................................ 131  4.4.3  Normalization of gene size and exon density ................................................... 132  4.4.4  Identification of shared-outlier genes ............................................................... 133  4.4.5  Classification of gene conservation levels ........................................................ 133  4.4.6  Comparison of the average TE age between SUO and SLO genes .................. 133  4.4.7  Examination of the tissue specificity of Polr2a binding at SUO/SLO promoters ...........................................................................................................................134  4.4.8  Statistical tests................................................................................................... 134  Chapter 5: Thesis Summary and Conclusions ................................................................ 135 5.1  Thesis Summary .................................................................................................. 136  5.1.1  Which ERVs are polymorphic in mice? ........................................................... 136  5.1.2  What features of TE insertions are linked to their residential probability in  genes? ...........................................................................................................................137 5.1.3  Why are some genes highly enriched in TEs while others are depleted of TEs? ...........................................................................................................................138  5.2  Conclusions .......................................................................................................... 139  References ............................................................................................................................ 144 Appendices ........................................................................................................................... 163 Appendix A Supporting information of Chapter 2...................................................... 164 A.1  Supplementary figures ...................................................................................... 164  ix  A.2  Supplementary tables ........................................................................................ 172  A.3  Accession numbers ........................................................................................... 190  Appendix B Supporting information of Chapter 3 ...................................................... 191 B.1  Supplementary figures ...................................................................................... 191  B.2  Supplementary tables ........................................................................................ 196  Appendix C Supporting information of Chapter 4...................................................... 198 C.1  Supplementary figures ...................................................................................... 198  C.2  Supplementary tables ........................................................................................ 207  x  List of Tables  Table 1.1 Genomic compositions of major TE families in human, mouse and cow ................ 5  Table 2.1 Genomic PCR verification of ETn/MusD intronic insertions in different mouse strains ...................................................................................................................................... 62  Table 3.1 Intronic underrepresentation zones by TE type ...................................................... 84  Table 3.2 Intronic distributional biases of mutagenic, polymorphic, and all full-length TEs 94  Table A.1 Details of ERV probes ......................................................................................... 172  Table A.2 Polymorphic IAP cases in genes .......................................................................... 173  Table A.3 Polymorphic ETn/MusD cases in genes .............................................................. 186  Table A.4 Primer sequences ................................................................................................. 188  Table B.1 Mutagenic intronic Alu insertions in humans ...................................................... 196  Table B.2 Mutagenic intronic L1 insertions in humans ........................................................ 196  Table B.3 Mutagenic intronic ERV insertions in mice ......................................................... 196  Table C.1 SLOs identified among human, mouse and cow .................................................. 207  Table C.2 SUOs identified among human, mouse and cow ................................................. 215  Table C.3 Overrepresentation of GO terms for SLOs (BiNGO results) ............................... 219  Table C.4 Overrepresentation of GO terms for SUOs (BiNGO results) .............................. 223  Table C.5 The optimization table for outlier threshold selection ......................................... 225  Table C.6 Number of TUs/genes derived in each step of SUO/SLO calculation ................. 226   xi  List of Figures  Figure 1.1 Composition of TEs in the human genome. ............................................................ 2  Figure 1.2 The four major types of TEs in mammals. .............................................................. 3  Figure 1.3 Target integration site preferences of HIV, MLV and ASLV. .............................. 17  Figure 1.4 Transcriptional influences of TEs. ........................................................................ 35  Figure 2.1 Screening strategy for detection of polymorphic ERV insertions. ........................ 55  Figure 2.2 Fractions of polymorphic ERVs based on the four strains. ................................... 57  Figure 2.3 Distributions of young versus older ERV elements with respect to genes............ 59  Figure 2.4 An apparent heterozygous ETn insertion in the Sytl3 gene in an A/J mouse. ...... 61  Figure 2.5 Detection of ETn-gene chimeric transcripts. ......................................................... 65  Figure 2.6 Normalized, tissue-averaged expression of Dnajc10 across strains. ..................... 67  Figure 2.7 Transcript levels of Opcml in A/J versus B6......................................................... 69  Figure 3.1 Intronic distributions of the four major TE types in human (normalized). ........... 86  Figure 3.2 Average size of human TEs within and outside the U-zone. ................................ 87  Figure 3.3 Distributional biases of full-length human intronic TEs. ...................................... 88  Figure 3.4 Orientation bias of human full-length intronic TEs based on their proximity to different types of splice sites................................................................................................... 90  Figure 3.5 Comparisons of TE frequency within the U-zone. ................................................ 93  Figure 3.6 Chimeric transcripts and cryptic splice signals of TEs within and outside the Uzone. ........................................................................................................................................ 99  Figure 3.7 Chimeric transcripts of the Trpc6 gene and Kcnh6 gene in mice. ...................... 101   xii  Figure 3.8 Effect of polymorphic LTR element insertions on transcription of the Trpc6 and Kcnh6 genes. ......................................................................................................................... 102  Figure 4.1 TE composition patterns of SUO genes. ............................................................. 116  Figure 4.2 TE density and conservation level of outlier genes. ............................................ 119  Figure 4.3 Tissue-specificity of Polr2a binding at outlier genes. ......................................... 122  Figure 4.4 Histone marks at promoters of shared outlier genes. .......................................... 124  Figure 4.5 Gene expression data analyses for SUOs and SLOs in mouse ESCs. ................. 126  Figure 4.6 Effect-evaluation matrix for gene function and chromatin status of SLOs. ........ 128  Figure A.1 Mouse trace sequence archive composition (May 2007). .................................. 164  Figure A.2 Design of probes. ................................................................................................ 165  Figure A.3 Estimation of fractional loss of detection of ERVs using sequence traces. ....... 166  Figure A.4 Splice sites in ETn elements detected in chimeric transcripts. ........................... 167  Figure A.5 Comparative microarray expression data of Dnajc10 in different strains. ......... 168  Figure A.6 Comparative microarray expression data of Opcml in different strains. ............ 169  Figure A.7 Semi-quantitative RT-PCR of Opcml in A/J versus B6. .................................... 170  Figure A.8 UCSC Genome Browser screen shot of the PolyERV track. ............................. 171  Figure B.1 Intronic distributions of the four major TE types in human (non-normalized). . 191  Figure B.2 Intronic distributions of the four major TE types in mouse (normalized). ......... 192  Figure B.3 Average size of mouse TEs within and outside the U-zone. .............................. 193  Figure B.4 Distributional biases of mouse full-length intronic TEs. .................................... 194  Figure B.5 Orientation bias of mouse full-length intronic TEs based on proximity to different types of splice sites. .............................................................................................................. 195  Figure C.1 TE density distribution of human genes. ............................................................ 198   xiii  Figure C.2 The relationship between gene size and exon density in human. ....................... 199  Figure C.3 Identification of outlier genes by controlling gene size and exon density.......... 200  Figure C.4 Chromosomal distribution of SUOs and SLOs in human................................... 201  Figure C.5 The relationship between the number of SUOs/SLOs on human chromosomes and the chromosome size. ............................................................................................................ 202  Figure C.6 Correlation analysis of the TE composition of SUOs between human and mouse. ............................................................................................................................................... 203  Figure C.7 Correlation analyses of G+C content vs. LINE/SINE composition of SUOs in human.................................................................................................................................... 204  Figure C.8 Tissue-type composition of tissue-specific outlier genes. .................................. 205  Figure C.9 Histone marks at promoters of all outlier genes. ................................................ 206   xiv  Acknowledgements  I wish to thank the many people who have contributed, in various ways, to the research work presented here. Their kind assistance and many contributions have been essential to my completion of this thesis. Many thanks go to my supervisor, Dr. Dixie Mager, who gave me the great opportunity to undertake the research in this fascinating field of biological science. At the beginning, coming from a computer science/engineering background, I knew very little about molecular biology. It was Dr. Mager’s great patience and supportive guidance that gave me the confidence to overcome numerous difficulties encountered along the way. I thank her not only the knowledge of biology that she imparted, but also helping me to think as a real biologist. Her tutelage will benefit me throughout my career as a scientist. Furthermore, I express here my deep appreciation to everyone who has been involved in this thesis work. This includes Dr. Mark Wilkinson, Dr. Ryan Brinkman and Dr. Robert Kay from my thesis committee, who gave me many insightful ideas ─ not only in technical details, but also strategic guidelines. Also included are the many current and past lab colleagues involved in my thesis projects, who significantly contributed their time and skills in helping design and conduct related experiments. In particularly, I express my appreciation to Dr. Louie van de Lagemaat, my lab mentor in bioinformatics, who helped me to adapt quickly to this new field of research during the early days of my graduate study. I also thank my friend Blair Shakell, who offered his precious time in reading my thesis and correcting errors. My thesis work has been generously funded by the Canadian Institutes of Health Research, with core support provided by the BC Cancer Agency, as well as fellowship  xv  support from the National Science and Engineering Council of Canada. I thank all of these organizations for their support. Lastly, I thank my family and friends for being so supportive through all my years of study. Especially my dear wife Yuan Xue, my best friend and the real hero behind any of my little successes and achievements. Without her love, sacrifice and forgiveness, it would not be possible for me to achieve what I have. I thank my parents, Youfang Zhang and Ping Jin, who always gave me their greatest support, encouragement and understanding, and by so doing, enabled me to turn my childhood interest in science into a real life career. I thank my parents-in-law, Yi Xue and Shuying Zhang, for so diligently helping to care of our family while I was working on my thesis. And finally, I thank my dear Gavin, among the most important persons in my life, for gifting me with the extraordinary joy and great happiness of being his father and watching him growing up day-by-day, which have been so amazing and rewarding.  xvi  Chapter 1: Introduction  1  1.1  Mammalian Transposable Elements  Transposable elements (TEs), also known as transposons, are DNA fragments that can move from one genomic location to another and proliferate within the genome as a molecular hitchhiker. Due to this “moving” feature, sometimes they are also referred to as “jumping genes”. Since the identification of TEs in maize by Barbara McClintock more than a half century ago (McClintock, 1950), these mobile DNA sequences have been found in almost all living organisms from bacteria to humans. Sequencing of the human genome has revealed, for example, that at least 45% of the total human genomic DNA is derived from TEs (Figure 1.1) (Lander et al., 2001), which can be divided into four major types: the long interspersed nucleotide element (LINE; Figure 1.2A), the short interspersed nucleotide element (SINE; Figure 1.2B), the long terminal repeat (LTR) retrotransposon (Figure 1.2C), and the DNA transposon (Figure 1.2D) (Smit, 1996). The first three types, known collectively as retrotransposons or Class I transposons, account for most TEs in mammalian genomes and  Figure 1.1 Composition of TEs in the human genome.  2  Figure 1.2 The four major types of TEs in mammals. The molecular structures of A) LINE, B) SINE, C) LTR retrotransposon and D) DNA transposon are shown. Note that target-site direct repeats (TSDs) are not part of the TE. EN – endonuclease; RT – reverse transposase; C – C-terminal domain; A(n) – poly-A tail; TSD – target site direct repeat; LTR – long terminal repeat; IR – inverted repeat. 3  utilize an RNA intermediate during their retrotransposition, a hallmark “copy-and-paste” translocation process of TEs (see below for details). Unlike retrotransposons, DNA transposons (Class II transposons) proliferate in the host genome without an RNAintermediate. Instead, most of them move directly to the new genomic loci in a “cut-andpaste” manner, without leaving the original copy at the donor site. Most mammals contain all four types of TEs mentioned above. Each type of these TEs can be further divided into different families/clades, which present distinct genomic compositions in different host species. For example, about 83% of all human LINEs belong to the relatively young L1 family, some copies of which are still active in humans. In some other mammals such as the cow, however, only half of LINEs are L1s, while most of the remaining belong to the retrotransposable element (RTE) family (Elsik et al., 2009). Similarly, about 80% of SINE elements in humans are from the Alu family, which exists only in primates (Batzer and Deininger, 2002). As a simplified overview of the TE composition in different mammals, Table 1.1 lists the genomic compositions of major TE families present in human, mouse and cow. As molecular symbionts residing in the host genome, TEs have coevolved with their host species for many millions of years. Ancient elements have been utilized by biologists as fossils to calibrate the neutral rate of nucleotide divergence (Lander et al., 2001), while insertions of “modern” or still active elements are often studied for their effects on genes (see Section 1.5 below). Since different types of TEs show distinct characteristics in terms of their sequence, structure, age, life cycle, as well as distribution in the host genome, it is important to study them separately, based on their type or family.  4  In this introductory chapter, I describe briefly some essential features of each major TE type in mammals, and summarize in greater detail the known information in terms of TE genomic distributions, integration site preferences, activities and polymorphisms, biological/molecular effects, and influences on host evolution. At the end of this chapter I provide an overview of my thesis studies on the genome dynamics of TEs in mammals, which themselves are discussed more thoroughly in the following chapters.  Table 1.1 Genomic compositions of major TE families in human, mouse and cow Human1 Copy number (x1000) 868  Genomic coverage (%) 20.42  Mouse2 Copy number (x1000) 660  L1  516  16.89  L2  315  L3/CR1 RTE (BovB)  Type/Family LINEs  SINEs  Genomic coverage (%) 19.20  Cow3 Copy number (x1000) 1139  Genomic coverage (%) 23.29  599  18.78  616  11.26  3.22  53  0.38  132  1.18  37  0.31  1.2  0.05  -  -  -  -  -  -  376  10.74  1558  13.14  1498  8.22  2883  17.66  B1 (Alu)  1090  10.60  564  2.66  -  -  B2  -  -  348  2.39  -  -  B4/RSINE  -  -  391  2.36  -  -  ID  -  -  79  0.25  -  -  MIR/MIR3  468  2.54  115  0.57  301  1.39  BOV-A2  -  -  -  -  378  2.36  Bov-tA  -  -  -  -  1462  7.73  ART2A  -  -  -  -  349  4.18  tRNA  -  -  -  -  389  1.99  LTR elements  443  8.29  631  9.87  311  3.20  ERV class I  112  2.89  34  0.68  ERV class II  8  0.31  127  3.14  ERV class III  83  1.44  37  0.58  MaLR  240  3.65  388  4.82  294  2.84  112  0.88  244  1.96  DNA elements Total  44.69  38.17  46.11  1. Data taken from (Lander et al., 2001). 2. Data taken from (Waterston et al., 2002). 3. Data taken from (Elsik et al., 2009).  5  1.1.1  Long interspersed nucleotide elements (LINEs)  As the most prevalent TEs in humans, LINEs cover more than 20% of our entire genome (Lander et al., 2001). Although several LINE families exist in mammals based on sequence similarity (Table 1.1), many of them show similar molecular structures and life cycles. Generally, LINEs are autonomous non-LTR retrotransposons, which means that they encode their own proteins required for retrotransposition. For example, a full-length copy of L1, the most abundant and active LINE family in human, is about 6 kb long and carries an internal RNA polymerase II (Pol-II) promoter in the 5′ untranslated region (5′ UTR), two open reading frames (ORFs), and a short 3′ UTR immediately followed by a poly(A) tail (Figure 1.2A). L1 ORF1 encodes a 40-kDa protein with both RNA-binding and protein-protein interaction domains, but the specific roles of this protein in L1 retrotransposition are still not fully understood. L1 ORF2 encodes a 150-kDa protein that contains an endonuclease (EN) domain, a reverse transcriptase (RT) domain, and a zinc knuckle-like domain (C-terminal). During retrotransposition, the EN domain cleaves the genomic DNA at the target site, while the RT domain plays a key role in converting the L1 RNA into complementary DNA (cDNA) through a process known as target primed reverse transcription (TPRT) (Ostertag and Kazazian, 2001). The role of the ORF2 C-terminal zinc knuckle-like domain is not fully clear, but studies of similar domains in retroviral nucleocapsid proteins suggested that it might bind to the L1 RNA during the formation of retrotransposition intermediates and facilitate unfolding of structured RNA (Ostertag and Kazazian, 2001). The TPRT retrotransposition process of non-LTR retrotransposons such as L1s happens in the nucleus during their life cycle. First, a full-length active L1 element is transcribed from its 5′ internal promoter. After being exported from the nucleus into the cytoplasm, the ORF1  6  and ORF2 proteins (ORF1p & ORF2p) are produced from the L1 RNA and, by cisinteracting with the L1 RNA itself, form an intermediate ribonucleoprotein (RNP) complex. In the next step, this RNP complex is imported back into the nucleus to complete TPRT with the help of the EN and RT domains of the ORF2p. During this process, the L1 RNA is reverse transcribed to cDNA and integrated into the host genome. Due to the staggered cleavage of the target DNA by the EN domain at the new L1 integration site, a perfect pair of target site duplications (TSDs), ranging in length from 7-20bp, is created flanking the newly generated copy of L1 (Ostertag and Kazazian, 2001). Notably, the vast majority of L1s in the human genome are 5′-truncated, most likely due to the incomplete reverse transcription of the template L1 RNA during TPRT (Lander et al., 2001). Moreover, since L1s carry only a weak polyadenylation (poly(A)) signal, very often the transcription of the L1 element bypasses its own poly(A) signal so that the transcriptional machinery keeps moving downstream until it reaches the next poly (A) signal. Thus, the accidentally transcribed downstream sequence could be copied into a new L1 integration site. This process is called “L1-mediated transduction” (Moran et al., 1999).  1.1.2  Short interspersed nucleotide elements (SINEs)  SINE elements, as indicated by the name, are much shorter compared to other TEs, being typically less than 500 bp for a full-length element. Due to their limited size, SINEs do not usually encode any proteins, but instead are thought of as hitchhikers, which only retrotranspose under the help of other autonomous TEs such as LINEs (Ohshima and Okada, 2005). For example, Alu SINEs, the most abundant TE family in the human genome (> one million copies, see Table 1.1), contain very similar structural features such as the poly(A) tail  7  at the 3′ end (Figure 1.2B) as in L1s, which might be critical during the TPRT process. Furthermore, the coincidence of bursts of amplification of both Alus and L1s in primates during the past 150 million years also suggests that Alus might have utilized the L1 endonuclease and reverse-transcriptase during their retrotransposition. Indeed, a well designed transposition assay of marked Alus in human HeLa cells clearly demonstrated that the presence of L1 ORF2p protein and the poly(A) tail of Alu sequence are both essential in active Alu retrotransposition (Dewannieux et al., 2003), confirming the role of Alu SINE as “parasite of parasite”. While most SINE families originated from cellular tRNAs, human Alus are different from others by ancestrally deriving from the 7SL RNA gene (Batzer and Deininger, 2002), a small RNA molecule that forms part of the ribosome complex. Like L1 elements, Alus also contain a 5′ internal promoter, but usually are transcribed by RNA polymerase III (Pol-III) instead of Pol-II. However, the internal promoter of Alu is generally too weak to effectively initiate transcription of Alu RNAs by itself, and upstream sequences adjacent to the integration site could be very important for active Alu transcription (Ullu and Weiner, 1985). This feature has likely greatly lowered the total possible copies of retrotransposition-competent Alus in host genomes.  1.1.3  Long terminal repeat (LTR) retrotransposons  Like non-LTR retrotransposons, donor LTR elements also amplify in the host genome via an intermediate RNA template in a “copy-and-paste” manner. These repeats, however, contain a hallmark structure known as the “long terminal repeat” or LTR, at both the 5′ and 3′ end of the integrated DNA sequence (Figure 1.2C). The two identical LTRs at either end of  8  these elements, in fact, are products of a very sophisticated reverse-transcription process conducted in the cytoplasm rather than in the nucleus. Briefly speaking, after an LTR retrotransposon is transcribed from its own promoter located within the 5′ LTR, the produced mRNA is exported into the cytoplasm, where it uses a cellular tRNA molecule that is complementary to a special region called “primer binding site” (PBS) at the 5′ of the internal sequence to initiate reverse transcription. Since the cDNA synthesis does not start from the beginning of the LTR retrotransposon, the completion of the entire reverse transcription process without losing sequence information is achieved through a series of biochemical reactions, which is not discussed here (see a detailed review in (Telesnitsky and Goff, 1997)). Interestingly, the LTR structure and the cytoplasmic reverse transcription process of LTR retrotransposons greatly resemble that of infectious retroviruses such as the murine leukemia virus (MLV) and the human immunodeficiency virus (HIV). Moreover, many LTR retrotransposons encode at least two ORFs that are similarly found in exogenous retroviruses (Figure 1.2C): gag, which produces proteins involved in viral particle assembly, and pol, which encodes key enzymes such as the reverse transcriptase (RT) and integrase (IN), which are crucial during the reverse-transcription and integration process (Boeke and Stoye, 1997). In fact, it has been postulated that in certain circumstances some LTR retrotransposons may capture an envelope gene and thus become able to infect other cells, giving rise to true exogenous retroviruses (Eickbush and Jamburuthugoda, 2008; Gifford and Tristem, 2003). Other evidence indicates that some exogenous retroviruses have obviously lost (at least the function of) their envelope genes and, as a result, their ability to enter into other cells. When these retroviruses are trapped in host germ cells such as the sperm or egg (or other germ cell progenitors), they could be vertically passed to the next host generation as part of the genome  9  – a process known as “endogenization”. For this reason, these germline-trapped retroviruses are also known as endogenous retroviruses or ERVs. Notably, there is also a substantial number of LTR retrotransposons in mammals that do not show any similarity to retroviral genes (e.g. the mammalian apparent LTR retrotransposons (MaLRs) and the early transposons (ETns)), yet maintain the same LTR structures as ERVs. It is likely these noncoding LTR retrotransposons have amplified in the genome by hijacking the reverse transcription machinery provided by autonomous ERVs that encode competent gag and pol proteins for retrotransposition. An example of this phenomenon is the family of nonautonomous ETn elements in mice and their autonomous counterparts, the Mouse type-D endogenous proviruses (MusDs) (Mager and Freeman, 2000; Ribet et al., 2004). Since the origins of all LTR-like structures in mammalian genomes are not clear, the terms “LTR element”, “LTR retrotransposon” and “ERV” are used interchangeably in the field and in this thesis.  1.1.4  DNA transposons  DNA transposons, or Class II transposons (Figure 1.2D), differ fundamentally from all retrotransposons discussed above by transposing without an RNA intermediate template. Many DNA transposons, in fact, move from one genomic locus to another as double-stranded DNA (dsDNA), during which the original element is excised by DNA transposase (an enzyme encoded by DNA transposons) and integrates back into the host genome at a different location. Not all DNA transposons use the “cut-and-paste” strategy for transposition, however. Helitrons, first identified in thale crest Arabidopsis Thaliana, use a “rolling-circle” replication mechanism that involves the displacement of a single-stranded  10  DNA (ssDNA) intermediate to increase their genomic copy numbers (Kapitonov and Jurka, 2007). Another type of DNA transposon, known as Polintons (also named “Mavericks”), are very large transposons averaging from 15 to 20 kb, and even encode their own DNA polymerase for element duplication (Kapitonov and Jurka, 2006; Pritham et al., 2007). Since the number of both Helitrons and Polintons is negligible in most mammals, I only focus on classical “cut-and-paste” DNA transposons in this thesis.  1.2  Distribution of TEs in Mammalian Genomes  Significant advances in whole genome sequencing techniques during the past decade have provided unprecedented opportunities for researchers to study properties of TEs and their relationships to the host genome on a genome-wide scale. Analysis of the many assembled genomes available today has revealed that the abundance of TEs varies greatly between different species. In the puffer fish Fugu rubripes, for example, only 2.7% of the genome matches TE sequences (Aparicio et al., 2002), whereas in the B73 Maize (Zea mays), the genomic coverage of TEs reaches 85% (Schnable et al., 2009). In most mammals, the abundance of TEs lies in between the above two extremes, typically with a genomic coverage of 40 - 50% as seen in humans. Interestingly, genome-wide analyses of TE distributions in mammals, as well as in many other species, have shown apparent non-random patterns, which are likely consequences of at least three independent mechanisms: the initial integration site preference, genetic drift, and natural selection. Therefore, studying the genomic distributions of TEs becomes a powerful tool in understanding the biological interactions between these mobile elements and their host genomes through evolution.  11  1.2.1  TE distribution and the local G/C content  At the beginning of the 21st century, both the human and the mouse genomes were fully sequenced and assembled (Lander et al., 2001; Waterston et al., 2002), providing excellent platforms and opportunities for researchers to generate a comprehensive view of the TE-host relationships, based on the distributions of different cohorts of TEs in the two remotely related mammals. For example, LINEs and SINEs show high correlations with respect to the local G/C content in both genomes. Specifically, LINEs are typically enriched in A/T-rich regions, whereas their non-autonomous partners, SINEs, are significantly overrepresented within G/C-rich regions. While the reasons behind this surprising difference are still under debate, several possible explanations have been proposed and are discussed below. First of all, since the preferred target integration site of L1 is A/T-rich (TTTT/AA), the initial target site preference seems to be the most straightforward and reasonable explanation for LINE’s enrichment in A/T-rich regions. However, the high occurrence of SINEs in G/Crich regions is puzzling. Since SINEs share similar base-composition patterns with LINEs near the target site and are likely facilitated by LINE proteins during retrotransposition (Jurka, 1997), it is unlikely that SINEs have an opposite integration site preference. The second possible explanation is the selection force of either purifying clearance of SINEs in A/T-rich regions or positive selection of them in G/C-rich regions. Since most A/Trich DNA in the genome is gene-poor (Lander et al., 2001) and tolerate well the accumulation of most other TEs, there is unlikely stronger selection pressure to eliminate SINEs from such regions. Alternatively, high G/C content is known to correlate positively with gene density (Lander et al., 2001), and the enrichment of SINEs in G/C-rich regions could be due to their possible beneficial roles when located near/within genes. Indeed,  12  evidence has shown that SINE RNAs can promote cellular protein translation by acting as an eIF2 kinase (PKR)-inhibitor and thus might benefit the host cells under stress (Schmid, 1998). Since SINE RNAs function here directly at the RNA level, the accumulation of SINEs in the more readily transcribed gene-rich regions promises a faster cellular response (Lander et al., 2001). However, this model was based mostly on speculations, and to measure the degree of such effects could be difficult. Lastly, spontaneous loss of SINEs in A/T-rich (gene-poor) regions may also explain this phenomenon. While such loss could be neutral and the loss rate correlates positively to the density of SINEs in gene-poor regions, negative selection pressure of the deleterious effects on nearby genes in gene-rich regions may lead to slower rates of both SINE accumulation and recombination/deletion, resulting a relatively higher SINE density in G/C-rich regions (Medstrand et al., 2002). Although each of the above explanations has its own supporting evidence, the real situation could be far more complex with a combination of multiple factors, and more computational and experimental studies are needed to gain a deeper understanding of such effects.  1.2.2  TE orientation bias in genes  Even before the publication of the human genome sequence, it had been noticed that the majority of intronic TEs (especially LINEs and LTR elements) are in antisense orientation with respect to the enclosing gene (Smit, 1999). Subsequently, the availability of wholegenome sequence data for many species confirmed this observation (Cutter et al., 2005; Lander et al., 2001; Medstrand et al., 2002; Waterston et al., 2002). Limited experiments to  13  measure de novo integration patterns, however, show no evidence of any initial orientation bias when exogenous retroviruses or ERV/LTR retrotransposons integrate into the genome (Dewannieux et al., 2004; Gasior et al., 2007; Ribet et al., 2004; Schroder et al., 2002). These results suggest a strong selection pressure against sense-oriented TE insertions within genes, very likely due to the cryptic regulatory signals these elements carry in their sequence (Smit, 1999; van de Lagemaat et al., 2006). For example, substantial evidence has shown that intronic TE insertions can cause alternative splicing or premature polyadenylation, or used as alternative promoters to produce aberrant transcripts of the host gene (see Section 1.5 of this chapter). Notably, most de novo intronic TE insertions that do cause mutations or diseases are, indeed, in the same orientation as the enclosing gene (van de Lagemaat et al., 2006), suggesting a much higher chance of being detrimental when TEs integrate in sense.  1.2.3  TE density within genes  When analysis of the first human genome was published, it was noticed that TE density in some genomic regions was extremely low compared to the average level even after correction by G/C content (Lander et al., 2001). For example, the four homeobox gene clusters, namely HOXA, HOXB, HOXC and HOXD, sized at ~100 kb each, contain only less than 2% TEs in contrast to the 45% genomic level. Based on the crucial functions played by these genes during human development, it is not hard to imagine how detrimental a TE integration event could be, and how important it is for the host to maintain the integrity of these regions by purifying selection. In fact, a general trend of low TE density within genes has been observed for most TE families (Medstrand et al., 2002), and a genome-wide study of transposon-free regions (TFRs) larger than 10 kb identified nearly a thousand such regions  14  in both human and mouse (Simons et al., 2006). While most bases covered by TFRs are noncoding, the majority (85%) of TFRs, in fact, overlap one or more annotated genes. Further analysis on Gene Ontology (GO) for TFR-related genes revealed significant enrichment of gene functions such as transcriptional regulation and development, which agrees with several other studies on TE density within/around genes (Grover et al., 2003; Huda et al., 2009; Mortada et al., 2010; Sironi et al., 2006). Exceptions do exist, however. In a recent study of Hox gene clusters in the green anole lizard Anolis carolinensis, researchers surprisingly found massive accumulations of TEs in these regions, as well as in many other development-related genes (Di-Poi et al., 2009). This is inconsistent with the observation in most other vertebrates. Since there is huge morphological variation among various Squamata species (e.g. Anolis lizards have largely radiated into ~400 species that are highly adapted within a variety of ecological niches (Alfoldi et al., 2011)), a speculative explanation for this exception could be the potential benefit of the increased genetic divergence introduced by allowing more TEs into these genomic regions that are important in lizard morphogenesis. Nonetheless, the above data demonstrates an uneven distribution of TEs relative to genes, suggesting a general theme of purifying selection around fundamentally important coding regions.  1.3  Initial Integration Site Preference of TEs  The distributions of fixed TEs in today’s mammalian genomes are combinational outcomes of the initial target site preference, natural selection and genetic drift. However, except for a limited number of lineages, most TEs in mammals are only molecular fossils carrying numerous mutations, and do not actively transpose. For this reason, directly  15  detecting the initial TE integration site preference is largely unfeasible, especially for old elements. Nonetheless, currently active copies of LINEs and SINEs have been isolated and utilized to investigate their de novo integration features (Dewannieux et al., 2003; Gilbert et al., 2002). Alternatively, based on the kin relationship between exogenous and endogenous retroviruses (see Section 1.1.3), a glimpse of potential integration site preferences of related ERVs can be inferred from insertion site analyses of selected retroviruses (Mitchell et al., 2004; Wu and Burgess, 2004). Moreover, successful synthetic reconstructions to create functional copies of the presently inactive TEs such as the human ERV, HERV-K, have been performed based on consensus sequences (Dewannieux et al., 2006; Lee and Bieniasz, 2007) and, as discussed below, this “resurrected” HERV-K has been used to determine its initial target site preferences.  1.3.1  Integration site preference of retroviruses and ERVs  Exogenous retroviruses have been intensely studied over a long time because of their often pathogenic effects on host organisms. As close relatives to their endogenous counterparts, these infectious pathogens possess very similar features, from their sequences to their life cycle, compared to ERVs. Deep understanding of exogenous retroviruses can greatly assist biologists to infer various characteristics of ERVs, such as integration mechanisms and target site preferences, without being blinded by post-integration changes during evolution. Shortly after the publication of the first human genome sequence assembly, several genome-wide studies of retrovirus integration patterns were reported. For example, by using human immunodeficiency virus (HIV) and HIV-based vectors to infect a human T cell line (SupT1) and the naked DNA from the same cell line as an in vitro control, a high degree of  16  initial integration preference of HIV within genes was unexpectedly revealed (Schroder et al., 2002). Specifically, while only about 35% of the 111 HIV integration sites collected in vitro were located within transcription units (which is close to the genomic coverage of genes in human), nearly 70% of the 524 HIV integration sites in human SupT1 cells resided in genes. Interestingly, unlike most ERVs fixed in the genome, these de novo HIV integrations showed no evidence of orientation bias relative to the enclosing genes. Similarly, integration assays of murine leukaemia virus (MLV) in human HeLa cells and avian sarcoma-leukosis virus (ASLV) in both human 293T cells and Hela cells have also been conducted (Mitchell et al., 2004; Wu et al., 2003), with the former showing a strong preference for gene promoter regions and the latter a much more random distribution of integration sites. A schematic illustration of the target integration site preferences of the above-mentioned retroviruses is given in Figure 1.3.  Figure 1.3 Target integration site preferences of HIV, MLV and ASLV. A conceptual gene structure is shown with boxes in different colors representing the promoter region (yellow), exons (dark blue) and introns (light blue). Arrows with different colors represent insertion sites of HIV (red), MLV (black) and ASLV (green).  To investigate the potential factors that may influence retrovirus integration site preference, additional analyses were performed in the foregoing cited studies. By comparing the transcriptional profile of target cells and the HIV integration site distribution, a trend towards integration in highly expressed genes was revealed based on different target cell 17  types (Schroder et al., 2002). Such a result supports the hypothesis that retroviruses may prefer to integrate into open chromosomal regions, where genes are actively transcribed and thus potentially easier to access. However, the distinct patterns of target site selection of different retroviruses indicate that DNA accessibility is unlikely to be the only underlying mechanism. According to evidence shown for Ty elements (an LTR-retrotransposon) in yeast, a “tethering model” involving the interactions between the retroviral integration complex and sequence-specific DNA binding proteins has also been proposed (Bushman, 2003). Theoretically, selection forces can be quite different between exogenous and endogenous retroviruses due to their distinct criteria of fitness. Given enough time for evolution, this might lead to different preferences of integration targeting (Bushman et al., 2005). For example, based on the surviving strategy, HIV may benefit from targeting actively transcribed genes in order to maximize its proliferation rate in a cell before the cell is killed. Similarly, MLV has a higher capacity of utilizing an adjacent promoter to gain higher expression levels (De Palma et al., 2005), leading to a selection advantage of targeting the 5′end of genes. The integration sites of ASLV, however, are much more random, which might lower the chance of being deleterious and rapidly selected against. By contrast, endogenous LTR elements such as the Ty elements in yeast showed an apparent targeting preference in less critical regions in the host genome, probably because of the absence of an extracellular stage in their lifecycle and, therefore, their dependence on the survival of the host cell (Wu and Burgess, 2004). Due to the foregoing reasons, caution is necessary when inferring the general target site preferences of ERVs from studies on their exogenous relatives.  18  Although many ERV families in mammals have accumulated enough mutations to become molecular fossils, the original forms of some inactive ERV families have been successfully reconstructed and used to directly study their initial integration site preferences. For example, in a recent study conducted by Brady et al., the authors examined the integration pattern of a reconstructed HERV-K element (HERV-Kcon) that can efficiently retrotranspose in selected human cell lines (Brady et al., 2009). In contrast to the distribution of existing HERV-Ks in the human genome, de novo integration sites of HERV-Kcon show highly distinct features including the preference for transcription units and the lack of orientation bias within genes. Notably, the youngest HERV-Ks in the human genome show an intermediate distribution between that of the resurrected HERV-Kcon and older HERV-Ks fixed in the genome, indicating that the current distribution pattern of ERVs is a consequence of initial integration followed by genetic drift and natural selection.  1.3.2  Integration site preference of other TEs  Although there is no evidence for on-going transposition activity of either ERVs or DNA transposons in the human genome, active LINEs and SINEs still exist and have been successfully isolated/engineered for de novo retrotransposition assays (Dewannieux et al., 2003; Dewannieux and Heidmann, 2005; Gilbert et al., 2002; Moran et al., 1996). To conduct an evolutionarily unbiased investigation of the target integration site preference of LINEs and SINEs in the human genome, Gasior and colleagues collected over one hundred de novo L1 integration loci in human HeLa cells from various sources, as well as a total of 13 de novo SINE integration loci (including human Alu, mouse B1 and B2 elements) obtained from similar experiments (Gasior et al., 2007). Unexpectedly, while an A/T-rich bias was observed  19  in the 50 bp flanking regions of these de novo L1 insertions as predicted by the “TTAAAA” consensus sequence of L1 integration sites (Jurka, 1997), only a neutral G/C content was revealed for such elements based on a 20 kb window size. This observation is significantly different from the genomic distribution of existing L1s, which shows a strong bias toward A/T-rich sequence even at relatively large genomic intervals. Consistently to de novo L1s, the same study also showed a similar pattern of target site preferences for de novo Alu insertions based on either the 50 bp or the 20 kb window size, but the size of the Alu dataset was too small to make any strong conclusions (Gasior et al., 2007). Comparing these results to the genomic distributions of LINEs and SINEs that have been fixed in the human genome, it is clear that the concurrent distributions of TEs are combined outcomes of both initial integration preferences and evolutionary constraints, albeit the latter factor shows a stronger long-term effect, especially for old TE families.  1.4  Activity and Polymorphism of TEs in Mammals  According to the sequence divergence of a given TE from its consensus of the corresponding TE family and the neutral substitution rate of genomic DNA, the timespan of each TE copy from its integration into the genome can be estimated. Generally, most TEs in the mammalian genome today are ancient molecular fossils of an age ranging from tens to hundreds of Myr (Lander et al., 2001), which have lost their ability to transpose through accumulation of mutations. In addition to genetic degradation leading to inactivity, the activity of TEs can sometimes also be repressed by active silencing mechanisms of the host cell, such as epigenetic modifications and RNA silencing (see Section 1.6 for a detailed discussion). However, it is noteworthy that the age distributions and activities of TEs in  20  various mammalian species can be surprisingly different. Moreover, on-going TE activity can give rise to TE insertional polymorphisms, which refer to the presence of unfixed TE insertions only in a subpopulation of the host species. This information can be very useful when examining the effects and properties of TE integrations, as well as serving as an additional source of genomic variation when studying the evolution history and disease susceptibility of the host organisms.  1.4.1  TE activity and polymorphism in humans  In the contemporary human genome, active TE transposition events have been generally silenced. In fact, more than 97% of human TEs are remnants of ancient elements older than 25 Myr (Lander et al., 2001), most of which have accumulated enough mutations to completely lose their ability to generate new transposition events. Initial analysis of the human genome, for example, showed a flourishing period of LTRs (mostly ERV-Ls and MaLRs) in humans that had lasted for about 100 Myr, most of which appeared to have died out ~40 Myr ago (Lander et al., 2001). Today, about 85% of the LTR retrotransposons in the human genome have lost their internal sequence due to homologous recombination between the flanking LTRs, resulting only a solitary LTR as their entire remaining sequence. So far only a single family (HERV-K) is known to have transposed since our divergence from the chimpanzee ~7 Myr ago, and is polymorphic in today’s human population (Belshaw et al., 2005; Cordaux and Batzer, 2009; Mills et al., 2007). However, no evidence has been reported for concurrent HERV-K de novo retrotranspositions. Even more dramatically, the activity of DNA transposons in humans, comprising less than 3% of the genome, was completely extinguished about 37 - 50 Myr ago (Lander et al., 2001; Pace and Feschotte, 2007).  21  In contrast to LTR and DNA transposons, LINE and SINE elements have extremely long lives in mammals, and are responsible for most current TE activities in humans. Based on the draft sequence assembly of the haploid human genome (95% complete), Brouha et al. estimated 80-100 retrotransposition-competent (RC) L1s in an average human being, with the bulk of retrotransposition activity coming from only a few “hot L1s” (Brouha et al., 2003). More interestingly, the same study also showed that half of the identified RC-L1s are polymorphic in humans, indicating their relatively young age compared with elements fixed in the human population. Recently, the emergence of high throughput technologies has facilitated the discovery of an increasing number of human LINE/SINE polymorphisms (Beck et al., 2010; Ewing and Kazazian, 2010; Huang et al., 2010; Iskow et al., 2010). Using a pair-end DNA sequencing strategy, Beck and colleagues identified 68 full-length L1s that are differentially present among individuals but absent from the reference human genome (Beck et al., 2010). Among these polymorphic human L1s, 37 (or 55%) were hot L1s with strong retrotransposition activity, a number much greater than the total of six hot L1s found previously by Brouha and colleagues. Similarly, using high throughput pyrosequencing technology, another research group reported as many as 742 new polymorphic L1 insertions and 403 Alu insertions from a larger set of human tissue samples (Iskow et al., 2010). While some of the L1 polymorphisms reported in the above two independent studies are identical, both studies revealed many unique low frequency L1s present in only one or a few individuals, indicating the very young age of these L1s in the human population. As the most active TE families in humans, de novo germline integrations of L1s and Alus associated with human diseases have been reported. The detection of a de novo L1 insertion in a patient with X-linked retinitis pigmentosa (XLRP) led to the identification of  22  mutations in a novel gene RP2, which are responsible for the progressive retinal degeneration (Schwahn et al., 1998). In another case, the insertion of Alu sequences in the fibroblast growth-factor receptor 2 (FGFR2) gene appeared to cause the Apert Syndrome in two patients (Oldridge et al., 1999). As of 2011, at least 51 disease-associated de novo germline TE insertions (including 33 Alus, 15 L1s and 3 others) have been documented in the dbRIP database (http://dbrip.brocku.ca/) (Wang et al., 2006). This clearly demonstrates the ongoing activity and mutagenic effects of LINEs/SINEs in humans. In addition to germline retrotranspositions, activity of TEs has also been reported in human somatic cells. In 2005, it was successfully demonstrated that engineered human L1s can retrotranspose in adult rat hippocampus progenitor cells in vitro and in the mouse brain in vivo (Muotri et al., 2005). Subsequently, the same group showed that neural progenitor cells, either isolated from human fetal brain or derived from human embryonic stem cells, support the retrotransposition of engineered human L1s in vitro, along with clear evidence of increased copy number of endogenous L1s in several human brain regions compared to other somatic cell types such as liver or heart from the same donor (Coufal et al., 2009). Recently, by applying a high throughput method, Baillie et al. directly showed the high retrotransposition activity of endogenous L1s in the hippocampus and caudate nucleus of three human individuals by revealing more than 7,700 de novo somatic L1 and a nearly doubled number of Alu insertions in these brain regions. This suggests that somatic genome mosaicism driven by active retrotransposition may reshape the genetic circuitry underlining normal and abnormal neurobiological processes (Baillie et al., 2011). Notably, the massive L1/Alu retrotransposition activity in human brain seems exceptional compared to most other adult tissues, in which TEs are largely repressed. However, in the same TE activity study  23  conducted by Iskow et al. cited above (Iskow et al., 2010), the authors also identified frequent (30%) somatic L1 insertions in lung tumors, suggesting an escalated TE activity in at least some tumors, and a possibility that new retrotransposition events may be involved in tumorigenesis.  1.4.2  TE activity and polymorphism in mice  Unlike in humans, LTR retrotransposons/ERVs are highly active in mice, causing about 10% of all spontaneous germline mutations (Maksakova et al., 2006). The retroviral-like Intracisternal A Particle (IAP) and the MusD/Early Transposon (ETn), for example, are two high copy number ERV families that are responsible for most of the insertional germline mutations described in mice. IAP elements have been extensively studied since the early 1980s (Kuff and Lueders, 1988), and appear to cause both germline mutations as well as oncogene or growth factor gene activation in somatic cells (Druker and Whitelaw, 2004; Wang et al., 1997). ETn elements were originally reported in the early 1980s as a non-coding transposon-like sequence specifically expressed in early embryogenesis (Brulet et al., 1985; Loebel et al., 2004; Shell et al., 1990), also capable of causing new mutations. It is now known that ETns represent a non-autonomous partner of the retroviral-like MusD elements (Baust et al., 2003; Mager and Freeman, 2000), which provide the proteins in trans necessary for ETns to retrotranspose (Ribet et al., 2004). For this reason, they are usually referred to as ETn/MusD elements. Notably, unlike most wild-type species, laboratory mice were derived from intensive inbreeding of ancestral species of Mus, and have been carefully maintained in artificial living conditions. Thus, the effective population size of each inbred strain is down to only two,  24  leading to a significant higher probability of fixing random mutations (including TE insertions) in each strain by genetic drift. Due to the lack of heterogeneity within each inbred strain, strain-specific mutations are not subject to selection pressure based on the entire mouse population. Nonetheless, here in this thesis I used the terminology of “polymorphism” in population genetics to describe strain-specific variants in mice, as commonly referred by many other mouse genetic variation studies (Frazer et al., 2007; Li et al., 2004; Wade and Daly, 2005). According to a list compiled at the end of 2005 (Maksakova et al., 2006), six strain polymorphisms and 26 mutations in mice due to germline insertions of IAPs have been documented, and at least four polymorphisms and 19 mutations have been reported for ETns/MusDs. Genomic hybridization and Polymerase Chain Reaction (PCR) methods have demonstrated that the IAP (Kaushik and Stoye, 1994; Lueders and Frankel, 1994; Lueders et al., 1993) and ETn/MusD (Baust et al., 2002) families are polymorphic among strains. The extent of this variation was unknown in 2006 when I commenced my genome-wide study of ERV polymorphisms in mice (see Chapter 2 for details). The retrotransposition activity of LINEs/SINEs in mice is also much higher compared to humans. The copy number of active mouse full-length L1s estimated by the cell culture assay could be more than 3000, about 30 times the estimate of active human L1s and close to the excess of the proportion of L1 insertions causing mouse diseases compared to the human case (2.5% vs. 0.07%, or 35-fold) (Ostertag and Kazazian, 2001). Furthermore, due to their relatively younger age and stronger retrotransposition activity, L1s show a much higher polymorphism among different mouse strains. In 2008, by analyzing DNA sequence traces derived from the whole genome sequencing (WGS) techniques, Akagi et al. reported nearly 7,000 polymorphic L1 insertions present in the C57BL/6J strain but absent from at least one  25  of four other inbred mouse strains (Akagi et al., 2008). In fact, the above total of polymorphic L1s is very likely an underestimation, and the number is expected to increase when sequence information becomes available for additional mouse strains. In contrast to the higher activity of most other TE types in mice, DNA transposons were even more highly restricted in this rodent species compared to humans. The analysis of the mouse genomic sequence revealed only four lineage-specific DNA transposons in mice, whereas more than 14 have been identified in humans, most of which were deposited in early primate evolution (Waterston et al., 2002). Since the transposase of DNA transposons usually works in trans during the transposition process (i.e. to move around an element that is not necessarily the one producing the working enzyme itself) and may thus facilitate the proliferation of “dead” elements, they intrinsically require periodical horizontal transfer (HT) to refresh their transposition capacity (Feschotte and Pritham, 2007). Based on the observation that DNA transposon is the least active of the four major TE types in both human and mouse, it has been postulated that the highly developed mammalian immune system probably has contributed significantly to the large suppression of the invasion and amplification of these TEs (Waterston et al., 2002).  1.4.3  TE activity and polymorphism in other mammals  With an increasing number of fully sequenced genomes, it would be interesting to evaluate and compare TE distributions and activities in other mammalian species. A comparison of genomic sequences between human and chimpanzee revealed similar activities of LINEs in the two primate species, but a significant difference for Alu SINEs (Mikkelsen et al., 2005). In contrast to the ~7,000 lineage-specific Alu insertions in the human, the chimpanzee  26  genome contains ~2,300 such insertions, indicating either an escalated Alu activity in human or a declined Alu activity in chimpanzee since the time of species divergence. Recently, the availability of the orangutan genomic sequence provided an opportunity to calibrate the above analysis of lineage-specific Alu insertions in each primate species. Surprisingly, while the TE content in general was basically consistent with the previous findings for human and chimpanzee, only ~250 orangutan-specific Alu insertions were identified (Locke et al., 2011), indicating that Alu activity has been strongly limited within the orangutan genome since its divergence from the other two primates. However, an LTR retrotransposon family named Pan troglodytes endogenous retrovirus (ptERV1) was identified with nearly 200 copies in chimpanzee and several other primate species, but not in human (Mikkelsen et al., 2005). Unlike HERV-K, the youngest ERV family in human (with a majority being only solitary LTRs resulting from LTR-LTR recombination), ptERV1s in chimpanzee are much more homogeneous, and more than half of them still full-length. Until Kaiser et al. showed a link between ptERV and the human TRIM5α antiviral protein (Kaiser et al., 2007), it had been a conundrum as to why other primates, but not humans, harbor this young ERV family. This is discussed further in the next section. In addition to primates, genomic studies of other mammalian species distantly related to humans could help provide a more complete picture of TE activities in mammals. The identification of both the endogenous koala retrovirus (KoRV) in the germline and the exogenous version of the same virus in the peripheral blood mononuclear cells (PBMCs) of diseased koalas, for instance, illustrates an example of an ongoing bombardment of retroviral infection and endogenization in these already endangered mammals (Stoye, 2006; Tarlinton  27  et al., 2006). KoRV was originally identified as an endogenous retrovirus, but its competency in producing viral particles and the high sequence similarity to an exogenous retrovirus called gibbon ape leukemia virus (GALV) indicate it is more likely a retrovirus which had not been endogenized too long ago. Indeed, geographical studies of koalas carrying KoRV (Tarlinton et al., 2006) showed highly variable distribution of this ERV within the whole koala population, with 100% invasion of all koalas living in the northeast of Australia, and none on Kangaroo Island (a small isolated island located off southern Australia). Based on the fact that the koala subpopulation on Kangaroo Island was only established in the early nineteenth century, it seems that the invasion of KoRV into the koala genome only happened within 100 years. Like KoRVs in koalas, the recent or even concurrent amplification of a large number of DNA transposons discovered in bats was another big surprise with respect to TE activities in mammals. Based on the results of initial analyses of the human and mouse genomic sequences (Lander et al., 2001; Waterston et al., 2002), it was widely assumed that DNA transposons essentially died out in mammalian genomes dozens of million years ago. However, when Ray et al. recently examined the genomic sequence data of a distant mammalian lineage, the bat genus Myotis, they identified six hAT-like DNA transposon families showing low sequence divergence among individual copies and polymorphisms between different Myotis species, suggesting their relatively young age (6.4 – 15 Myr) and their recent activity in bats. Further studies on these elements, as well as other bat DNA transposons, showed multiple waves of recent DNA transposon activities in these flying mammals (Ray et al., 2008), including some very recent and probably currently active piggyBac-like transposons, and a massive amplification of two Helitron DNA transposon  28  families that reached at least 3% of the bat M. lucifugus genome (Pritham and Feschotte, 2007).  1.5  Effects of TE Integration in the Host Genome  Although most mammalian TEs are neutral components of the genome with no significant biological effects (Brouha et al., 2003; Mills et al., 2007), some elements do impact the cell/organism by acting as insertional mutagens, inducing DNA rearrangements, assuming cellular functions, and altering gene regulation (Batzer and Deininger, 2002; Cordaux and Batzer, 2009; Maksakova et al., 2006; Mills et al., 2007). Not surprisingly, such mutations are usually associated with various types of genetic disorders, but sometimes may also contribute to innovative changes that are beneficial to the host organisms. In this section I review some of the common mechanisms observed in TE mutagenesis, as well as potential roles that TEs might have played during the host evolution.  1.5.1  TE-mediated physical damage at the DNA level  Perhaps the most intuitive effect of TE integration in the host genome is the increased genome size. Primarily as selfish genomic parasites, these mobile DNA elements continuously replicate themselves, producing more and more new copies within host genomes. Although each host species may bear a different load of TEs, the ancient origin and ubiquitous presence of these repetitive elements suggest a global genome expansion for most of today’s living creatures. Indeed, a genome size comparison of 12 eukaryote species revealed a nearly linear relationship between the size of the host genome and TE content (Kidwell, 2002), implying non-coding repetitive DNA as an important contributor of genome  29  size. Consistent with this observation, genomic comparisons of multiple species have also shown that the increase of genome size is an ongoing process in many species including the human, which has already accumulated ~2,000 L1, ~7,000 Alu and ~1,000 SVA SINE copies over the past 6 Myr (Mikkelsen et al., 2005). However, some non-mammalian species controversially show only a small proportion of their genomes occupied by TEs, especially those with a compact genome size less than 500 Mb (Kidwell, 2002). Most of these smallgenome species still contain many TE families, but the copy number of each TE family is much lower compared with larger genomes (Kidwell, 2002). Taking the fruit fly as an example, the 117 Mb fully sequenced euchromatic portion comprises two thirds of the entire 180 Mb Drosophila melanogaster genome, but only ~1500 full or partial TEs have been identified, corresponding to just 4% of this genomic portion (Adams et al., 2000). Further studies of Drosophila TEs revealed more than 90 distinct TE families (i.e. ~16 copies/family), with the largest family reaching a total copy number of only 146 (Celniker and Rubin, 2003). Indeed, higher DNA deletion rates in general have been shown for both fly and puffer fish (only 2.7% of the 365 Mb puffer fish genome is occupied by TEs (Aparicio et al., 2002)), which potentially explains the relatively compact genome size of both species. Insertional disruption of coding sequences (i.e. exons) is one of the most straightforward mechanisms by which TEs may physically influence the functionality of host genes. In most cases, the affected genes are only able to produce truncated or nonsense RNA templates, leading to a severe dysfunction of the gene. Human genetic disorders caused by de novo TE insertions have been documented in accumulating cases, with Apert syndrome, cystic fibrosis, neurofibromatosis, muscular dystrophy and breast and colon cancers as a few of many examples (Belancio et al., 2008; Chen et al., 2005; Deininger and Batzer, 1999).  30  Sometimes, the inserted TE element also concomitantly brings in extra flanking sequences to the integration site (a process known as “TE-mediated transduction”), whereas in some other cases, retrotransposition-mediated deletions from 1 bp to several hundred kb of the target DNA sequence can occur. Such genomic rearrangements have been clearly demonstrated by both TE retrotransposition assays in cultured cells (Gilbert et al., 2002; Symer et al., 2002) and genomic comparisons between multiple host genomes (Callinan et al., 2005; Han et al., 2005), and instances related to disease were also reported (Callinan and Batzer, 2006; Mine et al., 2007; Solyom et al., 2012). A recent study of a Japanese boy with Duchenne muscular dystrophy (Solyom et al., 2012) revealed an insertion of a 212 bp non-coding unique sequence from 11.q22.3 plus a 115 bp poly(A) tail in exon 67 of the patient’s Dystrophin gene on Chromosome X. Remarkably, this turned out to be a 3′ transduction mediated by a polymorphic L1, that has only 6% allele frequency in Japanese people and had never been previously detected. Since the 3′ transduction of the downstream unique sequence was coupled with severe 5′ truncation during TPRT, no L1 sequence could even be found at the insertion site, a phenomenon known as “orphan transduction”. In addition to the above-mentioned TE-mediated physical alterations of the target DNA, non-allelic or unequal homologous recombination involving existing TE loci is another major mechanism causing physical damage/rearrangements commonly seen in TE-mediated mutagenesis (Burwinkel and Kilimann, 1998; Cordaux and Batzer, 2009). By generating ectopic recombination between non-allelic homologous elements, various types of genomic rearrangements (including deletions, segmental duplications and gene inversions) can occur within the same or between different chromosomes. For example, Alu recombinationmediated deletions (RMDs) in humans have been recognized for decades, and many dozens  31  of cases related to various genetic disorders and cancers have been reported (Cordaux and Batzer, 2009; Deininger and Batzer, 1999). Further, genomic comparisons between the human and chimpanzee genomes revealed 492 Alu and 73 L1 RMDs in the human genome since the divergence between the two primate species, including hundreds of cases located within genes, a few of them involving deletions of exons (Han et al., 2008; Sen et al., 2006). Similarly, genome-wide chromosomal inversions have been investigated by comparative genomics, and at least 44% of the 252 inversions found in the human and chimpanzee genomes are related to either L1 or Alu elements (Lee et al., 2008a). Genomic segmental duplications are also often found as associated with TEs. In a genome-wide study of nearly 10,000 segmental duplication junctions, Alus were found as highly overrepresented within these regions (27% of segmental duplication junctions are terminated within Alus), and seem to have contributed significantly to such genomic rearrangements in the human genome during the past 30 – 40 Myr (Bailey et al., 2003).  1.5.2  Transcriptional influence of TE sequences on host genes  Being equipped with native/cryptic transcriptional regulatory signals such as transcription factor binding sites (TFBSs) and splice/polyadenylation signals, TEs serve as a unique reservoir of regulatory elements for host gene expression. It is now well appreciated that TEs may work as alternative promoters or enhancers to drive transcription of nearby genes (Cohen et al., 2009; Thornburg et al., 2006), or facilitate alternative splicing or polyadenylation of mRNA transcripts (Lee et al., 2008b; Lev-Maor et al., 2008; Sorek et al., 2002). During the past decades, accumulating evidence has shown that TEs have been  32  involved in shaping gene regulatory networks in the host genome, though many of the mechanisms underlying TE-gene dynamics remain poorly understood (Feschotte, 2008). In order to move around and proliferate in the host genome, autonomous TEs usually encode their own proteins critical for transposition. In most cases, these TE-encoded protein products are transcribed in the same way as cellular genes but under the control of TE internal promoters, which sometimes can also promote the transcription of a nearby gene (Figure 1.4A). Sequence mutations, on the other hand, may lead to the creation of new TFBSs in a TE, which again may act as an alternative promoter or enhancer and drive gene expression. An LTR-derived alternative promoter of the human β1,3-galactosyltransferase 5 (β3Gal-T5) gene, for example, was shown as the dominant promoter in the colon (Dunn et al., 2003), indicating that this ERV-derived element may play an important role in tissuespecific regulation of the gene. In another example, an alternative promoter derived from a HERV-P LTR contributes, specifically in testis, ~12% of the normal total transcription of the human gene encoding the neuronal apoptosis inhibitory protein (NAIP), whereas an unrelated LTR promoter in rodents confers constitutive expression of the orthologous gene (Romanish et al., 2007). Genome-wide bioinformatics studies also revealed numerous examples of TEderived alternative promoters. A 2003 computational survey of the human and mouse mRNA databases showed that more than 27% and 18% genes, respectively, have at least one mRNA with TE sequence in either the 5′ or the 3′ untranslated regions (UTRs) (van de Lagemaat et al., 2003). While it had previously been shown that the human CYP19 gene, which encodes the aromatase P450 (a key enzyme in estrogen biosynthesis), has tissue-specific expression in placenta, only until the above-mentioned bioinformatics study was it realized that it was actually driven by an LTR promoter sequence that had integrated into the genome during  33  early primate evolution. Strikingly, the same study discovered a total of 16 such examples of the apparent usage of TE-derived promoters not appreciated before. Recently, based on highthroughput sequencing data, Conley et al. discovered more than 50,000 ERV-initiated transcripts in the human genome, and demonstrated a total of 114 transcription start sites (TSSs) located within ERV sequences that contribute to the transcription of 97 genes (Conley et al., 2008). Notably, while TE-derived promoters have been found for various types of TEs, ERV/LTR elements are the dominant type due to the abundance of cryptic TFBSs in their sequences. A more complete assessment of this phenomenon was given by Cohen et al. in a 2009 review (Cohen et al., 2009), in which both experimental and bioinformatic results were summarized and thoroughly discussed. Alternative splicing and polyadenylation are also common mechanisms used by TEs to alter the normal transcription of host genes (Figure 1.4B, C). It has long been well appreciated that Alu elements are frequently involved in alternative splicing and a process known as exonization (Hasler and Strub, 2006; Keren et al., 2010; Sorek et al., 2002), during which part of an Alu sequence is incorporated into the mature mRNA transcripts of a host gene. A comparative analysis of 1,176 alternatively spliced and 4,151 constitutively spliced internal exons in the human genome has shown more than 5% of the former exon set contain Alu repeats; none for the latter one (Sorek et al., 2002). A further study revealed that the left but not the right arm of the Alu dimer sequence seems to be critical to promote alternative splicing and Alu exonization events, probably due to the weaker splice signals and lower density of exonic splicing regulatory elements (ESRs) (Gal-Mark et al., 2008). Furthermore, it has been experimentally demonstrated that two Alu elements inserted into an intron in opposite orientation undergo base-pairing and RNA editing, which, in turn, affects the  34  Figure 1.4 Transcriptional influences of TEs. TE insertions are depicted as thick green arrows. Blue boxes are exons of a hypothetical gene. The black arrows are Transcription start sites. Red dotted lines show splicing of the mRNA transcripts of the gene. A) TEs as alternative promoter. The TE promoter upstream of the gene may contribute to tissue-specific transcription, and the TE promoter within an intron may give rise to a truncated form of the protein product. B) Alternative splicing by TEs. The intronic TE insertion is exonized and may also cause exon skipping. C) Premature polyadenylation by TEs. The gene transcription terminates at an intronic TE in the middle of the gene. 35  splicing patterns of the downstream exon by shifting it from constitutive to alternative (LevMaor et al., 2008). Indeed, a recent computational study of exonized Alus vs. their nonexonized counterparts revealed multiple features which significantly influence the possibility of an Alu sequence being recognized as an exon by the splicing machinery, including splice signal strength, density of exonic splicing enhancers (ESEs) and silencers (ESSs), length of flanking introns, and more interestingly, the stability of mRNA secondary structures (Schwartz et al., 2009). Alternative splicing has also been reported for many other TE types, such as LINEs and LTR retrotransposons (Belancio et al., 2006; Maksakova et al., 2006), and genome-wide computational assessments have identified hundreds of alternative splicing events involving TEs in human genes (Kim et al., 2010). Moreover, like TE-induced alternative splicing, the cryptic polyadenylation signals carried by many TEs may cause premature termination of normal gene transcription, which has also been widely documented and studied for various TE types in mammals (Lee et al., 2008b; Maksakova et al., 2006; Perepelitsa-Belancio and Deininger, 2003).  1.5.3  Epigenetic Effects of TEs  Being used by the host cell as a fast and efficient mechanism to inhibit TE expression, epigenetic changes such as DNA methylation, histone modifications and small RNA targeting can repress the transcription of various types of TEs (Leung and Lorincz, 2011; Slotkin and Martienssen, 2007; Yoder et al., 1997). However, such an epigenetic defense may sometimes cause “side-effects” on normal gene transcription, and TEs may also escape repression and change the expression pattern of nearby genes due to the reversible nature of epigenetic modifications.  36  One fascinating example of TE methylation influencing gene expression is the Avy allele generated by an IAP ERV insertion near the mouse agouti locus (Duhl et al., 1994). While the wild-type allele (a) of the agouti gene in C57BL/6 mice produces a normal black coat color, mutant mice possessing an Avy allele show a varying color range from yellow to agouti. Detailed analyses of this allele revealed that an IAP LTR-retrotransposon, which is absent in wild-type C57BL/6 mice, is located in antisense 100 kb upstream of the conventional TSS of the Agouti gene and works as an alternative promoter for the gene, causing ectopic expression of the agouti protein, resulting in yellow fur, obesity, diabetes and increased susceptibility to tumors. Interestingly, the variable coat color of the mutant mouse is correlated with the methylation state of the IAP element (i.e. hypomethylated – yellow; hypermethylated – agouti), which displays epigenetic inheritance following maternal but not paternal transmission (Morgan et al., 1999). Moreover, a further study showed dynamic programming of DNA methylation of this allele in early mouse development, during which the maternally inherited allele is demethylated much later than the paternal counterpart and reestablished later in mouse embryogenesis (Blewitt et al., 2006). However, the lack of methylation of the maternally inherited allele at a certain stage during mouse development indicates that DNA methylation may not be the direct mechanism regarding its inheritance. The fact that haplo-insufficiency of a polycomb protein induces the epigenetic inheritance of the paternally derived Avy epiallele suggests the involvement of other epigenetic modifiers (e.g. histone modifications) that regulate chromatin state and DNA methylation (Blewitt et al., 2006). Intimately related to DNA methylation, chromatin state change induced by histone modifications is also linked to TE repression (Leung and Lorincz, 2011) and, in certain  37  circumstances, may also influence nearby genes. It has been proposed that LINEs are related to imprinting and X chromosome inactivation (Allen et al., 2003; Lyon, 2006), and examples of repressive chromatin spreading caused by TEs have been shown in both plants and animals (Martin et al., 2009; Rebollo et al., 2011; Sun et al., 2004), albeit reported cases are rare. Using strain-specific polymorphic ERV insertions in hybrid mice (reported in Chapter 2), our lab’s recent collaborative study of the epigenetic effects of mouse ERVs showed a general trend of heterochromatin formation induced by IAP elements, and at least one apparent example of heterochromatin spreading from an IAP into a nearby gene (Rebollo et al., 2011). Specifically, based on a genome-wide assessment using chromatin immunoprecipitation sequencing (ChIP-seq), we showed a significant enrichment of the heterochromatin mark H3K9me3 around common IAP insertion sites in two ES cell lines derived from different mouse strains. The same analysis of polymorphic IAPs present only in one of the two strains clearly demonstrated a TE-induced enrichment of H3K9me3 flanking the polymorphic IAPs, but not at the orthologous loci in the mouse strain in which these IAPs are missing. More interestingly, the same study also reported an apparent case of TE-induced heterochromatin spreading into a neighboring gene, in which the H3K9me3 mark induced by a solitary antisense IAP LTR extends into the promoter region of the downstream beta 1,3galactosyltransferase-like (B3galtl) gene, causing a decrease in the expression of this gene. In comparison, the mouse strain lacking this IAP insertion showed neither heterochromatin spreading nor impaired gene expression. To eliminate any possible genetic background differences between the two mouse strains, a hybrid strain was used for allelic methylation and expression analyses, which confirmed the IAP as the cause of heterochromatin spreading  38  and down-regulation of the neighboring gene. It is believed to be the first such case involving ERVs documented in mammals. The recently appreciated post-transcriptional gene regulation by small RNAs such as microRNAs (miRNA) and small interfering RNAs (siRNA) may also have originated from ancient cellular defense to virus infection and TE transposition (Obbard et al., 2009). These short RNA molecules are about 20~30 bp in length, usually derived from double stranded RNA (dsRNA) fragments cleaved by the RNase III enzyme Dicer, and can interact with Argonaute proteins to form the RNA-induced silencing complex (RISC), which can bind to mRNA targets complementary in sequence to the small RNA and cause either target mRNA degradation or translational inhibition (Almeida and Allshire, 2005; Okamura, 2011). While TEs are usually targets of RNA silencing, it has been shown that these repetitive sequences can also be used as a source of small RNA biogenesis, and may repress the transcriptional activity of both TEs and cellular genes (Watanabe et al., 2006; Yang and Kazazian, 2006). Indeed, a computational study of human miRNAs showed that ~12% of annotated human miRNAs are completely derived from sequences of various TE families (Piriyapongsa et al., 2007), which might further regulate the transcription of thousands of host genes.  1.6 1.6.1  Transposable Elements and Host Evolution TEs vs. the host genome: an everlasting battle  The arms race between TEs and their hosts has never stopped during the long course of evolution. Essentially as genomic parasites, TEs need to survive in the host genome by continuously generating new copies and escaping genomic silencing. Since TE activity could  39  be highly deleterious, the host organisms have developed various molecular mechanisms to limit the transposition and proliferation of these “jumping genes”. Insertional tolerance can be considered a passive solution that the host genome uses to relieve itself from the pressure of genomic TE bombardment. Generally, TE insertions may happen anywhere in the host genome, but presumably only a small fraction of such insertions become fixed either by random genetic drift or, in rare instances, due to positive selection for retaining the TE. As mentioned earlier, organisms with relatively small genome size usually contain a small proportion of fixed TEs, presumably due to stronger negative selection, higher deletion rate or more stringent TE integration control (see Section 1.5.1). On the other hand, many larger genomes (e.g. the mammalian genome) are filled with TE sequences that are located mostly within heterochromatin or intergenic regions, where TEs are presumably less likely to pose any detrimental effects on host genes. Even when TEs do insert into genes, the intron-exon structure of genes assures a much higher opportunity of acquiring TE insertions in introns rather than in protein coding exons. As a consequence, the larger the proportion of the host genome covered by TEs, the lower the probability of having new TE insertions in critical regions (e.g. exons). However, the insertional tolerance described earlier may only be efficient for genomes that are already large enough and contain many TEs and/or non-functional DNA, otherwise the probability of deleterious TE insertions would be too high to outweigh the potential benefits obtained from relaxing the control of genome size. In addition to passive insertional tolerance, epigenetic silencing is an important active mechanism used by the host to effectively control TE activities. As noted above, DNA methylation and histone modifications can change the regional chromatin into a repressive state so that the embedded TEs can no longer be transcribed. This largely limits new TE  40  transpositions and their interference of host gene regulation. However, this mechanism raises a new challenge to the host genome. When a TE is epigenetically silenced by the state change of local chromatin, the epigenetic marks associated with the target TE may spread into nearby genes and cause deleterious side effects on regular gene expression. Indeed, a study in the plant Arabidopsis thaliana showed a negative correlation between the density of methylated TEs (mTEs) and the expression level of nearby genes, and population genetic analysis in the same study confirmed a lower than neutral frequency (e.g. synonymous single nucleotide polymorphisms) of mTEs but not their unmethylated counterparts (uTEs) near genes, suggesting a purifying selection against mTEs (Hollister and Gaut, 2009). This intriguing study clearly demonstrated an evolutionary trade-off between reduced TE activity and potential side effects on nearby genes when using DNA methylation for TE silencing. To date, the spreading of epigenetic marks from silenced TEs into nearby genomic regions in mammals has been robustly confirmed for at least the B1 SINE and IAP LTRretrotransposon elements in mice, though cases of spreading into nearby genes are rare (Rebollo et al., 2011; Yates et al., 1999). While it is still unclear how the host genome effectively prevents the deleterious spreading of epigenetic silencing marks, some evidence has shown the existence of a buffer zone between the silenced TE and its nearby gene, suggesting unknown insulators might be involved (unpublished data in the Mager lab) (Rebollo et al., 2011). Besides DNA methylation and histone modifications, many host organisms can also efficiently inhibit TE activities via various small RNA interference pathways such as miRNAs, siRNAs and piRNAs. Among them, the piRNA pathway is particularly interesting for the following reasons: 1) piRNAs are mainly produced from TE sequences; 2) their major  41  targets are TEs (Khurana and Theurkauf, 2010). This is a spectacular example of how the host genome may use existing TEs to prevent new TE transpositions. Notably, the piRNA pathway is only available in germ-line cells, in which other mechanisms of TE inhibition such as DNA methylation are often weak or absent, at least during certain development stages (Sasaki and Matsui, 2008). Since only in germ-line can newly transposed TEs be passed to the next generation of the host organism, the maintenance of genetic integrity and genomic stability is even more important in germ cells. For this reason, piRNA silencing is sometimes referred as “the vanguard of genome defense” (Siomi et al., 2011). Although TEs are largely considered as endogenous and thus usually do not induce any immunological response, evidence has indicated that the intrinsic immunity of the host cell could be involved, at least for TEs with an exogenous origin (e.g. ERVs). A classical example is the restriction of an extinct retrovirus by the human TRIM5α antiviral protein shortly after the divergence between human and chimpanzee (Kaiser et al., 2007). The Pan troglodytes endogenous retrovirus, or PtERV1, has been found with more than 100 copies in both the chimpanzee and gorilla genomes, but it is absent in humans (Yohn et al., 2005). While phylogenetic analysis strongly suggests that the invasion of exogenous PtERV1 into multiple primate species occurred about 3-4 Myr ago (Yohn et al., 2005), it was not understood why the pandemic apparently did not affect humans who by then were cohabitating with other Old World primates. To answer this question, Kaiser and colleagues studied the relationship between PtERV1 and TRIM5α protein in a panel of selected primates, and found intriguing results (Kaiser et al., 2007). For example, their sequence analysis showed an arginine (R) residue at position 332 of the human TRIM5α protein, while the hominoid ancestral sequence carries a glutamine (Q) at the same position. When these  42  researchers resurrected the silenced PtERV1 by reconstructing its DNA from the consensus sequence, they tested the restriction power of TRIM5α in a panel of primate species and found only the human version of this antiviral protein to be highly effective against the infectious PtERV1 retrovirus. When the arginine at position 332 of the human TRIM5α protein was replaced by a glutamine (R332Q), however, the restriction by human TRIM5α to PtERV1 was significantly reduced. More interestingly, upon treatment of the human immunodeficiency virus type 1 (HIV-1), the human TRIM5α with R332Q mutation showed significant improvement of HIV infectivity reduction compared with the wild type TRIM5α bearing R332R. It appears that the ability to repel infection by PtERV1 is fixed in the human population, accounting for the lack of endogenous PtERV copies in the human genome. This TRIM5 genetic variation, which was advantageous against PtERV, now renders humans susceptible to the modern virus, HIV.  1.6.2  TE exaptation: turning “junk” into “gold”  “Ex-aptation” is a term introduced by paleontologists, Stephen J. Gould and Elisabeth Vrba, to describe the evolutionary scene where the biological features that now enhance fitness of an organism were not built by natural selection for their current role (Gould and Vrba, 1982). Exaptation can be further classified into two categories: 1) functional shift (or type 1 exaptation), which refers to the reuse by natural selection of a character with previously different purposes; 2) functional cooptation (or type 2 exaptation), which describes particularly the situation where a character whose origin cannot be linked directly to natural selection (i.e. “non-aptation” as coined by Gould and Vrba) is co-opted for a current use (Gould and Vrba, 1982). Unlike the widely (and often mistakenly) used term “ad-  43  aptation”, which references the evolutionary development of a brand new feature based on natural selection, “ex-aptation” did not get sufficient attention until recently,  but now  becomes an important subject in studying evolution. In the context of TE-host relations, exaptation played critical roles during the host evolution. When a TE is “exonized”, i.e. incorporated into the mature mRNA of a cellular gene as an exon, it may become part of the coding sequence and alter the protein product that the gene makes. If the new protein product is not only functional, but also beneficial to the host, it may be positively selected and eventually fixed in the host population (i.e. type 2 exaptation). It is noteworthy that not all exonization events will lead to exaptation, however. Alu SINEs, for instance, are well-known as frequent exonization targets, but in a genomewide study of Alu exonization in primates, Krull et al. only identified a total of 153 human chromosomal loci where Alu elements were conceivably exonized (Krull et al., 2005). Further examination of these cases showed a large proportion being lost again in some descendent primate lineages after exonization during evolution. The authors proposed that such dynamic exonization of Alus was due to the relatively younger ages of Alu elements (~60 Myr) and, indeed, a similar study of the mammalian-wide interspersed repeat elements (MIRs) revealed a much higher extent of stable exonization for this older SINE family (~130 Myr) (Krull et al., 2007). A closer examination of five selected MIR-derived exons showed that four of these exonized TEs contribute only proportionally to the total gene transcripts as alternative splicing forms, with the remaining case (in gene ZNF639) being constitutively spliced in all mammals tested. The studies cited above suggest that a true exaptation of TE may need a long evolutionary time to either test or obtain all the mutations required for a beneficial role, helping it to firmly endure the pressure of being lost during evolution.  44  More dramatically, exaptation of TE sequences that encode proteins may sometimes give rise to brand new genes that can significantly benefit the host organisms, a phenomenon known as “domestication”. One fascinating example of TE domestication is the Syncytin genes, which perform crucial functions in placenta formation and were adopted from endogenous retroviral elements independently several times during the radiation of mammalian evolution (Dupressoir et al., 2005; Heidmann et al., 2009; Mi et al., 2000; Vernochet et al., 2011). In humans, HERV-W is an endogenous retrovirus family specifically expressed in the placenta, and analyses of HERV-W mRNA from this tissue have revealed the expression of an intact open reading frame (ORF) of the retroviral envelope gene from one specific HERV-W copy. Since the life cycle of ERVs is intracellular and usually does not need the retroviral envelope protein for viron secretion and reinfection, the identification of functional transcripts from the HERV-W envelope gene (named as “Syncytin 1”) suggests a positive selection. Indeed, in situ hybridization experiments clearly showed that Syncytin is a highly fusogenic membrane glycoprotein, and its expression in placenta is crucial in cell fusion and syncytia formation (the generation of multinucleated giant cells) (Blond et al., 2000; Mi et al., 2000). Shortly after the identification of Syncytin 1, a genome-wide screen of fusogenic ERV envelopes in primates identified Syncytin 2, a conserved retroviral envelope gene co-opted from an independent ERV family, HERV-FRD, and is also involved in placental morphogenesis in primates. Similar genome-wide screens of ERV envelope genes in other mammals such as the mouse (Dupressoir et al., 2005) and the rabbit (Heidmann et al., 2009) revealed multiple examples of ERV-derived genes that originated from independent exaptation events, all of which are convergently involved in placenta formation (Malik, 2012).  45  In addition to directly contributing to the reservoir of protein coding sequences of the host genome, more TE exaptation events happen at the regulatory level. As mentioned earlier in this chapter, many TEs carry cryptic or ready-to-use transcriptional signals (e.g. TFBSs) in their sequences, and numerous cases of TE-derived promoters and enhancers that may contribute to the regulation of normal host gene functions have been reported. Indeed, our laboratory has recently created a web site to catalog the growing number of such cases (http://sites.google.com/site/tecatalog/welcome). More importantly, the abundance and wide dispersion of TEs in the host genome may lead to significant rewiring of various gene regulatory pathways at a network level. An examination of genome-wide ChIP-seq data for p53 showed about one third of the identified p53 binding sites occurring within human ERV sequences (Wang et al., 2007). Similarly, another genome-wide ChIP-seq study of three key regulatory proteins (OCT4, NANOG and CTCF) in human and mouse ES cells (Kunarso et al., 2010) showed that repeat-associated binding sites (RABSs) contribute up to 28% of all mapped regions, and multiple lines of evidence were shown for functional relevance of these RABSs. It is worth remarking that, while both OCT4 and NANOG are conserved proteins between human and mouse and regulate similar target genes, only 5% of their binding sites are homologously occupied in the two species. By examining the origin of the RABSs of these regulatory proteins, the authors found most to be located in species-specific TEs and, as a result, revealed a group of human-specific target genes apparently rewired into the regulatory networks of OCT4 and NANOG via species-specific RABSs. All the above studies, plus the fact that thousands of conserved non-coding regions (presumably functional) in mammalian genomes are derived from TEs (Lowe et al., 2007), clearly demonstrate the  46  importance of TEs to host gene regulation in mammals. This strongly implies a beneficial role played by these once-considered “selfish” genomic parasites during host evolution.  1.7  Thesis Objectives  The overall objective of this thesis work has been to study the genome dynamics of mammalian TEs and their relationships to host genes using computational approaches. My work can be divided into three relatively independent projects, which form the foundations of the following three chapters. Chapter 2 describes my study of genome-wide identification and evaluation of polymorphic ERV/LTR retrotransposons in mice. Although it had been recognized that ERVs in mice were highly active and caused ~10% of all reported germ-line spontaneous mutations, the level of ERV polymorphisms in various lab mouse strains was unclear at the time I initiated this project. Using publicly available data of millions of short mouse genomic DNA sequences with an average size around 800 bp, I designed a computational algorithm that identified both common and polymorphic insertions of the ETn/MusD and IAP elements, the two most active ERV families in mice, in four inbred mouse strains. My comparison of the genomic distributions between common and polymorphic ERVs was based on the hypothesis that if polymorphic ERVs generally represent younger elements not fixed in the host population, they should show a distribution pattern closer to the one expected by chance, which would imply an ongoing selection on these polymorphic elements. With the aid of lab colleagues, I also investigated the transcriptional effects of selected strain-specific ERV insertions.  47  Chapter 3 describes my study of intronic distributions of TEs in mammalian genes. While many genomic studies had been done on global TE distributions in various types of host organisms, there was no comprehensive investigation of TE distributions within genes at the time I started this project. Previous studies had shown that TEs were largely underrepresented in genic regions, very likely due to a strong purifying selection. However, I was intrigued by the question of why, while many intronic TE insertions can be highly mutagenic and had been selected against, there are also numerous fixed TEs residing within introns which apparently do no harm? Examining intronic distributions of the four major types of TEs in the human and mouse, I identified genomic features that may influence the retention probability of de novo TE integrations in gene introns. To test whether natural selection is the major driving force underlining the TE distribution pattern in genes, I compared the intronic distribution of fixed (and thus presumably harmless) TEs with either the mutagenic TE insertions in introns reported in the literature or the intronic polymorphic ERVs (presumably still under selection) identified in Chapter 2 to determine if a shift of distribution patterns could be revealed. In addition, I sought assistance from my colleagues to use experimental approaches to evaluate the transcriptional effects of selected TEs. Chapter 4 describes my study of the properties of genes that have either very high or very low TE densities, as well as the relevance of these gene properties to the accumulation of TEs in introns. Although previous studies had touched on the properties of genes with extreme TE densities (i.e. the outlier genes), none had examined those genes sharing the same TE content extremity in multiple species. Using orthologous genes in three mammalian species (human, mouse and cow), I identified computationally two sets of genes that have either extremely high or low density of TEs in all three species. With these two highly  48  reliable gene sets in hand, I considered whether common properties could be found for each gene set and what the biological relevance would be. Based on the hypothesis that the extreme density of TEs might be a consequence of natural selection, I examined gene properties such as chromosomal distribution, biological function and conservation level of both gene sets, to see if any difference exists between the two groups. Since it had long been hypothesized that genes with an open chromosome state in either embryonic stem (ES) cells or germ-line cells are more prone to heritable TE insertions (i.e. all TEs that we can observe in the host genome today), I searched for evidence to support that hypothesis by examining the polymerase II occupation and histone modification status of the shared outlier genes in mouse ES cells.  49  Chapter 2: Identification and Investigation of ERV Polymorphisms in Mice  A version of Chapter 2 has been published: Ying Zhang, Irina A. Maksakova, Liane Gagnier, Louie N. van de Lagemaat, Dixie L. Mager (2008). "Genome-wide assessments reveal extremely high levels of polymorphism of two active families of mouse endogenous retroviral elements." PLoS Genet 4(2): e1000007.  50  2.1  Background  The laboratory mouse is the model of choice for mammalian biological research and a plethora of mouse genomic resources and databases now exist (Peters et al., 2007). Notably, fueled by availability of genomic sequence for the common strain C57BL/6J (B6) (Waterston et al., 2002), several research groups have documented genetic variation among strains using single nucleotide polymorphisms (SNPs) (Frazer et al., 2007; Wade and Daly, 2005; Yang et al., 2007). Surveys of mouse polymorphism due to segmental duplications or copy number variations have also recently been published (Graubert et al., 2007; Li et al., 2004). Such resources are invaluable in trait mapping, tracing strain origins, and genotype/phenotype studies. However, genome-wide studies to document other types of genetic variation have been lacking. ERVs/LTR retrotransposons are known to be highly active in inbred mice, causing ~10% of spontaneous mutations (Maksakova et al., 2006), but relatively little is known about the level of polymorphism of such sequences (Chapter 1, Section 1.4). Southern blotting and extensive genetic mapping have clearly demonstrated that ERVs related to murine leukemia virus (MLV) are highly polymorphic (Boeke and Stoye, 1997; Frankel et al., 1990; Stoye and Coffin, 1988), but such techniques are feasible only for low copy number ERVs which constitute a very small fraction of ERVs and LTR retrotransposons in the mouse genome. Due to limitations of the array-based technology employed, the largest mouse polymorphism study performed by Perlegen focused only on SNPs in non-repetitive genomic regions, and was not designed to detect insertional ERV polymorphisms (Yang et al., 2007). Compared to a single nucleotide difference, genetic variation due to insertion of an ERV has obviously a much greater probability of affecting the host (Chapter 1, Section 1.5). The  51  phenotypes of most mouse germ-line mutations caused by ERV insertions result not from simple physical disruption of coding regions (although this does occur), but from transcriptional abnormalities mediated by ERVs located in introns or near the affected genes (Maksakova et al., 2006). Further, it is well appreciated that retroviruses can activate oncogenes or growth control genes leading to malignancy (Boeke and Stoye, 1997; Kung et al., 1991; Rosenberg and Jolicoeur, 1997) and, indeed, are used as tags to identify genes involved in cancer (Dudley, 2003; Theodorou et al., 2007). Determining the extent of mouse ERV polymorphism, therefore, is critical to understanding how ERVs contribute to diversity and disease susceptibility among inbred strains. The retroviral-like IAP and ETn/MusD families are two high copy number ERVs responsible for most of the insertional germ-line mutations described in mice, and have been postulated as highly polymorphic among inbred mouse strains (Chapter 1, Section 1.4.2). The goal of the study presented in this Chapter was to quantitatively assess the genome-wide polymorphism levels of the IAP and ETn/MusD families, and to identify those polymorphic ERVs with the highest probability of affecting host genes. By comparing only the few strains for which sufficient genomic sequence is available, I found high levels of insertional polymorphism for both the IAP and ETn/MusD families. Moreover, I detected 695 polymorphic members of these ERV families located within genes, and found evidence that some of these affect gene transcription. Such polymorphisms represent a substantial source of genetic variability among inbred strains, and may play a major role in strain-specific traits.  52  2.2  Results and Discussion  2.2.1  Prevalence of ETn/MusDs and IAPs in different strains  As the first step in assessing the ERV polymorphisms in mice, I conducted a survey of the overall copy numbers of IAP and ETn/MusD elements in the well-sequenced, assembled B6 genome using BLAST (see Section 2.4.2 of this chapter for details). For the IAP family, I detected 2,595 full-length or partly deleted elements, plus 2,477 solitary LTRs, for a total of 5,072. ETn/MusD elements are less numerous than IAPs, with 1,873 sequences in the B6 genome, 1,457 of which are solitary LTRs. In accord with previous studies (Mager and Medstrand, 2003), my results indicated that solitary LTRs, the result of recombination between the 5′ and 3′ LTRs of proviral forms, are typically more common than full length ERVs. By the time of this study, sufficient whole genome shotgun (WGS) sequence traces (Figure A.1) are available for only three mouse strains other than B6: A/J, DBA/2J, and 129X1/SvJ (referred to hereafter as “the three test strains”). To identify all traces containing IAP or ETn/MusD sequences, I used specifically designed ERV probes (Figure A.2 and Table A.1) to screen the trace archives of the three test strains with local sequence alignment. Sequences flanking the ERV segment in each trace were used to map the region to a unique position in the assembled B6 genome, and to combine redundant traces (Section 2.4.3). This screening method identified 1,659, 1,509 and 1,379 ETn/MusD elements which could be assigned a unique location in A/J, DBA/2J and 129X1/SvJ, respectively. Similarly, for the IAP elements, I identified 4,696 elements in A/J, 4,320 in DBA/2J, and 3,878 in 129X1/SvJ. As discussed earlier, my genomic survey detected 1,873 MusD/ETn elements and 5,072 IAPs in assembled B6 genome. It is likely the lower ERV numbers detected in the three test strains  53  compared to B6 is due mainly to incomplete sequence coverage of the traces available for each strain. Another factor contributing to the loss of detectable ERV insertions is the inability to map the trace to a unique location. This is usually because the flanking non-ERV portion is too short, being composed of other types of repeats, or is located within duplicated genomic regions. To determine the approximate fraction of elements from each of the three test strains not detectable due to incomplete sequence coverage or other reasons, I determined how many elements in the assembled B6 genome could be found with my method, using randomly sampled sets of WGS traces from the B6 trace archive database. Using numbers of B6 traces equivalent to that available for A/J (11,646,236), DBA/2J (7,998,826) and 129X1/SvJ (5,998,950), I detected, respectively, 83.8%, 77.9% and 68.6% of the 1865 ETn/MusD insertions present in the assembled B6 genome (Figure A.3). It seems reasonable that approximately 16.2%, 22.1% and 33.4% of the ERVs present in the three test strains could not be found due to incomplete coverage or mapping difficulties. Moreover, this B6 trace sampling experiment also allowed me to conservatively estimate the false discovery rate of this procedure to be ~0.4% (Section 2.4.5).  2.2.2  Identification and frequency of polymorphic ERVs  As outlined in Figure 2.1 and described fully in Section 2.4.4, I designed a four-phase screening process to identify polymorphic ERVs. In the first phase, probes derived from known ERV sequences were used to screen the B6 assembled genome, and a collection of ETn/MusD or IAP elements in B6 was obtained. In the second phase (illustrated in Figure 2.1A), I determined if the ERVs identified in the three test strains were also present in B6 by checking for existence of such ERV sequences at corresponding loci in the assembled B6  54  Figure 2.1 Screening strategy for detection of polymorphic ERV insertions. A) Identification of ERV insertions in test strains. In the first step, ERV probes of different lengths were designed based on known ERV sequences (see Section 2.4.2 and Table A.1 for more details). Next, the ERV probes were aligned to trace sequences of the test strain with WU-BLAST, and all traces containing the target ERV sequences were retrieved (step 2). From each ERV-containing trace, a chimeric tag was constructed by taking the flanking genomic sequence appended with a small tail (≤ 50 bp) of the target ERV sequence (step 3). In the final step, all chimeric tags were mapped to the assembled B6 genome with BLAT, and the existence of corresponding ERVs in B6 was determined by checking whether the small ERV-tail was included in the alignment (step 4). B) Determining the polymorphism status of ERVs present in B6. In the first step, probes were built based on the sequences flanking all ERV insertions in the B6 genome. In the next step, these probes were used to select all traces containing such flanking sequences in test strains. In the third step, a 35-bp-region adjacent to the mapped flanking sequence in each trace obtained from previous step was compared to the corresponding ERV sequence in the B6 genome, and the existence of such ERV element in the test strain was assessed according to the sequence identity. In both panels, solid blue bars represent genomic sequences flanking the ERV insertions in mice. Green hatched bars or arrows with solid borders are ERV internal or LTR sequences, respectively. Gray shaded bars or arrows with broken borders are suspected ERV sequences, of which the existence is determined by the alignment score of regions annotated with “?”s.  55  genome. In the third phase (represented in Figure 2.1B), I included the dataset of all IAP and ETn/MusD elements found in B6, and determined the presence of these ERVs in the three test strains. To achieve this outcome, I retrieved the 5′ and 3′ flanking sequences from elements present in the assembled B6 genome, obtained those flanking segments that could be uniquely mapped to the genome, and identified sequence traces from the test strains that contain these flanking segments. The traces were then checked for presence of the ERV. In the final phase, a similar strategy was applied to the polymorphic ERV insertions found in each test strain (but not in B6), so that the existence of corresponding ERVs in the other two test strains could be assessed. The combination of these steps allowed me to compile lists of ERV genomic locations and the polymorphism status of each ERV in the four strains. Due to inability to uniquely map many ERV flanking regions to the short, unassembled sequence traces, the status of many elements present in the assembled B6 genome could not be computationally determined in the test strains (see below). Additionally, as discussed above, incomplete sequence coverage of the test strains resulted in an “unknown” status for a proportion of ERVs in each test strain. In spite of the limitations described, I identified a large number of polymorphic ERVs (Figure 2.2). Of all IAP elements detected in at least one strain, 2,143 were present in all four strains, while 3,394 elements were scored as polymorphic (i.e. absent in at least one of the four strains), giving an overall polymorphic fraction of 61.3%. For ETn/MusD elements, 1,087 were mapped as present in all four strains, and 375 could be scored as absent in one or more strains, a polymorphic fraction of 25.6% of all the elements having a determinable status. Another 1,767 IAP and 660 ETn/MusD elements present in the assembled B6 genome could not be mapped to the test strain traces due to incomplete trace coverage or repetitive  56  flanking regions, so their polymorphic status could not be computationally determined. Remarkably, these high levels of insertional polymorphism were obtained by considering just four strains, despite the fact that the status of many elements could not be ascertained in some strains. Thus, the numbers of polymorphic ERVs among all inbred mice must be significantly higher. Details of all the polymorphic ERV insertions identified in this study can be found online in the form of custom tracks for the UCSC Genome Brower (see Section 2.4.10 for details).  Figure 2.2 Fractions of polymorphic ERVs based on the four strains. Pie charts indicate the status of all detectable A) IAP elements and B) ETn/MusD elements. White sections indicate the fraction of elements that could be scored as present in all four strains (annotated as ‘common’). Dark blue sections represent the fraction of elements scored as absent in at least one strain (annotated as ‘polymorphic’). Side bars illustrate the data composition of polymorphic ERVs, with dotted/striped sections indicating polymorphic elements for which status could be/not be confirmed in all four strains, respectively.  2.2.3  Genic distribution patterns of the youngest ERVs are distinct from older  elements Given that the genomic distributions of ERVs fixed in a species are strongly shaped by selection, I predict that recently inserted ERVs will display genic distributions different from  57  their older cousins. To test this prediction, I compared the distributional properties of a subset enriched for the youngest ERVs with that of ERVs common to all four strains. To obtain the youngest elements, I chose those present in one strain but absent in the other three strains. Many of these likely still represent older polymorphic elements due to the fact that lab strains are genetic mixtures of subspecies of Mus (Beck et al., 2000; Wade and Daly, 2005; Yang et al., 2007). However, this group will contain all the truly young elements inserted after strain divergence. As shown in Figure 2.3, these datasets enriched for the “youngest” elements are more likely to be found in genes (Figure 2.3A) and in the sense orientation within genes (Figure 2.3B), compared with elements shared between all four strains. The higher prevalence in genes and the reduced intronic orientation bias displayed by ERV subsets enriched for the youngest elements suggests that some of them have inserted very recently and may be deleterious, which have not been eliminated completely by selection.  2.2.4  Confirmation of polymorphic ERVs in gene introns  My bioinformatics screens identified 623 polymorphic IAP elements and 72 polymorphic ETn/MusD elements located within genes in one or more of the four strains. Complete lists of these elements and their locations with respect to the B6 genome are provided in Table A.2 and A.3. These Tables list in which of the four strains each element was computationally detected by my screens. The question marks in the Tables are mainly due to mapping difficulties or incomplete sequence coverage of the trace databases. A subset of these elements was analyzed using genomic PCR on DNA from a panel of mouse strains (including B6 and the three test strains) with primers flanking the insertion site to verify the insertion status. For this analysis, I chose all 28 cases of ETn/MusD elements found in A/J  58  Figure 2.3 Distributions of young versus older ERV elements with respect to genes. A) Fraction of elements located within genes. B) Fraction of genic elements oriented in the same transcriptional direction as the gene. Dark blue bars represent elements found in all four strains. White bars represent ERVs present in only one of the four strains. Dashed lines indicate the expected fractions assuming a random genomic integration pattern. Error bars show standard errors, and P-values based on two sample z-test comparing young and old groups are shown. All comparisons between the “young” and “old” subsets are statistically significantly different except for the orientation bias of ETn/MusD elements (marked with “*” in B), due to the low numbers of elements in this category. Actual numbers of elements in each category are shown as numerators in fractions, with denominators being the total numbers of elements in the different groups.  gene introns but absent in B6, and 12 cases of ETn/MusD elements present in B6 gene introns but scored as absent in A/J (Table 2.1). For the 28 cases of elements computationally detected in A/J (cases 1-28 in Table 2.1), the ETn insertion in the dysferlin (Dysf ) gene (case # 9) is the only previously reported case and occurred 20-30 years ago in the A/J breeding stocks (Ho et al., 2004). For the set of 12 elements present in B6 (cases 29-40 in Table 2.1), 59  the ETn element in the Wiz gene (case #40) has also previously been reported as polymorphic (Baust et al., 2002). In total, these 40 selected cases and four strains generated an experimental space of 160 predictions. As shown in Table 2.1, columns with a strain name followed by a “(p)” indicate that data in these columns are computational predictions of the existence of the ERV insertions in the corresponding strain. After excluding 16 undeterminable instances (denoted as “?”s in these columns in Table 2.1), I computationally determined the presence of these ERV insertions in all four strains, with a total number of 144 predictions. For 140 of these, my computational predictions matched precisely the experimental confirmation of ERV insertion status using genomic PCR (performed by my colleague, Liane Gagnier), demonstrating the high accuracy of my bioinformatics screens. In one instance (case #39 in DBA), the PCR failed, so the predicted insertion could not be tested. As a result, only three cases showed anomalous PCR results that did not match my bioinformatics predictions. One of these cases was #24 in Table 2.1, where I predicted an ETn/MusD insertion in an intron of the Sytl3 gene in A/J mice. However, the PCR verification result showed no evidence of this insertion in the A/J DNA sample used. I consequently reexamined the A/J sequence dataset, and detected orthologous sequence traces both with and without this particular ERV element (Figure 2.4). The most likely explanation for this finding is that this ERV represents a very recent insertion present in a heterozygous state in the A/J genomic DNA used to generate the trace sequence data. Since the rate of ETn/MusD retrotransposition in A/J is relatively high compared with other strains (Maksakova et al., 2006), it is not surprising that individual A/J mice will have occasional “private” insertions. The second anomalous case was #34 of an ETn/MusD LTR found in the B6 genome within the Cadm4 gene, and confirmed as present  60  Figure 2.4 An apparent heterozygous ETn insertion in the Sytl3 gene in an A/J mouse. The top line is the first 30bp of an ETnII LTR. The second line is from the A/J trace gnl|ti|1104656312, which consists of a non-ERV part and an ETn LTR part. The third line is a different trace sequence from A/J (gnl|ti|1344398576). The bottom line is from the RefSeq gene Sytl3 in the assembled B6 genome (build 36) with genomic coordinates shown. ETn sequences are bold red, and non-ETn sequences are blue.  in all tested strains by PCR (Table 2.1). My computational screens correctly scored this LTR as present in DBA/2J and 129X1/SvJ, but as absent in A/J. Upon further examination of the sequence data, I found that one of the two available A/J sequence traces mapping to this location to be an artifact, since it contains a segment of unknown origin. The other trace is also unusual, as segments of it map to two locations several kb apart. Thus, this case can be explained by artifactual sequence traces, demonstrating that the trace archives and, therefore, my dataset are not without errors. The last inconsistent case was #37, an insertion located within the Slfn8 gene and predicted as present in both B6 and DBA. In this case, the PCR verification in DBA showed that the element is not present. Since both the computational and experimental results were clear yet contradictory, I do not have a definitive explanation for this case, although it is possible the trace is not of DBA origin. In any event, this case was regarded as a false positive. In several instances, the PCR data also allowed me to assign a definite insertion status to elements in test strains that could not be predicted in silico due to incomplete sequence coverage of the traces (Table 2.1).  61  Table 2.1 Genomic PCR verification of ETn/MusD intronic insertions in different mouse strainsa case# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40  gene 2310035C23Rik Dnajc10 Atp9a Gem B230396O12Rik Art3 Foxk1 Stk31 Dysf Zfp82 Pde8a Pgbd5 Opcml Alg9 Zfp291 Cdh23 Odz2 Prkca Akap6 Mark3 LOC432723 Cacna2d3 A2bp1 Sytl3 Dlgap1 Dym Mtm1 Col4a6 Sh3bp4 Tor3a Cd84 Ttbk2 Unc13b Cadm4 Dpep1 Vnn3 Slfn8 Klhl1 Mapk14 Wiz  location intron size insert_site b intron 13 1924 1:107544558 intron 3 1920 2:80121109 intron 14 5914 2:168357724 intron 2 4896 4:11637425 intron 6 2946 4:153708611 intron 2 8817 5:93471996 intron 1 33098 5:142663608 intron 3 6516 6:49330234 intron 4 4795 6:84024675 intron 5 4708 7:29768570 intron 12 1773 7:81189399 intron 1 49175 8:127312767 intron 2 270720 9:28170744 intron 14 20017 9:50582311 intron 8 8166 9:55688688 intron 32 2103 10:59779699 intron 8 35382 11:36015943 intron 3 134255 11:107934734 intron 7 71966 12:53930690 intron 14 7429 12:112088888 intron 1 61826 13:4788783 intron 11 109422 14:28067384 intron 3 321339 16:6641077 intron 4 12675 17:6573500 intron 4 55340 17:70603001 intron 16 43220 18:75389795 intron 8 4446 X:67552081 intron 2 146075 X:136618162 intron 1 36659 1:90927936 intron 4 10001 1:158497287 intron 2 20560 1:173693834 intron 4 16402 2:120487764 intron 7 46281 4:43167395 intron 1 16992 7:24206094 intron 1 7590 8:126077748 intron 3 7946 10:23546116 intron 4 8583 11:82825929 intron 8 15107 14:95020963 intron 6 2511 17:28455067 intron 2 19667 17:32100856  orient. A S A A A A A A S A A A S A A A A S A A A A S S A A S S A A A A A A A A A A A A  c  B6 d  + + + + + + + + + + + +  B6(p) e  N N N N N N N N N N N N N N N N N N N N N N N N N N N N Y Y Y Y Y Y Y Y Y Y Y Y  A/J d  + + + + + + + + + + + + + + + + + + + + + + + + + + + + -  A/J(p) e  Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y N N N N N N N N N N N N  DBA + + + + + + + + F -  DBA(p) N N N N N N N ? N ? N N N Y Y N N N N N N ? Y N Y N N N N N N N N Y Y Y Y ? Y N  129X1 + + + + + + + + + + + + + + -  129X1(p) A/WySn N + N + N ? + ? + ? F N + N + ? ? + N + ? + N N + Y + N N + ? + N + ? + Y + Y + Y + ? Y + N ? + Y + N ? ? N Y Y + N Y N Y Y N -  SWR/J  C3H  Balb  + + + + + + + + + F -  + + + + + + + + + + + + -  + + + + + + + + + + + + + + + + + -  f  Size 350 5400 5700 5500 5500 5500 5500 5500 6000 1700 400 400 5400 unk. 350 5600 7500 5500 unk. 7800 350 400 5500 N/A 350 5700 5500 5500 5550 330 1574 5478 322 338 322 5696 334 320 320 7121  a  Full names of mouse strains are given in Materials and Methods (Section 2.4.6). Highlighted frame shows the four strains used in computational analysis. Corresponding position in B6 genome (version mm8) c A: insertion is antisense to gene; S: insertion is sense to the gene d + indicates presence of insertion; - indicates absence of insertion; “F” indicates PCR failure (All PCR experiments listed in this table were performed by Liane Gagnier) e Y: computational prediction of having insertion; N: computational prediction of no insertion; ?: presence of insertion could not be determined computationally f Size of ERV insertion (approximate for cases 1-28) b  62  As expected, some of the identified insertions are not specific to a single strain. This finding indicates that many of the polymorphic ERV insertions arose prior to divergence of common inbred strains, or represent even older polymorphisms due to different origins of chromosomal segments in the genomes of today’s lab mice. For the 28 cases present in A/J but absent from B6, the short A/J sequence traces do not contain the entire ERV. However, the length of the inserted element could be estimated from the size of the genomic PCR product for 25 of these cases (last column in Table 2.1). In 15 cases, the size matches fulllength ETn element (5.5 - 6 kb), whereas two other cases appear to be full length MusD elements (7.5-7.8 kb) and one case is likely a partly deleted element (case #10). Seven are solitary LTRs (320-400 bp), so the nature of the original insertion cannot be determined since the LTRs of ETnII elements and MusDs are extremely similar (Baust et al., 2003). For the 12 elements present in the assembled B6 genome, seven are solitary LTRs, one is a partial element and four are ETn elements based on size and sequence. The element in the Wiz gene is a longer ETn variant (Baust et al., 2002). Given that most published mutagenic insertions of this family are of the ETnII subfamily (Baust et al., 2002; Maksakova et al., 2006), the preponderance of polymorphic ETn elements over MusDs was expected.  2.2.5  Potential gene expression effects mediated by polymorphic ERVs  Since ERVs/LTRs can affect gene transcription through a variety of mechanisms (Section 1.5), some of the polymorphic ERVs detected in this study may contribute to gene expression differences between strains, possibly leading to phenotypic differences. However, the factors that determine whether transcription of a gene will be affected by a nearby or intragenic ERV insertion are not understood, and in all likelihood, are complex. It is therefore not possible to  63  estimate what fraction of the polymorphic insertions documented here may have functional consequences. Nonetheless, it can be predicted which cases may be more likely to affect gene expression. In the majority of documented cases where a new mutagenic ETn/MusD insertion causes significant transcriptional defects, the element has been located within an intron in the sense orientation, disrupting splicing patterns of the gene (Maksakova et al., 2006; van de Lagemaat et al., 2006). Thus, it is reasonable to postulate that ETn elements within introns and oriented in the same direction as the enclosing gene have a relatively high probability of affecting mRNA processing. Moreover, compared with older insertions, the youngest, polymorphic subsets of these elements are potentially more likely to affect host gene expression, as selection may still be operating in these cases. Based on the foregoing reasoning, I chose to examine further a subset of cases using three criteria. First, since the consequences of IAP insertions can involve LTR bidirectional promoter effects (Maksakova et al., 2006) that are more complicated and difficult to predict, I chose to focus on ETn/MusD insertions. Second, only intronic ETn elements oriented in the same direction as the gene were selected. Third, I chose elements verified as present in A/J but lacking in B6 using genomic PCR (see Table 2.1). Seven such cases exist, involving ETns in the Dnajc10, Dysf, Opcml, Prkca, A2bp1, Mtm1 and Col4a6 genes. My colleague, Irina Maksakova, performed RT-PCR on RNA from A/J mice using primers from the gene exon upstream of the ETn insertion, coupled with primers from within the ETn chosen to detect the most frequently reported types of ETn-mediated transcriptional fusions from the literature (Maksakova et al., 2006). Sources of RNAs were chosen based on known expression patterns of the gene. As shown in Figure 2.5, chimeric transcripts were detected for all five of the genes tested, namely Dnajc10, Prkca, Mtm1, Opcm1 and Col4a6. The  64  Figure 2.5 Detection of ETn-gene chimeric transcripts. Aberrant transcripts induced by a novel ETn insertion into a gene intron are shown. Gene direction, from left to right, is the same as ETn orientation. Genes Dnajc10, Opcml, Mtm1 and Col4a6 harbor ETnII insertions, while Prkca has an ETnI insertion. ETn sections that are different in an ETnI element compared to ETnII are shown as striped. Cryptic and natural splice acceptor sites are designated as blue (previously identified) or yellow (newly identified in this study) vertical arrows. Splice donor sites are represented as red (previously known) or light blue (newly identified) triangles. Natural and cryptic polyadenylation sites are marked as pA. Numbered thin arrows denote primers used to amplify chimeric transcripts. Sense primers in the upstream exon are as follows: 1, Prkca-up-ex-s; 2, Dnajc10-up-ex-s; 3, Mtm1-up-ex-s; 4, Opcml-upex-s; 5, Col4a6-up-ex-s. Antisense primers in the ETn are as follows: 6, IM_3as; 7, MusD2_7130as; 8, LTR_2as. Number of clones for each transcript variant compared to all clones sequenced for each primer pair is indicated. Chimeric ETn transcripts and number of their examples identified previously (Maksakova et al., 2006) are shown at the bottom. Sequences of all splice sites used in these cases are shown in Figure A.4. Note: Irina A. Maksakova performed the experiments and made this figure.  sense-oriented ETn element found in the Dysf gene in A/J has already been shown to cause similar splicing defects (Ho et al., 2004), and I did not examine A2bp1. In most cases, the splice sites used in the ETn element in the examples analyzed were analogous to those 65  characterized in known mutagenic cases. For Prkca, however, analysis showed that the insertion is a member of the ETnI subfamily, as opposed to ETnII, and revealed usage of cryptic splice acceptor sites not previously documented (see Figure A.4 for sequences of splice sites). It should be noted that the subset of chimeric transcripts shown in Figure 2.5 is likely an underestimate, since a limited number of clones were sequenced and not all transcript variants would have been detected with the primers used. This RT-PCR analysis demonstrates that these ETn elements cause patterns of aberrant splicing similar to those documented in cases of known mutations due to new ETn integrations. Nevertheless, further quantitative analyses are required to determine the significance of these splicing abnormalities in affecting overall levels of gene expression. Such in depth experimental investigations for each case were beyond the scope of this study. Microarray data on gene expression differences in inbred strains available through the Gene Expression Omnibus (Barrett et al., 2007) were also surveyed (see online resource at http://www.ncbi.nlm.nih.gov/geo/). My colleague, Louie van de Lagemaat, examined all cases listed in Table 2.1 for correlations between presence of the insertion and differences in gene transcript levels compared with strains lacking the insertion (Section 2.4.9). Specifically, he analyzed the microarray data available through NCBI GEO accession GSE3594 (Zapala et al., 2005), which includes data on gene expression in 10 tissues profiled in A/J, B6, C3H/HeJ, DBA/2J and 129S6/SvEvTac mice. For the Dnajc10 gene, tissue-wide reduction in expression was noted in A/J mice relative to the other four strains (p < 10-4, Binomial distribution) (Figure 2.6). Microarray data available through the GeneNetwork web site (http://www.genenetwork.org/) also showed that transcript levels of this gene in A/J are much lower than in all other tested strains, based on whole brain, cerebellum, hippocampus  66  and eye datasets (Figure A.5). Dnajc10 has a sense-oriented ETn element in the third intron in A/J and the related A/WySn mice, but in no other strain tested (Table 2.1). This gene (also termed ERdj5) encodes an endoplasmic reticulum (ER) chaperone protein induced during ER stress, and is likely involved in protein folding (Corazzari et al., 2007; Cunnea et al., 2003).  Figure 2.6 Normalized, tissue-averaged expression of Dnajc10 across strains. Analysis of microarray data of transcript levels in 10 tissues profiled in A/J, B6, C3H/HeJ, DBA/2J and 129S6/SvEvTac mice. Data retrieved from NCBI GEO accession GSE3594 (Zapala et al., 2005). Replicates within a tissue/strain were averaged and a variation pattern of relative expression levels across the above five strains was computed for each tissue. These patterns were found to be similar across tissues (except white adipose tissue which was an outlier). Note: Louie N. van de Lagemaat performed the data analysis and made the figure.  Another gene for which significant differences in expression correlate with presence of an ETn element is Opcml. No data is available from the Zapala et al. study on this gene, but datasets accessed through GeneNetwork show that transcript levels in A/J, the only tested strain carrying an ETn insertion (Table 2.1), are significantly lower than in any other strain in  67  cerebellum, whole brain, hippocampus and eye, the only tissues where A/J microarray information is available for this gene (Figure A.6). Opcml (Opioid binding protein/cell adhesion molecule-like), also termed Obcam, is a member of the IgLON gene family and encodes a synaptic neural cell adhesion molecule (Schofield et al., 1989; Yamada et al., 2007). Loss of expression and/or promoter hypermethylation of this gene has been reported in some human cancers, suggesting that it may play a tumor suppressive role (Reed et al., 2007; Sellar et al., 2003). Northern blot analysis on total RNA from A/J and B6 cerebellum were performed by Irina Maksakova using a probe derived from the exon upstream of the insertion site, and results are shown in Figure 2.7A. The ~6.5 kb band corresponding to Opcml full length mRNA is markedly decreased in A/J compared with B6. A similar reduction in Opcml RNA was also observed in A/J using an exon probe downstream of the insertion site (data not shown). The two bands at 3-3.5 kb are due to cross-hybridization to another gene, neurotrimin (Hnt), which is a closely linked member of the IgLON family and highly related to Opcml in the region used as a probe (Struyk et al., 1995). Irina Maksakova also performed semi-quantitative RT-PCR on total RNA from A/J and B6 cerebral hemispheres using primers from Opcml exons just upstream and downstream of the ETn insertion site, and found an approximately 4.6-fold reduction in the correctly spliced Opcml RNA in A/J relative to B6 (Figure 2.7B). These results confirm the microarray data (Figure A.6), and indicate that presence of the ETn insertion correlates with a substantial decrease in full length, correctly spliced Opcml mRNA. While there may be other reasons for the reduced transcript levels, such patterns suggest that the ETn element in these genes significantly affects expression by causing aberrant splicing (Figure 2.5), allowing only a minor fraction of normal transcripts to be produced.  68  Figure 2.7 Transcript levels of Opcml in A/J versus B6. A) Northern blotting. Total RNA from the cerebellum of an A/J and B6 mouse was hybridized to a probe from the Opcml exon upstream of the ETn insertion in A/J. The lower part of the figure is the ethidium bromide-stained gel, showing even loading as indicated by the ribosomal RNA bands. The band corresponding to Opcml is marked with an arrow. The probe cross-hybridized with a related gene Hnt, as explained in the text. B) Semi-quantitative PCR on cDNA from A/J and B6 cerebral hemispheres. Opcml cDNA was amplified with primers from upstream and downstream of the ETn insertion site. Opcml and Gapdh fragments were amplified from cDNA dilutions; for each dilution, the intensity of the resulting band was quantified and graphed as transcript levels of Opcml relative to Gapdh (see Figure A.7). The average and standard deviation for all experiments is shown. Note: Irina A. Maksakova performed the experiments and made the figure.  For all other cases from Table 2.1, including the other genes with insertions that cause aberrant splicing detected by RT-PCR (Figure 2.5), available microarray data was either inconsistent or did not show a clear relationship between presence of the insertion and altered levels of transcripts. These findings suggest that, in most cases, the ETn insertion has no significant effect on expression. This result is not surprising, given that thousands of ERVs or  69  LTRs have become fixed during evolution in human and mouse genes, indicating that they can reside within introns without a functional impact (van de Lagemaat et al., 2006). As illustrated by the Dysf case, however, the microarray data should be treated with caution. It has been convincingly shown by Northern analysis that A/J mice with the ETn insertion lack full length Dsyf mRNA and protein in skeletal muscle (Ho et al., 2004). However, the available microarray data for Dysf is limited to cerebellum and whole brain, neither of which shows abnormally low transcript levels in A/J (data not shown). There could be several reasons for this discrepancy, but it illustrates that wet lab approaches are necessary to properly evaluate each case. It is well established that IAP LTRs can promote ectopic gene transcription in cases of somatic oncogene activations and germ line mutations (Boeke and Stoye, 1997; Druker and Whitelaw, 2004; Kung et al., 1991; Maksakova et al., 2006; Wang et al., 1997) in addition to causing gene splicing defects similar to ETns. Moreover, a few mutations caused by IAPdriven aberrant gene expression have been shown to act as metastable epialleles, exhibiting variable expressivity among genetic identical mice linked to the variable epigenetic state of the IAP LTR (Blewitt et al., 2006; Druker and Whitelaw, 2004; Wang et al., 1997). In a recent study, Horie et al identified transcripts from 11 loci in 129 strain embryonic stem cells that initiate in an IAP LTR and read into flanking sequence, in five cases giving rise to chimeric RNAs between an intronic IAP and the enclosing gene (Horie et al., 2007). In six of the 11 loci analyzed, the IAP element was not present in the B6 genome, prompting the authors to postulate that variations in IAPs may contribute to strain-specific traits. Although I have not functionally examined any cases of polymorphic IAPs identified here to look for LTR-initiated fusion gene transcripts, it is likely that numerous such cases exist.  70  2.3  Concluding Remarks  In this initial genomic assessment of ERV polymorphisms in mice, I have used the available DNA sequence from four inbred strains to conduct an assessment of the level of insertional polymorphism of the currently active IAP and ETn/MusD ERV families. Despite mapping limitations and incomplete sequence coverage, I identified 3394 IAP and 375 ETn/MusD elements that are polymorphic among the four strains, resulting in polymorphic fractions of 61.3% and 25.6%, respectively. This is the first genome-wide determination of the extent of polymorphism of these ERV families. Given that this study was based on only a few strains, the total numbers of polymorphic elements must be substantially higher and represent a large source of genetic variation among inbred strains. Among the polymorphic copies, 623 IAPs and 72 ETn/MusD elements reside in gene introns. In all the five cases of sense-oriented ETn elements in A/J introns that we examined, evidence for gene splicing disruption was found by RT-PCR and, for two genes, further evidence of lower gene expression in A/J mice was observed through surveys of microarray data. While most polymorphic ERVs likely have little effect on host genes, I found that the prevalence within genes and the intronic orientation bias exhibited by polymorphic ERV subsets enriched for the youngest elements are distinctly different from that of older elements. This observation suggests that some of the polymorphic ERVs are deleterious but have not yet been eliminated by selection due to their short time in the genome or the controlled breeding environment of laboratory mice. Indeed, new insertions of these elements could play a significant role in genetic drift and inbreeding depression of mouse lines (Taft et al., 2006). I propose that a comprehensive effort to document ERV and other transposable element polymorphisms among multiple inbred strains would complement SNP data, and  71  contribute greatly to our understanding of mouse genetic history, and genotypic and phenotypic variation.  2.4  Materials and Methods  2.4.1  Source data  The NCBI Trace Archive included a total of 195,993,571 traces from 38 mouse strains as of May, 2007 (http://0-www.ncbi.nlm.nih.gov.ilsprod.lib.neu.edu/Traces/trace.cgi). The majority of these traces were obtained by CHIP-related resequencing techniques, which exclude most repetitive sequences. In this study, I used only sequence traces obtained by whole genome shotgun (WGS) sequencing, which are unbiased in their content of repetitive elements. Three mouse strains (A/J, DBA/2J, 129X1/SvJ) were chosen to compare to the assembled B6 genome (version mm8) because these were the only strains with sufficient traces sequenced by shotgun-related strategies (Figure A.1). RefSeq gene annotations were retrieved from the RefGene annotation table (version mm8, April 2007 annotation) downloaded from the UCSC Genome Browser. When an ERV insertion was found in a genomic region with multiple overlapping annotations, the smallest gene size was chosen to improve specificity. Also calculated were the genomic coverage of annotated RefSeq genes in the mouse genome (used as the “expected value” in Figure 2.3A), based on the same annotation table. After merging overlapping RefSeq annotations and removing redundancies, the total coverage of genic regions in the mouse genome was calculated at 31.58%.  72  2.4.2  Design of ERV probes and detection of ERVs in the assembled B6 genome  Three types of probes designed were based on template ERV sequences (only the type-1 probe is shown in Figure 2.1, step A1). For IAP, probes were based on a recently inserted polymorphic IAP 1Δ1 element (accession #EU183301) (Juriloff et al., 2005). For ETn/MusD, probes were based on a mutation-causing ETnII element (accession #Y17106) (Hofmann et al., 1998). MusD and ETnII elements are on average over 90% identical in the regions of the probes. To capture ETnI elements, which differ from ETnII/MusD elements in the 3′ part of the LTR and 5′ internal region (Baust et al., 2003; Shell et al., 1990), I used a representative ETnI element (accession # AC068908). As shown in Figure A.2, the type-1 probe included the full-length LTR and a small fragment of the internal ERV sequence; type2 included only the full LTR; type-3 was only the first/last 60 bp of the 5′/3′ LTR. Additional information about probe design is summarized in Table A.1. I conducted a survey of both ETn/MusD and IAP insertions in the B6 genome using the 60 bp type-3 probes because these insertions are in regions of low divergence between family members (data not shown), ensuring that all ERVs of each group would be detected. The probes were aligned to the B6 genome using the WU-BLAST 2.0 program (Gish, W. 19962004 http://blast.wustl.edu/), and any hit above the cut-off threshold was scored as an ERV insertion. To keep both sensitivity and specificity as high as possible, I designed an experiment to optimize the parameters of alignment identity and length of the aligned region. Results suggested a value of 80% for both parameters. To obtain an estimation of the sizes and numbers of ETn/MusD and IAP elements, all mapped ERV fragments (LTR termini) were merged into one individual element if they were: 1) on the same chromosome; 2) in the same orientation; 3) within 10 kb from each other.  73  2.4.3  Detection of ERV insertions in test strains  The standalone version of the WU-BLAST v2.0 program (Gish, W. 1996-2004 http://blast.wustl.edu/) was used to make local alignments between ERV probes and mouse traces in the NCBI trace archive database (step 2 in Figure 2.1A). The threshold parameters for BLAST were 80% for sequence identity, and 80% for length of the aligned region. A usable ERV-containing trace consists of two parts – a non-ERV flanking sequence and the target-ERV sequence. All ERV-containing traces with a flanking portion shorter than 30 bp were discarded. Once identified, a chimeric tag was constructed by taking the whole flanking portion appended with a small tail of its target-ERV sequence (Figure 2.1A, step 3). I required the target-ERV tail of the tag to be 1/5 of the flanking portion in length, and a maximum of 50 bp. The ERV-containing traces were mapped to the assembled B6 genome. I used the chimeric tags derived from the previous step as the input query for BLAT (Kent, 2002), and mapped them to the B6 genome (version mm8) (Figure 2.1A, step 4). The sequencing error rate of the mouse traces was estimated to be ~5% (data not shown). I therefore defined criteria for a significant mapping as follows: 1) the score of alignment being the highest mapping score among all BLAT hits; 2) the best hit being at least 2% higher in identity and 10% longer in mapping length compared to the second hit; 3) the alignment identity between the chimeric tag and the target locus being greater than 90%; 4) the length of aligned region being more than 70% of the tag length. Once a significant BLAT mapping site was identified, it was straightforward to check for presence of the ERV in the B6 genome based on alignment of the small target-ERV tail of the chimeric tag. If the BLAT mapping included more than 2/3 of the target-ERV tail, it was considered a common insertion also present in  74  B6; if the mapping included less than 1/3 of the target-ERV tail, it was scored as absent from B6. Situations in between these two boundaries were extremely rare and were discarded.  2.4.4  Determining the polymorphism status of ERVs present in B6 or in the test  strains All sequences in the B6 genome with a length of 35 bp flanking both the 5′ and 3′ end of each detectable ERV element were aligned back to the B6 genome with BLAT, and only those with a unique location were retained. All retained 35-bp-flanking-sequences were used as queries of the WU-BLAST program, and all traces from the test strains containing such flanking sequences were collected (Figure 2.1, step B2). A minimum identity of 90% and a minimum mapping length of 80% were required. Because of incomplete genomic coverage of traces of test strains, many ERV flanking regions in B6 have no corresponding traces in the trace archive database, and their polymorphism status could not therefore be determined (denoted as “?” in Table A.2 and A.3). However, for ERVs in B6 with unique flanking sequence found in one or more test strain traces, presence of the ERV in test strains was determined by assessing identity between the ERV sequence in B6 and the sequence adjacent to the flanking sequence in the trace of the test strain. To align the two sequences, I used an implementation of the Needleman-Wunsch algorithm (Needleman and Wunsch, 1970). I required a minimum identity of 90% and an alignment length of at least 35 bp to score the ERV as present in the test strain. Using a similar strategy to one just described, I assessed the polymorphism status of ERV insertions found in a test strain but not in B6. Using their locations with respect to the B6 genome, probes based on flanking genomic sequences were built, and the trace archive  75  database was searched to check if traces with the same flanking sequences were present for other test strains. All qualified traces obtained from other test strains were aligned to sequences of corresponding ERV families based on the same mapping criteria used above, and the existence of such ERV elements in other test strains was determined. Instead of using the ERV portion in the original ERV-containing traces, I used exemplar ERV sequences because, for some traces, the ERV portion was too short (less than 35 bp) to make an effective alignment.  2.4.5  B6 trace sampling and screening simulations  The ERV numbers found in the three test strains are lower than the numbers detected in the assembled B6 genome. Incomplete sequence coverage and the inability to map the trace to a unique location are responsible for most of the loss of detectable ERV insertions. To estimate the fraction of ERVs that were not detected in each test strain, I applied my screening method using random samples of the unassembled B6 traces, and plotted an ERV detection curve based on this simulation (Figure A.3). Since the sequence quality of the B6 trace archive is generally lower than that of the three test strains, the sampling process was based only on B6 traces with less than 1% “N”s. Sample trace datasets of different sizes were built into simulative trace databases, and the corresponding numbers of B6 ERV insertions detected with these datasets were plotted in Figure A.3. Independent random sampling was applied twice for datasets smaller than 12 million traces. The second purpose for performing the screening simulations with B6 traces was to evaluate the accuracy of my screening method. Theoretically, all insertions found in the simulation assays in the B6 traces should be detected in the B6 reference genome. However,  76  I found a few cases of insertions cataloged as “polymorphic”, meaning they are from the B6 traces and were mapped to a significant locus in the B6 reference genome where no such insertion was found. One of the possible explanations for this is that the assembly of the B6 genome is not perfect, especially in repetitive regions. Indeed, only 49 of the 54 nonecotropic murine leukemia viruses (MLV) known to be present in B6 can be found in the mouse B6 assembly (Jern et al., 2007). Nonetheless, I considered all the “polymorphic” cases in each simulation assay to be false positives, and derived a conservative estimation of the accuracy of my screening method with an average false discovery rate of 0.4%±0.1%.  2.4.6  Experimental verification of polymorphic insertions  The presence of an insertion was tested by amplifying genomic DNA from the following strains: SWR/J, C3H/HeJ, Balb/cJ, B6, A/J, DBA/2J, 129X1/SvJ and A/WySn. All strains or DNA were from the Jackson Laboratory. Primers (see Table A.2) flanking the potential insertion sites were used to amplify specific sequences from 75 ng of genomic DNA in a 25ul reaction with Phusion DNA polymerase (New England Biolabs). As per the manufacturer’s instructions, cycling conditions had annealing temperatures of between 55-65°C, and extension times between 20 seconds and 4 minutes. PCR products were visualized on agarose gels. In some cases, amplification with the flanking primers did not produce a product, so one flanking primer and one LTR primer were used to confirm presence of an insertion. Therefore, in these cases, the size of the ERV insertion could not be estimated. In two cases, marked as “F” in Table 2.1, the PCRs were unsuccessful in one of the strains, suggesting a structural rearrangement or the presence of other polymorphisms that prevented amplification with the primers used. Some products were sequenced directly on Minelute (Qiagen) gel  77  purified PCR fragments using the BigDye Terminator v3.1 Cycle Sequencing Kit (ABI) in an ABI PRISM® 3730XL DNA Analyzer system.  2.4.7  RT-PCR  RNA from mouse tissues was extracted using RNeasy RNA isolation kit (Qiagen) according to manufacturer’s recommendations. The presence of native transcripts using primers located in exons flanking the intron with the ETn insertion was confirmed with the following primer pairs: Col4a6-up-ex-s and Col4a6-down-ex-as; Dnajc10-up-ex-s and Dnajc10-down-ex-as; Mtm1-up-ex-s and Mtm1-down-ex-as; Opcml-up-ex-s and Opcmldown-ex-as; Prkca-up-ex-s and Prkca-down-ex-as. Then, RT-PCRs designed to look for chimeric transcripts between gene exons and the intronic ETn were performed. To search for transcripts utilizing the 2nd and 3rd splice acceptor sites in the LTR (see Figure 2.5), cDNA from A/J tissues specified in parentheses was amplified using a common ETn primer located downstream of the LTR, IM_3as, and the following upper exon-specific primers: Col4a6-upex-s (eye), Dnajc10-up-ex-s (testis), Mtm1-up-ex-s (lung), Opcml-up-ex-s (cerebral hemisphere), and Prkca-up-ex-s (eye). The same exon-specific primers and cDNA were used for the search of transcripts utilizing the first splice acceptor site, this time with the LTRspecific primer located upstream of the first PolyA site, MusD2_7130as. For Dnajc10, an additional PCR was performed with an upstream exon primer and a primer located at the very end of the LTR, IM_LTR_2as. Semi-quantitative RT-PCR for the Opcml gene was performed with a series of A/J and B6 cerebral hemisphere cDNA dilutions, using the primers Opcml-ex2-s and Opcml-ex3-as in the exons upstream and downstream of the intronic ETn insertion. For Gapdh, primers  78  Gapdh_ex6F and Gapdh_ex7R were used. Opcml and Gapdh fragments were amplified from cDNA dilutions of 1/20, 1/40 and 1/80 (Figure A.7A). For each dilution, the intensity of the resulting band was quantified using ImageQuant LT (GE Healthcare) software, and graphed as the intensity of Opcml relative to Gapdh (Figure A.7B). The average levels of all experiments are displayed in Figure A.7B. All primer sequences for RT-PCR experiments are listed in Table A.2.  2.4.8  Northern blotting  RNA from A/J and B6 cerebellum was used. For each lane, 6 mg of RNA was denatured, electrophoresed in 1% agarose 3.7% formaldehyde gel in 1xMOPS buffer, transferred overnight to a Zeta-probe nylon membrane (Bio-Rad) and baked at 80ºC. A probe specific for the Opcml exon upstream of the ETn insertion was synthesized by PCR using primers Opcml-ex2-s and Opcml-ex2-as and labeled with 32P using a Random Primers DNA Labeling System (Invitrogen). Membranes were prehybridized in ExpressHyb (BD Biosciences) for 4 hours at 68ºC, hybridized overnight at the same temperature in fresh ExpressHyb, washed according to manufacturer’s instructions and exposed to film.  2.4.9  Microarray analysis  The microarray data of mRNA expression were obtained through NCBI GEO accession GSE3594 (Zapala et al., 2005), which included 10 tissues profiled in A/J, B6, C3H/HeJ, DBA/2J and 129S6/SvEvTac mice. The expression values for a given probeset replicated within the same strain and tissue were averaged, and the probeset expression rank was examined in two ways. First, each strain’s expression rank across genes within a given tissue  79  was determined, and second, the inserter strain’s expression rank for a given gene was determined across tissues.  2.4.10 PolyERV custom tracks for Genome Browser For the ease of accessing detailed information of all the polymorphic ERV insertions identified in this study by peer researchers, I produced custom tracks for the UCSC Genome Browser (named “PolyERV”), and have made them available online: Mouse mm8 version URL: http://genome.ucsc.edu/cgi-bin/hgTracks?db=mm8&hgt.customText=http://142.103.207.67/ people/dmager/PolyERV/track_data.txt Mouse mm9 version URL: http://genome.ucsc.edu/cgi-bin/hgTracks?db=mm9&hgt.customText=http://142.103.207.67/ people/dmager/PolyERV/track_data_mm9.txt A screenshot of the PolyERV track is also given in Figure A.8. The polymorphic IAP and ETn/MusD insertions are annotated in red and green, respectively. When the ERV insertion is displayed as a horizontal bar, it is present in the B6 reference genome, and its orientation is shown by arrows inside the bar. When the ERV insertion is displayed as a vertical line, the indication is that it is not present in B6. More detailed information such as the genomic coordinates and the presence and absence of the polymorphic ERV in different inbred strains can be found by clicking on the ERV id.  80  Chapter 3: TE  Distributions  Reveal  Hazardous  Zones  and  Residential Preferences in Gene Introns  A version of Chapter 3 has been published: Ying Zhang, Mark T. Romanish, Dixie L. Mager (2011). "Distributions of transposable elements reveal hazardous zones in Mammalian introns." PLoS Comput Biol 7(5): e1002046.  81  3.1  Background  As discussed in Chapter 1, mammalian TEs, which comprise at least one-third to half of mammalian genomic DNA, are major factors that have shaped the landscape of the mammalian genome through evolution. Although the majority of TEs reside outside genic regions (Lander et al., 2001; Medstrand et al., 2002; Waterston et al., 2002), about 90% of all human RefSeq genes still contain at least some TE sequences in their introns. Due to their proximity to host genes, intronic TE insertions are generally considered to possess a greater potential for affecting gene expression through various molecular mechanisms (Section 1.5). Our current knowledge of the distribution of TEs within gene introns is very limited, however, and it remains unclear why some intronic TEs perturb gene transcription while most do not. To fully understand their biological effects, it would be useful to determine which intronic TEs are most likely to affect gene expression, so they may be prioritized for functional analyses. With a growing appreciation for SINE and LINE insertional polymorphisms in human (see Section 1.4.1), such predictions would be particularly helpful in identifying those polymorphic TE insertions with the greatest probability of affecting gene transcription, and thereby possibly contributing to phenotypic variability or disease susceptibility in humans. In this study, I conducted a set of bioinformatics analyses of TE distribution patterns within human and mouse genes, which revealed TE underrepresentation zones and distributional biases in gene introns. TEs that do occur within the underrepresentation zones are more likely to be involved in aberrant gene splicing. Known cases of intronic disease-causing TE insertions are primarily located within these zones, strongly suggesting that TEs in these locations are more likely to be harmful and be selected  82  against. The results of my study reveal a distinct tendency for TEs to affect gene transcription when poised near exons, and point to their continued role in catalyzing genome evolution.  3.2 3.2.1  Results and Discussion Intronic regions near exon boundaries are depleted of TE insertions  According to my genomic survey, 85 - 90% of mouse and human protein coding genes contain TE sequences in their introns. In a computational study of the relationship between Alu SINEs and alternative splicing, Lev-Maor et al. reported a drop of Alu density within 150 bp from intron boundaries (Lev-Maor et al., 2008). Based on this observation and the fact that most intronic splice signals are located at the 5′- and 3′-end of introns (Lodish et al., 2007), I hypothesized that de novo intronic TE insertions near exons are more likely to be mutagenic, and consequently, that the frequency of TEs would be significantly lower than expected in general near intron ends. To analyze the distributions of various TE types within introns, I conducted computer simulations to determine theoretical TE distribution patterns (see Section 3.4.2 for details). I then determined the actual distribution pattern of intronic TEs according to their distance to the nearest exon. To alleviate my concern about the potential effect of “distance shifting” - a hypothesized result of later TE insertions or other rearrangements occurring between a specific TE and its nearest exon, I also analyzed the distribution of the youngest 20% of intronic TEs. However, only minor differences were observed compared to all intronic TEs in the genome (data not shown). To show clearly the difference between simulated and actual TE distributions at each predefined position in introns, I calculated the “standardized frequency” of observed TEs (see Section 3.4.4). The level of TE representation at each  83  predefined intronic interval is determined from the difference between the actual TE distribution in the genome (observed) and the computer simulation of random TE insertions (expected). An overrepresentation of a given TE type within the corresponding intronic region is reflected by a positive value; an underrepresentation is indicated by a negative value. I found, as expected, that all four major TE types are highly underrepresented near intron boundaries in both human (Figure B.1A) and mouse (data not shown). I next applied the same distribution analysis for only full-length or near full-length TE sequences (see Table 3.1 for “full-length” definitions). As shown in Figure B.1B for human, full-length TEs were again highly underrepresented when close to exons, but most TE types except SINEs showed larger underrepresentation zones (hereafter shortened to “U-zone”) compared with the all-TE distributions.  Table 3.1 Intronic underrepresentation zones by TE type* TE Type SINE  Human U-zone for All (bp) a 100  Mouse U-zone for All (bp) a 100  Human U-zone for FL (bp) b 100  Mouse U-zone for FL (bp) b 100  Human cutoff size of FL (bp) c >250  Mouse cutoff size of FL (bp) c >100  LINE  50  100  2000  2000  >5000  >5000  LTR  2000  1000  5000  2000  >5000  >5000  DNA  50  100  2000  2000  >1000  >1000    * The distributions of TEs were normalized by the overall G/C content preference of each TE type a Underrepresentation zone based on distribution of all elements of each TE type b Underrepresentation zone based on distribution of only 'near full-length' (FL) elements c The cutoff size of full-length elements for each TE type was determined as slightly shorter than the average full-length elements as described in (Lander et al., 2001) and (Waterston et al., 2002)  I also noticed that intronic regions more than 20 kb from exons showed a significant underrepresentation of SINEs compared to random simulations. Unlike patterns close to exons, intronic TE distributions greater than 20 kb from exons are less likely due to purifying selection, so I searched for other explanations. SINE elements are more abundant in G/C-rich 84  regions (Lander et al., 2001; Medstrand et al., 2002) and, since large introns resemble intergenic regions in terms of G/C content (which is generally A/T rich) (Kalari et al., 2006), I postulated that the drop of SINE frequency compared to random simulations in deep intronic regions was an effect of local G/C content. To determine if this indeed the case, I normalized my random simulations with the local G/C content as described in Materials and Methods (Section 3.4.3). Indeed, after applying such normalization, the underrepresentation of SINEs in deep intronic regions greatly flattened out, while the sizes of the U-zones near exons were not affected. All my subsequent analyses, therefore, employed this normalization. Figure 3.1 shows the normalized plots for all human TEs (Figure 3.1A) and full length TEs (Figure 3.1B), and these plots are very similar for mouse TEs (Figure B.2). It is of interest to note that the sizes of the U-zones near intron boundaries are different between TE types (Table 3.1). Original insertion site preferences, natural selection and genetic drift can all contribute to global TE distributions. While determining the initial integration site preference of TEs is difficult, if not impossible (especially for ancient families), a limited number of de novo TE integration studies showed that TEs in today’s human genome are distributed very differently from their initial target site preferences (see Section 1.3). Since 97% of TEs in the human genome and 93% in the mouse genome have been fixed for more than 25 million years (Lander et al., 2001), it is indeed reasonable that their current distributions will bear little resemblance to any original insertion site preferences, but will primarily be the result of selection and genetic drift. Therefore, the TE U-zones identified here most likely result from purifying selection, rather than original avoidance of these regions during the integration process.  85  Figure 3.1 Intronic distributions of the four major TE types in human (normalized). The distributions of all (A) and full-length (B) intronic TEs in human are shown separately. The sizes of the U-zone observed for each TE type are specified in Table 3.1. In both A and B, the x-axis shows a series of predefined intronic regions based on the distance from a TE to the nearest exon. The y-axis represents the standardized frequency of TEs at each intronic region, and is normalized by G/C content for each TE type. The red dotted line indicates the expected distribution of TEs based on random computational simulations. Error bars are standard errors derived from the total number of corresponding TEs (sample size) in each bin.  3.2.2  TEs within their U-Zones are shorter  The larger U-zones for full length TEs (compare Figures 3.1A and B) suggests that purifying selection acts at much greater distances on full-length elements than on their partly deleted counterparts. This effect is not observed for SINEs, but these elements have a much shorter full-length size (~300 bp for human Alus) (Batzer and Deininger, 2002; Lander et al., 2001), generally carry fewer cryptic transcriptional regulatory signals, and are less harmful to the enclosing genes than other TEs (Cordaux et al., 2006). For these reasons, full-length SINE elements may be better tolerated at a closer distance to exons. 86  I next compared the average length of intronic TEs within and outside their full-length Uzones, and discovered a significant difference for all TE types in both species (Figure 3.2 for human; Figure B.3 for mouse). In fact, most elements within their respective U-zones are truncated, while a greater portion of TEs beyond such zones are full-size elements, resulting in a larger size variance (see the difference between upper whiskers in Figure 3.2 for human and also Figure B.3 for mouse). The length of individual TEs is therefore an important aspect in terms of dictating their genomic distributions, indicating that larger elements are more likely to be genotoxic when positioned near exons. These results support previous work regarding L1 LINEs, indicating that, compared to shorter elements, full length L1s have more potentially disruptive splice and polyadenylation signals (Belancio et al., 2006), greater  Figure 3.2 Average size of human TEs within and outside the U-zone. Each TE type is divided into two groups as shown on the x-axis: one group for elements located within the corresponding full-length U-zone, and another group for those beyond. The average size of each TE group is indicated as the horizontal bar within each box, which represents the central 50% of data points of the group. Outliers beyond the 1.5x IQR (interquartile range) whiskers are not shown. P-values shown on top of each boxplot are based on the two-sample Wilcoxon test.  87  effects on expression of enclosing genes (Ustyugova et al., 2006), and a greater fitness cost (Boissinot et al., 2006).  3.2.3  TEs near exons exhibit strong orientation and splice-site bias  I next examined the distribution of intronic TEs in the sense orientation versus those in antisense with respect to the enclosing genes (see Figure 3.3A for human and Figure B.4A for mouse). Since DNA transposons comprise only about 3% of both the human and the mouse genomes, almost all of them being ancient elements without evidence of any transposition activity during the past 37 - 50 Myr (Section 1.4.1), I excluded them from the  Figure 3.3 Distributional biases of full-length human intronic TEs. A) Orientation bias of full-length intronic TEs. Y-axis shows the logarithmic fold-difference of TE frequency between sense and antisense oriented full-length TEs. B) Splice site bias of full-length intronic TEs. Y-axis shows the logarithmic fold-difference of full-length TE frequency between TEs close to the SA site and TEs close to the SD site. The x-axis shows a series of predefined intronic regions based on the distance from a TE to the nearest exon. Error bars are standard errors derived from the total number of corresponding TEs (sample size) in each bin.  88  following analyses to avoid uncertainties introduced by their relatively small numbers. While previous studies have found an overall antisense orientation bias in genes (particularly for LTR elements and LINEs; see Section 1.2.2), I show in this study the existence of a much stronger bias in antisense for both LINEs and LTR elements near exons. The excess of antisense TEs compared with sense elements near intron boundaries is probably the result of purifying selection, like the genome-wide orientation bias of TEs in genes. This indicates that, in general, sense-oriented TEs near splice sites have a higher probability of influencing normal gene transcription, and are potentially more harmful to the host gene. Interestingly, for SINEs I observed the same strong antisense bias in the mouse (Figure B.4A), but in the human genome, however, I observed a sense orientation bias instead of antisense for SINEs at a close distance of 20 - 200 bp from exons (Figure 3.3A). These data are consistent with the Alu SINE study of Lev-Maor et al. (Lev-Maor et al., 2008), in which the authors observed more sense-oriented Alu elements near intron termini. Since Alus account for twothirds of human SINE elements, and many antisense Alus possess a strong cryptic SA signal (Gal-Mark et al., 2008), selection against antisense-oriented elements may explain the unusual underrepresentation of antisense oriented SINEs near splice sites in humans. I further looked for evidence of any distributional bias of intronic TEs in terms of their proximity to either splice donor sites (SDs) or splice acceptor sites (SAs). I found the total numbers of elements near SA sites to be much lower than SD sites for all three retrotransposon classes examined (see Figure 3.3B for human and Figure B.4B for mouse). Since the core intronic splice signals at SD sites usually consist only of about 6 bp of terminal intron sequence compared with 20-50 bp at SA sites (Lodish et al., 2007), selection  89  against  physical  disruption  of  critical  splice  motifs  likely  underlies  this  TE  underrepresentation near SA sites. Theoretically, harmful antisense transcripts of protein-coding exons may be generated by read-through transcription of antisense TEs near SD sites. If such antisense transcripts have significant detrimental effects, due to purifying selection, one might expect a larger proportion of TEs near SD sites to be in sense rather than in antisense. As shown in Figure 3.4A (human) and Figure B.5A (mouse), the predicted bias of sense orientated TEs near SD sites was not found, except for human SINEs. This is likely explained by the previously noted fact that antisense Alus possess cryptic SA signals. For other TE types I observed  Figure 3.4 Orientation bias of human full-length intronic TEs based on their proximity to different types of splice sites. Orientation bias of full-length TEs near SD sites (A) and SA sites (B). The x-axis shows a series of predefined intronic regions based on the distance from a TE to the nearest exon. Y-axis shows the logarithmic fold-difference of TE frequency between sense and antisense oriented TEs. Error bars are standard errors derived from the total number of corresponding TEs (sample size) in each bin.  90  more SD-associated elements oriented in antisense, indicating that antisense transcription is most likely effectively silenced or not a general problem, and that sense oriented TE insertions are, in fact, more detrimental. The same analysis of TEs near SA sites revealed similar orientation bias patterns.  3.2.4  A high fraction of known mutagenic intronic TEs reside within U-zones  If the reduced frequency of TEs near intron boundaries reflects the force of selection against harmful insertions, one would predict a higher fraction of mutagenic TEs in gene introns located within these TE underrepresentation zones. To test this prediction, I compiled information on documented intronic mutagenic TE insertions, and examined their integration sites in introns. Based on the TE activity and data availability, I focused on the following three TE families in my analyses: human Alu (SINE), human L1 (LINE), and mouse LTR elements. Alus, the most abundant TE family, have successfully propagated in the human genome and reached a total number of over one million copies (Lander et al., 2001). Some of these elements are still active today, generating new insertions and causing mutations linked to diseases (Batzer and Deininger, 2002; Gallus et al., 2010; Ganguly et al., 2003). Based on the information provided by the dbRIP database (http://dbrip.brocku.ca/) (Wang et al., 2006), I found six de novo Alu insertions associated with human diseases within introns, all belonging to the AluY subfamily (the youngest subfamily of Alu) and causing splice defects of the enclosing gene (Table B.1). De novo disease-causing insertions of L1, the active LINE family in humans, have also been reported (Brouha et al., 2003; Chen et al., 2005; Ostertag and Kazazian, 2001; Yoshida et al., 1998). These elements play important roles in human retrotransposon-  91  mediated pathogenesis, not only for encoding reverse-transcriptase (RT) and other proteins required for their own retrotransposition, but also for mobilizing Alus (see Section 1.1.2). My search of the dbRIP database in this study identified a total of five intronic L1s associated with human diseases (Table B.2), all of which cause transcriptional disruptions. Since no mutagenic LTR insertions and only a few insertionally polymorphic ERVs or LTRs have been reported in human (Section 1.4.1), I turned to the mouse genome, where ERVs/LTR elements cause ~10% of germline mutations, many of which have been well studied (Maksakova et al., 2006). I collected in total 40 cases of mutagenic LTR elements in mice: 15 from the IAP family, 18 from the ETn/MusD family, and seven from other ERV/LTR retrotransposons. All these ERV-induced intronic mutations in mice are due to transcriptional disruptions on the enclosing gene (Table B.3). For the three TE families listed above, I compared the intronic distribution of mutagenic elements with all full-length counterparts in the reference genomes, and found highly consistent results (Figure 3.5 and Table 3.2). As indicated in figure 3.5A, all six mutagenic Alu insertions are within the U-zone of SINEs (i.e. <100 bp from the nearest exon), and all are oriented antisense with respect to the enclosing gene. Moreover, five out of the six cases are near SA sites. In comparison, only 1.83% of all full-length AluYs in the reference human genome are located within the 100 bp U-zone – strikingly lower than the mutagenic elements and also more than two-fold lower than that expected by chance (p < 2.2e-16; one-sample proportion test). For all full-length AluYs within the U-zone I observed 47.7% elements in antisense, slightly lower than the random level (50%) but much lower than mutagenic insertions. Since intronic TEs show their strongest splice site bias when they are in extremely close proximity to an exon (Figure 3.3B), I examined full-length intronic AluYs located  92  Figure 3.5 Comparisons of TE frequency within the U-zone. Three TE types were examined and results were plotted in panel A, B, and C for the human Alu, human L1, and mouse LTR elements, respectively. In each plot, three groups of comparisons are shown: ‘U-zone’ stands for TE insertions within the U-zone; ‘antisense’ for human Alu or ‘sense’ for human L1 and mouse ERV indicates TEs within the U-zone in the corresponding orientation with respect to the enclosing gene; ‘< 20 bp to SA’ indicates TE insertions within 20 bp of SA sites with an exception for human Alu and L1 mutagenic TEs (marked by shading), for which all cases were included due to limited total numbers. The yaxis shows the percentage of TEs that belong to the corresponding groups. Bars in each group represent mutagenic TE insertions (green), polymorphic TE insertions (light blue), all full-length TE insertions in the reference genome (yellow), and computational simulation as a random control (dark blue). Error bars represent standard errors derived from the total number of cases (sample size) for each category.  93  less than 20 bp from exons, and observed only 10% of such elements near SA sites. Although this result cannot be directly compared to the case of mutagenic Alus due to their insufficient number within 20 bp to exons, the fact that five out of six mutagenic Alus are near SAs is noteworthy. Similarly to mutagenic Alus, Figure 3.5B shows that all five mutagenic L1 elements are within the U-zone for full-length LINEs (i.e. <2 kb from the nearest exon). Among them, four are sense-oriented and four are near SA sites (with three of them are common in both sets). In contrast, only 23.0% of full-length intronic L1s in the reference genome are within the U-zone, which is significantly lower than both the mutagenic L1s and my random simulation (p < 0.0004 and p < 2.2e-16, respectively; two-/one-sample proportion test). Of those elements within the U-zone, only 27.7% are in sense, significantly lower than both mutagenic insertions and the simulation (p < 0.035 and p < 2.2e-16, respectively; two-/one-  Table 3.2 Intronic distributional biases of mutagenic, polymorphic, and all full-length TEs  6  TEs in U-zone / total TEs 6/6 (100%)  Sense TEs / TEs in U-zone 6/6 (100%)  TEs near SA / TEs ≤ 20 bp to exon 5/6* (83.3%)  54136  989/54136 (1.8%)  472/989 (47.7%)  3/30 (10%)  4.5%  50%  50%  Total TE cases  TE Type Mutagenica Human Alu  b  Full-length c  Expected Human L1  Mutagenic  5  5/5 (100%)  4/5 (80%)  4/5* (80%)  Full-length  10134  2328/10134 (23.0%)  644/2328 (27.7%)  2/7 (28.6%)  28.6%  50%  50%  29/40 (72.5%)  20/26 (76.9%)  4/6 (66.7%)  Expected Mutagenic Mouse LTR  40  Polymorphicd  161  56/161 (34.8%)  13/56 (23.2%)  0/0  Full-length  10150  1447/10150 (14.3%)  435/1447 (30.1%)  1/6 (16.7%)  36.6%  50%  50%  Expected    *Due to the limited number of cases, all human Alu and L1 mutagenic insertions are included rather than only using elements  within 20 bp to SAs.  a Mutagenic insertions documented in the literature   b All full‐length TEs (see cut‐off size of full‐length TEs in Table 1) in the reference human/mouse genome  c Based on random computational simulation  d Polymorphic ERV insertions present in the B6 mouse reference genome and at least one other mouse strain   94  sample proportion test). Although the number of full-length L1s in the reference genome within 20 bp to exons is very limited, among a total of seven cases only two were found near SA sites. I also examined the same parameters for mouse LTR elements (Figure 3.5C and Table 3.2). As expected, a high fraction of mutagenic insertions (72.5%) are within the U-zone of full-length mouse LTR elements (i.e. <2 kb from the nearest exon). Remarkably, all 15 mutagenic insertions from the IAP family were within the 2 kb U-zone. Since the orientation information of some mutagenic LTR elements within the U-zone was not indicated in their original reports, I checked the remaining 26 cases, and found 20 (76.9%) were oriented in sense. Among these mutagenic insertions in mice, five are located within 20 bp of exons, three of which are near SA sites (60%). The situation is completely different for all fulllength LTR elements in the sequenced mouse genome (strain C57BL/6J, or B6), however. In contrast to mutagenic insertions, only 14.3% of full-length LTR elements in the reference genome were located within the 2 kb U-zone (p < 2.2e-16; two-sample proportion test), and of these elements only 30.1% are in the sense orientation (p < 2.65e-09; two-sample proportion test). At a distance less than 20 bp to exons, I found six full-length LTR elements in the B6 reference genome, only one of which is near the SA site (16.7%). In summary, the foregoing analyses of mutagenic versus all full-length elements for the three retrotransposon families consistently showed an overrepresentation of mutagenic TEs within their respective U-zones, but an underrepresentation of all full-length elements within the same regions. Moreover, apparent differences in orientation and splice-site biases were also observed between mutagenic TEs and all full-length elements in the reference genomes. These observations strongly suggest that intronic TE insertions within the U-zone have a  95  much higher potential to be deleterious to the enclosing gene, particularly when oriented in antisense for human SINEs and in sense for LINEs and LTR elements. When intronic TE insertions are in extreme proximity (e.g. < 20 bp) to an SA site, they are very likely to be harmful and may cause functional abnormality of the enclosing gene.  3.2.5  Polymorphic LTR elements in mice show an intermediate distribution pattern  I next extended my analyses to polymorphic AluY and L1 insertions not associated with any disease based on the dbRIP data. These elements are considered relatively young since they are not fixed in humans. If, indeed, selection is still working upon these TEs, one might expect to see an intermediate distribution pattern between that of mutagenic and all elements. However, for both polymorphic AluYs and L1s, I observed no significant differences from all full-length elements in the reference human genome (data not shown). Though this result may be partially accounted for by the limited total number of polymorphic insertions documented in dbRIP, it is very likely that the distribution of these polymorphic TEs has already been shaped by selection. For the youngest insertionally polymorphic mouse LTR elements, however, I had previously shown that they do have a distinct prevalence in introns and orientation bias compared with older elements (Section 2.2.3). This finding suggests that some of these insertions are detrimental, but have not been eliminated due to the artificial breeding environment of inbred strains (Maksakova et al., 2006; Waterston et al., 2002; Zhang et al., 2008). Indeed, some known detrimental LTR insertions have even become fixed in one or a few mouse strains (Druker et al., 2004; Ho et al., 2004). I therefore analyzed a list of polymorphic LTR insertions in four mouse strains from my previous study (see Chapter 2),  96  in which I had detected different distributions between polymorphic and common LTR elements. Here I used polymorphic IAP and ETn/MusD elements present in only one of the four analyzed mouse strains (presumed to be the youngest elements), and found that 34.8% of intronic insertions were within the 2 kb U-zone (Figure 3.5C and Table 3.2), a fraction very close to the simulated prediction of a random distribution, but significantly higher than all full-length LTR elements in the mouse reference genome (14.3%; p < 5.58e-13; twosample proportion test) and lower than the mutagenic insertions (72.5%; p < 9.79e-05; twosample proportion test). Moreover, I observed 23.2% of polymorphic LTRs in the U-zone to be sense-oriented, which, though showing no statistical difference from that of all LTRs, is highly significantly lower than the mutagenic cases (p < 6.26e-07; two-sample proportion test). Since my list of polymorphic LTR insertions in mice does not contain any intronic insertions within 20 bp of an exon, I could not perform the analysis of splice site proximity bias. Nonetheless, the above observation of an intermediate distribution pattern of polymorphic LTRs between mutagenic LTR insertions and all full-length LTRs in the reference genome demonstrates that purifying selection is the most likely underlying force shaping the observed intronic TE distribution patterns, and evidence further suggests that such a process is ongoing.  3.2.6  Chimeric transcripts and cryptic splice signals differ within and outside the U-  zone If TEs within their respective U-zones are more likely to be harmful by causing splicing abnormalities, one can make two predictions. The first is that TEs located in the U-zones would be associated with chimeric TE-gene transcripts more often than with TEs located  97  elsewhere in introns. To test this prediction, I downloaded and analyzed the human expressed sequence tag (EST) data from the UCSC Genome Browser, in which only spliced transcripts were included. I then screened for all spliced ESTs overlapping with intronic TEs (i.e. chimeric ESTs). As Figure 3.6A shows, 11.7% of human SINE elements within the 100 bp U-zone are associated with chimeric ESTs. In contrast, this percentage is only 1.6% for SINE elements outside the U-zone. Similarly, for human LINEs in their 2 kb U-zone, I found 4.6% to be associated with chimeric ESTs, while the percentage outside the U-zone significantly drops to 0.7%. I also identified 2.9% of human LTR elements as chimeric-EST-related in the 5 kb human LTR U-zone, but only 0.9% for elements outside the U-zone. All these results are highly statistically significant (all p-values < 2.2e-16; two-sample proportion test), which reinforces the notion that TEs within their U-zones are more likely to be involved in aberrant splicing. It should be pointed out, however, that the splicing events detected by this analysis are of unknown relevance and, indeed, are unlikely to have significant detrimental effects because these TEs are fixed. The second prediction is that TEs not eliminated from the U-zone would have weaker splicing signals compared with other TEs. To examine this issue, I computationally analyzed potential splice sites within randomly selected solitary LTR sequences in human introns using NNSplice (Reese et al., 1997) (see Section 3.4.5 for details). As shown in Figure 3.6B, as the distance between the intronic LTR and its nearest exon decreases, the average number and the strength of predicted splice sites in these LTR sequences also decrease. This observation indicates that LTRs carry fewer and weaker cryptic splice sites within the Uzone, especially when they are located in close proximity to exons.  98  Figure 3.6 Chimeric transcripts and cryptic splice signals of TEs within and outside the U-zone. A) EST-associated human intronic TEs within and outside the U-zone. Each TE type is shown as a group on the x-axis. The y-axis shows the percentage of intronic TEs that contribute to chimeric ESTs with the enclosing gene. The white/dark bar represents all TEs of each TE type within/outside the U-zone, respectively. The fraction numbers beside each bar indicates the total number of TEs in each category (denominator) and the number of cases involved in chimeric ESTs (numerator). Error bars represent standard errors derived from the total number of cases (sample size) for each category. B) Predicted number and strength of cryptic splice sites in human LTRs. The top panel gives the average strength of predicted splice sites within sampled LTR sequences in each bin (the vertical axis) based on the distance from the LTR to its nearest intron boundary (the horizontal axis). The bottom panel shows the same but for the average total number of predicted slice sites.  3.2.7  Abnormal gene splicing linked to polymorphic LTR element insertions near  intron boundaries While the foregoing EST analysis suggests the importance of U-zones in TE-gene interactions, it would be useful to predict which particular intronic TEs are most likely to influence gene transcription based on their size, distance to the nearest exon, orientation, and 99  proximity to particular splice site. In my initial evaluation of this concept, I examined a panel of polymorphic LTR element insertions in inbred mouse strains since they are currently highly active and, as previously discussed, their genomic distribution suggests that some are likely detrimental yet being maintained due to the artificial breeding environment. In order to take the advantage of the available EST/mRNA data in the B6 reference genome, I restricted the intronic polymorphic LTR elements being examined to those present in the B6 mouse strain (Zhang et al., 2008). After excluding solitary LTRs and complex cases due to multigene families, I identified 44 full-length polymorphic LTR elements within the 2 kb U-zone (data not shown). I then inspected each region using the UCSC Genome Browser (mouse genome version: mm9) to look for chimeric ESTs/mRNAs involving the LTR element and the enclosing gene, and found such transcripts for 19 of the 44 genes. For most of these 19 genes, the aberrant forms appear to be minor in abundance, so it is difficult to estimate their overall impact on gene expression. Among these 19 genes, however, transcription of three (Cdk5rap1, Adamts13, and Wiz) has been shown previously to be affected significantly by the embedded LTR element (Banno et al., 2004; Baust et al., 2002; Druker et al., 2004). Judging from the frequency of annotated chimeric transcripts, two other genes among the group of 19 are of special interest: Kcnh6 (potassium voltage-gated channel, subfamily H (eag-related), member 6) and Trpc6 (transient receptor potential cation channel, subfamily C, member 6). While no evidence of transcriptional disruption caused by LTR element insertions for these genes has been reported in the literature, UCSC Genome Browser snapshots of their deposited mRNAs suggest significant involvement in the transcription of each gene. For Trpc6, two of seven mRNAs in the database terminate within a polymorphic IAP LTR element (Figure 3.7A), and for Kcnh6, one of three annotated mRNAs terminates  100  Figure 3.7 Chimeric transcripts of the Trpc6 gene and Kcnh6 gene in mice. Snapshots of the Trpc6 gene (A) and Kcnh6 gene (B) in UCSC Genome Browser are shown, with protein domains indicated. The red bar above the RefSeq gene annotation track shows the polymorphic LTR element insertion in B6 mice. For each gene, the mRNA track is shown, including the mRNAs terminating in the LTR element. Positions of primer sets used in the qRT-PCR experiments are indicated as arrowheads below the snapshot for each gene, with the upper pair (blue) for primers upstream of the polymorphic LTR element insertion, and the lower pair (green) for primers flanking the position of the LTR element insertion.  within another IAP insertion (Figure 3.7B). Trpc6 plays an important role in vascular and pulmonary smooth muscle cells, and its deficiency impairs certain allergic immune responses and smooth muscle contraction (Gonzalez-Cobos and Trebak, 2010). Kcnh6, also termed Erg2 (eag related protein 2), encodes a pore forming (alpha) subunit of potassium channels, and may serve a role in neural activation (Elmedyb et al., 2007). To examine the potential effect of the IAP polymorphisms on transcription of these two genes, the presence or absence of these insertions were confirmed by my colleague, Mark Romanish, using genomic PCR in 101  a panel of mouse strains that included B6, A/J, and 129SvEv. As a result, we were able to confirm that, indeed, an IAP is present in B6 and A/J but not in 129SvEv for the Trpc6 gene, and that the IAP in the Kcnh6 gene is present only in B6 but not in A/J and 129SvEv (data not shown). Since both genes are highly expressed in the brain, quantitative RT-PCR were performed on brain cDNA from all three mouse strains by setting one primer pair upstream of the insertion site and another primer pair flanking the insertion site, as indicated in Figure 3.7. In mouse strains carrying the IAP insertion, we found a significant decrease in the amount of normally spliced transcripts involving exons flanking the ERV insertion compared  Figure 3.8 Effect of polymorphic LTR element insertions on transcription of the Trpc6 and Kcnh6 genes. Quantitative RT-PCR of the Trpc6 (A) and Kcnh6 (B) genes using brain RNA from the indicated mouse strains. Green bars show the amount of transcripts detected by the primer set upstream the polymorphic IAP insertion, and blue bars show the amount of transcripts detected by the primer set flanking the location of the IAP insertion. Each bar represents the mean of at least 4 experiments ± standard deviation, which was first normalized to β-actin levels in the queried strain, and then represented relative to 5’ expression levels for each gene in B6 mice. The plus and minus sign shows the presence and absence of the IAP insertion in the corresponding mouse strain, respectively. Note: Experiments performed by Mark Romanish.  102  with exons upstream of the insertion. In contrast, we saw less difference between the upstream and flanking primer sets in strain(s) without the IAP insertion (Figure 3.8). The blockage of normal Kcnh6 transcription is particularly striking, with very little normal splicing occurring for exons flanking the IAP in the B6 strain. These data suggest significant transcriptional interference of these two genes mediated by the embedded IAPs. It would be interesting to determine if this interference does, in fact, result in phenotypic differences between mouse strains with and without these insertions.  3.3  Concluding Remarks  In this genomic study of the mammalian TE distributions within genes, I have identified intronic underrepresentation zones near exons, where fixed TEs occur less often than expected by chance. Strikingly, all documented human intronic Alu and L1 insertions and most mouse intronic LTR elements known to cause disease are located within these U-zones, strongly suggesting that TE elements in these locations are more likely to cause transcriptional disruptions and to be eliminated by selection. Moreover, TEs within their Uzones are more likely to be involved in spliced chimeric transcripts than those located elsewhere in introns, suggesting that some may be slightly detrimental. Presumably in most of these cases the transcriptional effects must be insufficient to cause such insertions to be eliminated by purifying selection. It is possible, however, that even apparently subtle effects on gene splicing could have functional consequences. Previous studies have demonstrated, on the other hand, that TEs fixed in the host genome can participate in gene transcription, producing alternative transcript isoforms that might have functional importance (see Section 1.6 in Chapter 1). Equally important as identifying potentially deleterious TE insertions, it is  103  also of great value to identify fixed TEs that contribute to normal gene expression and cell functionality. The U-zones identified in this study, coupled with TE size, orientation bias, and location relative to SD or SA sites can all be combined to help predict those TEs with a higher likelihood of functional significance, while also yielding new insights into the effects of TEs on gene regulation and evolution.  3.4 3.4.1  Materials and Methods Source Data  TE annotations The original TE annotation data obtained from the RepeatMasker tracks of the human hg18 genome and the mouse mm9 genome at the UCSC Genome Browser (http://genome.ucsc.edu) were further processed to fit this study. Since the annotations of TEs defined by RepeatMasker (http://www.repeatmasker.org) are only fragments based on the similarity to the consensus sequences of different TE families, a single full-length TE element may have multiple RepeatMasker entries if its sequence is not continuous in the genome. Therefore, in my analyses I computationally merged such TE fragments into a single element and counted them only once as an independent TE insertion event when they: 1) belong to the same TE family; 2) be on the same chromosome; 3) be within 10 kb distance; 4) be in the same orientation. EST data The human EST data were also downloaded from the UCSC Genome Browser. Here I used the UCSC Table Browser to download only the spliced EST data, which was stored in the intronEst table. According to a reference at the UCSC genome Browser  104  (https://lists.soe.ucsc.edu/pipermail/genome/2008-June/016560.html),  TE-only  transcripts  were not included in these datasets.  3.4.2  Computer Simulation of Random TE insertions  To establish a baseline of TE distributions in gene introns, I applied computational simulations of random TE insertions in both the hg18 human genome and the mm9 mouse genome. In this study, I used the RefSeq gene annotation data downloaded from the UCSC Genome Browser. For each round of simulation, I generated 1,000,000 random genomic loci across the entire host genome to mimic randomized TE insertions. I divided intronic regions into 13 bins with gradually increasing bin size according to their distance to the nearest exon: 0-20 bp, 20-50 bp, 50-100 bp, 100-200 bp, 200-500 bp, 500-1000 bp, 1-2 kb, 2-5 kb, 5-10 kb, 10-20 kb, 20-50 kb, 50-100 kb, >100 kb. My intention in using increasing bin size was to establish a higher resolution at the interesting regions near intron boundaries, while at the same time maintaining a good overview of other intronic regions. I then calculated the fraction of simulated TE insertions located in each bin with respect to the total simulated insertions in introns. This simulation process was applied three times, and the average was taken for each genome as a control distribution for all further analyses. My calculation of the standard error of the mean based on three rounds of simulations showed a negligible sampling error for each bin (data not shown), confirming the suitability of using these results to represent the theoretical random TE distribution.  105  3.4.3  Normalization of intronic TE distribution by G/C content  To minimize the influence on TE distribution by local G/C content, I corrected my computational simulations of random TE distribution according to the overall G/C preference of each TE type. Specifically, I first performed a genome-wide evaluation of the G/C preference of each TE type by dividing the entire host genome into a set of consecutive 20 kb windows and calculating both the density of each TE type and the G/C density for each window. Then I grouped these 20 kb windows by G/C density level (with a resolution of 1%) and calculated their average TE density at each G/C level. Based on the assumption that TE density should be close to the overall genome-wide TE density anywhere in the genome when there is no G/C preference, I calculated the fold-difference of the actual TE density at each G/C level compared with the genomic background level for each TE type. In this way, I derived a list of fold change values of TE density at each G/C level, which were then used as the normalization coefficient to correct the simulated distribution of random TE insertions.  3.4.4  Calculation of the standardized TE frequency levels  To determine the difference between the “observed” and the “expected” TE frequency at each predefined distance bin, I used the concept of residual to measure the standardized TE frequency: ⎛ (obs − exp ) ⎞ ⎛ obs ⎞ ⎟⎟ c = log10 ⎜⎜ + 1⎟⎟ = log10 ⎜⎜ exp ⎝ ⎠ ⎝ exp ⎠ where c is the residual of a given distance bin, obs is the total observed occurrence of a given TE type in that bin, and exp is the expected number of such TE insertions derived from my computational simulations. Common logarithm (log10) was used here to equalize the value ranges of over- and under-represented data, and the addition of “1” in the formula was to 106  fulfill the requirement that the subject of logarithm cannot be a negative number. The absolute value of residual c literally shows the degree of relative difference between the “observed” and “expected”. When c is positive, it means the corresponding TE type is overrepresented in this region; when c is negative, it means such TE type is underrepresented.  3.4.5  Computational analysis of potential splice sites in LTRs  I used the web-based interface of the NNSplice program (http://www.fruitfly.org/ seq_tools/splice.html), a bioinformatics tool based on artificial neural networks and used for predicting the presence and the strength of potential splice sites in any given input DNA sequence. Due to the limitation of the maximum length of total input sequences that NNSplice can take, in this analysis I only chose sense-oriented solitary LTR sequences (i.e. LTR sequences annotated with a size between 200-600 bp) in human introns. The intronic region was divided into a set of consecutive bins with bin size increasing according to the following distances from the LTR to the nearest exon: 0-200 bp, 200-500bp, 500-1000 bp, 12 kb, 2-5 kb, >5 kb. For all bins except the first, a total number of 100 LTR sequences were sampled randomly for three times, independently. The averaged total numbers and strength of potential splice sites based on the three samples were taken as the final values for each bin. Since the first bin (0-200 bp) contains only 101 cases in total, I took all those cases to calculate the average total number and strength of potential splice sites for this bin without sampling. Notably, when I calculated the average strength of potential splice sites for each bin, only the site with the highest score was considered for each LTR sequence.  107  3.4.6  RT- and qRT-PCR of polymorphic LTR insertions in mouse genes  DNA and RNA Isolation Primary mouse tissue samples were dissected from healthy adult male C57BL/6J, 129SvEv and A/J mice, and later preserved in RNA (Ambion). Genomic DNA and total RNA were isolated from the indicated strains and tissues using TRIzol (Invitrogen) according to the manufacturer’s specification. Subsequently, nucleic material was quantified using a NanoDrop UV spectrophotometer (Thermo Scientific), and quality was assessed by gel electrophoresis. Confirmation of LTR polymorphisms The  presence/absence  of  ERV/LTR  polymorphisms  was  initially  determined  computationally (Chapter 2; (Zhang et al., 2008)). These predicted events were confirmed by PCR using 50 ng of genomic DNA from C57BL/6J, 129SvEv, and A/J mice. Primers flanking  the  predicted  LTR  polymorphisms  in  the  Kcnh6  intron  (gKcnh6-F:  catcccagagctcaaagtgg; gKcnh6-R: tgcaccagtgcatgcatgc) and the Trpc6 intron (gTrpc6-F: gaagcatgccactctagagc; gTrpc6-R: tgtgcatgattgtgtaggtg) were used in a standard Platinum Taq DNA polymerase reaction (94°C-5 min; [94°C-0.5 min; 58°C-0.5 min; 72°C-0.5 min] x35; 72°C-7 min; 4°C-∞). Quantitative RT-PCR One microgram of total C57BL/6J, 129SvEv and A/J RNA was reverse transcribed using Superscript III (Invitrogen) following the manufacturer’s recommended protocol. The effect of the LTR polymorphisms on the expression of Trpc6 and Kcnh6 was assessed by qRT-PCR using primers situated 5′ and 3′ of the respective LTR insertions. Relative quantification of the indicated targets was carried out by the ΔΔCT method, essentially as before (Romanish et  108  al., 2009), using the following primer sets: Kcnh6 5′ (qKcnh6-Ex11-F: cgagagaagctggattgctg; qKcnh6-Ex12-R: ctgtggatgctgaagtagctg); Kcnh6 3′ (qKcnh6-3′-F3: ctcagagttcagagtcgatgc; qKcnh6-3′-R: caccagagatttgtccattgc); Trpc6 5′ (qTrpc6-Ex2-F: cttagccaatgagctggcagtg; qTrpc6-Ex3-R: ccacttcctctgtgtttctgc); Trpc6 3′ (qTrpc6-Ex3-F2: agtatgaagtaaaaaaatttgtggctc; qTrpc6-Ex4-R2: aatggcaacagcaaggaccac); β-actin (β -actin-F: aaggccaaccgtgaaaagat; β actin-R: gtggtacgaccagaggcatac). The amplification efficiency of each primer set was derived across a template dilution series. The achieved efficiencies were: Kcnh6 5′ 87%; Kcnh6 3′ 101%; Trpc6 5′ 98%; Trpc6 3′ 94%; and β-actin 96%. These values were incorporated into calculations to determine relative expression levels of the target and control genes. All Kcnh6- and Trpc6-specific primer sets were validated to amplify at an equal efficiency with respect to β-actin (normalization gene), and were determined to be suitable for subsequent ΔΔCT relative quantification qRT-PCR experiments.  109  Chapter 4: Gene Properties and Chromatin State Influence the Accumulation of TEs in Genes  A version of Chapter 4 has been published: Ying Zhang and Dixie L. Mager (2012). "Gene properties and chromatin state influence the accumulation of transposable elements in genes." PLoS One 7(1): e30158.  110  4.1  Background  Increasing availability of whole genome sequences during the last decade has revealed insights into the relationships of TEs and genes. As discussed in Chapter 1, it has been well studied and widely accepted that the genome-wide distribution of TEs is far from random, which is likely due primarily to the long-term effects of natural selection (Section 1.2). While most previous studies have focused on identifying properties of TEs that might contribute to their genomic distribution (e.g. insertional orientation, local G/C content preference, etc.), in the study described in this chapter, I focused on evaluating the host-TE relationship by examining properties of the genes in which they reside. Several studies of intronic TE density and gene function have shown that TE-poor genes are usually linked to developmental processes, while TE-rich genes tend to be involved in metabolic pathways (Grover et al., 2003; Mortada et al., 2010; Sironi et al., 2006). Moreover, TE density in genes has been reported associated with gene expression patterns (Jjingo et al., 2011; Mortada et al., 2010; Sironi et al., 2006). Indeed, a study of TE accumulation in Drosophila Melanogaster euchromatin examined TE density differences between soma- vs. germline-expressed genes (Fontanillas et al., 2007), finding a higher TE insertion rate for the latter. Notably, Kunarso et al. recently reported that TEs have impacted the core regulatory networks of both human and mouse embryonic stem cells (ESCs) by being involved in ES-specific transcription factor binding and gene regulation (Kunarso et al., 2010). The previous studies cited above, while providing evidence that multiple gene properties correlate with TE accumulation/fixation rate in genes, are limited in the following respects. First, TE density conservation among orthologous genes in multiple species was not considered when identifying TE-poor and TE-rich genes. Given the low probability of having  111  the same extreme density of independent TE insertions in orthologous genes among multiple species by chance, an enrichment of TE-rich/poor genes shared across species may lead to new insights of TE-host relations. Second, while a study had addressed the relationship between TE distribution and gene expression in germline/early embryo tissues (Fontanillas et al., 2007), it focused on invertebrates and was based only on indirect measurement of chromatin status such as EST abundance and microarray expression data. To address these limitations, I examined TE densities in genes of three mammalian species: human, mouse and cow, which are sufficiently diverged that most recognizable TE insertions are independent among these genomes (Elsik et al., 2009; Waterston et al., 2002). Additionally, since more than half of TEs in cows belong to ruminant-specific TE families that do not exist in humans and mice (Elsik et al., 2009), it provided higher confidence when evaluating possible factors that may contribute to the extreme density of TEs. At this point, I examined various properties of the TE-rich/poor genes shared across the above three species, including gene function, conservation, tissue-specificity of gene expression, and histone modification profiles in mouse ESCs. My results indicated that TE distributions in genes have been determined by or correlate with multiple properties of host genes, reflecting the influence of TEs in shaping the landscape of host genomes.  4.2 4.2.1  Results and Discussion Determining sets of orthologous genes with the same extreme of TE density  Although ~90% of human RefSeq genes contain sequences derived from TEs (mostly in introns), the coverage of TEs in each gene can be quite different. To determine if the difference is completely random, I identified genes that are either unusually enriched or  112  depleted of TE sequences in three mammalian species: human, mouse and cow. As determined by the initial sequencing and comparative analyses of the mouse genome (Waterston et al., 2002), more than half of all human TEs are lineage-specific elements inserted after the human-mouse divergence, and 87% of all TEs recognizable in the mouse are lineage-specific, probably due to higher deletion/mutation rates in the mouse. Further, the whole-genome sequencing of taurine cattle revealed that at least 58% of TEs in the cow belong to ruminant-specific repeat families (Elsik et al., 2009). These data show that the majority of TEs detectable in the human, mouse and cow today were independent insertions introduced after their divergence. To find outlier genes enriched/depleted of TEs in each species, I downloaded the RepeatMasker annotation of TEs from the UCSC Genome Browser, and calculated the TE density of each gene by taking the ratio of TE coverage and gene size. Since most small genes contain few TE sequences due to the limitation of their size (Figure C.1; see Section 4.4.1), I restricted my analyses to genes larger than 10 kb. I next sought outlier genes using “top 10%” as an optimized cutoff threshold for genes with the highest/lowest TE densities (see Section 4.4.2 for details on optimization). Since TE/gene fraction (i.e. TE density in genes) is linked to the gene length (Jjingo et al., 2011), I decided to control my TE density outlier analysis by gene size. Moreover, in my previous study of TE underrepresentation zones (U-zones) in mammalian gene introns (see Chapter 3), I had found a significant decrease of TE density near exons, suggesting that exon density is also a potential factor limiting the coverage of TEs in genes (i.e. genes with higher exon density are expected to have less TEs). Despite the fact that exon density and gene size are not independent (Figure C.2), I controlled my outlier analysis for both factors (Figure C.3; see Section 4.4.3) because of their moderate association strength (r = 0.688, r is the  113  correlation coefficient for gene size vs. 1/exon density). By intersecting my datasets from the three species (see Section 4.4.4), those genes identified as “shared outliers” were selected for further analysis because of the low probability that their outlier status is due to random chance (single-side frequency < 0.0005). I found a total of 189 shared lower-outlier genes (SLOs; Table C.1) and 84 shared upper-outlier genes (SUOs; Table C.2), both numbers much higher than expected by chance (p < 2.2e-16 for both SUOs and SLOs, proportion equality test).  4.2.2  Chromosomal distribution and TE composition of shared outlier genes  Before performing any functional analysis of the SUO/SLO genes identified above, I was curious to see if there is any distributional bias at the chromosome level. To answer this question, I plotted physical locations of both SUOs and SLOs along each human chromosome. As shown in Figure C.4, except for a disproportionally high amount of SUOs on chromosome 19, no apparent enrichment was found on certain chromosome(s) for both types of shared-outlier genes. When I counted the total SUOs and SLOs separately for each chromosome and produced scatter-plots according to chromosome size (Figure C.5), chromosome 19 stood out as an outlier with a relatively short size (2% of the human genome) yet encoding 13 of 84 (15.5%) SUO genes. Since human chromosome 19 is gene-rich compared to all other chromosomes (Lander et al., 2001), I wondered whether the majority of TEs in SUOs are SINE elements, which have been shown as highly enriched near genes and G/C-rich regions (Section 1.2.1), especially Alus. Indeed, further examination revealed that 53 of 84 (61%) SUOs contain more than 50% TEs as SINEs, including all of the 13 SUOs located on chromosome 19. LINEs, however, are known to associate with A/T-rich or gene-  114  poor regions (see Section 1.2.1), and to be overrepresented in gene poor chromosomes such as  chromosome  X  (Graham  and  Boissinot,  2006;  Medstrand  et  al.,  2002).  Notably, while only three SUOs are on chromosome X, two of them contain more than 50% TEs as LINEs. These results suggest that the distribution of SUOs is not random and is likely associated with the TE composition and the overall G/C content (or gene density) of each chromosome. Further, I wondered if orthologous SUOs in different species might contain the same types of TEs at similar percentages. When I examined the TE composition of the 84 SUOs, I found similar patterns for most between human and mouse (Figure 4.1A, B), but not for cow (Figure 4.1C). To quantitatively measure the correlation of TE compositions between human and mouse SUOs, I calculated the proportion contributed by each of the four major TE types (i.e. the normalized TE composition) and the corresponding correlation coefficient (r). Figure C.6 shows that the proportion of LINE and SINE elements are highly correlated between human and mouse (r = 0.827 for LINEs; r = 0.836 for SINEs), with only poor correlations being found for ERV and DNA elements (r = 0.408 for ERVs; r = 0.267 for DNA elements). Since the genomic density of LINEs or SINEs is highly associated with local G/C content, I calculated for all SUOs the correlation coefficient between the percentage of LINEs/SINEs and the local G/C content. As shown in Figure C.7, a strong correlation to G/C content is found for both TE types (r = -0.702 for LINEs and r = 0.775 for SINEs), suggesting that it is a major factor in determining the insertion/fixation probability of different TE types. However, the same analysis in the cow genome showed much less tendency for similar TE composition patterns in SUOs compared with the human or mouse (Figure 4.1C). I reasoned that this is likely due to the fact that a large amount of TEs in the cow are extremely rare in  115  Figure 4.1 TE composition patterns of SUO genes. Patterns for human, mouse and cow are shown in (A), (B) and (C), respectively. Relative proportions taken by different TE types for each SUO gene are shown as a stacked bar, with the color scheme for the four major TE types indicated at the top. Genes are arranged in the same order for all three species. Gene names at the bottom are from the annotation of human RefSeq genes.  the other two species (Elsik et al., 2009). Almost half of the LINE elements in the cow, for example, belong to the RTE clade, which is also found in many other species including reptiles, insects and nematodes, but is absent from most mammals including primates and rodents (Adelson et al., 2009; Elsik et al., 2009; Malik and Eickbush, 1998). Moreover, while most of the remaining LINEs are L1 elements (the most abundant LINE family in human and mouse), only 60% belong to subfamilies present in humans (Adelson et al., 2009). More dramatically, greater than 92% of Bovine SINEs are not found in both human and mouse. Unlike the major SINE families such as Alu in humans and B1 in mice, SINE elements in  116  cows are either derived from tRNAs or truncated LINEs instead of 7SL RNAs (Malik and Eickbush, 1998). Based on this unique genomic composition of TEs, it is not surprising to see a distinct composition pattern of TEs in cow SUOs. Indeed, a previous study reported that L1s in the cow show little correlation with local G/C content, while RTE LINEs and most SINEs are both negatively associated with the density of local G/C (Adelson et al., 2009). This feature is apparently different from of human and mouse, in which LINEs (mostly L1) are negatively correlated with local G/C content, while SINEs (mostly Alu/B1) show the opposite trend (Section 1.2).  4.2.3  Extreme TE density is associated with the function and conservation level of  genes In a pioneering study of Alu SINE distribution in genes based on sequence data of human chromosomes 21 and 22, Grover et al. reported an enrichment of Alu elements in genes involved in metabolic pathways and signaling/transport processes, as well as a depletion from genes coding for information pathways and structural proteins (Grover et al., 2003). The authors postulated that both positive and negative selection forces had been involved in shaping the current distribution of Alus in human genes. In a different study on transposonfree regions (TFRs) in mammals (Simons et al., 2006), the authors found almost 1000 genomic regions larger than 10 kb completely depleted of TEs. Although over 90% of the bases covered by these regions are non-coding, genes within them are significantly associated with developmental functions, suggesting that these regions are largely unable to accept or tolerate TE insertions. While the above and some other genome-wide studies (Mortada et al., 2010; Sironi et al., 2006) showed an association between TE density and  117  specific gene functions, I wanted to determine if such trends exist for the outlier genes shared among multiple mammalian species, after excluding the potential confounding effects of exon density and gene size. I examined my SUO/SLO gene lists derived from human, mouse and cow using BiNGO (Maere et al., 2005), a Gene Ontology (GO) tool used to identify statistically significant over-/under-representation of certain gene functions for a given gene set compared to the genomic background. In accordance with previous studies of non-shared TE density outlier genes, I observed significant enrichment of genes involved in developmental processes for SLOs (Table C.3), and enrichment of genes involved in metabolic pathways and DNA repair for SUOs (Table C.4). Since these orthologous genes show similar extremes of TE density in several species and have been controlled by their size and exon density, my observations firmly support the hypothesis that the density of TEs in genes is not random, and is evidently associated with specific gene functions. Intrigued by the possible association between TE density and functional importance of genes, the next gene property I examined was the phylogenetic conservation level. In a recent study conducted by Mortada et al., the authors applied a comprehensive analysis of TE-free vs. TE-rich genes based on the difference of selection pressure ratio (Ka/Ks) among four primate species (Mortada et al., 2010). That study could not find significant support for the association between TE density and gene conservation level. The authors proposed that the phylogenetic distances between the four primates they examined might be too close. Indeed, when they looked at the selection pressure ratio between human and mouse, evidence was found that TE-free genes tend to be more conserved than TE-rich genes. To further test this hypothesis, I defined three levels of gene conservation according to the HomoloGene database  (release  63;  http://www.ncbi.nlm.nih.gov/homologene):  species-specific,  118  mammalian-specific, and ancient genes (see Section 4.4.5 for details). In order to include species-specific genes, I expanded my analysis to all outlier genes, including both shared and non-shared among human, mouse and cow, and obtained consistent results for all three mammalian species. In accord with the results of Mortada et al. (Mortada et al., 2010), I found clear evidence showing a higher proportion of species-specific genes among SUOs, and a higher proportion of ancient genes in SLOs (Figure 4.2). These findings indicate that TE-poor genes are more likely to be conserved among distantly related species, while genes extremely rich in TEs show the opposite trend, suggesting that TE insertions within highly conserved genes have generally been selected against due to their detrimental effects on these genes. On the other hand, the emergence of non-conserved, species-specific genes has increased the genetic variation of the host population, and my data supports the idea that TEs  Figure 4.2 TE density and conservation level of outlier genes. Patterns for human, mouse and cow are shown in (A), (B) and (C), respectively. Proportions corresponding to different gene conservation levels for each gene set are shown as a stacked bar, with the color scheme for species-specific, mammalian-specific and ancient genes indicated at the top.  119  may have contributed to, or even played an important role in creating genetic diversity during evolution (Biemont, 2010; Rebollo et al., 2010). I also compared the average TE ages between SUO and SLO genes in human (Section 4.4.6). This comparison revealed a highly significant difference (p < 2.2e-16, Wilcoxon Rank Sum test), with TEs in SUOs at 16.1% and TEs in SLOs at 21.3% sequence divergence from their consensus sequences (As reference, 16–18% of unconstrained nucleotides have been substituted since the split of primates from other mammalian orders (Smit, 1999)). To evaluate if such a difference in TE age was a reflection of the possible age difference between SUO and SLO genes, I randomly selected 200 mammalian-specific genes, and another 200 ancient genes also conserved in fish or invertebrates, and compared their ages (Section 4.4.6). Although the age of genes in the two random control groups is apparently different, the average age of TEs is about the same (19.7% vs. 19.1% divergence; p=0.1367, Wilcoxon Rank Sum test). Based on the fact that the pattern of TE distributions is a combined effect of both initial TE integration patterns and outcomes of selection/genetic drift, and thus may change through evolution (Medstrand et al., 2002), I postulate that TEs in SLOs may show more features of selection outcomes due to a longer evolutionary period, while the high TE density of SUOs may result partially from the fact that young TEs in such genes are not fixed stably and still subject to natural loss and purifying selection.  4.2.4  Binding of Polymerase-II at gene promoters reveals association between TE-  content and tissue-specificity of genes Recently, Jjingo et al. reported that both higher TE density and larger gene size generally associate with less tissue-specificity of gene expression (Jjingo et al., 2011). To determine  120  which factor is more important to the tissue-specificity of genes, the authors applied multiple regression analyses and found a higher contribution of TE density (66%) than gene size (53%). While these results provided important insights into the relationship between TE density and gene expression, the confounding effects of stochastic TE integration in a single species and gene structure constraints greatly impair the accurate evaluation of such relations. To overcome this problem, I directly compared polymerase-II (Pol-II, and more specifically, the Polr2a subunit of Pol-II) binding states between the SUO and SLO gene groups, controlled for both exon density and gene size. To obtain accurate information concerning the tissue-specificity of genes, I downloaded the ChIP-chip genome-wide RNA Polr2a binding maps of active promoters in mouse embryonic stem cells (ESCs) and adult organs (brain, heart, kidney, liver) (Barrera et al., 2008). According to the tissue-specificity (TS) values described by the authors, I classified the SUO and SLO genes into two categories: high-TS and low-TS (see Section 4.4.7 for methods). As a control, all genes larger than 10 kb in the mouse genome were similarly categorized, and the proportions were compared with that of SUOs and SLOs. As indicated in Figure 4.3A, my results reveal that SUO genes exhibit significantly lower tissuespecificity compared to the genomic background (30% vs. 46%; p = 0.0338), and vice versa for SLOs (64% vs. 46%; p = 0.0001). These results accord with previous findings (Jjingo et al., 2011). What next drew my interest was whether there is any difference in tissue-type composition between tissue-specific SUOs and SLOs. For those outlier genes expressed only in one tissue, it would be interesting to know if specific tissue-types are significantly overrepresented. Using the same data set from Barrera et al., I examined the tissue-type  121  Figure 4.3 Tissue-specificity of Polr2a binding at outlier genes. (A) Associations between TE density and tissue-specificity of Polr2a binding. All gene sets are divided into two categories according to the Polr2a binding pattern of genes: the high tissue-specific (high-TS) and the low tissue-specific (low-TS). The proportions corresponding to the above two categories are shown for SUOs, SLOs and the genomic background as side-by-side bars. Error bars are standard errors derived from the total number of genes (sample size) in each gene set. (B) Tissue-type composition of SUOs/SLOs associated with strong Polr2a binding in only one tissue. For each gene set, the proportion taken by each tissue type is shown with corresponding color in a stacked bar. The 'genomic background' was calculated based on all mouse genes > 10 kb that show strong Polr2a binding in only one tissue. The color scheme for different tissue types is shown at the top.  composition of all SUOs and SLOs showing strong Polr2a binding in only one tissue type (Section 4.4.7). Despite identifying only 15 SUO genes that satisfied my criteria, nine are observed in ESCs, a proportion (60%) much higher than either SLOs (40%) or the genomic background (37%) (Figure 4.3B). In order to gain more statistical power, I expanded my analysis to all mouse outlier genes, including both shared and non-shared. As shown in Figure C.8, I observed 89 ESC-specific upper outliers among a total of 181 upper outlier genes (49%). This is significantly higher than the genomic background level (37%; p = 0.00146). More interestingly, after expanding my analysis from shared-only to all outlier 122  genes, I observed a significant depletion of ESC-specific genes among tissue-specific loweroutliers (93 out of 367, or 25%; p = 1.406e-05). Based on these observations and the fact that only TE insertions occurring in the germline and early embryonic cells could be inherited by the next generation, I propose that actively transcribed genes (presumably with an open chromatin state) are prone to TE integrations, which could, in turn, lead to a faster accumulation of TEs in such genes during evolution.  4.2.5  Histone marks at promoters confirm the overall open status of SUO genes in  ESCs To investigate the hypothesis that the chromatin regions of upper outlier genes are generally open in ES cells, I examined histone modification marks at gene promoters in mouse ESCs. In 2007, Mikkelsen et al. used ChIP-seq to generate genome-wide maps of various histone modification marks, including the open-chromatin mark histone H3 lysine 4 trimethylation (H3K4me3) and the condensed-chromatin mark histone H3 lysine 27 trimethylation (H3K27me3) in mouse ESCs (Mikkelsen et al., 2007). Although histone modification data were largely limited to gene promoters in this study, the authors showed that both H3K4me3 and H3K27me3 can effectively discriminate genes that are expressed, are poised for expression, or are stably repressed, and therefore reflect both chromatin and transcriptional states. I used this dataset to evaluate the general chromatin status of both SUO and SLO genes in mouse ES cells (Figure 4.4). Intriguingly, my results revealed a significant enrichment of the open-chromatin mark H3K4me3 at promoters of SUOs (78%), but a depletion at SLO promoters (43%) compared with the genomic background (64%; p = 0.0333 and p = 5.653e-08 for SUOs and SLOs, respectively), indicating a general open state  123  Figure 4.4 Histone marks at promoters of shared outlier genes. The proportions of genes associated with different histone marks are shown for SUOs, SLOs and the genomic background as side-by-side bars. Error bars are standard errors derived from the total number of genes (sample size) in each gene set.  of SUOs and closed state of SLOs in mouse ESCs. When I looked at the closed-chromatin mark H3K27me3, I observed a trend opposite to the case of H3K4me3. SLOs are more associated with H3K27me3 (1.9%) than the genomic background (0.5%; p = 0.0766, equality proportion test), but there were no SUOs in this category (0/60 hit for H3K27me3). While the above observation is intriguing, there is unavoidably a lack of statistical significance due to the limited total number of SUOs and SLOs associated with H3K27me3 alone. When I examined the enrichment of the bivalent mark H3K4me3+H3K27me3, however, SLOs showed a much higher enrichment than the genomic background (39% vs. 17%; p = 5.591e13), whereas SUOs showed the opposite effect (6.7% vs. 17%; p = 0.05). Since it has been shown that the H3K4me3+H3K27me3 bivalent mark is strongly associated with the “poised state” of developmental genes that are temporarily repressed in ES cells yet will become activated upon differentiation (Spivakov and Fisher, 2007), the enrichment of the bivalent  124  histone mark in SLOs supports my earlier observation that many SLO genes are involved in essential developmental processes and, consequently, are depleted of TE insertions. For shared outlier genes not linked to either H3K4me3 or H3K27me3, no significant differences were detected compared to the genomic background. The same analysis expanded to both shared and non-shared outlier genes showed very similar results, as illustrated in Figure C.9.  4.2.6  Expression data of SUO and SLO genes in ESCs support the ‘open chromatin  status’ hypothesis If indeed genes with an open chromatin status in ES cells are more prone to heritable TEinsertions, one might expect to see a larger proportion of genes that are highly expressed in ES cells among SUOs. To investigate this, I downloaded mouse ES cell expression data previously generated by Mikkelsen et al. (Mikkelsen et al., 2007). As expected, when SUOs and SLOs were classified into lowly- and highly-expressed genes based on the median genomic transcriptional level in mouse ESCs, I observed that 53% of SUOs are highly expressed in ESCs, compared with only 41% for SLOs (Figure 4.5A). However, while it shows a clear trend that SUOs contain a larger proportion of active genes in ESCs than SLOs, the result lacks statistical significance (p = 0.1428). Since chip-based techniques commonly suffer from cross-hybridization (where non-specific binding of probes produces unavoidable experimental noise), I performed the same analysis using a different mouse ESC expression dataset recently generated by whole transcriptome sequencing (RNA-seq) (Karimi et al., 2011). As shown in Figure 4.5B, I made a similar observation with the RNA-seq data as with GeneChip, but with a more dramatic difference between the two outlier groups (60% highly expressed genes among SUOs vs. 43% for SLOs; p = 0.0155). These results suggest  125  Figure 4.5 Gene expression data analyses for SUOs and SLOs in mouse ESCs. A) Gene expression data based on GeneChip technology. B) Gene expression data based on RNA-seq technology. In both (A) and (B), SUOs and SLOs are divided into 'low' and 'high' expression genes in mouse ESCs based on the median expression level of all genes > 10 kb. The red dotted line shows the 50% level, which is the proportion taken by lowly/highly expressed genes for all genes > 10 kb in mouse ESCs. Error bars are standard errors derived from the total number of genes (sample size) in each gene set.  that genes with open chromatin status and high expression level in ES cells may be more prone to TE insertions that can be inherited, which could lead, in turn, to the accumulation of TEs in such genes. Of note is a recent study documenting large numbers of somatic L1 retrotransposition events in the human brain, which showed that such events occur significantly more often in genes expressed in the brain compared to random expectations (Baillie et al., 2011). This study lends support to the view that insertions of at least some TE types are more likely to occur in open chromatin.  126  4.2.7  Gene function and chromatin status both contribute to the low TE density of  SLO genes Although the evaluation of chromatin status at SUO/SLO genes using genome-wide histone modification maps clearly shows a correlation between the density of TEs and chromatin status in ES cells, such observations may be explained by at least two possible underlying mechanisms. One mechanism could be that TEs are more likely to insert into genes with open chromatin, leading to an accumulation of TE sequences in genes active in ESCs. Alternatively, the strong association between low TE density and poised/inactive genes in ESCs could be explained by the essential roles of these genes in cell differentiation, and thus, their limited tolerance to TE residence. As the two mechanisms are not mutually exclusive, I was curious to evaluate separately the contributions of chromatin status and gene function. I began by classifying all SLO genes depending on whether or not they are under the GO term of developmental processes. I then checked the SLOs associated with either H3K4me3-only (open state) or H3K4me3+H3K27me3 bivalent mark (poised state), and calculated the proportion of genes involved in developmental processes in each set. (Due to the limited number of SLOs associated with the H3K27me3-only mark, I did not include such analysis in this study.) As seen in section A of Figure 4.6, 42% (24 out of 57) SLOs associated with an open chromatin state in ESCs were developmental genes, which is higher (but not statistically significant) than the whole-genome background (31%; p = 0.1). The analysis of SLOs associated with the H3K4me3+H3K27me3 bivalent mark showed 65% (30 out of 46) were developmental genes (section B of Figure 4.6), which is twofold more than the genomic background (p = 1.470e-06). Notably, the predicted chromatin status was not a variable in both analyses, confirming that gene function definitely plays a role in influencing  127  Figure 4.6 Effect-evaluation matrix for gene function and chromatin status of SLOs. As shown in the inner 2x2 matrix (white area), SLOs are divided into four groups according to their associated histone mark and gene function. ‘K4’ stands for H3K4me3; ‘K4+K27’ stands for H3K4me3 + H3K27me3; ‘Dvlp’ stands for developmental gene; ‘Non-Dvlp’ stands for non-developmental gene. Bar plots in the most right column show the proportion of SLOs associated with different histone marks compared with the genomic background when gene function is controlled. Bar plots in the bottom row show the proportion of SLOs associated with different gene functions compared with the genomic background when chromatin status is controlled. Error bars are standard errors derived from the total number of genes (sample size) in each gene set.  the density of TEs in genes. Moreover, the higher proportion of genes involved in developmental processes for SLOs associated with the H3K4me3+H3K27me3 bivalent mark (65%; B/T2 in Figure 4.6) compared with SLOs associated only with H3K4me3 (42%; A/T1 in Figure 4.6) implies an effect of chromatin status of genes. Indeed, when I looked at those  128  SLOs that are either involved in developmental processes or not (i.e. controlled by gene function), the association between SLOs and the H3K4me3+H3K27me3 bivalent mark (poised status) is consistently and significantly higher than expected: 56% when controlled for developmental genes (B/T3 in Figure 4.6) and 33% for non-developmental genes (D/T4 in Figure 4.6), compared with 17% for genomic background (p = 2.825e-13 and p = 0.007, respectively), showing that the chromatin status in ES cells is also an important factor contributing to the overall TE density of genes.  4.3  Concluding Remarks  Using the genomic sequences of three distant mammalian species, (i.e. human, mouse and cow), I identified “shared outlier” genes bearing highly conserved patterns of unusually high or low TE densities. These outlier genes, denoted as SUOs and SLOs respectively, provide a highly reliable resource for the study of TE-gene relations. Specifically, I showed that SUOs are enriched for metabolic genes, and that SLOs are enriched for genes involved in developmental processes. Moreover, SUOs show in general less conservation at the protein sequence level, and an expanded analysis involving non-shared outlier genes in each species revealed a disproportionate enrichment of species-specific genes for TE-rich genes (upper outliers). These findings are also in agreement with previous studies of the association between TE density and gene function (Grover et al., 2003; Mortada et al., 2010; Sironi et al., 2006). On the other hand, additional factors besides natural selection have also contributed to TEhost evolution, and therefore, cannot be neglected. The comparison of the average TE ages, for example, shows SUOs contain significantly younger TEs than SLOs, implying that the  129  high TE density of SUOs could also be due to recent TE insertions, some of which will be lost with time. Furthermore, other results presented in this study also suggest that the genomic signature of initial insertion site preference may still exist. I found that extreme TE content in introns is clearly associated with the chromatin status and expression level of genes in embryonic stem cells, with upper outlier genes being more likely to be active in ES cells, and lower outliers being the opposite. Given that all heritable TEs are the result of integration events that occurred in either germline cells or early embryos, and if I postulate that TEs are more likely to insert into actively transcribing genes, it is possible that more TEs would accumulate in such genes and less in genes that are inactive in these tissues. Indeed, only a very minor tendency for TEs to insert into actively transcribing genes in the germ line or early embryo could contribute, over evolutionary time, to their resultant densities in genes. In conclusion, my data support the view that both selection and initial insertion site preference have played roles in the extreme intronic TE densities observed in mammalian genomes. While my data provide intriguing insights on TE-gene relationships based on highly reliable gene sets shared in multiple species, due to the limited numbers of genes studied, caution should be taken when applying these results to a broader theme. On the other hand, my analyses, using all (i.e. both shared and non-shared) outlier genes, show very similar results, confirming that the findings in this study very likely reflect general characteristics of TE-gene relationships in mammals.  130  4.4 4.4.1  Materials and Methods Selection of source datasets  To obtain the genomic coordinates of TEs in human, mouse and cow, I downloaded the RepeatMasker annotation of TEs from the UCSC Genome Browser (http://genome.ucsc.edu). The versions of genome assemblies used were hg18 for human, mm9 for mouse, and bosTau4 for cow. For the annotation of genes, I downloaded RefSeq gene tracks based on the same genome assemblies listed above. I defined TE density as the ratio between the total length of TE sequences presented in a given gene and the size of the longest RefSeq isoform of that gene. An analysis of TE density in human genes revealed a normal distribution, except a singleton peak at near zero TE density (Figure C.1A). After excluding all genes shorter than 10 kb, this peak disappeared (Figure C.1B). I applied the same analysis for mouse and found similar results (data not shown). This observation indicates that most genes with near zero TE density are simply very small genes, very likely low in TEs due to their limited size. For this reason, my study included only genes larger than 10 kb. Since my intention was to identify outlier genes shared among multiple species, I also confined my data to only genes included in both the NCBI HomoloGene and RefSeq databases.  4.4.2  Identification of genes with extreme TE densities  In order to identify outlier genes, a cutoff percentile for genes with high/low TE densities was required. The selection of this threshold was a balance between the significance of effect and a reasonable outlier population size for the ease of further data analyses. I performed a computational optimization experiment by testing multiple cutoff thresholds of the outlier percentile from 2.5% to 25%, with an incremental step of 2.5%. For each percentile tested, I  131  calculated both the total number of SUOs and SLOs derived, and the ratio between the theoretical probability by chance and the actual frequency of observing shared outliers among the three species (Table C.5). Based on these optimization results, I chose 10% as the optimized cutoff percentile for outlier genes in each species, with the number of genes derived in each step being indicated in Table C.6.  4.4.3  Normalization of gene size and exon density  In order to eliminate the confounding effects of the size and exon density of the gene, I controlled my outlier selection for both factors. Specifically, I designed a 5x5 factor matrix composed of the following two dimensions (Figure C.3): the first dimension (horizontal) containing five predefined consecutive ranges of gene size, each including 20% of all genes; the second dimension (vertical) made up of five consecutive bins of exon density levels, each containing 20% of all genes within the corresponding range of gene size. In this way, I arranged my gene data into a well-organized lattice, in which each unit contains the same number of genes. Based on this factor matrix, I applied the optimized cutoff percentile threshold (i.e. 10%) on each lattice unit to obtain outlier genes with either the highest or the lowest TE density. All outlier genes identified in each unit (the shaded part at either side of the gene distribution within each unit in Figure C.3) were then merged into two categories according to whether they are high or low in TEs (i.e. total upper-/lower-outliers). To verify the efficiency of this normalization strategy, I applied statistical tests to the final sets of total upper- and lower-outlier genes, and found no significant difference of either gene size or exon density (data not shown).  132  4.4.4  Identification of shared-outlier genes  To identify shared outlier genes among the three mammals used in this study, I took the HomoloGene IDs (HIDs) of the upper- and lower-outlier genes commonly identified in all three species, and defined the resulting gene set as SUOs and SLOs, respectively. Notably, based on the normalization of gene size and exon density, even outlier genes with a different category of gene size or exon density in each species were, nevertheless, still able to be correctly identified as long as they share the same extreme of TE density in all three species.  4.4.5  Classification of gene conservation levels  To evaluate the relationship between TE density and gene conservation level, I classified all genes in each species into three conservation levels: species-specific, mammalianspecific, and ancient genes. Based on the HomoloGene Database, I defined “species-specific” genes as genes found only in one of the three mammalian species used in this study. I defined “mammalian-specific” genes as genes shared between at least two of the three species, due to the relatively poor annotation of the cow genome. I defined “ancient genes” as genes present in mammals and at least one of the following species: zebra fish, fruit fly and yeast.  4.4.6  Comparison of the average TE age between SUO and SLO genes  To calculate approximately the average TE age for a gene, I collected the RepeatMasker annotation of TEs from the UCSC Genome Browser, and obtained the divergence of each TE fragment from the consensus sequence of each TE family. The average age of all TEs in a given gene is calculated with the equation:  133  ∑d × l i  D=  i  i  L  where D is the average divergence of TE sequences from the consensus (i.e. the age of TEs), di is the sequence divergence of each TE fragment in the gene, li is the length (in bp) of each TE fragment, and L is the total coverage (in bp) of all TEs in the gene. Based on this methodology, the average TE age of SUO and SLO genes are calculated and subsequently compared using the Wilcoxon Rank Sum test.  4.4.7  Examination of the tissue specificity of Polr2a binding at SUO/SLO promoters  As suggested by the authors of the Polr2a binding data (Barrera et al., 2008), I defined “ubiquitous” promoter activity, based on all five tissues tested, as genes with an overall Polr2a binding entropy H > 2, where H = -∑1≤t≤Nptlog2pt (where p is the relative Polr2a binding strength, t is the tissue type, N is the total number of tissues tested). I took genes with H ≤ 2 as “tissue-specific” Polr2a binding. Where Polr2a binding shows the strongest activity at the given promoter, the specific tissue type was identified according to the lowest value of “categorical tissue-specificity” Qt among all five tissues, where Qt = H - log2(pt) as described by Barrera and colleagues (Barrera et al., 2008).  4.4.8  Statistical tests  All statistical tests were done in R (version 2.9.2). All p-values described in the text were based on the equality of proportion test (also known as Binomial proportion test), unless specifically noted otherwise.  134  Chapter 5: Thesis Summary and Conclusions  135  5.1 5.1.1  Thesis Summary Which ERVs are polymorphic in mice?  ERVs in mice are significant genomic mutagens, causing ~10% of all reported spontaneous germ line mutations in laboratory strains (Maksakova et al., 2006). The majority of these mutations are due to insertions of two high copy ERV families, the IAP and ETn/MusD elements. This significant level of ongoing retrotranspositional activity suggests that the content of these two ERV groups could be highly variable in inbred mice. At the time of my study, no comprehensive genome-wide studies had been performed to assess their level of polymorphism, however. In the study described in Chapter 2, three test strains, for which sufficient genomic sequence is available, were compared to each other and to the reference C57BL/6J genome, and very high levels of insertional polymorphism for both ERV families were detected (with an estimated false discovery rate of only 0.4%). Specifically, I found that at least 60% of IAP and 25% of ETn/MusD elements detected in any strain are absent in one or more of the other three strains. The polymorphic nature of a set of 40 ETn/MusD elements found within gene introns was confirmed using genomic PCR on DNA from a panel of mouse strains. For some cases, gene-splicing abnormalities involving the ERV were detected, and additional evidence for decreased gene expression in strains carrying the insertion was obtained. In total, I identified nearly 700 polymorphic IAP or ETn/MusD ERVs or solitary LTRs that reside in gene introns, providing potential candidates that may contribute to gene expression differences among strains. These previously unappreciated extreme levels of polymorphism suggest that ERV insertions play a significant role in the genetic drift of mouse lines.  136  The identification of such polymorphic ERV elements can be very useful in studying TEhost interactions and their evolutionary effects. In a recent study of ERV methylation patterns in the mouse, for example, our laboratory utilized the polymorphic ERV data described in this thesis, revealing a positive relationship between the age and methylation level of ETns in mice, as well as a negative age effect to the methylation variation of particular ETn insertions between cells/individuals (Reiss et al., 2010). In another study, we took the advantage of strain-specific ERV insertions in allelic expression assays, and showed a general trend of heterochromatin formation induced by IAP ERV elements, as well as at least one apparent example of heterochromatin spreading from an IAP into a nearby gene (Rebollo et al., 2011). Along with other potential applications such as genetic association studies and population/evolution genetics, the mouse polymorphic ERV data presented here sheds light on our understanding of fundamental mouse genetics, phenotypic variability and TE-host gene relations.  5.1.2  What features of TE insertions are linked to their residential probability in  genes? Comprising nearly half of the mammalian genomes, TEs are found within most genes. Although the vast majority of TEs in introns are fixed in the species and presumably exert no significant effects on the enclosing gene, some markedly perturb transcription and result in disease or a mutated phenotype. All the factors that determine the likelihood that an intronic TE will affect transcription are not clear. In the study described in Chapter 3, I examined intronic TE distributions in both human and mouse, and found that several factors likely contribute to whether a particular TE can influence gene transcription. Specifically, I  137  observed that TEs near exons are greatly underrepresented compared to random distributions, and further that the size of these “underrepresentation zones” (U-zones) differs between TE types. Compared to elsewhere in introns, TEs within these zones are shorter, on average, and show stronger orientation biases. Moreover, TEs in extremely close proximity (< 20 bp) to exons show a strong bias to be near splice-donor sites. Interestingly, disease-causing intronic TE insertions show the opposite distributional trends. By examining EST databases, I found that the proportion of TEs contributing to chimeric TE-gene transcripts is significantly higher within their U-zones. In addition, an analysis of predicted splice sites within human long terminal repeat (LTR) elements showed a significantly lower total number and weaker strength for intronic LTRs near exons. I selectively examined a list of polymorphic mouse LTR elements in introns based on these factors, and showed clear evidence of transcriptional disruption by LTR element insertions in the Trpc6 and Kcnh6 genes. These studies, taken together, provide insight into the potential selective forces that have shaped intronic TE distributions, and facilitate identification of TEs most likely to exert transcriptional effects on genes.  5.1.3  Why are some genes highly enriched in TEs while others are depleted of TEs?  By measuring the normalized coverage of TE sequences within genes, as described in Chapter 4, I identified sets of genes with conserved extremes of high/low TE density in the genomes of human, mouse and cow, and denoted them as “shared upper/lower outliers (SUOs/SLOs)”. By comparing these outlier genes to the genomic background, I showed that a large proportion of SUOs are involved in metabolic pathways and that they tend to be mammal-specific, whereas many SLOs are related to developmental processes and have  138  more ancient origins. Furthermore, the proportions of different types of TEs within human and mouse orthologous SUOs showed high similarity, even though most detectable TEs in these two genomes inserted after their divergence. My computational analysis of polymeraseII (Pol-II) occupancy at gene promoters in different mouse tissues showed that 60% of tissuespecific SUOs have strong Pol-II binding in embryonic stem cells (ESCs), a proportion significantly higher than the genomic background (37%). In addition, the analysis of histone marks, such as H3K4me3 and H3K27me3 in mouse ESCs, also suggests a strong association between TE-rich genes and open-chromatin at promoters. Two independent wholetranscriptome datasets also showed a positive association between TE density and gene expression level in ESCs. While this study focused on genes with extreme TE densities, the results clearly show that the probability of TE accumulation/fixation in mammalian genes is not random and is likely associated with different factors/gene properties. Most importantly, my results show an association between the TE insertion/fixation rate and gene activity status in ES cells.  5.2  Conclusions  This thesis work set out to study TE-gene dynamics in mammals, using computational approaches to analyze various biological data at the whole genome level. To achieve this, my research began by identifying polymorphic ERVs in mice. Since relatively little evolutionary time has elapsed since these young ERVs integrated, many of them are not fixed yet in the host population (i.e. polymorphic ERVs). For the same reason, some of these polymorphic elements may cause slightly or moderately deleterious effects to the host, and are probably still under selection. Polymorphic ERVs do show genomic location features resulting from  139  selection pressure, however, albeit their distributional biases are generally weaker compared with their fixed counterparts. In Chapter 2, I showed an intermediate level of underrepresentation of polymorphic ERVs in mouse genes, as well as a moderate orientation bias that lies between the distributions for random simulations and the actual pattern for fixed elements. Finding of such features for polymorphic TEs is important because, along with fixed elements and the expected random distribution derived from computational simulations, polymorphic TEs can help us capture a “snapshot” of the ongoing processes of natural selection imposed on TEs and the host genome. As an example, in Section 3.2.5 I show distinct “U-zone effects” for mutagenic, random and fixed TEs, and by adding mouse polymorphic ERV data to the analysis, it is much more convincing to conclude that the Uzone effect is a result of natural selection. In addition to polymorphic/active TEs, ancient elements that have been fixed in the host genome for millions of years may also contain valuable information concerning their characteristics and evolutionary roles, in the same way that the information found in fossils is valuable to paleontologists. Although over a million copies of TE have become fixed in mammalian gene introns during evolution, the vast majority of them have apparently no functional impact on the gene. Yet, new disease-causing TE insertions do occur in introns and exert detrimental effects on the host. Intrigued by the question why some intronic TEs are harmful whereas others are not, I looked for clues by examining the distribution of fixed TEs in gene introns and comparing it to the distribution of the non-fixed, such as polymorphic or mutagenic insertions. My results showed multiple genomic features that may influence the retention probability of de novo TE integrations in gene introns, including the distance to the intron-exon boundary, orientation, and proximity to splice sites. Moreover, by  140  using the mouse polymorphic ERV data obtained from Chapter 2, as well as mutagenic TE insertions reported in the literature, I have been able to show convincing evidence to support the premise that natural selection was the major driving force underlying the identified TE distribution patterns in genes. While the study described in Chapter 3 revealed insertional characteristics of TEs that may determine the “harmfulness” of a TE insertion in the gene, the distinct properties of genes themselves may also carry their own influences. By identifying orthologous genes that share extremes of either low- or high-density of TEs (i.e. SLOs and SUOs, respectively) in three distant mammalian species, I showed that TE density within genes is associated with factors such as the biological function and conservation level of genes, which are likely, again, an outcome of natural selection. In theory, initial insertional preference, natural selection and genetic drift can all contribute to the current distribution pattern of TEs in host genes. Although evaluating initial insertional preference is difficult for many ancient TEs, a limited number of experiments using reconstructed TE sequences have shown distinct distribution patterns of de novo TE insertions compared to fixed genomic distributions of corresponding TEs (Brady et al., 2009; Gasior et al., 2007), suggesting much greater effects of natural selection or genetic drift on fixed elements. It is therefore natural to propose that the enrichment of genes involved in developmental processes for SLOs, for example, is due to the essential roles these genes play in the development of the host organism and, as a result, in strong selection against TE insertions in such genes. One can further postulate that the enrichment of metabolic genes among the SUOs could be beneficial, in fact, since they can potentially contribute to the genetic variation of the host species. While it is apparent that the effect of natural selection is dominant compared to the initial integration preference of  141  TEs, my results also suggest that signatures of the latter are still evident. Using genome-wide chromatin state data generated by high-throughput techniques, I observed an enrichment of TEs in genes with open chromatin in ES cells, where new TE integrations in the genome can be passed on to the next generation. Furthermore, by controlling either functional category of genes or chromatin state in ES cells, a quantified analysis show effects linked to both factors (Section 4.2.6), providing more insights in understanding the relationship between TEs and genes. The relationship between TEs and host genes has intrigued biologists since the discovery of these mobile DNA sequences in various host genomes more than a half century ago. What are these repeats? Why are they there? What do they do? What can they tell us about genome evolution and speciation? All of these are intriguing questions that I hope my thesis work has helped to address. We know now that from selfish genomic parasites to evolutionary catalysts, TEs have played various roles during the host evolution (Belancio et al., 2008; Biemont, 2010). Boosted by many major achievements in molecular biology and genomics during the past few decades, our knowledge of TEs has been rapidly advancing, and our understanding of TE-gene relationships is also evolving. However, while various new technologies have significantly changed the fundamental methodology of biological research, numerous new challenges have arisen. For example, the great advances in whole genome sequencing and next generation sequencing techniques have provided unprecedented opportunities to systematically study TEs at the genomic level, but how to find biological insights from the overwhelming “-omics” data being produced today poses new challenges at the same time. The emergence of high throughput technologies has facilitated the discovery of an increasing number of TE germline polymorphisms and somatic insertions in human  142  cancers, with the recent advances in studies of human L1 polymorphisms being good examples (Beck et al., 2010; Ewing and Kazazian, 2010; Huang et al., 2010; Iskow et al., 2010). Another recent study on the sequencing of 17 mouse strains has uncovered over 700,000 structural variants among these strains, of which over half are due to TE insertional polymorphisms (Yalcin et al., 2011). While this mouse study reported a potential biological effect of one ERV polymorphism (i.e. an IAP ERV insertion upstream of the Eps15 gene, which severely disrupted the transcription of the gene and led to a lower locomotor activity), there have been no systematic efforts published to identify which human or mouse polymorphic or somatically-acquired TEs may contribute to allele-specific gene expression differences and phenotypic variation or disease. One future direction continuing from the work presented in this thesis could be the development of comprehensive methods to help choose potential candidate TE insertions that have functional effects on cellular genes. To achieve this purpose, many of the findings from this thesis work, as well as results from other studies, could be integrated and used to build the “TE-gene interaction” model that takes into account all the genomic features of TE insertions (e.g. results from Chapter 3) and the characteristics of genes harboring TEs (e.g. results from Chapter 4). Aided by new sequencing technologies and sophisticated computational/bioinformatics methods, an ambitious but worthy goal would be to make reliable predictions of the most influential TE insertions, which are either potentially linked to various diseases or to phenotypic variations (including both somatic and germline mutations). Ultimately, such knowledge will contribute to our understanding of the effects of TEs and their relationships to their host organisms, which is important to a broad range of studies from evolution to human medicine.  143  References Adams, M.D., Celniker, S.E., Holt, R.A., Evans, C.A., Gocayne, J.D., Amanatides, P.G., Scherer, S.E., Li, P.W., Hoskins, R.A., Galle, R.F., et al. (2000). The genome sequence of Drosophila melanogaster. Science 287, 2185-2195. Adelson, D.L., Raison, J.M., and Edgar, R.C. (2009). Characterization and distribution of retrotransposons and simple sequence repeats in the bovine genome. Proc Natl Acad Sci U S A 106, 12855-12860. Akagi, K., Li, J., Stephens, R.M., Volfovsky, N., and Symer, D.E. (2008). Extensive variation between inbred mouse strains due to endogenous L1 retrotransposition. Genome Res 18, 869-880. Alfoldi, J., Di Palma, F., Grabherr, M., Williams, C., Kong, L., Mauceli, E., Russell, P., Lowe, C.B., Glor, R.E., Jaffe, J.D., et al. (2011). The genome of the green anole lizard and a comparative analysis with birds and mammals. Nature 477, 587-591. Allen, E., Horvath, S., Tong, F., Kraft, P., Spiteri, E., Riggs, A.D., and Marahrens, Y. (2003). High concentrations of long interspersed nuclear element sequence distinguish monoallelically expressed genes. Proc Natl Acad Sci U S A 100, 9940-9945. Almeida, R., and Allshire, R.C. (2005). RNA silencing and genome regulation. Trends Cell Biol 15, 251-258. Aparicio, S., Chapman, J., Stupka, E., Putnam, N., Chia, J.M., Dehal, P., Christoffels, A., Rash, S., Hoon, S., Smit, A., et al. (2002). Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science 297, 1301-1310. Bailey, J.A., Liu, G., and Eichler, E.E. (2003). An Alu transposition model for the origin and expansion of human segmental duplications. Am J Hum Genet 73, 823-834. Baillie, J.K., Barnett, M.W., Upton, K.R., Gerhardt, D.J., Richmond, T.A., De Sapio, F., Brennan, P., Rizzu, P., Smith, S., Fell, M., et al. (2011). Somatic retrotransposition alters the genetic landscape of the human brain. Nature. Banno, F., Kaminaka, K., Soejima, K., Kokame, K., and Miyata, T. (2004). Identification of strain-specific variants of mouse Adamts13 gene encoding von Willebrand factor-cleaving protease. J Biol Chem 279, 30896-30903.  144  Barrera, L.O., Li, Z., Smith, A.D., Arden, K.C., Cavenee, W.K., Zhang, M.Q., Green, R.D., and Ren, B. (2008). Genome-wide mapping and analysis of active promoters in mouse embryonic stem cells and adult organs. Genome Res 18, 46-59. Barrett, T., Troup, D.B., Wilhite, S.E., Ledoux, P., Rudnev, D., Evangelista, C., Kim, I.F., Soboleva, A., Tomashevsky, M., and Edgar, R. (2007). NCBI GEO: mining tens of millions of expression profiles--database and tools update. Nucleic Acids Res 35, D760-765. Batzer, M.A., and Deininger, P.L. (2002). Alu repeats and human genomic diversity. Nat Rev Genet 3, 370-379. Baust, C., Baillie, G.J., and Mager, D.L. (2002). Insertional polymorphisms of ETn retrotransposons include a disruption of the wiz gene in C57BL/6 mice. Mamm Genome 13, 423-428. Baust, C., Gagnier, L., Baillie, G.J., Harris, M.J., Juriloff, D.M., and Mager, D.L. (2003). Structure and expression of mobile ETnII retroelements and their coding-competent MusD relatives in the mouse. J Virol 77, 11448-11458. Beck, C.R., Collier, P., Macfarlane, C., Malig, M., Kidd, J.M., Eichler, E.E., Badge, R.M., and Moran, J.V. (2010). LINE-1 retrotransposition activity in human genomes. Cell 141, 1159-1170. Beck, J.A., Lloyd, S., Hafezparast, M., Lennon-Pierce, M., Eppig, J.T., Festing, M.F., and Fisher, E.M. (2000). Genealogies of mouse inbred strains. Nat Genet 24, 23-25. Belancio, V.P., Hedges, D.J., and Deininger, P. (2006). LINE-1 RNA splicing and influences on mammalian gene expression. Nucleic Acids Res 34, 1512-1521. Belancio, V.P., Hedges, D.J., and Deininger, P. (2008). Mammalian non-LTR retrotransposons: for better or worse, in sickness and in health. Genome Res 18, 343-358. Belshaw, R., Dawson, A.L., Woolven-Allen, J., Redding, J., Burt, A., and Tristem, M. (2005). Genomewide screening reveals high levels of insertional polymorphism in the human endogenous retrovirus family HERV-K(HML2): implications for present-day activity. J Virol 79, 12507-12514. Biemont, C. (2010). A brief history of the status of transposable elements: from junk DNA to major players in evolution. Genetics 186, 1085-1093. Blewitt, M.E., Vickaryous, N.K., Paldi, A., Koseki, H., and Whitelaw, E. (2006). Dynamic reprogramming of DNA methylation at an epigenetically sensitive allele in mice. PLoS Genet 2, e49. 145  Blond, J.L., Lavillette, D., Cheynet, V., Bouton, O., Oriol, G., Chapel-Fernandes, S., Mandrand, B., Mallet, F., and Cosset, F.L. (2000). An envelope glycoprotein of the human endogenous retrovirus HERV-W is expressed in the human placenta and fuses cells expressing the type D mammalian retrovirus receptor. J Virol 74, 3321-3329. Boeke, J.D., and Stoye, J.P. (1997). Retrotransposons, Endogenous Retroviruses, and the Evolution of Retroelements. In Retroviruses, J.M. Coffin, L. Hughes, and H.E. Varmus, eds. (Cold Spring Harbor: Cold Spring Harbor Laboratory Press), pp. 343-435. Boissinot, S., Davis, J., Entezam, A., Petrov, D., and Furano, A.V. (2006). Fitness cost of LINE-1 (L1) activity in humans. Proc Natl Acad Sci U S A 103, 9590-9594. Brady, T., Lee, Y.N., Ronen, K., Malani, N., Berry, C.C., Bieniasz, P.D., and Bushman, F.D. (2009). Integration target site selection by a resurrected human endogenous retrovirus. Genes Dev 23, 633-642. Brouha, B., Schustak, J., Badge, R.M., Lutz-Prigge, S., Farley, A.H., Moran, J.V., and Kazazian, H.H., Jr. (2003). Hot L1s account for the bulk of retrotransposition in the human population. Proc Natl Acad Sci U S A 100, 5280-5285. Brulet, P., Condamine, H., and Jacob, F. (1985). Spatial distribution of transcripts of the long repeated ETn sequence during early mouse embryogenesis. Proc Natl Acad Sci U S A 82, 2054-2058. Burwinkel, B., and Kilimann, M.W. (1998). Unequal homologous recombination between LINE-1 elements as a mutational mechanism in human genetic disease. J Mol Biol 277, 513517. Bushman, F., Lewinski, M., Ciuffi, A., Barr, S., Leipzig, J., Hannenhalli, S., and Hoffmann, C. (2005). Genome-wide analysis of retroviral DNA integration. Nat Rev Microbiol 3, 848858. Bushman, F.D. (2003). Targeting survival: integration site selection by retroviruses and LTR-retrotransposons. Cell 115, 135-138. Callinan, P.A., and Batzer, M.A. (2006). Retrotransposable elements and human disease. Genome Dyn 1, 104-115. Callinan, P.A., Wang, J., Herke, S.W., Garber, R.K., Liang, P., and Batzer, M.A. (2005). Alu retrotransposition-mediated deletion. J Mol Biol 348, 791-800. Celniker, S.E., and Rubin, G.M. (2003). The Drosophila melanogaster genome. Annu Rev Genomics Hum Genet 4, 89-117. 146  Chen, J.M., Stenson, P.D., Cooper, D.N., and Ferec, C. (2005). A systematic analysis of LINE-1 endonuclease-dependent retrotranspositional events causing human genetic disease. Hum Genet 117, 411-427. Cohen, C.J., Lock, W.M., and Mager, D.L. (2009). Endogenous retroviral LTRs as promoters for human genes: a critical assessment. Gene 448, 105-114. Conley, A.B., Piriyapongsa, J., and Jordan, I.K. (2008). Retroviral promoters in the human genome. Bioinformatics 24, 1563-1567. Corazzari, M., Lovat, P.E., Armstrong, J.L., Fimia, G.M., Hill, D.S., Birch-Machin, M., Redfern, C.P., and Piacentini, M. (2007). Targeting homeostatic mechanisms of endoplasmic reticulum stress to increase susceptibility of cancer cells to fenretinide-induced apoptosis: the role of stress proteins ERdj5 and ERp57. Br J Cancer 96, 1062-1071. Cordaux, R., and Batzer, M.A. (2009). The impact of retrotransposons on human genome evolution. Nat Rev Genet 10, 691-703. Cordaux, R., Lee, J., Dinoso, L., and Batzer, M.A. (2006). Recently integrated Alu retrotransposons are essentially neutral residents of the human genome. Gene 373, 138-144. Coufal, N.G., Garcia-Perez, J.L., Peng, G.E., Yeo, G.W., Mu, Y., Lovci, M.T., Morell, M., O'Shea, K.S., Moran, J.V., and Gage, F.H. (2009). L1 retrotransposition in human neural progenitor cells. Nature 460, 1127-1131. Cunnea, P.M., Miranda-Vizuete, A., Bertoli, G., Simmen, T., Damdimopoulos, A.E., Hermann, S., Leinonen, S., Huikko, M.P., Gustafsson, J.A., Sitia, R., et al. (2003). ERdj5, an endoplasmic reticulum (ER)-resident protein containing DnaJ and thioredoxin domains, is expressed in secretory cells or following ER stress. J Biol Chem 278, 1059-1066. Cutter, A.D., Good, J.M., Pappas, C.T., Saunders, M.A., Starrett, D.M., and Wheeler, T.J. (2005). Transposable element orientation bias in the Drosophila melanogaster genome. J Mol Evol 61, 733-741. De Palma, M., Montini, E., Santoni de Sio, F.R., Benedicenti, F., Gentile, A., Medico, E., and Naldini, L. (2005). Promoter trapping reveals significant differences in integration site selection between MLV and HIV vectors in primary hematopoietic cells. Blood 105, 23072315. Deininger, P.L., and Batzer, M.A. (1999). Alu repeats and human disease. Mol Genet Metab 67, 183-193.  147  Dewannieux, M., Dupressoir, A., Harper, F., Pierron, G., and Heidmann, T. (2004). Identification of autonomous IAP LTR retrotransposons mobile in mammalian cells. Nat Genet 36, 534-539. Dewannieux, M., Esnault, C., and Heidmann, T. (2003). LINE-mediated retrotransposition of marked Alu sequences. Nat Genet 35, 41-48. Dewannieux, M., Harper, F., Richaud, A., Letzelter, C., Ribet, D., Pierron, G., and Heidmann, T. (2006). Identification of an infectious progenitor for the multiple-copy HERVK human endogenous retroelements. Genome Res 16, 1548-1556. Dewannieux, M., and Heidmann, T. (2005). L1-mediated retrotransposition of murine B1 and B2 SINEs recapitulated in cultured cells. J Mol Biol 349, 241-247. Di-Poi, N., Montoya-Burgos, J.I., and Duboule, D. (2009). Atypical relaxation of structural constraints in Hox gene clusters of the green anole lizard. Genome Res 19, 602-610. Druker, R., Bruxner, T.J., Lehrbach, N.J., and Whitelaw, E. (2004). Complex patterns of transcription at the insertion site of a retrotransposon in the mouse. Nucleic Acids Res 32, 5800-5808. Druker, R., and Whitelaw, E. (2004). Retrotransposon-derived elements in the mammalian genome: a potential source of disease. J Inherit Metab Dis 27, 319-330. Dudley, J.P. (2003). Tag, you're hit: retroviral insertions identify genes involved in cancer. Trends Mol Med 9, 43-45. Duhl, D.M., Vrieling, H., Miller, K.A., Wolff, G.L., and Barsh, G.S. (1994). Neomorphic agouti mutations in obese yellow mice. Nat Genet 8, 59-65. Dunn, C.A., Medstrand, P., and Mager, D.L. (2003). An endogenous retroviral long terminal repeat is the dominant promoter for human beta1,3-galactosyltransferase 5 in the colon. Proc Natl Acad Sci U S A 100, 12841-12846. Dupressoir, A., Marceau, G., Vernochet, C., Benit, L., Kanellopoulos, C., Sapin, V., and Heidmann, T. (2005). Syncytin-A and syncytin-B, two fusogenic placenta-specific murine envelope genes of retroviral origin conserved in Muridae. Proc Natl Acad Sci U S A 102, 725-730. Eickbush, T.H., and Jamburuthugoda, V.K. (2008). The diversity of retrotransposons and the properties of their reverse transcriptases. Virus Res 134, 221-234.  148  Elmedyb, P., Olesen, S.P., and Grunnet, M. (2007). Activation of ERG2 potassium channels by the diphenylurea NS1643. Neuropharmacology 53, 283-294. Elsik, C.G., Tellam, R.L., Worley, K.C., Gibbs, R.A., Muzny, D.M., Weinstock, G.M., Adelson, D.L., Eichler, E.E., Elnitski, L., Guigo, R., et al. (2009). The genome sequence of taurine cattle: a window to ruminant biology and evolution. Science 324, 522-528. Ewing, A.D., and Kazazian, H.H., Jr. (2010). High-throughput sequencing reveals extensive variation in human-specific L1 content in individual human genomes. Genome Res 20, 12621270. Feschotte, C. (2008). Transposable elements and the evolution of regulatory networks. Nat Rev Genet 9, 397-405. Feschotte, C., and Pritham, E.J. (2007). DNA transposons and the evolution of eukaryotic genomes. Annu Rev Genet 41, 331-368. Fontanillas, P., Hartl, D.L., and Reuter, M. (2007). Genome organization and gene expression shape the transposable element distribution in the Drosophila melanogaster euchromatin. PLoS Genet 3, e210. Frankel, W.N., Stoye, J.P., Taylor, B.A., and Coffin, J.M. (1990). A linkage map of endogenous murine leukemia proviruses. Genetics 124, 221-236. Frazer, K.A., Eskin, E., Kang, H.M., Bogue, M.A., Hinds, D.A., Beilharz, E.J., Gupta, R.V., Montgomery, J., Morenzoni, M.M., Nilsen, G.B., et al. (2007). A sequence-based variation map of 8.27 million SNPs in inbred mouse strains. Nature 448, 1050-1053. Gal-Mark, N., Schwartz, S., and Ast, G. (2008). Alternative splicing of Alu exons--two arms are better than one. Nucleic Acids Res 36, 2012-2023. Gallus, G.N., Cardaioli, E., Rufa, A., Da Pozzo, P., Bianchi, S., D'Eramo, C., Collura, M., Tumino, M., Pavone, L., and Federico, A. (2010). Alu-element insertion in an OPA1 intron sequence associated with autosomal dominant optic atrophy. Mol Vis 16, 178-183. Ganguly, A., Dunbar, T., Chen, P., Godmilow, L., and Ganguly, T. (2003). Exon skipping caused by an intronic insertion of a young Alu Yb9 element leads to severe hemophilia A. Hum Genet 113, 348-352. Gasior, S.L., Preston, G., Hedges, D.J., Gilbert, N., Moran, J.V., and Deininger, P.L. (2007). Characterization of pre-insertion loci of de novo L1 insertions. Gene 390, 190-198.  149  Gifford, R., and Tristem, M. (2003). The evolution, distribution and diversity of endogenous retroviruses. Virus genes 26, 291-315. Gilbert, N., Lutz-Prigge, S., and Moran, J.V. (2002). Genomic deletions created upon LINE1 retrotransposition. Cell 110, 315-325. Gonzalez-Cobos, J.C., and Trebak, M. (2010). TRPC channels in smooth muscle cells. Front Biosci 15, 1023-1039. Gould, S.J., and Vrba, E.S. (1982). Exaptation - a Missing Term in the Science of Form. Paleobiology 8, 4-15. Graham, T., and Boissinot, S. (2006). The genomic distribution of L1 elements: the role of insertion bias and natural selection. J Biomed Biotechnol 2006, 75327. Graubert, T.A., Cahan, P., Edwin, D., Selzer, R.R., Richmond, T.A., Eis, P.S., Shannon, W.D., Li, X., McLeod, H.L., Cheverud, J.M., et al. (2007). A high-resolution map of segmental DNA copy number variation in the mouse genome. PLoS Genet 3, e3. Grover, D., Majumder, P.P., C, B.R., Brahmachari, S.K., and Mukerji, M. (2003). Nonrandom distribution of alu elements in genes of various functional categories: insight from analysis of human chromosomes 21 and 22. Mol Biol Evol 20, 1420-1424. Han, K., Lee, J., Meyer, T.J., Remedios, P., Goodwin, L., and Batzer, M.A. (2008). L1 recombination-associated deletions generate human genomic variation. Proc Natl Acad Sci U S A 105, 19366-19371. Han, K., Sen, S.K., Wang, J., Callinan, P.A., Lee, J., Cordaux, R., Liang, P., and Batzer, M.A. (2005). Genomic rearrangements by LINE-1 insertion-mediated deletion in the human and chimpanzee lineages. Nucleic Acids Res 33, 4040-4052. Hasler, J., and Strub, K. (2006). Alu elements as regulators of gene expression. Nucleic Acids Res 34, 5491-5497. Heidmann, O., Vernochet, C., Dupressoir, A., and Heidmann, T. (2009). Identification of an endogenous retroviral envelope gene with fusogenic activity and placenta-specific expression in the rabbit: a new "syncytin" in a third order of mammals. Retrovirology 6, 107. Ho, M., Post, C.M., Donahue, L.R., Lidov, H.G., Bronson, R.T., Goolsby, H., Watkins, S.C., Cox, G.A., and Brown, R.H., Jr. (2004). Disruption of muscle membrane and phenotype divergence in two novel mouse models of dysferlin deficiency. Hum Mol Genet 13, 19992010.  150  Hofmann, M., Harris, M., Juriloff, D., and Boehm, T. (1998). Spontaneous mutations in SELH/Bc mice due to insertions of early transposons: molecular characterization of null alleles at the nude and albino loci. Genomics 52, 107-109. Hollister, J.D., and Gaut, B.S. (2009). Epigenetic silencing of transposable elements: a tradeoff between reduced transposition and deleterious effects on neighboring gene expression. Genome Res 19, 1419-1428. Horie, K., Saito, E.S., Keng, V.W., Ikeda, R., Ishihara, H., and Takeda, J. (2007). Retrotransposons influence the mouse transcriptome: implication for the divergence of genetic traits. Genetics 176, 815-827. Huang, C.R., Schneider, A.M., Lu, Y., Niranjan, T., Shen, P., Robinson, M.A., Steranka, J.P., Valle, D., Civin, C.I., Wang, T., et al. (2010). Mobile interspersed repeats are major structural variants in the human genome. Cell 141, 1171-1182. Huda, A., Marino-Ramirez, L., Landsman, D., and Jordan, I.K. (2009). Repetitive DNA elements, nucleosome binding and human gene expression. Gene 436, 12-22. Iskow, R.C., McCabe, M.T., Mills, R.E., Torene, S., Pittard, W.S., Neuwald, A.F., Van Meir, E.G., Vertino, P.M., and Devine, S.E. (2010). Natural mutagenesis of human genomes by endogenous retrotransposons. Cell 141, 1253-1261. Jern, P., Stoye, J.P., and Coffin, J.M. (2007). Role of APOBEC3 in genetic diversity among endogenous murine leukemia viruses. PLoS Genet 3, 2014-2022. Jjingo, D., Huda, A., Gundapuneni, M., Marino-Ramirez, L., and Jordan, I.K. (2011). Effect of the transposable element environment of human genes on gene length and expression. Genome Biol Evol, 259-271. Juriloff, D.M., Harris, M.J., Dewell, S.L., Brown, C.J., Mager, D.L., Gagnier, L., and Mah, D.G. (2005). Investigations of the genomic region that contains the clf1 mutation, a causal gene in multifactorial cleft lip and palate in mice. Birth Defects Res A Clin Mol Teratol 73, 103-113. Jurka, J. (1997). Sequence patterns indicate an enzymatic involvement in integration of mammalian retroposons. Proc Natl Acad Sci U S A 94, 1872-1877. Kaiser, S.M., Malik, H.S., and Emerman, M. (2007). Restriction of an extinct retrovirus by the human TRIM5alpha antiviral protein. Science 316, 1756-1758.  151  Kalari, K.R., Casavant, M., Bair, T.B., Keen, H.L., Comeron, J.M., Casavant, T.L., and Scheetz, T.E. (2006). First exons and introns--a survey of GC content and gene structure in the human genome. In Silico Biol 6, 237-242. Kapitonov, V.V., and Jurka, J. (2006). Self-synthesizing DNA transposons in eukaryotes. Proc Natl Acad Sci U S A 103, 4540-4545. Kapitonov, V.V., and Jurka, J. (2007). Helitrons on a roll: eukaryotic rolling-circle transposons. Trends Genet 23, 521-529. Karimi, M.M., Goyal, P., Maksakova, I.A., Bilenky, M., Leung, D., Tang, J.X., Shinkai, Y., Mager, D.L., Jones, S., Hirst, M., et al. (2011). DNA Methylation and SETDB1/H3K9me3 Regulate Predominantly Distinct Sets of Genes, Retroelements, and Chimeric Transcripts in mESCs. Cell Stem Cell 8, 676-687. Kaushik, N., and Stoye, J.P. (1994). Intracisternal A-type particle elements as genetic markers: detection by repeat element viral element amplified locus-PCR. Mamm Genome 5, 688-695. Kent, W.J. (2002). BLAT--the BLAST-like alignment tool. Genome Res 12, 656-664. Keren, H., Lev-Maor, G., and Ast, G. (2010). Alternative splicing and evolution: diversification, exon definition and function. Nat Rev Genet 11, 345-355. Khurana, J.S., and Theurkauf, W. (2010). piRNAs, transposon silencing, and Drosophila germline development. J Cell Biol 191, 905-913. Kidwell, M.G. (2002). Transposable elements and the evolution of genome size in eukaryotes. Genetica 115, 49-63. Kim, D.S., Huh, J.W., Kim, Y.H., Park, S.J., and Chang, K.T. (2010). Functional impact of transposable elements using bioinformatic analysis and a comparative genomic approach. Mol Cells 30, 77-87. Krull, M., Brosius, J., and Schmitz, J. (2005). Alu-SINE exonization: en route to proteincoding function. Mol Biol Evol 22, 1702-1711. Krull, M., Petrusma, M., Makalowski, W., Brosius, J., and Schmitz, J. (2007). Functional persistence of exonized mammalian-wide interspersed repeat elements (MIRs). Genome Res 17, 1139-1145. Kuff, E.L., and Lueders, K.K. (1988). The intracisternal A-particle gene family: structure and functional aspects. Adv Cancer Res 51, 183-276. 152  Kunarso, G., Chia, N.Y., Jeyakani, J., Hwang, C., Lu, X., Chan, Y.S., Ng, H.H., and Bourque, G. (2010). Transposable elements have rewired the core regulatory network of human embryonic stem cells. Nat Genet 42, 631-634. Kung, H.J., Boerkoel, C., and Carter, T.H. (1991). Retroviral mutagenesis of cellular oncogenes: a review with insights into the mechanisms of insertional activation. Curr Top Microbiol Immunol 171, 1-25. Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al. (2001). Initial sequencing and analysis of the human genome. Nature 409, 860-921. Lee, J., Han, K., Meyer, T.J., Kim, H.S., and Batzer, M.A. (2008a). Chromosomal inversions between human and chimpanzee lineages caused by retrotransposons. PLoS One 3, e4047. Lee, J.Y., Ji, Z., and Tian, B. (2008b). Phylogenetic analysis of mRNA polyadenylation sites reveals a role of transposable elements in evolution of the 3'-end of genes. Nucleic Acids Res 36, 5581-5590. Lee, Y.N., and Bieniasz, P.D. (2007). Reconstitution of an infectious human endogenous retrovirus. PLoS Pathog 3, e10. Leung, D.C., and Lorincz, M.C. (2011). Silencing of endogenous retroviruses: when and why do histone marks predominate? Trends Biochem Sci. Lev-Maor, G., Ram, O., Kim, E., Sela, N., Goren, A., Levanon, E.Y., and Ast, G. (2008). Intronic Alus influence alternative splicing. PLoS Genet 4, e1000204. Li, J., Jiang, T., Mao, J.H., Balmain, A., Peterson, L., Harris, C., Rao, P.H., Havlak, P., Gibbs, R., and Cai, W.W. (2004). Genomic segmental polymorphisms in inbred mouse strains. Nat Genet 36, 952-954. Locke, D.P., Hillier, L.W., Warren, W.C., Worley, K.C., Nazareth, L.V., Muzny, D.M., Yang, S.P., Wang, Z., Chinwalla, A.T., Minx, P., et al. (2011). Comparative and demographic analysis of orang-utan genomes. Nature 469, 529-533. Lodish, H., Berk, A., Kaiser, C.A., Krieger, M., Scott, M.P., Bretscher, A., Ploegh, H., and Matsudaira, P. (2007). Molecular Cell Biology. In (New York: W. H. Freeman and Company), pp. 329-330. Loebel, D.A., Tsoi, B., Wong, N., O'Rourke, M.P., and Tam, P.P. (2004). Restricted expression of ETn-related sequences during post-implantation mouse development. Gene Expr Patterns 4, 467-471. 153  Lowe, C.B., Bejerano, G., and Haussler, D. (2007). Thousands of human mobile element fragments undergo strong purifying selection near developmental genes. Proc Natl Acad Sci U S A 104, 8005-8010. Lueders, K.K., and Frankel, W.N. (1994). Mapping of mouse intracisternal A-particle proviral markers in an interspecific backcross. Mamm Genome 5, 473-478. Lueders, K.K., Frankel, W.N., Mietz, J.A., and Kuff, E.L. (1993). Genomic mapping of intracisternal A-particle proviral elements. Mamm Genome 4, 69-77. Lyon, M.F. (2006). Do LINEs have a role in X-chromosome inactivation? J Biomed Biotechnol 2006, 59746. Maere, S., Heymans, K., and Kuiper, M. (2005). BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics 21, 3448-3449. Mager, D.L., and Freeman, J.D. (2000). Novel mouse type D endogenous proviruses and ETn elements share long terminal repeat and internal sequences. J Virol 74, 7221-7229. Mager, D.L., and Medstrand, P. (2003). Retroviral repeat sequences. In Nature Encyclopedia of the Human Genome (Nature Publishing Group), pp. 57-63. Maksakova, I.A., Romanish, M.T., Gagnier, L., Dunn, C.A., van de Lagemaat, L.N., and Mager, D.L. (2006). Retroviral elements and their hosts: insertional mutagenesis in the mouse germ line. PLoS Genet 2, e2. Malik, H.S. (2012). Retroviruses push the envelope for mammalian placentation. Proc Natl Acad Sci U S A. Malik, H.S., and Eickbush, T.H. (1998). The RTE class of non-LTR retrotransposons is widely distributed in animals and is the origin of many SINEs. Mol Biol Evol 15, 1123-1134. Martin, A., Troadec, C., Boualem, A., Rajab, M., Fernandez, R., Morin, H., Pitrat, M., Dogimont, C., and Bendahmane, A. (2009). A transposon-induced epigenetic change leads to sex determination in melon. Nature 461, 1135-1138. McClintock, B. (1950). The origin and behavior of mutable loci in maize. Proc Natl Acad Sci U S A 36, 344-355. Medstrand, P., van de Lagemaat, L.N., and Mager, D.L. (2002). Retroelement distributions in the human genome: variations associated with age and proximity to genes. Genome Res 12, 1483-1495. 154  Mi, S., Lee, X., Li, X., Veldman, G.M., Finnerty, H., Racie, L., LaVallie, E., Tang, X.Y., Edouard, P., Howes, S., et al. (2000). Syncytin is a captive retroviral envelope protein involved in human placental morphogenesis. Nature 403, 785-789. Mikkelsen, T.S., Hillier, L.W., Eichler, E.E., and al., e. (2005). Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437, 69-87. Mikkelsen, T.S., Ku, M., Jaffe, D.B., Issac, B., Lieberman, E., Giannoukos, G., Alvarez, P., Brockman, W., Kim, T.K., Koche, R.P., et al. (2007). Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature 448, 553-560. Mills, R.E., Bennett, E.A., Iskow, R.C., and Devine, S.E. (2007). Which transposable elements are active in the human genome? Trends Genet 23, 183-191. Mine, M., Chen, J.M., Brivet, M., Desguerre, I., Marchant, D., de Lonlay, P., Bernard, A., Ferec, C., Abitbol, M., Ricquier, D., et al. (2007). A large genomic deletion in the PDHX gene caused by the retrotranspositional insertion of a full-length LINE-1 element. Hum Mutat 28, 137-142. Mitchell, R.S., Beitzel, B.F., Schroder, A.R., Shinn, P., Chen, H., Berry, C.C., Ecker, J.R., and Bushman, F.D. (2004). Retroviral DNA integration: ASLV, HIV, and MLV show distinct target site preferences. PLoS Biol 2, E234. Moran, J.V., DeBerardinis, R.J., and Kazazian, H.H., Jr. (1999). Exon shuffling by L1 retrotransposition. Science 283, 1530-1534. Moran, J.V., Holmes, S.E., Naas, T.P., DeBerardinis, R.J., Boeke, J.D., and Kazazian, H.H., Jr. (1996). High frequency retrotransposition in cultured mammalian cells. Cell 87, 917-927. Morgan, H.D., Sutherland, H.G., Martin, D.I., and Whitelaw, E. (1999). Epigenetic inheritance at the agouti locus in the mouse. Nat Genet 23, 314-318. Mortada, H., Vieira, C., and Lerat, E. (2010). Genes devoid of full-length transposable element insertions are involved in development and in the regulation of transcription in human and closely related species. J Mol Evol 71, 180-191. Muotri, A.R., Chu, V.T., Marchetto, M.C., Deng, W., Moran, J.V., and Gage, F.H. (2005). Somatic mosaicism in neuronal precursor cells mediated by L1 retrotransposition. Nature 435, 903-910. Needleman, S.B., and Wunsch, C.D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48, 443-453.  155  Obbard, D.J., Gordon, K.H., Buck, A.H., and Jiggins, F.M. (2009). The evolution of RNAi as a defence against viruses and transposable elements. Philos Trans R Soc Lond B Biol Sci 364, 99-115. Ohshima, K., and Okada, N. (2005). SINEs and LINEs: symbionts of eukaryotic genomes with a common tail. Cytogenet Genome Res 110, 475-490. Okamura, K. (2011). Diversity of animal small RNA pathways and their biological utility. Wiley Interdiscip Rev RNA. Oldridge, M., Zackai, E.H., McDonald-McGinn, D.M., Iseki, S., Morriss-Kay, G.M., Twigg, S.R., Johnson, D., Wall, S.A., Jiang, W., Theda, C., et al. (1999). De novo alu-element insertions in FGFR2 identify a distinct pathological basis for Apert syndrome. Am J Hum Genet 64, 446-461. Ostertag, E.M., and Kazazian, H.H., Jr. (2001). Biology of mammalian L1 retrotransposons. Annu Rev Genet 35, 501-538. Pace, J.K., 2nd, and Feschotte, C. (2007). The evolutionary history of human DNA transposons: evidence for intense activity in the primate lineage. Genome Res 17, 422-432. Perepelitsa-Belancio, V., and Deininger, P. (2003). RNA truncation by premature polyadenylation attenuates human mobile element activity. Nat Genet 35, 363-366. Peters, L.L., Robledo, R.F., Bult, C.J., Churchill, G.A., Paigen, B.J., and Svenson, K.L. (2007). The mouse as a model for human biology: a resource guide for complex trait analysis. Nat Rev Genet 8, 58-69. Piriyapongsa, J., Marino-Ramirez, L., and Jordan, I.K. (2007). Origin and evolution of human microRNAs from transposable elements. Genetics 176, 1323-1337. Pritham, E.J., and Feschotte, C. (2007). Massive amplification of rolling-circle transposons in the lineage of the bat Myotis lucifugus. Proc Natl Acad Sci U S A 104, 1895-1900. Pritham, E.J., Putliwala, T., and Feschotte, C. (2007). Mavericks, a novel class of giant transposable elements widespread in eukaryotes and related to DNA viruses. Gene 390, 3-17. Ray, D.A., Feschotte, C., Pagan, H.J., Smith, J.D., Pritham, E.J., Arensburger, P., Atkinson, P.W., and Craig, N.L. (2008). Multiple waves of recent DNA transposon activity in the bat, Myotis lucifugus. Genome Res 18, 717-728. Rebollo, R., Horard, B., Hubert, B., and Vieira, C. (2010). Jumping genes and epigenetics: Towards new species. Gene 454, 1-7. 156  Rebollo, R., Karimi, M.M., Bilenky, M., Gagnier, L., Miceli-Royer, K., Zhang, Y., Goyal, P., Keane, T.M., Jones, S., Hirst, M., et al. (2011). Retrotransposon-induced heterochromatin spreading in the mouse revealed by insertional polymorphisms. PLoS Genet 7, e1002301. Reed, J.E., Dunn, J.R., du Plessis, D.G., Shaw, E.J., Reeves, P., Gee, A.L., Warnke, P.C., Sellar, G.C., Moss, D.J., and Walker, C. (2007). Expression of cellular adhesion molecule 'OPCML' is down-regulated in gliomas and other brain tumours. Neuropathol Appl Neurobiol 33, 77-85. Reese, M.G., Eeckman, F.H., Kulp, D., and Haussler, D. (1997). Improved splice site detection in Genie. J Comput Biol 4, 311-323. Reiss, D., Zhang, Y., Rouhi, A., Reuter, M., and Mager, D.L. (2010). Variable DNA methylation of transposable elements: the case study of mouse Early Transposons. Epigenetics 5, 68-79. Ribet, D., Dewannieux, M., and Heidmann, T. (2004). An active murine transposon family pair: retrotransposition of "master" MusD copies and ETn trans-mobilization. Genome Res 14, 2261-2267. Romanish, M.T., Lock, W.M., van de Lagemaat, L.N., Dunn, C.A., and Mager, D.L. (2007). Repeated recruitment of LTR retrotransposons as promoters by the anti-apoptotic locus NAIP during mammalian evolution. PLoS Genet 3, e10. Romanish, M.T., Nakamura, H., Lai, C.B., Wang, Y., and Mager, D.L. (2009). A novel protein isoform of the multicopy human NAIP gene derives from intragenic Alu SINE promoters. PLoS One 4, e5761. Rosenberg, N., and Jolicoeur, P. (1997). Retroviral pathogenesis. In Retroviruses, J.M. Coffin, L. Hughes, and H.E. Varmus, eds. (Cold Spring Harbor: Cold Spring Harbor Laboratory Press), pp. 475-585. Sasaki, H., and Matsui, Y. (2008). Epigenetic events in mammalian germ-cell development: reprogramming and beyond. Nat Rev Genet 9, 129-140. Schmid, C.W. (1998). Does SINE evolution preclude Alu function? Nucleic Acids Res 26, 4541-4550. Schnable, P.S., Ware, D., Fulton, R.S., Stein, J.C., Wei, F., Pasternak, S., Liang, C., Zhang, J., Fulton, L., Graves, T.A., et al. (2009). The B73 maize genome: complexity, diversity, and dynamics. Science 326, 1112-1115.  157  Schofield, P.R., McFarland, K.C., Hayflick, J.S., Wilcox, J.N., Cho, T.M., Roy, S., Lee, N.M., Loh, H.H., and Seeburg, P.H. (1989). Molecular characterization of a new immunoglobulin superfamily protein with potential roles in opioid binding and cell contact. EMBO J 8, 489-495. Schroder, A.R., Shinn, P., Chen, H., Berry, C., Ecker, J.R., and Bushman, F. (2002). HIV-1 integration in the human genome favors active genes and local hotspots. Cell 110, 521-529. Schwahn, U., Lenzner, S., Dong, J., Feil, S., Hinzmann, B., van Duijnhoven, G., Kirschner, R., Hemberger, M., Bergen, A.A., Rosenberg, T., et al. (1998). Positional cloning of the gene for X-linked retinitis pigmentosa 2. Nat Genet 19, 327-332. Schwartz, S., Gal-Mark, N., Kfir, N., Oren, R., Kim, E., and Ast, G. (2009). Alu exonization events reveal features required for precise recognition of exons by the splicing machinery. PLoS Comput Biol 5, e1000300. Sellar, G.C., Watt, K.P., Rabiasz, G.J., Stronach, E.A., Li, L., Miller, E.P., Massie, C.E., Miller, J., Contreras-Moreira, B., Scott, D., et al. (2003). OPCML at 11q25 is epigenetically inactivated and has tumor-suppressor function in epithelial ovarian cancer. Nat Genet 34, 337-343. Sen, S.K., Han, K., Wang, J., Lee, J., Wang, H., Callinan, P.A., Dyer, M., Cordaux, R., Liang, P., and Batzer, M.A. (2006). Human genomic deletions mediated by recombination between Alu elements. Am J Hum Genet 79, 41-53. Shell, B.E., Collins, J.T., Elenich, L.A., Szurek, P.F., and Dunnick, W.A. (1990). Two subfamilies of murine retrotransposon ETn sequences. Gene 86, 269-274. Simons, C., Pheasant, M., Makunin, I.V., and Mattick, J.S. (2006). Transposon-free regions in mammalian genomes. Genome Res 16, 164-172. Siomi, M.C., Sato, K., Pezic, D., and Aravin, A.A. (2011). PIWI-interacting small RNAs: the vanguard of genome defence. Nat Rev Mol Cell Biol 12, 246-258. Sironi, M., Menozzi, G., Comi, G.P., Cereda, M., Cagliani, R., Bresolin, N., and Pozzoli, U. (2006). Gene function and expression level influence the insertion/fixation dynamics of distinct transposon families in mammalian introns. Genome Biol 7, R120. Slotkin, R.K., and Martienssen, R. (2007). Transposable elements and the epigenetic regulation of the genome. Nat Rev Genet 8, 272-285. Smit, A.F. (1996). The origin of interspersed repeats in the human genome. Curr Opin Genet Dev 6, 743-748. 158  Smit, A.F. (1999). Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr Opin Genet Dev 9, 657-663. Solyom, S., Ewing, A.D., Hancks, D.C., Takeshima, Y., Awano, H., Matsuo, M., and Kazazian, H.H., Jr. (2012). Pathogenic orphan transduction created by a nonreference LINE1 retrotransposon. Hum Mutat 33, 369-371. Sorek, R., Ast, G., and Graur, D. (2002). Alu-containing exons are alternatively spliced. Genome Res 12, 1060-1067. Spivakov, M., and Fisher, A.G. (2007). Epigenetic signatures of stem-cell identity. Nat Rev Genet 8, 263-271. Stoye, J.P. (2006). Koala retrovirus: a genome invasion in real time. Genome Biol 7, 241. Stoye, J.P., and Coffin, J.M. (1988). Polymorphism of murine endogenous proviruses revealed by using virus class-specific oligonucleotide probes. J Virol 62, 168-175. Struyk, A.F., Canoll, P.D., Wolfgang, M.J., Rosen, C.L., D'Eustachio, P., and Salzer, J.L. (1995). Cloning of neurotrimin defines a new subfamily of differentially expressed neural cell adhesion molecules. J Neurosci 15, 2141-2156. Sun, F.L., Haynes, K., Simpson, C.L., Lee, S.D., Collins, L., Wuller, J., Eissenberg, J.C., and Elgin, S.C. (2004). cis-Acting determinants of heterochromatin formation on Drosophila melanogaster chromosome four. Mol Cell Biol 24, 8210-8220. Symer, D.E., Connelly, C., Szak, S.T., Caputo, E.M., Cost, G.J., Parmigiani, G., and Boeke, J.D. (2002). Human l1 retrotransposition is associated with genetic instability in vivo. Cell 110, 327-338. Taft, R.A., Davisson, M., and Wiles, M.V. (2006). Know thy mouse. Trends Genet 22, 649653. Tarlinton, R.E., Meers, J., and Young, P.R. (2006). Retroviral invasion of the koala genome. Nature 442, 79-81. Telesnitsky, A., and Goff, S.P. (1997). Reverse Transcriptase and the Generation of Retroviral DNA. In Retroviruses, J.M. Coffin, S.H. Hughes, and H.E. Varmus, eds. (Cold Spring Harbor: Cold Spring Harbor Laboratory Press), pp. 121-160. Theodorou, V., Kimm, M.A., Boer, M., Wessels, L., Theelen, W., Jonkers, J., and Hilkens, J. (2007). MMTV insertional mutagenesis identifies genes, gene families and pathways involved in mammary cancer. Nat Genet 39, 759-769. 159  Thornburg, B.G., Gotea, V., and Makalowski, W. (2006). Transposable elements as a significant source of transcription regulating signals. Gene 365, 104-110. Ullu, E., and Weiner, A.M. (1985). Upstream sequences modulate the internal promoter of the human 7SL RNA gene. Nature 318, 371-374. Ustyugova, S.V., Lebedev, Y.B., and Sverdlov, E.D. (2006). Long L1 insertions in human gene introns specifically reduce the content of corresponding primary transcripts. Genetica 128, 261-272. van de Lagemaat, L.N., Landry, J.R., Mager, D.L., and Medstrand, P. (2003). Transposable elements in mammals promote regulatory variation and diversification of genes with specialized functions. Trends Genet 19, 530-536. van de Lagemaat, L.N., Medstrand, P., and Mager, D.L. (2006). Multiple effects govern endogenous retrovirus survival patterns in human gene introns. Genome Biol 7, R86. Vernochet, C., Heidmann, O., Dupressoir, A., Cornelis, G., Dessen, P., Catzeflis, F., and Heidmann, T. (2011). A syncytin-like endogenous retrovirus envelope gene of the guinea pig specifically expressed in the placenta junctional zone and conserved in Caviomorpha. Placenta 32, 885-892. Wade, C.M., and Daly, M.J. (2005). Genetic variation in laboratory mice. Nat Genet 37, 1175-1180. Wang, J., Song, L., Grover, D., Azrak, S., Batzer, M.A., and Liang, P. (2006). dbRIP: a highly integrated database of retrotransposon insertion polymorphisms in humans. Hum Mutat 27, 323-329. Wang, T., Zeng, J., Lowe, C.B., Sellers, R.G., Salama, S.R., Yang, M., Burgess, S.M., Brachmann, R.K., and Haussler, D. (2007). Species-specific endogenous retroviruses shape the transcriptional network of the human tumor suppressor protein p53. Proc Natl Acad Sci U S A 104, 18613-18618. Wang, X.Y., Steelman, L.S., and McCubrey, J.A. (1997). Abnormal activation of cytokine gene expression by intracisternal type A particle transposition: effects of mutations that result in autocrine growth stimulation and malignant transformation. Cytokines Cell Mol Ther 3, 319. Watanabe, T., Takeda, A., Tsukiyama, T., Mise, K., Okuno, T., Sasaki, H., Minami, N., and Imai, H. (2006). Identification and characterization of two novel classes of small RNAs in the mouse germline: retrotransposon-derived siRNAs in oocytes and germline small RNAs in testes. Genes Dev 20, 1732-1743. 160  Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F., Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., An, P., et al. (2002). Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520-562. Wu, X., and Burgess, S.M. (2004). Integration target site selection for retroviruses and transposable elements. Cell Mol Life Sci 61, 2588-2596. Wu, X., Li, Y., Crise, B., and Burgess, S.M. (2003). Transcription start regions in the human genome are favored targets for MLV integration. Science 300, 1749-1751. Yalcin, B., Wong, K., Agam, A., Goodson, M., Keane, T.M., Gan, X., Nellaker, C., Goodstadt, L., Nicod, J., Bhomra, A., et al. (2011). Sequence-based characterization of structural variation in the mouse genome. Nature 477, 326-329. Yamada, M., Hashimoto, T., Hayashi, N., Higuchi, M., Murakami, A., Nakashima, T., Maekawa, S., and Miyata, S. (2007). Synaptic adhesion molecule OBCAM; synaptogenesis and dynamic internalization. Brain Res 1165, 5-14. Yang, H., Bell, T.A., Churchill, G.A., and Pardo-Manuel de Villena, F. (2007). On the subspecific origin of the laboratory mouse. Nat Genet 39, 1100-1107. Yang, N., and Kazazian, H.H., Jr. (2006). L1 retrotransposition is suppressed by endogenously encoded small interfering RNAs in human cultured cells. Nat Struct Mol Biol 13, 763-771. Yates, P.A., Burman, R.W., Mummaneni, P., Krussel, S., and Turker, M.S. (1999). Tandem B1 elements located in a mouse methylation center provide a target for de novo DNA methylation. J Biol Chem 274, 36357-36361. Yoder, J.A., Walsh, C.P., and Bestor, T.H. (1997). Cytosine methylation and the ecology of intragenomic parasites. Trends Genet 13, 335-340. Yohn, C.T., Jiang, Z., McGrath, S.D., Hayden, K.E., Khaitovich, P., Johnson, M.E., Eichler, M.Y., McPherson, J.D., Zhao, S., Paabo, S., et al. (2005). Lineage-specific expansions of retroviral insertions within the genomes of African great apes but not humans and orangutans. PLoS Biol 3, e110. Yoshida, K., Nakamura, A., Yazaki, M., Ikeda, S., and Takeda, S. (1998). Insertional mutation by transposable element, L1, in the DMD gene results in X-linked dilated cardiomyopathy. Hum Mol Genet 7, 1129-1132.  161  Zapala, M.A., Hovatta, I., Ellison, J.A., Wodicka, L., Del Rio, J.A., Tennant, R., Tynan, W., Broide, R.S., Helton, R., Stoveken, B.S., et al. (2005). Adult mouse brain gene expression patterns bear an embryologic imprint. Proc Natl Acad Sci U S A 102, 10357-10362. Zhang, Y., Maksakova, I.A., Gagnier, L., van de Lagemaat, L.N., and Mager, D.L. (2008). Genome-wide assessments reveal extremely high levels of polymorphism of two active families of mouse endogenous retroviral elements. PLoS Genet 4, e1000007.  162  Appendices  163  Appendix A Supporting information of Chapter 2  A.1  Supplementary figures  Figure A.1 Mouse trace sequence archive composition (May 2007). Numbers of sequence traces produced by whole genome shotgun methods are shown in yellow, traces produced by a re-sequencing CHIP technology shown in blue and other methods shown in green.  164  Figure A.2 Design of probes. Type-1 probes include the full-length LTR and a small 23 bp fragment of the internal 5′ or 3′ ERV sequence. Type-2 probes consist only of the full LTR. Type-3 probes cover only the first/last 60 bp of the 5′/3′ LTR.  165  Figure A.3 Estimation of fractional loss of detection of ERVs using sequence traces. The graph shows the fraction of ERVs present in the assembled B6 genome that were found using raw sequence traces from B6. Varying numbers of whole genome shotgun traces from B6 were mapped back to the assembled genome to detect ETn/MusD sequences as described in Materials and Methods. Arrows show the fraction of insertions found with an equivalent number of WGS traces as that available for each test strain. Standard deviations in percentages are plotted but are too small to see, the largest being 1% for the sample size of four million traces.  166  Figure A.4 Splice sites in ETn elements detected in chimeric transcripts. An alignment of an ETnI element (chr2:110,262,865-110,268,371 of mm8 version of B6 genome) and an ETnII element is shown. The 5′ LTR and part of the 3′ LTR are shown with some interior sequence. Filled arrows indicate SA sites identified in cases of ETn mutations in the literature and also found here. Open arrows show SA sites newly identified in this study. The open triangle shows a SD site in ETnI newly identified here. PolyA site marks a polyadenylation site in some published cases. The sequence of the Dnajc10 ETnII insertion differs from that of the ETnII shown, two C-to-A mutations producing new SA sites, one suspected, marked with a "?", another one sequenced. An ETnII element harboring these and one other mutation present in the Dnajc10 insertion is present in the B6 genome on chr13:23,177,61523,184,720. Locations of primers used are shown with arrowed lines. Note: Figure made by Irina A. Maksakova.  167  Figure A.5 Comparative microarray expression data of Dnajc10 in different strains. Graphs of relative transcript levels are computer screen shots of datasets available on the GeneNetwork site (www.GeneNetwork.org). Line at the top represents the intron/exon structure of the gene with location of the ETn insertion in A/J indicated and position of the probe used for microarrays shown. A) Dataset from The Hippocampus Consortium M430v2 (Dec. 2005) RMA (Robust Analysis of Microarrays) series. Mouse strains are 129S1/Svlmj, A/J, AKR/J, Balb/cJ, C3H/HeJ, C57BL/6J, CAST/Ei, DBA/2J, KK/HIJ, LG/J, NOD/LtJ, NZO/HILtJ, PWD/PhJ, PWK/PhJ, WSB/EiJ, B6D2F1, D2B6F1. B) Dataset from the Hamilton Eye Institute mouse eye M430v2 (Nov. 2005) RMA series. Mouse strains are 129S1/Svlmj, A/J, BALB/cByJ, C3H/HeJ, C57BL/6J, CAST/EiJ, DBA/2J, KK/HIJ, LG/J, NOD/LtJ, NZO/HILtJ, PWD/PhJ, PWK/PhJ, WSB/EiJ, B6D2F1, D2B6F1. C) Dataset from the Univ. of Colorado at Denver and Health Sciences Center whole brain M430v2 (Nov06) RMA series. Mouse strains are 129P3/J, 129S1/Svlmj, A/J, AKR/J, BALB/cByJ, BALB/cJ, C3H/HeJ, C57BL/6J, C58/J, CAST/EiJ, CBA/J, DBA/2J, FVB/NJ, KK/HIJ, MOLF/EiJ, NOD/LtJ, NZW/LacJ, PWD/PhJ, SJL/J. D) Dataset from the GE-NIAAA (National Institute on Alcohol Abuse and Alcoholism) cerebellum Affymetrix M430v2 (May05) PDNN (Probe Dependent Nearest Neighbors) series. Mouse strains are 129S1/Svlmj, A/J, AKR/J, BALB/cByJ, BALB/cJ, C3H/HeJ, C57BL/6J, CAST/EiJ, DBA/2J, KK/HIJ, LG/J, NOD/LtJ. Values for A/J in each graph are marked with a star and fall significantly below the normal distribution displayed by all other strains.  168  Figure A.6 Comparative microarray expression data of Opcml in different strains. Graphs of relative transcript levels are computer screen shots of datasets available on the GeneNetwork site (www.GeneNetwork.org). Line at the top represents the intron/exon structure of the gene with location of the ETn insertion in A/J indicated and position of the probe used for microarrays shown. A) Dataset from The Hippocampus Consortium M430v2 (Dec. 2005) RMA (Robust Analysis of Microarrays) series. B) Dataset from the Hamilton Eye Institute mouse eye M430v2 (Nov. 2005) RMA series. C) Dataset from the Univ. of Colorado at Denver and Health Sciences Center whole brain M430v2 (Nov06) RMA series. D) Dataset from the GE-NIAAA (National Institute on Alcohol Abuse and Alcoholism) cerebellum Affymetrix M430v2 (May05) PDNN (Probe Dependent Nearest Neighbors) series. Mouse strains for all datasets are listed in the legend to Figure A5. Values for A/J in each graph are marked with a star and fall significantly below the normal distribution displayed by all other strains.  169  Figure A.7 Semi-quantitative RT-PCR of Opcml in A/J versus B6. A) Semi-quantitative RT-PCR. Opcml cDNA was amplified with primers from upstream and downstream of the ETn insertion. Opcml and Gapdh fragments were amplified from undiluted cDNA and dilutions of 1/20, 1/40 and1/80. B) Graphical representation of RT-PCR. For each dilution, the intensity of the resulting band was quantified and graphed as transcript levels of Opcml relative to Gapdh. The results of one of two representative experiments are shown. Note: Irina A. Maksakova performed the experiments and made this figure.  170  Figure A.8 UCSC Genome Browser screen shot of the PolyERV track. The polymorphic IAP and ETn/MusD insertions in the mouse are annotated in red and green, respectively. When the ERV insertion is displayed as a horizontal bar, it is present in the B6 reference genome, and its orientation is shown as arrows inside the bar. When the ERV insertion is displayed as a vertical line, it is not present in B6. More detailed information such as the genomic coordinates and the presence and absence of the polymorphic ERV in different inbred strains can be found by clicking on the ERV id.  171  A.2  Supplementary tables  Table A.1 Details of ERV probes Target ERV  Probe Type Type-1  ETn/MusD  Type-2  Type-3  Type-1 IAP  Type-2 Type-3  a  a  Probe Name  Probe Structure  Length  Template ERV  probe1_5p  5’ LTR + ERV internal  340 bp  Y17106 (ETn II)  probe1_3p  ERV internal + 3’ LTR  340 bp  Y17106 (ETn II)  probe2  full LTR  317 bp  Y17106 (ETn II)  Probe3_U3  5’-end of the LTR  60 bp  Y17106 (ETn II)  Probe3_U5A  3’-end of the LTR  60 bp  Y17106 (ETn II)  Probe3_U5B  3’-end of the LTR  60 bp  AC068908 (ETn I)  probe1_5p  5’ LTR + ERV internal  376 bp  EU183301 (IΔ1)  probe1_3p  ERV internal + 3’ LTR  376 bp  EU183301 (IΔ1)  probe2  full LTR  317 bp  EU183301 (IΔ1)  Probe3_U3  5’-end of the LTR  60 bp  EU183301 (IΔ1)  Probe3_U5  3’-end of the LTR  60 bp  EU183301 (IΔ1)  Accession numbers of ERVs used for probe design.  172  Table A.2 Polymorphic IAP cases in genes insertion sitea  gene  orientb  location  intron_size  B6 c  AJ Y  c  DBA  129X1  N  Y  chr1:10631189  Cpa6  S  intron 1  123915  N  chr1:107135130 chr1:112765156 chr1:117712805 chr1:11872330 chr1:11932192 chr1:128251571 chr1:129385804 chr1:129702627 chr1:13141941 chr1:133540602 chr1:140332028 chr1:141088753 chr1:141541180  Rnf152 Cdh19 Cntnap5a A830018L16Rik A830018L16Rik E030049G20Rik 2610024A01Rik Rab3gap1 Ncoa2 5430435G22Rik Nek7 Crb1 Cfhrc  S S A A A A S A A A S A A  intron 2 intron 3 intron 1 intron 10 intron 11 intron 2 intron 2 intron 3 intron 15 exon 6 intron 7 intron 5 intron 13  70085 16857 215710 103651 17370 66267 115845 14999 3624 1846 12958 65494 29547  N Y Y Y N Y Y Y N N N N Y  ? N N N ? N N N Y N Y Y N  Y Y ? ? Y N N N N Y ? N N  ? Y N N ? N N Y Y Y ? N Y  chr1:148306255 chr1:148568011  B830045N13Rik B830045N13Rik  A A  intron 2 intron 6  167621 79568  N Y  Y Y  ? N  ? N  chr1:154325012 chr1:154998354  Rgl1 Lamc1  A A  intron 4 intron 18  14009 4600  N N  N N  Y ?  N Y  chr1:159356956 chr1:162444080  Lztr2 Rabgap1l  S A  intron 1 intron 13  17923 155102  N N  N ?  N ?  Y Y  chr1:162950622 chr1:164677213  Klhl20 Fmo1  A A  intron 2 intron 4  20592 10300  N Y  N N  Y N  N N  chr1:166636721 chr1:16687299  Dpt Ly96  S A  intron 1 intron 3  11868 5248  N Y  N N  Y N  N Y  chr1:171886435 chr1:172255709  Ddr2 Nos1ap  A A  intron 2 intron 2  52772 123790  N N  Y Y  N N  N Y  chr1:172644970 chr1:175833708  Atf6 Ifi202b  A S  intron 9 intron 1  15676 48881  N N  Y Y  N N  Y N  chr1:176449911 chr1:177104097 chr1:181190740 chr1:18238083  Fmn2 Rgs7 Smyd3 Defb41  A A A S  intron 7 intron 3 intron 5 intron 1  34453 81588 310879 13691  N N N Y  ? Y Y Y  N Y N Y  Y Y N N  chr1:187004287 chr1:20147389 chr1:25364967 chr1:25694047 chr1:36489068 chr1:37197114 chr1:38226945 chr1:40642830 chr1:44606648 chr1:4930959 chr1:58875297 chr1:79679405 chr1:8648290 chr1:8942202 chr1:9128706  Iars2 Pkhd1 Bai3 Bai3 Ankrd39 Cnga3  A S A A A A  intron 12 intron 60 intron 15 intron 2 intron 1 intron 3  12491 80069 24077 266137 4219 8943  Y N N N N N  Y Y N N ? Y  Y ? Y ? ? N  N ? ? Y Y N  Aff3 Slc9a2  A S  intron 10 intron 1  50349 36428  N Y  N N  Y Y  ? N  Gulp1 Rgs20 Trak2 Serpine2 Sntg1 Sntg1 Sntg1  A S A A S S A  intron 1 intron 1 intron 2 intron 6 intron 10 intron 3 intron 2  137270 95320 9017 1900 52924 116970 203928  N N Y N Y N N  Y Y Y N N N Y  Y ? Y Y Y Y N  N ? N N Y Y N  173  insertion sitea  gene  orientb  location  chr2:102508928 chr2:104809033  Slc1a2 Ga17 C130023O10Rik Slc5a12  A A A A  intron 1 intron 5 intron 2 intron 2  Aven Scg5 Pla2g4e Ttbk2 Slc28a2 Slc30a4 Sqrdl Fgf7 Nphp1 Prnp  A S S A S A A A A A  chr2:166812758 chr2:179589378  Plcb1 Sel1l2 LOC433479 LOC433479 Bfsp1 Dtd1 Dtd1 Dtd1 Slc24a3 Slc24a3 Cdk5rap1 2410003P15Rik Phf20 Ptprt Slpi Matn4 Eya2 Setd6 Kcnb1 Cdh4  chr2:20231325 chr2:26155230 chr2:26820307 chr2:38777659 chr2:41486698 chr2:41566379 chr2:54323971 chr2:54744114 chr2:5601298 chr2:62018942 chr2:68091288 chr2:69001191 chr2:69049150 chr2:71059935 chr2:79157475  chr2:106367986 chr2:110408677 chr2:112349790 chr2:113610781 chr2:119908942 chr2:120505366 chr2:122147455 chr2:122381289 chr2:122487190 chr2:125741346 chr2:127443107 chr2:131622077 chr2:134675433 chr2:140069047 chr2:141440826 chr2:142050494 chr2:143544044 chr2:144319334 chr2:144360438 chr2:144442331 chr2:145048177 chr2:145316680 chr2:154045704 chr2:155549661 chr2:155902424 chr2:161660353 chr2:164053003 chr2:164091225 chr2:165366393 chr2:166361281  intron_size  B6  AJ  DBA  129X1  76641 6010 335222 3625  N N Y N  N Y Y Y  N Y Y Y  Y ? N Y  intron 2 intron 2 intron 1 intron 3 intron 15 intron 2 intron 4 intron 2 intron 16 intron 2  110526 35308 44399 15565 563 4841 5633 52233 5859 24126  Y Y Y Y N N Y N N Y  N N N N Y Y N Y N Y  N N N N Y Y N N Y Y  N Y N N N Y N ? N N  S A S A S A A S A A A A S A S A A S A A  intron 2 intron 1 intron 2 intron 11 intron 2 intron 3 intron 4 intron 5 intron 2 intron 15 intron 6 intron 7 intron 1 intron 7 intron 1 intron 3 intron 2 intron 1 intron 1 intron 2  174444 49416 85934 55722 9032 18329 111035 21136 203415 17425 6345 28918 26536 207432 13248 672 17949 66468 81696 335441  N N N N N Y Y Y N Y Y N Y N N N Y N N N  N N N ? N Y Y Y N Y N N Y N N Y Y N N N  Y Y Y Y Y N N N Y N N ? Y N N N N N N Y  N N N ? Y N N N Y ? N Y N Y Y ? ? Y Y Y  BC026657 Gpsm1  S A  intron 1 intron 9  49879 12219  N Y  Y ?  Y N  N N  Adamts13 Olfml2a  S S  intron 23 intron 6  7828 2896  Y N  Y ?  Y Y  N ?  Lrp1b Lrp1b  S A  intron 7 intron 5  159272 42033  Y Y  N N  ? ?  ? N  Galnt13 Galnt13  A A  intron 2 intron 8  38722 53215  N N  Y Y  Y Y  Y Y  Camk1d Slc4a10  A A  intron 1 intron 4  148364 17072  Y Y  Y Y  N Y  N N  Stk39 Spc25 Abcb11 Dync1i2 Cerkl  S A A A A  intron 14 intron 3 intron 25 intron 14 intron 4  39948 2481 2761 2596 6258  N N N Y N  Y Y Y N Y  Y ? N Y Y  N Y Y Y N  174  insertion sitea  gene  orientb  location  chr2:79900254 chr2:90336025  Pde1a Ptprj 1110051M20Rik Spag17  S A S A  intron 1 intron 1 intron 2 intron 34  Sycp1 Dph5 Agl Alg14 Abca4 Arsj Hadh Slc39a8 Bank1 Bank1  S S A A A S A A A A  chr3:84192475 chr3:84211092  B930007M17Rik Unc5c Bmpr1b Gipc2 St6galnac5 Tnni3k Negr1 Negr1 Pld1 4930558O21Rik Ccrn4l 4631416L12Rik Rsrc1 Rsrc1 Serpini1 Fstl5 Fstl5 Fstl5 Mnd1 Mnd1  chr3:84228309 chr3:85708209 chr3:86996186 chr4:100621986 chr4:102373594 chr4:106662737 chr4:106671735 chr4:111069681 chr4:120433792 chr4:123292955 chr4:133452899 chr4:140076408 chr4:140184613 chr4:140211091 chr4:146246790  chr2:91185450 chr3:100216189 chr3:103057526 chr3:115901132 chr3:116752419 chr3:121317947 chr3:122158252 chr3:126427417 chr3:131247008 chr3:135782671 chr3:136055611 chr3:136111166 chr3:139595022 chr3:141594386 chr3:141868130 chr3:152097382 chr3:152826279 chr3:154890028 chr3:156734836 chr3:156960231 chr3:28253814 chr3:30861316 chr3:51327897 chr3:62486655 chr3:67115872 chr3:67237206 chr3:75705022 chr3:76519100 chr3:76531573 chr3:76693308  intron_size  B6  AJ  DBA  129X1  200944 105165 38005 2856  N N N N  Y Y Y N  Y Y Y Y  N N N N  intron 8 intron 4 intron 25 intron 3 intron 43 intron 1 intron 1 intron 1 intron 7 intron 7  3988 15235 9868 41343 2752 72832 21943 22324 109900 109900  N Y Y N N N N N Y Y  N Y N ? N N N N Y N  N Y N Y Y Y Y Y Y N  Y N N ? N N ? ? N N  A A A S A A A A A S S A S A S S A A S S  intron 12 intron 1 intron 1 intron 1 intron 2 intron 11 intron 1 intron 3 intron 12 intron 7 intron 1 intron 5 intron 3 intron 4 intron 4 intron 6 intron 6 intron 10 intron 5 intron 4  178647 212148 101403 27888 134386 15244 297083 52857 14620 12042 22516 38556 85944 98118 2355 52361 52361 26901 11376 17515  N N N N N N Y N Y N Y N Y N N N N N N N  Y N N N Y Y N Y N Y Y ? Y N N Y N Y N Y  N Y Y ? ? Y ? N Y ? Y Y N Y Y ? Y N Y Y  Y N N Y ? N N N N ? N ? Y ? N Y N N N Y  Mnd1 Pet112l  S A  intron 1 intron 9  13896 17909  Y Y  N N  N N  ? N  Dcamkl2 Raver2  A A  intron 1 intron 5  13532 10683  N Y  Y N  ? Y  ? N  Sgip1 2210012G02Rik  A S  intron 7 intron 3  29800 17838  Y N  N Y  Y N  Y Y  2210012G02Rik 4931433A01Rik  S S  intron 2 intron 9  11874 48229  N Y  N N  Y ?  N N  Zfp69 Rhbdl2  S S  intron 2 intron 1  12094 21941  Y N  ? N  N N  ? Y  Ccdc21 Padi3 Padi2 Padi2 LOC195531  S A S A S  intron 1 intron 1 intron 1 intron 12 intron 1  14061 6867 11022 6376 334847  N N N Y N  Y ? N Y ?  N Y Y N ?  N ? N N Y  175  insertion sitea  gene  orientb  location  chr4:146251747 chr4:146280727  LOC195531 LOC195531 LOC195531 MGC67181  S S A S  intron 1 intron 1 intron 1 intron 1  Nphp4 Mmp16 E130310K16Rik Olfr157 Olfr157 Olfr157 Olfr157 Olfr157 Olfr157 Dnaja1  A A A A S A S A S A  chr5:115217895 chr5:116255931  Kif24 4930417M19Rik Olfr157 Trim14 Gabbr2 Akap2 Rgs3 Dennd4c BC057079 Tek Asph Itgb3bp Enoph1 Hsd17b13 Pkd2 Lrrc8b Lrrc8d Golga3 Tcf1 Cit  chr5:117755717 chr5:118371007 chr5:122266574 chr5:124578173 chr5:12465510 chr5:12523097 chr5:127958002 chr5:127975659 chr5:14208932 chr5:143269838 chr5:144543260 chr5:144561331 chr5:144803458 chr5:147070481 chr5:148307541  chr4:146329110 chr4:146677302 chr4:151368737 chr4:17877006 chr4:20396970 chr4:37843384 chr4:38148828 chr4:38616468 chr4:39058766 chr4:40141430 chr4:40630644 chr4:40922556 chr4:41643406 chr4:43318824 chr4:43459920 chr4:46532187 chr4:46936870 chr4:57976790 chr4:62059985 chr4:86274553 chr4:87586931 chr4:94305261 chr4:9450826 chr4:99310397 chr5:100291561 chr5:104196736 chr5:104737400 chr5:105659638 chr5:105964456 chr5:110435196  intron_size  B6  AJ  DBA  129X1  334847 334847 334847 19480  N N N Y  Y ? N Y  ? ? Y ?  N Y N N  intron 11 intron 1 intron 3 intron 7 intron 8 intron 10 intron 11 intron 16 intron 17 exon 9  11540 133718 186380 535229 144288 272908 504175 797293 272631 2238  N N N N N Y Y Y N N  N Y Y N ? N N N ? Y  N N Y N ? Y Y Y Y N  Y Y Y Y Y N N N ? N  A A A A A A S A A A A A S A A A A A A A  intron 1 intron 3 intron 20 intron 4 intron 1 intron 1 intron 2 intron 13 intron 1 intron 7 intron 17 intron 1 intron 1 intron 6 intron 10 intron 1 intron 1 intron 9 intron 2 intron 42  35726 10849 794532 11451 115304 25985 15221 2316 26593 8666 12150 14981 18836 9487 3448 52883 31143 8073 4012 1051  Y Y N Y N N Y N N Y N Y Y N N Y Y Y N N  N N ? N N N N Y N N Y Y N N Y N N N N N  N N Y N Y N N ? Y Y N ? ? N Y N ? N Y Y  Y Y ? ? Y Y N Y N ? ? N Y Y N Y Y N Y Y  Ksr2 Fbxw8  A A  intron 1 intron 5  85703 17881  N N  Y Y  ? N  ? N  Cutl2 Mphosph9  A A  intron 2 intron 3  118853 4947  Y N  N Y  Y N  N Y  Sema3d Sema3d  A A  intron 2 intron 7  15035 16029  N Y  N Y  ? Y  Y N  Glt1d1 Glt1d1  A A  intron 7 intron 9  12495 11695  N N  Y Y  ? N  N N  Sema3e 2810055G22Rik  S A  intron 4 intron 15  46111 22051  N Y  Y Y  ? N  ? N  Baiap2l1 Baiap2l1 Nptx2 Usp12 C130038G02Rik  S A A A A  intron 3 intron 3 intron 2 intron 3 intron 1  32616 32616 5063 8600 118143  N N N N N  Y N Y Y ?  ? N Y N N  ? Y ? ? Y  176  insertion sitea  gene  orientb  location  chr5:151339140 chr5:15796736  Stard13 Cacna2d1 Sema3c A530088I07Rik  S A A A  intron 4 intron 10 intron 2 intron 12  Fbxl13 Armc10 Reln Lhfpl3 Lhfpl3 Srpk2 Letm1 Cdk6 Stx18 Slc2a9  S A A S S S A A A A  chr6:122007148 chr6:128789794  Bst1 Kcnip4 Kcnip4 Tbc1d1 3732412D22Rik Gabrb1 Dcun1d4 Dcun1d4 Tmprss11d Adamts3 4932413O14Rik Fras1 Cntn6 Cntn4 Arl8b Grm7 Ninj2 Atp6v1e1 Mug2 Klrb1b  chr6:136801806 chr6:137520031 chr6:138416793 chr6:139788455 chr6:149027741 chr6:149391024 chr6:17074195 chr6:18439758 chr6:45030853 chr6:46608110 chr6:54884614 chr6:5796637 chr6:63704980 chr6:71884936 chr6:72534267  chr5:17097612 chr5:20747242 chr5:21148060 chr5:21168567 chr5:21712860 chr5:22638613 chr5:22696864 chr5:23081808 chr5:34076523 chr5:3413005 chr5:38418593 chr5:38693940 chr5:44117077 chr5:48958008 chr5:49108457 chr5:64548798 chr5:67123782 chr5:72348808 chr5:73814057 chr5:73831442 chr5:87392834 chr5:90905921 chr5:93689748 chr5:96725801 chr6:104611850 chr6:105663923 chr6:108767487 chr6:111326901 chr6:120093223 chr6:120777881  intron_size  B6  AJ  DBA  129X1  9643 32452 76736 13557  N Y N Y  N Y N Y  N ? Y ?  Y N N N  intron 1 intron 4 intron 3 intron 2 intron 2 intron 1 intron 6 intron 2 intron 9 intron 6  12598 6999 56243 119215 119215 51060 8110 38173 7107 44714  N N Y N Y Y Y N Y Y  ? Y Y N Y Y Y Y N N  ? Y N Y N ? N ? Y Y  Y N ? Y N N ? ? Y Y  A A S A A A A A S A A A A S A A A A A S  intron 6 intron 1 intron 1 intron 6 intron 1 intron 5 intron 7 intron 8 intron 7 intron 3 intron 22 intron 2 intron 3 intron 1 intron 1 intron 8 intron 1 intron 2 intron 12 intron 1  11103 406709 406709 6630 111876 78397 9348 11202 16883 85856 10764 142399 7365 162525 30248 136571 104443 9861 6664 5117  Y Y N N N Y N N Y Y Y Y N Y N N Y Y N N  Y N Y Y Y Y Y N Y Y Y Y N Y N Y Y N ? Y  N N Y N Y N N Y N N Y ? N Y Y Y Y Y ? Y  Y Y N ? N Y N N Y N N N Y N N Y N ? Y Y  BC049715 Eps8  A A  intron 2 intron 2  8742 52015  N N  Y ?  Y ?  ? Y  Lmo3 Pik3c2g  A A  intron 2 intron 18  72279 71383  Y Y  N N  ? ?  Y ?  D030011O10Rik Bicd1  S A  intron 2 intron 1  12088 74341  N N  N ?  Y Y  N ?  EG232599 Cttnbp2  A S  intron 1 intron 2  84126 53478  N N  N Y  N N  Y N  Cntnap2 Cntnap2  S A  intron 1 intron 13  593602 275609  N Y  N N  Y Y  Y Y  Nod1 Dync1i1 Grid2 Rpo1-4 0610039N19Rik  A A A A S  intron 1 intron 6 intron 2 intron 20 intron 5  22514 104213 405417 6677 1013  N Y N Y N  Y Y N Y N  Y N Y N Y  Y Y N Y N  177  insertion sitea  gene  orientb  location  chr6:92170731 chr6:93756060  Zfyve20 Magi1 Suclg2 C130034I18Rik  A A A A  intron 3 intron 3 intron 1 intron 3  Parva Spon1 Tmc7 Gdpd3 Ctbp2 4933400E14Rik Dock1 Dock1 Psg16 BC024868  A S A S A A A A A A  chr7:56898887 chr7:56983305  2210010C17Rik Nalp4e Irgc1 Zfp566 Gpatc1 Pira6 BC043301 1700127D06Rik Isoc2b Fut2 Ush1c Sergef Nell1 Nell1 Gas2 Luzp2 Nipa2 p Gabrg3 Gabrg3  chr7:57007475 chr7:57026889 chr7:57088731 chr7:57280026 chr7:65527529 chr7:66294972 chr7:71980533 chr7:75733740 chr7:76290339 chr7:76310437 chr7:76423064 chr7:78957337 chr7:79210579 chr7:80808851 chr7:81468649  chr6:95653698 chr6:96899757 chr7:112221386 chr7:113733465 chr7:118356014 chr7:126557998 chr7:132882066 chr7:134162112 chr7:134799549 chr7:134852230 chr7:16258225 chr7:18698208 chr7:19100806 chr7:23016479 chr7:24150314 chr7:29792842 chr7:34989957 chr7:3879939 chr7:43438369 chr7:43813698 chr7:4452434 chr7:45528137 chr7:46092624 chr7:46380203 chr7:50100373 chr7:50668010 chr7:51851912 chr7:54807033 chr7:55814787 chr7:56278670  intron_size  B6  AJ  DBA  129X1  4373 6699 63088 150455  N N N N  Y Y N N  Y N Y Y  Y Y N N  intron 1 intron 6 intron 2 intron 1 intron 2 intron 2 intron 27 intron 29 intron 5 intron 1  116075 88865 1818 291 55883 50310 114205 77815 32340 24342  Y Y N N N N N N N N  N Y N N Y Y ? N N N  N N Y N Y N Y Y N N  N Y Y Y Y Y N ? Y Y  A S S S A A A S A A A A A A A A A A A A  intron 2 intron 1 intron 1 intron 2 intron 19 intron 5 intron 3 intron 2 intron 5 intron 1 intron 8 intron 10 intron 9 intron 15 intron 7 intron 1 intron 1 intron 19 intron 3 intron 3  2683 18807 9545 5881 4519 4553 4844 371 4083 14848 1934 72131 19381 125007 39988 216731 16365 56499 338483 338483  N N N N N Y N N N N N Y Y N N Y Y N N Y  N Y Y Y Y N N N Y N N Y N N Y Y ? N N ?  N ? N Y ? N ? N Y Y N Y Y N N N N N N N  Y ? N N N N Y Y N Y Y N N Y N N ? Y Y ?  Gabrg3 Gabrg3  S S  intron 3 intron 3  338483 338483  Y N  N Y  Y N  N N  Gabrg3 Gabra5  A A  intron 3 intron 9  338483 4675  Y N  Y Y  Y N  N N  Tarsl2 Aldh1a3  A A  intron 3 intron 3  4590 6071  N N  N Y  Y N  ? ?  Gm489 Klhl25  A A  intron 19 intron 1  18443 16852  N N  N Y  Y Y  ? N  EG244071 EG244071  A A  intron 3 intron 10  3941 11341  N N  Y Y  N N  N ?  EG244071 Agc1 Abhd2 Sec11l1 Whdc1  A A A A A  intron 11 intron 2 intron 5 intron 1 intron 10  144374 2201 22657 12264 1765  N N Y Y N  N N N N N  N N ? Y N  Y Y Y N Y  178  insertion sitea  gene  orientb  location  chr7:82613322 chr7:89519920  Eftud1 Me3 Me3 Me3  A A A S  intron 15 intron 1 intron 1 intron 6  Dlg2 Dlg2 Rab30 Rab30 4632434I11Rik Prcp Odz4 Gdpd4 Mlycd BC021611  A A A A A S A A S A  chr8:64410863 chr8:68653188  Disc1 Pard3 1700008F21Rik Tmco3 Tmco3 Csmd1 Csmd1 LOC436177 Nek3 Slc20a2 Slc20a2 Fut10 Tusc3 Mtmr7 Mtus1 Vegfc Sh3rf1 Sh3rf1 Palld March1  chr8:68967805 chr8:68984286 chr8:69009827 chr8:69217878 chr8:69258126 chr8:71437203 chr8:84929641 chr8:88047485 chr8:90680847 chr8:94277841 chr8:94454988 chr8:97505641 chr9:101947479 chr9:103330213 chr9:104482616  chr7:89525385 chr7:89693986 chr7:91940259 chr7:92100499 chr7:92637224 chr7:92661278 chr7:92744697 chr7:92809075 chr7:96192800 chr7:97882033 chr8:122288343 chr8:125686127 chr8:128131514 chr8:130449167 chr8:131991574 chr8:13317388 chr8:13317701 chr8:16790154 chr8:16896338 chr8:20395567 chr8:23600168 chr8:23962594 chr8:24016713 chr8:32657254 chr8:40595747 chr8:42108760 chr8:42575690 chr8:55641586 chr8:64179531 chr8:64259164  intron_size  B6  AJ  DBA  129X1  52773 103533 103533 26945  Y N Y Y  N Y N N  N N Y N  ? Y N N  intron 11 intron 13 intron 1 intron 1 intron 2 exon 9 intron 1 intron 15 intron 2 intron 6  47694 159847 78139 78139 4570 1029 144584 8581 839 11814  N Y N N N N Y Y N N  Y N N N N Y N Y N Y  N N N ? Y N N N Y N  ? Y Y Y Y N Y Y N Y  A S S S A S A S S A S S A A A A A A A A  intron 10 intron 22 intron 2 intron 10 intron 10 intron 3 intron 3 intron 7 intron 11 intron 1 intron 5 intron 1 intron 7 intron 1 intron 2 intron 1 intron 2 intron 8 intron 16 intron 1  17889 118444 47626 4885 4885 316521 316521 127285 9793 58092 13340 7490 4149 25680 6185 79031 102899 9261 6140 260565  N N N Y Y Y Y N N N Y N N Y N Y N Y N N  N N N Y N Y N Y Y Y N Y Y N Y N ? N Y N  Y N N Y Y Y N ? Y ? N Y Y N ? Y Y N N Y  N Y Y N ? N N ? Y N N Y ? ? Y N N Y N N  March1 March1  A A  intron 3 intron 3  309405 309405  N N  Y N  N Y  N N  March1 March1  A A  intron 3 intron 4  309405 109726  Y N  N Y  N ?  Y ?  March1 4732435N03Rik  A S  intron 4 intron 3  109726 95119  N N  N Y  Y Y  ? Y  Inpp4b EG624855  A S  intron 16 intron 1  1150 19301  N N  Y Y  Y Y  N Y  Zfp423 Fto  A A  intron 3 intron 1  75869 88185  N N  Y N  ? N  ? Y  Fto Rspry1 Ephb1 Bfsp2 Cpne4  A A A A S  intron 8 intron 1 intron 3 intron 1 intron 1  143384 20402 127746 26589 116322  N Y N N N  N ? Y Y Y  ? N Y N Y  Y N Y Y Y  179  insertion sitea  gene  orientb  location  chr9:104825982 chr9:104841903  Cpne4 Cpne4 Nek11 Dock3  A A A A  intron 7 intron 7 intron 2 intron 4  Dock3 Slc25a20 Glb1 Osbpl10 Tgfbr2 Itga9 Trak1 Kif15 Ccr9 Kirrel3  A A A A A A A S A A  chr10:106410062 chr10:120994919  EG434396 Grik4 Grik4 D930028F11Rik Bcdo2 4930535E21Rik Thsd4 Ppib Rora AA407270 Scg3 Bmp5 AK129341 Bckdhb Trpc6 Mod1 1700057G04Rik Armc8 8430416H19Rik Xpot  chr10:121066964 chr10:122235925 chr10:13359312 chr10:13375267 chr10:20727441 chr10:22027763 chr10:24375847 chr10:25159767 chr10:28477359 chr10:40281941 chr10:5384312 chr10:73468636 chr10:84145835 chr10:85850816 chr10:88393677  chr9:105238657 chr9:106982772 chr9:107085292 chr9:108535147 chr9:114265329 chr9:114926725 chr9:115988048 chr9:118735079 chr9:121255107 chr9:122805410 chr9:123591415 chr9:34626673 chr9:36288332 chr9:42465562 chr9:42495504 chr9:48054003 chr9:50294820 chr9:57827226 chr9:60115842 chr9:65862744 chr9:68960009 chr9:71317979 chr9:75433883 chr9:75568793 chr9:8121393 chr9:83774731 chr9:8590112 chr9:86482671 chr9:92152022 chr9:99360375  intron_size  B6  AJ  DBA  129X1  63731 63731 44709 26329  N N N N  Y ? Y Y  Y Y Y ?  N ? Y ?  intron 1 intron 4 intron 1 intron 1 intron 2 intron 26 intron 2 intron 1 intron 1 intron 1  52306 2411 15690 42073 26556 10539 39426 7963 95843 422730  N N N Y N N Y N Y N  Y Y N N Y Y N Y Y N  N Y Y Y N N N ? Y Y  ? ? Y N N ? Y ? N ?  A S A A A A A S A A A A A A S S A S S A  intron 1 intron 3 intron 3 intron 1 intron 4 intron 18 intron 6 intron 3 intron 1 intron 11 intron 11 intron 1 intron 2 intron 3 intron 3 intron 1 intron 3 intron 2 intron 7 intron 24  6757 120394 120394 192726 3284 963 123165 2403 542101 9611 7685 51434 9407 34993 9303 15823 3080 13782 18837 3445  N N N N N N Y Y N N N N N Y Y N N N N N  N N N Y Y Y N Y N Y Y Y N Y Y ? Y Y Y Y  Y Y Y Y Y N Y N N Y Y Y N N Y Y ? N Y Y  ? Y Y Y ? N N Y Y ? N N Y Y N ? N Y N N  D930020B18Rik Ppm1h  A A  intron 4 intron 4  6259 45501  N Y  Y N  Y Y  N N  Aig1 Aig1  A A  intron 4 intron 4  36773 36773  Y N  Y Y  N N  Y Y  Ahi1 Raet1c  A S  intron 18 intron 5  23151 191660  N N  Y Y  N N  N Y  Enpp1 Epb4.1l2  A A  intron 1 intron 5  32544 2146  Y N  N Y  Y N  ? N  E430004N04Rik Slc22a16  A A  intron 4 intron 6  80556 3361  N N  ? N  Y N  ? Y  Esr1 Pcdh15 Polr3b Syn3 Tmem16d  S S S A A  intron 6 intron 4 intron 25 intron 3 intron 23  29807 118784 3997 41160 10565  N Y N N Y  N Y Y ? N  Y Y N Y ?  N N N ? N  180  insertion sitea  gene  orientb  location  chr10:90516147 chr10:9355865  1200009F10Rik E130306M17Rik Nmt1 Kcnh6  A A A A  intron 1 intron 1 intron 7 intron 12  Pitpnc1 Prkca Ccdc46 Ccdc46 Ccdc46 Ccdc46 Ccdc46 Sdk2 Gprc5c Actr2  A A S A A S A A A A  chr11:74700606 chr11:75966048  Sfi1 Sfi1 Sfi1 Sfi1 Sfi1 Sgcd Col23a1 Col23a1 Fstl4 Ankrd48 Ankrd48 Nudcd3 Adcy1 Nalp1 Ube2g1 1200014J11Rik Garnl4 Garnl4 Rutbc1 Vps53  chr11:81361676 chr11:84929126 chr11:88907472 chr11:98623955 chr12:102733042 chr12:11130003 chr12:11374493 chr12:117020350 chr12:21436950 chr12:32613270 chr12:38731481 chr12:40912515 chr12:42508770 chr12:42634020 chr12:51477047  chr11:102874052 chr11:105844336 chr11:107113157 chr11:107773664 chr11:108291118 chr11:108294001 chr11:108304916 chr11:108465990 chr11:108642440 chr11:113728274 chr11:114674870 chr11:19976586 chr11:3032912 chr11:3052573 chr11:3062637 chr11:3072587 chr11:3093109 chr11:47005426 chr11:51213654 chr11:51281139 chr11:52728669 chr11:5517981 chr11:5588311 chr11:6041515 chr11:7000997 chr11:70950223 chr11:72498818 chr11:72879659 chr11:74356372 chr11:74394456  intron_size  B6  AJ  DBA  129X1  9841 45454 668 6044  N N N Y  Y Y Y N  ? N N N  Y N N ?  intron 4 intron 14 intron 7 intron 7 intron 9 intron 21 intron 25 intron 2 intron 1 intron 7  24521 7822 14497 14497 11252 58099 47042 39977 11192 4598  N N Y N N Y Y Y N N  Y Y N Y Y N N N Y Y  Y Y N Y Y ? N Y N N  N Y N Y N N N N N ?  A S S S A A A S A A A S A S A S A S A A  exon 31 intron 14 intron 9 intron 6 intron 1 intron 3 intron 2 intron 2 intron 3 intron 13 intron 26 intron 3 intron 3 intron 1 intron 5 intron 3 intron 2 intron 1 intron 2 intron 4  80 7138 4409 6721 4302 62344 200242 200242 185537 2895 1656 24932 8311 17943 4030 11458 82243 22707 22817 25402  N N N N N N N Y N N N Y N N N Y Y N Y N  N Y N Y Y ? Y Y ? Y Y Y N ? N Y N Y N Y  N N Y N Y Y Y N Y N N Y N ? Y N N Y N Y  Y Y N N N ? ? Y ? Y ? N Y Y N Y N Y N ?  Accn1 Usp32  A A  intron 1 intron 1  996014 35683  Y N  N ?  Y Y  Y N  Gm525 Casc3  A S  intron 3 intron 1  4193 4688  Y N  N N  N Y  ? Y  Rin3 Kcns3  A S  intron 3 intron 2  24968 27002  N Y  Y N  ? N  N N  Vsnl1 Ptprn2  A A  intron 2 intron 1  54597 94424  N N  Y Y  N Y  N N  Ddef2 Prkar2b  S A  intron 3 intron 2  14299 44837  N Y  Y N  N Y  N N  Dgkb Zfp277 Immp2l Immp2l Prkcm  A A A A S  intron 18 intron 2 intron 5 intron 6 intron 1  11853 31700 74824 251222 158698  N Y N N N  Y Y N Y N  ? N Y Y N  N Y N N Y  181  insertion sitea  gene  orientb  location  chr12:53469443 chr12:53651725  Arhgap5 Akap6 Slc25a21 Mia2  A A S A  intron 2 intron 1 intron 6 intron 1  Map4k5 Rtn1 Birc1g Pde4d Mier3 Arl15 Itga1 AF397014 AF397014 Cdc2l5  A S A S A A A A A A  chr13:9431905 chr13:94978085  Amph Stard3nl Agtr1a Fars2 Gfod1 Cap2 Tpmt Smad5 Ntrk2 Ntrk2 2010111I01Rik 2010111I01Rik Cntnap3 Spata9 AK129128 Mctp1 Mctp1 Xrcc4 Dip2c Arsb  chr13:94988399 chr14:10866447 chr14:115044689 chr14:116738973 chr14:118510725 chr14:118613808 chr14:12098529 chr14:121126269 chr14:12840210 chr14:13548865 chr14:13719761 chr14:17452304 chr14:17491551 chr14:17599622 chr14:19962286  chr12:57665867 chr12:60018304 chr12:70792838 chr12:73247588 chr13:101397364 chr13:110972570 chr13:112815113 chr13:114964103 chr13:116144653 chr13:17223595 chr13:17364703 chr13:17572651 chr13:19122471 chr13:19400469 chr13:30377593 chr13:36372845 chr13:43288515 chr13:46636254 chr13:47037177 chr13:56722503 chr13:59010939 chr13:59131106 chr13:63166458 chr13:63227285 chr13:64892851 chr13:76462125 chr13:76652452 chr13:76986607 chr13:77101796 chr13:90447941  intron_size  B6  AJ  DBA  129X1  14851 96393 27382 5555  N Y Y N  N N Y Y  N Y Y Y  Y ? N ?  intron 2 intron 1 intron 10 intron 7 intron 3 intron 1 intron 6 intron 9 intron 6 intron 1  17412 99380 1685 130895 12246 127549 13774 129179 67198 30479  N N N Y Y N N N N N  Y ? ? Y N N Y N Y Y  Y N ? N N Y N N Y Y  ? Y Y Y Y Y N Y ? ?  S A S A A A A A A A A A A A A S S A A S  intron 12 intron 1 intron 2 intron 4 intron 1 intron 7 intron 8 intron 1 intron 13 intron 16 intron 5 intron 10 intron 2 intron 4 intron 43 intron 1 intron 1 intron 6 intron 1 intron 5  3016 19041 35880 67981 101999 20264 1773 20296 51975 65969 13957 29920 28900 5374 2348 254004 254004 49962 216037 18156  Y N Y N Y N Y Y N Y Y Y Y N N N N N Y N  N Y Y Y N N N N N N N N ? Y N Y Y ? Y Y  Y Y Y N N N ? N Y Y ? N ? N Y N N ? Y N  Y N N N ? Y N ? ? Y N N N Y ? Y N Y N ?  Arsb Ptprg  S A  intron 6 intron 5  58667 53977  N N  N N  Y Y  Y Y  Gpc5 Gpc6  S A  intron 7 intron 4  735946 245801  N Y  Y Y  Y Y  Y N  Hs6st3 Hs6st3  A A  intron 1 intron 1  729767 729767  Y Y  Y Y  ? N  N N  Synpr Phgdhl1  S A  intron 2 intron 6  208300 10908  N N  N Y  Y N  Y N  Atxn7 Slc4a7  A A  intron 3 intron 9  39262 6761  Y N  Y ?  N Y  N ?  Nek10 Ube2e2 Ube2e2 Ube2e2 Adk  S A A A A  intron 3 intron 3 intron 3 intron 3 intron 4  45445 241061 241061 241061 64953  N N N N N  ? N N N N  Y Y Y Y Y  ? Y Y ? N  182  insertion sitea  gene  orientb  location  chr14:25733698 chr14:26820315  Asb14 D14Ertd171e 3110001K24Rik Mmrn2  A A S A  intron 6 intron 7 intron 2 intron 1  Grid1 Samd4 LOC434459 Rpgrip1 A630098A13Rik D14Ertd500e LOC544988 Rp1hl1 Ptk2b Rcbtb2  A A S A S A S A A A  chr15:53531677 chr15:64192357  4930452B06Rik D230005D02Rik D230005D02Rik Diap3 Scn8a Adamts12 Pdzd2 Dnahc5 Ctnnd2 5730557B15Rik Sema5a Pgcp Laptm4b Stk3 Zfpm2 Ttc35 Nudcd1 Slc30a8 Samd12 Ddef1  chr15:64696361 chr15:66275903 chr15:74573404 chr15:79411098 chr15:82939357 chr15:86140030 chr15:95163649 chr16:11593592 chr16:31360592 chr16:31638524 chr16:32021504 chr16:32067824 chr16:34964523 chr16:35741481 chr16:36243599  chr14:33106287 chr14:33212947 chr14:34343293 chr14:45904677 chr14:50480097 chr14:51059678 chr14:52937443 chr14:53501754 chr14:6142859 chr14:62978260 chr14:65228565 chr14:71893381 chr14:7419275 chr14:76357616 chr14:76417949 chr14:85842569 chr15:100768243 chr15:11025828 chr15:12471499 chr15:28264316 chr15:30531730 chr15:31299195 chr15:32617793 chr15:33339872 chr15:34204246 chr15:34947807 chr15:40647677 chr15:43333952 chr15:44220503 chr15:52161185  intron_size  B6  AJ  DBA  129X1  7807 45071 39502 20385  N N N Y  Y Y N Y  ? N Y N  ? N N N  intron 12 intron 2 intron 1 intron 11 intron 2 intron 5 intron 1 intron 2 intron 1 intron 4  68288 131259 101657 1919 161469 3042 174953 3521 67329 7045  Y N N N N N Y N N N  Y Y ? N Y Y N N ? N  Y ? Y Y Y Y Y Y Y Y  N N ? N Y Y Y Y ? ?  A S S A A A A A A S A S A A S A A A A A  intron 4 intron 13 intron 16 intron 5 intron 2 intron 2 intron 1 intron 35 intron 3 intron 1 intron 17 intron 5 intron 2 intron 7 intron 3 intron 5 intron 7 intron 5 intron 3 intron 1  65933 22058 30400 30054 18808 80053 133780 12606 138326 41849 3685 85177 13550 8203 96380 10525 7915 5855 61123 32802  N N N N N N Y N N Y N N N N N N N N N N  N Y Y Y N N Y Y Y Y N ? N N N N ? ? Y Y  Y N N N N N Y ? ? Y N Y N ? ? Y Y Y Y N  Y ? N N Y Y N ? ? N Y ? Y Y Y N Y ? ? N  Adcy8 Lrrc6  A A  intron 2 intron 7  49330 5369  N N  Y Y  ? N  N N  2300005B03Rik Dmc1  S A  intron 1 intron 12  3133 15562  N N  N Y  N Y  Y Y  Serhl Tbc1d22a  A A  intron 8 intron 8  6518 39811  N Y  N N  N N  Y Y  Nell2 4933437K13Rik  A S  intron 14 intron 11  47780 54285  N Y  Y Y  Y Y  ? N  Bdh1 Dlgh1  S A  intron 2 intron 4  8473 58810  Y N  N ?  Y Y  Y Y  1500031L02Rik Lrrc33 Ptplb Hspbap1 Stfa1  S A A A A  intron 1 intron 2 intron 2 intron 6 intron 1  2854 14563 22776 5925 11207  N Y N N Y  N N N Y N  Y ? Y N Y  Y ? ? Y ?  183  insertion sitea  gene  orientb  location  chr16:36684294 chr16:56154949  Slc15a2 Impg2 A2bp1 Epha6  S A A A  intron 7 intron 7 intron 1 intron 7  Epha6 A2bp1 A2bp1 Gbe1 Grik1 Cryzl1 Park2 Park2 Map3k4 Abca17  A A A S A A A A A A  chr17:7897922 chr17:80192514  Zfand3 Vps52 AA388235 EG547347 Opn5 Gpr110 Rftn1 Rftn1 Plcl2 Pcaf 2810458H16Rik Tulp4 D930040M24Rik Myom1 Smchd1 Alk Alk Ttc27 Gm1604 Dhx57  chr17:83275521 chr17:83674547 chr17:90760663 chr17:90922764 chr18:22924347 chr18:30473818 chr18:53659933 chr18:62094087 chr18:62132682 chr18:68217607 chr18:68537613 chr18:70608529 chr18:72353140 chr18:74063506 chr18:74843234  chr16:5991812 chr16:59935140 chr16:60313157 chr16:6308438 chr16:6598891 chr16:70425126 chr16:87824174 chr16:91599319 chr17:11283761 chr17:11516654 chr17:12127053 chr17:24002501 chr17:29772262 chr17:33574868 chr17:33594630 chr17:35657073 chr17:42054881 chr17:42755596 chr17:49543989 chr17:49571223 chr17:50126162 chr17:53047148 chr17:55269308 chr17:6188639 chr17:68201613 chr17:71003229 chr17:71300790 chr17:72259504 chr17:72479884 chr17:74664131  intron_size  B6  AJ  DBA  129X1  9675 5595 287739 45093  Y N N N  N Y Y ?  N Y ? ?  Y Y N Y  intron 3 intron 2 intron 3 intron 14 intron 13 intron 4 intron 5 intron 7 intron 1 intron 25  203583 314150 321339 32012 12594 3320 58044 203154 40385 11309  N Y Y Y N N N N Y N  Y N N N N N Y Y N Y  Y N ? ? ? N Y Y N Y  N ? N ? Y Y N ? Y Y  A A S A A A S A A A A A S A A A S A A S  intron 1 intron 17 exon 1 intron 1 intron 4 intron 2 intron 4 intron 3 intron 5 intron 1 intron 8 exon 13 intron 1 intron 22 intron 10 intron 3 intron 1 intron 9 intron 2 intron 1  55500 2036 3868 13134 11897 6068 30939 20392 27806 43284 13647 2514 40481 7738 3606 154558 222865 19113 50426 8626  N Y N Y N N Y N Y N Y N Y Y N Y Y Y Y N  Y Y N ? ? Y N Y N Y ? Y N ? Y ? ? N Y Y  N ? Y N Y ? N Y N Y ? ? Y ? ? ? ? N N N  N N N N ? N N Y Y Y N ? N N Y N N N N N  Eml4 Mta3  A S  intron 1 intron 7  58785 3492  N N  Y N  N Y  ? ?  Nrxn1 Nrxn1  A A  intron 4 intron 2  283719 92475  N N  Y N  N Y  Y N  Nol4 Pik3c3  A A  intron 6 intron 21  52425 8541  Y N  N N  N N  N Y  Prdm6 Sh3tc2  S A  intron 2 intron 3  62882 2824  N N  Y N  N N  ? Y  Sh3tc2 D18Ertd653e  A A  intron 12 intron 2  16328 41863  N N  N Y  ? N  Y N  Mc2r Stard6 Dcc Mapk4 Myo5b  A A A A S  intron 1 intron 5 intron 1 intron 2 intron 26  19292 6979 395280 32615 3728  Y N N N N  N N Y Y Y  ? Y N ? N  N ? N N ?  184  insertion sitea  gene  orientb  location  chr18:82664133 chr18:89875351  Mbp Dok6 Vps37c Cep78  A A A S  intron 3 intron 1 intron 1 intron 5  Trpm3 Mamdc2 1700028P14Rik Exoc6 Cyp2c55 Cyp2c50 Crtac1 LOC545291 LOC545291 Abcc2  A S A A S A A A A A  chrX:73839228 chrX:7674806  Scd3 Sorcs3 Trub1 Atrnl1 D630002G06Rik EG331493 Il1rapl2 Gucy2f ORF34 Phf8 Klf8 EG546400 Phex Egfl6 Olfr157 Olfr157 Nsdhl Olfr157 Tbl1x Ssxb9  chrX:77695025 chrX:9241304 chrX:92550530  chr19:10761586 chr19:16043758 chr19:22461274 chr19:23378836 chr19:23696514 chr19:37682580 chr19:39089896 chr19:40152902 chr19:42399286 chr19:42992748 chr19:43100351 chr19:43859474 chr19:44288550 chr19:48570031 chr19:57507282 chr19:57880359 chr19:8278386 chrX:102893625 chrX:133309958 chrX:137384135 chrX:146727042 chrX:146918864 chrX:148612513 chrX:150478771 chrX:152550699 chrX:161879907 chrX:67247535 chrX:68580222 chrX:69194673 chrX:69587660  intron_size  B6  AJ  DBA  129X1  32181 170283 17229 2417  N N N N  Y N Y Y  N ? N N  ? Y ? ?  intron 1 intron 11 intron 1 intron 18 intron 4 intron 7 intron 2 intron 6 intron 3 intron 9  558119 19863 35949 30515 12457 13132 79867 35297 280983 1895  N N N Y N N Y Y Y N  Y N Y ? Y N N N ? Y  Y Y ? N Y Y N N N Y  Y ? ? Y ? N Y Y ? N  A A S A A A A A S A A A A A S A S S A S  intron 3 intron 3 intron 1 intron 26 intron 2 intron 3 intron 2 intron 7 intron 3 intron 13 intron 1 intron 1 intron 15 intron 9 intron 22 intron 18 intron 3 intron 6 intron 1 intron 7  1416 47964 4892 138092 10344 34408 522726 13304 17680 23783 70758 12614 26562 4125 709627 276602 19381 586741 122275 379739  N Y N Y N Y Y N N Y N N N N N N Y Y N Y  Y N Y N ? Y Y ? ? Y Y Y Y ? Y ? N Y N Y  Y ? N N Y N N ? N Y N N N Y ? Y N N Y Y  N ? Y N ? Y Y Y Y N N Y Y ? ? ? Y Y ? N  4930595M18Rik Srpx  S S  intron 1 intron 1  31292 43614  N N  Y N  Y N  N Y  B230358A15Rik  A  intron 2  18589  Y  N  ?  ?  a  Genomic location with respect to B6 genome (mm8 version) S=Sense; A=Antisense with respect to gene transcription c Computation prediction of element presence within the strain. N=not present; Y=present; ?=unknown b  185  Table A.3 Polymorphic ETn/MusD cases in genes insertion sitea  gene  orientb  location  intron_size  B6 c  AJ  DBA  129X1  Y  N  N  c  chr1:107544558  2310035C23Rik  A  intron 13  1924  N  chr1:11761263 chr1:158497287 chr1:173693834 chr1:176956278 chr1:179852136 chr1:90927936 chr2:120487764 chr2:161740844 chr2:168357724 chr2:80121108 chr3:101214201 chr3:53201348 chr4:11637421  A830018L16Rik Tor3a Cd84 Rgs7 EG545391 Sh3bp4 Ttbk2 Ptprt Atp9a Dnajc10 Ptgfrn Lhfp Gem  A A A A S A A A A S A A A  intron 8 intron 4 intron 2 intron 9 intron 3 intron 1 intron 4 intron 7 intron 14 intron 3 intron 1 intron 2 intron 2  50388 10001 20560 27848 712 36659 16402 207432 5914 1920 29594 188037 4896  Y Y Y N N Y Y N N N Y Y N  ? N N N N N N ? Y Y Y Y Y  ? N N Y Y N N ? N N N N N  N ? ? ? N N N Y N N Y Y ?  chr4:141314667 chr4:153708611  Tmem51 B230396O12Rik  A A  intron 1 intron 6  46127 2946  Y N  Y Y  N N  N ?  chr4:153833591 chr4:43167395  Plcl4 Unc13b  A A  exon 18 intron 7  86 46281  N Y  Y N  N N  N Y  chr4:69880108 chr4:83332767  Cdk5rap2 4930473A06Rik  A A  intron 3 intron 25  20942 46078  N N  N N  ? N  Y Y  chr5:142663608 chr5:148304836  Foxk1 C130038G02Rik  A S  intron 1 intron 1  33098 118143  N N  Y N  N N  N Y  chr5:21020983 chr5:90278338  Fbxl13 Slc4a4  A A  intron 16 intron 18  21545 14656  N Y  N Y  N N  Y N  chr5:93471996 chr6:37085228  Art3 Dgki  A S  intron 2 intron 1  8817 149358  N N  Y N  N N  ? Y  chr6:49330234 chr6:57722531  Stk31 AW146242  A A  intron 3 intron 1  6516 34890  N Y  Y Y  ? Y  N N  chr6:84024675 chr7:24206094 chr7:26386874 chr7:29768570  Dysf Cadm4 EG243881 Zfp82  S A A A  intron 4 intron 1 intron 4 intron 5  4795 16992 4219 4708  N Y N N  Y N N Y  N Y Y ?  ? Y Y ?  chr7:81189399 chr8:126077748 chr8:127312767 chr9:120202643 chr9:28170744 chr9:30821365 chr9:50582311 chr9:55688688 chr9:83896502 chr10:23546116 chr10:59779699 chr11:103427206 chr11:107934667 chr11:35233113 chr11:36015957  Pde8a Dpep1 Pgbd5 Myrip Opcml BC038156 Alg9 Zfp291 Bckdhb Vnn3 Cdh23 Gm884 Prkca Slit3 Odz2  A A A A S A A A A A A A S A A  intron 12 intron 1 intron 1 intron 2 intron 2 exon 1 intron 14 intron 8 intron 9 intron 3 intron 32 intron 1 intron 3 intron 4 intron 8  1773 7590 49175 52370 270720 3731 20017 8166 53041 7946 2103 2885 134255 273313 35382  N Y N N N N N N Y Y N N N N N  Y N Y N Y Y Y Y Y N Y N Y ? Y  N Y N Y N N Y Y N Y N N N ? N  N N ? ? N Y N Y Y Y N Y ? Y N  186  insertion sitea  gene  orientb  location  chr11:75331799 chr11:82825929  Scarf1 Slfn8  A A  exon 4 intron 4  chr12:112088887 chr12:53930690  Mark3 Akap6  A A  chr13:4788787 chr14:28067385  LOC432723 Cacna2d3  chr14:6880103 chr14:95020963  intron_size  B6  AJ  DBA  129X1  524 8583  N Y  Y N  N Y  N N  intron 14 intron 7  7429 71966  N N  Y Y  N N  ? N  A A  intron 1 intron 11  61826 109422  N N  Y Y  N ?  Y Y  Rpp14 Klhl1  A A  intron 4 intron 8  1177 15107  Y Y  Y N  N ?  N Y  chr15:44040286 chr15:96269646  Trhr Sfrs2ip  A A  intron 2 intron 3  31282 8946  Y Y  Y Y  Y Y  N N  chr16:6641081 chr16:90324059  A2bp1 Hunk  S A  intron 3 intron 2  321339 14502  N N  Y N  Y Y  Y Y  chr16:96314083 chr17:28455067  Sh3bgr Mapk14  A A  intron 2 intron 6  7344 2511  Y Y  N N  Y Y  Y Y  chr17:32100856 chr17:52975332  Wiz 4921523A10Rik  A A  intron 2 intron 2  19667 6341  Y N  N ?  N Y  N ?  chr17:6430672 chr17:6573500  Sytl3 Sytl3  A S  intron 1 intron 4  247001 12675  N N  ? Y  Y N  ? ?  chr17:70602997 chr17:75006067  Dlgap1 Ltbp1  A A  intron 4 intron 3  55340 85061  N N  Y ?  Y ?  Y Y  chr18:75389792 chr18:76066620  Dym Zbtb7c  A A  intron 16 intron 2  43220 110866  N N  Y N  N N  N Y  chrX:136618166 chrX:66814504  Col4a6 Olfr157  S S  intron 2 intron 23  146075 685526  N N  Y Y  N ?  Y ?  chrX:67552081  Mtm1  S  intron 8  4446  N  Y  N  ?  a  Genomic location with respect to B6 genome (mm8 version) S=Sense; A=Antisense with respect to gene transcription c Computation prediction of element presence within the strain. N=not present; Y=present; ?=unknown b  187  Table A.4 Primer sequences case#a  primer name  sequence 5' to 3'  16  Cdh23-s Cdh23-as Odz2-s Odz2-as Mark3-s Mark3-as Cacna2d3-s Cacna2d3-as Atp9a-s Atp9a-as Dnajc10-s Dnajc10-as Gem-s Gem-as  CTGTGGCAGTGTGAACTTAG ACGAGGCAGTCATCATGGAG CACAGACTCAAAGCCACTGC GGGTTATGGTTGATTTCTG ATGCAGTCAGGCTGACAGTC TTTCTTAGGCAGTCAGGTCAC AGAGAGTGATAGTAGTCATGC TTCTCATAAG CTCTAGGAAG C CAGAGATTACTCCAGCCTGC GAGGAAGACTATCAACAAAGC CATCAGGTCACAGGTCACAG GTCTTGCTGGAGAAAGTCAGT CACAATGGAGTTCCCATA AAG CTAAGCAAATCTCCCACAGC  7  Foxk1-s Foxk1-as  CAGACACTGACTACAGTTGC ACACCACCATTGCCTGTTAC  12  Pgbd5-s Pgbd5-as  TCTGATGGCTCTGGTTTCC CGAGAGACAGCACTAGAGC  13  Opcml-s Opcml-as  TCTACCACCTGTCTTGTTTAC CCTAACACATCCTTCCTTGC  27  Mtm1-s Mtm1-as  GCTACCAAATTCCAGTAACAG CACGACAGTACCTATGGTAG  26  Dym-s Dym-as  ACAGTCCTCTGAGCCCTTGA GAAAAAGAGCCAGGGGTAGG  18  Prkca-s Prkca-as  TTGGGATGGAGTGTTGGTTT CAAGCATCTTTGCTTCCACA  11  Pde8a-s Pde8a-as  CATTCAGTGCTCTGGCATCT TGCGAACGTACTCACTATTCG  28  Col4a6-s Col4a6-as 2310035C23Rik-s 2310035C23Rik-as  TACATTCACTCCTGCCCTTG TCATTCGGGGTCTACTGACA AGAGTCCTGTGGAGCATTGG AACCAAATTCCACAACAAGTCTC  Dlgap1-s Dlgap1-as Dysf-s Dysf-as A2bp1-s A2bp1-as  GAGAGAGAGACGGTCACATGG CTACATACCCAGCCCCTGAA CACAAACCATTCCCAGTCCT ATGTCGGGTTCAGCAGTAGC TTTTCCCCAAGTGAGTGC CAGGTTTCTGGCTAGGAAAGG  24  Sytl3-s Sytl3-as  TTGGTTGGCATTTTCCTTTC AATCATGGGGCATTCACAGT  19  Akap6-s 241s 210as Alg9-as Art3-s Art3-as B230396O12Rik-s  CTGTGTGTGAGAAGCCCAGA CGTCTAGATTCCTCTCTTACAGC TGATCAA(A/G)GCTCAAATTTTATTG CAGTCTCTGTCGCTCTGCTG GTCTCAGAAACAGGGAAGG TGAAGGGCTGAATGCTCTTT TTCAGCCCTCCTAGCAAAAA  17 20 22 3 2 4  1 25 9 23  14 6 5  188  case#a  primer name  sequence 5' to 3'  B230396O12Rik-as LOC432723-s LOC432723-as Stk31-s  GGACTGGTTTCCTCCTGTGA TTTCCATCTCAGAGGCTTCC TCCCTCTTTCCTTAGGTCTGC GGGGTAGGGAGAGGAAATTG  Stk31-as Zfp291-s Zfp291-as Zfp82-s Zfp82-as Col4a6-up-ex-s Col4a6-down-ex-as Dnajc10-up-ex-s Dnajc10-down-ex-as Gapdh_ex6F  AAACCACATCCCCAGTCCTT AAGAAATGGCAAGTGGATGG TGCTGCCAATACATGAAAGC TCAACGACCCATGTCAAAGA ACCCTCACCCACCACTGTAA GAACAAACTGCCAAGCATC TCAGGAAAGCATGTACAGAC AGCACTGAAGTTACATCCTG GTACTCAAAGATGAAGATCTAC GACTTCAACAGCAACTCCCAC  Gapdh_ex7R Mtm1-up-ex-s Mtm1-down-ex-as Opcml-up-ex-s Opcml-down-ex-as Prkca-up-ex-s Prkca-down-ex-as MusD2-7130as IM_LTR_2as IM_3as Opcml-ex2-s Opcml-ex2-as Opcml-ex3-as Sh3bp4-s Sh3bp4-as Tor3a-s Tor3a-as Cd84-s Cd84-as  TCCACCACCCTGTTGCTGT AACAAGTGCTATGAGCTCTG CTGGGTGAATCCACGACAGT CGCAGCGGAGATGCCAC AGGATTGTGCTGCGGTTTAG GGTTCATAAGAGGTGCCATG TAGGGCTTCCGTATGTGTG GAATGAGCAGGAAGCTCCACC TGTTGC(G/A)GCCGCCAGCAGC ATC(T/C)CTCTGCCATTCTTCAGG CCTTGTACCCACAGGAGTG AGGAAGGCTCCCTACCTGA CTTGCACTATGAGATGGAC AGAGGGAAGGAAGGCTCTTG TGGGTTATCCGTGAGTGTGA ATTCCGCTGTCTCTGCTTGT CTGTGCTCTGTATGCCCAGA TCCCTGCATCCTATCAGTACG AGAACCCAGGTTGTGGTCTG  Ttbk2-s Ttbk2-as Unc13b-s Unc13b-as  AACCAAGCAAACAAAGCAAAA TTTTTCTGGGCTGGTGAGAT TCAAGAACAGCCATGCTCAG CGGTGGCTCACACCTAGAAT  34  Cadm4-s Cadm4-as  TGGGGGAAGTAGCATAGGTG GGAGCCAGACAAGACTCAGG  35  Dpep1-s Dpep1-as  GGGCCAATCAAAAGTTCAGA TGTGGAAAGCACAAAAGCAG  36  Vnn3-s Vnn3-as  ATGGCAAATATTGGGGACAA GTGCAGCTGTAGCCAAGTGA  37  Slfn8-s Slfn8-as  TGCTAGCCAGTGGACAAGTG CTGGTCCTCCTGGTGTTGTT  38  Klhl1-s Klhl1-as Mapk14-s Mapk14-as  TGGAATTGTTTTGTTGACAGGA GAGCAGCCTCAATTTTCTGC CCATGGGTTTCTGTCTCACA AGGAGGGGACACTCAAACCT  21 8 15 10  29 30 31 32 33  39  189  case#a 40  a  primer name  sequence 5' to 3'  Wiz-s Wiz-as  TATCTGCCTGTGGCTGACTG GCCCTTGAGTTGAAGCAGAC  case # from Table 2.1  A.3  Accession numbers  The National Center for Biotechnology Information (NCBI) Nucleotide database (http://www.ncbi.nlm.nih.gov/sites/entrez?db=Nucleotide) accession number for the ETnII element used for probe design and to align in Figures 5 and S4 is Y17106. The ETnI element used for probe design is located in a BAC clone with accession number AC068908. The accession number for the IAP element used in probe design is EU183301.  190  Appendix B Supporting information of Chapter 3  B.1  Supplementary figures  Figure B.1 Intronic distributions of the four major TE types in human (non-normalized). The distributions of all (A) and full-length (B) intronic TEs in mouse are shown. The x-axis shows a series of intronic regions based on the distance from a TE to the nearest exon. The y-axis represents the standardized frequency of TEs. The red dotted line indicates the expected distribution of TEs based on random computational simulations. Error bars are standard errors derived from the total number of corresponding TEs (sample size) in each bin.  191  Figure B.2 Intronic distributions of the four major TE types in mouse (normalized). The distributions of all (A) and full-length (B) intronic TEs in mouse are shown. The x-axis shows a series of intronic regions based on the distance from a TE to the nearest exon. The y-axis represents the standardized frequency of TEs and is normalized by G/C content for each TE type. The red dotted line indicates the expected distribution of TEs based on random computational simulations. Error bars are standard errors derived from the total number of corresponding TEs (sample size) in each bin.  192  Figure B.3 Average size of mouse TEs within and outside the U-zone. Each TE type is divided into two groups as shown on the x-axis: one group for elements located within the corresponding U-zone and another group for those beyond. The average size of each TE group is indicated as the horizontal bar within each box, which represents the central 50% of data points of the group. Outliers beyond the 1.5x IQR (interquartile range) whiskers are not shown. P-values shown on top of each boxplot are based on the two sample Wilcoxon test.  193  Figure B.4 Distributional biases of mouse full-length intronic TEs. A) Orientation bias of full-length intronic TEs. Y-axis shows the logarithmic fold-difference of TE frequency between sense and antisense oriented full-length TEs. B) Splice site bias of full-length intronic TEs. Y-axis shows the logarithmic fold-difference of TE frequency between full-length TEs close to the SA site and TEs close to the SD site. The x-axis shows a series of intronic regions based on the distance from a TE to the nearest exon. Error bars are standard errors derived from the total number of corresponding TEs (sample size) in each bin.  194  Figure B.5 Orientation bias of mouse full-length intronic TEs based on proximity to different types of splice sites. Orientation bias of full-length TEs near A) SD sites and B) SA sites. The x-axis shows a series of intronic regions based on distance of a TE to nearest exon. Y-axis is the logarithmic fold-difference of TE frequency between sense and antisense oriented TEs. Error bars are standard errors derived from the total number of corresponding TEs (sample size) in each bin.  195  B.2  Supplementary tables  Table B.1 Mutagenic intronic Alu insertions in humans Gene with mutation FAS  Alu subfamily Sb1  Orientation#  FGFR2  -  Nearest splice-site SA  Distance to exon (bp) 50  Yb8  -  SA  10  Autoimmune syndrome (ALPS) Apert Syndrome  APC  Yb9  -  SD  20  Familial Adenomatous Polyposis  GK  Ya5  -  SA  30  Glycerol kinase deficiency  F8  Yb9  -  SA  10  Hemophilia A  OPA1  Yb8  -  SA  21  Autosomal dominant optic atrophy  #  Related diseases lymphoproliferative  The "-" sign represents antisense orientation of the TE with respect to the enclosing gene  Table B.2 Mutagenic intronic L1 insertions in humans Gene with mutation HBB  Orientation  Distance to exon (bp) 100  Related diseases  +  Nearest splice-site SA  FKTN  +  SA  24  RPS7KA3  -  SA  10  Fukuyama-type congenital muscular dystrophy (FCMD) Coffin-Lowry syndrome (CLS)  CYBB  +  SA  265  Chronic granulomatous disease  RP2  +  SD  653  X-linked retinitis pigmentosa (XLRP)  #  Beta-thalassemia  The "+"/"-" sign represents sense/antisense orientation of the TE with respect to the enclosing gene  Table B.3 Mutagenic intronic ERV insertions in mice Mutation Ap3d1  mh2J  Orientation#  ERV family  Nearest splice-site*  Distance to exon (bp)  IAP  +  SD  6  Atrn  mg  IAP  -  SD  136  Atrn  mg-L  IAP  +  SD  ~420  bor  IAP  +  SA  1575  IAP  +  SA  btw. 850-1100  Pas  IAP  +  SD  ~300  IAP  IAP  -  SA  1  Eya1  Gusmps2J Lama2  LamB3  196  Mutation  Nearest splice-site*  Distance to exon (bp)  +  SD  616  IAP  +  SA  ~942  IAP  +  SA  ~1126  IAP  ?  SA  1  IAP  -  SA  24  Mgrn1  md-2J  IAP  Mgrn1  md  vb  Dem  Pofut1cax  Pitpna Spna1  Pmca2  joggle  Orientation#  ERV family  IAP  +  SD  ~15  spkw1  IAP  +  SD  ~720  SJL  IAP  ?  SA  965  ETna  +  SD  ~1700  Cacng2  stg  ETn  +  SD  btw. 1500-2100  Cacng2  stg-3J  ETn  +  SD  btw. 2500-4100  ETn  +  SD  1033  ETn  +  SA  3500  ETn  +  SD  ~14000  ETn  +  SD  ~60000  Gria4 Zfp69  Adcy1brl  Clcn1 Fas  adr  lpr  Fbxw4Dac-2J Fign  fi  Foxn1  nu-Bc  ETn  -  SA  ~5200  pdn  ETn  +  SA  24514  Hk1dea  ETn  ?  SA  901  0b-2J  ETn  +  SD  ~3200  Cat-Fr  ETn  +  SA  ~800  mu  ETn  +  SD  2362  ETn  +  SA  57  Hsf4  lop11  ETn  +  SA  61  Fig4  paletremor  Gli3  Lep  Mip  Muted Ttc7fsn  ETn  +  SA  384  Dysf  prmd  ETn  +  SD  495  Zhx2  Afr1  ETn  +  SA  ~20600  VL30b  -  SD  ~1200  MuLV  -  SA  4  MuLV  +  SD  ~500  d  ?  SD  ~4000  Pdcd8Hq  MuLV  +  SD  3432  rd1  MuLV  -  SD  1511  MuERV  +  ?  <250  a Abcb1a Myo5a Nox3  d  het  Pde6b Lmf1  mds  cld  c  unk  #  The "+"/"-" indicates orientation of the TE with respect to the enclosing gene; "?" indicates orientation is unknown; *SD-splice donor sites; SA-splice acceptor sites; "?" indicates that the nearest splice site was not given; a ETn represents the ETn/MusD family; bVirus-like 30 element; c Murine leukemia virus; d “unk" indicates the ERV type was not given.  197  Appendix C Supporting information of Chapter 4  C.1  Supplementary figures  Figure C.1 TE density distribution of human genes. A) TE density distribution of all human RefSeq genes. B) TE density distribution of human RefSeq genes larger than 10 kb.  198  Figure C.2 The relationship between gene size and exon density in human. A) The negative association between gene size and exon density. B) The linear regression between gene size and the inverse of exon density. r is the correlation coefficient.  199  Figure C.3 Identification of outlier genes by controlling gene size and exon density. In any given species, all genes ≥ 10 kb were divided into 25 subsets based on both gene size and exon density and were put into a 5 x 5 matrix. For genes in each subset, upper/lower outliers were identified by taking the top or bottom 10% genes with the most extreme TE density. The final set of upper/lower outlier genes is collected by merging the upper/lower outliers from each subset, for which the variations of both gene size and exon density are controlled.  200  Figure C.4 Chromosomal distribution of SUOs and SLOs in human. The short red lines along the left side of each chromosome show the chromosomal locations of SLOs. The short blue lines along the right side of each chromosome show the chromosomal locations of SUOs.  201  Figure C.5 The relationship between the number of SUOs/SLOs on human chromosomes and the chromosome size. Results for SUOs and SLOs are shown in A) and B), respectively. In both (A) and (B), the x-axis shows the genomic coverage of each chromosome in percentage, and the y-axis shows the total number of SUOs/SLOs on a given chromosome.  202  Figure C.6 Correlation analysis of the TE composition of SUOs between human and mouse. Results for LINE, SINE, LTR retroelement and DNA transposon are shown as linear regression plot in A), B), C) and D), respectively. In each plot, each open circle represents an SUO gene and its location is determined by the density of the corresponding TE type of the SUO orthorlogs in the two species. The line across the data points in each plot represents the regression line, and r is the correlation coefficient.  203  Figure C.7 Correlation analyses of G+C content vs. LINE/SINE composition of SUOs in human. Results for LINE and SINE are shown in (A) and (B), respectively. The x-axis shows the proportion covered by LINEs/SINEs relative to all TEs in each SUO gene. The y-axis shows the average G+C content of human SUOs. The line across the data points in each plot represents the regression line, and r is the correlation coefficient.  204  Figure C.8 Tissue-type composition of tissue-specific outlier genes. For each gene set, the proportion corresponding to each tissue type is shown in a stacked bar according to the color scheme indicated at the top. The 'genomic background' was calculated based on all mouse genes > 10 kb that show strong Polr2a binding in only one tissue.  205  Figure C.9 Histone marks at promoters of all outlier genes. The proportions of genes associated with different histone marks are shown for all upper outliers, all lower outliers and the genomic background as side-by-side bars. Error bars are standard errors derived from the total number of genes (sample size) in each gene set.  206  C.2  Supplementary tables  Table C.1 SLOs identified among human, mouse and cow Hid   gene_name   gene_size   te_coverage   te_density   chr   gStart   gEnd   exonCount   exon_density   13092   FOXP1   628405   154657   0.24611   chr3   71087425   71715830   21   0.334179   4086   NFIA   380480   82899   0.21788   chr1   61320567   61701047   11   0.289108   8641   CADM1   330897   67999   0.205499   chr11   1.15E+08   1.15E+08   9   0.271988   8822   TOX   313791   69067   0.220105   chr8   59880530   60194321   9   0.286815   20574   PBX1   292244   50083   0.171374   chr1   1.63E+08   1.63E+08   9   0.307962   9151   ZNF521   290226   60633   0.208916   chr18   20895888   21186114   8   0.275647   22888   FAM19A5   269799   41124   0.152425   chr22   47263951   47533750   4   0.148259   8556   TRPS1   260503   41795   0.16044   chr8   1.16E+08   1.17E+08   7   0.268711   88724   NFIB   232099   38043   0.163909   chr9   14071846   14303945   9   0.387766   7564   TCF7L2   216062   44874   0.20769   chr10   1.15E+08   1.15E+08   14   0.647962   2354   SPTBN1   215130   48890   0.227258   chr2   54536957   54752087   36   1.673407   48068   FYN   212143   57295   0.270077   chr6   1.12E+08   1.12E+08   14   0.659932   22712   LPHN2   192026   31203   0.162494   chr1   82038669   82230695   20   1.041526   21214   ZBTB16   190967   42541   0.222766   chr11   1.13E+08   1.14E+08   7   0.366555   31087   MEF2C   185811   36817   0.198142   chr5   88049814   88235625   11   0.591999   75187   CTBP2   173207   28127   0.16239   chr10   1.27E+08   1.27E+08   11   0.635078   8549   MACROD1   167489   41650   0.248673   chr11   63522605   63690094   11   0.65676   4222   LHFPL2   163611   39124   0.239128   chr5   77816793   77980404   5   0.305603   20437   COL4A1   158187   25652   0.162163   chr13   1.1E+08   1.1E+08   52   3.287249   20933   EPHA4   154264   30651   0.198692   chr2   2.22E+08   2.22E+08   18   1.166831   1320   CALCR   149952   34159   0.2278   chr7   92891734   93041686   13   0.866944   11224   SCUBE1   140126   36733   0.262143   chr22   41929173   42069299   22   1.570016   86803   MEIS1   137360   19802   0.144161   chr2   66516035   66653395   13   0.946418   207  Hid   gene_name   gene_size   te_coverage   te_density   chr   gStart   gEnd   exonCount   exon_density   3801   RUNX1T1   136292   26739   0.196189   chr8   93040327   93176619   11   0.807091   4340   AFF1   134039   30853   0.230179   chr4   88147176   88281215   20   1.492103   32336   NFATC1   133552   16317   0.122177   chr18   75256759   75390311   10   0.748772   32426   SEMA6A   131301   19913   0.151659   chr5   1.16E+08   1.16E+08   19   1.447057   9635   BANP   125887   27349   0.21725   chr16   86542538   86668425   12   0.953236   55948   IKZF1   125369   9354   0.074612   chr7   50314923   50440292   8   0.638116   7813   LEF1   120878   26086   0.215804   chr4   1.09E+08   1.09E+08   10   0.82728   23384   LMF1   117351   26577   0.226474   chr16   843634   960985   11   0.937359   86984   NRXN2   117015   17984   0.15369   chr11   64130221   64247236   24   2.051019   2875   NRP2   115634   13558   0.117249   chr2   2.06E+08   2.06E+08   17   1.470156   6528   VAC14   113720   29396   0.258495   chr16   69278842   69392562   19   1.67077   7673   COL18A1   108538   11784   0.10857   chr21   45649524   45758062   43   3.961746   1952   PDE2A   98228   22426   0.228306   chr11   71964832   72063060   32   3.257727   7485   PPARGC1A   98057   16479   0.168055   chr4   23402741   23500798   13   1.32576   3636   ETV1   97909   20020   0.204476   chr7   13897380   13995289   12   1.225628   20623   PTPRF   92797   20639   0.22241   chr1   43769133   43861930   33   3.556149   2232   SATB1   90987   5610   0.061657   chr3   18364269   18455256   11   1.208964   1095   EPAS1   89274   16729   0.187389   chr2   46378066   46467340   16   1.792235   515   IGF1   84734   19863   0.234416   chr12   1.01E+08   1.01E+08   5   0.590082   187   KIT   82787   17732   0.214188   chr4   55218851   55301638   21   2.53663   6279   ERCC6   82657   22096   0.267322   chr10   50334496   50417153   21   2.54062   9697   BAIAP2   82271   9214   0.111996   chr17   76623556   76705827   15   1.823243   4526   ENPP2   81788   19739   0.241343   chr8   1.21E+08   1.21E+08   26   3.17895   3332   MYT1   77780   7674   0.098663   chr20   62266270   62344050   23   2.957058   6389   DDX31   76113   19908   0.261558   chr9   1.34E+08   1.35E+08   20   2.627672   7889   PIK3R1   75188   9631   0.128092   chr5   67558217   67633405   15   1.994999   2119   PTPN1   74196   20218   0.272494   chr20   48560297   48634493   10   1.347782   9341   PHF20L1   73449   10362   0.141077   chr8   1.34E+08   1.34E+08   21   2.859127   208  Hid   gene_name   gene_size   te_coverage   te_density   chr   gStart   gEnd   exonCount   exon_density   13251   DICER1   71195   15054   0.211447   chr14   94622317   94693512   27   3.792401   20579   PFKP   69245   16905   0.244133   chr10   3099751   3168996   22   3.177125   36206   SLA   66338   17229   0.259715   chr8   1.34E+08   1.34E+08   9   1.356688   22401   ANGPT2   63612   14023   0.220446   chr8   6344580   6408192   9   1.414827   3837   ETS1   63502   6190   0.097477   chr11   1.28E+08   1.28E+08   8   1.259803   10608   PMEPA1   63090   10837   0.17177   chr20   55656857   55719947   4   0.634015   55619   SPEG   58655   9830   0.16759   chr2   2.2E+08   2.2E+08   41   6.990026   1040   DHX15   57097   14739   0.25814   chr4   24138185   24195282   14   2.451968   6899   ASS1   56568   13927   0.246199   chr9   1.32E+08   1.32E+08   16   2.828454   32466   SMYD2   55913   14631   0.261674   chr1   2.13E+08   2.13E+08   12   2.146191   870   ADCYAP1R1   54170   13374   0.246889   chr7   31058666   31112836   16   2.953664   55740   EZR   53684   13920   0.259295   chr6   1.59E+08   1.59E+08   14   2.607853   110441   ALDH1A1   52383   8426   0.160854   chr9   74705406   74757789   13   2.481721   8860   TSC22D2   50828   9664   0.190131   chr3   1.52E+08   1.52E+08   4   0.786968   21059   TLE3   49714   6723   0.135234   chr15   68127596   68177310   20   4.023012   5109   ADAMTS5   49208   6728   0.136726   chr21   27212102   27261310   8   1.625752   2069   PROX1   47903   2248   0.046928   chr1   2.12E+08   2.12E+08   5   1.043776   2266   SFRP1   47503   10678   0.224786   chr8   41238634   41286137   3   0.631539   55639   KDR   47336   8499   0.179546   chr4   55639183   55686519   30   6.337671   49690   ATP2B3   46808   6779   0.144826   chrX   1.52E+08   1.53E+08   20   4.272774   8612   OLFM1   45942   4815   0.104806   chr9   1.37E+08   1.37E+08   6   1.305995   37922   DSP   45077   7749   0.171906   chr6   7486868   7531945   24   5.324223   56497   LHX4   44747   5529   0.123561   chr1   1.78E+08   1.79E+08   6   1.340872   635   GAD1   44460   8423   0.189451   chr2   1.71E+08   1.71E+08   17   3.823662   6628   WDR1   42611   6027   0.141442   chr4   9685060   9727671   12   2.816174   74967   PCDH10   42263   4231   0.100111   chr4   1.34E+08   1.34E+08   5   1.183068   16375   TPCN2   41723   6654   0.15948   chr11   68572925   68614648   25   5.991899   7301   EHF   40414   6637   0.164225   chr11   34599243   34639657   9   2.226951   209  Hid   gene_name   gene_size   te_coverage   te_density   chr   gStart   gEnd   exonCount   exon_density   55709   SH3GL1   40105   10130   0.252587   chr19   4311366   4351471   10   2.493455   55759   SLC7A5   39472   6557   0.166118   chr16   86421129   86460601   10   2.533441   1750   LSP1   39294   2147   0.054639   chr11   1830775   1870069   11   2.79941   1531   FMR1   39133   5526   0.141211   chrX   1.47E+08   1.47E+08   17   4.34416   4912   SSFA2   38993   4639   0.11897   chr2   1.82E+08   1.83E+08   18   4.616213   20445   CSPG4   38527   6043   0.156851   chr15   73753717   73792244   10   2.595582   55433   COL3A1   38374   1906   0.049669   chr2   1.9E+08   1.9E+08   51   13.29025   2438   THBS2   38263   3500   0.091472   chr6   1.69E+08   1.69E+08   23   6.011029   1997   PLCG1   38197   6434   0.168443   chr20   39199574   39237771   32   8.377621   69   COL1A2   36672   1276   0.034795   chr7   93861808   93898480   52   14.17976   12206   CCDC80   36569   6735   0.184172   chr3   1.14E+08   1.14E+08   8   2.187645   9768   ANKRD10   36530   5791   0.158527   chr13   1.1E+08   1.1E+08   6   1.642486   180   JAG1   36363   3770   0.103677   chr20   10566331   10602694   26   7.150125   10551   SON   34463   3778   0.109625   chr21   33837219   33871682   12   3.481995   53111   INTS1   34106   4037   0.118366   chr7   1476438   1510544   48   14.07377   68520   SLC2A1   33802   8665   0.256346   chr1   43163632   43197434   10   2.958405   21389   TPPP   33534   2471   0.073686   chr5   712976   746510   4   1.192819   2410   TCF7   33518   6829   0.203741   chr5   1.33E+08   1.34E+08   10   2.983472   41463   C10orf84   33268   6156   0.185043   chr10   1.2E+08   1.2E+08   8   2.404713   1212   PAX6   33170   0   0   chr11   31762915   31796085   13   3.919204   201   KCNH2   32966   4919   0.149214   chr7   1.5E+08   1.5E+08   15   4.550143   443   VLDLR   32693   4898   0.149818   chr9   2611792   2644485   19   5.811642   37525   CCND2   31579   3290   0.104183   chr12   4253198   4284777   5   1.583331   10919   CADM3   31556   2790   0.088414   chr1   1.57E+08   1.57E+08   10   3.168969   4314   SMAD7   30859   3146   0.101948   chr18   44700220   44731079   4   1.296218   9482   SLCO4A1   29851   2357   0.078959   chr20   60744241   60774092   12   4.019966   22547   COL11A2   29777   1847   0.062028   chr6   33238446   33268223   66   22.16476   11384   TBX18   29743   2605   0.087584   chr6   85500875   85530618   8   2.689709   210  Hid   gene_name   gene_size   te_coverage   te_density   chr   gStart   gEnd   exonCount   exon_density   37370   PDE1B   29620   4568   0.15422   chr12   53229670   53259290   16   5.401756   74841   CSNK1D   29334   5595   0.190734   chr17   77795528   77824862   10   3.409013   55612   CRHR2   29278   5598   0.191202   chr7   30659387   30688665   12   4.098641   20688   TFAP2B   28888   607   0.021012   chr6   50894397   50923285   7   2.423151   23129   DAZAP1   28099   2366   0.084202   chr19   1358583   1386682   12   4.270615   7968   TBX4   27858   5380   0.193122   chr17   56888588   56916446   8   2.871707   88554   PFKFB2   27749   4271   0.153915   chr1   2.05E+08   2.05E+08   15   5.4056   10772   RIPK4   27721   3385   0.12211   chr21   42032597   42060318   8   2.885899   55668   PFKL   27327   1945   0.071175   chr21   44544357   44571684   22   8.050646   2709   CPZ   27054   5218   0.192874   chr4   8645334   8672388   11   4.065942   12767   CLPTM1L   27003   1717   0.063586   chr5   1370999   1398002   17   6.295597   32531   EGLN3   26864   3899   0.145138   chr14   33463171   33490035   5   1.861227   8443   AZI2   26772   3543   0.13234   chr3   28338850   28365622   8   2.988197   57205   8‐Sep   26559   2382   0.089687   chr5   1.32E+08   1.32E+08   10   3.765202   82484   ARHGEF16   26531   2562   0.096566   chr1   3361006   3387537   15   5.653764   3098   USP2   26512   3367   0.126999   chr11   1.19E+08   1.19E+08   13   4.90344   23133   NLGN3   26341   5542   0.210394   chrX   70281435   70307776   7   2.657454   3638   NR5A1   26185   3645   0.139202   chr9   1.26E+08   1.26E+08   7   2.673286   37509   ATP1B1   26014   3832   0.147305   chr1   1.67E+08   1.67E+08   6   2.30645   35319   TMEM201   25959   4385   0.16892   chr1   9571563   9597522   10   3.852229   1078   CELSR2   25738   1568   0.060922   chr1   1.1E+08   1.1E+08   34   13.21004   1918   SLC22A18   25526   4204   0.164695   chr11   2877526   2903052   11   4.309332   21322   HSPH1   25355   2019   0.079629   chr13   30608762   30634117   18   7.099191   2252   SDC1   24637   2476   0.100499   chr2   20264038   20288675   6   2.435361   925   PRDM1   23620   1736   0.073497   chr6   1.07E+08   1.07E+08   7   2.96359   1391   COL6A1   23301   208   0.008927   chr21   46226090   46249391   35   15.02081   2421   TFAP2A   22882   149   0.006512   chr6   10504901   10527783   7   3.059173   37750   NR2E1   22752   2416   0.106188   chr6   1.09E+08   1.09E+08   9   3.955696   211  Hid   gene_name   gene_size   te_coverage   te_density   chr   gStart   gEnd   exonCount   exon_density   68427   KCNC4   22602   2756   0.121936   chr1   1.11E+08   1.11E+08   4   1.769755   32055   PDGFA   22583   3215   0.142364   chr7   503424   526007   6   2.656866   3845   FOSL2   21738   3153   0.145046   chr2   28469282   28491020   4   1.840096   20263   RARG   21684   2062   0.095093   chr12   51890619   51912303   10   4.611695   5075   ALX1   21526   1724   0.080089   chr12   84198166   84219692   4   1.858218   40725   HOXA3   20831   385   0.018482   chr7   27112333   27133164   4   1.920215   22410   HNRNPD   20683   3041   0.147029   chr4   83493490   83514173   9   4.3514   1550   GATA3   20498   1384   0.067519   chr10   8136672   8157170   6   2.927115   510   IGF2   20487   0   0   chr11   2106922   2127409   5   2.440572   8636   PATZ1   20460   2244   0.109677   chr22   30051789   30072249   4   1.955034   4927   LMO4   20453   168   0.008214   chr1   87566738   87587191   5   2.444629   31360   PAX9   20230   851   0.042066   chr14   36196532   36216762   5   2.471577   55800   FASN   19893   631   0.03172   chr17   77629502   77649395   43   21.61564   1877   NGFR   19718   3016   0.152957   chr17   44927653   44947371   6   3.042905   31405   TNNT3   19138   779   0.040704   chr11   1897374   1916512   15   7.83781   11450   KCTD15   18916   2573   0.136022   chr19   38979590   38998506   7   3.700571   4691   DPYSL4   18867   962   0.050988   chr10   1.34E+08   1.34E+08   14   7.420364   37520   KLF5   18535   1381   0.074508   chr13   72531142   72549677   4   2.158079   40883   GDF6   18463   1887   0.102204   chr8   97223733   97242196   2   1.083248   68057   MAT1A   17859   1653   0.092558   chr10   82021555   82039414   9   5.039476   7816   LHX9   17639   54   0.003061   chr1   1.96E+08   1.96E+08   6   3.401553   73874   COL1A1   17544   1055   0.060135   chr17   45616455   45633999   51   29.06977   55799   EMX1   17417   821   0.047138   chr2   72998111   73015528   3   1.722455   31142   THBS1   16389   641   0.039112   chr15   37660571   37676960   22   13.42364   2534   VEGFA   16271   541   0.033249   chr6   43845930   43862201   7   4.302133   7605   ZBTB7B   15888   888   0.055891   chr1   1.53E+08   1.53E+08   4   2.517623   4582   TNFAIP3   15869   689   0.043418   chr6   1.38E+08   1.38E+08   9   5.671435   55437   FGFR3   15560   373   0.023972   chr4   1764836   1780396   16   10.28278   212  Hid   gene_name   gene_size   te_coverage   te_density   chr   gStart   gEnd   exonCount   exon_density   10233   GABRQ   15189   1981   0.130423   chrX   1.52E+08   1.52E+08   9   5.925341   99848   RAP2C   15137   777   0.051331   chrX   1.31E+08   1.31E+08   4   2.642532   2753   STC2   14781   595   0.040254   chr5   1.73E+08   1.73E+08   4   2.706177   1293   BGN   14594   457   0.031314   chrX   1.52E+08   1.52E+08   8   5.481705   4712   MXD4   14580   793   0.05439   chr4   2218957   2233537   6   4.115226   23132   SLC38A2   14577   893   0.061261   chr12   45038237   45052814   16   10.9762   11186   ITM2C   14343   1269   0.088475   chr2   2.31E+08   2.31E+08   5   3.486021   6461   SF1   14164   303   0.021392   chr11   64288653   64302817   13   9.178198   1653   INHBA   14106   797   0.056501   chr7   41695125   41709231   3   2.126755   2694   ENC1   14016   1282   0.091467   chr5   73958989   73973005   3   2.140411   20128   L1CAM   13925   659   0.047325   chrX   1.53E+08   1.53E+08   28   20.10772   21215   ZFAND5   13823   927   0.067062   chr9   74156160   74169983   6   4.340592   3263   EFNB1   13167   514   0.039037   chrX   67965564   67978731   5   3.797372   41636   ARRDC4   13136   385   0.029309   chr15   96304936   96318072   8   6.090134   8614   PUF60   12991   1184   0.09114   chr8   1.45E+08   1.45E+08   11   8.467401   24916   DISP2   12823   1730   0.134914   chr15   38437725   38450548   8   6.23879   81909   HNRNPK   12572   114   0.009068   chr9   85772817   85785389   17   13.52211   68089   CYP2E1   11754   713   0.06066   chr10   1.35E+08   1.35E+08   9   7.656968   22705   HEY2   11684   833   0.071294   chr6   1.26E+08   1.26E+08   5   4.279356   36929   VASN   11681   1172   0.100334   chr16   4361849   4373530   2   1.712182   69317   NLGN2   11678   118   0.010104   chr17   7252225   7263903   7   5.994177   1661   ISL1   11606   52   0.00448   chr5   50714714   50726320   6   5.16974   37900   SLC16A3   11077   305   0.027535   chr17   77779581   77790658   5   4.513858   37251   AGXT   10375   267   0.025735   chr2   2.41E+08   2.41E+08   11   10.60241   3354   OVOL1   10157   63   0.006203   chr11   65311104   65321261   4   3.938171   7884   PEA15   10036   255   0.025409   chr1   1.58E+08   1.58E+08   4   3.985652   213  hid: the HomoloGene Database ID of the gene gene_name: the human gene name annotated by RefSeq Database (hg18) gene_size: the human gene size annotated by the longest RefSeq isoform te_coverage: the coverage of TE sequences in the human gene in basepairs (bp) te_density: the density of TE sequences derived from the ratio of te_coverage/gene_size chr: the name of the human chromosome that the gene is located on gStart: the genomic coordinate of the 5' start site of the gene in the human genome (hg18) gEnd: the genomic coordinate of the 3' end site of the gene in the human genome (hg18) exonCount: number of exons in the gene exon_density: the density of exon derived from the ratio of exonCount/geneSize (/kb)  214  Table C.2 SUOs identified among human, mouse and cow Hid   gene_name   gene_size   te_coverage  te_density  chr  gStart   gEnd  exonCount  exon_density  8584   COMMD9   17158   9849  0.574018  chr11  36250417  36267575  5  2.914093  9800   PPP1R14D   13265   9717  0.732529  chr15  38894934  38908199  5  3.769318  13945   HSCB   15454   10629  0.687783  chr22  27468042  27483496  6  3.88249  57091   CALR3   17129   12222  0.713527  chr19  16450874  16468003  9  5.254247  12018   KLHL10   10557   6214  0.588614  chr17  37247568  37258125  5  4.736194  19263   TMEM219   11023   6169  0.559648  chr16  29880851  29891874  6  5.443164  7862   NFATC2IP   15450   8514  0.551068  chr16  28869818  28885268  8  5.177994  6289   ITPA   14451   7869  0.54453  chr20  3138055   3152506  8  5.535949  1501   ERCC1   14306   9285  0.649028  chr19  50604711  50619017  10  6.990074  55906   AP1M2   14645   9176  0.626562  chr19  10544346  10558991  12  8.193923  8709   DHDH   11288   6703  0.593816  chr19  54128750  54140038  7  6.201276  36381   PRODH2   13310   7541  0.566566  chr19  40982731  40996041  11  8.264463  1592   HARS   17482   9279  0.530775  chr5  1.4E+08   1.4E+08  13  7.43622  3280   FARSA   11275   6439  0.571086  chr19  12894283  12905558  13  11.52993  9232   AAAS   14173   7581  0.53489  chr12  51987506  52001679  16  11.28907  27612   ANKRD13D   13136   6658  0.506851  chr11  66813394  66826530  16  12.18027  81844   ACTL6B   13359   6771  0.506849  chr7  1E+08   1E+08  14  10.47983  16533   CCDC151   14709   7448  0.506357  chr19  11392271  11406980  13  8.838126  12877   FERMT3   17151   8426  0.491283  chr11  63730788  63747939  15  8.745846  12723   RILPL2   21329   15612  0.731961  chr12  1.22E+08   1.22E+08  4  1.875381  12083   ZBTB8OS   28879   21946  0.759929  chr1  32859893  32888772  7  2.423907  2628   NME5   24272   16218  0.668177  chr5  1.37E+08   1.38E+08  6  2.471984  75168   CHIA   29702   20082  0.676116  chr1  1.12E+08   1.12E+08  9  3.030099  8080   ECSIT   23187   15605  0.673006  chr19  11477743  11500930  8  3.450209  9315   NOSIP   24836   16374  0.659285  chr19  54750779  54775615  10  4.026413  12320   MTFMT   28128   16829  0.598301  chr15  63080902  63109030  9  3.199659  37712   RPA2   23188   13722  0.591772  chr1  28090635  28113823  9  3.881318  215  Hid   gene_name   gene_size   te_coverage  te_density  chr  gStart   gEnd  exonCount  exon_density  14939   PHYHD1   21147   13327  0.630208  chr9  1.31E+08   1.31E+08  12  5.674564  1160   GTF2H3   26954   16713  0.620056  chr12  1.23E+08   1.23E+08  13  4.823032  83   DDB2   24277   13649  0.562219  chr11  47193068  47217345  10  4.119125  7893   PLD3   30059   16855  0.560731  chr19  45546171  45576230  13  4.324828  1620   HPD   19337   13533  0.69985  chr12  1.21E+08   1.21E+08  14  7.240006  426   F2   20314   13259  0.652703  chr11  46697318  46717632  14  6.891799  23489   RABEP2   20791   12803  0.615795  chr16  28823242  28844033  13  6.252705  14189   DNTTIP1   19491   11456  0.587758  chr20  43853982  43873473  13  6.669745  38185   POLR3C   18280   10411  0.56953  chr1  1.44E+08   1.44E+08  15  8.205689  730   PRIM1   20783   11772  0.566424  chr12  55411630  55432413  13  6.255112  74971   PTPN18   18527   10472  0.565229  chr2  1.31E+08   1.31E+08  15  8.096292  20712   TYK2   30045   16184  0.538659  chr19  10322203  10352248  25  8.320852  83157   KIFC1   18387   9829  0.534562  chr6  33467290  33485677  11  5.982488  3426   CDC23   25696   13721  0.533974  chr5  1.38E+08   1.38E+08  16  6.22665  24933   COMMD7   41322   26774  0.647936  chr20  30754153  30795475  9  2.178017  4309   IPP   47727   30873  0.646867  chr1  45936993  45984720  8  1.6762  1363   CDK7   42636   29275  0.686626  chr5  68566377  68609013  12  2.814523  7973   TEKT1   31761   21780  0.685747  chr17  6644023   6675784  8  2.518812  1356   CDC25C   46558   30750  0.660467  chr5  1.38E+08   1.38E+08  11  2.362644  9897   C2orf42   41135   26304  0.639455  chr2  70230520  70271655  10  2.43102  37483   XRCC6   42758   26761  0.625871  chr22  40347240  40389998  13  3.040367  39824   PTGR2   33489   20749  0.619577  chr14  73388426  73421915  10  2.986055  9502   CDK5RAP1   42693   25233  0.591034  chr20  31410305  31452998  14  3.279226  32123   SHCBP1   40844   23840  0.583684  chr16  45171968  45212812  13  3.182842  14239   GLMN   52612   30573  0.581103  chr1  92484542  92537154  19  3.611343  55978   CCT6B   33569   21982  0.65483  chr17  30279050  30312619  14  4.170514  1159   GTF2H2   32547   20007  0.614711  chr5  70366706  70399253  16  4.915968  57000   DHX40   42817   23949  0.559334  chr17  54997667  55040484  18  4.203938  216  Hid   gene_name   gene_size   te_coverage  te_density  chr  gStart   gEnd  exonCount  exon_density  4597   SMC1A   48549   26669  0.549321  chrX  53417794  53466343  25  5.149437  41349   ACN9   65171   40285  0.618143  chr7  96583840  96649011  2  0.306885  78167   C3orf70   74965   44978  0.599987  chr3  1.86E+08   1.86E+08  2  0.266791  17076   UBXN2A   60318   42993  0.712772  chr2  24016879  24077197  7  1.160516  45811   AMN1   58038   36082  0.621696  chr12  31715337  31773375  7  1.206106  18584   CYP20A1   67400   45675  0.677671  chr2  2.04E+08   2.04E+08  13  1.928783  41769   TTLL9   72354   48746  0.673715  chr20  29922165  29994519  15  2.07314  37744   TBCE   81553   52240  0.640565  chr1  2.34E+08   2.34E+08  17  2.084534  33650   BOLL   59336   36308  0.611905  chr2  1.98E+08   1.98E+08  11  1.853849  2376   NEK4   60151   41944  0.697312  chr3  52719840  52779991  16  2.659972  105654   PFKFB1   60922   42087  0.690834  chrX  54976314  55037236  14  2.29802  72261   NWD1   97988   66889  0.682624  chr19  16691786  16789774  21  2.14312  65105   NLRP5   62083   38729  0.623826  chr19  61202903  61264986  15  2.41612  6920   ALG6   69581   41927  0.602564  chr1  63605885  63675466  15  2.155761  1371   CENPC1   73268   44148  0.602555  chr4  68020583  68093851  19  2.593219  3322   KIF11   62326   34719  0.557055  chr10  94342804  94405130  22  3.529827  5092   ZPBP   155788   118841  0.762838  chr7  49947584  50103372  8  0.513518  20218   GABRA3   284198   191698  0.674523  chrX  1.51E+08   1.51E+08  10  0.351867  12423   ITFG1   305718   180212  0.589471  chr16  45746798  46052516  18  0.588778  49914   NHEDC1   134682   91702  0.680878  chr4  1.04E+08   1.04E+08  12  0.890988  10410   SPATA6   173568   117494  0.676934  chr1  48536864  48710432  13  0.748986  52235   CCDC73   192562   124780  0.647999  chr11  32580201  32772763  18  0.934764  17021   ADAM32   177387   133355  0.751774  chr8  39084206  39261593  25  1.409348  51868   ALS2CR11   131753   90074  0.683658  chr2  2.02E+08   2.02E+08  15  1.138494  10778   SENP7   188968   122521  0.648369  chr3  1.03E+08   1.03E+08  23  1.217137  49799   EIF2C3   125292   79686  0.636002  chr1  36169358  36294650  19  1.516458  10225   ASH1L   227273   143196  0.630062  chr1  1.54E+08   1.54E+08  28  1.231999  2120   PTPN4   217831   127661  0.586055  chr2  1.2E+08   1.2E+08  27  1.239493  217  Hid   gene_name   gene_size   te_coverage  te_density  chr  gStart   gEnd  exonCount  exon_density  9084   TRIM37   124267   70509  0.567399  chr17  54414781  54539048  25  2.011797  hid: the HomoloGene Database ID of the gene gene_name: the human gene name annotated by RefSeq Database (hg18) gene_size: the human gene size annotated by the longest RefSeq isoform te_coverage: the coverage of TE sequences in the human gene in basepairs (bp) te_density: the density of TE sequences derived from the ratio of te_coverage/gene_size chr: the name of the human chromosome that the gene is located on gStart: the genomic coordinate of the 5' start site of the gene in the human genome (hg18) gEnd: the genomic coordinate of the 3' end site of the gene in the human genome (hg18) exonCount: number of exons in the gene exon_density: the density of exon derived from the ratio of exonCount/geneSize (/kb)  218  Table C.3 Overrepresentation of GO terms for SLOs (BiNGO results) GO‐ID   p‐value   corr p‐value   x   n   X   N   Description   48731   3.60E‐24   7.24E‐21   87   2421   173   14301   system development   48856   5.05E‐22   5.09E‐19   88   2655   173   14301   anatomical structure development   7275   3.55E‐21   2.38E‐18   92   2971   173   14301   multicellular organismal development   32502   2.11E‐20   1.06E‐17   95   3234   173   14301   developmental process   48513   1.43E‐18   5.78E‐16   67   1791   173   14301   organ development   30154   3.03E‐16   1.02E‐13   61   1668   173   14301   cell differentiation   48869   1.10E‐15   3.16E‐13   61   1714   173   14301   cellular developmental process   32501   2.40E‐15   6.04E‐13   103   4375   173   14301   multicellular organismal process   9653   1.84E‐14   4.12E‐12   49   1217   173   14301   anatomical structure morphogenesis   9888   2.67E‐14   5.38E‐12   38   749   173   14301   tissue development   51252   1.69E‐13   3.10E‐11   60   1857   173   14301   regulation of RNA metabolic process   6355   1.92E‐13   3.23E‐11   59   1808   173   14301   regulation of transcription, DNA‐dependent   1944   2.21E‐13   3.43E‐11   23   273   173   14301   vasculature development   9887   9.24E‐13   1.33E‐10   33   635   173   14301   organ morphogenesis   1568   1.03E‐12   1.38E‐10   22   265   173   14301   blood vessel development   7399   3.85E‐12   4.85E‐10   44   1155   173   14301   nervous system development   50794   1.32E‐11   1.57E‐09   119   6223   173   14301   regulation of cellular process   50789   9.52E‐11   1.07E‐08   121   6552   173   14301   regulation of biological process   65007   1.56E‐10   1.64E‐08   125   6941   173   14301   biological regulation   10468   1.69E‐10   1.64E‐08   72   2928   173   14301   regulation of gene expression   48514   1.71E‐10   1.64E‐08   18   220   173   14301   blood vessel morphogenesis   10556   4.37E‐09   4.00E‐07   68   2877   173   14301   regulation of macromolecule biosynthetic process   45449   4.58E‐09   4.01E‐07   64   2622   173   14301   regulation of transcription   31326   5.25E‐09   4.40E‐07   70   3021   173   14301   regulation of cellular biosynthetic process   6357   5.89E‐09   4.74E‐07   30   748   173   14301   regulation of transcription from RNA polymerase II promoter   9889   7.42E‐09   5.75E‐07   70   3045   173   14301   regulation of biosynthetic process   60255   1.74E‐08   1.30E‐06   74   3377   173   14301   regulation of macromolecule metabolic process   219  GO‐ID   p‐value   corr p‐value   x   n   X   N   Description   48646   2.76E‐08   1.99E‐06   20   375   173   14301   anatomical structure formation involved in morphogenesis   19219   3.11E‐08   2.16E‐06   68   3014   173   14301   regulation of nucleobase, nucleoside, nucleotide and nucleic acid metabolic process   7167   3.96E‐08   2.66E‐06   19   346   173   14301   enzyme linked receptor protein signaling pathway   51171   4.44E‐08   2.88E‐06   68   3040   173   14301   regulation of nitrogen compound metabolic process   48608   9.49E‐08   5.97E‐06   13   164   173   14301   reproductive structure development   42127   9.86E‐08   6.02E‐06   30   848   173   14301   regulation of cell proliferation   3006   1.13E‐07   6.69E‐06   17   296   173   14301   reproductive developmental process   48518   1.29E‐07   7.42E‐06   54   2209   173   14301   positive regulation of biological process   43588   1.37E‐07   7.62E‐06   7   34   173   14301   skin development   60429   1.40E‐07   7.62E‐06   18   337   173   14301   epithelium development   7166   1.49E‐07   7.92E‐06   38   1280   173   14301   cell surface receptor linked signaling pathway   80090   1.64E‐07   8.46E‐06   74   3554   173   14301   regulation of primary metabolic process   1974   1.83E‐07   9.19E‐06   6   22   173   14301   blood vessel remodeling   48729   2.28E‐07   1.12E‐05   16   275   173   14301   tissue morphogenesis   48522   2.49E‐07   1.20E‐05   50   2005   173   14301   positive regulation of cellular process   31323   2.62E‐07   1.23E‐05   76   3735   173   14301   regulation of cellular metabolic process   50793   2.77E‐07   1.25E‐05   28   791   173   14301   regulation of developmental process   10628   2.79E‐07   1.25E‐05   24   604   173   14301   positive regulation of gene expression   1525   3.14E‐07   1.38E‐05   12   152   173   14301   angiogenesis   19222   4.11E‐07   1.76E‐05   78   3918   173   14301   regulation of metabolic process   1501   5.74E‐07   2.41E‐05   17   332   173   14301   skeletal system development   51254   8.51E‐07   3.50E‐05   21   507   173   14301   positive regulation of RNA metabolic process   10604   9.32E‐07   3.75E‐05   30   942   173   14301   positive regulation of macromolecule metabolic process   45944   1.15E‐06   4.52E‐05   18   389   173   14301   45935   1.26E‐06   4.87E‐05   24   657   173   14301   7548   1.51E‐06   5.74E‐05   12   176   173   14301   positive regulation of transcription from RNA polymerase II promoter  positive regulation of nucleobase, nucleoside, nucleotide and nucleic acid metabolic process  sex differentiation   45941   1.76E‐06   6.58E‐05   22   576   173   14301   positive regulation of transcription   32583   1.81E‐06   6.63E‐05   13   212   173   14301   regulation of gene‐specific transcription   220  GO‐ID   p‐value   corr p‐value   x   n   X   N   Description   122   2.04E‐06   7.32E‐05   15   286   173   14301   negative regulation of transcription from RNA polymerase II promoter   10557   2.07E‐06   7.32E‐05   24   676   173   14301   positive regulation of macromolecule biosynthetic process   51173   2.18E‐06   7.58E‐05   24   678   173   14301   positive regulation of nitrogen compound metabolic process   48468   2.28E‐06   7.79E‐05   23   632   173   14301   cell development   45893   2.78E‐06   9.34E‐05   20   501   173   14301   positive regulation of transcription, DNA‐dependent   45446   3.27E‐06   1.08E‐04   5   20   173   14301   endothelial cell differentiation   8406   3.68E‐06   1.18E‐04   10   129   173   14301   gonad development   48771   3.68E‐06   1.18E‐04   7   54   173   14301   tissue remodeling   16055   3.95E‐06   1.23E‐04   10   130   173   14301   Wnt receptor signaling pathway   23052   3.97E‐06   1.23E‐04   64   3132   173   14301   signaling   61061   4.16E‐06   1.27E‐04   14   265   173   14301   muscle structure development   9893   4.63E‐06   1.39E‐04   30   1019   173   14301   positive regulation of metabolic process   3158   5.45E‐06   1.61E‐04   5   22   173   14301   endothelium development   30855   6.18E‐06   1.80E‐04   11   168   173   14301   epithelial cell differentiation   31328   6.26E‐06   1.80E‐04   24   721   173   14301   positive regulation of cellular biosynthetic process   7398   6.33E‐06   1.80E‐04   12   202   173   14301   ectoderm development   45892   6.52E‐06   1.82E‐04   17   397   173   14301   negative regulation of transcription, DNA‐dependent   51270   6.77E‐06   1.87E‐04   13   239   173   14301   regulation of cellular component movement   9891   8.08E‐06   2.20E‐04   24   732   173   14301   positive regulation of biosynthetic process   51253   8.20E‐06   2.20E‐04   17   404   173   14301   negative regulation of RNA metabolic process   8584   9.43E‐06   2.50E‐04   7   62   173   14301   male gonad development   51093   1.12E‐05   2.90E‐04   14   289   173   14301   negative regulation of developmental process   42692   1.13E‐05   2.90E‐04   9   116   173   14301   muscle cell differentiation   51239   1.15E‐05   2.90E‐04   30   1067   173   14301   regulation of multicellular organismal process   8284   1.15E‐05   2.90E‐04   18   459   173   14301   positive regulation of cell proliferation   45137   1.18E‐05   2.93E‐04   10   147   173   14301   development of primary sexual characteristics   30334   1.25E‐05   3.07E‐04   12   216   173   14301   regulation of cell migration   48523   1.30E‐05   3.15E‐04   43   1844   173   14301   negative regulation of cellular process   221  GO‐ID   p‐value   corr p‐value   x   n   X   N   Description   31325   1.37E‐05   3.29E‐04   28   967   173   14301   positive regulation of cellular metabolic process   22414   1.39E‐05   3.30E‐04   25   808   173   14301   reproductive process   7169   1.44E‐05   3.36E‐04   12   219   173   14301   transmembrane receptor protein tyrosine kinase signaling pathway   3   1.48E‐05   3.43E‐04   25   811   173   14301   reproduction   35295   1.52E‐05   3.49E‐04   14   297   173   14301   tube development   48534   1.60E‐05   3.63E‐04   13   259   173   14301   hemopoietic or lymphoid organ development   48519   2.47E‐05   5.46E‐04   45   2020   173   14301   negative regulation of biological process   43062   2.47E‐05   5.46E‐04   10   160   173   14301   extracellular structure organization   9987   2.53E‐05   5.54E‐04   138   9366   173   14301   cellular process   30097   2.66E‐05   5.76E‐04   12   233   173   14301   hemopoiesis   8585   2.79E‐05   5.96E‐04   7   73   173   14301   female gonad development   51128   2.81E‐05   5.96E‐04   19   538   173   14301   regulation of cellular component organization   23033   2.99E‐05   6.24E‐04   46   2100   173   14301   signaling pathway   2520   3.01E‐05   6.24E‐04   13   275   173   14301   immune system development   6916   3.04E‐05   6.24E‐04   11   199   173   14301   anti‐apoptosis   43009   3.20E‐05   6.51E‐04   15   360   173   14301   chordate embryonic development   GO-ID: the Gene Ontology term ID p-value: the p-value derived from Hypergeometric test corr p-value: the corrected p-value derived from Benjamini & Hochberg False Discovery Rate (FDR) correction x: the number of samples (i.e. SUOs/SLOs) belonging to the given GO term n: the number of population (i.e. genomic background) belonging to the given GO term X: the sample size (i.e. all SUOs/SLOs with available GO annotations) N: the population size (i.e. genomic background) Description: the GO term decription  222  Table C.4 Overrepresentation of GO terms for SUOs (BiNGO results) GO‐ID   p‐value   corr p‐value   x   n  X  N  Description  718   2.95E‐10   2.52E‐07   6  21  63  14304  nucleotide‐excision repair, DNA damage removal  6308   1.74E‐07   4.97E‐05   6  57  63  14304  DNA catabolic process   6289   1.74E‐07   4.97E‐05   6  57  63  14304  nucleotide‐excision repair  6281   4.85E‐05   7.37E‐03   8  300  63  14304  DNA repair  44238   4.90E‐05   7.37E‐03   39  5286  63  14304  primary metabolic process  8152   5.17E‐05   7.37E‐03   42  5957  63  14304  metabolic process   6259   7.95E‐05   9.71E‐03   10  517  63  14304  DNA metabolic process   9056   9.46E‐05   1.01E‐02   14  1005  63  14304  catabolic process   43170   1.12E‐04   1.06E‐02   32  4016  63  14304  macromolecule metabolic process  44260   1.58E‐04   1.35E‐02   29  3507  63  14304  cellular macromolecule metabolic process  6807   2.94E‐04   2.20E‐02   21  2194  63  14304  nitrogen compound metabolic process  6974   3.29E‐04   2.20E‐02   8  396  63  14304  response to DNA damage stimulus  9057   3.34E‐04   2.20E‐02   9  503  63  14304  macromolecule catabolic process  34641   3.99E‐04   2.38E‐02   20  2076  63  14304  cellular nitrogen compound metabolic process  51301   4.43E‐04   2.38E‐02   7  314  63  14304  cell division  90304   4.45E‐04   2.38E‐02   16  1459  63  14304  nucleic acid metabolic process  6139   4.98E‐04   2.49E‐02   18  1785  63  14304  nucleobase, nucleoside, nucleotide and nucleic acid metabolic process  70   5.24E‐04   2.49E‐02   3  36  63  14304  mitotic sister chromatid segregation  819   5.69E‐04   2.55E‐02   3  37  63  14304  sister chromatid segregation  44237   6.05E‐04   2.55E‐02   35  4988  63  14304  cellular metabolic process  87   6.27E‐04   2.55E‐02   6  239  63  14304  M phase of mitotic cell cycle  44265   6.71E‐04   2.61E‐02   8  441  63  14304  cellular macromolecule catabolic process  6350   7.75E‐04   2.88E‐02   7  345  63  14304  transcription  279   8.58E‐04   3.06E‐02   7  351  63  14304  M phase  6368   1.22E‐03   4.18E‐02   3  48  63  14304  RNA elongation from RNA polymerase II promoter  6354   1.46E‐03   4.80E‐02   3  51  63  14304  RNA elongation   223  GO-ID: the Gene Ontology term ID p-value: the p-value derived from Hypergeometric test corr p-value: the corrected p-value derived from Benjamini & Hochberg False Discovery Rate (FDR) correction x: the number of samples (i.e. SUOs/SLOs) belonging to the given GO term n: the number of population (i.e. genomic background) belonging to the given GO term X: the sample size (i.e. all SUOs/SLOs with available GO annotations) N: the population size (i.e. genomic background) Description: the GO term decription  224  Table C.5 The optimization table for outlier threshold selection cutoff   SLO_count   SLO P‐fold diff  SUO_count  SUO P‐fold diff   0.025  0.05  0.075  0.1  0.125  0.15  0.175  0.2  0.225  0.25   39  81  121  183  239  293  353  437  521  603   414.8246635 107.6948646 47.66741147 30.41382749 20.33704504 14.42825574 10.94663274 9.078444408 7.601690045 6.413827489  6 14 36 66 115 160 213 265 329 398  63.81917899  18.61392721  14.18203978  10.96892139  9.785607446  7.878910987  6.605191995  5.505235167  4.800299472  4.233338873   cutoff: the cutoff threshold as the top/bottom percentage of the TE density distribution for upper/lower outlier genes SLO_count: the number of SLOs derived from the three species using the corresponding cutoff SLO P-fold diff: the fold difference between the observed frequency of SLOs and the probability of observing SLOs by random based on the shared genes among the three species SUO_count: the number of SUOs derived from the three species using the corresponding cutoff SUO P-fold diff: the fold difference between the observed frequency of SUOs and the probability of observing SUOs by random based on the shared genes among the three species  225  Table C.6 Number of TUs/genes derived in each step of SUO/SLO calculation A) Human Step 1. source data 2. homoloGene 3. gene size cutoff 4. all outliers 5. shared outliers  TU/gene type* whole genome RefSeq TUs (hg18) RefSeq genes with HID genes >= 10 kb all upper outliers all lower outliers SUOs SLOs  Total of TUs/genes involved 27212 17357 12423 1231 1228 84 189  TU/gene type* whole genome RefSeq TUs (mm9) RefSeq genes with HID genes >= 10 kb all upper outliers all lower outliers SUOs SLOs  Total of TUs/genes involved 21873 18248 11670 1175 1175 84 189  TU/gene type* whole genome RefSeq TUs (bosTau4) RefSeq genes with HID genes >= 10 kb all upper outliers all lower outliers SUOs SLOs  Total of TUs/genes involved 12441 9389 6206 615 608 84 189  B) Mouse Step 1. source data 2. homoloGene 3. gene size cutoff 4. all outliers 5. shared outliers C) Cow Step 1. source data 2. homoloGene 3. gene size cutoff 4. all outliers 5. shared outliers  Abreviations: TU - Transcription Unit SUO - Shared Upper Outlier SLO - Shared Lower Outlier * Genes are defined as the longest isoforms of multiple TUs associated with the same gene  226  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0072813/manifest

Comment

Related Items