UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Prioritizing genes with functionally distinct splice isoforms Bhuiyan, Shamsuddin A. 2020

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata

Download

Media
24-ubc_2021_may_bhuiyan_shamsuddin.pdf [ 2.61MB ]
Metadata
JSON: 24-1.0395404.json
JSON-LD: 24-1.0395404-ld.json
RDF/XML (Pretty): 24-1.0395404-rdf.xml
RDF/JSON: 24-1.0395404-rdf.json
Turtle: 24-1.0395404-turtle.txt
N-Triples: 24-1.0395404-rdf-ntriples.txt
Original Record: 24-1.0395404-source.json
Full Text
24-1.0395404-fulltext.txt
Citation
24-1.0395404.ris

Full Text

PRIORITIZING GENES WITH FUNCTIONALLY DISTINCT SPLICE ISOFORMS by  Shamsuddin A. Bhuiyan  B.H.Sc, University of Calgary, 2015  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF  DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES (Bioinformatics)   THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver)  December 2020  © Shamsuddin Bhuiyan, 2020     ii The following individuals certify that they have read, and recommend to the Faculty of Graduate and Postdoctoral Studies for acceptance, the dissertation entitled: Prioritizing genes with functionally distinct splice isoforms  submitted by Shamsuddin Bhuiyan in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Bioinformatics  Examining Committee: Dr. Paul Pavlidis, Psychiatry, UBC Supervisor  Dr. Wyeth Wasserman, Medical Genetics, UBC Supervisory Committee Member  Dr. Thibault Mayor, Biochemistry and Molecular Biology, UBC University Examiner Dr. Inanc Birol, Medical Genetics, UBC University Examiner  Additional Supervisory Committee Members: Dr. Douglas Allan, Cellular and Physiological Sciences, UBC Supervisory Committee Member Dr. Ann Marie Craig, Psychiatry, UBC Supervisory Committee Member     iii Abstract Most mammalian genes generate multiple transcripts via splicing, and we do not know the function of most splice variants. Currently, there is a debate about how many splice variants are likely nonfunctional or “noisy” transcripts. My thesis explores the claim that alternative splicing vastly increases the genome’s functional diversity in the context of noisy splicing, and in doing so attempts to identify candidate cases for which alternative splicing is likely to be of consequence. To ground computational analyses of genes with multiple splice variants in experimental data, the field needs a corpus of genes that have experimental evidence of functionally distinct splice isoforms (FDSIs). We curated the literature for 743 genes and found that ~5% had literature evidence of FDSIs. This suggests that the claim that alternative splicing vastly increases genomic functional diversity is extrapolated from a few key genes. Next, I developed a pipeline to identify candidate genes with FDSIs using long-read RNA-seq data. The output of my pipeline is a computationally-prioritized list of candidate genes likely to have FDSIs based on features such as expression, conservation, functional domains, and coding-potential. From an initial set of 6,799 genes with multiple splice variants, I prioritized 79 candidate genes. While I had limited long-read data, my work aids in establishing guidelines for high-throughput prioritization of genes with FDSIs for future study. With our collaborators, I investigated a specific application of my pipeline to the voltage-gated calcium channel gene Cacna1e. Using novel long-read data, I established a set of 2,110 splice variants for Cacna1e. Based on properties of the channel, I determined that at most 154 splice variants are likely to encode a functional channel. My results highlighted the amount of potential noise produced by one gene’s expression.     iv Through my investigation, I added to the growing body of literature in support of noisy splicing. I also provided the field with a list of interesting genes with multiple splice variants. This includes a gold standard set of genes from the experimental literature, and a novel set of prioritized genes. Both sets of genes will be useful for future studies of gene function.     v Lay Summary Traditionally, biologists believed that one gene made one functional product, or protein. However, we now know that portions of the gene (exons) can be “shuffled” to create multiple different proteins (alternative splicing). In the field, there is a debate regarding whether alternative splicing results in mostly functional or nonfunctional proteins. My thesis adds to this debate by (1) providing a list of experimentally verified genes that code for multiple functional proteins, (2) building a computational pipeline that prioritizes candidate genes likely to have multiple functional proteins, and (3) applying this pipeline to publicly available and novel datasets. In general, the results of my work provide more evidence that only a few genes likely code for multiple functional proteins. Nonetheless, I highlight genes that are most interesting in the context of alternative splicing, and would be good candidates for experimental validation.    vi Preface All the work presented in this dissertation was conducted in the Michael Smith Laboratories at the University of British Columbia. I was solely responsible for conducting the research and the primary author of every chapter and corresponding publications. My supervisor, Dr. Paul Pavlidis, contributed ideas, supervision as well as text, and editorial suggestions for all chapters.  At the time of writing, portions of Chapter 1 are being prepared for submission as a review article. I was the lead investigator for this literature review, responsible for all major areas of data collection and analysis. Dr. Paul Pavlidis will be the supervisory author for this submission.  A version of Chapter 2 has been published. Bhuiyan, Shamsuddin A., et al. "Systematic evaluation of isoform function in literature reports of alternative splicing." BMC genomics 19.1 (2018): 637. I was the lead investigator, responsible for all major areas of concept formation, data collection and analysis, as well as manuscript composition. Dr. Paul Pavlidis was the supervisory author on this project and was involved throughout the project in concept formation and manuscript composition and writing. The manual curation of the literature was done by Sophia Ly, Minh Phan, Ellie Hogan, Chao Chun Liu, Brandon Huntington, and James Liu.  At the time of writing, Chapter 3 has not been published. I was the lead investigator, responsible for all major areas of concept formation, most data collection and data analysis, as well as manuscript composition. Dr. Paul Pavlidis was the supervisory author on this project and was involved throughout the project in concept formation and manuscript composition and writing. Dr. John Tyson and Dr. Terrance Snutch at the Michael Smith Laboratories generated the mouse colliculus data. Manuel Belmadani processed the publically short-read RNA-seq data.    vii Jordan Sicherman built the IsoVision tool I use for visualization. Jose Rafael Dimayacyac performed comparisons between PFAM and CDD.  At the time of writing, Chapter 4 has not been published. I was the co-lead investigator, responsible for all major areas of concept formation, most data collection and data analysis, as well as manuscript composition. Dr. Paul Pavlidis was the supervisory author on this project and was involved throughout the project in concept formation and manuscript composition and writing. My co-lead investigator, Dr. John Tyson, from Dr. Terrance Snutch’s lab at the Michael Smith Laboratories, generated the rat data for my analysis. Sections 4.2.2 and 4.2.3 were originally drafted by Dr. John Tyson. Dr. John Tyson aided in the design of Figure 4.4. Manuel Belmadani and Guillaume Poirier-Morency processed the publicly short-read RNA-seq data. Jordan Sicherman built the IsoVision tool I use for visualization.    viii Table of Contents  Abstract ......................................................................................................................................... iii Lay Summary .................................................................................................................................v Preface ........................................................................................................................................... vi Table of Contents ....................................................................................................................... viii List of Tables .............................................................................................................................. xiv List of Figures ............................................................................................................................. xvi List of Abbreviations ............................................................................................................... xviii Acknowledgements ......................................................................................................................xx Dedication .................................................................................................................................. xxii Chapter 1: Introduction ................................................................................................................1 1.1 The term “alternative splicing” is a potential source of confusion ................................. 3 1.2 Defining genes with functionally distinct splice isoforms .............................................. 3 1.2.1 Systematic definitions of biological function ......................................................... 5 1.2.2 How do NMD-targeted splice variants fit my operational definitions? .................. 7 1.3 Biochemical sources of noise in mRNA splicing ........................................................... 8 1.3.1 The mRNA splicing process ................................................................................... 9 1.3.2 Sources of splicing noise ...................................................................................... 11 1.4 Most splice variants are lowly-expressed and not detected at the peptide level ........... 14 1.4.1 Transcript level evidence of most splice variants ................................................. 14 1.4.1.1 Nearly all multi-exonic genes have multiple splice variants ............................ 14 1.4.1.2 Condition-specific function or condition-specific noise ................................... 17    ix 1.4.1.3 Parallels to ENCODE criticisms ....................................................................... 19 1.4.1.4 Can we assess noise from RNA-seq data? ........................................................ 20 1.4.2 Proteomics: Ribosomal profiling vs. Mass spectrometry ..................................... 21 1.4.2.1 Ribosomal profiling .......................................................................................... 21 1.4.2.2 Mass spectrometry ............................................................................................ 22 1.5 Impact of splicing on protein domains .......................................................................... 25 1.5.1.1 Adding or removing a domain .......................................................................... 27 1.5.1.2 Modifying a domain .......................................................................................... 28 1.5.1.3 Maintaining domains while modifying other parts of a protein ....................... 29 1.6 Most splice variants evolve neutrally ........................................................................... 31 1.6.1 Few splice variants have conservation evidence .................................................. 33 1.6.2 Limited evidence for positively selected splice variants ...................................... 34 1.6.2.1 Exceptional splicing in humans? ...................................................................... 37 1.7 Thesis outline and research contributions ..................................................................... 39 Chapter 2: Systematic evaluation of isoform function in literature reports of alternative splicing ..........................................................................................................................................42 2.1 Introduction ................................................................................................................... 42 2.2 Methods......................................................................................................................... 45 2.2.1 Determining the type of functional distinctness ................................................... 45 2.2.2 Expression pattern distinctness ............................................................................. 45 2.2.3 Intrinsic-functional distinctness ............................................................................ 46 2.2.4 Literature selection ................................................................................................ 47 2.2.5 Gene-centric curation ............................................................................................ 47    x 2.2.6 Paper-centric curation ........................................................................................... 47 2.2.7 Curation process .................................................................................................... 48 2.2.8 Curator Validation ................................................................................................ 50 2.2.9 Linking FDSIs to Ensembl ................................................................................... 50 2.2.10 Computational predictions of genes with FDSIs .................................................. 50 2.3 Results ........................................................................................................................... 51 2.3.1 Landscape of the alternative splicing literature .................................................... 51 2.3.2 Curation summary ................................................................................................. 53 2.3.3 Identification of 23 human genes with direct evidence of functionally distinct splice isoforms ...................................................................................................................... 54 2.3.4 Genes tend to express functionally distinct splice isoforms in the same condition 57 2.3.5 Challenges linking FDSIs to sequence databases ................................................. 59 2.3.6 Only a quarter of genes with FDSIs are predicted by a computational classifier . 60 2.4 Discussion ..................................................................................................................... 60 2.4.1 Evaluating the evidence for FDSIs at the gene level ............................................ 63 2.4.2 Types of functional distinctness in FDSIs ............................................................ 64 2.4.3 Disconnect between literature and gene databases ............................................... 64 2.4.4 Implication for alternative splicing’s impact on gene function ............................ 65 2.5 Conclusion .................................................................................................................... 66 Chapter 3: Prioritizing genes likely to have functionally distinct splice isoforms using long-read RNA-seq data .......................................................................................................................68 3.1 Background ................................................................................................................... 68    xi 3.2 Methods......................................................................................................................... 71 3.2.1 Data collection ...................................................................................................... 71 3.2.1.1 Publicly available mouse brain and liver transcriptome data collection ........... 71 3.2.1.2 Mouse colliculus full length cDNA sequencing using Nanopore sequencing .. 71 3.2.2 FLAIR Processing ................................................................................................. 72 3.2.3 Splice-variant specific annotations ....................................................................... 74 3.2.3.1 Expression Annotations .................................................................................... 75 3.2.3.2 Open reading frame (ORF) annotations ............................................................ 76 3.2.3.3 Protein domain annotations ............................................................................... 77 3.2.3.4 Conservation ..................................................................................................... 78 3.2.4 Summary of the prioritization process .................................................................. 79 3.2.5 Retrieving previously known “biologically interesting” genes in our prioritization 80 3.2.6 IsoVision visualization .......................................................................................... 80 3.3 Results ........................................................................................................................... 81 3.3.1 Data processing ..................................................................................................... 81 3.3.2 79 candidate genes likely to have FDSIs .............................................................. 83 3.3.3 Over a quarter of genes have multiple appreciably expressed splice variants ...... 87 3.3.4 Nearly all splice variants had an open reading frame (ORF) ................................ 89 3.3.5 83% of genes have two splice variants with an annotatable protein domain ........ 91 3.3.6 Few genes have at least one conserved spliced in exon ........................................ 93 3.3.7 Description of selected candidate genes with FDSIs ............................................ 95 3.3.7.1 Previously known case of FDSIs: Cdc42 .......................................................... 95    xii 3.3.7.2 Literature pointing towards functional distinctness: Tpd52l1 .......................... 96 3.3.7.3 A novel candidate FDSI gene: Gstz1 ................................................................ 98 3.4 Discussion ..................................................................................................................... 99 3.4.1 Effects of thresholds ........................................................................................... 100 3.4.2 Support for the noisy splicing model .................................................................. 103 3.5 Conclusions ................................................................................................................. 104 Chapter 4: Cataloging the potential functional diversity of Cacna1e splice variants .........106 4.1 Background ................................................................................................................. 106 4.2 Methods....................................................................................................................... 110 4.2.1 Targeted amplification for five a1-subunit genes of GAERS and NEC rats ..... 110 4.2.2 ONT MinION sequencing of amplicons ............................................................. 110 4.2.3 Short-read sequencing of amplicons ................................................................... 111 4.2.4 Processing short-read sequencing data ............................................................... 111 4.2.5 Determining splice variants using long read RNA-seq data and FLAIR ............ 111 4.2.6 Definition of a functional Cacna1e splice variant ............................................... 112 4.2.7 Annotating splice variants for Cacna1e .............................................................. 113 4.2.8 Visualization tool ................................................................................................ 114 4.3 Results ......................................................................................................................... 115 4.3.1 Detection of candidate Cacna1e splice variants .................................................. 115 4.3.2 Expression profiles of Canca1e splice variants .................................................. 118 4.3.3 Cacna1e splice variants contain a conserved cassette exon 19 and 45 ............... 120 4.3.4 Novel Cacna1e splicing events ........................................................................... 122 4.4 Discussion ................................................................................................................... 122    xiii 4.5 Conclusions ................................................................................................................. 127 Chapter 5: Conclusion ...............................................................................................................128 5.1 Strengths and limitations of research .......................................................................... 130 5.1.1 Strengths ............................................................................................................. 130 5.1.2 Limitations .......................................................................................................... 131 5.2 Applications of research findings ............................................................................... 134 5.3 Future directions ......................................................................................................... 136 5.4 Final remarks .............................................................................................................. 138 References ...................................................................................................................................140 Appendices ..................................................................................................................................161 Appendix A Supplementary material for Chapter 2 ............................................................... 161 A.1 Curation standards (Filling in master spreadsheet) ................................................. 161 A.2 Additional Files ....................................................................................................... 165 Appendix B Supplementary material for Chapter 3 ............................................................... 166 B.1 Tables ...................................................................................................................... 166 Appendix C Supplementary material for Chapter 5 ............................................................... 168 C.1 Tables ...................................................................................................................... 168     xiv List of Tables Table 1.1 Estimates of human genes with multiple functional isoforms from key noisy splicing papers .............................................................................................................................................. 2 Table 1.2 Pittsburg model of function applied to my definition of genes with FDSIs ................... 5 Table 1.3 Estimates of splicing error rates from different studies ................................................ 12 Table 1.4 The percentage of genes with multiple splice variant in a species is likely drive by the number of sequencing reads and effective population size .......................................................... 38 Table 2.1 Curation of alternative splicing literature revealed 23 human genes and 20 mouse genes with functionally distinct splice isoforms (FDSIs) ............................................................. 54 Table 2.2 Genes with positive literature evidence of FDSIs ........................................................ 55 Table 2.3 Genes with evidence failing to support FDSIs ............................................................. 57 Table 2.4 Most genes with FDSIs have intrinsically-functionally distinct FDSIs ....................... 59 Table 3.1 Summary of mouse long-read data. .............................................................................. 72 Table 3.2 FLAIR processing of mouse brain and liver transcriptomes ........................................ 82 Table 3.3 Summary of transcript variants found after FLAIR processing of mouse long-reads .. 83 Table 3.4 Summary of splice variant classes found after FLAIR processing of mouse long-reads....................................................................................................................................................... 83 Table 3.5 33 candidate genes likely to have FDSIs (cassette exons) ............................................ 85 Table 3.6 45 candidate genes likely to have FDSIs (intron retention) .......................................... 86 Table 3.7 9 candidate genes likely to have FDSIs (alternative 5’ splice site) .............................. 86 Table 3.8 7 candidate genes likely to have FDSIs (alternative 3’ splice site) .............................. 87 Table 4.1 Summary of targeted MinION RNA-seq data. ........................................................... 116 Table 4.2 Summary of Cacna1e-targetted short-read RNA-seq data in rat thalamus ................. 116    xv Table 4.3 Summary of putative splice variants detected in long-read RNA-seq data ................ 117 Table B.1 Summary of cDNA brain data from Sessogolo et al., 2019 ....................................... 166 Table D.2 Metrics to assess genes with literature evidence for FDSIs ....................................... 169     xvi List of Figures Figure 1.1 Types of functional distinctness for genes with FDSIs ................................................. 4 Figure 1.2 The structure of U2 snRNP (major) introns are defined by motifs ............................... 9 Figure 1.3 Splice variants entry into Ensembl steadily increases over time ................................. 15 Figure 1.4 Long read RNA-seq reduces transcript structure ambiguity ....................................... 16 Figure 1.5 Hierarchy for the functional distinctness of protein domains for a gene’s splice variants .......................................................................................................................................... 27 Figure 2.1 Non-mutually exclusive types of functional distinctness for literature reported genes with FDSIs .................................................................................................................................... 44 Figure 2.2 Overview of literature curation ................................................................................... 49 Figure 2.3 Number of alternative splicing studies linked to human or mouse genes ................... 53 Figure 3.1 Workflow for annotation and prioritization of splice variants found in long-read RNA-seq ................................................................................................................................................. 75 Figure 3.2 Most genes have one dominantly expressed splice variant across both tissue types .. 89 Figure 3.3 All genes with multiple splice variants have an annotatable ORF .............................. 90 Figure 3.4 Rank 1 splice variants mostly have complete open reading frames that are similar to rank 2 splice variants .................................................................................................................... 91 Figure 3.5 Levels of protein domain distinctness for genes with multiple splice variants ........... 92 Figure 3.6 6% of genes have at least one conserved alternatively spliced exon. ......................... 94 Figure 3.7 Example candidate genes with FDSIs ......................................................................... 97 Figure 4.1 The majority of detected splice variants did not contain an ORF of at least 2,000 amino acids ................................................................................................................................. 117 Figure 4.2 Cacna1e has four similarly expressed splice variants ............................................... 119    xvii Figure 4.3 Structure of top 4 most expressed splice variants. .................................................... 120 Figure 4.4 Impact of splicing on channel structure of 6 previously known cassette exons on 154 Cacna1e splice variants with 4 transmembrane domains ........................................................... 121     xviii List of Abbreviations APPRIS: Annotation of principal alternative splice isoforms BED: Browser extensible data BLAST: Basic alignment search tool CDD: Conserved Domain Database  CSV: Comma-separated values DNA: Deoxyribonuclease ENCODE: Encyclopedia of DNA elements EST: Expressed sequence tag FDSIs: Functionally distinct splice isoforms FLAIR: Full-length alternative isoform analysis of RNA GAERS: Genetic absence of epilepsy rats of Strasburg GTEx: Genotype-Tissue Expression MS: Mass spectrometry NEC: Nonepileptic controls NMD: Nonsense mediated decay nr: Non-redundant ONT: Oxford Nanopore Technologies ORF: Open reading frame PCR: Polymerase chain reaction Refseq: Reference sequence Ribo-seq: Ribosomal-sequencing or ribosomal profiling RNA: Ribonucleic acid    xix RNA-seq: RNA-sequencing RNPs: Ribonucleoproteins siRNA: small interfering RNA SS: Splice site STAR: Spliced transcript alignment to a reference snRNPs: Small nuclear RNPs UCSC: University of California – Santa Cruz VGCC: Voltage-gated calcium channels WT: Wildtype Y2H: Yeast-two-hybrid       xx Acknowledgements “It takes a village to raise a scientist” -Shams Bhuiyan, probably. I would like to thank Dr. Paul Pavlidis for his guidance throughout my doctoral studies. This project was born out of mutual curiosity. Thanks to Paul’s patience, guidance, and scientific acumen, I was able to do this work. More importantly, Paul continued to have faith in my abilities, often when I did not, and pushed me towards more scientific pursuits. If I am fortunate enough to run my own lab, I hope I am as good of a mentor as Paul. I would also like to thank my committee members, Dr. Douglas Allan, Dr. Wyeth Wasserman, Dr. Ann-Marie Craig, and Dr. Tara Klassen, for their valuable feedback. Their perspective would often challenge me, and it forced me to grow as a scientist. During this COVID-19 pandemic, I realized that I may not get a chance to say goodbye to all my friends at the Pavlidis lab. So let me take this opportunity to say thank you for everything. Of particular note, I would like to thank Dr. Sanja Rogic for being just the best lab manager ever – past, present, and future. I was fortunate enough to mentor some undergrads who taught me a lot. Thank you to Sophia Ly, Ellie Hogan, John Phan, Jimmy Liu, Brandon Huntington, James Liu, Calvin Chang, Owen Tsai, Danja Currie-Olsen, and Jose Raphael Dimayacyac. To Patrick Savage, I did not get to work with you, but I wanted to write your last name in my thesis. Additionally, I would like to thank Manuel Belmadani, Guillaume Poirier-Morency, and Jordan Sicherman for directly helping out on my project. A shoutout to Ben Callaghan for his emotional support as my work husband. Finally, I would like to thank the following lab members for their thoughts throughout grad school: Pippin, Matt Jacobson, Dr. Shreejoy Tripathy, Dr. Lilah Toker, Dr. Marjan Farahbod, Dr. Niklous Fortelny, Alex Morin, Eric Chu, Margot Gunning, Min Feng, and Ogan Mancarci.     xxi I made some life-long friends at the Allan Lab: Justin Fong, Dr. Katerina Othonos, and Aarya V. Payel Ganguly wanted her own sentence here. Thanks for the memories! I had a fun Bioinformatics cohort. Special thanks to Phillip Richmond for his scientific and beard insights. And of course, thank you to Rachelle Farkas for trying out new dipping sauces with me, and letting me blame you for everything (Joe too!). I would also like to extend my thanks to my undergraduate lab, the de Koning lab. Dr. Jason de Koning was the one who first sparked my interest in science. I would also like to thank Dr. Ivan Kryukov, Dr. Arnab Saha-Mandal, and Rumika Macencaras for their continued collaborations with me.  The past 5 years has fostered many friendships, new and old. I would like to single out Nathan Cormier, and Dr. Christopher Smith for always providing me feedback on my work. Furthermore, many thanks to Hilda Doan, Dr. Rebecca Manion, Dr. Chloe Gerak, Stephanie Tran, Madeline Herman, Sonja Soo, Jack Middleton, Elesha Hoffarth, Keenan Dagenais, Elizabeth Fisher, Rema K, and Jeff Wintersinger. These people are just fantastic human beings. They say challenges make you stronger, so I guess I should thank my siblings – Sharif Bhuiyan, Samina Bhuiyan, and Shaden Bhuiyan – for giving me many problems. Also Samih Raiyan Bhuiyan has brought me much joy. Finally, as is oft the case, I would like to thank my partner, Areesha Salman. She probably read this thesis as many times as my supervisor. I wanted to write an entire page acknowledging Areesha, but that would just embarrass her. So for the first time ever, I will actively choose not to embarrass her, but rather say that I hope we will continue to hold hands as we surmount life together.    xxii Dedication To my father, Dr. Abul Jalaluddin Bhuiyan, for teaching me to love science,  and to my mother, Rokeya Begum, for teaching me to love everything else   1 Chapter 1: Introduction Most mammalian genes have multiple transcripts via “alternative” splicing, and we continue to detect more transcripts as our sequencing technologies improve in sensitivity. Although we do not know the function of most of these transcripts, many claim that alternative splicing vastly increases the functional diversity of the genome. Many computational studies operate under the assumption that this is true and are dedicated to predicting the function of all transcripts. However, the mere presence of a transcript is insufficient evidence for functionality, and a growing body of work has provided evidence that many transcripts are likely nonfunctional “noise”. For my thesis, I explore how much splicing drives the genome’s functional diversity.  In the debate about the functional consequences of alternative splicing, one side argues that we can never prove non-functionality, and future studies will eventually reveal the function of most transcripts (Blencowe, 2017). In contrast, the other side argues that most alternative splicing is a result of splicing error or biological “noise”, and only a few genes increase their functional diversity via this posttranscriptional modification (Pertea et al., 2018a; Tress et al., 2017a). This chapter introduces the relevant literature about “noisy splicing” that makes interpreting transcriptomic diversity challenging.  The noisy splicing model proposes that most mRNA splicing stochastically produces erroneous transcripts (Zhang et al., 2009). The model bears similarities to pervasive transcription, where RNA polymerase randomly transcribes the likely nonfunctional parts of the genome (Jensen et al., 2013; Meer et al., 2019; Pertea et al., 2018a). Likewise, the splicing machinery stochastically binds to pre-mRNAs to produce noisy transcripts. Due to the random error introduced by the splicing machinery, some noise must exist. Disagreements arise in determining how much of splicing produces noise.    2 Noisy splicing model proponents agree that the set of genes for which alternative splicing increases functional diversity is limited. The literature discussed in this chapter yields estimates that 2% to 15% of genes likely have multiple functional transcripts (Table 1.1) (Abascal et al., 2015a; Bhuiyan et al., 2018; Hao et al., 2015; Melamud and Moult, 2009; Pickrell et al., 2010; Reyes et al., 2013; Saudemont et al., 2017; Tress et al., 2017b; Xiong et al., 2018; Zhang et al., 2009). While these studies are mostly high-throughput and computational, most claims about alternative splicing increasing the genome’s functional diversity derive from computational and high-throughput studies as well. Only a limited number of genes have direct experimental evidence of multiple functional transcripts.  Study Study type % Genes with multiple functional isoforms Reyes et al., 2013 Differential expression and conservation  20 Melamud and Moult, 2009 Computational models of alternative splicing 20 Tress et al., 2017 Mass spectrometry and conservation 2 Zhang et al., 2009 Differential expression and conservation 4 Pickrell et al., 2010 Computational modeling of splice isoform expression 2 Table 1.1 Estimates of human genes with multiple functional isoforms from key noisy splicing papers I calculated the proportion of genes with multiple functional isoforms based on the initial set of detected genes, and the number of genes with multiple functional isoforms based on the investigator’s definition of function. For example, in Tress et al. 2017, the investigators defined a gene with multiple functional isoforms as a gene with multiple splice variants detected in mass spectrometry experiments. They initially had a set of 12,227 genes in their mass spectrometry datasets, but only 246 (2%) of these genes had multiple splice variants.   Whether alternative splicing functionally diversifies 5% or 95% of genes, the focus of my thesis is prioritizing the specific genes where alternative splicing drives functional diversity. However, if the evidence only supports this in a few exceptional genes, then the assumed functional importance of most alternative splicing would need re-evaluation by molecular biologists. For example, determining gene function through wet-lab experiments on randomly picked splice variants would be wasteful, and computational algorithms trained on all splice    3 variants would make erroneous gene function predictions. This applies to not only mRNA splicing, but to other RNA modifications as well (Li et al., 2016; Liu and Zhang, 2018; Mudge et al., 2013). 1.1 The term “alternative splicing” is a potential source of confusion The term used to describe when a gene has at least two transcripts generated by mRNA splicing is “alternative splicing”. However, this is a term I tend to avoid, because the word “alternative” is often taken as implying a secondary function to a “primary” transcript. Instead, we consider all transcripts as potentially equally important without necessitating any arbitrary definition of what is “primary”. For similar reasons I reserve the term “isoform” for cases where the transcripts are shown to be functional. “Transcript variant” would be a more general term in some contexts (RNA modifications unrelated to splicing), but I use “splice variant” to emphasize my focus on transcripts distinguished by splicing events. Thus, a gene can produce multiple splice variants, and our objective is to determine which of those actually represent isoforms and which are likely to be noise. The production of different splice variants requires differences in “splicing events”, which are defined as the use of a splicing junction.   1.2 Defining genes with functionally distinct splice isoforms  A central theme of my thesis project is isoform function. Importantly, I have an operational definition of “functional” throughout my thesis: A functional isoform is a splice variant that is necessary for phenotype(s) associated with the gene. As discussed below (1.2.1), this definition is intentionally a liberal one compared to a definition based on organismal fitness (however, the associated phenotype ideally impacts organismal fitness). For genes with multiple splice variants to have functionally distinct splice isoforms (FDSIs), they must have at least two splice variants that are necessary for the gene’s wildtype phenotype. Thus, the loss of either isoform would    4 cause a change in phenotype. This definition of genes with FDSIs allows for cases where the FDSIs are necessary for the same phenotype, as in, the loss of either isoform causes a change in the same phenotype. However, if the removal of a splice variant does not cause a change in phenotype, then it is likely functionally redundant, and mRNA splicing has not diversified the gene’s function. Note that for the purposes of my thesis, I will focus on the splicing of protein-coding genes. However, non-coding genes can also have multiple splice variants.  Figure 1.1 Types of functional distinctness for genes with FDSIs For genes with FDSIs, I categorized the specific subtypes of functional distinctness which contributed to the distinctness between FDSIs. FDSIs from the same gene can generally be distinct by having non-overlapping patterns of expression (e.g. each FDSI is expressed in a different tissue type), or they can be distinct due to intrinsic functional differences (e.g. each FDSI has a different protein domain). These categories are non-mutually exclusive. I use these categories to guide part of my work in Chapter 2.   The distinctness between FDSIs of the same gene can be due to intrinsic functional differences (e.g. has distinct protein domains), expression pattern differences (e.g. non-overlapping tissue-specific expression pattern), or both (see Figure 1.1). The human gene TP63 has two FDSIs with intrinsic functional differences (Mitani et al., 2011). The isoform Tap63 contains a transactivation domain, whereas the isoform DNp63 lacks this domain. In contrast, mouse Calca is a gene with FDSIs (Calcitonin and CGRP) distinct at least in part due to non-overlapping tissue-specific expression pattern differences. In the thyroid gland, Calcitonin FunctionaldistinctnessIntrinsic functionaldistinctness Expression-patterndistinctnessCell-type-speciÀc Subcellularlocali]ation Other condition-speciÀcDevelopmental-stage-speciÀcTissue-speciÀcDominantnegative Protein domain Protein terminuschange   5 regulates calcium and phosphate metabolism, while in the brain, CGRP regulates vasodilation (Hoff et al., 2002; Schinke et al., 2004; Toda et al., 2008; Yang et al., 2013). Thus Calca requires both isoforms, even though the peptides encoded are very similar and could be interchangeable (Hoff et al., 2002).  1.2.1 Systematic definitions of biological function  In biology, the term “function” can have different definitions depending on the speaker and context; nevertheless, my thesis requires a definition. I make use of the recently proposed Pittsburg model, which attempts to systemize the term into multi-level definitions (Table 1.2) (Keeling et al., 2019) The Pittsburgh model describes different biological definitions or “levels” of function: evolution, physiological, interaction, capacities, expression, and vague. My definition of genes with FDSIs requires genes to have multiple functional splice variants by these levels. The cellular machinery must transcribe, splice, and – for protein-coding genes – translate each splice variant into a protein that “does something”. The isoform’s existence is necessary but insufficient alone to establish functional distinctness. To be biologically necessary, there must either be evidence of evolutionary constraint (Section 1.6) or a physiological effect (Chapter 2). Throughout this chapter, I discuss how each level of function applies to splice variants.  Level of function Example of evidence in splicing context Evidence of functionally distinct? Evolutionary implications Splice variant has highly conserved exons or splice junctions Yes Physiological implications Knockdown of isoforms produces a change in phenotype Yes Interactions Splice variant binds to other proteins in column; position in a protein-protein interaction network No Capacities Splice variant has complete protein domain No Expression Splice variant exists in RNA-seq data or MS data  No Vague “The absence of evidence is not evidence of absence” No Table 1.2 Pittsburg model of function applied to my definition of genes with FDSIs     6 The column “Level of function” is defined by the Pittsburg Model of Function. The columns of “Example of evidence” and “Evidence of functional distinctness?” indicate how my operational definition of genes with FDSIs fits with the Pittsburg model of function. For “Example of evidence” and “Evidence of functionally distinct?” the evidence must apply to at least two splice variants from the same gene. “Evidence of functionally distinct?” is based on my interpretation of whether the evidence shows that the splice variant is necessary. Note that these columns contain key examples of the types of evidence that supports these levels, but other types of evidence for each level likely exists. See (Keeling et al., 2019) for more details on the Pittsburgh model.   Doolittle and colleagues adapted a philosophical definition of function into a systemized definition of biological function, this one heavily delineated by molecular evolution (Doolittle, 2018). Biological components can have “causal roles”, or “selected effects”. Causal roles explain what the biological component does in a system, which is predefined by an investigator. A biological component can have innumerable causal roles. For example, one causal role of the heart is to make a “lub dub” sound. In the context of splicing of CALCA, CGRP’s causal roles include increasing the length of chromosome 11. While some causal roles can be important properties for the component, they do not have to explain why the component exists (Graur, 2017). In contrast to causal roles, selected effects explain why the biological component exists, if for any reason at all. These effects are necessary, etiological, and under evolutionary constraint. Biologists generally do not want to focus on causal roles if a selected effect exists. For example, the selected effect of the heart is to pump blood, and the selected effect of CGRP is to regulate vasodilation in the brain. These are likely more biologically relevant than the sound the heart makes, or the number of nucleotides CGRP adds to the genome. Importantly, all selected effects are causal roles, and while the converse is not true, we need to investigate the causal roles for a biological component to learn the selected effects. Nevertheless, Doolittle et al. emphasized that we should only use a component’s selected effect when describing the component’s function, though most biologists use causal roles to describe a component’s function. For my thesis, I will refer to splice isoform function by the selected effects, rather than causal roles. Thus, splice    7 variants do not necessarily have a function – though they could be assigned causal roles – while the subset I refer to as splice isoforms will have a function in the selected effects sense. I note the work described in Chapter 2 is based on observed phenotypes assayed by independent investigators. While strictly speaking this is a casual role definition, it is likely the phenotypes measured are associated with selected effects, on the basis that experimental work tends to measure phenotypes which plausibly impact fitness, such as cell growth rates. The distinction between causal roles and selected effects can help systemize splice isoform function. Since causal roles can be trivial (e.g. splice variant mass), all biological components theoretically have causal roles; however, not all components have selected effects. Doolittle et al. provided many reasons why this may be the case, but the most applicable reason for my purposes is that biological components can arise through noisy processes (Doolittle et al., 2014). These components from noisy processes can have causal roles depending on how the component is tested. For example, the detection of a splice variant at the peptide level, or its interaction with another protein may provide proof of a causal role, but it does not necessarily support a selected effect. In section 1.3, I discuss the sources of noise in the mRNA splicing process, and in section 1.6, I discuss the lack of selected effect evidence for most splice variants.    1.2.2 How do NMD-targeted splice variants fit my operational definitions? Nonsense mediated decay (NMD) is thought to primarily act as a quality-control mechanism by which some erroneous transcripts can be efficiently targeted for destruction. However, it has been proposed that some mammalians genes have evolved to express NMD-targeted splice variants with a defined regulatory function (Jaillon et al., 2008; McGlincy and Smith, 2008). The proposal is that the gene will shift to express an NMD-targeted splice variant instead of a non-NMD-targeted splice variant until the gene requires the non-NMD-targeted splice variant.    8 However, in the noisy splicing literature, this assumption is challenged, as most NMD-targeted splice variants are not evolving under selection (see section 1.6) (Zhang et al., 2009). In my definition of FDSIs, the presence of such NMD-targeted splice variants does not mean that the gene has more than one functional isoform. By any reasonable definition, the NMD-targeted splice variants are non-functional and failing to express them would presumably have little consequence so long as the expression of the other splice variant was appropriately regulated. However, at least one exceptional case has been reported where an NMD-targeted transcript is first translated a single time prior to destruction to produce a required protein (Chen et al., 2008). As such, I do not immediately exclude splice variants predicted to undergo NMD in my analysis in Chapters 3 and 4. Rather, I first determine whether these splice variants evolve under selection. 1.3 Biochemical sources of noise in mRNA splicing In order to understand the impact of noisy splicing on the functional diversity of the genome, we must first understand the potential biochemical sources for error (Eling et al., 2019; Hsu and Hertel, 2009; Kurmangaliyev and Gelfand, 2008; Mironov et al., 1999). By understanding that splicing must give rise to noise, we understand that it would be erroneous to assume that every splice variant is functional. Biochemical reactions are imperfect processes. Given how many different molecules are involved, their high concentration, and their thermodynamics, it is essentially impossible for a biochemical process to generate only one outcome with 100% probability. This applies not only to splicing, but to all biochemical reactions. While more precise processes will produce less noise, even low levels of noise will contribute to measurements if they are sufficiently sensitive; current RNA-seq technologies can detect molecules expressed at less than one copy per cell    9 (Marioni et al., 2008; Palazzo and Lee, 2015; Pertea et al., 2018b; Raj et al., 2006; Trapnell et al., 2010, 2012). 1.3.1 The mRNA splicing process In order to appreciate the sources of noise in the mRNA splicing process, it is important to understand the general biochemical process of removing introns. Here, I briefly describe the biochemical steps relevant to producing noise.  Small nuclear ribonucleoproteins (snRNPs) make up the splicing machinery, or spliceosome, which recognizes specific motifs in order to excise introns from pre-mRNA (Figure 1.2) (Lim and Burge, 2001).  At the 5’ side, the motif GT defines the exon-intron boundary while at the 3’ side, AG defines the intron-exon boundary. Another necessary intronic feature is the branchpoint (a single nucleotide ‘A’). Between the 3’ splice site (SS) and the branchpoint is a polypyrimidine-rich region.   Figure 1.2 The structure of U2 snRNP (major) introns are defined by motifs The splicing machinery recognizes key motifs at the 5’ and 3’ ends of the intron. In bold are the nucleotides that are 100% present across all introns.   These motifs guide the binding of RNPs throughout the splicing process, and the RNPs catalyze the two necessary transesterification steps (Lee and Rio, 2015; Raj and Blencowe, 2015). When U1 snRNP binds to the 5’ SS, Branch Point Protein (BPP) binds to the branch point and a U2AF65/U2AF35 complex binds to the 3’ SS. U2AF65/U2AF35 and BPP recruit U2 Exon ExonIntronNNNNNNNNNNGTRAGTNNNNNNYURACYYYYYYYYYYYYNCAGNNNNNNNNNNN5’ splice site 3’ splice siteBranch pointPolypyrimidine tract (10 - 50 bp)   10 snRNP to bind at the BP which in turn causes U2AF65/U2AF35 and BPP to disassociate. Next, a U4/U5/U6 snRNP trimer bridges U2 and U1 snRNP, causing the mRNA strand to bend. An OH group on the BP nucleophilically attacks the 5’ SS (first transesterification step) and causes the mRNA strand to form a loop (the intron lariat) where the exon at the 5’ SS is cleaved. At this point an OH group on the 5’ exon nucleophilically attacks (second transesterification step) the 3’ SS. The intron separates from the mRNA strand while the two exons are ligated using ATP hydrolysis. Nearly all mRNA splicing involves U2-dependent (major) introns, however about 1% of mammalian splicing is U12-dependent (minor) splicing (Turunen et al., 2013). The key difference between major and minor splicing is the snRNPs that guide the splicing process recognizes different motifs at the 5’ and 3’ splice sites. In my thesis, I treat these as the same: a gene can have multiple splice variants via major or minor splicing. The question remains whether the gene requires these splice variants. A minor intron may erroneously arise from similar biochemical activities that cause a major intron to erroneously arise. Likewise, a gene can have FDSIs via major or minor splicing.   There are multiple types of splicing events that can cause variation in transcript structure. The spliceosome may skip an exon entirely (“cassette exon”), shift the 3’ or the 5’ splice sites (alternative 3’ splice site, or alternative 5’ splice site), include an entire intron between two exons (intron retention), recognize a different first exon (alternative first exon, or alternative transcriptional start site), or recognize a different last exon (alternative last exon, or alternative transcriptional stop site). For the purposes of my thesis, I also include differences in transcriptional start and stop sites. Transcription initiates with RNA polymerase binding to a promoter, and some eukaryotic genes have multiple promoters (Xu et al., 2019). RNA    11 polymerase will continue to transcribe the gene until the terminator, and some eukaryotic genes have multiple termination sequences. Via the use of these alternative transcriptional initiation or termination sites, a single gene may have multiple transcripts with different first or last exons. While these are not always counted as differences in splicing, they change exon boundaries and thus fit into my analytical framework.  Regardless of the splicing mechanism, two parts of the mRNA splicing process are important for understanding the biochemical source of splicing errors. First, different snRNPs work in concert and, second, the snRNPs bind to small splice site recognition motifs. While both steps are necessary for mRNA splicing, in Section 1.3.2, I outline how these steps can produce splicing noise. 1.3.2 Sources of splicing noise In the nucleus, pre-mRNAs interact stochastically with components of the splicing machinery, and some of these interactions will result in splicing (Coulon et al., 2014; Hu et al., 2017; Rino et al., 2007). The snRNPs are present at high concentrations and have many opportunities to interact with pre-mRNAs. The probability that an interaction results in splicing is a function of the affinity of the sites in the mRNA, and there is no strict threshold for this. Thus, even at non-optimal sites, we would expect splicing to occur at some rate. If this “non-specific” splicing is frequent enough, then it will be detected. (Pickrell et al., 2010; Warnecke and Hurst, 2011). Previous studies have shown that lowly-expressed splice junctions are enriched near highly expressed splice sites, suggesting that the spliceosome has easy access to many splice sites and can miss its “intended” splice site (Chern et al., 2006; Pickrell et al., 2010). Given the short length of the splicing motifs and the multi-step splicing process, a single pre-mRNA strand can theoretically lead to many different transcripts (Fekete et al., 2017; Wang    12 et al., 2014). For example, the spliceosome may bind to the “correct” 5’ splice site but an “erroneous” 3’ splice site because the spliceosome commits to each splice site separately. As another example, introns tend to harbor sequence variants (discussed further in section 1.6) which can introduce novel splice sites for the spliceosome to erroneously recognize. (Lim and Hertel, 2004; Lynch, 2007; Sahebi et al., 2016).  In order to understand the “alternative splicing code” – the regulatory principles which the cell uses to convert a pre-mRNA into processed mRNA – previous studies have investigated the error introduced throughout splicing (Table 1.3). Hsu and Hertel (2009) provided the first estimate of splicing error, 0.001% per splice junction, based on qRT-PCR of the human genes UBA52 and RPL23. For three other genes (HPRT, POLB, TRPV), Skandalis reported an error rate estimate between 0.1% and 0.3% per splice junction (Skandalis, 2016). However, the investigators in both of these studies treated any unknown transcript for the genes as erroneous. Furthermore, they biased their gene selection to a limited set of highly expressed, well known genes with multiple splice variants. Study % estimate error Hsu and Hertel, 2009 0.001% per splice junction Skandalis 2016 0.1-0.3% per splice junction Melamud and Moult, 2009 1% - 10% per gene Pickrell et al., 2010 0.7% per splice junction Table 1.3 Estimates of splicing error rates from different studies Error estimates for splice junctions refer the splicing reactions which occur at the 5’ and 3’ splice sites of a single intron.    To address these limitations, two studies estimate the splicing error using high throughput transcriptome-wide approaches. In the first study, Melamud and Moult used EST libraries and microarray expression levels to estimate a splicing error rate of 1% to 10% per gene. In the second study, using splice junctions found in GENCODE, evolutionary conservation, and RNA-   13 seq, Pickrell and colleagues estimated that 0.7% of splicing reactions per splice junction produce erroneous transcripts (Pickrell et al., 2010).  Both studies provided a splicing model where genes expressed most splice variants based on a probability distribution, rather than a deterministic, regulated outcome. They furthered our understanding of the splicing code by hypothesizing the parameters that guide this distribution.  The above estimates (Table 1.3) are mostly at the level of a splice junction. At the gene level, the outcome of these studies suggests that very large numbers of “erroneous” transcripts are produced. Even if the rate of error per junction is low, the cumulative effect per gene can be high. Consider that the average gene has 6 splice junctions. If the spliceosome has an error rate of 1% at each junction, the binomial probability of producing an erroneously processed transcript is ~5.8% for each pre-mRNA. Simplistically, this would imply that 5.8% of mRNA molecules in the cell might be erroneous.  Although the above calculation gives a sense of how much noise could exist, for some genes noise might be more or less tolerated (in terms of selection pressure against cryptic splice sites, for example), and the error rates are likely not uniform. Previous studies have found that splicing error is inversely proportional to gene expression and the number of introns in the gene (Melamud and Moult, 2009; Saudemont et al., 2017). The central principle is that cells can tolerate the accumulation of only a certain number of erroneous transcripts. Since the probability of producing erroneous transcripts increases with the number of introns, selection pressures for intron-rich genes might drive error rates lower. Furthermore, error rates might be lower for highly expressed genes to avoid too many erroneous transcripts. In the intron-rich Paramecium tetraurelia, Saudemont and colleagues provide support for these hypotheses (Saudemont et al., 2017). By measuring the effects of disabling the NMD pathway, they reported a negative    14 correlation between erroneous splicing (defined operationally as NMD-targeted transcripts), and both gene expression and number of introns. 1.4 Most splice variants are lowly-expressed and not detected at the peptide level In section 1.3, I discussed how the multi-step process of splicing can produce erroneous transcripts. In this section, I will discuss the expression level evidence of splice variants at both the mRNA level (Section 1.4.1) and the protein level (Section 1.4.2).  1.4.1 Transcript level evidence of most splice variants 1.4.1.1 Nearly all multi-exonic genes have multiple splice variants Two studies published in 2008 changed the field’s perception of alternative splicing in mammals. Pan and colleagues (cited 3,069 times as of July 16th, 2020), and Wang and colleagues (cited 4,233 times as of July 16th, 2020) both claimed that ~95% of multi-exonic genes have multiple splice isoforms (Pan et al., 2008; Wang et al., 2008a). Earlier estimates based on ESTs had been lower, but still nontrivial (40-60% of genes) (Blencowe, 2006; Matlin et al., 2005; Modrek and Lee, 2002). RNA-seq provides far greater sensitivity than ESTs, so the 95% value has been widely accepted. Based on these types of studies, many researchers have assumed that most of the “alternative splicing” reported increases functional diversity, because (essentially) they assume it must be there for a reason – that is, not noise (Blencowe, 2017). The effect of this approach is apparent in genomic databases, which tend to take a permissive approach in including observed transcripts. Thus, the number of splice variants per gene has increased dramatically over the last 10 years (Figure 1.3). However, there are multiple reasons not to take the observed RNA at face value beyond the known error rate of splicing.      15   Figure 1.3 Splice variants entry into Ensembl steadily increases over time Using BioMart, I downloaded all human protein-coding genes and all transcript IDs associated to those protein coding genes. Though these data are only for protein-coding genes, these transcript IDs may correspond to transcripts that do not encode for a protein. The Ensembl versions used to generate these figures were 50, 54, 59, 63, 67, 75, 80, 87, 89, 93, 97 and 100. A) The mean number of transcript IDs per Ensembl protein-coding gene has increased over time. Annotated with each data point is the number of protein coding genes for that version of Ensembl. The number of protein-coding genes that each data point corresponds to: 21 779 (2008), 21 411 (2009), 21 882 (2010), 21 783 (2011), 21 941 (2012), 22 805 (2014), 21 997 (2015), 22 280 (2016), 22 352 (2017), 22 638 (2018), 22 717 (2019), and 22 797 (2020) B) The maximum number of transcript IDs linked to a protein coding gene has increased over time. These are the genes that each data point corresponds to: CACNA1C (2008 - 2009),  DDR1 (2010), EEF1D (2011), ADGRG1 (2012 - 2014), NDRG2 (2015), TCF4 (2016), KCNMA1 (2017),  and MAPK10 (2018 - 2020) C) The total number of transcript IDs associated with protein coding genes has increased over time.  Short-read RNA-seq faces a challenge in assigning reads to specific splice variants (Figure 1.4). One critical step in RNA-seq processing is mapping the reads to a reference transcriptome. Typically, there is ambiguity in which transcript a short read comes from. RNA-seq quantification algorithms attempt to address this by using probabilistic methods to allocate 0123456782008200920102011201220132014201520162017201820192020Yearmean number oftranscript IDs per geneA0204060801001201401601802002008200920102011201220132014201520162017201820192020Yearmax transcript IDsfound for a single geneB50000750001000001250001500001750002008200920102011201220132014201520162017201820192020Yeartotal number oftranscripts IDsC   16 reads to transcripts (Merino et al., 2019). However, if a read maps to an exon that is shared between different splice variants for the same gene, the algorithm may assign the read to the wrong splice variant (Hardwick et al., 2016; Steijger et al., 2013). Not only does this mean that RNA-seq pipelines can characterize transcript structure incorrectly, but a lowly expressed, and potentially noisy, splice variant may be reported at a significantly higher expression. It is thus not surprising that algorithms disagree substantially when assigning reads to splice variants (Hong et al., 2018).   Figure 1.4 Long read RNA-seq reduces transcript structure ambiguity In this toy example, the gene has a total of 5 exons (colored boxes) and has two different splice variants. On the left side of the figure, we have an overview of short-read sequencing. In brief, short-read computational pipelines involve taking small reads and mapping them to previously known transcript structures. Since these splice variants have overlapping exons, computational pipelines have to do probabilistic guesses in determining which read belongs to which splice variant. This can lead to error in estimating expression of splice variants. On the right, with long read sequencing the entire transcript can be sequenced in one read. Thus, computational pipelines do not have to map reads to previously known transcript structures. The sequence simply represents the entire transcript.   Short read sequencingLong read sequencingSequence fragments of strandsProbabilistic  “guesses” of transcript structureAccurate transcript structureSequence entire strandGene   17 In my work (Chapters 3 and 4), I use long read RNA-seq data as a promising alternative (or at least adjunct) to short-reads (Figure 1.4). Long-read technologies attempt to sequence an entire transcript at once, in principle removing the problem of reconstructing transcript structure (Križanovic et al., 2018). Given the advantages of the approach, long-read sequencing has been used to identify splice variants with more confidence. For example, a recent analysis used this approach to sequence a cell line (GM12878) transcriptome, and reported 78,199 splice variants for more than 12,000 genes (Workman et al., 2018). Furthermore, the authors could not connect ~50% of these splice variants to GENCODE transcripts and suggested that their approach detected “novel” splice variants. While the number of long-read studies grow, to my knowledge my research is the first to examine splice variants identified with long-read RNA-seq in the context of noisy splicing. While there are clear advantages to using long-read sequencing to study splice variants, the current technology has a high base call error rate (5 to 15%) relative to short-read sequencing (<1%) (Zhang et al., 2019). This could lead to erroneous splice junctions (a technical error, rather than biological error) in the sequenced reads. One approach to alleviate this error is to combine short-read data with long-read data. Junction-spanning short-reads provide a more accurate picture of where specific splice junctions are expressed in the genome. Long-read sequencing pipelines, including mine (Chapters 3 and 4), often use short-reads to correct potentially erroneous splice junctions in the long-read data.  1.4.1.2 Condition-specific function or condition-specific noise A common argument for “alternative splicing” vastly increasing genomic functional diversity is the differential expression of splice variants (Han et al., 2017). Many claim that differential expression of splice variants implies condition-specific regulation of splice variants, and    18 condition-specific regulation implies condition-specific function. For example, in both landmark papers from Pan et al. and Wang et al., the investigators described the plethora of differentially expressed splice “isoforms” between tissue types (Pan et al., 2008; Wang et al., 2008a). Furthermore, in cases where a gene is not differentially expressed, the splice variant could still be differentially expressed (isoform-switch) (Chen et al., 2019; Merino et al., 2019; Trapnell et al., 2012). However, there are reasons to doubt the interpretation that differentially expressed splice variants implies condition-specific function. Most genes have one dominantly expressed splice variant regardless of tissue-type or cell-type, suggesting that previous claims exaggerate the significance of how much splicing changes across conditions (Gonzalez-Porta et al., 2013). The sum of the highest expressed splice variant for each human gene contributes to about 85% of the total transcriptomic expression.  Though genes do vary somewhat in the multiple splice variants expressed between tissue- or cell-type, only 35% of genes have an isoform-switch between conditions where a different splice variant has a significantly higher expression level than the gene’s most expressed splice variant in another condition (Gonzalez-Porta et al., 2013). This represents more than a third of protein-coding genes, which is a nontrivial proportion, but much less than 95% of genes.  Crucially, changes in expression are not prima facie evidence for function or distinct function. The splicing process often involves condition-specific regulators – these regulators can bind to mRNA in a stochastic process, similarly to snRNPs (Zhang et al., 2016). The stochastic process and promiscuity of the condition-specific regulators gives rise to condition-specific noise (Hu et al., 2017). As the field begins to collect more data about splicing regulation, the challenge remains to separate signal from noise. I am unaware of any studies which specifically explore    19 condition-specific noise for splicing regulators, though there is discussion of condition-specific noise in overall gene expression levels (de Jong et al., 2019).  1.4.1.3 Parallels to ENCODE criticisms In sections 1.4.1.1 and 1.4.1.2, I discussed the evidence for and against noisy splicing from transcriptomic studies, which parallels the criticism of the initial ENCyclopedia Of DNA Elements (ENCODE) project. The ENCODE project aimed to provide a “parts list” to the human genome, and broadly interpreted any RNA transcription in any of the 147 cell-types they tested as evidence of function (Consortium, 2012; ENCODE Project Consortium, 2004). They found evidence that most of the human genome was transcribed or bound by a protein, and concluded that most of the genome was functional. While the ENCODE project did provide a “parts list”, this key conclusion was heavily criticized. One central criticism to the ENCODE project was their causal role definition of function (section 1.2) based on any RNA transcription. Challenging this assumption that any reproducible, biochemical activity is biologically meaningful, Eddy (2013) posed a thought experiment known as the “Random Genome Project”. If one were to construct a genome with a random sequence of nucleotides, put the genome in a cell, and provide this genome with all regulatory factors and RNA polymerase, then the cell would surely transcribe this random genome. This is because transcriptional regulatory factors bind to small sites in the genome, and these binding sites will frequently occur at random. While this random genome might be unrealistic, if ENCODE would have analyzed the random genome instead of the human genome, they would have erroneously concluded that the random genome was mostly functional. Eddy also pointed out that this principle applies to other biochemical processes, such as mRNA splicing.     20 A commonly used argument in support of the ENCODE project’s conclusions is that a nontrivial proportion of the detected RNA transcription is cell-type specific and thus regulated for an important function. However, the principles of the random genome still apply (Clark et al., 2011). Eddy points out that condition-specific regulators would still bind to a random genome (Eddy, 2013). This would result in condition-specific noise and applies equally to splice variants. From a modern molecular evolutionary perspective, the criticisms of ENCODE again parallel the criticism of widespread functional alternative splicing (Jensen et al., 2013). I discuss the evolutionary perspective of gene function diversification and alternative splicing in Section 1.6. 1.4.1.4 Can we assess noise from RNA-seq data? If the observation of a splice variant is insufficient to determine its functionality, it raises the question of how functionality (and distinct functionality) can be determined purely from transcriptomics. To my knowledge, no one has put forth standards for how to determine if a splice variant is biological noise. However, there are standards for whether a sequence of nucleotides with a transcribed product is a gene or biological noise (Pertea et al., 2018b). A proportion of erroneous transcripts are likely lowly-expressed, and some may consider the (standard) removal of lowly expressed transcripts in RNA-seq pipelines as a first step at filtering for noise. However, bioinformaticians may need to be more deliberate as to what they consider “noisy” transcripts. While observing a splice variant is not evidence of function, the case for functionality can be strengthened if the transcript is expressed “appreciably”.  The evaluation of “appreciable expression” may have to be on a gene by gene basis, and there exists few studies that explore this (Gonzalez-Porta et al., 2013; Meer et al., 2019). The central point is that a splice variant that is present at one molecule per cell is unlikely to have a    21 function, especially if that molecule must interact with another species having a much higher concentration. Thus, it is not only reasonable to prioritize splice variants expressed at high levels, it is less reasonable to focus on splice variants expressed at extremely low levels. As part of Chapter 2 and Chapter 3, I explore the impact of prioritizing genes with FDSIs based on splice variant expression relative to overall gene expression.  1.4.2 Proteomics: Ribosomal profiling vs. Mass spectrometry The difficulties of determining whether a certain expression level of an RNA is compatible with function can be ameliorated if the transcript is known to result in protein expression. This is reasonable even while recognizing that the translational machinery has the same potential for producing noise as the transcriptional machinery. Observation of a protein thus can be taken to be consistent with functionality. For this reason, there is interest in gathering evidence of splice variant translation. There are two dominant high-throughput approaches used: mass spectrometry (MS) and ribosomal profiling (or ribo-seq). In this section, I review applications of proteomics to understand the impact of genes with multiple splice variants on protein diversity.   1.4.2.1 Ribosomal profiling Ribosomal profiling (ribo-seq) aims to isolate ribosome-bound mRNA. The motivation for this approach rests on the assumption that that any mRNA bound to the ribosome is being translated. If a gene has multiple splice variants detected via ribo-seq, one can infer that the gene has multiple translated splice variants. In the ribo-seq literature, I found a wide range of claims about the exact impact of genes with multiple splice variants on functional diversity – estimates range from 40% to 70% of the protein coding genes having multiple translated splice variants  (Blencowe, 2017; Weatheritt et    22 al., 2016). Weatheritt et al. reported one of these extremes. They surveyed exon-skipping events across 5,463 genes and found evidence that between 75% to 85% of splicing events detected in RNA-seq were also bound to the ribosome. Thus, they concluded that ~70% of genes have multiple splice variants that translate into a protein. However, there are some limitations of ribo-seq that limit confidence in this estimate. The initial step to isolate for ribosome-bound mRNA sometimes retrieves non-ribosomal RNA-protein complexes (Ingolia, 2016; Ji et al., 2016).  Ribo-seq experiments have been shown to detect RNAs with no translation potential, such as lncRNAs (Guttman et al., 2013). Furthermore, due to ribosomal stalling, scanning and other quality control mechanisms, a bound mRNA is not guaranteed to undergo successful translation (Inada, 2017).   Accounting for some of the aforementioned technical limitations in ribosomal profiling and the ribosome’s quality control mechanisms, Reixachs-Sole et al. determined that 53% of genes have multiple translated splice variants (Reixachs-Solé et al., 2019).  This contrasts with the 95% statistic reported based on transcript-level observations (section 1.4.1) for which Reixachs-Sole et al. offered two explanations. First, ribosomal profiling has limited coverage of the transcriptome in comparison to RNA-seq. Second, RNA-seq tends to artificially amplify fragments of erroneous transcripts. Reixachs-Sole and colleagues concluded that the latter reason is more likely, as their 53% statistic agrees more with mass spectrometry and conservation studies. In the next section, I discuss these mass spectrometry studies, and in Section 1.6, I discuss the conservation studies.  1.4.2.2 Mass spectrometry In contrast to the RNA-based approaches discussed thus far, protein mass spectrometry (MS) can provide more direct evidence of translation. The mass spectrometer breaks down proteins into    23 small peptides and calculates the mass-to-charge ratio of a single small peptide in increasingly larger fragments (e.g. first fragment is first amino acid of peptide, second fragment is the first two amino acids, etc.) (Coon et al., 2005). This allows investigators to determine the small peptide’s sequence. In the context of genes with multiple splice variants, one can look for whether distinct exons of splice variants for the same gene are present in MS data. This is often referred to as looking for the “discriminating peptide”.  I found a wide range of estimates for the number of genes with MS evidence of multiple splice variants, ranging from 2% to 100% (Abascal et al., 2015a; Ly et al., 2014a; Wilhelm et al., 2014; Yang et al., 2016). At the high end, Ly and colleagues concluded that all genes likely have multiple translated splice variants. The peer reviewers of their study and other research groups have reported the limitations of this study, such as the lack of replicates (Ly et al., 2014b). For my purposes, the central limitation is that this study (and other studies yielding high estimates of splice variants in mass spectrometry experiments) identify the discriminating peptide of only one splice variant per gene, instead of at least two splice variants from the same gene. But even ignoring the Ly et al. study, the upper bound of the estimates still remains large: 57% of genes. While I do not use MS data in my thesis, the study by Abascal et al. (2015a) and the response to the study is important to outline because it had a profound impact on proteomics studies into genes with multiple splice variants. Abascal and colleagues reported that ~2% of genes have MS evidence for multiple translated splice variants, based on whether a gene had at least two splice variants with discriminating peptides (Abascal et al., 2015a). For this analysis, these investigators collected over 100 MS experiments across different human cell types and tissues from different studies. Since the groups who originally performed the MS experiments did not use identical methods, Abascal and colleagues controlled for certain differences. These    24 differences included the type of spectra search engine used, the spectra scoring scale, and the accepted FDR threshold. Furthermore, Abascal and colleagues only accepted a splice variant as a true positive if they detected the splice variant in more than one experiment. Using these filters, they reported that only 246 out of 12,227 genes (~2%) have support for multiple translated splice variants, despite initially detecting 277,244 proteins.  In agreement with this statistic, a recent MS analysis from the GTEx Consortium reported that 353 out of 6,963 genes (~5%) have evidence for multiple splice variants (Jiang et al., 2020).  The dearth of alternative splicing evidence reported by Abascal and colleagues sparked a rebuttal from Blencowe (2017). Blencowe had three main criticisms. First, he claimed the literature has hundreds of examples of functionally important splice variants, and that this number is only limited by the effort needed to study each gene. Second, he argued mass spectrometry experiments have technical limitations in detecting lowly expressed or condition-specific peptides. Blencowe cited ribo-seq studies which tend to capture a wider breath of splice variants, closer to those found in RNA-seq studies. Thus, using ribosomal profiling data, Blencowe claims that alternative splicing vastly increases proteomic complexity despite the limited protein evidence.  In response, Tress and colleagues (Tress et al., 2017a) argued that the mass spectrometry data was only an adjunct to evidence based on conservation and protein domain analysis (discussed further in sections 1.5 and 1.6). Furthermore, they challenged whether the limitations of mass spectrometry really explained the substantial dearth of alternative splicing evidence in their study. The mass-spectrometry based findings are more compatible with the current understanding of molecular evolution (neutral theory) whereas the ribo-seq studies cited by    25 Blencowe did not provide an evolutionary analysis (see section 1.6) (Kimura, 1968; Ohta, 1992; Saudemont et al., 2017).  Despite these counterarguments, there is some validity to Blencowe’s critiques. The mass spectrometry analysis done by Abascal and colleagues failed to detect some well-documented genes with FDSIs, such as BDNF and CALCA. This is a challenge of prioritizing putative genes with FDSIs without a gold standard list. Perhaps the stringent filters applied to the mass spectrometry experiments excluded these genes, and future studies require somewhat relaxed constraints. Computational and high-throughput studies need a list of gold-standard genes with FDSIs to improve the tools we use to understand alternative splicing (see Chapter 2). Nevertheless, in order to resolve the noisy splicing debate and determine the functional relevance of genes with multiple splice variants, we will require more experimental data, and computational prioritization will help on this matter. The genes that Abascal et al. found in their investigation, despite their arguably stringent approach, are of interest for follow-up. In Chapter 3, I provide my own prioritized list of genes with FDSIs. 1.5 Impact of splicing on protein domains Alteration of protein domains among splice variants from the same gene may provide evidence of functional diversity. Protein domains are evolutionarily conserved structural and functional amino acid sequences. Intuitively, one may expect their alteration to result in alteration of protein function (Lees et al., 2016; Light and Elofsson, 2013; Sulakhe et al., 2018). In general, there are three scenarios in which one can use protein domain annotations to infer whether alternative splicing has increased functional diversity of a gene: 1. Addition or removal of a domain.    26 2. Modification of a domain (annotatable domains remains the same, but with a different amino acid sequence) 3. Modification of the remainder of the protein, without modification of a domain.  None of the above scenarios sufficiently constitute evidence of a gene with FDSIs; however, given that we can infer function from protein domains, cataloging genes with splice variants that have distinct domains seems a viable approach to prioritization of genes more likely to harbor FDSIs. As such, I use protein domain annotations for my analyses in Chapter 3 and 4. For the purposes of using protein domain annotations on high-throughput datasets, I developed a hierarchy of how protein domains can relate to functional distinctness (Figure 1.5). At level 1 are all genes with multiple splice variants. At this level, a gene can have only one splice variant with an annotatable domain, while the remainder of the gene’s splice variants lack any annotatable domain.  For level 2, genes must have at least two splice variants where both splice variants have at least one complete domain. Both splice variants could have the exact same set of domains among them. At level 3, a gene must have at least two splice variants where each splice variant has a different set of domains. One might assume that the gene has FDSIs and the differences in domains guide that distinctness. Finally, at level 4, the gene has at least two splice variants where each splice variant has at least one domain from different domain families. As domains can be grouped together based on sequence and functional similarity, a gene at this level may have more intrinsic functional differences in its FDSIs, than a gene only at level 3. The protein domains, the annotation approach and the domain families can be defined by a protein domain database (such as Pfam (El-Gebali et al., 2019) or CDD (Marchler-Bauer et al., 2015)). This hierarchy guides part of the analysis in Chapter 3.     27 Figure 1.5 Hierarchy for the functional distinctness of protein domains for a gene’s splice variants I produced this hierarchy to aid in my study of splice variants likely to have FDSIs based on splice variant domain annotations. Lines with or without shapes represents splice variants, and splice variants with the same color belong to the same gene. Each shape represents a protein domain. Level 1: All genes with multiple splice variants are in this level, regardless of their annotatable domains. Level 2: At this level, the gene must have at least two splice variants where each splice variant has at least one complete annotatable domain. The domains can be the same among a gene’s splice variants, or different. Level 3: At this level, a gene must have at least two splice variants where the protein domains differ between the splice variants. Level 4: Genes in this level have at least two splice variants where the annotated domains for each splice variant is from a different domain family.   1.5.1.1 Adding or removing a domain Alternative splicing can excise or include an exon which wholly contains the protein domain within the exon’s boundaries (Kriventseva et al., 2003). This may result in the swapping of protein domains among splice variants of the same gene, or the creation of splice variants that contain extra or missing domains. In short, a gene can have multiple splice variants each with its distinct set of protein domains (Level 3 in Figure 1.4).  Gene has multiple splice variantsLevel 1Gene has at least two splice variants with complete protein domainsLevel 2Gene has at least two splice variants with different protein domainsLevel 3Gene has at least two splice variants with different domain familiesLevel 4CDD 6,799 (100%)5,649 (83%)435 (5%)424 (4%)Pfam6,799 (100%)5,437 (80%)1,323 (19%)714 (10%)% Genes    28 High-throughput studies, such as the mass-spectrometry study by Abascal and colleagues, sometimes consider whether alternative splicing added or removed protein domains (Abascal et al., 2015a; Tress et al., 2017b). They reported that 7.1% of their mass spectrometry validated splice variants (246 genes) resulted in the loss of a domain, while the majority of splicing events had no effect on domain architecture. In contrast, when they analyzed GENCODE splicing events, they reported that about one-third of splicing events led to the exclusion of at least one whole protein domain. Similarly, Eksi and colleagues reported that a third of mouse genes have an addition or loss of an entire protein domain due to alternative splicing (Eksi et al., 2013).  1.5.1.2 Modifying a domain In addition to adding and removing domains, alternative splicing can modify domains by swapping in mutually exclusive exons for a given domain. Doing so maintains the general structure of the domain but might alter the specific function.  For example, a gene’s splice variants may have a protein domain that plays a role in cell-recognition. Variation in this domain among splice variants could enable interactions with additional binding partners. A famous example of domain modification via splicing comes from studies of Drosophila melanogaster. In the Drosophila nervous system, the expression pattern of Dscam1 splice isoforms play a role in regulating synaptic composition and cellular self-avoidance (Millard et al., 2010; Wojtowicz et al., 2007). Dscam1 has three clusters of exons where the combination of exons can potentially yield up to 38,000 splice variants. All splice variants for this neural receptor contain 10 Ig domains, with changes in the precise structure among isoforms (Schmucker et al., 2000; Wojtowicz et al., 2004). Each cell stochastically selects a specific combination of exons for expression (how this is accomplished is unknown), so that no neuron has the exact same combination of Dscam1 isoforms expressed on its surface (Miura et al.,    29 2013). Each neuron’s unique Dscam1 identity facilitates the neuron’s binding to other neurons with a different Dscam1 splice isoform repertoire while inhibiting neurons from binding to themselves (self-avoidance) (Matthews et al., 2007). Critically, limiting Dscam1 splice isoform diversity has phenotypic effects, providing direct evidence that Dscam1 has multiple FDSIs (Huang et al., 2011; Schmucker et al., 2000).  The example of Dscam1 suggests that variation of amino acid sequence, while maintaining domain structure, is a potential mechanism for isoform functional diversity.  However, it is important to note that the evidence for functional diversity in these cases does not rely simply on observing the presence of the RNA molecules, but on evidence from conservation, protein expression, and experimentally showing that the transcriptomic diversity contributes to the function of the gene (Han et al., 2010; Nicoludis et al., 2015; Schreiner et al., 2015; Watson et al., 2005). 1.5.1.3 Maintaining domains while modifying other parts of a protein Alternative splicing can cause a single gene’s splice variants to have a different amino acid composition in the parts of a protein without an annotatable domain. This might preserve the overall function of the gene, as the necessary protein domains remain intact for each splice variant, but the variability in the rest of the protein causes some change in regulation. In a survey of Ensembl splice variants, Tress et al. noted that most splicing between protein domains are frame-shifting which disrupts protein domains (Tress et al., 2017a). For only a few genes did they report that the domains were maintained. There are documented cases of non-domain-affecting splicing having a role in generating functional diversity. For example, the CACNA1 genes encode for the a1 subunit component of voltage-gated calcium channels (VGCCs), and each gene produces numerous splice variants. The    30 functional splice variants have the same four pore-forming domains (Bourinet et al., 1999; Catterall, 2011). VGCCs in the nervous and cardiovascular system require these four pore-forming domains to regulate calcium ion influx when the cell depolarizes (Weyrer et al., 2019). The absence or change of any of these four pore-forming domains causes the loss of a functional VGCC (Arikkath and Campbell, 2003; Catterall, 1995; Guida et al., 2001; Heinemann et al., 1992).  Rather than change the pore-forming domains of the a1 subunit, alternative splicing causes the amino acid residues between the domains to differ between splice variants of the CACNA1 genes (Adams et al., 2009; Bourinet et al., 1999; David et al., 2010; Hirano et al., 2017; Powell et al., 2009; Snutch et al., 1991). This difference between the transmembrane domains is hypothesized to provide binding sites for regulators. For example, Calnexin binds to specific splice variants of CACNA1 genes to inhibit calcium ion influx (Proft et al., 2017). Splicing of VGCC genes is often a focus of study. CACNA1 genes and their splice variants are evolutionarily conserved and have evidence of translation (Lipscombe et al., 2013a; Richards et al., 2010). Knockdowns of Cacna1h’s exon 25 in rat epilepsy models show a reduction in seizures (Cain et al., 2018). However, there are challenges to determining FDSIs in CACNA1 genes. One challenge is short read sequencing, since this provides inaccurate characterization of the splice variant profiles of each CACNA1 gene. There has been some effort to get a more accurate picture for CACNA1 splice variants using long read RNA-sequencing technology (Clark et al., 2018). In Chapter 4, I investigate the splicing profile of one relatively poorly studied VGCC gene, Cacna1e.     31 1.6 Most splice variants evolve neutrally As with many biological processes, biologists seek an evolutionary explanation for the fact that most genes have multiple splice variants. Many claim that alternative splicing solves the “G-value paradox” – the finding that the number of genes in a genome does not predict organismal complexity (Bush et al., 2017; Hahn and Wray, 2002; Schad et al., 2011). However, we must first ground any evolutionary analysis with modern evolutionary concepts.  The exact evolutionary explanation of spliceosomal introns continues to be an ongoing area of research, but the most explored ideas do not hypothesize that splicing has evolved to generate within-gene functional diversity (Doolittle et al., 2014). One popular concept is the intron-late hypothesis (Koonin, 2006). Under this hypothesis, around the time of eukaryogenesis, ancestral cells were invaded by selfish group II introns (self-splicing). Along with the development of other eukaryotic properties (nucleus, linear chromosomes, etc.), the spliceosome evolved to excise these introns and the evolutionary constraint for these introns to self-splice relaxed. The intron-late hypothesis also postulates that with the relaxation of selection pressure in specific eukaryotic lineages, introns grew, and genomes began to accumulate more introns. The evolution of alternative splicing is likely a consequence of mis-splicing occasionally creating advantageous splice variants that then evolved under evolutionary constraint (Roy and Irimia, 2008, 2009). These evolutionary models position alternative splicing as a source of variability which may or may not be subject to selection, rather than an evolved mechanism to generate functional diversity in the first place. This reinforces our reasoning that we should not assume every single splice variant is functional. As I will elaborate in this section, identifying genes with multiple splice variants evolving under constraint is central to computational studies of genes with FDSIs.    32 Modern evolutionary theory – and how we view molecular processes like splicing – shifted with the emergence of the neutral theory of molecular evolution (Kimura, 1968; Kimura and Ohta, 1972). Kimura proposed that random genetic drift, rather than Darwinian selection, can be a cause of allele fixation, and in fact is predicted to be a major force in evolution (Kimura, 1983). This is particularly true in species with small effective population sizes, including humans and most mammals, such that selection against mildly detrimental variation is too weak to eliminate it effectively. Thus it is thought that many variants have become fixed in the population despite offering no advantage to organismal reproductive success (Charlesworth, 2009). Though many evolutionary biologists originally challenged Kimura’s claims (the selectionist-neutralist debate), the principles of neutral evolution and drift are now widely accepted and integrated into evolutionary theory and quantitative models of population genetics (Charlesworth and Charlesworth, 2018; Nei, 2005).  As a consequence of the neutral theory, modern molecular evolution studies center around separating neutrally evolving biological processes from biological processes that evolve under selection (Li and Zhang, 2019; Liu and Zhang, 2018; Xu et al., 2019; Zhang, 2018). In these studies, the investigators treat a given molecular event as a hypothesis test: the null hypothesis is neutral evolution while the alternative hypothesis is selection. They then analyze the data to determine whether there is enough evidence to reject the null hypothesis of neutral evolution. Thus, the absence of evidence for selection is the failure to reject neutral evolution.  In this section, I discuss the three possible evolutionary scenarios for a splice variant: evolution under the two different types of selection (negative or positive selection) and evolution under genetic drift (neutral evolution).     33 1.6.1 Few splice variants have conservation evidence Splice variant conservation, or the presence of a splice variant in multiple species, indicates that a splice variant evolves under negative selection, and mutations detrimental to the splice variant will be purified from the population. In other words, the splice variant affects reproductive success. Splice variant conservation is a powerful metric in assessing whether a gene has multiple necessary splice variants. If a gene has two or more conserved splice variants, this provides evidence that the gene has multiple necessary splice variants. Thus, this gene likely has FDSIs, based on my operational definition. A lack of conservation evidence means that the null hypothesis of neutral evolution cannot be rejected. While an early RNA-seq study reported that ~86% of multi-exonic human genes have multiple conserved “alternative” splice junctions (Wang et al., 2008a), the current consensus is that most mammalian genes do not have multiple conserved splice variants. The work by Wang et al. used PhastCons, a conservation metric which is not suitable for nucleotide-level observations, but rather designed to detect conserved genomic regions (Pollard et al., 2010; Siepel et al., 2005). When Pickrell and colleagues reanalyzed splice junctions using a more appropriate method, PhyloP, they reported a much lower proportion of conserved “alternative” splice junctions, and concluded that drift drives most splice variant evolution (Pickrell et al., 2010). Indeed, the same research group as Wang et al. has also since published work agreeing that only a minority of splice variants were conserved throughout vertebrate evolution (Merkin et al., 2012). The findings of Pickrell et al. and Merkin et al. were later confirmed in studies of RNA-seq datasets from primates (Reyes et al., 2013; Xiong et al., 2018). Thus, the general agreement is that most mammalian genes do not have multiple splice variants evolving under negative selection.    34 In proteomic studies, splice variant conservation was correlated with splice variants present in mass-spectrometry datasets (Abascal et al., 2015a; Tress et al., 2017b). These analyses found 246/12,227 (~2%) genes have multiple detected splice variants in mass-spectrometry datasets, and the investigators confirmed the conservation of these genes’ multiple splice variants using BLAST homology searches. They found that all 246 of these genes had orthologous splice variants in 5 distantly related species, thus providing multiple lines of evidence for FDSIs in these genes.  In this section, I described the dearth of evidence for conservation of most splice variants. Though their methods may differ, multiple research groups have concluded that most mammalian genes have only one conserved splice variant. This means that most genes have only one splice variant present in multiple species, and the remaining splice variants are likely evolving under genetic drift. In the next section, I discuss the dearth of evidence that genes with multiple splice variants are lineage specific innovations. 1.6.2 Limited evidence for positively selected splice variants A counterargument to the dearth of multi-isoform genes detected by conservation is the fact that many “alternative” splicing events are lineage-specific and may have lineage-specific functions. Support for this could come from signs of positive selection. In contrast to negative selection, splice variants evolving under positive selection would be detected as adaptive lineage-specific innovations and suggest functional importance to one species (Blencowe, 2017; Lu et al., 2009; Xing and Lee, 2005a, 2006).  In other words, lack of conservation in a phylogeny could be misleading and most species-specific splice variants could, in theory, have selected effect functions (Section 1.3). However, only a low proportion of mammalian genomes are thought to be under positive selection – for humans, about 0.03% (Lunter et al., 2006), suggesting an upper    35 bound on how much this would be true for the splice variant level. Detection of positive selection is challenging in general, especially in species with low effective population sizes where the lineage-specific fixation of neutral or nearly-neutral sites is common (Booker et al., 2017). As such, tests for positive selection use neutral evolution as the null hypothesis. Additionally, distinguishing between positive selection and neutral evolution generally requires species-specific population-level sequence data to test for evidence of within-lineage constraint, though some tests may be done with multiple species. Under our operational definition, if a gene has multiple splice variants evolving under selection (constraint), and at least one of the variants has evidence of evolving under positive selection, it would be evidence in favor of the gene having FDSIs. The evidence is strongly against the idea that many mammalian splice variants are evolving under positive selection. The same two research groups which published the claim that ~95% of multi-exonic genes have multiple splice variants published two high-profile, multi-species, multi-tissue RNA-seq studies with the conclusion that most “alternative” splicing is species-specific (Barbosa-Morais et al., 2012; Merkin et al., 2012). While this observation is likely true, the authors from both groups concluded that this species-specific splicing drove species-specific phenotypic diversity without acknowledging the effects of genetic drift in their analyses. In contrast, for a future study, Reyes et al. performed a similar analysis but tested for selection. They were able to reject the null hypothesis of neutrality for only a minority of splice variants (Reyes et al., 2013). While the inability to reject the null hypothesis is not evidence in favor of the null hypothesis, Reyes et al. concluded that most splice variants were neutrally evolving, given our understanding of the neutral theory and noisy splicing.    36 In terms of direct tests of splice variants evolving under positive selection, there has not been substantial evidence. To my knowledge, the only experimental support for this claim originates from frequently cited work by Xing and Lee (Xing and Lee, 2005b, 2005a). Unfortunately, their claim appears to be based on a misinterpretation of their data and is not supported by other work. Xing and Lee calculated selection pressures based on the ratio of nonsynonymous to synonymous substitutions (Ka/Ks). A ratio of less than 1.0 is generally taken to indicate negative selection, while a ratio greater than 1.0 signals positive selection; a ratio near 1.0 indicates neutral evolution (Spielman and Wilke, 2015). Xing and Lee reported that most lineage-specific splice variants have a Ka/Ks ratio closer to (but still less than) 1.0 compared to conserved splice variants, consistent with weak negative selection or neutrality, not positive selection. Furthermore, they did not perform appropriate hypothesis testing (Jukes, 2000; Yang, 1998). Nevertheless, they reported that non-conserved splice variants evolved rapidly and then falsely equated “relaxed conservation” or “relaxed selection” with lineage-specific innovations, a severe mischaracterization of modern evolutionary theory (Tress et al., 2017a). In fact, the lack of conservation is more consistent with genetic drift. It is important to point this out because Xing and Lee are frequently cited in support of claims against the noisy splicing model (essentially, that the alleged noise is actually important evolutionary innovation).  Importantly, other studies of positive selection have failed to support the claims of Xing and Lee. Xiong et al. (2018) used a standard interpretation of Ka/Ks ratios to suggest that in fact most non-conserved exons in primates are evolving neutrally (Xiong et al., 2018). Tress and colleagues attempted to provide evidence for positively selected splice variants from GENCODE but failed to find any (Tress et al., 2017b). Using allele frequencies from the 1000 Genomes Project, they annotated the variation in non-conserved splice variants. If a positively selected for    37 splice variant exists, then that splice variant should be responding to evolutionary constraints and be less variable in the human population. All non-conserved splice variants had an enriched amount of variation, suggesting that these splice variants evolved neutrally, and not under positive selection.  While the weight of the evidence is against positive selection being a general feature of non-conserved exons (and splice variants), any individual such cases would be interesting – no matter how few may exist. With the increase in population level data, there are many methods to determine positive selection on splice variants across the genome (Hsiao et al., 2016; Ramensky et al., 2008). To better refine the search for positively selected for splice variants, one likely needs to focus on specific genes and molecular processes that may be biased towards positive selection. For example, genes whose products function at the interface between the environment and organism tend to evolve under positive selection, such as reproduction, digestion, sensory and immune function (Voight et al., 2006). Furthermore, some intrinsically disordered domains (IDDs) evolve under positive selection (Afanasyeva et al., 2018). Given the subset of splice variants that have IDDs, some new cases of genes with FDSIs may emerge with better tests/data of positive selection.  1.6.2.1 Exceptional splicing in humans? A common misunderstanding of genes with multiple splice variants is based on the observation that many splice variants are only present in humans (and not other primates or mammals), and therefore alternative splicing is at least partially responsible for our “special” status and complexity (Shabalina et al., 2014). Besides the flawed tests of positive selection discussed above, I offer two likely explanations for the correlation between number of splice variants and organismal complexity. First, cellular complexity (as measured by number of cell types, a    38 commonly used proxy for organismal complexity) and number of splicing events tend to negatively correlate with effective population size (Bush et al., 2017). Neutral and slightly detrimental processes tend to fix more often in species with low effective population sizes. This observation may derive from the breeding rate in small populations, or sampling approaches for allele frequencies. Regardless, the low population sizes in mammalian species means many neutrally evolving splice sites will fix to the population. Second, the number of splice variants documented per species is likely a function of factors such as sequencing depth (Zhang et al., 2017). Increasing sequencing depth will increase sensitivity in detecting more splice variants. Humans for example tend to have more RNA-seq studies, with deeper sequencing than other species. Thus, we are more likely to detect more splice variants in humans, than in other species. In Table 1.4, I highlight key RNA-seq studies from a variety of species, the sequencing depth, and the splicing proportion reported. The suggestion that humans are exceptional because we have more genes with multiple splice variants is likely just a result of deeper sequencing.  Species Proportion of multi-exonic genes with multiple splice variants Number of Reads Effective population size PMID Human 95% 435,000,000 10,000 18978772  Mouse 93% 140,000,000 25,000 18516045 Rat 84% 20,000,000 - 29116075 Rabbit 44% 1,194,102 - 28794490 Drosophila melanogaster 37% 6,453,796,999 822,351  27935948 Saccharomyces cerevisiae 1-4% - 2,562,065 26469855 Table 1.4 The percentage of genes with multiple splice variant in a species is likely drive by the number of sequencing reads and effective population size    39 The data in “Proportion of genes with multiple splice variants” and “Number of reads” corresponds to the study under the “PMID” column. Effective population size estimates were from (Charlesworth and Charlesworth, 2018)  In summary, the neutralist perspective of alternative splicing parallels the noisy splicing model (Saudemont et al., 2017). Much of human splicing is present only in humans, and most of these splice variants are likely evolving neutrally (Bush et al., 2017). Since neutral evolution does not equate to any selective advantage for reproductive success, humans and other organisms likely do not need most splice variants. Thus, for my thesis, unless a gene has two or more splice variants evolving under selection rather than neutral evolution, it likely does not have FDSIs. 1.7 Thesis outline and research contributions As I have described, there is a debate about how much alternative splicing contributes to genomic functional diversity. One of the primary arguments in favor of broad, functional importance to most splice variants is that the lack of functionality of most splice variants has not been proven. Proving a negative is impossible, so no amount of negative evidence will be satisfying to those who prefer to assume an observed transcript is functional. I take the stance that to make progress in this discussion, we must seek out positive evidence for more genes with FDSIs. At the same time, I am convinced that the noisy splicing model is valid, implying there is a potentially vast amount of irrelevant splice variants to sift through. To address this needle-in-a-haystack problem in a pragmatic fashion, I have taken the approach that the null hypothesis for any observed transcript is that it is non-functional, and computational evidence can lead to a provisional rejection of that null hypothesis. In this framework, the genes (and transcripts) for which we have the most evidence against the null hypothesis are candidates for further study. Genes for which we fail to reject the null hypothesis are simply “not guilty” of having multiple functional isoforms based on the evidence presented, and we withhold judgement on whether    40 they are truly “innocent”. Below I briefly review my objectives and findings from the three research chapters.  As a first step, I felt it was important to document as many cases of genes with FDSIs as possible. It could be that most genes in fact do have experimental evidence of FDSIs already reported. Even if not, a list of known cases would strongly inform computational approaches for automated identification of candidates. Therefore, in Chapter 2, I document the literature support for genes with FDSIs, where each FDSI has ex silico evidence of necessity. My first goal was to ground the field in the evidence-based reality of alternative splicing function by producing a gold standard set of genes with literature evidence of FDSIs. With a team of trained curators, I curated the literature for 743 human and mouse genes and found that only ~5% have support for FDSIs. I conclude that one source of the claim that “alternative splicing vastly increases the functional diversity of the genome” is extrapolated from a very small number of cases. In Chapter 3, I annotate splice variants from long-read mouse brain and liver transcriptomes to prioritize genes likely to have FDSIs. Due to the low yield of gold standard genes with FDSIs in Chapter 2, my goal was to determine the “low-hanging fruit” of genes likely to have FDSIs using annotations based on lessons learned from both noisy splicing and my manual literature curation. As part of this research project, I annotated long-read splice variants, which removed much of the transcript structure ambiguity found in short-read data. From a set of 6,799 genes, I prioritize a set of 79 genes likely have FDSIs.   In Chapter 4, I prioritize splice variants of the voltage-gated calcium channel (VGCC) gene, Cacna1e, that our collaborators hypothesize to have FDSIs. VGCCs are large genes with complex transcript structures. Cacna1e is predicted to have hundreds, if not thousands, of splice variants. As with Chapter 3, I annotate splice variants from long-read data to alleviate the    41 transcript structure ambiguity, however, my annotations are also based on the known functional properties of voltage-gated calcium channels. With a priori knowledge of Cacna1e’s function, I provide the field with the gene’s splicing profile and support Cacna1e’s candidacy as a gene with FDSIs.  In Chapter 5, I provide some concluding thoughts on the implications of my research and considerations for future work. In brief, my results agree with the hypothesis that most splice variants are likely biological noise. On the other hand, the small number of genes that I hypothesize to have FDSIs throughout my thesis will aid the field in determining the specific contexts where alternative splicing increases a gene’s functional diversity.       42 Chapter 2: Systematic evaluation of isoform function in literature reports of alternative splicing 2.1  Introduction As described in Chapter 1, an ongoing debate is whether most mammalian genes produce more than one functional isoform (Blencowe, 2017; Tress et al., 2017b, 2017a). The mere presence of multiple splice variants in public sequence databases is clearly insufficient to settle the question (Light and Elofsson, 2013). Arguments against widespread functional alternative isoforms include the fact that the splicing machinery’s limited fidelity causes the stochastic generation of “junk” splice variants (Hsu and Hertel, 2009; Melamud and Moult, 2009). Analyses using proteomics and molecular evolution approaches have also failed to support the expression and conservation of most splice variants (Abascal et al., 2015b; Pickrell et al., 2010; Reyes et al., 2013; Saudemont et al., 2017; Tress et al., 2017b). Nevertheless, the question lingers because the lack of evidence is not generally accepted as evidence and the function of most splice variants remain unknown (Blencowe, 2017; Light and Elofsson, 2013). Beyond the question of whether most genes have more than one functional isoform is a critical issue: whether these variants increase the functional repertoire of genes, or are merely functionally redundant (Kriventseva et al., 2003; Lipscombe et al., 2002, 2013a; Stetefeld and Ruegg, 2005). In this chapter, I take steps to address the gap between the commonplace assumption that most genes have more than one distinct functional product and the evidence-based reality.  Establishing whether a gene has functionally distinct isoforms requires experimental validation. While databases that contain information on transcript isoforms gather information on isoform features, none attempt to assess functionally distinct isoform reports from the    43 experimental literature. For example, Ensembl, RefSeq, and UniProt catalog and annotate splice isoforms based on evidence that they exist as a transcript or protein (Aken et al., 2016; Bely et al., 2010; Light and Elofsson, 2013; Zhao and Zhang, 2015). However, the existence of a splice isoform alone does not provide direct support for its functionality, much less functional distinctness.  To establish the extent to which splice variants increase the functional repertoire of the genome, we need data on which genes have functionally distinct splice isoforms (FDSIs). Identification of genes with FDSIs requires experimental support to demonstrate the necessity of each splice isoform. A classical method to determine the function of a given gene is to knock it out and observe the phenotypic consequence (Alberts et al., 2002; Shehu et al., 2016). This idea readily extends to splice variants; if a single splice variant is made absent and that variant is necessary for the normal function of the gene, then a consequence (change in phenotype) would be expected. A gene has FDSIs if two or more isoforms meet this criterion independently (Figure 2.1). In contrast, the depletion of an unnecessary or redundant splice variant will not cause a phenotype. Another approach that is often used to probe the function of splice variant is overexpression. However, overexpression is well known to be fraught with interpretational challenges including artifacts so the gold standard is to generate loss-of-function alleles (Gibson et al., 2013). Note that a negative result from experiments is not evidence of a lack of functional distinctness, as it is possible the functional distinction between the splice variants may be eventually discovered. Curating the genes with FDSIs is of obvious importance to evaluate the state of the literature support for the commonplace claim that alternative splicing increases the    44 functional repertoire of the genome. Figure 2.1 Non-mutually exclusive types of functional distinctness for literature reported genes with FDSIs Generally, the distinctness of FDSIs of the same gene can be attributed to expression-pattern distinctness or biochemical distinctness. Expression-pattern distinctness is defined as a gene having specific splice isoforms necessary in distinct conditions. The depletion of the splice isoform in its distinct condition causes a phenotype. Biochemical distinctness is defined as a protein structure difference between splice isoforms of the same genes. While the FDSIs of the gene can be expressed in the same condition, the depletion of either splice isoform causes a phenotype.   Beyond identifying knowledge gaps, establishing a set of genes with FDSIs provides potential avenues for improving computational approaches to analyzing alternative splicing. For example, classifiers, such as PULSE, attempt to predict genes with multiple functional splice isoforms (Hao et al., 2015). Hao et al. trained PULSE using a set of splice isoforms confirmed by Western blot experiments. PULSE predicted that one-third of human protein-coding genes have multiple functional isoforms (not necessarily functionally distinct). A difficulty cited by Hao et al. was in the identification of training data, an issue which is even worse if one is interested in functional distinctness. Having lists of experimentally validated genes with FDSIs could open the door to improved algorithmic approaches in characterizing isoform function. Here I present a literature-based analysis of experimental evidence for functionally distinct splice isoforms (FDSIs) for over 700 human and mouse genes. Despite a gene selection    45 strategy that was highly biased towards genes suggested to have multiple functional isoforms, I found good experimental evidence for FDSIs for fewer than 10% of genes. 2.2 Methods 2.2.1 Determining the type of functional distinctness I developed a scheme to describe non-mutually exclusive types of functional distinctness found in genes with FDSIs. I recognize two general biological mechanisms by which functional distinctness could arise, schematized in Figure 2.1, and elaborated on further below: “expression-pattern distinctness” or “biochemical distinctness”. Figure 1.1 outlines our full scheme for classifying functional distinctness. The subclasses I identified were designed to accommodate how functional distinctness is reported in the literature I curated, that is, I did not create this classification wholly ab initio. I determined the type of functional distinctness using the publication which provided the evidence for FDSI, but some cases required an inference based on other literature by the authors. I stress that a gene can have multiple types of functional distinctness. For example, biochemically distinct isoforms could also have expression pattern distinctness. I annotated as many types of functional distinctness as were provided by the literature reports. 2.2.2 Expression pattern distinctness Expression-pattern distinctness requires the condition-dependent expression of isoforms of a single gene. Generally, in this category, splice isoforms of the same gene have functional relevance in distinct conditions. I further specified expression-pattern distinctness as “subcellular-localization-specific”, “cell-type-specific”, “tissue-specific”, “developmental stage-specific”, and “other-condition-specific”. Thus, genes with cell-type-specific FDSIs express their splice isoforms in distinct cell types, and the elimination of expression of either splice isoform    46 causes a phenotype (Figure 2.1). These isoforms’ final products could be identical (that is, they are not intrinsically functionally distinct). However, they are still functionally distinct because they have partially different expression patterns and one cannot fully compensate for the other.  2.2.3 Intrinsic-functional distinctness Intrinsic-functional distinctness is defined as differences in biochemical properties or activities, and which cannot compensate for each other even if co-expressed in the same condition (Figure 1.1). I further specified intrinsic-functional distinctness as “protein domain change”, “dominant negative”, “subcellular localization”, “UTR change” and “protein terminus change”. Genes categorized as FDSIs with distinct protein-domains indicate that each splice isoform has a unique structural or functional unit in their final protein product. I manually extracted information about the specific protein domain from the authors providing the evidence of functional distinctness. In some cases, this could involve the presence or absence of one or more protein domains. Genes categorized as “protein terminus change” indicates that the FDSIs’ final protein product differ from each other either in their C-terminus or their N-terminus. These changes to the C- or N-termini usually do not affect the presence or absence of protein domains (or the paper did not make any note of changes to protein domains). Genes with dominant-negative FDSIs have splice isoforms with antagonistic phenotypes. Typically, these splice isoforms regulate each other’s function. The loss of one splice isoform generally affects the function of the other splice isoform. Gene categorized as “UTR change” indicates that the FDSIs of the same gene differ in the UTRs of the mRNA (coding regions may change as well).     47 2.2.4 Literature selection On July 17th, 2017, I generated a “starting set” of publications associated with human and mouse genes to curate using PubMed e-utilities and the search term “alternative splicing”. From here curation was both “gene-centric” and “paper-centric.”  2.2.5 Gene-centric curation The gene-centric approach attempted to curate all relevant studies associated with a specific gene. PubMed linked each study from our starting set to a specific gene which provided a list of genes with literature. The genes I selected to curate from this list were genes suggested to us by the community, PULSE’s training genes or commonly discussed by the literature (Hao et al., 2015). As suggestions from the community might be biased, 100 random genes were also selected for gene-centric curation. 2.2.6 Paper-centric curation The paper-centric approach attempted to curate literature likely enriched for evidence of genes with FDSIs. Using this approach, I make no attempt to curate all relevant reports for any specific gene. As a targeted source of literature likely to be enriched for functional evidence, I used review articles on the function of alternative splicing that provided citations for 603 genes (Kelemen et al., 2013; Kovacs et al., 2010; Lipscombe et al., 2013a; Ramanouskaya and Grinev, 2017; Stamm et al., 2005; Tress et al., 2017b). I further extended paper-centric curation with specific search phrases in PubMed. Search terms were: “functionally distinct splice isoforms”, “CRISPR alternative splicing”, “alternative splicing knockdown” and “alternative splicing knockout.” These queries identified an additional 260 papers for our starting set of papers. The genes found in the publications retrieved by these PubMed queries and provided in the aforementioned reviews further informed us of which genes to gene-centrically curate. For    48 example, BDNF and XBP1 were commonly reviewed in the literature and consequently, I gene-centrically curated them. 2.2.7 Curation process For each paper, a trained curator first identified general features of that study by manually extracting the following information: the investigated gene, the reported number of the splice variants for the gene, the names used by the authors for the splice variants, the number of splice variants specifically investigated in the paper (“the investigated variants”), the experiments performed, the organism where the gene was identified, the organism or cell line used for the experiments, and any claims of functional distinctness.  Next, using a decision tree (Figure 2.2), we annotated each paper as to whether the data provided positive evidence of functional distinctness for the investigated splice isoforms. We sought evidence where the loss of one isoform (via knockdown, knockout or other means of isoform-specific depletion) produced a phenotype in the test system. We also curated experiments which performed overexpression analyses, which were retained as a separate category from the isoform loss studies (as an example, see study by Scotton and colleagues (Scotton et al., 2006)). We did not accept studies of aberrant splice variants caused by rare mutations (for example in cancer), as we deemed these as not relevant to the normal function of the gene as we have defined it (as an example, see Cogan et al.  (Cogan et al., 2012)). If a study provided evidence where investigators depleted multiple splice variants of a single gene but at most one splice variant caused a phenotype, we classified the gene as having negative support for FDSIs. Finally, regardless of study type, the curators provided a concise explanation of the functions investigated.    49  Figure 2.2 Overview of literature curation We sought papers which study the functional distinctness of a single human or mouse gene’s splice isoforms. Positive studies are those that provide evidence where multiple splice isoforms of a single gene are depleted and at least two isoforms show a phenotype. We annotated studies as providing negative evidence for functional distinctness when investigators deplete multiple splice isoforms of the same gene but only one produces an observable phenotype. The numbers in bold represent the number of studies in each category. Clip art designed from Flaticon (free license with attribution).  For our definition of genes with FDSIs, I required evidence for the independent depletion of at least two splice isoforms of the same gene. If the curated study investigated the outcome of the absence of a single isoform for a given gene, then that study alone insufficiently provides evidence of FDSIs. While such studies demonstrate an existence of a single functional isoform, the support for FDSIs requires data on at least two isoforms from the same gene. However, I subsequently attempted to identify a second paper that provided functional evidence for a    50 different splice isoform of the same gene. In situations where a second paper identified evidence of a different functional splice isoform, I recorded the gene as having FDSIs.  2.2.8 Curator Validation To ensure consistent curation, I evaluated the curators. These tests consisted of all curators curating the exact same randomly selected 50 papers. After the test, I addressed any discrepancy between curators, and I updated the curation standards with any necessary clarifications (curation standards provided in Appendix A1). This evaluation process was conducted three times. I also further scrutinized papers annotated as providing positive evidence of a gene with FDSI to eliminate any false positives.  2.2.9 Linking FDSIs to Ensembl If a paper provided positive evidence for FDSIs, we linked the splice isoforms with the appropriate Ensembl transcript ID. Generally, studies provided GenBank or RefSeq accession IDs and these accession IDs linked to Ensembl. In the absence of an accession ID, we referred to the literature for sequence information about the splice isoforms and aligned splice isoform sequences to Ensembl using ClustalOmega (Sievers et al., 2011). 2.2.10 Computational predictions of genes with FDSIs PULSE, a computational classifier developed by Hao et al., predicted 2,419 of 15,639 UniProt genes to have multiple functional isoforms based on a training set of 145 genes (Hao et al., 2015). I downloaded the supplementary data provided by Hao et al. to determine whether PULSE predicted our genes with FDSIs to have multiple functional splice isoforms. I also investigated whether any of our genes with FDSIs were part of PULSE training and validation set of genes. This was of interest because a training set enriched for genes with FDSIs may yield    51 predictions for genes with FDSIs, even though PULSE was only designed to detect function, not distinct function. For our comparison to PULSE prediction, I used the human orthologue for any mouse gene with FDSI as determined by BioMart (Smedley et al., 2009). 2.3 Results 2.3.1 Landscape of the alternative splicing literature To generate a starting set of papers to curate, I queried PubMed on August 2017 using the term “alternative splicing”. I found 19,049 human studies and 8,197 mouse studies representing 12,891 human genes and 7,585 mouse genes. While the median number of papers per gene was one, there was a large variance (see Figure 2.3). Most human genes (7,738) had only one such paper associated with them, while some have up to 100 (for example, SRSF1). I also observed that genes with many “alternative splicing”-mentioning papers tend to have many papers in PubMed overall (Spearman’s rank correlation = 0.55). For example, I identified 86 studies linked to human TP53 with the term “alternative splicing” (rank 2), but this is not particularly remarkable because overall, PubMed contains 8,261 studies linked to TP53 – the most studies for any single gene. This suggests, unsurprisingly, that heavily studied genes tend to have more research done on their splicing.    52   A B    53 Figure 2.3 Number of alternative splicing studies linked to human or mouse genes A) Most human genes have one study linked to alternative splicing in PubMed. The total number of human studies retrieved with the term “alternative splicing” was 19,049. These studies linked to 12,891 human genes. Genes (taken from Ensembl) that were not retrieved from this query were labelled as 0. B) Most mouse genes have one or two studies linked to alternative splicing on PubMed. The total number of mouse studies retrieved with the term “alternative splicing” was 8,203. These studies linked to 28,167 mouse genes. Note that this gene count, unlike the human gene count, include non-protein coding genes. This was likely due to six transcriptome-wide mouse studies which include the term “alternative splicing” and all mouse genes. Removal of these six studies from the query results led to only 7,585 mouse genes associated with a total of 8,197 “alternative splicing”-mentioning studies as described in the Results. Furthermore, after filtering these six studies, most mouse genes had one study which mentioned alternative splicing. We did not see a similar issue in our human query shown in Figure S1. Genes (taken from Ensembl) that were not retrieved from this query were labelled as 0.  2.3.2 Curation summary We manually curated primary studies which provide evidence for the function of splice variants. As described in Methods, we selected genes and publications for curation in a manner that we expected should enrich for documentation of functional distinctness – for example, using review articles on splicing function. The curation process primarily focused on determining whether the elimination of expression of each splice variant from a single gene caused an observable phenotype. Table 2.1 provides a summary of the knowledgebase as of July 20th, 2018, and Additional File A1 and Additional File A2 contain full details of all curated studies for human and mouse. In total, we curated 1,127 human and mouse studies. This encompasses 903 human studies (555 genes) and 272 mouse studies (227 genes). We have curated a median of 1 study per human gene and 1 study per mouse gene (mean = 1.5 studies and 1.2 studies, respectively). Our curation evaluations (see Methods) revealed that the curators agree on the interpretation of a paper 98% of the time. Errors were generally false positives for functional distinctness, which we addressed in the final review (see Methods).       54  Table 2.1 Curation of alternative splicing literature revealed 23 human genes and 20 mouse genes with functionally distinct splice isoforms (FDSIs)  The 23 human genes with FDSIs accounted for almost 4% of human genes annotated in this knowledgebase, while the 20 mouse genes accounted for 9% of the all mouse genes annotated. The majority of curated studies could be classified into three different types: “splice variant removal”, “overexpression” and “localization”. Splice variant removal studies have experiments where expression of at least one splice variant is eliminated and a phenotypic change is evaluated. Overexpression studies have experiments where at least one splice variant is overexpressed. This “abundance” of the splice variant can cause a phenotype (not necessarily distinct). Localization studies have experiments that characterize where in the cell or organism the splice isoform is expressed. A single study can report experiments with multiple study types. The total number of human and mouse studies curated do not sum to 1,158 studies because some publications investigated both human and mouse forms of a single gene.  2.3.3 Identification of 23 human genes with direct evidence of functionally distinct splice isoforms By definition, a gene with functionally distinct splice isoforms (FDSIs) has at least two splice isoforms necessary for the gene’s normal function. We find that genes with such evidence are rare: about 4% of curated human genes (9% of mouse genes) have FDSIs, based on reports in a total of 64 studies out of 1,127 studies. Note that 138 studies depleted only one splice isoform of a gene and no other study we curated had depleted any other isoforms of the same gene. I provided the full list of the 23 human genes and the 20 mouse genes with FDSIs in Table 2.2 with additional information in Additional File A3. RNAi knockdown experiments provided support for over 75% of these FDSIs, while the remaining FDSIs were characterized using gene knockouts combined with isoform-specific rescue. Species Curated genes Genes with FDSIs Studies curated Study Type Splice variant removal Overexpression Localization Other study types Human 555 23 903 149 294 80 380 Mouse 227 20 272 82 70 37 83 Total 782 43 1,127 222 353 109 443    55  Table 2.2 Genes with positive literature evidence of FDSIs  Studies have provided positive evidence of functional distinctness for these genes in experiments where individual splice isoforms were eliminated, and a phenotypic change was observed. See Additional File 3 for study demonstrating functional distinctness. “Number of FDSIs” indicates the number of splice isoforms where depletion of splice isoforms causes a phenotype. “Number of Ensembl Transcripts” indicates number of transcripts found in Ensembl entry for gene. “Number of studies” indicates the number of studies associated with the gene retrieved with the term “alternative splicing” on PubMed. The highest number of FDSIs found in a single gene is three.  “Mappable to Ensembl” indicates genes where we successfully linked all FDSIs back to Ensembl. “PULSE” indicates whether the gene was used at all by Hao and colleagues in their computational predictions. “Training” in this column means that the gene was used as part of PULSE’s training set. “Predicted” means that PULSE predicted that the gene has multiple functional splice isoforms. “Missed” means that PULSE failed to predict that the gene has multiple functional splice isoforms. “NA” means that the gene was not an input for PULSE.  We sought genes with negative evidence for FDSIs. For these cases, experiments individually depleted multiple splice variants for a single gene, however, only one splice variant’s depletion caused a phenotype and while the depletion of the other splice variants caused no phenotype. We found 16 genes with such evidence (shown in Table 2.3).  As mentioned, I biased our gene and paper selection in such a way that our estimate of ~4% (~9% for mouse) might be too high. To help clarify this issue, I randomly selected 100  Gene Number of FDSIs Number of Ensembl Transcripts Number of Studies Mappable to Ensembl? PULSE  Gene Number of FDSIs Number of Ensembl Transcripts Number of Studies Mappable to Ensembl? PULSE Human AR 3 9 31 Yes NA  BCAR1 3 17 1 No Missed BDNF 3 19 12 No NA  BIRC5 3 11 34 Yes NA BOK 2 2 1 No Missed  CD44 2 39 58 Yes Predicted CFLAR 2 25 6 Yes Predicted  CSPP1 2 7 1 Yes NA DPF3 2 16 2 Yes NA  EIF4G1 2 38 6 No NA EIF4G2 3 32 3 No NA  HBS1L 2 14 1 Yes Predicted KLF6 2 7 16 No Missed  MADD 2 23 2 Yes Predicted MST1R 2 15 11 Yes Predicted  PML 2 22 12 Yes Predicted PGAM5 2 4 1 Yes Missed  PRMT5 2 20 4 Yes Missed STIM2 2 12 2 No NA  SUN1 2 35 2 No NA TICAM1 2 2 1 No NA  TICAM2 2 2 1 No NA TP63 2 14 27 Yes NA        Mouse Cacna1b 2 10 2 No Missed  Calca 2 7 10 Yes NA Cdc42 2 2 8 Yes NA  Enc1 2 1 3 No NA Homer1 2 12 7 Yes NA  Il1rap 2 7 4 Yes Training Lpin1 2 8 5 No Missed  Lrp8 2 12 13 No Predicted Mecp2 2 6 9 Yes Missed  Myh10 2 10 7 No Predicted Nf1 2 9 5 Yes Missed  Opn4 2 3 2 Yes NA Oprm1 3 31 20 Yes Predicted  Rbfox1 2 2 22 No Predicted Robo3 2 6 5 Yes Missed  Rock2 2 12 4 No Missed Ryr3 2 12 5 No NA  Sirt3 2 10 3 No Missed Snap25 2 3 12 Yes NA  Tp63 2 8 9 Yes NA    56 human genes (from those that had at least one alternative splicing related paper) for gene-centric curation. Of these 100 genes, two genes (PML and DPF3, 2%, of the curated genes) had experimental evidence of FDSIs. We also curated gain-of-function experiments where investigators overexpressed multiple splice isoforms of the same gene. From our 555 curated human genes and 227 curated mouse genes, we found 50 human genes (~9%) and 14 mouse genes (~4%) where investigators overexpressed individual splice isoforms and yielded multiple distinct phenotypes. Such studies did not meet our criteria for FDSIs, but I report them in case this relaxed criterion is of interest to others.    57 Table 2.3 Genes with evidence failing to support FDSIs  These genes had multiple isoforms tested however only one splice isoform caused a change in phenotype.  2.3.4 Genes tend to express functionally distinct splice isoforms in the same condition To further explore functional distinctness in splicing, I identified non-mutually exclusive types of functional distinctness between FDSIs of the same gene, summarized in Table 2.4. I classified two main types of distinctness, expression-pattern distinctness and intrinsic-functional distinctness. Genes with expression-pattern distinct FDSIs have splice isoforms necessary for specific conditions while genes with intrinsic-functional distinctness have FDSIs with distinct Gene Experimental method Tissue/Cell Type Reference (PubMed ID) Ank3 Isoform-specific rescue Neuron 25552556 Ar Knockdown Prostate cancer cell line 20823238 Ccnd1 Isoform-specific rescue Embryonic fibroblast 21200149 Dab1 Isoform-specific rescue Neuron 28968791 Dntt Isoform-specific rescue Bone marrow 11136823 FANCE Isoform-specific rescue Breast cancer cell line 26277624 FNBP1L Isoform-specific rescue MDCK cell line 26063734 Pcdha1 Isoform-specific rescue Brain 18973563 PDE4D Knockdown Kidney 16030021, 17673687 PEX19 Isoform-specific rescue Fibroblast 11883941 Pparg Knockdown and isoform-specific rescue Adipose 11782442 RAP1GSDS1 Knockdown Breast cancer cell line 24197117 RREB1 Knockdown Bladder 21703425 SIRT1 Knockdown Colon cancer cell line 22124156 Smad2 Isoform-specific rescue Embryonic stem cells 15630024 STAT1 Knockdown Embryonic cells 21914475    58 biochemical properties that cannot compensate for each other even when co-expressed (for further description see Methods and Figure 1). The majority of genes (27/43) have intrinsically-functionally distinct isoforms, rather than expression-pattern distinct. I identified “dominant-negative” as the most common subtype of biochemically distinct FDSIs (12/31 genes). For example, the mouse gene Enc1 has two FDSIs, named “57 kDa” and “67 kDa” by the authors, interacting in the Wnt-signalling pathway (Worton et al., 2017). Knockdown of 57 kDa promoted osteoblast mineralization while the knockdown of 67 kDa inhibited osteoblast mineralization.   In contrast to intrinsic-functional distinctness, I identified fewer cases of genes with expression-pattern distinct FDSIs. Only a total of 17 human and mouse genes had FDSIs in which the distinctness arises from distinct expression patterns. For example, the mouse gene Myh10 has two FDSIs, named B1 and B2 by the authors (Ma et al., 2006). Cells in the brainstem express B1 to promote normal migration of facial neurons, while cells in the cerebellum expressed B2 to promote normal cerebellar Purkinje cell development.           59  Table 2.4 Most genes with FDSIs have intrinsically-functionally distinct FDSIs Genes with FDSIs were categorized on functional type based on the literature that reported on the FDSIs using the scheme outlined in Figure 1. Genes categorized as “distinct expression patterns” express FDSIs in specific conditions. Genes categorized as “intrinsically-functionally distinct” have FDSIs whose functional distinctness is a consequence of biochemical differences in their final protein product. Genes can be categorized as both “distinct expression patterns” and “intrinsically-functionally distinct” such as Myh10 and Robo3.  2.3.5 Challenges linking FDSIs to sequence databases We attempted to link all identified FDSIs back to Ensembl transcript identifiers and were successful in 25/43 cases. Our process was as follows. First, in the studies for seven genes, investigators provided a GenBank or RefSeq ID. We were able to map three of these to Ensembl (which includes GenBank and RefSeq data), but not for the other four (for more details see Additional File A3), accounting for four of the 16 failures. Next, for 36 genes with missing accession information, we used sequence alignment or other information to identify likely matches (See Methods). This was successful in 25 cases. In a further 6 cases, we were able to determine a sequence by referring to other papers by the same authors. Despite extensive efforts, we were unable to find matching Ensembl transcripts or sequence data for the isoforms of 5  Types of distinctness Human genes Mouse genes Distinct expression patterns Cell-type-specific AR, MADD  Developmental-stage-specific CD44 Myh10, Robo3 Cellular localization BIRC5, CSPP1, PRMT5, PML  Myh10, Rbfox1, Robo3, Sirt3 Tissue-specific MST1R Calca, Rock2 Other-condition-specific BOK  Intrinsically-functi onally distinct Protein domain CFLAR, DPF3, EIF4G1, TICAM1, TP63 Lrp8 Dominant negative BIRC5, HBS1L, KLF6, Nf1, PRMT5, STIM2, SUN1, TICAM Enc1, Nf1, Robo3, Ryr3, Tp63 Protein terminus change BCAR1, BDNF, EIF4G2, IL1RAP, PGAM5 Cacna1b, Mecp2, Oprm1, Pn4 UTR Change BDNF     60 genes. This situation was not specific to Ensembl as we failed to link isoforms of 8 genes to UniProt; see Additional File A3.  2.3.6 Only a quarter of genes with FDSIs are predicted by a computational classifier Hao et al. (Hao et al., 2015) developed a machine learning algorithm (PULSE) that predicted 1/3 of human genes have more than one functional isoform (but not necessarily functionally distinct). I hypothesized that our curated genes with FDSIs would be enriched among those predictions, because even though Hao and colleagues were not attempting to predict functional distinctness, genes with FDSIs by definition have more than one functional isoform. Though I included PULSE’s training genes in our gene-centric curation, only two gene with FDSIs (including human orthologues of our curated mouse genes) were used by Hao et al., in their training data. In their validation gene set of 212 genes, I found none of our genes with FDSIs. Hao et al. predicted 2,419 genes to each have multiple functional splice isoforms. Ten of our genes with FDSIs are included in this set. Based on input set used for PULSE predictions, the classifier failed to predict 12 of our genes with experimentally-validated FDSIs to have multiple functional splice isoforms. However, our interpretation of these is limited because of the small number of genes with FDSIs.  2.4 Discussion This chapter represents progress towards documenting and evaluating the breadth of evidence for functionally distinct splice isoforms (FDSIs) for human and mouse genes. The inspiration for our study was strong arguments against the likelihood of most genes having multiple functional isoforms, contrasted with the ubiquitous claim that splicing vastly increases the functional repertoire of the genome (Auboeuf, 2018; Kriventseva et al., 2003; Lipscombe et al., 2002, 2013a; Stetefeld and Ruegg, 2005; Tress et al., 2017b; Wang et al., 2008a). This led us to ask    61 where this latter claim comes from: while surely there are interesting cases of multi-isoform genes, has this been optimistically extrapolated to the entire human genome? Our analysis suggests this is the case and supports the hypothesis that the majority of splice variants functions remain unknown (Frankish et al., 2012; Light and Elofsson, 2013; Mudge et al., 2011). While it was not surprising that there is no evidence of FDSIs for most genes, I was surprised by the low fraction for which there is supporting data, a mere 4% in human genes and 9% in mouse genes. Regardless of whether this number holds true with more curated studies, by contributing a list of genes with documented functionally distinct isoforms, we start to identify the scope of the gaps, the parameters for future experimental work, and assist computational methods that require training examples. The low fraction of genes surveyed for which we found evidence of FDSIs (~4-9%) agrees with the general sense that we still have limited concrete evidence of more than one functional splice isoform per gene (Kelemen et al., 2013; Reyes et al., 2013; Tress et al., 2017b). Even if we loosen our criteria to include overexpression studies, this fraction rises only to ~12-13%. Furthermore, we only considered genes for which some literature exists for their isoforms, so the range 4% to 9% is relative to genes that have at least one publication about them associated with splicing. Based on our PubMed queries, I estimate that one-third of human protein-coding genes do not have any type of specific experimental study of differences among their isoforms. For most genes the main available sources of information come from genome-wide studies of transcript expression patterns, which do not address function. One might question whether the fraction 4% will rise substantially as we continue our curation efforts, but I hypothesize a lower true fraction of genes with documentation of FDSIs in the literature. First, I aimed the gene-centric aspect of our curation at genes mentioned in review    62 articles or otherwise prominent genes, and thus is highly biased towards genes with experimentally-backed function, yielding an over-estimate. Second, the gene-centric survey of 100 randomly-selected human genes yielded only two genes with evidence of FDSIs. Third, I found a median of only one study per gene from PubMed. Since the genes with FDSIs tended to be genes with relatively more associated studies (Table 2.2), genes with few associated studies seem less likely to yield existing positive evidence for functional distinctness. Fourth, investigators face technical and/or resource challenges when testing the functional distinctness of splice variants, requiring either the ability to conduct splice variant-specific depletion experiments, or splice variant-specific rescue following a complete gene knockout. Reasonably, one might suppose that in many cases the experiments have not been done. The essential problem remains that most genes simply have not had their splice variants tested in such a way as to establish distinct functions.  We also sought negative evidence of genes having FDSIs from experiments where the depletion of only one splice isoform causes a phenotype while the depletion of the remaining splice variants of the same gene causes no phenotype. However, we only identified eight human genes and eight mouse genes from 16 studies with this type of evidence in our current curation of 1,127 studies (Table 2.3). Since most studies consider only one type of functional assay, it remains possible that tests of different functions would yield positive results for these genes. Nevertheless, the “file-drawer effect” – a type of publication bias against negative results – potentially plays a role in the dearth of negative evidence (Kennedy, 2004).  A natural question is whether genes with FDSIs have distinguishing features compared to genes without FDSIs. However, we identified too few genes to perform an adequately powered analysis. Furthermore, both the literature and our curation process have large biases in the    63 identification of FDISs. They tend to involve highly studied genes, while at the same time the extent and types of investigations into isoform function are highly variable. If there are biological principles that explain the distribution of FDSIs in the genome, discovering them will require a larger and less biased source of data than is currently available.   2.4.1 Evaluating the evidence for FDSIs at the gene level After the curation of over 1,000 alternative splicing studies, we identified 23 human genes and 20 mouse genes with evidence for functionally distinct splice isoforms, mostly determined by RNAi knockdown experiments. RNAi knockdowns naturally align with our definition of a functional splice isoform and how researchers traditionally determine function in molecular biology. One question that arises in discussing RNAi is target specificity and efficacy. In most, but not all, of the papers we curated as having FDSIs, the authors demonstrate the target specificity of their siRNA to effectively deplete a single isoform. I raise this as a reminder that reports of evidence for functional distinctness may vary in quality. Isoform-specific rescues demonstrating functional distinctness provide an alternative option to knockdown studies, but the method has limitations when determining whether the splice isoforms rescue distinct phenotypes. In some studies, splice isoforms of the same gene clearly rescued distinct functions. For example, Candi and colleagues performed rescue experiments on Tp63-null mice (Candi et al., 2006). The knockout of Tp63 impeded the development of skin. In the rescue experiments, the splice isoform DNp63 restored the skin’s basal layer while the TAp63 restored the skin’s upper layers. In contrast, other studies rescued the same phenotype with each splice isoform, which makes evidence of functional distinctness unclear. For example, in the investigation by Coldwell and colleagues, each splice isoform of EIF4G1 (eIF4G1e and eIF4G1f) rescued the phenotype of translation by restoring the translation    64 rate (Coldwell et al., 2012). It is unclear whether this constitutes evidence of functional distinctness. Since both splice isoforms rescued the same phenotype, they appear functionally redundant. Nevertheless, in cases such as these, we accepted the claim of the authors that the gene has FDSIs. I resisted accepting overexpression studies as demonstrating FDSIs for two reasons. First, overexpression experiments are known to be subject to a variety of artifacts (Gibson et al., 2013). Second, and more importantly, overexpression experiments fail to provide evidence for a splice variant’s necessity. In molecular biology, a molecule’s necessity can only be supported by the effects of the molecule’s absence (Gannett, 1999; Gifford, 1990). Thus, I have more confidence in splice variant depletion experiments to provide support for genes with FDSIs compared to overexpression. I draw a parallel to the standards of evidence for characterizing gene function, in which evaluation of a loss of function is the gold standard (Kopp and Mendell, 2018). I argue that the same criteria used to establish gene function must be applied to isoforms. 2.4.2 Types of functional distinctness in FDSIs It has been speculated that many poorly-characterized variants may have function because genes express splice variants in specific conditions, perhaps yet to be studied (Blencowe, 2017; Pan et al., 2008; Wang et al., 2008b). It is therefore relevant that the minority (17) of genes had functional distinctness due to condition-specificity. This may simply be due to a lack of study of condition-specific studies, as it might be generally easier to study splice variants expressed in the same conditions. Our results thus point to a potential gap in the literature. 2.4.3 Disconnect between literature and gene databases In one-third of the genes with FDSIs, the isoforms studied in a paper could not be matched to transcripts in Ensembl (as mentioned, this is not an Ensembl-specific problem; ~20% of genes    65 had functional isoforms that could not be matched to UniProt). Conversely, Ensembl contains many transcripts that the literature ignores. This observation has fairly serious implications for basing splice isoform research on the contents of Ensembl (or related databases). If one developed experiments to functionally test the splice isoforms of the genes I identified to have FDSIs based on Ensembl transcripts for that gene, their experiments would not contain the correct FDSIs in at least one-third of the genes. In bioinformatics research, computational methods that make predictions based on Ensembl transcripts might be valueless to experimental biologists as Ensembl does not reflect the literature. Large-scale databases specialized for alternative splicing, such as the Alternative Splicing encyclopedia (ASpedia) and the APPRIS database, tend to anchor to Ensembl (Hyung et al., 2018; Rodriguez et al., 2018). Of note, previous discussions used APPRIS to understand the functional impact of alternative splicing (Tress et al., 2017b). The disconnect between Ensembl the literature also impacts datasets not specific to splicing but where splice isoform information is important. For example, the GTEx consortium provides transcript-level quantification based on the Ensembl transcriptome (Lonsdale et al., 2013). The FDSIs that are not in Ensembl are therefore not included in GTEx. Given the few known cases of genes with FDSIs and PULSE’s inability to predict all our genes with FDSIs, it remains crucial that computational resources contain FDSIs and experimentalists ensure that they submit their sequence data to these resources. 2.4.4 Implication for alternative splicing’s impact on gene function Recent studies have challenged whether most genes can produce multiple functional splice isoforms and our results can offer something to both sides of the debate. I acknowledge that other researchers may have different definitions of a functional splice isoform, but I view the    66 debate within our operational definition – a functional splice isoform is one that is necessary for the gene’s overall function.  One side of the debate claims that most genes have multiple functionally distinct isoforms (Blencowe, 2017). Viewing our findings optimistically, I provide what is to our knowledge the only substantial list of human and mouse genes for which this is actually documented to be true. The low number of genes with such evidence can be interpreted as a vast opportunity for experimentalists to identify the functions of the splice variants for >80% of genes. The other side of the debate approaches alternative splicing with a less Panglossian view, with the null hypothesis being that most splice variants do not have a specific distinct function (Gould and Lewontin, 1979). Multiple studies taking a genomic or evolutionary perspective have concluded that it is unlikely that most genes have multiple functional splice isoforms (Abascal et al., 2015a; Hsu and Hertel, 2009; Hu et al., 2017; Kurmangaliyev and Gelfand, 2008; Light and Elofsson, 2013; Melamud and Moult, 2009; Pickrell et al., 2010; Reyes et al., 2013; Saudemont et al., 2017; Tress et al., 2017b; Wang et al., 2014; Zhang et al., 2009). Viewed pessimistically, our data is consistent with this body of work. If the literature lacks supporting evidence for widespread FDSIs, the null hypothesis should be maintained and claims that every observed splice variant has a function to be discovered should be viewed skeptically.   2.5 Conclusion To our knowledge, the work in this chapter represents the first effort to curate the literature in order to determine the genes where splicing increases the genome’s functional potential. Such individual reports have been generally ignored in the debate about the function of alternative splicing, which has instead focused on databases and high-throughput data sets. Our estimate that only 4% of human and 9% of mouse genes have evidence for functionally distinct isoforms    67 serves both a sobering reminder of the limited evidence, and a motivation for increased experimental efforts to settle the debate. At the same time, I also recognize there are likely genes with FDSIs that I did not curate and should be included.  The dearth of experimental evidence for genes with FDSIs poses challenges for future computational predictions of genes with FDSIs. For example, building a classifier for genes with FDSIs would require more gold standard genes with FDSIs. Nevertheless, there are likely genes with FDSIs that are yet to be experimentally evaluated. In my next chapter, I look at transcriptomic data to prioritize the most likely genes to have FDSIs using functional genomic annotations, based on what I learned throughout my literature curation.     68 Chapter 3: Prioritizing genes likely to have functionally distinct splice isoforms using long-read RNA-seq data 3.1 Background As explored in Chapter 2, the best evidence for genes with FDSIs must be established ex silico, by directly testing the phenotypic effects of disrupting each splice variant in turn. However, in contrast to the gene-level case, there seems to be only a small number of cases of FDSI genes. Our previous curation of the alternative splicing literature found that only 5% of genes (at most) are likely to have such evidence (Chapter 2; Bhuiyan et al., 2018). In lieu of ex silico experiments, computational annotations can help us establish genes that are likely to have FDSIs. This is the goal of the work presented in this chapter. A key aspect of my approach is that I consider the null hypothesis to be that observed splice variants are non-functional; that is, I treat the “noisy splicing model” as the source of all observed splice variants, until evidence is collected to suggest otherwise. The noisy splicing model arises from the observation that there are many biochemical steps in splicing, all with finite precision, resulting in erroneous, non-functional RNAs (Hsu and Hertel, 2009; Melamud and Moult, 2009; Saudemont et al., 2017). Previous work has sought to identify how much splicing arises from these stochastic, “noisy” events, versus how much arises from a biologically necessary and regulated process (Section 1.3). Since most splice variants are lowly expressed (Section 1.4), evolving neutrally (Section 1.6), and lack evidence of protein translation (Section 1.4), many have concluded that most genes do not have multiple functional splice variants (Ezkurdia et al., 2015; Gonzalez-Porta et al., 2013; Pickrell et al., 2010; Reyes et al., 2013; Tress et al., 2017b). Therefore, I do not assume that every gene has FDSIs, and in fact there is good    69 evidence to suggest that genome-wide, most genes do not have multiple functional splice variants.  For the purposes of this chapter, a key issue is how computational methods can help prioritize candidate genes with FDSIs. In line with past work, I can use functional genomic annotations to help determine genes likely to have FDSIs. For this chapter, I categorize functional genomic annotations such as conservation, expression level, coding potential, and protein domain annotations. For my purposes of prioritizing genes likely to have FDSIs (multiple necessary splice variants), a gene must have at least two splice variants that satisfy criteria in each of these categories.  Notably, previous high-throughput studies of functional genomic annotations of splice variants exist. One high-throughput resource, APPRIS, uses annotations based on conservation and protein structure to assign a principal splice variant for each Ensembl gene ID (Rodriguez et al., 2013). Though their main goal was not to determine genes with FDSIs, their annotation scheme provides direction towards the most necessary splice variant for a gene’s overall function. Based on their annotations, about 70% of Ensembl splice variants are lacking amino acids key to their gene’s overall function – that is, they are likely to be non-functional.  Annotations by APPRIS and similar studies represent significant progress towards understanding the functional consequences of alternative splicing, but the use of short-read RNA-sequencing data and large genomic databases like Ensembl are limitations (Hyung et al., 2018; Rodriguez et al., 2013). Our previous curation of the literature revealed that about a third of splice variants studied in the literature could not be found in Ensembl (Bhuiyan et al., 2018). Furthermore, annotating splice variants found in short-read RNA-sequencing is problematic due to the difficulties in transcript-structure reconstruction (Merino et al., 2019). If a short-read maps    70 to an exon shared by multiple splice variants, computational pipelines perform probabilistic assignment (essentially educated guesses) to predicting which read belongs to which splice variant (Hardwick et al., 2016; Steijger et al., 2013). While this is not a problem when studying gene-level expression, it creates ambiguity when interpreting splice variant expression profiles (Hong et al., 2018).  In contrast to short-read RNA-sequencing, long-read RNA-sequencing has the potential to remove much transcript structure ambiguity (Križanovic et al., 2018). Long-reads, such as those from Oxford Nanopore Technology’s MinION sequencer or PacBio’s SMRT sequencer, are long enough to span an entire mRNA molecule. For example, MinION sequencing will sequence the entire transcript as one read (Clark et al., 2019). Consequently, the MinION sequencer can detect known splice variants as well as detect novel splice variants (Workman et al., 2019). In this chapter, I use splice variants found in MinION transcriptomes. Previous studies have annotated splice variants found in long-read transcriptomes. Wang and colleagues recently used functional genomics to annotate splice variants found in a PacBio SMRT rat hippocampal transcriptome (Wang et al., 2019). Their annotation scheme focused primarily on the translation potential of the splice variants. From a starting set of 102,377 splice variants that mapped to 22,629 gene loci, they produced an annotated set of 22,268 “high-confidence” isoforms for 6,380 genes. A subset of these high-confidence isoforms was conserved. This work offers molecular biologists a direction towards splice variants that are most likely to be necessary for the gene’s overall function. However, this must also be implemented for other species (such as mouse and human) and must be considered in the context of noisy splicing.     71 In this chapter, I describe a computational pipeline for prioritizing splice variants in MinION long-read RNA-seq data based on the noisy splicing model and the criteria outlined above. I designed the prioritization approach using splice variant-specific conservation (PhastCons and PhyloP), expression, coding-potential, and protein domain annotations. Based on these annotations, the pipeline outputs a prioritized list of genes likely to have FDSIs, and I report a small number of highly prioritized genes for mouse. While more long-read data from additional tissues are needed to do a full genome-wide prioritization, and for more species, our work establishes methods and guidelines for high-throughput prioritization of genes with FDSIs. 3.2 Methods 3.2.1 Data collection 3.2.1.1 Publicly available mouse brain and liver transcriptome data collection I downloaded a mouse brain MinION dataset (7 samples) and liver MinION dataset (2 samples) from a published study by Sessegolo and colleagues (Sessegolo et al., 2019). The study also provided short-read datasets for the mouse brain and liver. This was used for long-read error correction later in our pipeline. I chose these datasets because they were the only publicly available MinION RNA-seq datasets for mice at the time. 3.2.1.2 Mouse colliculus full length cDNA sequencing using Nanopore sequencing Our collaborators in the Snutch Lab at UBC performed the Nanopore sequencing of mouse colliculus cDNA. Total RNA was extracted from the mouse inferior and superior colliculus using a MagMax Kit (Ambion) and full length cDNAs was generated using Maxima H Minus reverse transcriptase (ThermoFisher #EP0751) with oligo-dT priming (Invitrogen) as per the Oxford Nanopore Technologies (ONT) cDNA-PCR Sequencing kit (SQK-PCS109). Full length PCR amplified cDNA was generated as per the SQK-PCS109 kit for each of the four samples. These    72 individual amplified cDNA pools were each end-repaired and barcoded using ONT Native barcodes (EXP-NBD103), before being prepared for sequencing as per the Oxford Nanopore Technologies (ONT) SQK-LSK109 adapter ligation procedure. The individual sample libraries were each run sequentially on a single ONT MinION flow cell with DNaseI mediated clearing between each sample library addition. Signal data was captured for base calling and sequence data generated from the raw captured data using ONT specific software (Guppy 3.4) on a GPU enabled desktop PC. A breakdown of the long-read datasets use can be found in Table 3.1.   Brain Liver Reads 12,943,803 3,097,077 Bases 13,935,095,347 2,986,120,105 cDNA samples 7 1 RNA samples 4 1 Median length 793 820 Mean length 1,076 964 Median base quality 9.3 9.2 Mean base quality 8.4 8.7 Table 3.1 Summary of mouse long-read data. A total of 11 brain datasets and two liver datasets were identified. Four brain datasets were generated within this research, with the remaining taken from publish work by Sessegolo and colleagues (2019).   3.2.2 FLAIR Processing For the purposes of processing the raw long-read data into genes with multiple splice variants, I used FLAIR (Tang et al., 2020), with default settings. At the time I started my work, FLAIR was the primary tool cited in the literature for MinION long-read RNA-seq analysis, and thus represented the best practices and state of the art (furthermore, later in the project I evaluated    73 new tools including TALON (Wyman et al., 2020) and StringTie2 (Kovaka et al., 2019), and found they had no substantial effect on my findings [not shown]). Previous work in the field considered each splice variant that FLAIR outputs as representing a molecule that existed in the sample, and thus potentially has a function, regardless of length or structure (Tang et al., 2020; Workman et al., 2019). I followed the same practice, as my goal was to evaluate each transcript on biologically-motivated grounds rather than making potentially contentious decisions about what to consider a “technical artifact”.  FLAIR aligns raw long-reads to the genome using MiniMap2 (Li, 2018). Since long-read sequencers can have high base call error rates, FLAIR will “correct” erroneous splice junctions. This is accomplished by using the splice junctions identified in a short-read RNA-seq pipeline. Specifically, FLAIR corrects a splice junction identified in the long-read data to a splice junction identified in the short-read data if they are within a 10-bp window. After this, FLAIR performs a collapse step, where reads with the exact same exon boundaries are clustered together. FLAIR removes splice variants clustered with less than 3 reads total across all samples. Also, in this step, FLAIR will annotate reads to Ensembl transcripts if the reads have the exact same exon boundaries as an Ensembl transcript. This processing yields a set of splice variants across the dataset. This set of splice variants were the splice variants I annotated, and I describe the annotation approach in Section 3.2.3. Next, at the quantify step, FLAIR maps the aligned reads to a specific splice variant and outputs expression values (raw read counts and normalized TPM values) for each splice variants for each sample. FLAIR next removes splice variants without at least one read in the majority of the samples. Finally, FLAIR outputs the set of splice variants organized by their mechanism of alternative splicing (for example, skipped exons).     74 In order to correct for potential erroneous splice junctions in the long-read data, we processed the mouse short-read data provided by Sessegolo and colleagues (2019). The mouse splice junction data that was produced using the "rsem-prepare-reference" script from RSEM RNA-Seq quantification software (Li and Dewey, 2011). The build used was Ensembl GRCm38. All files pertaining to this assembly were downloaded from the Illumina iGenomes collection for this build (https://support.illumina.com/sequencing/sequencing_software/igenome.html). Each accession was processed separately. Illumina short reads were aligned using STAR version 2.4.0h (Dobin et al., 2013). RSEM version 1.2.31 was used for gene and splice variant quantification and count matrix generation.  I processed raw long-read data using FLAIR -align, -correct, -collapse, -quantify, and -diffSplice (Tang et al., 2020). FLAIR was downloaded on January 9th, 2020 and all default settings were used. I aligned raw reads to the GRCm38 genome. The aligned long-reads were then corrected using the SJ.tab.out file produced from processing the study’s short read data. For the purposes of prioritization, I used the following FLAIR outputs: flair.collapse.isoforms.fasta, flair.collapse.isoforms.bed, counts_matrix.tsv (normalized TPM values and raw read counts), and flair.diffsplice.* files. 3.2.3 Splice-variant specific annotations Scripts for prioritization will be available on GitHub. Figure 3.1 provides an overview of the prioritization scheme, which I detail in the following sub-sections. The figure also shows key statistics for the outcome of the analysis, described in detail in the Results section (3.3.2).    75  Figure 3.1 Workflow for annotation and prioritization of splice variants found in long-read RNA-seq See main text for details.  3.2.3.1 Expression Annotations As many transcripts are lowly expressed, I considered genes with multiple appreciably expressed splice variants as stronger candidate genes with FDSIs (Gonzalez-Porta et al., 2013). This could mean the gene has two splice variants expressed at similar levels in the same condition.  Using the FLAIR output of counts_matrix.tsv (raw read counts or normalized TPM values), I annotated each splice variant with the total expression across all samples, average number expression across all samples, total expression per different tissue types (brain and liver), and average expression in different tissue types. For our annotation purposes, I treated the brain samples and the colliculus samples as the same “brain” tissue. I then calculated total gene expression by summing all the read counts or TPM values for the splice variants for a given gene across all samples.  Long-read data Genes with multiple splice variants Protein domains (CDD, Pfam) Appreciable expression Coding potential Candidate genes with FDSIs Short read correction Splice variant-specific annotations (2+ splice variants) Conservation 83% 6% 29% 100% 6,799 genes 41,281 splice variants 79 genes    76 I annotated each splice variant with a gene-specific ranking and gene-specific expression proportion. For each gene, the most highly expressed splice variant was ranked ‘1’, the second most expressed splice variant was ranked ‘2’, and so on until all splice variants were ranked. I produced ranks based on the total expression across all samples and ranks for each tissue type. Using the gene-specific rankings, I calculated two separate ratios for each gene. First, I calculated an expression ratio between a gene’s rank 1 splice variant and its rank 2 splice variant. Second, I calculated a sum-expression ratio between a gene’s rank 1 splice variant and the sum of the expression of all other splice variants for the gene.  For our final set of candidate genes with FDSIs, I considered genes where the expression level of the rank 2 splice variant was at least 50% of the expression level of the rank 1 splice variant (expression ratio less than 2). 3.2.3.2 Open reading frame (ORF) annotations When prioritizing protein-coding genes as candidates with FDSIs, I want to ensure that the candidate have multiple splice variants that can each permit translation into a protein. Since I do not have appropriate mass spectrometry or ribo-seq data, at minimum I wanted to ensure that a candidate gene had at least two splice variants that each had an ORF.   Using TransDecoder (https://github.com/TransDecoder/TransDecoder/wiki) on each splice variant’s cDNA sequence, I annotated each splice variant with the type of ORF, and the length of the ORF. TransDecoder categorizes types of ORFs as complete (start codon, stop codon and at least 30 amino acids in-frame), 5’ incomplete (no start codon, but a stop codon with at least 30 preceding in-frame amino acids), 3’ incomplete (no stop codon, but a start codon followed by at least 30 in-frame amino acids), and incomplete (no start or stop codon, but at least 30 in-frame amino acids). For our purposes, I do not have a priori thresholds for open reading    77 frame length across all genes. As such, I use the longest open reading frame predicted by TransDecoder. If the splice variant does not contain any ORF category, I remove it from FDSI consideration. For each gene with multiple splice variants, I aligned the two most expressed splice variants using MAFFT, and counted the differences (amino acid substitutions and insertions/deletions) between the rank 1 splice variant’s ORF and the rank 2 splice variant’s ORF using EMBOSS infoalign (Katoh et al., 2002; Rice et al., 2000). For our final candidate set of genes with FDSIs, I only considered genes with at least two splice variants, each with any type of ORF.  3.2.3.3 Protein domain annotations In order to determine whether a single gene had at least two splice variants that encoded different proteins, I used the Conserved Domain Database (CDD) and their Perl script (pwrpsb.pl) to annotate the longest ORF for each splice variant (Marchler-Bauer et al., 2015). The CDD uses BLAST to categorize what domains a protein sequence has, what domain families these protein domains are from, and whether the domains are complete. As an alternative protein annotation approach, we also annotated splice variants using the Pfam domain database (El-Gebali et al., 2019).  At minimum, all genes should have at least one splice variant with a protein domain. In order to assess whether the CDD had functional domains for each gene, I also annotated Ensembl genes and Ensembl splice variants to determine whether most genes would have known protein domains.  I then organized each gene and splice variant from our long-read data based on our post hoc domain hierarchy (introduced in Figure 1.5, with CDD and Pfam statistics in Figure 3.5; the statistics in Figure 3.5 are discussed in the Results section 3.3.5):     78 • Level 1: The gene has multiple splice variants  • Level 2: The gene has at least two splice variants, each with at least one complete protein domains  • Level 3: The gene has at least two splice variants with different sets of domains. These domains can have some overlap. For example, splice variant 1 can contain domains A, B and C, and splice variant 2 can have domains B and C. • Level 4: The genes has at least two splice variants with domains from different domain families. Since about 83% of Ensembl gene IDs have at least one splice variant with a protein domain, I only considered genes that pass level 2 of our hierarchy in our final set of candidate genes. This means that a candidate gene can have multiple splice variants that encode for similar proteins, and something else explains the functional diversity of the splice variants (e.g. tissue-specific expression). However, level 3 genes and level 4 genes are considered stronger candidates in the prioritization. 3.2.3.4 Conservation Splice variant conservation across multiple species indicates that the splice variant is important for the organism’s fitness; hence, I used splice variant conservation as a signal for functional importance. When a gene has at least two conserved splice variants, that gene is likely a candidate for having FDSIs.   PhyloP (Pollard et al., 2010) and PhastCons (Siepel et al., 2005) scores are widely used methods to determine if genomic sequences are under evolutionary constraint. Starting from a multiple DNA sequence alignment, PhyloP produces base-specific scores that are negative log p-values. These p-values are derived from likelihood ratio tests where the null hypothesis is neutral evolution, and the alternative hypothesis is evolutionary selection (either positive or negative selection). A PhyloP score greater than 1.0 suggests that a base is more identical than one would expect by random chance, under the assumption that the null hypothesis is true. The    79 interpretation of this is that the base is evolving under negative selection. PhastCons similarly produces base-specific scores from multiple sequence alignments, however these scores are only evidence of conservation (negative selection). Furthermore, PhastCons scores are calculated using a Hidden Markov Model (HMM) which considers the sequence similarity of the neighboring position across the alignment as well. The interpretation of PhastCons score is that higher PhastCons scores are in a region that is more similar than those at a lower PhastCons score.  I downloaded 40-way mammalian PhyloP and PhastCons scores for the GRCm38 build of the mouse genome from the UCSC genome browser. Using the FLAIR diffSplice outputs for our splice variants and KentUtils  (https://github.com/ENCODE-DCC/kentUtils) on these conservation scores, I annotated each discriminating exon (an alternatively spliced exon or one that is spliced in) with its average PhyloP score and average PhastCons score.  For our prioritization, I considered genes with at least one discriminating exon with a PhastCons score greater than 0.75 or a PhyloP score greater than 1.0 as candidates to have FDSIs. These scores are based on previous investigations of splice variant conservation (Kovalak et al., 2019).   By using PhyloP or PhastCons scores to interpret the functional diversity of a gene with multiple splice variants, I assumed that a discriminating exon with a high score was present in other species and was also alternatively spliced. I interpret this as an exon evolving under negative selection (conservation). 3.2.4 Summary of the prioritization process In summary of the steps describe above (as outlined in Figure 3.1), I prioritized genes as likely candidates to have FDSIs based on the following:    80 • The gene has an expression ratio less than 2 (expression ratio = gene’s most expressed splice variant/gene’s second most expressed splice variant) • The gene has at least two splice variants that each have an ORF • The gene has at least two splice variants that each have at least one complete domain • The gene has at least one alternatively spliced exon with a PhastCons score greater than 0.75 or a PhyloP score greater than 1.0 3.2.5 Retrieving previously known “biologically interesting” genes in our prioritization In Chapter 2, I manually curated alternative splicing experiments for 743 human and mouse genes, and found that 43 genes have evidence of FDSIs in the literature (Bhuiyan et al., 2018). Of these genes, 20 were from mouse, which I extracted for our current purposes. Where possible I also collected the cDNA sequence and ORF sequence of the FDSIs.  I then mapped these 20 mouse genes and their FDSIs to our prioritization scheme. First, I determined whether I detected these genes in the data, and then I determined whether I detected their FDSIs in the data using either the cDNA sequence or the ORF sequence. Then I determined if the genes were included in our final candidate genes with FDSIs. If a gene was not present in the final candidate set, I determined at what part of the prioritization scheme the gene was lost.  3.2.6 IsoVision visualization We developed a visualization tool for splice variants, IsoVision, to assist in our prioritization and presentation of results. IsoVision is implemented in R and takes three input files: a BED-formatted file of splice variants for a gene (FLAIR output), a tab-delimited CDD output file, and a CSV formatted file of expression levels for each variant. The visualization tool calculates, based on chromosomal position and exon sizes, the set of similarly annotated exons that are preserved between all splice variants of interest, and draws the result as a stack of aligned exon patterns. IsoVision and instructions for use are freely available online at https://github.com/jsicherman/IsoVision2.    81 3.3 Results In order to prioritize genes likely to have functionally distinct splice isoforms (FDSIs), I processed long-read RNA-seq data from mouse brain and liver samples, and then annotated the splice variants found in these datasets. I then prioritized the genes likely to have FDSIs based on these splice-variant specific annotations, as outlined in Figure 3.1. In the remainder of Section 3.3, I first describe the overall results of the raw data processing and the number of prioritized candidate genes with FDSIs. I then provide detailed results of each functional genomic annotation, and characterize the genes that had multiple splice variants to “pass” each functional genomic annotation. 3.3.1 Data processing For the long-read data (Table 3.1), I used 4 novel mouse colliculus datasets from the Snutch lab, 7 mouse brain datasets from Sessegolo and colleagues, and 2 mouse liver datasets from Sessegolo and colleagues (summary of each dataset is provided in Appendix B, Tables B1, B2, and B3) (Sessegolo et al., 2019). In total, I had 16,027,088 raw reads with 14,224,215,452 bases. Among the brain and liver transcriptome, our data have a mean read length of ~1,317 bases. I used FLAIR to align, correct, and collapse the long-reads into transcript variants for genomic loci (Table 3.2). A total of 14,208,862 of 16,040,880 reads (88%) aligned to the mouse genome. If these aligned reads could be collapsed (clustered) together into groups of at least 3 (FLAIR default settings), they would be considered the same transcript variant. Of the 14,208,862 aligned reads, 11,312,080 (70%) reads collapsed into 221,190 transcript variants (Table 3.3). I defined the transcript variants that FLAIR mapped to Ensembl gene IDs as splice variants. Of the 221,190 transcript variants, 46,991 (21%) mapped to 12,509 Ensembl genes    82 (Table 3.3).  The 79% of transcript variants that did not map to any known Ensembl gene are in effect treated as technical or biological artifacts and ignored by the rest of my analysis; these removed reads tend to be short (<500 bases) which suggests they are enriched for incomplete transcripts. At the transcript level, 6% of the 46,991 splice variants were annotatable to Ensembl transcript IDs.   Brain Liver Total reads 12,943,803 3,097,077 Aligned reads 12,007,910 2,200,952 Unaligned reads 935,893 896,125 Collapsed reads 9,693,640 1,618,440 Uncollapsed reads 2,314,270 582,512 Table 3.2 FLAIR processing of mouse brain and liver transcriptomes Here I summarize the FLAIR processing of a total of 16,040,880 reads for mouse brain and liver. The row “Aligned reads” and “Unaligned reads” relates to the successful or unsuccessful mapping reads directly to the mouse genome. “Collapsed reads” are the total number of reads that were clustered in groups of at least 3, due to similar transcript structures, indicating these reads were for the same splice variant. If a cluster had fewer than 3 reads, FLAIR removed them from the results (Uncollapsed reads)  Transcript variants 221,190 Splice variants (mappable to gene)  46,991 Ensembl genes 12,509 Ensembl splice variants 2,751 Median splice variants per gene 2.0 Mean splice variants per gene 3.8 Genes with multiple splice variants 6,799 Mean splice variants per genes with multiple splice variants 6.1    83 Gene with most splice variants 169 (Mup3) Table 3.3 Summary of transcript variants found after FLAIR processing of mouse long-reads  Alternative 3’ splice site 1,261 genes Alternative 5’ splice site 1,027 genes Cassette exons 4,477 genes Intron retention 1,192 genes Table 3.4 Summary of splice variant classes found after FLAIR processing of mouse long-reads Each splicing event can only map to one gene, however a single gene can have multiple splicing events. About 54% of genes (6,799/12,509) had multiple splice variants and I used these 6,799 genes for our prioritization purposes (Table 3.4). In Table 3.4, I provide a summary of the non-mutually exclusive alternative splicing events occurring in these genes. As these splicing events are defined by FLAIR, I note that these splicing events must be present with at least one read in the majority of the samples of either the brain or liver transcriptome.  3.3.2 79 candidate genes likely to have FDSIs Of the 6,799 genes in our mouse brain and liver long-read transcriptomes, I prioritized 79 genes as likely candidates with FDSIs (Table 3.5 for 33 genes with cassette exons; Table 3.6 for 45 genes with intron retention; Table 3.7 for 9 genes with alternative 5’ splice sites; Table 3.8 for 7 genes with alternative 3’ splice sites; a single candidate gene can have multiple splicing mechanisms). In Sections 3.3.3 to 3.3.6, I describe how the annotation scheme (Figure 3.1) led to prioritize these 79 candidates. In summary, these 79 genes had at least two appreciably expressed and conserved splice variants. Furthermore, each splice variant had an CDD-annotatable protein domain and open reading frame (ORF). The median amino acid difference between the ORFs of the two most expressed splice variants for each gene was 0% (range: 0% to 34%). Only 18/79    84 have two splice variants that each encode for at least one different protein domain (Level 3 in Figure 3.5).      85  Borcs7 (BLOC-1 Related Complex Subunit 7)  Hikeshi (Hikeshi) Stx8 (Syntaxin-8) Cdc42 (Cell Division Cycle 42) Hspbp1 (HSPA binding protein, cytoplasmic cochaperone 1) Tle5 (TLE Family Member 5, Transcriptional Modulator) Cdipt  (CDP-Diacylglycerol--Inositol 3-Phosphatidyltransferase) Iah1 (Isoamyl Acetate Hydrolyzing Esterase 1) Tpd52 (Tumor Protein D52) Chmp2a (Charged Multivesicular Body Protein 2A) Nars (Asparaginyl-tRNA synthetase) Trappc1 (Trafficking Protein Particle Complex 1) Coro1b (Coronin 1B) Nipsnap1 (Nipsnap Homolog 1) Ugp2 (UDP-Glucose Pyrophosphorylase 2) Cuedc2 (CUE Domain Containing 2) Otub1 (OTU Deubiquitinase, Ubiquitin Aldehyde Binding 1) Zfand6 (Zinc Finger AN1-Type Containing 6) Dap3 (Death Associated Protein 3) Psmc4 (Proteasome 26S Subunit, ATPase 4)  Dctn2 (Dynactin Subunit 2) Rpl22l1 (Ribosomal Protein L22 Like 1)  Dhrs7b (Dehydrogenase/Reductase 7B) Rtn3 (Reticulon-3)  Ech1 (Delta-Delta-dienoyl-CoA isomerase) Scn1a (sodium channel, voltage-gated, type I, alpha)   Fxyd7 (FXYD Domain Containing Ion Transport Regulator 7) Sdcbp (Syntenin-1)  Gm20390 (Nucleoside diphosphate kinase) Selenow  (Selenoprotein W)   Gstz1 (Glutathione S-Transferase Zeta 1) Stoml2 (Stomatin Like 2)  Table 3.5 33 candidate genes likely to have FDSIs (cassette exons)                   86 Akap8l (A kinase (PRKA) anchor protein 8-like) Doc2a (double C2, alpha) Nsd1 (nuclear receptor-binding SET-domain protein 1) Smarcc2 (SWI/SNF related, matrix associated, actin dependent regulator of chromatin, subfamily c, member 2) Ambp (alpha 1 microglobulin/bikunin precursor) Enpp5 (ectonucleotide pyrophosphatase/phosphodiesterase 5) Pcnp (PEST proteolytic signal containing nuclear protein) Srek1 (splicing regulatory glutamine/lysine-rich protein 1) Arnt2 (aryl hydrocarbon receptor nuclear) Ewsr1 (Ewing sarcoma breakpoint region 1) Phf1 (PHD finger protein 1) Srrm1 (serine/arginine repetitive matrix 1) Atxn2l (ataxin 2-like) Gprc5b (G protein-coupled receptor, family C, group 5, member B) Phf24 (PHD finger protein 24) Srrt (serrate RNA effector molecule homolog) Brd1 (bromodomain containing 1) Gria2 (glutamate receptor, ionotropic, AMPA2) Pigo (phosphatidylinositol glycan anchor biosynthesis, class O) Srsf12 (serine and arginine-rich splicing factor 12) Bag6 (BCL2-associated athanogene 6) Hnrnpm (heterogeneous nuclear ribonucleoprotein M) Plp1 (proteolipid protein (myelin) 1) Thoc1 (THO complex 1) Cdk11b (cyclin-dependent kinase 11B) Ivns1abp (influenza virus NS1A binding protein) Prpf39 (pre-mRNA processing factor 39) Xpot (exportin, tRNA) Cfl2 (cofilin 2, muscle) Kat5 (K(lysine) acetyltransferase 5) Rgl2 (ral guanine nucleotide dissociation stimulator-like 2) Zfp131 (zinc finger protein 131) Chmp2a (charged multivesicular body protein 2A) Med24 (mediator complex subunit 24) Rit2 (Ras-like without CAAX 2)  Clk4 (CDC like kinase 4) Mkrn1 (makorin, ring finger protein, 1) Serbp1 (serpine1 mRNA binding protein 1)  Cryab (crystallin, alpha B) Nacc2 (nucleus accumbens associated 2, BEN and BTB (POZ) domain containing) Serpinc1 (serine (or cysteine) peptidase inhibitor, clade C (antithrombin), member 1)  Cstf2 (cleavage stimulation factor, 3' pre-RNA subunit 2) Nap1l1 (nucleosome assembly protein 1-like 1) Sf3b1 (splicing factor 3b, subunit 1)  Table 3.6 45 candidate genes likely to have FDSIs (intron retention)   Cfl2 (cofilin 2, muscle) Gria2 (glutamate receptor, ionotropic, AMPA2 (alpha 2)) Sh2b1 (SH2B adaptor protein 1) Ehbp1 (EH domain binding protein 1) Hnrnpc (heterogeneous nuclear ribonucleoprotein C) Rps6kb1 (ribosomal protein S6 kinase, polypeptide 1) Ewsr1 (Ewing sarcoma breakpoint region 1) Vcpkmt (valosin containing protein lysine (K) methyltransferase)  Table 3.7 9 candidate genes likely to have FDSIs (alternative 5’ splice site)       87 Phf1 (PHD finger protein 1) Sf1 (splicing factor 1) Zswim8 (zinc finger SWIM-type containing 8) Plp1 (proteolipid protein (myelin) 1) Thra (thyroid hormone receptor alpha)  Serbp1 (serpine1 mRNA binding protein 1) Ttc14 (tetratricopeptide repeat domain 14)  Table 3.8 7 candidate genes likely to have FDSIs (alternative 3’ splice site) I determined whether our approach prioritized genes with experimental evidence of FDSIs from Chapter 2 (Bhuiyan et al., 2018). In Chapter 2, we identified 20 mouse genes with FDSIs, of which I detected expression of 12 in our long-read data. However, only 8/12 genes had multiple splice variants detectable in our data, while 4/12 genes had only one detectable splice variant. For the 4/12 genes with only one detectable splice variant, the variant detected matches one of the isoforms annotated in Chapter 2. Of the 8/12 genes with multiple detectable splice variants, I did not detect the expected isoforms for five genes (Il1rap, Myh10, Nf1, Snap25, Cacna1b). For one gene (Rock2), I detected one splice variant matching one of two literature-reported isoforms, but the remaining detected splice variants did not match the remaining reported isoform. Another gene (Lpin1) had both FDSIs detected in our data, however the gene was lowly expressed in a minority of our samples and had been filtered out FLAIR. Thus, of the genes previously identified in Chapter 2, only one emerged as being among the 79 genes prioritized by my pipeline. The fact that I only prioritized one gene with FDSIs from Chapter 2 in my long-read data is likely a consequence of my data’s limitations (see Discussion in Section 3.4).  3.3.3 Over a quarter of genes have multiple appreciably expressed splice variants I defined a gene to have multiple appreciably expressed splice variants based on an expression ratio between the gene’s most expressed splice variant and the gene’s second-most expressed splice variant. If a gene’s expression ratio was less than 2, I considered the gene to have multiple    88 appreciably expressed splice variants. Using the total reads for each splice variant in both brain and liver transcriptomes, I found that 1,960/6,799 genes (29%) have an expression ratio less than 2 (Figure 3.2A). I further assessed how dominant the expression of rank 1 splice variants was compared to splice variants of differing ranks (Figure 3.2B). Rank 1 splice variants tend to dominate their gene’s overall expression, suggesting that splice variants contribute to transcriptomic diversity unevenly. Finally, I calculated a sum-expression ratio for each gene based on the gene’s rank 1 splice variant’s expression to the sum of the gene’s other splice variants. About 54% (3,666 genes) had a sum expression ratio (the gene’s most expressed splice variant/sum of the gene’s other splice variants) of at least 2 (Figure 3.2C).   02004006008000 1 2 3log10(Splice variant rank 1 expression/splicevariant rank 2 expression)Count of genesA0.000.250.500.751.001 2 3 4 5Splice variantrank within genesplice variantexpressionrelativeto gene expressionB0200400600800−1 0 1 2 3log10(Splice variant rank 1 expression/allother splice variants's expression)Count of genesC   89 Figure 3.2 Most genes have one dominantly expressed splice variant across both tissue types A) 71% of genes have a rank 1 splice variant that is at least double in expression compared to its rank 2 expression (red line). For each gene, I ranked the splice variants by their total expression. For example, the most expressed splice variant was ranked 1, the second most expressed splice variant was ranked 2. Then I divided the rank 1 splice variant’s expression by its rank 2 splice variant’s expression to produce the expression ratio plotted along the x-axis. At the red line, the expression ratio equals 2.0. At the cyan line, expression ratio equals to 5.0. For prioritization purposes, the genes to the left of the red line were retained. B) Rank 1 splice variants tend to dominate gene expression relative to other splice variants. Stratified by splice variant expression rank (x-axis), the contribution of each splice variant’s expression relative to the gene’s overall expression (y-axis) is plotted. C) 64% of genes have a rank 1 splice variant double in expression than the sum total of all other splice variants for the same gene. For each gene, I ranked the splice variants by their total expression (x-axis). For example, the most expressed splice variant was ranked 1, the second most expressed splice variant was ranked 2, and so on until each splice variant is ranked. I then divided the rank 1 splice variant’s expression by the sum of all other splice variants for the same gene (y-axis).   I also investigated whether genes had a different rank 1 splice variant among our brain and liver samples. About 41% of genes (2,781/6,799) had a tissue-specific rank 1 splice variant. These genes overlapped with many of the genes with an expression ratio of less than 2 (1,295 genes).  3.3.4 Nearly all splice variants had an open reading frame (ORF) I annotated each splice variant with a TransDecoder-predicted ORF in order to determine whether a gene had at least two splice variants that encoded a protein. A splice variant could have a “complete”, “5’ incomplete”, “3’ incomplete”, or “incomplete” ORF. For our prioritization purposes, I kept any splice variant with any annotatable ORF of at least 30 amino acids. Of the 41,285 splice variants mapped to our 6,799 genes, nearly all (41,174) have an ORF (Figure 3.3A). While I removed 111 splice variants without a predicted ORF from further analysis, all 6,799 genes still had multiple annotatable ORFs. Thus, no genes were removed from analysis by this step. Furthermore, about 69% of genes (4,688/6,799) have two or more splice variants with complete ORFs (Figure 3.3B).     90  Figure 3.3 All genes with multiple splice variants have an annotatable ORF A) All 6,799 genes with multiple splice variants had multiple splice variants with an annotatable ORF frame B) About 69% of genes have multiple splice variants with complete ORFs  I combined our previous expression rank annotations with our ORF annotations (Figure 3.4). For our 6,799 genes with multiple splice variants, about 57% (3,893) had a rank 1 splice variant with a complete ORF (Figure 3.4A). However, regardless of the type of ORF, 40% of the genes (2,762) had the exact same amino acid sequence among their rank 1 and rank 2 splice variants (Figure 3.5B).    0500100015001 10 100number of splice variantswith ORFs (log10)count of genesA0500100015001 10 100number of splice variantswith complete ORFs (log10)count of genesB   91  Figure 3.4 Rank 1 splice variants mostly have complete open reading frames that are similar to rank 2 splice variants A) About 57% genes have a rank 1 splice variant with a complete open reading frame (ORF). I annotated the most abundantly expressed splice variant for each gene (rank 1 splice variant) with an ORF of at least 30 amino acids using TransDecoder. TransDecoder classifies ORFs as complete (start and stop codon), 5prime_partial (missing a start codon), 3prime_partial (missing a stop codon), and incomplete (missing start and stop codon, but at least 30 amino acids in frame). B) 40% of genes have a rank 1 splice variant ORF that is the same as their rank 2 splice variant ORF (red). Using EMBOSS, I compared the differences in amino acid composition (x-axis) and length (y-axis) of the rank 1 and 2 splice variant for each gene. Diagonal with a slope of 1 added as a visual guide.   3.3.5 83% of genes have two splice variants with an annotatable protein domain I annotated each splice variant with an ORF with protein domain annotations from the CDD (Marchler-Bauer et al., 2011) and produced a hierarchy of protein domain distinctness for genes with multiple splice variants (Figure 3.5). Of the 6,799 genes with multiple splice variants, 83% had at least two splice variants where both splice variants had an annotatable domain (Level 2 on 010002000300040003prime_partial5prime_partialcompleteinternalORF TypecountA−3−2−1012−3 −2 −1 0 1 2log10(# of amino acid differences + 0.001) log10(# of gaps + 0.001)B   92 Figure 3.5). Annotating the splice variants using Pfam instead of CDD yielded similar results (80%).  For the purposes of my prioritization, I kept the 83% of genes at Level 2 when annotating splice variants using the CDD. However, the subset of genes from Level 2 that are part Level 3 and 4 may be more interesting for follow-up. Level 3 (CDD: 20%; Pfam: 19%) genes are cases where at least two splice variants have a distinct set of protein domains. In principle this would be a good indication of functional distinctness in protein function. At Level 4 (CDD: 20%; Pfam: 19%), genes have two splice variants where the splice variants were annotated with at least one domain from a different domain family. As differences in molecular function between domain families tend to be greater than those within, level 4 genes could represent larger effects of alternative splicing on “functional diversity”.  Figure 3.5 Levels of protein domain distinctness for genes with multiple splice variants The scheme introduced in Figure 1.5 is shown augmented with results statistics.  Gene has multiple splice variantsLevel 1Gene has at least two splice variants with at least one complete protein domainsLevel 2Gene has at least two splice variants with different protein domainsLevel 3Gene has at least two splice variants with different domain familiesLevel 4CDD 6,799 (100%)5,649 (83%)1,399 (20%)1,373 (20%)Pfam6,799 (100%)5,437 (80%)1,323 (19%)714 (10%)% Genes    93 3.3.6 Few genes have at least one conserved spliced in exon I used conservation of a discriminating exon (one that is spliced in or out) as a signal for biological importance. For our candidate set of genes, I annotated four different splicing events: cassette exons, retained introns, alternative 5’ splice sites, and alternative 3’ splice sites. I extracted these splicing events from FLAIR’s output and annotated each one with PhastCons and PhyloP scores. From the 6,799 genes with multiple splice variants in our long-read data, 483 genes have multiple conserved splice variants where one conserved exon is spliced (Figure 3.6; for PhastCons distributions, see appendix). Broken down by splicing events, 483 genes have multiple conserved cassette exons (Figure 3.6A), 113 genes have multiple conserved retained introns (Figure 3.6B), 9 genes have multiple conserved alternative 5’ splice sites (Figure 3.6C), and 15 genes have multiple conserved alternative 3’ splice sites (Figure 3.6D). I used this set of 483 genes for our list of 79 likely to have FDSIs. The conservation of one exon may be difficult to interpret. A transcript that contains a spliced-in exon with a high PhyloP score suggests functional importance, however, the same principle does not apply to a transcript without that exon. The splice variant without the conserved exon could simply be a consequence of splicing error. As such, I also subsetted for genes with two alternatively spliced exons on separate splice variants, resulting in 283 genes. Broken down by splicing events, 236 genes have multiple conserved cassette exons, 44 genes have multiple conserved retained introns, 1 gene has multiple conserved alternative 5’ splice sites, and 2 genes have multiple conserved alternative 3’ splice sites. By prioritizing a set of genes with at least two conserved and alternatively spliced exons, our final set consists of 48    94 prioritized candidates when considering our other annotations (protein domains, expression, and coding potential).  Figure 3.6 6% of genes have at least one conserved alternatively spliced exon.  Each histogram shows the distribution of PhyloP scores for each alternatively splice exon, organized by splicing mechanism. The orange line is our threshold for a conserved exon (1.0), and next to the orange line is the number of genes with at least one alternatively and conserved spliced exon. A) Skipped Exons: Of the 8,215 skipped exons, 910 (11%) exons were conserved.  B) Retained introns: Of 6,196 retained introns, only 311 (3%) are conserved. C) Of the 6,722 alternative 5’ splice sites, only 10 (~0.14%) are conserved. D) Alternative 3’ splice sites: Of the 7,037 splice sites, only 17 (~0.2%) are conserved. CA BD341 genes 113 genes9 genes 15 genes   95  3.3.7 Description of selected candidate genes with FDSIs Here I detail examples of three candidate genes with FDSIs, chosen as illustrating three distinct scenarios in terms of previous evidence supporting my findings. The first gene, Cdc42, is a gene with literature evidence of FDSIs from Chapter 2 of my thesis. The selected effect of the Cdc42 splice variants have experimental support of necessity. For the second gene, Tpd52l1, I could not find literature to support it as a gene with FDSIs, though there have been studies into the causal role of this gene’s splice variants. In future experiments, these may be established as selected effects. Finally, the third gene, Gstz1, I found no literature investigating the function of its splice variants – selected effects or causal role. As I have prioritized Gstz1 based on conservation, protein domains, expression and coding potential, Gstz1 may be a truly novel gene with FDSIs. Details on these genes are given in the next subsections. 3.3.7.1 Previously known case of FDSIs: Cdc42 In my long-read dataset, I prioritized one gene from our previous curation of genes with literature evidence of FDSIs, Cdc42 (Bhuiyan et al., 2018; Yap et al., 2016). Cdc42 is a gene that plays a role in cell projection growth and cell polarity. In neuronal precursors, an isoform containing exon 7 is expressed. As the neuronal precursor comes to its neuronal identity, the cell co-expresses the Cdc42 isoform containing exon 7, and another isoform containing exon 6. Yap and colleagues showed that a knockdown of isoforms containing exon 7 reduces axonogenesis. In contrast, the knockdown of isoforms containing exon 6 decreases dendritic spine density. The investigators concluded that Cdc42 had FDSIs important to the development of the nervous system. In our long-read data, Cdc42’s two most abundantly expressed splice variants are these same two FDSIs tested by Yap and colleagues.     96 3.3.7.2 Literature pointing towards functional distinctness: Tpd52l1 In humans, the TPD52 family consists of 4 genes: TPD52 (tumor protein D52), TPD52L1 (tumor protein D52 like-1 or D53), TPD52L2 (tumor protein 52 like-2 or D54), and TPD52L3 (tumor protein 52 like-3 or D55) (Boutros et al., 2004). The TPD52 family is generally implicated in cell-proliferation. TPD52 proteins hetero- or homodimerize via the coil-coil motif and do not harbor any catalytic domains; therefore, the general consensus is that this gene family encodes adaptor proteins. As adaptor proteins found in the brain, they are implicated in calcium signaling (Boutros et al., 2004). My dataset has 9 splice variants for mouse Tpd52l1. The two most expressed splice variants had a total expression of 167 and 110 reads (Figure 3.7A). The most expressed was novel, whereas the second most abundantly expressed splice variant corresponds to an Ensembl transcript (ENSMUST00000000305). Structurally, they have alternative first exons, and the second most expressed splice variant lacks exon 5. These splice variants encoded for open reading frames that differ by one amino acid. My literature review indicated that the 5th exon of our Tpd52l1 splice variants has been studied, but in humans rather than mice. Boutros and colleagues used Y2H to investigate the interaction partners of human TPD52L1 splice variants (Boutros et al., 2003). They found that exon 5 affects the binding dimerization partner of D53. Homodimers of D53 (where both D53s have exon 5) were more likely to bind to 14-3-3 proteins than those that lacked exon 5. The human exon 5 that they characterized contained the amino acid sequence SHSIG, whereas our mouse exon 5 was SHSFG. The high degree of protein sequence similarity among the mouse and human exon 5 likely indicates similarity in function.    97 Additionally, Nourse and colleagues demonstrated the functional importance of TPD52L1 splice variants lacking exon 5 (similar exons were not found in other TPD52 genes) (Nourse et al., 1998). Tpd52l1 splice variants that lacked exons 5 and 6 were similar to the splice variants for other genes in TPD52 family. This suggested to Nourse and colleagues that Tpd52l1 splice variants that lack exons 5 and 6 will have a distinct functional role. To strengthen this point, Cho et al. found TPD52L1 splice variants lacking exon 5 interacted with C-terminal regulatory domain of ASK1 (Cho et al., 2004). When they overexpressed these splice variants, they induced ASK1-promoted apoptosis in HEK cells.  Based on previous literature about Tpd52l1 and its splice variants, and my prioritization scheme, I hypothesize that Tpd52l1 may have FDSIs. As Tpd52l1 encodes for adaptor proteins, the presence/absence of exon 5 influences the binding partner of Tpd52l1 and ultimately, allows for the gene to function in distinct molecular pathways.   Figure 3.7 Example candidate genes with FDSIs  Annotated domains are highlighted in color. Figures generated using IsoVision. A) Most abundantly expressed Tpd52l1 splice variants. B) Most abundantly expressed Gstz1 splice variants.  1b 5 64 8 93 722 5 64 8 93 71amaiA1b 2-3 4 5 6 7 81a 2-3 4 6 7 8TPD52 coil-coilAB   98 3.3.7.3 A novel candidate FDSI gene: Gstz1 Gstz1 is a member of the Glutathione S-transferase (GST) gene family. These genes encode enzymes that aid in the detoxification of electrophilic molecules by conjugation with glutathione (Nebert and Vasiliou, 2004). GSTs are extremely diverse in their function. Gstz1 is specifically known for catalyzing glutathione-dependent isomerization of maleylacetoacetate to fumarylacetoacetate, which is the second-to-last step in the vital phenylalanine and tyrosine degradation pathway. It is the only enzyme in the GST family that catalyzes a significant process in intermediary metabolism, and can be found in a variety of species, from humans to bacteria. I observed 18 splice variants for Gstz1 which were all expressed in the liver, but few were expressed in the brain (gene expression in liver was 866 reads vs. brain was 279 reads). The most abundantly expressed splice variant was primarily expressed in the liver and was not found in Ensembl. In contrast, the second-most abundantly expressed splice variant was similarly expressed among brain and liver samples, and Ensembl reported the transcript (ENSMUST00000063117). Structurally, the two most expressed Gstz1 splice variants were similar, differing only in their first exons (Figure 3.7B). The splice variants encoded for the exact same amino acid sequence. Thus, both variants had the maleylacetoacetate isomerase domain which is necessary for catalysis.  I investigated the Gstz1 literature, and found that both first exons were present in CAGE-seq reports (Lizio et al., 2015). However, I failed to find any indication in the literature that there was “functional distinctness” between the splice variants, making Gstz1 a truly novel candidate gene. Given that both splice variants encoded for identical protein sequences, I hypothesize that Gstz1 has FDSIs that are distinct via expression pattern differences (Figure 1.1 in Chapter 1). That is, the two isoforms should have different expression patterns (in time and/or space), such    99 that they are not completely redundant. The data on hand is insufficient to establish this, so it remains a topic for future study.  3.4 Discussion Previous work in the field has used functional genomic annotations to determine genes likely to have multiple functional splice variants, however, to our knowledge our work represents the first step in applying these annotations to long-read splice variants. Nearly all splice variants (95%) were apparently novel to Ensembl. By applying functional genomic annotations to long-read splice variants, I prioritized a set of 79 genes out of 6,799 (1.1%) as genes most likely to have functionally distinct splice isoforms (FDSIs). These genes had multiple conserved splice variants that were appreciably expressed relative to the gene’s overall expression. Furthermore, these genes had more than one splice variant encoding for an open reading frame with an annotatable protein domains. While these functional genomics properties cannot be taken as conclusive evidence of functional distinctness, these genes have more convincing evidence of functional distinctness compared to the remaining genes in our long-read data. For example, a gene with multiple conserved splice variants is more likely to have multiple necessary splice variants, than a gene with only one conserved splice variant. My analysis is clearly incomplete, largely due to data limitations. I had data for only about 1/3 of annotated mouse protein-coding genes, and I had only detected 12 mouse genes with FDSIs from Chapter 2. In addition, the long-read data has relatively shallow sequencing depth compared to short-read datasets, and for just two tissues (Table 3.1). Short-read datasets used in comparable analyses tend to have more sequencing depth and many more tissue types. I believe this explains why I missed prioritizing two detected genes (Rock2 and Lpin1) with previous experimental evidence of FDSIs. Rock2 has literature evidence of FDSIs in muscle cells, while I    100 only had samples for the brain and liver. In the case of, Lpin1, I detected the two literature FDSIs, but I detected gene expression in a minority of the samples. Nevertheless, the literature for the 79 prioritized genes suggests some promise towards their candidacy, such as Gstz1 and Tpd52l1. 3.4.1 Effects of thresholds One may ask how many more genes would remain for our prioritization if we loosened our thresholds. For example, if we kept genes where the expression ratio between the most expressed splice variant and the second most expressed splice variant is less than 2.5, the remaining genes for prioritization would change from 1,960 (29%) to 2,498 (36%) genes. Changing the threshold to be less than 3 would leave 2,891 (42%) genes for prioritization. For our conservation thresholds, our use of 1.0 as a PhyloP score yielded 483 genes (~7%) with one conserved alternatively spliced exon. A threshold of 1.3 or 0.6 would change our yield to 396 (~6%) or 646 (9%) genes, respectively. While loosening either expression or conservation threshold would increase the yield of prioritized genes, it may introduce more lowly expressed, or non-conserved splice variants. These are more likely to be noise (Abascal et al., 2015a; Ezkurdia et al., 2015; Saudemont et al., 2017). However, to enable threshold-free investigation of our results, complete annotations are available from the authors.     I used base-specific PhyloP scores (Figure 3.7) to determine whether an exon evolved under negative selection (conservation). These PhyloP scores were calculated using a 40-mammalian species multiple sequence alignment. A high PhyloP score suggests a high degree of similarity of the exon across all 40 mammals. This similarity indicates functional importance, as evolution has likely maintained this sequence for organismal fitness. While PhyloP and similar methods provide a useful metric in assessing conservation, it is important to acknowledge that    101 they have limitations. For example, multiple sequence alignments can have poorly aligned regions especially if it includes species with poor reference genomes (Gouin et al., 2015; Prakash and Tompa, 2007). Even with a lenient conservation threshold, a few genes likely to have FDSIs may be missed in my prioritization. I based our PhyloP threshold on a previous study by Kovalak and colleagues (Kovalak et al., 2019). These investigators sampled PhyloP scores for random 5’ splice sites and 3’ splice sites, where they detected an average PhyloP score of ~1.0 and ~0.6 respectively. Since PhyloP scores are negative logged p-values (-log10(0.05) = 1.3), some may interpret these scores as less stringent for a standard conservation analysis. However, Kovalak and colleagues established these PhyloP scores for the purposes of investigating the functional consequences of alternative splicing.   For the purposes of prioritizing genes with multiple appreciably expressed splice variants, we kept genes where the ratio of the most expressed splice variant to the second most expressed splice variant was less than 2. Consequently, of 6,799 genes, only 1,960 (29%) remained (Figure 3.3A). We had three reasons why we used this specific approach in our prioritization. First, to our knowledge, there exists no general expression thresholds in the field for long-read data. In short-read data, lowly expressed transcripts will be removed either as technical artefacts, or biological noise (Pertea et al., 2018a). Without a standard, we did not want to apply a sweeping threshold across all genes, but rather apply a gene-specific threshold. Second, Gonzalez-Porta and colleagues defined a gene with a dominantly expressed splice variant as a gene that has a most expressed splice variant that is double in expression from the second-most expressed splice variant (Gonzalez-Porta et al., 2013). We assume that every gene has at least one functional splice variant, and that the most expressed splice variant is functional. By that reasoning, if the    102 gene has another splice variant that contributes similarly to transcriptomic diversity, then that splice variant may likely be functional. Thus, we maintain that the ratio between the most expressed splice variant and second-most expressed splice variant must be less than 2. However, to allow readers to decide the best threshold for themselves, I provide all gene expression ratios if requested. Finally, in Figure 3.3C, we demonstrate just how dominantly each gene’s most expressed splice variant contributes to transcriptomic diversity over other splice variants. About 54% of genes have a most expressed splice variant that is double than the sum of all other splice variants of the same gene.  With such a low proportion of genes (1.1%) considered most likely to have FDSIs in our analysis, it is reasonable to ask whether this proportion would change with additional long-read data or deeper long-read data, for the same tissues. However, there are several reasons why the proportion will likely remain low. First, most splice variants are lowly expressed relative to the gene’s overall expression. Increasing our sequencing depth will likely only detect more lowly expressed splice variants. Second, we report, like other studies, that most splice variants evolve neutrally rather than via natural selection (Reyes et al., 2013). The lack of a selection signal suggests a lack of biological importance. There tends to be an inverse correlation between gene expression, and the amount of unconserved, erroneous transcripts (Saudemont et al., 2017). Again, increasing our sequencing depth would likely result in more lowly expressed splice variants, which are likely to be lowly conserved. Finally, though we did not exhaustively explore proteomic datasets, we do not believe that we would be able to detect the vast majority of these long-read splice variants in mass spectrometry experiments. Mass spectrometry fails to detect the vast majority of splice variants found in RNA-seq studies (Tress et al., 2017b). While some of this can be explained by technical limitations, splice variants found in mass spectrometry    103 datasets tend to be highly expressed and well-conserved. Taking these lines of reasoning together, we hypothesize that additional data will not greatly increase the proportion of genes prioritized. After considering my annotation scheme, readers may question whether more functional genomic annotations that might functionally distinguish isoforms would provide me with more candidate genes with FDSIs, other than protein domains. Besides the fact that there is an upper bound on how many genes could pass my (liberal) conservation criterion for multiple isoforms (6%), my preliminary investigations suggest adding other annotations will have little effect. I tested annotations for intrinsically disordered domains, positive selection metrics, signal peptides and ribosomal profiling expression, none of which yielded any additional positive predictions of FDSIs (data not shown).  3.4.2 Support for the noisy splicing model There is an ongoing debate in the field about the extent to which alternative splicing diversifies the function of mammalian proteomes. Some adopt a Panglossian view of alternative splicing where each splice variants produces a novel, functionally distinct product and thus greatly increases functional diversity (Blencowe, 2017; Kelemen et al., 2013; Li et al., 2014; Ryu et al., 2017; Shabalina et al., 2014). However, in the absence of any evidence of functionality (much less distinct functionality), many of these alternative transcripts could also be “noise” (Chapter 1) (Melamud and Moult, 2009; Pickrell et al., 2010; Tress et al., 2017b; Zhang et al., 2009). We investigated whether our long-read RNA-seq data would also support the noisy splicing model. One key consequence of noisy splicing is that most splice variants are lowly expressed. In our data, we detected that 71% of genes have their total expression dominated by one splice variant (Figure 3.3). Furthermore, about 60% of genes have the same most expressed splice    104 variants among our brain and liver samples. These statistics mirror those from a study by Gonzàlez-Porta et al. (2013), who used Illumina BodyMap data to demonstrate that the vast majority of the mRNA pool in a transcriptome comes from one splice variant per gene, despite having data for 16 different tissue types. By combining mass spectrometry and RNA-seq, Ezkurdia and colleagues provided further support for this observation (Ezkurdia et al., 2015).  Another consequence of the noisy splicing model is that most splice variants are poorly conserved. Conservation suggests that natural selection has maintained the splice variant throughout evolution, and therefore is a signal of biological importance. We investigated the conservation of alternative exons with the multiple sequence alignment-based methods, PhyloP (Figure 3.6). Regardless of splicing mechanism, the vast majority of discriminating exons remain poorly conserved in the mammalian phylogeny. These results are also consistent with the noisy splicing literature (Pickrell et al., 2010; Reyes et al., 2013). 3.5 Conclusions While 95% of protein-coding genes have multiple splice variants, the functional consequences of most remain unclear. In this study, we prioritize a set of 79 mouse genes we consider most likely to have functionally distinct splice isoforms (FDSIs). The long-read data provides confidence in the transcript structure, compared to short-read data, while the functional genomic annotations provide insight into the most biologically interesting genes for wet-lab follow up. Some of our candidates, such as Gstz1, meet all of our expectations of a gene with FDSIs, but apparently lack any literature support, indicating the potential of our candidate genes for uncovering novel biology. Finally, our results from long-read data generally agree with the noisy splicing model, whereas previous support for the noisy splicing model comes from short-read data. As such, while the addition of more long-read data may provide the field with more candidate genes with    105 FDSIs, the proportion of candidates is unlikely to increase. Based on the analysis presented in this chapter, I expect that additional data would provide more evidence that alternative splicing only increases the functional diversity of a limited set of genes.     106 Chapter 4: Cataloging the potential functional diversity of Cacna1e splice variants 4.1 Background Voltage-gated calcium channels (VGCCs) play a crucial role in regulating the influx of calcium ions into neuronal cells. The pore-forming a1-subunit genes (CACNA1s, also termed Cavs) are linked to developmental disorders such as epilepsy, schizophrenia, and autism spectrum disorder (Heyes et al., 2015). Despite the importance of these genes, our understanding of CACNA1s remains incomplete, largely because each gene undergoes extensive mRNA splicing (Adams et al., 2009). To fully appreciate VGCCs, we must first have a complete splicing repertoire of each CACNA1 gene and determine which splice variant, if any, impact the overall function of the gene. Here, I describe our work to profile the distinct splice variants of a single CACNA1 gene, Cacna1e.  As a member of the CACNA1 gene family, Cacna1e encodes for the a1-subunit (pore-forming subunit) for the VGCC Cav2.3 (Wormuth et al., 2016). Knock out of Cacna1e in rodents causes a decrease in calcium current in pyramidal cells, a decreased sensitivity to pain, and a resistance to drug-induced seizures (Park and Luo, 2010; Saegusa et al., 2002; Simms and Zamponi, 2014; Weiergräber et al., 2007; Zaman et al., 2011). Gain-of-function mutations in human CACNA1E have been associated with epileptic encephalopathy, macrocephaly, and dyskinesia (Helbig et al., 2018; Weiergräber et al., 2006; Wormuth et al., 2016; Zaman et al., 2011). Furthermore, CACNA1E potentially has tissue-specific functions because CACNA1E’s splice variants have tissue-specific expression (Fang et al., 2007).    107 As described in previous chapters, the extent to which splice variants contribute to genomic functional diversity remains an ongoing debate, but the issues are magnified with CACNA1E (Blencowe, 2017; Light and Elofsson, 2013; Pickrell et al., 2010; Tress et al., 2017b; Zhang et al., 2009). CACNA1E undergoes widespread alternative splicing, and each splice variant’s function is an active area of research (Donaldson and Beazley-Long, 2016; Scott and Kammermeier, 2017). Cacna1 genes are large and contain at least 36 exons. The a1-subunit must be at least 2000 amino acids long and contain four pore-forming domains to make a functional pore (Catterall, 2011; Lee, 2013). The loss or change to any of these pore-forming domains likely results in a dysfunctional channel (Guida et al., 2001). Consequently, the multiple splice variants for a single CACNA1 gene have similar exons encoding for the pore-forming domains; however they will vary in their use of exons in between the pore-forming domains (the linker regions) (Lipscombe and Andrade, 2015; Lipscombe et al., 2013a).  Cacna1e’s splicing pattern is an ongoing topic of research. Pereverzev and colleagues studied the electrophysiological characteristics of seven splice variants in HEK-293 cells (Pereverzev et al., 2002). They found that different splice variants encode for VGCCs with distinct channel inactivation and recovery time courses. With similar splice variants and the same model system, Klockner and colleagues noted distinct binding affinities among Cav2.3 regulatory proteins and CACNA1E splice variants (Klöckner et al., 2004). Furthermore, as CACNA1E antagonists are used to treat to epilepsy, understanding the role of CACNA1E splice variants may play a role in developing better drug targets as different splice variants have different pharmacological sensitivities (Lipscombe and Andrade, 2015). Extreme estimates of CACNA1E’s splicing predicts thousands of splice variants (Lipscombe et al., 2013a). To date, only short-read sequencing has been used to characterize    108 CACNA1E’s full splicing profile (Lipscombe et al., 2013b). Short-read sequencing presents a number of challenges for splice variant-level study, the primary one being that the relationship of an individual read with a full-length transcript structure is fundamentally unresolvable (Hardwick et al., 2016; Mehmood et al., 2019; Steijger et al., 2013). Large genes like the Cacna1 gene family particularly amplify these short-read sequencing problems. Short-read sequencing’s problems with capturing CACNA1E’s splicing profile can be found in public genomic databases. Many large-scale genomic databases, such as Ensembl, have reports of a single gene’s splice variants based on short-read sequencing (Aken et al., 2016, 2017). In Ensembl, human CACNA1E has 11 splice variants, mouse Cacna1e has 9, and rat Cacna1e has 3; these numbers differ greatly from the thousands of splice variants estimated in the literature (Lipscombe et al., 2013a). Given how much we have learned and can potentially learn from rodent models, the low number of splice variants reported in our databases for rodent Cacna1e is problematic (Jarre et al., 2017). As an alternative to short-read sequencing, long-read RNA-seq by Oxford Nanopore’s MinION sequencer provides an option that potentially improves our characterization of a CACNA1 gene’s splicing repertoire (Clark et al., 2019). Specific to CACNA1 genes, Clark et al. recently used MinION sequencing to establish CACNA1C’s transcriptional complexity in the human brain, though they did not correct for the base caller’s error rate (Clark et al., 2019). They found a trivial minority of CACNA1C’s splice variants were present in GENCODE. Clark and colleagues further hypothesized that these splice variants contributed to the functional diversity of CACNA1C in different brain regions. In mammals, 10 genes (Cacna1a, Cacna1b, Cacna1c, Cacna1d, Cacna1e, Cacna1f, Cacna1g, Cacna1h, Cacna1i Cacna1s) have potentially thousands of splice variants that encode    109 the entire a1-subunit (Lipscombe et al., 2013b). Previous work in studying the Cacna1 genes in rodent models has provided evidence that the regions surrounding the pore-forming domains provide regulatory binding sites. For example, with the influx of calcium ions, linker II-III regions regulates channel inactivation (Welsby et al., 2003). Furthermore, rodent models have helped our understanding of the role that CACNA1 genes and their splice variants have on disease. The knockdown of exon 25 in Cacna1h splice variants showed a decrease in seizures in the disease model Genetic Absence Epilepsy Rat from Strasbourg or GAERS (Cain et al., 2018; Powell et al., 2009). Consequently, much anti-epileptic drug development occurs in GAERS (Wang and Chen, 2019). The lack of a comprehensive CACNA1E splice variant catalog in Ensembl likely impacts the various computational splicing tools that use Ensembl, and further represents the disconnect between computational tools and the experimental literature (Bhuiyan et al., 2018; Hyung et al., 2018; Rodriguez et al., 2018). We previously reported that we failed to find the splice variants on Ensembl for about a third of the genes with evidence for functionally distinct splice isoforms (FDSIs), despite the splice variant’s use in the literature (Chapter 2). This disconnect likely impacts our ability to evaluate alternative splicing in all genes, but especially in genes with complex transcript structures. An accurate evaluation of a splice variant’s impact on any gene’s function requires an accurate repertoire of the gene’s splice variants. In this study, I characterize the splicing profile for Cacna1e using targeted transcriptomics and MinION RNA-seq data generated from GAERS and NEC rats. I provide the structure and splice junctions for thousands of novel Cacna1e splice variants. Using protein domain annotations, splice variant expression, and conservation metrics, I establish a putative set of splice variants for Cacna1e. In doing so, I demonstrate the potential functional diversity of the    110 gene while maintaining a more accurate characterization of the splice variant’s structure than previously available. This improved transcript catalog can serve to aid our computational tools and provide experimentalists potentially interesting splice variants for investigation. 4.2 Methods 4.2.1 Targeted amplification for five a1-subunit genes of GAERS and NEC rats My collaborators in the Snutch lab at UBC performed the targeted amplification of five a1-subunit genes of GAERS and NEC (Non-Epileptic Control) rats. Total RNA was extracted from the thalamus of 2 GAERS (10 and 90 days) and 2 NEC (10 and 90 days) rat using a MagMax Kit (Ambion) and full length cDNAs generated using SuperScriptII with oligo-dT priming (Invitrogen). These samples will be referred to as GAERS10, GAERS90, NEC10 and NEC90. Gene specific amplicons for 5 VGCC genes (Cacna1c, Cacna1g, Cacna1e, Cacna1h, and Cacna1i) were generated using PCR with the Elongase enzyme (Invitrogen). PCR products were then gel purified using a gel extraction kit (QIAGEN) and DNA eluted in TE and stored at -20oC ready for sequencing.  4.2.2 ONT MinION sequencing of amplicons My collaborators in the Snutch lab at UBC performed the MinION sequencing of amplicons. The generated gene specific amplicons were processed as per the Oxford Nanopore Technologies (ONT) SQK-LSK108 adapter ligation procedure. In brief, the DNA amplicon molecules were treated to generate A-tailed molecules allowing the ligation of the ONT specific adapter onto the amplicon ends. These adapted molecules were then run on the ONT MinION device and signal data captured for base calling over a 48hr period. Sequence data was generated from the raw captured data using ONT specific software (Guppy 3.4) on a GPU enabled desktop PC.    111 4.2.3 Short-read sequencing of amplicons My collaborators at the Snutch lab at UBC performed the short-read sequencing of amplicons. The generated gene-specific amplicons were short-read sequenced at the BC Genome Sciences Center and data was provided back in SOLEXA format. Raw short-reads will be available through the Short Reads Archive (SRA). 4.2.4 Processing short-read sequencing data We reprocessed the rat transcriptomic data obtained in 4.2.3. Since these reads were in SOLEXA format, we converted them into FASTQ format in order to run them through our short-read RNA-seq pipeline. The rat transcriptome reference was prepared using the “rsem-prepare-reference” script provided by the software package "RNA-seq by expectation-maximization" in RSEM (Li and Dewey, 2011). The assembly version used was Ensembl Rnor6.0, obtained through Illumina for the iGenomes collection. (https://support.illumina.com/sequencing/sequencing_software/igenome.html). Short-reads were processed as single-end (no mate pairs) and aligned using the STAR aligner (Dobin et al., 2013) version 2.4.0h. provided as input to the quantification scripts from RSEM v1.2.31. Default parameters were used (with the exception of parallel processing and logging related options). I used the count quantification matrix of splice junctions (SJ.tab.out) for analysis.  4.2.5 Determining splice variants using long read RNA-seq data and FLAIR I aligned my long-reads for each sample to the Ensembl rat genome (rn6) using FLAIR, downloaded July 2019, (Tang et al., 2018) and minimap2 (Li, 2018). With the addition of the splice junctions found in the short-read data, FLAIR corrected the reads in each alignment file. In short, this means that any novel splice junctions will be merged to an existing splice junction    112 if the junction is within a 10-nucleotide window of the existing junction. Finally, FLAIR collapsed any reads having the same transcription start site and same splice junctions across all samples into a single splice variant. FLAIR removed any splice variants without support from at least 3 reads. 4.2.6 Definition of a functional Cacna1e splice variant A gene with functionally distinct splice isoforms (FDSIs) is a gene where two or more splice isoforms are necessary for the gene’s overall function. Experimentally, the depletion of each individual isoform causes a phenotype. For the purposes of computationally characterizing a candidate FDSI from Cacna1e’s splice variants, we defined a candidate Cacna1e FDSI: “a splice variant evolving under selection (conservation) with all 4 pore-forming domains necessary for calcium passage into the cell and is relatively appreciably expressed”. The requirement that the splice variant has all 4 pore-forming domains implicitly means that that the splice variant must have an open reading frame (ORF) of at least 6000 bp (2000 amino acids). Conserved splice variants indicate that the splice variant is necessary for reproductive success, and sequence conservation acts as a proxy for functional importance. In line with my work in Chapter 3, we expect genes with FDSIs to have appreciably expressed splice variants, and I apply similar criteria in this chapter.  This chapter focuses on Cacna1e splice variants; however, I had long-read data for four other VGCC rat genes (Cacna1c, Cacna1g, Cacna1h, and Cacna1i). Splice variants for the four other VGCC genes must still encode for a protein of at least 2000 amino acids that contain four annotatable pore-forming domains. As I show later in Table 4.3, the splice variants detected for three of the genes (Cacna1c, Cacna1g, and Cacna1i) did not meet my definition of a functional VGCC. I also do not include methods and results for Cacna1h in this chapter because the    113 transcript expected by our collaborators was missing, suggesting a problem at the data collection step.  4.2.7 Annotating splice variants for Cacna1e I adjusted my pipeline from Chapter 3 to be applied to Cacna1e splice variants. Primarily, this change meant filtering for splice variants that could encode for a peptide of at least 2,000 amino acids, and that peptide contained 4 annotatable pore-forming domains. My annotation pipeline will be available in a git repository. I subsetted the 6,252 splice variants from FLAIR to 2,110 splice variants for Cacna1e. I then reduced all splice variants into their exons and identified the exons shared between splice variants. Furthermore, I annotated any transcript that was not entirely found in Ensembl as novel. I then annotated all exons with the number of MinION reads which support it based on the FLAIR output. As done in Chapter 3 for all genes and their splice variants, I annotated each Cacna1e splice variant with an expression ratio. I calculated each splice variant’s expression ratio by dividing the expression of the most expressed Cacna1e splice variant by the splice variant’s expression. I annotated all splice variants and their individual exons with their average PhastCons (Siepel et al., 2005) and PhyloP (Pollard et al., 2010) basewise scores from a 20 vertebrate species alignment, downloaded from the UCSC genome browser for build rn6 (Rosenbloom et al., 2015). The average score for each exon was calculated using Kentutils. Since PhastCons scores may produce false positives in identifying conservation in a small genomic region, I normalized our PhastCons scores by the size of the exon. In order to determine if there was any selection pressure upon the splice sites, I also annotated the average PhyloP conservation of the    114 two intronic bases next to the 5’ side of the exon and the two intronic bases next to the 3’ side of the exon (Pickrell et al., 2010). Using TransDecoder (https://github.com/TransDecoder/TransDecoder) I predicted translation products of the 2,110 splice variants from Cacna1e for any ORFs larger than 2,000 amino acids with a start codon. These ORFs do not necessarily contain a stop codon. I then queried the Conserved Domain Database (CDD) with the translated sequences and annotated each splice variant with the protein domain hits the database returned (Marchler-Bauer et al., 2015). I then performed homology searches against the GenBank nr database (Pruitt et al., 2007) on all exons using NCBI’s tBLASTx (Camacho et al., 2009). First, I extracted the sequences for all alternatively spliced exons. Each alternatively spliced exon’s sequence was concatenated with the sequences of its flanking exons. The set of three exonic sequences was BLAST-ed against the human, mouse, rat, zebrafish, fugu, coelacanth, lamprey and spotted gar data in the nr database. We filtered our BLAST results using an e-value threshold of 0.0001, gap threshold of 30% or less, query coverage threshold of 80% or more, and a percent identity threshold of 30% or more. 4.2.8 Visualization tool We developed a visualization tool using R that uses three input files: a BED-formatted file of all splice variants (FLAIR output), a tab-delimited CDD output file, and a CSV formatted file of each splice variant’s expression. The visualization tool calculates, based on chromosomal position and exon sizes, the set of similarly annotated exons that are preserved between all splice variants of interest, and draws the result as a stack of aligned exon patterns. Since introns are disproportionately large as compared to the exons present in Cacna1e splice variants, they are    115 shown schematically, just to illustrate exon use between splice variant. The tool and example inputs are available online at https://github.com/jsicherman/IsoVision2. 4.3 Results I profiled Cacna1e’s splicing repertoire in the rat thalamus using targeted MinION sequencing of cDNAs. Our analysis encompassed 4,060,847 reads and initially yielded 2,110 potential splice variants. The goal of the analyses we describe was to identify potentially functional isoforms and to identify candidates that might contribute to the functional diversity of the gene. 4.3.1 Detection of candidate Cacna1e splice variants I first filtered the data for reads that mapped to Cacna1e, yielding an initial pool of 2,110 splice variants for Cacna1e that passed quality control (see Methods, Table 4.1 and Table 4.2). However, of these 2,110 variants, only a subset is likely to code for a functional product. Their median length was 5,725 nucleotides, still slightly below what we expect for an a1-subunit splice variant (>6,000 nucleotides). About 53% of the reads supported a splice variant of at least 6,000 nucleotides. I used the number of reads that map to the same splice variant as a proxy for the splice variant’s expression level. I observed a moderate correlation (Spearman r = 0.59) between the size of the splice variants and the number of supporting reads per splice variant.           116  NEC 10 NEC 90 GAERS 10 GAERS 90 # Reads 3,514,669 4,958,824 3,255,180 6,127,231 # Bases 20,259,569,128 23,582,737,080 19,092,042,421 31,856,992,171 Unmapped reads 31,356 31,356 17,311 5,773 Mean length of reads 5,764 4,756 5,865 5,199 Median length of reads 7,952 8,948 8,134 7,902 Number of reads FLAIR assigned to splice variants 3,494,983 4,881,246 3,243,064 6,076,453 Number of reads removed 19,686 77,578 12,116 50,778 Mean read quality 10.50 9.80 9.90 9.90 Median read quality 10.80 9.90 10.20 9.70 Table 4.1 Summary of targeted MinION RNA-seq data.  All four samples were sequenced with PCR primers specific for Cacna1c, Cacan1e, Cacna1g, Cacna1h, and Cacna1i. Column headings: NEC 10 – Non-epileptic control at 10 days, NEC 90 – Non-epileptic control at 90 days, GAERS 10 - Genetic Absence Epilepsy Rat from Strasbourg at 10 days, GAERS 10 - Genetic Absence Epilepsy Rat from Strasbourg at 90 days.   Cacna1e - Thalamus # Reads 35,828,354 # Bases 1,898,902,762 Aligned reads 28,104,950 Unaligned reads 5,272,945 Median length of reads 50.0 Table 4.2 Summary of Cacna1e-targetted short-read RNA-seq data in rat thalamus  I determined how many of the a1-subunit splice variants had the translational potential for a functional a1-subunit (Table 4.3, Figure 4.1). Only 238 of the 2,110 Cacna1e splice variants were predicted to translate to a protein longer than 2,000 amino acids. Furthermore, even fewer of these splice variants are likely to have an ORF with all four pore-forming domains. Based on protein domain annotations from CDD, only 154 of the 238 Cacna1e splice variants had an ORF longer than 2000 amino acids that span sequences for the four pore-forming domains.    117  Gene Splice variants (minimap alignment and FLAIR processing) Splice variants with ORF ≳2000 AA Splice variants with 4 complete pore-forming domains Cacna1c 21 0 0 Cacna1e 2,110 238 154 Cacna1g 175 0 0 Cacna1h 715 102 37 Cacna1i 333 0 0 Table 4.3 Summary of putative splice variants detected in long-read RNA-seq data  Initial set of Cacna1e splice variants was 2,110 splice variants. Of those 2,110 splice variants, 238 contained an open reading frame of 2,000 amino acids or greater. I further filtered these splice variants to 154 based on whether they contain the four necessary pore-forming domains. While my focus in this chapter is Cacna1e, I provided the splice variant data for four other VGCC genes. For three of these genes (Cacna1c, Cacna1g, and Cacna1i), all detected splice variants did not have an ORF of at least 2,000 amino acids long.   Figure 4.1 The majority of detected splice variants did not contain an ORF of at least 2,000 amino acids ORFs were defined as containing a start codon, but not necessarily a stop codon. The X-axis shows the distribution of log10(ORF) lengths. The dashed blue line indicates the minimum size expected for a functional channel.     118 4.3.2 Expression profiles of Canca1e splice variants In my putative set of Cacna1e splice variants, I detected the splice variant characterized by Soong and colleagues (Soong et al., 1993). None of the splice variants in our long-read data matched the transcripts reported in Ensembl, including the sole Ensembl transcript that codes with an ORF over 2000 amino acids (ENSRNOT00000003928). This remained the case even after inspecting the MinION data at the raw read level to ensure that data relevant to this transcript had not been filtered out by my processing pipeline. Using the number of reads mapped to each splice variant as a proxy for splice variant expression, I observed that the top four splice variants were expressed at similar levels (Figure 4.2). These four most expressed splice variants had an expression ratio of less than 2 (Figure 4.2A). The most abundantly expressed Cacna1e splice variant matches the splice variant structure and protein sequence of the Soong et al. splice variant (Soong et al., 1993). I mapped 498,616 reads to the most abundantly expressed splice variant, while the second most abundantly expressed splice variant had 435,458 reads. The top four splice variants contributed similarly to Cacna1e’s total expression among the 4 samples: 12.2%, 10.7%, 8.8%, and 6.5% (Figure 4.2B). However, when stratifying number of reads per splice variant by sample, I observed that the Soong et al. splice variant is the most abundant splice variant in our GAERS90 and NEC90 samples, and fifth most abundant in GAERS10 and NEC10 (Figure 4.2C).         119  Figure 4.2 Cacna1e has four similarly expressed splice variants A) Four splice variants have an expression ratio of less than 2. Each splice variant’s expression ratio is calculated by the most expressed splice variant’s expression divided by the splice variant’s expression. The dotted line indicates where the expression ratio would equal 2. B) Top 10 most expressed Cacna1e splice variants pooled across all samples. The X-axis shows the rank of the splice variant based on expression. The Y-axis shows the proportion of the splice variant relative to Cacna1e’s total expression. C) Cacna1e splice variants rank in each sample. X-axis show expression rank for splice variant summed across all 4 sample, while Y-axis shows the splice variant expression rank for each sample. Sample is indicated by color. G10 is Genetic Absence Epilepsy Rat from Strasbourg at 10 days, G90 is Genetic Absence Epilepsy Rat from Strasbourg at 90 days, N10 is non-epileptic control at 10 days and N90 is non-epileptic control at 90 days. Example: the splice variant at expression rank 1 summed across all 4 sample is the most expressed splice variant G10 and N10. However, for samples G90 and N90, that same splice variant is at rank 4 and 5 respectively.   In order to assess the potential functional diversity of Cacna1e splice variants, we downloaded the protein sequence for Ensembl Cacna1e (ENSRNOP00000003928) and the Cacan1e splice variant characterized by Soong et al. (1993). We annotated protein domains as a guide for analyzing our Cacna1e splice variants (Figure 4.3). Both sequences contained the four 020400 1 2 3 4log10(Expression ratio)Count of splice variantsA0.0000.0250.0500.0750.1000.1251 2 3 4 5 6 7 8 9 10Cacna1e splice variantexpression rankPropotion of total readsassigned to splice variantB51015201 2 3 4 5 6 7 8 9 10Total Expression RankSample−specificexpression rank SampleG10G90N10N90C   120 pore-forming domains necessary for calcium influx. Furthermore, the Ensembl sequence and the Soong et al. sequence contained the GPHH and IQ domains (GPHH – pfam16905 and Ca_chan_IQ – pfam08763 respectively) necessary for Ca-calmodulin regulation of channel activity (Van Petegem et al., 2005).  Figure 4.3 Structure of top 4 most expressed splice variants.  Red indicates a pore-forming domain annotation from CDD (Ion_trans – pfam00520). Blue and purple indicate calmodulin-binding domain annotation from CDD (GPHH – pfam16905 and Ca_chan_IQ – pfam08763, respectively). Y-axis is the expression for each splice variant as total number of reads, X-axis is exon number. Visualization was done using IsoVision.  4.3.3 Cacna1e splice variants contain a conserved cassette exon 19 and 45 I further prioritized Cacna1e’s 154 splice variants based on our splice variant-level conservation and expression annotations. I detected 37 splicing events in our 154 Cacan1e splice variants, six of which were previously found by our collaborators (Figure 4.4). The three most commonly occurring cassette exons in Cacna1e splice variants were: skipping of exon 10 (40% of reads), insertion of exon 19 (40% of reads), and skipping of exon 45 (45% of reads). 498,616435,458360,132262,713Transmembrane GPHH IQ domain1 2-8 109 1911-18 20-22 23-38 39-40 41-42 43-44 45 46-47 48Expression(reads)   121  Figure 4.4 Impact of splicing on channel structure of 6 previously known cassette exons on 154 Cacna1e splice variants with 4 transmembrane domains  Six cassette exons were detected in our data and previously reported in the literature. Here we show where each of these six cassette exons would likely impact the channel, and the percentage of reads that do not contain the exon. Channel figure was created by Dr. John Tyson, and I annotated image with cassette exons.   The sequences of exons 10, 19 and 45 are all highly conserved with a PhyloP score over 1.3. The question is whether skipping these exons is an inherent aspect of function of the gene, or simply due to splicing errors. I investigated whether the exclusion of these exons was conserved; that is whether transcripts lacking these exons are present in other species. I performed a BLAST analysis where we checked whether the protein sequences for each exon together with their flanking exons (i.e. exons 9-10-11, exons 18-19-20, and exons 44-45-46) were present in transcribed sequences in other species. I found evidence of conservation for exon 19 and exon 45 within the jawed-vertebrate phylogeny (fish, rodents and human), but only evidence of conservation for exon 10 skipping within the rodent and human phylogeny. This is consistent ([RQGHOHWLRQ([RQGHOHWLRQ([RQGHOHWLRQ([RQGHOHWLRQ([RQGHOHWLRQ([RQGHOHWLRQ   122 with the potential for skipping of these exons having a conserved function, but as I discuss later the interpretation is not straightforward due to the potential for conservation artefacts.  4.3.4 Novel Cacna1e splicing events The 154 Cacna1e splice variants that had the translation potential to encode a functional VGCC contained a total of 31 splicing events novel to my collaborators (in various combinations), and I could not find any literature reports of 25 of them. In total these splicing events affect less than 11% of the 154 splice variants (17 splice variants), and the splice variants that contained these splicing events accounted for less than 10% of the total reads. I investigated whether there was any conservation of these novel splicing events and whether there was any impact to the protein domains. Sixteen of the splicing events involve exons in a pore-forming domain; however, they did not affect the CDD annotation for the pore-forming domain. Two splicing events are the cassette exons encoding for the GPHH and IQ domain. The seven remaining splicing events are the exons in the N- or C- termini, or in the linker regions. One subclass of variants (14/153) is predicted to result in the lack of a complete calmodulin-binding domain. The most expressed splice variant lacking a complete calmodulin-binding domain contributed to ~0.02% of Cacna1e’s total expression (832 reads). Notably, many of these splicing events are found in other species: one is conserved across jawed vertebrates, nine conserved within mammals, and two conserved only within a rodent phylogeny. 4.4 Discussion In this chapter, I describe the splicing repertoire for voltage-gated calcium channel gene (VGCC) Cacan1e using novel targeted long-read RNA-seq data from the rat thalamus. I then prioritized how much potential Cacna1e’s transcriptional diversity has for functional diversity. This chapter represents a necessary step towards bridging the gap between claims that alternative splicing    123 vastly increases the functional diversity of VGCC genes and the evidence-based reality of noisy splicing. Though I focused on a specific gene, the approaches we used to computationally assess the biological relevance of splice variants would be applicable to many genes (Chapter 3), helping to bridge the gap between raw measures of transcriptional diversity to estimates of functional diversity. While I detected 2,110 different splice variants for Cacna1e, my analysis suggests that the large majority of these are unlikely to be biologically relevant. In particular, only 7% are predicted to plausibly encode for a functional α1-subunit. Of the remaining 93%, only 1,273/1,965 (~68% of the total) have an open reading frame of at least 30 amino acids. Furthermore, these 1,965 likely non-functional splice variants account for 38% of the gene’s total expression. One may wonder why I dismiss these shorter, lowly expressed splice variants as biological noise rather than removing them at an earlier quality control step as technical errors introduced in the library preparation or sequencing. The fact is that other studies include such fragments in support of their claims of functional importance of widespread alternative splicing (Tang et al., 2020; Workman et al., 2019). To avoid opening myself to criticism of filtering them out inappropriately, I begin from the liberal assumption that these Cacna1e splice variants exist in the cell. By annotating these splice variants based on ORF length, I provide a more specific reason to doubt their functional importance, leading to exclusion of 93% of the splice variants. As I discussed in Section 1.3, the observation of splice variants that do not encode for functional products agrees with extensive evidence that RNA splicing is imprecise, such that much “biochemical noise” is produced and which can be captured by modern sensitive molecular biology methods even when present at low levels (Pickrell et al., 2010; Saudemont et al., 2017; Tress et al., 2017b; Zhang et al., 2009). While I cannot formally exclude the possibility that any    124 of the non-channel-encoding splice variants have a function, for follow-up studies I feel it is appropriate to prioritize splice variants for which the most plausible case for functionality can be made. A further distinction of interest is among splice variants that have different and required functions from one another, what I call FDSIs in previous chapters. Computational analyses cannot establish FDSIs, but I applied methods that provide a prioritization. While I detected 154 “full-length” splice variants for CACNA1E, evidence from domain analysis, expression and homology focuses my attention – as it did in Chapter 3 – on a subset of splice variants involving cassette splicing events for exons 10, 19 and 45 in various combinations. I believe the functional consequences of these three exons are worthy of follow-up. I failed to find any literature discussing the functional effects of exon 10, and our BLAST results mapped to a direct GenBank submission without any associated studies (Human NCBI Reference Sequence: XM_017002244.1). The splicing of exon 10 impacts the linker region between the pore-forming domains I-II, the part of the channel where β-subunit (Cacnb1 genes) interacts with the α1-subunit (Cacna1 genes) to chaperone the VGCC to the cell membrane (Gonzalez-Gutierrez et al., 2010). The Conserved Domain Database (CDD) does not contain the “AID motif” where the β-subunit interacts with the α-subunit. Consequently, my interpretation of whether the skipping of exon 10 impacts channel function remains limited. Both Williams et al. and Schneider et al. reported the cassette splicing of exon 19 in 1994 (Schneider et al., 1994, 2020; Williams et al., 1994). Williams and colleagues showed that splice variants that contained exon 19 changed channel inactivation times. We could not access Schneider and colleagues’ publication, though our BLAST results led us to their GenBank submission (GenBank: L29385.1). The effects of splice variants that contain exon 19 on cell    125 current were later characterized in HEK293 cells and human pancreatic cell lines (Pereverzev et al., 2002; Vajna et al., 2001). Finally, when investigating the alternative splicing of a different, but functionally similar VGCC gene, Cacna1b, Gray and colleagues noted that Cacna1b and Cacna1e both had an alternatively spliced 19th exon (Gray et al., 2007). Given the conservation of exon 19 and the location of the exon between domains II and III in both Cacna1b and Cacna1e, Gray and colleagues hypothesized that the exon has an important regulatory role for channel function. Based on RT-PCR, exon 19 was more present during fetal development than postpartum, and the expression was distinctly absent in the peripheral nervous system (Gray et al., 2007). Our results provide more transcript structural context for the previous analyses of exon 19 and the combination of which exons splice with exon 19 will be useful for future follow-up.  While our BLAST results for exon 45 mapped to direct GenBank submissions without any associated study (Human GenBank: AH009158.2, AL161734.12, NG_050616.1), we found literature reporting the cassette splicing of exon 45. Williams and colleagues report the existence of a human CACNA1E splice variant that lacks both exon 19 and exon 45, and another human splice variant that lacks only exon 45 (Williams et al., 1994). In the same study, they also provide evidence that mice have splice variants that lack exon 45.  Based on their expression profiling, CACNA1E splice variants that lack exon 45, but have exon 19 is the major neuronal splice variant.  Based on patch clamp experiments in HEK cells, Williams and colleagues concluded that exon 45 influences channel inactivation times. These electrophysiological findings were further strengthened in future studies (Olcese et al., 1994; Pereverzev et al., 2002).  Furthermore, exon 45 has been reported to contain de novo variants causing epileptic encephalopathy (Helbig et al., 2018).    126 The splice variants we observed for Cacan1e and exon 45 somewhat differs from previous studies. The most expressed splice variant in our rat thalamus data lack exon 19 but contain exon 45. This splice variant was previously characterized by Soong and colleagues in the rat brain (Soong et al., 1993). Furthermore, my two most expressed splice variants lack exon 19, but have exon 45. My third most expressed splice variant has exon 19 but lacks exon 45. The human and mouse splice variants first described by Williams and colleagues lacked both exon 19 and 45, or lacked only exon 45. Given the potential role of exon 45 in disease and the strong conservation of the exon 45’s cassette splicing, exon 45 may be interesting for further follow-up. Do humans, mice, and rats utilize Cacna1e splice variants differently? Can Cacna1e have a functional channel if the correct combination of exons is present? Or is the strong conservation of skipping exon 45 an artefact of our conservation approach? The full repertoire of Cacna1e’s splice variant provides the field with a list of splice variants and their transcript structure to investigate this further.  My study has some important caveats and limitations. Though others have used PCR-based strategies to estimate expression level, our PCR-based strategy to isolate CACNA1E splice variants is not necessarily expected to lead to accurate expression level estimates for different splice variants, so my relative abundance estimates should be viewed with some caution (Clark et al., 2019). The use of PCR primers targeting the ends of the only annotated CACNA1E ORF means that I would not detect splice variants containing potential alternative translation starts and stops, or alternative transcription starts and stops. Another limitation is that we only studied the thalamus, and the splice variants in other brain regions could be different. Furthermore, with only one sample per condition, and the potential for bias in quantification in cDNA samples which were PCR amplified, I was unable to confidently assess differences in splice variant    127 expression in the GAERS model. Finally, while I used conservation of exon skipping events as a way to evaluate the potential for functional relevance, there is an important caveat. The imprecision of splicing is present in all species. Thus, I cannot exclude the possibility that observation of an exon skipping event in another species’ transcriptome is simply the result of chance capture of an “erroneous” isoform. This is made clearer by the observation of “conserved” splicing events that would result in a non-functional channel. Given these limitations, I view this study as prioritizing a set of putative FDSIs for CACNA1E, and further high-throughput and low-throughput investigations are required to establish their roles or importance. 4.5 Conclusions My work with CACNA1E demonstrates the importance of providing a gene-specific splicing profile using targeted long-read RNA sequencing. I provided transcript structures for 2,110 Canca1e splice variants that was previously missing in our genomic databases. Furthermore, my prioritization using splice variant expression and conservation helps point the field towards potentially interesting splice variants. For Cacna1e, the prioritization using expression and conservation provided a mixture of splice variants that have been functionally investigated in the literature (e.g. splice variants containing exon 19) and splice variants without any literature validation (e.g. splice variants containing exon 10).  The splice variants I detected will provide a foundation for future research into VGCCs for both basic research and therapeutic studies.         128 Chapter 5: Conclusion Nearly all mammalian genes have multiple splice variants, but we do not know the function of most of these splice variants (Pan et al., 2008; Wang et al., 2008a). Despite this gap in knowledge, many tout alternative splicing as a key driver of proteomic and functional diversity in the mammalian genome, and even the basis of human exceptionalism (Blencowe, 2017; Kelemen et al., 2013; Li et al., 2014; Ryu et al., 2017; Shabalina et al., 2014). Our genomic databases continue to gain more splice variants, and a plethora of computational studies are dedicated to predicting the functional consequences of alternative splicing. These studies include machine learning-based splice variant function prediction, splice variant regulatory networks, and exon-based ontologies (Li et al., 2016; Tranchevent et al., 2017). Most of these studies are based on the assumption that alternative splicing vastly increases genomic functional diversity. My thesis takes an alternative position to investigate alternative splicing function. The noisy splicing model is based on evidence that implies that most splice variants are non-functional. If most splice variants are non-functional, then it is unlikely that most genes have functionally distinct splice isoforms (FDSIs) (Light and Elofsson, 2013; Pickrell et al., 2010; Tress et al., 2017b; Zhang et al., 2009). Yet, we know of genes with FDSIs. My thesis attempts to prioritize the genes most likely to have FDSIs, in light of the noisy splicing model.  In Chapter 2, I used manual literature curation to determine which genes have literature support for FDSIs. After curating the literature for 743 human and mouse genes, I provided the field with a gold-standard set of 43 genes with FDSIs. Overall, I concluded that the claim that alternative splicing vastly increases the functional diversity of the genome is extrapolated from a limited number of cases. I also point towards a disconnect between our genomic databases and    129 the experimental literature. As many alternative splicing computational tools use genomic databases, my results suggest that these tools may be built on missing data.  In Section 1.4.2, I outlined the debate between Tress et al. and Blencowe about proteomic evidence for most splice variants. In addition to the points that Tress et al. raised, I had concerns with Blencowe’s critiques based on my Chapter 2 results. Blencowe refers to “hundreds of examples” in the literature supporting the functional contributions for specific splice variants. In Chapter 2, though we curated all the examples cited by Blencowe (2017), we failed to find the level of literature support he claimed (Bhuiyan et al., 2018). Furthermore, at least 15% of the studies Blencowe referred to were about NMD-targeted splice variants which are not likely to play role in the functional diversity of a gene. In Chapter 3, I annotated splice variants found in long-read transcriptomes, and prioritized genes most likely to have FDSIs, in line with the noisy splicing model and my literature curation. From 6,799 genes, I prioritized a set of 79 genes. For some genes, like Tpd52l1, I was able to find literature that at least provides suggestions for why the splice variants would be functionally distinct. Other top candidates, such as Gstz1, are more novel in that it is difficult to develop an informed hypothesis about why it would have FDSIs, since the predicted protein structures of the variants were identical. I hypothesized that differences in expression pattern may be relevant in this case. In general, genes like Gstz1 are perhaps the most interesting genes from my prioritization process because the functional consequences of their multiple splice variants have yet to be explored.  In Chapter 4, I annotated splice variants found in a novel long-read transcriptomic dataset, targeted for Cacna1e in the rat thalamus. A key difference between Chapter 4 and Chapter 3 is that we have detailed a priori knowledge of Cacna1e function, and therefore I can    130 evaluate the splice variants in more detail. Cacna1e requires four pore-forming domains, and an amino acid sequence of at least 2000 residues. This information alone reduced the likelihood of functionality for 1,946/2,100 observed splice variants. Furthermore, there is a growing interest in the field for targeted transcriptomics of voltage-gated calcium channel genes using long-read sequencing (Clark et al., 2019). The work done in Chapter 4 serves the field with annotation of Cacna1e’s splicing profile that can directly inform downstream studies of the gene’s function. Some might consider the general findings from my thesis to be negative, in that I found evidence of FDSIs for very few genes, and in the case study of Cacna1e, only a few transcripts are likely to be worthy of interest. However, I interpret these findings as positive, both in the set of candidates identified, but also in clarifying the functional landscape of transcription in light of the knowledge of evolution, protein structure, and the fidelity of splicing.  5.1 Strengths and limitations of research 5.1.1 Strengths In Chapter 2, I intentionally biased the curation for literature evidence of genes with FDSIs towards well-studied genes both in the context of alternative splicing and genomics. By targeting our curation for well-studied genes, I likely increased our chances of finding genes with literature evidence of FDSIs. Along with the benefits of biasing my curation this way, it introduced a limitation to the interpretation of my results (see limitations in Section 5.1.2). My research in Chapter 3 and 4 benefitted from the use of long-read data. Previous work in alternative splicing predominantly uses short-read data. Computational pipelines for short-read data have issues with predicting transcript structure. Thus, I have more confidence in my transcripts. Processing short-read data is also tethered to previous annotations in genomic databases, and the detection of novel transcript structures is difficult to do with confidence.    131 Another strength of my work is that I do not limit myself to isoforms reported in sequence databases. In Chapter 2, I reported that one-third of genes have FDSIs in the literature that are not found in Ensembl, exposing a problem with reliance on reference transcriptomes. Indeed, in Chapter 3, I reported that 95% of splice variants were not found in Ensembl, and in Chapter 4, none of the Cacna1e splice variants were present in Ensembl (not even the “primary” transcript reported in the literature since 1993). Thus, the vast majority of my splice variants are likely novel to computational databases. Other long-read based investigations echo similar claims about splice variants novel to computational databases. 5.1.2 Limitations In Chapter 2, I biased the curation for well-studied genes. While this bias is beneficial, it also creates an interpretation problem. The 43 genes with FDSIs have conserved, appreciably expressed splice variants that have evidence of translation. In general, well-studied genes are conserved, appreciably expressed, and encode for a protein, presenting a confound which means extrapolating from those 43 genes to the entire genome must be done with some caution. Chapter 3 was in part designed to remedy this gap by applying reasonable criteria to seek evidence of FDSI genes at scale, but like any computational predictions they should be taken as informative for future studies, but not definitive.  When we manually curated the literature in Chapter 2, I assumed that the investigators designed an appropriate experiment for the gene’s splice variants, and performed the experiment correctly. However, if one were to assess the studies for the 43 genes with literature evidence of FDSIs, the quality and extent of the experimental evidence may vary. For example, not all studies included controls for the specificity and effectiveness of siRNAs, and studies that do might be considered more convincing. Other experimental aspects to assess these studies can    132 include the appropriateness of the control experiments, whether the investigators tested the correct phenotype, and if the correct statistical test was used. In the absence of a high-quality study, the number of studies confirming the result, or the conservation of the FDSIs are another source of confidence. I did not consider these factors in considering whether a paper met the threshold for evidence for FDSIs, preferring not to make judgement calls and taking the authors’ claims at face value. Nevertheless, the enumeration of these factors (Table D.1 in Appendix D) confirms that the extent of the evidence is greater for some genes with FDSIs. While somewhat subjective, in my view, this suggests that some of these genes could be false positives. For Chapter 3, I was limited in the availability of long-read MinION sequencing data. In comparison to short-read data, my analysis did not have comparable sequencing depth, nor number of tissue types. Consequently, I only detected splice variants for about a third of protein-coding genes. Nevertheless, the sequencing depth and the number of tissue types is comparable to current long-read data, whether MinION or PacBio. Most long-read studies detect multiple splice variants for 6,000 to 7,000 genes. As such, Chapter 3 should be compared not only to short-read RNA-seq studies, but other long-read RNA-seq studies as well.  Only one gene with literature support for FDSIs from Chapter 2 emerged as a prioritized gene with FDSIs in Chapter 3. I believe that this is a consequence of the aforementioned limitations of my long-read data because I could not detect the majority of the genes and their FDSIs. In Table D.1, I have enumerated the protein domain annotations and conservation annotations of the 43 genes with FDSIs from Chapter 2. Only one gene (Oprm1) was not annotatable using CDD protein domains, but three genes (PML5, Lpin1 and Sirt3) do not have a discriminating exon with a PhyloP score greater than 1 (my threshold). As I discussed in Chapter 3, this may reflect a limitation in PhyloP scores or the CDD. It may also indicate that the    133 literature support for the gene with FDSIs was based on problematic results. Regardless, the majority of genes with FDSIs from Chapter 2 would pass my protein domain and conservation thresholds in Chapter 3. This suggests that if I had detected them in my long read data, I would have likely prioritized these genes. One limitation to Chapter 3 and 4 is the use of expression levels computed from relatively low depth long-read data. If these estimates are inaccurate, it would potentially impact my prioritization based on expression levels. Ideally, the expression levels of prioritized genes and their splice variants would be validated using more accurate approaches. This would help delineate between actual transcripts, and potential technical artefacts detected by the MinION sequencer. Furthermore, this would ground gene expression metrics from MinION sequencing to a common laboratory technique. I believe this would not only benefit my work, but also work across the alternative splicing field as we continue to study splice variants using long-read sequencing. For example, the study describing FLAIR, the tool I use to quantify splice variant expression in Chapter 3 and 4, did not validate their splice variant expression levels (Tang et al., 2018). Furthermore, in a targeted transcriptomic study of human CACNA1C, the investigators did not validate splice variant expression (Clark et al., 2019). In order for us to best use long-read sequencing for splice variant function, appropriate validation must become a field standard, at least until the properties of the method are better understood. In Chapter 3 and 4, my evaluation of splice variants using functional genomics is dependent on specific functional genomic annotations. Thus, any limitations of the different annotations will limit my prioritization approach. To alleviate these limitations, I maintained reasonably loose annotation thresholds. For example, in order for a gene in Chapter 3 to be a candidate with FDSIs, the gene had to have two splice variants with an open reading frame of at    134 least 30 amino acids long. Most proteins are 300 amino acids long, and many would define a reading frame with fewer than 100 amino acids as a small ORF (Couso and Patraquim, 2017). In Chapter 4, I used a threshold of 2,000 amino acids to determine whether the Cacna1e splice variant could encode for a likely functional channel. However, the smallest Cacna1e protein I found in the literature or in UniProt was ~2,200 amino acids (Bateman et al., 2017).  Despite maintaining loose annotation thresholds, my prioritized set of genes with FDSIs remained low in Chapter 3, and my filtering still eliminated most Cacna1e splice variants in Chapter 4.  5.2 Applications of research findings As with many in silico studies, an immediate application of my research findings will be wet-lab validation. Ideally, the experimental validation would be aimed for providing evidence that a gene has FDSIs. This application would not only improve our understanding of the functional consequences of alternative splicing, but would also change our knowledge of the gene’s function.    For Chapter 2, my literature curation for genes with FDSIs helps the field quickly identify the extent to which a gene’s splice variants have been explored. For example, we classified 88/743 genes as having “potentially positive” literature evidence of FDSIs. This “potentially positive” annotation was from two scenarios. The first scenario was when investigators depleted one splice variant for a gene, and no other splice variant was depleted. The second scenario was when investigators simultaneously depleted multiple splice variants for a single gene, but they did not deplete all the gene’s splice variants. Either scenario could be promising evidence towards FDSIs but could also turn out to be cases of functional redundancy.  We do not know the functional consequences of most splice variants, and my work in Chapter 3 provides direction towards some promising genes with multiple functional splice    135 variants. In Chapter 3 of my thesis, I prioritized a set of 79 genes likely to have FDSIs using functional genomic annotations. Nearly all of these genes (78/79) do not appear to have previous suggestions that that they would be likely to have FDSIs by the criteria set in Chapter 2. Besides protein domain annotations, my annotations do not explain “why” the gene would have FDSIs. For example, one prioritized gene, Tpd52l1 has previous literature evidence suggesting that different exons change the TPD52L1 protein’s binding partners, and eventual molecular pathways. Given my annotations and this literature, I hypothesized that Tpd52l1 has FDSIs to function in distinct molecular pathways. In contrast, another prioritized gene, Gstz1, does not have comparable literature. Regardless, Tpd52l1, Gstz1, and the remaining prioritized genes require ex silico evidence to be considered genes with FDSIs. Furthermore, with additional tissue-specific long-read data, my annotations could provide tissue-specificity of expression, and thus non-redundancy in function. Currently, I have data for only the mouse brain and liver, and even within the liver transcriptome, I only have 2 samples. Previous claims of tissue-specific splice variant function were based on short-read data from GTEx (54 tissues/15,253 samples), or Illumina BodyMap (16 tissues/48 samples), and comparable long-read data would be desirable (Gonzalez-Porta et al., 2013; Lonsdale et al., 2013). In Chapter 4, I investigate the splicing profile of rat Cacna1e, and provide support for its candidacy as a gene with FDSIs. The next step is for an ex silico validation: siRNA experiments that individually target the topmost expressed splice variant and the second most expressed splice variant, and determine whether either depletion causes a change in phenotype. Based on previous literature, the change in phenotype will be based on channel activation time. In Chapter 2, we found literature evidence for FDSIs for another member of the Cacna1 family, Cacna1b. If    136 ex silico experiments verify that Cacna1e has FDSIs, this would support the non-interchangeable model of alternative splicing evolution following gene duplication (Abascal et al., 2015c).  5.3 Future directions Hypothetically, an increase in genes with FDSIs in the literature would lead to many exciting areas of research. For example, one could build a machine learning classifier to predict more genes with FDSIs, investigate the regulation of genes with FDSIs, or a build a transcriptome-wide network for the genes with FDSIs. While I believe these ideas will eventually answer important questions, it is unlikely we will see a large increase in genes with FDSIs in the literature. Certain challenges, such as determining the correct function to test in an assay or producing a high-throughput assay, remain. Thus, for the remainder of this section, I will discuss ideas that are more immediately feasible. A central discussion about alternative splicing and functional diversity is finding evidence of translation of RNA-seq splice variants. Whether with ribosomal profiling data, or with mass spectrometry, investigators look to determine whether short-read splice variants have evidence for translation (Chait, 2006). As we generate more long-read datasets, an interesting future direction could be determining how many long-read splice variants have translation evidence in an aggregated set of ribosomal profiling data, or mass spectrometry data. In brief, using long-read splice variants, one could build a putative protein database. Then, using reads from ribosomal profiling data, or peptides from mass spectrometry data, one would “align” reads/peptides to the putative protein database.  Some investigators claim that many splice variants are species-specific innovations, that is, they evolve under positive selection (Blencowe, 2017; Ramensky et al., 2008). As I described in section 1.6, due to the low effective population size of mammals, it is extremely unlikely that    137 most or even many species-specific splice variants are evolving under positive selection. Studies have indicated that less than 2% of human genes evolve under positive selection (Kelley et al., 2006; Pickrell et al., 2009). However, a limited number of splice variants could evolve under positive selection, which is an interesting avenue for future research. In order to properly test for positively selected for splice variants, we would require massive amounts of population variation data (Tress et al., 2017a). Fortunately, the latest update to the gnomAD project may be useful here (Karczewski et al., 2020). Genomic features that evolve under negative (conservation) or positive selection tend to have low variation. In brief, we can divide splice variants into two groups: splice variants that are present in multiple species (negative selection, or conservation), or splice variants present only in humans. If any of the splice variants present only in humans have a similar level of intra-species variability to conserved splice variants, this may indicate that the splice variants evolve under positive selection. This potential investigation would not only find more interesting cases of genes with FDSIs, but also point towards genes experiencing adaptive functional changes in the human lineage. For my thesis, I defined genes with FDSIs in the context of wildtype function. An interesting future research direction would be to explore alternative splicing in the context of disease. This idea stems from the early ENCODE discussions of function where function could be defined in an evolutionary context, a clinical context, and a biochemical context. My definition of functional fits best with the evolutionary context of function – a splice variant that is necessary for the overall function of a conserved gene would be more likely to affect reproductive success. In contrast, each of my research chapters could be expanded to explore the clinical context, where a lowly-expressed, non-conserved splice variant in the wildtype condition might be upregulated in a disease condition. In such a scenario, a transcript might become    138 functional in a way that is detrimental to the organism (Bergsma et al., 2017; Evrony et al., 2017). Thus, the work in Chapter 2 could be expanded to curate such situations. Equivalent studies for Chapter 3 and 4 could be done as well by using disease-specific long-read datasets (Minervini et al., 2020). This type of investigation would potentially lead to identifying splicing or transcript expression disruption which contributes to disease. RNA splicing is just one of at least 100 posttranscriptional modifications (Liu and Zhang, 2018). The concept of noise has been explored with other posttranscriptional modifications besides splicing. Just like splicing, previous studies have challenged the idea that different posttranscriptional modifications vastly increase genomic functional diversity. For example, based on conservation metrics, RNA editing is likely nonfunctional for most genes (Xu and Zhang, 2014, 2015). Similar conclusions have been drawn for alternative polyadenylation, alternative transcription initiation, and RNA methylation (Li and Zhang, 2019; Liu and Zhang, 2018; Xu and Zhang, 2020; Xu et al., 2019). But as for splicing, there are known cases where these processes do have a functional role, suggesting that there may be more such cases to discover, even if they are rare. Given appropriate data, one could reapply the types of approaches I used in my research to any one of these posttranscriptional modifications. Doing so would result in prioritizing genes where posttranscriptional modifications are more likely to be interesting to study.  5.4 Final remarks This thesis investigates the claim that alternative splicing vastly increases the functional diversity of the genome in the context of noisy splicing evidence. While my main goal was to aid the field in identifying novel genes with FDSIs, a secondary goal was to explore the common claim that genes with FDSIs are the norm. In agreement with my hypothesis, my results provide more    139 support for the noisy splicing model. My thesis shows that it is challenging to find support for genes with FDSIs – either from the literature, or from computational analysis. If the functional diversity caused by alternative splicing is allegedly so vast, it now seems reasonable to ask where the evidence is, that is, to shift the burden of proof on the positive claims. In the meantime, my work can help scientists who tend to take the statement that “splicing vastly increases functional diversity” at face value understand the lack of evidence for that claim.  The tendency for molecular biologists to treat every observable molecular trait as functionally important is common, despite this claim being consistently challenged (Doolittle, 2018; Doolittle et al., 2014). As novel sequencing platforms identify ever more transcripts and post-transcriptional modifications with increased sensitivity, it becomes paramount that experimental and computational biologists understand that biochemical processes are inherently noisy, and that this noise makes its way into the data. Not everything observed is important and having this in mind can help appropriately (or at least most impactfully) focus scientific resources. I believe that my thesis aids this education by providing the field with the genes likely to have FDSIs, rather than stopping with just a list of genes that have multiple detected splice variants. Obviously, nothing prevents researchers from pursuing all genes with multiple detected splice variants. There are clearly interesting genes with FDSIs waiting to be discovered. However, the clear prediction from my work is that the yield will be low.     140 References Abascal, F., Ezkurdia, I., Rodriguez-Rivas, J., Rodriguez, J.M., Pozo, A. del, Vázquez, J., Valencia, A., and Tress, M.L. (2015a). Alternatively Spliced Homologous Exons Have Ancient Origins and Are Highly Expressed at the Protein Level. PLOS Comput. Biol. 11, e1004325. Abascal, F., Tress, M.L., and Valencia, A. (2015b). Alternative splicing and co-option of transposable elements: the case of TMPO/LAP2α and ZNF451 in mammals. Bioinformatics 31, 2257–2261. Abascal, F., Tress, M.L., and Valencia, A. (2015c). The evolutionary fate of alternatively spliced homologous exons after gene duplication. Genome Biol. Evol. 7, 1392–1403. Adams, P.J., Garcia, E., David, L.S., Mulatz, K.J., Spacey, S.D., and Snutch, T.P. (2009). Ca(V)2.1 P/Q-type calcium channel alternative splicing affects the functional impact of familial hemiplegic migraine mutations: implications for calcium channelopathies. Channels Austin Tex 3, 110–121. Afanasyeva, A., Bockwoldt, M., Cooney, C.R., Heiland, I., and Gossmann, T.I. (2018). Human long intrinsically disordered protein regions are frequent targets of positive selection. Genome Res. 28, 975–982. Aken, B.L., Ayling, S., Barrell, D., Clarke, L., Curwen, V., Fairley, S., Banet, J.F., Billis, K., Girón, C.G., Hourlier, T., et al. (2016). The Ensembl gene annotation system. Database 2016, baw093. Aken, B.L., Achuthan, P., Akanni, W., Amode, M.R., Bernsdorff, F., Bhai, J., Billis, K., Carvalho-Silva, D., Cummins, C., Clapham, P., et al. (2017). Ensembl 2017. Nucleic Acids Res. 45, D635–D642. Alberts, B., Johnson, A., Lewis, J., Raff, M., Roberts, K., and Walter, P. (2002). Studying Gene Expression and Function. Arikkath, J., and Campbell, K.P. (2003). Auxiliary subunits: essential components of the voltage-gated calcium channel complex. Curr. Opin. Neurobiol. 13, 298–307. Auboeuf, D. (2018). Alternative mRNA processing sites decrease genetic variability while increasing functional diversity. Transcription 9, 75–87. Bateman, A., Martin, M.J., O’Donovan, C., Magrane, M., Alpi, E., Antunes, R., Bely, B., Bingley, M., Bonilla, C., Britto, R., et al. (2017). UniProt: the universal protein knowledgebase. Nucleic Acids Res. 45, D158–D169. Bely, B., Martin, M.J., and Apweiler, R. (2010). Source of annotations in the UniProt Knowledgebase. F1000Posters 1.    141 Bergsma, A.J., van der Wal, E., Broeders, M., van der Ploeg, A.T., and Pim Pijnappel, W.W.M. (2017). Alternative Splicing in Genetic Diseases: Improved Diagnosis and Novel Treatment Options. In International Review of Cell and Molecular Biology, (Elsevier), p. Bhuiyan, S.A., Ly, S., Phan, M., Huntington, B., Hogan, E., Liu, C.C., Liu, J., and Pavlidis, P. (2018). Systematic evaluation of isoform function in literature reports of alternative splicing. BMC Genomics 19, 637. Blencowe, B.J. (2006). Alternative Splicing: New Insights from Global Analyses. Cell 126, 37–47. Blencowe, B.J. (2017). The Relationship between Alternative Splicing and Proteomic Complexity. Trends Biochem. Sci. 0. Booker, T.R., Jackson, B.C., and Keightley, P.D. (2017). Detecting positive selection in the genome. BMC Biol. 15, 98. Bourinet, E., Soong, T.W., Sutton, K., Slaymaker, S., Mathews, E., Monteil, A., Zamponi, G.W., Nargeot, J., and Snutch, T.P. (1999). Splicing of α1A subunit gene generates phenotypic variants of P- and Q-type calcium channels. Nat. Neurosci. 2, 407–415. Boutros, R., Bailey, A.M., Wilson, S.H.D., and Byrne, J.A. (2003). Alternative Splicing as a Mechanism for Regulating 14-3-3 Binding: Interactions between hD53 (TPD52L1) and 14-3-3 Proteins. J. Mol. Biol. 332, 675–687. Boutros, R., Fanayan, S., Shehata, M., and Byrne, J.A. (2004). The tumor protein D52 family: many pieces, many puzzles. Biochem. Biophys. Res. Commun. 325, 1115–1121. Bush, S.J., Chen, L., Tovar-Corona, J.M., and Urrutia, A.O. (2017). Alternative splicing and the evolution of phenotypic novelty. Phil Trans R Soc B 372, 20150474. Cain, S.M., Tyson, J.R., Choi, H., Ko, R., Lin, P.J.C., LeDue, J.M., Powell, K.L., Bernier, L., Rungta, R.L., Yang, Y., et al. (2018). CaV3.2 drives sustained burst‐firing, which is critical for absence seizure propagation in reticular thalamic neurons. Epilepsia 59, 778–791. Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., and Madden, T.L. (2009). BLAST+: architecture and applications. BMC Bioinformatics 10, 421. Candi, E., Rufini, A., Terrinoni, A., Dinsdale, D., Ranalli, M., Paradisi, A., De Laurenzi, V., Spagnoli, L.G., Catani, M.V., Ramadan, S., et al. (2006). Differential roles of p63 isoforms in epidermal development: selective genetic complementation in p63 null mice. Cell Death Differ. 13, 1037–1047. Catterall, W.A. (1995). Structure and function of voltage-gated ion channels. Annu. Rev. Biochem. 64, 493–531.    142 Catterall, W.A. (2011). Voltage-gated calcium channels. Cold Spring Harb. Perspect. Biol. 3, a003947. Chait, B.T. (2006). Mass Spectrometry: Bottom-Up or Top-Down? Science 314, 65–66. Charlesworth, B. (2009). Fundamental concepts in genetics: effective population size and patterns of molecular evolution and variation. Nat. Rev. Genet. 10, 195–205. Charlesworth, B., and Charlesworth, D. (2018). Neutral Variation in the Context of Selection. Mol. Biol. Evol. 35, 1359–1361. Chen, H., Shaw, D., Zeng, J., Bu, D., and Jiang, T. (2019). DIFFUSE: predicting isoform functions from sequences and expression profiles via deep learning. Bioinformatics 35, i284–i294. Chen, Z., Gore, B.B., Long, H., Ma, L., and Tessier-Lavigne, M. (2008). Alternative splicing of the Robo3 axon guidance receptor governs the midline switch from attraction to repulsion. Neuron 58, 325–332. Chern, T.-M., van Nimwegen, E., Kai, C., Kawai, J., Carninci, P., Hayashizaki, Y., and Zavolan, M. (2006). A simple physical model predicts small exon length variations. PLoS Genet. 2, e45. Cho, S., Ko, H.-M., Kim, J.-M., Lee, J.-A., Park, J.-E., Jang, M.-S., Park, S.G., Lee, D.H., Ryu, S.-E., and Park, B.-C. (2004). Positive Regulation of Apoptosis Signal-regulating Kinase 1 by hD53L1. J. Biol. Chem. 279, 16050–16056. Clark, M., Wrzesinski, T., Garcia-Bea, A., Kleinman, J., Hyde, T., Weinberger, D., Haerty, W., and Tunbridge, E. (2018). Long-read sequencing reveals the splicing profile of the calcium channel gene CACNA1C in human brain. BioRxiv 260562. Clark, M.B., Amaral, P.P., Schlesinger, F.J., Dinger, M.E., Taft, R.J., Rinn, J.L., Ponting, C.P., Stadler, P.F., Morris, K.V., Morillon, A., et al. (2011). The Reality of Pervasive Transcription. PLOS Biol. 9, e1000625. Clark, M.B., Wrzesinski, T., Garcia, A.B., Hall, N.A.L., Kleinman, J.E., Hyde, T., Weinberger, D.R., Harrison, P.J., Haerty, W., and Tunbridge, E.M. (2019). Long-read sequencing reveals the complex splicing profile of the psychiatric risk gene CACNA1C in human brain. Mol. Psychiatry 1–11. Cogan, J., Austin, E., Hedges, L., Womack, B., West, J., Loyd, J., and Hamid, R. (2012). Role of BMPR2 alternative splicing in heritable pulmonary arterial hypertension penetrance. Circulation 126, 1907–1916. Coldwell, M.J., Sack, U., Cowan, J.L., Barrett, R.M., Vlasak, M., Sivakumaran, K., and Morley, S.J. (2012). Multiple isoforms of the translation initiation factor eIF4GII are generated via use of alternative promoters, splice sites and a non-canonical initiation codon. Biochem. J. 448, 1–11.    143 Consortium, T.E.P. (2012). An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74. Coulon, A., Ferguson, M.L., de Turris, V., Palangat, M., Chow, C.C., and Larson, D.R. (2014). Kinetic competition during the transcription cycle results in stochastic RNA processing. ELife 3, e03939. Couso, J.-P., and Patraquim, P. (2017). Classification and function of small open reading frames. Nat. Rev. Mol. Cell Biol. 18, 575–589. David, L.S., Garcia, E., Cain, S.M., Thau, E., Tyson, J.R., and Snutch, T.P. (2010). Splice-variant changes of the Ca(V)3.2 T-type calcium channel mediate voltage-dependent facilitation and associate with cardiac hypertrophy and development. Channels Austin Tex 4, 375–389. Dobin, A., Davis, C.A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M., and Gingeras, T.R. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinforma. Oxf. Engl. 29, 15–21. Donaldson, L.F., and Beazley-Long, N. (2016). Alternative RNA splicing: contribution to pain and potential therapeutic strategy. Drug Discov. Today 21, 1787–1798. Doolittle, W.F. (2018). We simply cannot go on being so vague about ‘function.’ Genome Biol. 19, 223. Doolittle, W.F., Brunet, T.D.P., Linquist, S., and Gregory, T.R. (2014). Distinguishing between “Function” and “Effect” in Genome Biology. Genome Biol. Evol. 6, 1234–1237. Eddy, S.R. (2013). The ENCODE project: Missteps overshadowing a success. Curr. Biol. 23, R259–R261. Eksi, R., Li, H.-D., Menon, R., Wen, Y., Omenn, G.S., Kretzler, M., and Guan, Y. (2013). Systematically Differentiating Functions for Alternatively Spliced Isoforms through Integrating RNA-seq Data. PLoS Comput Biol 9, e1003314. El-Gebali, S., Mistry, J., Bateman, A., Eddy, S.R., Luciani, A., Potter, S.C., Qureshi, M., Richardson, L.J., Salazar, G.A., Smart, A., et al. (2019). The Pfam protein families database in 2019. Nucleic Acids Res. 47, D427–D432. Eling, N., Morgan, M.D., and Marioni, J.C. (2019). Challenges in measuring and understanding biological noise. Nat. Rev. Genet. 1. ENCODE Project Consortium (2004). The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 306, 636–640. Evrony, G.D., Cordero, D.R., Shen, J., Partlow, J.N., Yu, T.W., Rodin, R.E., Hill, R.S., Coulter, M.E., Lam, A.-T.N., Jayaraman, D., et al. (2017). Integrated genome and transcriptome    144 sequencing identifies a noncoding mutation in the genome replication factor DONSON as the cause of microcephaly-micromelia syndrome. Genome Res. Ezkurdia, I., Rodriguez, J.M., Carrillo-de Santa Pau, E., Vázquez, J., Valencia, A., and Tress, M.L. (2015). Most Highly Expressed Protein-Coding Genes Have a Single Dominant Isoform. J. Proteome Res. 14, 1880–1887. Fang, Z., Park, C.-K., Li, H.Y., Kim, H.Y., Park, S.-H., Jung, S.J., Kim, J.S., Monteil, A., Oh, S.B., and Miller, R.J. (2007). Molecular Basis of Cav2.3 Calcium Channels in Rat Nociceptive Neurons. J. Biol. Chem. 282, 4757–4764. Fekete, E., Flipphi, M., Ág, N., Kavalecz, N., Cerqueira, G., Scazzocchio, C., and Karaffa, L. (2017). A mechanism for a single nucleotide intron shift. Nucleic Acids Res. Frankish, A., Mudge, J.M., Thomas, M., and Harrow, J. (2012). The importance of identifying alternative splicing in vertebrate genome annotation. Database J. Biol. Databases Curation 2012, bas014. Gannett, L. (1999). What’s in a Cause?: The Pragmatic Dimensions of Genetic Explanations. Biol. Philos. 14, 349–373. Gibson, T.J., Seiler, M., and Veitia, R.A. (2013). The transience of transient overexpression. Gifford, F. (1990). Genetic traits. Biol. Philos. 5, 327–347. Gonzalez-Gutierrez, G., Miranda-Laferte, E., Contreras, G., Neely, A., and Hidalgo, P. (2010). Swapping the I-II intracellular linker between L-type CaV1.2 and R-type CaV2.3 high-voltage gated calcium channels exchanges activation attributes. Channels Austin Tex 4, 42–50. Gonzalez-Porta, M., Frankish, A., Rung, J., Harrow, J., and Brazma, A. (2013). Transcriptome analysis of human tissues and cell lines reveals one dominant transcript per gene. Genome Biol. 14, R70. Gouin, A., Legeai, F., Nouhaud, P., Whibley, A., Simon, J.-C., and Lemaitre, C. (2015). Whole-genome re-sequencing of non-model organisms: lessons from unmapped reads. Heredity 114, 494–501. Gould, S.J., and Lewontin, R.C. (1979). The spandrels of San Marco and the Panglossian paradigm: a critique of the adaptationist programme. Proc R Soc Lond B 205, 581–598. Graur, D. (2017). Rubbish DNA: The Functionless Fraction of the Human Genome. In Evolution of the Human Genome I: The Genome and Genes, N. Saitou, ed. (Tokyo: Springer Japan), pp. 19–60. Gray, A.C., Raingo, J., and Lipscombe, D. (2007). Neuronal calcium channels: Splicing for optimal performance. Cell Calcium 42, 409–417.    145 Guida, S., Trettel, F., Pagnutti, S., Mantuano, E., Tottene, A., Veneziano, L., Fellin, T., Spadaro, M., Stauderman, K.A., Williams, M.E., et al. (2001). Complete Loss of P/Q Calcium Channel Activity Caused by a CACNA1A Missense Mutation Carried by Patients with Episodic Ataxia Type 2. Am. J. Hum. Genet. 68, 759–764. Guttman, M., Russell, P., Ingolia, N.T., Weissman, J.S., and Lander, E.S. (2013). Ribosome Profiling Provides Evidence that Large Noncoding RNAs Do Not Encode Proteins. Cell 154, 240–251. Hahn, M.W., and Wray, G.A. (2002). The g-value paradox. Evol. Dev. 4, 73–75. Han, H., Braunschweig, U., Gonatopoulos-Pournatzis, T., Weatheritt, R.J., Hirsch, C.L., Ha, K.C.H., Radovani, E., Nabeel-Shah, S., Sterne-Weiler, T., Wang, J., et al. (2017). Multilayered Control of Alternative Splicing Regulatory Networks by Transcription Factors. Mol. Cell 65, 539-553.e7. Han, M.-H., Lin, C., Meng, S., and Wang, X. (2010). Proteomics analysis reveals overlapping functions of clustered protocadherins. Mol. Cell. Proteomics MCP 9, 71–83. Hao, Y., Colak, R., Teyra, J., Corbi-Verge, C., Ignatchenko, A., Hahne, H., Wilhelm, M., Kuster, B., Braun, P., Kaida, D., et al. (2015). Semi-supervised Learning Predicts Approximately One Third of the Alternative Splicing Isoforms as Functional Proteins. Cell Rep. 12, 183–189. Hardwick, S.A., Chen, W.Y., Wong, T., Deveson, I.W., Blackburn, J., Andersen, S.B., Nielsen, L.K., Mattick, J.S., and Mercer, T.R. (2016). Spliced synthetic genes as internal controls in RNA sequencing experiments. Nat. Methods 13, 792–798. Heinemann, S.H., Terlau, H., Stühmer, W., Imoto, K., and Numa, S. (1992). Calcium channel characteristics conferred on the sodium channel by single mutations. Nature 356, 441. Helbig, K.L., Lauerer, R.J., Bahr, J.C., Souza, I.A., Myers, C.T., Uysal, B., Schwarz, N., Gandini, M.A., Huang, S., Keren, B., et al. (2018). De Novo Pathogenic Variants in CACNA1E Cause Developmental and Epileptic Encephalopathy with Contractures, Macrocephaly, and Dyskinesias. Am. J. Hum. Genet. 103, 666–678. Heyes, S., Pratt, W.S., Rees, E., Dahimene, S., Ferron, L., Owen, M.J., and Dolphin, A.C. (2015). Genetic disruption of voltage-gated calcium channels in psychiatric and neurological disorders. Prog. Neurobiol. 134, 36–54. Hirano, M., Takada, Y., Wong, C.F., Yamaguchi, K., Kotani, H., Kurokawa, T., Mori, M.X., Snutch, T.P., Ronjat, M., De Waard, M., et al. (2017). C-terminal splice variants of P/Q-type Ca2+ channel CaV2.1 α1 subunits are differentially regulated by Rab3-interacting molecule proteins. J. Biol. Chem. 292, 9365–9381. Hoff, A.O., Catala-Lehnen, P., Thomas, P.M., Priemel, M., Rueger, J.M., Nasonkin, I., Bradley, A., Hughes, M.R., Ordonez, N., Cote, G.J., et al. (2002). Increased bone mass is an unexpected phenotype associated with deletion of the calcitonin gene. J. Clin. Invest. 110, 1849–1857.    146 Hong, J.H., Ko, Y.H., and Kang, K. (2018). RNA variant identification discrepancy among splice-aware alignment algorithms. PLOS ONE 13, e0201822. Hsiao, Y.-H.E., Bahn, J.H., Lin, X., Chan, T.-M., Wang, R., and Xiao, X. (2016). Alternative splicing modulated by genetic variants demonstrates accelerated evolution regulated by highly conserved proteins. Genome Res. 26, 440–450. Hsu, S.-N., and Hertel, K.J. (2009). Spliceosomes walk the line: splicing errors and their impact on cellular function. RNA Biol. 6, 526. Hu, J., Boritz, E., Wylie, W., and Douek, D.C. (2017). Stochastic principles governing alternative splicing of RNA. PLOS Comput. Biol. 13, e1005761. Huang, J., Wang, Y., Raghavan, S., Feng, S., Kiesewetter, K., and Wang, J. (2011). Human down syndrome cell adhesion molecules (DSCAMs) are functionally conserved with Drosophila Dscam[TM1] isoforms in controlling neurodevelopment. Insect Biochem. Mol. Biol. 41, 778–787. Hyung, D., Kim, J., Cho, S.Y., and Park, C. (2018). ASpedia: a comprehensive encyclopedia of human alternative splicing. Nucleic Acids Res. 46, D58–D63. Inada, T. (2017). The Ribosome as a Platform for mRNA and Nascent Polypeptide Quality Control. Trends Biochem. Sci. 42, 5–15. Ingolia, N.T. (2016). Ribosome Footprint Profiling of Translation throughout the Genome. Cell 165, 22–33. Jaillon, O., Bouhouche, K., Gout, J.-F., Aury, J.-M., Noel, B., Saudemont, B., Nowacki, M., Serrano, V., Porcel, B.M., Ségurens, B., et al. (2008). Translational control of intron splicing in eukaryotes. Nature 451, 359–362. Jarre, G., Guillemain, I., Deransart, C., and Depaulis, A. (2017). Chapter 32 - Genetic Models of Absence Epilepsy in Rats and Mice. In Models of Seizures and Epilepsy (Second Edition), A. Pitkänen, P.S. Buckmaster, A.S. Galanopoulou, and S.L. Moshé, eds. (Academic Press), pp. 455–471. Jensen, T.H., Jacquier, A., and Libri, D. (2013). Dealing with Pervasive Transcription. Mol. Cell 52, 473–484. Ji, Z., Song, R., Huang, H., Regev, A., and Struhl, K. (2016). Transcriptome-scale RNase-footprinting of RNA-protein complexes. Nat. Biotechnol. 34, 410–413. Jiang, L., Wang, M., Lin, S., Jian, R., Li, X., Chan, J., Dong, G., Fang, H., Robinson, A.E., Aguet, F., et al. (2020). A Quantitative Proteome Map of the Human Body. Cell 0. de Jong, T.V., Moshkin, Y.M., and Guryev, V. (2019). Gene expression variability: the other dimension in transcriptome analysis. Physiol. Genomics 51, 145–158.    147 Jukes, T.H. (2000). The Neutral Theory of Molecular Evolution. Genetics 154, 956–958. Karczewski, K.J., Francioli, L.C., Tiao, G., Cummings, B.B., Alföldi, J., Wang, Q., Collins, R.L., Laricchia, K.M., Ganna, A., Birnbaum, D.P., et al. (2020). The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443. Katoh, K., Misawa, K., Kuma, K., and Miyata, T. (2002). MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30, 3059–3066. Keeling, D.M., Garza, P., Nartey, C.M., and Carvunis, A.-R. (2019). The meanings of “function” in biology and the problematic case of de novo gene emergence. ELife 8, e47014. Kelemen, O., Convertini, P., Zhang, Z., Wen, Y., Shen, M., Falaleeva, M., and Stamm, S. (2013). Function of alternative splicing. Gene 514, 1–30. Kelley, J.L., Madeoy, J., Calhoun, J.C., Swanson, W., and Akey, J.M. (2006). Genomic signatures of positive selection in humans and the limits of outlier approaches. Genome Res. 16, 980–989. Kennedy, D. (2004). The old file-drawer problem. Science 305, 451. Kimura, M. (1968). Evolutionary Rate at the Molecular Level. Nature 217, 624–626. Kimura, M. (1983). The Neutral Theory of Molecular Evolution (Cambridge University Press). Kimura, M., and Ohta, T. (1972). On the stochastic model for estimation of mutational distance between homologous proteins. J. Mol. Evol. 2, 87–90. Klöckner, U., Pereverzev, A., Leroy, J., Krieger, A., Vajna, R., Pfitzer, G., Hescheler, J., Malécot, C.O., and Schneider, T. (2004). The cytosolic II-III loop of Cav2.3 provides an essential determinant for the phorbol ester-mediated stimulation of E-type Ca2+ channel activity. Eur. J. Neurosci. 19, 2659–2668. Koonin, E.V. (2006). The origin of introns and their role in eukaryogenesis: a compromise solution to the introns-early versus introns-late debate? Biol. Direct 1, 22. Kopp, F., and Mendell, J.T. (2018). Functional Classification and Experimental Dissection of Long Noncoding RNAs. Cell 172, 393–407. Kovacs, E., Tompa, P., Liliom, K., and Kalmar, L. (2010). Dual coding in alternative reading frames correlates with intrinsic protein disorder. Proc. Natl. Acad. Sci. U. S. A. 107, 5429–5434. Kovaka, S., Zimin, A.V., Pertea, G.M., Razaghi, R., Salzberg, S.L., and Pertea, M. (2019). Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol. 20, 278.    148 Kovalak, C., Metkar, M., and Moore, M.J. (2019). Deep sequencing of pre-translational mRNPs reveals hidden flux through evolutionarily conserved AS-NMD pathways. BioRxiv 847004. Kriventseva, E.V., Koch, I., Apweiler, R., Vingron, M., Bork, P., Gelfand, M.S., and Sunyaev, S. (2003). Increase of functional diversity by alternative splicing. Trends Genet. 19, 124–128. Križanovic, K., Echchiki, A., Roux, J., and Šikic, M. (2018). Evaluation of tools for long read RNA-seq splice-aware alignment. Bioinforma. Oxf. Engl. 34, 748–754. Kurmangaliyev, Y.Z., and Gelfand, M.S. (2008). Computational analysis of splicing errors and mutations in human transcripts. BMC Genomics 9, 13. Lee, S. (2013). Pharmacological Inhibition of Voltage-gated Ca2+ Channels for Chronic Pain Relief. Curr. Neuropharmacol. 11, 606–620. Lee, Y., and Rio, D.C. (2015). Mechanisms and Regulation of Alternative Pre-mRNA Splicing. Annu. Rev. Biochem. 84, 291–323. Lees, J.G., Dawson, N.L., Sillitoe, I., and Orengo, C.A. (2016). Functional innovation from changes in protein domains and their combinations. Curr. Opin. Struct. Biol. 38, 44–52. Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinforma. Oxf. Engl. 34, 3094–3100. Li, B., and Dewey, C.N. (2011). RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12, 323. Li, C., and Zhang, J. (2019). Stop-codon read-through arises largely from molecular errors and is generally nonadaptive. PLOS Genet. 15, e1008141. Li, H.-D., Menon, R., Omenn, G.S., and Guan, Y. (2014). The emerging era of genomic data integration for analyzing splice isoform function. Trends Genet. 30, 340–347. Li, H.-D., Omenn, G.S., and Guan, Y. (2016). A proteogenomic approach to understand splice isoform functions through sequence and expression-based computational modeling. Brief. Bioinform. 17, 1024–1031. Light, S., and Elofsson, A. (2013). The impact of splicing on protein domain architecture. Curr. Opin. Struct. Biol. 23, 451–458. Lim, L.P., and Burge, C.B. (2001). A computational analysis of sequence features involved in recognition of short introns. Proc. Natl. Acad. Sci. 98, 11193–11198. Lim, S.R., and Hertel, K.J. (2004). Commitment to Splice Site Pairing Coincides with A Complex Formation. Mol. Cell 15, 477–483.    149 Lipscombe, D., and Andrade, A. (2015). Calcium Channel CaVα₁ Splice Isoforms - Tissue Specificity and Drug Action. Curr. Mol. Pharmacol. 8, 22–31. Lipscombe, D., Pan, J.Q., and Gray, A.C. (2002). Functional diversity in neuronal voltage-gated calcium channels by alternative splicing of Ca(v)alpha1. Mol. Neurobiol. 26, 21–44. Lipscombe, D., Andrade, A., and Allen, S.E. (2013a). Alternative splicing: functional diversity among voltage-gated calcium channels and behavioral consequences. Biochim. Biophys. Acta 1828, 1522–1529. Lipscombe, D., Allen, S.E., and Toro, C.P. (2013b). Control of Neuronal Voltage-Gated Calcium Ion Channels From RNA to Protein. Trends Neurosci. 36, 598–609. Liu, Z., and Zhang, J. (2018). Most m6A RNA Modifications in Protein-Coding Regions Are Evolutionarily Unconserved and Likely Nonfunctional. Mol. Biol. Evol. 35, 666–675. Lizio, M., Harshbarger, J., Shimoji, H., Severin, J., Kasukawa, T., Sahin, S., Abugessaisa, I., Fukuda, S., Hori, F., Ishikawa-Kato, S., et al. (2015). Gateways to the FANTOM5 promoter level mammalian expression atlas. Genome Biol. 16, 22. Lonsdale, J., Thomas, J., Salvatore, M., Phillips, R., Lo, E., Shad, S., Hasz, R., Walters, G., Garcia, F., Young, N., et al. (2013). The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45, 580–585. Lu, H., Lin, L., Sato, S., Xing, Y., and Lee, C.J. (2009). Predicting Functional Alternative Splicing by Measuring RNA Selection Pressure from Multigenome Alignments. PLOS Comput. Biol. 5, e1000608. Lunter, G., Ponting, C.P., and Hein, J. (2006). Genome-Wide Identification of Human Functional DNA Using a Neutral Indel Model. PLOS Comput. Biol. 2, e5. Ly, T., Ahmad, Y., Shlien, A., Soroka, D., Mills, A., Emanuele, M.J., Stratton, M.R., and Lamond, A.I. (2014a). A proteomic chronology of gene expression through the cell cycle in human myeloid leukemia cells. ELife 3, e01630. Ly, T., Ahmad, Y., Shlien, A., Soroka, D., Mills, A., Emanuele, M.J., Stratton, M.R., and Lamond, A.I. (2014b). A proteomic chronology of gene expression through the cell cycle in human myeloid leukemia cells. ELife 3, e01630.027. Lynch, M. (2007). The Origins of genome architecture (Sunderland (Mass.): Sinauer). Ma, X., Kawamoto, S., Uribe, J., and Adelstein, R.S. (2006). Function of the neuron-specific alternatively spliced isoforms of nonmuscle myosin II-B during mouse brain development. Mol. Biol. Cell 17, 2138–2149.    150 Marchler-Bauer, A., Lu, S., Anderson, J.B., Chitsaz, F., Derbyshire, M.K., DeWeese-Scott, C., Fong, J.H., Geer, L.Y., Geer, R.C., Gonzales, N.R., et al. (2011). CDD: a Conserved Domain Database for the functional annotation of proteins. Nucleic Acids Res. 39, D225–D229. Marchler-Bauer, A., Derbyshire, M.K., Gonzales, N.R., Lu, S., Chitsaz, F., Geer, L.Y., Geer, R.C., He, J., Gwadz, M., Hurwitz, D.I., et al. (2015). CDD: NCBI’s conserved domain database. Nucleic Acids Res. 43, D222–D226. Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M., and Gilad, Y. (2008). RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18, 1509–1517. Matlin, A.J., Clark, F., and Smith, C.W.J. (2005). Understanding alternative splicing: towards a cellular code. Nat. Rev. Mol. Cell Biol. 6, 386. Matthews, B.J., Kim, M.E., Flanagan, J.J., Hattori, D., Clemens, J.C., Zipursky, S.L., and Grueber, W.B. (2007). Dendrite Self-Avoidance Is Controlled by Dscam. Cell 129, 593–604. McGlincy, N.J., and Smith, C.W.J. (2008). Alternative splicing resulting in nonsense-mediated mRNA decay: what is the meaning of nonsense? Trends Biochem. Sci. 33, 385–393. Meer, K.M., Nelson, P.G., Xiong, K., and Masel, J. (2019). Transcriptional Error Rates Vary by Gene Expression Level in E. coli but not S. cerevisiae. BioRxiv 554329. Mehmood, A., Laiho, A., Venäläinen, M.S., McGlinchey, A.J., Wang, N., and Elo, L.L. (2019). Systematic evaluation of differential splicing tools for RNA-seq studies. Brief. Bioinform. Melamud, E., and Moult, J. (2009). Stochastic noise in splicing machinery. Nucleic Acids Res. 37, 4873–4886. Merino, G.A., Conesa, A., and Fernández, E.A. (2019). A benchmarking of workflows for detecting differential splicing and differential expression at isoform level in human RNA-seq studies. Brief. Bioinform. 20, 471–481. Merkin, J., Russell, C., Chen, P., and Burge, C.B. (2012). Evolutionary dynamics of gene and isoform regulation in Mammalian tissues. Science 338, 1593–1599. Millard, S.S., Lu, Z., Zipursky, S.L., and Meinertzhagen, I.A. (2010). Drosophila Dscam Proteins Regulate Postsynaptic Specificity at Multiple-Contact Synapses. Neuron 67, 761–768. Minervini, C.F., Cumbo, C., Orsini, P., Anelli, L., Zagaria, A., Specchia, G., and Albano, F. (2020). Nanopore Sequencing in Blood Diseases: A Wide Range of Opportunities. Front. Genet. 11. Mironov, A.A., Fickett, J.W., and Gelfand, M.S. (1999). Frequent alternative splicing of human genes. Genome Res. 9, 1288–1293.    151 Mitani, Y., Li, J., Weber, R.S., Lippman, S.L., Flores, E.R., Caulin, C., and El-Naggar, A.K. (2011). Expression and regulation of the ΔN and TAp63 isoforms in salivary gland tumorigenesis clinical and experimental findings. Am. J. Pathol. 179, 391–399. Miura, S.K., Martins, A., Zhang, K.X., Graveley, B.R., and Zipursky, S.L. (2013). Probabilistic Splicing of Dscam1 Establishes Identity at the Level of Single Neurons. Cell 155, 1166–1177. Modrek, B., and Lee, C. (2002). A genomic view of alternative splicing. Nat. Genet. 30, 13–19. Mudge, J.M., Frankish, A., Fernandez-Banet, J., Alioto, T., Derrien, T., Howald, C., Reymond, A., Guigó, R., Hubbard, T., and Harrow, J. (2011). The origins, evolution, and functional potential of alternative splicing in vertebrates. Mol. Biol. Evol. 28, 2949–2959. Mudge, J.M., Frankish, A., and Harrow, J. (2013). Functional transcriptomics in the post-ENCODE era. Genome Res. 23, 1961–1973. Nei, M. (2005). Selectionism and Neutralism in Molecular Evolution. Mol. Biol. Evol. 22, 2318–2342. Nicoludis, J.M., Lau, S.-Y., Schärfe, C.P.I., Marks, D.S., Weihofen, W.A., and Gaudet, R. (2015). Structure and Sequence Analyses of Clustered Protocadherins Reveal Antiparallel Interactions that Mediate Homophilic Specificity. Structure 23, 2087–2098. Nourse, C.R., Mattei, M.-G., Gunning, P., and Byrne, J.A. (1998). Cloning of a third member of the D52 gene family indicates alternative coding sequence usage in D52-like transcripts. Biochim. Biophys. Acta BBA - Gene Struct. Expr. 1443, 155–168. Ohta, T. (1992). The Nearly Neutral Theory of Molecular Evolution. Annu. Rev. Ecol. Syst. 23, 263–286. Olcese, R., Qin, N., Schneider, T., Neely, A., Wei, X., Stefani, E., and Birnbaumer, L. (1994). The amino terminus of a calcium channel beta subunit sets rates of channel inactivation independently of the subunit’s effect on activation. Neuron 13, 1433–1438. Palazzo, A.F., and Lee, E.S. (2015). Non-coding RNA: what is functional and what is junk? Genet. Aging 6, 2. Pan, Q., Shai, O., Lee, L.J., Frey, B.J., and Blencowe, B.J. (2008). Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat. Genet. 40, 1413–1415. Park, J.F., and Luo, Z.D. (2010). Calcium channel functions in pain processing. Channels 4, 510–517. Pereverzev, A., Leroy, J., Krieger, A., Malécot, C.O., Hescheler, J., Pfitzer, G., Klöckner, U., and Schneider, T. (2002). Alternate Splicing in the Cytosolic II–III Loop and the Carboxy    152 Terminus of Human E-type Voltage-Gated Ca2+ Channels: Electrophysiological Characterization of Isoforms. Mol. Cell. Neurosci. 21, 352–365. Pertea, M., Shumate, A., Pertea, G., Varabyou, A., Chang, Y.-C., Madugundu, A.K., Pandey, A., and Salzberg, S. (2018a). Thousands of large-scale RNA sequencing experiments yield a comprehensive new human gene list and reveal extensive transcriptional noise. BioRxiv 332825. Pertea, M., Shumate, A., Pertea, G., Varabyou, A., Breitwieser, F.P., Chang, Y.-C., Madugundu, A.K., Pandey, A., and Salzberg, S.L. (2018b). CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise. Genome Biol. 19, 208. Pickrell, J.K., Coop, G., Novembre, J., Kudaravalli, S., Li, J.Z., Absher, D., Srinivasan, B.S., Barsh, G.S., Myers, R.M., Feldman, M.W., et al. (2009). Signals of recent positive selection in a worldwide sample of human populations. Genome Res. 19, 826–837. Pickrell, J.K., Pai, A.A., Gilad, Y., and Pritchard, J.K. (2010). Noisy Splicing Drives mRNA Isoform Diversity in Human Cells. PLOS Genet 6, e1001236. Pollard, K.S., Hubisz, M.J., Rosenbloom, K.R., and Siepel, A. (2010). Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20, 110–121. Powell, K.L., Cain, S.M., Ng, C., Sirdesai, S., David, L.S., Kyi, M., Garcia, E., Tyson, J.R., Reid, C.A., Bahlo, M., et al. (2009). A Cav3.2 T-type calcium channel point mutation has splice-variant-specific effects on function and segregates with seizure expression in a polygenic rat model of absence epilepsy. J. Neurosci. Off. J. Soc. Neurosci. 29, 371–380. Prakash, A., and Tompa, M. (2007). Measuring the accuracy of genome-size multiple alignments. Genome Biol. 8, R124. Proft, J., Rzhepetskyy, Y., Lazniewska, J., Zhang, F.-X., Cain, S.M., Snutch, T.P., Zamponi, G.W., and Weiss, N. (2017). The Cacna1h mutation in the GAERS model of absence epilepsy enhances T-type Ca2+ currents by altering calnexin-dependent trafficking of Cav3.2 channels. Sci. Rep. 7, 11513. Pruitt, K.D., Tatusova, T., and Maglott, D.R. (2007). NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 35, D61-65. Raj, B., and Blencowe, B.J. (2015). Alternative Splicing in the Mammalian Nervous System: Recent Insights into Mechanisms and Functional Roles. Neuron 87, 14–27. Raj, A., Peskin, C.S., Tranchina, D., Vargas, D.Y., and Tyagi, S. (2006). Stochastic mRNA Synthesis in Mammalian Cells. PLOS Biol. 4, e309. Ramanouskaya, T.V., and Grinev, V.V. (2017). The determinants of alternative RNA splicing in human cells. Mol. Genet. Genomics 1–21.    153 Ramensky, V.E., Nurtdinov, R.N., Neverov, A.D., Mironov, A.A., and Gelfand, M.S. (2008). Positive Selection in Alternatively Spliced Exons of Human Genes. Am. J. Hum. Genet. 83, 94–98. Reixachs-Solé, M., Ruiz-Orera, J., Alba, M.M., and Eyras, E. (2019). Ribosome profiling at isoform level reveals an evolutionary conserved impact of differential splicing on the proteome. BioRxiv 582031. Reyes, A., Anders, S., Weatheritt, R.J., Gibson, T.J., Steinmetz, L.M., and Huber, W. (2013). Drift and conservation of differential exon usage across tissues in primate species. Proc. Natl. Acad. Sci. 110, 15377–15382. Rice, P., Longden, I., and Bleasby, A. (2000). EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. TIG 16, 276–277. Richards, A.J., Muller, B., Shotwell, M., Cowart, L.A., Rohrer, B., and Lu, X. (2010). Assessing the functional coherence of gene sets with metrics based on the Gene Ontology graph. Bioinformatics 26, i79-87. Rino, J., Carvalho, T., Braga, J., Desterro, J.M.P., Lührmann, R., and Carmo-Fonseca, M. (2007). A Stochastic View of Spliceosome Assembly and Recycling in the Nucleus. PLOS Comput. Biol. 3, e201. Rodriguez, J.M., Maietta, P., Ezkurdia, I., Pietrelli, A., Wesselink, J.-J., Lopez, G., Valencia, A., and Tress, M.L. (2013). APPRIS: annotation of principal and alternative splice isoforms. Nucleic Acids Res. 41, D110–D117. Rodriguez, J.M., Rodriguez-Rivas, J., Di Domenico, T., Vázquez, J., Valencia, A., and Tress, M.L. (2018). APPRIS 2017: principal isoforms for multiple gene sets. Nucleic Acids Res. Rosenbloom, K.R., Armstrong, J., Barber, G.P., Casper, J., Clawson, H., Diekhans, M., Dreszer, T.R., Fujita, P.A., Guruvadoo, L., Haeussler, M., et al. (2015). The UCSC Genome Browser database: 2015 update. Nucleic Acids Res. 43, D670-681. Roy, S.W., and Irimia, M. (2008). Intron mis-splicing: no alternative? Genome Biol. 9, 208. Roy, S.W., and Irimia, M. (2009). Splicing in the eukaryotic ancestor: form, function and dysfunction. Trends Ecol. Evol. 24, 447–455. Ryu, J.Y., Kim, H.U., and Lee, S.Y. (2017). Framework and resource for more than 11,000 gene-transcript-protein-reaction associations in human metabolism. Proc. Natl. Acad. Sci. 114, E9740–E9749. Saegusa, H., Matsuda, Y., and Tanabe, T. (2002). Effects of ablation of N- and R-type Ca(2+) channels on pain transmission. Neurosci. Res. 43, 1–7.    154 Sahebi, M., Hanafi, M.M., van Wijnen, A.J., Azizi, P., Abiri, R., Ashkani, S., and Taheri, S. (2016). Towards understanding pre-mRNA splicing mechanisms and the role of SR proteins. Gene 587, 107–119. Saudemont, B., Popa, A., Parmley, J.L., Rocher, V., Blugeon, C., Necsulea, A., Meyer, E., and Duret, L. (2017). The fitness cost of mis-splicing is the main determinant of alternative splicing patterns. Genome Biol. 18, 208. Schad, E., Tompa, P., and Hegyi, H. (2011). The relationship between proteome size, structural disorder and organism complexity. Genome Biol. 12, R120. Schinke, T., Liese, S., Priemel, M., Haberland, M., Schilling, A.F., Catala-Lehnen, P., Blicharski, D., Rueger, J.M., Gagel, R.F., Emeson, R.B., et al. (2004). Decreased bone formation and osteopenia in mice lacking alpha-calcitonin gene-related peptide. J. Bone Miner. Res. Off. J. Am. Soc. Bone Miner. Res. 19, 2049–2056. Schmucker, D., Clemens, J.C., Shu, H., Worby, C.A., Xiao, J., Muda, M., Dixon, J.E., and Zipursky, S.L. (2000). Drosophila Dscam Is an Axon Guidance Receptor Exhibiting Extraordinary Molecular Diversity. Cell 101, 671–684. Schneider, T., Wei, X., Olcese, R., Costantin, J.L., Neely, A., Palade, P., Perez-Reyes, E., Qin, N., Zhou, J., and Crawford, G.D. (1994). Molecular analysis and functional expression of the human type E neuronal Ca2+ channel alpha 1 subunit. Receptors Channels 2, 255–270. Schneider, T., Neumaier, F., Hescheler, J., and Alpdogan, S. (2020). Cav2.3 R-type calcium channels: from its discovery to pathogenic de novo CACNA1E variants: a historical perspective. Pflüg. Arch. - Eur. J. Physiol. Schreiner, D., Simicevic, J., Ahrné, E., Schmidt, A., and Scheiffele, P. (2015). Quantitative isoform-profiling of highly diversified recognition molecules. ELife 4, e07794. Scott, M.B., and Kammermeier, P.J. (2017). CaV2 channel subtype expression in rat sympathetic neurons is selectively regulated by α2δ subunits. Channels 11, 555–573. Scotton, P., Bleckmann, D., Stebler, M., Sciandra, F., Brancaccio, A., Meier, T., Stetefeld, J., and Ruegg, M.A. (2006). Activation of muscle-specific receptor tyrosine kinase and binding to dystroglycan are regulated by alternative mRNA splicing of agrin. J. Biol. Chem. 281, 36835–36845. Sessegolo, C., Cruaud, C., Silva, C.D., Cologne, A., Dubarry, M., Derrien, T., Lacroix, V., and Aury, J.-M. (2019). Transcriptome profiling of mouse samples using nanopore sequencing of cDNA and RNA molecules. Sci. Rep. 9, 1–12. Shabalina, S.A., Ogurtsov, A.Y., Spiridonov, N.A., and Koonin, E.V. (2014). Evolution at protein ends: major contribution of alternative transcription initiation and termination to the transcriptome and proteome diversity in mammals. Nucleic Acids Res. 42, 7132–7144.    155 Shehu, A., Barbará, D., and Molloy, K. (2016). A Survey of Computational Methods for Protein Function Prediction. In Big Data Analytics in Genomics, K.-C. Wong, ed. (Springer International Publishing), pp. 225–298. Siepel, A., Bejerano, G., Pedersen, J.S., Hinrichs, A.S., Hou, M., Rosenbloom, K., Clawson, H., Spieth, J., Hillier, L.W., Richards, S., et al. (2005). Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050. Sievers, F., Wilm, A., Dineen, D., Gibson, T.J., Karplus, K., Li, W., Lopez, R., McWilliam, H., Remmert, M., Söding, J., et al. (2011). Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7, 539. Simms, B.A., and Zamponi, G.W. (2014). Neuronal voltage-gated calcium channels: structure, function, and dysfunction. Neuron 82, 24–45. Skandalis, A. (2016). Estimation of the minimum mRNA splicing error rate in vertebrates. Mutat. Res. Mol. Mech. Mutagen. 784–785, 34–38. Smedley, D., Haider, S., Ballester, B., Holland, R., London, D., Thorisson, G., and Kasprzyk, A. (2009). BioMart--biological queries made easy. BMC Genomics 10, 22. Snutch, T.P., Tomlinson, W.J., Leonard, J.P., and Gilbert, M.M. (1991). Distinct calcium channels are generated by alternative splicing and are differentially expressed in the mammalian CNS. Neuron 7, 45–57. Soong, T.W., Stea, A., Hodson, C.D., Dubel, S.J., Vincent, S.R., and Snutch, T.P. (1993). Structure and functional expression of a member of the low voltage-activated calcium channel family. Science 260, 1133–1136. Spielman, S.J., and Wilke, C.O. (2015). The Relationship between dN/dS and Scaled Selection Coefficients. Mol. Biol. Evol. 32, 1097–1108. Stamm, S., Ben-Ari, S., Rafalska, I., Tang, Y., Zhang, Z., Toiber, D., Thanaraj, T.A., and Soreq, H. (2005). Function of alternative splicing. Gene 344, 1–20. Steijger, T., Abril, J.F., Engström, P.G., Kokocinski, F., The RGASP Consortium, Hubbard, T.J., Guigó, R., Harrow, J., and Bertone, P. (2013). Assessment of transcript reconstruction methods for RNA-seq. Nat. Methods advance online publication. Stetefeld, J., and Ruegg, M.A. (2005). Structural and functional diversity generated by alternative mRNA splicing. Trends Biochem. Sci. 30, 515–521. Sulakhe, D., D’Souza, M., Wang, S., Balasubramanian, S., Athri, P., Xie, B., Canzar, S., Agam, G., Gilliam, T.C., and Maltsev, N. (2018). Exploring the functional impact of alternative splicing on human protein isoforms using available annotation sources. Brief. Bioinform.    156 Tang, A.D., Soulette, C.M., Baren, M.J. van, Hart, K., Hrabeta-Robinson, E., Wu, C.J., and Brooks, A.N. (2018). Full-length transcript characterization of SF3B1 mutation in chronic lymphocytic leukemia reveals downregulation of retained introns. BioRxiv 410183. Tang, A.D., Soulette, C.M., van Baren, M.J., Hart, K., Hrabeta-Robinson, E., Wu, C.J., and Brooks, A.N. (2020). Full-length transcript characterization of SF3B1 mutation in chronic lymphocytic leukemia reveals downregulation of retained introns. Nat. Commun. 11, 1–12. Toda, M., Suzuki, T., Hosono, K., Kurihara, Y., Kurihara, H., Hayashi, I., Kitasato, H., Hoka, S., and Majima, M. (2008). Roles of calcitonin gene-related peptide in facilitation of wound healing and angiogenesis. Biomed. Pharmacother. Biomedecine Pharmacother. 62, 352–359. Tranchevent, L.-C., Aubé, F., Dulaurier, L., Benoit-Pilven, C., Rey, A., Poret, A., Chautard, E., Mortada, H., Desmet, F.-O., Chakrama, F.Z., et al. (2017). Identification of protein features encoded by alternative exons using Exon Ontology. Genome Res. 27, 1087–1097. Trapnell, C., Williams, B.A., Pertea, G., Mortazavi, A., Kwan, G., van Baren, M.J., Salzberg, S.L., Wold, B.J., and Pachter, L. (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515. Trapnell, C., Hendrickson, D.G., Sauvageau, M., Goff, L., Rinn, J.L., and Pachter, L. (2012). Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat. Biotechnol. 31, 46–53. Tress, M.L., Abascal, F., and Valencia, A. (2017a). Most Alternative Isoforms Are Not Functionally Important. Trends Biochem. Sci. Tress, M.L., Abascal, F., and Valencia, A. (2017b). Alternative Splicing May Not Be the Key to Proteome Complexity. Trends Biochem. Sci. 42, 98–110. Turunen, J.J., Niemelä, E.H., Verma, B., and Frilander, M.J. (2013). The significant other: splicing by the minor spliceosome. Wiley Interdiscip. Rev. RNA 4, 61–76. Vajna, R., Klöckner, U., Pereverzev, A., Weiergräber, M., Chen, X., Miljanich, G., Klugbauer, N., Hescheler, J., Perez-Reyes, E., and Schneider, T. (2001). Functional coupling between “R-type” Ca2+ channels and insulin secretion in the insulinoma cell line INS-1. Eur. J. Biochem. 268, 1066–1075. Van Petegem, F., Chatelain, F.C., and Minor, D.L. (2005). Insights into voltage-gated calcium channel regulation from the structure of the CaV1.2 IQ domain-Ca2+/calmodulin complex. Nat. Struct. Mol. Biol. 12, 1108–1115. Voight, B.F., Kudaravalli, S., Wen, X., and Pritchard, J.K. (2006). A map of recent positive selection in the human genome. PLoS Biol. 4, e72.    157 Wang, Y., and Chen, Z. (2019). An update for epilepsy research and antiepileptic drug development: Toward precise circuit therapy. Pharmacol. Ther. 201, 77–93. Wang, E.T., Sandberg, R., Luo, S., Khrebtukova, I., Zhang, L., Mayr, C., Kingsmore, S.F., Schroth, G.P., and Burge, C.B. (2008a). Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476. Wang, E.T., Sandberg, R., Luo, S., Khrebtukova, I., Zhang, L., Mayr, C., Kingsmore, S.F., Schroth, G.P., and Burge, C.B. (2008b). Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476. Wang, M., Zhang, P., Shu, Y., Yuan, F., Zhang, Y., Zhou, Y., Jiang, M., Zhu, Y., Hu, L., Kong, X., et al. (2014). Alternative splicing at GYNNGY 5′ splice sites: more noise, less regulation. Nucleic Acids Res. 42, 13969–13980. Wang, X., You, X., Langer, J.D., Hou, J., Rupprecht, F., Vlatkovic, I., Quedenau, C., Tushev, G., Epstein, I., Schaefke, B., et al. (2019). Full-length transcriptome reconstruction reveals a large diversity of RNA and protein isoforms in rat hippocampus. Nat. Commun. 10, 1–15. Warnecke, T., and Hurst, L.D. (2011). Error prevention and mitigation as forces in the evolution of genes and genomes. Nat. Rev. Genet. 12, 875–881. Watson, F.L., Püttmann-Holgado, R., Thomas, F., Lamar, D.L., Hughes, M., Kondo, M., Rebel, V.I., and Schmucker, D. (2005). Extensive Diversity of Ig-Superfamily Proteins in the Immune System of Insects. Science 309, 1874–1878. Weatheritt, R.J., Sterne-Weiler, T., and Blencowe, B.J. (2016). The ribosome-engaged landscape of alternative splicing. Nat. Struct. Mol. Biol. 23, 1117–1123. Weiergräber, M., Kamp, M.A., Radhakrishnan, K., Hescheler, J., and Schneider, T. (2006). The Cav2.3 voltage-gated calcium channel in epileptogenesis—Shedding new light on an enigmatic channel. Neurosci. Biobehav. Rev. 30, 1122–1144. Weiergräber, M., Henry, M., Radhakrishnan, K., Hescheler, J., and Schneider, T. (2007). Hippocampal seizure resistance and reduced neuronal excitotoxicity in mice lacking the Cav2.3 E/R-type voltage-gated calcium channel. J. Neurophysiol. 97, 3660–3669. Welsby, P.J., Wang, H., Wolfe, J.T., Colbran, R.J., Johnson, M.L., and Barrett, P.Q. (2003). A Mechanism for the Direct Regulation of T-Type Calcium Channels by Ca2+/Calmodulin-Dependent Kinase II. J. Neurosci. 23, 10116–10121. Weyrer, C., Turecek, J., Niday, Z., Liu, P.W., Nanou, E., Catterall, W.A., Bean, B.P., and Regehr, W.G. (2019). The Role of CaV2.1 Channel Facilitation in Synaptic Facilitation. Cell Rep. 26, 2289-2297.e3.    158 Wilhelm, M., Schlegl, J., Hahne, H., Gholami, A.M., Lieberenz, M., Savitski, M.M., Ziegler, E., Butzmann, L., Gessulat, S., Marx, H., et al. (2014). Mass-spectrometry-based draft of the human proteome. Nature 509, 582–587. Williams, M.E., Marubio, L.M., Deal, C.R., Hans, M., Brust, P.F., Philipson, L.H., Miller, R.J., Johnson, E.C., Harpold, M.M., and Ellis, S.B. (1994). Structure and functional characterization of neuronal alpha 1E calcium channel subtypes. J. Biol. Chem. 269, 22347–22357. Wojtowicz, W.M., Flanagan, J.J., Millard, S.S., Zipursky, S.L., and Clemens, J.C. (2004). Alternative Splicing of Drosophila Dscam Generates Axon Guidance Receptors that Exhibit Isoform-Specific Homophilic Binding. Cell 118, 619–633. Wojtowicz, W.M., Wu, W., Andre, I., Qian, B., Baker, D., and Zipursky, S.L. (2007). A Vast Repertoire of Dscam Binding Specificities Arises from Modular Interactions of Variable Ig Domains. Cell 130, 1134–1145. Workman, R.E., Tang, A., Tang, P.S., Jain, M., Tyson, J.R., Zuzarte, P.C., Gilpatrick, T., Razaghi, R., Quick, J., Sadowski, N., et al. (2018). Nanopore native RNA sequencing of a human poly(A) transcriptome. BioRxiv 459529. Workman, R.E., Tang, A.D., Tang, P.S., Jain, M., Tyson, J.R., Razaghi, R., Zuzarte, P.C., Gilpatrick, T., Payne, A., Quick, J., et al. (2019). Nanopore native RNA sequencing of a human poly(A) transcriptome. Nat. Methods. Wormuth, C., Lundt, A., Henseler, C., Müller, R., Broich, K., Papazoglou, A., and Weiergräber, M. (2016). Review: Cav2.3 R-type Voltage-Gated Ca2+ Channels - Functional Implications in Convulsive and Non-convulsive Seizure Activity. Open Neurol. J. 10, 99–126. Worton, L.E., Shi, Y.-C., Smith, E.J., Barry, S.C., Gonda, T.J., Whitehead, J.P., and Gardiner, E.M. (2017). Ectodermal-Neural Cortex 1 Isoforms Have Contrasting Effects on MC3T3-E1 Osteoblast Mineralization and Gene Expression. J. Cell. Biochem. 118, 2141–2150. Wyman, D., Balderrama-Gutierrez, G., Reese, F., Jiang, S., Rahmanian, S., Forner, S., Matheos, D., Zeng, W., Williams, B., Trout, D., et al. (2020). A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantification. BioRxiv 672931. Xing, Y., and Lee, C. (2005a). Evidence of functional selection pressure for alternative splicingevents that accelerate evolution of protein subsequences. Genome Biol. 6, P8. Xing, Y., and Lee, C. (2005b). Evidence of functional selection pressure for alternative splicing events that accelerate evolution of protein subsequences. Proc. Natl. Acad. Sci. U. S. A. 102, 13526–13531. Xing, Y., and Lee, C. (2006). Alternative splicing and RNA selection pressure--evolutionary consequences for eukaryotic genomes. Nat. Rev. Genet. 7, 499–509.    159 Xiong, J., Jiang, X., Ditsiou, A., Gao, Y., Sun, J., Lowenstein, E.D., Huang, S., and Khaitovich, P. (2018). Predominant patterns of splicing evolution on human, chimpanzee and macaque evolutionary lineages. Hum. Mol. Genet. 27, 1474–1485. Xu, C., and Zhang, J. (2020). A different perspective on alternative cleavage and polyadenylation. Nat. Rev. Genet. 21, 63–63. Xu, G., and Zhang, J. (2014). Human coding RNA editing is generally nonadaptive. Proc. Natl. Acad. Sci. 201321745. Xu, G., and Zhang, J. (2015). In Search of Beneficial Coding RNA Editing. Mol. Biol. Evol. 32, 536–541. Xu, C., Park, J.-K., and Zhang, J. (2019). Evidence that alternative transcriptional initiation is largely nonadaptive. PLOS Biol. 17, e3000197. Yang, Z. (1998). Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution. Mol. Biol. Evol. 15, 568–573. Yang, L., Sakurai, T., Kamiyoshi, A., Ichikawa-Shindo, Y., Kawate, H., Yoshizawa, T., Koyama, T., Iesato, Y., Uetake, R., Yamauchi, A., et al. (2013). Endogenous CGRP protects against neointimal hyperplasia following wire-induced vascular injury. J. Mol. Cell. Cardiol. 59, 55–66. Yang, X., Coulombe-Huntington, J., Kang, S., Sheynkman, G.M., Hao, T., Richardson, A., Sun, S., Yang, F., Shen, Y.A., Murray, R.R., et al. (2016). Widespread Expansion of Protein Interaction Capabilities by Alternative Splicing. Cell 164, 805–817. Yap, K., Xiao, Y., Friedman, B.A., Je, H.S., and Makeyev, E.V. (2016). Polarizing the Neuron through Sustained Co-expression of Alternatively Spliced Isoforms. Cell Rep. Zaman, T., Lee, K., Park, C., Paydar, A., Choi, J.H., Cheong, E., Lee, C.J., and Shin, H.-S. (2011). Cav2.3 channels are critical for oscillatory burst discharges in the reticular thalamus and absence epilepsy. Neuron 70, 95–108. Zhang, J. (2018). Neutral Theory and Phenotypic Evolution. Mol. Biol. Evol. 35, 1327–1331. Zhang, C., Zhang, B., Lin, L.-L., and Zhao, S. (2017). Evaluation and comparison of computational tools for RNA-seq isoform quantification. BMC Genomics 18, 583. Zhang, H., Jain, C., and Aluru, S. (2019). A comprehensive evaluation of long read error correction methods. BioRxiv 519330. Zhang, X., Chen, M.H., Wu, X., Kodani, A., Fan, J., Doan, R., Ozawa, M., Ma, J., Yoshida, N., Reiter, J.F., et al. (2016). Cell-Type-Specific Alternative Splicing Governs Cell Fate in the Developing Cerebral Cortex. Cell 166, 1147-1162.e15.    160 Zhang, Z., Xin, D., Wang, P., Zhou, L., Hu, L., Kong, X., and Hurst, L.D. (2009). Noisy splicing, more than expression regulation, explains why some exons are subject to nonsense-mediated mRNA decay. BMC Biol. 7, 23. Zhao, S., and Zhang, B. (2015). A comprehensive evaluation of ensembl, RefSeq, and UCSC annotations in the context of RNA-seq read mapping and gene quantification. BMC Genomics 16, 97.      161 Appendices  Appendix A  Supplementary material for Chapter 2  A.1 Curation standards (Filling in master spreadsheet) 1. Double check the Master Spreadsheet to ensure that the paper you have curated has not already been curated 2. If paper is a review article fill in all columns with NAs with the exception of study type and PubMed ID. Fill in PubMed ID with study’s PubMed ID and study type with “review article”. 3. Not that even if the species is not of interest (i.e human or mouse), we still will curate the experiment 4. Identify investigated gene(s). Fill out gene column with the NCBI gene name and the PMID with the article’s PubMed ID a. If multiple genes are investigated, then one row per gene b. If a paper does not have a PMID, enter in the citation for the paper c. The gene name should be written to the standards of the organism in which the isoforms are endogenously expressed. 5. Identify number of splice isoforms for the investigated gene. Identify the names of the splice isoforms if possible. a. This is reported by the authors but may be different from the number of splice isoforms they actually test in the experiment. b. Fill out “# of splice isoforms” column and “isoform name” column    162 6. Identify if the paper is actually about identifying the function of the gene’s endogenously expressed splice isoforms. a. Splicing papers in the pipeline are sometimes not about the gene’s splice splice isoform’s function, but regulation of the alternative splicing of that gene. i. This can be further confusing if the investigated gene is a splicing regulator. Sometimes, these splicing regulators have splicing isoforms whose function is being investigated.  1. If the study is about the regulation of the splice isoforms and not function, then make a note of this in the “Evidence of Functionality” column. Fill out the remaining columns as NA. Stop curating the study. b. Endogenously expressed isoforms are the ones that are expressed in a healthy wildtype population  i. Splice isoforms expressed solely in a disease condition is not what we are looking for. E.g. in cancer we observe novel splice isoform expressions and we do not report on this. 1. If the study is about disease-associated splice isoforms, make a note of this in the “Evidence of Functionality” column and mark the “Isoform causes a disease” column as yes. Fill out the remaining columns as NA and stop curating. Annotate “study type” column as “disease association”. ii. This can be further complicated if the upregulation of an endogenously expressed splice isoform causes a disease. Certain studies can identify a    163 splice isoform that is necessary for the overall function of the gene at the wildtype expression level but at an upregulated level, there’s a disease phenotype. We are still interested in these splice isoforms. 1. Another source of confusion is that often time a lowly expressed wildtype splice isoform can be slightly detrimental but has not effect on the cell as it is lowly expressed. But when the splice isoform is upregulated or an overexpression study, a disease is caused because now the splice isoform is abundant.  7. Determine which splice isoforms the authors use for their experiments and whether or not the authors are claiming the splice isoforms are functional. a. Fill out the “# of function isoforms” column 8. Determine the organism the splice isoform has been studied in and fill out the “Organism” column a. This is generally the organism where the splice isoform is endogenously expressed i. However, studies will occasionally test their splice isoforms in multiple organsims. ii. Add this to the organism column (e.g. mouse, rat, ferret) 9. In the ‘study types’ column, fill out the experiment that was performed to investigate  the splice isoforms. Common experiments we have seen are: knockdown, knockdown (1 isoform), knockdown (non-isoform-specific), knockout, knockout (1 isoform), knockout (non-isoform-specific), rescue, rescue (1-isoform), rescue (non-isoform-specific) overexpression, tissue distribution, subcellular localization, activity assay, protein    164 interaction, disease association, regulation of isoform expression, detection, structural characterization, mutation isoform absence, immunodepletion, timecourse distribution a. If multiple experiments were used, then annotate column with all experiments. The “main” experiment should be listed first. i. Usually if the experiments depleted the expression of a splice isoform, then we want that experiment listed first 1. E.g. if an experiment investigated isoform tissue distribution and knockdown, then “knockdown, tissue distribution” is what the column should be annotated with b. If any of the experiments eliminated the expression of a single isoform and looked for an effect, then mark the ‘Only evidence is presence’ as no. These are the studies we are most interested in.  i. Be careful with rescue experiments when more than two splice isoforms are investigated. When only two isoforms are involved in the isoform specific rescues, then using only one splice isoform to rescue a phenotype is fine. 1. Rescued phenotypes can be redundant or non-redundant. Redundancy is allowed for rescue experiments because phenotype might be quantitative.  ii. If more than one splice isoform is shown to be necessary (i.e the absence of each splice isoform causes a phenotype) then mark this gene as a gold standard gene in the “Gold Standard Gene” column    165 c. If the splice isoforms cause a different function from each other or have the same function as each other, then fill out the appropriate column to reflect that (“same function” column or “different function” column). 10. Fill out the “Evidence of Functionality” column with a short, concise description of the study in present tense. This description will often explain how the isoforms were molecularly characterized, and what what was tested. If the paper does provide evidence of a “Gold Standard Gene” please include in the description which figure(s) best shows this evidence.  11. Fill out “sequence accession” column with any sequence accession information provided to the investigated splice isoforms.   A.2 Additional Files Additional files are available by request from the authors or in publication (Bhuiyan et al., 2018) Additional File A1 – Supplementary_data.pdf: Contains explanations of each heading found in Additional File A2, standards curators used to evaluate studies for evidence of functionally distinct splice isoforms, three genes where literature reported splice isoforms were not found in Ensembl, supplemental tables and figures, and citations for all human and mouse literature curated for this study. Additional File A2 – Master_curation_spreadsheet.xlsx: Contains a list of all human and mouse literature we curated for this study, and curators’s annotations for each study.  Additional File A3 – Genes_with_functionally_distinct_splice_isoforms.xlsx: Contains the list of human and mouse genes with literature evidence for functionally distinct splice isoforms found in our curation     166 Appendix B  Supplementary material for Chapter 3  B.1 Tables  Brain1 cDNA (ERR2680377/ERX2695238) Brain2 cDNA (ERR3363658/ERX3387950) Brain2 cDNA (ERR3363660/ERX3387952) Brain1 cDNA-TL (ERR2844019/ERX2850744) # Reads 1,267,830 5,834,882 3,003,844 1,691,454 # Bases 1,304,518,778 7,028,355,083 3,279,758,852 1,312,184,503 Mean length of reads 1028.9 1204.5 1091.9 775.8 Median length of reads 858.0 982.0 894.0 684.0 Read Length at N50 1,283 1749.0 1591.0 896.0 Mean read quality 8.7 8.0 8.3 9.6 Median read quality 9.0 8.9 9.3 10.2 Aligned reads 1004822 4391586 2207214 1301790 Unaligned reads 263,008 1,443,296 796,630 389,664 Number of reads assigned to isoforms in collapse 633120 2997680.0 1488200.0 1064600.0 Number of reads removed after collapse 371702 1393906.0 719014.0 237190.0 Table B.1 Summary of cDNA brain data from Sessogolo et al., 2019        167  Brain1 RNA (ERR2680375/ERX2695236) Brain2 RNA (ERR3363657/ERX3387949) Brain2 RNA (ERR3363659/ERX3387951) # Reads 571,098 364,041 210,654 # Bases 432,921,136 375,725,839 201,631,156 Mean length of reads 758.1 1032.1 957.2 Median length of reads 551.0 793.0 736.0 Read Length at N50 1,357 1,492 1,417 Mean read quality 6.6 8.8 8.8 Median read quality 7.1 9.1 9.1 Aligned reads 278,210 256013 143051 Unaligned reads 274,317 108,028 67,603 Number of reads assigned to isoforms in collapse 145920 132200 74120 Number of reads removed after collapse step 132,290 123813 68931 Table B.2 Summary of RNA brain data from Sessogolo et al., 2019  Liver1 RNA (ERR2680379) Liver1 cDNA TL (ERR2844020) # Reads 418,102 2,668,975 # Bases 344,223,164 2,641,896,941 Mean length of reads 823.3 989.9 Median length of reads 728.0 913 Read Length at N50 1,153 1,116 Mean read quality 7.0 10.3 Median read quality 7.5 10.9 Aligned reads 245767 1955185 Unaligned reads 172,335 713,790 Number of reads assigned to isoforms in collapse 124040 1494400 Number of reads removed after collapse step 121727 460785 Table B.3 Summary of liver data from Sessogolo et al., 2019      168  Appendix C  Supplementary material for Chapter 5 C.1 Tables Gene Number of studies Num of FDSIs with sequences/Num of FDSIs Conservation (TBLASTN human-mouse) Conservation (40-way Mammalian PhyloP) Domain annotations (CDD or Pfam) AR 11 3/3 3/3 Yes 3/3 BDNF 1 1/3 1/1 NA 1/1 BIRC5 2 2/2 2/2 Yes 2/2 BOK 1 2/2 2/2 NA 2/2 Calca 3 2/2 2/2 Yes 2/2 CD44 2 2/2 2/2 Yes 2/2 Cdc42 1 2/2 2/2 Yes 2/2 CFLAR 3 2/2 2/2 Yes 2/2 CSPP1 1 2/2 2/2 Yes 0/2 DPF3 1 2/2 2/2 Yes 2/2 Enc1 1 1/2 1/1 NA 1/1 Homer1 1 2/2 2/2 Yes 2/2 Il1rap 1 2/2 2/2 Yes 2/2 KLF6 1 2/2 2/2 NA 2/2 Lpin1 1 2/2 2/2 No 2/2 Lrp8 1 1/2 1/1 NA 1/1 Mecp2 1 2/2 2/2 Yes 2/2 MST1R 1 2/2 2/2 Yes 2/2 Myh10 1 0/2 0/0 NA 0/0 Nf1 1 2/2 2/2 Yes 2/2 Opn4 1 2/2 2/2 Yes 2/2 Oprm1 1 3/3 1/3 NA 1/3 PGAM5 1 2/2 2/2 Yes 2/2 PML 2 2/2 2/2 No 2/2 PRMT5 1 2/2 2/2 Yes 2/2 Rbfox1 5 2/2 2/2 Yes 2/2 Robo3 1 2/2 2/2 Yes 2/2 Rock2 1 2/2 2/2 Yes 2/2 Ryr3 1 2/2 2/2 No 2/2 Sirt3 1 2/2 2/2 NA 2/2 Snap25 1 2/2 2/2 Yes 2/2    169 STIM2 1 2/2 2/2 Yes 2/2 SUN1 1 3/3 3/3 NA 3/3 TICAM1 1 2/2 2/2 NA 2/2 TICAM2 1 2/2 2/2 NA 2/2 TP63 3 2/2 2/2 Yes 2/2 Trp63 1 2/2 2/2 Yes 2/2 HBS1L 1 2/2 2/2 Yes 2/2 BCAR1 1 3/3 2/2 NA 3/3 EIF4G1 1 2/2 2/2 Yes 2/2 EIF4G2 1 0/3 0/0 NA 0/0 Cacna1b 1 0/2 0/0 NA 0/0 MADD 1 2/2 2/2 Yes 2/2 Table D.2 Metrics to assess genes with literature evidence for FDSIs Here I present an enumeration of factors to assess the genes with literature evidence of FDSIs (Chapter 2). “Number of studies” shows the number of studies that provide evidence that the gene has FDSIs. “Num of FDSIs with sequences/Num of FDSIs” shows the number of FDSIs where the authors provide the sequence of the FDSIs versus the number of FDSIs established by the studies. “Conservation (TBLASTN human-mouse)” shows the number of FDSIs with sequence data available that were conserved. Human FDSIs were BLASTed against the Ensembl mouse database, and mouse FDSIs were BLASTed against the Ensembl human database. “Conservation (40-way Mammalian PhyloP” indicates whether at least one discriminating exon has a PhyloP score of 1 based on 40-species mammalian alignments on UCSC genome browser. NA in this column indicates that genomic coordinates were missing for at least one FDSI. This was because an Ensembl transcript ID was not retrievable for the FDSI (see Chapter 2). “Domain annotations (CDD or Pfam)” shows the number of FDSIs with sequences that have an annotatable domain, either through the Conserved Domain Database (CDD) or Pfam.   

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            data-media="{[{embed.selectedMedia}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0395404/manifest

Comment

Related Items