A Systems Biology Study of Alternative SplicingRegulations and FunctionsbySeyed Alborz MazloomianA THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Bioinformatics)The University of British Columbia(Vancouver)June 2017c© Seyed Alborz Mazloomian, 2017AbstractAlternative splicing is highly appreciated as a major contributor to cellular complexity,and its dysregulation has been associated to several diseases. Despite being the focusof numerous studies in recent years, there remains much unknown about functions andregulations of alternative splicing in mammalian systems. Here, I take a systems biologyapproach to study alternative splicing using high-throughput sequencing data.In Chapter 2, I use tissue-specific high-throughput libraries of Drosophila melanogasterto explore the potential inter-relation of RNA editing and alternative splicing. I first de-velop a pipeline to accurately detect editing events. Next, I find regions where editing andsplicing are likely to influence each other, and report conserved RNA structures that canmediate the inter-relation.In Chapter 3, I study functions of Cyclin dependent kinase 12 (CDK12) using hu-man cell line data. I show that CDK12 influences the differential usage of alternativelast exon. Additionally, the results demonstrate that CDK12 modulates the expression ofDNA damage response genes, and increases the tumorigenicity of breast cancer cells bydown-regulating the long isoform of DNAJB6 gene.Finally, in chapter 4, I first present a review of methods that search for underlyingmechanisms explaining variations between high-throughput measurements of two biolog-ical conditions. Next, I introduce our RNA-seq data derived from progressively inhibitingsplicing-related proteins at multiple concentrations of pharmaceuticals, and I discuss howthe reviewed methods should be adopted to benefit most from our type of data.Our systems biology research provides new insights on how the studied components ofthe splicing machinery contribute to splicing functions and regulations, and these findingscan help to improve our understanding of related diseases.iiLay SummaryGenes contain information that determine what cells should do at specific times, and theyare sometimes referred to as the blueprint for life. Alternative splicing is a mechanismthrough which multiple products are generated from a single gene, and these products (e.g.proteins) can have different functions; therefore, the mechanism expands the capacity ofgenes. Disruption in alternative splicing has been associated to many genetic diseases.However, the mechanism is not fully understood. Fortunately, recent advances in technol-ogy have brought new opportunities to better investigate this mechanism. In this thesis,I study how the alternative splicing mechanism and its functions are regulated by somegenes and cellular machineries using data generated by new sequencing technologies. Theselected genes and mechanisms are known to play important roles in human diseases suchas cancers. Our findings can help to improve our understanding of the alternative splicingmechanism and related diseases.iiiPrefaceA version of Chapter 2 has been published in RNA biology journal [1]. I performed theanalysis and wrote the manuscript under the supervision of Professor Irmtraud Meyer.A version of Chapter 3 has been accepted for publication in Nucleic Acids Researchjournal [2]. I am a co-first author of the paper with Jerry F. Tien. This Chapter was su-pervised by Professor Shah and Professor Morin. I performed the computational analysisof the data, made figures and wrote the manuscript. Jerry F. Tien designed and performedexperiments, analyzed data, made figures and wrote the manuscript. S.-W. Grace Chengdesigned and performed experiments. Ali Bashashati and James Xu performed the dataanalysis. Christopher S. Hughes, Christalle C.T. Chow, Leanna T. Canapi, Arusha Oloumi,Genny Trigo-Gonzalez, Vicky C.-D. Chang, Stella S. Chun performed experiments. Pro-fessor Samuel Aparicio assisted in research design and data interpretation. ProfessorGregg Morin conceived the project, designed experiments, and wrote the manuscript. Pro-fessor Shah supervised the computational analysis, assisted in research design and datainterpretation.Chapter 4 has not been submitted for publication yet. The project was supervised byProfessor Sohrab Shah and Professor Samuel Aparicio. I performed the analysis, madefigures and wrote the chapter. All the small compounds used for inhibiting the stud-ied proteins affecting alternative splicing have been developed by Takeda PharmaceuticalCompany Limited.[1] Mazloomian, A. & Meyer, I.M., 2015. Genome-wide identification and characteri-zation of tissue-specific RNA editing events in D. melanogaster and their potential role inregulating alternative splicing. RNA biology 12 (12): 1391-1401.iv[2] Tien, J.F. *, Mazloomian, A. *, Cheng, S., Hughes, C.S., Chow, C., Canapi, L.T.,Oloumi, A., Trigo-Gonzalez, G., Bashashati, A., Xu, J. et al., 2017. CDK12 regulatesalternative last exon mRNA splicing and promotes breast cancer cell invasion. to appearin Nucleic acids research. (*: co-first authors).vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Alternative splicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1.1 Splicing mechanism . . . . . . . . . . . . . . . . . . . . . . . . . 31.1.2 Functions and regulations of alternative splicing . . . . . . . . . . 51.1.3 Computational identification of alternative splicing using RNA-seq data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.2 RNA editing by ADAR proteins . . . . . . . . . . . . . . . . . . . . . . 131.2.1 Mechanism and abundance of A-to-I RNA editing . . . . . . . . . 131.2.2 Functions of RNA Editing . . . . . . . . . . . . . . . . . . . . . 16vi1.2.3 Computational detection of RNA editing . . . . . . . . . . . . . . 201.3 Phosphorylation by Cyclin Dependent Kinase 12 (CDK12) . . . . . . . . 231.3.1 CDK12 is a protein kinase . . . . . . . . . . . . . . . . . . . . . 231.3.2 Functions of CDK12 . . . . . . . . . . . . . . . . . . . . . . . . 241.4 Research contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 Genome-wide Identification and Characterisation of Tissue-specific RNAEditing Events in Drosophila melanogaster and their Potential Role in Reg-ulating Alternative Splicing . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.2 Materials and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.2.1 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.2.2 Prediction pipeline . . . . . . . . . . . . . . . . . . . . . . . . . 312.2.3 Finding alternatively spliced exons . . . . . . . . . . . . . . . . . 352.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.3.1 Our pipeline accurately distinguishes genuine editing sites fromSNPs, and sequencing and mapping artifacts . . . . . . . . . . . . 362.3.2 Characterisation of identified RNA editing sites . . . . . . . . . . 382.3.3 Evidence for cross-regulation of RNA editing and alternative splic-ing and the potential underlying regulatory mechanism . . . . . . 432.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483 The Regulation of Alternative Last Exon Splicing by CDK12 Promotes theOncogenic Potential of Breast Cancer Cells . . . . . . . . . . . . . . . . . . 513.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.2 Materials and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.2.2 Differential gene expression and alternative splicing analysis . . . 533.2.3 TCGA data analysis . . . . . . . . . . . . . . . . . . . . . . . . . 553.2.4 Motif analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57vii3.3.1 CDK12 regulates alternative last exon splicing of genes with longtranscript and many exons . . . . . . . . . . . . . . . . . . . . . 573.3.2 Tumors defective in CDK12 function exhibit mis-regulation of ALEsplicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653.3.3 Regulation of gene expression by CDK12 is gene- and cell type-specific but modulates a core set of common pathways . . . . . . 683.3.4 CDK12 can modulate the expression of DNA damage responsegenes in SK-BR-3 cells through alternative splicing . . . . . . . . 733.3.5 CDK12 down-regulates the long isoform of DNAJB6 and increasesthe tumorigenicity of breast cancer cells . . . . . . . . . . . . . . 753.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774 Investigating Cellular Responses upon Inhibiting Components of SplicingMachinery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814.2 Identifying Pathways and genes contributing most to cellular responses: Ashort review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.3 Analyzing genes expression and splicing through inhibiting splicing com-ponents at multiple levels: preliminary results . . . . . . . . . . . . . . . 904.3.1 Materials and methods. . . . . . . . . . . . . . . . . . . . . . . . 904.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 934.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1015 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108A Supporting Materials for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . 134A.1 Details of the proposed pipeline . . . . . . . . . . . . . . . . . . . . . . . 134A.2 Editing events within or in close vicinity of alternatively spliced exonicregions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137viiiA.3 Genomic regions with evidence for the inter-relation of RNA editing andalternative splicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144B Supporting Materials for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . 147B.1 Selected TCGA overian serous cystadenocarcinoma samples . . . . . . . 147B.2 qRT-PCR validation of identified ALE splicing events . . . . . . . . . . . 150B.3 Proteomics analysis of SK-BR-3 after CDK12 depletion . . . . . . . . . . 151B.4 Up-regulation of cell proliferation pathways in MDA-MD-231 cells byCDK12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152ixList of TablesTable 1.1 Summary of methods proposed to identify editing events . . . . . . . . 20Table 2.1 Tissue specific data sets selected from the MODENCODE project . . . 30Table A.1 Alternatively spliced exonic parts for which we found editing events inclose vicinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143Table A.2 Genomic regions with evidence for the inter-relation of RNA editingand alternative splicing . . . . . . . . . . . . . . . . . . . . . . . . . . 146Table B.1 Ovarian serous cystadenocarcinoma samples selected from TCGA . . . 149xList of FiguresFigure 1.1 Transesterification steps in the splicing mechanism . . . . . . . . . . 4Figure 1.2 Different classes of alternative splicing . . . . . . . . . . . . . . . . . 5Figure 1.3 An example of alternative splicingregulation . . . . . . . . . . . . . . 10Figure 1.4 A-to-I mechanism by ADAR proteins . . . . . . . . . . . . . . . . . 14Figure 1.5 Organization of domains in ADAR proteins . . . . . . . . . . . . . . 15Figure 1.6 An example of RNA editing in a human pre-mRNA molecule . . . . . 19Figure 1.7 A diagram of my research presented in this dissertation . . . . . . . . 26Figure 2.1 Outline of the computational analysis pipeline . . . . . . . . . . . . . 32Figure 2.2 Types of identified conversions . . . . . . . . . . . . . . . . . . . . . 37Figure 2.3 Characterisation of the identified editing sites . . . . . . . . . . . . . 39Figure 2.4 Number of conversion types for four tissues . . . . . . . . . . . . . . 41Figure 2.5 Comparing the editing mechanism in different tissues . . . . . . . . . 42Figure 2.7 An example of a region where RNA editing and alternative splicingmay affect each other . . . . . . . . . . . . . . . . . . . . . . . . . . 46Figure 2.8 An Example of a structure that can mediate the influence of editing onsplicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47Figure 3.1 CDK12 regulates alternative last exon (ALE) splicing . . . . . . . . . 59Figure 3.2 ALE regulation by CDK12 is cell type-specific . . . . . . . . . . . . 60Figure 3.3 Regulation of ALE splicing is a universal function of CDK12 . . . . . 61Figure 3.4 CDK12 regulates ALE splicing of genes with long transcripts and alarge number of exons . . . . . . . . . . . . . . . . . . . . . . . . . 62xiFigure 3.5 CDK12 interacts with the RNA splicing machinery. . . . . . . . . . . 64Figure 3.6 The 3’UTR of ALEs regulated by CDK12 do not feature unique pat-terns of polyadenylation motifs . . . . . . . . . . . . . . . . . . . . . 66Figure 3.7 Alterations in CDK12 correlate with mis-regulation of ALE splicingin ovarian tumor samples . . . . . . . . . . . . . . . . . . . . . . . . 67Figure 3.8 CDK12 differentially regulates gene expression in a cell type-specificmanner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69Figure 3.9 Figure 5. CDK12 regulates the expression of a core set of genes andpathways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71Figure 3.10 Differential protein expression due to CDK12 regulation . . . . . . . 72Figure 3.11 CDK12 regulates the expression of full-length ATM . . . . . . . . . . 74Figure 3.12 CDK12 down-regulates the long isoform of DNAJB6 through ALEsplicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76Figure 4.1 Methods proposed to perform mechanistic inference using high-throughputdata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86Figure 4.2 Our systematic approach to study proteins via gradual inhibition . . . 92Figure 4.3 Splicing response patterns upon increasing inhibitor levels . . . . . . 94Figure 4.4 The overlap of the splicing events detected in the inhibitor and siRNAexperiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96Figure 4.5 Clustering of expression response patterns upon inhibiting EIF4A3 . . 97Figure 4.6 GO enrichment analysis for gene clusters . . . . . . . . . . . . . . . 99Figure 4.7 Clustering of NMD isoforms response patterns upon inhibiting EIF4A3 100Figure 4.8 An auto-regressive hidden Markov model proposed to analyze ordereddata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103Figure B.1 qRT-PCR analysis showing regulation of alternative splicing specifi-cally by CDK12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150Figure B.2 Proteomic analysis of SK-BR-3 after CDK12 depletion . . . . . . . . 151Figure B.3 CDK12 up-regulates cell proliferation pathways in MDA-MB-231 triple-negative breast cancer cells. . . . . . . . . . . . . . . . . . . . . . . . 152xiiAcknowledgmentsI would like to express my deep gratitude to my supervisors Professor Irmtraud Meyer andProfessor Sohrab Shah for their guidance and encouragement. I would like to thank mysupervisory committee, Professor Wyeth Wasserman, Professor Steven Jones, and Pro-fessor Samuel Aparicio for their help and feedback during my PhD studies. I am alsovery thankful for my collaborations with Professor Gregg Morin and Professor SamuelAparicio.I extend my sincere thanks to Meyer Lab members, Shah Lab members, Morin Labmembers, and all my other friends for their discussions and for creating pleasing environ-ments.I would like to thank the University of British Columbia and Bioinformatics TrainingProgram for their generous four-year fellowship funding.Finally, my heartfelt thanks to my parents and my three brothers, for all the support,encouragement, and love. Specific to my PhD research, I would like to thank Amin for allthe great helps and mentorship.xiiiDedicated to:Baba & MamanxivChapter 1IntroductionComplexity of molecular responses is created through the interplay between biologicalmechanisms. Hundreds and thousands of genes, RNA molecules and proteins communi-cate through signalling pathways to provide appropriate responses to environmental stim-uli. Disruption in any of these cellular processes including transcription, translation, DNArepair, cell division and cell adhesion can initiate disease development. Consequently, in-vestigating the inter-relations and interactions between these mechanisms is one key steptowards deciphering their regulations and functions.Alternative splicing, a mechanism thorough which multiple products are generatedfrom a single gene, is highly appreciated as a major contributor to cellular complexity [3].Many disease mutations have been associated to mis-regulation of alternative splicing [4,5]. As a result, this process has become a critical topic for thorough research. Fortunately,recent advances in technology [6, 7] has brought new opportunities to better investigatealternative splicing and the related mechanisms such as transcription, RNA editing, andpoly-adenylation.In this thesis, I took a systems biology approach to study the inter-relation between al-ternative splicing and two other mechanisms, RNA editing, and phosphorylation of splic-ing related proteins by Cyclin Dependent Kinase 12 (CDK12). I also studied the globalconsequence of intervening with the splicing machinery. Based on the systems biologyperspective, understanding properties of a system requires simultaneously modelling com-ponents of systems and integrating results of multiple types of experiments. The modelling1can benefit from the experiments in biological backgrounds when different conditions arescreened, or it can benefit from intervening with the system and knocking down parts ofthe system to monitor its influence on the other parts.I start by studying the inter-relation of RNA editing and alternative splicing in a modelorganism, Drosophila melanogaster in tissue specific data sets. Model organisms havebeen broadly studied for understanding biological machineries and development of ther-apeutics [8–10]. Fewer repeat regions, fewer overlapping genes, and smaller number oftranscripts per gene on average in Drosophila melanogaster compared to human [11, 12]make the computational study of editing and splicing in Drosophila melanogaster easierand less error prone. Thus, D. melanogaster brings great opportunities to study the inter-relation of these two mechanisms in a context less complex than human, and its interpretedresults provide a test bed to explore various hypotheses.Furthermore, I study the influence of CDK12 on the regulation of alternative splicingthrough inhibiting CDK12 in human cell lines. I use data sets where CDK12 expressionwas manipulated to quantify and model its influence on global RNA processing. More-over, I show that how the proper gathering of information from multiple cell lines leads tounderstanding the main functions of a protein. Human cell lines present closer estimatesto the rules governing cellular responses in human and are broadly being used to carefullyinvestigate findings [13, 14].Finally, I investigate how inhibiting components of the splicing machinery impacts cel-lular responses when the inhibition level is gradually increased. With the development ofpharmacologic agents, there is the opportunity to systematically interfere with the spliceo-some components to inhibit their functions in human. By progressively increasing druglevels and measuring responses and studying response curves, one can develop more ac-curate assumptions regarding the primary and secondary effects of disrupted components.In parallel to my focus on the main topic of this thesis, how splicing relates to itscomponents and other mechanisms, the thesis is designed to cover different steps whichshould be taken for understanding a mechanism and targeting it in relevant diseases.21.1 Alternative splicingIn this section, I will briefly present the current knowledge of the splicing mechanism,functions and regulations of alternative splicing, and also the computational approachesdeveloped to study alternative splicing.1.1.1 Splicing mechanismSplicing is a mechanism responsible for removing introns from a pre-mRNA moleculeand merging exons together [15, 16]. The process is carried out by spliceosomes in thenucleus, where the splicesosome can cooperate and couple with other RNA processingmachineries such as transcription [17]. A typical exon is on average about one orderof magnitude shorter than an average-size intron in human [18] (few hundred and fewthousand nucleotides in a typical exon and intron, respectively). Thus, exon recognitionremains non-trivial for spliceosomes.Spliceosomes employ the information in some regulatory conserved sequence motifs toaccomplish RNA splicing [19]. These complex macromolecular machines identify intronboundaries with the help of the 5’ and 3’ conserved sequences [19]. More specifically,there exist a highly conserved GU di-nucleotide at the 5’ end (splice donor site) and aconserved AG di-nucleotide at the 3’ end (splice acceptor site) of introns. Some of theother conserved informative sequences in a primary sequence are the branch point locatedclose to the acceptor site followed by a pyrimidine rich region [18–20]. Mutations inthese conserved sequences can change open reading frames and result in degradation oftranscripts, or producing incorrect amino acids and non-functional proteins.Through detection of conserved sequence motifs by small nuclear ribonnucleoproteins(snRNPs) of the spliceosomal machinery, two transesterification steps are carried out [21].In the first transesterification step (figure 1.1.A), RNA molecules in snRNPs interact anddetect conserved motifs to trigger transesterification steps. Once the region is identified,the 2’ hydroxyl of the branch point adenine nucleotide in the intron attacks the 5’ splicesite and cuts the sugar phosphate backbone of the pre-RNA molecule (figure 1.1.B). Sub-sequently, the end of the intron covalently bonds to the adenine nucleotide and forms alariat structure. In the second transesterification step, the spliceosome goes to a conforma-3A AGGUexon 1 exon 2OO PPP2’ OHA AGGUexon 1exon 1exon 2OOH 3’OH 3’Pexon 2PA AGGU Pfirst stepintronsecond stepmRNA lariat intron lariat intermediate+exon 2exon 1 GU AGAA) B)U1 snRNPU2 snRNPU6U4U5Figure 1.1: The two transesterification steps of the splicing mechanism. (A) Before thecatalytic reactions of splicing, snRNA molecules in the spliceosome interact with thepre-mRNA molecule. These interactions followed by conformational changes in thesplicing machinery initiate splicing. (B) The transesterification steps carried out by thespliceosome. The first step reactions are shown in red, and the second step reactionsin blue. Conserved consensus motifs are shown in green, and the circled P’s representphosphates. Figure modified from [21] and [22].tional rearrangement to bring the exons together, and guides the 3’ hydroxyl group of thedetached exon to react with the 5’ end of the other exon. Finally, the two exons are mergedinto a continuous sequence and the lariat is released and degraded [21].The spliceosome is composed of five snRNPs (U1, U2, U4, U5, and U6) and hundredsof other protein components [21, 23]. The two transesterification reactions required forthe splicing mechanism cannot completely explain the necessity for such a complicatedmachinery. Some of the spliceosomal proteins are required to avoid making defectivemRNAs, and some others link splicing to transcription, or other post-splicing events suchas mRNA transport [24].41.1.2 Functions and regulations of alternative splicingAlternative splicing (AS), a process by which multiple transcripts are produced from a sin-gle pre-mRNA molecule is one major cause of cellular complexity [25]. Splicing patternsare determined by cell-types, developmental stages, or external stimulus [26]. Moreover,studying the alternatively spliced genes reveals that the process is most important wherea differential processing is critical and a high level of diversity is required, especiallyin brain [27, 28]. Brain-specific AS events play crucial roles in neuronal differentiationand development, regulating protein-protein interactions, and regulating transcription net-works [29, 30]. The accessibility and interactions of cis-regulatory sites with trans-actingproteins can modify splice site selection. As a result, the final splicing product is notalways uniquely defined, and spliceosome decisions dictate the final conformation whenalternative junction choices are available. Based on the consequences of such decisions,AS events are classified into multiple types, as illustrated in Figure 1.2.Skipped exonRetained intronAlternative 3’splice site (A3SS)Alternative 5’splice site (A5SS)Mutually exclusiveexon (MXE)Alternative rstexon (AFE)Alternative lastexon (ALE)Tandem 3’ UTRFigure 1.2: Types of alternative splicing defined by the alternative choices that spliceosomecan make. The black boxes represent constitutive exons and the white boxes representalternative regions whose inclusion depend on splicing choices. Only in the “Retainedintron” type, an intron is contained in the final product. The red solid lines and theblue dashed lines illustrate the two possible patterns of splicing for each class. Figuremodified from [31].5Specific features of RNA regions qualify them as candidates of each AS type. Forexample, skipped exons are usually shorter than constitutive exons and are flanked bylong intronic regions [32]. Besides, the number of nucleotides in skipped exons is oftenmultiples of three in order to prevent a change of reading frame and the introduction of apremature stop codon [32]. On the other hand, retained introns tend to be short and possessweak splicing signals around their junctions [33]. Finally, alternative 3’ and 5’ splice sitesare mainly evolved from constitutive exons after introducing mutations that could createcompetitive splice sites [33].The alternative splicing mechanism is evolutionarily conserved and is observed abun-dantly in multicellular eukaryotic organisms [34]; however, its prevalence increases inmore behaviorally complex species such as humans [35]. The prevalence of alternativelyspliced genes grows from ∼25% of the genes in C. elegans and ∼60% in Drosophilamelanogaster to ∼95% of the genes in human [36]. Also, the relative abundance of splic-ing types changes among organisms. As an example, in lower metazoans intron reten-tion is common while the relative abundance of skipped exons grows for more complexspecies [37].Because the AS mechanism is conserved, one promising way to evaluate related hy-potheses would be using model organisms. In the second chapter of this thesis, we chose toinvestigate the regulation of alternative splicing in Drosophila melanogaster. Drosophilamelanogaster shares a large amount of its genetic content with human has been broadlyused to improve our understanding of many cellular mechanisms including alternativesplicing [38]. For example, Reiter et al showed ∼77% of human disease genes have sta-tistically significant related sequences in Drosophila melanogaster [39]. According toFLYBASE [40] (release: May 24, 2016), the genome of Drosophila melanogaster contains∼17,700 genes of which ∼13,900 are protein coding. These 13,900 protein coding genesencode ∼30,400 protein coding isoforms in total (an average of ∼2.2 isoforms per gene)manifesting the potential of AS regulation in D. melanogaster. In particular, more than40% of the genes are alternatively spliced and there exist a set of highly complex genes(∼50 genes) each encoding over 1000 isoforms [41]. Similar to human, different patternsof splicing are detected in Drosophila melanogaster [38, 41]. Dscam (Down syndrome celladhesion molecule) is an example of a gene displaying complex AS patterns. The gene6contains 20 constitutive and 95 alternatively spliced exons [42]. Combinatorial assemblyof observed local splicing patterns in Dscam can potentially encode more than 38,000 iso-forms; a number greater than the number of genes in the entire D. melanogaster’s genome.Dscam encodes an axon guidance receptor and this huge level of complexity seems essen-tial for its functional roles as an axon guidance receptor [42].In addition to the interesting features of splicing in Drosophila melanogaster that re-sembles AS in human, a massive volume of publicly available data makes Drosophilamelanogaster a promising candidate model to study alternative splicing. UCSC genomebrowser [43, 44] provides a genome annotation of Drosophila melanogaster and an align-ment of its genome to 14 other Drosophila species. This alignment enables benefiting fromevolutionary information and comparative methods. Besides, FLYBASE is a rich growingsource of information on gene expression, genes interactions, observed phenotypes, andalso genome features gathered from thousands of papers [40]. Additionally, there exist ex-perimental data sets generated by different experimental pipelines including RNA-seq andchromatin immunoprecipitation (ChIP) in the MODENCODE project [45]. These exper-iments are replicated on different tissues and through various developmental stages, andthe information can be used to study cellular behaviors in a condition-specific manner.Apart form the cis-acting regulatory sites discussed, there are two other main classes ofsuch primary sequence signals known as splicing silencers and splicing enhancers [46, 47].Moreover, these sites and the corresponding trans-acting proteins have the tendency todecrease (in the case of silencers) or increase (in the case of enhancers) the probabilityof a neighboring intron to be spliced. These signals can be intronic or exonic (ExonicSplicing Silencers (ESS)/Enhancers (ESE), Intronic Splicing Silencers (ISS)/Enhancers(ISE)). These regulatory motifs can function as a silencer or an enhancer depending ontheir location in a pre-mRNA sequence. For instance, if G triplets occur in introns, theyact as enhancers and in exonic context they usually act as silencers [31].Recent studies have significantly improved our understanding of the AS mechanism.Apart from the regulatory elements discussed, splicing is found to be modulated by otherfactors such as transcription rates, pre-mRNA structures, histone marks and nucleosomepositioning [48]. Among these, of special interest to my focus in this thesis are transcrip-tion rates and pre-mRNA structures, as they are more related to the splicing components I7investigate.RNA splicing generally happens co-transcriptionally [49, 50]. Being co-transcriptionalprovides further opportunities to regulate the splicing mechanism. For example, C terminaldomain (CTD) of RNA polymerase II (RNA poly II) helps in the recruitment of splicefactors [50]. Additionally, when RNA pol II elongates slowly, weak 3’ splice sites acquirehigher chance of being properly processed without competing with stronger downstream3’ splice sites [50]. Also, the transcription rate can influence formation of alternativestructures which in turn can affect splicing, as discussed in the following [51].Pre-mRNA structures act as additional regulators of alternative splicing [52]. Sev-eral mechanisms in which the formation of structure influences splicing patters have beenreported in literature [53]. A considerable amount of studies reported that the substrate se-lection of RNA binding proteins depends not only on the primary sequence of a target, butalso on the target structure [52], and clearly this conformation dependence exists for manyof the major SR proteins (proteins with RNA binding motifs important in AS regulations)as well. As a simple example, an RNA sequence which is detected by the spliceosomemachinery (e.g. 5’ and 3’ splice sites) becomes inaccessible because of its base-paringswith the other parts of the sequence in the human MAPT gene (microtubule-associatedprotein tau). Meanwhile, the splicing machinery requires well-suited distance between thesplicing consensus motifs in order to carry out splicing in a certain way. Pre-mRNA struc-tures have the tendency to decrease the effective distance between the conserved motifs,and thus resulting in a modified pattern of splicing. These findings are well supportedby further computational evidence as well. Pervouchine et al [54] searched for pairs ofcomplementary sequences around splice sites that can form stable hairpin structures. Bystudying mammalian protein-coding genes, they identified hundreds of such pairs wherethe energy of suggested structures could modify the pattern of splicing. In an earlier study,Meyer and Miklos [55] analyzed the alignment of 11 human genes to other vertebrates andfound conserved double-stranded structures in coding regions. They showed that amongcodons encoding the same amino acid (due to degeneracy of genetic code) there is a selec-tive pressure towards those leading to a more appropriate double-stranded structure. Also,in a case study, Meyer and Miklos [55] predicted secondary-structures of regions aroundexon 12 of the human CFTR (cystic fibrosis transmembrane regulator) gene for the wild8type sequence and also for the sequences carrying synonymous mutations. They showedthat secondary-structure of sequences with high (experimentally evaluated) splicing effi-ciencies are more similar to each other than to those sequences with low splicing efficien-cies. These studies suggest a global regulation of alternative splicing by the formation ofdifferent structures and consequently, taking the secondary-structure of pre-mRNAs intoconsideration facilitates understanding splicing patterns [53].A considerable number of human diseases have been linked to aberrant splicing events[56]. Genomic mutations can create or destroy splicing sites or splicing enhancers andsilencers, and in this way they sometimes alter the splicing patterns [52, 56]. For example,the severity of spinal muscular atrophy is affected by the creation of an ESS (Exonic Splic-ing Silencer) [57]. Also, mutations in genes involving the splicing mechanism have beenlinked to some human diseases such as retinitis pigmentosa [58]. After the uncovering ofthe cause of these and many other diseases, splicing events became important therapeutictargets.The importance of AS regulation can be illustrated by an example involving tau pro-tein. A gene located on chromosome 17, MAPT, encodes tau which is a microtubule-associated protein required for the polymerization and stability of axonal transport in neu-rons [26]. Through the alternative splicing of exons 2, 3 and 10, six protein isoforms areproduced in the adult human brain. There exist three patterns of splicing which involveexons 2 and 3; and exon 10 is included or skipped independently [58]. Exon 10 and threeother exons (9, 11, and 12) encode four microtubule-binding domains. Depending on theinclusion or exclusion of exon 10, the N-terminal of the resulted protein can have 3 or 4microtubule-binding domains. In a normal human brain, the abundance of isoforms in-cluding exon 10 is equal to isoforms where exon 10 is spliced out, and the ratio of thesetwo sets of isoforms seem to be important for neuronal function [26, 58]. Furthermore,it has been shown that mutations in tau protein cause neuro-degeneration accounting forfronto-temporal dementia and parkinsonism. Further analysis of the pre-mRNA structurerevealed that a stem-loop structure could be formed involving the 5’ exon/intron junctionof exon 10 and the accessibility of the splice site is regulated by the structural configu-ration of this region [53] (See Figure 1.3). Mutations that destabilize the helix structureenhance the accessibility of the region and result in the increase of the set of isoforms9aaaaac c ccc ccgggguuuuuuccauuAC GGG U guga3’A)B)5’ 3’3’5’5’exon 9 exon 10 exon 11exon 9 exon 10 exon 11 exon 9 exon 10 exon 11U1 snRNAU1 snRNPDisrupted stemstructuretau genepre-mRNAFigure 1.3: Alteration of alternative splicing in a human disease (modified from [53]). Ahairpin-loop structure plays an important role in regulating the inclusion of exon 10in the human tau protein. (A) The U1 snRNA structure and its potential interactionwith the exon-intron hairpin-loop structure, revealed by NMR studies. Exonic andintronic nucleotides of the hairpin-loop are shown by uppercase and lowercase letters,respectively. If the structure is formed, the interaction of the region and U1 snRNA (thedashed line) does not happen properly, and U1 snRNA cannot detect the region. (B)Mutations in the primary sequence disrupt the hairpin-loop structure and increase therecognition of the region by U1 snRNA. As a result, more transcripts will include exon10 (Abundance of the isoforms are presented by the thickness of red and blue lines inthe two conditions); the condition leads to frontotemporal dementia and parkinsonism.10that include exon 10. Accordingly, therapeutic agents have been suggested to stabilize thestem-loop configuration.Genome-wide studies have broaden our understanding of regulations and functionsof alternative splicing; however, considering the diversity of cis-acting elements and thehuge number of corresponding trans-acting factors, there is still so much unknown re-garding position dependent and context dependent regulations and functions of alternativesplicing [59]. Moreover, the role and importance of factors such as non-coding RNAs anddsRNAs, and upstream pathways are being more appreciated based on recent studies [60];accordingly, the discovery of many new AS regulators are anticipated [61]. Finally, to getcloser to understanding the splicing code, we need to investigate genes that are affected byspecific splicing factors [62].Despite the numerous studies on alternative splicing, some fundamental questions areyet to be answered. For example, it is still not clear what percentage of observed isoformsare essential for regulating cellular responses [49], or what are the relevance of couplingalternative splicing and other mechanisms such as transcription? [49] and how abundantand functionally relevant are the coupling of alternative splicing with other mechanisms?In this thesis, I investigate some of these questions.1.1.3 Computational identification of alternative splicing usingRNA-seq dataDespite being remarkably helpful in improving our understanding of the AS mechanism,properly interpreting RNA-seq data is challenging. Short reads are sometimes mis-aligned,especially when they originate from the repetitive regions of genomes (e.g. more than 50%of the human genome constitute repetitive elements [63]) or when they harbor multiplesequencing errors. Besides, sequencing biases due to non-uniform sampling of sequencingmachines should be appropriately addressed [64, 65]. Additionally, many of the isoformsshare common parts, making prediction of reads’ origins nontrivial. Understanding andmodeling these and other potential issues help avoid misleading conclusions.The importance of the AS mechanism and the inherent complexity of studying AS us-ing RNA-seq data have motivated the development of several computational methods [66–1176]. Alamancos et al published a comprehensive review of the proposed methods alongwith their strengths and limitations [77]. Conceptually, these methods can be classifiedinto two groups: The first group contains algorithms that model the problem at the isoformlevel and assess the differential usage of entire isoforms; and the second group containsmethods that model AS at local regions (e.g. an exon or an intron) without being concernedto the other parts of transcripts. Another criteria that differentiate methods is whether theyare restricted to the existing transcriptome annotations or they detect de novo AS events aswell.Some of the methods performing differential splicing analysis at the isoform level areCUFFDIFF2 [71], BITSEQ [72] and MISO [73]. All three methods take aligned reads inaddition to annotation files as input and provide information on the differential regula-tion of genes and isoforms. Apart from distinct statistics being used in these methods,there are some other clear distinctions as well. CUFFDIFF2 can detect de novo isoformsand AS events. Both BITSEQ and CUFFDIFF2 allow incorporating biological replicates,while MISO cannot. CUFFDIFF2 assigns a p-value to candidate events, MISO reports aBayes factor, and BITSEQ uses a one sided Bayesian test to rank the genes based on theirprobability of being up or down regulated.The event based differential alternative splicing analysis includes methods such asMISO [73], DEXSEQ [74], DSGSEQ [75], and DIFFSPLICE [76]. MISO can be appliedin both isoform-based and event based analysis and therefore is placed in both categories.All methods accept aligned read files and identify differential regulation in local regionsof transcripts. Among them, DIFFSPLICE is the only method that does not rely on an-notation files. It constructs alternatively spliced modules (ASMs) using the aligned readswhich represent regions where transcripts diverge. Accordingly, it is able to identify com-plex splicing events. On the other hand, MISO is able to distinguish between 8 differentpre-annotated types of splicing (those shown in figure 1.2). In contrast, DEXSEQ andDSGSEQ are specialized for only one type of AS event, skipped exon. All the methodsexcept MISO incorporate information from multiple replicates. MISO assigns a Bayes Fac-tor values to each of the identified events as a measure of confidence, DEXSEQ reportscorrected p-values, DSGSEQ uses Negative Binomial statistics to rank AS candidates andfinally DIFFSPLICE outputs events under a given false discovery rate by considering its12introduced test statistics.Isoform-based methods model the AS problem in more detail and take into accountall potential transcripts and all reads aligned to the genes under investigation. Therefore,properly solving these models can provide helpful clues on global regulation of isoforms.However, sequencing biases with regard to non uniform sampling, as well as shared re-gions among transcripts complicate the inference problem. In situations where the genesconstitute many alternatively spliced isoforms, these methods encounter problems [78].On the other hand, local event based methods resolve this issue by only considering readsaligned to a small region of interest. Clearly these methods disregard information frommany reads and are especially error prone when local events are short [78]. The appro-priate method should be selected according to the research question, type and amount ofavailable data (e.g. read length and read depth), and also the completeness of existingannotations for the species of interest.1.2 RNA editing by ADAR proteinsIn the second chapter of this thesis, I investigate the inter-relation between alternativesplicing and RNA editing. Here, I briefly summarize what we already know about theediting mechanism.1.2.1 Mechanism and abundance of A-to-I RNA editingRNA editing is a widespread molecular mechanism which modifies transcripts [79]. Themechanism was first discovered in 1986 in trypanosomes, where nucleotide insertionscause reading frame shifts [80]. However, in mammals, the most frequent type of RNAediting is A-to-I conversion carried out by ADAR (Adenosine Deaminase that Act onRNA) proteins through deamination process (Figure 1.4). Most cellular mechanisms in-terpret inosine as guanosine, including splicing and translation. Therefore, in RNA-seqdata also adenosines will be presented as guanosines. Some cellular factors (e.g., Tudorstaphylococcal nuclease involved in RNA interference), however, can distinguish inosinefrom guanosine (as shown in Xenopus laevis [81]).Different ADAR genes show non-identical behavior based on their distinct conserved13Ribose 5 phosphate Ribose 5 phosphate Ribose 5 phosphateAdenosine Transition state InosineUCUAAGAUIIExonA)B)IntronADARADARADAR+Figure 1.4: A-to-I mechanism carried out by ADAR proteins. (A) Chemical process of de-maination through which an adenosine is converted to an inosine (Part A from [82]).(B) ADARs target double-stranded structures in pre-mRNA molecules. Many of thesestructures are formed by base pairings between exons and flanking introns and usuallyupon ADAR binding, multiple nucleotides are converted until the structure is destabi-lized and ADAR is released [83] (more details in the text).protein domains [84]. All members of the ADAR gene family share protein domains es-sential for RNA binding and catalytic activities (Figure 1.5). The dsRNA (double-strandedRNA) binding domains in the N-terminal region of ADAR genes fulfill the recognition andbinding of the protein to the substrate. The highly conserved catalytic domain carries outthe deamination process in all ADAR proteins. Additionally, there are some other do-mains which make human ADARs work uniquely [85]. The functional impact of Z-DNAbinding domains in ADAR1 is still unclear, but one hypothesis is that Zα domain local-14deaminase domainhADAR1-ihADAR1-chADAR2hADAR3dADARCeADAR1CeADAR2dsRNA bindingdomainZ-DNA bindingdomainsRα ββFigure 1.5: Organization of domains in ADAR proteins (from [85]). Domains that areidentified in ADAR family members are shown for three genes in human genome(ADAR1 encodes two expressed isoforms), two genes in C. elegans and one gene inD. melanogaster. The deaminase domain and the dsRNA binding domains are com-mon in species, whereas there are other domains specific to some of the genes.izes ADAR1 at genes being transcribed, which enables ADAR1 to more efficiently useintronic regions required for editing, before they are removed [85]. Finally, ADAR3 hasbeen shown to bind to single stranded RNA with the aid of its arginine-rich RNA bindingdomain (R-domain) [85].A remarkable number of RNA editing events were found in different species indicat-ing their significant potential to contribute to the regulation of other cellular mechanisms.RADAR [86] (Rigorously Annotated Database of A-to-I RNA editing) database gathered∼5,000 editing sites in fly, ∼9,000 editing sites in mouse and over 2.5 million, a surpris-ingly huge number, editing sites in human (version 2, update: December 24, 2014).The identified editing sites occur in both exons and introns. Based on RADAR an-notations intronic editing happens ∼20 times more often than exonic editing in human15(∼2,000,000 intronic sites compared to∼100,000 sites in coding regions and untranslatedregions). In Drosophila, however, exonic sites are ∼1.5 times more observed compared tointronic sites (∼2700 exonic sites compared to ∼1750 intronic sites). The dominance ofintronic sites observed only in human can be at least partially explained by the prevalenceof repetitive elements in human genome that can form dsRNA structures served as ADARtargets. Meanwhile, it should be noted that most of the RNA-seq libraries are enriched forthe mRNA molecules where most introns are removed. Thus, the ratio of detected intronicevents presents only a lower bound of genuine intronic targets.In the second chapter, I study the reciprocal influence of RNA editing and alternativesplicing in D. melanogaster. Considering the large number of identified A-to-I RNA edit-ing events in Drosophila melanogaster in addition to the valuable publicly available datadiscussed before, we use Drosophila melanogaster data to study ADAR mechanisms, aswell as alternative splicing. D. melanogaster has one ADAR gene (dADAR), and ADAR2is the most similar gene to it among vertebrate ADARs [85]. DADAR is highly expressedin the central nervous system, and similar to vertebrates, its expression shows temporalregulation [87]. In recent years, there has been an increasing amount of studies on theimportance and abundance of RNA editing in Drosophila melanogaster [88, 89].1.2.2 Functions of RNA EditingADARs require double-stranded RNA regions to perform the deamination process [84].Double-stranded RNAs are composed of hydrogen bonds that form between pairs of com-plementary nucleotides (A-U, C-G, and G-U) in an RNA molecule. In primary transcripts,these regions are typically formed by local RNA secondary-structure features such ashair-pins and they can be very long (>500 nucleotides). Once an appropriate double-stranded region is found, ADARs bind a base-paired adenosine and edit it without beingvery specific about the primary sequence surrounding the substrate [90]. In other words,the requirement for a double-stranded structural context is much more important than theprimary nucleotide composition in specifying a potential ADAR binding site [91]. Some-what surprisingly, this key feature has not yet been directly exploited in most RNA editingprediction programs [92, 93].16One of the key features of ADAR-derived RNA editing is that even in the same cell,the editing of two transcripts of the same gene does not necessarily involve identical RNAediting sites, but only the same double-stranded region which seems to be necessary andsufficient requirement for RNA editing to have the desired functional effect. Many of theknown double-stranded regions serving as ADAR binding sites are formed between exonicsequences and complementary intronic sequences [94] (known as editing site complemen-tary sequences). This supports the idea that editing usually precedes splicing [95]. Also,for many editing sites, the levels of pre-mRNA editing and mRNA editing correlate well inDrosophila melanogaster showing that RNA editing can happen co-transcriptionally [89].A well-studied example is the editing of RNA structures formed between inverted Alurepeats in human transcripts [96]. Alu repeats constitute more than 10% of the humangenome and can readily form double-stranded region and thus potential RNA editing sitesby binding to their inverted copies in the same primary transcript. When one site is edited,other adenosine nucleotides in the same double-stranded region have a high chance ofalso being edited by the same ADAR protein; this may result in the conversion of severaladenosines in a small region [97, 98].Several functions of RNA editing have been identified so far. I briefly review some ofthese functions in the following.RNA editing generally destabilizes the structure of its targets [83]. The function ofan RNA molecule is mainly determined by its structure [99]. In recent years, the crucialrole of RNA structure in regulating other cellular mechanisms has become more clear[100, 101]. Blow et al [83] studied several editing sites in the human transcriptome toinvestigate the global effect of RNA editing on the stability of the target’s structure. Bypredicting the secondary-structure of editing regions before and after the correspondingediting events, the authors illustrate that the abundance of edited A:U matches (which ischanged to a G:U mismatch) reduces the stability of the target molecule. Alteration in thestructure of a target molecule changes the way it interacts with other molecules or the wayit responds to cellular machineries.Diversifying protein products of a single gene is another known function attributedto RNA editing [102]. Non-synonymous modifications in coding regions of transcriptsproduce protein products with altered functionalities. Most of these editing events oc-17cur in the genes involving rapid electrical and chemical neuro-transmission in Drosophilamelanogaster [94], which are strongly expressed in central nervous systems. Stark etal [103] observed that the high conservation in coding regions of 12 aligned Drosophilagenomes continues to hundreds of nucleotides after the stop codon, and argued that theediting of stop codons in these genes could be a candidate underlying mechanism in suchcases in order to generate two isoforms which are significantly different.ADAR proteins can also affect gene expression through the editing of miRNA moleculesand their targets [104]. MiRNAs bind to their complementary 3’ untranslated regions ofmessenger RNAs and suppress gene expression by preventing translation or causing targetdegradation. Several studies have revealed that a notable amount of editing events occur in3’ UTRs in human and mouse [104, 105]. The large number of editing events in 3’UTRsindicates the potential effect of RNA editing on post-transcriptional gene silencing. Ad-ditionally, some of the editing events are known to happen in miRNA sequences, and thiscould be considered as another way of affecting gene expression by ADAR proteins [104].Severe phenotypes have been associated with the deficiency of ADARs. Drosophiladeficient for ADAR shows severe neurological disorders such as locomotor incoordinationand temperature sensitive paralysis [106]; mice deficient for ADAR1 have a heterozygousembryonic lethal phenotype [107], and in humans, variations of RNA editing have beenlinked to neurological and psychiatric disorders [108]. These severe phenotypes also makeit challenging to investigate ADAR functions.Figure 1.6 illustrates an example of RNA editing in protein coding regions of 5−HT2CR. 5-HT receptors are G-protein coupled receptors that cross the cell membrane sev-eral times and play roles in signal transduction. A double-stranded structure is formedbetween exon 5 and intron 5 of its pre-mRNA molecule which serves as an editing sub-strate. Accordingly, five sites in exon 5 of 5−HT2CR undergo A-to-I editing. If all thesites are edited, the G-protein-coupling activity of the corresponding protein will be hugelydifferent from the corresponding unedited protein[109]. As a consequence, RNA editingof these genes have been associated with some psychiatric disorders such as depression[109].Although studies suggested some primary sequence features and also proteins that af-fect ADAR activity in specific target regions, the general regulation of RNA editing is18Exon 5Ile IleAsnIntron 5AUAAUI IAU AIUIUA IIU IUUValGlyIUIMet Asp SerValA B E C DAAU AUUN2CC5-HT RADAR1 ADAR1ADAR1ADAR2ValGlyValFigure 1.6: Modification of the amino acid sequence of the human 5−HT2CR through RNAediting (figure from [109]). A part of the pre-mRNA molecule of 5−HT2CR, a trans-membrane receptor, is shown in this figure. ADAR1 and ADAR2 target 5 sites (A, B,C, D, and E) of exon 5 to produce proteins with highly modified properties. These 5sites are embedded within a hairpin structure formed by base-pairings of exon 5 (thickblue line) and the exonic complementary sequences of intron 5 (thin blue line).unclear. Inverted copy sequences in proximity of a region increase the editing probabil-ity of that region, probably by having the potential to form the double-stranded regionrequired for ADAR binding; in support of this, Alu repeats were observed to constitutethe majority of ADAR targets in human transcripts [110, 111]. Moreover, some short pri-mary sequence preferences have been observed for ADAR proteins in human [112, 113],mouse [93] and fly [114]. On the other hand, few RNA-binding proteins have so far beenshown to suppress the editing levels of specific targets [115]. The SFRS9 gene, which en-codes a splicing factor, represses the editing of the cyFIP2 gene. This could be the resultof competition between the two proteins for common substrates or due to the protein-protein interaction between ADAR2 and SFRS9 [115]. The level of ADAR expressionis another regulatory factor, despite it not usually correlating well with the level of RNAediting [116].19Moreover, considering the huge number of identified editing sites, there is still muchto be discovered and understood regarding the molecular mechanisms and functional rolesof RNA editing. The way RNA editing interacts with and affects other mechanisms isstill unclear [90]. For example, given the abundance of RNA editing events in noncodingRNAs, and the growing evidence for the influence of RNA editing on gene expression,more detailed study of how editing affects RNA interference seems promising [84]. Ad-ditionally, recent studies suggest that alternative splicing and RNA editing mechanismshave the potential to influence each other [95, 117]. Considering the co-occurrence ofRNA editing and alternative splicing in same genes[114, 117], we study their potentialinter-relation in this study.1.2.3 Computational detection of RNA editingThe number of detected RNA editing sites has grown rapidly since the development ofRNA-seq technologies. Sequencing machines are able to generate hundreds of millionsof reads with a much lower cost compared to Sanger sequencing. As a result, the largenumber of reads aligned to a single location makes the detection of RNA editing sites witha low level of editing much easier. In the following, I discuss some of the computationalmethods proposed to encounter potential errors when using RNA-seq data. Table 1.1 sum-marizes these methods.Author/year Strategy ADAR features incorporated confidence measure? ReferencePeng et al (2011) Using thresholds None No [104]Danecek et al (2012) Using thresholds Vicinity of targets No [98]Li et al (2012) Likelihood model None Log-likelihood ratios [113]Guiliany et al (2012) Bayesian model None Probability based [118]Laurent et al (2013) Thresholds/Random forest None Ranked based [114]Zhang et al (2015) Mutual information based Randomness of editing Ranked based [119]Table 1.1: Summary of the methods proposed to identify editing events.Early methods of identifying editing events by high-throughput sequencing data werethreshold based, mainly for their simplicity [104, 111, 120]. The major concern when ap-plying empirically determined thresholds is that the margin value of passing a threshold isnot considered in making the final conclusions [118]. Li et al [120] claimed the discoveryof thousands of editing events for each of the twelve possible conversions, most of which20were repeatedly reported as being the consequence of sequencing artifacts [121, 122]. Oneof the convincing arguments was that most of those events were predicted to occur on ei-ther ends of the reads where the probability of error is much higher [121]. Accordingly,More stringent filters were utilized in the following studies to prevent the abundance offalse positive predictions.Likelihood models offer the ability to quantify the significance of predictions. To over-come the shortcomings of threshold based models, Bahn et al [113] applied a statisticalapproach. By considering the quality score of the aligned reads and the position of nu-cleotides in a read, the authors proposed a model to compute the likelihood of a site beingedited with ratio r. Then, in order to find the ratio of editing, this likelihood function ismaximized with respect to r. Finally, the confidence in predictions of editing ratios isassessed by comparing them against the null hypothesis using a log likelihood ratio test.A study by Guiliany et al [118] involved the jointly modeling of whole genome se-quencing and RNA-seq data using mixture models. This model, called Auditor, requiresDNA and RNA base counts as input, and calculates the probability of editing at each po-sition in the genome. To benefit most from data, transcriptotype (mRNA genotype) ismodeled as a function of genotype using a transition matrix to present the probability ofobserving an specific transcriptotype given a defined genotype. The transition values canbe learned by the expectation maximization method. When the pipeline is coupled withMUTATION-SEQ [123], a method to detect somatic mutations, the enzymatic modifica-tions carried out by ADARs can be effectively distinguished from other types of observeddiscrepancies.In more recent studies, other machine learning approaches have also been proposed.Laurent et al [114] developed a method based on multiple rounds of detection and vali-dation to adjust the applied thresholds. They also applied the Random Forest method totrain a classifier based on true positive and true negative events found in their validations,and also to assess the importance of different features. In an interesting and completelydifferent approach, Zhang et al [119] used the fact that if a read covers two nearby SNPs(single nucleotide polymorphisms), the two variable positions will have a fixed allelic link-age; however the fixed linkage breaks in the case of an RNA editing event coupled withan SNP due to inherent randomness in editing. Accordingly, they compute the mutual21information (MI) for observed variants. If the MI value deviates significantly from thedistribution of publicly available MI values for SNPs, it would be considered as an editingevent.Other known ADAR features can be incorporated to improve the computational powerof existing pipelines. As an example the fact that ADARs edit multiple sites within a smallregion has been incorporated in a pipeline introduced by Danecek et al [98] to extend thelist of detected editing sites. As discussed, one of the requirements of ADAR targets isthat they must be dsRNA regions. In the following, I briefly explain methods developed tocomputationally detect structural regions within RNA transcripts.Predicting tertiary structures is computationally hard, and experimentally costly andtime consuming. Fortunately, RNA secondary-structure are also informative for uncov-ering functional roles of RNA molecules [51]. Conceptually, RNA secondary-structureprediction methods are classified into two main categories: energy-based methods andevolution-based methods.Energy based methods are commonly established upon the idea that the ultimate struc-tured RNA molecule is the one minimizing the overall free Gibbs energy. RNAFOLD [124],MFOLD [125] and SFOLD [126, 127] are some of the methods that try to solve this opti-mization method by introducing time and memory efficient algorithms. These methods arefast and work for sequences of thousands of nucleotides, however they have some limita-tions. Several assumptions of these methods are violated in vivo and the accuracy of themrapidly drops by sequence length for sequences longer than few hundred nucelotides invivo. First, in vivo, proteins and other molecules interact with a folding transcript and im-pose further folding constraints. Second, the folding time for RNA molecules is finite andthe structure may never reach the optimum energy point; and finally, there exist uncertain-ties in the experimentally measured parameters (stacking energies, energy of bulges, etc)required by these methods.The second class of RNA secondary-structure prediction methods rely on a completelydifferent assumption. The basic idea is that homologous sequences diverge through evo-lution in a way that the functionally important structures are preserved. In other words,if a base pair is functionally important, then although the primary sequence of the corre-sponding nucleotides may diverge, the changes always happen in a way that the pairing22potential is maintained (C:G base-pair changes to A:U and not A:G). Based on this con-cept, methods such as TRANSAT [128], EVOFOLD [129], RNA-DECODER [130, 131]and PFOLD [132] search for the linkage between divergence of pairs of bases. Given aset of aligned sequences and an evolutionary tree, Most of the proposed methods applyphylo-SCFG (stochastic context free grammar) to model and score the evolution of pairedcolumns and unpaired columns statistically and find an optimal solution based on the re-sulting conservation scores. Moreover, the flexibility of phylo-SCFGs allows assigningprior probabilities to predictions and also capturing additional hypotheses on secondary-structures. For example, if functional structures form in coding regions, apart from thestructural restrictions, the amino acid sequences should also be preserved. In other words,only the third codon position can freely hold structural information because it usuallydoes not change the amino acid, in contrast to the first and the second positions. RNA-DECODER is the only method that properly models these different evolutions and in thecase of RNA editing, because many of the editing events happen in coding genes, themethod could be helpful by incorporating coding information as well.1.3 Phosphorylation by Cyclin Dependent Kinase 12(CDK12)In the third chapter of this thesis, I study how CDK12 influences the regulation of RNAprocessing and specifically alternative splicing. In this section, I review our current under-standing of CDK12 mechanisms and functions.1.3.1 CDK12 is a protein kinaseOne other mechanism that contributes to expanding genome repertoire is post-translationalphosphorylatoin. Over 500 protein kinases have been annotated that perform the phospho-rylation process [133]. This large family of regulatory enzymes supplies cells with anadditional level of regulation in order to control most cellular processes [134]. The func-tionalities of these enzymes are crucial for determining cell fate; accordingly, impairedfunctions of kinases have been attributed to diseases including multiple cancer types..Kinases have been investigated thoroughly as they constitute attractive targets for ther-23apeutics. Besides, studies have shown the potential to develop highly selective drugs forkinases. For instance, matinib is one of them with a high success rate in chronic-phaseCML (Chronic myelogenous leukemia) patients [135]. Therefore, considering the generalrole of kinases in regulating many signalling pathways and the small number targetingagents designed so far, the future investigation to target other members of the family seemspromising and essential [135].One class of regulatory kinases are cyclin dependent kinases (CDKs). CDKs are in-active when they are in their monomeric form, and form holoenzymes with their cyclinpartners for activation [136]. Although initial studies conducted based on their cyclin do-main confirmed their involvement in cell cycle regulation, CDKs are known to be engagedin a variety of other mechanisms such as transcription, splicing and DNA repair [137–139].Similar to other kinases, impaired CDKs are hallmark of several diseases. CDK12 is onemember of this family of enzymes which associates with Cyclin K to become active [140].Cyclin dependent kinase 12 is evolutionarily conserved [141]. Human CDK12 is alarge protein (1,490 amino acids) [142] located on chromosome 17. The Drosophilamelanogasterorthologue is 41% identical to the human CDK12, and the C. elegans or-thologue shows 53% identity [143]. Among the human genes, CDK13 has a very similarkinase domain, but other that that looks different [143]. The RS domain (domains richin alternating arginine and serine residues) of CDK12 is usually observed in SR proteins,proteins known to play crucial roles in the regulation of pre-mRNA splicing. Furthermorethe protein is found to be co-localized with the splicing machinery and the hyperphospho-rylated form of RNA pol II [144].1.3.2 Functions of CDK12CDK12 is involved in the regulation of transcription elongation. The protein helps inthe productive elongation of RNA pol II by phosphorylating C terminal domain (CTD) ofRNA polymerase II, as shown both in human and Drosophila [140, 141, 145]. An expres-sion microarray study showed that the phosphorylation only modulates the transcriptionof a small set of genes, primarily long genes with many exons [146]. Some of these tar-get genes are involved in genome stability including BRCA1 (Breast and ovarian cancer24type 1 susceptibility protein 1), ATR (Ataxia telangiectasia and Rad3-related) and FANCI(Fanconi anemia complementation group I).In addition, CDK12 also contributes actively to the regulation of alternative splicing.The idea is supported by several evidence. First, as explained before, CDK12 proteinscontain RS domains. RS domains are usually observed in SR proteins and are believedto be important for recruiting proteins of the splicing machinery [147]. Second, over30 splicing proteins interact CDK12, including SRSF1 and U2AF2, and several 3’-endformation factors [148]. Finally, a study by Chen et al [149] illustrated that the expressionof CDK12 can modulate the splicing pattern of a synthetic E1A minigene; and Rodrigueset al [150] found that CDK12 is essential for regulating the splicing activity carried outby HOW protein. Further genome-wide investigation seems necessary to uncover generalregulation of AS modulated by CDK12.Similar to many other protein kinases, functional and structural properties of CDK12qualify it as a promising drug target. CDK12 is one of few genes recurrently mutatedin ovarian cancer and these mutations are usually mutually exclusive with mutations inBRCA1 or BRCA2, two of the most abundant mutated genes in ovarian cancer [151]. Dis-ruption of CDK12 has also been observed in breast and gastric cancers [152]. Furthermore,CDK12 over-expression has been associated with poor prognosis power and a higher riskof tumour recurrence [144]. Thus, CDK12 seems to be an attractive drug target for in-vestigation, and better understanding of its general effect on RNA processing could helpadvancement in therapeutics.In chapter 3, I explore how CDK12 regulates alternative splicing and gene expression,and how target genes are selected at a genome-wide scale.1.4 Research contributionsIn the following, I briefly summarize my research questions and my objectives in the threemain chapters of this dissertation (Figure 1.7).In chapter 2, my main research question was how RNA editing regulates splicing pat-terns. Based on the importance of RNA editing and alternative splicing mechanisms indiversifying gene products, as described in this section, I investigated the potential inter-25Alternative Splicing Regulations and FunctionsInter-relation of Alternative Splicing and RNA EditingTissue specific data sets from the modENCODE projectFind co-occurrences of editing and alternative splicingSearch for mechanisms that may mediate the inter-relationRegulation of Alternative Splicing by CDK12CDK12 knockdown data in human cell lines; TCGA dataSpecify types of alternative splicing regulated by CDK12 Identify potential mechanisms in cancer biology contextImpacts of Components of Splicing Machinery on Cellular Responses Review of methods to inferring drivers of cellular responsesData from pharmaceuticals interrupting splicing mechanismDetermine functions of inhibited componentsChapter 4Chapter 3Chapter 2Figure 1.7: A diagram of my research presented in this dissertation. The figure shows datasets I use and the analyses I perform in the 3 main chapters of this study.relation between these two mechanisms. Considering the existing evidence on some caseswhere editing regulates splicing [95], I hypothesized this regulation happens more fre-quently than what was known previously.My main goal was to find local regions where splicing patterns is modulated by RNAediting and also to uncover the mechanism of regulation. I hypothesized splicing andediting can compete for common targets in local regions, or editing can influence splicingthrough modifying sequence motifs and structural features in a genome-wide scale. Iaddressed this problem in the context of Drosophila melanogaster, using tissue-specificRNA-seq data from the MODENCODE project.In Chapter 3, the research question I explored is how CDK12 contributes to the regula-tion of alternative splicing. Despite the growing number of studies investigating functionsof CDK12, the mechanism through which CDK12 contributes to cancer development andprogression is still unclear. I hypothesized that a part of functional roles of CDK12 incancer biology occurs through regulation of alternative splicing. Therefore, I performed26a genome-wide analysis of splicing and expression regulation by CDK12 using knock-down and control libraries of breast cell line data. I also examined if my findings could begeneralized in tumour cells using The Cancer Genome Atlas (TCGA) ovarian data [153].In Chapter 4, my main objective was to assess how different methods in the literaturecan be employed to provide mechanistic insights when a gene is systematically inhibitedby pharmaceutical agents at different levels. The inhibited genes are splicing related genesand the data are generated to study their contributions to splicing regulation, and betterunderstanding the splicing mechanism as a complex machinery. I summarized appropriatemethods in the literature, and compared their advantages and limitations. I also examinedthe usefulness of the data using one of the appropriate methods, and finally discussed howthe methods in the literature should be adopted to properly benefit and extract informationfrom this type of data.27Chapter 2Genome-wide Identification andCharacterisation of Tissue-specific RNAEditing Events in Drosophilamelanogaster and their Potential Role inRegulating Alternative Splicing2.1 IntroductionRecent studies suggest that alternative splicing and RNA editing mechanisms have thepotential to influence each other [95, 117]. Obviously, RNA editing can directly modifysplicing patterns by editing primary sequence motifs required such as splice sites, splicingenhancers or silencers [95, 154]. Other studies in human and fly suggest that many ofthe editing sites occur in transcripts encoding RNA-binding proteins that play roles inalternative splicing. This may alter the expression, efficiency or binding properties of theseproteins which may in turn affect the splicing of many genes [114, 117]. On the other hand,different ADAR isoforms have different editing efficiencies [87], so the splicing machineryalso has the potential to influence RNA editing. It thus seems obvious to hypothesise28that there are feedback loops between RNA editing and alternative splicing waiting to bediscovered.In the past few years, thousands of editing sites have been discovered by calling A-to-G differences between the reference genome and the transcriptome reads in human[104, 111, 113], mouse [98], and fly [38, 89] using RNA-seq data. One key challenge whenanalysing RNA-seq is to discriminate true editing events from artifacts [104, 111, 113] asexplained in Chapter 1; RNA-seq data require sophisticated and statistical data analysismethods for reliably detecting RNA editing events.Fortunately, the large number of experimentally confirmed A-to-I RNA editing eventsin Drosophila melanogaster and the considerable amount of publicly available data makethe fly a promising model organism to study ADAR mechanisms. In recent years, there hasbeen an increasing amount of studies on the importance and abundance of RNA editingin this organism [38, 87, 89]. Drosophila melanogaster has one ADAR gene (dADAR),and among vertebrate ADARs, ADAR2 is the most similar gene to dADAR [85]. In fly,dADAR is highly expressed in the central nervous system, and similar to vertebrates, itsexpression shows tight temporal regulation [87].Here, we use tissue-specific high-throughput data sets of Drosophila melanogasterfrom the MODENCODE project [155] to identify RNA editing events in multiple tissues.To achieve this, we introduce a new computational analysis pipeline to accurately iden-tify editing events and to distinguish genuine editing events from sequencing and mappingartifacts. In our analysis of the resulting, predicted cases of RNA editing, we search forcases of differential exon usage between pairs of different tissues to identify regions whereRNA editing and alternative splicing may influence each other. Finally, in order to dis-cover potential molecular mechanisms underlying this interplay, we identify many casesof evolutionarily conserved RNA secondary-structures that have the potential to regulatealternative splicing via RNA editing.292.2 Materials and methods2.2.1 Data setTo study tissue-specific RNA editing events, we selected tissue-specific RNA-seq librariesof Drosophila melanogaster from the MODENCODE project [45, 155]. These librariescorrespond to paired-end, strand-specific RNA-seq reads of 74–120 nucleotides length.The strand-specificity of the reads allows us to assess the correct conversion types in over-lapping or incompletely annotated parts of the genome [98], whereas the paired-ends im-prove the alignment of reads to repeat-rich regions of the genome which would otherwiseeasily result in incorrectly aligned reads or the false positive prediction of SNPs or RNAediting sites. The 29 selected libraries are classified into 10 tissues (Table.2.1). Someof these libraries are extracted from multiple tissues. For each library there exist twoto five technical replicates. All libraries derive from the OregonR strain of Drosophilamelanogaster which is, however, not the strain of the Drosophila melanogaster referencegenome.Dataset Tissue Dataset Tissue Dataset TissueMOD4241 Head MOD4266 Ovaries MOD4259 Digestive systemMOD4242 Head MOD4247 Accessory glands MOD4256 Central nervous systemMOD4243 Head MOD4249 Testes MOD4257 Central nervous systemMOD4245 Head MOD4250 Carcass MOD4260 Fat bodyMOD4246 Head MOD4252 Carcass MOD4267 Fat bodyMOD4248 Head MOD4254 Carcass MOD4268 Fat bodyMOD4263 Head MOD4258 Carcass MOD4261 Imaginal discsMOD4264 Head MOD4251 Digestive system MOD4262 Salivary glandsMOD4265 Head MOD4253 Digestive system MOD4269 Salivary glandsMOD4244 Ovaries MOD4255 Digestive systemTable 2.1: Tissue specific data sets selected from the MODENCODE project. TheIDs of the selected libraries and the tissues from which these libraries are sam-pled are shown in this table. The data contain 29 libraries from 10 tissue types.Since we do not have genomic DNA sequencing reads in our data, it is essential to alignthe short transcriptome reads to the reference genome of the OregonR strain when search-ing for DNA/RNA discrepancies; otherwise, genomic differences between the genome ofthe OregonR strain and the D. melanogaster’s reference genome could be misinterpreted30as RNA editing events. We therefore generate an annotation for the OregonR genome byaligning the genome of the OregonR strain to the D. melanogaster’s reference genome. Wefirst use MUMMER [156, 157] to find a set of consecutive matches of at least 20 nucleotideslong. Next, we align the remaining parts between these matches using the NEEDLEMAN-WUNSCH algorithm [158] with default parameter values. Finally, we convert the coordi-nates of the reference annotation of Drosophila melanogaster in ENSEMBL [159] to thecorresponding coordinates of the resulting OregonR genome.2.2.2 Prediction pipelineFigure 2.1 gives an overview of the steps of our computational analysis pipeline for iden-tifying RNA editing events using multiple RNA-seq libraries and the reference genome asinput. Considering the potential challenges in reliably detecting RNA editing events [160],we designed a probabilistic pipeline to achieve the following in an efficient manner: (1)filter variants against artifacts due to mapping and sequencing errors; (2) explicitly captureADAR-specific features such as the requirement for double-stranded region to distinguishRNA editing events from other types of observed variants; and (3) leverage the statisticalpower derived from the size and number of our input data sets. In the following, we brieflyexplain the steps of our pipeline.We use TOPHAT2 [163] to align short reads to the genome in a splice-aware manner.We allow up to five mismatches in the alignment step to permit TOPHAT2 to success-fully align reads that have been RNA-edited multiple times. Next, we employ PICARD-TOOLS (http://broadinstitute.github.io/picard) to remove duplicates from each technicalreplicate. These duplicate reads may be generated during RT-PCR as a result of ampli-fication bias [111, 113]. Finally, technical replicates are merged and positions showingDNA/RNA discrepancies are extracted for further analysis.Our analysis pipeline combines a set of statistical and deterministic filters that applytwo sets of threshold values, one set called the flexible set and one set called the stringentset (Figure 2.1). By employing these two sets of threshold values and leveraging the largesize of the input data, it is possible to simultaneously lower the false positive and the falsenegative error rates. If only the stringent threshold values were used to distinguish genuine31Figure 2.1: Outline of the computational analysis pipeline for identifying editing eventsfrom multiple RNA-seq libraries. The input consists of several RNA-seq libraries andthe reference genome. As shown, first, reads are aligned and RNA/DNA mismatchesare extracted. Then, two sets of values (for flexible and stringent filtering) are usedfor several filters ((A)-(D)) to remove potential experimental artifacts. Finally, ourpipeline considers clustering of identified candidates and the number of times they aredetected in multiple libraries to output a final set of predicted editing events. (A)-(D)show the statistical tests and filters used in our pipeline. (A) A set of primary filtersused to assess the initial requirements for candidate sites. (B) The statistical graphi-cal model (modified from [161]) that we use to find the maximum editing ratio, andto compute a log likelihood ratio score. Shaded circles are the random variables thatare observed in data and unshaded circles are the ones that are inferred. The roundedsquare is fixed to represent the reference genotype. a is a binary variable which in-dicates whether or not a read aligned to a position comes from an edited molecule. zis also a binary variable that indicates whether the read is aligned correctly. The edit-ing ratio of position i is presented with node r; and nodes m and b present mappingand base qualities. (C) Statistical tests in SAMTOOLS/BCFTOOLS [162] to check thepotential biases in reads. (D) The energy of local structures and base pairing probabil-ities of nucleotides in close vicinity of candidate sites are used to ensure the structuralrequirements of candidates are met. (E) We use the fact that editing events occur inclusters to improve our predictions. (F) For less confident sites, the site requires to bedetected in multiple libraries in order to be reported in our final set.32RNA editing sites from mapping and sequencing errors, filtering the potential artifactscould result in discarding many true RNA editing events, i.e., a high false negative errorrate. On the other hand, using only the set of flexible threshold values could lead to anincreased false positive error rate. To overcome this issue, our pipeline combines the twosets of threshold values. The potential editing sites that pass the flexible threshold valuesare only reported in the final output if they are detected in multiple samples and are closeto other predicted sites.After the alignment step, a set of primary filters are applied to reduce the identifiedDNA/RNA discrepancies and remove those that are likely to be due to mapping and se-quencing errors (Figure 2.1.A). These primary filters examine, for example, the number ofreads covering a candidate RNA editing site, the read and mapping qualities of the inputdata, and also the distance to both ends of the read. In addition, any known variants listedin the ENSEMBL fly variant files are removed. Some of these variants may correspondto genuine RNA editing events – similar to what has been observed in human SNP databases [102] – yet we decided to be conservative and to remove all known variants in theabsence of any corresponding DNA sequencing reads.Figure 2.1.B shows the graphical model we use to compute the maximum likelihoodediting ratio, and to apply a log-likelihood ratio test. The model is a modification of themodel introduced in SNVMIX2 [161]. The original model considers both mapping andbase qualities of the reads and takes uncertainties of bases and alignments into account. Wetook a part of the model and added a new node (shown with ”ri” in the figure) that presentsthe editing ratio (r). This can take values ranging from 0 (not edited) to 1 (always edited)with uniform prior. We model aij (which indicates whether read j aligned to position icomes from an edited RNA molecule) to have a Bernoulli distribution with its parameterset to the editing ratio r. The conditional probability distribution for the other three nodes(”z”, ”b” and ”m” ) are the original ones used in SNVMIX2 [161]. Using this statisticalmodel, the null hypothesis of a position having an RNA editing level of zero is comparedto the hypothesis of the position being edited with the inferred maximum likelihood levelof editing. More precisely, for each candidate position i, we compute the following log-likelihood ratio score for position i:33score(i) = argmaxα logP(Di |Mi,Bi,ri = α)P(Di |Mi,Bi,ri = 0) (2.1)where ri is the editing ratio; Di presents the observed reads overlapping position i;and Bi and Mi are the base and the mapping qualities of reads, respectively, overlappingposition i.In the following step, the pipeline applies SAMTOOLS/BCFTOOLS [162] tests to iden-tify and remove positions that are discovered as a result of potential biases. These testshave been used in the literature to improve the quality of variant calls [98]. Base qualityand mapping quality tests gauge the bias of the corresponding scores between the readsshowing the reference allele and the reads showing the variant allele. Two additional othertests evaluate the strand bias and the tail distance bias. The strand bias gauges the biasbetween the distribution of the strand of reference reads and the distribution of the strandof non-reference reads. The tail distance bias investigates whether nucleotide reads fromone allele tend to occur closer to read ends compared to nucleotide reads from the otherallele.Unlike most other existing prediction methods for RNA editing sites, our analysispipeline explicitly utilises the requirement for the existence of double-stranded regionsin potential ADAR target regions [164] to further improve our predictions (Figure 2.1.D).Long double-stranded regions constitute perfect potential target sites for ADARs [84], andstructured regions also recruit ADARs to nearby sites that are not in the same double-stranded region [101]. Consequently, the stability of potential structures has been used torank output candidates [92], although edited double-stranded region have been observedto have a wide range of stabilities [97]. Also, the vicinity of complementary nucleotideregions which allows the formation of RNA secondary-structures was used to improveprediction results [93]. Most of the double-stranded regions bound by ADAR have beenshown to correspond to intramolecular interactions, i.e., RNA structure features in thesame transcript [83]. We therefore use local RNA secondary-structure prediction algo-rithms in our pipeline. We employ RNAFOLD [125] on a sequence interval of 200 nu-cleotides length around each candidate editing site to calculate the minimum-free-energy(MFE) RNA structure predicted for this region. We use the corresponding minimum free34energy as an indicator of the stability of all potential local RNA structures that can beformed in that region. Additionally, as ADAR binding and editing predominantly happensin double-stranded regions [84], we use RNAPLFOLD [165] to estimate the probabilityof a potential RNA editing site being in a double-stranded region. For this we examinesequence intervals of five nucleotides length around the candidate editing site. Finally,the two sets of thresholds (stringent and flexible) introduced above are applied to thesepotential RNA editing sites in order to incorporate structural information in our pipeline.It is well known that ADAR tends to edit several sites in the same double-strandedregion upon binding [97] which we explicitly judge by our analysis pipeline (Figure 2.1.E).In addition, we expect true RNA editing events to show up in several libraries due to thelarge amount of input reads. To use these features, we first include all candidate editingsites that pass the stringent threshold values. Any remaining candidate sites are then addedif: a) The same position passes the stringent threshold values in another sample, or b)the position has been predicted (passes the flexible threshold values) at least twice andthere is another identified site showing the same conversion type within a distance of25 nucleotides.To summarise, by using a large number of samples as input, by explicitly capturingADAR specific features and requirements and by combining two distinct sets of thresholdvalues, we create an analysis pipeline that has a low false positive as well as a high truepositive rate (see results section below). Assuming similar mutation rates for transitionsand similar mutation rates for transversions, we can use the ratio of A-to-G conversions inour predictions to estimate the false positive ration [98, 166]. Based on this estimate, wechose a set of pipeline parameters that result in a decent overall number of predictions andalso a high ratio of A-to-G conversion type. Details of the pipeline including parametervalues are explained in Appendix. A2.2.3 Finding alternatively spliced exonsTo find alternatively expressed exons between pairs of tissues, we use DEXSEQ [74].DEXSEQ applies a generalised linear model to detect exonic regions that are differen-tially expressed between two conditions. We consider libraries from the same tissue as35replicates as required by DEXSEQ. Furthermore, we only consider genes that show an ex-pression higher than a pre-defined threshold in both conditions in our analysis (by apply-ing a threshold on expression predicted by CUFFLINKS [167]). Additionally, we discardgenes for which many of the exonic parts are predicted to be alternatively used, keep-ing only those genes for which the number of alternatively used exons is smaller thanmax(2,1/4 · number of exons). The main reason for doing so it so focus our analysis ofthe potential interplay between alternative splicing and RNA editing on genes that aremore likely to be regulated locally.2.3 Results2.3.1 Our pipeline accurately distinguishes genuine editing sitesfrom SNPs, and sequencing and mapping artifactsThe set of sites predicted by our pipeline is highly enriched in A-to-G conversions. Fig-ure 2.2.A shows the number of unique RNA editing sites identified for each of the twelvepossible types of DNA/RNA differences after applying our pipeline to the combined dataset comprising all 29 libraries. We find 3680 unique conversion sites in multiple tissuesof Drosophila melanogaster of which 2879 (78.2%) correspond to A-to-G conversions.Assuming similar A-to-G and G-to-A mutation rates as well as similar rates of sequencingand mapping errors for these two types of transitions, we can estimate the false positiveerror rate of our predictions. Of the 3680 sites in our set, 112 of them are G-to-A con-versions. By assuming that up to 112 of these A-to-G detected sites are false positivepredictions, we estimate the false positive rate to be at most 3.9% (112/2879).Figure 2.2.B shows the extent of overlap of the 2879 RNA editing sites identified byus and those of two other genome wide studies in Drosophila melanogaster by Graveleyet al [38] and Laurent et al [114]. In contrast to our study, Graveley et al analyse RNA-seqdata sets of the MODENCODE project from different developmental stages, i.e., their readsamples do not overlap our tissue-specific data sets. In another high-throughput genomewide study of RNA editing in D. melanogaster, Laurent et al employ single moleculesequencing for data generation. As Figure 2.2.B shows, the overlap of sites predicted by36Figure 2.2: Types of identified conversions and the overlap of A-to-G conversions withother high-throughput studies. (A) Number of different types of conversions identi-fied by our analysis pipeline. Most of the identified sites correspond to A-to-G con-version. (B) Venn diagram showing the overlap between our study and two otherhigh-throughput studies by Graveley et al [38] and Laurent et al [114].this and both previous studies is not very high (369/2879 (13%)), yet the overlap betweenour sites and each study separately, especially Laurent et al, is considerable (874/2879(30%) Laurent et al, 407/2879 (14%) Graveley et al), implying that a reassuring third(912/2879 (32%)) of our RNA editing sites have been detected by either of these earlierstudies, while still adding a large number (1967) of new potential RNA editing to theexisting Drosophila melanogaster annotation.Apart from the obvious differences in the sampled cells and the transcripts that may37be highly expressed in one cell and not the other cell, effects such as random sampling inhigh-throughput studies, sequencing errors and other challenges in distinguishing editingsites from artifacts [160] can account for the observed differences in detected sites. One ofthe key features of ADAR-derived RNA editing is that even in the same cell, the editing oftwo transcripts of the same gene does not necessarily involve identical RNA editing sites,but only the same double-stranded region which seems to be necessary and sufficient re-quirement for RNA editing to have the desired functional effect. Furthermore, differencesin the proposed pipelines in these studies (a different set of thresholds, tests and engagedfeatures) could at least partially account for some of the observed variations in results.Among all output sites of our pipeline, 45% (1288/2879) have been predicted by atleast one of four existing RNA-seq studies of RNA editing in D. melanogaster [38, 89,114, 166], and at least 14% (400/2879) have been experimentally validated. In summary,the number of previously validated and identified sites combined with the low estimatedfalse positive error rate indicates that most of our reported sites are likely to be genuineRNA editing sites.2.3.2 Characterisation of identified RNA editing sitesAs would be expected when analysing libraries of RNA-seq data (i.e., reads derived frommRNAs), most (40%, 1149/2879) our RNA editing sites occur in coding regions. Fig-ure 2.3.A represents the distribution of RNA editing sites (which obviously derive fromthe transcriptome) onto different types of genomic regions. The abundance of RNA editingevents in coding regions when analysing pre-mRNAs of the fly genome has been reportedearlier [89]. Editing in coding regions can cause non-synonymous changes. These mayalter the sequence of a protein (and possibly its length) and also change the protein’s struc-ture and function.The next class of genomic regions with a large number of identified sites are 3’ un-translated regions (3’ UTRs). Editing of 3’ UTRs may alter gene expression by changingnucleotides in target sequences, e.g., of miRNAs. On the other hand, binding of ADARto a target region can also prevent miRNAs and other molecules from binding [104]. In-deed, we find that 165 of our editing sites overlap known miRNA target regions. Another38580 [20%]852 [30%]1149 [40%]88 [3%]67 [2%]143 [5%]IntronExon3’UTR5’UTRCDSIntergenicA)C)B)0.000.250.500.751.00−5 −4 −3 −2 −1 0 1 2 3 4 5Nucleotide position5’ 3’ Base pairing probabilityWebLogo 3.4050100Percentage (%)UCGACUGAUGCAGCUA0ACUAGGUCAGCUAUCAGNucleotide positionstudyour studyprevious studiesFigure 2.3: Characterisation of the identified editing sites. (A) Number and percentageof identified sites in different genomic regions. Coding regions contain more sitesthan other regions. (B) The frequency of each nucleotide at each position rela-tive to the predicted editing sites. Guanosine is depleted at the exact 5’ position ofediting sites. (C) Average base pairing probabilities computed using RNAPLFOLD[168] for regions close to ADAR targets for sites predicted in our study, and pre-vious studies [38, 89, 114, 166]. Positions -1 to 1 show higher average pairingprobabilities compared to other loci. Using structural features in our pipeline maybias our predictions towards sites with higher base pairing probabilities around re-ported sites; however a similar pattern has also been observed when consideringsites predicted in previous studies. Part (B) is generated using WEBLOGO [http://weblogo.berkeley.edu/logo.cgi].39mechanism for altering gene expression patterns is to directly edit the miRNA moleculesthemselves or by interfering with miRNA processing [169–171]. We find 6 editing sitesin 4 miRNA molecules: mir-4971 (1 site), mir-2a-2 (2 sites), mir-4961 (2 sites), andmir-4956 (1 site). These miRNA editing sites have the potential to influence miRNA pro-cessing and targeting.Although our data derives from spliced transcripts, i.e., mRNAs (polyA enriched), wefind 580 editing sites (20%) in genomic regions that are annotated as being intronic. Theprevalence of editing in retained introns has already been reported [89]. Editing in intronscan happen when the editing site falls into an editing site complementary sequence (ECS)which forms a double-stranded region with a region in an adjacent exon [94]. RNA edit-ing may then lead to changes in the local RNA secondary structure which may result inthe exon being retained [55]. Via this molecular mechanism, RNA editing thus has thepotential to alter splicing patterns by changing local RNA secondary-structure.Our remaining sites overlap intergenic regions, 5’ UTRs and exons of non-codinggenes. Sites classified as intergenic may be due to an incomplete annotation of the Drosophilamelanogaster genome. The number of editing sites in the other two classes is small, butmay have interesting biological consequences.We took advantage of the large number of predicted RNA editing sites to investigatethe primary sequence and structural binding preferences of ADAR. In agreement withearlier studies [114], we find that a guanosine directly adjacent in the 5’ position of anadenosine decreases the chance of the adenosine being edited (see figure 2.3.B). Analysingthe estimated base-pairing probabilities of small regions around the predicted RNA editingsites using RNAPLFOLD [165], we find that the two nucleotides directly adjacent to thesite are the most important to be base-paired in ADAR target regions (2.3.C).Analysing different tissue-specific data, we find that RNA editing happens in multipletissues of D. melanogaster, predominantly in head. We highlight the number of DNA/RNAmismatches for four tissues in figure 2.4. The majority of detected editing sites occurexclusively in head and central nervous system. In other tissues, RNA editing is rare.Reassuringly, we find that in heavily edited tissues most of the predicted sites are A-to-Gconversions that can be attributed to ADAR activity; the false positive rate of our analysisis thus low, conversely, in other tissues the estimated error rate is higher (figure 2.5.A).40A−>G (81.2%)A−>G (32.8%)A−>G (84.2%)A−>G (47%)CNS Digestive SystemHead Imaginal Discs01000200030000100020003000TypeCountType of conversionA−>GOtherA−>CA−>GA−>TC−>AC−>GC−>TG−>AG−>CG−>TT−>AT−>CT−>GA−>CA−>GA−>TC−>AC−>GC−>TG−>AG−>CG−>TT−>AT−>CT−>GFigure 2.4: The number of all 12 types of conversions for four tissues of our study: centralnervous system (CNS), digestive system, head, and imaginal discs. Head and CNScontribute most to the list of our predictions.Editing patterns differ considerably between different types of tissues. Figure 2.5.Billustrates the relative overlap between sets of predicted sites in the ten studied tissues.Generally, different pairs of tissues do not share most of their editing events. One obviouscandidate for regulating RNA editing is the expression of the ADAR gene itself. We findthat ADAR expression is highest in head and central nervous system (CNS), but that thegene is also expressed in other tissues (figure 2.5.C). Over-expression of ADAR in headand CNS is in agreement with the number of detected sites in these tissues, however,a higher expression of ADAR in one tissue compared to the other, does not necessarilyimply a greater level of RNA editing; thus, as suggested before [116], the level of ADARexpression alone cannot explain how RNA editing levels are regulated.41TestesSalivaryGlandsOvariesImaginalDiscsHeadFatBodyDigestiveSystemCNSCarcassAccessoryGlandsTestesSalivaryGlandsOvariesImaginalDiscsHeadFatBodyDigestiveSystemCNSCarcassAccessoryGlands0 0.40.8 1Color keyB)A)C)0102030AccessoryGlandsCarcassCNSDigestiveSystemFat BodyHeadImaginalDiscsOvariesSalivaryGlandsTestesTissueAverage Expression (FPKM)02550751000 1000 2000Number of predicted sitesEstimated false positive rate (%)Accessory GlandsCarcass CNSDigestive SystemFat BodyHeadImaginal DiscsOvariesSalivary GlandsTestesFigure 2.5: Comparing the editing mechanism in different tissues of Drosophilamelanogaster. (A) Estimated error rate versus the number of predicted sites in differ-ent tissues of our study. (B) Percentage of overlapping sites between pairs of tissues asencoded by color shading. To compute the overlap ratio, the number of common sitesbetween pairs of tissues is divided by the smaller number of detected sites betweencorresponding tissues. (C) Average expression of dADAR in tissues of the MODEN-CODE project. Expression values are measured in FPKM (fragments per kilobaseof transcript per million fragments mapped) unit using CUFFLINKS [167]. AlthoughdADAR expression is highest in CNS (central nervous system) and head, but the geneis expressed in other tissues as well.42Our functional enrichment analysis using DAVID [172] confirms that edited genes areinvolved in ion transport (Benjamini Hochberg (BH) adjusted p-value: 2 · 10−13), gatedchannel activity (BH adjusted p-value: 3 · 10−8) and cell-cell signalling (BH adjustedp-value: 8 · 10−8), the well known functions of ADAR targets [94, 173]. Additionally,functional annotation clustering using DAVID [172] identifies a cluster of genes involvedin locomotory behaviour (BH adjusted p-value: 2 · 10−3) and similar genes which is inagreement with the phenotype associated with ADAR knock-down flies [106, 174].2.3.3 Evidence for cross-regulation of RNA editing and alternativesplicing and the potential underlying regulatory mechanismAs discussed in the introduction, there already exists some evidence for an inter-relationbetween alternative splicing and RNA editing mechanisms. Leveraging the large numberof selected tissue-specific data sets used in our study, we decided to investigate the re-ciprocal effect between alternative splicing and RNA editing in much greater details andto discover potential underlying regulatory mechanisms. Alternative splicing and RNAediting both play key roles in diversifying gene products and in fine tuning gene expres-sion on RNA level. It would thus be of great conceptual importance to identify potentialmechanisms of their cross-regulation.We find that a gene with a greater number of known isoforms has a higher chance ofbeing edited. Figure 2.6.A illustrates the positive correlation (RPearson = 0.33, p-value< 2 ·10−15) between the number of annotated isoforms and the number of predicted RNAediting sites in our study. One would expect longer genes to have a higher probability ofbeing edited and to also have more splice variants (based on the larger number of exons). Inorder to test if the correlation observed in our data can be explained by gene length alone,we grouped genes according to their lengths and calculated the average number of knownisoforms per group, once for the sub-group of edited and once for the complementarysub-group of un-edited genes (Figure 2.6.B). Although we find that longer genes tendto contain more editing sites, edited genes have a significantly greater number of knownisoforms than un-edited genes (Figure 2.6.C). Other features such as exon lengths, intronlengths, and nucleotide bias may also affect the number of editing sites in genes.4302460 1 2−4 5−8 >9Number of editing sites per geneR = 0.33Average number of annotated transcripts per gene123456<1K1K−2K2K−3K3K−5K5K−8K8K−12K>12KGene lengthAverage number of annotatedEditedNot editedisoformsPearsonA)C)B)5K10K15K20K25K30K<1K1K−2K2K−3K3K−5K5K−8K8K−12K>12KGene lengths in each binAverage gene lengthGroupEditedNot editedFigure 2.6: There is a positive correlation between genes that are targets of RNA editingand genes that are alternatively spliced. (A) The number of annotated isoforms vs.the number of predicted sites in our study. The number of detected sites is found tobe greater in genes that express more annotated isoforms. (B) We group genes basedon their length and compare the average number of annotated isoforms for genes ofsimilar length between those that are edited and those that are un-edited genes. Forgenes with similar length, edited genes have a higher chance of being alternativelyspliced. (C) Here we tested whether genes in the same length bins from edited and un-edited groups have similar lengths. The plot shows that for most of our bins, averagegene lengths is almost equal for edited and un-edited group.44Even more interestingly, we find that editing events tend to preferentially occur nearexons with multiple splicing donor/acceptor sites (χ2 test, p-value < 2 · 10−15). For this,we classify exons (including UTR exons) into two groups, those with multiple knownacceptor and/or donor sites and those with unique acceptor and donor sites. Within eachgroup, we count the number of RNA editing sites and normalise by the combined lengthsof all exons in that group. Based on the resulting numbers, RNA editing sites are 3.2times more likely to occur in exons with multiple splicing donor/acceptor sites comparedto those with unique acceptor and donor sites (χ2 test, p-value < 2 · 10−15, this p-valueis calculated for the null hypothesis of a 1:1 ratio). To further confirm our findings thatare based on our set of predicted RNA editing events, we repeated the same analysis forall sites reported by four existing high-throughput studies of RNA editing in Drosophilamelanogaster [38, 89, 114, 166] and find again that RNA editing is 1.9 times more likelyto occur in exons with multiple acceptor/donor sites (χ2 test, p-value < 2 ·10−15).We then identified 244 regions where RNA editing and tissue-specific alternative splic-ing can have reciprocal effect (Appendix. A). For this, we searched for RNA editing sites inand around exons (between -150 and +150 around each exonic part) that are alternativelyspliced when comparing expression for pairs of tissues using DEXSEQ [74]. Figure 2.7shows an example of a region that is predicted to be highly edited and observed to bealternatively spliced. The figure shows that many more editing sites are predicted in thehead tissue (blue arrows) compared to digestive system (red arrow). This is also true forthe exonic region that is not predicted to be alternatively used (E002).One reason for the alternative splicing of the 3’ exon could be the formation of dou-ble stranded structure; or the binding of ADAR could prevent splicing machinery fromdetecting splicing signals and splice out the last exonic part. We should mention thatthe predicted editing level is low even in head tissue, and low editing level and randomsampling may have caused the editing events not to be predicted in the digestive systemsamples. Dedicated follow-up experiments are required to understand how the two mech-anisms affect each other.To discover potential mechanisms regulating the interplay between alternative splicingand RNA editing, we also searched for statistically significant conserved RNA secondary-structure features in the vicinity of exons where we found RNA editing and alternative45Figure 2.7: An example of a region where RNA editing and alternative splicing may affecteach other. Rectangles at the bottom represent exonic parts of gene CG5850 locatedon the reverse strand of the left arm of chromosome 2. Exonic parts are numbered byE001, E002, ..., E009, where E001 is the 3’ most exonic part and E009 is the 5’ mostexonic part. The Y axis shows the number of reads aligned to each exonic bin, nor-malised by library size. Blue lines correspond to the number of reads from the librariesof the head tissue and red lines correspond to libraries from the digestive system tissue.The purple rectangle shows the rectangle that is predicted to be alternatively expressedbetween the two tissue types. In this region, multiple arrows are shown for identifiedediting sites for head (blue arrows) and digestive system (red arrows). Figure gener-ated using DEXSEQ [74]splicing to co-occur. For this, we employed TRANSAT [128] on input alignments of 15 flyspecies downloaded from UCSC [176] (We also added OregonR sequence to the align-ment; see Appendix. A for more details) around splice sites of alternatively spliced exonicparts where editing sites are also predicted (extended by 150 nucleotides on either side,a total of 167 regions). There already exist quite a few computational methods to predictevolutionarily conserved RNA secondary-structure [132, 168, 177]. These programs, how-ever, expect the input alignment to contain one more or less global secondary-structure,46Figure 2.8: An example of a region where a conserved RNA secondary structure featuredetected by investigating editing events can potentially influence alternative splicing.Rectangles at the bottom of the figure show exonic parts of Cip4 gene located on thereverse strand of the left arm of chromosome 3. The figure shows the structure pre-dicted using RNAALIFOLD in a region of 100 nucleotides around the splice site ofan exonic region which is predicted to be alternatively used between tissues. Redarrows show predicted editing sites. Black arcs indicate alignment columns that arepredicted to be base-paired, and black columns correspond to un-paired nucleotides.Green squares within the alignment show valid base-pairs and orange squares invalidbase-pairs. Dark blue squares represent valid base-pairs with two-sided mutations(compared to the most common base-pair in the pair of columns), probably in order toretain base-pairing potential. Likewise, light blue colour represents single mutations toretain base-pairing potential. The existence of multiple compensatory mutations pro-vides evidence for its functional importance throughout evolution. Figure generatedusing R-CHIE [175]47i.e., a structure spanning the entire alignment. As there a priori no reason to expectsecondary-structure features relevant for RNA editing to involve the entire transcript – es-pecially not longish fly pre-mRNAs in vivo – we use TRANSAT as this program has beenspecifically designed to identify local, conserved RNA secondary-structure features suchas the double-stranded regions needed for ADAR binding and RNA editing. TRANSATmethod takes a set of aligned sequences and an evolutionary tree as input; extracts po-tential helices in the alignment, and assigns a p-value to each of these helices. For 96 ofthe 167 regions (57%) where alternative splicing and RNA editing co-occur in our datawe find one or more conserved RNA secondary-structure features (when we filter heliceswith p-value greater than 0.05 and helices shorter than 8 nucleotides). Figure 2.8 showsan example of these regions and the corresponding, conserved RNA secondary-structuredetected by RNAALIFOLD [168] in this region. Multiple compensatory mutations for con-served base-pairs provide evolutionary evidence for a likely functional role of this double-stranded region. The list of the identified regions is presented in Appendix. A. Finally, weapplied RNAALIFOLD to assess the stability of the global structures in these regions. Thelist of the identified regions sorted based on the energy of the predicted global structure byRNAALIFOLD can be found in Appendix. A.2.4 DiscussionWe identify 2879 A-to-I RNA editing sites in different tissues of Drosophila melanogasterwith high precision. More than half of these have not been identified previously. Thehigh ratio of A-to-G conversion type among the detected DNA/RNA discrepancies showsthat most of our predictions are anticipated to be true editing events and not the resultof experimental or computational artifacts. Also, our study suggests that other types ofpossible RNA editing apart from A-to-I RNA editing are very rare or do not happen at allin the investigated tissues of D. melanogaster.Furthermore, our results show that editing occurs in multiple tissues, with many of thesites being edited exclusively in brain and central nervous system where ADAR expres-sion is also higher than in other tissues. Moreover, patterns of editing differ significantlybetween tissues, implying a tissue-specific underlying regulatory mechanism.48Our study demonstrates how the appropriate use of ADAR specific features enhancesthe detection of RNA editing events when DNA reads are not available. A previous studyby Ramaswami et al [166] shows that evolutionary information can be used to detectediting sites in the absence of DNA reads. Here, we explicitly capture ADAR specificfeatures - in particular the requirement for the formation of local RNA secondary structuresaround target sites and clustering of editing sites - in addition to utilising large number ofselected data sets to distinguish editing events from artifacts and SNPs.We identify more than 200 regions exist where RNA editing and alternative exon us-age between tissues co-occur when comparing libraries. Many of the identified regionshave been identified in multiple pair-wise comparison of tissues. Studies showed the co-occurrence of RNA editing and alternative splicing in same genes [114, 117], similar towhat we find in this analysis. Solomon et al reported the enrichment of editing events incassette exons in human, although they reported most of the sites are far from exon bound-aries. We here show that editing events tend to happen much more abundantly in exonswith multiple known acceptor or donor sites, or 3’ and 5’ UTRs that contain alternativesplicing potential. Further, we find 96 regions around splice sites with significant statisti-cal evidence for the overlap of evolutionarily conserved, local RNA secondary-structures.The actual formation of these RNA structure features in vivo is supported by both compu-tational RNA secondary-structure prediction programs and predicted RNA editing sites.RNA editing thus has the potential to regulate alternative splicing via changes of localRNA secondary structures. This suggests a potential, tissue-specific molecular mecha-nism of regulation for alternative splicing whose potential mediation via changes of localRNA structure we showed earlier [55].Overall, we find strong evidence for our hypothesis that RNA editing and alternativesplicing mechanisms directly influence each other in specific regions of the transcriptome.Both, RNA editing and alternative splicing are abundant in the CNS and are both knownto be temporally and spatially regulated [87]. Also, target genes of the two mechanismscorrelate well. These mechanisms may influence each other in several ways. First, thesplicing machinery may compete with ADAR for common substrates. This is plausiblegiven that RNA editing and splicing can happen at the same time co-transcriptionally inDrosophila melanogaster [89, 178]. Targeting of a specific location by one machinery lim-49its the simultaneous access of the other machinery and can thereby affect its functionality.Second, considering the potential importance of RNA secondary structures in regulatingalternative splicing, ADAR may edit and thereby alter local secondary structures whichcan in turn change exon usage. Blow et al showed earlier that RNA editing of double-stranded regions has the overall effect of destabilising these features. Finally, editing ofsplicing silencers and enhancers or splice site motifs could additionally affect splicing.Based on our results, the type of local co-regulation through changing RNA structureshappens predominantly within exons with multiple acceptor or donor sites. In these re-gions, the primary sequence splicing signals may be weak, and these weak signals canprevent the splicing machinery from always making the same decision.The formation and RNA editing-mediated modification of local RNA secondary struc-tures therefore has the potential to significantly alter splicing patterns in these genes aslocal RNA structure features can be “encoded” in a transcript-specific way. In fact, thenecessity for encoding RNA structure features that are involved in regulating the alterna-tive splicing of their own transcript may explain why introns tend to be longer in morecomplex organisms: these RNA structure features are (at least partly) encoded in intronsthus imposing no undue additional evolutionary constraints on the protein-coding exons.Previous studies suggest that the dominant way in which editing regulates splicingis by editing RNA-binding proteins [117]. This would, however, imply a more indirectand global way of regulating alternative splicing and could not easily happen in a gene-specific way. Our results support a gene-specific mechanism where alternative splicing canbe directly regulated via tissue-specific changes of RNA editing. Also, one of the roles ofpre-mRNA sequences may be to not only encode amino-acid information, but also RNAsecondary-structure motifs that determine the correct splicing patterns in a tissue-specificway. Detailed follow-up experiments, e.g., ADAR knockdowns and mutational studies ofspecific genes, are now required to experimentally confirm our results.50Chapter 3The Regulation of Alternative Last ExonSplicing by CDK12 Promotes theOncogenic Potential of Breast CancerCells3.1 IntroductionCyclin-dependent kinases (CDKs) and their activating cyclin partners integrate numeroussignal transduction pathways to regulate a variety of critical cellular processes [138, 179].CDK12 (CRK7, CrkRS) is one of several CDKs that regulate transcription through thedifferential phosphorylation of the C-terminal domain (CTD) of RNA Polymerase II [137])as discussed in Chapter 1. There is still much unkonwn regarding how CDK12 regulatesalternative splicing and gene expression at a genome-wide scale.The Cancer Genome Atlas (TCGA) project identified recurrent somatic alterations inCDK12 in 13% of breast cancers and 5% of ovarian cancers [153, 180–182]. CDK12mutations are commonly nonsense mutations or impair CDK12 kinase activity [183], andare frequently coupled with loss of heterozygosity [180, 184]. Recent studies show thatCDK12 functions in maintaining genome stability. In cell-based assays, depletion of51CDK12 is associated with defects in DNA damage response (DDR) and decreases ex-pression of genes involved in the homology-directed repair (HDR) pathway [146, 151,152, 183, 185]. Though it is generally classified as a tumor suppressor gene from itsrole in DDR, additional evidence indicates that CDK12 may have pro-oncogenic func-tions in breast cancers. CDK12 is located on chromosome 17, 165-267 kb proximal toHER2 (ERBB2), an oncogene that is frequently amplified in breast cancers. CDK12 isco-amplified with HER2 in 27-92% of breast tumors or tumor cell lines [186–194]. Sim-ilar to HER2, over expression of CDK12 also correlates with high proliferative index andgrade 3 tumor status based on tissue microarrays of invasive breast carcinomas [134]. It isnoteworthy that in about 13% of HER2+ (HER2-amplified) breast tumors, the amplifica-tion breakpoint resides in the CDK12 allele and likely results in the functional loss of oneCDK12 allele [185]. In related observations, recurrent CDK12-HER2 gene fusions in gas-tric cancers result in impaired CDK12 protein levels [195]. It is currently unknown howalterations in CDK12 contribute to the myriad of changes seen in breast tumors. Over-all, these data suggest CDK12 may have oncogenic roles in cancer progression, but themechanisms underlying this effect have not been explored.To address the oncogenic roles of CDK12, we performed a comprehensive and sys-tematic genomic and proteomic analysis of CDK12 function in a breast cancer cell linewith genomic amplification of CDK12. We sought to determine if the role of CDK12 intumorigenesis and DDR was related to its hypothesized ability to regulate splicing or ASin addition to its role in transcription. Instead of having a general effect on transcriptionor splicing, we found that CDK12 regulated the expression and AS of a distinct set of mR-NAs in a cell type-specific manner. Furthermore, CDK12 predominantly regulated onlythe alternative last exon (ALE) sub-type of AS. Functionally, events regulated by CDK12potentiated tumorigenic processes, indicating that aberrant CDK12 expression can haveoncogenic properties.523.2 Materials and methods3.2.1 DataThe RNA-seq data consists of biological triplicates of SK-BR-3 and 184-hTERT cellstreated with CDK12 siRNA-1 or scrambled siRNA. The libraries contain on average 103± 12 million paired end 75 nucleotides strand-specific reads (mean ± s.d.). The paired-end reads were aligned to the reference genome (hg19 reference genome downloaded fromUCSC genome browser [196]) using GSNAP [197]. The corresponding gene annotationfile was downloaded from ENSEMBL [159]. The “novel splicing” parameter of GSNAPwas enabled to allow the discovery and use of novel junctions in the alignment step. Inthe final step, duplicate reads were removed using SAMTOOLS [162, 198]. The procedureresulted in an average of ∼92% successfully aligned reads.To assess our findings in an independent data set, we downloaded the RNA-seq datapublished in a previous study [145]. The data contain two control and two CDK12 shRNAsin two replicates from HCT-116 cells. These libraries consisted of ∼14 million to ∼48million single end un-stranded 50 nucleotides reads. On average 84% of the reads weresuccessfully aligned to the hg19 reference genome using TOPHAT2 [163]. Because of thesmall number and short lengths of these reads, duplicate reads were kept in the alignedfiles and reads with mapping quality of less than 10 were removed.3.2.2 Differential gene expression and alternative splicing analysisDESEQ2 [199] was applied to detect genes that were differentially expressed in CDK12siRNA-treated libraries as compared to control libraries. Given a table of raw read countsfor genes in the genome, DESEQ2 applies a statistical model to compare counts betweenthe two conditions, and it calculates a fold change for each gene and assigns a statisticalmeasure of confidence for differential regulation of the gene. The gene read counts re-quired as input by DESEQ2 was provided using HTSEQ-COUNT [200] by setting “mode”parameter to “union”. Genes with adjusted p-values < 0.01 form the list of confidentdifferentially expressed genes between siRNA-treated and control conditions in each cellline. CUFFLINKS [201] was also used to quantify gene and isoform expressions.53Given a ranked list of genes, GSEA [202] applies a statistical method to identify path-ways for which the genes involved in that pathway are over-represented at the top or bot-tom of the ranked list. For the RNA-seq data, genes were filtered for having very few(<100 on average) reads aligned to them and the remaining genes (∼12,000 in SK-BR-3and 184-hTERT cell lines) were sorted based on the estimated fold changes. For the globalproteome data, the sign of the fold change was multiplied by the inverse of the FDR rateand the genes were sorted based on the corresponding values. Here, FDR values repre-sented the statistical significance of evidence for the differential expression of proteins.The GSEA pre-ranked analysis assigned a normalized enrichment score (NES) represent-ing the extent of over-representation of genes of a pathway at the top or bottom of theranked list. All of the 1,454 GO (Gene Ontology) gene sets from the Molecular SignatureDatabase (MSigDB) [202] were used. Gene sets having fewer that 15 or more than 500genes common to each list were filtered out. GSEA was applied in classic enrichmentstatistics mode with 1,000 permutations.The ENRICHMENTMAP plugin [203] in CYTOSCAPE [204] was used to make enrich-ment map plots. A p-value cut-off to 0.005 and FDR q-value cut-off of 0.01 was applied.These are output p-values and FDR q-values from the GSEA analysis that represent thestatistical significance of evidence for pathways being enriched for in the top up-regulatedor top down-regulated genes. The clustering feature of ENRICHMENTMAP with defaultparameters was used to cluster gene sets that share common genes, and cluster names weremanually curated based on the contained pathways.The MISO package was used to investigate the regulation of alternative splicing (AS)by CDK12. The MISO package [73] applies a statistical framework to distinguish eightdifferent types of annotated AS and processing events. These events are skipped exons,mutually exclusive exons, retained introns, alternative 3’ and 5’ splice sites, alternativefirst and last exons, and tandem 3’ UTRs. The method takes a pair of samples as input andreports a ∆Ψ value between the two samples for each annotated event. The Ψ (PercentSpliced In) value represents the fraction of inclusion of one isoform when two isoformsare being considered in an splicing event (a value between 0 and 1). The method alsoreports a Bayes Factor (BF) value to quantify the support for the model where Ψ value isaltered between the two samples compared to the alternative model of no difference in Ψ54value between the two samples.For each cell line, siRNA-treated and control samples were randomly paired and eventswith BF values ≥ 20 and |∆Ψ| values ≥ 0.1 were selected. To form the most confident setof detected splicing events, events were required to have been predicted in all three pair-wise comparisons for each cell line (SK-BR-3 and 184 hTERT). To compare the MISOanalyses in SK-BR-3 and 184-hTERT cells with an independent data set previously pub-lished [145], a smaller BF threshold value of 10 was applied to allow the discovery of moreevents (considering the smaller number of reads present in the Liang et al [145] data).3.2.3 TCGA data analysisHigh-grade serous ovarian cancer data generated by the TCGA Research Network (http://cancergenome.nih.gov/) were analyzed for AS events. A total of 70 raw fastq case fileswere downloaded from the Cancer Genome Hub (CGHub) repository (https://cghub.ucsc.edu); 14 cases have CDK12 alterations, including 7 cases with point mutations, 3 caseswith homozygous deletions, and 4 cases with amplifications (Appendix. B). The remaining56 cases were control tumor samples with no reported alterations in CDK12. Additionally,these control samples did not have alterations in CDK13, BRCA1, BRCA2, PALB2, andBRIP1 genes. These criteria were applied to exclude tumors with alterations that maypotentially phenocopy the effects of CDK12 alterations.Fastq files were aligned to the hg19 reference genome with the same gene annota-tion file and parameters used for datasets from SK-BR-3 and 184-hTERT cell lines usingGSNAP. MISO was used for pairwise comparisons between two different datasets (e.g.CDK12-mutated cases vs. cases with no CDK12 alterations). The analysis of AS events inthe TCGA data was restricted to the union set of ALE events detected in 184-hTERT andSK-BR-3. This list comprised of 133 ALE events (predicted in all 3 pairwise comparisonsfor each cell line with BF > 20 and |∆Ψ| > 0.1) with 23 events common between the twocell lines.To analyze the effects of CDK12 point mutations, each of the 7 CDK12-mutated caseswas paired with 2 random unique control cases (without CDK12 alterations), generating14 total comparisons for MISO analysis. The number of times that each of the 133 ALE55events were detected in these 14 comparisons (with BF ≥ 20) was calculated. As a con-trol, 7 other random cases without CDK12 alteration were selected and paired with thesame 14 control samples. The control experiment was replicated three times to deter-mine if the identified ALE events were over-represented when comparing CDK12-alteredcases with other random cases. P-values were calculated using the Mann-Whitney U test.Similar analyses were performed for cases with CDK12 amplifications and homozygousdeletions. Due to the smaller number of CDK12-amplified and -deleted cases, each samplewas randomly paired with 4 control cases rather than 2.3.2.4 Motif analysisThe 3’UTRs (3’ untranslated regions) of ALE (alternative last exon) events regulated byCDK12 were searched to identify polyadenylation motifs based on published PositionWeight Matrices (PWMs) [205]. The number of identified ALE events in SK-BR-3 cellswas small; therefore, to expand the list and include more potential targets in the motif anal-ysis, all ALE events that were identified in at least 2 of the 3 pairwise comparisons whenusing the thresholds of |∆Ψ| > 0.1 and BF > 10 were included. For each MISO ALEevent representing two isoforms, the best two overlapping isoforms from ENSEMBL geneannotations that explained the ALE event in the RNA-seq data was determined. This wasdone by taking into account: isoform expressions computed by CUFFLINKS, the overlapbetween the MISO last exon and the ENSEMBL annotated isoform, and the percentage ofexon-exon junctions from each candidate Ensembl isoform that were verified in RNA-seqdata. The positive samples comprised all the genes for which ALE events were predicted inSK-BR-3 cell lines. These genes were divided into two groups: genes with over-expressedproximal last exons and genes with over-expressed distal last exons after CDK12 deple-tion. Negative samples contained genes that were annotated to have ALE events in MISOannotations, but were not predicted to be regulated by CDK12. Genes for which the totalFPKM expression value of the two isoforms was smaller than 0.5, were filtered out. Posi-tive and negative samples were split into bins of similar UTR lengths and for each positivesample UTR, 20 negative samples from the same UTR length bin were randomly selected.To count motif abundance, background nucleotide frequencies in the extracted regions56were determined, and then using the PWM for each motif, the log odds score of a se-quence being a motif hit compared to being randomly generated according to backgroundfrequencies was calculated. The sequence with the maximum of these log odds scoresfor each motif was identified. All the sequences for which the computed score was above80 percent of the maximum score were counted as motif hits. For the count-based motifanalysis, the number of hits in each region was normalized by the length of the region, andthe Mann-Whitney U test was used to compare the normalized hits in positive samples tothe normalized hits in negative samples. The Benjamini-Hochberg correction was used tocorrect the calculated p-values for multiple testing. For the distance-based motif analysis,the distance of each motif hit to the 5’ and to the 3’ end of the 3’UTR was calculated.The significance (Benjamini-Hochberg corrected) of the difference between the calculateddistances for hits in the positive samples and negative samples was calculated.3.3 Results3.3.1 CDK12 regulates alternative last exon splicing of genes withlong transcript and many exonsTo explore the function of CDK12 in splicing regulation, we performed mRNA sequenc-ing (RNA-seq) on SK-BR-3 cells treated with a scrambled siRNA control or siRNA di-rected to CDK12 (achieving 8- and 7-fold reduction in CDK12 mRNA and protein, re-spectively). SK-BR-3 cells are a HER2+ epithelial breast cancer cell line where CDK12is co-amplified with HER2. As a result, SK-BR-3 cells over-express CDK12 protein [185].We also performed RNA-seq on CDK12 siRNA-treated 184-hTERT cells, an immortalizednormal mammary epithelial cell line that does not over-express CDK12. In our RNA-seqlibraries, the transcriptome was deeply sequenced (103 ± 12 million reads per sample) inorder to enable the identification of low level alternative splicing events.To find differentially spliced events, we used the MISO package [73], which applies astatistical framework to distinguish eight different types of annotated AS events in pairwiseRNA-seq comparisons. From three independent pairs of CDK12 siRNA:scrambled siRNAsamples, we identified 102 AS events common to all SK-BR-3 samples and 86 AS events57common to all 184 hTERT samples (Figure 3.1.A). The regulation of specific AS eventsby CDK12 was cell type-specific and only 23 AS events were common to both datasets(Figure 3.1.B). However, the mechanism of regulation appears conserved: 86% and 79%of AS events observed in CDK12-depleted SK-BR-3 and 184-hTERT cells, respectively,were alternative last exon (ALE) splicing. Moreover, all 23 AS events common to bothcell lines were ALE events. ALE events regulated by CDK12 had an average MISO |∆Ψ|value of 0.23 ± 0.09 (Figure 3.1.C).The cell type-specific effects we observed (Figure 3.2.A) are likely not an indirectresult of low gene expression in either cell type; genes that were regulated by CDK12 inonly one cell type were similarly expressed in the other cell type (Figure 3.2.B). Geneswith ALE events regulated by CDK12 were expressed with an average FPKM (fragmentsper kb of exon per million fragments mapped) value of 12 and 15 in SK-BR-3 and 184-hTERT cells, respectively. For the 23 genes common to both cell lines, the average FPKMvalue was 15 and 16 in SK-BR-3 and 184-hTERT cells, respectively. Genes with SK-BR-3-specific ALEs had an average FPKM of 15 in 184-hTERT cells, and genes with184-hTERT-specific ALEs had an average FPKM of 16 in SK-BR-3 cells.To further explore the universality of this type of regulation, we performed MISO anal-ysis on published RNA-seq data of HCT-116 cells (derived from colorectal cancer) treatedwith CDK12 shRNAs [145]. The experiments in HCT-116 were performed in duplicateswith two different shRNA constructs. Consistent with our findings in SK-BR-3 and 184-hTERT cells, ALE events accounted for 33% and 41% of all AS types in HCT-116 cellsfor each of the two shRNAs, respectively (Figure 3.3.A). Common AS events resultingfrom treatment with CDK12 siRNA-1 (SK-BR-3 and 184-hTERT cells) and either of thetwo shRNAs (HCT-116) were all ALEs (n = 9, Figure 3.3.B).The regulation of AS by CDK12 is largely cell type-specific, but the preponderance ofALE events suggests the regulated genes may possess a common feature. When comparedto the total set of protein coding genes, genes whose ALEs were regulated by CDK12 hadsignificantly longer transcripts and contained a greater number of exons (Figure 3.4.A).It was previously reported that genes transcriptionally regulated by CDK12 generally hadlonger transcripts [146]. In our analysis, we found that genes with ALE events regulatedby CDK12 were significantly longer than those transcriptionally regulated by CDK12 (Fig-581.00.80.60.40.20.0Fraction of AS eventsSE RIA3SSA5SSMXE AFEALET-UTRAS type184-hTERT1.00.80.60.40.20.0Fraction of AS eventsSK-BR-3n = 819n = 847SE RIA3SSA5SSMXE AFEALET-UTRAS type184-hTERTSK-BR-3SE RIA3SSA5SSMXE AFEALET-UTRAS type184-hTERTSK-BR-3n = 202n = 233n = 102n = 8679 6323SK-BR-3 184-hTERTAll AS types(in three replicates)n = 16565 4523SK-BR-3 184-hTERTALE events(in three replicates)n = 133SK-BR-3 & 184-hTERT ALEsn = 1566040200Counts0.80.60.40.20.0|ΔΨ| (Percent spliced in)|ΔΨ|avg = 0.23± 0.09 A)B) C)Figure 3.1: CDK12 regulates alternative last exon (ALE) splicing. A. MISO analysis iden-tified AS events that resulted from depletion of CDK12 in SK-BR-3 and 184-hTERTcells (Bayes Factor ≥ 20, |∆Ψ| ≥ 0.1). AS events present in all three RNA-seq repli-cates were primarily alterative last exon splicing in both cell types. SE, skipped exons;RI, retained introns; A3SS, alternative 3’ splice sites; A5SS, alternative 5’ splice sites;MXE, mutually exclusive exons; AFE, alternative first exons; ALE, alternative lastexons; T-UTR, tandem 3’ untranslated regions. B. The majority of AS events arecell type-specific, and events common to both SK-BR-3 and 184-hTERT cells are allALEs. C. Distribution of |∆Ψ| values for ALE events (total n = 156) regulated byCDK12 in SK-BR-3 (n = 88) and 184 hTERT (n = 68) cells.5910Relative expression(CDK12 siRNA-1/scrambled siRNA)123123184-hTERTreplicateSK-BR-3replicate CDK12expressionALE Events1 6CountSK-BR-3-specific ALEs184-hTERT-specific ALEscommonALEsA)806040200Expression (FPKM)SK-BR-3ALEs184-hTERTALEsCommonALEsSK-BR-3 cells184-hTERT cellsB)Figure 3.2: ALE regulation by CDK12 is cell type-sepecific. A. CDK12-regulated alterna-tive last exon (ALE) events are both common and unique between SK-BR-3 and 184-hTERT cells. Biological replicates assist in classifying cell type-specific and commonALE events. For this analysis, a lower statistical threshold was applied (Bayes Factor≥ 10, |∆Ψ| ≥ 0.1) to increase the number of ALE events for the heat map. B. Distribu-tion of FPKM values in SK-BR-3 cells (blue boxes) or 184-hTERT cells (green boxes)for genes with CDK12-regulated ALE events specific to SK-BR-3 cells, specific to184-hTERT cells, or common to both cell types.601.00.80.60.40.20.0Fraction of AS events184-hTERT + siRNA-1n = 1341.00.80.60.40.20.0Fraction of AS eventsHCT-116 + shRNA-1Liang ., 2015n = 64*1.00.80.60.40.20.0Fraction of AS eventsHCT-116 + shRNA-2Liang ., 2015n = 133*1.00.80.60.40.20.0Fraction of AS eventsSK-BR-3 + siRNA-1n = 161SE RIA3SSA5SSMXE AFEALET-UTRAS typeSE RIA3SSA5SSMXE AFEALET-UTRAS type150100 50 0Set sizeSK-BR-3 + siRNA-1184-hTERT + siRNA-1HCT-116 + shRNA-2HCT-116 + shRNA-1100500 Number of AS eventsALE eventsAll AS eventsALE/All AS91/12036/9651/9513/3925/253/20 7/75/53/43/31/11/10/00/00/0A)B)Figure 3.3: Regulation of ALE splicing is a universal function of CDK12. A. Comparison ofAS events identified by MISO in SK-BR-3 and 184-hTERT cells (treated with CDK12siRNA-1, this study), and HCT-116 cells (treated with two shRNA constructs, Liang etal [145]). Only two replicates of the SK-BR-3 and 184-hTERT RNA-seq were used inorder to match the conditions of the HCT-116 RNA-seq data. In the HCT-116 exper-iment RNA-seq was performed on total RNA after depletion of rRNA. The RNA wasnot enriched for mRNA, which could explain the enrichment of retained introns (de-noted by asterisks) observed in the HCT-116 data versus the SK-BR-3 and 184-hTERTdata. B. Intersection set analysis showing number of AS and ALE events common toSK-BR-3, 184-hTERT, and HCT-116 cells. Top: set sizes of each group are shown.Bottom: numbers of AS and ALE events in each intersection group. Graph createdusing the UpSetR package in R (https://cran.r-project.org/web/packages/UpSetR/).61-0.6 -0.4 -0.2 0.0 0.2 0.4ΔΨ (Percent spliced in)24%76%24%76%17%83%SK-BR-3184-hTERTCommonn = 88n = 68n = 23proximal ALE distal ALE50403020100Number of exonsAll protein codingn = 23,393All annotatedn = 4,485SK-BR-3n = 80184-hTERTn = 52SK-BR-3n = 3,163184-hTERTn = 3,940ALEs Differential expression****n.s.****43210Transcript length (x105 bp)*********A)B)Figure 3.4: CDK12 regulates ALE splicing of genes with long transcripts and a large num-ber of exons. A. Distributions of gene transcript length and number of exons. Allprotein coding genes are compared to all genes with annotated ALE events and genesregulated by CDK12 (ALE splicing or differential expression in SK-BR-3 and 184-hTERT cells). Red lines represent the means. Pairwise statistical comparisons per-formed using the Kolmogorov-Smirnov test (*p < 0.0005, **p < 1 x 10-6, n.s. notsignificant). C. Depletion of CDK12 generally results in the utilization of proximalALEs (negative ∆Ψ values).ure 3.4.A). Genes with ALE events regulated by CDK12 were also longer than all geneswith annotated ALEs. Notably, of all genes with annotated ALEs, only 3% with tran-scripts longer than the average were regulated by CDK12 in SK-BR-3 or 184-hTERTcells. In other words, only a subset of long genes was regulated by CDK12-dependentALE splicing, suggesting additional gene-specific factors that direct AS by CDK12.In 76% of ALE events, CDK12 depletion resulted in the enrichment of mRNA isoformsutilizing the proximal ALE (Figure 3.4.B). When considering only ALE events common62to both SK-BR-3 and 184 hTERT cells, the proximal ALE was utilized more in 83% of thecases. These results were independently validated by performing qRT-PCR (by ChristalleChow and Jerry Tien) on a select number of ALE events in SK-BR-3 and 184-hTERT cellsdepleted of CDK12, with good correlation of ∆Ψ values between the MISO and qRT-PCRdata (Appendix. B). These observations were also not due to off-target effects; we obtainedsimilar results with a different CDK12 siRNA construct (CDK12 siRNA-2; Appendix. B),but not with siRNA constructs targeting CDK9 or CDK13.Furthermore, the immunoprecipitation experiments carried (by Christopher S. Hughesand Jerry Tien) determined that CDK12 interactome is enriched in spliceosomal proteins.CDK12-interacting proteins were highly enriched for RNA splicing function (Figure 3.5),and could be generally classified into core spliceosome components (pre-catalytic com-plexes A and B, and the associated Prp19 complex) and regulators of constitutive andalternative splicing (SR proteins, RBM proteins, and hnRNPs) (Figure 3.5.C) [206]. Theinteractions between CDK12 and hnRNPs were sensitive to nuclease treatment and weretherefore likely dependent on RNA intermediates, such as the pre-mRNA upon which hn-RNPs are assembled. By contrast, interactions between CDK12 and core spliceosomeand SR proteins were largely unaffected by nuclease treatment. The universality of in-teractions between CDK12 and core spliceosome components was further supported byimmunoprecipitation experiments in HEK-293T cells [145, 207], Jurkat T-cells [208], andHeLa cells [148, 209]; however, many of the regulatory splicing components differ acrosscell types. This could be a product of cell type-specific regulation or differences in experi-mental methodology. Together, these results suggest that CDK12 is a bona fide componentof the splicing machinery.While the regulation of ALE usage by CDK12 can be achieved through its associationwith regulatory splicing factors, it could also be an indirect product of transcription ter-mination processes (such as alternative polyadenylation) initiated by termination signalsin the 3’ untranslated regions (UTRs) [211]. To address this possibility, we searched forpolyadenylation motifs in the 3’UTRs of proximal and distal ALEs that were regulatedby CDK12 (Figure 3.6). We observed no differences in the distribution and density ofpolyadenylation motifs in ALEs regulated by CDK12, as compared to ALEs unaffectedby CDK12 function. This observation further suggests that the regulation of ALE usage6310-510-410-310-210-1100p adj -value-10 -5 0 5 10log2 (enrichment score)RNA splicing proteins(n = 37)padj < 0.05(n = 121)CDK12RBM25PRPF40ASRSF1PARP1PRPF19Cyclin Knall = 9833020100-log10(padj -value)RNA splicingmRNA splicing(via spliceosome)mRNA processingRNA processingmRNA metabolicprocessregulation of mRNAprocessingregulation of mRNAmetabolic processGO Biological Process total n = 12137/33831/22636/39642/70536/56419/9219/103CDK12Cyclin KPRPF19 CDC5L HSPA8Prp19complexComplex APRPF31 PRPF4BComplex BhnRNPA0 hnRNPA2B1 hnRNPDRBMX SYNCRIPRBM15 C1QBP RP9YBX1 CDK11BSRSF1 SRSF2 SRSF3SRSF5 SRSF6 SRSF7SRSF9 SRSF10 LUC7L2SRRM1 SRRM2TRA2A TRA2BCore spliceosome proteins hnRNPs SR proteinsOther splicingproteinssnRNP70 snRNPA ZRANB2U1 snRNP-associated:U2AF2 U2SURP snRNPD1U2 snRNP-associated:PRPF40A RBM25 LUC7L3Other:B)A)C)Figure 3.5: CDK12 interacts with the RNA splicing machinery. A. Immunoprecipitationof FLAG-CDK12 and mass spectrometry was used to identify 121 CDK12-interactingproteins in SK-BR-3 cells (enrichment score > 0, padj < 0.05). B. Interacting proteinswere highly enriched for RNA splicing functions as determined by Gene Ontology(GO) analysis [210]. C. CDK12-interacting splicing proteins can be generally dividedinto core spliceosome proteins (blue) and regulatory splicing factors (green, orange,and brown).64by CDK12 occurs through a splicing mechanism rather than through gene-specific recruit-ment of polyadenylation factors.3.3.2 Tumors defective in CDK12 function exhibit mis-regulation ofALE splicingAlterations in CDK12 have been described in numerous tumor types, including breast,ovarian, uterine, gastric, and bladder cancers [153, 181, 185, 194, 195]. The TCGA con-sortium has performed large-scale analyses on collections of tumor samples, includingRNA-seq for 311 cases of ovarian serous cystadenocarcinoma [153]. CDK12 is recurrentlyaltered in 6% of these cases (Figure 3.7.A). Tumors containing the CDK12 mutations arenotably not amplified for HER2, and previous studies demonstrated that these ovarian can-cer mutations impair the kinase activity of CDK12 in vitro [152, 183]. Therefore, thesesamples are well suited to explore the consequences of modulating CDK12 function in atumor setting.To generalize the regulation of ALE events by CDK12 to tumor cells, we used theMISO package to perform pairwise comparisons of tumor samples containing CDK12 al-terations to tumor samples without CDK12 alterations (Figure 3.7.B). For this analysis,we utilized data from four types of available TCGA RNA-seq samples [153]: tumors withCDK12 point mutations (n = 7), tumors with homozygous CDK12 deletions (n = 3), tu-mors with genomic amplification of CDK12 (n = 4), and tumors with no alterations inCDK12 (n = 56 control samples). We queried the point mutation, deletion, amplification,and control samples for the occurrence of the 133 ALE events that resulted from CDK12depletion in SK-BR-3 and 184-hTERT cells. Each ALE event in CDK12-mutated tumorswas found in 49% of comparisons on average (point mutation:control), as compared to23% of control (control:control) comparisons (Figure 3.7.B i). When considering only the23 events common to both SK-BR-3 and 184-hTERT cells, each ALE event was foundin 71% and 27% of mutation and control comparisons on average, respectively. Similartrends were obtained with tumors containing homozygous CDK12 deletions (Figure 3.7.Bii), demonstrating that these ALE events were identified more frequently in tumors im-paired in CDK12 function.651.00.80.60.40.20.0Probability density (x10-3 )400001.00.80.60.40.20.0Probability density (x10 -3 )4000 01.00.80.60.40.20.0Probability density (x10-3 )400001.00.80.60.40.20.0Probability density (x10 -3 )4000 0Distance to5’ junction (bp)Distance to3’ junction (bp)1.00.80.60.40.20.0Probability density (x10-3 )400001.00.80.60.40.20.0Probability density (x10 -3 )4000 01.00.80.60.40.20.0Probability density (x10-3 )400001.00.80.60.40.20.0Probability density (x10 -3 )4000 0Distance to5’ junction (bp)Distance to3’ junction (bp)SK-BR-3 ALE (n = 867)Control ALE (n > 16,000)SK-BR-3 ALE (n = 257)Control ALE (n > 4,800)SK-BR-3 ALE (n = 1,112)Control ALE (n > 21,000)SK-BR-3 ALE (n = 371)Control ALE (n > 8,000)ΔΨ < -0.1 ΔΨ < -0.1ΔΨ > 0.1 ΔΨ > 0.1UTR UTRproximal ALE distal ALE20151050Motif density (x10-3 per bp)ControlALESK-BR-3ALEControlALESK-BR-3ALEΔΨ < -0.1 ΔΨ > 0.1 20151050Motif density (x10-3 per bp)ControlALESK-BR-3ALEControlALESK-BR-3ALEΔΨ < -0.1 ΔΨ > 0.1A)B)C)Figure 3.6: The 3’UTR of ALEs regulated by CDK12 do not feature unique patterns ofpolyadenylation motifs. Distribution plots are shown for ALEs regulated by CDK12(green lines and boxes) and control ALEs (black lines and boxes). ALE events aredivided into those that result in greater usage of the proximal ALE (∆Ψ < -0.1) andthose that favor the distal ALE (∆Ψ > 0.1). A. Analysis of polyadenylation motifswas performed on the 3’UTRs of proximal and distal ALEs regulated by CDK12.B. Distributions of distances of polyadenylation motifs from the 5’ and 3’ junctions of3’UTRs. The n values represent total numbers of polyadenylation motifs identified. C.Distributions of the densities of polyadenylation motifs in the 3’UTRs. The differencesbetween the distributions of SK-BR-3 ALEs and control ALEs are not statisticallysignificant in all comparisons (Mann-Whitney U test, Benjamini-Hochberg correctedp > 0.05).66ALE events (SK-BR-3 U 184-hTERT), 133 ALE eventsALE events (SK-BR-3 184-hTERT), 23 ALE eventsU(i) (ii) (iii)1086420% with alterationTotalAmp DelMutationsOvarian( = 311)x 7 Mutation samplesControl vs Control (n = 42)Mutation vs Control (n = 14)CDK12MutationControltumorsControl tumorsvsControl vs Control (n = 36)Del vs Control (n = 12)x 3 Del samplesCDK12DelControltumorsControl tumorsvsx 4 Amp samplesControl vs Control (n = 48)Amp vs Control (n = 16)CDK12AmpControltumorsControl tumorsvs1.00.80.60.40.20.0ControlvsControlAmpvsControl1.00.80.60.40.20.0ControlvsControlDelvsControl1.00.80.60.40.20.0Fraction of comparisonswith ALE eventControlvsControlMutationsvsControl***************A) B)Figure 3.7: Alterations in CDK12 correlate with mis-regulation of ALE splicing in ovariantumor samples. A. CDK12 is recurrently altered in ovarian serous cystadenocarci-nomas [153]. From this dataset, RNA-seq data was available for tumors containingCDK12 point mutations (blue, n = 7), homozygous deletions (green, n = 3), and am-plifications (red, n = 4). B. Using the MISO package, changes in AS (Bayes Factor ≥20) were determined based on the following comparisons: (i) CDK12 point mutationvs. control, (ii) CDK12 deletion vs. control, and (iii) CDK12 amplification vs. control.Changes in CDK12-regulated AS events were compared to AS events found in controlvs. control comparisons. To obtain a similar number of comparisons in each scenario,each point mutation sample (i) was compared to two unique control samples (n = 14comparisons), while each deletion (ii) and amplification sample (iii) was compared tofour unique control samples (n = 12 and 16 comparisons, respectively). Control vs.control comparisons were likewise paired, and additionally performed in triplicate (n= 36, 42, or 48 comparisons). A total of 133 ALE events were queried, representingthe events found in either the SK-BR-3 or 184-hTERT experiments (grey circles). Wealso queried 23 ALE events common to both SK-BR-3 and 184-hTERT cells (purpletriangles). Red lines represent the means. The significances of comparisons (greyand purple lines) were determined using the Mann-Whitney U test (*p < 0.05, **p< 0.005, ***p < 1 x 10−5). The results published here are in whole or part basedupon data generated by The Cancer Genome Atlas managed by the NCI and NHGRI.Information about TCGA can be found at: http://cancergenome.nih.gov/.67In breast cancers, CDK12 is commonly co-amplified with HER2. Similarly, the fourovarian tumor samples with CDK12 amplifications also contain HER2 amplifications. Un-like cases containing CDK12 point mutations or deletions, the queried ALE events werefound less frequently in tumors amplified for CDK12 (13% of amplification:control and20% of control:control comparisons; Figure 3.7.B iii). These findings with the CDK12-amplified samples mirror our results in SK-BR-3 cells, where the ALE events were iden-tified after depletion of CDK12 from an over-expressed state. Together, these results sug-gest that mis-regulation of ALE splicing occurs due to aberrations in CDK12 and supporta functional role of CDK12 alterations in tumor development in ovarian tumors.3.3.3 Regulation of gene expression by CDK12 is gene- and celltype-specific but modulates a core set of common pathwaysThe role of CDK12 in regulating ALE splicing of long transcripts occurs in multiple celland tumor types; however, only a small subset of these regulated genes are common tomultiple cell types. To address the question if CDK12 also regulated cell type-specificgene transcription we evaluated the effects of CDK12 on global gene expression.We analyzed the triplicate CDK12 siRNA and control siRNA RNA-seq data from SK-BR-3 and 184-hTERT cells using DESEQ2 [199, 212]. The analysis found that depletionof CDK12 resulted in modest changes in gene expression (Figure 3.8.A). In SK-BR-3 cells,3,163 statistically significant (padj < 0.01) events were evenly divided into up-regulated(50%, mean fold change = 1.5) and down-regulated (50%, mean fold change = -1.5) genes.Of these events, only 386 exhibited more than a 2 fold change in gene expression (Fig-ure 3.8.C). Depletion of CDK12 in 184-hTERT cells resulted in slightly more differentialexpression events (n = 3,940 with padj < 0.01). Again, events were differentially ex-pressed in both directions (49% up-regulated an average 1.7 fold; 51% down-regulatedan average 1.6-fold). Only 678 changed more than 2 fold in expression. Of these genes,37 were differentially expressed in both cell lines (Figure 3.8.C). These analyses con-trast with a previous study in HCT-116 cells, which reported that 98% of differentiallyexpressed genes were down-regulated after CDK12 depletion [145]. Taken together, ourobservations suggest that similar to the regulation of ALE splicing, regulation of gene68-4-202410-1 101 103 105 Mean expression (DESeq2 counts)10-610-410-2p adj -value184-hTERTlog2 (fold change)-4-202410-1 101 103 105 Mean expression (DESeq2 counts)10-610-410-2p adj -valueSK-BR-3log2 (fold change)SK-BR-3 184-hTERTlog2 (fold change)(padj < 0.01 and > 1)349 64137A)B)Figure 3.8: CDK12 differentially regulates gene expression in a cell type-specific manner.A. Differential gene expression analysis by RNA-seq following CDK12 depletion inSK-BR-3 and 184-hTERT cells. Mean expression (DESEQ2 counts) is plotted againstfold change (CDK12 siRNA-1 versus scrambled siRNA). Dotted lines delineate eventswith | f oldchange|> 2. Events with padj < 0.01 are colored. C. Few differential geneexpression events with padj < 0.01 and | f oldchange| > 2 are common between SK-BR-3 and 184-hTERT cells.expression by CDK12 is highly gene- and cell type-specific.While the regulation of individual genes by CDK12 was cell type-specific, an examina-tion of the affected cellular pathways using Gene Set Enrichment Analysis (GSEA) [202]offered additional insight. We found that in both SK-BR-3 and 184-hTERT cells, loss ofCDK12 altered similar pathways. Identification of these pathways also support previously69reported functions of CDK12 [145, 146, 151, 152, 183, 185, 213, 214]. Namely, depletionof CDK12 resulted in the down-regulation of genes involved in RNA splicing and process-ing, cell cycle progression, and regulation of DNA damage response pathways in both celltypes (Figure 3.9). Since these processes were previously reported in different cell types,they appear to represent universal functions of CDK12.The pathway analysis also aided in determining cell type-specific properties of CDK12.Depletion of CDK12 in SK-BR-3 cells decreased expression of genes associated with mi-tochondrial function (Figure 3.9). This change was not observed in 184-hTERT cells.Instead, depletion of CDK12 in 184-hTERT cells increased expression of genes associ-ated with the plasma membrane or related to development and extracellular activity (Fig-ure 3.9). In general, CDK12 expression both increased and decreased the expression ofgenes in various pathways in 184-hTERT cells, but primarily up-regulated pathways inSK-BR-3 cells (Figure 3.9). Taken together, these results demonstrate that while tran-scriptional regulation by CDK12 is largely gene- and cell type-specific, common cellularprocesses are modulated by CDK12 activity amongst different cell types.We next sought to determine how changes in gene expression due to CDK12 func-tion manifest at the protein level to affect the expressed phenotype of SK-BR-3 cells.Global proteomics experiment was performed (by Grace Cheng, Christalle Chow, JerryTien, and Christopher Hughes) to quantify alterations in protein expression after deple-tion of CDK12 in SK-BR-3 cells, and we compared the results to the matching RNA-seqdata (Figure 3.10). We found that the proteome data represented a smaller subset of thetranscriptome data (Figure 3.10.A). Of the 11,072 expressed genes in the RNA seq data(defined as FPKM ≥ 1), 7,031 (64%) were identified at the protein level by mass spec-trometry (Figure 3.10.A).Moreover, similar to the transcriptome data, only a small proportion of proteins weredifferentially expressed (n = 444, padj < 0.01) in SK-BR-3 cells after depletion of CDK12(Figure 3.10.B). There was a high correlation in the fold change values of the 197 genesthat were differentially expressed in a statistically significant manner in both the transcrip-tome and proteome datasets (Figure 3.10.C). We note that 242 genes were changed at theprotein level and not at the mRNA level, and that 1,136 mRNAs were changed at the tran-scriptome level and not at the protein level. Pathway analyses demonstrated that the core70DNA Repair/ReplicationCell CycleRNA ProcessingTranslation ActivityNuclear LumenDNA DependentTranscription Chromosome;OrganellePositive Regulation ofCellular process HelicaseActivityExtracellularMatrixMitochondrion Structural MoleculeActivityUp DownSK-BR-3 + CDK12 siRNA-1 (transcriptome)Cell Cycle Transcription RegulationDNA Repair/ReplicationMicrotubuleLipid TransportNuclearLumenTissueDevelopmentTransmembraneActivity RNAProcessingTranslation Initiation TranscriptionFactor ActivityStructuralMoleculeActivity RNA BindingChromosome CarbohydrateMetabolic Process 184-hTERT + CDK12 siRNA-1 (transcriptome)-6-4-20246184-hTERT transcriptomepathway enrichment (NES)-6 -4 -2 0 2 4 6SK-BR-3 transcriptomepathway enrichment (NES)Cell cycleDNA RepairRNA processingMitochondrionTransmembraneTissue DevelopmentFDR184-hTERTSK-BR-3< 0.1< 0.1≥ 0.1< 0.1< 0.1≥ 0.1≥ 0.1≥ 0.1BothSK-BR-3184-hTERTNeitherA) B)Figure 3.9: CDK12 regulates the expression of a core set of genes and pathways. A. En-richment maps from GSEA analysis of differential gene expression resulting fromCDK12 depletion in SK-BR-3 and 184-hTERT cells. B. For each pathway, GSEApre-ranked analysis assigned a normalized enrichment score (NES) representing theextent of over-representation of genes of a pathway at the top or bottom of a rankedlist. Positive and negative NES values represent up- and down-regulated pathways,respectively. For each pathway, NES values in SK-BR-3 and 184-hTERT are shown.Red markers represent NES values significant in both cell lines (FDR < 0.1). The dot-ted red line shows the general trend of these points. Blue and yellow markers representNES values only significant in SK-BR-3 and 184-hTERT cells, respectively.71transcriptome only padj < 0.01 (n = 1,136) proteome only padj < 0.01 (n = 242) transcriptome & proteome padj < 0.01 (n = 197) r2 = 0.88r2all = 0.14n all = 7,508SK-BR-3padj < 0.01(n = 444)μdown = -0.43(n = 174)μup = 0.43(n = 270)100806040200Counts-2 -1 0 1 2log2 (fold change)SK-BR-3 proteome7006005004003002001000Counts1050-5log2 (FPKM)All coding genes, FPKM > 0 (n = 17,826)Detected in proteome (n = 7,651)Detected in proteome with ≥ 2 peptides (n = 6,047)FPKM = 1 6201376,754(9%)(2%)(100%) 7,0315,910 11,072(64%)(53%)(100%)FPKM < 1: FPKM ≥ 1:-2 0 2log2 (fold change)proteome-202log2 (fold change)transcriptome1086420-2 -1 0 1 2403020-log 10 (p adj -value)CDK12padj < 0.01(n = 444)+1 σ-1 σn = 165 n = 215A) B)C)Figure 3.10: Differential protein expression due to CDK12 regulation represents a subsetof differential gene expression events. A. Histogram of RNA-seq expression values(FPKM) for all coding genes and genes with corresponding proteins detected by massspectrometry with ≥ 1 unique peptides (blue bars) or ≥ 2 unique peptides (greenbars). B. Top: volcano plot of the global proteome analysis in SK-BR-3 cells. Dottedhorizontal line denotes point at which padj = 0.01. Dotted vertical lines lineate eventswith | f oldchange| > 1 s.d. (σ ) from the mean. Bottom: distribution of fold changevalues for all differential protein expression events with padj < 0.01. Green verticallines denote mean fold change (µ) values for up- and down-regulation. Dotted linesare the ± 1 σ lines extended from the top plot. C. Correlation of fold change valuesfrom global transcriptome and proteome analysis in SK-BR-3 cells (r2all = 0.14, p <10−5). Events with significant fold change values (padj < 0.01) in both datasets areshown in red (r2 = 0.88, p < 10−5). Events significant only in the transcriptome andproteome are colored yellow and blue, respectively.72functions of CDK12 (e.g., RNA processing and DNA damage response) were all observedin the proteomics experiment (Appendix. B). Functions specific to SK-BR-3 cells, such asthe involvement of mitochondrial processes, were also found at the protein level. However,the regulation of proteins involved in cell cycle and cell division, which was prominent inthe transcriptome data, was absent in the proteome data. This is likely a result of mRNA-independent means of regulating protein expression and turnover, and may also be celltype-specific.3.3.4 CDK12 can modulate the expression of DNA damage responsegenes in SK-BR-3 cells through alternative splicingBased on microarray differential gene expression analysis, it was proposed that CDK12regulates the expression of DNA damage repair genes [146]. Our analysis suggests thatAS may be a significant mechanism of regulation by CDK12, especially for genes withlong transcripts and many exons. One such example we identified in our SK-BR-3 RNA-seq data was the gene encoding the ATM (Ataxia Telangiectasia Mutated) protein. ATMis a key regulatory kinase that responds to DNA double-strand breaks and initiates DNArepair pathways [215]. The canonical isoform of ATM is a 350 kDa protein translatedfrom a 13,147-bp transcript containing 63 exons (Figure 3.11.A). Along with many otherDDR genes, treatment of SK-BR-3 and 184-hTERT cells with CDK12 siRNA resulted ina down-regulation of ATM mRNA expression (Appendix B). Specific to SK-BR-3 cells,however, CDK12 regulates the expression of ATM through ALE splicing (Figure 3.11.Band C). By examining expression of individual ATM exons (Figure 3.11), and as confirmedby qRT-PCR (Figure 3.11.C), CDK12 depletion resulted in a 1.3-fold down-regulation ofmost of the exons. However, the terminal exon and 3’UTR were down-regulated morethan 4-fold. This was in contrast to 184-hTERT cells, where CDK12 depletion resulted ina 1.4-fold down-regulation across the entire length of the ATM gene.These data indicate that in SK-BR-3 cells, the expression of full-length ATM isoformcould be regulated through AS in addition to direct transcriptional control. Using a mono-clonal antibody targeting ATM residues 980-1,512 (exons 20-30), it was confirmed at theprotein level that the expression of full-length ATM was decreased 3-fold after CDK1273A)B) C)D)108,093,211 108,239,829Genomic coordinates (Chromosome 11)ATMi(Exon 1-3)ii(Exon 13-14)iii(Exon 31-32)iv(Exon 62-63)proximalALEdistalALE80 40 20 80 40 20 μg lysateα-ATMα-ActinscrambledsiRNACDK12siRNA-141 kDa268 kDa1.00.50.0α-ATM intensity (A.U.)1.00.50.0α-Actin intensity (A.U.)scrambled siRNACDK12 siRNA-13.1 xATMRNA-seq:SK-BR-3 western blot:log2 Expression(relative to scrambled siRNA)-2-10i ii iii ivATMSK-BR-3184-hTERTqRT-PCR:150100500Normalized counts 32115029ATM Exon 62-63 + 3’UTR(distal ALE)SK-BR-3184-hTERTscrambled siRNACDK12 siRNA-1100%25%100%57%150100500log2 Exon expression(relative to scrambled siRNA)SK-BR-3184-hTERT6050403020100exon number-3-2-101Figure 3.11: CDK12 regulates the expression of full-length ATM through ALE splicing inSK-BR-3 cells. A. Exon structure of the canonical ATM isoform, corresponding toEnsembl transcript ENST00000278616. Primers for qRT-PCR in (C) were designedto target four exon junctions (i-iv) and are shown as red arrowheads. B. Top: relativeexpression of each ATM exon after CDK12 depletion in SK-BR-3 (blue circles) and184-hTERT (orange triangles) cells. Bottom: normalized read counts for the 3’ endof the canonical ATM isoform after CDK12 depletion in SK-BR-3 (dark and lightblue traces) and 184-hTERT (dark and light orange traces) cells. C. Validation ofRNA-seq exon expression analysis by qRT-PCR. Expression levels were determinedfor the four regions of ATM (i-iv, shown in (A)) after CDK12 depletion in SK-BR-3(blue circles) and 184-hTERT (orange triangles) cells. Error bars denote the 99%confidence interval range. D. Relative quantification of full-length ATM protein ex-pression due to CDK12 depletion by western blot analysis.74depletion (by Jerry Tien, Christalle Chow, and Leanna Canapi; Figure 3.11.D). These ob-servations suggest that CDK12 can modulate the protein expression of full-length ATM byaltering the ratio of different ATM splice isoforms. These results demonstrate that AS isan additional mechanism by which CDK12 can control DNA repair pathways.3.3.5 CDK12 down-regulates the long isoform of DNAJB6 andincreases the tumorigenicity of breast cancer cellsPathway analysis of differential gene and protein expression suggests that some CDK12functions are conserved across cell types. In addition to cell type-specific regulation de-scribed above, we identified common ALE events that were regulated by CDK12 in mul-tiple cell lines. From our experiments with SK-BR-3 and 184-hTERT cells, and the avail-able datasets from HCT-116 cells [145], we found that loss of CDK12 is frequently asso-ciated with changes in ALE splicing of the DNAJB6 (DnaJ homolog subfamily B member6, MRJ) gene (in SK-BR-3 and 184-hTERT, ∆Ψavg = 0.21). In our analysis of TCGARNA-seq data for tumors containing homozygous CDK12 deletions (12 deletion:controlpairs), the DNAJB6 ALE event was found in 92% of comparisons on average, as comparedto 44% of control (36 control:control pairs) comparisons (Fisher’s exact test p = 0.006).Unlike the long genes that were regulated in a cell type-specific manner, DNAJB6 en-codes two small protein isoforms (36 and 27 kDa) from transcripts containing 10 and 8 ex-ons, respectively (Figure 3.12.A). The short isoform of the DNAJB6 protein (DNAJB6-S)is a HSP40 family cytosolic chaperone with implicated roles in Huntington’s disease [216,217]. By contrast, ALE splicing introduces a nuclear localization signal in the long iso-form of DNAJB6 (DNAJB6-L), and therefore it operates primarily in the nucleus. In-creased nuclear localization of DNAJB6-L has been reported to mitigate tumorigenicityand metastasis in breast and esophageal cancer cells [218, 219].Our RNA seq data showed that through ALE splicing, higher CDK12 expression inSK-BR-3 cells reduced the expression of DNAJB6-L (Figure 3.12.B), consistent withCDK12 functioning as an oncogene. We tested this hypothesis using MDA-MB-231 cells,a highly invasive triple-negative breast cancer cell line where DNAJB6-L had been previ-ously shown to decrease cell migration potential [218]. Global proteome pathway analysis75157,128,075 157,210,133L-S-Genomic coordinates (Chromosome 7)DNAJB632 16 8 μg lysateα-DNAJB6α-ActinscrambledsiRNACDK12siRNA-1LS32 16 828 kDa38 kDa38 kDa1.00.80.60.40.20.0α-Actin intensity (A.U.)1.00.80.60.40.20.0α-DNAJB6 intensity (A.U.)DNAJB6-LDNAJB6-Sscrambled siRNACDK12 siRNA-1DNAJB6: -L -S1.6 x2.1 x-3-2-101log2 Expression(relative to scrambled siRNA)Exon12-13 -LALE ALE-SCDK12 DNAJB6MDA-MB-231:(qRT-PCR)-3-2-101Exon12-13 -LALE ALE-SCDK12 DNAJB6SK-BR-3:(qRT-PCR)MDA-MB-231:(western blot)403020100FPKMscrambledsiRNA CDK12siRNA-1DNAJB6-LDNAJB6-SSK-BR-3:(RNA-seq; Cufflinks)A)B) C)D)Figure 3.12: CDK12 down-regulates the long isoform of DNAJB6 through ALE splicing. A.Exon structure of the long (-L) and short (-S) isoforms of DNAJB6, corresponding toENSEMBL transcripts ENST00000262177 and ENST00000429029, respectively. B.Quantification of DNAJB6-L and DNAJB6-S transcript levels (FPKM) after CDK12depletion in SK-BR-3 cells by RNA-seq using CUFFLINKS. Error bars represents.d. C. Validation of changes in DNAJB6-L and DNAJB6-S transcript expressionafter CDK12 depletion in SK-BR-3 and MDA-MB-231 cells by qRT-PCR. Error barsdenote the 99% confidence interval range. D. Relative quantification of changes inDNAJB6-L and DNAJB6-S protein expression due to CDK12 depletion in MDA-MB-231 cells by western blot analysis.76of CDK12-depleted MDA-MB-231 cells largely resembled results from SK-BR-3 cells,with the exception of the down-regulation of cell cycle and cell division proteins thatwas not seen in SK-BR-3 cells (Appendix. B). This analysis further supported the use ofMDA-MB-231 cells to examine the effects of CDK12 on tumorigenicity. Using qRT-PCRand western blot analysis, we confirmed that MDA-MB-231 cells treated with CDK12siRNA increased gene and protein expression of DNAJB6-L (and decreased expression ofDNAJB6-S) as compared to a scrambled siRNA control (Figure 3.12.C and D).To examine the cellular phenotype associated with CDK12 expression, a scratch woundassay and live cell imaging of MDA-MB-231 cells were used (by Grace Cheng and JerryTien) as a functional test for cell migration. The experiments show that the ability ofMDA-MB-231 cells to invade is correlated with CDK12 expression and inversely corre-lated with the expression level of DNAJB6-L, and suggest that CDK12 can increase thetumorigenicity of an invasive breast cancer cell line, likely through ALE splicing of theDNAJB6 gene [2].3.4 DiscussionWe showed that CDK12 regulates ALE splicing in a cell type specific manner. Prior to thisstudy, the global effect of CDK12 on AS was uncharacterized, and opposing conclusionshad been made regarding its role in gene expression. While several studies proposed thatCDK12 specifically affects a small number of genes [146, 220], another report suggested aglobal up-regulation of transcription [145]. Here, we applied stringent criteria, combiningRNA-seq datasets in biological triplicates from two different cell lines to identify AS anddifferential gene expression events with high confidence.We found that the regulation of ALE splicing and differential gene expression byCDK12 was limited to a small subset of genes and the nature of this regulation was highlycell type-specific. In 184-hTERT cells, CDK12 both up- and down-regulated the expres-sion of genes and pathways. Using the same statistical criteria in SK-BR-3 cells, CDK12both up- and down-regulated genes, but the most significantly affected pathways were alldown-regulated after CDK12 depletion. Down-regulation of pathways in SK-BR-3 cellsis consistent with the role of CDK12 in increasing the rate of transcription elongation.77Importantly, our proteomic analysis of SK-BR-3 cells suggests that not all CDK12-mediated transcriptional regulation manifests at the protein level. For example, pathwaysrelating to cell cycle and cell division were down-regulated in the transcriptome of SK-BR-3 cells after CDK12 depletion, but not in the proteome. These results could reflectadditional layers of regulation at the protein level, including the modulation of translation,post-translational modifications, and protein turnover/proteolysis. An additional factor toexplain this observation could be a dominant effect of HER2 over-expression on manypathways [221]. Consistent with this idea, loss of CDK12 significantly down-regulatescell cycle and cell division proteins in MDA-MB-231 cells, which do not have HER2amplification.In general, we found that CDK12 regulates ALE splicing of genes with long tran-scripts and high numbers of exons. This trend was significantly more pronounced in ALEsplicing events regulated by CDK12, rather than in differential gene expression events aspreviously reported for HeLa cells [146]. Furthermore, in a majority of events, nativeCDK12 promoted the splicing of the longer mRNA isoform.The simplest model for CDK12 regulation of pre-mRNA processing is that CDK12increases the processivity and/or rate of elongation to achieve successful splicing of oneexon to the next exon. In the absence of CDK12, this splicing event is reduced due todecreased processivity and transcription defaults to termination and polyadenylation ofwhat then becomes the last exon (the proximal ALE). However, this simple model cannotexplain all our major observations. For instance, it is unclear how the proximal ALE isselected amongst all the exons within a long transcript. Notably, we did not observe anydifference in the density of polyadenylation motifs in the 3’UTRs of CDK12-regulatedALEs.It is also not known how CDK12 achieves regulation of only a small subset of genesthat differs depending on cell type. This is possibly accomplished by the various tissue-specific splicing regulatory factors that associate with CDK12 or by signal transductionprocesses that regulate the action of CDK12 and/or its interacting proteins. The proces-sivity and elongation model also does not explain ALE splicing to promote the shortermRNA isoform, as observed with a minority of genes ( 20%). One such gene, DNAJB6 ,is regulated by CDK12 in multiple cell types and tumors, suggesting a gene-specific reg-78ulation that differs from the possible length-dependent regulation common to other ALEevents. Therefore, it is probable that regulation of AS by CDK12 also requires additionalsplicing factors such as the SR proteins, hnRNPs, and RNA processing factors identifiedin our immunoprecipitation experiments. Future studies should be aimed at determiningthe precise role of these regulatory proteins in CDK12-dependent regulation of splicing.Our results shows that CDK12 regulates the DNA damage response through multiplemechanisms. One of the most consistently reported functions of CDK12 has been theregulation of the DDR. Differential expression of specific DDR genes was first identifiedby microarray analysis [146], and changes in DDR pathways were determined from tran-scriptome analysis [145]. Furthermore, CDK12 depletion was found to be synthetic lethalwith PARP inhibition [151, 183, 185]. This behavior is reminiscent of the sensitivity ofBRCA1/BRCA2-deficient tumors to PARP inhibitors [222–224], suggesting that CDK12may be specifically involved in the HDR pathway. Indeed, ovarian tumors containingCDK12 mutations exhibited down-regulation of several HDR genes [152].In all cell types we examined, CDK12 regulated gene and protein expression of com-ponents of the DDR pathway. Furthermore, our RNA-seq data for SK-BR-3 cells suggestthat CDK12 may be a key regulator of HDR through ALE splicing of ATM, a master reg-ulating kinase that directly responds to DNA damage. The splicing-dependent regulationof ATM in SK-BR-3 cells was independent of transcriptional regulation, whereas in 184-hTERT cells there was modest transcriptional regulation of ATM. By compiling our dataand those on gene regulation from the literature [145, 146] it is apparent that gene regula-tion and AS regulation by CDK12 is both cell type specific and gene specific. Furthermore,while CDK12 alters the transcription of some genes, it can also modulate the splicing offunctional isoforms of DDR genes.In line with our findings, experiments exploring the effect of loss-of-function muta-tions in CDK12 on the DDR suggest that CDK12 is a tumor suppressor gene. However,several observations show that CDK12 can also function as an oncogene. This is particu-larly pertinent in breast cancers, where CDK12 is frequently co-amplified with the HER2oncogene. Over-expression of CDK12 is correlated with aggressive tumor behaviour andpoor survival [134, 180, 182]. Notably, these properties also apply to the small fractionof tumors where CDK12 is amplified but HER2 is not, suggesting an oncogenic potential79independent of HER2 [144].Our RNA-seq experiments examining a breast cancer cell line over-expressing CDK12(SK-BR-3 cells) identified AS splicing events that could promote tumorigenesis. Theseevents were also found in our analysis of TCGA RNA-seq data of ovarian tumors contain-ing CDK12 amplifications. One notable AS event regulated by CDK12 and identified inmultiple cell types and tumors was the ALE splicing of DNAJB6. Recent studies showthat the long isoform of DNAJB6 (DNAJB6-L) suppresses cell migration and invasion inMDA-MB-231 cells [218]. While the mechanism driving this activity was unclear, it wasdependent on the ALE splicing and subsequent nuclear localization of DNAJB6-L.Using the same MDA-MB-231 cell line model, we showed that CDK12 expression isinversely correlated with ALE splicing of DNAJB6-L. The ability of cancer cells to migrateand invade is a fundamental mechanism underlying tumorigenesis and metastasis [225].MDA-MB-231 cells can seed tumors in mouse models, and increasing DNAJB6-L expres-sion decreases tumor growth and metastasis in athymic mice [218]. Therefore, the abilityof CDK12 over-expression to down-regulate DNAJB6-L through ALE splicing representsa specific cellular mechanism by which CDK12 can increase the tumorigenicity of breastcancer cells. This could be a significant factor contributing to the progression of HER2+breast cancers, where CDK12 is co-amplified in 27-92% of cases [186–194].In this study, we applied a comprehensive genomic and proteomic approach to definethe cellular functions of CDK12 and to investigate its oncogenic properties. We showedthat in multiple cell lines, CDK12 regulated a core set of cellular processes includingRNA processing and DNA repair. We also found that CDK12 regulated ALE splicing,primarily of genes with long transcripts and a large number of exons. While this regulationmechanism is present in multiple cell lines, the affected genes are highly cell type-specific.In SK-BR-3 cells, CDK12 modulated ALE splicing to promote the generation of full-length ATM, a key component of DNA repair associated with tumorigenesis. CDK12 alsoregulated splicing of DNAJB6, whose nuclear localization attenuates tumor invasion. InMDA-MB-231 cells, CDK12 promoted tumor migration and invasion in a dose-dependentmanner. Together, these results show how loss of CDK12 can disrupt DNA repair, but alsodemonstrate an AS-dependent mechanism by which CDK12 over-expression can increasethe tumorigenicity of breast cancer cells.80Chapter 4Investigating Cellular Responses uponInhibiting Components of SplicingMachinery4.1 IntroductionAlternative splicing is precisely regulated through complex interactions of a large numberof proteins, RNA molecules, and environmental stimuli [26]. The complex interplay be-tween components of this machinery is essential to maintain cell functions. Consequently,a considerable number of genetic diseases has been linked to mutations that impair splic-ing. For instance, more than 15% of disease causing genetic mutations are believed todisturb splicing [226].Disruption of splicing has been involved in many diseases including: Growth hormonedeficiency, Parkinsons disease, Cystic fibrosis, Retinitis pigmentosa, Spinal muscular at-rophy, and also several types of cancer [5, 227, 228]. Usually, genes important for tumorbiology involved in processes such as cell cycle regulation and apoptosis are regulatedby alternative splicing [228]. In this case, aberrant splicing events in genes with specificfunctions can lead to uncontrolled growth and survival of cells [229, 230]. These aberrantsplicing events are usually a consequence of mutations in components of splicing ma-81chinery, 3’ and 5’ splice sites, or splicing silencers and enhancers. As a result, splicingmechanism contributes to the development and progression of tumors [231].Since the recognition of cancer specific splice variants, splicing is now being appreci-ated as a potential therapeutic target [232]. Conceptually, two strategies are being investi-gated. The first strategy is trying to interfere with the components of splicing machinery,and if the components are more crucial for tumor cells compared to normal cells, then theinterruption may show therapeutic advantage [233]. Nevertheless, because the spliceo-some components modulate the splicing of an extensive number of genes, the correspond-ing drugs may display cytotoxic effects. As an alternative, in the second strategy, tumorspecific splicing events are targeted directly [233]. This approach is expected to have lessoff-target effects, but it is necessary to identify the key splicing events to establish bettertreatment potentials.A primary step to the development of splicing related therapeutics is understandinghow components of spliceosome contribute to the regulation of splicing, and uncoveringhow they interact to maintain balance between isoforms. In general, splicing machinerycan be modelled as a dynamic system of interactions. To understand regulations of thissystem, we can interfere with the system from multiple points (e.g., inhibiting one proteinat a time) and evaluate the system’s response. Next, systematically integrating the resultsof these measurements leads to developing a model, capable of explaining our observationsand predicting system’s responses in further conditions. Already, several methods havebeen proposed to infer genetic interactions and relations using perturbation screens [234–236].Advancement in developing pharmacological agents improved the opportunities in sys-tematic study of biological systems. Despite the growing evidence of the importance ofsplicing mechanism in maintaining normal cellular functions, there remains much un-known about its regulation in mammalian systems. Most of our current understanding ofthe spliceosome is determined through studying model organisms. Only recently, with thedevelopment of pharmacological agents, we acquired the opportunity to systematically in-terfere with spliceosome components at different levels to inhibit their functions in human.In other words, several inhibition levels can be experimented through applying differentdosage of pharmacological agents in order to investigate gradual changes in cellular re-82sponses. Moreover, following the inhibition, RNA-seq enables us to gauge correspondingchanges in a genome-wide scale. Finally, we can cluster response patterns to determinegroups of genes that may undergo similar regulations. For instance, genes that show mono-tonically increasing or decreasing response patterns have a higher chance of being primarytargets of inhibited proteins.When a gene is inhibited, we are interested to know what are the direct targets of it,which pathways undergo differential regulations as a result, and what are the main reg-ulators of the observed differences. Here, I first briefly review the methods proposed toidentify primary pathways and genes that are more probable to trigger differential regula-tion of usually a large number of other genes that are observed in genome-wide RNA-seqstudies. Following that, I present the data set where multiple components of the splic-ing machinery are inhibited using small compounds, and finally, I show some preliminaryresults on how our data can help to understand functions of these components.4.2 Identifying Pathways and genes contributing most tocellular responses: A short reviewHuman cells are remarkably complex systems with thousands of genes whose interactionsare organized in order to maintain appropriate responses based on a given condition. Oneimportant goal in biology is to understand, predict, and ideally advantageously manipulateemergent responses of these complex systems [237].To gain a mechanistic insight on how an stimulus drives a cellular response, how a tis-sue differentially regulates genes, or how a disease state deviates from a normal state, oneneeds to interpret the measurable differences between conditions. RNA-seq is one of theencouraging tools that provides good opportunities to study complex cellular mechanisms.By simultaneously measuring transcript abundance, RNA-seq provides snapshots of a cellstatus in a particular condition. When two conditions are compared using RNA-seq snap-shots, hundreds or thousands of genes may exhibit alterations in transcript abundance.Some changes are direct consequence of the modified condition (which are of most inter-est), some are secondary effects of the direct targets, and some others may be due to errorsor inherent stochasticity in RNA-seq sampling.83An additional source of knowledge to complement RNA-seq measurements is the in-formation on known gene interactions stored in knowledge bases [238–243]. The numberof these interactions grow rapidly. As an example, the number of non-redundant physicalinteractions in BIOGRID [242, 243] data base increased over 10 times since ten years ago.It should be noted that many of the stored interactions are context specific and do not al-ways apply to a particular cell state, some are inferred from high-throughput experimentswith lower confidence, and some may be reported in very few studies. Thus, a carefulstrategy should be designed to benefit most from these knowledge bases, while filteringirrelevant information.In a study by Rolland et.al et al [244], it was estimated that among all possible binaryinteractions (direct interactions) between human proteins, fewer than 10% of them areknown. This estimation does not consider the impact of alternative splicing and couldbe optimistic, and thus highlights the limitation of our knowledge in this research area.Therefore, when investigating cellular responses, the potential novel interactions couldalso be taken into account. Here, our focus is on methods that are knowledge-driven, andwe do not consider the problem of inferring novel interactions from RNA-seq data.Having access to prior knowledge of interactions and RNA-seq measurements, a cen-tral question is therefore, what are the pathways or genes whose differential regulation inan experiment contribute most to the observed variations. In this short review, we explainthe two alternative approaches, but only focus on the methods that aim to provide rulesand mechanisms governing the observed differences.We group methods proposed to interpret variation between conditions into two broadclasses. The first class constitutes approaches that try to find gene sets or pathways thatare enriched by differentially expressed genes. These methods usually assess whetherthe genes of a specific gene set are over-represented in a given set of N differentially ex-pressed genes compared to a randomly selected set of N genes. In a conceptually differentapproach, methods in the second class are designed to infer pathways or a small set ofupstream genes driving cascades of changes that lead to the observed measurements.The first class of methods aim to summarize a list of identified differentially transcribedgenes into smaller sets of genes that are somehow connected: either they participate in asame pathway, or they take part in related functions. A large number of methods have84been proposed based on this point of view [245, 246]. Khatri et al [247] categorized theseapproaches into 3 smaller groups. In the first group, a threshold is applied to select a set ofgenes showing significant alterations, and then, the set of the selected genes are statisticallyassessed to detect their over-representation in predefined pathways or gene sets [248–252].The second group incorporates the fold change magnitudes of the gene expression valuesin the statistical evaluation, as an improvement [202, 253, 254]. Finally, the last groupmodifies pathway scores in a way to account for interactions between pathway genes aswell [255, 256].Although gene set and pathway enrichment based methods were successful in orga-nizing results and highlighting affected functions and pathways, they turn to usually fallshort in predicting driver genes. These techniques become ineffective in spotting genesthat govern the usually large number of genes that undergo differential regulation. In otherwords, these methods cannot provide insights on the underlying mechanisms controllingthe transition between the two given conditions [257]. Furthermore, another limitation ofthese methods is that they only present transcriptional view of the variations; however,many of the interactions do not directly influence changes in transcript abundance [257].To address the limitations of the pathway and gene set enrichment based methods, asecond group of methods focus on detecting a small number of driver genes or few path-ways, whose mis-regulation elucidate a mechanistic explanation of measured variations(Figure. 4.1). These methods that rely on the quality and quantity of knowledge basesfor the existence and direction of interactions in regulatory networks are explained in thefollowing.Among the methods we discuss here that try to perform mechanistic inference, LPIA(Latent Pathway Identification Analysis) [257] is the most similar one to pathway enrich-ment based methods. Similar to previous methods, the output of the algorithm constitutespathways; however, here the pathways are scored based on their potential to initiate cas-cades of changes in other pathways. More specifically, The method first constructs a net-work of pathways, one node for each pathway of a given knowledge base. Next, it assignsweighted edges between pathways. For each pair of pathways (pair of nodes in the gen-erated network), the assigned weight reflects the number of GO (Gene Ontology) termsthat are common between the genes of the two pathways, and also it reflects the number85G1G1 G1G4G5G6G2G6Control TreatmentG6G3Transcription factors Observations Protein kinasesCandidate regulatorA)C) D)B)P1 P2P3P4Figure 4.1: Methods proposed to perform mechanistic inference using high-throughput se-quencing data. A. LPIA method [257] searches for pathways central to a set of dis-rupted pathways that have the potential to initiate alterations in other pathways. Pi’srepresent pathways consisting of sets of genes. Edge thickness displays how pathwaysare believed to be inter-related. For instance, the pathway shown in blue (P3), will bereported as an upstream regulatory pathway based on its connections. B. DEMANDmethod [258] searches for dysregulated interactions between genes. The joint prob-ability density of the expression of interacting genes are compared between the twoconditions (here between the genes G1 and G6), and genes whose interactions are sig-nificantly altered are reported as upstream causal genes. C. The third group of meth-ods use the direction and sign of interactions to compare the predicted versus observedchanges upon disruption of a given gene (here the gene shown in orange). Shaded cir-cles display genes whose transcript abundance are observed and are expected to alter.D. The last group of methods perform multiple levels of inference in order to connectseveral regulatory levels. For example, based on the observed changes (black nodes),active transcription factors are identified (dark grey nodes), and in the next level of theanalysis, candidate regulatory kinases are determined (the light grey node). For moreinformation see the main text.86of common genes between the pathways that are differentially expressed. As a result, theassigned weights express both the prior belief on how much the two pathways are related,and also the context dependent (based on the experiment) measure of their interactions. Inthe final step, the pathways that are more central in the constructed network are reportedas being potential causal pathways.A main advantage of this method compared to pathway enrichment analysis is thathere, the genes in the top reported pathways not only show differential regulation, but candescribe observed changes in other pathways as well. On the other hand, one limitation ofLPIA is that it does not benefit from interaction of genes and their directions to increasethe confidence of causal inference. Additionally the amount of observed variations of geneexpression could improve the scoring scheme.To find genes driving transcriptional transition between the two conditions, DEMAND [258](DEtecting Mechanism of Action by Network Dysregulation) searches for genes whoseknown interactions are significantly dysregulated. The method was primarily proposed toidentify mechanism of action (MoA) of a compound, defined as targets essential to causethe pharmacological effect of a compound.The underlying assumption in DEMAND method is that if a gene belongs to the MoAof a compound, then its direct targets are more likely to be dysregulated compared torandom genes. As a result, DEMAND evaluates the changes in joint gene expressionprobability density of candidate genes using Kullback-Leibler divergence (KLD). Esti-mating joint probability distributions may require many samples; however, DEMANDis claimed to efficiently detect the corresponding changes by applying KLD. In the finalstep, the evaluated dysregulations between a gene and its neighbors are combined and a p-value is assigned to candidate genes. The method has been successfully applied to classifycompounds with similar functions and targets.The first limitation of DEMAND method is that in only considers first order neigh-bours (directly connected to the gene of interest) without considering the direction of reg-ulation. Incorporating these additional information may improve the algorithm. Also, theassumption of expected alterations of gene expression can be violated when the regulationdoes not happen at transcriptional level. These issues are addressed in the next class ofmethods.87The next group of methods consolidate direction and sign of known interactions withdifferential expression analysis to discriminate upstream causal genes from others [259–266]. In most of these methods, a directed graph is constructed by putting one nodefor each network entity. The nodes represent transcripts, proteins, small molecules andcompounds. Interactions between these entities are compiled from various data bases.Interactions should be ideally signed and directed, showing the direction of the regulationand whether the regulation is activation or inhibition. In the inference step, the commonframework is to predict the expected change of downstream genes using signed directedpaths, and then scoring candidates by comparing the expected and measured values.In one of these methods, Chindelevitch et al [259] evaluated the expected directionof changes for measured entities (transcript abundance) downstream of a candidate gene,assuming the candidate gene is disrupted. In their evaluation, they considered directionsand signs of shortest paths from the candidate genes to those measured values. Next, theyintroduced a scoring scheme based on rewarding and penalizing correct and incorrect pre-dictions, accordingly. In the final step, they compared the computed scores to randomizedsituations, in order to assign p-values to each of the candidate upstream regulators. Sev-eral improvements have been applied in IPA (Ingenuity Pathway Analysis) approach [261].The technique takes advantage of edge weights (indicating the confidence in edge direc-tion) as well, and additionally, it determines interactions between upstream regulators thatare relevant to explaining variations. Zarringhalam et al implemented similar ideas in aBayesian framework [262]. Their proposed approach is however, limited to direct interac-tions (paths of length one). The authors also incorporated context dependence of edges intheir study. Applying a similar statistical inference to genes connected with longer pathsrapidly increases the computational complexity of the problem, and the information car-ried in cascades of interactions is inevitably ignored. Finally, several algorithms of usingedge weights, fold change values, and type of paths between nodes (i.e. only shortest vs.all paths) were compared by Jaeger et al [263].A clear limitation of these methods is their strong dependence on the quality of priornetworks generated from available knowledge bases. Sign and direction information isscarce; meanwhile, many of such information heavily rely on the context. As a conse-quence, some of the methods use knowledge bases not publicly available (for example88from Ingenuity Inc. (http://www.ingenuity.com) or Selventa Inc. (http://www.selventa.com) with more curated information. This limitation may be temporary given the hugeamount of high-throughput data generated these days; however, providing a well stud-ied interaction network along with a high quality high-throughput data seem essential tobenchmark the proposed methods.Finally, the last group of methods in our classification contains methods that apply onelayer/type of regulator-target detection at each step. Lefebvre et al constructed a networkconsisting of context specific transcription factors and their binding sites in human B-cell,and introduced a method called MARINA (MAster Regulator INference algorithm) tospecify context specific transcription factors with regulatory roles [267]. Genes are rankedbased on their down or up regulation magnitude, and the method evaluates if targets ofa candidate transcription factor are enriched in top or bottom of the ranked list usingGSEA [202]. The algorithms proposed in [268, 269] also investigate this layer of reg-ulation, but they use different knowledge bases to build the initial graph, and they applydifferent enrichment techniques. In addition, these methods incorporate protein-proteininteraction data bases to detect key proteins involving the differential regulation, as anadditional layer. As a final step, EXPRESSION2KINASE method [270] employs kinase en-richment analysis to find kinases that potentially phosphorylate the input list of detectedproteins in the previous layer of the analysis.Similar to some methods discussed before, the strength of these methods also stronglydepend on the quality of the generated influence graphs. The methods are very infor-mative in providing clear insights in mechanisms underlying the regulation at differentregulatory levels; however approaches that integrate information from multiple regulatorylayers [260, 261] may be more sensitive in detecting weaker signals.Most of the methods we discussed here are proposed to efficiently explain what are theminimal genes or pathways necessary for a cell to make a transition from state A to state B.However, it would be helpful to take several other snapshots of the transient states betweenthe initial and final states as an additional guideline. For instance, when we want to studyfunction of a gene as a component of a machinery, instead of only comparing the two stateswhere the gene is active (control experiment) and when it is knocked down, it would behelpful to investigate the situation where the gene is 50% active. In the remaining part of89this chapter, we talk about response curves of cells upon inhibiting genes at different levelsusing pharmaceutical compounds. We present preliminary results showing that clusteringof response patterns helps to identify gene functions, and in the discussion section, weexplain how we think this type of data can be incorporated in causal reasoning inferencealgorithms.4.3 Analyzing genes expression and splicing throughinhibiting splicing components at multiple levels:preliminary resultsSo far, we have reviewed methods enabling extracting biological knowledge when twoRNA-seq experiments are compared, regarding how to identify a smaller number of genesor pathways essential to attain the observed results. These two experiments, for example,may originate from a disease state and a normal state, or a knock down experiment andthe analogous control experiment. In addition to the samples from the two conditions,other measurements from the intermediate conditions can also be informative to betterunderstanding and tracking of variations. For instance, data may provide measurements atmultiple time points of the transition, or when different inhibition levels are imposed.In this section, we present results of analyzing a data set consisting of RNA-seq experi-ments of targeting proteins with pharmaceuticals. Target proteins of these pharmaceuticalsare known to directly or indirectly influence splicing. We show that by appropriately us-ing our data, we get consistent results when we investigate different cell lines, or differentcompounds that target the same protein.4.3.1 Materials and methods.Our data consists of samples from multiple concentrations of three pharmaceuticals. Thefirst compound, T3, targets CDC-like kinases (CLKs) and has been shown to have a highspecificity to CLK1-3 protein isoforms [271]. The two other compounds (T-202 and T-595) target EIF4A3 (Eukaryotic translation initiation factor 4A-III) protein.EIF4A3 is known to play roles in translation initiation, splicing and ribosome assem-90bly [272, 273]. EIF4A3 data consists of RNA-seq libraries generated from 5 inhibition lev-els of EIF4A3 protein, each being applied with two different pharmaceutical compounds(T-202 and T-595) in two cell lines. These two cell lines are Hela (derived from cervicalcancer) cell line and HCT-116 (derived from human colon carcinoma) cell line. In total,EIF4A3 drug data comprises 22 RNA-seq libraries: 2 control libraries, in addition to 20drug treated libraries. Additionally, we have RNA-seq libraries from 3 different siRNAsdirected to EIF4A3 in Hela cell line and also a corresponding control RNAi experiment.CLKs are also known to contribute to the regulation of splicing. Especially, phosphoryla-tion by CLK proteins is required for SR proteins to facilitate their cooperation in splicingmechanism [274]. CLK data consists of libraries generated from treating HCT116 andhTert cell lines at three different concentrations of T3 compounds and two control libraries(one for each cell line). For CLK data, we use stranded libraries previously published byFunnell et al [271]. Figure 4.2 summarizes our drug RNA-seq libraries and our analysisworkflow. These drugs have been developed by Takeda Pharmaceutical Company Lim-ited and their specificity and efficacy were previously investigated [271, 275, 276] and theRNA-seq experimental procedures were previously explained [271].The paired-end reads of our libraries were aligned to the reference genome (hg19 ref-erence genome downloaded from UCSC genome browser [43]) using GSNAP [197]. Thecorresponding gene annotation file was downloaded from ENSEMBL [159]. We enabled“novel splicing” parameter of GSNAP. Following the alignment step, duplicate reads wereremoved using SAMTOOLS [162, 198]. Next, gene and isoform abundance were computedby employing CUFFLINKS [167] package, resulting multiple FPKM values assigned toeach gene based on the number of inhibition levels. The computed FPKM values for a cellline and a compound were combined to form gene responses upon multiple treatments.Next, we applied WGCNA (Weighted correlation network analysis) [277] to clustergenes exhibiting correlated response patterns. We filtered genes and isoforms with FPKMvalue <1. Moreover, we only considered genes for which the maximum expression levelis at least 50% larger than the smallest observed expression value in a set of experimentsperformed by changing compound levels. This was done to remove genes with smallvariations across treatments. Next, To determine gene functions, we applied GO enrich-ment analysis for gene clusters using BINGO [278]. BINGO takes a list of genes as input91HCT116T_202T_595HeLaGsnapalignmentAlignmentpost-processingMiso analysisCuinks analysisRNA ExtractionUnstrandedRNA-seqStrandedRNA-seqHCT116T3184hTert0.502510 0.50 1520Dose (μM) Dose (μM)Figure 4.2: Our systematic approach to study proteins via gradual inhibition. For each pro-tein two cell lines were treated with multiple concentration of pharmaceuticals for6 hours. Stranded paired-end RNA-seq libraries were generated for CLK and un-stranded paired-end RNA-seq data for EIF4A3 protein. Reads were aligned usingGSNAP [197]; MISO [73] and CUFFLINKS [167] analyses were performed on eachdata set separately.and examines the over-representation of genes in GO sets within those lists and reportsa FDR (false discovery rate) value for each GO set. We consider GO terms with FDRvalue smaller than 0.05 as being statistically significant. Finally, to summarize enrichedGO terms, we cluster them using ENRICHMENTMAP plugin [203] of CYTOSCAPE [204].We used the MISO package [73] to find differential splicing events when drug treatedRNA-seq samples were compared to control (no treatment) samples. As explained inprevious chapters, MISO detects and differentiate 8 types of splicing events by applyinga statistical framework. These event types are: skipped exons (SE), retained introns (RI),alternative 3’/5’ splice sites (A3SS/A5SS), tandem 3’ UTRs, mutually exclusive exons(MXE), alternative first exons (AFE), and also alternative last exons (ALE). The methodassumes two potential splice variants for each event, and assigns aΨ value (percent splicedin) to one of the isoforms in the two given conditions. Additionally, it reports a BF (Bayes92factor) value as a measure of confidence of them being differentially expressed. We filterevents with |∆Ψ| value smaller than 0.1 or events with BF value smaller than 20.4.3.2 ResultsInhibiting our target proteins using pharmaceuticals imposes dose dependent splic-ing regulations. First, we investigated the regulation of alternative splicing upon increas-ing inhibition levels of the proteins. Figure 4.3 illustrates the number of identified differ-entially spliced events when treated samples were compared to control samples. All threeinhibitors cause the increase of detected splicing events at higher compound concentra-tions as compared to lower concentrations. Moreover, the type of regulation is maintainedin the different cell lines inspected and also with the two drugs targeting EIF4A3.The results also suggest the distinct contribution of the two proteins in regulating dif-ferent splicing types. Although the same database of splicing events was used when ap-plying MISO pipeline, the proportion of splicing types regulated by proteins are different.While 4 AS types are almost equally abundant in EIF4A3 detected events, CLKs seem topredominantly regulate SE type. Additionally, a much larger number of AS events happento be influenced by CLK inhibition. The results of these experiments can be utilized to de-termine genes and splicing regions that are more sensitive to disruption of a gene function.Events detected at lower drug levels may help in uncovering cis regulatory motifs relatedto a protein.Inhibiting proteins with pharmaceutical compounds partially reproduces the re-sults of knockdown experiments. We next sought to determine whether treating cellswith drugs reproduces the results of knock down experiments with siRNAs. To comparethe results, we took advantage of EIF4A3 data for which we have 3 different siRNAs tar-geting EIF4A3 transcripts, and the corresponding control siRNA. We paired data fromeach EIF4A3 siRNA to the control siRNA data and detected splicing events using theMISO package [73]. Events showing BF value ≥ 10 and |∆Ψ| ≥ 0.1 for the three knock-down:control comparisons were reported, and the overlaps between them are representedin Figure 4.4.A. There are ∼31% of all events that are observed in at least two of the threeknockdown:control comparisons.930 0.5 2 5 10 20 0 0.5 2 5 10 20050100150200Event typeA3SSA5SSAFEALEMXERISE0501001500 0.5 2 5 10 20 0 0.5 2 5 10 20Count0 0.5 1 5 0 0.5 1 50500100015002000EIF4A3HCT116T_202EIF4A3HCT116T_595CLKsHCT116T3EIF4A3HelaT_202EIF4A3HelaT_595CLKs184hTertT3Dose (μM)Figure 4.3: Splicing response patterns upon increasing inhibitor levels. The figure presentsthe number of splicing types at multiple inhibitor levels for three compounds. EIF4A3,and CLK proteins are inhibited at 5, and 3 different levels, respectively. For all types ofsplicing, the number of differentially spliced events generally increases by increasingthe inhibitor level. Similar patterns of AS regulations are observed when EIF4A3 isinhibited in the two different cell lines and using the two distinct drugs; this pattern isdifferent from the pattern of events undergoing AS regulation upon CLK inhibition94Figure 4.4.B shows a Venn diagram of the overlap between events found in the siRNAknockdown experiments and the drug inhibition experiments. We classified the identifiedevents in the T-202 drug inhibition data into “low dose” and “high dose” groups based onthe concentration of the treatment at which the events were predicted. Events detected atdrug concentrations of 0.5 µM and 2 µM were classified as “low dose” events and the onespredicted at higher drug concentrations were classified as “high does” events. The majorityof events detected in “low dose” group were also detected in “high dose” group (60%),however, only 38% of the events in the union of the three siRNAs were also identified in thedrug inhibition experiments. The different specificity of siRNAs and the drug, their distinctoff-target effects added to the uncertainties and the inevitable noise in RNA-seq data canexplain some of the sources of the observed dissimilarity. This shows the importanceof performing independent experiments to characterize more confident or more sensitiveevents as opposed to events less dependent on the gene of interest or potential artifacts.Genes showing monotonic responses in different cell lines account for similarfunctions. In order to assess the potential of utilizing multiple level inhibition data inexploring gene and drug functions, we clustered gene response patterns in each cell lineand compound in EIF4A3 inhibition data, using WGCNA [277]. Figure 4.5 representsclusters sorted based on the number of genes in them. For 3 out of 4 of our compound-cellline pair experiments, there exist two dominant clusters consisting of genes following ageneral monotonically increasing or monotonically decreasing patterns. The two patternsis indeed observed in the remaining compound-cell line experiment as well constitutingtwo of the top four dominant clusters.Here, we assume primary targets of a protein tend to show monotonic responses uponincreasing inhibition much more than a set of randomly selected genes. Secondary effects,or random genes are expected to receive the inhibition signal at a lower amount, only afterseveral other rounds of regulations were imposed on the signal. Based on this idea, we per-formed GO enrichment analysis by applying BINGO [278] to uncover primary functionsof EIF4A3 protein in the four sets of libraries that we have.Parts A and B in Figure 4.6 show the enriched GO terms found by BINGO method forthe set of monotonically increasing and monotonically decreasing genes. For this anal-ysis, we used the libraries provided by applying the first compound to the HCT116 cell95siRNA-1siRNA-2siRNA-361633111 10856191212 29354102332315129Hela (T_202)low doses Hela (T_202)high dosessiRNAsA) B)Figure 4.4: The overlap of the splicing events detected in the inhibitor and siRNA exper-iments. A. A venn diagram showing the overlap of the detected events between thethree different siRNA experiments where EIF4A3 was knocked down. B. The overlapbetween the results of knocking down EIF4A3 with siRNAs, inhibiting EIF4A3 withlow drug concentrations, and inhibiting EIF4A3 with high drug concentrations. Al-most 38% of the siRNA knockdown events were also detected in the drug inhibitionexperiments.line. The GO terms (one node per each term) are clustered based on common genes inthem which are also present in the list of monotonically changing genes. The enrichedterms for up-regulated genes include: regulation of metabolic process, regulation of tran-scription and gene expression, DNA damage response, regulation of kinase activity andsome others. Similarly, the list of enriched terms for down-regulated genes include: reg-ulation of cell cycle, protein localization, regulation of signal transduction and also somecommon enriched terms with GO terms for up-regulated genes such as the regulation ofmetabolic processes and gene expression.Next, we analyzed the other EIF4A3 RNA-seq libraries generated using the other com-pound or in the other cell line to check if terms and functions associated with EIF4A3 couldalso be retrieved in the other datasets. We replicated the GO enrichment analysis in the 3remaining combination of compounds-cell lines (2 compounds and 2 cell lines), and found96Group=1, N=1728 Group=2, N=1331 Group=3, N=219Group=4, N=179 Group=5, N=68 Group=6, N=63−101−1010 0.5 2 5 10 20 0 0.5 2 5 10 20 0 0.5 2 5 10 20Group=1, N=1091 Group=2, N=1061 Group=3, N=88Group=4, N=57 Group=5, N=53 Group=6, N=47−101−1010 0.5 2 5 10 20 0 0.5 2 5 10 20 0 0.5 2 5 10 20Group=1, N=1272 Group=2, N=1260 Group=3, N=1002Group=4, N=919 Group=5, N=618 Group=6, N=370−101−1010 0.5 2 5 10 20 0 0.5 2 5 10 20 0 0.5 2 5 10 20DoseDrug 1HCT116HeLaDrug 2Normalized expressionGroup=1, N=1512 Group=2, N=1283 Group=3, N=97Group=4, N=61 Group=5, N=38 Group=6, N=35−101−1010 0.5 2 5 10 20 0 0.5 2 5 10 20 0 0.5 2 5 10 20Figure 4.5: Clustering of expression response patterns upon inhibiting EIF4A3. Resultsare presented for distinct compounds in two cell lines. Gene expression values areclustered using WGCNA [277]. For each case, only the six clusters with the largestnumber of genes are shown. In each one of our experiments, two of the largest clus-ters can be attributed to genes showing mostly monotonically increasing or decreasingresponses. The blue line demonstrates the consensus response pattern for each of theclusters (by connecting average values of gene expression at different compound lev-els).97many common enriched terms among the four experiments. For instance, parts C and Din Figure 4.6 illustrates the -log10 false discovery rate for the top 15 GO terms identifiedin the list of genes showing monotonically increasing expression patterns in HCT116 cellline, when the cells were treated by the first compound. Additionally, their existence orabsence in the other data sets are also presented in the figure. Out of the 15 top enrichedterms in HCT116 cell lines treated with compound 1, 14 of them are also detected in thethree other libraries for up-regulated genes (part B). For the terms enriched in the list ofdown-regulated genes, 12 out of 15 were detected in all three other data sets, and anotherterm was detected in two of the other data sets as well.To further assess if the type of our data can uncover functions of a targeted protein,we inspected a previously known function of EIF4A3 in our data. EIF4A3 is known to bea core component of exon junction complex, an important member of nonsense mediateddecay (NMD) mechanism. Through NMD, mRNA molecules that contain premature stopcodons are eliminated before being translated. Inhibiting EIF4A3 intervenes with NMD,thus the isoforms that are supposed to undergo NMD are expected to be expressed more.To analyze the consequence of EIF4A3 inhibition on NMD using our data, we firstextracted ∼14,000 isoforms known to undergo NMD from ENSEMBL data base. Next,similar to our gene expression analysis, we clustered isoform expressions for the isoformshaving average FPKM value ≥ 1 and median value ≥ 0. Figure 4.7 shows WGCNAclustering results. Unlike gene expression clusters where we usually found two domi-nant clusters with both up- and down-regulated genes, here for all the experiments, weonly found 1 dominant cluster that predominantly contains genes with monotonically in-creasing expression patterns. Moreover, we confirmed that the observed pattern cannot beassociated to the up-regulation of the corresponding genes (results not shown), and there-fore, the up-regulation may be mainly attributed to in-activation of NMD process. Thus,the clustering of isoform expression patterns confirms a known function of EIF4A3.98PRIMARYMETABOLICPROCESSMACROMOLECULEMETABOLICPROCESSRNA METABOLICPROCESSCELLULARMACROMOLECULEMETABOLICPROCESSCELLULARNITROGENCOMPOUNDMETABOLICPROCESSNUCLEOBASE,NUCLEOSIDE,NUCLEOTIDE ANDNUCLEIC ACIDMETABOLICPROCESSCELLULAR PROTEINMETABOLICPROCESSNUCLEIC ACIDMETABOLICPROCESSCELLULARMETABOLICPROCESSNITROGENCOMPOUNDMETABOLICPROCESSREGULATION OFMACROMOLECULEMETABOLICPROCESSREGULATION OFMACROMOLECULEBIOSYNTHETICPROCESSREGULATION OF RNAMETABOLICPROCESSREGULATION OFBIOSYNTHETICPROCESSREGULATION OFPRIMARYMETABOLICPROCESSREGULATION OFTRANSCRIPTION,DNA-DEPENDENTREGULATION OFBIOLOGICALPROCESSREGULATION OFGENE EXPRESSIONBIOLOGICALREGULATIONNUCLEOSOMEORGANIZATIONPROTEIN-DNACOMPLEX ASSEMBLYNUCLEOSOMEASSEMBLYCHROMATINCHROMATINASSEMBLY ORDISASSEMBLYCELLULARRESPONSE TOSTRESSNEGATIVEREGULATION OFPROTEIN KINASEACTIVITYNEGATIVEREGULATION OFTRANSFERASEACTIVITYRESPONSE TO DNADAMAGE STIMULUSDNA REPAIRNEGATIVEREGULATION OFMACROMOLECULEBIOSYNTHETICPROCESSREGULATION OFTRANSCRIPTIONFROM RNAPOLYMERASE IIPROMOTERNEGATIVEREGULATION OFBIOSYNTHETICPROCESSNEGATIVEREGULATION OFTRANSCRIPTION,DNA-DEPENDENTNEGATIVEREGULATION OFCELLULARBIOSYNTHETICPROCESSNEGATIVEREGULATION OFGENE EXPRESSIONNEGATIVEREGULATION OFNUCLEOBASE,NUCLEOSIDE,NUCLEOTIDE ANDNUCLEIC ACIDMETABOLICPROCESSNEGATIVEREGULATION OFMACROMOLECULEMETABOLICPROCESSCHROMOSOMEORGANIZATIONCHROMATINORGANIZATIONHISTONEMODIFICATIONCOVALENTCHROMATINMODIFICATIONREGULATION OFTRANSCRIPTIONREGULATION OFMETABOLICPROCESSREGULATION OFCELLULAR PROCESSREGULATION OFNUCLEOBASE,NUCLEOSIDE,NUCLEOTIDE ANDNUCLEIC ACIDMETABOLICPROCESSREGULATION OFNITROGENCOMPOUNDMETABOLICPROCESS REGULATION OFCELLULARBIOSYNTHETICPROCESSREGULATION OFCELLULARMETABOLICPROCESSRESPONSE TOUNFOLDED PROTEINSTEROLBIOSYNTHETICPROCESSNEGATIVEREGULATION OF DNAMETABOLICPROCESSOOCYTEDEVELOPMENTRESPONSE TOPROTEIN STIMULUSNEGATIVEREGULATION OF DNAREPLICATIONNEGATIVEREGULATION OFBIOLOGICALPROCESSNEGATIVEREGULATION OFNITROGENCOMPOUNDMETABOLICPROCESSNEGATIVEREGULATION OFCELLULARMETABOLICPROCESSNEGATIVEREGULATION OFTRANSCRIPTIONFROM RNAPOLYMERASE IIPROMOTERNEGATIVEREGULATION OFMETABOLICPROCESSNEGATIVEREGULATION OFTRANSCRIPTIONNEGATIVEREGULATION OF RNAMETABOLICPROCESSNEGATIVEREGULATION OFCELLULAR PROCESSA)Regulation of metabolic process,gene expressionRNA/Proteinmetabolic processOocytematurationResponse to unfolded proteinNegativeregulation ofDNA replicationDNA damageresponseNegative regulation ofkinase activityCholesterol biosynthetic processProtein/DNA complex assemblyChromatin modicationNegative regulation ofcellular process, transcription MODIFICATION-DEPENDENTMACROMOLECULECATABOLICPROCESSMACROMOLECULEPROTEINMODIFICATION BYSMALL PROTEINCONJUGATION ORREMOVAL UBIQUITIN-DEPENDENTPROTEINCATABOLIC MODIFICATION-DEPENDENTPROTEINCATABOLICPROCESSCELLULARMACROMOLECULECATABOLICPROCESSPROTEINUBIQUITINATIONPROTEINCATABOLICPROCESSPROTEINMODIFICATION BYSMALL PROTEINCONJUGATIONMITOSISCELL CYCLEPROCESSCELL DIVISIONCELL CYCLENUCLEAR DIVISIONORGANELLEORGANIZATIONMITOTIC CELLCYCLECELL CYCLE PHASEM PHASE OF MITOTICCELL CYCLECELLULARCOMPONENTORGANIZATIONORGANELLEFISSIONM PHASEREGULATION OFBLOOD VESSELENDOTHELIAL CELLMIGRATIONPROTEIN IMPORTNUCLEAR IMPORTPROTEINLOCALIZATION INORGANELLENEGATIVEREGULATION OFBLOOD VESSELENDOTHELIAL CELLMIGRATIONPROTEIN IMPORTINTO NUCLEUS,TRANSLOCATIONPOSITIVEREGULATION OFBLOOD VESSELENDOTHELIAL CELLMIGRATIONPROTEIN IMPORTINTO NUCLEUSREGULATION OFCELL CYCLEPROCESSREGULATION OFCELL CYCLEPROTEINTRANSPORTPROTEINLOCALIZATIONREGULATION OFMITOTIC CELLCYCLEMACROMOLECULELOCALIZATIONESTABLISHMENT OFPROTEINLOCALIZATIONREGULATION OFCELLCOMMUNICATIONREGULATION OFSIGNALTRANSDUCTIONREGULATION OFSIGNALINGPROCESSREGULATION OFBINDINGREGULATION OFMOLECULARFUNCTIONPOSITIVEREGULATION OFREGULATION OFSIGNALINGPATHWAYREGULATION OFCELLULARMETABOLICPROCESSNEGATIVEREGULATION OFCELLULAR PROCESSREGULATION OFGENE EXPRESSIONNEGATIVEREGULATION OFBIOLOGICALPROCESSREGULATION OFMETABOLICPROCESSREGULATION OFMACROMOLECULEMETABOLICPROCESSREGULATION OFPRIMARYMETABOLICPROCESSREGULATION OFBIOSYNTHETICPROCESSREGULATION OFNUCLEOBASE,NUCLEOSIDE,NUCLEOTIDE ANDNUCLEIC ACIDMETABOLICPROCESSREGULATION OFCELLULARBIOSYNTHETICPROCESSREGULATION OFTRANSCRIPTIONBIOLOGICALREGULATIONREGULATION OFMACROMOLECULEBIOSYNTHETICPROCESSREGULATION OFCELLULAR PROCESSREGULATION OFPROGRAMMED CELLDEATHREGULATION OFBIOLOGICALPROCESSREGULATION OFNITROGENCOMPOUNDMETABOLICPROCESSMACROMOLECULEMETABOLICPROCESSCELLULAR PROCESSPROTEINMODIFICATIONPROCESSMACROMOLECULEMODIFICATIONCELLULARMACROMOLECULEMETABOLICPROCESSCELLULARMETABOLICPROCESSPRIMARYMETABOLICPROCESS POST-TRANSLATIONALPROTEINMODIFICATIONCELLULAR PROTEINMETABOLICPROCESSRegulation of metabolic process, gene expressionCell cycleHCT116, Drug_2Hela, Drug_1FDR (-log 10) FDR (-log 10)Hela, Drug_2Hela, Drug_1Hela, Drug_2HCT116, Drug_2Regulation of cell cycle Regulation of blood vessel endothelial cell migrationProtein importRegulation ofsignal transductionProtein localizationProtein catabolicprocessCellular metabolicprocess6 4 2 0 4 2 0B)C)Figure 4.6: GO enrichment analysis for clusters of genes showing similar expression changepattern. A. For the cluster of genes with monotonically increasing consensus patternafter inhibiting EIF4A3 with the first inhibitor in HCT116 cell line, we performedGO enrichment analysis using BINGO. Each node represents a GO term enriched inour analysis with false discovery rate ≤ 0.05, and edges show gene sets with com-mon genes present in the input list. GO terms are clustered using ENRICHMENTMAPsoftware. B. Similar analysis as in part A was carried out for genes in the cluster ofmonotonically decreasing consensus pattern. C. For the top 15 GO terms with lowestFDR, we checked if replicating the analysis with the other compound, or in the othercell line could detect similar GO terms. Bar plots illustrate FDR values; the black cir-cle indicates the same term was also detected in the corresponding data with ≤ 0.05,and the grey circle indicates that the same GO term was not detected.99Group=1, N=1068 Group=2, N=129 Group=3, N=80Group=4, N=58 Group=5, N=54 Group=6, N=47−101−1010 0.5 2 5 10 20 0 0.5 2 5 10 20 0 0.5 2 5 10 20Group=1, N=785 Group=2, N=101 Group=3, N=45Group=4, N=43 Group=5, N=40 Group=6, N=31−101−1010 0.5 2 5 10 20 0 0.5 2 5 10 20 0 0.5 2 5 10 20Group=1, N=822 Group=2, N=133 Group=3, N=128Group=4, N=86 Group=5, N=65 Group=6, N=41−101−1010 0.5 2 5 10 20 0 0.5 2 5 10 20 0 0.5 2 5 10 20Group=1, N=891 Group=2, N=108 Group=3, N=95Group=4, N=38 Group=5, N=30 Group=6, N=29−101−1010 0.5 2 5 10 20 0 0.5 2 5 10 20 0 0.5 2 5 10 20Normalized expressionDoseHCT116Drug 2Drug 1HeLaFigure 4.7: Clustering of NMD isoforms response patterns upon inhibiting EIF4A3. Resultsare presented for distinct compounds in two cell lines. Expression profiles of isoformsknown to undergo NMD are clustered using WGCNA [277]. In each case, only thesix clusters with the largest number of genes are shown. In contrast to the clusteringof gene expression where we observed two dominant clusters (Figure 4.5), here thereonly exists one dominant cluster constituting genes with monotonically increasing re-sponse patterns. The blue line demonstrates the consensus response pattern for eachof the clusters (by connecting average values of isoform expression at different druglevels).1004.4 DiscussionIn this chapter, we discussed an important goal of molecular Biology research: under-standing functions and regulations of genes. We reviewed methods developed to inferringfunctional and regulatory knowledge from high-throughput sequencing data. Based on thepower and limitations of methods discussed, the type of experiment, and the research ques-tion, the appropriate method should be employed. With the improvement in technologyand the reduction of sequencing costs, data is being generated at a much faster rate. There-fore, the methods should also be adopted to benefit from the amount of extra informationavailable.Methods performing mechanistic inference reviewed here have been successfully ap-plied to improve our understanding of how an specific response emerges when a conditionis modified [257, 267]. Despite being helpful, the methods have some limitations as well.Our knowledge on biological interactions essential for the success of the discussed meth-ods is still incomplete. Additionally, many of the known interactions are indeed contextspecific without the context being specified in public data bases. Fortunately, the increas-ing amount of data generated these days seem to make many of such limitations to be onlytemporary.We also presented our RNA-seq data consisting of inhibiting proteins at multiple lev-els. Advancement in therapeutics has made similar data sets much more abundant thanbefore, and consequently, adopting methods to incorporate dose dependent responses incomputational analyses is of huge interest. Most of the methods reviewed are only in-tended to handle situations where two conditions are compared; thus not being optimizedto benefit from the extra information provided by inducing various inhibition levels.Our preliminary analysis showed that increasing inhibition levels of the genes we in-vestigated imposes gradual effects, both at the splicing level and at the expression level.This type of effect can be further investigated to realize primary functions of targets anddifferentiate them from secondary consequences. When applying a correlation-based clus-tering methods (WGCNA), the results suggested that the data could be engaged to con-sistently derive gene functions when distinct compounds and cell lines were used.Appropriately modifying methods performing mechanistic inference can enhance our101findings using pharmaceutical inhibition data. In order to benefit from these methods, weneed to determine which interactions among the known prior interactions are active ina given condition, and accordingly which genes are being regulated by a given candidategene. An obvious approach is to define interactions based on response correlations, insteadof considering direction and the magnitude of changes for the interacting genes in the twoconditions case. One issue with using correlation based methods is that they assume lineardependencies among responses which can be violated [234].Hidden Markov models [279] (HMMs) are also appropriate tools to model the se-quence of observed responses in our data. HMMs have been extensively applied to prob-lems where there could be a long range dependencies among a sequence of observations.For instance, the observations can be presented by a series of fold changes and hiddenstates (which control the generative probabilistic components explaining observations)can take three values: “Up”, “Down”, and “No change”, indicating whether the gene isup-regulated, down-regulated or there is no change compared to the previous inhibitorlevel. Besides, a probability distribution is assigned to each hidden state from which theobservations are derived. Finally, the probability of each possible path (a sequence of“Up”, “Down” and “No Change”s) could be calculated to assess the probability of genesshowing similar patterns of responses.In a recent study, Leng et al [280] proposed auto-regressive hidden Markov models(AR-HMM) to infer probability of potential paths (a sequences of “up”, “Down” and “Nochange”’s). The model allows to capture the dependence of an observed FPKM valueor a read count value in an experiment not only based on the current hidden state (up-regulation, down-regulation or no change), but also on the observed previous FPKM valueas well (Figure 4.8). As an extension, separate models could be designed and trained forpotential regulation between any two given genes in interaction data bases. Paired obser-vations (transcript abundances of two interacting genes) are derived based on the hiddenstate of the upstream gene in each model, and the best model describing the observationscan define the type of interaction.In this chapter we have taken the first steps towards developing methods that in futurecan help to study biological systems, drug effects, and gene functions with the increasingamount of data provided by pharmaceutical agents. We discussed the existing methods,102Z Z Z ZY Y Y Y1 2 n-1n-1nn11Figure 4.8: An auto-regressive hidden Markov model proposed by Leng et al [280] to an-alyze ordered high-throughput sequencing data. Zi’s show hidden states and can takevalues from “Up”, “Down” and “No change” to represent direction of change betweenconsecutive observations. Shaded circles (Yi’s) represent observations that could re-port FPKM values or read counts per each gene. Connected nodes enable modellingdependencies among an ordered set of observations.our type of data, preliminary analysis on their usefulness, and also the way we think thedata should be incorporated in the existing pipelines.103Chapter 5ConclusionIn this thesis, I took a systems biology approach to investigate functions and regulations ofalternative splicing. Through AS mechanism, cells expand the capacity of their genomesand orchestrate complex responses. The regulated interplay between components of splic-ing machinery is essential to maintain normal cellular functions, and consequently, manyof the genetic diseases have been associated to impaired splicing. Our approach offersnew insight on how AS is regulated and also how it affects related mechanisms. Ourstudy provides additional perspective towards a more comprehensive picture of alternativesplicing.Advancement in high-throughput sequencing technologies and the development ofcost-effective methods has brought new opportunities to better understanding of AS mech-anism. In all research questions explored here, we benefited from RNA-seq libraries toperform a genome-wide identification of AS events and the corresponding global con-sequences on transcriptome regulation. Additionally, by taking advantage of replicatedexperiments, multiple cell lines, and state of the art computational methods, we addressedlimitations and uncertainties of RNA-seq data.In Chapter 2, I presented our findings on tissue specific RNA editing in Drosophilamelanogaster and its potential role in regulating alternative splicing. We designed apipeline that utilizes large input data and ADAR’s requirement for double-stranded targetsto distinguish genuine editing sites from mapping and sequencing errors. We showed thatediting events happen 3 times more frequently in exons with multiple acceptor/donor sites104than exons with unique splice site. This finding demonstrates a potential inter-relation be-tween AS and RNA editing. Next, we searched conserved secondary structures in regionswhere alternative splicing and RNA editing co-occur, and reported conserved structuresthat may mediate their inter-relation. Our research suggests a tissue specific and genespecific regulation of alternative splicing by RNA editing mediated through formation ofRNA structures.Considering the huge number of editing sites that have been already reported in human,exploring a similar hypothesis in human in future can uncover regions where a similarinter-relation may happen. Additionally, it should be noted that in our study, we usedmRNA libraries (poly-A enriched) where most intronic signals were removed. In futurestudies, using pre-mRNA sequencing data enables investigating editing in more detail,especially in human, where a large number of editing sites have been predicted to happenin intronic regions [104, 113].In a different prospective, our study identifies RNA structures that form in vivo. Al-though potential RNA structures can be predicted computationally, it is hard to determinewhether they actually form in vivo in an specific tissue, or at a given time. However, weknow ADAR requires double stranded structures which confirms the formation of struc-tures. Once these structures are detected, their potential roles in regulating splicing or theirrelevance to diseases regardless of RNA editing can be further analyzed. Furthermore, infuture studies, mutational experiments will be required in order to validate the importanceof these structures in regulating splicing.In chapter 3, we studied the roles of CDK12 in regulating RNA splicing and transcrip-tion. Our RNA-seq data demonstrate that CDK12 expression predominantly influencesplicing by regulating the differential usage of alternative last exons. The regulation couldbe modulated either at the transcription or splicing level. Furthermore, our proteomicsdata indicates that CDK12 interacts with the components of splicing machinery, especiallythose associated with splice site selection. We showed that long genes with many exonsconstitute differentially regulated genes upon knocking down CDK12. We also showedthat the regulation of gene expression by CDK12 is tissue specific, however, commonpathways are influenced in the two cell lines that we analyzed. DNA damage responsegenes are one class of common affected genes. We analyzed TCGA data and showed105that the regulation of alternative last exon events that we found in our data could also beobserved when comparing samples from CDK12 mutant patients to control patients.In future studies, our findings on the differential regulation of ATM and DNAJB6 andtheir potential contribution to the tumorigenicity of breast cancer cells can be further in-vestigated to better understand tumor biology of breast cancer cells harboring genomicalterations in CDK12. Additionally, our study is limited in providing mechanisms thatregulate the tissue specific splicing of events such as the one happening in DNAJB6 whichshould be taken into account in future studies.In this study, we only considered splicing events that are already annotated and arepresent in MISO [73] database. Using methods that enable discovering de novo splicingevents as discussed in chapter 1 can further increase the number of identified regulatedgenes, and might help to infer more plausible models to explain functions of CDK12.Also, chip-seq experiments can be used to measure the occupancy of RNA polymerase IIacross the genome for the same cell lines to help distinguish events that are a consequenceof disruption in transcription elongation rate and the other ones.In chapter 4, I presented a review on methods developed to perform mechanistic infer-ence using high-throughput sequencing data, and methods that try to identify a small setof genes and pathways that control the transition between the two given conditions. Themethods have been successfully applied to uncover how a response emerges by modify-ing conditions. Our review provides a guideline to choose appropriate methods based onresearch questions and available data sets. We also introduced our data sets where genesknown to directly or indirectly affecting splicing regulations were progressively inhibitedusing multiple concentrations of the inhibitor. Using clustering of response patterns andapplying gene set enrichment analysis, we showed the data can contribute to exploringfunctions and regulations of proteins.By the advancement in therapeutics and decreasing sequencing cost, this type of datawill become more accessible. Although we discussed how the reviewed methods could beadopted to incorporate the additional information provided by the introduced data, so farno method has been developed. The next step would be to develop methods (e.g. HMMbased methods) to benefit more from the additional information provided by the systematicgene inhibition using pharmaceuticals.106Taken together, our systems biology approach in this thesis provides additional insighton regulations and functions of alternative splicing. I hope this study can motivate furtherinvestigation of mechanisms discussed and their roles in associated diseases, and eventu-ally lead to the advancement in therapeutics.107Bibliography[1] Mazloomian, A. & Meyer, I.M., 2015. Genome-wide identification and characterization oftissue-specific RNA editing events in D. melanogaster and their potential role in regulatingalternative splicing. RNA biology 12(12): 1391–1401. → pages iv[2] Tien, J.F., Mazloomian, A., Cheng, S., Hughes, C.S., Chow, C., Canapi, L.T., Oloumi, A.,Trigo-Gonzalez, G., Bashashati, A., Xu, J. et al., 2017. CDK12 regulates alternative lastexon mRNA splicing and promotes breast cancer cell invasion. Nucleic acids research . →pages iv, 77[3] Johnson, J.M., Castle, J., Garrett-Engele, P., Kan, Z., Loerch, P.M., Armour, C.D., Santos,R., Schadt, E.E., Stoughton, R., & Shoemaker, D.D., 2003. Genome-wide survey of humanalternative pre-mRNA splicing with exon junction microarrays. Science 302(5653):2141–2144. → pages 1[4] Krawczak, M., Reiss, J., & Cooper, D.N., 1992. The mutational spectrum of singlebase-pair substitutions in mRNA splice junctions of human genes: causes andconsequences. Human genetics 90(1-2): 41–54. → pages 1[5] Venables, J.P., 2004. Aberrant and alternative splicing in cancer. Cancer research 64(21):7647–7654. → pages 1, 81[6] Marguerat, S. & Ba¨hler, J., 2010. RNA-seq: from technology to biology. Cellular andmolecular life sciences 67(4): 569–579. → pages 1[7] Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., & Wold, B., 2008. Mapping andquantifying mammalian transcriptomes by RNA-Seq. Nature methods 5(7): 621–628. →pages 1[8] Kaletta, T. & Hengartner, M.O., 2006. Finding function in novel targets: C. elegans as amodel organism. Nature Reviews Drug Discovery 5(5): 387–399. → pages 2[9] Pandey, U.B. & Nichols, C.D., 2011. Human disease models in Drosophila melanogasterand the role of the fly in therapeutic drug discovery. Pharmacological reviews 63(2):411–436. → pages108[10] Chintapalli, V.R., Wang, J., & Dow, J.A., 2007. Using FlyAtlas to identify betterDrosophila melanogaster models of human disease. Nature genetics 39(6): 715–720. →pages 2[11] Smit, A.F., Hubley, R., & Green, P., 1996. RepeatMasker. Published on the web athttp://www repeatmasker org . → pages 2[12] Yates, A., Akanni, W., Amode, M.R., Barrell, D., Billis, K., Carvalho-Silva, D., Cummins,C., Clapham, P., Fitzgerald, S., Gil, L. et al., 2016. Ensembl 2016. Nucleic acids research44(D1): D710–D716. → pages 2[13] Shoemaker, R.H., 2006. The NCI60 human tumour cell line anticancer drug screen. NatureReviews Cancer 6(10): 813–823. → pages 2[14] Barretina, J., Caponigro, G., Stransky, N., Venkatesan, K., Margolin, A.A., Kim, S.,Wilson, C.J., Leha´r, J., Kryukov, G.V., Sonkin, D. et al., 2012. The Cancer Cell LineEncyclopedia enables predictive modelling of anticancer drug sensitivity. Nature483(7391): 603–607. → pages 2[15] Berget, S.M., Moore, C., & Sharp, P.A., 1977. Spliced segments at the 5’ terminus ofadenovirus 2 late mRNA. Proceedings of the National Academy of Sciences 74(8):3171–3175. → pages 3[16] Chow, L.T., Gelinas, R.E., Broker, T.R., & Roberts, R.J., 1977. An amazing sequencearrangement at the 5’ ends of adenovirus 2 messenger RNA. Cell 12(1): 1–8. → pages 3[17] Han, J., Xiong, J., Wang, D., & Fu, X.D., 2011. Pre-mRNA splicing: where and when inthe nucleus. Trends in cell biology 21(6): 336–343. → pages 3[18] Lim, L. & Burge, C., 2001. A computational analysis of sequence features involved inrecognition of short introns. Proceedings of the National Academy of Sciences 98(20):11193–11198. → pages 3[19] Black, D., 2003. Mechanisms of alternative pre-messenger RNA splicing. Annual review ofbiochemistry 72(1): 291–336. → pages 3[20] Smith, C.W. & Valca´rcel, J., 2000. Alternative pre-mRNA splicing: the logic ofcombinatorial control. Trends in biochemical sciences 25(8): 381–388. → pages 3[21] Horowitz, D.S., 2012. The mechanism of the second step of pre-mRNA splicing. WileyInterdisciplinary Reviews: RNA 3(3): 331–350. → pages 3, 4[22] Will, C.L. & Lu¨hrmann, R., 2011. Spliceosome structure and function. Cold SpringHarbor perspectives in biology 3(7): a003707. → pages 4109[23] Wahl, M.C., Will, C.L., & Lu¨hrmann, R., 2009. The spliceosome: design principles of adynamic RNP machine. Cell 136(4): 701–718. → pages 4[24] Nilsen, T., 2003. The spliceosome: the most complex macromolecular machine in the cell?Bioessays 25(12): 1147–1149. → pages 4[25] Nilsen, T.W. & Graveley, B.R., 2010. Expansion of the eukaryotic proteome by alternativesplicing. Nature 463(7280): 457–463. → pages 5[26] Faustino, N. & Cooper, T., 2003. Pre-mRNA splicing and human disease. Genes &development 17(4): 419–437. → pages 5, 9, 81[27] Barbosa-Morais, N.L., Irimia, M., Pan, Q., Xiong, H.Y., Gueroussov, S., Lee, L.J.,Slobodeniuc, V., Kutter, C., Watt, S., C¸olak, R. et al., 2012. The evolutionary landscape ofalternative splicing in vertebrate species. Science 338(6114): 1587–1593. → pages 5[28] Yeo, G., Holste, D., Kreiman, G., & Burge, C.B., 2004. Variation in alternative splicingacross human tissues. Genome biology 5(10): 1. → pages 5[29] Raj, B. & Blencowe, B.J., 2015. Alternative splicing in the mammalian nervous system:recent insights into mechanisms and functional roles. Neuron 87(1): 14–27. → pages 5[30] Vuong, C.K., Black, D.L., & Zheng, S., 2016. The neurogenetics of alternative splicing.Nature Reviews Neuroscience 17(5): 265–281. → pages 5[31] Wang, Z. & Burge, C.B., 2008. Splicing regulation: from a parts list of regulatory elementsto an integrated splicing code. RNA 14(5): 802–813. → pages 5, 7[32] Keren, H., Lev-Maor, G., & Ast, G., 2010. Alternative splicing and evolution:diversification, exon definition and function. Nature Reviews Genetics 11(5): 345–355. →pages 6[33] McManus, C.J., Coolon, J.D., Eipper-Mains, J., Wittkopp, P.J., & Graveley, B.R., 2014.Evolution of splicing regulatory networks in Drosophila. Genome research 24(5):786–796. → pages 6[34] Ast, G., 2004. How did alternative splicing evolve? Nature Reviews Genetics 5(10):773–782. → pages 6[35] Lee, Y. & Rio, D.C., 2015. Mechanisms and Regulation of Alternative Pre-mRNASplicing. Annual Review of Biochemistry 84: 291–323. → pages 6[36] McManus, C.J. & Graveley, B.R., 2011. RNA structure and the mechanisms of alternativesplicing. Current opinion in genetics & development 21(4): 373–379. → pages 6110[37] Kim, E., Goren, A., & Ast, G., 2008. Alternative splicing: current perspectives. Bioessays30(1): 38–47. → pages 6[38] Graveley, B.R., Brooks, A.N., Carlson, J.W., Duff, M.O., Landolin, J.M., Yang, L., Artieri,C.G., van Baren, M.J., Boley, N., Booth, B.W. et al., 2011. The developmentaltranscriptome of Drosophila melanogaster. Nature 471(7339): 473–479. → pages 6, 29,36, 37, 38, 39, 45[39] Reiter, L.T., Potocki, L., Chien, S., Gribskov, M., & Bier, E., 2001. A systematic analysisof human disease-associated gene sequences in Drosophila melanogaster. Genomeresearch 11(6): 1114–1125. → pages 6[40] McQuilton, P., Pierre, S., Thurmond, J. et al., 2012. FlyBase 101–the basics of navigatingFlyBase. Nucleic Acids Research 40(D1): D706–D714. → pages 6, 7[41] Gibilisco, L., Zhou, Q., Mahajan, S., & Bachtrog, D., 2016. The evolution of alternativesplicing in Drosophila. bioRxiv page 054700. → pages 6[42] Celotto, A.M. & Graveley, B.R., 2001. Alternative splicing of the Drosophila Dscampre-mRNA is both temporally and spatially regulated. Genetics 159(2): 599–608. → pages7[43] Karolchik, D., Baertsch, R., Diekhans, M., Furey, T.S., Hinrichs, A., Lu, Y., Roskin, K.M.,Schwartz, M., Sugnet, C.W., Thomas, D.J. et al., 2003. The UCSC genome browserdatabase. Nucleic acids research 31(1): 51–54. → pages 7, 91[44] Speir, M.L., Zweig, A.S., Rosenbloom, K.R., Raney, B.J., Paten, B., Nejad, P., Lee, B.T.,Learned, K., Karolchik, D., Hinrichs, A.S. et al., 2016. The UCSC Genome Browserdatabase: 2016 update. Nucleic acids research 44(D1): D717–D725. → pages 7[45] Roy, S., Ernst, J., Kharchenko, P.V., Kheradpour, P., Negre, N., Eaton, M.L., Landolin,J.M., Bristow, C.A., Ma, L., Lin, M.F. et al., 2010. Identification of functional elementsand regulatory circuits by Drosophila modENCODE. Science 330(6012): 1787–1797. →pages 7, 30[46] Matlin, A.J., Clark, F., & Smith, C.W., 2005. Understanding alternative splicing: towards acellular code. Nature reviews Molecular cell biology 6(5): 386–398. → pages 7[47] Maniatis, T. & Tasic, B., 2002. Alternative pre-mRNA splicing and proteome expansion inmetazoans. Nature 418(6894): 236–243. → pages 7[48] Naftelberg, S., Schor, I.E., Ast, G., & Kornblihtt, A.R., 2015. Regulation of alternativesplicing through coupling with transcription and chromatin structure. Annual review ofbiochemistry 84: 165–198. → pages 7111[49] Kornblihtt, A.R., Schor, I.E., Allo´, M., Dujardin, G., Petrillo, E., & Mun˜oz, M.J., 2013.Alternative splicing: a pivotal step between eukaryotic transcription and translation. Naturereviews Molecular cell biology 14(3): 153–165. → pages 8, 11[50] Neugebauer, K.M., 2002. On the importance of being co-transcriptional. Journal of cellscience 115(20): 3865–3871. → pages 8[51] Lai, D., Proctor, J.R., & Meyer, I.M., 2013. On the importance of cotranscriptional RNAstructure formation. RNA 19(11): 1461–1473. → pages 8, 22[52] Baralle, D. & Baralle, M., 2005. Splicing in action: assessing disease causing sequencechanges. Journal of medical genetics 42(10): 737–748. → pages 8, 9[53] Buratti, E. & Baralle, F.E., 2004. Influence of RNA secondary structure on the pre-mRNAsplicing process. Molecular and cellular biology 24(24): 10505–10514. → pages 8, 9, 10[54] Pervouchine, D., Khrameeva, E., Pichugina, M., Nikolaienko, O., Gelfand, M., Rubtsov, P.,& Mironov, A., 2012. Evidence for widespread association of mammalian splicing andconserved long-range RNA structures. RNA 18(1): 1–15. → pages 8[55] Meyer, I. & Miklo´s, I., 2005. Statistical evidence for conserved, local secondary structurein the coding regions of eukaryotic mRNAs and pre-mRNAs. Nucleic acids research33(19): 6338–6348. → pages 8, 40, 49[56] Tazi, J., Bakkour, N., & Stamm, S., 2009. Alternative splicing and disease. Biochimica etBiophysica Acta (BBA)-Molecular Basis of Disease 1792(1): 14–26. → pages 9[57] Singh, N.N., Androphy, E.J., & Singh, R.N., 2004. An extended inhibitory context causesskipping of exon 7 of SMN2 in spinal muscular atrophy. Biochemical and biophysicalresearch communications 315(2): 381–388. → pages 9[58] Garcia-Blanco, M., Baraniak, A., & Lasda, E., 2004. Alternative splicing in disease andtherapy. Nature biotechnology 22(5): 535–546. → pages 9[59] Fu, X.D. & Ares Jr, M., 2014. Context-dependent control of alternative splicing byRNA-binding proteins. Nature Reviews Genetics 15(10): 689–701. → pages 11[60] Kelemen, O., Convertini, P., Zhang, Z., Wen, Y., Shen, M., Falaleeva, M., & Stamm, S.,2013. Function of alternative splicing. Gene 514(1): 1–30. → pages 11[61] Irimia, M. & Blencowe, B.J., 2012. Alternative splicing: decoding an expansive regulatorylayer. Current opinion in cell biology 24(3): 323–332. → pages 11[62] Stamm, S., Ben-Ari, S., Rafalska, I., Tang, Y., Zhang, Z., Toiber, D., Thanaraj, T., & Soreq,H., 2005. Function of alternative splicing. Gene 344: 1–20. → pages 11112[63] Treangen, T.J. & Salzberg, S.L., 2012. Repetitive DNA and next-generation sequencing:computational challenges and solutions. Nature Reviews Genetics 13(1): 36–46. → pages11[64] Oshlack, A. & Wakefield, M., 2009. Transcript length bias in RNA-seq data confoundssystems biology. Biology direct 4(4): 14. → pages 11[65] Garber, M., Grabherr, M.G., Guttman, M., & Trapnell, C., 2011. Computational methodsfor transcriptome annotation and quantification using RNA-seq. Nature methods 8(6):469–477. → pages 11[66] Ryan, M.C., Cleland, J., Kim, R., Wong, W.C., & Weinstein, J.N., 2012. SpliceSeq: aresource for analysis and visualization of RNA-Seq data on alternative splicing and itsfunctional impacts. Bioinformatics 28(18): 2385–2387. → pages 11[67] Griffith, M., Griffith, O.L., Mwenifumbo, J., Goya, R., Morrissy, A.S., Morin, R.D.,Corbett, R., Tang, M.J., Hou, Y.C., Pugh, T.J. et al., 2010. Alternative expression analysisby RNA sequencing. Nature methods 7(10): 843–847. → pages[68] Glaus, P., Honkela, A., & Rattray, M., 2012. Identifying differentially expressed transcriptsfrom RNA-seq data with biological variation. Bioinformatics 28(13): 1721–1728. → pages[69] Shen, S., Park, J.W., Huang, J., Dittmar, K.A., Lu, Z.x., Zhou, Q., Carstens, R.P., & Xing,Y., 2012. MATS: a Bayesian framework for flexible detection of differential alternativesplicing from RNA-Seq data. Nucleic acids research page gkr1291. → pages[70] Aschoff, M., Hotz-Wagenblatt, A., Glatting, K.H., Fischer, M., Eils, R., & Ko¨nig, R., 2013.SplicingCompass: differential splicing detection using RNA-Seq data. Bioinformatics29(9): 1141–1148. → pages[71] Trapnell, C., Hendrickson, D.G., Sauvageau, M., Goff, L., Rinn, J.L., & Pachter, L., 2013.Differential analysis of gene regulation at transcript resolution with RNA-seq. Naturebiotechnology 31(1): 46–53. → pages 12[72] Leng, N., Dawson, J.A., Thomson, J.A., Ruotti, V., Rissman, A.I., Smits, B.M., Haag, J.D.,Gould, M.N., Stewart, R.M., & Kendziorski, C., 2013. EBSeq: an empirical Bayeshierarchical model for inference in RNA-seq experiments. Bioinformatics 29(8):1035–1043. → pages 12[73] Katz, Y., Wang, E.T., Airoldi, E.M., & Burge, C.B., 2010. Analysis and design of RNAsequencing experiments for identifying isoform regulation. Nature methods 7(12):1009–1015. → pages 12, 54, 57, 92, 93, 106113[74] Anders, S., Reyes, A., & Huber, W., 2012. Detecting differential usage of exons fromRNA-seq data. Genome research 22(10): 2008–2017. → pages 12, 35, 45, 46, 135[75] Wang, W., Qin, Z., Feng, Z., Wang, X., & Zhang, X., 2013. Identifying differentiallyspliced genes from two groups of RNA-seq samples. Gene 518(1): 164–170. → pages 12[76] Hu, Y., Huang, Y., Du, Y., Orellana, C.F., Singh, D., Johnson, A.R., Monroy, A., Kuan, P.F.,Hammond, S.M., Makowski, L. et al., 2013. DiffSplice: the genome-wide detection ofdifferential splicing events with RNA-seq. Nucleic acids research 41(2): e39–e39. →pages 12[77] Alamancos, G.P., Agirre, E., & Eyras, E., 2014. Methods to study splicing fromhigh-throughput RNA Sequencing data. Spliceosomal Pre-mRNA Splicing: Methods andProtocols pages 357–397. → pages 12[78] Hooper, J.E., 2014. A survey of software for genome-wide discovery of differentialsplicing in RNA-Seq data. Human Genomics 8(1): 1–6. → pages 13[79] Gray, M.W., 2012. Evolutionary origin of RNA editing. Biochemistry 51(26): 5235–5242.→ pages 13[80] Benne, R., Van Den Burg, J., Brakenhoff, J.P., Sloof, P., Van Boom, J.H., & Tromp, M.C.,1986. Major transcript of the frameshifted coxll gene from trypanosome mitochondriacontains four nucleotides that are not encoded in the DNA. Cell 46(6): 819–826. → pages13[81] Scadden, A., 2005. The RISC subunit Tudor-SN binds to hyper-edited double-strandedRNA and promotes its cleavage. Nature structural & molecular biology 12(6): 489–496.→ pages 13[82] Farajollahi, S. & Maas, S., 2010. Molecular diversity through RNA editing: a balancingact. Trends in Genetics 26(5): 221–230. → pages 14[83] Blow, M., Futreal, P.A., Wooster, R., & Stratton, M.R., 2004. A survey of RNA editing inhuman brain. Genome research 14(12): 2379–2387. → pages 14, 17, 34[84] Nishikura, K., 2010. Functions and Regulation of RNA Editing by ADAR Deaminases.Annual Review of Biochemistry 79: 321–349. → pages 14, 16, 20, 34, 35[85] Barraud, P. & Allain, F., 2012. ADAR Proteins: Double-stranded RNA and Z-DNABinding Domains. Current Topics in Microbiology and Immunology 353: 35–60. → pages14, 15, 16, 29[86] Ramaswami, G. & Li, J.B., 2014. RADAR: a rigorously annotated database of A-to-I RNAediting. Nucleic Acids Research 42(D1): D109–D113. → pages 15114[87] Paro, S., Li, X., O’Connell, M., & Keegan, L., 2012. Regulation and functions of ADAR inDrosophila. Current topics in microbiology and immunology 353: 221–236. → pages 16,28, 29, 49[88] Graveley, B., Brooks, A., Carlson, J., Duff, M., Landolin, J., Yang, L., Artieri, C., vanBaren, M., Boley, N., Booth, B. et al., 2010. The developmental transcriptome ofDrosophila melanogaster. Nature 471(7339): 473–479. → pages 16[89] Rodriguez, J., Menet, J.S., & Rosbash, M., 2012. Nascent-seq Indicates WidespreadCotranscriptional RNA Editing in Drosophila. Molecular cell 47(1): 27–37. → pages 16,17, 29, 38, 39, 40, 45, 49, 135[90] Bass, B.L., 2002. RNA editing by adenosine deaminases that act on RNA. Annual reviewof biochemistry 71: 817–846. → pages 16, 20[91] Bass, B.L., 1997. RNA editing and hypermutation by adenosine deamination. Trends inbiochemical sciences 22(5): 157–162. → pages 16[92] Maas, S., Godfried Sie, C., Stoev, I., Dupuis, D., Latona, J., Porman, A., Evans, B.,Rekawek, P., Kluempers, V., Mutter, M. et al., 2011. Genome-wide evaluation anddiscovery of vertebrate A-to-I RNA editing sites. Biochemical and biophysical researchcommunications 412(3): 407–412. → pages 16, 34[93] Neeman, Y., Levanon, E.Y., Jantsch, M.F., & Eisenberg, E., 2006. RNA editing level in themouse is determined by the genomic repeat repertoire. RNA 12(10): 1802–1809. → pages16, 19, 34[94] Hoopengardner, B., Bhalla, T., Staber, C., & Reenan, R., 2003. Nervous system targets ofRNA editing identified by comparative genomics. Science’s STKE 301(5634): 832–836. →pages 17, 18, 40, 43[95] Rieder, L.E. & Reenan, R.A., 2012. The intricate relationship between RNA structure,editing, and splicing. In Seminars in cell & developmental biology, volume 23, pages281–288. Elsevier. → pages 17, 20, 26, 28[96] Athanasiadis, A., Rich, A., & Maas, S., 2004. Widespread A-to-I RNA editing ofAlu-containing mRNAs in the human transcriptome. PLoS biology 2(12): e391. → pages17[97] Morse, D.P., Aruscavage, P.J., & Bass, B.L., 2002. RNA hairpins in noncoding regions ofhuman brain and Caenorhabditis elegans mRNA are edited by adenosine deaminases thatact on RNA. Proceedings of the National Academy of Sciences 99(12): 7906–7911. →pages 17, 34, 35115[98] Danecek, P., Nella˚ker, C., McIntyre, R.E., Buendia-Buendia, J.E., Bumpstead, S., Ponting,C.P., Flint, J., Durbin, R., Keane, T.M., & Adams, D.J., 2012. High levels of RNA-editingsite conservation amongst 15 laboratory mouse strains. Genome Biology 13(4): 1–12. →pages 17, 20, 22, 29, 30, 34, 35[99] Wan, Y., Kertesz, M., Spitale, R., Segal, E., & Chang, H., 2011. Understanding thetranscriptome through RNA structure. Nature Reviews Genetics 12(9): 641–655. → pages17[100] Yang, Y., Sun, F., Wang, X., Yue, Y., Wang, W., Zhang, W., Zhan, L., Tian, N., Jin, Y. et al.,2012. Conservation and regulation of alternative splicing by dynamic inter-and intra-intronbase pairings in Lepidoptera 14-3-3z pre-mRNAs. RNA biology 9(5): 691–700. → pages17[101] Daniel, C., Venø, M.T., Ekdahl, Y., Kjems, J., & O¨hman, M., 2012. A distant cis actingintronic element induces site-selective RNA editing. Nucleic Acids Research 40(19):9876–9886. → pages 17, 34[102] Levanon, E., Hallegger, M., Kinar, Y., Shemesh, R., Djinovic-Carugo, K., Rechavi, G.,Jantsch, M., & Eisenberg, E., 2005. Evolutionarily conserved human targets of adenosineto inosine RNA editing. Nucleic acids research 33(4): 1162–1168. → pages 17, 33[103] Stark, A., Lin, M., Kheradpour, P., Pedersen, J., Parts, L., Carlson, J., Crosby, M.,Rasmussen, M., Roy, S., Deoras, A. et al., 2007. Discovery of functional elements in 12Drosophila genomes using evolutionary signatures. Nature 450(7167): 219–232. → pages18[104] Peng, Z., Cheng, Y., Tan, B.C.M., Kang, L., Tian, Z., Zhu, Y., Zhang, W., Liang, Y., Hu, X.,Tan, X. et al., 2012. Comprehensive analysis of RNA-Seq data reveals extensive RNAediting in a human transcriptome. Nature biotechnology 30(3): 253–260. → pages 18, 20,29, 38, 105[105] Gu, T., Buaas, F.W., Simons, A.K., Ackert-Bicknell, C.L., Braun, R.E., & Hibbs, M.A.,2012. Canonical A-to-I and C-to-U RNA Editing Is Enriched at 3UTRs and microRNATarget Sites in Multiple Mouse Tissues. PLoS ONE 7(3): e33720.doi:{10.1371/journal.pone.0033720}. → pages 18[106] Palladino, M.J., Keegan, L.P., O’connell, M.A., & Reenan, R.A., 2000. A-to-I pre-mRNAediting in Drosophila is primarily involved in adult nervous system function and integrity.Cell 102(4): 437–449. → pages 18, 43[107] Wang, Q., Miyakoda, M., Yang, W., Khillan, J., Stachura, D.L., Weiss, M.J., & Nishikura,K., 2004. Stress-induced apoptosis associated with null mutation of ADAR1 RNA editingdeaminase gene. Journal of Biological Chemistry 279(6): 4952–4961. → pages 18116[108] Maas, S., Kawahara, Y., Tamburro, K., & Nishikura, K., 2006. A-to-I RNA editing andhuman disease. RNA biology 3(1): 1–9. → pages 18[109] Nishikura, K., 2006. Editor meets silencer: crosstalk between RNA editing and RNAinterference. Nature Reviews Molecular Cell Biology 7(12): 919–931. → pages 18, 19[110] Bazak, L., Haviv, A., Barak, M., Jacob-Hirsch, J., Deng, P., Zhang, R., Isaacs, F.J.,Rechavi, G., Li, J.B., Eisenberg, E. et al., 2014. A-to-I RNA editing occurs at over ahundred million genomic sites, located in a majority of human genes. Genome Research24(3): 365–376. → pages 19[111] Ramaswami, G., Lin, W., Piskol, R., Tan, M.H., Davis, C., & Li, J.B., 2012. Accurateidentification of human Alu and non-Alu RNA editing sites. Nature methods 9(6):579–581. → pages 19, 20, 29, 31[112] Eggington, J., Greene, T., & Bass, B., 2011. Predicting sites of ADAR editing indouble-stranded RNA. Nature communications 2: 319. → pages 19[113] Bahn, J., Lee, J., Li, G., Greer, C., Peng, G., & Xiao, X., 2012. Accurate identification ofA-to-I RNA editing in human by transcriptome sequencing. Genome research 22(1):142–150. → pages 19, 20, 21, 29, 31, 105[114] St Laurent, G., Tackett, M.R., Nechkin, S., Shtokalo, D., Antonets, D., Savva, Y.A.,Maloney, R., Kapranov, P., Lawrence, C.E., & Reenan, R.A., 2013. Genome-wide analysisof A-to-I RNA editing by single-molecule sequencing in Drosophila. Nature structural &molecular biology 20(11): 1333–1339. → pages 19, 20, 21, 28, 36, 37, 38, 39, 40, 45, 49[115] Tariq, A., Garncarz, W., Handl, C., Balik, A., Pusch, O., & Jantsch, M.F., 2013.RNA-interacting proteins act as site-specific repressors of ADAR2-mediated RNA editingand fluctuate upon neuronal stimulation. Nucleic acids research 41(4): 2581–2593. →pages 19[116] Wahlstedt, H., Daniel, C., Enstero¨, M., & O¨hman, M., 2009. Large-scale mRNAsequencing determines global regulation of RNA editing during brain development.Genome research 19(6): 978–986. → pages 19, 41[117] Solomon, O., Oren, S., Safran, M., Deshet-Unger, N., Akiva, P., Jacob-Hirsch, J., Cesarkas,K., Kabesa, R., Amariglio, N., Unger, R. et al., 2013. Global regulation of alternativesplicing by adenosine deaminase acting on RNA (ADAR). RNA 19(5): 591–604. → pages20, 28, 49, 50[118] Giuliany, R.S., 2012. A Novel Statistical Framework for the Accurate Identification ofRNA-edits with Application to Human Cancers. Ph.D. thesis, University of BritishColumbia. → pages 20, 21117[119] Zhang, Q. & Xiao, X., 2015. Genome sequence-independent identification of RNA editingsites. Nature methods 12(4): 347–350. → pages 20, 21[120] Li, M., Wang, I., Li, Y., Bruzel, A., Richards, A., Toung, J., & Cheung, V., 2011.Widespread RNA and DNA sequence differences in the human transcriptome. Science333(6038): 53–58. → pages 20[121] Pickrell, J.K., Gilad, Y., & Pritchard, J.K., 2012. Comment on ”Widespread RNA andDNA Sequence Differences in the Human Transcriptome”. Science 335(6074): 1302. →pages 21[122] Kleinman, C.L. & Majewski, J., 2012. Comment on ”Widespread RNA and DNA SequenceDifferences in the Human Transcriptome”. Science 335(6074): 1302. → pages 21[123] Ding, J., Bashashati, A., Roth, A., Oloumi, A., Tse, K., Zeng, T., Haffari, G., Hirst, M.,Marra, M.A., Condon, A. et al., 2012. Feature-based classifiers for somatic mutationdetection in tumour–normal paired sequencing data. Bioinformatics 28(2): 167–175. →pages 21[124] Hofacker, I.L., Fontana, W., Stadler, P.F., Bonhoeffer, L.S., Tacker, M., & Schuster, P.,1994. Fast folding and comparison of RNA secondary structures. Monatshefte fu¨rChemie/Chemical Monthly 125(2): 167–188. → pages 22[125] Zuker, M. & Stiegler, P., 1981. Optimal computer folding of large RNA sequences usingthermodynamics and auxiliary information. Nucleic acids research 9(1): 133–148. →pages 22, 34, 135[126] Ding, Y. & Lawrence, C.E., 2003. A statistical sampling algorithm for RNA secondarystructure prediction. Nucleic acids research 31(24): 7280–7301. → pages 22[127] Ding, Y., Chan, C.Y., & Lawrence, C.E., 2004. Sfold web server for statistical folding andrational design of nucleic acids. Nucleic acids research 32(suppl 2): W135–W141. →pages 22[128] Wiebe, N.J. & Meyer, I.M., 2010. Transat–a method for detecting the conserved helices offunctional RNA structures, including transient, pseudo-knotted and alternative structures.PLoS Comput Biol 6(6): e1000823. → pages 23, 46, 136[129] Pedersen, J.S., Bejerano, G., Siepel, A., Rosenbloom, K., Lindblad-Toh, K., Lander, E.S.,Kent, J., Miller, W., & Haussler, D., 2006. Identification and classification of conservedRNA secondary structures in the human genome. PLoS Comput Biol 2(4): e33. → pages 23[130] Pedersen, J.S., Meyer, I.M., Forsberg, R., Simmonds, P., & Hein, J., 2004. A comparativemethod for finding and folding RNA secondary structures within protein-coding regions.Nucleic acids research 32(16): 4925–4936. → pages 23118[131] Pedersen, J.S., Forsberg, R., Meyer, I.M., & Hein, J., 2004. An evolutionary model forprotein-coding regions with conserved RNA structure. Molecular biology and evolution21(10): 1913–1922. → pages 23[132] Knudsen, B. & Hein, J., 2003. Pfold: RNA secondary structure prediction using stochasticcontext-free grammars. Nucleic acids research 31(13): 3423–3428. → pages 23, 46[133] Walsh, C.T., Garneau-Tsodikova, S., & Gatto, G.J., 2005. Protein posttranslationalmodifications: the chemistry of proteome diversifications. Angewandte ChemieInternational Edition 44(45): 7342–7372. → pages 23[134] Capra, M., Nuciforo, P.G., Confalonieri, S., Quarto, M., Bianchi, M., Nebuloni, M.,Boldorini, R., Pallotti, F., Viale, G., Gishizky, M.L. et al., 2006. Frequent alterations in theexpression of serine/threonine kinases in human cancers. Cancer research 66(16):8147–8154. → pages 23, 52, 79[135] Zhang, J., Yang, P.L., & Gray, N.S., 2009. Targeting cancer with small molecule kinaseinhibitors. Nature Reviews Cancer 9(1): 28–39. → pages 24[136] Harper, J. & Adams, P., 2001. Cyclin-dependent kinases. Chemical Reviews 101(8):2511–2526. → pages 24[137] Malumbres, M., 2014. Cyclin-dependent kinases. Genome Biology 15(6): 1–10. → pages24, 51[138] Loyer, P., Trembley, J.H., Katona, R., Kidd, V.J., & Lahti, J.M., 2005. Role of CDK/cyclincomplexes in transcription and RNA splicing. Cellular signalling 17(9): 1033–1051. →pages 51[139] Even, Y., Durieux, S., Escande, M.L., Lozano, J.C., Peaucellier, G., Weil, D., & Genevie`re,A.M., 2006. CDC2L5, a Cdk-like kinase with RS domain, interacts with theASF/SF2-associated protein p32 and affects splicing in vivo. Journal of cellularbiochemistry 99(3): 890–904. → pages 24[140] Cheng, S.W.G., Kuzyk, M.A., Moradian, A., Ichu, T.A., Chang, V.C.D., Tien, J.F., Vollett,S.E., Griffith, M., Marra, M.A., & Morin, G.B., 2012. Interaction of cyclin-dependentkinase 12/CrkRS with cyclin K1 is required for the phosphorylation of the C-terminaldomain of RNA polymerase II. Molecular and cellular biology 32(22): 4691–4704. →pages 24[141] Bartkowiak, B., Liu, P., Phatnani, H.P., Fuda, N.J., Cooper, J.J., Price, D.H., Adelman, K.,Lis, J.T., & Greenleaf, A.L., 2010. CDK12 is a transcription elongation-associated CTDkinase, the metazoan ortholog of yeast Ctk1. Genes & development 24(20): 2303–2316. →pages 24119[142] Dixon-Clarke, S., Elkins, J., Cheng, S., Morin, G., & Bullock, A., 2014. Structures of theCDK12/CycK complex with AMP-PNP reveal a flexible C-terminal kinase extensionimportant for ATP binding. Scientific reports 5: 17122–17122. → pages 24[143] Ko, T.K., Kelly, E., & Pines, J., 2001. CrkRS: a novel conserved Cdc2-related proteinkinase that colocalises with SC35 speckles. Journal of cell science 114(14): 2591–2603.→ pages 24[144] Taglialatela, A. CDK12 is a novel oncogene with clinical and pathogenetic relevance inbreast cancer. Ph.D. thesis. → pages 24, 25, 80[145] Liang, K., Gao, X., Gilmore, J.M., Florens, L., Washburn, M.P., Smith, E., & Shilatifard,A., 2015. Characterization of human cyclin-dependent kinase 12 (CDK12) and CDK13complexes in C-terminal domain phosphorylation, gene transcription, and RNA processing.Molecular and cellular biology 35(6): 928–938. → pages 24, 53, 55, 58, 61, 63, 68, 70, 75,77, 79[146] Blazek, D., Kohoutek, J., Bartholomeeusen, K., Johansen, E., Hulinkova, P., Luo, Z.,Cimermancic, P., Ule, J., & Peterlin, B.M., 2011. The Cyclin K/Cdk12 complex maintainsgenomic stability via regulation of expression of DNA damage response genes. Genes &development 25(20): 2158–2172. → pages 24, 52, 58, 70, 73, 77, 78, 79[147] Chen, M. & Manley, J.L., 2009. Mechanisms of alternative splicing regulation: insightsfrom molecular and genomics approaches. Nature reviews Molecular cell biology 10(11):741–754. → pages 25[148] Bartkowiak, B. & Greenleaf, A.L., 2015. Expression, purification, and identification ofassociated proteins of the full-length hCDK12/CyclinK complex. Journal of BiologicalChemistry 290(3): 1786–1795. → pages 25, 63[149] Chen, H.H., Wang, Y.C., & Fann, M.J., 2006. Identification and characterization of theCDK12/cyclin L1 complex involved in alternative splicing regulation. Molecular andcellular biology 26(7): 2736–2745. → pages 25[150] Rodrigues, F., Thuma, L., & Kla¨mbt, C., 2012. The regulation of glial-specific splicing ofNeurexin IV requires how and cdk12 activity. Development 139(10): 1765–1776. → pages25[151] Bajrami, I., Frankum, J.R., Konde, A., Miller, R.E., Rehman, F.L., Brough, R., Campbell,J., Sims, D., Rafiq, R., Hooper, S. et al., 2014. Genome-wide profiling of genetic syntheticlethality identifies CDK12 as a novel determinant of PARP1/2 inhibitor sensitivity. Cancerresearch 74(1): 287–297. → pages 25, 52, 70, 79120[152] Ekumi, K.M., Paculova, H., Lenasi, T., Pospichalova, V., Bo¨sken, C.A., Rybarikova, J.,Bryja, V., Geyer, M., Blazek, D., & Barboric, M., 2015. Ovarian carcinoma CDK12mutations misregulate expression of DNA repair genes via deficient formation and functionof the Cdk12/CycK complex. Nucleic Acids Research 43(5): 2575–2589. → pages 25, 52,65, 70, 79[153] Network, C.G.A.R. et al., 2011. Integrated genomic analyses of ovarian carcinoma. Nature474(7353): 609–615. → pages 27, 51, 65, 67[154] Dawson, T.R., Sansam, C.L., & Emeson, R.B., 2004. Structure and sequence determinantsrequired for the RNA editing of ADAR2 substrates. Journal of Biological Chemistry279(6): 4941–4951. → pages 28[155] Celniker, S.E., Dillon, L.A., Gerstein, M.B., Gunsalus, K.C., Henikoff, S., Karpen, G.H.,Kellis, M., Lai, E.C., Lieb, J.D., MacAlpine, D.M. et al., 2009. Unlocking the secrets ofthe genome. Nature 459(7249): 927–930. → pages 29, 30[156] Delcher, A.L., Kasif, S., Fleischmann, R.D., Peterson, J., White, O., & Salzberg, S.L.,1999. Alignment of whole genomes. Nucleic Acids Research 27(11): 2369–2376. → pages31, 134[157] Delcher, A.L., Phillippy, A., Carlton, J., & Salzberg, S.L., 2002. Fast algorithms forlarge-scale genome alignment and comparison. Nucleic acids research 30(11): 2478–2483.→ pages 31, 134[158] Needleman, S.B. & Wunsch, C.D., 1970. A general method applicable to the search forsimilarities in the amino acid sequence of two proteins. Journal of molecular biology48(3): 443–453. → pages 31, 134[159] Flicek, P., Amode, M.R., Barrell, D., Beal, K., Billis, K., Brent, S., Carvalho-Silva, D.,Clapham, P., Coates, G., Fitzgerald, S. et al., 2013. Ensembl 2014. Nucleic acids researchpages D749–D755. → pages 31, 53, 91[160] Pachter, L., 2012. A closer look at RNA editing. Nature biotechnology 30(3): 246–247. →pages 31, 38[161] Goya, R., Sun, M.G., Morin, R.D., Leung, G., Ha, G., Wiegand, K.C., Senz, J., Crisan, A.,Marra, M.A., Hirst, M. et al., 2010. SNVMix: predicting single nucleotide variants fromnext-generation sequencing of tumors. Bioinformatics 26(6): 730–736. → pages 32, 33[162] Li, H., 2011. A statistical framework for SNP calling, mutation discovery, associationmapping and population genetical parameter estimation from sequencing data.Bioinformatics 27(21): 2987–2993. → pages 32, 34, 53, 91, 134121[163] Kim, D., Pertea, G., Trapnell, C., Pimentel, H., Kelley, R., & Salzberg, S.L., 2013.TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions andgene fusions. Genome Biology 14(4): 1–13. → pages 31, 53[164] Higuchi, M., Single, F.N., Ko¨hler, M., Sommer, B., Sprengel, R., & Seeburg, P.H., 1993.RNA editing of AMPA receptor subunit GluR-B: a base-paired intron-exon structuredetermines position and efficiency. Cell 75(7): 1361–1370. → pages 34[165] Bernhart, S.H., Hofacker, I.L., & Stadler, P.F., 2006. Local RNA base pairing probabilitiesin large sequences. Bioinformatics 22(5): 614–615. → pages 35, 40, 135[166] Ramaswami, G., Zhang, R., Piskol, R., Keegan, L.P., Deng, P., O’Connell, M.A., & Li,J.B., 2013. Identifying RNA editing sites using RNA sequencing data alone. NatureMethods 10(2): 128–132. → pages 35, 38, 39, 45, 49, 135[167] Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D.R., Pimentel, H.,Salzberg, S.L., Rinn, J.L., & Pachter, L., 2012. Differential gene and transcript expressionanalysis of RNA-seq experiments with TopHat and Cufflinks. Nature protocols 7(3):562–578. → pages 36, 42, 91, 92, 136[168] Bernhart, S.H., Hofacker, I.L., Will, S., Gruber, A.R., & Stadler, P.F., 2008. RNAalifold:improved consensus structure prediction for RNA alignments. BMC Bioinformatics 9(1):1–13. → pages 39, 46, 48[169] Chawla, G. & Sokol, N.S., 2014. ADAR mediates differential expression of polycistronicmicroRNAs. Nucleic Acids Research 42(8): 5245–5255. → pages 40[170] Vesely, C., Tauber, S., Sedlazeck, F.J., von Haeseler, A., & Jantsch, M.F., 2012. Adenosinedeaminases that act on RNA induce reproducible changes in abundance and sequence ofembryonic miRNAs. Genome research 22(8): 1468–1476. → pages[171] de Hoon, M., Taft, R., Hashimoto, T., Kanamori-Katayama, M., Kawaji, H., Kawano, M.,Kishima, M., Lassmann, T., Faulkner, G., Mattick, J. et al., 2010. Cross-mapping and theidentification of editing sites in mature microRNAs in high-throughput sequencinglibraries. Genome research 20(2): 257–264. → pages 40[172] Da Wei Huang, B.T.S. & Lempicki, R.A., 2008. Systematic and integrative analysis oflarge gene lists using DAVID bioinformatics resources. Nature protocols 4(1): 44–57. →pages 43[173] Hood, J.L. & Emeson, R.B., 2012. Editing of neurotransmitter receptor and ion channelRNAs in the nervous system. In Adenosine Deaminases Acting on RNA (ADARs) andA-to-I Editing, pages 61–90. Springer. → pages 43122[174] Jepson, J., Savva, Y., Yokose, C., Sugden, A., Sahin, A., & Reenan, R., 2011. Engineeredalterations in RNA editing modulate complex behavior in Drosophila: regulatory diversityof adenosine deaminase acting on RNA (ADAR) targets. The Journal of biologicalchemistry 286(10): 8325–8337. → pages 43[175] Lai, D., Proctor, J.R., Zhu, J.Y.A., & Meyer, I.M., 2012. R-CHIE: a web server and Rpackage for visualizing RNA secondary structures. Nucleic acids research page gks241. →pages 47[176] Fujita, P.A., Rhead, B., Zweig, A.S., Hinrichs, A.S., Karolchik, D., Cline, M.S., Goldman,M., Barber, G.P., Clawson, H., Coelho, A. et al., 2011. The UCSC Genome Browserdatabase: update 2011. Nucleic acids research 39(suppl 1): D876–D882. → pages 46[177] Meyer, I.M. & Miklo´s, I., 2007. SimulFold: simultaneously inferring RNA structuresincluding pseudoknots, alignments, and trees using a Bayesian MCMC framework. PLoScomputational biology 3(8): e149. → pages 46[178] Khodor, Y.L., Rodriguez, J., Abruzzi, K.C., Tang, C.H.A., Marr, M.T., & Rosbash, M.,2011. Nascent-seq Indicates Widespread Cotranscriptional pre-mRNA splicing inDrosophila. Genes & development 25(23): 2502–2512. → pages 49[179] Romano, G. & Giordano, A., 2008. Role of the cyclin-dependent kinase 9-related pathwayin mammalian gene expression and human diseases. Cell Cycle 7(23): 3664–3668. →pages 51[180] Cerami, E., Gao, J., Dogrusoz, U., Gross, B.E., Sumer, S.O., Aksoy, B.A., Jacobsen, A.,Byrne, C.J., Heuer, M.L., Larsson, E. et al., 2012. The cBio cancer genomics portal: anopen platform for exploring multidimensional cancer genomics data. Cancer discovery2(5): 401–404. → pages 51, 79[181] Kandoth, C., McLellan, M.D., Vandin, F., Ye, K., Niu, B., Lu, C., Xie, M., Zhang, Q.,McMichael, J.F., Wyczalkowski, M.A. et al., 2013. Mutational landscape and significanceacross 12 major cancer types. Nature 502(7471): 333–339. → pages 65[182] Network, C.G.A. et al., 2012. Comprehensive molecular portraits of human breasttumours. Nature 490(7418): 61–70. → pages 51, 79[183] Joshi, P.M., Sutor, S.L., Huntoon, C.J., & Karnitz, L.M., 2014. Ovarian cancer-associatedmutations disable catalytic activity of CDK12, a kinase that promotes homologousrecombination repair and resistance to cisplatin and poly (ADP-ribose) polymeraseinhibitors. Journal of Biological Chemistry 289(13): 9247–9253. → pages 51, 52, 65, 70,79123[184] Carter, S.L., Cibulskis, K., Helman, E., McKenna, A., Shen, H., Zack, T., Laird, P.W.,Onofrio, R.C., Winckler, W., Weir, B.A. et al., 2012. Absolute quantification of somaticDNA alterations in human cancer. Nature biotechnology 30(5): 413–421. → pages 51[185] Natrajan, R., Wilkerson, P.M., Marchio`, C., Piscuoglio, S., Ng, C.K., Wai, P., Lambros,M.B., Samartzis, E.P., Dedes, K.J., Frankum, J. et al., 2014. Characterization of thegenomic features and expressed fusion genes in micropapillary carcinomas of the breast.The Journal of pathology 232(5): 553–565. → pages 52, 57, 65, 70, 79[186] Kauraniemi, P., Kuukasja¨rvi, T., Sauter, G., & Kallioniemi, A., 2003. Amplification of a280-kilobase core region at the ERBB2 locus leads to activation of two hypotheticalproteins in breast cancer. The American journal of pathology 163(5): 1979–1984. → pages52, 80[187] Hyman, E., Kauraniemi, P., Hautaniemi, S., Wolf, M., Mousses, S., Rozenblum, E.,Ringne´r, M., Sauter, G., Monni, O., Elkahloun, A. et al., 2002. Impact of DNAamplification on gene expression patterns in breast cancer. Cancer research 62(21):6240–6245. → pages[188] Kao, J. & Pollack, J.R., 2006. RNA interference-based functional dissection of the 17q12amplicon in breast cancer reveals contribution of coamplified genes. Genes, Chromosomesand Cancer 45(8): 761–769. → pages[189] Kauraniemi, P., Ba¨rlund, M., Monni, O., & Kallioniemi, A., 2001. New amplified andhighly expressed genes discovered in the ERBB2 amplicon in breast cancer by cdnamicroarrays. Cancer research 61(22): 8235–8240. → pages[190] Neve, R.M., Chin, K., Fridlyand, J., Yeh, J., Baehner, F.L., Fevr, T., Clark, L., Bayani, N.,Coppe, J.P., Tong, F. et al., 2006. A collection of breast cancer cell lines for the study offunctionally distinct cancer subtypes. Cancer cell 10(6): 515–527. → pages[191] Pollack, J.R., Sørlie, T., Perou, C.M., Rees, C.A., Jeffrey, S.S., Lonning, P.E., Tibshirani,R., Botstein, D., Børresen-Dale, A.L., & Brown, P.O., 2002. Microarray analysis reveals amajor direct role of DNA copy number alteration in the transcriptional program of humanbreast tumors. Proceedings of the National Academy of Sciences 99(20): 12963–12968. →pages[192] Curtis, C., Shah, S.P., Chin, S.F., Turashvili, G., Rueda, O.M., Dunning, M.J., Speed, D.,Lynch, A.G., Samarajiwa, S., Yuan, Y. et al., 2012. The genomic and transcriptomicarchitecture of 2,000 breast tumours reveals novel subgroups. Nature 486(7403): 346–352.→ pages124[193] Lawrence, R.T., Perez, E.M., Herna´ndez, D., Miller, C.P., Haas, K.M., Irie, H.Y., Lee, S.I.,Blau, C.A., & Ville´n, J., 2015. The proteomic landscape of triple-negative breast cancer.Cell reports 11(4): 630–644. → pages[194] Ciriello, G., Gatza, M.L., Beck, A.H., Wilkerson, M.D., Rhie, S.K., Pastore, A., Zhang, H.,McLellan, M., Yau, C., Kandoth, C. et al., 2015. Comprehensive molecular portraits ofinvasive lobular breast cancer. Cell 163(2): 506–519. → pages 52, 65, 80[195] Zang, Z.J., Ong, C.K., Cutcutache, I., Yu, W., Zhang, S.L., Huang, D., Ler, L.D., Dykema,K., Gan, A., Tao, J. et al., 2011. Genetic and structural variation in the gastric cancerkinome revealed through targeted deep sequencing. Cancer research 71(1): 29–39. →pages 52, 65[196] Meyer, L.R., Zweig, A.S., Hinrichs, A.S., Karolchik, D., Kuhn, R.M., Wong, M., Sloan,C.A., Rosenbloom, K.R., Roe, G., Rhead, B. et al., 2013. The UCSC Genome Browserdatabase: extensions and updates 2013. Nucleic acids research 41(D1): D64–D69. →pages 53[197] Wu, T.D. & Nacu, S., 2010. Fast and SNP-tolerant detection of complex variants andsplicing in short reads. Bioinformatics 26(7): 873–881. → pages 53, 91, 92[198] Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis,G., Durbin, R. et al., 2009. The sequence alignment/map format and SAMtools.Bioinformatics 25(16): 2078–2079. → pages 53, 91, 134[199] Love, M.I., Huber, W., & Anders, S., 2014. Moderated estimation of fold change anddispersion for RNA-seq data with DESeq2. Genome Biology 15(12): 550. → pages 53, 68[200] Anders, S., Pyl, P.T., & Huber, W., 2015. HTSeq–a Python framework to work withhigh-throughput sequencing data. Bioinformatics 31(2): 166–169. → pages 53[201] Trapnell, C., Williams, B.A., Pertea, G., Mortazavi, A., Kwan, G., Van Baren, M.J.,Salzberg, S.L., Wold, B.J., & Pachter, L., 2010. Transcript assembly and quantification byRNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation.Nature biotechnology 28(5): 511–515. → pages 53[202] Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette, M.A.,Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S. et al., 2005. Gene set enrichmentanalysis: a knowledge-based approach for interpreting genome-wide expression profiles.Proceedings of the National Academy of Sciences 102(43): 15545–15550. → pages 54, 69,85, 89125[203] Merico, D., Isserlin, R., Stueker, O., Emili, A., & Bader, G.D., 2010. Enrichment map: anetwork-based method for gene-set enrichment visualization and interpretation. PloS one5(11): e13984. → pages 54, 92[204] Shannon, P., Markiel, A., Ozier, O., Baliga, N.S., Wang, J.T., Ramage, D., Amin, N.,Schwikowski, B., & Ideker, T., 2003. Cytoscape: a software environment for integratedmodels of biomolecular interaction networks. Genome research 13(11): 2498–2504. →pages 54, 92[205] Retelska, D., Iseli, C., Bucher, P., Jongeneel, C.V., & Naef, F., 2006. Similarities anddifferences of polyadenylation signals in human and fly. BMC genomics 7(1): 1. → pages56[206] Busch, A. & Hertel, K.J., 2012. Evolution of SR protein and hnRNP splicing regulatoryfactors. Wiley Interdisciplinary Reviews: RNA 3(1): 1–12. → pages 63[207] Eifler, T.T., Shao, W., Bartholomeeusen, K., Fujinaga, K., Ja¨ger, S., Johnson, J.R., Luo, Z.,Krogan, N.J., & Peterlin, B.M., 2015. Cyclin-dependent kinase 12 increases 3 endprocessing of growth factor-induced c-FOS transcripts. Molecular and cellular biology35(2): 468–478. → pages 63[208] Ingham, R.J., Colwill, K., Howard, C., Dettwiler, S., Lim, C.S., Yu, J., Hersi, K.,Raaijmakers, J., Gish, G., Mbamalu, G. et al., 2005. WW domains provide a platform forthe assembly of multiprotein networks. Molecular and cellular biology 25(16):7092–7106. → pages 63[209] Jung, S.Y., Malovannaya, A., Wei, J., OMalley, B.W., & Qin, J., 2005. Proteomic analysisof steady-state nuclear hormone receptor coactivator complexes. Molecular endocrinology19(10): 2451–2465. → pages 63[210] Mi, H., Lazareva-Ulitsky, B., Loo, R., Kejariwal, A., Vandergriff, J., Rabkin, S., Guo, N.,Muruganujan, A., Doremieux, O., Campbell, M.J. et al., 2005. The PANTHER database ofprotein families, subfamilies, functions and pathways. Nucleic acids research 33(suppl 1):D284–D288. → pages 64[211] Elkon, R., Ugalde, A.P., & Agami, R., 2013. Alternative cleavage and polyadenylation:extent, regulation and function. Nature Reviews Genetics 14(7): 496–506. → pages 63[212] Anders, S. & Huber, W., 2010. Differential expression analysis for sequence count data.Genome Biology 11: R106–R106. → pages 68[213] Juan, H., Lin, Y., Chen, H., & Fann, M., 2016. Cdk12 is essential for embryonicdevelopment and the maintenance of genomic stability. Cell Death & Differentiation 23(6):1038–1048. → pages 70126[214] Iorns, E., Martens-de Kemp, S.R., Lord, C.J., & Ashworth, A., 2009. CRK7 modifies theMAPK pathway and influences the response to endocrine therapy. Carcinogenesis 30(10):1696–1701. → pages 70[215] Guleria, A. & Chandna, S., 2016. ATM kinase: Much more than a DNA damageresponsive protein. DNA repair 39: 1–20. → pages 73[216] Chuang, J.Z., Zhou, H., Zhu, M., Li, S.H., Li, X.J., & Sung, C.H., 2002. Characterizationof a brain-enriched chaperone, MRJ, that inhibits Huntingtin aggregation and toxicityindependently. Journal of Biological Chemistry 277(22): 19831–19838. → pages 75[217] Fayazi, Z., Ghosh, S., Marion, S., Bao, X., Shero, M., & Kazemi-Esfarjani, P., 2006. ADrosophila ortholog of the human mrj modulates polyglutamine toxicity and aggregation.Neurobiology of disease 24(2): 226–244. → pages 75[218] Mitra, A., Fillmore, R.A., Metge, B.J., Rajesh, M., Xi, Y., King, J., Ju, J., Pannell, L.,Shevde, L.A., & Samant, R.S., 2008. Large isoform of MRJ (DNAJB6) reduces malignantactivity of breast cancer. Breast Cancer Research 10(2): R22. → pages 75, 80[219] Yu, V.Z., Wong, V.C.L., Dai, W., Ko, J.M.Y., Lam, A.K.Y., Chan, K.W., Samant, R.S.,Lung, H.L., Shuen, W.H., Law, S. et al., 2015. Nuclear localization of DNAJB6 isassociated with survival of patients with esophageal cancer and reduces Akt signaling andproliferation of cancer cells. Gastroenterology 149(7): 1825–1836. → pages 75[220] Li, X., Chatterjee, N., Spirohn, K., Boutros, M., & Bohmann, D., 2016. Cdk12 Is AGene-Selective RNA Polymerase II Kinase That Regulates a Subset of the Transcriptome,Including Nrf2 Target Genes. Scientific reports 6: 21455. → pages 77[221] Moasser, M.M., 2007. The oncogene HER2: its signaling and transforming functions andits role in human cancer pathogenesis. Oncogene 26(45): 6469–6487. → pages 78[222] Bryant, H.E., Schultz, N., Thomas, H.D., Parker, K.M., Flower, D., Lopez, E., Kyle, S.,Meuth, M., Curtin, N.J., & Helleday, T., 2005. Specific killing of BRCA2-deficient tumourswith inhibitors of poly (adp-ribose) polymerase. Nature 434(7035): 913–917. → pages 79[223] Farmer, H., McCabe, N., Lord, C.J., Tutt, A.N., Johnson, D.A., Richardson, T.B.,Santarosa, M., Dillon, K.J., Hickson, I., Knights, C. et al., 2005. Targeting the DNA repairdefect in BRCA mutant cells as a therapeutic strategy. Nature 434(7035): 917–921. →pages[224] Fong, P.C., Boss, D.S., Yap, T.A., Tutt, A., Wu, P., Mergui-Roelvink, M., Mortimer, P.,Swaisland, H., Lau, A., O’Connor, M.J. et al., 2009. Inhibition of poly (adp-ribose)polymerase in tumors from BRCA mutation carriers. New England Journal of Medicine361(2): 123–134. → pages 79127[225] Thompson, E.W. & Price, J.T., 2002. Mechanisms of tumour invasion and metastasis:emerging targets for therapy. Expert opinion on therapeutic targets 6(2): 217–233. →pages 80[226] Modrek, B., Lee, C. et al., 2002. A genomic view of alternative splicing. Nature genetics30(1): 13–19. → pages 81[227] Zhang, J. & Manley, J.L., 2013. Misregulation of pre-mRNA alternative splicing in cancer.Cancer discovery 3(11): 1228–1237. → pages 81[228] Danan-Gotthold, M., Golan-Gerstl, R., Eisenberg, E., Meir, K., Karni, R., & Levanon, E.,2015. Identification of recurrent regulated alternative splicing events across human solidtumors. Nucleic acids research 43(10): 5130–5144. → pages 81[229] Sveen, A., Kilpinen, S., Ruusulehto, A., Lothe, R., & Skotheim, R., 2016. Aberrant RNAsplicing in cancer; expression changes and driver mutations of splicing factor genes.Oncogene 35(19): 2413–2427. → pages 81[230] Chen, J. & Weiss, W., 2015. Alternative splicing in cancer: implications for biology andtherapy. Oncogene 34(1): 1–14. → pages 81[231] Kim, E., Goren, A., & Ast, G., 2008. Insights into the connection between cancer andalternative splicing. Trends in Genetics 24(1): 7–10. → pages 82[232] Tavares, R., Scherer, N.M., Ferreira, C.G., Costa, F.F., & Passetti, F., 2015. Splice variantsin the proteome: a promising and challenging field to targeted drug discovery. Drugdiscovery today 20(3): 353–360. → pages 82[233] Anczuko´w, O. & Krainer, A.R., 2015. The spliceosome, a potential Achilles heel ofMYC-driven tumors. Genome Medicine 1(7): 1–4. → pages 82[234] Nelander, S., Wang, W., Nilsson, B., She, Q.B., Pratilas, C., Rosen, N., Gennemark, P., &Sander, C., 2008. Models from experiments: combinatorial drug perturbations of cancercells. Molecular systems biology 4(1): 216. → pages 82, 102[235] Liberali, P., Snijder, B., & Pelkmans, L., 2015. Single-cell and multivariate approaches ingenetic perturbation screens. Nature Reviews Genetics 16(1): 18–32. → pages[236] Leha´r, J., Zimmermann, G.R., Krueger, A.S., Molnar, R.A., Ledell, J.T., Heilbut, A.M.,Short, G.F., Giusti, L.C., Nolan, G.P., Magid, O.A. et al., 2007. Chemical combinationeffects predict connectivity in biological systems. Molecular systems biology 3(1): 80. →pages 82[237] Ideker, T., Dutkowski, J., & Hood, L., 2011. Boosting signal-to-noise in complex biology:prior knowledge is power. Cell 144(6): 860–863. → pages 83128[238] Cerami, E.G., Gross, B.E., Demir, E., Rodchenkov, I., Babur, O¨., Anwar, N., Schultz, N.,Bader, G.D., & Sander, C., 2011. Pathway Commons, a web resource for biologicalpathway data. Nucleic acids research 39(suppl 1): D685–D690. → pages 84[239] Kanehisa, M. & Goto, S., 2000. KEGG: kyoto encyclopedia of genes and genomes.Nucleic acids research 28(1): 27–30. → pages[240] Franceschini, A., Szklarczyk, D., Frankild, S., Kuhn, M., Simonovic, M., Roth, A., Lin, J.,Minguez, P., Bork, P., Von Mering, C. et al., 2013. STRING v9. 1: protein-proteininteraction networks, with increased coverage and integration. Nucleic acids research41(D1): D808–D815. → pages[241] Szklarczyk, D., Franceschini, A., Wyder, S., Forslund, K., Heller, D., Huerta-Cepas, J.,Simonovic, M., Roth, A., Santos, A., Tsafou, K.P. et al., 2014. STRING v10:protein–protein interaction networks, integrated over the tree of life. Nucleic acids researchpages D447–D452. → pages[242] Stark, C., Breitkreutz, B.J., Reguly, T., Boucher, L., Breitkreutz, A., & Tyers, M., 2006.BioGRID: a general repository for interaction datasets. Nucleic acids research 34(suppl 1):D535–D539. → pages 84[243] Chatr-Aryamontri, A., Breitkreutz, B.J., Oughtred, R., Boucher, L., Heinicke, S., Chen, D.,Stark, C., Breitkreutz, A., Kolas, N., O’Donnell, L. et al., 2015. The BioGRID interactiondatabase: 2015 update. Nucleic acids research 43(D1): D470–D478. → pages 84[244] Rolland, T., Tas¸an, M., Charloteaux, B., Pevzner, S.J., Zhong, Q., Sahni, N., Yi, S.,Lemmens, I., Fontanillo, C., Mosca, R. et al., 2014. A proteome-scale map of the humaninteractome network. Cell 159(5): 1212–1226. → pages 84[245] Huang, D.W., Sherman, B.T., & Lempicki, R.A., 2009. Bioinformatics enrichment tools:paths toward the comprehensive functional analysis of large gene lists. Nucleic acidsresearch 37(1): 1–13. → pages 85[246] Hung, J.H., Yang, T.H., Hu, Z., Weng, Z., & DeLisi, C., 2012. Gene set enrichmentanalysis: performance evaluation and usage guidelines. Briefings in bioinformatics 13(3):281–291. → pages 85[247] Khatri, P., Sirota, M., & Butte, A., 2011. Ten years of pathway analysis: currentapproaches and outstanding challenges. PLoS computational biology 8(2):e1002375–e1002375. → pages 85[248] Al-Shahrour, F., Dı´az-Uriarte, R., & Dopazo, J., 2004. FatiGO: a web tool for findingsignificant associations of Gene Ontology terms with groups of genes. Bioinformatics20(4): 578–580. → pages 85129[249] Zeeberg, B.R., Feng, W., Wang, G., Wang, M.D., Fojo, A.T., Sunshine, M., Narasimhan, S.,Kane, D.W., Reinhold, W.C., Lababidi, S. et al., 2003. GoMiner: a resource for biologicalinterpretation of genomic and proteomic data. Genome Biology 4(4): R28. → pages[250] Bindea, G., Mlecnik, B., Hackl, H., Charoentong, P., Tosolini, M., Kirilovsky, A., Fridman,W.H., Page`s, F., Trajanoski, Z., & Galon, J., 2009. ClueGO: a Cytoscape plug-in todecipher functionally grouped gene ontology and pathway annotation networks.Bioinformatics 25(8): 1091–1093. → pages[251] Du, Z., Zhou, X., Ling, Y., Zhang, Z., & Su, Z., 2010. agriGO: a GO analysis toolkit forthe agricultural community. Nucleic Acids Research 38(suppl 2): W64–W70. → pages[252] Ye, J., Fang, L., Zheng, H., Zhang, Y., Chen, J., Zhang, Z., Wang, J., Li, S., Li, R., Bolund,L. et al., 2006. WEGO: a web tool for plotting GO annotations. Nucleic acids research34(suppl 2): W293–W297. → pages 85[253] Tian, L., Greenberg, S.A., Kong, S.W., Altschuler, J., Kohane, I.S., & Park, P.J., 2005.Discovering statistically significant pathways in expression profiling studies. Proceedingsof the National Academy of Sciences of the United States of America 102(38):13544–13549. → pages 85[254] Goeman, J.J., Van De Geer, S.A., De Kort, F., & Van Houwelingen, H.C., 2004. A globaltest for groups of genes: testing association with a clinical outcome. Bioinformatics 20(1):93–99. → pages 85[255] Draghici, S., Khatri, P., Tarca, A.L., Amin, K., Done, A., Voichita, C., Georgescu, C., &Romero, R., 2007. A systems biology approach for pathway level analysis. Genomeresearch 17(10): 1537–1545. → pages 85[256] Tarca, A.L., Draghici, S., Khatri, P., Hassan, S.S., Mittal, P., Kim, J.s., Kim, C.J.,Kusanovic, J.P., & Romero, R., 2009. A novel signaling pathway impact analysis.Bioinformatics 25(1): 75–82. → pages 85[257] Pham, L., Christadore, L., Schaus, S., & Kolaczyk, E.D., 2011. Network-based predictionfor sources of transcriptional dysregulation using latent pathway identification analysis.Proceedings of the National Academy of Sciences 108(32): 13347–13352. → pages 85, 86,101[258] Woo, J.H., Shimoni, Y., Yang, W.S., Subramaniam, P., Iyer, A., Nicoletti, P., Martı´nez,M.R., Lo´pez, G., Mattioli, M., Realubit, R. et al., 2015. Elucidating compound mechanismof action by network perturbation analysis. Cell 162(2): 441–451. → pages 86, 87130[259] Chindelevitch, L., Ziemek, D., Enayetallah, A., Randhawa, R., Sidders, B., Brockel, C., &Huang, E.S., 2012. Causal reasoning on biological networks: interpreting transcriptionalchanges. Bioinformatics 28(8): 1114–1121. → pages 88[260] Chindelevitch, L., Loh, P.R., Enayetallah, A., Berger, B., & Ziemek, D., 2012. Assessingstatistical significance in causal graphs. BMC bioinformatics 13(1): 35. → pages 89[261] Kra¨mer, A., Green, J., Pollard Jr, J., & Tugendreich, S., 2014. Causal analysis approachesin Ingenuity Pathway Analysis. Bioinformatics 30(4): 523–530. → pages 88, 89[262] Zarringhalam, K., Enayetallah, A., Gutteridge, A., Sidders, B., & Ziemek, D., 2013.Molecular causes of transcriptional response: a Bayesian prior knowledge approach.Bioinformatics 29(24): 3167–3173. → pages 88[263] Jaeger, S., Min, J., Nigsch, F., Camargo, M., Hutz, J., Cornett, A., Cleaver, S., Buckler, A.,& Jenkins, J.L., 2014. Causal network models for predicting compound targets and drivingpathways in cancer. Journal of biomolecular screening 19(5): 791–802. → pages 88[264] Martin, F., Thomson, T.M., Sewer, A., Drubin, D.A., Mathis, C., Weisensee, D., Pratt, D.,Hoeng, J., & Peitsch, M.C., 2012. Assessment of network perturbation amplitudes byapplying high-throughput data to causal biological networks. BMC Systems Biology 6: 54.→ pages[265] Vasilyev, D.M., Thomson, T.M., Frushour, B.P., Martin, F., & Sewer, A., 2014. Analgorithm for score aggregation over causal biological networks based on random walksampling. BMC Research Notes 7(1): 516. → pages[266] Laenen, G., Ardeshirdavani, A., Moreau, Y., & Thorrez, L., 2015. Galahad: a web serverfor drug effect analysis from gene expression. Nucleic Acids Research 43(W1):W208–W212. → pages 88[267] Lefebvre, C., Rajbhandari, P., Alvarez, M.J., Bandaru, P., Lim, W.K., Sato, M., Wang, K.,Sumazin, P., Kustagi, M., Bisikirska, B.C. et al., 2010. A human B-cell interactomeidentifies MYB and FOXM1 as master regulators of proliferation in germinal centers.Molecular Systems Biology 6: 377. → pages 89, 101[268] Lachmann, A. & Ma’ayan, A., 2009. KEA: kinase enrichment analysis. Bioinformatics25(5): 684–686. → pages 89[269] Koschmann, J., Bhar, A., Stegmaier, P., Kel, A.E., & Wingender, E., 2015. UpstreamAnalysis: An Integrated Promoter-Pathway Analysis Approach to Causal Interpretation ofMicroarray Data. Microarrays 4(2): 270–286. → pages 89131[270] Chen, E.Y., Xu, H., Gordonov, S., Lim, M.P., Perkins, M.H., & Ma’ayan, A., 2012.Expression2Kinases: mRNA profiling linked to multiple upstream regulatory layers.Bioinformatics 28(1): 105–111. → pages 89[271] Funnell, T., Tasaki, S., Oloumi, A., Araki, S., Kong, E., Yap, D., Nakayama, Y., Hughes,C.S., Cheng, S.W.G., Tozaki, H. et al., 2017. CLK-dependent exon recognition andconjoined gene formation revealed with a novel small molecule inhibitor. NatureCommunications 8(1): 7. → pages 90, 91[272] Chazal, P.E., Daguenet, E., Wendling, C., Ulryck, N., Tomasetto, C., Sargueil, B., & Le Hir,H., 2013. EJC core component MLN51 interacts with eIF3 and activates translation.Proceedings of the National Academy of Sciences 110(15): 5903–5908. → pages 91[273] Haremaki, T. & Weinstein, D.C., 2012. Eif4a3 is required for accurate splicing of theXenopus laevis ryanodine receptor pre-mRNA. Developmental biology 372(1): 103–110.→ pages 91[274] Ngo, J.C.K., Chakrabarti, S., Ding, J.H., Velazquez-Dones, A., Nolen, B., Aubol, B.E.,Adams, J.A., Fu, X.D., & Ghosh, G., 2005. Interplay between SRPK and Clk/Sty kinasesin phosphorylation of the splicing factor ASF/SF2 is regulated by a docking motif inASF/SF2. Molecular cell 20(1): 77–89. → pages 91[275] Ito, M., Iwatani, M., Kamada, Y., Sogabe, S., Nakao, S., Tanaka, T., Kawamoto, T.,Aparicio, S., Nakanishi, A., & Imaeda, Y., 2017. Discovery of selective ATP-competitiveeIF4A3 inhibitors. Bioorganic & Medicinal Chemistry 25(7): 2200–2209. → pages 91[276] Iwatani-Yoshihara, M., Ito, M., Ishibashi, Y., Oki, H., Tanaka, T., Morishita, D., Ito, T.,Kimura, H., Imaeda, Y., Aparicio, S.A. et al., 2017. Discovery and characterization of aeukaryotic initiation factor 4A-3-selective inhibitor that suppresses nonsense-mediatedmRNA decay. ACS Chemical Biology . → pages 91[277] Langfelder, P. & Horvath, S., 2008. WGCNA: an R package for weighted correlationnetwork analysis. BMC Bioinformatics 9: 559. → pages 91, 95, 97, 100[278] Maere, S., Heymans, K., & Kuiper, M., 2005. BiNGO: a Cytoscape plugin to assessoverrepresentation of gene ontology categories in biological networks. Bioinformatics21(16): 3448–3449. → pages 91, 95[279] Rabiner, L.R., 1989. A tutorial on hidden Markov models and selected applications inspeech recognition. Proceedings of the IEEE 77(2): 257–286. → pages 102[280] Leng, N., Li, Y., McIntosh, B.E., Nguyen, B.K., Duffin, B., Tian, S., Thomson, J.A.,Dewey, C.N., Stewart, R., & Kendziorski, C., 2015. EBSeq-HMM: a Bayesian approach132for identifying gene-expression changes in ordered RNA-seq experiments. Bioinformatics31(16): 2614–2622. → pages 102, 103[281] Lange, S.J., Maticzka, D., Mo¨hl, M., Gagnon, J.N., Brown, C.M., & Backofen, R., 2012.Global or local? Predicting secondary structure and accessibility in mRNAs. Nucleic acidsresearch 40(12): 5215–5226. → pages 135[282] Edgar, R., 2004. MUSCLE: multiple sequence alignment with high accuracy and highthroughput. Nucleic acids research 32(5): 1792–1797. → pages 136133Appendix ASupporting Materials for Chapter 2A.1 Details of the proposed pipelineIn our analysis, we used dm3 reference genome (fasta file) from UCSC (http://genome.ucsc.edu). Drosophila melanogaster annotation file (BDGP5.74 ensembl.gtf ) was down-loaded from the Ensembl web page (http://www.ensembl.org). In making the correspond-ing annotation file for the OregonR genome we employed MUMMER [156, 157] version”3.23” and NEEDLEMAN-WUNSCH [158] program version ”0.3.5”. To align short reads tothe OregonR genome we executed: “tophat2 -F 0 -i 40 -g 40 –library-type fr-secondstrand-r 200 –mate-std-dev 20 –segment-length 16 –read-mismatches 5 –read-edit-dist 5” usingTOPHAT2 version ”v2.0.10”. Information such as library type and mate standard deviationwere chosen based on the information provided on http://www.modencode.org/.For each candidate position, we require at least 2 and 5 reads in the flexible and strin-gent threshold sets, accordingly. We employed SAMTOOLS [162, 198] mpileup to extractreads covering each position. Sites that contain stars in SAMTOOLS mpileup tracks arealso discarded (They present evidence for small insertions and deletions near a candidatesite).Additionally, at least one of the observed nucleotides from each variant should be froma high quality read (phred score of at least 20) and more than 5 nucleotides distant from theread ends. This filter can improve the results in two ways: first, random hexamer priming134can cause errors in the 5’ starting positions of reads [166]; and second, read ends at splicejunctions are prone to being misaligned [89]. We also filter sites where two or more allelesare observed other than the reference allele.To filter known variations, we use Ensembl fly variant file http://uswest.ensembl.org/info/data/ftp/index.html (Ensembl release 74). Because variations reported in the file onlycontains variations of chromosomes X, 2 and 3, we ignored all predictions from otherregions.We filter candidates with log likelihood score smaller than 3. Additionally, we re-quire editing ratio to be between 0.03 and 0.97, in order to lower the chance of includinghomozygous sites in our predictions [166], since sequencing and mapping errors are in-evitable. These thresholds are equal in both sets threshold values.The thresholds for all four of the SAMTOOLS/BCFTOOLS tests are set to 0.15 in flexiblethresholding and 0.02 for the stringent thresholding. Our results were generated usingSAMTOOLS version ”0.1.19”.We employ RNAFOLD [125] with default parameters and RNAPLFOLD [165] with“-W 200 -L 150 -u 1” as suggested [281]; and for each site we calculate the average ofpairing probabilities for a local region of length 5 (candidate position extended by twonucleotides from each side). A candidate site passes the structural filter if it is in a highlystructured region (based on RNAFOLD [125] energy) or it shows evidence for being a partof a stem (based on RNAPLFOLD [165] energy). We set RNAFOLD [125] thresholds to-10 and -50 for the flexible and stringent threshold sets and we set RNAPLFOLD [165]thresholds to 0.2 and 0.7, accordingly. The analysis in the paper was carried out usingRNAFOLD version ”2.0.4” and RNAPLFOLD version ”2.0.7”.For finding alternatively used exons, we applied DEXSEQ [74] version ”1.8.0”. Incases that there are transcripts with overlapping exons with different boundaries, DEXSEQcuts the exons into multiple parts (see [74] for more details) and analyses their usageseparately. Each of these exonic parts are considered as an exon in our analysis whenwe investigate the potential inter-relation between editing and splicing, however, we onlyreport the ones that are longer than 10 nucleotides. Additionally, when we compare twotissues, we only consider genes that are predicted to have FPKM (fragments per kilobaseof transcript per million fragments mapped) expression values greater than 2. Expression135values were computed by employing CUFFLINKS [167] package version ”2.2.1”.In our analysis, we classify exonic regions into two groups: for each gene, we put allthe exons in all the transcripts together; then we find the union of these exonic regions.Next, for each region, if the region constitutes multiple exons that are not identical, we callthe region an exonic region with multiple acceptor/donor sites. The other group containsall the other exonic regions.When we searched for structural features using TRANSAT [128], we only consideredthose helices that contain at least 8 base-pairs. The 15 fly species alignment was down-loaded from UCSC http://genome.ucsc.edu for regions of interest. We added OregonRgenome to the alignments and realigned the 16 sequences in each region by employingMUSCLE [282] (version 3.8.31).Micro-RNA target sites were downloaded from http://microrna.org (August 2010 re-lease), and miRNA sites were downloaded from: http://www.mirbase.org/ (miRBase v19).136A.2 Editing events within or in close vicinity ofalternatively spliced exonic regionsThe following table presents alternatively spliced exonic parts for which we found editingevents within or in close vicinity (+/- 150 nt) of them.Chrom Strand OregonRstartOregonRendReferencestartReferenceendEnsembl ID Gene namechr2L + 2752952 2753088 2753235 2753371 FBgn0031453 CG9894chr2L + 2756391 2757424 2756664 2757703 FBgn0031453 CG9894chr2L + 2964499 2968083 2964852 2968434 FBgn0010263 Rbp9chr2L + 4307994 4308077 4308776 4308859 FBgn0010473 tutlchr2L + 4313423 4313836 4314202 4314615 FBgn0010473 tutlchr2L + 4313837 4315272 4314616 4316050 FBgn0010473 tutlchr2L + 5126158 5138115 5127058 5139015 FBgn0261836 Msp-300chr2L + 5157555 5164499 5158455 5165399 FBgn0261836 Msp-300chr2L + 5164500 5185064 5165400 5185964 FBgn0261836 Msp-300chr2L + 5205305 5205461 5206205 5206361 FBgn0065104 snmRNA:158chr2L + 5205462 5206102 5206362 5207002 FBgn0261836 Msp-300chr2L + 6497626 6498803 6498642 6499819 FBgn0051637 CG31637chr2L + 7050813 7052418 7051760 7053365 FBgn0262872 miltchr2L + 8109514 8115564 8110481 8116531 FBgn0261822 Bsgchr2L + 8176785 8177403 8177779 8178397 FBgn0031993 Piezochr2L + 9255522 9255655 9256563 9256696 FBgn0028433,FBgn0263984Ggamma30A,CG43733chr2L + 9255656 9255773 9256697 9256814 FBgn0028433 Ggamma30Achr2L + 12723188 12723334 12724472 12724618 FBgn0032456 MRPchr2L + 14082195 14082304 14083458 14083567 FBgn0028875 nAcRalpha-34Echr2L + 16172868 16173007 16174057 16174196 FBgn0001991 Ca-alpha1Dchr2L + 16748691 16750760 16749944 16752013 FBgn0032600 CG17912chr2L + 16750761 16752451 16752014 16753704 FBgn0032600 CG17912chr2L + 16778450 16778567 16779728 16779845 FBgn0264695 Mhcchr2L + 21124662 21124877 21126159 21126374 FBgn0040297 Nhe2chr2L + 22443026 22447535 22444518 22449027 FBgn0040010 CG17493chr2L + 22735693 22736072 22737200 22737579 FBgn0041004 CG17715chr2L - 227580 228164 227548 228132 FBgn0086902 kischr2L - 1008285 1011311 1008378 1011417 FBgn0031294 IA-2chr2L - 2782981 2785243 2783275 2785538 FBgn0004242 Syt1chr2L - 3461785 3462873 3462213 3463289 FBgn0005616 msl-2chr2L - 3503294 3503904 3503689 3504299 FBgn0014396 timchr2L - 6631823 6633573 6632798 6634548 FBgn0051635 CG31635137Chrom Strand OregonRstartOregonRendReferencestartReferenceendEnsembl ID Gene namechr2L - 6792277 6792422 6793228 6793373 FBgn0015777 nrv2chr2L - 6792423 6792453 6793374 6793404 FBgn0015777 nrv2chr2L - 7227712 7228024 7228700 7229012 FBgn0259111 Ndae1chr2L - 7802122 7802654 7803044 7803576 FBgn0031952 cdc14chr2L - 9788515 9788572 9789675 9789732 FBgn0042174 CR18854chr2L - 9789454 9789583 9790616 9790745 FBgn0042174 CR18854chr2L - 9808062 9808242 9809222 9809402 FBgn0032151 nAcRalpha-30Dchr2L - 9934485 9936150 9935570 9937237 FBgn0051712 CG31712chr2L - 9958179 9959380 9959248 9960449 FBgn0032172 CG5850chr2L - 11158599 11158963 11159880 11160244 FBgn0259822 Ca-betachr2L - 11828938 11829492 11830277 11830831 FBgn0259225 Pde1cchr2L - 17369521 17370498 17370843 17371820 FBgn0032633 Lrchchr2L - 21662309 21664103 21663806 21665595 FBgn0032957 CG2225chr2R + 2704047 2705453 2704298 2705704 FBgn0033107 koichr2R + 4790682 4792342 4790636 4792296 FBgn0004921 Ggamma1chr2R + 5996022 5997233 5995980 5997192 FBgn0004907 14-3-3zetachr2R + 6166227 6172274 6166251 6172306 FBgn0033504 CAPchr2R + 6499104 6499267 6499021 6499184 FBgn0263102 psqchr2R + 9704073 9704297 9704260 9704484 FBgn0261041 stjchr2R + 10164604 10166977 10164786 10167147 FBgn0263397 Ihchr2R + 10185231 10185547 10185402 10185718 FBgn0263397 Ihchr2R + 12002296 12002709 12002445 12002858 FBgn0034075 Asphchr2R + 13249308 13252422 13249777 13252891 FBgn0261642 mblchr2R + 13260853 13266410 13261323 13266881 FBgn0261642 mblchr2R + 13458532 13458825 13458978 13459271 FBgn0040294 POSHchr2R + 13458887 13459381 13459333 13459827 FBgn0040294 POSHchr2R + 14708584 14708731 14708995 14709142 FBgn0010551 l(2)03709chr2R + 15110431 15111243 15110745 15111557 FBgn0263395 hppychr2R + 15112756 15112977 15113070 15113291 FBgn0263395 hppychr2R + 16893090 16894180 16894147 16895245 FBgn0034570 CG10543chr2R + 17030976 17030998 17032011 17032033 FBgn0021872 Xbp1chr2R + 20770449 20772868 20772049 20774468 FBgn0085442 NKAINchr2R + 20796894 20797221 20798487 20798814 FBgn0085434 NaCP60Echr2R + 20797929 20798103 20799522 20799696 FBgn0085434 NaCP60Echr2R + 20801661 20801761 20803254 20803354 FBgn0085434 NaCP60Echr2R - 674709 675916 674643 675850 FBgn0250830 CG12547chr2R - 2814780 2815538 2814895 2815653 FBgn0053558 mimchr2R - 2818453 2819232 2818568 2819347 FBgn0053558 mimchr2R - 5121984 5123114 5121950 5123080 FBgn0010114 higchr2R - 5172488 5172577 5172464 5172553 FBgn0020621 Pkn138Chrom Strand OregonRstartOregonRendReferencestartReferenceendEnsembl ID Gene namechr2R - 5610974 5611177 5611045 5611248 FBgn0259678 sqachr2R - 5771494 5771588 5771508 5771589 FBgn0033463 CG1513chr2R - 5911719 5911875 5911708 5911864 FBgn0022382 Pka-R2chr2R - 9400439 9405293 9400637 9405491 FBgn0260964 Vmatchr2R - 9771446 9781481 9771600 9781634 FBgn0013733 shotchr2R - 9954582 9955019 9954752 9955189 FBgn0040752 Prosapchr2R - 11150036 11151989 11150309 11152262 FBgn0083959 trpmchr2R - 11652369 11652498 11652628 11652757 FBgn0083919 Zasp52chr2R - 11661984 11662733 11662263 11663012 FBgn0083919 Zasp52chr2R - 19049472 19051556 19050535 19052619 FBgn0085400 CG34371chr3L + 893317 895105 893521 895313 FBgn0052479 CG32479chr3L + 1620402 1620503 1620817 1620918 FBgn0035244 ABCB7chr3L + 2923450 2924848 2924078 2925476 FBgn0262593 Shabchr3L + 3085267 3085738 3085865 3086336 FBgn0035397 CG11486chr3L + 4429108 4429621 4429716 4430229 FBgn0000038 nAcRbeta-64Bchr3L + 9068900 9069378 9070078 9070556 FBgn0023479 Tequilachr3L + 9824730 9826147 9826028 9827445 FBgn0264489 CG43897chr3L + 20368371 20368907 20372412 20372948 FBgn0036980 RhoBTBchr3L + 20761984 20762135 20766064 20766215 FBgn0016696 Pitslrechr3L + 20762136 20762813 20766216 20766893 FBgn0016696 Pitslrechr3L + 21202134 21202496 21206229 21206591 FBgn0037060 CG10508chr3L + 21915634 21921636 21919800 21925802 FBgn0262737 mubchr3L + 23273132 23276049 23277312 23280229 FBgn0037212 nAcRalpha-80Bchr3L + 24531314 24532438 24535510 24536634 FBgn0044510 mRpS5chr3L - 2039221 2040112 2039681 2040572 FBgn0086906 slschr3L - 2561442 2562353 2561932 2562843 FBgn0010909 msnchr3L - 4096290 4096364 4096904 4096978 FBgn0035497 CG14995chr3L - 4322064 4322921 4322672 4323529 FBgn0035533 Cip4chr3L - 4367031 4368514 4367649 4369133 FBgn0035538 DopEcRchr3L - 4824734 4825093 4825328 4825687 FBgn0261797 Dhc64Cchr3L - 5148383 5150468 5148921 5151007 FBgn0052423 shepchr3L - 6944092 6946063 6944904 6946872 FBgn0035720 CG10077chr3L - 7172281 7172729 7173238 7173686 FBgn0263218 Dscam2chr3L - 7822499 7823929 7823487 7824910 FBgn0016694 Pdp1chr3L - 7920585 7922201 7921569 7923185 FBgn0024187 sydchr3L - 11549184 11549651 11551372 11551839 FBgn0259481 Mob2chr3L - 12199603 12203376 12201959 12205752 FBgn0260941 appchr3L - 13424958 13425445 13427842 13428329 FBgn0036360 CG10713chr3L - 14497699 14500007 14500714 14503019 FBgn0087007 bbg139Chrom Strand OregonRstartOregonRendReferencestartReferenceendEnsembl ID Gene namechr3L - 17048362 17054741 17052007 17058366 FBgn0260943 Rbp6chr3L - 17959290 17961085 17962744 17964539 FBgn0000568 Eip75Bchr3L - 18047937 18049283 18051337 18052698 FBgn0000568 Eip75Bchr3L - 19132438 19136013 19135998 19139573 FBgn0016797 fz2chr3L - 19880748 19884381 19884505 19888138 FBgn0014037 Su(Tpl)chr3L - 20145870 20146412 20149780 20150322 FBgn0261556 CG42674chr3L - 21186766 21187109 21190860 21191203 FBgn0053054 CG33054chr3L - 21187110 21187176 21191204 21191270 FBgn0053054 CG33054chr3R + 121421 122682 121423 122684 FBgn0041605 cpxchr3R + 528121 530970 528134 530983 FBgn0263346 CG43427chr3R + 3019037 3019078 3018693 3018734 FBgn0086372 lapchr3R + 3829261 3829801 3828942 3829482 FBgn0037536 CG2698chr3R + 5274501 5275314 5274377 5275190 FBgn0261552 pschr3R + 6021459 6021790 6021438 6021769 FBgn0004575 Synchr3R + 6067512 6073989 6067519 6073996 FBgn0261928 CG42795chr3R + 7217000 7217144 7217251 7217395 FBgn0004595 proschr3R + 9489835 9490352 9490657 9491174 FBgn0004587 B52chr3R + 9516531 9518800 9517357 9519629 FBgn0024555 flflchr3R + 10615974 10618161 10616975 10619164 FBgn0263929 jvlchr3R + 11237203 11238136 11238301 11239234 FBgn0041188 Atx2chr3R + 11780587 11780904 11781712 11782029 FBgn0013334 Sap47chr3R + 12127035 12127383 12128095 12128443 FBgn0250823 gishchr3R + 13744178 13745815 13745554 13747191 FBgn0263995 cpochr3R + 13835530 13836146 13836819 13837435 FBgn0263995 cpochr3R + 13998906 13999410 14000262 14000766 FBgn0042693 PP2A-B’chr3R + 14002949 14004109 14004328 14005496 FBgn0042693 PP2A-B’chr3R + 14794533 14794750 14796353 14796570 FBgn0261262,FBgn0263983CG42613,CG43732chr3R + 14795191 14795317 14797012 14797138 FBgn0261262,FBgn0263983CG42613,CG43732chr3R + 15589530 15589636 15591367 15591473 FBgn0024963 GluClalphachr3R + 16145871 16147064 16147671 16148864 FBgn0261550 CG42668chr3R + 16834442 16834471 16836308 16836337 FBgn0013995 Calxchr3R + 17036498 17038080 17038410 17039991 FBgn0264357 SNF4Agammachr3R + 18426603 18427004 18428784 18429185 FBgn0051158 Efa6chr3R + 20529756 20529892 20532259 20532395 FBgn0003429 slochr3R + 20530492 20531282 20532995 20533784 FBgn0003429 slochr3R + 20531283 20531615 20533785 20534111 FBgn0003429 slochr3R + 20531616 20534411 20534112 20536911 FBgn0003429 slochr3R + 21425417 21431422 21428144 21434145 FBgn0011666 msichr3R + 23530288 23531240 23533099 23534051 FBgn0039544 CG12877140Chrom Strand OregonRstartOregonRendReferencestartReferenceendEnsembl ID Gene namechr3R + 24737936 24738410 24740662 24741136 FBgn0259220 Doachr3R + 27659530 27659581 27662904 27662955 FBgn0039883 RhoGAP100Fchr3R - 622768 623397 622790 623419 FBgn0260794 ctripchr3R - 1108329 1109043 1108387 1109101 FBgn0013576 mtdchr3R - 1827447 1827676 1827555 1827784 FBgn0003261 Rm62chr3R - 1828380 1828724 1828476 1828820 FBgn0003261 Rm62chr3R - 1828725 1829298 1828821 1829394 FBgn0003261 Rm62chr3R - 3687162 3687752 3686861 3687451 FBgn0037525 CG17816chr3R - 4661546 4663665 4661313 4663432 FBgn0262614 pydchr3R - 5845458 5847136 5845427 5847105 FBgn0053208 Micalchr3R - 7590636 7591909 7590844 7592117 FBgn0086910 l(3)neo38chr3R - 7629238 7630103 7629438 7630309 FBgn0051116 ClC-achr3R - 7772790 7773053 7773103 7773362 FBgn0037963 Cad87Achr3R - 10638940 10642622 10639971 10643653 FBgn0053555 btszchr3R - 11921737 11922221 11922858 11923342 FBgn0026059 Mhclchr3R - 12163638 12164821 12164665 12165848 FBgn0040284 SF2chr3R - 13598188 13600676 13599554 13602042 FBgn0262562 CG43102chr3R - 13706530 13708449 13707906 13709825 FBgn0053547 Rimchr3R - 13708450 13709028 13709826 13710404 FBgn0053547 Rimchr3R - 14016821 14019193 14018216 14020580 FBgn0011481 Ssdpchr3R - 14963038 14965628 14964776 14967365 FBgn0261285 Ppcschr3R - 19926699 19926891 19929101 19929293 FBgn0013343 Syx1Achr3R - 19926892 19929063 19929294 19931465 FBgn0013343 Syx1Achr3R - 21179732 21180606 21182356 21183230 FBgn0004509 Fur1chr3R - 27424594 27425436 27427953 27428795 FBgn0039858 CycGchrX + 936417 936786 936466 936835 FBgn0003638 su(w[a])chrX + 936787 936917 936836 936966 FBgn0003638 su(w[a])chrX + 1541203 1543216 1541398 1543411 FBgn0000210 brchrX + 1677514 1677859 1677689 1678034 FBgn0026086 AdarchrX + 1677860 1681927 1678035 1682100 FBgn0026086 AdarchrX + 2005630 2007591 2005737 2007698 FBgn0000382 cswchrX + 2561982 2562647 2562199 2562864 FBgn0003371 sggchrX + 2568738 2569418 2568955 2569635 FBgn0003371 sggchrX + 2569419 2569437 2569636 2569654 FBgn0003371 sggchrX + 2570917 2571662 2571134 2571879 FBgn0003371 sggchrX + 3232900 3237343 3233361 3237800 FBgn0000479 dncchrX + 3846095 3847427 3846800 3848138 FBgn0029687 Vap-33-1chrX + 5132003 5132968 5133459 5134415 FBgn0086911 rgchrX + 5293695 5295921 5295157 5297372 FBgn0029761 SKchrX + 8124249 8124726 8126462 8126939 FBgn0261873 sdtchrX + 8131254 8132287 8133475 8134507 FBgn0261873 sdt141Chrom Strand OregonRstartOregonRendReferencestartReferenceendEnsembl ID Gene namechrX + 9007936 9009533 9010179 9011772 FBgn0030089 AP-1gammachrX + 9066982 9081106 9069224 9083342 FBgn0026206 mei-P26chrX + 10742995 10743629 10745738 10746360 FBgn0030240 CG2202chrX + 11384395 11384582 11387231 11387418 FBgn0052666 DrakchrX + 11594708 11595593 11597542 11598427 FBgn0011754 PhKgammachrX + 11691173 11692768 11694026 11695620 FBgn0000259 CkIIbetachrX + 11726864 11728598 11729760 11731493 FBgn0262684 CG43154chrX + 12539706 12540359 12542817 12543464 FBgn0030412 tomosynchrX + 12661863 12663735 12665012 12666884 FBgn0030421 CG3812chrX + 13604153 13605164 13607457 13608468 FBgn0052627 NnaDchrX + 14823230 14823799 14827047 14827616 FBgn0264078 Flo-2chrX + 14888251 14889359 14892167 14893275 FBgn0000535 eagchrX + 14890535 14890950 14894446 14894861 FBgn0000535 eagchrX + 14892240 14892527 14896148 14896435 FBgn0000535 eagchrX + 15793632 15795071 15797823 15799262 FBgn0003392 shichrX + 15795072 15795334 15799263 15799525 FBgn0003392 shichrX + 16228492 16230171 16232805 16234476 FBgn0011764 Dsp1chrX + 16326218 16326300 16330538 16330620 FBgn0026575 hangchrX + 16823620 16824025 16828071 16828476 FBgn0027556 CG4928chrX + 18767196 18769775 18772088 18774669 FBgn0085430 CG34401chrX + 19395304 19397722 19400519 19402937 FBgn0027621 PfrxchrX + 21249448 21250319 21255317 21256188 FBgn0003423 slgAchrX - 6681872 6683314 6683745 6685194 FBgn0259228 C3GchrX - 6977733 6977754 6979803 6979824 FBgn0263563 mir-4956chrX - 6977755 6977765 6979825 6979835 FBgn0263563 mir-4956chrX - 6977766 6977787 6979836 6979857 FBgn0263563 mir-4956chrX - 6977918 6977941 6979988 6980011 FBgn0264270 SxlchrX - 7940386 7941253 7942563 7943432 FBgn0004656 fs(1)hchrX - 9170062 9170707 9172285 9172930 FBgn0040236 c11.1chrX - 9949290 9949353 9951699 9951762 FBgn0030174 CG15312chrX - 10217262 10217284 10219824 10219846 FBgn0259170 alpha-Man-IchrX - 11859737 11860300 11862684 11863247 FBgn0263111 cacchrX - 11913895 11916122 11916861 11919090 FBgn0030366 Usp7chrX - 13093610 13093927 13096773 13097090 FBgn0005410 snochrX - 13094040 13094586 13097203 13097742 FBgn0005410 snochrX - 13161461 13161511 13164676 13164726 FBgn0041210 HDAC4chrX - 14677766 14679544 14681511 14683283 FBgn0003301 rutchrX - 14681551 14682656 14685281 14686386 FBgn0003301 rutchrX - 15818777 15818930 15822964 15823117 FBgn0053180 Ranbp16chrX - 15879107 15879164 15883302 15883359 FBgn0030719 eIF5chrX - 15967717 15968843 15971920 15973046 FBgn0028397 Tob142Chrom Strand OregonRstartOregonRendReferencestartReferenceendEnsembl ID Gene namechrX - 16451371 16454003 16455710 16458346 FBgn0030758 CanA-14FchrX - 17819913 17820103 17824619 17824809 FBgn0003380 ShchrX - 17855688 17856565 17860381 17861258 FBgn0003380 ShchrX - 19462271 19462409 19467498 19467636 FBgn0031030 TaochrX - 19462410 19462649 19467637 19467876 FBgn0031030 TaochrX - 19462650 19462663 19467877 19467890 FBgn0031030 TaochrX - 20626419 20630935 20632098 20636616 FBgn0085387 shakBchrX - 20630936 20632981 20636617 20638654 FBgn0085387 shakBchrX - 21072832 21074690 21078702 21080560 FBgn0052521 CG32521chrX - 21074691 21075035 21080561 21080905 FBgn0052521 CG32521chrX - 21493913 21494001 21499782 21499870 FBgn0024807 DIP1chrX - 21494951 21495131 21500820 21501000 FBgn0024807 DIP1Table A.1: Alternatively spliced exonic parts for which we found editing events inclose vicinity143A.3 Genomic regions with evidence for the inter-relationof RNA editing and alternative splicingThe following table contains the list of genomic regions for which we found evidence that RNA editing may regulatealternative splicing.Chrom Strand OregonRstartpositionOregonRend posi-tionReferencestartpositionReferenceend posi-tionEnsembl geneIDGene name RNAalifoldEnergychr2R + 17030848 17031148 17031883 17032183 FBgn0021872 Xbp1 -47.47chr2R + 17030826 17031126 17031861 17032161 FBgn0021872 Xbp1 -43.97chr3R + 6021309 6021609 6021288 6021588 FBgn0004575 Syn -35.74chr2R + 15112606 15112906 15112920 15113220 FBgn0263395 hppy -35.32chr3R + 20530342 20530642 20532845 20533145 FBgn0003429 slo -34.33chr2L + 4307844 4308144 4308626 4308926 FBgn0010473 tutl -32.99chr3R - 1827526 1827826 1827634 1827934 FBgn0003261 Rm62 -32.71chrX - 6977637 6977937 6979707 6980007 FBgn0263563 mir-4956 -30.22chr2L + 4313686 4313986 4314465 4314765 FBgn0010473 tutl -28.02chr2R + 15112827 15113127 15113141 15113441 FBgn0263395 hppy -28.01chr2L - 3503144 3503444 3503539 3503839 FBgn0014396 tim -27.77chr2R + 20796744 20797044 20798337 20798637 FBgn0085434 NaCP60E -27.50chr3R + 20529742 20530042 20532245 20532545 FBgn0003429 slo -27.01chr3L + 1620353 1620653 1620768 1621068 FBgn0035244 ABCB7 -26.77chr3L + 21201984 21202284 21206079 21206379 FBgn0037060 CG10508 -24.89chr3R + 20529606 20529906 20532109 20532409 FBgn0003429 slo -24.83chr2L - 9808092 9808392 9809252 9809552 FBgn0032151 nAcRalpha-30D-24.60chr3R + 14795167 14795467 14796988 14797292 FBgn0261262,FBgn0263983CG42613,CG43732-24.09chr2R - 2819082 2819382 2819197 2819497 FBgn0053558 mim -23.82chrX + 14889209 14889509 14893125 14893424 FBgn0000535 eag -21.86chr2R - 2814630 2814930 2814745 2815045 FBgn0053558 mim -21.81chr2L - 9807912 9808212 9809070 9809372 FBgn0032151 nAcRalpha-30D-21.19chr2L + 5205311 5205611 5206211 5206511 FBgn0065104 snmRNA:158 -20.95chr2L + 9255372 9255672 9256413 9256713 FBgn0028433,FBgn0263984Ggamma30A,CG43733-20.94chrX + 16823875 16824175 16828326 16828626 FBgn0027556 CG4928 -20.77chr2L - 9789304 9789604 9790464 9790766 FBgn0042174 CR18854 -20.15chr3L + 1620252 1620552 1620667 1620967 FBgn0035244 ABCB7 -19.52chr2L - 7227562 7227862 7228550 7228850 FBgn0259111 Ndae1 -18.93chr3L - 4322771 4323071 4323379 4323679 FBgn0035533 Cip4 -18.87chr3L - 7172579 7172879 7173536 7173836 FBgn0263218 Dscam2 -18.84144Chrom Strand OregonRstartpositionOregonRend posi-tionReferencestartpositionReferenceend posi-tionEnsembl geneIDGene name RNAalifoldEnergychr3L + 3085588 3085888 3086186 3086486 FBgn0035397 CG11486 -18.05chr3R + 5274351 5274651 5274219 5274527 FBgn0261552 ps -17.27chr3R + 3018887 3019187 3018543 3018843 FBgn0086372 lap -17.03chr2L + 16778417 16778717 16779695 16779995 FBgn0264695 Mhc -16.91chr2L + 5205155 5205455 5206055 5206355 FBgn0065104 snmRNA:158 -16.51chr2L + 9255505 9255805 9256546 9256846 FBgn0028433,FBgn0263984Ggamma30A,CG43733-16.25chr3R + 27659431 27659731 27662805 27663105 FBgn0039883 RhoGAP100F -16.14chr3R + 24738260 24738560 24740986 24741286 FBgn0259220 Doa -16.02chr2L + 21124727 21125027 21126224 21126524 FBgn0040297 Nhe2 -15.11chr2R + 13458737 13459037 13459183 13459483 FBgn0040294 POSH -14.58chr3R + 9490202 9490502 9491024 9491324 FBgn0004587 B52 -14.17chr3R + 27659380 27659680 27662754 27663054 FBgn0039883 RhoGAP100F -13.76chr2L + 12723038 12723338 12724322 12724622 FBgn0032456 MRP -13.49chr3R - 27424444 27424744 27427803 27428103 FBgn0039858 CycG -11.81chrX - 21493851 21494151 21499720 21500020 FBgn0024807 DIP1 -11.79chr2L - 11829342 11829642 11830681 11830979 FBgn0259225 Pde1c -11.54chr2R + 9704147 9704447 9704334 9704634 FBgn0261041 stj -11.38chrX - 13161361 13161661 13164576 13164874 FBgn0041210 HDAC4 -11.27chr3R + 14795041 14795341 14796861 14797162 FBgn0261262,FBgn0263983CG42613,CG43732-10.94chr2L + 12723184 12723484 12724468 12724768 FBgn0032456 MRP -10.76chr3R - 11921587 11921887 11922708 11923008 FBgn0026059 Mhcl -10.73chr3L + 21202346 21202646 21206441 21206741 FBgn0037060 CG10508 -10.50chr3L - 19135863 19136163 19139423 19139723 FBgn0016797 fz2 -9.94chr2R + 20797779 20798079 20799372 20799672 FBgn0085434 NaCP60E -9.58chr2R + 6498954 6499254 6498871 6499171 FBgn0263102 psq -9.31chr3R + 7216850 7217150 7217101 7217401 FBgn0004595 pros -9.15chrX - 9949140 9949440 9951549 9951849 FBgn0030174 CG15312 -8.74chr3L - 4368364 4368664 4368982 4369283 FBgn0035538 DopEcR -8.32chrX + 16326068 16326368 16330388 16330688 FBgn0026575 hang -7.85chr3R + 122532 122832 122534 122834 FBgn0041605 cpx -7.50chr3L - 4824584 4824884 4825178 4825478 FBgn0261797 Dhc64C -7.43chr2R + 14708581 14708881 14708992 14709292 FBgn0010551 l(2)03709 -7.37chr3L - 21186959 21187259 21191053 21191353 FBgn0053054 CG33054 -7.05chr2R - 5610824 5611124 5610895 5611195 FBgn0259678 sqa -6.95chr3L + 24531164 24531464 24535360 24535660 FBgn0044510 mRpS5 -6.70chrX + 10742845 10743145 10745588 10745888 FBgn0030240 CG2202 -5.77chrX - 13093777 13094077 13096940 13097240 FBgn0005410 sno -5.31chr3R - 1827297 1827597 1827405 1827705 FBgn0003261 Rm62 -4.48145Chrom Strand OregonRstartpositionOregonRend posi-tionReferencestartpositionReferenceend posi-tionEnsembl geneIDGene name RNAalifoldEnergychrX - 13161311 13161611 13164526 13164824 FBgn0041210 HDAC4 -4.06chr2R - 5911569 5911869 5911558 5911858 FBgn0022382 Pka-R2 -3.63chr3L - 13425295 13425595 13428179 13428479 FBgn0036360 CG10713 -3.60chrX - 15879014 15879314 15883210 15883509 FBgn0030719 eIF5 -3.57chr3R - 7590486 7590786 7590694 7590994 FBgn0086910 l(3)neo38 -2.95chr2R + 13458675 13458975 13459121 13459421 FBgn0040294 POSH -2.66chr3R + 12126885 12127185 12127945 12128245 FBgn0250823 gish -2.31chr2L + 22735922 22736222 22737429 22737729 FBgn0041004 CG17715 -2.03chr3R + 18426453 18426753 18428640 18428934 FBgn0051158 Efa6 -1.89chr3R + 23531090 23531390 23533901 23534201 FBgn0039544 CG12877 -1.70chrX + 11384432 11384732 11387268 11387568 FBgn0052666 Drak -1.62chr2L - 6792303 6792603 6793254 6793554 FBgn0015777 nrv2 -1.14chr2R + 16894030 16894330 16895092 16895395 FBgn0034570 CG10543 -1.14chr3L - 14497549 14497849 14500564 14500864 FBgn0087007 bbg -0.79chr3L - 11549034 11549334 11551222 11551522 FBgn0259481 Mob2 -0.60chr2R - 5911725 5912025 5911714 5912014 FBgn0022382 Pka-R2 -0.20chrX + 15795184 15795484 15799375 15799675 FBgn0003392 shi -0.02chr2L - 1011161 1011461 1011267 1011570 FBgn0031294 IA-2 0.22chrX - 14679394 14679694 14683133 14683433 FBgn0003301 rut 0.96chr3R + 12127233 12127533 12128293 12128593 FBgn0250823 gish 1.12chrX - 17855538 17855838 17860231 17860531 FBgn0003380 Sh 1.56chrX - 13093890 13094190 13097053 13097353 FBgn0005410 sno 1.80chrX + 1677709 1678009 1677884 1678184 FBgn0026086 Adar 1.98chr3L - 21187026 21187326 21191120 21191420 FBgn0053054 CG33054 2.08chrX + 15794921 15795221 15799112 15799412 FBgn0003392 shi 3.13chrX + 2569268 2569568 2569485 2569785 FBgn0003371 sgg 3.31chrX + 2569287 2569587 2569504 2569804 FBgn0003371 sgg 3.43chrX + 14823080 14823380 14826897 14827197 FBgn0264078 Flo-2 4.06Table A.2: Genomic regions with evidence for the inter-relation of RNA editing andalternative splicing146Appendix BSupporting Materials for Chapter 3B.1 Selected TCGA overian serous cystadenocarcinomasamplesSample ID Alteration Amino acid change Type of mutationTCGA-13-0891-01 Mutation T10114 Q1016del In frame deletionTCGA-13-1495-01 Mutation Q602* NonsenseTCGA-20-0987-01 Mutation E928Gfs*27 Frame shift insertionTCGA-25-1322-01 Mutation Y901C MissenseTCGA-25-2392-01 Mutation R882L MissenseTCGA-31-1953-01 Mutation W719* NonsenseTCGA-59-2351-01 Mutation K975E MissenseTCGA-04-1332-01 Deletion - -TCGA-23-1030-01 Deletion - -TCGA-61-2003-01 Deletion - -TCGA-10-0934-01 Amplification - -TCGA-24-1431-01 Amplification - -TCGA-61-2002-01 Amplification - -TCGA-61-2092-01 Amplification - -TCGA-04-1343-01 Control - -TCGA-04-1348-01 Control - -TCGA-04-1361-01 Control - -TCGA-04-1517-01 Control - -TCGA-09-1662-01 Control - -TCGA-09-2053-01 Control - -147Sample ID Alteration Amino acid change Type of mutationTCGA-10-0927-01 Control - -TCGA-10-0933-01 Control - -TCGA-13-0720-01 Control - -TCGA-13-0724-01 Control - -TCGA-13-0800-01 Control - -TCGA-13-0884-01 Control - -TCGA-13-0897-01 Control - -TCGA-13-0905-01 Control - -TCGA-13-1407-01 Control - -TCGA-13-1483-01 Control - -TCGA-13-1492-01 Control - -TCGA-13-1505-01 Control - -TCGA-13-1506-01 Control - -TCGA-13-1507-01 Control - -TCGA-20-0991-01 Control - -TCGA-23-1032-01 Control - -TCGA-23-1116-01 Control - -TCGA-24-0966-01 Control - -TCGA-24-1419-01 Control - -TCGA-24-1422-01 Control - -TCGA-24-1436-01 Control - -TCGA-24-1471-01 Control - -TCGA-24-1545-01 Control - -TCGA-24-1552-01 Control - -TCGA-24-1553-01 Control - -TCGA-24-1558-01 Control - -TCGA-24-1563-01 Control - -TCGA-24-1564-01 Control - -TCGA-24-1565-01 Control - -TCGA-24-1567-01 Control - -TCGA-24-1603-01 Control - -TCGA-24-1604-01 Control - -TCGA-24-2038-01 Control - -TCGA-24-2261-01 Control - -TCGA-24-2290-01 Control - -TCGA-25-1320-01 Control - -TCGA-25-1321-01 Control - -TCGA-25-2396-01 Control - -TCGA-25-2399-01 Control - -TCGA-30-1862-01 Control - -TCGA-36-1570-01 Control - -TCGA-36-1574-01 Control - -148Sample ID Alteration Amino acid change Type of mutationTCGA-59-2355-01 Control - -TCGA-61-1728-01 Control - -TCGA-61-1919-01 Control - -TCGA-61-2009-01 Control - -TCGA-61-2016-01 Control - -TCGA-61-2095-01 Control - -TCGA-61-2104-01 Control - -TCGA-61-2111-01 Control - -Table B.1: Ovarian serous cystadenocarcinoma samples selected from TCGA149B.2 qRT-PCR validation of identified ALE splicingeventsFig. S7-2-1012ALE-LALE-S-2-1012ALE-LALE-S-2-1012ALE-LALE-S-4-202log 2 Expression(relative to scrambled siRNA)Exon12-135'UTRExon11-13Exon1-3Exon1-2CDK12 CDK13 CDK9 CCNKCDK12 siRNA-1CDK13 siRNACDK9 siRNAlog 2 Expression(relative to scrambled siRNA)CDK12 siRNA-1 CDK13 siRNA CDK9 siRNA-0.4-0.20.00.20.4ΔΨqRT-PCR (CDK12 siRNA-1)-0.4 -0.2 0.0 0.2 0.4ΔΨqRT-PCR (CDK12 siRNA-2)r2 = 0.85p = 0.004-0.4-0.20.00.20.4ΔΨqRT-PCR-0.4 -0.2 0.0 0.2 0.4ΔΨMISOr2 = 0.79SK-BR-3 cells184-hTERT cellsp = 0.0003A BCFigure B.1: qRT-PCR analysis of identified alternative splicing events. A. Correlation of∆Ψ values for a panel of ALE events regulated by CDK12 in SK-BR-3 and 184-hTERT cells, as determined by RNA-seq (MISO) versus qRT-PCR for genes: NFX1,RIBF1, DNAJB6, BRCA2, DPP9, THADA, ZFYVE26, PADI2 and ATM. B. Correlationof ∆Ψ values for SK-BR-3 cells treated with two different CDK12 siRNA constructs.C. Depletion of CDK13 or CDK9 did not phenocopy the effect of CDK12 depletionon ALE splicing. For nine genes with ALEs regulated by CDK12, qRT-PCR was usedto measure changes in the expression of long (ALE-L) and short (ALE-S) mRNAisoforms after depletion of CDK12, CDK13, or CDK9 in SK-BR-3 cells. Also, thedepletion of the CDKs did not affect expression of CCNK.150B.3 Proteomics analysis of SK-BR-3 after CDK12depletionFig. S10Vesicle MembraneDNA Repair/Replication RNA Processing Nuclear LumenMitochondrialRibosome Structural Molecule Activity Signal Transduction MitochondrionSK-BR-3 + CDK12 siRNA-1 (proteome)Up Down-6-4-20246SK-BR-3 proteomepathway enrichment (NES)-6 -4 -2 0 2 4 6SK-BR-3 transcriptomepathway enrichment (NES)DNA RepairRNA processingMitochondrionCell cycleCell divisionFDRtranscriptomeproteome < 0.1< 0.1≥ 0.1< 0.1< 0.1≥ 0.1≥ 0.1≥ 0.1BothtranscriptomeproteomeNeitherA BFigure B.2: Proteomic analysis of SK-BR-3 after CDK12 depletion confirms trendsobserved by differential gene expression analysis. A. Enrichment map fromglobal proteome analysis in SK-BR-3 cells by GSEA. B. For each pathway,GSEA pre-ranked analysis assigned a normalized enrichment score (NES) rep-resenting the extent of over-representation of genes of a pathway at the top orbottom of a ranked list. Positive and negative NES values represent up- anddown- regulated pathways, respectively. The dotted red line shows the gen-eral trend for NES values significant in both proteomics and transcriptomicsdatasets (FDR <0.1).151B.4 Up-regulation of cell proliferation pathways inMDA-MD-231 cells by CDK12Fig. S11-6-4-20246SK-BR-3 proteomepathway enrichment (NES)-6 -4 -2 0 2 4 6MDA-MB-231 proteomepathway enrichment (NES)DNA RepairRNA processingCell cycleCell divisionMitochondrionFDRMDA-MB-231SK-BR-3 < 0.1< 0.1≥ 0.1< 0.1< 0.1≥ 0.1≥ 0.1≥ 0.1BothMDA-MB-231SK-BR-3Neitherpadj < 0.01(n = 1,428)μdown = -0.60(n = 743)μup = 0.55(n = 685)150100500Counts-2 -1 0 1 2log2 (fold change)MDA-MB-231 proteomeMDA-MB-231 + CDK12 siRNA-1 (proteome)Up DownNuclear Lumen/Chromosome RNA ProcessingDNA BindingBiosyntheticProcessChromosomeOrganization DNA Repair/Replication Cell Cycle3020100-log 10 (p adj -value)-2 -1 0 1 220015010050CDK12padj < 0.01(n = 1,428)+1 σ-1 σn = 542 n = 486A BCFigure B.3: CDK12 up-regulates cell proliferation pathways in MDA-MB-231triple-ne ative breast cancer cells. A. Top: volcano plot of the global proteomeanalysis in MDA-MB-231 cells. Bottom: distribution of fold change values for alldifferential protein expression events with padj < 0.01. B. Enrichment map fromglobal proteome analysis in MDA-MB-231 cells by GSEA. C. For each pathway,GSEA pre-ranked analysis assigned a normalized enrichment score (NES) represent-ing the extent of over-representation of genes of a pathway at the top or bottom of aranked list. Positive and negative NES values represent up- and down- regulated path-ways, respectively. For each pathway, NES values in the MDA-MB-231 and SK-BR-3proteome are shown. Red markers represent NES values significant in both cell lines(FDR < 0.1). The dotted red line shows the general trend of these points.152