BIRS Workshop Lecture Videos

Banff International Research Station Logo

BIRS Workshop Lecture Videos

Learning a mapping from pre-mRNA sequence to splice site usage to understand RNA splicing variation Knowles, David


RNA splicing is complex process carefully regulated by the coordinated action of the hundreds of proteins and associated small nuclear RNAs comprising the core spliceosome and trans-acting splicing factors. To determine which regions to remove as introns, this splicing machinery must interpret information encoded in the pre-mRNA sequence: consensus splice site and branch point motifs, as well as splicing regulatory elements (SRE) such as exonic splicing enhancers. We approach computational modeling of the splicing process by learning a mapping from pre-mRNA sequence to splice site (SS) usage levels. Following our recent work on LeafCutter, which we developed to study intron splicing variation, we use spliced reads from RNA-seq to quantify local intron usage in a straightforward, annotation-agnostic manner. For each 5’ SS we predict the proportion of corresponding spliced reads mapping to each possible 3’ SS (specifically every AG dinucleotide within 100kb) as a function of the 3’ SS sequence context. An analogous model is used to model 5’ SS choice for each 3’ SS. Compared to previous work predicting exon inclusion, we model more splicing events, including 3’ and 5’ extensions, and additionally leverage signal from constitutive splicing. We choose a deep neural network (NN) to represent the mapping from sequence to SS usage proportion. Convolutional NNs (CNN) can naturally detect regulatory elements: the first convolutional layer corresponds to scanning learnt PWMs along the sequence, and following layers allow combinatorial and spatial logic on top of the resulting detections. Max-pooling layers endow limited translational invariance on where motifs are detected, which is appropriate for SREs but undesirable for the SS consensus. We therefore use a CNN on a large sequence context (~800bp) combined with a dense network locally around the SS (~70bp). A Dirichlet-multinomial likelihood is used to appropriately account for overdispersion in RNA-seq read counts. A multi-output extension readily allows modeling of tissue-specific splicing patterns. We assess model performance using 110 male muscle RNA-seq samples from GTEx, training on odd chromosomes and testing on even chromosomes. Out of the typically ~2000 canonical dinucleotides within 100kb we are able to correctly predict the most frequently used 68% of the time (for 3’ SS choice, 65% for 5’ SS), compared to 9.6% when picking the strongest SS by MaxEntScore. The model learns to detect expected features, including the branchpoint consensus and polypyrimidine tract, and can distinguish between canonical dinucleotides which are never spliced, noisily spliced (<1% of reads) and constitutively spliced (>99%). Using GTEx WGS data we are able to predict which SNVs will create cryptic splice sites with an AUC of 99%. Using 12 diverse tissues from GTEx we predict tissue-specific SS usage, with an average correlation between predicted and observed differences in SS usage across all pairs of tissues of 0.31 (compared to 0.07 for existing work on exonic PSI).

Item Media

Item Citations and Data


Attribution-NonCommercial-NoDerivatives 4.0 International