Learning a mapping from pre-mRNA sequence to splice site usage to understand RNA splicing variation

BIRS Workshop Lecture Videos

Featured Collection

BIRS Workshop Lecture Videos

Learning a mapping from pre-mRNA sequence to splice site usage to understand RNA splicing variation Knowles, David

Description

RNA splicing is complex process carefully regulated by the coordinated action of the hundreds of proteins and associated small nuclear RNAs comprising the core spliceosome and trans-acting splicing factors. To determine which regions to remove as introns, this splicing machinery must interpret information encoded in the pre-mRNA sequence: consensus splice site and branch point motifs, as well as splicing regulatory elements (SRE) such as exonic splicing enhancers. We approach computational modeling of the splicing process by learning a mapping from pre-mRNA sequence to splice site (SS) usage levels. Following our recent work on LeafCutter, which we developed to study intron splicing variation, we use spliced reads from RNA-seq to quantify local intron usage in a straightforward, annotation-agnostic manner. For each 5’ SS we predict the proportion of corresponding spliced reads mapping to each possible 3’ SS (specifically every AG dinucleotide within 100kb) as a function of the 3’ SS sequence context. An analogous model is used to model 5’ SS choice for each 3’ SS. Compared to previous work predicting exon inclusion, we model more splicing events, including 3’ and 5’ extensions, and additionally leverage signal from constitutive splicing. We choose a deep neural network (NN) to represent the mapping from sequence to SS usage proportion. Convolutional NNs (CNN) can naturally detect regulatory elements: the first convolutional layer corresponds to scanning learnt PWMs along the sequence, and following layers allow combinatorial and spatial logic on top of the resulting detections. Max-pooling layers endow limited translational invariance on where motifs are detected, which is appropriate for SREs but undesirable for the SS consensus. We therefore use a CNN on a large sequence context (~800bp) combined with a dense network locally around the SS (~70bp). A Dirichlet-multinomial likelihood is used to appropriately account for overdispersion in RNA-seq read counts. A multi-output extension readily allows modeling of tissue-specific splicing patterns. We assess model performance using 110 male muscle RNA-seq samples from GTEx, training on odd chromosomes and testing on even chromosomes. Out of the typically ~2000 canonical dinucleotides within 100kb we are able to correctly predict the most frequently used 68% of the time (for 3’ SS choice, 65% for 5’ SS), compared to 9.6% when picking the strongest SS by MaxEntScore. The model learns to detect expected features, including the branchpoint consensus and polypyrimidine tract, and can distinguish between canonical dinucleotides which are never spliced, noisily spliced (<1% of reads) and constitutively spliced (>99%). Using GTEx WGS data we are able to predict which SNVs will create cryptic splice sites with an AUC of 99%. Using 12 diverse tissues from GTEx we predict tissue-specific SS usage, with an average correlation between predicted and observed differences in SS usage across all pairs of tissues of 0.31 (compared to 0.07 for existing work on exonic PSI).

Item Metadata

Title	Learning a mapping from pre-mRNA sequence to splice site usage to understand RNA splicing variation
Creator	Knowles, David
Publisher	Banff International Research Station for Mathematical Innovation and Discovery
Date Issued	2017-03-29T10:00
Description	RNA splicing is complex process carefully regulated by the coordinated action of the hundreds of proteins and associated small nuclear RNAs comprising the core spliceosome and trans-acting splicing factors. To determine which regions to remove as introns, this splicing machinery must interpret information encoded in the pre-mRNA sequence: consensus splice site and branch point motifs, as well as splicing regulatory elements (SRE) such as exonic splicing enhancers. We approach computational modeling of the splicing process by learning a mapping from pre-mRNA sequence to splice site (SS) usage levels. Following our recent work on LeafCutter, which we developed to study intron splicing variation, we use spliced reads from RNA-seq to quantify local intron usage in a straightforward, annotation-agnostic manner. For each 5’ SS we predict the proportion of corresponding spliced reads mapping to each possible 3’ SS (specifically every AG dinucleotide within 100kb) as a function of the 3’ SS sequence context. An analogous model is used to model 5’ SS choice for each 3’ SS. Compared to previous work predicting exon inclusion, we model more splicing events, including 3’ and 5’ extensions, and additionally leverage signal from constitutive splicing. We choose a deep neural network (NN) to represent the mapping from sequence to SS usage proportion. Convolutional NNs (CNN) can naturally detect regulatory elements: the first convolutional layer corresponds to scanning learnt PWMs along the sequence, and following layers allow combinatorial and spatial logic on top of the resulting detections. Max-pooling layers endow limited translational invariance on where motifs are detected, which is appropriate for SREs but undesirable for the SS consensus. We therefore use a CNN on a large sequence context (~800bp) combined with a dense network locally around the SS (~70bp). A Dirichlet-multinomial likelihood is used to appropriately account for overdispersion in RNA-seq read counts. A multi-output extension readily allows modeling of tissue-specific splicing patterns. We assess model performance using 110 male muscle RNA-seq samples from GTEx, training on odd chromosomes and testing on even chromosomes. Out of the typically ~2000 canonical dinucleotides within 100kb we are able to correctly predict the most frequently used 68% of the time (for 3’ SS choice, 65% for 5’ SS), compared to 9.6% when picking the strongest SS by MaxEntScore. The model learns to detect expected features, including the branchpoint consensus and polypyrimidine tract, and can distinguish between canonical dinucleotides which are never spliced, noisily spliced (<1% of reads) and constitutively spliced (>99%). Using GTEx WGS data we are able to predict which SNVs will create cryptic splice sites with an AUC of 99%. Using 12 diverse tissues from GTEx we predict tissue-specific SS usage, with an average correlation between predicted and observed differences in SS usage across all pairs of tissues of 0.31 (compared to 0.07 for existing work on exonic PSI).
Extent	33 minutes
Subject	Mathematics; Statistics; Biology and other natural sciences
Type	Moving Image
File Format	video/mp4
Language	eng
Notes	Author affiliation: Stanford University
Series	BIRS Workshop Lecture Videos (Banff, Alta)
Date Available	2017-09-25
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0355770
URI	http://hdl.handle.net/2429/63116
Affiliation	Non UBC
Peer Review Status	Unreviewed
Scholarly Level	Postdoctoral
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Item Media

201703291000-Knowles_lrv.mp4 -- 102.76MB

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International

Open Collections

BIRS Workshop Lecture Videos

Learning a mapping from pre-mRNA sequence to splice site usage to understand RNA splicing variation Knowles, David

Description

Item Metadata

Item Media

Item Citations and Data

Rights