- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- BIRS Workshop Lecture Videos /
- Learning a mapping from pre-mRNA sequence to splice...
Open Collections
BIRS Workshop Lecture Videos
BIRS Workshop Lecture Videos
Learning a mapping from pre-mRNA sequence to splice site usage to understand RNA splicing variation Knowles, David
Description
RNA splicing is complex process carefully regulated by the coordinated action of the hundreds of proteins and associated small nuclear RNAs comprising the core spliceosome and trans-acting splicing factors. To determine which regions to remove as introns, this splicing machinery must interpret information encoded in the pre-mRNA sequence: consensus splice site and branch point motifs, as well as splicing regulatory elements (SRE) such as exonic splicing enhancers. We approach computational modeling of the splicing process by learning a mapping from pre-mRNA sequence to splice site (SS) usage levels. Following our recent work on LeafCutter, which we developed to study intron splicing variation, we use spliced reads from RNA-seq to quantify local intron usage in a straightforward, annotation-agnostic manner. For each 5’ SS we predict the proportion of corresponding spliced reads mapping to each possible 3’ SS (specifically every AG dinucleotide within 100kb) as a function of the 3’ SS sequence context. An analogous model is used to model 5’ SS choice for each 3’ SS. Compared to previous work predicting exon inclusion, we model more splicing events, including 3’ and 5’ extensions, and additionally leverage signal from constitutive splicing. We choose a deep neural network (NN) to represent the mapping from sequence to SS usage proportion. Convolutional NNs (CNN) can naturally detect regulatory elements: the first convolutional layer corresponds to scanning learnt PWMs along the sequence, and following layers allow combinatorial and spatial logic on top of the resulting detections. Max-pooling layers endow limited translational invariance on where motifs are detected, which is appropriate for SREs but undesirable for the SS consensus. We therefore use a CNN on a large sequence context (~800bp) combined with a dense network locally around the SS (~70bp). A Dirichlet-multinomial likelihood is used to appropriately account for overdispersion in RNA-seq read counts. A multi-output extension readily allows modeling of tissue-specific splicing patterns. We assess model performance using 110 male muscle RNA-seq samples from GTEx, training on odd chromosomes and testing on even chromosomes. Out of the typically ~2000 canonical dinucleotides within 100kb we are able to correctly predict the most frequently used 68% of the time (for 3’ SS choice, 65% for 5’ SS), compared to 9.6% when picking the strongest SS by MaxEntScore. The model learns to detect expected features, including the branchpoint consensus and polypyrimidine tract, and can distinguish between canonical dinucleotides which are never spliced, noisily spliced (<1% of reads) and constitutively spliced (>99%). Using GTEx WGS data we are able to predict which SNVs will create cryptic splice sites with an AUC of 99%. Using 12 diverse tissues from GTEx we predict tissue-specific SS usage, with an average correlation between predicted and observed differences in SS usage across all pairs of tissues of 0.31 (compared to 0.07 for existing work on exonic PSI).
Item Metadata
Title |
Learning a mapping from pre-mRNA sequence to splice site usage to understand RNA splicing variation
|
Creator | |
Publisher |
Banff International Research Station for Mathematical Innovation and Discovery
|
Date Issued |
2017-03-29T10:00
|
Description |
RNA splicing is complex process carefully regulated by the coordinated
action of the hundreds of proteins and associated small nuclear RNAs
comprising the core spliceosome and trans-acting splicing factors. To
determine which regions to remove as introns, this splicing machinery
must interpret information encoded in the pre-mRNA sequence: consensus
splice site and branch point motifs, as well as splicing regulatory
elements (SRE) such as exonic splicing enhancers. We approach
computational modeling of the splicing process by learning a mapping
from pre-mRNA sequence to splice site (SS) usage levels.
Following our recent work on LeafCutter, which we developed to study
intron splicing variation, we use spliced reads from RNA-seq to
quantify local intron usage in a straightforward, annotation-agnostic
manner. For each 5’ SS we predict the proportion of corresponding
spliced reads mapping to each possible 3’ SS (specifically every AG
dinucleotide within 100kb) as a function of the 3’ SS sequence
context. An analogous model is used to model 5’ SS choice for each 3’
SS. Compared to previous work predicting exon inclusion, we model more
splicing events, including 3’ and 5’ extensions, and additionally
leverage signal from constitutive splicing.
We choose a deep neural network (NN) to represent the mapping from
sequence to SS usage proportion. Convolutional NNs (CNN) can naturally
detect regulatory elements: the first convolutional layer corresponds
to scanning learnt PWMs along the sequence, and following layers allow
combinatorial and spatial logic on top of the resulting detections.
Max-pooling layers endow limited translational invariance on where
motifs are detected, which is appropriate for SREs but undesirable for
the SS consensus. We therefore use a CNN on a large sequence context
(~800bp) combined with a dense network locally around the SS (~70bp).
A Dirichlet-multinomial likelihood is used to appropriately account
for overdispersion in RNA-seq read counts. A multi-output extension
readily allows modeling of tissue-specific splicing patterns.
We assess model performance using 110 male muscle RNA-seq samples from
GTEx, training on odd chromosomes and testing on even chromosomes. Out
of the typically ~2000 canonical dinucleotides within 100kb we are
able to correctly predict the most frequently used 68% of the time
(for 3’ SS choice, 65% for 5’ SS), compared to 9.6% when picking the
strongest SS by MaxEntScore. The model learns to detect expected
features, including the branchpoint consensus and polypyrimidine
tract, and can distinguish between canonical dinucleotides which are
never spliced, noisily spliced (<1% of reads) and constitutively
spliced (>99%). Using GTEx WGS data we are able to predict which SNVs
will create cryptic splice sites with an AUC of 99%. Using 12 diverse
tissues from GTEx we predict tissue-specific SS usage, with an average
correlation between predicted and observed differences in SS usage
across all pairs of tissues of 0.31 (compared to 0.07 for existing
work on exonic PSI).
|
Extent |
33 minutes
|
Subject | |
Type | |
File Format |
video/mp4
|
Language |
eng
|
Notes |
Author affiliation: Stanford University
|
Series | |
Date Available |
2017-09-26
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
Attribution-NonCommercial-NoDerivatives 4.0 International
|
DOI |
10.14288/1.0355770
|
URI | |
Affiliation | |
Peer Review Status |
Unreviewed
|
Scholarly Level |
Postdoctoral
|
Rights URI | |
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International