Data integration, variant aggregation and combined annotation

BIRS Workshop Lecture Videos

Featured Collection

BIRS Workshop Lecture Videos

Data integration, variant aggregation and combined annotation Goldenberg, Anna

Description

Majority of human diseases are complex, arising due to a multitude of factors. Identifying these factors is critical to understanding diseases and improving health care, yet it is a very difficult computational problem: low signal-to-noise ratio (only a few variants out of millions are likely to be causal), heterogeneity of reasons (e.g. coding, regulatory, epigenetic), epistasis (gene interaction patterns), etc. We propose to combine two mostly complementary data sources: coding variants and gene expression. These two data sources are responsible for different kinds of protein aberrations. Combining them allows us to survey both coding and regulatory aberrations genome wide without underpowering the model. We developed a biologically motivated hierarchical factor graph model which efficiently combines these two sources of data. We use variant harmfulness and gene interactions as priors, to increase the likelihood of identifying the genes correctly. To our knowledge, this is the first work that takes into account complementarity of exome and gene expression data sources in a principled way, integrating variant harmfulness and gene interaction information in the inference process of the model. Our approach a) allows to integrate different data modalities; b) provides a principled way to aggregate rare (and common) variants; c) improves the power of detecting genes associated with a given disease; d) implicates proteins that have been affected in the population in a variety of ways, rather than solely through the coding DNA sequence. Our extensive simulations confirm that our method has superior sensitivity and precision compared to other methods that aggregate rare variants. We have tested our approach in a large breast cancer dataset as a proof of concept and found that our method is able to identify important breast cancer genes. Interestingly, we find genes that have DNA mutations or coding variants in some patients and gene expression aberrations in other patients, indicating that our method is able to effectively explain the disease in more patients.

Item Metadata

Title	Data integration, variant aggregation and combined annotation
Creator	Goldenberg, Anna
Publisher	Banff International Research Station for Mathematical Innovation and Discovery
Date Issued	2015-08-04T11:36
Description	Majority of human diseases are complex, arising due to a multitude of factors. Identifying these factors is critical to understanding diseases and improving health care, yet it is a very difficult computational problem: low signal-to-noise ratio (only a few variants out of millions are likely to be causal), heterogeneity of reasons (e.g. coding, regulatory, epigenetic), epistasis (gene interaction patterns), etc. We propose to combine two mostly complementary data sources: coding variants and gene expression. These two data sources are responsible for different kinds of protein aberrations. Combining them allows us to survey both coding and regulatory aberrations genome wide without underpowering the model. We developed a biologically motivated hierarchical factor graph model which efficiently combines these two sources of data. We use variant harmfulness and gene interactions as priors, to increase the likelihood of identifying the genes correctly. To our knowledge, this is the first work that takes into account complementarity of exome and gene expression data sources in a principled way, integrating variant harmfulness and gene interaction information in the inference process of the model. Our approach a) allows to integrate different data modalities; b) provides a principled way to aggregate rare (and common) variants; c) improves the power of detecting genes associated with a given disease; d) implicates proteins that have been affected in the population in a variety of ways, rather than solely through the coding DNA sequence. Our extensive simulations confirm that our method has superior sensitivity and precision compared to other methods that aggregate rare variants. We have tested our approach in a large breast cancer dataset as a proof of concept and found that our method is able to identify important breast cancer genes. Interestingly, we find genes that have DNA mutations or coding variants in some patients and gene expression aberrations in other patients, indicating that our method is able to effectively explain the disease in more patients.
Extent	37 minutes
Subject	Mathematics; Statistics; Biology and other natural sciences
Type	Moving Image
File Format	video/mp4
Language	eng
Notes	Author affiliation: SickKids Research Institute/ University of Toronto
Series	BIRS Workshop Lecture Videos (Banff, Alta)
Date Available	2016-04-18
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0300006
URI	http://hdl.handle.net/2429/57676
Affiliation	Non UBC
Peer Review Status	Unreviewed
Scholarly Level	Faculty
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Item Media

201508041136-Goldenberg_lrv.mp4 -- 98.09MB

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International

Open Collections

BIRS Workshop Lecture Videos

Data integration, variant aggregation and combined annotation Goldenberg, Anna

Description

Item Metadata

Item Media

Item Citations and Data

Rights