Open Collections

UBC Faculty Research and Publications

Development and application of an integrated allele-specific pipeline for methylomic and epigenomic analysis… Richard Albert, Julien; Koike, Tasuku; Younesy, Hamid; Thompson, Richard; Bogutz, Aaron B; Karimi, Mohammad M; Lorincz, Matthew C Jun 15, 2018

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-12864_2018_Article_4835.pdf [ 4.21MB ]
JSON: 52383-1.0368684.json
JSON-LD: 52383-1.0368684-ld.json
RDF/XML (Pretty): 52383-1.0368684-rdf.xml
RDF/JSON: 52383-1.0368684-rdf.json
Turtle: 52383-1.0368684-turtle.txt
N-Triples: 52383-1.0368684-rdf-ntriples.txt
Original Record: 52383-1.0368684-source.json
Full Text

Full Text

SOFTWARE Open AccessDevelopment and application of anintegrated allele-specific pipeline formethylomic and epigenomic analysis (MEA)Julien Richard Albert1, Tasuku Koike2, Hamid Younesy3,4,5, Richard Thompson6, Aaron B. Bogutz1,Mohammad M. Karimi7* and Matthew C. Lorincz1*AbstractBackground: Allele-specific transcriptional regulation, including of imprinted genes, is essential for normal mammaliandevelopment. While the regulatory regions controlling imprinted genes are associated with DNA methylation (DNAme)and specific histone modifications, the interplay between transcription and these epigenetic marks at allelic resolutionis typically not investigated genome-wide due to a lack of bioinformatic packages that can process and integratemultiple epigenomic datasets with allelic resolution. In addition, existing ad-hoc software only consider SNVs forallele-specific read discovery. This limitation omits potentially informative INDELs, which constitute about one fifthof the number of SNVs in mice, and introduces a systematic reference bias in allele-specific analyses.Results: Here, we describe MEA, an INDEL-aware Methylomic and Epigenomic Allele-specific analysis pipelinewhich enables user-friendly data exploration, visualization and interpretation of allelic imbalance. Applying MEAto mouse embryonic datasets yields robust allele-specific DNAme maps and low reference bias. We validateallele-specific DNAme at known differentially methylated regions and show that automated integration of suchmethylation data with RNA- and ChIP-seq datasets yields an intuitive, multidimensional view of allelic generegulation. MEA uncovers numerous novel dynamically methylated loci, highlighting the sensitivity of our pipeline.Furthermore, processing and visualization of epigenomic datasets from human brain reveals the expected allele-specificenrichment of H3K27ac and DNAme at imprinted as well as novel monoallelically expressed genes, highlighting MEA’sutility for integrating human datasets of distinct provenance for genome-wide analysis of allelic phenomena.Conclusions: Our novel pipeline for standardized allele-specific processing and visualization of disparate epigenomicand methylomic datasets enables rapid analysis and navigation with allelic resolution. MEA is freely available as a Dockercontainer at Epigenomics, Allele-specific, Allelic, RNA-seq, Chromatin immunoprecipitation, ChIP, ChIP-seq,Whole genome bisulphite-sequencing, WGBS, Imprinting, MEABackgroundNext-generation sequencing (NGS)-based approaches forgenome-wide analysis of RNA, histone post-translationalmodifications (PTMs), DNA methylation (DNAme) andchromatin conformation are now routinely conducted onboth model organisms and human samples. Such studieshave yielded many insights into the interplay betweenchromatin structure and transcription, including thesurprising observation that allele-specific phenomenamay be more widespread than previously believed [1, 2].Unfortunately, while such datasets, including RNA se-quencing (RNA-seq), chromatin immunoprecipitationfollowed by sequencing (ChIP-seq) and whole genomebisulphite sequencing (WGBS), are theoretically amen-able to allele-specific profiling, NGS analysis softwaregenerally does not discriminate between parental allelesfrom diploid genomes. Indeed, popular read alignersdepend on alignment to a single reference genome,* Correspondence:; mlorincz@mail.ubc.ca7MRC London Institute of Medical Sciences, Imperial College, London, UK1Department of Medical Genetics, The University of British Columbia,Vancouver, BC, CanadaFull list of author information is available at the end of the article© The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0International License (, which permits unrestricted use, distribution, andreproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link tothe Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver( applies to the data made available in this article, unless otherwise stated.Richard Albert et al. BMC Genomics  (2018) 19:463 considering the sequencing reads generatedfrom autosomes (and the X-chromosome in the case offemales) as originating from isogenic rather than outbredindividuals. In merging both parental alleles into a singlemeasurement, these aligners neglect allele-specific phe-nomena, such as genomic imprinting [1], X-chromosomeinactivation [2] and sequence-dependent cis-regulatoryeffects [3].To overcome this shortcoming, a number of softwarepackages have recently been developed that assign NGSsequencing reads to a specific parental allele. For example,MMSEQ [4], QuASAR [5], MBASED [6] and SCALE [7]were designed to analyze RNA-seq data, while MethPipe[8], epiG [9] and BSPAT [10] were designed to processDNAme data. Several independent custom scripts forallele-specific analyses have also been reported [11–13],but the details required for implementing them were notincluded. Pipelines such as Allelome.PRO [14], WASP[15] and our previously published toolbox, ALEA [16]accommodate both RNA- and ChIP-seq datasets, yetno pipeline offers the additional capability of processingDNAme data. The lack of a universal allele-specific pipe-line has precluded robust integration of allele-specifictranscription, histone PTMs and DNAme profiles. Import-antly, while such pipelines can be applied in parallel toanalyze distinct epigenomic features, installation and im-plementation of multiple software packages can be timeconsuming, even for experienced bioinformaticians. Add-itionally, comparing allelic results generated using differ-ent software can introduce confounding factors, as thestrategies used to process reads depend on multiple pa-rameters, including read trimming, alignment mismatchscoring and read alignment filtering (mapping quality,PCR duplicate reads). For example, several allele-specificanalysis packages rely on reference genome alignmentfollowed by variant calling [8, 10, 14], while others lever-age publicly available single nucleotide variant (SNV) datato derive a diploid genome for read alignment [5, 15, 16].This “pseudogenome” strategy is a significant improve-ment over the former as it enables alignment over lociwith high levels of genetic variation. However, currentpipelines exclude short insertions and deletions (INDELs)for pseudogenome reconstruction, as they modify refer-ence chromosome sequence lengths and annotated genecoordinates required for downstream analyses. Given therelative abundance of INDELs, this shortcoming may leadto the omission of a significant fraction of informative al-lelic reads. Indeed, analysis of high quality genotyping in-formation for mouse strains reveals that, exclusive ofstructural variants, INDELs compose up to 20% of geneticvariation [17]. Thus, an INDEL-aware allele-specificpipeline that considers both SNVs and INDELs forpseudogenome reconstruction would offer a significantimprovement over existing software.Here, we present MEA, an “all-in-one” bioinformaticstoolbox that exploits both SNVs and INDELs to enableallele-specific analyses of RNA-seq and ChIP-seq as wellas WGBS datasets generated using short-read sequen-cing technology (Fig. 1a). MEA is shipped in a Dockercontainer, enabling one step installation of all dependen-cies independent of operating system type. After provid-ing a reference genome assembly (e.g. hg19 or mm10)and a VCF file containing the relevant genetic variants,users simply input an NGS dataset in FASTQ format.MEA will then automatically generate allele-specific gen-omic coverage files in BigWig format and allele-specificanalyses over user-defined regions of interest in a tab-delimited table. To benchmark the performance of ourINDEL-aware software, we present both theoretical andreal-world evidence for improved allele-specific DNAmeanalysis relative to an INDEL-agnostic pipeline. Further-more, to highlight the utility of MEA, we investigateDNAme data processed in parallel with RNA- andChIP-seq data from mouse hybrid embryos and uncovernovel differentially methylated regions (DMRs). Add-itionally, using human brain cell data, we observe the ex-pected H3K27ac and DNAme enrichment at knownimprinted genes and uncover novel monoallelicallyexpressed genes, further demonstrating the power of in-tegrating epigenetic and expression analyses in a unifiedworkflow. The MEA toolbox harmonizes NGS read pro-cessing, with all dependencies consolidated in a Dockercontainer, includes pan-species compatibility, maxi-mizing its utility for allele-specific profiling of modelorganisms as well as human samples.ImplementationTo generate a harmonized workflow for processing ofDNAme, RNA-seq and ChIP-seq datasets, we developeda universal strategy for detecting allele-specific reads.Further, to maximize the number of experimental readsthat can be assigned to a specific allele for each datatype, MEA was designed to exploit underlying geneticvariation by incorporating both SNVs and INDELs duringpseudogenome construction. For each data type, allelicreads are captured by constructing an in silico pseudo-genome comprised of both parental genomes followedby NGS read alignment. Aligning reads simultaneouslyto both haplotype sequences of a diploid genome facili-tates the appropriate alignment of reads that map toheterozygous loci onto their cognate allele, reads whichotherwise would be discarded due to “sequencing er-rors”. Such reads are thus extracted and can be used tode-convolute allelic phenomena.An allele-specific DNA methylation pipelineTo establish a pipeline for allele-specific DNAme ana-lysis, we began by incorporating Bismark [18], a widelyRichard Albert et al. BMC Genomics  (2018) 19:463 Page 2 of 19adopted bisulphite-seq read aligner and methylationcaller, into ALEA, our previously developed tool forallele-specific analyses of RNA-seq and ChIP-seq data-sets [16]. We first quantified the hypothetical increase inthe percentage of informative CpG sites from which wecan infer allelic information by incorporation of INDELsin addition to SNVs during pseudogenome reconstruc-tion. As high-quality genetic variation information ofinbred mouse strains is available [19], we constructed apseudogenome from two mouse strains, namely DBA/2 J and the reference strain C57BL/6 J (build mm10),incorporating known genetic variants (SNVs and/orINDELs). By counting CpGs within 200 bp (an insertsize typical of WGBS libraries) of an INDEL or SNV,we found that INDEL incorporation leads to a theoret-ical increase in the number of informative CpGs (i.e.CpGs for which DNAme differences between allelescan be deduced) of 12.9% for this pseudogenome (Fig. 1b).Notably, a subset of genomic regions with associatedINDELs are entirely devoid of SNVs and therefore includenearby CpGs that theoretically can only be assessed bypipelines that are “INDEL-aware”.ResultsMEA is informative for significantly more CpGs than anINDEL-agnostic scriptTo test whether the inclusion of INDELs increases thenumber of informative CpGs for which allelic methylationstate can be calculated in practice, we processed raw readsfrom a previously published WGBS dataset from C57BL/6 Jx DBA/2 J mouse F1 inner cell mass (ICM) cells [11]. Ap-plying the same filtering parameters allowed us to directlycompare results obtained with the MEA pipeline to those ofthe Bismark-based INDEL-agnostic custom script employedby Wang et al. [11]. MEA yielded a 62.5% increase in thenumber of CpGs covered by at least 5 allele-specific C57BL/6 J reads (Fig. 1c). Importantly, informative CpGs gainedusing MEA overlapped almost exclusively with CpGs within200 bp of an INDEL or SNV, as expected. This gain is likelythe result of an increase in the number of informative het-erozygous sites (quantified in Fig. 1b) as well as efficaciousalignment of reads to the non-reference genome over re-gions with high INDEL density.Reads from regions with high INDEL density were pre-sumably excluded by the pipeline from Wang et al. [11] asFig. 1 A bioinformatics toolkit for allele-specific epigenomic analysis. a MEA pipeline flow chart. Supplied with a reference genome assembly andrelevant genetic variants, MEA first reconstructs a diploid pseudogenome. Subsequently, allele-specific analysis is performed on the input geneexpression (RNA-seq), histone PTM (ChIP-seq) or DNAme (WGBS) data in FASTQ format. MEA calculates allelic imbalance values using the resultingallele-specific genomic coverage files and generates a tab-delimited table for the user-defined regions of interest. Mouse and human exon, genebody and transcription start site coordinates are provided to facilitate analyses of such regions. b Venn diagram showing the theoretical number ofCpG dinucleotides for which allele-specific DNAme levels can be calculated using C57BL/6 J and DBA/2 J SNVs (blue) or INDELs (green) alone. CpGs forwhich allelic information can theoretically be extracted are defined as those that fall within 200 bp (an insert size typical of WGBS libraries) of a geneticvariant. c Venn diagram showing the observed number of C57BL/6 J-specific CpG dinucleotides for which allele-specific DNAme levels were calculatedusing MEA (yellow) versus an INDEL-agnostic contemporary allele-specific DNAme script [11] using the same dataset (red)Richard Albert et al. BMC Genomics  (2018) 19:463 Page 3 of 19“sequencing errors”, rather than assigned as allelic variants.To confirm that MEA increases the alignment rate ofnon-reference reads, we repeated the alignment of C57BL/6 J x DBA/2 J F1 WGBS reads to a reference genome aswell as the MEA-constructed diploid pseudogenome (com-posed of the reference and DBA/2 J genomes) and deter-mined the number of reads that aligned to each genome 0,1 or > 1 time (Fig. 2a-b). Alignment to a pseudogenome in-creased the overall alignment rate by 1.25% (80.83 to82.08%), most likely due to alignment ofnon-reference-originating reads at loci that show signifi-cant genetic divergence (high SNV and INDEL density)from the reference. As expected, the majority of readsaligned uniquely to the haploid reference genome alignedat least twice to the pseudogenome, except over regionscontaining genetic variants. This crucial distinction allowedthe uniquely aligned reads to be extracted and assigned totheir cognate parental genomes, with 8.8 and 8.2% ofall aligned reads specific to C57BL6J and DBA/2 Jstrains, respectively (Fig. 2c). By capturing a greaternumber of sites at which we can measure allelic DNAmelevels, a higher proportion of experimental reads can beassigned to a specific parental haplotype, thus enablingthe evaluation of allelic differences in DNAme levels for ahigher fraction of the genome.MEA significantly reduces reference genome alignment biasA major concern when exploring allele-specific data is thepotential for reference bias caused by differences in gen-omic sequence quality between the reference and non-ref-erence genomes, which may lead to preferentialalignment of reads to the former and artefactual allelic im-balance results [20]. For example, using an INDEL-agnosticpipeline similar to that employed by Wang et al. [11],Keown et al, reported a reference bias of 15.4% in theirstudy of allele-specific DNAme in C57BL/6 J x SPRET/EiJcells [21] (SPRET/EiJ has > 5 times the number of SNVsrelative to C57BL/6 J than does DBA/2 J [19]). To deter-mine the extent of reference bias in our MEA pipeline, webenchmarked the observed parental contribution to allelicread alignment for each autosome from the C57BL/6 J xDBA/2 J ICM WGBS dataset generated by Wang et al. [11](Fig. 2d). Notably, MEA yielded an alignment reference biason all autosomes of 3.81%, only ~ 54% of that reported bythe INDEL-agnostic pipeline (6.98%, Fig. 2e). This re-duction in alignment bias is consistent with the increasedfraction of allele-specific reads aligned to the non-referencegenome.Estimation of allele-specific alignment error rate usingisogenic miceFalse positives caused by erroneous allelic read alignmentat regions devoid of true genetic variation can lead to anunderestimation of reference bias in allele-specificexperiments. To quantify the false positive allelic align-ment rate of our pipeline, we processed pure C57BL/6 JWGBS data using the C57BL/6 J x DBA/2 J pseudogen-ome described above and determined the parental contri-bution to allelic read alignment (Fig. 3a). Curiously, 0.8%of all aligned reads (5.13% of allelic reads) were scoredas DBA/2 J-specific, indicating that MEA has an FDRof ~ 5%. When calculating the parental contribution toallelic read alignment over each autosome, we foundthat the majority of false-positive (“DBA/2 J-specific”) al-lelic read alignments clustered on chromosomes 2 and 9(Fig. 3b). Closer inspection revealed that these regions areannotated by RepeatMasker as Satellite DNA (Fig. 3c).Such allele-specific calls at sites lacking genetic variantsare the result of Bismark’s mapping quality algorithm,which calculates an erroneously high mapping score atthese highly repetitive regions. Analysis of processedWGBS data from pure DBA/2 J spermatozoa withoutblack-listing of repetitive regions revealed a C57BL/6 J-spe-cific alignment rate of 3.80% (Additional file 1: Figure S1),indicating that a global false positive rate of ~ 5% may beexpected when using the MEA pipeline for analysis ofWGBS data without excluding repetitive regions. Sincesatellite DNA is generally omitted in studies of the tran-scriptome or epigenome, we excluded reads aligned to an-notated satellite repeats (0.19% of the mappable genome)and recalculated the false-positive rate for the C57BL/6 Jdataset, which dropped to 1.62% of allelic reads, with nospecific chromosome enriched (Fig. 3d). Thus, when ap-plying the MEA pipeline, the majority of false positiveread alignments can likely be removed by black-listing sat-ellite repeats.MEA reports the expected allelic imbalance in DNAmethylation at known gametic differentially methylatedregions (gDMRs)To establish the accuracy of calculating allele-specificDNAme levels using the MEA pipeline, we measuredallele-specific DNAme levels over known imprinted gDMRs.Such regions are densely methylated on one allele andunmethylated on the other as a result of parent-of-origindependent differences in methylation established in thegametes, representing a unique resource for benchmarkingallele-specific DNAme calling. Of the 23 known mousegDMRs, 9 harbor SNVs and/or INDELs between theC57BL/6 J and DBA/2 J genomes and can therefore beassessed for allele-specific DNAme levels. For consistency,we directly compared our allele-specific results over theseregions with those reported by Wang et al. [11] (Fig. 4a).For most gDMRs, MEA yielded average allelic DNAmelevels similar to those reported by the INDEL-agnosticpipeline. However, MEA consistently yielded allele-specificinformation over a greater number of CpGs (mean ± SD:72 ± 24 vs 38 ± 21 CpGs on either allele), increasing theRichard Albert et al. BMC Genomics  (2018) 19:463 Page 4 of 19ab c edFig. 2 (See legend on next page.)Richard Albert et al. BMC Genomics  (2018) 19:463 Page 5 of 19statistical power of allelic imbalance calculations. For ex-ample, MEA detected a total of 68 CpGs informative forallelic methylation state at the Meg3 gDMR, nearly threetimes greater than the number reported by Wang et al.(Table 1). As expected, when calculated over the same 129CpGs covered by at least five reads in the gDMR, DNAmelevels calculated by the two pipelines independent of alleliccalling were nearly identical (30.2% vs 30.6%). However,the discordance between the percentage of methylationcalculated for the CpGs that are informative at an alleliclevel was significantly lower using the MEA pipeline(0.13% vs 5.8%), indicating that the accurate determinationof allelic DNAme levels at specific loci can be adversely im-pacted by sampling errors. Furthermore, as expected, onlythe MEA pipeline yields informative results for CpGsproximal to INDELs at the Meg3 gDMR locus (Fig. 4b),confirming the benefit of incorporating the latter duringpseudogenome reconstruction. Taken together, these(See figure on previous page.)Fig. 2 Empirical benchmarking of allele-specific read alignment reveals reduced reference bias. a Graphical representation of MEA’s unified strategy fordetecting allele-specific reads from RNA-, ChIP-seq and WGBS datasets. Aligning F1 hybrid reads to a pseudogenome enables alignment to their cognategenome even when originating from highly variable loci. b Paired-end WGBS reads (101 bp) from a previously published dataset of C57BL/6 J x DBA/2 JICM cells [11] were aligned using the Bismark aligner to the (haploid) reference genome (mm10 build) and a MEA-constructed diploid pseudogenome.When using MEA, multiple (2 or more) alignments reflect non-allelic reads, while uniquely aligned reads are allele-specific. Reads aligning uniquely to thepseudogenome were extracted and retroactively assigned to their parental haplotype. c The percentages of allele-specific reads called for each parentalhaplotype and the number of aligned reads that did not overlap with a genetic variant (non-allelic) is shown. d Allelic contribution of read alignments toeach parental haplotype (C57BL/6 J or DBA/2 J) on each autosome. Relative to the script employed by Wang et al. [11], MEA displays abouthalf the reference bias on the majority of autosomes. e Global reference bias for each pipeline is shownc da bFig. 3 Quantifying allele-specific alignment error rates. To estimate the rate of false-positive errors for allelic analysis of DNAme data, WGBSreads generated from C57BL/6 J mice [11] were aligned to the MEA-generated C57BL/6 J x DBA/2 J pseudogenome, and the percentage ofDBA/2 J-specific read alignments was scored. The expected allelic contribution from C57BL/6 J is 100%, as these cells are of C57BL/6 J origin.a The percentages of reads aligning uniquely to the C57BL/6 J and DBA/2 J (false-positive) pseudogenomes, as well as the number of alignedreads that did not overlap with a genetic variant (non-allelic) is shown. b The false-positive alignment rate for each autosome, along with thetotal number of aligned allelic read pairs, is shown. c Genome browser screenshot of a locus that displays a high rate of false-positive allele-specific alignment to a repeat annotated as Satellite DNA by RepeatMasker and devoid of genetic variants. d To assess the false-positive rateexclusive of repetitive Satellite DNA, allele-specific read alignments over these Repbase annotated repetitive sequences, as recognized byRepeatMasker, were culled and the rate of false-positive allele-specific alignments recalculated over each autosome as in (b)Richard Albert et al. BMC Genomics  (2018) 19:463 Page 6 of 19Fig. 4 Validation of allele-specific DNA methylation level calculations over known gDMRs. C57BL/6 J x DBA/2 J ICM WGBS reads were processedin parallel with MEA and a published pipeline [11] using identical parameters. a Allelic methylation levels over 9 known gDMRs are shown for bothpipelines. b UCSC genome browser screenshot of the Meg3 gDMR including the allele-agnostic percentage of DNAme calculated using each pipeline(total) as well as allelic calls for each informative CpG. The location of each informative CpG for each pipeline (blue tracks) is also included. Only MEAdetects allele-specific reads in a region within the gDMR that lacks SNVs but contains several INDELs (dashed box). A summary of the total number ofallelic CpG counts and DNAme levels over this locus is included in Table 1Table 1 Allele-specific DNA methylation level analysis over the Meg3 gDMRPipeline Allelic call CpGs covered Mean Methylation (%)MEA – 129 30.24C57BL/6 J 31 1.66DBA/2 J 37 58.55Total allelic informative 68 30.11Wang et al. (Table S7) – 129 30.63C57BL/6 J 12 1.59DBA/2 J 12 48.09Total allelic informative 24 24.84Richard Albert et al. BMC Genomics  (2018) 19:463 Page 7 of 19analyses demonstrate that MEA outperforms an INDEL-agnostic pipeline.MEA uncovers novel putative transient DMRs atannotated transcription start sites (TSSs)A recent study employing MeDIP on genomic DNA iso-lated from early mouse embryos revealed the presenceof maternally-methylated DMRs that are resolved duringpost-implantation development [22]. While these “tran-sient DMRs” may have important biological functionsduring pre-implantation development [22, 23], the ex-tent of transient imprinting remains unclear. To deter-mine whether MEA can be used to identify novel DMRs,we assayed the subset of informative regions gainedusing our refined pipeline, namely loci exclusively over-lapping INDELs, using the aforementioned WGBS datafrom C57BL/6 J x DBA/2 J ICM cells. As expected forpreimplantation cells, which are characterized by glo-bally low DNAme levels [24], hypomethylation of bothparental alleles was generally observed over such inform-ative regions, including at those with high CpG density(Fig. 5a). Importantly, analysis agnostic to allelic align-ment also revealed hypomethylation across such regions(for example, see Additional file 1: Figure S2). However,focusing on regions within 200 bp of annotated tran-scription start sites (TSSs) reveals that a subset showclear asymmetric DNAme levels (Fig. 5b), with eithermaternal or paternal bias.UCSC genome browser screen shots of two putativeTSS proximal DMRs, including the apparently paternallymethylated Kiss1 (a suppressor of metastasis) and mater-nally methylated Lpar6 (a lysophosphatidic acid recep-tor) genes, are shown in Fig. 5c and d. Using the MEApipeline, 15 and 34 CpGs respectively, are informativeon either allele at these loci. Importantly, the absolutemethylation levels reported by the allele-agnostic pipe-line (27.2 and 26.5%) are similar to those of the meanallele-specific methylation (29.4 and 34.6%), consistentwith the observation that methylation at these loci isallele-specific. Moreover, intersection of these ICM datawith WGBS data from mature gametes [25, 26] revealsthat paternal DNAme at the Kiss1 gene in the former islikely the result of methylation already present in sperm-atozoa, indicating that this locus potentially protectedfrom the wave of genome-wide DNA demethylation thatoccurs early in mouse embryonic development [27]. Par-ental asymmetry at the Kiss1 locus is resolved by E7.5,when the maternal allele gains DNAme coincident withthe wave of global de novo DNAme that occurs duringearly post implantation development [28]. On the otherhand, the short, intron-less gene Lpar6 is hypermethylatedin both mature oocytes and spermatozoa, indicating thatthe paternal but not the maternal allele is susceptible tothe global wave of DNAme erasure that takes place afterfertilization. Parental asymmetry of DNAme is resolved byloss of maternal DNAme in the E7.5 post-implantationembryo, revealing that the allelic bias in DNAme at thislocus is also transient but involves sequential loss ofDNAme on the paternal followed by the maternal allele.Whether these non-canonical DNAme dynamics aredriven by genetic or parent-of-origin effects, and theircontribution to the development of the early embryo,remains to be tested. Regardless, the novel DMRs iden-tified proximal to the Kiss1 and Lpar6 TSSs exemplifythe merit of increasing the number of allelic reads ex-tracted from experimental datasets and underscores thepotential for future discoveries using this approach.Comparison of RNA- and ChIP-seq read aligners using theMEA pipelineIn order to integrate epigenomic and transcriptomic-based datasets, alignment to the same genomic sequenceis required. Transcriptomic data presents a uniquechallenge when aligning to a genome, as processedmessenger RNA contains many gaps (introns) relativeto the template DNA sequence. In our previously pub-lished pipeline ALEA [16], RNA-seq alignment was car-ried out using the short-read aligner BWA, which doesnot allow alignment of intron-spanning reads. Thus, toenable integration of transcriptomic and epigenomicdatasets, gapped read alignment is essential. Tophat2[29] and STAR [30], two widely used aligners that in-corporate this feature, were recently shown to performwell in short-read RNA-seq alignment [31]. To deter-mine which of the two shows superior allele-specificgapped read alignment, we carried out a side by sidecomparison of these aligners, as well as the non-gappedread aligner BWA, using a published RNA-seq datasetfrom C57BL/6 J x DBA/2 J F1 ICM cells. STAR clearlyoutperformed both Tophat2 and BWA (Fig. 6a), likelydue to its advanced gapped read alignment algorithm[30] and ability to properly assign paired-end reads as-sociated with the same DNA molecule (if a read alignsto a region including a genetic variant, its mate is alsoidentified as allelic regardless of whether it overlaps agenetic variant). Thus, analysis of paired-end sequencingdata using the STAR aligner and MEA pipeline increasesthe fraction of regions showing relatively high sequenceconservation over which allele-specific NGS reads can bealigned, an improvement over using flanking regions as aproxy. Based on these observations, we currently rec-ommend the STAR aligner, but MEA’s flexibility in in-corporating new NGS aligners facilitates its adoptionfor analyzing epigenomic and expression datasets usingalternative/next generation aligners, such as those thatcan accommodate increased read lengths.In our previously published pipeline ALEA [16], allele-specific alignment of ChIP-seq datasets was limited toRichard Albert et al. BMC Genomics  (2018) 19:463 Page 8 of 19a bc dFig. 5 Identification of novel DMRs using the MEA pipeline. Allele-specific DNAme levels were calculated over 133,065 regions containing INDELsbut lacking SNVs (representing novel informative regions gained employing MEA) using C57BL/6 J x DBA/2 J ICM WGBS data [11]. a Maternalversus paternal DNAme levels and CpG density (data point size) are plotted for informative regions overlapping with at least 10 CpGs from whichallele-specific DNAme levels can be ascertained (746 data points). b CpG density (data point size) and allele-specific DNAme levels are shown, asin (a) over the subset of novel informative regions +/− 200 bp from annotated TSSs (with at least five informative CpGs on both alleles).Representative novel informative regions for which screenshots are provided are circled in red. c-d UCSC genome browser screenshots of differentiallymethylated regions (dashed boxes) near the promoters of the Kiss1 and Lpar6 genes. Tracks from Wang et al. [11] are included to illustrate differencesin pipeline sensitivity. DNAme tracks of male and female germ cells [25, 26] as well as E7.5 embryos [11] are also shown, along with the location ofinformative CpGs (highlighted in blue)Richard Albert et al. BMC Genomics  (2018) 19:463 Page 9 of 19baFig. 6 Validation of allele-specific transcription level calculations and integration with ChIP-seq and WGBS datasets at allelic resolution. MEA wasextended to accommodate contemporary RNA-seq aligners and to automatically organize allelic and total genomic tracks into UCSC Track Hubsto aid data visualization and interpretation. a The number of annotated genic exons covered by allelic reads using BWA, Tophat2 and STAR aligners isshown for an RNA-seq dataset generated from C57BL/6 J x DBA/2 J ICM cells [48]. b UCSC genome browser screenshot of the Meg3 gDMR anddownstream gene using the default MEA output for visualization of allelic (WGBS, RNA- and ChIP-seq) data. MEA automatically generates compositetracks containing total (allele-agnostic, grey), reference (blue) and non-reference (red) genomic tracks for visualization of allelic RNA- and ChIP-seqdatasets. Bottom three tracks show MEA output from previously published C57BL/6 J x PWK/PhJ F1 ICM ChIP-seq data [13, 47]Richard Albert et al. BMC Genomics  (2018) 19:463 Page 10 of 19the BWA-aln algorithm. To enhance MEA’s flexibility,we incorporated another popular ChIP-seq alignerBowtie2. To compare the performance of BWA-alnand Bowtie2 for allele-specific ChIP-seq alignment, weprocessed H3K4me3 ChIP-seq data generated from pureC57BL/6 J and PWK/PhJ gametes [13]. While both align-ment algorithms yield a low false-positive alignment rate of~ 0.2–4.8%, BWA-aln clearly reports more allele-specificread alignments than Bowtie2 (Additional file 1: Figure S3and Additional file 2: Table S1). Thus, while users canchoose between BWA-aln and Bowtie2, we recommendthe former for allele-specific analysis of ChIP-seq datausing MEA.Integration of WGBS, RNA-seq and ChIP-seq datasets usingthe MEA pipelineDissecting the interplay between epigenetic marks andtranscription was greatly facilitated by the advent ofNGS-based approaches for measuring RNA levels andthe genome-wide distribution of DNAme and histonePTMs. However, as such datasets are commonly processedusing different pipelines, integrating and visualizing allelicinformation embedded therein is non-trivial. To automatedataset integration, MEA processes WGBS, RNA- andChIP-seq alignment data using the same allele-specific readidentification strategy, yielding standardized allele-specificgenomic tracks. This unification of file types allows simul-taneous visualization of each datatype (in BigWig format)using popular genome browsers. Further, to automate theprocess of reporting allelic imbalance, MEA generates atab-delimited table containing allelic imbalance mea-surements over user-defined regions of interest, such astranscription start sites, genic exons or gene bodies (seeAdditional file 3: Table S2).This approach solves two important considerations inthe presentation of allele-specific data. First, allelic gen-omic tracks, i.e. those displaying only read coverage thatis informative for allelic alignment, are inherently sparse,especially at regions devoid of genetic variants. To de-lineate signal from noise, allele-specific genomic trackvisualization should be considered in the context of allaligned reads and the position of the genetic variantsites. Second, allele-specific enrichment is greatest atsites of genetic variation and therefore does not necessar-ily coincide with the profiles generated from all reads ag-nostic of allelic assignment. For example, while readsderived from H3K4me3 ChIP-seq datasets are enrichedover active TSSs, allelic H3K4me3 reads may align any-where within the set of allele-agnostic peaks. Thus, allelicreads aligning at the edge of a region of H3K4me3 enrich-ment that is devoid of genetic variants at its center may beincorrectly discarded as noise.The MEA pipeline standardizes such integrated trackvisualization by organizing genomic tracks into a UCSCTrack Hub [32]. These hubs agglomerate multiple colour-coded data tracks, enabling the concurrent visualizationof allele-specific and “total” (allele-agnostic) alignmentprofiles, and in turn interpretation of allelic imbalance.Variant files used for pseudogenome reconstruction canalso be directly visualized as UCSC custom tracks. Theutility of this approach is illustrated using the Meg3gene and its governing gDMR as a representative locus(Fig. 6b). Imprinting is simultaneously displayed in fourindependent datasets generated from two distinct F1hybrid crosses. The Meg3 gDMR is paternally methylatedand weakly enriched for both permissive (H3K4me3) andrepressive (H3K27me3) histone PTMs (grey). Interest-ingly, H3K4me3 and H3K27me3 asymmetrically mark thematernal and paternal alleles, respectively, as expected forthe promoter of a gene expressed exclusively from the ma-ternal allele. Notably, each dataset is consistent with pater-nal imprinting, with repressive marks associated with thepaternal allele and active marks with the expressed mater-nal allele. Profiles of the maternally imprinted Snrpn andImpact loci reveal similar patterns (see Additional file 1:Figures S4 and S5). Note that for the Impact locus, a sin-gle genetic variant in the F1 hybrid analyzed is sufficientto score DNAme asymmetry between parental alleles. Theobserved enrichment of both H3K4me3 and H3K27me3at imprinted DMRs is consistent with a previous report[33], and evidence of H3K4me3 and H3K27me3 enrich-ment asymmetry on active and repressed alleles hasbeen documented for individual genes [34]. Thus, theallele-specific genomic tracks and dataset integrationemployed by MEA enhances the visualization of allelicdifferences between epigenetic marks and transcriptionacross the genome.Application of the MEA pipeline to human WGBS, RNA-seqand ChIP-seq datasetsTo demonstrate the utility of MEA for the study of NGSdatasets from human samples, we used the STAR alignerto analyze an RNA-seq dataset generated from humanbrain tissue. For individuals whose parental genomic se-quences are unavailable, MEA uses Shape-IT [35] tophase individual genetic variants into inferred haplo-types. For each annotated gene, the haplotype-specificcontribution to allelic read alignment was calculated usingMEA (Additional file 3: Table S2). As expected, humanimprinted genes [36] such as MEST, MEG3, PEG3 andPEG10 display monoallelic expression (Fig. 7a), confirm-ing the suitability of MEA for the analysis of RNA-seqdata from human samples.We next generated UCSC Track Hubs to visualize theRNA-seq data analyzed above, as well as matched DNAme(WGBS) and histone PTM (cross-linked ChIP-seq) datafrom human brain and focused on imprinted genes thatinclude genetic variants in their exons and respectiveRichard Albert et al. BMC Genomics  (2018) 19:463 Page 11 of 19Fig. 7 (See legend on next page.)Richard Albert et al. BMC Genomics  (2018) 19:463 Page 12 of 19DMRs. Thirteen known imprinted genes were expressed(RPKM > 1) and had at least 10 allele-specific mappedread coverage on either allele, 6 of which show > 80% ex-pression from one allele (see Additional file 3: Table S2).A screen shot of the imprintedMEST gene, which is pater-nally expressed in somatic tissues, is shown in Fig. 7b. Asexpected, analysis of sperm and oocyte WGBS data fromunrelated individuals reveals a DMR at the MEST TSSthat is methylated exclusively in the oocyte and shows ~50% methylation across the annotated DMR in adult braincells. MEA output reveals one allele with dense methyla-tion in this region, haplotype 1 (hap1) and the other withvery low methylation (hap2). Importantly, only the latter,which is transcriptionally active, shows enrichment ofH3K27ac, a histone modification associated with activegenes. Based on allele-specific DNAme, transcription andhistone PTM patterns, we surmise that haplotypes 1 and 2of the MEST locus were inherited from the proband’smother and father, respectively. Taken together, these re-sults reveal that MEA successfully integrates allele-specificRNA-seq data with WGBS and ChIP-seq data for identifi-cation and visualization of human loci harbouring geneticvariants.To determine whether H3K27ac shows allele-specificenrichment in the promoter regions of genes exhibitingallele-specific transcription, we identified all genes thatharbor genetic variants over annotated exons and theTSS and calculated their allelic ratios (Fig. 8a). Whilethe correlation between expression and H3K27ac allele-specific ratios is low (Pearson r2 = 0.29), many genesdisplaying strong allele-specific expression bias (overtwo standard deviations from the mean) are also enrichedfor H3K27ac on the active allele (χ2 test p values for bias to-wards haplotype 1 = 1.38E -24 and haplotype 2 = 4.8E -38),as expected. Moreover, manual inspection of a subsetof genes displaying monoallelic expression and biallelicH3K27ac reveals that transcription originates at alter-native promoters. To further quantify the relationshipbetween allele-specific H3K27ac and transcription, wecategorized genes based on allele-specific transcriptionbias and measured the distribution of allele-specificH3K27ac at TSSs (Fig. 8b). Notably, while allele-specificH3K27ac was positively correlated with transcriptional ac-tivity, the ChIP-seq input (control) dataset also showed ahigher level of enrichment on the active allele for eachhaplotype. This observation is consistent with previousstudies demonstrating that the promoter regions of activegenes are inherently more sensitive to sonication than in-active genes [37, 38]. That this bias also applies to individ-ual genes exhibiting allelic differences in expression/PTMsreiterates the importance of input-correction of ChIP-seqmaterial and highlights the sensitivity of the MEA pipelinefor quantifying allele-specific differences in enrichment.To determine whether MEA can be employed to identifynovel monoallelically expressed transcripts in human sam-ples, we revisited the brain RNA-seq data described above.Applying thresholds for total expression (RPKM > 1),allele-specific coverage (mapped reads > 100) and expres-sion bias (> 90% of transcript levels from one allele), weidentified 222 monoallelically expressed transcripts(Fig. 7a). Ten of these 222 transcripts showed sufficientH3K27ac ChIP-seq coverage for allele-specific calling(total RPKM > 1 and allele-specific CpGs on each allele.While seven of these transcripts (PIK3R3, ZNF662,PSMC1, LOC145784, CYP4F24P, C19orf48 and ZNF805)showed biallelic or minor allele-specific bias in H3K27ac,perhaps indicative of allele-specific post-transcriptionalregulation, three (MEST, MIR4458HG and PCDHA5)showed strong H3K27ac bias toward the active allele (>90% allelic reads). Importantly, the latter represent knownand candidate novel imprinted genes. PCDHA5 belongs toa large gene family of protocadherins, complicating al-lelic interpretation. However, analysis of the previouslydescribed imprinted gene MEST (Fig. 7b) and theuncharacterized non-coding RNA gene MIR4458HG(Fig. 8c), revealed H3K27ac enrichment and intermediatemethylation at their TSSs. As described above for theMEST gene, allelic deconvolution at the MIR4458HG pro-moter using MEA reveals H3K27ac enrichment and theabsence of DNAme exclusively on the active allele. Fur-thermore, analysis of published WGBS data from gametesreveals hypomethylation of the MIR4458HG TSS in bothsperm and oocyte, indicating that the allelic gain ofDNAme at this locus occurs in somatic tissues. Thus,using MEA to integrate complementary RNA-, ChIP-seqand DNAme datasets allows for the allele-specific reso-lution of epigenetic states at the regulatory regions of bothknown and novel monoallelically expressed genes.(See figure on previous page.)Fig. 7 Allelic integration of RNA-, ChIP-seq and WGBS datasets from human brain. a Analysis of allele-specific gene expression using RNA-seq data fromadult human brain. Imprinted genes are highlighted in red and monoallelically expressed genes (defined by total expression (RPKM > 1), allele-specificcoverage (mapped reads > 100) and expression bias (> 90% of transcript levels from one allele)) are highlighted in blue and orange. MEST, an imprintedgene, is highly expressed in brain and shows the expected allelic bias. b UCSC genome browser screenshot of the MEST locus showing allele-agnostic(total) and allele-specific (blue and red) DNAme levels in adult brain. DNAme levels in gametes (oocyte & spermatozoa) are also shown [49]. RNA-seq andH3K27ac ChIP-seq data from human brain were integrated using MEA and allele-agnostic (total) as well as allele-specific coverage is shown for each.Note that only the expressed allele, haplotype 2 (hap2) is unmethylated and enriched for H3K27ac. Also see Additional file 2: Table S2Richard Albert et al. BMC Genomics  (2018) 19:463 Page 13 of 19Consolidation of all dependencies into a Docker containerThe proper installation and configuration of bioinfor-matics dependencies is a major hurdle for both new andexperienced users. To address this challenge, we packagedMEA into a Docker container, an open-source softwarepackaging and distribution system (see Materials andMethods). The self-contained nature of the containerallows one-step installation of all 15 bioinformatic de-pendencies (STAR, bwa, Bedtools, Bowtie2, Tophat2,Bismark, Java, etc.), providing a consistent user experienceindependent of operating system (Windows, MacOS,Linux, etc.). Furthermore, the consolidation of all MEAtool installation steps will greatly facilitate future in-corporation of alternative NGS aligners.DiscussionThe surge of publicly available NGS epigenomic and ex-pression datasets generated by international consortia,has outpaced the development and dissemination of bio-informatic pipelines that can be used to analyze disparateepigenomic datasets at allelic resolution. To address thisneed, we developed a universal pipeline that generatesFig. 8 Allele-specific transcription, H3K27ac and DNA methylation at the MIR4458HG locus. a Integration of allele-specific gene expression andpromoter H3K27ac enrichment using human brain RNA-seq and matched ChIP-seq datasets. Only transcripts with informative allele-specific RNA-seqcoverage over exons and ChIP-seq coverage over TSSs (+/− 300 bp) are shown (n = 1759). b Distribution of H3K27ac and input/control allelic ratios atTSSs of transcripts expressed from one or both alleles. Note the allelic ratio bias even in the input control. c UCSC genome browser screenshot of theMIR4458HG locus. Only the expressed allele (hap2) is enriched for H3K27ac and hypomethylated at the CpG island promoterRichard Albert et al. BMC Genomics  (2018) 19:463 Page 14 of 19integrated allele-specific genomic tracks for DNA methy-lation (WGBS or Reduced Representation Bisulphite Se-quencing (RRBS)), expression (RNA-seq) and histonemodification (ChIP-seq) data. Using a unique strategythat incorporates INDELs in addition to SNVs duringpseudogenome reconstruction, MEA increases thequality of non-reference genomic sequences, yielding areduction in reference genome alignment bias. Addition-ally, in the case of mouse datasets, false positiveallele-specific alignments can be minimized by excludingsatellite repeats from post-alignment analysis. By consider-ing INDELs and SNVs, MEA captures significantly moreallelic CpGs than an INDEL-agnostic script and in turn in-creases the sensitivity of allele-specific, parent-of-originDNAme level calculations. Furthermore, by implementingRNA-seq aligners developed specifically to address splicedread alignment, such as STAR [30], MEA reportsallele-specific expression over a greater proportion of thetranscriptome relative to other aligners.The fraction of the genome for which allele-specificstate can be calculated is a function of several experimen-tal variables, including the choice of parental strains in thecase of F1 hybrid studies in model organisms. We wereable to measure allele-specific DNA levels over 20.4% ofall CpGs in C57BL/6 J x DBA/2 J F1 hybrid mice. TheDBA/2 J strain is quite similar genetically to the referenceC57BL/6 J, containing on average one SNV per 530 bp(0.19%), at the lower limit of the optimal sequence diver-gence range of 0.1 to 5% for genome-wide allelic analysis[37]. Wild and inbred mouse strains such as PWK/PhJ,CAST/EiJ or SPRET/EiJ are up to eight times more diver-gent than commonly used strains, such as DBA/2 J,129S1/SvImJ and C3H/HeJ [19]. Thus, when crossed withany other strain, such F1 hybrids will yield a significant in-crease in the fraction of informative reads. Regardless ofparental genome diversity, the incorporation of INDELs inaddition to SNVs during pseudogenome reconstruction,as implemented in MEA, significantly increases the num-ber of regions over which allele-specific methylation canbe discerned. For strains with available SNV and INDELannotations, such as those provided by the Sanger Insti-tute’s Mouse Genomes Project [17], the average geneticvariant frequency between parental genomes can easily becalculated, and in turn, the fraction of the genome likelyto be informative for discriminating allele-specific readsdetermined a priori.By increasing the number of allele-specific reads ex-tracted from NGS datasets of outbred individuals, includingF1 hybrid model organisms as well as human subjects,MEA enables the identification of novel DMRs in WGBSdata, allelic-specific gene expression from RNA-seq dataand the discrimination of histone marks showingparent-of-origin specific patterns from true bivalent marksby ChIP-seq. As this toolbox was developed to processnext generation sequencing reads regardless of experi-ment type, MEA can also be used to analyze additionalchromatin features with allelic resolution. For example,to map chromatin accessibility at an allelic level,DNase I hypersensitivity site-sequencing (DNase-seq,[38]) or transposase-accessible chromatin followed byhigh-throughput sequencing (ATAC-seq, [39]) datasetscan be interrogated and the results integrated with thedata types described above. Importantly, if allele-specificresolution is desirable, previously generated datasets usingany of these approaches can be revisited using MEA.While MEA can be applied to datasets generated fromany diploid organism, there are several important limita-tions that must be considered for clinical studies. Aseach individual has a unique diploid genome (except inthe case of monozygotic twins), pseudogenome reconstruc-tion is essential. While MEA exploits publicly availablewhole genome sequencing datasets from the Sanger Insti-tute’s Mouse Genomes Project [17] and the human-focused1000 genomes project [40], additional genotyping and var-iant-calling steps will be required for haplotypes notcovered by these population level sequencing projects.Nevertheless, large-scale efforts such as The CancerGenome Atlas (TCGA) project that harmonize variouscancer-related dataset types, including genotype infor-mation, may be analyzed using MEA to deconvolutecomplex relationships that may operate at an allele-spe-cific level. For example, a recent publication combinedgenetic, DNAme and gene expression variation to explainaberrant gene regulatory networks in thyroid carcinomasamples [41]. Given the high frequency of heterozygoussomatic mutations in many cancer types, MEA may be ap-plied to directly measure the effect of these mutations onDNAme and gene expression levels on the same allele byusing the other allele as a control, potentially allowing forthe identification of additional driver mutations. Since insilico diploid genome sequences are twice as large as theirrespective reference assemblies, such population-basedstudies (encompassing thousands of individuals) will re-quire extensive computational infrastructure. These tech-nical restrictions limit the number of unique individualsthat can be practically evaluated. Therefore, for studiesencompassing large outbred populations, an alternativeapproach that combines genotyping and allele-specificread calling is more suitable [42]. Nevertheless, forsmaller scale epigenomic studies, such as those involvingtrios, MEA can be applied to study the role of genetics inepigenetic variation, and in turn, to facilitate the discoveryor validation of variants of interest, complementingepigenome-wide association studies (EWAS) [43].ConclusionsTo our knowledge, MEA is the first software package toprovide integrated allele-specific analysis of DNARichard Albert et al. BMC Genomics  (2018) 19:463 Page 15 of 19methylation, histone modification and expression data.Exploiting both SNV and INDEL information, this pipe-line increases the sensitivity and specificity of allelic ana-lyses relative to an INDEL-agnostic approach. MEAautomates diploid pseudogenome reconstruction,allele-specific read detection and haplotype-resolved gen-omic track agglomeration for intuitive data visualizationand allelic imbalance detection. With one-step installationand user-friendly file outputs, MEA can be applied with-out relying on extensive bioinformatic expertise. Intersec-tion of epigenomic and transcriptomic datasets usingthis novel toolbox will facilitate studies ofparent-of-origin effects as well as the interplay betweengenomic sequence, the epigenome and transcriptionalregulation in both humans and model organisms.MethodsSamples used in this studyWe validated our tool using previously published bisulphite-seq data generated from inner cell mass (ICM) cells from anF1 hybrid between mouse strains C57BL/6 J and DBA/2 J(Wang et al. (2014) [11]). DBA/2 J differs from the referencestrain (C57BL/6 J) by 5,126,997 SNVs (roughly 1 SNV/530 bp) and 1,019,400 INDELs, comparable to othercommonly used lab mouse strains (see Discussion).ICM bisulphite-sequencing (GSM1386023) was comple-mented with RNA-seq (GSM1845307–8) from ICM cellsisolated from C57BL/6 J x PWK/PhJ F1 mice as well asChIP-sequencing data for H3K4me3 (GSM1845274–5)and H3K27me3 (GSM2041078–9), permissive and re-pressive histone post-translation modifications respect-ively. RNA-seq data from C57BL/6 J x DBA/2 J ICM(GSM1625868) was used to test allele-specific alignmentperformance of contemporary RNA-seq aligner software.Bisulphite sequencing datasets from C57BL/6 J MII oocytes(GSM1386019) and DBA/2 J spermatozoa (GSM1386020)were analyzed to directly measure false-positive allele-specific alignment rates. Processed fully grown oocyte(DRX001583) and sperm (DRX001141–9) bisulphite-seqwere used for visualization. Processed human sperm andoocyte WGBS was obtained from JGAS00000000006.Adult human brain datasets were obtained as part ofthe Canadian Epigenetics, Environment and Health Re-search Consortium (CEEHRC) Network.In silico diploid genome reconstructionAs published previously, MEA constructs a diploid pseu-dogenome using a reference sequence (.fasta) and knowngenetic variants (.vcf ) including SNVs and INDELs [16].For samples requiring genotype phasing, MEA utilizesSHAPEIT2 [35] and a publicly available reference panelof haplotypes provided by the 1000 Genomes Project[40] to output phased haplotypes. These steps generatean in silico diploid genome containing two copies of eachchromosome, one for each parental genome. AligningNGS reads to a diploid genome enables the extraction ofuniquely aligned allele-specific reads, which are sepa-rated into parent-of-origin read alignment files. Anautomatically-generated index file (.refmap) enables re-versal of coordinate alterations in non-reference allelicread alignments caused by differential parental INDELlengths. This allows projection of allelic genomic tracksback onto the original reference genome for consistentdata visualization in genome browsers (which are builtaround reference genomes) and downstream analysesover coordinate-based regions of interest.MEA exploits widely used NGS alignment softwareIn order to detect allele-specific reads from RNA-, ChIP-seqand WGBS data, we designed MEA to align reads using anin silico pseudogenome and extract uniquely mapped reads.This approach allows allele-specific alignment of reads con-taining sequencing errors, which is critical for datasets withlong (100+ bp) reads commonly sequenced on Illumina se-quencers, which have approximately 0.26–0.80% sequenceerror rates [44]. This pipeline modification assures adoptionand operation of our tool well into the future as sequencingtechnologies continue to extend read lengths without neces-sarily improving error rates.Special considerations for allele specific DNAme analysisDNAme levels can be accurately measured genome wideusing sodium bisulphite conversion of unmethylated cy-tosines to thymines followed by whole genome sequen-cing (bisulphite-seq). To measure allele-specific DNAmelevels, MEA detects allelic reads and calculates the pro-portion of cytosines and thymines at CpG dinucleotides.To do so, MEA aligns bisulphite-seq reads to the insilico diploid genome using the popular aligner andmethylation caller Bismark [18]. Unlike ChIP- or RNA-seqaligners, Bismark considers cytosine to thymine mutations(introduced during sodium bisulphite conversion) inorder to accurately align reads to a genomic sequence.Allele-specific DNAme levels therefore reflect both geneticand epigenetic effects: users can retroactively delineateboth effects using their original list of genetic variants.UCSC track hubs for allelic track visualizationUCSC Track Hubs are a hierarchical file organization sys-tem that allow combining multiple genomic tracks intoone for convenient data visualization and interpretation[45, 32]. MEA automatically normalizes allele-specifictracks by sequencing depth and generate correspondingtrack hub database files. Using UCSC binaries (hubCheck),we ensure the integrity of MEA-generated track hubs forstandardized visualization experiences. Additionally, weprovide scripts for the automatic computation of allelicRNA- and ChIP-seq coverage over user-defined regions ofRichard Albert et al. BMC Genomics  (2018) 19:463 Page 16 of 19interest (for example: transcription start sites, genes,enhancers, etc.), outputting a tab-delimited table. WhileRPKM- and coverage-calculating software already exist,confounding variables are inherent to allelic analyses,requiring custom scripting. For example, calculating allelicRPKM values using conventional tools is complicated bythe variability in SNV and INDEL density between regionsof interest. To account for such effects, MEA’s default out-put includes allelic read coverage for both alleles (to calcu-late allelic imbalance) and total RPKM (to filter forenrichment). Users can easily interpret allelic imbalancecalculations with the combination of these two metrics(allelic read coverage and total RPKM) over their regionsof interest. In this study, VisRseq [46] was used to plot al-lelic read coverage for RNA-seq data from human brain.Consolidation of tool dependencies into self-sufficientpipelinePackaging MEA into a Docker container allows the one-step installation of all ~ 15 dependencies, significantly re-ducing the work required by the end-users. Simply, theDocker container is a text file containing instructions forinstalling a virtual system and setting environment vari-ables, followed by standardized installation of each bio-informatic dependency. Once installed through the Dockercontainer, MEA is immediately operational. The Dockerfile is uploaded to a third-party website and available fordownload (see Availability and requirements).Software tool requirementsUsers are encouraged to install MEA through Docker.Alternatively, manual installation requires the followingsoftware (with specific versions used during developmentof MEA): Java v-1.6, Python v-2.4, Bismark v-0.15.0,Bowtie2 v-2.2.3, BWA v-0.7.10, STAR v-2.5.1b, Tophat2v-1.1, SAMtools v-0.1.16, Bedtools v-2.22.1, VCFtoolsv-0.1.10, SHAPEIT2, bgzip v-1.1, bedGraphToBigWigv-1.1, wigToBigWig v-4 & hubCheck.Availability and requirementsProject name: MEA.Project home page: Installation1. Download: Run: $ docker build -t taskkoike:mea.1.0 /path/to/directory-containing-Dockerfile/Operating system(s): Platform independentProgramming language: Java, Python, Awk, Bash.Other requirements: Docker v1.13.1 and aboveLicense: The MIT LicenseAdditional filesAdditional file 1: Supplementary Figures S1-S5. Figure S1 False-positive allele-specific alignments using a dataset derived from DBA/2 Jspermatozoa. To estimate the rate of false-positive errors for WGBS ana-lyses, raw data generated from DBA/2 J mice [11] was aligned to theMEA-generated C57BL/6 J x DBA/2 J pseudogenome and the percent-age of C57BL/6 J-specific read alignments was scored. The expected alleliccontribution from C57BL/6 J is 0%, as these cells are of DBA/2 J origin. (a)The percentage of reads aligning to C57BL/6 J (false-positive) and DBA/2 J as well as the number of aligned reads that did not overlap with agenetic variant (non-allelic) is shown. (b) The false-positive alignmentrate for each autosome, along with the number of aligned allelic readpairs, is shown. (c) Genome browser screenshot of a representative false-positive locus. C57BL/6 J-specific reads aligned in large stretches of false-positive alignment regions, suggesting that the parental strain DBA/2 J fromthis study was not pure. Indeed, when manually inspecting thesestretches of false-positive read alignments, experimental reads perfectlymatched the reference sequence over known DBA/2 J SNVs andINDELs, again suggesting that “DBA/2 J” mice analyzed by Wang et al.[11] contained C57BL/6 J sequence. Figure S2 DNA methylation dy-namics over the Foxj3 CpG island promoter. Allele-specific DNAmelevels were calculated over 133,065 regions containing INDELs but lack-ing SNVs (representing novel informative regions gained employingMEA) using C57BL/6 J x DBA/2 J ICM WGBS data [11]. UCSC genomebrowser screenshot of a representative region over which an allele-agnosticpipeline calculated a total DNAme level of < 1% (dashed box). Accordingly,the levels of allele-specific DNAme on both parental alleles, as calculated byMEA, are < 1%. DNAme tracks of male and female germ cells [25, 26] arealso shown, as well as a track indicating the location of each informativeCpG (highlighted in blue). Figure S3 Comparison of ChIP-seq software forallele-specific read alignment. To estimate the rate of allele-specific readalignments and false-positive errors for ChIP-seq analyses, raw H3K4me3ChIP-seq data generated from C57BL/6 J (fully grown oocytes) and PWK/PhJ(spermatozoa) mice [13] was aligned to the MEA-generated C57BL/6 J xPWK/PhJ pseudogenome and the number of C57BL/6 J- and PWK/PhJ-specific read alignments was scored. The number of reads aligning toC57BL/6 J and PWK/PhJ as well as the total number of allele-specificalignments on each autosome is shown for each analysis. (a) Allele-specific alignment using the BWA-aln algorithm. (b) Allele-specificalignment using Bowtie2. (a-b) The expected allelic contribution forC57BL/6 J is 100%, as these cells are of C57BL/6 J origin. (c) Allele-specificalignment using the BWA-aln algorithm. (d) Allele-specific alignment usingBowtie2. (c-d) The expected allelic contribution for C57BL/6 J is 0%, as thesecells are of PWK/PhJ origin. Also see Additional file 2: Table S1. Figure S4Integration of WGBS with allele-specific RNA- and ChIP-seq over thepaternally-expressed imprinted gene Snrpn. UCSC genome browserscreenshot of the Snrpn gDMR and downstream gene using the defaultMEA output. MEA automatically generates composite tracks containingtotal/allele-agnostic (grey), reference (blue) and non-reference (red)genomic tracks for visualization of allelic RNA- and ChIP-seq datasets, shownfrom references [48], [47] and [13] here. An additional track indicating thelocation of each informative CpG (highlighted in blue) is also included.Notably, only the expressed paternal allele is enriched for H3K4me3 whilethe inactive maternal allele is enriched for H3K27me3 and DNAme.Figure S5 Integration of WGBS with allele-specific RNA- and ChIP-seqdata over the paternally-expressed imprinted gene Impact. UCSCgenome browser screenshot of the Impact gDMR and downstreamgene using the default MEA output for visualization of allelic data(WGBS, RNA- and ChIP-seq), as shown in Additional file 1: Figure S4.This locus demonstrates that a single genetic variant is apparentlysufficient to score DNAme level asymmetry between parental allelesin an F1 hybrid. (PDF 9823 kb)Additional file 2: Table S1. BWA and Bowtie2 allele-specific alignmentresults. (XLSX 49 kb)Additional file 3: Table S2. Human RNAseq, ChIPseq and WGBS allele-specific alignment results. (TXT 6957 kb)Additional file 4: Table S3. List of datasets used in this study and theirsource. (XLSX 11 kb)Richard Albert et al. BMC Genomics  (2018) 19:463 Page 17 of 19Abbreviationsbp: Base pair; ChIP: Chromatin immunoprecipitation; FDR: False discovery rate;gDMR: Gametic differentially methylated region; ICM: Inner cell mass;INDEL: Insertion or deletion; mat: Maternal; mESC: Mouse embryonic stemcell; pat: Paternal; RNA: Ribonucleic acid; seq: Sequencing; SNV: Singlenucleotide variant; TSS: Transcription start siteAcknowledgementsWe thank Julie Brind’Amour, Inanç Birol, and Kristoffer Jensen for criticalreview of the manuscript and Louis Lefebvre, Hendrik Marks and AlexanderDobin for helpful discussions.FundingThe Lorincz lab is supported by grants from the Canadian Institutes of HealthResearch (MOP-133417) and the Natural Sciences and Engineering ResearchCouncil of Canada (RGPIN-2015-05228). JRA is a recipient of an NSERCPostgraduate Scholarships-Doctoral Program award (PGSD3–476000-2015), aKillam Doctoral Scholarship and a UBC Four Year Doctoral Fellowship.Availability of data and materialsThe datasets analyzed in this study are available in the Gene ExpressionOmnibus (GEO) and DNA Data Bank of Japan (DDBJ) repositories under codes:GSM1386023, GSM1845274, GSM1845275, GSM1845307, GSM2041078,GSM2041079, GSM1625868, GSM1386019, GSM1386020, DRX001583,DRX001141–9 & JGAS00000000006 from the publications: [11, 13, 25, 26, 47–49].No cell lines or datasets derived from cell lines were utilised in this study. Adetailed description of all datasets can be found in Additional file 4: Table S3.The results published here are in part based upon data generated by TheCanadian Epigenetics, Epigenomics, Environment and Health ResearchConsortium (CEEHRC) initiative funded by the Canadian Institutes of HealthResearch (CIHR), Genome BC, and Genome Quebec. Information aboutCEEHRC and the participating investigators and institutions can be foundat The datasets supporting theconclusions of this article are included within the article and its additional files.Authors’ contributionsJRA participated in study design, developed code, performed validation analysis,and wrote the manuscript. TK developed code and performed validation analysis.HY participated in study design and developed code. RT and ABB ran bug tests.RT and HY wrote the user manual. MMK participated in study design andperformed validation analysis. MCL participated in study design and wrote themanuscript. All authors read and approved the final manuscript.Ethics approval and consent to participateNot applicable.Competing interestsThe authors declare that they have no competing interests.Publisher’s NoteSpringer Nature remains neutral with regard to jurisdictional claims inpublished maps and institutional affiliations.Author details1Department of Medical Genetics, The University of British Columbia,Vancouver, BC, Canada. 2Department of BioScience, Tokyo University ofAgriculture, Setagaya-ku, Tokyo, Japan. 3Graphics Usability and VisualizationLab, School of Computing Science, Simon Fraser University, Burnaby, BC,Canada. 4Canada’s Michael Smith Genome Sciences Centre, British ColumbiaCancer Agency, Vancouver, BC, Canada. 5Biomedical Research Centre, TheUniversity of British Columbia, Vancouver, BC, Canada. 6Qatar BiomedicalResearch Institute, Hamad Bin Khalifa University, Doha, Qatar. 7MRC LondonInstitute of Medical Sciences, Imperial College, London, UK.Received: 23 February 2018 Accepted: 29 May 2018References1. Holliday R. Genomic imprinting and allelic exclusion. Development. TheCompany of Biologists Ltd. 1990;108(Supplement):125–9.2. Pinheiro I, Heard E. X chromosome inactivation: new players in the initiationof gene silencing. F1000Research. 2017;6:344.3. Goncalves A, Leigh-Brown S, Thybert D, Stefflova K, Turro E, Flicek P, et al.Extensive compensatory cis-trans regulation in the evolution of mouse geneexpression. Genome research. Cold Spring Harbor Lab. 2012;22(12):2376–84.4. Turro E, Su S-Y, Goncalves A, Coin LJM, Richardson S, Lewin A. Haplotypeand isoform specific expression estimation using multi-mapping RNA-seqreads. Genome Biology BioMed Central Ltd. 2011;12(2):R13.5. Harvey CT, Moyerbrailean GA, Davis GO, Wen X, Luca F, Pique-Regi R.QuASAR: quantitative allele-specific analysis of reads. Bioinformatics OxfordUniversity Press. 2014;31(8):btu802–1242.6. Mayba O, Gilbert HN, Liu J, Haverty PM, Jhunjhunwala S, Jiang Z, et al.MBASED: allele-specific expression detection in cancer tissues and cell lines.Genome biology. BioMed Central. 2014;15(8):405.7. Jiang Y, Zhang NR, Li M. SCALE: modeling allele-specific gene expression bysingle-cell RNA sequencing. Genome Biol BioMed Central. 2017;18(1):74.8. Song Q, Decato B, Hong EE, Zhou M, Fang F, Qu J, et al. A referenceMethylome database and analysis pipeline to facilitate integrative andcomparative Epigenomics. El-Maarri O, editor. PLoS One 2013;8(12):e81148.9. Vincent M, Mundbjerg K, Skou Pedersen J, Liang G, Jones PA, Ørntoft TF, etal. epiG: statistical inference and profiling of DNA methylation from whole-genome bisulfite sequencing data. Genome biology. BioMed Central. 2017;18(1):245–16.10. Hu K, Ting AH, Li J. BSPAT: a fast online tool for DNA methylation co-occurrence pattern analysis based on high-throughput bisulfite sequencingdata. BMC Bioinformatics BioMed Central. 2015;16(1):97.11. Wang L, Zhang J, Duan J, Gao X, Zhu W, Lu X, et al. Programming andinheritance of parental DNA methylomes in mammals. Cell. 2014;15(4):979–91.12. Leung D, Jung I, Rajagopal N, Schmitt A, Selvaraj S, Lee AY, et al. Integrativeanalysis of haplotype-resolved epigenomes across human tissues. NatureNature Publishing Group. 2015;518(7539):354.13. Zhang B, Zheng H, Huang B, Li W, Xiang Y, Peng X, et al. Allelicreprogramming of the histone modification H3K4me3 in early mammaliandevelopment. Nature Nature Research. 2016;537(7621):553.14. Andergassen D, Dotter CP, Kulinski TM, Guenzl PM, Bammer PC, Barlow DP,et al. Allelome.PRO, a pipeline to define allele-specific genomic featuresfrom high-throughput sequencing data. Nucleic Acids Res Oxford UniversityPress. 2015;43(21):e164.15. van de Geijn B, McVicker G, Gilad Y, Pritchard JK. WASP: allele-specificsoftware for robust molecular quantitative trait locus discovery. Nat ChemBiol Nature Research. 2015;12(11):1061–3.16. Younesy H, Möller T, Heravi-Moussavi A, Cheng JB, Costello JF, Lorincz MC,et al. ALEA: a toolbox for allele-specific epigenomics analysis. BioinformaticsOxford University Press. 2013;30(8):1174.17. Adams DJ, Doran AG, Lilue J, Keane TM. The mouse genomes project: arepository of inbred laboratory mouse strain genomes. Mamm GenomeSpringer US. 2015;26(9–10):403–12.18. Krueger F, Andrews SR. Bismark: a flexible aligner and methylation callerfor bisulfite-Seq applications. Bioinformatics Oxford University Press.2011;27(11):1571–2.19. Keane TM, Goodstadt L, Danecek P, White MA. Mouse genomic variation andits effect on phenotypes and gene regulation. Nature. 2011;477(7364):289–94.20. Degner JF, Marioni JC, Pai AA, Pickrell JK, Nkadori E, Gilad Y, et al. Effect ofread-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics. 2009;25(24):3207–12.21. Keown CL, Berletch JB, Castanon R, Nery JR, Disteche CM, Ecker JR, et al.Allele-specific non-CG DNA methylation marks domains of active chromatinin female mouse brain. Proc Natl Acad Sci USA. National Acad. Sciences.2017;114(14):201611905–2890.22. Proudhon C, Duffié R, Ajjan S, Cowley M, Iranzo J, Carbajosa G, et al. Protectionagainst de novo methylation is instrumental in maintaining parent-of-originmethylation inherited from the gametes. Mol Cell. 2012;47(6):909–20.23. Greenberg MVC, Glaser J, Borsos M, Marjou FE, Walter M, Teissandier A, et al.Transient transcription in the early embryo sets an epigenetic state thatprograms postnatal growth. Nat Genet. 2016;49(1):110–8.24. Smith ZD, Chan MM, Mikkelsen TS, Gu H, Gnirke A, Regev A, et al. A uniqueregulatory phase of DNA methylation in the early mammalian embryo.Nature. 2012;484(7394):339–44.25. Shirane K, Toh H, Kobayashi H, Miura F, Chiba H, Ito T, et al. Mouse oocyteMethylomes at base resolution reveal genome-wide accumulation of non-Richard Albert et al. BMC Genomics  (2018) 19:463 Page 18 of 19CpG methylation and role of DNA methyltransferases. Bartolomei MS, editorPLoS Genet Public Library of Science. 2013;9(4):e1003439.26. Kobayashi H, Sakurai T, Imai M, Takahashi N, Fukuda A, Yayoi O, et al.Contribution of intragenic DNA methylation in mouse gametic DNAmethylomes to establish oocyte-specific heritable marks. Reik W, editorPLoS Genet 2012;8(1):e1002440.27. Leitch HG, McEwen KR, Turp A, Encheva V, Carroll T, Grabole N, et al. Naivepluripotency is associated with global DNA hypomethylation. Nat StructMol biol. Nature Research. 2013;20(3):311–6.28. Borgel J, Guibert S, Li Y, Chiba H, Schübeler D, Sasaki H, et al. Targets anddynamics of promoter DNA methylation during early mouse development.Nat Genet. 2010;42(12):1093–100.29. Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2:accurate alignment of transcriptomes in the presence of insertions, deletionsand gene fusions. Genome Biology BioMed Central. 2013;14(4):R36.30. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR:ultrafast universal RNA-seq aligner. Bioinformatics. Oxford University Press.2013;29(1):15–21.31. Križanovic K, Echchiki A, Roux J, Šikic M. Evaluation of tools for long readRNA-seq splice-aware alignment. Birol I, editor. Bioinformatics 2018;34(5):748–754.32. Raney BJ, Dreszer TR, Barber GP, Clawson H, Fujita PA, Wang T, et al. Trackdata hubs enable visualization of user-defined genome-wide annotationson the UCSC genome browser. Bioinformatics. 2014;30(7):1003–5.33. McEwen KR, Ferguson-Smith AC. Distinguishing epigenetic marks ofdevelopmental and imprinting regulation. Epigenetics Chromatin BioMedCentral. 2010;3(1):2.34. Maupetit-Méhouas S, Montibus B, Nury D, Tayama C, Wassef M, Kota SK, etal. Imprinting control regions (ICRs) are marked by mono-allelic bivalentchromatin when transcriptionally inactive. Nucleic Acids Res. 2016;44(2):621–35.35. Delaneau O, Coulonges C, Zagury J-F. Shape-IT: new rapid and accuratealgorithm for haplotype inference. BMC Bioinformatics. BioMed Central.2008;9(1):540.36. White CR, MacDonald WA, Mann MRW. Conservation of DNA methylationprogramming between mouse and human gametes and preimplantationembryos. Biol Reprod. 2016;95(3):61.37. Wang X, Clark AG. Using next-generation RNA sequencing to identifyimprinted genes. Heredity (Edinb). Nat Publ Group. 2014;113(2):156–66.38. Song L, Crawford GE. DNase-seq: a high-resolution technique for mappingactive gene regulatory elements across the genome from mammalian cells.Cold Spring Harb Protoc. Cold Spring Harbor Laboratory Press; 2010;2010(2):pdb.prot5384–pdb.prot5384.39. Buenrostro JD, Wu B, Chang HY, Greenleaf WJ. ATAC-seq: A Method forAssaying Chromatin Accessibility Genome-Wide, vol. 11. Hoboken: JohnWiley & Sons, Inc; 2001. 9 p40. McVean GA, Altshuler Co-Chair DM, Durbin Co-Chair RM, Abecasis GR,Bentley DR, Chakravarti A, et al. An integrated map of genetic variation from1,092 human genomes. Nature. 2012;491(7422):56–65.41. Chen Y-C, Gotea V, Margolin G, Elnitski L. Significant associations betweendriver gene mutations and DNA methylation alterations across many cancertypes. Fertig EJ, editor. PLoS Comput biol. Public Libr Sci; 2017;13(11):e1005840.42. Cheung WA, Shao X, Morin A, Siroux V, Kwan T, Ge B, et al. Functionalvariation in allelic methylomes underscores a strong genetic contributionand reveals novel epigenetic alterations in the human epigenome. Genomebiology. BioMed Central. 2017;18(1):50.43. Pastinen T. Genome-wide allele-specific analysis: insights into regulatoryvariation. Nat Rev Genet Nature Publishing Group. 2010;11(8):533–8.44. Quail MA, Smith M, Coupland P, Otto TD, Harris SR, Connor TR, et al. A taleof three next generation sequencing platforms: comparison of ion torrent,Pacific biosciences and Illumina MiSeq sequencers. BMC genomics. BioMedCentral. 2012;13(1):341.45. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, et al. Thehuman genome browser at UCSC. Genome research. Cold Spring HarborLab. 2002;12(6):996–1006.46. Younesy H, Möller T, Lorincz MC, Karimi MM, Jones SJM. VisRseq: R-basedvisual framework for analysis of sequencing data. BMC BioinformaticsBioMed Central Ltd. 2015;16(Suppl 11):S2.47. Zheng H, Huang B, Zhang B, Xiang Y, Du Z, Xu Q, et al. Resetting epigeneticmemory by reprogramming of histone modifications in mammals. Mol Cell.2016;63(6):1066–79.48. Wu J, Huang B, Chen H, Yin Q, Liu Y, Xiang Y, et al. The landscape ofaccessible chromatin in mammalian preimplantation embryos. Nature.2016;534(7609):652.49. Okae H, Chiba H, Hiura H, Hamada H, Sato A, Utsunomiya T, et al. Genome-wide analysis of DNA methylation dynamics during early human development.Oakey RJ, editor PLoS Genet Public Library of Science; 2014;10(12):e1004868.Richard Albert et al. BMC Genomics  (2018) 19:463 Page 19 of 19


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items