{"http:\/\/dx.doi.org\/10.14288\/1.0395957":{"http:\/\/vivoweb.org\/ontology\/core#departmentOrSchool":[{"value":"Medicine, Faculty of","type":"literal","lang":"en"},{"value":"Science, Faculty of","type":"literal","lang":"en"},{"value":"Other UBC","type":"literal","lang":"en"},{"value":"Non UBC","type":"literal","lang":"en"},{"value":"Medical Genetics, Department of","type":"literal","lang":"en"},{"value":"Microbiology and Immunology, Department of","type":"literal","lang":"en"}],"http:\/\/www.europeana.eu\/schemas\/edm\/dataProvider":[{"value":"DSpace","type":"literal","lang":"en"}],"https:\/\/open.library.ubc.ca\/terms#identifierCitation":[{"value":"Genome Biology. 2021 Feb 22;22(1):68","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/contributor":[{"value":"Michael Smith Laboratories","type":"literal","lang":"en"}],"https:\/\/open.library.ubc.ca\/terms#rightsCopyright":[{"value":"The Author(s)","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/creator":[{"value":"Akbari, Vahid","type":"literal","lang":"en"},{"value":"Garant, Jean-Michel","type":"literal","lang":"en"},{"value":"O\u2019Neill, Kieran","type":"literal","lang":"en"},{"value":"Pandoh, Pawan","type":"literal","lang":"en"},{"value":"Moore, Richard","type":"literal","lang":"en"},{"value":"Marra, Marco, 1966-","type":"literal","lang":"en"},{"value":"Hirst, Martin","type":"literal","lang":"en"},{"value":"Jones, Steven J. M.","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/issued":[{"value":"2021-02-24T18:52:33Z","type":"literal","lang":"en"},{"value":"2021-02-22","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/description":[{"value":"The ability of nanopore sequencing to simultaneously detect modified nucleotides while producing long reads makes it ideal for detecting and phasing allele-specific methylation. However, there is currently no complete software for detecting SNPs, phasing haplotypes, and mapping methylation to these from nanopore sequence data. Here, we present NanoMethPhase, a software tool to phase 5-methylcytosine from nanopore sequencing. We also present SNVoter, which can post-process nanopore SNV calls to improve accuracy in low coverage regions. Together, these tools can accurately detect allele-specific methylation genome-wide using nanopore sequence data with low coverage of about ten-fold redundancy.","type":"literal","lang":"en"}],"http:\/\/www.europeana.eu\/schemas\/edm\/aggregatedCHO":[{"value":"https:\/\/circle.library.ubc.ca\/rest\/handle\/2429\/77374?expand=metadata","type":"literal","lang":"en"}],"http:\/\/www.w3.org\/2009\/08\/skos-reference\/skos.html#note":[{"value":"SOFTWARE Open AccessMegabase-scale methylation phasing usingnanopore long reads and NanoMethPhaseVahid Akbari1,2, Jean-Michel Garant1, Kieran O\u2019Neill1, Pawan Pandoh1, Richard Moore1, Marco A. Marra1,2,Martin Hirst1,3 and Steven J. M. Jones1,2** Correspondence: sjones@bcgsc.ca1Canada\u2019s Michael Smith GenomeSciences Centre, BC Cancer,Vancouver, British Columbia,Canada2Department of Medical Genetics,University of British Columbia,Vancouver, British Columbia,CanadaFull list of author information isavailable at the end of the articleAbstractThe ability of nanopore sequencing to simultaneously detect modified nucleotideswhile producing long reads makes it ideal for detecting and phasing allele-specificmethylation. However, there is currently no complete software for detecting SNPs,phasing haplotypes, and mapping methylation to these from nanopore sequencedata. Here, we present NanoMethPhase, a software tool to phase 5-methylcytosinefrom nanopore sequencing. We also present SNVoter, which can post-processnanopore SNV calls to improve accuracy in low coverage regions. Together, thesetools can accurately detect allele-specific methylation genome-wide using nanoporesequence data with low coverage of about ten-fold redundancy.Keywords: Nanopore sequencing, Allele-specific methylation, Phasing, NanoMethPhaseIntroductionSomatic cells of diploid organisms comprise two alleles for each gene, and most genesare expressed from both alleles [1]. However, some genes only express from one allele,often in a lineage- or tissue-specific manner [1]. Various mechanisms can controlmono-allelic expression (MAE), including DNA polymorphisms at regulatory regions,and differential epigenetic modifications [1, 2]. Imprinting is a specific type of MAE inwhich the expressed allele is defined based upon the parent of origin through epigen-etic modifications, primarily DNA methylation at CpG sites located in imprinting con-trol regions (ICRs) [2]. In addition to imprinting, MAE can be a result of randommono-allelic expression (RME) where allelic choice randomly occurs somatically in atissue- and cell type-specific and non-parent of origin manner, and X chromosome in-activation (XCI) is a well-established RME [3]. Both RME and imprinting have sub-stantial roles in normal growth and development, behavior, and metabolism [3, 4].Aberrant DNA methylation at ICRs results in various developmental disorders and lossof imprinting is frequently observed in human tumors [2, 5, 6].Detection of allele-specific methylation (ASM) requires profiling both allele-specificSNPs and DNA methylation on the same or linked data. The current gold standardapproach to study DNA methylation is whole-genome bisulfite sequencing (WGBS)\u00a9 The Author(s). 2021 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, whichpermits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit tothe original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. Theimages or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwisein a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is notpermitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyrightholder. To view a copy of this licence, visit http:\/\/creativecommons.org\/licenses\/by\/4.0\/. The Creative Commons Public DomainDedication waiver (http:\/\/creativecommons.org\/publicdomain\/zero\/1.0\/) applies to the data made available in this article, unlessotherwise stated in a credit line to the data.Akbari et al. Genome Biology           (2021) 22:68 https:\/\/doi.org\/10.1186\/s13059-021-02283-5[7, 8]. Discrimination of methylated and unmethylated CpGs in WGBS is based onbisulfite conversion. During bisulfite treatment, unmethylated cytosine converted touracil which then replaced by thymidine through downstream PCR reactions priorto short-read sequencing. Subsequent short-read sequencing and mapping of readsto the reference genome can detect DNA methylation genome-wide [7, 8]. How-ever, short-read pairs typically cannot span adjacent allele-specific SNPs in regionsof low variant density. Moreover, bisulfite conversion is a challenging molecularprotocol, which introduces errors both from the harsh chemical treatment and thedifficulty of mapping converted reads to the reference genome [7, 9]. Third-generation long-read sequencing provided by Oxford Nanopore Technologies(ONT) and Pacific Biosciences single-molecule real-time sequencing not only se-quences DNA but can also detect DNA methylation through picoampere signal in-tensities and polymerase kinetics, respectively [10, 11]. Long reads produced bythese technologies can span over several kilobases and can resolve the phasingproblem at regions of low SNP density. However, the higher sequencing error ratecan impede the accurate detection of SNPs and the haplotype phasing of longreads. Additionally, Pacific Biosciences is prohibitively expensive as it requires veryhigh coverage (250\u00d7 per strand) to confidently detect 5-methylcytosine whichmakes it impractical on a mammalian size genome [12].In nanopore sequencing, as nucleic acids are propelled through a protein nanoporeembedded in an electrically resistant membrane, the chemical composition of the ap-proximately five nucleotides (5-mers) present in the narrowest region of the pore de-fines the measurable current signal across the membrane (pore version R9.4) [13].These distinctive signal characteristics are interpretable by a trained artificial neuralnetwork during typical base calling. A similar approach is used to call 5-methylcytosineat CpG sites by tools such as Megalodon [14], Nanopolish [10], DeepSignal [15], Deep-Mod [16], and SignalAlign [17]. Therefore, in nanopore sequencing, both DNA se-quences and modifications are detectable using raw signal information and there is noneed for bisulfate conversion and PCR amplification prior to sequencing [16, 18]. Sta-tistically based approaches are also available to detect base modifications using com-parative tests on signals from pairwise samples (e.g., wild-type and knock out),Nanoraw [19] and NanoMod [20]. Statistically based approaches typically call all de-tectable base modifications between the samples without distinction of the specific typeof modification [20]. Model approaches can potentially differentiate several types ofmodification assuming the model used was trained accordingly. These sets of tools canbe used together on the same signal data to generate both DNA sequence and CpGmethylation, which is ideal for detecting ASM. Using these features, Gigante et al. [21]successfully used known SNPs from parental mice to phase methylation in the F1 anddetect ASM. However, there is a lack of a straightforward workflow and software toolsto phase both reads and methylation data from nanopore sequencing. More import-antly, an approach which can phase nanopore reads using only nanopore sequencing,without having to detect SNVs using other platforms or known parental SNVs, has notbeen previously shown.Here, we have developed a workflow and associated software, SNVoter and Nano-MethPhase, to detect ASM from a single sample using only nanopore sequence datawith redundant sequence coverage as low as about 10\u00d7. We called SNVs fromAkbari et al. Genome Biology           (2021) 22:68 Page 2 of 21nanopore sequencing data using Clair [22]. Clair is designed to call germline small vari-ants from nanopore reads based on pileup format, and the authors demonstrated its su-periority over other pileup-based tools [22]. We demonstrated that we can improveSNV detection significantly by using SNVoter. Subsequently, we phased the SNVs de-tected from nanopore reads using WhatsHap [23] v 0.18 and used our tool, NanoMeth-Phase, to phase both the sequence reads and the CpG methylation. Using a normalhuman B-lymphocyte cell line (NA19240), we demonstrated that NanoMethPhase canaccurately detect ASM and parent of origin when trio data (i.e., using known SNPsfrom mother, father, and child) is available. Moreover, NanoMethPhase detected ASMfor this sample using only nanopore sequence, showing high concordance with thetrio-based phasing. We further demonstrated the detection of ASM with about 10\u00d7coverage using a Colo829BL B-lymphoblast cell line.ResultsBenchmarking of nanopore methylation calling softwareWe benchmarked three tools able to detect CpG methylation from nanopore sequen-cing using a pre-trained model: Nanopolish [10], Megalodon [14], and DeepSignal [15].The CpG methylation calls obtained were compared to matching data from WGBS andIllumina Infinium HumanMethylation 27 BeadChip (henceforth, 27k methylationarray). We used 12 MinION flow cells (~ 12\u00d7 coverage) of publicly available nanoporesequencing data for NA12878 [24] (Additional file 1) to compare with ENCODEWGBS data (ENCFF835NTC) and 27k methylation array [25] (GSM670984) data forthis cell line. DeepSignal, Nanopolish, and Megalodon were able to call methylation for~ 30M, ~ 29.7M, and ~ 29.2M CpG sites, respectively. Because ENCODE WGBS work-flow used an index generated from the GRCh38 assembly with alternative contigs re-moved, only called CpG sites on the main chromosomes (1\u201322 and X) wereconsidered. DeepSignal, Nanopolish, Megalodon, and WGBS were able to call methyla-tion for 28.827M, 28.825M, 28.186M, and 28.159M CpG sites on the main chromo-somes, respectively (Fig. 1a). Nanopolish and DeepSignal showed a higher correlationwith WGBS (Fig. 1b, 0.89 for Nanopolish and 0.90 for DeepSignal; Additional file 2:Fig. S1a-f) and 27k methylation array (Fig. 1c, 0.88 for Nanopolish and 0.89 for Deep-Signal) compared to Megalodon (0.83 with WGBS and 0.81 with 27k array). The correl-ation with WGBS was also evaluated based on CpG density and at each genomicregion (Additional file 2: Fig. S1g, h). These analyses showed that the correlation islower at low- and high-density CpG sites (< 4% CpG per 2 kb and > 8% CpG per 2 kb)and DeepSignal works better at low dense CpG sites while Nanopolish gives better cor-relation at high-density CpG regions. Moreover, the best correlation with WGBS is ob-tained at enhancers and promoters, while repeats and intergenic regions had the lowestcorrelation. In agreement with correlation analysis based on CpG density, Nanopolishshowed slightly better correlation at CpG islands while DeepSignal worked better inother regions. The distribution of methylation levels obtained from Nanopolish andDeepSignal closely followed the trend of WGBS values across CpG islands (CGIs,Fig. 1d). They also outperformed Megalodon around transcription start (TSS) and endsites (TES) (Fig. 1e), with Nanopolish showing closer concordance with WGBS nearTSS. Consequently, we selected Nanopolish as the most appropriate tool to callAkbari et al. Genome Biology           (2021) 22:68 Page 3 of 21methylation, with the added benefits that were faster processing time and also fewerpre-processing steps.To further validate methylation calling using Nanopolish, we compared methylationcalling for NA19240 (ERR3046935 or run 1; Additional file 1) [26] with the 27k methy-lation array (GSM671076) [25]. They showed 0.93 Pearson correlation between methy-lation frequencies from Nanopolish and beta-values from 27k methylation array data(Fig. 1f). The distribution of methylation calls showed the same close concordanceacross CGIs (Fig. 1g) and TSS-TES (Fig. 1h).As shown in Fig. 1a, Nanopolish and DeepSignal called methylation at considerablymore CpG sites compare to WGBS. Nanopolish and DeepSignal had ~ 945,000 and ~871,000 CpGs not present in WGBS, respectively, and ~ 843,000 of them were commonto both Nanopolish and DeepSignal. Approximately 50% of these nanopore-specificFig. 1 Methylation calling from nanopore data and comparison to gold standard platforms. a UpSet plot ofthe intersections of CpG sites detected using DeepSignal, Nanopolish, Megalodon, and whole-genomebisulfite sequencing (WGBS) for NA12878. Because Encode WGBS workflow used index generated fromGRCh38 assembly with alternative contiguous removed, only CpG methylations on main chromosomes (1\u201322 and X) were considered. b, c Pearson correlation matrix of methylation levels from the tools with WGBS(b) and Illumina\u2019s 27k methylation array (c). For comparison to WGBS, only CpGs with at least 5 calls wereconsidered. The number represents common CpGs in all methods. d Distribution of methylation over CpGislands (CGIs). e Distribution of methylation at transcription start (TSS) and end sites (TES). f Scatter plot ofmethylation level obtained from Nanopolish and Illumina\u2019s 27k methylation array for NA19240 sample.Pearson correlation coefficient presented as r. g\u2013h Distribution of methylation over common CpGs betweenNanopolish and 27k array at CpG islands (CGIs) (g) or transcription start (TSS) and end sites (TES) (h). Goldstandard methods for CpG methylation detection are indicated by an asteriskAkbari et al. Genome Biology           (2021) 22:68 Page 4 of 21CpGs mapped to satellite repeats (Additional file 2: Fig. S2a) which are complex repeatsmostly found in centric and pericentric regions. Of satellite repeat types, approximately98% of the nanopore-specific CpGs mapped to ALR\/Alpha repeats (Additional file 2:Fig. S2b). This demonstrates the advantages of nanopore sequencing in mapping tocomplex repeat regions.High false positives in variant calling resultsSNPs are the most commonly occurring variants and are routinely used to phase readsbetween haplotypes. Thus, to phase nanopore reads and CpG methylation, detection ofSNPs is first required. In order to evaluate single-sample variant calling from nanoporedata, we used Clair [22] to call SNVs on 20 flow cells of the NA12878 sample (24\u00d7;Additional file 1) described above and compared these calls to gold standard SNVsfrom the Genome in a Bottle database (GIAB, V3-3-2) [27]. Clair detected 6.712MSNVs, 3.711M of which exceeded the variant call quality filtering threshold of 730(Additional file 2: Fig. S3a). We then counted false positives (943,304), false negatives(223,581), true positives (2,768,084), and true negatives (2,777,463) to calculate accur-acy, precision, recall, and F1 score, measured at 82.6%, 74.6%, 92.5%, and 82.6%, re-spectively. The majority (90%) of true SNVs were determined in Clair as high-qualitySNV calls. However, there were still a considerable number of false positives which canconfound downstream analysis such as phasing. Therefore, we sought to improve theaccuracy and the precision and reduce the false-positive rate as much as possible whileretaining true positives.Base qualities and mutation frequencies are informativeStudies have demonstrated that in nanopore sequencing the signal distribution is drivenmostly by the five to six bases that occupy the narrowest part of the pore [28]. During thebase-calling process, base callers segment signals and translate them to the appropriate 5-mer using a pre-trained model. We hypothesized that if an SNV is a false call, the basecaller must have been mistaken in calling the 5-mer, and therefore, additional sequencingerrors should be found in the 5-mer windows adjacent to that SNV. Moreover, the Phredquality of called bases should be affected. To test this, we extracted Phred scores, mis-match, deletion, and insertion frequencies for SNVs where the Clair output agreed withthe GIAB standard (true positives), and compared these with the false-positive SNVs. In-deed, higher mutation frequencies and lower base quality scores were observed in the ad-jacent 5-mer windows of false SNVs compared to true SNVs (Fig. 2a). Moreover, in aPCA analysis, they separated into two distinct groups where mismatch frequencies andbase qualities seemed to have the highest effect on discriminating true-positive SNVsfrom false positives (Fig. 2b). These results demonstrate that base qualities and mutationerrors in 5-mer windows are informative and can be used to further discriminate true pos-itives and false positives in order to improve SNV calling from nanopore data.A neural network SNV classifier improves Clair SNV callsWe then used base quality and mutation frequencies to train an artificial neural net-work to discriminate true-positive SNVs from false calls (see the \u201cMaterials andmethods\u201d section), which we packaged into a tool we called SNVoter. Three differentAkbari et al. Genome Biology           (2021) 22:68 Page 5 of 21models were trained using various fold coverages from three datasets includingNA12878 [24] (20 flow cells, 24\u00d7), NA12878 [24] (all flow cells, 44\u00d7), and HG003 [29](80\u00d7). See additional file 1 for a full description of datasets used in our study. SNVoterclassifies each 5-mer that includes an SNV. Since each base represents itself in five 5-mers (e.g., ATCGACGTC will be represented in ATCGA, TCGAC, CGACG, GACGT,and ACGTC), the final prediction for each SNV is an average of all five predictions. Fi-nally, we used the predictions from our classifier as weights, and for each SNV, weFig. 2 Improvement in Clair SNV calls from nanopore data. a Base quality and mutation frequenciesobtained for 1 million randomly selected 5-mers with SNV in the base 3 position (1:1 true positives to falsepositives). b PCA analysis of the reference 5-mer GTACT shows separation of true positive from falsepositives based on quality scores and mismatch frequencies. c Clair variant calling quality distribution forthe NA19240 run 0 sample. d Quality distribution upon normalization of Clair\u2019s qualities using the weightsgiven by SNVoter to each SNV. The highlighted region represents the optimal threshold area to filter outlow-quality calls. e Receiver operating characteristic curves for SNV calling using Clair or using Clair+SNVoterfor different coverage depths. NA19240 run 1, NA19240 run 0, and Colo829BL are processed by SNVoterusing the model trained on NA12878 20FCs (24\u00d7). NA19240 runs 0&1 and NA19240 runs 1&2 is processedusing the model trained on NA12878 whole dataset (44\u00d7)Akbari et al. Genome Biology           (2021) 22:68 Page 6 of 21multiplied the variant call quality from Clair with its weight from our model to producenormalized quality scores.To test the classifier and to obtain optimal quality threshold for normalized base qualityscores, we used the NA19240 run 0 (whole dataset includes five PromethION runs, Add-itional file 1) basecalled nanopore sequencing dataset [26] (ERR3219853, 18\u00d7 coverage)and variant calling data from the 1000 Genomes Project (1KGP) phase3 [30]. As pre-sented in Fig. 2c and d, there is an obvious shift toward the low-quality region of the qual-ity distribution plot after using SNVoter. This is largely due to the false-positive SNVswith high variant call quality from Clair that were assigned low weights by SNVoter. Weplotted receiver operating characteristic curves (ROC) across a range of thresholds fornormalized quality and used these to obtain the new optimal threshold (Fig. 2e). TheROC curve analysis demonstrates an improvement in discriminating true-positive SNVcalls from false positives, and we determined the new optimal threshold to be the end ofthe first peak and the start of the valley (highlighted region in Fig. 2d. This threshold wasfurther confirmed by more datasets we used that explained below). Other important met-rics including accuracy, precision, recall, and F1 score are presented in Table 1 (for acomplete table of different thresholds, see Additional file 3).Finally, to comprehensively test SNVoter on different coverages, we used a Colo829BLsample which we sequenced (~10\u00d7), NA19240 run 1 [26] (ERR3219854or ERR3046935. ~22\u00d7), NA19240 runs 1&2 (ERR3219854-5. ~32\u00d7), NA19240 runs 0&1(ERR3219853-4. ~40\u00d7), and NA19240 4 runs (ERR3219854-7. ~65\u00d7). As presented in Fig. 2eand in Table 1, SNVoter can significantly improve SNV calls in data with up to 30\u00d7 coverage(Clair variant call quality and normalized quality distributions are shown in Additional file 2:Fig. S3c). We used different models and random down-sampling of coverage for NA192404 runs (ERR3219854-7. ~ 65\u00d7), which did not provide any improvement when using SNVoter(Additional file 2: Fig. S3b). Overall, using SNVoter, we could raise the accuracy, precision, re-call, and F1 using low coverage data to be comparable to that from high coverage data.Phasing of SNVs detected by nanopore sequencingAfter normalizing Clair qualities using SNVoter and filtering out low-quality calls basedon an optimized threshold (Table 1; Additional file 2: Fig. S3c) for NA19240 run 1 andTable 1 Different metrics for SNV calls using Clair and Clair + SNVoterTool Sample QT All TP TP FP Acc Pre Rec F1C Colo829BL 725 3,638,416 2,664,923 774,123 0.88 0.77 0.79 0.78C+S Colo829BL 160 3,638,416 2,851,683 336,594 0.93 0.89 0.84 0.87C NA19240 Run0 755 3,973,249 2,997,917 1,462,313 0.86 0.67 0.81 0.74C+S NA19240 Run0 170 3,973,249 3,160,297 956,846 0.90 0.77 0.86 0.81C NA19240 Run1 780 3,973,249 3,465,350 990,793 0.84 0.78 0.92 0.84C+S NA19240 Run1 350 3,973,249 3,436,275 658,067 0.88 0.84 0.91 0.87C NA19240 runs 1&2 800 3,973,249 3,644,703 812,335 0.84 0.82 0.96 0.88C+S NA19240 runs 1&2 180 3,973,249 3,595,674 683,727 0.86 0.84 0.94 0.89C NA19240 runs 0&1 820 3,973,249 3,639,922 750,137 0.85 0.83 0.95 0.89C+S NA19240 runs 0&1 170 3,973,249 3,592,467 653,594 0.85 0.85 0.94 0.89C Clair, S SNVoter, QT quality threshold, TP true positive, FP false positive, Acc accuracy, Pre precision, Rec recall. All TPrefers to all high-quality SNVs from 1KGP for NA19240 and from Strelka for Colo829BL Illumina sequencingAkbari et al. Genome Biology           (2021) 22:68 Page 7 of 21Colo829BL samples, these SNVs were leveraged to phase nanopore reads. We phasedSNVs using WhatsHap [31]. In the Colo829BL sample, ~ 37.5% of haplotype blockswere > 1Mb (mean = 1116.5Kb, median = 588.3Kb), and in the NA19240 sample,~ 9.25% of haplotype blocks were > 1Mb (mean = 315.9Kb, median = 82.2Kb)(Fig. 3a, b). While the median read lengths were 8Kb and 12Kb for Colo829BL andNA19240 run 1, respectively, Colo829BL had numerous large reads. For example,23% (678,900) of Colo829BL reads were > 20Kb while in NA19240 8.8% (513,373)of reads were > 20Kb, which resulted in a higher average read length in Colo829BLcompared to NA19240 (Additional file 2: Fig. S4a and S4b). This can explain thenumerous larger haplotype blocks in Colo829BL and highlights the value of longerreads, specifically in the proportion of all reads, for having a larger haplotypelength.NanoMethPhase detects allele-specific methylationThe phased SNVs, alignment file, methylation call file from Nanopolish, and the refer-ence genome are supplied to our tool, NanoMethPhase, to phase reads and CpG meth-ylations (Fig. 3c; the \u201cMaterials and methods\u201d section). In NA19240 run 1, out of 4.165M reads which tagged with at least one phased SNV, 1.883M were assigned to the firsthaplotype (reference haplotype, HP1) and 1.871M were assigned to the second haplo-type (alternative haplotype, HP2) (Fig. 3d). In the Colo829BL sample, out of 1.9 M readswhich tagged with at least one phased SNV, 0.743M reads were assigned to HP1 and0.738M reads were assigned to HP2 (Fig. 3f). In terms of speed, using 48 CPU cores, ittook 10 h for NanoMethPhase to process the NA19240 run 1 sample (22\u00d7) and 5 h forthe Colo829BL sample (10\u00d7).We used two approaches to evaluate the phasing results. Firstly, we used trio data(parental and child variants) for NA19240 from 1KGP [30] to create a mock phased vcffile to use as input for NanoMethPhase (software code in the repository https:\/\/github.com\/vahidAK\/NanoMethPhase). Out of 4.075M reads assigned to at least one SNV,1.802M reads assigned to maternal and 1.797M reads to paternal haplotypes (Fig. 3e).3.464M reads (96.3% of phased reads using trio data and 92.3% of phased reads usingnanopore sequencing alone) were congruent between trio and nanopore phasing alone.Trio phasing itself was confirmed by examining the methylation status of paternal andmaternal haplotypes at established and putative ICRs. This includes 43 known regions,as well as 14 novel regions from Court et al. [32] and 34 novel regions from Joshi et al.[33] (Additional file 4). As presented in Fig. 4a and b (first two heatmap columns andAdditional file 4), NanoMethPhase correctly detected ASM using trio data. The differ-ences in methylation level and correct parental origin is captured at known and novelICRs. The only obvious inconsistency with the reported parent of origin was DLX2-AS1from Joshi et al. but this ICR is only supported by four CpGs. Joshi et al. used a 450kmethylation array to study ASM and parent of origin in uniparental disomic subjects[33], and ~ 90% of their reported ICRs are supported by less than seven probes withindifferentially methylated regions (DMRs). Court et al. used WGBS and high-densitymethylation microarrays to study ASM [32]. We then compared the phased status ofreads from the trio to nanopore sequencing alone. The human genome reference is notcompletely contiguous, and it is represented by chromosome scaffolds with gaps ofAkbari et al. Genome Biology           (2021) 22:68 Page 8 of 21Fig. 3 NanoMethPhase workflow and read phasing. a, b Haplotype block sizes following phasing ofNA19240 and Colo829BL detected high-quality SNVs using WhatsHap. c NanoMethPhase workflowrepresenting inputs, processing steps, and outputs. The output options can be requested independently tofit the needs. d\u2013f Number of reads that were phased, filtered out, or could not be assigned to any phasedSNV (left panel) and their length distribution (right panel, for ease in visualization reads with length < 50 kbare shown). d Obtained from NA19240 run 1 using nanopore phasing alone, e NA19240 run 1 trio phasing,and f Colo829BL sample. *NanoMethPhase phasing step ignores duplicated, QC failed, unmapped, andsecondary reads. Supplementary reads also excluded by default but can be included as an optionalparameter. The plots represent reads using default parametersAkbari et al. Genome Biology           (2021) 22:68 Page 9 of 21unknown sequence (represented by \u201cN\u201d in the reference). Moreover, the sequencecoverage across the genome is not evenly distributed and there are regions lackingreads. Therefore, when using SNVs from a single sample to phase reads for that sam-ple, we would not expect consistent correct haplotype assignment (i.e., all reads thatFig. 4 Methylation levels and phased CpGs at human ICRs. a, b Methylation levels of phased CpGspresented with haplotypes of origin at reported ICRs as heatmaps. a CpGs mapped to known ICRs. b CpGsmapped to novel ICRs from Court et al. and Joshi et al. The heatmap colors represent the mean ofmethylation at the regions. Origin bar indicates known or reported origin from previous studies, andheatmap column labels represent assigned haplotype by NanoMethPhase. In trio phasing, Pat stands forpaternal and Mat for maternal. c, d Integrative Genomics Viewer screen captures of phased bam filesconverted to mock WGBS format for samples NA19240 run 1 and Colo829BL at two well-known ICRsAkbari et al. Genome Biology           (2021) 22:68 Page 10 of 21are from maternal be always on HP1 or HP2 and vice versa). Rather, we would expectreads mapped to a region to group together (e.g., reads that are from maternal haplo-type for a given region all being designated either HP1 or HP2 at that region). There-fore, to investigate this, we sampled reads at every 10 kb for trio phasing and nanoporephasing alone and we kept regions with more than two phased reads on one or bothhaplotypes. We then compared the read haplotype assignment at each 10-kb region tocheck whether reads that belong to paternal or maternal haplotype at a given positionin trio phasing group together as HP1 or HP2 in nanopore phasing alone. We exam-ined 3.3M reads at 236,478 genomic positions. Of these, only 54,592 reads (1.66%) wereincorrectly phased in nanopore phasing alone compared to the trio phasing.Secondly, we investigated the methylation status of phased haplotypes at known ICRsfor Colo829BL and NA19240 run 1. In both samples, NanoMethPhase recapitulatedmethylation differences at known and novel ICRs (Fig. 4a, b; the last four heatmapcolumns and Additional file 4), although, as mentioned earlier, the haplotype switch isfrequent and haplotype assignment is not consistent. We also used IntegrativeGenomics Viewer [34] to visualize mock bisulfite-converted phased bam files out-put by NanoMethPhase. This represents accurate ASM at two well-known ICRs(Fig. 4c, d). SNRPN and KvDMR1 aberrant imprinting are involved in Prader-Willi\/Angelman and Beckwith\u2013Wiedemann syndromes, respectively [35].Differentially methylated regions on chromosome X map to known inactive genesTo investigate DMRs genome-wide, we performed differential methylation analysis(DMA) using the default parameters of the dma module in our tool which uses Disper-sion Shrinkage for Sequencing data (DSS) [36] R Bioconductor package to foundDMRs. We detected 2205 DMRs in NA19240 run 1 nanopore phasing alone and 2109in trio phasing (Fig. 5a; Additional file 5). Ninety-three percent (1964 DMRs) of DMRsin the trio phasing overlapped with DMRs from nanopore phasing alone. In the maleColo829BL sample, we detected 854 DMRs (Fig. 5a; Additional file 5).Fig. 5 Differentially methylated regions mapping and imprinted genes. a Number of DMRs detected ateach chromosome in Colo829BL, NA19240 run 1 nanopore alone phasing, and NA19240 run 1 trio phasing.The numerous DMRs in the X chromosome of NA19240 cell line are explained by its X chromosomeinactivation. b Mapped DMRs to 4 Mb upstream and downstream of known, predicted, conflicting andprovisional imprinted genes from GeneImprint and the catalog of human imprinted gene databases.NA19240 NA stands for NA19240 nanopore phasing aloneAkbari et al. Genome Biology           (2021) 22:68 Page 11 of 21NA19240 is a female B-lymphocyte cell line and therefore presents XCI [3]. Severalgenes escape XCI (express from both alleles) and they are mostly responsible for sex-specific characteristics in women, while most of the genes are inactivated and displayMAE [37]. CpG methylation is an important mechanism through which cells canachieve and maintain XCI. Not surprisingly, ~ 40% of all DMRs in NA19240 mapped tochromosome X. We assessed whether DMRs detected by NanoMethPhase mapped toany escapee genes. For this aim, data from three previous studies [37\u201339] on genic XCIwere collected which includes 371, 204, 75, and 71 inactive, variable, escapee, and un-known genes, respectively (Additional file 6). Given that methylation at the promoterregion has a direct effect on gene silencing, we mapped DMRs 1 kb upstream and 200bp downstream of a TSS. DMRs were mapped to inactive genes in a greater proportionthan escapee genes (Table 2; Additional file 7). The presence of methylated CpGs ingene bodies is associated with active genes [40]. Therefore, we would expect to also ob-serve DMRs mapping to the gene body of inactive genes, because only one of the allelesis active. We mapped DMRs to the gene body of the 721 genes. DMRs significantlymapped to inactive genes compared to escapee ones (Table 2; Additional file 7) withthe expected methylation direction for genes with DMRs mapped to their promoterand body (i.e., inactive genes with hypermethylation at the promoter mostly displayedhypomethylation at the gene body and vice versa). It is also worth mentioning that theinactive state for the majority of inactive genes with mapped DMR was supported by atleast two of the studies, while the majority of escapee genes was supported by one ofthe studies.Autosomal DMRs mapped to known ICRs and known imprinted genesWe then mapped autosomal DMRs to known and novel ICRs from Court et al.[32] and Joshi et al. [33] (Table 3; Additional file 8). In addition to these ICRs, wemapped DMRs to 166 novel DMRs from Zink et al. [41] from which 83 are onchr15 and 83 are on other chromosomes (Table 3; Additional file 8). Overall,DMRs mapped to 60% of known ICRs and novel ICRs from Court et al. and sev-eral novel ICRs from Zink et al. and Joshi et al. (Table 3). Moreover, the parent oforigin identified from NA19240 trio phasing for DMRs that mapped to ICRs wasTable 2 Mapping DMRs from NA19240 run 1 to genes from chromosome XGene category Sample Inactive Escapee Variable Unknown Total*# of DMRs to GeneBody NA** 396 16 89 28 519Trio 380 17 86 26 503# of genes with DMR in body NA 154 9 48 15 226Trio 151 10 50 15 226# of DMRs to promoter NA 178 8 68 19 264Trio 175 9 60 21 257# of genes with DMR in promoter NA 190 8 69 19 286Trio 187 9 61 20 277# of genes with DMR in body and promoter NA 87 2 19 5 113Trio 88 2 19 6 115*All number of unique DMRs or genes.**NA nanopore alone, phasing reads only using nanopore sequencing for asingle sampleAkbari et al. Genome Biology           (2021) 22:68 Page 12 of 21consistent with reported parental origin except for DLX2-AS1 from Joshi et al. (seethe section \u201cNanoMethPhase detects allele-specific methylation\u201d) and RP11-33B1.1from Zink et al. RP11-33B1.1 was reported with a modest increase in methylationfrom the maternal allele (0.17) in Zink et al. The consistency in the parent of ori-gin of DMRs further highlights accurate phasing of CpG methylation and parent oforigin detection using NanoMethPhase and trio information.Most imprinted genes are located in clusters which can span up to approximately 4Mb and are controlled by their ICR [42, 43]. Therefore, we investigated the distance ofDMRs to known imprinted genes, those predicted to be imprinted, and those geneswith conflicting or provisional data. We gathered a list of known, predicted, and con-flicting imprinted genes from GeneImprint (http:\/\/www.geneimprint.com\/) and thecatalog of imprinted genes (http:\/\/igc.otago.ac.nz\/) [44] databases which include 107imprinted, 103 predicted, 14 conflicting, and 6 provisional (Additional file 9) genes. Wemapped DMRs to regions spanning 4Mb upstream and downstream of these genes(Fig. 5b; Additional file 10). Fifty percent of autosomal DMRs in NA19240 run 1 and45% in Colo829BL mapped to this window. Eighty percent of known imprinted genesin NA19240 run 1 and 74% in Colo829BL had at least one DMR mapped within 1Mbfrom the gene boundaries.The best correlation between nanopore CpG methylation calls with WGBS wasobtained at regions with 4\u20138% of CpG ratio (number of CpGs\/region length; Add-itional file 2: Fig. S1g). We noticed that the known ICRs with CpG ratio 4\u20138 tendto be more detected. Overall, 28 of the known ICRs have CpG ratio 4\u20138% and 15less than 4% or more than 8%. In Colo829BL, 26 known ICRs were detected fromwhich 21 have 4\u20138% CpG ratio. In NA19240, out of 28 detected known ICRs, 21have 4\u20138% CpG ratio. However, in the HG002 sample (~50x. Ashkenazi son; seebelow and Additional file 1: Section 5), which has higher coverage compare toNA19240 run 1 and Colo829BL, out of 38 detected known ICRs, 26 have 4\u20138%CpG ratio.As ONT frequently releases new versions of Guppy with claimed higher accuracy, weaimed to investigate the effect of a new version of the basecaller on ASM detection(Additional file 1: Section 4). We re-basecalled Colo829BL and NA19240 run 1 usingguppy v4.2.2. Even though more true-positive SNVs were detected at the optimalthreshold, more false positives were also presented (Table 1; Additional file 1: TableS2). Therefore, no overall improvement in the SNV detection was observed. However,using guppy v4.2.2, the detection of ICRs slightly improved and we could detect onemore reported ICR [32, 33, 41] in NA19240 and 6 more in Colo829BL (Table 3; Add-itional file 1: Table S3) and, on average, about 82% of detected DMRs in Guppy 4.2.2were overlapped with DMRs detected with older Guppy.Table 3 Mapping DMRs to ICRsSample DMRs mapped Known Novel Court Novel Joshi Novel ZinkNA19240 NA* 74 28 8 5 24 (6 on chr15)NA19240 Trio 69 26 8 4 22 (6 on chr15)Colo829Bl 60 26 10 5 16 (4 on chr15)*NA nanopore alone, phasing read only using nanopore sequencing for a single sampleAkbari et al. Genome Biology           (2021) 22:68 Page 13 of 21In NA19240 trio phasing, the SNV calls are obtained from short-read data. Tofurther investigate ASM detection in trio where SNV calls originating from nano-pore sequencing, we used nanopore data for GIAB Ashkenazi trio [29] samplesincluding son (HG002), father (HG003), and mother (HG004) (Additional file 1:Section 5). This analysis demonstrated the detection of most of the known ICRs(86%) (Additional file 1: Table S4). Moreover, detected DMRs and phased readsin phasing using SNVs detected from nanopore data for all samples in trio dem-onstrated high commonalities with DMRs and phased reads for the trio and phas-ing using SNVs detected from short-read sequencing (high confidence variantcalls, obtained from short-read sequencing, for Ashkenazi trio is also available inGIAB).DiscussionASM is involved in various processes such as development and tissue differentiation.Its dysregulation can result in developmental disorders and promotes cancer [2, 5, 6].Short-read sequencing coupled with bisulfite treatment is used for the detection ofASM. However, the length of reads and the complexity and biases introduced by bisul-fite treatment can impede their investigation, especially in low SNP density regions. Inthese regions, the length of reads can be insufficient to span multiple SNPs preventingtheir association in a contiguous haplotype. This conceptual limit is not shared bylong-read sequencing via nanopore. The span of reads reduces the risk of them notreaching enough SNP positions. It also enables the filtering of low-quality SNPs, redu-cing the presence of erroneous SNPs with a minimal impact on the number of readsthat spans multiple SNPs. Long-read sequencing powered by ONT is also applicable todetect mono-allelic methylation as it offers detection of both DNA bases and theirmodifications. Mono-allelic bisulfite-converted C are indistinguishable from a C to TSNP present in the sample using WGBS alone [8]. The signal from the pore can beused to get the sequence confirmation that would otherwise require another input forSNP presents in the sample, a feature that can also be leveraged to investigate cancerspresenting a loss of heterozygosity [45].In this study, we benchmarked methylation calling tools for nanopore sequence data,improved SNV calling for lower coverage data, and detected ASM. We developed therequired tools, SNVoter and NanoMethPhase, which allow users to detect ASM using acommand-line interface. We demonstrated that nanopore methylation calls are con-cordant with gold standard platforms (Fig. 1; Additional file 2: Fig. S1) and this technol-ogy is advantageous in highly repetitive regions, particularly at highly repetitive ALR\/Alpha satellite (Additional file 2: Fig. S2) which their expression is shown to beenriched in cancer [46]. A drawback of nanopore sequencing is the high cost which in-creases dramatically when several flow cells need to be run to obtained adequate cover-age. We determined that our workflow is capable of detecting ASM using low coverage(~ 10\u00d7) of nanopore data from a single PromethION flowcell run (one PromethIONrun typically provides a coverage of 10\u201325\u00d7 using the r9.4 pore). Therefore, the usageof NanoMethPhase is advised to leverage the depth of the long-read data. We demon-strated detection of ASM and parent of origin using trio information (Fig. 4). We wereable to detect ASM by nanopore sequencing exclusively from a single sample whichAkbari et al. Genome Biology           (2021) 22:68 Page 14 of 21was concordant with trio phasing, although the parent of origin labeling is inconsistentbetween non-overlapping haplotyped regions (Fig. 4).Mnimap2 [47] is a widely used aligner for nanopore long-read data; therefore, we alsoinvestigated the possibility of improvement in SNV detection using the recently devel-oped aligner for nanopore long reads, Winnowmap [48]. We compared SNV detectionfor Colo829BL aligned to hg38 using Winnowmap to Minimap2 and no improvementwas observed (Additional file 2: Fig. S8).ASMs on chromosome X were detected by NanoMethPhase in the NA19240 cell line.They mostly mapped to known inactive genes on the chromosome X (Table 2; Add-itional file 7). Previous studies mostly relied on transcriptome analysis to investigategenic XCI [37\u201339], but this provides indirect evidence of the inactivation rather thandetecting the actual mechanism of inactivation. Our approach can be applied, alongwith expression-based approaches, to further investigate genic XCI and CpG methyla-tion as a potential mechanism for gene inactivation. We indicated that autosomalDMRs detected by NanoMethPhase mapped to most of the known ICRs and severalother novel ICRs from previous studies [32, 33, 41] (Fig. 4 and Table 3, Additional file 4and Additional file 8). The majority of known imprinted genes were also detected withone or more autosomal DMR mapped inside or in close vicinity to the gene (Fig. 5;Additional file 10). Numerous autosomal DMRs, which did not map to any knownICRs, were mapped to the close vicinity of the known imprinted genes (Add-itional file 10). This is also observed by previous studies [21, 41], and these DMRs couldbe secondary ICRs (also known as somatic DMRs) which usually regulated by nearbygermline or primary ICRs and established postfertilization [21, 41, 49]. In addition toimprinting, a proportion of ASM might result in random MAE. Approximately 22% ofautosomal DMRs from NA19240 and 18% from Colo829BL mapped to the gene bodyor the promoter of 251 and 143 MAE genes from Savova et al., respectively [50](Additional file 11).We noticed more than 90% of DMRs mapped to ICRs and promoters of geneson chromosome X and more than 80% of DMRs that mapped near imprintedgenes had area statistics (the sum of the test statistics of all CpG sites within theDMR) \u2265 |100|, while approximately 40% of DMRs in each sample had area statis-tics < |100|. This is consistent with the fact that DMRs with higher area statisticsare more likely to be true positives, and demonstrates the positive impact of filter-ing the numerous detected DMRs. Several regions were lacking high-qualitymapped reads in the original alignment file and phased alignment results. For ex-ample, five to seven of known and novel ICRs reported by Court et al. [32] andJoshi et al. [33] lacked mapped reads in the original alignment file and\/or thephased alignment results (Fig. 4a, b; Additional file 4). Moreover, there were sev-eral ICRs with moderate differences in methylation between haplotypes. For ex-ample, approximately 15 to 18 ICRs showed 0.1\u20130.3 delta in mean methylation,while none of them was captured by the DMA. The Nanopolish algorithm assignsthe same log-likelihood ratio to all CpGs in close vicinity of less than 11 bp.Therefore, improvements in methylation calling can improve ASM detection, spe-cifically at dense CpG sites. Although final methylation frequencies obtained bynanopore are similar to those from WGBS and tools that were developed forWGBS DMA are expected to be applicable to nanopore, it is advantageous to haveAkbari et al. Genome Biology           (2021) 22:68 Page 15 of 21dedicated software to perform DMA from nanopore data. As discussed by Giganteet al. [21], WGBS calls are binary while nanopore tools output continuous predic-tions or log-likelihood ratios. Algorithms that could leverage non-discrete datapresent an opportunity to improve DMA. We also noticed that averaging CpGmethylation in genomic bins significantly improves correlation with WGBS, evenwhen disregarding the coverage filters (Additional file 2: Fig. S5). This suggests thata sliding window approach might be beneficial for nanopore DMA.Materials and methodsNanopore sequencing and datasetsNanopore sequencing data for NA19240 [26], NA12878 [24], and Ashkenazi trio [29]human cell lines are publicly available. A complete description of the datasets, theirbase calling, mapping, and usage in our study are provided in additional file 1 alongwith the link to the sources.We also sequenced the Colo829BL B-lymphoblast cell line using one nanopore Pro-methION flow cell and Illumina paired-end sequencing at 30\u00d7 coverage. A completedescription of nanopore and Illumina sequencing protocols and data obtained is alsoprovided in Additional file 1.CpG methylation calling from nanopore dataTo call CpG methylation, we benchmarked three model-based approaches: Nanopolish[10], Megalodon [14], and DeepSignal [15]. Nanopolish uses a hidden Markov model tocall CpG methylations from raw nanopore data while Megalodon and DeepSignal useneural networks. We called CpG methylation using these tools (with the default param-eters) for 12 flow cells of NA12878 publicly available data (Additional file 1) and com-pared the results with WGBS data from ENCODE project (ENCFF835NTC) [51] andHuman Methylation 27 (27k) array from Fraser et al. [25].Variant callingWe used Clair to call SNVs [22]. We called variants for each chromosome using clair.pycallVarBam --threshold 0.2 and the HG122HD34 model. Indels were filtered out. To evalu-ate variant calling, we compared SNVs called by Clair from nanopore data to those from1KGP phase 3 [30] (GRCh37 coordinates). Clair\u2019s variant calls were lifted over to GRCh37human reference genome coordinates using CrossMap [52] for comparison to 1KGP data.For our in-house Colo829BL sample, we compared Clair variant calls to Strelka [53]v 2.9.10 calls made from paired-end Illumina reads (Additional file 1).Model training to improve SNV callingWe calculated average qualities and mutation frequencies for each position of each 5-mer window containing an SNV. Mutation frequencies were calculated as the numberof instances over coverage for each genomic position in the 5-mer window. Base qual-ities for a given position were calculated as the average of all base qualities mapped tothe position. We used these as inputs to a fully connected artificial neural network clas-sifier composed of four hidden layers with a relu activation function. The first hiddenAkbari et al. Genome Biology           (2021) 22:68 Page 16 of 21layer is six times larger than the input layer and the size of subsequent hidden layersdecreases through a factor two.We trained three models to compare the classifier using different coverages. NA12878 20flow cells (24\u00d7), NA12878 all flow cells (44\u00d7), and HG003 (80\u00d7) were used for training. First,we called variants for each dataset using Clair and then determined true and false positivesusing high-quality variants using the Genome in a Bottle database (GIAB) [27]. UsingNA12878 20 flow cell data, a randomly selected balanced dataset of 25 million 5-mers wasused for training and 4 million unseen randomly selected 5-mers were used as the validationset. For the NA12878 whole dataset and HG003 sample, the training datasets were 18M and14.9M, respectively, and validation sets were 2.5M and 2M, respectively (Additional file 2:Fig. S6). The NA12878 20 flow cell model was used for < 30\u00d7 coverage data, NA12878 allflow cells for 30\u00d7\u201345\u00d7 coverage data, and HG003 model for > 45 coverage data.Phasing single nucleotide variants detected from nanopore sequencingIn order to phase nanopore reads and CpG methylation, we first called SNVs for bothsamples (NA19240 run 1 and Colo829BL) using Clair [22], then used SNVoter tonormalize the quality scores and filter out false positives (Fig. 2e and Table 1). Finally,we used WhatsHap [23, 31] v0.18 with the default parameters and --ignore-read-groups on to determine haplotype status for each SNV.Phasing of nanopore reads and CpG methylationsPhased SNVs and CpG methylation calls were leveraged to phase reads along theirCpG methylation to diploid haplotypes. After filtering out a considerable number offalse-positive SNVs using SNVoter, we still noticed 10\u201320% false-positive SNV calls inthe datasets (Table 1). These unfiltered false-positive calls, in addition to sequencing er-rors, can result in reads incorrectly mapping to the SNVs from haplotype 1 when theread would actually belong to the haplotype 2 and vice versa. We noticed reads pre-senting SNVs from both haplotypes when mapping them to phased SNVs. In NA19240run 0, out of ~ 3M reads which mapped to at least one phased SNV, ~ 2M reads hadSNVs from both haplotypes (Additional file 2: Fig. S7a). To further overcome false pos-itives and the sequencing error problem, we made several filtering steps to account forremnant false-positive SNVs and haplotype ratio (number of SNVs from HP1\/HP2 orHP2\/HP1). As we analyzed NA19240 run 0, we noticed a lower base quality distribu-tion for false-positive SNVs compared to true positives that could not be filtered out bySNVoter (Additional file 2: Fig. S7b). Therefore, we assigned a minimum base qualitythreshold to successfully map each read at a phased SNV position. To manage readscontaining SNVs from both haplotypes, we defined another threshold, the haplotype ra-tio, which ensures the reads are assigned to a single haplotype. Based on the quality dis-tribution of SNVs (Additional file 2: Fig. S7b), the proportion of false positives which isbetween 10 and 20% (Table 1) and haplotype ratios (Additional file 2: Fig. S7a), andalso based on empirical phasing at a few known imprinted regions, we used seven asthe minimum base quality and 0.75 as haplotype ratio. We also used two as the mini-mum number of phased SNVs a read must present to be considered for phasing. Inorder to assign a read to a defined haplotype, a read must satisfy the following criteria:Akbari et al. Genome Biology           (2021) 22:68 Page 17 of 21As the reads are separated to different haplotypes, their associated CpG methylationsfrom processed methylation call file are also separated to the corresponding haplotypes.We have integrated all the steps and filters in our python3 command-line tool, Nano-MethPhase. Users can input methylation call data from Nanopolish, phased variantcalling file, alignment file, and reference genome to NanoMethPhase (Fig. 3c). Nano-MethPhase will output phased reads in aligned format, phased mock WGBS convertedformat for visualization (see the \u201cVisualization\u201d section; Fig. 4c, d), phased methylationcalls, and methylation frequency files. The latter can be used for differential methyla-tion analysis to detect DMRs between haplotypes.Differential methylation analysisAfter phasing reads and CpG methylation to haplotypes, NanoMethPhase can performDMA to detect mono-allelic methylated regions. It uses the DSS R package [36] forDMA. Users can perform all analyses in a command-line interface and directly performDMA using the dma module of NanoMethPhase on the output phased methylation fre-quency data to detect DMRs.VisualizationNanoMethPhase can convert phased reads into separate mock-WGBS bam files usingthe processed methylation call file from its methyl_call_processor module. Each cytosinein each CpG in each read is converted to a T, A, or N depending on the CpG beingcalled as methylated, unmethylated, or uncalled. These pairs of files can be loaded intoa genome browser such as IGV [34] in bisulfite mode for visualization (Fig. 4c, d).Supplementary InformationThe online version contains supplementary material available at https:\/\/doi.org\/10.1186\/s13059-021-02283-5.Additional file 1. This file includes additional notes for material and methods section. Description of datasets andthe sources of publically available data. The results of further analyses using newer version of Guppy basecaller andAshkenazi trio are also provided in this file.Additional file 2. Contains additional figures for the paper, such as more correlation analysis for comparison ofnanopore methylation call with WGBS and mapping of nanopore specific CpG methylation calls to genomicregions, supporting figures for SNV improvement using SNVoter, Read length distribution for Colo829BL andNA19240, Model training plots for SNVoter, etc.Additional file 3. Contains table of true- and false-positives using different quality thresholds for Clair andClair+SNVoter.Additional file 4. Contains average of methylation at known and novel ICRs from Court et al. and Joshi et al.Additional file 5. Contains DMRs found in each sample.Additional file 6. List of genes studied by Carrel and Willard, Cotton et al., and Tukiainen et al. for genic XCI.Additional file 7. Mapping of DMRs on chromosome X for NA19240 run 1 to the list of genes studied throughXCI.Additional file 8. Mapping of autosomal DMRs to known and novel ICRs.Akbari et al. Genome Biology           (2021) 22:68 Page 18 of 21Additional file 9. List of imprinted, predicted, conflicting, and provisional imprinted genes from GeneImprint(http:\/\/www.geneimprint.com\/) and the catalog of human imprinted gene (http:\/\/igc.otago.ac.nz\/home.html)database.Additional file 10. Mapping DMRs to 4 Mb upstream and downstream of the list of imprinted, predicted,conflicting, and provisional imprinted genes.Additional file 11. Mapping of DMRs to the reported MAE genes by Savova et al.Additional file 12. Review history.AcknowledgementsNot applicableReview historyThe review history is available as Additional file 12.Peer review informationAlison Cuff and Barbara Cheifet were the primary editors of this article and managed its editorial process and peerreview in collaboration with the rest of the editorial team.Authors\u2019 contributionsAll authors jointly conceived the project. V. A and J.M.G developed the pipeline software and packaging. V.A and K.Operformed base calling for nanopore sequencing data. V. A analyzed variant calling data for NA12878, NA19240,Colo829BL, and Ashkenazi trio. V. A analyzed NA19240, Colo829, and HG002 (Ashkenazi son) sample methylationphasing. R. M and P. P performed Illumina and nanopore sequencing for Colo829BL. V. A wrote the manuscript withinput from S.J.M.J, J.M.G, and K.O. The authors read and approved the final manuscript.FundingThis study is supported by Canada Research Chairs and the University of British Columbia Four Year DoctoralFellowship award.Availability of data and materialsNanoMethPhase and SNVoter source codes, installation instructions, and tutorials are available on GitHub [54, 55](https:\/\/github.com\/vahidAK\/NanoMethPhase and https:\/\/github.com\/vahidAK\/SNVoter) under GNU General PublicLicense v3.0 (https:\/\/www.gnu.org\/licenses\/). Source codes for NanoMethPhase and SNVoter are also deposited inZenodo as DOI-assigned repositories [56, 57]. Illumina sequencing alignment file and nanopore raw fast5 and base-called fastq files for the Colo829BL sample are available at the European Genome-phenome Archive (EGA: https:\/\/www.ebi.ac.uk\/ega\/home) under the accession number EGAS00001001385 [58]. NA19240 data from De Coster et al.[26] are publically available at ENA (https:\/\/www.ebi.ac.uk\/ena\/browser\/home) under the accession numberPRJEB26791. Nanopore data for NA12878 from Jain et al. [24] are publically available as an Amazon Web Services OpenData set at https:\/\/github.com\/nanopore-wgs-consortium\/NA12878. Ashkenazi trio data (HG002, HG003, and HG004)from Zook et al. [29] are publically available at SRA (https:\/\/www.ncbi.nlm.nih.gov\/sra) under the accession numberPRJNA200694 and Genome in a Bottle GitHub (https:\/\/github.com\/genome-in-a-bottle\/giab_data_indexes).Ethics approval and consent to participateNot applicableConsent for publicationNot applicableCompeting interestsWe declare that there is no conflict of interest associated with this publication.Author details1Canada\u2019s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada. 2Department ofMedical Genetics, University of British Columbia, Vancouver, British Columbia, Canada. 3Department of Microbiologyand Immunology, Michael Smith Laboratories, University of British Columbia, Vancouver, British Columbia, Canada.Received: 1 September 2020 Accepted: 29 January 2021References1. Khamlichi AA, Feil R. Parallels between mammalian mechanisms of monoallelic gene expression. Trends Genet. 2018;34:954\u201371.2. Goovaerts T, Steyaert S, Vandenbussche CA, Galle J, Thas O, Van Criekinge W, et al. A comprehensive overview ofgenomic imprinting in breast and its deregulation in cancer. Nat Commun. 2018;9:1\u201314.3. Reinius B, Sandberg R. Random monoallelic expression of autosomal genes: stochastic transcription and allele-levelregulation. Nat Rev Genet. 2015;16:653\u201364. https:\/\/doi.org\/10.1038\/nrg3888.4. Morcos L, Ge B, Koka V, KCL L, Pokholok DK, Gunderson KL, et al. Genome-wide assessment of imprinted expression inhuman cells. Genome Biol. 2011;12:R25.5. Jelinic P, Shaw P. Loss of imprinting and cancer. J Pathol. 2007;211:261\u20138. https:\/\/doi.org\/10.1002\/path.2116.Akbari et al. Genome Biology           (2021) 22:68 Page 19 of 216. Tomizawa S, Sasaki H. Genomic imprinting and its relevance to congenital disease, infertility, molar pregnancy andinduced pluripotent stem cell. J Hum Genet. 2012;57:84\u201391. https:\/\/doi.org\/10.1038\/jhg.2011.151.7. Kurdyukov S, Bullock M. DNA methylation analysis: choosing the right method. Biology (Basel). 2016;5:3 Available from:https:\/\/www.ncbi.nlm.nih.gov\/pubmed\/26751487.8. Krueger F, Kreck B, Franke A, Andrews SR. DNA methylome analysis using short bisulfite sequencing data. Nat Methods.2012;9:145\u201351. https:\/\/doi.org\/10.1038\/nmeth.1828.9. Li Y, TO T. DNA methylation detection: bisulfite genomic sequencing analysis. Methods Mol Biol. 2011;791:11\u201321Available from: https:\/\/pubmed.ncbi.nlm.nih.gov\/21913068.10. Simpson JT, Workman RE, Zuzarte PC, David M, Dursi LJ, Timp W. Detecting DNA cytosine methylation using nanoporesequencing. Nat Methods. 2017;14:407. https:\/\/doi.org\/10.1038\/nmeth.4184.11. Flusberg BA, Webster DR, Lee JH, Travers KJ, Olivares EC, Clark TA, et al. Direct detection of DNA methylation during single-molecule, real-time sequencing. Nat Methods. 2010;7:461\u20135 Available from: https:\/\/pubmed.ncbi.nlm.nih.gov\/20453866.12. Biosciences P. Detecting DNA base modifications using single molecule, real-time sequencing. White Pap Base Modif.2015. Available from: https:\/\/www.pacb.com\/wp-content\/uploads\/2015\/09\/WP_Detecting_DNA_Base_Modifications_Using_SMRT_Sequencing.pdf.13. Wick RR, Judd LM, Holt KE. Performance of neural network basecalling tools for Oxford Nanopore sequencing. GenomeBiol. 2019;20:129. https:\/\/doi.org\/10.1186\/s13059-019-1727-y.14. Oxford Nanopore Technologies. Megalodon. GitHub. 2020. Available from: https:\/\/github.com\/nanoporetech\/megalodon. Accessed 27 July 2019.15. Ni P, Huang N, Zhang Z, Wang D-P, Liang F, Miao Y, et al. DeepSignal: detecting DNA methylation state from Nanoporesequencing reads using deep-learning. Bioinformatics. 2019;35:4586\u201395. https:\/\/doi.org\/10.1093\/bioinformatics\/btz276.16. Liu Q, Fang L, Yu G, Wang D, Xiao C-L, Wang K. Detection of DNA base modifications by deep recurrent neural networkon Oxford Nanopore sequencing data. Nat Commun. 2019;10:2449. https:\/\/doi.org\/10.1038\/s41467-019-10168-2.17. Rand AC, Jain M, Eizenga JM, Musselman-Brown A, Olsen HE, Akeson M, et al. Mapping DNA methylation with high-throughput nanopore sequencing. Nat Methods. 2017;14:411. https:\/\/doi.org\/10.1038\/nmeth.4189.18. Xu L, Seki M. Recent advances in the detection of base modifications using the Nanopore sequencer. J Hum Genet.2020;65:25-33. https:\/\/doi.org\/10.1038\/s10038-019-0679-0.19. Stoiber M, Quick J, Egan R, Lee JE, Celniker S, Neely RK, et al. De novo identification of DNA modifications enabled bygenome-guided nanopore signal processing. BioRxiv. 2016;94672. https:\/\/doi.org\/10.1101\/094672.20. Liu Q, Georgieva DC, Egli D, Wang K. NanoMod: a computational tool to detect DNA modifications using Nanoporelong-read sequencing data. BMC Genomics. 2019;20:78. https:\/\/doi.org\/10.1186\/s12864-018-5372-8.21. Gigante S, Gouil Q, Lucattini A, Keniry A, Beck T, Tinning M, et al. Using long-read sequencing to detect imprinted DNAmethylation. Nucleic Acids Res. 2019;47:e46. https:\/\/doi.org\/10.1093\/nar\/gkz107.22. Luo R, Wong C-L, Wong Y-S, Tang C-I, Liu C-M, Leung C-M, et al. Exploring the limit of using a deep neural network onpileup data for germline variant calling. Nat Mach Intell. 2020;2:220\u20137. https:\/\/doi.org\/10.1038\/s42256-020-0167-4.23. Martin M, Patterson M, Garg S, O Fischer S, Pisanti N, Klau GW, et al. WhatsHap: fast and accurate read-based phasing.bioRxiv. 2016;85050. https:\/\/doi.org\/10.1101\/085050.24. Jain M, Koren S, Miga KH, Quick J, Rand AC, Sasani TA, et al. Nanopore sequencing and assembly of a human genomewith ultra-long reads. Nat Biotechnol. 2018;36:338\u201345. https:\/\/doi.org\/10.1038\/nbt.4060.25. Fraser HB, Lam LL, Neumann SM, Kobor MS. Population-specificity of human DNA methylation. Genome Biol. 2012;13:R8.https:\/\/doi.org\/10.1186\/gb-2012-13-2-r8.26. De Coster W, De Rijk P, De Roeck A, De Pooter T, D\u2019Hert S, Strazisar M, et al. Structural variants identified by OxfordNanopore PromethION sequencing of the human genome. Genome Res. 2019;29:1178\u201387.27. Zook JM, McDaniel J, Olson ND, Wagner J, Parikh H, Heaton H, et al. An open resource for accurately benchmarkingsmall variant and reference calls. Nat Biotechnol. 2019;37:561\u20136. https:\/\/doi.org\/10.1038\/s41587-019-0074-6.28. Rang FJ, Kloosterman WP, de Ridder J. From squiggle to basepair: computational approaches for improving nanoporesequencing read accuracy. Genome Biol. 2018;19:90 Available from: https:\/\/pubmed.ncbi.nlm.nih.gov\/30005597.29. Zook JM, Catoe D, McDaniel J, Vang L, Spies N, Sidow A, et al. Extensive sequencing of seven human genomes tocharacterize benchmark reference materials. Sci Data. 2016;3:160025. https:\/\/doi.org\/10.1038\/sdata.2016.25.30. Auton A, Abecasis GR, Altshuler DM, Durbin RM, Abecasis GR, Bentley DR, et al. A global reference for human geneticvariation. Nature. 2015;526:68\u201374. https:\/\/doi.org\/10.1038\/nature15393.31. Patterson M, Marschall T, Pisanti N, van Iersel L, Stougie L, Klau GW, et al. WhatsHap: weighted haplotype assembly forfuture-generation sequencing reads. J Comput Biol. 2015;22:498\u2013509. https:\/\/doi.org\/10.1089\/cmb.2014.0157.32. Court F, Tayama C, Romanelli V, Martin-Trujillo A, Iglesias-Platas I, Okamura K, et al. Genome-wide parent-of-origin DNAmethylation analysis reveals the intricacies of human imprinting and suggests a germline methylation-independentmechanism of establishment. Genome Res. 2014;24:554\u201369.33. Joshi RS, Garg P, Zaitlen N, Lappalainen T, Watson CT, Azam N, et al. DNA methylation profiling of uniparental disomysubjects provides a map of parental epigenetic bias in the human genome. Am J Hum Genet. 2016;99:555\u201366. https:\/\/doi.org\/10.1016\/j.ajhg.2016.06.032.34. Robinson JT, Thorvaldsd\u00f3ttir H, Winckler W, Guttman M, Lander ES, Getz G, et al. Integrative genomics viewer. NatBiotechnol. 2011;29:24\u20136. https:\/\/doi.org\/10.1038\/nbt.1754.35. Soejima H, Higashimoto K. Epigenetic and genetic alterations of the imprinting disorder Beckwith\u2013Wiedemannsyndrome and related disorders. J Hum Genet. 2013;58:402\u20139. https:\/\/doi.org\/10.1038\/jhg.2013.51.36. Park Y, Wu H. Differential methylation analysis for BS-seq data under general experimental design. Bioinformatics. 2016;32:1446\u201353. https:\/\/doi.org\/10.1093\/bioinformatics\/btw026.37. Cotton AM, Ge B, Light N, Adoue V, Pastinen T, Brown CJ. Analysis of expressed SNPs identifies variable extents ofexpression from the human inactive X chromosome. Genome Biol. 2013;14:1\u201317.38. Carrel L, Willard HF. X-inactivation profile reveals extensive variability in X-linked gene expression in females. Nature.2005;434:400\u20134.39. Tukiainen T, Villani A-C, Yen A, Rivas MA, Marshall JL, Satija R, et al. Landscape of X chromosome inactivation acrosshuman tissues. Nature. 2017;550:244\u20138.Akbari et al. Genome Biology           (2021) 22:68 Page 20 of 2140. Yang X, Han H, De Carvalho DD, Lay FD, Jones PA, Liang G. Gene body methylation can alter gene expression and is atherapeutic target in cancer. Cancer Cell. 2014;26:577\u201390 Available from: https:\/\/pubmed.ncbi.nlm.nih.gov\/25263941.41. Zink F, Magnusdottir DN, Magnusson OT, Walker NJ, Morris TJ, Sigurdsson A, et al. Insights into imprinting from parent-of-origin phased methylomes and transcriptomes. Nat Genet. 2018;50:1542\u201352. https:\/\/doi.org\/10.1038\/s41588-018-0232-7.42. Barlow DP, Bartolomei MS. Genomic imprinting in mammals. Cold Spring Harb Perspect Biol. 2014;6:a018382. https:\/\/doi.org\/10.1101\/cshperspect.a018382.43. da Rocha ST, Gendrel A-V. The influence of DNA methylation on monoallelic expression. Essays Biochem. 2019;63:663\u201376. https:\/\/doi.org\/10.1042\/EBC20190034.44. Morison IM, Reeve AE. A catalogue of imprinted genes and parent-of-origin effects in humans and animals. Hum MolGenet. 1998;7:1599\u2013609. https:\/\/doi.org\/10.1093\/hmg\/7.10.1599.45. Nichols CA, Gibson WJ, Brown MS, Kosmicki JA, Busanovich JP, Wei H, et al. Loss of heterozygosity of essential genesrepresents a widespread class of potential cancer vulnerabilities. Nat Commun. 2020;11:2517. https:\/\/doi.org\/10.1038\/s41467-020-16399-y.46. Bersani F, Lee E, Kharchenko PV, Xu AW, Liu M, Xega K, et al. Pericentromeric satellite repeat expansions through RNA-derived DNA intermediates in cancer. Proc Natl Acad Sci U S A. 2015;112:15148\u201353 Available from: https:\/\/pubmed.ncbi.nlm.nih.gov\/26575630.47. Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094\u2013100. https:\/\/doi.org\/10.1093\/bioinformatics\/bty191.48. Jain C, Rhie A, Zhang H, Chu C, Walenz BP, Koren S, et al. Weighted minimizer sampling improves long read mapping.Bioinformatics. 2020;36:i111\u20138. https:\/\/doi.org\/10.1093\/bioinformatics\/btaa435.49. Pervjakova N, Kasela S, Morris AP, Kals M, Metspalu A, Lindgren CM, et al. Imprinted genes and imprinting controlregions show predominant intermediate methylation in adult somatic tissues. Epigenomics. 2016;8:789\u201399 Availablefrom: https:\/\/pubmed.ncbi.nlm.nih.gov\/27004446.50. Savova V, Chun S, Sohail M, RB MC, Witwicki R, Gai L, et al. Genes with monoallelic expression contributedisproportionately to genetic diversity in humans. Nat Genet. 2016;48:231\u20137. https:\/\/doi.org\/10.1038\/ng.3493.51. Consortium EP. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57.52. Zhao H, Sun Z, Wang J, Huang H, Kocher J-P, Wang L. CrossMap: a versatile tool for coordinate conversion betweengenome assemblies. Bioinformatics. 2014;30:1006\u20137.53. Kim S, Scheffler K, Halpern AL, Bekritsky MA, Noh E, K\u00e4llberg M, et al. Strelka2: fast and accurate calling of germline andsomatic variants. Nat Methods. 2018;15:591\u20134. https:\/\/doi.org\/10.1038\/s41592-018-0051-x.54. Akbari V, Garant J-M, O\u2019Neill K, Pandoh P, Moore R, Marra M, et al. NanoMethPhase. GitHub. 2020; Available from: https:\/\/github.com\/vahidAK\/NanoMethPhase.55. Akbari V, Garant J-M, O\u2019Neill K, Pandoh P, Moore R, Marra M, et al. SNVoter. GitHub. 2020; Available from: https:\/\/github.com\/vahidAK\/SNVoter.56. Akbari V, Garant J-M, O\u2019Neill K, Pandoh P, Moore R, Marra M, et al. NanoMethPhase. Zenodo. 2021; Available from:https:\/\/doi.org\/10.5281\/zenodo.4474430.57. Akbari V, Garant J-M, O\u2019Neill K, Pandoh P, Moore R, Marra M, et al. SNVoter. Zenodo. 2021; Available from: https:\/\/doi.org\/10.5281\/zenodo.4474436.58. Akbari V, Garant J-M, O\u2019Neill K, Pandoh P, Moore R, Marra M, et al. EGAS00001001385. Eur Genome-phenome Arch. 2021;Available from: https:\/\/www.ebi.ac.uk\/ega\/studies\/EGAS00001001385.Publisher\u2019s NoteSpringer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.Akbari et al. Genome Biology           (2021) 22:68 Page 21 of 21","type":"literal","lang":"en"}],"http:\/\/www.europeana.eu\/schemas\/edm\/hasType":[{"value":"Article","type":"literal","lang":"en"}],"http:\/\/www.europeana.eu\/schemas\/edm\/isShownAt":[{"value":"10.14288\/1.0395957","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/language":[{"value":"eng","type":"literal","lang":"en"}],"https:\/\/open.library.ubc.ca\/terms#peerReviewStatus":[{"value":"Reviewed","type":"literal","lang":"en"}],"http:\/\/www.europeana.eu\/schemas\/edm\/provider":[{"value":"Vancouver : University of British Columbia Library","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/publisher":[{"value":"BioMed Central","type":"literal","lang":"en"}],"https:\/\/open.library.ubc.ca\/terms#publisherDOI":[{"value":"10.1186\/s13059-021-02283-5","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/rights":[{"value":"Attribution 4.0 International (CC BY 4.0)","type":"literal","lang":"en"}],"https:\/\/open.library.ubc.ca\/terms#rightsURI":[{"value":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/","type":"literal","lang":"en"}],"https:\/\/open.library.ubc.ca\/terms#scholarLevel":[{"value":"Faculty","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/subject":[{"value":"Nanopore sequencing","type":"literal","lang":"en"},{"value":"Allele-specific methylation","type":"literal","lang":"en"},{"value":"Phasing","type":"literal","lang":"en"},{"value":"NanoMethPhase","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/title":[{"value":"Megabase-scale methylation phasing using nanopore long reads and NanoMethPhase","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/type":[{"value":"Text","type":"literal","lang":"en"}],"https:\/\/open.library.ubc.ca\/terms#identifierURI":[{"value":"http:\/\/hdl.handle.net\/2429\/77374","type":"literal","lang":"en"}]}}