Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Meta-analysis of human methylomes reveals stably methylated sequences surrounding CpG islands associated… Edgar, Rachel 2014

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2014_november_edgar_rachel.pdf [ 6.98MB ]
JSON: 24-1.0135552.json
JSON-LD: 24-1.0135552-ld.json
RDF/XML (Pretty): 24-1.0135552-rdf.xml
RDF/JSON: 24-1.0135552-rdf.json
Turtle: 24-1.0135552-turtle.txt
N-Triples: 24-1.0135552-rdf-ntriples.txt
Original Record: 24-1.0135552-source.json
Full Text

Full Text

Meta-analysis of Human Methylomes Reveals StablyMethylated Sequences Surrounding CpG IslandsAssociated with High Gene ExpressionbyRachel EdgarB. Sc (Honours), University of Western Ontario, 2008A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Genome Science and Technology)The University Of British Columbia(Vancouver)August 2014c© Rachel Edgar, 2014AbstractDNA methylation is thought to play an important role in the regulation of mam-malian gene expression. Part of the evidence for this role is the observation thatlack of CpG island methylation in gene promoters is associated with high tran-scriptional activity. However, CpG island methylation level only accounts for afraction of the variance in gene expression, and methylation in other domains ishypothesized to play a role (e.g., island shores and shelves). We set out to im-prove understanding of the human methylome through a meta-analysis approach,using 1737 samples from 30 publicly available studies. An initial screen identified15224 CpGs that are ultra-stable in their state, being always fully methylated or un-methylated across diverse tissues, cell types and developmental stages (974 alwaysmethylated; 14250 always unmethylated). A further analysis of ultra-stable CpGsled us to identify a novel class of CpG islands, ravines, that exhibit a markedlyconsistent pattern of low methylation with highly methylated flanking shores andshelves. Our findings were validated using independent and heterogeneous datasetsassayed on the same and different technologies. Building on additional existingdata types such as gene expression microarrays, DNase hypersensitive sites, andhistone modifications, we found that ravines are associated with higher gene ex-pression, compared to typical unmethylated CpG islands. This finding suggestsa novel role for methylation in promoters, markedly different from the traditionalview that active promoters need to be unmethylated. We propose ravines are a newclass of CpG islands, established early in development and maintained throughdifferentiation, that mark universally active genes and provide new evidence thatmethylation beyond the CpG island could play a role in gene expression.iiPrefaceThis dissertation is original, independent work by the author, R. Edgar and has beensubmitted for publication. The gene expression data was assembled and providedby Powell Patrick Cheng Tan. All other data was collected from public repositoriesby the author. All data analysis was performed by the author, as well as the majorityof manuscript composition. Paul Pavlidis and Elodie Portales-Casamar were thesupervisory authors on this project and were involved throughout the project inconcept formation and manuscript edits. The author, R. Edgar, has accepted aresearch position in Dr. Michael Kobor’s, a member of the examining committee,after completion of the degree.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 450K Public Data . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Scope and Objectives . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Quality Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.1 Individual Study Quality Control . . . . . . . . . . . . . . 62.3 Ultra-Stable CpG Calling . . . . . . . . . . . . . . . . . . . . . . 72.3.1 ENCODE Confirmation of Ultra-Stable CpGs . . . . . . . 72.3.2 Methyltransferase Confirmation of Ultra-Stable CpGs . . 82.3.3 Ultra-Stable CpG Characterization . . . . . . . . . . . . . 82.3.4 Concurrence of Ultra-stable CpGs in CGIs . . . . . . . . 9iv2.4 Composite Profile of Resorts . . . . . . . . . . . . . . . . . . . . 92.5 Resort Classifier Based on Ravine Steepness . . . . . . . . . . . . 92.5.1 Resort Class Comparison to Previously Defined Methy-lome Domains . . . . . . . . . . . . . . . . . . . . . . . 102.5.2 Ravine Confirmation in Independent Samples . . . . . . . 102.5.3 Ravine Overlap with Gene Bodies . . . . . . . . . . . . . 102.6 Resort Class Characterization . . . . . . . . . . . . . . . . . . . . 112.6.1 ENCODE DNase Sensitivity Data . . . . . . . . . . . . . 112.6.2 CGI-to-Gene Associations . . . . . . . . . . . . . . . . . 112.6.3 Gene Expression Data . . . . . . . . . . . . . . . . . . . 112.6.4 RNA Polymerase II Binding Data . . . . . . . . . . . . . 122.6.5 Histone Modifications Data . . . . . . . . . . . . . . . . 122.6.6 Polycomb Binding Sites Data . . . . . . . . . . . . . . . 122.7 Steep Ravine-Associated Gene Function . . . . . . . . . . . . . . 133 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1 Ultra-stable DNA methylation sites . . . . . . . . . . . . . . . . . 143.1.1 Ultra-stable site Confirmation . . . . . . . . . . . . . . . 163.2 Distribution of ultra-stable CpG sites in the human genome . . . . 173.3 Profiles of regions containing ultra-stable CpG sites . . . . . . . . 193.3.1 Ravine Independence from Previously Defined MethylomeDomains . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3.2 Ravine Confirmation in Independent Samples . . . . . . . 273.3.3 Ravines are not a by-product of gene body methylation . . 273.3.4 CGI Categorization by Gene Relation . . . . . . . . . . . 273.4 Ravines are associated with active transcription . . . . . . . . . . 283.4.1 RNA Polymerase II Binding at Resorts . . . . . . . . . . 313.4.2 Histone Modifications at Resorts . . . . . . . . . . . . . . 343.4.3 Polycomb Binding Sites at Resorts . . . . . . . . . . . . . 343.5 Ravines are associated with housekeeping genes . . . . . . . . . 354 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40vBibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41A GEO series used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50B Detailed Tissues Available . . . . . . . . . . . . . . . . . . . . . . . . 53C Independent GEO series used . . . . . . . . . . . . . . . . . . . . . . 55D Uniformly Unmethylated Resort GO and DO Enrichment . . . . . . 57D.1 Uniformly Unmethylated Resort GO Enrichment . . . . . . . . . 57D.2 Uniformly Unmethylated Resort DO Enrichment . . . . . . . . . 57viList of TablesTable 3.1 Summary of comparison to previously reported methylome fea-tures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Table 3.2 Summary of the resort classes features. . . . . . . . . . . . . . 33Table A.1 Summary of the GEO series Available . . . . . . . . . . . . . 51Table A.2 Summary of the GEO series Available (Continued) . . . . . . . 52Table B.1 Summary of tissue types used in analysis. . . . . . . . . . . . . 54Table C.1 Summary of the GEO series available between April 30, 2013and July 29, 2013 . . . . . . . . . . . . . . . . . . . . . . . . 56Table D.1 Uniformly Unmethylated Resort GO Enrichment . . . . . . . . 58Table D.2 Uniformly Unmethylated Resort GO Enrichment (Continued) . 59Table D.3 Uniformly Unmethylated Resort GO Enrichment (Continued) . 60Table D.4 Uniformly Unmethylated Resort GO Enrichment (Continued) . 61Table D.5 Uniformly Unmethylated Resort DO Enrichment . . . . . . . . 62viiList of FiguresFigure 1.1 Growth in 450K Samples Available on GEO . . . . . . . . . . 3Figure 1.2 Outline of Analysis . . . . . . . . . . . . . . . . . . . . . . . 5Figure 2.1 Beta Mixture Model To Call Methylated and UnmethylatedThresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Figure 3.1 Tissues used in meta-analysis . . . . . . . . . . . . . . . . . 15Figure 3.2 Representative CpGs of Each Stability State . . . . . . . . . . 15Figure 3.3 ENCODE RRBS Confrimation . . . . . . . . . . . . . . . . . 17Figure 3.4 Distribution of SNPs in 450K Probes . . . . . . . . . . . . . 18Figure 3.5 Ultra-stable CpG Distribution Among Chromosomes . . . . . 19Figure 3.6 Ultra-stable distance from TSS . . . . . . . . . . . . . . . . . 20Figure 3.7 Co-occurrence of Ultra-stable CpGs . . . . . . . . . . . . . . 21Figure 3.8 Ultra-stable CpG Distribution in Resorts . . . . . . . . . . . . 22Figure 3.9 Structure of a Resort . . . . . . . . . . . . . . . . . . . . . . 22Figure 3.10 Resort Composite Profiles . . . . . . . . . . . . . . . . . . . 23Figure 3.11 Representative Resorts . . . . . . . . . . . . . . . . . . . . . 24Figure 3.12 Ravine Confirmation in Independent Series . . . . . . . . . . 28Figure 3.13 Ravines Patterns Around Gene Bodies . . . . . . . . . . . . . 29Figure 3.14 Representative CGI with Multiple Gene Associations . . . . . 29Figure 3.15 CGI Gene Relations . . . . . . . . . . . . . . . . . . . . . . 30Figure 3.16 Resort Classification and Characterization . . . . . . . . . . . 32Figure 3.17 POLR2A Binding in Resort Classes . . . . . . . . . . . . . . 33Figure 3.18 Resort Class H3K27me3 and H3k4me1 Marks . . . . . . . . 34viiiFigure 3.19 Ravines Association with Housekeeping Genes . . . . . . . . 36ixAcknowledgementsI wish to thank my supervisor, Paul Pavlidis, for his guidance and patience through-out my project and thank you for becoming a fast expert in a new field. Thank youto Elodie Portales-Casamar for her mentorship and out-of-the-box thinking, whichmade the project unique and kept me excited. I would also like to thank MichealKobor and Inanc Birol, my committee members. Thank you to Martin Hirst, SanjaRogic, Shreejoy Tripathy, and Magda Price for comments on the thesis content.Thank you to my parents and brother for supporting me in my journey to Van-couver to work in my master’s. You have always been there for me and I could nothave done any of the things I have without knowing I had you to fall back on.All the Pavlidis lab members, past and present, have enriched my master’sexperience. You all made it a delight to come to the lab each day. I want to alsoacknowledge all the researchers who have taken the time and effort to make thedata I have used publicly available, I hope my work provides some justification forthat investment.xChapter 1IntroductionVariation in the methylation state of DNA across cell types, developmental stagesand physiological or disease conditions is of intense interest to understanding mam-malian gene regulation. To this end, numerous studies have been carried out tomeasure DNA methylation states among cell types or conditions at the resolutionof single cytosine guanine dinucleotides (CpG). Currently the field is undergo-ing an explosion of characterization of methylomes, leading to a growing but stillhighly incomplete understanding of the relationships among methylation, gene ex-pression, normal cellular function and disease[1]. The conceptually simplest ap-proach is to divide chromosomes into domains or clusters of similar methylationstates and correlate such domains with the location of genes or their regulatorysequences, and with other epigenetic marks such as histone acetylation or methy-lation. However, even with massive efforts such as ENCODE [2], numerous gapsin our knowledge exist, particularly in the variation (and functional significance)of epigenetic states across multiple cell types and conditions.Early studies focused on CGIs, defined as short (approximately 1kb) regionsof high CpG density in an otherwise CpG-sparse genome [3]. Many CGIs are as-sociated with gene promoters [4, 5], and methylation at CGIs is associated withrepression of transcription [6, 7]. More recently, the utility of the concept of theCGI has been challenged as it has become more technologically feasible to di-rectly measure methylation, rather than relying on inferred states based on CpGdensity [8]. Genome-wide analysis has thus helped define a growing geography1of biologically significant methylation patterns besides that associated with CGIsnear promoters. CGI “shores”, defined as the 2kb of sequence flanking a CGI,have been reported to be more dynamic than the CGI itself [9, 10]. Beyond shoresare “shelves” [11] and “open sea” sites [12]. More recently, large DNA methy-lation “valleys” and “canyons” of low methylation have been identified [13, 14].Other domains, identified in tumour cells, are termed “low-methylated regions”(LMRs) and “long-range epigenetic activation” (LREA) or silencing (LRES) do-mains of relatively low or high methylation [15–17]. We note that the definition ofthese domains inevitably relies on investigator-specified parameters of length andmethylation level, and they are not mutually exclusive; for example, canyons of-ten overlap CGIs. In addition the relative stability of domains such as LREAs andcanyons across cell types and conditions is still not completely documented.In general, the largest changes in DNA methylation are seen during develop-ment, which involves global methylation erasure and reestablishment [18], and incancer, which is characterized by extensive and often gene-specific changes com-pared to normal tissues [19]. Beyond this, many studies have emphasized the gen-eral stability of the methylome. Even between different tissues or tumour types,the number of differentially methylated CpGs reported ranges from 0.5% to 20%(depending in part on the statistical tests and significance cut-offs) [20, 21]. Under-standing which sites and domains are relatively static or dynamic is an importantstep to assigning function to DNA methylation.Because many previous studies focused on differences in methylation acrossconditions or cell types, there is likely to be additional information on stabilitywaiting to be identified. Here we analyze a large collection of DNA methylationdata to identify a set of ultra-stable CpG sites. We associate many of these siteswith a novel class of CGIs we refer to as “ravines”, which tend to be near house-keeping genes and associated with high expression activity and open chromatinstates. We propose a new classification of CGIs that takes into account the methy-lation state of the island as well as the shores and shelves.240006000800010000120000 100 200 300 400Days Since Project Start450KSamplesAvailableApril 30, 2013 June 6, 2014Figure 1.1: Number of 450K samples available on GEO at various timepoints from the project start.1.1 450K Public DataPerforming DNA methylation analysis in human tissues is becoming increasinglymore common. The primary source of DNA methylation data used in this studywas from Gene Expression Omnibus (GEO) on the Illumina Infinium Methyla-tion array (450K) [11, 22]. The array was selected because of its high resolution,and popularity. The 450K has many more probes than its predecessor, the Illu-mina Infinium 27K. While the 450K has lower resolution than sequencing basedmethylation methods, fewer datasets are currently available from sequencing basedmethods. The popularity of methylation analysis with the 450K greatly increasedthe power of the study, because there was high diversity among sample types. Thedata collection, from GEO, was done on April 30, 2013 but the number of sampleson GEO from the 450K has increased rapidly since then Figure Scope and ObjectivesThere remains limited understanding in the field of epigenomics on the effects ofmethylation in various regions of the human genome. This thesis work, throughdefining the relation between specific regions methylation state and gene regu-lation, serves to further the understanding of the human methylome. There arebroadly two aims of this thesis work:1. The identification of CpGs in the human genome which show stable methy-lation states across all samples available2. Characterize the genomic regions containing stable CpGs1.3 Thesis OutlineThis thesis includes an analysis of a large collection of DNA methylation data toidentify a set of ultra-stable CpG sites. Then many of these sites were found tobe associated with a novel class of CGIs we refer to as “ravines”, which tend tobe near housekeeping genes and associated with high expression activity and openchromatin states. We propose a new classification of CGIs that takes into accountthe methylation state of the island as well as the shores and shelves. An outline ofthe major steps and data types incorporated into this analysis is shown in Figure 1.2.4Gemma Gene ExpressionDataData Collection from GEOQuality Control Methyltransferase StudiesUltra-stable Confirmation RRBS DataENCODE Histone DataTFBS DataUltra-stable CpG CallingCGI Level AnalysisResort ClassificationResort Class CharacterizationGene and Disease EnrichmentPolycomb Binding SitesPrevious DomainsHousekeeper Gene Listolyco b Binding SitesGemma Expression DataFigure 1.2: The data analysis pipeline. Additional analyses are discussed inthe thesis which support the steps shown here.5Chapter 2Methods2.1 Data CollectionAs of April 30th 2013, 58 unique sample series run on the Illumina 450K platform(GPL13534 or GPL16304) were available in GEO [22]. Using the R Bioconduc-tor package “GEOquery” 2.26.2 [23] the series were collected and considered forquality control (Appendix A).2.2 Quality ControlTo qualify for inclusion in our study, samples had to have beta values (propor-tion of DNA from sample which is methylated from 0 to 1, one being entirelymethylated) for all 485577 probes, disqualifying 19 series. An additional 4 studieswhich involved direct global manipulation (genetic or chemical) of DNA methy-lation were also removed (DNMT1;DNMT3b double knockouts or methyltrans-ferase treatment). Within each study, individual samples were further assessed forquality. Eight samples with unusually high numbers of missing values (5 standarddeviations (SD) from the mean, corresponding to >0.4% or 1957 were removed).2.2.1 Individual Study Quality ControlIn addition to initial quality control to remove series, five more series were consid-ered unsuitable for the meta-analysis, for individual reasons, and removed. These6studies were flagged during initial quality control steps because of design fea-tures which makes them unsuitable for meta-analysis. GSE38271 [24] contains42 Formalin-Fixed Paraffin-Embedded (FFPE) samples. FFPE samples showedmethylation totals beyond the standard deviation limit set for sample quality, whilethe untreated samples did not. Therefore GSE38271 was excluded from furtheranalysis. GSE30338 [25] was listed as having beta values provided. However, thedata ranges from -4 and 4. Likely the values are M values, but out of caution weremoved the series from the meta-analysis. GSE20945 [26] was removed as the se-ries is provided with multiple array data in one file. The structure would not easilyfit into the collection script. GSE41273 [27] was removed as it had a mean Pearsoncorrelation of 0.023 with the other series, compared to an average of 0.96 betweenall other series. Though the cause of the low series correlation with GSE41273 isunclear, the series was removed from the meta-analysis.2.3 Ultra-Stable CpG CallingA three-component mixture model was fit to each series beta distribution using theR “mixtools” package [28]. The mean was calculated for each component; +2sdand -2sd were used as the unmethylated and methylated beta value thresholds,respectively, for each series separately Figure 2.1. For each sample, unmethylatedand methylated probes were called based on the thresholds computed for the series.Average thresholds were at beta values of 0.15 and 0.74. Probes which were scoredas methylated or unmethylated in all 1737 samples were deemed “ultra-stable”.2.3.1 ENCODE Confirmation of Ultra-Stable CpGsData from 102 ENCODE RRBS samples was collected from UCSC (Release 3 ofENCODE/HudsonAlpha RRBS data; [2]). In many RRBS studies reads with <10fold coverage [29, 30] are discarded, therefore a 10 fold coverage cutoff was usedon the ENCODE RRBS data. Unlike the 450K data simple thresholds were used forall ENCODE RRBS data since it is one dataset. CpGs were considered methylatedin ENCODE RRBS data if their percent methylation was >80 an unmethylated ifthe CpG percent methylation was <20.7Figure 2.1: Methylated and unmethylated thresholds were called separatelyfor each series based on the series beta distribution. The histogram andbroken line show the distribution of beta values for series GSE42118 asa representative example. Solid lines are the 3 fitted Gaussian compo-nents of the distribution (red unmethylated peak, blue methylated peakand green partially methylated peak). Vertical black lines indicate 2sdaway from the unmethylated and methylated component means whichwere used as the methylation state thresholds.2.3.2 Methyltransferase Confirmation of Ultra-Stable CpGsFour methyltransferase 450K studies (DNMT1;DNMT3b double knockouts, methyl-transferase inhibitor or methyltransferase treated) with a total 68 samples wereavailable on GEO (see Appendix: GEO series used). The studies were excludedfrom the ultra-stable site calling, and the ultra-stable sites state were then checkedin the 68 samples.2.3.3 Ultra-Stable CpG CharacterizationTo annotate the CpGs, we used two sources of information. The first was thatprovided by Illumina [11] and included which UCSC CGI a CpG site is associatedwith, if any. The Illumina annotation also includes the relation of a CpG to a CGI.8CpG shores and shelves are defined by base pairs distance from the UCSC definedCGI start and stop coordinates. Shores are 2 kb from the CGI boundaries [10], andshelves are 2-4 kb from the CGI boundary [11]. The second annotation, availableon GEO under GPL16304, contains additional CpG annotations to those providedby Illumina under GPL13534 [31], including distance to nearest TSS. A Studentst-test was performed to determine significantly different distance to TSS betweenall CpGs and ultra-stable CpGs.2.3.4 Concurrence of Ultra-stable CpGs in CGIsTen sets of 15244 CpGs (the same number as the number of ultra-stable CpGs)were randomly sampled from all 450K CpGs. The average number of randomlyselected CpGs present within each CGI was taken as an expected distribution of15224 CpGs across the CGIs to compare to the distribution of the 15224 ultra-stable CpGs.2.4 Composite Profile of ResortsThe 27176 resorts measured by the 450K have a range of lengths (minimum 201bp,maximum 45710bp, mean 935bp). To allow comparison of resorts, the positionof a CpG in a CGI was converted to the CpG’s relative position in a CGI of themean CGI size (935bp). As an example, a CpG 200bp from the start of a 1200bpCGI would be shown at 155.83bp from the start of the CGI in the composite plot.Conversion of CpG position to a relative value allowed comparisons of CGIs ofvarying sizes. Resort shores include all CpGs less than 2kb from the CGI start orend. CGI shelves include all CpGs 2-4kb from the CGI start or end. Since shoresand shelves are fixed sizes CpG positions within shores and shelves are shown attheir actual, not relative, distance from the CGI boundaries.2.5 Resort Classifier Based on Ravine SteepnessSteepness was calculated as mean beta methylation level of the CGI CpGs sub-tracted from the mean beta methylation of the shore and shelf CpGs. Steepnessof a ravine was only calculated for those resorts which had at least one CpG mea-9sured on the 450K array in each relevant part of the result (CGI, the north shoreor shelf and the south shore or shelf; 22290 resorts). Steep ravines were arbitrarilydefined as those with the 1500 highest steepness values. Uniformly unmethylatedresorts were defined as those with a CGI mean methylation<0.3 and the 1500 low-est steepness values.2.5.1 Resort Class Comparison to Previously Defined MethylomeDomainsWe compared ravines to several previous classifications of sites. For comparison tohuman data, we used UCSC LiftOver [32] to convert mouse genome coordinatesor previous human genome builds to corresponding orthologous sites in hg19. Thecomparisons are summarized in Table 3.1. In all cases regions or domains wereconsidered overlapping if any base pairs were shared between them.2.5.2 Ravine Confirmation in Independent SamplesThe original 1737 sample were collected for all 450K studies available before April30th 2013. Between April 30th 2013 and July 29th, 2013 additional samples werecollected as an independent sample set for ravine confirmation. A total of 27 newseries were available and 15 passed quality control leaving 757 samples to con-firm the ravine pattern (See Appendix: Independent GEO series used). The newsamples were from a similar variety of tissues as the original samples Figure Ravine Overlap with Gene BodiesUsing our CGI-to-Gene associations, we separated all resorts in 4 categories: “5Prime” where one shore/shelf and the CGI core are located 5 of the gene, butthe other shore/shelf are overlapping the gene body, ”Gene Body” where the wholeresort is contained within the gene body, “3 Prime” where one shore/shelf are over-lapping the gene body but the CGI core and the other shore/shelf are downstreamthe gene, and “Intergenic” where the whole resort is away from any gene.102.6 Resort Class Characterization2.6.1 ENCODE DNase Sensitivity DataENCODE UCSC DNase clusters track (wgEncodeRegDnaseClusteredV2) fromUniversity of Washington and Duke University were collected for 125 cells types[2]. DNase score for a CGI was calculated by taking the score for any DNasehot spot overlapping a CGI body. If multiple DNase hot spots overlapped a CGIthe scores were weighted by the amount of the CGI the DNase peak overlapped.Wilcoxon Rank Sum (Wilcoxon RS) tests on the DNase data were performed be-tween the three classes of resorts.2.6.2 CGI-to-Gene AssociationsCGI were considered associated with a gene if the CGI is located in the gene bodyor in promoter region of a gene. We assigned each CGI to a class based on itsproximity to Refseq or ncRNA genes using the Maunakea et al. [33] definitionsfor promoter, intragenic, intergenic, and 3 CGIs. CGIs with multiple annotations(3368 CGIs, eg., 3 to one gene and promoter to another) were excluded. Refseqgenes were downloaded from UCSC. For Refseq genes with multiple transcriptsthe longest form was used, to capture any possible intragenic functions. Non-coding RNA (ncRNA) annotations were collected from Ensembl [34].The final list included 40721 unique transcription units. There are 21743 CGIon the 450K array which associate with one of 17725 genes or ncRNA.To look for over-representation of any resort class in any gene feature (genebody, promoter etc.) we sampled 100 ravine, uniformly unmethylated, and otherresorts 25 times. We then performed a Wilcoxon RS test on the relation of eachresort type to genes.2.6.3 Gene Expression DataGene expression data from 2021 GEO expression studies were assembled from theGemma database [35], representing 97388 samples and 34 tissue types. Expres-sion information was available for 21733 genes, 14809 of which were associatedwith one of 22290 450K CGI (only those CGI in resorts previously classed by11steepness were compared for expression). Wilcoxon RS tests on the gene expres-sion data were performed between the three classes of resorts. Linear regressionwas done with 17127 CGI (CGI with associated gene expression level and steep-ness class). Models were for expression variance with associated resort steepness,associated CGI mean methylation, and resort steepness, CGI methylation interac-tion. An F-test was used to show significant interaction of resort steepness and CGImethylation.2.6.4 RNA Polymerase II Binding DataENCODE collected transcription factor binding site (TFBS) information for 161transcription factors in 91 cell types [2]. We used the Chip-seq clusters V3 datafrom UCSC for our analysis. We scored each 450K CGI for all 161 factors (datanot shown).2.6.5 Histone Modifications DataENCODE collected histone modification data for 12 marks in 46 cells types withsimilar variety of cell types to our 450K methylation data [2]. We scored each CGIfor each histone mark by overlapping histone peaks with CGI. Different scoresfor a histone mark from different ENCODE samples in one CGI were weightedby percent overlap with the CGI and then averaged. Some differences in histonemark scores between resort classes were significant but with a very small actualdifference in mean (eg., H3K4me3 scores were significantly different between uni-formly unmethylated and ravine CGIs (p<0.001, Wilcoxon RS test) but the actualdifference in mean scores was only 500, on a histone mark score scale or 0-50000).Therefore only comparisons with a difference in mean >3000 were considered aspossibly biologically significant.2.6.6 Polycomb Binding Sites DataLee et al. [36] defined 3710 polycomb binding sites in human embryonic stemcells using ChIP-seq with the SUZ12 subunit of polycomb repressive complex 2.The distance to the closest SUZ12 binding site was calculated for each resort as ameasure of polycomb activity at the resort.122.7 Steep Ravine-Associated Gene FunctionList of steep ravine- and uniform resort-associated genes are the same as thoseused with the gene expression data. One hundred random gene lists of the samelength as the steep ravine and uniform resort gene association lists (1573 and 1465,respectively) were generated. Percent overlap of each random gene list and ei-ther the housekeeping or tissue-specific list was calculated. Mean overlap of the100 random lists with the housekeeping and tissue-specific lists were taken as theexpected overlap from comparison with the steep ravine and uniform resort genelists. Fishers exact tests were performed between each random gene list overlapand steepness class gene lists.We used the GO annotations of the 19389 genes associated with the 450Kprobes [34] and Disease Ontologies (DO) from the Phenocarta database [37] forenrichment analysis. Enrichment of GO and DO groups in uniformly unmethylatedresort and ravine associated genes using over-representation analysis was done inErmineJ [38]. Statistical significance is reported as false discovery rates computedusing the Benjamini-Hochberg method in ErmineJ. Also calculated are the multi-functionality scores of the ontology gene sets [39], as well as the p values correctedfor multifunctionality.13Chapter 3Results3.1 Ultra-stable DNA methylation sitesOur initial analysis was to identify CpGs which have a consistent methylation state,across all available tissue, developmental stage, and disease variation. To do thiswe took advantage of the large amount of data available from the Illumina InfiniumHumanMethylation450 BeadChip (450K) [11]. The 450K assays 485577 CpGs inthe human genome and is widely used in methylation studies, many of which arepublicly available through the Gene Expression Omnibus [22]. Careful qualitycontrol (see methods) yielded a set of 1737 samples from 30 different GEO series(a series typically reflects a single publication, see Appendix A), covering 26 tissuetypes and a wide range of conditions (Figure 3.1 and Appendix B). We used asimple but stringent computational approach to identify candidate CpGs that wereconsistently methylated or unmethylated in all samples (see methods). Based onthis analysis, 974 CpGs were considered consistently methylated in every sampleand 14250 consistently unmethylated (Figure 3.2 , Appendix C). Together we referto these as “ultra-stable” CpGs. These represent 3.1% of the CpG sites measuredon the 450K. A less stringent definition of “ultra-stable” would expand this set, butfor our initial analysis we considered these as our starting pool.14Blood (1144)Brain (34)Bone Marrow (38)Colon (69)Oral (76)Stem Cell (78)Lung (82)Trophoblast (90)Other (126)Healthy (968)Rett Syndrome (2)HGP and Werner (7)Ulcerative Colitis (11)Chron's Disease (16)Schizophrenia (62)Cancer (317)Arthritis (354)Mesoderm (1248)Ectoderm (34)Endoderm (266)Figure 3.1: Counts of 450K samples disease tissue and germ layer samplesused in analysis. See Appendix B for complete list of tissue types used.Ultra-stableMethylatedUltra-stableUnmethylatedNot ultra-stableCpG State0.000.250.500.751.00cg18276943 cg03734035 cg02859992BetaFigure 3.2: Representative CpGs of the methylation stability states used inthis study (not ultra-stable, ultra-stable unmethylated and ultra-stablemethylated). Each point represents the methylation level of the givenCpG in a single individual sample. Colour scheme for ultra-stable CpGsis maintained throughout the paper.153.1.1 Ultra-stable site ConfirmationOne concern is that apparent stability of a CpG might be a function of the platformand methodology. We therefore checked the methylation state of the ultra-stableCpGs in the ENCODE reduced representation bisulfite sequencing (RRBS) data asvalidation. The 1.2 million CpGs measured in ENCODE RRBS data include 17%of the sites assayed by the 450K, including 5063 (33%) of the ultra-stable CpGs.Of the 121 ultra-stable methylated and 4942 ultra-stable unmethylated CpGs ofinterest for which there was ENCODE RRBS data, 80% and 98% were methy-lated and unmethylated, respectively, in 90% of RRBS samples Figure 3.3. Theagreement of ENCODE RRBS data with our results was correlated with sequenc-ing depth, so that higher-quality ENCODE sites tended to agree more closely withour methylation calls (that is, failures to verify tended to be poorly-covered sitesin the ENCODE data). This suggests that the large majority of the ultra-stableCpGs are not merely artifacts of the 450K. We further tested whether these CpGsmight be giving erroneous measurements due to unusual resistance or sensitivityto bisulfite conversion, which is used both by the 450K and RRBS methods. Weexamined the ultra-stable CpGs in data sets that purposefully manipulated methy-lation, either by direct enzymatic treatment of the DNA, or by genetic knockout ofDNA methyltransferases. This analysis showed that under appropriate conditions,the ultra-stable sites can be measured in their opposite state. This suggests thatthere is no inherent problem with the ultra-stable CpGs being measured at eithermethylation state, but that under a wide range of biological conditions, the CpGsare always in one state.Of the final 1737 samples 750 had detection P values available to estimate thequality of ultra-stable probes. Of 485577 probes only 43388 had a detection pvalue >0.001 in 1% of samples. Of the 43388 lower quality probes only one is anultra-stable probe. Therefore ultra-stable probes are not being called ultra-stablebecause they are generally poor performing.An additional control was to examine the proportion of SNPs in ultra-stable450K probes and SNPs at the CpG assayed by the 450k probe using the Price et al.[31] annotation. The number of SNPs in ultra-stable probes was similar to thenumber in all probes, and fewer SNPs at ultra-stable CpG sites Figure 3.4 [31].16Ultra Stable Methylated (216/974) Ultra Stable Unmethylated (6534/14250)02550751000 50 100 150 200 0 50 100 150 200Average Coverage of CpGWeightedAveragePercentMethylated50100150200Sample NumberNumber of SamplesFigure 3.3: Ultra-stable CpG states were conrmed by ENCODE RRBS data.Panels show the ultra-stable methylated CpGs (left panel; 216 capturedin RRBS) or unmethylated CpGs (right panel; 6534 captured in RRBS)average methylation level across samples, weighted by the CpG cov-erage in a sample. Average coverage was cut o at 10 fold for stateconrmation, indicated by vertical marker. Point colour indicates howmany runs of RRBS the CpG was measured (max 216, 102 sampleswith replicates), so the darker the point the more confidence.3.2 Distribution of ultra-stable CpG sites in the humangenomeUltra-stable CpGs are evenly distributed across autosomes, but not on the sex chro-mosomes. There are no ultra-stable CpGs on the Y chromosome and only 4 un-methylated CpGs on the X Figure 3.5.Because the ultra-stable sites are consistent across a wide range of tissues, de-velopmental stages and conditions, we hypothesized they would be of biologicalsignificance. Both classes of ultra-stable sites tend to be near transcription start17Probe SNP Target CpG SNP01020CpG StatePercentof450KProbesAll CpG Ultra-stableMethylated Ultra-stableUnmethylated All CpG Ultra-stableMethylated Ultra-stableUnmethylatedFigure 3.4: Ultra-stable CpGs are not the result of SNPs affecting probe hy-bridization. Left panel shows the percent of 450K probes with a SNP inthe probe body. Right panel shows the percent of probes with a SNP atthe assayed CpG.sites (TSS; p<0.001, t-test; accounting for the distribution of sites on the 450K;Figure 3.6). Concomitantly, ultra-stable CpGs tend to be associated with CGIs. Ofall 450K CpGs assayed, 62% are CGI-associated (in CGI, shore or shelf), while95.5% of the ultra-stable CpGs are CGI-associated. We also observed that ultra-stable CpGs tend to be found in CGIs in groups of two or more, rather than inisolation, more often than expected by chance Figure 3.7. The ultra-stable un-methylated CpGs are overrepresented in CGIs, rather than in shores and shelves.In contrast, ultra-stable methylated CpGs are underrepresented in CGIs but over-represented in CGI shelves Figure 3.8. This distribution is expected as CpGs inCGIs are generally unmethylated and those in the rest of the genome tend to bemethylated. However, the extreme stability of these sites led us to hypothesize that180. 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X YPercentUl tra-stabl eChromosomeUltra-stable MethylatedUltra-stable UnmethylatedFigure 3.5: Ultra-stable CpGs are distributed evenly across chromosomes.Percent of ultra-stable CpGs out of all probes on each chromosome isshown.the ultra-stable CpGs might reflect other features of the CGIs they associate with,leading us to focus further investigation on CGIs. We leave a deeper analysis ofthe 1134 non-CGI-associated ultra-stable sites as a topic for future study.3.3 Profiles of regions containing ultra-stable CpG sitesWe stratified CGIs and their associated flanking shores and shelves into four cat-egories based on the presence or absence of an ultra-stable CpG. For brevity, fol-lowing the terminology of Sofer et al. [40], we use the term “resort” to refer tothe complex of a CGI and its flanking shores and shelves Figure 3.9. We createda methylation profile for each resort category by aligning the CGIs, shores andshelves and plotting the mean methylation level of each CpG assayed in the resorts(see legend to Figure 3.10 and methods). As shown in Figure 3.10 , an interest-ing pattern emerges. Resorts that contain at least one methylated and unmethylatedultra-stable CpG (top panel) have a strikingly high contrast between the low methy-lation level of the CGI compared to the highly methylated shores and shelves. In19Ultra-stable Methylated CpGsUltra-stable Unmethylated CpGsAll CpGsRandom CpGs (15224)Distance to Closest TSS -1000000-500000 01000000 500000Figure 3.6: Ultra-stable CpGs are located closer to TSSs. Distance of CpGsto the closest TSS (regardless of gene directionality) is shown.comparison, resorts that lack ultra-stable CpGs do not show this pattern (bottompanel), and such resorts the CGI can be either methylated or unmethylated, as canbe the shores and shelves. Resorts that have only methylated or unmethylated ultra-stable sites show an intermediate pattern (middle panels). To get a better sense ofthe correlation structure of methylation levels across single resorts, we visualizedthe data at sample-level for two characteristic resorts (Figure 3.11). Generally, andin the examples shown, resorts with high contrast between CGI and shore/shelfshow a very consistent pattern across samples whereas others do not. By analogyto the previously reported methylation “valleys” and “canyons” [13, 14], we referto the sharp pattern shown in Figure 3.11 top panel as a “ravine”. We note thatravines genomic positions do not overlap with canyons or valleys (in addition tobeing smaller; ravines average 785bp of unmethylated region, canyons >3.5kb andvalleys >5kb).200250050007500100001 2 3 4 5 6 7 8 9 10 11 12 13 14 15Mean of Random Sets of 15224 CpGsUltra-stable CpGs (15224 CpGs)Number of CpG from Set in CGICpG SetCGI CountFigure 3.7: Ultra-stable CpGs co-occur more than expected by chance. Barsshow counts of CGI containing from 1 to 15 CpGs from the set of ultra-stable CpGs (grey) or random sets of CpGs (green; means of 10 sets areshown with standard error).210255075All 450K CpGs Ultra-stable Methylated CpGs Ultra-stable Unmethylated CpGsCGIN. ShoreN. Shelf S. Shore S. Shelf Resort FeatureCGIN. ShoreN. Shelf S. Shore S. Shelf CGIN. ShoreN. Shelf S. Shore S. ShelfPercentofCpGSitesResort FeatureCGIShoreShelfFigure 3.8: Ultra-stable CpGs follow known trends of methylated and un-methylated CpGs. Proportions of CpGs in each resort feature from bothCpG ultra-stable types as well as all CpGs on the 450K.2000bp 2000bp2000bp 2000bp935bpNorth Shelf South ShelfNorth Shore South ShoreIsland(CGI) Gene BodyResortFigure 3.9: Resorts are centered around a CGI (mean size of 935bp) with4000bp of shore and shelf on either side.22MeanBetaAcross1737SamplesResort FeatureN. Shelf N. Shore CGI S. Shore S. Shelf1. With Both Ultra-stable CpG Types (85)Resorts With Unmethylated Ultra-stable CpGs (5598)Resorts With Methylated Ultra-stable CpGs (395)Resorts With No Ultra-stable CpGs (21098)Figure 3.10: Ultra-stable CpGs allow observation of a unique resort methy-lation pattern. Composite profiles are shown for all 27176 resorts onthe 450K. As CGIs have variable lengths the CpG position within aCGI is shown here as relative to the length of the CGI it is positionedin. The CGIs are plotted as 935.23 bp (mean length of all CGIs mea-sured on the 450K). Beyond the boundaries of the CGIs on the plot(i.e. start at 0 end at 935.23) the CpGs actual distance, in base pairs,from the CGI start or end are used. Horizontal lines indicate the CGI,shore and shelf boundaries. The four panels show resorts with bothtypes of ultra-stable CpGs, only ultra-stable unmethylated CpGs, onlyultra-stable methylated CpGs and no ultra-stable CpGs.23|| |||| |||| | || | || ||chr12:57,881,7355kb hg19MARSN. ShoreN. Shelf CGI S. Shore S. ShelfMethylation450K ProbesResort FeaturesCpG Density|| | ||||| || || |||| ||| ||| | || | | ||| | || | || || || || | |chr12:114,791,7345kb hg19N. ShoreN. Shelf CGI S. Shore S. ShelfMethylation450K ProbesResort FeaturesCpG Density 3.11: Example resorts associated with the genes MARS (top; CGIchr12:57881750-57882035; ravine) and TBX5 (bottom; CGI chr12:114845861-114847650; not ravine) are depicted with individual sam-ple methylation patterns as smoothed lines showing the methylationpattern of an individual across the resort associated with the gene. Re-sort feature positions are indicated by coloured labelled bars. Positionsof the 450K probes assaying the resorts are indicated with lines ultra-stable CpGs are highlighted with taller red lines. The histogram showsCpG density for bins of 50bp on a scale of 0-0.2 CpG/bp. The genetrack is extracted from the UCSC Genome browser (refseq track).243.3.1 Ravine Independence from Previously Defined MethylomeDomainsA further extensive comparison of ravines to a number of previously-defined methy-lation domain types shows that ravines represent a novel aspect of the methylomeTable 3.1. Irizarry et al. [10] assembled a list of tissue specific (TDMR) and can-cer specific (CDMR) differentially methylated regions. Irizarry et al. [10] definedshores by finding shores to be more frequently dynamic than CGIs. We assessedthe overlap between TDMR and CDMR and ravines and uniformly unmethylatedshores and observed that ravine shores are more stable and less frequently over-lap with differentially methylated shores Table 3.1. de la Rica et al. [20] assem-bled reprogramming specific (RDMRs) differentially methylated regions. Bothravines and uniformly unmethylated resorts are underrepresented for RDMRs Ta-ble 3.1. Combined, these results show that, as expected given their apparent stabil-ity, ravines are not often differentially methylated. Canyons defined in Jeong et al.[14] and valleys defined in Xie et al. [13] frequently encompass uniformly methy-lated resorts, but rarely steep ravines Table 3.1. Jeong et al. [14] reported “controlunmethylated regions” (cUMRs) which show a large overlap with all resorts onthe 450k, which could be due to the resorts having generally unmethylated CGIsTable 3.1.Stadler et al. [15] identified low methylated and unmethylated regions (LMR,UMR) in the mouse genome. UMRs are expected to overlap with CGI and ourresults confirm UMRs tend to be orthologous to unmethylated CGIs in human, butnot specifically one class of resort as we have defined Table 3.1.We also assessed the overlap of the resort classes with LREA and LRES do-mains [16, 17] and observed no over-representation in ravines Table 3.1.25Table 3.1: Summary of each resort class overlap with previously reported methylome features. Percent of resortsoverlapped by a previously defined domain are shown. Number of domains used for each domain type is shownin brackets. Canyons LMR and UMR are the number of mouse genome(mm9) regions successfully converted tohuman genome (hg19) coordinates.TDMR/CDMR(12059/1866regions)RDMR(4401regions)Canyons(1092domains)cUMR(11964regions)Valleys(2380domains)LMR(26335regions)UMR(33675regions)LRES(47domains)LREA(35domains)UniformlyUnmethylated (1500)1.73 1.93 25.12 60.33 36.87 0.2 42.69 4.68 2.08Ravine (1500) 0.1 0.08 1.93 61.32 3.6 0.2 37.07 2.87 0.64Other (19290) 18.53 12.47 3.03 70.8 5.78 0.24 27.2 2.74 0.96Irizarry et al.,2009)Doi et al.,2009)Jeong et al.,2014)Xie et al.,2013)Stadler et al.,2011)Coolen et al.,2010)Bert et al.,2013)263.3.2 Ravine Confirmation in Independent SamplesTo confirm our findings were not due to some idiosyncrasy of the set of 450K sam-ples or the parameters we used to define ravines, we tested whether the ravines hadthe same properties on an additional set of 757 samples of similar variety,that be-came available after we started our study (see Appendix: Independent GEO seriesused) Figure 3.12. The 757 new samples have 16619 ultra-stable unmethylated andare 1123 ultra-stable methylated CpGs (7020 unmethylated and 207 methylatedCpGs which overlap with original ultra-stable sites) The results show that CGI weclassify as ravines, uniformly unmethylated or “other” have the same features inthe new data set, strongly supporting the idea that ravines are stable features ofthese genomic regions.3.3.3 Ravines are not a by-product of gene body methylationBecause gene body methylation has been previously reported to be positively cor-related with gene expression [41, 42], we further tested whether the super-additiveeffect we observe could be explained by a ravine being equivalent to a CGI nextto a highly methylated gene body. This appears to not be the case as ravines aresymmetrical with respect to transcription direction, and ravines can be found awayfrom gene bodies in intergenic regions Figure CGI Categorization by Gene RelationThere are multiple methods of annotating a CGI with a gene association, includingthe annotation a CGI with closest gene TSS to the CpGs making up a CGI [31],the position of the CpGs making up a CGI in a genes body or promoter [11], oroverlap of an entire CGI with a genes body or promoter [33]. Each yields slightlydifferent CGI to gene associations. Even with a given method, a CGI can end upassociated with more than one gene Figure 3.14. For this study an inclusive CGI togene association was used. Genes which overlap a CGI in their promoter or genebody were considered associated with that CGI. An inclusive association was usedbecause the exact role of CGIs and resorts in regulating gene expression is unclear.Using inclusive associations will hopefully capture any possible CGI effects ongene expression.27Arthritis (12)Cancer (199)Healthy (546) Blood (532)Breast (25)Cartilage (51)Brain (40)Saliva (34)Sperm (26)Epithelial (25)Other (24) Ectoderm (74)Endoderm (33)Mesoderm (650)A-2500 0 2500Uniformly Unmethylated Resorts (1500)-2500 0 2500Steep Ravines (1500)-2500 0 2500Other Resorts (19290)Resort Position (bp)MeanBetaAcross757GEOSamples0.01.00.5BFigure 3.12: Ravines are confirmed in independent 450K samples. (A)Counts of new 450K samples disease tissue and germ layer samplesused in analysis. (B) Resorts are classified based on steepness in theoriginal samples but mean methylation levels shown here are in thenew 757 samples.While CGIs are classically thought of as gene promoter features, only 45%of CGIs are located in the promoter region of a Refseq gene [33]. We associatedthe resorts with gene features and found that steep ravine CGIs are significantlyoverrepresented in promoters and underrepresented in gene bodies Figure Ravines are associated with active transcriptionTo identify ravines more comprehensively, we quantified the difference betweenthe CGI and shore/shelf methylation levels for all 450K resorts. In this mannerwe ranked all 27176 resorts assayed on the 450K for their “ravine-ness”, indepen-dent of whether they contained an ultra-stable CpG. As depicted in Figure 3.16A,the 1500 resorts with steepest ravines (mean steepness 0.638) represent the mostextreme ravine pattern (hereafter referred to as “steep ravines”) whereas the 1500281.00.50.0-2500 25000 -2500 25000-2500 25000Resort Position (with respect to adjacent gene direction)Uniformly UnmethylatedResortsRavinesOther Resorts5 Prime Gene Body 3 Prime Intergenic-2500 250001. Across 1737 GEO SamplesFigure 3.13: The ravine pattern is not due to the shores/shelves being locatedwithin highly methylated gene bodies. Columns of plots separate theresorts by position of a resort in relation to the associate gene. Rowsof plots separate resorts by class our defined classes. Resort positionis distance from the CGI start. chr11:64036875−64037974chr11:64038964−64039306chr11:64052607−64053652chr11:64055167−6405538864,040,000 64,045,000 64,050,000 64,055,000BADGRP137CGIBetaFigure 3.14: CGI to gene annotations are not all one-to-one. A representativeCGI (chr11:64052607-64053652) with multiple possible gene associ-ations. The CGI is located in both genes promoter regions, by thedenition used in this paper.29**0204060PercentofResortsGene RelationPromoter Intragenic Intergenic 3' of GeneResort ClassRavineUniformlyUnmethylatedAllFigure 3.15: Ravines are more frequent in promoters than other resorts. Barsshow the percent of gene-associated CGIs located in either promoter,intragenic, 3 of gene, or intergenic. Asterisks indicate gene associa-tions where ravines were signicantly different.unmethylated resorts with the lowest ravine steepness (CGI mean methylation<0.3and mean steepness 0.097) show a more uniform pattern (hereafter referred to as“uniformly unmethylated resorts”, mean methylation and CpG density of resortsreported in Table 3.2).To test whether the high methylation in the shores had an impact on the as-sociated gene expression, we used the ENCODE DNase-seq data [2] as an indi-rect measure of non tissue-specific transcriptional activity. Unmethylated CGIsare generally associated with high transcriptional activity at their associated gene[7]. As expected, uniformly unmethylated resorts show significantly (p<0.001,30Wilcoxon RS test) higher DNase sensitivity than all other resorts. Interestingly, thesteep ravines show significantly (p<0.001, Wilcoxon RS test) higher DNase sensi-tivity than the uniform resorts (Figure 3.16B). Since the main difference betweenthe steep ravines and the uniformly unmethylated resorts is the highly methylatedshores and shelves, it suggests that this high methylation on the edges of CGIs fa-cilitates a transcriptionally permissive state. The relationship between high ravinesteepness and high transcriptional activity is supported by analysis of a diverseset of microarray expression experiments (see methods). Averaged across expres-sion data sets, the expression of genes associated with steep ravines is significantlyhigher (p<0.001, Wilcoxon RS test) than for the uniformly unmethylated resorts(Figure 3.16C).We next tested whether the steepness of ravines was predictive of gene expres-sion, beyond that which is possible using methylation level of the CGI alone, usinga regression approach (see methods). Gene expression variance (R2) explained byCGI methylation level alone is 4.6%, comparable to previous reports [43, 44] eventhough our expression and methylation data comes from different sources. Vari-ance in expression levels explained by resort steepness alone is 3.4%. In combi-nation, ravine steepness and CGI methylation level explain 9.8% of the expressionvariance, significantly greater than would be expected if they were purely additive(significant interaction, p<0.001, ANOVA).3.4.1 RNA Polymerase II Binding at ResortsAs another measure of the transcription level, of the genes associated with re-sorts, we used the data ENCODE collected on transcription factor binding. Whilemany of the transcription factors could be of interest in comparison between re-sort classes, we focused on the RNA polymerase subunit POLR2A as confirmationthat ravines are associated with higher transcriptional activity. Ravine CGIs didin fact show significantly higher POLR2A scores (p<0.001, Wilcoxon RS test;Figure 3.17).31DNaseHypersensitivityScoreDNase Score DensityUniformly Unmethylated Resorts (1500) Steep Ravines (1500) Other Resorts (19290)-2500 0 2500Uniformly Unmethylated Resorts (1500) -2500 0 2500Steep Ravines (1500)-2500 0 2500Other Resorts (19290)Resort Position (bp)MeanBetaAcross1737GEOSamplesABC-0.25 0.25 -0.25 0.25 -0.25 0.25-0.25 0.25 -0.25 0.25 -0.25 Unmethylated ResortGenes (1255 genes) Steep Ravine ResortGenes (1389 genes) Other ResortGenes (12165) genes)MeanRelativeExpression(EachAssociatedGeneAcross1850GemmaDatasets)Expression Level Density10005000Figure 3.16: Ravines are associated with higher transcriptional activity. (A)Resorts are classified based on steepness, with the steepest 1500 resortsforming the steep ravine class and the least steep unmethylated resortsforming the uniformly unmethylated class. (B) Distribution of DNasesensitivity scores for each resort class. (C) Distribution of gene ex-pression levels of all genes associated (5’promoter or intragenic) witha CGI in the uniformly unmethylated, ravine or other resort classes.Density of DNase scores and expression levels are shown by the violinplots behind the box plots.32Table 3.2: Summary of the resort classes features. 450K density representsthe probes per base pair in a feature. Genomic CpG density represents thenumber of CpG per base pair present whether or not they are measuredby the 450K. Mean methylation is the mean beta of all probes in a featureacross 1737 samples. Methylation variability is the standard deviation ofprobe beta values across 1737 samples in a feature.Resort Feature Resort Class450KDensityGenomic CpGDensityMeanMethylationMethylationVariabilityCGIUniformlyUnmethylated)0.009 0.098 0.146 1.02E-003Ravine 0.009 0.107 0.081 6.47E-004Other 0.009 0.099 0.25 8.60E-004ShoresUniformlyUnmethylated)0.003 0.025 0.187 1.49E-003Ravine 0.002 0.02 0.641 3.68E-003Other 0.002 0.021 0.465 1.09E-003ShelvesUniformlyUnmethylated)0.001 0.018 0.438 5.06E-003Ravine 0.002 0.015 0.798 1.68E-003Other 0.001 0.017 0.717 8.48E-004POLR2A Score DensityPOLR2APeakScore(log10)1001000 Steep Ravines (1500) Other Resorts(1500)Uniformly Unmethylated Resorts (1500)-0.25 0.25-0.25 0.25 -0.25 0.25*Figure 3.17: RNA Polymerase II binding activity is higher in ravine CGIs.Box plots show the distribution of POLR2A scores in CGI on a logscale. Violin plots show the density of the scores.33HistonePeak ScoreUniformly Unmethylated Resorts (1500) Other Resort (19290)Steep Ravines (1500)H3k27me3H3k4me1050000050000Histone Score Density-0.25 0.25 -0.25 0.25 -0.25 0.25**Figure 3.18: Uniformly unmethylated resort CGIs show differences in his-tone mark. Data from 2 of the 12 ENCODE histone marks are shown(H3K27me3 and H3K4me1). Histone peak scores >50 000 (max inscore data 165 000) are not shown so the differences between resortclasses are noticeable on the scale. Grey violin plots show the densityof histone scores in each resort class.3.4.2 Histone Modifications at ResortsUsing the 12 ENCODE histone modification data we examined the distribution ofhistone marks on resorts. Only two histone marks were significantly different be-tween resort classes, H3K27me3 and H3K4me1. Uniformly unmethylated resortshave significantly less and significantly greater H3K27me3 and H3K4me1 marks,respectively (p<0.001, Wilcoxon RS test; Figure 3.18). H3K27me3 being a re-pressive mark and H3K4me1 being a mark of regulatory elements or TSS [2] doesnot explain, and is somewhat opposite to the gene expression and DNase sensitiv-ity results. The role of ravines in permitting gene expression may therefore involveother histone modifications or perhaps transcription factor binding.3.4.3 Polycomb Binding Sites at ResortsLee et al. defined identified the polycomb binding sites in the human genome,through measuring SUZ12 binding. The SUZ12 sites were significantly closer touniformly unmethylated resort CGIs (p<0.001, Wilcoxon RS test). Steep ravineCGIs were not significantly different in distance from SUZ12 sites than all CGIs.343.5 Ravines are associated with housekeeping genesTaken together the consistency of the ravine pattern, high DNase sensitivity andhigh associated gene expression, both across a variety of tissues and conditionssuggests the genes associated with steep ravines are universally active in humancells. Indeed, we find that steep ravine-associated genes are significantly associ-ated with a curated set of housekeeping genes (p<0.001, Fishers exact test; Fig-ure 3.19), but not with tissue-specific genes [45]. In contrast the uniformly un-methylated resorts are not significantly associated with either set of genes Fig-ure 3.19. However uniformly unmethylated resorts are over represented for GeneOntology (GO) groups for development and Disease Ontology (DO) groups fordevelopment associated diseases (See Appendix: Uniformly Unmethylated Re-sort GO and DO Enrichment) [37]. Ravine associated genes had no significantenrichment for GO groups or diseases. Suggesting that ravines, which are main-tained across tissues and conditions, may be regulatory features associated withthe expression of ubiquitous genes, while uniformly unmethylated resort associ-ated genes function in development.35Resort Class Associated GenesSteep Ravines (1573 genes) Uniformly Unmethylated Resorts (1465 genes)House Keeping Tissue Specific House Keeping Tissue SpecificGene CategoryRandom GenesGene ListPercentofGenesOverlapping 1050*Figure 3.19: Steep ravine genes are overrepresented for housekeeping genes.Dark bars show the percent overlap of genes associated with steepravines or uniformly unmethylated resorts with a list of housekeep-ing genes (2064 genes) or tissue specific genes (2293 genes). Lightbars show mean overlap of housekeeping and tissue specific lists withrandom gene lists from all 450K resort associated genes.36Chapter 4DiscussionOur contributions in this paper are two-fold. First, we identified a subset of CpGsin the human genome that appear to be highly stable in their methylated or un-methylated state, across diverse developmental states and cell types. Second, weidentified a subclass of CGIs that have an unusually high contrast between themethylation state of the CGI and the flanking shores. We found that such CGIstend to be found near highly expressed genes. These results provide new insightinto the structure and function of the human DNA methylome, and uncover a pos-sibly more specific function of shores than has been previously proposed, linkingthem to the regulation of genes which are “always on”.The existence of CpGs with ultra-stable methylation states reveals a previouslyundocumented feature of the human methylome. While some level of stabilityhas been previously noted in differentiated somatic cells, the dramatic changes inmethylation during development and differences between tissues [46] suggestedthat much of the methylome is dynamic. In contrast, our analysis suggests thata subset of the human methylome is highly stable across differentiated cell types,cancer cells, embryonic stem cells, induced pluripotent stem cells, trophoblasts andgerm cells. The consistency of the CpGs across our developmental and germ cellsamples suggests the state of the sites we found to be ultra-stable are establishedearly in development and then maintained in all studied differentiated tissues.We cannot rule out the possibility that some of the ultra-stable CpGs we iden-tified will have a different state in some cell type or physiological state not yet37examined. However, the data set we have assembled covers many of the states pre-viously identified with variability, including between tissues [21], developmentalstates [9] and diseases [47, 48]. Indeed, we suspect there are many other CpGsin the human genome that show unusual stability but not revealed by our study.Our analysis used a very stringent threshold, disallowing even single exceptions;additional CpGs are “nearly stable”. Furthermore the 450K array does not assaymost of the CpGs in the genome, some of which are likely to also be ultra-stable.As there are only a few samples of certain tissue types, we could not assess thepotential existence of tissue-specific ultra-stable CpGs. Future experiments to as-sess additional CpGs and larger numbers and varieties of samples will help furtherelucidate the scope of ultra-stable CpGs.Our association of ultra-stable CpGs with TSSs and resorts (94.5% in resorts)agrees with the previous observations that differentially methylated (that is, dy-namic) regions are primarily located far from TSSs, outside of resorts [21]. How-ever, we note that the 450K array is biased towards resort CpGs. The small subsetof ultra-stable CpGs we observe that are not in resorts (5.5%) hints that many otherultra-stable CpGs may be outside resorts. Because most of the CpGs we identifiedare in or near resorts, we focused our analysis on their potential roles in resort func-tion. Regarding the ravine pattern we note that some degree of contrast betweenshores and CGIs is expected: the majority of CpGs are known to be methylated,with CGIs being the exception. However, we show that this contrast is not observedin all resorts, and is particularly striking in resorts that also contain ultra-stable un-methylated CpGs. The association of steep ravines with higher gene expressionlevels, high DNase I sensitivity and high POLR2A occupancy provides a novel andbiologically meaningful classification of human CGIs that complements earlier ef-forts [49].In our study (as in many others) we attempted to relate the methylation state ofa region to the expression level of nearby genes. However, it is not clear how to tellif a CGI is in a position to influence (or be influenced by) a gene. The examinationof ravines may provide some insight. The classic association of genes with CGIsis based on the presence of a CGI in a gene′s 5′ promoter. This association is suffi-ciently strong that it was originally used to annotate human genes [50]. However,many CGIs do not appear to function as 5′ gene promoters [33](Supplementary38Figure S9). In contrast, the ravine CGIs are more strongly overrepresented in 5′promoter regions of genes. Thus ravines fit the classic CGI archetype: an unmethy-lated CGI in the 5′promoter of a highly expressed gene. Ravines share an additionalfeature in common with the classical CGI, an association with housekeeping genes[51, 52]. The image of unmethylated 5′ promoter CGIs leading to gene expressionmay be more specific to ravines and not true for resorts and CGIs in general. Astable ravine pattern at many 5′ promoters supports the emerging idea that it is cru-cial to examine non-promoter CpGs and CGIs in differential methylation analysisas non-promoter regions may have more dynamic methylation than 5′ promoterregions.While CGIs are the classic unit of focus for human methylation studies, othergroups have focused on identifying other types of methylation domains [13–17]which have some overlap with CGI classes we identified. Specifically, uniformlyunmethylated resorts (non-ravines) are encompassed by canyons and valleys [13,14] more than other resorts, suggesting that uniformly unmethylated resorts, canyonsand valleys may be related domains (see Supplement). Subsets of canyons andvalleys lack H3k27me3 similar to uniformly unmethylated resorts. Additionally,uniformly methylated resorts, canyons and valleys are all enriched for genes whichfunction in development. Thus uniformly unmethylated resorts are confirmation ofcanyons and valleys as features of the methylome in a greater variety of tissues.Ravines on the other hand minimally overlap with canyons, valleys or most otherpreviously defined methylation domains. Additionally ravines show no obvious re-lation to histone marks. Other regulatory mechanisms are likely be involved whichexplain ravine association with stable and high gene expression.To our knowledge the ravine pattern has not been previously reported, and thereis very little information available on the potential function of resort shores andshelves. One potentially relevant study, from Wu et al. [53], found that the shoresand shelves of unmethylated 5′ promoter CGIs, are associated with high Dnmt3aactivity in the mouse genome. Wu et al. also found Dnmt3a- shore and shelf DNAmethylation is associated with increased gene expression. We can speculate thatthe regions identified by Wu et al. may correspond to ravines, but we were unableto confirm this with the information available. We speculate that possible Dmnt3aactivity at steep ravines′ shores and shelves could function to antagonize the bind-39ing of transcriptional repressors. The previous work on Dnmt3a binding at genepromoters also found that shore and shelf methylation in proximal promoters an-tagonized polycomb protein-binding [53]. Interestingly, uniformly unmethylatedresorts had a higher association with polycomb binding sites than steep ravines[36](see supplement). On the other hand, we did not find evidence that ravinesare associated with low H3k27me3, as would be predicted from polycomb bindinginhibition (see Supplement). To resolve the function of ravines it will be importantto further explore their relationship with polycomb binding and other regulatorymechanisms.An alternate model for ravine function is that transcription factors that bindmethylated CpGs could be directly affected by shore and shelf methylation. How-ever, most studies of methyl-CpG-binding proteins show they function to repressgene expression, agreeing with the classical model of any methylation in promot-ers being repressive [7]. There is, however, recent evidence of the methyl-CpG-binding protein MeCP2 having transcription activating function at promoters withmethylated CpGs [54]. A model where MeCP2 binds methylated shores at genepromoters and performs its transcription activation function could explain the as-sociation of methylated shores and shelves with high gene expression.4.1 ConclusionIn summary, ravines are a novel class of CGI, distinct from previously identifiedmethylome domains. The ultra-stable CpGs and ravine consistency across samplessuggests they are stable component of the human methylome. While the ravinessuggest that CGI shore methylation is stably associated with high gene expression,other work has shown some CGI shores methylation to be highly dynamic. Bothresults support the overall importance of shores for gene expression. The presenceof ravines in the 5′ gene promoters of many actively transcribed genes supports acomplex role for methylation in both activating and repressing expression.40Bibliography[1] P. A. Jones. Functions of DNA methylation: islands, start sites, gene bodiesand beyond. Nature reviews.Genetics, 13(7):484–492, May 29 2012. →pages 1[2] ENCODE Project Consortium, B. E. Bernstein, E. Birney, I. Dunham, E. D.Green, C. Gunter, and M. Snyder. An integrated encyclopedia of DNAelements in the human genome. Nature, 489(7414):57–74, Sep 6 2012. →pages 1, 7, 11, 12, 30, 34[3] M. Gardiner-Garden and M. Frommer. Cpg islands in vertebrate genomes.Journal of Molecular Biology, 196(2):261–282, Jul 20 1987. → pages 1[4] I. P. Ioshikhes and M. Q. Zhang. Large-scale human promoter mappingusing cpg islands. Nature genetics, 26(1):61–63, Sep 2000. → pages 1[5] S. Saxonov, P. Berg, and D. L. Brutlag. A genome-wide analysis of cpgdinucleotides in the human genome distinguishes two distinct classes ofpromoters. Proceedings of the National Academy of Sciences of the UnitedStates of America, 103(5):1412–1417, Jan 31 2006. → pages 1[6] M. Esteller. Cpg island hypermethylation and tumor suppressor genes: abooming present, a brighter future. Oncogene, 21(35):5427–5440, Aug 122002. → pages 1[7] A. Bird. DNA methylation patterns and epigenetic memory. Genes &development, 16(1):6–21, Jan 1 2002. → pages 1, 30, 40[8] J. M. Greally. Bidding the cpg island goodbye. eLife, 2:e00593, 2013. →pages 1[9] A. Doi, I. H. Park, B. Wen, P. Murakami, M. J. Aryee, R. Irizarry, B. Herb,C. Ladd-Acosta, J. Rho, S. Loewer, J. Miller, T. Schlaeger, G. Q. Daley, andA. P. Feinberg. Differential methylation of tissue- and cancer-specific cpg41island shores distinguishes human induced pluripotent stem cells, embryonicstem cells and fibroblasts. Nature genetics, 41(12):1350–1353, Dec 2009.→ pages 2, 38[10] R. A. Irizarry, C. Ladd-Acosta, B. Wen, Z. Wu, C. Montano, P. Onyango,H. Cui, K. Gabo, M. Rongione, M. Webster, H. Ji, J. B. Potash,S. Sabunciyan, and A. P. Feinberg. The human colon cancer methylomeshows similar hypo- and hypermethylation at conserved tissue-specific cpgisland shores. Nature genetics, 41(2):178–186, Feb 2009. → pages 2, 9, 25[11] M. Bibikova, B. Barnes, C. Tsan, V. Ho, B. Klotzle, J. M. Le, D. Delano,L. Zhang, G. P. Schroth, K. L. Gunderson, J. B. Fan, and R. Shen. Highdensity DNA methylation array with single cpg site resolution. Genomics,98(4):288–295, Oct 2011. → pages 2, 3, 8, 9, 14, 27[12] J. Sandoval, H. Heyn, S. Moran, J. Serra-Musach, M. A. Pujana,M. Bibikova, and M. Esteller. Validation of a DNA methylation microarrayfor 450,000 cpg sites in the human genome. Epigenetics : official journal ofthe DNA Methylation Society, 6(6):692–702, Jun 2011. → pages 2[13] W. Xie, M. D. Schultz, R. Lister, Z. Hou, N. Rajagopal, P. Ray, J. W.Whitaker, S. Tian, R. D. Hawkins, D. Leung, H. Yang, T. Wang, A. Y. Lee,S. A. Swanson, J. Zhang, Y. Zhu, A. Kim, J. R. Nery, M. A. Urich, S. Kuan,C. A. Yen, S. Klugman, P. Yu, K. Suknuntha, N. E. Propson, H. Chen, L. E.Edsall, U. Wagner, Y. Li, Z. Ye, A. Kulkarni, Z. Xuan, W. Y. Chung, N. C.Chi, J. E. Antosiewicz-Bourget, I. Slukvin, R. Stewart, M. Q. Zhang,W. Wang, J. A. Thomson, J. R. Ecker, and B. Ren. Epigenomic analysis ofmultilineage differentiation of human embryonic stem cells. Cell, 153(5):1134–1148, May 23 2013. → pages 2, 20, 25, 39[14] M. Jeong, D. Sun, M. Luo, Y. Huang, G. A. Challen, B. Rodriguez,X. Zhang, L. Chavez, H. Wang, R. Hannah, S. B. Kim, L. Yang, M. Ko,R. Chen, B. Gottgens, J. S. Lee, P. Gunaratne, L. A. Godley, G. J.Darlington, A. Rao, W. Li, and M. A. Goodell. Large conserved domains oflow DNA methylation maintained by dnmt3a. Nature genetics, 46(1):17–23,Jan 2014. → pages 2, 20, 25, 39[15] M. B. Stadler, R. Murr, L. Burger, R. Ivanek, F. Lienert, A. Scholer, E. vanNimwegen, C. Wirbelauer, E. J. Oakeley, D. Gaidatzis, V. K. Tiwari, andD. Schubeler. DNA-binding factors shape the mouse methylome at distalregulatory regions. Nature, 480(7378):490–495, Dec 14 2011. → pages 2,2542[16] S. A. Bert, M. D. Robinson, D. Strbenac, A. L. Statham, J. Z. Song, T. Hulf,R. L. Sutherland, M. W. Coolen, C. Stirzaker, and S. J. Clark. Regionalactivation of the cancer genome by long-range epigenetic remodeling.Cancer cell, 23(1):9–22, Jan 14 2013. → pages 25[17] M. W. Coolen, C. Stirzaker, J. Z. Song, A. L. Statham, Z. Kassir, C. S.Moreno, A. N. Young, V. Varma, T. P. Speed, M. Cowley, P. Lacaze,W. Kaplan, M. D. Robinson, and S. J. Clark. Consolidation of the cancergenome into domains of repressive chromatin by long-range epigeneticsilencing (lres) reduces transcriptional plasticity. Nature cell biology, 12(3):235–246, Mar 2010. → pages 2, 25, 39[18] W. Mayer, A. Niveleau, J. Walter, R. Fundele, and T. Haaf. Demethylationof the zygotic paternal genome. Nature, 403(6769):501–502, Feb 3 2000. →pages 2[19] A. P. Feinberg and B. Tycko. The history of cancer epigenetics. Naturereviews.Cancer, 4(2):143–153, Feb 2004. → pages 2[20] L. de la Rica, J. M. Urquiza, D. Gomez-Cabrero, A. B. Islam,N. Lopez-Bigas, J. Tegner, R. E. Toes, and E. Ballestar. Identification ofnovel markers in rheumatoid arthritis through integrated analysis of DNAmethylation and microrna expression. Journal of Autoimmunity, 41:6–16,Mar 2013. → pages 2, 25[21] M. J. Ziller, H. Gu, F. Muller, J. Donaghey, L. T. Tsai, O. Kohlbacher,P. L. De Jager, E. D. Rosen, D. A. Bennett, B. E. Bernstein, A. Gnirke, andA. Meissner. Charting a dynamic DNA methylation landscape of the humangenome. Nature, 500(7463):477–481, Aug 22 2013. → pages 2, 38[22] R. Edgar, M. Domrachev, and A. E. Lash. Gene Expression Omnibus: NCBIgene expression and hybridization array data repository. Nucleic acidsresearch, 30(1):207–210, Jan 1 2002. → pages 3, 6, 14[23] S. Davis and P. S. Meltzer. Geoquery: a bridge between the gene expressionomnibus (geo) and bioconductor. Bioinformatics (Oxford, England), 23(14):1846–1847, Jul 15 2007. → pages 6[24] M. Lechner, T. Fenton, J. West, G. Wilson, A. Feber, S. Henderson,C. Thirlwell, H. K. Dibra, A. Jay, L. Butcher, A. R. Chakravarthy, F. Gratrix,N. Patel, F. Vaz, P. O’Flynn, N. Kalavrezos, A. E. Teschendorff, C. Boshoff,and S. Beck. Identification and functional validation of hpv-mediated43hypermethylation in head and neck squamous cell carcinoma. Genomemedicine, 5(2):15, Feb 5 2013. → pages 7[25] S. Turcan, D. Rohle, A. Goenka, L. A. Walsh, F. Fang, E. Yilmaz,C. Campos, A. W. Fabius, C. Lu, P. S. Ward, C. B. Thompson, A. Kaufman,O. Guryanova, R. Levine, A. Heguy, A. Viale, L. G. Morris, J. T. Huse, I. K.Mellinghoff, and T. A. Chan. Idh1 mutation is sufficient to establish theglioma hypermethylator phenotype. Nature, 483(7390):479–483, Feb 152012. → pages 7[26] H. C. Tsai, H. Li, L. Van Neste, Y. Cai, C. Robert, F. V. Rassool, J. J. Shin,K. M. Harbom, R. Beaty, E. Pappou, J. Harris, R. W. Yen, N. Ahuja, M. V.Brock, V. Stearns, D. Feller-Kopman, L. B. Yarmus, Y. C. Lin, A. L. Welm,J. P. Issa, I. Minn, W. Matsui, Y. Y. Jang, S. J. Sharkis, S. B. Baylin, andC. A. Zahnow. Transient low doses of DNA-demethylating agents exertdurable antitumor effects on hematological and epithelial tumor cells.Cancer cell, 21(3):430–446, Mar 20 2012. → pages 7[27] R. S. Alisch, T. Wang, P. Chopra, J. Visootsak, K. N. Conneely, and S. T.Warren. Genome-wide analysis validates aberrant methylation in fragile xsyndrome is specific to the fmr1 locus. BMC medical genetics, 14:18–2350–14–18, Jan 29 2013. → pages 7[28] T. Benaglia, D. Chauveau, D. R. Hunter, and D. Young. mixtools: An rpackage for analyzing finite mixture models. J. Stat. Softw, 32(6), 2009. →pages 7[29] A. Akalin, F. E. Garrett-Bakelman, M. Kormaksson, J. Busuttil, L. Zhang,I. Khrebtukova, T. A. Milne, Y. Huang, D. Biswas, J. L. Hess, C. D. Allis,R. G. Roeder, P. J. Valk, B. Lowenberg, R. Delwel, H. F. Fernandez,E. Paietta, M. S. Tallman, G. P. Schroth, C. E. Mason, A. Melnick, and M. E.Figueroa. Base-pair resolution DNA methylation sequencing revealsprofoundly divergent epigenetic landscapes in acute myeloid leukemia.PLoS genetics, 8(6):e1002781, 2012. → pages 7[30] K. E. Varley, J. Gertz, K. M. Bowling, S. L. Parker, T. E. Reddy,F. Pauli-Behn, M. K. Cross, B. A. Williams, J. A. Stamatoyannopoulos,G. E. Crawford, D. M. Absher, B. J. Wold, and R. M. Myers. Dynamic DNAmethylation across diverse human cell lines and tissues. Genome research,23(3):555–567, Mar 2013. → pages 7[31] M. E. Price, A. M. Cotton, L. L. Lam, P. Farre, E. Emberly, C. J. Brown,W. P. Robinson, and M. S. Kobor. Additional annotation enhances potential44for biologically-relevant analysis of the illumina infiniumhumanmethylation450 beadchip array. Epigenetics & chromatin, 6(1):4–8935–6–4, Mar 3 2013. → pages 9, 16, 27[32] A. S. Hinrichs, D. Karolchik, R. Baertsch, G. P. Barber, G. Bejerano,H. Clawson, M. Diekhans, T. S. Furey, R. A. Harte, F. Hsu,J. Hillman-Jackson, R. M. Kuhn, J. S. Pedersen, A. Pohl, B. J. Raney, K. R.Rosenbloom, A. Siepel, K. E. Smith, C. W. Sugnet, A. Sultan-Qurraie, D. J.Thomas, H. Trumbower, R. J. Weber, M. Weirauch, A. S. Zweig,D. Haussler, and W. J. Kent. The ucsc genome browser database: update2006. Nucleic acids research, 34(Database issue):D590–8, Jan 1 2006. →pages 10[33] A. K. Maunakea, R. P. Nagarajan, M. Bilenky, T. J. Ballinger, C. D’Souza,S. D. Fouse, B. E. Johnson, C. Hong, C. Nielsen, Y. Zhao, G. Turecki,A. Delaney, R. Varhol, N. Thiessen, K. Shchors, V. M. Heine, D. H.Rowitch, X. Xing, C. Fiore, M. Schillebeeckx, S. J. Jones, D. Haussler,M. A. Marra, M. Hirst, T. Wang, and J. F. Costello. Conserved role ofintragenic DNA methylation in regulating alternative promoters. Nature,466(7303):253–257, Jul 8 2010. → pages 11, 27, 28, 38[34] M. Guttman, I. Amit, M. Garber, C. French, M. F. Lin, D. Feldser,M. Huarte, O. Zuk, B. W. Carey, J. P. Cassady, M. N. Cabili, R. Jaenisch,T. S. Mikkelsen, T. Jacks, N. Hacohen, B. E. Bernstein, M. Kellis, A. Regev,J. L. Rinn, and E. S. Lander. Chromatin signature reveals over a thousandhighly conserved large non-coding rnas in mammals. Nature, 458(7235):223–227, Mar 12 2009. → pages 11, 13[35] A. Zoubarev, K. M. Hamer, K. D. Keshav, E. L. McCarthy, J. R. Santos,T. Van Rossum, C. McDonald, A. Hall, X. Wan, R. Lim, J. Gillis, andP. Pavlidis. Gemma: a resource for the reuse, sharing and meta-analysis ofexpression profiling data. Bioinformatics (Oxford, England), 28(17):2272–2273, Sep 1 2012. → pages 11[36] T. I. Lee, R. G. Jenner, L. A. Boyer, M. G. Guenther, S. S. Levine, R. M.Kumar, B. Chevalier, S. E. Johnstone, M. F. Cole, K. Isono, H. Koseki,T. Fuchikami, K. Abe, H. L. Murray, J. P. Zucker, B. Yuan, G. W. Bell,E. Herbolsheimer, N. M. Hannett, K. Sun, D. T. Odom, A. P. Otte, T. L.Volkert, D. P. Bartel, D. A. Melton, D. K. Gifford, R. Jaenisch, and R. A.Young. Control of developmental regulators by polycomb in humanembryonic stem cells. Cell, 125(2):301–313, Apr 21 2006. → pages 12, 4045[37] E. Portales-Casamar, C. Ch’ng, F. Lui, N. St-Georges, A. Zoubarev, A. Y.Lai, M. Lee, C. Kwok, W. Kwok, L. Tseng, and P. Pavlidis. Neurocarta:aggregating and sharing disease-gene relations for the neurosciences. BMCgenomics, 14:129–2164–14–129, Feb 26 2013. → pages 13, 35[38] H. K. Lee, W. Braynen, K. Keshav, and P. Pavlidis. Erminej: tool forfunctional analysis of gene expression data sets. BMC bioinformatics, 6:269,Nov 9 2005. → pages 13[39] J. Gillis and P. Pavlidis. The impact of multifunctional genes on ”guilt byassociation” analysis. PloS one, 6(2):e17258, Feb 18 2011. → pages 13[40] T. Sofer, E. D. Schifano, J. A. Hoppin, L. Hou, and A. A. Baccarelli.A-clustering: a novel method for the detection of co-regulated methylationregions, and regions associated with exposure. Bioinformatics (Oxford,England), 29(22):2884–2891, Nov 15 2013. → pages 19[41] R. Lister, M. Pelizzola, R. H. Dowen, R. D. Hawkins, G. Hon,J. Tonti-Filippini, J. R. Nery, L. Lee, Z. Ye, Q. M. Ngo, L. Edsall,J. Antosiewicz-Bourget, R. Stewart, V. Ruotti, A. H. Millar, J. A. Thomson,B. Ren, and J. R. Ecker. Human DNA methylomes at base resolution showwidespread epigenomic differences. Nature, 462(7271):315–322, Nov 192009. → pages 27[42] M. P. Ball, J. B. Li, Y. Gao, J. H. Lee, E. M. LeProust, I. H. Park, B. Xie,G. Q. Daley, and G. M. Church. Targeted and genome-scale strategies revealgene-body methylation signatures in human cells. Nature biotechnology, 27(4):361–368, Apr 2009. → pages 27[43] L. L. Lam, E. Emberly, H. B. Fraser, S. M. Neumann, E. Chen, G. E. Miller,and M. S. Kobor. Factors underlying variable DNA methylation in a humancommunity cohort. Proceedings of the National Academy of Sciences of theUnited States of America, 109 Suppl 2:17253–17260, Oct 16 2012. → pages31[44] K. R. van Eijk, S. de Jong, M. P. Boks, T. Langeveld, F. Colas, J. H. Veldink,C. G. de Kovel, E. Janson, E. Strengman, P. Langfelder, R. S. Kahn, L. H.van den Berg, S. Horvath, and R. A. Ophoff. Genetic analysis of DNAmethylation and gene expression levels in whole blood of healthy humansubjects. BMC genomics, 13:636–2164–13–636, Nov 17 2012. → pages 31[45] C. W. Chang, W. C. Cheng, C. R. Chen, W. Y. Shu, M. L. Tsai, C. L. Huang,and I. C. Hsu. Identification of human housekeeping genes and46tissue-selective genes by microarray meta-analysis. PloS one, 6(7):e22859,2011. → pages 35[46] J. A. Hackett and M. A. Surani. DNA methylation dynamics during themammalian life cycle. Philosophical transactions of the Royal Society ofLondon.Series B, Biological sciences, 368(1609):20110328, Jan 5 2013. →pages 37[47] G. C. Hon, R. D. Hawkins, O. L. Caballero, C. Lo, R. Lister, M. Pelizzola,A. Valsesia, Z. Ye, S. Kuan, L. E. Edsall, A. A. Camargo, B. J. Stevenson,J. R. Ecker, V. Bafna, R. L. Strausberg, A. J. Simpson, and B. Ren. GlobalDNA hypomethylation coupled to repressive chromatin domain formationand gene silencing in breast cancer. Genome research, 22(2):246–258, Feb2012. → pages 38[48] D. Sproul and R. R. Meehan. Genomic insights into cancer-associatedaberrant cpg island hypermethylation. Briefings in functional genomics, 12(3):174–190, May 2013. → pages 38[49] S. Saxonov, P. Berg, and D. L. Brutlag. A genome-wide analysis of cpgdinucleotides in the human genome distinguishes two distinct classes ofpromoters. Proceedings of the National Academy of Sciences of the UnitedStates of America, 103(5):1412–1417, Jan 31 2006. → pages 38[50] J. C. Venter, M. D. Adams, E. W. Myers, P. W. Li, R. J. Mural, G. G. Sutton,H. O. Smith, M. Yandell, C. A. Evans, R. A. Holt, J. D. Gocayne,P. Amanatides, R. M. Ballew, D. H. Huson, J. R. Wortman, Q. Zhang, C. D.Kodira, X. H. Zheng, L. Chen, M. Skupski, G. Subramanian, P. D. Thomas,J. Zhang, G. L. Gabor Miklos, C. Nelson, S. Broder, A. G. Clark, J. Nadeau,V. A. McKusick, N. Zinder, A. J. Levine, R. J. Roberts, M. Simon,C. Slayman, M. Hunkapiller, R. Bolanos, A. Delcher, I. Dew, D. Fasulo,M. Flanigan, L. Florea, A. Halpern, S. Hannenhalli, S. Kravitz, S. Levy,C. Mobarry, K. Reinert, K. Remington, J. Abu-Threideh, E. Beasley,K. Biddick, V. Bonazzi, R. Brandon, M. Cargill, I. Chandramouliswaran,R. Charlab, K. Chaturvedi, Z. Deng, V. Di Francesco, P. Dunn, K. Eilbeck,C. Evangelista, A. E. Gabrielian, W. Gan, W. Ge, F. Gong, Z. Gu, P. Guan,T. J. Heiman, M. E. Higgins, R. R. Ji, Z. Ke, K. A. Ketchum, Z. Lai, Y. Lei,Z. Li, J. Li, Y. Liang, X. Lin, F. Lu, G. V. Merkulov, N. Milshina, H. M.Moore, A. K. Naik, V. A. Narayan, B. Neelam, D. Nusskern, D. B. Rusch,S. Salzberg, W. Shao, B. Shue, J. Sun, Z. Wang, A. Wang, X. Wang,J. Wang, M. Wei, R. Wides, C. Xiao, C. Yan, A. Yao, J. Ye, M. Zhan,W. Zhang, H. Zhang, Q. Zhao, L. Zheng, F. Zhong, W. Zhong, S. Zhu,47S. Zhao, D. Gilbert, S. Baumhueter, G. Spier, C. Carter, A. Cravchik,T. Woodage, F. Ali, H. An, A. Awe, D. Baldwin, H. Baden, M. Barnstead,I. Barrow, K. Beeson, D. Busam, A. Carver, A. Center, M. L. Cheng,L. Curry, S. Danaher, L. Davenport, R. Desilets, S. Dietz, K. Dodson,L. Doup, S. Ferriera, N. Garg, A. Gluecksmann, B. Hart, J. Haynes,C. Haynes, C. Heiner, S. Hladun, D. Hostin, J. Houck, T. Howland,C. Ibegwam, J. Johnson, F. Kalush, L. Kline, S. Koduru, A. Love, F. Mann,D. May, S. McCawley, T. McIntosh, I. McMullen, M. Moy, L. Moy,B. Murphy, K. Nelson, C. Pfannkoch, E. Pratts, V. Puri, H. Qureshi,M. Reardon, R. Rodriguez, Y. H. Rogers, D. Romblad, B. Ruhfel, R. Scott,C. Sitter, M. Smallwood, E. Stewart, R. Strong, E. Suh, R. Thomas, N. N.Tint, S. Tse, C. Vech, G. Wang, J. Wetter, S. Williams, M. Williams,S. Windsor, E. Winn-Deen, K. Wolfe, J. Zaveri, K. Zaveri, J. F. Abril,R. Guigo, M. J. Campbell, K. V. Sjolander, B. Karlak, A. Kejariwal, H. Mi,B. Lazareva, T. Hatton, A. Narechania, K. Diemer, A. Muruganujan, N. Guo,S. Sato, V. Bafna, S. Istrail, R. Lippert, R. Schwartz, B. Walenz, S. Yooseph,D. Allen, A. Basu, J. Baxendale, L. Blick, M. Caminha, J. Carnes-Stine,P. Caulk, Y. H. Chiang, M. Coyne, C. Dahlke, A. Mays, M. Dombroski,M. Donnelly, D. Ely, S. Esparham, C. Fosler, H. Gire, S. Glanowski,K. Glasser, A. Glodek, M. Gorokhov, K. Graham, B. Gropman, M. Harris,J. Heil, S. Henderson, J. Hoover, D. Jennings, C. Jordan, J. Jordan, J. Kasha,L. Kagan, C. Kraft, A. Levitsky, M. Lewis, X. Liu, J. Lopez, D. Ma,W. Majoros, J. McDaniel, S. Murphy, M. Newman, T. Nguyen, N. Nguyen,M. Nodell, S. Pan, J. Peck, M. Peterson, W. Rowe, R. Sanders, J. Scott,M. Simpson, T. Smith, A. Sprague, T. Stockwell, R. Turner, E. Venter,M. Wang, M. Wen, D. Wu, M. Wu, A. Xia, A. Zandieh, and X. Zhu. Thesequence of the human genome. Science (New York, N.Y.), 291(5507):1304–1351, Feb 16 2001. → pages 38[51] F. Larsen, G. Gundersen, R. Lopez, and H. Prydz. Cpg islands as genemarkers in the human genome. Genomics, 13(4):1095–1107, Aug 1992. →pages 39[52] J. Zhu, F. He, S. Hu, and J. Yu. On the nature of human housekeeping genes.Trends in genetics : TIG, 24(10):481–484, Oct 2008. → pages 39[53] H. Wu, V. Coskun, J. Tao, W. Xie, W. Ge, K. Yoshikawa, E. Li, Y. Zhang,and Y. E. Sun. Dnmt3a-dependent nonpromoter DNA methylation facilitatestranscription of neurogenic genes. Science (New York, N.Y.), 329(5990):444–448, Jul 23 2010. → pages 39, 4048[54] D. H. Yasui, S. Peddada, M. C. Bieda, R. O. Vallero, A. Hogart, R. P.Nagarajan, K. N. Thatcher, P. J. Farnham, and J. M. Lasalle. Integratedepigenomic analyses of neuronal mecp2 reveal a role for long-rangeinteraction with active genes. Proceedings of the National Academy ofSciences of the United States of America, 104(49):19416–19421, Dec 42007. → pages 4049Appendix AGEO series used50Table A.1: Summary of the GEO series available on the 450K as of April 30, 2013. QC is the reason a series wasnot included in the analysis. Samples removed is the reason individual samples from a study were not included.Probes is the number of probes (of 485577) provided on GEO.Available (April 30) Samples QC Samples Used Samples Removed Probes Data Type Tissues Pubmed IDGSE38271 72 FFPE 0 - 485577 Beta Head and neck 23419152GSE43976 95 Filtered 0 - 473097 Beta Blood 23422812GSE45529 22 Filtered 0 - 441946 Beta Blood, placenta, sperm, buccal 23538714GSE41826 145 Filtered 0 - 480492 Beta Brain 23426267GSE42700 53 Filtered 0 - 330168 Beta Buccal NAGSE42409 12 Filtered 0 - 428216 Beta Buccal, blood, chorionic villus 23452981GSE34486 16 Filtered 0 - 485512 Beta Endothelium 22434260GSE34487 16 Filtered 0 - 485512 Beta Endothelium 22434260GSE36278 142 Filtered 0 - 485512 Beta Glioblastoma NAGSE37067 38 Filtered 0 - 350447 Beta iPSC NAGSE37066 24 Filtered 0 - 350447 Beta iPSC NAGSE34688 16 Filtered 0 - 350447 Beta iPSC, ES 23032973GSE46364 11 Filtered 0 - 485512 Beta Joints (synoviocytes) 22736089GSE37633 2 Filtered 0 - 483823 Beta Liver cell line NAGSE40279 656 Filtered 0 - 473034 Beta Whole Blood NAGSE30338 134 M Values 0 - 485577 Beta glioma cellsGSE40870 46 Methlytransferase 0 - 485577 Beta Bone marrow NAGSE43851 20 Methlytransferase 0 - 485577 Beta Cell line 23408854GSE32286 139 Methlytransferase 0 - 485577 Beta glioblastomas NAGSE44830 26 Methlytransferase 0 - 485577 Beta XenograftGSE20945 9 Multiple arrays 0 - 485577 Beta Leukemia cell line 22439938GSE42882 23 NA Count 0 - 485577 Beta Nervous systemGSE34639 48 Not Beta 0 - 462176 Signal intensity Blood NAGSE43414 696 Not Beta 0 - - None Brain 23631413GSE40699 62 Not Beta 0 - - IDAT Encode Tissues NAGSE34779 24 Not Beta 0 - 98090 M-Values lymphoblast NAGSE34777 8 Not Beta 0 - 98090 M-Values Lymphoblasts NAGSE41273 62 Low Sample Correlation 0 - 485577 Beta Blood 23356558GSE32283 86 Subseries 0 - 485577 Beta Glioblastoma NAGSE38266 42 Subseries 0 - 485577 Beta Head and neck 23419152GSE38128 13 Superseries 0 - 485577 Beta Blood 22919074GSE42119 61 Superseries 0 - 485577 Beta Bone marrow 23152544GSE40871 115 Superseries 0 - 485577 Beta Bone marrow 23297133GSE32149 48 Superseries 0 - 485577 Beta Colon mucosa, blood NAGSE33130 4 Superseries 0 - 485577 Beta embryocarcinoma 2256936651Table A.2: Summary of the GEO series Available (Continued)Available (April 30) Samples QC Samples Used Samples Removed Probes Data Type Tissues Pubmed IDGSE32079 68 Superseries 0 - 485577 Beta Epithelial NAGSE40927 50 Superseries 0 - 485577 Beta Fibroblasts 23202434GSE30339 134 Superseries 0 - 485577 Beta glioma 22343889GSE30654 153 Superseries 0 - 485577 Beta iPS,ES 22560082GSE41117 132 Superseries 0 - 485577 Beta oral squamous 22466171GSE34403 8 Superseries 0 - 485577 Beta Prostate 22466171GSE45459 22 Used 18 NA Count 485577 Beta B-cell 23074194GSE46168 14 Used 13 NA Count 485577 Beta B-cell NAGSE42865 16 Used 16 - 485577 Beta B-cells 23257959GSE40005 24 Used 23 NA Count 485577 Beta blood NAGSE37966 7 Used 7 - 485577 Beta blood, sperm 22919074GSE42118 10 Used 10 - 485577 Beta Bone marrow NAGSE39141 33 Used 33 - 485577 Beta Bone marrow, b cell 23110451GSE29290 22 Used 22 - 485577 Beta Breast and colon NAGSE41114 46 Used 46 - 485577 Beta Buccal, and blood NAGSE42752 63 Used 63 - 485577 Beta colon NAGSE32362 4 Used 4 - 485577 Beta embryocarcinoma 22569366GSE40909 25 Used 25 - 485577 Beta endothelial, stem NAGSE45198 28 Used 28 - 485577 Beta glioma cells 23558169GSE31848 153 Used 153 - 485577 Beta hPSCs 22560082GSE40790 4 Used 4 - 485577 Beta ips, hsc 23386128GSE36216 75 Used 75 - 485577 Beta lung 22261801GSE32148 48 Used 47 NA Count 485577 Beta peripheral blood 22467598GSE33233 19 Used 18 NA Count 485577 Beta peripheral blood 22689993GSE36064 78 Used 78 - 485577 Beta peripheral blood 22300631GSE42861 689 Used 689 - 485577 Beta peripheral blood 23334450GSE34340 6 Used 6 - 485577 Beta prostate NAGSE38240 12 Used 12 - 485577 Beta prostate 23345608GSE43141 2 Used 2 - 485577 Beta skin fibroblast NAGSE38268 6 Used 6 - 485577 Beta tongue NAGSE38270 24 Used 24 - 485577 Beta tongue NAGSE41336 90 Used 90 - 485577 Beta trophoblast 23314690GSE30870 40 Used 40 - 485577 Beta whole blood 22689993GSE37965 30 Used 30 - 485577 Beta whole blood 23054610GSE41169 95 Used 95 - 485577 Beta whole blood 23034122GSE35069 60 Used 60 - 485577 Beta whole, peripheral blood 228484724614 30 series used 173752Appendix BDetailed Tissues Available53Table B.1: Summary of tissue types used in analysis.Tissue SamplesAdipose 2Bladder 2Blood 119Blood (B Cell) 35Blood (Peripheral) 841Blood (Whole) 149Blood Vessel 8Bone Marrow 38Brain 34Breast 16Colon 69Epithelial 5Fibroblast 9Kidney 12Liver 4Lung 82Lymph 2Muscle 6Oral 76Pancreas 2Prostate 18Somatic Cell Line 21Sperm 2Spleen 5Stem Cell 78Stomach 6Testes 4Thymus 2Trophoblast 90Total 173754Appendix CIndependent GEO series used55Table C.1: Summary of the GEO series available on the 450K between April 30, 2013 and July 29, 2013. QC is thereason a series was not included in the confirmation dataset. Probes is the number of probes (of 485577) providedon GEO.Available (April 30- July 29) Samples QC Samples Used Probes Data Type Tissues Pubmed IDGSE41782 32 Filtered 0 137797 Beta Blood 24039605GSE44132 55 Filtered 0 483266 Beta Blood 23689534GSE44667 40 Filtered 0 430685 Beta Placenta 23770704GSE44684 67 Filtered 0 485512 Beta Cerebellum 23660940GSE46573 22 Filtered 0 441946 Beta Blood, sperm, placenta, pancreas, buccal 23538714GSE43226 12 Mouse samples 0 485577 Beta Mouse (Bone marrow, liver, heart, spleen) 23639479GSE42707 1 Sample size 0 485577 Beta Embryonic stem cell 23598999GSE42310 14 Superseries 0 485577 Beta Colon 23598999GSE43298 176 Superseries 0 485577 Beta paraganglioma 23707781GSE44712 56 Superseries 0 430685 Beta placental 23770704GSE44838 52 Superseries 0 485577 Beta breast 23844228GSE47639 5 Superseries 0 485577 Beta esophageal 24022190GSE36369 322 Used 255 485577 Beta Cell line and blood NAGSE38235 92 Used 92 485577 Beta Blood (B cell) 23722552GSE39560 34 Used 34 485577 Beta Saliva 23706164GSE39672 133 Used 133 485577 Beta Lymphoblastoid 23792949GSE40853 51 Used 51 485577 Beta Cartilage 23863747GSE42308 6 Used 6 485577 Beta Colon 23598999GSE43293 24 Used 24 485577 Beta Adrenal glands 23707781GSE44798 50 Used 50 485577 Beta Blood 23716672GSE44837 26 Used 26 485577 Beta Breast 23844228GSE45707 4 Used 4 485577 Beta Monocytes NAGSE45958 24 Used 24 485577 Beta Breast NAGSE46650 12 Used 12 485577 Beta Fibroblast 23306098GSE47627 28 Used 28 485577 Beta Sperm 23071498GSE47637 2 Used 2 485577 Beta Esophageal 24022190GSE45199 16 Used 16 485577 Beta Oligodendroglioma 235581691356 15 series used 75756Appendix DUniformly Unmethylated ResortGO and DO EnrichmentD.1 Uniformly Unmethylated Resort GO Enrichment57Table D.1: Significantly over-represented (corrected p value ¡0.001) GO groups in the uniformly unmethylated resortassociated genes. Columns are: name of the GO gene set, GO ID, number of genes in the GO gene set, originalp value and Benjamini-Hochberg corrected p value (which the data is sorted by). Multifunctionality (MF) scoresp values and corrected p values are provided, but corrected p value without MF correction is used for significancecalling.Name ID Number of Genes p value Corrected p value MF p value Corrected MF p value MFspinal cord development GO:0021510 73 1.11E-017 5.28E-014 1.05E-007 1.67E-004 0.838neuron fate commitment GO:0048663 56 9.09E-016 2.16E-012 5.75E-008 2.73E-004 0.847neuron migration GO:0001764 94 2.72E-014 4.32E-011 1.30E-006 1.23E-003 0.93outflow tract morphogenesis GO:0003151 50 1.11E-012 1.32E-009 0.55269379 1 0.881diencephalon development GO:0021536 68 1.46E-012 1.39E-009 5.41E-006 3.68E-003 0.877cell differentiation in spinal cord GO:0021515 41 2.22E-012 1.76E-009 1.34E-005 7.09E-003 0.756pallium development GO:0021543 91 4.28E-012 2.91E-009 3.40E-006 2.69E-003 0.879metanephros development GO:0001656 80 9.01E-012 5.35E-009 0.08730415 1 0.886forebrain generation of neurons GO:0021872 51 1.83E-011 7.90E-009 1.19E-004 0.03155777 0.865embryonic digestive tract development GO:0048566 35 1.75E-011 8.34E-009 0.13851477 1 0.855forebrain neuron differentiation GO:0021879 40 1.73E-011 9.16E-009 6.43E-005 0.02183443 0.842embryonic skeletal system morphogenesis GO:0048704 84 2.86E-011 1.13E-008 2.80E-004 0.06048596 0.953limbic system development GO:0021761 71 3.25E-011 1.19E-008 6.64E-008 1.58E-004 0.866ventral spinal cord development GO:0021517 33 8.40E-011 2.85E-008 4.36E-004 0.07979735 0.758embryonic digestive tract morphogenesis GO:0048557 20 9.72E-011 3.08E-008 0.31962669 1 0.832palate development GO:0060021 69 1.33E-010 3.95E-008 5.03E-004 0.08539711 0.861regulation of neural precursor cell proliferation GO:2000177 47 3.66E-010 1.02E-007 0.01686011 0.95460318 0.86forelimb morphogenesis GO:0035136 37 6.10E-010 1.61E-007 0.01770803 0.95703859 0.745cell fate specification GO:0001708 61 6.48E-010 1.62E-007 3.36E-005 0.01451621 0.853peripheral nervous system neuron differentiation GO:0048934 12 1.22E-009 2.63E-007 3.46E-004 0.06863086 0.451peripheral nervous system neuron development GO:0048935 12 1.22E-009 2.63E-007 3.46E-004 0.06863086 0.451positive regulation of neuron differentiation GO:0045666 63 1.20E-009 2.72E-007 8.23E-004 0.11188962 0.837cardiac chamber morphogenesis GO:0003206 99 1.18E-009 2.81E-007 0.50484358 1 0.887cardiac septum development GO:0003279 57 1.42E-009 2.94E-007 0.64227964 1 0.872embryonic forelimb morphogenesis GO:0035115 29 1.93E-009 3.82E-007 5.33E-003 0.46981081 0.723forebrain regionalization GO:0021871 20 2.17E-009 4.13E-007 8.00E-005 0.02376578 0.697neuron fate specification GO:0048665 25 3.07E-009 5.62E-007 5.12E-005 0.02028098 0.709cardiac ventricle development GO:0003231 91 6.11E-009 1.00E-006 0.60352184 1 0.885central nervous system neuron development GO:0021954 55 5.89E-009 1.04E-006 5.24E-005 0.01917466 0.853ureteric bud development GO:0001657 91 6.11E-009 1.04E-006 0.26643212 1 0.885midbrain development GO:0030901 32 8.39E-009 1.33E-006 8.82E-007 1.05E-003 0.696regulation of glial cell differentiation GO:0045685 44 1.03E-008 1.52E-006 0.01190884 0.77586918 0.863hindlimb morphogenesis GO:0035137 38 1.00E-008 1.54E-006 0.02003856 1 0.8158Table D.2: Uniformly Unmethylated Resort GO Enrichment (Continued)Name ID Number of Genes p value Corrected p value MF p value Corrected MF p value MFembryonic digit morphogenesis GO:0042733 46 2.05E-008 2.95E-006 4.24E-003 0.42006388 0.818dorsal/ventral pattern formation GO:0009953 89 2.14E-008 3.00E-006 1.51E-004 0.03775389 0.972cerebral cortex development GO:0021987 60 2.54E-008 3.36E-006 5.36E-004 0.08791468 0.86regulation of gliogenesis GO:0014013 60 2.54E-008 3.46E-006 0.02061329 1 0.877embryonic camera-type eye development GO:0031076 35 3.03E-008 3.90E-006 0.04776796 1 0.846mesenchymal cell development GO:0014031 99 3.21E-008 4.01E-006 0.67746894 1 0.886dorsal spinal cord development GO:0021516 20 4.01E-008 4.89E-006 8.00E-005 0.02535017 0.517spinal cord motor neuron differentiation GO:0021522 25 4.42E-008 5.25E-006 2.42E-003 0.26750416 0.665cerebral cortex neuron differentiation GO:0021895 16 5.13E-008 5.95E-006 1.45E-005 6.87E-003 0.76negative regulation of neuron apoptotic process GO:0043524 94 5.95E-008 6.74E-006 0.01855004 0.99128066 0.882negative regulation of neuron death GO:1901215 94 5.95E-008 6.74E-006 0.01855004 0.99128066 0.882cardiac septum morphogenesis GO:0060411 43 6.22E-008 6.88E-006 0.7150241 1 0.867embryonic camera-type eye morphogenesis GO:0048596 26 7.26E-008 7.85E-006 0.05877479 1 0.828camera-type eye morphogenesis GO:0048593 90 1.35E-007 1.42E-005 0.0330986 1 0.878ventricular septum development GO:0003281 40 1.89E-007 1.96E-005 0.67687288 1 0.848negative regulation of neurogenesis GO:0050768 92 1.98E-007 2.00E-005 0.03771388 1 0.984regulation of oligodendrocyte differentiation GO:0048713 23 2.12E-007 2.10E-005 0.03981093 1 0.791digestive tract morphogenesis GO:0048546 54 2.25E-007 2.18E-005 0.6054949 1 0.883pancreas development GO:0031016 69 2.40E-007 2.29E-005 0.09958755 1 0.865face morphogenesis GO:0060325 29 2.77E-007 2.54E-005 0.08173651 1 0.788embryonic hindlimb morphogenesis GO:0035116 29 2.77E-007 2.59E-005 0.02311755 1 0.796pituitary gland development GO:0021983 42 3.60E-007 3.23E-005 0.03142116 1 0.856ureteric bud morphogenesis GO:0060675 56 3.79E-007 3.34E-005 0.40139621 1 0.872hippocampus development GO:0021766 49 3.88E-007 3.35E-005 8.12E-005 0.02270861 0.837lung morphogenesis GO:0060425 36 4.11E-007 3.49E-005 0.01557035 0.91422918 0.848regulation of stem cell proliferation GO:0072091 72 4.64E-007 3.87E-005 7.75E-003 0.58532528 0.881head development GO:0060322 50 5.10E-007 4.19E-005 0.065048 1 0.835cardiac chamber formation GO:0003207 11 5.38E-007 4.33E-005 0.47784594 1 0.735face development GO:0060324 37 5.76E-007 4.57E-005 0.05846942 1 0.809embryonic eye morphogenesis GO:0048048 31 6.11E-007 4.76E-005 0.09914553 1 0.836stem cell proliferation GO:0072089 52 8.64E-007 6.42E-005 0.07583075 1 0.85659Table D.3: Uniformly Unmethylated Resort GO Enrichment (Continued)Name ID Number of Genes p value Corrected p value MF p value Corrected MF p value MFproximal/distal pattern formation GO:0009954 32 8.82E-007 6.46E-005 7.44E-006 4.42E-003 0.736metencephalon development GO:0022037 75 8.63E-007 6.52E-005 1.09E-003 0.14358641 0.859odontogenesis GO:0042476 92 9.18E-007 6.52E-005 0.15717717 1 0.984bone morphogenesis GO:0060349 67 8.54E-007 6.55E-005 0.03714177 1 0.859inner ear morphogenesis GO:0042472 92 9.18E-007 6.62E-005 7.26E-004 0.10462652 0.965cardiac right ventricle morphogenesis GO:0003215 16 9.68E-007 6.77E-005 0.61143545 1 0.747endocrine pancreas development GO:0031018 46 1.16E-006 7.99E-005 0.01507255 0.89606301 0.825head morphogenesis GO:0060323 33 1.26E-006 8.29E-005 0.11811378 1 0.807nephron development GO:0072006 77 1.28E-006 8.34E-005 0.82565216 1 0.886thymus development GO:0048538 33 1.26E-006 8.41E-005 0.11811378 1 0.81regulation of organ formation GO:0003156 33 1.26E-006 8.53E-005 0.29321985 1 0.852body morphogenesis GO:0010171 40 1.48E-006 9.52E-005 0.07687415 1 0.817embryonic heart tube development GO:0035050 62 1.57E-006 9.82E-005 0.28226955 1 0.876cardiac ventricle morphogenesis GO:0003208 62 1.57E-006 9.95E-005 0.69805712 1 0.877embryonic heart tube morphogenesis GO:0003143 49 2.55E-006 1.53E-004 0.3088397 1 0.863positive regulation of glial cell differentiation GO:0045687 23 2.54E-006 1.55E-004 0.38321525 1 0.81embryonic skeletal joint morphogenesis GO:0060272 13 2.53E-006 1.56E-004 5.36E-004 0.08501935 0.663positive regulation of neural precursor cell proliferation GO:2000179 29 2.70E-006 1.59E-004 0.02311755 1 0.844olfactory bulb development GO:0021772 29 2.70E-006 1.61E-004 5.33E-003 0.47867516 0.806regulation of embryonic development GO:0045995 82 3.23E-006 1.87E-004 0.50993673 1 0.884cardiac neural crest cell development involved in heart development GO:0061308 6 3.51E-006 1.99E-004 1 1 0.678noradrenergic neuron differentiation GO:0003357 6 3.51E-006 2.01E-004 0.29839077 1 0.581regulation of DNA binding GO:0051101 66 3.65E-006 2.04E-004 0.73773033 1 0.869olfactory lobe development GO:0021988 30 3.85E-006 2.13E-004 6.35E-003 0.5119186 0.812negative regulation of neuron differentiation GO:0045665 51 4.15E-006 2.27E-004 0.02550863 1 0.816cardiocyte differentiation GO:0035051 84 4.56E-006 2.44E-004 0.71707901 1 0.886heart looping GO:0001947 44 4.52E-006 2.44E-004 0.2440901 1 0.859negative regulation of neural precursor cell proliferation GO:2000178 14 4.80E-006 2.48E-004 0.0425761 1 0.686bone development GO:0060348 93 4.70E-006 2.48E-004 0.16394622 1 0.987aorta morphogenesis GO:0035909 19 4.87E-006 2.49E-004 0.67458076 1 0.822spinal cord association neuron differentiation GO:0021527 14 4.80E-006 2.51E-004 7.95E-004 0.11120716 0.50960Table D.4: Uniformly Unmethylated Resort GO Enrichment (Continued)Name ID Number of Genes p value Corrected p value MF p value Corrected MF p value MFmetanephric nephron development GO:0072210 31 5.39E-006 2.73E-004 0.53759448 1 0.869cerebral cortex cell migration GO:0021795 25 5.73E-006 2.84E-004 3.85E-004 0.07328463 0.835heart valve development GO:0003170 25 5.73E-006 2.87E-004 0.77177601 1 0.848cardiac ventricle formation GO:0003211 10 6.03E-006 2.92E-004 0.44606208 1 0.712forebrain cell migration GO:0021885 38 5.99E-006 2.94E-004 2.33E-004 0.05538064 0.863mesonephros development GO:0001823 20 7.71E-006 3.70E-004 0.69326256 1 0.836positive regulation of ossification GO:0045778 39 7.91E-006 3.76E-004 0.90031753 1 0.859determination of heart left/right asymmetry GO:0061371 47 9.54E-006 4.45E-004 0.28264822 1 0.864autonomic nervous system development GO:0048483 47 9.54E-006 4.49E-004 0.75993499 1 0.855endoderm development GO:0007492 55 1.02E-005 4.54E-004 0.09378184 1 0.855cell-cell signaling involved in cell fate commitment GO:0045168 33 1.01E-005 4.54E-004 0.0383238 1 0.839nephron epithelium development GO:0072009 40 1.04E-005 4.56E-004 0.4046518 1 0.872neural tube patterning GO:0021532 33 1.01E-005 4.59E-004 0.0383238 1 0.81positive regulation of gliogenesis GO:0014015 33 1.01E-005 4.63E-004 0.29321985 1 0.841neural crest cell differentiation GO:0014033 63 1.01E-005 4.67E-004 0.70838321 1 0.879cardiac neural crest cell differentiation involved in heart development GO:0061307 7 1.17E-005 5.01E-004 1 1 0.721aorta development GO:0035904 21 1.18E-005 5.02E-004 0.7108729 1 0.835hard palate development GO:0060022 7 1.17E-005 5.06E-004 0.05690589 1 0.36telencephalon regionalization GO:0021978 7 1.17E-005 5.11E-004 5.53E-003 0.46939426 0.38spinal cord motor neuron cell fate specification GO:0021520 11 1.26E-005 5.31E-004 0.02193044 1 0.536embryonic skeletal joint development GO:0072498 16 1.45E-005 6.03E-004 1.58E-003 0.18734492 0.703branching involved in ureteric bud morphogenesis GO:0001658 49 1.52E-005 6.27E-004 0.3088397 1 0.86embryonic cranial skeleton morphogenesis GO:0048701 35 1.81E-005 7.35E-004 0.13851477 1 0.829telencephalon cell migration GO:0022029 35 1.81E-005 7.42E-004 6.68E-004 0.09925519 0.863neural tube closure GO:0001843 67 2.14E-005 8.62E-004 0.03714177 1 0.84neural tube formation GO:0001841 85 2.26E-005 9.04E-004 0.11375899 1 0.955retinal ganglion cell axon guidance GO:0031290 17 2.34E-005 9.25E-004 2.51E-004 0.05677822 0.567forebrain neuron development GO:0021884 12 2.40E-005 9.43E-004 3.67E-003 0.38789957 0.597negative regulation of glial cell differentiation GO:0045686 23 2.57E-005 9.86E-004 8.69E-003 0.60798514 0.794lens morphogenesis in camera-type eye GO:0002089 23 2.57E-005 9.94E-004 0.03981093 1 0.691tube closure GO:0060606 68 2.55E-005 9.95E-004 0.04006723 1 0.84361D.2 Uniformly Unmethylated Resort DO Enrichment62Table D.5: Significantly over-represented (corrected p value ¡0.001) DO groups in the uniformly unmethylated resortassociated genes. Columns are: name of the DO gene set, DO ID, number of genes in the DO gene set, originalp value and Benjamini-Hochberg corrected p value (which the data is sorted by). Multifunctionality (MF) scoresp values and corrected p values are provided, but corrected p value without MF correction is used for significancecalling.Name IDNumberof Genes)p valueCorrectedp value)MF p valueCorrected MFp value)MFName ID Number of Genes p value Corrected p value MF p value Corrected MF p value MFintellectual disability DOID 1059 524 1.29E-009 1.98E-006 1.59E-003 1 0.924autistic disorder DOID 12849 299 2.63E-009 2.02E-006 7.42E-003 1 0.923pervasive developmental disorder DOID 0060040 805 1.44E-008 7.36E-006 0.01077664 1 0.893physical disorder DOID 0080015 264 5.06E-008 1.94E-005 2.60E-003 0.99503206 0.922autism spectrum disorder DOID 0060041 798 1.72E-007 4.39E-005 0.0266294 1 0.89263


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items