UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Estimating cell type proportions in human cord blood samples from DNAm arrays Dinh, Louie 2017

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2017_november_dinh_louie.pdf [ 10.57MB ]
JSON: 24-1.0356611.json
JSON-LD: 24-1.0356611-ld.json
RDF/XML (Pretty): 24-1.0356611-rdf.xml
RDF/JSON: 24-1.0356611-rdf.json
Turtle: 24-1.0356611-turtle.txt
N-Triples: 24-1.0356611-rdf-ntriples.txt
Original Record: 24-1.0356611-source.json
Full Text

Full Text

Estimating Cell Type Proportions in Human Cord BloodSamples from DNAm ArraysbyLouie DinhB. Sc., The University of British Columbia, 2012A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)October 2017c© Louie Dinh, 2017AbstractEpigenome-wide association studies are used to link patterns in the epigenometo human phenotypes and disease. These studies continue to increase in num-ber, driven by improving technologies and decreasing costs. However, results frompopulation-scale association studies are often difficult to interpret. One major chal-lenge to interpretation is separating biologically relevant epigenetic changes fromchanges to the underlying cell type composition. This thesis focuses on computa-tional methods for correcting cell type composition in epigenome-wide associationstudies measuring DNAm in blood. Specifically, we focus on a class of methods,called reference-based methods, that rely on measurements of DNAm from puri-fied constituent cell types. Currently, reference-based correction methods performpoorly on human cord blood. This is unusual because adult blood, a closely relatedtissue, is a case-study in successful computational correction. Several previousattempts at improving cord blood estimation were only partially successful. Wedemonstrate how reference-based estimation methods that rely on for cord bloodcan be improved. First, we validated that existing methods perform poorly on cordblood, especially in minor cell types. Then, we demonstrated how this low per-formance stems from missing cell type references, data normalization and violatedassumptions in signature construction. Resolving these issues improved estimatesin a validation set with experimentally generated ground truth. Finally, we com-pared our reference-based estimates against reference-free techniques, an alterna-tive class of computational correction methods. Going forward, this thesis providesa template for extending reference-based estimation to other heterogeneous tissues.iiLay SummaryAll cells within the human body generally contain the same genetic material in theform of DNA. However, our bodies require cells to specialize and carry out themany different tasks related to maintaining a complex biological organism. Onesystem that our body uses to facilitate these specializations is called DNA methy-lation. DNA methylation acts like a series of on-off switches that allow differentcells to execute different parts of their shared genetic program. Some diseasesleave clues about its’ underlying cause in the differing pattern of DNA methylationbetween healthy and diseased individuals. However, it is difficult to differentiatebetween patterns due to disease and patterns due to cellular specialization. Thereare computational methods for separating these two types of patterns, but they donot work well for umbilical cord blood samples. This thesis studies how to improvesuch methods.iiiPrefaceThis dissertation is an original intellectual product of the author, Louie Dinh. Theauthor conducted all experiments and wrote the manuscript under the supervisionof Dr. Sara Mostafavi and Dr. Raymond Ng.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Epigenome Wide Association Studies . . . . . . . . . . . . . . . 21.1.1 Epigenetics and DNAm . . . . . . . . . . . . . . . . . . 21.1.2 Measuring DNAm with Microarrays . . . . . . . . . . . . 31.1.3 Blood is a Complex Tissue . . . . . . . . . . . . . . . . . 41.2 Cell Type Heterogeneity in Association Studies . . . . . . . . . . 51.2.1 Confounding Due to Cell Type Heterogeneity . . . . . . . 51.2.2 Computationally Correcting for Confounding . . . . . . . 6v1.3 Thesis Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Approach and Contribution . . . . . . . . . . . . . . . . . . . . . 81.5 Detailed Outline of Thesis . . . . . . . . . . . . . . . . . . . . . 82 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.1 The Linear Mixture Model for Complex Tissues . . . . . . . . . . 102.2 Reference-Based Methods . . . . . . . . . . . . . . . . . . . . . 112.2.1 Reference-Based Methods for Gene Expression . . . . . . 122.2.2 Reference-Based Methods for DNAm . . . . . . . . . . . 122.2.3 An Adult Blood DNAm Reference . . . . . . . . . . . . . 142.2.4 Three Cord Blood DNAm References . . . . . . . . . . . 142.3 Reference Free Methods Applicable to DNAm . . . . . . . . . . . 143 Approach and Results . . . . . . . . . . . . . . . . . . . . . . . . . . 183.1 Description of Datasets . . . . . . . . . . . . . . . . . . . . . . . 193.1.1 Reference Cell Type Profiles . . . . . . . . . . . . . . . . 193.1.2 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 213.3 Validation of Existing Estimation Methods . . . . . . . . . . . . . 223.3.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 233.3.2 Accuracy of Adult Reference Dataset . . . . . . . . . . . 243.3.3 Effects of Using a Cord Specific Reference . . . . . . . . 293.3.4 Comparison of Cord and Adult References . . . . . . . . 303.3.5 Comparison of Three Cord Blood References . . . . . . . 343.4 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.4.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 393.4.2 Within Array Normalization . . . . . . . . . . . . . . . . 423.4.3 Between Array Normalization . . . . . . . . . . . . . . . 433.4.4 Cluster Analysis of Full Normalization Pipeline . . . . . . 473.5 Signature Selection . . . . . . . . . . . . . . . . . . . . . . . . . 473.5.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 473.5.2 Balancing of Highly and Lowly Methylated Probes . . . . 503.5.3 Finding an Optimal Signature Size . . . . . . . . . . . . . 51vi3.5.4 Treating T-cells as Indistinguishable . . . . . . . . . . . . 523.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.6.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 533.6.2 Validating Estimation Accuracy in Cord Blood . . . . . . 553.6.3 Changes to the Cell Type Signatures . . . . . . . . . . . . 573.6.4 Comparison to Reference-Free Techniques . . . . . . . . 624 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72viiList of TablesTable 3.1 Count of Purified Reference Cell Type Profiles . . . . . . . . . 20Table 3.2 Number of Differentially Methylated Positions Between CellTypes in Cord . . . . . . . . . . . . . . . . . . . . . . . . . . 34Table 3.3 Summary of Evaluated DNAm Normalization Methods . . . . 40Table 3.4 Jaccard Index of Signature Probes for Standard and OptimizedNormalization . . . . . . . . . . . . . . . . . . . . . . . . . . 58Table 3.5 Pearson Correlations of Signature Probe Ranks Between Stan-dard and Optimized Pipeline . . . . . . . . . . . . . . . . . . 60Table 3.6 sPCA Explained Variance at 500 Representative Probes . . . . 64Table 3.7 sPCA Explained Variance Across All Probes . . . . . . . . . . 65viiiList of FiguresFigure 3.1 Overview of Evaluated Reference-Validation Pairings. . . . . 22Figure 3.2 Estimation of Adult Validation Samples Using Adult Refer-ence. Black line is x=y. . . . . . . . . . . . . . . . . . . . . . 25Figure 3.3 Estimation of Cord Validation Samples Using Adult Refer-ence. Black line is x=y. . . . . . . . . . . . . . . . . . . . . . 26Figure 3.4 Comparison of Adult Reference Performance by: (A) Correla-tion, (B) MAD. . . . . . . . . . . . . . . . . . . . . . . . . . 27Figure 3.5 Comparison of Adult Reference Performance After Subsam-pling on Cord Validation Set: (A) Correlation, (B) MAD. . . . 28Figure 3.6 Estimation of Cord Validation Samples Using Cord Reference(UBC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Figure 3.7 (A) Clustering Between Cord Cell Types on Signature Sitesusing Euclidean Distances (B) Sensitivity to Missing nRBC inReference Profile . . . . . . . . . . . . . . . . . . . . . . . . 31Figure 3.8 MDS Plot of Adult and Cord Reference Profiles . . . . . . . . 32Figure 3.9 Proportion of differentially methylated positions (DMPs) be-tween Cell Types . . . . . . . . . . . . . . . . . . . . . . . . 33Figure 3.10 Number of Variable Sites in Reference Cell Type Profiles . . . 35Figure 3.11 Correlation (Estimated vs. Measured) by Cord Blood Refer-ence Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36Figure 3.12 Comparison of Signature Probes Between Cord Reference Datasets 37Figure 3.13 Estimation Performance by Alternative Cord Reference . . . . 38Figure 3.14 SVD Analysis Raw Samples . . . . . . . . . . . . . . . . . . 44ixFigure 3.15 SVD Analysis of Background Subtraction and Probe Type Nor-malization Methods . . . . . . . . . . . . . . . . . . . . . . . 45Figure 3.16 P-value Inflation When Comparing Cord Blood Samples After(A) Quantile Normalization (B) ComBat . . . . . . . . . . . . 46Figure 3.17 Hierarchical Clustering of Samples After Each Normalization 48Figure 3.18 Effect Size of DMRs By Cell Type . . . . . . . . . . . . . . . 50Figure 3.19 Estimation Performance by Signature Size. Solid is Mean,Dashed is median, Dotted is 10th and 90th quartile. . . . . . . 51Figure 3.20 Evaluating Effect of Treating Tcells as Indistinguishable. . . . 52Figure 3.21 cell type proportion (CTP) Estimation with Cord OptimizedPipeline. Line of unity shown in black. . . . . . . . . . . . . 56Figure 3.22 Estimation Performance By Reference And Normalization . . 57Figure 3.23 Performance Improvements Step by Step. . . . . . . . . . . . 58Figure 3.24 Association of PCs to Estimated Cell Types . . . . . . . . . . 59Figure 3.25 Effect of Normalizations on Signature Probe Selection . . . . 60Figure 3.26 Changes to Rank Of Signature Probes Between Normalizations 61Figure 3.27 Rank of Probes Unique to a Normalization . . . . . . . . . . 63Figure 3.28 Variance Explained by sPC1 Versus Measured Gran Proportions 65Figure 3.29 Variance Explained by sPC1 Versus All Measured CTP . . . . 66Figure 3.30 Variance Explained by Top 6 sparse principle components (sPCs)Versus All Measured CTPs. Red is line of unity. . . . . . . . . 67Figure 3.31 Difference to Variance Explained by Measured CTP . . . . . 68xGlossaryBMIQ beta mixture quantile normalizationCBC complete blood countCTP cell type proportionCTH cell type heterogeneityDNAm DNA methylationDMP differentially methylated positionDMR differentially methylated regionEH expression heterogeneityEWAS epignome-wide association studyFACS fluorescence activated cell sortingGWAS genome wide association studyICA independent components analysisISVA independent surrogate variable analysisILBG Illumina background correctionLMM linear mixed modelMAD mean absolute deviationxiMDS multidimensional scalingNOOB normal-exponential using out-of-band probesPC principle componentPCA principle components analysisRUV remove unwanted variationSWAN subset within array normalizationsPC sparse principle componentSD standard deviationSNP single nucleotide polymorphismSVD singular value decompositionSVA surrogate variable analysisxiiAcknowledgmentsFirst, I’d like to thank my supervisors Sara Mostafavi and Raymond Ng for theirendless patience during my graduate studies. Without their wisdom, this documentwould not exist.Next, I would like to thank Meaghan Jones for being a wonderful collaborator.Also, thank you to Magda Price, Kobor lab, and Robinson lab for lively discussionson the proper way to analyze methylation data.Thank you to all the friends that I made throughout. You kept me from becom-ing a hermit. My fellow Mostafavi lab members: Bernard Ng, Halldor Thorhalls-son, Farnush Farhadi, Emma Graham, and Hamid Omid. My computer sciencecohort: Kimberly Dextras-Romagnino, Meghana Venkatswamy, Dilan Ustek, An-toine Ponsard, Jacob Chen, Yasha Pushak, Robbie Rolin, Neil Newman, ClementFung, Alistair Wick, and Kuba Karpierz. My adoptive lab and graduate program:Phil Richmond, Emma Hitchcock, Shams Bhuiyan, Sam and Amelia Hinshaw,Oriol Forne´s and Robin van der Lee. Special thanks to Rachelle Farkas for chat-ting whenever I couldn’t be bothered to work.I’d like to thank my oldest friends: Raymond Huang, Anthony Dong, SimonTso, Derek Cho, Nelson Wang and Gilbert Leung.Most of all, thank you to my family and to my partner. Many thanks to myparents and my sister for always being there. Finally, I’d like to thank my partner,Crystal Wong, for her help in this and all the things to come.xiiiDedicationTo my parentsxivChapter 1IntroductionThe question of how molecular changes manifest as observable phenotypes hasbeen asked since the discovery of DNA as the biological mechanism for inheri-tance. With the advent of genome sequencing, researchers began probing for ge-netic changes, like single nucleotide polymorphisms (SNPs), that were enrichedwithin populations with particular traits. This paradigm, called an associationstudy, is increasingly popular as the cost of probing the genome continues to drop[32].DNA microarray technology is one particular development that facilitated thepopularity of association studies. While early microarrays were limited to readingnucleotide base pairs, their capabilities have since been extended to other genomicfeatures. Among other things, microarrays can now be used to quantify gene ex-pression profiles or epigenetic marks genome-wide at the population scale.While cost-effective and convenient, microarrays are not without drawbacks.Multiple studies have identified key challenges to ensuring high quality microarray-based analyses [33, 35]. These challenges originate from factors like measurementerror and shortcomings in statistical methodology [7, 16].Researchers have overcome a number of these challenges, but one remainingissue is the confounding effects of cell type heterogeneity (CTH) in microarrayassociation studies [16, 25]. The literature contains many algorithms to computa-tionally correct for CTH, but the problem is not satisfactorily resolved in all cases.One particularly troubling case arises in the study of human infant cord blood DNA1methylation (DNAm). Cord blood is unusual because of its relatedness to adultblood, a case where CTH in DNAm studies has been successfully resolved [22]. Inadult blood samples, effects due to CTH are successfully mitigated through estima-tion of each sample’s cell type proportion (CTP). But, it was observed that existingCTP estimation techniques applied to cord blood suffer a severe degradation inperformance [53]. Several previous attempts to close this adult-infant gap throughbetter characterization of cord blood’s constituent cell types did not fully succeed[5, 12, 19]. In this thesis, we investigate the reasons behind estimation perfor-mance degradation and present a pipeline for more accurately estimating cell typeproportions for cord blood DNAm measurements.1.1 Epigenome Wide Association StudiesAssociation studies are used to link patterns in the epigenome to human phenotypesand diseases. These studies are called epignome-wide association study (EWAS).Often, association studies target DNAm, an epigenetic mark, due to its’ responseto environmental factors, role in gene regulation and implications in development[6]. These studies measure the DNAm level of up to millions of sites across thegenome to find loci that exhibit a relationship to disease status or phenotype.EWAS must employ large cohorts to detect small effects in a large numberof measurements. This drives researchers towards accessible biological tissueslike buccal or blood. These complex tissues, called complex tissues, consist ofmultiple cell types with distinct methylation profiles [45]. This mixing of distinctcell types makes inferring associations between methylation level and phenotypesa challenge.1.1.1 Epigenetics and DNAmScientific disciplines currently disagree on the definition of epigenetics [13]. Someuse the term to describe changes to gene expression, while others are explicitlyreferring to inheritance of expression patterns. In this thesis, we will follow thelatter convention. Specifically, we use the term epigenetics to refer to mitoticallyheritable modifications to DNA that does not involve modifying the underlyingbase pair sequence [28, 46]. One such heritable epigenetic mark, and the focus of2this thesis, is DNAm.DNA methylation is when a methyl group attaches to a DNA molecule. Thisprocess is crucial to development and involved in many biological processes likegenomic imprinting, X-chromosome inactivation, and aging [28, 41]. In mammals,like humans, DNA methylation occurs almost exclusively at the DNA base Cyto-sine immediately followed by Guanine, denoted a CpG site. Functionally, highlevels of CpG methylation in promotor regions have been shown to negatively cor-relate with gene expression in multiple species [21]. DNAm offers a mechanismby which somatic cells, all sharing the same DNA, can execute different geneticprograms.DNAm plays a large role in cell differentiation and aging [25, 28]. Patterns ofDNAm are so closely tied to aging that it can be used to accurately predict biolog-ical age in humans [21]. In addition, DNAm is highly sensitive to environmentaland psychosocial factors such as stress, car exhaust or neonatal exposure to ma-ternal cigarette smoke [26, 34]. Thus, many studies analyze DNAm due to its keybiological role, sensitivity to environmental factors, and heritability.1.1.2 Measuring DNAm with MicroarraysTo study DNAm’s role in gene regulation, aging and disease, we measure methy-lation levels across the genome. One platform for high throughput genome-widemeasurement of DNAm is Illumina’s Infimium 450k. The 450k is a microarraybased platform that interrogates the methylation status of 485,577 CpGs across thehuman genome [42]. The accuracy, sample requirements and relatively low costmake it the platform of choice for many EWAS studies [33].To measure methylation at CpG sites, the 450k uses probes designed to hy-bridize with DNA from a specific genomic locus [14]. Hybridization is followedby an elongation step that causes differential fluorescence in methylated versusunmethylated sites. The amount of methylation is inferred from the strength offluorescence. Due to the design of the 450k and inherent noise of physical experi-mentation, interpretation of DNAm data requires care [11, 33].One complication arising from the 450k’s design is the distinction betweenType 1 and Type 2 probes. To increase coverage of CpGs, the 450k includes two3types of probes; here denoted as Type 1 and Type 2. Type 1 probes use two dif-ferent physical beads to measure DNAm at a CpG; one each for methylated andunmethylated states. Type 2 probes use one physical bead and competitively bindsto methylated and unmethylated DNA. Furthermore, the two probe types use dif-ferent binding chemistries. This distinction is similar to the difference betweentwo-colour and one-colour gene expression microarrays. As a result, Type 2 probesare less sensitive in the detection of extreme methylation values, and have greatervariance between replicates [15].Once measured, these raw DNAm fluorescence signals are transformed intoeither Beta values or M-values to facilitate analyses [8, 17]. Beta values can beintuitively interpreted as proportion of DNA molecules methylated at a particularsite, say s. Let F be the measured intensity of the fluorescence due to methylatedmolecules and let U be the measured intensity of fluorescence due to unmethylatedmolecules. ThenBetas =max(F,0)max(F,0)+max(U,0)Another representation used in DNAm analysis is the M-value. Beta values can beeasily transformed to M-values as follows:Ms = log2(Betas1−Betas)This logit transformation results in better statistical properties such as meetingGaussian assumptions, approximate homoscedasticy, and no longer being restrictedto the interval between 0 and 1 [17]. Such properties are desirable when perform-ing statistical procedures like t-tests. In this thesis, both Beta values and M-valuesare used.1.1.3 Blood is a Complex TissueAs mentioned previously, one of the most common target tissues for EWAS iswhole blood. While accessible, the heterogenous nature of blood complicates theinterpretation of EWAS analyses. Human blood is a mixture of cell populationswith distinct methylation profiles [45]. In this thesis, we focus on the 7 major celltypes found in human infant cord blood: granulocytes (Gran), CD14+monocytes4(Mono), CD4+ T-cells (CD4T), CD8+ T-cells (CD8T), CD19+ B-cells (Bcell),CD56+ natural killer cells (NK) and nucleated red blood cells (nRBC). Previousfindings show that these cell types can be differentially methylated at over 20%of measured CpGs [45]. This differential methylation, paired with variability inCTP, can make interpretations difficult, confound statistical associations and causespurious discoveries. The next section discusses statistical confounding in detail.Cord blood and adult blood, while similar, must be treated as distinct tissues.Adult blood contains only 6 of the 7 cord blood cell types mentioned; nucleatedred blood cells are unique to cord blood [12]. In adults, red blood cells do notcontain nuclei and therefore do not contribute to the methylation measurements.In cord blood, red blood cells are still in the process of extruding their nuclei andmany still contain genetic material. Previous assays of nRBC DNAm revealedan unusual methylation profile [12]. Their methylome did not exhibit a strongbimodal distribution like most cell types, and instead nRBCs had many intermedi-ately methylated CpGs. In certain pregnancy complications, nRBCs can contributeup to 50% of the genetic material measured [2]. Studies on cord blood must ac-count for this unusual cell type.1.2 Cell Type Heterogeneity in Association StudiesResearchers have long recognized that varying cell type composition across sam-ples can dramatically affect the interpretation of association studies [3]. DNAmstudies are particularly susceptible to this type of confounding due to its tissue-specificity and highly variable nature [22, 34] . To compound the issue, CTPs inwhole blood are not static over time. These proportions can change with envi-ronmental exposures, disease, and particularly age [25, 26]. Thus DNAm studiesusing blood samples from different ages must be particularly vigilant in correctingfor inter-individual differences in CTP.1.2.1 Confounding Due to Cell Type HeterogeneitySystemic differences in cell type proportion has long been recognized as a sourceof DNAm variability [22, 35]. Left unaccounted for, CTP differences can lead tomany false positive associations [25]. This problem is known as statistical con-5founding due to CTH. For convenience, we sometimes refer to the problem as justCTH.CTH arises when comparing measurements from mixtures of cell types, whenthe underlying CTPs differ between samples. A detailed mathematical descriptionof the problem can be found in Section 2.1.Experimental techniques for purifying cell populations can be used to resolveCTH. For example, fluorescence activated cell sorting (FACS) can be used to iso-late pure cell populations before taking DNAm measurements. Directly comparingcells of the same type eradicates CTH. However, experimental approaches suf-fers some limitations: affected cell types not known apriori, additional overhead,labour intensive, and cannot be performed post-hoc. These drawbacks make com-putational correction methods appealing.1.2.2 Computationally Correcting for ConfoundingTechniques for correcting CTH can be divided into two classes: reference-basedCTP estimation and reference-free surrogate variables [38]. Reference-based meth-ods seek to accurately estimate the proportions of constituent cell types in a sample.These methods require reference cell type profiles - experimental measurements ofDNAm from purified constituent cell types. Reference-free methods do not requirereference cell type profiles. However, this saving of experimental labour comes atthe cost of interpretability. Reference-free methods result in surrogate, also calledlatent, variables that are a function of cell proportions rather than direct estimates.Once computed, association studies can incorporate these reference-based es-timates or reference-free surrogate variables in the same way [35]. Associationsbetween DNAm and phenotypes are typically inferred using linear models. Specif-ically, a linear model is fit to each methylated position with the phenotype as anexplanatory variable. Correcting for CTH is done by expanding this set of explana-tory variables to include either the CTP estimates or the surrogate variables.1.3 Thesis MotivationComputational methods that accurately estimate cell type proportions in DNAmmicroarrays are convenient, highly interpretable and can be performed post-hoc6on DNAm EWAS. Methods correcting for blood samples are of particular interestdue to their prevalence in population scale studies [25]. For adult whole blood,accurate methods for estimation of CTP exists. However, attempts to develop ananalogous method for human infant cord blood, a closely related tissue, have showna persistent degradation in performance [19, 53].Currently, the most accurate methods for estimating CTP in adult whole bloodrely upon the availability of reference cell type profiles. In 2013, Koestler et al.[30] used these reference-based methods to estimate CTPs for 94 adult sampleswith matched blood counts; correlation for monocytes and lymphocytes were 0.6and 0.61 respectively. In 2016, Koestler et al. [31] used an improved method toestimate CTP for 6 adult samples with more detailed cell counts. They observedcorrelations over 0.99 for all cell types. However, it was observed that methodsbased on adult reference profiles failed to produce accurate estimates in cord blood[53]. For cord blood, Yousefi et al. [53] observed correlations for Monocytes andLymphocytes of -0.01 and -0.03 respectively.Since methylation of cord blood cell types are known to be distinct from adultblood cell types, there have been several attempts at remedying this situation bydeveloping cell type reference profiles specific to cord blood [5, 12, 19]. How-ever, even with cord blood specific reference profiles, estimation accuracy in cordblood was still low compared to their adult whole blood analog. In their 2016study, Gervin et al. [19] estimated cell counts for 195 cord blood samples usingcord blood reference profiles. They observed CTP correlations between 0.51 and0.57 for low abundance cell types like Bcells and Monocytes. While better thanadult references, cord blood estimation performance has room for improvement.Furthermore, this raises the concern that existing estimation methodologies cannotbe extended to new tissues simply by characterizing that new tissue.An understanding of the culprits behind this loss of accuracy would benefit allcord blood based DNAm EWAS. Furthermore, pinpointing culprits will assist indeveloping a better CTP estimation method for cord blood DNAm. Finally, as theapplication of EWAS broadens to new tissue types, this understanding will helpextend reference-based CTP estimation methods to these new targets.71.4 Approach and ContributionThe objective of this thesis is to investigate how to improve the low estimationaccuracy of cell type proportions in DNAm array measurements from human in-fant cord blood. Specifically, we aimed to improve upon a reference based methodthat accurately estimates cell type proportions in adult whole blood. Previous at-tempts at solving this problem by generating cord specific cell type reference pro-files still has poor estimation accuracy in lowly abundant cell types. We assessedeach step of the estimation procedure, identified problematic steps, resolved eachissue specifically for cord blood samples, and validated the improved estimationmethod.First, we confirmed how the adult estimation procedure is unsuitable for infantcord blood. To do so, we estimated cell type proportions using the same adult-calibrated procedure for both adult and infant samples. Estimates were comparedto experimentally measured proportions to corroborate previously reported results.Indeed, adult-calibrated procedures are unsuitable for use on cord blood samples.Next, we explored how the same procedure, but with cord blood specific celltype profiles, improves estimation accuracy. We used three different sets of cordblood specific cell type profiles and observed consistently low estimation accuracy,especially for low abundance lymphoid cells. Therefore, we concluded that the lowaccuracy is partially caused by the procedure itself.In order to improve estimation accuracy, we tailored the procedure to cordblood in a step-by-step fashion. A detailed outline of this thesis follows.1.5 Detailed Outline of ThesisThe remainder of this thesis formalizes the problem of estimating cell type pro-portions in complex tissues [Chapter 2], surveys existing methods used to correctfor cell type heterogeneity in association studies [Chapter 2], diagnoses the issuesof applying existing methods to cord blood [Chapter 3], proposes resolutions andpresents validation results for a new cord blood estimation pipeline [Chapter 3],and discusses directions for future work [Chapter 4].A detailed chapter breakdown of this thesis is as follows:8• Chapter 2 surveys the existing literature on the problem of confounding dueto cell type heterogeneity in association studies. First, we present a linearmodel generating the mixed signal observed in complex tissues. We showhow this model leads to the two classes of solutions for CTH: reference-based and reference-free. We give reference-based estimation in DNAm amore formal description and trace its origins to early applications in geneexpression. Previous works on improving estimation of CTP in cord bloodare also described. Finally, we touch upon reference-free techniques bothspecific to DNAm and generally applicable to microarray studies.• Chapter 3 presents the datasets, approach and experimental results used toimprove the accuracy of CTP estimation in cord blood. First, we describe thetwo validation sets of mixed tissues and four reference sets of cell type pro-files. We also describe how performance of CTP estimation was measured.Then we report results confirming previously reported results of low estima-tion accuracy using existing methods on our validation datasets. Next, wepresent a series of diagnostics used to determine the issues behind this lowestimation accuracy. We show how resolving these issues lead to improvedestimation accuracy. Finally, we detail a comparison between our reference-based estimates of cell type proportions to a PCA-based reference-free sur-rogate variables from the perspective of variance explained.• Chapter 4 summarizes our results, identifies the limitations of reference-based estimation and discusses directions for future work.9Chapter 2Related WorksIn this chapter, we discuss prior works on correction of CTH in DNAm studies.Section 2.1 mathematically describes how measurements on a complex tissue canbe modelled as a linear mixture. We show how the two classes of CTH correctiontechniques, reference-based and reference-free, arises naturally from this descrip-tion. Section 2.2 summarizes the most successful reference-based methods for CTPestimation in blood. Section 2.3 summarizes existing reference-free techniques.2.1 The Linear Mixture Model for Complex TissuesIn 2001, Venet et al. [52] presented the idea of computationally quantifying CTPdirectly from the mixed microarray measurements. This was originally presentedin the context of gene expression. Here, we summarize that framework in thecontext of DNAm.Assume that DNAm measurements were made on a mixture of distinct celltypes using a microarray. Then, for each DNAm locus, the total measurement isthe sum of signals from each cell type alone. We assume that the signal from eachcell type is proportional to its relative abundance in the sample. This is calledthe linear mixing assumption, and it serves as a good model for the fluorescenceintensities measured on microarrays [24].Mathematically, we can represent the data generation process as follows:• L: The number of measured mixed samples.10• M: A matrix of measurements from DNAm microarrays. One column permixed sample and one row per probe.• G: A matrix of reference cell type profiles. One column per cell type andone row per probe.• C: A matrix of concentrations. One row per cell type and one column permixed sample.We assume that the mixed signal is generated only from the represented celltypes in the columns of G:M = GCSince each column of C represents concentrations of cell types in a sample, eachentry is between 0 and 1, and each column must sum to 1. We recognize that eachcolumn of M to be a linear mixture of the columns of G with weights defined bythe corresponding column of C.In these microarray experiments, M is always observed. The problem of quan-tifying CTP can be specialized into whether G, or a noisy representation of G, isavailable. Note that if C was observed, then we already have cell type proportionsfor all mixed samples which solves our original problem. When G is available, weare dealing with reference-based methods. If G is not available, then we must relyon reference-free methods.2.2 Reference-Based MethodsIn the reference-based context, we have observed G and seek only to find an ap-proximation of C, denoted Cˆ. This is done on a sample-by-sample basis. Thealgorithm proceeds in two steps:1. Signature Selection: Select an appropriate subset of rows, corresponding toprobes, of G. The selected probes are called the cell type signature probesand the resulting submatrix of G is called the cell type signature.2. Optimization: For each column of M, denoted Mi, solve the following for1≤ i≤ L:minCˆi‖Mi−GCˆi‖11such that Cˆi constrained to the interval [0,1] and elements sum to less thanor equal to 1. Here the norm denotes some appropriate measure of distancebetween two vectors, for example Euclidian distance.All reference-based estimation techniques rely on this formulation but differ inchoice of signature selection heuristic, distance function, optimization procedureand enforcement of constraints on Cˆi.2.2.1 Reference-Based Methods for Gene ExpressionOne of the early reference-based methods for cell type estimation from gene ex-pression microarrays was presented by Abbas et al. [3]. Their method quantifiedthe proportion of immune cell in adult whole blood samples. In the framework laidout above, signature selection was done by maximizing the condition number of theresulting G matrix, optimization was done with Euclidian distances, and the esti-mates had no enforced constraints. Their optimization is exactly the least squaresmethod used in linear regression. As a post-hoc step, to get positive proportions,the process is run iteratively each time removing the most negative coefficient. Fi-nally the results are normalized to sum to 1.The least-square method was subsequently extended to explicitly enforce theconstraints as part of the optimization. One extension used non-negative leastsquares to enforce the lower bound of the constraint to get estimated proportionsgreater than 0 [43]. Another extension used quadratic programming to ensure thatthe proportions were in the interval [0,1] and each sample’s proportions summedto one [20]. More recent methods move away from the least squares framework tomore robust methods like Support Vector Regression [39].2.2.2 Reference-Based Methods for DNAmCorrecting for CTH in DNAm studies, while conceptually similar, requires ac-knowledgement of some DNAm-specific realities. In 2012, Houseman et al. [22]proposed a method for estimating cell type proportion from DNAm microarraysbased. This method, once again, attempts to estimate the cell type proportions, Cˆ,that best approximates M with GCˆ. Signature selection was done based on order-ing F-statistics for CpGs. F-statistics were computed from independently fitting a12linear mixed-effects model to each probe in the reference data to identify betweencell type differentially methylated positions (DMPs). The method uses Euclidiandistances and quadratic programming to enforce constraints.Houseman’s method was successfully applied to validation blood samples froma cohort of 94 healthy adult individuals [30]. The study was done using Illu-mina’s 27k technology, the predecessor to the 450k. Each sample was subjectto complete blood counts (CBCs) and assayed on the DNAm microarray. TheDNAm data, along with the reference data described in subsection 2.2.3, wasused to estimate CTP. Since CBCs can only resolve cell types to the Lympho-cyte, Monocyte and Granulocyte level, detailed estimates were aggregated ap-propriately. Estimates were shown to have low root-mean-squared-error (Lym-phocytes: 5%, Monocytes:6%) and medium-high correlations (Lymphocytes:0.6,Monocytes:0.61). Thus, computational estimation of CTP in DNAm was demon-strated to be reasonably accurate in adult whole blood.In 2014, Jaffe and Irizarry [25] extended Houseman’s method to the Illumina450k. This thesis builds upon Jaffe’s method; so we carefully describe their ap-proach here. Their method proceeds as follows:1. Remove Bad Probes: Remove probes on the 450k that are known to be prob-lematic due to SNPs in the hybridizing sequence.2. Signature Selection: To form the matrix G, 100 probes were selected to dis-tinguish each cell type. For each cell type, probes that were differentiallymethylated compared to all other cell types were identified using two-groupt-tests. Probes with p-values < 10−8 were ranked in order of effect size. The100 most differentially methylated probes were selected, balanced betweenhighly methylated (most positive difference in mean methylation) and lowlymethylated (most negative difference in mean methylation).3. Optimization: Estimation of Cˆ is done by solvingminCˆi‖Mi−GCˆi‖ An Adult Blood DNAm ReferenceCurrently, the most widely used adult whole blood reference dataset for DNAm waspublished in 2012 [45]. Since then, it has been cited over 400 times. The datasetcontains DNAm profiles for 6 cell populations in adult whole blood: Gran, Mono,Bcell, NK, CD4T and CD8T. In their comparison, Reinius et al. [45] showed that,between cell types, 85% of human genes have at least one differentially methy-lated probe. They conclude that interpretation of whole blood methylation mea-surements should be done with great caution.2.2.4 Three Cord Blood DNAm ReferencesAs DNAm studies began focusing on cord blood, it became apparent that the ex-isting CTP estimation methods were not performing well in cord blood samples.In 2015, Yousefi et al. [53] demonstrated that the Jaffe’s method, paired with theadult reference-data, had very poor prediction accuracy for blood samples fromnew borns. The suggested culprit was a mismatch between the adult originatedreference data and infant originated target samples.To rectify this problem, three different cord blood reference datasets were gen-erated [5, 12, 19]. To date, only the dataset from The University of Oslo has beenvalidated for CTP estimation accuracy. In their study, Gervin et al. [19] validatedtheir reference dataset on an independent cohort of 195 individuals. Their resultsshowed that using a cord-specific reference dataset improves correlation betweenestimates and cell counts, but low abundance cell types like Bcell and CD8T showonly moderate correlations.In the next chapter, we report the CTP estimation performance of these threereferences on a set of cord blood samples with matched cell counts. We recommendone reference based on performance and coverage of cell types. Then, we showhow estimation performance can be further improved by modifying the reference-based procedure itself.2.3 Reference Free Methods Applicable to DNAmReference-free methods allow for correction of CTH when there is no availablereference dataset. That is, these methods estimate both G and C, usually through14matrix decomposition. Instead of directly estimation cell proportions, reference-free methods return surrogate variables that are functions of the cell proportions.These surrogate variables, similar to CTP estimates, can be directly incorporatedinto downstream analyses to correct for CTH.Reference-free methods must estimate both the profile, G, and the proportions,C, simultaneously from a set of DNAm microarray samples. Estimating both G andC is hard because there are many more loci measured by these microarrays thanthere are number of samples. Formally this is a highly under-determined problem.Also, reference-free methods will capture other systemic sources of variation, likebatch effects, in addition to variation from CTH. Together, this makes reference-free results difficult to interpret directly.In this section, we briefly outline the reference-free methods applicable toDNAm studies. First, we cover methods first developed for gene expression stud-ies, but applicable to DNAm studies: SVA, ISVA, and RUV [18, 36, 48]. Thenwe summarize methods developed in the context of DNAm EWAS studies: LM-MEWasher, RefFreeEWAS, and ReFACTOR [23, 44, 54]. ReFACTOr is given adetailed treatment because it serves as the reference-free comparison in this thesis.In 2007, Leek and Storey [36] described surrogate variable analysis (SVA), aSVD-based method for estimating unmodeled sources of variation. They calledvariation of expression due to these unmodeled sources expression heterogeneity(EH). Many of the following reference-free techniques build upon SVA. The ideais to capture shared sources of variation between different observations of geneexpression or DNAm. Conceptually, SVA proceeds through 3 steps:1. Remove the signal attributable to the main variables of interest to identify anorthogonal basis for EH.2. Find a subset of measurements associated with each basis of EH.3. Estimates the surrogate variables from the identified subsets using the origi-nal data.These surrogate variables are then incorporated into downstream regressions tocontrol for the effects of confounding.15In 2011, Teschendorff et al. [48] extends upon SVA from orthogonal surrogatevariable to statistically independent surrogate variables. This is achieved by re-placing singular value decomposition (SVD), which enforced orthogonality, withindependent components analysis (ICA). They show that independent surrogatevariable analysis (ISVA) is more effective in cases when confounding is uncorre-lated with the primary variables of interest in a non-linear fashion.Another method for capturing unmodeled variation was presented by Gagnon-Bartsch and Speed [18] in 2012 and relies upon prior of knowledge genomic lociunaffected by the primary variable. This method, called remove unwanted vari-ation (RUV), restricts the estimation of surrogate variables to a-priori unaffectednegative control loci. By restricting the loci under consideration, RUV mitigatesthe problem of overcorrecting for biological variation of interest.In 2014, Houseman et al. [23] proposed a reference-free method called Ref-FreeEWAS. This method expanded upon SVA by including the estimated covari-ates of the unadjusted model for differential methylation in the decomposition.They show algebraically how this expanded matrix better models the linear mixingassumption. Their method out performed SVA when technical errors are small andvariability is dominated by CTH.Also in 2014, Zou et al. [54] described a reference-free approach based onlinear mixed models (LMMs) called EWASher. Originally, LMMs were used tocontrol association study test statistic inflation due to genetic relatedness amonginbred strains of model organisms like mice Kang et al. [29]. By explicitly es-timating the genetic relatedness of individuals, the model can better account forcorrelated measurements. To capture relatedness between samples, this approachcomputes pairwise methylome similarity between samples, and then includes thisas the covariance component of the linear mixed model as a proxy for cell typecomposition The differential expression model is fit to observe inflation of teststatistics. If there is inflation, the process is run iteratively with increasing numberof principle component until test statistic inflation is controlled.In 2016, Rahmani et al. [44] described ReFACTOR, a reference-free methodbased on sparse principle components analysis (PCA). Intuitively, since cell typecomposition effects should be shared across many CpGs, ReFACTOR tries to finda subset of probes that are well represented by a low-dimensional approximation of16the observed samples. These resulting sparse factors should represent large scaleeffects like variation due to CTH. The REFACTOR algorithm proceeds in 3 steps:1. Find a k-rank approximation of the sample matrix. Call this matrix O˜.2. Look for the top d CpGs that are best approximated. If Oi and O˜i representthe ith row of O and O˜ respectively, then find the d rows that have the smallestdistance(Oi, O˜i).3. Run PCA on the subset of d sites from (2), and return the scores for the topk principle components.The resulting principle components are the sparse factors that should be a func-tion of the CTP and can be used in downstream correction of CTH.Since reference-free methods do not rely upon experimentally generated celltype profiles, they must make assumptions to bound the under-constrained solutionspace. These assumptions can differ quite substantially between algorithms, andusually an algorithm’s performance depends greatly upon how well its’ assump-tions correspond to biological reality. For example, ReFACTOR and EWASherassume that the top components of variation are caused by cell type composition.In situations where this assumption is unfounded, these methods can overfit and re-move true biological signal [46]. In contrast, SVA-based methods explicitly modelout variation associated with the phenotype of interest before decomposing theresiduals into surrogate variables. Thus, SVA-based methods no longer assumesthat the largest components of variation are due to cell type. However, SVA-basedmethods rely upon having a well-specified model, which may not be available. Insummary, reference-free methods are applicable to a wider range of tissues, butsuffer from limitations like overfitting, unrealistic assumptions and model avail-ability.In the next chapter, we compare reference-based estimates to reference-freesurrogate variables in cord blood. Specifically, we examine the amount of variancecaptured by CTP estimates versus surrogate variables. We show how ReFACTORis able to accurately model variation attributable to abundant cell types like Gran,but tends to overcorrect when accounting for minor cell types.17Chapter 3Approach and ResultsThis chapter details our approach to improving reference-based cord blood CTPestimation. We first describe our materials and evaluation criteria followed bythree results: confirmation of low estimation performance on cord blood samples,diagnoses/resolution of problems, and evaluation of the improved method.For the first result, we treat the CTP estimation technique as a black-box. Weconfirm the previously reported results that adult-ref is unsuitable for CTP estima-tion in cord blood samples. Then, we compare the estimation performance of threedifferent cord blood references. We comment on how updating the reference to becord blood specific still leaves room for improvements to estimation.To diagnose the degraded performance, we critically examined the steps in-volved in CTP estimation. The estimation procedure proceeds by normalizing thedata, constructing a signature for each cell type, and optimizing for CTP. Nor-malization consists of removing unreliable probes, removing noise associated withper-sample measurement, and bringing samples to a common measurement scalefor comparison. We explored the space of normalization techniques and providerecommendations. Next, constructing a signature requires filtering for probes thatcan discriminate between cell types. We examined the measure of discriminabilityand tuned the size of the cell type signatures. Finally, we evaluated our improvedpipeline by comparing against using the unmodified pipeline with adult referencedata, the unmodified pipeline with cord reference data, and the reference-free tech-nique ReFACTOR.183.1 Description of DatasetsThis thesis used six DNAm datasets obtained from blood samples. The datasetsconsists of four reference datasets and two validation datasets. Reference datasetscontained DNAm measured in purified cell populations, called reference cell typeprofiles, used for CTP estimation. Validation datasets contained DNAm measure-ments from mixed blood, and had matching CTP quantification done experimen-tally. All samples had DNAm measured on the Illumina 450k microarray. Threereference profiles were from human infant cord blood and one was from humanadult whole blood. There was one validation set each for cord blood and adultblood. A summary of all reference datasets can be found in Table 3.1. Each datasetis described in detail below.3.1.1 Reference Cell Type ProfilesReference-based estimation techniques rely upon DNAm measurements from pu-rified cell type populations. Since human blood consists of multiple cell types,work must first be done to isolate specific cell types. Reference datasets tend tobe quite small, with sample sizes ranging from six to fifteen profiles per cell type.These sample sizes are often limited by the experimental cost and requirements ofisolating cell populations. To carry out cell population purification, experimentersideally have access to fresh samples and highly specialized equipment. Our ref-erence datasets relied on two isolation technologies: Fluorescence Activated CellSorting (FACS) and Magnetic Activated Cell Sorting. (MACS). While different inimplementation, the output of both technologies are samples highly enriched forone cell type.Adult Whole Blood Reference Cell Type ProfilesOur adult reference dataset, referred to as adult-ref, contained cell type profiles forsix cell types: Gran, Mono, Bcell, CD8T, CD4T and NK. We downloaded adult-reffrom bioconductor with R (FlowSorted.Blood.450k). These cell type profiles wereisolated from blood samples donated by six healthy adult males. For full detailssee Reinius et al. [45].19Table 3.1: Count of Purified Reference Cell Type ProfilesDataset Blood Gran Mono CD4T CD8T NK Bcell nRBCAdult 6 6 6 6 6 6 6 0UBC-ref 7 7 7 7 6 7 7 7Oslo-ref 11 11 11 11 11 11 11 0JHU-ref 15 12 15 15 14 14 15 4Infant Cord Blood Reference Cell Type ProfilesThe three cord reference datasets, all publicly available, were generated by threedifferent research groups. For convenience, we will refer to them by the name oftheir respective universities of origin. That is, we will refer to them as UniversityOf British Columbia reference (UBC-ref), Oslo reference (Oslo-ref), and JohnHopkins University Reference (JHU-ref). Mentions of cord blood reference refersto UBC-ref unless otherwise specified.UBC-ref was obtained from seven cord blood samples obtained from electivecaesarian deliveries at BC Women’s Hospital. Each sample was fractionated intoseven cell types using FACS. This resulted in seven samples each of Gran, Mono,Bcell, CD4T, NK and nRBC. Due to poor sample quality, there were only six CD8Tprofiles. In addition, the infant cord blood samples were also assayed for DNAmprior to fractionation. For full details, refer to de Goede et al. [12].JHU-ref was obtained from fifteen cord blood samples from full term, healthyvaginal births at John Hopkins Hospital. Fractionation was done using MACS andresulted in a variable number of profiles per cell type – see Table 3.1. For fulldetails, refer to Bakulski et al. [5]Oslo-ref was obtained from eleven cord blood samples from uncomplicatedbirths at Oslo University Hospital. Fractionation was done using FACS. All sam-ples were successfully fractionated, resulting in one profile each for Bcell, CD4T,CD8T, Gran, Mono and NK. Notably, nRBC was not isolated in this referencedataset. For full details, refer to Gervin et al. [19].203.1.2 ValidationTwo sets of samples, one each for cord and adult blood, was used to evaluateestimation performance of reference-based methods. We will refer to them ascord-validation and adult-validation respectively. All samples were measuredfor DNAm using the Illumina 450k as mixed tissue. These DNAm measurementsare then used in cell type proportion estimation. In addition, each sample had cellcounts experimentally generated. Normalized cell counts are considered groundtruth against which estimation performance was evaluated.Adult Whole Blood ValidationAdult-validation was obtained from GEO (GSE77797). This dataset contains sixadult whole blood samples with matched cell counts generated from flow cytome-try. See Koestler et al. [31] for full details.Infant Cord Blood ValidationCord-validation consists of 24 cord blood samples were obtained from deliveriesat BC Women’s hospital. All samples were from term healthy elective caesariandeliveries. A small aliquot of each validation whole cord blood sample was sent toa pathology lab for CBC. A second aliquot was prepared as the reference samplesabove, with the same markers and antibodies, with a single exception: mononuclearcell fractions were run on a FACS machine for cell counting instead of sorting.Final counts were a combination of CBC and FACS data. CBC data providednRBC, monocyte, and granulocyte counts, as well as total lymphocytes. Thesecounts were scaled to total 1, then lymphocyte breakdown (relative proportions ofBcell, NK, CD4T, and CD8T from FACS data) was multiplied by the calculatedtotal lymphocyte proportion to generate final lymphocyte numbers.3.2 Evaluation MetricsWe used the validation datasets to evaluate estimation performance of all meth-ods. For each sample, we computationally estimated the CTP. These estimateswere compared against the experimentally generated CTP, which we considered21Adult-Ref	UBC-ref,	Oslo-ref,	JHU-ref	Adult-Valida8on	Cord-Valida8on	Adult-Ref	Cord-Valida8on	1	2	3	Figure 3.1: Overview of Evaluated Reference-Validation Pairings.the ground truth. Performance was measured with Spearman correlation (Rho)and mean absolute deviation (MAD). Correlation is our primary measure of per-formance because downstream correction only requires accuracy up to a scalingfactor. We use MAD as a secondary measure of performance. A method with bothhigh correlation and low MAD implies good estimation of CTP in both magnitudeand ordering. Such estimates can be used for qualitative insights in place of exper-imentally quantified cell type proportions. In this thesis, estimation performancewithout any qualification refers to Spearman correlation.3.3 Validation of Existing Estimation MethodsTo confirm the previously reported loss of accuracy, we estimated cell type pro-portion for both adult-validation and cord-validation. In this section, all estimatesare from applying a previously validated methodology, only varying the referencedataset provided [22, 25]. First, we used the adult reference set to estimate CTPin the adult validation set. Next we used the adult reference set to estimate CTP22in the cord validation set. We then compared the estimation performance of theadult and cord reference sets. Finally, we compared the performance of all threepublicly available cord blood references on our cord validation set. For a summary,see Figure ApproachWe compared the performance of an existing reference-based CTP estimation pipelineon both our adult and cord validation samples. We applied the widely used methoddescribed by Houseman et al. [22], and implemented in R by Jaffe and Irizarry[25]. This method is available as the function estimateCellCount in the minfi pack-age. The method takes in two parameters, a target set of DNAm measurements forwhich to estimate cell proportions and a reference set of cell type profiles. In thissection, we treated estimateCellCount as a black-box and only modified these twoparameters.The estimateCellCount function consists of three steps: normalization, signa-ture construction and optimization. First, quantile normalization [50] is appliedto make the reference and target datasets comparable. Then, using the referencedataset, the algorithm selects a number of probes that are differentially methylatedbetween cell types. Finally, an optimization procedure is run to find a linear mix-ture that best reconstructs the mixed signal in the target samples from the signature.See Section 2.2 for a more detailed treatment.Measuring Estimation Performance Of Reference DatasetsFor the adult reference, we measured CTP estimation performance on both theadult and cord validation sets. For both, we used an unmodified version of esti-mateCellCounts. Proportions for the Gran, Mono, Bcell, CD4T, CD8T and NKwere estimated for each validation sample. Estimates of nRBC were set to 0, be-cause adult-ref does not contain a representation for this cell type. Estimates werecompared against measured cell proportions using both Spearman correlation andMAD.We measured CTP estimation performance of the three cord blood referencedatasets. We used an unmodified estimateCellCounts to estimate CTP for the cord23validation set. Estimates for nRBC were made with UBC-ref and JHU-ref, both ofwhich contained reference cell type profiles for that cell type. Estimation of nRBCwas omitted when using Oslo-ref because it doesn’t contain representation for thiscell type.Comparing Reference DatasetsTo compare reference datasets, we used dimensionality reduction, DMP callingand hierarchical clustering. For dimensionality reduction, we used multidimen-sional scaling (MDS) plots as implemented in the minfi package under the functionmdsPlot [10]. This function performs multidimensional scaling (MDS), projectingthe top 1000 most variable probes of the reference dataset onto two dimensions be-fore plotting. To call differentially methylated positions between two datasets, weused probe-wise two-group t-tests on m-values. P-values were adjusted with theBenjamini-Hochberg procedure, and the significance threshold was set to 1% falsediscovery rate [7]. For hierarchical clustering, we used the default hclust procedurein R with Euclidian distances of beta values between signature probes subsequentlyused for CTP estimation.3.3.2 Accuracy of Adult Reference DatasetWe first confirm previous findings that CTP estimation with adult reference celltype profiles performs well with adult samples but not cord samples. To do so, weused adult-ref set to estimate CTP for both our adult and cord validation sets.Estimates for the adult-validation, when compared to measured proportions,were highly correlated and had low MAD. Figure 3.2 shows that for adult sam-ples, all spearman correlation coefficients were either moderate or high: Gran(Rho=.94), Mono (Rho=.71), NK (Rho=.60), Bcell (Rho=.94), CD4T(Rho=.89)and CD8T (Rho=1.00). Furthermore, we observed that points fell close to the lineof unity (black solid line) for all cell types except NK, indicating that estimateswere close in magnitude to the measured proportions. This is in line with the pre-viously reported estimation performance on adult samples as reported by Koestleret al. [31].Estimation performance using adult reference for cord-validation was notably24GranRho=0.94MAD=0.01MonoRho=0.71MAD=0.02NKRho=0.6MAD=0.05BcellRho=0.94MAD=0.01CD4TRho=0.89MAD=0.03CD8TRho=1MAD=0.020.4 0.5 0.6 0.06 0.07 0.08 0.02 0.03 0.040.02 0.03 0.04 0.05 0.06 0.15 0.2 0.25 0.05 0.08 0.1 0.12 ProportionEstimated ProportionAdult Reference with Adult Whole Blood SamplesFigure 3.2: Estimation of Adult Validation Samples Using Adult Reference.Black line is x=y.degraded. Figure 3.3 shows that only CD4T retains high correlation (Rho=.79).Gran (Rho=.49), Mono (Rho=.52) and NK (Rho=.59) are now only moderatelycorrelated. Bcell (Rho=.43) and CD8T (Rho=.3) are lowly correlated. Proportionsfor nRBCs were not estimated because the cell type is not present in adult wholeblood and therefore not represented in the adult reference.Adult reference cell type profiles performed much worse when used to estimateCTP in cord blood samples. Figure 3.4A shows that correlation between measuredand estimated proportions were worse for all cell types present in both samples.Similarly, Figure 3.4B shows that the MAD between measured and estimated CTPsis higher for all cell types except NK. This degradation was not as dramatic as thosereported by Yousefi et al. [53]. Their results, showing all CTP correlations between0.01 and 0.03, were surprising. Adult and cord blood are closely related tissues;so our results of partial degradation in estimation performance more closely aligns25MonoRho=0.52MAD=0.03NKRho=0.59MAD=0.03nRBCRho=NAMAD=0.03BcellRho=0.43MAD=0.06CD4TRho=0.79MAD=0.04CD8TRho=0.3MAD=0.07GranRho=0.49MAD=0.080.03 0.06 0.09 0.05 0.1 0.15 0 0.02 0.05 0.08 0.10.02 0.05 0.08 0.1 0.2 0.3 0.4 0.5 0.02 0.04 0.06 0.3 0.4 0.5 0.6−0.5−0.2500. ProportionEstimated ProportionAdult Reference with Cord Blood SamplesFigure 3.3: Estimation of Cord Validation Samples Using Adult Reference.Black line is x=y.with expectations. Similar to our results, Gervin et al. [19] reported that switchingfrom an adult to a cord reference leads to only moderate changes in CTP estimates.Since correlations are not directly comparable across sample sizes, we mea-sured adult reference performance on subsamples of cord-validation. To do so, wesubsampled cord-validation, matching the number of adult validation samples, be-fore estimating CTP using adult-ref. Figure 3.5, shows a similar pattern of adultreference performance degradation. For correlations (Figure 3.5A), Bcell, CD8Tand Gran showed the largest degradation when comparing adult against medianor 75th percentile performance over cord subsamples. CD4T and Mono estima-tion performance fell at the top end of the inter-quartile range. Only NK estimatesshowed similar correlation performance, with the adult correlations landing on themedian performance over cord subsamples. Comparisons of MAD performance(Figure 3.5B), also showed Bcell, CD8T and Gran to be significantly degraded. CD4T CD8T Gran Mono NKCell TypeCorrelation SampleTypeAdultCordDeconvolution Performance Of Adult Reference00. CD4T CD8T Gran Mono NK nRBCCell TypeMADSampleTypeAdultCordDeconvolution Performance Of Adult ReferenceFigure 3.4: Comparison of Adult Reference Performance by: (A) Correla-tion, (B) MAD.27Bcell CD4T CD8T Gran Mono NKAdult Cord Adult Cord Adult Cord Adult Cord Adult Cord Adult Cord− TypeRhoSampleTypeAdultCordDeconvolution Performance Of Adult Reference − Subsampled CordBcell CD4T CD8T Gran Mono NK nRBCAdult Cord Adult Cord Adult Cord Adult Cord Adult Cord Adult Cord Adult Cord0. TypeMADSampleTypeAdultCordDeconvolution Performance Of Adult Reference − Subsampled CordFigure 3.5: Comparison of Adult Reference Performance After Subsamplingon Cord Validation Set: (A) Correlation, (B) MAD.28MonoRho=0.57MAD=0.02NKRho=0.69MAD=0.04nRBCRho=0.72MAD=0.06BcellRho=0.57MAD=0.03CD4TRho=0.87MAD=0.04CD8TRho=0.47MAD=0.06GranRho=0.54MAD=0.160.03 0.06 0.09 0.05 0.1 0.15 0 0.02 0.05 0.08 0.10.02 0.05 0.08 0.1 0.2 0.3 0.4 0.5 0.02 0.04 0.06 0.3 0.4 0.5 0.6 ProportionEstimated ProportionCord Reference With Cord Blood SamplesFigure 3.6: Estimation of Cord Validation Samples Using Cord Reference(UBC)We conclude that adult-reference dataset is unsuitable for estimation of CTPin cord blood samples. A major assumption of the reference-based paradigm isthat the validation sample is a mixture of the reference cell types profiles. Thus, aparsimonious potential solution to this degraded performance is to use a set of cordblood specific reference cell type profiles.3.3.3 Effects of Using a Cord Specific ReferenceSwitching from the adult reference to a cord blood specific reference improvesCTP estimation performance. This effect is unlikely due to sample size changes,since the adult reference has six profiles per cell type and the cord reference hasseven. Figure 3.6 shows estimates for the cord validation set using UBC-ref. Allcell types showed improved correlations between estimated and measured propor-tions (Figure 3.22). Notably, correlations for nRBC, previously not estimated, is29now quite high (Rho=.72). This overall improvement was expected to come fromtwo sources: more accurate representation of DNAm in shared cell types and rep-resentation of the previously missing nRBCs.To isolate the effect of missing nRBC in the reference profile, we comparedestimation of the same cord validation set under two conditions: UBC-ref withnRBC and UBC-ref without nRBC. Figure 3.7B shows the percentage differencein estimated CTP between these two conditions. The x-axis is the percentage dif-ference in estimated CTP for the same sample when estimation is done with andwithout nRBCs reference profile. The y-axis is the number of validation sampleswith that percentage difference. We observed the sensitivity of cell type propor-tion estimation to be unevenly distributed across cell types. Dramatic variationsin estimates for Bcell (max=50%, mean=20%), Mono(max=52%, mean=10%) andNK(max=62%, mean=21%) were observed when the nRBC profile was added.Other cell types, like CD4T(max=15%, mean=5%), CD8T(max=10%, mean=3%) and Gran(max=23%, mean=13%) were much less sensitive. To visualize howthe relationship between cell types affects sensitivity to missing references, weconstructed a dendrogram using Euclidian distances. Figure 3.7A shows that sen-sitivity to missing nRBC profiles is related to both relative cell type abundance anddistance to nRBC. This suggests that the optimization constraints from our refer-ence based method assigns DNAm signal from missing cell types to other closelyrelated cell types.3.3.4 Comparison of Cord and Adult ReferencesTo investigate the difference in estimation performance for cord blood samples,we compared the adult reference and cord reference cell type profiles. First, weexamined analogous cell types between the adult and cord references. Then welooked at the number of pairwise differentially methylated regions (DMRs), whichare crucial to our reference-based algorithm. Finally, we looked at how probe-levelvariability within cell types might play a role in estimation accuracy.First, we visualized reference cell type profiles showed analogous cell typesto be distinct. In Figure 3.8, we projected the thousand most variable probes ontotwo dimensions using MDS. We observed that the same cell types within cord30CD8TCD4TBcellMonoGranNKnRBC456789Cluster Dendrogramhclust (*, "complete")dist(t(cord_profile))HeightBcellCD4TCD8TGranMonoNK0 20 40 60024680246802468024680246802468Percentage DifferenceCountPrediction Sensitivity To Removing nRBC ReferenceFigure 3.7: (A) Clustering Between Cord Cell Types on Signature Sites us-ing Euclidean Distances (B) Sensitivity to Missing nRBC in ReferenceProfile31−10 −5 0 5 10−8−6−4−202MDS − Adult ReferenceGranGranGran CD4TCD4TCD4TCD8TCD8TCD8TBcellBcellBcellMonoMon NKNKNKGranGranranCD4TCD4TCD8TCD8TBcellBcellBcellMonoMonoonoNKNKAdult−10 −5 0 5 10−8−6−4−202MDS − Cord ReferenceBcellCD4TCD8TGranMonoNKnRBCll4TGranMono NKnRBCllGranNKnRBC4Mono NKnRBCBcellrMononRBCllCDMono NKnRBCce lrono NKnRBCCordFigure 3.8: MDS Plot of Adult and Cord Reference Profiles320.030.22 0.22 0.29 0.23 0.3 0.240.22 0.230.11 0.11 0.090.22 0.230.11 0.02 0.040.29 0.30.11 0.02 0.010.22 0.240.09 0.04 0.17 0.18 0.16 0.17 0.120.13 0.120.09 0.09 0.10.17 0.160.09 0.01 0.090.18 0.170.09 0.01 0.090.12 0.120.1 0.09 0.09Adult CordBcell CD4T CD8T NK Gran Mono Bcell CD4T CD8T NK Gran MonoBcellCD4TCD8TNKGranMonoCT1CT2Cell Type Pairwise DMRsFigure 3.9: Proportion of DMPs between Cell Typesand adult, while similar, are visually distinct. In particular, Bcells are particularlydifferent between cord and adult, corroborating that cell type’s loss in accuracy.Furthermore, we noticed that nRBCs are located between the major clusters, aresult of their intermediate methylation status (See Section 1.1.3). This indicatesthat adult cell type profiles are an imperfect proxy for their cord blood equivalent,explaining why adult references performed poorly when predicting CTP for cordblood samples.Next, we test the hypothesis that prediction of CTP in adult samples is moreaccurate because their cell types are much more distinct. Here, we quantified dis-tinct as proportion of measured CpGs that are differentially methylated betweencell types. Figure 3.9, shows the proportion of DMPs between pairs of cell typeswithin each reference set. In adult, the cell types cleanly separate into lymphoid(CD4T, CD8T, NK, Bcell) and myeloid (Gran, Mono) cell lineages. Furthermore,Table 3.2 shows that in cord blood CD4T and CD8T cells have relatively few dif-33Table 3.2: Number of Differentially Methylated Positions Between CellTypes in CordBcell CD4T CD8T Gran Mono NK nRBCBcell 44505 45930 62836 58946 48853 183586CD4T 44505 7122 80726 77772 44693 189751CD8T 45930 7122 85685 84161 43557 191093Gran 62836 80726 85685 17400 58792 152116Mono 58946 77772 84161 17400 55965 155700NK 48853 44693 43557 58792 55965 163982nRBC 183586 189751 191093 152116 155700 163982ferentially methylated sites. Since the reference-based algorithm sub-selects fromthese DMPs to create the cell type signatures (2.2) for CTP estimation, we expecta lack of candidate probes to have negative downstream effects.Finally, we examined the variability of the probes because that determines vari-ability of the estimated proportions [24]. So, we examine the number of probes thatexhibit high variability, defined as Beta standard deviation greater than 0.05. Fig-ure 3.10 shows that adult cell types have more within-population variable probesthan cord across all shared cell types. Thus, we can eliminate variability as thesource of low performance in cord sample estimation using cord references.3.3.5 Comparison of Three Cord Blood ReferencesCurrently, there are three publicly available reference cell type profiles for humancord blood. Each reference has salient characteristics (Section 3.1) such as numberof samples, method of purification, and cell types represented. In this section, weestimate CTP on the cord validation set with each reference to ensure that they allshow similar patterns of low estimation accuracy.Figure 3.11 shows that there are large performance differences between the dif-ferent cord references for a few cell types. Gran, Mono, Bcell and CD4T show sim-ilar correlations across all three references. For CD8T and NK, JHU has markedlyworse correlations. Oslo-ref does not contain nRBC references and so does notmake any predictions for that cell type. Overall, Oslo and UBC perform similarlyacross shared cell types, while JHU stands apart.340200004000060000BcellCD4TCD8TGranMonoNKCellTypeCount DataSetAdultCordNumber of Variable Sites In Reference Profiles (Beta SDev. > .05)Figure 3.10: Number of Variable Sites in Reference Cell Type ProfilesGiven the large performance differences, we next investigate the similarity ofthe selected signature probes. Figure 3.12 shows that the overlap between signa-ture probes selected for UBC-ref and JHU-ref range from 24/100 for CD8T to 66/100 for NK. This was surprisingly, because the least overlapping cell type sig-nature and the most overlapping cell type signature both correspond to cell typeswith markedly lower correlations. Figure 3.12, shows a similar pattern of overlapbetween UBC-ref and Oslo-ref signature probes, with NK with the most and CD8Twith the least. Since the proportion of overlap between JHU/UBC and Oslo/UBCare similar, we cannot attribute JHU-ref’s large drop in correlation to changes inthe selected signature probes.Since the reference datasets differences in performance are not associated withsignature similarity, we examine the estimates for individual samples. Figure 3.13shows that, for shared cell types, Oslo-ref CTP estimates are reasonably correlatedwith measured CTP, and generally similar to UBC-ref estimates in Figure 3.6.35nRBCGran Mono NKBcell CD4T CD8TJHU Oslo UBCJHU Oslo UBC JHU Oslo UBC0.000.250.500.750.000.250.500.750.000.250.500.75Cord Reference DatasetCorrelationCellTypeBcellCD4TCD8TGranMonoNKnRBCEstimation Performance By Cord ReferenceFigure 3.11: Correlation (Estimated vs. Measured) by Cord Blood ReferenceSetIn contrast, JHU-ref shows large overestimation of CD8T proportions and mostlyestimates of 0 for NK proportions.The similar estimation accuracy between UBC-ref and Oslo-ref was surpris-ing given the difference in sample size; seven and eleven respectively. UBC-refperformed best, despite having the smallest sample size. This suggests that fur-ther increasing sample size would be of limited benefit. We are mindful of thefact that UBC-ref might be performing better on cord-validation because they wereboth measured in the same facilities. However, we believe this effect to be quitesmall because the reference and validation data were measured in two completelyseparate experiments.We proceeded with UBC-ref due to the results above. JHU-ref was eliminatedbecause of poor prediction performance on NK and CD8T. Between the more simi-lar Oslo-ref and UBC-ref, we chose UBC-ref because it included a reference profile36JHUUBC767624CD8TJHUUBC686832CD4TJHUUBC343466NKJHUUBC515149BcellJHUUBC643936MonoJHUUBC616139GranJHUUBC373763nRBCCord Profile Comparison: JHU vs. UBCOsloUBC808020CD8TOsloUBC555545CD4TOsloUBC363664NKOsloUBC454555BcellOsloUBC674233MonoOsloUBC686832GranOsloUBC01000nRBCCord Profile Comparison: Oslo vs. UBCFigure 3.12: Comparison of Signature Probes Between Cord ReferenceDatasets37GranRho=0.52MAD=0.12MonoRho=0.56MAD=0.05NKRho=0.59MAD=0.03BcellRho=0.45MAD=0.04CD4TRho=0.79MAD=0.04CD8TRho=0.35MAD=0.050.3 0.4 0.5 0.6 0.7 0.03 0.06 0.09 0.05 0.1 0.150.02 0.05 0.08 0.1 0.2 0.3 0.4 0.5 0.02 0.04 0.0600. ProportionEstimated ProportionOslo Cord Blood ReferenceMonoRho=0.56MAD=0.04NKRho=0.18MAD=0.04nRBCRho=0.73MAD=0.03BcellRho=0.57MAD=0.04CD4TRho=0.72MAD=0.06CD8TRho=0.15MAD=0.1GranRho=0.52MAD=0.150.03 0.06 0.09 0.05 0.1 0.15 0 0.02 0.05 0.08 0.10.02 0.05 0.08 0.1 0.2 0.3 0.4 0.5 0.02 0.04 0.06 0.3 0.4 0.5 0.6 ProportionEstimated ProportionJHU Cord Blood ReferenceFigure 3.13: Estimation Performance by Alternative Cord Reference38for nRBC. Previous results, Figure 3.7 and Figure 3.4, suggest that the missing celltype can highly skew estimates and lead to decreased correlation with measuredproportions.3.4 NormalizationIn the last section, we evaluated the performance of reference datasets using anexisting estimation pipeline. Here, we begin our examination of the estimationpipeline’s components. The first step is to perform normalization on both the ref-erence and validation datasets.Preprocessing and normalization are essential to the analysis of 450k data. Theintent is to remove sources of variation unrelated to biological phenomena. Thisvariation originates from many sources like dye bias, background fluorescence,differing dynamic ranges between probe types and running samples on differentarrays. Enumerations of these technical challenges and their removal can be foundelsewhere [11, 14, 33]. In this section, we focus on optimizing the normalizationprocedure for accuracy of CTP estimation in cord blood samples.We optimized the DNAm data normalization procedure for CTP using a seriesof intermediate diagnostics. Table 3.3 summarizes steps for normalizing DNAmmicroarray array data. Since normalization of DNAm requires multiple steps anexhaustive exploration leads to a combinatorial explosion of procedures. In addi-tion, such a brute force approach would result in severe overfitting on our smallvalidation set. Instead, we used SVD analysis, clustering and p-value inflation toindependently improve each step of the normalization.3.4.1 ApproachNormalization of DNAm proceeds in several steps. First, probes with unreliablemeasurements are removed. Then within-array normalization corrects for noiseunique to each sample. Finally, between-array normalization is used to adjust themeasurements so that different arrays are comparable.39Table 3.3: Summary of Evaluated DNAm Normalization MethodsStep Method Summary1. Remove Unreliable Probes Nordland Filter for probes with low detection p-value,located at a known SNP, or hybridizes withmultiple genomic locations. [40]2. Background Correction ILBG Background subtraction using the average valueof the negative control probes. [4]Noob Background subtraction using out-of-band measurements,instead of just negative control probes. [51]3. Probe Type Normalization SWAN Match Type 1 and Type 2 probes by CpG densitystratification, quantile normalize, and interpolation [1]BMIQ Match Type 1 and Type 2 probe distributions bymatching two beta-mixture; one for each probe type. [49]4. Between-Array Normalization Quantile Match distribution between all arrays byaveraging across quantiles. [9]ComBat Match distribution between all arrays and removebatch effects by using location and scale adjustmentand empirical Bayes [27]Baseline Normalization ProcedureCurrently, there is no consensus on how to best normalize data for CTP. One com-mon normalization procedure consists only of quantile normalization between thereference and validation datasets [50]. Specifically, the validation and referencesamples are pooled together before quantile normalization. This make measure-ments comparable across samples, but may be removing signal that is desirable inCTP estimation. We will use this as our baseline normalization procedure.Steps to OptimizeThe first step in the normalization procedure, removing unreliable probes, has beenwell established [16, 42]. Therefore, we did not believe this step had a large poten-tial for improving estimation performance. Instead, we applied a standard filteringprotocol as described by Dedeurwaerder et al. [16]. First, we filter for probes withhigh detection p-value (> .01), low bead coverage (< 3), known to have a SNP at40the CpG, known to cross hybridize with multiple genome loci, or found on a sexchromosome. To identify probes on sex chromosomes, containing a SNP or knownto cross-hybridize, we used the list published by Nordlund et al. [40]; other similarlists exist [42]. After this step, probes should have reliable measurements of DNAmethylation.After removing unreliable probes, we dealt with noise and technical artifactsunique to each sample. Two factors have been identified as the main sources oftechnical effects at the sample level: variability in background fluorescence anddiffering dynamic ranges between Type 1 and Type 2 probes [16, 49]. For eachsource, we tested two methods shown to have performed well in several bench-marks [16]. For background subtraction, we tested Illumina background correc-tion (ILBG), and normal-exponential using out-of-band probes (NOOB) [51]. ForType 1 and Type 2 probe balancing, we tested subset within array normaliza-tion (SWAN) and beta mixture quantile normalization (BMIQ). To assess theefficacy of each method, we used SVD analysis to test for association of majorcomponents of variation with control probes. The Illumina 450k has several con-trol probes designed to measure technical effects, independent of the sample used.We measured association of signal strength at these control probes to the majorcomponents of variation using SVD analysis [47]. We expected association withtechnical factors to negatively impact CTP estimation and thus seek to minimizesuch associations.Next, we explored between-array normalizations to make measurements be-tween samples comparable. Measurements between arrays are not directly compa-rable because they differ by a variety of factors unrelated to the biological signal.Between-array normalizations seek to minimize these differences, leaving only bi-ological signal. We tested two of the recommended between-array normalizationtechniques recommended by Dedeurwaerder et al. [16]: Quantile Normalizationand ComBat. After applying each technique, we used p-value inflation betweenbiologically similar samples to diagnose these technical effects. Specifically, wecompared reference cord blood samples measured before purification with cordblood samples in the validation dataset. We used probe-by-probe two-group t-testsand plotted the p-values with the expectation of a uniform distribution. Any devi-ations from uniform was taken to be indicative of batch effects. Once again, we41expected strong batch effects to negatively impact downstream estimates of CTP.Finally, after normalizations, we clustered all samples. We used hierarchicalclustering with Euclidian distances between all probes beta values. We expected agood normalization procedure to result in samples of the same cell type clusteringtogether.3.4.2 Within Array NormalizationFor within array normalization, we used SVD analysis to assess the associationbetween the principal components of the data and control probe signals designedto be independent of the measured sample to assess presence of strong batch ef-fect. We tested four different within-array procedures: BMIQ, NOOB, SWAN,and ILBG. Only one of NOOB or ILBG can be applied since they both performbackground subtraction. Similarly, only one of , SWAN or BMIQ can be appliedbecause they both balance Type 1 and Type 2 probes (described in Section 1.1.2).We concluded that a combination of NOOB and BMIQ best mitigates within-arraytechnical artefacts.Control probes on the 450k are designed to test the efficiency of various stepsinvolved in measure methylation. These probes are not dependent on the sam-ple, and provide array diagnostics. For example, each negative control probe con-tains a randomly permuted sequence without CpGs. Thus, fluorescence at theseprobes measures the system-wide background signal strength. Any association ofcontrol probes measurements with major components of variation is indicative ofconfounding between biological and technical signals. Successful application ofpreprocessing methods should reduce the significance of associations.Figure 3.14 shows the association of various control probes with the majorcomponents of variation in the raw sample data. We observed that the largest com-ponent of variation (PC1) is strongly associated with control probes. Specifically,linear modelling of the first principle component (PC) against many green channelcontrol probes show p-value < 10−5 for many green channel control probes.For balancing Type 1 and Type 2 probes, BMIQ out performed SWAN. SWANerased the association of PC1 to control probes, observed in the raw data, but leavesPC2 strongly associated with control probes. Figure 3.15B shows that SWAN42leaves the green channel control probes strongly associated with PC2 (p-value< 10−5). In contrast, Figure 3.15A shows that BMIQ also erases PC1 associa-tions, but reduces the strength of associations between PC2 and control probes.Thus, we use BMIQ for probe type balancing.For background subtraction methods, we tested ILBG and NOOB. ILBG onlyuses the negative control probes to estimate the background signal. Due to thesmall number of probes used, this can lead to poor estimates of the background.In contrast, NOOB uses the out-of-band measurements made on Type 1 probes.These out-of-band measurements are possible because of the probe design. Bothmethylated and unmethylated CpGs cause fluorescence on only one colour channel,ignoring the second channel. Thus, the second channel can be used to estimatenon-specific fluorescence.We found that NOOB performed better than ILBG. Compared to raw mea-surements, ILBG left weak associations between control probes and PC1 (p-value< .05). Also after ILBG, we observed some strong associations between PC2 andcontrol probes (p-value < 10−5). In contrast, Figure 3.15C shows that NOOBcompletely removed association of PC1 to control probes. There was residual as-sociation with PC2 after NOOB, but strength of associations were much lower thanILBG (p-value < 10−1).3.4.3 Between Array NormalizationAfter applying within-array normalizations, we evaluated between-array normal-izations that bring the reference and validation sets on to a common scale. Wecompared the mixed cord blood samples in the reference set to the cord blood sam-ples in the validation set. When these two sets were normalized together usingthe baseline quantile normalization procedure, we observed broad p-value inflationin probe-by-probe two-group t-tests. P-value inflation is unexpected because bothreference and validation sets are biologically similar cord blood samples. Figure3.16A shows the p-value inflation to have a dominant left skew, indicating a largenumber of probes with differing methylation distributions. Figure 3.16B shows theresult of applying ComBat with each sample’s cell type explicitly modelled. Us-ing ComBat results in over-correction, indicated by the right-skew of the resulting43SVD AnalysisPC−1PC−2PC−3PC−4PC−5PC−6Sample_WellSlideArrayWellColWellRowChipRowChipColBSC−I C1 GrnBSC−I C2 GrnBSC−I C3 GrnBSC−I C4 RedBSC−I C5 RedBSC−I C6 RedBSC−II C1 RedBSC−II C2 RedBSC−II C3 RedBSC−II C4 RedTarget Removal 1 GrnTarget Removal 2 GrnHyb (Low) GrnHyb (Medium) GrnHyb (High) GrnExtension (A) RedExtension (T) RedExtension (C) GrnExtension (G) Grnp < 1x 10−10p < 1x 10−5p < 0.01p < 0.05p > 0.05Figure 3.14: SVD Analysis Raw Samplesp-value distribution.In the context of CTP estimation, over-correction is preferable to under-correction.When comparing samples of the same type between reference and validation, quan-tile normalization results in many probes that are differentially methylated. Thereference dataset contains measurements from mixed cord blood samples prior topurification. These reference samples should have similar methylation profiles tothe validation samples, which are also mixed cord blood. Detection of many dif-ferentially methylated sites is a severe violation of the linear mixing assumption(see Section 2.1). Thus, quantile normalization is unfit for our application. Onthe other hand, over-correction may not have large negative downstream conse-quences. CTP estimation first builds a signature of discriminating probes used inestimation. Broad over-correction is tolerable if there are probes that retain theirbetween cell type discriminability. Therefore, we conclude that ComBat is bettersuited to CTP estimation.44Singular Value Decomposition Analysis (SVD)PC−1PC−2PC−3PC−4PC−5PC−6Sample_WellSlideArrayRowColumnBSC−I C1 GrnBSC−I C2 GrnBSC−I C3 GrnBSC−I C4 RedBSC−I C5 RedBSC−I C6 RedBSC−II C1 RedBSC−II C2 RedBSC−II C3 RedBSC−II C4 RedTarget Removal 1 GrnTarget Removal 2 GrnHyb (Low) GrnHyb (Medium) GrnHyb (High) GrnExtension (A) RedExtension (T) RedExtension (C) GrnExtension (G) Grnp < 1x 10−10p < 1x 10−5p < 0.01p < 0.05p > 0.05Singular Value Decomposition Analysis (SVD)PC−1PC−2PC−3PC−4PC−5PC−6Sample_WellSlideArrayWellColWellRowChipRowChipColBSC−I C1 GrnBSC−I C2 GrnBSC−I C3 GrnBSC−I C4 RedBSC−I C5 RedBSC−I C6 RedBSC−II C1 RedBSC−II C2 RedBSC−II C3 RedBSC−II C4 RedTarget Removal 1 GrnTarget Removal 2 GrnHyb (Low) GrnHyb (Medium) GrnHyb (High) GrnExtension (A) RedExtension (T) RedExtension (C) GrnExtension (G) Grn(A) BMIQ (B) SWANSingular Value Decomposition Analysis (SVD)PC−1PC−2PC−3PC−4PC−5PC−6Sample_WellSlideArrayWellColWellRowChipRowChipColBSC−I C1 GrnBSC−I C2 GrnBSC−I C3 GrnBSC−I C4 RedBSC−I C5 RedBSC−I C6 RedBSC−II C1 RedBSC−II C2 RedBSC−II C3 RedBSC−II C4 RedTarget Removal 1 GrnTarget Removal 2 GrnHyb (Low) GrnHyb (Medium) GrnHyb (High) GrnExtension (A) RedExtension (T) RedExtension (C) GrnExtension (G) GrnSingular Value Decomposition Analysis (SVD)PC−1PC−2PC−3PC−4PC−5PC−6Sample_WellSlideArrayWellColWellRowChipRowChipColBSC−I C1 GrnBSC−I C2 GrnBSC−I C3 GrnBSC−I C4 RedBSC−I C5 RedBSC−I C6 RedBSC−II C1 RedBSC−II C2 RedBSC−II C3 RedBSC−II C4 RedTarget Removal 1 GrnTarget Removal 2 GrnHyb (Low) GrnHyb (Medium) GrnHyb (High) GrnExtension (A) RedExtension (T) RedExtension (C) GrnExtension (G) Grn(C) NOOB (D) ILBGFigure 3.15: SVD Analysis of Background Subtraction and Probe Type Nor-malization Methods450100002000030000400000.00 0.25 0.50 0.75 1.00P ValueCountCord Blood Samples − Reference versus Validation010000200000.00 0.25 0.50 0.75 1.00P ValueCountCord Blood Samples − Reference versus ValidationFigure 3.16: P-value Inflation When Comparing Cord Blood Samples After(A) Quantile Normalization (B) ComBat463.4.4 Cluster Analysis of Full Normalization PipelineWe compared the baseline normalization against our final normalization pipelineusing hierarchical clustering. From the results above, we settled on a final nor-malization consisting of removing bad probes, NOOB, BMIQ and ComBat. Asdescribed in 1.2, we expected samples of the same cell type to cluster together iftechnical noise was correctly removed. Figure 3.17 shows how the baseline nor-malization procedure results in imperfect separation of samples by cell type. Twoparticularly imperfect separations were Gran mixed with Mono and CD4T mixedwith CD8T. In contrast, Figure 3.17, shows clustering of samples after applying theoptimized normalization procedure. Here, all except two samples cluster cleanlywith samples of the same type. The two samples are a swap between NK and cordblood. This shows that our optimized normalization procedure, while possibly re-moving some signal, reliably corrects for the data’s technical effects.3.5 Signature SelectionAfter normalizations, we examined the construction of cell type signatures. First,we looked at the heuristic of balancing the number of probes that are highly andlowly methylated. Then we used a cross-validation framework to find the optimalnumber of signature probes. Finally, based on our previous pairwise cell type dis-tance results (Figure 3.9), we investigated whether treating CD4T and CD8T as thesame cell type improves estimation performance. Since the optimization procedureonly considers probes in the signature, the proper selection is crucial to accurateestimation.3.5.1 ApproachConstructing Cell Type SignatureSignature construction consists of ranking probes based on their ability to discrim-inate between cell types then deciding how many probes will be included for eachcell type. This process relies exclusively on the reference dataset. For each celltype we identified probes that discriminate between that cell type and all others us-ing a probe-by-probe two-group t-test comparing the target cell type with all others.47−4 −2 0 21CellTypeBcellBcellBcellBcellBcellBcellBcellNKNKNKNKNKNKCBMCWBCBMCWBCBMCCBMCCBMCCBMCCD4TCD8TCD4TCD8TCD8TCD4TCD4TCD8TCD4TCD4TCD8TCD4TCD8TnRBCnRBCnRBCnRBCnRBCnRBC WBWBWBWBWBWBWBWBWBWBWBWBWBnRBC WBNKWBWBWBWBWBWBWBWBWBWBWBWBWBWBMonoMonoGranGranMonoMonoMonoMonoMonoGranGranGranGranGranBcellBcellBcellBcellBcellBcellBcellNKNKNKNKNKNKCBMCWBCBMCWBCBMCCBMCCBMCCBMCCD4TCD8TCD4TCD8TCD8TCD4TCD4TCD8TCD4TCD4TCD8TCD4TCD8TnRBCnRBCnRBCnRBCnRBCnRBCWBWBWBWBWBWBWBWBWBWBWBWBWBnRBCWBNKWBWBWBWBWBWBWBWBWBWBWBWBWBWBMonoMonoGranGranMonoMonoMonoMonoMonoGranGranGranGranGranQuantile Normalization−4 −2 0 21CellTypenRBCnRBCnRBCnRBCnRBCnRBCnRBCBcellBcellBcellBcellBcellBcellBcellNKNKNKNKNKNKWBCBMCCBMCCBMCCBMCCBMCCBMCCD8TCD8TCD8TCD8TCD8TCD8TCD4TCD4TCD4TCD4TCD4TCD4TCD4TMonoMonoMonoMonoMonoMonoMonoGranGranGranGranGranGranGranNKWBWBWBWBWBWBWBWBWBWBWBWBWBWBWBWBWBWBWBWBWBWBWBWBWBWBWBWBWBnRBCnRBCnRBCnRBCnRBCnRBCnRBCBcellBcellBcellBcellBcellBcellBcellNKNKNKNKNKNKWBCBMCCBMCCBMCCBMCCBMCCBMCCD8TCD8TCD8TCD8TCD8TCD8TCD4TCD4TCD4TCD4TCD4TCD4TCD4TMonoMonoMonoMonoMonoMonoMonoGranGranGranGranGranGranGranNKWBWBWBWBWBWBWBWBWBWBWBWBWBWBWBWBWBWBWBWBWBWBWBWBWBWBWBWBWBOptimized NormalizationFigure 3.17: Hierarchical Clustering of Samples After Each Normalization48Probes with a p-value higher than 10−8 were removed. The remaining probes areranked in terms of difference between mean beta value in target cell type versusall other cell types. Finally, we selected probes that are most discriminating, asmeasured by the difference in group mean beta values.Investigating the Balancing of Highly and Lowly Methylated ProbesThe adult estimation methodology constructs each cell type signatures from 100probes balanced between high and lowly methylated measurements. That is, rankprobes in terms of group mean beta differences. Then select 50 probes each fromthe top and bottom of the list. This corresponds to the top 50 probes where thetarget cell type is more methylated than the other cell types, and the top 50 probeswhere the target cell type is less methylated than the other cell types. The fullsignature appends all the individual cell type signatures together, so for 6 cell typesthe signature contains 600 probes.We investigated this balancing heuristic by comparing probes ranked by dif-ference in mean beta values, the measure of discriminability, in the cord and adultsignatures. For both adult and cord, we constructed signatures for each cell typeof size 100, balanced between high and low methylation as described above. Thenwe asked whether the magnitude of discrimination was different, and whether therewere at least 50 discriminating probes in each category of highly and lowly methy-lated.Choosing A Signature SizeWe used a resampling approach to explore how signature size affects CTP estima-tion. To do so, we created a sequence of signatures, of varying sizes, from thereference samples and multiple validation sets by sampling from our 24 validationsamples.The same sequence of signatures was used for all measurements. We used thecord reference dataset and ranked probes in terms of discriminability as describedabove. For each cell type, we take the top N most discriminable probes, where Nfollows the sequence {1, 3, 5, ... , 49}. The resulting signature is a matrix withN rows and columns equal to the number of cell types. These 25 signatures were49High Beta Low BetaBcellCD4TCD8TGranMonoNK0 10 20 30 40 50 0 10 20 30 40 500. RankAbsolute Beta DifferenceSourceAdultCordSignature Discriminability Against Other Cell TypesFigure 3.18: Effect Size of DMRs By Cell Typeused for all subsequent measurements.We created 40 validation sets to measure performance and variability of per-formance of different signature sizes. First, we randomly split the twenty-fourvalidation samples into two smaller validation sets of twelve samples each. We dothis 20 times for a total of 40 validation sets. We measured estimation performanceon all 40 samples for each signature in the sequence.3.5.2 Balancing of Highly and Lowly Methylated ProbesThe heuristic of balancing high and low beta probes, while effective in adults,is unsuitable for cord blood. For each cell type, signature probes are identifiedby ranking probes in terms of discriminability, measured by mean absolute betadifference between target cell type and all other cell types. Figure 3.18 shows thediscriminability of the top 50 probes for both high and low beta probes for eachcell type. We observed that Bcells have similar levels of discriminability in cordand adult for both high and low beta probes. In adults, there were at least 50 probes50nRBCGran Mono NKBcell CD4T CD8T0 10 20 30 40 500 10 20 30 40 50 0 10 20 30 40 5000.250.50.7500.250.50.7500.250.50.75Probe CountRhoPerformance By Signature SizeFigure 3.19: Estimation Performance by Signature Size. Solid is Mean,Dashed is median, Dotted is 10th and 90th quartile.that are capable of acting as signature probes for each cell type. However, this isnot true for cord. For high beta probes, Mono does not have 50 high beta probeswith an absolute beta difference greater than zero. Thus, the balancing heuristicfor cord blood results in discarding highly discriminating probes in the Low Betacategory. This result suggests that we should use the top 50 discriminating probes,irrespective of their high or low methylation values.3.5.3 Finding an Optimal Signature SizeWe measured the estimation performance of a sequence of signatures of sizes {1,3, 5,... , 49} on 40 datasets subsampled from our validation samples. Figure 3.19shows that both mean and median estimation performance peak and plateau aroundsignature sizes between 10 and 15 probes. Performance for nRBC, Bcell, NK andCD8T, all of which have low abundance, tend to have low performance at smallersignature sizes, between 1 and 10 probes, relative to their peak. CD4T and Gran51Mono NK nRBCBcell CD4/8T Gran0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 500. CountRhoStudy4/8−MergeNormalEffect of Merging CD4T and CD8TFigure 3.20: Evaluating Effect of Treating Tcells as Indistinguishable.have relatively good performance even at the extreme case of a signature with 1probe per cell type. Looking at the 10th and 90th percentiles of performance, vari-ability of estimation performance decreases as N increases and stabilizes arounda signature size of 20 probes. Thus, we conclude that in the current predictionpipeline, performance is robust to signature size above Treating T-cells as IndistinguishableOur previous results on pairwise cell type differences, Figure 3.9, suggests thatCD4T and CD8T may be difficult to distinguish using DNAm. So, we investi-gated whether treating CD4T and CD8T as the same cell type improves estimationperformance. Figure 3.20 compares the mean performance over all subsampledvalidation sets at varying signature sizes. We observed that merging CD4T andCD8T cells did markedly improve performance at signature sizes below 20. How-ever, at 20 probes estimation performance for nRBC, Bcell and nRBC have not yetpeaked. At 50 probes, the merged and separated predictions are very similar for all52cell types. Thus, we proceed with a signature size of 50 and the T-cells are keptseparate.3.6 EvaluationTaking the previous results, we created a CTP estimation pipeline optimized forcord blood samples. We mitigated the risk of overfitting our normalization pipelineby using diagnostics unrelated to CTP estimates. While there is still some chancethat we chose a pipeline effective only on our dataset, we believe this is unlikely.Several other benchmarks have suggested that NOOB, BMIQ and ComBat resultsin high reproducibility of DMP detection and minimizes differences between tech-nical replicates [16, 37].In this section, we validated the performance of our estimation pipeline. First,we measured the pipeline’s performance on a set of 24 cord blood samples withmatched cell counts (Section 3.1). Then we explored how the new normaliza-tion procedure affected the construction of cell type signatures during estimation.Finally, we compared our reference-based estimates against reference-free tech-niques for ability to explain variance in mixed samples.3.6.1 ApproachValidating Estimation Accuracy Of A Cord Optimized PipelineWe used our pipeline to estimate CTP in our 24 cord blood validation samples.First, we filtered for unreliable probe measurements from probes with SNPs, cross-hybridization or on a sex chromosome. Second, we performed within-array nor-malizations on both the the reference and validation datasets separately. Within-array normalization consisted of background subtraction using NOOB and probebalancing with BMIQ. nRBCs references are not subject to BMIQ because they donot satisfy the expected bimodal distribution. Third, the reference and validationsamples are made comparable using ComBat for between-array normalizations.We used ComBat to correct sample batch with the sample’s cell type explicitlymodelled. Once normalization is done, we constructed cell type signatures fromthe reference by finding the 50 most discriminating probes irrespective of high or53low methylation status. Finally, we pass the signature and validation samples tothe optimization procedure to obtain the final CTP estimates.After benchmarking our pipeline in totality, we broke down the estimation im-provements by step. To do this, we applied the changes in a step-wise fashion.Since we do not expect changes to be additive, we instead measured cumulativeperformance. We applied filtering for reliable probes, background subtraction,probe type normalization, and ComBat in a cumulative fashion. The probe typenormalization was applied twice, once to all cell types and once to all cell typesexcept nRBC since it doesn’t exhibit the bimodal distribution of beta values that isexpected. For comparison, we applied the baseline normalization with the probefiltering and modified signature selection.As a sanity check, we tested for association between major components ofvariation and our estimated CTPs. PCA was used to find the major directions ofvariation. The top 5 PCs were then tested for association with estimated propor-tions using a spearman two-sided correlation test.Changes To The Cell Type SignaturesSince normalization affects the resulting cell type signatures, we compared the re-sulting signatures from our normalization scheme and the baseline. We constructedtwo signatures, one after each normalization procedure, by selecting the top 50 dis-criminating probes by absolute difference in group means.The two signatures were compared visually with Venn diagrams and analyti-cally with the Jaccard Index. Then, we looked for patterns in the relative rankingsof probes. We did this by partitioning the probes into three groups: intersectingprobes, probes unique to baseline normalization and probes unique to optimizednormalization. For probes in the intersection, we compared their relative rank-ings, based on discriminability, within each signature. For probes unique to eachsignature, we examined where they appear in ttheir respective relative rankings.Comparison To Reference-Free TechniquesReference-free techniques are an alternative class of methods used to correct forCTH in association studies like genome wide association study (GWAS) and EWAS.54We will collectively refer to these association studies under the more generic xWAS.See Chapter 2 for a summary of reference free techniques relevant to DNAm. Wecompared reference-based estimation to ReFACTOR, a reference-free techniquebased on sparse PCA [44]. We chose ReFACTOR because it was developed andpreviously validated on DNA methylation data. Also, it doesn’t require specifi-cation of case and control status, a common requirement for reference-free tech-niques.We carried out this comparison from a variance explained perspective becauseit closely aligns with the task of correction for CTP in xWAS studies. We used thecord validation set with true CTP. To quantify true variance explained by CTP, wefit a linear model to each probe’s beta value as the response and the measured CTPas the explanatory variables. Once fit, the multiple R2 is taken as the true proportionof variability due to CTP. In xWAS, CTP correction is achieved by includingeither the estimates or surrogate variables in a probe-by-probe regression againstcase/control status. Any significant variability that is associated with case/controlstatus, when accounting for changes in the CTP, is considered an association. Sincewe have the ”true” amount of variance attributable to CTP, any deviations whenregressing against estimated CTP or ReFACTOR variables can be considered overor under correction.3.6.2 Validating Estimation Accuracy in Cord BloodWe validated our pipeline’s performance by estimating CTP for our cord validationsamples. Figure 3.21 shows that all estimates, except CD8T, are moderate or highlycorrelated with measured CTP: Bcell (Rho=.73), CD4T (Rho=.84), NK (Rho=.66),nRBC (Rho=.68), Gran (Rho=.64), Mono(Rho=.54), CD8T(Rho=.42).Next, we compared our pipeline’s performance to existing estimation methods.Figure 3.22 shows the estimation performance of the standard pipeline with anadult-ref, the standard pipeline with ubc-ref, and our optimized pipeline with ubc-ref. The adult reference is clearly unsuitable for estimation of cord blood CTP.It poorly estimates Bcells, CD8T cell proportions, and provides no estimates fornRBC. Compared to the standard method with cord blood reference, our methodmarkedly improves correlation of estimates for Bcells (+0.15), and Gran(+0.09).55nRBCMAD=0.05Rho=0.68GranMAD=0.09Rho=0.64MonoMAD=0.03Rho=0.54NKMAD=0.04Rho=0.66BcellMAD=0.03Rho=0.73CD4TMAD=0.09Rho=0.84CD8TMAD=0.04Rho=0.420 0.02 0.05 0.08 0.10.3 0.4 0.5 0.6 0.7 0.03 0.06 0.09 0.05 0.1 0.150.02 0.05 0.08 0.1 0.2 0.3 0.4 0.5 0.02 0.04 ProportionEstimated ProportionCorrected Estimation Of Cell Type ProportionsFigure 3.21: CTP Estimation with Cord Optimized Pipeline. Line of unityshown in black.Other cell types showed a slight decrease in correlation: CD4T(-.03), CD8T(-.04),Mono(-.03), NK(-.02), nRBC(-.04).To understand the incremental gains of each modification, we apply the opti-mized pipeline step-by-step. Figure 3.23 shows how optimizations have an unevenimpact across cell types. Background subtraction using NOOB greatly improvedestimation of Bcell and nRBC, but negatively impacts CD8T and Mono. Signa-ture selection without balancing for high and low methylation improves Monoand nRBC, but slightly degrades CD8T. Inter-array normalization using ComBatgreatly improves estimation of Bcell and Gran but mostly leaves other cell typesalone. BMIQ, which forces methylation measurements to be bimodal, strongly de-grades estimation performance when applied to nRBC because their methylationprofile do not satisfy the method’s bimodal assumption.As a sanity check for our CTP estimates, independent of the measured cellcounts, we tested for association with the principle components of the validation560.000.250.500.75Bcell CD4T CD8T Gran Mono NK nRBCCellTypeRhoMethodAdult Ref. & Standard NormalizationCord Ref. & Optimized NormalizationCord Ref. & Standard NormalizationPerformance By Reference And NormalizationFigure 3.22: Estimation Performance By Reference And Normalizationdata. Figure 3.24 shows that most CTP estimates are strongly associated with PC1:Bcell (p-value < .01), CD4T (p-value ≤ .001) , CD8T (p-value ≤ .001), Gran (p-value≤ .001), and NK (p-value≤ .001). Further, nRBC is strongly associated withPC2 (p-value ≤ .01) and PC4 (p-value ≤ .01). These results are reassuring sincewe expect CTP to account for a large fraction of the variation observed in EWAS.Furthermore, this suggests that our estimates could effectively correct for CTH.3.6.3 Changes to the Cell Type SignaturesTo investigate how normalization modifies the estimation procedure, we comparedthe signature probes selected between the baseline and optimized normalizationprocedures. Figure 3.25 shows that most cell types share between 60% and 74%of their signature probes. nRBC, with 16% shared probes, is an outlier with analmost completely different signature. Using the Jaccard Index as a measure ofsimilarity, Table 3.4, leads to the same conclusion. Once again, most cell typeshave moderate to high signature similarity, except nRBCs which have low signature57nRBCGran Mono NKBcell CD4T CD8T0. By Normalization StepFigure 3.23: Performance Improvements Step by Step.Table 3.4: Jaccard Index of Signature Probes for Standard and OptimizedNormalizationCell Type Jaccard IndexGran 0.471Mono 0.515Bcell 0.429CD4T 0.587CD8T 0.493nRBC 0.0870NK 0.471580. 2 3 4 5PCAdjusted Variance Validation DatanRBCNKMonoGranCD8TCD4TBcellSexSlidePC1 PC2 PC3 PC4 PC5Adjusted Principle ComponentPvalue<=0.001<=0.01<=0.05>0.05Figure 3.24: Association of PCs to Estimated Cell Types59OptimizedStandard181832GranOptimizedStandard161634MonoOptimizedStandard202030BcellOptimizedStandard131337CD4TOptimizedStandard171733CD8TOptimizedStandard42428nRBCOptimizedStandard181832NKSignature Probe Comparison: Standard and Optimized NormalizationsFigure 3.25: Effect of Normalizations on Signature Probe SelectionTable 3.5: Pearson Correlations of Signature Probe Ranks Between Standardand Optimized PipelineCell Type CorrelationBcell 0.65CD4T 0.45CD8T 0.69Gran 0.67Mono 0.62NK 0.72nRBC 0.27similarity. It was surprising that Bcell and Gran, the two cell types with largeperformance improvements, did not show larger than average changes to selectedsignature probes.Ranks of the shared probes between baseline and optimized pipelines are wellcorrelated (Table 3.5). However, Figure 3.26 shows that most shared probes ex-60nRBCGran Mono NKBcell CD4T CD8T0 10 20 30 40 500 10 20 30 40 50 0 10 20 30 40 50010203040500102030405001020304050Standard Normalization RankOptimized Normalization RankSignature Probe ComparisonFigure 3.26: Changes to Rank Of Signature Probes Between Normalizations61perienced a change in rank. Contrary to our intuition, the shared probes were notenriched for higher discriminatory power (i.e lower ranks). Instead, we observedinstances, for example in Bcell, where probes that were ranked as poorly discrimi-nating in the baseline procedure were promoted to be highly discriminating in ouroptimized procedure.The optimized pipeline promoted previously low ranked signature probes toimprove estimation performance in Bcell and Gran. Looking at Figure 3.27, weobserved that only 2 out of the top 20 discriminating probes for Bcell were uniqueto the optimized pipeline. The other 18 most discriminating probes were promotedfrom their previously low ranking in the baseline signature. A similar pattern isobserved in Gran, with the top 20 discriminating probes originating from previ-ously lower ranked positions in the baseline signature. This suggests that the betternormalization scheme is removing probes that appear to be discriminating, but areactually just strongly affected by technical artefacts. For example, all referencesamples for one cell type could be highly correlated at one probe due concurrentexperimental processing. This probe would be selected during signature construc-tion as being highly discriminatory. However, once the technical artefacts is cor-rected removed, the probe no longer conveys as much information about cell typeidentity.For CTP estimation, measurement at the signature probes appears to be at leastas important as their identity. There appears to be some redundancy of informationacross potential signature probes. This redundancy would would manifest as highcorrelation of methylation in signature probes. Together, this analysis underlinesthe importance of value calibration when constructing signatures for CTP estima-tion.3.6.4 Comparison to Reference-Free TechniquesIn this section, we compare our reference-based estimation to ReFACTOR, a reference-free correction method, in terms of CTP correction efficacy. So far, we have dis-cussed the performance of CTP estimation in terms of correlation to true measuredproportions. However, the end goal of CTP estimation is usually to correct forconfounding in genome wide association studies. Here, we explore the correction62nRBCGran Mono NKBcell CD4T CD8T0 20 400 20 40 0 20 40012345012345012345RankCountProbes Unique To Optimized SignaturenRBCGran Mono NKBcell CD4T CD8T0 20 400 20 40 0 20 40012345012345012345RankCountProbes Unique To Standard SignatureFigure 3.27: Rank of Probes Unique to a Normalization63Table 3.6: sPCA Explained Variance at 500 Representative ProbesPC Prop. Variance Explained R2 to Cell Type Proportion1 0.88 0.742 0.03 0.803 0.02 0.494 0.02 0.565 0.01 0.176 0.01 0.08of confounding due to CTP with variance explained as the measure of efficacy(See Section 1.1). Since we can regress against measured CTP to quantify thetrue amount of variation explained at each probe, we consider any deviations to bespurious.First, we examine the sparse principle components (sPCs), generated by ReFAC-TOR, and how they correlate with our measured CTPs. The top 6 sparse PCs ac-count for 97% (Table 3.6) of the variance at the 500 sites used to estimate them(2.3). Within the 6 PCs, the variance explained is heavily skewed towards withfirst PC at 88%. The first two are highly correlated (Pearson) with measured CTPat R=0.74 and R=0.80 respectively. The third and fourth PCs show moderate cor-relation at .49 and .56 respectively. The last two PCs show almost no correlationto measured CTP. This suggests that using the first four PCs would accurately cor-rect for CTP, while using the last two would result in removal of other sources ofvariation.To understand whether sPCs explains variance differentially across probes, wemeasure the amount of variance explained by the first six PCs at different betastandard deviation (SD) value cutoffs. Table 3.7 shows that the amount of varianceexplained is stable across beta SD cutoffs. Since the results are so stable acrossall levels of variation, subsequent analyses are based on only probes with beta SDgreater than 0.04.Since sPC1 correlates well with Gran CTP, we examined how well it acts asa surrogate for Gran CTP in explaining variation. Figure 3.28 shows that acrossall probes, variance explained by sPC1 is highly correlated (R=0.92) with variance64Table 3.7: sPCA Explained Variance Across All Probesβ SD Cutoff N Probes Variance Explained By sPCs Variance Explained By Cell Type0.04 39187 0.51 0.390.022 112744 0.52 0.380.009 256467 0.53 0.380.004 360530 0.52 0.380.002 381286 0.52 0.380.000.250.500.751.000.0 0.2 0.4 0.6 0.8Var. Expl. by Gran ProportionVar. Expl. by sPC1Correlation of Variance Explained by sPC1 and Gran (R=.92)Figure 3.28: Variance Explained by sPC1 Versus Measured Gran Proportionsexplained by measured Gran CTP. This suggests that sPC1 would be a reliableproxy for Gran CTP during correction in association studies. However, cord bloodcontains many other cell types besides Gran, so we next examine how well sPC1represents all cell types.Regressing against sPC1 does not correctly account for variation due to all celltypes. Figure 3.29 shows a large number of probes that have variation attributableto cell type, but not sPC1; seen as the large blob spreading across the x-axis.650.000.250.500.751.000.00 0.25 0.50 0.75Var. Expl. by CT ProportionsVar. Expl. by sPC1Correlation of Variance Explained by sPC1 and CTs (R=.75)Figure 3.29: Variance Explained by sPC1 Versus All Measured CTPIncreasing the number of top sPCs used for correction from top 1 to 6 resultsin evidence of overcorrection. Figure 3.30 (bottom) shows that sparsePCA has adense cloud of points on the upper-left quadrant, where each point represents ahigher amount of variance explained by sPCs than expected. Amount of varianceexplained by predicted cell type, Figure 3.30 (top), also shows evidence of over-correction to a lesser extent, seen as a slight skew above the line of unity.As shown in Figure 3.31, using only the top sPC results in under-correctionand using the top six PCs results in overcorrection. Relative to sparse principlecomponents, using estimated cell counts from reference based techniques, whileslightly over-correcting, is closest to our measured ground truth. Therefore, werecommend using reference based techniques when available.66PredictedCTSparsePCA0.00 0.25 0.50 0.750.000.250.500.751. Expl. by CT ProportionVar. Expl. by MethodVar. Expl. By Estimated CTP and top 6 Sparse PCsFigure 3.30: Variance Explained by Top 6 sPCs Versus All Measured CTPs.Red is line of unity.670200040006000−0.5 0.0 0.5Difference To Var. Expl. by Measured CT Props.Probe Count MethodPrCTsPC1sPC1to6Difference In Variance ExplainedFigure 3.31: Difference to Variance Explained by Measured CTP68Chapter 4ConclusionsThe number of EWAS studies on human infant cord blood continues to grow, butthese results are difficult to interpret due to a lack of adequate CTH correctiontechniques. Currently, reference-based CTP estimation techniques have been ef-fectively correcting for CTH in adult whole blood EWAS studies. However, thesemethods do not generalize to studies of cord blood, a proxy for prenatal condi-tions. In this thesis, we explore why current techniques are inadequate and proposea better CTP estimation technique for cord blood.First, we validated that current reference-based estimation techniques performpoorly on a cord blood samples. Treating the existing methodology as a black box,we compared estimates to experimentally generated ground truth CTP on a cordblood validation set.The most widely used reference-based method for adult wholeblood does not generalize to infant cord blood. We observed a large differencein estimation performance between adult and cord samples. One possible reasonfor this degradation is the distinct methylation patterns between adult and cord celltypes in blood. Another possible cause is the presence of nRBCs, a cell type uniqueto cord blood, which is unaccounted for by the adult method.Next, we compared how well the existing methodology performs using threedifferent publicly available cord blood reference data sets. These datasets weregenerated specifically to improve estimation accuracy. Our results showed that thesame few cell types are poorly estimated by all three references. However, up-dating the references does result in performance gains and so partially resolves69the problem. We also showed that two out of the three references provide sim-ilar estimation performance, with the third under-performing. We recommendedone particular reference based on validated performance and more comprehensiverepresentation of cord blood cell types.Then, we explored how estimates can be improved by opening up the method-ology black box. Our analyses suggested several causes for the loss of accuracy:insufficient preprocessing of batch effects, improper data normalization, and poorrepresentation of cell types. We resolved the identified issues and demonstrateimproved CTP estimation in cord blood.Finally, we compared our estimates to a reference-free method using vari-ance explained. This metric mimics correction for CTH without requiring case-control status. We observed that reference-based techniques can circumvent over-correction, a drawback of reference-free techniques.This thesis is a case study in extending reference-based CTP estimation tech-niques for DNAm to a novel tissue. We demonstrated how extending a reference-based technique requires more than just a new reference dataset. Proper extensionrequired careful consideration of both normalization and optimization in order toachieve high performance.Despite their high performance, reference-base techniques are not without lim-itations. For example, one difficulty is the delineation of cell types. In our study,granulocytes could have been further subdivided into eosinophils, basophils andneutrophils. Another troublesome subdivision was the separation of T-cells intoCD4T and CD8T. Another limitation of reference-based techniques are their scopeof application. Reference-based techniques can only be used on tissues for whichwe can generate references.In addition to the described results, we explored several other, ultimately un-successful, avenues. One promising avenue was to borrow information from theadult reference dataset for prediction in cord blood. This idea fits under the um-brella of transfer learning and we thought it could limit the search space for celltype markers. We didn’t observe any performance improvements, possibly becauseadult and cord cell types have very distinct methylation profiles. Another avenuewas to perform estimation in a hierarchical fashion. We attempted to estimate celltypes in groups that are iteratively refined, mirroring the process of cell type dif-70ferentiation. This approach resulted in compounding errors and poor fine grainedestimates of CTP. Finally, we assessed the effects of reference sample size bypooling together the different cord blood references. This was also unhelpful, pos-sibly due to noise added by the diversity of methodology and experimental designs.Even though these paths did not improve estimation accuracy, they were instructiveand paved the way to our successful method.In the future, we’d like to study hybrid approaches that combine experimen-tal and blind source separation techniques. This would extend the applicabilityof reference-based techniques to tissues that currently defy granular experimentalcharacterization.71Bibliography[1] SWAN: Subset-quantile Within Array Normalization for Illumina InfiniumHumanMethylation450 BeadChips. Genome Biology, 13(6):R44, 2012.ISSN 1465-6906. doi:10.1186/gb-2012-13-6-r44. URL http://genomebiology.biomedcentral.com/articles/10.1186/gb-2012-13-6-r44. →pages 40[2] B. S. Aali, R. Malekpour, F. Sedig, and A. Safa. Comparison of maternal andcord blood nucleated red blood cell count between pre-eclamptic and healthywomen. Journal of Obstetrics and Gynaecology Research, 33(3):274–278,2007. ISSN 13418076. doi:10.1111/j.1447-0756.2007.00523.x. → pages 5[3] A. R. Abbas, K. Wolslegel, D. Seshasayee, Z. Modrusan, and H. F. Clark.Deconvolution of blood microarray data identifies cellular activation patternsin systemic lupus erythematosus. PLoS ONE, 4(7), 2009. ISSN 19326203.doi:10.1371/journal.pone.0006098. → pages 5, 12[4] M. J. Aryee, A. E. Jaffe, H. Corrada-Bravo, C. Ladd-Acosta, A. P. Feinberg,K. D. Hansen, and R. A. Irizarry. Minfi: A flexible and comprehensiveBioconductor package for the analysis of Infinium DNA methylationmicroarrays. Bioinformatics, 30(10):1363–1369, 2014. ISSN 14602059.doi:10.1093/bioinformatics/btu049. → pages 40[5] K. M. Bakulski, J. I. Feinberg, S. V. Andrews, J. Yang, S. Brown,S. McKenney, F. Witter, J. Walston, A. P. Feinberg, and M. D. Fallin. DNAmethylation of cord blood cell types: Applications for mixed cell birthstudies. Epigenetics, 2294(March):00–00, 2016. ISSN 1559-2294.doi:10.1080/15592294.2016.1161875. URLhttp://www.tandfonline.com/doi/full/10.1080/15592294.2016.1161875. →pages 2, 7, 14, 20[6] S. Beck and V. K. Rakyan. The methylome: approaches for global DNAmethylation profiling. TRENDS in Genetics, 24(5):231–237, 2008. ISSN720168-9525. doi:10.1016/j.tig.2008.01.006. URLhttp://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed{&}id=18325624{&}retmode=ref{&}cmd=prlinks{%}5Cnpapers2://publication/doi/10.1016/j.tig.2008.01.006. → pages 2[7] Y. Benjamini and Y. Hochberg. Controlling the False Discovery Rate: APractical and Powerful Approach to Multiple Controlling the FalseDiscovery Rate: a Practical and Powerful Approach to Multiple Testing.Source Journal of the Royal Statistical Society. Series B (Methodological),57(1):289–300, 1995. ISSN 00359246. doi:10.2307/2346101. URLhttp://www.jstor.org/stable/2346101{%}5Cnhttp://about.jstor.org/terms. →pages 1, 24[8] M. Bibikova, Z. Lin, L. Zhou, E. Chudin, E. W. Garcia, B. Wu, D. Doucet,N. J. Thomas, Y. Wang, E. Vollmer, T. Goldmann, C. Seifart, W. Jiang, D. L.Barker, M. S. Chee, J. Floros, and J. B. Fan. High-throughput DNAmethylation profiling using universal bead arrays. Genome Research, 16(3):383–393, mar 2006. ISSN 10889051. doi:10.1101/gr.4410706. URLhttp://www.genome.org/cgi/doi/10.1101/gr.4410706. → pages 4[9] B. M. Bolstad, R. A. Irizarry, M. Astrand, and T. P. Speed. A comparison ofnormalization methods for high density oligonucleotide array data based onvariance and bias. Bioinformatics (Oxford, England), 19(2):185–93, 2003.ISSN 1367-4803. doi:10.1093/bioinformatics/19.2.185. URLhttp://www.ncbi.nlm.nih.gov/pubmed/12538238{%}5Cnhttp://www.stat.berkeley.edu/?bolstad/normalize/. → pages 40[10] I. Borg and P. J. Groenen. Modern multidimensional scaling: Theory andapplications. Springer Science & Business Media, 2005. → pages 24[11] K. D Siegmund. Statistical approaches for the analysis of dna methylationmicroarray data. 129:585–95, 06 2011. → pages 3, 39[12] O. M. de Goede, H. R. Razzaghian, E. M. Price, M. J. Jones, M. S. Kobor,W. P. Robinson, and P. M. Lavoie. Nucleated red blood cells impact DNAmethylation and expression analyses of cord blood hematopoietic cells.Clinical epigenetics, 7(1):1–11, 2015. ISSN 1868-7075.doi:10.1186/s13148-015-0129-6. URLhttp://dx.doi.org/10.1186/s13148-015-0129-6{%}5Cnpapers3://publication/doi/10.1186/s13148-015-0129-6. → pages 2, 5, 7, 14, 2073[13] C. Deans and K. A. Maggert. What do you mean, ?Epigenetic?? Genetics,199(4):887–896, 2015. ISSN 19432631. doi:10.1534/genetics.114.173492.→ pages 2[14] S. Dedeurwaerder, M. Defrance, E. Calonne, H. Denis, C. Sotiriou, andF. Fuks. Evaluation of the Infinium Methylation 450K technology.Epigenomics, 3(6):771–784, 2011. ISSN 1750-1911.doi:10.2217/epi.11.105. → pages 3, 39[15] S. Dedeurwaerder, M. Defrance, E. Calonne, H. Denis, C. Sotiriou, andF. Fuks. Evaluation of the Infinium Methylation 450K technology.Epigenomics, 3(6):771–784, 2011. ISSN 1750-1911.doi:10.2217/epi.11.105. → pages 4[16] S. Dedeurwaerder, M. Defrance, M. Bizet, E. Calonne, G. Bontempi, andF. Fuks. A comprehensive overview of Infinium HumanMethylation450 dataprocessing. Briefings in bioinformatics, 15(6):929–941, 2014. ISSN14774054. doi:10.1093/bib/bbt054. → pages 1, 40, 41, 53[17] P. Du, X. Zhang, C.-c. Huang, N. Jafari, W. A. Kibbe, L. Hou, and S. M.Lin. Comparison of Beta-value and M-value methods for quantifyingmethylation levels by microarray analysis. BMC bioinformatics, 11(1):587,2010. ISSN 1471-2105. doi:10.1186/1471-2105-11-587. URLhttp://www.biomedcentral.com/1471-2105/11/587. → pages 4[18] J. A. Gagnon-Bartsch and T. P. Speed. Using control genes to correct forunwanted variation in microarray data. Biostatistics, 13(3):539–552, 2012.ISSN 14654644. doi:10.1093/biostatistics/kxr034. → pages 15, 16[19] K. Gervin, C. M. Page, H. C. D. Aass, M. A. Jansen, H. E. Fjeldstad, B. K.Andreassen, L. Duijts, J. B. van Meurs, M. C. van Zelm, V. W. Jaddoe,H. Nordeng, G. P. Knudsen, P. Magnus, W. Nystad, A. C. Staff, J. F. Felix,and R. Lyle. Cell type specific DNA methylation in cord blood: a450K-reference data set and cell count-based validation of estimated celltype composition. Epigenetics, 11(9):00–00, 2016. ISSN 1559-2294.doi:10.1080/15592294.2016.1214782. URLhttp://www.tandfonline.com/doi/full/10.1080/15592294.2016.1214782. →pages 2, 7, 14, 20, 26[20] T. Gong, N. Hartmann, I. S. Kohane, V. Brinkmann, F. Staedtler, M. Letzkus,S. Bongiovanni, and J. D. Szustakowski. Optimal Deconvolution ofTranscriptional Profiling Data Using Quadratic Programming with74Application to Complex Clinical Blood Samples. PLoS ONE, 6(11):e27156,2011. ISSN 1932-6203. doi:10.1371/journal.pone.0027156. URLhttp://dx.plos.org/10.1371/journal.pone.0027156. → pages 12[21] S. Horvath. DNA methylation age of human tissues and cell types. GenomeBiology, 14(10):R115, 2013. ISSN 1465-6906.doi:10.1186/gb-2013-14-10-r115. URLhttp://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=4015143{&}tool=pmcentrez{&}rendertype=abstract{%}5Cnhttp://genomebiology.com/2013/14/10/R115http://www.ncbi.nlm.nih.gov/pubmed/24138928http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC4015143http://genomebiology.biomedcentral.com/articles/10.1186/gb-2013-14-10-r115.→ pages 3[22] E. A. Houseman, W. P. Accomando, D. C. Koestler, B. C. Christensen, C. J.Marsit, H. H. Nelson, J. K. Wiencke, and K. T. Kelsey. DNA methylationarrays as surrogate measures of cell mixture distribution. BMCBioinformatics, 13(1):86, 2012. ISSN 1471-2105.doi:10.1186/1471-2105-13-86. → pages 2, 5, 12, 22, 23[23] E. A. Houseman, J. Molitor, and C. J. Marsit. Reference-free cell mixtureadjustments in analysis of DNA methylation data. Bioinformatics, 30(10):1431–1439, 2014. ISSN 1367-4803. doi:10.1093/bioinformatics/btu029.URL http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/btu029.→ pages 15, 16[24] E. A. Houseman, K. T. Kelsey, J. K. Wiencke, and C. J. Marsit.Cell-composition effects in the analysis of DNA methylation array data: amathematical perspective. BMC Bioinformatics, 16(1):1–16, 2015. ISSN1471-2105. doi:10.1186/s12859-015-0527-y. URLhttp://www.biomedcentral.com/1471-2105/16/95. → pages 10, 34[25] A. E. Jaffe and R. A. Irizarry. Accounting for cellular heterogeneity iscritical in epigenome-wide association studies. Genome biology, 15(2):R31,2014. ISSN 1465-6914. doi:10.1186/gb-2014-15-2-r31. URLhttp://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=4053810{&}tool=pmcentrez{&}rendertype=abstract. → pages 1, 3, 5, 7, 13, 22, 23[26] R. Jiang, M. J. Jones, F. Sava, M. S. Kobor, and C. Carlsten. Short-termdiesel exhaust inhalation in a controlled human crossover study is associatedwith changes in DNA methylation of circulating mononuclear cells in75asthmatics. Particle and fibre toxicology, 11:71, 2014. ISSN 1743-8977.doi:10.1186/s12989-014-0071-3. URLhttp://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=4268899{&}tool=pmcentrez{&}rendertype=abstract. → pages 3, 5[27] W. E. Johnson, C. Li, and A. Rabinovic. Adjusting batch effects inmicroarray expression data using empirical Bayes methods. Biostatistics, 8(1):118–127, 2007. ISSN 14654644. doi:10.1093/biostatistics/kxj037. →pages 40[28] M. J. Jones, S. J. Goodman, and M. S. Kobor. DNA methylation and healthyhuman aging. Aging Cell, 14(6):924–932, 2015. ISSN 14749726.doi:10.1111/acel.12349. → pages 2, 3[29] H. M. Kang, N. A. Zaitlen, C. M. Wade, A. Kirby, D. Heckerman, M. J.Daly, and E. Eskin. Efficient control of population structure in modelorganism association mapping. Genetics, 178(3):1709–1723, 2008. ISSN0016-6731. doi:10.1534/genetics.107.080101. URLhttp://www.genetics.org/content/178/3/1709. → pages 16[30] D. C. Koestler, B. C. Christensen, M. R. Karagas, C. J. Marsit, S. M.Langevin, K. T. Kelsey, J. K. Wiencke, and E. A. Houseman. Blood-basedprofiles of DNA methylation predict the underlying distribution of celltypes: A validation analysis. Epigenetics, 8(8):816–826, 2013. ISSN15592294. doi:10.4161/epi.25430. → pages 7, 13[31] D. C. Koestler, M. J. Jones, J. Usset, B. C. Christensen, R. A. Butler, M. S.Kobor, J. K. Wiencke, and K. T. Kelsey. Improving cell mixturedeconvolution by identifying optimal DNA methylation libraries (IDOL).BMC bioinformatics, 17(1):120, 2016. ISSN 1471-2105.doi:10.1186/s12859-016-0943-7. URLhttp://www.ncbi.nlm.nih.gov/pubmed/26956433. → pages 7, 21, 24[32] L. Kruglyak and L. Kruglyak. The road to genome-wide association studies.Nature reviews. Genetics, 9(4):314–8, 2008. ISSN 1471-0064.doi:10.1038/nrg2316. URL http://www.ncbi.nlm.nih.gov/pubmed/18283274.→ pages 1[33] P. W. Laird. Principles and challenges of genome- wide DNA methylationanalysis. Nature Reviews Genetics, 11(3):191–203, 2010. ISSN 1471-0056.doi:10.1038/nrg2732. URL http://dx.doi.org/10.1038/nrg2732. → pages 1,3, 3976[34] L. L. Lam, E. Emberly, H. B. Fraser, S. M. Neumann, E. Chen, G. E. Miller,and M. S. Kobor. Factors underlying variable DNA methylation in a humancommunity cohort. Proceedings of the National Academy of Sciences, 109(Supplement 2):17253–17260, 2012. ISSN 0027-8424.doi:10.1073/pnas.1121249109. → pages 3, 5[35] T. Lappalainen and J. M. Greally. Associating cellular epigenetic modelswith human phenotypes. Nature Publishing Group, 18(7):441–451, 2017.ISSN 1471-0056. doi:10.1038/nrg.2017.32. URLhttp://dx.doi.org/10.1038/nrg.2017.32. → pages 1, 5, 6[36] J. T. Leek and J. D. Storey. Capturing heterogeneity in gene expressionstudies by surrogate variable analysis. PLoS genetics, 3(9):1724–35, 2007.ISSN 1553-7404. doi:10.1371/journal.pgen.0030161. URL http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.0030161.→ pages 15[37] J. Liu and K. D. Siegmund. An evaluation of processing methods forHumanMethylation450 BeadChip data. BMC Genomics, 17(1):469, 2016.ISSN 1471-2164. doi:10.1186/s12864-016-2819-7. URL http://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-2819-7.→ pages 53[38] S. Mohammadi, N. Zuckerman, A. Goldsmith, and A. Grama. A CriticalSurvey of Deconvolution Methods for Separating Cell Types in ComplexTissues. Proceedings of the IEEE, 105(2):340–366, feb 2017. ISSN0018-9219. doi:10.1109/JPROC.2016.2607121. URLhttp://arxiv.org/abs/1510.04583http://dx.doi.org/10.1109/JPROC.2016.2607121http://ieeexplore.ieee.org/document/7676285/. → pages 6[39] A. M. Newman, C. L. Liu, M. R. Green, A. J. Gentles, W. Feng, Y. Xu, C. D.Hoang, M. Diehn, and A. a. Alizadeh. Robust enumeration of cell subsetsfrom tissue expression profiles. Nature Methods, 12(MAY 2014):1–10,2015. ISSN 1548-7091. doi:10.1038/nmeth.3337. URLhttp://dx.doi.org/10.1038/nmeth.3337. → pages 12[40] J. Nordlund, C. L. Ba¨cklin, P. Wahlberg, S. Busche, E. C. Berglund, M.-l.Eloranta, T. Flaegstad, E. Forestier, B.-M. Frost, A. Harila-Saari,M. Heyman, O´. G. Jo´nsson, R. Larsson, J. Palle, L. Ro¨nnblom,K. Schmiegelow, D. Sinnett, S. So¨derha¨ll, T. Pastinen, M. G. Gustafsson,G. Lo¨nnerholm, and A.-C. Syva¨nen. Genome-wide signatures of differentialDNA methylation in pediatric acute lymphoblastic leukemia. Genome77Biology, 14(9):r105, sep 2013. ISSN 1465-6906.doi:10.1186/gb-2013-14-9-r105. URLhttp://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=4014804{&}tool=pmcentrez{&}rendertype=abstracthttp://www.ncbi.nlm.nih.gov/pubmed/24063430http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC4014804http://genomebiology.biomedcentral.com/articles/10.1186/gb-2013-14-9-r105.→ pages 40, 41[41] J. Peters. The role of genomic imprinting in biology and disease: anexpanding view. Nature Publishing Group, 15(8):517–530, 2014. ISSN1471-0056. doi:10.1038/nrg3766. URL http://dx.doi.org/10.1038/nrg3766.→ pages 3[42] M. E. Price, A. M. Cotton, L. L. Lam, P. Farre´, E. Emberly, C. J. Brown,W. P. Robinson, and M. S. Kobor. Additional annotation enhances potentialfor biologically-relevant analysis of the Illumina InfiniumHumanMethylation450 BeadChip array. Epigenetics & chromatin, 6(1):4,2013. ISSN 1756-8935. doi:10.1186/1756-8935-6-4. URLhttp://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3740789{&}tool=pmcentrez{&}rendertype=abstract. → pages 3, 40, 41[43] W. Qiao, G. Quon, E. Csaszar, M. Yu, Q. Morris, and P. W. Zandstra. PERT:A Method for Expression Deconvolution of Human Blood Samples fromVaried Microenvironmental and Developmental Conditions. PLoSComputational Biology, 8(12), 2012. ISSN 1553734X.doi:10.1371/journal.pcbi.1002838. → pages 12[44] E. Rahmani, N. Zaitlen, Y. Baran, C. Eng, D. Hu, J. Galanter, S. Oh, E. G.Burchard, E. Eskin, J. Zou, and E. Halperin. Sparse PCA corrects for celltype heterogeneity in epigenome-wide association studies. Nature Methods,13(5), 2016. ISSN 1548-7091. doi:10.1038/nmeth.3809. URLhttp://www.nature.com/doifinder/10.1038/nmeth.3809. → pages 15, 16, 55[45] L. E. Reinius, N. Acevedo, M. Joerink, G. Pershagen, S.-E. Dahle´n,D. Greco, C. So¨derha¨ll, A. Scheynius, and J. Kere. Differential DNAMethylation in Purified Human Blood Cells: Implications for Cell Lineageand Studies on Disease Susceptibility. PLoS ONE, 7(7):e41361, 2012. ISSN1932-6203. doi:10.1371/journal.pone.0041361. URLhttp://dx.plos.org/10.1371/journal.pone.0041361. → pages 2, 4, 5, 14, 1978[46] A. E. Teschendorff and S. C. Zheng. Cell-type deconvolution inepigenome-wide association studies: a review and recommendations.Epigenomics, 9:epi–2016–0153, 2017. ISSN 1750-1911.doi:10.2217/epi-2016-0153. URLhttp://www.futuremedicine.com/doi/10.2217/epi-2016-0153. → pages 2, 17[47] A. E. Teschendorff, U. Menon, A. Gentry-Maharaj, S. J. Ramus, S. A.Gayther, S. Apostolidou, A. Jones, M. Lechner, S. Beck, I. J. Jacobs, andM. Widschwendter. An epigenetic signature in peripheral blood predictsactive ovarian cancer. PLoS ONE, 4(12), 2009. ISSN 19326203.doi:10.1371/journal.pone.0008274. → pages 41[48] A. E. Teschendorff, J. Zhuang, and M. Widschwendter. Independentsurrogate variable analysis to deconvolve confounding factors in large-scalemicroarray profiling studies. Bioinformatics, 27(11):1496–1505, 2011.ISSN 13674803. doi:10.1093/bioinformatics/btr171. → pages 15, 16[49] A. E. Teschendorff, F. Marabita, M. Lechner, T. Bartlett, J. Tegner,D. Gomez-Cabrero, and S. Beck. A beta-mixture quantile normalizationmethod for correcting probe design bias in Illumina Infinium 450 k DNAmethylation data. Bioinformatics, 29(2):189–196, 2013. ISSN 13674803.doi:10.1093/bioinformatics/bts680. → pages 40, 41[50] N. Touleimat and J. Tost. Complete pipeline for Infinium R© HumanMethylation 450K BeadChip data processing using subset quantilenormalization for accurate DNA methylation estimation. Epigenomics, 4:325–341, 2012. → pages 23, 40[51] T. J. Triche, D. J. Weisenberger, D. V. D. Berg, P. W. Laird, and K. D.Siegmund. Low-level processing of Illumina Infinium DNA MethylationBeadArrays. 41(7):1–11, 2013. doi:10.1093/nar/gkt090. → pages 40, 41[52] D. Venet, F. Pecasse, C. Maenhaut, and H. Bersini. Separation of samplesinto their constituents using gene expression data. Bioinformatics, 17(Suppl1):S279–S287, 2001. ISSN 1367-4803.doi:10.1093/bioinformatics/17.suppl 1.S279. URLhttp://bioinformatics.oxfordjournals.org/content/17/suppl{ }1/S279.abstract.→ pages 10[53] P. Yousefi, K. Huen, H. Quach, G. Motwani, A. Hubbard, B. Eskenazi, andN. Holland. Estimation of Blood CellularHeterogeneity in Newborns andChildren for Epigenome-Wide Association Studies. Environmental and79molecular mutagenesis, 51(3):229–235, 2015. ISSN 1098-2280.doi:10.1002/em. → pages 2, 7, 14, 25[54] J. Zou, C. Lippert, D. Heckerman, M. Aryee, and J. Listgarten.Epigenome-wide association studies without the need for cell-typecomposition. Nature Methods, 11(3):309–311, 2014. ISSN 1548-7091.doi:10.1038/nmeth.2815. URLhttp://www.nature.com/doifinder/10.1038/nmeth.2815. → pages 15, 1680


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items