UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Molecular interpretation of genome-wide association studies using multiomics analysis Farhadi Hassan Kiadeh, Farnush 2018

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2018_september_farhadihassankiadeh_farnush.pdf [ 2.08MB ]
JSON: 24-1.0368592.json
JSON-LD: 24-1.0368592-ld.json
RDF/XML (Pretty): 24-1.0368592-rdf.xml
RDF/JSON: 24-1.0368592-rdf.json
Turtle: 24-1.0368592-turtle.txt
N-Triples: 24-1.0368592-rdf-ntriples.txt
Original Record: 24-1.0368592-source.json
Full Text

Full Text

Molecular Interpretation of Genome-wide AssociationStudies Using Multiomics AnalysisbyFarnush Farhadi Hassan KiadehB.Sc., Sharif University of Technology, 2015A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Bioinformatics)The University of British Columbia(Vancouver)June 2018c© Farnush Farhadi Hassan Kiadeh, 2018The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:Molecular Interpretation of Genome-wide Association Studies UsingMultiomics Analysissubmitted by Farnush Farhadi Hassan Kiadeh in partial fulfillment of the re-quirements for the degree of Master of Science in Bioinformatics.Examining Committee:Sara Mostafavi, Statistics and Medical GeneticsSupervisorWyeth Wasserman, Medical GeneticsSupervisory Committee MemberMichael Kobor, Medical GeneticsSupervisory Committee MemberInanc Birol, Medical GeneticsProgram ChairiiAbstractGenome-wide association studies have found thousands of single-nucleotide poly-morphisms associated with various human traits. Recently, a powerful statisticalapproach called MetaXcan has been proposed for interpreting genome-wide asso-ciations at the gene level. We extended MetaXcan to a multiomics application,using a brain cortex reference dataset that includes gene expression, DNA methy-lation, and histone acetylation data from approximately 400 individuals. Our ap-proach, Multi-MetaXcan, consists of three steps. In the first step, we use regular-ized regression to build models that predict gene expression and variation in epige-nomic modifications from single-nucleotide polymorphisms. We call these modelsgenotype-based imputation models. In the second step, we apply these models tomap genome-wide associations to gene-level and epigenomic-level associations.Finally, in the third step, our model summarizes all molecular-level associations atthe gene level by building epigenome-based imputation models that predict geneexpression levels from nearby epigenomic marks like CpG sites and transcrip-tionally active regions. In summary, Multi-MetaXcan identifies trait-associatedgenes whose expression levels are impacted by single-nucleotide polymorphismsand their influence on intermediate molecular traits such as DNA methylation andhistone acetylation. We applied Multi-MetaXcan to a major depressive disordergenome-wide association study. As the result, we discovered 12 genes, 25 tran-scriptionally active regions, and 163 CpG sites associated with major depressivedisorder corresponding to 74 genes in total. 26 of these genes fall within or closeto previously identified major depressive disorder-associated genomic regions. Im-portantly, the inclusion of epigenomic data resulted in an additional 62 genes thatwere not identified by gene expression imputation model alone.iiiLay SummaryDNA stores what individuals inherit from their parents. The information in theDNA varies from one individual to another. Some of the DNA differences causedifferences in individuals including eye color, height, or the risk of carrying a cer-tain disease. Technology advancement has enabled scientists to discover parts ofthe human DNA that are related to such differences. There are mathematical ap-proaches to understanding the impact of these DNA differences on the functioningof human cells. Such methods provide researchers with biological insights abouthow DNA differences cause a certain trait. This thesis is an endeavor to improvethe current methods with a focus on DNA differences specific to major depressivedisorder.ivPrefaceThis dissertation is an original intellectual product of the author, Farnush FarhadiHassan Kiadeh. The breakdown of contributions reported in the present manuscriptis as below:• Data preprocessing, quality control, and normalization as described fromSection 3.1.2 to Section 3.1.4 have been done by Drs. Sara Mostafavi andBernard Ng.• Implementation of the statistical frameworks as described from Section 3.2to Section 3.4 has been performed by Farnush Farhadi Hassan Kiadeh.• Results exploration and interpretation as described in Chapter 4 and Chap-ter 5 have been done by Farnush Farhadi Hassan Kiadeh.• The manuscript has been written by Farnush Farhadi Hassan Kiadeh underthe supervision of Drs. Sara Mostafavi and Bernard Ng.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Genome-wide Association Studies . . . . . . . . . . . . . . . . . 21.1.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . 41.1.3 GWAS Meta-analysis . . . . . . . . . . . . . . . . . . . . 41.2 Molecular Traits . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.1 Measuring Gene Expression with RNA Sequencing . . . . 51.2.2 Measuring DNA Methylation with Microarray . . . . . . 6vi1.2.3 Measuring Histone Acetylation with ChIP-seq . . . . . . 61.3 Major Depressive Disorder . . . . . . . . . . . . . . . . . . . . . 71.4 Thesis Motivation and Contribution . . . . . . . . . . . . . . . . 71.5 Detailed Outline of Thesis . . . . . . . . . . . . . . . . . . . . . 82 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.1 From GWAS to Gene-based Analysis . . . . . . . . . . . . . . . 102.2 Expression Quantitative Trait Loci Analysis . . . . . . . . . . . . 112.3 Sequence Kernel Association Test . . . . . . . . . . . . . . . . . 122.4 Versatile Gene-based Association Studies . . . . . . . . . . . . . 132.5 PrediXcan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.6 Transcriptome-wide Association Studies . . . . . . . . . . . . . . 152.7 Summary Data-based Mendelian Randomization . . . . . . . . . 162.8 MetaXcan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.9 Recent Findings of MDD . . . . . . . . . . . . . . . . . . . . . . 183 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.1 Description of Datasets and Preprocessing . . . . . . . . . . . . . 213.1.1 Genotype Data . . . . . . . . . . . . . . . . . . . . . . . 223.1.2 Gene Expression Data . . . . . . . . . . . . . . . . . . . 223.1.3 DNA Methylation Data . . . . . . . . . . . . . . . . . . . 233.1.4 Histone Acetylation Data . . . . . . . . . . . . . . . . . . 243.1.5 Additional QC Steps . . . . . . . . . . . . . . . . . . . . 253.1.6 MDD GWAS Meta-analysis Summary Statistics . . . . . 253.2 Genotype-based Imputation of Molecular Traits . . . . . . . . . . 263.3 Identifying GWAS-associated Molecular Traits . . . . . . . . . . 283.4 Multi-MetaXcan: Summarizing Molecular-level Associations atthe Gene Level . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.1 Building Genotype-based Imputation Model . . . . . . . . . . . . 324.1.1 Imputing Gene Expression Levels from SNPs . . . . . . . 324.1.2 Imputing CpG Site Methylation Levels from SNPs . . . . 33vii4.1.3 Imputing TAR Acetylation Levels from SNPs . . . . . . . 334.2 Identifying MDD-associated Molecular Traits . . . . . . . . . . . 334.2.1 Identifying MDD-associated Genes . . . . . . . . . . . . 374.2.2 Identifying MDD-associated CpG Sites . . . . . . . . . . 394.2.3 Identifying MDD-associated TARs . . . . . . . . . . . . 394.3 Building Epigenome-based Imputation Models . . . . . . . . . . 394.3.1 Imputing Gene Expression Levels from CpG Sites . . . . 394.3.2 Imputing Gene Expression Levels from TARs . . . . . . . 424.4 Summarizing Epigenomic-level Associations at the Gene Level . 424.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54viiiList of TablesTable 3.1 Summary of MDD GWAS meta-analysis cohorts. . . . . . . . 26Table 4.1 Summary of transcriptome-driven MDD-associated genes (TDGs). 37Table 4.2 Summary of approaches that translate epigenomic-level associ-ations to the gene level. . . . . . . . . . . . . . . . . . . . . . 42Table 4.3 Summary of the MDD-associated genes proximal to the MDD-associated genomic regions reported by Wray et al. [86]. Wefound that seven of the TDGs and 22 of the Muti-MetaXcangenes are proximal (1Mb) to nine such genomic regions. . . . 48ixList of FiguresFigure 2.1 From GWAS to gene-based analysis. . . . . . . . . . . . . . . 11Figure 3.1 Overview of multiomics integration. A) This figure depictsthe identification of trait-associated molecular traits. First, webuild three genotype-based imputation models for genes, CpGsites, and TARs, separately. Second, we apply MetaXcan tosummarize SNP-level p-values from GWAS at the CpG-, TAR-, and gene-level p-values. B) This figure depicts the summa-rization of CpG-level and TAR-level associations at the genelevel. First, we build two separate epigenome-based imputa-tion models for genes using either CpG sites or TARs as input.Second, we apply Multi-MetaXcan to summarize CpG-leveland TAR-level p-values at the gene level. . . . . . . . . . . . 27Figure 4.1 Cross-validation R2 of genotype-based imputation models thatpredict gene expression levels from cis SNPs. A) This fig-ure shows a histogram of cross-validation R2 for genes withan available elastic net model (n = 7,044). B) This figureshows a histogram of cross-validation R2 for genes with cross-validation R2 > 0.1 (n = 1,081). . . . . . . . . . . . . . . . . 34xFigure 4.2 Cross-validation R2 of genotype-based imputation models thatpredict CpG site methylation levels from cis SNPs. A) Thisfigure shows a histogram of cross-validation R2 for CpG siteswith an available elastic net model (n = 207,765). B) Thisfigure shows a histogram of cross-validation R2 for CpG siteswith cross-validation R2 > 0.1 (n = 30,444). . . . . . . . . . . 35Figure 4.3 Cross-validation R2 of genotype-based imputation models thatpredict TAR acetylation levels from cis SNPs. A) This fig-ure shows a histogram of cross-validation R2 for TARs withan available elastic net model (n = 12,167). B) This figureshows a histogram of cross-validation R2 for TARs with cross-validation R2 > 0.1 (n = 698). . . . . . . . . . . . . . . . . . 36Figure 4.4 Gene-level associations with MDD. A) This figure shows ahistogram of gene-level association p-values. B) This figureshows a Manhattan plot of gene-level association p-values. . 38Figure 4.5 CpG-level associations with MDD. A) This figure shows ahistogram of CpG-level association p-values. B) This figureshows a Manhattan plot of CpG-level association p-values. . . 40Figure 4.6 TAR-level associations with MDD. A) This figure shows ahistogram of TAR-level association p-values. B) This figureshows a Manhattan plot of TAR-level association p-values. . . 41Figure 4.7 Cross-validation R2 of epigenome-based imputation models thatpredict gene expression levels from cis CpG sites. A) This fig-ure shows a histogram of cross-validation R2 for genes withan available elastic net model (n = 4,234). B) This figureshows a histogram of cross-validation R2 for genes with cross-validation R2 > 0.1 (n = 476). . . . . . . . . . . . . . . . . . 43Figure 4.8 Training R2 of epigenome-based imputation models that pre-dict gene expression levels from cis TARs. A) This figureshows a histogram of training R2 for all genes in the study (n= 7,051). B) This figure shows a histogram of training R2 forgenes with training R2 > 0.1 (n = 1,302). . . . . . . . . . . . 44xiFigure 4.9 Translation of CpG-level associations to gene-level associa-tions by Multi-MetaXcan. A) This figure shows a histogram ofgene-level association p-values. B) This figure shows a Man-hattan plot of gene-level association p-values. . . . . . . . . 46Figure 4.10 Translation of TAR-level associations to gene-level associa-tions by Multi-MetaXcan. A) This figure shows a histogram ofgene-level association p-values. B) This figure shows a Man-hattan plot of gene-level association p-values. . . . . . . . . 47Figure 4.11 Comparison of transcriptome-driven Z statistics to DNAm-drivenZ statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . 49xiiGlossaryDGN Depression Genes and NetworksDLPFC Dorsolateral Prefrontal CortexGWAS Genome-wide Association StudiesHGP Human Genome ProjectLASSO Least Absolute Shrinkage and Selection OperatorLD Linkage DisequilibriumMAF Minor Allele FrequencyMDD Major Depressive DisorderMR Mendelian RandomizationNGS Next-Generation SequencingOLS Ordinary Least SquarePC Principal ComponentPGC Psychiatric Genomics ConsortiumQC Quality ControlSBS Sequencing By SynthesisSKAT Sequence Kernel Association TestxiiiSMR Summary data-based Mendelian RandomizationSNP Single-nucleotide PolymorphismSNV Single-nucleotide VariantTSS Transcription Start SiteTWAS Transcriptome-wide Association StudiesVEGAS Versatile Gene-based Association StudiesWGBS Whole Genome Bisulfite SequencingxivAcknowledgmentsI would like to thank my supervisor, Sara Mostafavi for being an amazing supervi-sor during my graduate studies. Without her patience, care, support, and supervi-sory, I could have not been able to write this document. I am delighted that I hadthe opportunity of having her in my academic life and learning from her.Also, I would like to thank Bernard Ng for his great support through my project.I had the opportunity to ask him my questions, discuss my results with him, andconsult with him on various parts of my project.In addition, I would like to thank my committee members, Wyeth Wassermanand Michael Kobor, who advised me through my project. I am glad I had theopportunity to ask them for their valuable feedback on various parts of my project.I would like to extend my appreciation to all my friends through this journey.My fellow Mostafavi lab members: Louie Dinh, Halldor Thorhallsson, Emma Gra-ham, Hamid Omid, Sasha Miroslav, William Casazza, Michael Vermeulen, SinaJafarzadeh, and Nam Hee Kim. Special thanks to Emma for being a great friendand chatting with me whenever I felt overwhelmed. My adoptive lab and graduateprogram: Rachelle Farkas, Phil Richmond, Oriol Fornes, Jasleen Grewal, ShamsBhuiyan, Sam and Amelia Hinshaw, Santina Lin, and Hamid Mohammadi. Specialthanks to Hamid for his help and advice through my master’s program.My special thanks to my closest friends: Bita Nejat, Zeinab Sadeghipour,Zeynab Nosrati, Behrouz Sepehry, and Reza Babanezhad. These people made mylife more colorful and were always there for me whenever I needed a friend to talkto.I would like to thank Luisa Abuchaibe for being an amazing dance instructorand an inspiring woman. She taught me how to explore new dimensions in myself.xvI would like to thank Marjan Farahbod. When I first entered the program andhad no clue of how things work, I was fortunate to meet her and have her support.She helped me through my academic and life challenges. She helped me to becomea stronger woman.Here it comes to my oldest friends: Ghazale Asghari, Roya Bagheri, andFateme Abdi. They supported me through all the most challenging moments inmy life. I am glad I had the joy to share my happiness with them and they werekind enough to be there for me in the times of sadness and frustration.My greatest appreciation goes to my family and to my partner. Many thanksto my parents and my sister for being there for me. Thanks to my little brotherfor making me feel the happiest person in the world by his cheerful smile. Finally,many thanks to my partner, Amir Sepehri, for his endless help through all the thingsin my life.xviDedicationTo my parents, Farzad and Homa.xviiChapter 1IntroductionRecent technology advancements have led to the development of sequencing assaysthat enable genetic profiling of individuals [10, 82, 84]. Data generated throughsuch assays have revealed that humans share 99.9% of their DNA. However, thereare still millions of positions in the human genome that vary between individu-als. Some such sequence variants cause differences in observable human traits [1].Two principal questions that researchers have been trying to answer are a) whichsequence variants are related to human traits and b) what molecular mechanismslink sequence variants to individuals’ traits.For the case of common traits, researchers have attempted to answer these ques-tions by relating genotype data across a large number of individuals to commontraits through Genome-wide Association Studies (GWAS). During the last decade,thousands of Single-nucleotide Polymorphism (SNP)s have been found to be asso-ciated with multiple human traits [71, 83, 86].Despite the noteworthy findings of GWAS, it is challenging to connect trait-associated variants to the affected genes underlying the associations because themajority of SNPs found by GWAS fall in non-coding regions of the human genome [2].The availability of various types of large genomics datasets that profile gene ex-pression, epigenomic modifications, and protein expression provides an opportu-nity to decipher the mechanisms associated with various human traits. For exam-ple, several studies have tried to integrate GWAS findings with gene expressiondata in order to find molecular mechanisms underlying human traits [6, 7, 31, 34,152, 91].In this thesis, we build on an integrative gene-based approach developed by Bar-beira et al. [6, 7] to combine multiple genomics datasets such as gene expression,DNA methylation (DNAm), and histone acetylation (HA) data with Major Depres-sive Disorder (MDD) GWAS summary statistics to investigate molecular mecha-nisms underlying MDD. Specifically, we identify MDD-associated gene expres-sion levels, CpG site methylation levels, and transcriptionally active region (TAR)acetylation levels. Finally, we develop imputation models that link DNAm and HAdata to gene expression levels and translate CpG-level and TAR-level associationsto gene-level associations.1.1 Genome-wide Association StudiesWhen the Human Genome Project (HGP) was completed in 2003 [21, 82], re-searchers started investigating areas of the human genome that vary among in-dividuals. Accordingly, researchers found that the most common class of variationfound in the human genome is single base variants. Such variants are called Single-nucleotide Variant (SNV)s. Single-nucleotide Polymorphism (SNP)s are SNVs thatoccur among more than a certain percentage (usually 1%) of people within a pop-ulation [61]. Researchers further found that all humans collectively have approx-imately 15 million SNPs, of which some are population-specific [20]. During thelast decade, GWAS have identified thousands of SNPs that are associated with vari-ous human traits [4, 71, 83, 85, 86]. Most such SNPs fall in the non-coding regionsof the genome [2]. SNPs falling in the regulatory regions of the DNA may affect agene’s promoter activity and thus, change the expression level of that gene [73].1.1.1 ApproachSNPs typically have two alleles, i.e., there are mainly two possible nucleotidesthat may occur for a given SNP within a human population. The frequency of aSNP is defined by the frequency of the less frequent allele, which is called MinorAllele Frequency (MAF). For example, a SNP with the minor allele A and MAF0.1 implies that 10% of the population carry A in that SNP position and the restcarry the major allele [15].2To perform GWAS, 105 to 106 SNPs across human genome are genotyped us-ing a genotyping assay. Recent innovations in DNA microarrays have lowered thecost of genotyping significantly. To date, two cost-efficient and popular genotyp-ing technologies are Affymetrix and Illumina platforms [61, 69]. In addition tothe genotyped SNPs, imputation methods such as IMPUTE [36, 54] and BEA-GLE [12–14] can be used to predict the genotype of the unmeasured SNPs, whichare found in the human genome but are not genotyped through genotyping as-says [53]. Such imputation methods work based on leveraging the informationfrom regions of the genome that are in high Linkage Disequilibrium (LD).Genotype data are generally coded with 0, 1 and, 2 values. As a standard, theSNP genotype value represents the number of minor alleles at a given position.For example, a SNP with minor allele A and genotype value 2 implies that bothpaternal and maternal copies of the individual’s chromosome carry allele A at thatposition. The goal of GWAS is to find a set of SNPs that have significantly differentallele frequencies between individuals who possess the GWAS trait and those whodo not [18].The most commonly used method to perform GWAS on a trait characterizedby a binary variable is a case-control study. Individuals in the study populationcarrying a specific trait are labeled as cases and the rest are labeled as controls [15].One basic and commonly used approach to carry out a GWAS analysis in a case-control study is logistic regression [23]. Logistic regression reports the effect sizeof each SNP as odds ratios [18]. A significant high or low odds ratio indicates thatthe SNP’s reference allele frequency is significantly different between cases andcontrols, thereby implies that the SNP is significantly associated with the trait.GWAS results are available at either individual level or summary statistics. Inthe former, the actual genotype data of individuals are available whereas in thelatter, only the SNP association significance (i.e., effect sizes and p-values) is re-ported. Accessing protected individual-level data may be challenging due to dataaccess policies and computational and storage burdens [6]. In addition, preprocess-ing individual-level genotype data needs care since GWAS results are sensitive tothe Quality Control (QC) steps performed on genotype data.31.1.2 ChallengesAlthough GWAS have successfully discovered thousands of SNPs that are asso-ciated with multiple human traits [83, 86], there are some challenges involved inGWAS. GWAS require large sample sizes to have sufficient statistical power torobustly discover associations between SNPs and a trait. Two main reasons forrequiring large sample sizes are: 1) GWAS examines millions of SNPs simultane-ously and thus, suffer from a severe multiple-testing burden and 2) effect sizes ofSNPs are often small [46, 76, 77]. In addition to the computational burden, sincemany of the trait-associated SNPs are located in non-coding regions of the genome,molecular mechanisms underlying SNP-level associations remain unknown [2, 56].1.1.3 GWAS Meta-analysisGWAS meta-analysis combines multiple GWAS analyses to increase the statisticalpower to detect associations by leveraging larger sample sizes. One advantage ofmeta-analysis is that it is able to estimate SNP-level association significance usingonly summary statistics [16, 88]. Two common approaches to performing meta-analysis are fixed and random effects models [41, 79, 88]. In this thesis, we usedsummary statistics from GWAS meta-analysis of hundreds of thousands of MDDcases and controls performed by Wray et al. [86].1.2 Molecular TraitsThe gene is a functional unit of the genome that can be transcribed to form either anon-coding RNA or a messenger RNA. The messenger RNA can then be translatedto a protein [19]. Gene expression is a fundamental process for cellular differen-tiation. Regulation of gene expression plays a significant role in an organism’sdevelopment.DNAm, a process where a methyl group (−CH3) attaches to a DNA molecule,is an essential contributor to the regulation of gene expression [11]. DNAm hap-pens widely across the genome and plays a critical role in many biological phe-nomena like development, differentiation, aging [40], genomic imprinting [65],and cancer [11]. In mammals, DNAm usually occurs at a CpG site, a Cytosinenucleotide immediately followed by a Guanine [11]. If methyl groups are added in4a promoter region, they can result in suppressing the gene expression of the nearbygenes.Another key process involved in the regulation of the expression is the acetyla-tion of core histones. Acetylation is a process where an acetyl group (−CH3CO) isattached to lysine residues in the N-terminal tail of a histone protein inside thenucleosome. Deacetylation is the reverse process of acetylation. Histones arepositively charged whereas DNA molecules contain negatively charged phosphategroups. DNA thus strongly interacts with histones through electrostatic interac-tions. When the histone core is acetylated, it decreases the positive charge of thehistone and thus, lets chromatin transform into a more relaxed structure. This trans-formation provides RNA polymerases with increased access to the DNA, leadingto an increase in gene expression [43]. To highlight the regulatory impact of HA,histone deacetylases (HDACs) play an important role in biological processes suchas development [35], inflammatory diseases [8], cancer initiation, and cancer pro-gression [32].1.2.1 Measuring Gene Expression with RNA SequencingQuantifying gene expression levels has enabled researchers to understand the molec-ular processes involved in many biological phenomena like development, diseases,and cancer. Over the last decade, RNA sequencing (RNA-Seq), which uses Next-Generation Sequencing (NGS) [58], has been widely used to measure gene expres-sion levels. There are various NGS platforms i.e., SOLiD, Illumina, etc., whichdiffer mainly in the technical aspects of their sequencing reactions [33, 58].Illumina leverages the Sequencing By Synthesis (SBS) technology. In this tech-nology, after the library preparation and cluster amplification, fluorescently taggednucleotides are loaded to the sequencing chip. Within each cycle of sequencing,only one nucleotide is incorporated. The process of nucleotide addition yields afluorescent signal and the base call is identified by the signal’s wavelength and in-tensity. n sequencing cycles are required to determine the base calls of a fragmentwith n nucleotides, and thus generate a sequencing read of size n. SBS generatesmillions of sequencing reads, which are aligned back to the reference genome toquantify transcripts’ and genes’ expression levels [38, 55]. In this thesis, we used5gene expression levels from Dorsolateral Prefrontal Cortex (DLPFC) of 540 indi-viduals quantified by Illumina HiSeq [49].1.2.2 Measuring DNA Methylation with MicroarrayChanges in DNAm in many diseases have highlighted the need to develop technolo-gies that enable researchers to profile DNAm of individuals [27]. There are severalplatforms to measure the DNAm levels of the CpG sites across the genome. Il-lumina’s HumanMethylation450 BeadChip (450k) is a microarray-based platformthat measures methylation levels of more than 485,000 CpG sites across the hu-man genome [67]. The 450k array consists of probes that are designed to hybridizeto specific locations of the genome. As the result of hybridization, a fluorescencesignal is generated from fluorescently labeled nucleotide and is read by a scanner.The strength of the fluorescence signal is used to infer the amount of methylationat a particular CpG site [26, 67].The 450k array is a mixture of two types of assay, each with their own DNAbinding strategies to measure DNAm levels. Type I probes use two distinct physicalbeads to measure the amount of DNAm at a CpG Site. One bead is used to measurethe methylated state and the other is used to measure the unmethylated state at aCpG site. On the contrast, type II probes use a single physical bead to measurethe methylation level [67]. DNAm levels obtained from type II probes are lessaccurate and have a greater variance between replicates compared to type I probes.In general, the 450k array is the most cost-effective platform to profile DNAm [26].In this thesis, we used DNAm data from DLPFC of 740 individuals quantified usingthe Illumina 450k array [25].1.2.3 Measuring Histone Acetylation with ChIP-seqThere are different types of HA that activate gene expression. For example, H3K9ac(the acetylation of the amino acid located in the ninth position in the N-terminusof histone’s H3 protein) is found in actively transcribed promoters [42]. In or-der to capture TARs of the DNA in which the associated histones are acetylated,chromatin immunoprecipitation (ChIP) followed by NGS is used.ChIP-seq experiments generally detect DNA binding sites of specific proteins6through four main steps. First, the protein of interest (in our case the histone’sH3 protein) is crosslinked to DNA in vivo using formaldehyde. Second, chromatinis isolated and sheared by sonication into small fragments, ranging from 200 to600 bp. Third, an antibody specific to the protein of interest is used to bind to theprotein and isolate the complex by precipitation. Finally, crosslinks are reversedto purify DNA fragments. Fragments of DNA are then amplified, profiled throughNGS, and mapped back to the genome to identify DNA sites bound by the pro-tein [64]. Accordingly, acetylation levels of the TARs are generated. In this thesis,we used HA data from DLPFC of 714 individuals generated using ChIP-seq ofH3K9Ac [49].1.3 Major Depressive DisorderMDD is a common mental disorder that is characterized mainly by low mood,low self-esteem, low energy, loss of interest in previously enjoyable activities, andpain without a clear cause. MDD has a huge impact on both individuals’ andtheir families’ quality of life by influencing the individual’s daily activities likesleeping, eating, and working. MDD is often chronic or recurrent and is associatedwith high morbidity, financial costs, and risk of suicide [85]. MDD is influencedby many SNPs, each having a small effect size. Accordingly, a substantial numberof samples is required to identify SNPs associated with MDD [70, 86]. In thisthesis, we leveraged MDD GWAS summary statistics of hundreds of thousands ofindividuals reported by Wray et al. [86] and combined them with gene expression,DNAm, and HA data to interpret GWAS associations at multiple molecular levels.1.4 Thesis Motivation and ContributionAs described earlier in Section 1.1.2, GWAS suffer from a huge multiple-testingburden. Also, molecular-level mechanisms underlying GWAS associations are un-clear. Given such limitations, Neale and Sham [59] and many other researcherssuggested a move toward gene-based approaches to interpret findings of GWAS atthe gene level.In 2015, Gamazon et al. [31] proposed a new gene-based variant aggregationapproach, called PrediXcan. In this approach, imputation models are built on the7reference eQTL dataset where both genotype and gene expression data are avail-able. The imputation models predict a gene’s expression level from cis SNPs. CisSNPs are defined as those SNPs that are within the 1Mb window of the gene’sTranscription Start Site (TSS). The imputation models are then applied to a newGWAS dataset to predict gene expression from genotype data. Finally, predictedgene expression levels are directly tested for association with the GWAS trait.PrediXcan requires individual-level genotype data to find associations betweengene expression levels and the trait of interest. However, it is often challenging toobtain individual-level genotype data for large GWAS analysis. In order to remedythis and to apply this approach to GWAS summary statistics, Barbeira et al. [6, 7]extended PrediXcan to Summary-PrediXcan (also called MetaXcan). For a givengene, MetaXcan calculates the p-value of its association with the GWAS trait usingGWAS summary statistics.MetaXcan is developed to translate GWAS findings to the gene-level associa-tions. Here, we extended MetaXcan to a multiomics application, where we lever-aged HA and DNAm data, in addition to gene expression data. We applied ourapproach, Multi-MetaXcan, to MDD GWAS meta-analysis summary statistics [86]and identified genes, CpG sites, and TARs that are associated with MDD. Multi-MetaXcan enabled us to interpret functionality of MDD GWAS findings not onlyat the gene level but also at the DNAm and HA levels. In addition to the multiomicsinterpretation of GWAS associations, we investigated the effect of DNAm and HAmarks on gene expression regulation and extended our statistical framework to fur-ther summarize DNAm-level and HA-level associations at the gene level.1.5 Detailed Outline of ThesisThe rest of this thesis explicates the problem of performing a gene-based analysisand surveys existing gene-based methods [Chapter 2], formulates our multiomicsdata integration pipeline [Chapter 3], presents the results [Chapter 4], and discussesthe results and future directions [Chapter 5].A detailed chapter breakdown of this thesis is as follows:• Chapter 2 reviews existing literature on the problem of interpreting findingsof GWAS at the gene level. First, we explain the concept of a gene-based8analysis applicable to GWAS. Second, we review gene-based approacheswhich require individual-level GWAS data as well as methods that requireonly GWAS summary statistics. Finally, we review recent findings of MDDfrom GWAS and eQTL anlayses.• Chapter 3 provides a detailed description of the MDD GWAS summarystatistics, multiomics datasets, and Quality Control (QC) steps used in ouranalysis. It also formulates the extension of the MetaXcan approach to inte-grate other molecular traits like HA and DNAm with GWAS. First, we buildimputation models that predict gene expression levels, CpG site methylationlevels, and TAR acetylation levels from their nearby (1Mb) SNPs. Second,we propose a framework to interpret GWAS associations at the gene, CpG,and TAR levels. Finally, we propose a statistical approach, Multi-MetaXcan,to summarize CpG-level and TAR-level associations at the gene level.• Chapter 4 presents the results of multiple steps of our multiomics data in-tegration pipeline. First, we present the prediction results of the imputationmodels for each of the molecular traits i.e., gene expression, DNAm, andHA. Second, we present gene-level, CpG-level, and TAR-level associationsunderlying MDD GWAS associations. Finally, we provide gene-level asso-ciations as the result of Multi-MetaXcan method.• Chapter 5 discusses our results, advantages and limitations of Multi-MetaXcan,and directions for future work.9Chapter 2Related WorksIn this chapter, we survey the literature on the problem of interpreting functionalityof GWAS associations at the gene level. According to the current literature, re-searchers leveraged mainly gene expression data and genotype data to map GWASSNPs to genes for understanding their underlying molecular mechanisms. In Sec-tion 2.1, we explain the concept of gene-based analysis applicable to GWAS andstate its advantages compared to GWAS. In Section 2.2 to Section 2.8, we sum-marize the most widely used gene-based methods applicable to individual-levelGWAS as well as GWAS summary statistics. Finally, since we are interested inunderstanding the molecular mechanisms underlying MDD, we review recent find-ings of MDD from GWAS and eQTL studies in Section From GWAS to Gene-based AnalysisAs described in Section 1.1, in GWAS analysis, the associations between SNPsand the trait of interest are tested (Figure 2.1-b). In a gene-based analysis, a modelthat links one or a set of SNPs to a gene is derived. Evidence for the gene-traitassociation is then tested (Figure 2.1-a).There are multiple reasons why a gene-based analysis has been widely adoptedby the GWAS community:• Genes are biologically more interpretable than SNPs since we know moreabout the downstream biological impacts of regulated genes on disease traits.10Figure 2.1: From GWAS to gene-based analysis.• By reducing the number of tests from millions of SNPs to thousands ofgenes, the multiple-testing burden is often reduced by two orders of mag-nitude, yielding higher statistical power.• Most gene-based approaches aggregate the effects of multiple SNPs whichmay increase the statistical power compared to single-variant analyses.2.2 Expression Quantitative Trait Loci AnalysisExpression Quantitative Trait Loci (eQTL) analysis is one of the most widely usedmethods for mapping GWAS SNPs to genes. This analysis enables gene-levelinterpretation of GWAS SNPs for deciphering their molecular mechanisms under-lying the GWAS trait [22]. Performing an eQTL analysis requires the availabilityof the genotype and gene expression data from same individuals with large samplesizes. The goal of an eQTL analysis is to test for the associations between SNPsand gene expression levels. Such associations are found if individuals carryingdifferent genotypes of a SNP have differences in a gene’s expression level [91].112.3 Sequence Kernel Association TestSequence Kernel Association Test (SKAT) [87] is a kernel-based association testthat examines the joint effects of multiple SNPs on the trait. The key advantage ofSKAT is that the joint effects of SNPs is tested through a single test statistic ratherthan testing each SNPs individually, which increase the statistical power to detectthe association.Specifically, SKAT leverages a multiple regression model that regresses thetrait on a SNP set while accounting for confounding factors:Y = G+Cβ + ε (2.1)logP(Y = 1)1−P(Y = 1) = G+Cβ , (2.2)where• Y is the trait of interest.• G is the random effect of SNPs.• C captures confounding factors such as sex, age and, top Principal Compo-nent (PC)s of SNPs.• β is a vector of regression coefficients of the confounding factors.SKAT uses Equation 2.1 and Equation 2.2 if the trait of interest is continuous (likeheight) or binary (like disease status), respectively. Let’s assume that we have pSNPs within a gene, which we define as our SNP set of interest. SKAT assumesG ∼ N(µ = 0,σ2 = τ2KG). KG is a kernel matrix capturing similarities betweensubjects and τ2 is the variance component. Testing for the gene-level associationis equivalent to testing τ = 0. SKAT tests the null hypothesis (τ = 0) by fittingthe null model which includes only confounding factors and calculating p-valuesusing a score test [50]. SKAT requires the availability of the phenotype data andindividual-level genotype data from same individuals. Therefore, SKAT is notapplicable to GWAS summary statistics.122.4 Versatile Gene-based Association StudiesVersatile Gene-based Association Studies (VEGAS) developed by Liu et al. [51]builds on a similar idea as SKAT. VEGAS integrates the effects of either a fullset of SNPs or a subset of most significant trait-associated SNPs within a gene.VEGAS tests for the gene-trait association while accounting for LD between SNPsby using simulated samples from the multivariate normal distribution.VEGAS assigns GWAS SNPs to genes based on their positions. Gene bound-aries are defined as 50Kb upstream and downstream of 5′ and 3′ UTRs. For agene with n SNPs, VEGAS involves the following steps to test for the gene-levelassociation with the GWAS trait:• First, VEGAS generates a n-element random vector R = (r1,r2, ...,rn) thatfollows a multivariate normal distribution, R ∼Nn(µ = 0,σ2 = I), where Iis an identity matrix.• Second, it multiplies R by the Cholesky decomposition of SNPs LD scores,to generate a new vector Z = (z1,z2, ...,zn) such that Z∼Nn(µ = 0,σ2 =∑),where ∑ is a n×n matrix of SNPs LD scores.• Finally, it transforms Z into the vector Q = (q1,q2, ...,qn),qi = z2i of corre-lated chi-squared 1 degree of freedom (df) variables. The null distributionof the gene-trait association statistic is then generated by summing over alln elements of the vector Q. VEGAS calculates the observed association teststatistic tg by summing over the Z statistics of all the n SNPs for the gene. Zstatistic of a SNP in a GWAS analysis is defined as the SNP regression co-efficient divided by the standard error of the regression coefficient. VEGASthen assesses the significance of tg using the null distribution.VEGAS requires only SNPs p-values and thus, it is applicable to GWAS summarystatistics.2.5 PrediXcanPrediXcan developed by Gamazon et al. [31] is a new class of gene-based variantaggregation method. Unlike any other previous methods, PrediXcan models the13regulating effects of SNPs on gene expression and tests for the genetically regulatedcomponent of the gene expression and the trait.In this approach, a reference dataset where both genotype and gene expressiondata from hundreds to thousands of individuals are available is first used to buildan imputation model. The imputation model predicts (imputes) gene expressionlevel from cis SNPs (defined in PrediXcan as those SNPs that fall within the 1Mbwindow of the gene’s TSS). PrediXcan models a gene expression as follows:g = ∑l∈ModelgwlgXl + ε, (2.3)where• g is the gene expression level of gene g.• wlg is the regression coefficient of SNP l for gene g.• Xl is the genotype dosage of SNP l represented by 0, 1 or, 2.Gamazon et al. [31] learned gene expression imputation models for more than48 human tissues form GTEx project and Depression Genes and Networks (DGN) [9]and released such models in the PredictDB database. PrediXcan imputationmodels are then applied to predict the genetic component of the gene expressionlevels from a new set of individual-level genotype data. Thus, the imputed geneexpression levels are directly associated with the trait of interest.Gamazon et al. [31] compared three different approaches for estimating theweights of SNPs, wˆlg: penalized linear models like elastic net [92] and Least Ab-solute Shrinkage and Selection Operator (LASSO) [81] as well as top eQTL SNP.They concluded that elastic net (with α = 0.5) had the best performance com-pared to the other methods. Gamazon et al. [31] further showed that PrediXcanoutperforms SKAT (described in Section 2.3) [87] and VEGAS (described in Sec-tion 2.4) [51].PrediXcan has the advantage of identifying the direction of the gene-level as-sociation hence enables distinction between upregulation and downregulation ef-fect of a gene on a trait. Although genotype and gene expression data with largesample sizes may not be always available, PrediXcan requires such data only for14training the imputation models. Once the model is trained, gene expression datais not required for any later application of PrediXcan to individual-level GWASdatasets since gene expressions are estimated from genotype data. Requiring theindividual-level GWAS data is a limitation of PrediXcan which makes it inapplica-ble to GWAS summary statistics.2.6 Transcriptome-wide Association StudiesTranscriptome-wide Association Studies (TWAS) developed by Gusev et al. [34] isanother gene-based method that works on a similar idea as PrediXcan and aims toidentify genes whose genetically regulated expression levels are associated with theGWAS trait. Two main distinctions between PrediXcan and TWAS are underlyingimputation models to predict gene expression and the applicability of TWAS tosummary statistics.Gusev et al. [34] proposed two versions of TWAS: 1) individual TWAS whereGWAS data are available at the individual level and 2) summary-based TWASwhen GWAS data are available only at summary statistics. In the individual TWAS,gene expression levels are estimated by either the most significant cis eQTL SNPor linear imputation methods such as BLUP [72] and BSLMM [90]. Imputed ex-pression levels are then directly correlated with the GWAS trait. TWAS considersSNPs only within the 1Mb window of the gene’s TSS.In summary-based TWAS, the association between the trait and expressionlevel is directly estimated from GWAS summary statistics. Gusev et al. [34] showedthat for each gene, the expression-trait z statistic can be estimated as follows:zTWAS =WZ2√WΣSNP,SNPW t, (2.4)where• Z is the Z statistics of the SNPs from GWAS.• W represents a vector of SNP weights for the gene. If individual-level GWASdata are available, weights are estimated by imputation models that predictgene expression levels from genotype data. If GWAS data are available onlyat the summary statistics, weights in W are estimated as ΣSNP,geneΣ−1SNP,SNP,15where ΣSNP,gene is the covariance matrix between all SNPs genotype valuesand the gene expression level. ΣSNP,SNP is the covariance matrix of all SNPs,denoted by LD from a reference dataset like 1000 Genomes project.Both TWAS and PrediXcan suffer from some limitations. SNPs are unlikely tobe identified if they impact the GWAS trait by mechanisms other than influencingexpression of their nearby genes. Also, the number of genes that can be accuratelyimputed from genotype data are limited due to the quality of expression data, sam-ple size, and the proportion of variance in gene expression that can actually beexplained by SNPs.2.7 Summary Data-based Mendelian RandomizationSummary data-based Mendelian Randomization (SMR) developed by Zhu et al.[91] tests for the gene-trait association based on Mendelian Randomization (MR)analysis, where a genetic variant (like SNP) is tested for the effect of an exposurevariable (like gene expression level) on an outcome variable (like MDD). SMRcombines GWAS summary statistics with eQTL studies to assess for the gene-traitassociations.Unlike PrediXcan or TWAS, SMR performs a heterogeneity test to distinguishbetween pleiotropy (where gene expression level and trait are impacted by the sameSNP) and linkage (where gene expression level and trait are affected by separateSNPs which are in high LD), where the latter is considered as a false positive as-sociation. However, MR method is not able to distinguish between pleiotropy andcausality (where the impact of a SNP on the trait is mediated by gene expression).SMR approximates the gene-trait expression association score using χ2 teststatistics as [7]:TSMR =Z2eQT LZ2GWASZ2eQT L+Z2GWAS, (2.5)where• ZeQT L is the Z statistic of the SNP from an eQTL analysis.• ZGWAS is the Z statistic of the from a GWAS study.16TSMR follows a χ2 distributions only in two extreme cases where ZeQT L >>ZGWAS or ZeQT L << ZGWAS. In both extreme cases, TSMR is approximately equal toless significant Z statistic of GWAS or eQTL signals. For all other cases, the rightdistribution of TSMR should be simulated. SMR requires the availability of ZeQT Lfrom a tissue of interest. Given ZeQT L, SMR only needs GWAS summary statisticsto detect gene-level associations.2.8 MetaXcanMetaXcan developed by Barbeira et al. [6] is an extension of PrediXcan approach,which tests for the gene-trait associations using only GWAS summary statistics.Similar to PrediXcan and TWAS, the goal of MetaXcan is to identify genes whosegenetic components are associated with a GWAS trait. There are two equationsunderlying MetaXcan:Y = Xlβl + ε1 (2.6)Y = Tgγg+ ε2, (2.7)where• Y is the GWAS trait.• Xl is the dosage of SNP l, represented by 0, 1 or, 2.• βl is the GWAS regression coefficient of SNP l.• Tg is the expression level of gene g.• γg is the estimated TWAS regression coefficient of gene g which is used toassess the significance of the gene-level association.Equation 2.6 and Equation 2.7 model the GWAS trait as a linear function of in-dividual’s SNPs and gene expression levels, respectively. The goal is to estimatethe regression coefficient γg and its corresponding significance. Barbeira et al. [6]showed that the association statistic Zg =γgse(γg) between gene g and trait Y can beestimated as follows:17Zg ≈ ∑l∈Modelgwlgσˆlσˆgβˆlse(βˆl), (2.8)where• wlg is the weight of SNP l. Weights of the SNPs are estimated by PrediXcanimputation models.• σˆl is the estimated variance of SNP l. If individual-level GWAS data areavailable, σˆl is calculated from genotype data. Otherwise, σˆl can be esti-mated from reference sets like 1000 Genomes data.• σˆg is the estimated variance of the predicted expression level of gene g andis computed as 2√W ′gCov(G)Wg, where Wg is the vector of SNP weights forthe gene g. G is a matrix representing genotype values of SNPs, denoted by0, 1, or 2. Cov(G) is the covariance matrix of SNPs where Cov(G)i, j is thecovariance between SNP i and SNP j.• βˆl is the GWAS regression coefficient of SNP l.• se(βˆl) is the standard error of βˆl reported by GWAS.MetaXcan takes in GWAS summary statistics and weights of SNPs and yieldsgene-level association summaries including Z statistic and the corresponding p-value. MetaXcan has been shown to produce results as accurate as PrediXcan. Bar-beira et al. [7] compared MetaXcan with summary-based TWAS [34] and showedthat MetaXcan and summary-based TWAS derive equivalent mathematical expres-sion if the SNP weights are scaled.2.9 Recent Findings of MDDMDD is a complex human disease. It has the lifetime prevalence of ∼ 15% andis considered as the major cause of disability in the world [86]. MDD has theheritability, h2SNP, of 31%− 42% and specific types of MDD like early-onset andrecurrent MDD are more heritable [85]. In order to find SNPs associated withMDD, Psychiatric Genomics Consortium (PGC) performed its initial MDD mega-analysis [85] with 9,240 MDD cases and 9,519 controls and found no SNPs with18genome-wide significance. Levinson et al. [46] discussed multiple reasons whyMDD GWAS achieved no genome-wide significant SNPs with the available sam-ple sizes at the time. They reported that the h2SNP and polygenic score analysisshowcase that there are SNP signals in MDD GWAS data. However, SNP signalsare not distinguishable since they are hidden in the noise. Levinson et al. [46] con-cluded that the insufficient power to detect associations is probably caused by twomajor factors: MDD is a heterogeneous trait and current sample sizes are too small.MDD and schizophrenia share biological basis [86]. However, MDD is morefrequent than schizophrenia, which has less than 1% lifetime risk. Higher preva-lence of MDD among a human population implies that MDD cases and controlsare less distinguishable from each other compared to schizophrenia [85]. Also,although MDD is probably impacted by many SNPs [86], their effect sizes aresmaller compared to schizophrenia since MDD is less heritable than schizophrenia(h2SNP is 65% to 85%). Thus, greater sample sizes are required to detect genome-wide significant SNPs in MDD. In fact, compared to schizophrenia, cases threeto five times greater are required to identify the same number of associations inMDD [46].In 2015, Hyde et al. [37] achieved genome-wide significance for SNPs throughincreasing GWAS sample sizes. They performed meta-analysis on 75,607 MDDcases and 23,747 control individuals of European descent and identified 15 ge-nomic regions (also called loci) significantly associated with MDD, which include17 independent significant SNPs in total.In 2017, MDD working group of PGC [86] combined samples from multiplestudies including 23andMe [37] and carried out a meta-analysis of 130,664 MDDcases and 330,470 controls. This study which is the largest MDD GWAS analysisto date, identified 44 independent genomic regions associated with MDD whichcontain about 600 significant MDD-associated SNPs. Wray et al. [86] also foundthat 30 of these 44 loci are novel findings and 14 were reported as a significantMDD association in previous studies.Moreover, Wray et al. [86] showed that the lead SNP (SNP with smallest p-value within a region) of nine of these 44 loci are located within a gene and thereis no other gene within 200Kb of these lead SNPs. Genes containing those ninelead SNPs are involved in neuronal related processes such as neuronal develop-19ment and synaptic function. Wray et al. [86] also found that the two most sig-nificant MDD-associated SNPs are located in OLFM4 (olfactomedin 4) and nearNEGR1 (neuronal growth regulator 1). In addition, Wray et al. [86] leveragedTWAS method [34] to combine MDD GWAS summary statistics with Common-Mind gene expression data from brain cortex [30]. As the result, they identified 17MDD-associated genes.20Chapter 3Proposed ApproachThis chapter details our approach to perform a multiomics analysis to identifymolecular traits that are associated with the GWAS trait and to further translateall molecular-level findings to the gene level. In Section 3.1, we describe the dataacquisition of our multiomics datasets i.e., gene expression, DNAm, and HA dataas well as the QC analysis performed on each dataset. We also describe the MDDGWAS summary statistics. In Section 3.2, we develop genotype-based imputationmodels for each of the molecular traits in our study. In Section 3.3, we leveragegenotype-based imputation models followed by the statistical approach developedby Barbeira et al. [6] to interpret GWAS findings at multiple molecular levels. Fi-nally in Section 3.4, we present a statistical framework, Multi-MetaXca, whichenables us to summarize molecular-level associations at the gene level.3.1 Description of Datasets and PreprocessingWe used multiomics datasets such as genotype [24], gene expression [49], DNAm [25],and HA data [49] from ROSMAP study. To reduce the batch effect, a single per-son performed necessary preparations for all samples i.e., dissection of the frozentissues and isolating tissue content used to generate gene expression, DNAm, andHA data. In the following, we describe each of the datasets used in our analysis.213.1.1 Genotype DataIn ROS and MAP studies, two batches consisting of 1,709 and 384 individuals weregenotyped using Affymetrix GeneChip Array6.0 and Illumina OmniQuad Expressplatform, respectively. DNA of the subjects was extracted from whole blood, lym-phocytes, or frozen post-mortem brain tissue as described in De Jager et al. [24]. Toreduce the population heterogeneity, only self-declared non-Hispanic Caucasianswere genotyped.The same QC analysis was performed on two batches of genotyped data usingPLINK toolkit [68] as previously described by De Jager et al. [24]. As the result ofindividual-level QC steps, subjects with a) genotype success rate < 95%, b) discor-dant genotype-derived and reported gender, and c) excess inter/intra-heterozygositywere removed from the study. At the SNP-level QC analysis, SNPs with a) Hardy-Weinberg equilibrium P < 0.001, b) genotype call rate < 0.95, and c) misshap test< 1×10−9 were excluded. EIGENSTRAT [66] with default parameter setting wasused to remove population outliers.As the result of QC analysis, Affymetrix genotype data on 729,463 SNPs andIllumina genotype data on 624,668 SNPs were available for 1,709 and 384 indi-viduals, respectively. BEAGLE software [12] was used to impute the genotype ofmore than 35 million SNPs. SNPs with MAF > 0.01 and imputation INFO score> 0.3 were excluded resulting in genotype data on 7,321,515 SNPs in total. Weleveraged only genotyped SNPs in our analysis corresponding to the 448,867 SNPsin total.3.1.2 Gene Expression DataGene expression data were generated by extracting the RNA from DLPFC of 540individuals followed by RNA-seq using Nanodrop. A detailed process of quan-tifying gene expression levels and assessing its quality is described in Lim et al.[49]. Samples with RNA quantity < 5µg or RNA integrity score < 5 were ex-cluded from the analysis. RNA-seq library was prepared at the Broad Institute’sGenomics Platform using dUTP method with poly-A selection [45].Illumina HiSeq with 101-bp paired-end reads with the coverage of 150M readswas used to perform sequencing of first 12 samples which were used as a deep22coverage reference for further sequencing. The remaining samples were sequencedwith a coverage of 50M reads. Further processing of the RNA-seq data includeda) trimming beginning and ending low-quality bases and adaptor sequences fromreads, b) removing ribosomal RNA reads, c) aligning trimmed reads to the refer-ence genome by Bowtie [44], and d) quantifying genes’ and transcripts’ expres-sion levels using RSEM [47]. Expression levels were represented by fragmentsper kilobase per million mapped fragments (FPKMs) values. All FPKM valueswere log-transformed for further analysis. Depending on the steps of our analy-sis, we required individuals to have various data types available to be used in ouranalysis. For example, we analyzed 418 individuals in development of genotype-based imputation of gene expression levels (Section 4.1.1). For such individuals,both genotype and gene expression levels were available. We report the detailednumber of samples used in our analysis in the next chapter.The contribution of confounding factors to gene expression levels was quan-tified by computing the Spearman correlation between top ten PCs of the expres-sion levels and technical and biological confounding factors including sequencingdepth, RNA integrity score, postmortem interval, study index (refer to ROS vsMAP samples), genotyping PCs, age at death, and sex. Spearman correlation anal-ysis showed significant correlations between top expression PCs and confoundingfactors. We used COMBAT algorithm [39] to account for the batch effect. Also,we used linear regression to regress out such technical and biological confoundingfactors.Finally, we kept 13,484 highly expressed genes whose mean expressions weregreater than their 2 log2(FPKM) values. The gene expression data were then quan-tile normalized to account for sequencing depth across samples.3.1.3 DNA Methylation DataDNAm data were generated by performing a Whole Genome Bisulfite Sequencing(WGBS) from the DLPFC of 740 individuals using 450K Illumina array. A detailedprocedure of data generation and QC steps was described by De Jager et al. [25].Polymorphic CpG sites were removed from further analysis. BMIQ algorithm [80]was used as an initial normalization to correct for differences between type I and23type II probes and β values were reported.Technical and confounding factors such as batch effect, postmortem interval,age at death, sex, and estimate of the proportion of neurons present in each samplewere regressed out from DNAm data using SNM method [57]. Since the propor-tion of neural cells may vary among brain samples, the effect of generic neuronalproportions based on DNAm marks [25] was removed from DNAm data. In thenext chapter, we report the detailed number of DNAm samples used in differentsteps of our analysis.3.1.4 Histone Acetylation DataHA data were generated from the DLPFC of 714 individuals using H3K9Ac ChIP-seq. A detailed description of data acquisition and QC steps is available in Limet al. [49]. Purified DNA fragments resulting from ChIP experiment were ex-tracted and used for Illumina library preparation and then were sequenced with36-bp single-end reads using Illumina HiSeq. Reads were aligned back to the ref-erence genome using BWA algorithm [48]. MACS2 [89] with q cutoff of 0.001was used for peak detection in each sample. Further QC pipeline was performed toremove low quality samples. All samples in the analysis were required to have a)total unique reads≥ 15×106, b) non-redundant fraction≥ 0.3, c) cross-correlation≥ 0.03, d) fraction of reads in peaks ≥ 0.05, and e) peaks ≥ 6,000. 669 samplesremained as the result of QC steps.H3K9Ac domains were defined by calculating all genomic regions that weredetected as a peak in at least 100 (15%) of our 669 samples. The FPKM values tomeasure H3K9Ac were generated by counting the number of reads falling withineach of the H3K9Ac domains for each sample and dividing by the width of eachdomain in Kb and the total number of mapped reads. The log-transformed HAdata were quantile normalized to adjust for sequencing depth across samples. Inthe next chapter, we report the detailed number of HA samples that we leveragedin various steps of our analysis.243.1.5 Additional QC StepsIn addition to the data-specific QC steps we performed through Section 3.1.2 toSection 3.1.4, we filtered out the effect of hidden confounding factors (like celltype heterogeneity) from gene expression, DNAm, and HA data. Specifically, weperformed a PCA analysis and removed 10 first PCs from gene expression, DNAm,and HA data to regress out the variations in molecular traits arising from hiddenconfounding factors [60].3.1.6 MDD GWAS Meta-analysis Summary StatisticsWe used MDD GWAS summary statistics from Wray et al. [8] that were avail-able for 10,468,943 SNPs. Therefore, for each SNP, we had their association p-value, effect size, allele coding, and other GWAS metrics. A detailed descrip-tion of data acquisition and QC steps was previously described in [86]. Wrayet al. [86] defined an anchor cohort of GWAS mega-analysis of 29 samples ofEuropean-ancestry (Table 3.1). Cases in the anchor cohort were constrained tofulfill the globally accepted criteria for the diagnosis of MDD over an individual’slife. These criteria (DSM-IV, ICD-9, or ICD-10) [62, 63, 78] have been establishedusing structured diagnostic instruments coming from assessments done by trainedinterviewers, checklists administered by clinicians, or review of the individual’smedical records. Controls in 22 such samples were randomly selected and testedfor the absence of lifetime MDD.Wray et al. [86] leveraged 6 other European-ancestry cohorts, which they re-fer to as expanded cohorts (Table 3.1). These cohorts used alternative methodsfor assessing MDD. For example, 23andMe [37] used self-report of individuals inthe cohort whether they had ever received MDD treatment from a medical profes-sional and Generation Scotland [28, 74] used direct interviews. All controls werescreened for the absence of MDD. Wray et al. [86] combined MDD GWAS sam-ples from these cohorts (Table 3.1) and carried out a meta-analysis of 130, 664MDD cases and 330,470 controls to test for the associations between SNPs andMDD.25Table 3.1: Summary of MDD GWAS meta-analysis cohorts.Cohort MDD Cases ControlsAnchor Cohort: 16,823 25,632GWA Mega-analysis of29 samples ofEuropean-ancestryExpanded Cohorts: 113,841 304,83823andMe [37]GERA [5]DECODE [70]GenScot [28, 74]iPsychUK Biobank [3, 75]3.2 Genotype-based Imputation of Molecular TraitsIn the first step of our approach, we predict gene expression and other moleculartraits from genotype data. We follow the approach of Gamazon et al. [31] andleverage a linear model to predict molecular traits from cis SNPs. We model amolecular trait YX as:YX = ∑l∈cis−Xwl,X Gl + ε, (3.1)where• YX represents either gene expression level of a gene, methylation level of aCpG site, or acetylation level of a TAR.• wl,X is the weight of SNP l for molecular trait X .• Gl is the dosage of SNP l, represented by 0, 1, or, 2.• ε is the error term that accounts for the error in the linear model.We refer to above model as genotype-based imputation model. Genotype-basedimputation model estimates a continuous value per individual that represents thegenetically regulated component of the molecular trait X . Genetically regulatedcomponent captures the component of molecular traits regulated by cis SNPs. For26Figure 3.1: Overview of multiomics integration. A) This figure depicts theidentification of trait-associated molecular traits. First, we build threegenotype-based imputation models for genes, CpG sites, and TARs, sep-arately. Second, we apply MetaXcan to summarize SNP-level p-valuesfrom GWAS at the CpG-, TAR-, and gene-level p-values. B) This figuredepicts the summarization of CpG-level and TAR-level associations atthe gene level. First, we build two separate epigenome-based imputa-tion models for genes using either CpG sites or TARs as input. Second,we apply Multi-MetaXcan to summarize CpG-level and TAR-level p-values at the gene level.each molecular trait, we consider SNPs only within the 1Mb of the start site ofthat molecular trait. We train three separate genotype-based imputation models toimpute expression level of a gene, methylation level of a CpG site, and acetylationlevel of a TAR.We follow the procedure used in Gamazon et al. [31] and estimate a vector ofregression coefficients for a single molecular trait, WX , by elastic net [92] approachwith α = 0.5 as:WˆX = argminWX(||YX −GWX ||22+λ [0.5||WX ||22+0.5||WX ||1]), (3.2)where27• WX is a vector of regression coefficients wl,X .• G is a genotype matrix including the dosages of all SNPs for the moleculartrait X , represented by 0, 1, and 2 values.• ||WX ||22 is the quadratic part of the penalty defined as the L2-norm of theregression coefficients WX . ||WX ||1 is the L1-norm of WX .• λ is the penalty parameter.We learn the optimal λ by fivefold cross-validation using glmnet package in R [29].The elastic net model attempts to identify a small set of non-zero SNP weights suchthat the linear model in Equation 3.4 yields the minimum prediction error in cross-validation.3.3 Identifying GWAS-associated Molecular TraitsIn Section 2.8, we described how MetaXcan [6, 7] approach integrates genotype-based imputation models with GWAS summary statistics and thus, interprets GWASfindings at the gene level. Briefly, MetaXcan takes in SNPs’ weights from genotype-based imputation models and SNPs effect sizes from GWAS summary statistics.MetaXcan estimates the Z statistic of the association between gene expression andthe GWAS trait.As the second step of our pipeline, we follow the approach of Barbeira et al. [6,7] and apply the genotype-based imputation models to GWAS summary statistics.We then compute the Z statistic of the association between each molecular trait andthe GWAS trait as:ZX ≈ ∑l∈ModelXwl,XσˆlσˆXβˆlse(βˆl)(3.3)where• X represents either a gene, a CpG site, or a TAR.• wl,X is the weight of SNP l.• σˆl is the estimated variance of SNP l.28• σˆX is the variance of the imputed molecular trait X . It is computed as2√W ′XCov(G)WX , where WX is the vector of SNP weights for the moleculartrait X . G is a matrix representing genotype values of such SNPs, denotedby 0, 1, or 2. Cov(G) is the covariance matrix of SNPs and Cov(G)i, j is thecovariance between SNP i and SNP j.• βˆl is the GWAS regression coefficient of SNP l.• se(βˆl) is the standard error of βˆl reported by GWAS .Barbeira et al. [6, 7] used above equation to interpret GWAS findings at thegene level. We apply above model with weights that are derived from genotype-based imputation models for each CpG site, TAR, and gene. As the result of thisstep, we find gene-level, CpG-level, and TAR-level associations with the GWAStrait. We convert the association Z statistics to their corresponding p-values toassess the significance of the associations. P-values are adjusted by Bonferronicorrection to control for multiple testing. Since wl,x for a given molecular trait isestimated by genotype-based imputation models, Equation 3.3 detects associationsbetween the GWAS trait and the genetic component of molecular traits regulatedby cis SNPs.3.4 Multi-MetaXcan: Summarizing Molecular-levelAssociations at the Gene LevelAs the final step in our approach, we extend our multiomics pipeline to summa-rize TAR-level and CpG-level associations at the gene level. We explore two ap-proaches. The first approach which we refer to as the Closest Matching, finds theclosest genes to each of the GWAS-associated CpG sites or TARs.In the second approach called Multi-MetaXcan, we first impute gene expres-sion levels from cis epigenomic marks. Specifically, we model gene expressionlevel as:Yg = ∑j∈cis−gw j,gEl + ε, (3.4)where29• Yg is the gene expression level.• w j,g is the weight of jth epigenomic mark, CpG sites or TARs that are prox-imal to Yg.• El represents either the methylation level of CpG site j or acetylation levelof TAR j.• ε is the error term that accounts for the error in the linear model.We consider CpG sites or TARs in the 1Mb window of the gene’s TSS. We referto the above model as epigenome-based imputation model. The epigenome-basedimputation model estimates a continuous value per individual that represents theepigenetically regulated component of the gene expression. Epigenetically regu-lated component captures the component of the molecular traits regulated by cisepigenomic marks i.e., CpG sites or TARs.We train two separate epigenome-based imputation models to impute expres-sion level of a gene. The first model leverages cis CpG sites for a single gene toimpute genes expression level. Cis CpG sites are defined as those CpG sites thatfall within the 1Mb window of the gene’s TSS. Such epigenome-based imputationmodel is learned by solving an elastic net model with = 0.5 and cross-validationfor setting the lambda parameter. The second epigenome-based imputation modeluses cis TARs for a single gene to estimate gene’s expression level. Cis TARs aredefined as those TARs that fall within the 1Mb window of the gene’s TSS. Becausethe average number of TARs per gene is 17, we use Ordinary Least Square (OLS)model to estimate weights of TARs for each gene.As the second step in Multi-MetaXcan approach, we leverage our epigenome-based imputation models to translate CpG-level and TAR-level associations to thegene-level associations. Specifically, following the MetaXcan procedure, we com-pute gene-level p-values as follows:Zg ≈ ∑X∈ModelgwX ,gσˆXσˆgβˆXse(βˆX)= ∑X∈ModelgwX ,gσˆXσˆgZX , (3.5)where• X represents a CpG site or a TAR.30• wX ,g is the weight of epigenomic marks, CpG sites or TARs, for gene g asdescribed above.• σˆX is the estimated variance of the epigenomic mark.• σˆg is the variance of the imputed expression level of gene g. It is computedas 2√W ′XCov(M)WX , where WX is the vector of epigenomic mark weights forthe gene g. M is a matrix representing either methylation levels of the CpGsites or acetylation levels of the TARs which are used to predict expressionlevel of the gene g. Cov(M) is the covariance matrix and Cov(M)i, j is thecovariance between either 1) methylation level of the CpG site i and CpGsite j or 2) acetylation level of the TAR i and TAR j.• ZX is the Z statistic for either CpG-level association or TAR-level associationfound in the previous section.We convert the association Z statistics to their corresponding p-values to assessthe significance of the gene-level associations. P-values are adjusted by Bonferronicorrection to control for the multiple testing. Since wX ,g for a given gene is esti-mated by epigenome-based imputation models, above equation finds genes whoseepigenomic component (regulated by cis epigenomic marks) are associated withthe GWAS trait.3.5 SummaryIn Section 3.2, we develop genotype-based imputation models for each gene, CpGsite, and TAR. Genotype-based imputation models predict expression level of agene, methylation level of a CpG site, or acetylation level of a TAR from cis SNPs.In Section 3.3, we leverage SNPs’ weights from genotype-based imputation modelsand apply MetaXcan to interpret GWAS findings at gene, CpG, and TAR levels.Finally in Section 3.4, we link CpG sites and TARs to genes and present a statisticalframework, Multi-MetaXca, to summarize CpG-level and TAR-level associationsat the gene level.31Chapter 4ResultsIn this chapter, we summarize the results from various steps in our multiomicspipeline. In Section 4.1, we provide the imputation results of the genotype-basedimputation models. In Section 4.2, we present molecular-level associations un-derlying MDD GWAS findings. In Section 4.3, we provide the imputation resultsof the epigenome-based imputation models. Finally, in Section 4.4, we presentgene-level associations with MDD as the resultant of Multi-MetaXcan.4.1 Building Genotype-based Imputation ModelAs the first step of our pipeline, we leveraged multiomics data from ROSMAPand imputed gene expression levels for 13,484 genes, DNA methylation levels for420,103 CpG sites, and acetylation levels for 25,243 TARs. From Section 4.1.1 toSection 4.1.3, we summarize the prediction results of the genotype-based imputa-tion models.4.1.1 Imputing Gene Expression Levels from SNPsWe used 418 samples of individuals for whom both gene expression levels andgenotype data were available. On average, we observed that 307 SNPs are avail-able in the 1Mb window of the gene’s TSS. Our elastic net models select at least onenon-zero weight for 52% of the genes (n=7,044). Cross-validation R2 (the propor-tion of variance of the gene expression level explained by cis SNPs) has an average32of 0.05 among such genes that have an available elastic net model (Figure 4.1-A).15% of such genes (n = 1,081) have cross-validation R2 values > 0.1 for whichcross-validation R2 has an average of 0.24 (Figure 4.1-B). We found that on aver-age, the elastic net models select 12 SNPs for each gene.4.1.2 Imputing CpG Site Methylation Levels from SNPsWe used 403 samples of individuals for whom both DNAm and genotype data wereavailable. On average, we observed that 311 SNPs are available in the 1Mb windowof the CpG site. The elastic net models find at least one non-zero SNP weight for49% of the CpG sites (n = 207,765). Cross-validation R2 (the proportion of vari-ance of the methylation level of a CpG site explained by cis SNPs) has an averageof 0.05 among such CpG sites with an available elastic net model (Figure 4.2-A).15% of these CpG sites (n = 30,444) have cross-validation R2 values > 0.1. Cross-validation R2 has an average of 0.28 for such CpG sites (Figure 4.2-B). CpG siteswith the available elastic net model have on average 11 SNPs in their genotype-based imputation models.4.1.3 Imputing TAR Acetylation Levels from SNPsWe leveraged 385 samples of individuals for whom both HA and genotype datawere available. On average, there are 375 SNPs available in the 1Mb window ofthe acetylation site. The elastic net models identify at least one non-zero SNPweight for 48% of the TARs (n = 12,167). Cross-validation R2 (the proportion ofvariance of the TAR acetylation level explained by cis SNPs) has an average of0.02 among such TARs with an available elastic net model (Figure 4.3-A). 6% ofthese TARs (n = 698) have cross-validation R2 values > 0.1. Cross-validation R2has an average of 0.23 for such TARs (Figure 4.3-B). On average, the elastic netmodels select 10 SNPs for each TAR.4.2 Identifying MDD-associated Molecular TraitsAs the second step of our pipeline, we used the model described in Section 3.3 andintegrated multiomics data with GWAS summary statistics to identify moleculartraits associated with MDD. From Section 4.2.1 to Section 4.2.3, we present our33Figure 4.1: Cross-validation R2 of genotype-based imputation models thatpredict gene expression levels from cis SNPs. A) This figure shows ahistogram of cross-validation R2 for genes with an available elastic netmodel (n = 7,044). B) This figure shows a histogram of cross-validationR2 for genes with cross-validation R2 > 0.1 (n = 1,081).34Figure 4.2: Cross-validation R2 of genotype-based imputation models thatpredict CpG site methylation levels from cis SNPs. A) This figureshows a histogram of cross-validation R2 for CpG sites with an avail-able elastic net model (n = 207,765). B) This figure shows a histogramof cross-validation R2 for CpG sites with cross-validation R2 > 0.1 (n =30,444).35Figure 4.3: Cross-validation R2 of genotype-based imputation models thatpredict TAR acetylation levels from cis SNPs. A) This figure showsa histogram of cross-validation R2 for TARs with an available elasticnet model (n = 12,167). B) This figure shows a histogram of cross-validation R2 for TARs with cross-validation R2 > 0.1 (n = 698).36Table 4.1: Summary of transcriptome-driven MDD-associated genes(TDGs).Gene Chr. Association Aassociation SNPs in Proximal toZ Statistic P-value Genotype-based Loci inModel Wray et al. [86]NEGR1 1 8.85 8.8×10−19 20 YesZNF204P 6 −5.49 4.1×10−8 7 YesTMEM33 4 5.13 2.9×10−7 1 YesZNF184 6 −5.10 3.3×10−7 12 YesNOTCH4 6 5.09 3.4×10−7 25 YesOTUD6B 8 −4.77 1.8×10−6 7LY6G5C 6 −4.62 3.8×10−6 35FADS1 11 −4.60 4.2×10−6 7TMEM106B 7 −4.58 4.6×10−6 5 YesGAS5 1 −4.57 4.9×10−6 15COX19 7 4.52 6.0×10−6 1XRCC3 14 4.51 6.4×10−6 6 Yesmolecular-level findings underlying MDD GWAS associations. Associations weredetected between the genetic component of the molecular traits and MDD. Geneticcomponent of a molecular trait is the proportion of the variation in the moleculartrait regulated by cis SNPs.4.2.1 Identifying MDD-associated GenesWe integrated gene expression data with MDD GWAS summary statistics and iden-tified 12 genes whose imputed expression levels were significantly (Bonferronicorrected P < 0.05) associated with MDD (Figure 4.4, Table 4.1). We call thesegenes as transcriptome-driven genes (TDGs). NEGR1 (neural growth regulator1) is the most significant MDD-associated gene in our study. NEGR1 is highlyexpressed in the brain frontal cortex and influences axon extension and synapticplasticity in cortex, hypothalamus, and hippocampus. Also, NEGR1 modulatessynapse formation in hippocampus [86]. Wray et al. [86] found that the secondmost MDD-associated SNP (rs1432639, P = 4.6×10−15) in their study fall in thevicinity (< 65Kb) of NEGR1.37Figure 4.4: Gene-level associations with MDD. A) This figure shows a his-togram of gene-level association p-values. B) This figure shows a Man-hattan plot of gene-level association p-values.384.2.2 Identifying MDD-associated CpG SitesWe integrated DNAm data into MDD GWAS summary statistics and identified 163CpG sites whose imputed methylation levels were significantly (Bonferroni cor-rected P< 0.05) associated with MDD (Figure 4.5). Comparing MDD-associationsbetween CpG-level (Figure 4.5) and gene-level (Figure 4.4) Manhattan plots, weobserved that there are parts of the genome (like chromosome 2, 5, and 13), forwhich we are able to identify only CpG-level associations with MDD.4.2.3 Identifying MDD-associated TARsWe combined HA data with MDD GWAS associations and discovered 25 TARswhose imputed acetylation levels were significantly (Bonferroni corrected P <0.05) associated with MDD (Figure 4.6). We found that the closest TAR to NEGR1is the most significant TAR (P = 1×10−13) among MDD-associated TARs. Inter-estingly, we identified that the closet gene to the most significant MDD-associatedTAR is NEGR1.4.3 Building Epigenome-based Imputation ModelsAs the third step of our pipeline, we built two separate epigenome-based imputationmodels to impute epigenomic component of the genes expression levels. In Sec-tion 4.3.1 and Section 4.3.2, we summarize the prediction results of the epigenome-based imputation models.4.3.1 Imputing Gene Expression Levels from CpG SitesWe leveraged 411 samples of individuals for whom both DNAm and gene expres-sion data were available. On average, there are 307 CpG sites available in the 1Mbwindow of the gene’s TSS. The elastic net models that take methylation levels ofthe CpG sites as input find at least one non-zero CpG weight for 60% of the genes(n = 4,234) in the study. Cross-validation R2 (the proportion of variance of thegene expression level explained by cis CpG sites) has an average of 0.04 amongsuch genes that have an available elastic net model (Figure 4.7-A). 11% of thesegenes (n = 476) have cross-validation R2 values > 0.1 for which cross-validation39Figure 4.5: CpG-level associations with MDD. A) This figure shows a his-togram of CpG-level association p-values. B) This figure shows a Man-hattan plot of CpG-level association p-values.40Figure 4.6: TAR-level associations with MDD. A) This figure shows a his-togram of TAR-level association p-values. B) This figure shows a Man-hattan plot of TAR-level association p-values.41Table 4.2: Summary of approaches that translate epigenomic-level associa-tions to the gene level.Approach Significant Genes Overlapwith TDGsDNAm + Closest Matching 46 NEGR1ZNF184OTUD6BTMEM106BHA + Closest Matching 21 NEGR1DNAm + Multi-MetaXcan 43 ZNF184FADS1XRCC3HA + Multi-MetaXcan 34 NEGR1ZNF184R2 has an average of 0.22 (Figure 4.7-B). On average, the elastic net models select12 CpG sites for each gene.4.3.2 Imputing Gene Expression Levels from TARsWe used 392 samples of individuals for whom both gene expression and HA datawere available. On average, there are 17 TARs available in the 1Mb window ofthe gene’s TSS. The training R2 for OLS models that take acetylation levels of theTARs as input has an average of 0.06 across all genes in the study (n = 7,051)(Figure 4.8-A). 18% of the genes (n = 1,302) have training R2 > 0.1. Training R2has an average of 0.15 for such genes (Figure 4.8-B).4.4 Summarizing Epigenomic-level Associations at theGene LevelAs the last step of our pipeline, we found 46 genes proximal to the MDD-associatedCpG sites (Table 4.2). Four such genes are common with TDGs. Similarly, wefound 21 genes nearby to MDD-associated TARs. Only NEGR1 is common be-tween these genes and TDGs (Table 4.2).We used epigenome-based imputation models followed by Multi-MetaXcan to42Figure 4.7: Cross-validation R2 of epigenome-based imputation models thatpredict gene expression levels from cis CpG sites. A) This figure showsa histogram of cross-validation R2 for genes with an available elastic netmodel (n = 4,234). B) This figure shows a histogram of cross-validationR2 for genes with cross-validation R2 > 0.1 (n = 476).43Figure 4.8: Training R2 of epigenome-based imputation models that predictgene expression levels from cis TARs. A) This figure shows a histogramof training R2 for all genes in the study (n = 7,051). B) This figure showsa histogram of training R2 for genes with training R2 > 0.1 (n = 1,302).44translate CpG-level associations to the gene-level associations. We identified 43genes (called DNAm-driven genes) significantly (Bonferroni corrected P < 0.05)associated with MDD (Figure 4.9). Three of the DNAm-driven genes are foundamong TDGs (Table 4.2).Same analysis on HA data identified 34 genes (HA-driven genes) significantly(Bonferroni corrected P < 0.05) associated with MDD (Figure 4.10). Two suchgenes are common with TDG (Table 4.2). DNAm-driven genes and HA-drivengenes overlap for 11 genes. We define Multi-MetaXcan genes as the union ofDNAm-driven and HA-driven genes. Multi-MetaXcan finds the associations be-tween the epigenomic component of gene expression levels and MDD. Epigenomiccomponent of gene expression is the proportion of the variation in gene expressionregulated by cis epigenomic marks.We compared transcriptome-driven Z statistics to DNAm-driven Z statisticsof the gene-level associations (Figure 4.11). Transcriptome-driven Z statisticsare obtained by applying MetaXcan to gene expression data as described in Sec-tion 4.2.1. DNAm-driven Z statistics are estimated by translating CpG-level as-sociations to gene level through Multi-MetaXcan. We observed a significant cor-relation (Pearson’s r = 0.52, P < 2.2× 10−16) between transcriptome-driven andDNAm-driven Z statistics. However, we did not find a strong correlation betweenbetween transcriptome-driven and HA-driven Z statistics. HA-driven Z statisticsare estimated by translating TAR-level associations to gene-level associations.4.5 SummaryWe built genotype-based imputation models for each gene, CpG site, and TAR.We leveraged SNPs’ weights from these models and applied MetaXcan to translateSNP-level associations underlying MDD to molecular-level associations. As theresult, we identified 12 genes (TDGs) (Figure 4.4), 163 CpG sites (Figure 4.5),and 25 TARs (Figure 4.6) significantly (Bonferroni corrected P < 0.05) associatedwith MDD.Next, we linked CpG sites and TARs to genes by building epigenome-basedimputation models for each gene. Finally, we leveraged CpG sites’ and TARs’weights from these models and applied Multi-MetaXcan to summarize epigenomic-45Figure 4.9: Translation of CpG-level associations to gene-level associationsby Multi-MetaXcan. A) This figure shows a histogram of gene-levelassociation p-values. B) This figure shows a Manhattan plot of gene-level association p-values.46Figure 4.10: Translation of TAR-level associations to gene-level associationsby Multi-MetaXcan. A) This figure shows a histogram of gene-levelassociation p-values. B) This figure shows a Manhattan plot of gene-level association p-values.47Table 4.3: Summary of the MDD-associated genes proximal to the MDD-associated genomic regions reported by Wray et al. [86]. We found thatseven of the TDGs and 22 of the Muti-MetaXcan genes are proximal(1Mb) to nine such genomic regions.Chr. MDD-associated Proximal ProximalRegion (Mb) (1 Mb) (1 Mb)Reported by TDGs Multi-MetaXcanWray et al. [86] Genes1 8.390−8.895 RERE1 72.511−73.059 NEGR1 NEGR13 44.222−44.997 ZNF502TCAIMTMEM424 41.880−42.189 TMEM335 87.443−88.244 LINC004616 27.738−32.848 ZNF184 ZNF184ZNF204P AL022393.7NOTCH4 ZNF323TMEM106B MICAGTF2H4C6ORF47HSPA1LBAG6CLIC19 119.675−119.767 ASTN214 41.941−42.320 LRFN514 103.828−104.174 XRCC3 XRCC3KLC1TRMT61ACKBAPOPT148Figure 4.11: Comparison of transcriptome-driven Z statistics to DNAm-driven Z statistics.level associations at the gene level. As the result, we found 43 DNAm-driven(Figure 4.9) and 34 HA-driven (Figure 4.10) genes associated with MDD.49Chapter 5DiscussionMetaXcan framework enabled us to integrate multiomics datasets with MDD GWASsummary statistics to interpret the functionality of MDD-associated SNPs at mul-tiple molecular levels. We used Bonferroni correction to control for false positivesarising from multiple testing. We identified 12 MDD-associated genes (TDGs)(Figure 4.4). Combining DNAm and HA data with MDD GWAS summary statis-tics, we identified 163 MDD-associated CpG sites (Figure 4.5) and 25 MDD-associated TARs (Figure 4.6), respectively.As the final step of our multiomics integration, we linked epigenomic modifica-tions (DNAm and HA marks) to gene expression levels and summarized epigenomic-level (CpG-level and TAR-level) associations at the gene level. The inclusion ofepigenomic data through Multi-MetaXcan yielded 66 genes (Table 4.2), of which62 genes were not found among TDGs (genes which are identified by models thatuse only transcriptome data).In the most recent MDD GWAS study, Wray et al. [86] identified 44 genomicregions significantly associated with MDD. They also identified the lead SNP (SNPwith the lowest p-value) within each region and found that the two most significantlead SNPs fall within OLFM4 (olfactomedin 4) and nearby NEGR1 (neural growthregulator 1). OLFM4 is not present among our genes, which passed the QC stepsbut NEGR1 is the most significant (P= 8.8×10−19) MDD-associated gene amongTDGs (Table 4.1). Moreover, Wray et al. [86] reported 51 genes, which eitherincluded the lead SNPs or were in 200kb distance of the lead SNPs. 27 such50genes are available among our genes, of which 11 genes are found to be significant(Bonferroni P < 0.05) MDD-associated genes in our study.We found that seven genes of our 12 TDGs fall within or in the vicinity (1Mbwindow) of the MDD-associated genomic regions (Table 4.3). In total, we identi-fied 74 MDD-associated genes (Multi-MetaXcan = 62, TDGs = 12). 26 such genesfall within or in the vicinity (1Mb window) of the 44 MDD-associated genomicregions (Table 4.3).Wray et al. [86] found that in nine of the 44 MDD-associated genomic re-gions, the lead SNP is within a gene and there is no other gene within 200Kbdistance. Such genes are known to be involved in neuronal development, synapticfunction, and transmembrane adhesion complexes. LRFN5 (leucine rich repeat andfibronectin type III domain containing 5) is one of these nine genes and is highlyexpressed in the brain frontal and anterior cingulate cortex. This gene is involved insynapse formation and presynaptic differentiation [17]. Although LRFN5 includesa lead SNP, the regulatory impact of the lead SNP on the expression level of LRFN5is not clear. We did not find LRFN5 among the TDGs. However, we were able todetect this gene among both of the DNAm-driven and HA-driven genes. These ob-servations highlight the limitation of the typical gene-based methods [6, 7, 31, 34],where only the direct regulatory impact of SNPs on gene expression is modeled.We found that including additional molecular traits (CpG sites and TARs), andthus modeling indirect regulatory impact of SNPs on gene expression enabled usto detect LRFN5.Wray et al. [86] performed a TWAS analysis [34] to integrate CommonMind [30]gene expression data from brain cortex with MDD GWAS summary statistics totranslate MDD GWAS associations at the gene level. They found 17 MDD-associatedgenes (TWAS MDD genes), which included OLFM4 but did not contain eitherNEGR1 or LRFN5. 12 of the TWAS MDD genes are available among our genes,of which only two genes were common with Multi-MetaXcan genes and none ofthem overlapped with TDGs.We investigated the differences between our TDGs and TWAS MDD genesin Wray et al. [86]. We found that the underlying computational models usedby Wray et al. [86] are different from our genotype-based imputation models. Us-ing a different imputation model might result in selecting a different set of SNPs51for each gene. In addition, reference eQTL datasets are different between our studyand Wray et al. [86]. Wray et al. [86] leveraged the reference eQTL datasets fromCommonMind whereas we used ROSMAP genotype and gene expression data.Using imputed gene expression, DNAm, and HA data to interpret GWAS SNPsat the molecular levels and eventually gene level has potential advantages:• Gene is a more interpretable biological unit than a SNP.• Lower number of molecular traits (genes, CpG sites, or TARs) comparedto the number of GWAS SNPs reduces multiple-testing burden and thus,increases the statistical power.• Aggregating effects of multiple GWAS SNPs for a single molecular trait mayincrease the power to identify molecular-level associations.• In contrast to the typical gene-based approaches [6, 7, 31, 34], where the as-sociation between genetically regulated gene expression level and the GWAStrait is tested, Multi-MetaXcan tests for the association between both genet-ically and epigenetically regulated gene expression and GWAS trait. As wehave shown, accounting for the epigenetic component of gene expressioncan yield additional genes that are missed in a typical gene-based analysis.Such associations may reflect the indirect regulatory impact of SNPs on geneexpressions mediated through epigenomic marks.• No actual genotype, gene expression, DNAm, or HA data are required toapply Multi-MetaXcan to a new GWAS summary statistics.However, Multi-MetaXcan has some limitations:• GWAS association may not be interpreted at the molecular level if the GWASSNP affects the GWAS trait through mechanisms other than regulating theexpression level of a nearby gene, methylation level of a nearby CpG site, oracetylation level of a nearby TAR.• Inherited from MetaXcan, the GWAS SNPs need to cover the majority of thecis SNPs in the imputation model. Otherwise, the accuracy in estimating theassociation Z statistic between a molecular trait and the GWAS trait wouldbe reduced.52In the future, we will leverage imputed genotype data as well as datasets withlarger sample sizes in model developments. For a single gene, we will buildgenotype-based imputation models on thousands of cis SNPs rather than hundredsof cis input SNPs. Also, we will apply Muti-MetaXcan to other complex traits likeschizophrenia, autism, or bipolar disorder. This analysis will help us to evaluatewhether the detection gain with Multi-MetaXcan compared to the typical gene-based approaches [6, 7, 31, 34] is generalizable. Moreover, we will combine braincortex reference eQTL datasets from ROSMAP and CommonMind data to increasethe sample size. Finally, we hope to implement Multi-MetaXcan software andmake it accessible online. The software would take in GWAS summary statisticsand output the GWAS-associated genes, CpG sites, and TARs as well as Multi-MetaXcan genes.In conclusion, we proposed a multiomics approach, Multi-MetaXcan, whichincorporates genotype, gene expression, DNAm, and HA data to interpret the func-tionality of GWAS SNPs at multiple molecular levels. Multi-MetaXcan furtherpermits translation of CpG-level and TAR-level Z statistics to gene level. We ap-plied Multi-MetaXcan to MDD GWAS summary statistics and found 62 new genescompared to the TDGs. Our results thus highlight the detection power we gainedthrough multiomics integration.53Bibliography[1] F. W. Albert and L. Kruglyak. The role of regulatory variation in complextraits and disease. Nature Reviews Genetics, 16(4):197, 2015. → page 1[2] R. P. Alexander, G. Fang, J. Rozowsky, M. Snyder, and M. B. Gerstein.Annotating non-coding regions of the genome. Nature Reviews Genetics, 11(8):559, 2010. → pages 1, 2, 4[3] N. E. Allen, C. Sudlow, T. Peakman, R. Collins, et al. Uk biobank data:come and get it, 2014. → page 26[4] D. Altshuler, M. J. Daly, and E. S. Lander. Genetic mapping in humandisease. science, 322(5903):881–888, 2008. → page 2[5] Y. Banda, M. N. Kvale, T. J. Hoffmann, S. E. Hesselson, D. Ranatunga,H. Tang, C. Sabatti, L. A. Croen, B. P. Dispensa, M. Henderson, et al.Characterizing race/ethnicity and genetic ancestry for 100,000 subjects inthe genetic epidemiology research on adult health and aging (gera) cohort.Genetics, 200(4):1285–1295, 2015. → page 26[6] A. Barbeira, K. P. Shah, J. M. Torres, H. E. Wheeler, E. S. Torstenson,T. Edwards, T. Garcia, G. I. Bell, D. Nicolae, N. J. Cox, et al. Metaxcan:summary statistics based gene-level association method infers accuratepredixcan results. bioRxiv, page 045260, 2016. → pages1, 2, 3, 8, 17, 21, 28, 29, 51, 52, 53[7] A. Barbeira, S. P. Dickinson, J. M. Torres, E. S. Torstenson, J. Zheng, H. E.Wheeler, K. P. Shah, T. Edwards, D. Nicolae, N. J. Cox, et al. Integratingtissue specific mechanisms into gwas summary results. bioRxiv, page045260, 2017. → pages 1, 2, 8, 16, 18, 28, 29, 51, 52, 53[8] P. Barnes, I. Adcock, and K. Ito. Histone acetylation and deacetylation:importance in inflammatory lung diseases. European Respiratory Journal,25(3):552–563, 2005. → page 554[9] A. Battle, S. Mostafavi, X. Zhu, J. B. Potash, M. M. Weissman,C. McCormick, C. D. Haudenschild, K. B. Beckman, J. Shi, R. Mei, et al.Characterizing the genetic basis of transcriptome diversity throughrna-sequencing of 922 individuals. Genome research, 24(1):14–24, 2014. →page 14[10] D. R. Bentley, S. Balasubramanian, H. P. Swerdlow, G. P. Smith, J. Milton,C. G. Brown, K. P. Hall, D. J. Evers, C. L. Barnes, H. R. Bignell, et al.Accurate whole human genome sequencing using reversible terminatorchemistry. nature, 456(7218):53, 2008. → page 1[11] M. Bibikova, Z. Lin, L. Zhou, E. Chudin, E. W. Garcia, B. Wu, D. Doucet,N. J. Thomas, Y. Wang, E. Vollmer, et al. High-throughput dna methylationprofiling using universal bead arrays. Genome research, 16(3):383–393,2006. → page 4[12] B. L. Browning and S. R. Browning. A unified approach to genotypeimputation and haplotype-phase inference for large data sets of trios andunrelated individuals. The American Journal of Human Genetics, 84(2):210–223, 2009. → pages 3, 22[13] S. R. Browning. Multilocus association mapping using variable-lengthmarkov chains. The American Journal of Human Genetics, 78(6):903–913,2006.[14] S. R. Browning and B. L. Browning. Rapid and accurate haplotype phasingand missing-data inference for whole-genome association studies by use oflocalized haplotype clustering. The American Journal of Human Genetics,81(5):1084–1097, 2007. → page 3[15] W. S. Bush and J. H. Moore. Genome-wide association studies. PLoScomputational biology, 8(12):e1002822, 2012. → pages 2, 3[16] R. M. Cantor, K. Lange, and J. S. Sinsheimer. Prioritizing gwas results: areview of statistical methods and recommendations for their application. TheAmerican Journal of Human Genetics, 86(1):6–22, 2010. → page 4[17] Y. Choi, J. Nam, D. J. Whitcomb, Y. S. Song, D. Kim, S. Jeon, J. W. Um,S.-G. Lee, J. Woo, S.-K. Kwon, et al. Salm5 trans-synaptically interacts withlar-rptps in a splicing-dependent manner to regulate synapse development.Scientific reports, 6:26676, 2016. → page 5155[18] G. M. Clarke, C. A. Anderson, F. H. Pettersson, L. R. Cardon, A. P. Morris,and K. T. Zondervan. Basic statistical analysis in genetic case-controlstudies. Nature protocols, 6(2):121, 2011. → page 3[19] J. K. Conner, D. L. Hartl, et al. A primer of ecological genetics. SinauerAssociates Incorporated, 2004. → page 4[20] . G. P. Consortium et al. A map of human genome variation frompopulation-scale sequencing. Nature, 467(7319):1061, 2010. → page 2[21] I. H. G. S. Consortium et al. Initial sequencing and analysis of the humangenome. Nature, 409(6822):860, 2001. → page 2[22] W. Cookson, L. Liang, G. Abecasis, M. Moffatt, and M. Lathrop. Mappingcomplex disease traits with global gene expression. Nature ReviewsGenetics, 10(3):184, 2009. → page 11[23] D. R. Cox. The regression analysis of binary sequences. Journal of theRoyal Statistical Society. Series B (Methodological), pages 215–242, 1958.→ page 3[24] P. L. De Jager, J. M. Shulman, L. B. Chibnik, B. T. Keenan, T. Raj, R. S.Wilson, L. Yu, S. E. Leurgans, D. Tran, C. Aubin, et al. A genome-widescan for common variants affecting the rate of age-related cognitive decline.Neurobiology of aging, 33(5):1017–e1, 2012. → pages 21, 22[25] P. L. De Jager, G. Srivastava, K. Lunnon, J. Burgess, L. C. Schalkwyk,L. Yu, M. L. Eaton, B. T. Keenan, J. Ernst, C. McCabe, et al. Alzheimer’sdisease: early alterations in brain dna methylation at ank1, bin1, rhbdf2 andother loci. Nature neuroscience, 17(9):1156, 2014. → pages 6, 21, 23, 24[26] S. Dedeurwaerder, M. Defrance, E. Calonne, H. Denis, C. Sotiriou, andF. Fuks. Evaluation of the infinium methylation 450k technology.Epigenomics, 3(6):771–784, 2011. → page 6[27] A. P. Feinberg. Epigenomics reveals a functional genome anatomy and anew approach to common disease. Nature biotechnology, 28(10):1049,2010. → page 6[28] A. M. Fernandez-Pujals, M. J. Adams, P. Thomson, A. G. McKechanie,D. H. Blackwood, B. H. Smith, A. F. Dominiczak, A. D. Morris,K. Matthews, A. Campbell, et al. Epidemiology and heritability of majordepressive disorder, stratified by age of onset, sex, and illness course in56generation scotland: Scottish family health study (gs: Sfhs). PLoS One, 10(11):e0142197, 2015. → pages 25, 26[29] J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths forgeneralized linear models via coordinate descent. Journal of statisticalsoftware, 33(1):1, 2010. → page 28[30] M. Fromer, P. Roussos, S. K. Sieberts, J. S. Johnson, D. H. Kavanagh, T. M.Perumal, D. M. Ruderfer, E. C. Oh, A. Topol, H. R. Shah, et al. Geneexpression elucidates functional impact of polygenic risk for schizophrenia.Nature neuroscience, 19(11):1442, 2016. → pages 20, 51[31] E. R. Gamazon, H. E. Wheeler, K. P. Shah, S. V. Mozaffari,K. Aquino-Michaels, R. J. Carroll, A. E. Eyler, J. C. Denny, D. L. Nicolae,N. J. Cox, et al. A gene-based association method for mapping traits usingreference transcriptome data. Nature genetics, 47(9):1091, 2015. → pages1, 7, 13, 14, 26, 27, 51, 52, 53[32] M. Glozak and E. Seto. Histone deacetylases and cancer. Oncogene, 26(37):5420, 2007. → page 5[33] S. Goodwin, J. D. McPherson, and W. R. McCombie. Coming of age: tenyears of next-generation sequencing technologies. Nature Reviews Genetics,17(6):333, 2016. → page 5[34] A. Gusev, A. Ko, H. Shi, G. Bhatia, W. Chung, B. W. Penninx, R. Jansen,E. J. De Geus, D. I. Boomsma, F. A. Wright, et al. Integrative approaches forlarge-scale transcriptome-wide association studies. Nature genetics, 48(3):245, 2016. → pages 1, 15, 18, 20, 51, 52, 53[35] M. Haberland, R. L. Montgomery, and E. N. Olson. The many roles ofhistone deacetylases in development and physiology: implications fordisease and therapy. Nature Reviews Genetics, 10(1):32, 2009. → page 5[36] B. N. Howie, P. Donnelly, and J. Marchini. A flexible and accurate genotypeimputation method for the next generation of genome-wide associationstudies. PLoS genetics, 5(6):e1000529, 2009. → page 3[37] C. L. Hyde, M. W. Nagle, C. Tian, X. Chen, S. A. Paciga, J. R. Wendland,J. Y. Tung, D. A. Hinds, R. H. Perlis, and A. R. Winslow. Identification of15 genetic loci associated with risk of major depression in individuals ofeuropean descent. Nature genetics, 48(9):1031, 2016. → pages 19, 25, 2657[38] I. Illumina. An introduction to next-generation sequencing technology.2015. → page 5[39] W. E. Johnson, C. Li, and A. Rabinovic. Adjusting batch effects inmicroarray expression data using empirical bayes methods. Biostatistics, 8(1):118–127, 2007. → page 23[40] M. J. Jones, S. J. Goodman, and M. S. Kobor. Dna methylation and healthyhuman aging. Aging cell, 14(6):924–932, 2015. → page 4[41] F. K. Kavvoura and J. P. Ioannidis. Methods for meta-analysis in geneticassociation studies: a review of their potential and pitfalls. Human genetics,123(1):1–14, 2008. → page 4[42] C. M. Koch, R. M. Andrews, P. Flicek, S. C. Dillon, U. Karao¨z, G. K.Clelland, S. Wilcox, D. M. Beare, J. C. Fowler, P. Couttet, et al. Thelandscape of histone modifications across 1% of the human genome in fivehuman cell lines. Genome research, 17(6):691–707, 2007. → page 6[43] M.-H. Kuo and C. D. Allis. Roles of histone acetyltransferases anddeacetylases in gene regulation. Bioessays, 20(8):615–626, 1998. → page 5[44] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg. Ultrafast andmemory-efficient alignment of short dna sequences to the human genome.Genome biology, 10(3):R25, 2009. → page 23[45] J. Z. Levin, M. Yassour, X. Adiconis, C. Nusbaum, D. A. Thompson,N. Friedman, A. Gnirke, and A. Regev. Comprehensive comparativeanalysis of strand-specific rna sequencing methods. Nature methods, 7(9):709, 2010. → page 22[46] D. F. Levinson, S. Mostafavi, Y. Milaneschi, M. Rivera, S. Ripke, N. R.Wray, and P. F. Sullivan. Genetic studies of major depressive disorder: whyare there no genome-wide association study findings and what can we doabout it? Biological psychiatry, 76(7):510–512, 2014. → pages 4, 19[47] B. Li and C. N. Dewey. Rsem: accurate transcript quantification fromrna-seq data with or without a reference genome. BMC bioinformatics, 12(1):323, 2011. → page 23[48] H. Li and R. Durbin. Fast and accurate short read alignment withburrows–wheeler transform. Bioinformatics, 25(14):1754–1760, 2009. →page 2458[49] A. S. Lim, H.-U. Klein, L. Yu, L. B. Chibnik, S. Ali, J. Xu, D. A. Bennett,and P. L. De Jager. Diurnal and seasonal molecular rhythms in humanneocortex and their relation to alzheimers disease. Nature Communications,8:14931, 2017. → pages 6, 7, 21, 22, 24[50] D. Liu, X. Lin, and D. Ghosh. Semiparametric regression ofmultidimensional genetic pathway data: Least-squares kernel machines andlinear mixed models. Biometrics, 63(4):1079–1088, 2007. → page 12[51] J. Z. Liu, A. F. Mcrae, D. R. Nyholt, S. E. Medland, N. R. Wray, K. M.Brown, N. K. Hayward, G. W. Montgomery, P. M. Visscher, N. G. Martin,et al. A versatile gene-based test for genome-wide association studies. TheAmerican Journal of Human Genetics, 87(1):139–145, 2010. → pages13, 14[52] N. Mancuso, H. Shi, P. Goddard, G. Kichaev, A. Gusev, and B. Pasaniuc.Integrating gene expression with summary association statistics to identifygenes associated with 30 complex traits. The American Journal of HumanGenetics, 100(3):473–487, 2017. → page 2[53] J. Marchini and B. Howie. Genotype imputation for genome-wideassociation studies. Nature Reviews Genetics, 11(7):499, 2010. → page 3[54] J. Marchini, B. Howie, S. Myers, G. McVean, and P. Donnelly. A newmultipoint method for genome-wide association studies by imputation ofgenotypes. Nature genetics, 39(7):906, 2007. → page 3[55] E. R. Mardis. The impact of next-generation sequencing technology ongenetics. Trends in genetics, 24(3):133–141, 2008. → page 5[56] M. T. Maurano, R. Humbert, E. Rynes, R. E. Thurman, E. Haugen, H. Wang,A. P. Reynolds, R. Sandstrom, H. Qu, J. Brody, et al. Systematic localizationof common disease-associated variation in regulatory dna. Science, page1222794, 2012. → page 4[57] B. H. Mecham, P. S. Nelson, and J. D. Storey. Supervised normalization ofmicroarrays. Bioinformatics, 26(10):1308–1315, 2010. → page 24[58] M. L. Metzker. Sequencing technologiesthe next generation. Nature reviewsgenetics, 11(1):31, 2010. → page 5[59] B. M. Neale and P. C. Sham. The future of association studies: gene-basedanalysis and replication. The American Journal of Human Genetics, 75(3):353–362, 2004. → page 759[60] B. Ng, C. C. White, H.-U. Klein, S. K. Sieberts, C. McCabe, E. Patrick,J. Xu, L. Yu, C. Gaiteri, D. A. Bennett, et al. An xqtl map integrates thegenetic architecture of the human brain’s transcriptome and epigenome.Nature neuroscience, 20(10):1418, 2017. → page 25[61] K. Norrgard. Genetic variation and disease: Gwas. Nature Education, 1(1):87, 2008. → pages 2, 3[62] W. H. Organization. The ICD-10 classification of mental and behaviouraldisorders: clinical descriptions and diagnostic guidelines, volume 1. WorldHealth Organization, 1992. → page 25[63] W. H. Organization et al. International classification of diseases:[9th] ninthrevision, basic tabulation list with alphabetic index. 1978. → page 25[64] P. J. Park. Chip–seq: advantages and challenges of a maturing technology.Nature Reviews Genetics, 10(10):669, 2009. → page 7[65] J. Peters. The role of genomic imprinting in biology and disease: anexpanding view. Nature Reviews Genetics, 15(8):517, 2014. → page 4[66] A. L. Price, N. J. Patterson, R. M. Plenge, M. E. Weinblatt, N. A. Shadick,and D. Reich. Principal components analysis corrects for stratification ingenome-wide association studies. Nature genetics, 38(8):904, 2006. → page22[67] E. M. Price, A. M. Cotton, L. L. Lam, P. Farre´, E. Emberly, C. J. Brown,W. P. Robinson, and M. S. Kobor. Additional annotation enhances potentialfor biologically-relevant analysis of the illumina infiniumhumanmethylation450 beadchip array. Epigenetics & chromatin, 6(1):4,2013. → page 6[68] S. Purcell, B. Neale, K. Todd-Brown, L. Thomas, M. A. Ferreira, D. Bender,J. Maller, P. Sklar, P. I. De Bakker, M. J. Daly, et al. Plink: a tool set forwhole-genome association and population-based linkage analyses. TheAmerican Journal of Human Genetics, 81(3):559–575, 2007. → page 22[69] J. Ragoussis. Genotyping technologies for genetic research. Annual reviewof genomics and human genetics, 10:117–133, 2009. → page 3[70] S. Ripke, N. R. Wray, C. M. Lewis, S. P. Hamilton, M. M. Weissman,G. Breen, E. M. Byrne, D. H. Blackwood, D. I. Boomsma, S. Cichon, et al.A mega-analysis of genome-wide association studies for major depressivedisorder. Molecular psychiatry, 18(4):497, 2013. → pages 7, 2660[71] S. Ripke, B. M. Neale, A. Corvin, J. T. Walters, K.-H. Farh, P. A. Holmans,P. Lee, B. Bulik-Sullivan, D. A. Collier, H. Huang, et al. Biological insightsfrom 108 schizophrenia-associated genetic loci. Nature, 511(7510):421,2014. → pages 1, 2[72] G. K. Robinson. That blup is a good thing: the estimation of random effects.Statistical science, pages 15–32, 1991. → page 15[73] B. S. Shastry. Snps: impact on gene function and phenotype. SingleNucleotide Polymorphisms: Methods and Protocols, pages 3–22, 2009. →page 2[74] B. H. Smith, A. Campbell, P. Linksted, B. Fitzpatrick, C. Jackson, S. M.Kerr, I. J. Deary, D. J. MacIntyre, H. Campbell, M. McGilchrist, et al.Cohort profile: Generation scotland: Scottish family health study (gs: Sfhs).the study, its participants and their potential for genetic research on healthand illness. International journal of epidemiology, 42(3):689–700, 2012. →pages 25, 26[75] D. J. Smith, B. I. Nicholl, B. Cullen, D. Martin, Z. Ul-Haq, J. Evans, J. M.Gill, B. Roberts, J. Gallacher, D. Mackay, et al. Prevalence andcharacteristics of probable major depression and bipolar disorder within ukbiobank: cross-sectional study of 172,751 participants. PloS one, 8(11):e75362, 2013. → page 26[76] E. K. Speliotes, C. J. Willer, S. I. Berndt, K. L. Monda, G. Thorleifsson,A. U. Jackson, H. L. Allen, C. M. Lindgren, J. Luan, R. Ma¨gi, et al.Association analyses of 249,796 individuals reveal 18 new loci associatedwith body mass index. Nature genetics, 42(11):937, 2010. → page 4[77] C. C. Spencer, Z. Su, P. Donnelly, and J. Marchini. Designing genome-wideassociation studies: sample size, power, imputation, and the choice ofgenotyping chip. PLoS genetics, 5(5):e1000477, 2009. → page 4[78] R. L. Spitzer, K. Kroenke, J. B. Williams, P. H. Q. P. C. S. Group, et al.Validation and utility of a self-report version of prime-md: the phq primarycare study. Jama, 282(18):1737–1744, 1999. → page 25[79] A. J. Sutton, K. R. Abrams, D. R. Jones, D. R. Jones, T. A. Sheldon, andF. Song. Methods for meta-analysis in medical research. 2000. → page 4[80] A. E. Teschendorff, F. Marabita, M. Lechner, T. Bartlett, J. Tegner,D. Gomez-Cabrero, and S. Beck. A beta-mixture quantile normalization61method for correcting probe design bias in illumina infinium 450 k dnamethylation data. Bioinformatics, 29(2):189–196, 2012. → page 23[81] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal ofthe Royal Statistical Society. Series B (Methodological), pages 267–288,1996. → page 14[82] J. C. Venter, M. D. Adams, E. W. Myers, P. W. Li, R. J. Mural, G. G. Sutton,H. O. Smith, M. Yandell, C. A. Evans, R. A. Holt, et al. The sequence of thehuman genome. science, 291(5507):1304–1351, 2001. → pages 1, 2[83] D. Welter, J. MacArthur, J. Morales, T. Burdett, P. Hall, H. Junkins,A. Klemm, P. Flicek, T. Manolio, L. Hindorff, et al. The nhgri gwas catalog,a curated resource of snp-trait associations. Nucleic acids research, 42(D1):D1001–D1006, 2013. → pages 1, 2, 4[84] D. A. Wheeler, M. Srinivasan, M. Egholm, Y. Shen, L. Chen, A. McGuire,W. He, Y.-J. Chen, V. Makhijani, G. T. Roth, et al. The complete genome ofan individual by massively parallel dna sequencing. nature, 452(7189):872,2008. → page 1[85] N. Wray, M. Pergadia, D. Blackwood, B. Penninx, S. Gordon, D. Nyholt,S. Ripke, D. MacIntyre, K. McGhee, A. Maclean, et al. Genome-wideassociation study of major depressive disorder: new results, meta-analysis,and lessons learned. Molecular psychiatry, 17(1):36, 2012. → pages2, 7, 18, 19[86] N. R. Wray, P. F. Sullivan, et al. Genome-wide association analyses identify44 risk variants and refine the genetic architecture of major depression.bioRxiv, page 167577, 2017. → pagesix, 1, 2, 4, 7, 8, 18, 19, 20, 25, 37, 48, 50, 51, 52[87] M. C. Wu, S. Lee, T. Cai, Y. Li, M. Boehnke, and X. Lin. Rare-variantassociation testing for sequencing data with the sequence kernel associationtest. The American Journal of Human Genetics, 89(1):82–93, 2011. →pages 12, 14[88] E. Zeggini and J. P. Ioannidis. Meta-analysis in genome-wide associationstudies. 2009. → page 4[89] Y. Zhang, T. Liu, C. A. Meyer, J. Eeckhoute, D. S. Johnson, B. E. Bernstein,C. Nusbaum, R. M. Myers, M. Brown, W. Li, et al. Model-based analysis ofchip-seq (macs). Genome biology, 9(9):R137, 2008. → page 2462[90] X. Zhou, P. Carbonetto, and M. Stephens. Polygenic modeling withbayesian sparse linear mixed models. PLoS genetics, 9(2):e1003264, 2013.→ page 15[91] Z. Zhu, F. Zhang, H. Hu, A. Bakshi, M. R. Robinson, J. E. Powell, G. W.Montgomery, M. E. Goddard, N. R. Wray, P. M. Visscher, et al. Integrationof summary data from gwas and eqtl studies predicts complex trait genetargets. Nature genetics, 48(5):481, 2016. → pages 2, 11, 16[92] H. Zou and T. Hastie. Regularization and variable selection via the elasticnet. Journal of the Royal Statistical Society: Series B (StatisticalMethodology), 67(2):301–320, 2005. → pages 14, 2763


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items