UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Integration of genomic and metabolomic data for the prioritization of rare disease variants Graham, Emma 2018

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2019_february_graham_emma.pdf [ 1.06MB ]
JSON: 24-1.0374227.json
JSON-LD: 24-1.0374227-ld.json
RDF/XML (Pretty): 24-1.0374227-rdf.xml
RDF/JSON: 24-1.0374227-rdf.json
Turtle: 24-1.0374227-turtle.txt
N-Triples: 24-1.0374227-rdf-ntriples.txt
Original Record: 24-1.0374227-source.json
Full Text

Full Text

Integration of genomic and metabolomic data for theprioritization of rare disease variantsbyEmma GrahamBSc Molecular Biochemistry and Biophysics, Yale University, 2016A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Bioinformatics)The University of British Columbia(Vancouver)November 2018c© Emma Graham, 2018The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:Integration of genomic and metabolomic data for the prioritization ofrare disease variantssubmitted by Emma Graham in partial fulfillment of the requirements for the de-gree of Master of Science in Bioinformatics.Examining Committee:Sara Mostafavi, Statistics and Medical GeneticsSupervisorWyeth Wasserman, Medical GeneticsSupervisory Committee MemberClara van Karnebeek, PediatricsSupervisory Committee MemberMartin Hirst, Microbiology and ImmunologyCommittee ChairiiAbstractMany inborn errors of metabolism (IEMs) are amenable to treatment, thereforeearly diagnosis and treatment is imperative. Despite recent advances, the geneticbasis of many metabolic phenotypes remains unknown. For discovery purposes,Whole Exome Sequencing (WES) variant prioritization coupled with clinical andbioinformatics expertise is currently the primary method used to identify noveldisease-causing variants; however, causation is often difficult to establish due tothe number of plausible variants. Integrated analysis of untargeted metabolomics(UM) and WES or Whole Genome Sequencing (WGS) data is a promising system-atic approach for prioritizing causal variants from a list of candidates. In this thesis,we present an automated network-based bioinformatics approach to the integrationof WES with UM data from 13 neurometabolic patients with known IEMs and 25controls. We perform label propagation on the STRING network initialized usingan integrated genomic and metabolomic score, and use the results to rank candi-date genes in order of their likely relevance to the disease. Integrated genomic andmetabolomic evidence was able to prioritize the causative gene in the top 20th per-centile of candidate genes for 61.5% (8 of 13) of patients, 75% of which achieveda percentile prioritization score at least one standard deviation above a permutedpercentile. Combining genomic and metabolomic evidence resulted in the prioriti-zation of the causative gene in 30.7% more patients than was possible with genomicevidence alone. The results of this study indicate that for diagnostic and gene dis-covery purposes, metabolomics can lend support to WES gene discovery methods.This is the first method that uses UM and WES data to rank candidate variants inorder of their biological relevance. To improve this method, expansion of gene-metabolite annotations and metabolomic feature-to-metabolite mapping methodsiiiare needed.ivLay SummaryRecent technological advances have made it possible to profile an individual’s ge-netic code and metabolic processes with increasing ease and precision. These im-provements have allowed scientists to probe an individual’s biology, enabling theidentification of the cause of previously undiagnosable rare diseases. Individu-ally, personalized genetic and metabolic profiles have been used to successfullyidentify the cause of inborn errors of metabolism (IEM)–a group of rare pediatricdiseases that affect metabolism. If left untreated, IEMs can result in severe damageto organs, including the brain, therefore early treatment is imperative. Each IEMpatient’s disease is caused by a different genetic “mistake”; identifying the locationof this “mistake”, or mutation, is challenging. Metabolic data represents a snap-shot of our body’s biological reactions, and can therefore help distinguish betweenthe “quirks” in our genome–DNA changes that make each person unique–and themutations that can have damaging consequences to our metabolism. In this thesis,a method that combines genetic and metabolic evidence to help physicians identifythe location of the mutation causing a patient’s disease was developed. Its meritsand shortcomings highlight how genomic and metabolic profiling technologies canbe utilized and improved for use in the clinic.vPrefaceThe introduction of this thesis–with the exception of the section on labelpropagation–as well as some parts of the discussion, are largely reproduced froma review I wrote on the genomics and metabolomics field (Graham et al. [2018]).The WES variant filtering pipeline introduced was created and maintained by theWasserman lab at BC Children’s Hospital Research Institute, specifically MajaTarailo-Graovac, Allison Matthews, Jessica Lee and Phillip Richmond. Their con-tributions included all WES data generation, variant filtering and candidate variantlist creation. The raw LC-MS metabolomic data was generated by the lab of RonWevers at Radboud University in Nijmegen, Netherlands, specifically Udo Engelkeand Leo Kluiijtmans. I completed all other analysis performed in this thesis.viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Introduction to this chapter . . . . . . . . . . . . . . . . . . . . . 11.2 Inborn Errors of Metabolism . . . . . . . . . . . . . . . . . . . . 21.3 Whole exome sequencing for the identification of rare disease vari-ants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3.1 Canonical approach to identifying the causative gene . . . 31.4 Metabolomic profiling . . . . . . . . . . . . . . . . . . . . . . . 41.4.1 Generating Liquid Chromatography coupled Mass Spec-trometry data . . . . . . . . . . . . . . . . . . . . . . . . 61.4.2 Processing LC-MS data . . . . . . . . . . . . . . . . . . 7vii1.4.3 Normalizing LC-MS data . . . . . . . . . . . . . . . . . 81.4.4 Testing for significant features in IEM studies . . . . . . . 101.4.5 Annotating features: adducts, isotopes, and metabolites . . 111.4.6 Identifying IEMs through untargeted metabolomic analysis 131.5 Integrating genomic and metabolomic data . . . . . . . . . . . . . 141.5.1 Identification of metabolically active loci . . . . . . . . . 141.5.2 Gathering evidence for metabolic perturbation on a gene-by-gene basis . . . . . . . . . . . . . . . . . . . . . . . . 151.6 Treatable Intellectual Disability Endeavour (TIDE) Exome Se-quencing project . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.7 Strategies for disease gene prioritization . . . . . . . . . . . . . . 171.7.1 Gene Recommendation . . . . . . . . . . . . . . . . . . . 181.7.2 Label propagation . . . . . . . . . . . . . . . . . . . . . 201.7.3 Theoretical framework of label propagation . . . . . . . . 201.7.4 Using label propagation algorithms for candidate gene pri-oritization . . . . . . . . . . . . . . . . . . . . . . . . . . 211.8 Functional Linkage Networks and the STRING network . . . . . 221.8.1 Selection of the STRING network . . . . . . . . . . . . . 221.9 Aims of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . 232 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.2 WES variant filtering pipeline . . . . . . . . . . . . . . . . . . . 272.3 Processing and normalization of untargeted LC-MS metabolomics 272.3.1 LC-MS data generation . . . . . . . . . . . . . . . . . . . 282.3.2 LC-MS data normalization and filtering . . . . . . . . . . 282.4 Integrative analysis of WES and LC-MS data . . . . . . . . . . . 292.4.1 Creation of combined WES and metabolomic score . . . . 302.5 Label propagation . . . . . . . . . . . . . . . . . . . . . . . . . . 312.6 Categorizing polarity of gene-associated metabolites . . . . . . . 312.7 Summary of method . . . . . . . . . . . . . . . . . . . . . . . . 32viii3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.1 Characterization of WES and LC-MS data . . . . . . . . . . . . . 333.1.1 WES variants . . . . . . . . . . . . . . . . . . . . . . . . 333.1.2 Characterization of LC-MS metabolomics features . . . . 343.2 Assessing enrichment for gene-associated metabolites . . . . . . . 343.3 Integration of genomic and metabolic data to prioritize causativegenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.3.1 Using label propagation to rank candidate genes . . . . . 393.3.2 Assessing the utility of metabolomic evidence . . . . . . . 413.3.3 Permutation test to generate null model of percentile rank 413.3.4 Characterization of factors that effect gene prioritization . 423.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444 Disscussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52A Supporting Materials . . . . . . . . . . . . . . . . . . . . . . . . . . 74ixList of TablesTable 3.1 Summary of WES data. Percentage of each patient’s variantsthat fall into one of four modes of inheritance and the averagenumber of candidate genes to which these variants map. . . . . 34Table 3.2 Characterization of LC-MS features. . . . . . . . . . . . . 35Table 3.3 Summary of the causative gene(s) identified for each pa-tient through the WES variant filtering pipeline. The en-richment of causative gene-associated metabolites, function ofeach causative gene and polarity of gene-associated metabolitesare provided. . . . . . . . . . . . . . . . . . . . . . . . . . . . 37Table 3.4 Metabolic enrichment profile. Number of genes enriched forin the set of patient-specific differentially abundant metabolites. 39Table 3.5 Label propagation results. Prioritization results after LPwith both the combined genomic and metabolomic initial labelscores and the genomic-only initial label scores. The percentilerank of each causative gene in the list of candidate genes, thenumber of candidate genes as well as the change in percentilerank of the causative gene after addition of metabolomic evi-dence to the initial label score (DeltaM) is provided, in additionto each gene’s final genomic-only and combined prioritizationcategory (PC). . . . . . . . . . . . . . . . . . . . . . . . . . . 40xTable 3.6 Label propagation results with permuted initial labels. Per-centile rank of the causative gene in each patient’s list of can-didate genes, as well as the mean and standard deviation of thepermuted (n=500) percentile for each causative gene. . . . . . 43xiList of FiguresFigure 1.1 WES rare variant analysis pipeline. This pipeline was usedin Tarailo-Graovac et al. [2016]. Raw reads are aligned to thehuman genome. Variants are annotated using SnpEff as wellas custom Perl and Python scripts. Variants that do not mapto protein-coding regions, have MAF > 0.01 in several variantdatabases or that do not pass QC steps are removed. Variantsthat do not agree with multiple inheritance models and thatwould not agree with the observed phenotypic effect are alsoremoved. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Figure 1.2 Sample LC-MS metabolomics analysis pipeline. Briefly,raw metabolomics data can be processed using freely avail-able processing software (e.g. XCMS), annotated (e.g. CAM-ERA), normalized (e.g. through use of internal standards) andfiltered. Differentially abundant metabolites can be isolatedusing univariate or multivariate tests. Biological interpretationsuch as pathway analysis can be performed using publishedmetabolomic databases (e.g. HMDB, BioCyc, METLIN). . . 6Figure 1.3 Automated and manual steps in untargeted metabolomicspre-processing pipeline. The algorithms listed are only ex-amples of tools that could be used in each step. . . . . . . . . 9xiiFigure 2.1 Summary of overall method. A) Briefly, raw metabolomicsdata was processed using XCMS and CAMERA and sub-sequently normalized through linear baseline normalization.Differentially abundant metabolites were isolated based on z-score. Raw genomic sequencing reads for each patient wereprocessed and SnpEff was used to identify a list of candidategenomic variants with MAF≤ 0.01. B) Enrichment for each ofthe 5371 HMDB genes in the patient-specific set of DAMs wasassessed using Fisher’s Exact Test, generating a metabolomicscore. To generate the WES (i.e genomic) score, each gene inthe STRING network was assigned “1” if also in the patient-specific candidate gene list, and “0” if not. C) These scoreswere plotted on an x-y grid, and the inverse of the distance be-tween (1,1) and the coordinate of each gene, i, was consideredthe combined genomic and metabolomic score. Label prop-agation was then performed with the combined initial labelscore. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Figure 3.1 DeltaM of causative vs non-causative genes. . . . . . . . . 42Figure 3.2 Effect of polarity of gene-associated metabolites on per-centile ranking. “None” indicates that the gene has no as-sociated metabolites in HMDB. . . . . . . . . . . . . . . . . 45Figure 3.3 Utility of metabolomic evidence for causative genes asso-ciated with metabolites of varying polarities. DeltaM ofcausative genes by polarity of gene-associated metabolites.“None” indicates that the gene did not have any associatedmetabolites in HMDB. . . . . . . . . . . . . . . . . . . . . . 46xiiiGlossaryDAM Differentially abundant metaboliteGBA Guilt by AssociationIEM Inborn Error of MetabolismLC-MS Liquid-Chromatography Mass SpectrometryLP Label PropagationRWR Random Walk with RestartSTRING Search Tool for the Retrieval of Interacting Genes/ProteinsTIDEX Treatable Intellectual Disability Endeavour - ExomeWES Whole Exome SequencingxivAcknowledgmentsI would like to thank my supervisor, Sara Mostafavi, without whom this docu-ment would be a collection of gibberish. I would also like to thank my committeemembers, Wyeth Wasserman and Clara van Karnebeek, for the insights and con-versations.Thanks to my family and friends for allowing me to ramble on about metabo-lites and the power of data integration, for sharing in my joy when things workedout, and for buying me beers when it didn’t.Special thank you to all past and present members of the Mostafavi andWasserman labs, especially: Bernard Ng, Farnush Farhadi, Mike Vermeulen,Will Casazza, Sasha Maslova, Sina Jafarzadeh, Peter West, Louie Dinh, HalldorThorhallsson, Hamid Omid, Joanna Lubieniecki, Phil Richmond, Magda Price,Jessica Lee and Robin van der Lee. And of course, my bioinformatics buddies:Eric Chu, Annie Cavalla, Allison Tai, Arjun Baghela and Nivi Thatra.Most of all, I would like to thank my husband, Evan, for always being there forone more cup of coffee.xvChapter 1IntroductionTo define a person′ s medical essence, we have to look at more thanjust their sequence — Eric Toppel (2017)1.1 Introduction to this chapterIn this chapter, we will review the acquisition, processing and diagnostic utilityof genomic and metabolomic data and explore how these data types can be in-tegrated to prioritize candidate genes and improve the diagnosis of rare mono-genic metabolic diseases. Specifically, we will provide an overview of whole ex-ome sequencing (WES) variant filtering and liquid chromatography coupled massspectrometry (LC-MS) metabolomic processing and feature selection pipelines.We will then discuss previous approaches to the integration of genomic andmetabolomic data for the generation of relevant biological insights and review can-didate gene prioritization strategies, with a focus on those using network-basedmethods. Finally, we will introduce several relevant components of our integrationmethod, namely functional linkage networks and the network-based method labelpropagation. All of this chapter, with the exception of Section 1.6-1.8, is includedin Graham et al. [2018], with a few minor modifications.11.2 Inborn Errors of MetabolismInborn errors of metabolism (IEMs) are the largest group of genetic diseasesamenable to causal therapy, and are caused by genetic variants that disrupt thefunction of enzymes or other proteins involved in cellular metabolism, leading toenergy deficit and/or accumulation of toxins (Van Bokhoven [2011]). Early detec-tion, enabled by newborn metabolic screening programs and genetics profiling, ispivotal so that treatment can be initiated before the onset of irreversible progressivedamage to the central nervous system, which in some cases can result in intellectualdisability disorder (IDD) and damage to additional organ systems.There are currently more than 100 treatable IEMs, but for many phenotypesthe genetic basis remains to be discovered (Van Karnebeek and Stockler [2012]).Cases for which the causal gene was identified have in turn provided insights andopportunities for interventions targeting their downstream molecular or cellularabnormalities (Collins et al. [2010], Horvath et al. [2016], Karnebeek et al. [2016]).These efforts have been catalogued in the online resource IEMbase, which providesfurther information on the etiologies and treatment of over 500 IEM disorders (B.et al. [2014]).WES is the primary tool for discovery of the genetic basis of IEMs, and thusestablishment of a genetic-based diagnosis that, in some cases, can lead to im-proved outcomes through targeted interventions. The promise of this approachwas illustrated by a recent neurometabolic gene discovery study (Tarailo-Graovacet al. [2016]), in which deep phenotyping and WES achieved a diagnostic yieldof 68% in patients with unexplained phenotypes, identified novel human diseasegenes and most importantly enabled targeted intervention for improved outcomesin 44% of patients. Overall, published studies applying WES coupled with variantprioritization in patients with unexplained phenotypes are successful in identifyingthe underlying cause in 16-68% of patients (Tarailo-Graovac et al. [2016]).However, with our current limited understanding of variant pathogenicity andthe biological impact of rare variants, variant prioritization algorithms that aim tocompletely automate the process of prioritization fail to identify the causal variantin a substantial number of patients. Further, variants that are identified as plau-sible often have a low level of supporting evidence, and are thus not adequate to2establish a genetic-based diagnosis. Using multiple types of personalized “-omic”data is a promising approach to address the evidence gap in support of an IEMdiagnosis. The integration of metabolomics data with WES/WGS data to identifygenes causing IEM is a prime example of this approach. For example, a diag-nosis of maple syrup urine disease can be supported by 1) pathogenic variants ineither DBT, BCKDHB or BCKDHA, 2) high levels of amino acids such as allo-isoleucine, isoleucine, leucine and valine, and 3) branched-chain oxoacids (Strausset al. [2013]). These biochemical biomarkers can be detected individually (tar-geted metabolomics), or as part of a broader characterization of the metabolome(untargeted metabolomics). Recently, the unbiased approach afforded by untar-geted metabolomics has increased in popularity due to decreasing costs, lack ofrequired parameter tuning, and opportunities for pathway analysis (Johnson et al.[2016]).1.3 Whole exome sequencing for the identification of raredisease variants1.3.1 Canonical approach to identifying the causative geneGenomic sequencing-driven variant prioritization involves multiple filtering stepsthat incorporate prior knowledge about allele population frequency and predictedpathogenicity. Databases such as ExAC, dbSNP and gnomAD provide informa-tion about allele frequencies seen in the general population, and are then used tofilter out common and likely non-pathogenic variants in the patient (Exome Aggre-gate Consortium [2016], Smigielski et al. [2000], Lek et al. [2016]). Once identi-fied as pathogenic through use of in silico pathogenicity prediction tools (such asPolyphen-2 and SIFT), genomic data from an individual’s parents is then used tofilter variants according to Mendelian models of inheritance, treating the parents ofthe individual as “controls”(Ng and Henikoff [2003], Adzhubei et al. [2013]). Thisallows for both the isolation of pathogenic variants, and the assignment of modeof inheritance. However, it should be noted that some studies have questionedwhether genomic databases may in fact contain individuals with disease-associatedgenotypes but no clinical presentation of the underlying disease at the time of the3inclusion, as more than 2.8% of the ExAC population was found to carry likely/-pathogenic genotypes reported in ClinVar (Tarailo-Graovac et al. [2017]). Con-tinued expansion of variant databases and variant filtering methods will play animportant role in identifying pathogenic variants.A sample WES variant filtering pipeline used in Tarailo-Graovac et al. [2016]is detailed in Figure 1.1. In the case of WES, in which around 20,000 to 50,000variants are observed in protein coding regions per individual, standard filteringsteps typically enable researchers to reduce the number of variants to 10 to 200candidate variants depending on the WES study design (e.g., access to trio data andpedigree structure) (Yang et al. [2013], Belkadi et al. [2015]). For the challengingtask of identifying the needle in the haystack, i.e., the single causative variant,clinical input and extensive discussion among physicians, genetic counselors andbioinformaticians is typically needed; for genes previously unreported to causehuman disease, identification of other families with similar phenotypes and othervariants in the same gene as well as in vitro functional studies are required asevidence for validation of etiology (Tarailo-Graovac et al. [2016]).The reliance on clinical expertise throughout the variant filtering process resultsin long processing times, especially if trio data is not available. Further, causalityis difficult to establish–especially for variants previously unreported in human dis-ease, of poor sequencing quality or unknown significance (Bertier et al. [2016]).Integrating multiple types of “-omic” data is a promising approach to help addressthis challenge.1.4 Metabolomic profilingSince IEMs result from a malfunction of protein-coding genes, many of which con-trol the concentration of a variety of metabolites, biochemical tests of known IEM-related metabolites have long been performed for IEM diagnosis. The simultane-ous assay of many IEM biomarkers through the use of untargeted metabolomics isan active research area. In this section, we provide an overview of metabolomicprofiling methods and review existing approaches for the processing and analysisof untargeted LC-MS metabolomics data for IEM diagnosis and discovery. Thisincludes four critical components: 1) generation of LC-MS data, 2) identification4Figure 1.1: WES rare variant analysis pipeline. This pipeline was usedin Tarailo-Graovac et al. [2016]. Raw reads are aligned to the humangenome. Variants are annotated using SnpEff as well as custom Perland Python scripts. Variants that do not map to protein-coding regions,have MAF > 0.01 in several variant databases or that do not pass QCsteps are removed. Variants that do not agree with multiple inheritancemodels and that would not agree with the observed phenotypic effectare also removed.Raw	data(mzData	format)Peak	picking:Feature	table~	10000	to	20000		featuresAdduct	and	isotope	annotation:Annotated	feature	tableNormalization	and	transformationNormalized	feature	tablePeak	filteringFeature	table	with	base	peaks	annotated	in	metabolomic	database~	5000	to	10000	featuresAssess	differential	abundance:list	of	featuresX	featuresBiological	interpretationE.g.	pathway	analysisInputProcess:OutputSize	of	datasetLegendFigure 1.2: Sample LC-MS metabolomics analysis pipeline. Briefly, rawmetabolomics data can be processed using freely available processingsoftware (e.g. XCMS), annotated (e.g. CAMERA), normalized (e.g.through use of internal standards) and filtered. Differentially abundantmetabolites can be isolated using univariate or multivariate tests. Bio-logical interpretation such as pathway analysis can be performed usingpublished metabolomic databases (e.g. HMDB, BioCyc, METLIN).of units of analysis (“features”) and normalization, 3) identification of significantfeatures, and 4) association of significant features with known metabolites. Anoverview of a hypothetical untargeted LC-MS pipeline is provided in Figure Generating Liquid Chromatography coupled MassSpectrometry dataIn general, metabolomics quantifies a subset of small molecules (metabolites) in atissue or body fluid using either nuclear magnetic resonance (NMR) spectroscopyor MS (Johnson et al. [2016]). NMR spectroscopy quantifies solution-state molec-ular structures based on atom-centered nuclear interactions. NMR spectroscopy isinexpensive, capable of high throughput analysis and highly reproducible; how-ever, it lacks sensitivity and is generally only able to quantify metabolites ofmedium to high abundance. For this reason, MS based quantification has primarily6been used in the context of IEM diagnosis and discovery.In mass-spectrometry (MS) based quantification, metabolites are first chro-matographically separated and quantified in a semi-quantitative manner usinghigh resolution mass spectrometers in detection modes that measure both positiveand negatively charged ions produced through electrospray ionization (ESI). MSseparation techniques include liquid chromatography, capillary electrophoresis,gas chromatography and ultra-performance liquid chromatography (Zhang et al.[2012]). No single chromatographic separation protocol can quantify all metabo-lites in a sample. Therefore, to completely capture all metabolites, multiple chro-matographic methods must be used. For example, reverse-phase LC quantifies non-polar to slightly polar molecules, while hydrophilic interaction LC detects stronglypolar to slightly polar molecules (Bajad et al. [2006], Roberts et al. [2012]). Thisreview will focus on liquid chromatography coupled MS (LC-MS), as it quantifiesthe largest range of metabolite polarity, and is widely used. When coupled withLC, the most common means of separation are reverse phase liquid chromatogra-phy (RPLC) for separation of hydrophobic metabolites, and hydrophilic interactionchromatography (HILIC) for the separation of hydrophilic metabolites (Zhou andYin [2016]). MS platforms commonly used for untargeted metabolomics studiesinclude low resolution techniques such as triple quadrupole (QQQ), quadrupole-ion trap (QIT) and high resolution techniques such as quadrupole-time of flight(Q-TOF), quadrupole Orbitrap (Q-Orbitrap) and Fourier transform ion cyclotronresonance mass spectrometry (FTICR-MS).1.4.2 Processing LC-MS dataAn overview of the manual and automatic components of the LC-MS pre-processing pipeline are detailed in Figure 1.3. The first step is to convert an LC-MS-produced dataset for a single individual into a list of “features” (defined as thecombination of mass to charge (m/z) ratio and retention time) and their intensities.A variety of software packages designed to process metabolomic data have beendeveloped for this purpose, of which XCMS, Mzmine2 and MAVEN are among themost popular (Katajamaa et al. [2006], Tautenhahn et al. [2008], Melamud et al.[2010]). Each pipeline involves three steps: 1) “peak selection”, in which features7are identified and quantified, 2) retention time alignment, whereby intensity pro-files of consecutive samples are aligned to allow maximal feature overlap and 3)adduct and isotope annotation. The most prominent difference between existingpackages involves their approach towards assessing peak quality during the peakselection step. Both XCMS and Mzmine2 define low quality peaks according toa user-defined signal-to-noise ratio cutoff threshold; in contrast, MAVEN uses amachine-learning (neural network) approach. Because an independent comparisonof MAVEN, Mzmine2 and XCMS has not yet been completed, one recommenda-tion is to analyze metabolomics data using several packages and remove peaks thatare not robustly identified by multiple algorithms (Tautenhahn et al. [2008]). Thisis one of several methods that aims to minimize false positives, as it has been shownthat up to 90% of features in an LC-MS experiment are non-biological noise or de-generate in a typical LC-MS experiment (Mahieu and Patti [2017]). Other methodsinclude curating databases of confirmed features identified using different separa-tion techniques, and removing features not profiled in the corresponding database(Mahieu and Patti [2017]). An additional approach is to confirm the presence ofthe feature in a technical replicate. In practice, it is difficult to identify the samemetabolites across replicates, as retention times may differ, and is therefore mostoften done in targeted metabolomics, in which only a small subset of features arequantified (Crews et al. [2009]). In addition to the above, another method for iden-tifying robust features involves removing features that are not detected in a set ofquality control (QC) samples consisting of either a set number of defined metabo-lites, or a combination of all tested samples (pooled sample) (Brodsky et al. [2010],Godzien et al. [2015]).1.4.3 Normalizing LC-MS dataTo be biologically informative, raw intensities need to be corrected for a) batcheffects, b) missing values, and c) inter-sample variation. This section describesstandard approaches used for such normalizations.As a first step, raw intensities of each feature produced from data processingpackages typically need to be corrected for systematic variation due to batch ef-fects. In metabolomics data, a common type of batch effect is “chemical drift”.8Figure 1.3: Automated and manual steps in untargeted metabolomicspre-processing pipeline. The algorithms listed are only examples oftools that could be used in each step.This drift–caused by changes in signal that occur as metabolites interact with eachother while waiting to be analyzed–can be corrected if QC samples are analyzedin between experimental samples (Vaikenborg et al. [2009], Shen et al. [2016]).While these corrections are not always performed, they have been shown to mini-mize inter-batch variation (Godzien et al. [2015]). Inter-batch variation and chemi-cal drift can be visualized using dimensionality reduction approaches such as PCAand t-SNE (De Livera et al. [2015]).Missing values can result from a variety of processes, and thus require a nu-anced approach. Specifically, a missing value, which is an intensity of zero orinfinity, can be created from a metabolite existing in one sample but 1) not existingin another, 2) existing at a concentration below an instrument’s power of detectionor 3) existing at a concentration above an instrument’s power of detection. Theproblem of missing values can best be improved by increasing the sensitivity ofdetection of the MS platform. Numerous strategies have been developed to reducemissing values through a group of analytic techniques called missing value impu-tation (MVI). The utility of these techniques has empirically been found to dependon whether univariate or multivariate techniques are used to detect differentiallyabundant features (Karpievitch et al. [2012]).Subsequently to above, both sample-wise and feature-wise normalizationmethods that concurrently consider multiple samples are typically applied to ad-just for technical and biological variation. Sample-wise normalization methods9include quantile, linear baseline, total ion count (TIC) and LOESS normalization(Wu and Li [2015]). These methods adjust for technical factors that may have af-fected the entire sample. Feature-wise normalization methods involve constructingscaling factors for each feature, and include centering, scaling and transformations(Bolstad et al. [2003], van den Berg et al. [2006]). These approaches minimize theintensity differences between metabolites with low or high abundance, allowingrelative perturbations of each metabolite to be compared. Usually, both sample-wise and feature-wise normalization methods are applied during pre-processing.However, because the type of normalization required is dependent on the separa-tion technique and mass spectrometer used, no gold standard approach exists.1.4.4 Testing for significant features in IEM studiesIn a typical experimental design relevant to IEMs, one typically measuresmetabolomics data for a set of patients only or a set of patients and some controls(i.e., case/control design). Because each case is likely unique (i.e., may representa unique disease caused by a rare genetic variant), data is usually analyzed forone patient at a time and compared against a) controls or b) other patients. Bothparametric (e.g. t-test, ANOVA) and non-parametric (e.g. Mann-Whitney U-test,Wilcoxon-signed rank and Kruskal-Wallis) tests can be used to identify differen-tially abundant features in a given patient sample. When pursuing parametric tests,which typically have more statistical power compared to non-parametric tests, caremust be taken to transform data so that it is distributed according to the expecta-tion of the test (e.g., Gaussian for t-test). Correction for multiple testing is usuallynot performed due to lack of statistical power. When studying the genetic causesof rare diseases, in contrast to studies of common disease, one seeks to identify“outlier” features, as they represent abnormal metabolites that may be pathogenic.Availability of biological and technical replicates is important in confirming that agiven metabolite value is a “biological” outlier, rather than an artifact of technicalvariation.In metabolomic studies, selection of “control” samples (or comparators) thatare as similar as possible to the patient being studied is paramount to reducingnoise. This is difficult due to the numerous factors that influence the metabolome10(e.g age, sex, ethnicity, food consumption and time of day). Selection of controlsoften depends on patient availability, and the type of bio-fluid analyzed; findingsuitable controls is much easier for analysis of urine samples, and much more dif-ficult for plasma and CSF samples, due to the relative ease at which these samplescan be provided. Estimations of the genetic component of metabolite variance varybetween approximately 10 and 60%, with the largest determining factor being thethe type of metabolite (Shah and Newgard [2015], Rhee et al. [2013]). Becauseof this moderate heritability, the trio structure has been suggested as a possiblereplacement for the classic case-control design, as it may enable the removal ofmetabolomic features attributable to non-disease related heritable phenotypes. Forexample, parents would likely show an abnormal profile for metabolites related toa heterozygous variant, which would be magnified in the bi-allelic patient (Longet al. [2017]). However, due to inherent uncertainty in quantification, the signif-icant impact of age, gender and diet, and varying heritability of each metabolite,the human metabolome needs to be explored further before the trio structure canbe robustly used in this manner. Overall, like any other -omics study of dynamicmolecular traits, experimental designs that enable robust statistical adjustments forthe effect of demographical and environmental factors are of key importance inidentifying meaningful disease-associated metabolites. At the least, care should betaken to utilize metabolomic controls that share as many characteristics as possiblewith the population being studied.1.4.5 Annotating features: adducts, isotopes, and metabolitesOnce features have been identified, they can be annotated as an adduct or isotopeof a particular metabolite. An adduct is an ionized metabolite that has become as-sociated with another ion through electrospray ionization (ESI), most commonlyH+, Na+, K+ and H20. An isotope is a metabolite that is composed of elementsthat are not in their most abundant form. A metabolite’s most abundant isotopicform generally corresponds to its most abundant features. The most readily quan-tifiable adduct depends on the chromatographic separation performed (Keller et al.[2008]). Annotation of isotopes and adducts corresponding to a particular metabo-lite reduces the multiple testing burden by enabling the removal of features be-11longing to the same metabolite. Removal of redundant features is performed at thediscretion of the researcher, as no standard filtering approach exists. In Mzmine2and MAVEN packages, adduct and isotope annotation is performed automatically,whereas processing with XCMS requires use of an external package such as CAM-ERA to make these annotations (Kuhl et al. [2012]).The putative metabolite mass annotated in the peak-annotation step describedearlier is used to map a specific feature to known metabolite(s). Databases thatinclude mass, adduct, spectra and structure data are then used to match metabolitemasses to known metabolites that fall within the specific mass accuracy of the massspectrometer used. Several such databases exist, such as the Human MetabolomeDatabase (HMDB), Recon2, BioCyc and METLIN (Petri and Schmidt-Dannert[2004], Smith et al. [2005], Wishart et al. [2007], Thiele et al. [2013]). TheHMDB in particular contains information on endogenous, food-based and drug-related metabolites found in human urine, CSF and plasma of humans. The human-specificity of this databases makes it particularly useful for mapping features iden-tified through untargeted metabolomic methods, as the entire database can be uti-lized without a priori knowledge of each metabolite’s origin. At the time of writing,it contains over 114,100 metabolites annotated with structure and chemical proper-ties, a portion of which are also associated with specific genes (n = 5701). Of thesemetabolites, 19.5% have been detected in a bio-fluid, and 81.5% are predicted orexpected. Its sister database, the Small Molecule Pathway Database (SMPDB),annotates a portion of genes and metabolites to specific small molecule pathways.Together, the HMDB and SMPDB facilitate biological interpretation at the geneand pathway level. Limitations of these databases include the relatively small num-ber of detected and quantified metabolites, as well as the relevant paucity of genesannotated to both HMDB and SMPDB.Identifying the “true” identity of a specific feature is challenging because eachneutral mass can be annotated to multiple metabolites, so called isobaric com-pounds (i.e., their chemical properties result in them having the same mass andretention time upon ionization). Narrowing down the identity of a given feature iscurrently an active area of research (Li et al. [2013], Pirhaji et al. [2016]). Publicdatabases that contain metabolite masses and MS/MS spectra can assist in confirm-ing metabolite identities, in cases where mass spectra are available (Wishart et al.12[2007]). Additionally, “internal standards”, or radiolabeled compounds that can beeasily identified through isotopic analysis, can be used for targeted metabolomicsas well as untargeted lipidomics (detection of all lipids in the metabolome), asthey allow researchers to benchmark when certain ions elute over time (i.e. theirretention time), knowledge which can then be used to guide the interpretationof unknown features (Sysi-Aho et al. [2007], Ejigu et al. [2013], Weindl et al.[2015]). Validation of mapping between a feature and its assigned metabolite canbe achieved by analyzing a purchased chemical standard through identical process-ing techniques, and comparing its m/z ratio, ion-source fragments and retentiontime to that of an experimentally-derived feature.1.4.6 Identifying IEMs through untargeted metabolomic analysisThe creation of processing tools and metabolomic databases has greatly facili-tated the use of untargeted metabolomics in diagnosing IEMs. As opposed tothe narrow m/z range of targeted methods, untargeted methods aim to character-ize a metabolome without any pre-conceived limitations on the m/z range understudy. Both univariate and multivariate tests have been used to identify biomarkersof IEMs through untargeted metabolomics (Wikoff et al. [2007], Dercksen et al.[2013], Venter et al. [2014], Najdekr et al. [2015], L. et al. [2016], Kennedy et al.[2017], Pappan et al. [2017]). Recently, untargeted metabolomics was able to iden-tify 20 of 21 IEMs, demonstrating its utility as a replacement for traditional new-born dry blood spot screenings (Miller et al. [2015]). Another more recent studyfound that untargeted metabolomics enabled the diagnosis of 42 of 46 IEMs (Co-ene et al. [2018]). Challenges with this method include separating noise (unrelatedfood and environmental influences) from disease signal, identifying isobaric com-pounds and attaining adequate quantification of polar and non-polar metabolites.Because of these challenges, untargeted metabolomics alone is unlikely to usurptraditional genomics-based methods that identify causative genes for novel IEMs(Tarailo-Graovac et al. [2016]). The benefits and drawbacks of integrating geneticand metabolomic data for the purpose of identifying both known and unknownIEMs are addressed in the subsequent section.131.5 Integrating genomic and metabolomic dataIntegration of genomic and metabolomic data has been performed for two pri-mary purposes: to identify 1) metabolically active loci and 2) genes relevant toa metabolic disease phenotype. Most existing methods for combining genomicand metabolomic data conceptually follow from the former purpose.1.5.1 Identification of metabolically active lociIdentification of metabolically active loci involves identifying variants that affect ametabolite’s abundance. Population-based studies have combined genotyping mi-croarray data (Gieger et al. [2008], Hicks et al. [2009], Illig et al. [2010], Suhreet al. [2011], Demirkan et al. [2012], Tukiainen et al. [2012], Kettunen et al.[2012], Shin et al. [2014], Draisma et al. [2015], Rhee et al. [2016], Long et al.[2017]) or WES/WGS data (Guo et al. [2015], Yazdani et al. [2016], Yu et al.[2016]) with metabolomic data through quantitative trait loci (QTL) analysis, inorder to identify metabolically active genetic loci and to characterize the impact ofcommon genetic variation on metabolite abundance (also known as heritability).In so-called metabolite QTL (mQTL) analysis, linear regression is used to asso-ciate genetic variants with individual metabolite intensities (or metabolite ratios),and significant variants are deemed metabolically active loci. To reduce spuriousassociations, most studies analyze between 2000 and 8000 subjects, and restricttheir analysis to common SNVs (e.g. MAF ≥ 5%) or to pairs of variants andmetabolic loci that are located nearby each other (i.e., “cis” association analysis)(Tukiainen et al. [2012]). mQTL analyses have identified disease biomarkers by as-sociating variants in known disease-causing genes with metabolites (Yazdani et al.[2016], Gauba et al. [2012]. In these previous studies, loci strongly associated withmetabolite intensities or ratios–termed “metabotypes”–have predominantly beenfound to map close to or in genes associated with enzymes, transporters and regu-lators of metabolism, facilitating biological interpretation (Shin et al. [2014]).Consistent with transcriptomic-based QTL studies (e.g., eQTL studies), it hasbeen reported that, on average, genetic variation is a stronger predictor of metabo-lite variance across individuals than demographic and symptom-based clinical co-variates (Rhee et al. [2013], Shin et al. [2014]). Heritability estimates have varied14across classes of metabolites; Shin et al. found the heritability of amino acids(e.g. carnosine, h2 = 0.86, P = 6.8×10−4) to be higher than lipids (e.g. lysophos-phatidylcholine, h2 = 0.46, P = 2.0×10−7), and that of essential amino acids (meanh2 = 0.29) to be lower than non-essential amino acids (mean h2 = 0.53), suggestingthat some metabolites are more influenced by genomic variation than others, as onemight expect (Shin et al. [2014]).Rare SNVs (0.5% ≤ Minor Allele Frequency (MAF) ≤ 5%) have been foundto have a larger effect size than common SNVs (MAF > 5% (Long et al. [2017]).However, because association analysis (e.g., mQTL analysis) is statistically chal-lenging and underpowered in the rare variant setting, the effects of rare variantshave primarily been studied on a case-by-case basis. Long et al. identified theeffect of seventeen rare coding variants (SnpEff annotations such as “stop”, “mis-sense” or “frame”) by first manually identifying an outlier metabolite that based onknown biochemical reactions could be affected by the variant, and then confirm-ing the presence of this putative rare variant and outlier metabolite combinationin at least one other sample (Long et al. [2017]). Guo et al. examined the effectof rare coding variants by assessing the overlap of genes in perturbed metabolicpathways (i.e biochemical pathways with at least one outlier metabolite) with rareexonic variants (Guo et al. [2015]). These studies indeed show that rare variantscan have a large effect on metabolic variation; however, the small number of rarevariant-metabolite relationships yet identified suggest that clarifying their role in asystematic manner will likely require a more nuanced approach.1.5.2 Gathering evidence for metabolic perturbation on agene-by-gene basisStudies aiming to explore rare disease using smaller sample sizes have usedmetabolomics in conjunction with curated biochemical knowledge to derivedisease-specific biological insights. In a “pathway based approach”, genes in en-riched metabolic pathways were found to harbor variants that explained the pa-tient’s biochemical phenotype (Guo et al. [2015]). Several studies also reportedhow untargeted metabolomics could be used to quantify gene-associated metabo-lites to provide evidence a variant is disease-causing (Gauba et al. [2012], L.et al. [2016], Pappan et al. [2017]). An example of this approach is a study15that used untargeted metabolomics to demonstrate that a bi-allelic variant in N-acetylneuraminic acid synthase (NANS) in patients with infantile-onset severedevelopmental delay and skeletal dysplasia was reflected in high levels of N-acetylneuraminic acid (Karnebeek et al. [2016]). Confirmation of high levels ofthis enzymatic substrate of NANS suggested that the clinical phenotype was likelycaused by an enzymatic deficiency in NANS. Normalization of skeletal dysplasiain a zebrafish model with knocked-out nansa and nansb (zebrafish orthologs forhuman NANS) occurred after supplementation with sialic acid, shedding light ona possible treatment. These findings support the idea than an integrated approachinvolving both genomics (i.e mircorarrays/WES/WGS) and metabolomics can beused to facilitate variant filtering and improve diagnosis and IEM discovery. As ofyet, genome-wide integration has mainly been performed for exploratory purposes.Integrating genomic and metabolomic data in a systematic manner is challeng-ing because there is not a one-to-one mapping of genes to metabolites. Rather,some genes, such as enzymes involved in beta oxidation, will have many annotatedmetabolites, and other enzymatic genes will uniquely associate with only a hand-ful of relevant metabolites. This dynamic makes it difficult to directly integrategenomic and metabolomic data at the variant and feature level, respectively.1.6 Treatable Intellectual Disability Endeavour (TIDE)Exome Sequencing projectThe TIDE study was conducted at BC Children’s Hospital from 2015 until 2017(UBC IRB approval H12-00067). The project aimed to identify the genetic causeof intellectual disability in patients with a neurometabolic phenotype. Patient inclu-sion criteria consisted of 1) a confirmed or potential neurodevelopmental disorderand 2) a metabolic phenotype, which could be a pattern of abnormal metabolitesin urine, blood and CSF, abnormal results on biochemical functional studies orabnormalities in clinical history. Details of the WES bioinformatic pipeline andvariant interpretation protocol have previously been published in Tarailo-Graovacet al. [2016] as well as in separate case reports (Collins et al. [2010], Horvath et al.[2016], Karnebeek et al. [2016]).161.7 Strategies for disease gene prioritizationThe ultimate goal of the TIDEX project was to identify the causative gene foreach patient using a combination of WES and clinical expertise. Here, we do thisthrough integrated use of personalized genomic and metabolomic data, with twoassumptions:1. Genomic data is static and metabolomic data is dynamic• A genetic mutation only perturbs the function of the gene in which itresides• In contrast, metabolites that differ between a single patient and all con-trols can reflect perturbations of the causative gene in addition to allgenes with which it interacts2. Genomic and metabolomic data are incomplete: each imperfectly capturesone axis of an individual’s biological processesGiven these assumptions, utilizing networks that depict functional and regulatoryinteractions between genes could help prioritize the causative variant by identifyingsubnetworks of disease-relevant regulation that could include the causative geneand its interaction partners. Networks could also mitigate the incompleteness ofboth genetic and metabolomic data by implicating genes closely connected to genessupported by patient-specific evidence.The overall goal of this approach is to leverage interaction networks to identifygenes that are likely related to an individual’s disease given experimental evidence.Conceptually, this approach is similar to that taken by a class of algorithms knownas “gene recommendation” algorithms, which identify new disease genes basedon their relatedness to a set of “seed” genes already implicated in the disease. Inour case, however, the overall goal is not to identify new disease genes, but toidentify the relevance of each patient’s “seed” gene to their disease based on thephysical evidence supporting that gene and its proximity to other “seed” genes inan interaction network. Here, we will review the gene recommendation literatureand discuss label propagation, a network-based computational algorithm that weuse to perform gene recommendation in this thesis.171.7.1 Gene RecommendationInitial approaches performed gene recommendation based on similarity of genomicfeatures. These features included those related to sequence (Marcotte et al. [1999],Adie et al. [2005], Aerts et al. [2006]), expression patterns (Marcotte et al. [1999],van Driel et al. [2003], Aerts et al. [2006], De Bie et al. [2007], Oti et al. [2008], Alaet al. [2008]), functional annotations (Freudenberg and Propping [2002], Perez-Iratxeta et al. [2002], Turner et al. [2003], Aerts et al. [2006], Li and Patra [2010a]),physical interactions (Aerts et al. [2006], Oti and Brunner [2007], Fraser andPlotkin [2007]) and literature descriptions (van Driel et al. [2003], Aerts et al.[2006], Li et al. [2006], Gaulton et al. [2007]). These initial approaches were mod-erately successful in prioritizing genes based on heterogeneous lines of evidence.With the growth of ontologies defining gene function and streamlined accessto large databases, integration of multiple types of evidence into functional linkagenetworks (FLN) became more feasible. Functional linkage networks (FLN) arethe backbone of many network-based algorithms, as they model the interactionsbetween genes through the incorporation of data from studies of coexpression, ge-netic interaction, protein interaction, coinheritance, colocalization, or shared do-main composition. Many authors have noted the improvement in gene functionprediction and prioritization when FLNs are used in this manner (Myers et al.[2005], Mostafavi et al. [2008], Tsuda et al. [2005], Deng et al. [2004], Mostafaviand Morris [2010], Lanckriet et al. [2004], Pen˜a-Castillo et al. [2008], Pavlidiset al. [2002], Lancour et al. [2018]). Proximal genes in an FLN have a higherlikelihood of performing similar biological functions and of belonging to similardisease pathways, making them ideal resources for gene recommendation/prioriti-zation algorithms (Sharan et al. [2007], Oti and Brunner [2007]).The first network-based algorithms to utilize FLNs were successful, encourag-ing more instances of their use (Weston et al. [2004], Xu and Li [2006]). Frankeet al. integrated genomic features and coexpression data into a Bayesian networkand used the distance between known disease genes and candidate genes for prior-itization (Franke et al. [2006]). Information about disease-gene relationships wasused to prioritize candidate genes (Lage et al. [2007], Wu et al. [2008]). Lage et al.scored each candidate in a PPI network based on its direct neighbors’ associations18with diseases similar to the disease of interest. A more recent approach, termedCIPHER, assigned each candidate gene a score based on the correlation betweena query phenotype’s similarity with every other phenotype in a human phenotypenetwork and a candidate gene’s topological similarity to genes annotated to thatphenotype in a protein-protein interaction network (Wu et al. [2008]). These effortsas well as others (e.g. bioPixie, STRING and HumanNet) demonstrated the utilityof integrating heterogeneous sources of evidence in a network structure to prior-itize relevant genes from a list of candidates (Myers et al. [2005], Pen˜a-Castilloet al. [2008], von Mering et al. [2007], Lee et al. [2011]).Several methods built highly specific networks to answer particular researchquestions (Itan et al. [2013], Greene et al. [2015]. Greene et al. enabled tissue-specific disease gene prioritization through use of a classifier on tissue-specific PPInetworks. In another approach, Itan et al. built a condensed network of distancesbetween a seed gene and candidate genes to enable disease gene prioritization.Over time, computational disease gene prioritization algorithms became moreefficient, allowing for deployment on web-based servers (Mostafavi et al. [2008],Linghu et al. [2009], Li and Patra [2010a,b], Itan et al. [2013]). GeneMania usedridge regression to integrate multiple heterogeneous networks and performed labelpropagation (discussed below) on the resulting network to identify genes of similarfunction to a query functional annotation. Due to its efficiency, label propagation(LP), as well as the related guilt by association (GBA), and random walk withrestart (RWR), have been commonly used for function prediction and disease geneprioritization (Li and Patra [2010b,a], Lee and Lee [2018], Lancour et al. [2018],Qian et al. [2014]).In general, gene function prediction and disease-gene prioritization involvesthree components:• Functional linkage network (FLN) that depicts the relationship betweengenes using various types of evidence (Section 1.8)• Genes related to process/disease of interest (i.e “seed” genes)• Network-based algorithm“Seed” genes generally have 1) similar function to genes we wish to identify19or 2) relevance to a disease of interest. They can be given a binary or continuousinitial label score reflecting their relevance to the function/disease being studied.1.7.2 Label propagationIntuitively, LP can be thought of as an iterative procedure in which seed genes(nodes) propagate their initial label scores to their first degree neighbors, and thento their second and third, etc. Upon convergence of LP, nodes that are closer toseed genes have a higher score than nodes farther away. Nodes with a higher finalscore can be thought of as more relevant to a function/disease associated with seedgenes. In this section, a general theoretical framework for Gaussian field labelpropagation will be provided. Previous implementations of label propagation andits related family of algorithms for the purpose of gene recommendation will thenbe reviewed.1.7.3 Theoretical framework of label propagationWe assume that we are given a network, represented as a symmetric matrix, W . Ifthere are n nodes (genes) in a network, then W is an n× n matrix. The weightededges between node i and j is given the value w, with wi j = w ji >= 0. In a FLN, wcan be binary (0 or 1) or continuous (weighted by the− log(pvalue) of the strengthof evidence associating nodes (genes) i and j). The node labels can be representedas y ∈ [0,1]n, with larger scores representing increased importance in the network.The basic assumption of LP is that a score of node i, f , at iteration r can bewritten as a weighted combination of its neighbors so thatf (r)i = λn∑j=1wi j f r−1j +(1−λ )yiwhere λ is a constant such that 0 < λ < 1. In this formulation, nodes with highweighted degree with positive neighbors will have a relatively high impact on theirneighbor’s scores. To reduce the influence of these nodes, it is standard practice tonormalize the matrix W . We choose to symmetrically normalize the matrix so thatS = D−1WD−120or equivalentlyf ri = λn∑j=1wi j√did jf r−1j +(1−λ )yiwhere D is a matrix with diagonal elements, di equal to the degree of node i.Using the symmetrically normalized matrix, S, the final score vector can bewritten asf = (1−λ )(I−λW )−1yGiven a symmetrically normalized network, the final score is dependent ononly the lambda parameter, which controls the relative influence of neighboringnodes.1.7.4 Using label propagation algorithms for candidate geneprioritizationRather than directly using final scores of each node after convergence of a propaga-tion algorithms to prioritize genes, several groups focused on combining the outputof propagation algorithms with GWAS results to prioritize/recommend genes. Leeet al. integrated each node’s score after GBA with its GWAS log odds to forma posterior score, which they then used to prioritize candidate genes (Lee et al.[2011]). Qian et al. applied LP to the same data sets to determine whether for-mally adjusting for node degree would uncover more disease-relevant genes, butfound little overlap with Lee et al. (Qian et al. [2014]). Each group performedtheir prioritization on different FLNs, suggesting that both choice of FLN and se-lection of local vs global network propagation methods can have a large impacton gene prioritization. In a related approach, Lancour et al. combined LP finalscores with gene level p-values from an Alzheimer’s GWAS to identify relevantAlzheimer’s genes (Lancour et al. [2018]). Ranking each gene by this score wasfound to increase the replication rate.211.8 Functional Linkage Networks and the STRINGnetworkThe goal of integrating genomic and metabolomic data is to prioritize a list of can-didate variants (and by association, genes) in order of their relevance to a patient’sdisease. As discussed above, using network-based methods on FLNs has beenparticularly successful at predicting disease genes as FLNs provide a frameworkthrough which prior knowledge about the relationship between genes and their as-sociation with disease can be introduced. In this section, we will introduce the FLNused in this thesis.1.8.1 Selection of the STRING networkA recent performance comparison of 21 FLNs was conducted by testing their abil-ity to recover half of a gene set when given the other half as seed genes (Huanget al. [2018]). Gene sets were constructed from Gene Ontology databases in addi-tion to disease-specific GWAS and high-throughput gene expression assays. Afterperforming random walk with restart, a form of label propagation, STRING wasfound to have the highest percentage recovery in both curated and gene-expressionbased networks. STRING’s network also performed the best when evidence fromMEDLINE abstract co-citation were removed, an indication that STRING exhibitsthe least amount of literature bias.The STRING network is constructed from a combination of existing databases,including MINT, (Licata et al. [2012]), HPRD (Keshava Prasad et al. [2009], BIND(Alfarano et al. [2005]), DIP (Salwinski [2004]), BioGRID (Chatr-Aryamontriet al. [2017]), KEGG (Kanehisa et al. [2017]), Reactome (Fabregat et al. [2018]),IntAct (Kerrien et al. [2007]), EcoCyc (Keseler et al. [2013]), NCI-Nature PathwayInteraction Database and Gene Ontology (GO) protein complexes (Jensen et al.[2009]). Evidence of shared functions between genes is augmented with three ad-ditional sources of computational data:• Evidence from evolution– Physical proximity of two genes (i.e intergenic distance) in prokaryoticgenomes22– Presence of gene fusions– Phylogenetic similarity between corresponding gene families. Genesthat originated from a recent ancestor would be likely to share a func-tion• Evidence from high-throughput datasets– Similarity of transcriptional regulation profiles (i.e co-expression)• Evidence from literature– Co-occurrence of both gene names in abstracts from SGD (Cherry et al.[1998]), OMIM (Amberger and Hamosh [2017]), The Interactive Fly(Yan et al. [2010]), and all abstracts from PubMedIn addition to these sources of evidence, knowledge of protein-protein interac-tions are also transferred between organisms based on their orthological hierarchy.Phylogenetically more similar organisms undergo stronger transfer. This transferof information disproportionately benefits poorly characterized organisms.1.9 Aims of this thesisCandidate rare variants potentially causative for neurometabolic disease are com-monly identified through the use of WES bioinformatics pipelines. Currently, theonly method for identifying the causative gene from this list of candidate genes isthrough manual curation by clinical experts. The primary aim of this thesis is todevelop a computational method for prioritizing the causative gene in a single IEMpatient using evidence from personalized LC-MS and WES data. The sub-aims ofthis thesis are to:• Characterize LC-MS and WES data for IEM patients with neu-rometabolic disorders.• Investigate whether LC-MS data can be used to prioritize the causativegene in a list of candidate genes. Detecting enrichment for causative gene-associated metabolites in a patient’s set of differentially abundant metabo-lites may provide evidence for the metabolic impact of certain mutations.23• Determine whether performing label propagation on a functional link-age network initialized with a combined genetic and metabolomic scorecan prioritize a patient’s causative gene. As discussed in Section 1.7, as-suming that each patient’s causative gene impacts metabolism in some way,perturbation of metabolic function is more likely to occur in genes locatedproximally to the causal gene. Therefore, adding metabolomic evidence toan FLN could add signal to the causative gene’s local neighborhood and helpprioritize the causative gene.24Chapter 2MethodsThe goal of this integrative analysis was to use both WES and LC-MS data to pri-oritize the gene causative for each patient’s IEM. Analyzing data from patientswith known causative genes enabled us to assess whether and to what extentmetabolomics could assist the prioritization of the causative gene from a list of10-200 candidate variants. An outline of our method is provided in Figure DatasetsThe 13 IEM patients in this study were genetically diagnosed through the TIDEXgene discovery project (UBC IRB approval H12-00067). Clinical and genomicvariant information for each patient is available in the Appendix. The causativevariants in four of the IEM patients were previously reported (Tarailo-Graovacet al. [2016]). Three of these patients were also profiled in separate case reports(Collins et al. [2010], Horvath et al. [2016], Karnebeek et al. [2016]). The IEMpatients analyzed in this study met the patient inclusion criteria outlined in Sec-tion 1.6 and were found to have a validated or putatively causative variant throughthe WES variant prioritization pipeline used in this thesis and clinical expertise.An additional inclusion criteria was that the causative gene was included in theSTRING database, as our method would not be able to prioritize the gene if absentfrom the STRING network. The WES and LC-MS methods used are detailed inthe following sections.25Figure 2.1: Summary of overall method. A) Briefly, raw metabolomicsdata was processed using XCMS and CAMERA and subsequently nor-malized through linear baseline normalization. Differentially abundantmetabolites were isolated based on z-score. Raw genomic sequencingreads for each patient were processed and SnpEff was used to identify alist of candidate genomic variants with MAF≤ 0.01. B) Enrichment foreach of the 5371 HMDB genes in the patient-specific set of DAMs wasassessed using Fisher’s Exact Test, generating a metabolomic score. Togenerate the WES (i.e genomic) score, each gene in the STRING net-work was assigned “1” if also in the patient-specific candidate gene list,and “0” if not. C) These scores were plotted on an x-y grid, and theinverse of the distance between (1,1) and the coordinate of each gene,i, was considered the combined genomic and metabolomic score. Labelpropagation was then performed with the combined initial label score.2.2 WES variant filtering pipelineThe WES data used in this study was from patients and, in some cases, their fam-ily members. WES data was generated using the Agilent SureSelect capture kitand the Illumina HiSeq 2000 sequencer. The obtained WES data was analyzedusing a modified version of the semi-automated bioinformatics pipeline used inTarailo-Graovac et al. [2016]. Figure 1.1 provides a visual overview of the pipeline.Briefly, the pipeline included 1) aligning the sequencing reads to the hg19 hu-man reference genome and 2) annotating variants based on their predicted impacton protein function, predicted pathogenicity (CADD score), match to clinician-provided phenotype descriptions, and minor allele frequency (MAF) in dbSNP,NHLBI Exome Sequencing Project Exome Variant Server (EVS), and Exome Ag-gregation Consortium (ExAC) (Kircher et al. [2014], Smigielski et al. [2000], Ex-ome Aggregate Consortium [2016], Lek et al. [2016]). Variants with an MAF ≤0.01 in dbSNP, EVS, or ExAC were removed. Using genetic information from eachtrio (mother, father, index), the variants were screened for agreement with multipleinheritance patterns in order to generate a stringent list of potentially pathogenicvariants for each patient. Recessive mutations were defined as variants in whichone allele was present in both parents and two alleles were present in the index.De novo variants were defined as variants only occurring in the index. These mu-tations were identified separately in the autosomal and sex chromosomes. If trioinformation was not available, all variants were classified as de novo. The genesto which each variant mapped were referred to as ”candidate genes” (Figure 2.1,section A).2.3 Processing and normalization of untargeted LC-MSmetabolomicsIn this section we describe the generation and processing of LC-MS data into a listof differentially abundant features.272.3.1 LC-MS data generationHigh-resolution untargeted metabolomics analysis of CSF and plasma was per-formed using UHPLC-QTOF mass spectrometry. Due to sample availability,plasma was analyzed for five of the IEM patients and 10 of the controls, and CSFwas analyzed for eight of the IEM patients and 15 of the controls. Only samplesprofiled in the same bio-fluid were compared. CSF and plasma samples were de-protonated in methanol:ethanol solution (50:50; 100 microlitres of each sampleplus 400 microlitres of methanol:ethanol solution). Samples were profiled in du-plicate, however only one of each duplicate pair was analyzed in this study. A2-microlitre sample was applied to an Acquity HSS T3 reverse-phase column (1002.1 mm; 100 angstroms, 1.8 micrometers), and an Agilent 6540 UHD accuratemass UHPLC-QTOF mass spectrometer with acquisition in positive and negativemodes was used. The buffers in positive mode consisted of buffer A (0.1 formicacid in water) and buffer B (0.1 formic acid in water:methanol solution (1:99)); innegative mode, the buffers consisted of buffer A (10 mM acetic acid) and buffer B(10 mM acetic acid in water:methanol solution (1:99)) (Coene et al. [2018]).2.3.2 LC-MS data normalization and filteringOnce MS data was generated, the centwave and obiwarp methods from the XCMSpackage were used for peak detection and retention time correction, respectively,for both positive and negative electrospray ionization (ESI) detection modes (Taut-enhahn et al. [2008]). Data-driven parameters were optimized separately forplasma and CSF samples using the IPO package (Libiseller et al. [2015]). CAM-ERA was used to annotate adducts and isotopes (Kuhl et al. [2012]). Linear base-line normalization was applied to each feature (Bolstad et al. [2003]). In linearbaseline normalization, a baseline intensity profile is created from the median in-tensity of all features across all samples (hereby referred to as the ”baseline”), andall runs are assumed to be scalar multiples of the baseline intensity profile. Foreach metabolite i in sample j:y′i j = β jyi j28Where y′i j is the log normalized abundance of a particular feature and yi j isthe log transformed unnormalized abundance. β is the per-sample scaling factordefined as the mean intensity of the baseline over the mean intensity of the sample(j):β j =ybaseliney¯ jTwo filtering criteria were applied before analysis: removal of 1) features notannotated to any known metabolites in the HMDB and 2) features annotated asnon-base isotopes (Wishart et al. [2007]). Z-scores based on the mean and stan-dard deviation of a given metabolite across the IEM patient and controls were com-puted. Features for which the IEM patient had an absolute z-score greater than 2 (2SD away from the mean) were isolated and called “differentially abundant metabo-lites” (DAMs). All DAMs found through both positive and negative mode analyseswere annotated with compound identities within 15ppm of the compound mass us-ing HMDB. Results from both positive and negative modes were combined forsubsequent enrichment tests (Figure 2.1, section A).2.4 Integrative analysis of WES and LC-MS dataThough the steps described above, an LC-MS metabolomics pipeline identified dif-ferentially abundant metabolites and a WES pipeline identified a set of candidatevariants/genes (Figure 2.1, section A). The primary goal of subsequent analysiswas to determine whether the output of these respective pipelines could be com-bined to prioritize the causative variant. To do this, as illustrated in section B andC of Figure 2.1, a combined metabolomic and genomic per-gene score was gen-erated and propagated across the STRING network using label propagation. Thefinal score of each candidate gene was used to rank candidate variants in order oftheir biological relevance to the network. Genes with higher relevance to the dis-ease were hypothesized to have a higher final score. Our methods for devising acombined WES and metabolomic score and for performing label propagation areoutlined below.292.4.1 Creation of combined WES and metabolomic scoreWES scoreTo generate the WES (i.e genomic) score, each gene in the STRING network wasassigned a ”1” if also in the patient-specific candidate gene list, and a ”0” if not.Metabolomic scoreAn enrichment test was performed to determine whether metabolites known to beassociated with candidate genes were overrepresented in the patient-specific set ofdifferentially abundant metabolites (Figure 2.1, section B). Curated sets of metabo-lites associated with each putative gene were parsed from files available from theHMDB web portal (hmdb.ca, Jan 1st 2017) (Wishart et al. [2007]). Enrichmentwas calculated using Fisher’s Exact test. P-values were adjusted for multiple test-ing using the Benjamini-Hochberg procedure, and reported as False Discovery Rate(FDR). Duplicate HMDB compound IDs were not removed, and it should be notedthat significantly different results were obtained when they were.The metabolomic score was computed as follows:fmet =−log(p)where p is the unadjusted enrichment p-value.The metabolomic score for each gene was scaled to fall between 0 and 1.Creation of combined scoreThe WES and metabolomics score were combined by defining each gene as a pointon an (x,y) plane, where (x,y) = (WES score, metabolomics score), and calculatingthe inverse of the Euclidean distance between (1,1) and the coordinates of eachgene.The final combined score, f, for each gene, i, can be written as:fi =1√(1−G)2+(1−M)230where G is the per-gene WES score (binary) and M is the scaled fmet scoredefined above. The inverse was taken because points closer to (1,1) should havehigher scores, as their significance is supported by both genomic and metabolomicevidence.2.5 Label propagationLabel propagation was performed as stipulated by Zhou et al. (Zhou et al. [2003]).The per-gene score, fi, of each node at iteration, r, was determined byf (r)i = λn∑j=1wi j f r−1j +(1−λ )yiwhere j is a connected node, λ is a parameter between 0 and 1 that controls thedegree of propagation between a node and its neighbors, wi j, is the symetricallynormalized edge weight between node i and node j and yi is the label of node i.Initial label values, y, were continuous between 0 and 1. Label propagationwas run 30 times, although LP algorithms have been demonstrated to converge inless than 20 (Weston et al. [2004]). λ was set at 0.99, as this is the parameterused in Zhou et al. [2003], and was not optimized for our data set due to limitedsample size. However, the relative ranking of candidate genes were robust to smallchanges in this parameter. The final scores of each of the candidate genes wereranked to generate a prioritized candidate gene list.2.6 Categorizing polarity of gene-associated metabolitesThe utility of LC-MS metabolomic analysis in prioritizing the causative gene is de-pendent on whether metabolites associated with a particular gene are quantifiableby the LC-MS system. Reverse phase LC achieves the greatest chromatographicseparation of semi-polar to non-polar metabolites, therefore these metabolites willbe more readily detectable by the MS quantification system. Given these limita-tions, LC-MS analysis cannot accurately provide evidence for genes with highlypolar metabolites.Having a qualitative understanding of the polarity of metabolites associatedwith the causative gene is important for predicting whether or not LC-MS anal-31ysis can support its candidacy as the causative gene. The metabolites associatedwith each causative gene were evaluated for polarity. Briefly, if the majority ofannotated metabolites (≥ 50%) contained a phosphate group, or were ions, themetabolites were categorized as “very polar”; if the molecules contained saturatedor unsaturated hydrocarbons ten or more carbons in length, the metabolites werecategorized as “non-polar”. All other genes that did not fit into either of thesecategories were categorized as “semi-polar”.2.7 Summary of methodDifferentially abundant metabolites were identified through an untargeted LC-MSprocessing pipeline. A conservative list of pathogenic candidate variants were se-lected through a WES processing pipeline. A per-gene metabolic and genomicscore was devised for each gene in the STRING network. Enrichment for HMDBgene-associated metabolites in the patient-specific set of differentially abundantmetabolites was determined through a Fisher’s Exact Test, and the resulting p-value was used as the basis for the per-gene metabolomic score. The per-genegenomic score was binary, with “1” reflecting the presence of that gene in the pa-tient’s candidate gene list. The combined genomic and metabolomic score wasgenerated by first plotting the metabolomic and genomic scores as points on an (x,y) grid, with (x, y) = (Genomic score, Metabolomic score), and then taking theinverse of the Euclidean distance between (1,1) and the gene’s coordinates. Labelpropagation was performed on the STRING network, with the per-gene combinedscores as prior labels. The resulting final scores of each node/gene were then usedto rank each candidate gene.32Chapter 3Results3.1 Characterization of WES and LC-MS dataIn this chapter, we will first characterize the TIDE WES and LC-MS data used inthis study, then detail the results of two approaches that use LC-MS data to sup-port the prioritization of WES-identified causative genes. The first method usesmetabolic enrichment directly, and the second method propagates a combined ge-nomic and metabolomic score through the STRING network using the label propa-gation algorithm. Finally, we will characterize the factors that influence the successof these methods.3.1.1 WES variantsCandidate WES variants in each patient mapped to an average of 108 genes (Ta-ble 3.1). Of these variants, 36.8% were autosomal recessive, 49.6% were autoso-mal de novo, 1.7% were recessive on sex chromosomes, and 2.4% were de novo onsex chromosomes. Definitions of these modes of inheritance can be found in Sec-tion 2.2. Only 29.9% of candidate genes were profiled in the HMDB, highlightingthe challenges posed by our limited understanding of the metabolic involvement ofall genes.33Table 3.1: Summary of WES data. Percentage of each patient’s variantsthat fall into one of four modes of inheritance and the average number ofcandidate genes to which these variants map.Variant category NumberNumber of candidate genes 109.6 ± 86.8% recessive variants 36.8 ± 23.4% recessive variants, Chr X or Y 1.7 ± 3.2% de novo variants 49.6 ± 21.0% de novo variants, Chr X or Y 2.4 ± 1.7Number of genes annotated in HMDB (% of total implicated genes) 29 (29.9)3.1.2 Characterization of LC-MS metabolomics featuresLC-MS data from a single IEM patient was compared with those of controls in or-der to identify disease-relevant metabolites. CSF samples were available for eightindividuals and 15 controls, and plasma samples were available for five individualsand 10 controls. A summary of the number of features identified in each bio-fluidis included in Table 3.2. On average, more features were detected in the ESI+mode than in the ESI- mode. Out of all detected features, an average of 21.3% inthe ESI+ mode and 24.7% in the ESI- mode mapped to known metabolite in theHMDB database. This low rate of mappability highlights a common shortcomingof existing analysis: a large number of potentially differentially abundant featuresare essentially discarded from further investigation. When each IEM patient wasindividually compared to controls, the number of features with absolute z-scoregreater than 2 in the IEM patient was on average larger when measured in CSFthan in plasma (mean = 128 vs 662 in ESI+, 82 vs 390 in ESI-). This could be dueto greater biological variability in CSF metabolites, or due to the reduced chemicalstability of metabolites in CSF.3.2 Assessing enrichment for gene-associated metabolitesEnrichment of differentially abundant metabolites (DAMs) for metabolites asso-ciated with Human Metabolome Database (HMDB) genes was assessed using a34Table 3.2: Characterization of LC-MS features.Summary (Avg. num-ber)Plasma (n = 7) CSF (n = 8)Features in ESI+ mode 23405 ± 1343 23405 ± 2503Features in ESI- mode 15720 ± 1377 16232 ± 2032DAMs in ESI+ mode 128 ± 121 662 ± 268DAMs in ESI- mode 82 ± 89 390 ± 125Features that map toHMDB compounds inESI+5000 ± 471 11227 ± 1111Compound assignmentsfor each feature in ESI+4 ± 6 6 ± 8Features that map toHMDB compounds inESI-3884 ± 426 7012 ± 799Compound assignmentsfor each feature in ESI-4 ± 7 4 ± 7Compounds assigned toDAMs in ESI+21795 ± 1979 31488 ± 839Compounds assigned toDAMs in ESI-10030 ± 654 12044 ± 199Fisher’s Exact Test. The total number of genes enriched for in each patient, andwhether causative gene-associated metabolites were enriched is provided in Ta-ble 3.3 and Table 3.4. Each patient’s DAMs were enriched for between 23 to 261genes. The high degree of variability in the number of enriched genes reflectsthe heterogeneity of the underlying metabolic processes in this patient population.Two patients had multiple causative genes; for the purpose of analyses, these geneswere considered independently, so that 15 genes were considered causative for 13patients. For four of 15 causative genes, detection of enrichment in the causativegene was not possible because metabolite annotations were not available in the35HMDB. Enrichment was detected in two of the remaining 11 genes. The LC-MSdetection method used is biased towards the detection of semi-polar to non-polarmetabolites; to investigate whether polarity of causative gene-associated metabo-lites affected their enrichment, the polarity of metabolites associated with eachcausative gene was determined. Enrichment was detected in two of six causativegenes associated with semi-polar and non-polar metabolites, but in no genes asso-ciated with very polar metabolites.3.3 Integration of genomic and metabolic data toprioritize causative genesTo determine whether metabolomic information could help prioritize the causativegene, each patient’s enriched gene profile was combined with their candidate vari-ants to generate a combined genetic and metabolomic score. This score was thenused as the initial label for label propagation on the STRING network, as outlinedin Section 2.4.Only a small percentage of nodes/genes were given an initial label. Theinitial label for each node/gene could be determined by genomic evidence,metabolomic evidence, or a combination thereof, depending on the availabilityof genomic/metabolomic evidence for that node/gene. Overall, 0.46% ± 0.45%of nodes were assigned a score based on both genetic and metabolomic evidence,0.47% ± 0.40% of nodes were assigned a score based on only genetic evidence,17.2% ± 10.6% of nodes were assigned a score based on only metabolomic evi-dence, and 82.2% ± 10.5% of nodes were not assigned an initial label.36Table 3.3: Summary of the causative gene(s) identified for each patientthrough the WES variant filtering pipeline. The enrichment ofcausative gene-associated metabolites, function of each causative geneand polarity of gene-associated metabolites are provided.PatientnumberCausative gene Causativegene-associatedmetabolitesenrichedFunction of gene Polarity ofassociatedmolecules1 CPT1A Yes Transporter of fatty acidsacross the mitochondrial in-ner membraneNon-polar2 NANS Yes Generates phosphorylatedforms of N-acetylneuraminicacidSemi-polar3 SCN2A No Functions in the generationand propagation of actionpotentials in neurons andmuscleVery polar4 DYRK1A No Catalyzes its autophospho-rylation on serine/threonineand tyrosine residuesVery polar5 CACNA1D No Mediates the entry of cal-cium ions into cells and isinvolved in other calcium-dependent processesNone6 CNKSR2 Not in HMDB Encodes a multidomain pro-tein that serves as a scaf-fold protein to mediate themitogen-activated protein ki-nase pathways downstreamfrom RASNone377 ECI1 No Involved in beta-oxidation offatty acids through transfor-mation of enoyl-CoA estersSemi-polar8 IDS No Required for the lysosomaldegradation of heparansul-fate and dermatan sulfateSemi-polar8 HAL No Breaks down histidine tourocanic acid, which is fur-ther broken down in the liverto glutamic acidSemi-polar9 CHRNA1 Not in HMDB Plays a role in acetylcholinebinding as a membrane pro-teinNone9 DHFR No Catalyzes an essential re-action for de novo glycineand purine synthesis, and forDNA precursor synthesisSemi-polar10 ATP8A2 No Transports aminophospho-lipids from the outer tothe inner leaflet of variousmembranes and maintainsasymmetric distribution ofphospholipids, primarily insecretory vesiclesVery polar11 MYO5B Not in HMDB May be involved in vesiculartraffickingNone12 KCNQ2 No Part of ion channel complex Very polar13 VGLL4 Not in HMDB Regulates alpha1-adrenergicactivation of gene expressionin cardiac myocytesNone38Table 3.4: Metabolic enrichment profile. Number of genes enriched for inthe set of patient-specific differentially abundant metabolites.Patient number Number of enriched genes1 862 883 284 455 846 1977 918 1079 2310 5511 14312 4813 2613.3.1 Using label propagation to rank candidate genesThe final score of each node/gene was found using the label propagation algo-rithm. To simplify the results, the percentile rank of each category was sortedinto two prioritization groups: high evidence (rank in 80th to 100th percentile)and low evidence (rank below the 80th percentile). The percentile rank of eachcausative gene(s) within each patient’s candidate gene list is provided in Table 3.5.Overall, after LP with initial node labels defined by the combined genomic andmetabolomic score, 8 of 15 (53.3%) causative genes were ranked in the “high” pri-oritization category. At the patient level, at least one causative gene was found inthe “high” prioritization category for 8 of 13 (61.5%) patients, one of which wasalso prioritized using metabolomic enrichment alone (CPT1A).39Table 3.5: Label propagation results. Prioritization results after LP withboth the combined genomic and metabolomic initial label scores and thegenomic-only initial label scores. The percentile rank of each causativegene in the list of candidate genes, the number of candidate genes as wellas the change in percentile rank of the causative gene after addition ofmetabolomic evidence to the initial label score (DeltaM) is provided, inaddition to each gene’s final genomic-only and combined prioritizationcategory (PC).PatientnumberCausativegeneRank withcombinedscoreRank withgenomicscorePercentilewith com-bined scorePercentilewith ge-nomicscorePercentilechange(DeltaM)PC withcom-binedscorePC withge-nomicscore1 CPT1A 1/29 12/29 100 60 40 High Low2 NANS 56/151 91/151 63 40 23 Low Low3 SCN2A 5/45 12/45 91 74 17 High Low4 DYRK1A 6/68 6/68 94 94 0 High High5 CACNA1D 2/8 2/8 88 88 0 High High6 CNKSR2 31/53 30/53 42 44 -2 Low Low7 ECI1 30/271 63/271 89 77 12 High Low8 HAL 9/50 13/50 83 78 5 High Low8 IDS 19/50 20/50 63 61 2 Low Low9 CHRNA1 111/213 100/213 48 53 -5 Low Low9 DHFR 5/213 9/213 98 96 2 High High10 ATP8A2 92/167 107/167 45 36 9 Low Low11 MYO5B 11/81 11/81 87 87 0 High High12 MAST1 132/234 173/234 44 26 18 Low Low12 KCNQ2 61/234 89/234 74 62 12 Low Low13 VGLL4 50/55 51/55 10 8 2 Low Low3.3.2 Assessing the utility of metabolomic evidenceIn order to determine whether the addition of metabolomic evidence was useful inthe prioritization of the causative gene, the percentile ranking of the causative geneafter LP with the combined initial score was compared to the percentile ranking af-ter LP with the genomic-only score. The difference between the percentile rankingwith the combined score and the percentile ranking with the genomics-only scorewill hereby be referred to as “DeltaM” whereDeltaM = PC−PGif PC is the percentile ranking with the combined score and PG is the percentileranking with the genomics-only score. A higher DeltaM signifies an increase inpercentile ranking upon the addition of metabolomic evidence.With the addition of metabolomic evidence to the initial score, the prioritizationcategory changed from “low” to “high” for 3 of 13 patients and stayed the samefor the remainder. In contrast, all genes without associated metabolites remainedin the same prioritization category. Causative genes benefited more from the ad-dition of metabolomic evidence more than non-causative genes (mean DeltaM forcausative genes = 8.2%, mean DeltaM for non-causative genes = -0.11%, p < 0.05,Figure 3.1). In addition, DeltaM was positive for all causative genes, but not allnon-causative genes, providing further evidence that metabolomics preferentiallybenefits the prioritization of causative genes.3.3.3 Permutation test to generate null model of percentile rankIn order to put these rankings into context, we performed a permutation test togenerate a null distribution of percentile rankings of each candidate gene. To gen-erate this null distribution, the combined genomic and metabolomic labels wereshuffled, and LP was performed. The average percentile ranking of the causativegene as well as their standard deviation across all permutations was calculated, andcompared to the percentile ranking observed in real data (Table 3.6). Overall, 10of 15 causative genes received a percentile ranking more than one standard devi-ation above the mean gene-specific permuted percentile. Six of these genes were41Figure 3.1: DeltaM of causative vs non-causative genes.part of the 8 (71.4%) in the “high” prioritization category. Although more patientsare needed to confirm this finding, these results suggest that genes with a highpercentile rank may be less likely to be false positives.3.3.4 Characterization of factors that effect gene prioritizationWe next wanted to examine the influence of four characteristics of the causativegene on its percentile ranking: its centrality, the types of interactions occurringbetween it and its first degree neighbors, the number of its first degree neighborsprofiled in the HMDB and the polarity of its associated metabolites. The signifi-cance of these factors in influencing the final prioritization percentile was evaluatedthrough a linear model, with each of these factors (except polarity of associatedmetabolites due to uneven categorical representation) as independent variables andthe percentile ranking as the dependent variable. The lack of association betweencentrality and prioritization in our data (p > 0.05) suggests that symmetric nor-malization effectively adjusts for network topology; however, prior literature has42Table 3.6: Label propagation results with permuted initial labels. Per-centile rank of the causative gene in each patient’s list of candidate genes,as well as the mean and standard deviation of the permuted (n=500) per-centile for each causative gene.Patientnum-berCausativegeneBetweenesscentralityPercentile rank-ing of causativegene with com-bined scorePermuted per-centile rankingwith combinedscoreStandard devia-tion of permutedpercentile rank-ing with com-bined score1 CPT1A 3823 100 61 7.62 NANS 3882 63 36.9 5.93 SCN2A 17960 91 78.3 6.44 DYRK1A 56248 94 94.2 0.75 CACNA1D25905 88 75 8.26 CNKSR2 4086 42 45 7.37 ECI1 10523 89 78 3.88 HAL 2865 83 72.1 6.68 IDS 6455 63 57 7.99 CHRNA1 2759 48 53 5.19 DHFR 32071 98 95 1.310 ATP8A2 3370 45 35 4.811 MYO5B 38544 87 87.8 1.812 KCNQ2 7820 74 61.3 5.313 VGLL4 184 10 12.1 7.3shown that the centrality of a node empirically influences its prioritization, there-fore further research is needed to confirm this trend (Lee et al. [2011], Qian et al.[2014]). The STRING database sorts functional interactions (i.e. edges) betweenproteins into several categories: “activation”, “binding”, “catalysis”, “expression”,“inhibition”, “post-translational modifications” and “reactions”. The type andstrength of the functional interactions between a causative gene and its first degreeneighbors was not associated with its percentile ranking (p > 0.05). Similarly,the number of a causative gene’s first degree neighbors profiled in HMDB had noeffect on its percentile ranking, suggesting that the metabolic activity of a gene’sneighborhood does not affect its prioritization (p > 0.05). In general, the increasingpolarity of gene-associated metabolites negatively affected its percentile ranking,although no formal statistical test was performed (Figure 3.2).To determine whether the above factors influenced the utility of metabolomicevidence, a linear model was used to test the association between these factors andDeltaM. Centrality and the number of HMDB-associated genes in the causativegene’s first-degree neighborhood were not significantly associated with DeltaM.However, causative genes connected to other genes through the “activation” rela-tionship were more likely to benefit from metabolomic evidence (p < 0.05). Again,the polarity of causative gene-associated metabolites was found to mildly influencethe utility of metabolomic evidence; causative genes with non-polar metabolitesexhibited a higher DeltaM than semi-polar or polar metabolites (Figure 3.3). Com-bined, these results suggest that genes associated with non polar metabolites orthose involved in acting as/associating with enzyme activators may benefit prefer-entially from the addition of metabolomic evidence.3.4 SummaryIn this thesis, we assembled an untargeted LC-MS metabolomic analysis pipelinecapable of taking in raw LC-MS data as input and returning a list of differentiallyabundant metabolites (DAM). Through characterization of processed LC-MS data,we found that metabolomics suffers from low feature to metabolite mappability.On average, only approximately one fifth of LC-MS features mapped to knownmetabolites in the HMDB database. We assessed enrichment in this list of DAMs44Figure 3.2: Effect of polarity of gene-associated metabolites on percentileranking. “None” indicates that the gene has no associated metabolitesin HMDB.for gene-associated metabolites and combined this with candidate variant lists froma WES variant filtering pipeline to create a combined per-node score that was prop-agated through a FLN using a label propagation algorithm. The final propagatedscore of each node was used to rank each candidate gene in order of its relevanceto each patient’s disease. Integrated genomic and metabolomic evidence was ableto prioritize the causative gene in the top 20th percentile of candidate genes for61.5% (8 of 13) of patients, 75% of which achieved a percentile prioritization score45Figure 3.3: Utility of metabolomic evidence for causative genes associatedwith metabolites of varying polarities. DeltaM of causative genes bypolarity of gene-associated metabolites. “None” indicates that the genedid not have any associated metabolites in HMDB.at least one standard deviation above a permuted percentile. Combining genomicand metabolomic evidence resulted in the prioritization of the causative gene in30.7% more patients than was possible with genomics evidence alone, and on av-erage improved the percentile rank of the causative gene by 7.9%. Metabolomicevidence primarily benefited the prioritization of causative genes, although non-causative genes also saw an increase in DeltaM. Metabolomic evidence was partic-ularly helpful for prioritizing genes with non-polar metabolites, eg. CPT1A, withan increase of 40 percentile points upon addition of metabolomic evidence. In ad-dition, causative genes involved in activating roles with their first degree neighborsmay preferentially benefit from the addition of metabolomic evidence.47Chapter 4DisscussionTo the author’s knowledge, the method presented in this thesis is the first to com-bine genomic and LC-MS data for the purpose of disease gene prioritization, al-though others have approached the broader problem of genomic/metabolomic dataintegration (Krumsiek et al. [2011], Li et al. [2013], Pirhaji et al. [2016]). Ournetwork-based method enabled the prioritization of the causative gene in approx-imately 60% of the patients profiled. Polarity had an obvious impact on prioriti-zation, as genes with non-polar metabolites experience a greater boost in prioriti-zation when LC-MS data is used. The combination of multiple chromatographytechniques would address this problem by enabling metabolites of diverse polari-ties to be profiled. Additionally, genes acting as or associating with enzyme activa-tors benefited more from the addition of metabolomic evidence. Enzyme activatorsare often associated with metabolic pathways and processes (e.g. hexokinase-I andglucokinase), suggesting that a gene that is directly implicated in metabolism willbe more likely t benefit from metabolomic evidence.Several areas of our method could be further refined. The binary genomicscore failed to reflect relevant characteristics of the variant, namely its frequency,pathogenicity and relevance to disease. Inclusion of such variables in the genera-tion of label biases would enable a more data-driven approach to prioritization. Inaddition, integrating the nature of the genomic controls used (i.e singleton or trio)into the genomic score may provide additional context for the strength of genomicevidence.48The success of the method put forward in this thesis is limited by the preci-sion of LC-MS metabolomic data, as we are unable to conclusively identify eachfeature. Challenges to integrating metabolomic data into the variant prioritiza-tion process can broadly be divided into those concerning the technical aspects ofmetabolite quantification and identification, and those concerning the biologicalinterpretation of results. On the technical side, it is currently impossible to knowthe number of unique metabolites in the typical plasma/CSF/urine metabolome, asno LC-MS protocol is capable of identifying all metabolites. This means that forexperiments aiming to capture an unbiased snapshot of the metabolome, a combi-nation of chromatography techniques must be used. Comparisons across platformsare difficult to make, as little is known about how results from different analytictechniques can be compared, although some efforts have been made (Bu¨scher et al.[2009], Yet et al. [2016], Leuthold et al. [2017]). Further, only approximately 65%of metabolites are quantifiable in all three body fluids (plasma, urine and CSF),indicating that care must also be taken to select the most disease-relevant bio-fluid(Kennedy et al. [2017]). Additionally, the choice of pre-processing algorithms mayhave a large effect on feature detection and adduct annotation. This renders anal-ysis reproducibility difficult. On the biological interpretation side, there is a lackof established methods for mapping genomic perturbations to their downstream(directly and indirectly) impacted metabolites in the rare disease context. mQTLstudies are underpowered, particularly for those caused by rare variants, making itdifficult for them to identify novel gene-metabolite associations. Incomplete anno-tation of gene-metabolite associations in databases such as the HMDB limits ourability to use this data for patient diagnostics. The lack of methods available to as-sess the overall characteristics of metabolites, like polarity, challenges large scaleomic studies, as it limits their ability to account for these variables in a quantitativeand reliable manner.The challenges facing the use of metabolomics in rare metabolic disease diag-nostics are best illustrated through the exploration of four cases analyzed in thisstudy and by Tarailo-Graovac et al., each with known IEM-causing variants inCPT1A, NANS, DYRK1A and SCN2A, respectively. CPT1A and NANS are en-zymes that catalyze highly specific interactions, and do not share many metaboliteswith other genes. In contrast, SCN2A, a transmembrane sodium ion transporter,49interacts with the common metabolites ATP, sodium and water, and DYRK1A,a phosphotransferase, interacts with ATP and ADP. Metabolites associated withSCN2A and DYRK1A would be less likely to be identified as differentially abun-dant, as ATP and ADP are used in multiple metabolic pathways and are understrong homeostatic control. This is echoed by Nicholson et al, who notes that un-like in eQTL studies, there is not a one-to-one mapping between a metabolite anda gene (Nicholson et al. [2011]). Because more statistical tests are performed inmQTL studies, effect sizes must be larger to reach statistical significance. Thissuggests that even when a robust snapshot of the metabolome is procured usingmultiple chromatography methods, metabolomics may only be useful in confirm-ing perturbations in genes that interact with metabolites under weak homeostaticcontrol, as they are likely to have larger effect sizes. Metabolomics therefore mightnot be of use in the prioritization of SCN2A and DYRK1A. The finding that addingmetabolomic evidence to the initial label bias does not assist the prioritization ofthese genes through our method supports this claim. Further work is needed to eval-uate the impact of homeostatic control on gene prioritization using metabolomicdata.In this work, the identify of each differentially abundant feature could only benarrowed down to on average of 4 metabolites in ESI+ and ESI- modes, reflectingthe high degree of uncertainty associated with the identity of a differentially abun-dant metabolite. A solution to this would be to restrict metabolite detection to alist of approximately 300 metabolites known to be detectable by this LC-MS sys-tem, as has been done previously (Coene et al. [2018]). However, this would limitthe ability of this approach to characterize the metabolome in an unbiased manner.Using network structures to refine the true feature to metabolite mapping has beenproposed as a viable approach to reduce uncertainty associated with metaboliteidentification, and should be explored further(Pirhaji et al. [2016]).Given the increased technical reliability of WES as compared to untargetedLC-MS, methods that could dynamically weight either source of evidence basedon its technical reliability deserve further exploration. Ideally, high confidencemetabolite identifications would be weighed more heavily than low confidencemetabolite identifications, thereby mitigating the effects of noise. Additionally,mapping all metabolomic features to the gene level in order to perform LP on a50homogeneous network may have the effect of oversimplifying the interactions be-tween genes and metabolites. For example, some metabolites are known to beassociated with a particular gene with a higher confidence than others. By utiliz-ing an enrichment score, our method effectively considers all metabolites equally,when in reality some metabolites are more robustly associated. Applying LP to aheterogeneous network, which would include edges between genes and betweengenes and metabolites, may allow propagation to occur while allowing for weight-ing of specific gene-metabolite associations (Lotfi Shahreza et al. [2017]. Furtherquantification of the strength of gene-metabolite associations is needed before thisweighting can occur in a robust manner.4.1 Future WorkIn order for the successful integration of genomics and LC-MS basedmetabolomics in the clinical diagnosis of IEMs, two major technical areas of im-provement must be addressed. First, existing feature detection and adduct/iso-tope annotation methods must be refined and benchmarked for use in clinicalmetabolomics. Several publicly available databases with known chemical com-positions have been generated for this purpose (Kenar et al. [2014]). Second, ex-plorations of gene-metabolite associations through mQTL studies are needed toexpand gene-metabolite annotations.On the biological interpretation side, understanding the degree to which a par-ticular metabolite is regulated (i.e by which genes) would help identify metabolitesthat are under strong homeostatic control, and by association, genes that may notbenefit from metabolomic-guided prioritization. In addition, given that label prop-agation allows each node to be influenced by its neighbors, the local neighborhoodsurrounding a causative gene may be important in determining whether or not pri-oritization through label propagation will be effective. Preliminary characterizationof the local neighborhood surrounding each causative gene in this thesis suggeststhat prioritization is not influenced by just one factor, but rather by a multitude offactors working in concert. Further investigation into network-based and metabolicfactors that affect prioritization may be informative for future methods.51BibliographyEmma Graham, Jessica Lee, Magda Price, Maja Tarailo-Graovac, AllisonMatthews, Udo Engelke, Jeffrey Tang, Leo A.J. Kluijtmans, Ron A. Wevers,Wyeth W. Wasserman, Clara D.M. van Karnebeek, and Sara Mostafavi.Integration of genomics and metabolomics for prioritization of rare diseasevariants: a 2018 literature review, 2018. ISSN 15732665. → pages vi, 1Maja Tarailo-Graovac, Casper Shyr, Colin J. Ross, Gabriella A. Horvath, RamonaSalvarinova, Xin C. Ye, Lin-Hua Zhang, Amit P. Bhavsar, Jessica J.Y. Lee,Britt I. Dro¨gemo¨ller, Mena Abdelsayed, Majid Alfadhel, Linlea Armstrong,Matthias R. Baumgartner, Patricie Burda, Mary B. Connolly, Jessie Cameron,Michelle Demos, Tammie Dewan, Janis Dionne, A. Mark Evans, Jan M.Friedman, Ian Garber, Suzanne Lewis, Jiqiang Ling, Rupasri Mandal, AndreMattman, Margaret McKinnon, Aspasia Michoulas, Daniel Metzger,Oluseye A. Ogunbayo, Bojana Rakic, Jacob Rozmus, Peter Ruben, BryanSayson, Saikat Santra, Kirk R. Schultz, Kathryn Selby, Paul Shekel, SandraSirrs, Cristina Skrypnyk, Andrea Superti-Furga, Stuart E. Turvey, Margot I. VanAllen, David Wishart, Jiang Wu, John Wu, Dimitrios Zafeiriou, LeoKluijtmans, Ron A. Wevers, Patrice Eydoux, Anna M. Lehman, HilaryVallance, Sylvia Stockler-Ipsiroglu, Graham Sinclair, Wyeth W. Wasserman,and Clara D. van Karnebeek. Exome Sequencing and the Management ofNeurometabolic Disorders. New England Journal of Medicine, pageNEJMoa1515792, 2016. ISSN 0028-4793. doi:10.1056/NEJMoa1515792.URL http://www.nejm.org/doi/10.1056/NEJMoa1515792. → pagesxii, 2, 4, 5, 13, 16, 25, 27, 49Hans Van Bokhoven. Genetic and Epigenetic Networks in IntellectualDisabilities. Annu. Rev. Genet, 45:81–104, 2011. ISSN 1545-2948.doi:10.1146/annurev-genet-110410-132512. → page 2Clara D M Van Karnebeek and Sylvia Stockler. Treatable inborn errors of52metabolism causing intellectual disability: A systematic literature review, 2012.ISSN 10967192. → page 2Sorcha A. Collins, Graham Sinclair, Sarah McIntosh, Fiona Bamforth, RobertThompson, Isaac Sobol, Geraldine Osborne, Andre Corriveau, Maria Santos,Brendan Hanley, Cheryl R. Greenberg, Hilary Vallance, and Laura Arbour.Carnitine palmitoyltransferase 1A (CPT1A) P479L prevalence in live newbornsin Yukon, Northwest Territories, and Nunavut. Molecular Genetics andMetabolism, 101(2-3):200–204, 2010. ISSN 10967192.doi:10.1016/j.ymgme.2010.07.013. → pages 2, 16, 25Gabriella A. Horvath, Michelle Demos, Casper Shyr, Allison Matthews, LinhuaZhang, Simone Race, Sylvia Stockler-Ipsiroglu, Margot I. Van Allen, OganMancarci, Lilah Toker, Paul Pavlidis, Colin J. Ross, Wyeth W. Wasserman,Natalie Trump, Simon Heales, Simon Pope, J. Helen Cross, and Clara D.M. vanKarnebeek. Secondary neurotransmitter deficiencies in epilepsy caused byvoltage-gated sodium channelopathies: A potential treatment target? MolecularGenetics and Metabolism, 117(1):42–48, 2016. ISSN 10967206.doi:10.1016/j.ymgme.2015.11.008. → pages 2, 16, 25Clara D M Van Karnebeek, Luisa Bonafe´, Xiao-yan Wen, Maja Tarailo-graovac,Sara Balzano, Beryl Royer-bertrand, Angel Ashikov, Livia Garavelli, IsabellaMammi, Licia Turolla, Catherine Breen, Dian Donnai, Valerie Cormier,Delphine Heron, Gen Nishimura, Shinichi Uchikawa, Belinda Campos-xavier,Antonio Rossi, Thierry Hennet, Koroboshka Brand-arzamendi, Jacob Rozmus,Keith Harshman, Brian J Stevenson, Enrico Girardi, Giulio Superti-furga,Tammie Dewan, Alissa Collingridge, Jessie Halparin, Colin J Ross, MargotI Van Allen, Andrea Rossi, Udo F Engelke, and Leo A J Kluijtmans.NANS-mediated synthesis of sialic acid is required for brain and skeletaldevelopment. Nature Publishing Group, 48(7):777–784, 2016. ISSN1061-4036. doi:10.1038/ng.3578. URL http://dx.doi.org/10.1038/ng.3578. →pages 2, 16, 25Nenad B., Michael G.K., Marinus D., and Carlo D.-V. IEMBASE, aknowledgebase of inborn errors of metabolism. Molecular Genetics andMetabolism, 111(3):296, 2014. ISSN 1096-7192. URL http://www.embase.com/search/results?subaction=viewrecord{&}from=export{&}id=L71804976.→ page 2Kevin A Strauss, Erik G Puffenberger, and D Holmes Morton. Maple Syrup UrineDisease. In GeneReviews, volume 28, pages 93–97. 2013.53doi:NBK1319[bookaccession]. URLhttp://www.ncbi.nlm.nih.gov/books/NBK1319/. → page 3Caroline H. Johnson, Julijana Ivanisevic, and Gary Siuzdak. Metabolomics:beyond biomarkers and towards mechanisms. Nature Reviews Molecular CellBiology, 17(7):451–459, 2016. ISSN 1471-0072. doi:10.1038/nrm.2016.25.URL http://www.nature.com/doifinder/10.1038/nrm.2016.25. → pages 3, 6Exome Aggregate Consortium. ExAC Browser, 2016. URLhttp://exac.broadinstitute.org/variant/9-139413097-T-G. → pages 3, 27E M Smigielski, K Sirotkin, M Ward, and S T Sherry. dbSNP: a database ofsingle nucleotide polymorphisms. Nucleic acids research, 28(1):352–355,2000. ISSN 0305-1048. doi:10.1093/nar/28.1.352. → pages 3, 27Monkol Lek, Konrad J Karczewski, Kaitlin E Samocha, Eric Banks, TimothyFennell, Anne H O, James S Ware, Andrew J Hill, Beryl B Cummings, Daniel PBirnbaum, Jack A Kosmicki, Laramie Duncan, Fengmei Zhao, James Zou,Emma Pierce-Hoffman, David N Cooper, Jackie Goldstein, Namrata Gupta,Daniel Howrigan, Adam Kiezun, D Stenson, Christine Stevens, Grace Tiao,Maria T Tusie-Luna, Ben Weisburd, G Wilson, Mark J Daly, and Daniel GMacArthur. Analysis of protein-coding genetic variation in 60,706 humans.bioRxiv, 536(7616):030338, 2016. ISSN 0028-0836. doi:10.1101/030338.URL http://biorxiv.org/lookup/doi/10.1101/030338. → pages 3, 27Pauline C. Ng and Steven Henikoff. SIFT: Predicting amino acid changes thataffect protein function. Nucleic Acids Research, 31(13):3812–3814, 2003.ISSN 03051048. doi:10.1093/nar/gkg509. → page 3Ivan Adzhubei, Daniel M. Jordan, and Shamil R. Sunyaev. Predicting functionaleffect of human missense mutations using PolyPhen-2. Current Protocols inHuman Genetics, (SUPPL.76), 2013. ISSN 19348266.doi:10.1002/0471142905.hg0720s76. → page 3Maja Tarailo-Graovac, Wyeth W. Wasserman, and Clara D. M. Van Karnebeek.Impact of next-generation sequencing on diagnosis and management ofneurometabolic disorders: current advances and future perspectives. ExpertReview of Molecular Diagnostics, 17(4):307–309, 2017. ISSN 1473-7159.doi:10.1080/14737159.2017.1293527. URLhttps://www.tandfonline.com/doi/full/10.1080/14737159.2017.1293527. →page 454Yaping Yang, Donna M. Muzny, Jeffrey G. Reid, Matthew N. Bainbridge, AleciaWillis, Patricia A. Ward, Alicia Braxton, Joke Beuten, Fan Xia, Zhiyv Niu,Matthew Hardison, Richard Person, Mir Reza Bekheirnia, Magalie S. Leduc,Amelia Kirby, Peter Pham, Jennifer Scull, Min Wang, Yan Ding, Sharon E.Plon, James R. Lupski, Arthur L. Beaudet, Richard A. Gibbs, and Christine M.Eng. Clinical Whole-Exome Sequencing for the Diagnosis of MendelianDisorders. New England Journal of Medicine, 369(16):1502–1511, 2013. ISSN0028-4793. doi:10.1056/NEJMoa1306555. URLhttp://www.nejm.org/doi/abs/10.1056/NEJMoa1306555. → page 4Aziz Belkadi, Alexandre Bolze, Yuval Itan, Aure´lie Cobat, Quentin B. Vincent,Alexander Antipenko, Lei Shang, Bertrand Boisson, Jean-Laurent Casanova,and Laurent Abel. Whole-genome sequencing is more powerful thanwhole-exome sequencing for detecting exome variants. Proceedings of theNational Academy of Sciences, 112(17):5473–5478, 2015. ISSN 0027-8424.doi:10.1073/pnas.1418631112. URLhttp://www.pnas.org/lookup/doi/10.1073/pnas.1418631112. → page 4Gabrielle Bertier, Yann Joly, and Martin He´tu. Unsolved challenges of clinicalwhole-exome sequencing: A systematic literature review of end-users’ views.Accepted 07/28/2016. BMC medical genomics, 9(1):doi:10.1186/s12920–016–0213–6, 2016. ISSN 1755-8794.doi:10.1186/s12920-016-0213-6. URLhttp://dx.doi.org/10.1186/s12920-016-0213-6. → page 4Aihua Zhang, Hui Sun, Ping Wang, Ying Han, and Xijun Wang. Modernanalytical techniques in metabolomics analysis. The Analyst, 137(2):293–300,2012. ISSN 0003-2654. doi:10.1039/C1AN15605E. URLhttp://xlink.rsc.org/?DOI=C1AN15605E. → page 7Sunil U. Bajad, Wenyun Lu, Elizabeth H. Kimball, Jie Yuan, Celeste Peterson,and Joshua D. Rabinowitz. Separation and quantitation of water solublecellular metabolites by hydrophilic interaction chromatography-tandem massspectrometry. Journal of Chromatography A, 1125(1):76–88, 2006. ISSN00219673. doi:10.1016/j.chroma.2006.05.019. → page 7Lee D. Roberts, Amanda L. Souza, Robert E. Gerszten, and Clary B. Clish.Targeted metabolomics. Current Protocols in Molecular Biology, 1(SUPPL.98), 2012. ISSN 19343639. doi:10.1002/0471142727.mb3002s98. →page 755Juntuo Zhou and Yuxin Yin. Strategies for large-scale targeted metabolomicsquantification by liquid chromatography-mass spectrometry. The Analyst, 141(23):6362–6373, 2016. ISSN 0003-2654. doi:10.1039/C6AN01753C. URLhttp://xlink.rsc.org/?DOI=C6AN01753C. → page 7Mikko Katajamaa, Jarkko Miettinen, and Matej Oresic. MZmine: toolbox forprocessing and visualization of mass spectrometry based molecular profile data.Bioinformatics (Oxford, England), 22(5):634–6, 2006. ISSN 1367-4803.doi:10.1093/bioinformatics/btk039. URLhttp://www.ncbi.nlm.nih.gov/pubmed/16403790. → page 7R Tautenhahn, C Bottcher, and S Neumann. Highly sensitive feature detection forhigh resolution LC/MS. BMC Bioinformatics, 9:16, 2008. ISSN 1471-2105.doi:10.1186/1471-2105-9-504. → pages 7, 8, 28Eugene Melamud, Livia Vastag, and Joshua D Rabinowitz. Metabolomic analysisand visualization engine for LC-MS data. Analytical chemistry, 82(23):9818–9826, 2010. ISSN 1520-6882. doi:10.1021/ac1021166. → page 7Nathaniel G. Mahieu and Gary J. Patti. Systems-Level Annotation of aMetabolomics Data Set Reduces 25 000 Features to Fewer than 1000 UniqueMetabolites. Analytical Chemistry, 89(19):10397–10406, 2017. ISSN15206882. doi:10.1021/acs.analchem.7b02380. → page 8Bridgit Crews, William R. Wikoff, Gary J. Patti, Hin-Koon Woo, Ewa Kalisiak,Johanna Heideker, and Gary Siuzdak. Variability Analysis of Human Plasmaand Cerebral Spinal Fluid Reveals Statistical Significance of Changes in MassSpectrometry-Based Metabolomics Data. Analytical Chemistry, 81(20):8538–8544, 2009. ISSN 0003-2700. doi:10.1021/ac9014947. URLhttp://pubs.acs.org/doi/abs/10.1021/ac9014947. → page 8Leonid Brodsky, Arieh Moussaieff, Nir Shahaf, Asaph Aharoni, and IlanaRogachev. Evaluation of peak picking quality in LC-MS metabolomics data.Analytical Chemistry, 82(22):9177–9187, 2010. ISSN 00032700.doi:10.1021/ac101216e. → page 8Joanna Godzien, Vanesa Alonso-Herranz, Coral Barbas, and Emily GraceArmitage. Controlling the quality of metabolomics data: new strategies to getthe best out of the QC sample. Metabolomics, 11(3):518–528, 2015. ISSN15733890. doi:10.1007/s11306-014-0712-4. → pages 8, 9Dirk Vaikenborg, Gre´goire Thomas, Luc Krois, Koen Kas, and TomaszBurzykowski. A strategy for the prior processing of high-resolution mass56spectral data obtained from high-dimensional Combined fractional diagonalchromatography. Journal of Mass Spectrometry, 44(4):516–529, 2009. ISSN10765174. doi:10.1002/jms.1527. → page 9Xiaotao Shen, Xiaoyun Gong, Yuping Cai, Yuan Guo, Jia Tu, Hao Li, Tao Zhang,Jialin Wang, Fuzhong Xue, and Zheng Jiang Zhu. Normalization andintegration of large-scale metabolomics data using support vector regression.Metabolomics, 12(5):1–12, 2016. ISSN 15733890.doi:10.1007/s11306-016-1026-5. → page 9Alysha M. De Livera, Marko Sysi-Aho, Laurent Jacob, Johann A.Gagnon-Bartsch, Sandra Castillo, Julie A. Simpson, and Terence P. Speed.Statistical Methods for Handling Unwanted Variation in Metabolomics Data.Analytical Chemistry, 87(7):3606–3615, 2015. ISSN 15206882.doi:10.1021/ac502439y. → page 9Yuliya V Karpievitch, Alan R Dabney, and Richard D Smith. Normalization andmissing value imputation for label-free LC-MS analysis. BMC Bioinformatics,13(Suppl 16):S5, 2012. ISSN 1471-2105. doi:10.1186/1471-2105-13-S16-S5.URL http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-13-S16-S5. → page 9Yiman Wu and Liang Li. Sample normalization methods in quantitativemetabolomics, 2015. ISSN 18733778. → page 10B M Bolstad, R A Irizarry, M Astrand, and T P Speed. A comparison ofnormalization methods for high density oligonucleotide array data based onvariance and bias. BIOINFORMATICS, 19(2):185–193, 2003. ISSN1367-4803. doi:10.1093/bioinformatics/19.2.185. URLhttp://www.stat.berkeley.edu/bolstad/normalize/. → pages 10, 28Robert a van den Berg, Huub C J Hoefsloot, Johan a Westerhuis, Age K Smilde,and Marie¨t J van der Werf. Centering, scaling, and transformations: improvingthe biological information content of metabolomics data. BMC genomics, 7:142, 2006. ISSN 1471-2164. doi:10.1186/1471-2164-7-142. URLhttp://www.ncbi.nlm.nih.gov/pubmed/16762068. → page 10Svati H. Shah and Christopher B. Newgard. Integrated Metabolomics andGenomics: Systems Approaches to Biomarkers and Mechanisms ofCardiovascular Disease. Circulation: Cardiovascular Genetics, 8(2):410–419,2015. ISSN 19423268. doi:10.1161/CIRCGENETICS.114.000223. → page 1157Eugene P Rhee, Jennifer E Ho, Ming-Huei Chen, Dongxiao Shen, Susan Cheng,Martin G Larson, Anahita Ghorbani, Xu Shi, Iiro T Helenius, Christopher JO’Donnell, Amanda L Souza, Amy Deik, Kerry A Pierce, Kevin Bullock,Geoffrey A Walford, Ramachandran S Vasan, Jose C Florez, Clary Clish,J.-R. Joanna Yeh, Thomas J Wang, and Robert E Gerszten. A Genome-wideAssociation Study of the Human Metabolome in a Community-Based Cohort.Cell Metabolism, 18(1):130–143, 2013. ISSN 15504131.doi:10.1016/j.cmet.2013.06.013. → pages 11, 14Tao Long, Michael Hicks, Hung-Chun Yu, William H Biggs, Ewen F Kirkness,Cristina Menni, Jonas Zierer, Kerrin S Small, Massimo Mangino, HelenMessier, Suzanne Brewerton, Yaron Turpaz, Brad A Perkins, Anne M Evans,Luke A D Miller, Lining Guo, C Thomas Caskey, Nicholas J Schork, ChadGarner, Tim D Spector, J Craig Venter, and Amalio Telenti. Whole-genomesequencing identifies common-to-rare variants associated with human bloodmetabolites. Nature genetics, 49(4):568–578, 2017. ISSN 1546-1718.doi:10.1038/ng.3809. URLhttp://www.nature.com/doifinder/10.1038/ng.3809{%}5Cnhttp://www.ncbi.nlm.nih.gov/pubmed/28263315. → pages 11, 14, 15Bernd O. Keller, Jie Sui, Alex B. Young, and Randy M. Whittal. Interferences andcontaminants encountered in modern mass spectrometry, 2008. ISSN00032670. → page 11Carsten Kuhl, Ralf Tautenhahn, Christoph Bo¨ttcher, Tony R. Larson, and SteffenNeumann. CAMERA: An integrated strategy for compound spectra extractionand annotation of liquid chromatography/mass spectrometry data sets.Analytical Chemistry, 84(1):283–289, 2012. ISSN 00032700.doi:10.1021/ac202450g. → pages 12, 28Ralf Petri and Claudia Schmidt-Dannert. BioCyc—Genome and Metabolism.Angew. Chem. Int. Ed., 43(15):1908, 2004. ISSN 1433-7851.doi:10.1002/anie.200483067. URLhttp://doi.wiley.com/10.1002/anie.200483067. → page 12A Smith, Grace. O ’maille, Elizabeth. J. Want, Chuan. Qin, Sunia. A. Trauger,Theodore. R. Brandon, Darlene. E. Custodio, Ruben. Abagyan, and Gary.Siuzdak. METLIN A Metabolite Mass Spectral Database. Proceedings of the9Th International Congress of Therapeutic Drug Monitoring & ClinicalToxicology, 27(6):747–751, 2005. ISSN 0163-4356.doi:10.1097/01.ftd.0000179845.53213.39. → page 1258David S. Wishart, Dan Tzur, Craig Knox, Roman Eisner, An Chi Guo, NelsonYoung, Dean Cheng, Kevin Jewell, David Arndt, Summit Sawhney, ChrisFung, Lisa Nikolai, Mike Lewis, Marie Aude Coutouly, Ian Forsythe, PeterTang, Savita Shrivastava, Kevin Jeroncic, Paul Stothard, Godwin Amegbey,David Block, David D. Hau, James Wagner, Jessica Miniaci, Melisa Clements,Mulu Gebremedhin, Natalie Guo, Ying Zhang, Gavin E. Duggan, Glen D.MacInnis, Alim M. Weljie, Reza Dowlatabadi, Fiona Bamforth, Derrick Clive,Russ Greiner, Liang Li, Tom Marrie, Brian D. Sykes, Hans J. Vogel, and LoriQuerengesser. HMDB: The human metabolome database. Nucleic AcidsResearch, 35(SUPPL. 1), 2007. ISSN 03051048. doi:10.1093/nar/gkl923. →pages 12, 29, 30Ines Thiele, Almut Heinken, and Ronan M T Fleming. A systems biologyapproach to studying the role of microbes in human health. Current Opinion inBiotechnology, 24(1):4–12, 2013. ISSN 09581669.doi:10.1016/j.copbio.2012.10.001. URLhttp://dx.doi.org/10.1016/j.copbio.2012.10.001. → page 12S Li, Y Park, S Duraisingham, F H Strobel, N Khan, Q A Soltow, D P Jones, andB Pulendran. Predicting network activity from high throughput metabolomics.PLoS Comput Biol, 9(7):e1003123, 2013. ISSN 1553-7358.doi:10.1371/journal.pcbi.1003123. URLhttp://www.ncbi.nlm.nih.gov/pmc/articles/PMC3701697/pdf/pcbi.1003123.pdf.→ pages 12, 48Leila Pirhaji, Pamela Milani, Mathias Leidl, Timothy Curran, JulianAvila-pacheco, Clary B Clish, Forest M White, Alan Saghatelian, and ErnestFraenkel. Revealing disease-associated pathways by network integration ofuntargeted metabolomics. Nature biotechnology, (October 2015), 2016.doi:10.1038/nmeth.3940. → pages 12, 48, 50M Sysi-Aho, M Katajamaa, L Yetukuri, and M Oresic. Normalization method formetabolomics data using optimal selection of multiple internal standards. BMCBioinformatics, 8:93, 2007. ISSN 1471-2105. doi:10.1186/1471-2105-8-93.URL http://www.ncbi.nlm.nih.gov/pubmed/17362505. → page 13Bedilu Alamirie Ejigu, Dirk Valkenborg, Geert Baggerman, Manu Vanaerschot,Erwin Witters, Jean-Claude Dujardin, Tomasz Burzykowski, and Maya Berg.Evaluation of normalization methods to pave the way towards large-scaleLC-MS-based metabolomics profiling experiments. Omics : a journal ofintegrative biology, 17(9):473–85, 2013. ISSN 1557-8100.59doi:10.1089/omi.2013.0010. URLhttp://www.ncbi.nlm.nih.gov/pubmed/23808607. → page 13Daniel Weindl, Andre´ Wegner, Christian Ja¨ger, and Karsten Hiller. Isotopologueratio normalization for non-targeted metabolomics. Journal ofChromatography A, 1389:112–119, 2015. ISSN 18733778.doi:10.1016/j.chroma.2015.02.025. → page 13William R. Wikoff, Jon A. Gangoiti, Bruce A. Barshop, and Gary Siuzdak.Metabolomics identifies perturbations in human disorders of propionatemetabolism. Clinical Chemistry, 53(12):2169–2176, 2007. ISSN 00099147.doi:10.1373/clinchem.2007.089011. → page 13Marli Dercksen, Gerhard Koekemoer, Marinus Duran, Ronald J A Wanders,Lodewyk J. Mienie, and Carolus J. Reinecke. Organic acid profile of isovalericacidemia: A comprehensive metabolomics approach. Metabolomics, 9(4):765–777, 2013. ISSN 15733882. doi:10.1007/s11306-013-0501-5. → page 13Leonie Venter, Zander Lindeque, Peet Jansen van Rensburg, Francois van derWesthuizen, Izelle Smuts, and Roan Louw. Untargeted urine metabolomicsreveals a biosignature for muscle respiratory chain deficiencies. Metabolomics,11(1):111–121, 2014. ISSN 15733890. doi:10.1007/s11306-014-0675-5. →page 13Luka´sˇ Najdekr, Alzˇbeˇta Gardlo, Lucie Ma´drova´, David Friedecky´, HanaJanecˇkova´, Elon S. Correa, Royston Goodacre, and Toma´sˇ Adam. Oxidizedphosphatidylcholines suggest oxidative stress in patients with medium-chainacyl-CoA dehydrogenase deficiency. Talanta, 139:62–66, 2015. ISSN00399140. doi:10.1016/j.talanta.2015.02.041. → page 13Abela L., Steindl K., Simmons L., Joset P., Papuc M., Mathis D., Schmitt B.,Wohlrab G., Klein A., Asadollahi R., Crowther L., Sass O., Hersberger M., andRauch A. A combined metabolic-genetic approach to early-onset epilepticencephalopathies: Results from a Swiss study cohort, 2016. URLhttp://ovidsp.ovid.com/ovidweb.cgi?T=JS{&}PAGE=reference{&}D=emex{&}NEWS=N{&}AN=615322759. → pages 13, 15Adam D. Kennedy, Kirk L. Pappan, Taraka R. Donti, Anne M. Evans, Jacob E.Wulff, Luke A.D. Miller, V. Reid Sutton, Qin Sun, Marcus J. Miller, andSarah H. Elsea. Elucidation of the complex metabolic profile of cerebrospinalfluid using an untargeted biochemical profiling assay. Molecular Genetics andMetabolism, 121(2):83–90, 2017. ISSN 10967206.doi:10.1016/j.ymgme.2017.04.005. → pages 13, 4960Kirk L. Pappan, Adam D. Kennedy, Pilar Magoulas, Neil A. Hanchard, Qin Sun,and Sarah H. Elsea. Clinical Metabolomics to Segregate Aromatic Amino AcidDecarboxylase Deficiency From Drug-Induced Metabolite Elevations.Pediatric Neurology, 2017. ISSN 08878994.doi:10.1016/j.pediatrneurol.2017.06.014. URLhttp://linkinghub.elsevier.com/retrieve/pii/S0887899417304836. → pages13, 15Marcus J. Miller, Adam D. Kennedy, Andrea D. Eckhart, Lindsay C. Burrage,Jacob E. Wulff, Luke A D Miller, Michael V. Milburn, John A. Ryals, Arthur L.Beaudet, Qin Sun, V. Reid Sutton, and Sarah H. Elsea. Untargetedmetabolomic analysis for the clinical screening of inborn errors of metabolism.Journal of Inherited Metabolic Disease, 38(6):1029–1039, 2015. ISSN15732665. doi:10.1007/s10545-015-9843-7. → page 13Karlien L.M. Coene, Leo A.J. Kluijtmans, Ed van der Heeft, Udo F.H. Engelke,Siebolt de Boer, Brechtje Hoegen, Hanneke J.T. Kwast, Maartje van de Vorst,Marleen C.D.G. Huigen, Irene M.L.W. Keularts, Michiel F. Schreuder,Clara D.M. van Karnebeek, Saskia B. Wortmann, Maaike C. de Vries,Mirian C.H. Janssen, Christian Gilissen, Jasper Engel, and Ron A. Wevers.Next-generation metabolic screening: targeted and untargeted metabolomicsfor the diagnosis of inborn errors of metabolism in individual patients. Journalof Inherited Metabolic Disease, 2018. ISSN 15732665.doi:10.1007/s10545-017-0131-6. → pages 13, 28, 50Christian Gieger, Ludwig Geistlinger, Elisabeth Altmaier, Martin Hrabe´ DeAngelis, Florian Kronenberg, Thomas Meitinger, Hans Werner Mewes,H. Erich Wichmann, Klaus M. Weinberger, Jerzy Adamski, Thomas Illig, andKarsten Suhre. Genetics meets metabolomics: A genome-wide associationstudy of metabolite profiles in human serum. PLoS Genetics, 4(11), 2008.ISSN 15537390. doi:10.1371/journal.pgen.1000282. → page 14Andrew A. Hicks, Peter P. Pramstaller, A˚sa Johansson, Veronique Vitart, IgorRudan, Peter Ugocsai, Yurii Aulchenko, Christopher S. Franklin, GerhardLiebisch, Jeanette Erdmann, Inger Jonasson, Irina V. Zorkoltseva, CristianPattaro, Caroline Hayward, Aaron Isaacs, Christian Hengstenberg, SusanCampbell, Carsten Gnewuch, A. Cecile J.W. Janssens, Anatoly V. Kirichenko,Inke R. Ko¨nig, Fabio Marroni, Ozren Polasek, Ayse Demirkan, Ivana Kolcic,Christine Schwienbacher, Wilmar Igl, Zrinka Biloglav, Jacqueline C.M.Witteman, Irene Pichler, Ghazal Zaboli, Tatiana I. Axenovich, Annette Peters,Stefan Schreiber, H. Erich Wichmann, Heribert Schunkert, Nick Hastie, Ben A.61Oostra, Sarah H. Wild, Thomas Meitinger, Ulf Gyllensten, Cornelia M. VanDuijn, James F. Wilson, Alan Wright, Gerd Schmitz, and Harry Campbell.Genetic determinants of circulating sphingolipid concentrations in Europeanpopulations. PLoS Genetics, 5(10), 2009. ISSN 15537390.doi:10.1371/journal.pgen.1000672. → page 14Thomas Illig, Christian Gieger, Guangju Zhai, Werner Ro¨misch-Margl, RuiWang-Sattler, Cornelia Prehn, Elisabeth Altmaier, Gabi Kastenmu¨ller, Bernet SKato, Hans-Werner Mewes, Thomas Meitinger, Martin Hrabe´ de Angelis,Florian Kronenberg, Nicole Soranzo, H Erich Wichmann, Tim D Spector, JerzyAdamski, and Karsten Suhre. A genome-wide perspective of genetic variationin human metabolism. Nat Genet, 42(2):137–141, 2010. ISSN 1546-1718.doi:ng.507[pii]10.1038/ng.507. URLhttp://www.pubmed.org/20037589{%}5Cnhttp://www.ncbi.nlm.nih.gov/pmc/articles/PMC3773904/. → page 14Karsten Suhre, Henri Wallaschofski, Johannes Raffler, Nele Friedrich, RobinHaring, Kathrin Michael, Christina Wasner, Alexander Krebs, FlorianKronenberg, David Chang, Christa Meisinger, H. Erich Wichmann, WolfgangHoffmann, Henry Vo¨lzke, Uwe Vo¨lker, Alexander Teumer, Reiner Biffar,Thomas Kocher, Stephan B. Felix, Thomas Illig, Heyo K. Kroemer, ChristianGieger, Werner Ro¨misch-Margl, and Matthias Nauck. A genome-wideassociation study of metabolic traits in human urine. Nature Genetics, 43(6):565–569, 2011. ISSN 10614036. doi:10.1038/ng.837. → page 14Aye Demirkan, Cornelia M. van Duijn, Peter Ugocsai, Aaron Isaacs, Peter P.Pramstaller, Gerhard Liebisch, James F. Wilson, A˚sa Johansson, Igor Rudan,Yurii S. Aulchenko, Anatoly V. Kirichenko, A. Cecile J.W. Janssens, Ritsert C.Jansen, Carsten Gnewuch, Francisco S. Domingues, Cristian Pattaro, Sarah H.Wild, Inger Jonasson, Ozren Polasek, Irina V. Zorkoltseva, Albert Hofman,Lennart C. Karssen, Maksim Struchalin, James Floyd, Wilmar Igl, ZrinkaBiloglav, Linda Broer, Arne Pfeufer, Irene Pichler, Susan Campbell, GhazalZaboli, Ivana Kolcic, Fernando Rivadeneira, Jennifer Huffman, Nicholas D.Hastie, Andre Uitterlinden, Lude Franke, Christopher S. Franklin, VeroniqueVitart, Christopher P. Nelson, Michael Preuss, Joshua C. Bis, Christopher J.O’Donnell, Nora Franceschini, Jacqueline C.M. Witteman, Tatiana Axenovich,Ben A. Oostra, Thomas Meitinger, Andrew A. Hicks, Caroline Hayward,Alan F. Wright, Ulf Gyllensten, Harry Campbell, and Gerd Schmitz.Genome-wide association study identifies novel loci associated with circulatingphospho- and sphingolipid concentrations. PLoS Genetics, 8(2), 2012. ISSN15537390. doi:10.1371/journal.pgen.1002490. → page 1462Taru Tukiainen, Johannes Kettunen, Antti J. Kangas, Leo Pekka Lyytikinen, PasiSoininen, Antti Pekka Sarin, Emmi Tikkanen, Paul F. O’reilly, Markku J.Savolainen, Kimmo Kaski, Anneli Pouta, Antti Jula, Terho Lehtimki, MikaKnen, Jorma Viikari, Marja Riitta Taskinen, Matti Jauhiainen, Johan G.Eriksson, Olli Raitakari, Veikko Salomaa, Marjo Riitta Jrvelin, Markus Perola,Aarno Palotie, Mika Ala-korpela, and Samuli Ripatti. Detailed metabolic andgenetic characterization reveals new associations for 30 known lipid loci.Human Molecular Genetics, 21(6):1444–1455, 2012. ISSN 09646906.doi:10.1093/hmg/ddr581. → page 14Johannes Kettunen, Taru Tukiainen, Antti-Pekka Sarin, Alfredo Ortega-Alonso,Emmi Tikkanen, Leo-Pekka Lyytika¨inen, Antti J Kangas, Pasi Soininen, PeterWu¨rtz, Kaisa Silander, Danielle M Dick, Richard J Rose, Markku J Savolainen,Jorma Viikari, Mika Ka¨ho¨nen, Terho Lehtima¨ki, Kirsi H Pietila¨inen, MichaelInouye, Mark I McCarthy, Antti Jula, Johan Eriksson, Olli T Raitakari, VeikkoSalomaa, Jaakko Kaprio, Marjo-Riitta Ja¨rvelin, Leena Peltonen, Markus Perola,Nelson B Freimer, Mika Ala-Korpela, Aarno Palotie, and Samuli Ripatti.Genome-wide association study identifies multiple loci influencing humanserum metabolite levels. Nature genetics, 44(3):269–76, 2012. ISSN1546-1718. doi:10.1038/ng.1073. URLhttp://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3605033{&}tool=pmcentrez{&}rendertype=abstract{%}5Cnhttp://dx.doi.org/10.1038/ng.1073.→ page 14So-Youn Shin, Eric B Fauman, Ann-Kristin Petersen, Jan Krumsiek, Rita Santos,Jie Huang, Matthias Arnold, Idil Erte, Vincenzo Forgetta, Tsun-Po Yang,Klaudia Walter, Cristina Menni, Lu Chen, Louella Vasquez, Ana M Valdes,Craig L Hyde, Vicky Wang, Daniel Ziemek, Phoebe Roberts, Li Xi, ElinGrundberg, Melanie Waldenberger, J Brent Richards, Robert P Mohney,Michael V Milburn, Sally L John, Jeff Trimmer, Fabian J Theis, John POverington, Karsten Suhre, M Julia Brosnan, Christian Gieger, GabiKastenm?ller, Tim D Spector, and Nicole Soranzo. An atlas of geneticinfluences on human blood metabolites. Nature Genetics, 46(6):543–550, 2014.ISSN 1061-4036. doi:10.1038/ng.2982. URLhttp://www.nature.com/ng/journal/v46/n6/full/ng.2982.html{%}5Cnhttp://www.nature.com/doifinder/10.1038/ng.2982. → pages 14, 15Harmen H. M. Draisma, Rene´ Pool, Michael Kobl, Rick Jansen, Ann-KristinPetersen, Anika A. M. Vaarhorst, Idil Yet, Toomas Haller, Aye Demirkan, To˜nuEsko, Gu Zhu, Stefan Bo¨hringer, Marian Beekman, Jan Bert van Klinken,Werner Ro¨misch-Margl, Cornelia Prehn, Jerzy Adamski, Anton J. M. de Craen,63Elisabeth M. van Leeuwen, Najaf Amin, Harish Dharuri, Harm-Jan Westra,Lude Franke, Eco J. C. de Geus, Jouke Jan Hottenga, Gonneke Willemsen,Anjali K. Henders, Grant W. Montgomery, Dale R. Nyholt, John B. Whitfield,Brenda W. Penninx, Tim D. Spector, Andres Metspalu, P. Eline Slagboom,Ko Willems van Dijk, Peter A. C. t Hoen, Konstantin Strauch, Nicholas G.Martin, Gert-Jan B. van Ommen, Thomas Illig, Jordana T. Bell, MassimoMangino, Karsten Suhre, Mark I. McCarthy, Christian Gieger, Aaron Isaacs,Cornelia M. van Duijn, and Dorret I. Boomsma. Genome-wide associationstudy identifies novel genetic variants contributing to variation in bloodmetabolite levels. Nature Communications, 6:7208, 2015. ISSN 2041-1723.doi:10.1038/ncomms8208. URLhttp://www.nature.com/doifinder/10.1038/ncomms8208. → page 14Eugene P. Rhee, Qiong Yang, Bing Yu, Xuan Liu, Susan Cheng, Amy Deik,Kerry A. Pierce, Kevin Bullock, Jennifer E. Ho, Daniel Levy, Jose C. Florez,Sek Kathiresan, Martin G. Larson, Ramachandran S. Vasan, Clary B. Clish,Thomas J. Wang, Eric Boerwinkle, Christopher J. O’Donnell, and Robert E.Gerszten. An exome array study of the plasma metabolome. NatureCommunications, 7:12360, 2016. ISSN 2041-1723.doi:10.1038/ncomms12360. URLhttp://www.nature.com/doifinder/10.1038/ncomms12360. → page 14Lining Guo, Michael V Milburn, John A Ryals, Shaun C Lonergan, Matthew WMitchell, Jacob E Wulff, Danny C Alexander, Anne M Evans, BrandiBridgewater, Luke Miller, Manuel L. Gonzalez-Garay, and C Thomas Caskey.Plasma metabolomic profiles enhance precision medicine for volunteers ofnormal health. Proceedings of the National Academy of Sciences, 112(35):E4901–E4910, 2015. ISSN 0027-8424. doi:10.1073/pnas.1508425112. URLhttp://www.pnas.org/lookup/doi/10.1073/pnas.1508425112. → pages 14, 15Akram Yazdani, Azam Yazdani, Xiaoming Liu, and Eric Boerwinkle.Identification of Rare Variants in Metabolites of the Carnitine Pathway byWhole Genome Sequencing Analysis. Genetic Epidemiology, 40(6):486–491,2016. ISSN 10982272. doi:10.1002/gepi.21980. → page 14B. Yu, A. H. Li, G. A. Metcalf, D. M. Muzny, A. C. Morrison, S. White, T. H.Mosley, R. A. Gibbs, and E. Boerwinkle. Loss-of-function variants influencethe human serum metabolome. Science Advances, 2(8):e1600800–e1600800,2016. ISSN 2375-2548. doi:10.1126/sciadv.1600800. URLhttp://advances.sciencemag.org/cgi/doi/10.1126/sciadv.1600800. → page 1464R Gauba, T G Natarajan, L Song, K Bhuvaneshwar, S Madhavan, and Y Gusev.Metabolomic and exome sequence analysis reveal novel molecular signaturesassociated with colorectal cancer relapse. BMC Proceedings, Conference:Beyond the Genome 2012 Boston, MA United States. C, 2012. URLhttp://ovidsp.ovid.com/ovidweb.cgi?T=JS{&}CSC=Y{&}NEWS=N{&}PAGE=fulltext{&}D=emed12{&}AN=71478275{%}5Cnhttp://imp-primo.hosted.exlibrisgroup.com/openurl/44IMP/44IMP{ }services{ }page?sid=OVID{&}isbn={&}issn=1753-6561{&}volume=6{&}issue={&}date=2012{&}title=BMC+Proceedings{&}atitle=Met. → pages 14, 15Edward M. Marcotte, Matteo Pellegrini, Michael J. Thompson, Todd O. Yeates,and David Eisenberg. A combined algorithm for genome-wide prediction ofprotein function. Nature, 1999. ISSN 00280836. doi:10.1038/47048. → page18Euan A. Adie, Richard R. Adams, Kathryn L. Evans, David J. Porteous, andBen S. Pickard. Speeding disease gene discovery by sequence based candidateprioritization. BMC Bioinformatics, 2005. ISSN 14712105.doi:10.1186/1471-2105-6-55. → page 18Stein Aerts, Diether Lambrechts, Sunit Maity, Peter Van Loo, Bert Coessens,Frederik De Smet, Leon Charles Tranchevent, Bart De Moor, Peter Marynen,Bassem Hassan, Peter Carmeliet, and Yves Moreau. Gene prioritizationthrough genomic data fusion. Nature Biotechnology, 2006. ISSN 10870156.doi:10.1038/nbt1203. → page 18Marc a van Driel, Koen Cuelenaere, Patrick P C W Kemmeren, Jack a MLeunissen, and Han G Brunner. A new web-based data mining tool for theidentification of candidate genes for human genetic disorders. Europeanjournal of human genetics : EJHG, 2003. ISSN 1018-4813.doi:10.1038/sj.ejhg.5200918. → page 18Tijl De Bie, Le´on Charles Tranchevent, Liesbeth M.M. van Oeffelen, and YvesMoreau. Kernel-based data fusion for gene prioritization. In Bioinformatics,2007. ISBN 1367-4811 (Electronic)\r1367-4803 (Linking).doi:10.1093/bioinformatics/btm187. → page 18Martin Oti, Martijn A. Huynen, and Han G. Brunner. Phenome connections,2008. ISSN 01689525. → page 18Ugo Ala, Rosario Michael Piro, Elena Grassi, Christian Damasco, LorenzoSilengo, Martin Oti, Paolo Provero, and Ferdinando Di Cunto. Prediction of65human disease genes by human-mouse conserved coexpression analysis. PLoSComputational Biology, 2008. ISSN 1553734X.doi:10.1371/journal.pcbi.1000043. → page 18J. Freudenberg and P. Propping. A similarity-based method for genome-wideprediction of disease-relevant human genes. In Bioinformatics, 2002. ISBN1367-4803 (Print). doi:10.1093/bioinformatics/18.suppl 2.S110. → page 18Carolina Perez-Iratxeta, Peer Bork, and Miguel A. Andrade. Association of genesto genetically inherited diseases using data mining. Nature Genetics, 2002.ISSN 10614036. doi:10.1038/ng895. → page 18Frances S Turner, Daniel R Clutterbuck, and Colin A M Semple. POCUS: mininggenomic sequence annotation to predict disease genes. Genome biology, 2003.ISSN 1474-760X. doi:10.1186/gb-2003-4-11-r75. → page 18Yongjin Li and Jagdish C. Patra. Genome-wide inferring gene-phenotyperelationship by walking on the heterogeneous network. Bioinformatics, 2010a.ISSN 13674803. doi:10.1093/bioinformatics/btq108. → pages 18, 19M. Oti and H. G. Brunner. The modular nature of genetic diseases, 2007. ISSN00099163. → page 18Hunter B. Fraser and Joshua B. Plotkin. Using protein complexes to predictphenotypic effects of gene mutation. Genome Biology, 2007. ISSN 14747596.doi:10.1186/gb-2007-8-11-r252. → page 18Shao Li, Lijiang Wu, and Zhongqi Zhang. Constructing biological networksthrough combined literature mining and microarray analysis: A LMMAapproach. Bioinformatics, 2006. ISSN 13674803.doi:10.1093/bioinformatics/btl363. → page 18Kyle J. Gaulton, Karen L. Mohlke, and Todd J. Vision. A computational system toselect candidate genes for complex human traits. Bioinformatics, 2007. ISSN13674803. doi:10.1093/bioinformatics/btm001. → page 18Chad L. Myers, Drew Robson, Adam Wible, Matthew A. Hibbs, Camelia Chiriac,Chandra L. Theesfeld, Kara Dolinski, and Olga G. Troyanskaya. Discovery ofbiological networks from diverse functional genomic data. Genome Biology,2005. ISSN 1474760X. doi:10.1186/gb-2005-6-13-r114. → pages 18, 19Sara Mostafavi, Debajyoti Ray, David Warde-Farley, Chris Grouios, and QuaidMorris. GeneMANIA: A real-time multiple association network integration66algorithm for predicting gene function. Genome Biology, 9(SUPPL. 1):1–15,2008. ISSN 14747596. doi:10.1186/gb-2008-9-s1-s4. → pages 18, 19Koji Tsuda, HyunJung Shin, and Bernhard Scho¨lkopf. Fast protein classificationwith multiple networks. Bioinformatics (Oxford, England), 2005. ISSN1367-4811. doi:10.1093/bioinformatics/bti1110. → page 18Minghua Deng, Ting Chen, and Fengzhu Sun. An Integrated Probabilistic Modelfor Functional Prediction of Proteins. Journal of Computational Biology, 2004.ISSN 1066-5277. doi:10.1089/1066527041410346. → page 18Sara Mostafavi and Quaid Morris. Fast integration of heterogeneous data sourcesfor predicting gene function with limited annotation. Bioinformatics, 2010.ISSN 13674803. doi:10.1093/bioinformatics/btq262. → page 18Gert R G Lanckriet, Tijl De Bie, Nello Cristianini, Michael I. Jordan, andWilliam Stafford Noble. A statistical framework for genomic data fusion.Bioinformatics, 2004. ISSN 13674803. doi:10.1093/bioinformatics/bth294. →page 18Lourdes Pen˜a-Castillo, Murat Tasan, Chad L. Myers, Hyunju Lee, Trupti Joshi,Chao Zhang, Yuanfang Guan, Michele Leone, Andrea Pagnani, Wan Kyu Kim,Chase Krumpelman, Weidong Tian, Guillaume Obozinski, Yanjun Qi, SaraMostafavi, Guan Ning Lin, Gabriel F. Berriz, Francis D. Gibbons, GertLanckriet, Jian Qiu, Charles Grant, Zafer Barutcuoglu, David P. Hill, DavidWarde-Farley, Chris Grouios, Debajyoti Ray, Judith A. Blake, Minghua Deng,Michael I. Jordan, William S. Noble, Quaid Morris, Judith Klein-Seetharaman,Ziv Bar-Joseph, Ting Chen, Fengzhu Sun, Olga G. Troyanskaya, Edward M.Marcotte, Dong Xu, Timothy R. Hughes, and Frederick P. Roth. A criticalassessment of Mus musculus gene function prediction using integrated genomicevidence. Genome Biology, 2008. ISSN 14747596.doi:10.1186/gb-2008-9-s1-s2. → pages 18, 19Paul Pavlidis, Jason Weston, Jinsong Cai, and William Stafford Noble. LearningGene Functional Classi cations from Multiple Data Types. JOURNAL OFCOMPUTATIONAL BIOLOGY, 2002. → page 18Daniel Lancour, Adam Naj, Richard Mayeux, Jonathan L. Haines, Margaret A.Pericak-Vance, Gerard C. Schellenberg, Mark Crovella, Lindsay A. Farrer, andSimon Kasif. One for all and all for One: Improving replication of geneticstudies through network diffusion. PLoS Genetics, 2018. ISSN 15537404.doi:10.1371/journal.pgen.1007306. → pages 18, 19, 2167R Sharan, I Ulitsky, and R Shamir. Network-based prediction of protein function.Molecular systems biology, 2007. ISSN 1744-4292. doi:10.1038/msb4100129.→ page 18J. Weston, A. Elisseeff, D. Zhou, C. S. Leslie, and W. S. Noble. Protein ranking:From local to global structure in the protein similarity network. Proceedings ofthe National Academy of Sciences, 2004. ISSN 0027-8424.doi:10.1073/pnas.0308067101. → pages 18, 31Jianzhen Xu and Yongjin Li. Discovering disease-genes by topological features inhuman protein-protein interaction network. Bioinformatics (Oxford, England),2006. ISSN 1367-4811. doi:10.1093/bioinformatics/btl467. → page 18Lude Franke, Harm van Bakel, Like Fokkens, Edwin D. de Jong, MichaelEgmont-Petersen, and Cisca Wijmenga. Reconstruction of a Functional HumanGene Network, with an Application for Prioritizing Positional CandidateGenes. The American Journal of Human Genetics, 2006. ISSN 00029297.doi:10.1086/504300. → page 18Kasper Lage, E. Olof Karlberg, Zenia M. Størling, Pa´ll I´ O´lason, Anders G.Pedersen, Olga Rigina, Anders M. Hinsby, Zeynep Tu¨mer, Flemming Pociot,Niels Tommerup, Yves Moreau, and Søren Brunak. A humanphenome-interactome network of protein complexes implicated in geneticdisorders. Nature Biotechnology, 2007. ISSN 10870156. doi:10.1038/nbt1295.→ page 18Xuebing Wu, Rui Jiang, Michael Q. Zhang, and Shao Li. Network-based globalinference of human disease genes. Molecular Systems Biology, 2008. ISSN17444292. doi:10.1038/msb.2008.27. → pages 18, 19Christian von Mering, Lars J Jensen, Michael Kuhn, Samuel Chaffron, TobiasDoerks, Beate Kru¨ger, Berend Snel, and Peer Bork. STRING 7–recentdevelopments in the integration and prediction of protein interactions. Nucleicacids research, 2007. ISSN 1362-4962. doi:10.1093/nar/gkl825. → page 19Insuk Lee, U. Martin Blom, Peggy I. Wang, Jung Eun Shim, and Edward M.Marcotte. Prioritizing candidate disease genes by network-based boosting ofgenome-wide association data. Genome Research, 2011. ISSN 10889051.doi:10.1101/gr.118992.110. → pages 19, 21, 44Y. Itan, S.-Y. Zhang, G. Vogt, A. Abhyankar, M. Herman, P. Nitschke, D. Fried,L. Quintana-Murci, L. Abel, and J.-L. Casanova. The human gene connectomeas a map of short cuts for morbid allele discovery. Proceedings of the National68Academy of Sciences, 2013. ISSN 0027-8424. doi:10.1073/pnas.1218167110.→ page 19Casey S. Greene, Arjun Krishnan, Aaron K. Wong, Emanuela Ricciotti, Rene A.Zelaya, Daniel S. Himmelstein, Ran Zhang, Boris M. Hartmann, ElenaZaslavsky, Stuart C. Sealfon, Daniel I. Chasman, Garret A. Fitzgerald, KaraDolinski, Tilo Grosser, and Olga G. Troyanskaya. Understanding multicellularfunction and disease with human tissue-specific networks. Nature Genetics,2015. ISSN 15461718. doi:10.1038/ng.3259. → page 19Bolan Linghu, Evan S. Snitkin, Zhenjun Hu, Yu Xia, and Charles DeLisi.Genome-wide prioritization of disease genes and identification ofdisease-disease associations from an integrated human functional linkagenetwork. Genome Biology, 2009. ISSN 14747596.doi:10.1186/gb-2009-10-9-r91. → page 19Yongjin Li and Jagdish C. Patra. Integration of multiple data sources to prioritizecandidate genes using discounted rating system. BMC Bioinformatics, 2010b.ISSN 14712105. doi:10.1186/1471-2105-11-S1-S20. → page 19Tak Lee and Insuk Lee. AraGWAB: Network-based boosting of genome-wideassociation studies in Arabidopsis thaliana. Scientific Reports, 2018. ISSN20452322. doi:10.1038/s41598-018-21301-4. → page 19Yu Qian, Søren Besenbacher, Thomas Mailund, and Mikkel Heide Schierup.Identifying disease associated genes by network propagation. BMC SystemsBiology, 2014. ISSN 17520509. doi:10.1186/1752-0509-8-S1-S6. → pages19, 21, 44Justin K. Huang, Daniel E. Carlin, Michael Ku Yu, Wei Zhang, Jason F.Kreisberg, Pablo Tamayo, and Trey Ideker. Systematic Evaluation of MolecularNetworks for Discovery of Disease Genes. Cell Systems, 2018. ISSN24054720. doi:10.1016/j.cels.2018.03.001. → page 22Luana Licata, Leonardo Briganti, Daniele Peluso, Livia Perfetto, MartaIannuccelli, Eugenia Galeota, Francesca Sacco, Anita Palma, Aurelio PioNardozza, Elena Santonico, Luisa Castagnoli, and Gianni Cesareni. MINT, themolecular interaction database: 2012 Update. Nucleic Acids Research, 2012.ISSN 03051048. doi:10.1093/nar/gkr930. → page 22T S Keshava Prasad, Renu Goel, Kumaran Kandasamy, ShivakumarKeerthikumar, Sameer Kumar, Suresh Mathivanan, Deepthi Telikicherla,Rajesh Raju, Beema Shafreen, Abhilash Venugopal, Lavanya Balakrishnan,69Arivusudar Marimuthu, Sutopa Banerjee, Devi S Somanathan, Aimy Sebastian,Sandhya Rani, Somak Ray, C J Harrys Kishore, Sashi Kanth, Mukhtar Ahmed,Manoj K Kashyap, Riaz Mohmood, Y L Ramachandra, V Krishna, B AbdulRahiman, Sujatha Mohan, Prathibha Ranganathan, Subhashri Ramabadran,Raghothama Chaerkady, and Akhilesh Pandey. Human Protein ReferenceDatabase–2009 update. Nucleic acids research, 2009. ISSN 1362-4962.doi:10.1093/nar/gkn892. → page 22C. Alfarano, C. E. Andrade, K. Anthony, N. Bahroos, M. Bajec, K. Bantoft,D. Betel, B. Bobechko, K. Boutilier, E. Burgess, K. Buzadzija, R. Cavero,C. D’Abreo, I. Donaldson, D. Dorairajoo, M. J. Dumontier, M. R. Dumontier,V. Earles, R. Farrall, H. Feldman, E. Garderman, Y. Gong, R. Gonzaga,V. Grytsan, E. Gryz, V. Gu, E. Haldorsen, A. Halupa, R. Haw, A. Hrvojic,L. Hurrell, R. Isserlin, F. Jack, F. Juma, A. Khan, T. Kon, S. Konopinsky, V. Le,E. Lee, S. Ling, M. Magidin, J. Moniakis, J. Montojo, S. Moore, B. Muskat,I. Ng, J. P. Paraiso, B. Parker, G. Pintilie, R. Pirone, J. J. Salama, S. Sgro,T. Shan, Y. Shu, J. Siew, D. Skinner, K. Snyder, R. Stasiuk, D. Strumpf,B. Tuekam, S. Tao, Z. Wang, M. White, R. Willis, C. Wolting, S. Wong,A. Wrong, C. Xin, R. Yao, B. Yates, S. Zhang, K. Zheng, T. Pawson, B. F.F.Ouellette, and C. W.V. Hogue. The Biomolecular Interaction NetworkDatabase and related tools 2005 update. Nucleic Acids Research, 2005. ISSN03051048. doi:10.1093/nar/gki051. → page 22L. Salwinski. The Database of Interacting Proteins: 2004 update. Nucleic AcidsResearch, 2004. ISSN 1362-4962. doi:10.1093/nar/gkh086. → page 22Andrew Chatr-Aryamontri, Rose Oughtred, Lorrie Boucher, Jennifer Rust,Christie Chang, Nadine K. Kolas, Lara O’Donnell, Sara Oster, ChandraTheesfeld, Adnane Sellam, Chris Stark, Bobby Joe Breitkreutz, Kara Dolinski,and Mike Tyers. The BioGRID interaction database: 2017 update. NucleicAcids Research, 2017. ISSN 13624962. doi:10.1093/nar/gkw1102. → page 22Minoru Kanehisa, Miho Furumichi, Mao Tanabe, Yoko Sato, and KanaeMorishima. KEGG: New perspectives on genomes, pathways, diseases anddrugs. Nucleic Acids Research, 2017. ISSN 13624962.doi:10.1093/nar/gkw1092. → page 22Antonio Fabregat, Steven Jupe, Lisa Matthews, Konstantinos Sidiropoulos, MarcGillespie, Phani Garapati, Robin Haw, Bijay Jassal, Florian Korninger, BruceMay, Marija Milacic, Corina Duenas Roca, Karen Rothfels, Cristoffer Sevilla,Veronica Shamovsky, Solomon Shorser, Thawfeek Varusai, Guilherme Viteri,Joel Weiser, Guanming Wu, Lincoln Stein, Henning Hermjakob, and Peter70D’Eustachio. The Reactome Pathway Knowledgebase. Nucleic AcidsResearch, 2018. ISSN 13624962. doi:10.1093/nar/gkx1132. → page 22S. Kerrien, Y. Faruque, B. Aranda, I. Bancarz, A. Bridge, C. Derow, E. Dimmer,M. Feuermann, A. Friedrichsen, R. Huntley, C. Kohler, J. Khadake, C. Leroy,A. Liban, C. Lieftink, L. Montecchi-Palazzi, S. Orchard, J. Risse, K. Robbe,B. Roechert, D. Thorneycroft, Y. Zhang, R. Apweiler, and H. Hermjakob.IntActopen source resource for molecular interaction data. Nucleic AcidsResearch, 2007. ISSN 03051048. doi:10.1093/nar/gkl958. → page 22Ingrid M. Keseler, Amanda Mackie, Martin Peralta-Gil, Alberto Santos-Zavaleta,Socorro Gama-Castro, Ce´sar Bonavides-Martı´nez, Carol Fulcher, Araceli M.Huerta, Anamika Kothari, Markus Krummenacker, Mario Latendresse, LuisMun˜iz-Rascado, Quang Ong, Suzanne Paley, Imke Schro¨der, Alexander G.Shearer, Pallavi Subhraveti, Mike Travers, Deepika Weerasinghe, Verena Weiss,Julio Collado-Vides, Robert P. Gunsalus, Ian Paulsen, and Peter D. Karp.EcoCyc: Fusing model organism databases with systems biology. NucleicAcids Research, 2013. ISSN 03051048. doi:10.1093/nar/gks1027. → page 22L J Jensen, M Kuhn, M Stark, S Chaffron, C Creevey, J Muller, T Doerks,P Julien, A Roth, M Simonovic, P Bork, and C von Mering. STRING 8–aglobal view on proteins and their functional interactions in 630 organisms.Nucleic Acids Res, 2009. ISSN 03051048. doi:10.1093/nar/gkn760. → page 22J. Michael Cherry, Caroline Adler, Catherine Ball, Stephen A. Chervitz, Selina S.Dwight, Erich T. Hester, Yankai Jia, Gail Juvik, Taiyun Roe, Mark Schroeder,Shuai Weng, and David Botstein. SGD: Saccharomyces genome database.Nucleic Acids Research, 1998. ISSN 03051048. doi:10.1093/nar/26.1.73. →page 23Joanna S. Amberger and Ada Hamosh. Searching online mendelian inheritance inman (OMIM): A knowledgebase of human genes and genetic phenotypes.Current Protocols in Bioinformatics, 2017. ISSN 1934340X.doi:10.1002/cpbi.27. → page 23Han Yan, Kavitha Venkatesan, John E. Beaver, Niels Klitgord, Muhammed A.Yildirim, Tong Hao, David E. Hill, Michael E. Cusick, Norbert Perrimon,Frederick P. Roth, and Marc Vidal. A genome-wide gene function predictionresource for Drosophila melanogaster. PLoS ONE, 2010. ISSN 19326203.doi:10.1371/journal.pone.0012139. → page 23Martin Kircher, Daniela M. Witten, Preti Jain, Brian J. O’roak, Gregory M.Cooper, and Jay Shendure. A general framework for estimating the relative71pathogenicity of human genetic variants. Nature Genetics, 2014. ISSN15461718. doi:10.1038/ng.2892. → page 27Gunnar Libiseller, Michaela Dvorzak, Ulrike Kleb, Edgar Gander, TobiasEisenberg, Frank Madeo, Steffen Neumann, Gert Trausinger, Frank Sinner,Thomas Pieber, and Christoph Magnes. IPO: a tool for automated optimizationof XCMS parameters. BMC Bioinformatics, 16(1):118, 2015. ISSN1471-2105. doi:10.1186/s12859-015-0562-8. URLhttp://www.biomedcentral.com/1471-2105/16/118. → page 28Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, andBernhard Sch. Learning with Local and Global Consistency. Advances inNeural Information Processing Systems (NIPS), 2003. ISSN 00664162. doi:c.→ page 31Jan Krumsiek, Karsten Suhre, Thomas Illig, Jerzy Adamski, and Fabian J Theis.Gaussian graphical modeling reconstructs pathway reactions fromhigh-throughput metabolomics data. BMC systems biology, 5(1):21, 2011.ISSN 1752-0509. doi:10.1186/1752-0509-5-21. URLhttp://www.biomedcentral.com/1752-0509/5/21. → page 48Jo¨rg Martin Bu¨scher, Dominika Czernik, Jennifer Christina Ewald, Uwe Sauer,and Nicola Zamboni. Cross-platform comparison of methods for quantitativemetabolomics of primary metabolism. Analytical Chemistry, 81(6):2135–2143,2009. ISSN 00032700. doi:10.1021/ac8022857. → page 49Idil Yet, Cristina Menni, So Youn Shin, Massimo Mangino, Nicole Soranzo, JerzyAdamski, Karsten Suhre, Tim D. Spector, Gabi Kastenmu¨ller, and Jordana T.Bell. Genetic influences on metabolite levels: A comparison acrossmetabolomic platforms. PLoS ONE, 11(4), 2016. ISSN 19326203.doi:10.1371/journal.pone.0153672. → page 49Patrick Leuthold, Elke Schaeffeler, Stefan Winter, Florian Bu¨ttner, Ute Hofmann,Thomas E. Mu¨rdter, Steffen Rausch, Denise Sonntag, Judith Wahrheit, FalkoFend, Jo¨rg Hennenlotter, Jens Bedke, Matthias Schwab, and Mathias Haag.Comprehensive Metabolomic and Lipidomic Profiling of Human KidneyTissue: A Platform Comparison. Journal of Proteome Research, 16(2):933–944, 2017. ISSN 15353907. doi:10.1021/acs.jproteome.6b00875. →page 49George Nicholson, Mattias Rantalainen, Anthony D. Maher, Jia V. Li, DanielMalmodin, Kourosh R. Ahmadi, Johan H. Faber, Ingileif B. Hallgrı´msdo´ttir,72Amy Barrett, Henrik Toft, Maria Krestyaninova, Juris Viksna, Sudeshna GuhaNeogi, Marc Emmanuel Dumas, Ugis Sarkans, Bernard W. Silverman, PeterDonnelly, Jeremy K. Nicholson, Maxine Allen, Krina T. Zondervan, John C.Lindon, Tim D. Spector, Mark I. McCarthy, Elaine Holmes, Dorrit Baunsgaard,and Chris C. Holmes. Human metabolic profiles are stably controlled bygenetic and environmental variation. Molecular Systems Biology, 7, 2011.ISSN 17444292. doi:10.1038/msb.2011.57. → page 50Maryam Lotfi Shahreza, Nasser Ghadiri, Seyed Rasoul Mousavi, Jaleh Varshosaz,and James R. Green. Heter-LP: A heterogeneous label propagation algorithmand its application in drug repositioning. Journal of Biomedical Informatics,2017. ISSN 15320464. doi:10.1016/j.jbi.2017.03.006. → page 51Erhan Kenar, Holger Franken, Sara Forcisi, Kilian Wo¨rmann, Hans-UlrichHa¨ring, Rainer Lehmann, Philippe Schmitt-Kopplin, Andreas Zell, and OliverKohlbacher. Automated label-free quantification of metabolites from liquidchromatography-mass spectrometry data. Molecular & Cellular Proteomics, 13(1):348–359, 2014. ISSN 1535-9484. doi:10.1074/mcp.M113.031278. URLhttp://www.mcponline.org/content/13/1/348.full{%}5Cnpapers3://publication/doi/10.1074/mcp.M113.031278. → page 5173Appendix ASupporting MaterialsSee below table for variant and clinical information for each patient included inthis study.7475PatientnumberCausativegeneMode of in-heritancePublication ID andvariant frequency (ifavailable)Variant information Clinical description1 CPT1A Homozygousrecessive,missensePMID: 20696606, Clin-Var: 65644, gnomAD:3.246×10−5CPT1A:g.68548130G>A(p.Pro479Leu)Mild hypoglycemia on fasting, osteo-genesis imperfecta - like phenotype(unexplained), short stature, congen-ital anomalies and dysmorphisms ex-plained by methotrexate exposureduring pregnancy2 NANS Compoundheterozy-gous,missensePMID:27276562,PMID:27213289, Clin-Var: 235191 , gnomAD:0NANS:g.100843203C>T(p.Arg237Cys),g.100840588T>C(p.Tyr188His),Transcript:ENST00000210444Skeletal dysplasia, short statureand rhixomelia, neurodevelopmen-tal arrest, progressive epilepticencephalopathy, dysmorphisms,congenital brain abnormalities andwhite matter lesions3 SCN2A de novo,splice donorvariantPMID:27276562 andPMID:26647175, Clin-Var: NA, gnomAD:0SCN2A:g.166188079+1G>A, Transcript:ENST00000283256Global developmental delay,seizures, ataxia, microcephaly,autism, abnormal CSF mono-amineneurometabolite profile76PatientnumberCausativegeneMode of in-heritancePublication ID andvariant frequency (ifavailable)Variant information Clinical description4 DYRK1A de novo,missensePMID:27276562, Clin-Var: NA, gnomAD:0DYRK1A:g.38865404C>T(p.Ser346Phe),Transcript:ENST00000398960Neurodevelopmental delay, in-tractable epilepsy, absence seizures,microcephaly, mild dysmorphisms,hypoglycorrhagia5 KIF5C de novo,missenseClinVar: NA, gnomAD:NAKIF5C:g.149818513G>A(p.Val101Met,p.Val333Met,p.Val238Met ,p.Val50Met), Transcript:ENST00000435030Seizures, behavioral and psychiatricabnormalities, aggression, low CSFMTHF (folate), mild cerebral atro-phy, mild ataxia, mild dysmorphism77PatientnumberCausativegeneMode of in-heritancePublication ID andvariant frequency (ifavailable)Variant information Clinical description6 CNKSR2 homozygousfrom oneheterozy-gous parent,missenseClinVar: NA, gnomAD:NACNKSR2:g.21581499C>T (p.Phe464Ser), Transcript:ENST00000543067Autonomic crises in infancy withhypertension, tachycardia, bladderretention, bowel dysmotility, fre-quent infections/sepsis responding tocholine therapy, low acetylcholinelevels, progressive cholinergic fail-ure, alzheimers type memory loss(on treatment with Donepezil), cen-tral apneas and on BiPAP at night7 ECI1 compoundheterozy-gous,missensePMID: 7586637 Clin-Var: NA, gnomAD:0.0071ECI1: g.2296927G>A(p.Thr17Met)and g.2290104(p.Thr262Met),Transcript:ENST00000566379Spasticity and dystonia, micro-cephaly and cataracts, elevatedmethylmalonic acid, elevated mal-onic acid, enlargement of theventricles, MRI signal changes in thebasal ganglia and cerebellar atrophy78PatientnumberCausativegeneMode of in-heritancePublication ID andvariant frequency (ifavailable)Variant information Clinical description8 IDS andHALIDS: hem-izygous,missense;HAL:compoundheterozy-gous, splicedonorIDS: ClinVar: NA, gno-mAD: 0; HAL: ClinVar:NA, gnomAD: 0.000599IDS: g. 148571971G>A(p.Arg294Trp,p.Arg83Trp),Transcript:ENST00000340855,HAL: g. 96371767A>G(p.Trp537Arg,p.Trp329Arg,p.Trp68Arg),ENST00000261208and g.96374333C>T,Transcript:ENST00000261208Early onset global developmental de-lay, short stature, dysmorphisms,coarse facial features, severe behav-ioral disturbances, elevated keratan-sulphate, developmental regression,elevated glycosaminoglycans79PatientnumberCausativegeneMode of in-heritancePublication ID andvariant frequency (ifavailable)Variant information Clinical description9 CHRNA1and DHFRCHRNA1:de novo,missense;DHFR:de novo,missenseCHRNA1: ClinVar: NA,gnomAD: NA; DHFR:ClinVar: NA, gnomAD:NACHRNA1:g.175619063C>T(p.Ala167Thr/p.Ala142Thr),Transcript:ENST00000261007;DHFR:g.79950270C>G(p.Gln13His),Transcript:ENST00000439211Progressive global developmental de-lay and loss of skills, microcephaly,congenital hypotonia and wheelchairbound, dysmorphic features, severefeeding difficulties, growth retarda-tion, demyelination on brain MRIscan, elevated lactates10 ATP8A2 Homozygousrecessive,missenseNA ATP8A2:g.26402265G>A(p.Ala897Thr),TranscriptENST00000381655Hypotonia, ataxia since age 18months80PatientnumberCausativegeneMode of in-heritancePublication ID andvariant frequency (ifavailable)Variant information Clinical description11 MYO5B Compoundheterozy-gous,missenseClinVar: 0.00240, gno-mAD: 0.001643MYO5B:g.47506839G>A(p.Arg344His),g.47566678G>C(p.Glu49Gln)Intellectual Disability, hyperkineticmovement disorder, sensorineuralhearing loss, myopathy, malabsorp-tion, failure to thrive, elevatedurine threonine, serine and lysine,plasma amino acids suggesting lacticacidemia, elevated lactate81PatientnumberCausativegeneMode of in-heritancePublication ID andvariant frequency (ifavailable)Variant information Clinical description12 MAST1 andKCNQ2MAST1:de novo,missense;KCNQ2:de novo,in framedeletionClinVar: NA; gnomAD:NAMAST1:g.12975745G>A(p.Asp497Asn),Transcript:ENST00000251472;KCNQ2: g.62038511del-GAG (p.Phe701del,p.Phe683del,p.Phe670del,p.Phe673del,p.Phe709del),Transcript:ENST00000354587Global developmental delay, micro-cephaly, hypotonia, failure to thrive,epilepsy and, delayed myelination82PatientnumberCausativegeneMode of in-heritancePublication ID andvariant frequency (ifavailable)Variant information Clinical description13 VGLL4 homozygousrecessive,missenseClinVar: NA, gnomAD:0.001073VGLL4: g.11600101G>T(p.Arg268Ser,p.Arg184Ser,p.Arg209Ser,p.Arg273Ser,p.Arg274Ser,p.Arg188Ser),Transcript:ENST00000430365Progressive dystonia, spasticity,query seizures, abnormalities ofthe neurotransmitters, intellectualdisability, cerebral atrophy, low CSFHVA, 5HIAA, MTHF


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items