UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Text based methods for variant prioritization Gottlieb, Michael 2017

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2017_february_gottlieb_michael.pdf [ 1.65MB ]
JSON: 24-1.0340776.json
JSON-LD: 24-1.0340776-ld.json
RDF/XML (Pretty): 24-1.0340776-rdf.xml
RDF/JSON: 24-1.0340776-rdf.json
Turtle: 24-1.0340776-turtle.txt
N-Triples: 24-1.0340776-rdf-ntriples.txt
Original Record: 24-1.0340776-source.json
Full Text

Full Text

Text Based Methods for Variant PrioritizationbyMichael GottliebB.A. Asian Studies & Psychology, The University of British Columbia, 2009A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Bioinformatics)The University of British Columbia(Vancouver)January 2017c©Michael Gottlieb, 2017AbstractDespite improvements in sequencing technologies, DNA sequence variant inter-pretation for rare genetic diseases remains challenging. In a typical workflow forthe Treatable Intellectual Disability Endeavor in B.C. (TIDE BC), a geneticist ex-amines variant calls to establish a set of candidate variants that explain a patient’sphenotype. Even with a sophisticated computation pipeline for variant prioriti-zation, they may need to consider hundreds of variants. This typically involvesliterature searches on individual variants to determine how well they explain thereported phenotype, which is a time consuming process. In this work, text analysisbased variant prioritization methods are developed and assessed for the capacityto distinguish causal variants within exome analysis results for a reference set ofindividuals with metabolic disorders.iiPrefaceContributionsThe division of TIDE cases into training and test sets was proposed by Drs. WyethWasserman and Maja Tarailo-Graovac. The cases in each set were selected by Drs.Maja Tarailo-Graovac and Allison Matthews.The idea to do a simulated test of Synverita’s capabilities for variant prioritiza-tion was conceived of by Dr. Steven Jones.The original idea to use Synverita for variant prioritization was my own. Themethods described in this work were developed, implemented and tested by mewith advice from Emily Hindalong.PublicationsThere are no publications based on this work at this time.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Automated Variant Prioritization . . . . . . . . . . . . . . . . . . 31.2.1 Properties of Specific Sequence Alterations . . . . . . . . 41.2.2 Features of Genes . . . . . . . . . . . . . . . . . . . . . . 41.2.3 HPO Leveraging . . . . . . . . . . . . . . . . . . . . . . 51.2.4 Text Mining . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Toward Improved Text-Based Variant Prioritization . . . . . . . . 72 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1 Term Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2 Gene-Phenotype Association Scoring . . . . . . . . . . . . . . . 112.3 Raw Data Association Strength . . . . . . . . . . . . . . . . . . . 112.4 Predicted Association Strength . . . . . . . . . . . . . . . . . . . 12iv2.5 Weighting Methods . . . . . . . . . . . . . . . . . . . . . . . . . 122.6 Score Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . 122.7 Score Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.8 Comparison to Other Methods . . . . . . . . . . . . . . . . . . . 142.8.1 RVIS Scoring . . . . . . . . . . . . . . . . . . . . . . . . 142.8.2 CADD Scoring . . . . . . . . . . . . . . . . . . . . . . . 142.8.3 FindZebra Scoring . . . . . . . . . . . . . . . . . . . . . 152.8.4 Score Comparison . . . . . . . . . . . . . . . . . . . . . 152.8.5 Ensembling . . . . . . . . . . . . . . . . . . . . . . . . . 153 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.1 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 Application of text ranking to clinical case examples . . . . . . . 213.2.1 Training Set . . . . . . . . . . . . . . . . . . . . . . . . . 213.2.2 Test Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 374 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48A Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56vList of TablesTable 3.1 Unweighted ranked scoring results. . . . . . . . . . . . . . . . 25Table 3.2 Spearman correlation of ranked scores of best candidate vari-ants across 5 metrics. . . . . . . . . . . . . . . . . . . . . . . 31Table A.1 Ranked scoring results for best candidate variants for all weight-ing methods. Correlation is Spearman’s ρ of all ranked scoresof all variants in a case and their ranked number of unique termsthat they associate with in Syverita’s raw data. Maximum is 1all cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57viList of FiguresFigure 1.1 Overview workflow for establishing best candidate variants usedby TIDE BC. . . . . . . . . . . . . . . . . . . . . . . . . . . 3Figure 1.2 Proposed modification to existing workflow. Text based vari-ant prioritization produces a ranked candidate variant list aimsto reduce number of passes through the two cycles. . . . . . . 8Figure 3.1 Histogram of disease profile size as represented by number ofCUIDs. Median of 9 CUIDs per disease profile. . . . . . . . . 18Figure 3.2 Histogram of candidate gene profile size as represented bynumber of CUIDs with non-zero avg j scores. Median of 1981CUIDs per disease profile. . . . . . . . . . . . . . . . . . . . 19Figure 3.3 Distribution of disease phenotype association rank scores ofcandidate genes in their candidate gene profile with and with-out wp weighting. Performance is almost identical between thetwo approaches. . . . . . . . . . . . . . . . . . . . . . . . . . 20Figure 3.4 Ranked scores of all variant genes in each case in training setusing best j aggregation. Variant higher and variant lower arevariants ranked higher or lower than their case’s lowest rankedbest candidate variant. . . . . . . . . . . . . . . . . . . . . . 23Figure 3.5 Distribution of ranked scores for raw data aggregation methodsacross all 37 individuals in the training set. . . . . . . . . . . 24Figure 3.6 Distribution of ranked scores for prediction aggregation meth-ods across all 37 individuals in the training set. . . . . . . . . 26viiFigure 3.7 Ranked scores of all variant genes in each case in training setusing repp aggregation. . . . . . . . . . . . . . . . . . . . . . 27Figure 3.8 Ranked scores of each of the prediction aggregation methodscompared to their respective raw data aggregation methods.Points above the black line rank higher using raw data. Thosebelow the black line rank higher using prediction. . . . . . . . 29Figure 3.9 Distribution ranked scores of best candidate variants in trainingset using FindZebra, CADD, RVIS, rep j and repp. . . . . . . 32Figure 3.10 Estimated sensitvity and specificity of 31 classifiers trained onall combinations of 5 metrics. . . . . . . . . . . . . . . . . . 34Figure 3.11 Top 5 classifiers based on estimated sensitivity. Shapes denotethe metrics used. The symbols used to denote which metrics aclassifier used are c for CADD, z for FindZebra, j for rep j andp for repp. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35Figure 3.12 Top 5 classifiers based on estimated specificity. Shapes denotethe metrics used. The symbols used to denote which metrics aclassifier used are c for CADD, z for FindZebra, j for rep j andp for repp . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36Figure 3.13 Distributions of ranked scores for test set using 9 scoring meth-ods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38Figure 3.14 Sensitivity and specificity of all classification of test set variants. 40Figure 3.15 Top 6 classifiers based on test set classificiation sensitivity.Shapes denote the metrics used. The symbols used to denotewhich metrics a classifier used are c for CADD, z for FindZe-bra, j for rep j and p for repp . . . . . . . . . . . . . . . . . . 41Figure 3.16 Top 6 classifiers based on test set classificiation specificity.Shapes denote the metrics used. The symbols used to denotewhich metrics a classifier used are c for CADD, z for FindZe-bra, j for rep j and p for repp . . . . . . . . . . . . . . . . . . 42viiiFigure 3.17 Distribution of random forest ranked probability across all 17indidivudals in the test set. Variant higher are variants rankedhigher than their case’ lowest ranked best candidate variant.Variant lower are variants ranked lower than their case’s lowestranked best candidate variant. . . . . . . . . . . . . . . . . . 43Figure 3.18 Distribution of best candidate variant ranking of four best in-dividual methods and best random forest classifier, RFr,c,p, j. . 44ixAcknowledgmentsI would like to thank Drs. Wyeth Wasserman and Steven Jones for supporting mein pursuing this project. They provided me the resources of the CMMT and GSC,enabling the work to move rapidly. I thank Dr. Anna Lehman for serving on mycommittee, providing insightful feedback and helping me ground this work with aclinical perspective.My thanks to Dr. Allison Matthews for providing me with the whole genomeand exome data I needed and feedback on my progress, and to Jake Lever forgranting me access and support in using Synverita and updating Synverita for theneeds of variant prioritization.Special thanks go to Dr. Maja Tarailo-Graovac, whose guidance and supportensured that I stayed on track in all aspects of this work. I owe my success to herpatience and incredible ability to make sense of medical genomics.I would like to thank my wife, Emily Hindalong for encouraging me and refin-ing my work through many discussions.I acknowledge the Canadian Institute of Health Research for funding my stud-ies.xChapter 1Introduction1.1 BackgroundRare Mendelian diseases are caused in most cases by DNA sequence variationswithin one or two alleles of a gene in a patient. Despite individual disease rarity,it is estimated that 8% of people have rare disease worldwide [22] [6]. The alteredgenes for about 50% of rare diseases have been discovered, but few have knowntreatments [40] [11].Within this thesis the research is focuses upon a subset of rare diseases char-acterized by both deficits in intellectual development and metabolite processing.Intellectual developmental disorders (IDD) are a group of disorders with heteroge-neous etiologies, including Mendelian diseases, that occur in about 2-3% of chil-dren worldwide [13] [38]. They are marked by lifelong disturbances in multiplecognitive domains that first manifest early in life [34]. IDD are amongst the costli-est disorders due to the degree of impairment and lifelong duration [27]. Inbornerrors of metabolism (IEM) are a category of Mendelian diseases in which a singleprotein’s functioning is disrupted in a metabolic pathway. IEMs can be difficultto diagnose, as they may present in patients as diverse phenotypes, for instance asimmunodeficiency [10] or IDD [33]. Individuals with the same IEM may presentdifferent disease phenotypes due to other genetic differences and interactions withother metabolic pathways [4]. Conversely, IEMs caused by distinct genetic mecha-nisms may present with the same phenotype. For example, at present over 90 IEMs1are known to cause IDD [44].IEMs are amongst (if not the top) most likely to be treatable Mendelian diseases[3]. IDDs due to IEM are the largest group of treatable IDDs with genetic causes[43] [44]. The prototypic example of IDD due to IEM is phenylketonuria, whichoccurs when mutations in the PAH gene disrupt the metabolism of phenylalanine.This in turn allows dietary phenylalanine to accumulate, eventually reaching toxiclevels and causing IDD. As phenylketonuria is well known, it is often screened for,and individuals with phenylketonuria are able to develop normally by adhering toa low phenylalanine diet [2].For an individual with an IEM that can lead to IDD, timely discovery of thecausal variant(s) is needed for effective treatment and management [43][41]. How-ever, for many IEMs, causal genes are either unknown or the observed phenotypescould arise from defects in any of multiple genes. Due to the inexact relationshipbetween disease phenotype and genotype of an individual, single gene tests can beinconclusive even when caused by a known disease gene variant [38]. Early diag-nosis of treatable conditions can be important to avoid tissue damage from toxiccompounds.Causal variant discovery of rare Mendelian diseases has been revolutionized bynext generation sequencing (NGS). Analysis of whole genome sequencing (WGS)and targeted sequencing of protein coding regions, known as whole exome se-quencing (WES), have revealed the causal variants for many rare diseases [11] [7].WGS and WES provide a solution to the problem of targeted testing as they checkall genes for mutations at a cost that is approaching single gene tests [38] [29].The broad application of exome and WGS to metabolic disorders has beenshown to have a high diagnostic success rate [38] [39]. In the Treatable Intellec-tual Disability Endeavor in B.C. (TIDE BC), the DNA sequences from a patientare analyzed using a semi-automated bioinformatics pipeline [41] to establish a listof variations that may explain the patient’s symptoms. Based on availability, ge-nomic data from close relatives may be included, most commonly as a trio (mother-father-child). Inclusion of close relatives allows for variations to be deprioritizedif present in healthy individuals (for dominant models), or for the inheritance pat-terns of recessive candidates to be confirmed. The results are then reviewed byan interdisciplinary team, leveraging their respective expertise, to establish the top2causal candidate(s). The experts consider factors such as variant inheritance pat-tern, the relationship between the gene function and patient phenotype (and knownphenotypes of individuals with disruptions to the gene) and pathogenicity estimates[41].This workflow, outlined in Figure 1.1, is executed in two cycles. After re-ceiving the output of automated analysis, the bioinformatics team semi-manuallyreviews each variant and relevant literature to form a hypothesis that explains thepatient’s phenotype. The number of variants per patient can range from dozens ina typical WES analysis of a trio to a few hundred in the event of a WES analysisrestricted to the patient. As manual analysis is time consuming, the number of vari-ants to consider and the order in which they are considered can result in diagnosisand treatment delays for individuals with treatable IDD.Figure 1.1: Overview workflow for establishing best candidate variants usedby TIDE BC.1.2 Automated Variant PrioritizationAutomated variant prioritization is a set of techniques that aim to establish a rank-ing amongst a set of variants. The goal of variant prioritization in this case is torank variants that are more likely to explain a patient’s phenotype higher than thosethat are not. Such approaches are expected to accelerate diagnosis. There havebeen numerous prioritization procedures introduced in the literature, which canbe roughly categorized into four broad (and occasionally overlapping) categories:features of genes; properties of the specific sequence altered; human phenotype3ontology (HPO) leveraging; or text mining.1.2.1 Properties of Specific Sequence AlterationsApproaches that prioritize variants based on properties of the specific sequencealteration do so without taking phenotype into consideration. This includes featuressuch as the type of variation and its predicted or known effect on its transcript, asin the SNP Effect Predictor [26]. This category also includes prioritization basedon how frequently specific variants appear in the general population from sourcessuch as ExAC [25]. Given an observed disease phenotype and a list of variants,these methods will predict how likely a variant is related to disease or deleterious.However, these methods will not provide any measure of how well a variant fits theobserved disease phenotype.Combined Annotation-Dependent Depletion (CADD): CADD [21] is a gen-eralized approach for scoring single-nucleotide variant and insertion-deletion events.It combines multiple measures and annotations in order to produce a single scorethat estimates the likelihood of a specific variant being deleterious. This alleviatesthe knowledge-biases that individual metrics of pathogenicity suffer from. CADDscores have been pre-calculated for all possible single-nucleotide variants in thehg19 reference, which enables quick ranking of variants with a low computationalthreshold for use.1.2.2 Features of GenesApproaches that priortize variants using features of genes also do not take phe-notype into consideration. These approaches include features such as selectivepressure or degree of functional diversity as in RVIS [30]. They share the samelimitations as the preceding category as they do not estimate how likely a variantgene is involved in a particular disease phenotype.Residual Variation Intolerance Score (RVIS): RVIS [30] is a gene level ap-proach for scoring variants. Unlike CADD, which can produce multiple scoresfor different variants in one gene, RVIS produces one score per gene. RVIS aimsto measure the degree to which a gene can tolerate functional variation. This isestimated by regressing on the ratio of commonly observed missense and stop-4gained mutations to total number of observed variants in the gene. Genes that havefewer commonly observed missense and stop-gained mutations than expected areconsidered less tolerant of variation and have lower RVIS scores. That is, non-synonymous variants in a gene with a low RVIS are more likely to be deleterious.FLAGS: Unlike RVIS and CADD, [35] focuses on identifying features ofgenes in which non-synonymous mutations are less likely to be pathogenic in rarediseases. FLAGS is a ranking system that deprioritizes genes that have more as-sociated publications, longer coding sequences, more paralogs and less selectivepressure as defined by the dN/dS ratio [45]. Not to be confused with filtering,FLAGS identifies genes that should be interpreted with caution when examiningtheir variants for pathogenicity.Ongoing work in variant prioritization continues and extends well beyond theabove mentioned projects. For example, recent work has investigated methods forprioritization based on biological networks [28], but these have been met with lim-ited success, likely due to gaps in our current understanding of biological networks[31].1.2.3 HPO LeveragingThe Human Phenotype Ontology (HPO) [24] is a hierarchically structured set ofhuman phenotypic abnormalities. Its structure supports a number of computationalapproaches to phenotype comparisons for rare disease diagnostics [23][18][20][12][19].Computational approaches to phenotype comparisons have also been used for vari-ant prioritization in conjunction with other methods. An in-depth review is pre-sented in Smedley et al. [37].Exomiser: Exomiser[36] is a variant prioritization system that combines sev-eral methods of phenotype comparison with a variety of gene/variant features andmeasures. Given a patient phenotype and variants, it ranks the variants basedon how well they match established disease phenotypes for that gene and relatedgenes. Unlike other variant HPO leveraging prioritization systems, Exomiser alsoconsiders established disease phenotypes of model organisms. Exomiser is avail-able as a standalone tool with modest requirements and requires VCF files andphenotypes described using the HPO [9].51.2.4 Text MiningText mining approaches use automatic analysis of relevant literature in order torank links between disease phenotypes and genes. The two examples below func-tion like a search engine over primary literature and curated sources. Both use theUnified Medical Language System (UMLS)[8] for biomedical term identification.The UMLS provides a consolidated identification system that spans biomedicalconcepts across many biomedical vocabularies.FindZebra: FindZebra [14] is a web-based tool focused on the rare diseasedomain. FindZebra uses a small corpus that only includes online sources relevantto the rare disease domain, such as Online Mendelian Inheritance in Man and theGenetic and Rare Diseases Information Center. Using a Google-like user interface,FindZebra accepts free text biomedical queries. It then ranks the documents in itscorpus based on relevance to the query and clusters them based on UMLS concepts.It also provides the option to rank UMLS concepts related to the query, which canhelp researchers prioritize genes for variant analysis.FindZebra’s small corpus gives it good specificity as it mines from highly cu-rated sites. This means that it does not include low quality associations that mightbe available in a general web search. However, this negatively affects its sensitivitybecause curated sources can be slow to update and less complete. Nonetheless, itprovides a simple application programming interface (API) that allows it to be usedin the context of automated analysis pipelines.Beegle: Beegle[15] is a web-based tool built on top of the Endeavour gene pri-oritization tool [1][42]. Like FindZebra, Beegle uses a Google-like search interfaceand accepts free text user queries in the biomedical domain. Beegle then finds a listof associated MEDLINE abstracts via PubMed, including 10,000 PubMed Ids. Ituses these articles to construct a UMLS profile from approximately 67,000 UMLSconcepts. Beegle calculates similarity between the query and genes by countingabstract level co-occurrences. It generates a strength of association by comparingthe number of co-occurrences to the total number of abstracts in which each termoccurs. It also compares the number of concepts associated with each term as asecond metric of similarity. The result of this stage is a list of genes known to beassociated with a given query term. The user selects some number of these known6associated genes and selects a number of candidate genes that they are interestedin. This allows Beegle to produce a list of genes it predicts to be associated withthe query.Beegle’s larger corpus and association predictions offer greater sensitivity thanFindZebra at the cost of specificity. However, Beegle’s computations are performedon-the-fly, which can result in long delays. Furthermore, it does not provide anAPI, which makes it difficult to use for automated analysis pipelines.1.3 Toward Improved Text-Based Variant PrioritizationIndividual cases that are referred for TIDE BC’s WES or WGS bioinformaticsanalysis have inconclusive findings from single gene and panel tests. For a givencase, automated variant analysis may result in dozens of variant genes to considerwith respect to a clinician provided free text deep phenotype. As there are manyvariant genes to consider and rareness of the disease phenotype, successful variantprioritization in TIDE BC’s bioinformatics analysis workflow should have minimalrunning time or user interaction and be sufficiently sensitive to link a variant geneto the observed disease phenotype.RVIS and CADD are currently incorporated into TIDE BC’s variant analysispipeline. These scores require no user input to generate and are produced quickly.However, they do not provide any indication as to how well a variant or variantgene fits with the observed disease phenotype. Exomiser, Beegle, and FindZebraall score how well variant genes fit a disease phenotype. Exomiser, especially in itsleveraging of model organisms, promises more sensitive performance. Its require-ment of HPO terms provides a challenge, as referring clinicians provide patientphenotype using free text. Beegle and FindZebra can both score phenotype geneassociations and accept free text phenotypes. Beegle’s large corpus, vocabularyand predictive capacity appear to be a good fit for this application, but its lack ofa back end and long on the fly computation prevent it from being incorporatedinto the current pipeline. FindZebra’s simple API allow it to be integrated into thecurrent pipeline, but its small corpus size may result in reduced sensitivity.A candidate solution to this problem is a current biomedical text mining projectat Canada’s Michael Smith Genome Sciences Centre, Synverita. Synverita is under7development to predict future biomedical discoveries. It does this by building a ma-trix of sentence level co-occurrences between UMLS concepts across the entiretyof PubMed’s Open Access Subset. This includes co-occurrences of almost 300,000concepts across 1.2 million full text articles and over 24 million abstracts. Syn-verita produces predictions using singular value decomposition [16] on its minedmatrix of sentence level concept co-occurrences.Synverita may provide a highly sensitive measure of how well a variant genefits a disease phenotype, given its use of predictions and large corpus. Its large vo-cabulary provides a means to map clinician free text phenotype to UMLS concepts.As Synverita’s text mining results are precomputed and it allows for backend ac-cess, it can produce scores between phenotype UMLS concepts and variant genesefficiently.The research presented here aims to augment variant prioritization by Syn-verita. Its raw data and predictions are used to prioritize variants for pediatricpatients of IDD due to IEM. This targets the workflow outlined in (Figure: 1.1).(Figure: 1.2)shows the updated workflow that includes text mining based variantprioritization. The performance at variant prioritization is compared with FindZe-bra, RVIS and CADD as well as with an ensemble method of all methods.Figure 1.2: Proposed modification to existing workflow. Text based variantprioritization produces a ranked candidate variant list aims to reducenumber of passes through the two cycles.8Chapter 2MethodsGiven a set of phenotype terms describing a patient, we seek to return a rankingof genes containing rare variations where genes ranked highly are more likely tobe causal for the observed phenotype (or a subset of the phenotype terms). Thereturned ranking is based on a strength of association score between each gene andthe entire set of phenotype terms. Both phenotype terms and gene symbols arerepresented using UMLS concept unique identifiers (CUIDs), which are describedbelow.In broad strokes, strength of association scores are generated between a setof gene symbols G = {g1,g2, ..,gm} and a set of disease phenotype terms D ={P1,P2, ...Pr} in the following steps:1. Map each g ∈ G to a CUID2. Map each P ∈ D to a set of all relevant CUIDs, P = {p1, p2, ..., pn}3. Calculate an association score, sg,P, between each g ∈ G and each P ∈ D asfollows:(a) Calculate an association score, sg,pi , between g and each pi ∈ P(b) Aggregate the scores from (a), weighting each sg,pi according to theweighting scheme4. Calculate an association score, sg,D, between each gene g ∈G and each D bysumming over the scores from (3)95. Rank all genes in G according to sg,DIt is possible to mix-and-match strategies for phenotype-to-CUIDs mapping,association scoring, and term weighting. These are described in detail below.2.1 Term MappingUMLS CUIDs provide standard labels for synonymous noun phrases that are foundin the biomedical literature as well as a relational structure based on semantic cat-egories. An example of this is the CUID C0085997 which is defined asC0085997 T048 child development dis specific—child developmentdisorders, specific—developmental delay dis—developmental delaydisorder—developmental delay disorders.In this example, the CUID refers to five different pipe-delimited synonyms thatbelong to category T048, which contains concepts related to mental or behaviouraldysfunction.Gene symbols are each mapped to a single UMLS CUID in the genes categoryin two passes through Synverita’s wordlist. In the first pass, each gene symbol ismatched to the CUID that contains the gene symbol followed by the string ’ gene’in its synonyms list. If no matches are found in the first pass for a gene symbol,this is attempted again without the ’ gene’ string. This is done to resolve cases inwhich a gene symbol maps to multiple CUIDs representing homologs or alleles.Phenotype terms do not always correspond one-to-one to entries in Synverita’sword list. In the event that no exact match for a phenotype term is found in Syn-verita’s wordlist, a semi-automated approach is used to identify all potentially rel-evant terms. An example of this is the oft-reported ”global developmental delay,”which is not within Synverita’s wordlist. In this case, ”developmental delay” isused instead, and all CUIDs with ”developmental delay” in their synonym list areused. Management of multiple CUIDs per phenotype term is reflected in the scor-ing methods.102.2 Gene-Phenotype Association ScoringSynverita provides two broad approaches to generating scores of association be-tween two CUIDs. The first is via its raw data matrix. Synverita’s raw data ma-trix captures the number of sentences in which pairs of CUIDs both appear in thebiomedical literature. This matrix is large and sparse. That is, it contains many 0values for pairs of CUIDs. Synverita’s raw data provides a quantitative measureof how strongly two CUIDs associate based on the current state of the literature.However, this scoring makes no assumptions with respect to the directionality ofassociation. That is, the following two hypothetical sentences would both be con-sidered an association between Gene1 and SymptomA:”Mutations in Gene 1 have been found to cause Symptom A.””Mutations in Gene 1 do not cause Symptom A.”The second approach uses Synverita’s predictions to score associations be-tween pairs of CUIDs. Performing singular value decomposition on Synverita’sraw data matrix infers many of the missing values based on the values of sim-ilar terms. The idea is that if there is a strong association between Gene1 andSymptomA and between Gene1 and Gene2, Synverita would predict an associationbetween Gene2 and SymptomA.2.3 Raw Data Association StrengthThe strength of association between two CUIDs, t1 and t2, in Synverita’s raw datais calculated by using the Jaccard similarity coefficient, js. The value of js iscalculated using the number of sentences in which t1 and t2 appear together (St1∩t2)and the total number of sentences that each appears in (St1 and St2). This is similarto how association strength is calculated in Beegle [15] but is based on sentence-level co-occurrence rather than abstract-level co-occurrence. The value of js canrange from 0 when t1 and t2 never occur in the same sentence to 1 when they alwaysoccur together. js is defined asjs(t1, t2) =St1∩t2St1 +St2−St1∩t2.112.4 Predicted Association StrengthPredicted association strength between two terms, ps, is calculated using the resultsof singular value decomposition of the binarized form of Synverita’s data matrix.That is, the non-zero values in Synverita’s data matrix are replaced with 1 to createa binary matrix that contains values of only 0 and 1. Singular value decompositionis performed using the graphlab implementation with best parameters found inLever et al. (2016).2.5 Weighting MethodsFour different weighting methods are used for both raw data and association calcu-lations between gene CUID g and phenotype CUID p. All four weighting methodsare based on the total number of unique terms that g and p co-occur with, Ag andAp respectively. The four weighting methods, wg, wp, wg+p and wg∗p, are definedas follows:wg = log10(Ag +1)wp = log10(Ap +1)wg+p = log10(Ag +Ap +1)wg∗p = log10(Ag ∗Ap +1)Weighting is applied to both predicted and raw data association scores bydivision. Given the generalized association score s ∈ { js, ps} and weight w ∈{wg,wp,wg+p,wg∗p}, the generalized weighted score, ws is defined asws =sw.2.6 Score AggregationFour different methods are used for scoring association strength between a geneterm g and a phenotype term P with CUIDs {p1, p2, ..., pn}. Average scoring be-12tween g and P averages s(g, pi) across all pi in P and is defined asavgs(g,P) =∑ni=1 s(g, pi)n.Forgiving scoring averages only those values of s(g, pi) that have a non-zero valueand is defined asf ors(g,P) =∑ni=1 s(g, pi)max(1,∑ni=11 if s(g, pi)> 00 otherwise ).Best scoring selects the highest single value of s(g, pi) and is defined asbests(g,P) = max1≤i≤ns(g, pi).The fourth scoring method, representative scoring, requires the definition ofa representative CUID of phenotype term P, pr. The representative CUID of aphenotype term is the CUID that is most closely associated to the other phenotypeterms in its case, based on Synverita’s raw data. For case C with phenotype terms{P1, ..,Pm}, the representative CUID of Pj, pr j, is defined aspr j = argmaxpi j∈Pjm∑k=10 if k = javg j(pi j,Pk) otherwise .With a phenotype term’s representative CUID defined, the representative scoringfor gene g and phenotype term P with representative CUID pr is defined asrep(g,P) = s(g, pr).Synverita’s prediction values do not produce values of zero, which means thatforgiving scoring degenerates to average scoring. This results in a total of sevenmethods for scoring the strength of association between g and P: four scoringmethods for raw data {avg j, f or j,best j,rep j} and three scoring methods for pre-dictions {avgp,bestp,repp}. Given case C, the total strength of association betweengene g and C is calculated for scoring method meths ∈{avg j, f or j,best j,rep j,avgp,bestp,repp}13asm∑j=1meths(g,Pj).2.7 Score RankingFor each set of gene symbols and phenotype terms, scores were ranked for eachgene. For a given gene gx with score sx, in a case with genes G = {g1, ...,gm} andvector of corresponding scores S = {s1, ...,sm}, the ranked score rsx of gx is theproportion of scores that sx is greater than in S. This is calculated asrsx =∑mi=11 if sx > si0 otherwisem−12.8 Comparison to Other MethodsThree established methods for variant prioritization were compared to a subset ofthe methods introduced above. Two of these, RVIS and CADD, are currently inuse in the standard TIDE BC workflow. FindZebra is not currently used in thestandard TIDE BC worfklow, but it can be used to score variants with respect toa rare disease phenotype. The following section describes how these scores wereobtained.2.8.1 RVIS ScoringThe pipeline described in Tarailo-Graovac et. al [41] is used to generate RVISversion 2 scores. If a gene does not have an RVIS value, 1 is used instead, whichis the lowest possible RVIS value. This value is then subtracted from 1 to producea non-inverted scoring.2.8.2 CADD ScoringThe pipeline described in Tarailo-Graovac et. al [41] is used to generate CADDscores. If a gene has multiple CADD scores, the maximum value across all of its14variants in a given case is used. If a gene has no CADD scores, 0 is used instead,which is the lowest possible CADD score.2.8.3 FindZebra ScoringFindZebra scores are calculated by initializing every gene in a case to a score of 0.FindZebra’s API is then queried with each of a case’s phenotype terms. Each timea gene appears in a query’s associated gene list, the association score is added tothe gene’s score.2.8.4 Score ComparisonThe above three methods are used to create a ranked score as described above.These ranked scores are then compared with best performing raw data and predic-tion methods.2.8.5 EnsemblingMethods are ensembled by training random forests on multiple methods using theCaret R package [17]. Trained random forests are tuned using 20 repeats of 10-foldcrossvalidation with post-sampling downsampling to control for class imbalance.Performance is then optimized on receiver operating characteristic.15Chapter 3ResultsThis chapter presents the details of the tests performed using the preceding methodsand their results.3.1 SimulationAs Synverita is a general purpose biomedical text mining tool (as opposed to a raredisease focused resource), the first test was to compare the contents of its raw datamatrix to FindZebra. FindZebra is a rare disease specific text mining project, so itprovides a benchmark for Synverita’s domain applicability. Because FindZebra isable to produce relevant disease CUIDs given a gene symbol, term selection andscoring methods for multiple CUIDs per phenotype term were not used. Similarly,prediction scoring was not used as FindZebra does not produce predictive scores.A set of 922 known mendelian disease gene symbols from Tarailo-Graovacet al. (submitted) were queried using FindZebra’s API (on September 12, 2016).The CUIDs of diseases for each of these genes were collected in order to createa disease profile for each. Querying FindZebra with the 922 known Mendeliangene symbols resulted in 922 disease profiles. The 922 disease profiles containedbetween zero and ten CUIDs with a median of nine CUIDs (Figure: 3.1). Fivedisease profiles contained no CUIDs and were excluded from further analysis.Synverita’s raw association matrix was used to create a profile of candidategenes for each disease profile by selecting CUIDs from the UMLS genes semantic16category, G that had a non-zero avg j association score with that disease profile.Given a gene symbol gx and its corresponding disease profile dx, its correspondingcandidate gene profile gpx is defined asgpx = {g ∈ G|avg j(g,dx)> 0∨g = gx}.The 917 disease profiles that contained at least one CUID produced 909 can-didate gene profiles with at least one CUID based on Synverita’s raw data matrix.The candidate gene profiles ranged from a minimum of one CUID to a maximumof 13480 CUIDs with a median of 1981 CUIDs (Figure: 3.2). The distribution ofgene profile size is presented in Figure 4.For each gene in a candidate gene profile, avg j association score was calculatedwith and without wp weighting between it and its corresponding disease profile.This produced two vectors of association scores between each disease profile andeach gene in its candidate gene profile. These vectors were then ranked as out-lined above. Weighted ranked scores performed similarly to ranked scores withoutweighting (Figure : 3.3). Weighting produced a median ranked score of .9972whereas the median ranked scores without weighting was .9970. No meaningfulcorrelation was found between either set of ranked scores and size of gene profileor disease profile. Both methods ranked the target gene in the top 10% of its geneprofile over 90% of the time. These results suggest that Synverita’s raw data matrixis consistent with FindZebra with respect to associating rare disease genes to theirestablished set of disorders.17Figure 3.1: Histogram of disease profile size as represented by number ofCUIDs. Median of 9 CUIDs per disease profile.18Figure 3.2: Histogram of candidate gene profile size as represented by num-ber of CUIDs with non-zero avg j scores. Median of 1981 CUIDs perdisease profile.19Figure 3.3: Distribution of disease phenotype association rank scores of can-didate genes in their candidate gene profile with and without wp weight-ing. Performance is almost identical between the two approaches.203.2 Application of text ranking to clinical case examplesTwo groups of TIDE BC cases were run through the current analysis pipeline.These two groups had established best candidate variants and diagnoses. Thefirst set, the training set, was used to develop Synverita based variant prioritiza-tion methods. The second set, the test set, was subsequently analyzed using themethods developed on the training set in order to establish the generalizability ofthese methods. For both groups, pipeline results and clinician-reported phenotypewere used to produce raw data and predicted association scores, as well as CADD,RVIS and FindZebra scores as outlined above. All association, weighting and en-sembling methods were used on the training set. The best performing methodswere then used on the test and discovery sets.3.2.1 Training SetDataThe training set was composed of candidate variant lists and clinical reports from41 individuals from 38 TIDE families with established best candidate variants re-cently analysed in Tarailo-Graovac et al. [41]. Three pairs of two siblings sharedcommon best candidate variants. Five of the individuals had digenic etiology whilethe remaining 36 had monogenic etiology. The automatically generated candidatevariant lists of four individuals, one with digenic etiology, did not contain the estab-lished candidate variants (indicating that manual steps such as reducing expectedallele frequency had been performed to determine the candidates) and were ex-cluded from further analysis. A further two individuals with digenic etiology weremissing one of the two established best candidate variants (again due to customizedminor-allele frequency settings), which resulted in these variants being excluded.The remaining 37 individuals had between 30 and 352 variant genes in thecandidate variant lists with a median of 49 variant genes. The total number ofvariants ranged between 38 and 440 with a median of 62. Clinician reports forthese patients contained a median of 12 phenotype terms and ranged between 5and 35 terms. Clinician reported phenotype terms mapped to a median of 343CUIDs per case with a minimum of 42 and maximum of 2303.21Unweighted RankingRanked scores for each score aggregation method outlined in 2.6 were first com-puted without weighting in order to establish a performance baseline. Success ofa method was evaluated by its median and minimum ranking of a best candidatevariant across all cases.Raw Data All four raw data aggregation methods shared the same minimum andmaximum performance. Best scoring produced the highest median results, rank-ing the established best candidate variant gene ahead of over 92% of the othervariant genes in half of the training cases (Figure:3.4). All methods produced anumber of 0 rankings for several best candidate variants. This occurs when thereare no co-occurrences between a gene and any of the phenotype CUIDs in Syn-verita’s data matrix. Representative aggregation, rep j, is most prone to rank a bestcandidate variant 0, as it only uses one CUID per phenotype term (Figure : 3.5).However, rep j also has the largest number of cases in which the best candidateis ranked ahead of all other variants. Comparison of best candidate variant rank-ing to rankings of all other variants produced significant Kolmogorov-Smirnoff (asimplemented in R [32]) test p-values for all four methods (Table : 3.1).22Figure 3.4: Ranked scores of all variant genes in each case in training setusing best j aggregation. Variant higher and variant lower are variantsranked higher or lower than their case’s lowest ranked best candidatevariant.23Figure 3.5: Distribution of ranked scores for raw data aggregation methodsacross all 37 individuals in the training set.24Prediction The prediction based aggregation methods had worse median perfor-mance and higher minimum performance than their respective raw data counter-parts (Table : 3.1). Representative aggregation, repp, had the best median perfor-mance (Figure : 3.7). Notably, representative scoring always places a case’s bestcandidate variant ahead of at least 36.7% of the other variants in the case. Thisindicates that inclusion of the approach in a pipeline could reduce the burden onexpert reviewers. Comparison of best candidate variant ranking to rankings of allother variants produced significant Kolmogorov-Smirnoff test p-values for all threemethods (Table 3.1).Method Median Mean Min KS p-valueRaw Dataavg j 0.89 0.80 0.00 6.24e−09best j 0.92 0.76 0.00 7.85e−07f or j 0.88 0.77 0.00 1.98e−08rep j 0.87 0.72 0.00 3.87e−10Predictionavgp 0.80 0.74 0.03 1.24e−05bestp 0.78 0.74 0.29 3.37e−06repp 0.83 0.77 0.37 3.25e−06Table 3.1: Unweighted ranked scoring results.25Figure 3.6: Distribution of ranked scores for prediction aggregation methodsacross all 37 individuals in the training set.26Figure 3.7: Ranked scores of all variant genes in each case in training setusing repp aggregation.27Unweighted Raw Data vs. Prediction Ranking The rankings of each raw dataaggregation method was compared to its corresponding prediction method. Rawdata methods using best and average aggregation ranked the best candidate varianthigher than the prediction cases more often than not (Figure : 3.8). Representativeselection was divided more evenly, with the raw data ranking the best candidatehigher 49% of the time, predictions ranking the best candidate variant higher 41%of the time, and producing equal ranking 10% of the time.28Figure 3.8: Ranked scores of each of the prediction aggregation methodscompared to their respective raw data aggregation methods. Pointsabove the black line rank higher using raw data. Those below the blackline rank higher using prediction.29Weighted RankingIn order to quantify the effects of each weighting method on each aggregationmethod, all combinations of each aggregation method and each weighting methodwere used to rank candidate variants in the training set. Additionally, Spearman’sρ was calculated between each variant gene g and the number of unique terms itco-occurrs with in Synverita’s raw data matrix(Ag in the previous chapter). That is,the correlation was calculated for all genes, not just best candidate variants. Thiswas done in order to determine if weighting is able to moderate the effects of Agon ranked score.Of all of the weighting methods, wg∗p is notable as it produces the highest me-dian performance of 0.95 with rep j and the lowest Spearman’s ρ value of 0.566(see Table A.1 for full results). Similarly, it produces the highest median perfor-mance of all prediction methods with repp and the lowest correlation. Based onthe training data, wg∗p produces the best median ranking of best candidate variantsand the ranked scores it produces are least dependent on how well studied a geneis. Accordingly, wg∗p weighted rep j and repP are used in all following associationcalculations.Comparison To Other MethodsThe rankings of the best performing prediction and raw data aggregation methods,repp and rep j were compared to rankings produced by CADD, RVIS and Find-Zebra. CADD and rep j had the best performance for 17 variant genes, RVIS fornine genes, repp for seven genes and FindZebra for seven genes. Compared tothese three existing methods, Synverita based methods perform well. repp had thehighest minimum performance and rep j tied CADD in the number of cases that itperformed the best.Rankings for each of these methods show low correlation to each other (seeTable 3.2) The highest correlation is between the two Synverita based methods,rep j and repp with a value of .55. In all, the degree of correlation is limited, whichsupports the idea that they are measuring underlying properties.30Method RVIS FindZebra repp rep jCADD 0.010 0.172 −0.191 −0.190RVIS 1.000 −0.013 0.250 0.122FindZebra −0.013 1.000 0.118 0.156repp 0.250 0.118 1.000 0.546Table 3.2: Spearman correlation of ranked scores of best candidate variantsacross 5 metrics.31Figure 3.9: Distribution ranked scores of best candidate variants in trainingset using FindZebra, CADD, RVIS, rep j and repp.32EnsemblingIn order to test synergism of these methods, thirty-one random forest classifierswere trained using all possible combinations of the above five metrics. Perfor-mance for each classifier was estimated using 10-fold cross-validation as describedin section 2.8.5. Unsurprisingly, estimated sensitivity increased with the numberof metrics trained on (see Figure: 3.10). The most sensitive classifier was trainedusing CADD, rep j and repp, followed by the classifier trained on all metrics. Allof the most sensitive classifiers used CADD and rep j (see Figure: 3.11). The mostspecific classifiers incorporated either rep j or FindZebra (see Figure: 3.12). In-clusion of FindZebra tended to result in worse sensitivity and was estimated as theleast important metric in the random forest cross validation.33Figure 3.10: Estimated sensitvity and specificity of 31 classifiers trained onall combinations of 5 metrics.34Figure 3.11: Top 5 classifiers based on estimated sensitivity. Shapes denotethe metrics used. The symbols used to denote which metrics a classifierused are c for CADD, z for FindZebra, j for rep j and p for repp.35Figure 3.12: Top 5 classifiers based on estimated specificity. Shapes denotethe metrics used. The symbols used to denote which metrics a classifierused are c for CADD, z for FindZebra, j for rep j and p for repp363.2.2 Test SetThe test set was composed of candidate variant lists and clinical reports from 20individuals from 20 TIDE families. Like the individuals in the training set, theindividuals in the test set each had one or more established best candidate variants.12 individuals had monogenic etiology, seven had digenic etiology and one indi-vidual had trigenic etiology. Three individuals were not included in this analysis astheir candidate variant lists did not include any of their established best candidatevariants (again due to customized minor-allele frequency settings). Two of thesehad monogenic etiology and one had digenic etiology. A fourth individual’s can-didate variant list was missing one of its two established best candidate variants.This left a total of 22 established best candidate variants across 17 individuals.The remaining 17 individuals had between 31 and 482 variant genes in theircandidate variant lists with a median of 70 variant genes. The total number of vari-ants ranged between 38 and 697 with a median of 89. Clinician reports containeda median of 13 phenotype terms and ranged between 2 and 26 terms. Clinicianreported phenotype terms mapped to a median of 518 CUIDs per case with a min-imum of 38 and maximum of 2125.Method ComparisonIn order to test the generalizability of Synverita based variant prioritization, rankswere produced using all 7 Synverita aggregation methods with wg∗p weighting.Consistent with the training set, the highest median ranking was achieved by rep jfor the raw data approaches. For the prediction methods avgp produced a medianranking slightly higher than repp with lower minimum and average performance.Of the other methods existing methods, CADD performed the best and FindZebrahad the lowest median performance. Interestingly, CADD ranking has the highestminimum of any metric for the test set. This may be due to the fact that cases inthe training set were analysed before CADD scores became part of the TIDE BCbioinformatics workflow, whereas the test set cases are more recent. These findingsare mostly consistent with the results of the training set (see Figure:3.13).37Figure 3.13: Distributions of ranked scores for test set using 9 scoring meth-ods.38EnsemblingTo test the generalizability of the classifiers on combinations of metrics in thetraining set, all test set cases were classified using the 31 random forest classi-fiers trained in section 3.2.1. The sensitivity of classification did not correlate withnumber of metrics as strongly as estimated for the training set (see Figure: 3.14).Six classifiers achieved equal sensitivity all of which incorporated repp (see Figure:3.15). The classifier with the highest specificity of these six, RFr,c,p, j, incorporatedall metrics but FindZebra. RFr,c,p, j was also fifth best with respect to specificity(see Figure: 3.16. RFr,c,p, j ranked probability had better median and mean perfor-mance than any individual metric (see Figure: 3.18). RFr,c,p, j ranked best candidatevariants in the top 20% of its case 77% of the time (see Figure:3.17. These resultssuggest that prioritizing variants using Synverita’s predictions and raw data is gen-eralizable and its inclusion in variant prioritization is more effective than RVIS andCADD alone.39Figure 3.14: Sensitivity and specificity of all classification of test set variants.40Figure 3.15: Top 6 classifiers based on test set classificiation sensitivity.Shapes denote the metrics used. The symbols used to denote whichmetrics a classifier used are c for CADD, z for FindZebra, j for rep jand p for repp41Figure 3.16: Top 6 classifiers based on test set classificiation specificity.Shapes denote the metrics used. The symbols used to denote whichmetrics a classifier used are c for CADD, z for FindZebra, j for rep jand p for repp42Figure 3.17: Distribution of random forest ranked probability across all 17indidivudals in the test set. Variant higher are variants ranked higherthan their case’ lowest ranked best candidate variant. Variant lower arevariants ranked lower than their case’s lowest ranked best candidatevariant.43Figure 3.18: Distribution of best candidate variant ranking of four best indi-vidual methods and best random forest classifier, RFr,c,p, j.44Chapter 4DiscussionDespite continued advances in next generation sequencing, determining whichvariant(s) are most likely to cause a particular rare disease phenotype remains achallenge. This is due to a variety of reasons including the inexact relationshipbetween genotype and phenotype and the number of rare diseases with unknowngenetic bases. In this study I implemented a text-based approach for genetic variantprioritization, using the Synverita system, and assessed the utility of the approachto identify causal genetic alterations for patients with rare metabolic disorders. Theapproach performed well relative to widely used ranking methods.Tools such as RVIS and CADD provide useful measures of potential gene andvariant deleteriousness but fail to capture the link between variant genes and promi-nent clinical features. The above-discussed tools that focus on the link betweenphenotype and disease each have shortcomings. Beegle and Exomiser have highthresholds for use. Beegle has no backend and performs all text mining on the fly,which imposes a large time penalty. Exomiser imposes a less significant thresholdby requiring a patient phenotype to be described in HPO terms. While the HPOcontinues to gain ground, it is not universally used. This means that in some cases,such as at TIDE BC, Exomiser requires changing existing clinician workflows. Fi-nally, FindZebra provides an exceptionally easy-to-use API and user interface butlacks the sensitivity to prioritize variant genes with respect to disease phenotype inmany TIDE BC cases.Synverita is useful for prioritizing variants in cases of IDD caused by IEM.45Ranking variant genes based on strength of association to disease phenotype is ef-fective using both Synverita’s raw data and predictions. Both of these approachescontribute different strengths: raw data ranking is more likely to place a best can-didate variant in the top 10% of all variants in a case than any other tested method.Prediction methods provide a floor for performance, with repp never ranking anestablished variant lower than 31% in any case tested in both the training and testsets. Synverita’s large vocabulary and corpus and use of predictions provide en-hanced sensitivity when compared with FindZebra, granting it potential to enhancediagnostics for rare genetic diseases of unknown etiology.The strengths of Synverita are complementary to other measures of gene andvariant deleteriousness currently used in TIDE BC’s bioinformatics analysis pipelines.RVIS, CADD, rep j and repp scores do not show strong correlation with one an-other, ostensibly measuring different features that correlate with the probabilitythat a variant gene is implicated in disease. The combination of these methods isgeneralizable. The classifiers trained on the training set and tested on the test setranked established candidate variant genes ahead of 90% of other variant genes inover half of all cases in the test set.4.1 LimitationsSynverita’s contribution to variant prioritization is tempered by three key limita-tions. First, the version of Synverita used in this work does not include the CUIDsof the terms in the HPO. These were excluded as described in Lever et al. (sub-mitted) to reduce run time, as these terms frequently appear in Synverita’s corpus.The lack of HPO terms in Synverita’s corpus prevents us from leveraging HPO’sstructure. Second, variant prioritization using Synverita currently uses a semi-automated approach to map clinician reported phenotype terms to UMLS CUIDs inSynverita’s vocabulary. This semi-automated approach is a barrier to use and mayintroduce noise due to unpredictability of user (mis)behaviour. Finally, Synveritadoes not yet have an update schedule or public access protocol, although work iscurrently underway to establish both.464.2 Future WorkHow terms are mapped to CUIDs is a key aspect of variant prioritization usingSynverita. The comparison of different aggregation methods suggests that termmapping may significantly affect how well these methods work. Further work willinvestigate this. Future versions of Synverita will include all terms in the UMLSmetathesaurus. This will allow for HPO-based term selection. Additionally, itwill allow us to use existing tools to map phenotype terms to CUIDs, such asMetaMap[5]. These expansions will be compared to and combined with Exomiser.Collaboration with examining physicians will continue along multiple avenues.Performance may be improved by allowing clinicians to directly select the CUIDsand to provide weights to each phenotype term that describes a case. Ongoingcollaboration may also see the development of a visual case explorer that buildsupon visual phenotype comparison tools such as PhenoBlocks [19]. Perhaps mostimportantly, future work will use Synverita and ensemble methods on cases withno established best candidate variant.4.3 ConclusionPhenotype based variant prioritization using Synverita’s raw data and predictionsis effective for cases of IDD caused by IEM. It is complementary to measuresof potential gene and variant deleteriousness. Classifiers trained using RVIS andCADD produce generalizable results. With modest modifications to Synverita,these methods can feasibly be incorporated into TIDE BC’s pipeline and augmentvariant prioritization for cases of IDD caused by IEM.47Bibliography[1] S. Aerts, D. Lambrechts, S. Maity, P. Van Loo, B. Coessens, F. De Smet,L.-C. Tranchevent, B. De Moor, P. Marynen, B. Hassan, P. Carmeliet, andY. Moreau. Gene prioritization through genomic data fusion. Naturebiotechnology, 24(5):537–544, 2006. ISSN 1087-0156.doi:10.1038/nbt1203. → pages 6[2] N. Al Hafid and J. Christodoulou. Phenylketonuria: a review of current andfuture treatments. Translational pediatrics, 4(4):304–17, 2015. ISSN2224-4344. doi:10.3978/j.issn.2224-4336.2015.10.07. URLhttp://www.ncbi.nlm.nih.gov/pubmed/26835392$\delimiter”026E30F$nhttp://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC4728993. →pages 2[3] M. Alfadhel, K. Al-Thihli, H. Moubayed, W. Eyaid, and M. Al-Jeraisy.Drug treatment of inborn errors of metabolism: a systematic review.Archives of disease in childhood, 98(6):454–61, 2013. ISSN 1468-2044.doi:10.1136/archdischild-2012-303131. URLhttp://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3693126{&}tool=pmcentrez{&}rendertype=abstract. → pages 2[4] C. A. Argmann, S. M. Houten, J. Zhu, and E. E. Schadt. A Next GenerationMultiscale View of Inborn Errors of Metabolism. Cell Metabolism, 23(1):13–26, 2016. ISSN 19327420. doi:10.1016/j.cmet.2015.11.012. URLhttp://dx.doi.org/10.1016/j.cmet.2015.11.012. → pages 1[5] A. R. Aronson and F.-M. Lang. An overview of MetaMap: historicalperspective and recent advances. Journal of the American MedicalInformatics Association : JAMIA, 17(3):229–36, 2010. ISSN 1527-974X.doi:10.1136/jamia.2009.002733. URLhttp://www.ncbi.nlm.nih.gov/pubmed/20442139$\delimiter”026E30F$nhttp://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC2995713. →pages 4748[6] P. a. Baird, T. W. Anderson, H. B. Newcombe, and R. B. Lowry. Geneticdisorders in children and young adults: a population study. American journalof human genetics, 42(5):677–693, 1988. ISSN 0002-9297. → pages 1[7] C. L. Beaulieu, J. Majewski, J. Schwartzentruber, M. E. Samuels, B. A.Fernandez, F. P. Bernier, M. Brudno, B. Knoppers, J. Marcadier, D. Dyment,S. Adam, D. E. Bulman, S. J. M. Jones, D. Avard, M. T. Nguyen,F. Rousseau, C. Marshall, R. F. Wintle, Y. Shen, S. W. Scherer, J. M.Friedman, J. L. Michaud, and K. M. Boycott. FORGE Canada consortium:Outcomes of a 2-year national rare-disease gene-discovery project.American Journal of Human Genetics, 94(6):809–817, 2014. ISSN15376605. doi:10.1016/j.ajhg.2014.05.003. → pages 2[8] O. Bodenreider. The Unified Medical Language System (UMLS):integrating biomedical terminology. Nucleic Acids Research, 32(90001):267D–270, 2004. ISSN 1362-4962. doi:10.1093/nar/gkh061. URLhttp://nar.oxfordjournals.org/lookup/doi/10.1093/nar/gkh061. → pages 6[9] W. P. Bone, N. L. Washington, O. J. Buske, D. R. Adams, J. Davis,D. Draper, E. D. Flynn, M. Girdea, R. Godfrey, G. Golas, C. Groden,J. Jacobsen, S. Ko¨hler, E. M. J. Lee, A. E. Links, T. C. Markello, C. J.Mungall, M. Nehrebecky, P. N. Robinson, M. Sincan, A. G. Soldatos, C. J.Tifft, C. Toro, H. Trang, E. Valkanas, N. Vasilevsky, C. Wahl, L. A. Wolfe,C. F. Boerkoel, M. Brudno, M. A. Haendel, W. A. Gahl, and D. Smedley.Computational evaluation of exome sequence data using human and modelorganism phenotypes improves diagnostic efficiency. Genetics in Medicine,18(6):608–617, 2016. ISSN 1098-3600. doi:10.1038/gim.2015.137. URLhttp://www.nature.com/doifinder/10.1038/gim.2015.137. → pages 5[10] A. Borzutzky, B. Crompton, A. K. Bergmann, S. Giliani, S. Baxi, M. Martin,E. J. Neufeld, and L. D. Notarangelo. Reversible severe combinedimmunodeficiency phenotype secondary to a mutation of the proton-coupledfolate transporter. Clinical Immunology, 133(3):287–294, 2009.doi:10.1016/j.clim.2009.08.006.Reversible. → pages 1[11] K. M. Boycott, M. R. Vanstone, D. E. Bulman, and A. E. MacKenzie.Rare-disease genetics in the era of next-generation sequencing: discovery totranslation. Nature reviews. Genetics, 14(10):681–91, 2013. ISSN1471-0064. doi:10.1038/nrg3555. URLhttp://www.nature.com/doifinder/10.1038/nrg3555$\delimiter”026E30F$nhttp://www.ncbi.nlm.nih.gov/pubmed/23999272. →pages 1, 249[12] O. J. Buske, M. Girdea, S. Dumitriu, B. Gallinger, T. Hartley, H. Trang,A. Misyura, T. Friedman, C. Beaulieu, W. P. Bone, A. E. Links, N. L.Washington, M. A. Haendel, P. N. Robinson, C. F. Boerkoel, D. Adams,W. A. Gahl, K. M. Boycott, and M. Brudno. PhenomeCentral: A Portal forPhenotypic and Genotypic Matchmaking of Patients with Rare GeneticDiseases. Human Mutation, 36(10):931–940, 2015. ISSN 10981004.doi:10.1002/humu.22851. → pages 5[13] L. S. CARULLA, G. M. REED, L. M. VAEZ-AZIZI, S.-A. COOPER, R. M.LEAL, M. BERTELLI, C. ADNAMS, S. COORAY, S. DEB, L. A. DIRANI,S. C. GIRIMAJI, G. KATZ, H. KWOK, R. LUCKASSON,R. SIMEONSSON, C. WALSH, K. MUNIR, and S. SAXENA. Intellectualdevelopmental disorders: towards a new name, definition and framework for“mental retardation/intellectual disability” in ICD-11. World Psychiatry, 10(3):175–180, 2011. ISSN 17238617.doi:10.1002/j.2051-5545.2011.tb00045.x. URLhttp://doi.wiley.com/10.1002/j.2051-5545.2011.tb00045.x. → pages 1[14] R. Dragusin, P. Petcu, C. Lioma, B. Larsen, H. L. Jørgensen, I. J. Cox, L. K.Hansen, P. Ingwersen, and O. Winther. FindZebra: A search engine for rarediseases. International Journal of Medical Informatics, 82(6):528–538,2013. ISSN 13865056. doi:10.1016/j.ijmedinf.2013.01.005. URLhttp://dx.doi.org/10.1016/j.ijmedinf.2013.01.005. → pages 6[15] S. ElShal, L.-C. Tranchevent, A. Sifrim, A. Ardeshirdavani, J. Davis, andY. Moreau. Beegle: from literature mining to disease-gene discovery.Nucleic acids research, 44(2):e18, 2016. ISSN 1362-4962.doi:10.1093/nar/gkv905. URLhttp://nar.oxfordjournals.org/content/44/2/e18.short?rss=1. → pages 6, 11[16] J. Ford, F. Makedon, and J. Pearlman. Using Singular Value DecompositionApproximation for Collaborative Filtering. Seventh IEEE InternationalConference on E-Commerce Technology (CEC’05), pages 257–264, 2005.doi:10.1109/ICECT.2005.102. URLhttp://ieeexplore.ieee.org/xpl/freeabs{ }all.jsp?arnumber=1524053. →pages 8[17] M. K. C. from Jed Wing, S. Weston, A. Williams, C. Keefer, A. Engelhardt,T. Cooper, Z. Mayer, B. Kenkel, the R Core Team, M. Benesty,R. Lescarbeau, A. Ziem, L. Scrucca, Y. Tang, C. Candan, and T. Hunt. caret:Classification and Regression Training, 2016. URL50https://CRAN.R-project.org/package=caret. R package version 6.0-72. →pages 15[18] M. Girdea, S. Dumitriu, M. Fiume, S. Bowdin, K. M. Boycott, S. Che´nier,D. Chitayat, H. Faghfoury, M. S. Meyn, P. N. Ray, J. So, D. J. Stavropoulos,and M. Brudno. PhenoTips: Patient phenotyping software for clinical andresearch use. Human Mutation, 34(8):1057–1065, 2013. ISSN 10597794.doi:10.1002/humu.22347. → pages 5[19] M. Glueck, P. Hamilton, F. Chevalier, S. Breslav, A. Khan, D. Wigdor, andM. Brudno. PhenoBlocks: Phenotype Comparison Visualizations. IEEETransactions on Visualization and Computer Graphics, 22(1):101–110,2016. ISSN 10772626. doi:10.1109/TVCG.2015.2467733. → pages 5, 47[20] M. M. Gottlieb, D. J. Arenillas, S. Maithripala, Z. D. Maurer,M. Tarailo-Graovac, L. Armstrong, M. Patel, C. van Karnebeek, and W. W.Wasserman. GeneYenta: A phenotype-based rare disease case matching toolbased on online dating algorithms for the acceleration of exomeinterpretation. Human Mutation, 36(4):432–438, 2015. ISSN 10981004.doi:10.1002/humu.22772. → pages 5[21] M. Kircher, D. M. Witten, P. Jain, B. J. O’Roak, G. M. Cooper, andJ. Shendure. A general framework for estimating the relative pathogenicityof human genetic variants. Nature genetics, 46(3):310–315, 2014. ISSN1546-1718. doi:10.1038/ng.2892. URLhttp://www.ncbi.nlm.nih.gov/pubmed/24487276. → pages 4[22] A. Knight and T. Senior. The common problem of rare disease in generalpractice. Medical Journal of Australia, 185(2):2–3, 2006. URL https://www.mja.com.au/system/files/issues/185{ }02{ }170706/kni10328{ }fm.pdf. →pages 1[23] S. Ko¨hler, M. H. Schulz, P. Krawitz, S. Bauer, S. Do¨lken, C. E. Ott,C. Mundlos, D. Horn, S. Mundlos, and P. N. Robinson. Clinical Diagnosticsin Human Genetics with Semantic Similarity Searches in Ontologies.American Journal of Human Genetics, 85(4):457–464, 2009. ISSN00029297. doi:10.1016/j.ajhg.2009.09.003. → pages 5[24] S. Ko¨hler, S. C. Doelken, C. J. Mungall, S. Bauer, H. V. Firth,I. Bailleul-Forestier, G. C. M. Black, D. L. Brown, M. Brudno, J. Campbell,D. R. Fitzpatrick, J. T. Eppig, A. P. Jackson, K. Freson, M. Girdea, I. Helbig,J. A. Hurst, J. Ja¨hn, L. G. Jackson, A. M. Kelly, D. H. Ledbetter, S. Mansour,51C. L. Martin, C. Moss, A. Mumford, W. H. Ouwehand, S. M. Park, E. R.Riggs, R. H. Scott, S. Sisodiya, S. V. Vooren, R. J. Wapner, A. O. M. Wilkie,C. F. Wright, A. T. Vulto-Van Silfhout, N. D. Leeuw, B. B. A. De Vries,N. L. Washingthon, C. L. Smith, M. Westerfield, P. Schofield, B. J. Ruef,G. V. Gkoutos, M. Haendel, D. Smedley, S. E. Lewis, and P. N. Robinson.The Human Phenotype Ontology project: Linking molecular biology anddisease through phenotype data. Nucleic Acids Research, 42(D1):1–9, 2014.ISSN 03051048. doi:10.1093/nar/gkt1026. → pages 5[25] M. Lek, K. J. Karczewski, E. V. Minikel, K. E. Samocha, E. Banks,T. Fennell, A. H. O’Donnell-Luria, J. S. Ware, A. J. Hill, B. B. Cummings,T. Tukiainen, D. P. Birnbaum, J. A. Kosmicki, L. E. Duncan, K. Estrada,F. Zhao, J. Zou, E. Pierce-Hoffman, J. Berghout, D. N. Cooper, N. Deflaux,M. DePristo, R. Do, J. Flannick, M. Fromer, L. Gauthier, J. Goldstein,N. Gupta, D. Howrigan, A. Kiezun, M. I. Kurki, A. L. Moonshine,P. Natarajan, L. Orozco, G. M. Peloso, R. Poplin, M. A. Rivas,V. Ruano-Rubio, S. A. Rose, D. M. Ruderfer, K. Shakir, P. D. Stenson,C. Stevens, B. P. Thomas, G. Tiao, M. T. Tusie-Luna, B. Weisburd, H.-H.Won, D. Yu, D. M. Altshuler, D. Ardissino, M. Boehnke, J. Danesh,S. Donnelly, R. Elosua, J. C. Florez, S. B. Gabriel, G. Getz, S. J. Glatt, C. M.Hultman, S. Kathiresan, M. Laakso, S. McCarroll, M. I. McCarthy,D. McGovern, R. McPherson, B. M. Neale, A. Palotie, S. M. Purcell,D. Saleheen, J. M. Scharf, P. Sklar, P. F. Sullivan, J. Tuomilehto, M. T.Tsuang, H. C. Watkins, J. G. Wilson, M. J. Daly, and D. G. MacArthur.Analysis of protein-coding genetic variation in 60,706 humans. Nature, 536(7616):285–291, 2016. ISSN 0028-0836. doi:10.1038/nature19057. URLhttp://www.nature.com/doifinder/10.1038/nature19057. → pages 4[26] W. McLaren, B. Pritchard, D. Rios, Y. Chen, P. Flicek, and F. Cunningham.Deriving the consequences of genomic variants with the Ensembl API andSNP Effect Predictor. Bioinformatics, 26(16):2069–2070, 2010. ISSN13674803. doi:10.1093/bioinformatics/btq330. → pages 4[27] W. J. Meerding, L. Bonneux, J. J. Polder, M. A. Koopmanschap, and P. J.van der Maas. Demographic and epidemiological determinants of healthcarecosts in Netherlands: cost of illness study. BMJ (Clinical research ed.), 317(7151):111–115, 1998. ISSN 0959-8138 (Print).doi:10.1136/bmj.317.7151.111. → pages 1[28] J. Menche, A. Sharma, M. Kitsak, S. D. Ghiassian, M. Vidal, J. Loscalzo,and A.-L. Baraba´si. Disease networks. Uncovering disease-disease52relationships through the incomplete interactome. Science, 347(6224):1257601, 2015. ISSN 1095-9203. doi:10.1126/science.1116608. URLhttp://www.sciencemag.org/cgi/doi/10.1126/science.1257601$\delimiter”026E30F$npapers3://publication/doi/10.1126/science.1257601.→ pages 5[29] N. A. Miller, E. G. Farrow, M. Gibson, L. K. Willig, G. Twist, B. Yoo,T. Marrs, S. Corder, L. Krivohlavek, A. Walter, J. E. Petrikin, C. J. Saunders,I. Thiffault, S. E. Soden, L. D. Smith, D. L. Dinwiddie, S. Herd, J. A. Cakici,S. Catreux, M. Ruehle, and S. F. Kingsmore. A 26-hour system of highlysensitive whole genome sequencing for emergency management of geneticdiseases. Genome Medicine, 2015. ISSN 1756-994X.doi:10.1186/s13073-015-0221-8. URLhttp://dx.doi.org/10.1186/s13073-015-0221-8. → pages 2[30] S. Petrovski, Q. Wang, E. L. Heinzen, A. S. Allen, and D. B. Goldstein.Genic Intolerance to Functional Variation and the Interpretation of PersonalGenomes. PLoS Genetics, 9(8), 2013. ISSN 15537390.doi:10.1371/journal.pgen.1003709. → pages 4[31] J. Pin˜ero, A. Berenstein, A. Gonzalez-Perez, A. Chernomoretz, and L. I.Furlong. Uncovering disease mechanisms through network biology in theera of Next Generation Sequencing. Scientific reports, 6(October 2015):24570, 2016. ISSN 2045-2322. doi:10.1038/srep24570. URLhttp://www.ncbi.nlm.nih.gov/pubmed/27080396$\delimiter”026E30F$nhttp://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC4832203. →pages 5[32] R Core Team. R: A Language and Environment for Statistical Computing. RFoundation for Statistical Computing, Vienna, Austria, 2016. URLhttps://www.R-project.org/. → pages 22[33] C. R. Scriver and P. J. Waters. Monogenic traits are not simple: Lessonsfrom phenylketonuria. Trends in Genetics, 15(7):267–272, 1999. ISSN01689525. doi:10.1016/S0168-9525(99)01761-8. → pages 1[34] M. Shevell. Global Developmental Delay and Mental Retardation orIntellectual Disability: Conceptualization, Evaluation, and Etiology.Pediatric Clinics of North America, 55(5):1071–1084, 2008. ISSN00313955. doi:10.1016/j.pcl.2008.07.010. → pages 1[35] C. Shyr, M. Tarailo-Graovac, M. Gottlieb, J. J. Y. Lee, C. van Karnebeek,and W. W. Wasserman. FLAGS, frequently mutated genes in public exomes.53BMC medical genomics, 7:64, 2014. ISSN 1755-8794.doi:10.1186/s12920-014-0064-y. URLhttp://www.biomedcentral.com/1755-8794/7/64. → pages 5[36] D. Smedley and P. N. Robinson. Phenotype-driven strategies for exomeprioritization of human Mendelian disease genes. Genome Medicine, 7(1):81, 2015. ISSN 1756-994X. doi:10.1186/s13073-015-0199-2. URLhttp://genomemedicine.com/content/7/1/81. → pages 5[37] D. Smedley, J. O. B. Jacobsen, M. Ja¨ger, S. Ko¨hler, M. Holtgrewe,M. Schubach, E. Siragusa, T. Zemojtel, O. J. Buske, N. L. Washington, W. P.Bone, M. a. Haendel, and P. N. Robinson. Next-generation diagnostics anddisease-gene discovery with the Exomiser. Nature protocols, 10(12):2004–2015, 2015. ISSN 1750-2799. doi:10.1038/nprot.2015.124. URLhttp://www.ncbi.nlm.nih.gov/pubmed/26562621. → pages 5[38] S. E. Soden, C. J. Saunders, L. K. Willig, E. G. Farrow, L. D. Smith, J. E.Petrikin, J.-b. Lepichon, N. A. Miller, I. Thiffault, D. L. Dinwiddie,G. Twist, A. Noll, A. Bryce, L. Zellmer, A. M. Atherton, A. T. Abdelmoity,N. Safina, S. S. Nyp, B. Zuccarelli, I. A. Larson, A. Modrcin, S. Herd,M. Creed, Z. Ye, X. Yuan, R. A. Brodsky, and F. Stephen. Effectiveness ofexome and genome sequencing guided by acuity of illness for diagnosis ofneurodevelopmental disorders. 6(265), 2015.doi:10.1126/scitranslmed.3010076.Effectiveness. → pages 1, 2[39] H. Stranneheim, M. Engvall, K. Naess, N. Lesko, P. Larsson, M. Dahlberg,R. Andeer, A. Wredenberg, C. Freyer, M. Barbaro, H. Bruhn, T. Emahazion,M. Magnusson, R. Wibom, R. H. Zetterstro¨m, V. Wirta, U. von Do¨beln, andA. Wedell. Rapid pulsed whole genome sequencing for comprehensive acutediagnostics of inborn errors of metabolism. BMC Genomics, 15(1):1090,2014. ISSN 1471-2164. doi:10.1186/1471-2164-15-1090. URLhttp://www.biomedcentral.com/1471-2164/15/1090. → pages 2[40] D. C. Swinney. Challenges and Hurdles to Business as Usual in DrugDevelopment for Treatment of Rare Diseases. Clinical Pharmacology &Therapeutics, 100(4):339–341, 2016. ISSN 00099236. doi:10.1002/cpt.422.URL http://doi.wiley.com/10.1002/cpt.422. → pages 1[41] M. Tarailo-Graovac, C. Shyr, C. J. Ross, G. A. Horvath, R. Salvarinova,X. C. Ye, L.-H. Zhang, A. P. Bhavsar, J. J. Lee, B. I. Dro¨gemo¨ller,M. Abdelsayed, M. Alfadhel, L. Armstrong, M. R. Baumgartner, P. Burda,M. B. Connolly, J. Cameron, M. Demos, T. Dewan, J. Dionne, A. M. Evans,54J. M. Friedman, I. Garber, S. Lewis, J. Ling, R. Mandal, A. Mattman,M. McKinnon, A. Michoulas, D. Metzger, O. A. Ogunbayo, B. Rakic,J. Rozmus, P. Ruben, B. Sayson, S. Santra, K. R. Schultz, K. Selby,P. Shekel, S. Sirrs, C. Skrypnyk, A. Superti-Furga, S. E. Turvey, M. I. VanAllen, D. Wishart, J. Wu, J. Wu, D. Zafeiriou, L. Kluijtmans, R. A. Wevers,P. Eydoux, A. M. Lehman, H. Vallance, S. Stockler-Ipsiroglu, G. Sinclair,W. W. Wasserman, and C. D. van Karnebeek. Exome Sequencing and theManagement of Neurometabolic Disorders. New England Journal ofMedicine, page NEJMoa1515792, 2016. ISSN 0028-4793.doi:10.1056/NEJMoa1515792. URLhttp://www.nejm.org/doi/10.1056/NEJMoa1515792. → pages 2, 3, 14, 21[42] L. C. Tranchevent, R. Barriot, S. Yu, S. Van Vooren, P. Van Loo,B. Coessens, B. De Moor, S. Aerts, and Y. Moreau. ENDEAVOUR update:a web resource for gene prioritization in multiple species. Nucleic acidsresearch, 36(Web Server issue):377–384, 2008. ISSN 13624962.doi:10.1093/nar/gkn325. → pages 6[43] C. D. M. Van Karnebeek and S. Stockler. Treatable inborn errors ofmetabolism causing intellectual disability: A systematic literature review.Molecular Genetics and Metabolism, 105(3):368–381, 2012. ISSN10967192. doi:10.1016/j.ymgme.2011.11.191. URLhttp://dx.doi.org/10.1016/j.ymgme.2011.11.191. → pages 2[44] C. D. M. Van Karnebeek, M. Shevell, J. Zschocke, J. B. Moeschler, andS. Stockler. The metabolic evaluation of the child with an intellectualdevelopmental disorder: Diagnostic algorithm for identification of treatablecauses and new digital resource. Molecular Genetics and Metabolism, 111(4):428–438, 2014. ISSN 10967206. doi:10.1016/j.ymgme.2014.01.011.URL http://dx.doi.org/10.1016/j.ymgme.2014.01.011. → pages 2[45] Z. Yang and J. R. Bielawski. Statistical methods for detecting molecularadaptation. Trends in Ecology and Evolution, 15(12):496–503, 2000. ISSN01695347. doi:10.1016/S0169-5347(00)01994-7. → pages 555Appendix ATables56Weighting Method Min 1st Quartile Median Mean 3rd Quartile CorrelationNoneavg j 0.00 0.73 0.89 0.79 0.97 0.738best j 0.00 0.63 0.92 0.76 0.97 0.649f or j 0.00 0.65 0.88 0.77 0.97 0.652rep j 0.00 0.69 0.87 0.72 0.98 0.585avgp 0.03 0.62 0.80 0.74 0.92 0.810bestp 0.27 0.59 0.78 0.74 0.90 0.939repp 0.37 0.65 0.83 0.77 0.92 0.785wgavg j 0.00 0.72 0.90 0.80 0.98 0.731best j 0.00 0.62 0.90 0.76 0.97 0.633f or j 0.00 0.64 0.86 0.77 0.97 0.636rep j 0.00 0.71 0.95 0.72 0.98 0.575avgp 0.05 0.63 0.80 0.74 0.93 0.793bestp 0.27 0.59 0.76 0.74 0.92 0.931repp 0.29 0.66 0.81 0.76 0.92 0.774wpavg j 0.00 0.73 0.90 0.80 0.98 0.736best j 0.00 0.63 0.92 0.76 0.97 0.643f or j 0.00 0.62 0.86 0.76 0.97 0.644rep j 0.00 0.71 0.87 0.72 0.99 0.584avgp 0.03 0.63 0.80 0.74 0.92 0.791bestp 0.27 0.59 0.78 0.74 0.90 0.948repp 0.29 0.65 0.80 0.76 0.92 0.792wg+pavg j 0.00 0.73 0.90 0.80 0.98 0.736best j 0.00 0.63 0.92 0.76 0.97 0.640f or j 0.00 0.62 0.86 0.76 0.97 0.643rep j 0.00 0.71 0.87 0.72 0.98 0.580avgp 0.03 0.63 0.80 0.74 0.92 0.795bestp 0.27 0.59 0.78 0.74 0.90 0.946repp 0.38 0.65 0.81 0.77 0.92 0.784wg∗pavg j 0.00 0.72 0.92 0.80 0.98 0.728best j 0.00 0.63 0.89 0.76 0.97 0.628f or j 0.00 0.63 0.87 0.76 0.97 0.629rep j 0.00 0.71 0.95 0.73 0.99 0.566avgp 0.05 0.64 0.80 0.74 0.92 0.772bestp 0.27 0.59 0.76 0.74 0.92 0.931repp 0.37 0.64 0.84 0.78 0.92 0.758Table A.1: Ranked scoring results for best candidate variants for all weight-ing methods. Correlation is Spearman’s ρ of all ranked scores of allvariants in a case and their ranked number of unique terms that theyassociate with in Syverita’s raw data. Maximum is 1 all cases.57


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items