UBC Faculty Research and Publications

FLAGS, frequently mutated genes in public exomes Shyr, Casper; Tarailo-Graovac, Maja; Gottlieb, Michael; Lee, Jessica J; van Karnebeek, Clara; Wasserman, Wyeth W Dec 3, 2014

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-12920_2014_Article_64.pdf [ 1.48MB ]
JSON: 52383-1.0228386.json
JSON-LD: 52383-1.0228386-ld.json
RDF/XML (Pretty): 52383-1.0228386-rdf.xml
RDF/JSON: 52383-1.0228386-rdf.json
Turtle: 52383-1.0228386-turtle.txt
N-Triples: 52383-1.0228386-rdf-ntriples.txt
Original Record: 52383-1.0228386-source.json
Full Text

Full Text

RESEARCH ARTICLE Open AccessFLAGS, frequently mutateiebga diagnosis, refining genetic counseling, informing clin- in 2,000 people in Europe) [4] but collectively these affectShyr et al. BMC Medical Genomics 2014, 7:64http://www.biomedcentral.com/1755-8794/7/64BC, CanadaFull list of author information is available at the end of the articleical management (incl. decision making on appropriatepreventive measures and available treatments), and ul-timately facilitation of unrelated affected families as wellidentification of novel targets for treatment [1-3]. Raremillions of individuals worldwide [5-7]. The current bestestimate on the number of rare genetic disorders is be-tween 6,000 to 7,000 [7] based on the catalogue OnlineMendelian Inheritance in Man (OMIM) [8], and a com-prehensive reference portal for rare diseases (Orphanet)[9]; however, taking into consideration that the humanphenome is far from fully characterized [10] together withhigher estimates on rare-disease-causing genes based on* Correspondence: wyeth@cmmt.ubc.ca†Equal contributors1Centre for Molecular Medicine and Therapeutics, Child and Family ResearchInstitute, Vancouver, BC, Canada2Department of Medical Genetics, University of British Columbia, Vancouver,care for affected patients and their families by providingwide use of whole exome sequencing (WES) to identify the genetic basis of Mendelian disorders. More than 180novel rare-disease-causing genes with Mendelian inheritance patterns have been discovered through sequencingthe exomes of just a few unrelated individuals or family members. As rare/novel genetic variants continue to beuncovered, there is a major challenge in distinguishing true pathogenic variants from rare benign mutations.Methods: We used publicly available exome cohorts, together with the dbSNP database, to derive a list of genes(n = 100) that most frequently exhibit rare (<1%) non-synonymous/splice-site variants in general populations. Wetermed these genes FLAGS for FrequentLy mutAted GeneS and analyzed their properties.Results: Analysis of FLAGS revealed that these genes have significantly longer protein coding sequences, a greaternumber of paralogs and display less evolutionarily selective pressure than expected. FLAGS are more frequentlyreported in PubMed clinical literature and more frequently associated with diseased phenotypes compared to theset of human protein-coding genes. We demonstrated an overlap between FLAGS and the rare-disease causinggenes recently discovered through WES studies (n = 10) and the need for replication studies and rigorous statisticaland biological analyses when associating FLAGS to rare disease. Finally, we showed how FLAGS are applied indisease-causing variant prioritization approach on exome data from a family affected by an unknown rare geneticdisorder.Conclusions: We showed that some genes are frequently affected by rare, likely functional variants in generalpopulation, and are frequently observed in WES studies analyzing diverse rare phenotypes. We found that the rateat which genes accumulate rare mutations is beneficial information for prioritizing candidates. We provided aranking system based on the mutation accumulation rates for prioritizing exome-captured human genes, andpropose that clinical reports associating any disease/phenotype to FLAGS be evaluated with extra caution.BackgroundUncovering the genetic basis of human disease improvesMendelian diseases are caused by altered function of sin-gle genes and individually have a low prevalence (fewerthan 200,000 people in the United States, or fewer than 1exomesCasper Shyr1,3,4†, Maja Tarailo-Graovac1,2,3†, Michael Gottland Wyeth W Wasserman1,2,3*AbstractBackground: Dramatic improvements in DNA-sequencin© 2014 Shyr et al.; licensee BioMed Central LtdCommons Attribution License (http://creativecreproduction in any medium, provided the orDedication waiver (http://creativecommons.orunless otherwise stated.d genes in public1, Jessica JY Lee1,5, Clara van Karnebeek3,6,7technologies and computational analyses have led to. This is an Open Access article distributed under the terms of the Creativeommons.org/licenses/by/4.0), which permits unrestricted use, distribution, andiginal work is properly credited. The Creative Commons Public Domaing/publicdomain/zero/1.0/) applies to the data made available in this article,tion impacts. Many of these tools focus on the individualShyr et al. BMC Medical Genomics 2014, 7:64 Page 2 of 14http://www.biomedcentral.com/1755-8794/7/64human mutation rate and the number of essential genes[11], the number of rare genetic disorders is likely higher.Next-generation sequencing (NGS) high-throughputtechnologies have revolutionized the discovery of genedefects causing rare human diseases by detecting geneticvariations at base-pair resolution within an individual[12-14]. NGS is widely used to sequence either a portionof the human genome (~1%) by capturing the protein-coding sequences (known as whole exome sequencing,WES), or to sequence the entire human genome (knownas whole genome sequencing, WGS). In particular, WEStechnology had been widely used to identify genetic basisof Mendelian disorders by sequencing the exomes of justa few unrelated individuals or family members, and hasled to discovery of more than 180 novel rare-disease-causing genes with Mendelian inheritance patterns, ac-cording to the review published in November 2013 [7,15](the number continues to increase with some rapidity).Considering the estimates that genetic basis has been de-termined for about ~3,500 of the rare diseases [7], thereremain thousands of rare-disease-causing genes to beuncovered.With the increasing rate of the discovery of rare gen-etic variants, WES has the potential to identify the ma-jority of the remaining rare-disease-causing genes in thenear future. A major challenge in identification of thetrue pathogenic variants lies in the differentiation be-tween a large number of non-pathogenic functional vari-ants and disease-causing sequence variants in a studiedfamily (in this study, the term “functional variant” is re-stricted to missense/nonsense and splice site variants).Current WES analyses of rare genetic disorders use simi-lar approaches [16] to filter the observed variants to en-rich for potential causal genes. Specifically, after thereads are mapped, and variants are called and annotated,the variants are compared against internal exome data-bases as well as public databases, such as dbSNP [17],Exome Variant Server (EVS), 1000 Genomes Project [18],and HapMap project [19,20] to exclude variants that arelikely to arise from technological causes and variants thatare common (e.g. variants observed in more than 1%) in apopulation. The variants are further prioritized based ontheir predicted effect on protein function [21,22], wheresilent and non-coding variants (except for splice-site af-fecting variants) are typically excluded or ranked lower.The still extensive lists of candidate disease-causing vari-ants can be further refined based on the family historyand a hypothesized model of inheritance [7,15]. However,it is well-established that a significant proportion of cod-ing variants in each individual represent rare variants (ab-sent from dbSNP or observed with frequency of ≤1%)[17,20], and that genomes of healthy individuals containan average of ~100 loss-of-function variants [23]. The ana-lyst must further consider the possibility that non-codingvariants. In the variant-focused studies, it has been notedthat variants tend to arise more frequently in long genes(e.g. TTN and MUC16). In considering that researchersoften focus their interpretation of exome data on thegenic level initially, it might be advantageous to havemethods and ranking systems that integrate the individualvariants at the genic level more systematically to informvariant prioritization. While there are long-standingmethods for ranking a set of genes based on their anno-tations [25], there has been limited work on rankingsbased on sequencing properties. One ranking systembased on the genic level is RVIS [26]. RVIS generates ascore based on the frequencies of observed commoncoding variants compared to the total number of observedvariants in the same gene.To further help in identification of disease-causingvariants from families affected by rare Mendelian disor-ders, we expanded the current, common prioritizationparameters that focus mainly on frequency at which vari-ants themselves are seen in normal population, to includethe frequency at which genes are found to be affected byrare, likely functional variants. Using rare variations fromdbSNP and EVS, we introduced the concept of FLAGS(FLAGS for FrequentLy mutAted GeneS). We showedthat these genes possess characteristics that make themless likely to be critical for disease development, but aremore likely to be assigned causality for diseases than ex-pected for protein-coding genes in general. We furtherdemonstrated FLAGS’ utility via a case study as well as lit-erature review, and application in our in-house database.Finally, we provided a ranking system from FLAGS to as-sist in the prioritization of genes from exome/whole-gen-ome clinical studies.MethodsTerminologies used in this studyIn this study, the term “functional variants” refers to vari-ants that are missense, nonsense or fall within a splice sitewindow (see below for specifics). The length of a gene isdefined to be the longest open reading frame (ORF) of thegene, thus excluding promoters, untranslated regionsand introns. All genes are referred to by their HGNCvariations (e.g. regulatory alterations) could be involved,thus the filtered results may not contain the causal gene.Thus, for many rare disorders, it is still challenging to sep-arate the real disease-causing variant from the prioritizedset of rare, likely functional variants that are not account-able for the investigated phenotype.There are broadly used tools such as SIFT [21] andPolyPhen-2 [24] that provide an interpretation of muta-(HUGO Gene Nomenclature Committee) [27] officialgene symbol.DatasetsIn the following sections, we provide detailed descrip-tions of how the datasets were obtained or generated.Table 1 lists the size and descriptive nature of the data-sets used in this study. Each gene list referred to in thisreport can be found in Additional file 1: Table S1.a. FrequentLy mutAted GeneS (FLAGS)Variations from EVS hosted on the NHLBI ExomeSequencing Project (ESP6500) were downloadedon February 2014. The criteria used to generatethe variations are available online (http://evs.gs.washington.edu/EVS/). Variations from dbSNPv138[17] were downloaded from the NCBI website(version date 20130806). Genomic annotations wereassigned to each variation using SNPeff v3.5g [29]b. Dc. BTableNamedataseFLAGSOMIMShyr et al. BMC Medical Genomics 2014, 7:64 Page 3 of 14http://www.biomedcentral.com/1755-8794/7/64Inheritance in Man [8]HGMD 2691 The list of protein-coding genes with damagingmutations (<1% allelic frequency) from HumanGene Mutation Database [28].WES 300 Downloaded from Boycott et al. (2013) [7] -a list of novel genes implicated in humandisorders based on whole exome sequencingstudies, or novel/known pathogenic mutationsdiscovered by whole-exome sequencing.Background 18580 The entire set of human protein-coding geneswith the parameter –SpliceSiteSize 7 and humangenome version GRCh37.75. Variants were filteredfor allelic frequency <1% according to dbSNP’soverall frequency and EVS’s combined populationfrequency. Where a discrepancy in the reportedfrequency arose between the two resources, wetook the higher frequency. Variants were furtherfiltered for “functional” coding mutations thatresult in a change in the amino acid sequence(i.e. missense/nonsense), or mutations that residewithin a putative splice site junction (with a windowsize of 7, as supplied in the parameter for SNPeff).The remaining mutations were excluded if they wereobserved more than 10 times within our in-housedatabase consisting of 150 exomes and 13 wholegenomes (a list of filtered out variants are provided inAdditional file 2: Table S6 as VCF). This last step wasincluded because we noticed it is common to seepolymorphic mutations from dbSNPv138 without1 Description of the datasets used in this studyoftsSize Description100 The top 100 of FrequentLy mutAted GeneS withrare (<1% allelic frequency) functional variantsfrom dbSNPv138 and ESP65003099 The list of protein-coding genes associated withhuman diseases from Online Mendelianthat have complete start and end translationannotations with a specified dN/dS ratiowas downloaded from the Supplemental filepublished by Boycott et al. (2013) [7], whichprovided a compiled list of novel genes and/ornovel phenotypes associated with known disease-genes discovered through exome sequencing. Forall three disease-associated-gene lists, we mappedthe gene symbols to their official HGNC genesymbol (and discarded the ones that could not bemapped), retained only protein-coding genes witha fully annotated translation start and end, anda valid dN/dS ratio. OMIM and HGMD (HumanGene Mutation database) overlap with the top100 FLAGS by 42 and 37 genes respectively(Additional file 4: Table S2A, S2B).ackground datasetThe complete list of human-coding genes wasdownloaded from Ensembl [30] Biomart on Marchmanuscript as “HGMD genes”. A third disease setvariants that do not have an annotated frequency.Among these remaining mutations, for each gene,we counted the number of mutations observedper gene. Only protein-coding genes with a fullyannotated translation start and end, and a validdN/dS ratio are included for consideration(see Methodology section “Gene length anddN/dS ratio”). From this ranked list, we selectedthe top 100 genes (0.5% of the 19818 genesoverlapping between dbSNP and EVS) with themost observed mutations as a focus for thisstudy. This set will be referred throughout themanuscript as “FLAGS”. The entire ranked list isavailable in Additional file 3: Table S4.isease genes datasetsTo obtain a list of reliable disease-associated genes,we drew from multiple resources. The first list ofdisease-associated genes was downloaded fromOMIM website on March 2014 using the providedfile “morbidmap”. This list will be referredthroughout the manuscript as “OMIM genes”.A second list contains pathogenic variationsdownloaded from the HGMG professional version(file date 20130927) [28]. To focus on likelyhigh-penetrance pathogenic alleles, we filteredthe variations in this file by the same frequencycriteria as we performed for obtaining FLAGS(see Methodology section “FrequentLy mutatedGeneS”), and limited to only the mutationsannotated as “DM” (damaging mutations). Theaffected genes from those remaining variations arecompiled, and will be referred throughout thisin-house pipeline allowed us to remove polymorphican allelic frequency attached; filtering against an2014 using version Ensembl Genes 75 with genomeversion GRCh37.p13. Protein-coding genes withoutnumber of observed synonymous and non-synonymousmutaand Eour mto thTablesameratio“geneexcluParaloThe privedversioscriptGeneWerepreShyr et al. BMC Medical Genomics 2014, 7:64 Page 4 of 14http://www.biomedcentral.com/1755-8794/7/64tions was calculated from the same dbSNPv138VS datasets as described above. We verified thatethodology provides a comparable dN/dS ratiose ratios reported previously [31] (Additional file 5:S5). Gene length was derived by converting thetranscript that was used to calculate the dN/dSinto amino acid sequences. In this study, the termlength” is defined to be the ORF of the gene, thusding promoters, untranslated regions and introns.gsaralogous relationships for human genes were de-from the Ensembl Comparative Genomics API usingn Ensembl Genes 75, GRCh37.p13. A custom Perlwas written to extract the paralogs for every gene.-to-disease phenotypic termsHGNC gene symbol, a proper translation start andtranslation end annotation according to thisgenome version were discarded. Genes without avalid dN/dS ratio were removed (i.e. without anyobserved synonymous polymorphisms according todbSNPv138 and EVS). This last step was done fortwo reasons: 1) to ensure there is no bias whenevaluating dN/dS ratio in our results, 2) to ensurethe genes selected in this study have been coveredin NGS studies, since any gene without at least oneobserved synonymous mutation is presumably notsufficiently captured in either exome or whole-genome studies. The Background set overlapsFLAGS completely.The comparison analyses in the Results section aredone without removing the overlap between thegene datasets.Gene length and dN/dS ratioWe calculated the selection pressures acting on genesby comparing non-synonymous substitution per non-synonymous site (dN) to the synonymous substitutionsper synonymous site (dS). This ratio of the number ofnon-synonymous substitutions per non-synonymous siteto the number of synonymous substitutions per syn-onymous site (dN/dS) was calculated using the formulaof observed non−synonymous substitutionsof possible non−synonymous siteof observed synonymous substitutionsof possible synonymous substitutions[31]. The number of possiblesynonymous and non-synonymous mutations was de-rived by examining the longest annotated coding tran-script per gene (transcript length based upon EnsemblBiomart described above). Only transcripts with anno-tated start and end positions were considered. Theused MeSHOP software [32] to identify over-sented disease terms associated with each gene.MeSHOP returns a list of MeSH (Medical Subject Head-ing) terms for each gene with a p-value for each term.Each p-value was calculated by an over-representation(compared to control) of the MeSH terms assigned to theset of articles within PubMed that are associated with thegene (based on relationships defined in gene2pubmed; ar-ticles considered include up to March 2013). From thisoutput, for each gene, the non-disease related MeSHterms were filtered out, and the remaining MeSH termswere selected for significance (using the Bonferroni cor-rection and a significance threshold of 0.05). To derivegene-to-disease relationships with an independent source,we extracted phenotypic diseased terms per gene fromHuman Phenotype Ontology website [33] by downloadingthe file “genes_to_diseases.txt” (version April 2014).Publication record analysisFor our publication analysis on the relationship between agene and its frequency of citation(s) within biomedical lit-erature, we used Gene Reference into Function (GeneRIF),a manually curated list of experimentally validated genefunctions available as part of NCBI’s EntrezGene database.Each entry in GeneRIF contains a short description of agene function and a PubMed identifier for the publicationdocumenting the evidence of the described function.Therefore, we were able to count the number of paperspublished on a gene’s functionality by counting the num-ber of PubMed records associated to the gene. The follow-ing are the detailed steps of our publication calculation.First, two flat files necessary for our analysis were down-loaded via FTP from NCBI Gene on April 2014: GeneRIF(available at ftp://ftp.ncbi.nih.gov/gene/GeneRIF/generifs_basic.gz) and EntrezGene entries for human (ftp://ftp.ncbi.nih.gov/gene/DATA/Homo_sapiens.gene_info.gz). Second,because GeneRIF refers to each gene by its EntrezGeneID, we mapped the gene symbol of all genes on our lists(FLAGS, OMIM, HGMD, Background) to EntrezGene IDusing EntrezGene entries downloaded in the previousstep. Third, for each gene of interest, we counted thenumber of PubMed IDs (PMIDs) associated with itsEntrezGene ID in GeneRIF. Because GeneRIF does notguarantee one-to-one relationship between a GeneRIFentry and a PMID (http://www.ncbi.nlm.nih.gov/books/NBK3840/#genefaq.Why_does_the_number_of_GeneRIFs),we filtered out duplicates in the list of PMIDs linked to agene. Last, to filter the PMIDs by their publication date,we collected the publication date of each PMID via quer-ies into PubMed using the ESummary query providedwithin the Entrez Programming Utilities (E-utilities).Statistical analysesUnless stated otherwise, all statistical analyses and plotswere carried out in R [34] version 2.15.3. Non-parametricMann–Whitney U one-tailed test was executed by wilcox.functional variants affecting TTN and MUC16 repeat-edly passed all the prioritization steps of our pipelineand appeared in ~5% of our candidate disease-genelists. However, other genes were repeatedly observed inmultiple families affected with different phenotypes (e.g.DST). This motivated us to compile a set of FLAGS (Fre-quentLy mutAted GeneS) to understand their propertiesand facilitate better interpretation of phenotypes associ-ated with these variants. The FLAGS list was generated byranking genes based on number of rare (<1%) functionalvariants affecting these genes in general populations(NHLBI Exome Sequencing Project (ESP6500) anddbSNPv138). As expected, TTN and MUC16 are thetop two genes based on the number of rare functionalvariants; however, other genes that were frequently af-fected by rare, likely functional variants in multiple TIDEfamilies with unrelated phenotypes were also observed tobe frequently mutated in general population (Additionalfile 10: Table S8). To explore the properties of these fre-quently mutated genes, we focused our analysis on the top100 from this ranked list, which we hereafter refer to asFLAGS (Figure 1).Shyr et al. BMC Medical Genomics 2014, 7:64 Page 5 of 14http://www.biomedcentral.com/1755-8794/7/64test function with parameter exact = TRUE. Violin plotswere generated with Vioplot package. The input files tothe analyses are available in Additional file 6: Table S9Aand 9B.Mutation Detection using WES – a case studyA 3-year old female patient, born as an only child tonon-consanguineous parents of Turkish descent after anuncomplicated pregnancy and delivery, presented withprofound early-onset developmental delay, microcephaly,seizures, dysmorphic features, myopia, bone marrow dys-plasia with lymphopenia, neutropenia, aplastic anemia andcombined immunodeficiency (B and T cell) was enrolledinto the TIDEX gene discovery project, approved by theEthics Board of the Faculty of Medicine of the Universityof British Columbia (H12-00067).Extensive clinical investigations were performed ac-cording to the TIDE diagnostic protocol [35] to deter-mine the etiology of patient’s condition. These included:chromosome micro array analysis for copy number vari-ants (CNVs) (Affymetrix Genome-Wide Human SNPArray 6.0); telomere length analysis; CT and MRI scansand comprehensive metabolic testing.Genomic DNA was isolated from the peripheral bloodof the patient as well as parents using standard tech-niques. Whole exome sequencing was performed for theindex patient and her unaffected parents using the IonAmpliSeq™ Exome Kit and Ion Proton™ System from LifeTechnologies (Next Generation Sequencing Services,UBC, Vancouver, Canada) at 120X coverage. An in-housedesigned bioinformatics pipeline (Additional file 7: Text S3)was used to align the reads to the human referencegenome version hg19 and to identify and assess rarevariants for their potential to disrupt protein function.The candidate variants were further confirmed usingSanger re-sequencing in all the family members. Primersequences and PCR conditions are available on request.Deleteriousness of the candidate variants was assessedusing Combined Annotation–Dependent Depletion (CADD)scores [36].ResultsFLAGS: genes frequently affected by rare, likely-functionalvariants in public exomesIt has been previously reported that TTN and MUC16appear in multiple exome analyses due to their length[37-41]; researchers are aware of these genes and arecautious when encountering rare likely functional (mis-sense, nonsense, splice site) variants in WES analyses[37-41]. In a study of 53 independent families sufferingfrom distinct rare inborn errors of metabolism (compris-ing of 150 whole exomes and 13 whole genomes; http://www.tidebc.org; Additional file 8: Text S4 and Additionalfile 9: Table S7), we confirmed that rare/novel, likelyFLAGS tend to have longer ORFsIn this study, the assignment of gene length refers to thelongest open reading frame. Genes with longer ORFs areexpected to have more mutations than shorter genes. Toconfirm this, we determined the distribution of geneFigure 1 The word cloud of FLAGS. A text file was created using acustom Perl script to reflect the frequency of mutation per gene inFLAGS. The Tagxedo (http://www.tagxedo.com/) was then used togenerate the word cloud. The size of the words reflects how frequentlythey are found to bear rare, likely functional variants in the generalpopulation. As expected TTN and MUC16 are the top two genes.lengths based on the longest annotated open readingframe for each gene. FLAGS have an average length of4653 ± 3605 aa (amino acids). The high variance is dueto two genes (TTN andMUC16) having extremely longlengths (35992 and 14508 aa respectively) compared tothe rest of the protein coding genes. Excluding the 2outlying genes, the remaining FLAGS genes (n = 98)have an average ORF length of 4233 ± 1399 aa. Figure 2ashows the distribution of ORF lengths across differentevaluated datasets (with outliers removed to show thedistribution clearer). The entire FLAGS have overallmuch higher ORF length than HGMD, OMIM andBackground (HGMD, OMIM comparisons each yield ap-value <2.2e−16, Background comparison yields a p-valueof 0.00027). This is aligned with our expectation thatFLAGS are frequently mutated from exome analysis be-cause they correspond to genes with long coding regions.FLAGS tend to have paralogsThe presence of paralogs may increase tolerance forotherwise phenotype-inducing functional variations dueto functional compensation [42,43]. We calculated thenumber of paralogs per gene reported by the EnsemblCompara database [30], and compared this property be-tween different gene sets. FLAGS overall have an averageof 4 paralogs per gene. Figure 2b shows the distribution ofthe number of paralogs across the different gene sets.Aligned with our expectation, FLAGS have more para-logs than genes from OMIM, HGMD and Background(OMIM p-value = 7.2e−05, HGMD p-value = 7.4e−05,Background p-value = 8.1e−09). While the existence ofparalogs may cause read mapping challenges that leads toan increased frequency of false variant predictions, mostof these technical errors will be eliminated by a filter forvariant frequency, as they will arise recurrently.a bmlierf paShyr et al. BMC Medical Genomics 2014, 7:64 Page 6 of 14http://www.biomedcentral.com/1755-8794/7/64cFigure 2 Properties of FLAGS. (a) Violin distribution of open reading frain terms of amino acids for the longest annotated transcript per gene. Outgene across the evaluated gene sets. Y-axis shows the violin distribution othe plot. (c) Cumulative distribution of dN/dS ratio across the evaluated geneprobability according to the cumulative distribution function.e lengths across the evaluated gene sets. Y-axis shows the length defineds are excluded from the plot. (b) Distribution of number of paralogs perralogs based on Ensembl Compara database. Outliers are excluded fromsets. X-axis is limited from 0 to 2, and Y-axis plots the correspondingShyr et al. BMC Medical Genomics 2014, 7:64 Page 7 of 14http://www.biomedcentral.com/1755-8794/7/64FLAGS tend to have higher dN/dS ratiosGenes which exhibit many functional genetic variations(missense/nonsense/splice site) may have a higher toler-ance for variations and thus a reduced likelihood of phe-notypes subject to negative selection. For each gene, wecalculated the dN/dS ratio as a proxy indicator of theamount of selective pressure acting on protein-codinggenes. FLAGS have an average dN/dS ratio of 0.65 ± 0.18.Overall these genes have significantly higher ratio com-pared to genes from HGMD, OMIM, and Background(each individual comparison yields a p-value <0.005).Figure 2c shows the relative densities from cumulativedistribution functions for each gene set. The trend indi-cates that frequently mutated genes have higher dN/dSratio on average than expected.Variants detected in FLAGS tend to be predicted as lessdeleteriousWe explored the possibility that the FLAGS genes are af-fected by less deleterious rare variants compared to othergenes. If the variants in FLAGS are less likely to be in-volved in diseases, then we would expect the variants tohave lower predicted damage scores. To calculate this, weused the Phred-scaled Combined Annotation DependentDepletion (CADD) score developed by Kircher et al.(2014) to rank the deleteriousness of each single nu-cleotide variant [36]. The method objectively integratesdiverse annotations into a single measurement for eachvariant by training upon ~15 million genetic variantsseparating humans from chimpanzees against a simulatedset of variants not exposed to selection. This method waschosen over other variant prediction tools because of itssuperior performance [36] and its ability to quantify theseverity of a variant by a ranking system. This ranking sys-tem compares the candidate variant against other possiblevariants in the genome and assigns it a score based on thiscomparison; other variant prediction tools do not takeinto account other possible mutations in the genome [44].Also, the CADD method includes ranking of nonsenseand splice site variants, while other tools only handle mis-sense [36]. For each gene, we calculated the proportion ofvariants with CADD Phred-scaled score <10, between 10and 20, and above 20. We found that FLAGS are moreenriched for variants with low scores, compared to OMIMand HGMD (Figure 3a; p-values = 2.6e−11, 2.9e−12 respect-ively). Likewise, OMIM and HGMD are more enrichedfor variants with high impact score (>20) than FLAGS(Figure 3b; p-values = 2.4e−09, and 1.2e−10 respectively).These results are aligned with our expectation. We add-itionally analyzed the genic tolerance of FLAGS to func-tional genetic variants, using residual variation intolerancescore (RVIS) published by Petrovski et al. (2013) [26] andobserved trends in the same direction (Additional file 11:Text S2).FLAGS tend to be reported in PubMed and associatedwith disease phenotypesWe sought to determine if there is a publication bias forpathogenic mutations in the frequently mutated genes.For each gene, we calculated the number of publicationsrelated to human diseases and biological functions usingGeneRIF annotations (Figure 4). FLAGS have an averageof 51 articles per gene, which is lower than for genesfrom HGMD and OMIM (OMIM p-value = 0.00087,HGMD p-value = 0.0035). However, FLAGS have morepublications than the Background set (p-value = 6.3e−12).We next considered if the frequently mutated genes areassociated with greater diversity of disease phenotypescompared to disease-associated genes. Our expectation isthat if the frequently seen genes are arising as candidatesin more studies, and are less likely to be truly pathogenic,then they could be associated to a wider range of pheno-types in the literature (we recognize the association couldalso be due to pleiotropy [45], see Limitations). To analyzeif FLAGS have been frequently correlated to human dis-eases, we used two different computational resources(MeSHOP [32], HPO [33]) to extract known significantrelationship(s) between genes and human disease pheno-types based on published scientific articles. Figures 5a andb show the distribution of the number of disease termsfrom HPO and MeSHOP per gene within gene sets. FromMeSHOP results, we see that FLAGS have slightly fewerMeSH diseased terms per gene than genes from OMIM(mean 8.1 vs. 10.2; p-value = 0.013), and significantlyfewer terms per gene than HGMD genes (mean 8.1 vs.9.5; p-value = 2.3e−12). FLAGS have more MeSH termsthan Background genes (mean 8.1 vs. 3.1; p-value =1.3e−15). These observations are consistent with the resultsbased on HPO annotations, where we again see that whileFLAGS have fewer disease phenotypic terms than genesfrom OMIM and HGMD (mean 2.1 vs. 3.7 and 3.8 re-spectively; p-values <0.0001), FLAGS exhibit more termsthan the Background (mean 2.1 vs. 0.6; p-value = 3.7e−14).To adjust for the potential bias that genes with more arti-cles are likely to have more MeSH and HPO terms at-tached, we repeated the analysis by normalizing the MeSHand HPO terms to the number of publications in Gen-eRIF. The normalized observations are consistent with theresults if no normalization was applied (Additional file 12:Text S5).FLAGS recently implicated in rare-Mendelian disordersWe sought to determine which FLAGS have been re-ported with pathogenic mutations in NGS clinical studies.Boycott et al. (2013) provided a compilation of 178 novelgenes discovered to be disease-associated through exomesequencing [7], of which three overlapped with FLAGS(KMT2D/MLL2, HERC2, and DST). To explore the prop-erties of those 3 genes, we analyzed the ratio betweena bFigure 3 FLAGS genes are affected by rare variants predicted to be less deleterious than the variants affecting known disease-genes.(a) A boxplot distribution of proportion of variants with CADD score <10. The Y-axis plots the proportion of variants within each gene set havingdigeShyr et al. BMC Medical Genomics 2014, 7:64 Page 8 of 14http://www.biomedcentral.com/1755-8794/7/64number of rare variants and gene length, as well as pres-ence of putative essential protein domains by assessing thedistribution of rare variants across the gene. We foundthat among the FLAGS, KMT2D and HERC2 have thelowest ratios of number of rare variants compared to genelength, while DST is one of the three genes among thea Phred-scaled CADD score of <10. The proportion was calculated per inCADD score >20. The Y-axis plots the proportion of variants within eachcalculated per individual gene.FLAGS set with significant non-uniform distribution ofrare variants across the gene (p-value = 1.2e-04; the othertwo are EPPK1 and HRNR; see Additional file 13: Text S1for more details on methodology and rationale). If we0 10 20 300. distribution# of publicatioProbability from CDFFigure 4 Cumulative distribution of the number of publications per gpublications from GeneRIF per gene, and Y-axis plots the corresponding prwere to expand this 178 novel-rare-disease gene list fromBoycott et al. (2013) to include the exome studiesreporting on already-known disease-associated geneswith known/novel pathogenic mutations, then this ex-panded set (n = 300) overlapped FLAGS by an additional7 genes (TTN, RYR1, PKHD1, RP1L1, ASPM, SACS,vidual gene. (b) A boxplot distribution of proportion of variants withne set having a Phred-scaled CADD score of >20. The proportion wasABCA4). In the discussion we provide our thoughtsand literature analysis on why these genes have been re-ported as disease-associated despite being among thefrequent genes to harbor rare functional variants.40 50 60 70FLAGSHGMDOMIMBackground of # of publicationsns from GeneRIFene across the evaluated gene sets. X-axis plots the number ofobability according to the cumulative distribution function.Shyr et al. BMC Medical Genomics 2014, 7:64 Page 9 of 14http://www.biomedcentral.com/1755-8794/7/64aApplying FLAGS to prioritize candidate variantsCase studyTo demonstrate a disease-causing variant prioritizationapproach using FLAGS and whole exome sequencingdata, we selected one family from our TIDE cohort af-fected by an unknown rare genetic disorder. ThroughWES performed for the index and her unaffected parents(Methodology - Mutation Detection using WES – a casestudy), rare variants were identified and assessed for theirpotential to disrupt protein function. Only those variantspredicted to be functional (missense, nonsense and frame-shift changes, as well as in-frame deletions and splice-site effects) were subsequently screened under a seriesof inheritance models. In total, we identified six rare“functional” homozygous, and eight rare “functional”compound heterozygous candidates. Of those, only twogenes affected by missense variants were consideredfunctional candidates:bFigure 5 FLAGS tend to be associated with disease phenotypes. (a) Violigene sets. Y-axis is the violin distribution showing the number of HPO termsof number of MeSH disease terms from program MeSHOP across the evaluateMeSH terms per gene. Outliers are excluded from the plot.(1)VPS13B gene (MIM 607817) had been found to bearhomozygous or compound heterozygous mutationsin patients with Cohen syndrome (MIM 216550).Cohen syndrome is characterized by developmentaldelay/intellectual disability, facial dysmorphism,microcephaly, neutropenia, and weak muscle tone(hypotonia). The features of Cohen syndrome varywidely in presence and severity among affectedindividuals. Additional features, perhaps patient-specific, appear in the reports; myopia and smallhands and feet are observed in our patient. Inour WES analysis, we identified two rare variantsaffecting this gene in the index, suggesting compoundheterozygous inheritance. Neither of the variants wasfound in more than 160 in-house exomes; one of thevariants was predicted to be deleterious using theCADD scores [36] with a score higher than 20, whilethe second variant was given the score of less than 5.n distribution of number of HPO disease terms across the evaluatedper gene. Outliers are excluded from the plot. (b) Violin distributiond gene sets. Y-axis is the violin distribution showing the number ofShyr et al. BMC Medical Genomics 2014, 7:64 Page 10 of 14http://www.biomedcentral.com/1755-8794/7/64Sanger re-sequencing confirmed that mother is acarrier of one variant, while the father is the carrierof the second variant and the index is compoundheterozygous making the VPS13B gene a candidatedisease-gene in this family.(2)SENP1 gene (MIM 612157) product is one of thedesumoylating enzymes [46] which is important forproper development and survival in mice. SENP1was found to regulate expression of GATA1 in miceand subsequent erythropoiesis [47]. Furthermore,SENP1 was found to be essential for the developmentof early T and B cells through regulation of STAT5activation [48]. To date, germline mutations in SENP1had not been described in any human diseases. OurWES analysis identified a rare missense homozygousvariant in the index. The variant was not found inmore than 160 in-house exomes and was predictedto be the most deleterious of all homozygousvariants using the CADD scores [36]. The Sangerre-sequencing of the genomic DNA confirmed thatindex is homozygous for the variant, while bothparents are carriers.To further prioritize between these two genes, we con-sider a FLAGS-based approach. The VPS13B gene is oneof the FLAGS (top 100, rank 67) and is frequently seento be affected by rare, likely functional variants in gen-eral population. On the other hand, SENP1 is rarely af-fected by functional variants in the general population(rank 11,947). In addition, VPS13B is a frequently seenin the TIDE cohort of patients, 22 of 160 individualshave rare, likely functional alleles in the VPS13B genethat pass our prioritization filters. In contrast, the familyreported here is the only family from the TIDEX cohortof patients with a rare, likely functional variant affectingthe SENP1. In none of the other 160 exomes did the var-iants in SENP1 pass our prioritization filters for rare,likely functional variants. Together with the fact thatVPS13B does not fit well to her severe hematologic find-ings and bone marrow dysplasia, FLAGS helped us selectSENP1 as candidate gene for our experimental validationstudies. The case report will be published separately. Wefurther applied prioritization of FLAGS on an in-houseWES/WGS database and illustrated how trio-based exomefamilies have Mendelian recessive and dominant candi-dates overlapping with the FLAGS. The FLAGS rankingcan be fed into the candidate identification process andhighlight genes that should be considered as high-riskcandidates for false positives [Additional file 14].DiscussionWES/WGS studies can identify hundreds to thousandsof rare protein-coding mutations per individual. Genesvary in their frequency of appearance; genes that aremore likely to harbor rare-coding variants by chance areless likely to be involved in human diseases, especially inthe context of rare Mendelian disorders. Previous studieshave reported that TTN and MUC16, the two longestgenes in the human genome, should be interpreted withcare due to their long lengths [37-41]. In this study, wecompiled a list of frequently mutated genes (FLAGS)based upon analysis of rare coding mutations from dbSNPand Exome Variant Server ESP6500. We compared thebiological properties of FLAGS against genes from diseasedatabases (HGMD, OMIM) that represent the currentlybest reliable curated resources for disease-associatedgenes. We further demonstrated the clinical utilities ofFLAGS as a gene prioritization tool. The discussion willillustrate additional clinical benefits of FLAGS, andconclude with ideas for future directions and projectlimitations.FLAGS are less likely to be disease-associatedConsistent with our expectations, FLAGS have signifi-cantly longer coding lengths, higher average dN/dS ra-tios, and more paralogs than genes from OMIM andHGMD. Paralogs have been cited as capable to partiallycompensate for the loss of gene function [42,43], so thegreater frequency of paralogs could mean that mutationsare less likely to have a critical impact on phenotype. Inthe examination of the research literature for FLAGS,we observed fewer disease annotations compared to dis-ease genes, but elevated rates compared to backgroundgenes, suggesting that FLAGS have been associated tohuman disease more frequently than the rest of theprotein-coding genes.Clinical utilization of FLAGS for prioritizationPrioritizing candidates in rare disease studies is import-ant; as it takes substantial time of experts to review eachgene [49], getting better specificity without loss of sensi-tivity has real value. We demonstrated the utility ofFLAGS as a prioritization tool by overlapping FLAGSagainst candidates from clinical exomes in TIDE, with-out loss of ultimately identified causal genes. We furtherillustrated with a single clinical case how when multipleequally attractive candidates are under consideration,FLAGS provide a way for clinicians and researchers todecide which gene to focus on first.Cautionary indicatorWhile we are not claiming every gene in FLAGS isnon-pathogenic, we do wish to make it clear that greaterbiological evidence is required when interpreting the func-tional impacts of rare variants in frequently mutatedgenes. Among the 300 genes with putative pathogenicmutations identified via exome sequencing compiledby Boycott et al. (2013) [7], ten genes intersected withthe rate of mutation by employing statistical machineShyr et al. BMC Medical Genomics 2014, 7:64 Page 11 of 14http://www.biomedcentral.com/1755-8794/7/64FLAGS. We evaluated the gene-level and variant-levelevidence for causality based upon the guideline for in-vestigation of causality published by MacArthur et al.(2014) [23]. We found that many results are derived basedupon single-gene sequencing, rather than taking the lessbiased exome or whole-genome approach [50-52]. Inaddition, many studies reported the mutations as patho-genic simply due to segregation pattern within the family,rare allelic frequency and bioinformatics impact predic-tions [41,53-55], thus lacking experimental validation atboth the variant and gene levels. The screen for rare allelesis further complicated when some of the studies look atminor ethnic populations that are not well represented inthe population databases [52,54,55]. The evidence behindmissense variants is especially doubtful when many mis-sense variants are predicted by CADD [36] to be benignwith a lower impact rank than the rare mutations ob-served from dbSNP and ESP6500. Altogether, these obser-vations could explain why these genes harbor frequentrare functional variations despite being reported in dis-eases. To avoid false-positive reports of causality, espe-cially for FLAGS, it will be very important for reports tofollow the recently published guidelines [56] when assign-ing pathogenicity to new variants identified as well as add-itional variants identified in genes previously linked to aparticular disease. An example of a good paper would bethe one where the variant is identified in a genome-widescreening approach with statistical methods applied tocompare the distribution of variants in patients against alarge matched control cohorts, where the evidence isassessed at both the candidate gene and candidate variantlevels, and where the authors recognize the importance ofcombining both computational comparative approachesand experimental assays for validating the impact of thevariant.Going beyond the top 100 and what the future entailsGenes with frequent rare variants need to be appropri-ately ranked in order to reduce false associations andstreamline clinical analysis. Our current results are limitedto the top 100 frequently mutated genes. While it may beinsightful to study the characteristics of the genes at theother end of the spectrum (the bottom 100 or alternativelysets of genes with low mutation rates and gene-focusedpublications to exclude genes with poor coverage in ex-ome capture kits), we perceive the greatest long-term util-ity to be in the incorporation of the complete set ofrankings into the exome interpretation process. To makeour prioritization ranking accessible to the broad researchcommunity, we provide the FLAGS ranking for the genesrepresented in both dbSNP and EVS.The novelty that we bring forth is a ranking that uti-lizes public control exomes/genomes, which clinicianslearning techniques to incorporate the genic and allelicfeatures as highlighted in this study and previous worksto summarize them into a single computational score.Such a new quantitative measurement should improvethe ranking of pathogenicity for each gene, and highlightskeptical candidates to accelerate the clinical translationof genomic research findings. The mechanism itself (e.g.the weights of features) would also shed light on theexact nature of the causes of excess mutation rates andfacilitate better biological understanding.In the long-term, the accumulation of more exomesand whole genomes will provide an increasingly richbody of data for the generation of FLAGS rankings.LimitationsIn the study we relied upon manually-curated GeneRIFsto extract the publications for each gene. One couldargue for more sophisticated PubMed queries in com-bination with semantic rules to increase the sensitivityfor assigning human-disease related publications [59,60].We also recognize that neither MeSHOP nor HPO cap-ture gene-to-disease terms perfectly. A possible directionis to explore other gene-disease databases such as HuGENavigator [61]. We further acknowledge that the inter-pretation of MesHOP and HPO could be influenced bypleiotropic genes. Similarly, we used Ensembl for extract-ing the paralogous relationships for each gene, but thereare other available extraction algorithms and databases forinferring paralogy [62-64]. Additionally, our present studyis restricted to genes with both an HGNC symbol and afully annotated translation start and end. We recognizethat not all protein-coding genes fit these criteria, and wecan readily apply to their clinical cases. As discussedabove, the ranking is correlated with gene length, evolu-tionary constraint, and paralogous gene counts.The high accumulation rate of mutations can beinterpreted partially as genes being under less selectiveconstraint. A utility of the FLAGS ranking is that itprovides, albeit indirectly, a gene-level indication ofthe selective constraint upon a gene, while most existingmetrics such as phastCons [57] or PhyloP [58] provide aposition-specific value. While the FLAGS ranking is not asubstitute for the more direct measures, the genic level in-formation complements them.Current prioritization tools lack the ability to evaluateat both genic and variant level simultaneously. Ultim-ately, a scoring mechanism integrating biological andtechnological features at both the genic and variant levelshould be developed. A future direction is to improveupon methodologies like RVIS [26] and expand beyondare excluding non-coding genes (as well as 5′ and 3′UTRs of coding genes) from this analysis.ReferencesShyr et al. BMC Medical Genomics 2014, 7:64 Page 12 of 14http://www.biomedcentral.com/1755-8794/7/64ConclusionWhile most complex disorders generally can confirm thestrength of their findings by comparing against a matchedbackground cohort, the nature of studying rare mono-genic disorders mean that there is often insufficient sam-ple size to conduct a rigorous statistical analysis on thestrength of the finding. In this study, we extracted a list offrequently mutated genes based on rare variants fromdbSNP and Exome Variant Server. Our results revealedthe biological properties of these genes that could explainwhy they are frequently mutated, and why extra discretionin statistical and biological interpretation needs to betaken when trying to relate these genes to clinical pheno-types. We propose that the ranking of how frequent agene is mutated in next-generation sequencing studies isuseful for the prioritization of candidate genes.ConsentWritten informed consent was obtained from the pa-tient’s guardian/parent/next of kin for the publication ofthis report.Additional filesAdditional file 1: Table S1. This table lists the five datasets used in thisstudy, and the genes that made up each dataset. The first row in the tableshows the names of the datasets referred throughout the manuscript, andeach column contains the list of genes, referred to by their official genesymbol.Additional file 2: Table S6. A list of variants, in variant call format(VCF), showing the mutations that were observed more than 10 times inour in-house database consisting of 150 exomes and 13 whole genomes,after they were filtered by allelic frequencies according to the annotationsfrom dbSNP and Exome variant server (refer to methodology section formore details).Additional file 3: Table S4. The entire ranked list of FLAGS, with themost frequently mutated genes at the top.Additional file 4: Table S2. There are two lists in this table. The first isa list of genes that overlapped between FLAGS vs. OMIM (Table S2A), andthe second is a list of genes that overlapped between FLAGS vs. HGMD(Table S2B).Additional file 5: Table S5. A table showing a comparison of dN/dSratio between the values we reported with our calculation (see manuscriptfor methodology), versus a previously published result of a gene set (referto reference [31] in the manuscript). The results between the twomethodologies were highly consistent.Additional file 6: Table S9. This table shows the two input files thatwere fed into R for statistical analyses. In the table, the attributes for eachgene used in the analysis were described (dN/dS ratio, gene length, # ofMeSH terms, # of HPO terms, # of paralogs for table S9A, and # of pathogenicmutations from HGMD for table S9B).Additional file 7: Text S3. This section contains a concise descriptionof our in-house bioinformatics pipeline for processing exome and whole-genome datasets.Additional file 8: Text S4. This section provides a description of theTIDE-BC project.Additional file 9: Table S7. A summary of the families studied in TIDEXproject, and the number of candidate variants remaining after filteringagainst genetic and allelic frequency thresholds. The results are brokendown by family structure and the types of genetic model applied. Referto www.tidebc.org and additional text S4 for more information on theTIDEX project, and additional text S3 for how the variants were calledand filtered.Additional file 10: Table S8. A table of number of exomes from TIDEXproject that lists out the number of rare functional variants for eachprotein-coding gene captured in the exome capture kits. Please refer tomethodology section for how ‘rare’ and ‘functional’ descriptors were defined.Additional file 11: Text S2. A comparison of our FLAGS gene rankingsystem against another method, residual variation intolerance score (RVIS)that also built upon public genomic datasets and ranked importance ofeach gene’s association to human diseases. Agreements and disagreementsbetween the two methods are discussed.Additional file 12: Text S5. This section describes an analysis lookingat the distribution of number of MeSH and HPO terms per gene, afternormalizing by the number of biological functionally-related literaturepublished for that gene, as reported in GeneRIF.Additional file 13: Text S1. This section describes an analysis lookingat the uniformity of distribution for rare functional variants across genes.The hypothesis was that genes of less significance to monogenic humandiseases would display more uniformity in the occurrences of benigncoding mutations across the protein sequence, whereas genes that aremore linked to causing penetrating diseases would harbor regions thatare more devoid of mutations due to conservation of important proteindomains.Additional file 14: The PDF outlining the supplementaryinformation for this manuscript.Competing interestsThe authors declare that they have no competing interests.Authors’ contributionsCS, MTG, MG and JJYL generated the data. CS, MTG, MG and JJYL analyzedthe data. MTG, CvK and WWW designed the study. MTG and CS wrote themanuscript. All authors read, edited and approved the final manuscript.AcknowledgementsWe are indebted to the patient and her family for participation in this study;Drs. J. Wu, J. Rozmus, S. Vercauteren, K. Hildebrand, T. Dewan and A. Garcerafor clinical evaluation and management of the patient; Mrs. X. Han forSanger sequencing; Mr. B. Sayson for data management; Mrs. M. Higginsonfor DNA extraction, sample handling and technical data; Dr. C. Vilarino-Guellfor timely whole exome sequencing; Dr. W. Cheung for MeSHOP support; Mr.D. Arenillas and Mr. M. Hatas for systems support, and Dora Pak for researchmanagement support. This work was supported by funding from the B.C.Children’s Hospital Foundation as “1st Collaborative Area of Innovation”(www.tidebc.org); Genome BC (SOF-195 grant); Genome BC and GenomeCanada grants 174CDE (ABC4DE Project); and the Canadian Institutes ofHealth Research #301221 grant. CS is funded by CIHR-CGSD, JJYL is fundedby NSERC-CREATE, and MG is funded by CIHR-Computational BiologyUndergraduate Summer Student Health Research.Author details1Centre for Molecular Medicine and Therapeutics, Child and Family ResearchInstitute, Vancouver, BC, Canada. 2Department of Medical Genetics, Universityof British Columbia, Vancouver, BC, Canada. 3Treatable Intellectual DisabilityEndeavour in British Columbia, Vancouver, Canada. 4Bioinformatics GraduateProgram, University of British Columbia, Vancouver BC, Canada. 5GenomeScience and Technology Graduate Program, University of British Columbia,Vancouver, BC, Canada. 6Division of Biochemical Diseases, BC Children’sHospital, Vancouver, BC, Canada. 7Department of Pediatrics, University ofBritish Columbia, Vancouver, BC, Canada.Received: 16 June 2014 Accepted: 24 October 20141. Green ED, Guyer MS, National Human Genome Research Institute: Chartinga course for genomic medicine from base pairs to bedside. Nature 2011,470:204–213.Shyr et al. BMC Medical Genomics 2014, 7:64 Page 13 of 14http://www.biomedcentral.com/1755-8794/7/642. McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JPA,Hirschhorn JN: Genome-wide association studies for complex traits:consensus, uncertainty and challenges. Nat Rev Genet 2008, 9:356–369.3. Van Karnebeek CD, Sly WS, Ross CJ, Salvarinova R, Yaplito-Lee J, Santra S,Shyr C, Horvath GA, Eydoux P, Lehman AM, Bernard V, Newlove T, Ukpeh H,Chakrapani A, Preece MA, Ball S, Pitt J, Vallance HD, Coulter-Mackie M,Nguyen H, Zhang L-H, Bhavsar AP, Sinclair G, Waheed A, Wasserman WW,Stockler-Ipsiroglu S: Mitochondrial carbonic anhydrase VA deficiencyresulting from CA5A alterations presents with hyperammonemia in earlychildhood. Am J Hum Genet 2014, 94:453–461.4. Montserrat Moliner A, Waligóra J: The European union policy in the fieldof rare diseases. Public Health Genomics 2013, 16:268–277.5. Carter CO: Monogenic disorders. J Med Genet 1977, 14:316–320.6. Baird PA, Anderson TW, Newcombe HB, Lowry RB: Genetic disorders inchildren and young adults: a population study. Am J Hum Genet 1988,42:677–693.7. Boycott KM, Vanstone MR, Bulman DE, MacKenzie AE: Rare-disease genetics inthe era of next-generation sequencing: discovery to translation. Nat RevGenet 2013, 14:681–691.8. McKusick VA: Mendelian Inheritance in Man and its online version, OMIM.Am J Hum Genet 2007, 80:588–604.9. Aymé S, Urbero B, Oziel D, Lecouturier E, Biscarat AC: Information on rarediseases: the Orphanet project. Rev Médecine Interne Fondée Par SociétéNatl Francaise Médecine Interne 1998, 19(Suppl 3):376S–377S.10. Samuels ME: Saturation of the human phenome. Curr Genomics 2010,11:482–499.11. Cooper DN, Chen J-M, Ball EV, Howells K, Mort M, Phillips AD, Chuzhanova N,Krawczak M, Kehrer-Sawatzki H, Stenson PD: Genes, mutations, and humaninherited disease at the dawn of the age of personalized genomics.Hum Mutat 2010, 31:631–655.12. Bamshad MJ, Ng SB, Bigham AW, Tabor HK, Emond MJ, Nickerson DA,Shendure J: Exome sequencing as a tool for Mendelian disease genediscovery. Nat Rev Genet 2011, 12:745–755.13. Gilissen C, Hoischen A, Brunner HG, Veltman JA: Unlocking Mendeliandisease using exome sequencing. Genome Biol 2011, 12:228.14. Ku C-S, Naidoo N, Pawitan Y: Revisiting Mendelian disorders throughexome sequencing. Hum Genet 2011, 129:351–370.15. Rabbani B, Mahdieh N, Hosomichi K, Nakaoka H, Inoue I: Next-generationsequencing: impact of exome sequencing in characterizing Mendeliandisorders. J Hum Genet 2012, 57:621–632.16. Pabinger S, Dander A, Fischer M, Snajder R, Sperk M, Efremova M, KrabichlerB, Speicher MR, Zschocke J, Trajanoski Z: A survey of tools for variantanalysis of next-generation genome sequencing data. Brief Bioinform2014, 15:256–278.17. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K:dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 2001,29:308–311.18. 1000 Genomes Project Consortium, Abecasis GR, Altshuler D, Auton A,Brooks LD, Durbin RM, Gibbs RA, Hurles ME, McVean GA: A map of humangenome variation from population-scale sequencing. Nature 2010,467:1061–1073.19. International HapMap Consortium: A haplotype map of the humangenome. Nature 2005, 437:1299–1320.20. International HapMap Consortium, Frazer KA, Ballinger DG, Cox DR, Hinds DA,Stuve LL, Gibbs RA, Belmont JW, Boudreau A, Hardenbol P, Leal SM, PasternakS, Wheeler DA, Willis TD, Yu F, Yang H, Zeng C, Gao Y, Hu H, Hu W, Li C, Lin W,Liu S, Pan H, Tang X, Wang J, Wang W, Yu J, Zhang B, Zhang Q, et al: A secondgeneration human haplotype map of over 3.1 million SNPs. Nature 2007,449:851–861.21. Kumar P, Henikoff S, Ng PC: Predicting the effects of coding non-synonymousvariants on protein function using the SIFT algorithm. Nat Protoc 2009,4:1073–1081.22. Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P,Kondrashov AS, Sunyaev SR: A method and server for predictingdamaging missense mutations. Nat Methods 2010, 7:248–249.23. MacArthur DG, Balasubramanian S, Frankish A, Huang N, Morris J, Walter K,Jostins L, Habegger L, Pickrell JK, Montgomery SB, Albers CA, Zhang ZD,Conrad DF, Lunter G, Zheng H, Ayub Q, DePristo MA, Banks E, Hu M,Handsaker RE, Rosenfeld JA, Fromer M, Jin M, Mu XJ, Khurana E, Ye K, Kay M,Saunders GI, Suner M-M, Hunt T, et al: A systematic survey of loss-of-functionvariants in human protein-coding genes. Science 2012, 335:823–828.24. Adzhubei I, Jordan DM, Sunyaev SR: Predicting functional effect of humanmissense mutations using PolyPhen-2. In Curr Protoc Hum Genet EditorBoard Jonathan Haines Al; 2013. Chapter 7:Unit7.20.25. Gill N, Singh S, Aseri TC: Computational disease gene prioritization: anappraisal. J Comput Biol J Comput Mol Cell Biol 2014, 21:456–465.26. Petrovski S, Wang Q, Heinzen EL, Allen AS, Goldstein DB: Genic intolerance tofunctional variation and the interpretation of personal genomes. PLoS Genet2013, 9:e1003709.27. Gray KA, Daugherty LC, Gordon SM, Seal RL, Wright MW, Bruford EA EA:Genenames.org: the HGNC resources in 2013. Nucleic Acids Res 2013,41(Database issue):D545–D552.28. Stenson PD, Mort M, Ball EV, Shaw K, Phillips A, Cooper DN: The HumanGene Mutation Database: building a comprehensive mutation repositoryfor clinical and molecular genetics, diagnostic testing and personalizedgenomic medicine. Hum Genet 2014, 133:1–9.29. Cingolani P, Platts A, Wang LL, Coon M, Nguyen T, Wang L, Land SJ, Lu X,Ruden DM: A program for annotating and predicting the effects of singlenucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophilamelanogaster strain w1118; iso-2; iso-3. Fly (Austin) 2012, 6:80–92.30. Flicek P, Amode MR, Barrell D, Beal K, Billis K, Brent S, Carvalho-Silva D, Clapham P,Coates G, Fitzgerald S, Gil L, Girón CG, Gordon L, Hourlier T, Hunt S, Johnson N,Juettemann T, Kähäri AK, Keenan S, Kulesha E, Martin FJ, Maurel T, McLaren WM,Murphy DN, Nag R, Overduin B, Pignatelli M, Pritchard B, Pritchard E, Riat HS, et al:Ensembl 2014. Nucleic Acids Res 2014, 42(Database issue):D749–D755.31. Piton A, Redin C, Mandel J-L: XLID-causing mutations and associated geneschallenged in light of data from large-scale human exome sequencing.Am J Hum Genet 2013, 93:368–383.32. Cheung WA, Ouellette BFF, Wasserman WW: Compensating for literatureannotation bias when predicting novel drug-disease relationshipsthrough Medical Subject Heading Over-representation Profile (MeSHOP)similarity. BMC Med Genomics 2013, 6(Suppl 2):S3.33. Köhler S, Doelken SC, Mungall CJ, Bauer S, Firth HV, Bailleul-Forestier I, BlackGCM, Brown DL, Brudno M, Campbell J, FitzPatrick DR, Eppig JT, Jackson AP,Freson K, Girdea M, Helbig I, Hurst JA, Jähn J, Jackson LG, Kelly AM, LedbetterDH, Mansour S, Martin CL, Moss C, Mumford A, Ouwehand WH, Park S-M,Riggs ER, Scott RH, Sisodiya S, et al: The Human Phenotype Ontologyproject: linking molecular biology and disease through phenotypedata. Nucleic Acids Res 2014, 42(Database issue):D966–D974.34. R Development Core Team: R: A Language and Environment for StatisticalComputing: R Foundation for Statistical Computing; 2008.35. Van Karnebeek CDM, Shevell M, Zschocke J, Moeschler JB, Stockler S: Themetabolic evaluation of the child with an intellectual developmentaldisorder: diagnostic algorithm for identification of treatable causes andnew digital resource. Mol Genet Metab 2014, 111:428–438.36. Kircher M, Witten DM, Jain P, O’Roak BJ, Cooper GM, Shendure J: A generalframework for estimating the relative pathogenicity of human geneticvariants. Nat Genet 2014, 46:310–315.37. Sanders SJ, Murtha MT, Gupta AR, Murdoch JD, Raubeson MJ, Willsey AJ,Ercan-Sencicek AG, DiLullo NM, Parikshak NN, Stein JL, Walker MF, Ober GT,Teran NA, Song Y, El-Fishawy P, Murtha RC, Choi M, Overton JD, Bjornson RD,Carriero NJ, Meyer KA, Bilguvar K, Mane SM, Sestan N, Lifton RP, Günel M,Roeder K, Geschwind DH, Devlin B, State MW: De novo mutations revealedby whole-exome sequencing are strongly associated with autism.Nature 2012, 485:237–241.38. Neale BM, Kou Y, Liu L, Ma’ayan A, Samocha KE, Sabo A, Lin C-F, Stevens C,Wang L-S, Makarov V, Polak P, Yoon S, Maguire J, Crawford EL, Campbell NG,Geller ET, Valladares O, Schafer C, Liu H, Zhao T, Cai G, Lihm J, Dannenfelser R,Jabado O, Peralta Z, Nagaswamy U, Muzny D, Reid JG, Newsham I, Wu Y, et al:Patterns and rates of exonic de novo mutations in autism spectrumdisorders. Nature 2012, 485:242–245.39. O’Roak BJ, Vives L, Girirajan S, Karakoc E, Krumm N, Coe BP, Levy R, Ko A,Lee C, Smith JD, Turner EH, Stanaway IB, Vernot B, Malig M, Baker C, Reilly B,Akey JM, Borenstein E, Rieder MJ, Nickerson DA, Bernier R, Shendure J,Eichler EE: Sporadic autism exomes reveal a highly interconnectedprotein network of de novo mutations. Nature 2012, 485:246–250.40. Iossifov I, Ronemus M, Levy D, Wang Z, Hakker I, Rosenbaum J, Yamrom B,Lee Y-H, Narzisi G, Leotta A, Kendall J, Grabowska E, Ma B, Marks S, Rodgers L,Stepansky A, Troge J, Andrews P, Bekritsky M, Pradhan K, Ghiban E, Kramer M,Parla J, Demeter R, Fulton LL, Fulton RS, Magrini VJ, Ye K, Darnell JC, Darnell RB,et al: De novo gene disruptions in children on the autistic spectrum.Neuron 2012, 74:285–299.doi:10.1186/s12920-014-0064-yCite this article as: Shyr et al.: FLAGS, frequently mutated genes inpublic exomes. BMC Medical Genomics 2014 7:64.Shyr et al. BMC Medical Genomics 2014, 7:64 Page 14 of 14http://www.biomedcentral.com/1755-8794/7/6441. Ng SB, Bigham AW, Buckingham KJ, Hannibal MC, McMillin MJ, GildersleeveHI, Beck AE, Tabor HK, Cooper GM, Mefford HC, Lee C, Turner EH, Smith JD,Rieder MJ, Yoshiura K-I, Matsumoto N, Ohta T, Niikawa N, Nickerson DA,Bamshad MJ, Shendure J: Exome sequencing identifies MLL2 mutationsas a cause of Kabuki syndrome. Nat Genet 2010, 42:790–793.42. Chen W-H, Zhao X-M, van Noort V, Bork P: Human monogenic diseasegenes have frequently functionally redundant paralogs. PLoS Comput Biol2013, 9:e1003073.43. Diss G, Ascencio D, Deluna A, Landry CR: Molecular mechanisms ofparalogous compensation and the robustness of cellular networks.J Exp Zoolog B Mol Dev Evol 2014, 322:488–499.44. Castellana S, Mazza T: Congruency in the prediction of pathogenic missensemutations: state-of-the-art web-based tools. Brief Bioinform 2013, 14:448–459.45. Stearns FW: One hundred years of pleiotropy: a retrospective. Genetics 2010,186:767–773.46. Yamaguchi T, Sharma P, Athanasiou M, Kumar A, Yamada S, Kuehn MR:Mutation of SENP1/SuPr-2 reveals an essential role for desumoylation inmouse development. Mol Cell Biol 2005, 25:5171–5182.47. Yu L, Ji W, Zhang H, Renda MJ, He Y, Lin S, Cheng E, Chen H, Krause DS,Min W: SENP1-mediated GATA1 deSUMOylation is critical for definitiveerythropoiesis. J Exp Med 2010, 207:1183–1195.48. Van Nguyen T, Angkasekwinai P, Dou H, Lin F-M, Lu L-S, Cheng J, Chin YE,Dong C, Yeh ETH: SUMO-specific protease 1 is critical for early lymphoiddevelopment through regulation of STAT5 activation. Mol Cell 2012,45:210–221.49. Moreau Y, Tranchevent L-C: Computational tools for prioritizing candidategenes: boosting disease gene discovery. Nat Rev Genet 2012, 13:523–536.50. Micale L, Augello B, Maffeo C, Selicorni A, Zucchetti F, Fusco C, De Nittis P,Pellico MT, Mandriani B, Fischetto R, Boccone L, Silengo M, Biamino E, PerriaC, Sotgiu S, Serra G, Lapi E, Neri M, Ferlini A, Cavaliere ML, Chiurazzi P,Monica MD, Scarano G, Faravelli F, Ferrari P, Mazzanti L, Pilotta A, PatricelliMG, Bedeschi MF, Benedicenti F, et al: Molecular Analysis, PathogenicMechanisms, and Readthrough Therapy on a Large Cohort of KabukiSyndrome Patients. Hum Mutat 2014, 35:841–850.51. Schulz Y, Freese L, Mänz J, Zoll B, Völter C, Brockmann K, Bögershausen N,Becker J, Wollnik B, Pauli S: CHARGE and Kabuki syndromes: a phenotypicand molecular link. Hum Mol Genet 2014, 23:4396–4405.52. Harlalka GV, Baple EL, Cross H, Kühnle S, Cubillos-Rojas M, Matentzoglu K,Patton MA, Wagner K, Coblentz R, Ford DL, Mackay DJG, Chioza BA, ScheffnerM, Rosa JL, Crosby AH: Mutation of HERC2 causes developmental delay withAngelman-like features. J Med Genet 2013, 50:65–73.53. Cheon CK, Sohn YB, Ko JM, Lee YJ, Song JS, Moon JW, Yang BK, Ha IS, BaeEJ, Jin H-S, Jeong S-Y: Identification of KMT2D and KDM6A mutations byexome sequencing in Korean patients with Kabuki syndrome. J HumGenet 2014, 59:321–325.54. Puffenberger EG, Jinks RN, Wang H, Xin B, Fiorentini C, Sherman EA,Degrazio D, Shaw C, Sougnez C, Cibulskis K, Gabriel S, Kelley RI, Morton DH,Strauss KA: A homozygous missense mutation in HERC2 associated withglobal developmental delay and autism spectrum disorder. Hum Mutat2012, 33:1639–1646.55. Böhm J, Leshinsky-Silver E, Vassilopoulos S, Le Gras S, Lerman-Sagie T, GinzbergM, Jost B, Lev D, Laporte J: Samaritan myopathy, an ultimately benigncongenital myopathy, is caused by a RYR1 mutation. Acta Neuropathol (Berl)2012, 124:575–581.56. MacArthur DG, Manolio TA, Dimmock DP, Rehm HL, Shendure J, AbecasisGR, Adams DR, Altman RB, Antonarakis SE, Ashley EA, Barrett JC, BieseckerLG, Conrad DF, Cooper GM, Cox NJ, Daly MJ, Gerstein MB, Goldstein DB,Hirschhorn JN, Leal SM, Pennacchio LA, Stamatoyannopoulos JA, Sunyaev SR,Valle D, Voight BF, Winckler W, Gunter C: Guidelines for investigatingcausality of sequence variants in human disease. Nature 2014, 508:469–476.57. Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K,Clawson H, Spieth J, Hillier LW, Richards S, Weinstock GM, Wilson RK, Gibbs RA,Kent WJ, Miller W, Haussler D: Evolutionarily conserved elements in vertebrate,insect, worm, and yeast genomes. Genome Res 2005, 15:1034–1050.58. Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A: Detection of nonneutralsubstitution rates on mammalian phylogenies. Genome Res 2010, 20:110–121.59. Jung J-Y, DeLuca TF, Nelson TH, Wall DP: A literature search tool for intelligentextraction of disease-associated genes. J Am Med Inform Assoc JAMIA 2014,21:399–405.Submit your next manuscript to BioMed Centraland take full advantage of: • Convenient online submission• Thorough peer review• No space constraints or color figure charges• Immediate publication on acceptance• Inclusion in PubMed, CAS, Scopus and Google Scholar• Research which is freely available for redistribution60. Xu R, Li L, Wang Q: Towards building a disease-phenotype knowledgebase: extracting disease-manifestation relationship from literature.Bioinforma Oxf Engl 2013, 29:2186–2194.61. Yu W, Gwinn M, Clyne M, Yesupriya A, Khoury MJ: A navigator for humangenome epidemiology. Nat Genet 2008, 40:124–125.62. Kocot KM, Citarella MR, Moroz LL, Halanych KM: PhyloTreePruner: APhylogenetic Tree-Based Approach for Selection of OrthologousSequences for Phylogenomics. Evol Bioinforma Online 2013, 9:429–435.63. Altenhoff AM, Dessimoz C: Inferring orthology and paralogy. Methods MolBiol Clifton NJ 2012, 855:259–279.64. Pryszcz LP, Huerta-Cepas J, Gabaldón T: MetaPhOrs: orthology and paralogypredictions from multiple phylogenetic evidence using a consistency-based confidence score. Nucleic Acids Res 2011, 39:e32.Submit your manuscript at www.biomedcentral.com/submit


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items