UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

A bioinformatic workflow for analyzing whole genomes in rare Mendelian disease Couse, Madeline Hazel 2017

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2017_may_couse_madeline.pdf [ 1.99MB ]
JSON: 24-1.0345611.json
JSON-LD: 24-1.0345611-ld.json
RDF/XML (Pretty): 24-1.0345611-rdf.xml
RDF/JSON: 24-1.0345611-rdf.json
Turtle: 24-1.0345611-turtle.txt
N-Triples: 24-1.0345611-rdf-ntriples.txt
Original Record: 24-1.0345611-source.json
Full Text

Full Text

A BIOINFORMATIC WORKFLOWFOR ANALYZING WHOLEGENOMES IN RARE MENDELIANDISEASEbyMadeline Hazel CouseB.Sc., The University of Waterloo, 2013A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinTHE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES(Genome Science and Technology)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)April 2017c© Madeline Hazel Couse 2017AbstractThe vast majority of the human genome (˜98%) is non-coding. A symphonyof non-coding sequences resides in the genome, interacting with genes andthe environment to tune gene expression. Functional non-coding sequencesinclude enhancers, silencers, promoters, non-coding RNA and insulators.Variation in these non-coding sequences can cause disease, yet clinical se-quencing in patients with rare Mendelian disease currently focuses mostlyon variants in the ˜2% of the genome that codes for protein. Indeed, vari-ants in protein-coding genes that can explain a phenotype are identifiedin less than half of patients with suspected genetic disease by whole ex-ome sequencing (WES). With the dramatic reduction in the cost of wholegenome sequencing (WGS), development of algorithms to detect variantslonger than 50 bp (structural variants, SVs), and improved annotation ofthe non-coding genome, it is now possible to interrogate the entire spectrumof genetic variation to identify a pathogenic mutation.A comprehensive pipeline is needed to analyze non-coding variation andstructural variation from WGS. In this thesis, I developed and benchmarkeda bioinformatics workflow to detect pathogenic non-coding SNVs/indels andpathogenic SVs, and applied this workflow to unsolved patients with rareMendelian disorders. The pipeline detected ˜80-90% of deletions, ˜90% ofduplications, ˜65% inversions, and ˜50% of insertions in a simulated genomeand the NA12878 genome. The pipeline captured the majority of knownpathogenic non-coding single nucleotide variant (SNVs) and insertion dele-tions (indels), and effectively prioritized a spiked-in known pathogenic non-coding SNV. Several interesting candidate variants were detected in patients,but none could be convincingly implicated as pathogenic.The bioinformatic workflow described in this thesis is complementary tosequencing pipelines that analyze only protein-coding variants from wholegenomes. Application of this workflow to larger cohorts of patients with rareMendelian diseases should identify pathogenic non-coding variants and SVsto increase diagnostic yield of clinical sequencing studies, assist managementof genetic diseases, and contribute knowledge of novel pathogenic variantsto the scientific community.iiPrefaceThis thesis comprises unpublished work performed by the author. All anal-yses of non-coding variation and structural variation described in this paperwere performed by the author. Patients with hereditary sensory and au-tonomic neuropathy were recruited by Gabriella Horvath, who also helpedin interpretation of variants. Analysis to rule out pathogenic exonic vari-ants was performed by Farah Zahir, past member of the Friedman lab, as-sisted by Clara Van Karnebeek, Casper Shyr, and Maja Tarailo-Graovacof the Treatable Intellectual Disability Endeavour (TIDE) BC team. Pa-tients with Aicardi syndrome were recruited by Cristina Dias. Sequencing,as well as analysis and interpretation of exonic variants were the collabora-tive effort of Jan Friedman, Cristina Dias, Steven Jones, Farah Zahir, andYaoqing Shen. Allison Matthews, Jill Mwenifumbo, and Phillip Richmondfrom Wyeth Wasserman's lab provided insight into variant interpretationand structural variant analysis. Shaun Jackman assisted with interpreta-tion of the DNMT1 tandem duplication.This research was covered by the UBC Research Ethics Boards underthe project ”Genetic Alterations in Rare Diseases”, certificate number H09-01228.Figure 1.1 was adapted from Expert Review in Molecular Medicine, 17,Philip Cowie, Elizabeth A. Hay, Alasdair MacKenzie, The Noncoding Hu-man Genome and the Future of Personalised Medicine, p.4, Copyright (2015),with permission from Cambridge University Press. Figure 1.3 was repro-duced from Cell, 161:5, Dario G. Lupianez, Katerina Kraft, Verena Heinrich,Peter Krawitz, Francesco Brancati, Eva Klopocki, Denise Horn, Hlya Kay-serili, John M. Opitz, Renata Laxova, Fernando Santos-Simarro, BrigitteGilbert-Dussardier, Lars Wittler, Marina Borschiwer, Stefan A. Haas et al.,Disruptions of Topological Chromatin Domains Cause Pathogenic Rewiringof Gene-Enhancer Interactions, p.1014, Copyright (2015), with permissionfrom Elsevier. Figure 1.4 was adapted from Nature Methods, 17, MonyaBaker, Structural variation: the Genome’s Hidden Architecture, p.133, Copy-iiiPrefaceright (2012), with permission from Nature Publishing Group. Figure 1.5 wasadapted from Frontiers in Bioengineering and Biotechnology, 3, LorenzoTattini, Romina D’Aurizio, Alberto Magi, The Noncoding Human Genomeand the Future of Personalised Medicine, p.2, Copyright (2015), with per-mission from the authors under the Creative Commons Attribution License.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . xiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 The Human Genome . . . . . . . . . . . . . . . . . . . . . . 11.2 Genome Regulation and Dysregulation . . . . . . . . . . . . 11.2.1 Promoters . . . . . . . . . . . . . . . . . . . . . . . . 21.2.2 Enhancers and Silencers . . . . . . . . . . . . . . . . 31.2.3 Topologically Associated Domains . . . . . . . . . . . 31.2.4 Non-coding RNAs . . . . . . . . . . . . . . . . . . . . 61.3 Next Generation Sequencing and Clinical Studies . . . . . . 61.3.1 Exome Sequencing . . . . . . . . . . . . . . . . . . . . 71.3.2 Whole Genome Sequencing . . . . . . . . . . . . . . . 71.4 Prioritizing Non-coding SNVs and Indels . . . . . . . . . . . 81.4.1 Combined Annotation Dependent Depletion (CADD)score . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.4.2 Functional Analysis Through Hidden Markov Model(FATHMM) score . . . . . . . . . . . . . . . . . . . . 91.5 Structural Variation . . . . . . . . . . . . . . . . . . . . . . . 91.5.1 SV classes and detection from NGS . . . . . . . . . . 91.5.2 Algorithms for identifying SVs . . . . . . . . . . . . . 10vTable of Contents1.6 Rare Disease Cohorts . . . . . . . . . . . . . . . . . . . . . . 101.6.1 Hereditary Sensory and Autonomic Neuropathy (HSAN)101.6.2 Aicardi Syndrome . . . . . . . . . . . . . . . . . . . . 121.7 Thesis Rationale and Objective . . . . . . . . . . . . . . . . 122 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.1 Sequencing and Alignment . . . . . . . . . . . . . . . . . . . 132.1.1 Whole Genome Sequencing . . . . . . . . . . . . . . . 132.1.2 Alignment and SNV and Indel calling . . . . . . . . . 132.2 Structural Variant Calling and Benchmarking . . . . . . . . 152.2.1 MetaSV Consensus SV Caller . . . . . . . . . . . . . 152.2.2 VarSim Paired-End Read and SV Simulation . . . . . 152.2.3 SV Benchmarking with Biological Data: NA12878 WGS162.3 Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.1 Gene Lists . . . . . . . . . . . . . . . . . . . . . . . . 172.3.2 Identification of Regulatory Sequences in the HumanGenome . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.3 Association of Regulatory Sequences to Known Genes 172.4 Benchmarking Regulatory SNV and Indel Detection . . . . . 182.5 Filtering and Annotation of SNVs and Indels . . . . . . . . . 182.6 Filtering and Annotating SVs . . . . . . . . . . . . . . . . . 192.6.1 VCF to BED format . . . . . . . . . . . . . . . . . . 192.6.2 Comparison to SV Control Databases . . . . . . . . . 192.6.3 Comparison to DGV . . . . . . . . . . . . . . . . . . 202.6.4 Comparsion to 1000G . . . . . . . . . . . . . . . . . . 202.6.5 Translocations . . . . . . . . . . . . . . . . . . . . . . 202.7 Identifying Genic and Regulatory SVs . . . . . . . . . . . . . 212.8 HSAN Modifications to Workflow . . . . . . . . . . . . . . . 212.9 Aicardi Syndrome Modifications to Workflow . . . . . . . . . 212.9.1 Counting Variants in Common . . . . . . . . . . . . . 213 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.1 Benchmarking Against Simulated Data . . . . . . . . . . . . 243.1.1 Comparison of SV callers . . . . . . . . . . . . . . . . 243.1.2 LUMPY and CNVnator Versus All Other Callers . . 253.2 Benchmarking Against Biological Data: WGS from NA12878 303.2.1 Deletion Detection . . . . . . . . . . . . . . . . . . . . 303.2.2 Insertion Detection . . . . . . . . . . . . . . . . . . . 31viTable of Contents3.3 Genomiser Non-Coding Mendelian Variants . . . . . . . . . . 323.3.1 Detection of pathogenic non-coding variants . . . . . 323.3.2 CADD and FATHMM Scores for Non-Coding Variants 333.3.3 Spike-in of a Pathogenic SNV . . . . . . . . . . . . . 353.4 HSAN Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 363.4.1 HSAN SNVs and Indels . . . . . . . . . . . . . . . . . 363.4.2 HSAN SV Analysis . . . . . . . . . . . . . . . . . . . 393.5 Aicardi Syndrome Analysis . . . . . . . . . . . . . . . . . . . 433.5.1 Aicardi Syndrome SNV and Indel Analysis . . . . . . 433.5.2 Aicardi Syndrome SV Analysis . . . . . . . . . . . . . 433.5.3 Aicardi syndrome de novo variant analysis . . . . . . 454 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.2 Summary of Findings . . . . . . . . . . . . . . . . . . . . . . 474.2.1 VarSim Benchmarking Results and Limitations . . . . 474.2.2 NA12878 Benchmarking Results and Limitations . . . 484.2.3 Limitations to SV Calling from SRS . . . . . . . . . . 494.3 Genomiser Non-Coding Mendelian Variants . . . . . . . . . . 504.3.1 Detection of Pathogenic Non-Coding Variants . . . . 504.3.2 FATHMM and CADD Scores of Pathogenic Non-CodingVariants . . . . . . . . . . . . . . . . . . . . . . . . . 514.3.3 Spike-in of a Pathogenic Non-Coding SNV . . . . . . 514.4 HSAN Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 524.4.1 HSAN SNV and Indel Analysis . . . . . . . . . . . . 524.4.2 HSAN SV Analysis . . . . . . . . . . . . . . . . . . . 524.4.3 HSAN Analysis Limitations . . . . . . . . . . . . . . 544.5 Aicardi Syndrome Analysis . . . . . . . . . . . . . . . . . . . 544.5.1 Aicardi Syndrome SNV/Indel and SV Analysis . . . . 544.5.2 Aicardi Syndrome Limitations . . . . . . . . . . . . . 554.6 Conclusions and Future Directions . . . . . . . . . . . . . . . 554.6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 554.6.2 Comparison to a study analyzing WGS from patientswith a heterogeneous disease . . . . . . . . . . . . . . 564.6.3 Future directions . . . . . . . . . . . . . . . . . . . . 574.6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . 58viiTable of ContentsAppendicesA Python script . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60viiiList of Tables2.1 Software and parameters used . . . . . . . . . . . . . . . . . . 233.1 Deletion detection sensitivity for 30X 100 bp pair-end simu-lation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2 Duplication detection sensitivity for 30X 100 bp pair-end sim-ulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.3 Inversion detection sensitivity for 30X 100 bp pair-end simu-lation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.4 SV detection in NA12878 genome . . . . . . . . . . . . . . . . 313.5 Regulatory variants associated with familial hypercholesterolemiagenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.6 HSANpatient phenotypes . . . . . . . . . . . . . . . . . . . . 37ixList of Figures1.1 Flow of information in the cell . . . . . . . . . . . . . . . . . 21.2 Topological associated domains . . . . . . . . . . . . . . . . . 41.3 Disruptions to TAD boundaries cause rare limb malformations 51.4 Structural variation . . . . . . . . . . . . . . . . . . . . . . . 111.5 Structural variant detection . . . . . . . . . . . . . . . . . . . 112.1 Bioinformatic workflow . . . . . . . . . . . . . . . . . . . . . . 143.1 Deletion detection sensitivity and F1 score . . . . . . . . . . . 263.2 Deletion detection sensitivity with LUMPY and CNVnator . 273.3 Duplication detection sensitivity with LUMPY and CNVnator 283.4 Inversion detection sensitivity with LUMPY and CNVnator . 293.5 Size of deletions detected in NA12878 . . . . . . . . . . . . . 323.6 Detection of known pathogenic non-coding SNVs and indels . 333.7 FATHMM scores of known pathogenic non-coding SNVs andindels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.8 CADD scores of known pathogenic non-coding SNVs and indels 353.9 Regulatory variants in patients with HSAN . . . . . . . . . . 383.10 A regulatory variant in PMP22 in a patient with HSAN . . . 393.11 Summary of SVs in patients with HSAN . . . . . . . . . . . . 403.12 Tandem duplication impacting DNMT1 in three siblings withHSAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.13 Regulatory variants in patients with HSAN . . . . . . . . . . 443.14 Summary of SVs in patients with Aicardi syndrome . . . . . 45A.1 Python script for extracting TAD boundaries flanking candi-date genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59xList of AbbreviationsCRE cis-regulatory elementWGS whole genome sequence/sequencingTFBS transcription factor binding siteTSS transcription start siteTAD topologically associated domainBCA balanced chromosomal abnormalitylncRNA long non-coding RNAsnoRNA small nucleolar RNAmiRNA microRNAsiRNA short interfering RNAWES whole exome sequence/sequencingHGMD Human Gene Mutation DatabaseCNV copy number variantSV structural variantCADD combined annotation dependent depletionSNV single nucleotide variantIndel insertion deletionFATHMM functional analysis through hidden Markov modelCMA chromosomal microarray analysisPE paired-endRD read-depthRP read-pairSR split readAS de novo assemblyHSAN hereditary sensory and autonomic neuropathyDGV Database of Genomic VariantsGIAB Genome in a Bottle1000G 1000 Genomes ProjectPPV positive predictive valueTP true positiveFN false negativexiList of AbbreviationsFP false positiveDECIPHER Database of Genomic Variation andPhenotype in HumansSRS short-read sequencingLRS long-read sequencingPacBio Pacific BiosciencesSMRT single molecule real-timexiiAcknowledgementsFor funding, I would like to thank NSERC for the Canada Graduate Schol-arship Masters Program award, and my supervisor, Jan Friedman.Many thanks to Jan Friedman for his uncompromising intellect, kindsupport, and sage advice. I feel very fortunate to have had him as a su-pervisor. Thank you to the past and present members of the Friedman lab.Thank you to my committee members, Wyeth Wasserman and Inanc Birol,for taking the time to advise me on my thesis.Thank you to Green College for providing me with a warm and open-hearted community for my first two years in Vancouver. Thank you to myever-supportive parents and siblings. Big thanks to my twin sister, MargotCouse, for the countless hours spent mumbling and grumbling with me overthe phone. Thanks to my lovely friends in Vancouver and elsewhere. Andfinally, thank you to Lucian Go, for putting up with me while I wrote thisthesis.xiiiChapter 1Introduction1.1 The Human GenomeThe first draft of the human genome was published over fifteen years ago[29]. The announcement was greeted with excitement and high expecta-tions. With the sequence of the human genome, the blueprint for humanlife, would scientists lay bare the mysteries encoded in our DNA? The hu-man genome was seemingly simpler and yet more complex than previouslyimagined. The genome encodes a mere ˜21,000 protein coding genes, lessthan one quarter the number that had been predicted. On the other handonly ˜2% of the genomes ˜3 billion bases comprise protein-coding genes.Clearly, there was more to the genome than the simplistic and determinis-tic genome-as-blueprint concept. Indeed, scientists uncovered a symphonyof cis-regulatory elements (CRE) residing in the non-coding genome: pro-moters, enhancers, silencers, insulators, and various classes of non-codingRNAs. These elements respond to dynamic environmental cues to tunegene expression through space and time, guided by the genomes intricatespatial architecture. As our understanding of the non-coding genome andthe overall genomic structure has deepened, so too has our knowledge oftheir contribution to human disease.Though we have begun to unravel the role of the non-coding genome inshaping phenotypes, this knowledge has not yet been harnessed in clinicalsequencing pipelines for diagnosing disease. These pipelines focus on the 2%of DNA in the protein-coding regions of the genome. This thesis will attemptto integrate structural and CRE analysis into a whole genome sequence(WGS) analysis pipeline for prioritizing regulatory and structural variantsin rare Mendelian disease.1.2 Genome Regulation and DysregulationGene expression regulation begins at the cell surface, where ligand-receptorinteractions initiate signal transduction cascades that eventually activate11.2. Genome Regulation and Dysregulationtranscription factors [35]. Such activated transcription factors bind CREs,which then interact locally or distally with RNApolII at a gene promoterto induce or repress gene expression (fig.1.1). Above the linear sequence ofDNA, the 3D topology of the genome directs and constrains interactions ofCREs with their cognate promoters to orchestrate gene expression.Figure 1.1: The flow of information in a cell, greatly sim-plified. Modified from Cowie et al. 2015, with permissionfrom Cambridge University Press. [11]Variants in CREs may confer disease risk by altering transcription fac-tor binding sites (TFBS), disrupting regulatory domains, or influencing sus-ceptibility of a sequence to epigenetic modification [38]. Many examples ofpathogenic variants residing in or disrupting regulatory sequences have beenidentified in recent years. Indeed, a recent paper compiled 453 different non-coding variants associated with Mendelian diseases [52]. CRE structure andfunction, as well as examples of diseases resulting from their disruption, aredescribed below.1.2.1 PromotersA promoter is the sequence upstream of a gene transcription start site (TSS)that is bound by the transcriptional apparatus to initiate gene expression.A core promoter refers to the ˜50 bases that bind the transcription pre-initiation complex, which includes RNA polymerase II and basal transcrip-tion factors [11]. Meanwhile, the non-core promoter may be kilobases inlength and encompass TFBS that are necessary for tissue-specific gene ex-pression. Mutations in promoter sequences can introduce or remove TFBS21.2. Genome Regulation and Dysregulationto change levels of gene expression.For example, a constitutional mutation in the promoter of telomerasereverse transcriptase (TERT), which encodes a catalytic subunit of telom-erase, was found to segregate with the disease in a family with hereditarymelanoma [23]. TERT is upregulated in 90% of cancers and is associatedwith immortality in cancer cells. The germline mutation was located 57 bpupstream of the transcription initiation site. Mutations occurring somati-cally 124 and 146 bp downstream of the transcription initation site were alsofound in tumors of several unrelated patients. These germline and somaticmutations were found to create a CCGGAA/T binding motif for Ets/TCFtranscription factors that resulted in increased TERT expression.1.2.2 Enhancers and SilencersEnhancers and silencers are short (50-1500 bp) CREs that are bound bytranscription factors to upregulate or downregulate promoter activity, re-spectively. Their activity is orientation and distance-independent [36]. In-deed, enhancers may target promoters at a distance as far as one megabase;this was observed fortuitously in a mutant mouse, sasquatch, generated byrandom integration of a transgene into an intron of LMBR1, 1 Mb awayfrom the Sonic hedgehog gene (SHH) [31]. The transgene disrupted a long-range enhancer that regulates developmental SHH expression and causespreaxial polydactyly (PPD). Dysregulated SHH is ectopically expressed inthe anterior margin of the mouse limb bud, causing the limb to developpreaxial digits. Point mutations in this long-range enhancer in both miceand humans also cause PPD.1.2.3 Topologically Associated DomainsOur DNA is bound by proteins, including histones and transcription factors,as well as non-coding RNAs. This assemblage of molecules, termed chro-matin, is folded like origami into the nucleus. It is thought that interactionsbetween distal CREs and genes are mediated by folding of the interveningchromatin in regulatory domains termed topologically associated domains(TADs) (fig. 1.2). TADs are discrete genomic regions about ˜1 Mb in sizethat interact with themselves at a higher frequency than with the rest ofthe genome. TADs are bordered by regions with low interaction frequency,called TAD boundaries [37]. The TAD boundaries are associated with in-sulator binding factor CTCF, housekeeping and tRNA genes, and SINEelements[13]. Dixon et al performed Hi-C experiments in mouse embryonic31.2. Genome Regulation and Dysregulationstem (ES) cells, human ES cells, and human IMR90 cells and showed thatTAD boundaries are largely conserved between mice and humans (75.9% ofmouse boundaries are boundaries in humans, compared to 29.0% expectedby chance), and, further, TAD boundaries are largely invariant across celltypes.Figure 1.2: DNA is partitioned into TADs. TAD bound-aries prevent regulatory elements in one TAD from interact-ing with genes in neighbouring TADs.Disruptions to TADs can produce disease. In several families with rarelimb malformations, genomic rearrangements were observed at the WNT6/IHH/EPHA4/PAX3 locus [32]. All of these rearrangements disrupted aTAD boundary, either telomeric or centromeric to the EPHA4 TAD (fig.1.3). Disruption of either boundary caused ectopic interactions of a 150 kbregion within the EPHA4 TAD with neighboring TADs. This region wasshown to contain a cluster of enhancers driving limb expression of EPHA4 ;in mutants, this region targeted PAX3, WNT6, and IHH, resulting in ectopicexpression and limb malformations in mice and humans.Translocations and inversions are referred to as balanced chromosomalabnormalities (BCAs) if they do not cause any net gain or loss of genetic ma-terial. TAD disruptions were recently shown to cause long-range genetic reg-ulatory changes in 7% of a developmental anomaly cohort with BCAs studiedby whole genome sequencing [47]. Interestingly, BCA breakpoints in eightpatients impacted a TAD encompassing one particular gene, MEF2C, whichlies in the critical region of the 5q14.3q15 microdeletion syndrome. Thebreakpoints in these patients were all at difference loci within the TAD, and41.2. Genome Regulation and DysregulationFigure 1.3: Different Limb Pathogenic Structural Vari-ations in Human and Mouse Map to the EPHA4 TAD(A) Hi-C profile around the EPHA4 locus in human ESCs(Dixon et al., 2012). Dashed lines indicate the EPHA4TAD and boundaries. Cen, centromeric; tel, telomeric.(BD)Schematic of structural variants (left) and associated phe-notypes (right). (B) Brachydactyly-associated deletions infamilies B1, B2, and B3. Note thumb and index finger short-ening with partial webbing in a child (B1 patient) and adult(B2 patient). (C) F-syndrome-associated inversion in familyF1 and duplication in family F2. Note similar phenotypesof index/thumb syndactyly. (D) Polydactyly-associated du-plication (P1) and deletion in the doublefoot (Dbf) mousemutant. The radiograph of the patients hand and the skele-tal preparation of the Dbf/+ mouse show similar seven-digitpolydactyly. Reproduced from Figure 1 of Lupiez et al., 2015,with permission from Elsevier. [32]51.3. Next Generation Sequencing and Clinical Studiesnone affected the adjacent TAD boundaries, but all still changed MEF2Cexpression and likely caused the developmental anomalies in these patients.1.2.4 Non-coding RNAsMuch of the non-coding genome is transcribed, and the role of these non-coding transcripts in regulating gene expression is well appreciated [17] .Regulatory RNAs can be divided into long non-coding RNAs (lncRNA)and small non-coding RNAs including small nucleolar RNA (snoRNA),microRNA (miRNA), and short interfering RNA (siRNA). snoRNAs areinvolved in chemically modifying other RNA, such as transfer RNA andribosomal RNA, while siRNAs and miRNAs repress gene expression aftertranscription [6]. Small RNAs are associated with many regulatory roles,notably in brain development. Indeed, small RNA dysregulation is associ-ated with neurodevelopmental and neurodegenerative disorders, and withbrain cancer [6]. For instance, microdeletions at 15q11.2 resulting in lossof the paternal copy of SNORD116 snoRNAs cause Prader-Willi syndrome[15]. Heterozygous mutations in the seed region of the miRNA MIR96 causenonsyndromic progressive hearing loss [40].lncRNAs are regulatory elements that are gene-like in structure, withpromoters, introns and exons. lncRNA expression is highly tissue-specific;these RNAs are implicated in the regulation of cell maintenance and fate,particularly in the brain [6]. De novo translocations disrupting LINC00299are implicated in neurodevelopmental disability [55].1.3 Next Generation Sequencing and ClinicalStudiesThe estimated cost to sequence the first draft of the human genome viaSanger sequencing was $300 million(https://www.genome.gov/sequencingcosts/). Since the introduction ofnext-generation sequencing a decade ago, that figure has dropped dramat-ically, outpacing Moores law for computing costs. Remarkably, the cost ofsequencing the genome using Illuminas HiSeq X Ten machines (after the sys-tem cost) has broken the US$1000 goal set by the National Human GenomeResearch Institute; this represents a 10,000 fold price reduction comparedto 2004 [57]. With the human genome sequence known and increasinglyaffordable next-generation sequencing technology available, the number ofMendelian disease genes that have been identified increased from 100 to 300061.3. Next Generation Sequencing and Clinical Studieswithin a decade [29]. Simultaneously, the number of sequenced genomes hasexploded. Large-scale sequencing projects like the 1000 Genomes Project(The 1000 Genomes Project Consortium, 2015), and the ESP (NHLBI GOExome Sequencing Project, URL: http://evs.gs.washington.edu/EVS/)have made hundreds of thousands of exomes and whole genomes publiclyavailable, benefitting clinical sequencing projects greatly with databases ofcontrol exomes/genomes from many populations. Although single-gene test-ing remains an appropriate diagnostic approach for non-heterogeneous dis-orders with distinct phenotypes, whole exome sequencing (WES) and WGSare the approaches of choice for investigating heterogeneous known or sus-pected Mendelian disorders.1.3.1 Exome SequencingWES is a sequencing method that targets the protein-coding regions of thegenome. It is the current method of choice for clinical sequencing studiesand has been the most commonly used tool for Mendelian disease genediscovery[61]. WES interrogates less than 2% of the genome, yet it covers˜85% of known disease-causing variants [57]. However, this proportion issubject to ascertainment bias, as the search for disease-causing variants haslargely been limited to exonic analysis. WES studies on children with birthdefects or neurodevelopmental disorders have usually reported diagnosticrates of ˜25-28% [59] but up to 50% with more stringent patient selectioncriteria [44].There are several limitations to the use of WES for identifying geneticvariation. First, WES captures exons in a non-uniform manner, resultingin insufficient coverage of some genes. A recent study [14] with a TruSeqcapture kit demonstrated that an average of 10% of exons had a minimumcoverage of fewer than 10 reads, despite an overall mean coverage of over80X. Of these exons, a quarter resided in genes harboring known or likelydisease-causing Human Gene Mutation Database (HGMD) variants. In ad-dition, WES lacks reliability and sensitivity in detecting small copy numbervariants (<100kb), large indels (<50bp), and other complex structural vari-ants (SVs) [53]. Finally, WES does not capture the vast majority of thenon-coding genome.1.3.2 Whole Genome SequencingDue in large part to its higher cost and challenges in data interpretation,WGS has been performed far less frequently than WES for clinical diagno-71.4. Prioritizing Non-coding SNVs and Indelssis. However, WGS covers both the coding and non-coding genome moreuniformly than WES, and, further, can detect SVs across a wide range ofsizes with single base resolution.The advantages of WGS over WES have been demonstrated in a num-ber of studies. In a cohort of 50 patients with severe intellectual disabilitywho remained undiagnosed after microarray analysis and exome sequencing,Gilissen et al[20] detected 8 de novo copy number variants (CNVs), whichplay an important role in neuropsychiatric disorders. Carss et al[7] usedWES and WGS to identify rare pathogenic variants in a phenotypically andgenetically heterogeneous cohort of 722 individuals with inherited retinaldisease. Forty five of the individuals unsolved by WES underwent WGS,and pathogenic variants not detectable by WES were identified in a furthersix individuals. For 3 of these patients, this was due to coverage in WGS ata variant location that was absent from the bait in the exome capture kit.Two other individuals had a large deletion and one individual had a largeindel not called by WES. In 605 patients who underwent WGS, a total of 33SVs (31 deletions and 2 tandem duplications) were identified with precisebreakpoint resolution, which would not have been possible with WES. Fi-nally, the authors demonstrated the superior uniformity of coverage in GCrich regions afforded by WGS in comparison to WES, with the identificationof compound heterozygous mutations in one individual in the first exon ofGUCY2D, with a 76% GC content. This exon was not covered in the WEScapture kit.1.4 Prioritizing Non-coding SNVs and IndelsThe whole genome sequence of any individual varies from the reference hu-man genome at millions of sites. Even after filtering variants that occur atpolymorphic frequencies in normal populations from WGS of patients withrare Mendelian diseases, the variant list can contain hundreds of thousandsof non-coding and coding variants. Filtering for variants in regulatory re-gions is one strategy to reduce the search space in order to identify the oneor two pathogenic non-coding variants responsible for a Mendelian disease.Another is to use computational predictions of functionality based on mod-els derived from diverse sets of genome annotations and known pathogenicvariants. These scores can help to prioritize genetic variants by providinglikelihoods that a variant is deleterious or functional. Two scores that ap-ply to both coding and non-coding variants are the CADD score and theFATHMM score.81.5. Structural Variation1.4.1 Combined Annotation Dependent Depletion (CADD)scoreThe CADD score integrates diverse genome annotations of fixed or nearlyfixed alleles with simulated alleles to score the deleteriousness of any possi-ble single nucleotide variant (SNV) or small insertion-deletion (indel) [28].Deleteriousness, corresponding to a reduction in organismal fitness, corre-lates with molecular functionality and pathogenicity. The CADD score isderived from 63 distinct annotations including conservation metrics such asGERP and phastCons, regions of DNAse hypersensitivity, transcription fac-tor binding, and protein-level scores such as SIFT and PolyPhen. All 8.6billion possible SNVs in the genome are given a CADD score. The CADDscores are phred-scaled from 1 to 99, where variants in the highest 10% ofall scores are assigned scores of 10 or greater, variants in the highest 1% arescored 20 or greater, variants in the highest 0.1% 30 or greater, and so on.1.4.2 Functional Analysis Through Hidden Markov Model(FATHMM) scoreFATHMM is a machine learning approach integrating 46 sequence conserva-tion, histone modification, transcription factor binding site, and open chro-matin annotations to assess the functional consequences of non-coding andcoding variants [50]. Unlike CADD, FATHMM uses an algorithm to weightdifferent annotations according to relevance, and according to the paper,outperforms CADD in predicting functional consequences of non-codingvariants. FATHMM scores for all possible SNVs are available, and rangefrom 0 to 1. Scores of greater than 0.5 indicate that a variant is likely to befunctional.1.5 Structural Variation1.5.1 SV classes and detection from NGSSVs are balanced or unbalanced genomic rearrangements that affect morethan 50 bp of genomic sequence. Although SVs are estimated to impactmore than 1% of each human genome, versus 0.1% for SNPs [45], SVs areunder-ascertained in clinical sequencing studies, mainly due to the short-comings of SV calling with WES. Pathogenic SVs are typically identifiedclinically by karyotyping or chromosomal microarray analysis (CMA); how-ever, WGS affords a finer resolution and broader scope of SV identification.91.6. Rare Disease CohortsSVs that can be detected from WGS are insertions, deletions, duplications,inversions, and inter/intrachromosomal translocations (fig. 1.4). Althoughthe full spectrum of SVs can, in principle, be identified from WGS, SV de-tection in practice is limited by the length of the reads; WGS read length istypically 50-400 bp, shorter than most SVs. SV detection is further limitedin repetitive regions of the genome, which are known to be variable in struc-ture but to which short reads cannot be uniquely mapped [41]. As such,a number of strategies have been developed to identify SV signatures fromshort-read paired-end (PE) sequencing.1.5.2 Algorithms for identifying SVsThere are four different approaches designed to identify SVs from PE NGS:read-depth (RD), read-pair (RP), split-read (SR), and de novo assembly(AS) (fig. 1.5). RD methods are based on deviation of sequencing depthfrom the local genomic average and are used to detect copy number variants(CNVs). RP, SR, and AS methods use sequence signatures from PE readsto identify SVs. PE sequencing produces reads sequenced from both endsof DNA fragments, termed read pairs. On average, the length betweenread pairs will be consistent (e.g. 350 bp). RP methods are based on readpairs with insert-sizes or orientations that are inconsistent with expectedvalues. A read that aligns to two separate locations in the reference genomeand whose mate maps uniquely is termed a SR. SRs allow single base-pairresolution of breakpoints. AS methods order reads and merge them intolarger fragments, called contigs, to reassemble the original sequence withoutthe use of a reference. A method that incorporates all methods should beable to detect the broadest spectrum of SVs with high sensitivity.1.6 Rare Disease Cohorts1.6.1 Hereditary Sensory and Autonomic Neuropathy(HSAN)HSAN is a group of clinically and genetically heterogeneous disorders ofthe peripheral nervous system. Our cohort is comprised of eight patientsin six families with early-onset HSAN whose symptoms include loss of painsensation, developmental delay, and gastrointestinal complications. WGSwas performed on these patients, and my colleagues performed an analysisof rare exonic SNVs and indels. A genetic diagnosis was only reached in twopatients.101.6. Rare Disease CohortsFigure 1.4: SV classes that are detectable by WGS. Modi-fied from Baker, 2012, with permission from Nature Publish-ing Group. [5]Figure 1.5: PE read signatures for SVs from read count(RC), read-pair (RP), split-read (SR), and de novo assembly(AS) methods. Modified from Tattini et al 2015, with per-mission under the Creative Commons Attribution License[56].111.7. Thesis Rationale and Objective1.6.2 Aicardi SyndromeAicardi syndrome is an extremely rare neurodevelopmental disorder charac-terized by chorioretinal lacunae, agenesis of the corpus callosum, and infan-tile spasms. The disorder has been described almost exclusively in femalesor in boys with Klinefelter syndrome (XXY), with a risk to siblings of lessthan 1% [2]. It is, therefore, hypothesized to be caused by a dominant, malelethal de novo mutation in a gene on the X chromosome. Chromosome mi-croarray analysis and exome sequencing of patients with Aicardi syndromehave not revealed the genetic cause for the disorder. Our Aicardi syndromecohort is comprised of 9 patients recruited from across Canada and theUnited States. Analysis of protein-coding variations in these patients hasnot revealed a candidate gene.1.7 Thesis Rationale and ObjectiveWES can be used to investigate SNVs and small (<50 bp) indels in protein-coding genes. WES is currently the technology of choice for clinical sequenc-ing studies, but SNVs and indels in protein-coding genes that can explaina phenotype are identified in less than 50% of patients with suspected ge-netic disease by WES. There is growing interest in using WGS for clinicaldiagnosis now that WGS is more affordable and functional annotation of thegenome is improving. In addition, numerous methods have been developedfor calling SVs from WGS data.I hypothesize that SNVs and indels that disrupt regulatory sequencesand SVs in coding or non-coding sequences are a cause of genetic dis-ease in patients who remain undiagnosed after CMA and WES (or codingSNV/indel analysis of WGS). The objectives of this thesis are to 1) developand benchmark a bioinformatics workflow for detection of pathogenic non-coding SNVs/indels and pathogenic SVs, and 2) to use this workflow toanalyze the WGS of unsolved patients recruited from in-house HSAN andAicardi syndrome studies to test my hypothesis.12Chapter 2MethodsFigure 2.1 illustrates the bioinformatics workflow created to identify regu-latory variants and SVs from WGS. The steps taken in the pipeline and thebenchmarking performed to validate the tools are described below. Softwareversions and parameters are listed in table Sequencing and Alignment2.1.1 Whole Genome SequencingHSAN samples were sequenced following the manufacturers recommenda-tions on Illumina HiSeq2000 machines in Corey Nislows lab at the Uni-versity of British Columbia, Vancouver, B.C., Canada. Aicardi sampleswere sequenced following the manufacturers recommendations on IlluminaHiSeq2000 machines at the Michael Smith Genome Sciences Centre, Vancou-ver, B.C., Canada. Chastity-failed reads, or reads with a high frequency oflow quality bases at the beginning of the read, were excluded from furtheranalyses. HSAN samples were sequenced to 30X with 100 bp paired-endreads. WGS was performed on samples from nine Aicardi syndrome pa-tients to 30X with 125 bp paired-end reads. Of these nine patients, twowere additionally sequenced as trios; affected tissue from these probandswas sequenced to 100X with 100 bp paired-end reads and their parents weresequenced to 30X with 100 bp paired-end reads. Many reads from one trioprobands affected tissue, however, were dropped due to low quality, leadingto a coverage of 30X.2.1.2 Alignment and SNV and Indel callingFastq files were aligned with SpeedSeq, a wrapper with a modular architec-ture for performing rapid, parallelized whole-genome alignment and variantcalling. Default parameters were used. SpeedSeq uses BWA mem to alignraw reads to a reference genome, SAMBLASTER to rapidly mark duplicatereads, Sambamba to sort reads and covert SAM to BAM, and FreeBayes to132.1. Sequencing and AlignmentCall variantsFASTQ filesSNVs /indelsSVsRare SNVs and indelsRare SVsRare RegulatoryAnd genic SVs Candidate SVsSV calling: LUMPY, CNVnator, MetaSVFilter out variants present in DGV, 1000G, controlsFilter out variants with >1% frequency, controlsIntersect with regulatory sequences associated to genes Manual curation to identify false positivesConsideration of inheritance mode, pathogenicity score, TFBS, genotype phenotype correlationRare RegulatorySNVs/indelsSNV/indel calling : FreebayesIntersect with genesBAM fileAlignment: BWA memCandidate SNVs/indelsAlign readsFilter out common variantsIdentify regulatory variants Prioritize variants Figure 2.1: Bioinformatic workflow for transforming rawsequence reads to rare regulatory variants. The workflow forprocessing SVs from a BAM file is illustrated in red on theleft branch, while the purple branch on the right illustratesthe workflow for processing SNVs and indels from a BAMfile.142.2. Structural Variant Calling and Benchmarkingcall SNVs/indels. SpeedSeq also extracts split reads and discordant readspairs for downstream SV calling with LUMPY. HSAN samples were alignedto the human genome build 19, or hg19. Aicardi samples were aligned tothe human genome build 38, or hg38, to take advantage of its improvementsin the X chromosome assembly.2.2 Structural Variant Calling and Benchmarking2.2.1 MetaSV Consensus SV CallerMetaSV is a consensus SV caller that was used to incorporate SV calls frommultiple tools to provide a high-sensitivity SV set [42]. This is useful as thereis no single SV calling tool that optimally detects all SVs across a range ofsizes. MetaSV provides support for output of CNVnator, an RD approach[1], Breakdancer, an RP approach [8], Pindel, an SR and RP approach [60],LUMPY, an SR and RP approach [30], and Manta, an SR and RP approach[9]. These tools were selected to benchmark SV calling. MetaSV performsintra- and inter-tool merging for SVs with significant overlap, and thenperforms local assembly at SV breakpoints as an additional line of evidenceand to refine breakpoints. Finally, SVs are genotyped and annotated. SVsdetected by multiple tools (or multiple lines of evidence in one tool, e.g.SR and RP or SR and RD) are considered to be high-confidence, or PASS.SVs detected by only one tool or one line of evidence, e.g. just RD, are low-confidence, or LOWQUAL. MetaSV is also augmented with a soft-clip-basedmethod for detecting insertions. A soft-clipped read refers to the SR andits uniquely mapped mate. Candidate insertion intervals are generated fromsoft-clipped reads, which are then assembled to generate insertion locations.2.2.2 VarSim Paired-End Read and SV SimulationTo validate the sensitivity and specificity of the SV calling approach, Var-Sim was used to simulate paired-end reads and SVs from a reference genomeand validate the results of alignment and variant calling [43]. VarSim isable to simulate deletions, insertions, tandem duplications, and inversions.Translocation simulations are not currently available but are planned for afuture version. VarSim inserts variants, e.g., previously reported SVs fromthe Database of Genomic Variants (DGV), into a user-specified referencegenome. DGV is a curated catalogue of SVs from control individuals ob-tained using microarrays and NGS [34]. VarSim then uses ART to simulatereads in FASTQ format for secondary analysis. ART is a set of tools that152.2. Structural Variant Calling and Benchmarkinguses error models or quality profiles from real sequencing data to simulatesynthetic reads [24]. After alignment and variant-calling through the user-defined pipeline, VarSim compares the called variants to the SV truth setused as input for simulating reads, breaking down sensitivity and precisionby SV type and size.Using VarSim, 30X NGS 2x100 bp paired-end reads were generated fromhg19 using variants from DGV. Reads were then processed using SpeedSeq,and SVs called with Pindel, CNVnator, Manta, Breakdancer, Lumpy, andMetaSV.2.2.3 SV Benchmarking with Biological Data: NA12878WGSThe CEU NA12878 sample has been analyzed extensively by the Genome ina Bottle (GIAB) Consortium (http://jimb.stanford.edu/giab) in orderto characterize its high-confidence SNPs, indels, and homozygous referenceregions. Preliminary benchmark deletions and insertions have also beencalled [46]. Deletions were downloaded from ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/technical/svclassify_Manuscript/Supplementary_Information/Personalis_1000_Genomes_deduplicated_deletions.bedand insertions were downloaded from ftp://ftptrace.ncbi.nih.nlm.gov/giab/ftp/technical/svclassify_Manuscript/Supplementary_Information/Spiral_Genetics_insertions.bed. The deletions were called by the Per-sonalis approach (http://www.personalis.com/assets/files/posters/ashg2013/An_integrated_approach_for_accurate_calling.pdf) , whichis a consensus method that uses CNVnator, BreakDancer, Pindel, and Break-Seq. This set of deletions was further refined by pedigree analysis of 16family members and PCR validation [46]. The deletion call-set also includesdeletions called by the 1000 Genomes Project pilot phases, which were val-idated by assembly or other independent technologies such as CMA. Inser-tions were called using Spiral Genetics Anchored Assembly.The NA12878 genome sequenced to 50X by the Platinum Genomesproject (https://www.illumina.com/platinumgenomes.html) was downloadedfrom the BaseSpace Sequence Hub. FASTQ files were aligned using Speed-Seq and SVs were called using LUMPY, CNVnator, and metaSV. Deletionsand insertions were extracted from the final SV call set and converted toBED format using a custom bash script. Using intersectBED, deletions wereconsidered true positives if they shared a reciprocal overlap of 50% with abenchmark deletion. In other words, deletion A must overlap deletion Bby at least a fraction of 1/2, and deletion B must overlap deletion A by162.3. Annotationsat least a fraction of 1/2. Using windowBED, insertions were consideredtrue positives if a benchmark insertion resided within 10 bp of the insertionpoint.2.3 Annotations2.3.1 Gene ListsFor the HSAN study, candidate genes were defined from the list of 50genes in Gene Dxs hereditary neuropathy panel (https://www.genedx.com/test-catalog/available-tests/hereditary-neuropathy-panel/). Forthe Aicardi syndrome study, all RefSeq genes (1087 with unique HUGO GeneNomenclature Committee names) on the X chromosome were analyzed.2.3.2 Identification of Regulatory Sequences in the HumanGenomePublicly available databases were used to identify human regulatory se-quences. The FANTOM5 genome-wide, tissue- and cell-specific atlas ofenhancers was downloaded from http://enhancer.binf.ku.dk/presets/enhancer_tss_associations.bed. Vista enhancers, which have been testedfor biological activity in transgenic mice, were downloaded from UCSC(http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/).miRNA, snoRNA, and miRNA binding sites were also downloaded fromUCSC .The 2500 bp sequence upstream of each RefSeq gene (used as aproxy for promoters), and RefSeq 3 and 5 untranslated regions (UTRs) ofgenes were downloaded using the UCSC table browser (https://genome.ucsc.edu/cgi-bin/hgTables). Genomic regions deemed ultrasensitive tovariation based on conservation and presence of transcription factor bind-ing sites were downloaded from the supplementary information of Khu-rana et al [27]. TADs identified in human embryonic stem cells by HiC(combined replicates) were downloaded from the Ren lab Hi-C website athttp://chromosome.sdsc.edu/mouse/hi-c/download.html.2.3.3 Association of Regulatory Sequences to Known GenesUTRs and promoters regulate the genes in which they reside or to whichthey are proximal. Distal elements, however, do not always target the closestgene. Andersson et al. [3] performed pairwise correlations between RNAexpression from FANTOM5 enhancer elements and transcription start sites172.4. Benchmarking Regulatory SNV and Indel Detection(TSSs) to associate enhancer function with genes. Vista enhancers, Ultra-sensitive regions, and non-coding RNAs were associated to genes lying withinthe same TAD using intersectBED; if a regulatory region lies within thesame TAD as a gene, there is evidence to suggest the two interact [37].TAD boundaries flanking candidate gene-containing TADs were extractedusing a custom python script that defines the downstream and upstreamTAD boundaries for that gene (see Appendix).2.4 Benchmarking Regulatory SNV and IndelDetectionTo determine the sensitivity of the regulatory region annotations in identi-fying true pathogenic non-coding variants, the list of pathogenic non-codingvariants identified by Smedley et al [52] was intersected with the relevantregions on hg19. Using intersectBed, 3UTR variants were intersected withRefSeq 3UTRs, 5UTR variants with RefSeq 5UTRs, promoter variants withthe regions 2500 bp upstream of RefSeq genes, RNA genes with RefSeqgenes, miRNA variants with the UCSC wgRNA track (snoRNAs and miR-NAs), and enhancer variants with FANTOM5, Vista enhancers, and Khu-rana ultra-sensitive regions.To compare the CADD and FATHMM scores for pathogenic non-codingvariants versus a random set of genomic variants, FATHMM and CADDscores were computed for all pathogenic non-coding SNVs (343) and a setof 343 randomly sampled rare variants from one HSAN patient. Pathogenicnon-coding indels (110) were excluded, as FATHMM does not support scoresfor indels. To simulate the efficacy of the SNV/indel annotation and prior-itization pipeline for a heterogeneous Mendelian disorder, a variant in thepromoter region of LDLR was selected and inserted into the vcf file of oneof the HSAN patients. This particular variant (hg19 chr 19:11200073C¿T),present in the variant list compiled by Smedley et al, was chosen because itis associated with familial hypercholesterolemia (MIM 143890), a heteroge-neous phenotype associated with variants in APOA2, ITIH4, GHR, GSBS,EPHX2, and LDLR.2.5 Filtering and Annotation of SNVs and Indels1000G, dbsnp147, esp6500, ExAC, the Haplotype Reference Consortium,and Kaviar databases, as well as refGene annotations, segmental duplica-tions, and functional scores were downloaded from ANNOVAR using the an-182.6. Filtering and Annotating SVsnotate variation.pl script (e.g., annotate variation.pl -buildver hg19 -downdb-webfrom annovar exac03 humandb/). Conserved TFBS and ENCODETFBS were downloaded from UCSC using the same script (e.g., anno-tate variation.pl -buildver hg19 -downdb tfbsConsSites humandb/). UsingANNOVARvariants reduction.pl script, variants present at a frequency ofgreater than 1% in 1000G, dbsnp147, esp6500, ExAC, the Haplotype Ref-erence Consortium, and Kaviar were filtered out. Using the ANNOVARtable annovar.pl script, rare variants were annotated with refGene anno-tations, segmental duplications, functional scores (CADD and FATHMM-MKL), conserved TFBS as annotated by TRANSFAC, ENCODE TFBS,regulatory regions associated with genes, and TADS containing candidategenes. Upon manual inspection of variants in IGV and UCSC, variants thatwere found to have rs numbers were excluded.Indel annotation can vary by software, and therefore coordinates betweenknown indels and variants may differ. This can result in common indels notbeing filtered out. Indels that appeared to be technical artefacts based ontheir visual presence (or the presence of similar indels) in samples fromother cohorts upon inspection in IGV, were also excluded. After manualcuration, FATHMM and CADD scores were used to prioritize SNVs; variantswith a CADD score greater than 15 were flagged, as were variants witha FATHMM non-coding score of greater than 0.5, as these scores abovethese thresholds are considered to indicate functional variants. Genotype-phenotype correlation, inheritance mode, and presence of TFBS were thenused to further narrow down candidate pathogenic variants.2.6 Filtering and Annotating SVs2.6.1 VCF to BED formatIn order to compare SV intersections with common SVs and with regulatoryregions or genes, SV vcf files were converted to BED format using a customshell script where the SVLEN is extracted and added to the start coordinateto find the end coordinate of the SV. For insertions, the end coordinate wasequal to the start coordinate plus one, regardless of the SVLEN.2.6.2 Comparison to SV Control DatabasesMethods for SV comparison to DGV and 1000G were modified from Hehir-Kwa et al 2016 [22]. A reciprocal overlap was used to match deletions,192.6. Filtering and Annotating SVsduplications, and inversions, rather than comparison of SV length and cen-ter.2.6.3 Comparison to DGVFor HSAN, the SV calls were compared to the 2016-05-15 hg19 release ofthe database of genomic variants (DGV) (http://dgv.tcag.ca/dgv/docs/GRCh37_hg19_variants_2016-05-15.txt). For Aicardi syndrome, the SVcalls were compared to the 2016-08-31 hg38 release of DGV (http://dgv.tcag.ca/dgv/docs/GRCh38_hg38_variants_2016-08-31). Deletions werecompared to DGV entries where varianttype was equal to CNV and vari-antsubtype was deletion or loss. Duplications were compared to DGV en-tries where varianttype was equal to CNV and variantsubtype was duplica-tion, gain, or tandem duplication. For inversions, entries with varianttypeOTHER and variantsubtype inversion were used. Insertions were comparedto entries where varianttype was equal to CNV and variantsubtype was in-sertion or novel sequence insertion or mobile element insertion. For SVsother than insertions, variants with a reciprocal overlap of at least 80% witha DGV entry were filtered out using bedtools subtractBed. Insertions werefiltered out with windowBed if a DGV variant resided within 500 bp fromthe insertion point.2.6.4 Comparsion to 1000GThe SV calls for HSAN were compared to the SV release of phase 3 of the1000 Genomes Project for hg19(ftp://ftptrace.ncbi.nih.gov/1000genomes/ftp/phase3/intergrated_sv_map/ALL.wgs.integrated_sv_map_v2.20130502.svs.genotypes.vcf.gz). Annotations for 1000G SVs aligned to hg38 are not yet available. Dele-tions were compared to 1000G events where SVTYPE matched DEL, CNV,DEL ALU, DEL HERV, DEL LINE1, or DEL SVA. Duplications were com-pared to SVTYPES matching DUP or CNV. Inversions were compared torecords with SVTYPE=INV. Insertions were compared to SVTYPES ALU,LINE1, or SVA. As described above for DGV, matching was done based onoverlap of the patient SV, or within a 500 bp window for insertions.2.6.5 TranslocationsUnfortunately, MetaSV does not handle the SVs annotated as BND (break-point) by Lumpy, which represent translocations and insertions. As such,202.7. Identifying Genic and Regulatory SVsthese annotations are filtered out of the final MetaSV VCF output. In-stead, a custom python script was used to extract entries annotated asBND by Lumpy. 95% confidence intervals surrounding the breakends forIMPRECISE variants were used to determine bed start and end coordi-nates. Translocation bed files were then concatenated to the main SV bedfile and sorted using BEDtools sort.2.7 Identifying Genic and Regulatory SVsRare SVs were intersected with genes and regulatory elements using inter-sectBED.2.8 HSAN Modifications to WorkflowVariants present in the HSAN patients for whom a genetic diagnosis hadbeen made by exome sequencing, HSAN3 and HSAN4, were subtracted fromthe other HSAN patients. For SNVs and indels, this was done using theAnnovar annotate variation.pl script using the variant lists from HSAN3and HSAN4 as filters. Similarly, HSAN3 and HSAN4 SVs were filtered outusing subtractBED with a reciprocal overlap of 0.8, comparing like SV types.2.9 Aicardi Syndrome Modifications to WorkflowWe hypothesized that the mutation causing Aicardi syndrome was a de novomutation on the X chromosome. As such, in trio probands, putative denovo SNVs/indels were identified by filtering out parent SNV/indels usingAnnovars annotate variation.pl script. Similarly, parent SVs were filteredout using subtractBed with a reciprocal overlap of 0.8, comparing like SVtypes. To filter out technical artefacts, parent variants were also filtered outfrom all 9 Aicardi genomes.2.9.1 Counting Variants in CommonRegulatory regions, as described in Methods 2.3.2, were selected for thoseresiding on the X chromosome using Unix command grep chrX. A custompython script was used to count the number of patients in which chromo-some X regulatory regions were mutated by SNVs/indels in the nine Aicardi212.9. Aicardi Syndrome Modifications to Workflowgenomes and between the two trio probands for de novo SNVs/indels. Reg-ulatory regions with variants in 9, 8, 7, 6, 5, or 4 patients were manuallyinspected in IGV and the UCSC genome browser.222.9. Aicardi Syndrome Modifications to WorkflowTool VersionCommand-line arguments(if not specified, default was used)VarSim1Art0.6.303.11.14-read length 100 vc num snp 3000000vc num ins 100000 -vc num del 100000vc num mnp 5000 vc num complex 5000sv num ins 2000 sv num del 2000sv num dup 200 sv num inv 100sv percent novel 0.01 vc percent novel 0.01mean fragment size 350sd fragment size 50 vc min length lim 0vc max length lim 49 sv min length lim 50sv max length lim 1000000nlanes 5 totalcoverage 30SpeedSeq2BWASamblasterSambambaFreeBayes0.100.7.15-r11400. 0.2.5b8 -T 10 -w 5CNVnator4 0.3.2bin size: 100-his 100 -stat 100 -partition 100 -calling 100Breakdancer5 1.3.6 -m 1000000000 -r 2Manta6 0.29.6 -m local -j 10LUMPY7 0.2.13 -MetaSV8 0.5.4-min support ins 2-max ins intervals 500000ANNOVAR9 2016Feb01variants reduction.pl:aaf threshold 0.01table annovar.pl: -protocolrefGene,cadd*,fathmm*,genomicSuperDups,tfbsConsSites,wgEncodeRegTfbsClustered,wgRna,targetScanS[plus regulatory region bed files specific to cohort]Not available for hg38BEDtools10 2.24.0 -Table 2.1: Software and parameters used1 https://github.com/bioinform/varsim2 https://github.com/hall-lab/speedseq3 https://github.com/genome/pindel4 https://github.com/abyzovlab/CNVnator5 https://github.com/genome/breakdancer6 https://github.com/Illumina/manta7 https://github.com/arq5x/lumpy-sv8 https://github.com/bioinform/metasv9 http://annovar.openbioinformatics.org/en/latest/10 http://quinlanlab.org/tutorials/bedtools/bedtools.html23Chapter 3ResultsA bioinformatic workflow was constructed to analyze WGS from patientswith rare Mendelian diseases. Several components of this workflow, in-cluding SV detection, regulatory variant detection, and pathogenic regula-tory variant prioritization, were tested to benchmark the performance of thepipeline. Finally, the pipeline was tested on a cohort of patients with HSANand a cohort of patients of Aicardi syndrome and identified two candidatevariants of interest. The results are described in detail below.3.1 Benchmarking Against Simulated Data3.1.1 Comparison of SV callersFirst, SV calls from metaSV based on input from Pindel, CNVnator, Break-dancer, Manta, and LUMPY were compared to the set of calls for each toolindividually (fig. 3.1, deletions). metaSV calls were divided into metaSV alland metaSV PASS. metaSV all represents the union of SVs called using anynumber of SV tools or approaches, in other words, calls made by all five SVtools and integrated by metaSV. SVs called by only one approach, for exam-ple just by RD, are considered to be low-confidence. SVs called using two ormore tools/approaches are considered to be high-confidence. metaSV PASSrepresents only high-confidence calls. Sensitivity is measured by the numberof true positive SV calls divided by the total number of true SVs (eqn. 3.1).Precision, or positive predictive value (PPV), is measured by the proportionof true positive SV divided by all SVs called (eqn 3.2). The F1 score is theharmonic mean of PPV and sensitivity (eqn. 3.3).Sensitivity = TP/(TP + FN) (3.1)PPV = TP/(TP + FP ) (3.2)F1 = 2TP/(2TP + FP + FN) (3.3)243.1. Benchmarking Against Simulated DataWhere TP = true positive, FN= false negative, FP=false positiveAs expected, none of the individual tools was as sensitive as the set ofmetaSV calls together (metaSV all). Surprisingly, LUMPY achieved F1scores similar to the metaSV PASS calls across all deletion sizes and types,and similar sensitivity to metaSV all calls (Fig. 3.1). LUMPY was thereforewas selected for more in-depth analysis of SV data.3.1.2 LUMPY and CNVnator Versus All Other CallersFigure 3.1 shows that LUMPY detected deletions as accurately as the con-sensus of 5 SV callers. However, because we are searching for rare disease-causing SVs of all types, it is important to consider the sensitivity of themethods in more depth. metaSV all achieved a higher sensitivity thanLUMPY for deletions (fig. 3.1). Some SVs, such as large deletions andlarge duplications, are missed by SR and RP methods but are identifiableby a RD approach.I predicted that LUMPY and CNVnator together, merged by metaSV,would achieve a sensitivity equaling metaSV all because LUMPY and CN-Vnator together combine SR, RP, and RD methods. While LUMPYs sen-sitivity for deletions was nearly equivalent to metaSV all, it dropped no-ticeably for deletions greater than 500,000 bp (fig 3.1), while CNVnatorssensitivity increased, indicating that SR and RP methods are not as sensi-tive as an RD method in detecting larger deletions. Indeed, the sensitivityof LUMPY and CNVnator together for deletions was 79% vs. 80% formetaSV all (Fig. 3.2, Table 3.1). For duplications, LUMPY and CNVnatortogether were equivalent to metaSV all at 89.4% sensitivity (Fig. 3.3, Table3.2), and LUMPY and CNVnator together had slightly lower sensitivity forinversions at 63.5% vs. 66.7% for metaSV all (Fig. 3.4, Table 3.3). CN-Vnator does not detect inversions, so this reduction must be due to somedecrease in sensitivity introduced in either the merging process or local as-sembly step when LUMPY and CNVnator are input to metaSV. Althoughthe addition of CNVnator calls decreased overall SV detection specificityfor deletions and duplications, reflected by a decrease in the F1 score, it isimportant for capturing SVs that go undetected by LUMPY.253.1. Benchmarking Against Simulated DataFigure 3.1: Deletion detection a) sensitivity and b) F1-score of 5 SV callers and metaSV consensus calls on 30X100 bp paired-end simulated reads. metaSV all representsSVs called using only one tool or approach (LOWQUAL),as well as SVs called using two or more tools/approaches.metaSV PASS represents only SVs called using two or moretools/approaches. The lines on the rectangle on the left of thegraph indicate the sensitivity or F1 score of each approachfor detecting all sizes of SVs. Inf=infinity.263.1. Benchmarking Against Simulated DataFigure 3.2: Deletion detection sensitivity of metaSV (com-bining 5 SV callers), LUMPY alone, CNVnator alone, andLUMPY and CNVnator together, with 30X 100 bp paired-end simulation. metaSV all represents SVs called using onlyone tool or approach (LOWQUAL), as well as SVs calledusing two or more tools/approaches. The lines on the rect-angle on the left of the graph indicate the sensitivity of eachapproach for detecting all sizes of SVs. Inf=infinity. Notethat the lines for metaSV all (blue) and LUMPY CNVnatoroverlap.Tool CalledTruePositivesF1 SensitivityMetaSV all 2319 1418 69.3 80.0LUMPY 1430 1359 84.9 76.7CNVnator 1635 808 47.4 45.6LUMPY CNVnator 2193 1400 70.6 79.0Table 3.1: Deletion detection sensitivity for 30X 100 bppair-end simulation273.1. Benchmarking Against Simulated DataFigure 3.3: Duplication detection sensitivity of metaSV(combining 5 SV callers), LUMPY alone, CNVnator alone,and LUMPY and CNVnator together, with 30X 100 bppaired-end simulation. metaSV all represents SVs calledusing only one tool or approach (LOWQUAL), as well asSVs called using two or more tools/approaches. The lineson the rectangle on the left of the graph indicate the sen-sitivity of each approach for detecting all sizes of SVs.Inf=infinity. Note that the lines for metaSV all (blue) andLUMPY CNVnator overlap.Tool CalledTruePositivesF1 SensitivityMetaSV all 762 161 34.2 89.4LUMPY 221 142 70.8 78.9CNVnator 702 138 31.3 76.7LUMPY CNVnator 592 161 41.7 89.4Table 3.2: Duplication detection sensitivity for 30X 100 bppair-end simulation283.1. Benchmarking Against Simulated DataFigure 3.4: Inversion detection sensitivity of metaSV (com-bining 5 SV callers), LUMPY alone, CNVnator alone, andLUMPY and CNVnator together, with 30X 100 bp paired-end simulation. metaSV all represents SVs called using onlyone tool or approach (LOWQUAL), as well as SVs calledusing two or more tools/approaches. The lines on the rect-angle on the left of the graph indicate the sensitivity of eachapproach for detecting all sizes of SVs. Inf=infinity.Tool CalledTruePositivesF1 SensitivityMetaSV all 367 338 77.4 66.7LUMPY 359 356 82.2 70.2CNVnator 0 0 0 0LUMPY CNVnator 328 322 77.1 63.5Table 3.3: Inversion detection sensitivity for 30X 100 bppair-end simulationFinally, Breakdancer, Pindel, and Manta are capable of detecting inser-tions but captured a mere 0.5% of simulated insertions, calling 225 inser-tions with only 9 true positives. Here, metaSVs insertion calling algorithmachieved a sensitivity of 10.3%, which, while low, is a 20-fold increase overthe other tools. Similarly, the F1-score is ˜15 times higher for the metaSValgorithm than for metaSV using the 5 SV callers. Further investigationrevealed that VarSim was incorrectly processing insertion variants from thetest set. Intersecting the test set of insertions generated by metaSVs inser-tion boosting algorithm with the truth set of insertions using windowBED293.2. Benchmarking Against Biological Data: WGS from NA12878revealed 468 true positive insertions, as opposed to the 188 as judged byVarSim. Further, intersection of LUMPY SVs where SVTYPE=BND de-tected an additional 596 true positives, 163 of which overlap with metaSVINS calls. LUMPY marks large intra/inter-chromosomal insertions as BND(it cannot detect large novel insertions, however, as reads will be unmapped,and does not support small insertion detection). In total, the pipeline de-tected 1029 out of 1829 insertion breakpoints. The sensitivity for insertiondetection was therefore about 49%. When using a window of 10 bp aroundinsertion breakpoints, the sensitivity was 56%.Based on the results of benchmarking on simulated data, the SV callingpipeline was chosen to include LUMPY and CNVnator as input to MetaSV,with the insertion-boosting algorithm set to true. Interestingly, LUMPYperformed substantionally better than Pindel in detecting deletions, witha sensitivity of 76.7% versus 8.80%. On the other hand, the authors ofthe metaSV paper reported sensitivities of 84.6% and 92.8% for LUMPYand Pindel respectively[42]. This is perhaps due to suboptimal parametersettings in this thesis, as default settings were used, where as the authorsof metaSV used non-default settings. Further, the authors reported SVdetection sensitivity for a genome with 50X coverage, rather than the 30Xcoverage used in this thesis.3.2 Benchmarking Against Biological Data:WGS from NA128783.2.1 Deletion DetectionAfter performing benchmarking on simulated data to determine the mostsensitive combination of tools for calling SVs, SV analysis was performed onthe NA12878 50X genome from Platinum Genomes using LUMPY, CNVna-tor, and metaSV. Deletions and insertions called using this approach werecompared to the svclassify deletion and insertion truth sets, respectively.2,394 true positive deletions were called from a total of 2,676 truth set dele-tions (Table 3.4). The deletions called ranged in size from 50 to 139,619bp, with 75% of deletions under 1000 bp in length, and 66% under 500 bp.Deletion detection sensitivity was 90% for all deletions called and 88% forhigh-confidence deletions, with a near doubling in F1 score from 0.43 forall deletions to 0.76 for high-confidence deletions. The average size of truepositive deletions was almost three times the size of false negative deletions(1,451 bp vs. 533 bp). The modal size of the false negative deletions was303.2. Benchmarking Against Biological Data: WGS from NA1287853 bp, while the modal size of the true positive deletions was 314 bp. Thedistribution of deletion sizes (fig 3.5) displays peaks at 300 bp and ˜6000bp, consistent with Alu elements and L1/LINES, respectively.3.2.2 Insertion DetectionThe 28 true positive insertions ranged in size from 12 to 242 bp. Insertiondetection sensitivity was 47% for all insertions called, and 47% for high-confidence insertions, with an increase in the F1-score from 0.03 for allinsertions to 0.05 for high-confidence insertions (Table 3.4). The averagesize of true positive insertions was 74 bp, while the average size of falsenegative insertions was 100 bp.SV Type Called True Positives Sensitivity F1Deletion 8381 2394 89.5 43.3High-confidencedeletion3510 2342 87.5 75.7Insertion 2258 32 47.0 3.0High-confidenceinsertion1286 32 47.0 5.0Table 3.4: Deletion and insertion sensitivity and F1 scoresfor SV calling on the 50X NA12878 genome.313.3. Genomiser Non-Coding Mendelian VariantsFigure 3.5: Size distribution of smaller deletions detectedin NA12878. The number of deletions detected is plottedagainst the size of the deletion. The x-axis maximum wasset to 7,000 to highlight the peaks at 300 bp and 6,000 bp.Maximum deletion size was 139,619 bp.3.3 Genomiser Non-Coding Mendelian Variants3.3.1 Detection of pathogenic non-coding variantsTo determine if the non-coding pathogenic SNVs and indels compiled bySmedley et al. (2016) could be detected by the analysis pipeline, these vari-ants were intersected with their respective regulatory regions. This filteringprocess identified all miRNA and RNA gene pathogenic variants, 97% of5UTR variants, 93% of 3UTR variants, 87% of promoter variants, and 12%of enhancer variants (fig. 3.6).323.3. Genomiser Non-Coding Mendelian VariantsFigure 3.6: Percentage of Genomiser pathogenic non-coding variants compiled by Smedley et al. (2016) identifiedby the corresponding regulatory region annotations.3.3.2 CADD and FATHMM Scores for Non-CodingVariantsNext, the CADD and FATHMM scores for 343 pathogenic non-coding SNVsfrom the 453 SNVs and indels from the Smedley list were compared to a setof 343 randomly sampled rare variants. The 110 pathogenic indels were notexamined, as FATHMM does not support scores for indels. For pathogenicnon-coding SNVs, the median FATHMM score was 0.94, while for randomrare variants it was 0.1 (fig. 3.7). Pathogenic non-coding SNVs had amedian CADD score of 13, while the median CADD score for random rarevariants was 2 (fig. 3.8). As expected, both FATHMM and CADD scoredistributions for pathogenic non-coding variants and random rare variantswere very significantly different (two-sample Kolmogorov-Smirnov test, p-value <2.2x10-16)333.3. Genomiser Non-Coding Mendelian VariantsFigure 3.7: FATHMM non-coding score distributionsfor pathogenic non-coding SNVs compiled by Smedley etal. (2016) (turquoise) and random rare variants (pink).FATHMM scores greater than 0.50 are considered to indi-cate a likely functional variant.343.3. Genomiser Non-Coding Mendelian VariantsFigure 3.8: CADD score distributions for pathogenic non-coding SNVs compiled by Smedley et al. (2016) (turquoise)and random rare variants (pink). CADD scores greater than15 are considered likely to indicate a deleterious variant.3.3.3 Spike-in of a Pathogenic SNVIn order to determine the efficacy of the SNV/indel calling pipeline in pri-oritizing pathogenic variants, a variant in the promoter region of LDLRassociated with familial hypercholesterolemia was selected and inserted intothe WGS vcf file of one of the HSAN patients. After filtering for rare vari-ants and removing variants present in the two diagnosed HSAN patients,76,761 SNVs and indels remained in the test vcf file. Of these, only two(including the spike-in) impacted regulatory regions associated with famil-ial hypercholesterolemia (table 3.5). Manual inspection of the variants inIGV and the UCSC genome browser revealed that the APOA2 variant, an-notated as a complex SNV, was composed of two common SNPs (rs3829793and rs149905240). Thus, the true pathogenic variant in the LDLR pro-moter was the only remaining candidate variant. Interestingly, this variantwas annotated as falling in the 5UTR and not the promoter, where it ac-tually lies. The variant possesses pathogenic FATHMM and CADD scoresand falls within a conserved V$SREBP1 02 motif bound by sterol regulatoryelement binding transcription factor 1. Consistency checks for correct in-353.4. HSAN AnalysisLocus(Hg19) Ref AltRegulatoryregionFATHMM CADDTFBSConssites1:161194396 GTGAC CTGAGAPOA2promoter. . .19:11200073 C T LDLR 5'UTR 0.9992 15.16 SREBP1 02Table 3.5: Regulatory variants associated with familial hy-percholesterolemia genes identified in a patient genome withvariant chr 19:11200073C<T spiked in.heritance mode (autosomal dominant, in this case) and genotype-phenotypecorrelation would prioritize this variant as a candidate pathogenic mutation.3.4 HSAN AnalysisOur lab performed WGS on eight patients with childhood-onset HSAN. Cod-ing sequence analysis of these patients revealed pathogenic exonic variantsin two patients 1. The phenotypes for the undiagnosed HSAN patients arelisted in table HSAN SNVs and IndelsPrior to filtering, the genome of each HSAN patient differed from the refer-ence sequence by an average of ˜4,301,000 SNVs and indels. Filtering outvariants from the two solved patients as well as common variants reducedthis to ˜73,000 SNVs/indels (range 64,027-81,973). On average, 7 regula-tory variants associated with hereditary neuropathy genes were found perpatient prior to manual inspection in IGV, with the majority found in genepromoters, the longest of the regulatory sequences (fig 3.9).1After the analysis for this thesis was completed, my colleagues discovered a pathogenicexonic variant in an additional HSAN patient, HSAN2, upon re-analysis of coding sequencevariants using an improved analysis pipeline.363.4. HSAN AnalysisSymptomHSAN1-c1,HSAN1-c2,HSAN1-c3HSAN2 HSAN5 HSAN6Positiveintradermalhistamine skintest for HSANInsensitivityto painDevelopmentaldelays-Recurrentvomiting- -Other AnhidrosisMild whitematter diseaseChronic lungdisease fromaspirationSeverebehaviouralproblemsTable 3.6: HSAN patient phenotypes. A checkmark in-dicates the presence of a symptom or test result. A minussymbol indicates the absence of a symptom or test result.373.4. HSAN AnalysisFigure 3.9: Average number of regulatory variants asso-ciated with hereditary neuropathy genes in patients withHSAN (n=6). Bold lines indicate the median number ofvariants in a particular regulatory region. The upper andlower hinges correspond to the 25th and 75th percentiles.Upper and lower whiskers extend the hinge to the high-est value within 1.5xinterquartile range (IQR) and fromthe hinge to the lowest value within 1.5xIQR, respectively.The circles represent outliers, with values greater or lessthan the limits of the upper and lower whiskers, respec-tively. The regulatory variants are categorized by the reg-ulatory region into which they fall. wgRNA=miRNA andsnoRNA, VISTA=Vista enhancer, Khurana= Khurana etal (2015) ultra-sensitive region, FANTOM5=FANTOM en-hancer, miRNAbas = miRNA binding site, 5UTR=5'UTR,3UTR=3'UTR, Promoter =2500 bp upstream of transcrip-tion start site.Only one regulatory SNV in one patient was scored as potentially pathogenicand fit the predicted inheritance pattern for the proband (de novo muta-tion). A heterozygous variant at a conserved locus (chr17:15165831, hg19) ofthe peripheral myelin protein 22 (PMP22) 5'UTR, 58 bases downstream ofthe transcription start site (NM 153321) was found in this patient (fig.3.10).383.4. HSAN AnalysisThe variant has pathogenic FATHMM non-coding (0.92) and CADD scores(15.3). Further, it lies in HA-E2F1, GR, CTCF, Pol2, GATA-1, and E2F6 (H-50) TFBS binding sites, as annotated by ENCODE. PMP22 is a hereditaryneuropathy gene, the expression of which is critical for normal peripheralnerve myelination [39]. Duplications in PMP22 cause Charcot-Marie-Toothdisease type 1A (CMT1A, a hereditary motor and sensory neuropathy),deletions cause Hereditary Neuropathy with Liability to Pressure Palsies(HNPP), and point mutations can cause either [39]. This particular patientis a boy with mild white matter disease, learning disabilities, and dimin-ished pain sensation. Although the phenotype of this patient is not typicalof either classical CMT1A or HNPP, PMP22 can cause a broad range of phe-notypes and is dosage-sensitive. We performed Sanger sequencing on thispatient and both of his parents and found that the variant was inheritedfrom the unaffected mother. We, therefore, concluded that it is probablynot responsible for his severe phenotype.Figure 3.10: IGV screen shot of a heterozygous variant ata conserved locus of the PMP22 5UTR. The alternate allele(T) is supported by 5 reads, while the reference allele (G) issupported by 9 reads.3.4.2 HSAN SV AnalysisOn average, 7,627 SVs were detected in each HSAN patient. The majoritywere deletions (55%), followed by duplications (23%), insertions (21%) andinversions (1%) (fig. 3.11). 51% of these SVs were found in DGV or 1000G.Of the SVs found in DGV or 1000G, 74% were high-confidence SVs. Of theSVs not found in DGV or 1000G, 26% were high-confidence SVs. After sub-393.4. HSAN Analysistraction of SVs in the two diagnosed patients from the other HSAN patientsand intersection with hereditary neuropathy regulatory regions and genes,˜1 high-confidence SV and ˜6 low-confidence SVs were found per patient, onaverage. As mentioned in the Methods, metaSV does not handle LUMPYSVs of type BND, representing translocations and insertions, so these wereconsidered separately. 2,745 BNDs on average were found per patient. Aftersubtraction of BNDs from the two diagnosed patients, on average ˜3 BNDswere found to intersect with hereditary neuropathy regulatory regions andgenes.Figure 3.11: Comparison of HSAN SVs to DGV and 1000G(n=8). The SV numbers are averages. Presence in DGVor 1000G was based on whether the SV shared at least 8%reciprocal overlap with a DGV or 1000G variant of the sametype. SVs are further subdivided by quality, indicated by thetransparency of the bars. DEL=deletion, DUP=duplication,INS=insertion, INV=inversion. LOWQUAL=low confidenceSV called by only one approach. PASS= high-confidence SVcalled by two or more approaches.A rare, heterozygous 6,666 bp SV duplicating the last two exons of DNAmethyl transferase 1 (DNMT1) was found in three affected siblings fromone family (fig. 3.12). The duplication (chr19: 10238761- 10245460, hg19)403.4. HSAN Analysiswas called using RD, SR, and RP methods in one sibling, and RD alone inthe other two. The breakpoints of this SV overlap AluSz SINE elements,and input of breakpoint split reads to BLAT revealed 88 bp of homologyat the junction. This duplication is not present in DGV or 1000G. Theregion is overlapped by deletions greater than 100,000 bp in DGV, but notby duplications. In the Database of Genomic Variation and Phenotype inHumans using Ensembl Resources (DECIPHER), the region is overlappedby a number of duplications ranging in size from 3 million to 6 million bp.Analysis of this duplication in IGV indicates that it is a tandem du-plication that does not disrupt the DNMT1 coding sequence. The readsat the boundaries of the duplication show discordant orientation; the readpairs map in an outward orientation rather than inward, which is diagnos-tic of a tandem duplication (green reads, fig.3.12.a.) (http://software.broadinstitute.org/software/igv/interpreting_pair_orientations).The duplication is not inverted; an inverted duplication would be markedby overlapping read pairs (blue and teal) at the SV boundaries. The du-plicated sequence includes DNMT1 exons 40 and 41 and is adjacent to theDNMT1 gene, thus creating a pair of exons orphaned from DNMT1 by theintervening duplicated non-coding sequence. A schematic of the duplica-tion is shown in figure 3.12b. The duplication extends ˜1 kb into the TADboundary downstream of the gene (purple bar, fig. 3.12a). While the RD,SR, and RP signals of this SV are diagnostic of a high-confidence tandemduplication, it should be noted that it has not been confirmed by Sangersequencing. Interpretation of the pathogenicity of this variant will be con-sidered in detail in the Discussion (4.4.2).413.4. HSAN Analysis(a) Figure 3.12a(b) Figure 3.12bFigure 3.12: IGV screencap of a heterozygous 6,666 bp tan-dem duplication spanning the last two exons of DNMT1 inthree siblings. a) Reads at this locus in IGV for each siblingare shown. The red bar near the top of the figure representsthe span of the tandem duplication. The dashed vertical lineindicates the midpoint of the duplication. Reads are coloredby pair orientation: green reads indicate reads whose mateis mapped in the opposite orientation to that expected. Blueand teal reads indicate read pairs whose mates overlap. Theduplication is evidenced by the increased RD relative to theadjacent regions, and the green discordant read pairs at theboundaries of the duplication. The blue bar at the bottomof the figure represents a TAD. The TAD boundary is illus-trated in purple. b) Linear depiction of the DNMT1 tandemduplication (not to scale). DNMT1 is illustrated in blue.The sequence in the reference genome that is duplicated inthe siblings is depicted by diagonal pink lines.423.5. Aicardi Syndrome Analysis3.5 Aicardi Syndrome Analysis3.5.1 Aicardi Syndrome SNV and Indel AnalysisPrior to filtering, each Aicardi syndrome patient had an average of 4,525,000SNVs and indels in comparison to the reference sequence. Filtering outvariants from parents as well as common variants reduced this to 102,000SNVs/indels per patient on average (range 75,846-126,700). Interestingly,the number of de novo variants identified in the two trio probands differedsubstantially: 170,977 in one patient (6,948 on the X chromosome) and 3,545(135 on the X chromosome) in the other. This is likely due to the fact thehigh coverage in the former patient (100X) identifies many inherited SNVsand indels that are not picked up in the parents due to insufficient coverage(30X). A mean of 91 regulatory variants was found on the X chromosomein each of the Aicardi patients, with the majority falling in promoters (fig3.13).3.5.2 Aicardi Syndrome SV AnalysisOn average, 800 X chromosome SVs were detected in each patient withAicardi syndrome (fig. 3.1.4). 42% of these SVs were found in DGV. Of theSVs found in DGV, 44% were high-confidence SVs. Of the SVs not foundin DG V, 6% were high-confidence SVs. After subtraction of SVs from theparent genomes and intersection with X chromosome regulatory regions andgenes, 269 SVs were found per patient, on average. 428 X chromosome BNDson average were found per patient. After subtraction of BNDs present in thefour parent genomes, on average ˜80 X chromosome BNDs were found tointersect with X chromosome regulatory regions and genes. The number ofinversions in one patient genome was dramatically higher than the numberof inversions in all other genomes (2,112 X chromosome inversions vs. anaverage of 54 standard deviation=60, in the other patients). Further, thenumber of insertions in this patient was zero. This patient was thereforeexcluded from SV analyses. The high number of inversions was reflected bya high proportion of blue and teal reads (indicating overlapping mate pairs)evident in IGV throughout the entire genome of this patient (indicatingoverlapping mate pairs). This was also evident in the genome of this patientaligned to hg19 from FASTQ by my colleagues, using a different pipeline.It is unclear what the origin of these discordant read pairs is. It may be atechnical artifact originating from the sequencing in this patient.433.5. Aicardi Syndrome AnalysisFigure 3.13: Average number of regulatory variants asso-ciated with X chromosome genes in patients with Aicardisyndrome (n=9). Bold lines indicate the median numberof variants in a particular regulatory region. The upperand lower hinges correspond to the 25th and 75th per-centiles. Upper and lower whiskers extend the hinge tothe highest value within 1.5xinterquartile range (IQR) andfrom the hinge to the lowest value within 1.5xIQR, respec-tively. The circles represent outliers, with values greater orless than the limits of the upper and lower whiskers, respec-tively. The regulatory variants are categorized by the reg-ulatory region into which they fall. wgRNA=miRNA andsnoRNA, VISTA=Vista enhancer, Khurana= Khurana etal (2015) ultra-sensitive region, FANTOM5=FANTOM en-hancer, miRNAbas = miRNA binding site, 5UTR=5'UTR,3UTR=3'UTR, Promoter =2500 bp upstream of transcrip-tion start site.443.5. Aicardi Syndrome AnalysisFigure 3.14: Comparison of Aicardi X chromosome SVsto DGV (n=9). The SV numbers are averages. Presencein DGV was based on whether the SV shared at least 80%reciprocal overlap with a DGV variant of the same type.SVs are further subdivided by quality, indicated by thetransparency of the bars. DEL=deletion, DUP=duplication,INS=insertion, INV=inversion. LOWQUAL=low confidenceSV called by only one approach. PASS= high-confidence SVcalled by two or more approaches.3.5.3 Aicardi syndrome de novo variant analysisDe novo regulatory SNVs, indels, and SVs on the X chromosome were iden-tified in the two trio probands. While there were several regulatory regionsharboring putative de novo SNVs or indels in the trio probands, inspectionof these variants in IGV indicated they were false positives. Thus, no can-didate de novo regulatory SNVs or indels were identified that overlapped inthe two trio probands or with any of the other 7 Aicardi syndrome genomes.MID1, IL1RAPL2, and TFDP3 were found to be overlapped by rare denovo putative SVs in the two probands. All are putative de novo low qual-ity duplications called by CNVnator. Manual inspection of these variants inIGV did not support their existence; the RD distributions in the probandslooked very similar to that of the parents, despite these duplications not453.5. Aicardi Syndrome Analysisbeing called in the parents by CNVnator. The putative pathogenicity ofthese variants is discussed further in the Discussion (4.5.1).46Chapter 4Discussion4.1 OverviewThe aim of this thesis was to identify non-coding variants and SVs from WGSin patients with rare Mendelian diseases. More specifically, the objectiveswere to 1) develop and benchmark a bioinformatics workflow for detectionof pathogenic non-coding SNVs/indels and pathogenic non-coding or codingSVs, and 2) to use this workflow to analyze WGS data of unsolved patientsrecruited from in-house HSAN and Aicardi syndrome studies to identifycandidate pathogenic variants.A bioinformatic workflow was constructed to identify putative functionalregulatory variants from raw sequence data. SV calling was benchmarkedagainst SV truth sets from a simulated genome and the NA12878 genome. Acompilation of CRE annotations was selected to filter for functional variantsand permit comparison to known pathogenic non-coding variants. Finally,the workflow was applied to HSAN and Aicardi syndrome patient genomes.The workflow successfully detected and prioritized rare regulatory variantsand SVs. Several interesting candidate variants were detected, but nonecould be convincingly implicated as pathogenic in these patients.4.2 Summary of Findings4.2.1 VarSim Benchmarking Results and LimitationsFive SV callers (Pindel, CNVnator, Breakdancer, Manta, and LUMPY) anda consensus SV caller (metaSV) were used to call variants on a simulatedgenome containing known deletions, duplications, inversions, and insertions.LUMPYs F1 score for deletions, duplications, and inversions was compara-ble to the consensus set of high-confidence calls generated by metaSV fromthe input of all five callers. LUMPY, a SR and RP approach, and CNVna-tor, a RD approach, together with metaSV, a consensus caller that performslocal assembly of candidate regions, had a sensitivity equivalent to the com-bination of LUMPY, CNVnator, Pindel, Breakdancer, Manta, and metaSV474.2. Summary of Findingstogether. Although insertion detection appeared to be poor, this seemslargely to reflect errors in varSims comparison method between truth andtest SVs of this type. Altogether, LUMPY, CNVnator, and metaSV had adeletion detection sensitivity of 80%, a tandem duplication detection sen-sitivity of 89%, an inversion detection sensitivity of 64%, and an insertiondetection sensitivity of 49%.Simulated genomes, however, do not accurately reproduce artefacts suchas chimeric molecules and reads from poorly assembled genomic regions thatcan confound SV calling in biological genomes. The SV calling approachwas therefore evaluated on the NA12878 genome, an extremely well-studiedgenome in which SVs have been characterized.4.2.2 NA12878 Benchmarking Results and LimitationsSVs were called from the NA12878 genome by analyzing a 50X coveragedata set from Illumina Platinum Genomes. SVs were called using LUMPY,CNVnator, and metaSV. Based on comparison to truth SV calls from svclas-sify, the estimated deletion detection sensitivity was 90% with an F1 score of0.43. The modal size of false negative deletions was 53 bp, while the modalsize of true positive deletions was 314 bp, indicating that the SV pipelineis limited in its ability to call small deletions. Insertion detection sensitiv-ity was 47% for all insertions called and also for high-confidence insertionsalone, indicating that only high-confidence insertions were true positives.These results were consistent with the results from benchmarking againstsimulated data. The improvement in deletion detection sensitivity is mostlikely attributable to the fact that the NA12878 genome was sequenced to50X, while the simulated genome had a coverage of 30X.Analyses of these gold standard SVs is subject to ascertainment bias.The SVs in the deletion truth set were called from PE short-read sequencing(SRS) data using SR, RD, and RP methods. Indeed, CNVnator was oneof the tools used to call deletions in the Personalis and 1000G set, so thistruth set is biased towards CNVnator, which was one of the tools used inthis thesis. Insertions in the truth set were called using an AS approachfrom SRS data, which are limited in detecting long insertions. Indeed, themaximum insertion size in the truth set was only 353 bp. It is likely thatlarger insertions do exist in NA12878 and that the SV truth sets themselveswere biased towards SVs that are detectable by the approaches I used in myresearch. Therefore, the sensitivity and F1 scores obtained in this study areprobably overestimates of the true values. The limitations to SV detectionfrom SRS are discussed in more detail below.484.2. Summary of Findings4.2.3 Limitations to SV Calling from SRSThe set of SVs detectable with SRS approaches is limited as reads are gener-ally shorter than most SVs (50-400 bp). Indeed, the human genome consistsof 50-69% repetitive sequences, patterns of DNA sequence that occur inmultiple copies of the genome[12] and 5% of the genome cannot be uniquelymapped with 100 bp read length [21]. This repetitive sequence is composedof transposable elements, low complexity regions, and pseuodogenes. Theseregions present a challenge for SRS reference genome-based alignment, whichis the method of choice in clinical sequencing studies due to its cost effec-tiveness and low per-base error rate .Long-read sequencing (LRS) can overcome the SRS alignment challengesby spanning SVs and repetitive regions by several kilobases or more. Indeed,in one study characterizing SVs in a personal genome using a combinationof SRS and Pacific Biosciences(PacBio) LRS, SRS approaches to SV callingidentified only 57% of SVs identified using the long-read approaches [16].PacBio single molecule real-time (SMRT) sequencing offers read lengths of10 kb on average, with some reads close to 100 kb [10]. PacBio reads canbe used to assemble whole genomes de novo or scaffold assembly from SRS,thereby improving completeness and considerably improving SV detection.Recently, PacBio LRS data derived from two functionally haploid genomeswas analyzed to identify SVs [25]. The haploid genomes were obtained fromhydatidiform moles. Hydatidiform moles are abnormalities of human preg-nancy that form from fertilization by sperm of an enucleated egg or by lossof maternal chromosomes post-fertilization [26]. Some hydatidiform molesare diploid due to subsequent duplication without cytokinesis of the fertiliz-ing sperm; functionally, they can be considered haploid as they lack allelicvariation, as is true based on the analyses these authors performed. ˜20,500SVs were identified from each genome (˜13,000 insertions, ˜7,500 deletions,47 inversions). Half of the inserted or deleted sequences consisted of tan-dem repeats or complex arrays of different repeat classes. The authors alsocreated a pseudo-diploid genome by down-sampling the genomes of the twoindividuals and combining them. Interestingly, the study found that thesensitivity of SV detection from this pseudo-diploid genome was less thanhalf that in either haploid genome due to difficulties in detecting heterozy-gous SVs, regardless of coverage. 83% of SVs reported in the study had notbeen described in previous SV studies, including 1000G. Of particular rele-vance to my study, analysis of SRS data from the two haploid genomes usingLUMPY and WHAM, SR and RP methods, respectively, identified only 10%of variants identified by LRS technology. Clearly, SV detection from SRS494.3. Genomiser Non-Coding Mendelian Variantsis seriously limited by short read length, especially in repetitive regions ofthe genome. Thus, benchmarking with GIAB gold standard deletion andinsertion SV call-sets is biased towards SVs detectable by SRS.Unfortunately, GIAB gold standard sets of duplications, inversions, andtranslocations are not available for NA12878. This is likely due to the com-putational difficulties in detecting these SV types and the difficulty in ob-taining orthogonal validation. As a consequence, many SV tools are onlybenchmarked against deletions and/or insertions. LUMPY, for example, isbenchmarked against deletions, duplications, inversions and translocationsfrom simulated data but only against deletions from NA12878 [30]. Indeed,that LUMPY was trained on the NA12878 deletions biases its performancein the analysis performed in this thesis.4.3 Genomiser Non-Coding Mendelian Variants4.3.1 Detection of Pathogenic Non-Coding VariantsAs expected, detection of pathogenic non-coding SNVs and indels was highfor non-coding genic or proximal regulatory non-coding variants, with detec-tion rates ranging from 87-100%. Variants missed by RefSeq UTR and pro-moter annotations would be picked up by Ensembl gene predictions, whichare more comprehensive. This was apparent by looking at the Ensembl geneprediction track in the UCSC genome browser.Pathogenic enhancer variant detection by intersection with FANTOM5enhancers, Vista enhancers, and ultra-sensitive regions was poor, at 12%.Enhancers have been predicted using a range of methods, including en-hancer RNA expression, EP300 binding sites, RNA polymerase II bindingsites, DNase I hypersensitivity sites, and histone modification patterns, butthere is little consistency in enhancer predictions based on different tech-nologies [19]. A consensus set of enhancers integrating annotations shouldbe more comprehensive than enhancers predicted using any one technology,like the FANTOM5 or VISTA sets. Until such a comprehensive truth setof enhancers exists, detection of enhancer variants will be limited in sensi-tivity, as reflected in the benchmarking performed here. Massively parallelfunctional assays of enhancers will also contribute knowledge of enhancerswith biological function [4].504.3. Genomiser Non-Coding Mendelian Variants4.3.2 FATHMM and CADD Scores of PathogenicNon-Coding VariantsWith respect to categorization of non-coding variants as pathogenic or not,a FATHMM score cutoff of 0.5 categorizes 80% of true positive variantsas pathogenic while scoring 5% of random rare variants as false positives.A CADD score cutoff of 15 categorizes 35% of true positive variants aspathogenic while scoring 1% of random rare variants as false positives. Tocategorize 80% of true positive variants as pathogenic, a CADD score of8 would have to be used; this would identify 9% of random rare variantsas false positives. Utilizing the FATHMM non-coding score identifies fewerfalse positives than the CADD score with an equivalent rate of true posi-tives, making it a more reliable indicator of the pathogenicity of non-codingvariants.4.3.3 Spike-in of a Pathogenic Non-Coding SNVOne of the difficulties in analyzing non-coding variants is their sheer num-ber. Even after filtering for rare variants, an average of 64027-126700 SNVsand indels remained in each of our HSAN and Aicardi syndrome patients.The efficacy in reducing this number by filtering for non-coding variants incandidate gene-associated regulatory regions was demonstrated by spiking aknown pathogenic variant in the LDLR promoter into a patient variant file.Filtering for rare regulatory variants associated with familial hypercholes-terolemia genes detected only the true pathogenic variant and one additionalvariant that was excluded after manual inspection. The spiked-in variantwas therefore successfully prioritized as the best candidate pathogenic vari-ant. That this was the only true rare regulatory variant associated withfamilial hypercholesterolemia in this genome indicates that the pipeline de-scribed in this thesis identifies a limited number of rare regulatory variantsassociated with each gene. Indeed, this was seen in the HSAN genomes,where an average of 7 rare regulatory variants associated with a list of 50candidate genes were identified per patient. In Aicardi syndrome patients,an average of 91 rare regulatory variants associated with 1087 genes on theX chromosomes were identified. The number of rare regulatory variantsidentified is, of course, limited by the annotations used in the pipeline. Thetrue number of such variants is likely larger.514.4. HSAN Analysis4.4 HSAN Analysis4.4.1 HSAN SNV and Indel AnalysisA heterozygous SNV identified in the PMP22 5UTR of one HSAN patientwas a good candidate based on FATHMM and CADD scores and its presencewithin TFBSs. Indeed, a previous study explored the potential contributionof variants in the highly conserved PMP225 region to gene expression, con-cluding that rare variation in this region may alter PMP22 dosage and con-tribute to the clinical variability of CMT1A and HNPP [51]. The genotype-phenotype correlation for this patient was poor, but regulatory variantsare not necessarily expected to recapitulate phenotypes caused by codingvariants. For instance, coding variants in PTF1A cause syndromic pancre-atic agenesis with neurological symptoms, while PTF1A enhancer mutationscause an isolated pancreatic anomaly [49]. However, Sanger sequencing ofthe proband and parents revealed that the variant was inherited from theunaffected mother and was therefore likely to be benign. This emphasizesthe importance of a trio study design in sequencing studies, which can re-duce the number of candidates by factoring in inheritance mode when theparents phenotypes are known. Unfortunately, funds for WGS were limitedto sequencing only the probands in this cohort.4.4.2 HSAN SV AnalysisAn average of 7,627 SVs was detected in each of the WGS data sets fromHSAN patients. Only 50% of these were high-confidence calls. Of the SVsfound in DGV or 1000G (51%), 76% were high-confidence calls, while ofthe SVs not found in DGV or 1000G, only 26% were high confidence calls.This indicates that the set of rare SVs has a higher false positive rate thanthose that intersect with common SVs. The frequency of rare structuralvariation identified here is therefore inflated by false positive low-confidencecalls. With this in mind, after subtracting SVs present in the two diagnosedHSAN patients and intersecting variants with regulatory regions, an averageof only 10 putative rare genic and regulatory SVs were identified in eachpatient. Manual inspection and genotype-phenotype correlation narrowedthis down to one candidate SV in the three affected siblings of one family.A rare, heterozygous 6,666 bp tandem duplication affecting the last twoexons of DNMT1 was identified in these three siblings. Autosomal domi-nant SNVs in the targeting sequence domain of DNMT1 cause HSAN1E.While these siblings all have mild HSAN phenotypes, with some insensitiv-ity to pain and mild anhidrosis, HSAN1E is characterized by hearing loss,524.4. HSAN Analysisdementia, and sensory loss. It is typically adult-onset. The possibility thatthese siblings may go on to develop further symptoms, such as hearing loss,in the future cannot be excluded.Interestingly, a recent case report describes a DNMT1 SNV in a patientwith childhood-onset HSAN1E and some phenotypic similarities to thesesiblings: intermittent shooting pain in feet in childhood, repeated infections,and insensitivity to pain [18]. The patient developed deafness in adulthood.In spite of these parallels, the phenotype of the siblings we studied is nottypical of HSAN1E, as confirmed by the clinical expert who phenotypedthem.The molecular pathology of HSAN1E is likely due to reductions in DNMT1enzymatic activity resulting from mis-folding of the DNMT1 protein [54].The duplication I found, on the other hand, is not predicted to disruptthe protein. However, the orphaned exons may be translated, if alternativesplicing occurs between the full gene and the orphaned exons, which couldbe could be further elucidated with RNA-seq. If the gene is mistranslated,protein folding could be impaired. The SV does extend ˜1kb into the ad-jacent TAD boundary, which is 240 kb in length, however disease-causingTAD disruptions described in the literature fully delete, invert, or dupli-cate a TAD boundary. It is therefore unlikely that this small duplicationinterferes with DNMT1 regulation.That the phenotypes of the siblings are not a close match to documentedHSAN1E cases and that neither the gene nor gene regulation are confidentlypredicted to be affected makes it difficult to assess the pathogenicity ofthis SV. Similarly, the mode of inheritance of HSAN in these patients isunknown. One might expect an autosomal recessive mode of inheritancegiven that all three siblings are affected, but it is possible that the mode ofinheritance is autosomal dominant with one parent affected. These siblingsare adopted, and little information about the biological parents is available,complicating the interpretation of this SV. The duplication of DNMT1 Ifound must, therefore, be classified as a variant of uncertain significance(VUS) until new clinical evidence arises to support the likelihood that the SVis either benign or pathogenic. For instance, if symptoms typical of HSAN1Edevelop in these three siblings in adulthood, this would provide evidence forthe pathogenicity of the duplication. The presence of this duplication insimilarly affected patients would also support pathogenicity. Discovery ofthis SV in unaffected individuals in population databases would support thehypothesis that it is benign.534.5. Aicardi Syndrome Analysis4.4.3 HSAN Analysis LimitationsDetection of regulatory variants associated with HSAN is limited by both thecandidate gene list and the regulatory region annotations. This pipeline willnot detect variants in novel HSAN disease genes, nor will it detect variantsin as-yet unannotated regulatory regions. Furthermore, lower-than-expectedcoverage (<30X) was obtained in two of the HSAN patients included in thisstudy, limiting variant detection. It is, therefore, possible that undiagnosedpatients possess variants in genes not on the candidate gene list or in reg-ulatory regions that are not currently annotated. It is also possible thatpathogenic variants are present in these data sets that were not called dueto insufficient coverage. Lastly, the possibility that these patients are phe-nocopies, in other words that the origin of their disease is environmentaland not genetic, cannot be excluded.To search for regulatory variants in additional genes related to HSAN,a tool like Phenolyzer could be used to compile a list of genes related toknown HSAN genes by protein-protein interactions, sharing a gene family orbiological pathway, or transcriptionally regulating another gene [58]. If morefunds become available, patients with low coverage should be re-sequenced,and if possible, all of the parents should be sequenced by WGS.4.5 Aicardi Syndrome Analysis4.5.1 Aicardi Syndrome SNV/Indel and SV AnalysisNon-coding and SV analysis of Aicardi syndrome genomes failed to reveala candidate de novo X chromosome variant in these patients. CNVnatorcalled de novo duplications of IL1RAPL2, MID1, and TFDP3 in both trioprobands. IL1RAPL2 is about 1 mbp and is an orphan interleukin receptorthat was identified in fetal brain tissue [48]. MID1 is about 400 kb and is amicrotubule-associated protein; MID1 mutations cause Opitz syndrome, amidline malformation syndrome (OMIM:300000). TFDP3 is less than 2000bp in length and is a transcription factor that suppresses E2F1-inducedapoptosis-dependent P53. It is ubiquitously expressed in human tissues,including brain [33]. Interestingly, the two probands have putative dupli-cations that overlap a small region (2,600 bp) that encompasses TFDP3.Given thatTFDP3 is only 1,679 bp long, it is far less likely for the twoprobands to both have SVs in this gene by chance than it is in MID1 orIL1RAPL2. However, all these SVs are low-confidence, as they have beenonly called by CNVnator, which has an F1 score for duplications of only544.6. Conclusions and Future Directions47.4% as measured by VarSim. Further, visual inspection in IGV did notsupport the presence of these duplications, as the read depth distributionsof the probands appeared very similar to that of their parents. It is there-fore very likely that these are false positives. Furthermore, these SVs arenot found in any of the other probands, making it unlikely that they causeAicardi syndrome, if it is indeed a genetically homogenous condition.4.5.2 Aicardi Syndrome LimitationsThere are several reasons that a causative mutation might not have beenidentified in the Aicardi syndrome patients. First, it is possible that acausative variant is located in a genomic region that is invisible to mostalignment methods, such as a segmental duplication, tandem repeat or otherpoorly mapped genomic region. Second, it is possible that the causativemutation is an SV that cannot be detected from SRS data due to its lengthor repetitive content. Third, the variant may be in a regulatory region thatis not annotated, or insufficiently annotated, within the pipeline. Fourth, itis possible that the causative mutation is somatic, and the read frequencyof the variant allele was too low to be detectable in the samples studied.Indeed, some clinical features of Aicardi syndrome are patchy, which can beindicative of mosaicism. To test this hypothesis, we could sequence affectedtissue from additional patients at high coverage, e.g. 100X, and comparethis to blood WGS from the same patients. Lastly, the chorioretinal lacunae,or holes in the retina, and agenesis of the corpus callosum suggest that cellscarrying the causative variant die. In this case, a mosaic mutation wouldnot be detectable by sequencing DNA from viable tissue.4.6 Conclusions and Future Directions4.6.1 SummaryThis thesis benchmarked and tested a bioinformatic workflow for identify-ing pathogenic regulatory variants and SVs from WGS in rare Mendeliandisease. While typical sequencing pipelines analyze SNVs and indels in theexonic regions of the genome, this workflow extends WGS analysis to coverthe full spectrum of genetic variation. The SV calling pipeline, validatedagainst a simulated genome, detected 80% of deletions, 89% of tandem du-plications, 64% of inversions, and ˜50% of insertions. On experimental data,the pipeline detected 90% of deletions and 47% of insertions. The SV truthsets used for benchmarking likely only represent about 10% of structural554.6. Conclusions and Future Directionsvariation in the genome due to difficulties in calling SVs in repetitive andGC rich genomic regions from SRS data. Nevertheless, use of the analysispipeline I developed will detect many pathogenic SVs and increase diagnos-tic yield from clinical sequencing studies.Extending the analysis to non-coding regulatory regions identified only0.14 variants per candidate gene in the HSAN study and 0.08 variants per X-chromosome gene in the Aicardi syndrome study. This number of variantsis manageable for a bioinformatician to analyze manually. In a heteroge-neous disorder like intellectual disability, which is associated with over 1000different genes, the pipeline would identify more than 100 regulatory vari-ants. In this case, a FATHMM threshold of 0.5 would filter out 95% ofrare SNVs, with a sensitivity of ˜80% for detecting pathogenic SNVs. Evena FATHMM threshold of 0.2, with a sensitivity of 90%, would only filterout 84% of rare SNVs. Manual inspection of indels, on the other hand, re-vealed many to be sequencing or algorithmic artefacts. A large database ofin-house whole genome sequences, all processed through the same pipeline,would filter many of these technical artifacts out.4.6.2 Comparison to a study analyzing WGS from patientswith a heterogeneous diseaseGiven that only six HSAN patient genomes were studied and that, givenits extreme rarity and phenotypic homogeneity, Aicardi syndrome is likelycaused by pathogenic variants of just one gene [2], it is unsurprising thatno pathogenic non-coding variants or SVs were discovered. Relevant to thisthesis is a WGS study of 722 individuals with inherited retinal disease, aheterogeneous disorder, 537 pathogenic alleles were identified in 404 individ-uals [7]. This equated to a diagnostic rate of 56%. Of the pathogenic alleles,only 31 were deletions, 2 were tandem duplications, and 3 were SNVs in reg-ulatory regions. In other words, 4% of individuals in the cohort possessedpathogenic deletions, 0.3% possessed pathogenic tandem duplications, and0.4% possessed synonymous or regulatory region SNVs/indels. If we assumethat the prevalence of pathogenic SVs and non-coding SNVs in inheritedretinal disease is generalizable to other heterogeneous disorders, SRS wouldreveal one patient in 25 with a pathogenic deletion, 1 patient in 333 witha pathogenic tandem duplication, and one patient in 240 with a pathogenicnon-coding mutation. This study of retinal disease only looked at non-coding mutations in introns of candidate genes and SVs disrupting exonsof candidate genes, and so the rate of discovery of these mutations mightbe increased by expanding the search space. As with the limitations to the564.6. Conclusions and Future DirectionsHSAN and Aicardi syndrome studies, the unsolved portion of this cohortmay have pathogenic variants in repetitive regions, regions of poor cover-age, or genes not on the list of inherited retinal-disease associated genes.Further, the inheritance may be oligogenic or influenced by environmentalfactors. In spite of these limitations to whole genome SRS analysis and acandidate-gene based approach, this study demonstrated the value of WGSin identifying pathogenic non-coding variants and SVs.4.6.3 Future directionsThis thesis described the limitations to SV identification from SRS. LRSis expensive, but there is an affordable alternative: 10X GemCode Tech-nology is a library preparation and analysis method that leverages dropletmicrofluidics and molecular barcoding to construct 40-200 kb linked reads(pseudo-long reads) from short reads. Short reads from many long DNAmolecules are each tagged with molecular barcodes unique to the long frag-ment of origin, giving the ability to link distant segments into a singlecontig (https://www.10xgenomics.com/). Like PacBio SMRT sequencing,pseudo-long read sequencing allows assembly of repetitive regions but ismore affordable. Given the added cost of these technologies, a practical ap-proach for clinical diagnosis might be to perform LRS only after all otheranalyses are exhausted. We have obtained funding to perform 10X Gem-Code library preparation and Illumina sequencing on an Aicardi syndromepatient. This may reveal variants undetected in Aicardi syndrome patientsby SRS.Several months after designing the pipeline described in this thesis, a pa-per was published describing a tool, Genomiser, for identifying pathogenicSNVs and indels in Mendelian disease [52]). The paper presents a regu-latory Mendelian mutation (REMM) score for prioritizing variants. Thescore is based on machine learning from a set of 453 known pathogenicnon-coding variants and is claimed to be superior to FATHMM and CADDscores in prioritizing pathogenic non-coding variants. Genomiser harnessesTAD boundaries and FANTOM5 enhancers in prioritizing variants, as wellas patient phenotypes. However, Genomiser does not handle SVs, as it islimited to input from vcf files containing SNVs and indels. In the future,it may be enlightening to test Genomiser on our unsolved genomes. Inaddition, a novel reference-free k-mer based algorithm, RUFUS, is in de-velopment for de novo mutation detection from trios and quartets (https://github.com/jandrewrfarrell/RUFUS). We are in the process of testingthis software on Aicardi syndrome trios.574.6.4 ConclusionsThe bioinformatic workflow described in this paper is complementary tosequencing pipelines that analyze only protein-coding variants from wholegenomes. Benchmarking against simulated and real whole genome data, aswell as known pathogenic SNVs and indels, validated its utility in detectingvariants across the entire spectrum of genetic variation. Application of thisworkflow to larger cohorts of patients with rare Mendelian diseases shouldidentify pathogenic non-coding variants and SVs, increasing diagnostic yieldof clinical sequencing studies, assisting management of genetic diseases, andcontributing knowledge of novel pathogenic variants to the scientific com-munity.58Appendix APython scriptFigure A.1: This script takes a BED file of TADs fromcombined replicates of human embryonic stem cells [13], anda BED file with gene coordinates and names, and outputs theTAD boundaries flanking the TAD in which a gene resides.This script does not yet take into account genes that resideinside TAD boundaries.59Bibliography[1] A. Abyzov, A. E. Urban, M. Snyder, and M. Gerstein. CNVnator:An approach to discover, genotype, and characterize typical and atyp-ical CNVs from family and population genome sequencing. GenomeResearch, 21(6):974–984, June 2011.[2] J. Aicardi. Aicardi syndrome. Brain and Development, 27(3):164–171,Apr. 2005.[3] R. Andersson, C. Gebhard, I. Miguel-Escalada, I. Hoof, J. Bornholdt,M. Boyd, Y. Chen, X. Zhao, C. Schmidl, T. Suzuki, E. Ntini, E. Arner,E. Valen, K. Li, L. Schwarzfischer, D. Glatz, J. Raithel, B. Lilje,N. Rapin, F. O. Bagger, M. Jrgensen, P. R. Andersen, N. Bertin,O. Rackham, A. M. Burroughs, J. K. Baillie, Y. Ishizu, Y. Shimizu,E. Furuhata, S. Maeda, Y. Negishi, C. J. Mungall, T. F. Meehan,T. Lassmann, M. Itoh, H. Kawaji, N. Kondo, J. Kawai, A. Lennarts-son, C. O. Daub, P. Heutink, D. A. Hume, T. H. Jensen, H. Suzuki,Y. Hayashizaki, F. Mller, T. F. Consortium, A. R. R. Forrest, P. Carn-inci, M. Rehli, and A. Sandelin. An atlas of active enhancers acrosshuman cell types and tissues. Nature, 507(7493):455–461, Mar. 2014.[4] C. C. Babbitt, M. Markstein, and J. M. Gray. Recent advances infunctional assays of transcriptional enhancers. Genomics, 106(3):137–139, Sept. 2015.[5] M. Baker. Structural variation: the genome’s hidden architecture. Na-ture Methods, 9(2):133–137, Feb. 2012.[6] G. Barry. Integrating the roles of long and small non-coding RNA inbrain function and disease. Molecular Psychiatry, 19(4):410–416, Apr.2014.[7] K. J. Carss, G. Arno, M. Erwood, J. Stephens, A. Sanchis-Juan,S. Hull, K. Megy, D. Grozeva, E. Dewhurst, S. Malka, V. Plagnol,C. Penkett, K. Stirrups, R. Rizzo, G. Wright, D. Josifova, M. Bitner-Glindzicz, R. H. Scott, E. Clement, L. Allen, R. Armstrong, A. F.60BibliographyBrady, J. Carmichael, M. Chitre, R. H. H. Henderson, J. Hurst,R. E. MacLaren, E. Murphy, J. Paterson, E. Rosser, D. A. Thomp-son, E. Wakeling, W. H. Ouwehand, M. Michaelides, A. T. Moore,T. Aitman, H. Alachkar, S. Ali, L. Allen, D. Allsup, G. Ambe-gaonkar, J. Anderson, R. Antrobus, R. Armstrong, G. Arno, G. Aru-mugakani, S. Ashford, W. Astle, A. Attwood, S. Austin, C. Bacchelli,T. Bakchoul, T. K. Bariana, H. Baxendale, D. Bennett, C. Bethune,S. Bibi, M. Bitner-Glindzicz, M. Bleda, H. Boggard, P. Bolton-Maggs, C. Booth, J. R. Bradley, A. Brady, M. Brown, M. Browning,C. Bryson, S. Burns, P. Calleja, N. Canham, J. Carmichael, K. Carss,M. Caulfield, E. Chalmers, A. Chandra, P. Chinnery, M. Chitre,C. Church, E. Clement, N. Clements-Brod, V. Clowes, G. Coghlan,P. Collins, N. Cooper, A. Creaser-Myers, R. DaCosta, L. Daugh-erty, S. Davies, J. Davis, M. De Vries, P. Deegan, S. V. V. Deevi,C. Deshpande, L. Devlin, E. Dewhurst, R. Doffinger, N. Dormand,E. Drewe, D. Edgar, W. Egner, W. N. Erber, M. Erwood, T. Evering-ton, R. Favier, H. Firth, D. Fletcher, F. Flinter, J. C. Fox, A. Frary,K. Freson, B. Furie, A. Furnell, D. Gale, A. Gardham, M. Gat-tens, N. Ghali, P. K. Ghataorhe, R. Ghurye, S. Gibbs, K. Gilmour,P. Gissen, S. Goddard, K. Gomez, P. Gordins, S. Grf, D. Greene,A. Greenhalgh, A. Greinacher, S. Grigoriadou, D. Grozeva, S. Hackett,C. Hadinnapola, R. Hague, M. Haimel, C. Halmagyi, T. Hammerton,D. Hart, G. Hayman, J. W. M. Heemskerk, R. Henderson, A. Hensiek,Y. Henskens, A. Herwadkar, S. Holden, M. Holder, S. Holder, F. Hu,A. Huissoon, M. Humbert, J. Hurst, R. James, S. Jolles, D. Josifova,R. Kazmi, D. Keeling, P. Kelleher, A. M. Kelly, F. Kennedy, D. Kiely,N. Kingston, A. Koziell, D. Krishnakumar, T. W. Kuijpers, D. Ku-mararatne, M. Kurian, M. A. Laffan, M. P. Lambert, H. L. Allen,A. Lawrie, S. Lear, M. Lees, C. Lentaigne, R. Liesner, R. Linger,H. Longhurst, L. Lorenzo, R. Machado, R. Mackenzie, R. MacLaren,E. Maher, J. Maimaris, S. Mangles, A. Manson, R. Mapeta, H. S.Markus, J. Martin, L. Masati, M. Mathias, V. Matser, A. Maw, E. Mc-Dermott, C. McJannet, S. Meacham, S. Meehan, K. Megy, S. Mehta,M. Michaelides, C. M. Millar, S. Moledina, A. Moore, N. Morrell,A. Mumford, S. Murng, E. Murphy, S. Nejentsev, S. Noorani, P. Nur-den, E. Oksenhendler, W. H. Ouwehand, S. Papadia, S.-M. Park,A. Parker, J. Pasi, C. Patch, J. Paterson, J. Payne, A. Peacock,K. Peerlinck, C. J. Penkett, J. Pepke-Zaba, D. J. Perry, V. Pol-lock, G. Polwarth, M. Ponsford, W. Qasim, I. Quinti, S. Rankin,J. Rankin, F. L. Raymond, K. Rehnstrom, E. Reid, C. J. Rhodes,61BibliographyM. Richards, S. Richardson, A. Richter, I. Roberts, M. Rondina,E. Rosser, C. Roughley, K. Rue-Albrecht, C. Samarghitean, A. Sanchis-Juan, R. Sandford, S. Santra, R. Sargur, S. Savic, S. Schulman,H. Schulze, R. Scott, M. Scully, S. Seneviratne, C. Sewell, O. Shamar-dina, D. Shipley, I. Simeoni, S. Sivapalaratnam, K. Smith, A. Sohal,L. Southgate, S. Staines, E. Staples, H. Stauss, P. Stein, J. Stephens,K. Stirrups, S. Stock, J. Suntharalingam, R. C. Tait, K. Talks, Y. Tan,J. Thachil, J. Thaventhiran, E. Thomas, M. Thomas, D. Thompson,A. Thrasher, M. Tischkowitz, C. Titterton, C.-H. Toh, M. Toshner,C. Treacy, R. Trembath, S. Tuna, W. Turek, E. Turro, C. Van Geet,M. Veltman, J. Vogt, J. von Ziegenweldt, A. V. Noordegraaf, E. Wake-ling, I. Wanjiku, T. Q. Warner, E. Wassmer, H. Watkins, A. Web-ster, S. Welch, S. Westbury, J. Wharton, D. Whitehorn, M. Wilkins,L. Willcocks, C. Williamson, G. Woods, J. Wort, N. Yeatman, P. Yong,T. Young, P. Yu, A. Webster, and F. L. Raymond. ComprehensiveRare Variant Analysis via Whole-Genome Sequencing to Determine theMolecular Pathology of Inherited Retinal Disease. The American Jour-nal of Human Genetics, 100(1):75–90, Jan. 2017.[8] K. Chen, J. W. Wallis, M. D. McLellan, D. E. Larson, J. M. Kalicki,C. S. Pohl, S. D. McGrath, M. C. Wendl, Q. Zhang, D. P. Locke, X. Shi,R. S. Fulton, T. J. Ley, R. K. Wilson, L. Ding, and E. R. Mardis.BreakDancer: an algorithm for high-resolution mapping of genomicstructural variation. Nature Methods, 6(9):677–681, Sept. 2009.[9] X. Chen, O. Schulz-Trieglaff, R. Shaw, B. Barnes, F. Schlesinger,M. Kllberg, A. J. Cox, S. Kruglyak, and C. T. Saunders. Manta: rapiddetection of structural variants and indels for germline and cancer se-quencing applications. Bioinformatics, 32(8):1220–1222, Apr. 2016.[10] C.-S. Chin, P. Peluso, F. J. Sedlazeck, M. Nattestad, G. T. Concepcion,A. Clum, C. Dunn, R. O’Malley, R. Figueroa-Balderas, A. Morales-Cruz, G. R. Cramer, M. Delledonne, C. Luo, J. R. Ecker, D. Cantu,D. R. Rank, and M. C. Schatz. Phased diploid genome assembly withsingle-molecule real-time sequencing. Nature Methods, 13(12):1050–1054, Dec. 2016.[11] P. Cowie, E. A. Hay, and A. MacKenzie. The noncoding human genomeand the future of personalised medicine. Expert Reviews in MolecularMedicine, 17, 2015.62Bibliography[12] A. P. J. de Koning, W. Gu, T. A. Castoe, M. A. Batzer, and D. D.Pollock. Repetitive Elements May Comprise Over Two-Thirds of theHuman Genome. PLoS Genetics, 7(12), Dec. 2011.[13] J. R. Dixon, S. Selvaraj, F. Yue, A. Kim, Y. Li, Y. Shen, M. Hu, J. S.Liu, and B. Ren. Topological domains in mammalian genomes identifiedby analysis of chromatin interactions. Nature, 485(7398):376–380, May2012.[14] C. Du, B. N. Pusey, C. J. Adams, C. C. Lau, W. P. Bone, W. A.Gahl, T. C. Markello, and D. R. Adams. Explorations to improve thecompleteness of exome sequencing. BMC Medical Genomics, 9:56, 2016.[15] A. L. Duker, B. C. Ballif, E. V. Bawle, R. E. Person, S. Mahadevan,S. Alliman, R. Thompson, R. Traylor, B. A. Bejjani, L. G. Shaffer, J. A.Rosenfeld, A. N. Lamb, and T. Sahoo. Paternally inherited microdele-tion at 15q11.2 confirms a significant role for the SNORD116 C/D boxsnoRNA cluster in PraderWilli syndrome. European Journal of HumanGenetics, 18(11):1196–1201, Nov. 2010.[16] A. C. English, W. J. Salerno, O. A. Hampton, C. Gonzaga-Jauregui,S. Ambreth, D. I. Ritter, C. R. Beck, C. F. Davis, M. Dahdouli, S. Ma,A. Carroll, N. Veeraraghavan, J. Bruestle, B. Drees, A. Hastie, E. T.Lam, S. White, P. Mishra, M. Wang, Y. Han, F. Zhang, P. Stankiewicz,D. A. Wheeler, J. G. Reid, D. M. Muzny, J. Rogers, A. Sabo, K. C.Worley, J. R. Lupski, E. Boerwinkle, and R. A. Gibbs. Assessing struc-tural variation in a personal genometowards a human reference diploidgenome. BMC Genomics, 16(1), Apr. 2015.[17] M. Esteller. Non-coding RNAs in human disease. Nature Reviews Ge-netics, 12(12):861–874, Dec. 2011.[18] R. Fox, J. Ealing, H. Murphy, D. P. Gow, and D. Gosal. A novelDNMT1 mutation associated with early onset hereditary sensory andautonomic neuropathy, cataplexy, cerebellar atrophy, scleroderma, en-docrinopathy, and common variable immune deficiency. Journal of theperipheral nervous system: JPNS, 21(3):150–153, Sept. 2016.[19] T. Gao, B. He, S. Liu, H. Zhu, K. Tan, and J. Qian. EnhancerAtlas: aresource for enhancer annotation and analysis in 105 human cell/tissuetypes. Bioinformatics, page btw495, Aug. 2016.63Bibliography[20] C. Gilissen, J. Y. Hehir-Kwa, D. T. Thung, M. van de Vorst, B. W. M.van Bon, M. H. Willemsen, M. Kwint, I. M. Janssen, A. Hoischen,A. Schenck, R. Leach, R. Klein, R. Tearle, T. Bo, R. Pfundt, H. G.Yntema, B. B. A. de Vries, T. Kleefstra, H. G. Brunner, L. E. L. M.Vissers, and J. A. Veltman. Genome sequencing identifies major causesof severe intellectual disability. Nature, 511(7509):344–347, July 2014.[21] R. L. Goldfeder, J. R. Priest, J. M. Zook, M. E. Grove, D. Waggott,M. T. Wheeler, M. Salit, and E. A. Ashley. Medical implications oftechnical accuracy in genome sequencing. Genome Medicine, 8:24, 2016.[22] J. Hehir-Kwa, T. Marschall, W. P. Kloosterman, L. C. Francioli, J. A.Baaijens, L. Dijkstra, A. Abdellaoui, V. Koval, D. T. Thung, R. Warde-naar, B. Coe, P. Deelen, J. d. Ligt, E.-W. Lameijer, F. v. Dijk, F. Hor-mozdiari, E. E. Eichler, P. d. Bakker, M. Swertz, C. Wijmenga, G.-J. v.Ommen, E. Slagboom, D. Boomsma, G. o. t. Netherlands, A. Schoen-huth, K. Ye, and V. Guryev. A high-quality reference panel reveals thecomplexity and distribution of structural genome changes in a humanpopulation. bioRxiv, page 036897, Jan. 2016.[23] B. Heidenreich, P. S. Rachakonda, K. Hemminki, and R. Kumar. TERTpromoter mutations in cancer development. Current Opinion in Genet-ics & Development, 24:30–37, Feb. 2014.[24] W. Huang, L. Li, J. R. Myers, and G. T. Marth. ART: a next-generationsequencing read simulator. Bioinformatics, 28(4):593–594, Feb. 2012.[25] J. Huddleston, M. J. Chaisson, K. M. Steinberg, W. Warren,K. Hoekzema, D. S. Gordon, T. A. Graves-Lindsay, K. M. Munson,Z. N. Kronenberg, L. Vives, P. Peluso, M. Boitano, C.-S. Chin, J. Ko-rlach, R. K. Wilson, and E. E. Eichler. Discovery and genotypingof structural variation from long-read haploid genome sequence data.Genome Research, page gr.214007.116, Nov. 2016.[26] P. A. Jacobs, C. M. Wilson, J. A. Sprenkle, N. B. Rosenshein, and B. R.Migeon. Mechanism of origin of complete hydatidiform moles. Nature,286(5774):714–716, Aug. 1980.[27] E. Khurana, Y. Fu, V. Colonna, X. J. Mu, H. M. Kang, T. Lap-palainen, A. Sboner, L. Lochovsky, J. Chen, A. Harmanci, J. Das,A. Abyzov, S. Balasubramanian, K. Beal, D. Chakravarty, D. Chal-lis, Y. Chen, D. Clarke, L. Clarke, F. Cunningham, U. S. Evani,64BibliographyP. Flicek, R. Fragoza, E. Garrison, R. Gibbs, Z. H. Gm, J. Herrero,N. Kitabayashi, Y. Kong, K. Lage, V. Liluashvili, S. M. Lipkin, D. G.MacArthur, G. Marth, D. Muzny, T. H. Pers, G. R. S. Ritchie, J. A.Rosenfeld, C. Sisu, X. Wei, M. Wilson, Y. Xue, F. Yu, . G. P. Con-sortium, E. T. Dermitzakis, H. Yu, M. A. Rubin, C. Tyler-Smith, andM. Gerstein. Integrative Annotation of Variants from 1092 Humans:Application to Cancer Genomics. Science, 342(6154):1235587, Oct.2013.[28] M. Kircher, D. M. Witten, P. Jain, B. J. O’Roak, G. M. Cooper,and J. Shendure. A general framework for estimating the relativepathogenicity of human genetic variants. Nature Genetics, 46(3):310–315, Mar. 2014.[29] E. S. Lander. Initial impact of the sequencing of the human genome.Nature, 470(7333):187–197, Feb. 2011.[30] R. M. Layer, C. Chiang, A. R. Quinlan, and I. M. Hall. LUMPY: aprobabilistic framework for structural variant discovery. Genome Biol-ogy, 15:R84, 2014.[31] L. A. Lettice, S. J. H. Heaney, L. A. Purdie, L. Li, P. d. Beer, B. A.Oostra, D. Goode, G. Elgar, R. E. Hill, and E. d. Graaff. A long-rangeShh enhancer regulates expression in the developing limb and fin andis associated with preaxial polydactyly. Human Molecular Genetics,12(14):1725–1735, July 2003.[32] D. Lupiez, K. Kraft, V. Heinrich, P. Krawitz, F. Brancati, E. Klopocki,D. Horn, H. Kayserili, J. Opitz, R. Laxova, F. Santos-Simarro,B. Gilbert-Dussardier, L. Wittler, M. Borschiwer, S. Haas, M. Oster-walder, M. Franke, B. Timmermann, J. Hecht, M. Spielmann, A. Visel,and S. Mundlos. Disruptions of Topological Chromatin Domains CausePathogenic Rewiring of Gene-Enhancer Interactions. Cell, 161(5):1012–1025, May 2015.[33] Y. Ma, Y. Xin, R. Li, Z. Wang, Q. Yue, F. Xiao, and X. Hao. TFDP3was expressed in coordination with E2f1 to inhibit E2f1-mediated apop-tosis in prostate cancer. Gene, 537(2):253–259, Mar. 2014.[34] J. R. MacDonald, R. Ziman, R. K. C. Yuen, L. Feuk, and S. W. Scherer.The Database of Genomic Variants: a curated collection of structuralvariation in the human genome. Nucleic Acids Research, 42(Databaseissue):D986–D992, Jan. 2014.65Bibliography[35] A. MacKenzie, B. Hing, and S. Davidson. Exploring the effects ofpolymorphisms on cis-regulatory signal transduction response. Trendsin Molecular Medicine, 19(2):99–107, Feb. 2013.[36] G. A. Maston, S. K. Evans, and M. R. Green. Transcriptional regula-tory elements in the human genome. Annual Review of Genomics andHuman Genetics, 7:29–59, 2006.[37] N. Matharu and N. Ahituv. Minor Loops in Major Folds: EnhancerPro-moter Looping, Chromatin Restructuring, and Their Association withTranscriptional Regulation and Disease. PLoS Genet, 11(12):e1005640,Dec. 2015.[38] A. Mathelier, W. Shi, and W. W. Wasserman. Identification of alteredcis-regulatory elements in human disease. Trends in Genetics, 31(2):67–76, Feb. 2015.[39] M. McGrath. Charcot-Marie-Tooth 1a: A narrative review with clinicaland anatomical perspectives. Clinical Anatomy, 29(5):547–554, July2016.[40] . Menca, S. Modamio-Hybjr, N. Redshaw, M. Morn, F. Mayo-Merino,L. Olavarrieta, L. A. Aguirre, I. del Castillo, K. P. Steel, T. Dalmay,F. Moreno, and M. . Moreno-Pelayo. Mutations in the seed region ofhuman miR-96 are responsible for nonsyndromic progressive hearingloss. Nature Genetics, 41(5):609–613, May 2009.[41] R. E. Mills, K. Walter, C. Stewart, R. E. Handsaker, K. Chen, C. Alkan,A. Abyzov, S. C. Yoon, K. Ye, R. K. Cheetham, A. Chinwalla, D. F.Conrad, Y. Fu, F. Grubert, I. Hajirasouliha, F. Hormozdiari, L. M.Iakoucheva, Z. Iqbal, S. Kang, J. M. Kidd, M. K. Konkel, J. Korn,E. Khurana, D. Kura, H. Y. K. Lam, J. Leng, R. Li, Y. Li, C.-Y. Lin,R. Luo, X. J. Mu, J. Nemesh, H. E. Peckham, T. Rausch, A. Scally,X. Shi, M. P. Stromberg, A. M. Sttz, A. E. Urban, J. A. Walker, J. Wu,Y. Zhang, Z. D. Zhang, M. A. Batzer, L. Ding, G. T. Marth, G. McVean,J. Sebat, M. Snyder, J. Wang, K. Ye, E. E. Eichler, M. B. Gerstein,M. E. Hurles, C. Lee, S. A. McCarroll, and J. O. Korbel. Mappingcopy number variation by population scale genome sequencing. Nature,470(7332):59–65, Feb. 2011.[42] M. Mohiyuddin, J. C. Mu, J. Li, N. B. Asadi, M. B. Gerstein, A. Aby-zov, W. H. Wong, and H. Y. K. Lam. MetaSV: an accurate and inte-66Bibliographygrative structural-variant caller for next generation sequencing. Bioin-formatics, 31(16):2741–2744, Aug. 2015.[43] J. C. Mu, M. Mohiyuddin, J. Li, N. B. Asadi, M. B. Gerstein, A. Aby-zov, W. H. Wong, and H. Y. K. Lam. VarSim: a high-fidelity simulationand validation framework for high-throughput genome sequencing withcancer applications. Bioinformatics, 31(9):1469–1471, May 2015.[44] A. C. Need, V. Shashi, Y. Hitomi, K. Schoch, K. V. Shianna, M. T.McDonald, M. H. Meisler, and D. B. Goldstein. Clinical application ofexome sequencing in undiagnosed genetic conditions. Journal of MedicalGenetics, pages jmedgenet–2012–100819, May 2012.[45] A. W. Pang, J. R. MacDonald, D. Pinto, J. Wei, M. A. Rafiq, D. F.Conrad, H. Park, M. E. Hurles, C. Lee, J. C. Venter, E. F. Kirkness,S. Levy, L. Feuk, and S. W. Scherer. Towards a comprehensive struc-tural variation map of an individual human genome. Genome Biology,11:R52, 2010.[46] H. Parikh, M. Mohiyuddin, H. Y. K. Lam, H. Iyer, D. Chen, M. Pratt,G. Bartha, N. Spies, W. Losert, J. M. Zook, and M. Salit. svclas-sify: a method to establish benchmark structural variant calls. BMCGenomics, 17:64, 2016.[47] C. Redin, H. Brand, R. L. Collins, T. Kammin, E. Mitchell, J. C.Hodge, C. Hanscom, V. Pillalamarri, C. M. Seabra, M.-A. Abbott,O. A. Abdul-Rahman, E. Aberg, R. Adley, S. L. Alcaraz-Estrada, F. S.Alkuraya, Y. An, M.-A. Anderson, C. Antolik, K. Anyane-Yeboa, J. F.Atkin, T. Bartell, J. A. Bernstein, E. Beyer, I. Blumenthal, E. M. H. F.Bongers, E. H. Brilstra, C. W. Brown, H. T. Brggenwirth, B. Calle-waert, C. Chiang, K. Corning, H. Cox, E. Cuppen, B. B. Currall,T. Cushing, D. David, M. A. Deardorff, A. Dheedene, M. D’Hooghe,B. B. A. de Vries, D. L. Earl, H. L. Ferguson, H. Fisher, D. R. Fitz-Patrick, P. Gerrol, D. Giachino, J. T. Glessner, T. Gliem, M. Grady,B. H. Graham, C. Griffis, K. W. Gripp, A. L. Gropman, A. Hanson-Kahn, D. J. Harris, M. A. Hayden, R. Hill, R. Hochstenbach, J. D.Hoffman, R. J. Hopkin, M. W. Hubshman, A. M. Innes, M. Irons,M. Irving, J. C. Jacobsen, S. Janssens, T. Jewett, J. P. Johnson, M. C.Jongmans, S. G. Kahler, D. A. Koolen, J. Korzelius, P. M. Kroisel,Y. Lacassie, W. Lawless, E. Lemyre, K. Leppig, A. V. Levin, H. Li,H. Li, E. C. Liao, C. Lim, E. J. Lose, D. Lucente, M. J. Macera, P. Man-avalan, G. Mandrile, C. L. Marcelis, L. Margolin, T. Mason, D. Masser-67BibliographyFrye, M. W. McClellan, C. J. Z. Mendoza, B. Menten, S. Middelkamp,L. R. Mikami, E. Moe, S. Mohammed, T. Mononen, M. E. Morten-son, G. Moya, A. W. Nieuwint, Z. Ordulu, S. Parkash, S. P. Pauker,S. Pereira, D. Perrin, K. Phelan, R. E. P. Aguilar, P. J. Poddighe,G. Pregno, S. Raskin, L. Reis, W. Rhead, D. Rita, I. Renkens, F. Roe-lens, J. Ruliera, P. Rump, S. L. P. Schilit, R. Shaheen, R. Sparkes,E. Spiegel, B. Stevens, M. R. Stone, J. Tagoe, J. V. Thakuria, B. W.van Bon, J. van de Kamp, I. van Der Burgt, T. van Essen, C. M. vanRavenswaaij-Arts, M. J. van Roosmalen, S. Vergult, C. M. L. Volker-Touw, D. P. Warburton, M. J. Waterman, S. Wiley, A. Wilson, M. d. l.C. A. Yerena-de Vega, R. T. Zori, B. Levy, H. G. Brunner, N. de Leeuw,W. P. Kloosterman, E. C. Thorland, C. C. Morton, J. F. Gusella, andM. E. Talkowski. The genomic landscape of balanced cytogenetic abnor-malities associated with human congenital anomalies. Nature Genetics,49(1):36–45, Jan. 2017.[48] T. R. Sana, R. Debets, J. C. Timans, J. F. Bazan, and R. A. Kastelein.Computational identification, cloning, and characterization of IL-1r9, anovel interleukin-1 receptor-like gene encoded over an unusually largeinterval of human chromosome Xq22.2-q22.3. Genomics, 69(2):252–262,Oct. 2000.[49] G. S. Sellick, K. T. Barker, I. Stolte-Dijkstra, C. Fleischmann, R. J.Coleman, C. Garrett, A. L. Gloyn, E. L. Edghill, A. T. Hattersley, P. K.Wellauer, G. Goodwin, and R. S. Houlston. Mutations in PTF1a causepancreatic and cerebellar agenesis. Nature Genetics, 36(12):1301–1305,Dec. 2004.[50] H. A. Shihab, M. F. Rogers, J. Gough, M. Mort, D. N. Cooper, I. N. M.Day, T. R. Gaunt, and C. Campbell. An integrative approach to pre-dicting the functional effects of non-coding and coding sequence varia-tion. Bioinformatics, page btv009, Jan. 2015.[51] E. Sinkiewicz-Darol, D. Kabziska, I. Moszyska, and A. Kochaski. The 5’regulatory sequence of the PMP22 in the patients with Charcot-Marie-Tooth disease. Acta Biochimica Polonica, 57(3):373–377, 2010.[52] D. Smedley, M. Schubach, J. O. B. Jacobsen, S. Khler, T. Zemojtel,M. Spielmann, M. Jger, H. Hochheiser, N. L. Washington, J. A. Mc-Murry, M. A. Haendel, C. J. Mungall, S. E. Lewis, T. Groza, G. Valen-tini, and P. N. Robinson. A Whole-Genome Analysis Framework for68BibliographyEffective Identification of Pathogenic Regulatory Variants in MendelianDisease. The American Journal of Human Genetics, 0(0), Aug. 2016.[53] D. J. Stavropoulos, D. Merico, R. Jobling, S. Bowdin, N. Monfared,B. Thiruvahindrapuram, T. Nalpathamkalam, G. Pellecchia, R. K. C.Yuen, M. J. Szego, R. Z. Hayeems, R. Z. Shaul, M. Brudno, M. Girdea,B. Frey, B. Alipanahi, S. Ahmed, R. Babul-Hirji, R. B. Porras, M. T.Carter, L. Chad, A. Chaudhry, D. Chitayat, S. J. Doust, C. Cytryn-baum, L. Dupuis, R. Ejaz, L. Fishman, A. Guerin, B. Hashemi,M. Helal, S. Hewson, M. Inbar-Feigenberg, P. Kannu, N. Karp, R. H.Kim, J. Kronick, E. Liston, H. MacDonald, S. Mercimek-Mahmutoglu,R. Mendoza-Londono, E. Nasr, G. Nimmo, N. Parkinson, N. Quercia,J. Raiman, M. Roifman, A. Schulze, A. Shugar, C. Shuman, P. Sinajon,K. Siriwardena, R. Weksberg, G. Yoon, C. Carew, R. Erickson, R. A.Leach, R. Klein, P. N. Ray, M. S. Meyn, S. W. Scherer, R. D. Cohn, andC. R. Marshall. Whole-genome sequencing expands diagnostic utilityand improves clinical management in paediatric medicine. npj GenomicMedicine, 1:15012, Jan. 2016.[54] Z. Sun, Y. Wu, T. Ordog, S. Baheti, J. Nie, X. Duan, K. Hojo, J.-P.Kocher, P. J. Dyck, and C. J. Klein. Aberrant signature methylomeby DNMT1 hot spot mutation in hereditary sensory and autonomicneuropathy 1e. Epigenetics, 9(8):1184–1193, Aug. 2014.[55] M. Talkowski, G. Maussion, L. Crapper, J. Rosenfeld, I. Blumenthal,C. Hanscom, C. Chiang, A. Lindgren, S. Pereira, D. Ruderfer, A. Diallo,J. Lopez, G. Turecki, E. Chen, C. Gigek, D. Harris, V. Lip, Y. An,M. Biagioli, M. MacDonald, M. Lin, S. Haggarty, P. Sklar, S. Purcell,M. Kellis, S. Schwartz, L. Shaffer, M. Natowicz, Y. Shen, C. Morton,J. Gusella, and C. Ernst. Disruption of a Large Intergenic NoncodingRNA in Subjects with Neurodevelopmental Disabilities. The AmericanJournal of Human Genetics, 91(6):1128–1134, Dec. 2012.[56] L. Tattini, R. DAurizio, and A. Magi. Detection of Genomic Struc-tural Variants from Next-Generation Sequencing Data. Frontiers inBioengineering and Biotechnology, 3, June 2015.[57] E. L. van Dijk, H. Auger, Y. Jaszczyszyn, and C. Thermes. Ten years ofnext-generation sequencing technology. Trends in Genetics, 30(9):418–426, Sept. 2014.69Bibliography[58] H. Yang, P. N. Robinson, and K. Wang. Phenolyzer: phenotype-basedprioritization of candidate genes for human diseases. Nature Methods,12(9):841–843, Sept. 2015.[59] Y. Yang, D. M. Muzny, J. G. Reid, M. N. Bainbridge, A. Willis, P. A.Ward, A. Braxton, J. Beuten, F. Xia, Z. Niu, M. Hardison, R. Person,M. R. Bekheirnia, M. S. Leduc, A. Kirby, P. Pham, J. Scull, M. Wang,Y. Ding, S. E. Plon, J. R. Lupski, A. L. Beaudet, R. A. Gibbs, and C. M.Eng. Clinical Whole-Exome Sequencing for the Diagnosis of MendelianDisorders. New England Journal of Medicine, 369(16):1502–1511, Oct.2013.[60] K. Ye, M. H. Schulz, Q. Long, R. Apweiler, and Z. Ning. Pindel: apattern growth approach to detect break points of large deletions andmedium sized insertions from paired-end short reads. Bioinformatics,25(21):2865–2871, Nov. 2009.[61] X. Zhang. Exome sequencing greatly expedites the progressive researchof Mendelian diseases. Frontiers of Medicine, 8(1):42–57, Mar. 2014.70


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items