UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Mapping complex genomic translocations using Strand-seq Yuen, Michael Wai-Keong 2017

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2018_february_yuen_michael.pdf [ 2.13MB ]
Metadata
JSON: 24-1.0362577.json
JSON-LD: 24-1.0362577-ld.json
RDF/XML (Pretty): 24-1.0362577-rdf.xml
RDF/JSON: 24-1.0362577-rdf.json
Turtle: 24-1.0362577-turtle.txt
N-Triples: 24-1.0362577-rdf-ntriples.txt
Original Record: 24-1.0362577-source.json
Full Text
24-1.0362577-fulltext.txt
Citation
24-1.0362577.ris

Full Text

©Michael Wai-Keong Yuen   MAPPING COMPLEX GENOMIC TRANSLOCATIONS USING STRAND-SEQ   by  Michael Wai-Keong Yuen   B.Sc., University of Toronto, 2014 (Honors)   A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF  MASTER OF SCIENCE  in  The Faculty of Graduate and Postdoctoral Studies  (Medical Genetics)    THE UNIVERSITY OF BRITISH COLUMBIA  (Vancouver)    December 2017      ii  Abstract  Template strand sequencing (Strand-seq) is a single cell sequencing approach which maintains 5’ -> 3’ directionality of sequence reads. I hypothesized that the directional information preserved can be used to map complex translocation events. Translocations often disrupt gene expression by reshuffling regulatory elements or by formation of novel fusion transcripts. Yet, detection is often difficult, confounded by complexities of the Structural Variations (SVs).  I chose a cell line derived from a patient with pediatric Acute Lymphoblastic Leukemia (iALL) with a known complex karyotype. My aim was to explore Strand-seq’s ability in identifying breakpoints, linking translocation partners and resolving the configuration of SV, comparing low coverage Strand-seq data against high coverage Whole Genome Sequence (WGS) data as reference.  The iALL cells selected for my study harbor complex translocations involving 4 chromosomes with 5 breakpoint positions that were previously validated by Fluorescent In Situ Hybridization (FISH). BreakpointR, a novel pipeline for Strand-seq analysis, was able to identify 18 breakpoints, 5 which were isolated for further analysis. These 5 breakpoints were identified with a resolution of 5-60kb, overlapping with the genomic positions of breakpoints identified by WGS analysis, validating the accuracy of BreakpointR. Despite the lower sequencing coverage of Strand-seq, 18 breakpoints were detected against WGS’s 119 breakpoints.   Next, I developed a workflow to link translocation partners involved in the breakpoints, successfully linking 4 of 5 translocation partners; the final fragment remained unresolved due to lack of reads within the genomic interval, a limitation of low sequence coverage. By comparison, WGS successfully linked 4 of 5 translocation partners, its limitations of mapping across repetitive regions resulting in a different unresolved fragment.  Post-translocation-partner matching and single cell resolution from Strand-seq allowed us to further interrogate expected breakpoints for each single cell. Strand-seq analysis identified an inversion of the 100kb fragment in chromosome 11, validated with Sanger sequencing, representing an additional layer of complexity not identified by the other approaches. I conclude that the application of Strand-seq should be further explored in the areas of SV mapping as it has been proven useful for complementing the inherent difficulties of complex SV mapping across repetitive regions.        iii  Lay Summary  My project utilizes a novel Strand-based sequencing (Strand-seq) approach to explore new ways of mapping DNA rearrangements (translocations) in single cells. Using dedicated, new software, we were able to identify all 5 previously reported breakpoints from a cell line with a complex genome configuration and we could correctly link 4 of the 5 translocation partners. Whole Genome Sequencing performed similarly but required population filters and multiple cross validations. Strand-seq was also able to identify an additional layer of complexity of rearrangements, not previously characterized by the other methods. I propose that the application of Strand-seq should be further explored in the study of genome rearrangement as our work proven its usefulness for mapping complex and inherently difficult SVs.                   iv  Preface  Chapter 1 presents a comprehensive review of relevant literature regarding mapping complex rearrangements and provides background for my research objectives. The overarching objective of my project was to compare the utility of Template Strand Sequencing (Strand-seq) with Whole Genome Sequencing for mapping complex translocation. In this section I have also outlined the principles of Strand-seq, a single cell sequencing approach developed in part by Dr. Ester Falconer, a former post-doctoral fellow in the laboratory, who also conceived of this project. I wrote all aspects of this chapter.  Chapter 2 details all the materials and methods which were employed throughout the course of this project. The iALL826A cell line was provided by our collaborator, Dr. Mark Cruickshank. Single cell isolation was done at the Flow Cytometry Core Facility at the Terry Fox Laboratory. I cultured and treated the iALL826A cell lines, prepared them for sorting, which was overseen by the technical staff David Ko and Wen Bo Xu. I also carried out all Strand-seq library preparation, using the protocol detailed in Nat. Methods (Sanders et. al. 2017). Libraries were sequenced using the MiSeq platform. Dr. Mark Hills, a former post-doctoral fellow in the laboratory, helped with the library alignment and used the Bioinformatic Analysis of Inherited Templates (BAIT) package for preliminary analysis. Dr. Victor Guryev, a Professor at the University of Groningen and Principal Investigator at the European Research Institute for the Biology of Aging (ERIBA), ran the WGS analysis using DELLY and 123SV pipelines, as well as population filters.  Chapter 3 describes the results and processes of all the wet and dry experiments that I carried out to develop and validate the workflow for mapping translocation partners. With the guidance of Dr. Mark Hills, Dr. Ashley Sanders and Dr. David Porubsky I successfully developed a workflow to map translocation partners which utilizes pattern matching of strand directionality in order to resolve translocation events. I carried out all data analysis and figure preparations for this section.   Chapter 4 provides a summary of findings along with a general discussion of the work. I wrote all aspects of this chapter.       v   Table of Contents  Abstract .................................................................................................................................................. ii Lay Summary ......................................................................................................................................... iii Preface .................................................................................................................................................. iv Table of Contents ................................................................................................................................... v List of Tables ......................................................................................................................................... vii List of Figures ...................................................................................................................................... viii List of Abbreviations .............................................................................................................................. ix Acknowledgements ................................................................................................................................ x Chapter 1 | Introduction ........................................................................................................................ 1 1.1 Structural variants .................................................................................................................. 1 1.2 Translocations ........................................................................................................................ 3 1.3 History of translocation mapping ........................................................................................... 5 1.4 Strand-seq .............................................................................................................................. 8 1.5 Thesis objectives .................................................................................................................. 10 Chapter 2 | Methods & Materials ........................................................................................................ 12 2.1 iALL826A cell line ........................................................................................................................ 12 2.2 Single cell isolation ..................................................................................................................... 13 2.3 Library preparation and Sequencing .......................................................................................... 14 2.4 BreakpointR pipeline analysis .................................................................................................... 16 2.5 Whole Genome Sequencing analysis: DELLY &123SV ................................................................ 18 Chapter 3 | Results ............................................................................................................................... 19 3.1 BreakpointR analysis of iALL826A Strand-seq data .................................................................... 19 3.2 Principles of translocation mapping ........................................................................................... 21 3.3 Unexpected complexities identified by Strand-seq ................................................................... 24 3.4 Comparing the accuracy of translocation mapping with WGS analysis ..................................... 26 3.5 Revised translocation model ...................................................................................................... 28 Chapter 4 | Discussion ......................................................................................................................... 29 4.1 Summary overview & findings ................................................................................................... 29 4.2 BreakpointR ................................................................................................................................ 33 vi  4.3 Challenges of translocation mapping (SS & WGS) ..................................................................... 34 4.4 Impact ........................................................................................................................................ 35 Tables: .................................................................................................................................................. 37 Figures: ................................................................................................................................................. 48 References: ........................................................................................................................................... 65 Appendix 1: Gel image validating inversion ......................................................................................... 69 Appendix 2: Gel image showing no extraneous translocations ........................................................... 70                     vii  List of Tables  Table 1: BreakpointR breakpoints ........................................................................................................ 37 Table 2: Breakpoints added by manual curation ................................................................................. 38 Table 3: Strand-seq translocation partner links ................................................................................... 39 Table 4: WGS DELLY & 123SV translocation partner links, cross-validated ......................................... 40 Table 5: Sequencing primers used to verify inversion ......................................................................... 41 Table 6: Sequencing primers used to test additional hits found in WGS ............................................. 42 Table 7: DELLY translocation partner links (Population filtered + PASS) ............................................. 43 Table 8: 123SV translocation partner links (Population filtered + PASS) ............................................. 44 Table 9: DELLY translocation partner links (PASS) ............................................................................... 45                   viii  List of Figures  Figure 1: Types of structural variants ................................................................................................... 48 Figure 2: Different NGS approaches to mapping translocations ......................................................... 49 Figure 3: Principles of Template Strand Sequencing ........................................................................... 50 Figure 4: BAIT ideogram representation .............................................................................................. 51 Figure 5: The difference between an SCE event and a stable rearrangement visualized by BAIT’s ideogram profiles ................................................................................................................................. 52 Figure 6: Template Strand Sequencing and translocations ................................................................. 53 Figure 7: Complex karyotype of the iALL826A cell line ........................................................................ 54 Figure 8: UV profile of nuclei with or without BrdU incorporated ...................................................... 55 Figure 9: Principles of BreakpointR analysis ........................................................................................ 56 Figure 10: Breakpoint mapping by BreakpointR .................................................................................. 57 Figure 11: Principles of mapping translocation partners (Part 1) ........................................................ 58 Figure 12: Principles of mapping translocation partners (Part 2) ........................................................ 59 Figure 13: Results of translocation partner matching using Strand-seq data...................................... 60 Figure 14: Single cell resolution of Strand-seq identified additional layers of complexity .................. 61 Figure 15: PCR validation of predicted inversion ................................................................................. 62 Figure 16: Circos plots showing translocation links from each NGS approach .................................... 63 Figure 17: Revised model for the mechanism of translocation formation .......................................... 64             ix  List of Abbreviations  aCGH – Array Comparative Genome hybridization AML – Acute Myeloid Leukemia bp – basepair  BrdU – 5’-Bromo-2’-deoxyuridine CML – Chronic Myeloid Leukemia CNV – Copy Number Variation DNA – deoxyribonucleotide acid DSB – Double Strand Break  FISH – Fluorescent In Situ Hybrdization FoSTeS – Fork Stalling and Template Switching  kb – Kilobase  Mb – Megabase  MMBIR – Microhomology-Mediated Break-Induced Replication NAHR – Non-Allelic Homologous Recombination  NGS – Next Gen Sequencing NHEJ – Non-Homologous End-Joining  PCR – Polymerase Chain Reaction  SCE – Sister Chromatid Exchange SNP – Single Nucleotide Polymorphism  SV – Structural Variation TKI – Tyrosine Kinase Inhibitors            x  Acknowledgements  I would like to extend my sincerest gratitude to the many people who have helped me along my journey. To my Supervisor, Dr. Peter Lansdorp, whose daring approach to research help inspire me to bigger challenges. To my mentors, Dr. Ester Falconer, Dr. Mark Hills, Dr. Ashley Sanders for providing me with excellent guidance and support throughout this project. To my lab members, Dr. Geraldine Aubert, who advised me on various matters of scientific discussions and mundane personal problems, and kept me focused on my project. To Dr. David Porubsky, a former graduate student in the Groningen Lansdorp lab, who, together with Ashley Sanders built the BreakpointR software for analysis of chromosome breakpoints using Strand-seq data, and who taught me how to play table soccer when I visited Groningen. To Dr. Victor Guryev, a professor at ERIBA who’s door was always open and always was excited to discuss science and politics. To my Co-Supervisor, Dr. Peter Stirling, who gave me direction and emotional support when I most needed it. To my committee members, Dr. Aly Karsan and Dr. Sohrab Shah, for their time and patience, and invaluable advice which helped me understand and emphasis the strengths of my project, while addressing its weaknesses. To Stirling lab and rest of the Terry Fox Laboratory community, for taking the time to listen to my thoughts and ideas, which helped me grow as a scientist. Finally, to my friends and family. Thank you all.               1  Chapter 1 | Introduction 1.1 Structural variants  Structural variation (SVs) in the genome can be operationally defined as genomic variations that involve DNA segments of >50 base pairs (bp) [1]. The size of structural variants can range from 50bp to entire chromosomes and can encapsulate genes and their regulatory sequences [2]. The most notable example of a SV is chromosome aneuploidy, whole chromosomes duplication or loss altering the DNA content in a cell [3]. SVs can be broadly characterized into balanced or unbalanced SVs (Figure 1), depending on how the SV of interest affects DNA content. Balanced rearrangements do not result in a net gain or loss of DNA, while unbalanced rearrangements change the expected amounts of DNA, like aneuploidy. These two categories of rearrangements can be further sub-divided into copy-number variants such as duplications and deletions (unbalanced), or inversions and translocations (typically balanced) [4]. Altogether, SVs roughly 1-2% of all human genomic variation, compared to ~0.1% accounted for by SNPs [2, 5, 6].  A break along the DNA phosphodiester backbone is a necessary first step in the formation of SVs, and the breakpoint site is a characteristic feature [7]. The sequences of DNA bases surrounding the breakpoint site can give a clue as to the mechanism of SV formation, including errors in DNA replication and errors in DNA repair by recombination. Small regions of homology surrounding the breakpoint suggest a replication-based mechanism such as Fork Stalling and Template Switching (FoSTeS) or Microhomology-Mediated Break-Induced Replication (MMBIR). Long regions of homology across the breakpoint point towards Non-Allelic Homologous Recombination (NAHR) repairs, while no homology whatsoever suggests Non-Homologous End-Joining (NHEJ) repair mechanisms [8, 9]. Due to these properties, SVs typically occur around regions of repetitive DNA sequence, a feature that compounds the challenges of accurate detection.   2  The pathological consequences of SVs remain active areas of research [2, 9, 10]. CNVs can lead to disease if the genomic imbalance results in loss or gain of functional gene copies. Prader-Willi Syndrome and Angelman Syndrome are two examples of neurological disorders caused by deletions in the q-arm of chromosome 15 [11], while Down Syndrome is caused by a trisomy of chromosome 21 [12]. Balanced SVs such as inversions or translocations can cause disease if the breakpoints disrupt genes or gene regulators, or if rearrangements reshuffle regulatory elements of critical genomic regions. Groschel et. al. showed that an inversion on chromosome 3 can cause leukemia. This inversion results in concomitant deregulation of EVI1 and GATA2 in hematopoietic stem cell progenitor cells by reallocation of the enhancer region that causes normal cells to differentiate [13]. More importantly this type of rearrangements can fuse two or more gene segments together and produce a fusion gene with novel functions. The formation of the Philadelphia chromosome is the result of a translocation between chromosome 9 and chromosome 22 which results in the BCR-ABL fusion gene, a constitutively active tyrosine kinase activator [14] which is known to drive abnormal cell proliferation in Chronic Myeloid Leukemia (CML). demonstrating the significance of structural rearrangements on regulatory elements [15] and how they can alter gene function.   Despite their pathological significance, the technical limitations of SV detection have impeded research into SVs and the functional consequences of rearranging large swaths of DNA and regulatory elements still remain poorly understood [16].          3  1.2 Translocations  The remainder of this thesis I will focus on translocations, a typically balanced form of SV which can have significant pathological impact [17]. Translocations arise when two simultaneous double-stranded DNA breaks (DSBs) occur on separate chromosomes but are not properly repaired. If resolution of the DSBs result in exchange of genomic materials without any loss of DNA content, it is classified as a reciprocal translocation event [18]. There are endogenous and exogenous sources of DSBs. Endogenous examples include V(D)J recombination [19], collapse of a stalled replication fork [20], or topoisomerase activity [21]. Exogenous examples include all forms of DNA damaging agents such as ionizing radiation [22], cytotoxic drugs [23], or smoking [24]. These DSBs can be resolve by NAHR, NHEJ, or MMBIR depending on the sequence homology surrounding each breakpoint sites [18, 25-27].    The Philadelphia chromosome is formed as a result of a translocation event between the q-arms of chromosome 9 and 22, t(9;22)(q34;q11),  generating the BCR-ABL fusion gene [28]. Its discovery in 1959 and its sufficiency for initiating leukemogenesis highlights the pathological significance of translocation events. Translocations can result in gene disruption if the breakpoint falls within exonic gene regions but can also cause dysregulation of genes by rearranging regulatory elements. Most importantly, translocations are also capable of joining two separate gene segments, creating fusion genes with novel function [29]. BCR-ABL is a prime example, which combines the tyrosine kinase activation from the ABL domain with the regulation and targeting domains from the BCR gene. The new fusion gene can turn “on” multiple signalling pathways across the cell driving the progress of Chronic Myeloid Leukemia (CML). For these reasons, the Philadelphia chromosome is used as a diagnostic biomarker for leukemia subtypes. The nature of a translocation’s singularity also makes them attractive as therapeutic targets [30, 31]. Imatinib, more commonly known as Gleevec, is a chemotherapeutic agent designed to block the activating property of the BCR-ABL fusion transcript, and has been shown to be incredibly effective as a first line response in the treatment of CML. There are many other 4  examples in cancer of fusion oncogenes and translocations, in lymphomas, sarcomas, or cancers with a high degree of genomic instability.                             5  1.3 History of translocation mapping  The techniques used in identifying SVs (such as translocations) are broadly split into microscopic and sub-microscopic methods. Microscopic methods involve either chromosome banding or Fluorescence In Situ Hybridizations [32], while sub-microscopic methods encompass Array-based Comparative Genomic Hybridization (aCGH) [33] and Next-Generation Sequencing (NGS) approaches [34].  The correct number of 46 chromosomes in humans was not accepted until 1956 during the early days of cytogenetics [35]. However, it would not be until the late 1960s when chromosome banding method was established and remains a staple in the clinic to this day. Chromosome banding or Giemsa banding is a technique using chemical stains to visualize the unique bands pattern on chromosomes [36], allowing translocations to be visualized as changes in chromosome band patterns. This technique is still used in the clinic because it is capable of physically visualizing the genome within single cells. Its results are reliable and highly reproducible. The single cell resolution also allows cytogenetists to make inferences about the relation between structural rearrangements and cellular phenotype. However, the major drawback of chromosome banding is the limits on genomic resolution which could be achieved. This resolution is limited by experimental protocol, optical capabilities of the microscope and quality of experimental preparation. Typically, only SVs larger than 3Mb would be detected by chromosome banding. The method is also not amenable to multiplexing, making it an unattractive option for further development.  The field of molecular cytogenetics underwent several cycles of development before FISH was established in 1986 [37]. FISH uses fluorescently labelled oligonucleotide probes to target specific DNA sequences, visualizing the genome by laser excitation of the bound fluorophore. Translocations are visualized as the movement of colored dots between translocation partners. Due to its flexibility, ease of use and continued reliability, development in FISH technology has yield a wide variety of applications [38, 39], such as Spectral Karyotyping and Quantitative FISH alongside many others which will not be 6  detailed in this thesis. FISH had become a powerful tool, establishing itself as the gold standard in clinical diagnostics because of its ability to accurately and reliably visualized chromosomes to a resolution of ~250kb, while maintaining single cell resolution.  However, it remains a tedious protocol which is difficult to multiplex and typically used in the context of targeted interrogation, a concession made in exchange for higher resolution. Biased local interrogation of the genome risks missing otherwise functionally important regions which might be important for guiding treatment options in patients.  Moving from microscopic techniques to sub-microscopic techniques, two-color FISH eventually gave way to metaphase-based Comparative Genomic hybridization (CGH) and array-based CGH (aCGH) chips [33]. aCGH applies mainly to the study of CNVs, comparing the co-hybridization efficiency of a test and reference genome (differentially labelled) against the oligonucleotide array in order to identify areas in the test genome which differ compared to the reference [40]. When the DNA content remains unchanged, the hybridization efficiency of both the reference and sample genome will be equal, hiding any possible rearrangement. aCGH is not suitable for studies of balanced rearrangements (i.e. translocations) and will not be further considered in this thesis.  Next Generations Sequencing (NGS) approaches differ widely but are all based upon massively parallel short-read sequencing technology [41-44]. SV detection (i.e. translocations) based upon sequencing data relies upon three main approaches which I will discuss below (Figure 2).  Paired end sequencing utilizes DNA fragments of predetermined insert size (~450 bases) and sequences the fragment from both ends to generate read-pairs (~150bp) separated by the insert length. Analysis across the breakpoint takes into considerations both the location and relative orientation of each pair of the read-pairs to identify discordant read-pairs (reads whose partners do not match with the expected insert size or map to discordant positions). High quality discordant reads-pairs are then pooled for analysis when 7  reconstructing the genomic landscape of the local environment [45, 46]. Since analysis only utilizes discordant read-pairs, high coverage is necessary to obtain enough read-pairs to rise above the threshold of background noise. Repetitive regions cause a decrease in read quality and produce less mappable reads overall. Within repetitive regions, it can be difficult to obtain enough high quality discordant reads to map certain SVs.     Split-read sequencing employs either single or paired reads for analysis. Discordant reads are identified, where only a portion of the read is mapped to the reference. This suggests that the read spans a breakpoint site, causing the inconsistency. Since split-reads cover the breakpoint interval, they are very important in identifying the base resolution of breakpoints [47, 48]. High coverage is likewise necessary for split read mapping, which is difficult to obtain in highly repetitive regions which SVs tend to localize to.   De novo assembly attempts to reiteratively merge reads of maximal overlap to build increasingly larger contigs and scaffolds [49]. They favor the use of longer reads to facilitate the merger of reads. However, this approach has not fully matured, still suffering from uneven sequence coverage, sequencing errors, and complexities of the genome. Different assembly algorithms have been developed to address these challenges but as an approach, de novo assembly still cannot circumvent the challenge of repetitive regions [50]. Long regions of repeats disrupt contig building and act as boundary elements which must be resolved in a different manner.          8  1.4 Strand-seq  Template strand sequencing (Strand-seq) is a single cell sequencing approach that selectively sequences the template strands of parental cells inherited by daughter cells [51]. Directionality of reads is maintained using custom oligo-adaptors amenable to Illumina sequencing [52]. Reads are aligned to either the Watson (-) or Crick (+) strand of the reference genome and represented as ideograms to reflect the directionality of template strands inherited after independent assortment of sister chromatids [53]. For each chromosome during independent assortment, each cell can inherit one of four possible combinations of parental homologs (Figure 3). A cell can inherit for both parental homologs Watson template strands only, Watson-Watson (WW) strand-state. Both parental homologs can also be present as Crick template strands, Crick-Crick (CC) strand-state. Or the parental homologs can present as strands of alternating directionality, Watson-Crick (WC) strand-state.  In the case of 23 chromosomes pairs, there are ~7*10^13 possible combinations of independent assortments which can occur. Results from BAIT analysis of a single cell are plotted as an ideogram displaying the inheritance pattern of parental DNA template strands for each chromosome (Figure 4). In order to construct these ideograms, BAIT bins every chromosome into discreet variable bins of the reference genome. For my studies these bins were 200kb in size [53]. The number of reads aligned within each bin, and their strand-states (Watson and/or Crick) is calculated and this number is represented by the length of each individual horizontal bar, orange for Watson and blue for Crick. Each bin is stacked to produce the final ideogram. The spiky nature of ideograms reflects the variable number of reads per bin, which in turn reflects variability in the quality of single cell Strand-seq libraries. Apart from sub-saturation sequencing [43], areas of the genome that are not nucleosomal, G/C bias [54-56] and the variable  BrdU incorporation in nascent DNA all contribute to variability.  A break in the phosphodiester backbone of the DNA, which results in changes to the genomic architecture can be visualized on the ideograms as a switch in strand-states, either from WW-WC, CC-WC, WC-WW, or WC-CC. Normal cells exhibit a baseline probability of 9  Sister Chromatid Exchange (SCEs) events which are visualized on the ideograms as switches in strand-state. Stable rearrangements are also visualized as switches in strand-state on an ideogram. The defining feature of stable rearrangements is the recurrent nature of the strand-state switches, across multiple cells, while random SCE events do not have recurrent strand-state switches (Figure 5). When observing stable rearrangements by Strand-seq, they are expected to have higher than normal probability of strand-state switches occurring at the same region of the DNA, compared to random SCE events. The ability of Strand-seq to extract the directionality of inherited DNA template strands allows studies of SVs in single cells. Previous studies have applied Strand-seq to the mapping of human inversions [57], building whole chromosome haplotypes [58], and de novo genome assembly [59].  In the case of translocations (Figure 6), a strand-state switch can be visualized because the translocated chromosome fragment would follow the inheritance pattern on its host chromosome, instead of its origin chromosome. Alignment to the reference will visualize the difference of inheritance patterns between the fragment chromosome and its origin chromosome as a strand-state switch. This difference of inheritance patterns is true for only 50% of independent assortment combinations. Figure 6 shows an example of each scenario, where one combination results in a strand-state switch, while the other combination does not.            10  1.5 Thesis objectives  The overall goal of this thesis is to address the ability of Strand-seq to study and map translocation partners, using Whole Genome Sequencing (WGS) as the benchmark. Strand-seq is still a relatively new technique and its accessibility is directly related to the repertoire of bioinformatic pipelines capable of exploiting the directionality of template reads for research. The Lansdorp lab aims to flesh out the possibilities of mapping structural variation in single cells and open up Strand-seq to the wider SV community.    Aim 1: Investigate breakpoint resolutions achieved by the BreakpointR pipeline  Strand-seq harnesses the directionality (strand-state) of the inherited template strands to identify regions where DNA breaks have occurred in single cells. To efficiently identify these regions from sequencing data, BreakpointR was developed by D. Porubsky and A. Sanders. It is a pipeline to map all strand-state switch events on each chromosome within every single cell. As a necessary first step to translocation mapping, I will need to be able accurately identify all the breakpoints within my karyotype of interest and identify regions of recurrent stand strand-state switches. To address this, I will apply BreakpointR to my data set and compare the accuracy and resolution of breakpoints mapped by Strand-seq with WGS. This is to verify the ability of BreakpointR and also to identify any potential shortcomings in the pipeline.   Aim 2: Develop workflow for mapping translocation partners  Knowing the breakpoint position of a recurrent DNA breaks only suggests the presence of a stable rearrangement, but is not informative towards identifying the type of stable rearrangement. For positive identification of translocation events, it is essential that partners involved in the translocation event can be mapped relative to each other. Strand-seq generates a unique data form which is not amenable to current pipelines for 11  translocation partner mapping. Moreover, low coverage libraries cannot benefit from using traditional paired-end or split-read analysis for the study of SV. For this purpose, I have developed a workflow to map translocation partners using Strand-seq data, which centres upon pattern matching of strand-states. The principle assumption for this approach is that the strand-state of a TRUE translocated fragment will always match its partner’s inheritance pattern in exactly the same configuration, regardless of the combinations of independent assortment inherited across each individual cell.       Aim 3: Compare the resolution and accuracy of translocation mapping with results from WGS  Finally, to verify the validity of the workflow developed, the karotype of interest was also characterized with WGS, and analyzed by the DELLY pipeline. Translocation mapping results from both Strand-seq and WGS were compared to identify their individual strengths and weaknesses, along with areas where they might complement. A complex karyotype was deliberately chosen for this study to test the limits of SV detection using each approach. Cross-validation was done using a custom pipeline (123SV) for WGS SV analysis. I hypothesized that both BreakpointR and the translocation pipeline developed in Aim#2 will perform similarly to WGS analyses in terms of accuracy of partner matching and resolution of breakpoints identified. Further, I hypothesize that the single cell resolution from Strand-seq will yield intricacies which would be difficult to interpret from bulk WGS data.          12  Chapter 2 | Methods & Materials 2.1 iALL826A cell line  The cell line used throughout this thesis is the iALL826A cell line. It harbors a stable, but complex karyotype (Figure 7):  46,XX,der(2)(2pter->2q37::11q23->q23::13q32->qter),der(11)(11pter->q23::19p13.3->pter),der(13)(13pter->13q32::11q23->qter),der(19)(2qter->q37::19p13.3->qter)  It was derived from a pediatric Acute Lymphoblastic Leukemia (ALL) patient from Australia, characterized by Henderson et. al. [60], using chromosome banding and break-apart MLL FISH probes. The authors proposed that there was an initial reciprocal translocation step involving chromosome 11 and chromosome 13, before the derivative chromosome 11 (carrying the fragment chromosome 13q32->qter) underwent another round of translocations with chromosome 2 and chromosome 19 to generate the final karyotype configuration.   The cells are cultured in RPMI1640, 20% FCS, 1% Na-Pyruvate, 1% NEAA, 1% L-glutamine, 1% Penicillin-Streptomycin. They were seeded at 1.5*10^6 cells /mL, with a doubling time of ~10 days. These are suspension cells which settle to the bottom. Culture media was replenished once every two days in 50:50 ratios.         13  2.2 Single cell isolation   Once cells were in log phase, they were treated with 20uM of BrdU, an analog for thymidine, for 72 hours. Treatment with BrdU is used to label the nascent strand during replication. After 72 hours, daughter cells are treated with Nuclei Isolation Buffer (NIB) drop-wise while shaking. After NP-40 is added, and incubated at 4 degrees in the dark for 15 minutes to extract nuclei from cells. The nuclei were filtered using a cell strainer to remove unwanted cellular debris.  NIB recipe (6mL): 5124uL ddH2O; 600uL 1M Tris; 184.8uL 5M NaCl; 60uL 100mM CaCl2; 30uL 100mM MgCl2; 6uL 10mg/mL Hoechst 33258; 1.2uL 100x BSA.  Filtered nuclei went through the BD FACSAria III for Fluorescent Activated Cell Sorting (FACS). 480nm UV laser was used to gate cell populations which had undergone one round of replication under treatment of BrdU. BrdU is known to quench the UV signal of Hoechst 33258, therefore cells which incorporated BrdU into half of its DNA helix (only in nascent strand after replication) would show a UV signal roughly half of undivided cells (Figure 8).   Freezing Buffer (1mL): 500uL PBS, 425uL 2x ProFreeze, 75uL DMSO    Cells with half incorporated BrdU were sorted into 96-well plates, containing 5uL Freezing Buffer for storage, kept in -80 degrees.         14  2.3 Library preparation and Sequencing  During library preparation, plates were quick-thawed in a 37 degrees Celsius water bath for 15 seconds. They were treated with the 7-step Strand-seq library preparation protocol detailed in Sanders et. al. [52]. All steps were performed on the Agilent Bravo Automated Liquid Handling Platform.  1) Single nuclei were treated with MNase digest for 8 minutes to fragment the DNA into mononucleosomes with an expected fragment size of ~150bp, or dinucleosomes with an expected fragment size of ~300bp. Reaction with MNase is quenched using 100mM of EDTA. Samples were cleaned using Ampure XP beads.  2) The ends of fragmented DNA are filled in (made blunt) by End Repair, incubated for 30 minutes at room temperature. Samples were cleaned using Ampure XP beads. 3) A single Adenosine base was added to the 3’ end of the blunt fragments by the A-tailing protocol. Incubate for 30 minutes at 37 degrees. Samples were cleaned using Ampure XP beads. 4) Adaptor of ~30 bases were added to either end of the fragment during the Adaptor Ligation protocol. Incubate for 15 minutes at room temperature. Mononucleosomes should have an expected fragment size of ~210bp, dinucleosomes with an expected fragment size of ~360bp. Samples were cleaned using Ampure XP beads. 5) Nascent strand of the DNA double helix was degraded using Hoechst and UV treatment. 15 minute at room temperature in the dark after adding Hoechst 33258, and 15 minutes uncovered under 365nm bulbs. Ideally, only the template strand remains for sequencing, leaving no possibility for background noise. Functionally, the uneven and incomplete distribution of BrdU results in fragments of nascent strand which remain, leaving some background noise in the ideograms. 6)  The remaining templates were put through PCR amplification to attach a unique 6 hexamer barcode to each template fragment to produce the final sequence reads. Mononucleosomes should have an expected fragment size of ~270bp, 15  dinucleosomes with an expected fragment size of ~420bp Samples were cleaned using Ampure XP beads. 7) Samples undergo size selection and assayed on a high sensitivity DNA chip (Agilent) before paired end (76nt) sequencing on the Illumina MiSeq platform.  Samples were sequenced using an Illumina MiSeq platform, achieving 852k/mm^2 cluster density with 97% passing filter. 3.2GB of sequencing data was produced to obtain an average of 243143.6 reads per cell and an average 0.00578X coverage per cell.                      16  2.4 BreakpointR pipeline analysis   FASTQ files were aligned onto the Hg19 human reference genome using the BWA-MEM alignment algorithm [61]. Bam files were further sorted and duplicate reads were removed. Only reads with mapq >= 10 were analyzed with BreakpointR [73]. Setting user-defined parameters as 1Mb sliding window and 10 read minimum threshold for genotyping, BreakpointR annotated breakpoint sites of each chromosome in every single cell (Figure 9).  To find the areas with a recurrent breakpoint, BreakpointR utilizes the GenomicRanges R-package to disjoin all breakpoints (Figure 9b) at their start and end positions and compile disjoins across multiple libraries to build a histogram of disjoint overlaps. The resulting histogram represents the probability density function of breakpoint regions within every cell. I decided upon a 1.6 standard deviation above the mean threshold to annotate recurrent breakpoints. Knowing that ~50% of cells are expected to show a recurrent breakpoint, identify a stable rearrangement, < 1SD above the mean would have been sufficient. However this would result in very large breakpoint regions called. In an effort to narrow the breakpoint intervals called, I tried various threshold and chose 1.6xSD as the optimal value. Higher thresholds produced high specificity (accurately excluding all non-breakpoints) but low sensitivity (missing some true breakpoints), while lower thresholds produced low specificity (included some non-breakpoints), but high sensitivity (accurately capturing all true breakpoints).  The results identified 28 recurrent breakpoints localized to 11 chromosomes across 108 single cells. This breakpoint data was manually curated to remove centromeric region and 1kb up and downstream of the centromeres. Smoothing of the disjoint histogram was done by taking the midpoint between the extremes of each interval and joining all midpoints of each interval to generate a spline function of a smoothed version of the histogram. This was done to remove artifacts borne from choosing a lowered threshold (1.6SD) during disjoin processing phase. The final result of BreakpointR produced 18 high confidence recurrent 17  breakpoints localized to 11 chromosomes which were applied to the translocation partner mapping workflow.                             18  2.5 Whole Genome Sequencing analysis: DELLY &123SV  9,000,000 iALL826A cells were cultured without treatment of Bromodeoxyuridine (BrdU) for WGS. DNA was extracted using the Phenol-Chloroform protocol. 250ng/uL of DNA was sent in 40uL for a total of 10ug of DNA, to Macrogen Inc. for Whole Genome Sequencing on the HiSeqX platform.   Macrogen Inc. used Illumina TruSeq DNA sample preparation to obtain a library with ~300-400bp average insert size. 151 pair-end sequencing was performed on the HiSeqX platform. 75GB of sequencing data was produced to obtain 824,572,370 reads with ~38x mappable coverage.  FASTQ files were aligned onto the Hg19 human reference genome using the BWA-MEM alignment algorithm [61] and BAM files were fed into the DELLY pipeline [46] which uses discordant read-pairs and an integrated paired-end, split-read mapping to identify high confidence SVs. An in-house custom pipeline developed by Victor Guryev at the European Research Institute for the Biology of Aging (ERIBA), 123SV [62], was also used to cross-validate the WGS data. 123SV uses discordant read-pairs and an integrated paired-end, split-read mapping to identify high confidence SVs.            19  Chapter 3 | Results 3.1 BreakpointR analysis of iALL826A Strand-seq data  Considering the four chromosomes known to be involved in the complex translocation (Figure 10), chromosome 2 shows a recurrent strand-state switch in 57/108 cells (~53%), chromosome 11 shows a recurrent strand-state switch in 59/108 cells (~55%), chromosome 13 shows a recurrent strand-state switch in 51/108 cells (47%), and chromosome 19 shows a recurrent strand-state switch in 56/108 cells (52%). Each of these chromosomes seems to be involved in a stable rearrangement, but it is not sufficient to know each chromosome’s involvement in a rearrangement. Further studies regarding type or consequence of the rearrangement requires mapping of each breakpoint interval.  The function of BreakpointR is to annotate the breakpoint intervals of each chromosome, across every single cell (Figure 9). Breakpoint annotation is a critical first step in translocation mapping because it establishes the boundary elements of the genome, separating the genome into distinct strand-states. Further processing of the data disjoins each break interval and compiles them into a histogram to identify the maximum region of overlap across cells, annotating the recurrent breakpoint. 18 high confidence recurrent breakpoints were identified by BreakpointR (Table 1). An additional four breakpoints were added manually (Table 2). These four breakpoints were identified during visual inspection of breakpoints on the UCSC genome browser. An explanation for why these breaks were missed by the BreakpointR pipeline can be found in the Discussion section 4.2. The 18 breakpoint intervals identified by BreakpointR range from 5 - 863kb, with a median interval of 98kb.   For further analysis, we selected 5 breakpoints (BP2, BP13, BP11A, BP11B, BP19) corresponding to the 5 breakpoints previously validated by FISH break-apart probe, one of which was also validated by Sanger sequencing (BP19). Comparison of the recurrent breakpoint intervals identified by BreakpointR with the base resolution provided by WGS 20  analysis showed that BreakpointR could encapsulate the base resolution in 4 of the 5 selected breakpoints (Figure 10). BP11B deviated from the base resolution by ~650bp. The reason for this deviation is based upon a fundamental principle of Strand-seq and is explained in detail under Chapter 4’s Summary Overview. Briefly, the more complex a karyotype, the more cells which will be needed to recreate the karyotype. In this case, we did not have enough cells to accurately place this particular breakpoint interval. The proper number of cells can be calculated using a power calculation under a binomial sampling [63-70].   An unusual finding when comparing the accuracy of Strand-seq against WGS data was a discovery on chromosome 19. The interval for BP19 spans ~33kb and encapsulated the base position identified by WGS. However, the recurrent breakpoint interval, along with the base position identified by WGS deviated from the breakpoint validated by the previous authors. The position validated by Sanger Sequencing was identified as Exon2 of the MLLT1 gene, ~17kb upstream from the base position identified by WGS. This breakpoint is a crucial breakpoint as it results in a fusion transcript with the KMT2A gene on chromosome 11.  The reason for this discrepancy is not yet known. Attempts to re-sequence the breakpoint region have remained unsuccessful due to the highly repetitive nature of the sequence. For now, it seems to represent an inherent weakness of NGS technologies, both for Strand-seq and WGS.           21  3.2 Principles of translocation mapping  After identifying 18 high confidence breakpoints from BreakpointR, the next step was to identify which of these breaks were involved in a translocation event, to recreate the karyotype established by FISH. I developed a workflow that utilized the high confidence recurrent breakpoints identified from BreakpointR and the ability of contiBAIT [59] (custom pipeline developed for contig assembly using Strand-seq) to parse chromosomes by breakpoint intervals to map translocation partners. The workflow uses only the chromosomes carrying a recurrent breakpoint as input data and can be broken down into 4 distinct stages: (1) Within each single cell, identify and annotate the strand inheritance patterns for the chromosomes of interest; (2) Within each single cell, compare the differences between the expected inheritance pattern and the observed inheritance pattern from ideograms, annotating the strand-states of the derivative fragments; (3) Within each single cell, match the deviant fragment strand-states with the chromosomes’ inheritance pattern, annotating the translocation configuration (Normal or Inverted) to which the deviant fragment matches; (4) Across multiple cells, compare the translocation configurations to identify the proper translocation partner. This workflow assumes that the strand-state of a TRUE translocated fragment will always match its partner’s strand inheritance pattern in exactly the same configuration, regardless of the combinations of independent assortment that each chromosome experiences within each cell.       I will begin by outlining the process using a three chromosome model, where a heterozygous translocation event involves chromosome 4 and 11, but not chromosome 16 (Figure 11). For heterozygous translocations, it is not possible to immediately know the inheritance pattern of the deviant fragment from the ideogram alone. The first step is to identify the strand inheritance pattern for each chromosome with a recurrent breakpoint. Based on the centromere positions of each chromosome and the strand-states 5Mb up- and down-stream of the centromere, we can determine that both Watson homologs of chromosome 4 were inherited, both Crick homologs of chromosome 11 were inherited, and a Watson and Crick homolog of chromosome 16 were each inherited. The second step 22  compares the expected ideograms of each chromosome with the observed ideogram, and annotates the strand-states of the deviant fragments (DF). DF4 was inherited as Crick instead of the expected Watson. DF11 was inherited as Watson instead of the expected Crick. DF16 was inherited as Crick instead of the expected Watson.   The third step (Figure 12) is to determine how the DF’s strand-state (in this example I will use chromosome 4 as the chromosome of interest) matches the inheritance pattern of the other two chromosomes. DF4 matches chromosome 11’s strand inheritance pattern in the Normal configuration (fragment did not have to flip orientation; Crick remained as Crick), but matches with chromosome 16 in the Inverted configuration (fragment had to flip orientation; Crick to Watson). Across every single cell, this process is repeated to generate a table of fragment matches for chromosome 4. Within each single cell this process is also repeated for every chromosome of interest. The final step is to identify the translocation partner for chromosome 4 based on the consistency of the translocation configuration that DF4 matches. Comparing across all single cells shows that the inheritance pattern of DF4 consistently matches the inheritance pattern of chromosome 11 in the Normal configuration, but the matching configuration is inconsistent for chromosome 16. Regardless of the combination of independent assortment of chromosomes, DF4 always matches its true translocation partner in the same configuration. In this example, we can conclude that the deviate fragment of chromosome 4 was translocated to chromosome 11, therefore DF4 follows the inheritance pattern of chromosome 11. The same analysis using chromosome 11 as the chromosome of interest shows that the inheritance pattern of DF11 consistently matches the inheritance pattern of chromosome 4 in the Normal configuration, but inconsistently with chromosome 16. Therefore, we can conclude that DF11 translocated to chromosome 4. Using chromosome 16 as the chromosome of interest shows that the inheritance pattern of DF16 does not match any chromosome in a consistent configuration. Therefore, we conclude that chromosome 16 is involved with some type of stable rearrangement, but it is not a translocation event. In this way, we can map translocation partners, identifying the TRUE translocation partners from the background. 23  For homozygous reciprocal translocations, translocation partner matching is much simpler because it is easier to identify the strand-state of the deviant fragments from the ideograms. The subsequent steps of translocation partner mapping are still the same as for heterozygous translocations. However, homozygous translocations are extremely uncommon, with heterozygous translocation as the dominant type.   This procedure was written as an R-script to test the complex karyotype and could match 4 of 5 translocations partners (Figure 13). It accurately identified that chromosome 13 carries a translocated fragment of chromsome11 in the Normal configuration, chromosome 11 carries a translocated fragment of chromosome 19 in the Inverted configuration, chromosome 19 carries a translocated fragment of chromosome 2 in the Inverted configuration, and chromosome 2 carries a translocated fragment of chromosome 13 in the Normal configuration. However, the pipeline was not able to map the 100kb deviate fragment (DF100k) from chromosome 11. This represents a limitation of using low-coverage sequencing in this project. The 100kb fragment did not have enough reads for genotyping. Thus, without an assigned strand-state it could not be matched to any of the inheritance patterns from other chromosomes.             24  3.3 Unexpected complexities identified by Strand-seq    By pairing the single cell resolution of Strand-seq with the complete karyotype established by FISH, it is possible to reconstruct the observed chromosome ideograms in each cell to study multiple breakpoint sites along a chromosome. Based on the inheritance pattern of each chromosome along with known translocation partners, I have reconstructed the ideogram for chromosome 11 and tried predicting the unique breakpoint positions (BP11A or BP11B) for chromosome 11. In 108 cells sequenced, despite 59 cells showing a strand-state switch but only 31 cells influenced the final resolution between the two breakpoints. 28 cells were “unhelpful” because they either had breakpoint intervals spanning both breakpoints sites or were scattered around the breakpoint region without influencing the resolution of breakpoints. These 31 cells resolved the breakpoint position into two separate regions, BP11A or BP11B, because chromosome 11 exists on three separate chromosomes. The unique combinations of strand inheritance from chromosome 2, 11, and 13 resulted in some cells with a breakpoint in BP11A, while others had a breakpoint in BP11B. Figure 14 shows how chromosome 11’s existence on three chromosomes can cause two different breakpoints. In my predictions, I followed the karyotype model proposed by Henderson et. al. [60], where DF100k translocated to chromosome 2 from the derivative chromosome 11, maintaining its original directionality. In Figure 14, I have reconstructed the ideograms for chromosome 11 of Cell#586 and Cell#555. Cell#586 is predicted to have a breakpoint at BP11B, due to the strand-state of DF100k inherited from chromosome 2, while Cell#555 is predicted to have a breakpoint at BP11A, also due to the strand-state of DF100K inherited from chromosome 2. However, it was discovered that Cell#586 had the breakpoint of chromosome 11 at BP11A, while Cell#555 had the breakpoint of chromosome 11 at BP11B, the opposite of what was expected. In my attempts to predict the expected breakpoint position for these 31 cells, I was only able to correctly predict 1 of 31 libraries. 30 of the 31 cells demonstrated an unexpected flip of the actual breakpoint region, suggesting that the 100kb fragment of chromosome 11 had undergone an inversion event before being translocating to chromosome 2.   25  To test this hypothesis, I designed primers (Table 5) flanking both breakpoint regions BP11A and BP11B, and tried to capture the PCR product across each breakpoint (Figure 15). Primers were validated beforehand to show that they would only produce a single PCR product when using chromosome matched primers, and PCR testing accommodated for both scenarios, with and without an inversion present. In the absence of an inversion, translocation-primer-pairs did not produce any distinct product. However, when the translocation-primer-pairs considered the presence of an inversion the paired primers could capture a single distinct band (Appendix 1), one each at BP11A and BP11B.   Purified PCR products were isolated and sent for Sanger Sequencing. BLAST results from the sequenced purified PCR products confirmed that the PCR product spanned each breakpoint. PCR product from BP11A was aligned to chromosome 2-chromosome 11B while the PCR product from BP11B was aligned to chromosome 11A-chromosme 13, both products supporting the inversion of DF100k.                26  3.4 Comparing the accuracy of translocation mapping with WGS analysis  WGS data was used as a benchmark to see how Strand-seq analysis compares to current NGS, in terms of accuracy of breakpoint mapping and translocation mapping. The accuracy of breakpoint mapping by BreakpointR was detailed above in section 3.1. In this section, I will compare the accuracy of translocation partners mapped using WGS and Strand-seq.  DELLY’s raw translocation output identified 1270 translocation links within our complex karyotype. Assuming no access to population filters, we were able of narrowing the number of links to 119 quality-pass links (Figure 16b, Table 9). Based on the origins of the patient (Australia), whose bone marrow was used to derive the iALL826A cell line, I assumed that she had predominately Caucasian ancestry. We attempted population filtering to remove benign links using a Dutch normal population (n=11),  managing to narrow down 119 links to 11 high quality translocation links within our complex karyotype (Figure 16c, Table 7). Of the 11 translocation links identified, WGS identified only 4 of 5 verified translocation partners, unable to identify the translocation between chromosome 11 and chromosome 19.   123SV identified 17 quality-pass links, which could be resolved to 10 high confidence links (Figure 16d, Table 8) after filtering using the Dutch normal population (n=11). 123SV was also unable to identify the translocation event between chromosome 11 and chromosome 19. Based on our attempts to sequence this region, we propose that the inability of these two pipelines to identify the chromosome 11-19 translocation is due to the approach; using discordant read-pairs and split-read analysis requires a sufficiently large pool of high quality discordant read-pairs or split-reads to map a translocation. However, because the region is highly repetitive and not enough reads map with mapping quality more than 10, the minimum threshold needed to map the translocation.  The highly repetitive nature of this region represents a weakness in relying solely on discordant-read-pairs and split-reads for SV analysis. Cross-validation between DELLY and 123SV further narrowed the number of quality pass links to six possible translocation events (Table 4), four verified by FISH and 27  another two not identified by Strand-seq or FISH. These two translocations (chromosome 9-12 translocation and chromosome 8-11 translocation) identified by WGS analysis were tested using PCR to try and capture the unseen translocation. Primer pairs (Table 6) were designed around the proposed breakpoint sites and all combinations of Forward-Forward, Forward-Reverse, Reverse-Forward, and Reverse-Reverse primer pairs were tested. However, none of the combinations tested showed a band indicative of a translocation event, despite validating the primer-pairs beforehand to show that they could only produce a single PCR product when using chromosome matched primers (Appendix 2).  We concluded that despite the population filtering, quality checks, and cross-validation, the extraneous translocation links identified by WGS were not actual translocation events, but rather technical artifacts of WGS.  By comparison, Strand-seq identified only four links (Figure 16a, Table 3), without the need for population filtering or cross-pipeline validation. Three of the links were verified by FISH and the blue link represents the link between chromosome 2 and chromosome 13. This link does not represent the traditional link by proxy of a paired-end/ split-read analysis. However, it remains an accurate link, able to show that the derivative fragment of chromosome 13 was translocated to chromosome 2.   Regarding the inversion of DF100k identified by Strand-seq, in the six high confidence translocation links identified by WGS, two were links between chromosome 11 and chromosome 13 (Table 4, row 3 & 4). Chromosome 13 has the same breakpoint, but chromosome 11 has two breakpoints 100kb apart. These links represent the inversion identified, presented as WGS data. However, this form of representation is also suggestive of an insertion or deletion signature, I was not sufficiently experienced to recognize it as an inversion. Moreover, the inversion analysis from DELLY or 123SV restricts itself to inversions within the host chromosome, and is thus unhelpful for studying inversions of chromosome fragments outside the host chromosome.   28  3.5 Revised translocation model  Based on the inversion of chromosome 11’s 100kb fragment, the original model should be revised, adding an additional step to account for the new inversion discovered. (Figure 17) The FISH model postulated an initial reciprocal translocation between chromosome 11 and chromosome 13. After this first step, the derivative chromosome 11, carrying DF13 underwent a secondary round of translocations with chromosome 2 and chromosome 19 to reach the final karyotype configuration. This model would be inaccurate considering the inversion and suggests an additional step somewhere in between to account for the inversion.   Rather than a 3-step model, I propose a 1-step model where all four chromosomes experienced simultaneous breaks which resulted in three fragments of chromosome 11, and two fragments each of chromosome 2, chromosome 13, and chromosome 19. These chromosome fragments then rejoined to give the final configuration, much like a chromothripsis event. I propose a 1-step model because it is mechanistically simpler, but also because the model carries 3 of the 6 genomic signatures for chromothripsis laid out in Korbel et. al.[71].             29  Chapter 4 | Discussion 4.1 Summary overview & findings  The goal of this project was to investigate if Strand-seq is capable of mapping translocation partners in single cells. Strand-seq is capable of selectively sequencing the inherited parental DNA template strands in daughter cells (Figure 3), and reads aligned to the reference genome preserves the directional information of each read. In this manner, the strand inheritance pattern of each chromosome can be visualized as an ideogram (Figure 4), clearly showing regions of SCEs events or DNA DSBs as strand-state switch events. My project can be broken down into three main aims; (1) Assessing the effectiveness of BreakpointR as an automated tool to map breakpoints from Strand-seq data; (2) Develop a workflow capable of mapping translocation partners from Strand-seq data; (3) Compare the accuracy of Strand-seq’s breakpoint mapping and translocation partner mapping using WGS data as the benchmark. A complex karyotype (Figure 7) was chosen and sequenced using MiSeq for Strand-seq and HiSeqX for WGS.   The user-defined parameters for BreakpointR utilized a 1Mb window size and 10 reads minimum for genotyping. BreakpointR identified 28 breakpoints which were reduced to 18 (Table 1) after masking centromeric regions and smoothening. I chose five breakpoints which were validated by FISH [60], and compared results from BreakpointR with WGS. The breakpoint intervals of these five breakpoints ranged from 5 - 863kb, with 4 of the 5 managing to encapsulate the base positions identified by WGS. The remaining breakpoint on chromosome 11 deviated by ~650bp from the WGS.    This deviation is rooted in the fact that chromosome 11 contained multiple breakpoints. The issue of mapping multiple breakpoints on a single chromosome using Strand-seq is the notion that Strand-seq distinguishes only three different strand-states, WW, WC, or CC. Practically, this limits the number of separate strand-states which can be represented on a single chromosome. In an ideal scenario each breakpoint can still be distinguished as a 30  separate strand-state. While still possible, this is extremely unlikely based on the randomness of strand inheritance patterns. Thus, when evaluating breakpoints it is much more likely that chromosomes with multiple breakpoints will only present with a single strand-state switch event, at multiple positions. The functional consequence of identifying recurrent breakpoints is that these chromosomes will have fewer libraries per breakpoint which can be used for evaluating breakpoint resolution (Figure 10d). In my case, chromosome 11 had two breakpoints and consequently had half as many libraries per breakpoint resulting in lower breakpoint precision. Moreover, analysis was further limited by the low coverage and the proximity of breakpoints (100kb), the sequencing depth for our Strand-seq libraries were too low for the 100kb fragment to be genotyped, thereby reducing it to the level of background. For all instances when we could expect to see two breakpoints differentiating three separate strand-states we only had two breakpoints, resulting in even less libraries which can be used to identify the breakpoint. As such, the deviation from the WGS data represents a situation where there were not enough cells which were sequenced. Therefore, the higher the expected complexity of the karyotype, a higher number of cells will need to be sequenced to generate an accurate breakpoint interval.  The single cell resolution of Strand-seq, allowed us to reconstruct the ideograms of each chromosome (Figure 14) and predict the breakpoint positions based upon the unique combination of inherited template strands in each cell. Predictions of the breakpoints on chromosome 11 identified that the 100kb translocated fragment was inverted before settling into chromosome 2. This inversion was identified because the single cell resolution of Strand-seq allows us to study how combinations of independent assortment affect breakpoint position. The inversion fragment, although inconclusively represented in the WGS (Table 4) and FISH data, was validated using PCR and Sanger sequencing (Figure 15).   The translocation workflow for Strand-seq was written as an R-script and correctly mapped 4 of 5 verified translocation partners in the proper configuration (Figure 13). However, it 31  was unable to match the 100kb fragment of chromosome 11. WGS analysis performed similarly, identifying six high confidence translocation links (Table 4), of which four belonged to the translocations partners verified by FISH, but missed the verified link between chromosome 11 and chromosome 19. WGS analysis required effective population filters and multiple-pipeline cross-validation, to narrow the number of quality-pass hit from 119 to 6, two of which were shown to be technical artifacts of WGS.   The inability of Strand-seq to map the last fragment is a limitation of low coverage sequencing which can be improved by better DNA yield and higher coverage sequencing. However, since WGS relies upon high quality reads surrounding the local environment of the breakpoint site, repetitive regions surrounding the chromosome 11-19 breakpoint is an inherent weakness of the paired-end and split-read mapping approach. In this situation, Strand-seq is advantageous because it can ignore the repetitiveness of the local environment and map structural variations based solely upon the switch in strand-states.  In conclusion, Strand-seq performed in a manner which was comparable to WGS, in terms of mapping breakpoints and translocation partners. BreakpointR could never achieve base resolution, due to the limitations of low coverage sequencing. Instead it sacrificed base resolution for the ability to interrogate SVs at the level of single cells. Despite the low sequencing coverage, ~30kb breakpoint resolution is still a workable resolution which can be further investigated to answer other biological questions. Moreover, Strand-seq could identify complexities in DNA structure which were otherwise undetectable, due to its single cell resolution. The translocation workflow mapped 4 out of 5 translocation partners, similar to WGS, but did not require the hassle of population filters and cross-pipeline validation. Strand-seq’s ability to visualize SVs at the level of single cells, circumvent repetitive sequences surrounding the breakpoint, better visualize complexities in SV configuration, and remove the need for population filters and pipeline cross-validation, makes Strand-seq an extremely powerful complementary tool for genomics.   32  The only caveat to the application of Strand-seq is its requirement for cells of interest to be amendable to culture under treatment of BrdU. Application on terminally differentiated cells will require external stimulus to grow, which may change the native cell state of interest. Other types of cells such as solid tumors are also typically resistant to culture and may require external stimulation. These are not ideal situations, but should be considered when applying Strand-seq to research questions. Regarding the mutagenic properties of BrdU, members of the Lansdorp lab have shown [72] that increasing dosage of BrdU does not increase the number of SCE events in cell lines. Since SCEs can be used as a proxy for DNA damage, we conclude that BrdU treatment does not adversely influence the genomic landscape. However, whether BrdU can cause SNP changes which affect the fragility of the local genomic structure remains to be seen.   The only breakpoint which was validated by the previous authors showed a fusion gene between Exon 10 of the KMT2A gene and Exon 2 of the MLLT1 gene. Both WGS and Sanger sequencing were unable to map to that location. However, both Strand-seq and WGS approaches mapped to a breakpoint 17kb downstream from the validated position. Efforts to re-sequence the region using Sanger sequencing has been difficult as the region is extremely repetitive. Perhaps this represents an inherent weakness of NGS approaches, but for now I have been unsuccessful in re-sequencing the region.           33  4.2 BreakpointR  BreakpointR is currently the most efficient method capable of mapping breakpoints from Strand-seq data. Recurrent breakpoints are identified based off a user-defined metric. In this project, I have chosen 1.6 standard deviations above the mean to annotate the region of maximal overlap.  This threshold resulted in four missed breakpoints, two each from chromosome 8 and chromosome 17 (Table 2). These were added during post-processing, identified manually by visual inspection of breakpoints on the UCSC Genome Browser. The reason these breakpoints were missed is due to the relatively large breakpoint intervals which were identified. Large breakpoint intervals lead to higher than normal calculations of the mean-overlap and SD-overlap, resulting in an abnormally high threshold which masks the recurrent breakpoints. Strand-seq like most other NGS approaches suffer when faced with highly repetitive region. These gap regions result in poor alignment and low-quality reads which prevent BreakpointR from resolving the breakpoint to a fine interval, instead producing long breakpoint intervals. When compiling breakpoints, it is paramount to take these gap regions into consideration least any recurrent breakpoints are missed due to this weakness in breakpoint compilation. In general, finer breakpoint resolution can be obtained by increasing the thresholds of BreakpointR at the expense of masking some breakpoints, while a wider search scope can be obtained by lowering the threshold at the expense of breakpoint resolution.           34  4.3 Challenges of translocation mapping (SS & WGS)  Translocation mapping based on Strand-seq data accurately identified 4 out of 5 translocation partners. However, it was unable to place DF100k between chromosome 2 and chromosome 13. This was predominately a weakness of the sequencing platform used instead of the principles of translocation mapping. Due to the low coverage data for each single cell, there were not enough reads within that 100kb region to genotype its strand-state. BreakpointR required a minimum of 10 reads for genotyping a strand-state. Better coverage using deeper sequencing or higher DNA yield will allow BreakpointR to properly genotype each region, thereby allowing complete resolution of the complex karyotype.  SV mapping based on WGS also faced limitations. Final results identified 4 out of 5 translocation partners, unable to place the link between chromosome 11 and chromosome 19.  This is predominately a weakness of WGS analysis approach. Since most SV analysis rely upon paired-end or split-read mapping [46], high quality discordant read pairs or split reads are required to cover the breakpoint interval to identifying high quality links. If a breakpoint is flanked by repetitive sequences, not enough high-quality reads will be able to map to that region, masking the translocation. To circumvent the issues of mapping onto repetitive sequence, different approaches such as de novo assembly reconstruction or long read sequencing can be employed. However, even though de novo assembly does not rely on the reference mapping of paired or split reads, it will still face difficulties against highly repetitive regions [50], creating boundary regions where contigs will not be able to extend (Figure 2). The best resolution would be to use long-range sequencing such as PacBio or Nanopore technology [73, 74], but until these technologies mature, there is no really good way to resolve highly repetitive regions. Strand-seq is advantageous because it can ignore the repetitiveness of the local environment and map structural variations based solely upon the switch in strand-states. This has proven to be sufficient information to map inversions and phasing whole chromosome haplotypes. I propose that it is also sufficient in mapping translocation partners, presenting a means to circumvent repetitive regions when handling SVs. 35  4.4 Impact   SVs are an aspect of genomic variation which is notoriously difficult to study. The approach of using short paired-reads or short split-reads gives a window into the local environment surrounding SVs breakpoints and can be used as a proxy to reconstruct the SV configuration; however, this is not the ideal approach. Long reads provide a solution but has yet to fully mature and expand to provide single cell resolution [75, 76], requiring very high concentration of input DNA for long-read sequencing [73, 74]. Strand-seq provides a unique avenue for which SVs can be interrogated, supplementing traditional short paired-reads and short split-reads with directional information. The addition of the directional information allows Strand-seq to map whole chromosome segments without the need to consider the sequences surrounding the local environment of the breakpoint, circumventing the primary constraint hampering short read sequencing of repetitive regions, the necessity of high coverage.  While there are still weaknesses to using Strand-seq, as it is unable to provide base resolution, an important aspect for studying the sequence clues embedded within the local environment of breakpoints. It can be complemented with current NGS (WGS) approaches to study the mechanisms of SV formation. Moreover, cells earmarked for Strand-seq must be amendable to culture in the presence of BrdU before they can be sorted into single cells for library preparation. This caveat automatically discounts a lot of clinical applications. However, the field of SV studies is still extremely young; advances in SV detection, especially at the level of single cells can still answer a lot of basic questions about SV formation and their pathological significance.  Previous works have shown that Strand-seq is capable of mapping balanced inversions or building whole chromosome haplotypes [57-59]. Along the same vein of SV detection, I hope to have demonstrated that Strand-seq is also capable of mapping translocations in single cells, even for relatively complex karyotypes. While I cannot recommend that Strand-seq completely replaces current NGS approaches, Strand-seq can circumvent a lot of the 36  difficulties currently faced by current NGS approaches regarding repetitive sequences. Strand-seq is also capable of providing the resolution of single cell analysis, allowing investigators to causally link the genotypic presentation of SVs to the phenotypic presentation of the cell. In the interest of studying rare populations, Strand-seq is superior as it does not need to rely on population filters, and if employed in conjunction with WGS, will be able to curate the number of “quality-pass” links without the use of such filters or pipeline cross-validations.  While not perfect, Strand-seq represents a very powerful tool which can complement current NGS approaches in the study of SVs, allowing investigators an unparalleled view into SVs, their functional consequence, and the mechanisms controlling their development.           fin.                 37  Tables:  Table 1: BreakpointR breakpoints Chromosome chromStart chromEnd Width chr1  148009768 148185868 176100 chr1  148947781 149040111 92330 chr2 89548868 89832595 283727 chr2  232561624 232594080 32456 chr7  72399647 72699328 299681 chr7  74488557 75049123 560566 chr9  39775457 40639367 863910 chr9  70648936 71012527 363591 chr10  48945942 49395393 449451 chr11 118352526 118383891 31365 chr11  118454654 118459401 4747 chr13 101146277 101204167 57890 chr15  23501252 23604525 103273 chr16  33954501 33968085 13584 chr16 34022382 34183551 161169 chr16 46499964 46523466 23502 chr19  6274544 6307515 32971 chr21 37356482 37404085 47603                      38  Table 2: Breakpoints added by manual curation Chromosome chromStart chromEnd Width chr8 7153608 8065175 911567 chr8 12237040 12421529 184489 chr17 43475662 43663237 187575 chr17 44444818 44852715 407897                       39  Table 3: Strand-seq translocation partner links Chromosome.1 chromStart.1 chromEnd.1 Chromosome.2 chromStart.2 chromEnd.2 chr2 232519151 232611075 chr13 101143206 101219014 chr11 118352526 118382910 chr19 6260328 6330742 chr13 101143206 101219014 chr11 118454654 118459401 chr19 6260328 6330742 chr2 232519151 232611075                          40  Table 4: WGS DELLY & 123SV translocation partner links, cross-validated Chromosome.1 chromStart.1 chromEnd.1 Chromosome.2 chromStart.2 chromEnd.2 chr11 55052627 55053288 chr8 66514693 66514864 chr11 118459586 118460020 chr2 232579800 232580190 chr11 118359059 118359406 chr13 101167343 101167821 chr11 118460023 118460445 chr13 101166833 101167207 chr12 76237618 76237921 chr9 75288412 75288682 chr19 6288109 6288391 chr2 232580057 232580310                       41  Table 5: Sequencing primers used to verify inversion Primer Sequence (5' -> 3') 11AF2 TGGTAAAGAAAATCCACGTCGG 19F AGATCTGCTCTTTCTGTCCCT 11AR TGATCCGCCCACCATATACTTT 13R TCTTTCTGCTCCACTCCCAG                       42  Table 6: Sequencing primers used to test additional hits found in WGS Primer Sequence (5' -> 3') 8F* CCTCATAGTGGTCTAGGGTTCAT 8R* ACTCTTAAGCCTGTTTGTACACA 9F* GCAACAACAGCAGTATTTAGAGG 9R* TAAAGCTGCTAGTGGTTGGC 11F* AGACTTGATAAAAGGCACGGC 11R* GGTGTGAGTAGTGCCTAGGT 12F* GGGTTGGCTGGATGAGACTA 12R* GTCTTCGACGCTGAGACAAA                     43  Table 7: DELLY translocation partner links (Population filtered + PASS) Chromosome.1 chromStart.1 chromEnd.1 Chromosome.2 chromStart.2 chromEnd.2 chr11 55052972 55052972 chr8 66514865 66514865 chr11 55053004 55053004 chr8 66514693 66514693 chr11 118460021 118460021 chr2 232580191 232580191 chr12 76237922 76237922 chr9 75288412 75288412 chr13 101167208 101167208 chr11 118460023 118460023 chr13 101167345 101167345 chr11 118359059 118359059 chr19 6288109 6288109 chr2 232580057 232580057 chr19 36066674 36066674 chr5 71146742 71146742 chr19 36066674 36066674 chr2 230045631 230045631 chr19 47336727 47336727 chr2 202146368 202146368 chrX 140202675 140202675 chr6 15666629 15666629                  44  Table 8: 123SV translocation partner links (Population filtered + PASS) Chromosome.1 chromStart.1 chromEnd.1 Chromosome.2 chromStart.2 chromEnd.2 chr11 118359059 118359406 chr13 101167343 101167821 chr11 118459586 118460020 chr2 232579800 232580190 chr11 55052627 55053288 chr8 66514693 66514864 chr14 22749185 22749625 chr14 22918118 22918563 chr12 76237618 76237921 chr9 75288412 75288682 chr11 118460023 118460445 chr13 101166833 101167207 chr19 6288109 6288391 chr2 232580057 232580310 chr18 23994838 23994987 chr7 109483923 109484072 chr10 73751705 73752012 chr4 84105452 84105640 chr21 36548698 36548847 chr22 35867777 35867926                    45  Table 9: DELLY translocation partner links (PASS) Chromosome.1 chromStart.1 chromEnd.1 Chromosome.2 chromStart.2 chromEnd.2 chr2 102912422 102912422 chr1 72361232 72361232 chr3 89509862 89509862 chr1 101978992 101978992 chr4 66413931 66413931 chr2 42052611 42052611 chr4 104214671 104214671 chr1 81404768 81404768 chr5 64467980 64467980 chr2 68914956 68914956 chr5 71146742 71146742 chr4 70296578 70296578 chr5 71146742 71146742 chr2 230045488 230045488 chr6 57575918 57575918 chr5 21573437 21573437 chr6 58776282 58776282 chr1 121485419 121485419 chr6 58777092 58777092 chr1 121484836 121484836 chr7 8886705 8886705 chr5 24370523 24370523 chr7 46904659 46904659 chr4 95518740 95518740 chr7 46904689 46904689 chr4 95518558 95518558 chr7 61968600 61968600 chr6 58779257 58779257 chr7 61968621 61968621 chr1 121485303 121485303 chr7 61969549 61969549 chr1 121484746 121484746 chr7 61969551 61969551 chr1 121485402 121485402 chr7 61969644 61969644 chr6 58777521 58777521 chr7 61970217 61970217 chr6 58777033 58777033 chr7 61970563 61970563 chr6 58776842 58776842 chr7 61970604 61970604 chr1 121484693 121484693 chr8 70602253 70602253 chr1 91853149 91853149 chr8 128533834 128533834 chr3 111274101 111274101 chr9 79186730 79186730 chr1 91852783 91852783 chr10 60902409 60902409 chr7 81789148 81789148 chr10 60902929 60902929 chr7 81789164 81789164 chr11 38812658 38812658 chr8 52731478 52731478 chr11 38812670 38812670 chr8 52730143 52730143 chr11 55053004 55053004 chr8 66514693 66514693 chr11 55053015 55053015 chr8 66514864 66514864 chr11 118460020 118460020 chr2 232580220 232580220 chr12 2858859 2858859 chr9 80931888 80931888 chr12 2858881 2858881 chr9 80932489 80932489 chr12 20704358 20704358 chr2 133012635 133012635 chr12 76237922 76237922 chr9 75288412 75288412 chr12 108203259 108203259 chr7 111053752 111053752 chr12 108203265 108203265 chr7 111053153 111053153 chr12 127650638 127650638 chr1 91852783 91852783 chr12 133066793 133066793 chr6 29814678 29814678 46  Chromosome.1 chromStart.1 chromEnd.1 Chromosome.2 chromStart.2 chromEnd.2 chr12 133088527 133088527 chr6 29814695 29814695 chr13 20371574 20371574 chr10 127574453 127574453 chr13 21727773 21727773 chr11 108585766 108585766 chr13 21750661 21750661 chr11 108585748 108585748 chr13 48856830 48856830 chr12 34017352 34017352 chr13 61461083 61461083 chr5 21899850 21899850 chr13 74313862 74313862 chr8 15289367 15289367 chr13 74314056 74314056 chr8 15289364 15289364 chr13 101167207 101167207 chr11 118460023 118460023 chr13 101167345 101167345 chr11 118359059 118359059 chr14 52667768 52667768 chr4 170280994 170280994 chr14 65447291 65447291 chr2 194545058 194545058 chr14 81786774 81786774 chr11 61841815 61841815 chr14 93712486 93712486 chr1 9121449 9121449 chr15 22486810 22486810 chr14 106484186 106484186 chr15 39994608 39994608 chr12 56990096 56990096 chr15 40854189 40854189 chr6 116774692 116774692 chr15 40854194 40854194 chr7 26252979 26252979 chr15 40854212 40854212 chr7 26245987 26245987 chr16 33428528 33428528 chr6 382461 382461 chr16 33963445 33963445 chr12 20704358 20704358 chr16 46387488 46387488 chr10 42597211 42597211 chr16 46388994 46388994 chr10 42388479 42388479 chr16 46391410 46391410 chr10 42597211 42597211 chr16 46391553 46391553 chr10 42387177 42387177 chr16 46392228 46392228 chr3 196625615 196625615 chr16 46396112 46396112 chr10 42384920 42384920 chr16 46398815 46398815 chr10 42388127 42388127 chr16 46399487 46399487 chr10 42599560 42599560 chr16 46404484 46404484 chr10 42596870 42596870 chr16 46405099 46405099 chr10 42597884 42597884 chr16 46407066 46407066 chr10 42599918 42599918 chr16 46407119 46407119 chr10 42597198 42597198 chr16 46407509 46407509 chr10 42388473 42388473 chr16 46425082 46425082 chr10 42385151 42385151 chr16 46432527 46432527 chr10 42385754 42385754 chr16 46433257 46433257 chr10 42385146 42385146 chr17 4959188 4959188 chr6 31823085 31823085 chr17 7167890 7167890 chr8 30145623 30145623 chr17 7167959 7167959 chr8 30145402 30145402 47  Chromosome.1 chromStart.1 chromEnd.1 Chromosome.2 chromStart.2 chromEnd.2 chr17 22253075 22253075 chr6 58778920 58778920 chr17 22253296 22253296 chr1 121485042 121485042 chr17 31149594 31149594 chr9 79186946 79186946 chr17 32063850 32063850 chr1 88972810 88972810 chr17 33478113 33478113 chr11 85195011 85195011 chr18 57070971 57070971 chr8 134972200 134972200 chr18 57070979 57070979 chr12 63569173 63569173 chr18 57071253 57071253 chr12 63569178 63569178 chr19 6288109 6288109 chr2 232580057 232580057 chr19 11422016 11422016 chr1 38241383 38241383 chr19 18835608 18835608 chr8 96265722 96265722 chr19 24027959 24027959 chr1 28515674 28515674 chr19 24033178 24033178 chr1 168025732 168025732 chr19 27731957 27731957 chr1 121485309 121485309 chr19 27738409 27738409 chr1 121485358 121485358 chr19 47336727 47336727 chr2 202146368 202146368 chr20 26190297 26190297 chr1 156186657 156186657 chr20 26220632 26220632 chr19 33444266 33444266 chr20 29653769 29653769 chr4 190201038 190201038 chr20 52021436 52021436 chr6 88217549 88217549 chr20 52021643 52021643 chr6 88217546 88217546 chr21 9825438 9825438 chr20 26188798 26188798 chr21 9827531 9827531 chr16 33963030 33963030 chr21 11020086 11020086 chr2 96515679 96515679 chr22 16347208 16347208 chr19 19632141 19632141 chr22 16347241 16347241 chr19 19632427 19632427 chr22 16959719 16959719 chr2 91925952 91925952 chr22 32575912 32575912 chr9 101477598 101477598 chr22 32927887 32927887 chr6 24684003 24684003 chr22 32928565 32928565 chr6 24683982 24683982 chr22 45991806 45991806 chr21 11022629 11022629 chrX 11953197 11953197 chr7 17094701 17094701 chrX 32001099 32001099 chr3 49218378 49218378 chrX 66982772 66982772 chr5 159349717 159349717 chrX 90335662 90335662 chr10 101854838 101854838 chrX 108297832 108297832 chr1 91853149 91853149 chrY 13469194 13469194 chr18 108495 108495 chrY 59027502 59027502 chr21 11054628 11054628  48  Figures:  Figure 1: Types of structural variants    There are many configurations than make each SV unique, but they can be broadly split into balanced or unbalanced SVs. Unbalanced SVs can be further subdivided into duplications or deletions of genomic regions. Balanced SVs typically encompass inversions and translocations of genomic regions. In the above figure, Segments G & H have translocated to the reference genome, while Segments D & E have been translocated elsewhere.                      49  Figure 2: Different NGS approaches to mapping translocations   Paired-end reads are sequenced from the sample genome as shown above, where discordant read pairs span the DNA breakpoint with an expected insert size. However, when aligned to the reference genome, the read pairs realign to their original chromosomes, drastically changing the size of the inserts resulting in discordant mapping. Split-reads are reads which span the breakpoint in the sample genome. However, after realignment to the reference, only a portion of the read maps onto the reference. Identifying the position of the remaining portion will identify the translocation partner. Assembly-based analyses are very effective in mapping SVs, because they do not rely upon a reference genome. However, all techniques are weak against repetitive regions.                50  Figure 3: Principles of Template Strand Sequencing (a) Both parental homologs of Chromosome A, showing the Watson (orange, -) and Crick (blue, +) strands. (b) Culturing cells of interest in the presence of BrdU marks the nascent strand (grey) during replication. (c) Independent assortment of chromosomes into daughter cells demonstrates the various combinations possible template strand inheritance. Each daughter cell can have inherited 1 of 4 possible combinations. After independent assortment if both parental homologs are Watson strands, they are visualized in BAIT as an ideogram with the Watson-Watson strand-state. If both parental homologs are Crick strands, they are visualized in BAIT as an ideogram with the Crick-Crick strand-state. If parental homologs are inherited as alternating strand, then they are visualized in BAIT as an ideogram with the Watson-Crick strand-state.  51  Figure 4: BAIT ideogram representation   Ideogram profile of a single cell representing the inherited template strand states of each chromosome. Strand-state switch events are identified by the black arrow beside each chromosomal ideogram.                  52  Figure 5: The difference between an SCE event and a stable rearrangement visualized by BAIT’s ideogram profiles   Chromosome 11 has a stable rearrangement because of the recurrent nature of the strand-state switch event at the same genomic position. Chromosome 12 does not have a stable rearrangement as the strand-state switches are not recurrent; rather they follow a random distribution which is characteristic of random SCE events.              53  Figure 6: Template Strand Sequencing and translocations   Chromosome 11 is shown in black outline, while Chromosome 13 is shown in red outline. There exist 16 possible combination of independent assortment when considering 2 chromosomes. Two examples are shown here, one showing how the translocation is visualised as a strand-state switch event, another showing how even if a translocation can be present, that no strand-state switch may be seen. In principle, 50% of independent assortment combinations containing translocation will not be visualized as a strand-state switch event.  54  Figure 7: Complex karyotype of the iALL826A cell line   Adapted from Henderson et al. 2008. The authors managed to verify an in-frame translocation between Exon 10 of the KMT2A gene in chromosome 11 and Exon 2 of the MLLT1 gene in chromosome 19, shown in the red box. They proposed a 2-step translocation mechanism. First there was a reciprocal translocation between chromosome 11 and chromosme13. Next, the derivative chromosome 11 undergoes a second round of translocations with chromosome 2 and chromosome 19 to generate the final karyotype configuration shown above.  55  Figure 8: UV profile of nuclei with or without BrdU incorporated   iALL826A cells were grown for 72 hours under treatment of 20uM BrdU (top panel), and without treatment (bottom panel). Nuclei were isolated using the NIB protocol and sorted on the BD Aria III FACS sorter, using a 480nm UV laser. Untreated nuclei displayed a UV signal at ~145Units. Treated cells that successfully underwent one cell cycle with BrdU would have BrdU incorporated into the nascent strand, halving the UV signal, ~68Units. We sorted at gate P5, successfully sorting 5x 96-well plates for Strand-seq library preparation.               56  Figure 9: Principles of BreakpointR analysis    (a) A user defined sliding window moves along an ideogram, adjusting itself to encapsulate an equal number of reads in each frame of the window, calculating the change in the absolute number of Watson reads from Left frame to Right frame. The change is graphed, the top and bottom 10% is trimmed and 3 standard deviations above the remaining mean is annotated as the breakpoint interval (black bar). (b) This process is repeated across all chromosomes, in each single cell. The breakpoints across multiple cells are compiled and disjointed (purple bar) by using the start and end positions of breakpoints to disjoint the breakpoint intervals. The disjointed genomic segments are compiled into a histogram. The threshold is a user defined parameter to isolate the recurrent breakpoint interval, with a minimum of 50%. 1.6 standard deviations above the mean was used in this project. 57  Figure 10: Breakpoint mapping by BreakpointR   UCSC Genome Browser display showing the breakpoint intervals (black bars) of each single cell, along with the compiled disjointed histogram (purple). The red bar indicates the recurrent breakpoint interval identified by BreakpointR, encapsulating the blue dot which represents the WGS base resolution of the breakpoint. (a) Chromosome 13. (b) Chromosome 2. (c) Chromosome 19, the red star indicates the base resolution verified by the previous authors (Henderson et al. 2008). (d) Chromosome 11, the red star indicates the base resolution verified by the previous authors (Henderson et al. 2008). 58  Figure 11: Principles of mapping translocation partners (Part 1)   Step 1: Identifying the combination of inherited template strands and extrapolating the expected chromosomal ideogram. This is done by knowing the position of centromeres, and considering the reads surrounding that region. This assumes that the reads surrounding the centromeric region should reflect the original inheritance pattern, unaffected by SVs. Step 2: Annotate the strand-state of the deviate chromosome fragment by comparing the expected ideogram to the observed ideogram. The deviant strand-state on the observed ideogram is the strand-state of the deviate chromosome fragment.    59  Figure 12: Principles of mapping translocation partners (Part 2)   Step 3: Determine how the deviate fragment for the chromosome of interest (chromosome 4) matches with the expected inheritance pattern of the remaining chromosomes. Matches can occur in normal configuration when fragments are translocated but maintain 5’ -> 3’ directionality. Inverted matches occur when fragments are translocated but switch directionality during the translocation. Step 4: Check the consistency of fragment configuration matching across multiple cells. The assumption is that TRUE translocation partners will always match in the same configuration regardless of the combination of inherited template strands. In this case, the fragment from chromosome 4 consistently matches the inherence pattern of chromosome 11’s template strands in the Normal configuration, indicating that the fragment of chromosome 4 was translocated to chromosome 11 in Normal configuration. Repeating steps 1-4 for chromosome 11 will show that the fragment of chromosome 11 was translocated in Normal configuration to chromosome 4.   60  Figure 13: Results of translocation partner matching using Strand-seq data   The workflow for matching translocation partners was capable of correctly matching 4 of 5 translocated fragments. A fragment of chromosome 13 was translocated to chromosome 2 in Normal configuration. A fragment chromosome 2 was translocated to chromosome 19 in the Inverted configuration. A fragment of chromosome 11 was translocated to chromosome 13 in the Normal configuration. A fragment of chromosome 19 was translocated to chromosome 11 in the Inverted configuration. However, a 100kb fragment of chromosome 11 was unmapped. This is a limitation of low coverage sequencing used in Strand-seq. The 100kb fragment did not have enough reads for genotyping. Because it could not be assigned a strand-state, it could not be used to match the expected inheritance pattern of the other chromosomes.   61  Figure 14: Single cell resolution of Strand-seq identified additional layers of complexity   By knowing the complete karyotype, I reconstructed the chromosome ideograms of each single cell to study the breakpoint positions on chromosome 11. The two breakpoints (A and B) of chromosome 11 can be predicted from the reconstructed ideograms. However, in 30/31 cells, the predicted breakpoint sites did not match the ideogram profiles. Due to the high incidence of mismatch, an inversion of the 100kb fragment was predicted by Strand-seq.   62  Figure 15: PCR validation of predicted inversion    This cartoon diagram shows the primers which were used to validate the inversion of the 100kb fragment of chromosome 11. Sequencing primers were designed surrounding the breakpoint sites to capture the breakpoint and the surrounding chromosome. The PCR products were sent for Sanger sequencing to verify the chromosome origins surrounding the breakpoints. Assuming no inversion, I was unable to capture a PCR product which could be sent for sequencing. However, assuming the presence of an inversion, I could capture PCR products which after sequencing and realignment verified the inversion.                  63  Figure 16: Circos plots showing translocation links from each NGS approach   Red links were verified by FISH, green links passed quality controls, blue links are accurate partner mapping but does not reflect conventional links. (a) Strand-seq translocation mapping. Correctly mapped 4 of 5 translocation partners. The blue link represents mapping the fragment of chromosome 13 to chromosome 2, but does not reflect the 100kb fragment of chromosome 11. (b) WGS analysis using the DELLY pipeline, filtered for quality pass links only, 119. Correctly mapped 4 of 5 translocation partners, missing the in-frame fusion between chromosome 11 and chromosome 19. (c) WGS analysis using the DELLY pipeline, filtered for population variance and quality pass links, 11. Correctly mapped 4 of 5 translocation partners, missing the in-frame fusion between chromosome 11 and chromosome 19. (d) WGS analysis using a custom 123SV pipeline developed at ERIBA. Links were filtered for population variance and quality pass links, 10. Correctly mapped 4 of 5 translocation partners, missing the in-frame fusion between chromosome 11 and chromosome 19. 64  Figure 17: Revised model for the mechanism of translocation formation   Taking into the account the inversion of the 100kb fragment from chromosome 11, if the 2-step model were true, it would require an additional step to account for the inversion. I proposed a 1-step model whereby all chromosomes experienced cataclysmic shattering before rejoining into the complex karyotype observed.       65  References:  1. Sudmant, P.H., et al., An integrated map of structural variation in 2,504 human genomes. Nature, 2015. 526(7571): p. 75-81. 2. Weischenfeldt, J., et al., Phenotypic impact of genomic structural variation: insights from and for human disease. Nat Rev Genet, 2013. 14(2): p. 125-38. 3. Gordon, D.J., B. Resio, and D. Pellman, Causes and consequences of aneuploidy in cancer. Nat Rev Genet, 2012. 13(3): p. 189-203. 4. Feuk, L., A.R. Carson, and S.W. Scherer, Structural variation in the human genome. Nat Rev Genet, 2006. 7(2): p. 85-97. 5. Conrad, D.F., et al., Origins and functional impact of copy number variation in the human genome. Nature, 2010. 464(7289): p. 704-12. 6. Pang, A.W., et al., Towards a comprehensive structural variation map of an individual human genome. Genome Biol, 2010. 11(5): p. R52. 7. Weckselblatt, B. and M.K. Rudd, Human Structural Variation: Mechanisms of Chromosome Rearrangements. Trends Genet, 2015. 31(10): p. 587-99. 8. Zhang, Y., et al., Child development and structural variation in the human genome. Child Dev, 2013. 84(1): p. 34-48. 9. Lupski, J.R., Structural variation mutagenesis of the human genome: Impact on disease and evolution. Environ Mol Mutagen, 2015. 56(5): p. 419-36. 10. Weckselblatt, B., K.E. Hermetz, and M.K. Rudd, Unbalanced translocations arise from diverse mutational mechanisms including chromothripsis. Genome Res, 2015. 25(7): p. 937-47. 11. Makoff, A.J. and R.H. Flomen, Detailed analysis of 15q11-q14 sequence corrects errors and gaps in the public access sequence to fully reveal large segmental duplications at breakpoints for Prader-Willi, Angelman, and inv dup(15) syndromes. Genome Biol, 2007. 8(6): p. R114. 12. Ait Yahya-Graison, E., et al., Classification of human chromosome 21 gene-expression variations in Down syndrome: impact on disease phenotypes. Am J Hum Genet, 2007. 81(3): p. 475-91. 13. Groschel, S., et al., A single oncogenic enhancer rearrangement causes concomitant EVI1 and GATA2 deregulation in leukemia. Cell, 2014. 157(2): p. 369-381. 14. Kurzrock, R., J.U. Gutterman, and M. Talpaz, The molecular genetics of Philadelphia chromosome-positive leukemias. N Engl J Med, 1988. 319(15): p. 990-8. 15. Lupianez, D.G., et al., Disruptions of topological chromatin domains cause pathogenic rewiring of gene-enhancer interactions. Cell, 2015. 161(5): p. 1012-1025. 16. Alkan, C., B.P. Coe, and E.E. Eichler, Genome structural variation discovery and genotyping. Nat Rev Genet, 2011. 12(5): p. 363-76. 17. Greaves, M.F. and J. Wiemels, Origins of chromosome translocations in childhood leukaemia. Nat Rev Cancer, 2003. 3(9): p. 639-49. 18. Nickoloff, J.A., et al., Mechanisms of leukemia translocations. Curr Opin Hematol, 2008. 15(4): p. 338-45. 19. Lieber, M.R., K. Yu, and S.C. Raghavan, Roles of nonhomologous DNA end joining, V(D)J recombination, and class switch recombination in chromosomal translocations. DNA Repair (Amst), 2006. 5(9-10): p. 1234-45. 20. Kaushik Tiwari, M., et al., Triplex structures induce DNA double strand breaks via replication fork collapse in NER deficient cells. Nucleic Acids Res, 2016. 44(16): p. 7742-54. 66  21. Felix, C.A., Secondary leukemias induced by topoisomerase-targeted drugs. Biochim Biophys Acta, 1998. 1400(1-3): p. 233-55. 22. Vignard, J., G. Mirey, and B. Salles, Ionizing-radiation induced DNA double-strand breaks: a direct and indirect lighting up. Radiother Oncol, 2013. 108(3): p. 362-9. 23. Rezaee, M., L. Sanche, and D.J. Hunting, Cisplatin enhances the formation of DNA single- and double-strand breaks by hydrated electrons and hydroxyl radicals. Radiat Res, 2013. 179(3): p. 323-31. 24. Albino, A.P., et al., Induction of DNA double-strand breaks in A549 and normal human pulmonary epithelial cells by cigarette smoke is mediated by free radicals. Int J Oncol, 2006. 28(6): p. 1491-505. 25. Raghavan, S.C. and M.R. Lieber, DNA structures at chromosomal translocation sites. Bioessays, 2006. 28(5): p. 480-94. 26. Kanaar, R., J.H. Hoeijmakers, and D.C. van Gent, Molecular mechanisms of DNA double strand break repair. Trends Cell Biol, 1998. 8(12): p. 483-9. 27. Khanna, K.K. and S.P. Jackson, DNA double-strand breaks: signaling, repair and the cancer connection. Nat Genet, 2001. 27(3): p. 247-54. 28. Ren, R., Mechanisms of BCR-ABL in the pathogenesis of chronic myelogenous leukaemia. Nat Rev Cancer, 2005. 5(3): p. 172-83. 29. Edwards, P.A., Fusion genes and chromosome translocations in the common epithelial cancers. J Pathol, 2010. 220(2): p. 244-54. 30. Tan, P.H., et al., Renal tumors: diagnostic and prognostic biomarkers. Am J Surg Pathol, 2013. 37(10): p. 1518-31. 31. Nambiar, M., V. Kari, and S.C. Raghavan, Chromosomal translocations in cancer. Biochim Biophys Acta, 2008. 1786(2): p. 139-52. 32. Wan, T.S., Cancer cytogenetics: methodology revisited. Ann Lab Med, 2014. 34(6): p. 413-25. 33. Speicher, M.R. and N.P. Carter, The new cytogenetics: blurring the boundaries with molecular biology. Nat Rev Genet, 2005. 6(10): p. 782-92. 34. Tattini, L., R. D'Aurizio, and A. Magi, Detection of Genomic Structural Variants from Next-Generation Sequencing Data. Front Bioeng Biotechnol, 2015. 3: p. 92. 35. Ford, C.E. and J.L. Hamerton, The chromosomes of man. Nature, 1956. 178(4541): p. 1020-3. 36. Caspersson, T., L. Zech, and C. Johansson, Analysis of human metaphase chromosome set by aid of DNA-binding fluorescent agents. Exp Cell Res, 1970. 62(2): p. 490-2. 37. Levsky, J.M. and R.H. Singer, Fluorescence in situ hybridization: past, present and future. J Cell Sci, 2003. 116(Pt 14): p. 2833-8. 38. Schrock, E., et al., Multicolor spectral karyotyping of human chromosomes. Science, 1996. 273(5274): p. 494-7. 39. Hande, M.P., et al., Telomere length dynamics and chromosomal instability in cells derived from telomerase null mice. J Cell Biol, 1999. 144(4): p. 589-601. 40. Kallioniemi, A., et al., Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. Science, 1992. 258(5083): p. 818-21. 41. Abel, H.J. and E.J. Duncavage, Detection of structural DNA variation from next generation sequencing data: a review of informatic approaches. Cancer Genet, 2013. 206(12): p. 432-40. 42. Reuter, J.A., D.V. Spacek, and M.P. Snyder, High-throughput sequencing technologies. Mol Cell, 2015. 58(4): p. 586-97. 43. Sims, D., et al., Sequencing depth and coverage: key considerations in genomic analyses. Nat Rev Genet, 2014. 15(2): p. 121-32. 67  44. Nakagawa, H., et al., Cancer whole-genome sequencing: present and future. Oncogene, 2015. 34(49): p. 5943-50. 45. Korbel, J.O., et al., Paired-end mapping reveals extensive structural variation in the human genome. Science, 2007. 318(5849): p. 420-6. 46. Rausch, T., et al., DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics, 2012. 28(18): p. i333-i339. 47. Mills, R.E., et al., An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Res, 2006. 16(9): p. 1182-90. 48. Schroder, J., et al., Socrates: identification of genomic rearrangements in tumour genomes by re-aligning soft clipped reads. Bioinformatics, 2014. 30(8): p. 1064-1072. 49. Myers, E.W., et al., A whole-genome assembly of Drosophila. Science, 2000. 287(5461): p. 2196-204. 50. Chaisson, M.J., R.K. Wilson, and E.E. Eichler, Genetic variation and the de novo assembly of human genomes. Nat Rev Genet, 2015. 16(11): p. 627-40. 51. Falconer, E., et al., DNA template strand sequencing of single-cells maps genomic rearrangements at high resolution. Nat Methods, 2012. 9(11): p. 1107-12. 52. Sanders, A.D., et al., Single-cell template strand sequencing by Strand-seq enables the characterization of individual homologs. Nat Protoc, 2017. 12(6): p. 1151-1176. 53. Hills, M., et al., BAIT: Organizing genomes and mapping rearrangements in single cells. Genome Med, 2013. 5(9): p. 82. 54. Aird, D., et al., Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol, 2011. 12(2): p. R18. 55. Benjamini, Y. and T.P. Speed, Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res, 2012. 40(10): p. e72. 56. Ross, M.G., et al., Characterizing and measuring bias in sequence data. Genome Biol, 2013. 14(5): p. R51. 57. Sanders, A.D., et al., Characterizing polymorphic inversions in human genomes by single-cell sequencing. Genome Res, 2016. 26(11): p. 1575-1587. 58. Porubsky, D., et al., Direct chromosome-length haplotyping by single-cell sequencing. Genome Res, 2016. 26(11): p. 1565-1574. 59. O'Neill, K., et al., Assembling draft genomes using contiBAIT. Bioinformatics, 2017. 33(17): p. 2737-2739. 60. Henderson, M.J., et al., A xenograft model of infant leukaemia reveals a complex MLL translocation. Br J Haematol, 2008. 140(6): p. 716-9. 61. Li, H. and R. Durbin, Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 2009. 25(14): p. 1754-60. 62. Kloosterman, W.P., et al., Chromothripsis as a mechanism driving complex de novo structural rearrangements in the germline. Hum Mol Genet, 2011. 20(10): p. 1916-24. 63. Gawad, C., W. Koh, and S.R. Quake, Single-cell genome sequencing: current state of the science. Nat Rev Genet, 2016. 17(3): p. 175-88. 64. Stegle, O., S.A. Teichmann, and J.C. Marioni, Computational and analytical challenges in single-cell transcriptomics. Nat Rev Genet, 2015. 16(3): p. 133-45. 65. Huang, L., et al., Single-Cell Whole-Genome Amplification and Sequencing: Methodology and Applications. Annu Rev Genomics Hum Genet, 2015. 16: p. 79-102. 66. Shapiro, E., T. Biezuner, and S. Linnarsson, Single-cell sequencing-based technologies will revolutionize whole-organism science. Nat Rev Genet, 2013. 14(9): p. 618-30. 67. Baslan, T. and J. Hicks, Single cell sequencing approaches for complex biological systems. Curr Opin Genet Dev, 2014. 26: p. 59-65. 68  68. Wang, Y. and N.E. Navin, Advances and applications of single-cell sequencing technologies. Mol Cell, 2015. 58(4): p. 598-609. 69. Grun, D. and A. van Oudenaarden, Design and Analysis of Single-Cell Sequencing Experiments. Cell, 2015. 163(4): p. 799-810. 70. Liang, J., W. Cai, and Z. Sun, Single-cell sequencing technologies: current and future. J Genet Genomics, 2014. 41(10): p. 513-28. 71. Korbel, J.O. and P.J. Campbell, Criteria for inference of chromothripsis in cancer genomes. Cell, 2013. 152(6): p. 1226-36. 72. van Wietmarschen, N. and P.M. Lansdorp, Bromodeoxyuridine does not contribute to sister chromatid exchange events in normal or Bloom syndrome cells. Nucleic Acids Res, 2016. 44(14): p. 6787-93. 73. Lu, H., F. Giordano, and Z. Ning, Oxford Nanopore MinION Sequencing and Genome Assembly. Genomics Proteomics Bioinformatics, 2016. 14(5): p. 265-279. 74. Rhoads, A. and K.F. Au, PacBio Sequencing and Its Applications. Genomics Proteomics Bioinformatics, 2015. 13(5): p. 278-89. 75. Merker, J.D., et al., Long-read genome sequencing identifies causal structural variation in a Mendelian disease. Genet Med, 2017. 76. Liu, Q., et al., Interrogating the "unsequenceable" genomic trinucleotide repeat disorders by long-read sequencing. Genome Med, 2017. 9(1): p. 65.                           69  Appendix 1: Gel image validating inversion    Lane 1 is the 1kb+ ladder. The position of the 2kb band is marked. Lanes 2-5 are the validated chromosome-matched primer pairs, BP2, BP11A, BP11B, BP13. Lanes 6-7 are the translocation-matched primer pairs, chr2F-chr11AR and chr11BF-chr13R. No discernible bands were found. Lanes 8-9 are the translocation-matched primer pairs, and taking into consideration the inversion of DF100k, chr2F-chr11BF, chr11AR-chr13R. Sanger sequencing verified that the alignment of PCR products prove the presence of an inversion of DF100k.            70  Appendix 2: Gel image showing no extraneous translocations    Lane 1 is the 1kb+ ladder. The position of the 2kb band is marked. Lanes 2-5 are the translocation-matched primer pairs 11F-8R, 11R-8F, 11F-8F, 11R-8R. No discernible bands were found. Lane 6-7 are the validated chromosome-matched primer pairs, chr11, chr8. Lanes 8-11 are the translocation-matched primer pairs, 12F-9R, 12R-9F, 12F-9F, 12R-9R. No discernible bands were found. Lanes 12-13 are the validated translocation-matched primer pairs, chr12, chr9.  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0362577/manifest

Comment

Related Items