UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Mapping structural rearrangements in single cells by template strand sequencing to explore inversions… Sanders, Ashley Diane 2015

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2016_february_sanders_ashley.pdf [ 97.82MB ]
Metadata
JSON: 24-1.0223159.json
JSON-LD: 24-1.0223159-ld.json
RDF/XML (Pretty): 24-1.0223159-rdf.xml
RDF/JSON: 24-1.0223159-rdf.json
Turtle: 24-1.0223159-turtle.txt
N-Triples: 24-1.0223159-rdf-ntriples.txt
Original Record: 24-1.0223159-source.json
Full Text
24-1.0223159-fulltext.txt
Citation
24-1.0223159.ris

Full Text

  MAPPING STRUCTURAL REARRANGMENTS IN SINGLE CELLS BY TEMPLATE STRAND SEQUENCING TO EXPLORE INVERSIONS IN THE HUMAN GENOME   by   ASHLEY DIANE SANDERS  B.Sc., the University of Toronto, 2009   A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF   DOCTOR OF PHILOSOPHY in The Faculty of Graduate and Postdoctoral Studies (Cell and Developmental Biology)   THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver)  December 2015   © Ashley Diane Sanders, 2015   ii Abstract Studies of genome heterogeneity and plasticity aim to resolve how genomic features underlie phenotypes and disease susceptibilities. Identifying genomic features that differ between individuals and cells can help uncover the functional variants that drive specific biological outcomes. For this, single cell studies are paramount, as characterizing the contribution of rare but functional cellular subpopulations is important for disease prognosis, management and progression. Until now, these studies have been challenged by our inability to map structural variants accurately and comprehensively. To overcome this, I employed the template strand sequencing method, Strand-seq, to preserve the organization and structure of individual homologues and visualize structural rearrangements in single cells. Using Strand-seq, I monitored homologue states in human genomes to quantify the degree of somatic rearrangements, and distinguished these from recurrent structural variants, such as inherited inversions. In so doing, I created an innovative tool to rapidly discover, map, and genotype structural polymorphisms with unprecedented resolution. Next, to facilitate systematic analyses of Strand-seq data, I developed novel bioinformatic software that locates putative genomic rearrangements in singles cells and identifies recurrent rearrangements across multiple cells. This provides an essential instrument for unbiased and non-targeted structural variant discovery in a high-throughput approach, helping to scale Strand-seq for population-based studies. Applying these tools, I explored the distribution and frequency of structural variation in a heterogeneous cell population to discover and genotype over 100 inversions in the human genome. I found significant structural heterogeneity resides in definable polymorphic domains and within complex and repetitive regions of our genome. Finally, I extended my strategy to comprehensively map the complete set of inversions in an individual’s genome and define their unique invertome. Comparing two invertomes, I found sets of inversions can be combined to make predictions about ancestry and health of an individual, and I characterized the architectural features of inversion breakpoints with base-pair resolution. Taken together, I describe a powerful new framework to study structural rearrangements and genomic heterogeneity in single cell samples, whether from individuals for population studies, or tissues for biomarker discovery.    iii Preface  The introduction provided in Chapter 1 represents a comprehensive review of relevant literature and provides the background for the research objectives. I conceived and wrote all aspects of this Chapter, and generated the comprehensive timeline of landmark findings in human inversion field.  Parts of Chapter 2 describing the concept of Strand-seq as a single cell sequencing method and a tool for mapping somatic genomic rearrangements were published in: Falconer E, Hills M, Naumann U, Poon SS, Chavez EA, Sanders AD, Zhao Y, Hirst M, Lansdorp PM. DNA template strand sequencing of single-cells maps genomic rearrangements at high resolution. Nature Methods. 2012 Nov; 9(11):1107-1112. I manually prepared libraries for analysis in this manuscript using the protocol developed by Dr. Ester Falconer, a postdoctoral fellow in the laboratory. The rest of the Chapter describes genomic rearrangements in human cells and is original, work recently submitted for publication. I personally generated the libraries, performed all experiments, analyzed the data and produced all figures. Dr. Mark Hills, a postdoctoral fellow in the laboratory, helped with library alignment using the bioinformatic analysis of inherited templates (BAIT) package.  Chapter 3 describes a custom-built bioinformatic analysis tool for locating genomic rearrangements in Strand-seq data, called Invert.R. This tool was designed and scripted in collaboration with Dr. Mark Hills. The subsequent refinement and validation of the software was personally performed, along with extending it for population-scale studies. The bioinformatic software will be made open-source, and will be integrated into a comprehensive analysis suite tailored for Strand-seq data in collaboration with David Porubsky, Aaron Taudt and Dr. Mark Hills. The working title of this suite is ‘Strand-Seek.R’, which will integrate, breakpoint calling, structural variant mapping, copy number variant calling, and haplotyping tools, and is planned for release in 2016.   Chapters 4 and Chapter 5 describe the application of these tools for characterizing inversion profiles in populations and people. I designed the project,   iv conceived experiments, and executed all experiments in these Chapters, with the guidance of my supervisor Dr. Peter Lansdorp, and mentors Dr. Ester Falconer and Dr. Mark Hills. Collaborators David Knapp and Dr. Connie Eaves provided cord blood cell samples, acquired through the Stem Cell Assay Laboratory at the Terry Fox Laboratory (TFL). David Knapp, Wenbo Xu and David Ko from the Flow Core Facility at the TFL helped with cell sorting. Strand-seq libraries were prepared using the newly scaled and automated protocol refined by Dr. Ester Falconer. These were sequenced at Genome Sciences Center (Vancouver), and sequencing reads aligned using BAIT, with the help of Dr. Mark Hills. I performed the rest of the wet and dry experiments, including cell culture, cell isolation, library preparation, data collection, data analysis and figure preparation. Exceptions are the dot plot graphs shown in Appendix C, which were prepared by Dr. Mark Hills. A version of this work has been submitted for publication, which I personally wrote and expanded in this dissertation.  Chapter 6 provides a summary of the findings reported in this dissertation, along with a general discussion of the major implications of this work. Chapter 7 details all the material and methods required for producing and analyzing the data presented in this dissertation.    v Table of contents Abstract ......................................................................................................... ii Preface .......................................................................................................... iii Table of contents........................................................................................... v List of tables ................................................................................................. ix List of figures ................................................................................................ x List of abbreviations.................................................................................. xiii Glossary...................................................................................................... xiv Acknowledgements................................................................................... xvii Dedication.................................................................................................... xx Chapter 1 | General introduction................................................................ 1 1.1 Heterogeneity in human genomes ..........................................................................................1 1.1.1 Associating genotypes with phenotypes ......................................................................2 1.1.2 Clinical relevance for studying heterogeneity in cells and individuals ..........................4 1.2 Detecting structural variants by DNA sequencing .................................................................6 1.2.1 Sequence read alignment ................................................................................................6 1.2.2 Paired read signals ..........................................................................................................8 1.2.3 Read depth signals ..........................................................................................................8 1.2.4 Split read signals .............................................................................................................9 1.2.5 Integrating signals .........................................................................................................10 1.3 Inversions are an elusive structural variant ..........................................................................12 1.3.1 Inversions change the organization of a chromosome ..................................................12 1.3.2 Known inversions in human genomes ..........................................................................14 1.3.3 Inversions linked to human diseases .............................................................................17 1.4 Techniques to study inversions ............................................................................................20 1.4.1 Traditional cytogenetics ................................................................................................20 1.4.2 Molecular cytogenetics .................................................................................................21 1.4.3 Computation cytogenetics.............................................................................................22   vi 1.4.4 Detection by sequencing ...............................................................................................23 1.5 Aims and scope of dissertation.............................................................................................23 Chapter 2 | A high-resolution and single cell approach to characterize genomic rearrangements using Strand-seq.............................................. 26 2.1 Introduction ..........................................................................................................................27 2.2 Results ..................................................................................................................................28 2.2.1 Visualizing the orientation of homologues by sequencing template strands of single cells ........................................................................................................................................28 2.2.2 Localizing genomic rearrangements in single cells based on template strand orientation ..............................................................................................................................32 2.2.3 Localizing inherited genomic rearrangements in single cells based on recurrent template strand changes .........................................................................................................36 2.2.4 Reliable inversion genotyping of genomic variants in single cells...............................39 2.2.5 High-resolution breakpoint mapping of genomic rearrangements in single cells ........41 2.3 Discussion ............................................................................................................................45 2.4 Conclusion............................................................................................................................48 Chapter 3 | A bioinformatic tool to systematically characterize structural rearrangements in Strand-seq libraries ................................. 49 3.1 Introduction ..........................................................................................................................50 3.2 Results ..................................................................................................................................52 3.2.1 Tracking template strand states using local W/C ratios ................................................52 3.2.2 Characterizing putative inversions in single cells based on W/C ratios .......................58 3.2.3 Finding recurrent inversions and refining breakpoints across multiple cells................64 3.3 Discussion ............................................................................................................................71 3.4 Conclusion............................................................................................................................73 Chapter 4 | Characterizing the distribution and frequency of polymorphic rearrangements in a mixed population of single cells ...... 74 4.1 Introduction ..........................................................................................................................75 4.2 Results ..................................................................................................................................77 4.2.1 Exploring genomic variation in a heterogeneous human population by single cell analyses ..................................................................................................................................77   vii 4.2.2 Annotating the human reference assembly – locating potential minor alleles and misorients ...............................................................................................................................82 4.2.3 Annotating the human reference assembly – predicting under-represented repetitive elements .................................................................................................................................89 4.2.4 Non-targeted characterization of polymorphic inversions in a normal human population...............................................................................................................................94 4.3 Discussion ..........................................................................................................................104 4.4 Conclusion..........................................................................................................................106 Chapter 5 | Defining the complete inversion profile of an individual to study their unique invertome .................................................................. 107 5.1 Introduction ........................................................................................................................108 5.2 Results ................................................................................................................................109 5.2.1 Characterizing the set of inversions present in an individual’s genome to define their invertome .............................................................................................................................109 5.2.2 A side-by-side comparison of two invertomes reveals the unique distribution of inversions in a human genome.............................................................................................119 5.2.3 Exploring the genomic architectural features surrounding inversion breakpoints .....127 5.3 Discussion ..........................................................................................................................131 5.4 Conclusion..........................................................................................................................133 Chapter 6 | General discussion and conclusion ..................................... 135 6.1 Overview of findings..........................................................................................................135 6.2 Emergent themes, implications and limitations..................................................................137 6.3 Proposed future directions..................................................................................................140 6.3.1 Exploring mechanistic details of genomic rearrangements ........................................140 6.3.2 Potential applications for personalized medicine........................................................142 6.4 Concluding remarks ...........................................................................................................146 Chapter 7 | Materials and methods......................................................... 147 7.1 Primary human cell sources ...............................................................................................147 7.2 Building Strand-seq libraries..............................................................................................148 7.2.1 Cell selection and culture ............................................................................................148 7.2.2 Capturing cells after one cell division.........................................................................151 7.2.3 Strand-seq library construction ...................................................................................154   viii 7.2.4 Size-selection and sequencing ....................................................................................156 7.3 Analyzing Strand-seq libraries ...........................................................................................158 7.3.1 Localizing inversions in single cells ...........................................................................158 7.3.2 Analyzing inversions across multiple cells .................................................................159 References ................................................................................................. 165 Appendix A | Library statistics ............................................................... 174 Appendix B | Chromosome-level Circos plots ....................................... 180 Appendix C | Base pair-level dot plots.................................................... 206     ix List of tables  Table 1-1 | Sequencing signatures of structural variants............................................................10 Table 4-1 | Potential misorients or minor alleles in reference genome ......................................87 Table 4-2 | Always Watson Crick (AWC) regions in Strand-seq data.......................................92 Table 4-3 | Polymorphic inversions in the pooled cord blood population ...............................101 Table 5-1 | Informative chromosomes used to build invertomes .............................................110 Table 5-2 | Invertome of the adult male ...................................................................................118 Table 5-3 | Invertome of the newborn female ..........................................................................126 Table 7-1 | Antibodies used to label mitotic human hematopoietic cells. ................................149    x List of figures  Figure 1-1 | Common types of structural variation found in the human genome.........................2 Figure 1-2 | Sequence signature of inversions..............................................................................7 Figure 1-3 | Example of a genomic inversion ............................................................................13 Figure 1-4 | Timeline of major historical events in human inversion studies ............................15 Figure 1-5 | Inversions reported in the human genome..............................................................17 Figure 1-6 | Common cytogenetic methods used to study polymorphic inversions...................21 Figure 2-1 | Strand-seq sequences template strands inherited by daughter cells........................29 Figure 2-2 | Division kinetics of single-sorted hematopoietic cells ...........................................30 Figure 2-3 | Strand-seq libraries of a pair of sister cells.............................................................32 Figure 2-4 | Strand-seq examples of paired sister cells ..............................................................33 Figure 2-5 | Diagram of a sister chromatid exchange event in Strand-seq library .....................34 Figure 2-6 | Sister chromatid exchange events mapped in the human genome..........................36 Figure 2-7 | Locating and genotyping inversions in Strand-seq libraries...................................37 Figure 2-8 | Polymorphic inversion located on the p-arm of chr8 .............................................39 Figure 2-9 | Manually refining inversion breakpoints in a sister cell pair..................................42 Figure 2-10 | Manually refining inversion breakpoints of homozygous inversions...................44 Figure 3-1 | Invert.R – a custom bioinformatic approach to tracking template strands .............53 Figure 3-2 | A read-based sliding window binning strategy to analyze Strand-seq data ...........55 Figure 3-3 | Calculating W/C ratios of Strand-seq data at known locations of template strand state changes...............................................................................................................................57 Figure 3-4 | Using Invert.R to predict inversions based on W/C ratios and genotypes based on ∆ W/C values..............................................................................................................................60 Figure 3-5 | Identifying a homozygous and heterozygous inversion in a single cell using Invert.R.......................................................................................................................................62 Figure 3-6 | BAIT ideogram of HsSs_0123 ...............................................................................63 Figure 3-7 | Refining inversion breakpoints across multiple single cells using Invert.R...........65 Figure 3-8 | Mapping a heterozygous and homozygous inversion in multiple single cells using Invert.R.......................................................................................................................................66 Figure 3-9 | Refining inversion breakpoints by consensus.........................................................67 Figure 3-10 | Mapping a template strand state changes in whole chromosomes using Invert.R69 Figure 3-11 | Locating recurrent inversions in a non-targeted approach using Invert.R............70   xi Figure 4-1 | Read densities of Strand-seq libraries in the pooled cord blood (CB) dataset .......77 Figure 4-2 | Mapping genomic rearrangements in a mixed cell sample.....................................79 Figure 4-3 | Recurrent genomic rearrangements in a mixed donor cell population, as predicted by Invert.R..................................................................................................................................80 Figure 4-4 | Complex regions of interest containing multiple structural variants ......................82 Figure 4-5 | Example of an AWC region, potential misorients or minor allele region, and polymorphic inversion from chr10.............................................................................................83 Figure 4-6 | Heterozygous and homozygous frequency of the genotyped ROIs........................85 Figure 4-7 | Misoriented regions and minor alleles in the human reference assembly ..............86 Figure 4-8 | Always Watson Crick (AWC) regions mark underrepresented repetitive elements in the human reference assembly ...............................................................................................90 Figure 4-9 | Size and genomic distributions of polymorphic inversions identified in the mixed population ...................................................................................................................................94 Figure 4-10 | Allelic frequencies of polymorphic inversions identified in the mixed population....................................................................................................................................................96 Figure 4-11 | Polymorphic domains containing clusters of inversions identified in a mixed population ...................................................................................................................................97 Figure 4-12 | Cluster analysis of the inversion profiles for individual cells in the mixed population ...................................................................................................................................99 Figure 5-1 | Overall read densities of Strand-seq libraries used to build invertomes...............111 Figure 5-2 | Strategy for generating a high-density directional composite file ........................112 Figure 5-3 | Putative inversions in an adult male, as predicted by Invert.R.............................113 Figure 5-4 | The invertome of an adult male ............................................................................115 Figure 5-5 | Increased read densities of directional composite files ........................................118 Figure 5-6 | Putative inversions in a newborn female, as predicted by Invert.R......................120 Figure 5-7 | The invertome of a newborn female .....................................................................121 Figure 5-8 | chromosomal resolution of the genomic features surrounding inversions ...........123 Figure 5-9 | Correlation between palindromic segmental duplications and inversions............128 Figure 5-10 | Distribution of all structural features identified by Strand-seq...........................129 Figure 5-11 | Base pair resolution of the genomic features surrounding inversions ................131 Figure 6-1 | Rapid high-throughput population studies of structural variants..........................139 Figure 6-2 | Estimated cost for constructing and sequencing Strand-seq libraries...................143 Figure 6-3 | Summary of potential applications of Strand-seq for structural variant detection145 Figure 7-1 | Gating strategy used to enrich mitotic human hematopoietic cells ......................150   xii Figure 7-2 | Cell cycle kinetics of hematopoietic cells in increasing concentrations of BrdU.151 Figure 7-3 | BrdU quenches Hoechst fluorescence in hemi-substituted human hematopoietic cells...........................................................................................................................................153 Figure 7-4 | Pooling and size-selection strategy of Strand-seq libraries for sequencing..........157 Figure 7-5 | Determining the best fit genotype by Fisher exact test and Chi square test .........160 Figure 7-6 | Strategy for refining files using BEDtools ...........................................................163   xiii List of abbreviations  AML  Acute myeloblastic leukemia  BM  Bone marrow bp  Base pair BrdU  5-Bromo-2’-deoxyuridine C  Crick (template strand orientation) CC  Crick-Crick (homologue inheritance) CB  Cord blood Chr  Chromosome DGV  Database of Genomic Variants DNA  Deoxyribonucleic acid FACS  Fluorescent activated cell sorting FISH  Fluorescent in situ hybridization GWAS Genome-wide association studies HWE  Hardy-Weinberg equilibrium kb  kilobase(s) LD  Linkage disequilibrium M  Maternal (homologue) MAF  Minor allele frequency Mb  Megabase(s) NGS  Next-generation sequencing NSCLC non-small cell lung cancer P  Paternal (homologue) PAR  Pseudo-autosomal region PCR  Polymerase Chain Reaction q  Mapping quality (of sequence reads) ROI  Region of interest SCE  Sister chromatid exchange SNP  Single nucleotide polymorphism SV  Structural variant W  Watson (template strand orientation) WC  Watson-Crick (homologue inheritance) WW  Watson-Watson (homologue inheritance) UCSC  University of California, Santa Cruz UV   Ultraviolet     xiv Glossary  Always Watson Crick (AWC) region – a fragment of DNA in the reference genome assembly that appears Watson-Crick in > 80% of Strand-seq library. These likely represent under-represented sequence fragments in the reference genome  Breakpoint – the genomic coordinate of a variant allele that mark the start (5′) or end (3′) of the locus  Crick – a Strand-seq read that aligns to the forward (‘+’) strand of the reference genome, colored blue (specifically, ‘turquoise4’ in R; or RGB: 103,139, 139) in all visual representations of Strand-seq data  Crick-Crick (CC) – a template strand state that consists predominately (> 85%) of forward (‘+’) reads  Fisher exact test – a statistical test used to determine if there are non-random associations between observed and expected quantities, ideal for small sample sizes. Used to test genotypes in Strand-seq data  Genomic disorder – a human disease arising from a genomic rearrangement  Genomic rearrangement – a segment of a chromosome that is in a different orientation or location from the reference genome assembly, including sister chromatid exchange events, inversions and translocations.   Heterozygous – a variant allele (can be a single base or segment of DNA) that alters a single homologue of a chromosome in a diploid organism  Homozygous – a variant allele (can be a single base or segment of DNA) that alters a both homologues of a chromosome in a diploid organism  Inversion – a reversal in the orientation of a segment of DNA within a chromosome, which appears as a recurrent rearrangement in Strand-seq data.  Invertome – the complete set of inversions present in an individual’s genome  Invert.R – a custom-built bioinformatic program tailored to identify recurrent genomic rearrangements in Strand-seq data   Minor Allele – a fragment of DNA in the reference genome assembly that does not reflect the orientation of the majority of a population, appears as a homozygous inversion in > 80% of Strand-seq libraries    xv Misorient – a fragment of DNA in the reference genome assembly that is in the incorrect orientation, which appears as a homozygous inversion in 100% of Strand-seq libraries  Non-palindromic duplication – segmental duplications that are in the same (direct) orientation to each other  Non-recurrent rearrangement – a unique genomic event that may share a common region-of-overlap in multiple genomes but shows different sizes and distinct breakpoints  Nascent strand – the new DNA strand that is synthesized during semi-conservative replication. This is the strand that is ablated during Strand-seq library preparation  Palindromic duplication – segmental duplications that are in the opposite (inverted) orientation to each other  Recurrent rearrangement – an overlapping genomic event present in multiple genomes that shows a common size and fixed breakpoint.   Segmental Duplication – a segment of DNA at least 1000 base pair in length that shows > 90% similarity with another segment  Sister cells – the two daughter cells generated from a single mitosis. These related cells show reciprocal template strand patterns  Sister chromatid exchange – a somatic rearrangement between sister chromatids that results from a double-stranded DNA break getting repaired by a cross-over homologous recombination   Strand-seq – a single cell sequencing approach that preserves the direction of template strands to visualize the structure of individual homologues in parental chromosomes   Structural variant – a genomic event that is > 1 base pair in size and can involve a gain (insert), loss (deletion) or rearrangement (inversion or translocation) of DNA  Template strand – the DNA strand used to guide DNA synthesis during semi-conservative replication. This is the strand that is sequenced in Strand-seq libraries  Template strand inheritance – the combination of template strands, each representing a distinct parental homologue, inherited by a daughter cell at cell division. Can be WW, WC or CC in a Strand-seq library  Template strand state – the relative proportion of Watson and Crick reads in a chromosome (or segment of a chromosome) in a Strand-seq library Can be WW, WC or CC in a Strand-seq library    xvi Template strand state change –a segment of a chromosome that exhibits a change in template strand state (e.g. changing from WW to WC or to CC), which represents a genomic rearrangement  Variant signature – the alignment and features used to identify structural variants from sequencing data, which can appear as a change in read depth, split read or paired read mapping  Variant allele – a genomic locus (can be a single base or segment of DNA) that is distinct from the reference genome assembly   Watson – a Strand-seq read that aligns to the reverse (‘-’) strand of the reference genome, colored orange (specifically, ‘sandybrown’ in R; or RGB: 243, 165, 97) in all visual representations of Strand-seq data  Watson-Crick (WC) – a template strand state that consists of approximately equal proportions of reverse (‘-’) and forward (‘+’) reads.   Watson-Watson (WW) – a template strand state that consists predominately (> 85%) of reverse (‘-’) Strand-seq reads.   Wildtype – a locus (can be a single base or segment of DNA) that is the same as the reference genome assembly     xvii Acknowledgements   “It wasn’t easy for me; I was born a poor black child” – Martin  I have had a truly unique experience as a graduate student. I changed laboratories and fields part way into my degree, I haven’t lived on the same continent as my supervisor for the last four years, I have been the only student in a lab of senior post-docs, and I have focused on at least four independent projects during my Ph.D., gaining valuable skills and experience from each. Let’s just say, the path I traveled along my graduate career was almost as twisted and varied as the path I traveled to it. It wasn’t an easy road for me. But here I am – a skilled, educated, and confident young woman with real opportunity. The importance of opportunity must never be undervalued. In finding myself facing the end of this degree, I feel proud. I feel I have beaten the odds. I feel I have accomplished. I have surmounted seemingly impossible obstacles and I have reached the upper limits of an education. And I have not done so alone. There are innumerable people who helped me along this incredible journey, and I thank them all.  In keeping with tradition, first and foremost I thank my supervisor, Peter Lansdorp. I see Peter as a driver of discovery and technology who isn’t afraid to take risks and tackle provocative and controversial questions, and I am thankful for the chance to learn from him. He has taught me to stand up for my beliefs and my science, and to have the courage to pioneer my own research. He gave me freedoms to pursue my personal curiosities, develop my own projects and try different things (even when he doubted they would work), fully allowing me to be the proverbial “master of my own destiny”. I feel very fortunate for having the opportunity to drive my own project and see it through from conception to publication. I have had a rare and exciting Ph.D. and much of this was thanks to Peter’s relaxed teaching style, flexibility, trust and support. In Peter’s lab, I was surrounded by capable and skilled mentors and at the forefront of cutting-edge science that is shaping new territories. It has been exhilarating, challenging and at times even terrifying. Overall, I feel I have thrived in this environment, and I am   xviii truly very pleased with my graduate experience as a whole. I thank Peter for an incomparable learning experience.  Now, I can thank my mentor, Ester Falconer, who has been a boundless source of encouragement, guidance and love throughout my time in the Lansdorp laboratory. The last four and a half years have been a period of tremendous scientific and personal growth for me, and this would not have been possible without her. Ester has provided support and direction along every step of my Ph.D. and has given me more than I could have asked or hoped for from a colleague. She has proven herself an amazing teacher full of sound advice, reason and wisdom, and someone who constantly challenged, motivated and inspired me. I would also like to add that she was not opposed to jocularity. It has been a great honor to learn from this skilled scientist, and an even greater pleasure to call her my friend. For Ester, I am most thankful.   Plenty of thanks are also extended to the other members of the Lansdorp laboratory, both past and present, far and near. I believe our unique combination of expertise and personalities made the lab a hotbed of collaboration and exchange. In particular, I cannot imagine the Lansdorp lab without Uli, Geraldine and Mark, who each provided enthusiasm, guidance and great discussions throughout my degree. A special mention of Mark Hills is very well deserved, as he was amazingly patient in helping me develop programming skills and during our long debates about inversion mapping. Another very special mention goes to the esteemed members of ‘Team Success’. Collectively, the Lansdorp group created a nourishing environment for my development and I am very fortunate to have been surrounded by such a strong team.   Finally, I will thank my closest friends and family who have traveled this road alongside me. From this loving list, I will first mention Maria Acevedo, because I know how happy that will make her, and because she is such a valued friend who has helped me overcome innumerable challenges and crises (both experimental and personal) throughout my Ph.D. Thank you for co-piloting my graduate experience. You’re next lady! I have so much appreciation for Corrine and Ben Verwey who directly helped   xix sponsor this dissertation in the form of groceries, intermittent doses of sunshine and hugs from Hazel. Thank you for successfully helping me avoid scurvy! Corinne, you are an inspiration. I also send my eternal love and gratitude to Jennifer Taves – a true bestie since the day we met, and someone who has been by my side and supporting me throughout my entire educational experience. Distance only makes the heart grow fonder. I also send love and thanks to her entire family, most especially mom and pop, who have brought me into their amazing circle and have believed in me even when I did not. I am ready to sign the papers! I cannot forget to send little hearts to Dudu, who was present and so patient in that last month of dissertation writing. Thank you for standing by me in one of the hardest challenges I have endured. I can now thank my biological family, who has taught me I can overcome any obstacle in the pursuit of my dreams. Thank you to my sisters Caitlin and Jasmin who directly encouraged me throughout my Ph.D., and Laurel, Lindsay and Andrea, who indirectly inspired me to aspire. I love you all, and you are each in my heart. And finally, I have to thank the Transitional Year Programme for making education accessible, and everyone from this amazing program that supported me in my pursuit of science and a higher education – I literally would not be here without you.     xx Dedication      to all my homies #micdrop       1 Chapter 1 | General introduction   "Nothing in biology makes sense except in the light of evolution" - Dobzhansky   1.1 Heterogeneity in human genomes  Why is my sister short, if I am tall? Why did Aunt Jan get cancer, when Aunt Sue did not? And why did grandma live to be 83, but grandpa passed at 67? These are not just tough questions for a twelve-year old brain; they are long-standing in biology and remain unanswered, even today. Ever since Mendel played with his pea plants and discovered inheritable ‘units’, we have searched our genetic code for answers. Studies of genomic heterogeneity have helped us understand the relationship between specific genomic features and different phenotypic and disease traits. And while it is well understood that the diverse phenotypes observed in our world are largely driven by heterogeneity in our genomes, many of the details remain unresolved.   Genomic heterogeneity exists in many forms, and includes everything from small changes in a DNA sequence (e.g. single nucleotide polymorphisms (SNPs)) to large structural variants (SVs) involving copy number changes and chromosomal rearrangements [1] (Fig. 1-1). Millions of variants have been discovered so far, and their role in biology is an active area of research [2, 3]. Studies of human genomes aim to resolve how specific polymorphisms underlie different phenotypes and disease susceptibilities, and have revealed highly complex interactions between our genetic code and our biology [2]. For instance, genomic studies have established that i) variants affecting different genes can lead to the same disease [4], ii) distinct variants in the same gene can lead to different diseases [5], iii) pathological variants often do not show complete penetrance [6], and iv) multiple variants can collectively contribute to disease   2 phenotypes in a cumulative way [7]. And of course, we cannot forget the confounding influences of gene-gene and gene-environment interactions that contribute to the complexity in traits [2]. This convoluted interplay between genomes and biology is why we still don’t understand inheritable traits like height, cancer-susceptibility, and longevity.   Figure 1-1 | Common types of structural variation found in the human genome Structural variation results in a change in the structure of a chromosome (grey horizontal line, with genes indicated by colored boxes) and appears when a test genome is compared to a reference genome (boxed chromosome). Copy number variants result in a change in DNA content, and include deletions that cause a loss of DNA, and duplications that cause a gain in DNA. Copy-neutral variants do not change DNA content, but instead the organization of it. These include inversions that reverse the orientation of a segment of DNA, and translocations that result in the rearrangement of DNA onto a new chromosome (shown by the different color tones of genes)   1.1.1 Associating genotypes with phenotypes  Towards the seemingly simple goal of connecting genotypes with phenotypes, association studies are commonly used to understand inherited traits. These identify the genetic features enriched in specific demographics (such as people with a shared disease) to find correlations to variable loci [2]. One method is through linkage analysis, where families affected with a disease are genotyped for genetic markers and the segregation of these markers is tracked through generations to identify those that are present in afflicted members [8]. To further pinpoint causative mutations, linkage disequilibrium (LD) analyses expand correlations to populations and look for the co-inheritance of ‘linked’ {referencedeletionduplicationinversiontranslocation{copynumber variantcopyneutral variant  3 markers based on recombination rates to locate the disease allele in affected individuals [9]. Linkage analyses have proven useful for identifying rare mutations in Mendelian-type diseases, such as Huntington disease and cystic fibrosis [6, 8]. Conversely, with increasing numbers of SNPs and more people genotyped, genome-wide association studies (GWAS) are more frequently employed to locate common variants (i.e. > 5% allelic frequency) in common diseases [2]. Here, SNP frequencies are quantified in stratified populations to find variants enriched in a cohort of patients compared to healthy controls [2, 10, 11]. To obtain significant enrichment a large sample size or large effect are required, and sufficient data must be collected to make these associations [6]. Overall, the major challenges in association studies include distinguishing meaningful variants from those that have no functional significance, and properly defining test populations [2, 6]. Nevertheless, these studies make clear that small nucleotide changes in our DNA sequence underlie large systematic changes in our biology. However, many diseases cannot be attributed to single SNPs, and uncovering these more complex associations represents a current challenge to the field.  While SNP analyses ruled genetic association studies of the past, we are gaining a new appreciation for the role structural variation plays in our genomes [1, 12]. In fact, it has been shown that more heterogeneity resides in the structure of our DNA than in the actual sequence, where an in-depth analysis of a human genome revealed 15-fold more variation represented in SVs (1.5%) than SNPs (0.1%) [13]. Structural variants encompass everything larger than a single nucleotide, and include copy number variants (CNVs), such as deletions and duplications, along with copy-neutral rearrangements such as inversions and translocations [14] (Fig. 1-1). Understanding how these structures impact our biology and health is an ongoing area of research. One of the best known cases linking genome structure to human disease came when a chromosomal duplication was linked to an inherited pathology (Charcot-Marie-Tooth disease) [15]. This report directly illustrated that phenotypes can arise from gene dosage changes not just sequence changes, restructuring our concept of traditional Mendelian genetics [16]. Since, we have redefined our understanding of human disease and classified genomic disorders as those pathologies arising from copy number changes and genomic rearrangements [17].   4  Today, several genomic disorders have been described that involve both recurrent and nonrecurrent structural variants [1]. Recurrent variants appear at the same genomic location in multiple individuals, and share a common size and fixed breakpoints [18]. These are distinguished from nonrecurrent variants that represent unique genomic events, which may share a common overlapping region in different individuals, but have distinct breakpoints and are of differing sizes [18]. Clearly, a critical step in categorizing a genomic variant is accurately defining the boundaries of the event and localizing the breakpoints. This is also important for understanding and predicting the biological outcomes of the structural change. For instance, many genomic rearrangements have been associated with complex diseases like autism, epilepsy, Parkinson’s disease, and Alzheimer’s disease, and these pathological outcomes are often linked to structural changes in genes, be it by altering gene dosage or creating gene fusions [1]. However, breakpoints do not always fall within coding regions or impact gene numbers, and precisely how these rearrangements disrupt the genome is still unresolved [17].   Although we do not always understand how, it is clear that beyond our primary DNA sequence, the architecture of our genomes plays critical roles in determining phenotypes. To explore these roles further, we must uncover and annotate the complete repertoire of genomic variants in both normal and unhealthy humans. This will allow us to build a ‘pan-genome’ database of variation that will serve as an important resource to study the functional consequences of all types of variants in human populations. With a sufficiently comprehensive database, GWAS-style association studies can be performed for SVs in order to link genomic features to phenotypic outcomes. Once these associations are made, biological mechanisms can be further explored.   1.1.2 Clinical relevance for studying heterogeneity in cells and individuals   The ultimate goal of genetic association studies is to find genetic markers that allow us to anticipate disease risks and identify druggable biological pathways for targeted prevention and treatment regimens [8]. Association studies discover candidate   5 biomarkers that can be experimentally tested to determine the functional outcome of the variant, how it relates to the phenotype and eventually validate it for use in the clinic [19]. Biomarkers facilitate clinical decision-making, where diagnostic biomarkers help determine the specific pathology of a patient, prognostic ones help predict disease progression, and predictive ones predict patient response to a specific treatment (namely, drug metabolism, efficacy and adverse effects) [19, 20]. By characterizing biomarkers in patients we can better inform medical decisions; ideals that have given rise to an era of personalized medicine with goals of individualizing healthcare and administering medicine based on each person’s unique genetic background [21]. The premise of personalized medicine is to enable better disease stratification and prognosis in order to tailor therapeutic regimens to individual patients [21]. To realize this goal, we need simple and reliable tools to rapidly screen genomic variants and evaluate key biomarkers in populations [20, 21].   Finally, genomic heterogeneity is greater than just differences between individuals; it also encompasses differences in the cells and tissues within an individual. This is because non-inherited variants arise during development and later in life due to de novo mutations and rearrangements that generate ‘somatic variation’ in our bodies [22, 23]. This is best studied in the context of cancer, where new mutations and structural rearrangements can occur leading to substantial heterogeneity in a pool of cells, known as cellular mosaicism [22, 24]. This begins when somatic mutations accumulate in critical pathways that cause a cell to undergo genome instability, where it fails to accurately repair DNA breaks, replicate its genome, and/or correctly distribute sister chromatids during mitosis [25, 26]. This instability manifests as an increased mutational load and as gross genomic rearrangements in daughter cells, which can confer selective growth advantages to subsets of cells that drive their clonal expansions and eventual tumorigenesis [27, 28].   In order to unravel the complex composition of heterogeneous cell populations, single cell studies are paramount to identify rare subsets and explore their functional contribution to the phenotype [22, 29]. For instance, by characterizing normal and tumor   6 cells we can study the genomic variants present in different subpopulations and learn how they relate to specific disease phenotypes, cancer metastasis, and treatment responders [22, 26, 27]. These types of studies are an extension of traditional association studies using populations of patients, where variant frequencies are quantified in cell populations to determine which are enriched in phenotypically distinct subsets (for instance, before and after treatment with a drug). For this we need appropriate tools that reveal the genetic composition of mixed populations of single cells to determine the cells that contain pathogenic variants and identify subpopulations important for disease prognosis, management and progression [18, 30, 31]. Taken together, it is clear that before we can fully understand how genomic variants alter biology, susceptibility and progression of different pathologies, we need sensitive tools to study and annotate all types of variation in individuals and individual cells.  1.2 Detecting structural variants by DNA sequencing  Our genomes are structurally complex and a critical step in understanding how this complexity impacts our biology and pathology is to locate structural variation in populations of interest [1, 32]. Today, high-throughput DNA sequencing is without a doubt the primary approach to studying variation in human genomes. Locating small variants such as SNPs involves identifying single base pair changes in DNA; however, locating larger structural changes to predict the locations of putative SVs is more complicated. This section will discuss common approaches used to detect an SV from sequencing data, with a specific focus on inversions (Fig. 1-2).   1.2.1 Sequence read alignment  Once sequence reads are generated for a genome, the first step in SV detection begins with read mapping. This aims to find the best alignment or set of alignments between the read sequences of a test sample and reference genome [14]. Alignment software are tailored to the type of sequencing reads being mapped, which vary in length    7  Figure 1-2 | Sequence signature of inversions Different alignment signatures arise when the sequencing reads (purple and cyan bars) from a test genome that contains an inversion (arrow) are aligned to a reference genome (grey). Reads derived from the inversion are shown in cyan. Location of inversion in reference is boxed. Horizontal dotted lines represent inversion breakpoints.    and accuracy [33]. Algorithms that intake long (> 400 bp) sequence fragments look for high-quality and unique read placements, whereas smaller fragments (30-200 bp) generated by next-generation sequencing (NGS) technologies may not always map to a single location in the reference [14, 33, 34]. Also, longer reads are typically generated at the expense of precision, and there are large disparities in sequencing error rates that must be mitigated when mapping (e.g. compare the ~ 0.2% substitution rate of Illumina HiSeq sequencing to > 12% of current Pacific Biosciences single molecule real time (SMRT) technologies) [33, 35]. Ultimately, alignments aim to balance the degree of sequence agreement, number of allowed mismatches, and size of gaps introduced to find the best fit alignment between the test and reference genomes [14, 34]. Once reads are Figure 1-4 | Sequence signature of InversionsDifferent aligment signatures arise with the sequencing reads from a test genome (purple and blue bars) that contains an inverision (blue box) are aligned to a refence genome (grey). reads derived from the inverison are shown in cyan. Locaiton of inversion in reference is shown by blue box. Test genomeinversions p l i t  r e a dreferencer e a d  d e p t hreferencep a i r e d  r e a dreferenceabc  8 mapped, SV detection software search for discordant alignments and variant signatures, which can be loosely grouped into i) paired read signals, ii) read depth signals, and iii) split read signals (Table 1-1). 1.2.2 Paired read signals  Seminal works in SV discovery centered on paired read signals and defined the variant signatures from discordant paired-end mappings [36-38]. These early studies used low-throughput Sanger technologies to sequence the ends of fosmid clones derived from the test genome, and considered where each end mapped in the reference assembly, establishing the foundational principles to distinguish SVs from sequencing data [14]. By considering the distance and location of aligned reads, signatures were found to locate: deletions (fragment size shorter than mapping distance), insertions (fragment size larger than mapping distance), and balanced rearrangements (discordant location and/or orientation of paired-ends) [36-38]. In paired read signatures, inversions appear when the fragment spans the breakpoint and read pairs align to the reference genome in the ‘incorrect’ orientation [32, 39] (Fig. 1-2a). Paired read methods are highly dependent on the fragment length distribution, which must be narrowly defined for higher confidence variant calls [14, 40]. This also impacts the size of SVs that can be detected, as only those variants that statistically alter mapped fragment length from expected length can be reported [32].  1.2.3 Read depth signals  Read depth approaches were made possible by technological advances in sequencing methods. The first high-coverage (20-40x) complete genomes were sequenced in 2008 [41-43], which launched high-throughput NGS platforms. Equipped to rapidly generate large amounts of short reads, whole genome sequencing allowed questions of coverage to be explored. Built from similar principles as array technologies [44, 45], these methods identify localized deviations in read densities that show statistical departures from those expected genome-wide [14]. For this, the reference is segmenting into bins (either short non-overlapping windows [45], or overlapping sliding windows   9 [44]) and reads are counted in each bin to identify intervals with statistically significant gains or losses in the test genome by implementing sophisticated mathematic models, such as circular binary segmentation or hidden Markov models [14]. While these methods are more broadly applicable because they intake both single and paired-read data, they are biased toward longer CNVs that result in larger (more significant) deviations in coverage. Being poorly-suited to copy-neutral variants, inversion are very difficult to distinguish using read depth signatures, except to say that reads spanning breakpoints will not align to the reference resulting in a local reduction in read density flanking the inversion (which overlaps with a deletion signature) [14] (Fig. 1-2b). Moreover, sequencing biases need to be corrected for to make accurate predictions [45], and the poor mappability of reads to repetitive elements compromise SV calls in these regions of the genome [14].   1.2.4 Split read signals  Getting closer to variants than read depth or paired end signals, split read signatures provide unambiguous resolution of SV location. These methods identify sequence-fragmented reads where only one portion of the read or read pair maps to the reference, suggesting the fragment crosses a breakpoint that introduces a mapping inconsistency for the rest of the read [46-49]. Split read methods first identify the fragment end that shows high-quality unambiguous mapping, which can be used to anchor the read to a genomic location in the reference [46, 47]. Once anchored, the optimal alignment for the rest of the split read is found by trying to map it to a different genomic position, which is used to predict the variant breakpoint in the test genome [46, 48, 49]. Split read signatures of inversions occur when sequencing reads are fragmented by the inversion breakpoint, resulting in a change in the location and orientation in the alignment of each fragment [32] (Fig. 1-2c). While being able to predict precise breakpoint locations of SVs, these methods are limited to detecting relatively narrow breakpoints that are spanned by the fragment [46], require very deep sequencing and rely on at least one end of the fragment mapping with high-quality, which is not possible in highly repetitive regions [49, 50].   10   Table 1-1 | Sequencing signatures of structural variants General overview of the approaches commonly used to identifying structural variants (SVs) from sequence data, and their respective caveats.    1.2.5 Integrating signals  While many SV detection methods developed so far rely on a single signature to make variant predictions, no one signal is comprehensive and sufficient to detect all types of variants [14, 51] (Table 1-1). This was punctuated by the poor overlap in SV calls made when the same dataset was analyzed using different signatures, where as few as 20% of the variants detected in one approach were reproduced by another [32, 51]. More recent methods take an integrated approach that combines multiple signatures to increase the sensitivity and specificity of variant calls in a sample [14, 52]. For instance, CNVer was the first algorithm to combine read depth and read pair signals [53], and DELLY combines paired end with split read signatures [50]. Methods integrating sequence assembly algorithms include CREST, which uses split reads to seed local assemblies [41] and PopIns, which combines three approaches by assembling unaligned reads into contigs and then mapping these contigs to the reference using paired end and split read signatures [54]. Finally, most recent strategies integrate data from multiple library preparations or sequencing technologies to predict SVs [55]. These examples illustrate emerging trends  discordant distance and/or orientation of paired reads mapping to the reference, indicate deletions, inversions, translocations and tandem duplications. confounded by mis-mapping of reads, leading to high false positive rates, and cannot detect elements longer than insert fragment length  unequal distribution of reads mapping to specific locations, indicate changes in copy number variants like deletions and insertions, not breakpoints of copy-neutral variants biased by sequence composition, which can alter efficiency of PCR amplification and sequencing, along with mappability of reads uniquely to the reference.  a partial alignment of single reads to two locations and/or orientations in the reference indicate SV breakpoints and unambiguously resolve SV location.  relies on at least one end of the fragment being anchored, which is not possible in highly repetitive regions Paired ReadRead DepthSplit Readvariants  11 of i) combining multiple data types to make SV calls, and ii) moving away from aligning reads to reference genomes and toward de novo assemblies of test genomes. By integrating different signatures these approaches are able to circumvent the disadvantages associated with any one signal or technology, allowing for better quality variant predictions [14]. And by assembling an individual’s genome from scratch, alignment biases and challenges associated with mapping reads to complex regions of the genome can be overcome. These new methods reflect a broader and more dynamic understanding of our genome emerging from a greater appreciation for its structural nuances and fluid character.   Using these techniques a vast amount of variation has already been described in the human genome. However, we still face major challenges in resolving this heterogeneity and translating it to a clinical setting. For instance, current methods are unreliable and prone to false calls, and thus secondary validation tools are required (such as polymerase chain reaction (PCR) or fluorescence in situ hybridization (FISH), discussed below in Section 1.4.4) to confirm SV predictions [1, 37, 39, 52, 56]. The presence of large repetitive blocks of DNA at variant breakpoints greatly interferes with unambiguous read mapping, which confound prediction software [14, 32, 39]. Moreover, since SV detection methods rely on aberrant read alignment, multiple reads must be present at the event in order to predict the signature, and once a variant is predicted high-density SNPs are required to genotype the locus [36, 37, 52]. To obtain adequate coverage for this, large amounts of input DNA are required, preventing single cell analyses and increasing costs [50, 52]. Finally, rearrangements that do not change DNA content present a special challenge to SV detection [14, 39], highlighted by the fact that most studies report only a handful of high-quality inversions compared to thousands of CNV calls [3, 36, 37, 57]. Overall, current sequencing methods that locate SVs are poorly suited to map copy-neutral structural rearrangements in repetitive elements. Therefore, we currently do not have adequate tools to map the complete spectrum of structural variation in our genomes in order to study how these polymorphisms impact our biology.     12 1.3 Inversions are an elusive structural variant  Current tools used to detect SVs in human genomes are especially well suited to identifying copy number changes in our DNA. Consequently, a large amount of research connecting genomic variants to human biology has centered on CNVs [1]. However, functionally significant variants do not always entail increases or decreases in DNA content [39]. This section will address an underrepresented SV that is copy-neutral and arises between repetitive DNA elements, the elusive inversion.   1.3.1 Inversions change the organization of a chromosome  Inversions are variants that alter the organization of our genomes. Rather than impacting DNA content, this SV subtype is an intrachromosomal rearrangement that reorients a segment of DNA between two breakpoints (Fig. 1-3) [39]. The earliest published record of an inversion came from a Drosophila melanogaster study performed by Alfred H. Sturtevant, who trained under Thomas H. Morgan and built the first genetic maps [58-60] (Fig. 1-4). Sturtevant found that the relative position of genes in a fly’s polytene chromosomes1 were sometimes reversed [59], and that looped structures formed when one of the synapsed chromosomes contained this reversal (i.e. at heterozygous positions) [61]. These findings spurred a new structural genomics field. Working collaboratively with Theodosius Dobzhansky among others, Sturtevant went on to show that i) inversions are adaptive traits under (environmental) selective pressure, ii) that phenotypic characteristics such as size, fecundity and longevity are associated with polymorphic inversions, and iii) that heterozygous inversions suppress recombination events during meiosis [61-65]. These seminal works made clear the evolutionary and biological importance of inversions in flies; however, it has proven difficult to translate these findings to humans, due to our greater complexity.                                                   1 over-sized chromosomes that form when multiple rounds of replication proceed without cell division, producing many hundreds of sister chromatids that remain synapsed together   13  Figure 1-3 | Example of a genomic inversion Schematic representation of an inversion arising from a recombination event between two segmental duplications (arrows) present on the same chromosome (grey line). In order for the recombination event to occur an intermediate looping structure is predicted to form (boxed region), and a crossover event generates an inversion of the interstitial DNA (purple), as indicated by reversal of alleles ‘A’ and ‘B’.   In repositioning the DNA within a chromosome, inversions impact biology by changing the order and position of genes at the locus, which can disrupt coding or regulatory regions depending on breakpoint locations [39, 66]. One of the longest-known cases of this in human biology is the pericentric inversion on chr16 associated with acute myeloblastic leukemia (AML) (Fig. 1-4) [67]. This inversion disrupts the myosin heavy chain (MYH) gene at 16p13 and the core binding factor beta (CBFß) gene at 16q22 to create a fusion gene that disrupts hematopoietic cell differentiation [68]. But beyond just changing the primary linear sequence of DNA, inversions can change the tertiary three-dimensional structure of DNA. Strikingly, this unique feature of inversions is often not discussed in the context of higher organisms, even in light of the chromosome territories and functional domains present in nuclei [69-71]. However, beautifully illustrated in the original works by Sturtevant and Dobzhansky [61] (Fig. 1-4), the spatial organization of genomes are altered when inversion loops form at heterozygous loci, changing the configuration of how homologous chromosomes synapse. Crossover events at inversion loops can generate unbalanced products, and therefore inversions suppress recombination and limit the exchange of genetic information between chromosomes [63]. This feature reference locusinverted locusBAA BABintermediate structure  14 also means inversions create non-recombining blocks that protect loci by effectively ‘capturing’ the alleles (adaptive or deleterious) at the inverted region to impose on the fitness level of carriers [63]. These functional attributes of inversions emphasize how understanding the arrangement of genes and structural organization of a chromosome is critical to understanding how genomes impact phenotypes.  1.3.2 Known inversions in human genomes  After near a century of work, it is clear that inversions represent a significant source of genetic variation in our genomes and play important roles in directing our biology [63]. Comparative DNA sequence analyses revealed that up to 1,500 inversions distinguish humans from chimpanzees [72], highlighting that chromosomal rearrangements rather than sequence mutations drove our speciation. It was the block to recombination and limited genetic exchange at inverted loci that likely facilitated our divergence [63]. Inversions don’t merely distinguish us from other primates; they also distinguish us from each other. Large-scale pilot sequencing studies of normal humans revealed far more inversions scattered throughout our genomes than originally suspected using traditional cytogenetics [39]. For instance, Tuzun et. al. (2005) found 56 inversions in a single genome that was mapped using Sanger-based paired end sequencing [38], Korbel et. al. (2007) identified 72 and 50 inversions in two genomes sequenced using 454 technology [37], and Kidd et. al. (2008) found between 48 and 98 inversions per genome in the eight individuals they comprehensively analyzed [36] (Fig. 1-4). Currently, the Database of Genomic Variants (DGV), a public database that aims to curate all human SVs [3], reports 16012 inversions found in 62 studies. The size distribution of currently known human inversions range from 65 bp – 9.7 Mb, with the majority falling between 10 – 100 kb in size (Fig. 1-5). It should be noted that these numbers represent predictions                                                 2 this number was obtained on July 23, 2015 by querying the DGV for the subtype ‘inversion’, in the ‘merged’ dataset (i.e. sample level calls that share > 70% reciprocal overlap by length and position are merged into a single call) mapped to the assembly ‘GRCh37/hg19’; DGV last updated October, 2014.    15  Figure 1-4 | Timeline of major historical events in human inversion studies  Historical discoveries and technological advances that contributed to our understanding of human inversions are shown in each branch with the first author (italics) and year of finding listed for each major publication. The Pubmed identification number for each article is provided (square brackets).  SturtevantFirst published record of a chromosomal inversion in polytene chromosomes of Drosophila  DobzhanskyPopulaton genetics approach to study distribution of inversions in Drosophila species PainterBauman CarrStaining techniques to visualize chromsome morphology Seabright fluorochome coupled to DNA probe for direct visualization by FISH[PMC1208493][PMID16576597][PMC1209001]first published report of a human inversion [PMID14018846] G-banding pattern technique developed for chromsomes [PMID4107917] TeagueOptical mapping reveal stuctural organization of human chromosome[PMC2890719] [PMID6157553]Korbel NGS study mapped 122 inversions in two individuals[PMC2674581]Kidd SV study of eight individuals identfied 224 inversions[PMID18451855]Inv (16)normalFISH technique used to  map AML inversion[PMID2369839]TuzunFirst sequencing study to detect SVs, mapped 56 inversions [PMID15895083]InversionInconsistent orientationChaisson192119711937196219341980Le BeauAML inversion discovered inv(16)(p13q22) [PMID6577285]1983199020102015200720082005Dauwerse long read single molecule sequencing finds 33 inversions [PMID25383537]  16 from variant calling algorithms, and many of the reported SVs have yet to be validated [73]. Nevertheless, the preponderance of inversions in human genomes suggests they are playing an important role in our biology, but to better understand this role more targeted inversion studies are required.  Despite recent findings that inversions are common in the human genome, so far only two have been fully characterized at the population level [66]. These include a large ~ 4.5 Mb inversion on chr8p23.1 and a ~ 900 kb inversion on chr17q21.31 (Fig. 1-4, asterisks). The chr8p23 variant is the largest polymorphic inversions in the human genome, encompassing over 50 genes [74]. A population-wide analysis of SNP distributions showed this inversion exhibits a clinal distribution that correlates with geographic distance from Ethiopia, with frequencies ranging from 53% (Yoruba) to 1.3% (America) [75]. The inversion alters the expression of several genes within (but not disrupted by) the rearrangement, including BLK (B lymphocyte kinase), and confers a reduced risk of autoimmune diseases (such as rheumatoid arthritis) to carriers [75, 76]. While the chr8p23 inversion is visible using basic cytogenetic approaches (discussed below in Section 1.4.1), the size and complexity of the smaller chr17q21 inversion makes it very difficult to localize [56]. Applying BAC-based sequence assembly to explore the distribution of the inversion in primate and human populations revealed a convoluted evolutionary history that suggested the locus represents a recurrent inversion that’s changed orientation several times in primate history [77]. Results show the inverted orientation (i.e. reversed to the reference genome assembly, referred to as ‘H2’) is the ancestral state, and the common ‘uninverted’ orientation (i.e. that listed in the reference genome assembly, and called ‘H1’) emerged early in human speciation, but later re-inverted to the H2 variant in subpopulations, after humans migrated from Africa [77, 78]. This toggling is attributed to the large blocks of segmental duplications that flank the locus [56, 77, 79]. The inverted H2 allele occurs in ~20% of Europeans and is associated with increased (by ~ 3%) fertility rates in female carriers. Conversely, the ‘uninverted’ H1 allele is associated with neuropathologies, including Alzheimer’s Disease and Parkinson’s Disease, which are linked to the microtubule-associated protein tau (MAPT) gene within the variant [78, 80]. Both these inversions are under selection and impact   17 fitness, stressing the need for more large-scale investigations into the distribution and phenotypic consequences of inversions in human populations.    Figure 1-5 | Inversions reported in the human genome Human inversions are curated in the Database of Genomic Variants (DGV; last updated October 2014), and were extracted from this database by querying for variant subtype ‘inversion’ in the ‘M’ (merged) call set and from the ‘GRCh37/hg19’ assembly. The size distribution plot illustrates that most of the known inversions are between 10,000-100,000 bases. The genomic distribution plot illustrates that inversions are dispersed throughout the human genome. Asterisks mark inversions mentioned in the text.   1.3.3 Inversions linked to human diseases  While not reaching the population scale, several inversions have been studied in the context of human disease. For instance, inversions have been directly associated with an increased risk of hemophilia A [81], Hunter syndrome [82], and muscular dystrophy [83], while other inversions are indirectly associated with diseases that arise in offspring of carriers when additional rearrangements occur in germline cells [39]. For instance, an inversion on chr7q11 is prone to undergo a ~1.5 Mb deletion that causes Williams–Beuren syndrome, a multisystem disorder with neurodevelopmental symptoms [84, 85]. Complex and overlapping inversions are also present on chr15q13, with breakpoints falling between three repetitive regions that show a high degree of sequence similarity [86]. These inversions are associated with three distinct deletions that are found to occur 12345678910111213141516171819202122XY**g e n o m i c  d i s t r i b u t i o n>103 103 -104 104 -105 105 -106 106 -1070100200300400500600s i z e  d i s t r i b u t i o n(n = 1601)  18 between the same breakpoints and that are associated with severe neurological and developmental pathologies, including epilepsy, schizophrenia, or autism spectrum disorder, collectively showing an 80% disease penetrance [86, 87]. Finally, the inversion at the chr17q21 locus (described above in Section 1.3.2) is also associated with a deletion syndrome (called Koolen-de Vries) that causes complex developmental delays [39, 88]. This deletion occurs at the same location as the inversion and on the same chromosome inherited from the parent that carried the inversion [88, 89]. These studies illustrate that rearrangements at inversion breakpoints can excise the variant leading to a deletion of part or the entire inverted locus. These reports also illustrate how complex diseases that cannot be explained by simple nucleotide changes can be driven by complex genomic architectural events.  Inversion profiles can also serve as important clinical biomarkers [21]. For instance, the large chr16 inversion (between chr16p13.1 and chr16q22) found in AML patients is associated with a better clinical outcome than a smaller AML-associated inversion on chr3 (chr3q21 to chr3q26.2) [90]. A small inversion on chr2 (at chr2p23) is associated with non-small cell lung cancer (NSCLC), and can create an EML4 (Echinoderm microtubule-associated protein-like 4) and ALK (Anaplastic lymphoma kinase) gene fusion product that underlies the disease phenotype [91]. It was recently shown that patients with this fusion product are responsive (at a rate of 50-60%) to the small-molecule kinase inhibitor, Crizotinib [92], and therefore diagnostic testing for the inversion is recommended for many NSCLC patients to help guide treatment plans [21, 91]. These studies clearly highlight the clinical relevance of knowing the inversion profile of patients, and highlight immediate applications to personalized medicine. However, current approaches to characterizing inversions are highly targeted and look for specific variants using crude cytogenetic approaches. In so doing, they are unable to characterize the complete inversion profile of patients, to consider submicroscopic inversions, and test how sets of different inversions may collectively contribute to a pathology or treatment outcome.     19 Taken together, these studies make clear that inversions play important roles in our biology and health. However, because inversions are difficult to map, our understanding of how they are formed and transmitted is quite limited, and many questions remain. This is especially true in humans, where so few inversions have been studied comprehensively [66]. For instance, we do not yet know the frequency and genomic distribution of inversions in humans, or whether specific combinations of inversions are implicated in distinct phenotypes [66, 73]. What are the most common and rare inversions in different populations? What is the functional consequence of specific inversions, or sets of inversions, within our genomes? It remains to be seen what the evolutionary significance of these rearrangements are in different populations, and how complex variants are shuffled and inherited through generations [93]. Do inversions confer adaptive traits that are selected for in specific regions? Do they protect alleles that increase or decrease the fitness of carriers? Are breakpoints the only means for gene disruption, or can chromatin changes also occur at inversions? What’s more, the specific details surrounding their origin are unclear [39, 66]. What molecular players are responsible for their formation? Is there a relationship between repeat length and inversion size? Finally, the rates of de novo inversions and how they propagate in somatic cells is a completely unexplored area. How frequently does the recombination machinery result in inversions? How much do inversions contribute to genomic mosaicism within an individual? And do rearrangement frequencies differ in cell types, disease phenotypes or aged populations? Clearly, human inversions represent an exciting area of research with more unknowns than knowns; however, directly addressing these questions is currently being hindered by technical limitations in our ability to study these elusive variants.     20 1.4 Techniques to study inversions  Inversions represent a significant source of variation in our genomes that plays important roles in our speciation, biology and health. Exploring these associations has been greatly hampered by technical limitations in mapping inversions. This section describes the different techniques used to study this specific type of SV.  1.4.1 Traditional cytogenetics  The earliest explorations of inversions began in giant salivary chromosomes of Drosophila, where the segmented morphology of DNA molecules was evident, allowing shifts or reversals in segments to be distinguished [61, 94]. These efforts were greatly facilitated by advances in staining methods (initially using acetocarmine [94, 95], and later using Giemsa [96, 97]) that made banding patterns of chromosomes visible by microscopy – a critical step that allowed visualization of the first inversions in human chromosomes [98] (Fig. 1-4). Basic cytogenetic techniques were developed during the initial studies of human inversions, where changes in banding patterns of karyotypes (Fig. 1-6a) were used to locate large pericentric inversions on chromosomes 1, 2, 3, 5, 9, 10 and 16 [39, 99]. These megabase (> 3 Mb) rearrangements often involved heterochromatic regions and were not associated with strong phenotypes, but were found by forward genetic approaches when pathologies arose in the offspring of a carrier [39]. The poor outcome in affected children was attributed to abnormal meiotic products arising from unequal crossovers at the inverted segment [98-101]. Relying on metaphase spreads, the studies were limited to mitotically active cells, of which dozens (> 20) had to be meticulously analyzed to roughly karyotype the individual as ‘normal’ or ‘abnormal’ [100, 101]. Looking back at these original reports, the degree of skill and patience required to decipher the obscure banding patterns in these chromosomes is evident (Fig. 1-4), making clear how this approach is low resolution and not amenable to large-scale studies.    21  Figure 1-6 | Common cytogenetic methods used to study polymorphic inversions  a) Traditional cytogenetic techniques use staining methods to visualize banding patterns of metaphase chromosomes, and inversions (arrow) appear as a reversal in chromosome bands. Only inversions that disrupt banding pattern are visible b) Fluorescence in situ hybridization (FISH) technologies use fluorescent DNA-binding probes to visualize binding locations on chromatin fibers, metaphase chromosomes or interphase cells. Probes are designed to flank inversion breakpoints, and the relative distance between each probe genotypes the locus. An uninverted allele results in an overlapping signal (yellow) where as an inversion leads to the physical separation of the probes allowing both signals (red and green) to be visualized independently. c) Optical mapping fragments DNA molecules by restriction enzyme digestion and then the length of each fragment is visualized by microscopy. By comparing the size and order of each fragment to a reference genome (dotted lines) the organization of the chromosome is seen. An inversion results in the reversal of the restriction fragment pattern, also known as the ‘barcode’.   1.4.2 Molecular cytogenetics  Crude G-banding techniques have been superseded by molecular cytogenetics, namely by FISH technologies [21]. This technique relies on complementary base pairing between DNA molecules to hybridize directional fluorescent probes to target sites on chromosomes [69]. The basic steps of FISH are to: i) affix naked chromatin strands, metaphase chromosomes or interphase cells to a slide, ii) denature DNA to make it single stranded and available to probe binding, iii) hybridize target-specific probes that are conjugated to a fluorochrome to the sample, and iv) evaluate probe signals on a fluorescent microscope [102]. Inversions are visualized in FISH studies by constructing referencetest c o m p u t a t i o n a l  c y t o g e n e t i c sa bsingle cell resolutionpoor genomic resolutionlow thorughputgenotyping capabilitiesnon-targetedsingle cell resolutionfair genomic resolutionfair thorughputgenotyping capabilitiestargetedloss of single cell resolutiongood genomic resolutionfair thorughputno genotyping capabilitiesnon-targetedcb a s i c  c y t o g e n e t i c s m o l e c u l a r  c y t o g e n e t i c s  22 probes that target DNA sites at the predicted breakpoints and measuring the distance between probe signals (Fig. 1-6b) [21, 56, 86]. The first application in the human inversion field was to map the large AML-associated rearrangement on chr16 [103] (Fig. 1-4), which illustrated how breakpoints can be more precisely localized using this technique. The exact genomic resolution of inversion mapping is dependent on the target - probe choice, where the limit of detection is typically ~2 Mb for metaphase spreads, ~ 50 kb for interphase nuclei and down to ~ 5 kb for chromatin strands [102]. Since multiple cells can be screened on fluorescent imaging platforms, this technique improves the throughput and ease of making inversion calls, while preserving single cell resolution, and genotyping capabilities. However, the probe must target the inversion breakpoints, and therefore a priori knowledge of the rearrangement is required [39, 56].  1.4.3 Computation cytogenetics  The final major advancement in cytogenetic methods was computational. Based on a technique borrowed from the Saccharomyces cerevisiae field, optical mapping creates high-resolution and ordered restriction maps of single DNA molecules to visualize the long-range structural organization of chromosomes [104]. This method involves: i) linearizing DNA molecules on a glass slide, ii) digesting DNA in situ with restriction enzymes (creating nicks in the molecule), iii) staining DNA fragments with fluorescent dyes for imaging, iv) reconstructing the organization of the DNA molecule based on fragment sizes (using intensive computational algorithms), and v) comparing the sample to an in silico reference restriction map to identify structural differences [105]. In optical maps, inversions appear as a reversal in the expected restriction fragment ‘barcode’ seen in the reference (Fig. 1-6c), thereby offering better genomic resolution than traditional cytogenetic approaches. While predominately applied to plant and bacterial genomics, and for refining reference assemblies [106], human inversions ranging in size from thousands to millions of bases have been successfully located using this technique [107] (Fig. 1-4). However, the location and size of inversion discovery is dependent on the fragmentation step, as inversions contained entirely within a single restriction fragment are not visible. Also, the method requires millions of linearized DNA   23 molecules to assemble high-coverage optical maps for variant calling [107], and is not amenable to high-throughput single cell or population studies.   1.4.4 Detection by sequencing   While these cytogenetic techniques generated the first data on inversions in the human genome, real movement in the field came when sequencing technologies were applied to mapping SVs [39] (Fig. 1-4). This completely transformed our ability to discover inversions at the submicroscopic level. As described comprehensively in the preceding section (Section 1.2), sequencing methods used to map inversions search for specific read alignment signatures. The first sequence-based studies quickly revealed that inversions are far more abundant in our genomes than previously thought using cytogenetics [13, 36-38]. However, the repetitive nature of the human genome make these approaches prone to high false-positive calls [39, 56, 66], and it’s recently been suggested that only 85 out of the > 1500 inversions in the DGV can be validated based on existing evidence supported by at least two studies [73]. Consequently, if an inversion is predicted, secondary approaches are used (such as PCR or FISH) to validate the call and genotype the inversion [13, 36-38, 52]. As a result, these studies have proven cumbersome and not easily scalable to high-throughput, population-wide investigations or translatable to clinical settings [56, 66, 73]. Moreover, standard sequencing techniques require milligrams of input DNA, and lose the single cell resolution of cytogenetic approaches [102]. Taken together, there is no current technology that can map inversions in an unbiased, high-resolution and high-throughput approach, while simultaneously preserving the genomic heterogeneity of single cell populations.   1.5 Aims and scope of dissertation  The overall aim of this dissertation is to establish better methods to study structural variation in human genomes, with a specific focus on inversions, since they have proven a unique challenge in genomic studies. To reach this goal, I set out to   24 develop a new high-resolution and high-throughput tool for discovering, mapping and genotyping structural rearrangements in genomes. To meet present demands of the field, I wanted this method to be rapid, non-targeted, scalable and, most of all, single cell based. I also wanted to directly apply this method to explore inversions and better define how these specific variants contribute to genomic heterogeneity within and between individuals.My specific research aims can be summarized as follows:  Research Aim 1: Develop a single cell and high-resolution method to study structural variation. Our current ability to study genomic heterogeneity is gravely impeded by the limitations we face in trying to reliably visualize copy-neutral variants such as inversions. Current tools fail to offer high-resolution views of the genome while simultaneously preserving the cellular heterogeneity of the population. To address this, I aimed to develop a new sequencing-based approach that was tailored to detecting structural variation, and I aimed to establish this in a single cell system. I hypothesized that by sequencing template strands in single cells (via Strand-seq) I could visualize genomic rearrangements in single cells with high-resolution.   Research Aim 2: Design a bioinformatic pipeline that scales this method, making it amenable to high-throughput studies.  Sequencing-based technologies require appropriate software to analyze, interpret and represent the data. Strand-seq generates a unique data form that is both single cell and directional, and requires tailor-made bioinformatic pipelines. Moreover, single cell approaches need to be scaled in order to study sufficient cell numbers to resolve meaningful and statistical biology. To manage our sequencing datasets, I set out to learn computational biology and develop a new bioinformatic pipeline that could interpret Strand-seq data, identify structural rearrangements and output usable files. I hypothesized that an unbiased bioinformatic pipeline could be built to automate the localization and characterization of genomic rearrangements in Strand-seq libraries.    25 Research Aim 3: Perform a population study to characterize the distribution and frequency of inversions in the human genome. Due to technical limitations in our ability to visualize copy-neutral rearrangements, inversions remain poorly defined in human genomes. We have yet to perform comprehensive population-scale studies to discover the number and spectrum of inversions present in our genomes. To address this, I intended to apply the new system I developed to study structural variation, to a high-throughput population-scale study. Once equipped with the proper tools, I aimed to explore the structural complexity of the human genome by mapping inversions in a collection of single cells. I hypothesized that mapping genomic rearrangements in a mixed population of cells would reveal differences in the frequency, distribution and location of inversions within the human genome.   Research Aim 4: Characterize all inversions in a genome to compare the genomic distribution and shared architectural features. To study how inversions inform phenotypes, it is necessary to define the entire set of inversions within a genome and test how they impact specific biological processes. Although specific inversions have been implicated in our biology, it is unclear the extent of genomic heterogeneity that exists in inversions, how inversion profiles differ between people and whether inversions act cooperatively to confer phenotypes. Currently, we are unable to study these questions because we are not able to build comprehensive maps of all inversions within a single genome. I hypothesized that sequencing multiple single cells from the same donor would reveal all the inversions present in their individual genome ( and define their invertome), allowing me to interrogate the genomic heterogeneity that extists between invertomes.    26 Chapter 23 | A high-resolution and single cell approach to characterize genomic rearrangements using Strand-seq     “The capacity to blunder slightly is the real marvel of DNA” – Thomas    Chapter synopsis:  Using the novel single cell sequencing technique developed in our lab, Strand-seq, I explored genomic rearrangements in human hematopoietic cells. I used sister cell pairs to confirm that template strands reflect chromosomal homologues, and illustrated the reciprocal relationship between chromosome segregation in daughter cells. I reported the distribution and frequency of sister chromatid exchange events in human hematopoietic cells and described how somatic rearrangements are distinguished from inherited structural variants in a population of Strand-seq libraries. Finally, I demonstrated how to locate and genotype stable rearrangements in single cells. I manually mapped two inversions and illustrated how to rapidly and reliably locate the breakpoints of inversions with high-resolution. I demonstrated how Strand-seq is a new tool to quickly locate structural variants in a non-targeted approach, and carefully predict breakpoints of genomic rearrangements in single cells.                                                 3 Parts of this Chapter have been published in: Falconer E, Hills M, Naumann U, Poon SS, Chavez EA, Sanders AD, Zhao Y, Hirst M, Lansdorp PM. DNA template strand sequencing of single-cells maps genomic rearrangements at high resolution. Nature Methods. 2012 Nov; 9(11): 1107-1112.    27 2.1 Introduction  To explore genomic heterogeneity in the human genome, and investigate how different polymorphisms underlie specific phenotypes, we must be able to map all types of variants reliably and precisely [1]. Other than sequence variants, structural polymorphisms such as copy number variants (including insertions, deletions and duplications), and rearrangements such as translocations and inversions play major roles in human biology and health [1, 12, 32]. Mapping genomic variants in both healthy and unhealthy populations is critical for identifying disease-associated alleles. Genomic diseases often arise from an instability event, where (germ or somatic) cells fail to accurately repair DNA breaks, replicate their genomes, and/or distribute sister chromatids during cell division [25, 26]. In the context of cancer, instability can manifest as genomic rearrangements in daughter cells thatconfer selective growth advantages, drive clonal expansions and lead to tumorigenesis [27, 28]. In order to unravel these processes, single cell studies are paramount, as it becomes increasingly clear that the contribution of rare but functional cellular subpopulations are important for disease prognosis, management and progression [22, 29]. By characterizing normal and tumor cells we can study the genomic variants present in different subpopulations to learn how they underlie specific phenotypes, disease susceptibilities, cancer metastasis, and treatment responders [22, 26, 27]. To study these biological outcomes it is necessary to accurately map breakpoints of genomic rearrangements in single cells andreveal which genes are impacted, whether gene dosages are altered, coding or regulatory networks disrupted and new fusion genes are generated [18, 30, 31]. For this, reliable single cell methods are required to characterize all types of rearrangements, including genomic inversions [1, 32].  To date, no reported technique allows inversions to be discovered and mapped at high-throughput and high-resolution, while simultaneously showing the genome-wide structural heterogeneity of single cells. Techniques such as karyotyping andfluorescent in situ hybridization (FISH) allow one to reliably visualize rearrangements at the single cell level; however, their low resolution and low throughput limit their application to mapping large (> 2 Mb) rearrangements and in only a few cells or individuals at a time   28 [39, 56, 107, 108]. Next-generation sequencing (NGS) technologies enable discovery of submicroscopic events based on mapping of sequencing reads to the reference genome [13, 36-38]. While improving throughput and genomic resolution, this approach is prone to false-positives and a secondary technique (such as PCR) is typically required to validate or genotype the rearrangement [56, 72-74]. Moreover, the requirement of large amounts of DNA for standard NGS approaches prevents the single cell resolution essential for investigating heterogeneous cell samples, such as tumors.   In this Chapter, I describe how the novel sequencing technology, Strand-seq, can be used to visualize, map and genotype genomic rearrangements in single cells with high-resolution. This technological advance couples the genomic resolution afforded by sequencing technologies with the cellular resolution and reliability of traditional cytogenetic approaches. By generating Strand-seq libraries from pairs of sister cells I confirm that strand orientation reflects the inheritance of parental homologues. I then map reciprocal changes in strand orientation between the pairs and highlight how this reflects somatic and hereditary genomic rearrangements. Finally, I manually locate and genotype inversions with high resolution, refining breakpoints in pairs of sister cells. I demonstrate the utility of Strand-seq for localizing genomic rearrangements in single cells in an unbiased and non-targeted fashion, and illustrate how to refine breakpoints with unprecedented genomic and cellular resolution.  2.2 Results 2.2.1 Visualizing the orientation of homologues by sequencing template strands of single cells  Strand-seq is a single cell sequencing technique developed in our laboratory that identifies parental DNA template strands inherited by daughter cells after mitosis [109]. This method takes advantage of the directionality of single-stranded DNA molecules, which can be distinguished as either Crick (C; forward, or plus strand of the human reference assembly) or Watson (W; reverse, or minus strand) based on their 5′ – 3′    29  Figure 2-1 | Strand-seq sequences template strands inherited by daughter cells  i) Each parental homologue (M, maternal; P, paternal) is composed of a Watson (W, minus, orange) and Crick (C, plus, blue) strand. ii) During mitosis each strand serves as a template for DNA synthesis. When cultured in the presence of 5-Bromo-2´-deoxyuridine (BrdU) this generates sister chromatids composed of the original DNA strand (solid line) and a nascent BrdU-substituted strand (dotted-line). Following mitosis, these homologues segregate into daughter cells, which are harvested and subjected to Strand-seq library construction to selectively remove the newly-formed, BrdU-positive strands. This generates short sequencing reads from the template strands only. iii) When aligned to the reference genome and represented on chromosomal ideograms (using BAIT software), the orientation of each inherited homologue is seen as either W or C. For any given chromosome, daughter cells can inherit the maternal and paternal template strands as either WW and CC, or WC and CW.   orientation (Fig. 2-1, i). The thymidine analogue 5-Bromo-2’-Deoxyuridine (BrdU) is incorporated at DNA replication (Fig. 2-1, ii), and following mitosis, the BrdU-positive DNA strand is selectively ablated during genomic library construction, ensuring only the BrdU-negative template strand is sequenced for each chromosome, in each single cell. After library construction and sequencing, resulting sequence reads are aligned to either the minus or the plus strand of the reference genome using the software package BAIT [110], and the template DNA strands inherited by the cell are determined for every chromosome. In a diploid cell, each parental homologue (M, maternal; P, paternal) is composed of a W (orange) and C (blue) strand, and following cell division the template strands for any given chromosome segregate into daughter cells as either WW, CC, or mixture of WC (Fig. 2-1, iii). By sequencing only template DNA strands, the inheritance of parental homologues can be studied in daughter cells. The Strand-seq protocol was scaled and automated for library preparation on an Agilent Bravo liquid handling platform (by Dr. Ester Falconer; for details, see Methods Chapter 7, Section 7.2.3). This 1. DNA synthesis4. BAIT alignment1. DNA segregation& mitosisChromosomehomologues Sister chromatids Sequenced template strands in daughter cells5’3’ 5’3’_____+++++_ +ii.i. iii.2. incorporation of BrdU2. isolation of single cells3. library construction& removal of BrdUFigure 2-1 | Strand-seq sequences template strands inherited by daughter cells M PWCP M P MWWCC WCM PM P M M P P    W(Watson)   C(Crick)   30 allowed hundreds (presently, we have 576 unique hexamer barcodes) of single cell libraries to be multiplexed and pooled for sequencing in a single experiment.    Figure 2-2 | Division kinetics of single-sorted hematopoietic cells a) Single hematopoietic cells were sorted into single wells, and monitored daily to track cell divisions over a period of seven days. Cells were derived from either human bone marrow (BM) or cord blood (CB) samples, and FACS-selected (for a different experiment) on surface markers that enrich for primitive stem and progenitor cells, and cultured in supportive conditions supplemented with 5-Bromo-2´-Deoxyuridine (BrdU). 24 hours post-sort, each well was scored to confirm a single cell was successfully captured. Every day thereafter wells were interrogated and a cell division was scored when two (or more) cells were found. b) After the seven day period the cloning efficiency was calculated as the total cells that divided at least once, divided by the total wells with cells. c) The total days between sort and first division were recorded for every cell and the percent of cell that divided per day (or total divided cells) was calculated to determine the time to first division.   To visualize patterns of homologue inheritance, primary human hematopoietic samples derived from either neonatal cord blood (CB) or adult bone marrow (BM) were cultured to harvest cells for library preparation. As described in Methods (Chapter 7, Section 7.2), hematopoietic cells were selected and sorted as single cells in complete media supplemented with 5 µM of BrdU and cell divisions were tracked daily, over a period of seven days (Fig. 2-2a). Cells showed a high overall cloning efficiency, of 87% (CB) or 81% (BM), confirming growth conditions containing 5 µM BrdU supported their mitoses (Fig. 2-2b). I monitored the time to first division, and found the majority of cells divided between 3-5 days in culture, with BM cells being slightly delayed from CB cells 100806040200Cloning efficiencyTotal cells(% that divided  ≥1x)CB BM6040200% of total divided cellsCBBMTime to first divisionnumber of days in vitro1 2 3 4 5 6+cbFigure 2-2 | Division kinetics of single-sorted hematopoietic cellsTrack time to firstdivision1 7daysFACS sortsingle cells:BrdU (5 μM)aSCF (100 ng/mL), TPO (50 ng/mL), Flt-3L (50 ng/mL) CD34+CD38- CD49f+Thy1+CD45RA-Lin-  31 (Fig. 2-2c). This reflects the faster proliferative potential of neonatal cells, and suggests the optimal time range for capturing human hematopoietic cells after one cell division is approximately four days after sorting. Sister cell pairs arising after a single mitosis were harvested by micromanipulation (Fig. 2-3, i) and Strand-seq libraries were independently prepared for each sister and sequenced on an Illumina HiSeq 2000 DNA sequence analyzer, at the Genome Science Centre (Vancouver). In total, I obtained 294 Strand-seq libraries4 with an average of ~146,000 high-quality (mapping quality (q) > 10) and unique reads mapping to the reference assembly genome (GRCh37/hg19), yielding an average genomic coverage of 0.01x (77.9 reads/Mb) per cell (Appendix A). While this overall coverage was substantially lower than conventional NGS sequencing applications (which typically aim for at least 30x coverage [33, 111]), it was sufficient to analyze template strand inheritance in each cell and explore rearrangements in paired sister cells.   I successfully captured 100 pairs of sister cells (an 81% recovery rate), and homologue inheritance was independently considered for each pair. All paired sister cells revealed the reciprocal relationship between parental homologues. These inheritance patterns were evident in even a single pair of sister cells (Fig. 2-3, ii). For instance, when one daughter cell (arbitrarily labeled ‘A’) inherited two W template strands for a chromosome (e.g. chromosome 1, 9, 13, 14, 19 and 20), its sister cell (‘B’) inherited two C template strands for the same chromosomes (Fig. 2-3, ii). In this same pair, the chromosomes 3, 6, 15, 16, 17, and 21 appeared WC (or CW) for both sister cells (Fig. 2-3, ii). This mirrored inheritance of template strands was observed in 100% of the paired sister cells analyzed (for other examples of paired sister cells see Fig. 2-4). This illustrates that each template strand represents a sister chromatid and the template strand direction reflects the structural organization of the homologue. These results confirm that Strand-seq allows the segregation of parental template strands to be tracked in single cells, which can be used to visualize each homologue inherited in the daughter cell.                                                  4 Note: this total includes 45 (HsSs_0001 - HsSs_0045) unpaired Strand-seq libraries from the male BM donor that were bulk cultured in erythropoietin conditions– (see Methods).   32  Figure 2-3 | Strand-seq libraries of a pair of sister cells i) To capture sister cells the mitoses of single-sorted cells were tracked and daughter cells arising from a single division were harvested by splitting the well contents into four new wells of a 96-well microcrystalization plate. These wells were then searched to locate the sister cells, which were individually put through the Strand-seq library construction, sequencing and alignment protocols (see Methods for details). ii) BAIT ideograms of two Strand-seq libraries representing a pair of sister cells, aligned to the human reference genome (GRCh37/hg19). Each chromosome is shown, with Watson (W) reads in orange, and Crick (C) reads in blue. W and C read densities (reads/megabase) are listed below each chromosome. iii) An enlarged view of chromosome 7 (chr7) and chr8 from this pair of sister cells. Genomic rearrangements were manually located based on changes in template strand orientation and are indicated by arrowheads, with sister chromatid exchanges (SCEs) in red and putative inversions in black. The asterisk marks a heterozygous inversion mentioned in text of Section 2.2.3.   2.2.2 Localizing genomic rearrangements in single cells based on template strand orientation  By sequencing only template DNA strands, each homologue in a single Strand-seq cell became visible based on read orientation. We predicted any changes in template strand orientation of a chromosome would correspond to changes in the sister chromatids of the homologue, and could reflect genomic rearrangements (Fig. 2-3, and Fig. 2-4, arrowheads). For instance, sister chromatid exchange (SCE) events are somatic rearrangements that arise during mitosis when a double-strand break is repaired by homologous recombination, and their accumulation in cells is an early indicator of chromosome 8A3.19 70.57CC42.74 1.61WWB39.1 3.8SCE5.98 53.65SCEA Bchromosome 7ii.i. iii.1.14 40.56chr1CC38.52 0.49chr2WW22.61 21.65chr3WC36.76 0.56chr4WW3.16 41.82chr5SCE x221.02 20.13chr6WC39.1 3.8chr7SCE x142.74 1.61chr8WW0.86 35.85chr9CC49.18 1.78chr10WW45.26 0.25chr11WW9.62 33.98chr12SCE x10.2 30.56chr13CC0.4 34.68chr14CC19.83 22.45chr15WC22.22 28.26chr16WC24.85 24.82chr17WC44.56 0.38chr18WW0.52 50.55chr19CC0.52 60.02Trisomychr20 141 %18.99 15.65chr21NA39.59 0.57chr22WW21.77 0.62MonosomychrX 52.1 %2.24 5.14MonosomychrY 17.2 %66.29 1.71chr1SCE x20.9 63.75chr2CC35.14 31.41chr3WC0.92 59.43chr4CC63.32 4.86chr5SCE x229.26 33.31chr6WC5.98 53.65chr7SCE x13.19 70.57chr8SCE x156.3 1.43chr9WW3.18 72.46chr10CC1 72.23chr11CC52.65 14.54chr12SCE x153.76 0.22chr13WW56.1 0.54chr14WW27.99 31.32chr15WC35.47 32.32chr16WC34.66 31.34chr17WC0.79 63.32chr18CC70.22 0.9chr19WW85.27 0.89chr20WW19.84 27.97chr21SCE x11.52 62.31chr22CC1.27 33.91MonosomychrX 53.4 %8.76 5.14MonosomychrY 21.1 %1chr 2 3 4 5 6 7 8 9 10 11 1213chr 14 15 16 17 18 19 20 21 22 X Y1chr 2 3 4 5 6 7 8 9 10 11 1213chr 14 15 16 17 18 19 20 21 22 X YSister Cell A Sister Cell BHsSs_0077HsSs_0071SisterCell BSisterCell AFACS sortsingle cellsSplit sister cells1. Strand-seq libraryconstruction2. HiSeqsequencing3. BAIT readalignment* *  33  Figure 2-4 | Strand-seq examples of paired sister cells Ten BAIT ideograms of Strand-seq libraries prepared from a male donor bone marrow sample, and representing five distinct paired sister cells (connected by horizontal dotted line, and arbitrarily numbered Pair #1-5). Each sister cell pair is derived from a single mother that underwent an in vitro mitosis (see Fig. 2-3 for details). Chromosomes are shown with Watson (W) reads in orange, and Crick (C) reads in blue, after alignment to the human reference assembly (GRCh37/hg19) using BAIT. Genomic rearrangements where manually predicted based on changes in template strand orientation and are indicated by arrowheads, with sister chromatid exchanges (SCEs) in red and putative inversions in black.  32.25 1.48chr1WW0.43 33.85chr2SCE x217.78 14.52chr3SCE x116.15 16.26chr4WC16.64 16.47chr5WC16.59 15.89chr6WC33.62 0.5chr7WW13.4 15.92chr8SCE x213.87 14.86chr9WC13.71 24.58chr10SCE x117.49 16.38chr11WC17.18 16.61chr12WC0.21 24.49chr13CC25.53 0.48chr14WW8.76 17.73chr15SCE x21.46 31.44chr16CC35.26 0.71chr17WW0.51 31.51chr18CC0.36 34.13chr19CC30.88 7.35chr20SCE x114.02 11.78chr21NA27.79 0.58chr22WW7.32 8.13MonosomychrX 47.6 %3.82 1.97MonosomychrY 17.8 %3.66 70.42chr1SCE x272.71 0.82chr2WW28.64 42.86chr3SCE x127.57 31.05chr4WC35.91 31.25chr5WC30.99 34.44chr6WC1.41 70.79chr7CC35.97 35.34chr8WC30.69 30.97chr9WC48.42 30.18chr10SCE x137.65 40.29chr11WC41.7 33.81chr12WC52.26 0.16chr13WW0.61 65.07chr14CC43.81 20.75chr15SCE x269.08 3.63chr16WW1.11 76.94chr17CC75.66 0.82chr18WW79.44 1.15chr19WW14.01 65.83chr20SCE x125.08 26.49chr21WC1.34 62.41chr22CC17.1 17.9MonosomychrX 49.7 %SCE x14.33 8.72MonosomychrY 18.5 %17.28 4.35chr1SCE x121.69 0.21chr2WW4.47 16.12chr3SCE x10.28 16.66chr4CC18.47 1.48chr5SCE x10.11 18.46chr6CC0.35 18.9chr7CC17.72 6.31chr8NA9.79 8.72chr9WC11.72 13.26chr10WW12.32 11.73chr11WC10.17 9.97chr12WC5.51 7.2Monosomychr13 61.4 %8.79 7.95chr14NA3.48 14.68chr15SCE x119.55 1.14chr16WW23.45 0.32chr17WW17.2 5.52chr18SCE x10.52 25.39chr19NA27.47 0.25Trisomychr20 133.9 %0.29 14.17chr21NA21.48 0.23chr22WW0.29 8.42MonosomychrX 42.1 %2.14 1.47MonosomychrY 17.4 %5.89 21.95chr1SCE x10.42 26.45chr2CC19.1 5.69chr3WW18.7 0.38chr4WW2.21 21.75chr5SCE x123.4 0.26chr6WW24.8 0.57chr7NA7.56 19.68chr8SCE x112.3 12.31chr9CC14.25 15.45chr10WC15.72 15.89chr11WC12.69 11.92chr12NA7.62 9.65Monosomychr13 66 %9.14 9.88chr14WC16.98 4.92chr15SCE x21.77 31.29chr16CC0.58 35.51Trisomychr17 138 %6.43 18.4chr18NA30.83 0.68chr19WW0.51 35.1Trisomychr20 136.2 %23.67 0.39chr21NA1.03 31.87chr22CC9.9 0.57MonosomychrX 40 %SCE x11.68 3.37MonosomychrY 19.3 %25.87 20.49chr1SCE x222.12 26.33chr2WC8.99 33.46chr3SCE x119.39 17.44chr4WC19.69 20.75chr5WC17.09 30.19chr6SCE x134.09 11.47chr7SCE x12.14 42.64chr8CC19.7 19.47chr9WC2.06 47.35chr10CC28.78 22.85chr11WC19.39 25.86chr12SCE x18.13 27.59chr13SCE x139.51 0.49chr14WW24 19.69chr15WC56.07 2.42chr16WW0.63 52.28chr17CC20.97 21.76chr18WC2.08 48.96chr19CC28.48 32.51Trisomychr20 134.3 %29.25 1.7chr21WW25.87 21.13chr22WC0.72 21.19MonosomychrX 48.3 %5.49 2.63MonosomychrY 17.9 %SCE x228.33 40.79chr1SCE x235.19 33.96chr2WC49.54 14.62chr3SCE x125.59 25.65chr4WC31.17 32.95chr5WC35.34 23.97chr6SCE x315.78 48.76chr7SCE x166.72 3.01chr8WW29.98 29.95chr9WC74.48 2.32chr10WW35.02 44.06chr11SCE x135.97 30.57chr12SCE x137.52 9.74chr13SCE x10.37 55.25chr14CC28.95 29.29chr15WC2.96 80.68chr16CC87.09 0.73chr17WW30.38 34.71chr18WC85.49 2chr19WW44.52 46.16Trisomychr20 135.4 %1.6 46.27chr21CC37.17 32.75chr22WC31.1 0.85MonosomychrX 47.7 %3.39 6.13MonosomychrY 14.2 %0.73 11.97chr1CC6.04 6.25chr2WC0.17 10.26chr3CC4.9 5.02chr4WC4.81 6.56chr5WC0.25 10.67chr6CC5.76 5.62chr7WC6.24 4.9chr8WC10.27 0.29chr9SCE x213.18 0.51chr10WW2.56 10.91chr11CC7.89 2.95chr12SCE x13.83 3.85Monosomychr13 66.2 %9.37 0.24chr14WW6.27 5.47chr15WC7.56 6.75chr16NA14.03 0.42chr17WW0.29 10.44chr18CC7.14 6.6chr19NA7.28 7.55chr20NA3.74 4.74chr21NA0.43 10.8chr22CC4.82 0.26MonosomychrX 43.8 %0.49 1.47MonosomychrY 16.9 %44.7 1.93chr1SCE x219.1 24.21chr2WC41.75 0.53chr3WW17.91 18.84chr4WC17.35 25.23chr5WC39.04 0.65chr6WW20.6 21.33chr7WC24.04 23.91chr8WC1.13 38.96chr9CC1.82 44chr10SCE x246.26 10.33chr11SCE x111.59 34.19chr12SCE x119.1 15.46chr13WC0.62 35.86chr14CC18.35 17.58chr15WC24.79 23.49chr16WC1.47 52.58chr17CC39.88 0.58chr18WW27.19 24.52chr19WC32.76 21.47chr20WC17.22 17.68chr21WC36.43 1.03chr22WW0.71 21.04MonosomychrX 49.8 %4.33 2.48MonosomychrY 15.6 %2.13 52.03chr1SCE x221.8 24.73chr2WC25.78 21.98chr3WC0.67 43.08chr4CC44.67 0.33chr5WW0.4 46.42chr6CC1.19 47.29chr7CC11.2 44.16chr8SCE x121.87 20.64chr9WC31.31 34.26chr10WC0.59 50.01chr11CC38.06 8.64chr12SCE x10.23 36.43chr13CC46.29 0.41chr14WW43.38 1.28chr15WW2.66 55.13chr16CC17.3 42chr17SCE x121.43 24.1chr18WC24.67 26.15chr19WC46.85 15.45chr20SCE x135.26 1.33chr21WW18.3 22.73chr22WC0.28 25.56MonosomychrX 52.5 %2.48 5.95MonosomychrY 17.1 %Sister Cell A1chr 2 3 4 5 6 7 8 9 10 11 1213chr 14 15 16 17 18 19 20 21 22 X YSister Cell B1chr 2 3 4 5 6 7 8 9 10 11 1213chr 14 15 16 17 18 19 20 21 22Sister Cell A1chr 2 3 4 5 6 7 8 9 10 11 1213chr 14 15 16 17 18 19 20 21 22Sister Cell B1chr 2 3 4 5 6 7 8 9 10 11 123chr 14 15 16 17 18 1 20 21 22Sister Cell A1chr 2 3 4 5 6 7 8 9 10 11 1213chr 14 15 16 17 1 1 0 21 22Sister Cell B1chr 2 3 4 5 6 7 8 9 10 11 1213chr 14 15 16 17 18 19 0 21 22Sister Cell A1chr 2 3 4 5 6 7 8 9 10 11 1213chr 14 15 16 17 1 1 20 21 22Sister Cell B1chr 2 3 4 5 6 7 8 9 10 11 1213chr 14 15 16 17 18 19 20 21 22Sister Cell A1chr 2 3 4 5 6 7 8 9 10 11 1213chr 14 15 16 17 18 19 20 21 22 X YSister Cell B1chr 2 3 4 5 6 7 8 9 10 11 1213chr 14 15 16 17 18 19 20 21 22 X YHsSs_0050 HsSs_0061HsSs_0103 HsSs_0106HsSs_0078 HsSs_0047HsSs_0101 HsSs_0097HsSs_0094 HsSs_0057Pair #1Pair #2Pair #3Pair #4Pair #5  34 genomic instability [25, 112]. Cross-over events resulting from this repair cause template strands of the sister chromatids from that homologue to mix, which can be seen as a characteristic switch in Strand-seq ideograms (Fig. 2-5). For instance, seen in the paired sisters shown in Fig. 2-3, sister cell A inherited one full C homologue for chr7, and a W homologue that switched to C on the p-arm (Fig. 2-3, iii, red arrowhead). Perfectly mirrored in sister cell B, this change in template strand identity represented a stereotypical SCE [109]. In this pair of cells I found 4 SCEs: two on chr5 and one each on chr7, and chr12 (Fig. 2-3, ii, red arrowheads). Similarly, I located 2-9 SCEs in the other paired cells, with every SCE mirrored in its sister (Fig. 2-4, red arrowheads), confirming the exchange occurred during the preceding cell division. Until now, SCEs could only be tracked cytogenetically by preparing metaphase spreads of cells cultured for two cell divisions (as two rounds of BrdU incorporation were required) [112]. As we previously illustrated using murine embryonic stem cells [109], Strand-seq allows us to visualize SCEs after a single cell division and map them with orders of magnitude greater resolution than traditional approaches.    Figure 2-5 | Diagram of a sister chromatid exchange event in Strand-seq library Sister chromatid exchanges (SCEs) are somatic recombination events that occur when there is a double stranded DNA break (red asterisks) in a chromosomal homologue that is repaired by a homologous recombination (boxed region). The repair can result in a non-crossover or crossover event, and only the later is shown for simplicity. In Strand-seq libraries (lower panel), SCEs appear as a change in template strand state along the affected chromosome. Watson is shown in orange, Crick is shown in blue. Nascent BrdU+ strands are shown as dotted lines.    Double Strand Break CC - WC WC - WWTemplate strandsegregation optionsDouble strand breakrepair by homologous recombinationcan cause a sister chromatid exchangeWW - WC WC - CCM PP M P MM P M P  35  To explore the frequency of SCEs in normal human genomes, I manually counted the number of SCEs in 215 hematopoietic cell Strand-seq libraries5 (Fig. 2-6). In total, I identified 775 SCEs, and observed anywhere from 0 – 13 SCEs per cell, with the majority having between 3-5 SCEs (giving an overall average of 3.6 per cell) (Fig. 2-6, i). These frequencies are in agreement with previous reports using cytogenetic approaches to map SCEs in human hematopoietic cells, however I was able to provide much higher genomic resolution for mapping each SCE event [113-115]. When I compared BM to CB-derived libraries, I did not see any significant difference in SCE numbers (Fig. 2-6, i). This suggests donor age did not impact SCE frequencies in our sample; however, it is possible age-related difference would arise under varied growth conditions (e.g. under genotoxic stress) or in controlled longitudinal studies [30, 116]. To test for a chromosomal bias, I looked at the distribution of SCEs across all chromosomes and found an overall frequency of 0.061 SCEs per Mb of sequence (Fig. 2-6, ii). I saw no significant differences between SCE frequencies per chromosome, where SCE number directly correlated to chromosome length (Fig. 2-6, iii). This indicates all chromosomes were equally vulnerable to DNA breaks repaired by SCE. Supporting this, I looked at the genomic distribution of SCEs and found them randomly distributed along the length of all chromosomes, with no significant clustering or deserts (Fig. 2-6, iv). In line with our previous report of the murine genome, I did not observe SCEs recurring at the same genomic location in any two cells, and there was no evidence of localized SCE hotspots in these human cells (n = 215). This is the first report of SCE frequency and distribution in a normal human cell population, as determined by Strand-seq, which can serve as a high-resolution reference to look for changes in SCEs in different cell types and model systems.                                                  5 Note: I selected only one cell from each pair of sister cells for this analysis, since paired sisters have identical SCE profiles (as discussed).   36  Figure 2-6 | Sister chromatid exchange events mapped in the human genome i) Total number of sister chromatid exchanges (SCEs) counted in each single Strand-seq library generated from human hematopoietic cells, plotted based on the culture conditions (described in Methods Chapter 7; Section 7.21). Libraries were made from blood cells derived from bone marrow (BM) or cord blood (CB) cultured in 5 µM BrdU and sequenced on a HiSeq platform (100 base pair, paired-end reads). Dotted line represents the overall SCE average per cell (3.605). ii) Average number of SCEs mapped to each chromosome across all Strand-seq libraries analyzed (n = 215), normalized to chromosome size, in megabases (Mb). SCE number for chrX was corrected (white bar), based on the number of male cells (58%). Dotted line represents the average frequency found across all chromosomes (0.061/Mb). iii) SCE frequency is equal to the total number of SCEs detected on each chromosome divided by the number of chromosomes counted (n = 45-70 Strand-seq libraries/experiment). Chromosome sizes are plotted on the X-axis, with chromosome numbers adjacent to data points. iv) To illustrate the genomic distribution of SCEs we binned them into non-overlapping 200 kilobase regions and mapped them to chromosome ideograms. Locations of known inversions identified in later chapters (Chapters 4 and 5) were masked in the plot to only illustrate somatic rearrangements. The length of the horizontal line plotted on the ideogram represents the number of SCEs found in each bin across all libraries.    2.2.3 Localizing inherited genomic rearrangements in single cells based on recurrent template strand changes  I have shown how locating changes in template strands can be used to identify somatic rearrangements in a single cell. In addition to SCEs, we suspected stable genomic rearrangements would also be evident in Strand-seq libraries. With respect to the reference assembly, an inversion causes a localized reorientation in the Watson-Crick state along each DNA strand of a chromosome (Fig. 2-7a). Therefore, in Strand-seq 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 x1.9x0.000.050.100.15ChromosomeAvSCEs / MbSCE number normalized to chromosome sizeCorrelation between SCE frequency and chromosome size100 200 300-0.10.00.10.20.30.40.5Chr size (Mb)SCEfrequency12345678910111213141516171819 202122y = 0.001550 ± 0.0001438Number of SCEs in single cells051015BMHSCBMEpoCBsingleCBpoolSCE / cellAv ii.i.iii.iv.0 0.42chr10 0.43chr20 0.3chr30 0.44chr40 0.37chr50 0.38chr60 0.43chr70 0.37chr80 0.41chr90 0.48chr100 0.44chr110 0.4chr120 0.23chr130 0.29chr140 0.28chr150 0.29chr160 0.36chr170 0.4chr180 0.15chr190 0.31chr200 0.11chr210 0.02chr220 0.08chrX0 0chrY   Normalized counts / Mb = 0.35   Scale  =  0.16SCE EVENTS PLOT FOR (200,000 window)Quality filter=10  Organism=Homo_sapiensSCE events= 1052  37 libraries we predicted an inversion would appear as a genomic locus where sequence reads of the inverted DNA segment mapped to the complementary DNA strand, with respect to the surrounding sequence, causing a segmental change in strand state. To test this hypothesis, I considered the large (~ 4.5 Mb) well-known human inversion on chr8p23 [74, 75, 117] and tested whether I could identify this genomic rearrangement in the dataset. In the sister pair of Fig. 2-3 (HsSs_0071/HsSs_0077), I identified a segmental change in strand orientation localized to the tip of chr8 that coincided with the expected    Figure 2-7 | Locating and genotyping inversions in Strand-seq libraries a) Inversions occur when a segment of DNA becomes flipped around in a chromosome. When compared to the reference genome, this causes a localized reorientation in the Watson (W) - Crick (C) state along each single DNA strand of a chromosome. By sequencing only template strands, inversions are visualized as genomic regions where sequence reads of the inverted DNA segment map to the complementary DNA strand, with respect to the surrounding sequence. Asterisks denote breakpoints. b) In Strand-seq libraries, inversions appear as segmental changes in template strand orientation, and the number of homologues harboring an inversion at the locus can be discerned by the amount of change seen in the template strands. For instance, heterozygous inversions appear as a ‘partial’ change in strand orientation, where a WW chromosome switches to WC along the inversion, and a WC chromosome will switch to WW along the inversion. Every template strand segregation pattern displays the inversion if it is heterozygous. c) On the other hand, homozygous inversions result in a ‘complete’ change in template strand orientation, where a WW chromosome switches to CC along the inversion and a WC chromosome switches to CW. Consequently, chromosomes inherited as WC mask homozygous inversions, as they appear to have the same template strand orientation as wildtype WC chromosomes. For homozygous inversions, only WW and CC chromosomes are informative.  **___+_+++_++ ______+++++_ + ****bacHeterozygous InversionM P M PWW WCCC WCM PP M P MTemplate strandsegregation optionsHomozygousInversionM PP M P MTemplate strandsegregation optionsM P M PWW WCCC WC‘wildtype’ invertedChromosomeSister chromatidsafter replication5’3’ 5’3’ 5’3’ 5’3’  38 location of the chr8p23 inversion. For instance, sister cell A exhibited a localized and segmental switch in template strands from CC to WW and back to CC, and the opposite switch was reflected in sister cell B (Fig. 2-3, iii). I observed this rearrangement in almost every6 cell from this individual (Fig. 2-8a), suggesting it was stable in the donor’s genome and inherited from the germline, or arose early in development. When I looked for this inversion in cells derived from another donor I did not see the rearrangement (Fig. 2-8b), indicating this is a polymorphism and not a human genome reference assembly error [109] (for an in-depth discussion of this topic see Chapter 4, Section 4.2.2). These results strongly support how locating segmental changes in template strand orientation enables inversions to be discovered in single Strand-seq libraries.   To locate other potential inversions in our paired cells, I flagged segmental changes in strand orientation that recurred at the same genomic location in multiple cells (Fig. 2-3, and Fig. 2-4, black arrowheads). For instance, in addition to the large chr8 inversion, I observed smaller putative inversions recurring on chr7, chr10 and chr16 (Fig. 2-3, and Fig. 2-4, black arrowheads). Notice, these events were distinct from multiple SCEs on a chromosomes that were found exclusively in a single pair and absent from other cells from that donor, such as the two SCEs located on the p-arm of chr5 in the sister pair HsSs_0071/HsSs_0077 (Fig. 2-3, ii, and Fig. 2-4). Although difficult to visualize at the resolution offered by BAIT ideograms, I identified between 16-28 putative inversions in each pair of sister cells, with the number dependent on: i) the coverage of the libraries (which impacts the number of reads representing the inversion), ii) the level of spurious background reads (which makes it difficult to locate meaningful changes in strand orientation in BAIT ideograms), and iii) the chromosomes inherited as WC (as it is impossible to distinguish between wildtype and homozygous inversions in WC chromosomes (Fig. 2-7b), as described below in Section 2.2.4). Collectively, these results demonstrate that Strand-seq is a non-targeted tool to localize genomic rearrangements in single cells by locating changes in template strand orientation.                                                 6 As described below, the inversion is homozygous and only visible in a Strand-seq library when chr8 was inherited in a WW or CC configuration.    39  Figure 2-8 | Polymorphic inversion located on the p-arm of chr8 i) The segmental change in strand orientation (outlined by red dotted line) is seen at the same genomic location (p-arm of chromosome 8) in multiple cells from the same individual, marking it at an inherited inversion. The inversion is seen in all cells from a male donor provided chr8 is inherited as either WW or CC. It is not evident in WC chromosomes. ii) The inversion is polymorphic and not evident in cells derived from a different donor.  2.2.4 Reliable inversion genotyping of genomic variants in single cells  In contrast to SCEs that were randomly distributed in the genome, inversions were visualized as localized changes in strand orientation that recurred across multiple libraries. I next explored whether the genotype of inversions could be revealed based on whether one or both template strands exhibited the localized strand re-orientation. At any given locus, a pair of homologues is designated as ‘wildtype’ with respect to the reference genome (contains no inversion on either homologue), heterozygous if it contains a single inversion on one homologue, or homozygous if it contains two inversions, one on each homologue. We hypothesized in Strand-seq libraries, the number of homologues harboring an inversion at the locus could be discerned by the magnitude Male donor cells ii.i. Female donor cells  40 of change seen in the template strands (Fig. 2-8). For instance, heterozygous inversions would appear as a ‘partial’ change in local strand state, where a WW or CC chromosome switches to WC, and a WC chromosome switches to WW or CC state, along the inversion (Fig. 2-8b). On the other hand, homozygous inversions would result in a ‘complete’ change in template strand state, where a WW chromosome switches to CC along the inversion (or vice versa), and a WC chromosome switches to CW (Fig. 2-8c). Consequently, homozygous inversions would be masked in WC chromosomes, as they would appear to have the same template strand orientation as wildtype WC chromosomes (Fig. 2-1, iii).   These genotype patterns were evident in the BAIT ideograms of our paired sister cells. For instance, the large genomic rearrangement evident on the p-arm of chr8 showed a complete change in template strand state because both C homologues in sister cell A underwent a localized switch to W, and both W homologues in sister cell B changed to C, at the inverted locus (Fig. 2-3, iii, chr8 arrowhead). A similar complete template strand state change was evident at this genomic locus in multiple cells from the donor (e.g. see Pair #3 (HsSs_0078/HsSs_0047) & Pair #4 (HsSs_0101/HsSs_0097) in Fig. 2-4, and Fig. 2-7). This recurrent and localized strand state change supports the event represents a homozygous inversion that is present on both homologues and therefore is evidenced by both template strands. Also, the event was not visible when chromosomes were inherited as WC (Fig. 2-7a, lower panel). This illustrates that homozygous inversions are indeed masked in WC chromosomes, suggesting these should be excluded when studying frequencies of inversions, as underrepresented homozygous variant calls will confound allelic frequency calculations. I also located a possible inversion on the q-arm of chr7 at a locus that coincided with a previously reported disease-linked inversion [84, 85, 118], however the size of the event was less evident in the BAIT ideogram. More apparent in sister cell B, I observed a W homologue switching to C at the inversion, and the other W homologue remained W at this locus (Fig. 2-3, iii, chr7 asterisk), suggesting the inversion was present on only one homologue and was heterozygous. This same inversion was seen in other cells derived from this donor (e.g. see Pair #4 (HsSs_0101/HsSs_0097) & Pair #5 (HsSs_0094/HsSs_0057) in Fig. 2-4), and in all cases only a single homologue appeared   41 to change in strand orientation. I can assume the same homologue always harbors the inversion (i.e. either the maternal or paternal), and thus I can distinguish the two parental homologues based solely on this heterozygous inversion. Collectively, these results illustrate how inversions are genotyped in Strand-seq libraries, that homologues can be distinguished based on heterozygous inversions, and WC chromosomes mask homozygous inversions. However at the genomic resolution of BAIT ideograms it is difficult visualize this event, and a high-resolution approach to visualizing sequencing reads is required.   2.2.5 High-resolution breakpoint mapping of genomic rearrangements in single cells  While inversions were identified and genotyped in the BAIT ideograms of Strand-seq libraries, it was not possible to accurately map the location of inversion breakpoints at this level of resolution. To visualize Strand-seq libraries at higher resolution, sequencing reads were BED-formatted and uploaded as custom annotation tracks onto the UCSC Genome Browser (GRCh37/hg19 assembly) [119] (Fig. 2-9). This allowed me to navigate through the genome and zoom into regions of interest (ROIs) to visualize reads at specific rearrangements and putative inversions (Fig. 2-9, ii). In the UCSC Genome Browser view, every read appears as a line, the color denotes the orientation (C, blue; W, orange) and the precise genomic coordinates of each read aligned to the reference are provided (Fig. 2-9, iii). In this way, changes in the structural organization of homologues can be visualized at high resolution.   In order to predict the biological outcomes of genomic rearrangements it is necessary to accurately define the boundaries of the event and localize breakpoints [18]. Genomic rearrangements appear as changes in template strand states, and therefore we predicted that by narrowing down regions where strand orientations change we could define the genomic range where the breakpoints of the rearrangement were likely to reside. Accordingly, a breakpoint range of an inversion was defined as the first read within the inverted region (i.e. in the opposite orientation than the chromosome) and the    42  Figure 2-9 | Manually refining inversion breakpoints in a sister cell pair UCSC Genome Browser view of Strand-seq data from a sister cell pair of the heterozygous inversion on chr7. The BED-formatted libraries were uploaded as custom tracks to refine inversion breakpoints by locating the reads flanking the inverted region in a) sister cell ‘A’ (HsSs_0071), b) sister cell ‘B’ (HsSs_0077), and c) a merged composite library of the two sisters (see Section 2.2.3 for details). i) The whole chromosome ‘packed’ view of aligned Crick (C, blue) and Watson (W, orange) reads shows the location of the ii) zoomed inset of the inversion. In this ‘squished’ view, each aligned read is denoted as an individual blue (C) or orange (W) line. The manually mapped location of the inversion is shown above (red bars). iii) A representation of the reads used to refine the 5´ (left) and 3´ (right) breakpoint ranges. The genomic position of these reads were used to predict the breakpoint ranges, with the size of each shown in base pairs (bp, black arrows), and the genomic coordinates listed. d) Browser view of the Refseq genes present near the 5´ (top) and 3´ (lower) breakpoint ranges (red arrows) refined in the merged library. Notice, the 5´ read in the 5´ breakpoint (purple) falls directly within an exon of TRIM50, a ubiquitin ligase.  bcSister Cell A22,649 bp 654,673 bp5’ range:72,706,295 -  72,728,9443’ range:74,358,304- 75,012,977i.ii.iii.aSister Cell B22,600 bp 78,761 bp5’ range:72,727,194 - 72,749,7943’ range:74,158,034 - 74,236,795i.ii.iii.Sister Cells Merged14,086 bp1,750 bp5’ range:72,727,194 - 72,728,9443’ range:74,998,891 - 75,012,977i.ii.iii.Genes in 5´ rangedGenes in 3’ rangeputative inversion72,728,944 - 74,998,891putative inversion72,749,794 - 74,158,034putative inversion72,728,944 - 74,358,304  43 last read outside of the inversion (i.e. in the same orientation as the chromosome) (Fig. 2-9a-c, iii). Thus, for any given inversion, a 5′ and 3′ breakpoint range could be described, where the actual inversion breakpoints are predicted to fall within these breakpoint ranges, and the inverted locus between them. To refine inversion breakpoints in the sister pair HsSs_0071/HsSs_0077 (from Fig. 2-3), I loaded the sequencing data onto the UCSC Genome Browser, filtered reads for a minimum quality score of 10 and navigated to the heterozygous inversion on chr7 (Fig. 2-9). I located the breakpoint ranges for each sister based on the above criteria (Fig. 2-9a-b). The inversion located in sister cell A (72,728,944 - 74,358,304; Fig. 2-9a) overlapped with sister cell B (72,749,794 - 74,158,034; Fig. 2-9b), but did not match exactly due to differences in read densities near the breakpoints. The higher read depth of sister cell B (compare 63.4 read/Mb of sister cell B versus 31.37 reads/Mb of sister cell A) allowed me to define tighter breakpoint ranges, especially for the 3′ breakpoint, which was narrowed to a 78 kb region (Fig. 2-9a-b). To further refine this inversion, I generated a merged composite file for the pair of sisters by reverse-complementing the sequencing reads of sister cell B and merging them with the reads of sister cell A (Fig. 2-9c). The increased read density of this composite file allowed me to narrow the 5′ breakpoint range to a 1.7 kb region, and the 3′ range to 14 kb, placing the inversion within a 2.26 Mb region at 72,728,944 - 74,998,891. This illustrates how rearrangements can be very precisely mapped in just two single Strand-seq libraries, and that having paired sister cells can augment data for high-resolution breakpoint mapping.  With a precise genomic location mapped for this inversion, I explored the genes in the region. I identified 96 (44 non-redundant) Refseq genes in the putative inversion, all of which have a different physical location in one homologue of this individual than the other. Several genes were associated with William-Beuren syndrome (such as BCL7B, and CLIP2), which is a multisystem developmental disorder that arises when a deletion occurs at the location of this inversion [84, 85]. Other genes had roles in the cytoskeletal structure, including regulating actin polymerization (LIMK1) or elastin assembly (ELN). Finally, I mapped the 5′ breakpoint directly within the coding region of TRIM50 (Fig. 2-9, d), suggesting the transcription of this E3 ubiguitin-protein ligase is   44 disrupted in at least one allele of this donor. This reveals how finely mapping inversion breakpoints the genes within or disrupted by the rearrangement can be examined, in order to study the biological consequences of the event in specific cells or individuals.   Figure 2-10 | Manually refining inversion breakpoints of homozygous inversions UCSC Genome Browser view of a sister cell pair for the homozygous inversion on chr8. Reads were BED-formatted and uploaded as described in Fig. 2-8, for a) sister cell ‘A’ (HsSs_0071) and b) sister cell ‘B’ (HsSs_0077) i) whole chromosome ‘packed’ view of aligned Crick (C; blue) and Watson (W; orange) reads. ii) Zoom inset of the region, with the manually mapped inversion marked (red bars). iii) Genome browser details for the first read located within the inversion and used to predict the inverted region. The inversion is flanked by WC regions (red arrows), which hamper the ability to accurately define the breakpoint ranges. c) Two merged composites of all the libraries from this donor that had chr8 inherited as Crick-Crick (top; n=20) or Watson-Watson (bottom; n=19). The WC regions flanking this inversion overlap with blocks of segmental duplications, as shown below. ii) The breakpoint ranges were mapped to include the flanking WC regions (outer pink bars), with the intervening region defined as the inversion (blue bar). iii) Genomic coordinates of the inversion as mapped in the merged composite file.    The breakpoints for the homozygous inversion on chr8 were more difficult to refine (Fig. 2-10). Although the rearrangement was apparent in the UCSC genome browser view (Fig. 2-10a-b, ii), and I easily located the first read in the inversion (Fig. 2-10a-b, iii), the region was flanked by segments of reads that aligned to both W and C strands, complicating breakpoint mapping (Fig. 2-10a-b, red arrows). I observed these flanking WC regions in other cells from the individual, highlighted when I merged together the sequence data for cells that inherited chr8 as WW or CC into two large, high read depth files (Fig. 2-10c, i). The WC regions that flanked the inversion had lower read densities and contained large blocks of segmental duplications (Fig. 2-10c, ii). The repetitive nature of the genome at these locations make it difficult to map sequencing ii.iii.n = 20n = 19ci.5’ range:7,261,256 - 8,034,631putative inversion8,034,631 - 12,038,5323’ range:12,038,532-12,454,792Item: HS15_49:6:1114:12272:53096/1Score: 22Position: chr8:8013845-8013944Band: 8p23.1Genomic Size: 100Strand: -Item: HS15_49:6:2212:2082:90354/2Score: 23Position: chr8:11974184-11974283Band: 8p23.1Genomic Size: 100Strand: -Sister Cell Aa i.ii.iii.putative inversion8,013,944 - 11,974,184bItem: HS15_49:6:2209:14390:30446/2Score: 60Position: chr8:8195608-8195707Band: 8p23.1Genomic Size: 100Strand: +Item: HS15_49:6:1203:18647:55963/1Score: 60Position: chr8:11790431-11790530Band: 8p23.1Genomic Size: 100Strand: +Sister Cell Bputative inversion8,195,707 - 11,790,431i.ii.iii.  45 reads uniquely, and this can cause the low read depths and loss of directionality. Moreover, the rearrangement breakpoints likely occur within these repetitive regions, which are known to be susceptible to non-allelic homologous recombination [18, 31, 120]. Therefore I mapped the breakpoint ranges to encompass the WC regions, and the putative inversion to reside between them, at: 8,034,631 - 12,038,532 (Fig. 2-10c, iii). Notably, entries listed in the DGV map the breakpoints of this inversion to multiple different locations, with the majority of the records falling within this region. This illustrates how chromosomal architecture flanking genomic rearrangements impacts our ability to precisely map breakpoints. However, the direct visualization of the event still allows us to predict the structural variant with high confidence. Overall, these results show how Strand-seq can be leveraged as a reliable new tool to rapidly characterize structural features in single cell genomes, and reliably map breakpoints with high resolution in order to explore somatic and germline rearrangements in populations of cells.   2.3 Discussion  Here I present a novel single cell sequencing strategy that maps genomic rearrangements by tracking homologues of each chromosome. As reflected in the pairs of sister cells, I show that changes in homologue orientation reveal somatic rearrangements, such as SCEs, and recurrent rearrangements, such as inversions. By using strand orientation to genotype inversions and read locations to finely map breakpoints I was able to rapidly characterize this copy-neutral structural variant with high precision. This allows the genes and the genomic architecture of the rearrangements to be explored with new detail. My approach eliminates the need to impute rearrangements using complicated algorithms that search misaligned reads in complex and repetitive regions to predict inversions with low confidence.   While the genomic architecture flanking inversions impacted our ability to precisely map breakpoints, the opportunity to visualize the event offered a clear improvement over conventional SV discovery methods. The reliability of directly seeing   46 the rearrangement is a major advantage over other sequencing approaches that bypasses the requirement for secondary validation techniques. However, the lower read depths of our libraries (which is exacerbated in complex repetitive DNA regions, where inversion breakpoints often localize [66]) can make it difficult to map precise breakpoints from a single Strand-seq library. To overcome this, merging reads from multiple related libraries can help increase read densities at inversion breakpoints to more accurately locate them. This highlights how the genomic architecture flanking inversions impact our ability to precisely map breakpoints, while simultaneously providing insights into their putative mechanism of generation. Collectively, I believe Strand-seq offers a new and powerful method to quantify and map rearrangements after a single mitosis, and a high-throughput screening tool to study processes that govern genome dynamics.  One way to explore genome dynamics is by mapping SCEs, which are an early indicator of genome instability. While they are generally thought to be error-free, unequal crossovers at SCEs can lead to CNVs, loss of heterozygosity and aneuploidy [112, 116]. By exploring somatic rearrangements in our normal human population, I report a baseline of SCE frequencies and distributions in a healthy sample. This information can be used to look for elevated levels of SCEs and genomic hotspots in disease models, such as in Bloom Syndrome and cancer patients [113, 115, 121, 122]. Perturbations in molecules that maintain genome fidelity, such as the RecQ helicase Bloom, are predicted to increase SCE frequencies, and Strand-seq is a reliable tool to measure this. Moreover, mapping SCEs in Strand-seq libraries from different cell populations may reveal fragile sites or homologous recombination hotspots that are cell type specific [30]. Collectively, these results illustrate how to study genomic rearrangements in model systems, and highlight the use of Strand-seq to test for genomic instability.   Germline genomic rearrangements can also be characterized in Strand-seq libraries. By locating recurrent changes in template strand states, several types of stable rearrangements can be discerned including translocations and inversions. I demonstrate this by focusing on inversions and manually characterizing two recurrent events in our paired sister cells. Inversions can be discerned from sporadic events and rapidly   47 genotyped based on the pattern (i.e. the distribution, recurrence, and magnitude) of template strand change observed. Confirming the reliability of this method, the coordinates for each inversion closely coincide with previous reports [36, 56, 73], illustrating structural variants can be accurately mapped in even a single pair of cells. As evidenced in the homozygous inversion, the resolution of breakpoint mapping was limited by flanking repetitive sequence elements that appear as low read-depth, WC regions. It was not possible to distinguish these regions in heterozygous inversions that already appear WC, however it is likely heterozygous inversions are also flanked by such segmental duplications. By narrowing the location of inversion breakpoints in Strand-seq libraries, one can design PCR primers to map the rearrangement down to the base pair level. This also offers the opportunity to confirm any genes that may be disrupted by the inversion in order to understand the phenotypic consequences of the rearrangement.   While it is the recurrence of the rearrangement that distinguishes inversions from sporadic SCEs or spurious background reads, it is not possible to determine when in time the inversion occurred. For instance, the rearrangement may be a historical event that propagated in the population or a de novo event that occurred in the parent’s gametes or early in the individual’s development [123]. While the two inversions characterized here are likely historical events (as they were both previously described [36, 56, 73]), to distinguish when the event occurred within a lineage it is necessary to generate multiple Strand-seq libraries from defined sources (e.g. a family pedigree or cellular lineage, such as from cancer tissue). This may involve studying different tissues from the same individual, or different individuals from specific demographics. This would allow one to explore inversion frequencies in defined populations in order to ascertain when the event occurred and the mechanistic details of how it arose. Similar analyses would have to be performed for inversions identified within tumor cell samples in order to determine the extent of mosaicism, whether they arose during carcinogenesis or were inherited through the germline [27, 28]. In this way Strand-seq offers an opportunity to investigate the genomic heterogeneity of populations, and explore biological relevance of specific inversions.     48 2.4 Conclusion  Strand-seq is a reliable, non-targeted tool to localize genomic rearrangements in single cells. By sequencing only template strands, parental homologues can be tracked and rearrangements can be reliably mapped simply by locating changes in strand orientation. In capturing 100 paired sister cells, I have created a dataset which illustrates the concept of Strand-seq, the inheritance of parental homologues and the reciprocal nature of genomic rearrangements in paired daughter cells. I clearly show how somatic and de novo rearrangements can be observed in Strand-seq libraries by localizing single or rare switches in template strands, and how hereditary events can be distinguished as rearrangements that map to the same genomic location in every cell of an individual. By taking advantage of the sensitivity of sequencing, and the directionality of template strands, Strand-seq offers a much-needed tool for localizing inversions in single cells with high-resolution. Our ability to visualize changes in strand orientation and map genomic rearrangements in single chromosomes is a major advance in genomics, and supports the utility of single cell approaches to study complex populations. However, it is not reasonable to manually interrogate the hundreds of single cells required for a comprehensive analysis, and for more comprehensive high-throughput studies the method must be automated.   49 Chapter 37 | A bioinformatic tool to systematically characterize structural rearrangements in Strand-seq libraries    “If you can’t explain it simply, you don’t understand it well enough” – Einstein   Chapter synopsis:  To help realize the power of Strand-seq, I developed a bioinformatic pipeline (called Invert.R) that accurately tracks template strand states in Strand-seq libraries using a read-based binning strategy. I describe in detail how this software applies a sliding-window approach to capture and compare local template strand states based on read ratios in order to predict locations of genomic rearrangements. Invert.R was developed with the end-user in mind, and outputs a variety of usable files including descriptive histograms and summary tables that synthesize data for further analysis and allow comparisons to be made between cells. I tested and validated the software on Strand-seq libraries to determine its accuracy and precision in mapping variants in real sequencing data. By displaying inversions in single cells, refining inversion breakpoints across multiple cells, and discovering unknown inversions in an unbiased and non-targeted approach, bioinformatic tools like Invert.R are essential to launch Strand-seq as a high-throughput experimental method to explore structural variation in distinct cell types and populations.                                                  7 The bioinformatic software described in this Chapter will be made open-source through SourceForge (http://sourceforge.net/home.html; September 2015), and will be integrated into a comprehensive analysis suite tailored for Strand-seq data (Strand-seek.R), planned for release in 2016.   50 3.1 Introduction  Identifying genomic features that differ between individuals and cells can help uncover the functional variants that drive specific biological outcomes. Today, the fundamental approach to studying structural variants in human genomes is by DNA sequencing [14]. Current methods used to locate structural polymorphisms from sequencing data require complicated algorithms that impute variants based on incongruous mapping signatures [32, 37, 38, 55]. The specific pattern of discordant alignment to a reference genome serves as the variant signature, which can be found based on distinct paired read signals [124-126], read depth signals [44, 45], or split read signals [46-49, 127]. Highly sophisticated algorithms have been tailored to tease out subtle differences in order to predict the location of copy number changes and rearrangements in sampled genomes [50, 53-55, 126]. However, no one method is truly comprehensive and all are prone to false calls, requiring secondary validation tools (such as FISH or PCR) to confirm predictions [1, 37, 39, 52, 56].   Balanced rearrangements such as inversions present a special challenge to SV detection [14, 39]. This is highlighted by the disparity of high-quality inversion calls compared to other variant calls that alter DNA content, such as insertions and deletions [3, 36, 37, 57]. Conventional SV detection tools can predict inversions if sequencing reads or paired read mates span the inversion breakpoints [13, 36-38, 50, 126]. Since this results in aberrant read alignment, multiple reads must be present at the breakpoint in order to confirm the signature, and therefore large amounts of input DNA and sequence reads are required to obtain adequate coverage [50, 52]. Moreover, once the inversion is predicted, high-density single nucleotide variants are required to genotype the allele [36, 37, 52], which means single cell studies are not possible. Finally, the presence of large repetitive blocks of DNA at inversion breakpoints greatly interfere with unambiguous read mapping, which further challenge prediction software [14, 32, 39]. Overall, current sequencing methods to locate SVs are not well suited to detect inversions, which is largely due to repetitive DNA features flanking this copy-neutral structural rearrangement.    51  In the previous Chapter (Chapter 2), I introduced Strand-seq and illustrated how this sequencing advance is uniquely tailored to mapping structural rearrangements in single cells. The power of this novel method simply lies within the ability to: i) track read directionality in an individual homologue of a cell, and ii) visualize recurrent patterns of template strands across homologues of multiple cells. While the technology directly addresses a current need in the structural genomics field, it is not practical to physically load individual libraries onto the UCSC genome browser, manually locate every template strand switch in each single cell, and visually interrogate loci across multiple cells to locate recurrent events. Not only is this expensive, in terms of time and labor, but it also introduces a dangerous level of subjectivity into the analysis, which jeopardizes reproducibility. Therefore, to realize the potential of Strand-seq, the technology has to be scaled, not only in terms of data production (which was accomplished by adapting the protocol for construction on a robotic liquid handler, as described in Methods Chapter 7, Section 7.2.3), but also in terms of data analysis.   As our technology uniquely centers on strand directionality, current bioinformatic pipelines are not suited to analyze Strand-seq data and identify genomic rearrangements. In this Chapter, I describe how I overcame this obstacle by building an R-based [128] bioinformatic package, called Invert.R. This customized script considers the directionality of reads in individual Strand-seq libraries in order to locate template strand state changes. I validated this pipeline using human libraries and illustrate how to rapidly locate putative inversions, map breakpoint locations and predict genotypes, all in a single cell. To add power to the calls made in any individual library, I expanded the algorithm to assemble data from multiple cells in order to look for patterns and recurrent events on a population-scale. I used this tool to refine the breakpoints of two well-known inversions, illustrating the power and accuracy of Strand-seq. The bioinformatic software I developed can be used as a non-targeted method to locate genomic rearrangements in whole chromosomes, help extract the genomic complexity of an individual cell and assay the diversity of a population of cells.     52 3.2 Results 3.2.1 Tracking template strand states using local W/C ratios  In order to systematically study hundreds of Strand-seq libraries, I set out to build appropriate bioinformatic tools that could handle Strand-seq data. To do this, I collaborated with Dr. Mark Hills, a postdoctoral fellow in the laboratory, to develop the custom analysis script, Invert.R (illustrated in Fig. 3-1). This algorithm was tailored to track template strand states in Strand-seq libraries in order to localize and genotype putative inversions based on read alignment. Strand-seq centers on strand directionality, and thus our principal concern is whether a read aligns to the forward (‘+’; Crick, C) or reverse (‘-’; Watson, W) strand of the reference genome. We have previously shown how to determine template strand states by calculating the number of W and C reads within a defined genomic region, or a bin, and how to visualize this on chromosome ideograms using the open-source software BAIT [110]. While this software is very good at assigning templates to chromosomes (e.g. see Fig. 2-4, or Fig. 3-6), it is not optimized to locate changes in strand states that arise from genomic rearrangements. This is mainly because the binning strategy of BAIT is based on an arbitrary and fixed genomic size (e.g. 200 kb) that introduces artificial breaks into the genome, which can mask small events or events that fall between the breaks. Therefore, to develop Invert.R we created a new read-based, sliding window strategy, where the bins were dynamically defined and designed to track along every individual read in the library, aiming to capture all rearrangement events with better resolution (Fig. 3-1, ii, and Fig. 3-2).    53   Figure 3-1 | Invert.R – a custom bioinformatic approach to tracking template strands Schematic representation summarizing each step of Invert.R. See right-hand panel text for description of each step.    Strand-seq libraryi.calculate W/C ratiosii.visualize as a histogramiii.predict breakpointsiv.calculate ∆W/C ratiov.Homozygous InversionHeterozygous Inversion{{52 C86 W *b1r 1r 3b3...b22r 1,  1,  1,  1,  1,  1,  1,  1,  0.9,  1,  1,  1,  1,  0.8,  0.7,  0.7,  0.6,  0.7,  0.5,  0.5,  0.3,  0.2,  0,  0,  0,  0,  0,  0,  0,  0.1,  0,  0,  0,  0,  0,  0.1,  0.3,  0.5,  0.5,  0.6,  0.8,  0.9,  1,  1,  1,  0.9,  1,  1,  0.9,  1,  1,  1,  1,  1,  1,  1,  0.9,  0.8,  0.7,  0.6,  0.5,  0.5,  0.4,  0.6,  0.5,  0.5,  0.5,  0.7,  0.8,  0.9,  1,  1,  1,  1,  1,  1,  1,  1,  1Homozygous InversionHeterozygous InversionW/C ratio = predominant read number / bin size       = Watson / (Watson+Crick)W/C Ratio 100.5Homozygous InversionHeterozygous InversionW/C Ratio 100.55’ 5’3’ 3’breakpointThHomozygous InversionHeterozygous Inversion10 reads100% W20 reads25% C10 reads100% W20 reads25% C10 reads100% W20 reads25% C10 reads100% W20 reads25% CW/C Ratio 100.5ΔW/C ~1.0 ΔW/C ~0.5ΔW/CAv. W/C ratio = 0.1 Av. W/C ratio = 0.48{ { { Av.W/C ratio = 0.97 = 0.97-0.1 = 0.87ΔW/C  = 0.97-0.48 = 0.49Chromosome locationHomozygous InversionHeterozygous InversionA representative chromosome of a single Strand-seq library, with Crick (C) reads in blue, and Watson (W) reads in orange.Aligned libraries can be input as either BAM or BED file formats. Invert.R splits input files by chromosome, and filters them to ensure a minimum read density is met (e.g. 20 reads/Mb). It then tests the predominant read state (asterisk), based on the overall proportion of W and C reads in the chromo-some. This also allows it to determine if the chromosome is WW or CC.The bin (b) is a user-defined variable that defines the window size, based on read number, which is used to calculate a local ‘W/C ratio’ by compar-ing the number of W and C reads in the region (shaded box). The W/C ratio is assigned to the first base pair position of the first read (e.g. r1) in the window. The window then slides forward one read (e.g. r1 to r2, with b2 dynamical-ly resizing), and then sequentially along every read (r3...), to  repeat the calculation and assign a W/C ratio to each read. This generates a vector of W/C ratios, each associated with a genomic coordinate representing virtually every sequenced position in the library.Invert.R outputs a histogram by plotting W/C ratio values on the y-axis with the corresponding genomic coordinates on the x-axis. This allows template strand states to be visualized, where a pure W or C region gives a W/C ratio of 1.0 or 0.0, and a mixed WC region gives a ratio near 0.5. Changes in template strand states are found by locating regions where W/C ratios change (i.e. dip below 1), suggestive of a genomic rearrange-ment. The W/C ratios of each library are output as a .bedgraph file, which can be visualized on the UCSC Genome Browser.A threshold (Th) is applied that is 20% below the basal level of background in the library. Locations where W/C ratio fall below the threshold (red arrows) are flagged as potential template strand changes. To predict inversions, Invert.R Identifies regions where W/C ratios dip below (5’ breakpoint) and return back above (3’ breakpoint) the threshold. The 5‘ breakpoint represent the first read below Th, where the preceding 10 reads (outside the inversion) all correspond to the predominant state, and the succeeding 20 reads (inside the inversion) contain > 25% in the opposite state. The 3’ breakpoint is similarly defined, and the locus is flagged as a region of interest (ROI). To filter ROIs and reduce false-positive calls Invert.R tests that a minimum number of reads are present in the region and remain below the Th. It also calculates a ∆W/C value for each ROI by subtracting the average W/C ratio for the ROI from the basal W/C ratio of all reads above the threshold. A ∆W/C ratio near 0.5 is expected at ROIs that switch from a pure WW or CC to a mixed WC state (which is indicative of a heterozygous inversion), and a ∆W/C ratio near 1.0 is expected for complete template strand switches (and is indicative of a homozygous inversion). Including only ROIs with a ∆W/C ratio > 0.3, Invert.R  outputs a list of putative inversions for every single cell library.   54 To do this, we programmed Invert.R to begin at the first aligned read in a library, or chromosomal region of interest, and survey a user-defined (bin) number of reads forward (e.g. 20 reads) to define the window (representing a genomic region), and count the number of W and C reads within this region (Fig. 3-2a, i dotted box). After assigning a value (W/C ratio, described below) to that read, the program stepped forward to the next aligned read, resized the window to capture the specific bin number of reads (e.g. 20 reads) (Fig. 3-2a, ii), and repeated the calculation to assign a value to that read. It then stepped forward again (Fig. 3-2a, iii), moving sequentially through every read until the end of the library was reached, and assigning values to each read as it went. This dynamic binning strategy meant there was a direct relationship between window boundaries and mapped reads, avoiding unnatural breaks. Also, the genomic size (number of bases covered) of each window was regionally defined (Fig. 3-2a, red text) and inversely correlated to local read depths (reads/Mb)8 (Fig. 3-2b). Finally, the sliding nature of the bins resulted in overlapping windows that helped smooth the data and ensured every read was represented in the analysis.   With an advanced binning strategy in place, we next developed a new method to interpret read directionality that was tailored to track changes in template strand states. A template strand state can be interpreted as the proportion of reads aligning to the ‘+’ strand (i.e. C reads) and ‘-’ strand (i.e. W reads) of the reference assembly, and a change in template strand state (indicative of a rearrangement) is technically a change in these proportions. Therefore, we theorized that by tracking the ratio of W and C reads along a chromosome we could bioinformatically localize genomic rearrangements. For this, Invert.R first found the ‘predominant’ read direction of the entire chromosome by calculating the total number of W and C reads, and assigning the chromosome state based on which was highest (Fig. 3-1, iii). Then, W/C ratios were calculated for each bin,                                                  8 For instance, a bin = 20 will span 1,000,000 bases in a region containing 20 reads/Mb, whereas in a deeper region containing 100 read/Mb the same bin will span 2,000 bases. This is relevant because we see regional changes in read density along chromosomes in Strand-seq libraries.    55  Figure 3-2 | A read-based sliding window binning strategy to analyze Strand-seq data Invert.R implements a new binning strategy that is based on the sequencing reads of a Strand-seq library, which align to either the ‘+’ strand (Crick (C), in blue) or ‘-’ strand (Watson (W), in orange) of the reference assembly. a) To illustrate this strategy, we will follow the reads labeled ‘x’, ‘y’ and ‘z’, using a bin of 20 reads (bin = 20). i) Starting from read ‘x’, the bin extends forward 19 reads (encompassing 20 in total) to calculate the proportion of W and C reads in the region and determine the W/C ratio, as shown (black text). The W/C ratio value is assigned to the first base pair position of read ‘x’, and then the bin slides forward a single read, to read ‘y’. ii) From read ‘y’ the bin again extends forward 19 reads (to encompass 20 reads in total) and calculate a new W/C ratio based on the reads in this new window. Once assigning this to the first position of read ‘y’, the bin slides forward one more read to iii) read ‘x’, and repeats the calculation. In this way, a value is assigned to virtually every read in the library. The sliding window approach results in overlap between adjacent bins, allowing for a smooth distribution of data. Also, notice how the genomic size of the bin (red text) is dynamically resized and dependent on the distance between the reads in the region. b) There is an inverse relationship between range of the bin and read depth of the library.   defined as the number of predominant reads divided by the total reads in the bin9 (Fig. 3-2, black text), and this ratio was assigned to the first base pair position of the first read in the bin. This generated W/C ratios between 1.0 – 0.0, where a W/C ratio of 1.0 or 0.0 signified all reads within the bin were the same (either 100% W or 100% C, respectively), and a W/C ratio of 0.5 signified equal numbers of W and C reads were present. In this way, W/C ratios revealed the local template strand state of the reads in the                                                 9 If the predominate read state was W, local W/C ratios for that chromosome were calculated as number of W reads divided by total (Watson and Crick) reads, whereas if this state was C than they were calculated as number of C reads divided by total reads. This anchored the predominant chromosomal state to 1.0 on the histrogram. W/C ratio = abs((watson-crick)/bin)i. read x bin = 20x y zwatson = 14crick = 6W/C ratio= 14/20= 0.7crick = 7ii. read y bin = 20x y zwatson = 13 W/C ratio= 13/20= 0.65iii. read z bin = 20x y zwatson = 13crick = 7W/C ratio= 12/20= 0.694 kb107 kb99 kbx y zStrand-seq librarya bin = 20~ 1, 000, 000~ 400, 000~ 200, 00020 reads/Mb50 reads/Mb100 reads/Mb2 Mb regionLib1Lib2Lib3bin size * bases / readsgenomic size = b  56 bin, and by calculating this for every bin a local strand state was assigned to virtually every read in the library (Fig. 3-1, ii).  By assigning a W/C ratio to every read, we aimed to identify template strand states changes within each chromosome. Since the predominant state was used as the dividend, any deviation away from the chromosome state (which is the circumstance of a genomic rearrangement) resulted in a decrease in W/C ratio values toward 0. Consequently, changes in W/C ratios corresponded to changes in template strand states; a W/C ratio of 1.0 meant all the reads in the bin were in the same orientation as the rest of the chromosome (i.e. there was no change in template strand state), a W/C ratio of 0.5 occurred if half the reads in the bin were in the opposite orientation to the rest of the chromosome, (suggesting a single template strand state changed), and a W/C ratio of 0.0 occurred if all the reads in the bin were in the opposite orientation to the rest of the chromosome (indicating both template strands were changed). Therefore, tracking W/C ratio values along a chromosome revealed locations where template strand states changed (Fig. 3-1, iii). Combining our binning strategy with local W/C ratios, we generated a new tool to explore template strand states Strand-seq libraries.   To test our new strategy on Strand-seq data, and confirm whether we can track local template strand states as expected, I applied Invert.R to analyze a single cell library (HsSs_0123) generated from a male bone marrow donor that harbored inversions previously identified and characterized in Chapter 2 (Section 2.2.3). Invert.R calculated W/C ratios for chr7 (which was CC for this library) and chr8 (WW) along the regions containing the inversions (chr8p23: 5,000,000 – 15,000,000, and chr7q11: 65,000,000 – 80,000,000) and plotted the results as histograms (Fig. 3-3). The line in the histogram represents the local W/C ratio calculated for each bin and assigned to the read at that genomic location. Based on the total proportion of W and C reads in each chromosome (Fig. 3-3, i), Invert.R accurately identified the predominant chromosome, and both histograms were anchored to a W/C ratio value of 1.0 (Fig. 3-3, ii). The W/C ratios of each chromosome showed a clear decrease in values and dipped below 1.0 at the regions where the inversion was located, correlating with the change in template strand states    57  Figure 3-3 | Calculating W/C ratios of Strand-seq data at known locations of template strand state changes Invert.R was used to calculate W/C ratios for the Strand-seq library HsSs_0123, at the inversions identified on a) chromosome 7 (chr7) and b) chr8, which were previously characterized in Chapter 2, Section 2.2.5. i) the total read counts for the whole chromosome were used to assign the predominant read state and anchor the data (by setting the baseline as 1.0), as described in text (Section 3.2.1). ii) overlaid histograms of the Invert.R-calculated W/C ratios determined using a bin size of either 10 (yellow), 25 (orange) or 50 (red) reads. Each line in the histogram represents the local W/C ratio of that bin and assigned to that genomic location. Changes in the W/C ratio are visualized as peaks and valleys in the histogram, which occur at locations where template strand states change. The extent and resolution of W/C ratio changes depend on the bin size. iii) BED-formatted Strand-seq reads were loaded onto the UCSC genome browser to visualize sequencing reads at the location, with Crick reads in blue, and Watson in orange.   evident in the UCSC genome browser view of the Strand-seq reads (Fig. 3-3, iii). Testing different bin sizes showed a tradeoff between sensitivity and noise; sharper peaks were evident when a bin of 10 reads was used (Fig. 3-3, ii, yellow), as compared to a bin of 25 (Fig. 3-3, ii, orange) or 50 reads (Fig. 3-3, ii, red). This highlights how the resolution of tracking local template strand states was dependent on the bin size used to calculate the a68000000 72000000 76000000 800000000.00.40.8WC Ratio102550bin:HsSs_0123chr7: 68,000,000 - 80,000,000i. ii.iii.Crick Reads  = 399Watson Reads  = 18354Reads/Megabase  = 117.8bii.Crick Reads  = 16405HsSs_0123chr8: 5,000,000 - 15,000,0006000000 8000000 10000000 12000000 140000000.00.40.8WC Ratio102550bin:i.iii.Watson Reads    = 899 Reads/Megabase   = 118.2  58 W/C ratios. This preliminary test revealed template strand states could be accurately tracked based on local W/C ratios, suggesting a bioinformatic approach could identify strand state changes in order to systematically study genomic rearrangements in Strand-seq libraries. 3.2.2 Characterizing putative inversions in single cells based on W/C ratios  With a new bioinformatic approach to assign local template strand states and compare them to chromosomal states, we then programmed Invert.R to locate putative inversions. As inversions appear as segmental changes in template strand states along a chromosome, we predicted we could bioinformatically locate them based on W/C ratios as those values that displayed a segmental dip toward 0.0 (Fig. 3-1, iii). For this we had Invert.R identify locations where W/C ratio values fell below and then returned above a dynamic threshold limit that was automatically calculated based on the number of spurious reads (i.e. noise) in that specific chromosome (Fig. 3-1, iv). The threshold was determined by calculating the average W/C ratio of all values above a user-defined baseline threshold (e.g. 0.8), which measured the basal level of background in the library but omitted W/C ratios of reads present in inversions that would skew the background calculation. With the basal background level calculated (between 1.0 and baseline input), Invert.R subtracted 0.2 from this value to assign the calling threshold limit, this ensured only regions with W/C ratios falling 20% below the background level were flagged by Invert.R. If a W/C ratio fell below the calling threshold limit, Invert.R would locate the next W/C ratio that was above the threshold limit and flag it as a region of interest (ROI) for breakpoint mapping (Fig 3-4, red bars).   Once an ROI was flagged for breakpoint mapping, Invert.R predicted the upstream (5′) and downstream (3′) breakpoints of the event by locating the nearest 5′ and 3′ flanking reads that were above the threshold (Fig. 3-1, iv). Using a modification of the SCE locator in BAIT [110], Invert.R marked the outermost limits of the ROI by walking step-wise away from the first W/C ratio below the threshold until it located the nearest read that fulfilled two criteria: 10 neighboring reads outside of the inversion were in the direction of the un-inverted chromosome (i.e. in line with the predominant chromosome   59 strand state, and together had a W/C ratio of 1.0), and at least 25% of the 20 neighboring reads within the inversion were in the other direction (i.e. not in line with the predominant state, and together yielding a W/C ratio less than 0.76) (Fig. 3-1, iv). For instance, to call the 5′ breakpoint in a CC chromosome, Invert.R identified the first 5′ read that has a W/C ratio below the threshold, and then calculated a W/C ratio for the preceding 10 reads, moving away from the inversion until the ratio was 100% C. It then checked that the ratio of the succeeding 20 reads in the 3′ direction was at least 25% W. If the test failed, it moved to the next 5′ read, and repeated the test until both conditions were met. The start location of the first read meeting both criteria was assigned as the 5′ breakpoint for the putative inversion, which defined the outermost 5′ site where the strand orientation changed. The 3′ breakpoint was identified using the same strategy. Based on these criteria, adjacent ROIs may end up overlapping, in which case they were merged together into a single event. By performing these tests for any W/C ratio that fell below the threshold, Invert.R predicted the location and breakpoints of putative inversions in a non-targeted approach (Fig. 3-4, vertical dotted lines).   With the breakpoints of ROIs predicted, two tests were added to Invert.R to help reduce the chance of false-positive calls. First it confirmed that a user-defined number of reads were present in the ROI and below the threshold level (e.g. 20 consecutive reads). This helped ensure a sufficient number of reads evidenced the putative inversion. Next, the extent of change in strand orientation was calculated, which can be used to predict the genotype of the ROI. Here, Invert.R was programmed to calculate the change in W/C ratio (Δ W/C ratio) for the ROI by finding the average W/C ratio of all reads falling between the predicted breakpoints and subtracting this from the average W/C ratio of all reads falling outside the predicted breakpoints (Fig. 3-1, v). In this way, Δ W/C ratio takes into consideration the level of noise in the library to calculate the magnitude of change at the locus. If both homologues contain an inversion (i.e. a homozygous inversion) then strand orientation will completely switch and a Δ W/C ratio of ~ 1.0 is expected. Alternatively, if only one homologue contains an inversion (i.e. a heterozygous inversion) then a partial switch will be seen and a Δ W/C ratio of ~ 0.5 is expected.   60  Figure 3-4 | Using Invert.R to predict inversions based on W/C ratios and genotypes based on ∆ W/C values Histograms of the Invert.R-calculated W/C ratios for HsSs_0123, which has a) a heterozygous inversion on chromsome7 (chr7), and b) a homozygous inversions on chr8 (also shown in Fig. 3-3). W/C ratios were calculated using a bin size of either i) 10 (yellow), ii) 25 (orange) or iii) 50 (red) reads (all other settings were consistent). From these W/C ratios, Invert.R flagged regions of interest (ROIs; red bars) as those regions with values that dipped below and then came back above the calling threshold limit (horizontal dotted line). The smaller bin size (e.g. bin = 10) resulted in multiple ROI predictions because W/C ratios dipped below the threshold multiple times, illustrating how bin size directly impacts the resolution of Invert.R predictions. In all cases ∆ W/C values reflected the expected genotype with ∆ W/Cs between 0.394 – 0.527 for the heterozygous inversion, and 0.949 – 0.978 for the homozygous inversion.   Invert.R only called a putative inversion if the Δ W/C of the ROI was ≥ 0.3, which ensured there was a sufficient level of change in template strand states observed (Fig. 3-4). For each input library, Invert.R outputs: i) a histogram .png of each chromosome for visualization, ii) a collapsed .bedgraph file of the W/C ratios for uploading onto the UCSC Genome Browser, and iii) an ROI table of all putative inversion calls with relevant metrics, for further interrogation. Taken together, Invert.R represents a bioinformatic approach to locate segmental changes in template strand states of single cells in order to map putative inversions for further analysis.  To further validate Invert.R and calculate the resolution of breakpoint mapping for heterozygous and homozygous variants, I tested the previously characterized variants on chr7q11 and chr8p23 of HsSs_0123 (described in Section 3.2.1) (Fig. 3-5). To visualize the results, I produced histograms of the W/C ratio values calculated for the ~ 5 Mb homozygous inversion on chr8p23 and the ~ 2 Mb heterozygous inversion on W/C Ratio0.00.51.072391091 - 72805582∆WC =0.49872999337 - 73286681∆WC =0.52773307380 - 74325087∆WC =0.459Het InvHom InvW/C Ratio0.00.51.072194873 - 75058638∆WC =0.421Het InvHom InvW/C Ratio0.00.51.068000000 70000000 72000000 74000000 76000000 78000000 8000000072194873 - 74325087∆WC =0.394Het InvHom Inv0.00.51.07219983 - 11988761∆WC =0.97812052948 - 12409738`∆WC =0.665Het InvHom Inv0.00.51.07219983 - 12409738∆WC =0.949Het InvHom Inv0.00.51.07219983 - 12409738∆WC =0.949Het InvHom Inv6000000 8000000 10000000 12000000 14000000HsSs_0123chr8: 5,000,000 - 15,000,000HsSs_0123chr7: 68,000,000 - 80,000,000ai.ii.iii.bbin = 10bin = 25bin = 50Figure 3-4 | Using Invert.R to predict inversions based on W/C ratios and genotypes based on ∆W/C values.  61 chr7q11 (Fig. 3-5a). To better integrate the results, I layered Strand-seq reads (color-coded as C in blue, and W in orange) and reference sequence gaps (grey) above the plot, along with Invert.R-predicted inversions depicted below (red). For the single cell shown, Invert.R accurately located the homozygous chr8 inversion and the heterozygous chr7 inversion (using a bin of 25 reads and a baseline threshold of 0.8) as a segmental dip in W/C ratio values. The Invert.R-predicted breakpoints (Fig. 3-5a, dotted horizontal lines) closely corresponded to the manual predictions made in Chapter 2 (Section 2.2.3), along with published reports of each inversion [56, 85, 129]. For example, the chr7q11 breakpoints found by Invert.R (72,194,873 – 75,058,638) (Fig. 3-5a) compared to the manual breakpoints of 72,727,194 – 75,012,977 previously shown in Chapter 2 (Fig. 2-8c, ii), illustrates the precision of breakpoint mapping using this algorithm. The Δ W/C ratios predicted by Invert.R were 0.95 for the homozygous chr8 inversion and 0.42 for the heterozygous chr7 inversion (Fig. 3-5a). Therefore, Δ W/C ratios represented the relative magnitude of change in template strand states that distinguish homozygous and heterozygous alleles (illustrated in Fig. 2-7). Note these Δ W/C ratios were near the expected values of 1.0 and 0.5 for a homozygous and heterozygous variant (respectively), but were smaller due to genomic segments with low read depths or reads mapping to both template strands at the inversion breakpoints (Fig. 3-5a, asterisks). These results serve as proof-of-principle that both heterozygous and homozygous inversions can be accurately localized and mapped in a single cell by Invert.R analysis.    62   Figure 3-5 | Identifying a homozygous and heterozygous inversion in a single cell using Invert.R The Strand-seq library (HsSs_0123) of a male donor previously shown (see Fig. 3-4) to harbor a homozygous (chr8p23; left panel) and heterozygous (chr7q11; right panel) inversion was analyzed using Invert.R (bin = 25 reads, baseline threshold = 0.8, minReads = 20). a) Zoom inset (banded ideogram, red box) and Invert.R histograms of W/C ratios (reads (Crick, blue; Watson, orange) and gaps (gray) shown above) with predicted breakpoints (dotted lines) and corresponding Δ W/Cs of each ROI (red bars) listed. Asterisks denote regions with low read depth that flank inversion breakpoints. b) Invert.R histograms of the whole chromosome, with additional ROIs discovered in a non-targeted approach. Arrowhead marks the known inversion shown above. c) UCSC genome browser view of BED-formatted sequencing reads confirms the small Invert.R-predicted inversion (zoom) found on chr8. d) ROI table output of Invert.R analysis for each whole chromosome.   Next, to confirm whether Invert.R can identify inversions in a non-targeted approach, the entirety of chr7 and chr8 for this cell was analyzed. The Invert.R histograms illustrated how W/C ratios tracked template strands across the whole of the chromosome, and dipped below thresholds at locations where template strand states changed (Fig. 3-5b). Looking at the ROI predictions (Fig. 3-5b, red bars), Invert.R not only pulled out the previously identified inversions on 7q11 and 8p23 (Fig. 3-5b arrowheads), but also flagged additional inversions on each chromosome, locating one more on chr8, and four more on chr7 (Fig. 3-5d). Uploading sequencing reads onto the UCSC genome browser (Fig. 3-5c) illustrated that Invert.R histograms fairly represented the data. This also allowed us to visualize the small (~160 kb) inversion Invert.R located on the p-arm tip of chr8 at high resolution and confirm that Invert.R accurately mapped abcdChromosome 8 (chr8 p23.2-p21.3)100.5WC Ratio5 10HsSs_0123 (chr8)15 Mb7219983 12409738 Δ W/C = 0.95* *Chromosome 7 (chr7 q11.22-q21.11)100.5WC Ratio70 80 Mb75HsSs_0123 (chr7)72194873 75058638 *Δ W/C = 0.42*100.5WC Ratio0 50 150 Mb100HsSs_0123 118.1 reads/MbChromosome 8 (whole chr)100.5WC Ratio0 50 150 Mb100HsSs_0123 117.8 reads/MbChromosome 7 (whole chr)chr  callingTh  ROIstart  ROIend  ROIsize   roiReads  chr8 0.79 2199277 2360636 161359 0.683 41 chr8 0.79 7219983 12409738 5189755 0.951 753 Chromosome 8 ROI listchr callingTh ROIstart ROIend ROIsize  roiReads chr7 0.79 58050126 61904271 3854145 0.583 51 chr7 0.79 62839053 63042766 203713 0.491 22 chr7 0.79 64562744 64917777 355033 0.572 33 chr7 0.79 72194873 7508638 2863765 0.422 441 chr7 0.79 142044240 142442793 398553 0.991 42 Chromosome 7 ROI listzoomΔ WCΔ WC  63 the breakpoints with only 41 reads representing the event in this library (Fig. 3-5c, zoom), highlighting the precision of the program. Importantly, this ROI was not obvious from the BED file at this resolution on the UCSC genome browser (Fig. 3-5c, whole chromosome), and therefore it would have been easily missed in a manual analysis of this library. Moreover, while a BAIT analysis of this library identified the large ~5 Mb inversion on chr8p23, it failed to locate the small inversion on chr8 and all of the intermediate-sized inversions on chr7 (Fig. 3-6), underscoring the increased sensitivity of Invert.R over other currently available bioinformatic approaches. Taken together, these results illustrate Invert.R is able to accurately locate template strand state changes in single Strand-seq libraries to predict putative inversions in an unbiased and non-targeted approach. This systematic approach to analyzing Strand-seq data offers a huge advance, as we can now bioinformatically predict inversions in single cells and explore populations of cells to look for patterns of template strand states.   Figure 3-6 | BAIT ideogram of HsSs_0123 BAIT output of the HsSs_0123 Strand-seq library shows the template strand inheritance patterns of all chromosomes (chr). Watson (W) reads are shown in orange, and Crick (C) reads are in blue. Arrowheads flag template strand switch events identified by BAIT. Notice that BAIT did not identify any events on chr7, whereas Invert.R identified 5 putative inversions in this same library (see Fig. 3-5).  125.31 5.14chr1SCE x22.08 121.66chr2CC57.93 56.84chr3WC2.04 94.1chr4CC72.34 40.44chr5SCE x157.97 56.28chr6WC118.36 2.58chr7WW6.29 114.69chr8CC103.03 2.66chr9WW70.03 67.67chr10WC137.29 1.51chr11WW59.63 59.8chr12WC85.97 0.59chr13WW1.41 105.05chr14CC56.83 56.84chr15WC7.57 153.4chr16CC172.23 2.29Trisomychr17 140.8 %27.58 89.27chr18SCE x12.47 183.48Trisomychr19 150 %165.77 1.55Trisomychr20 135 %53.4 49.43chr21WC2.53 146.28chr22CC2.19 52.55MonosomychrX 44.2 %14.43 7.44MonosomychrY 17.6 %Aneuploid chromosomes 5       Normalized counts / Mb = 123.95   Scale  =  1.55Index=HsSs_0123 Organism=hg20Quality filter=10    Coverage=NA%    Reads (q>10)=NASCE events= 4   Switch events= 2  Background=1.99%Figure 3-6 | BAIT ideogram of HsSs_0123BAIT output of the HsSs_0123 Strand-seq library shows the template strand inheritance patterns of all chromosomes (chr). Watson (W) reads are shown in orange, and Crick (C) reads are in blue. Recurrent template strand switch events are flagged by red arrowheads, and sporadic  sister chromatid exchange events  are marked by black arrowheads. Notice that  BAIT did not identify any events on chr7, whereas Invert.R identi-fied 5 putative inversions in this same library.  64  3.2.3 Finding recurrent inversions and refining breakpoints across multiple cells   By developing new analysis software, I have created a rapid and unbiased tool to analyze Strand-seq data and locate template strand state changes based on local W/C ratios. This tool was tailored to discover putative variants in single cells, and visualize the results in descriptive histograms or at high-resolution on the UCSC genome browser. However, as described in Chapter 2 (Section 2.2.3), in order to confidently distinguish inherited genomic rearrangements (such as inversions) from sporadic recombination events (like sister chromatid exchanges (SCEs)), the variant must recur in multiple cells. Therefore, the final step was to extend this method to integrate call sets from different cells in order to look for recurrent patterns of template strand states. First, I tested the reliability of Invert.R to locate recurrent genomic rearrangements. Ten additional libraries for chr8 (Fig. 3-7, left panel) and chr7 (Fig. 3-7, right panel) were randomly selected from the same male donor dataset based on a WW or CC inheritance pattern10 and a read depth > 20 reads/Mb. These libraries were analyzed with Invert.R using the same criteria described for HsSs_0123, and histograms of the homozygous (chr8p23) and heterozygous inversions (chr7q11) were generated for each cell (Fig. 3-8). By comparing the inversion calls made between these related cells, the precision of Invert.R could be tested.                                                  10 Recall that homozygous inversions are not evident in WC chromosomes since they exhibit the same template strand states as wildtype reference (illustrated in Fig. 2-6). A predominant read state > 75% W or > 75% C was used to filter chromosomes.   65  Figure 3-7 | Refining inversion breakpoints across multiple single cells using Invert.R Ten single Strand-seq libraries from a male donor with a recurrent homozygous inversion on chr8p23 (left panel) and a recurrent heterozygous inversion on and chr7q11 (right panel) a) Zoom insets (banded ideogram, red box) and UCSC genome browser views of ten cells, each represented in a single horizontal line with BED-formatted Watson reads in orange, and Crick reads in blue. Asterisks denote regions with low read depth that often flank inversions b) summarized ROI file and c) overlaid histograms generated from an Invert.R-analysis of these libraries (see Fig. 3-8 for individual histograms). Sequence gaps (grey bars above histogram) and a heat map of the overlapping inversion predictions (red bars below histograms) are included. The minimal inverted region (inverted segment predicted in 80% of cells, grey bar below histogram) and flanking breakpoint ranges (inverted segment predicted in 20% of cells, black bars below histogram) calculated from all ten cells d) Simultaneous view of inversions mapped by Invert.R (black), in relation to segmental duplications (SegDups), and previously-reported inversions in the Database of Genomic Variants (DGV, purple) and the Human Polymorphic Inversion Database (InvFest, blue). For Invert.R and InvFest, the minimal inverted region is represented as the lower bar in the track, with the maximal inverted regions (outer-most breakpoint ranges) represented as the upper bar.   Immediately apparent was the high degree of concordance in W/C ratios calculated for these ten cells. In all cases, W/C ratios dipped below the calling threshold limit along the inverted locus, and the ROIs mapped to similar genomic locations (Fig. 3-8). Notably, the read depths of the libraries ranged from 32 – 291 reads/Mb, illustrating the robustness of Invert.R to locate inversions in Strand-seq libraries of varying coverage. The homozygous inversion on chr8 was very precisely mapped in each cell. Δ W/C ratios ranged between 0.87 - 0.96, breakpoints mapped within a ~600 kb range (with 5′ breakpoints falling between 6,979,754 – 7,540,372, and 3′ breakpoints between  * ** ** ** *72380014 - 72659960 & 73989814 - 750071657183914 - 7404466 & 11880370 - 12489771Chromosome 7 (chr7 q11.22-q21.11)Chromosome 8 (chr8 p23.2-p21.3)100.5WC Ratio5 10n = 10 cells15 MbDGVInvFestSegDupsInvertR70 80 Mb75n = 10 cells100.5WC RatioDGVInvFestSegDupsInvertRchr7 Inv ROIstart ROIend ROIsize   Δ W/C ROIdepth min 72147527 73989814 1695387 0.31 37.33 max 72720170 75083310 2597576 0.62 291.12 Average 72462038 74594703 2132665 0.45 101 stdev 198348 454633 334631 0.10 70 chr8 Inv ROIstart ROIend ROIsize  Δ W/C ROIdepth min 6979754 11880370 4660927 0.88 31.97 max 7540372 12555388 5466267 0.95 216.92 Average 7252791 12412071 5159280 0.93 105.5 stdev 176590 197670 231126 0.02 67 abcd  66  Figure 3-8 | Mapping a heterozygous and homozygous inversion in multiple single cells using Invert.R Ten single Strand-seq libraries from a male donor with a homozygous inversion on chr8p23 (upper panel) and a heterozygous inversion on and chr7q11 (lower panel). Zoom insets (banded ideogram, red box) of the W/C ratio values calculated by Invert.R (bin = 25 reads) are shown as histograms for each cell. Watson (orange) and Crick (blue) Strand-seq reads are shown above each histogram, along with sequence gaps in the reference assembly (grey). The putative inversion predictions made by Invert.R are shown below (red bars), with the ∆ W/C ratio of each inversion shown. When multiple inversions were predicted within the region the average ∆ W/C was provided. The depth (reads/megabase (Mb)) was calculated for the region shown and listed for each library.  100.5WC Ratio5 20 Mb10HsSs_0005 65.0 reads/Mb15100.5WC Ratio5 20 Mb10HsSs_0013 68.8 reads/Mb15100.5WC Ratio5 20 Mb10HsSs_0014 27.8 reads/Mb15100.5WC Ratio5 20 Mb10HsSs_0015 58.4 reads/Mb15100.5WC Ratio5 20 Mb10HsSs_0017 75.3 reads/Mb15100.5WC Ratio5 20 Mb10HsSs_0028 207.0 reads/Mb15100.5WC Ratio5 20 Mb10HsSs_0032 156.4 reads/Mb15100.5WC Ratio5 20 Mb10HsSs_0035 122.2 reads/Mb15100.5WC Ratio5 20 Mb10HsSs_0037 203.1 reads/Mb15100.5WC Ratio5 20 Mb10HsSs_0042 76.8 reads/Mb15Chromosome 8 (chr8 p23.2-p21.3)Chromosome 7 (chr7 q11.22-q21.11)/Volumes/DATA/PAPER - Human Genome Project100.5WC Ratio70 80 Mb75HsSs_0004 52.0 reads/Mb100.5WC Ratio70 80 Mb75HsSs_0023 93.6 reads/Mb100.5WC Ratio70 80 Mb75HsSs_0027 237.1 reads/Mb100.5WC Ratio70 80 Mb75HsSs_0032 123.1 reads/Mb100.5WC Ratio70 80 Mb75HsSs_0033 114.2 reads/Mb100.5WC Ratio70 80 Mb75HsSs_0034 95.4 reads/Mb100.5WC Ratio70 80 Mb75HsSs_0036 89.1 reads/Mb100.5WC Ratio70 80 Mb75HsSs_0038 137.5 reads/Mb100.5WC Ratio70 80 Mb75HsSs_0041 99.5 reads/Mb100.5WC Ratio70 80 Mb75HsSs_0043 105.3 reads/MbΔ W/C = 0.91Δ W/C = 0.91Δ W/C = 0.93Δ W/C = 0.95Δ W/C = 0.90Δ W/C = 0.93Δ W/C = 0.95Δ W/C = 0.96Δ W/C = 0.87Δ W/C = 0.94Δ W/C = 0.55Δ W/C = 0.58Δ W/C = 0.42Δ W/C = 0.40Δ W/C = 0.40Δ W/C = 0.62 Δ W/C = 0.46Δ W/C = 0.44Δ W/C = 0.37Δ W/C = 0.31  67 11,880370 – 12,555,388), and the average inversion size was 5.1 Mb (Fig. 3-7b). These metrics are in line with previous reports of the inversion [36, 56, 73]. The variability seen between the individual cells was due to changes in read densities across the locus, particularly near the inversion breakpoints (Fig. 3-7a, asterisks), where large blocks of segmental duplications interfere with unambiguous read alignment (discussed in more detail previously (Fig. 2-9c, ii)) [130]. The heterozygous inversion on chr7 was more complex. This locus showed a broader range of Δ W/C ratios between 0.31 - 0.62, and in some cases multiple ROIs were predicted because W/C ratios crossed the threshold more than once along the inverted segment (e.g. HsSs_0027 and HsSs_0041) (Fig. 3-8). Looking at all ten cells simultaneously showed these were overlapping events that likely represented a single inversion, and collectively showed an average 5′ breakpoint of 72,463,038 and a 3′ breakpoint of 74,594,703 (Fig. 3-7b). This illustrates the value of considering multiple cells when characterizing inversions by Strand-seq, as each independent single cell call substantiates the others cells and strengthens the confidence in mapping breakpoints.   Figure 3-9 | Refining inversion breakpoints by consensus To refine inversion breakpoints Invert.R finds the consensus between multiple cells. i) Invert.R first calculates the degree of overlap (coverage) between regions of interest (red bars) predicted for individual libraries (Lib). The user-defined variable ‘minLibs’ will filter results by a minimum number of libraries sharing the event (e.g. minLibs = 2 means a minimum coverage of 2 is required for consideration). ii) With the coverage calculated, the minimal inverted region (inner gray bar) defines the genomic location where at least 80% of the cells map the inversion, and the breakpoint ranges are the regions where at least 20% of the cells predicted the inversion (outer black bars).  Figure 3-9 | Refining inversion breakpoints by consensusTo refine inverison breakpoints Invert.R finds the consensus between multiple cells. It calcu-lates the degree of of overlap (i.e. coverage)  between regions of interest (red bars) predict-ed for invidiual libraries (Lib). The user-defined variable ‘minLibs’ will filter results by a mini-mum number of libraries sharing the event (e.g. minLibs = 2 means a minimum coverage of 2 is required for consideration).  ii) With the coverage calculated, the minimal inverted region (inner gray bar) defines the genomic where at least 80% of the cells map the inver-sion, and the breakpoint ranges of are the regions where at least 20% of the cells predicte the inversion (outer black bars). Lib 11 54 45 3 180%20% 20%5’ breakpoint range 3’ breakpoint rangeminimal inverted regionLib 3Lib 4Lib 5coverage:Lib 6Lib 2i.ii.  68 With multiple single cells analyzed, calculating the level of agreement between inversion predictions refines inversion breakpoints. Working from the assumption that overlapping ROIs of individual cells represent a single variant allele, we programmed Invert.R to compile data across single cells and find the consensus between breakpoint predictions. For this, we used the ROI file for each individual library and found the degree of overlap (i.e. coverage) between them using the genomeCoverageBed function of BEDtools (v2.17.0) [131]. This calculated the number of libraries with an ROI called at each genomic location represented in the data set (illustrated in Fig. 3-9). We then found the cumulative base pair coverage across all the libraries containing the ROI to determine the frequency that the putative inversion was called at each location. This was visualized by overlaying histograms from multiple Strand-seq libraries into a single plot, with the proportion of overlap between each individual ROI graphically depicted as a heat map below (Fig. 3-7c). Finally, to refine the breakpoints we defined the minimal inverted region as the overlap present in at least 80% of the cells, and the maximum inverted region (which defines the outer limits of the inversion) as the overlap present in at least 20% of the cells (Fig. 3-7, ii, and Fig. 3-9, black bars). This approach localized the minimum chr8 inverted region to 7,404,446 – 11,880,370 (spanning 4.47 Mb), and the minimum chr7 inverted region to 72,659,960 – 73,989,814 (spanning 1.32 Mb) (Fig. 3-7c). The precise breakpoints are predicted to reside within the breakpoint ranges (i.e. between the minimum and maximum inverted regions), which for these two inversions had a resolution of 220.5 kb – 1.02 Mb, and overlapped with large blocks of segmental duplications (Fig. 3-7d). This resolution coincided very closely with previous reports listed in public databases of human inversions, such as the Database of Genomic Variants (DGV) [3] (Fig. 3-7d, purple), and the Human Polymorphic Inversion Database (InvFest) [73] (Fig. 3-7d, blue), highlighting how inversion breakpoints can be finely mapped across multiple Strand-seq libraries using Invert.R.  Finally, to again test the program in a non-targeted system, I analyzed the full chr7 and chr8 for these ten libraries (Fig. 3-10, and Fig. 3-11, i). Once again Invert.R accurately located the known inversions on 8p23, and 7q11, along with additional ones in each chromosome (Fig. 3-10, red bars). The overlaid histograms illustrate high    69  Figure 3-10 | Mapping a template strand state changes in whole chromosomes using Invert.R Ten single Strand-seq libraries from a male donor showing chr8 (upper panel) and chr7 (lower panel) with the W/C ratio values calculated by Invert.R (bin = 25 reads) shown as histograms for each. Watson (orange) and Crick (blue) Strand-seq reads are shown above each histogram, along with sequence gaps in the reference assembly (grey). The putative inversion predictions made by Invert.R are shown below (red bars), with the breakpoints marked (horizontal dotted red lines). The depth (reads/megabase (Mb)) was calculated for each chromosome and listed on left. Arrowheads mark false calls made in individual cells that are filtered out by looking across multiple cells.  Chromosome 8 (whole)Chromosome 7 (whole)100.5WC Ratio0 50 150 Mb100HsSs_0005 65.0 reads/Mb100.5WC Ratio0 50 150 Mb100HsSs_0013 68.8 reads/Mb100.5WC Ratio0 50 150 Mb100HsSs_0014 29.1 reads/Mb100.5WC Ratio0 50 150 Mb100HsSs_0015 58.4 reads/Mb100.5WC Ratio0 50 150 Mb100HsSs_0017 65.6 reads/Mb100.5WC Ratio0 50 150 Mb100HsSs_0028 193.3 reads/Mb100.5WC Ratio0 50 150 Mb100HsSs_0032 162.0 reads/Mb100.5WC Ratio0 50 150 Mb100HsSs_0035 109.5 reads/Mb100.5WC Ratio0 50 150 Mb100HsSs_0037 146.4 reads/Mb100.5WC Ratio0 50 150 Mb100HsSs_0042 124.6 reads/Mb100.5WC Ratio0 50 150 MbHsSs_0004 57.2 reads/Mb100.5WC Ratio0 50 150 Mb100HsSs_0023 105.0 reads/Mb100.5WC Ratio0 50 150 Mb100HsSs_0027 246.3 reads/Mb100.5WC Ratio0 50 150 Mb100HsSs_0032 166.4 reads/Mb100.5WC Ratio0 50 150 Mb100HsSs_0033 138.1 reads/Mb100.5WC Ratio0 50 150 Mb100HsSs_0034 112.2 reads/Mb100.5WC Ratio0 50 150 Mb100HsSs_0036 106.1 reads/Mb100.5WC Ratio0 50 150 Mb100HsSs_0038 146.7 reads/Mb100.5WC Ratio0 50 150 Mb100HsSs_0041 121.2 reads/Mb100.5WC Ratio0 50 150 Mb100HsSs_0043 132.9 reads/Mb  70 concordance in the W/C ratios across all libraries (Fig. 3-11, ii). When breakpoints were refined across all cells, Invert.R localized two recurrent rearrangements on chr8 and 5 recurrent ones on chr7 (Fig. 3-11, tables), suggesting these are inherited inversions. In some cases the inversion frequency was less than 100% because some libraries did not have enough reads at the ROI for inclusion. In other cases, additional ROIs were flagged by Invert.R in a single library but were not recurrent in the population, such as two ROIs in HsSs_0004 (Fig. 3-10, arrowheads). These non-recurrent ROIs may represent somatic rearrangements, such as multiple SCEs or de novo inversions, or they may represent false-positive calls. Since this library was low quality with poor read coverage, it is likely they are the result of the latter; however, this point highlights how: i) the recurrence of an ROI in multiple cells strengthens the confidence of the call in each individual cell, and ii) a de novo rearrangement can appear as a rare event in a population cells, and Invert.R provides a platform for exploring somatic mutations rates. Collectively these results illustrate Invert.R is a non-targeted tool that accurately localizes template strand state changes in Strand-seq libraries to predict putative inversions, and compares these predictions across multiple libraries in order to explore stable genomic rearrangements.   Figure 3-11 | Locating recurrent inversions in a non-targeted approach using Invert.R Ten single Strand-seq libraries from a male donor with chr8 (left panel) and chr7 (right panel) shown i) UCSC genome browser views of ten cells, each represented in a single horizontal line with BED-formatted Watson reads in orange, and Crick reads in blue. ii) Overlaid histograms generated from an Invert.R-analysis of these libraries (see Fig. 3-10 for individual histograms). Sequence gaps (grey bars above histogram) and a heat map of the overlapping inversion predictions (red bars below histograms) are included. iii) Summary table of the refined recurrent ROIs located by Invert.R. The breakpoint ranges are shown, with the minimal inverted region as the overlapping segment predicted in 80% of cells (as described in Fig. 3-9).  Chromosome 8 (whole)100100.5WC Ration = 10 cells 107.4 Av reads/Mb0 50 150 MbChromosome 7 (whole)0 50 150 Mb100n = 10 cells 132.7 Av reads/Mb100.5WC Ratio5' breakpoint range 3' breakpoint range minimal inverted region chr start end start end Frequency Average WC size chr7 54316772 54316772 54387439 54387439 0.3 0.70 70667 chr7 57706959 57706959 61083870 61967383 0.9 0.47 3376911 chr7 64566186 64645175 64660630 65060633 1 0.60 15455 chr7 72380014 72659960 73989814 75007165 1 0.44 1329854 chr7 142031881 142046772 142334720 142365968 1 0.80 287948 5' breakpoint range 3' breakpoint range minimal inverted region chr start end start end Frequency Average WC size chr8 2193914 2193914 2321586 2321586 0.6 0.56 127672 chr8 7183914 7404466 11880370 12489771 1 0.93 4475904 Figure 3-11 | Locating recurrent inversions in a non-targeted approach using Invert.Ri.ii.iii.  71 3.3 Discussion  Here I describe a new software package that can be used to discover, map and characterize structural rearrangements in single cells. The ability to bioinformatically assess local template strand states and map template strand state changes in an unbiased and systematic fashion offers a major advance to Strand-seq studies. While being theoretically similar to read depth approaches [44, 45], the elegance of Invert.R is in the read-based binning strategy, which ensures every read in low-coverage Strand-seq libraries is incorporated into the analysis. We also designed a simplistic approach to quantifying read directionality that rapidly assigns local W/C ratios to overlapping bins. Invert.R allows us to rapidly locate putative inversions in single cells, and then integrate these data across multiple cells to identify patterns in populations. As a biologist, I appreciate the end-user (who may not have bioinformatic skills) and therefore Invert.R outputs a variety of user-friendly summary files and visual representations of the data (e.g. histograms) that facilitate down-stream analyses and interpretation of results. Fundamentally this makes Invert.R more accessible than computationally heavy algorithms that are hard to translate and generate imputed data calls that are difficult to envisage.   It is evident that bin size matters [44, 45]. While other sliding-window approaches are based on the reference genome and count bases [45], Invert.R uses a read-based strategy that counts aligned and high-qualityfragments in the input library. This makes Invert.R more dynamically tailored to low-coverage Strand-seq libraries, overcoming mappability issues and sequencing biases. I showed how the limit of detection of Invert.R is inversely correlated to the size of the bin used to calculate local W/C ratios. However, a smaller bin size increases noise, which can lead to false-positive calls. To mitigate these discordant factors, a small bin (e.g. bin = 25) for an initial first-pass analysis can be used to flag even small ROIs and mark them for further confirmation. This initial call set can then be retested to validate whether there is a meaningful change in template strand state by filtering for high-confidence calls. This is a common strategy used by many sequencing methods for SV detection [14, 52, 124]. In a single cell, confidence comes   72 from assessing the total number of reads that represent the ROI, and the proportion that are W versus C reads. This is in stark contrast to split-pair strategies, where only those few reads that cover the breakpoint can be used to make variant calls [46] [49]. In Strand-seq data, homozygous inversions are by nature easier to detect, as they show a higher magnitude of change and are thus more distinct from the baseline chromosomal state compared to heterozygous alleles. In a population of cells, confidence comes from the frequency of the event when looking across multiple cells. Each time an ROI is found independently in a single cell the confidence for that call strengthens for the whole population and ultimately the power of any given call comes down to frequency. Future improvements to the program would be to implement a mathematical scoring system that considers these multiple parameters and more robust statistical modeling to improve confidence assessments. Nevertheless, the reliability of variant calls will always come down to the number of cells analyzed, further emphasizing the need for reliable high-throughput tools, like Invert.R.   The resolution of breakpoint mapping is limited by the depth of the library and by flanking segmental duplications, which show reduced mappability [32]. Split read approached to identify SVs in NGS sequencing data can unambiguously locate inversions and have potential to map breakpoints with high precision [14, 46]. It may be possible to combine Invert.R with pre-existing algorithms (e.g. Pindel [46],CREST [49], or DELLY [50].) to identify split read signatures at Invert.R-predicted ROIs and adapt an integrated approach to SV discovery. However, the < 1 x coverage of Strand-seq data means at best a single read per library will span the breakpoint and reads from additional libraries will be required to confirm the event. On the other hand, when inversions fall between highly repetitive and gene-poor elements it may not be especially meaningful to map breakpoints down to a single base pair position. And while higher genomic resolution can be obtained with deeper sequencing efforts, call accuracy is still at question. Therefore, the advance in visually assessing template strand states in individual cells cannot be understated. The ability to assess the state of each homologue independently allows us to confidently call genomic rearrangements, immediately assign genotypes and test associations with phenotypes, even if few reads map to the rearrangement breakpoints.    73  This tool offers a new opportunity to explore structural variation in different cell types and populations, greatly extending the breadth of many Strand-seq projects, both on-going and planned. For instance, Invert.R is already being applied to a cancer project (spearheaded by Dr. Ester Falconer) where we are locating structural rearrangements in karyotypically normal AML (acute myeloid leukemia) patients and testing for rare cell subpopulations in bulk tumor cultures. The program is also being applied to a haplotyping project (lead by David Porubsky) to confirm if the localized haplotype switching observed in his data is due to the presence of inversions. Finally, I am applying Invert.R to an exciting new pan-genomics project (as part of the Human Genome Structural Variation Consortium) that aims to map structural variation in family trios from different geographical populations, in order to help characterize the extent of variation in normal human genomes. Collectively, these projects illustrate the immediate utility of this bioinformatic software, and I expect additional projects will continue to benefit from it, and the related analysis tools that are currently being developed using similar principles.   3.4 Conclusion  By preserving the structure of each homologue, Strand-seq offers an exciting new opportunity to map variants in single cells, however to realize this potential the system had to be scaled. To address this, we developed Invert.R, a custom R-based software package that systematically assesses local strand states of Strand-seq libraries to characterize putative inversions in single cells, and then compiles data to find patterns across multiple cells. Here, I have shown that Invert.R is a reliable and systematic approach to discover inversions in a non-targeted and unbiased fashion. This software will be made open source, in order to facilitate current and future Strand-seq projects. Now that we have scaled the production and analysis of Strand-seq, we have a high-throughput and high-resolution system to study inversions in different populations.     74 Chapter 4 | Characterizing the distribution and frequency of polymorphic rearrangements in a mixed population of single cells    “There’s no one else in the world quite like you” – Mom    Chapter synopsis:  To study structural polymorphisms in the human genome I performed a non-targeted and unbiased analysis of a mixed donor population. By applying a bioinformatic pipeline to Strand-seq data, I mapped genomic variants with high-resolution and throughput in a normal population. This can serve as a framework for studying genomic heterogeneity in other model systems and populations. By exploring genotype frequencies, I characterized complex areas of the genome and identified several sequence misorients and minor alleles, along with new repetitive sequence elements not yet included in the reference assembly. By locating polymorphic inversions on a chromosome-by-chromosome basis, I built a comprehensive genomic map of heterogeneous loci from the largest population described. To our knowledge, this is one of the most exhaustive explorations of human inversions performed to date, and will provide a valuable resource for studying other defined populations of individuals or cells.    75 4.1 Introduction  Identifying features that differ between genomes can help resolve the functional variants that underlie disease prognosis, management and progression [22, 29]. Characterizing genomic variants in populations of individuals has implications for studying inherited diseases, whereas genomic variants differing between cells can help us understand somatic diseases, like cancer [23]. To connect genotypes with phenotypes, genetic association studies are commonly used, which quantify allele frequencies to test whether specific genomic variants are enriched in defined populations [10, 11]. With the increased accessibility of NGS technologies, many researchers turn to whole genome and exome sequencing to explore these connections [16, 19]. The comparison of genomes to a reference is how genomic variants are discovered in most sequence-based approaches [14, 33].Consequently, for these sequencing efforts to be successful it is critical we have a reliable and accurate reference assembly that represents the complexity and diversity of the human genome [57]. While conventional SV detection tools are very good at identifying variants that alter DNA sequence or content, our ability to identify balanced rearrangements that alter the structure of DNA is far more limited [13, 36-38, 120]. However, it is becoming increasingly clear there is significant structural diversity in human genomes, and these polymorphisms play important roles in our biology and health [1, 12, 13, 32, 36-38].   Polymorphic inversions are a common feature of the human genome [13] that have been implicated in our speciation [72, 77], population diversification [74, 132], and complex diseases, including schizophrenia and cancer [56, 75, 82, 86, 88, 118, 133, 134]. Inversions not only alter the orientation and order of genes in a region, but they can also disrupt gene function when breakpoints fall within coding or regulatory regions [1, 63]. However, technical difficulties in the ability to reliably map inversions have impeded investigations of their role in human biology [32, 73]. Consequently, human inversions have been largely underrepresented in the SV field and, lacking comprehensive studies, the phenotypic consequences and clinical relevance of most inversions remain undefined [32, 39, 66, 73]. To date, only two have yet to be characterized at the population level   76 (chr8p23 and chr17q21), and both have been found to be under selection and to impact fitness of carriers [75, 77, 78, 132], stressing the need for more large-scale investigations into the distribution and frequency of inversions in human populations. In order to fully understand the relationship between genomes and phenotypes it is critical that we develop techniques that can map all types of SVs, including inversions.   I have already shown Strand-seq is a robust method to accurately localize and genotype inversions in single cells with high-resolution (Chapter 2). I also illustrated how the development and validation of appropriate bioinformatic pipelines make this method high-throughput and amenable to large-scale studies (Chapter 3). Equipped with these tools, I set out to explore genomic variation in a sample population of human cells. In this Chapter, I discuss the development of a new framework for mapping structural rearrangements in a mixed population in order to characterize polymorphisms in a non-targeted and high-throughput approach. By locating and genotyping variants present on every chromosome in this population, I identify 111 polymorphic inversions, along with several sequence orientation errors, minor alleles and under-represented repetitive elements in the reference assembly. Together, these data markedly improves the reference assembly annotation for future genomic variant studies. Considering the global distribution and frequency of inversions, I generate a comprehensive map of common polymorphic domains in the human genome. Collectively, I describe a novel framework for studying the heterogeneity of population of cells, which offers a major advance over conventional approaches. I found significant structural heterogeneity exists in normal populations of human cells and that each inversion profiles are highly individualized. This comprehensive study of human inversions greatly advances our understanding of the structural complexity of our genome.    77  4.2 Results 4.2.1 Exploring genomic variation in a heterogeneous human population by single cell analyses  With high-resolution tools to visualize genomic rearrangements in a single cell, and a scalable bioinformatic strategy to analyze hundred of cells simultaneously, I set out to explore the genomic heterogeneity in a sample human population. While it is well established that the human genome is highly heterogeneous between individuals, whether this heterogeneity could be uncovered by single cell studies was unclear. To test this, I generated Strand-seq libraries from a cord blood (CB) pool of 352 newborn donors. As described in Chapter 2 (Section 2.2.1), single cells were cultured in single wells in the presence of BrdU (5 µM), and harvested daughters arising from a single division by micromanipulation (see Fig. 2-3 for experimental set-up). Following Strand-seq library preparation and paired-end (100 bp) sequencing, libraries were selected for further analysis based on high read depths (> 20 reads/Mb) and low backgrounds (< 5%). This ensured a high-quality dataset with clear template strand inheritance patterns, so strand    Figure 4-1 | Read densities of Strand-seq libraries in the pooled cord blood (CB) dataset Single and sister cell pairs were captured after single mitoses as described in Fig. 2-3. After library construction, sequencing and alignment protocols the sister pairs were merged by reverse-complementing the reads in sister cell ‘B’ and combining them with the reads in sister cell ‘A’. Total reads per megabase (Mb) for each library are plotted for the single cells (diamonds) and merged sister cell pairs (squares). On average, there was a 1.86-fold increase in the read depths of the merged sister pairs.   0 200 400 600 800 1000 1200 0 5 10 15 20 25 30 Read densities of pooled CB dataset  single cells (n=25)  merged sister pairs (n=22) reads/Mb Strand-seq libraries   78  states of individual chromosomes could be reliably mapped. In this dataset, I captured 25 single cells and 22 paired sister cells. Having paired sister cells allowed us to increase read depths for this subset of libraries. As illustrated previously (Fig. 2-8 in Section 2.2.3), I generated directional composite files for each pair by reverse-complementing the reads in sister cell ‘B’ and combining them with the reads from sister cell ‘A’, which increased the read densities of these libraries by 1.9-fold, on average (Fig. 4-1). In total, this yielded 47 Strand-seq libraries with read depths ranging from 39-963 reads/Mb, and with both sexes equally represented (23 male, and 24 female cells) (Appendix A). Each library in the dataset signified a different individual human genome, and collectively they represent a normal human population of mixed cells primed to study genomic heterogeneity.   To explore the genomes of this heterogeneous cell population, I mapped changes in template strand orientation independently for each library. Recall that template strand orientation reflects the directionality of each homologue in a cell with respect to the reference assembly, and changes in strand orientation reflect rearrangements in a homologue. To locate changes in template strand orientation I calculated W/C ratios for each library with Invert.R, using a small bin size of 25 reads to capture small events (as detailed in Chapter 3, Section 3.2.1). I visualized the strand states of each cell as histograms (two cells exampled in Fig. 4-2). Each line in the histogram represents the W/C ratio of a single cell at a genomic location. The histograms greatly facilitated locating potential genomic rearrangements in each cell, as a change in the W/C ratio represents a change in strand orientation at that location. In the two cells exampled, Invert.R identified 80 and 27 ROIs where W/C ratios dipped below the threshold. Some of these were somatic SCEs (Fig. 4-2, red arrowheads), whereas others exhibited segmental changes indicative of stable rearrangements such as inversions (Fig. 4-2, black arrowheads). The complex template strand patterns observed for these libraries highlight the diverse genomic features and recombination events evident in each individual cell, suggesting that the heterogeneity of a population can be captured in a single cell experiment.   79  Figure 4-2 | Mapping genomic rearrangements in a mixed cell sample Histograms of W/C ratios generated by Invert.R (bin = 25) for a single cell from a newborn female (upper panel) and newborn male (lower panel). Each Strand-seq library was derived from a cord blood pool of 352 pooled donors. Sequencing reads are shown as lines above each histogram with Crick in blue and Watson in orange. All chromosomes were included (including WC chromosomes) by setting the WC-cutoff to 0. The line in the histogram represents the W/C ratio at the given genomic location, and a change in W/C ratio represents a change in strand orientation and is indicative of a genomic rearrangements. Reference assembly sequence gaps are shown as gray bars above each histogram. Some regions of interest (ROIs) located by Invert.R were marked with red (SCE) or black (putative inversion) arrowheads to illustrate the different types of genomic rearrangements evident in these single cells.    To distinguish between somatic and stable genomic rearrangements the frequency of the event must be considered [1]. In Strand-seq libraries, inversions appear as recurrent, segmental changes in template strand orientation (as outlined in Chapter 2). I hypothesized I could localize inversions in our population by locating genomic regions where W/C ratios changed in multiple cells. To explore this I analyzed each cell, chromosome-by-chromosome. Since WC regions mask homozygous inversions, I filtered  HsSs_0291m (female; 80 ROIs)W/C Ratio10W/C Ratio10W/C Ratio 10W/C Ratio10W/C Ratio 10W/C Ratio100 50 100 150 200 250Mbchr1 0 50 100 150 200 250Mbchr2 0 50 100 150 200Mbchr3 0 50 100 150Mbchr4 0 50 100 150Mbchr50 50 100 150Mbchr6 0 50 100 150Mbchr70 50 100 150Mbchr8 chr9 0 40 60 100 140Mb20 80 120chr10 0 40 60 100 140Mb20 80 120chr11 0 40 60 100 140Mb20 80 120chr12 0 40 60 10020 80 120Mbchr13 0 40 6020 80 100Mbchr140 40 6020 80 100Mbchr15 0 40 6020 80 100Mbchr160 40 6020 80Mbchr17 0 40 6020 80Mbchr18 0 40 6020 80Mbchr19 0 20 30 50 60Mb10 40chr20 0 20 30 50 60Mb10 40chr21 0 20 30 50Mb10 40chr22 0 20 30 50Mb10 40chrX 0 50 100 150MbchrY 0 20 30 60Mb10 40 50HsSs_0262 (male; 27 ROIs). W/C Ratio10W/C Ratio10W/C Ratio 10W/C Ratio10W/C Ratio 10W/C Ratio100 50 100 150 200 250Mbchr1 0 50 100 150 200 250Mbchr2 0 50 100 150 200Mbchr3 0 50 100 150Mbchr4 0 50 100 150Mbchr50 50 100 150Mbchr6 0 50 100 150Mbchr70 50 100 150Mbchr8 chr9 0 40 60 100 140Mb20 80 120chr10 0 40 60 100 140Mb20 80 120chr11 0 40 60 100 140Mb20 80 120chr12 0 40 60 10020 80 120Mbchr13 0 40 6020 80 100Mbchr140 40 6020 80 100Mbchr15 0 40 6020 80 100Mbchr160 40 6020 80Mbchr17 0 40 6020 80Mbchr18 0 40 6020 80Mbchr19 0 20 30 50 60Mb10 40chr20 0 20 30 50 60Mb10 40chr21 0 20 30 50Mb10 40chr22 0 20 30 50Mb10 40chrX 0 50 100 150MbchrY 0 20 30 60Mb10 40 50  80 Figure 4-3 | Recurrent genomic rearrangements in a mixed donor cell population, as predicted by Invert.R     79   Figure 4-3 | Recurrent genomic rearrangements in a mixed donor cell population, as predicted by Invert.R Overlaid histograms of W/C ratios, as generated by Invert.R (bin = 25), for each chromosome of the pooled donor cord blood cells, after selecting only WW or CC chromosomes. Numbers of cells (n) analyzed and average reads/megabase (Mb) are indicated. Each line in the histogram represents the W/C ratio at a genomic location in a single cell. A change in the W/C ratios along a chromosome represents a change in strand orientation and is indicative of a genomic rearrangement. Locations where W/C ratios dip in at least two cells appear in the red heat maps below each histogram, where the intensity of the red reflects the number of cells with a predicted inversion within the region. These recurrent rearrangements are suggestive of a putative inversion. Reference assembly sequence gaps are shown as grey bars above each histogram. Arrowheads mark intricate strand state switches mentioned in the text (Section 4.2.1).   0 50 100 150 200Mbchr3 (n=25) 214.4 Av Reads/Mb0 50 100 150Mbchr4 (n=24) 181.5 Av Reads/Mb0 50 100 150Mbchr7 (n=19) 189.7 Av Reads/Mb0 50 100 150Mbchr8 (n=20) 200.7 Av Reads/Mbchr11 (n=26)0 40 60 100 140Mb20 80 120210.3 Av Reads/Mb chr12 (n=23)0 40 60 10020 80 120Mb188.5 Av Reads/Mbchr15 (n=23)0 40 6020 80 100Mb178.2 Av Reads/Mb chr16 (n=22)0 40 6020 80Mb212.2 Av Reads/Mbchr19 (n=26)0 20 30 50 60Mb10 40179.7 Av Reads/Mb chr20 (n=29)0 20 30 50 60Mb10 40238.6 Av Reads/Mb0 50 100 150 200 250Mbchr1 (n=15) 173 Av Reads/Mb0 50 100 150Mbchr5 (n=22) 207.9 Av Reads/Mbchr9 (n=26)0 40 60 100 140Mb20 80 120170.6 Av Reads/Mbchr13 (n=24)0 40 6020 80 100Mb192.2 Av Reads/Mbchr17 (n=32)0 40 6020 80Mb189.1 Av Reads/Mbchr21 (n=28)0 20 30 50Mb10 40158.3 Av Reads/Mb0 50 100 150 200 250Mbchr2 (n=21) 239.7 Av Reads/Mb0 50 100 150Mbchr6 (n=17) 245.3 Av Reads/Mbchr10 (n=21)0 40 60 100 140Mb20 80 120238.4 Av Reads/Mbchr14 (n=21)0 40 6020 80 100Mb165.6 Av Reads/Mbchr18 (n=21)0 40 6020 80Mb198.6 Av Reads/Mbchr22 (n=27)0 20 30 50Mb10 40141.9 Av Reads/Mb chrX (n=30)0 50 100 150Mb144.9 Av Reads/Mb chrY (n=27)0 20 30 60Mb10 40 5040.8 Av Reads/MbW/C Ratio10W/C Ratio10W/C Ratio10W/C Ratio10W/C Ratio10W/C Ratio10Figure 4-3 | Recurrent genomic rearrangments in a mixed donor cell population, as predicted by Invert.ROverlaid histograms of W/C ratios, as generated by InvertR (bin=25), for each chromosome of the pooled donor cord blood cells. Number of cells (n) analyzed and average reads/megabase (Mb) are indicated. Each line in the histogram represents the W/C ratio at a genomic location in a single cell. A change in the W/C ratios along a chromosome represents a change in strand orientation and is indicative of a genomic rearrangement. Locations where W/C ratios dip in at least two cells appear as red heat maps below each histogram, where the intensity of the red heat map reflects the number of cells with a predicted inversion within the region. These recurrant rearrangments are suggestive of a putative inversion. Reference assembly sequence gaps are shown as grey bars above each histogram. Arrowheads mark intricate strand state switches mentioned in the text (Section 4.2.1).   81 chromosomes inherited as WC or harboring SCEs using Invert.R (as described in Methods Chapter 7, Section 7.3.1). I overlaid the resulting histograms of W/C ratios for each chromosome to visualize template strand states of all cells simultaneously (Fig. 4-3). I flagged regions where strand orientation changed at the same genomic location in at least two cells as ROIs that contain a putative inversion (see Methods Chapter 7, Section 7.3.2). The ROIs were represented as red heat maps below the overlaid histograms, with the intensity of the heat map reflecting the number of cells with a strand state change at the ROI (Fig. 4-3, red bars). The recurrence of these events in our population distinguished them from sporadic SCEs or background reads (as described in Chapter 2, Section 2.2.2), suggesting they mark stable genomic variants.  Locating recurrent template strand switches identified ROIs in the human genome where structural variation can exist. Some ROIs flagged by Invert.R encompassed large complex domains, such as near the centromeres of chr1, chr9 and chr21 (Fig. 4-3, arrowheads). Enlarged histograms were generated for the region spanning these ROIs to illustrate they contained blocks of convoluted strand state switching and included several reference assembly gaps (Fig. 4-4, i), indicating they consist of complex DNA elements that are difficult to sequence and assemble [33]. I reasoned these large complex ROIs encompassed several distinct variants that could not be completely resolved bioinformatically. This became evident when I uploaded BED-formatted libraries onto the UCSC genome browser and observed multiple variable positions within individual libraries (Fig. 4-4, ii). To better resolve these regions, I manually redefined ROI start and end positions based on changes in read densities, template strand states and reference assembly breaks11 (as illustrated in Fig. 4-4, and Fig. 4-5). This generated a list of 209 ROIs that marked recurrent changes in template strand states in our population, and represented putative loci of stable genomic rearrangements. Together, this demonstrates how genomic heterogeneity is explored in a mixed population of cells using our                                                 11 We cannot predict template strand states within gaps because we do not know the sequence or the actual size of these regions. As a result, when an ROI spans a gap we do not know whether it represents a single event or multiple events with hidden breakpoints falling within the gap. To account for this, we trimmed all ROIs according to reference assembly gaps.    82 approach, and highlights how genomic variants can be uncovered in the human genome on a single cell basis.    Figure 4-4 | Complex regions of interest containing multiple structural variants i) Zoomed histograms (red box) of three complex regions of interest (ROIs) identified by Invert.R on a) chr1, b) chr9, and c) chr21. In each histogram, the genomic coordinates of the ROI (purple dotted lines) and reference assembly gaps (gold; centromere in dark gold) are marked. Red boxes mark ii) the zoomed UCSC Genome Browser (GRCh37/hg19) view of ten representative Strand-seq libraries showing the single large ROI predicted by Invert.R (uppermost, green bars), reference assembly gaps (black bars), and complex segmental duplications (SegDups, lower track). It is clear that several genomic variants are present in these ROIs, as evidenced by the localized changes in strand states and read densities in individual libraries. These complex ROIs were manually refined based on changes in read densities, strand states switches and sequence gaps (dark red bars). ROIno.1.13 was mentioned in text.   4.2.2 Annotating the human reference assembly – locating potential minor alleles and misorients  By identifying recurrent changes in template strand orientation in a mixed population of cells I located hundreds of putative stable rearrangements in the human genome. While many of these events are predicted to be polymorphisms present in our 100 120 130 140 160 Mb150WC RatioRefined ROIsGapsInvert.R ROIHsSs_0262HsSs_0263HsSs_0266HsSs_0277HsSs_0290HsSs_0286HsSs_0270HsSs_0249HsSs_0279HsSs_0280SegDups30 40 50 60 80 Mb70WC RatioRefined ROIsGapsInvert.R ROIHsSs_0255HsSs_0271HsSs_0283HsSs_0281HsSs_0287HsSs_0257HsSs_0270HsSs_0289HsSs_0291HsSs_0258SegDups8 10 12 14 16 MbWC RatioRefined ROIsGapsInvert.R ROIHsSs_0271HsSs_0282HsSs_0263HsSs_0267HsSs_0289HsSs_0256HsSs_0291HsSs_0279HsSs_0259HsSs_0286SegDupsi.ii.i.ii.i.ii.ROIno.1.13b ca  83  Figure 4-5 | Example of an AWC region, potential misorients or minor allele region, and polymorphic inversion from chr10 Zoomed inset (red box) of a UCSC Genome Browser (GRCh37/hg19) view of 12 representative Strand-seq libraries from the pooled cord blood population. Shown are four refined regions of interest (ROIs), identified from three putative inversions predicted by Invert.R analysis (upper, green bars). Based on the genotype frequencies calculated for each ROI (see Fig. 4-6), there is an always Watson Crick (AWC) region (≥ 80% heterozygous cells, blue), two potential misorients or minor alleles (≥ 80% homozygous cells, red), and a polymorphic inversion (≥ 2 cells with different genotypes, purple) within this 4 megabase (Mb) domain on chr10. The domain has several reference sequence gaps (uppermost, black bars) and segmental duplications (SegDups) flanking the ROIs, and Database of Genomic Variants (DGV) inversions overlapping with the ROIs (lower purple bars). We suspect that ROIs classified as AWCs point to repetitive sequences in the human genome that are currently underrepresented in the reference assembly. ROIs classified as misorients or minor alleles point to regions in the human reference genome where the assembled sequence is not representative of the vast majority of individuals seen in our population.   sample, it is possible that rather than genomic variants, some represent assembly artifacts such as reference assembly orientation errors, as we previously found for the murine genome [109]. For instance, the q-arm of chr10 is representative of the disparity in the composition of different ROIs, where I localized four regions that exhibited very distinct strand state frequencies in our population (Fig. 4-5). To better characterize each ROI, every cell in our dataset was genotyped across these regions. For this, I used Invert.R to Chromosome 10 (chr10 q11.21-q11.23)HsSs_0312HsSs_0027HsSs_0266HsSs_0301HsSs_0302HsSs_0298HsSs_0248HsSs_0300HsSs_0265HsSs_0273HsSs_0303HsSs_0256SegDupsDGVRefinedROIsGapsInvert.R ROIsROIno.10.7ROIno.10.9ROIno.10.8ROIno.10.10Misorient or minor allelePolymorphic inversionAlways Watson Crick (AWC)Figure 4-5 | Example of an AWC region, potential misorient or minor allele region, and polymorphic inversion from chr10.  84 count the number of W and C reads in the region, where at least ten reads were required for inclusion. I then performed three independent Fisher’s Exact tests (for a wildtype, heterozygous, and homozygous state) at each ROI to statistically fit the likeliest genotype, based on how significantly different the observed ratio of W and C reads at the ROI were from expected ratios [10, 135] (described in detail in Methods Chapter 7, Section 7.3.2). Allowing for a 2% level of background in the library, an expected wildtype state has no change in template strands (i.e. has the same template pattern as the rest of the chromosome), an expected heterozygous state has equal W and C reads (since I only considered WW and CC chromosomes), and an expected homozygous state has a complete change in template strands (i.e. has the opposite template pattern as the rest of the chromosome). The highest p-value derived from these tests was designated the best-fit genotype for each ROI in each cell. I then considered the genotype of each ROI across the population of cells by calculating the frequency of heterozygosity and homozygosity (Fig. 4-6), in order to better characterize the putative genomic variants.   As we previously confirmed in the murine genome [109], if the orientation of a contig is incorrectly assigned in the reference genome it will appear as a localized and complete template strand switch (i.e. like a homozygous inversion) in all Strand-seq libraries. Therefore, we predicted regions with a high homozygous frequency would be either misoriented segments or minor alleles present in the human reference assembly. For instance, I found two ROIs on chr10q11 had a high homozygous frequency: ROIno.10.9 and ROIno.10.10 (Fig. 4-5). Notice how two cells (HsSs_0266 and HsSs_0265) harbored a heterozygous inversion at the ROIno.10.9 locus, while the remaining had a homozygous inversion. This suggests the ROI marks an inversion in which the minor allele state is represented in the reference assembly, not the common orientation found in our population. Conversely, ROIno.10.10 contained a homozygous inversion in 100% of the cells. While it is probable this ROI marks a misoriented fragment in the reference assembly, it cannot be excluded that this is a variable locus, and our dataset was too small to observe the minor allele represented in the assembly.   85 Strikingly, these two ROIs almost exactly match12 fragments recently resolved as assembly misorientations using long read technologies and targeted deep sequencing of BAC clones [57] (E. E. Eichler, personal communication). It is worth noting that in their study ROIno.10.9 was classified as a reference misorientation, likely because they did not have access to a multiple genomes that revealed the rare variant at this locus. This illustrates how the frequency of homozygous ROIs in a dataset can locate regions where the reference assembly does not reflect the common orientation within the population, and that the power of these predictions increase with additional libraries that increase the number of cells supporting the event.   Figure 4-6 | Heterozygous and homozygous frequency of the genotyped ROIs All cells in the pooled cord blood population were genotyped at 209 regions of interest (ROIs) to assess the frequency of heterozygosity (blue diamonds) and homozygosity (red squares), calculated as the proportion of genotyped cells showing either a heterozygous or homozygous state at each ROI. ROIs with a heterozygous frequency above 0.8 (dotted line) were defined as Always Watson Crick (AWC) regions, and those with a homozygous frequency above 0.8 were classified as potential misorients or minor alleles in the human genome reference assembly. See Table 3-1 and Table 3-2 (respectively) for genomic coordinates.                                                    12 Based on data kindly provided by Eichler’s group, ROIno.10.9 was mapped within two bases, and ROIno.10.10 was within 2,667 bases, of regions identified as misoriented using their independent approaches [57]. Regions of interest (ROIs)Heterozygous frequency0.00.51.0 Misorientsor minor alleles(n = 24)Always Watson Crick(AWC) regions(n = 46)Homozygous `frequency0.00.51.0Figure 4-6 | Heterozygous and homozygous frequency of the genotyped ROIsAll cells were genotyped at 209 regions of interest (ROIs) to assess the frequency of heterozygosity (blue diamonds) and homozygosity (red squares), calculated as the proportion of genotyped cells showing either a heterozygous or homozygous state at each ROI. ROIs with a heterozygous frequency above 0.8 (dotted line) were defined as Always Watson Crick (AWC) regions, and those with a homozygous frequency above 0.8, were classified as potential misorients or minor alleles in the human genome reference assembly. See Tables 3-1 and 3-2 for genomic coordinates.   86   Figure 4-7 | Misoriented regions and minor alleles in the human reference assembly a) Size distribution of ROIs with a > 80% homozygosity frequency (n = 24) and classified as misorients or minor alleles in the reference assembly (GRCh37/hg19). UCSC genome Browser view of misorients and minor alleles (blue bars) found on b) chr16p11, and c) chr1q21, with the assembly contigs (red bars), sequence gaps (black bars), Refseq genes (blue or red elements) and inversions listed in the Database of Genomic Variants (DGV; purple bars) are shown. Outlined region marks the ROIs mentioned in text. Refseq genes on chr1q21 (c, red) highlight neuroblastoma breakpoint family member (NBPF) paralogs, which map to several locations within this genomic region.   ROIno.16.21 ROIno.16.23ROIno.16.22GapsRefseq GenesDGV InversionsMap ContigsMisoriented ROIschr16ScaleROIno.1.11 ROIno.1.13GapsRefseq GenesDGV InversionsMap ContigsMisoriented ROIschr1Scale- 5000500100015002000Size (kb)Size distributionbca  87  To classify the minor alleles and misorients in the human reference I identified ROIs that were homozygous in ≥ 80% of the cells in our population, with a minimum of 10 cells required for inclusion (Fig. 4-6, red squares). I found 24 ROIs that ranged in size from 18.9 kb – 1.7 Mb and collectively comprised 8.4 Mb (0.27%) of the human genome (Fig. 4-7a, and Table 4-1). This included ROIno.16.22, which appeared misoriented in 82% of the cells and was flanked by two fragments (ROIno.16.21 and ROIno.16.23) that were misoriented in 100% of the cells (Fig. 4-7b). These data suggest that the entire contig (GL000125.1) is misoriented, with the region represented by ROIno.16.22 constituting a rare inversion that falls within this large misorientation. The location of ROIno.16.22 matches several previous reports of inversions listed in the DGV (Fig, 4-7b, purple bars) that should be re-evaluated given the likely misorientation of the surrounding contig. This emphasizes how correct annotation of the reference assembly is critical for mapping genomic variants accurately. The largest misoriented fragment I found (ROIno.1.13) fell within contig GL00014.1 on chr1q21, overlapped 19 inversions in the DGV, and encompassed 27 unique genes (Fig. 4-7c, also see Fig. 4-4a, blue bar). These genes are in the reverse configuration in 93% of the population I sampled, with those nearest the breakpoints almost 2 Mb from their expected location in the reference assembly. This included several paralogs in a tumor-suppressor gene family associated with neuroblastoma [136-138] (Fig. 4-7c, red Refseq genes). Finally, while the breakpoints of ROIno1.13 abut sequence assembly gaps, 33% of the misoriented ROIs I identified were within contiguous sequence of supposedly ‘known’ orientation (including ROIno.16.22 of Fig. 4.7b). This contrasted our findings in the murine genome, where misoriented fragments were flanked by unbridged gaps and were of unconfirmed orientation [109]. Collectively, these results highlight how Strand-seq datasets can help better resolve reference assemblies for more accurate gene mapping and variant discovery.  Table 4-1 | Potential misorients or minor alleles in reference genome have a homozygous frequency > 80%   88     87  Table 4-1 | Potential misorients or minor alleles in reference genome have a homozygous frequency > 80% Genomic coordinates and metrics associated with each region of interest (ROI) in this subcategory of structural variant found for the pooled cord blood population. Chromosome (chr); Database of Genomic Variants (DGV).   Name Chr Start End SizePassing cellsAverage reads/MbHomozygous cells (%)Number of DGV hitsFlanking reference gaps (5' - 3')ROIno.1.7 chr1 120747156 120936695 189,539    12 121.3 100% 0 bridged-unbridgedROIno.1.11 chr1 145368224 145833119 464,895    15 193.6 100% 6 none-bridgedROIno.1.13 chr1 146303299 148026039 1,722,740  15 133.5 93% 19 bridged-unbridgedROIno.1.16 chr1 206072708 206332221 259,513    13 161.8 100% 0 unbridged-unbridgedROIno.6.5 chr6 61880166 62128590 248,424    17 241.5 100% 0 unbridged-bridgedROIno.6.6 chr6 157609467 157641301 31,834      11 471.2 100% 0 bridged-bridgedROIno.7.13 chr7 142098195 142276198 178,003    18 280.9 100% 0 bridged-bridgedROIno.9.10 chr9 43996569 44676646 680,077    25 51.5 88% 7 bridged-bridgedROIno.9.20 chr9 66242215 66404656 162,441    13 104.7 85% 0 bridged-bridgedROIno.10.3 chr10 42409938 42527020 117,082    20 213.5 100% 0 none-bridgedROIno.10.4 chr10 42527234 42546688 19,454      21 5808.6 95% 0 none-bridgedROIno.10.9 chr10 48105710 49095536 989,826    21 137.4 90% 8 unbridged-unbridgedROIno.10.10 chr10 51448847 51666976 218,129    20 137.5 100% 6 bridged-noneROIno.11.6 chr11 51090854 51594205 503,351    25 298.0 100% 0 unbridged-unbridgedROIno.12.2 chr12 17921947 18010437 88,490      14 192.1 100% 32 none-noneROIno.15.6 chr15 22646194 23514853 868,659    23 140.4 96% 10 unbridged-unbridgedROIno.16.21 chr16 34173153 34393687 289,982    22 234.5 100% 8 unbridged-noneROIno.16.22 chr16 34394687 34764628 292,901    22 259.5 82% 8 none-noneROIno.16.23 chr16 34765628 35285801 530,298    22 237.6 100% 2 none-bridgedROIno.16.24 chr16 46416741 46435602 18,861      22 10603.9 100% 83 non-noneROIno.20.3 chr20 29419570 29580261 160,691    28 329.8 93% 0 unbridged-noneROIno.22.2 chr22 16448948 16697850 248,902    24 112.5 88% 2 none-unbridgedROIno.Y.14 chrY 22224491 22246066 21,575      4 556.2 100% 0 none-noneROIno.Y.24 chrY 58819362 58917656 98,294      22 925.8 100% 0 unbridged-unbridged  89 4.2.3 Annotating the human reference assembly – predicting under-represented repetitive elements   While ROIs with a high homozygous frequency pointed to potential orientation errors in the reference assembly, I also observed regions with high heterozygous frequencies (Fig. 4-6, blue diamonds). These ROIs contained an even ratio of W and C reads that best fit our definition of a heterozygous genotype in every cell analyzed (since only WW and CC chromosomes were considered). The maximal proportion of heterozygous genotypes we anticipated at a biallelic locus was 0.50 (indicating uninverted and inverted allele frequencies are equally represented in the population), and therefore these regions showed a much higher heterozygous frequency than expected. We hypothesized they marked under-represented repetitive sequences in the human reference assembly that physically occurred at multiple genomic locations, but were represented in the reference at a single locus. If these sequences are present on multiple chromosomes, they are expected to have template strand inheritance patterns that match the chromosomes they reside on, and will thus frequently appear WC, since each chromosome harboring the sequence will have independent segregation patterns. For instance, the pseudo-autosomal regions (PARs) are present on both sex chromosomes but only represented on chrX in the human reference assembly (GRCh37/hg19). Consequently, all sequencing reads originating from both chrX and chrY PARs align only to chrX. In Strand-seq libraries, the large (~1.15 Mb) PAR1 region on the p-arm tip of chrX (chrXp22) appeared WC in 100% of male cells where the sex chromosomes were differentially inherited (for instance, see the sister cell pairs of Fig. 2-4: Pair #2 (HsSs_0103/HsSs_0106), and Pair #3 (HsSs_0078/HsSs_0047)). Conversely, when the sex chromosomes were inherited in the same orientation, chrXp22 did not appear WC and instead an increased read depth was observed (e.g. see Pair #1 (HsSs_0050/HsSs_0061) in Fig. 2-4). Therefore, we predicted other repetitive sequences in the human genome that are currently underrepresented in the reference assembly would frequently appear WC in Strand-seq libraries.   To identify new repetitive elements in the human reference assembly, I located ROIs with a high heterozygous frequency (Fig. 4-6, blue diamonds). I found 46 regions   90 that were WC in ≥ 80% of cells, where a minimum of ten cells were required for inclusion, and denoted them ‘Always WC (AWC)’ regions. This included a > 250 kb region on chr10q11 (ROIno.10.7) that was WC in 100% of the cells I genotyped (Fig. 4-5). The AWC regions ranged from 4105 bp – 1.5 Mb in size, and together comprised 15.3 Mb (0.5%) of the human genome (Table 4-2). Although 67.4% (31) overlapped with DGV inversions, AWCs are unlikely to be polymorphisms because we would expect higher homozygous frequencies if they represented common variants. Indeed, none of the AWCs were in Hardy-Weinberg equilibrium (HWE) (Table 4-2), which is a feature of stable neutral alleles in a population [10, 11, 135]. Rather than genomic SVs, AWCs are more likely sequence elements represented multiple times in the genome.   Figure 4-8 | Always Watson Crick (AWC) regions mark underrepresented repetitive elements in the human reference assembly a) Lastz dot plots of pair-wise self-alignments of chr8 (coordinates listed above) show the repetitive composition of AWCs (blue bars). i) Two AWCs flanking the chr8p23 inversion coincide with known segmental duplications (Seg Dups; depicted on x-axis). ii) Zoomed inset (dotted red box in i) of ROIno.8.2 illustrating the palindromic and repetitive nature of DNA at this ROI. b) Genome Browser view of an AWC not annotated as a repetitive element. Two strand-seq libraries are shown to illustrate high read densities of this AWC, which is flanked by identical segmental duplications (orange signifies greater than 99% similarity) and overlaps several inversions in the database of genomic variants (DGV; purple bars). c) Representative libraries show how the centromere AWC at ROIno.19.2 contains a disproportionally high number of reads, and does not coincide with known segmental duplications or inversions.  GapsMap ContigsAWCsSeg Dupschr8:6,000,000-13,000,000chr8:6,750,000-8,250,000chr19:27,730,000-27,800,000ROIno.10.7chr10ScaleHsSs_0282HsSs_0271SegDupsDGV Inversionschr10:46,750,000-47,350,000ROIno.19.2chr19ScaleHsSs_0266HsSs_0272HsSs_0283HsSs_0264HsSs_0249HsSs_0256HsSs_0289HsSs_0275HsSs_0287HsSs_0280HsSs_0269HsSs_0281SegDupsGapsMap ContigsDGVZoom inset ca bi.ii.32,287 179.7 (whole chr)averagereads/Mb:6.75Mb         7.25Mb         7.75Mb        8.25Mb 6.75Mb      7.25Mb        7.75Mb         8.25Mb   91 In some cases AWC regions coincided with blocks of previously annotated segmental duplications, which are arbitrarily defined as genomic repeats with > 90% identity and > 1 kb in length [139]. This included ROIno.8.2 and ROIno.8.4, which had heterozygous frequencies of 85% and 95%, respectively, and flanked the chr8p23 inversion described in Chapter 2 (see Fig. 2-9c). The sequencing reads mapped at these elements must contain unique nucleotides that allow them to pass mapping quality filter criteria (q > 10). Self-alignment of the DNA sequence revealed the degree and orientation of sequence similarity between the AWCs (Fig. 4-8). On this scale I saw ROIno.8.2 was a repeated palindrome of about 500 kb, that was partially duplicated in ROIno.8.4 (i.e. on the other side of the inversion, at ~12 Mb) (Fig. 4-8a, i). The inverted orientation of the duplications explains why the region contained reads in both directions and was called WC. A zoomed view of ROIno.8.2 showed 4 copies of a minisatellite in the palindrome of variable sizes (Fig. 4-8a, ii), highlighting the complex architecture of AWCs.   In contrast, other AWCs did not overlap known segmental duplications. For instance, ROIno.10.7 on chr10q11 was between blocks of near-identical segmental duplications containing few aligned reads, whereas the read density at the AWC itself was quite high (Fig. 4-8b). In fact, the overall average densities at AWCs was ~ 5-fold greater than Strand-seq libraries (1700 versus 300 reads/Mb), with the highest density found at centromeric ROIno.19.2, averaging over 32, 000 reads/Mb (Fig. 4-8c). This strongly supports the hypothesis that these sequences are present elsewhere in the genome (likely several times, given the heterozygous frequency and read depths) and reads originating from other chromosomes converge at the AWC loci. To further test this, I aligned reads mapping to AWCs to short tandem repeat (STR) sequences recently described and patched to the human reference assembly (GRCh38/hg38) [57], and found 41 AWCs (89%) contained reads mapping to at least one STR. For instance, the AWC at ROIno.10.7 mapped to 14 STRs on 13 different chromosomes, explaining why this region always appeared WC in our cells. Collectively, these results highlight how using Strand-seq to explore genotype frequencies I can more accurately annotate genomes, correct potential assembly errors and locate novel repetitive elements compared to conventional techniques.    92 Table 4-2 | Always Watson Crick (AWC) regions have a heterozygous frequency > 80%     91  Name Chr Start End SizePassing cellsAverage reads/MbHeterozygous cells (%)Number of DGV hits ROI in HWE HWE pValueROIno.1.2 chr1 12977429 13711252 733,823    11 27.3           91% 11 not_hwe 0.018134ROIno.1.3 chr1 16834068 17275757 441,689    13 332.8         100% 33 not_hwe 0.000788ROIno.1.6 chr1 120533454 120693914 160,460    14 230.6         86% 0 not_hwe 0.021827ROIno.1.8 chr1 142535435 143544525 1,009,090 15 202.2         100% 16 not_hwe 0.000211ROIno.1.10 chr1 143871003 145368224 1,497,221 15 185.0         100% 8 not_hwe 0.000211ROIno.1.14 chr1 148511359 149039849 528,490    15 132.5         100% 4 not_hwe 0.000211ROIno.2.2 chr2 89595269 89892888 297,619    21 184.8         100% 4 not_hwe 3.90E-06ROIno.2.4 chr2 90413567 90531134 117,567    21 204.1         90% 0 not_hwe 0.000427ROIno.2.5 chr2 91595104 91989505 394,401    21 479.2         100% 2 not_hwe 3.90E-06ROIno.4.2 chr4 49215572 49326591 111,019    23 351.3         87% 13 not_hwe 0.000384ROIno.4.3 chr4 49488942 49660117 171,175    24 368.0         83% 11 not_hwe 0.002739ROIno.5.3 chr5 175343108 175474040 130,932    19 160.4         89% 6 not_hwe 0.000837ROIno.6.2 chr6 26680497 26712303 31,806      12 534.5         92% 23 not_hwe 0.014288ROIno.7.8 chr7 61054332 61917157 862,825    19 118.2         100% 2 not_hwe 1.48E-05ROIno.8.1 chr8 2187122 2330497 143,375    19 355.7         95% 28 not_hwe 0.000151ROIno.8.2 chr8 7261255 8034631 773,376    19 63.4           95% 18 not_hwe 0.000151ROIno.8.4 chr8 12038531 12454792 416,261    20 163.4         85% 17 not_hwe 0.001718ROIno.9.7 chr9 41415793 42613955 1,198,162 25 24.2           88% 3 not_hwe 0.000255ROIno.9.13 chr9 45350203 45815521 465,318    25 55.9           88% 0 not_hwe 0.000255ROIno.9.16 chr9 46561039 47060133 499,094    15 34.1           93% 0 not_hwe 0.001734ROIno.9.21 chr9 66454656 66614195 159,539    26 388.6         92% 25 not_hwe 2.92E-05ROIno.9.24 chr9 67207834 67366296 158,462    23 189.3         91% 6 not_hwe 0.000134ROIno.9.26 chr9 68137998 68514181 376,183    26 300.4         96% 10 not_hwe 2.39E-06ROIno.10.2 chr10 42354936 42409824 54,888      21 15,431.4     100% 0 not_hwe 3.90E-06ROIno.10.7 chr10 46908215 47163314 255,099    21 517.4         100% 28 not_hwe 3.90E-06ROIno.14.2 chr14 19359522 19736988 377,466    21 100.7         81% 8 not_hwe 0.010133ROIno.15.2 chr15 20389552 20894634 505,082    23 314.8         91% 21 not_hwe 8.07E-05ROIno.15.3 chr15 20935075 21398820 463,745    23 97.0           87% 20 not_hwe 0.000764ROIno.15.4 chr15 21885000 22212115 327,115    23 168.1         83% 21 not_hwe 0.003098ROIno.15.7 chr15 28585471 28836717 251,246    15 75.6           93% 0 not_hwe 0.001734ROIno.16.3 chr16 15016162 15124107 107,945    22 259.4         82% 27 not_hwe 0.00749ROIno.16.12 chr16 22545103 22712167 167,064    21 341.2         86% 28 not_hwe 0.001166ROIno.16.16 chr16 32298500 32741076 442,576    22 277.9         82% 2 not_hwe 0.00749ROIno.16.20 chr16 33942810 34023150 80,340      21 784.2         95% 13 not_hwe 4.35E-05ROIno.17.6 chr17 21506686 21566608 59,922      31 484.0         100% 0 not_hwe 4.61E-09ROIno.17.11 chr17 36349527 36407788 58,261      28 326.1         86% 0 not_hwe 0.000394ROIno.18.1 chr18 105696 113632 7,936       20 5,040.3       100% 0 not_hwe 8.95E-06Table 4-2 part 1 of 2  93      92   Table 4-2 | Always Watson Crick (AWC) regions have a heterozygous frequency > 80% Genomic coordinates and metrics associated with each region of interest (ROI) in this subcategory of structural variant found for the pooled cord blood population. Chromosome (chr); Megabase (Mb); Database of Genomic Variants (DGV); Hardy-Weinberg equilibrium (HWE).   Table 4-2 part 2 of 2Name Chr Start End SizePassing cellsAverage reads/MbHeterozygous cells (%)Number of DGV hits ROI in HWE HWE pValueROIno.18.2 chr18 18516506 18520611 4,105       20 10,718.6     95% 91 not_hwe 0.000108ROIno.19.2 chr19 27731783 27740734 8,951       26 32,286.9     92% 0 not_hwe 2.04E-05ROIno.20.4 chr20 29580277 29653908 73,631      27 665.5         85% 0 not_hwe 0.000392ROIno.21.1 chr21 9411194 10215977 804,783    28 147.9         100% 0 not_hwe 4.03E-08ROIno.21.3 chr21 10697896 10770371 72,475      28 1,586.8       100% 6 not_hwe 4.03E-08ROIno.21.5 chr21 11012921 11188129 175,208    28 810.5         96% 2 not_hwe 6.61E-07ROIno.X.19 chrX 61726368 61734076 7,708       20 2,465.0       80% 0 not_hwe 0.005439ROIno.Y.9 chrY 13258363 13573421 315,058    23 412.6         96% 0 not_hwe 1.24E-05ROIno.Y.25 chrY 58967657 59034049 66,392      23 1,506.2       91% 0 not_hwe 8.07E-05  94 4.2.4 Non-targeted characterization of polymorphic inversions in a normal human population  By exploring genotype frequencies of the ROIs in our population I have classified the potential misorients, minor alleles and underrepresented repetitive elements in the reference assembly. With these regions annotated, I was then able to accurately define structural polymorphisms in the human genome. Polymorphic variants are predicted to be heterogeneous loci that show a range of alleles in our population. For instance, a polymorphic inversion would cause a localized change in template strands in multiple cells, with a distribution of genotypes expected to be in HWE, and a minor allele frequency (MAF) of at least 0.05 [19]. Accordingly, to locate structural polymorphisms in our population I identified all the ROIs where at least two cells had a genotype other than wildtype, after excluding the AWCs and potential misorients from consideration. Using this non-targeted approach, I found 111 polymorphic inversions that together comprised 34.9 Mb (1.13%) of the genome, ranging in size from 16.5 kb – 3.9 Mb    Figure 4-9 | Size and genomic distributions of polymorphic inversions identified in the mixed population a) Size distribution of heterogeneous ROIs with a putative inversion in at least 2 cells and an allelic frequency > 0.05% were classified as polymorphic inversions (n = 111). The cumulative frequency of inversion sizes in base pairs (bp) is shown, divided into new inversions (blue circles) and inversions overlapping with Database of Genomic Variants (DGV) entries (purple squares). The median inversion size (dotted line) is well below the 2 Mb detection limit of traditional cytogenetic techniques (grey shading). b) Genomic distribution of the total number of inversions found in the pooled cord blood population (n = 47 cells). The proportion of inversions present on each chromosome is shown, normalized to chromosome size.   Genomic Distribution1 2 3 4 5 6 7 8 9 10 11 12 14 15 16 17 19 20 21 22 X Y Inversion Size (bp)020406080100104 105 106 107174,765 bpnot in DGVin DGVInversion Size (bp)Cumulative Frequency (%)Cumulative Size Distributiona b  95 (Table 4-3). Notably, 98% of the polymorphisms I identified were below 2 Mb in size (Fig. 4-9a, grey box), which marks the limit of detection for traditional cytogenetic techniques [39, 108]. Notably, the inversions found here that are missing from the DGV represent the entire spectrum of inversion sizes, suggesting that all previous methods were less robust at locating inversions, even larger variants that can be found using FISH technologies. This included a ~ 700 kb inversion at chr17q21 (ROIno.17.16) that coincided13 with a polymorphism known to predispose children of heterozygous carriers to a deletion syndrome associated with pronounced developmental delays [77, 78, 88, 132]. While the size and complexity of this genomic locus has made this inversion difficult to localize in the past [56, 79], I rapidly genotyped multiple individuals and identified nine heterozygotes in the population (of 32 informative cells), demonstrating how this approach could be applied for diagnostic testing. Additionally, I found 71 (64%) polymorphisms overlapped with inversions listed in the DGV, whereas 40 (36%) were novel variants not yet reported (Fig. 4-9a, and Table 4-3). Taken together, this shows how polymorphic SVs can be located in a mixed population of cells.   By looking chromosome-by-chromosome for heterogeneous ROIs in a mixed donor cell sample, I generated a global dataset of polymorphic inversions present in this subset of human genomes. With this comprehensive list in hand I was able to explore the distribution and frequency of inversions in a normal population. Some of the inverted loci were highly polymorphic in the dataset, and allelic frequencies ranged from 0.05 to 0.89 (Table 4-3, and Fig. 4-10). I found 24 inversions had frequencies > 50% (Fig. 4-10, dotted line), which infers that the reference assembly alleles do not represent the common variant found in our population. The vast majority (87%) of the polymorphisms were in HWE (Table 4-3), suggesting they are stable variants not under selective pressures in the population [11, 135]. The ROIs not in HWE typically had a higher than expected proportion of heterozygous cells (Fig. 4-10, asterisks). Previous studies suggest that                                                 13Using the UCSC Genome Browser LiftOver tool, the reported coordinates of chr17: 41046729-41470954, mapped on the NCBI35/hg17 assembly [88], lifted to chr17: 43690946-44115107 of the GRCh37/hg19 build.   96  Figure 4-10 | Allelic frequencies of polymorphic inversions identified in the mixed population Bar graph of allelic frequencies for the polymorphic inversions (n = 111) identified in the pooled donor population. The frequency of alleles in a wildtype versus inverted state was calculated based on the genotypes found for each cell. The height of each bar represents the frequency of the inverted allele, with the proportion contributed by cells in a heterozygous state (grey) versus homozygous state (black) shown. Inversions with an allelic frequency > 0.5 (dotted red line) represent alleles commonly inverted in the sampled population. Asterisks mark the inversions not in Hardy-Weinberg equilibrium. For additional information of the ROIs, including genomic coordinates, see Table 4-3.   heterozygous inversions are resistant to meiotic recombination, as this results in unbalanced gametes [62, 63], and therefore it is possible genes in these regions are protected from recombining. Alternatively, these may represent false positives in our dataset [10]. Almost half (46%) of the ROIs not in HWE were adjacent to centromeres or telomeres (e.g. ROIno.4.1 and ROIno.17.7), which are highly repetitive genomic regions that are difficult to reliably sequence and genotype [139]. Others encompassed genes with disease associations, including ROIno.17.4 that contained the kinase MAP2K [140], and ROIno.16.17 that contained a p53 target gene TP53TG3 [141]. This may suggest these loci are under selective pressure in a subset of our cells [142], however more defined populations are required to make definitive associations. Together, these data highlight how polymorphic inversions in normal populations can serve as tool to study changes in the frequency and distribution of rearrangements in defined demographics or in the context of disease.   Regions of interest (ROIs)Heterozygous proportionHomozygous proportionInversion frequencies0.51.00.0ROIno.16.6ROIno.21.7ROIno.X.4ROIno.X.12ROIno.15.9ROIno.16.18ROIno.16.19ROIno.20.2ROIno.5.1ROIno.9.11ROIno.19.3ROIno.12.5ROIno.22.3ROIno.Y.16ROIno.7.12ROIno.16.7ROIno.10.11ROIno.X.40ROIno.Y.19ROIno.7.5ROIno.6.4ROIno.16.15ROIno.7.2ROIno.17.1ROIno.17.2ROIno.16.5ROIno.7.11ROIno.8.3ROIno.17.16ROIno.2.8ROIno.1.1ROIno.1.15ROIno.2.6ROIno.3.2ROIno.5.4ROIno.15.10ROIno.15.16ROIno.14.1ROIno.21.6ROIno.17.10ROIno.21.2ROIno.7.9ROIno.9.8ROIno.15.1ROIno.6.1ROIno.10.8ROIno.15.8ROIno.X.32ROIno.20.1ROIno.2.7ROIno.7.17ROIno.19.1ROIno.9.25ROIno.16.14ROIno.7.6ROIno.9.18ROIno.11.5ROIno.12.1ROIno.14.4ROIno.X.48ROIno.7.10ROIno.17.7ROIno.22.7ROIno.11.7ROIno.17.4ROIno.7.3ROIno.17.5ROIno.21.4ROIno.X.15ROIno.4.5ROIno.5.2ROIno.22.5ROIno.11.3ROIno.15.5ROIno.16.13ROIno.4.1ROIno.4.4ROIno.7.14ROIno.9.14ROIno.16.8ROIno.22.4ROIno.15.15ROIno.X.36ROIno.9.3ROIno.9.12ROIno.16.26ROIno.22.8ROIno.9.17ROIno.1.5ROIno.16.17ROIno.16.25ROIno.7.4ROIno.14.3ROIno.9.9ROIno.9.28ROIno.22.1ROIno.9.29ROIno.9.4ROIno.9.15ROIno.X.42ROIno.9.22ROIno.16.4ROIno.9.32ROIno.16.2ROIno.16.11ROIno.X.30ROIno.2.13ROIno.16.9ROIno.9.27ROIno.2.3***** ** *********  97  Figure 4-11 | Polymorphic domains containing clusters of inversions identified in a mixed population Polymorphic domains (red box) mapped to a) chr7 and b) chr16. i) detail of the domains shown the UCSC Genome browser ‘packed’ view for ten representative Strand-seq libraries, along with tracks for sequence gaps (black), segmental duplications (SegDups) and inversions identified in the Database of Genomic Variants (DGV; purple). Corresponding overlaid Invert.R histograms of W/C ratios and inversion frequency heat maps (red bars) are shown in the lower panel. The polymorphic inversions (grey boxes) and corresponding ROI identifiers are also shown. ii) Clustered heat maps of the genotyped inversions (x-axis) identified in each cell (y-axis). Inversions are depicted as pale green (wildtype), medium green (heterozygous) or dark green (homozygous). In some cases, too few reads were present in the region to genotype the cell (grey). Asterisks and arrowheads highlight ROIs mentioned in text.    To test for evidence of chromosomal bias, I considered the genomic spread of inversions in our population. This revealed a non-random distribution, where over half ROIno.7.6ROIno.7.9ROIno.7.10ROIno.7.5ROIno.7.12ROIno.7.2ROIno.7.4HsSs_0289_mHsSs_0278_mHsSs_0283_fHsSs_0261_fHsSs_0030_mHsSs_0285_mHsSs_0248_fHsSs_0260_fHsSs_0281_mHsSs_0163_fHsSs_0256_mHsSs_0264_fHsSs_0267_fHsSs_0255_fHsSs_0279_mHsSs_0270_fHsSs_0277_mHsSs_0250_mHsSs_0302_mHomozygousHeterozygousWildtypeunknownROIno.7.14ROIno.7.3ROIno.7.17ROIno.7.11**HomozygousHeterozygousWildtypeunknownROIno.16.17ROIno.16.8ROIno.16.13ROIno.16.4ROIno.16.9ROIno.16.11ROIno.16.19ROIno.16.15ROIno.16.18ROIno.16.6ROIno.16.7ROIno.16.2ROIno.16.25ROIno.16.26ROIno.16.5ROIno.16.14HsSs_0247_fHsSs_0302_mHsSs_0274_fHsSs_0163_fHsSs_0251_fHsSs_0281_mHsSs_0030_mHsSs_0264_fHsSs_0266_mHsSs_0283_fHsSs_0285_mHsSs_0256_mHsSs_0270_fHsSs_0278_mHsSs_0261_fHsSs_0253_mHsSs_0276_mHsSs_0282_mHsSs_0291_fHsSs_0248_fHsSs_0249_fHsSs_0255_f**Chromosome 16 (chr16 p13.13-q11.1)HsSs_0285HsSs_0291HsSs_0264HsSs_0247HsSs_0252HsSs_0255HsSs_0282HsSs_0248HsSs_0254HsSs_0256SegDupsDGVGaps20 25 30 35Mb15WC Ratio}ROIno:16.1316.1416.1516.1816.1716.1916.9 - 16.11 16.416.216.616.516.7HsSs_0260HsSs_0289HsSs_0283HsSs_0267HsSs_0252HsSs_0256HsSs_0255HsSs_0285HsSs_0278HsSs_0248SegDupsDGVGapsWC RatioROIno: 7.57.4 7.6 7.107.117.127.955 60 65 75Mb70**Chromosome 7 (chr7 p12.1-q11.23)ai. ii.bi. ii.  98 (51.4%) of the polymorphisms were present on just five autosomes (chromosomes 7, 9, 15, 16 and 17), whereas six autosomes (chromosomes 3, 8, 10, 12, 19, and 21) together  contained only 9% of the inversions (Fig. 4-9b). I did not observe any inversions on chr13 or chr18 in our population. This polarization suggests chromosome-specific features influence susceptibility to undergo genomic rearrangements. On the chromosomes harboring high proportions of inversions, I observed inversion clusters forming blocks of highly polymorphic domains. For instance, a ~20 Mb domain surrounding the chr7 centromere (p12.1-q11.13) harbored seven distinct polymorphic inversions (Fig. 4-11a, i). To visualize the inheritance patterns of these inversions, they were hierarchically clustered based on genotype, and no obvious correlations with genomic distance were found (Fig. 4-11a, ii). For example, inversions physically close to each other (e.g. ROIno.7.10 and ROIno.7.11) did not cluster together (Fig. 4-11a, ii, arrowheads), whereas two inversions (ROIno.7.5 and ROIno.7.12) separated by ~15 Mb (containing three other inversions) showed similar inheritance and clustered closely (Fig. 4-11a, ii, asterisks). This suggests ROIno7.5 and ROIno7.12 are linked together on a haplotype block, and highlights the potential to study the evolutionary history of inversion in populations. I also identified 13 distinct polymorphisms in a ~20 Mb region on the p-arm of chr16 (Fig. 4-11b, i). The inversions here clustered into distinct blocks based on frequency, where one block contained very rare inversions (such as ROIno.16.6 and ROIno.16.18) and another block contained highly prevalent inversions with MAF < 0.2 (including ROIno.16.9 and ROIno.16.11) (Fig. 4-11b, ii, arrowheads). Taken together, these analyses demonstrate that inversions cluster in the human genome into polymorphic domains and that allelic states can be used to discern relationships between inversions in a heterogeneous sample.    99   Figure 4-12 | Cluster analysis of the inversion profiles for individual cells in the mixed population Clustered heat maps of inversion profiles characterized for each cell in our population were generated by creating a pair wise dissimilarity matrix of the genotyped inversions for the chromosomes shown, and clustered using a hierarchical model. The heat maps show how each cell is identical to itself (diagonal line of black pixels). Cells that show similar inversion profiles cluster together in deep blue, whereas cells that are highly dissimilar are in light blue clusters. Only a subset of the population is shown for each chromosome, depending on which cells inherited the chromosomes as WW or CC. Relationships between individual cells in a heterogeneous sample can be visualized by the set of inversions present in each Strand-seq library.   By locating recurrent genomic rearrangements in our mixed population of cells I have identified 111 polymorphic inversions with distinct frequencies and genomic distributions. Recall, each cell in our population represented a unique genome derived from a pool of 352 CB donors. Assuming equal donor cell contributions, it is likely the chr7 chr9 chr16HsSs_0261_fHsSs_0260_fHsSs_0248_fHsSs_0281_mHsSs_0285_mHsSs_0030_mHsSs_0278_mHsSs_0283_fHsSs_0250_mHsSs_0277_mHsSs_0270_fHsSs_0267_fHsSs_0279_mHsSs_0255_fHsSs_0289_mHsSs_0302_mHsSs_0163_fHsSs_0264_fHsSs_0256_mHsSs_0261_fHsSs_0260_fHsSs_0248_fHsSs_0281_mHsSs_0285_mHsSs_0030_mHsSs_0278_mHsSs_0283_fHsSs_0250_mHsSs_0277_mHsSs_0270_fHsSs_0267_fHsSs_0279_mHsSs_0255_fHsSs_0289_mHsSs_0302_mHsSs_0163_fHsSs_0264_fHsSs_0256_m0 1 2 3 4 5 6Value051525Color Keyand HistogramCount**HsSs_0287_mHsSs_0281_mHsSs_0278_mHsSs_0272_fHsSs_0269_mHsSs_0268_mHsSs_0251_fHsSs_0030_mHsSs_0163_fHsSs_0264_fHsSs_0254_fHsSs_0263_mHsSs_0284_fHsSs_0282_mHsSs_0274_fHsSs_0248_fHsSs_0271_fHsSs_0287_mHsSs_0281_mHsSs_0278_mHsSs_0272_fHsSs_0269_mHsSs_0268_mHsSs_0251_fHsSs_0030_mHsSs_0163_fHsSs_0264_fHsSs_0254_fHsSs_0263_mHsSs_0284_fHsSs_0282_mHsSs_0274_fHsSs_0248_fHsSs_0271_f0 0.5 1Value04080Color Keyand HistogramCountHsSs_0291_fHsSs_0030_mHsSs_0283_fHsSs_0279_mHsSs_0282_mHsSs_0254_fHsSs_0259_fHsSs_0277_mHsSs_0248_fHsSs_0262_mHsSs_0290_fHsSs_0285_mHsSs_0284_fHsSs_0280_fHsSs_0271_fHsSs_0266_mHsSs_0265_mHsSs_0260_fHsSs_0253_mHsSs_0163_fHsSs_0247_fHsSs_0291_fHsSs_0030_mHsSs_0283_fHsSs_0279_mHsSs_0282_mHsSs_0254_fHsSs_0259_fHsSs_0277_mHsSs_0248_fHsSs_0262_mHsSs_0290_fHsSs_0285_mHsSs_0284_fHsSs_0280_fHsSs_0271_fHsSs_0266_mHsSs_0265_mHsSs_0260_fHsSs_0253_mHsSs_0163_fHsSs_0247_f0 0.5 1 1.5 2Value050100150Color Keyand HistogramCountHsSs_0265_mHsSs_0254_fHsSs_0259_fHsSs_0250_mHsSs_0256_mHsSs_0163_fHsSs_0247_fHsSs_0285_mHsSs_0284_fHsSs_0291_fHsSs_0278_mHsSs_0301_fHsSs_0255_fHsSs_0280_fHsSs_0276_mHsSs_0268_mHsSs_0272_fHsSs_0248_fHsSs_0266_mHsSs_0283_fHsSs_0288_fHsSs_0264_fHsSs_0287_mHsSs_0279_mHsSs_0030_mHsSs_0257_mHsSs_0265_mHsSs_0254_fHsSs_0259_fHsSs_0250_mHsSs_0256_mHsSs_0163_fHsSs_0247_fHsSs_0285_mHsSs_0284_fHsSs_0291_fHsSs_0278_mHsSs_0301_fHsSs_0255_fHsSs_0280_fHsSs_0276_mHsSs_0268_mHsSs_0272_fHsSs_0248_fHsSs_0266_mHsSs_0283_fHsSs_0288_fHsSs_0264_fHsSs_0287_mHsSs_0279_mHsSs_0030_mHsSs_0257_m0 1 2 3Value04080120Color Keyand HistogramCountHsSs_0257_mHsSs_0264_fHsSs_0283_fHsSs_0271_fHsSs_0030_mHsSs_0273_fHsSs_0256_mHsSs_0255_fHsSs_0163_fHsSs_0272_fHsSs_0266_mHsSs_0302_mHsSs_0281_mHsSs_0279_mHsSs_0253_mHsSs_0248_fHsSs_0270_fHsSs_0258_mHsSs_0269_mHsSs_0291_fHsSs_0289_mHsSs_0284_fHsSs_0259_fHsSs_0276_mHsSs_0267_fHsSs_0287_mHsSs_0257_mHsSs_0264_fHsSs_0283_fHsSs_0271_fHsSs_0030_mHsSs_0273_fHsSs_0256_mHsSs_0255_fHsSs_0163_fHsSs_0272_fHsSs_0266_mHsSs_0302_mHsSs_0281_mHsSs_0279_mHsSs_0253_mHsSs_0248_fHsSs_0270_fHsSs_0258_mHsSs_0269_mHsSs_0291_fHsSs_0289_mHsSs_0284_fHsSs_0259_fHsSs_0276_mHsSs_0267_fHsSs_0287_m0 2 4 6Value010203040Color Keyand HistogramCountHsSs_0247_fHsSs_0302_mHsSs_0274_fHsSs_0281_mHsSs_0030_mHsSs_0276_mHsSs_0264_fHsSs_0256_mHsSs_0251_fHsSs_0163_fHsSs_0249_fHsSs_0282_mHsSs_0291_fHsSs_0270_fHsSs_0248_fHsSs_0261_fHsSs_0255_fHsSs_0253_mHsSs_0278_mHsSs_0285_mHsSs_0266_mHsSs_0283_fHsSs_0247_fHsSs_0302_mHsSs_0274_fHsSs_0281_mHsSs_0030_mHsSs_0276_mHsSs_0264_fHsSs_0256_mHsSs_0251_fHsSs_0163_fHsSs_0249_fHsSs_0282_mHsSs_0291_fHsSs_0270_fHsSs_0248_fHsSs_0261_fHsSs_0255_fHsSs_0253_mHsSs_0278_mHsSs_0285_mHsSs_0266_mHsSs_0283_f0 1 2 3 4 5 6Value010203040Color Keyand HistogramCount* *chr6 chr11chr10Figure 4-12 | Cluster analysis of the inversion profiles for individual cells in the mixed population.Clustered heat maps of inversion profiles charactered for each cell in our population were generated by creating a pairwise dissimilarity matrix of the genotyped inversions for the chromosomes shown, and clustered using a hierarchical model. The heat maps show how each cell is identical to itself (diagonal line of black pixels). Cells that show similar inversion profiles cluster together in deep blue, whereas cells that are highly dissimilar are in light blue clusters. Only a subset of the population is shown for ach chro osome, depending on which cells inherited th  chr mosomes as WW or CC. Relationships betwe  indivi ual cells in a heterogeneous sample can be visualized by the set of inversions present in a single Strand-seq library.  100 majority of cells in our population represent unique individuals14. To consider the relationship between the different cells in our sample, I clustered the cells based on the genotyped inversions for each chromosome (Fig. 4-12). The heat maps showed related cells were grouped together based on similar inversion profiles, and that the number of inversions present on the chromosome helped better discriminate subsets of cells. By comparing inversion profiles the relationships between cells were resolved within our sample. For instance, I saw HsSs_0278 and HsSs_0283 grouped together in multiple chromosomes (Fig. 4-12, asterisks) and distinct from HsSs_0264 and HsSs_0257 (Fig. 4-12, arrowheads). Taken together this illustrates how the set of inversions mapped in single Strand-seq libraries can be used to visualize the relatedness between individual cells in a heterogeneous sample, which has important implications for studying different human populations or tumor samples.                                                  14 While there is a significant (p = 0.04) likelihood that two libraries in our sample of 47 cells were derived from the same donor, the remaining 45 were likely from different donors, and thus 46 (98%) are expected to represent a unique individual from the pool.   101 Table 4-3 | Polymorphic inversions in the pooled cord blood population     100  Name Chr Start End SizePassing cellsaverage reads/Mb wtFreq invFreqNumber of DGV hits ROI in HWE HWE pValueROIno.1.1 chr1 130847 715919 585,072         8 32.5 0.75 0.25 0 HWE 1ROIno.1.5 chr1 108854248 108922047 67,799           7 250.7 0.43 0.57 5 HWE 1ROIno.1.15 chr1 149267846 149316798 48,952           12 388.1 0.75 0.25 3 HWE 0.52925793ROIno.2.3 chr2 89908995 90268656 359,661         21 194.6 0.95 0.05 4 HWE 1ROIno.2.6 chr2 92025421 92202999 177,578         20 253.4 0.75 0.25 0 HWE 0.27716573ROIno.2.7 chr2 92270839 92319722 48,883           21 4521.0 0.69 0.31 0 HWE 0.11540179ROIno.2.8 chr2 96130562 96263190 132,628         17 203.6 0.76 0.24 7 HWE 0.52012524ROIno.2.13 chr2 110853234 111063527 210,293         21 175.9 0.14 0.86 0 HWE 1ROIno.3.2 chr3 195389162 195724370 335,208         24 241.6 0.75 0.25 11 HWE 1ROIno.4.1 chr4 10001 69091 59,090          24 643.1 0.6 0.4 0 not_hwe 0.0022196ROIno.4.4 chr4 190539911 190683764 143,853        24 660.4 0.6 0.4 16 not_hwe 0.0022196ROIno.4.5 chr4 191017623 191044276 26,653           13 675.3 0.62 0.38 0 HWE 0.56580966ROIno.5.1 chr5 21464002 21590594 126,592         20 292.3 0.92 0.08 2 HWE 1ROIno.5.2 chr5 68916364 70645568 1,729,204       20 13.3 0.62 0.38 8 HWE 0.163034ROIno.5.4 chr5 177151585 177334643 183,058         16 109.3 0.75 0.25 6 HWE 0.51274255ROIno.6.1 chr6 273491 381214 107,723         17 538.4 0.71 0.29 0 HWE 0.24063775ROIno.6.4 chr6 57380956 57609346 228,390         17 586.7 0.85 0.15 2 HWE 1ROIno.7.2 chr7 6021253 6778997 757,744         19 186.1 0.84 0.16 5 HWE 1ROIno.7.3 chr7 6778997 6865007 86,010           7 174.4 0.64 0.36 3 HWE 0.44055944ROIno.7.4 chr7 54302450 54376389 73,939           12 216.4 0.42 0.58 18 HWE 1ROIno.7.5 chr7 56856774 57122691 265,917         18 112.8 0.86 0.14 0 HWE 1ROIno.7.6 chr7 57695766 57898822 203,056         18 226.5 0.67 0.33 1 HWE 0.5972668ROIno.7.9 chr7 62825835 63159414 333,579         18 89.9 0.72 0.28 2 HWE 0.25080778ROIno.7.10 chr7 64578759 65012798 434,039         19 152.1 0.66 0.34 3 HWE 0.60694208ROIno.7.11 chr7 65012903 65113002 100,099         9 139.9 0.78 0.22 6 HWE 0.34117647ROIno.7.12 chr7 72641816 74896848 2,255,032       19 133.9 0.89 0.11 3 HWE 1ROIno.7.14 chr7 143421498 143577612 156,114         13 115.3 0.58 0.42 17 HWE 0.27505721ROIno.7.17 chr7 152079422 152113324 33,902           13 442.5 0.69 0.31 0 HWE 1ROIno.8.3 chr8 8055789 11980649 3,924,860       20 226.0 0.78 0.22 18 HWE 1ROIno.9.3 chr9 40283029 40425834 142,805         5 91.0 0.5 0.5 0 HWE 0.36507937ROIno.9.4 chr9 40475834 40940341 464,507         26 66.7 0.25 0.75 0 HWE 1ROIno.9.8 chr9 42663955 43213698 549,743         18 34.6 0.72 0.28 0 HWE 0.55929869ROIno.9.9 chr9 43313698 43946569 632,871         24 71.1 0.38 0.62 0 HWE 0.39162754ROIno.9.11 chr9 44726646 44908293 181,647         25 192.7 0.92 0.08 2 HWE 1ROIno.9.12 chr9 44958293 45250203 291,910         8 54.8 0.5 0.5 0 HWE 0.47785548ROIno.9.14 chr9 45865521 46216430 350,909         7 37.0 0.57 0.43 0 HWE 1ROIno.9.15 chr9 46266430 46461039 194,609         2 56.5 0.25 0.75 0 HWE 1ROIno.9.17 chr9 47160133 47317679 157,546         18 126.9 0.44 0.56 0 HWE 0.33473263ROIno.9.18 chr9 65467679 65918360 450,681         9 35.5 0.67 0.33 0 HWE 0.45701357ROIno.9.22 chr9 66664195 66863343 199,148         26 241.0 0.23 0.77 5 HWE 0.28181471ROIno.9.25 chr9 67516296 67987998 471,702         19 40.3 0.68 0.32 6 HWE 0.60694208ROIno.9.27 chr9 68664181 68838946 174,765         23 125.9 0.11 0.89 1 HWE 1ROIno.9.28 chr9 68988946 69278385 289,439        26 117.5 0.38 0.62 4 not_hwe 0.0376061ROIno.9.29 chr9 69328385 70010542 682,157        26 90.9 0.37 0.63 4 not_hwe 0.0078568Table 4-3 part 1 of 3  102       101  Name Chr Start End SizePassing cellsaverage reads/Mb wtFreq invFreqNumber of DGV hits ROI in HWE HWE pValueROIno.9.32 chr9 70556535 70735468 178,933         18 72.7 0.19 0.81 3 HWE 0.51202346ROIno.10.8 chr10 47337994 47429049 91,055           7 164.7 0.71 0.29 7 HWE 1ROIno.10.11 chr10 81278497 82025007 746,510         21 213.0 0.88 0.12 0 HWE 1ROIno.11.3 chr11 48340762 48386751 45,989          19 391.4 0.61 0.39 0 not_hwe 0.0123ROIno.11.5 chr11 50093456 50307124 213,668         26 215.3 0.67 0.33 2 HWE 0.67193657ROIno.11.7 chr11 89556752 89804254 247,502         17 68.7 0.65 0.35 9 HWE 1ROIno.12.1 chr12 60001 95740 35,739           6 419.7 0.67 0.33 0 HWE 1ROIno.12.5 chr12 131769544 132186466 416,922         21 263.8 0.9 0.1 4 HWE 1ROIno.14.1 chr14 19000001 19129124 129,123         21 387.2 0.74 0.26 0 HWE 0.26170073ROIno.14.3 chr14 19747162 19829034 81,872           19 256.5 0.42 0.58 10 HWE 1ROIno.14.4 chr14 19837085 20419185 582,100        21 151.2 0.67 0.33 5 not_hwe 0.0476843ROIno.15.1 chr15 20000000 20386762 386,762         23 168.1 0.72 0.28 3 HWE 0.12666472ROIno.15.5 chr15 22262114 22596193 334,079         23 335.3 0.61 0.39 20 HWE 0.37869567ROIno.15.8 chr15 30439999 30862015 422,016         19 54.5 0.71 0.29 1 HWE 0.25318995ROIno.15.9 chr15 30895805 32459491 1,563,686       23 218.7 0.93 0.07 2 HWE 1ROIno.15.10 chr15 32491843 32746347 254,504         14 74.7 0.75 0.25 2 HWE 0.51304348ROIno.15.15 chr15 84903485 84959998 56,513           13 247.7 0.54 0.46 6 HWE 1ROIno.15.16 chr15 102489012 102521392 32,380           6 370.6 0.75 0.25 0 HWE 0.27272727ROIno.16.2 chr16 14893748 15013665 119,917         13 175.1 0.19 0.81 27 HWE 1ROIno.16.4 chr16 15125684 15426343 300,659         21 146.3 0.21 0.79 4 HWE 1ROIno.16.5 chr16 16389378 16582999 193,621         13 72.3 0.81 0.19 0 HWE 1ROIno.16.6 chr16 16641828 18219218 1,577,390       22 233.3 0.95 0.05 3 HWE 1ROIno.16.7 chr16 18213672 18743382 529,710         22 90.6 0.89 0.11 3 HWE 1ROIno.16.8 chr16 21517911 21603924 86,013           20 418.5 0.55 0.45 31 HWE 0.16947542ROIno.16.9 chr16 21607232 21749719 142,487         22 217.6 0.14 0.86 31 HWE 0.3235307ROIno.16.11 chr16 21801729 22497134 695,405         22 172.6 0.18 0.82 31 HWE 0.53811484ROIno.16.13 chr16 28424774 28788943 364,169         22 101.6 0.61 0.39 0 HWE 0.65268702ROIno.16.14 chr16 32021206 32119904 98,698           14 192.5 0.68 0.32 0 HWE 0.22086957ROIno.16.15 chr16 32129304 32296150 166,846         20 143.8 0.85 0.15 2 HWE 1ROIno.16.17 chr16 32751076 33259632 508,556        22 110.1 0.43 0.57 4 not_hwe 0.025126ROIno.16.18 chr16 33293819 33639852 346,033         22 393.0 0.93 0.07 2 HWE 1ROIno.16.19 chr16 33687373 33786667 99,294           15 211.5 0.93 0.07 0 HWE 1ROIno.16.25 chr16 70155977 70211425 55,448           14 360.7 0.43 0.57 5 HWE 0.26716023ROIno.16.26 chr16 75240199 75256700 16,501           4 848.4 0.5 0.5 40 HWE 1ROIno.17.1 chr17 16658164 16749299 91,135           23 230.4 0.83 0.17 8 HWE 1ROIno.17.2 chr17 18312480 18406490 94,010           19 170.2 0.82 0.18 11 HWE 1ROIno.17.4 chr17 21207696 21254690 46,994          31 659.7 0.65 0.35 0 not_hwe 0.0040192ROIno.17.5 chr17 21303046 21352310 49,264          32 629.3 0.64 0.36 0 not_hwe 0.001815ROIno.17.7 chr17 25263007 25336080 73,073          31 492.7 0.66 0.34 0 not_hwe 0.0055264ROIno.17.10 chr17 36273736 36342794 69,058           11 188.2 0.73 0.27 1 HWE 1ROIno.17.16 chr17 43661775 44372665 710,890         32 206.8 0.77 0.23 2 HWE 0.31319244ROIno.19.1 chr19 24513652 24593536 79,884           24 413.1 0.69 0.31 0 HWE 0.05447036ROIno.19.3 chr19 27830637 27880434 49,797           24 562.3 0.92 0.08 0 HWE 1ROIno.20.1 chr20 25823432 25992339 168,907         28 290.1 0.7 0.3 7 HWE 1Table 4-3 part 2 of 3  103       102  Table 4-3 | Polymorphic inversions in the pooled cord blood population Genomic coordinates and metrics associated with each region of interest (ROI) in this subcategory of structural variant found for the pooled cord blood population. ROIs not in Hardy-Weinberg equilibrium (HWE) are shown in bold. Chromosome (chr); Megabase (Mb); Wildtype frequency (wtFreq); inversion frequency (invFreq); Database of Genomic Variants (DGV).  Name Chr Start End SizePassing cellsaverage reads/Mb wtFreq invFreqNumber of DGV hits ROI in HWE HWE pValueROIno.20.2 chr20 26050000 26319569 269,569         29 252.3 0.93 0.07 4 HWE 1ROIno.21.2 chr21 10365977 10647896 281,919         28 319.2 0.73 0.27 0 HWE 0.13809485ROIno.21.4 chr21 10772088 11001775 229,687        28 613.9 0.64 0.36 4 not_hwe 0.0044612ROIno.21.6 chr21 14338130 14456062 117,932         27 364.6 0.74 0.26 0 HWE 0.63798588ROIno.21.7 chr21 15355027 15439142 84,115           25 249.7 0.94 0.06 2 HWE 1ROIno.22.1 chr22 16050001 16434781 384,780        26 114.4 0.38 0.62 2 not_hwe 0.0025367ROIno.22.3 chr22 16870823 17053135 182,312         26 170.0 0.9 0.1 0 HWE 1ROIno.22.4 chr22 18713650 18872558 158,908         20 169.9 0.55 0.45 3 HWE 0.16947542ROIno.22.5 chr22 20361245 20509431 148,186         4 74.2 0.62 0.38 3 HWE 0.42857143ROIno.22.7 chr22 21463843 21775876 312,033        19 80.1 0.66 0.34 18 not_hwe 0.0474301ROIno.22.8 chr22 21793009 21812131 19,122           3 784.4 0.5 0.5 10 HWE 1ROIno.X.4 chrX 36626542 37098256 471,714         30 154.8 0.94 0.06 0 not test n/aROIno.X.12 chrX 49019926 49120478 100,552         23 238.7 0.94 0.06 8 not test n/aROIno.X.15 chrX 51815016 51920116 105,100         17 190.3 0.63 0.37 0 not test n/aROIno.X.22 chrX 62472719 62509182 36,463          9 438.8 1 0 2 not test n/aROIno.X.30 chrX 103242611 103305081 62,470           17 256.1 0.17 0.83 28 not test n/aROIno.X.32 chrX 119221019 119284472 63,453           13 267.9 0.71 0.29 8 not test n/aROIno.X.36 chrX 134291578 134348132 56,554           13 300.6 0.52 0.48 6 not test n/aROIno.X.40 chrX 140202636 140562362 359,726         29 136.2 0.88 0.12 3 not test n/aROIno.X.42 chrX 148764498 148802474 37,976           9 395.0 0.25 0.75 0 not test n/aROIno.X.48 chrX 152415970 152516218 100,248         14 179.6 0.67 0.33 14 not test n/aROIno.Y.16 chrY 23729128 23901428 172,300         21 168.3 0.9 0.1 0 not test n/aROIno.Y.19 chrY 24355696 24525420 169,724         17 153.2 0.88 0.12 0 not test n/a* manually placed by Table 4-3 part 3 of 3  104 4.3 Discussion  Here I describe a new framework to explore the structural rearrangements in a mixed population of single human cells. I illustrate how bioinformatic approaches can be applied to rapidly flag regions that contain recurrent template strand state changes in a dataset of Strand-seq libraries, and how to further characterize these elements based on genotypes and frequencies. Using this framework, I located potential reference assembly errors and new underrepresented repetitive elements in the human genome. I then characterized a rich genomic map of polymorphic inversions, showcasing the extent of structural heterogeneity in a normal population of human cells. This demonstrates a robust, non-targeted approach to exploring the distribution and frequency of an important class of rearrangement that has previously been difficult to visualize, and we expect it will greatly contribute to the structural variation field.   For accurate genomic studies, it is crucial to have accurate reference assemblies. We have previously shown that even highly-sequenced assemblies contain misoriented regions that can be located and corrected by Strand-seq [109, 110]. In this study, I identified 46 AWC regions that point to repetitive sequences not yet sequenced in the reference assembly, along with 24 minor alleles or misorientations that mark sequences inaccurately oriented in the assembly. I found half of these fragments overlapped inversions listed in the DGV, which highlights that errors in genome assemblies can appear as SVs using conventional techniques, but are more accurately annotated using our approach. A third of the misorients were found within contigs of contiguous sequence, reflecting how the complex and repetitive composition of the human genome poses great challenges in orienting sequence and annotating meaningful variants in normal and diseased populations. Indeed, while 15 regions were misoriented in every cell analyzed, it is possible the genomes used to build the reference assembly harbored a rare allele at the location, justifying the current orientation in the assembly. Nevertheless, it is important to identify and annotate these potential assembly errors for disease association studies. For instance, the largest misoriented fragment (almost 2 Mb) encompasses several NBPF genes associated with neuroblastoma [136-138], potentially having   105 implications on which NBPF paralogs are associated with the disease, and therefore which to use as cancer biomarkers. The recent advancement of long-read single molecule sequencing [57] can help further refine the sequence and orientation at these complex genomic regions. However long-read technology is unlikely to replace the rapid, high-throughput discovery of genomic variants that Strand-seq now enables. Moving forward, it is most likely integrative technologies will be required to fully resolve the complexity of the human genome.  By investigating the heterogeneous loci in our pooled donor population, I mapped and genotyped 111 polymorphisms in 47 individuals simultaneously. This strategy is more high-throughput and comprehensive than traditional targeted inversion studies [56, 77, 143]. Indeed, other sequencing efforts have predicted over 3000 inversions, of which only 85 have yet been validated [73]. Here I offer supporting evidence for an additional 71 inversions listed in the DGV. By mapping events on the chromosomal level I was able to build a comprehensive genomic map of variant loci, and identify genomic regions that contain clusters of polymorphic inversions, including two 20 Mb domains on chr7 (p12.2-q21.11) and chr16 (p13.2-q11.2). These domains correspond to genomic locations predicted to recombine [144], and may represent hotspots for structural variation in the human genome. Strikingly, I did not observe any inversions on chr13 or chr18. Previous studies have similarly reported few or no inversions mapping to these chromosomes [38, 145], which suggests that genomic rearrangements may be suppressed.   Finally, I clustered individual cells by their unique set of inversions, demonstrating a powerful single cell method to study inversion haplotypes. Our cluster analyses revealed i) the historical relationship of inversions that clustered based on definable haplotypes, and ii) the relatedness of individual cells that clustered based on shared inversion profiles. It is important to note that, because WC chromosomes mask homozygous inversions, the inversions found in each individual cell do not represent the complete set of inversions for that individual. Nevertheless this illustrates how rearrangement profiles can be used to predict relationships between individual cells, for instance in heterogeneous tumor samples to follow disease evolution and progression. It   106 is now possible to take a sample of single cells from a specific demographics or tissue types and rapidly characterize the number and genotype of inversions in the sample to compare how inversion profiles differ between them.  4.4 Conclusion  In this Chapter, I demonstrate how generating multiple single cell libraries from a mixed population of cells offers a new opportunity to study inversions in human populations. Until now, technical and computational limitations have forced others to target single inversions for population studies (e.g. inversions on chr8 [74, 75] ad chr17 [77, 78]), or laboriously map inversions in small, targeted datasets [13, 36, 37, 56]. Consequently, our list of polymorphisms represents one of the most comprehensive records of inversions generated for a normal human population. This has allowed us to record the allelic frequencies and genomic distributions of polymorphic inversions with greater detail than previously possible. I predict these results will prove an invaluable resource for future studies looking at polymorphisms in other defined populations.    107 Chapter 5 | Defining the complete inversion profile of an individual to study their unique invertome    “The whole is greater than the sum of its parts” – Aristotle    Chapter synopsis:  I performed a comprehensive analysis of the entire complement of inversions found for two individuals to study their unique invertomes. By merging the strand information of multiple cells together I generated a high-density directional composite file that allowed the structural variation of every chromosome to be explored at unprecedented resolution. I used this composite file to finely map and characterize the entire set of inversions in a genome, and for the first time generated an invertome of an individual. By building two invertomes I undertook an in-depth investigation of how inversion profiles differ between individuals, and described the chromosomal architecture of inversions. Finally, merging the invertome data with the population datasets I produced a non-redundant list of inversions to study the sequence composition of breakpoints. In so doing I describe a novel approach for mapping invertomes, which I predict will be a valuable tool in personalized medicine.     108 5.1 Introduction  The ultimate goal of most genetic studies is to understand how our genomes inform our biology, so we can anticipate disease risks and plan targeted prevention and treatment strategies [8]. By characterizing the genetic backgrounds of patients, personalized medicine aims to predict disease progression, severity and treatment response to provide highly individualized healthcare [19]. To realize this goal, we need validated diagnostic, prognostic and predictive biomarkers that can be used to facilitate clinical decision-making at every stage of patient care [20]. To build a meaningful repertoire of biomarkers, genetic association studies can be employed that discover candidate risk factors for experimental testing and validation [2, 19, 20]. But before we can fully understand how variants alter biology, we need to study and annotate all types of variation in human genomes. For this, we need simple and reliable tools to rapidly screen human populations for polymorphic loci and build sufficient databases of annotated variation [21]. While previous work has successfully built comprehensive databases of copy number changes, cataloging copy-neutral rearrangements, such as inversions, is far less developed [3, 39, 73].   Inversion profiles can serve as important biomarkers for personalized medicine. For instance, specific inversions have been directly associated with an increased risk of hemophilia A [81], Hunter syndrome [82], and muscular dystrophy [83], and can be used as diagnostic biomarkers. Other inversions can help predict clinical outcomes and serve as prognostic biomarkers (e.g. inv(16)(p13.1q22) shows more favorable outcomes in acute myeloid leukemia (AML) patients compared to inv(3)(q21q26.2) [90]), and can inform predictive biomarkers (e.g. inv(2)(p23) can guide treatment plans in non-small cell lung cancer (NSCLC) patients [91]). These examples clearly highlight the clinical relevance of knowing the inversion profile of patients, and highlight direct applications to personalized medicine. Currently, the standard approach to studying inversions is to look at them in isolation and consider a single polymorphism, often by FISH [21], in a single population of interest, without considering the other variants present in each person [56, 75, 77, 78, 84]. Relying on highly targeted and crude cytogenetic approaches means these   109 studies are unable to characterize the complete inversion profile of patients, consider submicroscopic inversions, or test how sets of different inversions may collectively contribute to pathology or treatment outcome.   Using directionality to visualize changes in homologue orientation offers a new opportunity to study structural rearrangements in human genomes. In Chapter 4, I demonstrated how this is used to explore structural variation in a heterogeneous sample of cells to study the distribution and frequency of variant alleles in a population. Here, I will illustrate how to map inversions genome-wide by sequencing and analyzing multiple cells from the same individual, in order to build a comprehensive inversion profile (i.e. an ‘invertome’) for that person. By describing the invertome of two individuals I compare how inversion profiles are highly unique and can be used to predict ancestry and disease risk, facilitating a personalized approach to healthcare. I also explore the genomic architecture of inversion breakpoints, at the chromosomal and base pair level, to better characterize shared features of intrachromosomal rearrangements. This new method to annotating rearrangements in individual genomes offers an opportunity to test sets of inversions and explore how the act together to impact our biology.   5.2 Results 5.2.1 Characterizing the set of inversions present in an individual’s genome to define their invertome  I have shown how analyzing multiple Strand-seq libraries from a mixed donor sample can be applied to investigate the distribution and frequency of inversions in any given population (Chapter 4). However, WC chromosomes mask homozygous inversions, and therefore the variants found in any individual cell do not represent the complete profile for that genome. To map the entire set of inversions present in an in-dividual, I reasoned it would be necessary to analyze multiple Strand-seq libraries from that person. This would allow me to extract variant calls from the informative chromosomes and compile them into a genome-wide inversion map. To do this, I generated 140 Strand-seq libraries from the BM of a single adult male (see Methods   110 Chapter 7, Section 7.2 for details), sequenced them on an Illumina HiSeq platform, and aligned the reads to the GRCh37/hg19 reference assembly (libraries HsSs_0001-HsSs_0140 in Appendix A). After filtering for read mapping quality (q > 10) and duplicates, the final densities of each library ranged from 4 - 300 reads/Mb, with a median of 53 reads/Mb (Fig. 5-1a). With this genomic coverage, template strand inheritance patterns were clear. Each library was examined independently to identify the chromosomes inherited as WW or CC and without SCEs (as described in Methods Chapter 7, Section 7.3.1), which were selected for further analysis (Table 5-1, male). For each autosome, between 39 and 77 Strand-seq libraries met the selection criteria as informative. I observed a total average ratio of 0.20 WW and 0.21 CC chromosomes, which is near the expected random segregation pattern for a diploid (i.e. 0.25 WW, 0.5 WC, and 0.25 CC), with the difference arising from filtered chromosomes containing SCEs. With these chromosomes selected, I set out to map the inversion profile for this individual.   Table 5-1 | Informative chromosomes used to build invertomes  Total number of chromosomes (chr) inherited as either WW (Watson-Watson) or CC (Crick-Crick) from the male and female datasets and used to build the respective invertomes. The cumulative total was used to calculate the proportion of informative chromosomes from each dataset. Asterisks mark monoploid chromosomes that can only be inherited as either W or C.          chr cc ww total % cc ww total %1 32 27 59 0.42 16 20 36 0.342 26 36 62 0.44 21 19 40 0.383 24 21 45 0.32 25 27 52 0.494 21 26 47 0.34 21 26 47 0.445 30 25 55 0.39 23 22 45 0.426 23 29 52 0.37 24 21 45 0.427 23 22 45 0.32 25 22 47 0.448 19 20 39 0.28 28 25 53 0.509 23 32 55 0.39 21 17 38 0.3610 24 34 58 0.41 28 34 62 0.5811 32 32 64 0.46 22 19 41 0.3912 23 25 48 0.34 26 30 56 0.5313 27 31 58 0.41 16 25 41 0.3914 30 33 63 0.45 20 19 39 0.3715 30 29 59 0.42 26 30 56 0.5316 27 33 60 0.43 29 30 59 0.5617 33 38 71 0.51 27 23 50 0.4718 27 27 54 0.39 25 24 49 0.4619 39 38 77 0.55 20 28 48 0.4520 34 27 61 0.44 14 16 30 0.2821 29 32 61 0.44 26 23 49 0.4622 34 33 67 0.48 29 27 56 0.53X 60 64 124 0.89 19 22 41 0.39Y 70 67 137 0.98 - - na na******male femaleTable 5-1 | Total number of chromsomes inherited as WW or CC  for the datasets used to build invertomes  111   Figure 5-1 | Overall read densities of Strand-seq libraries used to build invertomes Plotted are read densities of the individual Strand-seq libraries derived from the a) adult male bone marrow sample (n = 140), and b) female cord blood sample (n = 120). Total reads/megabase (Mb) were calculated for each library by aligning unique reads to the reference genome (GRCh37/hg19) and filtering them for a mapping quality greater than ten (q > 10).       To map inherited rearrangements in this genome, I generated high-density directional composite files of each chromosome. For this, all the reads from the informative chromosomes of each library (i.e. chromosomes that were not WC and did not harbor a SCE) were merged together into a WW- or CC-file. Then, the reads from the WW-file were reverse complemented by flipping all ‘+’ reads to ‘-’ reads, and all ‘-’ reads to ‘+’ reads. This reverse complemented file was then merged with the CC-file to generate a single large composite file. As illustrated for chr8 (Fig. 5-2), this strategy increases read depth and preserves template strand directionality of the data, yielding a single file of each chromosome. Indeed, read depths increased by 37.7 to 66.6-fold for each chromosome, when compared to the average depths of the unmerged libraries (Fig. 5-1, blue triangles). Importantly, the composite file assumes all cells derived from the dataset have the same inversion profile, and thus represents the consensus of all the SVs found in these cells. It effectively removes any low-frequency events and eliminates possible inter-cell heterogeneity. However, by merging the data of multiple libraries into a single file any noise attributed to spurious background reads in the individual libraries is reduced, and the number of reads supporting the consensus variants are simultaneously increased. Therefore, this strategy generates an ideal dataset to study and locate stable, inherited inversions.  Figure 5-1 | Read density of Strand-s q librari s used to build invertomesPlotted are read densities of the individual Strand-seq libraries derived from the a) adult male bone marrow sampe (n=140), and b) female cord blood sample (n=120). Total reads/megabase (Mb) were calculated for each library by aligning unique reads to the reference genome (GRCh37/hg19) and filtering them for a mapping quality greater than ten (q > 10). 0 15 30 45 60 75 90 105 120 0 20 40 60 80 100 120 reads/Mb (q>10) Strand-seq library female cord blood 0 50 100 150 200 250 300 350 400 0 20 40 60 80 100 120 140 160 reads/Mb (q>10) Strand-seq library male bone marrow ab  112  Figure 5-2 | Strategy for generating a high-density directional composite file Composite files from all the informative single cell Strand-seq libraries were generated for each individual chromosome, as illustrated here for chr8 from the male donor. Chromosomes inherited as either WW or CC was selected and all reads were agglomerated into a large WW or CC-file for the chromosome. Reads in the WW-file were reverse complemented by changing all ‘+’ reads to ‘-’ reads, and all ‘-’ reads to ‘+’ reads. This reverse-complemented file was then merged with the CC-file to generate a large composite file with preserved strand directionality. Reads were re-colored (light and dark blue) to reflect how they no longer represent W or C template strands. The average read depth of the single Strand-seq libraries for chr8 was 71.5 reads/megabase (Mb), whereas the final read depth of the chr8 composite file was 2,701 reads/Mb, a 38-fold increase. Watson (W); Crick (C).   With a single composite file for each chromosome generated, I performed a genome-wide analysis of the inversions present within the male. Using Invert.R (described in detail in Chapter 3), the W/C ratios of the composite files were calculated. Here I applied a more stringent mapping quality score (q > 20), and a larger bin size (bin = 250 reads) to account for the higher read depths of each file, and reduce the number of false-positive calls. Invert.R identified 132 ROIs where W/C ratios dipped below the threshold due to a change in template strand state (Fig. 5-3). The histograms of each composite file illustrate the genomic locations where strand directionality is suggestive of a genomic rearrangement, with the Invert.R-predicted locations of each ROI flagged (Fig. 5-3, red bars). The improved read depths of the composite files allowed Invert.R to better resolve intricate strand state changes present within complex regions of the genome (e.g. Chromosome 8 (whole)Merge single chromosomes with same inheritance Average reads/Mb = 71.6Final reads/Mb = 2701Reverse-complement reads in WW chromosomeMerge all reads into a single directional composite file  113 Figure 5-3 | Putative inversions in an adult male, as predicted by Invert.R  107   Figure 5-3 | Putative inversions in an adult male, as predicted by Invert.R Histograms of W/C ratios generated by Invert.R (bin = 250), for the merged composite files of an adult male donor. The composite files were generated from 140 Strand-seq libraries derived from a male bone marrow sample. Sequencing reads are shown as lines above each histogram with the direction indicated by color (forward, ‘+’ in blue; reverse, ‘-’ in orange). Note some reads in the composite file were reverse complemented. The line in the histogram represents the W/C ratio at the given genomic location, and a change in W/C ratio represents a change in strand orientation, which is indicative of a stable and inherited genomic rearrangement. Locations where W/C ratios dip below a threshold (horizontal gray line) are highlighted by the red bars (n = 132) below each histogram, with the Invert.R predicted breakpoints shown with a dotted red line. Reference assembly sequence gaps are shown as gray bars above each histogram. Arrowheads mark intricate strand state switches mentioned in text (Section 5.2.1).  W/C Ratio10W/C Ratio10W/C Ratio10W/C Ratio10W/C Ratio10W/C Ratio100 50 100 150 200 250Mbchr1 0 50 100 150 200 250Mbchr2 0 50 100 150 200Mbchr3 0 50 100 150Mbchr4 0 50 100 150Mbchr50 50 100 150Mbchr6 0 50 100 150Mbchr70 50 100 150Mbchr8 chr9 0 40 60 100 140Mb20 80 120chr10 0 40 60 100 140Mb20 80 120chr11 0 40 60 100 140Mb20 80 120chr12 0 40 60 10020 80 120Mbchr13 0 40 6020 80 100Mbchr140 40 6020 80 100Mbchr15 0 40 6020 80 100Mbchr160 40 6020 80Mbchr17 0 40 6020 80Mbchr18 0 40 6020 80Mbchr19 0 20 30 50 60Mb10 40chr20 0 20 30 50 60Mb10 40chr21 0 20 30 50Mb10 40chr22 0 20 30 50Mb10 40chrX 0 50 100 150MbchrY 0 20 30 60Mb10 40 50  114   compare the peri-centromeric region of chr9 in Fig. 4-3 and Fig. 5-3, arrowheads). While many of the ROIs likely represented an inherited inversion, I set out to curate this data in order to exclude false-positive calls. Recall, I previously located 46 loci in our population study that always appear WC in Strand-seq libraries and are predicted to be underrepresented repetitive elements in the reference assembly (see Chapter 4, Section 4.2.3 and Fig. 4-8). Also, reads falling within PAR regions can represent false-positive calls in male genomes (as discussed in Chapter 4, also see p-arm tip of chrX in Fig. 5-3). Finally, we cannot predict the template strand state of genomic regions falling within reference assembly gaps. Therefore I refined the ROIs by removing genomic ranges overlapping AWC regions, PARs, and reference assembly gap (see Methods for details). Note, loci flagged as minor alleles or misorients (see Chapter 4, Section 4.2.2) were not subtracted as these may represent polymorphic alleles in the human genome.    Upon refining the ROI file generated by Invert.R, there were 245 loci that may contain an inherited inversion. Since the original list was fragmented by reference assembly gaps, this number was increased from the original file, especially at poorly assembled regions of the reference (e.g. near the centromeres of chr1 and chr9, which are highly gapped and evident in Fig. 4-4). To verify these loci represented true rearrangements in the male genome, I genotyped each region by statistically testing whether they contained ratios of ‘+’ and ‘-‘ reads that supported either a heterozygous or homozygous allele using three Fisher Exact tests that determined the best-fit genotype (as described in Chapter 4, Section 4.2.2). Since the composite files were merged from multiple single libraries, a higher degree of background (10%) was permitted for ‘pure’ WW or CC calls, and an increased minimum number of reads (minReads = 100) were required to make a call. The high read density of the composite file also allowed for better breakpoint mapping with improved confidence, as the predicted location represents a consensus of all the merged cells.    115   Figure 5-4 | The invertome of an adult male The genomic distribution of the complete set of inversions identified in a male bone marrow sample. The name of each inversion is listed to the right, and the associated coordinates can be found in Table 5-2. Inset shows the cumulative frequency of the size range of inversions (in base pairs; bp). New inversions that are not listed in the Database of Genomic Variants (DGV) are represented by squares, whereas those overlapping with known inversions listed in the DGV are represented by circles. The inversions show an even distribution of sizes, with the median size (red dotted line) indicated. The grey box marks 2 megabases (Mb), which is the limit of detection using traditional cytogenetics approaches. Also note how the new inversions missing from the DGV represent the entire range of inversion sizes, indicating that previous methods are not as robust as Strand-seq to map even large variants.   This identified 86 high-confident loci that exhibited a localized reorientation of strand states indicative of an inversion; 48 (55.8%) showed a heterozygous phenotype and 38 (44.2%) were homozygous (Table 5-2). This inversion number coincides well with previous reports of inversions using alternative SV detection methods [36-38]. The inversions I identified totaled 34.4 Mb (1.11%) of the genome (Fig. 5-4), and showed a continuous size distribution that ranged from 1750 bp – 4.0 Mb, with a median size of 197 kb (Fig. 5-4, inset), and corroborates the findings from the population study (see Chapter 4, Section 4.2.4). The majority (97%) were smaller than 2 Mb (Fig. 5-4, inset Inversion Size (bp)Cumulative Frequency (%)02040608010010 104 105 106 1073194,491 bpnot in DGVin DGVM a l e  i n v e r t o m e  116 grey box), which marks the limit of detection for cytogenetic techniques commonly used to study inversions [39, 108]. 38 did not overlap with inversions listed in the DGV (Table 5-2, and Fig. 5-4, inset), suggesting they represent new variants not previously reported. Five loci showed exceptionally high read densities and were heterozygous (Table 5-2, bold), possibly marking new AWC regions missed in the population study. All loci corresponding to minor alleles and misorients identified in the population study were homozygous inverted in this individual (see Chapter 4; Section 4.2.2), supporting the hypothesis that these represent reference assembly errors. Taken together, this illustrates a novel approach of merging multiple Strand-seq libraries together to locate all variants in an individual genome. Using directional composite files we can map a complete set of inversions, and define their specific invertome. Now that we can rapidly build invertomes, we can begin to study how combinations of inversions differ between individuals, and how they act together to inform phenotypes.    117  Name Chr Start End SizeGenotype (best fit)Total reads (at ROI)Reads/Mb (at ROI)Number of DGV hitsmBM.1.1 chr1 10000 177417 167,417    HET 106 633.15 0mBM.1.2 chr1 85980147 86000284 20,137      HET 182 9038.09 0mBM.1.3 chr1 120747156 120936695 189,539    HOM 376 1983.76 0mBM.1.4 chr1 121478637 121485434 6,797        HET 1480 217743.12 0mBM.1.5 chr1 143644525 143771002 126,477    HOM 117 925.07 0mBM.1.6 chr1 145368224 145833118 464,894    HOM 1733 3727.73 6mBM.1.7 chr1 146303299 148026038 1,722,739  HOM 4955 2876.23 19mBM.1.8 chr1 206072707 206332221 259,514    HOM 1044 4022.90 0mBM.2.1 chr2 92318880 92320630 1,750        HET 359 205142.86 0mBM.2.2 chr2 96075217 96247952 172,735    HET 469 2715.14 7mBM.2.3 chr2 110491157 111124224 633,067    HOM 803 1268.43 4mBM.2.4 chr2 131217800 131405182 187,382    HET 268 1430.23 3mBM.3.1 chr3 195269676 195678765 409,089    HET 1499 3664.24 11mBM.4.1 chr4 10050 68023 57,973      HET 608 10487.64 0mBM.4.2 chr4 49070357 49162130 91,773      HET 2404 26195.07 0mBM.5.1 chr5 10214 12031 1,817        HET 144 79251.51 0mBM.5.2 chr5 68844759 69662029 817,270    HET 130 159.07 2mBM.6.1 chr6 220062 371646 151,584    HET 1457 9611.83 0mBM.6.2 chr6 61880166 62128589 248,423    HOM 524 2109.31 0mBM.6.3 chr6 157609467 157641300 31,833      HOM 185 5811.58 0mBM.7.1 chr7 54301647 54386481 84,834      HET 227 2675.81 18mBM.7.2 chr7 62826571 63041301 214,730    HET 384 1788.29 2mBM.7.3 chr7 64335398 65030838 695,440    HET 1705 2451.69 6mBM.7.4 chr7 72014919 74715724 2,700,805  HET 7362 2725.85 3mBM.7.5 chr7 142098195 142276197 178,002    HOM 914 5134.77 0mBM.7.6 chr7 143397897 143572112 174,215    HET 246 1412.05 17mBM.7.7 chr7 143894270 144037723 143,453    HET 190 1324.48 28mBM.8.1 chr8 8034631 12038531 4,003,900  HOM 13738 3431.15 18mBM.9.1 chr9 40475834 40940341 464,507    HOM 472 1016.13 0mBM.9.2 chr9 42663955 43213698 549,743    HET 290 527.52 0mBM.9.3 chr9 43313698 43946569 632,871    HET 923 1458.43 0mBM.9.4 chr9 43996569 44432610 436,041    HOM 661 1515.91 7mBM.9.5 chr9 44958293 45250203 291,910    HOM 205 702.27 0mBM.9.6 chr9 45865521 46216430 350,909    HET 140 398.96 0mBM.9.7 chr9 47160133 47317679 157,546    HET 253 1605.88 0mBM.9.8 chr9 65467679 65918360 450,681    HET 107 237.42 0mBM.9.9 chr9 65968360 66192215 223,855    HET 117 522.66 0mBM.9.10 chr9 66242215 66404656 162,441    HOM 188 1157.34 0mBM.9.11 chr9 66664195 66863343 199,148    HET 745 3740.94 5mBM.9.12 chr9 66913343 67107834 194,491    HET 110 565.58 6mBM.9.13 chr9 68664181 68838946 174,765    HOM 316 1808.14 1mBM.9.14 chr9 68988946 69278385 289,439    HET 639 2207.72 4mBM.9.15 chr9 69328385 70010542 682,157    HET 993 1455.68 4mBM.9.16 chr9 70556535 70735468 178,933    HOM 220 1229.51 3mBM.10.1 chr10 42409824 42546687 136,863    HOM 679 4961.17 0mBM.10.2 chr10 42596687 42600289 3,602        HET 999 277345.92 0mBM.10.3 chr10 48105707 49095536 989,829    HOM 2352 2376.17 8mBM.10.4 chr10 51448845 51725945 277,100    HOM 420 1515.70 6mBM.11.1 chr11 1915109 1937247 22,138      HOM 110 4968.83 32mBM.11.2 chr11 50072272 50295697 223,425    HET 717 3209.13 2mBM.11.3 chr11 51090853 51594205 503,352    HOM 1522 3023.73 0mBM.12.1 chr12 60114 95739 35,625      HET 174 4884.21 0mBM.12.2 chr12 17922515 18013878 91,363      HOM 168 1838.82 32mBM.14.1 chr14 19005780 19109507 103,727    HET 549 5292.74 0mBM.14.2 chr14 19736988 20409825 672,837    HET 1937 2878.85 10mBM.15.1 chr15 22262114 22425344 163,230    HET 588 3602.28 20mBM.15.2 chr15 22646193 23469603 823,410    HOM 1840 2234.61 10mBM.15.3 chr15 84778218 84984473 206,255    HET 599 2904.17 6mBM.16.1 chr16 14703644 15016162 312,518    HET 507 1622.31 27mBM.16.2 chr16 15124107 15457608 333,501    HOM 630 1889.05 4mBM.16.3 chr16 21370868 22545103 1,174,235  HOM 3223 2744.77 31mBM.16.4 chr16 28269860 28690949 421,089    HET 1114 2645.52 0mBM.16.5 chr16 32741076 33046695 305,619    HET 732 2395.14 4mBM.16.6 chr16 34173150 35285801 1,112,651  HOM 3546 3186.98 8mBM.16.7 chr16 46406558 46439463 32,905      HET 2456 74639.11 87mBM.16.8 chr16 75223915 75258018 34,103      HET 162 4750.32 40mBM.17.1 chr17 18312355 18360070 47,715      HET 113 2368.23 11mBM.17.2 chr17 21196464 21250207 53,743      HET 1043 19407.18 0mBM.17.3 chr17 25263006 25319012 56,006      HET 586 10463.16 0mBM.17.4 chr17 36255435 36349527 94,092      HET 247 2625.09 1mBM.20.1 chr20 29419569 29580277 160,708    HOM 976 6073.13 0Table 5-2 part 1 of 2  118  Table 5-2 | Inversions found in the adult male Genomic coordinates and metrics associated with each inversion found in the composite files of 140 merged bone marrow Strand-seq libraries. Chromosome (chr); Heteozygous (HET); Homozygous (HOM); Megabase (Mb); Region of interest (ROI); Database of Genomic Variants (DGV).     Figure 5-5 | Increased read densities of directional composite files The fold increases in reads/megabase (Mb) were calculated for each composite file of the respective chromosome, as compared to the average reads/Mb seen in the single Strand-seq cells for the corresponding chromosomes. The male donor (blue triangles) and female donor (red squares) are shown, with the overall average increase indicated (dotted lines). The number of libraries merged together to generate the composite file is indicated in the table below, and the average depth of all composite files combined is listed on the right.  Name Chr Start End SizeGenotype (best fit)Total reads (at ROI)Reads/Mb (at ROI)Number of DGV hitsmBM.21.1 chr21 10770371 10841257 70,886      HET 1035 14600.91 0mBM.21.2 chr21 14338129 14395270 57,141      HET 323 5652.68 0mBM.22.1 chr22 16050603 16697850 647,247    HOM 1413 2183.09 2mBM.X.1 chrX 62281406 62510549 229,143    HET 656 2862.84 2mBM.X.2 chrX 149566495 149588947 22,452      HOM 118 5255.66 27mBM.X.3 chrX 151846672 151932779 86,107      HOM 170 1974.29 0mBM.X.4 chrX 152327099 152554125 227,026    HET 537 2365.37 14mBM.Y.1 chrY 5678267 8914955 3,236,688  HOM 15410 4761.04 3mBM.Y.2 chrY 8964955 9241322 276,367    HOM 1147 4150.28 3mBM.Y.3 chrY 9291322 9755602 464,280    HOM 894 1925.56 3mBM.Y.4 chrY 20193885 21031319 837,434    HOM 292 348.68 2ROIno.Y.5 chrY 22224633 22245544 20,911      HOM 205 9803.45 0ROIno.Y.6 chrY 23183236 23214596 31,360      HOM 167 5325.26 5ROIno.Y.7 chrY 23729130 24010452 281,322    HOM 1284 4564.16 0ROIno.Y.8 chrY 24355698 24525420 169,722    HOM 688 4053.69 0Table 5-2 part 2 of 2 = 3311 = 1444FinalAv. reads/Mb43.553.5ChromosomeFold change (in reads/Mb)020406080100Male donorFemale donor55 59 42 45 51 49 44 38 52 48 58 45 56 60 54 54 68 51 64 54 57 62 120 13431 34 52 43 43 44 46 52 35 58 38 55 39 35 54 53 48 47 43 28 43 52 37 n/a1 2 3 4 5 6 7 8 9 10 222111 12 13 14 15 16 17 18 19 20 YXchr{Numbermerged* *Figure 5-3 | Increased read densities of directional composite filesThe fold increases in reads/megabase (Mb) were calculated for each composite files of the respective chromosome, as compared to the average reads/Mb seen in the single Strand-seq cells for the corresponding chro-mosomes. The male (blue triangles) and female donor (red squares) are shown, with the overall average increase indicated (dotted lines). The number of libraries merged together to generate the composite file is indicated in the table below, and  the average depth of all composite files com-bined is listed on the right.   119 5.2.2 A side-by-side comparison of two invertomes reveals the unique distribution of inversions in a human genome  By locating changes in high-density directional composite files, I have compiled a comprehensive list of inversions present within a single genome, and for the first time defined their invertome. This revealed over 1% of the male genome was inverted with respect to the reference assembly; illustrating extensive human variation exists in inversion profiles alone. To explore this further, I next tested how invertomes differ between individuals. I generated another 106 Strand-seq libraries from the CB of a newborn female (libraries HsSs_0141-HsSs_0246 in Appendix A) and repeated the above analysis. I selected informative chromosomes to generate directional composite files (Table 5-1, female), and ran Invert.R to locate strand state changes (Fig. 5-6), as described in Section 5.2.1. Invert.R predicted 68 ROIs, which I further refined by subtracting AWC regions and assembly gaps; this donor was female, and thus the PAR regions were not subtracted. Since fewer libraries with lower read depths were merged (Fig. 5-1b), the overall density of the female composites files was approximate ½ of the male (Fig. 5-5, red text). To account for this, the refined ROIs were genotyped using a smaller minimum number of reads (minReads = 50) for inclusion in the invertome. This located 32 (53.3%) heterozygous and 28 (46.7%) homozygous alleles representing 60 distinct inversions in the female (Table 5-3), which collectively comprised 23.3 Mb (0.77%) of her genome (Fig. 5-7). The size distribution ranged from 740 bp - 2.15 Mb, with a median of 245 kb, and 24 (40%) did not overlap with inversions listed in the DGV (Fig. 5-7, inset). By rapidly building this second invertome from another donor, I was equipped to perform an in-depth comparison of inversion profiles between these two individuals.   120 Figure 5-6 | Putative inversions in a newborn female, as predicted by Invert.R  114   Figure 5-6 | Putative inversions in a newborn female, as predicted by Invert.R Histograms of W/C ratios generated by Invert.R (bin = 250), for the merged composite files of a newborn female donor. The composite files were generated from 106 Strand-seq libraries derived from a female cord blood sample. As in Fig. 5-4, sequencing reads are shown above each histogram with the direction indicated by color. The line in the histogram represents the W/C ratio, and a change in W/C ratio is indicative of an inherited genomic rearrangement. Locations where W/C ratios dip below a threshold (horizontal gray line) are highlighted (red bars; n = 68) and predicted breakpoints are shown (dotted red lines). Above each histogram, gray bars mark reference assembly sequence gaps.  W/C Ratio10W/C Ratio10W/C Ratio10W/C Ratio10W/C Ratio10W/C Ratio100 50 100 150 200 250Mbchr1 0 50 100 150 200 250Mbchr2 0 50 100 150 200Mbchr3 0 50 100 150Mbchr4 0 50 100 150Mbchr50 50 100 150Mbchr6 0 50 100 150Mbchr70 50 100 150Mbchr8 chr9 0 40 60 100 140Mb20 80 120chr10 0 40 60 100 140Mb20 80 120chr11 0 40 60 100 140Mb20 80 120chr12 0 40 60 10020 80 120Mbchr13 0 40 6020 80 100Mbchr140 40 6020 80 100Mbchr15 0 40 6020 80 100Mbchr160 40 6020 80Mbchr17 0 40 6020 80Mbchr18 0 40 6020 80Mbchr19 0 20 30 50 60Mb10 40chr20 0 20 30 50 60Mb10 40chr21 0 20 30 50Mb10 40chr22 0 20 30 50Mb10 40chrX 0 50 100 150Mb  121  With two complete invertomes assembled, we can now compare the structural composition of these genomes with exquisite detail. To facilitate this, I created descriptive Circos diagrams [146] for each chromosome (Fig. 5-8, and Appendix B). In these diagrams, I included the Invert.R histograms of the W/C ratios for the adult male (blue) and newborn female (pink) composite files, and highlighted all genotyped inversions as light green (heterozygous) or dark green (homozygous). Another track highlighted my findings from the population study and marked the polymorphic variants (orange), reference misorients/minor alleles (red), AWC regions (blue), and the sequence gaps in the reference assembly (dark grey). Inversions listed in the DGV were    Figure 5-7 | The invertome of a newborn female The genomic distribution of the complete set of inversions identified in a female cord blood sample. The name of each inversion is listed to the right, and associated coordinates can be found in Table 5-3. As in Fig. 5-5, inset shows the cumulative frequency of the size range of inversions (in base pairs; bp), with inversions not listed in the Database of Genomic Variants (DGV) represented by squares, and known inversions in the DGV shown as circles. The median inversion size is indicated (red dotted line), and the grey box marks the 2 megabases (Mb) limit of detection of traditional cytogenetics.  F e m a l e  i n v e r t o m eInversion Size (bp)Cumulative Frequency (%)020406080100102 103 104 105 106 107not in DGVin DGV242,550 bp  122 layered on the outside of the ideograms (fuchsia bars), along with Refseq genes (outer-most grey bars). Finally, I included intra-chromosomal segmental duplications as links, which were subdivided as either non-palindromic (i.e. directly oriented repeats, in grey) or palindromic (i.e. inverted with respect to each other, in purple). These Circos plots were compiled into an interactive ‘clickable’ .pdf file (Appendix B) that facilitated a side-by-side comparison of the two invertomes.  The chromosomal Circos plots allowed me to visualize both invertomes simultaneously, and study the genotype of loci in the context of the population data, previously reported variants, gene densities and duplicated elements annotated in the human genome. For instance, on chr16p11 there was a heterozygous allele in the female and a homozygous allele in the male corresponding to ROIno.16.22 from the population study (and discussed in Section 4.2.2) that further supported this locus represents a rare variant in a misoriented fragment (Fig. 5-8, chr16-detail, asterisk). Just upstream, there was a ~ 1 Mb heterozygous inversion in the female (fCB.16.5) that partly overlapped with a ~ 300 kb heterozygous inversion in the male (mBM.16.5) (Fig. 5-8, chr16-detail, arrowhead). The overlap raises questions around the evolutionary history of this region and whether these events represent a single variant with different breakpoints, or two distinct variants (meaning the female may have two inversions here, while the male has one). Overlapping and distinct inversions have been previously reported (for instance, at 15q13.3 [86]), highlighting complex rearrangements can recur at certain breakpoints. Confident breakpoint mapping at this locus was complicated by the presence of an adjacent AWC region (which itself appears heterozygous) (Fig. 5-8, chr16-detail, blue bar). We also identified structural rearrangements at the centromere of chr9, including three heterozygous inversions located to the p-arm (chr9p13-p11) in the male invertome that did not overlap with any inversions in the female invertome, nor those listed in the DGV (Fig. 5-8, chr9-detail, arrowheads). While these inversions were within blocks of highly repetitive DNA (Fig. 5-8, internal links), which notoriously confound variant mapping, we see clear structural differences that distinguish the two invertomes, supporting these are bone fide polymorphisms. Building additional invertomes will help    123  Figure 5-8 | chromosomal resolution of the genomic features surrounding inversions Circos plots of select chromosomes with Invert.R histograms (black lines) for the adult male (blue background ring) and newborn female (pink background ring) invertomes, with mapped inversions genotyped as either heterozygous (light green) or homozygous (dark green). Palindromic intra-chromosomal segmental duplications (purple lines) correlate with the inversion load of each chromosome. For instance, chr9 contains 16 and 10 inversions in the male and female, respectively, and harbors many palindromic segmental duplications (purple links). The boundaries of the palindromic duplications correlate with inversion breakpoints (chr9, detail). Note the structural differences of inversions between the two donors within this repetitive and complex region of the genome. Non-palindromic segmental duplications (grey links) are common on chr19, which contains a single inversion in the female genome. This highlights how the frequency and boundary of inversions correlates with palindromic segmental duplications in each chromosome. Arrowheads mark loci mentioned in text. See Appendix B for interactive Circos plots of all chromosomes.   Database of Genomic VariantsInversionsChromosome location (Mb)UCSC Genome Browser GenesChromosome bandingSegmental Duplicationspalindromicnon-palindromicComposite file WC ratio histogram MaleInvertomeInversionshomozygousheterozygousComposite file WC ratio histogram FemaleInvertomeInversionshomozygousheterozygousAWCs Misorients PolymorphismsGapschr13 chr18 chr19chr15chr15 - detailchr9chr9 - detailchr16chr16 - detail*Database of Genomic VariantsInversionsChromosome location (Mb)UCSC Genome Browser GenesChromosome bandingSegmental Duplicationspalindromicnon-palindromicComposite file WC ratio histogram MaleInvertomeInversionshomozygousheterozygousComposite file WC ratio histogram FemaleInvertomeInversionshomozygousheterozygousAWCs Misorients PolymorphismsGaps  124 better characterize the variants here to better reconstruct the recombination events in these complex regions of the genome.   On chrYp11 a large 3.2 Mb inversion mapped to the male’s invertome (mBM.Y.1). This inversion coincides with one previously reported using single molecule haplotyping, which predicted the inversion at 0.50 allele frequency in twelve males analyzed [143]. However we did not observe this inversion in any of the 23 males included in our population study, and believe this inversion is much more rare than reported (having a frequency of 0.04). In a different study, the inversion was found in males of Nigerian descent [36]. Since our male donor also has a 4 Mb homozygous inversions on chr8p23 (mBM.8.1) that is enriched in African populations [74, 75], these inversions together allude to a Yoruban ancestry. Strikingly, these two inversions represent 20% of his entire invertome, and significantly contribute to the structural composition of his genome. When Strand-seq is applied to mapping inversion profiles in more defined demographics we will be able to test ancestry prediction based on invertomes. One interesting application of this would be to build geographical maps of human inversions that show frequencies and co-inheritance patterns, which could then be used to study historical migration patterns and phenotypic associations of inversion combinations. These data can be paired with Y-chromosome, mitochondrial or SNP studies of different populations to augment haplogroup classifications.   In addition to alluding to an individual’s ancestry, combinations of inversions can also be used to predict disease susceptibility. For instance, on chr15q13, the female invertome contained a large (~ 2 Mb) heterozygous inversion (fCB.15.4) that was in a gene-rich region (containing 25 genes), was also found in the population (ROIno.15.9), and matched a known variant in the DGV [36, 56] (Fig. 5-8, chr15-detail, arrowhead). Recurrent deletions at this locus are known to cause complex neurological disorders (including autism, epilepsy and schizophrenia), with a ~ 80% penetrance [86, 87]. The predicted 3′ breakpoint confirmed this inversion disrupted the neurotransmitter,   125 CHRNA715 (the breakpoint was located to chr15: 32,538,672, and fell directly between exons 9 and 10), which is the gene implicated in the disease phenotype [86, 147]. Consequently, this female likely has only one functional copy of the CHRNA7 gene, and any de novo mutations arising in the other copy would render her null for this critical synaptic channel. As the invertome was built from a neonatal cord blood sample, it may be prudent to test for somatic mutations in the other CHRNA7 allele later in her life. Fortunately, the female did not also have an inversion at chr17q21, which is similarly associated with a deletion syndrome that causes neurodevelopment defects (as discussed in Section 4.2.4). This highlights how multiple loci can be screened to explore the inheritance patterns of sets of inversions simultaneously. By building more invertomes we can explore how these variants act together to drive specific phenotypes or disease susceptibilities. In this way we predict invertomes will have important implications for personalized medicine.                                                 15 An acetylcholine receptor, which is a ligand-gated ion channel that mediate fast signal transmission at synapses   126  Table 5-3 | Inversions found in the newborn female Genomic coordinates and metrics associated with each inversion found in the composite files of 106 merged cord blow Strand-seq libraries. Chromosome (chr); Heteozygous (HET); Homozygous (HOM); Megabase (Mb); Region of interest (ROI); Database of Genomic Variants (DGV).  Name Chr Start End SizeGenotype (best fit)Total reads (at ROI)Reads/Mb (at ROI)Number of DGV hitsfCB.1.1 chr1 120747156 120936695 189,539    HOM 91 480.11 0fCB.1.2 chr1 121484694 121485434 740           HET 624 843243.24 0fCB.1.3 chr1 145368224 145833118 464,894    HOM 516 1109.93 6fCB.1.4 chr1 146303299 148026038 1,722,739  HOM 1406 816.14 19fCB.1.5 chr1 206072707 206332221 259,514    HOM 267 1028.85 0fCB.2.1 chr2 91989505 92312835 323,330    HET 428 1323.72 0fCB.3.1 chr3 195213242 195659023 445,781    HET 1143 2564.04 11fCB.4.1 chr4 12383 61273 48,890      HET 245 5011.25 0fCB.4.2 chr4 49082280 49215572 133,292    HET 1480 11103.44 0fCB.5.1 chr5 68859472 70188682 1,329,210  HET 94 70.72 8fCB.6.1 chr6 257966 353162 95,196      HET 394 4138.83 0fCB.6.2 chr6 61880166 62128589 248,423    HOM 227 913.76 0fCB.6.3 chr6 157609467 157641300 31,833      HOM 74 2324.63 0fCB.7.1 chr7 5882689 6849828 967,139    HET 1444 1493.06 5fCB.7.2 chr7 57320885 58054331 733,446    HET 1010 1377.06 1fCB.7.3 chr7 62705852 62916549 210,697    HET 64 303.75 2fCB.7.4 chr7 142098195 142276197 178,002    HOM 510 2865.14 0fCB.9.1 chr9 40475834 40940341 464,507    HOM 131 282.02 0fCB.9.2 chr9 43996569 44436285 439,716    HOM 185 420.73 7fCB.9.3 chr9 44958293 45250203 291,910    HOM 52 178.14 0fCB.9.4 chr9 47160133 47317679 157,546    HOM 54 342.76 0fCB.9.5 chr9 66242215 66404656 162,441    HOM 57 350.90 0fCB.9.6 chr9 66664195 66863343 199,148    HET 167 838.57 5fCB.9.7 chr9 68664181 68838946 174,765    HOM 112 640.86 1fCB.9.8 chr9 68988946 69278385 289,439    HET 181 625.35 4fCB.9.9 chr9 69328385 70010542 682,157    HET 274 401.67 4fCB.9.10 chr9 70556535 70735468 178,933    HOM 80 447.09 3fCB.10.1 chr10 42409824 42546687 136,863    HOM 400 2922.63 0fCB.10.2 chr10 42596687 42611203 14,516      HET 963 66340.59 0fCB.10.3 chr10 47163314 47419631 256,317    HET 84 327.72 7fCB.10.4 chr10 48105707 49095536 989,829    HOM 1444 1458.84 8fCB.10.5 chr10 51448845 51729988 281,143    HOM 286 1017.28 6fCB.11.1 chr11 50057768 50242891 185,123    HET 135 729.24 2fCB.11.2 chr11 51090853 51594205 503,352    HOM 369 733.09 0fCB.12.1 chr12 60392 95739 35,347      HET 140 3960.73 0fCB.14.1 chr14 19736988 19889441 152,453    HET 213 1397.15 10fCB.15.1 chr15 20006624 20389552 382,928    HET 476 1243.05 3fCB.15.2 chr15 22262114 22596193 334,079    HOM 1057 3163.92 20fCB.15.3 chr15 22646193 23514853 868,660    HOM 961 1106.30 10fCB.15.4 chr15 30392064 32538672 2,146,608  HET 3328 1550.35 2fCB.15.5 chr15 84791070 84937863 146,793    HET 218 1485.08 6fCB.16.1 chr16 14882873 15016162 133,289    HOM 154 1155.38 27fCB.16.2 chr16 15124107 15469838 345,731    HOM 339 980.53 4fCB.16.3 chr16 21269506 22545103 1,275,597  HOM 1574 1233.93 31fCB.16.4 chr16 28483966 28654270 170,304    HET 280 1644.12 0fCB.16.5 chr16 32741076 33769975 1,028,899  HET 1784 1733.89 4fCB.16.6 chr16 34173150 34463135 289,985    HOM 528 1820.78 8fCB.16.7 chr16 34465450 34755351 289,901    HET 743 2562.94 8fCB.16.8 chr16 34758503 35285801 527,298    HOM 845 1602.51 0fCB.16.9 chr16 46407896 46435776 27,880      HOM 1229 44081.78 87fCB.17.1 chr17 21200713 21252727 52,014      HET 404 7767.14 0fCB.19.1 chr19 24507239 24631782 124,543    HET 153 1228.49 0fCB.20.1 chr20 29419569 29580277 160,708    HOM 179 1113.82 0fCB.21.1 chr21 10770371 11012921 242,550    HET 645 2659.25 4fCB.22.1 chr22 16052016 16697850 645,834    HOM 550 851.61 2fCB.22.2 chr22 18646713 18804051 157,338    HET 122 775.40 1fCB.22.3 chr22 20291683 20509431 217,748    HET 151 693.46 3fCB.X.1 chrX 140158291 140502975 344,684    HET 361 1047.34 3fCB.X.2 chrX 148650585 148826773 176,188    HET 217 1231.64 0fCB.X.3 chrX 152327099 152559647 232,548    HET 188 808.44 14  127  5.2.3 Exploring the genomic architectural features surrounding inversion breakpoints   I have shown how to rapidly build invertomes to discover SVs genome-wide and in a non-targeted fashion. I immediately applied this method to build two invertomes and compared the unique inversion profiles of these unrelated16 individuals. Having compiled comprehensive lists of variants, I am now equipped to explore the architectural features of conserved inversions in the human genome. In the Circos plots I noticed inversion breakpoints mapped to the same loci as clusters of segmental duplications, (Appendix B). As segmental duplications are thought to play a role in genomic rearrangements [86, 139, 144, 148], I investigated the relationship between inversions and repetitive DNA in greater detail. For this, segmental duplications were extracted from the UCSC Genome Browser by downloading the ‘segmental dups’ track from the ‘repeats’ group on the Table browser (GRCh37/hg19). Intrachromosomal duplications (n = 18,859) were identified if the paired sequence fell on the same chromosome, and palindromic repeats were identified when the paired sequence mapped to the complimentary strand (Table 5-3). Splitting these repeats by chromosome illustrates there are clear biases in the repetitive nature of each (Fig-5-9a), and substantiates previous observations [130]. I compared the percent of bases inverted to the proportion of segmental duplications per chromosome, and found a positive correlation between the two (r2 = 0.70, p < 0.001), which was strongest for palindromic (r2 = 0.78), versus non-palindromic (r2 = 0.66) segmental duplications. This correlation was most evident on chr7, chr9, and chr16 where the polymorphic domains identified in the population study coincide to highly repetitive palindromic blocks of DNA (Fig. 5-8, purple links). These were distinct from the non-repetitive chromosomes, such as chr13 and chr18, on which I did not identify any inversions. Also, chr19 had a sole inversion and was enriched for non-palindromic segmental duplications but had few palindromic ones. These results                                                  16 While the identity of the donors was kept confidential, the samples were acquired in different cities and sourced from separate cell banks   128  Figure 5-9 | Correlation between palindromic segmental duplications and inversions a) The percent of bases of each chromosome that are segmental duplications was determined by pulling the repeats form the UCSC Genome browser, splitting them into palindromic (inverted orientation; orange) and non-palindromic (direct orientation; blue) and then calculating the repetitive bases compared to total chromosome bases. b) A linear regression of the percent of inverted bases per chromosome (of the non-redundant events found in all three (i.e. from the pooled cord blood, male invertome, and female invertome) datasets) as compared to the percent palindromic segmental duplications.    highlight how the repetitive nature of the chromosome is related to its structural composition and suggests that duplications impact the propensity of a chromosome to undergo internal rearrangements. These finding are in support of a non-allelic homologous recombination (NAHR) mechanism for inversions [1].  To explore the architecture of inversion breakpoints with greater detail, I analyzed the sequence composition and surrounding genomic regions of the 257 inversions (137 unique) identified in all our datasets (Fig. 5-10), with the help of Dr. Mark Hills. For this we performing a pair wise self-alignment for each inversion, plus 200 kb of sequence upstream and downstream, and generated a dot plot of every locus. (see Methods Chapter 7, Section 7.3.2, and Appendix C). In these dot plots, the horizontal line represented the self-aligned sequence, any additional lines and dots represented repetitive elements, with palindromic segmental duplications appearing as perpendicular lines and non-palindromic segmental duplications appearing as parallel lines. Reference assembly gaps were depicted on diagonal axis as black bars, the inversions of the male invertome (mBM; blue), female invertome (fCB; pink) and pooled donor population (ROIno; green) were highlighted as overlays. Sequence coordinates that were self-aligned are listed above each plot, and inversions listed in Database of Genomic Variants (DGV) are 9 Percent (%) of chromsomeIntrachromosomal Segmental Duplicationschromosome0 2 4 6 80246810Palindromesinversionsr2 = 0.7845chr7chr: 9, 15, 16 & Ychr: 12, 13, 18 & 19Palindromic Segmental Duplicationsa b  129 plotted on the x- and y-axes (purple). Dot plots were compiled into a single .pdf file, allowing for an in-depth exploration of the genomic features of each inversion (Appendix C).   Figure 5-10 | Distribution of all structural features identified by Strand-seq A comprehensive genomic map of the structural features and inversion profiles described in this work. The complete set of inversions present in the male (blue circles, right-hand side) and female (pink triangles, left-hand side) donors are represented. Each inversion (plotted using the online tool Idiographica (v2.2) [147]) was genotyped as either heterozygous (empty symbols) or homozygous (filled symbols). The location of Always Watson Crick regions (yellow) and misorients or minor alleles (orange) mapped in the pooled donor population are also depicted. All polymorphic variants identified in the pooled donor data set are depicted in purple.   These self-alignment dot plots allow a base level analysis of the shared genomic features of inversion breakpoints and illustrate their relationship to reference assembly gaps, segmental duplications (including those not listed in the UCSC Genome Browser track17), and other repetitive features. The overlays of each dataset highlighted how several inversions overlapped between the different datasets, but breakpoints did not always exactly match (Fig. 5-11, and Appendix C). This suggests multiple non-recurrent                                                 17 UCSC Genome Browser defines a segmental duplication as a stretch of DNA > 1000 bases long that shares > 90% sequence identify with at least one other position in the genome 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X YheterozygoushomozygousMale InvertomeheterozygousFemale Invertomehomozygousalways watson crick (AWC)Reference Featuresmisorient/minor allelepolymorphic variantK E Y  130 rearrangements occurred near the same locus [18]. Several inversions were in highly complex and repetitive regions (Fig. 5-11, i-iii). When I looked at the flanking sequences, I found that 110 (43%) inversions were bordered by reference assembly gaps (Fig. 5-11, v). This prevented a direct consideration of the sequences flanking these variants. However, reference assembly gaps typically represent highly repetitive elements that are difficult to sequence and assemble faithfully [33], suggesting these inversions are flanked by duplicated DNA. Of the remaining 147 inversions, 69 (47%) were flanked by palindromic segmental duplications (Fig. 5-11, iv), and 17 (12%) were flanked by non-palindromic segmental duplications (Fig. 5-11, iii). It is currently accepted that recombination between direct repeats result in a deletion [17, 149], and this finding suggests there may be more flexibility in the system than previously believed. 58 (39%) inversions were not flanked by any repetitive elements and in a contiguous stretch of unique sequence (Fig. 5-11, vi), suggesting these variants either i) arose from a different repeat-independent mechanism, or ii) are historical events that occurred far enough in our past for the flanking repeats to diverge. In order to distinguish these scenarios, larger population studies are required to report the distribution of these inversions and better predict the historical age of the variant. These results demonstrate that a repetitive genomic architecture is not always a feature of inversion breakpoints and those non-recurrent overlapping rearrangements are common in the human genome.   131  Figure 5-11 | Base pair resolution of the genomic features surrounding inversions Self-alignment lastz dot plots illustrate the overlap between inversion breakpoints from different datasets, with the male invertome (mBM; blue), female invertome (fCB; pink) and pooled donor population (ROIno; green) highlighted. Each plot represents the genomic architecture of different inversions features which can be: i) within highly complex and repetitive regions, ii) within a repetitive sequence that is palindromic to itself, iii) flanked by non-palindromic segmental duplications, iv) flanked by palindromic segmental duplications v) flanked by reference assembly gaps (depicted on diagonal axis as black bars) vi) in non-repetitive and contiguous sequence. Sequence coordinates that were self-aligned (padded inversion) are listed above each plot, with inversions listed in the Database of Genomic Variants (DGV) plotted on the x- and y-axes (purple bars). Reference assembly gaps are shown as black bars along the diagonal axis, and a genomic size scale bar is shown in the top left. Refer to Supplementary File 2 for dot plots of all inversions in more detail.   5.3 Discussion  By mapping structural rearrangements in high-density directional composite files, I have developed a novel method for locating inherited, germline inversions with high confidence. This is the first time inversions have been mapped genome-wide using directionality to reliably visualize changes in homologue orientation, offering a new opportunity to study structural rearrangements in human genomes. This non-targeted approach is based on observable characteristics of template strands that allow ii. iii.i.v. vi.iv.40400000 40600000 40800000 4100000040400000406000004080000041000000Location (bp)Location (bp)mBM.9.1, fCB.9.1, ROIno.9.3, ROIno.9.4chr9:40275835−41140341200 kb32500000 33000000 33500000 3400000032500000330000003350000034000000Location (bp)Location (bp)mBM.16.5, fCB.16.5, ROIno.16.17,    ROIno.16.18, ROIno.16.19chr16:32541077−33986667300 kb69000000 69500000 70000000 7050000069000000695000007000000070500000Location (bp)Location (bp)mBM.5.2, fCB.5.1, ROIno.5.2chr5:68644760−70845568400 kb89400000 89600000 89800000 9000000089400000896000008980000090000000Location (bp)Location (bp)ROIno.11.7chr11:89356753−90004254100 kb0 100000 200000 300000 400000 500000 6000000100000200000300000400000500000600000Location (bp)Location (bp)mBM.6.1, fCB.6.1, ROIno.6.1100 kbchr6:20063−5812140 200000 400000 600000 8000000200000400000600000800000Location (bp)Location (bp)mBM.1.1, ROIno.1.1chr1:1−915919200 kb  132 rearrangements to be located and simultaneously genotyped on a genome-wide scale, offering a major advance to the field. Until now, inversions have been studied in isolation using targeted studies to understand their association with populations and phenotypic consequences. However, studies of complex disorders (such as diabetes, heart disease, autoimmune diseases, and psychiatric traits) have confirmed that genetic variants act cumulatively to contribute to disease risk, with each variant having only a subtle effect [10]. Therefore, our new framework to map all the inversions in an individual genome offers an important contribution that will allow us to understand how sets of inversions culminate in each of us.  In both invertomes I found a significant portion (0.8 – 1.1%) of the human genome was structurally different from the reference assembly, which is 3 – 4x higher than previously thought [13], showing that extensive genetic heterogeneity exists in inversions alone. We found more inversions in the male adult invertome (n = 86) than in the female newborn (n = 60), which suggests the female is more structurally similar to the genomes used to generate the human reference assembly (GRCh37/hg19). The total inversion numbers we report are in line with previous genome-wide analyses of SVs using paired end signatures that found between 48-98 inversions per genome [36-38]. I was unable to compare genotype distributions however, since these prior studies were unable to genotype their loci, stressing another major advantage of our approach. While the resolution of breakpoint mapping is limited by flanking repetitive sequence elements (which appear as WC or as low read count regions) we are able to more confidently predict locations using composite files, as they have high read coverage and represent the consensus of multiple merged cells. I found that several inversions overlapped but did not match between the invertomes and the pooled donor sample, suggesting they are independent non-recurrent events. If regions of our genome are prone to undergoing multiple historical rearrangements our chromosomes must be far more structurally fluid than we ever imagined. It is possible these variants represent new dynamic loci that are prone to rearranging in our genomes, as described for the toggling inversion on chr17q21 [77]. The biological significance of this finding remains to be resolved.     133 Now that it is possible to generate a complete invertome of individuals, the combined presence of specific inversions can be used together to predict ancestry and disease susceptibilities. This offers a new opportunity to test how various inversions operate simultaneously or cooperatively in different populations. To date, studies have focused on single inversions in isolation [56, 75, 77, 132], and have neglected to consider (due to technical constraints) whether groups or pairs of inversions act together (antagonistically or synergistically) to impact phenotypes. Building invertomes can serve as a tool to investigate the combination of inversions present in a single individual, and compare inversion sets between individuals. This can help us better understand the biological consequences of inversions in our genomes. For instance, if we find two inversions are linked in a population, that may suggest the genes located in each inversion act together to confer an adaptive phenotype to that group, since they would both be independently protected from recombining [63]. Taken together, this highlights how an individual’s invertome can provide important insight into their personalized genomes, which has important implications for personalized medicine.  5.4 Conclusion  Here I show how to rapidly construct an invertome and describe a novel method for exploring inversions in individuals. I performed a comprehensive comparison of two invertomes to consider how set of inversions can be used in combination to reflect their ancestry and query their disease risks. Until now, inversions have been studied in isolation using targeted studies to understand their association with populations and phenotypes. However, many studies of complex disorders (such as diabetes, heart disease, autoimmune diseases, and psychiatric traits) have confirmed that genetic variants act cumulatively to control disease risk, with each variant having only a subtle effect [10]. Therefore, our new framework to map all the inversions in an individual genome offers an important contribution. Ultimately, as we build more invertomes and perform population studies we will better understand the mechanistic details, evolutionary   134 importance and phenotypic consequences of inversions in our genomes. The methods presented here will greatly facilitate this venture.   135 Chapter 6 | General discussion and conclusion  “Biology has at least 50 more interesting years” – Watson  6.1 Overview of findings  The goal of this dissertation was to establish a high-resolution single cell system to study structural variation, and to explore the spectrum of genomic inversions in the human genome. I focused on inversions because this balanced rearrangement has proven challenging to map in the past, and is present in normal human populations, as well as patients with malignancies and complex genomic disorders. I aimed to develop and validate novel frameworks for characterizing inversion profiles in populations of cells or people, and provide required tools to study how chromosome structure impacts our biology and health.    In Chapter 2, I achieved Research Aim 1 and described a reliable new method to manually visualize, map and genotype inversions at unprecedented resolution in single cells. I illustrated how Strand-seq is used to discern the orientation of homologues within a cell, and map genomic rearrangements in each homologue by locating changes in strand orientation. I provided the highest-resolution study of SCEs in normal human blood cells to date, and reported the frequency and distribution of these somatic rearrangements in a sample of 215 cells. I showed that in Strand-seq libraries, inherited inversions appear as changes in strand orientation that recur at the same genomic location, and on the same homologue, in every cell of the individual. For the first time, I detailed how to manually map inversions using Strand-seq, and narrowed the breakpoint of an inversion on chr7p11 to a 1.7 kb window that disrupted a disease gene (TRIM50). This Chapter laid out the foundation of inversion analysis by Strand-seq; however, as it was not feasible to manually perform these analyses in larger populations, a bioinformatic approach was sought.    136  I directly addressed this in Chapter 3, and developed novel bioinformatic software, called Invert.R, which was tailored to locate putative inversions in Strand-seq data. This program uses an advanced read-based sliding window binning strategy to count ratios of directional reads and assign local template strand states to each chromosome of a Strand-seq library. I illustrated how Invert.R was designed as a non-targeted tool to rapidly discover and characterize putative inversions in single cells by locating genomic regions where template strand states change. By comparing inversion call sets between libraries, I described how Invert.R also integrates data across multiple cells to characterize recurrent variants in a population. Overall, I validated Invert.R as an unbiased approach to identify inversions in single cells, refine inversion breakpoints across multiple cells, and discover unknown inversions in whole chromosomes. In this Chapter, I accomplished Research Aim 2 and developed an essential bioinformatic package that offers a major advance in Strand-seq analyses, helping transform our sequencing technology into a high-throughput experimental tool.    Employing this high-throughput method in Chapter 4, I performed a population study and characterized the inversion profiles of 47 distinct single cell genomes. Addressing Research Aim 3, I explored the distribution and frequency of inversions on a genome-wide scale and identified 111 inversions that clustered in polymorphic domains. I found that inversion profiles are highly unique between human genomes and showed how this can be used to explore the relationship between cells in a mixed population. Finally I generated a valuable resource of normal human inversions that can be used to compare changes that may arise in other populations or in the context of disease. To the best of my knowledge, the data in this Chapter represents the most exhaustive inversion study performed to date, and provides a framework for more comprehensive population-based studies in defined demographics and disease models.  Finally, in Chapter 5 I achieved Research Aim 4, and showcased how to map the complete set of inversions in a genome to define an individual’s invertome. I showed how to merge multiple Strand-seq single cells derived from the same donor to generate   137 composite files and characterize high-resolution and precise invertomes for any individual. These comprehensive maps describe the spectrum of genomic rearrangements within each genome with greater detail and precision than ever before achieved. I then described how combinations of inversions can work together to make predictions of individual phenotypes and disease susceptibilities. By building two invertomes I explored the extent of heterogeneity that is present in inversion profiles and characterized the genomic architecture of breakpoints at the chromosomal and base pair level to describe shared architectural features. This Chapter provided a novel strategy for defining inversion profiles of patients, and has direct implications for personalized medicine and studying rare diseases.  6.2 Emergent themes, implications and limitations  The power of Strand-seq is counterintuitive: we sacrifice sequence depth for insight into genome structure. Instead of trying to sequence every nucleotide in a sample, we aim to sequence, at best,half the DNA within a cell. This characteristic sets Strand-seq apart from all other sequencing technologies [22, 33, 35, 150]. By sequencing only one strand of a chromosome, we are able to visualize the organization of each homologue and map genomic rearrangements with higher resolution and throughput than conventional approaches. In so doing, we circumvent the need for complete and deep genomic coverage [13, 36-38, 51], we are not limited by fragment length or repetitive regions in the genome [46, 49, 50, 151], and we do not require complicated predictive algorithms to impute variants [53, 55, 124, 126]. Therefore, in the realm of genomic rearrangements, I believe Strand-seq surpasses all other currently available methods and offers an invaluable tool to genomic studies. Although Strand-seq may never achieve the high base pair resolution of deep sequencing technologies, it is tailored to visualize higher-order genomic rearrangements in a rapid and non-targeted fashion. And while emergent long read sequencing technologies have and will continue to advance genomic investigations [57, 152], they cannot currently match the comprehensive size spectrum, reliability of directly visualizing rearrangements and throughput that Strand-seq offers. For these   138 reasons, our approach fills an important niche other techniques cannot, and I am confident it will be readily adapted by many fields.   This is the first time inversions have been mapped and genotyped genome-wide using strand directionality to visualize changes in homologue orientation. My non-targeted approach is based on observable characteristics of template strands. This offers a substantial advantage as it eliminates the need to impute structural rearrangements by searching for split reads or mates in complex and repetitive genomic regions to predict variants with low confidence [1, 14, 32, 73]. Indeed, without previous bioinformatic training, I developed a simple pipeline that predicts rearrangements by merely tracking the number of forward and reverse-aligned reads in a sliding window. In this way, Invert.R predicts changes in template strand orientation without a priori knowledge of the structural composition of the sample, and facilitates rapid and unbiased rearrangement discovery in any Strand-seq library. The power and confidence associated with any given variant call is directly dependent on i) the number of reads supporting the change in strand orientation in the library, ii) the magnitude of change in template strand state (e.g. homozygous inversions are more readily apparent than heterozygous inversions), and iii) the number of independent libraries corroborating the event. In the future, these metrics can be incorporated to assign a confidence score to each call made by Invert.R. Additionally, more integrated approaches can advance our analysis and increase the amount of information we obtain from a library. For example, incorporating split read signatures into our bioinformatic software can more precisely map breakpoint locations in each cell. As more researchers embrace Strand-seq as a powerful tool for mapping structural rearrangements, a more complete catalogue of inversions and other genomic rearrangements will be generated for the human genome, which can be used to cross-reference and support single cell variant calls. Ultimately, this can rapidly advance current efforts to annotate and describe the wealth of genomic variation present in the human genome.    139  Figure 6-1 | Rapid high-throughput population studies of structural variants With a new high-resolution and high-throughput approach to map structural rearrangements in single cells, non-targeted and large-scale studies can rapidly be performed. For this, a single cell can be collected from different individual of a defined population of interest, and these cells pooled together in a single Strand-seq experiment. The genomic distribution and allelic frequency of structural variants, such as inversions, can be immediately mapped and quantified in each person simultaneously, in order to compare the rearrangement profiles between individuals and populations.    Previously, the absence of a non-targeted single cell and high-throughput method to genotype inversions at high-resolution has precluded in-depth investigations into the frequencies, characteristics and phenotypic consequences of inversion polymorphisms in different human populations and cell types. I present a new framework for mapping inversions in heterogenic populations, be it individuals or cells. This provides a new opportunity to study inversions on a population-wide scale and investigate the role inversions play in disease susceptibility and genome evolution. For instance, Strand-seq analyses can be performed on different stratified populations to rapidly compare inversion profiles between people from different races, geographic regions or with complex pathologies. By generating libraries from a mixed cell pool comprised of donors from a defined demographic of interest, a comprehensive study of the distribution and 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y**g e n o m i c  d i s t r i b u t i o nA B C D E00.51.0a l l e l i c  f r e q u e n c y{  140 frequency of variants can be rapidly obtained for a population (similar to the analysis described in Chapter 4) (Fig. 6-1). Having 576 indexing barcodes available means we can pool hundreds of cells on a single lane of sequencing and rapidly establish inversion frequencies for multiple populations simultaneously. Then, by comparing the frequency and distribution of inversions between these stratified populations, the sets of inversions enriched in each can be revealed. This may elucidate evolutionary histories or relationships between these populations and help uncover trait-associated variants. For instance, in the context of disease, we could simultaneously analyze cell populations derived from different patients with a specific cancer subtype to test for global differences in inversion profiles that may be implicated in unique disease presentations, drug responses, and reveal new biomarkers of tumorogenicity. While the current sequence depth of Strand-seq data hinders precise rearrangement mapping (allowing breakpoint ranges to be mapped instead), we would be able to rapidly screen a population, and once a call set was generated, integrate alternative technologies (such as targeted arrays or deep and long read sequencing) to more precisely locate breakpoints, assess gene expression changes and directly study the biological outcomes of these variants in specific populations and patients. Finally, a major limitation in these studies will be the requirement for mitotic cells that will divide in the presence of BrdU. Nevertheless, I predict the work presented here will greatly facilitate future studies of inversions and other genomic rearrangements to help finally resolve how these structural features are implicated in phenotypes and disease states.   6.3 Proposed future directions 6.3.1 Exploring mechanistic details of genomic rearrangements  Now that we can see inversions, we can study them. Being able to map inversions with this kind of resolution (both genomic and cellular) offers an exciting opportunity to explore many biological questions. Having a fully-automated, single cell approach makes it possible to perform screens to explore how inversions are formed in different systems. For instance, although it is currently theorized that NAHR generates inversions via an   141 intermediate looping structure (as described in the Introduction Chapter 1, Section 1.3.1) [17, 149], the actual details and molecular players of this recombination event are unclear. Using our methods we can directly test the mechanistic steps of how inversions arise in our genomes. For instance, by depleting different candidate molecules, such as those involved in DNA replication, DNA repair or homologous recombination, we can screen for differences in inversion numbers arising in recombination models to identify the molecules that promote and suppress inversions. Additionally, we can more closely interrogate the genomic architecture surrounding inversions to reveal the relationship between repetitive elements, chromatin states and genomic rearrangements. For instance, it is currently held that directly oriented segmental duplications (i.e. non-palindromic) result in deletions, whereas inverted (i.e. palindromic) segmental duplications drive inversions [17, 149]. To directly test the mutual exclusivity of these events we can screen recombinant lines to empirically determine the frequency that inversions and deletions arise between different types of segmental duplications. This can also test whether length and degree of sequence similarity of the repetitive elements are important factors in inversions. By better understanding how inversions are formed we can consider their evolutionary and biological importance.   Additionally, an often-neglected feature in the literature is how inversions impact the spatial organization of our genomes. For instance, seminal studies in the fly illustrated how looped structures form between synapsed chromosomes that contain inversions [61], however this is rarely discussed in the context of humans. I believe this represents a currently unexplored area of research with exciting implications for studying the potential biological consequences of inversions. For instance, Hi-C experiments have elegantly shown our genomes are composed of megabase-sized self-interacting ‘topological domains’ that represent the higher order chromatin structure of the nucleus, and which are transcriptionally active [71]. Since inversions can cause looping structures in chromosomes, they may alter the tertiary organization of genomes, to impact the functionality of topological domains. As this should be made visible by Hi-C studies, it would be interesting to visualize any disrupted and altered interactions at inverted loci and explore this as a possible mechanism of how inversions impact our biology. By first   142 using Strand-seq to identify inversions in the sample, and then performing a focused analysis of Hi-C data to visualize the organization of candidate domains, we can map how inversions alter the topological organization of nuclei. This is just one example of how we can integrate multiple complementary tools to explore more complex biological phenomena associated with the structure of our genomes, as I suspect it is this type of integrative approach that will really drive this field forward in the coming years.   Finally, this system provides the tools to study de novo and somatic genomic rearrangements in order to better understand how inversions arise in our genomes. For instance, Strand-seq libraries can be generated from generations in a family pedigree to identify inversions present in children that arise anew and are not detected in parental genomes. These represent inversion that are de novo and generated in germline cells or very early in development [23, 52]. To explore inversions that arise later in development, monozygotic twin studies can be performed to identify rearrangements that are present in one twin and not the other, which arose post-fertilization. And to explore somatic mutations, inversion profiles can be generated from different tissues of an individual, or at different times in their lifespan, to locate new genomic rearrangements and test the extent of cellular mosaicism in the individual [29, 153]. These studies will allow us to uncover how new genomic rearrangements arise in our genome and are propagated in our tissues, in order to explore the functional consequences of both normal and pathogenic variants and better resolve how genome structure informs biology and health.   6.3.2 Potential applications for personalized medicine  The ability to finally characterize an invertome offers a new and exciting opportunity to test whether and how various inversions operate simultaneously or cooperatively. To date, studies have focused on single inversions in isolation, and have neglected to consider (due to technical constraints) whether groups or pairs of inversions act together and synergistically to impact individuals and populations. Now, more comprehensive analyses of multiple genomes can be performed simultaneously to investigate the phenotypic consequences of sets of inversions. For instance, I have shown   143 evidence of how Strand-seq can be used to rapidly screen genomes for multiple inversions associated with developmental delay. As we learn more about the phenotypic implications of different inversions in our genomes the utility of Strand-seq as an early diagnostic test will be revealed where we could: i) characterize inversion profiles to inform genetic counseling, and ii) perform early postnatal screens to establish anticipatory care. However, to achieve this, generating and characterizing individual invertomes will have to become standardized for clinical settings.   Figure 6-2 | Estimated cost for constructing and sequencing Strand-seq libraries a) Minimum number of cells required for constructing an invertome based on the expected (50%) and observed (42%) frequency for a chromosome being informative (i.e. inherited as either WW or CC) b) probability table of (a) that any given chromosome will be represented at least once as WW or CC based on total cells analyzed. c) cost breakdown for library construction (including enzymes, reagents and plastics) and sequencing on a HiSeq platform using a 100PE (100 base pair paired-end) flow cell. The invertome cost is based on generating 8 invertomes per plate. All values are shown in US dollars.   In this study, we compiled data from over 100 libraries for each person to create a high-resolution map of their unique invertome. In theory every diploid chromosome exhibits a 1:2:1 segregation pattern for WW: WC: CC, and thus ½ of the chromosomes of any given cell will be informative for an inversion analysis. Based on this, we expect to Cost per:  Plate  Invertome Cell libraries 96 12 1 construction 1453 181.63 15.14 sequencing (100PE) 3512 439.00 36.58 TOTAL (USD)  5061 632.63 52.72 5 10 15 200.40.60.81.0Number of cellsprobabilityp=0.5p=0.58(expected)(observed)cells 50% 42%1 0.5 0.422 0.75 0.6643 0.875 0.8054 0.938 0.8875 0.969 0.9346 0.984 0.9627 0.992 0.9788 0.996 0.9879 0.998 0.99310 0.999 0.99611 1 0.99812 1 0.99913 1 0.999a bc  144 have every chromosome represented as either WW or CC within 7 cells (p = 0.992), assuming random and independent segregation (Fig. 6-2a). However, in reality, we found a 0.42 probability that a chromosome would be informative (due to the presence of SCEs, discussed in Chapter 5, Section 5.2.1), and thus to account for this dropout I would suggest increasing this number to 12 cells (p = 0.999) (Fig. 6-2b). Based on this, a single invertome can be generated in a dozen cells and 8 distinct (albeit, relatively low-resolution) invertomes could be simultaneously generated on a single 96 well plate. If all were pooled for paired-end (100 bp) sequencing on a HiSeq platform and on a single flow cell, a complete invertome can be produced for $636 USD18 (presently equivalent to ~ $854 CAD and €565 EUR) (Fig. 6-2c). I expect this cost will come down in the next few years, as sequencing represents 70% of the price tag, and technologies are rapidly becoming less cost prohibitive. Even now, this is well below a $1,000 USD genome and provides crucial structural information otherwise lost in conventional sequencing approaches. I believe this makes our strategy a realistic tool for use in clinical settings.  In the future, I envision this technology can be adapted by clinical labs worldwide. While beyond the scope of this dissertation, other projects in our laboratory are underway that illustrate how Strand-seq is a reliable tool for building complete haplotype scaffolds (spearheaded by David Porubsky) and locating copy number variants and aneuploidy in single cell genomes (by Aaron Taudt). With appropriate validation and standardization, we anticipate it will one day be working protocol to Stand-seq a patient sample in order to rapidly haplotype their genome, quantify levels of genome instability, identify any translocations and copy number variants, and fully characterize their invertomes in order to inform diagnostic and prognostic predictions. However, to realize this vision it is critical we develop appropriate bioinformatic tools that can disseminate this information to stakeholders in meaningful ways, and straightforward laboratory protocols that can be universally adopted. Current efforts are underway to generate a comprehensive analysis suite that integrates all types of data from a Strand-seq sample. Ultimately, I can see this being offered as a service, where medical practitioners input                                                 18 Represents technical cost only, and does not include culture, sorting, man hours, analysis hours or data curation and interpretation.   145 patient samples, and receive a list of phased SVs and SNPs that can be considered for clinical decision-making, truly making medicine more personal.   Figure 6-3 | Summary of potential applications of Strand-seq for structural variant detection Using the experimental framework and tools described in this dissertation, Strand-seq can be applied to study: a) human populations b) cellular heterogeneity, or c) individual invertomes. a) Population-based studies may involve mapping inversion profiles from pooled donor samples derived from geographically distinct regions. This can be used to build halpogroup maps of the sets of structural rearrangements enriched in distinct populations, and reveal candidate alleles that may be implicated in specific biologies. This can readily be translated into clinical studies, where the populations are patient cohorts, in order to identify disease-associated variants. b) To study cellular heterogeneity, cell samples can be taken to map the structural profile of different tissues (for example, the stomach, kidney, blood, and/or brain) in order to assess somatic mosaicism within an individual. This can also be applied to tumor samples where the clonal evolution of cells within a tumor can be assessed by looking at the set and frequency of variants within the cancer tissue, as compared to the non-cancerous tissue. Evaluating changes in variant frequencies from samples taken before and after treatment can identify subclones that respond to the treatment and better assess disease progression. c) By building comprehensive invertomes, complete sets of inversions can be considered for individuals, which can be directly applied to personalized medicine. Invertomes can help identify any structural abnormalities that are present in the genomes of patients with rare and complex disease that are difficult to diagnose and stratify. By having more comprehensive views of all the structural polymorphisms present in these patients we can better understand what genomic features may be driving their phenotype.  Population StudiesPersonalized MedicineCellular Heterogeneity66.29 1.71chr1SCE x20.9 63.75chr2CC35.14 31.41chr3WC0.92 59.43chr4CC63.32 4.86chr5SCE x229.26 33.31chr6WC5.98 53.65chr7SCE x13.19 70.57chr8SCE x156.3 1.43chr9WW3.18 72.46chr10CC1 72.23chr11CC52.65 14.54chr12SCE x153.76 0.22chr13WW56.1 0.54chr14WW27.99 31.32chr15WC35.47 32.32chr16WC34.66 31.34chr17WC0.79 63.32chr18CC70.22 0.9chr19WW85.27 0.89chr20WW19.84 27.97chr21SCE x11.52 62.31chr22CC1.27 33.91MonosomychrX 53.4 %8.76 5.14MonosomychrY 21.1 %1 2 3 4 5 6 7 8 9 10 11 1213 14 15 16 17 18 19 20 21 22 X YStrand-seqacb  146 6.4 Concluding remarks  The work presented in this dissertation describes a powerful new framework to study structural variation and genomic heterogeneity from single cell samples, whether from individuals for population studies, or tissues for biomarker discovery. Although past studies have illustrated that the structure of our genome plays important roles in our biology, they have been biased to assessing copy number variants due to technical limitations. The absence of appropriate tools precluded in-depth investigations into the frequencies, characteristics and phenotypic consequences of inversion polymorphisms in different human populations and cell types. Now, we have a novel sequencing and analysis strategy that preserves the structure of individual homologues in order to study genome organization in single cells and map copy-neutral events. We also have an established framework for characterizing rearrangement profiles in hundreds of single cells simultaneously, and defining individual invertomes in a cost-effective and timely manner. These strategies greatly advance the resolution and scope of current genomic studies, and have far-reaching implications for population studies, exploring cellular heterogeneity and potential for integrated personalized medicine (summarized in Fig. 6-3). I predict this work will have immediate impact and broad appeal to many scientific communities, including those interested in medical genetics, population genetics, genomic instability, somatic mosaicism, and cancer evolution. With this advance, we can now explore inversions in different populations to better understand how they too impact our biology, and I have no doubt that the most interesting applications of this technology are to emerge in the near future.    147 Chapter 7 | Materials and methods  “I was taught that the way of progress is neither swift nor easy” – Curie  7.1 Primary human cell sources   All experiments were performed with fresh or previously frozen primary human hematopoietic cells, derived from the following sources:  i) A single 27-year-old male cadaveric bone marrow sample from the Lansdorp Laboratory biological stocks, cryopreserved and acquired from the Northwest Tissue Centre (Seattle, U.S.A.) sample identification NTR00165 (in-house: Cad11), received on May 30, 1994. Finished Homo sapiens Strand-seq (HsSs) libraries correspond to HsSs_0001-HsSs_0140. I personally performed all steps in the preparation of these cells.  ii) A single female cord blood sample of a newborn child born on November 29, 2012 and acquired fresh (i.e. never frozen) from the Stem Cell Assay Laboratory, Vancouver, Canada, sample identification C9053. Finished libraries correspond to HsSs_0141-HsSs_0246. David Knapp, a graduate student in the laboratory of Dr. Connie Eaves, received and prepared this sample for FACS (fluorescent activated cell sorting), and I personally performed all steps thereafter. iii) A pooled cord blood sample of 353 donors, cryopreserved and acquired from Stem Cell Assay Laboratory on July 12, 2012; sample identification CB7. Finished libraries correspond to HsSs_0247-HsSs_0302. This sample was prepared for FACS by David Knapp from the laboratory of Dr. Connie Eaves, and I personally performed all the steps required for the library construction and analysis of single sorted cells thereafter.    148 7.2 Building Strand-seq libraries 7.2.1 Cell selection and culture  To generate Strand-seq libraries for this study, primary human cells were cultured for one cell division in the presence of 5-bromo-2'-deoxyuridine (BrdU). For this, mitotic hematopoietic cell populations were selected and grown in conditions that support their proliferation, as described below. Note that some discrepancies in tissue culture protocols arose because the libraries were originally prepared for alternative experiments. However, I do not expect culture conditions to alter template strand states or the genomic rearrangement profiles described in this study.   To prepare single cells for selection and library construction, cryopreserved samples maintained at -135°C were thawed at 37°C and transferred into Iscove's Modified Dulbecco's Medium (IMDM) with 25 mM Hepes (Stem Cell Technologies, Vancouver, Canada; # 36150) enriched with 2% HyClone Cosmic Calf Serum (FCS; Thermo Fisher Scientific, Utah, U.S.A.; # SH30087) and 100 µg/mL DNase (Sigma-Aldrich, Missouri, U.S.A.; #D4513). The fresh cord blood sample was immediately processed upon receipt to isolate the mononuclear cell (MNC) fraction using Ficoll-Paque PLUS (Stem Cell Technologies #07957) density gradient centrifugation (2.3K rpm for 20 min), which was kept at 4 °C in 20% human serum (IMDM) overnight (since the child was born late in the day, and the FACS facilities was already closed). All samples were lineage-depleted using the EasySep Human Progenitor Cell Enrichment kit (Stem Cell Technologies #09650), which enriches for undifferentiated cells by removing cells expressing cell surface antigens of differentiated human blood cells (i.e. CD2, CD3, CD11b, CD11c, CD14, CD16, CD19, CD24, CD56, CD66b, and/or glycophorin A) . Single cells were then FACS-sorted into single wells of a 96-well tissue culture plate, in order to monitor cell divisions and capture paired sister cells. For sorting, cells were resuspended in staining buffer (Hank’s Balanced Salt Solution (HBSS) with 10mM Hepes (Stem Cell Technologies #37150), 2% FCS, and 0.005% Sodium Azide (Sigma-Aldrch #S2002)) containing a cocktail of the following fluorescent-conjugated, anti-   149  Table 7-1 | Antibodies used to label mitotic human hematopoietic cells. All antibodies were sourced from affymetrix eBioscience (California, U.S.A) and have the catalog number listed, except CD34 and CD71 (*Lansdorp stock), which were produced and conjugated in-house by PM Lansdorp (January, 1996, and May 1993, respectively). Concentration (conc); dilution (dil); phycoerythrin (PE); cyanine (Cy); microgram (µg); millilitre (mL); milligram (mg).   human, monoclonal antibodies: CD34-Cy5, CD45RA-PE, CD38-PE, CD90-PerCPCy5, and CD49f-FITC (described in Table 7-1). Cells were stained in the dark for one hour on ice, washed twice in HBSS and resuspended in 0.5 mL FACS buffer (HBSS, 2% FCS, and 2 µg/mL Propidium Iodide (PI; Sigma-Aldrich #P4170)).   Viable (PI-) cells were FACS-sorted to enrich for primitive and mitotic hematopoietic stem and progenitor cells using the CD34+ CD38- CD45RA- CD90+ CD49f+ immunophenotype [154] (Fig. 7-1). Sorting was performed on a BD Influx II cell sorter (BD Biosciences, Seattle, USA) equipped with 405nm, 488nm and 640nm lasers. Each cell was deposited into 50 µl of complete tissue culture medium, which supports the proliferation of this primitive cell population [154], and consisted of serum-free medium (Stemspan, Stem Cell Technologies #09650) supplemented with human-recombinant (rh) growth factors: rh-SCF (100 ng/mL; Stem Cell Technology #02630), rh-Flt-3L (100 ng/mL) and rh-TPO (50 ng/mL). All growth factors were sourced from Stem Cell Technologies. The cell culture medium was also spiked with 5 µM of BrdU. This dose was experimentally tested and found to minimize growth delays of the cell population  All antibodies were sourced from affymetrix eBioscience (California, U.S.A) and have the catalog number listed, except CD34 and CD71 (*Landorp stock), which were produced and conjugated in-house by PM Lansdorp (January, 1996, and May 1993, respectively). Abbreviations: concentration (conc); dilution (dil); phycoerythrin (PE); cyanine (Cy); microgram (µg); millilitre (mL); milligram (mg).Antibody Fluorophore Clone Stock conc Working dil Source CD34 Cy5 8G12 250 g/mL 1:100 *Lansdorp stock CD45RA PE-Cy7 HI100 0.5 g/mL 1:100 25-0458-41 CD38 PE HB7 0.1 g/mL 1:100 12-0388-41 CD90 PerCP-Cy5.5 eBio5E10 0.5 g/mL 1:50 45-0909-41 CD49f FITC eBioGoH3 0.5 mg/mL 1:75 11-0495-80 CD71 FITC OKT9  500 g/mL 1:75  *Lansdorp stock  150  Figure 7-1 | Gating strategy used to enrich mitotic human hematopoietic cells Representative FACS plots illustrating the selection gates (pink boxes) used to enrich for (Lineage- CD34+ CD38- CD45RA- CD90+ CD49f+) human stem and progenitor cells. All gates were setup using unstained and single-stained controls. Fluorescent-minus one (FMO) controls were used for the CD90 and CD49f gates. Each plot (moving from left-right, top-bottom) was gated on the preceding, with the percent of cells visible in each plot and falling within the gate listed.   (Fig. 7-2) while being sufficient for Strand-seq library preparation, which requires a minimum of at least one BrdU molecule taken up every 146 base pairs (bp) of DNA (the approximate length of DNA wrapped around a mononucleosome). These plates were  monitored by microscopic inspection, and cells that divided once were identified to manually capture the cell pair, as described below (Section 7.2.2).  A subset of the male bone marrow cells (corresponding to libraries HsSs_0001-HsSs_0045) was cultured in conditions that supported the growth of erythrocytes [154, 155]. Upon thawing, the MNC fraction was isolated using Ficoll-Paque PLUS (2.3K rpm for 25 min) and lineage-depleted using the Progenitor Cell Enrichment kit. Enriched cells were then bulk cultured at a plating density of 1x106 cells/mL in serum-free medium supplemented with rh-SCF (100 ng/mL), rh-Flt-3L (100 ng/mL), rh-TPO, (50 ng/mL), Erythropoietin (rh-EPO, 3 U/mL) and Granulocyte-macrophage colony-stimulating factor FSCSSCFSCDAPIFSCTrigger Pulse WidthCD34CD38CD90CD45RACD90CD49f  151 (rh-GM-CSF, 20 ng/mL) [155]. After four days in culture, erythrocyte precursors were sorted by gating on CD34+ CD45RA- and CD71- cells [155]. The same FACS staining protocol described above was used, with the following antibody cocktail: CD34-Cy5, CD45RA-PE-Cy7, and CD71-FITC (Table 7-1). After sorting the erythrocyte precursors, cells were lysed and the single nuclei that went through a single cell division were sorted based on Hoechst-quenching, as described in Section 7.2.2.   Figure 7-2 | Cell cycle kinetics of hematopoietic cells in increasing concentrations of BrdU  Histograms of human hematopoietic cells cultured in the presence or absence of BrdU, and stained with a CFSE analog (violet cell tracker; fluorescence reduces by half with each cell division) to track the number of cell divisions in culture over time (numbered arrows above plots). Colcemid (arrests mitotic cells; light grey) was used as to mark undivided population. a) at 3 days in vitro (DIV), most cells have undergone a single division in the negative control (0uM BrdU) and at low (5uM) BrdU dose, whereas cells in the high (40uM) BrdU concentration have not divided. b) at 5 DIV most cells have undergone two divisions in the 0 BrdU control, where as the cell cycle is delayed in the BrdU-treated cells. c) the delay is exacerbated over time, and at 7 DIV the 0 BrdU control has undergone 3 rounds of division, the 5uM BrdU group has undergone 2 rounds, and the 40uM BrdU has undergone a single cell division.   7.2.2 Capturing cells after one cell division  Daughter cells arising from a single cell division in the presence of BrdU contain chromosomes composed of a BrdU- template strand and BrdU+ nascent strand (i.e. hemi-substituted DNA), which is a requirement for successful Strand-seq libraries. The cells arising after a single cell division were isolated from culture either by manual micromanipulation (to capture pairs of sister cells) or by sorting nuclei based on Hoechst-quenching from bulk cultures. Each approach is described in detail below.  Time  0uM BrdU5uM BrdU40uM BrdU1x 1x2x 1x2x3xcolcemida b c  152 From the FACS-sorted single cell cultures, sister cell pairs arising from the same mother cell that underwent a single cell division were captured by micromanipulation. To accomplish this, plates were microscopically monitored every day over a period of seven days to locate and quantify the number of cells in each well. On the first day I confirmed that only a single cell was sorted into the well. Subsequent daily monitoring allowed me to track when that cell completed a single mitosis and capture the newborn daughter cells before they underwent a second division. Once two cells were found in the well, the pair of sister cells was manually harvested. This required gently transferring the contents of the well into four new wells of a 96-well plate (Microbatch Protein Crystallization plate, VWR International, Pennsylvania, U.S.A. #82050-972) and then searching these wells to localize the two daughter cells [156]. The small surface area of the wells in this 96-well plate used facilitated the search for each cell. The success rate was improved to 80% by: i) ‘pre-wetting’ the pipette tip used to transfer well contents (by first triturating HBSS containing 10% EDTA), ii) very gently mixing well contents by pipetting five times before aspirating the entire volume, iii) during the transfer, allowing a drop to form on the pipette tip and gently placing this into new the well (thus avoiding any bubbles), and iv) letting the transferred cells settle to the bottom of the well over approximately 20 min, before searching the new plate. Once the two sister cells were located in two separate wells, the contents of these wells were transferred to nuclei isolation buffer (NucleiEZ lysis buffer, Sigma-Aldrich #N3408) for Strand-seq library preparation.   From the bulk-cultured cells, daughter cells generated after one cell division were captured by sorting single nuclei based on BrdU-quenched Hoechst fluorescence [157, 158]. BrdU is a thymidine analogue that is taken up into newly synthesized DNA at sites of T, and Hoechst 33258 is an intercalating fluorescent dye that preferentially binds to double-stranded DNA at sites of A-T. When incorporated into DNA, the bromine group of BrdU is thought to deform the minor groove, thereby interfering with Hoechst binding and fluorescence [159]. Consequently, a cell that has hemi-substituted DNA (i.e. one strand of the DNA molecule contains BrdU) exhibits exactly half the fluorescence of a cell that did not incorporate BrdU into its DNA [157, 158] (Fig. 7-3). We took advantage of this biochemistry to capture cells after one cell division. For this purpose, bulk-  153 cultured cells were harvested and lysed to isolate nuclei for staining by resuspending the cell pellet (5x105 cells/mL) in a staining buffer (100 mM Tris-HCl (pH 7.4), 154 mM NaCl, 1 mM CaCl2, 0. 5 mM MgCl2, 0.2% BSA and 10 µg/mL Hoechst 33258) containing the detergent Nonidet-P40 (0.6% v/v; US Biological). Cells were lysed on ice for 15 min in the dark, and nuclei were sorted using a BD Influx cell sorter (BD/Cytopeia) equipped with two tunable Coherent I305C argon lasers and a Cobolt Jive 50 561-nm diode laser. A BrdU- control was included to identify the undivided population and setup the system. The cell population that underwent one cell division was selected as the peak showing ½ Hoechst-fluorescence compared to the undivided control (Fig. 7-3, blue arrowhead), and nuclei were deposited directly into NucleiEZ lysis buffer for Strand-seq library preparation.   Figure 7-3 | BrdU quenches Hoechst fluorescence in hemi-substituted human hematopoietic cells FACS analysis of fluorescence levels of human hematopoietic cells cultured in the presence or absence of BrdU for 3 days, and stained with Hoechst 33258 (as described in Section 2.2.1). The 0 BrdU control marks the level of fluorescence expected from an undivided population (red arrowhead), and cells that underwent a single division are expected to show 1/2 this level of fluorescence (blue arrowhead). a) Plots illustrates cells cultured in 5 µM BrdU display a shift in Hoechst (recall, Fig. 2-2 showed these cells divided at this time point), whereas cells cultured in higher doses do not show a divided population, based on Hoechst fluorescence b) overlaid histogram of density plots shows that Hoechst fluorescence is cut in exactly half in the divided cell fraction (red), as compared to the undivided population (grey).  abFSCHoechst40uM BrdU5uM BrdU 100uM BrdU0 BrdU40uM BrdU5uM BrdU100uM BrdU0 BrdU22K11KHoechst  154 7.2.3 Strand-seq library construction  Isolating daughter cells after one cell division that contain hemi-substituted genomic DNA is the principle requirement for successful Strand-seq, a method developed in-house by Dr. Ester Falconer and described in [109]. Briefly, this modified paired-end (PE) protocol uses a photolytic cleavage event to nick single stranded DNA at sites of BrdU incorporation. The nicking interferes with PCR amplification of BrdU+ (nascent) strands, allowing us to build directional libraries consisting only of the BrdU- (template) strands. Since our Nature Methods (2012) publication [109], the protocol was scaled and validated for construction on a robotic liquid handler by Dr. Ester Falconer, allowing us to process 96 samples simultaneously. The reduced reaction volumes and modifications introduced into the protocol will be discussed here.   The reaction enzymes used in this protocol were acquired from New England Biolabs (NEB). All steps were performed on the Agilent Bravo Automated Liquid Handling Platform (Agilent, California, U.S.A.; #16050-102) equipped with a 96-ST pipette head. Each enzymatic step was purified using solid-phase reversible immobilization paramagnetic beads (Agencourt AMPure XP, Beckman-Coulter, California, U.S.A.; #A63880) at 1.8× vol. for pre-adaptor ligation reactions, and 0.8× vol. for post-adaptor ligation reactions. Reagent volumes are given per reaction well.  Immediately following isolation, the daughter cells or their harvested nuclei were transferred into 5 µl Nuclei EZ buffer in 96-well PCR plates, and spun in a 4 °C pre-chilled centrifuge at 500g for 5 min. This pelleted the nuclei, and the following library steps were performed: i) MNase digestion: To fragment genomic DNA, a micrococcal nuclease (MNase) reaction mix (1.5 µL 10x MNase Buffer, 0.11 µL 200mM DTT, 1.51 µL 50% PEG 6000, 0.5 kunitz units (U) MNase (NEB # M0247S), and 6.82 µL ddH2O) was added to each well and incubated at room temperature for 8 min. The final reaction vol. was 15 µL. Reactions were stopped by adding 1.66 µL EDTA (10 mM final vol.). Samples were bead purified and eluted with 10 µL EB buffer (Qiagen, Limburg,   155 Netherlands; #19086) into a 384-well hard-shell plate (Bio-rad, California, U.S.A; #HSP3801).  ii) End repair: To fill-in the sticky end cuts of MNase-digested chromatin, samples were incubated in an end repair reaction mix (15.63 µL 10x Phosphorylation Buffer, 6.25 µL 10 mM dNTP mix, 0.75 U T4 DNA polymerase (NEB #M0203L), 0.25 U Klenow (NEB #M0210L), 2.5 U T4 PINK (NEB #M0201L), and 0.20 µL ddH2O) at room temperature for 30 min. The final reaction vol. was 12.5 µL. Bead purification and elution in 8.5 µL EB buffer followed.  iii) A-tailing: To attach an adenine to the 3´ end of the blunt DNA fragment, samples were incubated in an A-tailing reaction mix (1 µL NEB 2, 0.25 µL 10 mM dATP, and 1.25 U exo- Klenow (NEB #M0212L)) for 30 min at 37 °C. The final reaction vol. was 10 µL. Temperature was maintained in a Veriti 384-well Thermal Cycler (Thermo Fisher Scientific #4388444). This was followed by bead purification and elution in 7.3 µL EB buffer. iv) Adaptor ligation: Standard Illumina forked adaptors (designed with a single 3´ thymidine overhang) were ligated onto the A-tailed fragments by incubating samples for 18 minutes at room temperature in the ligation reaction mix (10 µL 2x Quick ligase buffer, 0.67 µL PE adaptors, 0.2 µL EB, and 9 U Quick T4 ligase (NEB # M2200L)). Final reaction vol. was 20 µL. The bead-purified reaction was eluted in 9.5 µL EB buffer v) H/UV treatment: To nick the BrdU substituted DNA strands samples were treated with Hoechst 33258 (10ng/µL, final vol.) for 15 min at room temperature while protected from light. This was followed by ultraviolet radiation (UV) treatment (2.7 × 103 J/m2) in a UVC 500 crosslinker (Amersham Biosciences; GE#80-6222-31) for 15 min, while uncovered. This was followed directly by PCR of the intact DNA strand. vi) Amplification: PCR amplification of the directional libraries was performed using an indexed PCR primer mix (12.5 µL Phusion HF master mix (NEB #M0531S)) containing 1 µL PE 1.0 primer (Illumina) and 1 µL custom multiplexing PCR primer (each well received a unique hexamer barcode). The final reaction vol. was 25 µL and the samples were run for 15 cycles of PCR in the Veriti 384-well Thermal Cycler. Completed libraries were bead purified and eluted in a 10 µL EB buffer. Libraries were pooled for size-selection and sequencing.   156  7.2.4 Size-selection and sequencing  The completed Strand-seq libraries of a 96-well plate were pooled together for two rounds of size selection. This ensured a narrow size range of DNA was loaded onto the flowcell used for sequencing. To isolate the ~280 bp product, corresponding to the size of a mononucleosome (~146 bp) with two PE adaptors (each at 68 bp) attached, the samples were first run on a 2% E-Gel Ex agarose gel (Invitrogen, U.S.A.; #G4020-02) and the 220- to 320-bp DNA band was excised (Fig. 7-4a). Size-selected DNA was extracted from the agarose using a Zymoclean Gel DNA recovery kit (Zymoreseach, California, U.S.A.; # D4001) and eluted in 60 µl EB buffer. To reduce contamination from adaptor dimers (the major product seen at ~125 bp (Fig. 7-4, black arrows)), the size-selection was repeated using a 1% E-Gel EX agarose (Invitrogen #G4020-01) (Fig. 7-4b) and eluted in 13 µl of EB. The final size distribution was confirmed by running 1 µl on an Agilent High Sensitivity chip (Agilent, # 5067-4626) (Fig. 7-4c), and the final concentration was determined by running 2 µl on a Quant-iT dsDNA HS assay kit and Qubit fluorometer (Invitrogen #Q-33120). The final morality (nM) was calculated using the average size distribution of the library (found by defining a region in the Agilent result (Fig. 7-4c, lower table), the Qubit concentration results, and the average molecular weight (660 g/mol) of each dinucleotide in dsDNA, as exampled (Fig. 7-4d).   To sequence the libraries, 10nM of each were sent to the Michael Smith Genome Science Centre (Vancouver, Canada), where clusters were generated on the cBOT (Hiseq2000), and paired-end 100-nucleotide reads were generated using v4 sequencing reagents on the Hiseq2000 (SBSxx) platform, following the manufacturer’s instructions. To sequence the hexamer barcode (index), a third 7-bp read was performed using a custom sequencing primer [109]. Image analysis, base-calling and error calibration were performed using Illumina’s genome-analysis pipeline. The indexed paired-end .fastq files were aligned to the human reference assembly (GRCh37/hg19, released Feb 2009) using bwa31. Custom scripts were used to split the resulting BAM files by index and to add the chastity flag. The resulting individual Strand-seq files were then analyzed using the open    157  Figure 7-4 | Pooling and size-selection strategy of Strand-seq libraries for sequencing To perform two consecutive rounds of size selection, pooled libraries were run on a) a 2% agarose gel, and then b) a 1% agarose, and the 270 bp fragment (dotted box) was excised. This purifies the desired product from contaminating adaptor dimers (black arrow). c) Agilent profile of the product illustrates the final size range of the pooled library. By defining a region (blue arrows) the average size range (asterisks) can be used to accurately calculate the molarity. d) example calculation of final molarity. The Qbit concentration (1.75 ng/µL) and Agilent peak size (282 bp) used in this calculation correspond to the overall average found for all libraries sequenced in this study. The molecular weight (660g/mol) used is an estimate of the average weight expected for each dinucleotide in a double-stranded DNA molecule.   source software ‘Bioinformatic Analysis of Inherited Templates’ (BAIT), developed in-house by Dr. Mark Hills and described in [110]. Briefly, BAIT parsed the data to remove duplicates, threshold for quality and discern read directionality in order to classify each read as either Watson (W) or Crick (C). W reads are characterized by the first paired-end tag (PET) mapping to the ‘+’ strand the second PET mapping to the ‘−’ strand, whereas C reads are those that mapped to the – strand from the first PET and the + strand from the second. BAIT plots these data as histograms (consisting of 200 kb bins) on ideograms of human chromosomes, and performs a number of computational analyses to assign strand inheritance states, library read-depth (normalized counts/megabase), sister chromatid exchange events, aneuploidy, and converted the data into BED files (see Fig. 3-6 BAIT ideogram for an example). These BED files were submitted to the UCSC genome browser and custom analysis software (e.g. Invert.R, described in Chapter 3, and below) to explore template strand changes indicative of structural genomic rearrangements.  Agilent Profile50 bp ladder2% agarose gel 270 bpadaptordimers50 bp ladder1% agarose gel 270 bpadaptordimersa bc dQubit Concentration  =  1.75 ng/uLAgilent Peak Size         =  282 bpMolarity  = DNA Concentration DNA length * molecular weight = 9.40 nM1.75 ng/uL * 1,000,000282 bp * 660 g/mol= Region table for sample   : From[bp]To [bp] Corr.Area% ofTotalAverage Size[bp]Size distribution inCV [%]Conc.[pg/µl]Molarity[pmol/l]Color115 1,989 517.8 94 400 64.7 3,406.77 17,711.9Region*  158 7.3 Analyzing Strand-seq libraries  With aligned Strand-seq libraries we can explore the structure of individual homologues. First, each single library was analyzed using Invert.R to identify all putative inversions present in that cell. This reflects the structural organization of a single genome. Once the variation in each library was characterized, the population was analyzed to identify patterns that recurred across multiple cells. This reflects the structural variation across genomes. These analyses will be described in detail here.  7.3.1 Localizing inversions in single cells  To manually visualize strand inheritance states of chromosomes, Strand-seq libraries were converted into BED files (using a modification of BEDtools [131] bamToBed, implemented through BAIT [110]) and uploaded onto the UCSC Genome Browser (http://genome.ucsc.edu/) as custom annotation tracks on the GRCh37/hg19 assembly (Feb. 2009) [119]. Library reads had duplicates removed and were filtered with a minimum mapping quality score of q > 10. The genome browser navigation feature facilitates close examination of the strand state at putative inversion sites and across multiple libraries simultaneously. Putative inversions were manually located and characterized using the Genome Browser’s navigation tool. The breakpoints of genomic rearrangements were manually located to the first base pair position of the first read present in a template strand state change. Rearrangements that involved a single homologue switching completely to the alternative template were flagged as somatic sister chromatid exchange events. Rearrangements that resulted in a localized change in template strands and returned to the original state were flagged as a putative inversion. To distinguish between these two types of rearrangements and identify inherited variants, the event was localized in more than two cells of the sample.   To bioinformatically assess inversions in Strand-seq libraries, the custom R-based bioinformatic software Invert.R was developed (as described in detail in Chapter 3). Briefly, putative inversions were first identified in single cell libraries. Invert.R was given aligned (GRCh37/hg19) sequence BAM files. Invert.R parsed through each   159 chromosome and filtered them for depth, where a minimum read depth of 20 reads/Mb (minDepth = 20) was required for inclusion, and chromosome state. For this, I modified Invert.R to tabulate the total number of C and W reads and assess whether the chromosome was predominately WW, WC or CC by borrowing BAIT’s [110] strategy of the ‘WCcall’. WCcall was calculated as the total W reads, less the total C reads, divided by all the reads in the chromosome. A WCcall of 1.0 corresponded to a WW chromosome, a WCcall of -1.0 corresponded to a CC chromosome and a WCcall of 0 corresponded to a WC chromosome. To select only high quality libraries that were at least 85% WW or CC, the chromosome needed a WCcall above 0.75 or below -0.75 (WCcall = 0.75). Invert.R also removed duplicate reads (dups = T) and those with a mapping quality below 10 (q = 10). The baseline threshold was set to 0.8 (baseline = 0.8), a bin size of 25 reads (checkNum = 25) was used, and a minimum of two libraries had to share an ROI for inclusion into the dataset (minLibs = 2). To analyze the directional composite files generated from multiple single cells of the same donor (described in Chapter 5, Section 5.2.1), a more stringent mapping quality (q = 20), and a larger bin (checkNum = 250) were applied to account for the higher read depth of each file. The composite was a consensus of all the variants in the merged cells, and therefore minLibs was set to 1. The putative inversions listed in the first-pass ROI files output by Invert.R were validated and high-quality calls were selected for further analysis, as described below.   7.3.2 Analyzing inversions across multiple cells  From the pooled donor population dataset, Invert.R-identified ROIs were confirmed by visualizing Strand-seq libraries on the UCSC Genome Browser and if required they were refined by redefining start and stop coordinates based on read depths (e.g. chr10 and chr16) or gaps in the reference assembly (e.g. chr9). In male cells, ROIs falling within PAR1 or PAR2 were removed. The refined ROIs were precisely genotyped by adapting Invert.R to implement a statistical test of all reads in the ROI and finding the ‘best fit’ genotype. First, the number of W and C reads in the library was counted at the ROI, and if there was a user-defined minimum number of twenty reads (minReads = 20)    160  Figure 7-5 | Determining the best fit genotype by Fisher exact test and Chi square test a) The genotype of a locus is determined by the relative proportion of Crick (C, blue) and Watson (W, orange) reads in a Strand-seq library. To determine the best statistical test for genotyping, i) a matrix of C reads (from 0-1000) and W reads (from 1000-0) was generated. For every read combination in this matrix, three Fisher Exact tests and three Chi Square tests were performed for a CC, WC or WW state. ii) In both tests a 2% level of background was built into the ‘pure’ calls (see text). Both tests ask if the observed data is significantly different from the expected results, and a low p-value rejects this null hypothesis (i.e. if a CC test yields a low p-value that means the locus is significantly not CC). The maximum (max) p-value from the b) Fisher test and c) Chi square test were plotted (black line) with the best fit genotype (i.e. test generating the max p-value) indicated by the underlaid color (WW in orange, WC in grey, and CC in blue). If the p-value was above 0.5, it was considered significant (red points). If each genotype is fairly represented by the test we expect to see a 1:2:1 distribution with 25% WW, 50% WC, 25% CC. The Fisher test showed a more even distribution compared to the Chi square test, which was very strict in calling a WW or CC genotype (red text below plot). While the test returns the best-fit based on p-values, it must be recognized that a genotype near transitions (i.e. where WW switches to WC, and WC to CC) is difficult to reliably call with high-confidence.  Figure 7-5 | Determining the best fit genotype by ) The genotype of a locus is determined by the relative proportion of Crick (C, blue) and  Watson (W, orange) reads in a Strand-seq library. To determine ) a matix of C reads (from 0-1000) and W reads (from 1000-0) was generated. For every read combination in this matrix, three Fisher Exact tests and three Chi Square tests ) In both tests a 2% level of background was built into the ‘pure’ calls (see text below). Both tests ask if the observed data is signficantly different from the expected results, and a low p-value rejects this null hypthesis (i.e. if a CC test yeilds a low p-value that means the locus is significantly not CC). The maximum (max) p-value ) Chi square test were plotted (black line) with the best fit genotype (i.e. test generating the max p-value) indicated by the under-laid color (WW in orange, WC in grey, and CC in blue). If the p-value was above 0.5, it was considered signfi-cant (red points). If each genotype is farily represent-ed by the test we expect to see a 1:2:1  distribution with 25% WW, 50% WC, 25% CC. The Fisher test showed a more even distribution compared to the Chi square test which was very strict in calling a WW or CC genotype (red text below plot). While the test returns the best-fit based on p-values, it must be recognized that a genotype near transitions (i.e. where WW switches to WC, and WC to CC) is difficult BEST FIT GENOTYPE =  maximum p-value derived from all three testsCrick reads = total ‘+’ reads Watson reads = total ‘-’ readsTotal reads = Crick reads +  Watson readsOBSERVED DATACC test: Crick = Total reads*0.98; Watson = Total reads*0.02 WC test: Crick = Total reads*0.5; Watson = Total reads*0.5 WW test: Crick = Total reads*0.02), Watson = Total reads*0.98EXPECTED DATA (with 2% background)Fisher Exact Testmax p-value0 200 400 600 800 10001000 800 600 400 200 00.00.20.40.60.81.012.5% 75% 12.5%0.00.20.40.60.81.021% 58% 21%max p-valueChi Square Testi.ii.abc  161 present in the region, three Fisher’s exact tests were independently performed (one each for a wildtype, heterozygous, and homozygous state, using R’s fisher.test built-in function). Note the Fisher test was selected over the Chi square test because it is better suited for small read numbers (analogous to sample size) [10, 135], and gave a better distribution of genotypes (Fig. 7-5). For tests of wildtype and homozygous states, a level of background (bg = 0.02) was introduced when calculating the expected ratio of W and C reads for these genotypes (Fig. 7-5). For example, at an ROI of a WW chromosome with 100 reads, if bg is set to 0.02 (i.e. 2% background) the expected proportion of W and C reads are: 98 W and 2 C for a wildtype state, 50 W and 50 C for a heterozygous state, and 2 W and 98 C for a homozygous state. The Fisher’s exact test asked whether the observed ratio of W and C reads at the ROI are significantly different from these expected ratios, and therefore the highest p-value derived from each test was designated the best fit genotype (Fig. 7-5). Significance was assigned to the genotype if the best-fit p-value was above 0.05 and significantly different from the other two states.   To calculate allelic frequencies in the population of cells, the proportion of genotyped cells with a wildtype, heterozygous or homozygous state were tabulated for each ROI. At diploid alleles (i.e. those on autosomes and female chrX), frequencies were calculated as p2 + 2pq + q2 = 1 [10, 135]. Therefore the wildtype allele frequency was found as [wtFreq = 2(wildtype cells) + heterozygous cells / 2(total cells)], and the inverted allele frequency was [invFreq = 2(homozygous cells) + heterozygous cells / 2(total cells)]. At monoploid alleles (i.e. those on the sex chromosomes of males) the frequency was calculated as p + q = 1. Therefore the wildtype allele frequency was (wtFreq = wildtype cells / total cells), and the inverted allele frequency was (invFreq = homozygous cells / total cells). For ROIs present on chrX, the frequencies of the males and females were combined as p = 2/3pfemale + 1/3pmale.  Autosomal ROIs were tested for Hardy-Weinberg equilibrium (HWE) using the HWExact test (HardyWeinberg package (v1.5.4) [160]), and found to be in HWE when p > 0.05 [10, 11]. ROIs were classified by counting the frequency of cells with a heterozygous or homozygous state. If a minimum of ten cells showed a heterozygous   162 frequency ≥ 80% the region was defined as AWC, (Always WC), whereas if they had a homozygous frequency ≥ 80% it was defined as a potential misorient or minor allele. If there were fewer than ten cells at an ROI with ≥ 80% homozygous or heterozygous frequency, it was not classified. Polymorphisms were identified as ROIs where at least two cells showed different allelic states. To generate clustered heat maps of the polymorphisms (using the heatmap.2 function of gplots (v2.14.2) [161]), ROIs were subdivided based on chromosome and a distance matrix of genotyped cells was calculated by Manhattan method (dist function), and hierarchically clustered by Ward’s method (hclust function). To generate cell-by-cell heat maps of all ROIs, cells were clustered by the daisy pair wise dissimilarity method in cluster (v1.15.3) [162].   For single donor invertomes, ROIs identified by Invert.R were refined by removing regions overlapping the AWC regions and sequence gaps in the reference assembly, using BEDtools [131] genomeCoverageBed function. For the male invertome, ROIs on chrY were manually refined, and any falling within the PARs were removed. To genotype the refined ROIs in Invert.R, background (bg) was set to 0.1 for both single donors, and minReads was set to 100 for the male and 50 for the female, to account for the different read densities in the composite files (the final average reads/Mb was 3311 for the male, versus 1444 for the female). To generate Circos (v0.76) [146] plots for each chromosome, data was formatted to include the Invert.R histograms of the W/C ratios (male in blue, female in pink) and genotyped invertomes for each individual (heterozygous inversions in light green, and homozygous inversions in dark green), the classified ROIs identified in the pooled donor population (inner ring; AWCs in blue, misorients or minor alleles in red, and polymorphic inversions in orange), all inversions listed in the DGV (outer fuchsia bars), and Refseq genes listed in the UCSC Genome Browser (outer-most grey bars). Intra-chromosomal segmental duplications were added as links, subdivided as palindromic (dark purple) or non-palindromic (grey). One plot per chromosome was generated and compiled into a single .pdf (Appendix B)  To compare inversion predictions between different datasets we used a two-step approach by implementing the findOverlap tool of GenomicRanges (v2.14) [163], with   163 minimum overlap was set to 1 kb (illustrated in Fig. 7-6). To interrogate segmental duplications, the ‘Segmental Dups’ track was downloaded from the ‘Repeats’ group of the UCSC Genome Table Browser [164]. The track was filtered for intra-chromosomal entries (i.e. the duplicated region fell on the same chromosome) and then subdivided into non-palindromic (duplicated region in same orientation) and palindromic (duplicated region in inverted orientation) segmental duplications. The Table Browser was also used to extract the ‘Refseq Genes’ track from the ‘Genes and Gene Predictions’ group. Inversions reported in the DGV on the GRCh37/hg19 genome assembly were downloaded from the DGV database [3] . Inversions listed in the InvFest [73] database are only aligned to GRCh36/hg18 and therefore were lifted to GRCh37/hg19 using the USCS liftOver tool.     Figure 7-6 | Strategy for refining files using BEDtools To refine a file and remove any entries that overlap with genomic positions listed in another file BEDtools can be used in a two-step process. Step 1: First the input file, such as an ROI file generated by Invert.R, is compared against a subtraction file, which may contain reference assembly gaps or Always Watson Crick (AWC) regions. Using the genomeCoverageBed function, a new file containing every genomic range in the two files is generated, where each range is given a coverage value of either 1 (present in only one file) or 2 (present in both files). Pulling out the ranges with a value of 2 will generate an overlap file. Step 2: compares the overlap file with the original input file, again using genomeCoverageBed. By pulling out the ranges that have a value of 1 a refined file is created that redefines the genomic ranges and identifies those that are present only in the input file and not the subtraction file.   Step 1: Find overlaps- compare input file with substraction file using BEDtools genomeCoverageBed- extract rangess scored as 2 into an overlap file1 2 1 2 1 1input fileoverlap filerefined fileStep 2: Remove overlaps- compare input file with over-lap file using BEDtools genomeCoverageBed- extract ranges scored as 1 into the refined file1 1 2 1 1 2 1 1input filesubtraction fileoverlap file  164 To assess the genomic architecture of inversions at the base pair level and visualize the surrounding segmental duplications not present in the UCSC track, the entire inversion was self-aligned. For this, the inversion plus 200 kb of sequence upstream and downstream was pulled from the UCSC Genome Browser (GRCh37/hg19) by querying the ‘mapping and sequence’ group and ‘assembly’ track from the Table browser. These sequences were self-aligned in a pair wise fashion using lastz (step = 20, seed match = 12, exact = 20, identity = 90 using the gapped, no chain and no transition options). Lastz output files were then used to generate dot plots in R [165]. We plotted the ROI locations (female invertome in pink, male invertome in blue and pooled donor polymorphisms in green) along with additional tracks (reference sequence gaps in black and DGV inversions in fascia) as overlays onto these dot plots, which were compiled into a single .pdf (Appendix C).     165 References 1. Stankiewicz, P. and J.R. Lupski, Structural variation in the human genome and its role in disease. Annu Rev Med, 2010. 61: p. 437-55. 2. McClellan, J. and M.C. King, Genetic heterogeneity in human disease. Cell, 2010. 141(2): p. 210-7. 3. MacDonald, J.R., et al., The Database of Genomic Variants: a curated collection of structural variation in the human genome. Nucleic Acids Res, 2014. 42(Database issue): p. D986-92. 4. Walsh, T. and M.C. King, Ten genes for inherited breast cancer. Cancer Cell, 2007. 11(2): p. 103-5. 5. Santoro, M., et al., Different mutations of the RET gene cause different human tumoral diseases. Biochimie, 1999. 81(4): p. 397-402. 6. Ott, J., J. Wang, and S.M. Leal, Genetic linkage analysis in the age of whole-genome sequencing. Nat Rev Genet, 2015. 16(5): p. 275-84. 7. Gleave, M., Re: Cumulative association of five genetic variants with prostate cancer. Eur Urol, 2008. 54(2): p. 460-1. 8. Bush, W.S. and J.H. Moore, Chapter 11: Genome-wide association studies. PLoS Comput Biol, 2012. 8(12): p. e1002822. 9. Slatkin, M., Linkage disequilibrium--understanding the evolutionary past and mapping the medical future. Nat Rev Genet, 2008. 9(6): p. 477-85. 10. Lewis, C.M. and J. Knight, Introduction to genetic association studies. Cold Spring Harb Protoc, 2012. 2012(3): p. 297-306. 11. Relethford, J.H. and H.R. M., Population Genetics of Modern Human Evolution. Encylopedia of Life Sciences. 2001: Macmillan Publishers Ltd. Nature Publishing Group. 12. Conrad, D.F., et al., Origins and functional impact of copy number variation in the human genome. Nature, 2010. 464(7289): p. 704-12. 13. Pang, A.W., et al., Towards a comprehensive structural variation map of an individual human genome. Genome Biol, 2010. 11(5): p. R52. 14. Sindi, S.S. and B.J. Raphael, Identification of structural variation, in Genome Analysis: Current Procedures and Applications, M.S. Poptsova, Editor. 2014, Caister Academic Press: online. 15. Lupski, J.R., et al., DNA duplication associated with Charcot-Marie-Tooth disease type 1A. Cell, 1991. 66(2): p. 219-32. 16. Baker, M., Structural variation: the genome's hidden architecture. Nat Methods, 2012. 9(2): p. 133-7. 17. Lupski, J.R., Genomic disorders: structural features of the genome can lead to DNA rearrangements and human disease traits. Trends Genet, 1998. 14(10): p. 417-22. 18. Gu, W., F. Zhang, and J.R. Lupski, Mechanisms for human genomic rearrangements. Pathogenetics, 2008. 1(1): p. 4. 19. Palotie, A., E. Widen, and S. Ripatti, From genetic discovery to future personalized health research. N Biotechnol, 2013. 30(3): p. 291-5. 20. Ziegler, A., et al., Personalized medicine using DNA biomarkers: a review. Hum Genet, 2012. 131(10): p. 1627-38.   166 21. Hu, L., et al., Fluorescence in situ hybridization (FISH): an increasingly demanded tool for biomarker research and personalized medicine. Biomark Res, 2014. 2(1): p. 3. 22. Macaulay, I.C. and T. Voet, Single cell genomics: advances and future perspectives. PLoS Genet, 2014. 10(1): p. e1004126. 23. Lupski, J.R., Genomic rearrangements and sporadic disease. Nat Genet, 2007. 39(7 Suppl): p. S43-7. 24. Voet, T., et al., Single-cell paired-end genome sequencing reveals structural variation per cell cycle. Nucleic Acids Res, 2013. 41(12): p. 6119-38. 25. Aguilera, A. and B. Gomez-Gonzalez, Genome instability: a mechanistic view of its causes and consequences. Nat Rev Genet, 2008. 9(3): p. 204-17. 26. Negrini, S., V.G. Gorgoulis, and T.D. Halazonetis, Genomic instability--an evolving hallmark of cancer. Nat Rev Mol Cell Biol, 2010. 11(3): p. 220-8. 27. Caldas, C., Cancer sequencing unravels clonal evolution. Nat Biotechnol, 2012. 30(5): p. 408-10. 28. Greaves, M. and C.C. Maley, Clonal evolution in cancer. Nature, 2012. 481(7381): p. 306-13. 29. Biesecker, L.G. and N.B. Spinner, A genomic view of mosaicism and human disease. Nat Rev Genet, 2013. 14(5): p. 307-20. 30. Savelyeva, L. and L.M. Brueckner, Molecular characterization of common fragile sites as a strategy to discover cancer susceptibility genes. Cell Mol Life Sci, 2014. 71(23): p. 4561-75. 31. Liu, P., et al., Mechanisms for recurrent and complex human genomic rearrangements. Curr Opin Genet Dev, 2012. 22(3): p. 211-20. 32. Alkan, C., B.P. Coe, and E.E. Eichler, Genome structural variation discovery and genotyping. Nat Rev Genet, 2011. 12(5): p. 363-76. 33. Schatz, M.C., A.L. Delcher, and S.L. Salzberg, Assembly of large genomes using second-generation sequencing. Genome Res, 2010. 20(9): p. 1165-73. 34. Li, H. and N. Homer, A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform, 2010. 11(5): p. 473-83. 35. Quail, M.A., et al., A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics, 2012. 13: p. 341. 36. Kidd, J.M., et al., Mapping and sequencing of structural variation from eight human genomes. Nature, 2008. 453(7191): p. 56-64. 37. Korbel, J.O., et al., Paired-end mapping reveals extensive structural variation in the human genome. Science, 2007. 318(5849): p. 420-6. 38. Tuzun, E., et al., Fine-scale structural variation of the human genome. Nat Genet, 2005. 37(7): p. 727-32. 39. Feuk, L., Inversion variants in the human genome: role in disease and genome architecture. Genome Med, 2010. 2(2): p. 11. 40. Raphael, B.J., Chapter 6: Structural variation and medical genomics. PLoS Comput Biol, 2012. 8(12): p. e1002821. 41. Bentley, D.R., et al., Accurate whole human genome sequencing using reversible terminator chemistry. Nature, 2008. 456(7218): p. 53-9.   167 42. Wheeler, D.A., et al., The complete genome of an individual by massively parallel DNA sequencing. Nature, 2008. 452(7189): p. 872-6. 43. Wang, J., et al., The diploid genome sequence of an Asian individual. Nature, 2008. 456(7218): p. 60-5. 44. Xie, C. and M.T. Tammi, CNV-seq, a new method to detect copy number variation using high-throughput sequencing. BMC Bioinformatics, 2009. 10: p. 80. 45. Yoon, S., et al., Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Res, 2009. 19(9): p. 1586-92. 46. Ye, K., et al., Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics, 2009. 25(21): p. 2865-71. 47. Karakoc, E., et al., Detection of structural variants and indels within exome data. Nat Methods, 2012. 9(2): p. 176-8. 48. Emde, A.K., et al., Detecting genomic indel variants with exact breakpoints in single- and paired-end sequencing data using SplazerS. Bioinformatics, 2012. 28(5): p. 619-27. 49. Wang, J., et al., CREST maps somatic structural variation in cancer genomes with base-pair resolution. Nat Methods, 2011. 8(8): p. 652-4. 50. Rausch, T., et al., DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics, 2012. 28(18): p. i333-i339. 51. Mills, R.E., et al., Mapping copy number variation by population-scale genome sequencing. Nature, 2011. 470(7332): p. 59-65. 52. Kloosterman, W.P., et al., Characteristics of de novo structural changes in the human genome. Genome Res, 2015. 25(6): p. 792-801. 53. Medvedev, P., et al., Detecting copy number variation with mated short reads. Genome Res, 2010. 20(11): p. 1613-22. 54. Kehr, B., P. Melsted, and B.V. Halldorsson, PopIns: population-scale detection of novel sequence insertions. Bioinformatics, 2015. 55. Ritz, A., et al., Characterization of structural variants with single molecule and hybrid sequencing approaches. Bioinformatics, 2014. 30(24): p. 3458-66. 56. Antonacci, F., et al., Characterization of six human disease-associated inversion polymorphisms. Hum Mol Genet, 2009. 18(14): p. 2555-66. 57. Chaisson, M.J., et al., Resolving the complexity of the human genome using single-molecule sequencing. Nature, 2014. 58. Sturtevant, A.H., Genetic Factors Affecting the Strength of Linkage in Drosophila. Proc Natl Acad Sci U S A, 1917. 3(9): p. 555-8. 59. Sturtevant, A.H., A Case of Rearrangement of Genes in Drosophila. Proc Natl Acad Sci U S A, 1921. 7(8): p. 235-7. 60. Sturtevant, A.H., Reminiscences of T. H. Morgan. Genetics, 2001. 159(1): p. 1-5. 61. Dobzhansky, T. and A.H. Sturtevant, Inversions in the Chromosomes of Drosophila Pseudoobscura. Genetics, 1938. 23(1): p. 28-64. 62. Hoffmann, A.A. and L.H. Rieseberg, Revisiting the Impact of Inversions in Evolution: From Population Genetic Markers to Drivers of Adaptive Shifts and Speciation? Annu Rev Ecol Evol Syst, 2008. 39: p. 21-42.   168 63. Kirkpatrick, M., How and why chromosome inversions evolve. PLoS Biol, 2010. 8(9). 64. Sturtevant, A.H. and G.W. Beadle, The Relations of Inversions in the X Chromosome of Drosophila Melanogaster to Crossing over and Disjunction. Genetics, 1936. 21(5): p. 554-604. 65. Sturtevant, A.H. and T. Dobzhansky, Inversions in the Third Chromosome of Wild Races of Drosophila Pseudoobscura, and Their Use in the Study of the History of the Species. Proc Natl Acad Sci U S A, 1936. 22(7): p. 448-50. 66. Alves, J.M., et al., On the structural plasticity of the human genome: chromosomal inversions revisited. Curr Genomics, 2012. 13(8): p. 623-32. 67. Le Beau, M.M., et al., Association of an inversion of chromosome 16 with abnormal marrow eosinophils in acute myelomonocytic leukemia. A unique cytogenetic-clinicopathological association. N Engl J Med, 1983. 309(11): p. 630-6. 68. Liu, P., et al., Fusion between transcription factor CBF beta/PEBP2 beta and a myosin heavy chain in acute myeloid leukemia. Science, 1993. 261(5124): p. 1041-4. 69. Pinkel, D., et al., Fluorescence in situ hybridization with human chromosome-specific libraries: detection of trisomy 21 and translocations of chromosome 4. Proc Natl Acad Sci U S A, 1988. 85(23): p. 9138-42. 70. Cremer, T. and M. Cremer, Chromosome territories. Cold Spring Harb Perspect Biol, 2010. 2(3): p. a003889. 71. Dixon, J.R., et al., Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature, 2012. 485(7398): p. 376-80. 72. Feuk, L., et al., Discovery of human inversion polymorphisms by comparative analysis of human and chimpanzee DNA sequence assemblies. PLoS Genet, 2005. 1(4): p. e56. 73. Martinez-Fundichely, A., et al., InvFEST, a database integrating information of polymorphic inversions in the human genome. Nucleic Acids Res, 2014. 42(Database issue): p. D1027-32. 74. Alves, J.M., et al., The 8p23 inversion polymorphism determines local recombination heterogeneity across human populations. Genome Biol Evol, 2014. 6(4): p. 921-30. 75. Salm, M.P., et al., The origin, global distribution, and functional impact of the human 8p23 inversion polymorphism. Genome Res, 2012. 22(6): p. 1144-53. 76. Bosch, N., et al., Nucleotide, cytogenetic and expression impact of the human chromosome 8p23.1 inversion polymorphism. PLoS One, 2009. 4(12): p. e8269. 77. Zody, M.C., et al., Evolutionary toggling of the MAPT 17q21.31 inversion region. Nat Genet, 2008. 40(9): p. 1076-83. 78. Donnelly, M.P., et al., The distribution and most recent common ancestor of the 17q21 inversion in humans. Am J Hum Genet, 2010. 86(2): p. 161-71. 79. Cardone, M.F., et al., Hominoid chromosomal rearrangements on 17q map to complex regions of segmental duplication. Genome Biol, 2008. 9(2): p. R28. 80. Baker, M., et al., Association of an extended haplotype in the tau gene with progressive supranuclear palsy. Hum Mol Genet, 1999. 8(4): p. 711-5.   169 81. Lakich, D., et al., Inversions disrupting the factor VIII gene are a common cause of severe haemophilia A. Nat Genet, 1993. 5(3): p. 236-41. 82. Bondeson, M.L., et al., Inversion of the IDS gene resulting from recombination with IDS-related sequences is a common cause of the Hunter syndrome. Hum Mol Genet, 1995. 4(4): p. 615-21. 83. Small, K., J. Iber, and S.T. Warren, Emerin deletion reveals a common X-chromosome inversion mediated by inverted repeats. Nat Genet, 1997. 16(1): p. 96-9. 84. Hobart, H.H., et al., Inversion of the Williams syndrome region is a common polymorphism found more frequently in parents of children with Williams syndrome. Am J Med Genet C Semin Med Genet, 2010. 154C(2): p. 220-8. 85. Osborne, L.R., et al., A 1.5 million-base pair inversion polymorphism in families with Williams-Beuren syndrome. Nat Genet, 2001. 29(3): p. 321-5. 86. Sharp, A.J., et al., A recurrent 15q13.3 microdeletion syndrome associated with mental retardation and seizures. Nat Genet, 2008. 40(3): p. 322-8. 87. Lowther, C., et al., Delineating the 15q13.3 microdeletion phenotype: a case series and comprehensive review of the literature. Genet Med, 2015. 17(2): p. 149-57. 88. Koolen, D.A., et al., A new chromosome 17q21.31 microdeletion syndrome associated with a common inversion polymorphism. Nat Genet, 2006. 38(9): p. 999-1001. 89. Koolen, D.A., et al., Clinical and molecular delineation of the 17q21.31 microdeletion syndrome. J Med Genet, 2008. 45(11): p. 710-20. 90. Costa, D., et al., Refining the diagnosis and prognostic categorization of acute myeloid leukemia patients with an integrated use of cytogenetic and molecular studies. Acta Haematol, 2013. 129(2): p. 65-71. 91. Savic, S. and L. Bubendorf, Role of fluorescence in situ hybridization in lung cancer cytology. Acta Cytol, 2012. 56(6): p. 611-21. 92. Casaluce, F., et al., ALK inhibitors: a new targeted therapy in the treatment of advanced NSCLC. Target Oncol, 2013. 8(1): p. 55-67. 93. rk 94. Painter, T.S., A New Method for the Study of Chromosome Rearrangements and the Plotting of Chromosome Maps. Science, 1933. 78(2034): p. 585-6. 95. Painter, T.S., The Morphology of the X Chromosome in Salivary Glands of Drosophila Melanogaster and a New Type of Chromosome Map for This Element. Genetics, 1934. 19(5): p. 448-69. 96. Pardue, M.L. and J.G. Gall, Chromosomal localization of mouse satellite DNA. Science, 1970. 168(3937): p. 1356-8. 97. Arrighi, F.E. and T.C. Hsu, Localization of heterochromatin in human chromosomes. Cytogenetics, 1971. 10(2): p. 81-6. 98. Carr, D.H., Chromosomal anomalies with special reference to Klinefelter's syndrome. Trans Am Assoc Genitourin Surg, 1962. 54: p. 9-14. 99. Jacobs, P.A., et al., Pericentric inversion of a group C autosome: a study of three families. Ann Hum Genet, 1968. 31(3): p. 219-30.   170 100. Chandra, H.S. and D.A. Hungerford, An Aberrant Autosome (13-15) in a Human Female and Her Father, Both Apparently Normal. Cytogenetics, 1963. 2: p. 34-41. 101. Pettenati, M.J., et al., Paracentric inversions in humans: a review of 446 paracentric inversions with presentation of 120 new cases. Am J Med Genet, 1995. 55(2): p. 171-87. 102. Speicher, M.R. and N.P. Carter, The new cytogenetics: blurring the boundaries with molecular biology. Nat Rev Genet, 2005. 6(10): p. 782-92. 103. Dauwerse, J.G., et al., Rapid detection of chromosome 16 inversion in acute nonlymphocytic leukemia, subtype M4: regional localization of the breakpoint in 16p. Cytogenet Cell Genet, 1990. 53(2-3): p. 126-8. 104. Schwartz, D.C., et al., Ordered restriction maps of Saccharomyces cerevisiae chromosomes constructed by optical mapping. Science, 1993. 262(5130): p. 110-4. 105. Neely, R.K., J. Deen, and J. Hofkens, Optical mapping of DNA: single-molecule-based methods for mapping genomes. Biopolymers, 2011. 95(5): p. 298-311. 106. Howe, K. and J.M. Wood, Using optical mapping data for the improvement of vertebrate genome assemblies. Gigascience, 2015. 4: p. 10. 107. Teague, B., et al., High-resolution human genome structure by single-molecule analysis. Proc Natl Acad Sci U S A, 2010. 107(24): p. 10848-53. 108. Youings, S., et al., A study of reciprocal translocations and inversions detected by light microscopy with special reference to origin, segregation, and recurrent abnormalities. Am J Med Genet A, 2004. 126A(1): p. 46-60. 109. Falconer, E., et al., DNA template strand sequencing of single-cells maps genomic rearrangements at high resolution. Nat Methods, 2012. 9(11): p. 1107-12. 110. Hills, M., et al., BAIT: Organizing genomes and mapping rearrangements in single cells. Genome Med, 2013. 5(9): p. 82. 111. Illumina, An introduction to Next-generation sequencing technology. 2015. 112. Wilson, D.M., 3rd and L.H. Thompson, Molecular mechanisms of sister-chromatid exchange. Mutat Res, 2007. 616(1-2): p. 11-23. 113. Becher, R., et al., Spontaneous sister chromatid exchange in normal bone marrow and Ph-positive chronic myelocytic leukemia. Cancer Res, 1988. 48(3): p. 745-50. 114. Ozturk, S., et al., Sister chromatid exchange frequency in B-cells stimulated by TPA in chronic lymphocytic leukemia. Cancer Genet Cytogenet, 2000. 123(1): p. 49-51. 115. Turkez, H., K. Celik, and B. Togar, Effects of copaene, a tricyclic sesquiterpene, on human lymphocytes cells in vitro. Cytotechnology, 2014. 66(4): p. 597-603. 116. Bishop, A.J. and R.H. Schiestl, Role of homologous recombination in carcinogenesis. Exp Mol Pathol, 2003. 74(2): p. 94-105. 117. Hollox, E.J., et al., Defensins and the dynamic genome: what we can learn from structural variation at human chromosome band 8p23.1. Genome Res, 2008. 18(11): p. 1686-97. 118. Tam, E., et al., The common inversion of the Williams-Beuren syndrome region at 7q11.23 does not cause clinical symptoms. Am J Med Genet A, 2008. 146A(14): p. 1797-806.   171 119. Kent, W.J., et al., The human genome browser at UCSC. Genome Res, 2002. 12(6): p. 996-1006. 120. Dittwald, P., et al., NAHR-mediated copy-number variants in a clinical population: mechanistic insights into both genomic disorders and Mendelizing traits. Genome Res, 2013. 23(9): p. 1395-409. 121. Ben Salah, G., et al., A novel frameshift mutation in BLM gene associated with high sister chromatid exchanges (SCE) in heterozygous family members. Mol Biol Rep, 2014. 41(11): p. 7373-80. 122. El Ghamrasni, S., et al., Cooperation of Blm and Mus81 in development, fertility, genomic integrity and cancer suppression. Oncogene, 2015. 34(14): p. 1780-9. 123. van Heesch, S., et al., Genomic and functional overlap between somatic and germline chromosomal rearrangements. Cell Rep, 2014. 9(6): p. 2001-10. 124. Hormozdiari, F., et al., Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Res, 2009. 19(7): p. 1270-8. 125. Korbel, J.O., et al., PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data. Genome Biol, 2009. 10(2): p. R23. 126. Chen, K., et al., BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods, 2009. 6(9): p. 677-81. 127. Wu, T.D. and S. Nacu, Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics, 2010. 26(7): p. 873-81. 128. R Core Team, R: A language and environment for statistical computing, 2013, R Foundation for Statistical Computing: Vienna, Austria. 129. Perez Jurado, L.A., et al., A duplicated gene in the breakpoint regions of the 7q11.23 Williams-Beuren syndrome deletion encodes the initiator binding protein TFII-I and BAP-135, a phosphorylation target of BTK. Hum Mol Genet, 1998. 7(3): p. 325-34. 130. Bailey, J.A., et al., Recent segmental duplications in the human genome. Science, 2002. 297(5583): p. 1003-7. 131. Quinlan, A.R. and I.M. Hall, BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 2010. 26(6): p. 841-2. 132. Stefansson, H., et al., A common inversion under selection in Europeans. Nat Genet, 2005. 37(2): p. 129-37. 133. Antonarakis, S.E., et al., Factor VIII gene inversions in severe hemophilia A: results of an international consortium study. Blood, 1995. 86(6): p. 2206-12. 134. Shaw-Smith, C., et al., Microdeletion encompassing MAPT at chromosome 17q21.3 is associated with developmental delay and learning disability. Nat Genet, 2006. 38(9): p. 1032-7. 135. Salanti, G., et al., Hardy-Weinberg equilibrium in genetic association studies: an empirical evaluation of reporting, deviations, and power. Eur J Hum Genet, 2005. 13(7): p. 840-8. 136. Andries, V., et al., NBPF1, a tumor suppressor candidate in neuroblastoma, exerts growth inhibitory effects by inducing a G1 cell cycle arrest. BMC Cancer, 2015. 15(1): p. 391.   172 137. Dumas, L.J., et al., DUF1220-domain copy number implicated in human brain-size pathology and evolution. Am J Hum Genet, 2012. 91(3): p. 444-54. 138. Vandepoele, K., et al., A novel gene family NBPF: intricate structure generated by gene duplications during primate evolution. Mol Biol Evol, 2005. 22(11): p. 2265-74. 139. Samonte, R.V. and E.E. Eichler, Segmental duplications and the evolution of the primate genome. Nat Rev Genet, 2002. 3(1): p. 65-72. 140. Kim, E.K. and E.J. Choi, Pathological roles of MAPK signaling pathways in human diseases. Biochim Biophys Acta, 2010. 1802(4): p. 396-405. 141. Ng, C.C., et al., Isolation and characterization of a novel TP53-inducible gene, TP53TG3. Genes Chromosomes Cancer, 1999. 26(4): p. 329-35. 142. Wang, J. and S. Shete, Testing departure from Hardy-Weinberg proportions. Methods Mol Biol, 2012. 850: p. 77-102. 143. Turner, D.J., et al., Assaying chromosomal inversions by single-molecule haplotyping. Nat Methods, 2006. 3(6): p. 439-45. 144. Bailey, J.A., et al., Hotspots of mammalian chromosomal evolution. Genome Biol, 2004. 5(4): p. R23. 145. Kidd, J.M., et al., A human genome structural variation sequencing resource reveals insights into mutational mechanisms. Cell, 2010. 143(5): p. 837-47. 146. Krzywinski, M., et al., Circos: an information aesthetic for comparative genomics. Genome Res, 2009. 19(9): p. 1639-45. 147. Hoppman-Chaney, N., et al., Identification of single gene deletions at 15q13.3: further evidence that CHRNA7 causes the 15q13.3 microdeletion syndrome phenotype. Clin Genet, 2013. 83(4): p. 345-51. 148. Emanuel, B.S. and T.H. Shaikh, Segmental duplications: an 'expanding' role in genomic instability and disease. Nat Rev Genet, 2001. 2(10): p. 791-800. 149. Parks, M.M., C.E. Lawrence, and B.J. Raphael, Detecting non-allelic homologous recombination from high-throughput sequencing data. Genome Biol, 2015. 16: p. 72. 150. Eid, J., et al., Real-time DNA sequencing from single polymerase molecules. Science, 2009. 323(5910): p. 133-8. 151. Chaisson, M.J. and G. Tesler, Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics, 2012. 13: p. 238. 152. Huddleston, J., et al., Reconstructing complex regions of genomes using long-read sequencing technology. Genome Res, 2014. 24(4): p. 688-96. 153. Lupski, J.R., Genetics. Genome mosaicism--one human, multiple genomes. Science, 2013. 341(6144): p. 358-9. 154. Notta, F., et al., Isolation of single human hematopoietic stem cells capable of long-term multilineage engraftment. Science, 2011. 333(6039): p. 218-21. 155. Mayani, H., W. Dragowska, and P.M. Lansdorp, Cytokine-induced selective expansion and maturation of erythroid versus myeloid progenitors from purified cord blood precursor cells. Blood, 1993. 81(12): p. 3252-8. 156. Brummendorf, T.H., et al., Asymmetric cell divisions sustain long-term hematopoiesis from single-sorted human fetal liver cells. J Exp Med, 1998. 188(6): p. 1117-24.   173 157. Latt, S.A., Y.S. George, and J.W. Gray, Flow cytometric analysis of bromodeoxyuridine-substituted cells stained with 33258 Hoechst. J Histochem Cytochem, 1977. 25(7): p. 927-34. 158. Kubbies, M. and P.S. Rabinovitch, Flow cytometric analysis of factors which influence the BrdUrd-Hoechst quenching effect in cultivated human fibroblasts and lymphocytes. Cytometry, 1983. 3(4): p. 276-81. 159. Breusegem, S.Y., R.M. Clegg, and F.G. Loontiens, Base-sequence specificity of Hoechst 33258 and DAPI binding to five (A/T)4 DNA sites with kinetic evidence for more than one high-affinity Hoechst 33258-AATT complex. J Mol Biol, 2002. 315(5): p. 1049-61. 160. Graffelman , J., HardyWeinberg: Graphical tests for Hardy-Weinberg equilibrium. R package., 2014. v.1.5.4. 161. Warnes, G.R., et al., gplots: Various R programming tools for plotting data. R package., 2014. v2.14.2. 162. Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K. , cluster: Cluster Analysis Basics and Extensions. R package., 2014. v1.15.3. 163. Lawrence, M., et al., Software for computing and annotating genomic ranges. PLoS Comput Biol, 2013. 9(8): p. e1003118. 164. Karolchik, D., et al., The UCSC Genome Browser database: 2014 update. Nucleic Acids Res, 2014. 42(Database issue): p. D764-70. 165. R Core Team, R Core Team, R: A language and environment for statistical computing, 2013, R Foundation for Statistical Computing: Vienna, Austria.      174 Appendix A | Library statisticsDataset cell ID sex Total Reads Reads > q10Unique reads (>q10)Reads/Mb* (unique >q10)sequence coverage (x)§male_BM HsSs_0001 m 3,740,836 154,547 129,932 41.97 0.0063male_BM HsSs_0002 m 5,409,986 85,369 76,724 24.78 0.0037male_BM HsSs_0003 m 3,560,862 76,525 62,156 20.08 0.0030male_BM HsSs_0004 m 2,933,044 189,157 139,043 44.92 0.0067male_BM HsSs_0005 m 3,124,980 266,470 202,089 65.28 0.0098male_BM HsSs_0006 m 2,708,160 379,636 252,860 81.68 0.0123male_BM HsSs_0007 m 3,561,872 113,430 94,661 30.58 0.0046male_BM HsSs_0008 m 3,670,846 276,873 205,831 66.49 0.0100male_BM HsSs_0009 m 3,186,156 276,148 206,589 66.73 0.0100male_BM HsSs_0010 m 2,950,676 119,867 95,997 31.01 0.0047male_BM HsSs_0011 m 1,655,212 262,435 190,229 61.45 0.0092male_BM HsSs_0012 m 5,079,980 688,614 546,484 176.53 0.0265male_BM HsSs_0013 m 2,902,610 245,509 191,830 61.97 0.0093male_BM HsSs_0014 m 2,350,010 94,941 80,488 26.00 0.0039male_BM HsSs_0015 m 2,660,220 347,009 251,521 81.25 0.0122male_BM HsSs_0016 m 2,425,916 255,197 192,917 62.32 0.0093male_BM HsSs_0017 m 3,289,708 288,495 212,844 68.76 0.0103male_BM HsSs_0018 m 4,160,476 116,520 97,616 31.53 0.0047male_BM HsSs_0019 m 3,368,034 141,340 119,337 38.55 0.0058male_BM HsSs_0020 m 2,222,734 180,272 137,207 44.32 0.0066male_BM HsSs_0021 m 2,160,058 187,688 150,023 48.46 0.0073male_BM HsSs_0022 m 4,367,118 192,091 156,996 50.71 0.0076male_BM HsSs_0023 m 2,890,530 413,362 313,461 101.26 0.0000male_BM HsSs_0024 m 3,410,768 174,093 131,732 42.55 0.0064male_BM HsSs_0025 m 2,309,298 480,488 353,989 114.35 0.0172male_BM HsSs_0026 m 1,497,692 188,257 152,487 49.26 0.0074male_BM HsSs_0027 m 5,884,266 1,166,261 828,584 267.66 0.0401male_BM HsSs_0028 m 3,947,988 804,257 590,771 190.84 0.0286male_BM HsSs_0029 m 2,764,758 276,284 229,390 74.10 0.0111male_BM HsSs_0030 m 4,411,270 1,819,247 1,059,122 342.13 0.0513male_BM HsSs_0031 m 2,894,368 394,969 294,778 95.22 0.0143male_BM HsSs_0032 m 4,487,454 819,699 576,606 186.26 0.0279male_BM HsSs_0033 m 3,929,592 572,199 442,487 142.94 0.0214male_BM HsSs_0034 m 1,809,842 464,886 371,292 119.94 0.0180male_BM HsSs_0035 m 3,040,694 487,797 345,227 111.52 0.0167male_BM HsSs_0036 m 3,124,204 566,402 419,762 135.60 0.0203male_BM HsSs_0037 m 2,406,918 689,927 490,500 158.45 0.0238male_BM HsSs_0038 m 2,486,104 609,709 423,637 136.85 0.0205male_BM HsSs_0039 m 3,911,776 470,825 360,718 116.52 0.0175male_BM HsSs_0040 m 4,155,390 1,074,935 727,574 235.03 0.0353male_BM HsSs_0041 m 2,660,304 613,452 405,779 131.08 0.0197male_BM HsSs_0042 m 2,239,808 753,628 479,417 154.87 0.0232male_BM HsSs_0043 m 2,128,534 566,711 440,963 142.44 0.0214male_BM HsSs_0044 m 2,539,392 176,919 141,256 45.63 0.0068male_BM HsSs_0045 m 2,664,684 353,955 276,606 89.35 0.0134male_BM HsSs_0046 m 3,357,334 321,363 230,408 74.43 0.0112male_BM HsSs_0047 m 2,006,116 204,525 135,195 43.67 0.0066male_BM HsSs_0048 m 2,207,934 73,554 59,371 19.18 0.0029male_BM HsSs_0049 m 2,344,736 560,994 368,401 119.00 0.0179male_BM HsSs_0050 m 1,781,986 291,142 185,685 59.98 0.0090male_BM HsSs_0051 m 2,594,300 341,323 203,196 65.64 0.0098male_BM HsSs_0052 m 2,882,750 665,016 397,865 128.52 0.0193Appendix - Library statistics part 1 of 6  175   Dataset cell ID sex Total Reads Reads > q10Unique reads (>q10)Reads/Mb* (unique >q10)sequence coverage (x)§male_BM HsSs_0053 m 2,033,608 192,801 117,888 38.08 0.0057male_BM HsSs_0054 m 801,416 13,654 11,654 3.76 0.0006male_BM HsSs_0055 m 1,885,894 131,183 99,583 32.17 0.0048male_BM HsSs_0056 m 1,154,680 352,218 205,960 66.53 0.0100male_BM HsSs_0057 m 1,633,140 350,574 209,403 67.64 0.0101male_BM HsSs_0058 m 1,695,202 71,187 50,349 16.26 0.0024male_BM HsSs_0059 m 1,554,192 213,952 121,779 39.34 0.0059male_BM HsSs_0060 m 1,585,698 134,865 73,378 23.70 0.0036male_BM HsSs_0061 m 1,872,184 219,626 146,912 47.46 0.0071male_BM HsSs_0062 m 2,012,922 52,131 38,318 12.38 0.0019male_BM HsSs_0063 m 2,287,680 634,426 384,142 124.09 0.0186male_BM HsSs_0064 m 1,513,502 352,881 234,621 75.79 0.0114male_BM HsSs_0065 m 1,321,442 303,166 199,099 64.32 0.0096male_BM HsSs_0066 m 1,675,920 89,802 64,387 20.80 0.0031male_BM HsSs_0067 m 1,290,810 590,561 320,681 103.59 0.0155male_BM HsSs_0068 m 1,671,504 507,320 283,549 91.60 0.0137male_BM HsSs_0069 m 2,493,198 47,359 38,622 12.48 0.0019male_BM HsSs_0070 m 1,651,614 172,662 123,618 39.93 0.0060male_BM HsSs_0071 m 2,099,508 279,647 196,505 63.48 0.0095male_BM HsSs_0072 m 2,643,298 809,220 480,287 155.15 0.0233male_BM HsSs_0073 m 1,966,364 558,322 319,720 103.28 0.0155male_BM HsSs_0074 m 2,869,082 116,974 92,548 29.90 0.0045male_BM HsSs_0075 m 2,004,360 276,239 203,382 65.70 0.0099male_BM HsSs_0076 m 1,622,332 69,701 51,353 16.59 0.0025male_BM HsSs_0077 m 1,579,710 172,693 128,077 41.37 0.0062male_BM HsSs_0078 m 1,526,528 314,800 199,016 64.29 0.0096male_BM HsSs_0079 m 1,248,356 489,583 212,902 68.77 0.0103male_BM HsSs_0080 m 936,648 208,038 131,409 42.45 0.0064male_BM HsSs_0081 m 1,950,184 436,598 281,468 90.92 0.0136male_BM HsSs_0082 m 738,264 156,711 98,384 31.78 0.0048male_BM HsSs_0083 m 774,082 17,610 11,678 3.77 0.0006male_BM HsSs_0084 m 2,370,238 182,742 147,197 47.55 0.0071male_BM HsSs_0085 m 1,470,004 414,924 225,332 72.79 0.0109male_BM HsSs_0086 m 1,612,034 284,220 189,018 61.06 0.0092male_BM HsSs_0087 m 1,304,970 140,625 80,609 26.04 0.0039male_BM HsSs_0088 m 1,372,858 252,436 132,828 42.91 0.0064male_BM HsSs_0089 m 1,858,568 153,387 105,572 34.10 0.0051male_BM HsSs_0090 m 1,448,230 253,050 178,876 57.78 0.0087male_BM HsSs_0091 m 2,100,952 365,051 272,892 88.15 0.0132male_BM HsSs_0092 m 1,044,842 201,750 111,104 35.89 0.0054male_BM HsSs_0093 m 1,305,564 455,311 220,320 71.17 0.0107male_BM HsSs_0094 m 795,248 174,275 96,426 31.15 0.0047male_BM HsSs_0095 m 899,464 85,629 59,525 19.23 0.0029male_BM HsSs_0096 m 2,563,508 316,259 207,186 66.93 0.0100male_BM HsSs_0097 m 1,415,860 78,711 61,472 19.86 0.0030male_BM HsSs_0098 m 1,389,660 240,741 105,437 34.06 0.0051male_BM HsSs_0099 m 1,951,420 195,716 151,862 49.06 0.0074male_BM HsSs_0100 m 1,863,568 51,005 43,058 13.91 0.0021male_BM HsSs_0101 m 931,290 101,514 77,476 25.03 0.0038male_BM HsSs_0102 m 2,344,218 220,891 171,064 55.26 0.0083male_BM HsSs_0103 m 1,749,360 198,122 129,905 41.96 0.0063male_BM HsSs_0104 m 1,463,904 158,875 101,624 32.83 0.0049male_BM HsSs_0105 m 660,178 56,398 46,241 14.94 0.0022male_BM HsSs_0106 m 1,064,148 45,910 34,541 11.16 0.0017Appendix - Library statistics part 2 of 6  176 Dataset cell ID sex Total Reads Reads > q10Unique reads (>q10)Reads/Mb* (unique >q10)sequence coverage (x)§male_BM HsSs_0107 m 3,120,062 1,055,775 664,248 214.57 0.0322male_BM HsSs_0108 m 2,129,084 80,761 63,534 20.52 0.0031male_BM HsSs_0109 m 1,505,536 462,125 302,442 97.70 0.0147male_BM HsSs_0110 m 786,500 137,624 82,914 26.78 0.0040male_BM HsSs_0111 m 1,555,528 139,426 98,601 31.85 0.0048male_BM HsSs_0112 m 1,483,030 383,239 244,835 79.09 0.0119male_BM HsSs_0113 m 1,334,868 199,902 112,337 36.29 0.0054male_BM HsSs_0114 m 1,837,932 28,621 23,756 7.67 0.0012male_BM HsSs_0115 m 2,348,464 53,458 41,901 13.54 0.0020male_BM HsSs_0116 m 1,241,836 200,651 109,960 35.52 0.0053male_BM HsSs_0117 m 1,390,900 181,964 109,601 35.40 0.0053male_BM HsSs_0118 m 1,151,874 145,116 103,758 33.52 0.0050male_BM HsSs_0119 m 1,398,818 267,738 136,970 44.25 0.0066male_BM HsSs_0120 m 3,073,384 113,934 91,345 29.51 0.0044male_BM HsSs_0121 m 1,153,118 624,789 347,561 112.27 0.0168male_BM HsSs_0122 m 1,905,816 219,966 170,345 55.03 0.0083male_BM HsSs_0123 m 2,893,982 541,144 367,848 118.83 0.0178male_BM HsSs_0124 m 2,290,076 302,914 200,921 64.90 0.0097male_BM HsSs_0125 m 3,019,926 161,625 116,965 37.78 0.0057male_BM HsSs_0126 m 2,148,420 118,565 77,887 25.16 0.0038male_BM HsSs_0127 m 3,179,258 199,102 142,334 45.98 0.0069male_BM HsSs_0128 m 1,964,726 120,334 94,385 30.49 0.0046male_BM HsSs_0129 m 1,540,338 142,681 88,092 28.46 0.0043male_BM HsSs_0130 m 1,371,144 310,226 190,666 61.59 0.0092male_BM HsSs_0131 m 2,573,754 185,388 145,275 46.93 0.0070male_BM HsSs_0132 m 2,211,436 161,438 116,207 37.54 0.0056male_BM HsSs_0133 m 1,999,980 428,921 261,129 84.35 0.0127male_BM HsSs_0134 m 2,695,826 130,549 97,984 31.65 0.0047male_BM HsSs_0135 m 1,745,220 732,751 332,005 107.25 0.0161male_BM HsSs_0136 m 860,822 104,493 73,052 23.60 0.0035male_BM HsSs_0137 m 1,616,464 341,641 200,242 64.68 0.0097male_BM HsSs_0138 m 2,447,212 1,053,568 389,061 125.68 0.0189male_BM HsSs_0139 m 1,726,664 650,667 322,892 104.30 0.0156male_BM HsSs_0140 m 3,457,344 237,546 188,473 60.88 0.0091female_CB HsSs_0141 f 1,987,230 253,037 217,201 71.53 0.0107female_CB HsSs_0142 f 1,670,842 299,884 251,084 82.69 0.0124female_CB HsSs_0143 f 1,718,470 457,051 330,209 108.75 0.0163female_CB HsSs_0144 f 2,288,764 254,116 203,747 67.10 0.0101female_CB HsSs_0145 f 1,372,984 94,154 76,341 25.14 0.0038female_CB HsSs_0146 f 2,651,970 163,787 140,932 46.42 0.0070female_CB HsSs_0147 f 2,142,594 133,020 106,028 34.92 0.0052female_CB HsSs_0148 f 2,101,644 186,918 150,577 49.59 0.0074female_CB HsSs_0149 f 1,993,724 238,059 209,155 68.88 0.0103female_CB HsSs_0150 f 1,297,246 56,981 51,584 16.99 0.0025female_CB HsSs_0151 f 1,441,408 349,349 235,512 77.57 0.0116female_CB HsSs_0152 f 1,253,792 101,381 87,839 28.93 0.0043female_CB HsSs_0153 f 1,471,770 144,138 115,367 38.00 0.0057female_CB HsSs_0154 f 1,027,472 82,966 74,043 24.39 0.0037female_CB HsSs_0155 f 1,393,958 62,918 56,734 18.69 0.0028female_CB HsSs_0156 f 2,003,774 313,942 220,700 72.69 0.0109female_CB HsSs_0157 f 1,034,398 170,965 142,286 46.86 0.0070female_CB HsSs_0158 f 748,954 73,311 61,717 20.33 0.0030female_CB HsSs_0159 f 2,055,472 326,627 251,583 82.86 0.0124female_CB HsSs_0160 f 3,052,052 233,649 199,181 65.60 0.0098Appendix - Library statistics part 3 of 6  177 Dataset cell ID sex Total Reads Reads > q10Unique reads (>q10)Reads/Mb* (unique >q10)sequence coverage (x)§female_CB HsSs_0161 f 1,634,608 217,740 173,788 57.24 0.0086female_CB HsSs_0162 f 2,175,388 124,640 105,033 34.59 0.0052female_CB HsSs_0163 f 1,909,952 349,934 280,927 92.52 0.0139female_CB HsSs_0164 f 1,116,876 113,502 98,225 32.35 0.0049female_CB HsSs_0165 f 1,786,002 67,306 58,631 19.31 0.0029female_CB HsSs_0166 f 1,431,820 149,225 124,764 41.09 0.0062female_CB HsSs_0167 f 1,462,836 52,412 45,800 15.08 0.0023female_CB HsSs_0168 f 1,722,854 160,926 103,886 34.21 0.0051female_CB HsSs_0169 f 1,690,950 16,652 16,183 5.33 0.0008female_CB HsSs_0170 f 2,164,752 29,170 26,329 8.67 0.0013female_CB HsSs_0171 f 1,501,062 123,013 105,643 34.79 0.0052female_CB HsSs_0172 f 1,266,740 90,134 76,529 25.20 0.0038female_CB HsSs_0173 f 2,051,076 61,664 54,417 17.92 0.0027female_CB HsSs_0174 f 1,627,032 84,152 76,429 25.17 0.0038female_CB HsSs_0175 f 2,161,126 231,177 186,574 61.45 0.0092female_CB HsSs_0176 f 2,085,476 245,891 196,143 64.60 0.0097female_CB HsSs_0177 f 1,939,736 181,813 146,876 48.37 0.0073female_CB HsSs_0178 f 1,207,264 97,825 82,057 27.03 0.0041female_CB HsSs_0179 f 2,434,624 131,782 109,604 36.10 0.0054female_CB HsSs_0180 f 1,769,942 56,330 51,932 17.10 0.0026female_CB HsSs_0181 f 1,833,910 131,449 108,262 35.66 0.0053female_CB HsSs_0182 f 1,673,680 130,931 113,552 37.40 0.0056female_CB HsSs_0183 f 1,092,600 40,388 35,806 11.79 0.0018female_CB HsSs_0184 f 1,350,230 94,929 84,490 27.83 0.0042female_CB HsSs_0185 f 1,763,164 189,893 167,764 55.25 0.0083female_CB HsSs_0186 f 2,022,358 86,343 73,328 24.15 0.0036female_CB HsSs_0187 f 1,459,414 166,021 132,331 43.58 0.0065female_CB HsSs_0188 f 2,119,600 70,820 62,428 20.56 0.0031female_CB HsSs_0189 f 3,115,954 165,518 145,766 48.01 0.0072female_CB HsSs_0190 f 2,263,108 254,402 194,426 64.03 0.0096female_CB HsSs_0191 f 1,795,746 110,268 88,469 29.14 0.0044female_CB HsSs_0192 f 1,887,716 47,261 42,869 14.12 0.0021female_CB HsSs_0193 f 1,714,274 67,824 58,355 19.22 0.0029female_CB HsSs_0194 f 615,944 212,459 144,651 47.64 0.0071female_CB HsSs_0195 f 1,132,316 131,041 115,466 38.03 0.0057female_CB HsSs_0196 f 1,150,738 238,304 205,038 67.53 0.0101female_CB HsSs_0197 f 1,718,060 331,692 262,827 86.56 0.0130female_CB HsSs_0198 f 1,021,322 111,481 95,163 31.34 0.0047female_CB HsSs_0199 f 1,137,656 281,571 242,900 80.00 0.0120female_CB HsSs_0200 f 982,406 250,781 197,818 65.15 0.0098female_CB HsSs_0201 f 984,104 118,787 103,714 34.16 0.0051female_CB HsSs_0202 f 1,202,954 76,976 71,507 23.55 0.0035female_CB HsSs_0203 f 1,219,570 112,976 103,927 34.23 0.0051female_CB HsSs_0204 f 1,040,968 62,853 54,935 18.09 0.0027female_CB HsSs_0205 f 902,922 32,640 30,030 9.89 0.0015female_CB HsSs_0206 f 1,186,628 67,462 60,681 19.99 0.0030female_CB HsSs_0207 f 1,285,614 164,765 145,656 47.97 0.0072female_CB HsSs_0208 f 1,056,512 127,275 114,495 37.71 0.0057female_CB HsSs_0209 f 1,227,064 152,846 135,589 44.66 0.0067female_CB HsSs_0210 f 666,886 81,018 75,450 24.85 0.0037female_CB HsSs_0211 f 1,316,348 17,871 16,903 5.57 0.0008female_CB HsSs_0212 f 564,356 65,609 56,903 18.74 0.0028female_CB HsSs_0213 f 1,061,748 111,203 100,381 33.06 0.0050female_CB HsSs_0214 f 517,576 33,794 32,060 10.56 0.0016Appendix - Library statistics part 4 of 6  178 Dataset cell ID sex Total Reads Reads > q10Unique reads (>q10)Reads/Mb* (unique >q10)sequence coverage (x)§female_CB HsSs_0215 f 731,258 62,104 55,633 18.32 0.0027female_CB HsSs_0216 f 888,282 5,126 4,951 1.63 0.0002female_CB HsSs_0217 f 1,386,862 96,002 86,725 28.56 0.0043female_CB HsSs_0218 f 1,260,012 82,678 76,999 25.36 0.0038female_CB HsSs_0219 f 903,848 108,178 101,164 33.32 0.0050female_CB HsSs_0220 f 1,244,540 115,408 103,993 34.25 0.0051female_CB HsSs_0221 f 990,864 51,810 48,461 15.96 0.0024female_CB HsSs_0222 f 794,280 31,429 30,445 10.03 0.0015female_CB HsSs_0223 f 1,322,084 9,925 9,632 3.17 0.0005female_CB HsSs_0224 f 1,381,866 143,308 129,552 42.67 0.0064female_CB HsSs_0225 f 1,130,606 121,129 106,364 35.03 0.0053female_CB HsSs_0226 f 810,288 28,915 27,215 8.96 0.0013female_CB HsSs_0227 f 1,001,874 428,263 332,973 109.66 0.0164female_CB HsSs_0228 f 1,300,760 29,337 25,099 8.27 0.0012female_CB HsSs_0229 f 863,086 92,043 81,934 26.98 0.0040female_CB HsSs_0230 f 1,095,052 68,910 61,115 20.13 0.0030female_CB HsSs_0231 f 1,024,464 88,210 79,052 26.04 0.0039female_CB HsSs_0232 f 1,493,084 53,621 48,801 16.07 0.0024female_CB HsSs_0233 f 824,990 59,140 55,507 18.28 0.0027female_CB HsSs_0234 f 1,038,532 82,467 75,708 24.93 0.0037female_CB HsSs_0235 f 707,202 97,798 91,095 30.00 0.0045female_CB HsSs_0236 f 808,802 149,709 136,158 44.84 0.0067female_CB HsSs_0237 f 1,615,066 32,917 30,345 9.99 0.0015female_CB HsSs_0238 f 1,454,680 231,733 207,327 68.28 0.0102female_CB HsSs_0239 f 729,016 17,817 17,232 5.68 0.0009female_CB HsSs_0240 f 859,584 46,974 44,133 14.54 0.0022female_CB HsSs_0241 f 921,746 70,774 62,616 20.62 0.0031female_CB HsSs_0242 f 1,264,636 111,778 101,649 33.48 0.0050female_CB HsSs_0243 f 765,922 22,007 21,063 6.94 0.0010female_CB HsSs_0244 f 724,736 5,620 5,631 1.85 0.0003female_CB HsSs_0245 f 1,242,094 18,800 18,362 6.05 0.0009female_CB HsSs_0246 f 991,278 58,557 54,540 17.96 0.0027pooled_CB HsSs_0247 f 1,743,272 448,314 373,216 122.92 0.0184pooled_CB HsSs_0248 f 4,942,750 1,162,364 875,746 288.43 0.0433pooled_CB HsSs_0249 f 4,991,944 1,149,120 781,655 257.44 0.0386pooled_CB HsSs_0250 m 845,506 130,911 118,010 38.12 0.0057pooled_CB HsSs_0251 f 2,153,010 688,967 489,743 161.30 0.0242pooled_CB HsSs_0252 m 2,640,122 648,392 532,574 172.04 0.0258pooled_CB HsSs_0253 m 1,353,738 364,878 298,077 96.29 0.0144pooled_CB HsSs_0254 f 2,397,742 835,375 647,768 213.34 0.0320pooled_CB HsSs_0255 f 1,662,204 482,011 396,162 130.48 0.0196pooled_CB HsSs_0256 m 1,028,442 530,056 442,803 143.04 0.0215pooled_CB HsSs_0257 m 1,487,884 531,951 436,322 140.95 0.0211pooled_CB HsSs_0258 m 2,255,468 639,224 521,466 168.45 0.0253pooled_CB HsSs_0259 f 2,677,320 727,877 547,842 180.43 0.0271pooled_CB HsSs_0260 f 1,935,190 734,807 517,781 170.53 0.0256pooled_CB HsSs_0261 f 2,356,364 786,741 624,251 205.60 0.0308pooled_CB HsSs_0262 m 1,494,998 278,660 225,904 72.97 0.0109pooled_CB HsSs_0263 m 3,210,664 1,372,216 901,303 291.15 0.0437pooled_CB HsSs_0264 f 2,611,260 815,618 650,600 214.27 0.0321pooled_CB HsSs_0265 m 2,087,210 501,337 393,872 127.23 0.0191pooled_CB HsSs_0266 m 2,143,268 710,452 576,580 186.25 0.0279pooled_CB HsSs_0267 f 1,618,490 758,244 566,783 186.67 0.0280pooled_CB HsSs_0268 m 1,516,724 545,379 408,841 132.07 0.0198Appendix - Library statistics part 5 of 6  179  Dataset cell ID sex Total Reads Reads > q10Unique reads (>q10)Reads/Mb* (unique >q10)sequence coverage (x)§pooled_CB HsSs_0269 m 4,208,380 1,729,956 1,095,253 353.80 0.0531pooled_CB HsSs_0270 f 1,825,376 808,749 610,627 201.11 0.0302pooled_CB HsSs_0271 f 2,173,402 548,287 437,897 144.22 0.0216pooled_CB HsSs_0272 f 2,990,772 660,922 518,915 170.90 0.0256pooled_CB HsSs_0273 f 1,916,248 700,527 515,190 169.68 0.0255pooled_CB HsSs_0274 f 1,073,544 457,929 387,113 127.49 0.0191pooled_CB HsSs_0275 f 4,239,852 1,378,164 981,233 323.17 0.0485pooled_CB HsSs_0276 m 1,957,900 267,503 217,184 70.16 0.0105pooled_CB HsSs_0277 m 1,773,698 862,447 647,130 209.04 0.0314pooled_CB HsSs_0278 m 2,042,278 849,345 578,084 186.74 0.0280pooled_CB HsSs_0279 m 2,572,780 699,773 551,559 178.17 0.0267pooled_CB HsSs_0280 f 2,588,624 705,302 561,166 184.82 0.0277pooled_CB HsSs_0281 m 3,462,772 1,008,910 710,671 229.57 0.0344pooled_CB HsSs_0282 m 9,017,926 5,058,575 2,371,922 766.20 0.1149pooled_CB HsSs_0283 f 2,955,018 1,266,666 872,720 287.43 0.0431pooled_CB HsSs_0284 f 2,662,608 1,019,800 785,062 258.56 0.0388pooled_CB HsSs_0285 m 4,339,922 1,078,919 771,044 249.07 0.0374pooled_CB HsSs_0286 m 4,236,854 2,464,816 1,588,165 513.03 0.0770pooled_CB HsSs_0287 m 2,126,714 564,576 384,932 124.34 0.0187pooled_CB HsSs_0288 f 2,319,906 998,454 785,755 258.79 0.0388pooled_CB HsSs_0289 m 1,651,202 831,489 659,970 213.19 0.0320pooled_CB HsSs_0290 f 2,199,214 417,508 364,683 120.11 0.0180pooled_CB HsSs_0291 f 3,647,856 1,665,332 1,082,191 356.42 0.0535pooled_CB HsSs_0301 f 1,803,898 387,378 322,729 106.29 0.0159pooled_CB HsSs_0302 m 1,979,018 312,484 247,329 79.89 0.0120Average: 1,996,300 340,093 239,257 77.85 0.0116* in calculating Reads/Mb, the male genome size included chr1:chrY (3,095,677,412 bp) and the female genome size excluded chrY (3,036,303,846 bp) § coverage calculation: number of unique aligned reads (>q10) x 150 (length of fragment) / 3,095,677,412 (male) or 3,036,303,846 (female)Appendix - Library statistics part 6 of 6Interactive pdfinstructions:From the whole genome view (pg2), click on chromosome number to jump to that page and view at high resolution.To return to whole genome view, click button at top right corner of page. Database of Genomic Variants(DGV) inversionsChromosome location (Mb)UCSC Genome Browser genesChromosome bandingSegmental DuplicationsPalindromicNon-palindromicCompositMaleinvertomeInversionshomozygousheterozygousCompositW/C ratio histogram FemaleinvertomeInversionshomozygousheterozygousAWCs Misorients/minor allelesPolymorphisms GapsKey:Genome-wide comparison of inversion features W/C ratio histogramAppendix B | Chromosome-level Circos plots  180181chr1182chr2183chr3184chr4185chr5186chr6187chr7188chr8189chr9190chr10191chr11192chr12193chr13194chr14195chr15196chr16197chr17198chr18199chr19200chr20201chr21202chr22203chrX204chrY205Genomic architecture of mapped inversionsColour Key of Overlaps:male bone marrow (mBM)  female cord blood (fCB)pooled cord blood(ROIno)Genomic Location (bp)start endGenomic Location (bp)Inversions shown on plotchromosome:start coordinates−end coordinates.Rdscale bar(Dataset.chrNo.invNo) (chromosome)Datasets included:Genomic Features:startendmale bone marrow (mBM)Adult male invertomepooled cord blood (ROIno)Population polymorphisms female cord blood (fCB)Newborn female invertomeDatabase of Genomic Variants (DGV) inversions Reference assembly gapsPalindromic repeatsNon-palindromic repeatsRepetitive elementsDotplot  of  mBM.4.2,  fCB.4.2  on  chr4Sequence compared to itselfKey:Appendix C | Base pair-level dot plots2060 200000 400000 600000 8000000200000400000600000800000Location (bp)location (bp)Dotplot of mBM.1.1, ROIno.1.1 on chr1chr1:1 915919.Rd200 kb20785800000 85900000 86000000 86100000 862000008580000085900000860000008610000086200000Location (bp)location (bp)Dotplot of mBM.1.2 on chr1chr1:85780148 86200284.Rd80 kb208108700000 108800000 108900000 109000000 109100000108700000108800000108900000109000000109100000Location (bp)location (bp)Dotplot of ROIno.1.5 on chr1chr1:108654249 109122047.Rd90 kb209120600000 120800000 121000000120600000120800000121000000Location (bp)location (bp)Dotplot of mBM.1.3, fCB.1.1 on chr1chr1:120547157 121136695.Rd100 kb210121300000 121400000 121500000 121600000 121700000121300000121400000121500000121600000121700000Location (bp)location (bp)Dotplot of mBM.1.4, fCB.1.2 on chr1chr1:121278638 121685434.Rd80 kb211143500000 143600000 143700000 143800000 143900000143500000143700000143900000Location (bp)location (bp)Dotplot of mBM.1.5 on chr1chr1:143444526 143971002.Rd100 kb212145200000 145400000 145600000 145800000 146000000145200000145400000145600000145800000146000000Location (bp)location (bp)Dotplot of mBM.1.6, fCB.1.3 on chr1chr1:145168225 146033118.Rd200 kb213146500000 147000000 147500000 148000000146500000147000000147500000148000000Location (bp)location (bp)Dotplot of mBM.1.7, fCB.1.4 on chr1chr1:146103300 148226038.Rd400 kb214149100000 149200000 149300000 149400000 149500000149100000149200000149300000149400000149500000Location (bp)location (bp)Dotplot of ROIno.1.15 on chr1chr1:149067847 149516798.Rd90 kb215205900000 206100000 206300000 206500000205900000206100000206300000206500000Location (bp)location (bp)Dotplot of mBM.1.8, fCB.1.5 on chr1chr1:205872708 206532221.Rd100 kb21689800000 90000000 90200000 9040000089800000900000009020000090400000Location (bp)location (bp)Dotplot of ROIno.2.3 on chr2chr2:89708996 90468656.Rd200 kb21791800000 92000000 92200000 9240000091800000920000009220000092400000Location (bp)location (bp)Dotplot of mBM.2.1, fCB.2.1, ROIno.2.6, ROIno.2.7 on chr2chr2:91789506 92520630.Rd100 kb21895900000 96000000 96100000 96200000 96300000 96400000959000009610000096300000Location (bp)location (bp)Dotplot of mBM.2.2, ROIno.2.8 on chr2chr2:95875218 96463190.Rd100 kb219110400000 110600000 110800000 111000000 111200000110400000110800000111200000Location (bp)location (bp)Dotplot of mBM.2.3, ROIno.2.13 on chr2chr2:110291158 111324224.Rd200 kb220131000000 131200000 131400000 131600000131000000131200000131400000131600000Location (bp)location (bp)Dotplot of mBM.2.4 on chr2chr2:131017801 131605182.Rd100 kb221195000000 195200000 195400000 195600000 195800000195000000195200000195400000195600000195800000Location (bp)location (bp)Dotplot of mBM.3.1, fCB.3.1, ROIno.3.2 on chr3chr3:195013243 195924370.Rd200 kb2220 50000 100000 150000 200000 250000050000100000150000200000250000Location (bp)location (bp)Dotplot of mBM.4.1, fCB.4.1, ROIno.4.1 on chr4chr4:1 269091.Rd50 kb22348900000 49000000 49100000 49200000 49300000 49400000489000004910000049300000Location (bp)location (bp)Dotplot of mBM.4.2, fCB.4.2 on chr4chr4:48870358 49415572.Rd100 kb224190400000 190500000 190600000 190700000 190800000 190900000190400000190600000190800000Location (bp)location (bp)Dotplot of ROIno.4.4 on chr4chr4:190339912 190883764.Rd100 kb225190850000 190950000 191050000 191150000190850000190950000191050000191150000Location (bp)location (bp)Dotplot of ROIno.4.5 on chr4chr4:190817624 191154276.Rd70 kb2260 50000 100000 150000 200000050000100000150000200000Location (bp)location (bp)Dotplot of mBM.5.1 on chr5chr5:1 212031.Rd40 kb22721300000 21400000 21500000 21600000 21700000 21800000213000002140000021500000216000002170000021800000Location (bp)location (bp)Dotplot of ROIno.5.1 on chr5chr5:21264003 21790594.Rd100 kb22869000000 69500000 70000000 7050000069000000695000007000000070500000Location (bp)location (bp)Dotplot of mBM.5.2, fCB.5.1, ROIno.5.2 on chr5chr5:68644760 70845568.Rd400 kb229177000000 177200000 177400000177000000177200000177400000Location (bp)location (bp)Dotplot of ROIno.5.4 on chr5chr5:176951586 177534643.Rd100 kb2300 100000 200000 300000 400000 500000 6000000100000200000300000400000500000600000Location (bp)location (bp)Dotplot of mBM.6.1, fCB.6.1, ROIno.6.1 on chr6chr6:20063 581214.Rd100 kb23157200000 57400000 57600000 5780000057200000574000005760000057800000Location (bp)location (bp)Dotplot of ROIno.6.4 on chr6chr6:57180957 57809346.Rd100 kb23261700000 61900000 62100000 6230000061700000619000006210000062300000Location (bp)location (bp)Dotplot of mBM.6.2, fCB.6.2 on chr6chr6:61680167 62328589.Rd100 kb233157400000 157500000 157600000 157700000 157800000157400000157500000157600000157700000157800000Location (bp)location (bp)Dotplot of mBM.6.3, fCB.6.3 on chr6chr6:157409468 157841300.Rd90 kb2345800000 6200000 6600000 70000005800000620000066000007000000Location (bp)location (bp)Dotplot of fCB.7.1, ROIno.7.2, ROIno.7.3 on chr7chr7:5682690 7065007.Rd300 kb23554100000 54200000 54300000 54400000 54500000 54600000541000005420000054300000544000005450000054600000Location (bp)location (bp)Dotplot of mBM.7.1, ROIno.7.4 on chr7chr7:54101648 54586481.Rd100 kb23656700000 56900000 57100000 5730000056700000569000005710000057300000Location (bp)location (bp)Dotplot of ROIno.7.5 on chr7chr7:56656775 57322691.Rd100 kb23757200000 57400000 57600000 57800000 58000000 58200000572000005760000058000000Location (bp)location (bp)Dotplot of fCB.7.2, ROIno.7.6 on chr7chr7:57120886 58254331.Rd200 kb23862600000 62800000 63000000 6320000062600000628000006300000063200000Location (bp)location (bp)Dotplot of mBM.7.2, fCB.7.3, ROIno.7.9 on chr7chr7:62505853 63359414.Rd200 kb23964200000 64400000 64600000 64800000 65000000 65200000642000006460000065000000Location (bp)location (bp)Dotplot of mBM.7.3, ROIno.7.10, ROIno.7.11 on chr7chr7:64135399 65313002.Rd200 kb24071500000 72500000 73500000 7450000071500000725000007350000074500000Location (bp)location (bp)Dotplot of mBM.7.4, ROIno.7.12 on chr7chr7:71614920 75296848.Rd700 kb241141900000 142100000 142300000141900000142100000142300000Location (bp)location (bp)Dotplot of mBM.7.5, fCB.7.4 on chr7chr7:141898196 142476197.Rd100 kb242143200000 143400000 143600000 143800000143200000143400000143600000143800000Location (bp)location (bp)Dotplot of mBM.7.6, ROIno.7.14 on chr7chr7:143197898 143777612.Rd100 kb243143700000 143800000 143900000 144000000 144100000 144200000143700000143900000144100000Location (bp)location (bp)Dotplot of mBM.7.7 on chr7chr7:143694271 144237723.Rd100 kb244151900000 152000000 152100000 152200000 152300000151900000152000000152100000152200000152300000Location (bp)location (bp)Dotplot of ROIno.7.17 on chr7chr7:151879423 152313324.Rd90 kb2458000000 9000000 10000000 11000000 1200000080000009000000100000001100000012000000Location (bp)location (bp)Dotplot of mBM.8.1, ROIno.8.3 on chr8chr8:7834632 12238531.Rd900 kb24640100000 40200000 40300000 40400000 40500000 40600000401000004020000040300000404000004050000040600000Location (bp)location (bp)Dotplot of ROIno.9.3 on chr9chr9:40083030 40625834.Rd100 kb24740400000 40600000 40800000 4100000040400000406000004080000041000000Location (bp)location (bp)Dotplot of mBM.9.1, fCB.9.1, ROIno.9.3, ROIno.9.4 on chr9chr9:40275835 41140341.Rd200 kb24842600000 42800000 43000000 43200000 434000004260000042800000430000004320000043400000Location (bp)location (bp)Dotplot of mBM.9.2, ROIno.9.8 on chr9chr9:42463956 43413698.Rd200 kb24943200000 43400000 43600000 43800000 440000004320000043400000436000004380000044000000Location (bp)location (bp)Dotplot of mBM.9.3, ROIno.9.9 on chr9chr9:43113699 44146569.Rd200 kb25043800000 44000000 44200000 44400000 446000004380000044000000442000004440000044600000Location (bp)location (bp)Dotplot of mBM.9.4, fCB.9.2 on chr9chr9:43796570 44636285.Rd200 kb25144600000 44700000 44800000 44900000 45000000 45100000446000004480000045000000Location (bp)location (bp)Dotplot of ROIno.9.11 on chr9chr9:44526647 45108293.Rd100 kb25244800000 45000000 45200000 4540000044800000450000004520000045400000Location (bp)location (bp)Dotplot of mBM.9.5, fCB.9.3, ROIno.9.12 on chr9chr9:44758294 45450203.Rd100 kb25345800000 46000000 46200000 4640000045800000460000004620000046400000Location (bp)location (bp)Dotplot of mBM.9.6, ROIno.9.14 on chr9chr9:45665522 46416430.Rd200 kb25446100000 46200000 46300000 46400000 46500000 46600000461000004630000046500000Location (bp)location (bp)Dotplot of ROIno.9.15 on chr9chr9:46066431 46661039.Rd100 kb25547000000 47100000 47200000 47300000 47400000 47500000470000004720000047400000Location (bp)location (bp)Dotplot of mBM.9.7, fCB.9.4, ROIno.9.17 on chr9chr9:46960134 47517679.Rd100 kb25665400000 65600000 65800000 6600000065400000656000006580000066000000Location (bp)location (bp)Dotplot of mBM.9.8, ROIno.9.18 on chr9chr9:65267680 66118360.Rd200 kb25765800000 66000000 66200000 6640000065800000660000006620000066400000Location (bp)location (bp)Dotplot of mBM.9.9 on chr9chr9:65768361 66392215.Rd100 kb25866100000 66200000 66300000 66400000 66500000 66600000661000006630000066500000Location (bp)location (bp)Dotplot of mBM.9.10, fCB.9.5 on chr9chr9:66042216 66604656.Rd100 kb25966500000 66600000 66700000 66800000 66900000 67000000665000006670000066900000Location (bp)location (bp)Dotplot of mBM.9.11, fCB.9.6, ROIno.9.22 on chr9chr9:66464196 67063343.Rd100 kb26066700000 66800000 66900000 67000000 67100000 67200000 6730000066700000669000006710000067300000Location (bp)location (bp)Dotplot of mBM.9.12 on chr9chr9:66713344 67307834.Rd100 kb26167400000 67600000 67800000 68000000 682000006740000067600000678000006800000068200000Location (bp)location (bp)Dotplot of ROIno.9.25 on chr9chr9:67316297 68187998.Rd200 kb26268500000 68600000 68700000 68800000 68900000 69000000685000006870000068900000Location (bp)location (bp)Dotplot of mBM.9.13, fCB.9.7, ROIno.9.27 on chr9chr9:68464182 69038946.Rd100 kb26368800000 69000000 69200000 6940000068800000690000006920000069400000Location (bp)location (bp)Dotplot of mBM.9.14, fCB.9.8, ROIno.9.28 on chr9chr9:68788947 69478385.Rd100 kb26469200000 69400000 69600000 69800000 70000000 70200000692000006940000069600000698000007000000070200000Location (bp)location (bp)Dotplot of mBM.9.15, fCB.9.9, ROIno.9.29 on chr9chr9:69128386 70210542.Rd200 kb26570400000 70500000 70600000 70700000 70800000 70900000704000007060000070800000Location (bp)location (bp)Dotplot of mBM.9.16, fCB.9.10, ROIno.9.32 on chr9chr9:70356536 70935468.Rd100 kb26642200000 42300000 42400000 42500000 42600000 42700000422000004230000042400000425000004260000042700000Location (bp)location (bp)Dotplot of mBM.10.1, mBM.10.2, fCB.10.1, fCB.10.2 on chr10chr10:42209825 42746687.Rd100 kb26742400000 42500000 42600000 42700000 428000004240000042500000426000004270000042800000Location (bp)location (bp)Dotplot of mBM.10.1, mBM.10.2, fCB.10.1, fCB.10.2 on chr10chr10:42396688 42811203.Rd80 kb26847000000 47200000 47400000 4760000047000000472000004740000047600000Location (bp)location (bp)Dotplot of fCB.10.3, ROIno.10.8 on chr10chr10:46963315 47629049.Rd100 kb26948000000 48400000 48800000 4920000048000000484000004880000049200000Location (bp)location (bp)Dotplot of mBM.10.3, fCB.10.4 on chr10chr10:47905708 49295536.Rd300 kb27051300000 51500000 51700000 5190000051300000515000005170000051900000Location (bp)location (bp)Dotplot of mBM.10.4, fCB.10.5 on chr10chr10:51248846 51929988.Rd100 kb27181200000 81400000 81600000 81800000 82000000 82200000812000008160000082000000Location (bp)location (bp)Dotplot of ROIno.10.11 on chr10chr10:81078498 82225007.Rd200 kb2721700000 1800000 1900000 2000000 210000017000001800000190000020000002100000Location (bp)location (bp)Dotplot of mBM.11.1 on chr11chr11:1715110 2137247.Rd80 kb27348200000 48300000 48400000 48500000 486000004820000048300000484000004850000048600000Location (bp)location (bp)Dotplot of ROIno.11.3 on chr11chr11:48140763 48586751.Rd90 kb27449900000 50100000 50300000 5050000049900000501000005030000050500000Location (bp)location (bp)Dotplot of mBM.11.2, fCB.11.1, ROIno.11.5 on chr11chr11:49857769 50507124.Rd100 kb27551000000 51200000 51400000 51600000 518000005100000051200000514000005160000051800000Location (bp)location (bp)Dotplot of mBM.11.3, fCB.11.2 on chr11chr11:50890854 51794205.Rd200 kb27689400000 89600000 89800000 9000000089400000896000008980000090000000Location (bp)location (bp)Dotplot of ROIno.11.7 on chr11chr11:89356753 90004254.Rd100 kb2770 50000 100000 150000 200000 250000 300000050000100000150000200000250000300000Location (bp)location (bp)Dotplot of mBM.12.1, fCB.12.1, ROIno.12.1 on chr12chr12:1 295740.Rd60 kb27817800000 17900000 18000000 18100000 182000001780000017900000180000001810000018200000Location (bp)location (bp)Dotplot of mBM.12.2 on chr12chr12:17722516 18213878.Rd100 kb279131600000 131800000 132000000 132200000 132400000131600000131800000132000000132200000132400000Location (bp)location (bp)Dotplot of ROIno.12.5 on chr12chr12:131569545 132386466.Rd200 kb28018800000 18900000 19000000 19100000 19200000 19300000188000001890000019000000191000001920000019300000Location (bp)location (bp)Dotplot of mBM.14.1, ROIno.14.1 on chr14chr14:18800002 19329124.Rd100 kb28119600000 19800000 20000000 20200000 20400000 20600000196000001980000020000000202000002040000020600000Location (bp)location (bp)Dotplot of mBM.14.2, fCB.14.1, ROIno.14.3, ROIno.14.4 on chr14chr14:19536989 20619185.Rd200 kb28219800000 20000000 20200000 20400000 206000001980000020000000202000002040000020600000Location (bp)location (bp)Dotplot of fCB.15.1, ROIno.15.1 on chr15chr15:19800001 20589552.Rd200 kb28322200000 22400000 22600000 2280000022200000224000002260000022800000Location (bp)location (bp)Dotplot of mBM.15.1, fCB.15.2, ROIno.15.5 on chr15chr15:22062115 22796193.Rd100 kb28422400000 22800000 23200000 2360000022400000228000002320000023600000Location (bp)location (bp)Dotplot of mBM.15.2, fCB.15.3 on chr15chr15:22446194 23714853.Rd300 kb28530500000 31000000 31500000 32000000 32500000 33000000305000003150000032500000Location (bp)location (bp)Dotplot of fCB.15.4, ROIno.15.8, ROIno.15.9, ROIno.15.10 on chr15chr15:30192065 32946347.Rd600 kb28684600000 84800000 85000000 8520000084600000848000008500000085200000Location (bp)location (bp)Dotplot of mBM.15.3, fCB.15.5, ROIno.15.15 on chr15chr15:84578219 85184473.Rd100 kb287102300000 102350000 102400000 102450000 102500000102300000102350000102400000102450000102500000Location (bp)location (bp)Dotplot of ROIno.15.16 on chr15chr15:102289013 102531392.Rd50 kb28814500000 14700000 14900000 1510000014500000147000001490000015100000Location (bp)location (bp)Dotplot of mBM.16.1, fCB.16.1, ROIno.16.2 on chr16chr16:14503645 15216162.Rd100 kb28915000000 15200000 15400000 1560000015000000152000001540000015600000Location (bp)location (bp)Dotplot of mBM.16.2, fCB.16.2, ROIno.16.4 on chr16chr16:14924108 15669838.Rd100 kb29016200000 16300000 16400000 16500000 16600000 16700000 1680000016200000164000001660000016800000Location (bp)location (bp)Dotplot of ROIno.16.5 on chr16chr16:16189379 16782999.Rd100 kb29116500000 17000000 17500000 18000000 18500000 19000000165000001700000017500000180000001850000019000000Location (bp)location (bp)Dotplot of ROIno.16.6, ROIno.16.7 on chr16chr16:16441829 18943382.Rd500 kb29221500000 22000000 22500000215000002200000022500000Location (bp)location (bp)Dotplot of mBM.16.3, fCB.16.3, ROIno.16.8, ROIno.16.9, ROIno.16.11 on chr16chr16:21069507 22745103.Rd300 kb29328200000 28400000 28600000 28800000 290000002820000028400000286000002880000029000000Location (bp)location (bp)Dotplot of mBM.16.4, fCB.16.4, ROIno.16.13 on chr16chr16:28069861 28988943.Rd200 kb29431900000 32000000 32100000 32200000 323000003190000032000000321000003220000032300000Location (bp)location (bp)Dotplot of ROIno.16.14, ROIno.16.15 on chr16chr16:31821207 32319904.Rd100 kb29532000000 32100000 32200000 32300000 32400000 32500000320000003220000032400000Location (bp)location (bp)Dotplot of ROIno.16.14, ROIno.16.15 on chr16chr16:31929305 32496150.Rd100 kb29632500000 33000000 33500000 3400000032500000330000003350000034000000Location (bp)location (bp)Dotplot of mBM.16.5, fCB.16.5, ROIno.16.17, ROIno.16.18, ROIno.16.19 on chr16chr16:32541077 33986667.Rd300 kb29734000000 34500000 35000000 3550000034000000345000003500000035500000Location (bp)location (bp)Dotplot of mBM.16.6, fCB.16.6, fCB.16.7, fCB.16.8 on chr16chr16:33973151 35485801.Rd300 kb29846200000 46300000 46400000 46500000 466000004620000046300000464000004650000046600000Location (bp)location (bp)Dotplot of mBM.16.7, fCB.16.9 on chr16chr16:46206559 46639463.Rd90 kb29970000000 70100000 70200000 70300000 704000007000000070100000702000007030000070400000Location (bp)location (bp)Dotplot of ROIno.16.25 on chr16chr16:69955978 70411425.Rd90 kb30075100000 75200000 75300000 7540000075100000752000007530000075400000Location (bp)location (bp)Dotplot of mBM.16.8, ROIno.16.26 on chr16chr16:75023916 75458018.Rd90 kb30116500000 16600000 16700000 16800000 169000001650000016600000167000001680000016900000Location (bp)location (bp)Dotplot of ROIno.17.1 on chr17chr17:16458165 16949299.Rd100 kb30218100000 18200000 18300000 18400000 18500000 18600000181000001820000018300000184000001850000018600000Location (bp)location (bp)Dotplot of mBM.17.1, ROIno.17.2 on chr17chr17:18112356 18606490.Rd100 kb30321000000 21100000 21200000 21300000 214000002100000021100000212000002130000021400000Location (bp)location (bp)Dotplot of mBM.17.2, fCB.17.1, ROIno.17.4, ROIno.17.5 on chr17chr17:20996465 21454690.Rd90 kb30421100000 21200000 21300000 21400000 215000002110000021200000213000002140000021500000Location (bp)location (bp)Dotplot of mBM.17.2, fCB.17.1, ROIno.17.4, ROIno.17.5 on chr17chr17:21103047 21552310.Rd90 kb30525100000 25200000 25300000 25400000 255000002510000025200000253000002540000025500000Location (bp)location (bp)Dotplot of mBM.17.3, ROIno.17.7 on chr17chr17:25063007 25536080.Rd90 kb30636100000 36200000 36300000 36400000 365000003610000036200000363000003640000036500000Location (bp)location (bp)Dotplot of mBM.17.4, ROIno.17.10 on chr17chr17:36055436 36549527.Rd100 kb30743600000 43800000 44000000 44200000 44400000 44600000436000004400000044400000Location (bp)location (bp)Dotplot of ROIno.17.16 on chr17chr17:43461776 44572665.Rd200 kb30824300000 24400000 24500000 24600000 24700000 24800000243000002440000024500000246000002470000024800000Location (bp)location (bp)Dotplot of fCB.19.1, ROIno.19.1 on chr19chr19:24307240 24831782.Rd100 kb30927700000 27800000 27900000 2800000027700000278000002790000028000000Location (bp)location (bp)Dotplot of ROIno.19.3 on chr19chr19:27630638 28080434.Rd90 kb31025700000 25800000 25900000 26000000 26100000 26200000257000002590000026100000Location (bp)location (bp)Dotplot of ROIno.20.1 on chr20chr20:25623433 26192339.Rd100 kb31125900000 26100000 26300000 2650000025900000261000002630000026500000Location (bp)location (bp)Dotplot of ROIno.20.2 on chr20chr20:25850001 26519569.Rd100 kb31229200000 29300000 29400000 29500000 29600000 29700000 2980000029200000294000002960000029800000Location (bp)location (bp)Dotplot of mBM.20.1, fCB.20.1 on chr20chr20:29219570 29780277.Rd100 kb31310200000 10400000 10600000 1080000010200000104000001060000010800000Location (bp)location (bp)Dotplot of mBM.21.1, ROIno.21.2 on chr21chr21:10165978 10847896.Rd100 kb31410600000 10800000 11000000 1120000010600000108000001100000011200000Location (bp)location (bp)Dotplot of mBM.21.1, fCB.21.1, ROIno.21.4 on chr21chr21:10570372 11212921.Rd100 kb31514200000 14300000 14400000 14500000 146000001420000014300000144000001450000014600000Location (bp)location (bp)Dotplot of mBM.21.2, ROIno.21.6 on chr21chr21:14138130 14656062.Rd100 kb31615200000 15300000 15400000 15500000 156000001520000015300000154000001550000015600000Location (bp)location (bp)Dotplot of ROIno.21.7 on chr21chr21:15155028 15639142.Rd100 kb31716000000 16200000 16400000 16600000 168000001600000016200000164000001660000016800000Location (bp)location (bp)Dotplot of mBM.22.1, fCB.22.1, ROIno.22.1 on chr22chr22:15850002 16897850.Rd200 kb31816700000 16800000 16900000 17000000 17100000 17200000167000001690000017100000Location (bp)location (bp)Dotplot of ROIno.22.3 on chr22chr22:16670824 17253135.Rd100 kb31918500000 18700000 18900000185000001870000018900000Location (bp)location (bp)Dotplot of fCB.22.2, ROIno.22.4 on chr22chr22:18446714 19072558.Rd100 kb32020100000 20300000 20500000 2070000020100000203000002050000020700000Location (bp)location (bp)Dotplot of fCB.22.3, ROIno.22.5 on chr22chr22:20091684 20709431.Rd100 kb32121300000 21500000 21700000 2190000021300000215000002170000021900000Location (bp)location (bp)Dotplot of ROIno.22.7, ROIno.22.8 on chr22chr22:21263844 21975876.Rd100 kb32221600000 21700000 21800000 21900000 220000002160000021700000218000002190000022000000Location (bp)location (bp)Dotplot of ROIno.22.8 on chr22chr22:21593010 22012131.Rd80 kb32336400000 36600000 36800000 37000000 372000003640000036600000368000003700000037200000Location (bp)location (bp)Dotplot of ROIno.X.4 on chrXchrX:36426543 37298256.Rd200 kb32448800000 48900000 49000000 49100000 49200000 49300000488000004890000049000000491000004920000049300000Location (bp)location (bp)Dotplot of ROIno.X.12 on chrXchrX:48819927 49320478.Rd100 kb32551600000 51700000 51800000 51900000 52000000 52100000516000005170000051800000519000005200000052100000Location (bp)location (bp)Dotplot of ROIno.X.15 on chrXchrX:51615017 52120116.Rd100 kb32662100000 62300000 62500000 6270000062100000623000006250000062700000Location (bp)location (bp)Dotplot of mBM.X.1, ROIno.X.22 on chrXchrX:62081407 62710549.Rd100 kb327103100000 103200000 103300000 103400000 103500000103100000103200000103300000103400000103500000Location (bp)location (bp)Dotplot of ROIno.X.30 on chrXchrX:103042612 103505081.Rd90 kb328119100000 119200000 119300000 119400000 119500000119100000119200000119300000119400000119500000Location (bp)location (bp)Dotplot of ROIno.X.32 on chrXchrX:119021020 119484472.Rd90 kb329134100000 134200000 134300000 134400000 134500000134100000134200000134300000134400000134500000Location (bp)location (bp)Dotplot of ROIno.X.36 on chrXchrX:134091579 134548132.Rd90 kb330140000000 140200000 140400000 140600000140000000140200000140400000140600000Location (bp)location (bp)Dotplot of fCB.X.1, ROIno.X.40 on chrXchrX:139958292 140762362.Rd200 kb331148500000 148700000 148900000148500000148700000148900000Location (bp)location (bp)Dotplot of fCB.X.2, ROIno.X.42 on chrXchrX:148450586 149026773.Rd100 kb332149400000 149500000 149600000 149700000 149800000149400000149500000149600000149700000149800000Location (bp)location (bp)Dotplot of mBM.X.2 on chrXchrX:149366496 149788947.Rd80 kb333151700000 151800000 151900000 152000000 152100000151700000151800000151900000152000000152100000Location (bp)location (bp)Dotplot of mBM.X.3 on chrXchrX:151646673 152132779.Rd100 kb334152200000 152400000 152600000152200000152400000152600000Location (bp)location (bp)Dotplot of mBM.X.4, fCB.X.3, ROIno.X.48 on chrXchrX:152127100 152759647.Rd100 kb3356000000 7000000 8000000 90000006000000700000080000009000000Location (bp)location (bp)Dotplot of mBM.Y.1 on chrYchrY:5478268 9114955.Rd700 kb3368800000 8900000 9000000 9100000 9200000 9300000 94000008800000900000092000009400000Location (bp)location (bp)Dotplot of mBM.Y.2 on chrYchrY:8764956 9441322.Rd100 kb3379200000 9400000 9600000 98000009200000940000096000009800000Location (bp)location (bp)Dotplot of mBM.Y.3 on chrYchrY:9091323 9955602.Rd200 kb33820000000 20400000 20800000 2120000020000000204000002080000021200000Location (bp)location (bp)Dotplot of mBM.Y.4 on chrYchrY:19993886 21231319.Rd200 kb33922100000 22200000 22300000 2240000022100000222000002230000022400000Location (bp)location (bp)Dotplot of mBM.Y.5 on chrYchrY:22024634 22445544.Rd80 kb34023000000 23100000 23200000 23300000 234000002300000023100000232000002330000023400000Location (bp)location (bp)Dotplot of mBM.Y.6 on chrYchrY:22983237 23414596.Rd90 kb34123600000 23800000 24000000 2420000023600000238000002400000024200000Location (bp)location (bp)Dotplot of mBM.Y.7, ROIno.Y.16 on chrYchrY:23529129 24210452.Rd100 kb34224200000 24300000 24400000 24500000 24600000 24700000242000002440000024600000Location (bp)location (bp)Dotplot of mBM.Y.8, ROIno.Y.19 on chrYchrY:24155697 24725420.Rd100 kb343

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share