UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Automated analysis of single cell leukemia data O'Neill, Kieran 2014

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2014_november_oneill_kieran.pdf [ 13.31MB ]
JSON: 24-1.0135595.json
JSON-LD: 24-1.0135595-ld.json
RDF/XML (Pretty): 24-1.0135595-rdf.xml
RDF/JSON: 24-1.0135595-rdf.json
Turtle: 24-1.0135595-turtle.txt
N-Triples: 24-1.0135595-rdf-ntriples.txt
Original Record: 24-1.0135595-source.json
Full Text

Full Text

Automated Analysis of Single CellLeukemia DatabyKieran O’NeillB.Sc. University of Natal, 2002B.Sc. (Hons) University of KwaZulu-Natal, 2004M.Sc. University of KwaZulu-Natal, 2008A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Bioinformatics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)October 2014c© Kieran O’Neill 2014AbstractAcute myeloid leukemia (AML) is a high grade malignancy of non-lymphoid cells ofthe hematopoietic system. AML is a heterogeneous disease, and numerous attemptshave been made to risk-stratify AML so that appropriate treatment can be offered.Single cell analysis methods could provide insights into the biology of AML leading torisk-stratified and functionally tailored treatments and hence improved outcomes. Recentadvances in flow cytometry allow the simultaneous measurement of up to 17 antibodymarkers per cell for up to millions of cells, and it is performed routinely during AMLclinical workup. However, despite vast amounts of flow cytometry data being gathered,comprehensive, objective and automated studies of this data have not been undertaken.Another method, strand-seq, elucidates template strand inheritance in single cells, witha range of potential applications, none of which had been automated when this thesiswork commenced. I have developed bioinformatic methods enabling research into AMLusing both these types of data.I present flowBin, a method for faithfully recombining multitube flow cytometrydata. I present flowType-DP, a new version of flowType, able to process flow cytometryand other single cell data having more than 12 markers (including flowBin output). Idemonstrate the application of flowBin to AML data, for digitally isolating abnormalcells, and classifying AML patients. I also use flowBin in conjunction with flowType tofind cell types associated with clinically relevant gene mutations in AML.I present BAIT, a software package for accurately detecting sister chromatid exchangesin strand-seq data. I present functionality to place unbridged contigs in late-buildgenomes into their correct location, and have, with collaborators, published the correctedlocations of more than half the unplaced contigs in the current build of the mousegenome. I present contiBAIT, a software package for assembling early-build genomeswhich consist entirely of unanchored, unbridged contigs. ContiBAIT has the potentialto dramatically improve the quality of many model organism genomes at low cost.These developments enable rapid, automated, objective and reproducible deepprofiling of AML flow cytometry data, subclonal cell analysis of AML cytogenetics, andimprovements to model organisms used in AML research.iiPrefaceAttribution of Work and Published MaterialsChapter 2 (Background)In chapter 2, the section* on flow cytometry bioinformatics incorporates material from areview article which I co-authored for PLoS Computational Biology [1]. I wrote roughlyhalf of the original article, while much of the remainder was written by my co-firstauthor, Nima Aghaeepour. Josef Sˆpidlen provided one section*, while Ryan Brinkmanprovided comments and oversight. The remainder of the chapter is my own, originalwork.The full citation is:K. O’Neill, N. Aghaeepour, J. Sˇpidlen, and R. Brinkman, “Flow Cytometry Bioinfor-matics,” PLoS Comput Biol, vol. 9, p. e1003365, Dec. 2013Chapter 3 (Enhanced FlowType)Chapter 3 details substantial improvements made especially to the flowType and also tothe RchyOptimyx Bioconductor packages. This work was published in Bioinformatics asan application note, [2] and the chapter incorporates text from that article. FlowTypewas originally created by Nima Aghaeepour [3]. RchyOptimyx was originally created byNima and Adrin Jalali, with some input from myself [4]. Adrin and I jointly developedthe core algorithm of the new flowType, and collaborated to implement it in C++. Iincorporated the C++ code into the flowType package, performed the performanceassessment work (besides Nima’s contribution), and wrote the article. Adrin madethe minor changes necessary in RchyOptimyx. Nima did the initial run of the timeperformance assessment of the new flowType, as well as updating documentation for thetwo packages. The remaining authors contributed to the conception and writing.The full citation is:K. O’Neill, A. Jalali, N. Aghaeepour, H. Hoos, and R. R. Brinkman, “Enhanced flowType-/RchyOptimyx: A Bioconductor pipeline for discovery in high-dimensional cytometrydata,” Bioinformatics, vol. 30, pp. 1329–30, Jan. 2014Chapters 4 and 5 (FlowBin)Although not yet published, the contents of Chapters 4 and 5 describing the design,validation and applications of flowBin, are entirely my own work, with the followingexceptions: Donna Hogge provided half of the NPM1 genotype data, while Aly KarsaniiiAttribution of Work and Published Materialsprovided the other half. The whole-genome AML mutation data was provided by LindaChang, Gerben Duns, Jeremy Parker and Aly Karsan. Bakul Dalal provided the flowcytometry data, as well as scientific guidance. Nima Aghaeepour and Ryan Brinkmanprovided scientific guidance.The second section* in Chapter 5 describes work I undertook as an entry to theFlow Cytometry: Critical Assessment of Population Identification Methods (FlowCAP)competition, which was ultimately published in 2013.[5] I have incorporated some textwhich I authored from the supplementary material of that paper; Nature PublishingGroup does not assert any copyright or licensing on supplementary material and so Ihave full rights to re-use it here without explicit permission. The remainder of the workdescribed in that section* was undertaken by myself, with scientific guidance from NimaAghaeepour, Habil Zare and Ryan Brinkman.The full citation is:N. Aghaeepour, G. Finak, T. F. Consortium, T. D. Consortium, H. Hoos, T. R. Mosmann,R. Brinkman, R. Gottardo, and R. H. Scheuermann, “Critical assessment of automatedflow cytometry data analysis techniques,” Nature Methods, vol. 10, pp. 445–445, May2013Chapters 6 (BAIT) and 7 (ContiBAIT)Chapter 6 and a small part of Chapter 7 incorporate text from an article published inGenome Medicine on which I was second author, and which represented a collaborativework primarily between myself and Mark Hills, the first author. [6] Mark developed muchof the framework code for the BAIT package we described in that article, and wrote mostof the article. Although I contributed extensively to the design of the evaluation workpresented in the results section* of that paper, the work itself was performed by Mark.However, the more technical aspects of BAIT were mostly my work, including the core ofthe sister chromatid exchange prediction algorithm, the algorithm for placing unbridgedcontigs in late-build genomes, and the algorithm for clustering contigs into putativechromosomal linkage groups in early-build genomes. I also contributed extensively tothe writing of the methodology section* of the article. The remaining authors – EsterFalconer, Ryan Brinkman and Peter Lansdorp, provided guidance for the project andwriting. As such, I have incorporated text and figures from that article pertaining tothose parts of the methodology to which I was the main contributor, as well as thoseparts of the results section*s, where my work was validated.The parts of Chapter 7 not taken from the BAIT paper are my own work, with theexception that Mark Hills performed the initial alignments of read data, and generatedthe . The remaining algorithmic design, implementation and validation was all my ownwork, with substantial planning input from Mark.The full citation is:M. Hills, K. O’Neill, E. Falconer, R. Brinkman, and P. M. Lansdorp, “BAIT: Organizinggenomes and mapping rearrangements in single cells,” Genome Medicine, vol. 5, p. 82,Sept. 2013ivEthics approvalEthics approvalAs portions of this work were performed on retrospective human clinical diagnosticdata, approval was obtained from the University of British Columbia / British ColumbiaCancer Agency Research Ethics Board (certificate number H08-00667).vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii1 Introduction and Problem Statement . . . . . . . . . . . . . . . . . . . . . . . 11.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Flow Cytometry Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.1 Flow Cytometry Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.2 Steps in Computational Flow Cytometry Data Analysis . . . . . . 62.1.3 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.4 Identifying Cell Populations . . . . . . . . . . . . . . . . . . . . . . . . 92.1.5 Diagnosis and Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2 Acute Myeloid Leukemia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.1 Leukemia Subtypes and Heterogeneity . . . . . . . . . . . . . . . . . 192.2.2 Flow Cytometry Immunophenotyping . . . . . . . . . . . . . . . . . . 212.2.3 Multitube Flow Cytometry Data . . . . . . . . . . . . . . . . . . . . . 222.2.4 Bioinformatic Approaches for Combining Multitube Data . . . . . 262.2.5 Surveying the Immunophenotypic Landscape of Leukemia . . . . 272.3 Strand-seq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.3.1 Detection of Sister Chromatid Exchange Events . . . . . . . . . . . 282.3.2 Applications of Strand-seq to Leukemia . . . . . . . . . . . . . . . . 302.3.3 Automated Analysis of Strand-seq Data . . . . . . . . . . . . . . . . 31viTable of Contents3 Enhanced FlowType for High-Dimensional Single Cell Data . . . . . . 323.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.2.1 Dynamic Programming Implementation . . . . . . . . . . . . . . . . 333.2.2 Breadth-First Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.2.3 Partitioning into Multiple Expression Levels . . . . . . . . . . . . . 353.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.3.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.3.2 Memory Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.3.3 Example Case: Multiple Expression Levels . . . . . . . . . . . . . . 373.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384 FlowBin: Deep Profiling of Multitube Flow Cytometry Data . . . . . 394.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.2 Design and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.2.1 Population Marker Normalization . . . . . . . . . . . . . . . . . . . . 424.2.2 Binning of Population Markers . . . . . . . . . . . . . . . . . . . . . . 424.2.3 Bin Matching Across Tubes . . . . . . . . . . . . . . . . . . . . . . . . 434.2.4 Expression Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . 434.2.5 Downstream Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.3.1 Validation of Quantile Normalization . . . . . . . . . . . . . . . . . . 444.3.2 Comparison of Binning Methods . . . . . . . . . . . . . . . . . . . . . 444.3.3 Comparison to Per-cell Nearest-neighbours . . . . . . . . . . . . . . 444.3.4 Validation on Simulated Multitube Data from Polychromatic FlowCytometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.4.1 Validation of Quantile Normalization . . . . . . . . . . . . . . . . . . 454.4.2 Comparison of Binning Methods . . . . . . . . . . . . . . . . . . . . . 464.4.3 Comparison to Per-cell Nearest-neighbours . . . . . . . . . . . . . . 474.4.4 Validation on Simulated Multitube Data from Polychromatic FlowCytometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.5.1 Validation of Quantile Normalization . . . . . . . . . . . . . . . . . . 514.5.2 Comparison of Binning Methods . . . . . . . . . . . . . . . . . . . . . 514.5.3 Comparison to Nearest Neighbours . . . . . . . . . . . . . . . . . . . 514.5.4 Validation on Simulated Multitube Data from Polychromatic FlowCytometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535 Applications of FlowBin to AML . . . . . . . . . . . . . . . . . . . . . . . . . . 555.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.1.1 Separation of AML and Normal Cells . . . . . . . . . . . . . . . . . . 555.1.2 FlowCAP2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.1.3 FlowBin-FlowType to Find NPM1 -associated Cell Types . . . . . 565.1.4 Surveying the Immunophenotypic and Genomic Landscape of AML 56viiTable of Contents5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.2.1 Separation of AML and Normal Cells . . . . . . . . . . . . . . . . . . 575.2.2 FlowCAP2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.2.3 FlowBin-FlowType to Find NPM1 -associated Cell Types . . . . . 585.2.4 Surveying the Immunophenotypic and Genomic Landscape of AML 605.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.3.1 Separation of AML and Normal Cells . . . . . . . . . . . . . . . . . . 615.3.2 FlowCAP2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.3.3 FlowBin-FlowType to Find NPM1 -associated Cell Types . . . . . 625.3.4 Surveying the Immunophenotypic and Genomic Landscape of AML 625.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.4.1 Separation of AML and Normal Cells . . . . . . . . . . . . . . . . . . 665.4.2 Identifying AML Patients (FlowCAP 2) . . . . . . . . . . . . . . . . 685.4.3 FlowBin-FlowType to Find NPM1 -associated Cell Types . . . . . 685.4.4 Surveying the Immunophenotypic and Genomic Landscape of AML 696 BAIT: Automated Strand-Seq Analysis . . . . . . . . . . . . . . . . . . . . . 706.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716.2.1 Automatically Detecting Sister Chromatid Exchanges . . . . . . . 716.2.2 Placing Contigs in Late-build Genomes Using Strand Inheritanceand Sister Chromatid Exchange Information . . . . . . . . . . . . . 726.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 776.3.1 Accurate Localization and Mapping of SCEs . . . . . . . . . . . . . 776.3.2 Placing Contigs in Late-build Genomes Using Strand Inheritanceand Sister Chromatid Exchange Information . . . . . . . . . . . . . 796.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 837 ContiBAIT: Assembling Genomes Using strand-seq . . . . . . . . . . . . 847.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 847.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 857.2.1 Preprocessing and Calling Template Strand State . . . . . . . . . . 857.2.2 Anchoring Unbridged Contigs . . . . . . . . . . . . . . . . . . . . . . . 887.2.3 Ordering of Unbridged Contigs within Chromosomes . . . . . . . . 927.3 Results and Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 937.3.1 Contig Anchoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 937.3.2 Validation of Contig Ordering . . . . . . . . . . . . . . . . . . . . . . . 987.4 Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1007.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1008 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1028.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1028.2 Reiteration of Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . 1038.3 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1068.3.1 FlowType-DP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1068.3.2 FlowBin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106viiiTable of Contents8.3.3 Applications of FlowBin to AML . . . . . . . . . . . . . . . . . . . . . 1078.3.4 BAIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1088.3.5 contiBAIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110ixList of Tables2.1 Clinical features and risk stratification of AML . . . . . . . . . . . . . . . . . 192.2 Recurrent non-karyotypic genetic aberrations in AML . . . . . . . . . . . . 202.3 Markers commonly assayed in leukemia diagnosis . . . . . . . . . . . . . . . 234.1 Markers in simulated multitube data. . . . . . . . . . . . . . . . . . . . . . . . 465.1 Mutations and mutation groups correlated with immunophenotype clusters. 666.1 Identification of all misoriented fragments in GRCm38/mm10. . . . . . . . 808.1 Summary of algorithms developed . . . . . . . . . . . . . . . . . . . . . . . . . 105xList of Figures2.1 Schematic diagram of a flow cytometer . . . . . . . . . . . . . . . . . . . . . . 52.2 An example pipeline for analysis of FCM data and some of the Biocon-ductor packages relevant to each step. . . . . . . . . . . . . . . . . . . . . . . . 62.3 Comparison of consensus of manual gates and automated gates . . . . . . 102.4 Cell populations in a high-dimensional mass-cytometry dataset manuallygated after dimension reduction using 2D layout for a minimum spanningtree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.5 An example of frequency difference gating . . . . . . . . . . . . . . . . . . . . 142.6 Overview of the flowType/RchyOptimyx pipeline for identification ofcorrelates of protection against HIV . . . . . . . . . . . . . . . . . . . . . . . . 162.7 Hematopoeitic markers detected by flow cytometry for leukemia diagnosis. 222.8 Multitube flow cytometry data. . . . . . . . . . . . . . . . . . . . . . . . . . . 242.9 Manual analysis of multitube flow data for leukemia immunophenotyping. 252.10 Gating leukemic blasts in CD45/SSC: ideal vs reality. . . . . . . . . . . . . 262.11 Overview of strand-seq protocol. . . . . . . . . . . . . . . . . . . . . . . . . . . 293.1 Run time comparison of flowType-DP to flowType-BF . . . . . . . . . . . . 363.2 Possible thresholds for marker combinations using flowType-DP . . . . . . 373.3 Three/four-partition flowType-generated, RchyOptimyx-visualized celltype hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.1 Overview of the flowBin pipeline, applied to one multitube sample. . . . 414.2 One, two and three-dimensional representations of quantile normalizationof population markers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.3 The two options for binning within flowBin: k-means and flowFP, asapplied to a 7-tube sample. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.4 Comparison between nearest-neighbours merging and flowBin for twotubes computationally sampled from a real data set. . . . . . . . . . . . . . 494.5 Performance on flowBin in reproducing a high-colour flow cytometryanalysis on simulated flow cytometry data. . . . . . . . . . . . . . . . . . . . 505.1 Schema for a voting classifier using flowBin output. . . . . . . . . . . . . . 575.2 Pipeline used to determine NPM1 -associated immunophenotypes in AML. 585.3 An example of RchyOptimyx analysis of one cluster of cell types. . . . . 595.4 nu-SVM separation of normal and abnormal cell populations in AMLsamples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.5 Schema for a voting classifier for flowBin output incorporating balancedbagging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63xiList of Figures5.6 Overview of cell types which showed significant differences in abundancebetween NPM1-mt and NPM1-wt. . . . . . . . . . . . . . . . . . . . . . . . . . 645.7 Selected classes of cell types showing significant differences in abundancebetween NPM1-mt and NPM1-wt. . . . . . . . . . . . . . . . . . . . . . . . . . 655.8 Top phenotypes and the results of NMF clustering. . . . . . . . . . . . . . . 676.1 Automated identification of sister chromatid exchange (SCE) from strand-seq data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736.2 Optimization of the sister chromatid exchange (SCE) interval detectorfunction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 746.3 BAIT localizes unplaced scaffolds in late-version assemblies. . . . . . . . . 766.4 Accuracy of automated sister chromatid exchange (SCE) detection byBioinformatic Analysis of Inherited Templates (BAIT). . . . . . . . . . . . . 786.5 Validation of using strand-seq to map unplaced scaffolds to built genomes. 827.1 True ordering of semi-synthetic mouse data for chromosome 3. . . . . . . 867.2 Overall ContiBAIT Linkage Group Assignment Pipeline . . . . . . . . . . 877.3 Flow diagram illustrating three-star Chinese restaurant clustering . . . . 897.4 Using SCEs to order contigs within a chromosome. . . . . . . . . . . . . . . 947.5 Clustering contigs into linkage groups for early-assembly genomes. . . . . 957.6 Contig locations in Bioinformatic Analysis of Inherited Templates (BAIT)-compiled linkage groups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 967.7 Evaluation of contiBAIT chromosome assignment with different-sizedcluster ensembles and different-sized contigs. . . . . . . . . . . . . . . . . . . 987.8 A chromosome assignment result after reorientation and merging. . . . . 997.9 Testing of ordering of contigs on synthetic data . . . . . . . . . . . . . . . . 100xiiList of AbbreviationsCEBPA CCAAT/enhancer binding protein-alpha. xii,19FLT3 Fms-like tyrosine kinase 3. xii, 21NPM1 nucleophosmin. xii, 19, 21, 62, 66AML acute myeloid leukemia. xii, 1, 2, 18, 19, 21,27, 55–59, 61, 66, 68, 103BAIT bioinformatic analysis of inherited templates.xii, 71, 83BrdU bromodeoxyuridine. xii, 28FISH fluorescence in-situ hybridisation. xii, 31FlowCAP Flow Cytometry: Critical Assessment of Pop-ulation Identification Methods. iv, xii, 13flowType-BF flowType brute force. xii, 33flowType-DP flowType dynamic programming. xii, 33FSC forward scattered light. xiiISH immortal strand hypothesis. xii, 28k-NN k-nearest neighbours. xii, 26, 27, 40MDS myelodysplastic syndrome. xiiMFI median fluorescent intensity. xii, 43, 52NK normal karyotype. xii, 18PCR polymerase chain reaction. xii, 28qPCR real-time polymerase chain reaction. xii, 5xiiiList of AbbreviationsSCE sister chromatid exchange. xii, 30, 70, 71SSC side scattered light. xii, 22SSH silent-sister hypothesis. xii, 28SVM support vector machine. xii, 57, 66TCGA the cancer genome atlas. xii, 21xivAcknowledgmentsThe (United States) National Institutes of Health and the Terry Fox Research Initiativegenerously funded this research.My supervisor Ryan Brinkman has kept me funded and supported, provided anexample in the ways of academic politics, guided me scientifically, and provided extensiveadvice in the ways of academic writing and presentation. My thesis committee havebeen a source of sage advice, as well as fair and honest criticism. Paul Pavlidis, JennyBryan, Bakul Dalal and Peter Lansdorp, thank-you.Many administrators have been an enormous support in my thesis work, regularlygoing above and beyond the call of duty for my sake: Sharon Ruschkowski, Alice Chau,Kristie Baas, Maryam Rahnama, Christine Kelly, Shera Paterson and Cynthia Wong,thank-you all. Most especially, thank-you to Sharon Ruschkowski for swinging me thetechnical advisor gig on Rise of the Planet of the Apes. Although, in the end, theygutted most of the science and turned it into an action film, the experience was amazingand the money much-needed at the time.The Brinkman lab provided a collegial, stimulating and fun environment in whichthis thesis developed. Nima Aghaeepour, Adrin Jalali, Habil Zare, Ali Bashashati, JosefSˇpidlen, Me´lanie Courtot, Adria´n Corte´s, Parisa Shooshtari, Alireza Khodabakhshi,Radina Droumeva, Mehrnoush Malekesmaeili, Justin Meskas, Marjan Farahbod, andanyone I may have left out, thank-you.Many collaborators helped with the work that became this thesis, all of whom havebeen a pleasure to work with: Bakul Dalal, Donna Hogge, Peter Landsdorp, Holger Hoos,Ester Falconer, Aly Karsan, Mark Kirkpatrick, Jeremy Parker, Linda Chang, GerbenDuns, and most especially Adrin Jalali, Nima Aghaeepour and Mark Hills, thank-youall.The late-PhD Bioinformatics tea house/lunch meetup group has been an invaluablesource of friendship, feedback and scientific engagement. Nima Aghaeepour, Adrin Jalali,Rodrigo Goya, Gavin Ha and Andrew McPherson, thank-you and good luck in yourrespective careers. The rest of the bioinformatics students at UBC and SFU, as well asthe students and post-docs at the Terry Fox Laboratory, over the years of hanging outat various socials and conferences, have been both excellent and scientifically stimulatingcompany.All of my fellow organisers, volunteers and coordinators in GraSPoDS, the AMSBike Co-op, the organising committee of the Canadian Students Computational BiologyConference 2009, and VanBUG have helped me find a sense of accomplishment beyondthat of my scientific work, and gain a wealth of useful skills a PhD alone would neverhave given me.John Michael Greer has inspired me in many ways, but most importantly for thisthesis, to drastically improve my writing skills.xvAcknowledgmentsMy family have been enormously understanding and supportive of my decision tomove literally to the diametric opposite side of the globe to complete this degree, andhave all contrived to come and visit despite the distance. Mom, Dad, Brendan andRowan, thank-you.And lastly, but most importantly, my wife Monia, even in the face of the emotionaltoll of a PhD-bound husband, and the trials and tribulations of moving to Canada, hasbeen steadfast in her love and support. I am quite certain that I owe her, among manyother things, my sanity during this process.xviDedicationTo Monia, for five years of patience and support.I love you.xviiChapter 1Introduction and ProblemStatementAcute myeloid leukemia (AML) is a disease with very poor prognosis. Focused researchover the past decades has produced significant advances in both prognostic predictionand the understanding of the biology of AML, but little improvement in patient outcomes.New tools and methods are needed to further understand this disease, and from thatunderstanding to develop new and more effective treatments tailored to the risk andbiology of individual patients’ diseases. In particular, single cell techniques for inves-tigating the clonal and somatic structure of the disease have the potential to increaseunderstanding of the functions of driver mutations, the acquisition of drug resistance,frequency of relapse, relapse-free survival and overall survival. However, appropriatebioinformatic tools are needed to process the large amounts of data these techniquesgenerate.One potential source of new insight into AML lies in the untapped potential of flowcytometry data. Flow cytometry provides single cell immunophenotypic information forthousands to millions of cells from a single sample [7, 8]. It is performed routinely as partof leukemia clinical diagnosis. However, the analysis used by pathologists is typicallymanual and low-throughput, focusing only on known patterns. Nevertheless, vastamounts of data are stored in the process, which could be used for deeper retrospectiveanalysis. Chapter 2 provides more detail about flow cytometry as it is applied toleukemia, and about existing bioinformatics methods which can be used for analyzingflow cytometry data.11.1. Problem StatementAnother single cell technology with applications to AML research is strand-seq,which allows for determination of single cell template strand inheritance patterns. Thisinformation can be used for a range of purposes, from detected genomic rearrangementsto assembling genomes. More detail about the strand-seq method and the need forautomated analysis tools to complement it may be found in Chapter 2.1.1 Problem StatementThe problem which this thesis attempted to solve can be summarized:Although the high-throughput methods of flow cytometry and strand-seq have producedlarge quantities of data that may facilitate life-saving research into AML treatment, thebioinformatics tools to analyze this data have been limited. Prior to this thesis work,tools for recombining the multitube flow cytometry data typical in AML diagnosis werefew and produced imputation errors, while existing downstream methods were poorlysuited for the high dimensionality of the resulting data. Tools for automatically analysingstrand-seq were non-existent.1.2 SolutionsChapter 4 describes a pipeline I developed which combines multi-tube flow cytometrydata in a way which minimizes spurious marker combinations, and enables data miningon retrospective multitube flow cytometry data. Chapter 3 describes enhancementsI made, in collaboration with Nima Aghaeepour and Adrin Jalali, to the flowTypealgorithm which enable it to be used on high-dimensional data, such as that produced byflowBin. Chapter 5 details analyses I have performed using flowBin and in some casesflowType, on AML flow cytometry data, and the biological results those have produced.Chapter 6 describes contributions which I made to BAIT, a software package devel-oped by myself and Mark Hills for automated analysis of strand-Seq data, especiallyfor the purposes of detecting sister chromatic exchanges and placing unmapped contigs21.2. Solutionsin late-build genomes. Chapter 7 details contiBAIT, a subsequent package which usesstrand-seq to assemble early-build genomes from unbridged contigs.Lastly, Chapter 8 summarizes the contributions I have made, and suggests futurework.3Chapter 2Background2.1 Flow Cytometry Bioinformatics2.1.1 Flow Cytometry DataFlow cytometers operate by hydrodynamically focusing suspended cells so that theyseparate from each other within a fluid stream. The stream is interrogated by one or morelasers, and the resulting fluorescent and scattered light is detected by photomultipliers.By using optical filters, particular fluorophores on or within the cells can be quantifiedby peaks in their emission spectra. These may be endogenous fluorophores such aschlorophyll or transgenic green fluorescent protein, or they may be fluorophores covalentlybonded to detection molecules such as antibodies for detecting proteins, or hybridizationprobes for detecting DNA or RNA.The ability to quantify these has led to flow cytometry being used in a wide range ofapplications, including but not limited to:• Monitoring of CD4 count in HIV[9]• Diagnosis of various cancers[10] [11]• Analysis of aquatic microbiomes [12]• Sperm sorting [13]• Measuring telomere length[14]42.1. Flow Cytometry BioinformaticsLaserPMTPMTPMTPMTADCSSCFL-2FL-1FL-3FSCFigure 2.1: Schematic diagram of a flow cytometer, showing focusing of the fluid sheath, laser,optics (in simplified form, omitting focusing), photomultiplier tubes (PMTs), analogue-to-digitalconverter, and analysis workstation.Until the early 2000s, flow cytometry could only measure a few fluorescent markers at atime. Through the late 1990s into the mid-2000s, however, rapid development of newfluorophores resulted in modern instruments capable of quantifying up to 18 markers percell [15]. More recently, the new technology of mass cytometry replaces fluorophores withrare earth elements detected by time of flight mass spectrometry, achieving the abilityto measure the expression of 34 or more markers [16]. At the same time, microfluidicreal-time polymerase chain reaction (qPCR) methods are providing a flow cytometry-likemethod of quantifying 48 or more RNA molecules per cell [17]. The rapid increase in thedimensionality of flow cytometry data, coupled with the development of high-throughputrobotic platforms capable of assaying hundreds to thousands of samples automaticallyhas created a need for improved computational analysis methods [15].52.1. Flow Cytometry BioinformaticsFigure 2.2: An example pipeline for analysis of FCM data and some of the Bioconductorpackages relevant to each step.2.1.2 Steps in Computational Flow Cytometry Data AnalysisThe process of moving from primary FCM data to disease diagnosis and biomarkerdiscovery involves four major steps:1. Data pre-processing (including compensation, transformation and normalization)2. Cell population identification (a.k.a. gating)3. Cell population matching for cross sample comparison4. Relating cell populations to external variables (diagnosis and discovery)62.1. Flow Cytometry Bioinformatics2.1.3 Data Pre-processingPrior to analysis, flow cytometry data must typically undergo pre-processing to removeartifacts and poor quality data, and to be transformed onto an optimal scale foridentifying cell populations of interest. Below are various steps in a typical flow cytometrypreprocessing pipeline.CompensationWhen more than one fluorochrome is used with the same laser, their emission spectrafrequently overlap. Each particular fluorochrome is typically measured using a bandpassoptical filter set to a narrow band at or near the fluorochrome’s emission intensity peak.The result is that the reading for any given fluorochrome is actually the sum of thatfluorochrome’s peak emission intensity, and the intensity of all other fluorochromes’spectra where they overlap with that frequency band. This overlap is termed spillover,and the process of removing spillover from flow cytometry data is called compensation[18].Compensation is typically accomplished by running a series of representative sampleseach stained for only one fluorochrome, to give measurements of the contribution of eachfluorochrome to each channel [18]. The total signal to remove from each channel can becomputed by solving a system of linear equations based on this data to produce a spillovermatrix, which when inverted and multiplied with the raw data from the cytometerproduces the compensated data [18, 19]. The processes of computing the spillovermatrix, or applying a precomputed spillover matrix to compensate flow cytometry data,are standard features of flow cytometry software [20].TransformationCell populations detected by flow cytometry are often described as having approximatelylog-normal expression [21]. As such, they have traditionally been transformed to alogarithmic scale. In early cytometers, this was often accomplished even before data72.1. Flow Cytometry Bioinformaticsacquisition by use of a log amplifier. On modern instruments, data is usually stored inlinear form, and transformed digitally prior to analysis.However, compensated flow cytometry data frequently contains negative values dueto compensation, and cell populations do occur which have low means and normaldistributions [22]. Logarithmic transformations cannot properly handle negative values,and poorly display normally distributed cell types [22, 23]. Alternative transformationswhich address this issue include the log-linear hybrid transformations Logicle[24] andHyperlog [25], as well as the hyperbolic arcsine and the Box-Cox [26].A comparison of commonly used transformations concluded that the biexponentialand Box-Cox transformations, when optimally parameterized, provided the clearestvisualization and least variance of cell populations across samples [23]. However, alater comparison of the flowTrans package used in that comparison indicated that itdid not parameterize the Logicle transformation in a manner consistent with otherimplementations, potentially calling those results into question [27]. Based on theseresults, either the Box-Cox transformation, or Logicle, when correctly paramaterised,are likely to be best choice for transformation.Quality ControlParticularly in larger flow cytometry data sets, quality control and assessment areimportant tasks. However, neither the software provided by instrument manufacturersnor most commonly used analysis suites provide such functionality [28]. One solution isto visualize summary statistics, such as the empirical distribution functions of singledimensions of technical or biological replicates to ensure they are the similar [28]. Formore rigor, the Kolmogorov–Smirnov test can be used to determine if individual samplesdeviate from the norm [28]. The Grubbs’ test for outliers may be used to detect samplesdeviating from the group.A method for quality control in higher-dimensional space is to use probability binningwith bins fit to the whole data set pooled together [29]. Then the standard deviation of82.1. Flow Cytometry Bioinformaticsthe number of cells falling in the bins within each sample can be taken as a measure ofmultidimensional similarity, with samples that are closer to the norm having a smallerstandard deviation [29]. With this method, higher standard deviation can indicateoutliers, although this is a relative measure as the absolute value depends partly on thenumber of bins.With all of these methods, the cross-sample variation is being measured. However,this is the combination of technical variations introduced by the instruments and handling,and actual biological information that is desired to be measured. Disambiguating thetechnical and the biological contributions to between-sample variation can be a difficultto impossible task [30].NormalizationParticularly in multi-centre studies, technical variation can make biologically equivalentpopulations of cells difficult to match across samples. Normalization methods to removetechnical variance, frequently derived from image registration techniques, are thus acritical step in many flow cytometry analyses. Single-marker normalization can beperformed using landmark registration, in which peaks in a kernel density estimate ofeach sample are identified and aligned across samples [30].2.1.4 Identifying Cell PopulationsThe complexity of raw flow cytometry data (dozens of measurements for thousands tomillions of cells) makes answering questions directly using statistical tests or supervisedlearning difficult. Thus, a critical step in the analysis of flow cytometric data is toreduce this complexity to something more tractable while establishing common featuresacross samples. This usually involves identifying multidimensional regions that containfunctionally and phenotypically homogeneous groups of cells [32]. There are a variety ofmethods by which this can be achieved, detailed below.92.1. Flow Cytometry BioinformaticsFigure 2.3: Comparison of consensus of eight independent manual gates (polygons) andautomated gates (colored dots). The consensus of the manual gates and the algorithms wereproduced using the CLUE package [31] Figure c©Nature Publishing Group 2013 [5], re-usedunder the terms of the CC-BY 3.0 license..102.1. Flow Cytometry BioinformaticsGatingThe data generated by flow-cytometers can be plotted in one or two dimensions toproduce a histogram or scatter plot. The regions on these plots can be sequentiallyseparated, based on fluorescence intensity, by creating a series of subset extractions,termed ”gates”. In datasets with a low number of dimensions and limited cross-sampletechnical and biological variability (e.g., clinical laboratories), manual analysis of specificcell populations can produce effective and reproducible results. However, exploratoryanalysis of a large number of cell populations in a high-dimensional dataset is not feasible[33]. In addition, manual analysis in less controlled settings (e.g., cross-laboratorystudies) can increase the overall error rate of the study [34, 35]. However, despitethe considerable advances in computational analysis, manual gating remains the mainsolution for the identification of cell populations.Gating Guided by Dimensionality ReductionThe number of scatter plots that need to be investigated increases with the square of thethe number of markers measured (or faster since some markers need to be investigatedseveral times for each group of cells to resolve high-dimensional differences between celltypes that appear to be similar in most markers) [37] To address this issue, principalcomponent analysis has been used to summarize the high-dimensional datasets usinga combination of markers that maximizes the variance of all data points.[38] However,PCA is a linear method and is not able to preserve complex and non-linear relationships.More recently, two dimensional minimum spanning tree layouts have been used toguide the manual gating process. Density-based down-sampling and clustering was usedto better represent rare populations and control the time and memory complexity ofthe minimum spanning tree construction process.[39]. More sophisticated dimensionreduction algorithms are yet to be investigated.112.1. Flow Cytometry BioinformaticsNaive CD8+ TMemory CD8+ TNaive CD4+ TMemory CD4+ TNKPre-B IIPlasma cellPre-B ICD38mid CD3- plateletCD38- CD3- plateletCD38mid CD3mid plateletErythrocyteErythroblastMPPPro-BHSCMonocytePlasmacytoid DCPre-DCGMPMyelocytePromyelocytePro-monocyteCMPMonoblast NKTIL-3Rα+ mature BMature BImmature B115−CD45 1.31.3−Figure 2.4: Cell populations in a high-dimensional mass-cytometry dataset manually gatedafter dimension reduction using 2D layout for a minimum spanning tree. Figure reproducedfrom the data provided in [36]122.1. Flow Cytometry BioinformaticsAutomated GatingDeveloping computational tools for identification of cell populations has been an area ofactive research only since 2008. Many individual clustering approaches have recentlybeen developed, including model-based algorithms (e.g., flowClust [26] and FLAME[40]), density based algorithms (e.g. FLOCK [41] and SWIFT, graph-based approaches(e.g. SamSPECTRAL [42]) and most recently, hybrids of several approaches (flowMeans[43] and flowPeaks [44]). These algorithms are different in terms of memory and timecomplexity, their software requirements, their ability to automatically determine therequired number of cell populations, and their sensitivity and specificity. The FlowCAPproject, with active participation from most academic groups with research efforts in thearea, is providing a way to objectively cross-compare state-of-the-art automated analysisapproaches [5].Probability Binning MethodsWhere most automated gating methods attempt to identify coherent cell populations,probability binning is primarily a method for surveying the distribution of cells within asample, which can also be used for gating. In probability binning, flow cytometry datais split into quantiles on a univariate basis [45]. The locations of the quantiles can thenbe used to test for differences between samples (in the variables not being split) usingthe chi-squared test [45].This was later extended into multiple dimensions in the form of frequency differencegating, a binary space partitioning technique where data is iteratively partitioned alongthe median [46]. These partitions (or bins) are fit to a control sample. Then theproportion of cells falling within each bin in test samples can be compared to the controlsample by the chi squared test.Finally, cytometric fingerprinting uses a variant of frequency difference gating to setbins and measure for a series of samples how many cells fall within each bin [29]. Thesebins can be used as gates and used for subsequent analysis similarly to automated gating132.1. Flow Cytometry BioinformaticsFigure 2.5: An example of frequency difference gating, created using the flowFP Bioconductorpackage. The dots represent individual events in an FCS file. The rectangles represent the bins.142.1. Flow Cytometry Bioinformaticsmethods.Combinatorial GatingHigh-dimensional clustering algorithms can perform well for identifying larger cellpopulations, but tend to perform more poorly at identifying rare cell types [5].Thefull complexity of high-dimensional data can be large – where there are M dimensionsand each is divided into positive and negative, there may be as many as 2M disjointpopulations. Methods for deciding on the number of clusters tend to predict muchsmaller numbers, resulting tin the merging of many of the smaller clusters together.Matching these small cell populations across multiple samples is even more challenging,and has few solutions [47].An alternative to high-dimensional clustering is to identify cell populations usingone marker at a time and then combine them to produce higher dimensional clusters.This functionality was first implemented in the commercial software, FlowJo [48]. TheflowType algorithm builds on this framework by allowing the exclusion of the markers [3]This enables the development of statistical tools (e.g., RchyOptimyx) that can investigatethe importance of each marker and exclude high-dimensional redundancies. [4].2.1.5 Diagnosis and DiscoveryAfter identification of the cell population of interest, cross-sample analysis can identifyphenotypical or functional variations that are correlated with an external variable (e.g.,a clinical outcome). These studies can be partitioned into two main groups of Diagnosisand Discovery.DiagnosisIn these studies, the goal usually is to diagnose a disease (or a sub-class of a disease) usingvariations in one or more cell populations. For example, one can use multidimensionalclustering to identify a set of clusters, match them across all samples, and then use152.1. Flow Cytometry BioinformaticsFigure 2.6: Overview of the flowType/RchyOptimyx pipeline for identification of correlates ofprotection against HIV: First, tens of thousands of cell populations are identified by combiningone dimensional partitions (panel one). The cell populations are then analyzed using a statisticaltest (and bonferroni’s method for multiple testing correction) to identify those correlated with thesurvival information. The third panel shows a complete gating hierarchy describing all possiblestrategies for gating that cell population. This graph can be mined to identify the “best” gatingstrategy (i.e., the one in which the most important markers appear earlier). These hierarchiesfor all selected phenotypes are demonstrated in panel 4. In panel 5, these hierarchies are mergedinto a single graph that summarized the entire dataset and demonstrates the trade-off betweenthe number of markers involved in each phenotype and the significance of the correlation withthe clinical outcome (e.g., as measured by the Kaplan–Meier estimator in panel 6). Figurereproduced in part from [3] and [4], both of which are in the public domain.162.1. Flow Cytometry Bioinformaticssupervised learning to construct a classifier for prediction of the classes of interest(e.g., this approach can be used to improve the accuracy of the classification of specificlymphoma subtypes [49]). Alternatively, all the cells from the entire cohort can bepooled into a single multidimensional space for clustering before classification.[50] Thisapproach is particularly suitable for datasets with a high amount of biological variation(in which cross-sample matching is challenging) but requires technical variations to becarefully controlled [51].DiscoveryIn a discovery setting, the goal is to identify and describe cell populations correlatedwith an external variable. Similar to the diagnosis use-case, cluster matching in high-dimensional space can be used for exploratory analysis. However, this approach producescell populations which can be difficult to visualise due to their high dimensionality,and also difficult to characteristically immunophenotype (for example if their shapeincludes twists and folds in higher-dimensional space).[50, 52] Finally, combinatorialgating approaches have been particularly successful in exploratory analysis of FCM data.Simplified Presentation of Incredibly Complex Evaluations (SPICE) is a software packagethat can use the gating functionality of FlowJo to statistically evaluate a wide rangeof different cell populations and visualize those that are correlated with the externaloutcome. flowType and RchyOptimyx (as discussed above) expand this techniqueby adding the ability of exploring the impact of independent markers on the overallcorrelation with the external outcome. This enables the removal of unnecessary markersand provides a simple visualization of all identified cell types. In a recent analysisof a large (n=466) cohort of HIV+ patients, this pipeline identified three correlatesof protection against HIV, only one of which had been previously identified throughextensive manual analysis of the same dataset [5].172.2. Acute Myeloid Leukemia2.2 Acute Myeloid LeukemiaAML is a cancer of white blood cells involving the blood and bone marrow [53]. It isa highly heterogeneous disease, with several dozen recognised subtypes [54]. AML isrelatively rare in the general population, with a prevalence of 3.8 cases per 100,000.However, it is more common among adults over 65 (17.9 per 100,000) [55], and has avery poor outlook, with median survival ranging from 7 – 12 months [55]. Standardtherapies, including chemotherapy and allogeneic bone marrow transplantation, areharsh and often poorly tolerated due to the advanced age of AML sufferers.Over the past decade, risk stratification has greatly improved, enabling greatersupport of higher-risk patients and more informed patient decision making regardingtrade-offs between therapy risks and benefits. Much of this improvement has resultedfrom the recognition, in the late 1990s, that several karyotypic abnormalities are recurrentin AML, which confer either favourable or dismal prognoses [56]. Using cytogeneticrisk factor, patients are now typically divided into three risk categories – favourable(5-year OS: 45%–85%), intermediate (5-year OS: 20%–40%) and adverse (5-year OS:5% – 20%) [57]. Initially, these categories were based solely on cytogenetics, with allnormal-karyotype patients being placed in the intermediate risk category. However,since the mid-2000s, molecular genetic markers have been added, helping to place someNormal karyotype (NK) patients in the favourable category [57, 58].These refinements in risk stratification have slightly improved treatment outcomes,mainly through more targeted supportive care [57]. However, it is clear that stratificationcould be further improved. Furthermore, therapies could be enhanced, both in termsof their specificity for subtypes of AML and in their general ability to cure the disease[57]. Deep exploration of the biology of AML through high-throughput screens andaccompanying bioinformatic data mining is underway towards this end.182.2. Acute Myeloid LeukemiaTable 2.1: Clinical features and risk stratification of AML. Reproduced from [57]. c©2013Elsevier, re-used by permission.2.2.1 Leukemia Subtypes and HeterogeneityThe first systematic classification of leukemias was the French-American-British (FAB)system [59], which classed acute leukemias based on their cellular morphology. TheFAB system identified three classes of acute lymphoblastic leukemia and seven classes ofAML [59]. Over the next two decades, immunophenotyping of surface and intracellularmarkers by immunohistochemistry and flow cytometry became a routine addition tomorphological studies [60, 61]. Immunophenotyping could distinguish between AML andALL in cases without morphological differentiation [60], with up to 98% accuracy [61]and could also provide insight into the process of hematopoiesis, which leukemias arebelieved to be a perturbation of [60, 61]. Purely immunological classifications were thenproposed [62], and with the inclusion of cytogenetic information, the third World HealthOrganization (WHO) Classification of Tumours of Haematopoietic and Lymphoid tissues[63] was created. This was followed in 2008 by the fourth WHO classification [54], whichadditionally incorporated molecular genetic markers.The two molecular genetic markers included in the WHO 2008 are mutations in thenucleophosmin (NPM1 ) and CCAAT/enhancer binding protein-alpha (CEBPA) genes,192.2. Acute Myeloid LeukemiaTable 2.2: Recurrent non-karyotypic genetic aberrations in AML, including unmutated butoverexpressed genes and miRNAs. Reproduced from [57]. c©2013 Elsevier, re-used by permission.202.2. Acute Myeloid Leukemiaboth of which confer favourable prognosis [54, 64–68]. However, additional markers areknown, including internal tandem duplications in the Fms-like tyrosine kinase 3 (FLT3 )gene, which are associated with poor prognosis [69]. Later studies have tended to showthat NPM1 alone is insufficient to predict a significant difference in prognosis, and that itis only the combination of FLT3 -ITD with wild-type NPM1 which has prognostic value[58]. Furthermore, the individual genetic markers found in these studies have typicallybeen relatively frequently-occurring (see Table 2.2), yet have not fully accounted for thediversity of AML.This has led more recently to high-throughput sequencing studies. The first AMLgenome was published in 2008 [70], and the first multi-patient sequencing effort waspublished a year later [71]. Since then, the cancer genome atlas (TCGA) sequencingeffort has published a landscape paper, containing the genomes, transcriptomes, miRNAand DNA-methylation for a cohort of 200 AML patients [72]. Some of the results ofthese and other recent efforts are summarised in Table Flow Cytometry ImmunophenotypingAML is routinely diagnosed by the use of flow cytometry to determine the presence orabsence of specific antigens present on single cells (their immunophenotype) [73]. Today,flow cytometry immunophenotyping is a critical step in the process of clinical decisionmaking about leukemias [54, 73, 74]. It can be used to identify the lineage and maturityof cells, detect abnormal cells, document the phenotype of abnormal cell populations,diagnose a specific disease entity (or suggest further testing to do so), and provideinformation that may be of prognostic value [73] Although its use is somewhat restrictedby cost [73], it is employed widely in most cases where other factors suggest a diagnosisof leukemia.[74] The ability of the technique to sensitively characterise leukemias may,in conjunction with omics methods, lead to personalized treatment regimens.[75] Table2.3. lists some of the more important surface and intracellular proteins detected by flowcytometry, along with their specific utility. Figure 2.7 illustrates how commonly used212.2. Acute Myeloid LeukemiaHSCCMPMegakaryocyteThrombocytesErythrocyte Mast cell MyeloblastGranulocytes MonocyteMacrophageCLPB lymphocyteT lymphocytePlasma cellNK cellCD34, CD117CD13, CD33CD5, CD10, CD19, CD20CD2, CD3, CD4, CD7, CD8HLA-DRCD56CD61 CD14CD64Figure 2.7: Hematopoeitic markers detected by flow cytometry for leukemia diagnosis. Figureadapted from work c©Mikael Ha¨ggstro¨m, re-used under the terms of the CC-BY-SA 3.0 license.immunophenotypic markers correspond to normal hematopoeisis.2.2.3 Multitube Flow Cytometry DataThe number of markers typically assayed in leukemia diagnosis exceeds the number ofmarkers which commonly in-use instruments can measure. For example, one standardpanel recommends assaying 32 markers on 8-colour instruments [76]. To achive this,standard practice is to aliquot samples into multiple tubes, with common markers ineach [74, 76]. This is illustrated in Fig 2.8. In leukemia diagnosis, CD45 is typicallyincluded in each tube and used to identify leukemic blast cells manually in a scatterplotof CD45 vs side scattered light (SSC) [73, 77–79]. An example of the manual gating outof blast cells in the CD45 and SSC dimensions is illustrated in Fig 2.9.In most cases leukemic blasts fall within the same region of the scatter plot (SSClowCD45dim),as shown in Fig 2.10a. However, there are many exceptions, as shown in Fig 2.10b.222.2. Acute Myeloid LeukemiaTable 2.3: Markers commonly assayed in leukemia diagnosis significance, after [73]. All aremeasured on the cell surface unless otherwise indicated by the prefix “intra” for parametersmeasured intracellularly.Parameter Normal staining Clinical utilityCD2 T and NK cells Indicator of T or NK lineage, butsometimes expressed in AMLCD3 Mature T cells Indicates T lineage, but may be aber-rantly lostCD4 T cell subset, monocytes Classification of mature T neoplasmsCD7 T and NK cells Indicates T lineage, but may be aber-rantly lostCD10 Immature T/B cells, subset of ma-ture T/B, neutrophilsFrequently present in ALLCD13 Myeloid and monocytic cells May be aberrantly expressed in B-ALL, AML, MDSCD14 Monocytes Indicates monocytic differentiation,but not a sensitive marker of imma-ture monocytesCD19 All B cells Indicates B lineage. Usually absentin plasma cell neoplasms.CD20 Mature B cells, some T cells Supports B cell lineageCD33 Myeloid and monocytic cells May be aberrantly expressed in B-ALL, AML, MDSCD34 B, T precursors and myeloblasts (notmature cells)Indicates leukemic blastsCD56 NK and NK-like T cells Indicator of NK differentiation, butmay be present in AMLCD61 Megakaryocytes and platelets Megakaryocyte differentiationCD64 Monocytes and intermediate neu-trophilic precursorsIdentification of monocytic differen-tiation, but may be aberrantly ex-pressed in AMLCD117 Immature neutrophils and mast cells Identification of myeloblasts, butmay be expressed by more maturecellsHLA-DRintraCD3 All mature and immature T cells Indicator of T or NK lineageintraCD22 Cytoplasmic in early B cells, surfacein matureIndicates B lineage; intensity oftendiffers between subtypes of B neo-plasmsintraCD79aintraTdT B and T precursors Indicates ALL, but also found insome AMLintraMPO Neutrophilic and monocytic cells Indicator of myeloid differentiation232.2. Acute Myeloid LeukemiaBonemarrowaspirateSSCCD45 Isotype controlsCell surface markers.Intracellular markersCD45CD45SSCSSC.....aliquotsflow cytometry dataPopulation markerchannel Immunophenotyping marker channelsFigure 2.8: Multitube flow cytometry data on a four-colour flow cytometer. Thisfigure shows how a sample is aliquoted into multiple tubes, each containing CD45 and acombination of three markers unique to each tube. In this example, the first tube containsnegative controls, while subsequent tubes contain markers of interest.242.2. Acute Myeloid LeukemiaHSCMPegPakrMPCykoroPctPThmbsEE PrlMtPoMrPkSrcGyScnMouMtuMPrlnMolcyp gHSCMPLPBktpPoSCoMNSMtrKgPakrMPCykoroPctPThmbsEE PrlMtPyccDPkrPM34nMoo,ctPnMykr,1MPrcPkSrcGyScnMouMtuMgFigure 2.9: Manual analsysis of multitube flow data for leukemia immunophenotyp-ing. Blast cells are identified by the pathologist based on their CD45 (typically dim) and sidescatter (typically low). The first tube is usually stained with negative control markers in thechannels besides CD45, and this staining is used to set a threshold distinguishing positive fromnegative marker expression. In the subsequent tubes, this threshold is used to compute the pro-portion of cells which express each marker. Figure from Dr. Bakul Dalal, VGH Hematopathology.Used by permission.252.2. Acute Myeloid LeukemiaFigure 2.10: Gating leukemic blasts in CD45/SSC: ideal vs reality. For manypatients, identifying blasts is relatively simple, and a generic gating net can be used (a). However,exceptions such as those shown in (b) are common. Gating blasts in these cases requires moreelaborate analysis and deeper knowledge, and would be challenging to automate. Gating netfrom Dr. Bakul Dalal, VGH Hematopathology. Used by permission.While automating blast identification and immunophenotyping would be desirable, themovement of both the blasts and other cell populations in some cases would confoundattempts to do so. A more realistic approach to automated analysis of leukemia datawould be to combine tubes from each patient together without knowledge of the celltypes within, and then compare that data across patients.2.2.4 Bioinformatic Approaches for Combining Multitube DataTo exploit multitube flow cytometry data computationally, a method has been developedfor combining multiple aliquots based upon the location of cells in the common parameters.This uses k-nearest neighbours (k-NN) to match cells closest to each other in the common262.2. Acute Myeloid Leukemiaparameters, and combine their expression in other dimensions [80]. However, as othershave shown [81], and I show later in this thesis, populations defined in terms of populationmarkers are frequently made up of a mixture of cell types, and k-NN consequently tendsto produce spurious combinations of markers. This makes the merged output fromk-NN poorly suited for applying the deep profiling techniques, such as flowType andSPADE, as it tends to skew the counts of cell types. One proposed solution to thisis to constrain the nearest-neighbours mapping with clustering incorporating domainknowledge [81]. However, this latter method requires that all cell types expected to bepresent be pre-specified in order to parameterise the clustering step. So, although wellsuited to diagnostic pipelines where the goal is to quantify known cell types, it is poorlysuited for discovery of new cell types. There remains a need for a multitube combinationmethod for FCM data that produces conservative, non-imputed data suitable for deepprofiling.2.2.5 Surveying the Immunophenotypic Landscape of LeukemiaDespite several broad and comprehensive studies of AML in terms of its genome, transcrip-tome, miRNA, and methylation status, only one study has performed a comprehensivebioinformatic analysis of AML immunophenotypes, and no studies have incorporatedthis with other data. That single study examining cell subtypes in AML was carried outon a small cohort (ten normal and four leukemic), and only for one single four-colourflow cytometry panel (containing CD13, CD33, CD45 and CD34) [82]. The existence ofa method for accurately analysing multitube flow cytometry data (as is often availableboth retrospectively and prospectively in AML diagnosis) would enable such surveyingof the immunophenotypic landscape of AML. In Chapter 4, I present flowBin, a methodI developed for combining multitube flow cytometry data, and in Chapter 5 I detailseveral analyses I performed using flowBin on retrospective multitube flow cytometrydata from AML patients.272.3. Strand-seq2.3 Strand-seqStrand-seq is a method for directional, low-coverage sequencing of DNA template strandsin single cells [83]. In strand-seq, cells are cultured for one cycle of cell division in thepresence of bromodeoxyuridine (BrdU). BrdU is a synthetic nucleotide which subsitutesfor thymidine in the newly synthesised DNA strands of the daughter cells. Single cellsare then sorted into separate plate wells, where their DNA is extracted and fragmentedby micrococcal nuclease digestion. Forked adapters are ligated to each double strandedfragment, enabling polymerase chain reaction (PCR) replication from either direction.The adapters used for each cell contain a unique 6-nucleotide index sequence, allowingmultiple cells to be sequenced together and later deconvoluted. Prior to PCR, theBrdU-incorporating strand is nicked at BrdU sites by exposure to ultraviolet light andHoechst 33258. During PCR, only the unfragmented template strand is amplified forsubsequent Illumina sequencing. The directionality of the short reads thus correspondsto the template strand inherited by the daughter cell.The resulting sequence and directionality data can then be aligned back to thereference genome for the organism. The read counts in each direction can be visualisedas a histogram along each chromosome (Fig 2.11d). For autosomes, since there are twophysical chromosomes per reference sequence, the template strand state can only beinferred for both chromosomes at once, as being one of: WW, WC or CC. Visuallyinspecting the read count histograms can indicate this state (see Fig 2.11d).2.3.1 Detection of Sister Chromatid Exchange EventsStrand-seq was developed to test the silent-sister hypothesis (SSH) [85] and immortalstrand hypothesis (ISH) [86], which propose that template strand segregation duringmitosis may be asymmetrical in some cell types, for functional reasons. SSH proposedthat this might be a determinant of stem cell fate, while ISH proposed that this mightbe for the purposes of maintaining a non-copied (and higher integrity) template strandin stem cells. As such, the goal of strand-seq was to test whether strand inheritance282.3. Strand-seqFigure 2.11: Overview of strand-seq protocol. Figure reproduced from [84]( c©2013Elsevier), parts of which were reproduced from [83] ( c©2012 Nature Publishing Group). Re-usedby permission of both copyright holders.292.3. Strand-seqduring cell division was random or segregated [84].While strand-seq has yet to confirm either hypothesis, early results indicated that ithad other applications, most notably sensitive detection of sister chromatid exchanges(SCEs) [83]. SCEs are exchanges of genetic material between identical sister chromatids.They occur as a result of double stranded breaks being repaired during homologousrecombination [87]. SCEs occur spontaneously at an average rate of approximately 10per mitosis in normal human cells [87]. Elevated rates of SCE is an indicator of genomicinstability, and is an important diagnostic test for Bloom’s Syndrome [88].SCEs show up in strand-seq data as marked discontinuities in the strand inheritancestate for a particular chromosome pair, allowing them to be detected at a resolutionapproaching the read depth of the strand-seq (Fig 2.11d) [83]. Detecting SCEs in thisway can reach a resolution of the order of kilobases, several orders of magnitude finerthan previous techniques [83].2.3.2 Applications of Strand-seq to LeukemiaGenomic instability is heavily interrelated with cancer [89], and has been shown to bea useful prognostic marker in some cancers [90–92]. However, detection of genomicinstability depends on either gene expression signatures [92] or breakpoint detection[90, 91]. Gene expression signatures are an indirect measurement, while breakpoints arelate indicators, occurring only after DNA repair mechanisms have been overwhelmed. Bycontrast, rates of sister chromatid exchange are a direct measure of genomic instability,before misrepaired breaks occur. As such, strand-seq, as a low-cost, direct measure ofgenomic instability, could be useful for the diagnosis and staging of cancers generally,and leukemia more specifically.Furthermore, although it has been established for more than two decades that tu-mours are primarily monoclonal in origin [93], interest has arisen in recent years inunravelling the subsequent clonal diversification and evolution [94, 95]. Understandingcancer clonal evolution is likely to shed light on clinically catastrophic clonally-driven302.3. Strand-seqprocesses like metastasis and chemotherapy resistance, and lead to better therapiesto overcome these [96]. One facet of this evolutionary process is the acquisition andloss of genomic rearrangements. These are currently detected by fluorescence in-situhybridisation (FISH), using cDNA probes specific for known rearrangements [97]. WhileFISH is able to detect rearrangements in single cells, results require visual inspection ofsample micrographs by expert pathologists, and are thus labour-intensive and poten-tially subjective. Strand-seq has the potential to enable the rapid and non-subjectiveidentification of genomic rearrangements, including those not yet characterised, on asingle-cell basis [83]. This could be a powerful, low-cost tool for elucidating the clonalstructure of cancers generally, and leukemias specifically.2.3.3 Automated Analysis of Strand-seq DataIn the original strand-seq study, all analysis was performed manually by visual inspectionof read count histograms [83]. This was a limiting factor in the usefulness of thetechnology Chapter 6 described my efforts, in collaboration with Mark Hills, to developsoftware tools to automate these processes, as well as later work we undertook to developtools to use strand-seq for assembly of early build genomes.31Chapter 3Enhanced FlowType forHigh-Dimensional Single CellData3.1 IntroductionFlow cytometry has undergone a “chromatic explosion” over the past decade and cannow measure 17 markers at once for each of hundreds of thousands of individual cells[15]. Since then, mass cytometry has enabled measurement of 30–45 markers per cell[98], while single-cell multiplexed RT-qPCR can measure 50–96 mRNAs per cell [17].The growth in high-throughput single-cell data continues to outpace development ofcorresponding bioinformatics techniques [15]. To answer this challenge, flowType [3] andRchyOptimyx [4] were developed. FlowType uses partitioning of cells, either manuallyor by clustering, into positive or negative for each marker to enumerate all cell typesin a sample. RchyOptimyx measures the importance of these cell types by correlatingtheir abundance to external outcomes, such as disease state or patient survival, anddistills the identified phenotypes to their simplest possible form. These packages havebeen used to identify several novel cell populations correlated with HIV outcome [3].More recently, this pipeline has been used to evaluate standardised immunological panels[99], to optimise lymphoma diagnosis [100], and to analyse a range of other clinical data(unpublished).However, the higher dimensionality of data produced by mass cytometry generates323.2. Methodologyup to 345 ≈ 1021 possible cell types, with an even greater number (up to 396 ≈ 1045)for single-cell qPCR; these magnitudes are beyond the capabilities of flowType andRchyOptimyx. Furthermore, flowType and RchyOptimyx have thus far only treatedcells as being either positive or negative for a marker. In practice, many biomarkerscan have a range of expression levels such as “dim” and “bright”. In this chapter,I detail architectural improvements I and my co-authors undertook to flowType andRchyOptimyx to overcome these limitations.3.2 MethodologyOur primary challenge was to enable flowType to generate a number of cell typestractable on most common workstations as well as nodes in compute clusters (e.g., thosewith 4–12GB of RAM). To this end, we developed a new version of flowType, which Ihereafter denote as flowType dynamic programming (flowType-DP), to distinguish itfrom the old version, which I denote as flowType brute force (flowType-BF).3.2.1 Dynamic Programming ImplementationTo improve computation time, flowType-DP uses a dynamic programming approach,which exploits the fact that cell types can be arranged into a hierarchy, and membershipof any given cell type over n markers is equal to the intersection of one of its parent types(over n− 1 markers) with a single-marker cell type. FlowType-DP first enumerates all celltypes involving only 1 marker by simple partitioning and then iterates over 2, ..., k markers,computing all cell types for each level n by set intersections between corresponding celltypes in levels n− 1 and 1.For example, membership of the cell type CD45++CD117+CD34− is computed as333.2. Methodologyfollows:{CD45++CD117+CD34−}={CD45++CD117+} ∩ {CD34−}={CD45++} ∩ {CD117+} ∩ {CD34−}For added speed, the core flowType-DP functionality was implemented in C++. Tooptimise the calculation of set intersection, the set of cells belonging to each cell type wasrepresented using a bit string, with one bit per cell. The Boost library dynamic bitsetclass was used to store these bit strings. This allowed set intersection to be computedusing a bitwise AND operator, which translates to a single machine instruction.3.2.2 Breadth-First StrategyFlowType-BF completely enumerates all cell types over all [1, ...,m] markers. As noted inthe introduction, this can easily result in an intractable number of cell types. Furthermore,it is common with flowType that the most significant cell types are made up of far fewermarkers than the full number available. For example, in the original flowType paper,despite 11 markers having been used, the cell types most significantly correlated withsurvival were defined over 2, 3 and 7 markers, respectively. FlowType-DP thus enablesa user to use a breadth-first strategy of enumerating all cell types defined over a subsetof k ≤ m markers only.FlowType-DP provides a memory use estimation function, to assist users in findinga k that fits within the limits of their hardware. The number of cell types N for a givencutoff k can be computed using a dynamic programming algorithm. Where Ni denotes343.2. Methodologythe number of cell types involving i markers, in a dataset with m total markers:N1 = m (3.1)N2 = N1 ×N1 (3.2)Ni = Ni−1 ×Ni−2 (3.3)For a given k, flowType recursively computes N1···k, and then computes N and the totalmemory required MP (in bytes):N =i=1∑kNi (3.4)MP = m×N (3.5)To count the memory needed to store the bit string representation (MB), max(Ni) isfound, and multiplied by the size of one bit string, as follows (where n is the number ofcells):MB =max(Ni)× n8(3.6)The final memory estimate in bytes is taken as the sum of these two numbers(MP +MB).3.2.3 Partitioning into Multiple Expression LevelsTo allow partitioning into levels other than positive and negative, a string representationis used for cell types. The string has one integer character for every marker, denotingthe partition, or zero if the marker is not used. Values 1, ..., n denote partitions 1 to n.For example, if the set of markers were {CD3, CD45, CD13, CD117, CD34} the cell typeCD45++CD117+CD34− would be represented by 03021. RchyOptimyx uses a dynamicprograming algorithm for efficiently constructing k-shortest paths [101]. RchyOptimyx’353.3. EvaluationFigure 3.1: Run time comparison of flowType-DP to flowType-BF in terms of numberof cells (a) and number of markers (b). Figure is my own work,[2] c©2014 Oxford UniversityPress, used by permission.graph construction component was modified to be able to handle more than one partitionper marker.3.3 Evaluation3.3.1 PerformanceFlowType-DP was evaluated against flowType-BF on a 10-marker dataset available fromFlow Repository (ID FR-FCM-ZZZK) [3]. FlowType-DP showed a substantial speedupover flowType-BF, which increases exponentially with the number of cells and markers.For example, at 106 cells and 10 markers, flowType-DP is 14 times faster (see Fig. 3.1).Comparison on larger datasets was not possible, due to the limitations of flowType-BF.3.3.2 Memory LimitsI also computed the limits for k on a hypothetical machine with 12GB of RAM for samplesrepresentative of mass cytometry (Fig. 3.2a) and polychromatic flow cytometry (Fig.3.2b), both of which would be intractable for flowType-BF. FlowType and RchyOptimyxare now able, within the memory of a common workstation (12GB), to analyze 34-marker363.3. Evaluationdata.3.3.3 Example Case: Multiple Expression LevelsFinally, to demonstrate the importance of several partitions per marker, I applied flow-Type and RchyOptimyx to an acute myeloid leukemia sample from Flow Repository(ID FR-FCM-ZZYA) (Fig. 3.3). CD34 is a stem-cell marker typically expressed onAML blast cells. These blasts are also known to have dimly positive CD45 expressionand low SSC [102]. By partitioning CD45 and SSC into four and three partitions, andnaively running flowType and RchyOptimyx to search for CD34-enriched cell types, Iwas able to find that the SSClowCD45dim cell type had a high proportion of CD34+ cells,as expected. This would not have been possible with only two partitions for each ofCD45 and SSC.Figure 3.2: Possible thresholds for marker combinations using flowType-DP fortypical mass cytometry data (a) and polychromatic flow cytometry data (b). Figure is my ownwork,[2] c©2014 Oxford University Press, used by permission.373.4. DiscussionFigure 3.3: Three/four-partition flowType-generated, RchyOptimyx-visualized celltype hierarchy on a bone marrow sample from a patient with AML. Cell population iden-tification strategy used for SSC and CD45, with the CD34-enriched subset highlighted (a).RchyOptimyx analysis showing CD34 enrichment (b). Figure is my own work,[2] c©2014 OxfordUniversity Press, used by permission.3.4 DiscussionIn this chapter I have presented a new, enhanced version of flowType, capable ofprocessing very high dimensional data available from current methods. This newflowType is also substantially faster than the previous version due to being implementedin C++ and optimised using a dynamic programming algorithm. Most importantly,flowType-DP allowed me to use flowType in conjunction with flowBin, the method Idescribe in Chapter 4. In Chapter 5, I detail two studies in which I applied flowBin andflowType-DP, where flowType-BF would not have been feasible.One minor disadvantage compared to the flowType-BF is a slight increase in memoryusage by the dynamic programming algorithm for storage of the phenotypes. However,this is only an issue on lower-dimensional data, since flowType-BF is incapable ofhandling higher-dimensional data at all. Another potential but necessary disadvantageof flowType-DP is that it no longer finds all possible cell types on larger data sets (sincethis would be intractable).38Chapter 4FlowBin: Deep Profiling ofMultitube Flow Cytometry Data4.1 IntroductionOften in flow cytometry immunophenotyping, the number of proteins needing to beassayed exceeds the number that the cytometer available can measure in a single run.Furthermore, it is often desirable to use negative controls (either unstained or isotype)to counteract technical variation across samples [34]. As a solution to this problem,standard practice is to aliquot a sample into multiple tubes, each of which is run witha disjoint subset of the total panel of proteins. Typically, some of the channels ineach tube are taken up by stains which are kept the same across tubes, in order toaid in identifying cell types of interest. This process is common for modern clinicaldiagnostic flow cytometry data, especially when immunophenotyping leukemias wherethe standard method is to include the pan-leukocyte marker (CD45) in each tube, anduse this in combination with right-angle scattered light (SSC) to identify leukemic blastsin each tube separately [79]. For example, the current standard for leukemia diagnosisestablished by the EuroFlow consortium recommends an eight-colour multitube panel ofoverlapping reagents [76].Without some means of combining tubes together, existing techniques for deepprofiling can only be applied serially to each tube in a multitube flow cytometry assay,which results in a substantial loss in depth. This can be illustrated with an example:consider an assay with six tubes containing six markers each, with two of those markers394.1. Introductionoverlapping (being present) in every tube. The complete number of distinct markerswill be 2 + 4× 6 = 26. When examining a binary division of each marker into positive andnegative expression, the total number of possible cell types present is 326 ≈ 2.5× 1012 [3].However, working one tube at a time, only 36 cell types can be elucidated in each tube,for a total of 36×6 = 4374. For this example, serial analysis can only explore approximatelyone hundred millionth of the complexity of the phenotype space.Per-cell k-NN merging of tubes attempts to address this. This method is founded onmaking the assumption that a cell in one tube is identical to its nearest neighbour inanother tube in terms of the common population markers [80]. The expression vectorsof all the nearest neighbours across tubes are merged, creating a single, high-colourmatrix of cellular expression. k-NN merging has proven effective as part of classificationpipelines [38, 76, 103].However, as others have shown [81], and I show later in this chapter, populationsdefined in terms of population markers are frequently made up of a mixture of cell types,and k-NN consequently tends to produce spurious combinations of markers. This makesthe merged output from k-NN poorly suited for applying the deep profiling techniques,such as flowType and SPADE, as it tends to skew the counts of cell types. Oneproposed solution to this is to constrain the nearest-neighbours mapping with clusteringincorporating domain knowledge [81]. However, this latter method requires that all celltypes expected to be present be pre-specified in order to parameterize the clustering step.So, although well suited to diagnostic pipelines where the goal is to quantify knowncell types, it is poorly suited for discovery of new cell types. There remains a need fora multitube combination method for flow cytometry data that produces conservative,non-imputed data suitable for deep profiling. In this chapter, I describe flowBin, a free,open source R/Bioconductor package that I developed to fulfil this need.404.2. Design and Implementation4.2 Design and ImplementationFlowBin is designed to accept multiple FCM assays from the same multitube assay andcombine these into a complete matrix of measurements for all the markers. To this end,flowBin consists of four stages: 1) normalization, 2) binning, 3) bin matching acrosstubes, and 4) expression measurement. These are illustrated in Figure 4.1, and in detailhereafter.Figure 4.1: Overview of the flowBin pipeline, applied to one multitube sample. 1)Flow cytometry data from individual aliquot tubes is quantile normalized in terms of thecommon population markers present in every tube. 2) The tubes are then binned in terms ofthese population markers, using either K-means or flowFP. 3) The bins from the first tube aremapped to the other tubes (by nearest-neighbour mapping for K-means bins, or directly forflowFP bins). 4) The expression of each bin in terms of each phenotyping marker (those markersdiffering across tubes) is measured. This may be done by taking median fluorescent intensity,normalized median fluorescent intensity, or proportion of cells exceeding the 98th percentile of anegative control. The final result is a high-dimensional matrix containing expression levels foreach bin in terms of each unique marker.414.2. Design and Implementation4.2.1 Population Marker NormalizationA consideration in combining multitube flow cytometry is that variations betweenstaining patterns across the aliquots need to be minimized [80]. In opposition to thisare a host of sources of technical variation, ranging from slight differences in samplehandling and preparation to instrument drift between runs. Although great pains aretaken by operators to reduce these, small variations may still exist. To counteract this, Iincluded a feature in flowBin to quantile normalize population markers.Since tubes contain physical samples drawn from a common population, their truedistributions in terms of population markers are expected to be identical, and anydeviations to represent technical variation. Quantile normalization transforms twosamples so that they have identical distributions, and has been used extensively in geneexpression analysis [104]. FlowBin uses quantile normalization to bring similar cells intogood registration, using the quantile normalization implementation from the LimmaBioconductor package [105]. The Limma implementation is capable of normalising inthe presence of missing values, and hence can normalize data where the number of cellsper tube varies.4.2.2 Binning of Population MarkersIn order to bin cells in terms of the overlapping markers present in all tubes, flowBinprovides two methods: K-means clustering and probability binning. K-means clusteringwith a high value for K and nearest-neighbour joining has been used successfully in thepast for identifying cell populations in flow cytometry data [43]. Probability binning isa binary space partitioning method for flow cytometry data in which each partitioningstep maintains equal probability density within both partitions created [45]. Probabilitybinning has been developed for use as a “micro-gating” algorithm in the form of theflowFP Bioconductor package [29]. Either method is typically used to partition the cellsin a sample into bins containing enough cells to be able to extract average expressionvalues; in our examples we chose this to be around 200 cells per bin.424.2. Design and Implementation4.2.3 Bin Matching Across TubesTo enable expression to be combined accurately across tubes, the bins must be spatiallyas close to identical as possible in each tube. This is easier for flowFP, since the partitionshave linear edges, which can easily be mapped directly to individual tubes. K-means,by contrast, produces approximately spherical clusters, with more difficult to describeboundaries. Rather than attempt to extract the boundaries of K-means clusters, flowBindraws on the idea of nearest-neighbour mapping, except that bin labels are mapped,rather than cellular identity. FlowBin K-means clusters the first tube in the set, andthen for each subsequent tube, labels each cell according to the label of the closest cellin the first tube. Once all the cells in each tube have been assigned to cross-tube bins,flowBin moves on to calculating the expression of each bin in terms of the tube-specificexpression markers.4.2.4 Expression MeasurementFlowBin provides two methods for determining expression in each bin, modelled oncommon practice by flow cytometry analysts. Firstly, normalized median fluorescentintensity (MFI) can be computed (by subtracting the untransformed MFI in terms ofnegative control from that of the expression marker). Secondly, a threshold may be setat the 98th percentile of the negative control, and the proportion of cells exceeding thatthreshold in the expression marker channel reported, as is common practice in manystudies [106–108].4.2.5 Downstream AnalysisThe final output of flowBin is a high-dimensional matrix of expression values for each bin(Fig. 4.1). This can provide a useful overview of the makeup of a sample, for exampleby plotting a heatmap of bin expression values. However, far greater utility comes inthe downstream analysis methods that this enables. FlowBin output can be treated asthough it were FCM data with a relatively low number of events but a high number of434.3. Methodsmarkers. Then, methods for deep profiling, such as flowType and RchyOptimyx, can beapplied.4.3 Methods4.3.1 Validation of Quantile NormalizationTo validate flowBin’s quantile normalization, I used an acute myeloid leukemia (AML)data set (Flow Repository:FR-FCM-ZZYA) [5]. This data set consists of 359 samples,with 8 tubes each, with 6 markers assayed in each tube. Every tube had an assay forCD45 as well as forward- and side-scattered light. flowBin’s normalization was evaluatedby applying it to each of these markers.4.3.2 Comparison of Binning MethodsTo aid users in choosing between the two binning methods flowBin provides, I comparedthem on a representative sample from the same AML data set as for the quantilenormalization (Flow Repository:FR-FCM-ZZY). Figure 4.3 shows both the general formof the bins, and the distribution of cells across bins for comparison.4.3.3 Comparison to Per-cell Nearest-neighboursTo compare flowBin to the per-cell nearest-neighbour merging of Pedreira et al., I createda small, synthetic example using real data containing peripheral blood mononuclear cellsstained for CD3, CD4 and CD8. For the source data, I used data from the United StatesMilitary HIV Natural History Study (FlowRepository:FR-FCM-ZZZK). I first removeddoublets, debris, dead cells and monocytes, as per [3], leaving only CD14− live cells. Ithen created two artificial tubes by randomly sampling two sets of 5,000 cells from theoriginal sample.444.4. Results4.3.4 Validation on Simulated Multitube Data from PolychromaticFlow CytometryTo assess the abilities and limitations of flowBin, I again took data from the United StatesMilitary HIV Natural History Study (FlowRepository:FR-FCM-ZZZK). I preprocessedthe data as per [3], screening out debris, doublets and non-viable cells, then finally gatingfor CD3+ cells (T cells). Patients with fewer than 3,000 events remaining were removed,leaving 426 patients, with 12 fluorescent and two scatter channels.To create simulated tubes, I chose CD3, CD4 and CD8 to use as common markers,then divided the remaining nine among three tubes. I divided the events for each patientrandomly into three, and discarded all the markers for each that were not to be includedin that tube. A summary of all the markers present in each tube is shown in Table 4.1.I then ran flowBin on each patient’s three tubes, using FSC, SSC, CD3, CD4 and CD8as binning markers, with 128 bins and flowFP as the binning method. I ran flowType onthe flowBin output (excluding CD3), and carried out survival analysis (Cox-PH and thelogrank test) on the the flowType data as per [3]. I also ran flowType and the subsequentsurvival analysis on the original, full-colour flow cytometry data, again as per [3].I compared the cell counts of individual cell types between the true counts from theflowType run on the original high-colour data, and the flowType run on the flowBindata, in terms of their Pearson correlation. I also compared the P-values of the logranktest for each cell type.4.4 Results4.4.1 Validation of Quantile NormalizationFigure 4.2 shows the results of applying quantile normalization to a single sample, in one,two and three dimensions. In a single dimension, all tubes are made to have identicalcumulative distribution functions. This is as expected for quantile normalization. Toensure that the quantile normalization was improving the higher-dimensional registration454.4. ResultsTable 4.1: Markers in simulated multitube data. The data was split into three tubes, eachcontaining CD3, CD4 and CD8 in addition to FSC and SSC. The remaining nine markers weredistributed across the tubes, three per tube.Marker Type Tube 1 Tube 2 Tube 3Common (scatter) FSC FSC FSCCommon (scatter) SSC SSC SSCCommon (fluorescent) CD3 CD3 CD3Common (fluorescent) CD4 CD4 CD4Common (fluorescent) CD8 CD8 CD8Phenotyping (fluorescent) KI67 CD57 CD27Phenotyping (fluorescent) CD28 CCR5 CCR7Phenotyping (fluorescent) CD45RO CD19 CD127of cells across tubes, and not introducing artefacts, two-dimensional scatter plots werequalitatively assessed, as shown in Figure 4.2. But to achieve a more objective, n-dimensional assessment of registration, I used cytometric fingerprinting [29]. Thisinvolved probability binning on all tubes pooled together, then measuring the degree towhich the bin density in individual tubes varies from that of the pooled sample. In ourexample case, there was substantial deviation across tubes before normalization, whichwas almost completely removed after normalization.To measure this objectively over the whole data set, I applied flowFP to measurethe standard deviation before and after normalization for each tube of all 359 samples.Of a total 2,513 tubes, 2,207 (88%) showed improvement, 39 (1.6%) showed no change,and 267 (11%) showed a wider (worse) standard deviation following normalization.4.4.2 Comparison of Binning MethodsIn terms of form, K-means produces spheroid bins that are more likely to follow thecontours of the underlying data. FlowFP, by contrast, produces bins with strictlyhorizontal-vertical borders, which are grid-like and less likely to follow the underlyingcontours of the data.In terms of cluster density, flowFP (by the nature of its algorithm) produces bins464.4. ResultsFigure 4.2: One, two and three-dimensional representations of quantile normaliza-tion of population markers. Empirical cumulative density function (ECDF) plots are shownfor all tubes and for forward scatter (FS), the most variant marker. Following normalization, theECDF for all tubes is identical, as is expected from quantile normalization. Two-dimensionalscatter plots for representative tubes show visually the improvement in two-dimensional registra-tion. Lastly, flowFP plots show the improvement in three-dimensional registration, measured bythe standard deviation of the number of cells falling within each bin, after bins have been fittedto the consensus of all tubes.with very even densities (SD of means = 0.07). By contrast, K-means produces a widerange of bin densities (SD of means = 255), with the most dense bins containing asmuch as 20-fold as many cells as the least dense.4.4.3 Comparison to Per-cell Nearest-neighboursThe results are presented in Figure 4.4, alongside the original data they were drawnfrom. The nearest-neighbours approach produced an artefactual CD4+CD8+ population,which is absent from the original data, and occurs only rarely in nature [109]. FlowBin,by contrast, creates a spectrum of values in a hyperbolic curve between the two “true”populations.474.4. ResultsFigure 4.3: The two options for binning within flowBin: k-means and flowFP, asapplied to a 7-tube sample. a. and b. show comparisons between the bin labels themselves.K-means creates roughly spherical bins, which conform around the location of cell populations.FlowFP creates grid-like bins,which may not conform to the true underlying shape of cellpopulations. c. shows the number of cells per bin across all tubes, for every bin. flowFP hasapproximately the same mean distribution of bin density across tubes as K-means (mean SD:24.6 vs 28.5). However, flowFP has a much closer to constant number of cells per bin across bins(SD of means: 0.07 vs 255).4.4.4 Validation on Simulated Multitube Data from PolychromaticFlow CytometryRunning flowType with 11 markers and two partitions per marker gives a total of 311 =177, 147 possible cell types. I excluded those with 0% abundance across all patients, leaving119, 479. Examining the three characteristic survival-associated cell types found in [3],more abundant cell types (especially KI-67+CD127−) appear to have better correlation,while rarer cell types (especially CD45RO+CD8+CCR5−CD27+CCR7−CD127−) havemuch poorer correlations (see Figure 4.5a). Importantly, the flowBin results for KI-67+CD127− show a strong correlation with the true data, despite KI-67 and CD127being in separate tubes. Looking across all cell types displays this same pattern (see484.4. ResultsFigure 4.4: Comparison between nearest-neighbours merging and flowBin for twotubes computationally sampled from a real data set. a. Raw data (compensated,transformed and filtered for debris), gated for CD3+ cells, and showing the true CD4 andCD8 distribution. b. The two sampled tubes, one containing CD4 and the other CD8. TheCD4+ population has slightly higher average CD3 than the CD8+, but both have substantiallyoverlapping CD3 distributions. c. Results of merging by nearest neighbours and by flowBin,including proportion of resulting “cells” falling within each quadrant. The nearest-neighboursmerging created a substantial CD4+CD8+ population not present in the original sample. Bothnearest neighbours and flowBin slightly overestimate the CD4−CD8− population. FlowBinis more accurate at reproducing the CD4+CD8− and CD4−CD8+ populations than nearestneighbours.Figure 4.5b). Although some low-abundance cell types show strong correlations, it islikely that this was by chance, due to their having very low values in all patients. Sincethe flowBin results for the majority of cell types with a median abundance of 10% ormore had a strong Pearson correlation with the true data, I chose to only do furtheranalysis on those, leaving 1,896 cell types.Comparing P-values, these showed a relatively good correlation (R2 = 0.65), with theP-values resulting from flowBin being slightly higher than the true P-values (Figure4.5c). Following Bonferroni correction, the cell types that were called as significant werematched between the true high-colour analysis and flowBin. Of the 592 cell types thatwould be called significant in the true data, only 58 were called significant by flowBin(9.8%). However, those 58 represent the majority of the 66 cell types that flowBin called494.4. Resultssignificant (88%).Figure 4.5: Performance on flowBin in reproducing a high-colour flow cytometryanalysis on simulated flow cytometry data. a. Comparison of counts of selectedphenotypes between actual data and simulated multitube data recombined by flowBin, withlinear regression fit lines. Ki-67+ was selected as being representative of a phenotype with thefull range of abundance across patients. The remaining three phenotypes are the representativephenotypes of the three classes found in the original study [5]. More abundant phenotypesshow a good, though imperfect fit. Less abundant phenotypes show a poorer fit. b. Pearsoncorrelations for all phenotypes between actual values and flowBin-recombined values, vs medianabundance of the phenotype. Below an abundance of approximately 0.1 (10% of all cells inthe sample), the correlation becomes decoupled from abundance. c. Comparison of P-valuesbetween actual and flowBin-recombined data, for only those phenotypes with greater than 10%abundance. The P-values show a good correlation (R2 = 0.65). d. Cell types called as beingsignificant on flowBin-recombined simulated multitube data (blue) and the raw data (purple).Just less than 10% of the phenotypes that were called as significant in the actual data werealso called as such in the flowBin-processed data (high type II error). Eight phenotypes wereinappropriately called as significant (very low type I error).504.5. Discussion4.5 Discussion4.5.1 Validation of Quantile NormalizationThe results of the comparison between non-normalized and normalized data suggest thatquantile normalization can improve the registration of cells in terms of their populationmarkers. This suggests that it should be considered before applying tube combinationmethods, including flowBin.4.5.2 Comparison of Binning MethodsFor binning, flowFP gives much less numerically dispersed results than K-means, whichis essential for accurate flowType counts (Fig. 4.3). FlowFP binning may thus be abetter choice for downstream applications that depend on accurate cell counts. Forexample, if flowFP is used for binning, the assumption can be made that each data pointin the flowBin results has the same number of cells contained within it. If flowTypeis then applied to that data, the counts of cell types which flowType produces can beconsidered to be relatively accurate representations of the true counts.K-means, by contrast, gives better fitted bins, but with greater variation. K-meansbinning thus may be a more attractive choice if later back-gating of interesting populationsis desired. K-means may also be the only feasible choice in cases where there are verymany population markers, as flowFP’s binary space partitioning runs into combinatorialdifficulties.4.5.3 Comparison to Nearest NeighboursThe comparison (Fig. 4.4) showed that flowBin produced a distribution across quadrantsand overall that more closely approximated the true distribution, while nearest-neighboursproduced an artificial CD4+CD8+ population. The reasons for this spurious populationare most likely as follows: Both CD4+ (helper) and CD8+ (cytotoxic) T cells express theT cell receptor, CD3. In other words, the CD3+ population contains a mixture of the two,randomly distributed within it. Consequently, when the nearest-neighbours approach514.5. Discussionmatches a CD3+ cell in the first tube to the cell with the nearest CD3 expression inthe second, it is in effect sampling randomly from a mixed population. Since the ratioof CD4+ to CD8+ cells is roughly equal, about 50% of the time a helper T cell willbe correctly matched with a helper T cell, and a cytotoxic with a cytotoxic. But forthe other 50%, a helper T cell will be matched with a cytotoxic T cell, resulting inthe combined cell being reported as either CD4-CD8- or CD4+CD8+, producing thespurious populations.By contrast, flowBin samples groups of cells, and reports their average expression, inthis case MFI. Thus most bins will contain a mixture of CD4+ and CD8+ cells, withthe ratio varying in a Gaussian manner around the mean. Since it is extremely unlikelythat flowBin will sample 200 cells of opposing types from each tube in the same bin,flowBin is much more robust against producing these kinds of spurious results.This suggests that flowBin is a better choice for situations where it is desirableto recover the underlying cell types accurately, such as cell type discovery. Nearest-neighbours merging, due to its tendency to produce spurious combinations of markerexpression, is poorly suited to techniques which require precise counts of particular celltypes, such as flowType, or manual analysis. However, nearest-neighbours has provenextremely effective when used as part of a classification pipeline [38, 76], whereas flowBinloses some information as a result of averaging. As such, the two methods can becomplementary to each other.4.5.4 Validation on Simulated Multitube Data from PolychromaticFlow CytometryWhen compared with simulated multitube data in a complete analysis pipeline withflowType, flowBin reproduced the true underlying trends in the data, but with loweredstatistical power and lowered sensitivity. The individual correlations (Fig 4.5a and b)suggest that cell type counts based on flowBin recombination of tubes reproduces thetrue counts for most cell types of greater than 10% abundance. This suggests that while524.5. DiscussionflowBin may not be suitable for analysis efforts examining rare cell populations (such asminimal residual disease in leukemia or T cell subsets), it is a useful tool for examiningmore abundant cell types (for example the gross heterogeneity among tumour types).For reproducing the results over the entire pipeline, flowBin performed well once celltypes of less than 10% average abundance were filtered out. P-values of the survivalanalysis were close to those of the pipeline run on the true data, but slightly raised. Thisresulted in increased type II error, but minimal type I error. This suggests that whileflowBin introduces some noise, the final effect is only to lower the statistical sensitivityof analyses performed on flowBin expression data, but not to produce false positives.Interestingly, KI-67+ showed a slightly sigmoidal distribution in Fig 4.5a. The likelyreason for this is because flowBin produces averages, while flowType uses a threshold(see Fig 4.4c). Thus, bins with fewer positive cells will tend to have an average (flowBin)expression which falls below the flowType threshold, This would also depend on thedistribution of the marker across cells in the common markers. A marker that wascompletely uniformly distributed in every patient would tend to have a curve approachinga step function. In that situation, for a sample with 49% expression of that marker, theexpression of every bin would fall just short of the flowType threshold and the samplewould erroneously appear to have 0% expression.It may be an avenue for future investigation to attempt to mitigate this, for exampleby applying a logit transformation to flowBin expression data before passing it toflowType. However, fitting logit transforms would be challenging for most cell types, astheir distributions would be a mixture of the sigmoidal distributions of their componentmarkers. Indeed, for KI-67+CD127− in Fig 4.5a, the sigmoidal curve is partially cancelledout, most likely by such mixing.4.5.5 ConclusionsFlowBin is a complete pipeline for combining multitube flow cytometry data via markersshared across tubes. Quantile normalization of those markers to reduce technical variation534.5. Discussionis included. Of the two forms of binning included, flowFP is most suitable for laterflowType analysis, while K-means fits the contours of the data better. Compared withnearest-neighbours merging of tubes, flowBin produces cleaner data, with far fewer falsedouble-positive marker combinations. FlowBin with flowType can reproduce true datafor more abundant cell types, and this data is suitable for statistical testing, albeit withlowered statistical power (increased type II error).54Chapter 5Applications of FlowBin to AML5.1 IntroductionAs mentioned in Chapter 2, multitube flow cytometry data is common in leukemiadiagnosis. FlowBin is thus ideally placed to be used for high-throughput deep profilingof AML data. In this chapter, I detail several AML datasets to which I applied flowBin.5.1.1 Separation of AML and Normal CellsIt is frequently desirable to isolate dysplased cells from healthy tissue in order tocharacterize the dysplasia; I demonstrate here how flowBin can be used to achieve this.Once again I used the multitube AML data set from FlowCap-II (FR-FCM-ZZYA). Thisdata set contains FCS files for 359 patients (normal=316, AML=43). Theoretically, thesamples from the AML patients should contain a mixture of normal and leukemic blastcells, while the healthy patient samples should only contain normal cells. The problemof separating abnormal from normal cells is thus one of novelty detection, for whichtechniques, such as single-class support vector machines, are available [110].5.1.2 FlowCAP2Flow Cytometry: Critical Assessment of Population Identification Methods (FlowCAP)is a competition in which automated flow cytometry analysis methods are compared. Thesecond FlowCAP challenge compared classification pipelines, and one data set includedmultitube data for AML.[5] I entered a pipeline using flowBin for feature extractionalong with a voting classifier method.555.1. Introduction5.1.3 flowBin with flowType and RchyOptimyx to find cell types inAML correlated with NPM1 mutationTo further demonstrate the utility of flowBin, I present its application to a novel dataset. The data in questions consists of 129 de novo AML cases. Each of these cases hadmultitube flow cytometry data available from the time of diagnosis. In addition, eachhad been genotyped for clinically relevant frameshift mutations in the twelfth exon ofthe NPM1 gene. These mutations (hereafter referred to as NPM1-mt) indicate a goodprognosis, and have a marked correlation with the absence of CD34 on the AML blastcells [64, 65, 67, 111] and, in certain cases, HLA-DR [112].While high-level, single marker studies have been performed to find other recurrentimmunophenotypic characteristics of NPM1-mt AML [112, 113], deep profiling of thecondition’s immunophenotypic landscape has yet to be undertaken. I performed such ananalysis using flowBin and flowType, with the hypothesis that within the immunophe-notypic landscape of AML, there may be additional cell types that are more stronglycorrelated with NPM1 mutation than CD34+/- alone.5.1.4 Surveying the Immunophenotypic and Genomic Landscape ofAMLAs detailed in Chapter 2, AML genomes, transcriptomes and various other sequence-baseddata has been published, but a comprehensive bioinformatic survey of the immunophe-notypic profiles of AML based on flow cytometry had not until the work I described inthe previous section. Furthermore, a study integrating the full depth of flow cytometrydata with sequence-based data has not been undertaken at all. In this section I describesuch work.565.2. MethodsTraining datapop 1) 0.48 0.44 0.78 ...pop 2) 0.67 0.45 0.34 ...pop 3) 0.74 0.89 0.12 ......Training classespop 1) AMLpop 2) AMLpop 3) AML...Classifier TrainingAlgorithm TrainedClassifierpop 1) 0.35 0.46 0.67 ...pop 2) 0.21 0.56 0.49 ...pop 3) 0.78 0.41 0.89 ......patient 1patient 2pop 1) healthypop 2) healthypop 3) healthy...patient 1patient 2Predicted classespop 1) AMLpop 2) AMLpop 3) healthy...TrainedClassifierTest data (all bins from one patient)pop 1) 0.48 0.44 0.78 ...pop 2) 0.67 0.45 0.34 ...pop 3) 0.74 0.89 0.12 ......patient 1Predicted classPatient 1) AMLTakevotea. b.Figure 5.1: Schema for a voting classifier using flowBin output. a. Training. Everybin from every patient is treated as an individual measurement, labelled with the class of thepatient. A classifier is then trained on the entire dataset at once (all bins from all patients). b.Prediction To predict the class of a new patient, a prediction is made by the trained classifierfor every one of the bins from the patient. The majority vote from these predictions is thentaken as the overall prediction for the patient.5.2 Methods5.2.1 Separation of AML and Normal CellsI applied flowBin to both the normal and AML patients using K-means binning. I pooledall bins from the normal samples, and trained a single-class support vector machine(SVM) on these using the kernlab CRAN package [114]. I then applied the trained SVMto the bins from the AML patients to predict which were normal and which dysplased.5.2.2 FlowCAP2Binning within patients raised the problem of linking features across patients for classifi-cation. To solve this, I took each bin from each sample as a separate training instance,labelled with the sample label, and then trained an SVM classifier. For class prediction,I took the majority vote of the predicted labels for a given sample’s bins. Classificationwith parameter optimization and three-fold cross-validation was implemented using theksvm R package, but could in theory be made to work with any modern classificationmethod. This method is illustrated in Fig 5.1.575.2. MethodsBone MarrowAliquot, stain,run flow cytometryFCS DataTube 1Tube 2MarkersScatter ...129 patients 7-10 tubes/patient 20,000 cells,3-4 markers/tubeCombine tubesusing flowBinFSCSSCCD45HLA-DRCD13CD34CD20CD19CD10CD61CD56CD33CD64CD117CD14CD7CD2CD4CD3FSC SSC CD45 HLA-DR CD13 CD34 CD20 CD19 CD10 CD61 CD56 CD33 CD64 CD117 CD14 CD7 CD2 CD4 CD3128 clusters/patient17 markers eachCell Clusters801cell typesassociatedwith NPM-1FSCSSCCD45HLA-DRCD13CD34CD20CD19CD10CD61CD56CD33CD64CD117CD14CD7CD2CD4CD3FSC SSC CD45 HLA-DR CD13 CD34 CD20 CD19 CD10 CD61 CD56 CD33 CD64 CD117 CD14 CD7 CD2 CD4 CD3128 clusters/patient17 markers eachCell ClustersType clustersusing flowType(1:6-combinations) Wilcoxon rank sumvs patient NPM1with Holm correction616,285cell typesper patientCell Type Proportions+ - +-Figure 5.2: Pipeline used to determine NPM1 -associated immunophenotypes inAML. Steps taken are denoted by arrows, while the data consumed/produced is indicated inboxes. Flow cytometry was performed in the clinic historically; all other steps were computational.The end result was a list of 801 cell types which showed a significant difference in abundancebetween NPM1 mutated and wild-type patients.5.2.3 FlowBin with FlowType and RchyOptimyx to Find Cell Typesin AML Correlated with NPM1 MutationI first used flowBin to combine tubes for each sample and measure the expression of themapped bins. I then used flowType to delineate and count all cell types present, definedover all combinations of up to six markers. I filtered out all cell types not present in anypatient, leaving 616,285. I then tested for differences in abundance of each of these celltypes between NPM1-mt and wild-type patients using the Mann-Whitney U test withBonferroni-Holm correction for multiple testing. These steps are illustrated in Fig 5.2.I then performed exploratory analysis using RchyOptimyx to visualize the hierarchies585.2. Methodswithin the 801 significant cell types. I found that adding CD19-, CD20- and CD10- hadlittle to no effect on the P-value of a cell type (see Fig 5.3). This is likely due to these(B-lymphoid) markers being extremely rare in AML [54], and hence not being expressedon the important cell types at all, so that a negative gate for any of them did not changewhich cells had that cell type. I consequently excluded all cell types involving CD19-,CD20- or CD10-, bringing the total cell types to 272.All CellsCD34+CD34+CD61−CD34+CD61−CD14−CD34+CD10−CD61−CD14−CD34+CD20−CD10−CD61−CD14−CD34+CD20−CD10−CD61−CD14−CD3+CD34+CD20−CD61−CD14−CD34+CD61−CD14−CD10− CD20−CD20−CD3+1 2 3 4 5 6 7 8 9-log10(P-value)Figure 5.3: An example of RchyOptimyx analysis of one cluster of cell types. As801 cell types are too many to visualize meaningfully with RchyOptimyx, I clustered the celltypes and visualized each in turn. In this example, the addition of CD10- or CD20- make littledifference to the P-value of the cell type CD34+CD61−CD14−. As this was a general trend andin line with reported AML biology, I chose to exclude cell types defined over these markers fromfurther analysis.595.2. Methods5.2.4 Surveying the Immunophenotypic and Genomic Landscape ofAMLList mode data from clinical flow cytometry performed at diagnosis of 93 patients wasobtained retrospectively. This data contained a common panel of 16 markers spannedacross multiple tubes, with each tube containing CD45 to delineate leukemic and whiteblood cell populations. I combined tubes using flowBin, with the flowFP method forbinning, 128 bins, and percentage positive as the measure of per-bin expression. I thenused flowType [2, 3] to extract and count all phenotypes defined over combinations ofup to four markers, with three levels of expression (negative: < 20% markers expression;partially positive: between 20% and 80% marker expression; fully positive: > 80% markerexpression). Of the 58,742 phenotypes, I excluded those with a mean cell count of < 5%of all cells and a standard deviation across cell counts of < 10% of all cells (based onexamination of the distribution of these values), leaving 4,938.I then clustered the remaining cell types using non-negative matrix factorisation, asimplemented in the NMF bioconductor package [115], in a manner similar that usedfor mRNA and miRNA in [72]. I used the default Brunet algorithm, with 50 and 200iterations respectively for the rank survey and clustering runs. The number of clusterswas chosen based on cophenetic score and silhouette width. I selected the top ten celltypes from each cluster, and plotted the cluster results on a heatmap of these cell typesagainst genotype and disease subtype.To determine which mutations were significantly correlated with immunophenotypicclusters, I used a two-tailed Fisher’s exact test between cluster membership and presenceof the mutation, followed by the Bonferonni-Holm correction.605.3. Results5.3 Results5.3.1 Separation of AML and Normal CellsHeatmaps of the two predicted classes of bins are shown in Figure 5.4. The bins predictedto be normal fall mainly into well-clustered populations showing expression patternstypical of healthy myeloid (monocytes, promonocytes and granulocytes), lymphoid anderythroid cells respectively. By contrast, the bins predicted to be abnormal show manymore clusters, and very few bins which fit normal expression patterns. Most of the binsshow patterns of expression more typical of AML, such as CD34 and CD117 expression,and co-expression of the CD4 and CD7 with myeloid markers.Figure 5.4: nu-SVM separation of normal and abnormal cell populations in AMLsamples. a. Heatmap of all populations within the AML samples that were predicted to benormal. Most can readily be identified as having the properties of common blood and bonemarrow cell populations: myeloid cells expressing CD16 and/or CD64, lymphoid cells (dominatedby CD3-expressing T-lymphocytes/precursors, and erythroid cells not expressing any of themarkers in the panel, including CD45. b. Heatmap of all populations predicted to be abnormal.In contrast to the cells predicted to be normal, many of these express CD34 and CD117, primitivemarkers typical of stem cells and of AML.615.3. Results5.3.2 FlowCAP2This pipeline performed poorly in comparison with other algorithms, with an F-measureof 0.46.[5] Although the competition included an opportunity for participants to improvetheir scores, an error in the score-calculating code initially showed my pipeline as havinga much higher F-measure during that phase, and so I did not have a chance to improvethe pipeline until after publication. In dissecting this performance, it emerged that theclassifier was vulnerable to class imbalance (only 43 of the 359 patients had AML). Toaddress this issue, I added bagging with down-sampling to the classifier [116]. Thisimproved classifier, illustrated in Fig 5.5, produced an F-measure of 0.96 when evaluatedusing the same methodology as was used in FlowCAP.5.3.3 FlowBin with FlowType and RchyOptimyx to Find Cell Typesin AML Correlated with NPM1 MutationThe final set of 272 NPM1 -associated cell types, after removing those defined overuninvolved markers, are illustrated on a heatmap showing their abundance in Figure 5.6.The cell types clustered into two main groups, roughly corresponding to the two patientclasses (NPM1-wt and NPM1-mt). Relative abundance of cell types with strongerP-values, chosen by exploration using RchyOptimyx, are shown in Figure Surveying the Immunophenotypic and Genomic Landscape ofAMLThe results of the NMF clustering are summarized in Fig 5.8. The only mutationsshowing significant correlations with the immunophenotypic clusters were those in NPM1(P=0.0006) and its closely-associated mutations in DNMT3A (P=0.0014) [117].However, certain combinations of other class II mutations (those which block dif-ferentiation [118, 119]) clearly account for some clusters. In clusters 4 and 8, mostpatients have NPM1 mutations, but those which do not have t(9;11) karyotype ormutations in TET2. In cluster 3, the two patients which do not have NPM1 mutations625.3. ResultsTraining data1) 0.35  0.46 0.67 ...2) 0.21  0.56 0.49 ...3) 0.78  0.41 0.89 ......Training classes1) healthy2) AML3) healthy... Classifier TrainingAlgorithmTrainedClassifier 1Subsamplesample 1 classes2) healthy3) AML5) healthy...sample 1 data2) 0.21  0.56 0.49 ...3) 0.78  0.41 0.89 ...5) 0.33  0.43 0.47 ......sample 2 classes1) healthy3) healthy8) AML...sample 2 data1) 0.35  0.46 0.67 ...3) 0.78  0.41 0.89 ...8) 0.12  0.31 0.71 .........Classifier TrainingAlgorithmTrainedClassifier 2Test data1) 0.22  0.26 0.65 ...2) 0.34  0.24 0.45 ...3) 0.67  0.34 0.46 ......Predicted classes1) AML2) healthy3) healthy...TrainedClassifierPredicted classes1) AML2) AML3) AML...TrainedClassifierPredicted classes1) AML2) AML3) healthy...TrainedClassifier Final classes1) AML2) AML3) healthy...TakevotePatient classPatient 1) AMLTakevotea.b.Figure 5.5: Schema for a voting classifier for flowBin output incorporating balancedbagging. a. Training. This is similar to the base classifier shown in Fig 5.1, except thatmultiple classifiers are trained, each on a bootstrap subsample of patients. Each bootstrapsample is set to contain equal numbers of patients from each class. b. Prediction To predictthe class of a new patient, predictions for each bin from that patient are made by each of thetrained classifiers. Final per-bin predictions are taken by majority vote of those predictions.Then, the prediction for the patient is made based on a majority vote of the per-bin predictions.635.3. ResultsFigure 5.6: Overview of cell types which showed significant differences in abundancebetween NPM1-mt and NPM1-wt. The cell types cluster into two main groups, roughlycorresponding to the two patient classes (NPM1-wt and NPM1-mt).645.3. Resultsqqqqqq0 wt mtProportion of all cellswt mt wt mtCD34-CD13+P=0.00221 CD34-CD33+P=0.002251CD34-P>0.05 CD13+CD34−CD33+P=0.00138wt mt0Proportion of all cells 1CD34-CD2-P=0.0265 CD34-CD2-CD4+P=0.000149 CD13+CD34-CD2-CD4+P=1.94e-06wt mt wt mt wt mt0Proportion of all cells 1wt mt wt mtwt mtCD34+CD61-CD14-P=0.015 CD34+CD61-CD14-CD2+P=0.000114 CD34+CD61-CD2+CD4-P=0.005480Proportion of all cells 1HLA+CD34+CD33-CD64-P=0.0235 HLA+CD34+CD4-CD64-P=0.0119wt mt wt mta.b.c.d.Figure 5.7: Selected classes of cell types showing significant differences in abundancebetween NPM1-mt and NPM1-wt. P-values are given after Holm correction. a. Gating forthe presence of myeloid lineage markers CD13 and CD33 within the CD34- compartment yieldsmuch stronger differences in abundance between NPM1-wt and NPM1-mt than CD34- alone. b.Gating for CD2- within the CD34- compartment yields a slightly better separation than CD34-alone, but gating down further to CD4- and CD13+ is a cell type that, while present in mostNPM1-mt, is absent or below 20% abundance in nearly all NPM1-wt. c. Gating for CD61- andCD14- within the CD34+ compartment leads to a cell type which is common in NPM1-wt butalmost entirely absent in NPM1-mt. d. Gating for HLA-DR+ and CD64- within the CD34+compartment leads to a cell type that occurs in a subset of NPM1-wt but is entirely absent inNPM1-mt.655.4. DiscussionTable 5.1: Mutations and mutation groups correlated with immunophenotype clus-ters by Fisher’s exact test. Note that the cell types column only shows the cell type moststrongly associated with the cluster(s) in question, and not the entire spectrum of cell typespresent or absent. For the full spectrum of cell types per cluster, see Figure 5.8Mutation(s) Cluster(s) P Cell TypesNPM1 3, 4, 8 0.0006 CD34−CD33hiNPM1 |TET2 |t(9;11)|PML/RARA 3, 4, 5, 8 8.7× 10−10 CD34−CD33hiDNMT3A 4 0.0014 CD33hiCD64hiRUNXIT1/RUNX |MYH11/CBFB 6 0.0001 CD34hiCD117hihave the PML/RARA fusion gene. Cluster 6 contains mainly patients with either theRUNXIT1/RUNX fusion gene, or the MYH11/CBFB fusion, and the unadjusted P-valuefor these two genes combined is 0.0001, as compared to 0.04 for RUNXIT1/RUNX alone.This cluster is characterized by having a high proportion of cells expressing CD34,CD117 and HLA-DR. When TET2, t(9;11) and PML/RARA are added to NPM1 , the(unadjusted) P-value of the Fisher’s exact test is 8.7× 10−10, as opposed to an unadjustedP-value of 3× 10−5 for NPM1 alone. Although TET2 mutations are not confirmed to beclass II mutations in AML, deletion of Tet2 in mice has shown to result in increasedhematopoeitic stem cell proliferation, suggesting that they may be [120]. These clustersare defined by an absence of CD34 expression and a high proportion of cells expressingCD13.5.4 Discussion5.4.1 Separation of AML and Normal CellsI have shown how flowBin in combination with one-class SVM can be useful for separatingnormal from aberrant cells where a good training set of normal cells is available. Thiscan have applications for later AML studies where the ratio of normal cells to leukemicis a confounding factor.665.4. DiscussionClass IIClass IChromatinModificationSplicingCohesinTumourSupressorMethylationHydroxy-MethylationKinase/PPTPrimary AMLAML-MDSTherapy AMLTherapy MDSSecondary AMLMDSDisease KeyProportion of all cellsPhenotype Key123456789Figure 5.8: Top phenotypes and the results of NMF clustering. Phenotypes areindicated on the y axis of the heatmap; ‘-’ indicates cell populations negative for the marker, ‘+’indicates partially positive, and ‘++’ indicates fully positive. Patients, clustered by nonnegativematrix factorization, are indicated on the Y axis. Greyed out genotypes are those for whichsequencing failed.675.4. Discussion5.4.2 Identifying AML Patients (FlowCAP 2)I have shown that, despite poor performance in the initial FlowCAP2 competition, avoting classifier used in conjunction with flowBin can separate AML from normal cells,when balanced bagging is added. On the AML data set from FlowCAP, balanced baggingimproved performance substantially, with the F-measure increasing from 0.46 to 0.96, anumber roughly in the middle of the field compared with the other pipelines entered.This could be improved further by incorporating other machine learning best practices,such as feature selection. However, there are two levels of recursion already: the votingclassifier and the bagging. Adding another layer of recursion would likely increase thecomputational complexity prohibitively. Instead, it would be better to recommend thatflowBin be used in conjunction with the best-performing techniques from FlowCAP,most notably flowType and SPADE. [5]5.4.3 FlowBin with FlowType and RchyOptimyx to Find Cell Typesin AML Correlated with NPM1 MutationThe overall pattern of association between CD34 expression and NPM1 mutation shownin Figure 5.6 fits with previous reports.[65, 66, 111] However, flowBin with flowTypewas able to find a multitude of immunophenotypes within those classes (CD34+ andCD34−).Referring to the groups examined in more detail in Figure 5.7, the first group, CD34-CD13+CD33+, fits with observations that blasts often express CD13 and CD33 in NPM1-mt AML [54]. The second and third groups, involving CD34-CD2- and CD34+CD2+,are both cell types which have been reported before in Acute Promyelocytic Leukemia[121], but not associated with NPM1. For the cell types in the second and third groups,CD4 has been recently reported to be associated with NPM1-mt, along with t(9;11) andmonocytoid AML [76]. In the third group, CD61, a marker of megakaryoblasts, may fitwith the observation of dysmegakaryopoeisis in NPM1-mt AML [122].In conclusion, I have found a series of cell types associated with NPM1 mutation685.4. Discussionin AML. Although many of these cell types fit with previously reported trends, mostrepresent new, previously unreported cell types associated with NPM1 mutation. Impor-tantly, I have also demonstrated that flowBin can be used to produce high-dimensionaldata from low-dimensional multitube flow, suitable for downstream analysis in toolssuch as flowType and RchyOptimyx the identify significant cell populations.5.4.4 Surveying the Immunophenotypic and Genomic Landscape ofAMLI have shown how flowBin and flowType can be used, with NMF clustering, to findAML patients with similar sets of cellular immunophenotypes. I have further shown howthis can be used as an alternative method to find genes (or combinations of genes) withcharacteristic immunophenotypes. Interestingly, many of the mutation types showingdistinctive immunophenotypes are those with more favourable outcomes, while theknown mutations with less favourable outcomes tended to be less associated with asingle immunophenotype.69Chapter 6BAIT: Automated Analysis ofstrand-seq Data for Detection ofSister Chromatid Exchanges andGenome Finishing6.1 BackgroundAs described in Chapter 2, strand-seq is a powerful technique which enables the detectionof SCEs. However, the method as published in [83] involved visualising binned readcounts on chromosome ideograms, then manually identifying change points betweenWW, WC and CC states. Like most visually based manual analysis, this is both time-consuming and prone to bias. Automating this analysis was thus a desirable goal, andin this chapter I describe a method for doing so, which I developed in collaboration withMark Hills.Furthermore, once SCEs can be systematically identified in large numbers of cellsfrom the same organism or species, they can be used for a range of purposes. One suchis finding and correcting misoriented contigs flanked by gaps within late-build genomes[83]. This can be extended to placing unbridged contigs in a late-build genome by theirstrand similarity to chromosomal regions across libraries. While this could be performedby manual inspection, that would again be tedious and potentially subjective, and anautomated method was desirable. Later in this chapter, I describe a method developed706.2. Methodologyby Mark and myself to place unbridged contigs based on strand similarity. This methoddepends upon the accurate detection of SCEs. Both methods have been incorporatedinto a package called bioinformatic analysis of inherited templates (BAIT), which isavailable on Sourceforge.6.2 Methodology6.2.1 Automatically Detecting Sister Chromatid ExchangesThe sister chromatid exchange detection in BAIT accepts the binned read depths asproduced for the visualisation. Although the overall read depth is unchanged across anSCE, the proportion of directional reads will change from two copies in the homozygousstate to one in the heterozygous state (see Figure 2.11 in Chapter 2). BAIT exploitsthe similarity of the change in template copy number to copy number variation (CNV)analysis in order to locate and characterize all SCE events.To create a single track of data for CNV, and to normalise for read depth differences,the ratio of Watson and Crick reads within each bin is computed, using [(W −C)/(W +C)].This gives a value of 1 when all reads map to the Watson strand (WW strand inheritance),-1 when all reads map to the Crick strand (CC), and 0 for an equal number of both(WC) (Fig 6.1a). A change in this ratio along the length of a chromosome corresponds tothe location of an SCE event (Fig 6.1a), which is first localized to neighboring bins. Forexample, using the default bin size of 200 kb, a switch from a CC template-strand statein one bin (ratio = -1) to a WC template-strand state in a neighboring bin (ratio = 0)indicates that an SCE event occurred somewhere within the 400 kb interval encompassingthose two bins (Fig 6.1a).A bin size of 200 kb was selected to be large enough that, even in low read depthstrand-seq data, such as that from an Illumina MiSeq, the read depth over that intervalshould, in our experience, be adequate to localise an SCE within the interval. BAITfirst makes gross event calls by utilizing the circular binary segmentation algorithm [123]716.2. Methodologyimplemented in the CNV Bioconductor package DNAcopy [124] to locate the SCE eventto the two-bin interval. It then recalculates the template-strand ratio by segmentingthis interval into five new bins (80 kb each, using default bin size), further narrowingthe location of the SCE interval . BAIT applies this binning-based DNA copy detectionmethod iteratively, decreasing the bin size by a factor of five each time (Fig 6.1b),until the read density is no longer sufficient to make accurate calls (determined fromexperience to be when an interval has less than 50 reads, or when DNAcopy can nolonger predict a single event (Fig 6.1c). In order to identify SCE events on the boundaryof bins, BAIT pads each interval with one-half of the interval length in each direction(Fig 6.1b, c; red arrows).BAIT then refines the gross interval by incorporating a simple walker algorithmthat analyzes reads starting from the homozygous state, and reports the first read onthe opposite template that represents a switch to a heterozygous state (Fig 6.1, redbox). From this refined interval, the walker checks that the 10 preceding reads mapto the homozygous state, and that at least four of the 20 following reads map to theopposite template state (Fig 6.1c). If these criteria are not met, as may be the casewhere the background is high, BAIT continues to analyze across the interval until theyare met. These checks improved the localization of SCE events (see Fig 6.2), and varyingthese thresholds did little to change the data. Through this two-step process, BAITautomatically detects and localizes SCEs with a high degree of confidence, plots themon ideograms, and creates a UCSC-formatted BED file of all SCE event intervals.6.2.2 Placing Contigs in Late-build Genomes Using StrandInheritance and Sister Chromatid Exchange InformationUsing scaffold-level and chromosome-level assemblies to generate functional referenceassemblies is valuable, but it is important to note that ‘completed’ assemblies alsocontain a large number of contigs that remain unmapped. Assigning locations for theseorphan scaffolds in a chromosome context is a high-priority endeavor for sequencing726.2. MethodologyFigure 6.1: Automated identification of sister chromatid exchange (SCE) fromstrand-seq data. (a) Gross directional mapping data are thresholded to remove bins withunexpectedly high or low read numbers, and analyzed using DNAcopy. Inherited templatenumbers are converted to a value between 1 and -1 for DNAcopy to make only one of threecalls: WW, WC, or CC. DNAcopy defines an interval across two bins, so with a bin size set to200 kb, the SCE event will be located to within 400 kb. (b) Localization is then iterated bysubdividing the identified region into bins one-fifth of the original size (80 kb on first iteration),and re-running DNAcopy. A single bin size is used as padding to aid detection of SCE events atbin boundaries. The iterations of re-running DNAcopy continue until less than 50 reads remainwithin the interval. (c) A second algorithm identifies the first read to map in a different direction(W read at chr13:19,203,283), then performs a check that the 10 preceding reads are all in theexpected direction (10 C reads), and at least 20% of succeeding reads are in the other direction.The interval is refined to a distance between two reads. Abbreviations: C, Crick; W, Watson.736.2. MethodologyFigure 6.2: Optimization of the sister chromatid exchange (SCE) interval detectorfunction. After DNAcopy has identified the smallest region in which SCE events have occurred,Bioinformatic Analysis of Inherited Templates (BAIT) executes a function that scans the regionfrom the homozygous template direction until it identifies the first read mapping to the oppositestrand. These “unfiltered calls” were similar to the manual calls, but were subject to low levelbackground reads interfering with accurate localization (red line). To circumvent this, we addedto the function that checks the 10 preceding reads to ensure they are all the same state, whichyielded more accurate calls (blue line). Finally, we added a further check to ensure that thesucceeding 20 reads (which are supposed to be Watson and Crick (WC)) mapped to the oppositestrand at least 20% of the time (green line). The dashed lines indicate the median intervaldifference for each level of stringency.746.2. Methodologycenters, and there are very few techniques that are available for this task [125]. However,provided that the orphan scaffold has sufficient read coverage, strand-seq can be used todetermine the strand-inheritance pattern, which will be the same as the chromosome onwhich it is present. For example, an orphan scaffold inheriting WC template strandsmust locate to a WC chromosome in that particular library. If an orphan scaffold inheritsWW template strands, it will locate to a WW chromosome if both sequences are inthe same orientation, or to a CC chromosome if it is misoriented with respect to thechromosome. On average, using just a single library, half of the chromosomes can beexcluded as possible locations for these orphan scaffolds (Fig. 6.3).By comparing these locations across a batch of libraries, BAIT localizes these scaffoldsto particular chromosomes. For each orphan scaffold with sufficient reads, BAIT assignsa template state, compares this against the template state of each chromosome withina particular library, and then iterates this process to compute the concordance acrossall libraries. Concordance is never 100% in practice, owing to libraries with highbackground, orphan scaffolds with too few reads to accurately call strands, SCE eventswithin gaps between the scaffolds, and the 5 to 10% error rate of BAIT in SCE detection.Nevertheless, BAIT is still able to achieve high-quality predictions of scaffold location bytaking the highest-concordance chromosome. Chromosomes are further split based onSCE locations, allowing for localization of orphan scaffolds to particular chromosomalregions (Fig 6.3). Because orphan scaffolds are likely to be located within gap regionsrather than within contiguous sequence, BAIT can use a provided BED-format gap fileto cross-reference all mapped orphan scaffold locations to gaps within the same interval.BAIT outputs in a BED file both the best predicted region for each fragment and anycandidate gaps within that region.756.2. MethodologyFigure 6.3: BAIT localizes unplaced scaffolds in late-version assemblies. Orphanscaffolds can be correctly oriented and localized relative to the rest of the genome by comparingtemplate-strand inheritance. The orientation of an orphan scaffold is arbitrary, because it isnot anchored to the rest of the genome, so it can be correctly oriented with respect its locatedchromosome, or misoriented. (a) For a single library where the unplaced scaffold GL456239.1 isWW, BAIT maps its potential location (shown in red) to both WW genomic regions (correctlyoriented), and CC genomic regions (misoriented). If only one library is analyzed, all locationsmap with 100% concordance. Note that a WW scaffold will not locate to a WC chromosome,so chr8, chr14, chr16, chr18, and chr19 are 0% concordant. (b) BAIT iterates over a secondlibrary where GL456239.1 is CC. The results of the two libraries combined reduce the number ofpotential mapping locations from 17 to only 3 that map with 100% concordance. Because chr8,chr14, and chr16 are WC in this library also, these chromosomes map with 0% concordance. (c)BAIT iterates over a third library where GL456239.1 is WC, and thus maps to all chromosomesthat are WC. The result of the three combined libraries reduces the number of potential mappinglocations to 2: the centromeric tips of chr1 and chr4. (d) The combined results after iterationof all 62 libraries refine the location of GL456239.1 to the first 10 Mb of chr1 in the reverseorientation (with a concordance of 91%). The fragment was further refined to an unbridged gapoccupying the first 3 Mb of chr1. Abbreviations: C, Crick; chr, chromosome; W, Watson.766.3. Results6.3 Results6.3.1 Accurate Localization and Mapping of SCEsTo assess the ability to computationally identify SCE events, BAIT predictions werecompared with 528 SCE events from 62 murine embryonic stem cell strand-seq librariesthat had previously been identified manually [83]. Manual processing of SCE eventsinvolved uploading BED-formatted strand-seq data into the UCSC genome browser [126],and identifying the interval at which the templates switch.To allow the user to choose between favouring false positives or false negatives, BAITSCE calling incorporates a filtering step, which excludes any bins that deviate morethan a set multiple of standard deviations from the average red depth. By comparingthe BAIT SCE calling to the manually processed SCEs, the optimal threshold for thesedata was determined. This was to exclude bins with read counts of ±0.2 standarddeviations from the mean, which gave a sensitivity of 0.93 (10.9% false positives), anda specificity of 0.89 (7.2% false negatives) (Fig 6.4a). When only those libraries witha low background metric (< 5%) were included, the specificity improved to 0.94, whilethe sensitivity remained almost the same at 0.92 (Fig 6.4b). Of the false-negative calls,72.9% were SCEs within 5 Mb of the start or end of the chromosome, indicating thatterminal regions of chromosomes are under-represented by BAIT’s SCE localization. Inaddition, three of the SCE events predicted by BAIT but absent in the manual analysiswere determined to be correct upon further analysis. One event was less than 2 Mb fromthe distal telomere of chromosome 1, while the remaining two events were 5 Mb fromeach other on chromosome 13. These SCE events were difficult to detect by eye from aBAIT ideogram output of strand-seq data. Furthermore, because BAIT identifies SCElocations directly on ideograms with an arrowhead, both false-positive and false-negativeSCEs can be rapidly scanned and validated from the ideogram output files.Of the correctly identified SCE events, a comparison of the location of the SCEinterval between automated and manual calls showed a median difference of just 34bp (see Fig 6.2 ). Almost two-thirds (65.8%) of the predictions were within 100 bp776.3. ResultsFigure 6.4: Accuracy of automated sister chromatid exchange (SCE) detection byBioinformatic Analysis of Inherited Templates (BAIT). (a) By comparing the numberof SCE events identified by BAIT to those determined manually, we calculated the percentageof computational calls that were incorrect (false positives) or not detected (false negatives).Filtering the data by only including bins that deviated minimally from the mean changed theresults, with highly conservative filtering increasing the level of false negatives, and very broadfiltering increasing the level of false positives. (b) The frequency of (left) false positives and(right) false negatives with respect to library background. Cleaner, high-quality libraries with< 1% of reads mapping incorrectly had a lower false-positive rate than libraries with mediumbackground (< 5% incorrectly mapped reads), and an even lower rate than libraries with highbackground (< 10% incorrectly mapped reads). Error bars are ± standard deviation.786.3. Resultsof the manual calls, with 74.7% of predictions within 10 kb. A summary of SCEdistribution across all libraries was plotted, together with a histogram reporting thedistance between events, helping to identify significant clustering of SCEs (see Additionalfile 2: Supplemental Data File 1). The accurate identification of SCEs is also importantfor the functions of BAIT which assemble and refine reference genomes (see sectionsbelow).6.3.2 Placing Contigs in Late-build Genomes Using StrandInheritance and Sister Chromatid Exchange InformationPrevious work using strand-seq has shown that over 20 Mb of the MGSCv37/mm9 Musmusculus reference assembly is misoriented, involving 17 regions flanked by unbridgedgaps [83]. In the more recent GRCm38/mm10 build of the genome, 35% (7,079.49 kb) ofthese identified misorientations were subsequently corrected, validating strand-seq withother approaches to correct orientation issues. In order to identify misorientations in thenewest GRCm38/mm10 assembly, these analyses were repeated using the automatedfunction of BAIT, identifying a total of 15 misoriented regions and five autosomalmisorientations, with the remaining ten located to the X chromosome (see AdditionalFile Table S1). Because the X chromosome only exists as one copy (monosomy) in themale embryonic stem cells (ESCs) of our dataset, misorientations appear indistinguishablefrom SCEs, and were identified by the intersection of events occurring over the sameregion across all libraries (see Additional file 2: Supplemental Data File 1). In this way,using just a single lane of sequencing, the majority of contigs (those larger than 10 kbwith minimal segmental duplications) were oriented with respect to flanking contigs.Thus, using strand-seq and BAIT with relatively low-coverage sequencing, the relativeorientation of all reference contigs can be determined, effectively bridging all gaps in anassembly.796.3. ResultsTable 6.1: Identification of all misoriented fragments in GRCm38/mm10. The genomic regionsthat were incorrectly oriented in the latest assembly version of the mouse genome were calculatedby Bioinformatic Analysis of Inherited Templates (BAIT). The location and lengths of theseregions, which should all be present in the reverse complement in the reference assembly, areshown. Misorientations were identified in every informative library.Fragment Location SizeJH584273.1 chr4:130,516,309-130,579,475 63167JH584273.1 chr4:146,708,412-146,752,334 43922GL456141.2 chr8:20,257,550-20,443,136 185586GL456225.2 chr9:124,248,836-124,476,930 228094GL456168.2 chr14:3,000,000-19,419,705 16419705GL456186.2 chrX:3,182,394-3,556,356 373962GL456186.2 chrX:4,742,210-4,960,182 217972GL456190.1 chrX:27,205,755-27,499,671 293916GL456192.2 chrX:27,549,670-29,212,945 1663275GL456194.2 chrX:30,544,569-31,013,339 468770GL456195.2 chrX:31,919,956-32,327,544 407588GL456195.2 chrX:34,129,361-34,339,520 210159GL456202.2 chrX:125,068,866-125,141,139 72273JH584278.1 chrX:170,672,643-170,678,055 5412GL456208.2 chrY:3,289,416-3,429,742 140326Total 20785352806.3. ResultsTo validate the ability of BAIT to map scaffolds that have yet to be localized to regionson reference assemblies, we used it to predict the localization of all orphan scaffoldsin an earlier assembly of the mouse reference (MGSCv37/mm9), and compared thosepredictions with the actual known locations in the current assembly (GRCm38/mm10).MGSCv37/mm9 has 60 useable orphan scaffolds that can be lifted to a single specificcoordinate on GRCm38/mm10 [127]. Of these, 57 were located by BAIT to an intervalcoincident with the correct location on GRCm38/mm10 (Fig efamalgaBAITRes). Fromthe three fragments that could not be correctly placed, two had fewer than ten librarieswith sufficient read counts to analyze, and the remaining fragment mapped with alow concordance (57.1%). These data suggest reasonable thresholds for BAIT to maporphan scaffolds: more than ten libraries and greater than 60% concordance. Moreimportantly, they confirm that using data from the same single lane of sequencing asused for contig orientation, BAIT and strand-seq can correctly map a large proportionof orphan scaffolds in a late assembly version.There remain 44 orphan scaffolds in GRCm38/mm10, accounting for 5,334,105 bp,and containing 41 known genes. Of these, 23 contained sufficient reads to analyze,and we were able to subsequently place all of them to their matching chromosomes towithin narrow intervals (Table 6.1; see Additional file 7: Supplemental Data File 2).By intersecting these locations to gaps in the contiguous genome build, BAIT furtherrefined the scaffold locations (Table 6.1). Fragments were assumed to locate withineither unbridged gaps or to bridged gaps in which gap size exceeded the fragment size,Analyzing 62 mouse libraries, 54.5% of these orphan scaffolds could be mapped to aparticular chromosome, of which 54.2% could be mapped to a single contig gap (Table6.1). BAIT also correctly oriented these fragments with respect to the chromosome towhich they were mapped. BAIT includes a utility to create a new FASTA referencegenome by reverse complementing misoriented regions and incorporating orphan scaffoldsthat map to a defined gap.816.3. ResultsFigure 6.5: Validation of using strand-seq to map unplaced scaffolds to builtgenomes To confirm that Bioinformatic Analysis of Inherited Templates (BAIT) can suc-cessfully locate orphan scaffolds, the reads were aligned to MGSCv37/mm9, which has 202orphan scaffolds, of which 60 can be mapped to a specific location in GRCm38/mm10. We usedBAIT to locate these scaffolds in MGSCv37/mm9, and then cross-referenced these locations tothe actual location in the GRCm38/mm10 assembly version. BAIT correctly located all regionsin which there were more than ten libraries to analyze, and where the percentage concordancewas above 68%. Green points indicate correctly mapped fragments, and red points indicateincorrectly mapped fragments. Dashed lines show the minimum number of libraries and minimalconcordance needed to make confident calls.826.4. Discussion6.4 DiscussionIn this chapter, I have described my contributions to BAIT, a software toolkit forautomated analysis of strand-seq data. BAIT facilitates SCE analyses by rapidlycounting and locating events, presenting a pipeline that can be incorporated into high-throughput strategies. BAIT accurately refines the interval between reads in whichthe template switch occurs, allowing regions with a high propensity to undergo SCEto be identified (for example, fragile sites [128] or sites of recurrent DNA damage).Accurate interval identification is also important in looking for genomic rearrangementssuch as translocations, and BAIT is able to detect these and assign a frequency ofthe rearrangement within the pool of libraries, requiring a far lower read depth thanconventional split-pair read sequencing. [129] A caveat to these analyses is that SCEsand genomic rearrangements are more difficult to detect on chromosomes that havemore than two copies within a cell, potentially limiting its use in highly polyploid cancercells. Taken together, these results show that BAIT is very accurate and efficient atpredicting SCE intervals, and will be indispensable for future high-throughput analysisof strand-seq data.BAIT is also capable of mapping orphan contigs in late-build genomes with a highlevel of accuracy. Furthermore, this accuracy can be increased to potentially 100%when disconcordant contigs (likely caused by either poor mappability or the presence ofmisorientations within the contig) and low-coverage contigs (short or those with poormappability) are filtered out. For established and well-studied genomes, finishing buildsby additional sequencing yields diminishing returns, and novel, targeted and highlysequence-efficient methodologies such as strand-seq and BAIT can play a crucial rolein completing these genomes [130]. Indeed, I, along with Mark, have presented likelylocations for 23 of the 44 remaining orphaned contigs in the current build of the mousegenome (GRCm38/mm10).This work led on to Chapter 7, in which I describe methods for de novo assembly ofearly-build genomes made up entirely of unbridged contigs.83Chapter 7ContiBAIT: Assembling GenomesUsing strand-seq7.1 BackgroundEarly-build genome assemblies assembled by second-generation shotgun sequencingconsist of many non-overlapping contigs, which are unanchored and unordered. However,performing strand-seq on cells derived from organisms with early assemblies yieldsdirectional strand information for each contig. Contigs residing on the same chromosomeinherit the same templates across many libraries, while those from different chromosomesinherit template strands independently. Not accounting for sister chromatid exchanges,the probability that two contigs C1 and C2 from different chromosomes will have thesame template state will be:P(C1 = C2) = P(C1 = WC;C2 = WC) + P(C1 = WW;C2 = WW) + P(C1 = CC;C2 = CC) (7.1)= P(C1 = WC).P(C2 = WC) + P(C1 = WW).P(C2 = WW) + P(C1 = CC).P(C2 = CC) (7.2)= 0.5× 0.5 + 0.25× 0.25 + 0.25× 0.25 (7.3)= 0.375 (7.4)Conversely, contigs from the same chromosome will inherit the same template strandsacross all libraries (although this will be confounded slightly by sister chromatid ex-changes). This similarity to genetic linkage [131] makes concordance of strand inheritancea good metric for clustering contigs into putative chromosomes or “linkage groups”.847.2. MethodologySome early-build genome assemblies have anchored contigs to chromosomes, but havenot placed those contigs in order within each chromosome, due to the lack of overlappinginformation for assembly. Furthermore, after anchoring contigs to chromosomes usingstrand-seq, it would be desirable to also order them. In this case, sister chromatidexchanges can be used. For anchoring contigs, SCEs are an undesirable source of noise,causing deviation from the ideal case in which every contig from the same chromosomehas the same strand template inheritance pattern in every library. But in every libraryin which an SCE has occurred, this provides ordering information for the chromosomeit occurred in. Contigs upstream of the SCE will have the same strand inheritance aseach other in that library, and the same for those downstream. Similarly to how meioticrecombination is used to create a genetic linkage map between loci [132], SCE eventsalong the chromosome can be used to determine a genetic distance between contigs onthe same chromosome, allowing them to be arranged and ordered. This is illustrated inFig 7.4a.In this chapter, I present contiBAIT, an R package for using strand-seq to bothanchor and order unbridged contigs in early-build genomes.7.2 Methodology7.2.1 Preprocessing and Calling Template Strand StateBefore attempting anchoring and ordering, ContiBAIT performs filtering of libraries andcontigs. ContiBAIT excludes libraries where every contig has inherited WC templates,as these suggest a failure of the strand-seq protocol. Contigs that have inherited WCtemplates in all libraries are also excluded, as these likely contain degenerate sequence.857.2. MethodologyFigure 7.1: True ordering of semisynthetic mouse data for chromosome 3. Thisplot was generated using 1 Mb artificial contigs created based on the mm10 genomic assembly.The contigs for a single linkage group, consisting mainly of chromosome 3, were used. Thisillustrates some of the difficulties with the data – while some libraries are called cleanly, andshow a consistent state either throughout or until there is an SCE, others are extremely noisy,with apparent state changes every few contigs.867.2. MethodologyStrand state matrix(WC set to NA)ClusterMultiple cluster labelsTake ensemble ConsensusCluster LabelsReorient clustersReoriented strand state matrixMerge similarclustersFinal linkage group callsLib1Contig 1Contig 2Contig 3Contig 4Lib2 Lib3 Lib4 Lib5 Lib6 Lib7 ......NACC WWNACC WWWWNA NANA CCCC CC NANA CCCCNANANAWWCC WWWW NAWW NACCLib1Contig 1Contig 2Contig 3Contig 4Lib2 Lib3 Lib4 Lib5 Lib6 Lib7 ......WCCC WWNACC WWWWWC WCWC CCCC CC WCWC CCCCNANANAWWCC WWWW NAWW NACCFigure 7.2: Overall ContiBAIT Linkage Group Assignment Pipeline. A matrix ofstrand states for each contig in each library (cell sequenced) is taken as input. For clustering,only the WW and CC states are used; WC are set to NA. Several clusterings runs are performed,and combined to form a consensus clustering. The clusters most dissimilar to other clusters arereoriented until this does not improve the global cluster score; the strand matrix is then updatedwith the reoriented strand states. Clusters that are similar by the reoriented strand matrix,including the WC calls, are merged, producing the final call for linkage group membership.877.2. Methodology7.2.2 Anchoring Unbridged Contigs into Chromosomes de novo inEarly-build GenomesUsing the template strand calls matrix from the previous step, contiBAIT assemblescontigs into groups with similar strand inheritance patterns (linkage groups), represent-ing putative chromosomes. The complete contiBAIT pipeline for contig anchoring isoverviewed in Fig 7.2. In the first step, only CC and WW states are used to clusterthe contigs; this is to keep contigs from the same chromosome but which are in reverseorientation separate. Due to the non-deterministic nature of the clustering algorithmused, an ensemble of multiple clusterings is used to determine the best cluster labels.Whole clusters are then reoriented until the global cluster distance no longer improves,and the strand matrix is updated with the reoriented strand template states. Thiscomplete matrix is used to determine distances between clusters, and similar clustersare merged into the final linkage groups.Three-Star Chinese Restaurant ClusteringA major challenge which arose in [6] was the great number of contigs which had tobe clustered into putative chromosomes. In the mm2 data set, the number of contigswhich had sufficient strand-seq data to cluster was approximately 50,000. This madedistance-based clustering algorithms infeasible, as this would require a distance matrixcontaining 2.5× 109, stretching the limits of all but highly specialised hardware to storein RAM.A method does exist for memory-efficient clustering of categorical data – k-modes[133]. However, this algorithm in its original form does not provide for a method of modelselection (choosing the number of clusters k). Furthermore, k-modes is not designed tohandle missing data, which is a necessary feature of strand state calls. I thus concludedthat developing a novel clustering algorithm would be the best solution to this problem.To this end, I created an algorithm very loosely modelled on a Chinese restaurant process.A Chinese restaurant process (CRP) posits a restaurant with infinitely many tables,887.2. MethodologyFigure 7.3: Flow diagram illustrating three-star Chinese restaurant clustering. Foreach contig, the similarity in strand state between the contig and the existing clusters (if any) iscomputed. If this similarity exceeds a pre-defined threshold for any of the existing clusters, thecontig is assigned to the most similar cluster. If no clusters are similar enough, a new cluster iscreated. For the purposes of computing similarity, a consensus strand state is computed for eachcluster by a simple vote of the strand states of its members.each having infinitely many chairs. As each new customer is seated, they choose, withuniform probability, to sit at an already-occupied table, or to sit at a new table. Indeed,a model-based clustering method for numerical gene expression data has been developedfrom CRPs [134].My clustering method was considerably simpler. In each iteration, a new contigis considered. The contig may either be assigned to a new cluster, or join an existingcluster. To decide this, each existing cluster’s per-library strand state is computed asthe majority vote of its member contigs for that library. Then, the distance between thenew contig and all the existing cluster is computed. If the contig is sufficiently similarto an existing cluster (as determined by a pre-set threshold similarity value), then it isadded to that cluster. If not, then it forms a new cluster.897.2. MethodologyIn a deliberate effort to stretch a bad metaphor to the breaking point, I have namedthis method “Three-Star Chinese Restaurant Clustering”. In Vancouver, the city wherethis work was conducted, there are many Chinese restaurants, serving a great range ofChinese cuisines. As a one to five star rating system is common for restaurant reviews,I have selected “three-star” to refer to an “average” restaurant, as an allusion to theaveraging of existing cluster values when choosing cluster assignment for a new datapoint.High-Speed Bitstring Implementation of Contig Similarity CalculationIn order to enhance the performance of TSCRC for this particular task, I optimised thedistance computation for speed. Since the strand state vector for each contig is definedover only two values – WW and CC, it could be encoded as a bit string (V 1S and V 2S),using the BitString class from the boost library [135]. The presence of missing values(NAs) in each vector is encoded as a separate bitstring (V 1NA and V 2NA). Then, usingbitstring arithmetic, the similarity between two strand state vectors, in the presence ofmissing values, can be computed:common bits C =∼ (V 1NA|V 2NA) (7.5)equal bits E =∼ (V 1S ⊕ V 2S) (7.6)non-NA overlap O = E&C (7.7)similarity S(∈ R) = bitcount(O)/bitcount(C) (7.8)Bitwise AND, OR, XOR and NOT are part of the core Intel x86 instruction set.Counting of set bits in a bitstring is implemented as the machine-level instructionPOPCNT since the SSE4 extension to the basic Intel instruction set [136]. As such, thisimplementation can be translated by the C++ compiler down to high-speed machine-levelinstructions on most modern computers.907.2. MethodologyOrdering Clustering by QualityOne problem with TSCRC is its sensitivity to the order in which contigs are presentedto it. Lower-information contigs are more likely to associate incorrectly, and if theseform the seeds of clusters at an early stage in the clustering, they could result in clustersmade up erroneously of contigs from separate chromosomes. Indeed, this was the casein the initial validation work shown in Fig 7.5. A solution to this problem is to feed incontigs in order of quality. For this purpose, I used the proportion of non-NA values foreach contig, and sorted the contigs in descending order.Ensemble ClusteringTo further overcome TSCRC’s sensitivity to ordering of its input, contiBAIT performsseveral clustering runs and combines these using ensemble clustering. In order to keep theadvantages of best-first ordering while still randomising the clusterings to be combinedin the ensemble, contiBAIT uses quality-weighted sampling of contigs to select the orderin which they are passed to TSCRC for each clustering run. A consensus clustering isthen produced using the CLUE R package with the default fixed-point algorithm [31].Reorientation and Merging of Linkage GroupsTo distinguish contig orientation, contiBAIT generates an initial contig dissimilaritymatrix using only chromosomes that have inherited homozygous WW and CC templates(but excluding WC), in such a way that misoriented linkage groups derived from thesame chromosome are highly dissimilar (Fig 7.5a, left panel). BAIT then uses a greedyalgorithm to reorient the misoriented linkage groups: iteratively inverting the mostdissimilar, and recomputing the distance matrix until a reorientation causes no increasein the summed concordance of all groups (Fig 7.5a, right panel; see Fig 7.6). Linkagegroups with high similarity are merged in the recomputed data, and contiBAIT hasfunctionality to visualize this as a distance-matrix heat plot of linkage group concordance(Fig 7.5a, right panel; see Fig 7.6).917.2. MethodologyThe final output of contiBAIT’s anchoring process is a linkage group label for eachcontig, corresponding to its chromosome membership. As a side effect of the reorientationprocess, contiBAIT also produces orientation information for all contigs.7.2.3 Ordering of Unbridged Contigs within ChromosomesTo determine the order of contigs within chromosomes, contiBAIT uses template-strandinheritance and SCE localization to build an inter-contig distance matrix for each linkagegroup. This is then used for ordering by formulating the ordering problem as a travellingsalesman problem.If this distance matrix is viewed as a complete graph, then the goal is to find apath that visits every vertex (a Hamiltonian path), while minimising the weight of theedges included (the lowest-weight Hamiltonian path). However, given the far greateravailability of techniques for finding the lowest-weight Hamiltonian Cycle (the travellingsalesman problem), it is desirable to reformulate the problem as such. This can beachieved by adding a dummy vertex with edges of weight zero to all other vertices. Fromthere, existing methods for solving the travelling salesman problem may be used to finda (hopefully optimal) low-weight path through the graph and infer the relative order ofcontigs within a linkage group. This process is illustrated on a toy example in Fig 7.4.Before computing this distance matrix, additional pre-filtering is carried out. Toreduce noise, contiBAIT re-examines the ratios R of W to C reads used to call thestrand state for each contig in each library. These calls are filtered out (set to NA) when0.2 < abs(R) < 0.8. Since no ordering information can be found for contigs which are soclose that their strand state is identical across all libraries (see Figure 7.4), these casesare detected and combined into meta-contigs before ordering.Following filtering, contiBAIT computes distances between contigs using the daisyfunction from the cluster package in R [137]. It then passes these distances to the TSPpackage to be solved. Unfortunately Concorde [138], widely considered to be the bestTSP solver, cannot be included with the TSP package due to licensing issues. So, at927.3. Results and Validationpresent, only the 2-opt method [139] is provided.7.3 Results and Validation7.3.1 Contig AnchoringTo test the ability of BAIT to build genomes de novo, the read libraries were realigned tothe first build of the mouse genome (MGSCv3). Of the 224,713 contigs in this assemblyversion, the 77,258 that were over 10 kb were included, representing 2,006 Mb of DNA(81.0% of total assembly). After remerging and reorienting similar clusters, BAITassigned 54,832 contigs, representing 1,742 Mb (64.9%) of the assembly, into 20 primaryLGs (Fig 7.5a). Allosomes in these male-derived ESCs are effectively monosome, andso contigs derived from the sex chromosomes can be separately identified, as they onlyinherit a single W or C template strand, never both. After cross-referencing the locationsof MGSCv3 contigs to GRCm38/mm10 coordinates, the majority of LGs clustered toonly one chromosome (see Fig 7.6), and the majority of chromosomes consisted of onlyone linkage group (Fig 7.5b). When more than one chromosome was attributed to thesame linkage group, these groups could be split into two subclusters (see Figure 7.6).Similar results were seen when we simulated an early-stage reference by splitting theGRCm38/mm10 genome into a scaffold of the 403 chromosomal Giemsa bands (basedon coordinates from the UCSC genome browser [127]), and realigned our libraries to thisnew reference version (see Additional file 5: Figure S4). Using disrupted concordancefrom SCEs as a genetic distance indicator, it was further possible to infer the relativeorders of the contigs present in each linkage group.The accuracy of ordering fragments is dependent on the frequency of SCEs, thenumber of libraries used in the analysis, and the level of library background (high-background libraries are more likely to have incorrect template calls). If the templatestrands of contigs are identical in all libraries (because no SCE events have occurredbetween them) their relative order remains unknown.937.3. Results and ValidationLocation on chromosomeSCESCEWCCCTemplateStateLibrary1234Contig 1 Contig 2 Contig 3 Contig 4CCWC CC CCCC CC CC CCWW WW WW WCWC WC WC WC12341 2 3 4LibraryContig12341 2 3 4ContigContig10.750.750.511110.75 0.75C1 C2C3C4D0.50.250.25000000.250.25 D, C1, C2, C3, C4, DD, C1, C3, C2, C4, DTotal Distance=0.5Optimal Solutionsa.b. c.d.Figure 7.4: Using SCEs to order contigs within a chromosome. a. A very simplifiedand idealised example, showing only four contigs and four libraries. b. The template state matrixfor each contig in each library. c. The concordance matrix between contigs, corresponding tothe template matrix in b. d. The concordance matrix formulated as a distance graph, with anadded dummy node D, for solving as a travelling salesman problem. Two shortest HamiltonianCycles are possible, and once the dummy node is removed, one of these corresponds to thecorrect ordering of the contigs along the chromosome. Note that, since C2 and C3 have identicaltemplate state in all libraries, it is impossible to distinguish their ordering.947.3. Results and ValidationFigure 7.5: Clustering contigs into linkage groups for early-assembly genomes. Us-ing template strand directionality as a unique signature, all contigs in the early mouse assemblyMGSCv3 were compared with each other across all 62 strand-seq libraries. All contigs withsimilar (> 85%) template inheritance patterns were stratified into linkage groups (LGs). (a)Heat plots of all BAIT-called LGs show limited similarity between groups. Through analysisof homozygous template states only (WW and CC, left panel) 57,581 contigs cluster into 33LGs, with the association between linkage groups appearing as yellow points if groups are inthe same orientation, or blue points if the groups are in opposite orientations. The LGs arethen reanalyzed after merging and reorientation of associated clusters, resulting in only 20linkage groups consisting of 54,832 contigs. (b) Histogram of the number of fragments withina linkage group that map to a particular chromosome. The LG with the largest number ofcontigs are shown at the bottom in dark gray, with groups that contain the next largest numbersof contigs shown in progressively lighter grays. Most LGs contain contigs that belong to thesame chromosome (see Fig 7.6), and in general, most chromosomes are represented by oneor two linkage groups. Note: contigs derived from sex chromosomes in male libraries can bedistinguished as they are haploid, and are not computed as an initial heat plot. Any contigsderived from haploid chromosomes are separated and clustered independently. Almost all contigsclustered into this linkage group mapped to the X chromosome (right histogram). Abbreviations:C, Crick; W, Watson. 957.3. Results and ValidationFigure 7.6: Contig locations in Bioinformatic Analysis of Inherited Templates(BAIT)-compiled linkage groups. strand-seq data was aligned to the MGSCv3 assembly,and was stratified solely on strand inheritance patterns. The resulting data were compareddirectly to the known locations of the MGSCv3 contigs in the current GRCm38/mm10 assembly.(a) Linkage groups determined by BAIT predominantly contain contigs derived from a singlechromosome. The linkage groups (denoted as LG# under each histogram) contain differentnumbers of clustered contigs, with the total length of each linkage group shown (y-axis) buttend to map to only one chromosome. (b) Of the linkage groups that map to more than onechromosome, a heatmap plot shows that the linkage group should be subdivided. Linkage group 1(green highlight) contains contigs from chromosome 1 (chr1), chr15, and chr7, but generates threedistinct clusters (left panel) where each cluster contains contigs derived from one chromosome(colours beneath dendrogram). An example of a linkage group with contigs mapping to a singlelocus (LG11, green highlight) shows that the majority of contigs within this group cluster tightlytogether (right panel).967.3. Results and ValidationEnsemble ClusteringTo evaluate ensemble clustering, I ran TSCRC 100 times on the early mouse genome(mm2) data used for evaluation in [6]. I ran reorientation and merging on each of these,and took those results as the non-ensemble baseline. I then took 1, 000 bootstrap samplesfor each of size {2, 4, 6, 8, 10, 12, 14, 16}, and reoriented and merged each of these. To evaluatethe accuracy of each clustering, I computed the F-measure as per [5], specifically:F (C,K) =∑ci∈CciNmaxkj∈K{F (ci, kj)} (7.9)Where C represents the true chromosome assignments of contigs (lifted over to mm10),and K represents the cluster assignments produced by contiBAIT.The results are shown in Figure 7.7a. Ensembles of size 4–16 showed improvedF-measures over single clusterings, with a peak at 14, and most of the gains achieved by8 clusters in the ensemble.Effect of Contig SizeUsing the current build of the mouse genome (mm10), I divided chromosomes intoequal-sized sections of varying sizes (1 Mb, 500 Kb, 250 Kb, 100 Kb, 50 Kb). I then rancontiBAIT 100 times (without ensemble) on each of these to assemble them into theiroriginal chromosomes, and computed the F-measure for each as per Equation 7.9.The results are shown in Figure 7.7b. Contig size, at least over this range, had minimaleffect on F-measure. Although there was an increase in F-measure with decreasing contigsize down to 100kb, on examination of the clusterings themselves, this was due to therebeing only 1-2 contigs misplaced in each result. Those 1-2 misplaced contigs representeda greater proportion of the total chromosome size in the data with larger contigs, hencethe slightly lowered F-measure.977.3. Results and ValidationFigure 7.7: Evaluation of contiBAIT chromosome assignment with different-sizedcluster ensembles and different-sized contigs. a. F-measure vs size of cluster ensemblefor mm2 data. Without ensemble clustering, median F-measure was 0.86. Ensembles of size 2actually decreased F-measure (median=0.77), while those of size 4 and larger show an overall(but not significant) increase, peaking at ensembles of size 12–14 (median=0.91). b. Effect ofcontig size on accuracy of chromosome assignment for uniform-sized synthetic contigs from mm10data. F-measure was relatively uniform (within the range of 0.94 to 1), however there was aslight trend of increasing F-measure as contig size decreased from 1 Mb down to 100 Kb. Thistrend reversed at 50kb, with a slight decrease in F-measure from 100 Kb.Overall PerformanceContiBAIT was run on the same early-build genome data (mm2) described earlier inFig 7.5 and in [6], but with TSCRC and ensemble clustering. The results are shown inFig 7.8, and include no major incorrect merging of contigs from different chromosomes,and most chromosomes almost entirely accounted for by a single linkage group.7.3.2 Validation of Contig OrderingTo validate contig ordering, I used the 1 Mb, 500 Kb, 250 Kb simulated contigs asdescribed in the earlier section for anchoring. I took the contigs assigned to each linkagegroup as predicted by contiBAIT anchoring, and applied contiBAIT to attempt tocorrectly order them. I then compared the positions of the contigs in order to their truestart positions, both visually and by using Spearman’s Rho correlation. These resultsare summarised in Fig 7.9. Visually, contiBAIT was able to recapture the overall trendof contig location, although frequently there were discontinuities and opposite-ordered987.3. Results and ValidationFigure 7.8: A chromosome assignment result after reorientation and merging. Thisplot was generated on mm2 data, following clustering with an ensemble of size 8, reorienting ofmisoriented linkage groups, and merging. This is a marked improvement in comparison with thepreviously published performance shown in Figure 7.5B. No linkage group contains substantialnumbers of contigs from any but the primary chromosome, so that, although chromosomes 2 and14 ended up split over two main linkage groups each, no chimeric chromosomes were created.997.4. PackageFigure 7.9: Testing of ordering of contigs on synthetic data Predicted ordering isplotted against true start positions of contigs. a–c. For each of the data sets tested, the linkagegroups with the best and worst Spearman correlation are shown. d. Distribution of Spearmancorrelation for each of the data sets. Both the maximum and the mean decrease with smallercontig sizes.regions.7.4 PackageUnlike BAIT, ContiBAIT has been developed as a fully-contained R package, capable ofaccepting BAM and BED files and performing all analysis described in this chapter. Itis intended for it to submitted to Bioconductor to be made freely available.7.5 DiscussionI have demonstrated that contiBAIT can anchor contigs into common linkage groupsbased on shared template inheritance, and can order some contigs within linkage groupsusing sister chromatid exchanges.For contig anchoring, I have presented a complete pipeline for clustering and reorient-ing contigs. As part of that pipeline, I have developed and optimised three-star Chinese1007.5. Discussionrestaurant clustering, a novel, memory-efficient clustering algorithm for symmetric binaryvariables. Although this method is somewhat susceptible to errors due to starting bias,I have shown that combining multiple runs in an ensemble of six or more clusteringsimproves its performance. On an actual early-assembly genome, assigning of contigs tochromosomes performed well (mean F-measure 0.91). On simulated genomes, createdby cutting the current (mm10) build of the Mus musculis genome into uniformly sized“contigs”, F-measure consistently exceeded 0.95 even for contigs as small as 50kb.For contig ordering, I have presented a pipeline for assigning the orders of contigsrelative to each other using a travelling salesman problem formulation. This was able tocapture the overall ordering of contigs, but with a great deal of local noise, as well asregions ordered opposite to each other. While this may be sufficient for the purposesof early-build genomes (and certainly provides more information than is available forunbridged contigs), this is far from ideal. It is possible that the use of the Concorde TSPsolver [138] may improve the ordering. Furthermore, more stringent filtering of bothlibraries and contigs could reduce the confounding noise, at the expense of fewer contigsordered. It would also be possible to first order a small subset of higher-quality contigs,then use BAIT to detect misorientations and to place some of the remaining contigs [6].Preliminary sequencing efforts in lesser-studied organisms suffer from fewer resourcesspent on deep sequencing and subsequent curating and refining of the reference genomeassemblies. With several ambitious sequencing projects in development [140], there is anincreasing need for rapid and cost-effective construction of accurate and useful referencegenomes. Arranging contigs to facilitate building chromosome-level and genome-levelhierarchies represents an attractive advance toward this goal, especially in conjunctionwith existing technologies, and it is likely that strand-seq and contiBAIT will be widelyadopted in standard genome assembly pipelines [130]. ContiBAIT is positioned to joinalternate, low-cost methods for anchoring contigs in draft-quality genomes, such asoptical mapping,[141] chromatin interactions [142] and POPSEQ [143] as a vital tool forlow-budget genome sequencing efforts.101Chapter 8Conclusions and Future Work8.1 Summary of ContributionsFirstly, I have developed an enhanced version of flowType, enabling it to be used onnewer, higher dimensional data. I have developed flowBin, a new, imputation-free tool forcombining multitube flow cytometry data. Both of these steps were necessary to enabledeep, automated profiling of retrospective flow cytometry data from AML diagnoses.FlowBin was needed to transform the data into a form that could be analysed whileaccounting for the true correlations among the markers in different tubes. FlowType-DPwas necessary to analyse the very high-dimensional data from flowBin, which was notpossible with flowType-BF. I have shown how, taken together, flowBin and flowType-DPform a powerful platform for high-throughput analysis of multitube flow cytometry data.Multitube flow cytometry data is commonly used for leukemia diagnosis, and I haveapplied flowBin and flowType-DP to three retrospective acute myeloid leukemia datasets. I have demonstrated that flowBin can be used to distinguish normal from leukemiccell types, to find cell types associated with a particular genotype, and to survey theimmunophenotypic landscape of a patient cohort. In the process, I have presenteda more detailed survey of the immunophenotypes associated with NPM1 mutationsin AML, shown the neoplastic immunophenotypic landscape of AML, and associatedimmunophenotype with a comprehensive survey of AML genetic lesions.Secondly, I have I have contributed significantly to developing BAIT, a tool fordetecting sister chromatid exchanges in strand-seq data. This lays the foundations fordetecting translocations and other genomic rearrangements at the single cell level, which1028.2. Reiteration of Problem Statementmay have profound impacts for researching drug resistance and relapse in AML. BAITalso enables the placing of unbridged contigs in late-build genomes, and I present theplacing of half the remaining unplaced contigs in the current build of the mouse genome.Mice are an important model species for AML research,[144] and improvements to thequality of the mouse genome will aid their usage for this purpose.Following from BAIT, I took the lead in developing contiBAIT, an R package for denovo assembling genomes with unbridged gaps using strand-seq. There are numerousmodel organisms with genomes consisting only of large numbers of unbridged, unanchoredcontigs. Many of these organisms are likely to be of use in cancer research. For example,the naked mole rat (Heterocephalus glaber), has unusually high cancer resistance,[145]but a genome currently only assembled into 4,230 unplaced contigs. ContiBAIT, withstrand-seq, can be used to cheaply assemble these genomes into chromosomes, and tocreate preliminary ordering of those contigs along the chromosomes, for relatively littlesequencing cost.8.2 Reiteration of Problem StatementAlthough the high-throughput methods of flow cytometry and strand-seq have producedlarge quantities of data that may facilitate life-saving research into AML treatment, thebioinformatics tools to analyze this data have been limited. Prior to this thesis work,tools for recombining the multitube flow cytometry data typical in AML diagnosis werefew and produced imputation errors, while existing downstream methods were poorlysuited for the high dimensionality of the resulting data. Tools for automatically analysingstrand-seq were non-existent.By developing flowBin and improving flowType, I have created tools to be able toaccurately recombine multitube AML flow cytometry data, and further characterise thatdata in spite of its high-dimensional nature. By developing BAIT and contiBAIT, Ihave created tools to analyse strand-seq data, with applications in AML of measuringgenomic instability, detecting subclonal genomic rearrangements, and improving the1038.2. Reiteration of Problem Statementassemblies of cancer model organism genomes. These tools will facilitate much-neededresearch using these data towards understanding AML biology and hence improvingtreatment outcomes.1048.2.ReiterationofProblemStatementAlgorithm Input Output Advantages LimitationsflowType-DP High-dimensional single-cell data, partition thresh-olds for each dimensionCounts of all cell types de-fined across all combina-tions of dimensions, downto a pre-specified numberof combinationsEnables very high-dimensional data tobe deeply phenotypedDoes not comprehensivelyexamine all combinations(as this would be infeasible)flowBin Multitube flow cytometrydataHigh-dimensional datawith per-cell-cluster ex-pression values, resemblinghigh-dimensional flowcytometry dataEnables combining of mul-titube flow cytometry datafor use in analyses accept-ing flow cytometry data;fewer false combinationsLow sensitivityBAIT: SCE prediction Read locations from astrand-seq libraryPredicted locations of sisterchromatid exchange eventsEnables automated predic-tion of SCEsNot 100% accurate, espe-cially in noisy data.AmalgaBAIT Multiple strand-seq li-braries from the sameorganism, a referencegenome containing someunbridged, orphan contigsPrediction of likelihoodsacross the genome thateach unplaced contig mapsto that regionEnables relatively low-costplacement of unbridgedcontigsMay not work for low-complexity or misassem-bled contigs (but these areeasily filtered out)contiBAIT Multiple strand-seq li-braries from the sameorganism, a referencegenome made entirely ofunbridged, orphan contigsPrediction of chromosomalmembership of contigs andpositional order of contigswithin chromosomesAssists in draft assemblyof poor-coverage referencegenomesVulnerable to noise (espe-cially predictions of posi-tional order).Table 8.1: Summary of algorithms developed, listing inputs, outputs, advantages and disadvantages.1058.3. Future Directions8.3 Future Directions8.3.1 FlowType-DPA possible further refinement to FlowType-DP might include heuristic search of the celltype space. P-values, area under the curve, or other outcome metrics could be used asoptimality criteria. This could enable much deeper exploration of the cell type spacethan flowType-DP’s breadth-first strategy. However, such a search method may besusceptible to convergence at local optima (and thus miss phenotypes which a morecomprehensive strategy would have found). Also, since the search is biased towardsphenotypes with higher P-values or better classification scores, the search could besusceptible to overfitting, and in a statistical context, care would need to be taken toensure that an appropriate multiple test correction was developed and applied.Another possible refinement to flowType analysis would be the development of anappropriate multiple test correction for the comprehensive case. Existing methods, e.g.Bonferonni-Holm[146] and Benjamini-Hochbergi[147], assume independence betweenvariables. However, the phenotypes produced by flowType are related hierarchically,and tend to be highly interdependent. Indeed, in some cases, adding a particular gateto a phenotype may have no effect whatsoever (for example, CD34+ cells are rarelyalso CD20+), so that some phenotypes are completely redundant. A multiple testcorrection which accounts for this feature of flowType data could substantially improvethe statistical power of analyses incorporating flowType.8.3.2 FlowBinFuture enhancements to the flowBin could include support for user-specified binningusing the gating data structures within the flowCore package.[20] This would by extensionsupport importing bin definitions from other software saved in Gating-ML format.[148]Another useful addition would be the option to support callback functions for computingthe expression of bins. This would enable users to implement alternative methodsfor thresholding based on negative controls, such as Overton’s cumulative subtraction1068.3. Future Directionsof histograms[149], Bagwell’s super enhanced Dmax subtraction[150], Lampariello’sratio analysis of empirical cumulative density functions[151], or the more recent use ofF-measure [152].8.3.3 Applications of FlowBin to AMLIn AML, blast count is typically quite high (at least 20% of bone marrow), but variessubstantially and may be as high as 90%. This acts as a source of noise in flowBin-flowType analyses. The detection of abnormal cell types (and filtering out of normal)could be used to reduce that noise. However, in the other AML data sets I analysed,there were no normal patients available to use for this purpose. So a future directionwould be to determine whether the FlowCAP2 AML data set (which is publicly availablein FlowRepository) could be used as a training set for other data sets.For the analysis associating NPM1 with phenotypes, the next step was to applythe same methods to a more comprehensive genotyping effort, which I did. For thatgenotyping effort, overall survival information is available, and it could be useful both forfuture prognostic purposes and to understanding the biology of AML to perform survivalanalysis. Both the NMF clusters, individual cell types, and combinations of these withrecurrent gene mutations could be examined for survival differences and/or correlation.Another future analysis to be performed on that data set would be a comparison ofthe flow cytometry NMF clusters with NMF clustering of the same patients in termsof both their mRNA and miRNA expression (data for which is available). This couldhelp to elucidate more of the biology underlying the genotype-phenotype correlationsI observed. Lastly, in recent years software has become available for inferring clonalprevalence of genotypes, such as PyClone.[153] The clonal prevalence could then becompared with immunophenotype prevalences to attempt to infer the relationshipsbetween immunophenotypic variation within AML and genotypic variations at thesingle-cell level.1078.3. Future Directions8.3.4 BAITThe ability to detect sister chromatid exchanges (and similar discontinuities in strandinheritance) has many potential applications. One of the most important applications toAML diagnosis is the ability to detect genomic rearrangements, which appear as SCEs.[6]Since strand-seq is single cell, it could be possible to use it as a cheap, sensitive andautomated method for detecting subclonal translocations and inversions, an importantnext step for AML research. There will be some challenges to this, such as the problemthat some rearrangements would be masked in some cells by chance. Once per-cellpredictions of rearrangements have been made, they could be used similarly to FISH toinfer the clonal phylogeny of the tumour [154].8.3.5 contiBAITMy collaborator on BAIT and contiBAIT, Mark Hills, has sequenced strand-seq librariesfrom several organisms with early-build genomes, including zebrafish (Danio rerio),pig (Sus scrofa), ferret (Mustela putorius), Tasmanian devil (Sarcophilus harrisii) withaccompanying facial tumour, and platypus (Ornithorhynchus anatinus). Mark is inthe process of applying contiBAIT to this data, in order to improve the quality of thegenomes for these organisms. In the future, strand-seq and contiBAIT will likely becomeone of a suite of low-cost tools complimenting and enhancing the results of whole-genomeshotgun sequencing efforts.While contiBAIT’s current anchoring approach performs extremely well, the orderingcomponent could be improved. A procedure of filtering more stringently and onlyattempting to order the higher-quality contigs could help. It may also be possible toestimate the background noise of the individual libraries, as BAIT does[6]; although thisrequires complete chromosomes, it could be adapted to work with larger contigs.The ordering algorithm itself could also stand to be improved. Applying the Con-corde framework[138] may produce more optimal travelling salesman problem solutions.Furthermore, the shortest Hamiltonian path / travelling salesman problem formulation1088.3. Future Directionsitself has some drawbacks, in that it does not account for some biological information,such as impossible strand state transitions, and detectable SCEs within contigs. Ametaheuristic approach (such as a genetic algorithm) could be developed instead, with acustomised objective function accounting for these factors. Although genetic algorithms(GAs) in their original formulation are notoriously poor at solving ordering problems dueto becoming stuck in local sub-optima,[155] a formulation incorporating local heuristicsearch in place of the mutation operator has been used to solve TSPs quickly and withsolutions equally optimal to benchmarks.[155] Incorporating this strategy with a moreintelligent objective function could dramatically improve the ability of contiBAIT tocorrectly order contigs.109Bibliography[1] K. O’Neill, N. Aghaeepour, J. Sˇpidlen, and R. Brinkman, “Flow CytometryBioinformatics,” PLoS Comput Biol, vol. 9, p. e1003365, Dec. 2013.[2] K. O’Neill, A. Jalali, N. Aghaeepour, H. Hoos, and R. R. Brinkman, “EnhancedflowType/RchyOptimyx: A Bioconductor pipeline for discovery in high-dimensionalcytometry data,” Bioinformatics, vol. 30, pp. 1329–30, Jan. 2014.[3] N. Aghaeepour, P. K. Chattopadhyay, A. Ganesan, K. O’Neill, H. Zare, A. Jalali,H. H. Hoos, M. Roederer, and R. R. Brinkman, “Early immunologic correlatesof HIV protection can be identified from computational analysis of complexmultivariate T-cell flow cytometry assays,” Bioinformatics, vol. 28, no. 7, 2012.[4] N. Aghaeepour, A. Jalali, K. O’Neill, P. K. Chattopadhyay, M. Roederer, H. H.Hoos, and R. R. Brinkman, “RchyOptimyx: Cellular hierarchy optimization forflow cytometry,” Cytometry Part A, vol. 81, no. 12, pp. n/a–n/a, 2012.[5] N. Aghaeepour, G. Finak, T. F. Consortium, T. D. Consortium, H. Hoos, T. R.Mosmann, R. Brinkman, R. Gottardo, and R. H. Scheuermann, “Critical assessmentof automated flow cytometry data analysis techniques,” Nature Methods, vol. 10,pp. 445–445, May 2013.[6] M. Hills, K. O’Neill, E. Falconer, R. Brinkman, and P. M. Lansdorp, “BAIT: Or-ganizing genomes and mapping rearrangements in single cells,” Genome Medicine,vol. 5, p. 82, Sept. 2013.[7] M. Roederer, “How many events is enough? Are you positive?,” Cytometry PartA, vol. 73A, pp. 384–385, May 2008.[8] B. D. Hedley and M. Keeney, “Technical issues: flow cytometry and rare eventanalysis,” International Journal of Laboratory Hematology, vol. 35, pp. 344–350,June 2013.[9] B. Brando, D. Barnett, G. Janossy, F. Mandy, B. Autran, G. Rothe, B. Scarpati,G. D’Avanzo, J.-L. D’Hautcourt, R. Lenkei, G. Schmitz, A. Kunkl, R. Chianese,S. Papa, and J. W. Gratama, “Cytofluorometric methods for assessing absolutenumbers of cell subsets in blood,” Cytometry, vol. 42, no. 6, pp. 327–346, 2000.110Bibliography[10] C. S. Ferreira-Facio, C. Milito, V. Botafogo, M. Fontana, L. S. Thiago, E. Oliveira,A. S. da Rocha-Filho, F. Werneck, D. N. Forny, S. Dekermacher, A. P. de Azambuja,S. E. Ferman, P. A. S. de Faria, M. G. P. Land, A. Orfao, and E. S. Costa,“Contribution of Multiparameter Flow Cytometry Immunophenotyping to theDiagnostic Screening and Classification of Pediatric Cancer,” PLoS ONE, vol. 8,p. e55534, Mar. 2013.[11] D. Wu, B. L. Wood, and J. R. Fromm, “Flow Cytometry for Non-Hodgkin andClassical Hodgkin Lymphoma,” Methods in Molecular Biology, vol. 971, pp. 27–47,Jan. 2013.[12] Y. Wang, F. Hammes, K. De Roy, W. Verstraete, and N. Boon, “Past, presentand future applications of flow cytometry in aquatic microbiology,” Trends inBiotechnology, vol. 28, pp. 416–424, Aug. 2010.[13] L. A. Johnson, J. P. Flook, M. V. Look, and D. Pinkel, “Flow sorting of X and Ychromosome-bearing spermatozoa into two populations,” Gamete research, vol. 16,pp. 1–9, Jan. 1987.[14] G. M. Baerlocher, I. Vulto, G. de Jong, and P. M. Lansdorp, “Flow cytometry andFISH to measure the average length of telomeres (flow FISH),” Nature Protocols,vol. 1, pp. 2365–2376, Dec. 2006.[15] P. K. Chattopadhyay, C. M. Hogerkorp, and M. Roederer, “A chromatic explosion:the development and future of multiparameter flow cytometry.,” Immunology,vol. 125, no. 4, p. 441, 2008.[16] G. K. Behbehani, S. C. Bendall, M. R. Clutter, W. J. Fantl, and G. P. Nolan, “SingleCell Mass Cytometry Adapted to Measurements of the Cell Cycle,” Cytometry.Part A : the journal of the International Society for Analytical Cytology, vol. 81,pp. 552–566, July 2012.[17] A. K. White, M. VanInsberghe, O. I. Petriv, M. Hamidi, D. Sikorski, M. A. Marra,J. Piret, S. Aparicio, and C. L. Hansen, “High-throughput microfluidic single-cellRT-qPCR,” Proceedings of the National Academy of Sciences, vol. 108, Aug. 2011.[18] M. Roederer, “Compensation in Flow Cytometry,” in Current Protocols in Cytom-etry, John Wiley and Sons, Inc., 2001.[19] C. B. Bagwell and E. G. Adams, “Fluorescence Spectral Overlap Compensation forAny Number of Flow Cytometry Parameters,” Annals of the New York Academyof Sciences, vol. 677, no. 1, pp. 167–184, 1993.[20] F. Hahne, N. LeMeur, R. R. Brinkman, B. Ellis, P. Haaland, D. Sarkar, J. Spi-dlen, E. Strain, and R. Gentleman, “flowCore: a Bioconductor package for highthroughput flow cytometry,” BMC Bioinformatics, vol. 10, p. 106, Apr. 2009.[21] H. M. Shapiro, Practical flow cytometry. Wiley-Liss New York, 2003.111Bibliography[22] D. R. Parks, M. Roederer, and W. A. Moore, “A new “Logicle” display methodavoids deceptive effects of logarithmic scaling for low signals and compensateddata,” Cytometry Part A, vol. 69A, no. 6, pp. 541–551, 2006.[23] G. Finak, J.-M. Perez, A. Weng, and R. Gottardo, “Optimizing transformations forautomated, high throughput analysis of flow cytometry data,” BMC Bioinformatics,vol. 11, p. 546, Nov. 2010.[24] W. A. Moore and D. R. Parks, “Update for the logicle data scale includingoperational code implementations,” Cytometry Part A, vol. 81A, no. 4, pp. 273–277, 2012.[25] C. B. Bagwell, “Hyperlog—A flexible log-like transform for negative, zero, andpositive valued data,” Cytometry Part A, vol. 64A, no. 1, pp. 34–42, 2005.[26] K. Lo, R. R. Brinkman, and R. Gottardo, “Automated gating of flow cytometrydata via robust model-based clustering.,” Cytometry.Part A : the journal of theInternational Society for Analytical Cytology, vol. 73, pp. 321–332, Apr. 2008.[27] Y. Qian, Y. Liu, J. Campbell, E. Thomson, Y. M. Kong, and R. H. Scheuermann,“FCSTrans: An open source software system for FCS file conversion and datatransformation,” Cytometry Part A, vol. 81A, no. 5, pp. 353–356, 2012.[28] N. Le Meur, A. Rossini, M. Gasparetto, C. Smith, R. R. Brinkman, and R. Gentle-man, “Data quality assessment of ungated flow cytometry data in high throughputexperiments,” Cytometry Part A, vol. 71, no. 6, pp. 393–403, 2007.[29] W. T. Rogers, A. R. Moser, H. A. Holyst, A. Bantly, E. R. Mohler III, G. Scangas,and J. S. Moore, “Cytometric fingerprinting: Quantitative characterization ofmultivariate distributions,” Cytometry Part A, vol. 73, no. 5, pp. 430–441, 2008.[30] F. Hahne, A. H. Khodabakhshi, A. Bashashati, C. J. Wong, R. D. Gascoyne,A. P. Weng, V. Seyfert-Margolis, K. Bourcier, A. Asare, T. Lumley, and Others,“Per-channel basis normalization methods for flow cytometry data,” CytometryPart A, vol. 77, no. 2, pp. 121–131, 2010.[31] K. Hornik, “A CLUE for CLUster Ensembles,” Journal of Statistical Software,vol. 14, no. 12, pp. 1–25, 2005.[32] E. Lugli, M. Roederer, and A. Cossarizza, “Data analysis in flow cytometry: thefuture just started,” Cytometry. Part A: the journal of the International Societyfor Analytical Cytology, vol. 77, pp. 705–713, July 2010.[33] S. C. Bendall and G. P. Nolan, “From single cells to deep phenotypes in cancer,”Nature Biotechnology, vol. 30, no. 7, pp. 639–647, 2012.[34] H. T. Maecker and J. Trotter, “Flow cytometry controls, instrument setup, andthe determination of positivity,” Cytometry Part A, vol. 69A, no. 9, pp. 1037–1042,2006.112Bibliography[35] L. K. McNeil, L. Price, C. M. Britten, M. Jaimes, H. Maecker, K. Odunsi, J. Mat-suzaki, J. S. Staats, J. Thorpe, J. Yuan, and S. Janetzki, “A harmonized approachto intracellular cytokine staining gating: Results from an international multicon-sortia proficiency panel conducted by the Cancer Immunotherapy Consortium(CIC/CRI),” Cytometry Part A, vol. 83A, pp. 728–738, Aug. 2013.[36] S. Bendall, E. Simonds, and P. Qiu, “Single-cell mass cytometry of differentialimmune and drug responses across a human hematopoietic continuum,” Science. . . , vol. 332, no. 6030, p. 687, 2011.[37] P. F. Virgo and G. J. Gibbs, “Flow cytometry in clinical pathology,” Annals ofClinical Biochemistry, vol. 49, pp. 17–28, Jan. 2012.[38] E. S. da Costa, C. E. Pedreira, S. Barrena, Q. Lecrevisse, J. Flores, S. Quijano,J. Almeida, M. del Carmen Garc´ıa-Macias, S. Bottcher, J. J. M. Van Dongen, andOthers, “Automated pattern-guided principal component analysis vs expert-basedimmunophenotypic classification of B-cell chronic lymphoproliferative disorders: astep forward in the standardization of clinical immunophenotyping,” Leukemia,2010.[39] P. Qiu, E. F. Simonds, S. C. Bendall, K. D. Gibbs Jr, R. V. Bruggner, M. D.Linderman, K. Sachs, G. P. Nolan, and S. K. Plevritis, “Extracting a cellular hier-archy from high-dimensional cytometry data with SPADE,” Nature biotechnology,2011.[40] S. Pyne, X. Hu, K. Wang, E. Rossin, T.-I. Lin, L. M. Maier, C. Baecher-Allan,G. J. McLachlan, P. Tamayo, D. A. Hafler, P. L. D. Jager, and J. P. Mesirov,“Automated high-dimensional flow cytometric data analysis,” Proc Natl Acad SciU S A, 2009.[41] Y. Qian, C. Wei, F. Eun-Hyung Lee, J. Campbell, J. Halliley, J. A. Lee, J. Cai,Y. M. Kong, E. Sadat, E. Thomson, P. Dunn, A. C. Seegmiller, N. J. Karandikar,C. M. Tipton, T. Mosmann, I. n. Sanz, and R. H. Scheuermann, “Elucidation ofseventeen human peripheral blood B-cell subsets and quantification of the tetanusresponse using a density-based method for the automated identification of cellpopulations in multidimensional flow cytometry data,” Cytometry Part B: ClinicalCytometry, vol. 78B, no. S1, pp. S69–S82, 2010.[42] H. Zare, P. Shooshtari, A. Gupta, and R. R. Brinkman, “Data reduction for spectralclustering to analyze high throughput flow cytometry data,” BMC bioinformatics,vol. 11, no. 1, p. 403, 2010.[43] N. Aghaeepour, R. Nikolic, H. H. Hoos, and R. R. Brinkman, “Rapid Cell Popu-lation Identification in Flow Cytometry Data,” submitted to Cytometry Part A,2010.[44] Y. Ge and S. C. Sealfon, “flowPeaks: a fast unsupervised clustering for flowcytometry data via K-means and density peak finding,” Bioinformatics, vol. 28,pp. 2052–2058, Aug. 2012.113Bibliography[45] M. Roederer, W. Moore, A. Treister, R. R. Hardy, and L. A. Herzenberg, “Prob-ability binning comparison: a metric for quantitating multivariate distributiondifferences,” Cytometry Part A, vol. 45, no. 1, pp. 47–55, 2001.[46] M. Roederer and R. R. Hardy, “Frequency difference gating: A multivariate methodfor identifying subsets that differ between samples,” Cytometry, vol. 45, no. 1,pp. 56–64, 2001.[47] A. Cron, C. Gouttefangeas, J. Frelinger, L. Lin, S. K. Singh, C. M. Britten, M. J. P.Welters, S. H. van der Burg, M. West, and C. Chan, “Hierarchical Modeling forRare Event Detection and Cell Subset Alignment across Flow Cytometry Samples,”PLoS Comput Biol, vol. 9, p. e1003130, July 2013.[48] “FlowJo - vX Manual - Boolean Gates,” 2013.[49] H. Zare, A. Bashashati, R. Kridel, N. Aghaeepour, G. Haffari, J. M. Connors, R. D.Gascoyne, A. Gupta, R. R. Brinkman, and A. P. Weng, “Automated Analysis ofMultidimensional Flow Cytometry Data Improves Diagnostic Accuracy BetweenMantle Cell Lymphoma and Small Lymphocytic Lymphoma,” American Journalof Clinical Pathology, vol. 137, pp. 75–85, Jan. 2012.[50] P. Qiu, “Inferring Phenotypic Properties from Single-Cell Characteristics,” PLoSONE, vol. 7, p. e37038, May 2012.[51] B. Bodenmiller, E. R. Zunder, R. Finck, T. J. Chen, E. S. Savig, R. V. Bruggner,E. F. Simonds, S. C. Bendall, K. Sachs, P. O. Krutzik, and G. P. Nolan, “Mul-tiplexed mass cytometry profiling of cellular states perturbed by small-moleculeregulators,” Nature Biotechnology, vol. 30, pp. 858–867, Sept. 2012.[52] A. Bashashati, N. A. Johnson, A. H. Khodabakhshi, M. D. Whiteside, H. Zare,D. W. Scott, K. Lo, R. Gottardo, F. S. Brinkman, J. M. Connors, G. W. Slack,R. D. Gascoyne, A. P. Weng, and R. R. Brinkman, “B Cells With High Side ScatterParameter by Flow Cytometry Correlate With Inferior Survival in Diffuse LargeB-Cell Lymphoma,” American journal of clinical pathology, vol. 137, pp. 805–814,May 2012.[53] E. J. Freireich and M. J. Keating, The Merck manual of diagnosis and therapy,ch. 142: Leuke. Merck Research Laboratories Whitehouse Station (NJ), 2006.[54] S. H. Swerdlow, E. Campo, N. L. Harris, E. S. Jaffe, S. A. Pileri, H. Stein, J. Thiele,and J. W. Vardiman, eds., WHO classification of tumours of haematopoietic andlymphoid tissues. International Agency for Research on Cancer, 2008.[55] E. Estey and H. Do¨hner, “Acute myeloid leukaemia,” The Lancet, vol. 368, pp. 1894–1907, Dec. 2006.[56] D. Grimwade, H. Walker, G. Harrison, F. Oliver, S. Chatters, C. J. Harrison,K. Wheatley, A. K. Burnett, and A. H. Goldstone, “The predictive value ofhierarchical cytogenetic classification in older adults with acute myeloid leukemia(AML): analysis of 1065 patients entered into the United Kingdom Medical ResearchCouncil AML11 trial,” Blood, vol. 98, no. 5, p. 1312, 2001.114Bibliography[57] B. B. Zeisig, A. G. Kulasekararaj, G. J. Mufti, and C. W. Eric So, “SnapShot:Acute Myeloid Leukemia,” Cancer Cell, vol. 22, pp. 698–698.e1, Nov. 2012.[58] R. F. Schlenk, K. Do¨hner, J. Krauter, S. Fro¨hling, A. Corbacioglu, L. Bullinger,M. Habdank, D. Spa¨th, M. Morgan, A. Benner, B. Schlegelberger, G. Heil,A. Ganser, and H. Do¨hner, “Mutations and Treatment Outcome in CytogeneticallyNormal Acute Myeloid Leukemia,” New England Journal of Medicine, vol. 358,no. 18, pp. 1909–1918, 2008.[59] J. M. Bennett, D. Catovsky, M. T. Daniel, G. Flandrin, D. A. Galton, H. R.Gralnick, and C. Sultan, “Proposals for the classification of the acute leukaemias.French-American-British (FAB) co-operative group.,” Br J Haematol, vol. 33,pp. 451–458, Aug. 1976.[60] K. A. Foon, R. F. Todd, and Others, “Immunologic classification of leukemia andlymphoma,” Blood, vol. 68, no. 1, p. 1, 1986.[61] C. D. Jennings and K. A. Foon, “Recent advances in flow cytometry: applicationto the diagnosis of hematologic malignancy,” Blood, vol. 90, no. 8, p. 2863, 1997.[62] M. C. Bene, G. Castoldi, W. Knapp, W. D. Ludwig, E. Matutes, A. Orfao, andV. VAN’T, “Proposals for the immunological classification of acute leukemias,”Leukemia, vol. 9, no. 10, pp. 1783–1786, 1995.[63] E. S. Jaffe, N. L. Harris, H. Stein, and J. W. Vardiman, eds., WHO classification oftumours of haematopoietic and lymphoid tissues. International Agency for Researchon Cancer, 2001.[64] S. Grisendi and P. P. Pandolfi, “NPM mutations in acute myelogenous leukemia,”New England Journal of Medicine, vol. 352, no. 3, p. 291, 2005.[65] R. G. W. Verhaak, C. S. Goudswaard, W. van Putten, M. A. Bijl, M. A. Sanders,W. Hugens, A. G. Uitterlinden, C. A. J. Erpelinck, R. Delwel, B. Lowenberg,and Others, “Mutations in nucleophosmin (NPM1) in acute myeloid leukemia(AML): association with other gene abnormalities and previously established geneexpression signatures and their favorable prognostic significance,” Blood, vol. 106,no. 12, p. 3747, 2005.[66] K. Dohner, R. F. Schlenk, M. Habdank, C. Scholl, F. G. Rucker, A. Corbacioglu,L. Bullinger, S. Frohling, and H. Dohner, “Mutant nucleophosmin (NPM1) predictsfavorable prognosis in younger adults with acute myeloid leukemia and normalcytogenetics: interaction with other gene mutations,” Blood, vol. 106, no. 12,p. 3740, 2005.[67] S. Schnittger, C. Schoch, W. Kern, C. Mecucci, C. Tschulik, M. F. Martelli,T. Haferlach, W. Hiddemann, and B. Falini, “Nucleophosmin gene mutations arepredictors of favorable prognosis in acute myelogenous leukemia with a normalkaryotype,” Blood, vol. 106, no. 12, p. 3733, 2005.115Bibliography[68] P. A. Ho, T. A. Alonzo, R. B. Gerbing, J. Pollard, D. L. Stirewalt, C. Hurwitz,N. A. Heerema, B. Hirsch, S. C. Raimondi, B. Lange, J. L. Franklin, J. P. Radich,and S. Meshinchi, “Prevalence and prognostic implications of CEBPA mutationsin pediatric acute myeloid leukemia (AML): a report from the Children’s OncologyGroup,” Blood, vol. 113, pp. 6558–6566, June 2009.[69] P. D. Kottaridis, R. E. Gale, M. E. Frew, G. Harrison, S. E. Langabeer, A. A.Belton, H. Walker, K. Wheatley, D. T. Bowen, A. K. Burnett, and Others, “Thepresence of a FLT3 internal tandem duplication in patients with acute myeloidleukemia (AML) adds important prognostic information to cytogenetic risk groupand response to the first cycle of chemotherapy: analysis of 854 patients from theUnited King,” Blood, vol. 98, no. 6, p. 1752, 2001.[70] T. J. Ley, E. R. Mardis, L. Ding, B. Fulton, M. D. McLellan, K. Chen, D. Dooling,B. H. Dunford-Shore, S. McGrath, M. Hickenbotham, L. Cook, R. Abbott, D. E.Larson, D. C. Koboldt, C. Pohl, S. Smith, A. Hawkins, S. Abbott, D. Locke,L. W. Hillier, T. Miner, L. Fulton, V. Magrini, T. Wylie, J. Glasscock, J. Conyers,N. Sander, X. Shi, J. R. Osborne, P. Minx, D. Gordon, A. Chinwalla, Y. Zhao, R. E.Ries, J. E. Payton, P. Westervelt, M. H. Tomasson, M. Watson, J. Baty, J. Ivanovich,S. Heath, W. D. Shannon, R. Nagarajan, M. J. Walter, D. C. Link, T. A. Graubert,J. F. DiPersio, and R. K. Wilson, “DNA sequencing of a cytogenetically normalacute myeloid leukemia genome,” Nature, vol. 456, pp. 66–72, Nov. 2008.[71] E. R. Mardis, L. Ding, D. J. Dooling, D. E. Larson, M. D. McLellan, K. Chen,D. C. Koboldt, R. S. Fulton, K. D. Delehaunty, S. D. McGrath, L. A. Fulton, D. P.Locke, V. J. Magrini, R. M. Abbott, T. L. Vickery, J. S. Reed, J. S. Robinson,T. Wylie, S. M. Smith, L. Carmichael, J. M. Eldred, C. C. Harris, J. Walker,J. B. Peck, F. Du, A. F. Dukes, G. E. Sanderson, A. M. Brummett, E. Clark,J. F. McMichael, R. J. Meyer, J. K. Schindler, C. S. Pohl, J. W. Wallis, X. Shi,L. Lin, H. Schmidt, Y. Tang, C. Haipek, M. E. Wiechert, J. V. Ivy, J. Kalicki,G. Elliott, R. E. Ries, J. E. Payton, P. Westervelt, M. H. Tomasson, M. A. Watson,J. Baty, S. Heath, W. D. Shannon, R. Nagarajan, D. C. Link, M. J. Walter, T. A.Graubert, J. F. DiPersio, R. K. Wilson, and T. J. Ley, “Recurring MutationsFound by Sequencing an Acute Myeloid Leukemia Genome,” New England Journalof Medicine, vol. 361, no. 11, pp. 1058–1066, 2009.[72] T. Ley, C. Miller, L. Ding, B. Raphael, A. Mungall, A. Robertson, K. Hoadley,T. Triche, P. Laird, J. Baty, and Others, “Genomic and Epigenomic Landscapesof Adult De Novo Acute Myeloid Leukemia,” New England Journal of Medicine,vol. 368, no. 22, pp. 2059–2074, 2013.[73] F. E. Craig and K. A. Foon, “Flow cytometric immunophenotyping for hematologicneoplasms,” Blood, vol. 111, no. 8, p. 3941, 2008.[74] B. L. Wood, M. Arroz, D. Barnett, J. DiGiuseppe, B. Greig, S. J. Kussick,T. Oldaker, M. Shenkin, E. Stone, and P. Wallace, “2006 Bethesda InternationalConsensus recommendations on the immunophenotypic analysis of hematolymphoidneoplasia by flow cytometry: Optimal reagents and reporting for the flow cytometric116Bibliographydiagnosis of hematopoietic neoplasia,” Cytometry Part B: Clinical Cytometry,vol. 72, no. S1, pp. S14—-S22, 2007.[75] B. H. Davis, J. T. Holden, M. C. Bene, M. J. Borowitz, R. C. Braylan, D. Cornfield,W. Gorczyca, R. Lee, R. Maiese, A. Orfao, and Others, “2006 Bethesda Inter-national Consensus recommendations on the flow cytometric immunophenotypicanalysis of hematolymphoid neoplasia: Medical indications,” Cytometry Part B:Clinical Cytometry, vol. 72, no. S1, pp. S5—-S13, 2007.[76] J. J. M. van Dongen, L. Lhermitte, S. Bo¨ttcher, J. Almeida, V. H. J. van der Velden,J. Flores-Montero, A. Rawstron, V. Asnafi, Q. Le´crevisse, P. Lucio, E. Mejstrikova,T. Szczepan´ski, T. Kalina, R. de Tute, M. Bru¨ggemann, L. Sedek, M. Cullen,A. W. Langerak, A. Mendonc¸a, E. Macintyre, M. Martin-Ayuso, O. Hrusak, M. B.Vidriales, and A. Orfao, “EuroFlow antibody panels for standardized n-dimensionalflow cytometric immunophenotyping of normal, reactive and malignant leukocytes,”Leukemia, vol. 26, no. 9, pp. 1908–1975, 2012.[77] M. J. Borowitz, K. L. Guenther, K. E. Shults, and G. T. Stelzer, “Immunopheno-typing of acute leukemia by flow cytometric analysis: use of CD45 and right-anglelight scatter to gate on leukemic blasts in three-color analysis,” American journalof clinical pathology, vol. 100, no. 5, pp. 534–540, 1993.[78] R. O. Rainer, L. Hodges, and G. T. Seltzer, “CD 45 gating correlates with bonemarrow differential,” Cytometry, vol. 22, no. 2, pp. 139–145, 1995.[79] F. Lacombe, F. Durrieu, A. Briais, P. Dumain, F. Belloc, E. Bascans, J. Reiffers,M. R. Boisseau, and P. Bernard, “Flow cytometry CD45 gating for immunophe-notyping of acute myeloid leukemia,” Leukemia, vol. 11, no. 11, pp. 1878–1886,1997.[80] C. E. Pedreira, E. S. Costa, S. Barrena, Q. Lecrevisse, J. Almeida, J. J. M. vanDongen, A. Orfao, and Others, “Generation of flow cytometry data files witha potentially infinite number of dimensions,” Cytometry Part A, vol. 73, no. 9,pp. 834–846, 2008.[81] G. Lee, W. Finn, and C. Scott, “Statistical file matching of flow cytometry data,”Journal of Biomedical Informatics, 2011.[82] E. Zamir, B. Geiger, N. Cohen, Z. Kam, and B. Z. Katz, “Resolving and classifyinghaematopoietic bone-marrow cell populations by multi-dimensional analysis offlow-cytometry data,” British journal of haematology, vol. 129, no. 3, pp. 420–431,2005.[83] E. Falconer, M. Hills, U. Naumann, S. S. S. Poon, E. A. Chavez, A. D. Sanders,Y. Zhao, M. Hirst, and P. M. Lansdorp, “DNA template strand sequencing ofsingle-cells maps genomic rearrangements at high resolution,” Nature Methods,vol. 9, pp. 1107–1112, Nov. 2012.117Bibliography[84] E. Falconer and P. M. Lansdorp, “Strand-seq: A unifying tool for studies ofchromosome segregation,” Seminars in Cell and Developmental Biology, vol. 24,no. 8-9, pp. 643–652, 2013.[85] P. M. Lansdorp, “Immortal Strands? Give Me a Break,” Cell, vol. 129, pp. 1244–1247, June 2007.[86] J. Cairns, “Mutation selection and the natural history of cancer,” Nature, vol. 255,pp. 197–200, May 1975.[87] E. Sonoda, M. S. Sasaki, C. Morrison, Y. Yamaguchi-Iwai, M. Takata, andS. Takeda, “Sister Chromatid Exchanges Are Mediated by Homologous Recombi-nation in Vertebrate Cells,” Molecular and Cellular Biology, vol. 19, pp. 5166–5169,July 1999.[88] R. G. Langlois, W. L. Bigbee, R. H. Jensen, and J. German, “Evidence for increasedin vivo mutation and somatic recombination in Bloom’s syndrome.,” Proceedingsof the National Academy of Sciences of the United States of America, vol. 86,pp. 670–674, Jan. 1989.[89] R. A. Burrell, N. McGranahan, J. Bartek, and C. Swanton, “The causes andconsequences of genetic heterogeneity in cancer evolution,” Nature, vol. 501,pp. 338–345, Sept. 2013.[90] M. E. Pretorius, H. k. Waehre, V. M. Abeler, B. Davidson, L. Vlatkovic, R. A.Lothe, K.-E. Giercksky, and H. v. E. Danielsen, “Large scale genomic instabilityas an additive prognostic marker in early prostate cancer,” Cellular oncology: theofficial journal of the International Society for Cellular Oncology, vol. 31, no. 4,pp. 251–259, 2009.[91] A. Vincent-Salomon, V. Benhamo, E. Gravier, G. Rigaill, N. Gruel, S. Robin,Y. de Rycke, O. Mariani, G. Pierron, D. Gentien, F. Reyal, P. Cottu, A. Fourquet,R. Rouzier, X. Sastre-Garau, and O. Delattre, “Genomic Instability: A StrongerPrognostic Marker Than Proliferation for Early Stage Luminal Breast Carcinomas,”PLoS ONE, vol. 8, p. e76496, Oct. 2013.[92] R. K. R. Mettu, Y.-W. Wan, J. K. Habermann, T. Ried, and N. L. Guo, “A12-Gene Genomic Instability Signature Predicts Clinical Outcomes in MultipleCancer Types,” The International journal of biological markers, vol. 25, no. 4,pp. 219–228, 2010.[93] J. S. Wainscoat and M. F. Fey, “Assessment of clonality in human tumors: areview,” Cancer research, vol. 50, pp. 1355–1360, Mar. 1990.[94] J. Brewin, G. Horn, and T. Chevassut, “Genomic Landscapes and Clonality of DeNovo AML,” New England Journal of Medicine, vol. 369, no. 15, pp. 1472–1473,2013.118Bibliography[95] L. Ding, T. J. Ley, D. E. Larson, C. A. Miller, D. C. Koboldt, J. S. Welch, J. K.Ritchey, M. A. Young, T. Lamprecht, M. D. McLellan, J. F. McMichael, J. W.Wallis, C. Lu, D. Shen, C. C. Harris, D. J. Dooling, R. S. Fulton, L. L. Fulton,K. Chen, H. Schmidt, J. Kalicki-Veizer, V. J. Magrini, L. Cook, S. D. McGrath,T. L. Vickery, M. C. Wendl, S. Heath, M. A. Watson, D. C. Link, M. H. Tomasson,W. D. Shannon, J. E. Payton, S. Kulkarni, P. Westervelt, M. J. Walter, T. A.Graubert, E. R. Mardis, R. K. Wilson, and J. F. DiPersio, “Clonal evolution inrelapsed acute myeloid leukaemia revealed by whole-genome sequencing,” Nature,vol. 481, pp. 506–510, Jan. 2012.[96] M. Greaves and C. C. Maley, “Clonal Evolution in Cancer,” Nature, vol. 481,pp. 306–313, Jan. 2012.[97] S. Fro¨hling, S. Skelin, C. Liebisch, C. Scholl, R. F. Schlenk, H. Do¨hner, andK. Do¨hner, “Comparison of Cytogenetic and Molecular Cytogenetic Detectionof Chromosome Abnormalities in 240 Consecutive Adult Patients With AcuteMyeloid Leukemia,” Journal of Clinical Oncology, vol. 20, pp. 2480–2485, May2002.[98] S. C. Bendall, G. P. Nolan, M. Roederer, and P. K. Chattopadhyay, “A deepprofiler’s guide to cytometry,” Trends in Immunology, vol. 33, pp. 323–332, July2012.[99] F. Villanova, P. D. Meglio, M. Inokumad, N. Aghaeepour, E. Perucha, J. Mollon,L. Nomura, M. Hernandez-Fuentes, A. Copeh, T. Prevosti, S. Heck, V. Maino,G. Lord, R. R. Brinkman, and and Frank O. Nestle, “Integration of Lyoplate basedFlow Cytometry and Computational Analysis for Standardized ImmunologicalBiomarker Discovery,” PLoS ONE, vol. 8, no. 7, 2013.[100] F. Craig, R. Brinkman, E. Ten, and N. Aghaeepour, “Computational AnalysisOptimizes the Flow Cytometric Evaluation for Lymphoma,” Cytometry Part B -Clinical Cytometry, vol. Digital pr, 2013.[101] D. Eppstein, “Finding the k shortest paths,” SIAM Journal on computing, vol. 28,no. 2, pp. 652–673, 1998.[102] J. P. Vial and F. Lacombe, “Immunophenotyping of acute leukemia: Utility ofCD45 for blast cell identification,” Methods in cell biology, vol. 64, pp. 343–358,2001.[103] C. E. Pedreira, E. S. Costa, M. E. Arroyo, J. Almeida, and A. Orfao, “A multidi-mensional classification approach for the automated analysis of flow cytometrydata,” IEEE Transactions on Biomedical Engineering, vol. 55, no. 3, pp. 1155–1162,2008.[104] B. M. Bolstad, R. A. Irizarry, M. Astrand, and T. P. Speed, “A comparisonof normalization methods for high density oligonucleotide array data based onvariance and bias,” Bioinformatics, vol. 19, no. 2, p. 185, 2003.119Bibliography[105] G. K. Smyth, “limma: Linear Models for Microarray Data,” in Bioinformaticsand Computational Biology Solutions Using R and Bioconductor (R. Gentleman,V. J. Carey, W. Huber, R. A. Irizarry, and S. Dudoit, eds.), Statistics for Biologyand Health, pp. 397–420, Springer New York, Jan. 2005.[106] E. M. A. Hensor, L. Hunt, R. Parmar, A. Burska, P. Emery, and F. Ponchel, “A1.33Predicting the evolution of inflammatory arthritis in ACPA-positive individuals:can T-cell subsets help?,” Annals of the Rheumatic Diseases, vol. 73, pp. A14–A14,Mar. 2014.[107] N. T. Colburn, K. J. M. Zaal, F. Wang, and R. S. Tuan, “A role for γ/δ Tcells in a mouse model of fracture healing,” Arthritis and Rheumatism, vol. 60,pp. 1694–1703, June 2009.[108] S. M. Garrido, F. R. Appelbaum, C. L. Willman, and D. E. Banker, “Acutemyeloid leukemia cells are protected from spontaneous and drug-induced apoptosisby direct contact with a human bone marrow stromal cell line (HS-5),” ExperimentalHematology, vol. 29, pp. 448–457, Apr. 2001.[109] Y. Parel and C. Chizzolini, “CD4+ CD8+ double positive (DP) T cells in healthand disease,” Autoimmunity Reviews, vol. 3, pp. 215–220, Mar. 2004.[110] Y. Chen, X. S. Zhou, and T. Huang, “One-class SVM for learning in imageretrieval,” in International Conference on Image Processing, vol. 1, pp. 34–37 vol.1,2001.[111] C. Thiede, S. Koch, E. Creutzig, C. Steudel, T. Illmer, M. Schaich, and G. Ehninger,“Prevalence and prognostic impact of NPM1 mutations in 1485 adult patients withacute myeloid leukemia (AML),” Blood, vol. 107, no. 10, p. 4011, 2006.[112] M. Syampurnawati, E. Tatsumi, B. Ardianto, M. Takenokuchi, Y. Nakamachi,S. Kawano, S. Kumagai, K. Saigo, T. Matsui, T. Takahashi, and Others, “DRnegativity is a distinctive feature of M1/M2 AML cases with NPM1 mutation,”Leukemia research, vol. 32, no. 7, pp. 1141–1143, 2008.[113] B. Dalal, S. Mansoor, M. Manna, S. Pi, G. Sauro, and D. Hogge, “Detection ofCD34, TdT, CD56, CD2, CD4, and CD14 by Flow Cytometry Is Associated With¡i¿ NPM1¡/i¿ and¡ i¿ FLT3¡/i¿ Mutation Status in Cytogenetically Normal AcuteMyeloid Leukemia,” Clinical Lymphoma Myeloma and Leukemia, 2012.[114] A. Karatzoglou, A. Smola, K. Hornik, and A. Zeileis, “kernlab – An {S4} Packagefor Kernel Methods in {R},” Journal of Statistical Software, vol. 11, no. 9, pp. 1–20,2004.[115] R. Gaujoux and C. Seoighe, “A flexible R package for nonnegative matrix factor-ization,” BMC Bioinformatics, vol. 11, p. 367, July 2010.[116] L. Breiman, “Bagging Predictors,” Machine Learning, vol. 24, pp. 123–140, Aug.1996.120Bibliography[117] L. I. Shlush, S. Zandi, A. Mitchell, W. C. Chen, J. M. Brandwein, V. Gupta, J. A.Kennedy, A. D. Schimmer, A. C. Schuh, K. W. Yee, J. L. McLeod, M. Doedens,J. J. F. Medeiros, R. Marke, H. J. Kim, K. Lee, J. D. McPherson, T. J. Hudson,T. H. Pan-Leukemia Gene Panel Consortium, A. M. K. Brown, Q. M. Trinh,L. D. Stein, M. D. Minden, J. C. Y. Wang, and J. E. Dick, “Identification ofpre-leukaemic haematopoietic stem cells in acute leukaemia,” Nature, vol. 506,pp. 328–333, Feb. 2014.[118] L. M. Kelly and D. G. Gilliland, “Genetics of Myeloid Leukemias,” Annual Reviewof Genomics and Human Genetics, vol. 3, no. 1, pp. 179–198, 2002.[119] Y. Ishikawa, H. Kiyoi, A. Tsujimura, S. Miyawaki, Y. Miyazaki, K. Kuriyama,M. Tomonaga, and T. Naoe, “Comprehensive analysis of cooperative gene muta-tions between class I and class II in de novo acute myeloid leukemia,” EuropeanJournal of Haematology, vol. 83, pp. 90–98, Aug. 2009.[120] Z. Li, X. Cai, C.-L. Cai, J. Wang, W. Zhang, B. E. Petersen, F.-C. Yang, andM. Xu, “Deletion of Tet2 in mice leads to dysregulated hematopoietic stem cells andsubsequent development of myeloid malignancies,” Blood, vol. 118, pp. 4509–4518,Oct. 2011.[121] F. Albano, A. Mestice, A. Pannunzio, F. Lanza, B. Martino, D. Pastore, F. Ferrara,P. Carluccio, F. Nobile, G. Castoldi, V. Liso, and G. Specchia, “The biologicalcharacteristics of CD34+ CD2+ adult acute promyelocytic leukemia and the CD34CD2 hypergranular (M3) and microgranular (M3v) phenotypes,” Haematologica,vol. 91, pp. 311–316, Jan. 2006.[122] B. Falini, K. Macijewski, T. Weiss, U. Bacher, S. Schnittger, W. Kern, A. Kohlmann,H. Klein, M. Vignetti, A. Piciocchi, and Others, “Multilineage dysplasia has noimpact on biologic, clinicopathologic, and prognostic features of AML with mutatednucleophosmin (NPM1),” Blood, vol. 115, no. 18, pp. 3776–3786, 2010.[123] A. B. Olshen, E. S. Venkatraman, R. Lucito, and M. Wigler, “Circular binarysegmentation for the analysis of array-based DNA copy number data,” Biostatistics,vol. 5, pp. 557–572, Oct. 2004.[124] V. E. Seshan and A. Olshen, DNAcopy: DNA copy number data analysis, 2013.[125] N. Nagarajan, C. Cook, M. D. Bonaventura, H. Ge, A. Richards, K. A. Bishop-Lilly,R. DeSalle, T. D. Read, and M. Pop, “Finishing genomes with limited resources:lessons from an ensemble of microbial genomes,” BMC Genomics, vol. 11, p. 242,Apr. 2010.[126] W. J. Kent, C. W. Sugnet, T. S. Furey, K. M. Roskin, T. H. Pringle, A. M. Zahler,and D. Haussler, “The Human Genome Browser at UCSC,” Genome Research,vol. 12, pp. 996–1006, June 2002.121Bibliography[127] L. R. Meyer, A. S. Zweig, A. S. Hinrichs, D. Karolchik, R. M. Kuhn, M. Wong,C. A. Sloan, K. R. Rosenbloom, G. Roe, B. Rhead, B. J. Raney, A. Pohl, V. S.Malladi, C. H. Li, B. T. Lee, K. Learned, V. Kirkup, F. Hsu, S. Heitner, R. A.Harte, M. Haeussler, L. Guruvadoo, M. Goldman, B. M. Giardine, P. A. Fujita,T. R. Dreszer, M. Diekhans, M. S. Cline, H. Clawson, G. P. Barber, D. Haussler,and W. J. Kent, “The UCSC Genome Browser database: extensions and updates2013,” Nucleic Acids Research, vol. 41, pp. D64–D69, Jan. 2013.[128] G. R. Sutherland, E. Baker, and R. S. Seshadri, “Heritable fragile sites on hu-man chromosomes. V. A new class of fragile site requiring BrdU for expression.,”American Journal of Human Genetics, vol. 32, pp. 542–548, July 1980.[129] W. Chen, V. Kalscheuer, A. Tzschach, C. Menzel, R. Ullmann, M. H. Schulz,F. Erdogan, N. Li, Z. Kijas, G. Arkesteijn, I. L. Pajares, M. Goetz-Sothmann,U. Heinrich, I. Rost, A. Dufke, U. Grasshoff, B. Glaeser, M. Vingron, and H. H. Rop-ers, “Mapping translocation breakpoints by next-generation sequencing,” GenomeResearch, vol. 18, pp. 1143–1149, July 2008.[130] M. Hunt, C. Newbold, M. Berriman, and T. D. Otto, “A comprehensive evaluationof assembly scaffolding tools,” Genome Biology, vol. 15, p. R42, Mar. 2014.[131] W. Bateson, E. R. Saunders, and R. C. Punnett, “Report II. Experimental studiesin the physiology of heredity,” tech. rep., Royal Society, 1904.[132] N. G. Copeland and N. A. Jenkins, “Development and applications of a moleculargenetic linkage map of the mouse genome,” Trends in Genetics, vol. 7, pp. 113–118,Apr. 1991.[133] Z. Huang, “A Fast Clustering Algorithm to Cluster Very Large Categorical DataSets in Data Mining,” in Research Issues on Data Mining and Knowledge Discovery,pp. 1–8, 1997.[134] Z. S. Qin, “Clustering microarray gene expression data using weighted Chineserestaurant process,” Bioinformatics, vol. 22, pp. 1988–1997, Aug. 2006.[135] BOOST C++ Libraries, 2014.[136] Intel, “Intel SSE4 Programming Reference,” Tech. Rep. July, Intel, 2007.[137] M. Maechler, P. Rousseeuw, A. Struyf, M. Hubert, and K. Hornik, “cluster: ClusterAnalysis Basics and Extensions,” 2014.[138] D. Applegate and R. Bixby, “On the Solution of Traveling Salesman Problems,”Documenta Mathematica, vol. Proceeding, pp. 645–656, 1998.[139] G. A. Croes, “A Method for Solving Traveling-Salesman Problems,” OperationsResearch, vol. 6, pp. 791–812, Dec. 1958.[140] D. Haussler, S. J. O’Brien, O. A. Ryder, F. K. Barker, M. Clamp, A. J. Crawford,R. Hanner, O. Hanotte, W. E. Johnson, J. A. McGuire, and Others, “Genome10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species.,”The Journal of heredity, vol. 100, no. 6, pp. 659–674, 2008.122Bibliography[141] E. T. Lam, A. Hastie, C. Lin, D. Ehrlich, S. K. Das, M. D. Austin, P. Desh-pande, H. Cao, N. Nagarajan, M. Xiao, and P.-Y. Kwok, “Genome mappingon nanochannel arrays for structural variation analysis and sequence assembly,”Nature Biotechnology, vol. 30, pp. 771–776, Aug. 2012.[142] J. N. Burton, A. Adey, R. P. Patwardhan, R. Qiu, J. O. Kitzman, and J. Shendure,“Chromosome-scale scaffolding of de novo genome assemblies based on chromatininteractions,” Nature Biotechnology, vol. 31, pp. 1119–1125, Dec. 2013.[143] M. Mascher, G. J. Muehlbauer, D. S. Rokhsar, J. Chapman, J. Schmutz, K. Barry,M. Mun˜oz Amatria´ın, T. J. Close, R. P. Wise, A. H. Schulman, A. Himmelbach,K. F. Mayer, U. Scholz, J. A. Poland, N. Stein, and R. Waugh, “Anchoring andordering NGS contig assemblies by population sequencing (POPSEQ),” The PlantJournal, vol. 76, pp. 718–727, Nov. 2013.[144] N. McCarthy, “Mouse Models: Closer than you think,” Nature Reviews Cancer,vol. 9, pp. 382–383, June 2009.[145] E. B. Kim, X. Fang, A. A. Fushan, Z. Huang, A. V. Lobanov, L. Han, S. M.Marino, X. Sun, A. A. Turanov, P. Yang, S. H. Yim, X. Zhao, M. V. Kasaikina,N. Stoletzki, C. Peng, P. Polak, Z. Xiong, A. Kiezun, Y. Zhu, Y. Chen, G. V.Kryukov, Q. Zhang, L. Peshkin, L. Yang, R. T. Bronson, R. Buffenstein, B. Wang,C. Han, Q. Li, L. Chen, W. Zhao, S. R. Sunyaev, T. J. Park, G. Zhang, J. Wang,and V. N. Gladyshev, “Genome sequencing reveals insights into physiology andlongevity of the naked mole rat,” Nature, vol. 479, pp. 223–227, Nov. 2011.[146] S. Holm, “A simple sequentially rejective multiple test procedure,” ScandinavianJournal of Statistics, vol. 6, no. 2, pp. 65–70, 1979.[147] Y. Benjamini and Y. Hochberg, “Controlling the false discovery rate: a practicaland powerful approach to multiple testing,” Journal of the Royal Statistical Society.Series B (Methodological), vol. 57, no. 1, pp. 289–300, 1995.[148] J. Spidlen, R. C. Leif, W. Moore, M. Roederer, International Society for theAdvancement of Cytometry Data Standards Task Force, and R. R. Brinkman,“Gating-ML: XML-based gating descriptions in flow cytometry,” Cytometry. PartA: the journal of the International Society for Analytical Cytology, vol. 73A,pp. 1151–1157, Dec. 2008.[149] W. R. Overton, “Modified histogram subtraction technique for analysis of flowcytometry data,” Cytometry, vol. 9, no. 6, pp. 619–626, 1988.[150] B. Bagwell, “A journey through flow cytometric immunofluorescence analy-ses—Finding accurate and robust algorithms that estimate positive fraction distri-butions,” Clinical Immunology Newsletter, vol. 16, pp. 33–37, Mar. 1996.[151] F. Lampariello, “Evaluation of the number of positive cells from flow cytometricimmunoassays by mathematical modeling of cellular autofluorescence,” Cytometry,vol. 15, no. 4, pp. 294–301, 1994.123Bibliography[152] A. J. Richards, J. Staats, J. Enzor, K. McKinnon, J. Frelinger, T. N. Denny, K. J.Weinhold, and C. Chan, “Setting objective thresholds for rare event detection inflow cytometry.,” Journal of immunological methods, Apr. 2014.[153] A. Roth, J. Khattra, D. Yap, A. Wan, E. Laks, J. Biele, G. Ha, S. Aparicio,A. Bouchard-Coˆte´, and S. P. Shah, “PyClone: statistical inference of clonalpopulation structure in cancer,” Nature Methods, vol. 11, pp. 396–398, Apr. 2014.[154] S. A. Chowdhury, S. E. Shackney, K. Heselmeyer-Haddad, T. Ried, A. A. Scha¨ffer,and R. Schwartz, “Phylogenetic analysis of multiprobe fluorescence in situ hybridiza-tion data from tumor cell populations,” Bioinformatics, vol. 29, pp. i189–i198,July 2013.[155] S. Sabba and S. Chikhi, “Integrating the best 2-Opt method to enhance the geneticalgorithm execution time in solving the traveler salesman problem,” in ComplexSystems and Dependability, pp. 195–208, Springer, 2012.124


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items