{"@context":{"@language":"en","Affiliation":"http:\/\/vivoweb.org\/ontology\/core#departmentOrSchool","AggregatedSourceRepository":"http:\/\/www.europeana.eu\/schemas\/edm\/dataProvider","Campus":"https:\/\/open.library.ubc.ca\/terms#degreeCampus","Creator":"http:\/\/purl.org\/dc\/terms\/creator","DateAvailable":"http:\/\/purl.org\/dc\/terms\/issued","DateIssued":"http:\/\/purl.org\/dc\/terms\/issued","Degree":"http:\/\/vivoweb.org\/ontology\/core#relatedDegree","DegreeGrantor":"https:\/\/open.library.ubc.ca\/terms#degreeGrantor","Description":"http:\/\/purl.org\/dc\/terms\/description","DigitalResourceOriginalRecord":"http:\/\/www.europeana.eu\/schemas\/edm\/aggregatedCHO","FullText":"http:\/\/www.w3.org\/2009\/08\/skos-reference\/skos.html#note","Genre":"http:\/\/www.europeana.eu\/schemas\/edm\/hasType","GraduationDate":"http:\/\/vivoweb.org\/ontology\/core#dateIssued","IsShownAt":"http:\/\/www.europeana.eu\/schemas\/edm\/isShownAt","Language":"http:\/\/purl.org\/dc\/terms\/language","Program":"https:\/\/open.library.ubc.ca\/terms#degreeDiscipline","Provider":"http:\/\/www.europeana.eu\/schemas\/edm\/provider","Publisher":"http:\/\/purl.org\/dc\/terms\/publisher","Rights":"http:\/\/purl.org\/dc\/terms\/rights","RightsURI":"https:\/\/open.library.ubc.ca\/terms#rightsURI","ScholarlyLevel":"https:\/\/open.library.ubc.ca\/terms#scholarLevel","Title":"http:\/\/purl.org\/dc\/terms\/title","Type":"http:\/\/purl.org\/dc\/terms\/type","URI":"https:\/\/open.library.ubc.ca\/terms#identifierURI","SortDate":"http:\/\/purl.org\/dc\/terms\/date"},"Affiliation":[{"@value":"Science, Faculty of","@language":"en"},{"@value":"Computer Science, Department of","@language":"en"}],"AggregatedSourceRepository":[{"@value":"DSpace","@language":"en"}],"Campus":[{"@value":"UBCV","@language":"en"}],"Creator":[{"@value":"Zare, Habil","@language":"en"}],"DateAvailable":[{"@value":"2011-12-13T18:27:19Z","@language":"en"}],"DateIssued":[{"@value":"2011","@language":"en"}],"Degree":[{"@value":"Doctor of Philosophy - PhD","@language":"en"}],"DegreeGrantor":[{"@value":"University of British Columbia","@language":"en"}],"Description":[{"@value":"Flow cytometry has many applications in clinical medicine and biological research. For many modern applications, traditional methods of manual data interpretation are not efficient due to the large amount of complex, high dimensional data.\nIn this thesis, I discuss some of the important challenges towards automatic analysis of flow cytometry data and propose my solutions. To validate my approach on addressing real life problems, I developed an automatic pipeline for analyzing flow cytometry data and applied it to clinical data. My pipeline can potentially be useful for improving quality check on diagnosis, assisting discovery of novel phenotypes, and making clinical recommendations.\nFurthermore, some of the challenges that I studied are rooted in more general areas of computer science, and therefore, the tools and techniques that I developed can be applied to a wider range of problems in data mining and machine learning. Enhancement to spectral clustering algorithm and proposing a novel scheme for scoring features are two examples of my contributions to computer science that were developed as part of this thesis.","@language":"en"}],"DigitalResourceOriginalRecord":[{"@value":"https:\/\/circle.library.ubc.ca\/rest\/handle\/2429\/39660?expand=metadata","@language":"en"}],"FullText":[{"@value":"Automatic Analysis of Flow Cytometry Data and its Application to Lymphoma Diagnosis by Habil Zare Bachelor of Mathematics, Sharif University of Technology, 2002 Master of Computer Science, Sharif University of Technology, 2004 a thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the faculty of graduate studies (Computer Science) The University Of British Columbia (Vancouver) December 2011 \u00a9 Habil Zare, 2011 Abstract Flow cytometry has many applications in clinical medicine and biological research. For many modern applications, traditional methods of manual data interpretation are not e\u000ecient due to the large amount of complex, high dimensional data. In this thesis, I discuss some of the important challenges towards auto- matic analysis of \row cytometry data and propose my solutions. To vali- date my approach on addressing real life problems, I developed an automatic pipeline for analyzing \row cytometry data and applied it to clinical data. My pipeline can potentially be useful for improving quality check on diagnosis, assisting discovery of novel phenotypes, and making clinical recommenda- tions. Furthermore, some of the challenges that I studied are rooted in more general areas of computer science, and therefore, the tools and techniques that I developed can be applied to a wider range of problems in data min- ing and machine learning. Enhancement to spectral clustering algorithm and proposing a novel scheme for scoring features are two examples of my contributions to computer science that were developed as part of this thesis. ii Preface This study was approved by UBC-BCCA Research Ethics Board and the Institutional Review Boards(IRB) \fle number is H08-00667. All data used in this study is available through BC Cancer Agency. I contributed to the open source community by publicly releasing two R packages (i.e., Sam- SPECTRAL and FeaLect) available for download from Bioconductor1 and the Comprehensive R Archive Network (CRAN)2, respectively. SamSPECTRAL methodology (Chapter 2) was published in BMC Bioin- formatics as an original article entitled: Data reduction for spectral cluster- ing to analyze high throughput \row cytometry data, BMC Bioinformatics 2010,11:403 [165]. Parisa Shooshtari and I were the \frst co-authors and con- ducted the study together. Ryan Brinkman provided data and computing facilities. Arvind Gupta studied the convergence of faithful sampling. All authors participated in writing the paper by reading, editing and approving the \fnal draft. The results presented in Section 2.3.4 were submitted for consideration for publication as an original article [2]. My contribution included perform- ing experiments by SamSPECTRAL and partially writing the manuscript. Nima Aghaeepour computed F-measure and compared the methods. My novel methodology for feature selection (Chapter 3), is described in a manuscript to be submitted for consideration for publication. I will be the \frst author among: H. Zare, G. Ha\u000bari, A. Gupta, R.R. Brinkman. I designed the novel feature scoring scheme useful for feature selection (90%), 1www.bioconductor.org\/packages\/devel\/bioc\/html\/SamSPECTRAL.html 2www.cran.rproject.org\/web\/packages\/FeaLect\/ iii implemented the algorithm (95%), developed a mathematical framework to explain the performance of the method (70%), theoretically proved its e\u000eciency (80%), performed experiments (85%) and wrote the manuscript (70%). My results on MCL vs. SLL study (Chapter 4.2) was submitted for consideration for publication as an original article; Zare H., Bashashati A., R. Kridel, Aghaeepour N., G. Ha\u000bari, Connors J, Gupta A., Gascoyne R.,Brinkman R, Weng A., \\Automated analysis of multidimensional \row cy- tometry data improves diagnostic accuracy between mantle cell lymphoma and small lymphocytic lymphoma\". I was the \frst author and designed the pipeline (80%), implemented the algorithms (95%), analyzed data (70%), and discovered novel phenotypes (95%). iv Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . xv Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii 1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Flow Cytometry . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Applications of Flow Cytometry . . . . . . . . . . . . 2 1.2 Computational Challenges of Flow Cytometry Data Analysis 3 1.3 Unsupervised Techniques for Cell Population Identi\fcation . 5 1.3.1 Spectral Clustering . . . . . . . . . . . . . . . . . . . . 8 1.4 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4.1 Regularization Techniques . . . . . . . . . . . . . . . . 10 1.4.2 Previous Work to Improve \u21131-regularization . . . . . . 10 1.4.3 Lasso and Bolasso . . . . . . . . . . . . . . . . . . . . 13 1.5 Train-Test Approach . . . . . . . . . . . . . . . . . . . . . . . 14 v 1.6 Lymphoma . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2 SamSPECTRAL . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1 Faithful Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.1 Mathematical Formulation . . . . . . . . . . . . . . . 18 2.2 Enhancement to Spectral Clustering . . . . . . . . . . . . . . 20 2.2.1 Spectral Clustering . . . . . . . . . . . . . . . . . . . . 21 2.2.2 Data Reduction Scheme . . . . . . . . . . . . . . . . . 21 2.2.3 Similarity Matrix . . . . . . . . . . . . . . . . . . . . . 23 2.2.4 Number of Clusters . . . . . . . . . . . . . . . . . . . 25 2.2.5 Combining Clusters . . . . . . . . . . . . . . . . . . . 26 2.2.6 Overview of SamSPECTRAL Algorithm . . . . . . . . 26 2.2.7 Modi\fed Markov Clustering Algorithm (MCL) . . . . 28 2.2.8 Testing on Real Data . . . . . . . . . . . . . . . . . . 29 2.3 Further Experiments . . . . . . . . . . . . . . . . . . . . . . . 37 2.3.1 Resolution . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.3.2 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . 38 2.3.3 Faithful and Uniform Sampling . . . . . . . . . . . . . 39 2.3.4 SamSPECTRAL in FlowCAP . . . . . . . . . . . . . . 39 2.4 Performance of SamSPECTRAL . . . . . . . . . . . . . . . . 42 2.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3 Feature Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.1.1 My Contributions . . . . . . . . . . . . . . . . . . . . 48 3.2 Feature Scoring and Mathematical Analysis . . . . . . . . . . 49 3.2.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . 49 3.2.2 The Analysis . . . . . . . . . . . . . . . . . . . . . . . 52 3.3 Experiment with Real Data . . . . . . . . . . . . . . . . . . . 55 3.3.1 Data Preparation and Feature Extraction . . . . . . . 56 3.3.2 Feature Selection and Classi\fcation . . . . . . . . . . 56 3.3.3 Additional Real Datasets from UCI Repository . . . . 57 vi 3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4 Application to Lymphoma Diagnosis . . . . . . . . . . . . . 62 4.1 Automatic Methodology . . . . . . . . . . . . . . . . . . . . . 63 4.1.1 Data Preparation and Feature Extraction . . . . . . . 63 4.1.2 Feature Selection and Classi\fcation . . . . . . . . . . 67 4.2 Discriminating Between Small Lymphocytic and Mantle Cell Lymphomas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.3 Clinical Motivation . . . . . . . . . . . . . . . . . . . . . . . . 72 4.4 Design and Methods . . . . . . . . . . . . . . . . . . . . . . . 74 4.4.1 Patient Samples . . . . . . . . . . . . . . . . . . . . . 74 4.4.2 Pathologic Classi\fcation . . . . . . . . . . . . . . . . . 74 4.4.3 Flow Cytometry Data . . . . . . . . . . . . . . . . . . 74 4.5 Computational Methodology Adjusted for MCL vs. SLL Study 75 4.5.1 Setting the Thresholds . . . . . . . . . . . . . . . . . . 76 4.5.2 Dead Cell Removal . . . . . . . . . . . . . . . . . . . . 78 4.5.3 Identi\fcation of Cell Populations . . . . . . . . . . . . 79 4.6 De\fnitional Criteria for Positive\/Negative Marker Expression 79 4.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.7.1 An Observation Motivating Computing Ratios . . . . 82 4.7.2 Developing a Diagnostic Predictor . . . . . . . . . . . 85 4.7.3 Testing and Validation of the Diagnostic Predictor . . 88 4.7.4 Sensitivity to Thresholds . . . . . . . . . . . . . . . . 92 4.7.5 Best Ratio vs. Other Markers . . . . . . . . . . . . . . 93 4.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.9 Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.2 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 vii A SamSPECTRAL Package Manual . . . . . . . . . . . . . . . 119 A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 A.2 How to Run SamSPECTRAL? . . . . . . . . . . . . . . . . . 120 A.2.1 An Example . . . . . . . . . . . . . . . . . . . . . . . 120 A.3 Adjusting Parameters . . . . . . . . . . . . . . . . . . . . . . 122 A.3.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . 123 B Supplementary Table A . . . . . . . . . . . . . . . . . . . . . 130 C Validating MCL vs. SLL Study on More Samples . . . . . 134 C.1 First Time Frame . . . . . . . . . . . . . . . . . . . . . . . . . 135 C.2 Second Time Frame . . . . . . . . . . . . . . . . . . . . . . . 137 C.3 Third Time Frame . . . . . . . . . . . . . . . . . . . . . . . . 139 C.4 Fourth Time Frame . . . . . . . . . . . . . . . . . . . . . . . 141 C.5 Fifth Time Frame . . . . . . . . . . . . . . . . . . . . . . . . 143 D Cluster Matching . . . . . . . . . . . . . . . . . . . . . . . . . 145 viii List of Tables Table 2.1 Compiled results of the FlowCAP competition . . . . . . . 41 Table 3.1 Comparsion of area under the ROC curve between FeaLect, lars, and Bolasso on six di\u000berent datasets. . . . . . . . . . 60 Table 4.1 Overall performance of my automatic pipeline for analyz- ing lymphoma dataset. . . . . . . . . . . . . . . . . . . . . 69 Table 4.2 Parameters used for identifying cell populations by Sam- SPECTRAL. . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Table 4.3 Distribution of cases among \\standard\" vs \\non-standard\" immunophenotypes and Combined Ratio Scores. . . . . . 79 Table 4.4 Discriminative values on test set . . . . . . . . . . . . . . . 87 Table B.1 Clinical data and FCM features for all cases investigated in MCL vs. SLL study. . . . . . . . . . . . . . . . . . . . . 133 Table C.1 Discriminative values based on all 114 studies cases . . . . 135 ix List of Figures Figure 2.1 Data reduction scheme . . . . . . . . . . . . . . . . . . . . 18 Figure 2.2 Faithful sampling . . . . . . . . . . . . . . . . . . . . . . . 23 Figure 2.3 De\fning the similarity between two communities and iden- tifying the number of clusters . . . . . . . . . . . . . . . . 24 Figure 2.4 Comparative clustering of the telomere dataset . . . . . . 32 Figure 2.5 Comparative clustering of dead cells (PI positive) and live cells (PI negative) in the viability data . . . . . . . . . . . 33 Figure 2.6 Comparative clustering of the GvHD dataset . . . . . . . 34 Figure 2.7 Comparative identi\fcation of a low density population surrounded by much denser populations in the stem cell data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Figure 2.8 Rare population in the stem cell data set . . . . . . . . . 36 Figure 2.9 Performance of SamSPECTRAL on synthetic data . . . . 38 Figure 2.10 Comparing uniform sampling with faithful sampling . . . 40 Figure 3.1 FeaLect . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Figure 3.2 Total feature scores in the log-scale . . . . . . . . . . . . . 51 Figure 3.3 Variation of area under the ROC curve when di\u000berent number of features are used . . . . . . . . . . . . . . . . . 58 Figure 3.4 Improvements in the area under the ROC curves by in- creasing the number of training samples. . . . . . . . . . . 59 Figure 3.5 Improvements in the area under the ROC curves by in- creasing the number of training samples. . . . . . . . . . . 59 Figure 4.1 Schematic of my automated analysis procedure . . . . . . 64 x Figure 4.2 Feature extraction . . . . . . . . . . . . . . . . . . . . . . 66 Figure 4.3 Scoring features . . . . . . . . . . . . . . . . . . . . . . . . 68 Figure 4.4 Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . 70 Figure 4.5 Setting of thresholds . . . . . . . . . . . . . . . . . . . . . 77 Figure 4.6 Dead cell removal . . . . . . . . . . . . . . . . . . . . . . 78 Figure 4.7 Contradiction in immunophenotypic signatures . . . . . . 80 Figure 4.8 Example typical and atypical cases. . . . . . . . . . . . . 81 Figure 4.9 Discriminative value of individual markers typically uti- lized for diagnosis of MCL vs. SLL . . . . . . . . . . . . . 83 Figure 4.10 Most discriminative \ruorescence ratios for MCL vs. SLL . 84 Figure 4.11 Comparing intensities for 274 cases. . . . . . . . . . . . . 86 Figure 4.12 Combined Ratio Scores obtained for 114 (44 MCL and 70 SLL) cases . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Figure 4.13 Incorporating CD20\/CD23 ratio with a third marker to diagnose borderline cases . . . . . . . . . . . . . . . . . . 90 Figure 4.14 sIg expression for cases with borderline CD20\/CD23 ratios. 91 Figure 4.15 ROC curves . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Figure C.1 Discriminative value of individual immunophenotypic mark- ers. Cases from the \frst time period. . . . . . . . . . . . . 135 Figure C.2 Discriminative value of binary ratios. Cases from the \frst time period. . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Figure C.3 Discriminative value of individual immunophenotypic mark- ers. Cases from the second time period. . . . . . . . . . . 137 Figure C.4 Discriminative value of binary ratios. Cases from the sec- ond time period. . . . . . . . . . . . . . . . . . . . . . . . 138 Figure C.5 Discriminative value of individual immunophenotypic mark- ers. Cases from the third time period. . . . . . . . . . . . 139 Figure C.6 Discriminative value of binary ratios. Cases from the third time period. . . . . . . . . . . . . . . . . . . . . . . 140 Figure C.7 Discriminative value of individual immunophenotypic mark- ers. Cases from the fourth time period. . . . . . . . . . . 141 xi Figure C.8 Discriminative value of binary ratios. Cases from the fourth time period. . . . . . . . . . . . . . . . . . . . . . . 142 Figure C.9 Discriminative value of individual immunophenotypic mark- ers. Cases from the \ffth time period. . . . . . . . . . . . . 143 Figure C.10 Discriminative value of binary ratios. Cases from the \ffth time period. . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Figure D.1 Cluster matching for tube (CD5-CD19-CD3). . . . . . . . 146 Figure D.2 Cluster matching for tube (CD10-CD11c-CD20). . . . . . 147 Figure D.3 Cluster matching for tube (FMC7-CD23-CD19). . . . . . 148 Figure D.4 Cluster matching for tube (CD7-CD4-CD8). . . . . . . . . 149 Figure D.5 Cluster matching for tube (CD45-CD14-CD19). . . . . . . 150 Figure D.6 Cluster matching for tube (kappa-lambda-CD19). . . . . . 151 xii Glossary dna deoxyribonucleic acid rna ribonucleic acid b-clpd B-cell chronic lymphoproliferative disorders dlbcl di\u000buse large B-cell lymphoma foll follicular lymphoma mcl mantle cell lymphoma sll small lymphocytic lymphoma mzl marginal zone lymphoma lbl lymphoblastic lymphoma ptcl peripheral T-cell lymphoma aild angioimmunoblastic lymphoma gvhd graft-versus-host-disease hiv human immunode\fciency virus fcm \row cytometry m\f mean \ruorescent intensity fsc forward scatter xiii ssc side scatter \fsh \ruorescence in situ hybridization ihc immunohistochemistry roc receiver operating characteristic auc area under the curve mise minimizing the integral of the squared error who The World Health Organization cran The Comprehensive R Archive Network \rowcap \row cytometry-critical assessment of population identi\fcation methods xiv Acknowledgements I would like to thank my supervisors, Dr. Arvind Gupta for his encour- agement to work and providing me with \fnancial support, Dr. Ryan R. Brinkman for his help in proving data, conducting studies and writing manuscripts, as well as, his useful corrections and comments on the cur- rent dissertation, Dr. Andrew Weng for motivating and guiding the MCL vs. SLL study, and Drs. Gabor Tardosh and Valentine Kabanets for their helpful discussions on the theoretical foundations and mathematical prob- lems that occurred during my research. Nevertheless, I appreciate Dr. Amir Daneshgar, my MSc supervisor, for proving me with an invaluable insight into spectral graph theory, deep enough to move the boundary of science for- ward and develop SamSPECTRAL algorithm. Besides, I acknowledge my parents and my other family members who have been always overwhelmingly supportive of my research and endless studies. On behalf of the authors of our paper describing SamSPECTRAL algo- rithm [165], I would like to thank Adrian Cortes, Connie Eaves, Peter Lans- dorp and Keith Humphries for providing data, Bari Zahedi and Irma Vulto for their biological insight and manual data analysis, Josef Spidlen for as- sistance in running FLAME, Aaron Barsky and Nima Aghaeepour for their editorial comments, and Mani Ranjbar for programming guidance. This work was supported by NIH grants 1R01EB008400 and 1R01EB005034, the Michael Smith Foundation for Health Research, the National Science and Engineering Research Council and the MITACS Network of Centres of Ex- cellence. My study on lymphoma dataset (Chapters 4 and 4.2) could not be ac- xv complished without the data that was kindly provided by BC Cancer Agency (BCCA). In particular, I would like to acknowledge the Hematopathology sta\u000b at BCCA (Drs. D. Banerjee, M.Chhanabhai, M. Hayes, A. Karsan, B. Skinnider, and G. Slack) for expert pathologic diagnoses, and the BCCA Clinical Flow Cytometry Laboratory sta\u000b for their excellent technical work. The MCL vs. SLL study (Chapter 4.2) was supported by funding from NSERC, MITACS, Canadian Cancer Society grant 700374, NIH-NIBIB grant EB008400, CIHR grant 94132, the Terry Fox Foundation, and the Terry Fox Research Institute. BCCA received an unrestricted grant from Ho\u000bman- LaRoche that was used to support research on the integration of PET scan- ning into lymphoma management. Ryan R. Brinkman and Andrew P. Wend were supported by MSFHR. xvi Dedication To my parents who scari\fed much to allow me enjoy the luxury of an aca- demic life. xvii Chapter 1 Background Analyzing large size data creates computational challenges that can be math- ematically interesting. Many real life applications require challenging exten- sive data analysis including: natural language processing, computer vision, bioinformatics, web mining, and analysis of social networks. For example, recent biological experiments produce large data sets that provide good ex- amples of such challenges [128]. While many data mining and machine learning techniques have been developed to address some of these challenges [122], there are still many open problems that are interesting from both points of view; computer science, and biological applications. 1.1 Flow Cytometry Flow cytometry is the science of measuring physical, chemical and biological characteristics of individual cells [70]. In cytometers, cells are passed through a laser beam one by one and the light that emerges from them is captured to measure up to 19 characteristic of each cell simultaneously [121]. As the speed of passing cells is so high (several thousands of cells per second), cytometers can generate rich, informative data from a huge number of cells. 1 1.1.1 Applications of Flow Cytometry Flow cytometry has many applications in molecular and cell biology for both diagnosis and research purposes. In cell biology, \ruorescencently tagged an- tibodies are used to mark the cells that express desired intracellular or cell surface proteins. In molecular biology, it provides information on speci\fc characteristics of cells such as their deoxyribonucleic acid (dna), ribonu- cleic acid (rna) or protein content. For instance, cells containing a speci\fc dna sequence can be marked by an appropriate dna-binding dye such as SYBR Green I [46], and then passed through a cytometer [114]. Since in the cytometer, cells are studied one by one, dna content per cell is measured accordingly [35, 115, 159]. Therefore, both surface or intracellular markers can be analyzed in a \row cytometry (fcm) analysis and cells with similar molecular content form cell populations that can be identi\fed and separated for further investigations. Some clinical and research applications of fcm include: monitoring the course and treatment of human immunode\fciency virus (hiv), diagnosis of lymphoma and leukemia, peripheral blood hematopoietic stem cell study, investigating cellular signature de\fnition for acute graft-versus-host-disease (gvhd) [109], vaccine trials [140], stem cell research [124], immunopheno- typic characterization of B-cell chronic lymphoproliferative disorders (b- clpd) [16, 118], distinguishing cancer stem cells, hematopoietic stem cell transplantation, detection of fetal cells in maternal blood, and malaria di- agnosis [40]. Lymphoma Diagnosis Lymphoma is a cancer that begins in the lymphatic cells of the immune system that presents as a solid tumor of lymphoid cells [98]. Just as can- cer represents many di\u000berent diseases, lymphoma represents many di\u000ber- ent cancers of lymphocytes [74]. Lymphoma has many types and subtypes and 13 major neoplasms were recognized in The World Health Organization (who) classi\fcation. To distinguish them accurately, several clinical test are used including: immunocytochemistry and gene rearrangement investi- 2 gation [144, 164]. Flow cytometry is an indispensable test for di\u000berential diagnosis of lymphomas [70, 130]. In spite of its wide-spread application, di\u000berent set of markers are used by di\u000berent pathologists that complicates comparison between data obtained from di\u000berent institutions in the world. An other challenge in fcm data analysis for lymphoma diagnosis is the rel- atively high number of markers used to di\u000berentiate between various types and sub-types of lymphoma that results in high dimensional data. 1.2 Computational Challenges of Flow Cytometry Data Analysis Analyzing FCM data is complicated because it consists of large size data; a single sample may contain up to 1 million events with 19 dimensions. [121]. Furthermore, in most tests, more than one sample is required per patient. Therefore, sophisticated data mining techniques are required. While state of the art approaches for automatizing FCM data analy- sis have adopted various tools and techniques from machine learning and statistics literature, there are important open problems that are motivating further research including: clustering, cluster matching, feature extraction, feature selection and classi\fcation [16]. Challenge 1 (Clustering: to identify cell populations in FCM data). Auto- mated identi\fcation of FCM cell populations is complicated by overlapping and adjacent populations, especially when low and high density populations are close to each other. Analyzing such data requires clustering methods that can separate these populations. My hypothesis is that spectral cluster- ing can provide satisfactory results. Challenge 2 (Cluster Matching: to label the clusters across di\u000berent pa- tients for the same FCM test). Due to biological variation, populations may move across patients resulting in di\u000eculties in matching clusters appropri- ately. Some approaches propose labeling the clusters based on their median or mean \ruorescent intensity (mfi) [80]. Clustering the centers of the pop- ulations has been applied before that relied on model-based clustering. My 3 hypothesis is that when number of samples is limited, non-parametric meth- ods such as spectral clustering can lead to more robust cluster labeling by avoiding the errors caused by \ftting a distribution to the data. If the setting of the instruments changes considerably, then data normalization might be useful to obtain more robust results [66, 82]. However, care should be taken in applying normalization techniques because some biological information may be lost by normalization leading to poor results in some applications. Challenge 3 (Feature Extraction: to de\fne appropriate features that rep- resent biological information in fcm data). In many research assays, the complete set of all appropriate features is not determined in advance be- cause the underlying biology is not known. The following characteristics of fcm data have been conventionally proposed to be considered as features: frequency of a single population (number of cells of a population divided by the total number of cells in the sample), relative size of two di\u000berent popu- lations, and location of mean or median of populations that is also known as mfi by clinicians [47]. However, no objective rule has been established for preferring any of the above criteria in a speci\fc study. I hypothesize that considering all of these features simultaneously guarantees obtaining the optimum results. Challenge 4 (Feature Selection: to determine the most appropriate features for predictive analysis and biological investigations). In applying supervised machine learning techniques, a high number of features can result in over- \ftting the model because the number of available patients is normally much smaller than the number of features. Furthermore, in biological and medical studies, it is desired to distinguish the most informative features and feature selection is a critical step towards properly focusing biological investigations on novel phenotype discovery. My hypothesis was that applying Lasso on the features with the high- est scores can prevent the model from over\ftting and provide reliable re- sults. However, when I applied this regularizatoin technique to real data, it turned out that some clinically useful features were excluded from the model. Therefore, I performed statistical analysis on features to compute a 4 score that determines the most informative features e\u000bectively. Challenge 5 (Classi\fcation: e.g. to predict type of a lymphoma case based on FCM data). A typical application of automatic fcm data analysis is to build a classi\fer in order to predict the diagnosis of a patient based on their clinical data. In applying general machine learning techniques that rely on statistical analysis, \\enough\"number of training samples should be obtained. The ideal number of samples normally depends on the number features, however, the number of training samples available for this study was limited by the number of lymphoma patients diagnosed in British Columbia Cancer Agency. Such limitations can potentially complicate the learning stage and sophisticated techniques should be applied carefully to prevent model from over\ftting. 1.3 Unsupervised Techniques for Cell Population Identi\fcation A classical approach for analyzing biological data is to \frst group individual data points based on some similarity criterion, a process known as clustering, and then, compare the outcome of clustering with the biological hypothe- ses. In contrast to the dramatic evolution of the FCM hardware, cluster- ing analysis is still usually accomplished by sequential manual partitioning (a.k.a. gating) of cell events into populations through visual inspection of plots one or two dimensions at a time. Many problems have been noted with this approach to FCM data analysis, including its subjective and time consuming nature, and the di\u000eculty to e\u000bectively analyze high dimensional data [93, 150]. Automated identi\fcation of FCM cell populations is complicated by over- lapping and adjacent populations, especially when low and high density populations are close to each other. Analyzing such data requires clustering methods that can separate these populations correctly. Beginning in 2007, several groups reported the development and applica- tion of computational methods to FCM data in an e\u000bort to overcome these 5 limitations in manual gating-based analysis with successful results reported in each case [4, 49, 89, 111, 125, 127, 129, 146, 148, 166] 1. However, it was unclear how the results from these algorithms compared with traditional manual gating results in general, or how these computation methods compared with each other, as every new algorithm was assessed using distinct datasets and evaluation methods. To address these shortcom- ings, members of the algorithm development, FCM users, and the software and instrument vendor communities initiated the Flow Cytometry: Critical Assessment of Population Identi\fcation Methods (FlowCAP) project. The goals of FlowCAP were to advance the development of computational meth- ods for the identi\fcation of cell populations of interest in \row cytometry data by providing the means to objectively test and compare these meth- ods, and to provide guidance to the end user about how best to use these algorithms. In this thesis, I report the results from the \frst such FlowCAP- sponsored challenge to compare my novel approach with the ability of other automated approaches on replicating manual analysis by experts on \fve typical \row cytometry datasets. Recently, sophisticated methods have been developed for objective clus- tering of FCM data [16, 84, 92]. The proposed clustering techniques include: Bayesian clustering [87], mixture modeling approaches [21], model-based cluster analysis [3, 141], feature-guided clustering [167], density-based clus- tering [134], combining the curvature information with density information [110], and image processing [79]. Non-parametric methods include density clustering [30], real-time adaptive clustering [54], Kohonen self-organizing maps [20], misty mountain clustering [147], and hierarchical clustering [37]. The application of non-parametric methods is restricted since the \frst two 1A portion of this section was submitted for publication, I am a co-author of that manuscript and the complete list of co-authors includes: Nima Aghaeepour, Phillip Mah, Greg Finak, Andrea A. Barbo, Jonathan Bramson, Hannes Bretschneider, Cliburn Chan, Philip L. De Jager, Connie Eaves, Arvind Gupta, Alireza Hadj-Khodabakhshi, Faysal El Khettabi,George Luta, Jose M. Maisog, Peter M\u0013ajek, Geo\u000brey J. McLachlan, Iftekhar Naim, Radina Nikolic,Saumyadipta Pyne, Yu Qian, John Quinn, Andrew Roth, Gaurav Sharma, Parisa Shooshtari, Josef Spidlen,Istv\u0013an P. Sug\u0013ar, Jozef Vil\u0014cek, Kui Wang, Andrew P. Weng, Habil Zare, Tim R. Mosmann, Holger Hoos, Jill Schoenfeld,Raphael Gottardo, Ryan R. Brinkman,Richard H. Scheuermann. 6 are subjective due to a dependency on user-de\fned thresholds, misty moun- tain algorithm identi\fes two closely situated populations as one when the respective histogram has only one peak, applying Kohonen self-organizing maps requires the number of clusters to be determined by the user, and hierarchical clustering is too time consuming for large samples containing more than ten thousand cells. Dependency on thresholds limits the application of a method because the user has to set them for each sample individually and when the number of samples is large, this approach is too time consuming to be e\u000eciently applicable. In contrast, SamSPECTRAL, the method that I developed for clustering \row cytometry data, has low sensitivity to the parameters such that they can be adjusted for a dataset by trying one or two samples. Then, other samples of the corresponding dataset can be run by SamSPECTRAL using the same parameters. In some clinical cytometry analysis, the number of clusters is known in advance from a-priori biological knowledge [51]. However, determining this number accurately can be a critical obstacle for other analyses such as identifying novel populations for biomarker discovery. In such applications, the user may not have any idea to impose a resealable assumption on the number of biologically meaningful populations and therefore, the methods that require the number of clusters to be known in advance are very limited in those applications [16]. Model-based clustering techniques such as \rowClust [90], \rowMerge [50], \rowMeans [3], and FLAME [126] have been developed to improve results. \rowMerge uses the \rowClust framework to identify clusters based on a t- mixture model methodology, followed by a merging step to account for over- estimation of the number of clusters by the Bayesian information criterion. \rowMeans applies a similar approach using k-means clustering. FLAME uses a skew t-mixture model, which is in theory more robust to skew, be- cause unlike t-distributions, skew t-distributions can be asymmetric [126]. However, the running time of this algorithm increases with the fourth degree of the number of dimensions. In practice this tends to make the algorithm impractical for more than \fve dimensions, while FCM data can have up to 7 19 dimensions. Overall, the major drawback of these parametric methods is the requirement for assumptions on either the size of the clusters or the cluster distributions and shapes [27], which could result in incorrect iden- ti\fcation of biologically interesting populations. In addition, one challenge for existing approaches is the identi\fcation of rare populations. 1.3.1 Spectral Clustering Spectral clustering is a non-parametric clustering method that avoids the problems of estimating probability distribution functions by using a heuristic based on graphs [156]. It has proved useful in many pattern recognition areas [11, 14, 113, 119]. Not only does it not require a priori assumptions on the size, shape or distribution of clusters, but it has advantages that make it particularly well-suited to clustering biological data: \u2022 It is not sensitive to outliers, noise or shape of clusters; \u2022 While it is adjustable and biological knowledge can be utilized to adapt the method for a speci\fc problem or dataset, the dependency to the required parameters is low such that once they are set for a dataset, all samples of that dataset can be analyzed using the same parameters; \u2022 There is mathematical evidence to guarantee its proper performance [157]. Two main challenges in applying spectral clustering algorithm on large data sets are: (1) the computationally expensive steps of constructing the normalized matrix and, (2) computing its eigenspace. For instance, for high throughput biological data containing one million data points (i.e., vertices), it requires computing eigenspace of a million by million matrix, which is in- feasible in terms of memory and time. Although there are some approxima- tion methods for speeding up this computation [34, 154], these could produce undesired errors in the \fnal results. The problem of applying spectral clus- tering on large datasets has been studied in [52] using Nystrom's method. They suggest a strategy of sampling data uniformly, clustering the sampled points and extrapolating this solution to the full set of points. However, 8 sampling data uniformly can miss low-density populations entirely if the density of adjacent populations varies considerably, a situation that often arises for biologically interesting populations in FCM data. Appendix 3 of chapter 2 includes an experiment to explain the e\u000bect of uniform sampling in such cases. Data reduction schemes have been developed to reduce the complexity of the FCM data while preserving the information [26, 96]. These methods reduce the dimensionality but not the size of the dataset, the latter being the more important bottleneck for spectral clustering. 1.4 Feature Selection In many applications such as bioinformatics, natural language processing, and computer vision, high number of features might be provided to the learning algorithm without any prior knowledge on which ones should be used. Therefore, the number of features can drastically exceed the number of training instances. In such real life learning problems, feature selection is important to speed up learning and to improve concept quality [83]. How- ever, the advantages of feature selection techniques come at a certain price, as the search for a subset of relevant features introduces an additional layer of complexity in the modeling task [133]. Feature selection techniques can be grouped into three classes: (1) \flter techniques, (2) wrapper methods, and (3) embedded techniques [41]. Filter techniques such as Euclidean distance or i-test assess the relevance of features by looking only at the intrinsic properties of the data and they select features based on inter-class separability criterion [112]. The wrapper methods such as sequential forward selection or beam search employ induction classi\fer as a black box and using cross-validation or bootstrap techniques, evaluate the feature subset candidates such that the accuracy of the classi\fer can often be maximized [155]. Embedded techniques such as decision trees or regularization techniques are designed to \fnd an optimum subset of features in the process of model building [95]. In general, embedded techniques are advantageous because they capture variable and model interactions better 9 than \flter techniques, and they are computationally less demanding than wrapping methods [63, 95]. The advantages and disadvantages of all these three approaches are compared in detail in [133]. 1.4.1 Regularization Techniques Many regularization methods have been developed to prevent over\ftting and improve the generalization error bound of the predictor in this learning situ- ation. Most notably, Lasso [152] is an embedded, \u21131-regularization technique for linear regression which has attracted much attention in machine learning and statistics. While empirical comparison of ranking provided by SVM and Lasso revealed that they perform equally in terms of prediction accuracy on pairwise di\u000berence data [71], applying Lasso is preferred in practice because it avoids the complication of choosing between di\u000berent kernels required by SVM. Although e\u000ecient algorithms exist for recovering the whole regular- ization path for the Lasso [44], \fnding a subset of highly relevant features which leads to a robust predictor remains an important research question. A well-known justi\fcation of \u21131-regularization is that it leads to sparse solutions (i.e. a wei with many zeros) and thus performs model selection. Recent research [12, 13, 158, 168] have studied model consistency of the Lasso. Analysis in [12, 13, 168] show that for various decaying schemes of the regularization parameter, Lasso selects the relevant features with proba- bility one and irrelevant features with positive probability as the number of training instances goes to in\fnity. If several samples are available from the underlying data distribution, irrelevant features can be removed by simply intersecting the set of selected features for each sample. The idea in [13] is to provide such datasets by re-sampling with replacement from the given training dataset using the bootstrap method [43]. 1.4.2 Previous Work to Improve \u21131-regularization A well-known justi\fcation of \u21131-regularization is that it leads to model se- lection by forcing weights corresponding to many \\irrelevant\" features to be exactly zero. Recent researches [12, 13, 91, 101, 158, 168] have studied 10 model consistency of the Lasso (i.e., if we know the true sparsity pattern of the underlying data-generation process, does the Lasso recover this sparsity pattern when the number of training instances increases?) Various decaying schemes of the regularization parameter have shown that under speci\fc settings ( explained in [12, 13]) Lasso selects the relevant features with probability one and the irrelevant features with positive prob- ability less than one, provided that the number of training instances tends to in\fnity [12, 13, 100, 168]. If several (say 100) samples were available from the underlying data distribution, irrelevant features could be removed by simply intersecting the set of selected features for each sample. The idea in [13] is to provide such datasets by resampling with replacement from the given training dataset using the bootstrap method [43]. This approach leads to Bolasso algorithm for feature selection that is theoretically motivated by proposition 6, proved under the following general assumption. Notations and Assumptions Suppose the response Y \u2208 R is a random variable that is dependent on p covariates (X1 ,X2 , . . . ,Xp) \u2208 Rp, and for the joint distribution PXY , we have: \u2022 The cumulant generating functions E X ( es||X|| 2 2 ) and E Y ( esY 2) are \fnite for some s > 0. \u2022 The matrix of second-order moments Q= E(XX\u22a4) is invertible. \u2022 There is a true weight vector w \u2208 Rp such that E(Y |X) =X\u22a4w, and also var(Y |X) = \u03c32 almost surely for some \u03c3. The last condition formalizes the dependency of Y on X. Furthermore, the sparsity pattern of the model is determined by the non-zero elements of w, i.e., J= { j|wj \u0338= 0 } represents the indexes of the relevant features. The goal of the learning procedure is to predict Y from X. We as- sume training data is provided as n independent and identically distributed (i.d.d) samples (Xi,yi) \u2208 RP \u00d7R from PXY , where i = 1, . . . ,n. We denote the matrix of covariates containing n samples and p features by X \u2208 Rn\u00d7p, 11 and the corresponding label vector by Y \u2208 Rn. We aim to solve a binary classi\fcation problem if values of Y are guaranteed to be in {\u22121,+1}, and to train a regression model otherwise. Any binary classi\fcation problem can be reduced to a regression problem by treating Y as a real number and consider its sign as the class label. Normally, multiclass classi\fcation can be reduced to multiple binary classi\fcation problems, (e.g., using one-versus-all scheme). Therefore, we focus on regression formulation that can be easily applied to binary or multiclass problems. Lasso is an \u21131-regularization technique for least-square linear regression: L := n\u2211 i=1 1 2n \u2225\u2225\u2225yi\u2212wT \u00b7xi\u2225\u2225\u22252 2 +\u03bb \u2225\u2225\u2225w\u2225\u2225\u2225 1 (1.1) It is well known that the \u21131-regularization term shrinks many components of the solution to zero, and thus performs feature selection [168]. There has been also some variants, such as elastic nets [169], to select highly- correlated predictive features. The number of selected features in equation (1.2) is controlled by the regularization parameter \u03bb. A common practice is to \fnd the best value for \u03bb by cross-validation to maximize the prediction accuracy. Having found the best value for the regularization parameter, the features are selected based on the non-zero components of w\u0302, a weight vector that minimizes the global error L in equation (1.2). We denote the indexes of all selected features by J\u0302 = {j|w\u0302j \u0338= 0}. However, it is known that with a \fxed value for \u03bb, w will converge in probability to the unique global minimizer of (w\u2212w)\u22a4Q((w\u2212w))+\u03bb\u2225w\u22251, and therefore, w\u0302 will not be a proper estimate for w [13]. Now, the question is what would be a proper value for \u03bb as a function of n with a theoretical basis? Assuming \u03bb tends to zero with rate n\u2212 1 2 , Bolasso algorithm for feature selection is proposed in [12] and supported by the following proposition. Proposition 6. [13] Suppose PXY satis\fes the above general assumptions and let \u03bb= \u00b50n\u2212 1 2 for a \fxed constant \u00b50 > 0. The probability that Bolasso 12 does not select the correct model, is upper-bounded by: Pr ( J\u0302 \u0338= J ) \u2264mA1e\u2212A2n+A3 logn n 1 2 +A4 logm m , where m > 1 is the number of bootstrap samples, and all Ais are positive constants. Now, if we send m to in\fnity slower than eA2n, then with probability tending to one Bolasso will select J, exactly the relevant features. While on synthetic data Bolasso outperforms similar methods such as ridge regression, Lasso, and bagging of Lasso estimates [22], it is sometimes too strict on real data. Bolasso-S, a soft version of Bolasso, performs better in practice. Because it does not require a selected feature to be present in all bootstrap replications and participating in at least a portion of 90% is considered to be enough, Bolasso-S is more \rexible and thus, more ap- propriate for the practical models that are not extremely sparse [12]. Note that proposition 6 guarantees the performance of Bolasso only asymptoticly and in real applications, when the number of training samples is limited, probability of selecting relevant features can be signi\fcantly less than one. 1.4.3 Lasso and Bolasso Let A= {(x1,y1),(x2,y2), . . .(xn,yn)} be the given training dataset consist- ing of n training instances where xi \u2208Rp is the covariate, p is the number of features, and yi \u2208 {\u22121,+1} is the class label for binary classi\fcation prob- lem. For an unseen instance x, we treat y as a real number and consider its sign as the class label. Usually multiclass classi\fcation can be reduced to multiple binary classi\fcation problems, e.g. using one-versus-all scheme. Lasso is an \u21131-regularization technique for least-square linear regression: L := n\u2211 i=1 ||yi\u2212wT \u00b7xi||22+\u03bb||w||1 (1.2) Lasso has been applied in many machine learning and data mining areas including: microarary analysis, natural language processing, and compressed 13 sensing [8, 73]. There has been also some variants, such as elastic nets [169], to select highly-correlated predictive features. The number of selected features in eqn (1.2) is controlled by the regu- larization parameter \u03bb. Now, the question is what would be a proper value for \u03bb with a theoretical basis? A common practice is to \fnd the best value for \u03bb using cross-validation to maximize the prediction accuracy. Having found the best value for the regularization parameter, the features are se- lected based on the non-zero components of the weight vector solution of the equation (1.2). However, when I applied this technique to the lymphoma classi\fcation problem, I observed that some informative features were not selected. As many ignored features had been previously known in medicine to be highly relevant and useful for lymphoma diagnosis, I developed and used a novel feature scoring scheme describe in section 3.2.1 as a more robust alternative for common features selection techniques. 1.5 Train-Test Approach The main goal of building a classi\fer is prediction and the performance of classi\fers is best measured by estimating how accurately they perform in practice. Train-test approach is a standard technique in machine learning and data mining to evaluate the performance of classi\fers using a sample of data [69, 142]. Before performing any training procedure, data is partitioned into complementary subsets called \\train\" and \\test\" sets. While a typical split might be 75% for training and 25% for testing [69], more conservative scholars opt to use other ratios such as 50%-50% [142]. Then the classi\fer is trained on the train set and the optimum parameters are selected. Finally, the performance is estimated by measuring accuracy, sensitivity, and speci- \fcity of the classi\fer on the test set. It is very important to keep the test set absolutely independent from the train set because otherwise, the reported performance might be exaggerated due to over\ftting phenomenon. That is, the parameters might be set in a particular way that the classi\fer \fts very well to the samples used for training but the model fails to predict labels of unseen cases. To eliminate the risk of over\ftting, I validated all results 14 reported in this thesis by the conventional train-test approach (Chapter 4). In MCl-SLL study, I chose to dedicate 44 cases as train set and 70 indepen- dent cases as test set that is very conservative relative to common practice. Also, to reduce variability and where applicable, multiple rounds of train- test procedure were performed using di\u000berent partitions and the results were averaged over the rounds. This technique is called cross-validation and is a standard technique to avoid over\ftting [123]. 1.6 Lymphoma Lymphoma is a cancer in the lymphatic cells of the immune system. The disease has many types and subtypes that can be indolent or aggressive. Ac- cording to WHO classi\fcation [97], the types are classi\fed into the following groups: \u2022 Mature B cell neoplasms { Chronic lymphocytic leukemia( a.k.a. Small lymphocytic lym- phoma) { B-cell prolymphocytic leukemia { Lymphoplasmacytic lymphoma (such as Waldenstrommacroglob- ulinemia) { Splenic marginal zone lymphoma { Plasma cell neoplasms: { Extranodal marginal zone B cell lymphoma, also called MALT lymphoma { Nodal marginal zone B cell lymphoma (NMZL) { Follicular lymphoma { Mantle cell lymphoma { Di\u000buse large B cell lymphoma { Mediastinal (thymic) large B cell lymphoma { Intravascular large B cell lymphoma 15 { Primary e\u000busion lymphoma { Burkitt lymphoma (leukemia) \u2022 Mature T cell and natural killer (NK) cell neoplasms { T cell prolymphocytic leukemia { T cell large granular lymphocytic leukemia { Aggressive NK cell leukemia { Adult T cell leukemia (lymphoma) { Extranodal NK (T cell lymphoma, nasal type) { Enteropathy-type T cell lymphoma { Hepatosplenic T cell lymphoma { Blastic NK cell lymphoma { Mycosis fungoides ( Sezary syndrome) { Primary cutaneous CD30-positive T cell lymphoproliferative dis- orders { Angioimmunoblastic T cell lymphoma { Peripheral T cell lymphoma, unspeci\fed { Anaplastic large cell lymphoma \u2022 Hodgkin lymphoma { Nodular sclerosis { Mixed cellularity { Lymphocyte-rich { Lymphocyte depleted or not depleted { Nodular lymphocyte-predominant Hodgkin lymphoma In this study, we focus on common B cell lymphomas, namely, Small lymphocytic lymphoma, Follicular lymphoma, Mantle cell lymphoma, and Di\u000buse large B cell lymphoma. 16 Chapter 2 SamSPECTRAL I hypothesized that spectral clustering could signi\fcantly improve high through- put biological data analysis. However, serious empirical barriers are encoun- tered when applying this method to large data sets. Speci\fcally, for n data points, the running time is O(n3), requiring O(n2) units of memory. For in- stance, it would take 2 years and 5 terabytes of memory to analyze a typical FCM sample with 300,000 events. This situation is even worse considering that FCM experiments typically involve analysis of dozens of such samples. I this chapter, I describe my approach for addressing these computational challenges e\u000eciently. 2.1 Faithful Sampling To reduce size of data, many data reduction methods were proposed that rely on spectral graph theory. For instance, graph sparsi\fcation can reduce the number of edges signi\fcantly while guaranteeing a negligible change in eigenspace [143]. However, the challenge in applying spectral clustering is high number of vertices and reducing the number of edges is insu\u000ecient for this purpose. My preliminary empirical experiments showed that uniform sampling is not useful for this application, and therefore, developing an e\u000e- cient sampling scheme is crucial to get satisfactory results (see Section 2.3.3). I developed a novel solution for this problem through my non-uniform in- 17 Figure 2.1: Data reduction scheme. The information about the local density is retrieved in this way. formation preserving sampling. My data reduction scheme (Figure2.1) consists of two major steps; \frst I sample the data in a representative manner to reduce the number of ver- tices of the graph (Figure2.1b ). Sample points cover the whole data space uniformly (Figures 2.2b), a property that aids in the identi\fcation of both low density and rare populations. In the second step as described below, I de\fne a similarity matrix that assigns weights to the edges between the sampled data points. Higher weights are assigned to the edges between nodes in dense regions so that information about the density is preserved in this way (Figure 2.1c). After faithful sampling is completed, the set of all representatives can be regarded as a sample from the data. Algorithm 1 Faithful Sampling Faithful Sampling Algorithm 1: Label all data points as unregistered. 2: repeat 3: Pick a random unregistered point p fthe representative of a new com- munityg 4: Label all unregistered data points within distance h from p as regis- tering 5: Put registering points in a set called community p 6: Relabel registering points as registered 7: until All points are registered 8: return All communities 2.1.1 Mathematical Formulation It is possible and desired to analyze the behavior of faithful sampling in more exact mathematical terms. In fact, each eigenvector of the sampled 18 graph Gs induces a vector on the verices of the original graph G that may approximate an eigenvector of G. My following de\fnitions formulate this idea more precisely. De\fnition 7. Suppose G is a graph and Vs is an eigenvector of the normal- ized adjacency matrix of Gs that is sampled from G by faithful sampling. A vector de\fned on the vertices of G \\agrees\" with Vs in a community c if on all members of c, V is equal to the value of Vs on the representative of c. A vector V\u0303s that agrees with Vs on all communities is called to be \\induced\" by Vs on G. De\fnition 8. Under the notation of de\fnition 7, an eigenvector Vs of Gs approximates an eigenvector V of G with error \u03f5 if: ||V \u2212 V\u0303s ||2 < \u03f5. where V\u0303s is induced by Vs on G. Let eij denote the weight on the edge between vertices i and j of graph G. and consider the following set of conditions A0 on G: \u2022 G is connected and positively weigted: eij > 0,\u2200i, j. \u2022 G is symmetric; eij = eji . \u2022 For all vertices i, j, and k, e jk \u2264 +e ik + eij , . That is, the weight of the edge between two nodes should be no more than the weight of any other path between those nodes. I conjecture the following can be mathematically proven in graph theory: Conjecture 9. [SamSPECTRAL]Suppose Gs is sampled from a graph G by faithful sampling. For any h0 > 0, under conditions A0 on G, if the neigh- borhood parameter h < h0 (see Algorithm 1), there exist \u03bb0 < 1 and \u03f5 \u2265 0 (depending on A0 and h0) such that every eigenvector Vs of Gs with eigen- value more than \u03bb0 approximates an eigenvector of G with error \u03f5, and vise versa, every eigenvector V of G with eigenvalue more than \u03bb0 approximates an eigenvector of Gs with error \u03f5. 19 One can think of di\u000berent set of conditions to obtain stronger results that might have wider range of applications. For instance, consider the following set of conditions A\u03b1 on G: \u2022 G is connected and positively weigted: eij > 0,\u2200i, j. \u2022 \u2200i, j,\u2200+j : e jk \u2264+e ik +eij , where by \u2200+ I mean for all except \u03b1 nodes. That is, the triangle inequality should be satis\fed for\\most\"neighbors of a vertex. If proved to be true for small values of \u03b1, the following version of the conjecture guarantee the performance of SamSPECTRAL in wider range of applications such as social networks or web graphs: Conjecture 10. [strong SamSPECTRAL] Suppose Gs is sampled from a graph G by faithful sampling. For any h0 > 0, under conditions A\u03b1 on G, if the neighborhood parameter h < h\u03b1 (see Algorithm 1), there exist \u03bb\u03b1 < 1 and \u03f5\u03b1 \u2265 0 (depending on A\u03b1 and h\u03b1) such that every eigenvector Vs of Gs with eigenvalue more than \u03bb\u03b1 approximates an eigenvector of G with error \u03f5\u03b1, and vise versa. Alternatively, proving a similar conjecture may be easier and still in- teresting if stronger set of conditions is assumed. I have experimentally validated SamSPECTRAL algorithm by applying it on four di\u000berent FCM data sets, and if the above conjecture is proven, it can provide a theoret- ical explanation for the performance of this algorithm. Furthermore, the conditions can provide a helpful guide to decide if SamSPECTRAL is use- ful in other clustering applications. Such a proof is mathematically desired but lies beyond the scope of the current dissertation. Also, I have provided my intuition in support of these conjectures based on potential theory in Section 2.2.3. 2.2 Enhancement to Spectral Clustering In this thesis, I distinguish between the terms\\biological populations\", \\clus- ters\" and \\components\" as follows. A population is a set of cells with similar 20 functionality or molecular content. By a cluster, I mean a set of data points that are grouped together by spectral clustering algorithm. I incorporate a post-processing stage on spectral clusters to \fnd the connected components intended to estimate the biological populations. 2.2.1 Spectral Clustering The \frst step is to build a graph. The vertices represent the n data points (e.g., cells in FCM data), and the edges between the vertices are weighted based on some similarity criterion. The adjacency matrix of the graph is then normalized using the following formula: A\u0303=D\u2212 1 2AD\u2212 1 2 , (2.1) where A is the adjacency matrix of the graph and D is a diagonal matrix with the (i, i) entry being equal to the sum of the weights on the edges that are adjacent to vertex i. The next step is to compute eigenspace of the normalized matrix. That is, all vectors Vi and values \u03bbi satisfying the following equation are computed: A\u0303 \u2212\u2192 Vi = \u03bbi \u2212\u2192 Vi . (2.2) In order to \fnd k clusters, an n by k matrix is built using the k eigenvectors with highest eigenvalues. The rows of this matrix are normalized and \fnally k-means is used to cluster the rows. However, the above method cannot be directly applied to FCM data due to large number of data points (cells) per sample. My solution for this prob- lem is a data reduction scheme developed speci\fcally for this purpose. This reduces the number of vertices signi\fcantly, but in a way such that biological information can be preserved by updating the weights on the edges. 2.2.2 Data Reduction Scheme While data size can be reduced by known sampling methods [161], a very delicate method should be used to preserve biologically important informa- 21 tion. From a high-level perspective, my data reduction scheme (Figure2.1) consists of two major steps; \frst I sample the data in a representative man- ner to reduce the number of vertices of the graph (Figure2.1b and Algorithm 1). Sample points cover the whole data space uniformly (Figures 2.2b), a property that aids in the identi\fcation of both low density and rare popu- lations. In the second step as described below, I de\fne a similarity matrix that assigns weights to the edges between the sampled data points. Higher weights are assigned to the edges between nodes in dense regions so that information about the density is preserved (Figure 2.1c). After faithful sampling is completed, the set of all representatives can be regarded as a sample from the data. Reducing the value of parameter h will increase the number of sample points, resulting in increased computa- tion time and required memory. Conversely, increasing h will result in fewer sample points that may lead to too low a resolution. In such a case, the computed spectral clusters may fail to estimate the real cell populations ap- propriately. In our implementation, I use an iterative procedure (explained in the overview of our algorithm) to adjust h automatically such that the number of representatives will be in range 1500-3000. As a result of this adjustment, the following two objectives are achieved. First, computing the eigenspace of a graph with a number of points in this range is feasible, (it takes less than one minute by a 2.7GHz processor.) Second, the communi- ties are \\small\" (Figure 2.2) and the resulting resolution is high enough such that no biologically interesting information is lost. In the sampling stage, there is no preference in picking up the next data point, therefore, the \fnal distribution of the sampled points will be uniform in the \\e\u000bective\" space. That is, the representatives are distributed almost uniformly in the space where data points were present (Figure 2.2). As a consequence, by repeating sampling procedure the \fnal results of clustering will not change signi\fcantly. This observation is con\frmed quantitatively in Section 2.3.3. By considering just the representatives, density information is e\u000bectively ignored so working directly with these representatives results in improper outcome. On the other hand, some biological information from the original 22 Figure 2.2: Faithful sampling. (a) Original data from telomere data set before sampling. (b) The distribution of representatives is almost uniform in the space after faithful sampling. data is preserved by the above algorithm that can be retrieved to guide the clustering algorithm. More precisely, for each sample point, I know the list of all points in its neighbourhood (i.e., the members of the corresponding community). In the next stage, I use this information to de\fne the similarity between two sample points to modify the behaviour of spectral clustering. In this sense, my sampling scheme is faithful, meaning that the valuable biological information from the original data points is preserved even after sampling. I called the overall procedure, which consists of faithful sampling, computing modi\fed similarity matrix and spectral clustering, SamSPEC- TRAL clustering. 2.2.3 Similarity Matrix In this study, I use the following heat kernel formula [85] to compute the similarity between two vertices i and j: si,j = e \u2212D 2(pi ,pj ) 2\u03c32 , (2.3) 23 Figure 2.3: De\fning the similarity between two communities and identifying the number of clusters. (a) I de\fne the similarity between two communities c and c\u2032 as the sum of pairwise similarities between the members of c and the members of c\u2032. (b) This \fgure shows the largest eigenvalues of a sample from the stem cell dataset. The number of clusters is estimated according to the knee point of eigenvalues curve. This point is de\fned as the intersection of the above regression line and the line y=1. The horizontal coordinate of the knee point estimates the number of spectral clusters. where D(pi ,pj ) is the Euclidean distance between them. \u03c3 is a scaling pa- rameter that controls how rapidly similarity between pi and pj falls o\u000b with increasing distance. I de\fne the similarity between two communities c and c\u2032 as the sum of all pairwise similarities between all members of the \frst community and all members of the second community. That is, S c,c\u2032 = \u2211 i\u2208c \u2211 j\u2208c\u2032 si,j , (2.4) where i and j are members of c and c\u2032 respectively. I do not normalize the similarity by dividing the above sum by the size of communities because I would lose valuable biological information that is 24 supposed to be preserved. In fact, the size of the communities determines the local density of the data points, which is biologically of great importance. The above de\fnition is motivated by the following intuition from potential theory that explains how biological information is preserved after faithful sampling by assigning similarities in this way. The eigenvectors of a graph are interpreted as potential functions on the electric network modeled by the graph [19]. Assuming the radius of each community is small enough, the potential values of the community members are almost the same. On the other hand, in potential theory, the equivalent conductance between a group of nodes {vi} with equal potential values and another group of nodes {wj} that also have equal potential values is computed by the summation of pairwise conductance between nodes vi and wj for all i and j. Since in my model, the similarity between two vertices is equivalent to the conductance between the corresponding electrical nodes, it is reasonable to sum up pair- wise similarities to estimate the equivalent similarity between communities (Figure 2.3a). 2.2.4 Number of Clusters The number of clusters must be determined before running the spectral clustering algorithm [160]. To \fnd this number automatically and in an e\u000e- cient manner, I developed a novel method that is motivated by the following observation from spectral graph theory: Theorem [18]: The number of connected partitions of a graph is equal to the number of eigenvectors with eigenvalue 1. I observed that typically for FCM data, if \u03c3 is adjusted properly as explained in the SamSPECTRAL package through The Comprehensive R Archive Network (cran)1, vignette (Appendix A), the \frst few eigenvalues are close to one and at a point I call knee point they start to decrease almost linearly. I compute the knee point by applying linear regression to the eigenvalues curve (Figure 2.3b) and use the horizontal coordinate of this point as a rough estimate for the number of spectral clusters. 1http:\/\/www.bioconductor.org\/packages\/devel\/bioc\/html\/SamSPECTRAL.html 25 2.2.5 Combining Clusters Applying spectral clustering on sampled data results in graph partitioning, which is almost optimum in the sense of having minimum normalized cut [29, 139]. However, in some cases, a biologically interesting population might be split into two or more smaller clusters by this method. I addressed this issue by adding a post-processing stage wherein the partitions of a popu- lation are combined based on known properties of FCM cell populations. Typically, biologically meaningful cell populations in FCM data have their highest density at the centre, and their density decreases towards the border of the population. Since higher density regions indicate communities with relatively more members, the conductance between them is expected to be relatively higher (Equation 2.4). Thus, similarity between communities is higher in regions with higher densities and the highest similarity is expected to be at the centre of the biological population. This observation forms the basis for my criterion for combining clusters. Speci\fcally, similarity be- tween communities determines the weight on graph edges and I de\fne the maximum weight of the edges of a spectral cluster as within similarity of that cluster. Also, the maximum weight of the edges between two di\u000berent spectral clusters is de\fned as between similarity. If the ratio of between sim- ilarity to within similarity is greater than a prede\fned threshold (separation factor), I conclude that these clusters are partitions of a single population, and should combine them to form a component. I repeat this stage until no two components can be combined. The \fnal components computed in this way are called connected components of the data, and estimate the real bi- ological populations. With smaller separation factors, spectral clusters tend to combine more often resulting in less number of \fnal components. 2.2.6 Overview of SamSPECTRAL Algorithm In summary, the stages of our algorithm are as follows, assuming the data contains n points in a d dimensional space of volume V , and the parameters m (max number of communities), \u03c3 (scaling parameter), and separation factor are set properly. 26 1. Sampling: (a) Let h= 12 d \u221a V m . (b) Repeat: \u2022 Run faithful (biological information preserving) sampling al- gorithm. Suppose m\u2032 communities are built. \u2022 Update: h= h ( d \u221a m\u2032 m ) . Until m2 \u2264m\u2032 \u2264m. 2. Compute the similarities between communities by adding pairwise sim- ilarities si,j de\fned by 2.3: S c,c\u2032 = \u2211 i\u2208c \u2211 j\u2208c\u2032 si,j . (2.5) 3. Build a graph wherein each community is a vertex. Put edges between all pairs of vertices and weight them by similarity between correspond- ing communities. 4. Analyze the spectrum of the above graph to \fnd the clusters; (a) Normalize the adjacency matrix of the graph according to Equa- tion 2.1. (b) Compute the eigenspace of the graph and set k, number of clus- ters, according to the knee point of eigenvalues curve. (c) Run classical spectral clustering algorithm to \fnd k clusters. 5. Combine the clusters to \fnd connected components: (a) Initiate the list of components equal to the list of spectral clusters. (b) Repeat: \u2022 For any pairs of components Ci,Cj , set : separation ratio(Ci ,Cj ) := between similarity(Ci ,Cj ) within similarity(Ci) (2.6) 27 \u2022 For each component Ci, compute: M(i) := max j \u0338=i (separation ratio(Ci ,Cj )) \u2022 If for all i, M(i)\u2264 separation factor, break. \u2022 Pick an i such that M(i)> separation factor and let: j = argmax j \u0338=i (separation ratio(Ci ,Cj )) \u2022 Combine Ci and Cj , then update list of components. Until number of components > 1. In the sampling stage, I start with the initial value h = 12 d \u221a V m for the neighbourhood. m is a parameter that controls m\u2032, the \fnal number of sample points such that m2 \u2264m\u2032 \u2264m. Since in our implementation, I use Manhattan metric to measure the distance between points, the volume of a community can be estimated by (2h)d = Vm . Therefore if the the data points were distributed uniformly in the space, I would get m sample points in the \frst run. However, in practice, I need to repeat the procedure after updating the neighbourhood value. According to our experiments, a few iterations are enough to ful\fl the terminating condition m2 \u2264m\u2032 \u2264m. As the running time of this part of SamSPECTRAL is O(nm), which is negligible compared to eigenspace computation time, I did not attempt to optimize the sampling loop. 2.2.7 Modi\fed Markov Clustering Algorithm (MCL) Step 4 in the above algorithm is the classic spectral clustering method. This step potentially could be substituted by any clustering algorithm for weighted graphs. To verify that our approach is extensible in this sense, I substituted classic spectral clustering with Markov Clustering (MCL) [39] keeping the rest of our algorithm, sampling and post-processing steps, un- changed. MCL \fnds the partitions of a graph by simulating \row on the nodes. 28 Simulation is done by iteratively multiplying two type of matrices that correspond to expansion and in\ration operations [39]. Because \row and eigenspace of a graph are strongly related2, the outcome of this approach tends to be similar to spectral clustering through computing eigenspace. 2.2.8 Testing on Real Data I tested our algorithm on four di\u000berent FCM datasets as explained brie\ry here. The GvHD dataset is available in \rowCore package through BioCon- ductor and the rest are available upon request. Stem Cells To investigate heterogeneity in the di\u000berentiation behaviour of hematopoi- etic stem cells, a subpopulation of adult mouse bone marrow was isolated and then each single stem cell was transplanted into one of 352 recipients [42]. 16 blood samples were taken from the recipients in biweekly intervals and were studied in a cytometer. The investigation contained hundreds of data \fles that needed to be analyzed to count the frequency of each subtype of white cells they contained. Telomere In all vertebrates, telomeres consist of tandem DNA repeats of the sequence d(TTAGGG) and associated proteins. Telomere length is known to be cru- cial elements in ageing and various diseases including cancer and it can be estimated by FCM [15]. Since telomere length is di\u000berent for various cell populations, these need to be distinguished before calculating telomere length. GvHD Acute graft versus host disease (GvHD) is a common outcome after bone marrow transplantation. It is di\u000ecult to diagnose in its early stages in order to provide timely treatment. To investigate how FCM can help predict the 2Cheeger inequality is an example of such a relation [29]. 29 development of GvHD, and to study its advantages over microarrays, periph- eral blood samples from 31 patients undergoing allogeneic blood and marrow transplant were analyzed [24]. The samples were taken at progressive time points post-transplant and were stained with four appropriate lymphocyte phenotypic and activation markers de\fning 121 di\u000berent populations using six markers. Viability Propidium iodide (PI) is a widely used marker for determining viability of mammalian cells [163] because it has the capability of passing through only damaged cell membranes. However, depending on the complexity of the data, identifying dead cells automatically might still be di\u000ecult even if this marker is used. I tested the capability of our algorithm in identifying dead cells using PI marker on a dataset from the Terry Fox Laboratory. I implemented SamSPECTRAL algorithm in R, and applied it on four di\u000berent datasets. I was able to identify some types of biologically interesting populations that were previously known to be hard to distinguish, including: 1. Overlapping populations (Figure 2.4a-c). 2. Subpopulations of a major population (Figure 2.4d-f). 3. Non-elliptical shaped populations (Figures 2.5 and Figure 2.6a-c). 4. Low density populations close to dense ones (Figures 2.6d-f and Figure 2.7). 5. Rare populations comprising less than 2% of all data points (Figure 2.8). Here, I demonstrate the capabilities of SamSPECTRAL in identifying biological populations in these cases and compare our results with two state of the art methods for clustering FCM data, \rowMerge (version 0.4.1) and FLAME (version 3), respectively obtained through BioConductor and GenePattern. 30 Overlapping Populations Traditionally, identifying cell populations in FCM data is accomplished by visualizing the multidimensional data as a series of bivariate plots, and sepa- rating interesting sections manually, in a process termed gating. Gating be- comes challenging for high dimensional data since when the data is mapped to two dimensions, some clusters may overlap, resulting in the mixing of di\u000berent populations. Consequently, even a trained operator cannot iden- tify overlapping populations properly in all cases. However, our algorithm prevents this undesired error by considering all data dimensions together (Figure 2.4a-c). Model based multidimensional techniques also perform gen- erally well in this regard. Subpopulations of a Population Figure 2.4d-f shows a major blood population (granulocytes) formed from two distinct subpopulations as veri\fed by expert manual analysis. Sam- SPECTRAL could clearly distinguish between two subpopulations. \rowMerge merged these two populations into one, while FLAME split both subpopu- lations. Non-elliptical Shaped Populations While most model based techniques have a priori assumptions on the shape of populations that resulted in mixing or splitting populations, our method worked relatively well on the samples with arbitrary shape populations. In Figure 2.5, the PI positive population (blue diagonal one) was clearly iden- ti\fed despite its non-elliptical shape. \rowMerge could also distinguish this population, but it incorrectly split the PI negative population into two parts. FLAME did not correctly distinguish the two populations. Figure 2.6a-c shows the output of the three algorithms on a four dimen- sional sample from GvHD dataset. While the red population has a complex shape, it could be identi\fed with high accuracy by SamSPECTRAL. While FLAME produced a satisfactory result, \rowMerge mixed this population with the one below it. 31 Figure 2.4: Comparative clustering of the telomere dataset. (a-c) Proper identi\fca- tion of overlapping populations. Although two populations shown by red and blue contours are overlapping in all bi-variant plots of this 3-dimensional sample, Sam- SPECTRAL can properly distinguish them by considering multiple parameters si- multaneously.(d) SamSPECTRAL can also identify two major subpopulations of granulocytes correctly, as veri\fed by expert analysis. (e) \rowMerge does not dis- tinguish between two populations of interest, and (f) FLAME improperly splits the same sample into several clusters. 32 Figure 2.5: Comparative clustering of dead cells (PI positive) and live cells (PI negative) in the viability data. (a) SamSPECTRAL could distinguish between dead cells (blue) and live cells (red) properly. (b) \rowMerge identi\fed dead cells correctly, but split live cells into two clusters. (c) FLAME did not distinguish between these two population. Low Density Populations Close to Dense Populations Figure 2.6d-f shows a sample from GvHD dataset containing a relatively low density and a high density population close together. SamSPECTRAL clearly distinguished the red population in the centre of the plot from the yellow dense population to its left. Moreover, it did not mix the red pop- ulation with the other low density population to its right. FlowMerge also clustered this sample relatively well, requiring \fve times more processing time. The performance of FLAME was not satisfactory for this sample due to mixing the desired population with the other low density ones. Figure 2.7 depicts a sample from the stem cell dataset containing a relatively low density population shown in blue. In each row, three 2- dimensional plots of the 3-dimensional data sample are presented. Sam- SPECTRAL could distinguish the blue population although it was sur- rounded by three relatively denser populations (the yellow, green and red ones). FlowMerge mixed this population with the yellow one, while FLAME mixed it with the red one. Rare Populations Identifying rare populations has many signi\fcant applications in FCM ex- periments including distinguishing cancer stem cells, hematopoietic stem 33 Figure 2.6: Comparative clustering of the GvHD dataset. (Left) Identi\fcation of non- elliptical shaped populations. (a) SamSPECTRAL could properly identify the red, non-elliptical population, while (b) \rowMerge mixed this population with the one below it. (c) FLAME produced satisfactory results in identifying this population. (Right) Identi\fcation of low density populations close to dense populations. (d) SamSPECTRAL and (e) \rowMerge could identify the low density population shown in red at the centre of the \fgure correctly, while (f) FLAME merged this population with the other ones surrounding it. 34 Figure 2.7: Comparative identi\fcation of a low density population surrounded by much denser populations in the stem cell data set. (a-c) SamSPECTRAL correctly identi- \fed the blue, low density population, while (d-f) \rowMerge merged it to the yellow, high density population. (g-i) FLAME merged it to the red population. (j-l) The outcome of our modi\fed MCL was similar to that obtained by SamSPECTRAL using classic spectral clustering. This shows that SamSPECTRAL is extensible by substituting classic spectral clustering with other clustering algorithms for weighted graph. 35 Figure 2.8: Rare population in the stem cell data set. (a-c) This is a typical sample from the stem cell data set that contains a rare population. In these three dimensional plots, the red dots represent the cells that are positive for all three markers. Only 23\/9721 (0.24%) events belong to this population in this sample. SamSPECTRAL could properly identify the rare population in 27\/34 (79.4%) samples from the stem cell data set. cell transplantation, detection of fetal cells in maternal blood, detection of leukocytes in leukocyte-depleted platelet products, detection of injected cells for biotherapy and malaria diagnosis [40]. Figure 2.8 shows a typical sample from the stem cell data set that con- tains a rare population in red. This population is positive for all the three markers and in each sample, it comprises between 0.1% to 2% of total cells. I performed an experiment on 34 samples from the stem cell data set and com- pared the performance of SamSPECTRAL, \rowMerge and FLAME. This rare population was distinguished manually and the result of manual gating was considered as the basis for our comparison. FLAME and \rowMerge could identify this population only in 11 (32%) and 9 (26%) of samples, respectively. SamSPECTRAL could distinguish this population in 27 (79%) samples including all the ones that were identi\fed by FLAME and \rowMerge. In the 7 (21%) samples that SamSPECTRAL failed, the rare population of interest contained less than 0.15% of all data points. To measure the accuracy of SamSPECTRAL, I de\fne sensitivity and speci\fcity as follows. For each sample, I call a cell positive if it belongs to the rare population of interest, and it is negative otherwise. Sensitivity is de\fned to be the number of truly identi\fed rare cells divided by the total number of rare cells. Accordingly, speci\fcity is the number of cells identi\fed as negative divided by the total number of truly negative cell. The 27 (79%) cases where 36 SamSPECTRAL correctly identi\fed the rare population, had a 0.83 mean sensitivity with a 0.26 standard deviation. The median sensitivity was .99. Speci\fcity was 1 except for one sample. If I consider only samples with a rare population bigger than 0.2% of the total data, I obtained median=1, mean=0.93 and standard deviation of 0.15 for sensitivity. A detailed report of the results of this experiment is provided as a table in additional \fle 1 available though BMC Bioinformatics website. SamSPECTRAL with MCL Figure 2.7j-l depicts the output of MCL on a sample from stem cells dataset. I ran MCL on the sampled data obtained by our faithful sampling algorithm and then the post-processing step was applied to the resulting clusters. This experiment showed there was no signi\fcant di\u000berence for SamSPECTRAL in clustering either through computing eigenvectors (Figure 2.7a-c) or by MCL (Figure 2.7j-l). 2.3 Further Experiments 2.3.1 Resolution Previously, I explained that the resolution of the sample points (Figure 2.2) is high enough such that by repeating the randomized faithful sampling procedure, the outcome of SamSPECTRAL does not vary signi\fcantly. The following experiment is performed to con\frm this observation quantitatively. In this experiment I used F-measure, which is known to be appropri- ate for comparing clustering results of FCM data [1]. F-measure varies in range 0-1 and reaches its best value at 1 when the two clustering results are identical. I ran SamSPECTRAL on a sample from the stem cell data set 20 times and compared the \fnal results. The F-measure values obtained by pairwise comparison between the \fnal results had mean=0.98, median=0.98 and standard deviation 0.0097. 37 Figure 2.9: Performance of SamSPECTRAL on synthetic data. (a) This synthetic two dimensional data consists of a normal distribution with 30,000 points, four normal distribution each with 300 points and a uniform background noise with 4000 points. (b) Around 3000 sample points are picked up by faithful sampling. These are distributed almost uniformly in the space, therefore, almost all information about density will be lost if one considers only the samples points. (c) The \fnal outcome of SamSPECTRAL con\frms that the information about density could be retrieved by properly assigning weights to the edges of the graph. The high density cluster is shown in red and the surrounding sparser clusters are shown in yellow, light blue, green and black. 2.3.2 Synthetic Data I performed the following experiment to show the e\u000bect of edge weights on performance of spectral clustering. As shown in Figure 2.9, I produced syn- thetic data containing one normal distribution with relatively high density surrounded by four relatively small clusters with lower densities. The num- ber of points in each small cluster is less than 0.01% of the whole data and noise is added to the data space uniformly (Figure 2.9a). For the central dense distribution, I set \u03c3xx = \u03c3yy = 2, \u03c3xy = \u03c3yx = 0 and the surrounding clusters are normal distributions with \u03c31xx = 0.08, \u03c31yy = 0.30, \u03c31xy = \u03c31yx = 0, \u03c32xx = 0.07, \u03c32yy = 0.08, \u03c32xy = \u03c32yx = 0, \u03c33xx = 0.50, \u03c33yy = 0.10, \u03c33xy = \u03c33yx = 0, \u03c34xx = 0.10, \u03c34yy = 0.70, \u03c34xy = \u03c34yx = 0. After faithful sampling is done (Figure 2.9b), the sample points are dis- tributed almost uniformly, and the information about the local density of original data is lost. However, faithful sampling provides us with more infor- mation than only the sample points. It will also return the members of each community and our data reduction scheme uses this information to assign weight to the edges. According to formulas 2.3 and 2.4, the more populated and closer two communities are, the higher the weight between them will be (Figure 2.1c). According to Figure 2.9c, this strategy is successful in retriev- 38 ing information about local density as all the \fve clusters are distinguished properly by SamSPECTRAL. 2.3.3 Faithful and Uniform Sampling I observed that some low density populations disappeared entirely when simple uniform sampling was employed. To investigate the e\u000bect of this phenomenon on the \fnal clustering results, I performed an experiment on a sample of the stem cell dataset that contained 48,000 events in 3 dimen- sions. First, 3,000 data points were selected uniformly at random. Then, I assigned a label to each of these selected points by applying classical spec- tral clustering on them. Finally, for each original data point, the label of the closest selected point was considered as its cluster label. Figures 2.10d and c show the results of this approach and SamSPECTRAL, accordingly. The red population that was distinguished by SamSPECTRAL correctly in Figure 2.10c consists of only 4% of the data. This population could not be distinguished properly by any setting of the parameters after uniform sampling (Figure 2.10d). 2.3.4 SamSPECTRAL in FlowCAP The goal of \row cytometry-critical assessment of population identi\fcation methods (flowcap) is to advance the development of computational meth- ods for identi\fcation of cell populations in \row cytometry data. I collab- orated in flowcap project by running SamSPECTRAL on the provided samples to participate in a competition between state of the art methods of \row cytometry clustering. The organizers de\fned four challenges in an- alyzing \fve datasets, and compared the results of di\u000berent methods in an objective manner by computing F-measures. The \fnal results are reported in Table 2.1. The appropriate challenges for SamSPECTRAL were the second and third one because participant (developer of each method) had the opportu- nity to set the parameters for each dataset in the best way. In the second challenge, SamSPECTRAL ranked the second best algorithm with score 31 39 Figure 2.10: Comparing uniform sampling with faithful sampling. Directly applying clas- sical spectral clustering is not e\u000ecient on this sample of the stem cell dataset which contains 48000 cytometry events in 3 dimensions. (a) Although only 2115 data points were selected by faithful sampling, each population has a considerable number of representatives in the selected points. (b) 3000 points were selected by uniform sampling. The low density population in the middle of the plot consists of only 55 sample points resulting in mixing this population with a high density one incorrectly (d). (c) The result of SamSPECTRAL on the original data is satis- factory because the low density red population and other high density populations are identi\fed properly. after ADICyt method with score 34. Also, in challenge three, these two methods were jointly ranked the \frst with score 26.2. Notably, SamSPEC- TRAL scored higher than FLAME and \rowMerge that reassured the supe- riority of its performance over these two methods, as discussed in detail in this Chapter, by a third party. 40 Table 2.1: Compiled results of the FlowCAP competition F-measurea Runtimeb Rank Method Ref GvHD DLBCL HSCT WNV ND Overall hh:mm:ss Scored Challenge 1: ADICyt 0.81 (0.72, 0.88)c 0.93 (0.91, 0.95) 0.93 (0.90, 0.96) 0.86 (0.84, 0.87) 0.92 (0.92, 0.93) 0.89 04:50:37 52 Student's t e flowMeans 0.88 (0.82, 0.93) 0.92 (0.89, 0.95) 0.92 (0.90, 0.94) 0.88 (0.86, 0.90) 0.85 (0.76, 0.92) 0.89 00:02:18 49 K-means [4] FLOCK 0.84 (0.76, 0.90) 0.88 (0.85, 0.91) 0.86 (0.83, 0.89) 0.83 (0.80, 0.86) 0.91 (0.89, 0.92) 0.86 00:00:20 45 Density [127] FLAME 0.85 (0.77, 0.91) 0.91 (0.88, 0.93) 0.94 (0.92, 0.95) 0.80 (0.76, 0.84) 0.90 (0.89, 0.90) 0.88 00:04:20 44 Skew t [125] SamSPECTRAL 0.87 (0.81, 0.93) 0.86 (0.82, 0.90) 0.85 (0.82, 0.88) 0.75 (0.60, 0.85) 0.92 (0.92, 0.93) 0.85 00:03:51 39 Graph [166] MM&PCA 0.84 (0.74, 0.93) 0.85 (0.82, 0.88) 0.91 (0.88, 0.94) 0.64 (0.51, 0.71) 0.76 (0.75, 0.77) 0.80 00:00:03 29 Density [148] FlowVB 0.85 (0.79, 0.91) 0.87 (0.85, 0.90) 0.75 (0.70, 0.79) 0.81 (0.78, 0.83) 0.85 (0.84, 0.86) 0.82 00:38:49 28 Student's t e MM 0.83 (0.74, 0.91) 0.90 (0.87, 0.92) 0.73 (0.66, 0.80) 0.69 (0.60, 0.75) 0.75 (0.74, 0.76) 0.78 00:00:10 28 Density [148] flowClust\/Merge 0.69 (0.55, 0.79) 0.84 (0.81, 0.86) 0.81 (0.77, 0.85) 0.77 (0.74, 0.79) 0.73 (0.58, 0.85) 0.77 02:12:00 24 Student's t [49, 89] L2kmeans 0.64 (0.57, 0.72) 0.79 (0.74, 0.83) 0.70 (0.65, 0.75) 0.78 (0.75, 0.81) 0.81 (0.80, 0.82) 0.74 00:08:03 20 K-means e CDP 0.52 (0.46, 0.58) 0.87 (0.85, 0.90) 0.50 (0.48, 0.52) 0.71 (0.68, 0.75) 0.88 (0.86, 0.90) 0.70 00:00:57 19 Gaussian [146] SWIFT 0.63 (0.56, 0.70) 0.67 (0.62, 0.71) 0.59 (0.55, 0.62) 0.69 (0.64, 0.74) 0.87 (0.86, 0.88) 0.69 01:14:50 15 Gaussian [108] Ensemble Clustering 0.88 0.94 0.97 0.88 0.94 0.92 - 64 Ensemble [75] Challenge 2: ADICyt 0.81 (0.71, 0.89) 0.93 (0.91, 0.95) 0.93 (0.90, 0.96) 0.86 (0.84, 0.87) 0.92 (0.92, 0.93) 0.89 04:50:37 34 Student's t e SamSPECTRAL 0.87 (0.79, 0.94) 0.92 (0.89, 0.94) 0.90 (0.86, 0.93) 0.85 (0.83, 0.88) 0.91 (0.91, 0.92) 0.89 00:06:47 31 Graph [166] FLOCK 0.84 (0.76, 0.90) 0.88 (0.85, 0.91) 0.86 (0.83, 0.89) 0.84 (0.82, 0.86) 0.89 (0.87, 0.91) 0.86 00:00:15 23 Density [127] FLAME 0.81 (0.75, 0.87) 0.87 (0.84, 0.90) 0.87 (0.82, 0.90) 0.84 (0.83, 0.85) 0.87 (0.86, 0.87) 0.85 00:04:20 23 Skew t [125] SamSPECTRAL-FK 0.87 (0.80, 0.94) 0.85 (0.81, 0.89) 0.90 (0.86, 0.92) 0.76 (0.71, 0.81) 0.92 (0.91, 0.93) 0.86 00:04:25 23 Graph [166] CDP 0.74 (0.67, 0.80) 0.89 (0.86, 0.91) 0.90 (0.88, 0.92) 0.75 (0.71, 0.78) 0.86 (0.85, 0.88) 0.83 00:00:18 19 Gaussian [146] flowClust\/Merge 0.69 (0.53, 0.78) 0.87 (0.85, 0.90) 0.96 (0.94, 0.97) 0.77 (0.75, 0.79) 0.88 (0.81, 0.91) 0.83 02:12:00 18 Student's t [49, 89] NMF-curvHDR 0.76 (0.69, 0.82) 0.84 (0.83, 0.86) 0.70 (0.67, 0.74) 0.81 (0.77, 0.84) 0.83 (0.83, 0.84) 0.79 01:39:42 13 Density [111] Ensemble Clustering 0.87 0.94 0.98 0.87 0.92 0.91 - 41 Ensemble [75] Challenge 3: ADICyt 0.91 (0.84, 0.96) 0.96 (0.94, 0.97) 0.98 (0.97, 0.99) 0.95 00:10:49 26.2 Student's t e SamSPECTRAL 0.85 (0.75, 0.93) 0.93 (0.91, 0.95) 0.97 (0.95, 0.98) 0.92 00:02:30 26.2 Graph [166] flowMeans 0.91 (0.84, 0.96) 0.94 (0.91, 0.96) 0.95 (0.93, 0.96) 0.93 00:00:01 23.4 K-means [4] TCLUST 0.93 (0.91, 0.96) 0.93 (0.91, 0.95) 0.93 (0.90, 0.95) 0.93 00:00:40 23.4 Trimmed K-means [56] FLOCK 0.86 (0.79, 0.93) 0.92 (0.89, 0.94) 0.97 (0.95, 0.98) 0.92 00:00:02 22.2 Density [127] CDP 0.85 (0.77, 0.92) 0.92 (0.89, 0.94) 0.76 (0.72, 0.81) 0.84 00:00:21 16.9 Gaussian [146] flowClust\/Merge 0.88 (0.82, 0.93) 0.90 (0.86, 0.94) 0.83 (0.79, 0.88) 0.87 00:49:24 15.9 Student's t [49, 89] FLAME 0.85 (0.79, 0.91) 0.90 (0.86, 0.93) 0.86 (0.82, 0.91) 0.87 00:03:20 15.9 Skew t [125] SWIFT 0.90 (0.84, 0.95) 0.00 (0.00, 0.00) 0.88 (0.84, 0.92) 0.59 00:01:37 11.9 Gaussian [108] flowKoh 0.85 (0.80, 0.90) 0.85 (0.82, 0.88) 0.87 (0.84, 0.91) 0.86 00:00:42 9.5 Kohonen NN e NMF 0.74 (0.69, 0.78) 0.84 (0.80, 0.88) 0.80 (0.76, 0.84) 0.79 00:01:00 7.5 Density [111] Ensemble Clustering 0.95 0.97 0.98 0.97 - 35.0 Ensemble [75] Challenge 4: Radial SVM 0.89 (0.83, 0.95) 0.84 (0.80, 0.87) 0.98 (0.96, 0.99) 0.96 (0.94, 0.97) 0.93 (0.92, 0.94) 0.92 00:00:18 21 SVM e flowClust\/Mergef 0.92 (0.88, 0.95) 0.92 (0.89, 0.94) 0.95 (0.92, 0.97) 0.84 (0.82, 0.86) 0.89 (0.88, 0.90) 0.90 05:31:50 19 Student's t [49, 89] randomForests 0.85 (0.78, 0.91) 0.78 (0.74, 0.83) 0.81 (0.79, 0.83) 0.87 (0.84, 0.90) 0.94 (0.92, 0.95) 0.85 00:02:06 15 Random Forest [23] FLOCK 0.82 (0.77, 0.87) 0.91 (0.89, 0.93) 0.86 (0.76, 0.93) 0.86 (0.82, 0.89) 0.86 (0.77, 0.92) 0.86 00:00:05 13 Density [127] CDP 0.78 (0.68, 0.87) 0.95 (0.93, 0.97) 0.75 (0.71, 0.78) 0.86 (0.84, 0.88) 0.83 (0.80, 0.86) 0.83 00:00:15 11 Gaussian [146] Ensemble Clustering 0.91 0.94 0.95 0.92 0.94 0.93 - 26 Ensemble [75] 4 1 2.4 Performance of SamSPECTRAL Although spectral clustering algorithm is a powerful technique, it can not be directly applied to large datasets as it is computationally expensive both in time and memory. In this study, I developed a sampling method and com- bined it with spectral clustering by modifying the similarity matrix based on potential theory. As a result, for the \frst time, analyzing FCM data using spectral methods becomes possible and practical. I applied SamSPECTRAL to four di\u000berent FCM datasets to demonstrate its applicability on a broad spectrum of FCM data, and compared its performance to two state of the art model-based clustering methods optimized for FCM data. Since our novel method for clustering, SamSPECTRAL, is a multidi- mensional approach, it can identify overlapping populations that are gener- ally hard to identify by manual gating that uses sequential two dimensional visualizations of the data. SamSPECTRAL is the \frst method that has demonstrated the ability to correctly identify subpopulations of major FCM cell populations. SamSPECTRAL can also distinguish subpopulations of a biological population properly in cases that \rowMerge and while FLAME do not produce satisfactory results. An important challenge in analyzing FCM data is in clustering data \fles that contain populations that signi\fcantly di\u000ber in density. Model-based techniques can produce errors in identifying a low density population close to denser populations because they typically make assumptions on the density of clusters [27]. Our experiments demonstrated that SamSPECTRAL can properly tackle this problem. Besides the practical observations, this capability is justi\fed by the fol- lowing observation. Spectral methodology clusters the graph such that the normal cut is \\almost\" optimum [29]. Now, assume that it can distinguish between two clusters when their densities are comparable. Then, if the size of the smaller cluster is reduced without change in its shape or distribution, the normal cut between them remains similar because the number of vertices and edges reduces almost proportionally to each other. Therefore, the clus- ters remain distinguishable. This explains why the overall performance of 42 SamSPECTRAL is independent of cluster densities as long as their shapes are preserved. Since parametric methods such as FLAME and \rowMerge make a pri- ori assumptions on the distribution or shape of the clusters [27], they may fail in identifying populations with arbitrary shapes. Although \rowMerge attempts to solve this issue by \fnding more clusters than needed and then merging them together, it still does not produce satisfactory results when the shape of the cluster is complex. SamSPECTRAL has the capability of identifying arbitrary shape clusters since it is a non-parametric approach that makes no assumptions on the shape and distribution of clusters, and clusters data based only on similarity between data points. Compared to other non-parametric methods, our algorithm has the advantages of auto- matically identifying the number of clusters and having low sensitivity to the prede\fned thresholds. Therefore, users can adjust the parameters only once by running SamSPECTRAL on one or two random samples from a FCM data set. Then, the algorithm can be run on the rest of data set without changing the parameters. Not only does our sampling scheme increase the speed of spectral cluster- ing without losing important biological information, but the resulting algo- rithm is faster than other methods considered in this study. More precisely, the running time of SamSPECTRAL is O(dmn)+O(m3) where O(dmn) is the running time for building m communities from n points in d dimension and O(m3) is the running time for computing the eigenspace. After this step, the k-means clustering runs very fast in time O(k m t) to \fnd k clus- ters using eigenvectors by t iterations. In comparison, the time complexity of the original MCL method is O(nr2) with no guarantee on upper bound for number of iterations r, other than n. Practically, for our model of FCM data where all pairs of data points are connected, I could not run MCL before applying our modi\fcation to it. Moreover, SamSPECTRAL running time is signi\fcantly less than model-based techniques. The running time of \rowMerge is O(d2k2nt) and FLAME runs in time O(d4klnt) where l is the number of times it runs to \fnd the optimal number of clusters. In practice, I can keep m as small as 1500-3000 without loosing important biological in- 43 formation, and consequently SamSPECTRAL ran at least 5-10 times faster than \rowMerge and FLAME on the studied datasets. Furthermore, the time e\u000eciency of our algorithm is more noticeable for higher dimensional data such as the one provided as additional \fle 2. This sample contains 100,000 events in 23 dimensions and SamSPECTRAL can analyze it in less than 25 minutes by a 2.7GHz processor. 2.5 Implementation To run \rowMerge and FLAME optimally, I used several settings for their parameters, \fnally selecting those that gave us the best results. Also, I participated in flowcap project that was a competition between state of the art methods of \row cytometry clustering. Each method was run by its developer and they had the opportunity to set the parameters in the best way. The results were compared in an objective manner by computing F- measures as described in Section 2.3.4. For SamSPECTRAL algorithm, I set m= 3000 to keep the running time bellow 1 minute by a 2.7GHz processor and the obtained results remained satisfactory for all samples I analyzed and presented in this chapter. The separation factor and scaling parameter (\u03c3) are two main parameters that needed to be adjusted. Decreasing \u03c3 and increasing the separation factor will result in identifying more populations. In particular, if \u03c3 is decreased, then according to the heat kernel formula, the weights on the edges of the graph will decrease exponentially. Therefore, the graph will be sparser and tends to obtain more partitions. In consequence, the algorithm identi\fes more spectral clusters. This phenomenon can be useful in identifying rare populations. On the other hand, if separation factor is too high, a single population may be split into parts. In our experiments, I applied SamSPECTRAL on one or two random data samples of a data set and tried di\u000berent values. Then, the selected parameters were \fxed and used to apply SamSPECTRAL on the rest of data samples. The parameters values for the data sets presented in this paper are provided in additional \fles 3,4,5, and 6. See Appendix A for more explanation on how to adjust parameters for a given data set. The SamSPECTRAL package vignette is 44 also routinely updated at Bioconductor. 2.6 Conclusions Faithful sampling is based on potential theory. It reduces the size of input for spectral clustering algorithms and consequently they can now be e\u000eciently applied on FCM data in spite of its large size. Practically, our approach demonstrated signi\fcant advantages in proper identi\fcation of populations with non-elliptical shapes, low density populations close to dense ones, mi- nor subpopulations of a major population, rare populations, and overlapping populations. No state of the art method can solve the challenges in iden- tifying populations with the above properties simultaneously. Moreover, applying SamSPECTRAL to other biological data such as microarrays and protein databases may result in signi\fcant improvements in gene expression and protein classi\fcation. The bottleneck in applying SamSPECTRAL to real life problems is requirement for parameters to be set properly because it might produce undesirable output if default parameters are used with- out caution. Also, for very high dimensional data (say larger than 10) that contains relatively few populations (say less than 5), our approach for auto- matically identifying the number of clusters may fail. Also, for two or three dimensional data with elliptical shape populations, parametric approaches are preferred. Besides, our faithful sampling algorithm can have interesting applications by itself. For instance, it can be used appropriately to reduce the size of input for other clustering algorithms that are based on spectral graph theory such as Markov Clustering Algorithm (MCL), electrical circuit based clus- tering, and agent based graph clustering [59]. I have shown the extendibility of our approach in this sense by substituting classic spectral clustering with MCL, a method that has many applications in bioinformatics. Other in- teresting directions for future work include applying other schemes for esti- mating similarities between communities, combining clusters based on other combinatorial algorithms or biological criteria, and repeating the algorithm several times to obtain a more stable outcome. 45 The knee point approach relies on phase transition. While some other fcm clustering techniques such as \rowMerge and \rowMeans also are based on phase transition to determine number of clusters, the theoretical basis for these approach are di\u000berent. SamSPECTRAL strategy relies on the relation between the number of connected components of a graph and its spectrum (Theorem 2.2.4), whereas, \rowMerge and \rowMeans consider Bayesian in- formation criterion and Mahalanobis distance, accordingly. Alternative ap- proaches include: considering the percentage of variance vs. the number of clusters, maximizing the jump in distortion of clustering, and Silhouette. However, because determining the\\real\"number of clusters is subjective and mostly depends on the biological application, comparison between these ap- proaches is best evaluated when they are applied in real life use cases. As described in the next chapters, I used that SamSPECTRAL to ad- dress the challenge of clustering fcm data to analyze lymphoma data set. Also, matching the clusters across the patients was done by applying Sam- SPECTRAL. 46 Chapter 3 Feature Scoring To build a robust classi\fer, the number of training instances is usually re- quired to be more than the number of features, otherwise the model can become over\ft. The Lasso is one of many regularization methods that have been developed to prevent over\ftting and improve prediction performance in high-dimensional settings. In this chapter, I propose a novel algorithm for feature selection based on the Lasso. My hypothesis is that de\fning a scoring scheme that measures the \\quality\" of each feature can provide a more robust feature selection method. My approach is to generate several samples from the training data, determine the best relevance-ordering of the features for each sample, and \fnally combine these relevance-orderings to select highly relevant features. In addition to the theoretical analysis of my feature scoring scheme, I provide an empirical evaluation using a lym- phoma dataset that con\frms the superiority of my method in exploratory data analysis and also in prediction performance. 3.1 Introduction To build a robust classi\fer, the number of training instances is required to be more than the number of features. In many real life applications such as bioinformatics, natural language processing, and computer vision, many features might be provided to the learning algorithm without any prior 47 knowledge on which ones should be used. Therefore, the number of features can drastically exceed the number of training instances. For instance, the number of samples for an fcm analysis can be limited to 100-300 while the number of features can be higher by a factor of two, i.e. be in range 200-600. Many regularization methods have been developed to prevent over\ftting and improve the generalization error bound of the predictor in this learning situation [152, 168, 169]. Most notably, Lasso [152] is an \u21131-regularization technique for linear regression which has attracted much attention in ma- chine learning and statistics. Although e\u000ecient algorithms exist for recov- ering the whole regularization path [44] for the Lasso, \fnding a subset of highly relevant features which leads to a robust predictor is a prominent re- search question [64]. In particular, it is di\u000ecult to \fnd the\\right\"number of features because they can be dependent on each other. Also, the best value for the regularization parameter might vary signi\fcantly by changing the training set. I propose a feature scoring scheme to address these challenges based on the combinatorial analysis of the Lasso. 3.1.1 My Contributions I pursued the line of research discussed in section 1.4.2 and developed\\FeaLect\"1 algorithm that is softer than Bolasso in three directions: (1) For each boot- strap sample, Bolasso considers only one model that minimizes the global error L but I include information provided by the whole regularization path; (2) Instead of making a binary decision of inclusion or exclusion, I compute a score value for each feature that can help the user to select the more rele- vant ones; (3) While Bolasso-S relies on a threshold of 90%, my theoretical study of the behavior of irrelevant features leads to an analytical criterion for feature selection without using any pre-de\fned parameter. I compared the performance of FeaLect with known methods on real datasets. In particular, I analyzed data from \\BC Cancer Agency\" that is publicly available in FeaLect package. As well, the source code that imple- ments my algorithm is released as a documented R package through cran2. 1FeaLect stands for Feature seLection. 2http:\/\/cran.r-project.org\/web\/packages\/FeaLect\/index.html 48 This chapter is organized as follows. In section 1.4.3 , we present the general background on the Lasso and develop the concepts that we will use in my feature selection algorithm in Section 3.2.1, and in my mathematical analysis in Section 3.2.2. Section 3.3 presents the empirical evaluation on the real-life lymphoma classi\fcation problem, and section 3.4 concludes the chapter. 3.2 Feature Scoring and Mathematical Analysis 3.2.1 The Algorithm Let B to be a random sample with size \u03b3n generated by choosing without replacement from the given training data A, where n= |A| and \u03b3 \u2208 (0,1) is a parameter that controls the size of sample sets. Using a training set B, we recover the whole regularization path using the Lars algorithm [44]. Let FBk be the set of selected features by the Lasso when \u03bb allows exactly k features to be selected. The number of selected features is decreasing in \u03bb and we have: \u2205= FB0 \u2282 . . .FBk \u2282 FBk+1 \u2282 \u00b7\u00b7 \u00b7 \u2282 FBp = F. For each feature f , we de\fne a scoring scheme depending on whether or not it is selected in FBk : SBk (f) := \uf8f1\uf8f2\uf8f3 1 k if f \u2208 FBk 0 otherwise (3.1) The above randomized procedure is repeated several times3 for various random subsets B to compute the average score of f when exactly k features are selected, i.e. EB[SBk (f)] is estimated (Figure 3.1). The total score for each feature is de\fned as the sum of average scores: S(f) := \u2211 k EB[SBk (f)] (3.2) 3According to my experiments, the convergence rate to the expected score is fast and there is no signi\fcant di\u000berence between the average scores computed by 100 or 1000 samples (Figure 3.2). 49 ] ] training set 1 training set 1000 test set 1test set 1000 Model 1000ModelModel 1 patients selected features selected features selected features Averaging Top features al l F C M f ea tu re s i Figure 3.1: Overview of bootstraping performed by FeaLect. A row and a column of the gray data matrix correspond to a feature and a case, accordingly. 1000 mod- els were trained, each \ftted to a random subset that contains 34 of cases using Lasso technique [152]. Without any assumption from biology, medicine, or other a-priori knowledge, all of the features were included for training the models. Then the selected features are scored by computing an average vote to select the most predictive ones. Before describing the rest of the algorithm, let us have a look at the feature scores for my lymphoma classi\fcation problem. Figure 3.2 depict the total score of features in log-scale, where features are sorted according to their increasing total scores. The feature score curve is almost linear in the middle. I hypothesized that features with a \\very high score\" in the top non-linear part of the curve are good candidates for informative features. Furthermore, the linear middle-part of the curve consists of features that are responsible for the model to get over\ftted and therefore we call them over\ftting features. In the next section, a formal de\fnition will be provided to clarify this intuitive idea and I show how this insight can be helpful in identifying informative features. The \fnal step of my feature selection algorithm is to \ft a 3-segment spline model to the feature score curve: the \frst quadratic part captures the 50 Algorithm 2 Feature Scoring 1: for t= 1 to m do 2: Sample (without replacement) a random subset B from A with size \u03b3|A| 3: Run Lars on B to obtain FB1 , ..,F B p 4: Compute SB1 , ..,S B p using eqn (3.1) 5: for k \u2208 {1, ..,p} do 6: Update the feature scores for all feature f : S(f)\u2190S(f)+SBk (f)\/m 7: end for 8: end for 9: Fit a 3-segment spline (g1(.),g2(.),g3(.)) on log-scale feature score curve (see Section 3.2.1 for more information) 10: return features corresponding to g3 as informative features (a) (b) Figure 3.2: Total feature scores in the log-scale. The middle-part of the curves is linear and represents scores of the over\ftting features (see section 3.2.2). The scores in (a) and (b) diagrams are computed by 1000 and 5000 samples, respectively. The low variance between diagrams indicates fast convergence and stability of score de\fnition. Also, the 30 top features are scored in the same order. low score features, the middle linear part captures over\ftting features, and the last quadratic part captures high score informative features. The middle linear part provides an analytic threshold for the score of over\ftting features. The features with score above this threshold are reported as informative features that can be used for training the \fnal predictor and\/or explanatory data analysis. 51 3.2.2 The Analysis I \frst present a probabilistic interpretation for the score function, and then I provide a precise de\fnition of an over\ftting feature. My de\fnition formal- izes the fact that such a feature is selected if and only if a particular subset U of instances is included in the training set of the Lasso. Then, I compute the probability that a random sample B contains U as n grows to in\fnity. Finally, I show that asymptotically, the log of the scores for over\ftting fea- tures is linear in |U |. This explains the linearity of the middle part of the feature score curve in Figure 3.2. Proposition 11. Suppose Pr ( f = fi ) is the probability of selecting a feature fi by the Lasso in some stage of my feature selection method in Algorithm 2. Then, the probability distribution of the random variable f is given by: Pr ( f = fi ) = 1 p S(fi) Proof. By conditional probability: Pr ( f = fi ) = \u2211 B Pr ( f = fi |B)Pr(B) = \u2211 B p\u2211 k=1 ( Pr ( f = fi | f \u2208 F B k ) Pr ( f \u2208 FB k ) Pr(B) ) = \u2211 B p\u2211 k=1 S B k (fi)Pr ( f \u2208 FB k ) Pr(B) Since we have not imposed any prior assumption, we put a uniform distri- bution on Pr ( f \u2208 FB k ) to get: Pr ( f = fi ) = 1p \u2211 B \u2211 k S B k (fi)Pr(B) = 1pEB (\u2211 k S B k (fi) ) = 1pS(fi). 52 The following de\fnition formalizes the idea that over\ftting features de- pend only on a speci\fc subset of the whole data set. De\fnition 12. For any subset of samples U \u2286A and any feature fi , we say that fi over-\fts on U if: \u2200k,\u2200B : fi \u2208 FBk \u21d0\u21d2 U \u2286B In words, fi is selected in F B k if and only if B contains U . Next, we derive the probability of including a speci\fc set U in a randomly generated sample. Lemma 13. For any U \u2286A, we have: lim n\u2192\u221ePrB ( U \u2286B)= \u03b3r where r is the number of samples in U and \u03b3 is the fraction of samples chosen for a random set B. Proof. Assuming B has \u03b3n members chosen without replacement, we have: 53 Pr B ( U \u2286B)= ( n\u2212r\u03b3n\u2212r)( n \u03b3n ) = (n\u2212 r)!(\u03b3n)! n!(\u03b3n\u2212 r)! = ( n\u2212r\u220f i=1 i )( \u03b3n\u220f i=1 i )( n\u220f i=1 i\u22121 )(\u03b3n\u2212r\u220f i=1 i\u22121 ) = ( n\u2212r\u220f 1 i ) \u00b7 (\u03b3n\u2212r\u220f 1 i ) \u00b7 \uf8eb\uf8ed \u03b3n\u220f \u03b3n\u2212r+1 i \uf8f6\uf8f8\u00d7 ( n\u2212r\u220f 1 i\u22121 ) \u00b7 \uf8eb\uf8ed n\u220f n\u2212r+1 i\u22121 \uf8f6\uf8f8 \u00b7(\u03b3n\u2212r\u220f 1 i\u22121 ) = \uf8eb\uf8ed \u03b3n\u220f \u03b3n\u2212r+1 i \uf8f6\uf8f8 \u00b7 \uf8eb\uf8ed n\u220f \u03b3n\u2212r+1 i\u22121 \uf8f6\uf8f8 = r\u22121\u220f i=0 [ (\u03b3n\u2212 i)(n\u2212 i)\u22121 ] = \u03b3r r\u22121\u220f i=0 ( n\u2212 i\u03b3 n\u2212 i ) = \u03b3r r\u22121\u220f i=0 ( 1+ i(1\u22121\/\u03b3) n\u2212 i ) = \u03b3r(1+O(n\u22121)). The claim derives from the fact that \u03b3 is a \fxed constant. The \frst line of the above proof relies on the assumption that the members of the random set B are chosen without replacement. The following theorem concludes my argument for the exponential be- 54 haviour of total score of over\ftting features. It relates the probability of selecting a feature fi over\ftting on U to the probability of including U in the sample. Theorem 14. If a feature fi over-\fts on a set of samples U with size r, then: lim n\u2192\u221eS(fi) = p\u03b3 r. Proof. From proposition11 we have: S(fi) = pPr ( f = fi ) = pl \u2211 l,B Pr ( fi \u2208 F B l ) Pr(B) = pPr B ( U \u2286B) = p(\u03b3r+O(n\u22121)). The last equation was proved in lemma 13. 3.3 Experiment with Real Data Because lymphoma has various types and subtypes [74], usually 15\u223c30 mark- ers are used for each patient to discriminate the exact diagnosis. I applied my algorithm to analyze lymphoma types based on FCM data [70]. In this experiment, I analyzed fcm data of 85 lymphoma patients who had been diagnosed at BC Cancer Agency, Vancouver, Canada between 2004- 2007. They were diagnosed with di\u000buse large B-cell lymphoma (dlbcl) (25 cases), follicular lymphoma (foll) (49 cases), mantle cell lymphoma (mcl) (8 cases), and small lymphocytic lymphoma (sll) (15 cases). The avail- able fcm panel included the following markers using three color cytometry: CD5, CD19, CD3, CD10, CD11c, CD20, FMC7, CD23, CD20, CD7, CD4, CD8, CD45, CD14, kappa, lambda, plus the forward scatter (fsc), and side scatter (ssc) values. These 85 cases were selected uniformly at ran- dom from the cases diagnosed between 2004-2007 because during this time period, the settings of the instruments was stable and did not change signi\f- cantly. Also, the patients of types with occurrence less than 2 cases per year 55 were excluded because applying my statistical analysis was not possible for such rare lymphoma types. The patients were grouped into four top-level disease subgroups and the goal was to build a classi\fer that could diagnose 20 test random patients based on their fcm data, given they belong to one of these four studied types. For each group, I trained a classi\fer to distinguish that group versus the others. These four classi\fers were then combined to provide the top-level diagnosis. 3.3.1 Data Preparation and Feature Extraction The blood sample of each patient was divided into 7 portions, and each portion is examined in a di\u000berent tube by the cytometer. Each tube gives 5 dimensional data of 20,000-70,000 blood cells. In the \frst analysis step, I used a SamSPECTRAL algorithm to cluster the cells in each tube into cell populations. While it was not possible to directly apply classical spectral clustering [11, 113, 156, 157] to the lymphoma data because it involved computing eigenvectors of a big n-by-n matrix where n ranges from 20,000 to 70,000, SamSPECTRAL was capable of clustering large amount of data in a reasonable amount of time (minutes) and memory (less than 1 GB) [165]. by performing faithful sampling, a speci\fc sampling stage to reduce size of data for spectral clustering. Each cluster computed by SamSPECTRAL was regarded as a \\cell pop- ulation\" that could potentially have information about the lymphoma type. Without imposing any a priori knowledge on the importance of any popula- tion, I considered their sizes and coordinates of their means as features. In total, 276 features were obtained and ignoring those with very low variance, 224 were kept for feature selection and classi\fcation. 3.3.2 Feature Selection and Classi\fcation To improve feature selection, we note that each run of the cross-validation gives us the value of the regularization parameter and the corresponding selected-features. The idea of my algorithm in Section 3.2.1 and Bolasso [12] is to combine these selected feature sets. However, my algorithm is dif- 56 ferent from Bolasso in that (i) I bootstrap by sampling without replacement, and (ii) unlike Bolasso which takes the intersection of these feature sets, I have a di\u000berent method to combine these feature sets yielding a soft feature selection technique. In the following section, I describe the feature ordering and feature selection algorithms. Since the number of features was considerably larger than the number of training samples (p= 224, n= 85), a careful feature selection scheme was needed. I developed and applied a novel feature selection technique based on FeaLect to address this challenge and build a classi\fer as explained in section 4.1.2. I observed that unlike the pure Lasso, all features that were known to be biologically and clinically interesting [78] were selected by my approach. For instance, CD23, FMC7, and CD20 markers were included that are known to be discriminative in di\u000berentiation between mcl and sll as discussed in Chapter 4.2. Prediction accuracy was improved con\frming the e\u000eciency of my feature selection method. I used my selected features to build a linear classi\fer that had precision, recall and F-measure 98%, 94% and 96%, respectively while the best result I obtained with the pure Lasso was 93%, 82% and 87%, respectively. Furthermore, the features that were selected by my method and were not previously reported to be biologically relevant are under study by my clinical collaborators to discover novel biomarkers. One challenge in such study is high number of potential features and my approach can assist them by scoring features according to relevance to lymphoma types. Bolasso could not help in selecting all relevant features in the current study because it is too restricted and in practice, ignores potentially relevant informative features. 3.3.3 Additional Real Datasets from UCI Repository I validated the performance of FeaLect on four datasets in addition to our lymphoma \row cytometry data, and the well-known colon gene expression benchmark dataset (Table 3.1). All additional datasets are from UCI (Uni- versity of California, Irvine) Machine Learning Repository [53]. Arcene con- 57 tains mass-spectrometric data for cancer and normal cases [65], variables of SECOM were collected from sensors and process measurements in com- plex modern semi-conductors with the goal of enhancing current business improvement techniques [53]. I used a version of SECOM dataset balanced by randomly selecting equal number of positive and negative samples. The learning task for Connectionist dataset is to train a network to discriminate between sonar signals bounced o\u000b a metal cylinder and those bounced o a roughly cylindrical rock [62]. ISOLET is a natural language processing dataset generated from speech of 150 subjects with the goal of identifying which letter-name was spoken [48]. In the current study, I only considered letters A and B as positive and negative samples and excluded the rest of samples. 0 500 1000 1500 2000 0 . 5 0 0 . 5 5 0 . 6 0 Number of selected features A U C o n t e s t s e t 0 100 200 300 400 5000 . 7 7 5 0 . 7 8 0 0 . 7 8 5 0 . 7 9 0 0 . 7 9 5 0 . 8 0 0 Number of selected features A U C o n t e s t s e t (a) Colon (b) Lymphoma Figure 3.3: Variation of area under the ROC curve when di\u000berent number of features are used. The features are sorted by applying FeaLect on 20 random training samples. Then, the training samples and the highly scored features are considered to build linear classi\fers by lars. The best AUC is reported by testing on a set of validating samples disjoint from the training set. For both lymphoma and colon datasets, the performance of the optimum classi\fer decreases if all features are provided to lars. This observation practically shows the advantage of using a limited number of highly scored features over pure lars. 58 0.0 0.2 0.4 0.6 0.8 1.0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 41 cases False positive rate T r u e p o s i t i v e r a t e AUC lars(0.53) FeaLect(0.69) ( . ) t(0.69) 0.0 0.2 0.4 0.6 0.8 1.0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 238 cases False positive rate T r u e p o s i t i v e r a t e AUC lars(0.79) FeaLect(0.8) ( . ) t(0.8) (a) Colon (b) Lymphoma Figure 3.4: Comparing ROC curves between FeaLect and lars. The blue curve represents the ROC curve of the best Lasso model trained on 20 random samples using all available features, and the red curve shows the performance of the best Lasso model when only 61 and 36 top features are provided from colon and lymphoma datasets respectively. While FeaLect always performs better than pure lars, the di\u000berence is more signi\fcant for colon dataset than lymphoma dataset. 10 15 20 25 30 35 400 . 4 5 0 . 5 5 0 . 6 5 0 . 7 5 Number of training samples A U C FeaLect lars Bolasso 20 40 60 80 100 120 140 0 . 6 0 . 7 0 . 8 0 . 9 Number of training samples A U C FeaLect lars Bolasso (a) Colon (b) Lymphoma Figure 3.5: Improvements in the area under the ROC curves by increasing the number of training samples. Except for Bolasso on colon dataset, the average performance increases as more training samples are provided. While FeaLect and lars converge to a common asymptotic performance on lymphoma dataset, FeaLect is consistently superior to pure lars on colon dataset because the number of training samples is very limited. Table 3.1 presents similar superiority for other datasets with relatively low instances. 59 Table 3.1: Comparsion of area under the ROC curve between FeaLect, lars, and Bolasso on six di\u000berent datasets. Dataset Total # of 20 training samples 40 training samples Reference samples features Bolasso lars FeaLect Bolasso lars FeaLect Lymphoma 258 505 0.62 0.81 0.84 0.67 0.87 0.88 current Colon 62 2000 0.50 0.57 0.65 0.47 0.64 0.75 [7] Arcene 100 10000 0.51 0.59 0.64 0.50 0.66 0.72 [65] (UCI) SECOM 208 590 0.51 0.57 0.61 0.52 0.61 0.64 [53] (UCI) Connectionist 208 60 0.63 0.76 0.78 0.67 0.78 0.79 [62] (UCI) ISOLET 479 617 0.90 0.99 1.00 0.91 1.00 1.00 [48] (UCI) Table 3.1 compares the performance of Bolasso, pure lars, and FeaLect on the studied datasets. Training samples were selected uniformly at ran- dom and area under the ROC curves (AUC) were computed on the rest of samples (Figure 3.3). For each dataset, we repeated this procedure 100 times and reported the average AUC to avoid any dependency on the ran- dom selection of train-test sets. Both FeaLect and lars always outperformed Bolasso. When only 20 random training samples were provided, FeaLect provides signi\fcantly better than pure lars except ISOLET dataset. The number of samples in ISOLET dataset is more than other datasets, enough such that both methods performed well. The di\u000berence between FeaLect and lars decreases as the number of training samples increases to 40, except for Colon and Arcene datasets. Interestingly, these two datasets are the ones with 2000 and 10000 features that are considerably higher dimensional than other datasets. This observation reassures that FeaLect is advantageous over lars in high-dimensional settings and their performance converges as \\adequate\" number of samples are provided (Figures 4.15 and 3.5). 3.4 Conclusion The proposed statistical framework is capable of detecting over\ftting fea- tures. I observed that the middle part of the score curve in the log plot is almost linear and my theoretical analysis explains the behavior of this curve. Furthermore, I provided empirical evaluations on the real dataset which con\frms the superiority of my method, in exploratory data analysis 60 and prediction performance, compared to the baselines. That is, my method leads to selecting more meaningful features according to a human expert, as well as, building a higher accuracy and more robust predictor (See Chapter 4.2 for quantitative details.) An advantage of my feature selection method over Bolasso is providing a ranking for features relevance that can be used in prediction and\/or ex- ploratory data analysis. For example, in the cancer classi\fcation problem that I studied in Section 3.3 (lymphoma diagnosis), distinguishing the most relevant features is of great interest from the biological and clinical point of view because it can provide clinicians with biomarkers for lymphoma sub- type diagnosis. My approach is capable of correctly identifying such relevant features. The log-scale score curve can be studied in more detail and ex- plaining its behavior in the non-linear parts is potentially a source of insight. If the number of features is not relatively high (in range 10-20 with couple of hundreds of samples), care should be taken in applying FeaLect as it may include features more than required. In such application, Bolasso may be preferred over FeaLect. Although I presented the analytical arguments for the Lasso, they also may work for any other feature selection algorithm which exhibits linearity in its feature score curve. That is, features corresponding to the linear part of the scoring curve are indeed the over\ftting features for that algorithm. 61 Chapter 4 Application to Lymphoma Diagnosis In this thesis, I studied FCM data of 248 lymphoma patients who were diagnosed with either DLBCL (63,25%) , Follicular (141,57%), MCL (14, 6%), SLL (30, 12%) at the BC Cancer Agency, Vancouver, Canada between 2004-2007 (third time period). These 248 cases include all patients in this time period except those belonging to types with occurrence less than 2 cases per year as applying my statistical analysis was not possible for such rare lymphoma types. Final pathologic diagnoses for all cases were determined by expert sta\u000b hematopathologists at the BC Cancer Agency after integration of \fndings from biopsy histology (H&E morphology), immunohistochemistry including cyclin D1 staining where indicated, \row cytometry , FISH analysis where indicated, and clinical history. The routine \row cytometry diagnostic panel comprised of seven tubes each containing three di\u000berent fcm markers measured for tens of thousands of cells (See Section 4.1.1). I combined the tools that I have developed for analyzing fcm data to design an automatic pipeline to analyze lymphoma cases based on their fcm data. The patients were grouped into four top-level disease groups that are abundant non-Hodgkin lymphoma types: dlbcl , foll, mcl, and sll. Usually 15\u223c30 markers are used for each patient, where each marker distinguishes a particular cell type based on its protein content. The goal 62 was to build a classi\fer that analyzed fcm data of cases from the above mentioned lymphoma types, and could objectively assign the probability of any of those cases being diagnosed with any of these four types. I applied one-vs-all strategy; for each group, I trained a classi\fer to distinguish that group versus the others. Each classi\fer computes the probability of belong- ing to the corresponding type of lymphoma for any given case. The \fnal diagnosis (label) of each case can be predicted by assigning the type which obtains the maximum probability. I divided the samples into four random subsets of equal size, each con- taining 62 cases,and applied four-fold cross-validation. The training-test procedure was repeated four times and in each iteration the test set was completely disjoint from the train set. After performing all experiments, the average performance of all 4 tests was reported by computing mean of accuracy, sensitivity, and speci\fcity. 4.1 Automatic Methodology 4.1.1 Data Preparation and Feature Extraction Flow cytometry data preparation was performed in BC Cancer Agency by laboratory sta\u000b. The blood sample of each patient was divided into 6 por- tions and each portion was examined in a di\u000berent tube by the cytometer. More speci\fcally, the three color fcm data included the following tubes that all included fsc and ssc : \u2022 a tube with CD5, CD19, CD3, fsc, and ssc, \u2022 a tube with CD10, CD11c, CD20, fsc, and ssc, \u2022 a tube with FMC7, CD23, CD19, fsc, and ssc, \u2022 a tube with CD7, CD4, CD8, fsc, and ssc, \u2022 a tube with CD45, CD14, CD19, fsc, and ssc, \u2022 a tube with kappa, lambda, CD19, fsc, and ssc. 63 Each tube gives 5 dimensional data of 20,000-70,000 blood cells. In the \frst automatic analysis step, I used my enhanced spectral clustering ap- proach to cluster the cells in each tube into cell populations. It was not possible to directly apply classical spectral clustering [11, 113, 156, 157] to the lymphoma data because it involved computing eigenvectors of a huge n-by-n matrix where n ranges from 20,000 to 70,000. Therefore, I applied SamSPECTRAL, my enhanced spectral clustering method capable of ana- lyzing large amount of data in a reasonable amount of time (minutes) using a\u000bordable amount of memory (less than 1 GB). algorithm identifies the most informative markers populations in multidimensional space mulidimensional clustering mulidimensional clustering Figure 4.1: Schematic of my automated analysis procedure on two cases diagnosed with di\u000berent lymphoma types. Discrete populations as de\fned by SamSPECTRAL clustering are graphically depicted in three dimensional space by di\u000berent colors. Three- and eight-color datasets were expressed in 5 and 10 dimensional space, re- spectively, since each also included FSC and SSC parameters. Each colored cluster illustrates a biological population of cells which express speci\fc intracellular or sur- face proteins. The abundance of events (cells) within each cluster as well as the mean \ruorescence intensity for expression of all immunophenotypic markers de\fne the feature coordinates for each population. Features for all identi\fed populations were compared between cases of di\u000berent types to determine those with the greatest discriminative value. Each cluster computed by SamSPECTRAL was regarded as a \\cell pop- ulation\" that could potentially have information about the lymphoma type. I obtained the coordinates of center of each cell population by computing the mfi values in all dimensions. To match the populations across the patients, I 64 clustered the centers of all populations of all studied cases using SamSPEC- TRAL (Figures D.1-D.6). All populations that were grouped together in this way were considered as being biologically identical. Following the above cluster matching stage and without imposing any a priori assumption on the importance of any population, I considered the size and position (i.e., coordinates of centers populations as FCM features (Figure 4.2). In total, 5401 features were obtained and after ignoring those with very low variance (features obtaining less than two non-zero values over all samples), 505 were kept for feature selection and classi\fcation. 1Because the number of populations in each tube varied between 10-20, and 6 features were obtained from each population, ( 5 dimensions to indicate the center of the population plus one feature for determining size of population) the number of features obtained from each tube was variable in range 60-120. The total number of features summed up to 540 for all 6 tubes. 65 stained samples of an individual patient multidimensional frequency,CD19,CD23,FMC7,CD20,CD5,... frequency,CD19,CD23,FMC7,CD20,CD5,... frequency,CD19,CD23,FMC7,CD20,CD5,... all FCM features of the patient clustering cell populations in high dimensional space Figure 4.2: Feature extraction. Size and MFIs of all populations in all tubes are considered as features. The three dots represent similar approach is applied for all other tubes. 4.1.2 Feature Selection and Classi\fcation Because the number of features was considerably larger than the number of training samples (p = 505, n = 186), a careful feature selection scheme was needed. To reduce the computation time required, I imposed a prede\fned upper bound 50 on the number of features based on a priori knowledge from the biology of the disease [145]. I initially applied \u21131-regularization technique, and it was not by its own enough to prevent over\ftting. Re- ducing regularization parameter did not improved the results as I observed that some of the features that were known to be biologically and clinically interesting were ignored. The number of wrongly ignored features varied between 3-5. I also applied Bolasso [13] to select relevant features. For most bootstrap samples of the studied data, the global error de\fned by equation (1.2) was minimized when only a few (less than 4) features were selected. Because the intersection of the selected features from several (more than 10) samples was empty, Bolasso could not result in appropriate feature selection. Next, I applied FeaLect, my novel feature selection algorithm described in Section 3.2.1. In my experiment, I set \u03b3 = 34 to be the fraction of in- stances used in each iteration for training. Figure 4.3 depict the resulting feature scores for di\u000berent lymphoma types after M = 1000 random runs. The log-scale plot consist of a linear part that con\frms my proposed hypoth- esis experimentally. Furthermore, I re-ran the experiments with M = 5000 random samples and the results did not vary signi\fcantly indicating a fast convergence rate for the feature scores. To select the informative features, a 3-segment spline model was \ftted to each curve. The features corresponding to the middle linear segment were considered as over\ftting ones, and ignored for the rest of analysis. Features with score higher than score of these over\ftting features were selected as informative features. Also, features with score less than score of these over- \ftting features were considered as non-relevant and discarded. I observed that unlike the pure Lasso, all features that were known to be biologically and clinically interesting [78] were selected by my approach. I used the selected features to build 100 linear classi\fers each of which 67 0 100 200 300 400 500 \u2212 2 0 2 4 6 8 31 features for: DLBCL Index s c o r e s 0 100 200 300 400 500 \u2212 2 0 2 4 6 8 25 features for: FOLL Index s c o r e s A (DLBCL) B (Follicular) 0 100 200 300 400 500 \u2212 2 0 2 4 6 8 12 features for: MCL Index s c o r e s 0 100 200 300 400 500 \u2212 2 0 2 4 6 8 31 features for: SLL Index s c o r e s C (MCL) D (SLL) Figure 4.3: Scoring features by FeaLect based on their performance in classifying various types of lymphoma. The red threshoulds indicate the number of features which are selected. was trained on a subset of training set again. Then, I \ftted a logistic regres- sion model to the outcome of each linear model to obtain a value in range 0-1 that could be considered as probability. Because the performance of the combined classi\fers relied on the regularization parameter, an objective strategy was required to adjust this value in the optimum way. I used the re- mainder of cases in the training set to validate each classi\fer independently and determined the best regularization value. Finally, the probability of be- longing to a type was computed by averaging all probabilities obtained from the 100 combined classi\fers. I measured the performance of this strategy 68 by validating the results on a test set that consisted of a quarter of whole samples (62 cases). Dividing the samples to training and test sets was done uniformly at random and the test set was totally disjoint from my training set. Figure 4.12 show computed probabilities for training and test sets. Table 4.1 includes average sensitivity speci\fcity, and accuracy of my approach over four-fold cross-validation and shows the overall performance of my automatic pipeline. To compute the values in this table, all cases that obtained probability more than 0.6 were predicted to be diagnosed with a particular type. For instance, if the probability of a case being in class dlbcl was computed to be 0.7, that case was predicted as an dlbcl case. Table 4.1: Overall performance of my automatic pipeline for analyzing lymphoma dataset. Pathological TPab TNc FPd FNe Sensitivityfg Speci\fcityh Accuracyi Diagnosis (%) (%) (%) DLBCLj 31 179 6 32 49 97 85 FOLLk 120 94 13 21 85 88 86 MCLl 7 233 1 7 50 100 97 SLLm 24 218 0 6 80 100 98 aThe results on the \frst four columns are on a random test-set of size 62 that is disjoint from the training set of size 186. bTP (true positive) = the number of cases predicted correctly to belong to the corresponding type. cTN (true negative) = the number of cases predicted correctly not to belong to the corresponding type. dFP (false positive) = the number of cases predicted wrongly to belong to the corresponding type. eFN (false negative) = the number of cases predicted wrongly not to belong to the corresponding type. fThe results on the last three columns are average of a 4-fold cross validation. g Sensitivity to MCL is the portion of MCL cases that are correctly diagnosed by a feature (TP\/(TP+FN)). h Speci\fcity is the portion of SLL cases that are diagnosed correctly (TN\/(TN+FP)). i Accuracy is the portion of all correctly diagnosed cases. jDi\u000buse large B Cell lymphoma kFollicular lymphoma lMantle Cell lymphoma mSmall lymphocytic lymphoma Because my approach relies on statistical analysis of cases, it is limited to abundant types, and therefore, it requires the subjects to have been known 69 0 50 100 150 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 DLBCL Index s c o r e s 0 10 20 30 40 50 60 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 DLBCL Index s c o r e s A (DLBCL train set) B (DLBCL test set) 0 50 100 150 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 FOLL Index s c o r e s 0 10 20 30 40 50 60 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 FOLL Index s c o r e s A (FOLL train set) B (FOLL test set) 0 50 100 150 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 MCL Index s c o r e s 0 10 20 30 40 50 60 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 MCL Index s c o r e s A (MCL train set) B (MCL test set) 0 50 100 150 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 SLL Index s c o r e s 0 10 20 30 40 50 60 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 SLL Index s c o r e s A (SLL train set) B (SLL test set) Figure 4.4: Probabilities of belonging to each type of studied lymphoma computed for test and train cases.The colored dots represent the corresponding type and the black ones show the cases from other types. 70 to belong to one of the abundant categories in advance. While this condition restricts the application of my pipeline for diagnosis, my analysis has other clinical applications such as providing new insight into the relative contri- bution of each fcm data feature to the overall diagnostic algorithm and can potentially lead to improving the diagnostic accuracy using FCM. Furthermore, some of the features that I selected and were not previously used in clinical settings were studied in collaboration my clinical collabora- tors to discover novel biomarkers. Chapter 4.2 describes the outcome of such an study on di\u000berentiating between mcl and sll. 71 4.2 Discriminating Between Small Lymphocytic and Mantle Cell Lymphomas 4.2.1 Overview Mantle cell lymphoma (MCL) and small lymphocytic lymphoma (SLL) ex- hibit similar, but distinct immunophenotypic pro\fles. While many (70%- 90%) cases can be diagnosed with high con\fdence based on FCM (FCM) results alone, ambiguous cases are frequently encountered and necessitate additional studies including immunohistochemistry for cyclin D1 and \ru- orescence in-situ hybridization (FISH) analysis for t(11;14) translocation. However, performing and interpreting these tests requires resources which may not be available in all laboratory settings [135]. In order to determine if greater diagnostic accuracy could be achieved from FCM data alone, I developed an unbiased, machine-based algorithm and used it to automatically identify those features within the multidimen- sional space that best distinguish between the two disease types. 4.3 Clinical Motivation Mantle cell lymphoma (MCL) and small lymphocytic lymphoma (SLL) are both mature B-cell neoplasms [28, 45, 67]. MCL is a proliferation of small to medium-sized B lymphocytes, with slightly irregular nuclear contours [97]. SLL on the other hand is the non-leukemic variant of chronic lymphocytic leukemia (CLL), typically causing lymphadenopathy and, occasionally, pe- ripheral blood cytopenias [45], and is morphologically characterized by an accumulation of mature B lymphocytes. Being immunologically and mor- phologically similar, SLL and CLL are considered to be a single disease entity in the WHO classi\fcation [10, 57, 97]. Unlike SLL, which generally shows an indolent course justifying a watch and wait approach in asymptomatic patients [45], MCL is an aggressive lymphoma that is usually treated at diagnosis. Therefore, accurate distinc- tion between these two diagnosis is crucial. The hallmarks of MCL are the t(11;14)(q13;q32) translocation, present in the vast majority of cases, 72 and the resulting overexpression of cyclin D1 [103, 162]. While \ruorescence in situ hybridization (fish) and immunohistochemistry (IHC) are excellent ancillary tests for these features, performing and interpreting them requires resources which may not be available in all laboratory settings [135]. FCM is frequently utilized in evaluation of lymphoproliferative disorders, and is especially useful in the di\u000berential diagnosis between SLL and MCL since they usually exhibit distinct immunophenotypes [107, 136]. While both lym- phomas are CD5+, MCL is generally CD23- and FMC7+, whereas SLL\/CLL is usually CD23+ and FMC7-. However, a signi\fcant proportion of SLL and MCL (e.g., more than 15% [31]) present con\ricting FCM signatures and are prone to misclassi\fcation [32]. Several groups have attempted to address this challenge by closer analysis of FCM data [5, 31, 36, 38, 55, 58, 77, 99, 104{ 107, 116, 136, 151], but most resulting diagnostic algorithms compromise sensitivity for speci\fcity or vise versa [32]. For instance, the approach suggested by Morice et al. [107] was reported to have 82% sensitivity to CLL\/SLL and 56% sensitivity to MCL for 175 studied cases that were CD5+. It is widely recognized that data analysis is by far one of the most chal- lenging and time-consuming aspects of fcm experiments as well as being a primary source of variation in clinical tests [3, 72, 86, 88, 117, 120, 131, 132, 149]. Investigators have traditionally relied on intuition rather than on standardized statistical inference in the analysis of FCM data [16]. Together with clinical collaborators, we hypothesized that the accuracy of diagnosis can be signi\fcantly improved by utilizing the information that already ex- ists in FCM data, but is missed by traditional data analysis approaches. My approach was to use an unbiased algorithm to retrospectively analyze mul- tidimensional fcm data in order to identify the most informative features. I applied the tools that I developed for automatic analysis of fcm data to address the challenges of traditional data analysis approaches. In particular, the goal of my mcl-sll study described in this chapter was to discover more sensitive and more speci\fc fcm features to reduce diagnostic errors, time and e\u000bort required for data analysis, and unnecessary utilization of ancillary tests. 73 4.4 Design and Methods 4.4.1 Patient Samples One hundred fourteen lymph node biopsy specimens with \fnal pathologic diagnoses of either mantle cell lymphoma (MCL; 44 cases, 39%) or small lymphocytic lymphoma (SLL; 70 cases, 61%) and for which \row cytometric analysis was performed at the BC Cancer Agency between 1997 and 2010 were identi\fed. Patient characteristics are detailed in Supplementary Table A. For purposes of comparison, an additional 60 di\u000buse large B cell lym- phoma (DLBCL) and 100 follicular lymphoma (FOLL) cases were identi\fed in a similar manner. 4.4.2 Pathologic Classi\fcation Final pathologic diagnoses for all cases were determined by expert sta hematopathologists at the BC Cancer Agency after integration of \fndings from biopsy histology (H&E morphology), immunohistochemistry including cyclin D1 staining where indicated, \row cytometry , FISH analysis where indicated, and clinical history. 4.4.3 Flow Cytometry Data A total of 88 cases were analyzed on a single laser, three color Beckman Coul- ter FC500 cytometer with surface immunophenotyping for CD19\/CD3\/CD5, CD19\/CD23\/FMC7, CD19\/kappa\/lambda, CD20\/CD10\/CD11c, CD19\/- CD45\/CD14, and CD7\/CD4\/CD8. The remaining 26 cases were analyzed on a three laser, eight color Becton Dickinson CantoII cytometer with surface immunophenotyping for CD19\/CD20\/CD3\/CD5\/CD10\/CD11c\/kappa\/lambda and CD19\/CD3\/CD5\/CD23\/FMC7\/CD38\/CD25\/CD103. Photomultiplier (PMT) voltage settings were substantially modi\fed twice during the 3-color data acquisition period and once during the 8-color data acquisition period as reported in [81]. Because changes in PMT voltage settings of the instrument alters the mfi of the corresponding markers and can mask biological information [81], we segregated the data into \fve dis- 74 tinct time periods such that PMT voltage settings were essentially constant within each time period. The time periods contained 23, 21, 44, 19 and 7 cases, respectively, as detailed in Table B.1. Time Frames The time periods included: 1997 to 2002 [February] containing 23 cases, 2002 [March] to 2004 [November] containing 21 cases, 2004 [December] to 2009 [February] containing 44 cases, 2009 [March] to 2009 [August] containing 19 cases, and 2009 [September] to 2010 containing 7 cases. For two cases diagnosed between 1997-2002 (\frst time frame, three color data), the FCS \fles of the CD5-CD19-CD3 tube were corrupted. I considered the mean of CD5 intensity of all MCL or SLL patients in that time period as an estimate for CD5 intensity of these two cases. Also, one single case had the same issue with kappa-lambda-CD19 tube and we treated it accordingly. 4.5 Computational Methodology Adjusted for MCL vs. SLL Study In this section, I explain how I applied my automatic pipeline described in this Chapter (4) speci\fcally to MCL vs. SLL study, and what parameters were used to obtain the clinically useful results. I previously developed SamSPECTRAL methodology to cluster individ- ual \row data points and thus de\fne cell populations automatically [165]. For each of these clusters, I computed the MFIs of all available markers and also the proportion of number of events to obtain a complete set of of \row cytometric features. Then, I applied our FeaLect methodology, a novel feature selection technique similar to the Bolasso algorithm (Chapter 3), to identify the subset of \row cytometric features which was most useful in dis- criminating between MCL and SLL cases [12]. Besides SamSPECTRAL and FeaLect computational packages that were published before, computational tools and source code used in this study are available upon request. To ensure that my \fndings are generalizable and the \fnal results are not dependent on a speci\fc set of samples, I \frst analyzed data from 44 cases 75 which were all acquired during during the 3rd time period (see Methods section for details). This 44-sample set was utilized initially to identify the most predictive features because it contained the greatest number of samples analyzed under consistent data acquisition conditions (i.e. PMT voltage settings and cytometer platform) To compute feature values for each case, all cell populations were iden- ti\fed by clustering data in all dimensions available for each tube (3 \ruores- cence parameters plus forward and side scatter) simultaneously. Then, B- cells were determined by excluding those populations that were either CD19 or CD20 negative as de\fned by applying t-test comparisons (described in the Supplementary Appendix). Finally, MFIs of each parameter were measured among identi\fed B-cell events. The MFI of immunoglobulin light chain was taken as the greater of that for kappa or lambda. 4.5.1 Setting the Thresholds For each feature under study, two densities were estimated separately; den- sity for MCL cases in a speci\fc time frame, and density for SLL cases in the same time frame. To estimate the density, a Gaussian kernel was used and the objective function for selecting the bandwidth h was minimizing the integral of the squared error (mise) [137, 138]: MISE(h) :=ES [\u222b ( f\u0302 h (x)\u2212f(x) )2 dx ] , where: f\u0302 h (x) = 1 (2\u03a0) 12hn \u2211 i e\u2212 (x\u2212xi ) 2 2h2 Table 4.2: Parameters used for identifying cell populations by SamSPECTRAL. \u03c3 normal separation factor 3-color data 200 0.9 8-color data 100 0.8 76 0.25 0.75 1.25 1.75 3 5 7 9 11 MCL SLL e s t i m a t e d d e n s i t y 0 5 1 0 1 5 2 0 2 5 3 0 3 5 Figure 4.5: Setting of thresholds. The best boundary (dashed green line) minimizes the dashed area under the density curves, which corresponds to the Bayes error [153]. is the kernel estimate of density function f over n sample points S= {x1 . . .xn}. I used the standard R function density() that applies Silverman's robustness rule [137] to estimate the optimal bandwidth. Having estimated the densities, I selected a threshold for each feature that minimized the Bayes error [25, 153], the probability of misclassi\fcation between MCL and SLL (Figure 4.5). To \fnd the best threshold, I computed the Bayes error for 1000 points uniformly distributed in the the range of values of each feature. It is known that the optimal threshold lies on one of the intersections between the two density curves [68]. The receiver operating characteristic (roc) curves are also presented to illustrate the sensitivity of of each feature to selecting the optimum threshold (Figure 4.15). 77 200 400 600 800 1000 0 2 0 0 4 0 0 6 0 0 8 0 0 1 0 0 0 FSC S S C Figure 4.6: Dead cell removal. Clusters with low mean FCS were excluded as non-viable cell debris. 4.5.2 Dead Cell Removal As a preprocessing step in analyzing FCM data, events with low FSC and SSC values were considered as debris and excluded from my analysis [61]. I clustered data for each case in FSC and SSC log-scale dimensions using SamSPECTRAL with parameters \u03c3 normal = 2000 and separationfactor = 0.8. Dead cells were de\fned as those clusters with mean FSC less than 200 (Figure 4.6). After removing dead cells and debris by automatic gating in forward and sideward scatter channels, I obtained \fve dimensional data for on average 13K live cells per tube with standard deviation 7K. Potentially, an other naive approach for removing dead cells is to ignore all cells with FSC value less than a threshold. My method is advantageous over the naive approach because as indicated in Figure 4.6, some dead cells might be located above the FSC threshold, however, they are close to other dead cells in other dimensions. Therefore, they cluster together with other dead cells and are identi\fed as dead cells by my method. 78 4.5.3 Identi\fcation of Cell Populations Following dead cell exclusion, I clustered data for each case in all available dimensions simultaneously including FSC and SSC, again using SamSPEC- TRAL. Table 4.2 includes the parameters I used to cluster events in each tube, such that cells with high biological similarity were grouped together. Therefore, a cluster in my analysis represented a biological population of cells that express speci\fc intracellular or surface proteins. I applied two consecutive t-tests to identify B cells. The \frst t-test was applied on every cluster with the null hypothesis that the MFI of CD19 of that cluster is greater than the total MFI of CD19, (i.e., it is brighter than total live cells in CD19). Clusters that passed this test with P value 0.001 were considered as CD19+. As the rest of events contained both CD19\u2212 and dim cells (non-CD19+ clusters), I applied an other t-test to identify dim from negative clusters. with the null hypothesis that a cluster is brighter than total non-CD19+ events. The clusters that failed this test with P value 0.001 were considered as CD19\u2212. All cells except the CD19\u2212 events (\\non-B cells\") were included in our analysis. 4.6 De\fnitional Criteria for Positive\/Negative Marker Expression According to who immunophenotypic classi\fcation, typical MCL is CD5+, CD23-, FMC7+ whereas typical SLL is CD5+, CD23+, FMC7- [33]. In routine clinical practice, the distinction between \\positive\" vs \\negative\" ex- Table 4.3: Distribution of cases among\\standard\"vs\\non-standard\"immunophenotypes and Combined Ratio Scores. 79 pression for a given marker is often based on absolute threshold values as de\fned by either an internal negative control population within the same staining tube or parallel analysis of unstained patient cells, patient cells stained with isotype control antibodies, or staining of cells from a \\normal\" control sample. Figure 4.7: Contradiction in immunophenotypic signatures. (A) A typical MCL case that is FMC7+ and CD23-. (B) A typical SLL case that is FMC7- and CD23+. (C) An atypical CD5+ case that should be either MCL or SLL. It presents both FMC7+ and CD23+ leading to ambiguity in following WHO recommendations. When immunophenotypic signatures are con\ricting for cases such as this one, the common routine is to order further tests to diagnose unambiguously. All three cases are from the third time frame, BC Cancer Agency dataset. While the above approach gives clear results when cell populations of interest exhibit uniform and bright expression of a given marker, it is much less informative when expression is variable or dim. Accordingly, we elected to apply a statistical measure instead to de\fne positive vs negative marker expression such that direct comparisons could be drawn between conven- 80 0 200 400 600 800 0 2 0 0 4 0 0 6 0 0 8 0 0 1 0 0 0 CD23 F M C 7 B\u2212cell T\u2212cell 0 200 400 600 800 1000 0 2 0 0 4 0 0 6 0 0 8 0 0 CD23 F M C 7 B\u2212cell T\u2212cell A (typical SLL) B (typical MCL) 0 200 400 600 800 0 2 0 0 4 0 0 6 0 0 8 0 0 CD23 F M C 7 B\u2212cell T\u2212cell 0 200 400 600 800 1000 0 2 0 0 4 0 0 6 0 0 8 0 0 1 0 0 0 CD23 F M C 7 B\u2212cell T\u2212cell C (atypical SLL) D (atypical MCL) Figure 4.8: Example typical and atypical cases. tional diagnostic criteria and our automated approach. Speci\fcally, a t-test was applied to determine positive vs negative expression for these markers. For instance, suppose two distributions dB and dB were obtained for CD23 intensity of B-cells and T-cells, accordingly. If a case did not pass the null hypothesis that dB \u2212st(dB ) is greater than dT with P value 0.05, then that case was considered as CD23-. The null hypothesis is equivalent to CD23 of B-cells being greater than the sum of CD23 of T-cells plus standard deviation of CD23 for B-cells. If the level of expression was not taken into account, 38 out of 114 (33%) cases could not be con\fdently diagnosed only based on positive\/negative phenotyping for FMC7 and CD23. The atypical cases include 16\/44 (36%) MCL and 22\/70 (31%) SLL cases (Table 4.3). This ratio may seem to look \\high\" for common practice approaches, however, our above objective test shows that a pathologist may subjectively use second line criteria or 81 expression level for di\u000berential diagnosis. 4.7 Results In conventional analysis of \row cytometry data, a primary diagnostic fea- ture is whether a de\fned population is scored as \\positive\" or \\negative\" for expression of a given marker. For example, B cells in MCL are typically considered to be FMC7+ and CD23-, whereas in SLL they are typically FMC7- and CD23+. To perform an objective comparison between my auto- mated algorithmic approach and conventional practice, I applied statistical t-test comparisons to determine whether a given marker should be scored as positive or negative for purposes of conventional diagnosis assignment (see Section 4.6 for methodological details). Interestingly, while 76\/114 (67%) of cases were correctly assigned as either MCL or SLL by FMC7+\/CD23- or FMC7-\/CD23+ criteria, respec- tively, 38\/114 (33%) cases were scored as either FMC7- CD23- or FMC7+ CD23+ by this approach and thus could not be unambiguously assigned. These ambiguous, or \\atypical\" cases included 16\/44 (36%) MCL and 22\/70 (31%) SLL cases, leading to 63% sensitivity, 87% speci\fcity, and 67% ac- curacy for this restricted approach. Accordingly, typical clinical practice in evaluation of \row cytometry data also incorporates the level of expression for certain markers such as CD20 and immunoglobulin light chain. Ex- pression level is seldom quantitated in a robust fashion, however, and as such these features are generally considered as \\secondary\" criteria in the di\u000berential diagnosis between MCL and SLL. In contrast, my automated diagnostic approach de\fnes each feature primarily by the component MFI of each marker. 4.7.1 An Observation Motivating Computing Ratios Figure 4.11 shows the expression levels of CD23 and FMC7. The pink dots, corresponding to MCL cases, were located above the dashed diagonal line meaning that the level of FMC7 expression for all 14 (100%) MCL cases was more than the level of CD23 expression. Conversely, the majority (29 82 @@ @ @ @ @ 0 10 20 30 40 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 C D 2 3 i n t e n s i t y 75 125 175 225 275 350 450 550 650 MCL SLL n u m b e r o f c a s e s ( % ) \/ s c a l e d d e n s i t y 0 5 1 0 1 5 2 0 @ typical MCL typical SLL atypical t(11;14) positive t(11;14) negative cyclin D1 negative both negative A (CD23 intensity) @ @ @ @ @ @ 0 10 20 30 40 1 0 0 2 0 0 3 0 0 4 0 0 F M C 7 i n t e n s i t y 75 125 175 225 275 325 375 425 0 5 1 0 1 5 2 0 2 5 3 0 Figure 4.9: Discriminative value of indi- vidual markers typically utilized for diagnosis of MCL vs. SLL. \\Typi- cal\"MCL cases were de\fned as hav- ing a CD5+, CD23-, and FMC7+ B cell signature, whereas \\typical\" SLL cases were de\fned as having a CD5+, CD23+, and FMC7- B cell signature. Any cases without a\\typi- cal\" MCL or SLL immunophenotype were designated \\atypical\". Individ- ual examples of typical and atypical cases are provided in the Supplemen- tary Appendix. All MCL cases were con\frmed as cyclin D1 positive by IHC. For clarity purposes, only cases from the third time frame (14 MCL shown in pink, 30 SLL shown in blue) are depicted and similar \fgures cor- responding to each of the \fve time frames are provided separately in Ap- pendix C. B (FMC7 intensity) @ @ @ @ @ @ 0 10 20 30 40 0 1 0 0 3 0 0 5 0 0 7 0 0 C D 2 0 i n t e n s i t y 50 150 250 350 450 550 650 750 0 5 1 0 1 5 2 0 2 5 3 0 C (CD20 intensity) @ @ @ @ @ @ 0 10 20 30 40 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 7 0 0 L i g h t c h a i n i n t e n s i t y 175 275 375 475 575 750 0 2 4 6 8 1 0 1 2 1 4 D (Light chain) @ @ @ @ @ @ 0 10 20 30 40 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 4 0 0 4 5 0 C D 5 i n t e n s i t y 125 175 225 275 325 375 425 0 2 4 6 8 1 0 1 2 E (CD5 intensity) 83 @ @ @ @ @ @ 0 10 20 30 40 0 2 4 6 8 1 0 r e l a t i v e i n t e n s i t y o f C D 2 0 t o C D 2 3 @ typical MCL typical SLL atypical t(11;14) positive t(11;14) negative cyclin D1 negative both negative 0.25 1.25 3 5 7 9 11 MCL SLL n u m b e r o f c a s e s ( % ) \/ s c a l e d d e n s i t y 0 5 1 0 1 5 2 0 2 5 3 0 3 5 A (relative intensity of CD20 to CD23) @ @ @ @ @ @ 0 10 20 30 40 0 2 4 6 8 r e l a t i v e i n t e n s i t y o f F M C 7 t o C D 2 3 0.1 0.5 0.9 1.1 2.5 4.5 6.5 0 1 0 2 0 3 0 4 0 B (relative intensity of FMC7 to CD23) @ @ @ @ @ @ 0 10 20 30 40 2 4 6 8 1 0 1 2 1 4 1 6 r e l a t i v e i n t e n s i t y o f C D 2 0 t o C D 1 1 c 1.25 2.25 3.25 4.25 5.5 9 11 15 0 5 1 0 1 5 2 0 2 5 C (relative intensity of CD20 to CD11c) Figure 4.10: Most discriminative \ruorescence ratios for MCL vs. SLL. Symbols are as depicted in Figure 4.9. 84 out of 30, 97%) of SLL cases expressed FMC7 less than CD23. Similarly, we compared the expression levels of CD20 and CD11c for the same 44 cases. The majority (12 out of 14, 86%) of MCL cases were located above the diagonal dashed line and 29 (97%) SLL cases were located below the threshold. Moreover, we observed that the two blue dots that are above the dashed lines in Figure 4.11 A, D do not represent the same SLL cases. The above two observations suggest that proposing features that are de\fned based on ratios are promising for improving the diagnosis. 4.7.2 Developing a Diagnostic Predictor When all 15 available markers were analyzed for their discriminative value, my unbiased, automated approach revealed that only a subset of markers was useful. Consistent with expectations from clinical practice, these in- cluded CD19, CD20, CD5, CD23, FMC7, CD11c, kappa, and lambda. We examined MFI values for each marker individually as well as all pairwise combinations of MFI ratios for their ability to discriminate between MCL and SLL. We observed CD23, FMC7, CD20, immunoglobulin light chain, and CD5 (typically regarded as most useful in MCL vs. SLL diagnosis [5, 9, 32, 36, 38, 55, 58, 60, 77, 99, 102, 105, 106, 116, 136, 151] ) to give variable results when considered individually (Figures 4.9 and ). Interest- ingly, three ratios (CD20\/CD23, FMC7\/CD23, and CD20\/CD11c) showed considerable improvement in discriminating between MCL and SLL (Fig- ure 4.10). Next, I derived a Combined Ratio Score for each case by counting the number of ratios which were above a prede\fned threshold (see Supplemen- tary Appendix for details on threshold setting). For example, if all three ratios were above their corresponding thresholds, then the Combined Ratio Score was considered to be equal to 3, and represented the strongest predic- tion for MCL. Conversely, if all ratios were below their corresponding thresh- olds, then the Combined Ratio Score was regarded as 0, and represented the strongest prediction for SLL. When the Combined Ratio Score was applied to the entire dataset of 114 cases, all 44 MCL cases received scores of 2 or 3, 85 CD23 F M C 7 0 100 200 300 400 500 600 0 1 0 0 2 0 0 3 0 0 4 0 0 MCL SLL CD11c C D 2 0 0 100 200 300 400 0 2 0 0 4 0 0 6 0 0 8 0 0 A D CD23 F M C 7 0 100 200 300 400 500 600 0 1 0 0 2 0 0 3 0 0 4 0 0 DLBCL CD11c C D 2 0 0 100 200 300 400 0 2 0 0 4 0 0 6 0 0 8 0 0 B E CD23 F M C 7 0 100 200 300 400 500 600 0 1 0 0 2 0 0 3 0 0 4 0 0 FOLL CD11c C D 2 0 0 100 200 300 400 0 2 0 0 4 0 0 6 0 0 8 0 0 C F Figure 4.11: Comparing intensities for 274 cases from the third time frame. CD19+ B-cells were identi\fed and MFI of FMC7 vs. MFI of CD23 are shown for: (A) 14 MCL and 30 SLL, (B) 60 DLBCL, and (C) 100 follicular cases. (D-F) Similarly, CD20+ were identi\fed and MFI of CD20 vs. MFI of CD11c are shown for the same cases. While MCL is clearly separated from SLL by the two dashed lines, these features do not have signi\fcant diagnostic value for DLBCL or follicular lymphoma. 86 Table 4.4: Discriminative values on test set. MCL is considered as positive and SLL as negative. TPa TNb FPc FNd Sensitivitye Speci\fcityf Accuracyg n o v e l ra ti o s{ CD20\/CD23 ratio 30 40 0 0 100%h 100% 100%FMC7\/CD23 ratio 30 37 3 0 100% 92% 96% CD20\/CD11c ratio 24 33 7 6 80% 82% 81% Combined Ratio Score 30 37 3 0 100% 92% 96% co m m o n m a rk e rs{ CD23 intensity 30 35 5 0 100% 88% 93%FMC7 intensity 24 32 8 6 80% 80% 80%CD20 intensity 24 33 7 6 80% 82% 81% Light chain 18 38 2 12 60% 95% 80% CD5 intensity 19 22 18 11 63% 55% 59% aTP (true positive) = the number of MCL cases which are diagnosed correctly bTN (true negative) = the number of SLL cases which are diagnosed correctly cFP (false positive) = the number of SLL cases which are diagnosed as MCL wrongly dFN (false negative) = the number of MCL cases which are diagnosed as SLL wrongly e Sensitivity to MCL is the portion of MCL cases that are correctly diagnosed (TP\/(TP+FN)). f Speci\fcity is the portion of SLL cases that are diagnosed correctly (TN\/(TN+FP)). g Accuracy is the portion of all correctly diagnosed cases. hPerformance higher than 95% is shown in boldface. and 67\/70 SLL cases received scores of 0 or 1 (Figure 4.12), thus indicating the Combined Ratio Score to achieve 100% sensitivity, 96% speci\fcity, and 97% accuracy (Table C.1 in the Supplementary Appendix). In a more con- servative approach, one could instead consider cases with scores of 3 as MCL (36\/114 cases, or 32%) and scores of 0 or 1 as SLL (67\/114 cases, or 59%), and defer to additional tests such as t(11;l4) FISH or cyclin D1 immunohis- tochemistry (IHC) to resolve cases with scores of 2 (11\/114 cases, or 10%). This approach would avoid any misdiagnoses, yet still reduce by two-thirds the number of ambiguous cases requiring additional testing as compared to the conventional method based on positive\/negative phenotyping for FMC7 and CD23 markers only. 87 @@ @@ @ @ @ @ @@ @ @ @ @ @@ @ @ @ Index C o m b i n e d R a t i o S c o r e @ typical MCL typical SLL atypical t(11;14) positive t(11;14) negative cyclin D1 negative both negative 1 6 12 19 26 33 40 47 54 61 68 75 82 89 96 104 113 0 1 2 3 Figure 4.12: Combined Ratio Scores obtained for 114 (44 MCL and 70 SLL) cases. The calculated combined scores are 2 or 3 for all MCL cases leading to 100% sensitivity, whereas, the scores of most (96%) SLL cases are 0 or 1. 4.7.3 Testing and Validation of the Diagnostic Predictor Since the Combined Ratio Score was initially de\fned using the 44-sample training set (time period #3), we tested its performance on an independent set of 70 samples comprising the remaining cases identi\fed in this series (time periods #1, 2, 4, and 5). The performance of the Combined Ratio Score on this 70-sample test set is summarized in Table 4.4, with comparison to each of the component pairwise ratios as well as to selected individual markers. Rather unexpectedly, this validation analysis revealed the Combined Ratio Score performed less reliably than did the CD20\/CD23 ratio feature alone. Of note, the CD20\/CD23 ratio was superior to all other ratios and also to 88 all individual marker features. In examining further the underlying cause for poorer performance of the Combined Ratio Score, it became apparent that FMC7 was the primary confounding variable in that it was expressed at relatively high levels (and scored as positive by t-test comparison) in con- \frmed SLL cases with borderline high CD20\/CD23 ratios (Figure 4.13). In contrast, sIg light chain intensity appeared to improve upon the discrimina- tive value of the CD20\/CD23 ratio in that most cases with borderline ratios would have been assigned correctly if light chain intensity were considered (i.e. dim light chains favoring SLL). The incremental value of considering CD11c expression with CD20\/CD23 ratio appeared variable in that some cases would have been assigned correctly, but others would not. 89 ++ + + + + + + + + \u2212 + + + + \u2212 \u2212 + \u2212 \u2212 \u2212 \u2212 \u2212 \u2212 \u2212 + \u2212 + \u2212 \u2212 \u2212 \u2212 + \u2212 \u2212 + + + \u2212 \u2212 \u2212 \u2212 0 2 4 6 8 10 1 0 0 2 0 0 3 0 0 4 0 0 CD20 to CD23 ratio F M C 7 i n t e n s i t y O O + \u2212 MCL SLL FMC7 positive FMC7 negative (a) 0 2 4 6 8 10 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 7 0 0 CD20 to CD23 ratio L i g h t c h a i n i n t e n s i t y MCL SLL (b) \u2212 \u2212 \u2212 \u2212 \u2212 \u2212 \u2212 \u2212 \u2212 \u2212 \u2212\u2212 \u2212 \u2212 \u2212 \u2212 + + \u2212 \u2212 \u2212 + \u2212 \u2212 \u2212 \u2212 + + \u2212 \u2212 \u2212 + + + + + + + + + 0 2 4 6 8 10 0 1 0 0 2 0 0 3 0 0 4 0 0 CD20 to CD23 ratio C D 1 1 c i n t e n s i t y O O + \u2212 MCL SLL CD11c positive CD11c negative (c) Figure 4.13: Incorporating CD20\/CD23 ratio with a third marker to diagnose borderline cases. Each gray line shows the optimum threshold which provides best discrim- ination between MCL and SLL, solely based on the third marker. The dashed lines determine borderline cases, i.e., the probability of observing a MCL case with CD20\/CD23 ratio smaller than the pink threshold is less than 0.01. (a) FMC7 can be misleading for SLL cases with borderline high CD20\/CD23 ratios. (b) Im- munoglobulin light chain expression may improve diagnostic accuracy. (c) CD11c gives variable results. 90 1 2 3 4 5 6 7 3 0 0 3 5 0 4 0 0 4 5 0 5 0 0 5 5 0 CD20 to CD23 ratio L i g h t c h a i n i n t e n s i t y MCL SLL 0 5 10 15 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 CD20 to CD23 ratio L i g h t c h a i n i n t e n s i t y MCL SLL A (\frst frame) B (second frame) 0 2 4 6 8 10 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 7 0 0 CD20 to CD23 ratio L i g h t c h a i n i n t e n s i t y MCL SLL 0.5 1.0 1.5 2.0 2.5 0 . 5 1 . 0 1 . 5 2 . 0 CD20 to CD23 ratio L i g h t c h a i n i n t e n s i t y MCL SLL C (third frame) D (fourth frame) 0.5 1.0 1.5 2.0 2.5 3.0 0 . 5 1 . 0 1 . 5 CD20 to CD23 ratio L i g h t c h a i n i n t e n s i t y MCL SLL D (\ffth frame) Figure 4.14: Immunoglobulin light chain (sIg) expression for cases with borderline CD20\/CD23 ratios. The gray line shows the optimum threshold which provides best discrimination between MCL and SLL, solely based on sIg. 91 4.7.4 Sensitivity to Thresholds A common methodology to compare the diagnostic value and robustness of several features is through roc analysis that is performed by computing true positive rate (TPR, or sensitivity) vs. false positive rate (FPR, or 1- speci\fcity) that are obtained by varying the threshold from the minimum observed value to the maximum observed value. [170]. Generally, varying the cuto\u000b value can change sensitivity and speci\fcity in the opposite directions; increasing sensitivity results in decreasing speci- \fcity, and vice versa. The roc curve of a feature shows that for any par- ticular value of speci\fcity, what is the best sensitivity that can be obtained by that feature. roc curves are also useful for comparing features. For in- stance, if the curve of a feature is above the curve of an other one, the former one is more robust because for any particular speci\fcity, it obtains higher sensitivity compared to the later. While the roc curves of two di\u000berent features may intersect in a point, meaning they should have the same per- formance for a particular pair of thresholds, their overall performance might be totally di\u000berent when other thresholds are applied. The area under roc curve measures the overall performance of the feature; the larger area under the roc curve shows that it is more robust because the feature loses less in sensitivity when higher speci\fcity is obtained by varying the threshold. I performed roc analysis and measured the area under the curve (auc) for all of the studied features (Figure 4.15). Each roc curve was obtained by combining the performance of a feature over all \fve time periods in the following way. First, in each time period, an roc curve was computed by the common method; varying the threshold from the minimum observed value to the maximum observed value and computing TPR vs. FPR. Then, for any false positive rate, the total true positive rate was computed by averaging the TPRs in all time periods. The following formula de\fnes how the contribution of each time period was weighted by considering the proportion of number of cases: TPR= \u2211 all periods ni n TPRi , 92 False positive rate (1\u2212Specificity) T r u e p o s i t i v e r a t e ( S e n s i t i v i t y t o M C L ) 0.0 0.2 0.4 0.6 0.8 1.0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 AUC (Area Under the Curve) (0.956) \u2212 CD23 (0.889) \u2212 FMC7 (0.899) \u2212 CD20 (0.840) \u2212 Light chain (0.552) \u2212 CD5 (1.000) \u2212 CD20\/CD23 ratio (0.991) \u2212 FMC7\/CD23 ratio (0.937) \u2212 CD20\/CD11c ratio (0.993) \u2212 Combined Ratio Score t chain i i tio ined Ratio Score Figure 4.15: ROC curves obtained for all 114 (44 MCL and 70 SLL) cases. CD20 to CD23 ratio (brown) is the most accurate (100%) and most robust (AUC=1) FCM feature. where n is the total number of studied cases, ni is the number of cases in ith time period and TPRi is the corresponding true positive rate. The resulting curve presented the overall true positive rate vs. false positive rate on all 114 studied cases for a speci\fc feature. ROC analysis con\frmed the CD20\/CD23 ratio was above all other curves meaning that for any \fxed speci\fcity, it obtained the highest sensitivity compared to other features. It also achieved the maximum AUC. 4.7.5 Best Ratio vs. Other Markers Our study showed CD20\/CD23 ratio is the best immunophenotype for dif- ferential diagnosis between MCL and SLL given that all of 114 studied cases could be correctly diagnosed using this feature. Computing this ratio does not require FMC7 which is a common marker for diagnosing MCL, and there- fore, it is interesting to see if incorporating FMC7 with the best ratio can be 93 helpful for diagnosis. Our objective analysis showed that FMC7 is mislead- ing for the SLL cases which are close to border, and therefore, considering this marker is not recommended for such cases (Figures 4.13 in the main paper). In contrast, if the CD20\/CD23 ratio is considered as the primary criteria for diagnosis, then light chain intensity of the clonal marker per- forms superior to FMC7 in adding diagnostic information because it is not generally misleading for the cases close to border. (Figures 4.13 and 4.14). 4.8 Discussion Using an automated, unbiased approach to examine multidimensional \row cytometry data, we have attempted in this study to improve upon diagnostic accuracy in distinguishing between two immunophenotypically related lym- phomas, SLL and MCL. The use of CD19 or CD20, CD5, CD23, and FMC7 as\\positive\"vs. \\negative\"markers with secondary consideration of features such as intensity of CD20 and surface immunoglobulin (sIg) light chain ex- pression is widespread. However, some cases cannot be con\fdently diagnosed by applying these conventional approaches to \row cytometry data analysis. My automated algorithm has identi\fed that the CD20\/CD23 ratio is the most robust FCM feature for discriminating SLL from MCL. Unexpectedly, inclusion of additional features such as FMC7 and CD11c expression actu- ally results in a greater likelihood of misdiagnosis, most speci\fcally in SLL cases with borderline high CD20\/CD23 ratios. In contrast, consideration of immunoglobulin light chain expression may improve diagnostic accuracy for these borderline cases. Our \fndings provide new insight into the relative contribution of each FCM data feature to the overall diagnostic algorithm, and by improving the diagnostic accuracy of FCM data analysis, can potentially reduce the amount of con\frmatory ancillary testing required. For instance, although some studies have shown that the biological information of FMC7 can also be captured by CD20 intensity [76, 94], it remains common practice to consider both FMC7 expression and CD20 intensity in evaluating fcm data. Our data reveal that if considered alone, FMC7 is superior to CD20 in di\u000berentiating 94 between MCL and SLL (Figure 4.9); however, FMC7 actually compromises diagnostic accuracy when included as part of a multiple feature diagnostic predictor. Because most clinical \row cytometry labs already acquire data needed to calculate the CD20\/CD23 ratio as part of their routine diagnostic panels [77, 116, 151], this novel criterion should be very easy to implement in other centers. The optimal cuto\u000b value for CD20\/CD23 ratio to discriminate be- tween SLL and MCL, however, will be sensitive to inter-laboratory variables such as staining protocols, choice of antibody clones and \ruorochrome conju- gates, and instrumentation settings and sensitivity. Therefore, our method currently may require recalibration by each lab using its own set of train- ing samples to determine the optimal cuto\u000b. If enough number of samples are available (say 10 of each lymphoma type), this task will be easy, other- wise, setting the best cuto\u000b will be a bottleneck for applying our approach. CD20\/CD23 ratio was fully discriminative for all the 114 studied cases, and although I observed no exceptional case in the current study, care should be taken in using this criterion to diagnose new cases. In particular, if the value of any feature is not in the range of already analyzed cases, or if the setting of the instruments is changed, or if the antibody clones and \ruorochrome conjugates are changed, it is recommended to order secondary assuring tests such as FISH or Cyclin D1 for that case. An alternative procedure that does not require a cuto\u000b threshold but is very conservative in determining the sample labels could be utilized. For instance, if a case obtains CD20\/CD23 ratio more than one other con\frmed MCL case who is diagnosed in the same clinic and tested with the same set- tings, then the new case can be con\fdently diagnosed as MCL. Accordingly, a similar conservative approach can be applied for diagnosing a SLL case; if the CD20\/CD23 ratio is less than the value reported for a previously diag- nosed SLL case, then the new case should be SLL. If the ratio for a case fails to satisfy both of these criteria (i.e. it is below all previously diagnosed MCL cases and above all SLL ones), then that particular case falls in a gray area and additional tests such as fish or cyclyn D1 immunohistochemistry (ihc) can be ordered. By adding the \fnal diagnosis of the gray case to the MCL- 95 SLL database, historical information accumulates and the database grows gradually resulting in fewer orders for additional tests in future, as long as a dramatic change in the settings of the fcm experiment is not applied. There is absolutely no source of any possible error in the \fnal diagnosis of all cases considered in my study because either the diagnosis was obvious from \row cytometry analysis, or the uncertain cases had been veri\fed by FISH and Cyclin D1 that are perfect tests. The current study was limited to examination of FCM data from lymph node samples to avoid other po- tential variables from confounding this initial analysis. Further studies are warranted to determine if our approach yields similar diagnostic accuracy in peripheral blood and bone marrow samples. Analyzing more cases from other cancer agencies can be a follow up of our study to con\frm our results on larger number of cases. Existing high dimensional FCM data may potentially contain valuable biological information that is hidden from conventional analyses because such approaches rely on bivariate plots and manual gating. Our identi\f- cation of CD20\/CD23 ratio is an example of revealing novel features that capture inapparent, but meaningful diagnostic information that already re- sides within existing FCM data. While our goal for the current study was to develop and test the automated algorithm for improving diagnostic accu- racy, this approach is capable of identifying novel multidimensional features that could provide valuable prognostic information, or aid in recognition and de\fning of novel biologic subtypes that are currently subsumed under a single diagnostic heading. Unsupervised clustering of lymphoma samples facilitated by our unbiased automatic approach will be a focus of future work and can potentially lead to discovery of such novel subtypes. Computational methods allow objective assessment of the relative contri- bution of component data features to overall diagnostic accuracy, and reveal some conventional criteria can actually compromise this accuracy. Further- more, computational approaches enable exploiting the full dimensionality of FCM data and can potentially lead to discovery of novel biomarkers relevant for clinical outcome. 96 4.9 Reproducibility My study is entirely reproducible. Assuming R version 2.9.2 and the pack- ages Epi, caTools, and epiR are installed, one can reproduce the \fgures presented in this chapter as described bellow. First, download the MCLSLL- reproduction package that is available upon request. The package can be installed in Linux by: R CMD INSTALL MCLSLLreproduction 1.0.tar.gz Then, start R and load the package by library(MCLSLLreproduction). It contains a copy of the FCM features that are included in Supplementary Table A. By calling the function reproduce(), the \fgures will be reproduced from the FCM features in less than 10 minutes. 97 Chapter 5 Conclusions 5.1 Summary Automated methods enhance FCM data analysis by reducing time and in- creasing accuracy, and they open new windows to clinical and biological research. In this thesis, some of the computational challenges such as clus- tering large biological datasets, and classi\fcation in presence of high dimen- sional noise were e\u000eciently addressed by non-trivial improvements to the state-of-the art techniques in machine learning. All results reported in this thesis were validated by conventional train- test approaches. For instance, sensitivity speci\fcity, and accuracy reported in chapter 4 that show the performance of my automatic pipeline on lym- phoma data set are averaged over four-fold cross-validation (Table 4.1). Also in MCL-SLL study, I discovered the novel criteria useful for di\u000berential di- agnosis by analyzing only 44 cases. Then, I validated the performance of my approach on 70 independent cases that were disjoint from my train set. Developing SamSPECTRAL and FeaLect methodologies are my two novel contributions to computer science that can potentially have a wider range of applications than FCM data analysis because they facilitate ap- plication of spectral clustering to large size data, and signi\fcantly improve feature selection techniques. In particular, SamSPECTRAL enhances spec- tral clustering by reducing the computational time in such a way that this 98 clustering technique is now e\u000eciently applicable to a wider range of prob- lems. FeaLect improves feature selection techniques based on the Lasso, such as Bolasso, by proposing a statistical scoring scheme, and thus, providing a \\softer\" and more robust feature selection algorithm. My automatic pipeline is useful in clinical, biological and medical studies by allowing objective assessment of the relative contribution of component data features to overall diagnostic accuracy. Improvement to diagnostic ac- curacy between mantle cell lymphoma (MCL) and small lymphocytic lym- phoma (SLL) is an example of such applications. I showed that CD20\/CD23 ratio performs better than FMC7 and CD23. Furthermore, our objective approaches revealed that inclusion of FMC7 expression in the diagnostic algorithm reduces its accuracy (from 100% to 97% according to our data). This was a surprising result for the clinicians because FMC7 has been widely used in practice for years [5, 36, 55, 58, 99], however, its performance was not objectively quanti\fed in comparison with other markers before this studied. These \fndings can be used in clinics to increase the accuracy of diagnosis based on \row cytometry alone. Thus, by reducing the number of cases for which additional tests are required, this approach can save clinical resources for better treatment. 5.2 Further Work Shedding more light on the Lasso performance by studying feature scores is a possible future direction of my study. Also, some of the computational tools developed in this thesis can be improved further. For instance, SamSPEC- TRAL methodology can be improved by major or minor modi\fcations such as applying other schemes for estimating similarities between communities, combining clusters based on other combinatorial algorithms or biological criteria, and repeating the algorithm several times to obtain a more stable outcome. Also, theoretical analysis of SamSPECTRAL algorithm is math- ematically interesting to investigate how eigenvectors of the sampled graph estimate the eigenvectors of the original graph. If Conjectures 10 and 9 is proven, they can provide a theoretical explanation for the performance of 99 SamSPECTRAL. Furthermore, discovering minimum su\u000ecient conditions for SamSPECTRAL to work properly can provide a helpful guide to decide if this algorithm is useful in other large graph clustering applications such as social networks and web graphs. While clustering is a major step of my current automatic pipeline, one can think of completely di\u000berent approaches. For instance, Wasserstein met- ric (also known as the earth mover distance,) can be useful to compute the similarity between fcm samples and can be a statistical basis for automatic analysis [17]. However, the time required for computing this measure of similarity is O(n3) for a sample containing n cells that is too much to be ef- \fcient in practice. I speculate the running time can be reduced using faithful sampling in a similar way that I enhanced spectral clustering. While my goal in the current study was to develop and test my auto- mated algorithm for improving diagnostic accuracy, this approach is capable of identifying novel multidimensional features that could provide valuable prognostic information, or aid in recognition and de\fning of novel biologic subtypes that are currently subsumed under a single diagnostic heading. Unsupervised clustering of lymphoma samples facilitated by my unbiased automatic approach could be a focus of future work and potentially lead to discovery of such novel subtypes. One challenge in applying my methodology in such study is selecting the most informative features because currently, FeaLect is based on classi\fcation rather than clustering. Possibly, perform- ing a survival analysis is the right path to take. Addressing this challenge requires FeaLect to be generalized to solve regression problems as satisfac- tory as it currently handles classi\fcation problems. My pipeline can also be potentially useful for improving quality check of diagnosis by identifying those patients who might bene\ft from having a secondary review of their diagnosis. Current clinical standard of care mandates that a subset of patients undergo secondary review as part of quality assurance. While candidates for secondary review are conventionally selected completely at random, my hypothesis is that the best candidates are those patients whose primary clinical diagnosis contradicts that of the classi\fer trained on retrospective fcm data. 100 The code and computational tool that I have developed could be easily applied on lymphoma datasets from other institutions to con\frm my results and conduct studies that were not possible in this thesis due to low number of available samples. Examples of such studies include: investigating sll prognosis based on FCM data, investigating transformation from follicular lymphoma to dlbcl1, and studying rare lymphoma types such as marginal zone lymphoma (mzl), lymphoblastic lymphoma (lbl), peripheral T-cell lymphoma (ptcl), and angioimmunoblastic lymphoma (aild). The only challenge in such studies is setting the parameters for SamSPECTRAL that can be addressed by following the instructions in the package vignette. Furthermore, the application of my automatic pipeline is not limited to lymphoma and it can be applied on other malignancy for which \row cytometry is a diagnostic test such as: leukemia, gvhd, hiv, psoriasis, and etc.. Here, the challenge would be to include features such as cytokine response that are totally di\u000berent in nature from the fcm features I studied in the current study. Semi-supervised techniques might be helpful to include more biological knowledge into the automatic analysis if such knowledge is available. 1A follicular lymphoma patient may develop dlbcl which is much more aggressive.The risk of transformation is 30% after 10 years from the diagnosis of indolent lymphoma [6]. 101 Bibliography [1] N. Aghaeepour, A. H. Khodabakhshi, and R. R. Brinkman. An empirical study of cluster evaluation metrics using \row cytometry data. In Proceedings of NIPS workshop \"Clustering: Science or Art\", 2009. [2] N. Aghaeepour, P. Mah, G. Finak, A. A. Barbo, J. Bramson, H. Bretschneider, C. Chan, P. L. D. Jager, A. Gupta, A. Hadj-Khodabakhshi, F. E. Khettabi, G. Luta, J. M. Maisog, P. Majek, G. J. McLachlan, I. Naim, R. Nikolic, Y. Qian, J. Quinn, A. Roth, G. Sharma, P. Shooshtari, J. Spidlen, I. P. Sugar, J. Vilvcek, K. Wang, A. P. Weng, H. Zare, T. R. Mosmann, H. Hoos, J. Schoenfeld, R. Gottardo, R. Brinkman, and R. H. Scheuermann. Critical assessment of cell population identi\fcation techniques for \row cytometry data: Results of \rowcap-1. Submitted, 2011. [3] N. Aghaeepour, R. Nikolic, H. Hoos, and R. Brinkman. Rapid cell population identi\fcation in \row cytometry data. Cytometry Part A, 79(1):6{13, 2011. ISSN 1552-4930. [4] N. Aghaeepuor, R. Nikolic, H. H. Hoos, and R. Brinkman. Rapid Cell Population Identi\fcation in Flow Cytometry Data. Cytometry Part A, 79(1):6{13, 2011. ISSN 1552-4930. [5] E. Ahmad, D. Garcia, and B. Davis. Clinical utility of CD23 and FMC7 antigen coexistent expression in B-cell lymphoproliferative disorder subclassi\fcation. Cytometry, 50(1):1{7, 2002. ISSN 1097-0320. [6] A. J. Al-Tourah, K. K. Gill, M. Chhanabhai, P. J. Hoskins, R. J. Klasa, K. J. Savage, L. H. Sehn, T. N. Shenkier, R. D. Gascoyne, and J. M. Connors. Population-based analysis of incidence and outcome 102 of transformed non-hodgkin's lymphoma. Journal of Clinical Oncology, 26(32):5165{5169, November 2008. [7] U. Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mack, and A. Levine. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA., 96(12):6745{6750, 1999. [8] D. Angelosante, G. Giannakis, and E. Grossi. Compressed sensing of time-varying signals. In Digital Signal Processing, 2009 16th International Conference on, pages 1 {8, 2009. [9] S. Asplund, R. McKenna, J. Doolittle, and S. Kroft. CD5-positive B-cell neoplasms of indeterminate immunophenotype: a clinicopathologic analysis of 26 cases. Applied Immunohistochemistry & Molecular Morphology, 13(4):311, 2005. ISSN 1062-3345. [10] K. Autio, Y. Aalto, K. Franssila, E. Elonen, H. Joensuu, and S. Knuutila. Low number of DNA copy number changes in small lymphocytic lymphoma. Haematologica, 83(8):690, 1998. [11] A. Azran and Z. Ghahramani. Spectral methods for automatic multiscale data clustering. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, volume 1, pages 190{197, 2006. [12] F. Bach. Model-consistent sparse estimation through the bootstrap. Technical report, HAL-00354771, 2009. [13] F. R. Bach. Bolasso: model consistent lasso estimation through the bootstrap. In ICML '08: Proceedings of the 25th international conference on Machine learning, 2008. [14] F. R. Bach and M. I. Jordan. Learning spectral clustering, with application to speech separation. J. Mach. Learn. Res., 7:1963{2001, 2006. [15] G. Baerlocher, I. Vulto, J. G. de, and P. Lansdorp. Flow cytometry and \fsh to measure the average length of telomeres (\row \fsh). Nature protocols, 1(5):2365, 2006. [16] A. Bashashati and R. Brinkman. A survey of \row cytometry data analysis methods. Advances in Bioinformatics, pages 1{19, 2009. 103 [17] T. Bernas, E. Asem, J. Robinson, and B. Rajwa. Quadratic form: a robust metric for quantitative comparison of \row cytometric histograms. Cytometry Part A, 73(8):715{726, 2008. [18] N. Biggs. Topics in algebraic graph theory: Encyclopedia of mathematics and its applications. 16(1):171{172, 2007. [19] T. Biyikoglu, J. Leydold, and P. F. Stadler. Laplacian eigenvectors of graphs. Lecture notes in mathematics ; 1915. Springer, 2007. [20] L. Boddy, M. Wilkins, and C. Morris. Pattern recognition in \row cytometry. Cytometry, 44(3):195{209, 2001. ISSN 0196-4763. [21] M. Boedigheimer and J. Ferbas. Mixture modeling approach to \row cytometry data. Cytometry Part A, 73(5):421{429, 2008. [22] L. Breiman. Bagging predictors. Machine learning, 24(2):123{140, 1996. ISSN 0885-6125. [23] L. Breiman. Random forests. Machine learning, 45(1):5{32, 2001. ISSN 0885-6125. [24] R. Brinkman, M. Gasparetto, S. Lee, A. Ribickas, J. Perkins, W. Janssen, R. Smiley, and C. Smith. High-content \row cytometry and temporal data analysis for de\fning a cellular signature of graft-versus-host disease. Biol Blood Marrow Transplant, 13(6): 691{700, 2007. ISSN 1083-8791. [25] M. Brun, D. Sabbagh, S. Kim, and E. Dougherty. Corrected small-sample estimation of the Bayes error. Bioinformatics, 19(8): 944, 2003. [26] K. M. Carter, R. Raich, W. G. Finn, and A. O. Hero. Information preserving component analysis: Data projections for \row cytometry analysis. IEEE Journal of Selected Topics in Signal Processing, vol. 3, issue 1, pp. 148-158, 3:148{158, Feb. 2009. [27] C. Chan, F. Feng, J. Ottinger, D. Foster, M. West, and T. B. Kepler. Statistical mixture modeling for cell subtype identi\fcation in \row cytometry. Cytometry Part A, 73A(8):693{701, 2008. [28] W. Chan, J. Armitage, R. Gascoyne, J. Connors, P. Close, P. Jacobs, A. Norton, T. Lister, E. Pedrinis, F. Cavalli, et al. A clinical evaluation of the International Lymphoma Study Group classi\fcation of non-Hodgkins lymphoma. Blood, 89(11):3909{3918, 1997. 104 [29] F. R. K. Chung. Spectral Graph Theory (CBMS Regional Conference Series in Mathematics, No. 92). American Mathematical Society, February 1997. ISBN 0821803158. [30] M. P. Conrad. A rapid, non-parametric clustering scheme for \row cytometric data. Pattern Recognition, 20(2):229{35, 1987. ISSN 0305-7364. [31] E. Costa, C. Pedreira, S. Barrena, Q. Lecrevisse, J. Flores, S. Quijano, J. Almeida, M. del Carmen Garcia-Macias, S. Bottcher, J. Van Dongen, et al. Automated pattern-guided principal component analysis vs expert-based immunophenotypic classi\fcation of B-cell chronic lymphoproliferative disorders: a step forward in the standardization of clinical immunophenotyping. Leukemia, 25(2): 385{385, 2011. [32] F. Craig and K. Foon. Flow cytometric immunophenotyping for hematologic neoplasms. Blood, 111(8):3941, 2008. [33] A. Criel, G. Verhoef, R. Vlietinck, C. Mecucci, J. Billiet, L. Michaux, P. Meeus, A. Louwagie, A. Van Orshoven, A. Van Hoof, et al. Further characterization of morphologically de\fned typical and atypical CLL: a clinical, immunophenotypic, cytogenetic and prognostic study on 390 cases. British Journal of Haematology, 97 (2):383{391, 1997. ISSN 1365-2141. [34] J. K. Cullum and R. A. Willoughby. Lanczos Algorithms for Large Symmetric Eigenvalue Computations, Volume I. SIAM, 2002. [35] Z. Darzynkiewicz, P. Smolewski, and E. Bedner. Use of \row and laser scanning cytometry to study mechanisms regulating cell cycle and controlling cell death. Clinics in laboratory medicine, 21(4): 857{73, 2001. [36] J. Delgado, E. Matutes, A. Morilla, R. Morilla, K. Owusu-Ankomah, F. Ra\fq-Mohammed, I. Giudice, and D. Catovsky. Diagnostic signi\fcance of CD20 and FMC7 expression in B-cell disorders. American journal of clinical pathology, 120(5):754, 2003. ISSN 0002-9173. [37] J. Diaz-Romero, S. Romeo, J. Bov\u0013ee, P. Hogendoorn, P. Heini, and P. Mainil-Varlet. Hierarchical clustering of \row cytometry data for 105 the study of conventional central chondrosarcoma. Journal of cellular physiology, 225(2):601{611, 2010. [38] F. DiRaimondo, M. Albitar, Y. Huh, S. O'Brien, M. Montillo, A. Tedeschi, H. Kantarjian, S. Lerner, R. Giustolisi, and M. Keating. The clinical and diagnostic relevance of CD23 expression in the chronic lymphoproliferative disease. Cancer, 94(6):1721{1730, 2002. ISSN 1097-0142. [39] S. V. Dongen. Graph clustering via a discrete uncoupling process. SIAM Journal on Matrix Analysis and Applications, 30(1):121{141, 2008. [40] A. D. Donnenberg and V. S. Donnenberg. Rare-event analysis in \row cytometry. Clinics in Laboratory Medicine, 27(3):627{652, 2007. [41] R. Duda and P. Hart. Pattern classi\fcation and scene analysis. Wiley, 1996. [42] B. Dykstra, D. Kent, M. Bowie, L. McCa\u000brey, M. Hamilton, K. Lyons, S. Lee, R. Brinkman, and C. Eaves. Long-term propagation of distinct hematopoietic di\u000berentiation programs in vivo. Cell Stem Cell, 1(2):218{29, 2007. ISSN 1934-5909. [43] B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap (Chapman & Hall\/CRC Monographs on Statistics & Applied Probability). Chapman and Hall\/CRC, 1998. [44] B. Efron, T. Hastie, L. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32:407{499, 2004. [45] B. Eichhorst, M. Hallek, and M. Dreyling. Chronic lymphocytic leukemia: ESMO Clinical Recommendations for diagnosis, treatment and follow-up. Ann Oncol, 19(suppl 2):60{62, 2008. [46] T. Elliott and X. Jin. Modulation of expression and cellular distribution of p21 by macrophage migration inhibitory factor. Journal of In\rammation, 6, 2009. [47] M. Eric. M\f: a \row cytometry list mode data analysis program optimized for batch processing under ms-dos. 2001. URL http:\/\/www.umass.edu\/microbio\/m\f. 106 [48] M. A. Fanty and R. Cole. Spoken letter recognition. In NIPS, page 220, 1990. [49] G. Finak, A. Bashashati, R. Brinkman, and R. Gottardo. Merging mixture components for cell population identi\fcation in \row cytometry. Advances in Bioinformatics, 2009, 2009. ISSN 1687-8027. [50] G. Finak, A. Bashashati, R. R. Brinkman, and R. Gottardo. Merging mixture components for cell population identi\fcation in \row cytometry. Advances in Bioinformatics, 2009:1{12, 2009. [51] W. G. Finn, K. M. Carter, R. Raich, L. M. Stoolman, and A. O. Hero. Analysis of clinical \row cytometric immunophenotyping data by clustering on statistical manifolds: Treating \row cytometry data as high-dimensional objects. Clinical Cytometry B, 76B(1):1{7, 2009. [52] C. Fowlkes, S. Belongie, F. Chung, and J. Malik. Spectral grouping using the nystrom method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(2):214{225, 2004. ISSN 0162-8828. doi:http:\/\/doi.ieeecomputersociety.org\/10.1109\/TPAMI.2004.1262185. [53] A. Frank and A. Asuncion. UCI machine learning repository, 2010. [54] L. Fu, M. Yang, R. Braylan, and N. Benson. Real-time adaptive clustering of \row cytometric data. Pattern Recognition, 26(2): 365{373, 1993. [55] D. Garcia, M. Rooney, E. Ahmad, and B. Davis. Diagnostic usefulness of CD23 and FMC7 antigen expression patterns in B-cell lymphoma classi\fcation. American journal of clinical pathology, 115 (2):258, 2001. ISSN 0002-9173. [56] L. Garc\u0013\u0010a-Escudero, A. Gordaliza, C. Matr\u0013an, and A. Mayo-Iscar. A general trimming approach to robust cluster analysis. The Annals of Statistics, 36(3):1324{1345, 2008. ISSN 0090-5364. [57] R. Gascoyne. The Science and Value of Lymphoma Classi\fcation. Orbital diseases present status and future challenges, 12:65, 2005. [58] C. Geisler, J. Larsen, N. Hansen, M. Hansen, B. Christensen, B. Lund, H. Nielsen, T. Plesner, K. Thorling, and E. Andersen. Prognostic importance of \row cytometric immunophenotyping of 540 consecutive patients with B-cell chronic lymphocytic leukemia. Blood, 78(7):1795, 1991. 107 [59] M. R. Gil Alterovitz, Roseann Benson. Automation in proteomics and genomics: an engineering case-based approach. Wiley, February 2009. [60] L. Ginaldi, M. De Martinis, E. Matutes, N. Farahat, R. Morilla, and D. Catovsky. Levels of expression of CD19 and CD20 in chronic B cell leukaemias. Journal of clinical pathology, 51(5):364, 1998. ISSN 1472-4146. [61] E. Goldberg, H. Jeong, I. Kruglikov, R. Tremblay, R. Lazarenko, and B. Rudy. Rapid developmental maturation of neocortical fs cell intrinsic excitability. Cerebral Cortex, 21(3):666, 2011. [62] R. Gorman and T. Sejnowski. Analysis of hidden units in a layered network trained to classify sonar targets. Neural networks, 1(1): 75{89, 1988. [63] I. Guyon. Feature extraction: foundations and applications, volume 207. Springer Verlag, 2006. [64] I. Guyon and A. Elissee\u000b. An introduction to variable and feature selection. The Journal of Machine Learning Research, 3:1157{1182, 2003. [65] I. Guyon, A. B. Hur, S. Gunn, and G. Dror. Result analysis of the nips 2003 feature selection challenge. In Advances in Neural Information Processing Systems 17, pages 545{552. MIT Press, 2004. [66] F. Hahne, A. Khodabakhshi, A. Bashashati, C. Wong, R. Gascoyne, A. Weng, V. Seyfert-Margolis, K. Bourcier, A. Asare, T. Lumley, et al. Per-channel basis normalization methods for \row cytometry data. Cytometry Part A, 77(2):121{131, 2010. [67] M. Hallek, B. Cheson, D. Catovsky, F. Caligaris-Cappio, G. Dighiero, H. Dohner, P. Hillmen, M. Keating, E. Montserrat, K. Rai, et al. Guidelines for the diagnosis and treatment of chronic lymphocytic leukemia: a report from the International Workshop on Chronic Lymphocytic Leukemia updating the National Cancer Institute-Working Group 1996 guidelines. Blood, 111(12):5446, 2008. [68] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning: data mining, inference, and prediction. Springer Verlag, 2009. ISBN 0387848576. 108 [69] T. Hastie, R. Tibshirani, and J. Friedman. Model assessment and selection. The elements of statistical learning, pages 219{259, 2009. [70] T. S. Hawley and R. G. Hawley. Flow Cytometry Protocols, 2nd edition. Methods in Molecular Biology. Humana Press, 2005. [71] E. Hensinger, I. Flaounas, and N. Cristianini. Learning the preferences of news readers with svm and lasso ranking. Arti\fcial Intelligence Applications and Innovations, pages 179{186, 2010. [72] L. Herzenberg, D. Parks, B. Sahaf, O. Perez, M. Roederer, and L. Herzenberg. The history and future of the \ruorescence activated cell sorter and \row cytometry: a view from Stanford. Clinical chemistry, 48(10):1819, 2002. [73] T. C. Hesterberg, N. H. Choi, L. Meier, and C. Fraley. Least angle and l1 penalized regression: A review. Statistics Surveys. URL http:\/\/www.i-journals.org\/ss\/viewarticle.php?id=35. [74] W. Hiddemann, D. Longo, B. Coi\u000eer, R. Fisher, F. Cabanillas, F. Cavalli, L. Nadler, V. De Vita, T. Lister, and J. Armitage. Lymphoma classi\fcation{the gap between biology and clinical management is closing. Blood, 1996. [75] K. Hornik. A clue for cluster ensembles. Journal of Statistical Software, 14(12):1{25, 2005. [76] W. Hubl, J. Iturraspe, and R. Braylan. FMC7 antigen expression on normal and malignant B-cells can be predicted by expression of CD20. Cytometry, 34(2):71{74, 1998. [77] R. Hubmann, J. Schwarzmeier, M. Shehata, M. Hilgarth, M. Duechler, M. Dettke, and R. Berger. Notch2 is involved in the overexpression of CD23 in B-cell chronic lymphocytic leukemia. Blood, 99(10):3742, 2002. [78] R. Ichinohasama, J. DeCoteau, J. Myers, M. Kadin, T. Sawai, and K. Ooya. Three-color \row cytometry in the diagnosis of malignant lymphoma based on the comparative cell morphology of lymphoma cells and reactive lymphocytes. Leukemia, 11(11):1891, 1997. [79] D. Je\u000bries, I. Zaidi, B. de Jong, M. Holland, and D. Miles. Analysis of \row cytometry data using an automatic processing tool. Cytometry Part A, 73(9):857{867, 2008. 109 [80] D. Je\u000bries, I. Zaidi, B. de Jong, M. J. Holland, and D. J. C. Miles. Analysis of \row cytometry data using an automatic processing tool. Cytometry A, 73(9):857{67, 2008. [81] N. Johnson, M. Boyle, A. Bashashati, S. Leach, A. Brooks-Wilson, L. Sehn, M. Chhanabhai, R. Brinkman, J. Connors, A. Weng, et al. Di\u000buse large b-cell lymphoma: reduced cd20 expression is associated with an inferior survival. Blood, 113(16):3773, 2009. [82] P. Juan-Manuel and G. Raphael. Optimizing transformations for automated, high throughput analysis of \row cytometry data. BMC Bioinformatics, 11. [83] K. Kira and L. Rendell. The feature selection problem: Traditional methods and a new algorithm. In Proceedings of the National Conference on Arti\fcial Intelligence, pages 129{129. John Wiley & Sons Ltd, 1992. [84] D. Klinke II and K. Brundage. Scalable analysis of \row cytometry data using R\/Bioconductor. Cytometry Part A, 75(8):699{706, 2009. [85] R. I. Kondor and J. La\u000berty. Di\u000busion kernels on graphs and other discrete structures. In In Proceedings of the ICML, pages 315{322, 2002. [86] P. Krutzik, J. Irish, G. Nolan, and O. Perez. Analysis of protein phosphorylation and cellular signaling events by \row cytometry: techniques and clinical applications. Clinical Immunology, 110(3): 206{221, 2004. ISSN 1521-6616. [87] J. Lakoumentas, J. Drakos, M. Karakantza, G. Nikiforidis, and G. Sakellaropoulos. Bayesian clustering of \row cytometry data for the diagnosis of b-chronic lymphocytic leukemia. Journal of Biomedical Informatics, 42(2):251{261, 2009. [88] T. Lazar. Manual of Clinical Laboratory Immunology. Laboratory Hematology, 9(1):52{52, 2003. ISSN 1080-2924. [89] K. Lo, R. Brinkman, and R. Gottardo. Automated gating of \row cytometry data via robust model-based clustering. Cytometry Part A, 73(4):321{332, 2008. ISSN 1552-4930. 110 [90] K. Lo, F. Hahne, R. Brinkman, and R. Gottardo. \rowclust: a bioconductor package for automated gating of \row cytometry data. BMC Bioinformatics, 10(1):1{145, April 2009. [91] K. Lounici et al. Sup-norm convergence rate and sign concentration property of Lasso and Dantzig estimators. Electronic Journal of statistics, 2:90{102, 2008. ISSN 1935-7524. [92] E. Lugli, M. Roederer, and A. Cossarizza. Data analysis in \row cytometry: The future just started. Cytometry Part A, 77(7):705{13, 2010. [93] E. Lugli, M. Roederer, and A. Cossarizza. Data analysis in \row cytometry: The future just started. Cytometry Part A, 77(7): 705{713, 2010. ISSN 1552-4930. [94] A. Lyons and C. Parish. Determination of lymphocyte division by \row cytometry. Journal of immunological methods, 171(1):131{137, 1994. ISSN 0022-1759. [95] S. Maldonado and R. Weber. Embedded feature selection for support vector machines: State-of-the-art and future challenges. Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, pages 304{311, 2011. [96] R. C. Mann. On multiparameter data analysis in \row cytometry. Cytometry A, 8(2):184{189, 1987. [97] D. Mason, N. Harris, G. Delsol, et al. WHO Classi\fcation of Tumours of Hematopoietic and Lymphoid Tissues. International Agency for Research on Cancer, (1):317{319, 2008. [98] M. J. Matasar and A. D. Zelenetz. Overview of lymphoma diagnosis and management. Radiologic Clinics of North America, 46(2):175 { 198, 2008. [99] E. Matutes, K. Owusu-Ankomah, R. Morilla, M. Garcia, A. Houlihan, T. Que, and D. Catovsky. The immunological pro\fle of B-cell disorders and proposal of a scoring system for the diagnosis of CLL. Leukemia o\u000ecial journal of the Leukemia Society of America, 8(10):1640, 1994. 111 [100] N. Meinshausen and P. Buhlmann. Consistent neighborhood selection for sparse high-dimensional graphs with the Lasso. Statist. Surv, 2:61{93, 2004. [101] N. Meinshausen and B. Yu. Lasso-type recovery of sparse representations for high-dimensional data. Annals of Statistics, 37 (1):246{270, 2009. [102] H. Meyerson, G. MacLennan, W. Husel, W. Tse, H. Lazarus, and D. Kaplan. D cyclins in CD5+ B-cell lymphoproliferative disorders. American Journal of Clinical Pathology, 125(2):241, 2006. ISSN 0002-9173. [103] R. N. Miranda, R. C. Briggs, M. C. Kinney, P. A. Veno, R. D. Hammer, and J. B. Cousar. Immunohistochemical detection of cyclin d1 using optimized conditions is highly speci\fc for mantle cell lymphoma and hairy cell leukemia. Modern Pathology, 13(12): 1308{1314, 2000. [104] S. Molica, A. Dattilo, A. Mannella, and D. Levato. CD11c expression in B-cell chronic lymphocytic leukemia. A comparison of results obtained with di\u000berent monoclonal antibodies. Haematologica, 79(5): 452, 1994. [105] S. Molica, A. Mannella, G. Crispino, A. Dattilo, and D. Levato. Comparative \row cytometric evaluation of bcl-2 oncoprotein in CD5+ and CD5-B-cell lymphoid chronic leukemias. Haematologica, 82(5):555, 1997. [106] R. Molot, T. Meeker, C. Wittwer, S. Perkins, G. Segal, A. Masih, R. Braylan, and C. Kjeldsberg. Antigen expression and polymerase chain reaction ampli\fcation of mantle cell lymphomas. Blood, 83(6): 1626, 1994. [107] W. Morice, P. Kurtin, J. Hodne\feld, T. Shanafelt, J. Hoyer, E. Remstein, and C. Hanson. Predictive value of blood and bone marrow \row cytometry in B-cell lymphoma classi\fcation: comparative analysis of \row cytometry and tissue biopsy in 252 patients. Mayo Clinic Proceedings, 83(7):776, 2008. [108] I. Naim, S. Datta, G. Sharma, J. Cavenaugh, and T. Mosmann. SWIFT: Scalable weighted iterative sampling for \row cytometry 112 clustering. Proc. IEEE Intl. Conf. Acoustics Speech and Sig. Proc., pages 509{512, 2010. [109] U. Naumann and M. Wand. Automation in high-content \row cytometry screening. Cytometry Part A, 75(9):789{797, 2009. [110] U. Naumann, G. Luta, and M. Wand. The curvHDR method for gating \row cytometry samples. BMC bioinformatics, 11(1):44, 2010. [111] U. Naumann, G. Luta, and M. Wand. The curvHDR method for gating \row cytometry samples. BMC bioinformatics, 11(1):44, 2010. ISSN 1471-2105. [112] S. Nemati, M. Basiri, N. Ghasem-Aghaee, and M. Aghdam. A novel aco-ga hybrid algorithm for feature selection in protein function prediction. Expert systems with applications, 36(10):12086{12094, 2009. [113] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems 14, pages 849{856. MIT Press, 2001. [114] R. Nunez. Dna measurement and cell cycle analysis by \row cytometry. Current issues in molecular biology, 3:67{70, 2001. [115] S. J. Ochatt. Flow cytometry (ploidy determination, cell cycle analysis, dna content per nucleus). Medicago truncatula handbook, Chapter 2.2.7, Online 2006. URL http:\/\/www.noble.org\/medicagohandbook\/. [116] M. Ocqueteau, J. San Miguel, M. Gonzalez, J. Almeida, and A. Orfao. Do myelomatous plasma cells really express surface immunoglobulins? Haematologica, 81(5):460, 1996. [117] W. Overton. Modi\fed histogram subtraction technique for analysis of \row cytometry data. Cytometry, 9(6):619{626, 1988. ISSN 1097-0320. [118] C. Pedreira, E. Costa, S. Barrena, Q. Lecrevisse, J. Almeida, J. van Dongen, A. Orfao, et al. Generation of \row cytometry data \fles with a potentially in\fnite number of dimensions. Cytometry Part A, 73 (9):834{846, 2008. 113 [119] W. Pentney and M. Meila. Spectral clustering of biological sequence data. In AAAI, pages 845{850, 2005. [120] S. Perfetto, P. Chattopadhyay, and M. Roederer. Seventeen-colour \row cytometry: unravelling the immune system. Nature Reviews Immunology, 4(8):648{655, 2004. ISSN 1474-1733. [121] S. P. Perfetto, P. K. Chattopadhyay, and M. Roederer. Seventeen-colour \row cytometry: unravelling the immune system. Nat Rev Immunol, 4(8):648{655, August 2004. [122] A. Persidis. Data mining in biotechnology. Nature Biotechnology, 18 (2):237{238, 2000. [123] L. Prechelt. Automatic early stopping using cross validation: quantifying the criteria. Neural Networks, 11(4):761{767, 1998. [124] F. Pre\u000ber and D. Dombkowski. Advances in complex multiparameter \row cytometry technology: Applications in stem cell research. Clinical Cytometry B, 76(5):295{314, 2009. [125] S. Pyne, X. Hu, K. Wang, E. Rossin, T. Lin, L. Maier, C. Baecher-Allan, G. McLachlan, P. Tamayo, D. Ha\rer, et al. Automated high-dimensional \row cytometric data analysis. Proceedings of the National Academy of Sciences, 106(21):8519, 2009. ISSN 0027-8424. [126] S. Pyne, X. Hu, K. Wang, E. Rossin, T.-I. Lin, L. M. Maier, C. Baecher-Allan, G. J. Mclachlan, P. Tamayo, D. A. Ha\rer, P. L. De Jager, and J. P. Mesirov. Automated high-dimensional \row cytometric data analysis. Proceedings of the National Academy of Sciences, 106(21):8519{8524, May 2009. [127] Y. Qian, C. Wei, F. Eun-Hyung Lee, J. Campbell, J. Halliley, J. Lee, J. Cai, Y. Kong, E. Sadat, E. Thomson, et al. Elucidation of seventeen human peripheral blood B-cell subsets and quanti\fcation of the tetanus response using a density-based method for the automated identi\fcation of cell populations in multidimensional \row cytometry data. Cytometry Part B: Clinical Cytometry, 78(S1): S69{S82, 2010. ISSN 1552-4957. [128] J. Quackenbush. Computational analysis of microarray data. Nature Reviews Genetics, 2(6):418{427, 2001. 114 [129] J. Quinn, P. Fisher, R. Capocasale, R. Achuthanandam, M. Kam, P. Bugelski, and L. Hrebien. A statistical pattern recognition approach for determining cellular viability and lineage phenotype in cultured cells and murine bone marrow. Cytometry Part A, 71(8): 612{624, 2007. ISSN 1552-4930. [130] A. Ribeiro, E. Vazquez-Sequeiros, L. Wiersema, K. Wang, J. Clain, and M. Wiersema. Eus-guided \fne-needle aspiration combined with \row cytometry and immunocytochemistry in the diagnosis of lymphoma. Gastrointestinal endoscopy, 53(4):485{491, 2001. [131] M. Roederer. Spectral compensation for \row cytometry: visualization artifacts, limitations, and caveats. Cytometry, 45(3): 194{205, 2001. ISSN 0196-4763. [132] M. Roederer and R. Hardy. Frequency di\u000berence gating: a multivariate method for identifying subsets that di\u000ber between samples. Cytometry, 45(1):56{64, 2001. ISSN 1097-0320. [133] Y. Saeys, I. Inza, and P. Larra~naga. A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19):2507, 2007. [134] R. Scheuermann, Y. Qian, C. Wei, and I. Sanz. ImmPort FLOCK: Automated cell population identi\fcation in high dimensional \row cytometry data. The Journal of Immunology, 182(Meeting Abstracts 1):42{17, 2009. [135] E. Schlette, K. Fu, and L. Medeiros. Cd23 expression in mantle cell lymphoma: clinicopathologic features of 18 cases. American Journal of Clinical Pathology, 120(5):760{6, 2003. [136] E. Schlette, K. Fu, and L. Medeiros. CD23 expression in mantle cell lymphoma: clinicopathologic features of 18 cases. American journal of clinical pathology, 120(5):760, 2003. ISSN 0002-9173. [137] D. Scott. Multivariate density estimation: theory, practice, and visualization. Wiley-Interscience, 1992. [138] S. Sheather and M. Jones. A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society. Series B (Methodological), 53(3):683{690, 1991. ISSN 0035-9246. 115 [139] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22: 888{905, 1997. [140] N. Shulman, M. Bellew, G. Snelling, D. Carter, Y. Huang, H. Li, S. Self, M. McElrath, and S. De Rosa. Development of an automated analysis system for data from \row cytometric intracellular cytokine staining assays from clinical vaccine trials. Cytometry Part A, 73(9): 847{856, 2008. [141] U. Simon, H. Mucha, and R. Bruggemann. Model-based cluster analysis applied to \row cytometry data. Innovations in Classi\fcation, Data Science, and Information Systems, pages 69{76, 2005. [142] P. Smyth. Clustering using monte carlo cross-validation. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pages 126{133, 1996. [143] D. A. Spielman and S.-H. Teng. Spectral sparsi\fcation of graphs. CoRR, abs\/0808.4134, 2008. [144] L. Staudt, G. Wright, S. Dave, and B. Tan. Methods for diagnosing lymphoma types, May 4 2010. US Patent 7,711,492. [145] M. Stetler-Stevenson. Flow cytometry in lymphoma diagnosis and prognosis: useful? Best practice and research. Clinical haematology, 16(4):583, 2003. [146] M. Suchard, Q. Wang, C. Chan, J. Frelinger, A. Cron, and M. West. Understanding GPU programming for statistical computation: Studies in massively parallel massive mixtures. Journal of Computational and Graphical Statistics, 19(2):419{438, 2010. ISSN 1061-8600. [147] I. Sugar and S. Sealfon. Misty mountain clustering: application to fast unsupervised \row cytometry gating. BMC bioinformatics, 11(1): 502, 2010. [148] I. Sug\u0013ar and S. Sealfon. Misty Mountain clustering: application to fast unsupervised \row cytometry gating. BMC Bioinformatics, 11: 502, 2010. ISSN 1471-2105. 116 [149] M. Suni, H. Dunn, P. Orr, R. Laat, E. Sinclair, S. Ghanekar, B. Bredt, J. Dunne, V. Maino, and H. Maecker. Performance of plate-based cytokine \row cytometry with automated data analysis. BMC immunology, 4(1):9, 2003. ISSN 1471-2172. [150] A. T\u0013arnok. Beyond the \rat world. Cytometry Part A, 79(1):1{2, 2011. ISSN 1552-4930. [151] A. Te\u000beri, B. Bartholmai, T. Witzig, C. Li, C. Hanson, and R. Phyliky. Heterogeneity and clinical relevance of the intensity of CD20 and immunoglobulin light-chain expression in B-cell chronic lymphocytic leukemia. American journal of clinical pathology, 106 (4):457{461, 1996. ISSN 0002-9173. [152] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society (Series B), 58:267{288, 1996. [153] G. Toussaint. Bibliography on estimation of misclassi\fcation. Information Theory, IEEE Transactions on, 20(4):472{479, 2002. ISSN 0018-9448. [154] L. N. Trefethen and D. Bau. Numerical Linear Algebra. SIAM, 1997. [155] C. Tsang, S. Kwong, and H. Wang. Genetic-fuzzy rule mining approach and evaluation of feature selection techniques for anomaly intrusion detection. Pattern Recognition, 40(9):2373{2391, 2007. [156] U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395{416, December 2007. [157] U. von Luxburg, M. Belkin, and O. Bousquet. Consistency of spectral clustering. Annals of Statistics, 36(2):555{586, 2008. [158] M. J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using l1-constrained quadratic programming (lasso). IEEE Trans. Inf. Theor., 55(5), 2009. [159] B. C. William and J. T. Gregory. Molecular Diagnostics: For the Clinical Laboratorian, pages 164{165. Human Press, 2005. [160] T. Xiang and S. Gong. Spectral clustering with eigenvector selection. Pattern Recogn., 41(3):1012{1029, 2008. ISSN 0031-3203. doi:http:\/\/dx.doi.org\/10.1016\/j.patcog.2007.07.023. 117 [161] D. Yan, L. Huang, and M. Jordan. Fast approximate spectral clustering. Technical Report UCB\/EECS-2009-45, EECS Department, University of California, Berkeley, Mar 2009. [162] Y. Yatabe, R. Suzuki, K. Tobinai, Y. Matsuno, R. Ichinohasama, M. Okamoto, M. Yamaguchi, J. Tamaru, N. Uike, Y. Hashimoto, et al. Signi\fcance of cyclin d1 overexpression for the diagnosis of mantle cell lymphoma: a clinicopathologic comparison of cyclin d1-positive mcl and cyclin d1-negative mcl-like b-cell lymphoma. Blood, 95(7):2253, 2000. [163] C. Yeh, B. Hsi, S. Lee, and W. Faulk. Propidium iodide as a nuclear marker in immuno\ruorescence. ii. use with cellular identi\fcation and viability studies. J. Immunol. Methods, 43(3):269{275, 1981. ISSN 0022-1759. [164] N. Young, T. Al-Saleem, H. Ehya, and M. Smith. Utilization of \fne-needle aspiration cytology and \row cytometry in the diagnosis and subclassi\fcation of primary and recurrent lymphoma. Cancer Cytopathology, 84(4):252{261, 1998. [165] H. Zare, P. Shooshtari, A. Gupta, and R. Brinkman. Data reduction for spectral clustering to analyze high throughput \row cytometry data. BMC Bioinformatics, 11(1):403, 2010. [166] H. Zare, P. Shooshtari, A. Gupta, and R. Brinkman. Data reduction for spectral clustering to analyze high throughput \row cytometry data. BMC bioinformatics, 11(1):403, 2010. ISSN 1471-2105. [167] Q. Zeng, J. Pratt, J. Pak, D. Ravnic, H. Huss, and S. Mentzer. Feature-guided clustering of multi-dimensional \row cytometry datasets. Journal of Biomedical Informatics, 40(3):325{331, 2007. [168] P. Zhao and B. Yu. On model selection consistency of lasso. J. Mach. Learn. Res., 7:2541{2563, 2006. [169] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67: 301{320, 2005. [170] M. Zweig and G. Campbell. Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clinical Chemistry, 39(4):561{577, 1993. 118 Appendix A SamSPECTRAL Package Manual A.1 Introduction Data analysis is a crucial step in most of recent biological research areas such as microarray techniques, gene expression and protein classi\fcation. A classical approach for analysing biological data is to \frst group individual data points based on some similarity criterion, a process known as cluster- ing, and then compare the outcome of clustering with the desired biological hypotheses. Spectral clustering is a non-parametric clustering method which has proved useful in many pattern recognition areas. Not only it does not require a priori assumptions on the size, shape and distributions of clus- ters, but it has several features that make it an appropriate candidate for clustering biological data: \u2022 It is not sensitive to outliers, noise or shape of clusters. \u2022 It is adjustable so we can make use of biological knowledge to adapt it for a speci\fc problem or dataset. \u2022 There is mathematical evidence to guarantee its proper performance. 119 However, because of the machine limitations, one faces serious empirical barriers in applying this method for large data sets. SamSPECTRAL is a modi\fcation to spectral clustering such that it will be applicable on large size datasets. A.2 How to Run SamSPECTRAL? SamSPECTRAL is an R package source that can be downloaded from Bio- Cunductor. In Linux, it can be installed by the following command: R CMD INSTALL SamSPECTRAL_x.y.z.tar.gz where x.y.z. determines the version. The main function of this package is SamSPECTRAL() which is loaded by using the command library(SamSPECTRAL) in R. Before running this function on a data set, some parameters are required to be set including: normal.sigma and separation.factor. This can be best done by running the algorithm on some number of samples (Normally, 2 or 3 samples are su\u000e- cient). Then the function SamSPECTRAL() can be applied to all samples in that data set to identify cell populations in each sample data. A.2.1 An Example This example shows how SamSPECTRAL can be run on \row cytometry data. If f is a \row frame (which is normally read from an FCS \fle us- ing \rowCore), then the object \"small\" in the following example should be replaced by expr(f). > library(SamSPECTRAL) > data(small_data) > full <- small > L <- SamSPECTRAL(full, dimension = c(1, 2, 3), normal.sigma = 200, + separation.factor = 0.39) > plot(full, pch = \".\", col = L) 120 SamSPECTRAL is done. The results are in L, a vector that provides a numeric label for each event. All events with equal label are in one compo- nent and isolated outliers are labelled by NA. The following piece of code is not a part of the analysis and it is included only for more clear presentation of the results. The code computes the frequency of events in each component and adds a legend to the \fgure. > plot(full, pch = \".\", col = L) > frequency <- c() > minimum.frequency <- 0.01 > frequency.large <- c() > labels <- as.character(unique(L)) > for (label in labels) { + if (!is.na(label)) { + frequency[label] <- length(which(L == label))\/length(L) + if (frequency[label] > minimum.frequency) + frequency.large[label] <- frequency[label] + } + } > print(frequency) 7 1 3 4 6 8 0.7881111111 0.1193333333 0.0008888889 0.0062222222 0.0576666667 0.0013333333 9 5 2 0.0237777778 0.0013333333 0.0010000000 > legend(x = \"topleft\", as.character(round(frequency.large, 3)), + col = names(frequency.large), pch = 19) 121 0 1 2 3 4 0 1 2 3 4 FL1\u2212H FL 2\u2212 H l l l l 0.788 0.119 0.058 0.024 A.3 Adjusting Parameters For e\u000eciency, one can setm=3000 to keep the running time bellow 1 minute by a 2 GHz processor and normally the results remained satisfactory for \row cytometry data. The separation factor and scaling parameter (\u03c3) are two main parameters that needed to be adjusted. The general way is to run SamSPECTRAL on one or two random data samples of a \row cytometry data set and try di\u000berent values for \u03c3 and separation factor. Then, the selected parameters were \fxed and used to apply SamSPECTRAL on the rest of data samples. An e\u000ecient strategy is explained by the following example. 122 A.3.1 Example First we load data and store the transformed coordinates in a matrix called full: > data(small_data) > full <- small The objects needed for creating this vignette can be directly computed or loaded from previously saved workspace to save time. The later increases the speed of building this vignette. > run.live <- FALSE The following parameters are rarely needed to be changed for \row cy- tometry data: > m <- 3000 > community.weakness.threshold <- 1 > precision <- 6 > maximum.number.of.clusters <- 30 The following piece of code, scales the coordinates in range [0,1]: > for (i.column in 1:dim(full)[2]) { + ith.column <- full[, i.column] + full[, i.column] <- (ith.column - min(ith.column))\/(max(ith.column) - + min(ith.column)) + } > space.length <- 1 To perform faithful sampling, we run: > society <- Building_Communities(full, m, space.length, + community.weakness.threshold) > plot(full[society$representatives, ], pch = 20) 123 l l l l l l l l l l l l l ll l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l ll l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l ll l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l ll l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l 0.0 0.2 0.4 0.6 0.8 1.0 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 FL1\u2212H FL 2\u2212 H We intend to \frst \fnd an appropriate value for \u03c3 and then set separation factor. Note that normal.sigma= 1 \u03c32 , therefore, decreasing normal.sigma is equivalent to increasing \u03c3 and visa versa. We start with normal.sigma=10: > normal.sigma <- 10 > conductance <- Conductance_Calculation(full, normal.sigma, space.length, + society, precision) > if (run.live) { + clust_result.10 <- Civilized_Spectral_Clustering( + full, + maximum.number.of.clusters, + society, conductance, stabilizer = 1) + eigen.values.10 <- clust_result.10@eigen.space$values + } else data(\"eigen.values.10\") > plot(eigen.values.10[1:50]) 124 ll l l l ll l llllll ll l lllllllllllllllllllllllllllllllll 0 10 20 30 40 50 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 Index e ig en .va lu es . 10 [1: 50 ] We observe that the eigen values curve does not have a \"knee\" shape. So we increase sigma: > normal.sigma <- 1000 > conductance <- Conductance_Calculation(full, normal.sigma, space.length, + society, precision) > if (run.live) { + clust_result.1000 <- Civilized_Spectral_Clustering(full, + maximum.number.of.clusters, society, conductance, stabilizer = 1) + eigen.values.1000 <- clust_result.1000@eigen.space$values + } else data(\"eigen.values.1000\") > plot(eigen.values.1000[1:50]) 125 lllllllllllllllllllll ll lll l l ll lllll l ll ll l l l l l l l ll l 0 10 20 30 40 50 0. 95 0. 96 0. 97 0. 98 0. 99 1. 00 Index e ig en .va lu es . 10 00 [1: 50 ] We observe that in the eigen values plot, \"too many\" values are close to 1 but for this example we do not expect 20 populations. So we decrease sigma: > normal.sigma <- 250 > conductance <- Conductance_Calculation(full, normal.sigma, space.length, + society, precision) > clust_result.250 <- Civilized_Spectral_Clustering( + full, + maximum.number.of.clusters, + society, conductance, stabilizer = 1) > eigen.values.250 <- clust_result.250@eigen.space$values > plot(eigen.values.250[1:50]) 126 lllllllll l lll ll l l l l l llllllll ll ll ll l lll l llll llll ll l 0 10 20 30 40 50 0. 75 0. 80 0. 85 0. 90 0. 95 1. 00 Index e ig en .va lu es . 25 0[1 :50 ] This is \"a right\"value for normal.sigma because the curve has now a knee shape. Even some variation to this parameter does not change the shape signi\fcantly (200 or 300 can be tried). Now having sigma been adjusted, separation factor can be tuned: > labels.for_num.of.clusters <- clust_result.250@labels.for_num.of.clusters > number.of.clusters <- clust_result.250@number.of.clusters > L33 <- labels.for_num.of.clusters[[number.of.clusters]] > separation.factor <- 0.1 > component.of <- Connecting(full, society, conductance, number.of.clusters, + labels.for_num.of.clusters, separation.factor)$label > plot(full, pch = \".\", col = component.of) 127 0.0 0.2 0.4 0.6 0.8 1.0 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 FL1\u2212H FL 2\u2212 H This value is too small for the separation factor and a population is combined by mistake. Therefore, we increase septation factor to separate the components more: > separation.factor <- 0.5 > component.of <- Connecting(full, society, conductance, number.of.clusters, + labels.for_num.of.clusters, separation.factor)$label > plot(full, pch = \".\", col = component.of) 128 0.0 0.2 0.4 0.6 0.8 1.0 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 FL1\u2212H FL 2\u2212 H This is the right value for separator factor as all population are now separated. Now, we can \fx these values for the parameters; normal.sigma=250 and separation.factor=0.5. One can run the SamSPECTRAL algorithm on the rest of the data set without changing them, hopefully, obtaining as appro- priate results. 129 Appendix B Supplementary Table A The following table provides clinical data such as sex, age, diagnosis, and \row cytometry features for the cases investigated in MCL vs. SLL study. 130 typical FISH cyclin Dx Age Sex CD20\/ FMC7\/ CD20\/ CD23 FMC7 CD20 sIg CD5 Score Time 11-14 D1 CD23 CD23 CD11c frame 1 TRUE ND POS MCL 85 M 2.68 2.32 2.58 0.56 1.30 1.50 2.04 0.66 3 4 2 FALSE ND POS MCL 79 M 1.25 1.92 1.59 0.70 1.34 0.87 0.86 0.75 2 4 3 FALSE POS POS MCL 54 M 2.39 2.05 1.92 0.75 1.54 1.80 1.79 1.28 3 4 4 FALSE POS POS MCL 80 M 1.81 1.47 2.44 0.72 1.06 1.31 1.38 1.16 3 4 5 TRUE ND POS MCL 75 M 2.97 2.27 1.51 0.49 1.12 1.46 1.86 1.03 3 5 6 TRUE ND POS MCL 67 M 3.64 1.03 8.51 105.97 109.17 385.66 535.15 130.43 3 1 7 TRUE ND POS MCL 32 M 3.95 1.28 6.67 105.06 134.58 414.95 393.49 275.60 3 1 8 TRUE ND POS MCL 79 M 6.43 4.30 8.14 73.41 315.72 472.16 328.60 121.67 3 1 9 TRUE ND POS MCL 53 M 2.79 2.69 3.47 76.69 205.95 214.21 527.23 147.38 2 1 10 TRUE ND POS MCL 76 M 5.85 3.91 4.18 78.63 307.06 460.02 582.61 311.89 2 1 11 TRUE ND POS MCL 73 M 3.25 3.31 146.20 145.66 482.12 473.16 493.35 326.19 3 1 12 TRUE ND POS MCL 77 M 3.67 2.95 16.73 87.98 259.73 323.11 362.13 221.85 3 1 13 FALSE ND POS MCL 64 F 4.12 1.13 2.55 71.00 80.53 292.68 536.12 209.50 2 1 14 TRUE ND POS MCL 57 M 6.44 4.60 22.82 67.93 312.70 437.61 579.01 299.61 3 1 15 TRUE ND POS MCL 59 M 4.72 2.88 5.70 95.35 274.28 450.26 330.56 387.11 3 1 16 TRUE ND POS MCL 77 F 7.35 3.10 5.13 43.94 136.13 323.03 470.75 196.51 3 1 17 TRUE ND POS MCL 54 M 4.41 2.14 5.37 104.12 222.65 459.06 408.04 325.74 3 1 18 FALSE NEG POS MCL 55 M 5.89 1.75 31.90 78.51 137.17 462.69 383.40 263.42 3 1 19 FALSE ND POS MCL 79 M 2.02 1.31 2.67 182.85 239.90 369.38 597.89 293.79 3 2 20 FALSE ND POS MCL 80 F 1.67 1.07 1.88 153.93 164.27 256.31 671.89 246.16 2 2 21 TRUE ND POS MCL 49 M 1.92 1.35 2.71 218.54 294.49 418.83 580.06 316.43 3 2 22 TRUE ND POS MCL 71 M 2.46 1.99 3.21 187.37 372.02 460.79 603.75 306.50 3 2 23 FALSE ND POS MCL 60 F 1.86 1.66 3.22 228.10 378.05 424.50 527.65 320.39 3 2 24 FALSE ND POS MCL 52 F 2.42 1.07 2.57 159.05 170.48 384.41 643.39 338.86 3 2 25 FALSE ND POS MCL 63 F 1.36 1.22 2.85 208.19 254.34 283.12 328.69 142.86 3 2 26 FALSE POS POS MCL 76 M 2.06 1.72 1.55 209.67 360.63 431.39 26.69 331.88 2 2 27 TRUE ND POS MCL 52 M 2.36 1.91 4.26 164.49 314.45 387.60 345.34 228.41 3 2 28 FALSE POS POS MCL 47 M 10.41 1.05 2.36 32.67 34.36 340.11 625.98 260.87 3 2 29 FALSE ND POS MCL 74 M 16.84 1.16 2.16 18.78 21.78 316.16 664.87 285.44 3 2 30 TRUE ND POS MCL 60 F 3.92 2.77 6.29 118.35 327.97 463.85 523.04 219.79 3 3 31 FALSE POS POS MCL 80 M 2.23 1.32 3.02 176.46 232.44 393.69 183.60 259.02 3 2 32 TRUE POS POS MCL 71 F 11.22 7.78 3.37 51.86 403.42 581.79 582.64 345.61 2 3 33 TRUE ND POS MCL 73 F 3.36 2.19 15.54 145.94 320.11 490.67 502.61 201.00 3 3 34 TRUE POS POS MCL 75 M 4.64 2.41 5.75 129.62 312.64 601.67 551.79 338.85 3 3 35 TRUE ND POS MCL 31 M 6.45 2.87 10.38 54.95 157.60 354.28 344.27 156.13 3 3 36 TRUE ND POS MCL 89 F 4.47 2.66 5.10 160.80 428.27 718.44 575.20 260.74 3 3 37 FALSE ND POS MCL 64 M 2.34 1.41 11.22 292.39 413.51 685.00 727.15 222.86 3 3 38 TRUE ND POS MCL 53 M 3.41 1.77 4.23 150.03 265.94 510.88 429.16 277.61 3 3 39 TRUE ND POS MCL 51 M 4.43 2.68 7.41 161.77 433.76 716.03 734.72 210.86 3 3 40 TRUE POS POS MCL 41 M 2.54 1.78 6.21 181.90 324.36 461.89 525.44 304.59 3 3 41 FALSE POS POS MCL 88 F 3.94 1.35 6.45 157.92 212.84 622.63 597.66 305.20 3 3 42 TRUE ND POS MCL 64 F 3.00 1.70 5.26 169.72 288.49 508.78 606.85 277.57 3 3 43 TRUE ND POS MCL 62 F 2.86 1.05 3.68 196.29 205.69 560.46 680.74 333.50 2 3 44 TRUE ND POS MCL 72 F 2.96 2.01 9.85 152.48 306.04 451.57 474.87 185.58 3 3 45 TRUE ND NEG CLL\/SLL 64 M 0.59 0.55 1.58 1.69 0.94 1.00 1.52 1.22 0 4 46 FALSE ND NEG\/POS CLL\/SLL 72 M 0.79 1.01 0.80 1.28 1.29 1.01 1.24 1.09 0 4 47 TRUE ND NEG CLL\/SLL 54 M 0.92 0.62 1.25 1.16 0.72 1.07 1.27 0.84 0 4 48 TRUE ND ND CLL\/SLL 73 M 0.91 0.46 1.41 1.59 0.73 1.44 1.67 1.28 0 4 49 FALSE ND ND CLL\/SLL 88 M 0.82 0.82 1.47 1.51 1.24 1.24 1.54 1.62 0 4 50 TRUE ND ND CLL\/SLL 63 M 0.63 0.71 0.84 1.70 1.21 1.07 1.15 1.49 0 4 51 TRUE ND ND CLL\/SLL 63 M 0.38 0.63 0.83 2.24 1.41 0.85 1.02 1.25 0 4 131 typical FISH cyclin Dx Age Sex CD20\/ FMC7\/ CD20\/ CD23 FMC7 CD20 sIg CD5 Score Time 11-14 D1 CD23 CD23 CD11c frame 52 TRUE ND ND CLL\/SLL 69 F 0.74 0.76 1.26 1.47 1.11 1.09 1.88 1.39 0 4 53 FALSE ND NEG CLL\/SLL? 81 M 0.73 1.77 1.86 0.70 1.24 0.51 0.56 0.40 2 4 54 TRUE NEG ND CLL\/SLL 59 M 0.86 0.80 0.98 1.73 1.39 1.49 1.67 1.92 0 4 55 TRUE ND ND CLL\/SLL 83 F 0.56 0.52 1.03 1.64 0.85 0.93 1.40 1.22 0 4 56 TRUE ND ND CLL\/SLL 69 F 0.38 0.69 1.34 1.37 0.95 0.52 0.72 0.38 0 4 57 TRUE ND ND CLL\/SLL 59 M 0.63 0.66 1.49 1.64 1.08 1.04 1.48 1.09 0 4 58 TRUE ND NEG CLL\/SLL 72 M 0.59 0.49 1.08 1.74 0.86 1.02 1.71 1.37 0 4 59 FALSE ND NEG CLL\/SLL 70 M 0.68 0.74 1.26 1.94 1.44 1.31 1.50 1.56 0 4 60 TRUE ND ND CLL\/SLL 69 M 0.62 0.45 1.36 1.38 0.62 0.86 1.38 1.22 1 5 61 TRUE ND NEG CLL\/SLL 75 M 0.17 0.32 1.74 1.94 0.63 0.33 0.38 0.33 1 5 62 TRUE NEG NEG CLL\/SLL 75 M 1.04 0.69 1.89 0.90 0.62 0.93 0.94 0.76 1 5 63 TRUE ND NEG\/POS CLL\/SLL 58 M 0.77 0.37 1.09 1.97 0.72 1.52 1.52 1.86 0 5 64 TRUE ND NEG CLL\/SLL 77 M 1.10 0.69 1.07 0.86 0.59 0.95 1.29 0.90 0 5 65 FALSE NEG ? ? 57 F 0.93 0.57 1.33 1.40 0.79 1.31 1.23 1.44 1 5 66 FALSE ND NEG CLL\/SLL 68 M 0.61 0.13 1.83 391.09 50.66 236.89 442.80 229.81 0 1 67 FALSE ND NEG CLL\/SLL 64 M 1.76 0.39 3.32 160.78 62.97 282.20 361.46 214.38 0 1 68 TRUE ND NEG CLL\/SLL 55 M 1.18 0.24 2.13 199.44 47.73 234.79 263.64 231.28 0 1 69 TRUE ND ND CLL\/SLL 81 M 0.82 0.26 1.46 212.21 54.83 173.13 299.82 263.49 0 1 70 TRUE ND ND CLL\/SLL 60 M 0.50 0.07 4.28 432.87 30.90 214.43 312.71 271.30 0 1 71 TRUE ND NEG CLL\/SLL 77 M 0.67 0.28 1.13 428.81 118.61 285.48 399.85 119.20 0 1 72 TRUE ND ND CLL\/SLL 75 M 0.89 0.20 2.12 288.99 58.97 258.20 290.87 344.28 0 1 73 FALSE ND NEG CLL\/SLL 57 F 1.66 1.44 64.38 197.56 283.79 327.36 348.22 303.06 2 1 74 TRUE ND ND CLL\/SLL 61 M 0.32 0.12 1.56 405.16 47.01 130.52 369.79 247.32 0 1 75 TRUE ND ND CLL\/SLL 69 M 0.96 0.32 2.75 369.25 117.03 353.56 328.93 247.32 0 1 76 TRUE ND NEG CLL\/SLL 44 F 0.71 0.43 1.70 334.47 142.67 237.33 329.37 247.26 0 2 77 TRUE ND ND CLL\/SLL 71 F 0.90 0.54 1.28 276.54 148.66 249.43 307.84 182.63 0 2 78 FALSE ND ND CLL\/SLL 52 M 0.27 0.34 0.59 614.84 211.13 165.26 66.13 368.33 0 2 79 FALSE ND ND CLL\/SLL 61 F 0.15 0.44 1.22 350.46 154.83 51.98 427.67 242.22 0 2 80 TRUE NEG ND CLL\/SLL 43 F 0.62 0.47 1.14 322.65 152.42 200.59 366.95 277.16 0 2 81 FALSE ND NEG CLL\/SLL 61 F 0.36 0.26 0.78 522.65 135.69 188.33 355.33 316.57 0 2 82 TRUE ND NEG CLL\/SLL 79 M 0.05 0.36 0.91 356.78 129.99 17.13 385.37 328.76 0 2 83 FALSE ND NEG CLL\/SLL 56 F 0.73 0.63 1.64 216.92 137.01 158.69 236.20 134.94 0 2 84 FALSE ND ND CLL\/SLL 55 M 1.19 0.97 2.36 146.44 142.31 174.63 268.20 77.07 2 2 85 FALSE ? ? CLL\/SLL 52 M 1.76 1.12 2.92 186.74 209.44 328.58 283.41 195.34 1 3 86 TRUE ND NEG CLL\/SLL 53 F 1.08 0.45 1.44 282.61 126.20 304.99 536.87 385.79 0 3 87 TRUE ND ND CLL\/SLL 57 F 0.44 0.20 1.17 613.22 122.59 272.83 334.73 393.70 0 3 88 FALSE ND NEG CLL\/SLL 81 M 1.46 0.65 2.11 411.36 269.13 599.86 309.03 407.15 0 3 89 TRUE ND NEG CLL\/SLL 86 F 1.44 0.40 3.34 288.03 114.25 414.49 471.72 180.10 0 3 90 TRUE ND ND CLL\/SLL 77 M 0.97 0.16 2.84 530.67 83.90 515.22 471.40 294.02 0 3 91 TRUE ND NEG CLL\/SLL 69 M 0.83 0.23 2.09 466.63 108.19 387.16 502.90 418.08 0 3 92 TRUE ND NEG CLL\/SLL 63 F 0.83 0.15 1.18 483.96 71.72 403.20 458.46 317.88 0 3 93 TRUE ND ND CLL\/SLL 78 M 1.39 0.46 4.55 355.13 163.16 493.81 451.47 238.10 1 3 94 TRUE ND NEG CLL\/SLL 74 M 0.02 0.15 1.56 595.85 88.45 10.27 155.17 352.37 0 3 95 TRUE ND NEG CLL\/SLL 62 F 0.72 0.31 3.55 555.21 171.74 398.26 474.43 145.16 0 3 96 FALSE ND NEG CLL\/SLL 47 M 1.30 0.57 2.39 358.73 205.10 464.65 397.44 205.94 0 3 97 TRUE ND NEG CLL\/SLL 61 M 0.93 0.34 1.41 418.89 142.40 389.04 510.23 331.76 0 3 98 FALSE ND ND CLL\/SLL 90 M 0.91 0.37 1.39 510.56 187.50 465.55 391.79 400.60 0 3 99 TRUE ND ND CLL\/SLL 72 M 1.02 0.29 2.63 392.48 112.63 398.76 292.17 161.95 0 3 100 TRUE ND NEG CLL\/SLL 58 M 0.59 0.29 1.98 556.05 159.31 329.25 546.61 436.96 0 3 101 TRUE ND NEG CLL\/SLL 73 M 1.50 0.50 3.51 299.27 150.13 449.31 304.67 281.90 0 3 102 TRUE ND ND CLL\/SLL 51 M 0.96 0.31 3.07 466.27 146.01 448.06 439.52 230.04 0 3 132 typical FISH cyclin Dx Age Sex CD20\/ FMC7\/ CD20\/ CD23 FMC7 CD20 sIg CD5 Score Time 11-14 D1 CD23 CD23 CD11c frame 103 FALSE ND ND CLL\/SLL 51 M 1.28 0.40 2.59 398.78 158.72 509.01 418.06 213.57 0 3 104 FALSE NEG NEG CLL\/SLL 67 M 1.74 0.78 2.59 170.14 133.32 295.41 271.02 116.91 0 3 105 TRUE ND ND CLL\/SLL 60 F 1.23 0.33 2.85 446.38 149.51 548.90 483.83 370.73 0 3 106 TRUE ND NEG CLL\/SLL 52 F 0.77 0.37 1.27 491.37 179.72 378.38 280.04 278.04 0 3 107 FALSE ND ND CLL\/SLL 56 M 0.84 0.33 1.30 618.29 206.45 516.82 433.65 421.05 0 3 108 FALSE ND NEG CLL\/SLL 72 M 1.69 0.85 1.94 368.25 312.71 620.84 293.30 374.43 0 3 109 TRUE ND NEG CLL\/SLL 53 M 0.91 0.20 1.31 550.61 108.42 502.90 372.11 144.81 0 3 110 FALSE ND ND CLL\/SLL 66 M 0.92 0.49 1.14 376.29 182.83 345.58 354.32 265.20 0 3 111 TRUE NEG NEG CLL\/SLL 52 F 0.81 0.25 1.82 474.48 120.26 382.79 297.53 316.03 0 3 112 TRUE ND ND CLL\/SLL 68 M 0.75 0.26 1.48 509.00 134.59 383.06 300.42 179.56 0 3 113 TRUE ND ND CLL\/SLL 77 M 0.89 0.23 1.68 390.47 88.66 347.88 357.42 162.82 0 3 114 TRUE ND ND CLL\/SLL 70 M 0.17 0.36 1.27 202.36 71.98 35.22 171.36 102.18 0 3 Table B.1: Clinical data and FCM features for all cases investigated in MCL vs. SLL study. 133 Appendix C Validating MCL vs. SLL Study on More Samples My data analysis pipeline identi\fed the three ratios as top diagnostic fea- tures, solely based on analyzing data of 44 cases from the third time period. To ensure that our \fndings are generalizable and the \fnal results are not de- pendent on training samples, the rest of the data (70 cases) were used as an independent set to validate the diagnostic value of the discovered features. Table 4.4 in the main paper shows the comparative performance of MFIs of CD23, FMC7, CD20, surface immnuoglobulin light chain, CD5 vs. the novel ratios. These criteria are computed based on on the 70-test cases, whereas Table C.1 reports the same criteria computed based on all 114-cases. There was not any statically signi\fcant di\u000berence between results computed for the 70 test cases and the results computed by considering all 114 available cases. For instance, the average accuracy of the four novel features was 93% and 95%, accordingly. Also, Figures C.1-C.10 show the dot plots and normalized estimated densities in all studied time frames. They correspond to Figures 4.9 and 4.10. 134 Table C.1: Discriminative values based on all 114 studies cases. MCL arbitrarily de\fned as the\\positive\" test and SLL as the\\negative\" test. TPa TNb FPc FNd Sensitivitye Speci\fcityf Accuracyg n o v e l ra ti o s{ CD20\/CD23 ratio 44 70 0 0 100%h 100% 100%FMC7\/CD23 ratio 44 66 4 0 100% 94% 96% CD20\/CD11c ratio 36 62 8 8 82% 89% 86% Combined Ratio Score 44 67 3 0 100% 96% 97% co m m o n m a rk e rs{ CD23 intensity 43 62 8 1 98% 89% 92%FMC7 intensity 35 60 10 9 80% 86% 83%CD20 intensity 34 55 15 10 77% 79% 78% Light chain 29 64 6 15 66% 91% 82% CD5 intensity 33 31 39 11 75% 44% 56% aTP (true positive) = the number of MCL cases which are diagnosed correctly bTN (true negative) = the number of SLL cases which are diagnosed correctly cFP (false positive) = the number of SLL cases which are diagnosed as MCL wrongly dFN (false negative) = the number of MCL cases which are diagnosed as SLL wrongly e Sensitivity to MCL is the portion of MCL cases that are correctly diagnosed (TP\/(TP+FN)). f Speci\fcity is the portion of SLL cases that are diagnosed correctly (TN\/(TN+FP)). g Accuracy is the portion of all correctly diagnosed cases. hPerformance higher than 95% is shown in boldface. C.1 First Time Frame @ 5 10 15 20 1 0 0 2 0 0 3 0 0 4 0 0 C D 2 3 i n t e n s i t y 50 90 130 155 225 325 425 MCL SLL n u m b e r o f c a s e s ( % ) \/ s c a l e d d e n s i t y 0 5 1 0 1 5 2 0 2 5 A (CD23 intensity) @ 5 10 15 20 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 F M C 7 i n t e n s i t y 25 75 125 175 225 275 350 450 0 5 1 0 1 5 B (FMC7 intensity) @ 5 10 15 20 1 5 0 2 5 0 3 5 0 4 5 0 C D 2 0 i n t e n s i t y 125 175 225 275 325 375 425 475 0 5 1 0 1 5 2 0 2 5 C (CD20 intensity) @ 5 10 15 20 3 0 0 3 5 0 4 0 0 4 5 0 5 0 0 5 5 0 L i g h t c h a i n i n t e n s i t y 275 325 375 425 475 525 575 0 5 1 0 1 5 D (Light chain) @ 5 10 15 20 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 C D 5 i n t e n s i t y 125 175 225 275 325 375 0 5 1 0 1 5 2 0 E (CD5 intensity) Figure C.1: Discriminative value of individual immunophenotypic markers. Cases from the \frst time period. 135 @5 10 15 20 1 2 3 4 5 6 7 r e l a t i v e i n t e n s i t y o f C D 2 0 t o C D 2 3 @ typical MCL typical SLL atypical t(11;14) positive t(11;14) negative cyclin D1 negative both negative 0.25 1.25 2.5 3.5 4.5 5.5 6.5 7.5 MCL SLL n u m b e r o f c a s e s ( % ) \/ s c a l e d d e n s i t y 0 5 1 0 1 5 2 0 A (relative intensity of CD20 to CD23) @ 5 10 15 20 0 1 2 3 4 r e l a t i v e i n t e n s i t y o f F M C 7 t o C D 2 3 0.1 0.5 0.9 1.3 1.8 3.5 0 5 1 0 1 5 2 0 2 5 B (relative intensity of FMC7 to CD23) @ 5 10 15 20 0 5 0 1 0 0 1 5 0 r e l a t i v e i n t e n s i t y o f C D 2 0 t o C D 1 1 c 5 15 25 35 45 55 65 85 125 0 1 0 2 0 3 0 4 0 C (relative intensity of CD20 to CD11c) Figure C.2: Discriminative value of binary ratios. Cases from the \frst time period. 136 C.2 Second Time Frame @ @ @ @ @ @ @ @ @@ 5 10 15 20 0 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 C D 2 3 i n t e n s i t y 25 75 125 225 350 550 MCL SLL n u m b e r o f c a s e s ( % ) \/ s c a l e d d e n s i t y 0 5 1 0 1 5 2 0 2 5 3 0 A (CD23 intensity) @ @ @ @ @ @ @ @ @ @ 5 10 15 20 5 0 1 0 0 2 0 0 3 0 0 F M C 7 i n t e n s i t y 50 110 130 150 170 190 210 260 350 0 5 1 0 1 5 2 0 2 5 B (FMC7 intensity) @ @ @ @ @ @ @ @ @ @ 5 10 15 20 0 1 0 0 2 0 0 3 0 0 4 0 0 C D 2 0 i n t e n s i t y 25 75 125 225 325 425 0 5 1 0 1 5 2 0 C (CD20 intensity) @ @ @ @ @ @ @ @ @ @ 5 10 15 20 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 L i g h t c h a i n i n t e n s i t y 50 150 250 350 450 550 650 0 5 1 0 1 5 2 0 2 5 D (Light chain) @ @ @ @ @ @ @@ @ @ 5 10 15 20 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 C D 5 i n t e n s i t y 75 125 175 225 275 325 375 0 5 1 0 1 5 2 0 2 5 E (CD5 intensity) Figure C.3: Discriminative value of individual immunophenotypic markers. Cases from the second time period. 137 @ @ @@ @ @@@ @@ 5 10 15 20 0 5 1 0 1 5 r e l a t i v e i n t e n s i t y o f C D 2 0 t o C D 2 3 @ typical MCL typical SLL atypical t(11;14) positive t(11;14) negative cyclin D1 negative both negative 0.1 0.3 0.5 0.7 0.9 1.1 3.1 7.5 17.5 MCL SLL n u m b e r o f c a s e s ( % ) \/ s c a l e d d e n s i t y 0 1 0 2 0 3 0 4 0 5 0 A (relative intensity of CD20 to CD23) @ @ @@ @ @@ @ @ @ 5 10 15 20 0 . 5 1 . 0 1 . 5 2 . 0 r e l a t i v e i n t e n s i t y o f F M C 7 t o C D 2 3 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9 0 5 1 0 1 5 2 0 B (relative intensity of FMC7 to CD23) @ @ @ @ @ @ @ @ @ @ 5 10 15 20 1 2 3 4 r e l a t i v e i n t e n s i t y o f C D 2 0 t o C D 1 1 c 0.75 1.25 1.75 2.25 2.75 3.25 3.75 4.25 0 5 1 0 1 5 2 0 C (relative intensity of CD20 to CD11c) Figure C.4: Discriminative value of binary ratios. Cases from the second time period. 138 C.3 Third Time Frame @ @ @ @ @ @ 0 10 20 30 40 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 C D 2 3 i n t e n s i t y 75 125 175 225 275 350 450 550 650 MCL SLL n u m b e r o f c a s e s ( % ) \/ s c a l e d d e n s i t y 0 5 1 0 1 5 2 0 A (CD23 intensity) @ @ @ @ @ @ 0 10 20 30 40 1 0 0 2 0 0 3 0 0 4 0 0 F M C 7 i n t e n s i t y 75 125 175 225 275 325 375 425 0 5 1 0 1 5 2 0 2 5 3 0 B (FMC7 intensity) @ @ @ @ @ @ 0 10 20 30 40 0 1 0 0 3 0 0 5 0 0 7 0 0 C D 2 0 i n t e n s i t y 50 150 250 350 450 550 650 750 0 5 1 0 1 5 2 0 2 5 3 0 C (CD20 intensity) @ @ @ @ @ @ 0 10 20 30 40 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 7 0 0 L i g h t c h a i n i n t e n s i t y 175 275 375 475 575 750 0 2 4 6 8 1 0 1 2 1 4 D (Light chain) @ @ @ @ @ @ 0 10 20 30 40 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 4 0 0 4 5 0 C D 5 i n t e n s i t y 125 175 225 275 325 375 425 0 2 4 6 8 1 0 1 2 E (CD5 intensity) Figure C.5: Discriminative value of individual immunophenotypic markers. Cases from the third time period. 139 @ @ @ @ @ @ 0 10 20 30 40 0 2 4 6 8 1 0 r e l a t i v e i n t e n s i t y o f C D 2 0 t o C D 2 3 @ typical MCL typical SLL atypical t(11;14) positive t(11;14) negative cyclin D1 negative both negative 0.25 1.25 3 5 7 9 11 MCL SLL n u m b e r o f c a s e s ( % ) \/ s c a l e d d e n s i t y 0 5 1 0 1 5 2 0 2 5 3 0 3 5 A (relative intensity of CD20 to CD23) @ @ @ @ @ @ 0 10 20 30 40 0 2 4 6 8 r e l a t i v e i n t e n s i t y o f F M C 7 t o C D 2 3 0.1 0.5 0.9 1.1 2.5 4.5 6.5 0 1 0 2 0 3 0 4 0 B (relative intensity of FMC7 to CD23) @ @ @ @ @ @ 0 10 20 30 40 2 4 6 8 1 0 1 2 1 4 1 6 r e l a t i v e i n t e n s i t y o f C D 2 0 t o C D 1 1 c 1.25 2.25 3.25 4.25 5.5 9 11 15 0 5 1 0 1 5 2 0 2 5 C (relative intensity of CD20 to CD11c) Figure C.6: Discriminative value of binary ratios. Cases from the third time period. 140 C.4 Fourth Time Frame @ @ @@ 5 10 150 . 5 1 . 0 1 . 5 2 . 0 C D 2 3 i n t e n s i t y 0.525 0.625 0.725 0.9 1.75 MCL SLL n u m b e r o f c a s e s ( % ) \/ s c a l e d d e n s i t y 0 1 0 2 0 3 0 4 0 5 0 A (CD23 intensity) @ @ @ @ 5 10 15 0 . 8 1 . 0 1 . 2 1 . 4 F M C 7 i n t e n s i t y 0.7 0.9 1.1 1.2 1.3 1.5 0 5 1 0 1 5 2 0 2 5 B (FMC7 intensity) @ @ @ @ 5 10 15 0 . 6 0 . 8 1 . 0 1 . 2 1 . 4 1 . 6 1 . 8 C D 2 0 i n t e n s i t y 0.45 0.7 0.9 1.1 1.3 1.45 1.8 0 5 1 0 1 5 2 0 2 5 3 0 C (CD20 intensity) @ @ @ @ 5 10 150 . 5 1 . 0 1 . 5 2 . 0 L i g h t c h a i n i n t e n s i t y 0.45 0.7 1.1 1.45 1.7 2.25 0 5 1 0 1 5 D (Light chain) @ @ @ @ 5 10 15 0 . 5 1 . 0 1 . 5 C D 5 i n t e n s i t y 0.25 0.55 0.7 0.9 1.1 1.3 1.45 1.75 0 5 1 0 1 5 2 0 2 5 3 0 E (CD5 intensity) Figure C.7: Discriminative value of individual immunophenotypic markers. Cases from the fourth time period. 141 @ @ @ @ 5 10 15 0 . 5 1 . 0 1 . 5 2 . 0 2 . 5 r e l a t i v e i n t e n s i t y o f C D 2 0 t o C D 2 3 @ typical MCL typical SLL atypical t(11;14) positive t(11;14) negative cyclin D1 negative both negative 0.35 0.55 0.75 0.95 1.75 2.75 MCL SLL n u m b e r o f c a s e s ( % ) \/ s c a l e d d e n s i t y 0 5 1 0 1 5 A (relative intensity of CD20 to CD23) @ @ @ @ 5 10 15 0 . 5 1 . 0 1 . 5 2 . 0 r e l a t i v e i n t e n s i t y o f F M C 7 t o C D 2 3 0.5 0.7 0.9 1.1 1.3 1.55 1.9 0 1 0 2 0 3 0 4 0 B (relative intensity of FMC7 to CD23) @ @ @ @ 5 10 15 1 . 0 1 . 5 2 . 0 2 . 5 r e l a t i v e i n t e n s i t y o f C D 2 0 t o C D 1 1 c 0.7 0.9 1.1 1.3 1.55 1.9 2.75 0 5 1 0 1 5 2 0 C (relative intensity of CD20 to CD11c) Figure C.8: Discriminative value of binary ratios. Cases from the fourth time period. 142 C.5 Fifth Time Frame 1 2 3 4 5 6 7 0 . 5 1 . 0 1 . 5 2 . 0 C D 2 3 i n t e n s i t y \u22120.25 0.65 1.1 1.3 1.55 1.9 MCL SLL n u m b e r o f c a s e s ( % ) \/ s c a l e d d e n s i t y 0 5 1 0 1 5 2 0 A (CD23 intensity) 1 2 3 4 5 6 7 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 1 . 1 F M C 7 i n t e n s i t y 0.575 0.675 0.775 0.95 1.15 1.35 0 5 1 0 1 5 2 0 2 5 3 0 3 5 B (FMC7 intensity) 1 2 3 4 5 6 7 0 . 4 0 . 6 0 . 8 1 . 0 1 . 2 1 . 4 C D 2 0 i n t e n s i t y 0.25 0.75 1.1 1.3 1.45 1.55 1.7 1.9 0 5 1 0 1 5 2 0 2 5 3 0 3 5 C (CD20 intensity) 1 2 3 4 5 6 7 0 . 5 1 . 0 1 . 5 L i g h t c h a i n i n t e n s i t y 0.25 0.75 1.25 1.75 2.25 0 5 1 0 1 5 2 0 2 5 3 0 3 5 D (Light chain) 1 2 3 4 5 6 7 0 . 5 1 . 0 1 . 5 C D 5 i n t e n s i t y 0.25 0.75 1.25 1.75 0 5 1 0 1 5 2 0 E (CD5 intensity) Figure C.9: Discriminative value of individual immunophenotypic markers. Cases from the \ffth time period. 143 1 2 3 4 5 6 7 0 . 5 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 r e l a t i v e i n t e n s i t y o f C D 2 0 t o C D 2 3 @ typical MCL typical SLL atypical t(11;14) positive t(11;14) negative cyclin D1 negative both negative 0.1 0.5 0.9 1.6 2.75 3.75 MCL SLL n u m b e r o f c a s e s ( % ) \/ s c a l e d d e n s i t y 0 5 1 0 1 5 2 0 A (relative intensity of CD20 to CD23) 1 2 3 4 5 6 7 0 . 5 1 . 0 1 . 5 2 . 0 r e l a t i v e i n t e n s i t y o f F M C 7 t o C D 2 3 0.35 0.45 0.55 0.65 1.1 1.75 2.25 2.75 0 5 1 0 1 5 2 0 B (relative intensity of FMC7 to CD23) 1 2 3 4 5 6 7 1 . 2 1 . 4 1 . 6 1 . 8 r e l a t i v e i n t e n s i t y o f C D 2 0 t o C D 1 1 c 1.1 1.2 1.3 1.4 1.5 1.7 1.9 0 5 1 0 1 5 2 0 C (relative intensity of CD20 to CD11c) Figure C.10: Discriminative value of binary ratios. Cases from the \ffth time period. 144 Appendix D Cluster Matching Clusters are matched by clustering the centers using SamSPECTRALmethod- ology. 145 200 400 600 800 1000 2 0 0 6 0 0 1 0 0 0 FS 0 2 0 0 6 0 0 1 0 0 0 SS 0 2 0 0 6 0 0 CD5 0 2 0 0 6 0 0 1 0 0 0 CD19 200 400 600 800 1000 2 0 0 6 0 0 1 0 0 0 0 200 400 600 800 1000 0 200 400 600 800 0 200 400 600 800 1000 200 400 600 800 1000 2 0 0 6 0 0 1 0 0 0 CD3 DLBCL FOLL MCL SLL (A) 200 400 600 800 1000 2 0 0 6 0 0 1 0 0 0 FS 0 2 0 0 6 0 0 1 0 0 0 SS 0 2 0 0 6 0 0 CD5 0 2 0 0 6 0 0 1 0 0 0 CD19 200 400 600 800 1000 2 0 0 6 0 0 1 0 0 0 0 200 400 600 800 1000 0 200 400 600 800 0 200 400 600 800 1000 200 400 600 800 1000 2 0 0 6 0 0 1 0 0 0 CD3 DLBCL FOLL MCL SLL (B) Figure D.1: Cluster matching for tube (CD5-CD19-CD3).(A) Each cluster of a patient is depicted by a dot that is located at the center of that cluster. (B) Each color represents a set of clusters that are matched together. 146 200 400 600 800 1000 2 0 0 6 0 0 1 0 0 0 FS 0 2 0 0 6 0 0 1 0 0 0 SS 0 2 0 0 6 0 0 CD10 0 2 0 0 6 0 0 1 0 0 0 CD11c 200 400 600 800 1000 2 0 0 6 0 0 1 0 0 0 0 200 400 600 800 1000 0 200 400 600 800 0 200 400 600 800 1000 200 400 600 800 1000 2 0 0 6 0 0 1 0 0 0 CD20 DLBCL FOLL MCL SLL (A) 200 400 600 800 1000 2 0 0 6 0 0 1 0 0 0 FS 0 2 0 0 6 0 0 1 0 0 0 SS 0 2 0 0 6 0 0 CD10 0 2 0 0 6 0 0 1 0 0 0 CD11c 200 400 600 800 1000 2 0 0 6 0 0 1 0 0 0 0 200 400 600 800 1000 0 200 400 600 800 0 200 400 600 800 1000 200 400 600 800 1000 2 0 0 6 0 0 1 0 0 0 CD20 DLBCL FOLL MCL SLL (B) Figure D.2: Cluster matching for tube (CD10-CD11c-CD20).(A) Each cluster of a pa- tient is depicted by a dot that is located at the center of that cluster. (B) Each color represents a set of clusters that are matched together. 147 200 400 600 800 1000 2 0 0 6 0 0 1 0 0 0 FS 0 2 0 0 6 0 0 1 0 0 0 SS 0 2 0 0 6 0 0 FMC7 0 2 0 0 6 0 0 1 0 0 0 CD23 200 400 600 800 1000 0 2 0 0 6 0 0 0 200 400 600 800 1000 0 200 400 600 800 0 200 400 600 800 1000 0 200 400 600 800 0 2 0 0 6 0 0 CD19 DLBCL FOLL MCL SLL (A) 200 400 600 800 1000 2 0 0 6 0 0 1 0 0 0 FS 0 2 0 0 6 0 0 1 0 0 0 SS 0 2 0 0 6 0 0 FMC7 0 2 0 0 6 0 0 1 0 0 0 CD23 200 400 600 800 1000 0 2 0 0 6 0 0 0 200 400 600 800 1000 0 200 400 600 800 0 200 400 600 800 1000 0 200 400 600 800 0 2 0 0 6 0 0 CD19 DLBCL FOLL MCL SLL (B) Figure D.3: Cluster matching for tube (FMC7-CD23-CD19).(A) Each cluster of a pa- tient is depicted by a dot that is located at the center of that cluster. (B) Each color represents a set of clusters that are matched together. 148 200 400 600 800 1000 2 0 0 6 0 0 1 0 0 0 FS 0 2 0 0 6 0 0 1 0 0 0 SS 0 2 0 0 4 0 0 6 0 0 8 0 0 CD7 0 2 0 0 6 0 0 CD4 200 400 600 800 1000 0 2 0 0 6 0 0 0 200 400 600 800 1000 0 200 400 600 800 0 200 400 600 800 0 200 400 600 800 0 2 0 0 6 0 0 CD8 DLBCL FOLL MCL SLL (A) 200 400 600 800 1000 2 0 0 6 0 0 1 0 0 0 FS 0 2 0 0 6 0 0 1 0 0 0 SS 0 2 0 0 4 0 0 6 0 0 8 0 0 CD7 0 2 0 0 6 0 0 CD4 200 400 600 800 1000 0 2 0 0 6 0 0 0 200 400 600 800 1000 0 200 400 600 800 0 200 400 600 800 0 200 400 600 800 0 2 0 0 6 0 0 CD8 DLBCL FOLL MCL SLL (B) Figure D.4: Cluster matching for tube (CD7-CD4-CD8).(A) Each cluster of a patient is depicted by a dot that is located at the center of that cluster. (B) Each color represents a set of clusters that are matched together. 149 200 400 600 800 1000 2 0 0 6 0 0 1 0 0 0 FS 0 2 0 0 6 0 0 1 0 0 0 SS 0 2 0 0 6 0 0 1 0 0 0 CD45 0 2 0 0 6 0 0 CD14 200 400 600 800 1000 0 2 0 0 6 0 0 1 0 0 0 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 0 200 400 600 800 1000 0 2 0 0 6 0 0 1 0 0 0 CD19 DLBCL FOLL MCL SLL (A) 200 400 600 800 1000 2 0 0 6 0 0 1 0 0 0 FS 0 2 0 0 6 0 0 1 0 0 0 SS 0 2 0 0 6 0 0 1 0 0 0 CD45 0 2 0 0 6 0 0 CD14 200 400 600 800 1000 0 2 0 0 6 0 0 1 0 0 0 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 0 200 400 600 800 1000 0 2 0 0 6 0 0 1 0 0 0 CD19 DLBCL FOLL MCL SLL (B) Figure D.5: Cluster matching for tube (CD45-CD14-CD19).(A) Each cluster of a patient is depicted by a dot that is located at the center of that cluster. (B) Each color represents a set of clusters that are matched together. 150 200 400 600 800 1000 2 0 0 6 0 0 1 0 0 0 FS 0 2 0 0 6 0 0 1 0 0 0 SS 0 2 0 0 6 0 0 1 0 0 0 Kappa 0 2 0 0 6 0 0 1 0 0 0 Lambda 200 400 600 800 1000 0 2 0 0 6 0 0 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 0 2 0 0 6 0 0 CD19 DLBCL FOLL MCL SLL (A) 200 400 600 800 1000 2 0 0 6 0 0 1 0 0 0 FS 0 2 0 0 6 0 0 1 0 0 0 SS 0 2 0 0 6 0 0 1 0 0 0 Kappa 0 2 0 0 6 0 0 1 0 0 0 Lambda 200 400 600 800 1000 0 2 0 0 6 0 0 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 0 2 0 0 6 0 0 CD19 DLBCL FOLL MCL SLL (B) Figure D.6: Cluster matching for tube (kappa-lambda-CD19).(A) Each cluster of a pa- tient is depicted by a dot that is located at the center of that cluster. (B) Each color represents a set of clusters that are matched together. 151","@language":"en"}],"Genre":[{"@value":"Thesis\/Dissertation","@language":"en"}],"GraduationDate":[{"@value":"2012-05","@language":"en"}],"IsShownAt":[{"@value":"10.14288\/1.0052140","@language":"en"}],"Language":[{"@value":"eng","@language":"en"}],"Program":[{"@value":"Computer Science","@language":"en"}],"Provider":[{"@value":"Vancouver : University of British Columbia Library","@language":"en"}],"Publisher":[{"@value":"University of British Columbia","@language":"en"}],"Rights":[{"@value":"Attribution-NonCommercial 3.0 Unported","@language":"en"}],"RightsURI":[{"@value":"http:\/\/creativecommons.org\/licenses\/by-nc\/3.0\/","@language":"en"}],"ScholarlyLevel":[{"@value":"Graduate","@language":"en"}],"Title":[{"@value":"Automatic analysis of flow cytometry data and its application to lymphoma diagnosis","@language":"en"}],"Type":[{"@value":"Text","@language":"en"}],"URI":[{"@value":"http:\/\/hdl.handle.net\/2429\/39660","@language":"en"}],"SortDate":[{"@value":"2011-12-31 AD","@language":"en"}],"@id":"doi:10.14288\/1.0052140"}