UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Flow cytometry data analysis pipeline : data quality control tool development and biomarker discovery Xue, Wang 2020

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata

Download

Media
24-ubc_2020_may_wang_xue.pdf [ 4.03MB ]
Metadata
JSON: 24-1.0389884.json
JSON-LD: 24-1.0389884-ld.json
RDF/XML (Pretty): 24-1.0389884-rdf.xml
RDF/JSON: 24-1.0389884-rdf.json
Turtle: 24-1.0389884-turtle.txt
N-Triples: 24-1.0389884-rdf-ntriples.txt
Original Record: 24-1.0389884-source.json
Full Text
24-1.0389884-fulltext.txt
Citation
24-1.0389884.ris

Full Text

Flow Cytometry Data AnalysisPipelineData Quality Control Tool Development and BiomarkerDiscoverybySherrie (Xue) cWangB.Sc., The University of British Columbia, 2017A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Bioinformatics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)April 2020c© Sherrie (Xue) cWang 2020The following individuals certify that they have read, and recommend tothe Faculty of Graduate and Postdoctoral Studies for acceptance, the thesisentitled:Flow Cytometry Data Analysis Pipeline-Data Quality ControlTool Development and Biomarker Discoverysubmitted by Xue Sherrie Wang in partial fulfillment of the require-ments for the degree of Master of Science in Bioinformatics.Examining Committee:Ryan R. Brinkman, Medical Genetics; SupervisorMaxwell W. Libbrecht, Computer Science; Supervisory CommitteeSara Mostafavi, Medical Genetics and Statistics; Supervisory CommitteePeter Lansdorp, Medical Genetics and Medicine; Additional ExamineriiAbstractTechnical complications occurring during the data acquisition process canimpact the quality of the cytometry data and its analysis results. Clogscan cause spikes in the data sets in the time domain. Other issues, suchas changing machine acquisition speed, can result in a shift in means ofthe populations analyzed. The outliers can potentially bias the downstreamanalysis if left unchecked and as such should be identified and removed. Toaddress this need, I developed flowCut is an R package for automated detec-tion of anomaly events and flagging of files for flow cytometry experiments.Results are on par with manual analysis, and it outperforms the existingapproaches in data quality control. flowCut has the highest F1 scores intwo types of evaluations used in this study and has zero crash rate on allfiles tested.I also studied the bone marrow regeneration pattern of acute myeloidleukemia patients after chemotherapy by applying state of the art automatedmethods. I identified cell populations and biomarkers that are uniquelypresent in relapsed patients when comparing to normal bone marrow data.I also identified cell populations that have different regeneration dynamicsbetween relapsed and non-relapsed patients.iiiLay SummaryFlow cytometry is used widely in clinics and research for measuring bloodcells. Its primary purpose is to quantify cell population compositions fordiagnosis or studying immunological characteristics of diseases. Technicalissues of cytometers during data acquisition can result in an inaccurate mea-surement of cells, which can cause an erroneous analysis of cell populations.My research focused on developing a data quality assessment tool and com-paring the performance of current approaches. I also used several state of theart automated data analysis methods to identify cell populations in acutemyeloid leukemia patients who relapsed after undergoing chemotherapy.ivPrefaceThe flowCut tool (included in Chapter 2) was written by Justin Meskas andmyself. The tool was written in R programming language and is availableathttps://github.com/jmeskas/flowCutMy significant contribution to the package was writing the function for iden-tifying and removal of outlier events based on the density of summed mea-sures (section 2.2.2). I wrote the preliminary quality checking step. I wasinvolved in testing the tools and optimizing parameters in section 2.3. SibylDrissler was also engaged in tool testing. Contents in chapters 2-3 are partof a paper to be submitted.Project in chapter 4 was a collaboration between the department of lab-oratory medicine, Institute of Biomedicine, Sahlgrenska Academy at Uni-versity of Gothenburg. Patients’ data were collected from multiple medicalcenters in Gothenburg, Copenhagen, Israel, and Umea by Linda Fogelstrand,who is the head of the department of laboratory medicine at the Universityof Gothenburg.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 R/Bioconductor . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Data quality assessment . . . . . . . . . . . . . . . . . . . . . 41.2.1 Current Approaches . . . . . . . . . . . . . . . . . . . 51.3 My Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 7viTable of Contents2 flowCut - a data quality control tool . . . . . . . . . . . . . . 82.1 Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.1 Segmentation and calculate Z scores . . . . . . . . . . 102.2.2 Removal of abnormal events . . . . . . . . . . . . . . 122.3 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3.1 Quality control parameters . . . . . . . . . . . . . . . 132.3.2 Cutoff line parameters . . . . . . . . . . . . . . . . . 143 Algorithms Comparison . . . . . . . . . . . . . . . . . . . . . . 173.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.1.1 Selection of files for evaluation . . . . . . . . . . . . . 173.1.2 Manual vs algorithm analysis . . . . . . . . . . . . . . 183.1.3 F1 score as a measure for comparison . . . . . . . . . 183.1.4 File based evaluation . . . . . . . . . . . . . . . . . . 193.1.5 Problem based evaluation . . . . . . . . . . . . . . . . 203.2 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2.1 File based evaluation results . . . . . . . . . . . . . . 213.2.2 Problem based evaluation results . . . . . . . . . . . 233.3 Impact on Gating Analysis . . . . . . . . . . . . . . . . . . . 233.4 Algorithm Robustness . . . . . . . . . . . . . . . . . . . . . . 253.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 Biomarker Discovery in Minimal Residual Disease . . . . . 264.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.2 Study Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 27viiTable of Contents4.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.2.2 Supervised Gating . . . . . . . . . . . . . . . . . . . . 284.2.3 Unsupervised Gating and analysis pipeline . . . . . . 294.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.3.1 Supervised Analysis Results . . . . . . . . . . . . . . 294.3.2 Unsupervised Analysis Results . . . . . . . . . . . . . 294.3.3 Regeneration Dynamic . . . . . . . . . . . . . . . . . 434.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.1 Data quality control study . . . . . . . . . . . . . . . . . . . 455.2 Biomarker discovery study . . . . . . . . . . . . . . . . . . . 46Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49AppendixA Manual Gates for 52 files . . . . . . . . . . . . . . . . . . . . . 55viiiList of Tables2.1 Algorithms for finding cutoff lines based on density distribution 133.1 Score tables for category one and two files . . . . . . . . . . . 223.2 Problem based F1 scores . . . . . . . . . . . . . . . . . . . . . 244.1 Blood data for AML patients . . . . . . . . . . . . . . . . . . 284.2 Day22 Biomarkers . . . . . . . . . . . . . . . . . . . . . . . . 354.3 Last before 2nd induction biomarkers . . . . . . . . . . . . . 374.4 Last before consolidation biomarkers . . . . . . . . . . . . . . 39A.1 Manual gates for 52 files. Every two numbers correspond toone removal region . . . . . . . . . . . . . . . . . . . . . . . . 56ixList of Figures1.1 Quality control by flowClean and flowAI . . . . . . . . . . . . 62.1 flowCut Workflow . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Segmentation and z scores calculation . . . . . . . . . . . . . 112.3 Density of summer Z scores . . . . . . . . . . . . . . . . . . . 122.4 An example file shows 98th, 2nd and means of a FCS file . . 142.5 Parameter for positioning cut off line . . . . . . . . . . . . . . 163.1 Manual vs algorithm analysis . . . . . . . . . . . . . . . . . . 193.2 Separation of files based on confidence in manual analysis . . 213.3 Categorizing problem types . . . . . . . . . . . . . . . . . . . 223.4 Category three files . . . . . . . . . . . . . . . . . . . . . . . . 243.5 Gating with and without using flowCut . . . . . . . . . . . . 254.1 Automated gating for supervised method . . . . . . . . . . . 304.2 Cell counts for supervised analysis . . . . . . . . . . . . . . . 314.3 Venn diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.4 RchyOptimyx biomarker visualization . . . . . . . . . . . . . 344.5 Unsupervised gating . . . . . . . . . . . . . . . . . . . . . . . 424.6 Linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . 44xAcknowledgementsI would like to thank Dr. Ryan Brinkman, Justin Meskas for their supportduring the course of my graduate work.xixiiChapter 1IntroductionFlow cytometry is a technique for studying the physical and chemical char-acteristics of cells using light-emitting antibodies. Cells stained with anti-bodies have light-emitting fluorophores attached to them. Each cell typehas some antigens on them. Biologists design antibodies that specificallybind to these antigens. Stained cells then flow past one or multiple laserlight sources in ideally a single file manner[6]. The emitted light from thecells, which is proportional to the antigen density, can be converted to elec-tric signals and analyzed on a 2D plot. The light signals consist of threetypes: forward scattering, side scattering, and fluorescence emission signals.Forward scattering and side scattering measure the physical attributes ofthe cells, whereas fluorescence signal measures the functional characteristicsof cells[6]. For example, T cells will present CD3 antigens, and B cells willpresent CD19 antigens.One critical step in flow cytometry data analysis is partitioning cellsinto types based on marker expressions. The cells of the same type aregrouped and selected on a 2D plot with a bounded area called gate. We canidentify subtypes of the selected cells with the addition of new markers. Forexample, T cells can have CD4+ T cells and CD8+ T cells subtypes and can11.1. R/Bioconductorbe gated on CD4 and CD8 markers. Traditionally, the process of identifyingcell populations is done manually. Researchers analyze two parameters at atime through visual inspection. The typical process starts by first removingdead cells and doublets. From live singlets cells, researchers can go down apath of finding targeted cell populations by following a specific partitioning(gating) strategy. The gating strategy indicates the sequence of markers tobe analyzed to reach the target cell types.Yet, manual analysis has many problems. Individual analysts can in-troduce subjectivity and bias into the gating analysis[12]. The presenceof subjectivity makes cross center comparative studies difficult and hindersreproducible research. Besides, recent instrumental advances and reagentsexpansion allow for measuring tens of surface and intracellular markers si-multaneously, and allow for the generation of 20-dimensional data. Tra-ditional manual analysis of two markers at a time cannot cope with theamount of data received. Manual analysis is time-consuming and can beineffective at analyzing high dimensional data [12].There has been a surge in the production of computational tools for flowcytometry analysis in the past decade to address the challenges of manualanalysis.1.1 R/BioconductorMore than 50 computational approaches are available for the analysis of flowcytometry data [3, 17], with a majority of the tools developed and releasedas free, open-source tools using R programming language [19]. These tools21.1. R/Bioconductorhave been developed for high throughput workflows, and are not generallyamenable to graphical user interface and manual interaction with individ-ual files during the analysis process. However, they can be integrated intocommercial tools FlowJo [FlowJo Bioinformatics Inc., Ashland OR] that arefamiliar to users.A majority of the approaches have been released through the Biocon-ductor repository [20], which enforces strict requirements on cross-platformcompatibility and functional documentation. Each package generally ad-dresses one single step in the analysis pipeline, allowing users to substitutenew approaches to the same challenge as the field advances.The required core infrastructure widely used by other packages providedby the flowCore R/Bioconductor package [9] implements a computation-ally efficient data structure for reading and saving FCM data and providessystematic FCS file parsing. The flowCore infrastructure encourages newalgorithms development and the use of combinations of tools in complexworkflows [9].The workflow involved in this study includes data compensation, trans-formation, quality control, automated gating, and biomarker identification.Compensation and transformation Data needs to be properly com-pensated, transformed, and normalized to ensure the accuracy of any sub-sequent gating analysis. Compensation is necessary to correctly accountfor the contribution of each fluorochrome to each channel in conditions ofspectral overlap [17]. A well-used transformation facilitates population gat-ing, visualization, and downstream analysis. The often-used transformation31.2. Data quality assessmentmethods that handle negative values and display normally distributed celltypes are logicle, hyperlog, and arcsine [11].1.2 Data quality assessmentOne goal of data quality control is to assess the stability of signal acquisitionover experimental time. We can visually check the signal stability by plot-ting fluorescence channels against time. A stable signal acquisition showsa consistent distribution of fluorescence intensity values over time. This isthe expected behavior based on the assumption that cells from a heteroge-neous sample are randomly measured at any time point [16]. Changes influorescence intensity values in the time domain are indicative of technicalvariability. Abnormal events can possess a unique space/cluster in a 2D di-mension [16] and potentially get mislabeled as biologically significant events.Therefore, these events should be removed or flagged before passing to thegating analysis.The manual inspection process can be time-consuming and subjective[7]. For removing spikes, users need to zoom in on the time channel toidentify the boundaries of slivers accurately. Even so, the actual placementof boundaries can still be subjective. It is therefore necessary to developautomated methods that could remove human subjectivity and speed up thequality control process. The current approaches addressing this problem areflowClean [7] and flowAI[15].41.2. Data quality assessment1.2.1 Current ApproachesflowClean detects the abnormal changes in the compositions of cell popu-lations over time. It partitions each marker into a high or low expressionusing median values and tracks the representation of resulting 2Q pheno-types across equally split time bins [7]. flowAI detects changes in meansand variances of fluorescence intensity in the time domain [16].Both flowClean and flowAI utilize some versions of ”multiple changepoints detection” algorithm implemented by the ”changepoint” package [10].Specifically, flowClean uses ”pruned exact linear time (PELT)” algorithm,and flowAI ”binary segmentation”.Binary segmentation Binary segmentation is a computationally fast ap-proximation algorithm that repeatedly splits data into two groups at a timeby repeating the single change point test over the split data sets until nochange points are found in any part of the data [10].Pruned exact linear time PELT uses dynamic programming and prun-ing to produce the exact segmentation. It is computed based on the as-sumption that the number of change points increases linearly as the dataset grows [10]. Namely, change points will spread throughout the entire datarather than confined to one portion.A common problem with flowClean is it has a long run time, average 1-4minutes per file, and often misses anomalies 1.1. flowAI, on the other hand,is very fast, 3-5 seconds per file, due to the efficiency of binary segmentation.51.2. Data quality assessmentHowever, it tends to be overly intolerant, removing large portions of normalregions, as shown in Fig 1.1.Killick et al. (2014) note that both PELT and binary segmentation canbe overly sensitive. In a normally distributed data set with three constructedchange points, the PELT algorithm reported six change points while binarysegmentation reported four [10]. We noted that flowAI, which uses binarysegmentation, tends to be more sensitive than flowClean. The underreport-ing of change points by flowClean could be due to challenges in the analysisof phenotype compositions.Figure 1.1: Quality control by flowClean (left) and flowAI (right) on thesame file. The colored regions are fluorescence intensity signals. Red is themost dense, followed by yellow, green, and purple. Black are the removedregions.61.3. My Research1.3 My ResearchDespite the current efforts of flowAI and flowClean, the challenge in dataquality control remains. My research focused on developing a tool that ad-dresses the ongoing problem more effectively. I hypothesized that a segment-wise statistical analysis could effectively identify outlier events. To provethis, I compared the performance of all three algorithms. In chapter 2, Idescribed the algorithm development. Chapter 3 detailed the method andresults for evaluating all three algorithms. In chapter 4, I covered the processand outcomes of using current tools for biomarker discovery.7Chapter 2flowCut - a data qualitycontrol tool2.1 WorkflowAs shown in Fig 2.1, we start with already processed FCS files, which arefiles after compensation and transformation. flowCut first checks the qualityof the time channel, flagging files with repeated time intervals or having amajority of events occurring in a short burst of time, usually at the beginningof a file. This step is to catch a problem that can potentially crash thealgorithm. Second, flowCut removes low-density sections, which are regionswith less than 1% of the range of data. Third, we begin data segmentationand score calculation for each segment. We then check if the scores arebelow a specific trigger threshold. If it is, then the files are up to standard.Otherwise, the bad data trigger flowCut to remove them based on a scoredistribution. After this, flowCut does a second quality control check andflags files with remaining problematic regions.82.2. MethodologyFigure 2.1: The figure summarizes workflow of flowCut algorithm. Overall,flowCut does a maximum of three quality checks. The first one pertains tothe time channel. The second and third are checking the stability of thefluorescence signals.92.2. Methodology2.2 MethodologyWe hypothesized that abnormal events are statistically different than thenormal events in the fluorescence versus time analysis. Naturally, we trackthe statistics of the time domain data to find abnormalities.Standard scoreIn statistics, the standard score, also called Z score, is the signed fractionalnumber of standard deviation and is used to study the deviation of datapoints from the mean value. We adapt from this concept and used absoluteZ scores 2.1 for this purpose as we are only interested in the differences fromthe mean but not the direction.|Z| = |x− µσ| (2.1)2.2.1 Segmentation and calculate Z scoresWe divided each fluorescence channel into equally populated segments, with500 events per segment for a typical FCS file (less than 20MB in size). Fig2.2 shows a two-channel (Alexa Fluor 488-A and APC-A) FCS file that isdivided into 11 segments. We calculated eight statistical measures for eachsegment according to equation 2.1. Because we use absolute Z scores, thedifferences calculated are all accumulative. Segments with high Z scoresindicate substantial deviations from the mean.102.2. MethodologyFigure 2.2: The top and bottom lines in the top two plots represent 98thand 2nd percentiles. The differences between these two lines define therange of data. The lines in the middle are the segments’ means. Pink isbefore cleaning. Brown is after cleaning. flowCut divides a two-channelFCS file into 11 segments. It calculates eight statistical measures, including5th, 20th, 80th, 95th percentiles, median, mean, standard deviation, andthe third moment for skewness. Summing eight statistics across all channelsresults in a vector. Each element in the vector represents one segment. Themost significantly different sections are in dark blue.112.2. Methodology2.2.2 Removal of abnormal eventsThe removal of outlier segments is based on the density distribution of thescore vector, shown in Fig 2.3. We want to find a data dependant thresholdthat separates outliers from normal cells. Any segments with values higherthan this threshold are outliers for removal. To do this, we adapted thedeGate function in the flowDensity R package [14]. The deGate functionreturns a gate line on a 1D density profile. The original purpose was to sep-arate cell populations. We utilized it in our outlier detection methodology.We manipulated the deGate function so that it always returns a gate linethat lies on the right side of the density distribution because we are onlyinterested in removing the cells that are most different.Figure 2.3: Density of summed Z scores. Outlier segments lie on the rightside of the distribution and are separated from the rest by a natural cutoffline. Segments with scores higher than the cutoff will be removed.The shape of the density profile of each marker can vary naturally. We122.3. Parametersallow natural variations of the data as long as they pass the quality controlcheck. Otherwise, we define the cutoff line to minimize overcutting andundercutting. I developed a set of rules for finding an ideal cutoff line 2.1.Table 2.1: Algorithms for finding cutoff lines based on density distribution1. flowCut first finds all the peaks (p = 1, 2, ...n) in the density distribu-tion.2. if p = 1, it uses deGate function to find a natural point along theupstream of the density distribution to remove significantly differentsegments.3. if p >= 2, flowCut checks each peak and calculates the valley heightbetween the adjacent peaks, if it is less than 1% of the maximumpeak, flowCut ignores the lower peak. If there are still more than 2peaks left and the population to be removed is less than or equal toa user specified amount, flowCut removes the significantly differentpopulation.2.3 Parameters2.3.1 Quality control parametersflowCut uses three thresholds to determine if a file passes or fails a qualitycontrol check, namely, the maximum allowable mean range, the average ofthis range across all channels, and the maximum continuous jump betweenadjacent segments. Fig 2.4 shows the maximum allowable mean range andmaximum one-step jump. If any of the parameters calculated is higher thana threshold value, flowCut starts the cleaning process. The user can adjustthe stringency of the algorithm and all by changing these parameters.132.3. ParametersFigure 2.4: An example file shows the range and mean of data before andafter cleaning. The data before cleaning is bounded by 98th and 2nd per-centile indicated by yellow (before cleaning) or dark brown (after cleaning)lines. The pink line in the middle is the connected segments means beforecleaning. The maximum range of these means is the first number on top.The second number is this range after cleaning. The number in the bracketindicates the maximum one-step change after cleaning.2.3.2 Cutoff line parametersUsers can set two parameters, one that defines the maximum percentageof events for removal, one that sets the thresholds to be generally higheror lower. These two parameters can wiggle the cutoff line on the densitydistribution.Maximum percentage of removal The default value is 30%. If theoutlier populations in Fig 2.3 exceeds this amount, then the cutoff line will142.3. Parametersbe moved further to the right, and nothing will be removed.Maximum valley height See Fig 2.5. It sets the maximum height ofthe tail on the distribution, defaulted to be 10% of the tallest peak. Thisparameter determines how aggressive the cutting will be. If a user setsa value larger than the default, then the user allows for generally moreaggressive cutting. In this case, the height of the valley is higher, and thecutoff line has a smaller value. Smaller threshold values allow the removalof more segments. For less aggressive cutting, the parameter will be lower,and the cutoff line moves further to the right. However, this could mean aninsufficient removal of abnormal events, and the file is not likely to pass thesecond quality control check.152.3. Parameters(a)(b) (c)Figure 2.5: By default, the threshold is placed at the valley with a heightof approximately 10% of the tallest peak, shown in a) and b). If users areto decrease this value, the threshold is moving further to the right and is atthe second valley. In this case, fewer segments will be removed c).16Chapter 3Algorithms ComparisonI evaluated the performance of all three algorithms against manual analysisfor selected files. I obtained these files from a public repository, FlowRepos-itory [22].3.1 Method3.1.1 Selection of files for evaluationI followed the following protocol when selecting files for evaluation:• Randomly download 1071 files from FlowRepository.• Eliminate corrupt files (83) that cannot be read, compensated, ortransformed.• Eliminate files crashed by any of the algorithms (145).• Keep any that required cleaning by visual inspection (50) or identi-fied with problematic regions by at least two algorithms (5). If a filehad no visually identifiable regions and only got cleaned by only onealgorithm, it was put aside and not counted toward the evaluation.173.1. Method3.1.2 Manual vs algorithm analysisFor each of the selected 55 files, I visually identified problematic regions,then ran each algorithm on these files with their default settings. Examplesare shown in Fig 3.1.Manual analysis procedure I plotted each marker channel versus timeand visually identified problematic regions. Each removal region had twoboundaries. I created a spreadsheet for storing boundaries for each of the 55files. Each row (file) has a series of an even number of boundaries that definethe regions for cutting. For example, if there are four numbers, the first andsecond numbers are the beginning and end of the first region removed. Andthe third and fourth numbers are the beginning and end of the second regionremoved. See Appendix A.3.1.3 F1 score as a measure for comparisonF1 score, in equation 3.1, is the harmonic mean of precision and recall.F1 score was used for judging algorithms’ performance in FlowCAP studies[1, 2, 5]. We borrowed the idea here in this evaluation.F1 =21recall +1precision= 2 ∗ precision ∗ recallprecision+ recall(3.1)The data selected by an algorithm can contain some portions of truepositives and false positives. And the non-selected regions include some falsenegatives and true negatives. Precision is the proportion of events selectedby the algorithm that are true positives, that is, overlapping with manually183.1. MethodFigure 3.1: 5 exemplary files of raw data, manual analysis, flowCut, flowAI,and flowClean analysis are shown.chosen regions. Recall is the proportions of cells selected by the manualanalysis, which are also identified by the algorithm. I used the results ofthe manual gating as the standard for computing the F1 scores. Manualanalysis took approximately 5-10 minutes per file.3.1.4 File based evaluationNoting that manual analysis can be subjective, I subdivided the 55 files intothree categories based on the subjective confidence of the manual analysis,shown in Fig 3.2. The first category includes 17 files that have removalregions with clearly defined boundaries such as discontinuity, low densityregions, and large spikes. The second category has 35 files that have fuzzyboundaries for removal regions. Examples include small spikes, boundaries193.1. Methodin fluorescence drifting regions. This category also includes files that haveoverlapping problematic regions detected by at least two algorithms but notby manual analysis. The third category has three files in which manualanalysis is arbitrary. The files look abnormal, but the regions for removalare not clear. For example, the first category three file in Fig 3.2 containsthree shifting regions with different means. It is unclear which region(s)should be removed. Category three files contain no region of truth and areeliminated from the evaluation. For the remaining 52 files, I calculated F1scores for categories one and two files for each algorithm. We subsequentlycalculated F1 score for random removing, with the percentage of removalsimilar to that of the manual removal, as a baseline for comparison.3.1.5 Problem based evaluationTo analyze the problem-based performance of each algorithm, I categorizedmanual identified regions into four distinctive problem types, as shown inFig 3.3. Each type has its unique characteristics. I evaluated algorithms’performance on dealing with these four types of problems by calculating F1scores for all regions of each type. A single file can contain multiple types ofissues. I calculated F1 scores for each of the identified problematic regionsand clean regions, and their proportions in a file.203.2. ResultFigure 3.2: Files are divided into three categories based on confidencein manual analysis. From category three to one there is an increase inconfidence levels for manual analysis. Category one (17 files) problematicregions contain distinctive discontinuity, large spikes regions. Category two(34 files) problematic region contain small spikes, intensity shifting regions.Category three (3 files) problematic regions are hard to defined, therefore,no manual analysis is performed on these files.3.2 Result3.2.1 File based evaluation resultsflowCut has the highest F1, precision, and recall score, as shown in table3.1. Run time is a close second to flowAI. flowAI removes most events forcategory two files yet has a low F1, indicating that it’s removing a largeamount of false positives. flowClean has overall lowest F1 scores, and itsrun time is several magnitudes longer than that of flowAI and flowCut.213.2. ResultFigure 3.3: Categorizing problematic regions into four types. Two examplesof each category are shown.Table 3.1: F1 scores, precision, recall, run times, and percentages of removalof each algorithm were calculated for 17 category one and 35 category twoFlowRepository exemplary files for three algorithms and a random removal.Computed on an Intel Xeon E5-2630 CPU with 128 GB RAM.Category One:F1 scores Precision Recall Run times (s) % removedflowCut 0.79 0.75 0.90 4.8 14.0%flowAI 0.42 0.43 0.68 3.7 7.9%flowClean 0.32 0.30 0.44 54.5 4.6%random 0.16 0.16 0.16 - 15.6%Category Two:F1 scores Precision Recall Run times (s) % removedflowCut 0.5 0.5 0.71 7.18 7.5%flowAI 0.18 0.3 0.23 4.4 10.0%flowClean 0.15 0.16 0.33 190.25 1.01%random 0.12 0.12 0.12 - 12.0%Category Three files Although I eliminated category three files fromscore calculation, it is necessary to see how each algorithm deals with thesefiles. As shown in Fig 3.4, flowCut flagged two files to users. The flagging223.3. Impact on Gating Analysiscontains information of why a file fails quality check. flowCut flagging hasfour letters of either T or F, checking 1) if the events are monotonicallyincreasing with time, 2) abrupt changes in fluorescence, 3) large gradual shiftof fluorescence signals in 3) all channels and 4) one channel. For categorythree first file, flowCut indicated a large gradual change of fluorescence inone channel. The second file had both abrupt and large gradual changesof fluorescence in one channel. The third file failed the time test with afraudulent time channel, as described in section 2.1. flowAI detected someproblems in the second file, but the removal regions can not be verified bymanual at this stage. flowClean identified no problems in the first file, andcouldn’t process the second and third files due to too few cells to calculatephenotypes compositions. Note, these files did not crash flowClean.3.2.2 Problem based evaluation resultsTable 3.2 shows the mean F1 scores for each problematic type with their nor-malized percentage of events in a file. The normalized proportions is the sumof all regions by type, divided by the total number of files (52). I calculated aweighted F1 score for each algorithm according to∑5i=problemtypemeanF1×normalizedPercentages. flowCut was 0.93, flowAI 0.86, flowAI 0.83 (Table3.2).3.3 Impact on Gating AnalysisWe reproduced the gating of TCM CD8 T cells, CD45RA-CDR7+ from thepublished paper [4] with and without using flowCut. The data that had233.3. Impact on Gating AnalysisFigure 3.4: Documenting how each algorithm deals with category threefiles. flowCut flagged first two files to users, and did not process the thirdfile due to time test issues. flowAI detected no problematic regions in thefirst and third. However, only flowAI detected some regions in the secondfile. flowClean detected no problematic regions in the first file, and couldnot process the 2nd and 3rd file.Percentage of events flowCut flowAI flowCleanShift 7.23% 0.67 0.40 0.20Spike 2.60% 0.63 0.33 0.02Low Density 1.72% 0.67 0.28 0.20Discontinuity 0.98% 0.93 0.32 0.00Clean 87.47% 0.96 0.93 0.93Weighted F1 - 0.93 0.86 0.83Table 3.2: Mean F1 scores of four types of problematic regions, mean per-centage of events and the weighted F1 scores for all three algorithmsproper data quality control showed an increased proportion of the targetedcell population after gating (Fig 3.5 c). Gating on the outlier events (Fig243.4. Algorithm Robustness(a) (b) (c) (d)Figure 3.5: (a) shows fluorescence drifting (Similarly for CCR7 - not shown).(a)-(d) show the difference between (b) not using and (c) using flowCut. (d)shows only the gated events between the middle two grey vertical lines in(a).3.5 d) showed the outliers events lie mostly on the bottom left quadrant,indicating that they were biasing the gating to that region.3.4 Algorithm RobustnessOut of 988 files from FlowRepository, a total of 114 files (11.5%) crashedflowClean, and 65 (6.6%) files crashed flowAI. Overall, a total of 145 filescrashed either flowClean or flowAI. 0 files crashed flowCut. As of October2019, I ran flowCut on all 117,115 FlowRepository files. It has 0% crashrate.3.5 ConclusionData cleaned by flowCut improves the downstream gating. flowCut allowsusers to check the quality of the cleaning and to adjust the stringency ofthe algorithm if needed. Compared to existing methods, flowCut identifiedoutlier events more accurately and did not fail to process any file.25Chapter 4Biomarker Discovery inMinimal Residual Disease4.1 BackgroundPatients with acute myeloid leukemia (AML) who underwent chemotherapycan sometimes relapse. The goal of this study is to examine the bone marrowregeneration pattern for characteristics that could associate with recurrence.Immunophenotyping by flow cytometry is capable of detecting 1 leukemiacells in 10,000 normal cells [23], making it an ideal method for monitoringminimal residual disease. When choosing tools to study cell markers, Iconsulted FlowCAP studies [1, 2, 5] in which a list of currently availablealgorithms are evaluated against manual for their ability to find significantpopulations that correctly predict HIV patients’ disease progression status.flowDensity[13] (both supervised and unsupervised), flowType and Rchy-Optimyx [18] pipeline stood out as the co-best method.flowDensity can be used in both supervised and unsupervised ways. Ina supervised fashion, it automates the manual gating process by using cus-tomized one-dimensional density thresholds for each cell population to mimic264.2. Study Designexperts’ hierarchical gating order. Unlike manual gating, where the place-ment of gate boundaries is inherently subjective, thresholds are adjusted ina data-dependent manner for each sample.When used in the unsupervised method, gating thresholds are adjustedin a completely automated fashion per marker, removing customization.FlowReMI [24], the other best method identified in FlowCAP studies, usedunsupervised flowDensity for marker partitioning. Since human interven-tion is removed from unsupervised gating analysis, experts might not findpartitioning of some markers agreeable.4.2 Study Design4.2.1 DataWe obtained blood data of AML patients (both relapsed and non-relapsed)at three time points after chemotherapy for five tubes. Each tube has differ-ent set of markers. The three time points of interest are: day 22, last before2nd induction, last before consolidation. To increase statistical power, I didrandom sampling so that each group at each time point contains 20 samples.There are an additional 20 normal samples for each tube. For supervisedgating, the starting population for analysis is singlets. For the unsupervisedmethod, the starting population is CD45+SSC-. The CD45+SSC- popula-tion was identified using k-means clustering by flowPeaks package [8].274.2. Study DesignMarkersTube1 CD56, CD13, CD34,CD117, CD33, CD11b, HLADR, CD45Tube2 CD36,CD64,CD34,CD117,CD33,CD14,HLADR,CD45Tube3 CD15,NG2,CD34,CD117,CD2,CD19,HLADR,CD45Tube4 CD7,CD96,CD34,CD117,CD123,CD38,HLADR,CD45Tube5 CD99,CD11a,CD34,CD117,CD133,CD4,HLADR,CD45Normal Day22Last before2nd indLast beforeconsTube1 20 normal20 relapsed20 non-relapsed20 relapsed20 non-relapsed20 relapsed20 non-relapsedTube2 20 normal20 relapsed20 non-relapsed20 relapsed20 non-relapsed20 relapsed20 non-relapsedTube3 20 normal20 relapsed20 non-relapsed20 relapsed20 non-relapsed20 relapsed20 non-relapsedTube4 20 normal20 relapsed20 non-relapsed20 relapsed20 non-relapsed20 relapsed20 non-relapsedTube5 20 normal20 relapsed20 non-relapsed20 relapsed20 non-relapsed20 relapsed20 non-relapsedTable 4.1: Blood data for AML patients4.2.2 Supervised GatingSupervised gating requires users to have some prior experimental expecta-tion, for example, if a user wants to replicate an existing manual processto target cell populations of interest in a specific way. In our case, wewill require biologists to have some preexisting knowledge regarding the re-generation pattern of bone marrow and design a gating strategy to findpopulations that are likely to be interesting, i.e., differentiating between therelapsed and non-relapsed group. It requires expertise and efforts to comeup with a gating strategy. We only obtained a gating strategy for tube 1.The goal was analyzing the end populations to see if any are significantlydifferent among the two groups of patients at each time point.284.3. Results4.2.3 Unsupervised Gating and analysis pipelineUnsupervised gating can substantially increase the scale of analysis. Com-bined with flowType and RchyOptimyx, we can examine all possible pop-ulations defined by the markers by removing the time-limiting step of cus-tomization for each tube. The supervised analysis was limited to one tube.However, unsupervised analysis can be applied to all tubes.4.3 Results4.3.1 Supervised Analysis ResultsI wrote the automated gating pipeline, in Fig 4.1 according to the gatingstrategy provided. The gates were mostly determined by 1D density dis-tribution of each marker. This was implemented by flowDensity[14] pack-age. However, three gates required rotation of the axis to some degree tofind proper separation. These gates are singlets gates, HLADR mast cellsgates, CD13+B CD33- gates, singlets monocyte-derived cells high SSC, cor-responding to gates 2,3,4,11 in Fig 4.1.Subsequent comparisons of the cell proportions across three time pointsshow no significant difference (significant when p < 0.05) among the twogroups of patients.4.3.2 Unsupervised Analysis ResultsThere were no significant populations between relapsed and non-relapsedgroups based on t-tests analysis for all five tubes across all three time points.294.3. ResultsFigure 4.1: Automated gating pipeline according to tube one gating strategyHowever, when comparing the two groups with healthy bone marrow indi-vidually, there were significantly different populations. A portion of thesepopulations overlap, as illustrated in Fig4.3. I assumed that the overlappedregions are variations due to time differences, not patients’ disease status.While we are interested in non-overlap cell populations in both relapsedversus normal and non-relapsed versus normal comparisons, I only reported304.3. Results(a) (b) (c)(d) (e) (f)(g) (h) (i)Figure 4.2: Cell counts across days for patients with two disease status314.3. Resultshere the cell populations that are exclusively different between relapsed andnormal.Figure 4.3: A Venn diagram generated for tube 2 last before 2nd inductionillustrates the regions of interested cell populations, i.e. the non-overlappingregions of significant populations identified for relapsed vs normal (17) andnon-relapsed vs normal (11).flowDensity sets threshold for each marker into high, low expressionbased on a 1D density profile. flowType then reports cell counts for 3Qphenotypes, where Q is the number of markers. Each marker has threepossible outcomes: high, low expression or don’t care. For reported cellpopulations, we only allow flowType to go down two levels in gating analy-sis, that is, representing each population with a maximum of four markerscombination. Due to the limitation of not using a gating strategy, thresholds324.3. Resultsare set based on one single starting population. If unsupervised gating goesmore than two levels, the density profile might be entirely different from thestarting population, making it difficult to transfer the thresholds and verifythe validity of the end population.OptimizationThe cell types returned by flowType can be redundantly represented, i.e.,with uninformative markers. Uninformative markers are those that do notsubstantially increase the significance of the biomarkers associated with anexternal outcome. Marker significance can be best visualized on RchyOp-timyx plot, as shown in Fig 4.4. CD64-HLADR- is the most significantbiomarkers with the least number of markers used.In tables 4.2, 4.3, 4.4, I summarized the resulting optimized biomarkerswith associated p values and cell proportions across time points day22, lastbefore 2nd induction and last before consolidation, respectively. Only amaximum of 10 populations for each day each tube are reported here.334.3. ResultsFigure 4.4: A RchyOptimyx plot lists all possible marker combinationsfor an end population. The biomarkers are colored by p-value. The mostsignificant biomarkers are in red. The thickness of the arrow indicates theamount of increase of −log10Pvalue .344.3.ResultsTable 4.2: Day 22 Biomarkers, their associated p-values, adjusted p-values, and the mean proportions in therelapsed, non-relapsed and normal cohortsPhenotype P ValueP ValueadjustedProportionsrelapsedProportionsnon-relapsedProportionsnormalTube 1CD13-CD34-HLADR+ 1.3e-05 0.021 1.70e-01 0.20 0.520CD56+CD34-CD117+CD33+ 1.4e-05 0.022 2.4e-03 0.0038 0.015CD13-CD34-CD33+CD11B- 2.4e-05 0.040 5.7e-02 0.066 0.23FSC-A+CD117-CD11B-HLADR- 1.5e-05 0.024 2.1e-01 0.19 0.062CD13-CD117-CD11B-HLADR- 1.0e-05 0.017 2.1e-01 0.21 0.059CD56-CD13-CD34-HLADR+ 7.5e-06 0.012 1.6e-01 0.18 0.503CD56+CD34-CD117+HLADR+ 1.8e-05 0.030 2.4e-03 0.0031 0.014CD13-CD117+CD33-HLADR+ 1.5e-05 0.025 7.0e-03 0.017 0.037Tube 3CD15+CD117+ 2.6e-07 4.2e-04 0.047 0.065 0.16CD34-CD117+ 2.1e-05 3.4e-02 0.066 0.12 0.19354.3.ResultsFSC-A+CD19- 8.3e-06 1.3e-02 0.45 0.52 0.93CD15+CD19- 1.7e-08 2.8e-05 0.19 0.26 0.56NG2-CD19- 3.5e-06 5.7e-03 0.41 0.55 0.90CD117-CD19+ 2.5e-05 3.9e-02 0.44 0.25 0.037FSC-A+CD34-CD117+ 1.9e-05 3.1e-02 0.061 0.11 0.18FSC-A+CD34-CD2- 1.0e-05 1.6e-02 0.35 0.43 0.77CD15+CD117+CD2- 3.0e-10 5.0e-07 0.029 0.050 0.16CD34-CD117+CD2- 2.2e-05 3.4e-02 0.059 0.089 0.18Tube 4FSC-A+CD7- 3.3e-06 5.1e-03 0.56 0.58 0.93FSC-A+CD96- 3.1e-05 4.5e-02 0.66 0.70 0.92CD117+CD123+ 1.9e-05 2.8e-02 0.027 0.019 0.012FSC-A+CD38- 2.2e-05 3.2e-02 0.37 0.53 0.93CD96-CD38- 5.4e-07 8.4e-04 0.30 0.56 0.91CD123-CD38- 3.6e-07 5.5e-04 0.25 0.52 0.88CD96-CD38+ 2.3e-05 3.4e-02 0.58 0.34 0.044364.3.ResultsCD117-CD38+ 2.9e-05 4.2e-02 0.52 0.31 0.033CD123-CD38+ 2.6e-05 3.9e-02 0.56 0.34 0.037CD7-HLADR- 3.2e-05 4.7e-02 0.4 0.35 0.093Table 4.3: Last before 2nd induction biomarkers, their associated p-values, adjusted p-values, and the meanproportions in the relapsed, non-relapsed and normal cohortsPhenotype P ValueP ValueadjustedProportionsrelapsedProportionsnon-relapsedProportionsnormalTube 1CD34-CD117- 2.8e-06 0.0046 0.82 0.80 0.63FSC-A+CD34-CD117- 3.5e-06 0.0057 0.81 0.79 0.62CD56-CD34-CD117- 8.4e-06 0.013 0.78 0.76 0.61CD56+CD13-CD117+ 1.0e-06 0.0016 0.0031 0.0046 0.012CD56-CD117-HLADR+ 5.8e-06 0.0095 0.73 0.70 0.54CD34-CD117-HLADR+ 9.8e-06 0.015 0.73 0.70 0.54CD56-CD117+HLADR+ 1.2e-06 0.0020 0.12 0.14 0.27374.3.ResultsCD56+CD13+CD117+CD11B- 2.7e-05 0.044 0.00063 0.0010 0.013Tube 2CD36+CD64+ 3.3e-06 0.0055 0.60 0.59 0.42CD36+CD33+ 4.1e-06 0.0068 0.69 0.68 0.50CD36-CD117-CD33+ 1.8e-05 0.029 0.045 0.049 0.10CD36-CD117+CD33+ 2.5e-05 0.040 0.11 0.11 0.21CD64+CD117+CD14+ 2.5e-05 0.041 0.012 0.012 0.0015Tube 3CD117- 6.9e-06 1.1e-02 0.83 0.78 0.65CD117+ 6.9e-06 1.1e-02 0.17 0.22 0.35CD117+CD2- 4.8e-06 7.8e-03 0.16 0.21 0.33CD117+CD19- 7.6e-06 1.2e-02 0.16 0.21 0.33Tube 4CD7+CD117+ 1.5e-06 2.4e-03 0.0097 0.013 0.018FSC-A+CD7+CD117- 9.6e-06 1.6e-02 0.039 0.042 0.027CD7+CD96-CD117+ 3.1e-08 5.3e-05 0.0043 0.0072 0.011384.3.ResultsCD7+CD34-CD117+ 4.4e-09 7.5e-06 0.0048 0.0066 0.012CD7+CD117+CD123- 4.0e-07 6.6e-04 0.0086 0.011 0.017CD7+CD117+CD38- 1.7e-06 2.8e-03 0.0075 0.0097 0.015CD96-CD117+CD38+ 1.0e-05 1.7e-02 0.0061 0.00915 0.013CD7+CD96-CD34-CD117+ 5.9e-10 9.9e-07 0.0022 0.0035 0.0071CD7+CD96-CD117+CD123- 5.4e-09 9.1e-06 0.0035 0.0061 0.010Tube 5CD34-CD4-HLADR+ 9.5e-06 0.016 0.076 0.090 0.13Table 4.4: Last before consolidation biomarkers, their associated p-values, adjusted p-values, and the mean pro-portions in the relapsed, non-relapsed and normal cohortsPhenotype P ValueP ValueadjustedProportionsrelapsedProportionsnon-relapsedProportionsnormalTube 2CD64-HLADR- 1.6e-05 0.027 0.052 0.050 0.11CD36+CD34-CD117+CD33+ 8.5e-06 0.014 0.014 0.014 0.020394.3.ResultsTube 3CD15- 8.2e-06 0.013 0.67 0.68 0.42CD15+ 8.2e-06 0.013 0.32 0.31 0.57FSC-A+CD15- 1.4e-05 0.022 0.65 0.67 0.41FSC-A+CD15+ 7.8e-06 0.013 0.32 0.31 0.56CD15+CD34- 2.4e-05 0.037 0.31 0.30 0.53CD15-CD2- 6.2e-06 0.010 0.62 0.63 0.38CD15+CD2- 8.8e-06 0.014 0.32 0.31 0.57CD15+CD19- 7.1e-06 0.011 0.31 0.30 0.56Tube 4CD7-CD123+ 1.0e-05 0.016 0.051 0.060 0.078CD34-CD123+ 1.7e-05 0.028 0.051 0.060 0.077CD7-CD96+CD34- 3.0e-05 0.049 0.017 0.018 0.025FSC-A+CD96-CD123- 2.4e-05 0.039 0.87 0.86 0.84CD7-CD34-CD123+ 4.5e-06 0.0075 0.041 0.051 0.070404.3.ResultsCD96-CD34-CD123+ 2.0e-05 0.033 0.050 0.060 0.076Tube 5CD4- 8.8e-06 0.014 0.19 0.20 0.30CD4+ 8.8e-06 0.014 0.80 0.75 0.69FSC-A+CD4- 1.8e-05 0.030 0.18 0.23 0.29CD99-CD4- 1.2e-05 0.020 0.16 0.21 0.28CD34-CD4- 3.5e-06 0.0058 0.13 0.18 0.22CD133-CD4- 3.0e-06 0.0051 0.14 0.20 0.26FSC-A+CD4+ 6.2e-06 0.010 0.79 0.74 0.68CD99-CD4+ 5.1e-06 0.0085 0.79 0.73 0.67CD117-CD4+ 4.8e-06 0.0079 0.72 0.68 0.55FSC-A+CD99-CD4+ 3.7e-06 0.0062 0.78 0.72 0.66We can validate the populations by plotting the gating with thresholds set by flowDensity. In Fig 4.5, I showthe gating of one biomarker from each tube.414.3.ResultsFigure 4.5: An example of gated population from tube 1-5, shown in a-e respectively, using unsupervised method.424.4. Conclusion4.3.3 Regeneration DynamicI compared the regeneration dynamics between relapsed and non-relapsedpatients by fitting a linear regression model for each group. Slope com-parison was done using the ”contrast” function in ”emmeans” package [21].Here I show significantly different linear patterns in Fig 4.6.I made slope comparisons for all 38 = 6561 cell populations. For a largenumber of comparisons, there would be some significance due to chance. Toreduce chance occurrence, I lowered the p-value to 0.02 and reported onlythe cell populations with the smallest p-value in each tube in Fig 4.6.4.4 ConclusionIn this study, I discovered novel biomarkers through the unsupervised anal-ysis. When time points were analyzed separately, these novel biomarkershave significantly (adjusted pvalue < 0.05) different proportions in the re-lapsed patients (n=20) compared to the healthy cohort (n=20). However,these populations were not significantly different between the relapsed andthe non-relapsed patients.When analyzing the changes of cell populations over time, I discov-ered populations with differential dynamics between the relapsed and non-relapsed patients.434.4. Conclusion(a) (b) (c)(d) (e)Figure 4.6: One example of a linear model from each tube is shown.Red represents relapsed patients model. Blue represents non-relapsedpatients model. a-c corresponds to tubes 1-5, respectively. Coeffi-cients between two groups of patients have pV alue < 0.02. Specif-ically, (a) CD13+CD34+CD33-CD11B+ has a p value of 0.0035,(b)CD36-CD34+CD117- 0.0024,(c) CD15-NG2+CD34+CD2- 0.00028, (d)CD96+CD34-CD38-HLADR- 0.017,(e) CD34+CD117-HLADR- 0.0018.44Chapter 5Discussion5.1 Data quality control studyIn this thesis, I presented a new and effective approach to data quality con-trol in the processing pipeline. I concluded that flowCut could successfullyreplace the current methods as a new state of the art algorithm for address-ing data quality issues caused by technical variability. I provided strong evi-dence that flowCut improves the accuracy of the subsequent gating analysis.Overall, my research has contributed to the improvement of a crucial com-ponent in the automated analysis pipeline, which, in turn, helps to achievethe goal of resolving bottleneck issues in the field of flow cytometry.FlowCAP studies had significant meanings in the field of flow cytome-try. They not only identified the best algorithms for automated analysis butalso established a method for evaluating and quantitatively comparing algo-rithms. This method influenced the study design of the comparative studyincluded in this thesis. A key feature was using manual analysis as the truth,which could have both advantages and limitations. One advantage was thatthe manual analysis created a standard, without which the comparativestudy cannot occur. Another benefit was manual analysis is intuitive, witheasy-to-follow procedures that do not require strong expertise. However, a455.2. Biomarker discovery studymajor limitation was the subjectivity presented in the manual analysis. Thesubjectivity lied not only in the boundary placements for removal regionsbut also in defining the removal regions. For example, some users mightnot find some small spikes require removal, while others do. In addressingthese limitations, I created categories based on the degree of subjectivityand evaluated each category separately. Algorithms had low F1 scores whensubjectivity was high, as seen with category two files. This was intuitive, asthe manual regions now contained some proportions of clean data. I chosethe removal boundaries to be aggressive, i.e., removing as much bad data asnecessary. However, it is worth evaluating the less aggressive cutting, thatis, calculating F1 scores with narrower boundaries. This could potentiallyimprove the overall F1 scores for category two files of all three algorithms,but the ranking of the algorithms might stay the same.To add more confidence in the results of this study, I suggest replicatingthe current research by multiple people or labs and comparing the results.It would be valuable to replicate any stage of the entire protocol, from fileselection to F1 score calculation.5.2 Biomarker discovery studyIn chapter 4, I documented the automated method for identifying biomarkersassociated with an external clinical outcome.Cell population identification Supervised analysis was most on parwith experts’ knowledge. However, it was limited by the time required to465.2. Biomarker discovery studycustomize the pipeline. It also requires strong expertise for providing gatingstrategies. The customization process, including coding, verifying, and re-vising, lasted for 3 weeks for one tube. Once we verified the gating results,we can be confident that the subsequent analysis was accurate. We don’tneed to back gate to check the validity of the gates as the gates are alreadyestablished during the customization process following experts’ guidance.Although supervised analysis required significant upfront effort, it offeredhighly accurate analysis. In this study, the supervised analysis was con-strained to one tube. It is worth attempting such analysis when gatingstrategies for other tubes are provided.On the other hand, unsupervised cell population identification analysiswas fast and scalable to all tubes. The marker thresholds were identifiedon 1D density distribution on CD45+SSC- cells. However, due to the elim-ination of human interventions in gating analysis, biologists might not findsome gates agreeable, which can affect the legitimacy of the subsequentbiomarkers discovered.Biomarker discovery Biomarker discovery is done in an unsupervisedmanner by examining all possible marker combinations. In this study, thepipeline I used is the unsupervised cell population identification and unsu-pervised biomarker analysis.For trade-off between accuracy and depth, I only let the gating analysisgo down two levels as the density distribution of the starting population canbe entirely different from the two or more levels down, making the gatingthresholds not transferable. In other words, the thresholds found on density475.2. Biomarker discovery studydistribution on CD45+SSC- cells might not make sense if placed on a smallsubset of these cells, especially when evaluating against multiple markers.Each population is represented by a maximum of 4 markers. This can belimiting as we are examining only the tip of an iceberg. For example, tube 4contains leukemia stem cell markers CD34, CD38 (CD34+CD38-). However,in the last two days, only CD34 is present in some populations. There wasa potential of discovering more of these cell types if we increased the depthof analysis. One other limitation is the lengthy verification process to checkthe validity of the significant populations. In this study, I had examinedthe gating of around 14-30 biomarkers per tube, which was 39% (117/300)of all generated biomarkers after optimization. With increased depth ofanalysis, we can create an even more substantial amount of biomarkers tobe examined.For future direction, experts can set up a simple gating strategy wheremarker thresholds can be found easily without much customization. Rely-ing on a gating strategy can effectively increase the depth and accuracy ofsubsequent analysis and reduce the lengthy validation effort.We can investigate the unique roles of the biomarkers reported in AMLpatients, especially in those who relapsed after chemotherapy. The au-tomated data analysis documented here can be one potential method forcharacterizing MRD regeneration patterns, which contributes to our under-standing of the disease prognosis.48Bibliography[1] N. Aghaeepour, P. Chattopadhyay, M. Chikina, T. Dhaene, S. VanGassen, M. Kursa, B. N. Lambrecht, M. Malek, G. J. McLachlan,Y. Qian, P. Qiu, Y. Saeys, R. Stanton, D. Tong, C. Vens, S. Walkowiak,K. Wang, G. Finak, R. Gottardo, T. Mosmann, G. P. Nolan, R. H.Scheuermann, and R. R. Brinkman. A benchmark for evaluation ofalgorithms for identification of cellular correlates of clinical outcomes.Cytometry A, 89(1):16–21, Jan 2016.[2] N. Aghaeepour, G. Finak, H. Hoos, T. R. Mosmann, R. Brinkman,R. Gottardo, R. H. Scheuermann, D. Dougall, A. H. Khodabakhshi,P. Mah, G. Obermoser, J. Spidlen, I. Taylor, S. A. Wuensch, J. Bram-son, C. Eaves, A. P. Weng, E. S. Fortuno, K. Ho, T. R. Kollmann,W. Rogers, S. De Rosa, B. Dalal, A. Azad, A. Pothen, A. Bran-des, H. Bretschneider, R. Bruggner, R. Finck, R. Jia, N. Zimmerman,M. Linderman, D. Dill, G. Nolan, C. Chan, F. El Khettabi, K. O’Neill,M. Chikina, Y. Ge, S. Sealfon, I. Sugar, A. Gupta, P. Shooshtari,H. Zare, P. L. De Jager, M. Jiang, J. Keilwagen, J. M. Maisog,G. Luta, A. A. Barbo, P. Majek, J. Vil?ek, T. Manninen, H. Hut-tunen, P. Ruusuvuori, M. Nykter, G. J. McLachlan, K. Wang, I. Naim,49BibliographyG. Sharma, R. Nikolic, S. Pyne, Y. Qian, P. Qiu, J. Quinn, A. Roth,P. Meyer, G. Stolovitzky, J. Saez-Rodriguez, R. Norel, M. Bhat-tacharjee, M. Biehl, P. Bucher, K. Bunte, B. Di Camillo, F. Sambo,T. Sanavia, E. Trifoglio, G. Toffolo, S. Dimitrieva, R. Dreos, G. Am-brosini, J. Grau, I. Grosse, S. Posch, N. Guex, J. Keilwagen, M. Kursa,W. Rudnicki, B. Liu, M. Maienschein-Cline, T. Manninen, H. Hut-tunen, P. Ruusuvuori, M. Nykter, P. Schneider, M. Seifert, and J. M.Vilar. Critical assessment of automated flow cytometry data analysistechniques. Nat. Methods, 10(3):228–238, Mar 2013.[3] A. Bashashati and R. R. Brinkman. A survey of flow cytometry dataanalysis methods. Adv Bioinformatics, page 584603, 2009.[4] Julie G. Burel, Yu Qian, Cecilia Lindestam Arlehamn, DanielaWeiskopf, Jose Zapardiel-Gonzalo, Randy Taplitz, Robert H. Gilman,Mayuko Saito, Aruna D. de Silva, Pandurangan Vijayanand, Richard H.Scheuermann, Alessandro Sette, and Bjoern Peters. An integratedworkflow to assess technical and biological variability of cell popula-tion frequencies in human peripheral blood by flow cytometry. TheJournal of Immunology, 198(4):1748–1758, 2017.[5] G. Finak, M. Langweiler, M. Jaimes, M. Malek, J. Taghiyar, Y. Ko-rin, K. Raddassi, L. Devine, G. Obermoser, M. L. Pekalski, N. Pontikos,A. Diaz, S. Heck, F. Villanova, N. Terrazzini, F. Kern, Y. Qian, R. Stan-ton, K. Wang, A. Brandes, J. Ramey, N. Aghaeepour, T. Mosmann,R. H. Scheuermann, E. Reed, K. Palucka, V. Pascual, B. B. Blomberg,F. Nestle, R. B. Nussenblatt, R. R. Brinkman, R. Gottardo, H. Maecker,50Bibliographyand J. P. McCoy. Standardizing Flow Cytometry ImmunophenotypingAnalysis from the Human ImmunoPhenotyping Consortium. Sci Rep,6:20686, Feb 2016.[6] Thomas A. Fleisher and Joao B. Oliveira. 92 - flow cytometry. InRobert R. Rich, Thomas A. Fleisher, William T. Shearer, Harry W.Schroeder, Anthony J. Frew, and Cornelia M. Weyand, editors, ClinicalImmunology (Fifth Edition), pages 1239 – 1251.e1. Content RepositoryOnly!, London, fifth edition edition, 2019.[7] K. Fletez-Brant, J. Spidlen, R. R. Brinkman, M. Roederer, and P. K.Chattopadhyay. flowClean: Automated identification and removal offluorescence anomalies in flow cytometry data. Cytometry A, 89(5):461–471, 05 2016.[8] Y. Ge and S. C. Sealfon. flowPeaks: a fast unsupervised clustering forflow cytometry data via K-means and density peak finding. Bioinfor-matics, 28(15):2052–2058, Aug 2012.[9] F. Hahne, N. LeMeur, R. R. Brinkman, B. Ellis, P. Haaland, D. Sarkar,J. Spidlen, E. Strain, and R. Gentleman. flowCore: a Bioconductorpackage for high throughput flow cytometry. BMC Bioinformatics,10:106, Apr 2009.[10] Rebecca Killick and Idris Eckley. changepoint: An r package for change-point analysis. Journal of Statistical Software, Articles, 58(3):1–19,2014.51Bibliography[11] J. A. Lee, J. Spidlen, K. Boyce, J. Cai, N. Crosbie, M. Dalphin,J. Furlong, M. Gasparetto, M. Goldberg, E. M. Goralczyk, B. Hyun,K. Jansen, T. Kollmann, M. Kong, R. Leif, S. McWeeney, T. D.Moloshok, W. Moore, G. Nolan, J. Nolan, J. Nikolich-Zugich, D. Par-rish, B. Purcell, Y. Qian, B. Selvaraj, C. Smith, O. Tchuvatkina,A. Wertheimer, P. Wilkinson, C. Wilson, J. Wood, R. Zigon, R. H.Scheuermann, and R. R. Brinkman. MIFlowCyt: the minimum infor-mation about a Flow Cytometry Experiment. Cytometry A, 73(10):926–930, Oct 2008.[12] Cossarizza A. Lugli E., Roederer M. Data analysis in flow cytometry:The future just started. Cytometry A, 77A:705–713, 2010.[13] Mehrnoush Malek, Mohammad Jafar Taghiyar, Lauren Chong, GregFinak, Raphael Gottardo, and Ryan R. Brinkman. flowDensity: repro-ducing manual gating of flow cytometry data by automated density-based cell population identification. Bioinformatics, 31(4):606–607, 102014.[14] Mehrnoush Malek, Mohammad Jafar Taghiyar, Lauren Chong, Greg Fi-nak, Raphael Gottardo, and Ryan R Brinkman. flowDensity: reproduc-ing manual gating of flow cytometry data by automated density-basedcell population identification. Bioinformatics, 31(4):606–607, 2015.[15] G. Monaco, H. Chen, M. Poidinger, J. Chen, J. P. de Magalhaes, andA. Larbi. flowAI: automatic and interactive anomaly discerning toolsfor flow cytometry data. Bioinformatics, 32(16):2473–2480, 08 2016.52Bibliography[16] Gianni Monaco, Hao Chen, Michael Poidinger, Jinmiao Chen, Joa˜o Pe-dro de Magalha˜es, and Anis Larbi. flowAI: automatic and interac-tive anomaly discerning tools for flow cytometry data. Bioinformatics,32(16):2473–2480, 04 2016.[17] Sebastiano Montante and Ryan R. Brinkman. Flow cytometry dataanalysis: Recent tools and algorithms. International Journal of Labo-ratory Hematology, 41(S1):56–62, 2019.[18] Kieran O’Neill, Adrin Jalali, Nima Aghaeepour, Holger Hoos, andRyan R. Brinkman. Enhanced flowType/RchyOptimyx: a Bioconduc-tor pipeline for discovery in high-dimensional cytometry data. Bioin-formatics, 30(9):1329–1330, 01 2014.[19] R Core Team. R: A Language and Environment for Statistical Comput-ing. R Foundation for Statistical Computing, Vienna, Austria, 2013.[20] M. Reimers and V. J. Carey. Bioconductor: an open source frame-work for bioinformatics and computational biology. Meth. Enzymol.,411:119–134, 2006.[21] Jonathon Love Paul Buerkner Maxime Herve Russell Lenth, Hen-rik Singmann. emmeans: Estimated Marginal Means, aka Least-SquaresMeans, 2020. R package version 1.4.5.[22] J. Spidlen, K. Breuer, C. Rosenberg, N. Kotecha, and R. R. Brinkman.FlowRepository: a resource of annotated flow cytometry datasets as-sociated with peer-reviewed publications. Cytometry A, 81(9):727–731,Sep 2012.53[23] P. Theunissen, E. Mejstrikova, L. Sedek, A. J. van der Sluijs-Gelling,G. Gaipa, M. Bartels, E. Sobral da Costa, M. Kotrov?, M. Novakova,E. Sonneveld, C. Buracchi, P. Bonaccorso, E. Oliveira, J. G. Te Mar-velde, T. Szczepanski, L. Lhermitte, O. Hrusak, Q. Lecrevisse, G. E.Grigore, E. Fro?kov?, J. Trka, M. Br?ggemann, A. Orfao, J. J. van Don-gen, and V. H. van der Velden. Standardized flow cytometry for highlysensitive MRD measurements in B-cell acute lymphoblastic leukemia.Blood, 129(3):347–357, 01 2017.[24] Sofie Van Gassen, Celine Vens, Tom Dhaene, Bart N. Lambrecht, andYvan Saeys. Floremi: Flow density survival regression using minimalfeature redundancy. Cytometry Part A, 89(1):22–29, 2016.54Appendix AManual Gates for 52 files55AppendixA.ManualGatesfor52filesTable A.1: Manual gates for 52 files. Every two numbers correspond to one removal regionFiles GatesTphe0994300600 F7 R.fcs 0 2500 12000 200000003.fcs 0 1000100 111 SEB.fcs 225 300100 111 vehicle.fcs 115 150125 114 Pre 6b.fcs 150 20015 24.fcs 3000 6000 8500 100002151 074 BA20120228 020.fcs 300 4502151 074 BA20120228 021.fcs 0 10002nd grouphigh amylose maize 6 006.fcs 2500 45007c MA+.fcs 5325 61509407 2 1 NKR.fcs 0 14800BDL aLFA1.fcs copy 7600 10000TH004 TH004 mg.fcs 0 30000TS01 14 P7.fcs 7 18 21 2356AppendixA.ManualGatesfor52filesTS01 25 P5.fcs 0 16TS16 406 P7.fcs 85 87TS21 802 P2.fcs 0 28 81 82 96 96.5 97.5 98.5TS27 937 P4.fcs 78.5 79.5 83.7 84.3 96 97 99 110TS27 952 P2.fcs 0 11 23 26 54 56 78.5 79.5VD TH003.fcs 0 1500Macrophages.fcs 10000 30000Macrophages + Leishmania + oATP.fcs 13800 30000Macrophages + oATP.fcs 11000 150009399 2 1 NKR.fcs 2000 30009399 2 4 NKR.fcs 2200 2400 3400 3600 9200 94009606 3 9 NKR.fcs 16000 18000 30000 34000 55000 70000binding assay dilution 6 007.fcs 0 1500Fig4 Algae chlorination.fcs 0 6000PBL 7wk M5.fcs 15500 30000PBMC HuKCD20014.fcs 0 25057AppendixA.ManualGatesfor52filesTS05 121 P6.fcs 0 20TS06 137 P3.fcs 0 20 82 87 95 100TS06 144 P5.fcs 0 3 7 11TS14 351 P6.fcs 0 222 dias Infectado 5.fcs 20000 3000013523 17012011 NS F01.fcs 0 10050uM ALLNAsy48h.fcs 0 600 1100 1800 3200 3650 4500 48006A liver 3d WT.fcs 0 4000ah20130417 dilution Tube 025 surface Sup 2.fcs 0 325B3 C.reinhardtii H2O2 5mM stained.fcs 210 300D33 C.reinhardtii preloaded.fcs 190 260GDC0941CQ D10 D10.fcs 0 20 760 1500IL220 pepstim1 1501.fcs(1) 0 800 28500 30000Macrophages + Leismania.fcs 12000 30000Paciente AH18 Isotipos.fcs 62.5 67.5 115 120 330 335PBMC Tphe PROP10813 P4 35 T48B E08.fcs 0 800 1400 1500 1750 185058AppendixA.ManualGatesfor52filesPhytoplankton shallow sample.fcs 15000 20000Specimen 001 B6 LSK.fcs 1600 4000TS04 92 P6.fcs 3.5 5 10.5 11 97.5 98.5TS21 813 P5.fcs 50 52 58.5 60 92.5 94.5Unstained SPL C3 C03.fcs 0 350 500 570 4750 4850UR414 24 Direct ex vivo 1502.fcs 7250 7380 19150 19200 24000 2600059

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            data-media="{[{embed.selectedMedia}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0389884/manifest

Comment

Related Items