Statistical Methods for HighThroughput GenomicsbyChi Ho LoB.Sc., The University of Hong Kong, 2003M.Phil., The University of Hong Kong, 2005A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate Studies(Statistics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)September 2009c Chi Ho Lo 2009AbstractThe advancement of biotechnologies has led to indispensable high-throughputtechniques for biological and medical research. Microarray is applied tomonitor the expression levels of thousands of genes simultaneously, while ow cytometry (FCM) o ers rapid quanti cation of multi-parametric prop-erties for millions of cells. In this thesis, we develop approaches based onmixture modeling to deal with the statistical issues arising from both high-throughput biological data sources.Inference about di erential expression is a typical objective in analysis ofgene expression data. The use of Bayesian hierarchical gamma-gamma andlognormal-normal models is popular for this type of problem. Some unre-alistic assumptions, however, have been made in these frameworks. In viewof this, we propose exible forms of mixture models based on an empiricalBayes approach to extend both frameworks so as to release the unrealis-tic assumptions, and develop EM-type algorithms for parameter estimation.The extended frameworks have been shown to signi cantly reduce the falsepositive rate whilst maintaining a high sensitivity, and are more robust tomodel misspeci cation.FCM analysis currently relies on the sequential application of a series ofmanually de ned 1D or 2D data lters to identify cell populations of inter-est. This process is time-consuming and ignores the high-dimensionality ofFCM data. We reframe this as a clustering problem, and propose a robustmodel-based clustering approach based on t mixture models with the Box-Cox transformation for identifying cell populations. We describe an EMalgorithm to simultaneously handle parameter estimation along with trans-formation selection and outlier identi cation, issues of mutual in uence.Empirical studies have shown that this approach is well adapted to FCMiidata, in which a high abundance of outliers and asymmetric cell populationsare frequently observed. Finally, in recognition of concern for an e cientautomated FCM analysis platform, we have developed an R package calledflowClust to automate the gating analysis with the proposed methodology.Focus during package development has been put on the computational ef- ciency and convenience of use at users’ end. The package o ers a wealthof tools to summarize and visualize features of the clustering results, and iswell integrated with other FCM packages.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiStatement of Co-Authorship . . . . . . . . . . . . . . . . . . . . . xiv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Di erential Gene Expression Analysis of Microarray Data . . 21.1.1 The Technology of Microarrays . . . . . . . . . . . . . 21.1.2 Methods for Detecting Di erentially Expressed Genes 71.2 Gating Analysis of Flow Cytometry Data . . . . . . . . . . . 121.2.1 The Technology of Flow Cytometry . . . . . . . . . . 121.2.2 Methods for Identifying Cell Populations . . . . . . . 15Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 Flexible Empirical Bayes Models for Di erential Gene Ex-pression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.2 A Bayesian Framework for Identifying Di erential Expression 272.2.1 A Hierarchical Model for Measured Intensities . . . . 272.2.2 A Mixture Model for Di erential Expression . . . . . 29iv2.2.3 Parameter Estimation using the EM-algorithm . . . . 302.3 Application to Experimental Data . . . . . . . . . . . . . . . 322.3.1 Data Description . . . . . . . . . . . . . . . . . . . . 322.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 342.4 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . 382.4.1 Data Generation . . . . . . . . . . . . . . . . . . . . . 382.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 382.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443 Flexible Mixture Modeling via the Multivariate t Distribu-tion with the Box-Cox Transformation . . . . . . . . . . . . 483.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . 513.2.2 The MultivariatetDistribution with the Box-Cox Trans-formation . . . . . . . . . . . . . . . . . . . . . . . . . 533.2.3 The Mixture Model of t Distributions with the Box-Cox Transformation . . . . . . . . . . . . . . . . . . . 553.3 Application to Real Data . . . . . . . . . . . . . . . . . . . . 643.3.1 Data Description . . . . . . . . . . . . . . . . . . . . 643.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 653.4 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . 723.4.1 Data Generation . . . . . . . . . . . . . . . . . . . . . 733.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 743.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794 Automated Gating of Flow Cytometry Data via RobustModel-Based Clustering . . . . . . . . . . . . . . . . . . . . . 844.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 844.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . 884.2.1 Data Description . . . . . . . . . . . . . . . . . . . . 88v4.2.2 The Model . . . . . . . . . . . . . . . . . . . . . . . . 894.2.3 Density Estimation . . . . . . . . . . . . . . . . . . . 914.2.4 Sequential Approach to Clustering . . . . . . . . . . . 914.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944.3.1 Application to Real Datasets . . . . . . . . . . . . . . 944.3.2 Simulation studies . . . . . . . . . . . . . . . . . . . . 1024.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1095 owClust: a Bioconductor package for automated gating of ow cytometry data . . . . . . . . . . . . . . . . . . . . . . . . 1155.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1155.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 1165.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . 1185.3.1 Analysis of Real FCM Data . . . . . . . . . . . . . . 1185.3.2 Integration with owCore . . . . . . . . . . . . . . . . 1265.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1296 Conclusion and Future Directions . . . . . . . . . . . . . . . 1316.1 Summary and Discussion . . . . . . . . . . . . . . . . . . . . 1316.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . 1336.2.1 Robusti cation of the Empirical Bayes Model for Dif-ferential Gene Expression . . . . . . . . . . . . . . . . 1336.2.2 Development of an Automated FCM Analysis Pipeline 1356.2.3 Combining Mixture Components in Clustering . . . . 138Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142AppendicesA Additional Material for Chapter 2 . . . . . . . . . . . . . . . 145A.1 Marginal Densities of Measured Intensities . . . . . . . . . . 145A.2 Estimation of and for the Prior of ag . . . . . . . . . . . 147viA.3 Initialization of the EM Algorithm . . . . . . . . . . . . . . . 147B Vignette of the owClust Package . . . . . . . . . . . . . . . 149B.1 Licensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149B.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149B.3 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150B.3.1 Unix/Linux/Mac Users . . . . . . . . . . . . . . . . . 150B.3.2 Windows Users . . . . . . . . . . . . . . . . . . . . . 152B.4 Example: Clustering of the Rituximab Dataset . . . . . . . . 153B.4.1 The Core Function . . . . . . . . . . . . . . . . . . . 153B.4.2 Visualization of Clustering Results . . . . . . . . . . . 156B.4.3 Integration with owCore . . . . . . . . . . . . . . . . 159C Code to Produce the Plots in Chapter 5 . . . . . . . . . . . 163viiList of Tables2.1 Analysis of di erential expression with the HIV-1 data. . . . 342.2 Analysis of di erential expression with the HGU95A spike-indata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.3 Analysis of di erential expression with the HGU133A spike-indata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.1 Misclassi cation rates for di erent models applied to the bank-ruptcy and crabs datasets. . . . . . . . . . . . . . . . . . . . . 663.2 The number of components selected by the BIC for di erentmodels applied to the bankruptcy and crabs datasets. . . . . 733.3 Average misclassi cation rates for di erent models applied todatasets generated under the bankruptcy or crabs setting. . . 743.4 90% coverage intervals of the number of components selectedby the BIC for di erent models applied to datasets generatedunder the crabs setting. . . . . . . . . . . . . . . . . . . . . . 764.1 Average misclassi cation rates for di erent models applied todata generated under the Rituximab or GvHD setting. . . . . 1034.2 Modes and 80% coverage intervals of the number of clustersselected by the BIC for di erent models applied to data gen-erated under the GvHD setting. . . . . . . . . . . . . . . . . . 106viiiList of Figures1.1 The central dogma of molecular biology. . . . . . . . . . . . . 31.2 The cDNA microarray experiment. . . . . . . . . . . . . . . . 51.3 The representation of a gene with a probe set on A ymetrixGeneChip arrays. . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 The schematic overview of a typical ow cytometer setup. . . 131.5 The occurrence of light scattering. . . . . . . . . . . . . . . . 141.6 Speci c binding of uorochrome-labeled antibodies to antigens. 151.7 A typical manual gating analysis. . . . . . . . . . . . . . . . . 162.1 Histograms of robust empirical estimates of ag’s with ttedLognormal density curves shown on a log scale under the ex-tended Gamma-Gamma modeling framework. . . . . . . . . . 282.2 Simulation results generated from the extended GG model. . 392.3 Simulation results generated from the extended LNN model. . 402.4 Simulation results generated from the EBarrays GG model. . 412.5 Simulation results generated from the EBarrays LNN model. 423.1 Contour plots revealing the shape of bivariate t distributionswith the Box-Cox transformation for di erent values of thetransformation parameter. . . . . . . . . . . . . . . . . . . . . 543.2 Scatterplots revealing the assignment of observations for dif-ferent models applied to the crabs dataset. . . . . . . . . . . . 633.3 Scatterplots revealing the assignment of observations for dif-ferent models applied to the bankruptcy dataset. . . . . . . . 67ix3.4 Plots revealing the location of misclassi ed observations rel-ative to the ordered uncertainties of all observations for dif-ferent models applied to the bankruptcy dataset. . . . . . . . 683.5 Plots revealing the assignment of observations for di erentmodels applied to the crabs dataset, displayed via the secondand third principal components. . . . . . . . . . . . . . . . . . 703.6 Plots revealing the location of misclassi ed observations rel-ative to the ordered uncertainties of all observations for dif-ferent models applied to the crabs dataset. . . . . . . . . . . . 713.7 Plots of BIC against the number of components for the dif-ferent models applied to the bankruptcy and crabs datasets. . 724.1 A synthetic 2D dataset with three mixture components. . . . 864.2 Strategy for clustering the GvHD positive sample to look forCD3+CD4+CD8 + cells. . . . . . . . . . . . . . . . . . . . . . 924.3 Strategy for clustering the GvHD control sample. . . . . . . . 934.4 Initial clustering of the Rituximab data using the FSC andSSC variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . 954.5 BIC as a function of the number of clusters for di erent mod-els applied to the Rituximab data. . . . . . . . . . . . . . . . 964.6 Second-stage clustering of the Rituximab data using all the uorescent markers (three clusters). . . . . . . . . . . . . . . 974.7 Second-stage clustering of the Rituximab data using all the uorescent markers (four clusters). . . . . . . . . . . . . . . . 984.8 Initial clustering of the GvHD positive sample using the FSCand SSC variables. . . . . . . . . . . . . . . . . . . . . . . . . 1004.9 Second-stage clustering of the GvHD positive and controlsamples using all the uorescent markers. . . . . . . . . . . . 1014.10 A representative sample generated from the t mixture modelwith the Box-Cox transformation under the GvHD setting. . 1045.1 A plot of BIC against the number of clusters for the rst-stagecluster analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . 120x5.2 A scatterplot revealing the cluster assignment in the rst-stage analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1225.3 A plot of BIC against the number of clusters for the second-stage cluster analysis. . . . . . . . . . . . . . . . . . . . . . . 1245.4 Plots of CD8 against CD4 for the CD3+ population. . . . . 1256.1 The overall ow of the proposed automated FCM analysispipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1366.2 Clustering of red blood cell samples from the PNH data. . . . 137xiAcknowledgementsI dedicate my greatest gratitude to my supervisors, Raphael Gottardo andRyan R. Brinkman, for their inspirational guidance and continuous supportthroughout this journey. I would also like to thank my committee members,Jenny Bryan and Kevin Murphy, for their invaluable advice on my researchand career.In addition to the aforementioned exceptional scholars, I feel honoredto study at the UBC Statistics Department lled with excellent mentorsincluding Paul Gustafson, Harry Joe, John Petkau and Mat as Salibi an-Barrera (in alphabetical order). I have learnt so much from their insightfuladvice and tremendous knowledge of statistics.My list of acknowledgements also goes to Ryan R. Brinkman (again!), Ali Bashashati and Josef Spidlen, amongothers, for their assistance and helpful discussion during my internshipat the British Columbia Cancer Research Centre; Florian Hahne, Martin Morgan, Patrick Aboyoun and Marc Carlsonfrom Fred Hutchinson Cancer Research Centre for their professionaladvice on the technical issues in software development; Bakul Dalal, Maura Gasparetto, Mario Roederer, Clayton Smith andthe British Columbia Cancer Research Centre for kindly o ering clin-ical data and providing assistance to interpret the data; WestGrid for providing tremendous computational resources which areindispensable to my research, and its technical support team, RomanBaranowski in particular, for their prompt and e cient response to myendless questions about the priority system for resource allocation;xii Tony W. K. Fung for enriching my research experience with his excel-lent supervision during my Master’s at the University of Hong Kong; the o ce ladies, namely, Peggy Ng, Elaine Salameh and Viena Tran,from whom I truly feel that they always provide e cient assistancewith sincere care rather than merely carrying out their duties.The research constituting this thesis has received funding support fromGenome Canada, MITACS, NIH, NSERC, PIMS, and University GraduateFellowships.xiiiStatement of Co-AuthorshipThis thesis is completed under the supervision of Dr. Raphael Gottardo andDr. Ryan R. Brinkman.Chapter 2 of this thesis is co-authored with Dr. Raphael Gottardo.I developed the methodology, performed the analysis, and prepared themanuscript.Chapter 3 is co-authored with Dr. Raphael Gottardo. I identi ed theresearch problem, conceived of the study, developed the methodology, per-formed the analysis, and prepared the manuscript.Chapter 4 is co-authored with Dr. Ryan R. Brinkman and Dr. RaphaelGottardo. I developed the methodology, performed the analysis, and pre-pared the manuscript.Chapter 5 is co-authored with Dr. Florian Hahne, Dr. Ryan R. Brinkmanand Dr. Raphael Gottardo. I conceived of the study, developed the method-ology and software, performed the analysis, and prepared the manuscript.xivChapter 1IntroductionRecent technological advances in molecular biology have enabled the rapidquanti cation of characteristics for an enormous number of genes or cellsunder the same experimental condition. Microarray has been being a pop-ular technique for monitoring the expression levels of thousands of genesfor more than a decade, while ow cytometry (FCM) o ers quanti cationof multi-parametric properties for up to millions of cells. To date, exten-sive applications of these two high-throughput technologies can be found inhealth research, medical diagnosis and treatment, drug discovery and vac-cine development (Schena et al., 1995; DeRisi et al., 1996; Behr et al., 1999;Debouck and Goodfellow, 1999; Hengel and Nicholson, 2001; Braylan, 2004;Illoh, 2004; Mandy, 2004; Orfao et al., 2004; Bolton and Roederer, 2009).The interest in studying changes in gene expression levels over experi-mental conditions have led to the development of a wealth of methodologyfor identifying di erentially expressed genes. Meanwhile, the tremendousattention built towards FCM in recent years has urged the need for bothmethodological and software development for an automated analysis plat-form of gating, the process of identifying cell populations. In this thesis,we show that the aforementioned issues can be recast into problems of clus-tering, the process of looking for homogeneous groups of observations instatistics. We introduce exible forms of nite mixture models (Tittering-ton et al., 1985; Ban eld and Raftery, 1993; McLachlan and Peel, 2000;Fraley and Raftery, 2002), commonly applied as a statistical tool for clus-tering, which serve as the modeling basis for approaches developed to dealwith the issues arising from both high-throughput biological data sources.In this chapter, we review the technology of microarrays and several pop-ular methods for di erential gene expression. We then give a brief account1of the FCM technology as well as a few attempts to automate the gatinganalysis to date. Next, in Chapter 2, we introduce mixture models based ona exible empirical Bayes approach to detect di erentially expressed genesin microarray data. This approach releases the unreasonable assumptionsand enhances the exibility of models introduced in Newton et al. (2001) andKendziorski et al. (2003). In Chapter 3, we develop a uni ed framework tosimultaneously handle data transformation, outlier identi cation and clus-tering, issues which are of mutual in uence. This methodology stems froma mixture model using the multivariate t distributions with the Box-Coxtransformation, which can be viewed as a new class of distributions ex-tending the t distribution. We proceed to present in Chapter 4 the resultobtained on applying the proposed methodology to FCM data, from whichcell populations asymmetric in shape and an abundance of outliers are oftenobserved. In Chapter 5 we introduce an open-source software package calledflowClust to implement the methodology introduced in Chapters 3 and 4.This publicly available package addresses a bottleneck to FCM that there isa dearth of software tools to manage, analyze and present data on a soundtheoretical ground. Finally, we conclude in Chapter 6 with a discussionof the overall contribution of this research work, and directions for futureextensions.1.1 Di erential Gene Expression Analysis ofMicroarray Data1.1.1 The Technology of MicroarraysThe structure, function, development and reproduction of an organism de-pends on the type and amount of proteins present in each cell and tissue. Aprotein is a sequence of up to 20 types of amino acids, which is speci ed bythe nucleotide sequence of the encoding gene(s). The synthesis of proteinsconsists of two major stages, transcription and translation, and is describedby the central dogma of molecular biology (Crick, 1970); see Figure 1.1.The genetic information encoded by the deoxyribonucleic acid (DNA) is rst2Figure 1.1: The central dogma of molecular biology. The synthesis of pro-teins constitutes two major stages, transcription and translation. Partof the DNA is rst transcribed into the single-stranded mRNA taking acomplementary sequence. The mRNA then migrates from the nucleus tothe cytoplasm, and is translated into proteins. (Picture source: access-excellence.org)3transcribed into the messenger ribonucleic acid (mRNA), a single-strandedsequence complementary to the base sequence in the DNA. The mRNA thenmigrates from the nucleus to the cytoplasm, and is translated into proteins.When a gene is transcribed and then translated, we say that it is ex-pressed. Cells under di erent conditions tend to express di erent sets ofgenes, and thereby synthesize di erent proteins. To understand a biolog-ical process, it is important to know what proteins are being processed.Nonetheless, due to the complex structures, the analysis of proteins is dif- cult. Based on the fact that the mRNA gets translated into proteins, theanalysis of gene expression helps provide information about the biologicalprocess of interest. This is where microarrays, a technology which facil-itates the simultaneous measurement of expression levels of thousands ofgenes, come forth.The microarray technology relies on two key chemical processes, reversetranscription and hybridization. The process of reverse transcription createssingle-stranded complementary DNA (cDNA) copy of mRNA transcripts ex-perimentally isolated from a cell. Hybridization is the process of combiningtwo single strands of DNA or RNA into a single molecule. Two strandswhich are perfectly complementary to each other tend to bind together, re-sulting in speci c hybridization. The term \speci c" is used as opposed tothe case in which binding randomly occurs between two strands that do notform a complementary pair.Microarrays can be classi ed into two categories: the cDNA microarraysand the oligonucleotide arrays. Below we give a brief account of each ofthese two categories.cDNA MicroarraysA cDNA microarray consists of thousands of microscopic spots attached to asolid surface, with each spot containing a massive number of identical DNAsequences to serve as probes. The choice of probes may be customized in-house to satisfy speci c experimental needs. In a typical dual-color cDNAmicroarray experiment, two mRNA samples extracted from di erent exper-4Introduction to microarrays - Raphael Gottardo 7cDNA microarraysMix equal amounts to obtain target solutionHybridizetarget to probesScanTwo ImagesControl and TreatmentIntroduction to microarrays - Raphael Gottardo 7Control TreatmentmRNAcDNARTLabellingCy3 Cy5PCR amplification of cDNA clonesMix equal amounts to obtain target solutionHybridizetarget to probesGene 1Gene 2ScanTo ImagesControl and TreatmentSpot cDNA probes (Arrayer)Introduction to microarrays - Raphael Gottardo 7cDNA microarraysControl TreatmentmRNAcDNARTLabellingCy3 Cy5PCR amplification of cDNA clonesMix equal amounts to obtain target solutionHybridizetarget to probesGene 1Gene 2ScanTwo ImagesControl and TreatmentSpot cDNA probes (Arrayer)Control TreatmentmRNAcDNARTLabellingCy3 Cy5PCR amplification of cDNA clonesGene 1Gene 2Spot cDNA probes (Arrayer)Figure 1.2: The cDNA microarray experiment. A cDNA microarray is aglass microscope slide spotted with individual DNA sequences as probes.The cDNA solutions, prepared from mRNA by reverse transcription, arelabeled with green and red dyes respectively to identify the source (controland treatment). The mixed cDNA target solution is then hybridized withthe probes on the microarray. The array is scanned twice to obtain imagesfor the red and green intensities.imental conditions are reverse-transcribed into cDNA, labeled with di erent uorescent dyes (red and green), mixed and targeted against the probes onthe microarray. Owing to the property of preferential binding of a labeledcDNA molecule (target) to a probe containing the complementary sequence,speci c hybridization occurs, under some stringent environment. The arrayis then scanned and the red and green intensities for each spot are measured.Figure 1.2 illustrates such an experiment.Oligonucleotide ArraysIn a high-density oligonucleotide array, the probes are composed of shortDNA sequences known as oligonucleotides. A ymetrix GeneChip arrays,5Introduction to microarrays - Raphael Gottardo5Affymetrix: Probe setGene Fragment... ACGTTACGAGAGATCGATCAGTCAGTACTAGTACTTGCCTAGCTAGC ...AGATCGATCAGTCAGTACTAGTACT AGATCGATCAGTGAGTACTAGTACT Perfect Match (PM) probeMisMatch (MM) probeA probe pair consists of a PM and MM proberays - Raphael Gottardo 12... ACGTTACGA AGCTAGC ... Probe pair setPMMMFluorescent imageA probe pair consists of a PM and MM prFigure 1.3: The representation of a gene with a probe set on A ymetrixGeneChip arrays. A probe set consists of 11{20 distinct probe pairs. Eachperfect match (PM) probe contains an excerpted sequence (25bp long) ofthe gene. A mismatch (MM) probe is created from a PM probe by changingthe middle base to its complement.which consist of up to 33;000 genes with probes containing oligonucleotides25bp long, are the most popular in this technology. Each gene is representedby a set of 11{20 distinct probe pairs. The perfect match (PM) probeof each probe pair contains a section of the mRNA molecule of interest;the mismatch (MM) probe is created by changing the middle (13th) baseof the PM probe. The use of probe sets to represent genes reduces thechance of non-speci c hybridization by including only probes unique to thegenome, while the presence of MM probes helps quantify the non-speci chybridization that still occurs. A graphical depiction of the relationshipbetween gene sequence and probe set is given in Figure 1.3. Various methodsare available for computing the expression summary values from the probeintensities, for example, gcRMA (Wu et al., 2004), RMA (Irizarry et al.,2003), MAS 5 (A ymetrix Manual, 2001), and dChip (Li and Wong, 2001).Compared to cDNA microarrays, high-density oligonucleotide arrays have6a lower chance of non-speci c hybridization and a high detection speci city,and allow for more genes to be probed in one experiment. However, anA ymetrix GeneChip array does not support a dual-channel system andcan only be exposed to one sample in each experiment. Also, as an o -the-shelf product, the probes on the array are not to be customized.1.1.2 Methods for Detecting Di erentially Expressed GenesThe analysis of di erential gene expression helps us understand how genesare di erentially expressed under di erent conditions, for example, normaland cancer. In recent years, there has been a considerable amount of workon the detection of di erentially expressed genes. In the following we give areview of some representative methods.t Tests and VariantsSimplistic statistical treatments include the use of two-sample t tests onthe log intensities, or one-sample t tests on the log intensity ratios for eachgene (Callow et al., 2000). A gene is declared to be di erentially expressedif its p-value is less than a threshold, for example, 0:05. Because of thelarge number of hypothesis tests, adjustment methods such as Bonferroni orHolm-Bonferroni should be employed to control the familywise error rate,the probability of yielding one or more false positives. In addition, due tothe small number of replicates in microarray experiments, the gene-speci cvariance can be poorly estimated. Baldi and Long (2001) suggested usinga modi ed t test statistic where the denominator is regularized by using aweighted average of the gene-speci c and global variance estimates: 0s20 + (R 1)s2g 0 +R 2 ; (1.1)where R is the number of replicates, s20 and s2g are respectively the estimatesfor the global and gene-speci c variances, and 0 is a tuning parameter thatgoverns the contribution of the global variance. Here, \global" may referto all the genes, or those in the neighborhood of gene g. This regularized7variance estimate is derived from the mean of the posterior distribution ofthe gene-speci c variance in a Bayesian framework.Signi cance Analysis of Microarrays (SAM)SAM, proposed by Tusher et al. (2001), uses a regularized t statistic dgwhere a constant c is added to the gene-speci c standard error sg:dg = Mgc+sg; (1.2)where Mg denotes the average log intensity ratio for gene g. The value of cwas suggested by Efron et al. (2001) to be the 90th percentile of all sg. Toestimate empirically the distribution of the statistic dg under the null hy-pothesis (no di erential expression), di erent permutations of the replicatesare considered, and the statistic shown in Eq.(1.2) is recomputed for eachpermutation. The average of the statistics over all permutations, denoted as~dg, is then determined for each gene. By considering the displacement of dgfrom ~dg, and a threshold , asymmetric cuto s are obtained as the smallestdg such that dg ~dg > , and the largest dg such that dg ~dg < . Thethreshold is determined by controlling the false discovery rate (FDR), theproportion of falsely identi ed genes among the genes declared di erentiallyexpressed, at 10% or a reasonable level. In SAM, the FDR is estimatedas the ratio of the average number of genes called signi cant from thosepermuted datasets to that number from the original dataset.L onnstedt and Speed’s B statisticMaking use of an empirical Bayes normal mixture model, L onnstedt andSpeed (2002) proposed the log posterior odds statistic, more convenientlycalled the B statistic, to determine di erentially expressed genes. Moreexplicitly, the log intensity ratio Mgr for gene g on the r-th replicate isassumed to follow a normal distribution N( g;k 1g ). Let Ig be the indicatorvariable such that Ig = 1 if gene g is di erentially expressed and Ig = 0otherwise. The parameters g and g have the following conjugate prior8distribution: g Ga( =2;1) gj g =(0 if Ig = 0N(0;ck 1g ) if Ig = 1A mixture structure is implicitly assumed by the above speci cation on g.Let p = Pr(Ig = 1) be the proportion of di erentially expressed genes. Thelog posterior odds of di erential expression is derived to beBg = log Pr(Ig = 1jM)Pr(Ig = 0jM)= log p1 pf(MgjIg = 1)f(MgjIg = 0): (1.3)A large value of Bg is in favor of the alternative hypothesis of di erentialexpression. Note, however, that due to computational di culty, the authorsdid not estimate the proportion p and xed it a priori. In consequence, acuto on Bg for declaring di erential expression could not be determined onan objective ground.Linear Models for Microarray Data (LIMMA)LIMMA (Smyth, 2004) reformulates the aforementioned hierarchical modelof L onnstedt and Speed (2002) in the context of general linear models tocater for the case of a di erent number of replicates in di erent conditionsand the case of multiple conditions. In addition, LIMMA uses a moder-ated t statistic in place of the posterior odds statistic given by Eq.(1.3) forinferencing about di erential expression:~tg = Mg~sg=pR; (1.4)where the posterior variance estimate~s2g = s2 + (R 1)s2g +R 1 (1.5)9provides shrinkage of the sample variance s2g towards a pooled estimate s2,resulting in more stable inference when the number of replicates is small.The computation of the moderated t statistic does not depend on p inEq.(1.3), the potentially contentious parameter that is left un-estimatedin L onnstedt and Speed (2002).Efron’s Local False Discovery Rate (fdr)Efron (2004) proposed an empirical Bayes approach combined with a localversion of the false discovery rate to test for di erential expression. In thismethod, t test statistics are rst obtained, one for each gene. The associ-ated p-values are converted into z-scores de ned as zg = 1(pg), where indicates the standard normal distribution function. A two-componentmixture,f(zg) = p0f0(zg) +p1f1(zg); (1.6)where f0 and f1 refer to the density of the z-scores under the null (nodi erential expression) and alternative (di erential expression) hypothesesrespectively, and p0 and p1 are the proportion of true null and alternativehypotheses, is used to model the z-scores. The mixture density f and thenull density f0 are then empirically estimated. For each gene, inference isbased on the local false discovery rate, which is de ned asfdr(zg) ^f0(zg)^f(zg) : (1.7)The notation fdr is deliberately shown in lowercase to signify its di erencefrom the usual de nition of false discovery rate proposed by Benjamini andHochberg (1995). A gene with an fdr lower than some threshold, say, 10%,will be called di erentially expressed.10Empirical Bayes Gamma-Gamma and Lognormal-Normal Models(EBarrays)Newton et al. (2001) developed a method for detecting changes in geneexpression using a hierarchical gamma-gamma (GG) model. Kendziorskiet al. (2003) extended this to multiple replicates with multiple conditions,and provided the option of using a hierarchical lognormal-normal (LNN)model. For the GG model, each observation is modeled by a gamma dis-tribution with shape a and rate . Strength is borrowed across genes byassuming a gamma prior on . Denote by xg and yg the intensities for geneg in two conditions respectively. The following two-component mixture isused to model the data:p(xg;yg) = p pA(xg;yg) + (1 p) p0(xg;yg); (1.8)wherepA(xg;yg) =Z Qrp(xgrja; gx) ( gx) d gx Z Qrp(ygrja; gy) ( gy) d gy(1.9)is the marginal density for di erentially expressed genes using condition-speci c rate parameters, andp0(xg;yg) =Z Qrp(xgrja; g)Qrp(ygrja; g) ( g) d g (1.10)is for non-di erentially expressed genes using a common rate parameter.The LNN model assumes a lognormal sampling distribution for each obser-vation with mean and variance 2. A conjugate normal prior is imposed on . The corresponding mixture model takes a form similar to that shown inEq.(1.8), where the marginal densities are derived by considering gx6= gy(di erential expression) and gx = gy (no di erential expression) respec-tively. Note that, in both models, the assumption of a constant coe cientof variation for all genes has been implicitly made. For both models, theprior can be integrated out and the EM algorithm can be used to estimatethe unknown parameters. Inference is based on the posterior probabilities11of di erential expression.Bayesian Robust Inference for Di erential Gene Expression(BRIDGE)Gottardo et al. (2006) developed a robust Bayesian hierarchical model fortesting di erential expression. The model may be viewed as an extension ofthe LNN speci cation of EBarrays. An enhanced exibility is achieved bymeasures taken to release the implicitly made assumption of the constantcoe cient of variation in EBarrays, and to account for outliers. BRIDGEincludes an exchangeable prior for the variances, which allows di erent vari-ances for the genes whilst still achieving shrinkage of extreme variances. Inaddition, observations are modeled using a t distribution, which accounts foroutliers. A fully Bayesian approach is adopted, by assuming vague priors onthe parameters and carrying out parameter estimation using Markov chainMonte Carlo (MCMC) algorithms (Gelfand and Smith, 1990).Due to the relatively small number of replicates in microarray experi-ments, combining information across genes in statistical analysis is vital. ABayesian framework ts such a scenario very well, and is adopted by mostof the aforementioned methods. Also, mixture modeling is a popular strat-egy among these methods. Incidentally, the majority of the aforementionedmethods are also applicable to oligonucleotide arrays upon slight or no modi- cation, although they are presented primarily for cDNA microarrays. Also,most of the methods have included an extension to multiple conditions.1.2 Gating Analysis of Flow Cytometry Data1.2.1 The Technology of Flow CytometryFlow cytometry (FCM) is a high-throughput technology that o ers auto-mated quanti cation of a set of physical and chemical characteristics forup to millions of cells in a sample. The characteristics measured for eachcell include size, granularity or internal complexity, and uorescence inten-12Figure 1.4: The schematic overview of a typical ow cytometer setup. Cellsin the uidics system are aligned in single le. When a cell intercepts thelight source, light scattering and emission will occur. The scattered light iscollected by the forward scatter (FSC) and sideward scatter (SSC) detectors.The emitted light is collected by the photomultiplier tubes (PMT), each ofwhich targets at an individual narrow range of wavelengths. (Picture source:ab-direct.com)13 Introduction to Flow Cytometry: A Learning Guide 10 3.1 Light Scatter Light scattering occurs when a par ticle deflects incident laser light. The extent to which this occurs depends on the physical proper ties of a par ticle, namely its size and internal complexity . Factors that affect light scattering ar e the cell's membrane, nucleus, and any granular material inside the cell. Cell shape and sur face topography also contribute to the total light scatter .Forwar d-scatter ed light (FSC) is pr opor tional to cell-sur face ar ea or siz e. FSC is a measur ement of mostly diffracted light and is detected just off the axis of the incident laser beam in the for war d dir ection b y a photodiode (F igur e 3-1). FSC pr ovides a suitable method of detecting par ticles gr eater than a giv en siz e independent of their fluor escence and is ther efor e often used in immunophenotyping to trigger signal processing.Side-scatter ed light (SSC) is pr opor tional to cell granularity or internal complexity . SSC is a measur ement of mostly r efracted and r eflected light that occurs at any inter face within the cell wher e ther e is a change in refractiv e index (Figur e 3-1). SSC is collected at appr oximately 90 degr ees to the laser beam b y a collection lens and then redir ected b y a beam splitter to the appr opriate detector . Figure 3-1 Light-scatter ing proper ties of a cell light sourceside scatter detectorforward scatter detectorFigure 1.5: The occurrence of light scattering. The forward scatter detectormeasures the amount of light di racted by a cell in the forward direction.The sideward scatter detector measures the amount of refracted or re ectedlight. (Picture source: BD Biosciences)sity. FCM is widely used in health research and treatment for a varietyof tasks, such as the monitoring of the course and treatment of HIV in-fection, the diagnosis and monitoring of leukemia and lymphoma patients,the cross-matching of organs for transplantation, and research on vaccinedevelopment (Hengel and Nicholson, 2001; Bagwell, 2004; Braylan, 2004; Il-loh, 2004; Krutzik et al., 2004; Mandy, 2004; Orfao et al., 2004; Bolton andRoederer, 2009).Figure 1.4 gives a schematic overview of a typical ow cytometer setup.Cells are introduced into the sample core of the ow cytometer, where hy-drodynamic forces align the cells to move in single le and at speeds up to70;000 cells per second. When a cell intercepts a light source (e.g., laser),light scattering will occur. The forward scatter (FSC) detector in Figure 1.5measures the amount of light di racted in the forward direction, and isproportional to the cell-surface area or size. The sideward scatter (SSC)detector measures the amount of light refracted or re ected by any interfacewithin the cell, and is proportional to cell granularity or internal complexity.Before they are introduced into the ow cytometer, cells have beentagged with uorescently conjugated antibodies bound to the antigens (Fig-14Figure 1.6: Speci c binding of uorochrome-labeled antibodies to antigens.ure 1.6). The uorochromes will be excited by the light source to incur lightemission, when a cell intercepts the laser. The emitted light will be divertedto a series of uorescent detectors. Each of the uorescent detectors targetsat an individual narrow range of wavelengths, close to the peak emissionwavelength characteristic of an individual uorochrome. The uorescentsignals detected are proportional to the amount of individual uorochromespresent. As each uorochrome is conjugated to an antibody, the signal canbe used to measure the amount of individual antigens.Cells of di erent types have di erent correlated measurements of FSCand SSC. Antigens are also present in the cells at di erent amounts. The uorescent signals, combined with the FSC and SSC measurements, cantherefore be used to identify cell populations (homogeneous groups of cellsthat display a particular function) and their relative abundance in a sample.1.2.2 Methods for Identifying Cell PopulationsOne major component of FCM analysis involves the process of identify-ing cell populations. This is referred to as gating in the FCM community.Conventionally, the identi cation of cell populations relies on applying se-quentially a series of manually drawn lters, i.e., gates, to subset and selectregions in 1D or 2D graphical representations of the data; see Figure 1.7 foran example. In such an analysis, the choice of which sequence of parameters15Figure 1.7: A typical manual gating analysis. (a) Data are projected ontothe FSC and SSC dimensions to identify basic cell populations. An expertresearcher draws a gate on the contour density plot to de ne the lymphocytepopulation. (b) The selected cells are then projected on the CD3 dimen-sion, and CD3+ cells are de ned through an interval gate with the markedthreshold as the lower bound. (c) A quadrant gate is applied on the pro-jections along the CD4 and CD8 dimensions. Cells within the upper rightgate is referred to as CD3+CD4+CD8 +, the cell population of interest inthis analysis.to gate on and where to position the gates are highly subjective. It alsoignores the high-multidimensionality of FCM data, which may convey infor-mation that cannot be displayed on 1D or 2D projections. In addition, thereis a major concern for the manually time-consuming input to the manualgating analysis. It is not uncommon for an FCM study to include thousandsof samples to analyze, thanks to the high-throughput technological advance-ment in generating FCM data. Attempts for partial automation have beenmade by using software to automatically apply a template gating sequencewith the same set of gates on all samples. Nevertheless, the improvement inoverall e ciency is limited as the variation between samples has not beentaken into account, and therefore sample-by-sample manual adjustment ofthe position of the gates cannot be avoided. As noted in Lizard (2007),the lack of an automated analysis platform to parallel the high-throughputdata-generation platform has become a major bottleneck for FCM.In statistics, gating may be reframed as a clustering problem. There have16been few attempts to devise an automated platform with a sound statisticalframework for gating analysis. Below is a brief account of the methods thathave been applied to automate the gating analysis, with the latter two beingrecent new additions.K-means ClusteringK-means clustering (MacQueen, 1967) is a relatively early attempt for FCManalysis. Its objective is to obtain a partitionP withK clustersP1;P2;:::;PK,corresponding to cell populations in FCM data, such that the within-clustersum of squares is minimized:arg minPKXk=1Xyi2Pk(yi k)T(yi k); (1.11)where yi is an observation vector and k is the mean of Pk. The algorithmstarts with K randomly selected points as means. A K-cluster partitionof the data is obtained by assigning each observation to the nearest mean,followed by a recomputation of the cluster means. Such a procedure repeatsuntil there is no change in the assignment of observations. This methodwas found to be equivalent to the classi cation EM algorithm (Celeux andGovaert, 1992, 1995) for a Gaussian mixture model with equal mixing pro-portions, and a scalar multiple of the identity matrix as a common covariancematrix.Bayesian Mixture Modeling using Gaussian DistributionsChan et al. (2008) proposed a Bayesian approach of representing cell popula-tions with Gaussian components in a mixture model. Explicitly, observationyi in an FCM dataset is modeled asf(yijw1;:::;wK; 1;:::; K; 1;:::; K) =KXk=1wk (yij k; 1k ); (1.12)17where ( j k; 1k ) is the multivariate Gaussian density with mean k andcovariance matrix 1k , and wk is the mixing proportion. The parametersin Eq.(1.12) are assumed to follow a conjugate prior distribution:(w1;:::;wK) D( 1;:::; K) k N(mk; k 1k ) k W(rk;Vk)where D( ) and W( ) denote the Dirichlet and Wishart distributions respec-tively. Parameter estimation is performed using Gibbs sampling describedin Lavine and West (1992). The number of mixture components is selectedvia the Bayesian Information Criterion (Schwarz, 1978). Attempts to re-solve the inadequacy of Gaussian distributions to asymmetric componentsare made by merging components which share a common local mode inthe tted mixture distribution. Nevertheless, Gaussian mixture models areknown to be vulnerable to the presence of outliers, which are frequently ob-served in FCM data. Moreover, considering the large size of FCM datasets,the use of MCMC techniques for parameter estimation requires an enormousamount of computational time. In FCM analysis usually involving a largenumber of samples, time is of the essence.Mixture Modeling using Skew t DistributionsAnother model-based clustering approach of automating the gating analysiswas proposed by Pyne et al. (2009), who addressed the issues of asymmet-ric cell populations and outliers with the skew t distributions (Sahu et al.,2003). Each component in the mixture is modeled by a p-dimensional skewt distribution with the following density:f(yj ; ; ; ) = 2 ’(yj ; ; ) T r +p + !; (1.13)where’( j ; ; ) is the multivariatetdensity with mean ( > 1), variance ( 2) 1 ( > 2) and degrees of freedom , T( j ) is the univariate18standard t distribution function with degrees of freedom, = + T; = T 1(y ); 2 = 1 T 1 , and = (y )T 1(y ). A stochasticrepresentation of the distribution is given byY = + U + jZj; (1.14)whereU N(0; = )Z N(0;1= ) Ga( =2; =2)and U;Z and are all independently distributed. Deduced from Eq.(1.14),extensions of the Monte Carlo EM algorithm (Wei and Tanner, 1990) havebeen developed for parameter estimation complicated by analytically in-tractable quantities in the E step (Lin, 2009).19BibliographyA ymetrix Manual (2001). A ymetrix Microarray Suite User Guide Version5.0. Santa Clara, CA.Bagwell, C. B. (2004). DNA histogram analysis for node-negative breastcancer. Cytometry Part A, 58A(1):76{78.Baldi, P. and Long, A. D. (2001). A Bayesian framework for the analysis ofmicroarray expression data: Regularized t-test and statistical inferencesof gene changes. Bioinformatics, 17(6):509{519.Ban eld, J. D. and Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49:803{821.Behr, M. A., Wilson, M. A., Gill, W. P., Salamon, H., Schoolnik, G. K.,Rane, S., and Small, P. M. (1999). Comparative genomics of BCG vaccinesby whole-genome DNA microarray. Science, 284(5419):1520{1523.Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discoveryrate: a practical and powerful approach to multiple testing. Journal ofthe Royal Statistical Society, Series B, 57(1):289{300.Bolton, D. L. and Roederer, M. (2009). Flow cytometry and the future ofvaccine development. Expert Review of Vaccines, 8(6):779{789.Braylan, R. C. (2004). Impact of ow cytometry on the diagnosis andcharacterization of lymphomas, chronic lymphoproliferative disorders andplasma cell neoplasias. Cytometry Part A, 58A(1):57{61.Callow, M. J., Dudoit, S., Gong, E. L., Speed, T. P., and Rubin, E. M.(2000). Microarray expression pro ling identi es genes with altered ex-pression in HDL-de cient mice. Genome Research, 10:2022{2029.Celeux, G. and Govaert, G. (1992). A classi cation EM algorithm for clus-tering and two stochastic versions. Computational Statistics and DataAnalysis, 14(3):315{332.20Celeux, G. and Govaert, G. (1995). Gaussian parsimonious clustering mod-els. Pattern Recognition, 28(5):781{793.Chan, C., Feng, F., Ottinger, J., Foster, D., West, M., and Kepler, T. B.(2008). Statistical mixture modeling for cell subtype identi cation in owcytometry. Cytometry Part A, 73A:693{701.Crick, F. (1970). Central dogma of molecular biology. Nature, 227:561{563.Debouck, C. and Goodfellow, P. N. (1999). DNA microarrays in drug dis-covery and development. Nature Genetics, 21(1 Suppl):48{50.DeRisi, J., Penland, L., Brown, P. O., Bittner, M. L., Meltzer, P. S., Ray, M.,Chen, Y., Su, Y. A., and Trent, J. M. (1996). Use of a cDNA microarrayto analyse gene expression patterns in human cancer. Nature Genetics,14(4):457{460.Efron, B. (2004). Large-scale simultaneous hypothesis testing: the choice ofa null hypothesis. Journal of the American Statistical Association, 99:96{104.Efron, B., Tibshirani, R., Storey, J. D., and Tusher, V. (2001). Empiri-cal Bayes analysis of a microarray experiment. Journal of the AmericanStatistical Association, 96(456):1151{1160.Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminantanalysis, and density estimation. Journal of the American Statistical As-sociation, 97(458):611{631.Gelfand, A. E. and Smith, A. F. M. (1990). Sampling-based approaches tocalculating marginal densities. Journal of the American Statistical Asso-ciation, 85:398{409.Gottardo, R., Raftery, A. E., Yeung, K. Y., and Bumgarner, R. E. (2006).Bayesian robust inference for di erential gene expression in microarrayswith multiple samples. Biometrics, 62(1):10{18.21Hengel, R. L. and Nicholson, J. K. (2001). An update on the use of owcytometry in HIV infection and AIDS. Clinics in Laboratory Medicine,21(4):841{856.Illoh, O. C. (2004). Current applications of ow cytometry in the diagnosis ofprimary immunode ciency diseases. Archives of Pathology and LaboratoryMedicine, 128(1):23{31.Irizarry, R., Hobbs, B., Collin, F., Beazer-Barclay, Y., Antonellis, K., Scherf,U., and Speed, T. (2003). Exploration, normalization, and summariesof high density oligonucleotide array probe level data. Biostatistics,4(2):249{264.Kendziorski, C., Newton, M., Lan, H., and Gould, M. N. (2003). On para-metric empirical Bayes methods for comparing multiple groups using repli-cated gene expression pro les. Statistics in Medicine, 22:3899{3914.Krutzik, P. O., Irish, J. M., Nolan, G. P., and Perez, O. D. (2004). Analysisof protein phosphorylation and cellular signaling events by ow cytometry:techniques and clinical applications. Clinical Immunology, 110(3):206{221.Lavine, M. and West, M. (1992). A Bayesian method for classi cation anddiscrimination. Canadian Journal of Statistics, 20:451{461.Li, C. and Wong, W. H. (2001). Model-based analysis of oligonucleotidearrays: expression index computation and outlier detection. Proceedingsof the National Academy of Sciences of the United States of America,98:31{36.Lin, T. I. (2009). Robust mixture modeling using multivariate skew t dis-tributions. Statistics and Computing, (In press).Lizard, G. (2007). Flow cytometry analyses and bioinformatics: interest innew softwares to optimize novel technologies and to favor the emergenceof innovative concepts in cell research. Cytometry Part A, 71A:646{647.22L onnstedt, I. and Speed, T. P. (2002). Replicated microarray data. StatisticaSinica, 12:31{46.MacQueen, J. B. (1967). Some methods for classi cation and analysis ofmultivariate observations. In LeCam, L. and Neyman, J., editors, Pro-ceedings of the fth Berkeley Symposium on Mathematical Statistics andProbability, volume 1, pages 281{297, Berkeley. University of CaliforniaPress.Mandy, F. F. (2004). Twenty- ve years of clinical ow cytometry: AIDSaccelerated global instrument distribution. Cytometry Part A, 58A(1):55{56.McLachlan, G. and Peel, D. (2000). Finite Mixture Models. Wiley-Interscience, New York.Newton, M. C., Kendziorski, C. M., Richmond, C. S., Blattner, F. R., andTsui, K. W. (2001). On di erential variability of expression ratios: im-proving statistical inference about gene expression changes from microar-ray data. Journal of Computational Biology, 8:37{52.Orfao, A., Ortuno, F., de Santiago, M., Lopez, A., and San Miguel, J. (2004).Immunophenotyping of acute leukemias and myelodysplastic syndromes.Cytometry Part A, 58A(1):62{71.Pyne, S., Hu, X., Wang, K., Rossin, E., Lin, T.-I., Maier, L. M., Baecher-Allan, C., McLachlan, G. J., Tamayo, P., Ha er, D. A., De Jager, P. L.,and Mesirov, J. P. (2009). Automated high-dimensional ow cytometricdata analysis. Proceedings of the National Academy of Sciences of theUnited States of America, 106(21):8519{8524.Sahu, S. K., Dey, D. K., and Branco, M. D. (2003). A new class of multivari-ate skew distributions with applications to Bayesian regression. CanadianJournal of Statistics, 31(2):129{150.23Schena, M., Shalon, D., Davis, R. W., and Brown, P. O. (1995). Quantita-tive monitoring of gene expression patterns with a complementary DNAmicroarray. Science, 270:467{470.Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statis-tics, 6:461{464.Smyth, G. K. (2004). Linear models and empirical Bayes methods for as-sessing di erential expression in microarray experiments. Statistical Ap-plications in Genetics and Molecular Biology, 3:Article 3.Titterington, D. M., Smith, A. F. M., and Makov, U. E. (1985). StatisticalAnalysis of Finite Mixture Distributions. Wiley, Chichester, UK.Tusher, V., Tibshirani, R., and Chu, G. (2001). Signi cance analysis of mi-croarrays applied to the ionizing radiation response. Proceedings of the Na-tional Academy of Sciences of the United States of America, 98(9):5116{5121.Wei, G. C. G. and Tanner, M. A. (1990). A Monte Carlo implementationof the EM algorithm and the poor man’s data augmentation algorithms.Journal of the American Statistical Association, 85:699{704.Wu, Z., Irizarry, R. A., Gentleman, R., Martinez-Murillo, F., and Spencer,F. (2004). A model-based background adjustment for oligonucleotide ex-pression arrays. Journal of the American Statistical Association, 99:909{917.24Chapter 2Flexible Empirical BayesModels for Di erential GeneExpression 2.1 IntroductionAs a natural development following the success of genome sequencing, DNAmicroarray technology has emerged for the sake of exploring the functioningof genomes (Schena et al., 1995). By exploiting the ability of a single-strand nucleic acid molecule to hybridize to a complementary sequence, re-searchers can simultaneously measure the expression levels of thousands ofgenes within a cell. A common task with microarray is to determine whichgenes are di erentially expressed under two di erent conditions.In recent years, there has been a considerable amount of work on thedetection of di erentially expressed genes. An early statistical treatmentcan be found in Chen et al. (1997). A common approach is to test a hypoth-esis for each gene using variants of t or F-statistics and then try to correctfor multiple testing (Efron et al., 2001; Tusher et al., 2001; Dudoit et al.,2002). Due to the small number of replicates, variation in gene expressioncan be poorly estimated. Baldi and Long (2001) and Tusher et al. (2001)suggested using a modi ed t statistic where the denominator has been reg-ularized by adding a small constant to the gene speci c variance estimate.Similar to an empirical Bayes approach this results in shrinkage of the em- A version of this chapter has been published. Lo, K. and Gottardo, R. (2007). Flexi-ble empirical Bayes models for di erential gene expression. Bioinformatics, 23(3):328{335.25pirical variance estimates towards a common estimate. L onnstedt and Speed(2002) proposed an empirical Bayes normal mixture model for gene expres-sion data, which was later extended to the two condition case by Gottardoet al. (2003) and to more general linear models by Smyth (2004) and Cuiet al. (2005), though Smyth (2004) and Cui et al. (2005) did not use mixturemodels but simply empirical Bayes normal models for variance regulariza-tion. In each case, the authors derived explicit gene speci c statistics anddid not consider the problem of estimating p the proportion of di erentiallyexpressed genes. Newton et al. (2001) developed a method for detectingchanges in gene expression in a single two-channel cDNA slide using a hi-erarchical gamma-gamma (GG) model. Kendziorski et al. (2003) extendedthis to replicate chips with multiple conditions, and provided the option ofusing a hierarchical lognormal-normal (LNN) model. Both models are im-plemented in an R package called EBarrays (Empirical Bayes microarrays)and from now on we use the name EBarrays to refer to the methodology.Both EBarrays model speci cations rely on the assumption of a constantcoe cient of variation across genes. In this chapter, we extend both modelsby releasing this assumption and introduce EM type algorithms for parame-ter estimation, thus extending the work of L onnstedt and Speed (2002) andGottardo et al. (2003) as well.The structure of this chapter is as follows. The extended forms of the twoEBarrays hierarchical models and the estimation procedures are presentedin Section 2.2. In Section 2.3, the performance of the extended models isexamined on three experimental datasets and compared to ve other baselineand commonly used methods. Section 2.4 presents a simulation study tofurther compare our empirical Bayes approach to the other methods. Finally,in Section 2.5 we discuss our results and possible extensions.262.2 A Bayesian Framework for IdentifyingDi erential Expression2.2.1 A Hierarchical Model for Measured IntensitiesIn a typical microarray experiment, two conditions are compared for geneexpression. Let us denote by Xgr and Ygr the intensities of gene g fromthe rth replicate in the two conditions respectively. Measurements betweenthe two conditions are assumed to be independent. The proposed model isan extension of the EBarrays framework (Newton et al., 2001; Kendziorskiet al., 2003). Extensions to the original two types of model formulation areconsidered in turn below.The Extended Gamma-Gamma ModelHere, a Gamma distribution is used to model the measured intensities of agiven gene. Explicitly, the probability density of Xgr (resp. Ygr) with shapeand rate parameters ag and gx (resp. gy) is given byp(xjag; gx) = 1 (ag) aggxxag 1 exp( x gx) for x> 0: (2.1)To borrow strength across genes we assume an exchangeable Gamma(a0; )prior for the rate parameters, and a Lognormal( ; ) prior for the shapeparameters. The Gamma prior is used for simplicity as it is conjugate tothe sampling distribution (Newton et al., 2001) while the Lognormal prior issuggested by a histogram plot of the empirical shape parameters estimatedby the method of moments (See Figure 2.1). The hyperparameters a0; ; and are assumed unknown and will be estimated as part of our approach.The proposed model extends the EBarrays GG model by placing a prioron the shape parameter. In the original GG model, the shape parametera was assumed to be constant and common to all genes whereas now it isgene speci c. However, strength is borrowed across genes through the priordistribution. By \borrowing strength", we mean that information from allgenes is used when estimating ag, which comes from the hyperparmeters27(a) HIV-1ADensity− 2 0 2 4 6 80.00.10.20.30.4(b) HIV-1BDensity− 2 0 2 4 6 80.00.10.20.30.4(c) HGU95A, gcRMADensity0 2 4 6 8 100.000.050.100.150.200.25(d) HGU133A, gcRMADensity0 2 4 6 8 10 120.000.050.100.150.200.25Figure 2.1: Histograms of robust empirical estimates of ag’s with ttedLognormal density curves shown on a log scale under the extended Gamma-Gamma modeling framework. The hyperparameters and are estimatedusing a robust version of the method of moments. The graphs for theHGU95A and HGU133A spike-in data are for one of the comparisons madebetween two array groups.28through the prior.The Extended Lognormal-Normal ModelThe second formulation is an extension of the EBarrays LNN framework.The intensities are assumed to be lognormally distributed, i.e., the log-transformed intensities are from a normal distribution, and we write logXgr N( gx; 1gx ) and logYgr N( gy; 1gy ) respectively. A conjugate prior isimposed on the mean gx (resp. gy) and precision gx (resp. gy). Ex-plicitly, we set gxj gx N(m;k 1gx ) and gx Gamma( ; ) respectively.In the original LNN model, the precision was assumed to be constantand common to all genes. Our proposed formulation extends the EBar-rays model by releasing the assumption of a constant coe cient of variationpexp( 1) 1, which is equivalent to the assumption of a constant variance 1 on the log scale. Note that our proposed formulation is also the frame-work of Gottardo et al. (2003). However, we use an EM based algorithm toestimate the unknown parameters, including the proportion of di erentiallyexpressed genes.On assuming a prior on both gx (resp. gy) and gx (resp. gy) commonto all genes, strength is borrowed across genes through both means andvariances of the distributions when making inferences. Again, we mean thatinformation from all genes is used when estimating both gx (resp. gy)and gx (resp. gy). In particular, this is essential for variances | due tothe small number of replicates variance estimates can be very noisy. Similarideas have been used in Smyth (2004) and Cui et al. (2005), where theauthors concentrated on variance regularization.2.2.2 A Mixture Model for Di erential ExpressionWe use a mixture model to identify di erentially expressed genes. We as-sume that a priori gx = gy (resp. gx = gy) with probability 1 p and gx 6= gy (resp. gx 6= gy) with probability p. For the latter case, themodel speci cation is just as stated in Section 2.2.1, while the former caseis modeled through setting the gene-speci c parameters common to both29conditions.Let us denote by zg the indicator variable equal to one if there is realchange in expression for gene g and zero otherwise. Then one can de nethe posterior probability of change, Pr(zg = 1jxg;yg;p; ), where xg =(xg1;xg2;:::;xgRx)0 and yg = (yg1;yg2;:::;ygRy)0 and denotes the vectorof unknown hyperparameters. Applying the Bayes rule, we obtain^zg = Pr(zg = 1jxg;yg;p; )= p pA(xg;ygj )p pA(xg;ygj ) + (1 p)p0(xg;ygj ); (2.2)where pA(xg;ygj ) and p0(xg;ygj ) denote the joint marginal density ofthe measured intensities of gene g under both the alternative (di erentialexpression) and null (no di erential expression) models respectively given . The marginal density for the extended LNN model can be computedexplicitly and is given in Appendix A.1. For the extended GG model only gcan be integrated out, and the corresponding \conditional" marginal densityis given in Appendix A.1. In the next section we describe an approximateestimation procedure to deal with this di culty.2.2.3 Parameter Estimation using the EM-algorithmHere we start with the extended LNN model as the estimation procedureis straightforward. The vector of unknown parameters = ( 0;p)0, where = (m;k; ; )0, can be estimated by maximizing the integrated likelihoodusing the EM-algorithm (Dempster et al., 1977). The estimation of p isimportant since it calibrates the posterior probability of change for multipletesting, as seen in Eq.(2.2). Such estimation is also part of some multipletesting procedure such as Storey’s q-value (Storey, 2003). Estimation of theparameter p can be di cult (Smyth, 2004; Bhowmick et al., 2006), and assuggested by Newton et al. (2001) we place a Beta(2;2) prior over p, whichavoids numerical issues when p gets close to 0 or 1. Given the large numberof genes, the prior on p has essentially no e ect on the nal estimation, andthus on the number of genes called di erentially expressed.30Treating the zg’s as missing data, the complete-data log-likelihood isgiven bylc( jx;y;z) =Xghzg log pA(xg;ygj ) + (1 zg) log p0(xg;ygj )+ (1 +zg) log p+ (2 zg) log(1 p)i: (2.3)During the E-step, the expectation is obtained by replacing zg by ^zg as givenby Eq.(2.2) while the M-step consists of maximizing the conditional expec-tation with respect to the parameter vector = ( 0;p)0. At convergence,the estimated parameters can be substituted into Eq.(2.2) to compute theposterior probability of change for each gene.Because the prior of the extended GG model is not conjugate to the sam-pling distribution, only the marginal density conditional on ag is analyticallyavailable for each gene. We refer to it as the conditional marginal density.To incorporate information about the prior for the ag’s, we propose to esti-mate the hyperparameters and beforehand through an empirical Bayesapproach using the method of moments (see Appendix A.2 for details), andadd log[ (agj ; )] to the log conditional density as a penalty term. Again,treating the zg’s as missing data, the corresponding modi ed complete-datalog-likelihood can be written as~lc( jx;y;z) = Xgnzg log pA(xg;ygj ;ag) + (1 zg) log p0(xg;ygj ;ag)+ (1 +zg) log(p) + (2 zg) log(1 p) + log (agj ; )o;(2.4)where = (a0; )0. The vector of parameters to be estimated becomes = (a1;a2;:::;aG; 0;p)0.Similar to the extended LNN model, we can use the EM algorithm tomaximize the modi ed marginal likelihood. During the E-step, to obtain theconditional expectation of the modi ed complete-data log-likelihood, zg inEq.(2.4) is replaced by ^zg as in Eq.(2.2). The M-step consists of maximizing31~lc given the current zg’s. Such maximization can be di cult given the highdimensionality of and here we suggest to exploit the conditional structureof the model during the maximization step, namely that given and p,the genes are conditionally independent and each ag can be maximized overseparately. Let us split the unknown parameters into two groups, namely, 1 = (a1;a2;:::;aG)0 (gene-speci c shape parameters) and 2 = ( 0;p)0(global parameters). Then the M-step would consist of iteratively maximiz-ing over 1 given 2 and 2 given 1. Here, we decided to maximize over 1 only once during the rst iteration to reduce the computational burden,and then take EM-iterations with respect to 2 only until convergence. Itturns out that the estimates obtained were very similar to the ones obtainedwhen maximizing over both 1 and 2, while signi cantly reducing thecomputing time.Details about the estimation of ( ; ) and initialization of the EM algo-rithm can be found in Appendix A.3.2.3 Application to Experimental Data2.3.1 Data DescriptionTo illustrate our methodology we use three publicly available microarraydatasets: one cDNA experiment and two A ymetrix spike-in experiments.All three have the advantage that in each case the true state (di erentiallyexpressed or not) of all or some of the genes is known.The HIV-1 DataThe expression levels of 4608 cellular RNA transcripts were measured onehour after infection with human immunode ciency virus type 1 (HIV-1) us-ing four replicates on four di erent slides. 13 HIV-1 genes have been includedin the set of RNA transcripts to serve as positive controls, i.e., genes knownin advance to be di erentially expressed. Meanwhile, 29 non-human geneshave also been included and act as negative controls, i.e., genes known tobe not di erentially expressed. Another dataset was obtained by repeating32the four aforementioned experiments but with an RNA preparation di erentfrom that for the rst dataset. For easy reference, we label the two datasetsas HIV-1A and HIV-1B respectively. See van’t Wout et al. (2003) for moredetails of the HIV-1 data. The data were lowess normalized using a globallowess normalization step (Yang et al., 2002).The HGU95A Spike-In DataThis dataset was obtained from a spike-in study by A ymetrix used to de-velop and validate the MAS 5.0 (A ymetrix Manual, 2001) platform. Theconcentrations of 14 spiked-in human gene groups in 14 groups of HGU95AGeneChip c arrays were arranged in a Latin square design. The concentra-tions of the 14 groups in the rst array group are 0;0:25;0:5;1;2;4;8;16;32;64;128;256;512 and 1024 pM respectively. Each subsequent array grouprotates the spike-in concentrations by one group such that each human genewas spiked-in at a particular concentration level on exactly one array group,and each concentration level came with exactly one spiked-in gene group ineach array group. There are three technical replicates in each array group.The third array group has been removed from the analysis as one of itsreplicates was missing. We use a set of 16 spiked-in genes in our list inrecognition of the extras reported by Hsieh et al. (2003) and Cope et al.(2004). Analysis is performed on each set of probe summary indices com-puted using gcRMA (Wu et al., 2004), RMA (Irizarry et al., 2003a), MAS 5and dChip (Li and Wong, 2001) respectively.The HGU133A Spike-In DataThis dataset was obtained from another spike-in study done with HGU133Aarrays. A total of 42 spiked-in genes were organized in 14 groups, andthe concentrations used were 0;0:125;0:25; 0:5;1;2;4;8;16;32;64;128;256and 512 pM. The arrangement of the spike-in concentrations was similarto the Latin square design stated above. Again, there are three technicalreplicates in each array group. For more information see Irizarry et al.(2003b). In addition to the original 42, we claim that another 20 genes33should also be included in the spiked-in gene list as they consistently showsigni cant di erential expression across the array groups in the exploratorydata analysis. Similar observations have been made by She er et al. (2005).Moreover, the probe sets of three genes contain probe sequences exactlymatching those for the spiked-ins. These probes should be hybridized bythe spike-ins as well. As a result, our expanded spiked-in gene list contains65 entries in total.2.3.2 ResultsWe compare our proposed methods { extended GG (eGG) and extendedLNN (eLNN) models { to ve other methods, namely, EBarrays GG andLNN models, the popular Signi cance Analysis of Microarrays (SAM) (Tusheret al., 2001), Linear Models for Microarray data (LIMMA) (Smyth, 2004),and a fully Bayesian approach named BRIDGE (Gottardo et al., 2006a).The results have been organized in Tables 2.1{2.3.In the analysis of the HIV-1 data, we obtain the number of genes calleddi erentially expressed (DE) for each method. Among those genes calledTable 2.1: Analysis of di erential expression with the HIV-1 data.(a) HIV-1AMethod DE TP FP GG 24 13 0LNN 18 13 1eGG 13 13 0eLNN 14 13 0LIMMA 13 13 0SAM 13 13 0BRIDGE 14 13 0(b) HIV-1BMethod DE TP FP GG 18 11 1LNN 18 11 1eGG 12 11 0eLNN 12 11 0LIMMA 11 11 0SAM 13 11 0BRIDGE 11 11 0The FDR is controlled at 0:1. The numbers of TP and FP are based on the controls, namely, the 13 (resp. 12in the second experiment) HIV-1 and the 29 non-human genes of which the statesare known in advance. They do not represent the true numbers of TP and FP inthe entire data.34DE, we look at the number of true positives (TP), i.e., genes known to beDE in advance, and the number of false positives (FP), i.e., genes known tobe not DE. Gottardo et al. (2006b) showed that one of the HIV genes, whichwas expected to be highly di erentially expressed had a very small estimatedlog ratio and did not properly hybridize in the second experiment (HIV-1B).We removed the corresponding gene from the list of known di erentiallyexpressed genes. Thus there are 13 genes known to be DE in the rstexperiment and 12 in the second. To compare the performance between theseven methods, we intend to control the false discovery rate (FDR) at a xed level of 0:1. The FDR cuto s can be selected using a direct posteriorprobability calculation as described in Newton et al. (2004). For the HIV-1A dataset, when the FDR is controlled at 0:1, all methods can identify the13 positive controls. Meanwhile, EBarrays LNN has made one FP. Similarresult is observed when the HIV-1B dataset is considered. All methodsdetect 11 out of the 12 positive controls but both versions of EBarrays (GGand LNN) have made one FP. Concluded from the HIV-1 datasets, alongwith LIMMA, SAM and BRIDGE our proposed eGG and eLNN methodsappear to perform the best as they recognize the most positive controls anddo not get any FP.For the HGU95A spike-in data, after removing the array group with onemissing replicate, we have a set of 13 array groups. To evaluate the di erentmethods we compare the rst array group to the other array groups, leadingto 12 comparisons. Since dChip may return negative probe summary indices,which cannot be processed by the aforementioned methods, those genes withnegative summary indices were ltered out. This excluded 5:5 spike-ins onaverage. This time, since we know the actual status of each gene, we cancheck the true FDR of each method against the desired FDR. In addition,we look at the number of false negatives (FN) as a power assessment.Unlike the results on the HIV-1 data, SAM does not show a competitiveperformance. A large number of FN (>11) have been observed with SAM forboth gcRMA and RMA summary indices, considering that there are only 16entries in our spiked-in gene list. eLNN and LIMMA have the actual FDRclosest to the desired FDR in general, though they have a relatively large35Table 2.2: Analysis of di erential expression with the HGU95A spike-indata.(a) gcRMAMethod FN FDRGG 2.42 0.22LNN 1.83 0.22eGG 1.58 0.28eLNN 5.83 0.09LIMMA 4.33 0SAM 11.25 0.05BRIDGE 3.6 0.06(b) RMAMethod FN FDRGG 2.42 0.28LNN 2.42 0.25eGG 2.25 0.2eLNN 3.25 0.15LIMMA 3.08 0.08SAM 12.58 0.23BRIDGE 2.33 0.17(c) MAS 5Method FN FDRGG 6.5 0.7LNN 5.42 0.84eGG 4.33 0.53eLNN 7.08 0.26LIMMA 5.58 0.27SAM 5.83 0.27BRIDGE 12.08 0(d) dChipMethod FN FDRGG 3.25 0.7LNN 3.58 0.74eGG 2.83 0.43eLNN 6.08 0.34LIMMA 4.83 0.3SAM 3 0.45BRIDGE 4.00 0.34The FDR is controlled at 0:1. The values of FN and FDR shown are the averagesacross the 12 comparisons.number of FN cases regarding MAS 5 and dChip summary indices. Theactual FDRs for EBarrays GG and LNN methods are too high comparedto the other methods, and our proposed extended versions have lowered therates by a wide margin while keeping relatively small FN rates.The HGU133A spike-in data have a set of 14 array groups, and therefore13 comparisons have been made. A total of 14 out of 65 spiked-in geneson average have been ltered from the analysis with dChip due to negativesummary indices. The relative performance of the six methods is similar tothat for the HGU95A data. It is worth mentioning that eGG is the onlymethod that can sustain the FN cases to a low number for all four types36Table 2.3: Analysis of di erential expression with the HGU133A spike-indata.(a) gcRMAMethod FN FDRGG 5.85 0.2LNN 5.92 0.2eGG 6.46 0.23eLNN 13.08 0.07LIMMA 10.38 0.08SAM 22.23 0.12BRIDGE 6.01 0.11(b) RMAMethod FN FDRGG 4.38 0.14LNN 4.46 0.13eGG 5.23 0.06eLNN 6.69 0.09LIMMA 6.15 0.03SAM 17.15 0.1BRIDGE 4.53 0.08(c) MAS 5Method FN FDRGG 15.77 0.89LNN 15.85 0.87eGG 9.23 0.59eLNN 15.77 0.23LIMMA 13.85 0.31SAM 13.77 0.28BRIDGE 18.46 0.25(d) dChipMethod FN FDRGG 9.31 0.48LNN 9.69 0.58eGG 6.69 0.44eLNN 11.31 0.3LIMMA 9.38 0.26SAM 5.08 0.28BRIDGE 6.92 0.51The FDR is controlled at 0:1. The values of FN and FDR shown are the averagesacross the 13 comparisons.of probe summary indices, though its FDR is higher than the desired one.SAM has considerably more FN cases than the other methods for gcRMAand RMA, while its FDR is close to the desired one. Similarly, eLNN andLIMMA exhibit good FDR performance but with better FN rates. Again,the FDRs for EBarrays GG and LNN methods are at quite a high level,while their extended versions (eGG and eLNN) have signi cantly reducedthe rates while keeping relatively small FN rates.372.4 Simulation Studies2.4.1 Data GenerationWe now use a series of simulations to study the performance of our empir-ical Bayes framework under di erent model speci cations compared to theoriginal EBarrays framework and the methods presented in Section 2.3.2.In order to do so, we generated data from the following models: EBarraysGG (a = 5;a0 = 0:8; = 15), EBarrays LNN (m = 5; 2 = 2; 1 = 0:25, 2 being the variance parameter of the prior of gx or gy), extended GG( = 2; = 1;a0 = 1; = 20) and extended LNN (m = 5;k = 12; = 2; =0:5). The values of the parameters are set in the proximity of the estimatesfrom the HIV-1 data. We xed the number of genes to 500, the number ofreplicates to three in each group and generated 100 datasets under each ofthe above models for two di erent values of p =f0:1;0:2g.2.4.2 ResultsThe seven methods mentioned in Section 2.3.2 are applied to each simu-lated dataset to make inference about di erential expression. Results aresummarized graphically in two ways: a plot of the actual FDR against thedesired FDR, and a plot of the number of FP against the number of FN.The curves show the average results across the 100 simulated datasets. Foreach dataset, results are collected by setting the cuto s for the posteriorprobabilities or p-values at di erent points in turn in detecting di erentialexpression.As expected, the EBarrays GG and LNN models perform quite poorlycompared to the eGG and eLNN models when the variance is not constantand clearly under estimate the FDR (Figures 2.2 and 2.3). On the otherhand, the eGG and eLNN models are comparable to EBarrays when thevariance is constant, showing that strength borrowing across genes is work-ing well (Figures 2.4 and 2.5). Finally, both GG and eGG (resp. LNN andeLNN) appear to perform relatively well under LNN and eLNN (resp. GGand eGG) model speci cations respectively. This con rms previous simula-38tion studies (Kendziorski et al., 2003).Overall, SAM is not performing very well and tend to under estimatethe FDR by a large amount. Meanwhile, LIMMA and BRIDGE consistentlyshow good performance for data generated from the four models, suggestingthat they are good candidates for identifying di erential expression under awide variety of settings.0.1 0.2 0.3 0.4 0.5 0.60.10.30.5p = 0.1Desired FDRActual FDR10 15 20 25 30 35 40020406080p = 0.1FNFP0.1 0.2 0.3 0.4 0.5 0.60.10.30.5p = 0.2Desired FDRActual FDR20 30 40 50 60 70 80020406080p = 0.2FNFPLegendGGLNNeGGeLNNLIMMASAMBRIDGE151617181920510203025 30 35 405102030Figure 2.2: Simulation results generated from the extended GG model.390.1 0.2 0.3 0.4 0.5 0.60.10.30.5p = 0.1Desired FDRActual FDR10 15 20 25 30 35 40020406080p = 0.1FNFP0.1 0.2 0.3 0.4 0.5 0.60.10.30.5p = 0.2Desired FDRActual FDR20 30 40 50 60 70 80020406080p = 0.2FNFPLegendGGLNNeGGeLNNLIMMASAMBRIDGE151617181920510203025 30 35 405102030Figure 2.3: Simulation results generated from the extended LNN model.400.1 0.2 0.3 0.4 0.5 0.60.10.30.5p = 0.1Desired FDRActual FDR10 15 20 25 30 35 40020406080p = 0.1FNFP0.1 0.2 0.3 0.4 0.5 0.60.10.30.5p = 0.2Desired FDRActual FDR20 30 40 50 60 70 80020406080p = 0.2FNFPLegendGGLNNeGGeLNNLIMMASAMBRIDGE151617181920510203025 30 35 405102030Figure 2.4: Simulation results generated from the EBarrays GG model.410.1 0.2 0.3 0.4 0.5 0.60.10.30.5p = 0.1Desired FDRActual FDR10 15 20 25 30 35 40020406080p = 0.1FNFP0.1 0.2 0.3 0.4 0.5 0.60.10.30.5p = 0.2Desired FDRActual FDR20 30 40 50 60 70 80020406080p = 0.2FNFPLegendGGLNNeGGeLNNLIMMASAMBRIDGE151617181920510203025 30 35 405102030Figure 2.5: Simulation results generated from the EBarrays LNN model.422.5 DiscussionWe have extended the EBarrays empirical Bayes framework for di erentialgene expression by releasing the constant coe cient of variation assump-tion, and introducing two algorithms that can be used for parameter esti-mation. Using both experimental and simulated data we have shown thatthe extended framework clearly improves the original framework. In addi-tion, it appears that the eLNN model performs better than the eGG one asshown with the spike-in data, and that it is comparable to BRIDGE, a morecomputational fully Bayesian approach. This is not the case for the orig-inal EBarrays framework, where the GG model generally performs better.This con rms previous ndings of Gottardo et al. (2006a) and suggests thatEBarrays GG is more robust to the model misspeci cation of a constantcoe cient of variation compared to the LNN formulation. However, whenthe EBarrays model formulations are extended and the constant coe cientof variation assumption is released, the LNN model seems more appropriate.In spite of the complications accompanying the model enhancementsrelative to the original EBarrays framework, the proposed methodology re-mains to be highly competitive in terms of processing time. In the analysiswith the HGU133A data of >20000 genes, it takes about ve minutes tocomplete the eGG or eLNN analysis of one comparison between the twoarray groups each with three replicates on the R platform.In this chapter, we have compared our approach with ve alternatives,but there are many other methods for detecting di erentially expressed geneswith gene expression data. We chose these ve because they are either ob-vious baseline methods or widely used; they are also representative of othermethods. More comparisons between statistical tests can be found in Cuiand Churchill (2003). Among explicit adjustments for multiple testing, weconsidered the FDR control method as it is interpretable under each method.For simplicity and ease of comparison, we assumed that we were in a sit-uation with only two conditions of interest. However, the methodology couldeasily be extended to the multiple condition case (Kendziorski et al., 2003) ormore complex ANOVA-type designs (Cui and Churchill, 2003; Smyth, 2004).43BibliographyA ymetrix Manual (2001). A ymetrix Microarray Suite User Guide Version5.0. Santa Clara, CA.Baldi, P. and Long, A. D. (2001). A Bayesian framework for the analysis ofmicroarray expression data: Regularized t-test and statistical inferencesof gene changes. Bioinformatics, 17(6):509{519.Bhowmick, D., Davison, A. C., Goldstein, D. R., and Ru eux, Y. (2006).A Laplace mixture model for identi cation of di erential expression inmicroarray experiments. Biostatistics, 7(4):630{641.Chen, Y., Dougherty, E. R., and Bittner, M. L. (1997). Ratio-based decisionsand the quantitative analysis of cDNA microarray images. Journal ofBiomedical Optics, 2(4):364{374.Cope, L. M., Irizarry, R. A., Ja ee, H. A., Wu, Z., and Speed, T. P. (2004).A benchmark for A ymetrix GeneChip expression measures. Bioinfor-matics, 20(3):323{331.Cui, X. and Churchill, G. A. (2003). Statistical tests for di erential expres-sion in cDNA microarray experiments. Genome Biology, 4(4):210.Cui, X., Hwang, J. T., Qiu, J., Blades, N. J., and Churchill, G. A. (2005).Improved statistical tests for di erential gene expression by shrinking vari-ance components estimates. Biostatistics, 6(1):59{75.Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likeli-hood from incomplete data via the EM algorithm. Journal of the RoyalStatistical Society, Series B, 39:1{38.Dudoit, S., Yang, Y. H., Callow, M. J., and Speed, T. P. (2002). Statisticalmethods for identifying di erentially expressed genes in replicated cDNAmicroarray experiments. Statistica Sinica, 12(1):111{139.44Efron, B., Tibshirani, R., Storey, J. D., and Tusher, V. (2001). Empiri-cal Bayes analysis of a microarray experiment. Journal of the AmericanStatistical Association, 96(456):1151{1160.Gottardo, R., Pannucci, J. A., Kuske, C. R., and Brettin, T. (2003). Sta-tistical analysis of microarray data: a Bayesian approach. Biostatistics,4:597{620.Gottardo, R., Raftery, A. E., Yeung, K. Y., and Bumgarner, R. E. (2006a).Bayesian robust inference for di erential gene expression in microarrayswith multiple samples. Biometrics, 62(1):10{18.Gottardo, R., Raftery, A. E., Yeung, K. Y., and Bumgarner, R. E. (2006b).Quality control and robust estimation for cDNA microarrays with repli-cates. Journal of the American Statistical Association, 101(473):30{40.Hsieh, W. P., Chu, T. M., and Wol nger, R. D. (2003). Who are thosestrangers in the Latin square? In Johnson, K. F. and Lin, S. M., editors,Methods of Microarray Data Analysis III: Papers from CAMDA’02, pages199{208. Kluwer, Boston.Irizarry, R., Hobbs, B., Collin, F., Beazer-Barclay, Y., Antonellis, K., Scherf,U., and Speed, T. (2003a). Exploration, normalization, and summariesof high density oligonucleotide array probe level data. Biostatistics,4(2):249{264.Irizarry, R. A., Bolstad, B. M., Collin, F., Cope, L. M., Hobbs, B., andSpeed, T. P. (2003b). Summaries of A ymetrix GeneChip probe leveldata. Nucleic Acids Research, 31(4):e15.Kendziorski, C., Newton, M., Lan, H., and Gould, M. N. (2003). On para-metric empirical Bayes methods for comparing multiple groups using repli-cated gene expression pro les. Statistics in Medicine, 22:3899{3914.Li, C. and Wong, W. H. (2001). Model-based analysis of oligonucleotidearrays: expression index computation and outlier detection. Proceedings45of the National Academy of Sciences of the United States of America,98:31{36.L onnstedt, I. and Speed, T. P. (2002). Replicated microarray data. StatisticaSinica, 12:31{46.Newton, M., Noueiry, A., Sarkar, D., and Ahlquist, P. (2004). Detectingdi erential gene expression with a semiparametric hierarchical mixturemethod. Biostatistics, 5(2):155{176.Newton, M. C., Kendziorski, C. M., Richmond, C. S., Blattner, F. R., andTsui, K. W. (2001). On di erential variability of expression ratios: im-proving statistical inference about gene expression changes from microar-ray data. Journal of Computational Biology, 8:37{52.Schena, M., Shalon, D., Davis, R. W., and Brown, P. O. (1995). Quantita-tive monitoring of gene expression patterns with a complementary DNAmicroarray. Science, 270:467{470.She er, W., Upfal, E., Sedivy, J., and Noble, W. S. (2005). A learned com-parative expression measure for A ymetrix GeneChip DNA microarrays.Proceedings of IEEE Computational Systems Bioinformatics Conference,pages 144{154.Smyth, G. K. (2004). Linear models and empirical Bayes methods for as-sessing di erential expression in microarray experiments. Statistical Ap-plications in Genetics and Molecular Biology, 3:Article 3.Storey, J. D. (2003). The positive false discovery rate: a Bayesian interpre-tation and the q-value. Annals of Statistics, 31:2013{2035.Tusher, V., Tibshirani, R., and Chu, G. (2001). Signi cance analysis of mi-croarrays applied to the ionizing radiation response. Proceedings of the Na-tional Academy of Sciences of the United States of America, 98(9):5116{5121.46van’t Wout, A. B., Lehrman, G. K., Mikheeva, S. A., O’Kee e, G. C., Katze,M. G., Bumgarner, R. E., Geiss, G. K., and Mullins, J. I. (2003). Cellulargene expression upon human immunode ciency virus type 1 infection ofCD4+-T-cell lines. Journal of Virology, 77:1392{1402.Wu, Z., Irizarry, R. A., Gentleman, R., Martinez-Murillo, F., and Spencer,F. (2004). A model-based background adjustment for oligonucleotide ex-pression arrays. Journal of the American Statistical Association, 99:909{917.Yang, Y. H., Dudoit, S., Luu, P., Lin, D. M., Peng, V., Ngai, J., and Speed,T. P. (2002). Normalization for cDNA microarray data: a robust com-posite method addressing single and multiple slide systematic variation.Nucleic Acids Research, 30(4):e15.47Chapter 3Flexible Mixture Modelingvia the Multivariate tDistribution with theBox-Cox Transformation 3.1 IntroductionIn statistics, model-based clustering (Titterington et al., 1985; Ban eld andRaftery, 1993; McLachlan and Peel, 2000; Fraley and Raftery, 2002) is apopular unsupervised approach to look for homogeneous groups of observa-tions. The most commonly used model-based clustering approach is basedon nite normal mixture models, which has been shown to give good re-sults in various applied elds, for example, gene expression (Yeung et al.,2001; McLachlan et al., 2002; Pan et al., 2002), image analysis (Wehrenset al., 2004; Fraley et al., 2005; Li et al., 2005), medical diagnosis (Schroeteret al., 1998; Forbes et al., 2006) and astronomy (Kriessler and Beers, 1997;Mukherjee et al., 1998).However, normal mixture models rely heavily on the assumption thateach component follows a normal distribution symmetric in shape, whichis often unrealistic. A common remedy for the asymmetry issue is to lookfor transformations of the data that make the normality assumption morerealistic. Box and Cox (1964) discussed the power transformation in the A version of this chapter has been submitted for publication. Lo, K. and Gottardo,R. (2009). Flexible Mixture Modeling via the Multivariate t Distribution with the Box-CoxTransformation.48context of linear regression, which has also been applied to normal mixturemodels (Schork and Schork, 1988; Gutierrez et al., 1995).Another line of attempts to resolve the asymmetry observed in data is toenhance the exibility of the normal distribution by introducing skewness.Azzalini (1985) developed a class of univariate skew normal distributionswith the introduction of a shape parameter to account for the skewness,which had been put to use in a mixture modeling context by Lin et al.(2007b). A multivariate version of the skew normal distributions was rstproposed by Azzalini and Dalla Valle (1996), with various generalizationsor modi cations ensuing. One such modi cation was found in Sahu et al.(2003), who developed a new class of multivariate skew elliptically sym-metric distributions with applications to Bayesian regression models, andincluded the multivariate skew normal distribution as a special case. Asopposed to Azzalini and Dalla Valle’s (1996) formulation of the skew nor-mal distribution, the correlation structure in that of Sahu et al. (2003) isnot a ected by the introduction of skewness in the sense that independencebetween elements of a random vector is preserved irrespective of changesin the skewness parameters. The latter formulation was adopted by Lin(2009a), who introduced the multivariate skew normal mixture model anddescribed an ECM algorithm (Meng and Rubin, 1993) for maximum like-lihood estimation. However, the implementation of this methodology isextremely computationally intensive. A simpli ed version of Sahu et al.’s(2003) formulation has recently been suggested by Pyne et al. (2009), whoparameterized skewness in the form of a vector in place of a matrix. Asa result of this simpli cation, the computational complexity of parameterestimation has been reduced considerably.In addition to non-normality, there is also the problem of outlier iden-ti cation in mixture modeling. Outliers can have a signi cant e ect on theresulting clustering. For example, they will usually lead to overestimatingthe number of components in order to provide a good representation of thedata (Fraley and Raftery, 2002). If a more robust model is used, fewerclusters may su ce. Outliers can be handled in the model-based clusteringframework, by either replacing the normal distribution with a more robust49one (e.g., t; see Peel and McLachlan, 2000; McLachlan and Peel, 2000) oradding an extra component to accommodate the outliers (e.g., uniform; seeSchroeter et al., 1998).Transformation selection and outlier identi cation are two issues whichcan have heavy mutual in uence (Carroll, 1982; Atkinson, 1988). While astepwise approach in which transformation is preselected ahead of outlierdetection (or vice versa) may be considered, it is unlikely to tackle the prob-lem well in general, as the preselected transformation may be in uenced bythe presence of outliers. One possible means of handling the two issues si-multaneously is through the application of skew t distributions (Azzalini andCapitanio, 2003; Sahu et al., 2003) in mixture modeling. Such an attemptwas given by Lin et al. (2007a), who proposed a skew t mixture model basedon the formulation of Azzalini and Capitanio (2003), but it is con ned tothe univariate case. Not until recently has a multivariate version of the skewt mixture model come to light. Lin (2009b) and Pyne et al. (2009) adopteda similar approach to the case of skew normal in de ning the multivariateskew t distribution, thereby simplifying Sahu et al.’s (2003) formulation witha vector in place of a skewness matrix.In view of the aforementioned issues, we propose a uni ed frameworkbased on mixture models using a new class of skewed distributions, namely,the multivariate t distributions with the Box-Cox transformation, to handletransformation selection and outlier identi cation simultaneously. The t dis-tribution provides a robust mechanism against outliers with its heavier tailsrelative to the normal distribution (Lange et al., 1989). The Box-Cox trans-formation is a type of power transformation, which can bring skewed databack to symmetry, a property of both the normal and t distributions. Alongwith the introduction of the mixture model using this new class of distribu-tions, we also describe a convenient means of parameter estimation via theEM algorithm (Dempster et al., 1977). Whilst the proposed framework holdsa big appeal of being computationally much simpler than mixture modelingusing skew t distributions, it performs well in various scenarios comparedto a wealth of competing approaches, as shown in subsequent sections ofthis chapter. A simpli ed form of our proposed framework has been ap-50plied to ow cytometry, which shows a favorable performance in identifyingcell populations (Chapter 4). This chapter presents a comprehensive frame-work that substantially enriches that previous simpli ed version, includingthe selection of component-speci c transformations, and the estimation ofdata outlyingness. In addition, it focuses at the computational developmentof the proposed methodology, and includes a large-scale comparison withcompeting approaches such as those using the skew normal or t mixturedistributions.The structure of this chapter is as follows. In Section 3.2 we rst intro-duce the new class of skewed distributions, the multivariate t distributionswith the Box-Cox transformation. Then we introduce the mixture modelusing the proposed distributions, and present details including outlier iden-ti cation, density estimation and the selection of the number of components.In addition, we describe an EM algorithm (Dempster et al., 1977) to simul-taneously handle parameter estimation and transformation selection for ourproposed mixture model. In Section 3.3, the performance of the proposedframework is examined on real datasets and compared to a wealth of com-monly used approaches. Section 3.4 presents extensive simulation studies tofurther evaluate our proposed framework relative to the other approaches.Finally, in Section 3.5 we summarize and discuss our ndings.3.2 Methodology3.2.1 PreliminariesThe Multivariate t DistributionThe multivariate t distribution has found its use as a robust modeling toolin various elds of applied statistics like linear and non-linear regression,time series, and pedigree analysis; see Lange et al. (1989) and Kotz andNadarajah (2004) for examples. The t distribution is applied in place of thenormal distribution when the latter fails to o er long enough tails for theerror distribution. Formally, a random vector y of length p is said to followa p-dimensional multivariate t distribution with mean ( > 1), covariance51matrix ( 2) 1 ( > 2) and degrees of freedom if its density functionis given by’p(yj ; ; ) = ( +p2 )j j 1=2( )p=2 ( 2)f1 + (y )T 1(y )= g +p2: (3.1)The degrees of freedom may be viewed as a robustness tuning parameter,as it controls the fatness of the tails of the distribution. When !1, thet distribution approaches a p-dimensional multivariate normal distributionwith mean , covariance matrix , and density function p(yj ; ) = 1(2 )p=2j j1=2 exp 12(y )T 1(y ) : (3.2)An account of the development of the maximum likelihood estimation ofthe multivariate t distribution can be found in Liu and Rubin (1995), Liu(1997) and Peel and McLachlan (2000). The estimation involves the useof the EM algorithm or its variants including the ECM (Meng and Rubin,1993) and ECME (Liu and Rubin, 1994) algorithms. The crux of thesealgorithms constitutes the fact that we can parameterize a t distributionusing a normal-gamma compound distribution. The degrees of freedom may be jointly estimated along with other unknown parameters, or xed apriori when the sample size is small. In the latter case, the setting with = 4 has been found to provide good protection against outliers and workwell in many applications (see, for example, Lange et al., 1989; Stephens,2000).Box-Cox TransformationThe power transformation proposed by Box and Cox (1964) was originallyintroduced to make data with asymmetric distributions ful ll the normal-ity assumption in a regression model. The Box-Cox transformation of an52observation y is de ned as follows:y( ) =( y 1 6= 0logy = 0 (3.3)where is referred to as the transformation parameter. Note that thisfunction is de ned for positive values of y only. In view of the need to han-dle negative-valued data in some applications, we adopt a modi ed version(Bickel and Doksum, 1981) of the Box-Cox transformation which is alsode ned for negative values:y( ) = sgn(y)jyj 1 ; > 0: (3.4)There exist several modi ed versions of the Box-Cox transformationto handle negative-valued data, for example, the log-shift transformation,which was also proposed in Box and Cox’s (1964) paper for the originalBox-Cox transformation. The advantage of our choice given by Eq.(3.4) isthat, while continuity is maintained across the whole range of the data, itretains the simplicity of the form of the transformation without introducingadditional parameters; when all data are positive, it reduces to the originalversion.3.2.2 The Multivariate t Distribution with the Box-CoxTransformationIn this subsection, we propose a new class of distributions, namely, the mul-tivariate t distributions with the Box-Cox transformation (tBC), to handletransformation and to accommodate outliers simultaneously. Explicitly, arandom vector y of length p following such a distribution has a densityfunction speci ed by’p(y( )j ; ; ) jJp(y; )j; (3.5)53(a) λ = 0.56810121468101214(b) λ = 16810121468101214(c) λ = 26810121468101214(d) λ = 56810121468101214(e) λ = 06810121468101214(f) λ = −16810121468101214(g) λ = −26810121468101214(h) λ = −56810121468101214Figure 3.1: Contour plots revealing the shape of bivariate t distributionswith the Box-Cox transformation for di erent values of the transformationparameter. Each distribution has a mean of 10 and unit variance along eachdimension. The degrees of freedom parameter is xed at eight. The valuesof the transformation parameter range from 5 (extremely right-skewed)to 5 (extremely left-skewed).where jJp(y; )j = jy 11 y 12 y 1p j is the Jacobian induced by the Box-Cox transformation. Equivalently speaking, the random vector y follows amultivariate t distribution after being Box-Cox transformed. It is di cultto derive the exact mean and variance of the distribution in closed form.However, using rst-order Taylor series expansion, approximations for themean and covariance matrix can be derived. The mean can be approximatedby a vector of length p with the j-th element being sgn( j+1)j j+1j1= ,and the variance by =( 2) Dp( ; ) Dp( ; ), where Dp( ; ) is adiagonal matrix of order p with the j-th diagonal element being j j +1j1= 1. The various shapes that can be represented by the tBC are shownin Figure 3.1.Analogous to the case of the t distribution without transformation, thetBC approaches a multivariate normal distribution with the Box-Cox trans-54formation (NBC) when !1. In addition, this class of distributions alsoincludes the untransformed version of the multivariate t or normal distribu-tion. The untransformed t or normal distribution is recovered by setting in Eq.(3.5) to one, although there is a translation of one unit to the left ineach direction on the original scale (due to the term 1= in Eq.(3.4)).The exible class oftBC o ers robustness against both outliers and asym-metry observed in data. Comparatively, the t distribution alone is deemedrobust in the sense that it o ers a mechanism to accommodate outliers. Asnoted by Lange et al. (1989), however, the t distribution is not robust againstasymmetric error distributions. When asymmetry is observed, data trans-formation is desired for the sake of restoring symmetry, and subsequentlydrawing proper inferences. The introduction of the tBC is therefore in linewith Lange et al.’s notion.3.2.3 The Mixture Model of t Distributions with theBox-Cox TransformationThe ModelMaking use of the tBC introduced in the last subsection, we now de nea G-component mixture model in which each component is described bya tBC. Given data y, with independent p-dimensional observation vectorsyi;i = 1;:::;n, the likelihood for the tBC mixture model is given as follows:L( jy) =nYi=1GXg=1wg’p(y( g)i j g; g; g) jJp(yi; g)j;GXg=1wg = 1: (3.6)The mixing proportion wg is the probability that an observation belongs totheg-th component. Estimates of the unknown parameters = ( 1;:::; G)where g = (wg; g; g; g; g) can be obtained conveniently using an EMalgorithm described in the next subsection. Analogous to the case of tBC,the mixture distribution approaches that for an NBC mixture model with’p( j g; g; g) being replaced by p( j g; g) when g!1for all g. Also,the class of tBC mixture models includes the conventional, untransformed t55or normal mixture model, obtained by xing g = 1 for all g. Note that arestricted form of Eq.(3.6) has been previously applied to identify cell popu-lations in ow cytometry data, on setting a global transformation parameter = g and xing g = 4 for all g (Chapter 4).Maximum Likelihood EstimationIn this subsection we illustrate how transformation selection can be handledalong with parameter estimation simultaneously via an EM algorithm. As inthe algorithm for a t mixture model described in Peel and McLachlan (2000),we rst de ne two types of missing data to augment the set of complete data.One is the unobserved component membership zi = (zi1;:::;ziG) withzig =8<:1 if yi belongs to the g-th component0 otherwiseassociated with each observation yi. Each vector Zi follows independently amultinomial distribution with one trial and event properties w = (w1;:::;wG),denoted as Zi MG(1;w). Another type of missing data is the weight ui,coming from the normal-gamma compound parameterization for the t dis-tribution, such thatYijui;zig = 1 N( g; g=ui) (3.7)independently for i = 1;:::;n, and Ui Ga( g=2; g=2). The advantage ofwriting the model in this way is that, conditional upon the Ui’s, the samplingerrors are again normal but with di erent precisions, and estimation becomes56a weighted least squares problem. The complete-data log-likelihood becomeslc( jy;z;u) =nXi=1GXg=1zignloghwg p(y( g)i j g; g=ui) jJp(yi; g)ji+ log Ga(uij g=2; g=2)o=nXi=1GXg=1zignlogwg p2 log(2 ) 12 logj gj ui2 (y( g)i g)T 1g (y( g)i g) + ( g 1)pXj=1logjyijj+ g2 log g2 log g2 + g2 (logui ui) + p2 1 loguio;(3.8)where Ga( j ) is the density function of ui. The E-step of the EM algorithminvolves the computation of the conditional expectation of the complete-data log-likelihood E (lcjy). To facilitate this, we need to compute ~zig E (Zigjyi), ~uig E (Uijyi;zig = 1) and ~sig E (logUijyi;zig = 1):~zig wg’p(y( g)i j g; g; g) jJp(yi; g)jPGk=1wk’p(y( k)i j k; k; k) jJp(yi; k)j; (3.9)~uig g +p g + (y( g)i g)T g 1(y( g)i g)(3.10)and~sig log ~uig + g +p2 log g +p2 ; (3.11)where ( ) is the digamma function. Note that, if we assume a global trans-formation parameter , then Eq.(3.9) used to compute ~zig is slightly simpli- ed as~zig wg’p(y( )i j g; g; g)PGk=1wk’p(y( )i j k; k; k): (3.12)57As can be seen in the following, ~sig only appears in Eq.(3.18) or Eq.(3.19) forthe update of the degrees of freedom g. If we x g to some predeterminedvalue, then ~sig is not needed and Eq.(3.11) can be omitted. Upon plugging~zig, ~uig and ~sig into Eq.(3.8) for zig, ui and logui respectively, we obtain theconditional expectation of the complete-data log-likelihood.In the M-step, we update the parameter estimates with values whichmaximize the conditional expectation of the complete-data log-likelihood.The mixing proportions are updated with the following formula:^wg ngn ; (3.13)where ng Pi~zig. The estimation of g and g needs to be consideredalong with the transformation parameter g of the Box-Cox transforma-tion. Closed-form solutions for g and g are available conditional on gas follows,^ g =Pni=1 ~zig~uigy( g)iPni=1 ~zig~uig= h1( g); (3.14)^ g =Pni=1 ~zig~uig(y( g)i ^ g)(y( g)i ^ g)Tng = h2( g): (3.15)No closed-form solution is available for g, but on substituting ^ g = h1( g)and ^ g = h2( g) into the conditional expectation of the complete-datalog-likelihood for g and g respectively, the problem reduces to a one-dimensional search of g. Explicitly, the optimization is recast as a one-dimensional root- nding problem of the equation @E (lcjy)=@ g = 0, inwhich@E (lcjy)@ g =@@ gnXi=1~zig( ~uig2h(y( g)i g)T 1g (y( g)i g)i+ ( g 1)pXj=1logjyijj)=nXi=1h ~zig~uig(y( g)i g)T 1gi@y( g)i@ g +nXi=1~zigpXj=1logjyijj(3.16)58where @y( g)i =@ g is a vector of length p whose j-th element is 2g sgn(yij)jyijj g( g logjyijj 1) + 1 , and g and g are replaced with^ g = h1( g) and ^ g = h2( g) respectively. The equation may be solvednumerically using, for example, Brent’s (1973) algorithm. If we assume aglobal transformation parameter instead, the left hand side of the equationto consider is slightly modi ed from Eq.(3.16) as@E (lcjy)@ =nXi=18<:GXg=1h ~zig~uig(y( )i g)T 1gi9=;@y( )i@ +nXi=1pXj=1logjyijj:(3.17)Once a numerical estimate of g has been obtained, we substitute it backinto Eqs.(3.14){(3.15) to update g and g respectively.To complete the M-step, we need to update the estimate of the degreesof freedom g, unless it is xed a priori. From Eq.(3.8), we see that thereare no overlaps between terms involving ( g; g; g) and those involving g.Hence, the incorporation of the Box-Cox transformation does not complicatethe estimation of g. Again, since there is no closed-form solution availablefor g, we turn it into a one-dimensional root- nding problem by consideringthe equation @E (lcjy)=@ g = 0, in which@E (lcjy)@ g =@@ gnXi=1~zign g2 log g2 log g2 + g2 (~sig ~uig)o/ngnlog g2 + 1 g2 o+nXi=1~zig(~sig ~uig): (3.18)If we assume a global degrees of freedom = g for all g, the derivative@E (lcjy)=@ is given by@E (lcjy)@ /nnlog 2 + 1 2 o+nXi=1GXg=1~zig(~sig ~uig): (3.19)Alternatively, to improve the convergence, we may exploit the advantageof the ECME algorithm (Liu and Rubin, 1994) and switch to update by59optimizing the constrained actual log-likelihood function:^ arg max 8<:nXi=1log0@GXg=1wg’p(y( g)i j g; g; ) jJp(yi; g)j1A9=;: (3.20)Apart from an intuitive sense that a faster convergence is expected on dis-regarding the information of the parameter estimates obtained from theprevious iteration (which is carried over by the conditional expectation ofthe complete-data log-likelihood otherwise) as well as considering the actuallikelihood instead of its approximation, it also saves a little computationalburden by circumventing the computation of ~sig.The EM algorithm alternates between the E and M-steps until conver-gence. The quantity ~zig may be interpreted as the posterior probability thatobservation yi belongs to the g-th component. The maximum a posterioricon guration results from assigning each observation to the component as-sociated with the largest ~zig value. The uncertainty corresponding to eachassignment may be conveniently quanti ed as 1 maxg ~zig (Bensmail et al.,1997).Outlier Identi cationJust like the case of ~zig, the introduction of ~uig does not only facilitate theimplementation of the EM algorithm, but also aids in the interpretation ofthe nal estimated model. As seen from Eqs.(3.14){(3.15), ~uig serves as theweight in the weighted least squares estimation of g and g. It holds a neg-ative relationship with the Mahalanobis distance (y( g)i g)T g 1(y( g)i g) between yi and g on the transformed scale, as given by Eq.(3.10).Hence, a small value of ~uig would suggest that the corresponding obser-vation is an outlier, and diminish its in uence on the estimation of theparameters. In contrast, in the absence of such a mechanism, a normal mix-ture model is not robust against outliers, as the constraint Pg~zig = 1 forall i restricts all observations to make equal contributions overall towardsparameter estimation.Exploiting such a mechanism, we may conveniently set up a rule of calling60an observation with the associated ~uig value smaller than a threshold, say,0:5, an outlier. Such a threshold may be selected on a theoretical basis byconsidering the one-to-one correspondence between ~uig and the Mahalanobisdistance which follows some standard, known distribution. On noting that(y( g)i g)T g 1(y( g)i g)=p F(p; g); (3.21)where y( g)i follows ap-dimensionaltdistribution with parameters ( g; g; g)and F( ) denotes an F distribution, a threshold c for ~uig may be determinedby considering the desired threshold quantile level of the distributionstated in Eq.(3.21):c = g +p g +pF1 (p; g); (3.22)where F1 ( ) denotes the quantile of the F distribution such that Pr(F F1 ) = 1 . For instance, if g = 4;p = 5, and the desired thresholdquantile level is = 0:9, then the corresponding threshold for ~uig is c =0:37 given the 0:9 quantile F0:1(5;4) = 4:051. Any observation with theassociated ~uig < 0:37 will be deemed an outlier.From Eq.(3.10), we can also see how the degrees of freedom g partici-pates in robustifying the parameter estimation process. A smaller value of g tends to downweight outliers to a greater extent, while a large enoughvalue tends to regress all weights to one, approaching the case of the NBCmodel. In addition, the upper bound of ~uig o ers a guide for setting thedegrees of freedom = g for all g, if it is preferred to be xed in advance.The weight ~uig takes a positive value on (0;1 + p= g), and for a moderate-valued g, its mean is around one. To avoid a point in the vicinity of thecentral location of a mixture component from imposing excessive in uenceon the estimation of parameters, we may set accordingly such that theratio p= is maintained at an appropriate level, for example, one to 1:5.Density EstimationOne advantage of mixture modeling based on the normal distribution is thatthe marginal distribution for any subset of the dimensions is also normally61distributed with the mean and covariance matrix extracted from the con-formable dimensions (Johnson and Wichern, 2002). This favorable propertyis also observed in the multivariate t distiribution (Liu and Rubin, 1995;Kotz and Nadarajah, 2004), making the estimation of the marginal densityfor any dimensions available at a very low computational cost. Consider thepartition Y = (Y1;Y2) as an example. If Y comes from a multivariate tdistribution with degrees of freedom and with mean and covariance matrixconformably partitioned as = ( 1; 2) and 2 = 2 11 12 21 22!respectively, then its subset Y1 will follow a t distribution with mean 1,covariance matrix =( 2) 11 and the same degrees of freedom. Thisnice property is easily extended to a t mixture model with more than onecomponent, and, in addition, preserved in our proposed tBC mixture model.One can easily derive the marginal density by extracting the conformablepartitions from the means, covariance matrices and the Jacobian. The 90thpercentile region of the mixture components shown in Figure 3.2 is producedby these means.Selecting the Number of ComponentsWhen the number of mixture components is unknown, we apply the BayesianInformation Criterion (BIC) (Schwarz, 1978) to guide the selection. The BICprovides a convenient approximation to the integrated likelihood of a modeland, in the context of mixture models, is de ned asBICG = 2 log ~LG KG logn; (3.23)where ~LG is the likelihood value of Eq.(3.6) evaluated at the maximumlikelihood estimates of , and KG is the number of independent parametersfor a G-component mixture model. The BIC would then be computed for arange of possible values for G and the one with the largest BIC (or relatively6210 15 2068101214161820(a) t + Box−CoxFrontal lobe sizeRear widtha71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71Blue maleBlue femaleOrange maleOrange female10 15 2068101214161820(b) tFrontal lobe sizeRear widtha71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71Blue maleBlue femaleOrange maleOrange female10 15 2068101214161820(c) Normal + Box−CoxFrontal lobe sizeRear widtha71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71Blue maleBlue femaleOrange maleOrange female10 15 2068101214161820(d) NormalFrontal lobe sizeRear widtha71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71Blue maleBlue femaleOrange maleOrange female10 15 2068101214161820(e) Skew tFrontal lobe sizeRear widtha71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71Blue maleBlue femaleOrange maleOrange female10 15 2068101214161820(f) Skew NormalFrontal lobe sizeRear widtha71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71Blue maleBlue femaleOrange maleOrange femaleFigure 3.2: Scatterplots revealing the assignment of observations for di erentmodels applied to the crabs dataset, projected onto the dimensions of thefrontal lobe size and the width of the rear region of the carapace. The solidlines represent the 90th percentile region of the components in the mixturemodels. The line colors match with the group they are labeled, determinedin a way such that the lowest misclassi cation rate is derived. Misclassi edobservations are drawn in red, overriding the original colors used to revealtheir true group memberships. 63close to it) would be selected. Often, the BIC is applied in line with theprinciple of parsimony, by which we favor a simpler model if it does notincur a downgrade of the modeling performance. Suppose there are twotBC mixture models with G1 and G2 components respectively such thatG1 < G2. Under the notion of this principle, we would prefer the simplermodel, i.e., the one with G1 components, unless a very strong evidence ofimproved performance signi ed by an increase of >10 (Kass and Raftery,1995; Fraley and Raftery, 2002) is observed from BICG1 over BICG2.3.3 Application to Real Data3.3.1 Data DescriptionTo illustrate our methodology we use the following two real datasets.The bankruptcy datasetThis dataset was obtained from a study which conducted nancial ratio anal-ysis to predict corporate bankruptcy (Altman, 1968). The sample consistsof 66 manufacturing rms in the United States, of which 33 are bankruptand the other 33 solvent. The data collected include the ratio of retainedearnings (RE) to total assets, and the ratio of earnings before interest andtaxes (EBIT) to total assets. They were derived from nancial statementsreleased two years prior to bankruptcy, and statements from the solvent rms during the same period.The crabs datasetMeasurements in this dataset were collected from a study of rock crabs ofgenus Leptograpsus (Campbell and Mahon, 1974). The sample is composedof 50 crabs for each combination of species (blue and orange color forms) andsex (male and female), resulting in a total of 200 observations. There are ve morphological measurements, namely, the frontal lobe size, the width ofthe rear region of the carapace, the length of the carapace along the midline,64the maximum width of the carapace, and the depth of the body, for eachcrab.3.3.2 ResultsWe compare the performance of six mixture modeling approaches using dif-ferent mixture distributions, namely, t with the Box-Cox transformation(tBC), t, normal with the Box-Cox transformation (NBC), normal, skewt, and skew normal. Since all observations in the two datasets come withknown labels, we can assess and compare the models based on the followingtwo criteria: misclassi cation rates and the number of components selected.Classi cationWe t the two datasets using the six aforementioned models in turn, on xingthe number of mixture components at the known values, i.e., two for thebankruptcy dataset and four for the crabs dataset. The same initializationstrategy is applied to the EM algorithm for all the models. Each time,10 random partitions are generated, each of which is followed by a fewEM runs. The one delivering the highest likelihood value is taken as theinitial con guration for the eventual EM algorithm. At convergence of theEM algorithm, misclassi cation rates, i.e., the proportions of observationsassigned to the incorrect group, are computed. Each misclassi cation rateis determined as the minimum considering all permutations of the labels ofthe components.Table 3.1 shows the misclassi cation rates for the di erent models. Ascan be seen, for the bankruptcy dataset, the tBC and NBC mixture modelsdeliver misclassi cation rates (15:2% and 16:7% respectively) lower than theother methods by a large margin. By taking a graphical inspection of theresults, we nd that the poor classi cation performance of the other fourmethods is due to the inability to resolve the shape of the two groups ofobservations properly (Figures 3.3(b,d{f)). The challenge likely arises fromthe scattered group of bankrupt rms, with its most concentrated regionlocated at the upper right corner and in close proximity to the dense group65Table 3.1: Misclassi cation rates for di erent models applied to thebankruptcy and crabs datasets.Model Bankruptcy CrabstBC 0.152 (10) 0.070 (14)t 0.273 (18) 0.075 (15)NBC 0.167 (11) 0.345 (69)Normal 0.318 (21) 0.290 (58)Skew t 0.303 (20) 0.085 (17)Skew Normal 0.394 (26) 0.175 (35)The best results are shown in bold. The numbers of misclassi ed cases are givenwithin parentheses.of solvent rms. The sensitivity of normal mixture models to outliers isclearly demonstrated in this example: the obvious outlier at the bottom ofthe scatterplot leads to an excessively sparse component representing thebankrupt group. Consequently, most observations in the bankrupt grouphave been absorbed by the compact component representing the solventgroup. The shapes of the components in the t, skew t and skew normalmixture models are not all the same, but it appears that for all of themthe scattered group of bankrupt rms are split into two components withone absorbing a concentration extending to the left and the other to thebottom. In contrast, both the tBC and NBC mixture models provide a nicerepresentation of both groups of observations (Figures 3.3(a,c)). The groupof bankrupt rms is resolved quite successfully upon a proper transformation(^ 0:5 for both models) of the observations.As another means of performance assessment, we look into the locationof the misclassi ed observations in a plot of the ordered uncertainties (Fig-ure 3.4). On observing that the misclassi ed observations have spread overthe entire range of the uncertainties, it suggests that the t, skew t and skewnormal mixture models simply provide an incorrect representation of thetwo groups (Figures 3.4(b,e,f)). The quality of the t using the tBC andNBC mixture models respectively is con rmed by the corresponding uncer-tainty plots (Figures 3.4(a,c)). We can see that the observations associatedwith high uncertainties are also the ones most likely to be misclassi ed.66−300−200−1000−250−150−500(a) t + Box−CoxRE ratioEBIT ratioa71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71 a71a71a71a71a71a71a71a71a71a71BankruptSolvent−300−200−1000−250−150−500(b) tRE ratioEBIT ratioa71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71 a71a71a71a71a71a71a71a71a71a71BankruptSolvent−300−200−1000−250−150−500(c) Normal + Box−CoxRE ratioEBIT ratioa71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71 a71a71a71a71a71a71a71a71a71a71BankruptSolvent−300−200−1000−250−150−500(d) NormalRE ratioEBIT ratioa71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71 a71a71a71a71a71a71a71a71a71a71BankruptSolvent−300−200−1000−250−150−500(e) Skew tRE ratioEBIT ratioa71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71 a71a71a71a71a71a71a71a71a71a71BankruptSolvent−300−200−1000−250−150−500(f) Skew NormalRE ratioEBIT ratioa71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71 a71a71a71a71a71a71a71a71a71a71BankruptSolventFigure 3.3: Scatterplots revealing the assignment of observations for di er-ent models applied to the bankruptcy dataset. The black solid lines repre-sent the 90th percentile region of the components in the mixture models.Misclassi ed observations are drawn in red.670 1020304050600.00.10.20.30.40.5(a) t + Box−CoxIndexUncertainty0 1020304050600.00.10.20.30.40.5(b) tIndexUncertainty0 1020304050600.00.10.20.30.40.5(c) Normal + Box−CoxIndexUncertainty0 1020304050600.00.10.20.30.40.5(d) NormalIndexUncertainty0 1020304050600.00.10.20.30.40.5(e) Skew tIndexUncertainty0 1020304050600.00.10.20.30.40.5(f) Skew NormalIndexUncertaintyFigure 3.4: Plots revealing the location of misclassi ed observations relativeto the ordered uncertainties of all observations for di erent models appliedto the bankruptcy dataset. Locations of the misclassi ed observations aremarked with red vertical lines.68The results on the crabs dataset once again show that the tBC mixturemodel delivers the best performance in terms of misclassi cation rate (7%).It is followed closely by the t (7:5%) and skew t (8:5%) mixture models. Fig-ure 3.2 shows a scatterplot of the crabs dataset projected onto the rst twodimensions, namely, the frontal lobe size and the width of the rear regionof the carapace. However, unlike the case for the bankruptcy dataset withonly two dimensions, a visually clear discrimination of the four groups in thecrabs dataset cannot be achieved by projecting the observations onto anytwo out of the ve dimensions. Therefore, we opt for displaying the crabsdataset on its second versus third principal components which provides agood visually discriminating e ect. Figures 3.5(a,b) suggest that those fewmisclassi ed observations in the tBC and t mixture models are all likely inthe overlapping region of neighboring groups, justifying that these modelsprovide a good representation of all the four groups in the dataset. This isfurther con rmed by a check on the uncertainty plots, in which the misclas-si ed observations are also among the ones with the highest uncertainties(Figures 3.6(a,b)). Meanwhile, from Figures 3.5(c,d,f) we nd that, for thepoorly performed NBC, normal and skew normal mixture models, misclas-si ed cases are concentrated on one or two of the groups. Figures 3.2(c,d,f)reveal that these models incorrectly split the assignment of the observationsfrom those groups into other components. As expected, these three poorlyperforming normal-based models have misclassi ed observations spreadingover the entire range in the uncertainty plots (Figures 3.6(c,d,f)).Selecting the Number of ComponentsTo facilitate this part of analysis, when we apply the aforementioned mod-els, we t the data and compute the BIC once for each choice of the numberof mixture components G = 1;2;:::;M, where M = 6 for the bankruptcydataset and M = 8 for the crabs dataset. These values are chosen for M be-cause they are well above the true number of groups (two for the bankruptcydataset and four for the crabs dataset) such that little change in the result isexpected when we further increase M; numerical problems may arise when69a71a71a71a71a71a71a71a71 a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71 a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71−0.50.0 0.5−0.6−0.4−0.20.00.20.4(a) t + Box−CoxComp.2Comp.3a71Blue maleBlue femaleOrange maleOrange femalea71a71a71a71a71a71a71a71 a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71 a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71−0.50.0 0.5−0.6−0.4−0.20.00.20.4(b) tComp.2Comp.3a71Blue maleBlue femaleOrange maleOrange femalea71a71a71a71a71a71a71a71 a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71 a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71−0.50.0 0.5−0.6−0.4−0.20.00.20.4(c) Normal + Box−CoxComp.2Comp.3a71Blue maleBlue femaleOrange maleOrange femalea71a71a71a71a71a71a71a71 a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71 a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71−0.50.0 0.5−0.6−0.4−0.20.00.20.4(d) NormalComp.2Comp.3a71Blue maleBlue femaleOrange maleOrange femalea71a71a71a71a71a71a71a71 a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71 a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71−0.50.0 0.5−0.6−0.4−0.20.00.20.4(e) Skew tComp.2Comp.3a71Blue maleBlue femaleOrange maleOrange femalea71a71a71a71a71a71a71a71 a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71 a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71−0.50.0 0.5−0.6−0.4−0.20.00.20.4(f) Skew NormalComp.2Comp.3a71Blue maleBlue femaleOrange maleOrange femaleFigure 3.5: Plots revealing the assignment of observations for di erent mod-els applied to the crabs dataset, displayed via the second and third princi-pal components. Misclassi ed observations are drawn in red, overriding theoriginal colors used to reveal their true group memberships.700 50 1001502000.00.10.20.30.40.5(a) t + Box−CoxIndexUncertainty0 50 1001502000.00.10.20.30.40.5(b) tIndexUncertainty0 50 1001502000.00.10.20.30.40.5(c) Normal + Box−CoxIndexUncertainty0 50 1001502000.00.10.20.30.40.5(d) NormalIndexUncertainty0 50 1001502000.00.10.20.30.40.5(e) Skew tIndexUncertainty0 50 1001502000.00.10.20.30.40.5(f) Skew NormalIndexUncertaintyFigure 3.6: Plots revealing the location of misclassi ed observations relativeto the ordered uncertainties of all observations for di erent models appliedto the crabs dataset. Locations of the misclassi ed observations are markedwith red vertical lines.71111111 2 3 4 5−1440−1400−1360−1320(a) BankruptcyNo. of clustersBIC2222233333444445555566666123456t + Box−CoxtNormal + Box−CoxNormalSkew tSkew Normal111111111 2 3 4 5 6 7 8−3300−3200−3100−3000−2900(b) CrabsNo. of clustersBIC2222222233333333444444445555555566666666123456t + Box−CoxtNormal + Box−CoxNormalSkew tSkew NormalFigure 3.7: Plots of BIC against the number of components for the di erentmodels applied to the bankruptcy and crabs datasets.M is too large, moreover. From the BIC curves shown in Figure 3.7, weobserve that one single peak is observed for each modeling choice over therange of the number of components attempted. The number of componentsat which a peak is observed is deemed optimal by the BIC for the respec-tive model. The BIC has selected the correct number of components (two)for all the mixture models except normal when applied to the bankruptcydataset (Table 3.2). As to the crabs dataset in which the separation of thegroups is less clear-cut, it poses a challenge of selecting the right numberof components to most models. Only the tBC and t mixture models haveresolved the correct number of components (four) guided by the BIC. Thisresult further con rms with what we have observed in the last subsectionthat the four-component mixture model using the tBC or t mixture modelprovides the best representation of the data out of all candidates.3.4 Simulation StudiesWe have conducted a series of simulations to further evaluate the relativeperformance of our proposed framework to the other approaches presented72Table 3.2: The number of components selected by the BIC for di erentmodels applied to the bankruptcy and crabs datasets.Model Bankruptcy CrabstBC 2 4t 2 4NBC 2 3Normal 3 3Skew t 2 2Skew Normal 2 3The best results are shown in bold.in Section 3.3.2. The di erent approaches are evaluated for their sensitivityagainst model misspeci cation, using the following two criteria: the accuracyin the assignment of observations, and the accuracy in selecting the numberof components.3.4.1 Data GenerationTo facilitate the comparison, we generate data from the following mixturemodels: tBC, skew t, t and normal. To assess the accuracy in the assign-ment of observations, two settings of parameter values have been adopted:one taken from the estimates obtained when applying each of the aforemen-tioned models to the bankruptcy dataset, and the other one from the crabsdataset, with the number of components set as the respective known values.As a result, each dataset generated from the bankruptcy setting consistsof two components and two dimensions, while that from the crabs settinghas four components and ve dimensions. For datasets generated under thebankruptcy setting we x the number of observations at 200, while it is setas 500 for the crabs setting. 100 datasets are generated from each of theaforementioned models under each setting. To study the accuracy in select-ing the number of components, we focus at the crabs setting. Pertainingto this criterion, the crabs setting o ers a better platform to discriminatethe relative performance of the di erent approaches for its larger numberof groups and higher dimensions. 1000 observations are generated from the73Table 3.3: Average misclassi cation rates for di erent models applied todatasets generated under the bankruptcy or crabs setting.Model used to t datatBC t NBC Normal Skew t Skew Norm.Model used to tBC 0.075 0.124 0.075 0.126 0.142 0.110generate data Skew t 0.100 0.094 0.225 0.189 0.087 0.191under the bank- t 0.109 0.109 0.126 0.134 0.114 0.144ruptcy setting Normal 0.032 0.032 0.033 0.030 0.032 0.031Model used to tBC 0.011 0.014 0.057 0.074 0.015 0.016generate data Skew t 0.024 0.023 0.046 0.060 0.020 0.021under the t 0.024 0.021 0.048 0.070 0.023 0.023crabs setting Normal 0.027 0.029 0.042 0.038 0.028 0.028The best results are shown in bold.crabs setting instead to avoid numerical problems that may arise from smallcomponents formed when the number of components is signi cantly largerthan the true number.3.4.2 ResultsClassi cationWe apply the six approaches presented in Section 3.3.2 in turn to eachgenerated dataset. Model tting is done by presuming that the numberof components is known, i.e., two for the bankruptcy setting and four forthe crabs setting. Similar to the way we determined the misclassi cationrates in our real data analysis, we consider all permutations of the labelsof the components and take the lowest one out of all misclassi cation ratescomputed. The performance of the di erent models is compared via theaverage misclassi cation rates.As shown clearly in Table 3.3, our proposed tBC mixture model is theonly model that remains the best or close to the best in all the compar-isons made. It delivers the lowest misclassi cation rates under both settings(7:5% and 1:1% respectively) when data are generated from the tBC mix-ture model. The exibility of the tBC mixture model is exhibited whenwe look into its performance in the scenario of model misspeci cation. It74remains close to the respective true model in those cases, and even de-livers the lowest misclassi cation rate when the true model is t under thebankruptcy setting (10:9%), or normal under the crabs setting (2:7%). Con-trariwise, when data are generated from the tBC mixture model, with a lackof mechanisms to handle asymmetric components, both the t and normalmixture models do not perform well. It is worth noting that even the skewt mixture model, which is intended for data departing from symmetry, alsoperforms poorly; the associated misclassi cation rate is as high as 14:2%under the bankruptcy setting, while that for tBC is only 7:5%. When dataare generated from the skew t mixture model, taking advantage of the cor-rect speci cation the skew t mixture model performs well. The tBC mixturemodel also shows a competent performance, however. Meanwhile, the skewt mixture model performs satisfactorily when the true mixture model is tor normal. The normal mixture model cannot match the others at all whendata are generated from models other than normal, showing its vulnerabil-ity to outliers and asymmetric components. In addition, it is interesting tonotice that the normal mixture model gives a rather high misclassi cationrate (3:8%) relative to the levels attained by tBC, t and skew t (2:7%{2:9%)when it itself is the true model for data generation under the crabs setting.It seems that the t-based mixture models are more robust to initializationof the EM algorithm.Selecting the Number of ComponentsIn this part of study, each time when we apply a model to a dataset generatedunder the crabs setting, we set the number of components from one up toeight in turn. The number of components is then selected to be the onewhich delivers the highest BIC. Table 3.4 summarizes the result and givesthe 90% coverage intervals of the number of components selected for eachmodel out of the 100 repetitions.The tBC mixture model selects the correct number of components (four)in the majority of repetitions, even in case of model misspeci cation. It is theonly model that remains to contain only the true number of components in75Table 3.4: 90% coverage intervals of the number of components selected bythe BIC for di erent models applied to datasets generated under the crabssetting.Model used to t datatBC t NBC Normal Skew t Skew Norm.Model used tBC (4, 4) (4, 4) (4, 4) (4, 4) (3, 5) (3, 5)to generate Skew t (4, 4) (4, 4) (4, 5) (4, 5) (4, 4) (4, 4)data t (4, 4) (4, 4) (4, 5) (4, 5) (4, 4) (4, 4)Normal (4, 4) (4, 5) (4, 4) (4, 5) (4, 4) (4, 4)The best results are shown in bold.all the 90% coverage intervals. On the other hand, both the skew t and skewnormal mixture models fail to distinguish the four groups properly in about30% of the datasets generated from the tBC mixture model. Besides, boththe NBC and normal mixture models, when applied to datasets generatedfrom the t or skew t mixture model, tend to require an additional componentto accommodate the data in an excess of outliers.3.5 DiscussionIn this chapter, we have introduced a new class of distributions, the t distri-butions with the Box-Cox transformation, for mixture modeling. The pro-posed methodology is in line with Lange et al.’s (1989) notion that trans-formation selection and outlier identi cation are two issues of mutual in- uence and therefore should be handled simultaneously. In our real dataapplications and simulation studies, we have shown the exibility of thismethodology in accommodating asymmetric components in the presence ofoutliers, and in coping with model misspeci cation. The vulnerability ofthe normal-based models to outliers is exposed in the analysis of the crabsdataset, in which the presence of outliers prevents a clear distinction of thefour groups. A lack of mechanisms to downsize the in uence of remote ob-servations undermines the ability of these approaches to properly locate thecores of the four groups in the dataset. On the other hand, the analysisof the bankruptcy dataset provides a very good example of demonstrating76the importance of incorporating data transformation in clustering. In theabsence of a means to accommodate components departing from symmetry,the t mixture model fails to provide a reasonable representation of the data,while the number of groups is known in advance. Our simulation studieshave con rmed these ndings.As mentioned in the Introduction, although mixture modeling using ourproposed tBC distributions and that using the skew t distributions followtwo lines of development with more or less the same aim, our approach hasan appeal of being computationally much simpler to implement. As noted inLin (2009b), di culties have been encountered in evaluating the conditionalexpectation of the complete-data log-likelihood in the E-step of the EM al-gorithm for the skew t mixture model. The objective function cannot bederived in closed form due to the presence of analytically intractable quan-tities. Numerical techniques for optimization as well as integration needto be employed extensively to update a vast amount of quantities in boththe E and M-steps of the algorithm, undermining the computational sta-bility therein. Besides, the parameterization that accounts for skewness inour proposed model originates from the family of power transformations,which is intuitively interpretable. It is less trivial to interpret the skewnessvector parameterized in the skew t distribution, however. In addition, aspresented in Section 3.2.3, the way to identify outliers using our approachis straightforward and on a theoretical ground. Exploiting the relationshipbetween ~uig and the quantile of an F distribution through Eq.(3.22), it isalmost costless to proceed with outlier identi cation once the EM algorithmis completed. On the contrary, when the skew t mixture model is used,we cannot determine such a threshold by recasting it as a known quantityobtained from a standard distribution. In consequence, it demands extracomputational e ort to identify outliers, especially when the dimension ofthe data is high. Finally, perhaps most importantly, as demonstrated fromour real data applications and simulation studies, the simplicity of the com-putational implementation of our proposed methodology is not achieved atthe expense of the quality of performance. The results have shown that ourproposed approach performs as well as that based on the skew t mixture77model, or even slightly better.An open-source software package that facilitates ow cytometry analysiswith the methodology proposed in this chapter has been developed andis available at Bioconductor (Gentleman et al., 2004); see Chapter 5 fordetails. It is released as an R package called flowClust and addresses thevast demand for software development from the ow cytometry community.flowClust is dedicated to the automated identi cation of cell populations,and is well integrated into other ow cytometry packages. Meanwhile, werecognize the potential of the proposed methodology in other elds, and theimportance of developing a general-purpose tool like MCLUST (Fraley andRaftery, 2002, 2006), the popular software that performs clustering analysisbased on normal mixture models. We are going to work on such a general-purpose, standalone software that will serve as a contribution to the generalpublic.78BibliographyAltman, E. I. (1968). Financial ratios, discriminant analysis and the predic-tion of corporate bankruptcy. Journal of Finance, 23(4):589{609.Atkinson, A. C. (1988). Transformations unmasked. Technometrics, 30:311{318.Azzalini, A. (1985). A class of distributions which includes the normal ones.Scandinavian Journal of Statistics, 12:171{178.Azzalini, A. and Capitanio, A. (2003). Distributions generated by pertur-bation of symmetry with emphasis on a multivariate skew t-distribution.Journal of the Royal Statistical Society, Series B, 65(2):367{389.Azzalini, A. and Dalla Valle, A. (1996). The multivarite skew-normal dis-tribution. Biometrika, 83(4):715{726.Ban eld, J. D. and Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49:803{821.Bensmail, H., Celeux, G., Raftery, A. E., and Robert, C. P. (1997). Inferencein model-based cluster analysis. Statistics and Computing, 7:1{10.Bickel, P. J. and Doksum, K. A. (1981). An analysis of transformationsrevisited. Journal of the American Statistical Association, 76(374):296{311.Box, G. E. P. and Cox, D. R. (1964). An analysis of transformations. Journalof the Royal Statistical Society, Series B, 26:211{252.Brent, R. (1973). Algorithms for Minimization without Derivatives.Prentice-Hall, Englewood Cli s, NJ.Campbell, N. A. and Mahon, R. J. (1974). A multivariate study of variationin two species of rock crab of the genus Leptograpsus. Australian Journalof Zoology, 22(3):417{425.79Carroll, R. J. (1982). Prediction and power transformations when the choiceof power is restricted to a nite set. Journal of the American StatisticalAssociation, 77(380):908{915.Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likeli-hood from incomplete data via the EM algorithm. Journal of the RoyalStatistical Society, Series B, 39:1{38.Forbes, F., Peyrard, N., Fraley, C., Georgian-Smith, D., Goldhaber, D. M.,and Raftery, A. E. (2006). Model-based region-of-interest selection indynamic breast MRI. Journal of Computer Assisted Tomography, 30:675{687.Fraley, C., Raftery, A., and Wehrens, R. (2005). Incremental model-basedclustering for large datasets with small clusters. Journal of Computationaland Graphical Statistics, 14(3):529{546.Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminantanalysis, and density estimation. Journal of the American Statistical As-sociation, 97(458):611{631.Fraley, C. and Raftery, A. E. (2006). MCLUST version 3 for R: Normalmixture modeling and model-based clustering. Technical Report 504, Uni-versity of Washington, Department of Statistics.Gentleman, R. C., Carey, V. J., Bates, D. M., Bolstad, B., Dettling, M.,Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn,T., Huber, W., Iacus, S., Irizarry, R., Leisch, F., Li, C., Maechler, M.,Rossini, A. J., Sawitzki, G., Smith, C., Smyth, G., Tierney, L., Yang, J.Y. H., and Zhang, J. (2004). Bioconductor: open software development forcomputational biology and bioinformatics. Genome Biology, 5(10):R80.Gutierrez, R. G., Carroll, R. J., Wang, N., Lee, G.-H., and Taylor, B. H.(1995). Analysis of tomato root initiation using a normal mixture distri-bution. Biometrics, 51:1461{1468.80Johnson, R. A. and Wichern, D. W. (2002). Applied Multivariate StatisticalAnalysis. Prentice Hall, Upper Saddle River, NJ.Kass, R. E. and Raftery, A. E. (1995). Bayes factors. Journal of the Amer-ican Statistical Association, 90(430):773{795.Kotz, S. and Nadarajah, S. (2004). Multivariate t Distributions and TheirApplications. Cambridge University Press, Cambridge.Kriessler, J. R. and Beers, T. C. (1997). Substructure in galaxy clusters: atwo-dimensional approach. The Astronomical Journal, 113:80{100.Lange, K. L., Little, R. J. A., and Taylor, J. M. G. (1989). Robust statisticalmodeling using the t-distribution. Journal of the American StatisticalAssociation, 84:881{896.Li, Q., Fraley, C., Bumgarner, R. E., Yeung, K. Y., and Raftery, A. E.(2005). Donuts, scratches and blanks: Robust model-based segmentationof microarray images. Bioinformatics, 21(12):2875{2882.Lin, T. I. (2009a). Maximum likelihood estimation for multivariate skewnormal mixture models. Journal of Multivariate Analysis, 100(2):257{265.Lin, T. I. (2009b). Robust mixture modeling using multivariate skew tdistributions. Statistics and Computing, (In press).Lin, T. I., Lee, J. C., and Hsieh, W. J. (2007a). Robust mixture modelingusing the skew t distribution. Statistics and Computing, 17:81{92.Lin, T. I., Lee, J. C., and Yen, S. Y. (2007b). Finite mixture modellingusing the skew normal distribution. Statistica Sinica, 17:909{927.Liu, C. (1997). ML estimation of the multivariate t distribution and the EMalgorithm. Journal of Multivariate Analysis, 63:296{312.Liu, C. and Rubin, D. (1994). The ECME algorithm: a simple extension ofEM and ECM with faster monotone convergence. Biometrika, 81(4):633{648.81Liu, C. and Rubin, D. (1995). ML estimation of the t distribution using EMand its extensions, ECM and ECME. Statistica Sinica, 5:19{39.McLachlan, G. and Peel, D. (2000). Finite Mixture Models. Wiley-Interscience, New York.McLachlan, G. J., Bean, R. W., and Peel, D. (2002). A mixture model-basedapproach to the clustering of microarray expression data. Bioinformatics,18(3):413{422.Meng, X. L. and Rubin, D. B. (1993). Maximum likelihood estimation viathe ECM algorithm: a general framework. Biometrika, 80:267{278.Mukherjee, S., Feigelson, E. D., Babu, G. J., Murtagh, F., Fraley, C., andRaftery, A. E. (1998). Three types of gamma ray bursts. The AstrophysicalJournal, 508:314{327.Pan, W., Lin, J., and Le, C. T. (2002). Model-based cluster analysis ofmicroarray gene-expression data. Genome Biology, 3(2):R9.Peel, D. and McLachlan, G. J. (2000). Robust mixture modelling using thet distribution. Statistics and Computing, 10(4):339{348.Pyne, S., Hu, X., Wang, K., Rossin, E., Lin, T.-I., Maier, L. M., Baecher-Allan, C., McLachlan, G. J., Tamayo, P., Ha er, D. A., De Jager, P. L.,and Mesirov, J. P. (2009). Automated high-dimensional ow cytometricdata analysis. Proceedings of the National Academy of Sciences of theUnited States of America, 106(21):8519{8524.Sahu, S. K., Dey, D. K., and Branco, M. D. (2003). A new class of multivari-ate skew distributions with applications to Bayesian regression. CanadianJournal of Statistics, 31(2):129{150.Schork, N. J. and Schork, M. A. (1988). Skewness and mixtures of nor-mal distributions. Communications in Statistics: Theory and Methods,17:3951{3969.82Schroeter, P., Vesin, J.-M., Langenberger, T., and Meuli, R. (1998). Robustparameter estimation of intensity distributions for brain magnetic reso-nance images. IEEE Transactions on Medical Imaging, 17(2):172{186.Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statis-tics, 6:461{464.Stephens, M. (2000). Bayesian analysis of mixture models with an unknownnumber of components { an alternative to reversible jump methods. An-nals of Statistics, 28:40{74.Titterington, D. M., Smith, A. F. M., and Makov, U. E. (1985). StatisticalAnalysis of Finite Mixture Distributions. Wiley, Chichester, UK.Wehrens, R., Buydens, L. M. C., Fraley, C., and Raftery, A. E. (2004).Model-based clustering for image segmentation and large datasets viasampling. Journal of Classi cation, 21:231{253.Yeung, K. Y., Fraley, C., Murua, A., Raftery, A. E., and Ruzzo, W. L.(2001). Model-based clustering and data transformations for gene expres-sion data. Bioinformatics, 17(10):977{987.83Chapter 4Automated Gating of FlowCytometry Data via RobustModel-Based Clustering 4.1 IntroductionFlow cytometry (FCM) can be applied to analyze thousands of samplesper day. However, as each dataset typically consists of multiparametricdescriptions of millions of individual cells, data analysis can present a sig-ni cant challenge. As a result, despite its widespread use, FCM has notreached its full potential because of the lack of an automated analysis plat-form to parallel the high-throughput data generation platform. As notedin Lizard (2007), in contrast to the tremendous interest in the FCM tech-nology, there is a dearth of statistical and bioinformatics tools to manage,analyze, present, and disseminate FCM data. There is considerable demandfor the development of appropriate software tools, as manual analysis ofindividual samples is error-prone, non-reproducible, non-standardized, notopen to re-evaluation, and requires an inordinate amount of time, makingit a limiting aspect of the technology (Roederer and Hardy, 2001; Roedereret al., 2001a,b; de Rosa et al., 2003; Bagwell, 2004; Braylan, 2004; Redelman,2004; Tzircotis et al., 2004; Spidlen et al., 2006).The process of identifying homogeneous groups of cells that display aparticular function, known as gating, is one major component of FCM anal- A version of this chapter has been published. Lo, K., Brinkman, R. R. and Gottardo,R. (2008). Automated gating of ow cytometry data via robust model-based clustering.Cytometry Part A, 73A(4):321{332.84ysis. As mentioned in Chapter 1, currently, to a large extent, gating relieson using software to apply a series of manually drawn gates (i.e., data l-ters) that select regions in 2D graphical representations of FCM data. Thisprocess is based largely on intuition rather than standardized statistical in-ference (Parks, 1997; Suni et al., 2003; Bagwell, 2004). It also ignores thehigh-dimensionality of FCM data, which may convey information that can-not be displayed in 1D or 2D projections. This is illustrated in Figure 4.1with a synthetic dataset, consisting of two dimensions, generated from a tmixture model (McLachlan and Peel, 2000) with three components. Whilethe three clusters can be identi ed using both dimensions, the structure ishardly recognized when the dataset is projected on either dimension. Suchan example illustrates the potential loss of information if we disregard themultivariate nature of the data. The same problem occurs when projectingthree (or more) dimensional data onto two dimensions.Several attempts have been made to automate the gating process. Amongthose, the K-means algorithm (MacQueen, 1967) has found the most ap-plications (Murphy, 1985; Demers et al., 1992; Bakker Schut et al., 1993;Wilkins et al., 2001). Demers et al. (1992) have proposed an extension ofK-means allowing for non-spherical clusters, but this algorithm has beenshown to lead to performance inferior to fuzzy K-means clustering (Wilkinset al., 2001). In fuzzy K-means (Rousseeuw et al., 1996), each cell can belongto several clusters with di erent association degrees, rather than belongingcompletely to only one cluster. Even though fuzzy K-means takes intoconsideration some form of classi cation uncertainty, it is a heuristic-basedalgorithm and lacks a formal statistical foundation. Other popular choicesinclude hierarchical clustering algorithms (e.g., linkage or Pearson coe -cients method). However, these algorithms are not appropriate for FCMdata, since the size of the pairwise distance matrix increases in the orderof n2 with the number of cells, unless they are applied to some preliminarypartition of the data (Bakker Schut et al., 1993), or they are used to clusteracross samples, each of which is represented by a few statistics aggregatingmeasurements of individual cells (Maynadi e et al., 2002; Lugli et al., 2007).Classi cation and regression trees (Breiman et al., 1984), arti cial neural85Figure 4.1: A synthetic 2D dataset with three mixture components. Thethree components can easily be identi ed when both dimensions are used(lower left), while the two density curves produced from projecting the dataon either dimension fail to capture the structure.86networks (Boddy and Morris, 1999) and support vector machines (Burges,1998; Sch olkopf and Smola, 2002) have also been used in the context ofFCM analyses (Beckman et al., 1995; Kothari et al., 1996; Boddy et al.,2000; Morris et al., 2001), but these supervised approaches require trainingdata, which are not always available.In statistics, the problem of nding homogeneous groups of observationsis referred to as clustering. An increasingly popular choice is model-basedclustering (Titterington et al., 1985; McLachlan and Basford, 1988; Ban eldand Raftery, 1993; McLachlan and Peel, 2000; Fraley and Raftery, 2002),which has been shown to give good results in many applied elds involvinghigh dimensions (greater than ten); see, for example, Yeung et al. (2001),Fraley and Raftery (2002) and Pan et al. (2002). In this chapter, we proposeto apply an unsupervised model-based clustering approach to identify cellpopulations in FCM analysis. In contrast to previous unsupervised methods(Murphy, 1985; Demers et al., 1992; Bakker Schut et al., 1993; Roederer andHardy, 2001; Roederer et al., 2001a,b; Wilkins et al., 2001), our approachprovides a formal uni ed statistical framework to answer central questions:How many populations are there? Should we transform the data? Whatmodel should we use? How should we deal with outliers (aberrant observa-tions)? These questions are fundamental to FCM analysis where one doesnot usually know the number of populations, and where outliers are fre-quent. By performing clustering using all variables consisting of uorescentmarkers, the full multidimensionality of the data is exploited, leading tomore accurate and more reproducible identi cation of cell populations.The most commonly used model-based clustering approach is based on nite Gaussian mixture models (Titterington et al., 1985; McLachlan andBasford, 1988; McLachlan and Peel, 2000; Fraley and Raftery, 2002). How-ever, Gaussian mixture models rely heavily on the assumption that eachcomponent follows a Gaussian distribution, which is often unrealistic. As aremedy, transformation of the data is often considered. On the other hand,there is the problem of outlier identi cation in mixture modeling. Transfor-mation selection can be heavily in uenced by the presence of outliers (Car-roll, 1982; Atkinson, 1988), which are frequently observed in FCM data.87To handle the issues of transformation selection and outlier identi cationsimultaneously, in Chapter 3 we have developed an automated clusteringapproach based on t mixture models with the Box-Cox transformation. Thet distribution is similar in shape to the Gaussian distribution with heav-ier tails and thus provides a robust alternative (Lange et al., 1989). TheBox-Cox transformation is a type of power transformation, which can bringskewed data back to symmetry, a property of both the Gaussian and t dis-tributions. In particular, the Box-Cox transformation is e ective for datawhere the dispersion increases with the magnitude, a scenario not uncom-mon to FCM data.4.2 Materials and Methods4.2.1 Data DescriptionTo demonstrate our proposed automated clustering we use two FCM datasetspublicly available at http://www.ficcs.org/software.html.The Rituximab DatasetFlow cytometric high-content screening (Abraham et al., 2004) was appliedin a drug-screening project to identify agents that would enhance the anti-lymphoma activity of Rituximab, a therapeutic monoclonal antibody (Gas-paretto et al., 2004). 1600 di erent compounds were distributed into dupli-cate 96-well plates and then incubated overnight with the Daudi lymphomacell line. Rituximab was then added to one of the duplicate plates and bothplates were incubated for several more hours. In addition to cells treatedwith the compound alone, other controls included untreated cells and cellstreated with Rituximab alone. During the entire culture period, cells wereincubated with the thymidine analogue BrdU to label newly synthesizedDNA. Following culture, cells were stained with anti-BrdU and the DNAbinding dye 7-AAD. The proportion of cells in various phases of the cellcycle and undergoing apoptosis was measured with multiparameter FACSanalysis.88The GvHD DatasetGraft-versus-Host Disease (GvHD) occurs in allogeneic hematopoietic stemcell transplant recipients when donor-immune cells in the graft initiate anattack on the skin, gut, liver, and other tissues of the recipient. It is oneof the most signi cant clinical problems in the eld of allogeneic blood andmarrow transplantation. FCM was used to collect data on patients sub-jected to bone marrow transplant with a goal of identifying biomarkers topredict the development of GvHD. The GvHD dataset is a collection ofweekly peripheral blood samples obtained from 31 patients following allo-geneic blood and marrow transplant (Brinkman et al., 2007). Peripheralblood mononuclear cells were isolated using Ficoll-Hypaque and then cry-opreserved for subsequent batch analysis. At the time of analysis, cells werethawed and aliquoted into 96-well plates at 1 104 to 1 105 cells per well.The 96-well plates were then stained with 10 di erent four-color antibodycombinations. All staining and analysis procedures were miniaturized sothat small number of cells could be stained in 96-well plates with optimallydiluted uorescently conjugated antibodies.4.2.2 The ModelIn statistics, model-based clustering (Titterington et al., 1985; McLachlanand Basford, 1988; McLachlan and Peel, 2000; Fraley and Raftery, 2002)is a popular unsupervised approach to look for homogeneous groups of ob-servations. The most commonly used model-based clustering approach isbased on nite Gaussian mixture models, which have been shown to givegood results in various applied elds (Ban eld and Raftery, 1993; McLach-lan and Peel, 2000; Fraley and Raftery, 2002, 2006). However, Gaussianmixture models might give poor representations of clusters in the presenceof outliers, or when the clusters are far from elliptical in shape, phenom-ena commonly observed in FCM data. In view of this, we have proposedan approach based on t mixture models (McLachlan and Peel, 2000; Peeland McLachlan, 2000) coupled with a variant of the Box-Cox transforma-tion (Bickel and Doksum, 1981), which is also de ned for negative-valued89data, to handle the two aforementioned issues simultaneously. Please referto Chapter 3 for a detailed account of an Expectation-Maximization (EM)algorithm (Dempster et al., 1977) for the simultaneous estimation of all un-known parameters along with transformation selection. When the numberof clusters is unknown, we use the Bayesian Information Criterion (BIC)(Schwarz, 1978), which gives good results in the context of mixture models(Fraley and Raftery, 1998, 2002).While it is possible to estimate the degrees of freedom parameter ofthe t distribution for each component of the mixture model as part of theEM algorithm (Peel and McLachlan, 2000), xing it to a reasonable prede-termined value for all components reduces the computational burden whilestill providing robust results. A reasonable value for is four, which leads toa distribution similar to the Gaussian distribution, with slightly fatter tailsaccounting for outliers. Besides, the EM algorithm needs to be initialized.In this chapter, we apply a type of agglomerative hierarchical clusteringbased on Gaussian models (Ban eld and Raftery, 1993; Fraley, 1998) forinitialization. Model-based Gaussian hierarchical clustering is a stepwiseprocess aimed to maximize the classi cation likelihood function (Ban eldand Raftery, 1993; Celeux and Govaert, 1992). The process starts withtreating each observation itself as one cluster, and then successively mergespairs of clusters leading to the highest increase in the likelihood until thedesired number of clusters is reached. This initialization method is thesame as the one used in the model-based clustering strategy proposed byFraley and Raftery (2002, 2006), as implemented in the R package mclust.As mentioned in the Introduction, hierarchical clustering algorithms pose aproblem with FCM data as they require the storage of a pairwise distancematrix which increases in the order of n2 with the number of cells. In viewof this, we apply hierarchical clustering to a subset of data, and perform oneEM iteration to cluster the remaining data to complete the initial partition.904.2.3 Density EstimationTo visualize FCM data, it may be convenient to project high-dimensionaldata on 1D or 2D density plots. One such application can be found in theanalysis of the GvHD data, in which cells selected through the CD3+ gatewere projected on the CD4 and CD8 dimensions to produce contour plots(see Figures 4.2 and 4.3). Usually, nonparametric methods are applied toproduce such plots. However, all nonparametric methods require a tuningparameter (e.g., bandwidth for kernel density estimation; see Silverman,1986) to be speci ed to control the smoothness of these plots, and di er-ent softwares have di erent default settings. In the model-based clusteringframework, such plots can be easily generated at a very low computationalcost once estimates of the model parameters are available. The degree ofsmoothness is controlled by the number of components, which is chosenby the Bayesian Information Criterion (BIC) (Schwarz, 1978). Please seeSection 3.2.3 for more details on implementation.4.2.4 Sequential Approach to ClusteringIn practice, gating is often done on a preselected subset of data chosen byprojecting the data on the forward light scatter (FSC) and sideward lightscatter (SSC) dimensions. These two variables, which measure the relativemorphological properties (corresponding roughly to cell size and shape) ofthe cells, are often used to distinguish basic cell types (e.g., monocytes andlymphocytes) or to remove dead cells and cell debris. As a consequence,similar to Hahne et al. (2006), we have adopted a sequential approach toclustering. We rst use the FSC and SSC variables to cluster the data and nd basic cell populations, and then perform clustering on one or more pop-ulations of interest using all other variables consisting of uorescent markers.However, our methodology could also be applied to any subset or the entireset of variables.91Figure 4.2: Strategy for clustering the GvHD positive sample to look forCD3+CD4+CD8 + cells. The manual gating strategy is shown in (a{c).(a) Using FlowJo, a gate was drawn by an expert researcher to de ne thelymphocyte population. (b) The selected cells were projected on the CD3dimension, and CD3+ cells were de ned through setting an interval gate. (c)Cells within the upper right gate were referred to as CD3+CD4+CD8 +. (d{f) A t mixture model with the Box-Cox transformation was used to mimicthis manual selection process; here we display the corresponding densityestimates. For FlowJo, the density estimates correspond to kernel estimates,while for our gating strategy, the density estimates are obtained from theestimated mixture models.92Figure 4.3: Strategy for clustering the GvHD control sample. (a{c) Thesame manual gating strategy was applied by the expert researcher. (c)The upper right gate corresponding to the CD3+CD4+CD8 + populationcontains very few cells, a distinct di erence from the positive sample. (d{f)A t mixture model with the Box-Cox transformation was used to mimicthis manual selection process; here we display the corresponding densityestimates.934.3 Results4.3.1 Application to Real DatasetsThe Rituximab datasetWe have re-analyzed a part of the Rituximab dataset using our sequen-tial clustering approach. This data contains 1545 cells and four variables:FSC, SSC and two uorescent markers, namely, 7-AAD and anti-BrdU. Wecompared the di erent models, namely, t mixture with Box-Cox, t mixture,Gaussian mixture with Box-Cox, and Gaussian mixture, with the results ob-tained through expert manual analysis using the commercial gating softwareFlowJo (Tree star, Ashland, Oregon) and the K-means clustering algorithm(MacQueen, 1967). As mentioned in Section 4.2.4, we use a sequential ap-proach where we rst cluster the FSC vs. SSC variables to select basic cellpopulations ( rst stage), and then cluster the selected population(s) usingall remaining variables (second stage).Figure 4.4(a) shows the initial gating performed by a researcher usingFlowJo on the FSC and SSC variables. To facilitate the comparison of ourclustering approach with manual analysis at the second stage, we tried tomimic this analysis. In order to do so, we used a t mixture model with Box-Cox transformation, xing the number of components at one, and removedpoints with weights ~u (please refer to Section 3.2.3 for details) less than 0:5,corresponding to outliers. As shown in Figure 4.4, the selected cells are notexactly the same but close enough to allow us to compare our clusteringapproach to manual gating results when using the two uorescent markers.At the second stage, we compare the di erent clustering models on theselected cells. Since the number of clusters is unknown in advance, we makeuse of the BIC. The BIC curves shown in Figure 4.5, corresponding to thedi erent models, peak around three to four clusters, motivating us to ex-amine the results obtained using three (Figure 4.6) and four (Figure 4.7)clusters respectively. As expected, K-means performs poorly as sphericalclusters do not provide a good t. Similarly, untransformed mixture mod-els (t and Gaussian), constrained by the assumption of elliptical clusters,94Figure 4.4: Initial clustering of the Rituximab data using the FSC and SSCvariables. (a) In typical analysis a gate was manually drawn to select agroup of cells for further investigation. (b) A t mixture model with Box-Cox transformation was used to mimic this manual selection process. In (b)points (shown in gray) outside the boundary drawn in black have weights ~uless than 0:5 and will be removed from the next stage. It can be shown thatthis boundary corresponds approximately to the 90th percentile region forthe t distribution transformed back on the original scale using the Box-Coxparameter. The numbers shown in both plots are the percentages of pointswithin the boundaries which are extracted for the next stage. Both gatescapture the highest density region, as shown by the two density estimates.are not exible enough to capture the top cluster. Furthermore, Gaussianmixture models (even with the Box-Cox transformation) are very sensitiveto outliers, which can result in poor classi cation. For example, when fourclusters are used, the Gaussian mixture model breaks the larger cluster intotwo to accommodate outliers, while the Gaussian mixture model with theBox-Cox transformation also has a large spread out cluster to accommo-date outliers. Finally, Figures 4.6(b) and 4.7(b) show that our t mixturemodel-based clustering approach with the Box-Cox transformation can pro-vide comparable results with the manual gating analysis by identifying three95a71a71a71a71 a71a711 2 3 4 5 6−17200−16800−16400no. of clustersBICa71t + Box−CoxtGaussian + Box−CoxGaussianFigure 4.5: BIC as a function of the number of clusters for di erent modelsapplied to the Rituximab data. All models have a maximum BIC valuearound three to four clusters, though there is some uncertainty as the BICvalues are relatively close.of the four clusters with well- t boundaries. Note, however, that none ofthe four clustering methods detect the left rectangular gate seen on Fig-ure 4.6(a), which is most likely because of its lower cell density comparedto the other gates and the lack of clear separation along the \7-AAD" di-mension. This gate, which corresponds to apoptotic cells (Gasparetto et al.,2004), contains a loose assemblage of cells located at the left of the three farright gates. Our methodology permits the identi cation of the three rightclusters with well- t boundaries, and thus could be combined with expertknowledge in order to identify apoptotic cells. For example, one could com-pute a one dimensional boundary at the left-end border of the two largestclusters, and automatically label cells on the left of that line apoptotic.96Figure 4.6: Second-stage clustering of the Rituximab data using all the u-orescent markers (three clusters). (a) Four gates were drawn by a researcherto de ne four populations of interest. (b{f) Clustering was performed onthe cells preselected from the rst stage as shown in Figure 4.4(b). Thenumber of clusters was set to be three. (b{c) Points outside the boundarydrawn in black have weights less than 0:5 and are labeled with \ " whent distributions were used. (d{f) For clustering performed without using tdistributions, for comparison sake, boundaries are drawn in a way such thatthey correspond to the region of the same percentile which the boundariesdrawn in (b{c) represent. Di erent symbols are used for the di erent clus-ters. The numbers shown in all plots are the percentages of cells assigned toeach cluster. The K-means algorithm is equivalent to the classi cation EMalgorithm (Celeux and Govaert, 1992, 1995) for a Gaussian mixture modelassuming equal proportions and a common covariance matrix being a scalarmultiple of the identity matrix. The spherical clusters with equal volumesdrawn in (f) correspond to such a constrained model.97Figure 4.7: Second-stage clustering of the Rituximab data using all the u-orescent markers (four clusters). (a) Four gates were drawn by a researcherto de ne four populations of interest. (b{f) Clustering was performed oncells preselected from the rst stage. The number of clusters was set tobe four. (b{c) Points outside the boundary drawn in black have weightsless than 0:5 and are shown in gray when t distributions were used. (d{f)For clustering performed without using t distributions, for comparison sake,boundaries are drawn in a way such that they correspond to the region ofthe same percentile which the boundaries drawn in (b{c) represent. Di er-ent symbols are used for the di erent clusters. The numbers shown in allplots are the percentages of cells assigned to each cluster.98Having shown the superiority of our clustering framework in terms of exibility and robustness compared to common approaches, we now turn toa larger dataset to demonstrate further its capability.The GvHD DatasetTwo samples of the GvHD dataset (Brinkman et al., 2007) have been re-analyzed, one from a patient who eventually developed acute GvHD, andone from a control. Both datasets consist of more than 12,000 cells and fourmarkers, namely, anti-CD4, anti-CD8 , anti-CD3 and anti-CD8, in additionto the FSC and SSC variables. One objective of the analysis is to look forthe CD3+CD4+CD8 + cells. To demonstrate the capability of our proposedautomated clustering approach, we try to mimic the gating strategy statedin Brinkman et al. (2007). Figures 4.2(a{c) and 4.3(a{c) show the gatingperformed by an expert researcher using FlowJo.In the initial gating, we rst extracted the lymphocyte population usingthe FSC and SSC variables by applying a t mixture model with the Box-Cox transformation, xing the number of clusters from one to eight in turn.Figure 4.8(a) shows that the BIC for the positive sample has a large increasefrom three to four clusters and remains relatively constant afterwards, sug-gesting a model t using four clusters is appropriate. Figure 4.8(b) is thecorresponding scatterplot showing the cluster assignment of the points onremoving those with weights less than 0:5, regarded as outliers. It is clearthat the region combining three of the clusters formed matches closely withthe gate drawn by the researcher as shown in Figure 4.2(a), correspondingto the lymphocyte population.The next two stages in the manual gating strategy consist of locatingthe CD3+ cells by placing an interval gate in the CD3 density plot (Fig-ure 4.2(b)), and then identifying the CD3+CD4+CD8 + cells through theupper right gate in the CD4 vs CD8 contour plot (Figure 4.2(c)). Whenapplying our proposed clustering approach, we can combine these two stagesby handling all the variables consisting of uorescent markers at once, fullyutilizing the multidimensionality of FCM data.99Figure 4.8: Initial clustering of the GvHD positive sample using the FSC andSSC variables. (a) The BIC curve remains constant beyond four clusters. (b)The scatterplot reveals the use of three clusters to represent the lymphocytepopulation and the remaining cluster (shown in gray) for dead cells. Pointsshown in gray have weights less than 0:5 and will be removed from the nextstage.The tted model with 12 clusters seems to provide a good t as suggestedby the BIC (Figure 4.9(a)). We compared our results with those obtainedthrough the manual gating approach by rst examining the estimated den-sity projected on the CD3 dimension. The unimodal, yet skewed, densitycurve suggests that it is composed of two populations with substantiallydi erent proportions superimposed on each other (Figure 4.2(e)). At a levelof around 280, we can well separate the 12 cluster means along the CD3dimension into two groups, and use the group with high cluster means inthe CD3 dimension to represent the CD3+ population. The unimodal na-ture of the density curve (Figures 4.2(b,e)) implies that the two underlyingpopulations somewhat mix together, and therefore setting a xed cuto toclassify the cells is likely inappropriate. The merit of our automated cluster-ing approach is shown here, that, instead of setting a cuto , it makes use of100Figure 4.9: Second-stage clustering of the GvHD positive sample (a{b) andcontrol sample (c{d) using all the uorescent markers. Clustering was per-formed on the cells preselected from the rst stage. For the positive sample,(a) the BIC reaches a maximum at 12 clusters; (b) the scatterplot reveals thecluster assignment of the cells. Points which are assigned to the ve clusterswith high CD3 means are classi ed as CD3+ cells. The ve regions drawnin solid lines form the CD3+ population. The two regions in the upper rightmarked with the symbols are identi ed as the CD3+CD4+CD8 + popu-lation. For the control sample, (c) little increment is observed in the BICbeyond seven clusters, suggesting that seven clusters, much fewer than forthe positive sample, are enough to model the data in the second stage; (d)the scatterplot reveals the cluster assignment of the cells. Only two clustershave been used to model the CD3+ population.101the information provided by the other dimensions to help classify the cellsinto CD3+/CD3 populations. The group with high cluster means in theCD3 dimension consists of ve clusters, and among these ve clusters, wecan easily identify the two clusters at the upper right in the CD4 vs CD8 scatterplot (Figure 4.9(b)) as the CD3+CD4+CD8 + population.We have applied the same strategy to the control sample; see Figures 4.3and 4.9(c{d). Figure 4.9(c) suggests that, this time, only seven clustersare necessary as the BIC is relatively at after that. The associated gat-ing results for the control sample is characterized by an absence of theCD3+CD4+CD8 + cells, a distinct di erence from the positive sample. Thisfeature is also captured using our automated clustering approach; the ttedmodel contains no clusters at the upper right of the CD4 vs CD8 scatter-plot (Figure 4.9(d)). This cell population was of speci c interest, as it wasidenti ed as one possibly predictive of GvHD, based on the manual gatinganalysis in Brinkman et al. (2007).4.3.2 Simulation studiesWe have conducted a series of simulations to study the performance of di er-ent model-based clustering approaches under di erent model speci cations.Model performance is compared using the following two criteria: (a) theaccuracy in cluster assignment; (b) the accuracy in selecting the numberof clusters. We performed two simulation studies, one where we set thedimension to two resembling the Rituximab dataset, and one where the di-mension was set to four resembling the GvHD dataset. In each case, wegenerated data from each of the following models: t mixture with Box-Cox,t mixture, Gaussian mixture with Box-Cox, and Gaussian mixture, usingthe parameter estimates obtained at the second stage in the Rituximab andGvHD (positive sample) analyses. For the GvHD, to reduce computationalburden, we only selected the ve clusters with the largest means in the CD3dimension, corresponding to the CD3+ population. We refer to the simu-lation experiments as the Rituximab and the GvHD settings, respectively.We xed the number of cells at 500 and generated 1000 datasets under each102Table 4.1: Average misclassi cation rates for di erent models applied todata generated under the Rituximab or GvHD setting.Model used to t datat+Box-Cox t G.+Box-Cox GaussianModel used to t+Box-Cox 0.187 0.211 0.279 0.251generate data t 0.255 0.263 0.339 0.315under the Ritux- G.+Box-Cox 0.321 0.400 0.251 0.352imab setting Gaussian 0.344 0.329 0.317 0.301Model used to t+Box-Cox 0.112 0.116 0.205 0.230generate data t 0.107 0.111 0.191 0.221under the GvHD G.+Box-Cox 0.135 0.143 0.139 0.152setting Gaussian 0.134 0.132 0.132 0.126G. = Gaussian; the best results are shown in bold.of the aforementioned models. To study the accuracy in selecting the num-ber of clusters using BIC, we generated 100 datasets from the same GvHDsetting with 1000 cells. Here, we used 1000 cells to avoid numerical prob-lems with small clusters when the number of clusters used is signi cantlylarger than the true number, while we decreased the number of datasets to100 because of the increase in computation when estimating the number ofclusters.Classi cation ResultsThe four clustering methods in comparison were applied to each of the 1000datasets generated from each model. Model tting was done by presumingthat the number of clusters is known, i.e., four clusters for the Rituximabsetting and ve for GvHD. We compared the models via misclassi cationrates, i.e., the proportions of cells assigned to incorrect clusters. Whencomputing the misclassi cation rates, all permutations of the cluster labelswere considered, and the lowest misclassi cation rate was determined.The scatterplot of one of the datasets (GvHD setting) generated from thet mixture model with Box-Cox transformation can be found in Figure 4.10.Overall results are shown in Table 4.1. As expected, the Gaussian mixturemodels perform poorly when data were generated from the t mixture modelsbecause of a lack of mechanisms to handle outliers. When a transformation103Figure 4.10: A representative sample generated from the t mixture modelwith the Box-Cox transformation under the GvHD setting. (a) The sampleis displayed through the CD4 and CD8 dimensions. (b{e) Classi cationresults are shown for the four clustering methods. Di erent plotting symbolsare used for di erent clusters. Misclassi ed points are marked with the symbols. 104was applied during data generation, the mixture models without the Box-Cox transformation fail to perform well. On the contrary, the exibilityof the t mixture model with the Box-Cox transformation does not penalizetoo much for model misspeci cation. This is illustrated by the results fromthe GvHD setting: the t mixture model with the Box-Cox transformationgives the lowest misclassi cation rates when the true model is instead the tmixture model without transformation or the Gaussian mixture model withthe Box-Cox transformation.Selecting the Number of ClustersIn this part of the study, the four models in comparison were applied to eachof the 100 datasets generated, setting the number of clusters from one to tenin turn. The number of clusters that delivered the highest BIC was selected.We compared the models via the mode and the 80% coverage interval of thenumber of clusters selected out of the 100 repetitions. As shown in Table 4.2,thetmixture models can select the correct number of clusters in the majorityof repetitions, even in case of model misspeci cation. In addition, theydeliver the same 80% coverage intervals as the Gaussian mixture modelsdo when data were generated from Gaussian mixtures, suggesting that therobustness against outliers of the t mixture models provides satisfactoryprotection against model misspeci cation. On the contrary, the Gaussianmixture models tend to overestimate the number of clusters when an excessof outliers is present in the data generated from t mixtures, and in mostinstances in which overestimation happens, six clusters are selected.4.4 DiscussionThe experimental data and the simulation studies have demonstrated theimportance of handling transformation selection, outlier identi cation andclustering simultaneously. While a stepwise approach in which transforma-tion is preselected ahead of outlier detection (or vice versa) may be consid-ered, it is unlikely to tackle the problem well in general, as the preselected105Table 4.2: Modes and 80% coverage intervals of the number of clustersselected by the BIC for di erent models applied to data generated underthe GvHD setting.Model used to t datat+Box-Cox t G.+Box-Cox GaussianM. Int. M. Int. M. Int. M. Int.Model t+Box-Cox 5 (5, 6) 5 (5, 6) 6 (6, 7) 6 (6, 8)used to t 5 (5, 7) 5 (5, 6) 6 (6, 7) 6 (6, 8)generat G.+Box-Cox 5 (5, 6) 5 (5, 6) 5 (5, 6) 5 (5, 6)data Gaussian 5 (5, 6) 5 (5, 6) 5 (5, 6) 5 (5, 6)M. = mode; Int. = 80% coverage interval; G. = Gaussian.transformation may be in uenced by the presence of outliers. This is shownin the analysis of the Rituximab dataset. Without outlier removal the useof Gaussian mixture models led to inappropriate transformation and poorclassi cation in order to accommodate outliers (Figures 4.6(d) and 4.7(d)).Conversely, without transformation, the t mixture model could not modelthe shape of the top cluster well (Figures 4.6(c) and 4.7(c)). Similarly, it isnecessary to perform transformation selection and clustering simultaneously(Gutierrez et al., 1995; Schork and Schork, 1988) as opposed to a stepwiseapproach. It is di cult to know what transformation to select beforehandas one only observes the mixture distribution, and the classi cation labelsare unknown. A skewed distribution could be the result of one dominantcluster and one (or more) smaller cluster. As shown by our analysis with theexperimental data and the simulation studies, our proposed approach basedon t mixture models with Box-Cox transformation bene ts from handlingthese issues, which have mutual in uence, simultaneously. Furthermore,con rmed by results of our simulation studies, our proposed approach isrobust against model misspeci cation and can avoid the problem of Gaus-sian mixture models that excessive clusters are often needed to provide areasonable t in case of model misspeci cation (Yeung et al., 2001).One of the bene ts of model-based clustering is that it provides mecha-nism for both \hard" clustering (i.e., the partitioning of the whole data intoseparate clusters) and fuzzy clustering (i.e., a \soft" clustering approach in106which each event may be associated with more than one cluster). The latterapproach is in line with the rationale that there exists uncertainty about towhich cluster an event should be assigned. The overlaps between clusters asseen in Figures 4.6 and 4.9 reveal such uncertainty in the cluster assignment.It is well known that the convergence of the EM algorithm depends onthe initial conditions used. A bad initialization may incur slow convergenceor convergence to a local minimum. In the real-data examples and the simu-lation studies, we used a deterministic approach called hierarchical clustering(Ban eld and Raftery, 1993; Fraley, 1998) for initialization. We have foundthis approach to perform well in the datasets explored here. However, betterinitialization, perhaps incorporating expert knowledge, might be needed formore complex datasets. For example, if there is a high level of noise in thedata, it might be necessary to use an initialization method that accounts forsuch outliers; see Fraley and Raftery (2002) for an example.To estimate how long it takes to analyze a sample of size typical for anFCM dataset, we have carried out a test run on a synthetic dataset, whichconsists of one million events and 10 dimensions. To complete an analysiswith 10 clusters, it took about 20 minutes on a 3GHz Intel Xeon proces-sor with 2GB of RAM. This illustrates that the algorithm should be quickenough for analyzing a large ow dataset. In general, the computationaltime increases linearly with the number of events and increases in the orderof p2 with the number of variables, p, per EM iteration. This is an advantageover hierarchical clustering in which the computational time and memoryspace required increase in the order of n2 with the number of events, makinga hierarchical approach impractical when a sample of a moderate size, say,>5000, is investigated.Like all clustering approaches, the methodology we have developed in-cludes assumptions which may limit the applicability of this approach, andit will not identify every cell population in every sample. If the distribu-tion of the underlying population is highly sparse without a well-de nedcore, our approach may not properly identify all sub-populations. This isillustrated in the Rituximab analysis where the loosely structured group ofapoptotic cells was left undetected. This in turn has hindered the capa-107bility of the approach from giving satisfactory esimates of the G1 and Sfrequences for the identi ed clusters that would be desired for normal anal-ysis of a 7-AAD DNA distribution for cultured cells. On the other handidenti cation of every cluster may not always be important. The Rituximabstudy was designed as a high throughput drug screen to identify compoundsthat caused a >50% reduction in S-phase cells (Gasparetto et al., 2004),as would be captured by both the manual gates and our automated anal-ysis should it occur. Furthermore, the exact identi cation of every clusterthrough careful manual analysis may not always be possible, especially inhigh throughput experiments. For instance, in the manual analysis of theGvHD dataset, a quadrant gate was set in Figure 4.2(c) in order to iden-tify the CD3+CD4+CD8 + population which was of primary interest. Forconvenience sake, this gate was set at the same level across all the samplesbeing investigated. While ve clusters can be clearly identi ed on the graph,it would be time-consuming to manually adjust the positions of each of thegates for all the samples in a high-throughput environment as well as identifyall novel populations. Contrariwise, our automated approach can identifythese clusters in short order without the need for manual adjustment. Tocomplete the analysis of the GvHD dataset (>12,000 cells, six dimensions)to identify the CD3+CD4+CD8 + population (Figure 4.2), it took less than ve minutes, using the aforementioned sequential approach to clustering, onan Intel Core 2 Duo with 2GB of RAM running Mac OS X 10.4.10.A rigorous quantitative assessment is important before implementingthis, or any approach, as a replacement for expert manual analysis. Theavailability of a wide variety of example data would aid in the developmentand evaluation of automated analysis methodologies. We are therefore de-veloping such a public resource, and would welcome contributions from thewider FCM community.108BibliographyAbraham, V. C., Taylor, D. L., and Haskins, J. R. (2004). High contentscreening applied to large-scale cell biology. Trends in Biotechnology,22(1):15{22.Atkinson, A. C. (1988). Transformations unmasked. Technometrics, 30:311{318.Bagwell, C. B. (2004). DNA histogram analysis for node-negative breastcancer. Cytometry Part A, 58A(1):76{78.Bakker Schut, T. C., de Grooth, B. G., and Greve, J. (1993). Cluster analysisof ow cytometric list mode data on a personal computer. Cytometry,14(6):649{659.Ban eld, J. D. and Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49:803{821.Beckman, R. J., Salzman, G. C., and Stewart, C. C. (1995). Classi cationand regression trees for bone marrow immunophenotyping. Cytometry,20(3):210{217.Bickel, P. J. and Doksum, K. A. (1981). An analysis of transformationsrevisited. Journal of the American Statistical Association, 76(374):296{311.Boddy, L. and Morris, C. W. (1999). Arti cial neural networks for patternrecognition. In Fielding, A. H., editor, Machine Learning Methods forEcological Applications, pages 37{87. Kluwer, Boston.Boddy, L., Morris, C. W., Wilkins, M. F., Al-Haddad, L., Tarran, G. A.,Jonker, R. R., and Burkill, P. H. (2000). Identi cation of 72 phytoplanktonspecies by radial basis function neural network analysis of ow cytometricdata. Marine Ecology Progress Series, 195:47{59.109Braylan, R. C. (2004). Impact of ow cytometry on the diagnosis andcharacterization of lymphomas, chronic lymphoproliferative disorders andplasma cell neoplasias. Cytometry Part A, 58A(1):57{61.Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Clas-si cation and Regression Trees. Wadsworth & Brooks, Monterey, CA.Brinkman, R. R., Gasparetto, M., Lee, S. J. J., Ribickas, A., Perkins, J.,Janssen, W., Smiley, R., and Smith, C. (2007). High-content ow cy-tometry and temporal data analysis for de ning a cellular signature ofGraft-versus-Host disease. Biology of Blood and Marrow Transplantation,13(6):691{700.Burges, C. J. C. (1998). A Tutorial on Support Vector Machines for PatternRecognition. Kluwer, Boston.Carroll, R. J. (1982). Prediction and power transformations when the choiceof power is restricted to a nite set. Journal of the American StatisticalAssociation, 77(380):908{915.Celeux, G. and Govaert, G. (1992). A classi cation EM algorithm for clus-tering and two stochastic versions. Computational Statistics and DataAnalysis, 14(3):315{332.Celeux, G. and Govaert, G. (1995). Gaussian parsimonious clustering mod-els. Pattern Recognition, 28(5):781{793.de Rosa, S. C., Brenchley, J. M., and Roederer, M. (2003). Beyond sixcolors: a new era in ow cytometry. Nature Medicine, 9(1):112{117.Demers, S., Kim, J., Legendre, P., and Legendre, L. (1992). Analyzing mul-tivariate ow cytometric data in aquatic sciences. Cytometry, 13(3):291{298.Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likeli-hood from incomplete data via the EM algorithm. Journal of the RoyalStatistical Society, Series B, 39:1{38.110Fraley, C. (1998). Algorithms for model-based Gaussian hierarchical clus-tering. SIAM Journal on Scienti c Computing, 20(1):270{281.Fraley, C. and Raftery, A. E. (1998). How many clusters? Which cluster-ing method? Answers via model-based cluster analysis. The ComputerJournal, 41(8):578{588.Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminantanalysis, and density estimation. Journal of the American Statistical As-sociation, 97(458):611{631.Fraley, C. and Raftery, A. E. (2006). MCLUST version 3 for R: Normalmixture modeling and model-based clustering. Technical Report 504, Uni-versity of Washington, Department of Statistics.Gasparetto, M., Gentry, T., Sebti, S., O’Bryan, E., Nimmanapalli, R.,Blaskovich, M. A., Bhalla, K., Rizzieri, D., Haaland, P., Dunne, J.,and Smith, C. (2004). Identi cation of compounds that enhance theanti-lymphoma activity of rituximab using ow cytometric high-contentscreening. Journal of Immunological Methods, 292(1{2):59{71.Gutierrez, R. G., Carroll, R. J., Wang, N., Lee, G.-H., and Taylor, B. H.(1995). Analysis of tomato root initiation using a normal mixture distri-bution. Biometrics, 51:1461{1468.Hahne, F., Arlt, D., Sauermann, M., Majety, M., Poustka, A., Wiemann, S.,and Huber, W. (2006). Statistical methods and software for the analysisof high throughput reverse genetic assays using ow cytometry readouts.Genome Biology, 7(8):R77.Kothari, R., Cualing, H., and Balachander, T. (1996). Neural network anal-ysis of ow cytometry immunophenotype data. IEEE Transactions onBiomedical Engineering, 43(8):803{810.Lange, K. L., Little, R. J. A., and Taylor, J. M. G. (1989). Robust statisticalmodeling using the t-distribution. Journal of the American StatisticalAssociation, 84:881{896.111Lizard, G. (2007). Flow cytometry analyses and bioinformatics: interest innew softwares to optimize novel technologies and to favor the emergenceof innovative concepts in cell research. Cytometry Part A, 71A:646{647.Lugli, E., Pinti, M., Nasi, M., Troiano, L., Ferraresi, R., Mussi, C., Salvioli,G., Patsekin, V., Robinson, J. P., Durante, C., Cocchi, M., and Cos-sarizza, A. (2007). Subject classi cation obtained by cluster analysis andprincipal component analysis applied to ow cytometric data. CytometryPart A, 71A:334{344.MacQueen, J. B. (1967). Some methods for classi cation and analysis ofmultivariate observations. In LeCam, L. and Neyman, J., editors, Pro-ceedings of the fth Berkeley Symposium on Mathematical Statistics andProbability, volume 1, pages 281{297, Berkeley. University of CaliforniaPress.Maynadi e, M., Picard, F., Husson, B., Chatelain, B., Cornet, Y., Le Roux,G., Campos, L., Dromelet, A., Lepelley, P., Jouault, H., Imbert, M.,Rosenwadj, M., Verg e, V., Bissi eres, P., Rapha el, M., B en e, M. C., Feuil-lard, J., and GEIL (2002). Immunophenotypic clustering of myelodys-plastic syndromes. Blood, 100(7):2349{2356.McLachlan, G. and Peel, D. (2000). Finite Mixture Models. Wiley-Interscience, New York.McLachlan, G. J. and Basford, K. E. (1988). Mixture Models: Inference andApplications to Clustering. Marcel Dekker Inc, New York, NY.Morris, C. W., Autret, A., and Boddy, L. (2001). Support vector machinesfor identifying organisms { a comparison with strongly partitioned radialbasis function networks. Ecological Modelling, 146(1{3):57{67.Murphy, R. F. (1985). Automated identi cation of subpopulations in owcytometric list mode data using cluster analysis. Cytometry, 6(4):302{309.Pan, W., Lin, J., and Le, C. T. (2002). Model-based cluster analysis ofmicroarray gene-expression data. Genome Biology, 3(2):R9.112Parks, D. R. (1997). Data processing and analysis: data management. InRobinson, J. P., editor, Current Protocols in Cytometry, chapter 10. JohnWiley & Sons, Inc, New York.Peel, D. and McLachlan, G. J. (2000). Robust mixture modelling using thet distribution. Statistics and Computing, 10(4):339{348.Redelman, D. (2004). CytometryML. Cytometry Part A, 62A(1):70{73.Roederer, M. and Hardy, R. R. (2001). Frequency di erence gating: amultivariate method for identifying subsets that di er between samples.Cytometry, 45(1):56{64.Roederer, M., Moore, W., Treister, A., Hardy, R. R., and Herzenberg, L. A.(2001a). Probability binning comparison: a metric for quantitating mul-tivariate distribution di erences. Cytometry, 45(1):47{55.Roederer, M., Treister, A., Moore, W., and Herzenberg, L. A. (2001b).Probability binning comparison: a metric for quantitating univariate dis-tribution di erences. Cytometry, 45(1):37{46.Rousseeuw, P. J., Kaufman, L., and Trauwaert, E. (1996). Fuzzy cluster-ing using scatter matrices. Computational Statistics and Data Analysis,23(1):135{151.Sch olkopf, B. and Smola, A. J. (2002). Learning with Kernels: SupportVector Machines, Regularization, Optimization and Beyond. The MITPress, Cambridge, Massachusetts.Schork, N. J. and Schork, M. A. (1988). Skewness and mixtures of nor-mal distributions. Communications in Statistics: Theory and Methods,17:3951{3969.Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statis-tics, 6:461{464.Silverman, B. W. (1986). Density Estimation for Statistics and Data Anal-ysis. Chapman-Hall, New York.113Spidlen, J., Gentleman, R. C., Haaland, P. D., Langille, M., Le Meur, N.,Ochs, M. F., Schmitt, C., Smith, C. A., Treister, A. S., and Brinkman,R. R. (2006). Data standards for ow cytometry. OMICS, 10(2):209{214.Suni, M. A., Dunn, H. S., Orr, P. L., de Laat, R., Sinclair, E., Ghanekar,S. A., Bredt, B. M., Dunne, J. F., Maino, V. C., and Maecker, H. T.(2003). Performance of plate-based cytokine ow cytometry with auto-mated data analysis. BMC Immunology, 4:9.Titterington, D. M., Smith, A. F. M., and Makov, U. E. (1985). StatisticalAnalysis of Finite Mixture Distributions. Wiley, Chichester, UK.Tzircotis, G., Thorne, R. F., and Isacke, C. M. (2004). A new spreadsheetmethod for the analysis of bivariate ow cytometric data. BMC CellBiology, 5:10.Wilkins, M. F., Hardy, S. A., Boddy, L., and Morris, C. W. (2001). Com-parison of ve clustering algorithms to classify phytoplankton from owcytometry data. Cytometry, 44(3):210{217.Yeung, K. Y., Fraley, C., Murua, A., Raftery, A. E., and Ruzzo, W. L.(2001). Model-based clustering and data transformations for gene expres-sion data. Bioinformatics, 17(10):977{987.114Chapter 5 owClust: a Bioconductorpackage for automated gatingof ow cytometry data 5.1 IntroductionIn Chapter 4, we mentioned the lack of an automated analysis platformto parallel the high-throughput data-generation platform in ow cytometry(FCM). How to resolve this current bottleneck has become an open questionamong the FCM community. Recently, a suite of several R packages provid-ing infrastructure for FCM analysis have been released though Bioconductor(Gentleman et al., 2004), an open source software development project forthe analysis of genomic data. owCore (Hahne et al., 2009), the core pack-age among them, provides data structures and basic manipulation of FCMdata. owViz (Sarkar et al., 2008) o ers visualization tools, while owQprovides quality control and quality assessment tools for FCM data. Finally, owUtils provides utilities to deal with data import/export for owCore.In spite of these low-level tools, there is still a dearth of software that helpsautomate FCM gating analysis with a sound theoretical foundation (Lizard,2007).In view of the aforementioned issues, based on a formal statistical cluster-ing approach, we have developed the owClust package to help resolve thecurrent bottleneck. owClust implements a robust model-based cluster- A version of this chapter has been published. Lo, K., Hahne, F., Brinkman, R. R.and Gottardo, R. (2009). owClust: a Bioconductor package for automated gating of owcytometry data. BMC Bioinformatics, 10:145.115ing approach (Peel and McLachlan, 2000; McLachlan and Peel, 2000; Fraleyand Raftery, 2002) which extends the multivariate t mixture model with theBox-Cox transformation proposed in Chapter 4. As a result of the exten-sions made, owClust has included options allowing for a cluster-speci cestimation of the Box-Cox transformation parameter and/or the degrees offreedom parameter.5.2 ImplementationWith the robust model-based clustering approach described in Chapter 4as the theoretical basis, we have developed owClust, an R package toconduct an automated FCM gating analysis and produce visualizations forthe results. owClust is released through Bioconductor (Gentleman et al.,2004), along with those R packages mentioned in Section 5.1. The GNUScienti c Library (GSL) is needed for successful installation of owClust.We have provided a vignette (Appendix B) that comes with owClust toenunciate details about installation, and procedures of linking GSL to R,especially for Windows users.In recognition of the potential need for analyzing a large number of FCMsamples in parallel, during the process of package development, tremendouse ort has been put into code optimization and automation. The source codefor the entire model- tting process via the EM algorithm is written in C foroptimal utilization of system resources, and makes use of the Basic LinearAlgebra Subprograms (BLAS) library, which facilitates multithreaded pro-cesses when an optimized library is provided. To ensure that code is devel-oped in an e cient manner, vectorization is administered wherever possiblein order to attain minimal explicit looping, one of the major factors lead-ing to sub-optimal e ciency in programming with R. In addition, insteadof straightforward conversion of mathematical formulae into programmingcode, a comprehensive account of the EM algorithm has been taken andthe code has been developed in a fashion such that redundant computationof the same or uncalled-for quantities is avoided. On the other hand, theencounter with undesirable execution halt at runtime due to computational116errors would undermine the level of automation achieved. This is criticalespecially when a user needs to analyze a large number of samples. On thisaccount, we have developed substantial error handling strategies to copewith various scenarios such as poor initialization of the EM algorithm, andfailure of root- nding for the transformation parameter. Another importantmeasure taken towards automation is the provision of a good default set-ting for parameters (e.g., search interval for the root- nding problem, andtolerance level for the convergence of EM) involved at di erent steps of themodel- tting process, or for arguments (e.g., colors for representing individ-ual clusters, and cuto s for de ning outliers) used in ltering or visualizingthe clustering result. Whilst parameter tuning for individual samples maystill be feasible in a small-scale study, it becomes impractical when hundredsof samples need to be processed in parallel. We have undergone an exten-sive tuning process to test against a large number of real FCM samples suchthat sensible results or visualization would be delivered within a reasonabletimeframe for the majority of cases upon which the default setting is ap-plied. Finally, in consideration of convenience from users’ perspective, manyfunctions or methods in owClust have been speci cally adapted to caterfor various input data structures. E ort has also been made to adapt to thecustom of FCM researchers whilst developing tools of visualization and forconstructing data lters.A formal object-oriented programming discipline, the S4 system (Cham-bers, 2004), has been adopted to build the owClust package. Two keyfeatures of the S4 system, namely, multiple dispatch and multiple inheri-tance, have been essential for de ning classes and methods. For most genericfunctions de ned or utilized in owClust (e.g., Subset, split and plot),method dispatch relies on the multiple dispatch capabilities and is done inaccordance with a signature taking more than one argument. Incidentally,inheritance is employed to extend classes de ned in other packages; see Sec-tion 5.3.2 for details about integration with other Bioconductor packagesdedicated to FCM analysis. In particular, for the sake of organization, mul-tiple inheritance is exploited such that multiple classes can be extendedsimultaneously.117The core function, flowClust, of the package implements the clusteringmethodology and returns an object of class flowClust. A flowClust objectstores essential information related to the clustering result which can beretrieved through various methods such as summary, Map, getEstimates,etc. To visualize the clustering results, the plot and hist methods can beapplied to produce scatterplots, contour or image plots and histograms.To enhance communications with other Bioconductor packages designedfor the cytometry community, owClust has been built with the aim ofbeing highly integrated with owCore. Methods in owClust can bedirectly applied on a flowFrame, the standard R implementation of a FlowCytometry Standard (FCS) le de ned in owCore; FCS is the typicalstorage mode for FCM data. Another step towards integration is to overloadbasic ltering methods de ned in owCore (e.g., filter, %in%, Subsetand split) in order to provide similar functionality for classes de ned in owClust.5.3 Results and Discussion5.3.1 Analysis of Real FCM DataIn this section, we illustrate how to use owClust to conduct an auto-mated gating analysis of real FCM data. For demonstration, we use thegraft-versus-host disease (GvHD) data (Brinkman et al., 2007). The dataare stored in FCS les, and consist of measurements of four uorescentlyconjugated antibodies, namely, anti-CD4, anti-CD8 , anti-CD3 and anti-CD8, in addition to the forward scatter and sideward scatter parameters.One objective of the gating analysis is to look for the CD3+CD4+CD8 +cell population, a distinctive feature found in GvHD-positive samples. Wehave adopted a two-stage strategy (Section 4.2.4): we rst cluster the databy using the two scatter parameters to identify basic cell populations, andthen perform clustering on the population of interest using all uorescenceparameters.At the initial stage, we extract the lymphocyte population using the118forward scatter (FSC-H) and sideward scatter (SSC-H) parameters:GvHD <- read.FCS("B07", trans=FALSE)res1 <- flowClust(GvHD, varNames=c("FSC-H", "SSC-H"), K=1:8)To estimate the number of clusters, we run flowClust on the data repeti-tively with K=1 up to K=8 clusters in turn, and apply the Bayesian Informa-tion Criterion (BIC) (Schwarz, 1978) to guide the choice. Values of the BICcan be retrieved through the criterion method. Figure 5.1 shows that theBIC curve remains relatively at beyond four clusters. We therefore choosethe model with four clusters. Below is a summary of the correspondingclustering result:** Experiment Information **Experiment name: Flow ExperimentVariables used: FSC-H SSC-H** Clustering Summary **Number of clusters: 4Proportions: 0.1779686 0.1622115 0.3882043 0.2716157** Transformation Parameter **lambda: 0.1126388** Information Criteria **Log likelihood: -146769.5BIC: -293765.9ICL: -300546.2** Data Quality **Number of points filtered from above: 168 (1.31%)Number of points filtered from below: 0 (0%)Rule of identifying outliers: 90% quantileNumber of outliers: 506 (3.93%)Uncertainty summary:Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s9.941e-04 1.211e-02 3.512e-02 8.787e-02 1.070e-01 6.531e-01 1.680e+02The estimate of the Box-Cox parameter is 0:11, implying a transformationclose to a logarithmic one ( = 0).1191 2 3 4 5 6 7 8−304000−302000−300000−298000−296000−294000No. of clustersBICglobal λcluster−specific λFigure 5.1: A plot of BIC against the number of clusters for the rst-stagecluster analysis. The two curves correspond to the settings with a common and cluster-speci c respectively for the rst-stage cluster analysis. Littledi erence in the BIC values between the two settings is observed. The BICcurves remain relatively at beyond four clusters, suggesting that the model t using four clusters is appropriate.120Note that, by default, flowClust selects the same transformation forall clusters. We have also enabled the option of estimating the Box-Coxparameter for each cluster. For instance, if a user nds the shapes ofthe clusters signi cantly deviate from one another and opts for a di erenttransformation for each cluster, he may write the following line of code:res1s <- flowClust(GvHD, varNames=c("FSC-H", "SSC-H"), K=1:8,trans=2)The trans argument acts as a switch to govern how is handled: xed ata predetermined value (trans=0), estimated and set common to all clusters(trans=1), or estimated for each cluster (trans=2). Incidentally, the optionof estimating the degrees of freedom parameter has also been made avail-able, either common to all clusters or speci c to each of them. The nu.estargument is the corresponding switch and takes a similar interpretation totrans. Such an option of estimating further ne-tunes the model- ttingprocess such that the tted model can re ect the data-speci c level of abun-dance of outliers. To compare the models adopting a di erent combinationof these options, one may make use of the BIC again. Figure 5.1 shows thatlittle di erence in the two BIC curves corresponding to the default setting(common ) and the setting with cluster-speci c respectively can be ob-served. In accordance with the principle of parsimony in statistics whichfavors a simpler model, we opt for the default setting here.Graphical functionalities are available to users for visualizing a wealth offeatures of the clustering results, including the cluster assignment, outliers,and the size and shape of the clusters. Figure 5.2 is a scatterplot showingthe cluster assignment of points upon the removal of outliers. Outliers areshown in grey with the + symbols. The black solid lines represent the 90%quantile region of the clusters which de nes the cluster boundaries. Thesummary shown above states that the default rule used to identify outliersis 90% quantile, which means that a point outside the 90% quantile regionof the cluster to which it is assigned will be called an outlier. In mostapplications, the default rule should be appropriate for identifying outliers.In case a user wants ner control and would like to specify a di erent rule,121200 400 600 800 100002004006008001000FSC−HeightSSC−Height++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++a71a71a71a71Clusters1234Figure 5.2: A scatterplot revealing the cluster assignment in the rst-stageanalysis. Clusters 1, 3 and 4 correspond to the lymphocyte population,while cluster 2 is referred to as the dead cell population. The black solidlines represent the 90% quantile region of the clusters which de ne the clusterboundaries. Points outside the boundary of the cluster to which they areassigned are called outliers and marked with \+".122he may apply the ruleOutliers replacement method:ruleOutliers(res1[[4]]) <- list(level=0.95)An excerpt of the corresponding summary is shown below:** Data Quality **Number of points filtered from above: 168 (1.31%)Number of points filtered from below: 0 (0%)Rule of identifying outliers: 95% quantileNumber of outliers: 133 (1.03%)As shown in the summary, this rule is more stringent than the 90% quantilerule: 133 points (1:03%) are now called outliers, as opposed to 506 points(3:93%) in the default rule.Clusters 1, 3 and 4 in Figure 5.2 correspond to the lymphocyte popu-lation de ned with a manual gating strategy adopted in Brinkman et al.(2007). We then extract these three clusters to proceed with the second-stage analysis:GvHD2 <- split(GvHD, res1[[4]], population=list(lymphocyte=c(1,3,4), deadcells=2))The subsetting method split allows us to split the data into severalflowFrame’s representing the di erent cell populations. To extract the lym-phocyte population (clusters 1, 3 and 4), we may type GvHD2$lymphocyteor GvHD2[[1]], which is a flowFrame. By default, split removes out-liers upon extraction. The deadcells=2 list element is included above fordemonstration purpose; it is needed only if we want to extract the dead cellpopulation (cluster 2), too.In the second-stage analysis, in order to fully utilize the multidimension-ality of FCM data we cluster the lymphocyte population using all the four uorescence parameters, namely, anti-CD4 (FL1-H), anti-CD8 (FL2-H),anti-CD3 (FL3-H) and anti-CD8 (FL4-H), at once:res2 <- flowClust(GvHD2$lymphocyte, varNames=c("FL1-H","FL2-H", "FL3-H", "FL4-H"), K=1:15)123a71a71a71a71a71a71a71a71a71 a71a71 a71 a71 a71a712 4 6 8 10 12 14−450000−445000−440000−435000No. of clustersBICFigure 5.3: A plot of BIC against the number of clusters for the second-stagecluster analysis. The BIC curve remains relatively at beyond 11 clusters,suggesting that the model t using 11 clusters is appropriate.The BIC curve remains relatively at beyond 11 clusters (Figure 5.3), sug-gesting that the model with 11 clusters provides a good t. Figure 5.4(a)shows a contour plot superimposed on a scatterplot of CD8 against CD4for the sub-population of CD3-stained cells, which were selected based on athreshold obtained from a negative control sample (Brinkman et al., 2007).We can easily identify from it the red and purple clusters at the upper rightas the CD3+CD4+CD8 + cell population. A corresponding image plot isgiven by Figure 5.4(b). The code used to produce all the plots shown in thischapter can be found in Appendix C.124(a)0 100 200 300 400 500 6000100200300400500600CD4−HeightCD8ββ−Heighta71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71 a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71 a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71 a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71 a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71 a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71 a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71 a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71 a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71 a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71 a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71 a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71 a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71 a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71 a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71 a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71 a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71a71(b)0 100 200 300 400 500 6000100200300400500600CD4−HeightCD8ββ−HeightFigure 5.4: Plots of CD8 against CD4 for the CD3+ population. (a) Acontour plot is superimposed on a scatterplot. The red and purple clustersat the upper right correspond to the CD3+CD4+CD8 + cell population,indicative of the GvHD. (b) The ve clusters corresponding to the CD3+population can also be identi ed clearly on an image plot.125The example above shows how an FCM analysis is conducted with the aidof owClust. When the number of cell populations is not known in advance,and the BIC values are relatively close over a range of the possible numberof clusters, the researcher may be presented with a set of possible solutionsinstead of a clear-cut single one. In such a case, the level of automationmay be undermined as the researcher may need to select the best one basedon his expertise. We acknowledge that more e ort is needed to extend ourproposed methodology towards a higher level of automation. Currently, weare working on an approach which successively merges the clusters in thesolution as suggested by the BIC using some entropy criterion to give amore reasonable estimate of the number of clusters; see Section 6.2.3 formore details.5.3.2 Integration with owCoreAs introduced in Section 5.1, owClust has been built in a way suchthat it is highly integrated with the owCore package. The core func-tion flowClust which performs the clustering operation may be replacedby a call to the constructor tmixFilter creating a filter object similarto the ones used in other gating or ltering operations found in owCore(e.g., rectangleGate, norm2Filter, kmeansFilter). As an example, thecoderes1 <- flowClust(GvHD, varNames=c("FSC-H", "SSC-H"), K=1:8)used in the rst-stage analysis of the GvHD data may be replaced by:s1filter <- tmixFilter("lymphocyte", c("FSC-H", "SSC-H"), K=1:8)res1f <- filter(GvHD, s1filter)The use of a dedicated tmixFilter-class object separates the task of specify-ing the settings (tmixFilter) from the actual ltering operation (filter),facilitating the common scenario in FCM gating analysis that ltering withthe same settings is performed upon a large number of data les. Thefilter method returns a list object res1f with elements each of classtmixFilterResult, which directly extends the filterResult class de ned126in owCore. Users may apply various subsetting operations de ned for thefilterResult class in a similar fashion on a tmixFilterResult object. Forinstance,Subset(GvHD[,c("FSC-H", "SSC-H")], res1f[[4]])outputs a flowFrame that is the subset of the GvHD data upon the removalof outliers, consisting of the two selected parameters, FSC-H and SSC-H,only. Another example is given by the split method introduced earlier inSection 5.3.1.We realize that occasionally a researcher may opt to combine the useof owClust with ltering operations in owCore to de ne the wholesequence of an FCM gating analysis. To enable the exchange of resultsbetween the two packages, lters created by tmixFilter may be treatedlike those from owCore; users of owCore will nd that lter operators,namely, &, |, ! and %subset%, also work in the owClust package. Forinstance, suppose the researcher is interested in clustering the CD3+ cellpopulation which he de nes by constructing an interval gate with the lowerend-point at 280 on the CD3 parameter. He may use the following code toperform the analysis:rectGate <- rectangleGate(filterId="CD3+", "FL3-H"=c(280, Inf))s2filter <- tmixFilter("s2filter", c("FL1-H", "FL2-H", "FL3-H","FL4-H"), K=5)res2f <- filter(GvHD2$lymphocyte, s2filter %subset% rectGate)The constructors rectangleGate and tmixFilter create two filter ob-jects storing the settings of the interval gate and flowClust, respectively.When the last line of code is run, the interval gate will rst be applied tothe GvHD data. flowClust is then performed on a subset of the GvHDdata contained by the interval gate.5.4 Conclusion owClust is an R package dedicated to FCM gating analysis, addressingthe increasing demand for software capable of processing and analyzing the127voluminous amount of FCM data e ciently via an objective, reproducibleand automated means. The package implements a statistical clustering ap-proach using multivariate t mixture models with the Box-Cox transforma-tion introduced in Chapter 4, and provides tools to summarize and visualizeresults of the analysis. The package contributes to the cytometry commu-nity by o ering an e cient, automated analysis platform which facilitatesthe active, ongoing technological advancement.128BibliographyBrinkman, R. R., Gasparetto, M., Lee, S. J. J., Ribickas, A., Perkins, J.,Janssen, W., Smiley, R., and Smith, C. (2007). High-content ow cy-tometry and temporal data analysis for de ning a cellular signature ofGraft-versus-Host disease. Biology of Blood and Marrow Transplantation,13(6):691{700.Chambers, J. M. (2004). Programming with Data: a Guide to the S Lan-guage. Springer, New York, NY.Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminantanalysis, and density estimation. Journal of the American Statistical As-sociation, 97(458):611{631.Gentleman, R. C., Carey, V. J., Bates, D. M., Bolstad, B., Dettling, M.,Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn,T., Huber, W., Iacus, S., Irizarry, R., Leisch, F., Li, C., Maechler, M.,Rossini, A. J., Sawitzki, G., Smith, C., Smyth, G., Tierney, L., Yang, J.Y. H., and Zhang, J. (2004). Bioconductor: open software development forcomputational biology and bioinformatics. Genome Biology, 5(10):R80.Hahne, F., Le Meur, N., Brinkman, R. R., Ellis, B., Haaland, P., Sarkar, D.,Spidlen, J., Strain, E., and Gentleman, R. (2009). owCore: a Biocon-ductor package for high throughput ow cytometry. BMC Bioinformatics,10:106.Lizard, G. (2007). Flow cytometry analyses and bioinformatics: interest innew softwares to optimize novel technologies and to favor the emergenceof innovative concepts in cell research. Cytometry Part A, 71A:646{647.McLachlan, G. and Peel, D. (2000). Finite Mixture Models. Wiley-Interscience, New York.Peel, D. and McLachlan, G. J. (2000). Robust mixture modelling using thet distribution. Statistics and Computing, 10(4):339{348.129Sarkar, D., Le Meur, N., and Gentleman, R. (2008). Using owViz to visu-alize ow cytometry data. Bioinformatics, 24(6):878{879.Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statis-tics, 6:461{464.130Chapter 6Conclusion and FutureDirections6.1 Summary and DiscussionThe intent of this thesis is to develop statistical methodology based on exible forms of nite mixture models to address issues arising from high-throughput biological data sources. In Chapter 2, we introduced an empiri-cal Bayes approach to detect di erentially expressed genes from microarraydata, extending the hierarchical Gamma-Gamma and Lognormal-Normalmodels (Newton et al., 2001; Kendziorski et al., 2003). The extension re-sults in a release of the unreasonable assumption of a constant coe cientof variation for all genes, and has been shown to remarkably improve theoriginal framework. Next, in Chapter 3, we proposed a mixture modelingframework based on a new class of skewed distributions, the multivariatet distribution with the Box-Cox transformation. We emphasize on a con-current treatment of data transformation and outlier identi cation, insteadof tackling the two issues of mutual impact in a sequential manner. Theapproach is robust to both asymmetric components and outliers, and re-mains to be highly competitive in comparisons made with the computa-tionally much more complicated approach using the skew t mixture model.In Chapter 4, we reframed the gating analysis in ow cytometry (FCM)as a clustering problem, and applied the approach proposed in Chapter 3to automate the identi cation of cell populations. The result shows thatour approach is well adapted to FCM data, in which a high abundance ofoutliers is often observed. Moreover, our approach has an appeal of being131computationally competitive, which is crucial for FCM analysis consideringthe high dimensionality of data and the large number of samples usuallyinvolved. In recognition of concern of the FCM community which has beenseeking an automated analysis platform with a solid theoretical foundation,we have developed an open-source software package called flowClust, de-tails of which are delineated in Chapter 5. While flowClust is publiclyreleased as an R package, its core part that implements the model- ttingprocess is coded in C to ensure computational e ciency at users’ end. Tofacilitate the convenience of use, speci c e orts have been made to adaptto the custom of FCM researchers, such as developing tools of visualizationand for constructing data lters, or \gates". flowClust has been built ina way such that it directly accepts data in FCM-dedicated format, and iswell integrated with other Bioconductor FCM packages. The package isunder active maintenance for further enrichment in the modeling aspect,feature enhancement for presentation and dissemination of analysis results,and integration to existing and upcoming FCM analysis tools.From a monochromatic technology dated back to the late 1960’s, FCMhas evolved into a technology that can simultaneously measure nearly 20parameters for each cell (Perfetto et al., 2004). To date, the LSR II owcytometer from BD Biosciences (San Jose, California) can detect up to 18colors (corresponding to biomarkers such as antigens) in one experiment.To the accompaniment of technological advancement, the impact of FCMon a wealth of elds of biology and medicine has undergone tremendousgrowth in the last few years; see, for example, Valet and T arnok (2003),Valet (2005), Herrera et al. (2007) and Lizard (2007). We believe that FCMin the next few years will reach a level of prominence that microarray hasattained in the last decade. Along with the increase in dimensionality ofFCM data, it becomes apparent that the traditional way of gating analysisby relying on expertise in de ning a gating sequence and positioning thegates is ine cient. How to resolve the bottleneck of a lack of an analysisplatform to parallel such a high-throughput data generation platform hasbecome an open question among the FCM community. A pleasant trend hasbeen observed over the past one or two years, when more research work of132statistical methodology dedicated to FCM comes to light (e.g., Lugli et al.,2007; Boedigheimer and Ferbas, 2008; Chan et al., 2008; Lo et al., 2008; Pyneet al., 2009). Such an accelerating trend can also be observed from regu-lar meetings of the Flow Informatics and Computational Cytometry Society(FICCS) and other conferences. Since published in April 2008, our article(corresponding to Chapter 4 of this thesis) has been cited 18 times to dateaccording to the search result of Web of Science and Google Scholar. Mean-while, a steady overall increase in the download statistics for flowClust hasbeen observed from the Package Downloads Report at Bioconductor. Theseevidences provide a positive sign that our proposed methodology has thepotential for being a mainstream automated gating approach in an FCManalysis pipeline.6.2 Future DirectionsIn the remainder of this chapter, we brie y describe a few possible directionsfor future research, and report preliminary results therein.6.2.1 Robusti cation of the Empirical Bayes Model forDi erential Gene ExpressionThe extension we proposed in Chapter 2 allows for a gene-speci c coe cientof variation in the hierarchical empirical Bayes models originated from New-ton et al. (2001) and Kendziorski et al. (2003) for microarray data. Suchan enhanced exibility does not e ectively constitute a mechanism to ac-commodate outliers, though. An outlying data value could occur because ofscratches or dust on the surface, imperfections in the glass slide, or imperfec-tions in the array production. As a possible way to robustify the empiricalBayes approach, we may consider the eLNN formulation and replace thelognormal sampling distribution with a log t distribution. In other words,133we build a model with the following hierarchical representation:logxgr = gx + gxrpwgr gxj gx N(m;k 1gx ) gxrj gx N(0; 1gx ) gx Gamma( ; )wgr Gamma r2 ; r2 (6.1)where wgr and gxr are independent and therefore gxr=pwgr follows a cen-tral t distribution with scale matrix 1gx and degrees of freedom r. All othernotations in (6.1) follow the convention used in Chapter 2, and the modelspeci cation for ygr can be derived accordingly. If we x wgr = 1 for all gand r, the aforementioned model reduces to the eLNN model introduced inChapter 2.The joint prior on gx; gx and wgr is not conjugate to the samplingdistribution, and the marginal density cannot be derived in closed form.However, the marginal density is analytically available conditional on wgr.As a result, it is possible to proceed in a way similar to what we describedin Section 2.2.3 for the eGG model in which a closed-form marginal densityis available conditional on the gene-speci c shape parameter. We may takeaccordingly the log prior density of wg = (wg1;wg2;:::;wgR)0 as the penaltyterm, and consider the modi ed complete-data log-likelihood~lc( jx;y;z) = Xgnzg log pA(xg;ygj ;wgr) + (1 zg) log p0(xg;ygj ;wgr)+ (1 +zg) log(p) + (2 zg) log(1 p) +Xrlog (wgrj r)o;(6.2)where = (m;k; ; )0 and = (w1;w2;:::;wG; 0;p)0. Parameter esti-mation may then be handled by the EM-type algorithm described in Sec-tion 2.2.3 in which the M-step is split into two constrained maximizationsteps. This robust approach provides a favorable alternative to the fully134Bayesian approach BRIDGE (Gottardo et al., 2006), which takes a similarmodel speci cation but relies on MCMC techniques and is computationallyintensive.6.2.2 Development of an Automated FCM AnalysisPipelineThe analysis of FCM data usually involves two major components: (1) iden-tifying homogeneous cell populations (commonly referred to as gating), eachof which displays a particular biological function, and (2) nding correlationsbetween identi ed cell populations and clinical diagnosis. We presented inChapter 4 the statistical methodology based on robust model-based clus-tering to automate the gating process. An ensuing research focus would bedevising a methodology that extracts features from the result of the auto-mated gating analysis to facilitate disease diagnosis, and identi es biomark-ers that correlate with the disease. Essentially, we would like to develop apipeline, with minimal manual intervention, for the di erent stages of FCMdata analysis, including identi cation of cell populations, extraction of use-ful features (biomarkers) correlated with a target disease, and classi cationof samples. Figure 6.1 shows the overall ow of the proposed data analysispipeline (Bashashati et al., 2009).As a motivational example of FCM analysis which ts into such a pipeline,here we present our preliminary study on paroxysmal nocturnal hemoglobin-uria (PNH), a disease of red blood cell breakdown with release of hemoglobininto the urine. The objective of the study is to build a classi cation rulethat separates subjects according to their disease status (positive or nega-tive). A series of FCM samples were obtained from 17 PNH patients and 15controls. A complete set of samples for one subject includes two red bloodcell samples and three white blood cell samples. Each sample consists ofa distinct antigenic marker. Figure 6.2 shows two histograms from the redblood cell samples of a PNH patient and a control respectively. A distinctivesubpopulation of low intensities is found in the positive sample. This servesas the discriminating information for subject classi cation.135Figure 6.1: The overall ow of the proposed automated FCM analysispipeline.To quantify the discriminating information, we applied the methodol-ogy described in Chapter 4 to cluster each red blood cell sample into twosubpopulations. The separation between the two cluster means, that isexpected to be large for a positive sample, provides the basis of discrimi-nating the two groups of subjects. We proceeded in a similar manner foreach cell type, namely, granulocytes, lymphocytes and monocytes, identi edin the white blood cell samples. At the next-stage analysis, subjects wererepresented with the features of interest (i.e., the separation between twocluster means), or a subset of them, extracted from the clustering stage. Webuilt classi ers using support vector machines (SVM) (Sch olkopf and Smola,2002) with a linear kernel. Leave-one-out cross-validation was used to assessthe accuracy of the classi ers built. Classi ers with > 97% accuracy havebeen found, with a few features consistently found among them.The preliminary study on PNH presented a simpli ed scenario of typicalFCM analysis. Very often, we do not know the number of cell populationsin advance, and multiple colors are used in each sample. In such a case, abetter example is given by our current study in which we attempt to de-vise an analysis pipeline to discriminate subtypes of lymphoma and identifybiomarkers that contribute to such a classi cation (Bashashati et al., 2009).Data in this study were generated at the British Columbia Cancer Agency136(a) A positive sampleHistogram of datadataDensity0 200 400 6000.0000.0010.0020.0030.0040.005! !!! !! !! !!! ! !! ! !!!! ! !! ! ! !!!! !! !! !!!! !!! !!! ! !!!!! ! !! !!! ! !!! !! !! ! !!! !!! ! !!! ! !!! !! !! !! !! !!! !!!! !!! !! !! !(b) A control sampleHistogram of datadataDensity0 1002003004005006007000.0000.0010.0020.0030.0040.0050.006!! ! !!!! !!!!! !! !! !! ! !!!! !!! !! !! !! !!!! ! !! !!!! !! !!! !!! !! !!! !!! !! !! !!!! ! !! !! !! !! !! !!! !! !!! ! !!!! !!! !!!! !! !! !!!Figure 6.2: Clustering of red blood cell samples from the PNH data. Thegraphs shown are the histograms of CD55 from (a) a positive sample, and(b) a control sample. The presence of the distinctive subpopulation of lowintensities in the positive sample is also expected to be observed on some/allof the three basic cell types from positive white blood cell samples. A clinicaldiagnosis would determine a subject to be PNH positive if the distinctivesubpopulation is observed from at least two cell types.in 2002{2007. FCM samples were obtained from biopsies of lymph nodesfrom 438 lymphoma patients of di erent subtypes. Samples were dividedinto seven tubes, each of which was stained with a distinct set of three uorescently conjugated antibodies.To proceed, we rst apply the robust model-based clustering methodol-ogy to identify cell populations, and use the BIC to determine the numberof cell populations in each sample. Statistics such as the proportion, meanand variance for each cluster are extracted, resulting in a long list of can-didate features with discriminating information. The majority of featuresare expected to be uninformative, and in order to discard them we applythe minimum redundancy maximum relevance (mRMR) feature selectiontechnique (Peng et al., 2005; Ding and Peng, 2005). The mRMR techniqueaims at selecting features that are relevant to the class label (i.e., subtype oflymphoma) whilst minimizing the redundancy of the selected features; the137Euclidean distance or, more e ectively, the Pearson correlation coe cientbetween features may be taken as the redundancy measure. Based on theselected features, we build a classi er using SVM or random forest (Breiman,2001) to classify the samples, and make predictions about future incomingsamples.In our current attempt, we randomly split the samples into the trainingand testing sets. Samples in the training set are used to select informativefeatures and build the classi ers, while the training samples are used for per-formance evaluation. To date, the devised analysis pipeline is con ned to abinary classi cation, i.e., discrimination between two subtypes of lymphoma.Our preliminary result shows that 80%{96% accuracy has been achieved ina few binary classi cations performed. Features (in terms of markers) iden-ti ed to be informative are in line with previous biological ndings (Dogan,2005), providing promising evidence that the proposed analysis pipeline canextract biologically meaningful features from FCM data. Subsequent workwould be re ning the various components of the pipeline in order to achievehigher discriminating accuracy, and extending the methodology to facilitatemulti-class discriminations.6.2.3 Combining Mixture Components in ClusteringIn clustering analysis, very often the number of clusters is unknown andrequires estimation. There are several approaches for selecting the numberof components in model-based clustering, such as resampling, cross valida-tion, and various information criteria; see McLachlan and Peel (2000) for areview. In this thesis, our approach to the problem is based on the BIC.Model selection based on the BIC has been shown not to underestimatethe number of clusters asymptotically (Leroux, 1992). Moreover, the BICis computationally cheap to compute once maximum likelihood estimationof the model parameters has been completed, an advantage over other ap-proaches, especially in the context of FCM where datasets tend to be verylarge. Nevertheless, if the correct model is not in the family of models beingconsidered, more than one mixture component may be needed to provide a138reasonable representation of an individual cluster of data. In such a case,the BIC tends to select an excessive number of components relative to thecorrect number of clusters (Biernacki and Govaert, 1997; Biernacki et al.,2000). Biernacki et al. (2000) attempted to rectify this problem by propos-ing an alternative to the BIC based on the integrated completed likelihood(ICL). The ICL criterion turns out to be equivalent to the BIC penalizedby the entropy of the corresponding clustering:ICLG = BICG 2 ENTG; (6.3)whereENTG = nXi=1GXg=1^zig log ^zig (6.4)is the entropy for the corresponding G-component mixture model, and ^zigis the conditional probability that the i-th observation arises from the g-thcomponent. The entropy ENTG is a measure of relevance of the G com-ponents from a mixture model to the partition of data. Conceptually, itincreases with the scale of overlap between the components. In consequence,the ICL favors models with well-separated mixture components. In prac-tice, however, the ICL tends to underestimate the correct number of clusters(Murphy and Martin, 2003). Such a tendency was also observed when weattempted to apply the ICL to FCM data.In a current attempt, we propose an approach for selecting the number ofclusters by combining the ideas underlying the BIC and ICL (Baudry et al.,2008). The BIC is used to select the number of components in the mixturemodel in order to provide a good representation of data. We then de nea sequence of possible solutions by hierarchical merging of the componentsidenti ed by the BIC. The decision about which components to merge isbased on the same entropy criterion given by Eq.(6.4) that the ICL uses. Inthis way, we propose a way of interpreting the mixture model by identifyingthe set of merged components as one cluster.In the following, we describe in details the hierarchical merging scheme.At each stage, we choose two mixture components to be merged so as to139minimize the entropy of the resulting clustering. If components k and k0from a G-component solution are merged, the conditional probability ^zigwill remain the same for every g except for k and k0. The new cluster k[k0then has the following conditional probability:^zi;k[k0 = ^zik + ^zik0: (6.5)The entropy for the resulting (G 1)-cluster solution is nXi=18<:Xg6=k;k0^zig log ^zig + ^zi;k[k0 log ^zi;k[k09=;: (6.6)The two components k and k0 to be merged are those minimizing the crite-rion nXi=1 ^zik log ^zik + ^zik0 log ^zik0 ^zi;k[k0 log ^zi;k[k0 among all possible pairs of components. Components in the model selectedby the BIC are successively merged by repeating the aforementioned proce-dure, until the data are reduced to one single cluster.The proposed approach yields one solution for each value ofg = 1;2;:::;G,and the user can choose between them on substantive grounds. If a moreautomated procedure is desired for choosing a single solution, one possibilityis to select, among the possible solutions, the solution providing the numberof clusters selected by the ICL. An alternative is to detect an \elbow" on theentropy curve, i.e., the graph of entropy against the number of clusters. In-tuitively, when mixture components overlap signi cantly, the correspondingentropy will be large. As overlapping components are combined in subse-quent stages of the hierarchical merging scheme, the entropy will decrease.When only well-separated components are left in the clustering solution, fur-ther merging will incur little reduction in the resultant entropy. This ideahas been formalized by Finak et al. (2009) in which a changepoint analysisis performed. On setting the changepoint at g = 2;3;:::;G 1 in turn, aseries of two-segment piecewise linear regression models is used to t the140entropy curve. The optimal location ~g of the changepoint is determined bythe regression model with the minimum residual sum of squares. Finally, thepresence or absence of such a changepoint may be determined by comparingthe two-segment piecewise regression model with a simple linear regressionmodel via the BIC or ANOVA. If the result is in favor of the two-segmentpiecewise regression model, the proposed hierarchical merging scheme is ableto select a ~g-cluster solution as the optimal.141BibliographyBashashati, A., Lo, K., Gottardo, R., Gascoyne, R. D., Weng, A., andBrinkman, R. (2009). A pipeline for automated analysis of ow cytometrydata: Preliminary results on lymphoma sub-type diagnosis. ConferenceProceedings of the IEEE Engineering in Medicine and Biology Society, (Inpress).Baudry, J.-P., Raftery, A. E., Celeux, G., Lo, K., and Gottardo, R. (2008).Combining mixture components for clustering. Submitted to Journal ofComputational and Graphical Statistics.Biernacki, C., Celeux, G., and Govaert, G. (2000). Assessing a mixturemodel for clustering with the integrated completed likelihood. IEEETransactions on Pattern Analysis and Machine Intelligence, 22(7):719{725.Biernacki, C. and Govaert, G. (1997). Using the classi cation likelihood tochoose the number of clusters. Computing Science and Statistics, 29:451{457.Boedigheimer, M. J. and Ferbas, J. (2008). Mixture modeling approach to ow cytometry data. Cytometry Part A, 73A(5):421{429.Breiman, L. (2001). Random forests. Machine Learning, 45:5{32.Chan, C., Feng, F., Ottinger, J., Foster, D., West, M., and Kepler, T. B.(2008). Statistical mixture modeling for cell subtype identi cation in owcytometry. Cytometry Part A, 73A:693{701.Ding, C. and Peng, H. (2005). Minimum redundancy feature selection frommicroarray gene expression data. Journal of Bioinformatics and Compu-tational Biology, 3(2):185{205.Dogan, A. (2005). Modern histological classi cation of low grade B-celllymphomas. Best Practice and Research: Clinical Haematology, 18(1):11{26.142Finak, G., Bashashati, A., Brinkman, R., and Gottardo, R. (2009). Mergingmixture components for cell population identi cation in ow cytometry.Submitted to Advances in Bioinformatics.Gottardo, R., Raftery, A. E., Yeung, K. Y., and Bumgarner, R. E. (2006).Bayesian robust inference for di erential gene expression in microarrayswith multiple samples. Biometrics, 62(1):10{18.Herrera, G., Diaz, L., Martinez-Romero, A., Gomes, A., Villamon, E.,Callaghan, R. C., and O’Connor, J. E. (2007). Cytomics: a multiparamet-ric, dynamic approach to cell research. Toxicology In Vitro, 21(2):176{182.Kendziorski, C., Newton, M., Lan, H., and Gould, M. N. (2003). On para-metric empirical Bayes methods for comparing multiple groups using repli-cated gene expression pro les. Statistics in Medicine, 22:3899{3914.Leroux, M. (1992). Consistent estimation of a mixing distribution. Annalsof Statistics, 20:1350{1360.Lizard, G. (2007). Flow cytometry analyses and bioinformatics: interest innew softwares to optimize novel technologies and to favor the emergenceof innovative concepts in cell research. Cytometry Part A, 71A:646{647.Lo, K., Brinkman, R. R., and Gottardo, R. (2008). Automated gating of ow cytometry data via robust model-based clustering. Cytometry PartA, 73A(4):321{332.Lugli, E., Pinti, M., Nasi, M., Troiano, L., Ferraresi, R., Mussi, C., Salvioli,G., Patsekin, V., Robinson, J. P., Durante, C., Cocchi, M., and Cos-sarizza, A. (2007). Subject classi cation obtained by cluster analysis andprincipal component analysis applied to ow cytometric data. CytometryPart A, 71A:334{344.McLachlan, G. and Peel, D. (2000). Finite Mixture Models. Wiley-Interscience, New York.143Murphy, T. B. and Martin, D. (2003). Mixtures of distance-based models forranking data. Computational Statistics and Data Analysis, 41(3{4):645{655.Newton, M. C., Kendziorski, C. M., Richmond, C. S., Blattner, F. R., andTsui, K. W. (2001). On di erential variability of expression ratios: im-proving statistical inference about gene expression changes from microar-ray data. Journal of Computational Biology, 8:37{52.Peng, H., Long, F., and Ding, C. (2005). Feature selection based on mu-tual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intel-ligence, 27(8):1226{1338.Perfetto, S. P., Chattopadhyay, P. K., and Roederer, M. (2004). Seventeen-colour ow cytometry: unravelling the immune system. Nature ReviewsImmunology, 4(8):648{655.Pyne, S., Hu, X., Wang, K., Rossin, E., Lin, T.-I., Maier, L. M., Baecher-Allan, C., McLachlan, G. J., Tamayo, P., Ha er, D. A., De Jager, P. L.,and Mesirov, J. P. (2009). Automated high-dimensional ow cytometricdata analysis. Proceedings of the National Academy of Sciences of theUnited States of America, 106(21):8519{8524.Sch olkopf, B. and Smola, A. J. (2002). Learning with Kernels: SupportVector Machines, Regularization, Optimization and Beyond. The MITPress, Cambridge, Massachusetts.Valet, G. (2005). Human cytome project, cytomics, and systems biology:the incentive for new horizons in cytometry. Cytometry Part A, 64A:1{2.Valet, G. K. and T arnok, A. (2003). Cytomics in predictive medicine. Cy-tometry Part B: Clinical Cytometry, 53B(1):1{3.144Appendix AAdditional Material forChapter 2A.1 Marginal Densities of Measured IntensitiesUnder the extended GG model, the joint marginal densities of measuredintensities of a given gene g are developed without integrating ag away,i.e., they are conditional on ag. Denote by G(x;a;b) the Gamma densityfunction with shape a and rate b. The explicit forms of the conditionalmarginal densities are given bypA(xg;ygj ;ag) =Z 10RYr=1G(xgr;ag; gx) G( gx; ) d gx Z 10RYr=1G(ygr;ag; gy) G( gy; ) d gy= (Rag +a0) R(ag) (a0) 2 2a0 Qrxgr ygr ag 1 ( +Prxgr)( +Prygr) Rag+a0(A.1)andp0(xg;ygj ;ag) =Z 10RYr=1G(xgr;ag; g)RYr=1G(ygr;ag; g) G( g; ) d g= (2Rag +a0) 2R(ag) (a0) a0 Qrxgr ygr ag 1 +Prxgr +Prygr 2Rag+a0; (A.2)where = (a0; )0.145The joint marginal densities of measured intensities under the extendedLNN model are developed in a similar fashion, this time by integrating gand g away. Denote by LN(x;a;b) the Lognormal density function withmean and variance parameters a and b respectively, and by N(x;a;b) thenormal density function. The marginal densities are developed as follows:pA(xg;ygj )=Z 10Z 1 1YrLN(xgr; gx; 1gx ) N( gx;m;k 1gx )G( gx; ; ) d gxd gx Z 10Z 1 1YrLN(ygr; gy; 1gy ) N( gy;m;k 1gy )G( gy; ; ) d gyd gy= 2 2 R2 + Qrxgr ygr (2 )R (kR+ 1) 2( ) ( + 12k" kPr logxgr +m 2kR+ 1 +kPr(logxgr)2 +m2#) (R2 + ) ( + 12k" kPr logygr +m 2kR+ 1 +kPr(logygr)2 +m2#) (R2 + )(A.3)andp0(xg;ygj )=Z 10Z 1 1YrLN(xgr; g; 1g )YrLN(ygr; g; 1g ) N( g;m;k 1g ) G( g; ; ) d gd g= (R+ ) Qrxgr ygr (2 )R(2kR+ 1)12 ( ) ( + 12k k Pr logxgr +Pr logygr +m 22kR+ 1+khPr(logxgr)2 +Pr(logygr)2i+m2!) (R+ ); (A.4)146where = (m;k; ; )0.A.2 Estimation of and for the Prior of agAs mentioned in Section 2.2.3, to make use of the modi ed complete-datalog-likelihood given by Eq.(2.4) in the extended GG model we need to pro-vide estimates of the hyperparameters for the Lognormal( ; ) prior of agbeforehand. Here we propose to use the method of moments (MM) to esti-mate and . First we would like to come up with simple estimates of theag’s. On noting that the coe cient of variation is given by 1=pag for eachgene, a robust empirical estimate of ag may be provided by~ag = med(xg;yg)2mad(xg;yg)2;where med and mad stand for median and median absolute deviation re-spectively. Note that a robust counterpart to mean and standard deviationis adopted since there are usually relatively few replicates. With these crudeestimates of ag’s, we can then obtain the estimates of and :^ = med(flog ~agg) and ^ = mad(flog ~agg)2:Again, a robust version of MM is proposed here.A.3 Initialization of the EM AlgorithmWe need to initialize the parameters to be estimated before the EM-typealgorithm described in Section 2.2.3 can be applied. Similar to the esti-mation for and above, robust MM estimates of (a;a0; ) are obtainedfor the extended GG model. Similar measure is taken for (m; ; ) if thedata are modeled under the extended LNN framework, while k is empiricallychosen to be 30. After the crude estimation step, updated estimates of theaforementioned parameters are obtained on maximizing the correspondingmarginal null log-likelihood under either model formulation. This step is147taken in order to bring the initial estimates closer to the estimates returnedby the EM algorithm. Using these initial estimates together with p set as0:5, the most likely value under the Beta(2;2) prior, initial estimates of zg’sare obtained, which are then used to update the parameter estimates in theEM algorithm.148Appendix BVignette of the owClustPackageB.1 LicensingUnder the Artistic License, you are free to use and redistribute this software.However, we ask you to cite the following papers if you use this software forpublication.1. Lo, K., Brinkman, R. R., and Gottardo, R. (2008). Automated gatingof ow cytometry data via robust model-based clustering. CytometryPart A, 73A(4):321{332.2. Lo, K., Hahne, F., Brinkman, R. R., and Gottardo, R. (2009). ow-Clust: a Bioconductor package for automated gating of ow cytometrydata. BMC Bioinformatics, 10:145.B.2 OverviewWe apply a robust model-based clustering approach proposed by Lo et al.(2008) to identify cell populations in ow cytometry data. The proposedapproach is based on multivariate t mixture models with the Box-Cox trans-formation. This approach generalizes Gaussian mixture models by modelingoutliers using t distributions and allowing for clusters taking non-ellipsoidalshapes upon proper data transformation. Parameter estimation is carriedout using an Expectation-Maximization (EM) algorithm which simultane-ously handles outlier identi cation and transformation selection. Please referto Lo et al. (2008) for more details.149This owClust package consists of a core function to implement theaforementioned clustering methodology. Its source code is built in C foroptimal utilization of system resources. Graphical functionalities are avail-able to users for visualizing a wealth of features of the clustering results,including the cluster assignment, outliers, and the size and shape of theclusters. The tted mixture model may be projected onto any one/two di-mensions and displayed by means of a contour or image plot. Currently, owClust provides two options for estimating the number of clusters whenit is unknown, namely, the Bayesian Information Criterion (BIC) and theIntegrated Completed Likelihood (ICL). owClust is built in a way such that it is highly integrated with ow-Core, the core package for ow cytometry that provides data structuresand basic manipulation of ow cytometry data. Please read Section B.4.3for details about actual implementation.B.3 InstallationB.3.1 Unix/Linux/Mac UsersTo build the owClust package from source, make sure that the followingis present on your system: a C compiler GNU Scienti c Library (GSL) Basic Linear Algebra Subprograms (BLAS)A C compiler is needed to build the package as the core function is codedin C. GSL can be downloaded at http://www.gnu.org/software/gsl/.In addition, the package uses BLAS to perform basic vector and matrixoperations. Please go to http://www.netlib.org/blas/faq.html#5 for alist of optimized BLAS libraries for a variety of computer architectures. Forinstance, Mac users may use the built-in vecLib framework, while users ofIntel machines may use the Math Kernel Library (MKL).150For the package to be installed properly you may have to type the fol-lowing command before installation:export LD_LIBRARY_PATH=’/path/to/GSL/:/path/to/BLAS/’:$LD_LIBRARY_PATHwhich will tell R where your GSL and BLAS libraries are. Note that thismay have already been con gured on your system, so you may not have todo so. In case you need to do it, you may consider including this line inyour .bashrc such that you do not have to type it every time.If GSL is installed to some non-standard location such that it cannotbe found when installing owClust, you may set the environment variableGSL CONFIG to point to the correct copy of gsl-config, for example,export GSL_CONFIG=’/global/home/username/gsl-1.12/bin/gsl-config’For convenience sake, this line may also be added to .bashrc.Now you are ready to install the package:R CMD INSTALL flowClust_x.y.z.tar.gzThe package will look for a BLAS library on your system, and by defaultit will choose gslcblas, which is not optimized for your system. To use anoptimized BLAS library, you can use the --with-blas argument which willbe passed to the configure.ac le. For example, on a Mac with vecLibpre-installed the package may be installed via:R CMD INSTALL flowClust_x.y.z.tar.gz --configure-args="--with-blas=’-framework vecLib’"On a 64-bit Intel machine which has MKL as the optimized BLAS library,the command may look like:R CMD INSTALL flowClust\_x.y.z.tar.gz --configure-args="--with-blas=’-L/usr/local/mkl/lib/em64t/ -lmkl -lguide -lpthread’"where /usr/local/mkl/lib/em64t/ is the path to MKL.If you prefer to install a prebuilt binary, you need GSL for successfulinstallation.151B.3.2 Windows UsersYou need the GNU Scienti c Library (GSL) for the owClust package.GSL is freely available at http://gnuwin32.sourceforge.net/packages/gsl.htm for Windows distributions.To install a prebuilt binary of owClust and to load the package suc-cessfully you need to tell R where to link GSL. You can do that by adding/path/to/libgsl.dll to the Path environment variable. To add this youmay right click on \My Computer", choose \Properties", select the \Ad-vanced" tab, and click the button \Environment Variables". In the dialogbox that opens, click \Path" in the variable list, and then click \Edit". Add/path/to/libgsl.dll to the \Variable value" eld. It is important thatthe le path does not contain any space characters; to avoid this you maysimply use the short forms (8.3 DOS le names) found by typing dir /x atthe Windows command line. For example, the following may be added tothe Path environment variable:C:/PROGRA~1/GNUWIN32/binand the symbol ; is used to separate it from existing paths.To build owClust from source (using Rtools), in addition to adding/path/to/libgsl.dll to Path, you need to tell owClust where yourGSL library and header les are. You can do that by setting up two en-vironment variables GSL LIB and GSL INC with the correct path to the li-brary les and header les respectively. You can do this by going to the\Environment Variables" dialog box as instructed above and then click-ing the \New" button. Enter GSL LIB in the \Variable name" eld, and/path/to/your/gsl/lib/directory in the \Variable value" eld. Like-wise, do this for GSL INC and /path/to/your/gsl/include/directory.Remember to use \/" instead of \\" as the directory delimiter.You can download Rtools at http://www.murdoch-sutherland.com/Rtools/ which provides the resources for building R and R packages. Youshould add to the Path variable the paths to the various components ofRtools. Please read the \Windows Toolset" appendix at http://cran.152r-project.org/doc/manuals/R-admin.html#The-Windows-toolset formore details.B.4 Example: Clustering of the RituximabDatasetB.4.1 The Core FunctionTo demonstrate the functionality we use a ow cytometry dataset froma drug-screening project to identify agents that would enhance the anti-lymphoma activity of Rituximab, a therapeutic monoclonal antibody. Thedataset is an object of class flowFrame; it consists of eight parameters,among them only the two scattering parameters (FSC.H, SSC.H) and two uorescence parameters (FL1.H, FL3.H) are of interest in this experiment.Note that, apart from a typical matrix or data.frame object, owClustmay directly take a flowFrame, the standard R implementation of an FCS le, which may be returned from the read.FCS function in the owCorepackage, as data input. The following code performs an analysis with onecluster using the two scattering parameters:> library(flowClust)> data(rituximab)> summary(rituximab)FSC.H SSC.H FL1.H FL2.H FL3.H FL1.A FL1.W TimeMin. 59.0 11.0 0.0 0.0 1.0 0.00 0.0 21st Qu. 178.0 130.0 197.0 55.0 150.0 0.00 0.0 140Median 249.0 199.0 244.0 116.0 203.0 0.00 0.0 285Mean 287.1 251.8 349.2 126.4 258.3 73.46 17.6 2943rd Qu. 331.0 307.0 445.0 185.0 315.0 8.00 0.0 451Max. 1023.0 1023.0 974.0 705.0 1023.0 1023.00 444.0 598> res1 <- flowClust(rituximab, varNames=c("FSC.H", "SSC.H"),K=1, B=100)153B is the maximum number of EM iterations; for demonstration purpose herewe set a small value for B. The main purpose of performing an analysiswith one cluster here is to identify outliers, which will be removed fromsubsequent analysis.Next, we would like to proceed with an analysis using the two uores-cence parameters on cells selected from the rst stage. The following codeperforms the analysis with the number of clusters being xed from one tosix in turn:> rituximab2 <- rituximab[rituximab %in% res1,]> res2 <- flowClust(rituximab2, varNames=c("FL1.H", "FL3.H"),K=1:6, B=100)We select the best model based on the BIC. Values of the BIC can be re-trieved through the criterion method. By inspection, the BIC values stayrelatively constant beyond three clusters. We therefore choose the modelwith three clusters and print a summary of the corresponding clusteringresult:> summary(res2[[3]])** Experiment Information **Experiment name: Flow ExperimentVariables used: FL1.H FL3.H** Clustering Summary **Number of clusters: 3Proportions: 0.2658702 0.5091045 0.2250253** Transformation Parameter **lambda: 0.4312673** Information Criteria **Log likelihood: -16475.41BIC: -33080.88ICL: -34180.67** Data Quality **Number of points filtered from above: 0 (0%)154Number of points filtered from below: 0 (0%)Rule of identifying outliers: 90% quantileNumber of outliers: 96 (6.99%)Uncertainty summary:Min. 1st Qu. Median Mean 3rd Qu. Max.0.0005699 0.0153000 0.1669000 0.2019000 0.3738000 0.5804000The summary states that the rule used to identify outliers is 90% quantile,which means that a point outside the 90% quantile region of the cluster towhich it is assigned will be called an outlier. To specify a di erent rule, wemake use of the ruleOutliers replacement method. The example belowapplies the more conservative 95% quantile rule to identify outliers:> ruleOutliers(res2[[3]]) <- list(level=0.95)Rule of identifying outliers: 95% quantile> summary(res2[[3]])...** Data Quality **Number of points filtered from above: 0 (0%)Number of points filtered from below: 0 (0%)Rule of identifying outliers: 95% quantileNumber of outliers: 35 (2.55%)We can also combine the rule set by the z.cutoff argument to identifyoutliers. Suppose we would like to assign an observation to a cluster onlyif the associated posterior probability is greater than 0:6. We can add thisrule with the following command:> ruleOutliers(res2[[3]]) <- list(z.cutoff=0.6)Rule of identifying outliers: 95% quantile,probability of assignment < 0.6> summary(res2[[3]])155...** Data Quality **Number of points filtered from above: 0 (0%)Number of points filtered from below: 0 (0%)Rule of identifying outliers: 95% quantile,probability of assignment < 0.6Number of outliers: 317 (23.07%)This time more points are called outliers. Note that such a change of the rulewill not incur a change of the model- tting process. The information aboutwhich points are called outliers is conveyed through the flagOutliers slot,a logical vector in which the positions of TRUE correspond to points beingcalled outliers.By default, when 10 or more points accumulate on the upper or lowerboundary of any parameter, the flowClust function will lter those points.To change the threshold count from the default, users may specify max.countand min.count when running flowClust. To suppress ltering at the upperand/or the lower boundaries, set max.count and/or min.count as 1. Wecan also use the max and min arguments to control ltering of points, butfrom a di erent perspective. For instance, if we are only interested in cellswhich have a FL1.H measurement within (0;400) and FL3.H within (0;800),we may use the following code to perform the cluster analysis:> flowClust(rituximab2, varNames=c("FL1.H", "FL3.H"), K=2,B=100, min=c(0,0), max=c(400,800))B.4.2 Visualization of Clustering ResultsInformation such as the cluster assignment, cluster shape and outliers maybe visualized by calling the plot method to make a scatterplot:> plot(res2[[3]], data=rituximab2, level=0.8, z.cutoff=0)Rule of identifying outliers: 80% quantile1560 200 400 600 80002004006008001000FL1.HFL3.HThe level and/or z.cutoff arguments are needed when we want to applya rule di erent from that stored in the ruleOutliers slot of the flowClustobject to identify outliers.To look for densely populated regions, a contour/image plot can be made:> res2.den <- density(res2[[3]], data=rituximab2)> plot(res2.den)157FL1.HFL3.H 2e−06 2e−06 4e−06 6e−06 8e−06 1e−05 1.2e−05 1.4e−05 1.8e−05 2.4e−05 2.6e−05 0 200 400 600 80002004006008001000> plot(res2.den, type="image")0 200 400 600 80002004006008001000FL1.HFL3.H158When we want to examine how the tted model and/or the data aredistributed along one chosen dimension, we can use the hist method:> hist(res2[[3]], data=rituximab2, subset="FL1.H")FL1.HDensity0 200 400 600 8000.0000.0010.0020.0030.0040.0050.006a71 a71a71a71a71 a71a71 a71 a71 a71a71 a71a71a71 a71a71a71 a71a71 a71 a71a71a71 a71a71 a71a71a71 a71a71a71a71 a71a71a71a71 a71a71 a71a71a71 a71a71 a71 a71a71 a71a71a71a71 a71 a71a71a71 a71a71a71 a71a71 a71a71a71 a71a71 a71 a71a71a71 a71a71 a71a71a71a71 a71a71a71 a71 a71a71a71 a71a71a71a71 a71 a71a71 a71a71a71a71 a71 a71a71 a71 a71a71a71 a71a71a71a71 a71 a71a71 a71a71 a71a71a71 a71a71a71a71 a71a71a71 a71 a71 a71a71a71a71 a71 a71a71a71 a71a71 a71 a71 a71a71 a71a71 a71a71 a71a71 a71 a71a71 a71 a71a71a71a71a71 a71a71a71 a71 a71a71a71a71a71a71 a71 a71a71a71 a71a71a71 a71 a71a71a71 a71a71 a71a71 a71 a71a71a71 a71 a71a71 a71 a71a71 a71a71a71 a71a71a71a71 a71 a71 a71a71 a71a71a71a71 a71 a71a71a71a71 a71 a71a71 a71a71 a71a71a71 a71a71 a71a71 a71 a71a71 a71a71a71a71a71 a71 a71a71 a71a71a71 a71a71 a71a71a71a71 a71 a71a71 a71a71 a71a71a71 a71 a71a71 a71a71 a71a71a71 a71a71 a71a71 a71a71 a71a71a71 a71a71a71 a71a71 a71a71a71 a71a71 a71a71a71a71 a71a71 a71 a71a71a71 a71a71 a71a71 a71 a71 a71a71 a71 a71a71 a71 a71a71 a71a71 a71a71a71 a71a71a71a71 a71a71a71 a71a71a71 a71 a71 a71a71 a71a71a71 a71a71a71 a71 a71a71 a71a71 a71 a71a71a71 a71a71 a71a71 a71a71 a71a71a71 a71a71 a71a71 a71a71 a71a71a71 a71a71a71 a71a71 a71a71a71 a71 a71 a71a71a71a71a71The subset argument may also take a numeric value:> hist(res2[[3]], data=rituximab2, subset=1)Since FL1.H is the rst element of res2[[3]]@varNames, this line pro-duces exactly the same histogram as the one generated by the line takingsubset="FL1.H". Likewise, the subset argument of both plot methodsaccepts either a numeric or a character vector to specify which two variablesare to be shown on the plot.B.4.3 Integration with owCoreAs mentioned in Overview, e ort has been made to integrate owClustwith the owCore package. Users will nd that most methods de ned in owCore also work in the context of owClust.159The very rst step of integration is to replace the core function flowClustwith a call to the constructor tmixFilter followed by the filter method.The aim is to wrap clustering in a ltering operation like those found in ow-Core. The tmixFilter function creates a filter object to store all settingsrequired for the ltering operation. The object created is then passed to theactual ltering operation implemented by the filter method. The use of adedicated tmixFilter-class object separates the task of specifying the set-tings (tmixFilter) from the actual ltering operation (filter), facilitatingthe common scenario in FCM gating analysis that ltering with the samesettings is performed upon a set of data les.As an example, the ltering operation that resembles the second-stageclustering using FL1.H and FL3.H with three clusters (see Section B.4.1) isimplemented by the following code:> s2filter <- tmixFilter("s2filter", c("FL1.H", "FL3.H"), K=3,B=100)> res2f <- filter(rituximab2, s2filter)The object res2f is of class tmixFilterResult, which extends themultipleFilterResult class de ned in owCore. Users may apply vari-ous subsetting operations de ned for the multipleFilterResult class in asimilar fashion on a tmixFilterResult object:> Subset(rituximab2, res2f)flowFrame object ’A02’with 1267 cells and 8 observables:name desc range$P1 FSC.H FSC-Height 1024$P2 SSC.H Side Scatter 1024$P3 FL1.H Anti-BrdU FITC 1024$P4 FL2.H <NA> 1024$P5 FL3.H 7 AAD 1024$P6 FL1.A <NA> 1024$P7 FL1.W <NA> 1024160$P8 Time Time (204.80 sec.) 1024135 keywords are stored in the ’descripton’ slot> split(rituximab2, res2f, population=list(sc1=1:2, sc2=3))$sc1flowFrame object ’A02 (1,2)’with 976 cells and 8 observables:name desc range$P1 FSC.H FSC-Height 1024$P2 SSC.H Side Scatter 1024$P3 FL1.H Anti-BrdU FITC 1024$P4 FL2.H <NA> 1024$P5 FL3.H 7 AAD 1024$P6 FL1.A <NA> 1024$P7 FL1.W <NA> 1024$P8 Time Time (204.80 sec.) 10243 keywords are stored in the ’descripton’ slot$sc2flowFrame object ’A02 (3)’with 291 cells and 8 observables:name desc range$P1 FSC.H FSC-Height 1024$P2 SSC.H Side Scatter 1024$P3 FL1.H Anti-BrdU FITC 1024$P4 FL2.H <NA> 1024$P5 FL3.H 7 AAD 1024$P6 FL1.A <NA> 1024$P7 FL1.W <NA> 1024$P8 Time Time (204.80 sec.) 1024136 keywords are stored in the ’descripton’ slot161The Subset method above outputs a flowFrame consisting of observationswithin the data-driven lter constructed. The split method separates thedata into two populations upon the removal of outliers: the rst populationis formed by observations assigned to clusters 1 and 2 constructed by the lter, and the other population consists of observations assigned to cluster 3.The two populations are returned as two separate flowFrame’s, which arestored inside a list and labelled with sc1 and sc2 respectively.The %in% operator from owCore is also de ned for a tmixFilterResultobject. A logical vector will be returned in which a TRUE value means thatthe corresponding observation is accepted by the lter. In fact, the imple-mentation of the Subset method needs to call %in%.The object returned by tmixFilter is of class tmixFilter, which ex-tends the filter class in owCore. Various operators, namely, &, |, ! and%subset%, which have been constructed for filter objects in owCore,also produce similar outcomes when applied to a tmixFilter object. Forexample, to perform clustering on a subset of data enclosed by a rectanglegate, we may apply the following code:> rectGate <- rectangleGate(filterId="rectRegion","FL1.H"=c(0, 400), "FL3.H"=c(0, 800))> MBCfilter <- tmixFilter("MBCfilter", c("FL1.H", "FL3.H"),K=2, B=100)> filter(rituximab2, MBCfilter %subset% rectGate)A filterResult produced by the filter named ’MBCfilter inrectRegion’162Appendix CCode to Produce the Plots inChapter 5# Figure 5.1plot(criterion(res1, "BIC"), xlab="No. of clusters", ylab="BIC",type="b", pch=2)points(criterion(res1s, "BIC"), type="b", pch=3, col=2)legend("bottomright", col=1:2, pch=2:3, lty=1,legend=c(expression(paste("global ", lambda)),expression(paste("cluster-specific ", lambda))))# Figure 5.2plot(res1[[4]], data=GvHD, pch.outliers="+", xlab="FSC-Height",ylab="SSC-Height")legend("bottomright", col=2:5, legend=1:4, title="Clusters",pch=20)# Figure 5.3plot(criterion(res2, "BIC"), xlab="No. of clusters", ylab="BIC",type="b")# Figure 5.4(a)CD3p <- which(getEstimates(res2[[11]])$locations[,3] > 280)plot(res2[[11]], data=GvHD2$lymphocyte, include=CD3p, ellipse=F,pch=20, xlab="CD4-Height", ylab=expression(paste("CD8", beta,"-Height")))163res2d <- density(res2[[11]], data=GvHD2$lymphocyte, include=CD3p)plot(res2d, drawlabels=F, add=T, nlevels=20)# Figure 5.4(b)plot(res2d, type="image", nlevels=100, xlab="CD4-Height", ylab=expression(paste("CD8", beta, "-Height")))164
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Statistical methods for high throughput genomics
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Statistical methods for high throughput genomics Lo, Chi Ho 2009
pdf
Page Metadata
Item Metadata
Title | Statistical methods for high throughput genomics |
Creator |
Lo, Chi Ho |
Publisher | University of British Columbia |
Date | 2009 |
Date Issued | 2009-10-08T20:16:29Z |
Description | The advancement of biotechnologies has led to indispensable high-throughput techniques for biological and medical research. Microarray is applied to monitor the expression levels of thousands of genes simultaneously, while flow cytometry (FCM) offers rapid quantification of multi-parametric properties for millions of cells. In this thesis, we develop approaches based on mixture modeling to deal with the statistical issues arising from both high-throughput biological data sources. Inference about differential expression is a typical objective in analysis of gene expression data. The use of Bayesian hierarchical gamma-gamma and lognormal-normal models is popular for this type of problem. Some unrealistic assumptions, however, have been made in these frameworks. In view of this, we propose flexible forms of mixture models based on an empirical Bayes approach to extend both frameworks so as to release the unrealistic assumptions, and develop EM-type algorithms for parameter estimation. The extended frameworks have been shown to significantly reduce the false positive rate whilst maintaining a high sensitivity, and are more robust to model misspecification. FCM analysis currently relies on the sequential application of a series of manually defined 1D or 2D data filters to identify cell populations of interest. This process is time-consuming and ignores the high-dimensionality of FCM data. We reframe this as a clustering problem, and propose a robust model-based clustering approach based on t mixture models with the Box-Cox transformation for identifying cell populations. We describe an EM algorithm to simultaneously handle parameter estimation along with transformation selection and outlier identification, issues of mutual influence. Empirical studies have shown that this approach is well adapted to FCM data, in which a high abundance of outliers and asymmetric cell populations are frequently observed. Finally, in recognition of concern for an efficient automated FCM analysis platform, we have developed an R package called flowClust to automate the gating analysis with the proposed methodology. Focus during package development has been put on the computational efficiency and convenience of use at users' end. The package offers a wealth of tools to summarize and visualize features of the clustering results, and is well integrated with other FCM packages. |
Extent | 9508183 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | eng |
Collection |
Electronic Theses and Dissertations (ETDs) 2008+ |
Date Available | 2009-10-08 |
Provider | Vancouver : University of British Columbia Library |
DOI | 10.14288/1.0067770 |
URI | http://hdl.handle.net/2429/13762 |
Degree |
Doctor of Philosophy - PhD |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2009-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- [if-you-see-this-DO-NOT-CLICK]
- ubc_2009_fall_lo_chiho.pdf [ 9.07MB ]
- Metadata
- JSON: 1.0067770.json
- JSON-LD: 1.0067770+ld.json
- RDF/XML (Pretty): 1.0067770.xml
- RDF/JSON: 1.0067770+rdf.json
- Turtle: 1.0067770+rdf-turtle.txt
- N-Triples: 1.0067770+rdf-ntriples.txt
- Original Record: 1.0067770 +original-record.json
- Full Text
- 1.0067770.txt
- Citation
- 1.0067770.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Country | Views | Downloads |
---|---|---|
China | 7 | 6 |
United States | 4 | 1 |
Belgium | 3 | 0 |
Russia | 3 | 0 |
France | 2 | 0 |
India | 1 | 0 |
Spain | 1 | 0 |
Germany | 1 | 2 |
Australia | 1 | 0 |
City | Views | Downloads |
---|---|---|
Unknown | 11 | 2 |
Shenzhen | 4 | 2 |
Ashburn | 3 | 0 |
Beijing | 2 | 4 |
Badalona | 1 | 0 |
Kansas City | 1 | 0 |
Putian | 1 | 0 |
{[{ mDataHeader[type] }]} | {[{ month[type] }]} | {[{ tData[type] }]} |
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0067770/manifest