Science, Faculty of
Mathematics, Department of
DSpace
UBCV
Bhaskar, Dhananjay
2017-04-24T18:19:11Z
2017
Master of Science - MSc
University of British Columbia
Individual cells adapt their morphology as a function of their differentiation status and in response to environmental cues and selective pressures. While it known that the great majority of these cues and pressures are mediated by changes in intracellular signal transduction, the precise regulatory mechanisms that govern cell shape, size and polarity are not well understood. Systematic investigation of cell morphology involves experimentally perturbing biochemical pathways and observing changes in phenotype. In order to facilitate this work, experimental biologists need software capable of analyzing a large number of microscopic images to classify cells and recognize cell types. Furthermore, automatic cell classification enables pathologists to rapidly diagnose diseases like leukemia that are marked by cell shape deformation.
This thesis describes a methodology to identify cells in microscopy images and compute quantitative descriptors that characterize their morphology. Phase-contrast microscopy data is used for the purpose of demonstration. Cells are identified with minimal user input using advanced image segmentation methods. Features (e.g. area, perimeter, curvature, circularity, convexity, etc.) are extracted from segmented cell boundary to quantify cell morphology. Correlated features are combined to reduce dimensionality and the resulting feature set is clustered to identify distinct cell morphologies. Clustering results obtained from different combinations of features are compared to identify a minimal set of features without compromising classification accuracy.
https://circle.library.ubc.ca/rest/handle/2429/61342?expand=metadata
Morphology Based Cell ClassificationUnsupervised Machine Learning ApproachbyDhananjay BhaskarB.Sc., The University of British Columbia, 2015A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Mathematics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)April 2017c© Dhananjay Bhaskar 2017AbstractIndividual cells adapt their morphology as a function of their differenti-ation status and in response to environmental cues and selective pressures.While it known that the great majority of these cues and pressures are me-diated by changes in intracellular signal transduction, the precise regulatorymechanisms that govern cell shape, size and polarity are not well under-stood. Systematic investigation of cell morphology involves experimentallyperturbing biochemical pathways and observing changes in phenotype. Inorder to facilitate this work, experimental biologists need software capableof analyzing a large number of microscopic images to classify cells and recog-nize cell types. Furthermore, automatic cell classification enables patholo-gists to rapidly diagnose diseases like leukemia that are marked by cell shapedeformation.This thesis describes a methodology to identify cells in microscopy im-ages and compute quantitative descriptors that characterize their morphol-ogy. Phase-contrast microscopy data is used for the purpose of demonstra-tion. Cells are identified with minimal user input using advanced imagesegmentation methods. Features (e.g. area, perimeter, curvature, circular-ity, convexity, etc.) are extracted from segmented cell boundary to quantifycell morphology. Correlated features are combined to reduce dimensionalityand the resulting feature set is clustered to identify distinct cell morpholo-gies. Clustering results obtained from different combinations of features arecompared to identify a minimal set of features without compromising clas-sification accuracy.iiPrefaceThis is the original and unpublished work of Dhananjay Bhaskar. No ethicsapproval was required for this work. The in-vitro pancreatic carcinomaimages (MIA PaCa-2 human cell line) used to demonstrate the methodol-ogy were acquired by Pamela Dean, PhD student working with Dr. CalvinRoskelley at The University of British Columbia (UBC), using phase-contrastmicroscopy.NSERC USRA ProjectThe feature extraction section in Chapter 2 is based on work done in col-laboration with Darrick Lee as part of an Undergraduate Summer ResearchAward (USRA) project from May to August 2016. Darrick implementedcode to compute circular, elliptical, polygonal and cubic spline fits to cellboundary.URO REX Mentorship ProgramDhananjay supervised undergraduate students MoHan Zhang and CindyTan from November 2016 to March 2017 as part of the Research Experi-ence Program (REX) organized by Undergraduate Research Opportunities(URO), a student-run organization at UBC. MoHan improved and extendedthe feature computation work performed by Darrick Lee. MoHan also im-plemented code to compute rectangular fits to cell boundary with guidancefrom Dhananjay. Cindy investigated the use of silhouette score analysisand gap statistic to determine number of clusters in synthetically generateddata. Partial results from Cindy’s project are presented in the unsupervisedclassification section of Chapter 2.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . ix1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Research Objective . . . . . . . . . . . . . . . . . . . . . . . 51.3 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . 52 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . 82.1.1 Foreground Detection . . . . . . . . . . . . . . . . . . 92.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . 142.2.1 Hu’s Moment Invariants . . . . . . . . . . . . . . . . 152.2.2 Geometrical and Boundary Features . . . . . . . . . . 172.2.3 Shape Factors . . . . . . . . . . . . . . . . . . . . . . 252.3 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . 272.3.1 Principal Component Analysis (PCA) . . . . . . . . . 282.4 Unsupervised Classification . . . . . . . . . . . . . . . . . . . 282.4.1 K-means Clustering Algorithm . . . . . . . . . . . . . 302.4.2 Silhouette Score Analysis . . . . . . . . . . . . . . . . 313 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . 333.1 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . 363.2 Clustering Using Hu’s Moment Invariants . . . . . . . . . . . 43ivTable of Contents3.3 Clustering Using Geometrical Feature Descriptors . . . . . . 483.4 Clustering Using Geometrical and Boundary Features . . . . 523.5 Clustering Using Shape Factors . . . . . . . . . . . . . . . . 574 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68AppendixA Dimensionality Reduction Using t-SNE . . . . . . . . . . . . 73vList of Tables2.1 Features extracted from ellipse and circle fits of cell images . 212.2 Features extracted from bounding rectangle fits . . . . . . . . 222.3 List of other geometrical features . . . . . . . . . . . . . . . . 242.4 Features extracted from curvature of cubic spline fit . . . . . 252.5 List of non-dimensional shape factors . . . . . . . . . . . . . . 27viList of Figures2.1 Standard pipeline for cell classification using microscopy data 82.2 Foreground detection from phase contrast image . . . . . . . 102.3 Mathematical morphology: image erosion . . . . . . . . . . . 112.4 Mathematical morphology: image dilation . . . . . . . . . . . 122.5 Cell segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 132.6 Geometrical fits and boundary interpolation . . . . . . . . . . 192.7 Computing chain code from boundary pixels . . . . . . . . . . 232.8 Computation of polygonal fit from chain code . . . . . . . . . 242.9 Clustering synthetic data using silhouette score analysis . . . 323.1 Selection of correctly segmented cells . . . . . . . . . . . . . . 343.2 Four distinct morphologies in MIA PaCa-2 phase images . . . 363.3 Plots of Hu’s moment invariants for all cells in the data set . 373.4 Thresholding Hu’s invariant moment φ1 . . . . . . . . . . . . 383.5 Manual classification of cells by thresholding φ1 . . . . . . . . 393.6 Thresholding Hu’s invariant moment φ7 . . . . . . . . . . . . 403.7 Manual classification of cells by thresholding φ7 . . . . . . . . 413.8 Correlation between shape factors, cell area and perimeter . . 423.9 PCA of Hu’s moment invariants . . . . . . . . . . . . . . . . . 433.10 Biplot for 2-component PCA using Hu’s moment invariants . 453.11 Feature agglomeration tree for Hu’s moment invariants . . . . 463.12 Silhouette score analysis of Hu’s moment invariants . . . . . . 473.13 Classification of cells using Hu’s moment invariants . . . . . . 483.14 PCA of normalized geometrical features . . . . . . . . . . . . 493.15 Biplot for 2-component PCA using geometrical features . . . 493.16 Feature agglomeration tree for geometrical features . . . . . . 503.17 Silhouette analysis and clustering of geometrical features . . . 513.18 Classification of cells using geometrical features . . . . . . . . 533.19 PCA, silhouette analysis and clustering of combined (geomet-rical and boundary) features . . . . . . . . . . . . . . . . . . . 543.20 Biplot for 2-component PCA using combined features . . . . 55viiList of Figures3.21 Feature agglomeration tree for combined features . . . . . . . 563.22 PCA of non-dimensional shape factors . . . . . . . . . . . . . 573.23 Analyzing correlation in shape factors using biplot and fea-ture agglomeration . . . . . . . . . . . . . . . . . . . . . . . . 583.24 Silhouette analysis and clustering of shape factors . . . . . . 593.25 Classification of cells using shape factors . . . . . . . . . . . . 605.1 Boundary curvature of a protrusive cell . . . . . . . . . . . . 655.2 Boundary curvature of an elliptical cell . . . . . . . . . . . . . 65A.1 Clustering Hu’s moment invariants using 2-component t-SNE 74A.2 Improved classification of cells using Hu’s moment invariants 75A.3 t-SNE using combination of geometrical and boundary features 76A.4 Clustering shape factors using 2-component t-SNE . . . . . . 76A.5 Subset of elongated cells corresponding to clusters 1, 2 and 3 77A.6 Protrusive, circular and elongated cells identified in Figure A.4 78A.7 Segmented cell images of outlier points in Figure A.4 . . . . . 78viiiAcknowledgmentsI am greatly indebted to Dr. Leah Edelstein-Keshet for giving me the op-portunity to pursue undergraduate and graduate research at UBC under hersupervision. In addition to providing generous funding and unconditionalsupport at a moment’s notice, Dr. Keshet is largely responsible for myenthusiasm to pursue interdisciplinary research as a career. Under her guid-ance, I was able to pursue multiple independent projects of my own volitionand acquire knowledge in fields as diverse as microscopy image processing,machine learning, mathematical modeling and multi-scale computationalsimulation of cell scale biology. Furthermore, Dr. Keshet introduced me toleading researchers in mathematical and computational biology by encourag-ing me to attend training workshops, present at conferences and collaborateextensively. The research reported here was supported by NSERC DiscoveryGrant (RGPIN-41870) awarded to Dr. Keshet.I thank my mentors Dr. James Feng and Dr. Calvin Roskelley for theiradvice and constructive critique. Dr. Keshet and Dr. Feng organized jointresearch group meetings where I presented my work and received valuablefeedback. Dr. Roskelley provided biological perspective and experimentaldata for multiple research projects. I thank the graduate chair, Dr. DanielCoombs, for his valuable time spent reviewing the final draft.I owe many thanks to my peers for their help and kind words of encour-agement. Cole Zmurchok and Hildur Knu´tsdo´ttir are inspiring role modelswho invested their time to show me the ropes. At UBC, I had the incredibleopportunity to work with talented undergraduate students: Eviatar Bach(Summer 2015 USRA), Darrick Lee (Summer 2016 USRA), MoHan Zhangand Cindy Tan (Fall and Winter 2016, URO REX Program), who motivatedme to take on ambitious projects. Finally, I am grateful to my family andfriends (Aditya Mandapati, Claire Guerrier, Lydia Lin and Michael Irvine)for their enduring support that has been instrumental in my success.Dhananjay BhaskarixChapter 1IntroductionModern medicine and high throughput quantitative biology have fueleddemand for computational methods that enable practitioners to analyzelarge quantities of data in an efficient manner. Data science and data miningcan be used to inform experimental design, test diagnostic protocols, makepredictions, verify and generate new hypotheses. Historically, the bioinfor-matics community has been at the forefront of developing tools that leverageadvancements in parallel processing, GPU computing and machine learningto process ‘omics’ (e.g. genomics, proteomics, metabolomics) data. Increas-ingly, methods for extracting useful information from microscopic data arereceiving greater attention [4]. From a modeling perspective, quantificationof microscopic data is crucial to understand cell and tissue scale biology, in-cluding developmental, vascular and cancer biology. Translationally, avail-ability of computational tools to process microscopic images is necessary forautomating diagnoses and generating patient-specific treatments.This thesis describes a methodology for identification of cells from phase-contrast microscopy images, quantification of cell geometry and morphology-based cell classification. The implementation of this methodology, resultingin an image processing and machine learning pipeline that operates semi-automatically (with minimal manual intervention), is suitable for analyzinga large number of images. Since the methodology does not rely on label-ing cells with biomarkers, it can be easily adapted to work with confocalor fluorescent images. However, to illustrate how the method works, in-vitro phase-contrast images of pancreatic carcinoma cells (MIA PaCa-2 cellline) are used in this thesis. The MIA PaCa-2 cell line exhibits a numberof different morphologies in monolayer culture. Therefore, it is ideal fortesting, developing and benchmarking new analytical tools designed for twodimensional image analysis, cell recognition and classification. Phase mi-croscopy was chosen for demonstration purposes because phase images areparticularly challenging to segment [39].11.1. Literature ReviewThe methodology described herein has numerous applications. The pre-cise mechanisms for regulation of cell shape and size are not well understood.Such methodology can be used as an image-based screening procedure todetermine whether an experimental perturbation (e.g. treatment with achemical compound, small interfering RNA or genetic manipulation) leadsto changes in cell morphology [35]. It can be used to detect outlier cell mor-phologies, i.e. cells that appear different from a control or reference group.Therefore, the methodology constitutes a systematic approach for scientificinvestigation of cell morphology. In a medical context, ability to identifyand count different cell types from biological samples (e.g. blood, biopsy)can be used for diagnoses, estimation of risk and prognosis.1.1 Literature ReviewManual classification of cells using computerized image analysis and quan-tification of morphological descriptors was well established before classifica-tion using machine-learning techniques became the norm. For instance,Merson-Davies and Odds classified Candida albicans cells into spherical,pseudohyphae (elongated) and true hyphae by computing morphology in-dex (Mi) of each yeast cell using maximum length (l), maximum diameter(d) and septal diameter (s) obtained from image analysis [23]. They foundthat Mi = ls/d2 is correlated with the content of chitin (a cell wall compo-nent) in cells and can be reliably used in place of subjective descriptions toidentify cell morphology.Machine learning techniques can be broadly categorized into two groups:supervised learning and unsupervised learning. In the context of using ma-chine learning to classify data, supervised classification refers to techniquesthat require training data (i.e. subset of input data for which the classifica-tion outcome is known a priori) in order to “teach” a classification algorithmto recognize patterns in input data and get the desired output. In contrast,unsupervised classification does not require any training data, instead theinput data is clustered or organized into different classes based on the inher-ent properties of the data. For the purpose of cell classification, input datais typically a randomized list of feature vectors, where each feature vectoris an ordered set of “features” (sometimes referred to as “descriptors”) thatquantify the morphology of a cell. The terms “ features” and “descriptors”are used interchangeably throughout this thesis. The remainder of this sec-tion describes recent work (2013 onwards) in the field of morphology based21.1. Literature Reviewcell classification and its applications.A significant proportion of recent cell classification literature is devotedto diagnosis of leukemia. Putzu and Di Ruberto segmented nuclei of lym-phocytes from blood samples in order to identify patients with acute lym-phoblastic leukemia (ALL) [31]. They computed 30 shape descriptors, 4color descriptors and 16 texture descriptors to quantify irregularities in nu-clear shape and used a Support Vector Machine (SVM, a supervised learningmethod) to perform supervised binary classification. Textural descriptorswere calculated for 0, 45, 90 and 135 degree rotations of the nucleus tomaintain rotational invariance. Patients with ALL were identified with 0.25percent mis-classification rate. The technique for segmenting nuclei usingwatershed algorithm described in this paper is also part of the methodologypresented in this thesis.Amin et al. further developed ALL classification methodology by com-puting a richer set of 77 geometrical and statistical features from bloodand bone marrow smears to further categorize cell nuclei into 3 subtypes:L1, L2 and L3, based on the French-American-British (FAB) classificationsystem [2]. Soon thereafter, Reta et al. classified bone marrow images todistinguish between families of acute leukemia (ALL and AML), subtypesof ALL (L1 and L2) and subtypes of AML (M2, M3 and M5) using a varietyof classifiers [33].Chankong et al. proposed a method for automatic cervical cell segmenta-tion and classification using single-cell images obtained from a Pap test [7].Nuclear and cytoplasmic regions of cells were separated computationally.Cells were classified into normal, low grade squamous intraepithelial lesion(LSIL), high grade squamous intraepithelial lesion (HSIL), and squamouscell carcinoma (SCC) phenotypic categories (in increasing order of malig-nancy) with over 95% accuracy. The authors computed six features toquantify the shape and coarseness of the nucleus, including some dimen-sionless shape factors that are described in the feature extraction sectionof Chapter 2. The feature vector also included three descriptors of overallcell shape and the size of the nucleus in relation to cell size. The authorscompared results obtained from various supervised classifiers. These includeK-nearest neighbors (KNN), artificial neural network (ANN) and SVM, notreviewed in this thesis.31.1. Literature ReviewNosaka and Fukui classified fluorescence staining patterns of human ep-ithelial type 2 (HEp-2) cells into six categories in order to perform automaticantinuclear antibody (ANA) analysis [26]. They trained a SVM classifier us-ing features computed from multiple size-scaling and rotations of each cellimage. The classifier outperformed other methods using the HEp-2 imagesdata set created by the Mivia Lab at the University of Salermo.Nanni et al. developed a methodology to classify phase-contrast mi-croscopy images of retinal pigment epithelial (RPE) cells (derived from hu-man pluripotent stem cell) using ensembles of cell texture descriptors [25].Cells were classified into three maturation stages to asses suitability forimplantation or in-vitro use. The methodology is generally applicable toa variety of biological image classification problems. However, it requiresavailability of an annotated data set to train SVM classifiers.As co-culture systems gain popularity in the study of interactions betweendifferent cell types, automatic identification of distinct morphologies andcell types in co-culture microscopy is becoming more relevant. Logan et al.developed a pixel-based learning methodology (where pixel intensities of cellimages are used as features) that can accurately identify multiple fluorescentmorphologies [20]. Although it requires customized tuning for each cell typewith distinct morphology, the authors demonstrated that the method cansegment and count hepatocytes and fibroblasts in an unsupervised manner,without requiring explicit feature computation.Ahonen et al. used unsupervised clustering methods to classify simulatedtumor spheroids and images of PC3 human prostate cancer spheroids. Theyused geometrical features calculated from ellipse fitting, boundary featuresobtained from principal curve fitting and texture features comprised of localsample moments and local binary patterns to classify tumors using a clus-tering algorithm [1]. The complete feature vector consisted of 193 shape-based descriptors and 178 texture-based descriptors. After dimensionalityreduction using principal component analysis (PCA), the authors identified4 clusters using only geometrical features, 4 clusters using only boundaryfeatures and 3 clusters using only texture features. The four distinct clus-ters correspond to smooth spherical, spherical with rough borders, sphericalwith appendages and highly irregular phenotypes.41.2. Research ObjectiveFinally, it should be noted that morphological classification of neuronspresents a distinct set of challenges and idiosyncrasies. According to a recentreview, supervised methods outperform unsupervised clustering methodsdespite the availability of large amounts of data due to lack of consensus inclass delineation, sensitivity to algorithm parameters, limitations in featureextraction and lack of robustness in cluster identification [41].1.2 Research ObjectiveThe primary goal of the research described in this thesis is to develop amethodology for unsupervised classification of cells based solely on morphology-based features. While supervised classifiers have been shown to perform well,they require annotated input data for training and perform poorly if inputdata is too dissimilar to training data. In practical applications, annotateddata is typically not available and generally requires significant time andeffort to acquire. The insistence on using only morphological descriptors(called label-free classification) serves to keep the methodology generallyapplicable and free from limitations of labeling techniques or adverse effectsof staining reagents. To facilitate development of the methodology and forthe purpose of validation, phase-contrast images of pancreatic carcinomacells (MIA PaCa-2) are used, although any microscopic data with reason-able spatial resolution and heterogeneity in cell shape would suffice. Cur-rent label-free classification methodologies mostly rely only on qualitativelysingular type of features and cannot achieve multi-class classification [9].Conversely, some methodologies that compute multiple feature descriptorscannot process high-throughput data due to computational complexity offeature extraction. Therefore, an objective of this work is to identify a min-imal set of feature descriptors that can classify cells without compromisingon the outcome.1.3 Thesis OverviewThe thesis assumes some background knowledge in image processing andmachine learning. An attempt is made to define terminology and introduceconcepts without detracting from the main content of the research. Thethesis is organized into five chapters. Their contents are summarized below:Chapter 1 provides an introduction to the research problem and a brief sum-mary of recent literature in this field.51.3. Thesis OverviewChapter 2 describes the methodology used to perform image segmentation,feature extraction, dimensionality reduction and cluster identification. To-gether, the techniques described in this chapter form the basis of an unsu-pervised cell classification pipeline.Chapter 3 illustrates a typical application of the methodology using the MIAPaCa-2 pancreatic cancer data set provided by the Roskelley Lab at UBC.Chapter 4 summarizes the advantages and limitations of the methodologycompared to other existing approaches.Chapter 5 identifies areas for improvement and describes future work.Appendix A describes classification results obtained by using t-SNE insteadof PCA for dimensionality reduction.6Chapter 2MethodologyThis chapter describes a methodology for unsupervised classification ofcells from phase contrast microscopy images. Typically, the machine learn-ing algorithm is embedded into a processing pipeline that converts mi-croscopy images into numerical data corresponding to individual cells [35].The pipeline consists of image processing, feature extraction, dimensionalityreduction, classification and validation steps, which are described in detailbelow.The first phase of image processing is to enable separation of foregroundand background by removing artifacts, reducing noise and compensatingfor uneven illumination. Subsequently, image segmentation methods (tech-niques that divide an image into regions of interest) are used to identifycells amongst the foreground pixels. The choice of segmentation algorithmdepends on the image and cell type. No single algorithm is capable of iden-tifying cells from multiple image sources. Often, manual tuning of algorithmparameters is required to achieve optimal performance.Once cells are segmented, quantifiable descriptors of cell shape and sizeare computed. These are referred to as feature vectors and they form thebasis for distinguishing between cells using a classification algorithm. Fea-ture vectors are normalized to have zero mean and standard deviation ofunity in order to prevent discrimination between features due to differencein their magnitudes or signs. The performance of the classifier depends onthe quality of segmentation and accuracy of features. Two cells can be dis-tinguished (i.e. assigned different labels by the classifier) if their featurevectors differ to a significant degree. Furthermore, in order to identify a cer-tain morphology, one or more features must capture unique characteristicsof that morphology.Most widely used classification algorithms that rely on computation ofdistance metric between features tend to perform well for low dimensional72.1. Image Processingfeature vectors. Dimensionality reduction techniques convert high dimen-sional feature vectors to low dimensional vectors by identifying correlationsand combining multiple features in a manner that maximizes the variancebetween projections of the data in the low dimensional space. Two pop-ular dimensionality reduction techniques, Principal Component Analysis(PCA) and t-distributed Stochastic Neighborhood Embedding (t-SNE) aredescribed in Section 2.3 and Appendix A respectively.After dimensionality reduction, clustering algorithms are used to classifydata points corresponding to individual cells. Several clustering algorithmsare publicly available and have been empirically evaluated using syntheticdata for their performance, robustness and accuracy [29]. Although otheralgorithms may yield better results, the k-means algorithm is widely used forits ease of implementation and the availability of several parameter estima-tion techniques to estimate the parameter k, corresponding to the numberof clusters in the data set.Figure 2.1: Standard pipeline for cell classification using microscopy dataFigure 2.1 illustrates the overall methodology organized in a data pro-cessing pipeline. The pipeline is implemented in a modular manner andindividual components can be replaced if required. For instance, to classifyfluorescent images instead of phase-contrast images, replacing the imagesegmentation method would suffice. Similarly, if the k-means classificationalgorithm is unable to find clusters in the input data or the algorithm fails toconverge, then it can be easily swapped for a more sophisticated algorithm.2.1 Image ProcessingIdentifying cells in an image is essential for automating the recognitionof multiple cell types in large cell populations. Automated processing of2-D images to count cells and identify cell types using morphological mea-surements has been steadily gaining traction since the 1960s. Over the pastdecades, literature on the subject has grown exponentially, with more thanhalf of the bulk of papers appearing after the year 2000 [22].82.1. Image ProcessingWith the exception of neuron segmentation, the vast majority of currentsegmentation methods are based on few basic approaches; namely, intensitythresholding, feature detection, morphological filtering, region accumulationand deformable model fitting. These methods are reviewed in [22]. Regionaccumulation methods such as Voronoi-based methods or watershed trans-form can result in inaccurate cell boundaries by mis-specifications of the cellregion to be divided or by over-segmentation. Similarly, popular deformablemodel approaches such as geodesic active contours or level sets, which de-tect cell boundaries by minimizing a predefined energy functional, can resultin poor boundary detection because they use local optimization algorithmsthat only guarantee to find a local minimum or use the gradient vector fieldof the image to decode the boundary information [10].Segmentation of bright field and phase contrast images is generally morechallenging compared to fluorescent images. The latter usually have bettercontrast and deformable model fitting techniques like active contour or levelsets work well [39]. Distinctive bright white patches or halo surrounding cellsin bright field and phase contrast images prevent accurate determination ofcell boundary. Therefore, a custom approach is required for each applicationthat takes heterogeneity in cell shape, population density, variability in cellcompartmentalization, etc., into account.The following sections describe an approach for segmenting phase contrastimages using a combination of edge detection, thresholding, mathematicalmorphology and watershed transform.2.1.1 Foreground DetectionForeground detection is performed in three stages. Firstly, the Sobel-Feldman derivative filter is applied to the original grayscale image to findedge points. These points are pixel locations in the image correspondingto non-zero intensity changes. The Sobel-Feldman operator uses two 3 × 3kernels, one for derivative in a horizontal direction and the other for deriva-tive in a vertical direction, which are convolved with the original image tocalculate gradient approximation. The result is binarized by thresholding,with the value of the threshold specified by the user.92.1. Image Processing(a) Original Image (b) Edge Detection(c) Dilated & Filled (d) Foreground Binary MaskFigure 2.2: Foreground detection from phase contrast image(a) Original phase contrast microscopy image of MIA PaCa-2 pancreaticcarcinoma cell line courtesy of the Roskelley Lab at UBC. (b) Edge pointdetection by applying Sobel-Feldman derivative filter and conversion fromgrayscale to binary by thresholding. (c) Dilation by line-shaped structuralelement. (d) Resulting foreground markers obtained after filling, removal ofsmall artifacts and objects connected with image boundary.The binary image produced by edge detection is further manipulated usingmathematical morphology, as shown in Figure 2.2. Mathematical morphol-ogy is a collection of set-theoretic operations on binary images that havebeen used for image enhancement, noise removal, edge detection, etc. Foun-dations of mathematical morphology are based on two operations: erosionand dilation.Erosion of image A by structural element (abbreviated “strel”) B, resulting102.1. Image Processingin image E is defined as:A B = {z ∈ E|Bz ⊆ A},where Bz is the translation of B by the vector z:Bz = {b+ z|b ∈ B}, ∀z ∈ E.In practice, erosion leads to shrinking or thinning of the binary image, asshown in Figure 2.3.(a) Original Image (b) Eroded by disk strel (c) Eroded by square strelFigure 2.3: Mathematical morphology: image erosion(a) Original image of UBC logo. (b) Image eroded by disk structural elementwith radius of 5 pixels. (c) Image eroded by square structural element withside length of 7 pixels. The diagonal length of the square element is 10pixels, which is equivalent to the diameter of the disk element.Dilation of image A by structural element B, resulting in image E is definedas:A⊕B =⋃z∈BAz,where Az is the translation of A by the vector z:Az = {a+ z|a ∈ A}.Dilation is used to grow or thicken regions in a binary image, as shown inFigure 2.4. All other morphological operations can be defined by composingerosions and dilations.112.1. Image Processing(a) Original Image (b) Dilated by disk strel (c) Dilated by square strelFigure 2.4: Mathematical morphology: image dilation(a) Original image of UBC logo. (b) Image dilated by disk structural elementwith radius of 5 pixels. (c) Image dilated by square structural element withside length of 7 pixels. Notice that corners are rounded in Figure 2.4b asa consequence of the shape of the disk, whereas corners remain sharp inFigure 2.4c.In the second stage of foreground detection, edge points are connectedby dilating the image with line shaped structural elements. The size of thestructural elements is specified by the user. This leads to the formationof closed loops around isolated cells or clusters of tightly packed cells. Thefinal stage involves filling of closed loops (using imfill in MATLAB), removalof small objects whose size is below user-specified threshold and removal ofobjects that cross the image boundary.Cell SegmentationForeground detection resulted in separation of foreground and backgroundfrom the original image. The foreground binary image is eroded multipletimes (number of erosions and size of structural element is specified by theuser) to obtain foreground markers, a majority of which lie inside cell bound-aries. The marker-based watershed transform is a region accumulation ap-proach that segments cell boundaries using foreground markers and gradientof the original image. The number of correct segmentations in the result de-pends on the pre-processing of markers prior to the segmentation [36].122.1. Image Processing(a) Foreground Markers (b) Background Markers(c) Watershed Segmentation (d) Manual SelectionFigure 2.5: Cell segmentation(a) Eroded binary foreground markers overlaid on top of original image.(b) Background markers computed by applying watershed transform to thedistance transform of foreground markers. (c) Result of watershed segmen-tation using foreground and background markers. (d) Manual selection ofcorrectly segmented cells.The watershed segmentation algorithm requires both foreground and back-ground markers. Foreground markers specify regions inside individual cellswhereas background markers specify regions between adjacent cells. To pre-vent over-segmentation due to background markers being too close to objectsof interest (i.e. cells), background markers are computed by calculating theskeleton by influence zones (SKIZ) of the foreground markers [24]. The in-fluence zone of a foreground marker is the set of neighboring pixels thatare closer to that foreground marker than to any other foreground markers.SKIZ is the boundary between influence zones of all foreground markers. Itis analogous to the Voronoi tessellation of foreground markers in the image132.2. Feature Extractionplane. In practice, the background markers are computed using a simpletwo-step procedure. Firstly, the distance transform of the foreground mark-ers is computed. Then, ridge lines (corresponding to background markersor SKIZ, as shown in Figure 2.5b) are determined by computing the water-shed transform of the distance-transformed foreground markers [32]. Onceboth foreground and background markers are computed, the priority-floodwatershed algorithm is applied to the original image, resulting in watershedlines corresponding to the boundary of the cells as shown in Figure 2.5c. Anoutline of the watershed algorithm follows:Priority-Flood Watershed Algorithm [3]:Step 1: Foreground and background markers are chosen. Each set of con-nected markers is assigned a different label.Step 2: The neighboring pixels of each marked area are inserted into a pri-ority queue (a list of objects sorted by their priority level), with a prioritylevel corresponding to the magnitude of the gradient of intensity at thatpixel.Step 3: The pixel with the lowest priority level is extracted from the pri-ority queue. If all labeled neighbors of the extracted pixel have the samelabel, then the pixel is assigned their label. All non-marked neighbors thatare not yet in the priority queue are put into the priority queue.Step 4: Step 3 is repeated until the priority queue is empty.The entire image segmentation process requires minimal user input. Theuser has to specify the threshold parameters, size of structural elementsand number of iterations for morphological operations. Finally, the user isrequired to manually select correct segmentation from the result of applyingthe watershed transform. This ensures that only correctly segmented cellsare assigned unique ID numbers (and serialized) for further processing inthe pipeline.2.2 Feature ExtractionIn general, there are two approaches for extracting morphological featuresfrom segmented cell shapes: boundary-based methods that extract informa-tion from points on the cell boundary and area-based methods which useall points on the interior and boundary of the cell shape. The area basedmethods are more robust to small perturbations in cell shape and are easyto compute. For example, to accurately estimate the area of a given shape it142.2. Feature Extractionis sufficient to count the number of pixels that make up the shape, whereasperimeter estimation is not so straightforward [18]. The main advantage ofboundary based features such as curvature functions, cubic spline interpo-lation of cell boundary, normalized Fourier shape descriptors, etc., is thatthey provide a good quantization of angles, corners and curves in the im-age. These details are lost when summing over all image pixels to computearea based features like Hu’s moment invariants and non-dimensional shapefactors.2.2.1 Hu’s Moment InvariantsThe mathematical concept of moments can be used to quantify global prop-erties of an image. Global image properties refer to an image as a wholerather than its components. Consider a binary segmented cell image definedby intensity function f : R2 → {0, 1} that maps each pixel (x, y) to 0 or 1depending on whether the pixel lies outside or inside/on the cell boundary.Raw image moments mpq of order p+ q are defined as projections of imagef(x, y) to basis xpyq:mpq =∫ ∞−∞∫ ∞−∞xpyqf(x, y)dxdy.In a discrete setting, the integral is replaced by a sum over all pixels in theraster image:mpq =∑x∑yxpyqf(x, y).The zeroth raw moment, m00, is the sum of intensities over all pixels inthe image. For a binary image, where pixel intensity is equal to 1 if thepixel is part of the cell and 0 otherwise, m00 corresponds to the area of thecell. x¯ = m10/m00 and y¯ = m01/m00 are coordinates (x, y) of the centroid.To confer translational invariance, one can compute central image momentsabout the mean:µpq =∑x∑y(x− x¯)p(y − y¯)qf(x, y).It can be easily verified that the zeroth central moment, µ00, is equivalentto m00 and it corresponds to the area of the segmented cell.Now, consider an image scaled by factor λ: f ′(x, y) = f(x/λ, y/λ). Thescaled image (defined by function f ′) can be smaller (if λ > 1) or larger152.2. Feature Extraction(if λ < 1) compared to the original image (defined by function f). Centralmoments of the scaled image are given by:µ′pq =∫ ∫xpyqf(x/λ, y/λ)dxdy.Substituting x′ = x/λ and y′ = y/λ, we derive:µ′pq =∫ ∫(λx′)p(λy′)qf(x′, y′)λ2dx′dy′,µ′pq = λ(p+q+2)µpq.Therefore, normalized central moments that are translation and scale in-variant can be obtained by setting area (µ′00) equal to unity:µ′00 = λ2µ00 =⇒ λ = µ−1200 .Normalized central moments of order p+ q are typically denoted by η:ηpq = µ−(p+q+2)200 µpq.However, a similar calculation corresponding to image rotated by angleθ: f ′(x, y) = f(x cos(θ) + y sin(θ),−x sin(θ) + y cos(θ)) does not yield ro-tationally invariant moments. Rotational invariance requires a nonlineartransformation that is not trivial to compute. Hu derived these nonlinearexpressions from normalized central moments up to order three using alge-braic invariants [16]. Hu’s moment invariants are widely used for translation,scaling and rotation invariant pattern recognition, including recognition oftyped English language characters [17]. Typically a feature vector for imageclassification is comprised of seven invariants:φ1 = η20 + η02,φ2 = (η20 − η02)2 + 4η211,φ3 = (η30 − 3η12)2 + (3η21 − η03)2,φ4 = (η30 + η12)2 + (η21 + η03)2,φ5 = (η30 − 3η12)(η30 + η12)[(η30 + η12)2 − 3(η21 + η03)2]+(3η21 − η03)(η21 + η03)[3(η30 + η12)2 − (η21 + η03)2],φ6 = (η20 − η02)[(η30 + η12)2 − (η21 + η03)2] + 4η11(η30 + η12)(η21 + η03),φ7 = (3η21 − η03)(η30 + η12)[(η30 + η12)2 − 3(η21 + η03)2]−(η30 − 3η12)(η21 + η03)[3(η30 + η12)2 − (η21 + η03)2].162.2. Feature ExtractionNote that φ7 has the additional property of being skew invariant andtherefore can be used to distinguish between mirror images. Unlike rawor central moments, these finite order invariant moments do not form acomplete set of image descriptors. While higher order moments can becalculated, image reconstruction given a set of Hu’s moment invariants isnot straightforward. Furthermore, all seven invariant moments are zero forimages that are rotationally symmetric [30].Dunn and Brown used shape measures (extension, dispersion and elonga-tion) and principal axis orientation calculated using φ1 and φ2 to charac-terize the shape and alignment adopted by chick heart fibroblasts on micro-fabricated grooved substrata [11]. However, recent literature on morphology-based cell classification omits the use of Hu’s moment invariants. The roleof these invariants and their usefulness is investigated in Chapter 3 andAppendix A.2.2.2 Geometrical and Boundary FeaturesIn order to achieve high-throughput cell classification, the feature vectordescribing the shape of the cell should be concise and computationally in-expensive to calculate. Hu’s moment invariants meet this criterion but donot have any intuitive meaning. Given an arbitrary set of Hu’s momentinvariants, one cannot easily imagine the shape of the object that producedthose moments.Normalized Fourier shape descriptors (FSD) represent the boundary ofan object using a subset of coefficients from the Fourier transform of itscontour [13]. However, the number of coefficients required to accuratelyrepresent a given object depends on the curvature of the object and usuallya large number is required. Furthermore, it is not evident how the numberof coefficients required can be estimated a priori without trial and error.Curvature functions are computed from the contour of an object anddescribe how much a curve bends at each point. Peaks in the curvaturefunction correspond to corners on the object. Urdiales et al. describe anon-parametric method for efficient computation of a short feature vectorfor a planar shape using its curvature function [40]. However, in order toproduce a short feature vector, their algorithm requires pre-computing a setof representative curvature functions by calculating the Fourier transformof curvature functions of typical shapes. Then, for a given input shape, the172.2. Feature Extractionalgorithm returns similarity measures of the shape’s curvature function com-pared to the representative set. The pre-computation step is not desirablefor high-throughput cell classification (particularly for on-line classificationwhere the entire data set is not available in advance), therefore curvaturefunction-based features are omitted.In view of the above, geometrical descriptors included in the feature setconstitute estimates of dimensions of the cell obtained by fitting conics (cir-cles and ellipses) as well as uncertainty in those estimates obtained fromgoodness of fit measures. Conic fitting is universally applicable to all planarobjects and requires only a set of boundary points (obtained from segmen-tation) as input. Boundary descriptors are obtained from cubic spline in-terpolation of the boundary of the cell, where the number of spline points isestimated using a manually adjusted smoothing parameter. These descrip-tors encode information about the shape and size of the cell that is easy tovisualize (as shown in Figure 2.6) and understand. Like Hu’s moment invari-ants, these features are resistant to affine transformations (scaling, rotationand translation) as well as to noise in the shape boundary.Consider an arbitrary geometry f(θ) = 0 parametrized by M features,θ = (θ1, ..., θM )T . To fit this geometry to a set of boundary points (xi, yi)Ni=1,consider the following optimization problem:argminθN∑i=1r2i (θ),where ri is the orthogonal distance between boundary point (xi, yi) andshape f(θ) = 0.For example, to fit a circle f(θ) = 0 ⇐⇒ x2+y2−r2 = 0, where θ = (r).Similarly, for ellipse fitting, f(θ) = 0 ⇐⇒ ax2 + by2 + cxy+ dx+ ey+ f =0, where θ = (a, b, c, d, e, f). However, the application of ordinary leastsquares does not return consistent estimates for parameter(s) θ. Therefore,typically one of the following techniques are used: gradient-weighted leastsquares fitting, bias-corrected renormalization fitting or extended Kalmanfiltering [6].182.2. Feature ExtractionFigure 2.6: Geometrical fits and boundary interpolationLeft: Circle, ellipse and polygon fits for three distinct cell geometries. Mid-dle: Rectangle fit is used to compute maximum and minimum Feret diame-ter. Right: Cubic spline interpolation of cell boundary colored by magnitudeof curvature.Ellipse and Circle FittingConsider an ellipse centered at (xc, yc) and rotated by angle α. Let (xt, yt)be the closest point on the ellipse to boundary point (xi, yi). Then theshortest distance Di from the boundary point to the ellipse is given by:xt = xc + a cos(α) cos(t)− b sin(α) sin(t),yt = yc + a sin(α) cos(t) + b cos(α) sin(t),Di =√(xi − xt)2 + (yi − yt)2.In this case, the optimal least squares solution can be computed directlywithout an iterative approach [15]. The stable and robust fitting methodreturns parameters θ = (xc, yc, a, b, α). In addition to parameters obtainedfrom fitting, goodness of fit is estimated by calculating its variance as fol-lows [28]:192.2. Feature ExtractionSuppose (x¯, y¯) is the centroid and (xi, yi)Ni=1 are the boundary points onthe contour of the shape that is being fitted. Then covariance matrix of thecontour is:C =1NN∑i=1ViVTi =(cxx cxycyx cyy),whereVi =(xi − x¯yi − y¯),and,cxx =1NN∑i=1(xi − x¯)2,cxy =1NN∑i=1(xi − x¯)(yi − y¯),cyx =1NN∑i=1(yi − y¯)(xi − x¯),cyy =1NN∑i=1(yi − y¯)2.Lengths of the two principal axes of the ellipse fit can be obtained by cal-culating eigenvalues of the covariance matrix:det(C − λ1,2I) = 0,λ1 =12[cxx + cyy +√(cxx + cyy)2 − 4(cxxcyy − c2xy)],λ2 =12[cxx + cyy −√(cxx + cyy)2 − 4(cxxcyy − c2xy)],Ellipse eccentricity, e =λ2λ1.Variance is the standard deviation of radial distance from the centroid tothe boundary points divided by the mean. Variance close to zero indicatesa good fit.Variance of fit =σRµR,202.2. Feature Extractionwhere,µR =1NN∑i=1di,σR =√√√√ 1NN∑i=1(di − µR)2,and di =√V Ti C−1Vi.The following table summarizes features obtained from ellipse and circlefitting:Feature Range DescriptionEllipse Eccentricity [0, 1]Close to 0, the ellipse is circu-lar. Close to 1, it is elongated.Ellipse Major AxisLength[0,∞) The length of the major axisof the ellipse fit.Ellipse Minor AxisLength[0,∞) The length of the minor axisof the ellipse fit.Ellipse Area [0,∞) Area of the ellipse fit.Ellipse Perimeter [0,∞) Perimeter of the ellipse fit.Ellipse Variance [0, 1]A goodness of fit measure forthe ellipse fit.Circle Radius [0,∞) Radius of the circle fit.Circle Area [0,∞) Area of the circle fit.Circle Variance [0, 1]A goodness of fit measure forthe circle fit.Table 2.1: Features extracted from ellipse and circle fits of cell imagesRectangle FittingFitting a bounding rectangle to boundary points obtained from segmen-tation is different from fitting conic sections because a rectangle cannot bedescribed by a single continuous function. A rectangle consists of four suchfunctions with constraints between them. Chaudhuri and Samal describe astep-wise procedure: 1) finding the centroid of the object, 2) determiningprincipal axes, 3) computing the upper and lower furthest edge points along212.2. Feature Extractionthe boundary, and finally, 4) finding the vertices of the bounding rectan-gle [8]. The following table summarizes features obtained from rectanglefit of a cell. These features are used in computation of elongation (a nondimensional shape factor) below. MoHan Zhang, an undergraduate studentat UBC, implemented this method as part of a research experience programproject.Feature DescriptionMaximum Feret Diameter Furthest distance between any twoparallel tangents on the cellMinimum Feret Diameter Shortest distance between any twoparallel tangents on the cellTable 2.2: Features extracted from bounding rectangle fitsPolygon FittingA polygon fit along the cell boundary is computed using the 3-pixel vector(3PV) method described by Inoue and Kimura [18]. The 3PV method is de-signed for calculating the perimeter of low resolution raster objects, wherecounting the number of pixels at the boundary of the object results in inac-curacies. Starting from an arbitrary location, adjacent boundary pixels areassembled in an ordered set. Each element in the set is an ensemble of threeadjacent pixels enumerated in counterclockwise order. The spatial configu-ration of each 3-pixel ensemble is specified using a pair of integers from 0to 7. The integers specify the direction of counterclockwise travel betweenconsecutive pixels. Integers 0, 2, 4, 6, represent east, south, west and northdirections respectively, as shown in Figure 2.7. Similarly, integers 1, 3, 5, 7,are used to encode southeast, southwest, northwest and northeast directionsrespectively for diagonal placement of pixels. Therefore, a 3-pixel ensemblecorresponding to “L” shape (Figure 2.7) is represented by integers [4, 6] in-dicating west and north direction of travel in counterclockwise manner. Theordered set of integer representations for 3-pixel ensembles starting from thearbitrary location is defined as the chain code of the object boundary.222.2. Feature ExtractionFigure 2.7: Computing chain code from boundary pixels(a) Reference diagram from computing chain code. (b) An example of 3-pixel configuration corresponding to√2 length. (c) An example of 3-pixelconfiguration corresponding to√5 length.As part of geometrical feature extraction, 3PV method is used to computethe perimeter of segmented cell images from the chain code representation ofthe cell boundary. Adjacent pair of elements in the chain code is referred toas a chain pair. Inoue and Kimura [18] specify corrections to typical perime-ter calculation (the so-called 1,√2,√5 method, illustrated in Figure 2.7, forcomputing distances for straight, diagonal, and straight followed by diago-nal placement of pixels respectively) for all possible combinations of chainpairs. A set of vectors consisting of pairs of points along the cell boundary(called 3-pixel vectors, since they are derived from chain pairs correspondingto 3-pixel ensembles) is computed where the sum of lengths of these vectorsprovides an accurate estimate of the perimeter of the cell boundary. Withminor adjustments (to account for cases where adjacent vectors do not alignhead to tail) as shown in Figure 2.8, the 3-pixel vector is used to obtain apolygonal fit to the cell geometry. Darrick Lee, an undergraduate summerresearch student at UBC, implemented the 3PV method and the cubic splineinterpolation of cell boundary (derived from the polygonal fit) described inthe next section.232.2. Feature ExtractionFigure 2.8: Computation of polygonal fit from chain code(a) 3-pixel vectors obtained from chain code [0, 1, 7] corresponding to east,southeast and northeast direction of travel between two adjacent 3-pixelensembles on the cell boundary. (b) Ordered set of points along the cellboundary corresponding to the 3-pixel vectors. (c) Connecting points inorder results in incorrect polygon segment as 3-pixel vectors are not alignedhead to tail. (d) A vertex is removed to correct the boundary segment.Feature Method DescriptionCell Area Pixel CountingNumber of pixels inside seg-mented cell boundary.Cell Perimeter3-pixel Vector(3PV) MethodEstimate obtained from chaincode for pixels on cell boundary.Table 2.3: List of other geometrical featuresCubic Spline Boundary FittingA cubic spline interpolation along the cell boundary is computed from ver-tices of the polygon fit described in the previous section. Curvature is calcu-lated by sampling points on the spline. As expected, the number of changesin sign (positive to negative and vice-versa) correlates with convexity of thecell shape. Convex circular cells have positive curvature throughout theboundary and zero changes in curvature sign. For cells with non-zero num-ber of sign flips, areas of high positive curvature correspond to protrusiveregions on the boundary. Boundary descriptors in the feature vector includenumber of changes in the curvature sign (i.e. number of zero crossings),global maximum and global minimum of curvature.242.2. Feature ExtractionFeature DescriptionNumber of sign flipsNumber of times curvature changessign from positive to negative andvice-versa.Maximum curvatureAbsolute maxima of curvature oncell boundary.Minimum curvatureAbsolute minima of curvature oncell boundary.Table 2.4: Features extracted from curvature of cubic spline fit2.2.3 Shape FactorsThe previous section described features obtained by fitting various geome-tries to a cell boundary. Shape factors are non-dimensional quantities thatare computed by counting pixels in a segmented cell image, and its convexhull, bounding box and bounding rectangular fit. Note the distinction be-tween bounding box and the rectangle fit. The edges of the bounding boxare parallel to Cartesian axes whereas the major axis of the bounding rect-angular fit is aligned to the principal axis of the cell shape. Shape factorsare widely used to classify particulate matter [5, 27] and often used as partof feature vectors designed to classify cell shapes [7, 31].ExtentThe ratio of the number of pixels belonging to a segmented cell to thenumber of pixels in its bounding box is defined as the extent. The boundingbox spans horizontally from the leftmost pixel to the rightmost pixel andvertically from the topmost pixel to the bottommost pixel. Extent is closeto zero if a cell is elongated and close to unity if the cell is uniformly spreadout.SoliditySolidity is a measurement of the overall concavity of an object. It is definedas the ratio of the number of pixels belonging to the segmented cell to thenumber of pixels in its convex hull. As cell shape deforms from a convexpolygon or circle to a more elliptical or protrusive shape, its convex hull areaincreases compared to the cell area and solidity correspondingly decreases.252.2. Feature ExtractionFor the MIA PaCa-2 pancreatic cancer data set, rounded cells typically havesolidity values that approach unity.CompactnessCompactness is defined as the ratio of the circular equivalent diameter tothe maximum Feret diameter obtained from the bounding rectangular fit:Compactness =√4(Acell)piMax. Feret diameterThe circular equivalent diameter, also known as area-equivalent diameter,is defined as the diameter of a circle with the same area as the object. Likeextent, compactness is close to zero if a cell is elongated or ‘I’ shaped.ElongationElongation is defined as (1 − Aspect Ratio). Aspect ratio is obtained fromthe rectangle fit as the ratio of minimum to maximum Feret diameters. Forelongated cells, maximum Feret diameter is much larger than minimum Feretdiameter, therefore their elongation is close to unity. Conversely, for circularcells, both diameters are roughly the same. Therefore the elongation of suchcells is close to zero.CircularityCircularity measures the degree to which an object is similar to a circle:Circularity =√4piAcellP 2cellIt can be easily verified that circularity for a perfect circle is unity. Regularpolygons approach a circle as their number of edges increases. It should benoted that a low value of circularity does not necessarily mean that the cellshape lacks rotational symmetry. Circularity close to zero typically indicateselongated or protrusive (e.g. starfish-like) morphology.ConvexityConvexity is defined as the ratio of the convex hull perimeter and the ac-tual perimeter of the object. It is highly sensitive to deviations from convexgeometry. Convexity is close to zero for highly non-convex cell geometries262.3. Dimensionality Reductionand close to unity for epithelial-like cells with polygonal morphology (absentfrom MIA PaCa-2 data set) or circular cells.The following table describes non-dimensional shape factors included inthe feature vector and formulas for their computation:Feature Range Equation DescriptionExtent [0, 1]AcellAbounding boxRatio of pixels belonging tosegmented cell to pixels in thebounding box.Solidity [0, 1] AcellAconvexRatio of pixels belonging tosegmented cell to pixels in theconvex hull.Compactness [0, 1]√4(Acell)piMax. DiameterRatio of circular equivalent di-ameter to maximum Feret di-ameter.Elongation [0, 1] 1− Min. DiameterMax. Diameter1 - Aspect Ratio. Close to 1for elongated cells and close to0 for circular cells.Circularity [0, 1]√4piAcellP 2cellDegree of resemblance to acircle.Convexity [0, 1] Pconvex hullPcellRatio of the convex hullperimeter to the cell perime-ter.Table 2.5: List of non-dimensional shape factors2.3 Dimensionality ReductionHigh-dimensional data exhibits “curse of dimensionality”, i.e. distancesbetween all pairs of points converges to the same value in higher dimensions.Therefore, unsupervised classification methods that rely on clustering algo-rithms to categorize the data set using some distance metric fail to performwell for high dimensional feature vectors. To prevent this problem, cluster-ing is typically performed on a low-dimensional data set after dimensionalityreduction.272.4. Unsupervised Classification2.3.1 Principal Component Analysis (PCA)Principal Component Analysis is a mathematical technique that exploitsvariance in a given data set in order to make patterns in the data salient.It is commonly used in machine learning to visualize and manipulate highdimensional feature vectors. PCA transforms high dimensional data into alow dimensional subspace, where the basis for the low dimensional subspaceis a linear combination of high dimensional basis vectors. The linearity as-sumption reduces the problem of finding an appropriate transformation tothe problem of finding an appropriate projection. The linear combination ofthe original basis is determined in a manner such that the low dimensionalbasis vectors (also known as principal components) correspond to directionwith the greatest variance in the data. In other words, PCA projects highdimensional feature vectors to a new (low dimensional) coordinate system,ensuring that the first axis in the new coordinate system (PCA 1) has maxi-mum variation, the second axis (PCA 2) has the second-most variation, andso on.Mathematically, the principal components are the eigenvectors of the co-variance matrix of the original data set. These eigenvectors are orthogonalsince the covariance matrix is symmetric. Reducing dimensionality of thefeature space by projecting all feature vectors to a low dimensional spacereduces the complexity of cluster identification and k-means clustering. Thenumber of principal components (i.e. dimensionality of the transformedspace) is chosen by plotting the variance in data explained by each princi-pal component versus the number of components. Typically, the numberof principal components is determined by finding an “elbow” in this plot.The elbow signifies the turning point where the trade-off between includingadditional variance is offset by the complexity of dealing with more compo-nents. Generally, most of the variance in the original data is explained bya small number of principal components. Sometimes 2 or 3 components arechosen for ease of plotting. As a rule of thumb, in this thesis, the number iscomponents is chosen such that the total explained variance exceeds 80%.2.4 Unsupervised ClassificationUnsupervised classification refers to grouping of quantifiable objects byinferring relationships between these objects. Clustering algorithms are usedto automatically group the data points (also called descriptors or features282.4. Unsupervised Classificationin the context of machine learning) corresponding to the objects of inter-est. The terms “clustering” and “unsupervised classification” are used in-terchangeably throughout this thesis. Multiple methods for clustering existand they are generally categorized into one of four types [37]:Prototype-Based Methods: Objects belong to the same cluster if theirdistance (a measure of similarity) to the prototype that defines the clusteris smaller compared to their distance to prototypes of other clusters. Theprototype is the most representative point of a given cluster, often lyingat the center of that cluster. k-means, k-medoids and X-means algorithmsbelong to this category.Graph-Based Methods: Relationships between objects (nodes in thiscontext) are represented by edges between nodes in a graph. Objects areclustered by identifying connected components, cliques and neighboringnodes in the graph. Some advanced algorithms define criteria for splittingedges in the Minimum Spanning Tree (MST) of the graph to obtain a forestof clustered points. The Fuzzy C-Means MST Clustering algorithm, MarkovClustering algorithm, Iterative Conductance Cutting algorithm, GeometricMST Clustering algorithm and Normalized Cut Clustering (NCC) algorithmbelong to this category [12].Density-Based Methods: Objects are represented by points in space. Acluster refers to group of densely packed points surrounded by a region oflow density. DBSCAN (density-based spatial clustering of applications withnoise) and OPTICS (ordering points to identify the clustering structure)algorithms are examples of density-based approach to clustering.Shared Property or Conceptual Methods: This is a catch-all categorythat defines a cluster as groups of objects that share some common property.Irrespective of the method used for clustering the data, it is importantto find parameters to optimize algorithm performance and ensure that theresults obtained from the algorithm are meaningful. Many clustering al-gorithms are known to be very sensitive to their input parameters [19].Measures for assessing the efficacy of clustering like Davies-Bouldin indexand silhouette score (defined below) are useful for evaluating the results ofclustering algorithms.292.4. Unsupervised ClassificationThe k-means algorithm and DBSCAN are two widely used methods forclustering data in low dimensions. While the k-means algorithm requires oneparameter (k), the DBSCAN algorithm requires two parameters (minPts,the minimum number of points in the neighborhood of a core point andneighborhood distance ) for distinguishing between core points, boundarypoints and non-reachable points. The k-means algorithm is used for clus-tering data in this thesis due to its ease of implementation and availabilityof methods for parameter estimation. Due to the modular nature of theunsupervised cell classification pipeline, the k-means clustering algorithmcan be swapped for more sophisticated algorithms like X-means or OPTICSthat do not require any input parameters in the future.2.4.1 K-means Clustering AlgorithmThe k-means algorithm partitions input data points, xi (i = 1, . . . , N),into k subsets or clusters, where points in a cluster Cp (p ∈ {2, . . . , k}) areassociated with cluster center µp. The partitioning is determined by min-imizing the distance between points and their cluster centers according tosome user-specified distance metric.K-means Algorithm:Step 1: Randomly pick k points µ1, . . . , µk, from input data xi as clustercenters. Initialize k clusters, C1, . . . , Ck, with these points.Step 2: Assign each point xi to the “nearest” cluster, Cp. The nearest clus-ter is determined by minimizing the sum of squared intra-cluster distancesbetween points and their cluster centers:argminp∑xj∈Cp(D(xj, µp))2.Typically, Euclidean measure is used to compute distance D.Step 3: Calculate new cluster centers by computing the mean of points ineach cluster:µp =1|Cp|∑xj∈CpxjStep 4: Repeat steps 2 and 3 until cluster centers stop changing.Typically, the algorithm is run multiple times with different random initialchoices of cluster centers to ensure stable convergence. In addition to theinput data points, the user has to supply parameter, k, which denotes the302.4. Unsupervised Classificationnumber of clusters in the data. There are four popular methods to findthe optimal number of clusters in an arbitrary data set: elbow heuristic,Bayesian information criterion (abbreviated BIC), silhouette analysis [34]and gap statistic [38]. In practice, using features computed from the MIAPaCa-2 data set, silhouette analysis (described below) performed best interms of robustness and convergence.2.4.2 Silhouette Score AnalysisSilhouette analysis is the study of the degree of separation between clustersof data points using silhouette coefficients. The silhouette coefficient for agiven data point, P , is a measure that quantifies the degree to which the datapoint belongs to its assigned cluster, C. It is computed as follows. Let a bethe mean distance between point P and every other point in its own clusterC. Let b be the mean distance between P and every point in the nearestneighboring cluster. Then, the silhouette coefficient for point P is (b −a)/max (a, b). The silhouette coefficient ranges from -1 to 1. A coefficientvalue near 1 indicates that P has undoubtedly been classified correctly, avalue around 0 indicates that the clustering of P has some ambiguity, anda value near -1 indicates it is likely that P was classified incorrectly.Rousseeuw described a heuristic using silhouette coefficients to identifythe number of clusters in a given data set [34]. Points are clustered us-ing k-means for various values of parameter k. Assuming that the algorithmconverges and gives stable results, the silhouette score is computed by calcu-lating the average of silhouette coefficients for all data points. The number ofclusters in the data set, i.e. the optimal value for k, is one that maximizesthe silhouette score. This technique is demonstrated using syntheticallygenerated data in Figure 2.9. 10,000 data points corresponding to 10 clus-ters (1000 points per cluster) are generated by transforming and combininguniform random distributions (see Figure 2.9a). Figure 2.9b shows a plotof silhouette score computed for various values of parameter k. The mostprobable value of k is automatically determined by finding the maximum inthis plot. For the synthetic data set, the silhouette score is maximized atk = 10. Figure 2.9c shows the k-means clustering result (corresponding tok = 10), with points colored according to their cluster label. Sorted valuesof silhouette coefficients for individual data points (grouped by cluster la-bel) are shown in Figure 2.9d, with the vertical red line depicting the overallsilhouette score obtained by averaging.312.4. Unsupervised Classification(a) 10,000 synthetic data points corre-sponding to 10 user-defined cluster shapes(b) Determining optimal k automaticallyby computing average silhouette score(c) Labeled data points (obtained by k-means) corresponding to 10 clusters(d) Sorted silhouette scores for datapoints in each clusterFigure 2.9: Clustering synthetic data using silhouette score analysis32Chapter 3Results and DiscussionThis chapter illustrates the methodology described in Chapter 2 usingdata acquired by segmenting phase-contrast microscopy images of the MIAPaCa-2 cancer cell line. MIA PaCa-2 is a human cell line that was estab-lished by A. Yunis, et al. [42] from primary tumor tissue of the pancreas. Itis currently used as an in-vitro model to study carcinogenesis in pancreaticductal adenocarcinoma [14]. The morphological and genetic characteristicsof cells belonging to the MIA PaCa-2 cell line are well understood and read-ily available in the literature [14]. Therefore, unsupervised classification ofthese cells based on their morphology is of little biological relevance. Thiscell line was chosen for the purpose of demonstrating the methodology, sinceMIA PaCa-2 cells exhibit several distinct morphological shapes.Phase-contrast MIA PaCa-2 images are segmented using the methodologydescribed in Section 2.1. Using 40 images acquired by the Roskelley Lab atUBC, 149 correctly segmented cells are identified by visually inspecting thesegmentation result and comparing it to the original phase image. A typicalsegmentation result is shown in Figure 3.1. Cells that cross the boundaryof the image are removed during mathematical morphology stage of imageprocessing. Note that boundaries of cells that are closely packed cannot beresolved by the watershed algorithm (see Figure 3.1b), since these cells shareforeground markers and are treated as one object. The number of iterationsrequired to separate foreground markers (obtained from mathematical mor-phology) of closely packed cells using erosion is high. Since all foregroundmarkers are eroded equal number of times to keep the image segmentationprocess as automated as possible, too many erosion operations prevents cor-rect segmentation of elongated cells due to the loss of foreground markersin their long thin “tails”. Therefore, the number of iterations of erosion isa parameter that affects the range of cell sizes and population density atwhich correct segmentations can be obtained. Once this parameter is man-ually chosen, the user must identify correctly segmented cells (by selectingIDs assigned to watershed outlines, as shown in Figure 3.1b) for serializa-tion. Serialization refers to the process of saving segmented boundaries in a33Chapter 3. Results and Discussiondatabase for feature extraction and further processing.Each cell is assigned a unique ID (henceforth referred to as UID) after allimages are segmented and manually selected cells are serialized. UID is usedto keep track of cells (and the original image from which they are obtained)through the remainder of the classification process.(a) MIA PaCa-2 image acquired using phase-contrast microscopy(b) Boundaries (tagged with IDs) ob-tained using mathematical morphologyand watershed segmentation(c) Correctly segmented cells are serial-ized by manually selecting IDs (from Fig-ure 3.1b)Figure 3.1: Selection of correctly segmented cells34Chapter 3. Results and DiscussionAs proof-of-concept, a manually curated subset of the 149 segmented cellsis used for unsupervised classification in this thesis. The subset, consisting of63 segmented cells, is preselected to only include cells with clear and distinctmorphologies. The remaining cells with ambiguous morphology are omittedfrom the data set as follows. A preliminary set of features is computed forall 149 cells. Each feature vector is reduced to a two dimensional vectorusing principal component analysis (PCA). Outlier points (correspondingto anomalous cell morphologies) are identified and manually removed fromthe data set. UIDs of outliers are noted for further inspection as these cellsare often of interest to experimental biologists. Manual outlier removal is atime-consuming task that does not scale with input data size. Automatictechniques for outlier detection (currently under development) will removethis bottleneck in the future. In addition, points that appear to lie betweenclusters in the two-component PCA space are removed to retain groups ofcells that have distinctive morphologies. Removal of points between clustersimproves stability of the k-means clustering algorithm. Stability and conver-gence of the algorithm is crucial for estimating parameter k (correspondingto number of clusters in the data set) using silhouette score analysis.Four distinct morphologies are evident in the MIA PaCa-2 phase images(see Figure 3.2). The manually curated data set contains 15 cells with “cir-cular” morphology. These rounded cells appear brighter compared to othercells in the phase images. Typically, cells are assumed to adopt a circularmorphology and become stationary before undergoing mitosis. The dataset contains 12 cells with “protrusive” morphology. In the absence of live-imaging data for verification, these cells are assumed to be migratory withthe presence of one or more lamellipodia on their cell boundary. The data setalso contains 16 and 20 cells exhibiting “elliptical” and “elongated” morphol-ogy respectively. Elliptical cells are non-circular, oval or teardrop shapedand lack distinctive lamellipodia. Elongated cells are highly stretched alongtheir principal axis. In other words, the maximum Feret length of elongatedcells is much larger compared to their minimum Feret length.353.1. Exploratory Data AnalysisFigure 3.2: Four distinct morphologies in MIA PaCa-2 phase imagesValidation of any unsupervised classification methodology requires com-parison with ground truth. Ground truth refers to annotated data set whereeach data point is assigned a label based on the category to which it be-longs. In this case, the data set consists of 63 cells, further categorized intofour labeled groups. The four labels correspond to 12, 15, 16 and 20 cellswith protrusive, circular, elliptical and elongated morphology respectively.The clustering algorithm also assigns labels (one label per identified clus-ter) to each data point. The efficacy of the methodology is determined bymeasuring the number of incorrect classifications. Labels assigned by theclustering algorithm are compared to ground truth labels in order to identifymisclassified cells.3.1 Exploratory Data AnalysisUnsupervised classification requires extraction of features from segmentedcell images. A feature vector is computed for each cell using the featureextraction process described in Section 2.2. Feature extraction decomposes363.1. Exploratory Data Analysiscell images into their morphological characteristics using mathematical tech-niques to quantify cell shape and size. Each feature vector contains 27 fea-tures, including 7 Hu’s invariant moments, 11 geometrical features, 3 bound-ary features and 6 shape factors. This section provides insight into diversitywithin the curated data set (consisting of 63 cells represented by points inthe feature space) as well as relationships between different features.(a) φ1 and φ2(b) φ3, φ4 and φ6(c) φ5(d) φ7Figure 3.3: Plots of Hu’s moment invariants for all cells in the data set373.1. Exploratory Data AnalysisHu’s moment invariants are plotted against cell UID (arranged in no par-ticular order) in Figure 3.3. Surprisingly, all moment invariants are corre-lated. The high degree of correlation is evident in Figure 3.3a, where φ1 andφ2 follow the same peak and trough pattern. Similarly, φ3, φ4, φ5 (plottedseparately in Figure 3.3c due to their low values) and φ6 peak simultane-ously for certain cells. Furthermore, φ2, φ3, φ4, φ6 and φ5 increasingly lackresolution (in that order) and fail to provide any information that cannotbe obtained from φ1.Rotationally symmetric objects have moment invariant values close tozero. Therefore, one can distinguish between circular cells and non-circular“stretched” cells (those that have a more elliptical or elongated morphology)by thresholding φ1, as shown in Figures 3.4 and 3.5. Although cells withstretched morphology are typically larger in size in the MIA PaCa-2 dataset, the discrimination is based on cell shape rather than cell size, as valuesof Hu’s moment invariants are not affected by scaling cell size.Figure 3.4: Thresholding Hu’s invariant moment φ1The threshold value, 0.3, is determined by trial and error to separate cellsinto two groups as follows. Cells (identified by their UID on the horizontalaxis) with circular morphology have φ1 values below the threshold. Con-versely, those with stretched morphology have φ1 values above the threshold.383.1. Exploratory Data Analysis(a) Segmented cell images for subset of cells with φ1 > 0.3(b) Segmented cell images for subset of cells with φ1 < 0.3Figure 3.5: Manual classification of cells by thresholding φ1Skew invariants, φ7 are plotted against cell UIDs in Figure 3.3d. Values ofφ7 occupy a narrow range from −8× 10−7 to 6× 10−7. On first inspection,393.1. Exploratory Data Analysisit appears that one can classify cells into three distinct categories using twothresholds (see Figure 3.6). However, there is no clear visually apparentmorphological difference between cells corresponding to positive peaks in φ7values and cells corresponding to negative peaks in φ7 values, as shown inFigures 3.7a and 3.7b.Figure 3.6: Thresholding Hu’s invariant moment φ7Thresholds are arbitrarily chosen in an attempt to classify cells using φ7.Cells with φ7 values close to zero have circular morphology (not shown inthis figure).403.1. Exploratory Data Analysis(a) Segmented cell images (and corresponding UIDs) for cells with φ7 values abovethe upper threshold (corresponding to φ7 > 10−7).(b) Segmented cell images (and corresponding UIDs) for cells with φ7 values belowthe lower threshold (corresponding to φ7 < −10−7).Figure 3.7: Manual classification of cells by thresholding φ7Pairs of shape factors, cell area and perimeter computed from segmentedcell images are plotted against each other in Figure 3.8. Correlation be-tween these features offers insight as to which features might be combinedtogether during dimensionality reduction (PCA). Sum of squared residuals(R2) quantifies the error in linear regression between pairs of shape factors.413.1. Exploratory Data Analysis(a) Compactness vs. Extent (b) Convexity vs. Solidity(c) Elongation vs. Circularity (d) Cell Area & PerimeterFigure 3.8: Correlation between shape factors, cell area and perimeterStrong positive correlation (evidenced by low R2 value) is observed be-tween convexity and solidity, as shown in Figure 3.8b. Note that the major-ity of cells in the data set are highly convex. The data set contains cells thatrepresent values across a wide range of elongation and circularity. Circu-lar cells can be identified by plotting perimeter versus area and identifyingpoints that lie close to the curve P =√4piA in Figure 3.8d. The more a celldeforms from circular shape, the further away from the curve will its (A,P )value be on this graph.423.2. Clustering Using Hu’s Moment Invariants3.2 Clustering Using Hu’s Moment InvariantsAs demonstrated in Section 3.1, Hu’s moment invariants can be used (withmanually assigned thresholds) to distinguish between circular and stretchedcell morphology. However, the goal of this thesis is to determine whethercells can be classified automatically. Can appropriate thresholds be deter-mined implicitly and automatically by performing dimensionality reductionand clustering? To investigate further and answer similar questions regard-ing geometrical descriptors and shape factors in the following sections, thefollowing procedure is implemented.(a) PCA Elbow Plot (b) 1-component PCA(c) 2-component PCA (d) 3-component PCAFigure 3.9: PCA of Hu’s moment invariants433.2. Clustering Using Hu’s Moment InvariantsPrincipal Component Analysis (PCA) is used to transform segmentedcells denoted by points in seven dimensional feature space (consisting of allmoment invariants, φ1 . . . φ7) to a lower dimensional subspace. Dimension-ality reduction in this manner facilitates the identification of clusters usingsilhouette analysis.According to the elbow plot (Figure 3.9a), two principal components ac-count for over 80% of the explained variance. Since all moment invariantscomputed from MIA PaCa-2 segmented cell images happen to be correlated(see Section 3.1), most of the variance in the data can be captured by asingle principal component, as shown in Figure 3.9b. This is further verifiedin 2-component and 3-component PCA plots (Figures 3.9c and 3.9d), wheremajority of the data points are clustered and outliers are responsible for thevariance. However, in conformity with the rule of thumb (retaining over80% of variance in data) and the elbow heuristic, two principal componentsare used in subsequent analysis.The coefficients or weights assigned to the linear combination of features(φ1 . . . φ7) for the first and second principal component are (up to 2 decimaldigits):PC1 = (0.32, 0.322, 0.44, 0.44, 0.42, 0.44,−0.15),and, PC2 = (−0.58,−0.58, 0.16, 0.15, 0.27, 0.13,−0.42),respectively. By definition, majority of the variation in data is explainedby the first principal component (PC1), represented by the horizontal axisin the 2-component PCA plot (Figure 3.9c). Projecting points on this axisprovides an approximation of 1-component PCA, shown in Figure 3.9b.Typically, this information is visualized in a biplot, like the one depicted inFigure 3.10. A biplot, in the context of PCA, consists of points and vectorsdrawn on a plot where the axes represent the principal components. Pointsare used to represent the transformed data set on the principal componentaxes, same as plots in Figure 3.9. Vectors, drawn on the same axes, representthe feature variables expressed in terms of the basis vectors of (PCA1, 0) and(0,PCA2).443.2. Clustering Using Hu’s Moment InvariantsFigure 3.10: Biplot for 2-component PCA using Hu’s moment invariantsThe alignment of vectors in the biplot indicates their degree of correla-tion. Notice the correlation between φ1 and φ2, also evident in Figure 3.3a.Figure 3.10 suggests that if outliers (points scattered on the right) are re-moved, then majority of variation in the data will be captured in φ1 and φ2,as expected from the exploratory data analysis in Section 3.1.Correlation between features can also be visualized with the aid of a fea-ture agglomeration tree, shown in Figure 3.11. The tree is built bottom-upby recursively merging features (or combination thereof) using Ward’s link-age method for hierarchical cluster analysis. Ward’s method combines clus-ters of features using the sum of squared deviations from points to centroidsas a distance metric. In other words, features are combined based on theproportion of variance explained by those features, in a manner similar toPCA. The leaves of the feature agglomeration tree are individual featuresin the feature vector. The number assigned to internal nodes indicates thedegree of correlation between its children, with a higher number indicatinggreater correlation. The root of the tree is assigned the lowest number, sinceit combines two sub-trees (or clusters of features) with minimal correlationbetween them.453.2. Clustering Using Hu’s Moment InvariantsFigure 3.11: Feature agglomeration tree for Hu’s moment invariantsSilhouette analysis using data points projected on two-component PCAaxes predicted two clusters, as shown in Figure 3.12. Labeled data pointsare shown in Figure 3.12c. Same exact results, with regard to number ofclusters and allocation of points to clusters, are obtained using 1-componentor 3-component PCA.k-means clustering (with k = 2) using Hu’s moment invariants as featurevector classified cells into two groups: 58 cells corresponding to black clusterlabels and 5 cells corresponding to green cluster labels. Since the value ofHu’s moment invariants is not affected by cell size, separation of cells intwo clusters is based on cell shape. Cells labeled in green have morphologythat is highly non-circular and corresponds to high values of Hu’s momentinvariants seen in Figures 3.4 and 3.5. Segmented cell images correspondingto labeled points in the clustering result are shown in Figure 3.13.463.2. Clustering Using Hu’s Moment Invariants(a) Identifying number of clusters (b) Silhouette scores for K = 2(c) Labeled data points (annotated with UID) for 2-component PCAFigure 3.12: Silhouette score analysis of Hu’s moment invariants473.3. Clustering Using Geometrical Feature Descriptors(a) Subset of cells corresponding to black cluster label(b) All cells corresponding to green cluster labelFigure 3.13: Classification of cells using Hu’s moment invariants3.3 Clustering Using Geometrical FeatureDescriptorsThis section describes results obtained by following the same procedureas Section 3.2, but using geometrical features instead of Hu’s moment in-variants. Geometrical features (summarized in Tables 2.1 and 2.2) consist ofinformation obtained from circle, ellipse, rectangle and polygon fits. PCAreveals that two principal components retain over 80% of explained vari-ance, as shown in Figure 3.14a. Features expressed in terms of principalcomponents are shown in the biplot (Figure 3.15). Ellipse perimeter andminimum Feret length are correlated and aligned with the second principalcomponent. Majority of variance in the data (over 70% according to Fig-ure 3.14a) is captured by the remaining features which are aligned in thedirection of the first principal component. The feature agglomeration tree(Figure 3.16) confirms this observation.483.3. Clustering Using Geometrical Feature Descriptors(a) PCA Elbow Plot (b) 2-component PCAFigure 3.14: PCA of normalized geometrical featuresFigure 3.15: Biplot for 2-component PCA using geometrical featuresFeatures obtained from ellipse fit, circle fit, rectangle fit and polygonal fitare plotted in red, magenta, yellow and green respectively.493.3.ClusteringUsingGeometricalFeatureDescriptorsFigure 3.16: Feature agglomeration tree for geometrical featuresNote that polygon area (denoted “Area” in graph above) and ellipse area are highly correlated. Similarly, polygonperimeter (denoted “Perimeter” in graph above) and circle perimeter are highly correlated. This is also noticeablein the alignment and (almost) equal magnitude of their vectors in the biplot (Figure 3.15), representing the(almost) equal weights assigned to these features in the principal components.503.3. Clustering Using Geometrical Feature DescriptorsTwo clusters are identified by performing silhouette analysis on geomet-rical feature data transformed using PCA, as shown in Figure 3.17a. Thetransformed data is clustered using k-means algorithm with parameter k =2, to assign label to each cell UID (see Figure 3.17b).(a) Identifying number of clusters using silhouette score(b) Labeled data points (annotated with UID)Figure 3.17: Silhouette analysis and clustering of geometrical features513.4. Clustering Using Geometrical and Boundary FeaturesClustering using geometrical features resulted in improved classificationof cells, clearly demonstrated by plotting segmented cell images correspond-ing to cluster labels in Figure 3.18. There is clear morphological differencebetween cells in Figure 3.18a and cells in Figure 3.18b. Cells correspondingto the 23 points in the green labeled cluster are smaller in size comparedto cells corresponding to 40 points in the black labeled cluster. The clas-sification is primarily based on cell size rather than cell shape, in contrastto classification using Hu’s moment invariants. This can be easily verified,since some large cells with circular or protrusive morphology are labeled inblack (see Figure 3.18c) as opposed to the majority of such cells that arelabeled in green.The results described above are obtained by classifying transformed fea-tures using k-means algorithm, using parameter k = 2 corresponding tothe maximum silhouette score. However, if geometrical features are classi-fied with k = 4 (not shown), using a priori knowledge that the data setcontains four different morphologies, then morphologically similar cells areplaced in the same cluster with only two mis-classifications. Consequently,employing a better cluster identification mechanism instead of silhouettescore or replacing k-means with another clustering algorithm can improveunsupervised cell classification using geometrical features.3.4 Clustering Using Geometrical and BoundaryFeaturesAnother approach for improving results obtained in the previous sectionis to consider increasing variance in the data set by introducing additionalfeatures. Performing PCA on a larger set of features exploits increasedvariance and enables k-means to identify new clusters. Since geometricalfeatures lack information about the boundary of cells, it is natural to as-sume that the combination of geometrical and boundary features will leadto better classification. Boundary features encode information about peaksand changes in the curvature of cell boundary. Therefore, boundary fea-tures can be used to distinguish between cells with multiple protrusions (i.e.more sign flips in curvature of boundary) and circular/elliptical cells thathave mostly positive curvature. This hypothesis is tested in the this section.523.4. Clustering Using Geometrical and Boundary Features(a) Subset of cells corresponding to green cluster label(b) Subset of cells corresponding to black cluster label(c) Circular/protrusive cells in black clusterFigure 3.18: Classification of cells using geometrical features533.4. Clustering Using Geometrical and Boundary Features(a) PCA Elbow Plot(b) Identifying number of clusters(c) Labeled data points (annotated with UID)Figure 3.19: PCA, silhouette analysis and clustering of combined (geomet-rical and boundary) features543.4. Clustering Using Geometrical and Boundary FeaturesAs shown in Figure 3.19, addition of boundary features did not lead toidentification of any new clusters. In the two-component PCA space, place-ments of points in Figure 3.19c is almost identical to Figure 3.17b. Thegreen cluster contains 40 points and the black cluster contains 30 points.There is no difference in the allocation of points to cluster labels. Possiblereasons for lack of improvement can be ascertained by looking at correlationsbetween features.Figure 3.20: Biplot for 2-component PCA using combined featuresThe color scheme for geometrical features is same as Figure 3.15. Boundaryfeatures are plotted in blue.Both biplot (Figure 3.20) and feature agglomeration tree (Figure 3.21)confirm that minimum boundary curvature is correlated to ellipse perimeterand minimum Feret perimeter. Maximum boundary curvature is correlatedto ellipse variance and other features in the left sub-tree obtained by splittingthe feature agglomeration tree at its root node.553.4.ClusteringUsingGeometricalandBoundaryFeaturesFigure 3.21: Feature agglomeration tree for combined features563.5. Clustering Using Shape Factors3.5 Clustering Using Shape FactorsSix non-dimensional shape factors, extent, solidity, compactness, elonga-tion, circularity and convexity (see Section 2.2.3 for definitions) were com-puted for each segmented cell image. Principal component analysis (Fig-ure 3.22) reveals that over 80% of variance in data set can be capturedusing two principal components.(a) PCA Elbow Plot (b) 2-component PCAFigure 3.22: PCA of non-dimensional shape factorsBiplot diagram (Figure 3.23a) and feature agglomeration tree (Figure 3.23b)both confirm that compactness, extent and circularity are highly correlated.This indicates that similar clustering results (for the MIA PaCa-2 data set)can be obtained by computing just one out of these three shape factors.Silhouette analysis confirms the presence of four clusters, as shown inFigure 3.24. Cell UIDs corresponding to the four cluster labels are shownin Figure 3.24c.573.5. Clustering Using Shape Factors(a) Biplot for 2-component PCA using shape factors(b) Feature agglomeration tree for shape factorsFigure 3.23: Analyzing correlation in shape factors using biplot and featureagglomeration583.5. Clustering Using Shape Factors(a) Identifying number of clustersusing silhouette score (b) Silhouette scores for each cluster(c) Labeled data points (annotated with UID)Figure 3.24: Silhouette analysis and clustering of shape factors593.5. Clustering Using Shape Factors(a) Subset of cells corresponding to black cluster label(b) Subset of cells corresponding to blue cluster label(c) Subset of cells corresponding to yellow cluster label(d) Subset of cells corresponding to green cluster labelFigure 3.25: Classification of cells using shape factorsClassification results are further improved using shape factors. A subset ofcells corresponding to each cluster label are shown in Figure 3.25. The black,blue, yellow and green labels identify cells with predominantly elongated,circular, protrusive and elliptical morphology respectively. According tooriginal annotations, four cells were mis-classified. An elliptical cell was mis-classified and incorrectly labeled as elongated. Three protrusive cells were603.5. Clustering Using Shape Factorsmis-classified: two were labeled as elliptical and one was labeled circular.Note that adding boundary features to the feature vector (in addition toshape factors) does not improve the result.In summary, using shape factors as features led to automatic identificationof four clusters corresponding to the elliptical, elongated, circular and pro-trusive morphologies evident in the manually vetted data set. Unsupervisedclassification of shape factors through a combination of PCA, silhouettescore analysis and k-means resulted in just four mis-classifications. Hu’smoment invariants, geometrical and boundary features did not perform aswell as shape factors using this methodology. Classification of cells usingHu’s moment invariants can be improved further by using t-SNE insteadof PCA for dimensionality reduction. Results obtained using t-SNE aredescribed in Appendix A.61Chapter 4ConclusionsCorrectly identifying specific objects such as cells in grayscale images withnoise remains a challenging problem. Even the human visual system, thathas evolved to recognize complex shapes within highly unstructured back-grounds can fail to find specific objects, misinterpret visual cues, or fail todetect cryptic shapes. What is more, even when all cells in a microscopyimage are detected and outlined (segmented), it is still challenging to hu-mans to distinguish between discrete classes when the classes have someoverlap. For this reason, computational methods, which have nowhere nearthe discriminating power of the human visual system (at least for a smallnumber of objects) are extremely challenging to develop. In this thesis, MIAPaCa-2 pancreatic carcinoma images are used to develop and test a pipelinefor unsupervised cell classification, using a number of pre-existing methodsthat are adapted, improved, or modified. The process of assembling thispipeline has led to several areas of learning, both of the underlying biology,and of aspects of mathematical and computational aspects of the problems.An image segmentation procedure that combines mathematical morphol-ogy and marker-based watershed segmentation algorithm is used to auto-matically obtain cell boundaries from phase-contrast images of MIA PaCa-2cells. Correctly segmented cell boundaries are manually selected for featureextraction. The feature extraction process is not only an important com-ponent of the unsupervised classification pipeline, but it is also a means toquantify cell morphology in a manner that is useful for other applications.By constructing a feature vector through geometrical fitting, computationof boundary spline interpolation and shape factors as part of the overallmethodology, the wealth of information gathered can be used by biologiststo quantify cell and tissue morphology, determine the presence of multiplecell types, and/or mutations or epigenetic changes that affect cell shape.With additional development, the methods discussed here can be used tostudy changes in tissue geometry during morphogenesis using images ac-quired through time-lapse microscopy.62Chapter 4. ConclusionsAfter feature extraction, a manually curated subset of cell features isused to identify circular, elliptical, protrusive and elongated cells. Whilethe necessity for manual curation indicates some of the limitations of thismethodology, it should be noted that MIA PaCa-2 cells do not exhibit adiscrete set of shapes. As is the case with majority of other cell types, themorphology of MIA PaCa-2 cells lies in a continuum of shapes which makestheir classification difficult, even for trained experts. For typical applicationsthat require identification of distinct shapes in a heterogeneous population(arising from co-cultures, mixture of control and experimental group, etc.),variability in features is expected to be much higher compared to variabilityin the MIA PaCa-2 single cell line data. This suggests that one aspect offuture work would be to test the methods on a variety of cell images, wherethere is a clearer distinction between cell types or morphologies.Cell features are classified using the k-means clustering algorithm afterperforming dimensionality reduction to take advantage of correlation be-tween features. PCA is used to compute a linear combination of features inorder to project data from high dimensional feature space to low dimensionalprincipal component space while retaining at least 80% of the variance inthe original feature data. Silhouette analysis is used to determine the mostprobable number of clusters in the low dimensional space. After clustering,segmented cell images corresponding to cluster labels (assigned to cell UIDs)are used to identify the number of mis-classifications.One of the goals for this thesis is to identify a minimal set of featuresthat are computationally inexpensive to calculate but also competent inclassifying cells. The number of mis-classifications is used to evaluate theperformance of different kinds of features. As shown in Chapter 3, cluster-ing using shape factors resulted in fewest mis-classifications. Improved cellclassification using Hu’s moment invariants is possible, although it requiresthe use of a non-linear embedding for dimensionality reduction (detailed inAppendix A). It should be noted that more information (besides global ex-trema and number of sign flips in curvature) can be obtained from the splineboundary fit, which may result in improvement of classification as well asidentification of new classes based on locations of cell protrusions. Sugges-tions for improvement in the methodology and potential areas for futurework are identified in the next chapter.63Chapter 5Future WorkThe previous chapters describe a general methodology for unsupervisedcell classification. The methodology is implemented in a modular pipelinethat consists of algorithms used to perform image segmentation, featureextraction, dimensionality reduction and cluster identification techniques,in a sequential order. Each stage of the process takes input and producesoutput in a specified format, making the components of the pipeline re-placeable. For example, the phase-contrast image segmentation componentcan be replaced by another component to segment images obtained througha different kind of microscopy. The pipeline is designed to handle high-throughput data, eliminating the need for manual intervention as much aspossible. Currently, correctly segmented cells have to be manually identifiedand only a curated subset of the data is passed on to the dimensionality re-duction step. With the improvements in methodology suggested below, it ispossible that a fully automated pipeline will become a reality in the future.The addition of more quantifiable morphology based features can be usedto increase the distance between non-similar cells represented by points inhigh dimensional feature space. This naturally improves the clustering ofpoints after dimensionality reduction, potentially eliminating the need formanual curation. Suggestion for new features are highlighted below:The curvature of the cubic spline interpolation of cell boundary identifiescell protrusions as regions of positive curvature, as demonstrated in Fig-ure 5.1. However, for cells with long straight segments in cell boundary (seeFigure 5.2), there are large number of changes in sign of curvature along thestraight edges. Therefore, the number of positive curvature segments doesnot correspond to number of protrusions. As a result, the number of pro-trusions along the cell boundary is not represented in the boundary featurevector. Furthermore, information about the location of protrusions alongthe cell boundary is also missing from the boundary feature vector. Thislimits the methodology, preventing it from distinguishing between cells that64Chapter 5. Future Workhave protrusions oriented in the same direction and those cells that haverandom placed protrusions along their boundary.(a) Cropped phase con-trast microscopy imageshowing a protrusive MIAPaCa-2 cell(b) Cubic spline interpola-tion of cell boundary show-ing regions of positive cur-vature (in red) and nega-tive curvature (in blue)(c) Spline color-coded bymagnitude of curvature,using the same colorscheme as 5.1bFigure 5.1: Boundary curvature of a protrusive cell(a) Cropped phase con-trast microscopy imageshowing an elliptical MIAPaCa-2 cell(b) Cubic spline interpola-tion of cell boundary show-ing regions of positive cur-vature (in red) and nega-tive curvature (in blue)(c) Spline color-coded bymagnitude of curvature,using the same colorscheme as 5.1bFigure 5.2: Boundary curvature of an elliptical cellIn addition to Hu’s moment invariants, geometrical features, boundaryfeatures and shape factors, objects (including segmented cell images) can beclassified by the degree of symmetry in their shape. In order to classify cellsirrespective of their position and orientation, features should be invariant toimage translation and rotation. Therefore, quantifying symmetry in a cellshape would require identifying its principal axes and computing features inrelation to those axes. Furthermore, cell polarity (dissimilarity of “front” vs“back”) is known to be an important aspect of cell motility. Future work65Chapter 5. Future Workshould aim to quantify this feature.The choice of clustering algorithm and cluster identification strategy cansignificantly impact the outcome of unsupervised classification. While k-means and silhouette analysis are employed in this thesis due to their sim-plicity and widespread use elsewhere, replacing them with advanced cluster-ing algorithms (e.g. OPTICS, variants of DBSCAN, Voronoi-based meth-ods) will likely result in improved identification of clusters and robustnessto noise. While two principal components captured over 80% of variance indata for all cases considered in Chapter 3, any choice of clustering algorithmshould have the ability to deal with three or more principal components.Methods for supervised classification of cells (based on morphological fea-tures) are not covered in this thesis for two reasons. Firstly, supervisedclassifiers like Support Vector Machines (SVMs), random forests and deepneural networks have already been shown to work for similar problems inrecent publications (see Section 1.1). These methods require training usingannotated data, a major disadvantage compared to unsupervised classifi-cation. Annotation of data is a laborious process that typically requireslabeling cells experimentally (while making sure that the labeling techniquedoes not have unintended consequences for structural and functional aspectsof the cell) or manually assigning labels to each cell after image acquisition.Secondly, supervised classifiers like convolutional neural networks (CNNs)that do not require pre-computed features can easily achieve the end goal(i.e. cell classification based on morphology), but valuable quantitative in-formation obtained as part of the feature extraction process is lost. CNNsbelong to the class of pixel-based learning methods, which require a list ofpixel intensities in the segmented cell image as input. Features are implicitlyencoded in the weights and biases of neurons during the training process,thus eliminating the need for feature extraction. While a trained CNN canbe used to identify distinguishing features in segmented cell images throughdeconvolution, the information thus obtained often lacks biophysical mean-ing.Deconvolution of a CNN provides a list of image filters (a subset of whichcan be interpreted as edge and corner detectors) that can be applied to asegmented cell image in order to extract features for the purpose of classi-fication. However, these filters typically do not correspond to measurablequantities like magnitude of curvature, perimeter length, etc. Despite the66Chapter 5. Future Worklack of biophysical meaning, filters obtained by deconvolution might be usedin conjunction with other feature extraction techniques described in Chap-ter 3 to yield improved classification results.Certain artificial neural networks (ANNs) can be used in an unsuper-vised manner to perform feature extraction without training data. However,the features computed by these ANNs lack biophysical meaning, similar toCNNs. Autoencoders, a type of ANN, is widely used to perform dimension-ality reduction. The efficacy of using an autoencoder network instead ofPCA or t-SNE for the purpose of cell classification will be evaluated in thefuture.The work described in this thesis is limited to the classification of cellsin still images. By (re)-constructing the trajectory of cells and their lineageusing images acquired through time-lapse microscopy, the feature vectorcan be augmented with information about cell movement and interactionsbetween cells. Future work in this direction would involve computing “liveimaging features” to study collective cell migration in wound-healing assays.The live imaging features will include cell velocity, neighbor count and es-timate of cell cycle time. This would enable classification of cells based onmigratory patterns and identification of cells that morph (from one shapeto another) over time.67Bibliography[1] Ilmari Ahonen, Ville Hrm, Hannu-Pekka Schukov, Matthias Nees, andJaakko Nevalainen. Morphological Clustering of Cell Cultures Basedon Size, Shape, and Texture Features. Statistics in BiopharmaceuticalResearch, 8(2):217–228, April 2016.[2] Morteza Moradi Amin, Saeed Kermani, Ardeshir Talebi, andMostafa Ghelich Oghli. Recognition of Acute Lymphoblastic LeukemiaCells in Microscopic Images Using K-Means Clustering and SupportVector Machine Classifier. Journal of Medical Signals and Sensors,5(1):49–58, 2015.[3] Richard Barnes, Clarence Lehman, and David Mulla. Priority-flood: Anoptimal depression-filling and watershed-labeling algorithm for digitalelevation models. Computers & Geosciences, 62:117–127, 2014.[4] Manuele Bicego and Pietro Lovato. A bioinformatics approach to2d shape classification. Computer Vision and Image Understanding,145:59–69, April 2016.[5] R. G. Billiones, M. L. Tackx, and M. H. Daro. The geometric features,shape factors and fractal dimensions of suspended particulate matter inthe Scheldt Estuary (Belgium). Estuarine, Coastal and Shelf Science,48(3):293–305, 1999.[6] Christina Carlsson. Vehicle size and orientation estimation using geo-metric fitting. PhD thesis, Division of Automatic Control, Departmentof Electrical Engineering, Linkpings universitet, Linkping, 2000. OCLC:474207954.[7] Thanatip Chankong, Nipon Theera-Umpon, and Sansanee Auephan-wiriyakul. Automatic cervical cell segmentation and classificationin Pap smears. Computer Methods and Programs in Biomedicine,113(2):539–556, February 2014.68Bibliography[8] D. Chaudhuri and A. Samal. A simple method for fitting of boundingrectangle to closed regions. Pattern Recognition, 40(7):1981–1989, July2007.[9] Claire Lifan Chen, Ata Mahjoubfar, Li-Chia Tai, Ian K. Blaby, AllenHuang, Kayvan Reza Niazi, and Bahram Jalali. Deep Learning in Label-free Cell Classification. Scientific Reports, 6(1), August 2016.[10] S. Dimopoulos, C. E. Mayer, F. Rudolf, and J. Stelling. Accurate cellsegmentation in microscopy images using membrane patterns. Bioin-formatics, 30(18):2644–2651, September 2014.[11] G. A. Dunn and A. F. Brown. Alignment of fibroblasts on groovedsurfaces described by a simple geometric transformation. Journal ofcell science, 83(1):313–340, 1986.[12] P. Foggia, G. Percannella, C. Sansone, and M. Vento. Benchmark-ing graph-based clustering algorithms. Image and Vision Computing,27(7):979 – 988, 2009. 7th IAPR-TC15 Workshop on Graph-based Rep-resentations (GbR 2007).[13] Hubert Fonga. Pattern recognition in gray-level images by Fourier anal-ysis. Pattern Recognition Letters, 17(14):1477–1489, December 1996.[14] Rui Gradiz, Henriqueta C. Silva, Lina Carvalho, Maria FilomenaBotelho, and Anabela Mota-Pinto. MIA PaCa-2 and PANC-1 pan-creas ductal adenocarcinoma cell lines with neuroendocrine differentia-tion and somatostatin receptors. Scientific Reports, 6(1), April 2016.[15] Radim Halr and Jan Flusser. Numerically stable direct least squaresfitting of ellipses. In Proc. 6th International Conference in CentralEurope on Computer Graphics and Visualization. WSCG, volume 98,pages 125–132. Citeseer, 1998.[16] Ming-Kuei Hu. Visual pattern recognition by moment invariants. IREtransactions on information theory, 8(2):179–187, 1962.[17] Zhihu Huang and Jinsong Leng. Analysis of Hu’s moment invariants onimage scaling and rotation. In Computer Engineering and Technology(ICCET), 2010 2nd International Conference on, volume 7, pages V7–476. IEEE, 2010.69Bibliography[18] K. Inoue and K. Kimura. A method for calculating the perimeter of ob-jects for automatic recognition of circular defects. NDT International,20(4):225–230, August 1987.[19] Ferenc Kovcs, Csaba Legny, and Attila Babos. Cluster validity mea-surement techniques. In 6th International symposium of hungarian re-searchers on computational intelligence. Citeseer, 2005.[20] David J. Logan, Jing Shan, Sangeeta N. Bhatia, and Anne E. Carpen-ter. Quantifying co-cultured cell phenotypes in high-throughput usingpixel-based classification. Methods, 96:6–11, March 2016.[21] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008.[22] Erik Meijering. Cell segmentation: 50 years down the road [life sci-ences]. IEEE Signal Processing Magazine, 29(5):140–145, 2012.[23] L. A. Merson-Davies and F. C. Odds. A morphology index for character-ization of cell shape in Candida albicans. Microbiology, 135(11):3143–3152, 1989.[24] Fernand Meyer. The watershed concept and its use in segmentation: abrief history. arXiv preprint arXiv:1202.0216, 2012.[25] Loris Nanni, Michelangelo Paci, Florentino Luciano Caetano dos San-tos, Heli Skottman, Kati Juuti-Uusitalo, and Jari Hyttinen. TextureDescriptors Ensembles Enable Image-Based Classification of Matura-tion of Human Stem Cell-Derived Retinal Pigmented Epithelium. PLOSONE, 11(2):e0149399, February 2016.[26] Ryusuke Nosaka and Kazuhiro Fukui. HEp-2 cell classification usingrotation invariant co-occurrence among local binary patterns. PatternRecognition, 47(7):2428–2436, July 2014.[27] Eric Olson. Particle shape factors and their use in image analysis-part1: Theory. Journal of GXP Compliance, 15(3):85, 2011.[28] Fred Park. Shape descriptor/feature extraction techniques, 2011.[29] Yaling Pei and Osmar Zaane. A synthetic data generator for clusteringand outlier analysis. 2006.70Bibliography[30] Richard J Prokop and Anthony P Reeves. A survey of moment-based techniques for unoccluded object representation and recogni-tion. CVGIP: Graphical Models and Image Processing, 54(5):438–460,September 1992.[31] Lorenzo Putzu and Cecilia Di Ruberto. White blood cells identifica-tion and classification from leukemic blood image. In Proceedings of theIWBBIO international work-conference on bioinformatics and biomed-ical engineering, pages 99–106, 2013.[32] Y. Qin, W. Wang, W. Liu, and N. Yuan. Extended-Maxima TransformWatershed Segmentation Algorithm for Touching Corn Kernels. Ad-vances in Mechanical Engineering, 5(0):268046–268046, January 2015.[33] Carolina Reta, Leopoldo Altamirano, Jesus A. Gonzalez, Raquel Diaz-Hernandez, Hayde Peregrina, Ivan Olmos, Jose E. Alonso, and RubenLobato. Segmentation and Classification of Bone Marrow Cells Im-ages Using Contextual Information for Medical Diagnosis of AcuteLeukemias. PLoS ONE, 10(6), June 2015.[34] Peter J. Rousseeuw. Silhouettes: a graphical aid to the interpretationand validation of cluster analysis. Journal of computational and appliedmathematics, 20:53–65, 1987.[35] Christoph Sommer and Daniel W. Gerlich. Machine learning in cellbiology teaching computers to recognize phenotypes. Journal of CellScience, 126(24):5529–5539, December 2013.[36] H. Q. Sun and Y. J. Luo. Adaptive watershed segmentation of binaryparticle image. Journal of Microscopy, 233(2):326–330, February 2009.[37] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Introductionto Data Mining. Pearson Addison Wesley, 2006. Google-Books-ID:YHsWngEACAAJ.[38] Robert Tibshirani, Guenther Walther, and Trevor Hastie. Estimatingthe number of clusters in a data set via the gap statistic. Journal of theRoyal Statistical Society: Series B (Statistical Methodology), 63(2):411–423, January 2001.[39] Shutong Tse, Laura Bradbury, Justin WL Wan, Haig Djambazian,Robert Sladek, and Thomas Hudson. A combined watershed and levelset method for segmentation of brightfield cell images. In SPIE Medical71Imaging, pages 72593G–72593G. International Society for Optics andPhotonics, 2009.[40] Cristina Urdiales, Antonio Bandera, and F. Sandoval. Non-parametricplanar shape representation based on adaptive curvature functions. Pat-tern Recognition, 35(1):43–53, 2002.[41] Xavier Vasques, Laurent Vanel, Guillaume Villette, and Laura Cif.Morphological Neuron Classification Using Machine Learning. Fron-tiers in Neuroanatomy, 10, November 2016.[42] A. A. Yunis, G. K. Arimura, and D. J. Russin. Human pancreatic carci-noma (MIA PaCa-2) in continuous culture: sensitivity to asparaginase.International Journal of Cancer, 19(1):128–135, January 1977.72Appendix ADimensionality ReductionUsing t-SNEt-distributed stochastic neighborhood embedding (t-SNE) is a dimension-ality reduction technique that produces a non-linear embedding from highdimensional space to low dimensional space. t-SNE is often used in place ofPCA due to its tendency to preserve local structure in data. Unlike PCA,t-SNE is a non-parametric learning algorithm that handles non-linearity inthe data very well. The embedding is learned in the process of movingdata to the low dimensional space. Consequently, t-SNE does not providea function for transforming data from the high dimensional space to thelow dimensional space. Furthermore, the t-SNE algorithm requires multipleinput parameters including perplexity, early exaggeration, learning rate andnumber of iterations. While default values of these parameters work wellfor widely publicized open data sets, the algorithm is sensitive to perplexityand learning rate parameters for features included in the MIA PaCa-2 dataset. The perplexity parameter is similar to k in the k-nearest neighbors(KNN) classifier algorithm. It is used to build a nearest neighbor graphin the high dimensional feature space. The t-SNE model building processinvolves performing random walks on this feature graph. The learning rateparameter plays an important role in preventing the algorithm from gettingstuck in a local minimum while minimizing the Kullback-Leibler divergence,a non-convex cost function. For more details on the t-SNE algorithm, pleaserefer to Maaten and Hinton (2008) [21].Shortcomings of t-SNE, including its stochastic nature (requiring multipleruns to ensure convergence), absence of parameter estimation techniques andlack of simplicity (compared to PCA where the linear transformation can beeasily analyzed), are the main reasons for its exclusion from the main textof this thesis.73Appendix A. Dimensionality Reduction Using t-SNEAs shown in Chapter 3, Figure 3.12c, Hu’s moment invariants did notcluster properly using PCA. However, using 2-component t-SNE, four clus-ters corresponding to circular, protrusive, elliptical and elongated cells areeasily identifiable (see Figures A.1 and A.2). The results described in thisappendix are limited to 2-component t-SNE to allow for comparisons withresults obtained by using PCA.Figure A.1: Clustering Hu’s moment invariants using 2-component t-SNE13 circular cells, 13 elliptical cells, 13 protrusive cells, 19 elongated cellsand 5 outliers are identified in the 2-component t-SNE plot. Contrary toobservations in Section 3.1, a non-linear transformation of Hu’s momentinvariants is capable of distinguishing between various cell morphologies.However, finding parameters for the t-SNE algorithm (perplexity = 10 andlearning rate = 500) requires manual exploration of the parameter space.Furthermore, if new data is made available, then a new embedding has tobe learned in order to transform the data into the two dimensional space.74Appendix A. Dimensionality Reduction Using t-SNE(a) Subset of circular cells clustered within blue region(b) Subset of elliptical cells clustered within red region(c) Subset of protrusive cells clustered within yellow region(d) Subset of elongated cells clustered within green regionFigure A.2: Improved classification of cells using Hu’s moment invariantsClustering is not evident when combination of geometrical features andboundary features are embedded in two-dimensional space using t-SNE, asshown in A.3. A similar t-SNE plot of (only) geometrical features (not in-cluded here) has nearly identical placement of points. Thus, adding bound-ary information to geometrical features does not result in improvement ofunsupervised classification, reinforcing the result obtained using PCA.75Appendix A. Dimensionality Reduction Using t-SNEFigure A.3: t-SNE using combination of geometrical and boundary featuresFigure A.4: Clustering shape factors using 2-component t-SNESix clustered regions are identified in the two component t-SNE plot ofshape factors, as shown in Figure A.4. Clusters 1, 2 and 3 correspond to76Appendix A. Dimensionality Reduction Using t-SNEa more elongated morphology (see Figure A.5), while clusters 4, 5 and 6correspond to protrusive, circular and elliptical cells respectively (see Fig-ure A.6). One can argue that some of the outlier points (see Figure A.7)correspond to cells with somewhat ambiguous morphology.(a) Cluster 1(b) Cluster 2(c) Cluster 3Figure A.5: Subset of elongated cells corresponding to clusters 1, 2 and 3Notice the subtle distinction in morphology between different clusters ofelongated cells in Figure A.5.77Appendix A. Dimensionality Reduction Using t-SNE(a) Subset of protrusive cells corresponding to Cluster 4(b) Subset of circular cells corresponding to Cluster 5(c) Subset of elliptical cells corresponding to Cluster 6Figure A.6: Protrusive, circular and elongated cells identified in Figure A.4Figure A.7: Segmented cell images of outlier points in Figure A.478
Thesis/Dissertation
2017-05
10.14288/1.0345604
eng
Mathematics
Vancouver : University of British Columbia Library
University of British Columbia
Attribution-NonCommercial-NoDerivatives 4.0 International
http://creativecommons.org/licenses/by-nc-nd/4.0/
Graduate
Morphology based cell classification : unsupervised machine learning approach
Text
http://hdl.handle.net/2429/61342