Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Automated classification of homologous human chromosomes using digital fluorescence microscopy images Mousavi, Parvin 2001

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2001-715094.pdf [ 5.97MB ]
Metadata
JSON: 831-1.0065275.json
JSON-LD: 831-1.0065275-ld.json
RDF/XML (Pretty): 831-1.0065275-rdf.xml
RDF/JSON: 831-1.0065275-rdf.json
Turtle: 831-1.0065275-turtle.txt
N-Triples: 831-1.0065275-rdf-ntriples.txt
Original Record: 831-1.0065275-source.json
Full Text
831-1.0065275-fulltext.txt
Citation
831-1.0065275.ris

Full Text

Automated Classification of Homologous Human Chromosomes Using Digital Fluorescence Microscopy Images by P a r v i n Mousav i M . S c , D . I . C . , Imperial College of Science, Technology and Medic ine , 1995 A T H E S I S S U B M I T T E D I N P A R T I A L F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F D o c t o r o f P h i l o s o p h y i n T H E F A C U L T Y O F G R A D U A T E S T U D I E S (Department of Elec t r ica l & Computer Engineering) We accept this thesis as conforming to the required standard Thle/Universifey of British Columbia September 2001 © Pa rv in Mousavi , 2001 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of f^l^c-hn'O^J ^ ^pmpu-LPf ^^^'nuZJL^L^ The University of British Columbia Vancouver, Canada Date H Sj2^). C2QO / DE-6 (2/88) A b s t r a c t This thesis is concerned with the classification of chromosome pairs into their pa-ternal and maternal homologs. Many disorders, including cancer, have inheritable components present in one of the parental chromosomes. In order to trace the in-volvement of such chromosome abnormalities in the development of cancer, it is important to analyze the parental homologs separately. In the last decade, the role of image processing techniques in chromosome analysis has significantly expanded. This is due to the introduction of novel peptide nucleic acid probes and high quality microscopes, which result in a chromosome imaging technique known as "quantita-tive fluorescence in-situ hybridization". We use this technique to acquire images of chromosomes with high quantitative information for homolog classification. There is no biological method to verify homolog differentiation. However, for certain chromosomes, after staining with fluorescent probes, centromere inten-sities, telomere lengths and chromosome shapes show apparent differences among homologs. These differences reflect true heteromorphism and a cytotechnician can visually classify homologs for such chromosomes using the heteromorphic proper-ties. We choose to automatically analyze homolog classification for heteromorphic chromosomes, since we are provided with ground truth for classification results. In addition, since no robust means exists for verifying the results, we use multi-feature analysis of digital chromosome images. The high correlation in the classification results using multiple features is employed as a verification tool. Features used for homolog classification in this study are centromere and i i telomere intensities, and the length of the P-arm. To calculate centromere intensi-ties, we study and develop methods for their segmentation. Two novel methods are introduced to segment centromeres from DAPI images. These methods use mode histograms, thresholding, and the 2D gradient of the chromosomes. Intensities of the segmented centromeres are calculated and homologous chromosomes are clas-sified into parental classes. The results are verified on heteromorphic chromosome 16, where a technician can visually tell the homologs apart. A fuzzy segmentation algorithm is employed to segment centromeres from FITC images. Intensities of the segmented centromeres are used to classify homologs. These results are then verified using heteromorphic chromosome 22, where the appearance of P-arms are different among homologs and can be differentiated by a technician. Telomere lengths are also calculated for each chromosome using previously developed software. For the first time, a novel automatic quantitative classification method is proposed for homolog differentiation using multiple features. This method is based on mutual information maximization applied to an unsupervised neural network ar-chitecture. We designed and tested our algorithm on chromosome 16, since for this chromosome, a cytotechnician can visually classify the homologs, providing us with ground truth. The neural network consists of separate modules trained to classify homologs using independent features. The first module classifies homologs of chro-mosome 16 based on differences in their centromere intensities. The second module utilizes the P and the Q-arm telomere lengths to classify homologous chromosomes 16. Mutual information is then maximized between the outputs of the modules, training them to produce the same classification results for a given chromosome. A unique input structure is suggested to enable the neural network to compare fea-tures within a pair of homologous chromosomes. The results of this work constitute a major step towards multi-feature analysis of chromosome images. Moreover, such analysis will allow a comprehensive evaluation of centromere and telomere repeat content in various chromosome instability syndromes, such as cancer. iii C o n t e n t s Abstract ii Contents iv List of Tables viii List of Figures ix Glossary xiv Acknowledgements xvi 1 Introduction 1 1.1 Human Chromosomes and Analysis: Background 2 1.1.1 Telomere Function in Chromosomes 5 1.1.2 Quantitative Chromosome Visualization and Analysis . 6 1.2 Homolog Classification: Methods 7 1.3 The Significance of Homolog Classification in Cancer Research: Motivation 7 1.4 Homolog Classification: Challenges 8 1.5 Thesis Outline 11 iv 2 Previous Research 14 2.1 Chromosome Analysis 14 2.1.1 Automated Karyotyping 15 2.1.2 Separation of Overlapping Chromosomes for Karyotyping 21 2.2 Fluorescence In-situ Hybridization (FISH) and Quantitative FISH Technologies 22 2.2.1 Q-FISH and Quantitative Measurements of Telomere Lengths 23 2.3 Homologous Chromosomes and Homolog Classification . . . . 24 3 Image Acquisition and Database Preparation 29 3.1 Q-FISH Technology and Slide Preparation 32 3.2 Image Acquisition System 34 3.2.1 Hardware Considerations 43 3.3 Preprocessing of the Acquired Images 46 3.3.1 Single Probe Experiments 48 3.4 Database Preparation 53 4 Segmentation of Centromeres from D A P I Images Using G r a -dient and Thresholding Methods 60 4.1 Heteromorphism in Homologous Chromosomes and Centromere Segmentation 61 4.2 Centromere Segmentation 62 4.3 Mode Histogram 63 4.4 Gradient of the Image and Thresholding 65 4.4.1 Gradient Operator 65 4.4.2 Centromere Segmentation Procedure 67 v 4.5 Homolog Classification of Chromosome 16 69 4.6 Conclusion 70 5 Segmentation of Centromeres from CY5 , F I T C and CY35 Images Using an Iterative Fuzzy Segmentation Algorithm 73 5.1 Fuzzy Sets Theory 74 5.2 Centromere Segmentation 76 5.2.1 Image Normalization and Neighbourhood Mask . . . . 78 5.2.2 Fuzzy Membership Values 78 5.2.3 Error Calculation and Updating . . . 80 5.3 Results and Discussion . 81 5.3.1 Homolog Classification of Chromosome 22 85 5.4 Noise Tolerance 86 5.5 Conclusion 92 6 Multi-feature Analysis of Homologous Chromosomes 94 6.1 Telomere Length Measurements and Data Normalization . . . 95 6.2 Fuzzy C-means Clustering of Telomere Data 96 6.3 Classification of Homologous Chromosome 16 Using Centromere Intensities 99 6.4 Comparison of Centromere and Telomere based Classification for Chromosome 16 99 6.5 Automated Classification of Homologous Chromosomes Using Mutual Information Maximization 101 6.6 Feed-forward Neural Networks 102 6.6.1 Error Back-propagation 104 6.7 Neural Network Architecture 106 vi 6.8 Mutual Information Maximization Theory 108 6.9 Network Learning and Update Rule 110 6.9.1 A Simple Classification Example 112 6.10 Training Experiments Design and Results 115 6.10.1 Training Using Simulated Data 115 6.10.2 Training Using Real Data 117 7 Conclusions and Future Work 122 7.1 Overview 122 7.2 Conclusions 123 7.3 Suggestions for Future Work 127 Bibliography 129 vii List o f Tables 4.1 The IFI values for centromeres of homologous chromosomes 16 of the database 70 5.1 The IFI values for segmented FITC-centromeres of homologous chromosomes 22 from the database 86 6.1 Centromere intensities and telomere lengths for the example. . 113 6.2 IFI values for centromeres of homologous chromosomes 16 of the simulated database 118 6.3 IFI values for centromeres of homologous chromosomes 16 of the database 120 viii L i s t o f F i g u r e s 1.1 From human being to DNA 3 1.2 Structure of a metaphase human chromosome 4 2.1 Typical Karyogram of a metaphase 16 2.2 Medial axis of a metaphase chromosome 17 3.1 An overview of the image acquisition system 30 3.2 From chromosome slide preparation to database development. 31 3.3 Excitation and emission spectra for DAPI (top) and F I T C (bot-tom) fluorescent probes 35 3.4 Excitation and emission spectra for CY3 (top) and CY5 (bot-tom) fluorescent PNA probes • 36 3.5 Acquired images of a sample metaphase; DAPI (top) and CY3 (bottom) : 37 3.6 Acquired images of the same metaphase as in Figure 3.5 F I T C (top) and CY5 (bottom) 38 3.7 Picture of the Zeiss Axioplan imaging system 40 3.8 Picture of the Zeiss Axioplan2 imaging system 41 3.9 The schematics of a general fluorescence microscope [1] 42 ix 3.10 Fluorescence intensities and photobleaching of beads visualized with a mercury lamp with and without a neutral density filter. Intensities of randomly chosen beads (each shown with a certain symbol) are studied over time 44 3.11 Schematic of filter block for a fluorescence microscope 46 3.12 CY3.5 image with spectral spreading; highlighted with yellow arrows in two instances (top). Colour compensated CY3.5 im-age (bottom) 51 3.13 FITC-centromere original and colour compensated images for single chromosomes; centromere segmentation from the original and compensated images • . 52 3.14 Snapshot of the ISIS software with the karyogram for the shown metaphase spread 55 3.15 Snapshot of the T F L - T E L O software which calculates telomere lengths for individual chromosomes 56 3.16 Snapshot of the developed software that automatically finds, isolates and saves each chromosome as an image 58 3.17 Snapshots of the developed software in section 3.4(3) for DAPI (top) and FITC (bottom) images 59 4.1 Centromere heteromorphism in a pair of chromosome 16 ho-mologs 61 4.2 Centromere segmentation using the mode histogram method. . 64 4.3 Centromere detection using thresholding and gradient method. 67 x 4.4 Bar chart presenting the comparison between IFI values for cen-tromeres of homologous chromosomes 16 in twelve metaphase images of a patient. Homologl refers to chromosomes with lower intensity centromeres, and Homolog2 refers to chromo-somes with higher intensity centromeres 71 5.1 Precision and significance in the real world [2] 75 5.2 Snapshots of a metaphase image prepared using DAPI (top) and FITC-centromere (bottom) probes 77 5.3 Fuzzy membership function 79 5.4 Segmentation of centromeres for a pair of homologous chromo-some 22 using three different neighbourhood mask sizes. Seg-mentation results for homolog 1 are shown in the top picture while results for homolog 2 are shown in the bottom picture. . 83 5.5 Centromere segmentation for homologs of chromosome 22 (top) and chromosome 19 (bottom) in a metaphase spread 84 5.6 The chromosome (DAPI), inverted centromere (FITC) and nor-malized segmented centromere images for two pairs of homolo-gous Chromosome 22 87 5.7 The chromosome (DAPI), inverted centromere (FITC) and nor-malized segmented centromere images for two other pairs of ho-mologous Chromosome 22 88 5.8 Scatter-plot and box-plots of relative changes in segmented cen-tromere areas versus additive Gaussian noise 90 5.9 Scatter-plots of relative changes in segmented centromere areas versus additive Gaussian noise for variances of up to 0.06 (A) and 0.7(B) 91 xi 5.10 An example of the original and noisy centromere images for additive Gaussian noise with different variances 92 6.1 Distribution of telomere lengths for chromosomes 16 in the database (top); Classification of telomere data using c-means clustering algorithm (bottom) 97 6.2 Classification of homologous chromosomes 16 based on differ-ences in their centromere intensities 98 6.3 Classification of homologous chromosomes 16 using centromere intensities as well as telomere lengths; the results of the two independent classifiers are completely correlated 100 6.4 An example of a feed-forward network which has two layers of adaptive weights 102 6.5 Structure of the proposed neural network architecture 107 6.6 Centromere intensities and telomere lengths; example. Each homolog is shown by a circle located at (Q,P). The circle size is proportional to the homolog centromere intensity 112 6.7 Initial weights, variances and outputs of the proposed architec-ture while learning to classify the example 114 6.8 Final weights and outputs of the proposed architecture after learning to classify the example 115 6.9 Centromere intensities and telomere lengths; simulated data. Each chromosome is shown by a circle located at (Q,P). The circle size is proportional to the centromere intensity 116 xii 6.10 Classification results for the simulated database. The two mod-ules in the neural network classify homologous chromosomes into the same two classes as shown by dotted lines and coloured circles 117 6.11 Centromere intensities and telomere lengths; real data. Cen-tromere intensity differences between homologs is rather low. . 119 6.12 The new neural network architecture 120 6.13 Classification of the real database. The two modules classify homologous chromosomes into the same two groups 121 xii i Glossary Acrocentric: Chromosomes that have the centromere located very near to one end. Centromere: The center of the chromosome. C Y 3 : A fluorescent dye; in this research is bound to telomere specific probes and makes only telomeres visible under a fluorescence microscope. D A P I : A fluorescent D N A probe which binds to the entire chro-mosome and makes the chromosome visible under a flu-orescence microscope. Denaturing: Unbinding of the D N A double helix. D N A : Deoxyribonucleic acid; the chemical inside the nucleus of a cell structured as a double helix which carries the heredity factors, genes, and passes from one generation to the other. F I S H : Fluorescence in-situ hybridization; a chromosome imag-ing technique based on fluorescent probes and fluores-cence microscopy. F I T C : A fluorescent dye; in this research is bound to cen-tromere specific probes and makes only centromeres vis-ible under a fluorescence microscope. Heteromorphism: The variability in the length and intensity between ho-mologous chromosomes. xiv Homologs: A pair of the same chromosome type; each homolog in the pair is inherited from one parent. IFI: Integrated fluorescence intensity; the total fluorescence detected from an object. MAT: Medial axis transform; a skeletal representation of a region. MLP: Multi-layer perceptron; a multi-layered network with units having either threshold or sigmoidal activation functions. PNA probes: Peptide nucleic acid probes; are designed to bind to pre-specified sites on chromosomes with high accuracy enabling quantitative analysis of the resulting chromo-some images. Probes: Defined nucleic acid segments used to identify specific repeat sequences on chromosomes. Q-FISH: Quantitative FISH; a new FISH technique that employs directly labelled PNA probes. Telomeres: The natural ends of chromosomes. xv Acknowledgements "If I have seen farther than others, it is because I have stood on the shoulders of giants." -Isaac Newton, philosopher and mathematician (164-2-1727) My utmost debt of gratitude is to my supervisors, Professor Rabab Ward, Professor Peter Lansdorp and Professor Sidney Fels for their inspiring advice, consistent and generous support and, invaluable guidance in technical and nontechnical matters. I am thankful to all the staff at the Terry Fox Labora-tory of B C Cancer Research Center for their support. Special thanks go to Ms. Elizabeth Chavez for her valuable help in preparing the slides and images used in this thesis. I would also like to thank Dr. Steven Poon and Dr. Mohammad Sameti for their fruitful discussions and suggestions in the early stages of this project. The financial support of the National Institutes of Health and the Fac-ulty of Graduate Studies of UBC through their numerous fellowships and awards is greatly appreciated. I am grateful to my past and present friends and colleagues at the Im-age Processing and Multi-media group and, the Department of Electrical and Computer Engineering of U B C for providing a friendly and encouraging work environment. In particular, I would like to thank my friend Simon DiMaio for his welcoming spirit and for being open-handedly helpful. Special thanks go to my friends and colleagues Shahram Shirani, Berna Erol, Mohammad and Mandana Sameti and, Mehdi and Elham Kazemi-nia for their good-will and help. I would also like to extend my appreciation to all the staff at Centre for xvi Integrated Computer Systems Research and the Department of Electrical and Computer Engineering of U B C particularly Mr. Tony Leugner and Mr. Rob Ross for their constant support during the course of my PhD. I am thankful to Ms. Kirsty Barclay for proofreading parts of my thesis. My deepest gratitude and love goes to my husband, Key van, who has always been my comfort, support and inspiration. His unconditional love and encouragement has been my constant source of energy. No words can express my appreciation and love to my parents who have passionately and selflessly encouraged and supported me all my life. All the good in me is from them and I have yet to see the limits of their wisdom. Last but not the least, I would like to thank my sister and best friend Negareh and my brother Saleh for their love and care. PARVIN MOUSAVI The University of British Columbia September 2001 xvii Chapter 1 Introduction Analyzing microscopy images of metaphase chromosomes prepared using Flu-orescence In-situ Hybridization (FISH) technology is presently a broad area of cancer research. Biologists believe that cancer is related to specific chro-mosome abnormalities. Identifying the paternal and maternal homologs of chromosomes in a patient would be of tremendous benefit in studies of genet-ics and cancer research. Advances in this area of research are largely dependent on developments in image processing techniques. In 1956, improved staining methods enabled Tjio and Levan to discover that a human being has 46 chromosomes [3]. Since then, our knowledge of chromosome abnormalities as a cause of disease has increased enormously [4]. The invention of banding procedures, coupled with molecular cytoge-netics, has also greatly enhanced our understanding of the role of chromosomes in clinical diagnosis and cancer pathology. The recent revolution in molecu-lar methods involving FISH techniques has revealed new vistas of insight into chromosome abnormalities at a higher magnitude [5]. This thesis is concerned with the automation of homolog classification, 1 that is, differentiation between the paternal and maternal homologs in a chro-mosome pair of the same type. Presently, homolog classification cannot be performed, except in rare circumstances, for a few of the 23 pairs of chromo-somes, where a separation can be made by visual inspection of a technician. There are no commercial or non-commercial software packages available for this purpose. 1.1 Human Chromosomes and Analysis: Back-ground The body of a human being is made up of millions of cells. Inside the nu-cleus of each cell, there are of 23 pairs of chromosomes. Each pair is called a "homologous pair" and the two chromosomes are homologs. The chromosome pairs are labelled from 1 to 22, and X and Y. In each pair, one chromosome is inherited from the mother, known as the maternal homolog, and the other chromosome is inherited from the father, known as the paternal homolog. Chromosomes consist of protein and one long threadlike molecule, called Deoxyribonucleic Acid (DNA). DNA carries the heredity factors, genes, and passes from one generation to the next. Therefore, it can be said that chro-mosomes determine the inherent properties of a species (see Figure 1.1) [6]. During cell division, chromosomes replicate and divide, going through different phases. Metaphase is one of the phases where chromosomes are du-plicated but not yet separated. We study homologous chromosomes at the metaphase stage; the chromosome structure in this phase is described in Fig-ure 1.2. As shown in Figure 1.2, each metaphase chromosome consists of two 2 3 P-arms Centromere Figure 1.2: Structure of a metaphase human chromosome. sister chromatids with four arms, known as P-arms and Q-arms. The sister chromatids are simply duplicates and identical most of the time. The center of the chromosome is known as the "centromere". Telomeres form the ends of the chromosome and are known to play an important role in cell life-span and in the development of cancerous cells. Telomeres and centromeres are critical structural and functional ele-ments of chromosomes. During the last 5 years, studies of telomere lengths have supported the notion that the integrity of telomeres is a necessity for accurate chromosomal distribution during cell division [7]. The centromere re-gions of chromosomes are characterized by the presence of one or more satellite DNAs. The type and amount of repetitive DNA varies, not only among chro-mosomes, but also between homologs. Metaphase chromosomes of the same type can differ in their DNA con-tent among individuals. The variability in the length and intensity between 4 homologs is known as "heteromorphism", sometimes referred to as "polymor-phism". These two terms will be used interchangeably in this thesis. Quantitative DNA content of the chromosomes, as well as centromeres and telomeres, are essential features used in this thesis for homolog classifica-tion. 1.1.1 Telomere Function in Chromosomes Telomeres are situated at the natural ends of chromosomes and protect the chromosome during cell division. They have a unique molecular structure that is crucial to normal chromosome behavior. Telomeres in humans comprise hun-dreds of repetitive genetic sequences. In the vast majority of cells, minuscule damage occurs with each cell division [8]. This damage is characterized in telomeres by the loss of 50 to 200 base pairs, and as a result the telomeres get shorter after each division. After about 100 divisions, the telomeres signal the cell to die. Scientists believe telomeres act as a molecular clock, telling the cell when its time is up. Tumor cells, however, are immortal. This is reportedly due to the activation of an enzyme known as telomerase. Telomerase continually adds sequences to the damaged telomeres, maintaining chromosomal length and stability, which allows tumor progression. In the last decade there has been a tremendous interest in studying the function and length of telomeres. Re-searchers are using what they know so far about telomeres and other cellular mechanisms to attack cancer. 5 1.1.2 Quantitative Chromosome Visualization and Anal-ysis In order to visualize chromosomes, their centromeres and telomeres quanti-tatively, as mentioned earlier, a technique known as fluorescence in-situ hy-bridization is utilized. FISH is one of the most significant developments in cytogenetics. As seen from the name, FISH is based on fluorescent probes hy-bridizing (binding) to different structures on chromosomes, which enables the examination of these chromosomes under a fluorescence microscope. Probes are defined nucleic acid segments that can be used to identify specific strings on the chromosome. As the probe is fluorescent, only the strings the probe binds to can be seen with the fluorescence microscope. FISH makes it possi-ble for us to see different structures of chromosomes as fluorescent spots. It has become an important tool in cytogenetic research and in routine clinical chromosome diagnostics. Quantitative fluorescent in-situ hybridization (Q-FISH) is a new FISH technique that employs directly labelled synthetic peptide nucleic acid probes. Synthetic PNA probes are small single stranded DNA strings that bind to the pre-specified sites on chromosomes more accurately. Q-FISH technique has some distinct differences from the more established FISH procedure (with DNA probes), in that it can allow extraction of quantitative information more effectively [9]. Following the Q-FISH procedure, images of chromosomes are acquired which have areas of dark and light bandings on the chromosomes as well as different intensities in centromeres/telomeres between homologs. The different intensities are a result of different amounts of DNA in those areas, as well as the probe's affinity for different nuclear acid sequences. 6 Using quantitative image analysis techniques, the intensity and other features of the fluorescent images can be calculated. Differences in the amount of DNA between homologs can be quantified and used as a tool for distinguish-ing between homologous chromosomes. 1.2 Homolog Classification: Methods In order to classify homologous chromosomes into maternal and paternal classes, a slide of metaphase chromosomes is labelled with several fluorescent probes. Each probe binds to certain structures on chromosomes and makes only these structures visible. Using this method, the entire chromosome, its centromeres, and its telomeres are visualized separately through different colour channels of a fluorescent microscope. Features from images of each of the visualized structures are extracted and used alone, as well as in combination for homolog differentiation. 1.3 The Significance of Homolog Classification in Cancer Research: Motivation Cancer is known as a somatic genetic disease, in which an abnormal form of a gene suddenly appears in some part of the body. The first acquired genetic abnormality provides some growth advantage to the cells. Upon each round of cell division, telomeres shorten until one telomere reaches a critical length. At this point a DNA damage signal is generated and the abnormal cells stop dividing. It seems that in different individuals, different chromosomes become shorter sooner than others, since specific chromosomes vary in telomere lengths 7 [10]. As specific chromosomes have critically short telomeres, they also become unstable and may fuse with other short telomeres. This may trigger loss of genetic loci (the specific place on a chromosome where a gene is located) in chromosomes, known as loss of heterozygosity (LOH). L O H is a major manifestation of the genetic instability that is a characteristic of cancer cells. To test the role of specific chromosomes in L O H , the ability to analyze telomere lengths in separate homologs would greatly increase the resolving power of the analysis. Moreover, although the somatic changes occurred are not passed on to the next generation, various predispositions to cancer are inherited as abnormal genes. In addition, the statistics also indicate the possi-bility of cancer being an inherited genetic disease. Figures have been reported on familial breast cancer. In order to get reliable quantitative information on cancer as an inherited genetic disease, it is important to differentiate between maternal and paternal homologs and trace the genes responsible for cancer back into past generations. Until now, all chromosome specific telomere length analysis have been done on pooled homologs. 1.4 Homolog Classification: Challenges Homolog classification is a difficult and exciting venture to explore and has not been studied sufficiently since the introduction of cytogenetics. This has been partially due to the low quality of available acquired images before the advent of FISH techniques and powerful microscopes. In addition, probes available in the past did not provide enough quantitative information for the images to be processed for homolog classification. With the introduction of PNA probes and Q-FISH procedures that pro-vide highly quantitative chromosome images, this thesis studies the automa-8 tion of homolog classification. Three main challenges faced in the automation of homolog classification and the strategies used to address them are listed below: • Throughout the image acquisition process, from labelling the chromo-somes with fluoro-probes to visualizing them using a fluorescent micro-scope, noise is introduced into the final acquired images. This degrades the quality of the images that will be analyzed for homolog classification purposes. Preventive procedures prior to image acquisition, and pre-processing algorithms prior to image analysis, are proposed to minimize the temporal and spatial noise. Preventive procedures include the stabilization of the microscope light source, the use of anti-fade agents to minimize photobleaching of the chromosomes, and the use of high quality emission and excitation filter blocks to minimize the spectral overlap of the probes. Preprocessing algorithms include background subtraction and colour com-pensation algorithms. Images of chromosomes labelled with certain probes may have higher background intensities, which should be com-pensated for. The high background is a result of the interaction of the anti-fade agent with the certain probe used for labelling the chromo-somes. • Provided we have high quality images of chromosomes with low levels of noise, homolog classification remains a challenging task. There are no means (including biological) by which the results of homolog clas-sification can be verified. Verification of homologous chromosome clas-sification is only possible for certain heteromorphic chromosomes. For 9 these chromosomes, in more than 50% of the population, homologs show enough differences in their DNA content and shape to be differentiated by the visual inspection of a cytotechnician. We chose these chromo-somes for automated homolog classification since we are provided with ground truth on their classification results. For each chromosome, the correlation of classification results employing different methods is used as another form of verification. Two different types of heteromorphism are studied in several chromo-somes. Chromosomes 1, 9 and 16 show apparent differences in the cen-tromere intensities of homologs after labelling with the DAPI probe. We developed two algorithms to segment the centromeres of these chromo-somes from their DAPI images. The differences in the intensities of the segmented centromeres were then used to classify homologs. The results of classification were compared to that performed by the cytotechnician. Chromosome 22 has another form of heteromorphism. In the DAPI im-ages of this chromosome the homologs show differences in their P-arm shapes, which enables their visual differentiation by a technician. We study the centromere images of this chromosome which are visualized by the FITC probe, and propose a segmentation algorithm. Homolog classification results using the centromere intensities calculated after seg-mentation are compared to those performed by the technician based on shape differences. We then propose an automated classification procedure based on an un-supervised modular neural network architecture. This procedure uses different features and classifies homologs based on maximizing the in-formation between its modules. This procedure can be generalized to 10 examine as-yet-untested features that might be of importance in ho-molog classification. The proposed architecture has the ability to derive features from the input space that might be of use for homolog classi-fication. Therefore, it can be used to attempt to classify chromosomes with no known heteromorphism. • The last challenge undertaken in this research is the lack of a compre-hensive test database. The entire process of database preparation is extremely lengthy and involves a great deal of human interaction which can be very expensive. In addition, PNA probes, and in general Q -FISH, are new techniques whose implementation protocols are changing and improving everyday. Therefore databases acquired over a period of time can differ, making it difficult to develop a comprehensive stan-dard database. Our developed algorithms should be able to foresee and tolerate such changes (although minor) in the chromosome images. Ver-ification of homolog classification using several different methods also helps in the case of smaller databases. Finally, our proposed automated unsupervised classifier has the advantage of being able to handle smaller databases. 1.5 Thesis Outline In Chapter 2 of this thesis, quantitative chromosome analysis methods are reviewed. These include automated karyotyping algorithms and attempts to fully automate this process. An overview of the quantification of the FISH technique and the applications of PNA probes and Q-FISH in measurements of DNA content of chromosomes are given. Finally, various techniques for 11 homolog classification, from C-band heteromorphisms in the early years to Q-FISH techniques and quantitative analysis in recent years, are reviewed in detail. In Chapter 3, the slide preparation protocol and the hardware of the image acquisition system, including the microscope, are discussed. Quantifi-cation and preprocessing of the acquired images to remove noise introduced by preparation and the system, are then presented. The database preparation procedure is also described comprehensively. In Chapter 4, two novel methods are proposed to segment centromeres from images of chromosomes prepared using the DAPI probe. The intensities of the segmented centromere areas using the proposed methods are then cal-culated. Homologous chromosomes are classified into maternal and paternal classes based on the differences of the calculated centromere intensities. Cer-tain heteromorphic chromosomes (such as chromosome 16) whose homologous centromeres exhibit different intensities are classified based on these differ-ences and the cytotechnician's visual inspection. The correlation of different classifications are used to verify homolog differentiation of chromosome 16. In Chapter 5, an algorithm is proposed for centromere segmentation to separate homologous chromosomes that differ in the number of centromere repeat sequences. The images are acquired using centromere specific probes la-belled with FITC and CY5 fluorophores. The algorithm is an iterative method based on fuzzy sets theory. It starts with assigning a "membership value to the centromere" for each pixel in the image. A predefined error function is minimized by iteratively updating the membership values and eventually seg-menting the centromeres. Homologous chromosomes 22 are classified based on intensity differences in their centromeres from FITC images. In addition, chro-mosome 22, a heteromorphic chromosome, whose homologs can differ physi-12 cally based on existence and non-existence of P-arms, are classified using this property. The two classification methods are then compared and their corre-lation is used as a verification of homolog classification for this chromosome. In Chapter 6, first a fuzzy C-means clustering algorithm is applied to classify homologous chromosomes using two features extracted in Chapter 3 (telomere lengths) and Chapter 4 (centromere intensities). The two features seem to be highly correlated. Inspired by these results, a novel quantitative and automatic homolog classification method is proposed. This methods is based on mutual information maximization applied to an unsupervised mod-ular neural network architecture. Each module is trained to classify homologs using an independent feature. The mutual information between the outputs of the modules is maximized, training the network to present the same classifica-tion result for a certain chromosome using independent features. The results of automated homolog classification for chromosome 16 are presented using the suggested architecture. Finally, extensions of the proposed architecture for use in homolog classification employing more features, as well as for use as a feature selection and testing tool, are discussed. Chapter 7 summarizes the contributions of this thesis along with sug-gestions for future research. 13 Chapter 2 Previous Research Over the past four decades, the role that chromosomes play in human disease and developmental disorders has become a primary concern of many scientists [5]. The first person to study human chromosomes is believed by some to have been Arnold (1879), however, the exact number of human chromosomes was not determined until almost a century later by Tjio and Levan (1956) [3]. Since then the area of chromosome analysis has drastically changed, playing a major role in disease diagnosis and treatment in this modern age. 2.1 Chromosome Analysis With the advent of various staining procedures in the early 1970s, a new area in genetics emerged known as "cytogenetics". In 1970 a breakthrough was made when Caspersson et al. [11] discovered that Quinacrine Mustard (QM), a DNA-binding fluorescent probe, binds preferentially to certain chromosomal regions [12]. When QM-stained chromosomes were illuminated with ultraviolet light, a banding pattern perpendicular to the long axis of the chromosome was 14 observed. The banding patterns were different for each chromosome pair, providing a means of identifying each pair [12]. The process of identifying and classifying chromosomes into their 23 types from a metaphase spread is known as Karyotyping. Karyotyping is a skilled and important task in pre-natal diagnosis of genetic abnormality and in diagnosis and monitoring of cancer [13]. Since 1970, numerous stains have been discovered which also produce a characteristic banded appearance. Among the earlier bandings, G-bands highlight the whole chromosome, while C-bands highlight the centromeres. A picture of a Karyogram demonstrating classified chromosomes is shown in Figure 2.1. 2.1.1 Automated Karyotyping Most of the studies in the field of chromosome analysis concern "Automated Karyotyping" which is the automated process of recognizing and categorizing the chromosomes from images of a metaphase spread. Karyotyping is a highly time consuming procedure if performed manually by a cytotechnician. Systems were introduced as early as 1964 to automate and classify chromosomes [14]. During the past four decades, considerable development has been made in attempts to automate or semi automate chromosome analysis [15]. The main features used in Karyotyping have been relative chromosome lengths, the centromeric index (ratio of the chromosome P-arm to the total chromosome length) and banding patterns described by density profiles. The density profile of a chromosome is calculated along the central axis of the chromosome (usually computed by Medial Axis Transform), taking values on perpendicular directions to the medial axis (see Figure 2.2). Most classifiers transform chromosome density profiles into a set of features and combine them 15 Figure 2.1: Typical Karyogram of a metaphase. 16 Medial Ax is Figure 2.2: Medial axis of a metaphase chromosome. with shape and length profiles for automated classification. Granlund (1976) uses several types of descriptors of chromosome density profiles for automation of chromosome classification. These descriptors are curves, Fourier descriptors and distribution functions [12]. Although relatively high success rates of classification are reported (78%-90%), the system can not handle touching, bending or overlapping chromosomes. A common challenge for automation of chromosome classification in early days was the inability to reproduce the chromosome data [16]. Also, the low resolution and high amounts of noise introduced during preparation, acquisition, and digitizing of the images degraded the efficiency of the proposed methods. Another problem, which still remains a challenge in designing a global automated chromosome classification system, is the variability of chromosome profiles (such as shape and density) between individuals as well as within an individual. At present, most clinical automated Karyotyping systems require in-teraction by a cytotechnician to correct errors. Piper (1989) proposes a fully automatic feature extraction system for chromosome classification where the option of interactive correction of the axis and centromere location has been omitted. New features describing banding patterns and the shape of the chro-17 mosome are proposed. Classification error rates are measured and reported. It is proposed that if a classifier is trained and tested within the same database, error rates are decreased by up to 10%. Also choosing feature sets appropriate to each database leads to improved classifier performance [15]. The variability of density and chromosome profiles has made chromo-some classification a challenging pattern recognition problem. Statistical, rule-based and neural computing strategies have been adopted over the years to solve this problem. Wu et al. (1989) propose a knowledge-based system for chromosome classification where a multi-stage recursive "hypothesize-and-identify" paradigm is adopted [17]. Ritter et al. introduced a statistical model of metaphase chromosomes for classification [18]. Later they propose a heuris-tic method of parameter estimation and design a Bayesian classifier for auto-matic chromosome classification [19]. Since density profiles are highly variable, artificial neural networks offer the possibility of adaptable classifiers for chromosomes. Of particular inter-est are the feature extraction properties such models exhibit, which allow unrefined information to be presented to the classifier rather than specific in-tuitively defined features [13]. Errington etal (1993) describe a combination of Multi-Layer Perceptrons (MLP) for classifying isolated chromosomes and propose that these perform as well as, or significantly better than, a well de-veloped statistical classifier [13]. Lerner et.al (1994) use an M L P and evaluate the significance of chromosome features to classification with two feature se-lection algorithms: "knock-out" and "Principal Component Analysis" [20]. Levinstein (1994) also proposes multi-layer perceptrons for chromosome clas-sification [21]. The results emphasize the significance of the centromeric index and chromosome length in classification. They also demonstrate the impor-tance of retaining most of the image information whenever small training sets 18 were used. Lerner et al. report maintaining classification performance using MLPs and only 70% of the available features [4, 20]. Original features are the cen-tromeric index, chromosome length and density profile. The density profile related features are calculated after medial axis transform of the chromosome. Two approaches were used to MAT, one based on skeletonisation and the other based on piecewise linear approximation. Features extracted from the density profiles as well as centromeric index were used in a M L P for classification. A feature selection algorithm was then used to choose the best features. They report a 97% rate of correct classification on a test dataset [22]. The same au-thors have compared the performance of an M L P neural network as a classifier of human chromosomes to that of a Bayes piecewise classifier on 5 types of chromosomes. They conclude that the M L P classifier outperformed the Bayes classifier for all the combination of the features and for all sizes of training sets [23]. Neural networks in Karyotyping are still an active area of research and different structures and learning algorithms for automatic chromosome classification are still being tested [24]. Several researchers have worked on automated Karyotyping of abnormal cells where there are numerical aberrations and the number of chromosomes in the cell is not 46. Eskizmirliler et al. (1996) suggest a hybrid Karyotyping system based on two different types of neural networks. The density profile related features are extracted using wavelet transform. The classifier con-sists of a supervised network which classifies the chromosome set as "normal" by searching for numerical abnormalities. The unsupervised network has 24 ANNs assigned to different classes of chromosomes which performs the final Karyotyping [25]. Stanley, et al. also present a technique that is extendible to detecting numerical aberrations using the similarity between homologs in 19 a certain class. Chromosome assignment to a specific class is initially based on neural networks using banding pattern and centromeric index criteria and concluding with homolog matching [26]. Lerner (1998) studies a completely au-tomatic neural network based chromosome classifier. In this study all aspects of the analysis, namely segmentation, feature extraction, and classification are discussed. Also separation of clusters of partially occluded chromosomes is performed in an effort for a complete automatic chromosome analysis system [27]. Other systems, such as fuzzy rule-based systems, have been suggested for chromosome classification. These report a relatively high classification rate (about 87%) for a selected number of chromosomes [28, 29]. A genetic algorithm approach was also proposed for chromosome classification by Piper (1995), who reported little improvement in classification results and high com-putational costs for these methods [30]. Zimmerman et al. (1986) proposed a homolog matching procedure prior to automated Karyotyping to increase the efficiency of these methods. The study assumes that a chromosome whose length does not deviate by more than 20% from a comparison chromosome is considered a candidate for homolog matching. Homolog matching is performed using a cross correlation of the chromosome profiles, both with and without chromosome stretching. They report 90% success rates in matching on two datasets [31]. Several methods have been proposed during the last two decades for fea-ture description of the chromosome band profiles [32]. These include Fourier coefficients [33, 34], Gaussian fitting functions [12], Weighted Density Distri-bution (WDD) descriptors [15, 35] and Wavelet based descriptors [32, 34]. Wu et al. (2000) report chromosome classification performances using wavelet-based banding pattern descriptors comparable to the current best-performing 20 W D D method [32]. Some studies suggest automatic detection of metaphases in an attempt to speed up the Karyotyping process [36, 37]. 2.1.2 Separation of Overlapping Chromosomes for Kary-otyping Most of the available commercial automatic Karyotyping software packages are in fact semi-automatic and require the interaction of a cytotechnician, es-pecially in cases of overlapping chromosomes. A number of successful chromo-some separation methods have used shape-based techniques for this purpose. Ji (1994) proposed a rule-based technique to analyze concavities along the chromosome shape and search for a minimum density path between touching chromosomes [38]. In a method studied by Lerner (1998), segmentation of occluding chromosomes is performed by connecting points of high concavity (cut points) along curvatures [27, 39]. Lines connecting pairs of cut points are hypothesized as separating lines. The two segments created by such a line are suspected of being touching chromosomes and a chromosome classifier is employed to verify the corresponding hypothesis [27, 39]. Agam et al. (1997), review methods for touching chromosome separa-tion. They propose a geometrical method for chromosome separation where the separated segments are evaluated by fitting to some prototype shapes [27, 40]. Stanley et al. present a homolog matching algorithm for overlapping chromosome recognition. The undistorted gray level information in isolated chromosomes 2 is used for identifying overlapped chromosomes. An isolated chromosome prototype for chromosome 2 is obtained using neural networks. The overlapped homolog is then matched to the prototype for chromosome 21 identification [26]. Poon et al. developed a chromosome segmentation method which segments most of the chromosomes correctly and addresses the prob-lems of touching chromosomes [41]. The proposed algorithm consists of a com-bination of segmentation algorithms such as thresholding, average and rank difference filtering [42], as well as dilation and erosion algorithms [41]. 2.2 Fluorescence In-situ Hybridization (FISH) and Quantitative FISH Technologies Fluorescence in-situ hybridization is one of the most important breakthroughs in cytogenetics. This technique is rapidly expanding in medical research and clinical diagnosis. It allows the detection of specific DNA and R N A sequences at the individual cell/chromosome level [43, 44]. Since the introduction of FISH, image analysis techniques have played a major role in chromosome analysis. One of the advantages of FISH is that several targets can be visual-ized simultaneously in the same sample [45, 46]. Based on this characteristic, Multi-fluor FISH techniques have been studied to paint each human chromo-some with a different fluor combination for Karyotyping purposes [46, 47, 48]. In a study by Roth et al. (1997), the ratio analysis of images, prepared using different fluoro-probes, has been used for detection of DNA changes in specific parts of chromosomes [49]. Image processing algorithms and protocols have been established for quantifying images acquired following FISH [9, 43, 44, 50, 51]. Nederlof et al. (1992) use image processing algorithms and propose a system for quantifi-cation of FISH signals to accurately measure fluorescent spot intensities [50]. Castleman et al. (1996) give a comprehensive overview of quantitative FISH 22 procedure employing image analysis techniques. These include the automatic calibration of the colour images, isolation and measuring spots of interest, and counting dots. They suggest an image processing algorithm to isolate each fluorophore to a single colour channel in multi-colour FISH [44]. Although FISH techniques have been around for at least 15 years, only relatively recently has quantitative fluorescent in-situ hybridization, Q-FISH, a FISH procedure adapted for Peptide Nucleic Acid (PNA) probes, been de-veloped [9, 52, 53]. Relative to conventional FISH techniques with RNA or DNA probes, PNA probes can hybridize to denatured D N A target sequences without competition from identical complementary DNA sequences. There-fore, the observed fluorescence intensity is directly correlated to the length of the target sequence. The high sensitivity and degree of accuracy in Q-FISH allow the extraction of quantitative information. The bright fluorescence sig-nals combined with the low background of directly labelled PNA probes offer unique possibilities for a variety of studies [9]. The ability to simultaneously visualize several sites on chromosomes also provides opportunities to study homologs separately. Lansdorp et al. present the application of PNA probes in quantitative FISH and propose a thorough protocol for Q-FISH [9, 53]. 2.2.1 Q-FISH and Quantitative Measurements of Telom-ere Lengths Telomeres form the physical ends of chromosomes. Their proper functioning is vital to normal cell growth [54]. Shortening of telomere lengths is believed to act as a biological clock to stop cell division. On the other hand, activation of telomerase (the enzyme that synthesizes telomere) is proposed to have a role in immortalizing cancerous cells [55]. Several authors have reviewed telomeres, 23 their controls and their role in normal and cancerous cells [54, 55, 56]. Lans-dorp reviews the mechanism of telomere shortening and its relation to aging and development of cancer [57]. With the advent of Q-FISH techniques, it is possible to measure telom-ere lengths in specific chromosome arms from metaphase cell preparations [58]. Poon et al. have developed a quantitative system to measure telom-ere lengths on individual chromosomes by image cytometry [53]. In this sys-tem, images of metaphase chromosome spreads (stained with DAPI) and of telomeres (stained with CY3-labelled telomere probe) are taken using a high resolution C C D camera. The images are preprocessed to remove noise in-troduced during preparation and acquisition procedures. Chromosomes and telomeres are then segmented and four telomeres are assigned to each chro-mosome. Individual telomere lengths are finally obtained for all four arms on each chromosome of the metaphase spread [9, 42, 53, 59]. Several studies on telomere lengths of chromosomes and their clinical significance have been made since then using this system [10, 52, 58, 60]. 2.3 Homologous Chromosomes and Homolog Classification Despite the vast amount of research conducted on chromosome analysis, not much has been done on homolog classification (into maternal and paternal classes). This is due to the fact that microscopy images of chromosomes avail-able in the past did not contain enough quantitative information. As mentioned before, metaphase chromosomes of the same type can differ in DNA content among healthy individuals [61]. The variability char-24 acterized by differences in the length and intensity of the homologs of these chromosomes is known as heteromorphism. Even before the advent of banding techniques, it was recognized that homologs of certain human chromosomes exhibit consistent morphological differences [62]. Heteromorphism was char-acterized in metaphase spreads which were stained to produce chromosome banding patterns in the 1970s [61, 63]. With the advances in probes and FISH techniques it has been possible to study the heteromorphic regions with more precision, and quantitatively. Verma and Dosik (1980) review human chromosome heteromorphisms, their nature and clinical significance as well as methods to identify different variants [62]. Trask et al. quantitatively study heteromorphisms using flow Karyotyping in a large number of cells [64]. They also review the inheritance of chromosome heteromorphisms using the same method [61]. In the past century, scientists have been studying the chromosomal or-der and suggest that homologous chromosomes are not distributed randomly in the nucleus [65, 66]. Some studies have used quantitative methods to measure homologous chromosome distances in the nucleus. FISH combined with fluo-rescence and confocal microscopes have been used to measure the distances of centromeres of certain chromosomes in two dimensions as well as three dimen-sions. These distances have been compared to those generated by a random generator and concluded that the distribution of the studied homologous chro-mosomes are not random [67, 68]. The possibility of homolog separation using these distributions, however, has never been studied. Synthetic PNA probes in combination with Q-FISH technique have be-come powerful tools in quantitative studies of chromosomes and specific DNA sequences. It is believed that the type and amount of repetitive DNA not only varies between chromosomes but also varies between homologs [7]. Meyne 25 and Moyzis (1994) review in-situ hybridization using synthetic probes for cen-tromere and telomeres. They suggest application of these two methods for labelling the repetitive sequences in the peri-centromeric regions of chromo-somes and analyzing the polymorphisms of these repeats [7]. Chromosome heteromorphisms are very useful characters for differenti-ating maternal and paternal homologs. Their use as genetic markers depends on the fact that they are stable, heritable and have low rates of mutation [69]. Some of the most common sites of heteromorphisms are the short arms of the acrocentric chromosomes such as chromosome 22 and the centromeric regions of chromosomes 1,9 and 16 [62]. Another heteromorphism recently studied is the heteromorphism in telomere lengths of certain homologous human chro-mosomes. Lansdorp et al. (1996) first reported heterogeneity in the telomere length of human chromosomes [52]. Zijlmans et al. (1997) later reported heteromorphism in telomere lengths of mice. Q-FISH techniques was used to prepare and acquire telomere images of mice. Homologous chromosomes of the studied mice were often characterized by widely different telomere lengths [60]. Martens et al. (1998) studied telomere lengths of human chromosomes using PNA probes and Q-FISH technique. They report chromosome specific het-erogeneity in human chromosomes such as short telomere lengths on P-arms of chromosome 17. They suggest the heteromorphism in telomere lengths of human chromosomes be used to classify homologs [10]. Pommier et al. (1998) also present telomere heteromorphisms and their application in classification of several homologous chromosomes [70]. We studied telomere lengths in human chromosome 16 and reported heterogeneity in telomere lengths of homologous chromosomes 16. We then suggested homolog classification combining telom-ere lengths and centromere intensities of chromosome 16 [71]. 26 Centromeric sequence polymorphisms are being increasingly reported. O'Keefe et al. describe probes specific for chromosome 17 centromere D N A re-peats [69]. They report homologous chromosome classification of chromosome 17 in 56.4% of individuals studied [69]. Mason et al. (1975) first reported studying centromere heteromor-phisms of chromosomes 1 and 16 quantitatively. Centromeres were labelled using the so called C-banding technique. A gradient method was used to dis-tinguish the C-bands and optical densities of the bands were measured. After statistical analysis of the results, they reported higher heteromorphism for ho-molog classification in chromosome 1 than chromosome 16 [72]. Lauder (1978) studies the problem of estimating polymorphism in homologous chromosomes using bivariate analysis. In this analysis, two variables are measured simul-taneously on the chromosome. The variables are the optical density of both the C-band and the chromosome itself [73]. Estimates of polymorphism are used to answer such questions as which parts of chromosomes contribute to the differences observed in homologs. Using the bivariate estimation, the author suggests the C-band with larger mean area is part of the chromosome with a larger mean area, and that a substantial part of the difference between the sizes in the homologs is due to the difference in the size of C-bands [73]. Although the above and similar studies in the 1970s [74, 75] were promis-ing initial studies on quantitative measurements of homologs, not much has been done on homolog classification until recently. This is due to the fact that microscopy images of chromosomes available in the past did not contain enough quantitative information. Obviously, the process of homolog classi-fication is dependent on the staining techniques. The results of C-banding used for staining centromeres were not always reproducible and were variant [73]. Furthermore, the quantification methods used previously resulted in low 27 resolution images with relatively high amounts of noise from both preparation procedures and the hardware employed. The introduction of Q-FISH technique and novel peptide nucleic acid probes [76] have provided a new and powerful tool for the field of cytogenetics, and specifically homolog classification. These techniques result in chromosome images with high quantitative information. Since then, centromere segmen-tation algorithms for classification of homologous chromosomes have been de-veloped [77, 78, 79]. We have classified homologs of certain heteromorphic chromosomes combining multiple features extracted from Q-FISH acquired images [71, 77, 79, 80]. We also proposed an automatic quantitative classifi-cation method for homolog classification using neural networks and maximum information theory [81]. This method can also be generalized to test features that are known to have some information for homolog classification [81]. 28 Chapter 3 Image Acquisition and Database Preparation Multi-spectral acquisition and analysis of human chromosome images using quantitative fluorescence in-Situ hybridization technique is performed employ-ing a high quality fluorescence microscope and a digital image acquisition sys-tem. Slides of metaphase chromosomes are first treated with certain probes. Different probes allow visualization of different parts of the chromosomes. Multi-spectral images of these slides are then taken using a digital camera attached to the fluorescence microscope. Acquired images,of the slide are pre-processed to remove some of the spatial and temporal noise introduced by the system [53]. Once images of chromosomes in several spectral planes have been acquired, semi-automatic and automatic procedures are used to recognize and categorize each chromosome, isolate the chromosome as a rectangular region and measure certain features on specific image planes. These rectangular re-gions are finally saved as images that make up our database. An overview of the image acquisition and processing system is shown in Figure 3.1. A 29 Illumination Source Microscope Optics / CCD Camera Microscope Stage Figure 3.1: An overview of the image acquisition system. flowchart presenting the organization of this chapter, including steps taken in the entire process of chromosome slide preparation to database development, is shown in Figure 3.2. As seen from the flowchart, in this chapter we will first study the chromosome slide preparation procedure and the properties of the fluorescent probes used for labelling them. Next, the image acquisition system (micro-scope) is described in detail. Challenges and solutions to minimize noise, quantify and make the final images reproducible, from a hardware point of view, are also presented. These include the stabilization of the light source of the microscope, minimization of photobleaching and spectral filtering of the coloured acquired images. The semi-automated karyotyping software (Meta-Systems ISIS) and the telomere length measurement software ( T F L - T E L O ) are described next. The acquired images of the chromosomes, despite the pre-cautions taken in image acquisition, in some cases need to be preprocessed to minimize remaining temporal and spatial noise. The preprocessing steps are discussed in detail. Finally, the automated algorithm that finds, isolates and 30 Slide Preparation Image Acquisition Microscope System H Chromosome labeling using PNA probes and Q-FISH 'MS. Light source Filter blocks Pre-Processing Background compensation Spectral overlap compensation Database Development Automatic Database Preparation Karyotyping MetaSystems Telomere Lengths TFL-TELO Figure 3.2: From chromosome slide preparation to database development. 31 saves each chromosome, from the 5-colour acquired images, as our database is presented. 3.1 Q-FISH Technology and Slide Preparation Fluorescence is a molecular phenomenon in which a substance absorbs light of some colour and almost instantaneously radiates light of another colour. The radiated light has lower energy and therefore longer wavelength. This process is known as excitation and emission [1]. Q-FISH involves applying different fluorescent probes which bind to certain repeat sequences on chromosomes and examining these chromosomes under a fluorescent microscope. The Q-FISH technique enables the quantita-tive measuring of some characteristics of certain structures (such as telomere or centromere) for specific individual chromosomes from metaphase cell prepara-tions. Q-FISH has become possible using directly labelled peptide nucleic acid probes [58]. Probes are nucleic acid molecules that bind to specific comple-mentary DNA sequences on chromosomes, and combined with fluoro-chromes, make these sequences observable. PNA probes in particular hybridize to spec-ified complementary segments on the chromosome DNA under conditions that favour annealing of the probe but disfavor renaturation of the DNA strands. Not until recently have synthetic PNA probes been developed to hybridize to repeat sequences on chromosomes with sufficient efficiency to consider quan-titative analysis of the resulting fluorescence images [52, 53]. PNA probes can hybridize to denatured DNA target sequences without competition from identical complementary DNA sequences. Therefore, the length of the target telomere or centromere spot is related to the number of hybridized probes. In turn, the fluorescence intensity of a given telomere or centromere spot is 32 correlated with its corresponding repeat length. The images used in this research are produced in several stages. In the first stage, standard cytogenetic techniques are used to prepare metaphase chromosomes from dividing cells [9]. Metaphase cells are then fixed and dropped onto a slide before being air-dried overnight. Centromere and telom-ere specific PNA probes including CY3, FITC, CY5 and CY3.5 are added to the chromosomes on the slide. The slide is then heated at 80°C for 3 minutes. The DNA of chromosomes is denatured completely during this period. The slide is left at room temperature for an hour afterwards. This allows hybridiza-tion of the PNA probes to the targeted sequences on the chromosomes. After hybridization, slides are washed to remove unbound PNA probes and a DNA counterstain labelled with DAPI is added to the slide. As probes are fluorescent, each time the slide is illuminated the amount of fluorescence that can be excited later decreases. This phenomenon is known as photobleaching. The amount of photobleaching depends on the time needed to focus on a metaphase as well as exposure times for different wavelengths. In order to minimize photobleaching and hence acquire quantitative images in all different colour channels, anti-fade agents are added to the slide during the preparation stage. Slides are now ready to be visualized under a microscope for quantitative analysis. As mentioned above, the probes used for preparing the slides are DAPI, CY3 telomere, FITC, CY3.5 and CY5 centromere probes. DAPI has a blue appearance, highlights the whole chromosome, and is excited at a wavelength of 359 nm (for an example see Figure 3.3-top). FITC has a green appearance, makes only the centromeres of the chromosomes visible, and is excited at a wavelength of 490 nm. CY3.5 has a yellow appearance and is another PNA probe used to label the centromeres. CY3 has an orange appearance, highlights 33 the telomeres at the ends of chromosomes and is excited at a wavelength of 554nm. Finally, CY5 has a red appearance and is excited at a wavelength of 649nm. The excitation and emission spectra for four of the more frequently used fluorescent probes are shown in Figures 3.3 and 3.4. At the beginning of this research, only DAPI and CY3 probes were available for visualizing the chromosomes and the telomeres, respectively. Later, centromere specific PNA probes were developed. Combining these probes with FITC, CY3.5, and CY5 fluorescent dyes, the visulaization of centromeres of chromosomes also became possible. At every stage of our project, the state-of-the-art probes and chromosome imaging techniques were employed to extract features for homolog classification. Acquired images of a sample metaphase are shown in Figure 3.5 for DAPI and CY3 planes and in Figure 3.6 for FITC and CY5 planes. In these images, one chromosome is highlighted as "Chromosome N" to emphasize how the same PNA labelled chromosome is viewed when excited at different wavelengths. As seen from Figures 3.3 and 3.4, the spectrum of the fluorescent probes can be relatively wide. In preparing multi-fluoroprobe stained images, care is taken to choose probes that have little spectral overlap. In addition, to further minimize the effect of this overlap, multiple filters are used prior to image acquisition. However, there still exists some spectral leakage of the DAPI image into other image planes. 3.2 Image Acquisition System Once the slide of metaphase chromosomes labelled with multiple PNA probes is ready, typically fifteen images of the slide are taken by a fluorescence mi-34 300 350 400 450 500 550 600 Wavelength (nm) 350 400 450 500 550 600 650 Wavelength (nm) Figure 3.3: Excitation and emission spectra for DAPI (top) and F I T C (bot-tom) fluorescent probes. 35 T 1 r | 1 I 400 450 500 550 600 650 700 Wavelength (nm) 500 550 600 650 700 750 800 Wavelength (nm) Figure 3.4: Excitation and emission spectra for CY3 (top) and CY5 (bottom) fluorescent PNA probes. 36 Figure 3.5: Acquired images of a sample metaphase; DAPI (top) and C Y 3 Chromosome N • Figure 3.6: Acquired images of the same metaphase as in Figure 3.5 F I T C (top) and CY5 (bottom). 38 croscope. Two different microscope systems have been used throughout this research for acquisition. The first imaging system includes a widefield Zeiss Axioplan microscope with a hybrid Mercury/Xenon lamp (200W. OptiQuip). A Microlmager MI1400-12 Charged Coupled Device (CCD) camera is used to acquire and digitize the images (Figure 3.7). The second acquisition system uses a new Zeiss Axioplan2 microscope with a Mercury lamp (100W, HBO100 Zeiss). Two different C C D cameras have been used to acquire and digitized images in this system. A high resolution P C O SensiCam C C D camera from Northern Eclipse and a C C D camera provided by MetaSystems Inc. are used for image acquisition (Figure 3.8). In general, the image acquisition system is divided into three subsystems as follows: 1. A fluorescence microscope consisting of an illumination source, micro-scope optics and microscope stage, which produce the images of the prepared slide; 2. A C C D camera to capture and digitize the images; 3. A computer system which stores the images, sends instructions to the microscope stage and is used in the preprocessing of the stored images. The microscope is the major component of the imaging system. It consists of an illumination source, microscope optics, including excitation and emission filter blocks, and the microscope stage with a mechanical z-drive for auto-focus. A schematic of a general fluorescence microscope is shown in Figure 3.9 [1]. In fluorescence microscopy, a selected wavelength of light is filtered from the illumination source and used to excite the chromosomes on the slide. The sites on the chromosomes which are labelled with fluoro-probes sensitive 39 Figure 3.7: Picture of the Zeiss Axioplan imaging system. 40 Figure 3.8: Picture of the Zeiss Axioplan2 imaging system. 41 Figure 3.9: The schematics of a general fluorescence microscope [1]. to this particular light wavelength will emit light at a higher wavelength and lower energy as shown in Figures 3.3 and 3.4. The emitted light is focused and magnified by the objective lens of the microscope [42]. The emission filter will then selectively pass through the wavelength of the emitted light and filter out the other wavelengths, including the reflected excitation wavelength. The image of the highlighted site on the chromosome can be viewed through the oc-ulars of the microscope as well as acquired using a C C D camera. The selection of the C C D camera is based on a number of important requirements desirable for quantitative fluorescence microscopy. These requirements are high spa-tial resolution and a large field of view, sufficient photometric resolution, and multi-spectral image acquisition capability. It is desirable to have a detector that can sample at a rate which is at least twice the resolution of the system (so that aliasing effects do not occur) [42, 82]. This translates to a sensor 42 with a pixel spacing of less than 7 microns. Given such a small pixel size, the detector should have about 1000x1000 pixels to capture the entire metaphase. In order to have sufficient photometric resolution, we need to use a camera with a large dynamic range (e.g. Microlmager or SensiCam with 12 bits or 4096 gray levels). Generally, a photometric resolution of only 8 true bits (256 gray levels) of information is sufficient to represent the data. Therefore, the eight most significant bits of the high resolution camera are used to represent the image stored in the computer. Images of chromosomes are acquired by exciting the slide at five differ-ent wavelengths and recording the images through five colour channels. The acquired images can now be analyzed for homolog classification. 3.2.1 Hardware Considerations To make the acquired images of chromosomes quantitative and reproducible, and to minimize the amount of photobleaching during acquisition, two hard-ware components are studied. Illumination source stabilization An ideal illumination source is one that has sufficient intensity for the desired excitation wavelengths and that does not fluctuate over time [42]. Typical fluo-rescent microscope illumination sources are Xenon or Mercury lamps. Mercury lamps have higher intensity than Xenon lamps, which makes them more suit-able for quantitative image acquisition. However, these lamps can fluctuate over time, resulting in non-uniform illumination in acquired images from the same slide. In the Axioplan microscope system, a hybrid Xenon/Mercury lamp is used. In the newer Axioplan2 microscope system, we use a Mercury lamp for illumination which has very high intensity suitable for multi-colour quantita-43 Fluorescence at 100% light and no Neutral Density filter 2500 i '5) c 2000 £ ?! 1500 -o c <D 1000 uoresc 500 LL 0 y- Xxx 50 100 150 200 250 300 Time(sec) Fluorescence at 100% light and Neutral Density filter ND1 600 I 500 | 400 8 300 c V 8 200 I § 100 (Xxxxxxxxx*x*xxxxxxxxXxxX K X X X X X X X X x X X V+++++++ 100 200 300 Time (sec) 400 500 Figure 3.10: Fluorescence intensities and photobleaching of beads visualized with a mercury lamp with and without a neutral density filter. Intensities of randomly chosen beads (each shown with a certain symbol) are studied over time. tive imaging. Experiments with fluorescent beads are performed to determine the fluctuations of the lamp over acquisition time and to determine the inten-sity level where the lamp is most stable. It was concluded that the Mercury lamp is most stable when used at full intensity (100%). However, if a slide of chromosomes is visualized under the Mercury lamp with 100% light intensity, the process of photobleacing accelerates very quickly (see Figure 3.10-top). A neutral density filter was used to reduce the light intensity to 10% of the 44 original value. Using this filter, the microscope has a stable light source with a reasonable intensity for the selected wavelengths (see Figure 3.10-bottom). In this figure, several beads are chosen randomly (shown with symbols such as o, 0, x, • • •) and their intesities are studied over time with and without a neutral density filter in the optical path of the microscope. Excitation and emission filter sets In acquiring multi-colour images of chromosomes we take care to label chro-mosomes with probes that have little spectral overlap. In addition, to further minimize this overlap, multiple filters are used prior to image acquisition. A typical filter block for a fluorescence microscope is shown in Figure 3.11. The main elements of this block are an excitation filter, an emission filter and a dichroic mirror. The excitation filter is a colour filter that transmits only those wavelengths of the illumination that efficiently excite a specific fluoro-probe [1]. The emission filter is also a colour filter and attenuates all the light trans-mitted by the excitation filter but transmits any fluorescence emitted by the labelled chromosomes. The dichroic mirror is a thin layer of specially coated glass and is set at a 45 degree angle to the optical path of the microscope. This mirror has the unique ability to reflect the excitation light but transmit the emitted fluorescence with high efficiency. Filter blocks are designed specifically for different probes. In the Ax-ioplan2 microscope system, a separate excitation/emission filter block is used for viewing each of the colour channels (SP filters, Chroma Corp.). In the older Axioplan microscope, there is a separate excitation filter for each colour channel, however, only one emission filter is used for viewing all the channels. This filter is designed to allow multiple bands of wavelengths to pass. With separate filter blocks, to view different channels, the filter blocks have to be in-45 Excitation " T Filter Objective Sample Figure 3.11: Schematic of filter block for a fluorescence microscope. terchanged throughout the acquisition process. This may introduce pixel shifts between images of chromosomes acquired at different wavelengths. However, these blocks have emission and excitation filters designed specifically for a cer-tain wavelength with narrow bandpasses that minimize spectral overlapping. Experiments with fluorescent beads were performed to determine pixel shifts for the Axioplan2 microscope system. It was concluded that this shift is less than 2 pixels and can be ignored. 3.3 Preprocessing of the Acquired Images In order to have consistent and quantitative images for homolog classification purposes, noise has to be minimized in the acquired images of chromosomes. As mentioned before, each time a chromosome slide is illuminated the amount 46 of fluorescence that can be excited decreases. As a result of photobleaching, images acquired from a metaphase can get weaker as acquisition continues. In order to prevent photobleaching and hence keep images quantitative and time invariant, anti-fade agents are added to the chromosome slides. A pow-erful anti-fade is used known as Prolong, which ensures the telomere image is quantitative over the period of image acquisition. Unfortunately, this anti-fade agent introduces a lot of background noise to the acquired images. Background subtraction of the acquired images is performed prior to homolog classification to compensate for this effect. Zero gray levels are set to zero brightness, that is, minimum gray levels in each image are mapped to zero intensity. Although high quality emission filters are used in the image acquisition path, there still exists some spectral overlap of the fluoro-probes. This is due to the bell shape emission characteristics of the fluoro-probes (Figures 3.3 and 3.4). Sometimes these characteristic diagrams are extended over a wide range of wavelength, and therefore, overlap with consequent fluoro-probes in the range. After image acquisition, this problem shows itself most noticeably as leakage of the DAPI image into other colour channels (mainly FITC). In addition, CY3.5-a centromere specific PNA probe, has excitation and emission wavelengths close to those of CY3. In acquiring CY3.5 images, it is noticed that there is a wavelength overlap between CY3 and CY3.5 images that cannot be completely filtered out by our emission filter block. This results in some telomere brightness appearing in the CY3.5 centromere images of chromosomes (Figure 3.13 - top). To compensate for the above mentioned problem, an approach called "single probe experiments" is taken. Single probe experiments are applied as a tool to minimize the spectral spreading of fluoro-probes across colour channels. 47 3.3.1 Single Probe Experiments An effective method to remove the spreading of fluoro-probe brightness among other colour channels is colour compensation [83, 84]. Using this method, each fluoro-probe is isolated to a single colour channel. Different chromosome structures labelled with different fluoro-probes can, therefore, be processed independently. Colour compensation is based on the idea that the brightness of the red fluoro-probe (in a five colour system) is spread by the imaging process among the five camera channels (R, G, B, O and Y), and likewise for the green, blue, yellow and orange fluoro-probes [83]. A 5x5 matrix, C is defined which specifies the colour spreading [83]. Each element of this matrix c^ - is the proportion of the brightness of fluoro-probe j that spreads into channel i. The elements of this matrix are determined using "single probe experi-ments". Five slides of metaphase chromosomes, each treated with one of the five different probes (DAPI only, FITC only, CY3.5 only, Cy3 only and CY5 only), are prepared. From each slide, images of four metaphase spreads are acquired under blue, green, yellow, orange and red channels. The algorithm is explained for images of DAPI-only stained chromosomes; however, the proce-dure is identical for other single-probe labelled images. In ideal conditions, for the DAPI-only stained slide, one expects images of DAPI stained chromosomes to only show up in the blue channel. However, the DAPI chromosome images appear to some extent in other colour channels. Images of the chromosomes la-belled with DAPI-only are visualized through the green (used to acquire FITC images), yellow (used to acquire CY35 images), orange and red channels at the typical exposure times used for these probes. Intensity values are measured over the chromosome and background areas in DAPI, FITC, CY3.5, CY3 and 48 CY5 images. Percentages of the DAPI brightness leaking into other colour channels are calculated by averaging these values in four metaphase images of the DAPI-stained slide. Similar calculations are performed for images of all single probe stained slides and the elements of the colour spread matrix C are determined. DAPI FITC CT3.5 CY3 CY5 0.77 0.006 0.001 0.005 0.006 Blue 0.03 0.865 0.029 0.03 0.044 Green 0.12 0.033 0.91 0.104 0.04 Yellow 0.045 0.023 0.027 0.845 0.03 Orange 0.035 0.073 0.033 0.016 0.880 Red In matrix C, column one, for example, shows that only 77% of the DAPI intensity is acquired in the blue channel. Twelve percent of DAPI appears in the yellow channel, 3% in the green, 4.5% in the orange and 3.5% in the red. Each column of this matrix adds up to one and in ideal conditions, C would be equal to the identity matrix of the same dimensions. If A is the 5x1 vector of the real brightness values of the five fluoro-probes, then B defined as is the 5x1 vector of the acquired images. Let D denote a 5x5 compensation matrix such that B = C * A (3.2) A = D * B. (3-3) Substituting Equation 3.3 in 3.2 it is driven B = C * D * B = ^ C * D = I (3.4) 49 where I is the 5x5 identity matrix. Equation 3.4 is satisfied when D = C T 1 . (3.5) The real intensity values of the fluoro-probes is therefore calculated as follows A = C _ 1 * B (3.6) where C _ 1 , the inverse of C is c- x = 1.3 -0.008 -0.035 1.163 -0.042 -0.094 -0.001 -0.007 -0.008 -0.034 -0.036 -0.055 -0.161 -0.034 1.106 -0.133 -0.043 • (3.7) -0.062 -0.027 -0.033 1.190 -0.037 -0.038 -0.014 1.143 From Equation 3.6, to correct the Blue channel (DAPI) image, for each pixel one should take 130% of the acquired intensity in the blue channel and subtract 3.5%, 16.1%, 6.2% and 4.2% of the intensity values in the green, yellow, orange and red channels respectively. Equation 3.6 is applied on a pixel by pixel basis to all five colour acquired images and the colour compensated new images are calculated. An example of a CY3.5 centromere image before and after colour com-pensation is shown in Figure 3.12. As seen from the original (top) image, CY3 image has spread across the CY3.5 image and some telomere brightness is visualized in CY3.5 channel (a couple of such points are highlighted with yel-low arrows). In the colour compensated image, however, mostly centromeres labelled with CY3.5 probe are seen. Two examples of colour compensa-tion for single chromosomes are also shown in Figure 3.13. In the top figure, FITC-centromere images of one chromosome before and after colour compen-sation are shown. Also a centromere segmentation algorithm (discussed later 50 CY35 Centromere Image - Spectral Overlap of CY3 1 1 1 f # # % 1 P * i % ' '.i / w $ CY35 Centromere Image - Colour Compensated • | 1 41 <• * p * * • lp % I Figure 3.12: CY3.5 image with spectral spreading; highlighted with yellow arrows in two instances (top). Colour compensated CY3.5 image (bottom). 51 Figure 3.13: FITC-centromere original and colour compensated images for sin-gle chromosomes; centromere segmentation from the original and compensated images. in Chapter 5) is applied to both images and the segmentation results are com-pared. In the bottom picture of Figure 3.13, again an FITC-centromere image of a single chromosome is shown before and after colour compensation. Note that the original image has a low quality due to the high background as well as spectral spreading. The original and the colour-compensated images are nor-malized for presentation purposes only and shown in this figure. The results of centromere segmentation (using the method discussed in Chapter 5) are also shown from the original and colour-compensated images for comparison. 3.4 Database Preparation After acquiring typically fifteen metaphase images from a prepared slide and preprocessing the images as discussed in the last section, we need to develop a database of chromosomes. In order to prepare such database, the metaphase images are processed in several stages as described below: 1. Images of metaphase chromosomes are "karyotyped" using a semi-automatic procedure. Karyotyping is the process of recognizing and categorizing the 23 different chromosome types from images of metaphase spreads. For karyotyping, we use the isis software from MetaSystems corpora-tion [85]. This software partially automates the karyotyping process and a cytotechnician then completes the procedure. The cytotechnician is highly trained and experienced, and has previously performed numerous similar procedures. A snapshot of the software with karyotyped chromo-somes for a metaphase spread is shown in Figure 3.14. The software also generates a table that contains the co-ordinates for all chromosomes and their karyotype numbers. 53 2. The lengths of telomeres of each individual chromosome can also be calculated from metaphase chromosomes prepared using Q-FISH tech-nology and labelled with telomere-specific PNA probes. A software de-veloped at Terry Fox laboratory, T F L - T E L O , is used for telomere length calculations. This program takes an image of metaphase chromosomes (stained with DAPI) and another image of their corresponding telom-eres (labelled with CY3) and generates telomere intensity values for each chromosome in the metaphase image [53]. Telomeres are segmented from the background using an average differ-ence filter [42] and thresholding. Each segmented object is then labelled and dilated to represent the real telomere area. Metaphase chromosomes are also segmented from the background by thresholding and using the average difference filter as well as a rank difference filter as morphologi-cal and edge detection operators [53]. Once telomeres and chromosomes are segmented, the two images are superimposed with their edges high-lighted. During the superimposing process, the telomere image is shifted in both the x and y directions until the maximum number of telomeres are placed inside the edges of the detected chromosomes. A table is generated as a result of the algorithm which includes the chromosome number as labelled by the program, the karyotype number and the cor-responding four telomere intensities for each chromosome. A snapshot of the T F L - T E L O is shown in Figure 3.15. 3. Once metaphase chromosomes in all the fifteen images of a certain database have been karyotyped, their coordinates tabulated and each individual chromosome's telomere lengths calculated, an automated procedure cre-ates our image database. The program scans the coordinate table of chro-54 —\ m to H DD o o <3 > cn a> CO 3 (A (A (A 1 4 O N cr BJ cn & O > O CD r- m to ESP m >< .—*-7\ O CO 0) (7) CD O a. o CO ft CQ O n o w CD O PT o ft o w < ft i ca W ft T3 0) CO ft & o o "O o cr 0 ' o O ft ft .—t-ft o ft o Q L 1 I X Figure 3.14: Snapshot of the ISIS software with the karyogram for the shown metaphase spread. 55 £* Save Sgrt Bint Lb, cm 198 1404 46 270 1530 45 360 1782 44 216 1476 43 414 778.30 80 450 5 00 172.59 273.35 630 827.05 738 777.89 720 771.27 StdDev 25569 237.31 291.18 313.26 148 38 143.17 1 1 985 932 641 1553 0 4310 0.80 2 19 898 653 418 472 0 2441 1.74 3 0 898 653 0 0 0 1551 0.00 4 1 984 794 632 1180 0 3570 0.99 6 0 632 1160 0 0 0 1792 0.00 6 8 831 705 682 579 0 2797 1 22 7 15 905 836 426 672 0 2839 1.58 8 15 571 311 1146 839 101 2969 0.42 8 2 881 701 554 460 0 2596 1 56 10 5 789 782 733 492 0 2795 1 28 11 10 825 1108 821 767 0 3520 1.22 12 16 960 780 759 819 0 3317 1.10 13 0 790 0 0 0 0 760 000 14 e 479 493 384 706 0 2062 0.89 15 0 759 819 0 0 0 1578 000 16 e 1079 1022 1366 1205 0 4673 0.82 17 12 859 1595 684 436 0 3573 2.19 18 24 1033 947 646 600 0 3226 1 59 19 5 914 816 933 647 0 3310 1.09 20 0 933 0 0 0 0 933 0.00 21 9 1126 959 837 453 0 3375 1 62 22 3 662 972 1030 906 0 3572 084 23 181185 1047 499 602 129 3462 1.81 5 24 0 499 0 0 0 0 499 0.00 25 18 541 see 688 556 0 2374 0.91 26 121178 298 1297 922 97 3792 0.64 27 7 1046 813 1010 1143 0 4012 086 28 4 670 870 505 1002 0 3047 1.02 29 14 766 786 612 643 0 2827 1.25 30 3 567 327 762 762 0 2416 059 31 11 704 411 590 794 97 2596 0.75 32 13 937 937 1225 1357 0 4457 0 73 33 0 937 0 0 0 0 937 0.00 34 17 684 990 750 784 0 3209 1.09 35 0 762 0 0 0 0 762 000 36 11 1314 746 1857 1495 472 5684 054 37 0 466 746 1495 1857 472 5037 032 38 231482 952 902 627 0 3962 1.59 39 21 755 647 0 0 0 1403 000 40 20 951 717 614 492 0 2773 1.51 41 19 218 566 562 586 0 1932 0.68 42 20 619 879 495 434 0 2428 1.61 43 161006 1019 751 819 0 3595 1.29 44 14 875 0 0 0 0 975 0.00 45 10 667 759 629 1267 0 3321 0.75 46 13 695 695 625 715 0 2729 1.04 47 9 1048 496 689 971 0 3403 0.83 48 22 475 654 653 703 0 2484 063 49 22 439 738 544 567 0 2289 1.06 50 2 1121 929 853 1019 0 3923 1.10 51 8 347 689 649 439 0 2124 0.95 52 0 929 0 0 0 0 929 0.00 53 17 877 832 566 237 0 2312 1.88 54 7 760 971 923 655 0 3510 097 55 4 554 786 1321 1410 0 4070 0.49 56 21 1119 1119 671 372 0 3281 2.15 Figure 3.15: Snapshot of the T F L - T E L O software which calculates telomere lengths for individual chromosomes. 56 mosomes for each image and every chromosome is cut from the DAPI as well as its corresponding CY3, FITC, CY5 and CY3.5 images. Each cut image is saved using the format "Database# -Image# -Chromosome# -Homolog# -Image type (DAPI,...).TIF". These images are used for fea-ture extraction and classification in the future chapters. Snapshots of the developed software that automatically finds, isolates and saves each chromosome as an image from several wavelength planes are shown in Figures 3.16 and 3.17. In summary, the image database is developed by acquiring about 15 metaphase images of a prepared slide in 5 colours. Each coloured image shows specifically labelled segments of the chromosomes (such as centromeres and telomeres). These images are then preprocessed to eliminate the biological and hardware introduced noise. At this stage, the metaphase chromosomes are karyotyped and telomere lengths as well as the position of each chromosome are generated using automated and semi-automated softwares. Finally, using the table of co-ordinates of chromosomes, each chromosome and its corresponding centromere and telomere images are isolated as rectangular regions and saved as separate images. The database is used in the next chapters to extract features from each image channel and perform homolog classification. 57 Figure 3.16: Snapshot of the developed software that automatically finds, isolates and saves each chromosome as an image. 58 Chapter 4 Segmentation of Centromeres from DAPI Images Using Gradient and Thresholding Methods Cancer is believed to have an inheritable component that, in some cases, is re-lated to specific chromosome abnormalities. In order to study the role of such chromosome abnormalities in cancer (or other disorders), the ability to ana-lyze separate homologs is highly desirable. Centromere intensities are believed to be an important differentiating feature between homologs. In this chapter we describe two techniques to segment centromeres from images of chromo-somes prepared using the DAPI probe. After segmenting the centromeres, their intensities are measured over the segmented areas for a homologous pair of chromosomes. Homologs of certain "heteromorphic" chromosomes such as chromosome 16 are then classified into two classes (reflecting the maternal and 60 Homologl - Bright Centromere Homolog2 - Dark Centromere Figure 4.1: Centromere heteromorphism in a pair of chromosome 16 homologs. paternal chromosomes) based on the differences in their centromere intensities. 4 .1 Heteromorph ism in Homologous Ch romo-somes and Centromere Segmentat ion The development of Q-FISH methods in preparing images of human chro-mosomes has made the quantitative studies of homolog classification of het-eromorphic chromosomes possible. In this case, heteromorphism in a pair of homologous chromosomes refers to visible intensity differences between the centromere areas of the homologs. Chromosomes 1, 9 and 16 are heteromor-phic in a high proportion of individuals (more than 50%), i.e. for each pair of these chromosomes, the centromere intensity in one homolog is brighter than that of the other homolog. An example of a pair of heteromorphic homologous human chromosome 16 is shown in 4.1. Homolog classification of polymorphic chromosomes 1,9 and 16 is possible in certain circumstances, using the visual inspection by a technician. We choose to automate homolog classification of these chromosomes since we have ground truth on the classification results for them. In order to quantify homolog classification in these chromosomes, we developed two algorithms to segment the centromeres from the chromosomes. 61 The overall intensities of the segmented areas using both methods were calcu-lated. Homologs of chromosomes 16 were classified into two groups of maternal and paternal classes based on the differences in their centromere intensities. These results were also compared to that performed by the technician. 4.2 Centromere Segmentat ion As mentioned in Section 4.1, the two homologs of chromosomes 1, 9 and 16 show apparent differences in their centromeres after staining with DAPI. Most likely these differences reflect true heteromorphism and different amounts of centromere heterochromatic DNA. This heteromorphism can be used for clas-sification purposes. Once the images of the chromosomes are captured and preprocessed, the centromere of each of the two homologous pairs of chromo-somes are segmented and the DAPI fluorescence intensities are measured over the centromere areas. The two methods developed to segment the centromeres are described in Sections 4.3 and 4.4. The reason for developing two methods to segment the centromeres is due to the fact that there does not exist any other means by which the results of segmentation can be verified. Thus we use the similarities in the results of the two segmentation methods as a verification tool. Also, as mentioned in Chapter 1, there are typically small datasets available for homolog classification. In addition, the process of chromosome labelling and image acquisition is developing constantly. The similarity in the results of the two developed segmentation algorithms is used to ensure the algorithms' ability to deal with the fore-mentioned challenges. The first proposed segmentation method involves mode histograms in which the noise level in the image is decreased to a minimum. A threshold 62 is then selected for the mode histogram, within the gray level range of the chromosome image. In the second method, referred to as gradient method, segmentation is performed using the two-dimensional (2D) gradient of the chromosome images as well as intensity thresholding. The results from both these methods were similar as will be discussed later. 4.3 Mode Histogram Having prepared the data-base as described in Chapter 3, our goal is to detect the centromeres of chromosomes from DAPI images. Figure 4.2 shows a chro-mosome 16 image from our data-base. As seen from this figure, centromeres are brighter in intensity than the rest of the chromosome, therefore applying thresholding methods seems like a natural choice to segment the centromeres. In the method we developed for centromere segmentation, threshold values are determined using gray level mode histograms. The mode histogram of an image represents the relative frequency of occurrence of the various gray levels in the image. For the segmentation algorithms that use the gray level mode histogram, threshold levels are generally placed at the valleys between two peaks of the image histogram [86]. Each peak represents an area where a large number of pixels have the same gray level values [87]. Using such a method, however, will not yield successful results in segmenting the centromeres. This is due to the distribution of the chromosome pixels which results in a mode histogram that does not contain many peaks or valleys (Figure 4.2). In our developed method, the gray level histogram for each chromo-some image is formed. In order to decrease the noise level in the image to a minimum, we delete the upper bins in the image histogram. These cor-63 Mode histogram of the chromosome image 250 200 1 | 150h 1 5 100 0 z 50 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Intensity -> Figure 4.2: Centromere segmentation using the mode histogram method. respond to 2% of the total number of pixels. We also delete the lower bins which also correspond to 2% of the total number of pixels. In a chromosome image, most of the pixels belong to the background and the non-centromere areas of the chromosome while a relatively small number of pixels belong to the centromere. Therefore if peaks were present in the mode histogram, they will correspond to the background and/or non-centromere chromosome pixels. Centromere pixels, on the other hand, fall in the high intensity part of the image histogram. Taking advantage of the fact that chromosome pixels are higher in intensity and lower in number, the threshold value is set to 7/10 of 64 the chromosome image gray level range. The gray level range of the mode histogram is defined such that approximately 96% of the image pixels are in that range. The threshold value (7/10) was chosen according to both the data gathered from the images and by trial and error. As a result of the above thresholding algorithm, the pixels whose inten-sities sum to 75% of the total image intensity are eliminated, and the algorithm segments the centromere. The segmentation results from this method for one chromosome are shown in Figure 4.2. 4.4 Gradient of the Image and Thresholding Gradient methods form other favorable methods for segmentation purposes. In Figure 4.3 the image of a chromosome .16 (Figure 4.3-a) as well as the in-tensity profile along a central axis (which passes through the centromere) of this chromosome (Figure 4.3-b) are shown. Studying the middle section of this profile, it is perceived that the centromere has higher intensities compared to other areas of the chromosome. It is also observed that the gradient of intensi-ties in most of the centromere area is higher than the rest of the chromosome. We used this knowledge to develop a gradient-based method for centromere segmentation. 4.4.1 Gradient Operator Most common methods of differentiation in image processing applications rely on the gradient operator. The gradient magnitude, g(m,n), of an image, u(m,ri), measures the amount of amplitude change in neighbouring pixels in two orthogonal directions, for each pixel in the image. These magnitudes are 65 approximately calculated by a pair of masks or gradient operators which we name H i and H 2 [87]. The gradient masks compute the horizontal and vertical differences of local sums. The gradients of an image in each direction, gi (m, n) and g2(m,n), are defined as: gi(m,n) A ^^hi(i,j)u(i+-m,j + n) = u(m,ra) * h\{—TO, — ra)(4.1) * 3 g2(m,n) A M*» •?')"(* + m>J + n) = u(m,n) * h2{-m,-n) i j and the total magnitude of the gradient is defined as: g(m, n) = \Jgl(m,n) + g$(m, ra) (4.2) In digital hardware applications, the magnitude of the gradient is often calcu-lated using simplified forms of Equation 4.2 such as norm calculation. In our application, however, we use the original form for calculating the gradient of a chromosome image. A few of the commonly known gradient masks are the Prewitt, Sobel and Isotropic masks [87, 88]. We use a 3x3 Prewitt mask for gradient calculation where and H 2 are defined as: H 1 = H , -1 0 1 -1 0 1 -1 0 1 -1 -1 -1 0 0 0 1 1 1 (4.3) (4.4) This mask reduces the effect of noise in the data and yields zero gradients for uniform regions of the image. 66 a) chromosome #16 b) Image profile along the chromosome 0 10 20 30 40 Pixel -> C) Histogram of the image gradient d) Segmented Centromere 25 20 T 0 0.05 0.1 0.15 0.2 Gradient -» Figure 4.3: Centromere detection using thresholding and gradient method. 4.4.2 Centromere Segmentation Procedure In the second algorithm that we developed for centromere segmentation, the two dimensional gradient of the chromosome image is calculated using the above mentioned method. The gray level mode histogram of this gradient image is then formed (Figure 4.3-c). The valleys towards the right end of the gradient histogram correspond to image regions with large intensity differences. As seen from Figure 4.3-b, since the centromere area has high gradients, select-ing the image pixels corresponding to the right end of the gradient histogram 67 should help in segmenting the centromere. By implementing this algorithm, most of the centromere pixels of the chromosomes are found; however, two major problems remain to be addressed: 1. Some pixels in the centromere were missing, i.e. were not detected as part of the centromere. This is explained by examining the peak of the chromosome intensity profile. As seen from the intensity profile, several pixels in the peak have the same (or close) intensity values, i.e. the image profile of the chromosome has a relatively flat top. This flat portion of the image profile peak has a very low gradient, therefore they will be eliminated by the segmentation algorithm. 2. Edges of the chromosome have a relatively high gradient as well. Some edge pixels are also segmented in the process of thresholding the gradient image. Finding a threshold value for the gradient image which only segments the centromeres is sometimes not possible. The above algorithm was modified by defining another requirement for segmentation. Since the centromere pixels of chromosomes are brighter than the rest of the chromosome image, the centromere is segmented by choosing: 1. pixels that lie in the maximum difference region of the image gradient histogram and 2. pixels that have an intensity value more than a certain threshold. This threshold is set close to the intensity values of the centromeres pixels found using condition 1. In addition, it is chosen high enough to avoid segmenting other parts of the chromosome. Applying this method, all the pixels of the centromeres of chromosomes were detected. The result of the segmentation for one chromosome is shown in 68 Figure 4.3-d. This figure is comparable to the results from the Mode histogram method shown in Figure 4.2. 4.5 Homolog Classif icat ion of Chromosome 16 As seen in sections 4.3 and 4.4, two approaches were taken to segment the centromeres of DAPI images of chromosomes. These two methods yielded similar results. As there is no biological method to confirm these results, the similarity yielded by the two algorithms in segmenting the centromeres is used as one verification method of the results. Once the centromere regions were detected in all chromosomes 16, the total Integrated Fluorescence Intensity (IFI) was calculated over each of these regions. The IFI values were computed for the segmented centromeres of the two homologs of chromosome 16. For each segmentation method, a set of IFI values were obtained. Homologs of Chromosome 16 were classified into two groups of maternal and paternal homologs based on their centromere intensity differences. Homolog classification results were identical using centromeres segmented by mode histogram and gradient methods. Centromere intensities calculated for twelve pairs of homologous chromosomes 16 (segmented using mode histogram method) are shown in Table 4.1. In this table, Homolog 1 refers to the class containing chromosome centromeres with lower intensities, and Homolog 2 refers to the class containing chromosome centromeres with higher intensities. A bar chart representing the same IFI values is also shown in Figure 4.4 so as to compare the centromere intensities in homologous pairs. As it can be seen from this chart, in most of the cases, the difference between the IFI values of the centromeres in the homologs is considerably high and only in 16% of the cases (Image9 and ImagelO), this difference is lower. Maternal and 69 Centromere Intensity Image 1 Image2 Image3 Image4 Homolog 1 63.59 73.93 65.13 63.86 Homolog 2 84.44 78.40 71.79 70.04 Centromere Intensity Image5 Image6 Image7 Image8 Homolog 1 94.76 67.65 57.34 59.05 Homolog 2 103.92 93.83 63.59 73.93 Centromere Intensity Image9 ImagelO Imagell Image 12 Homolog 1 65.13 64.31 79.07 47.70 Homolog 2 68.76 69.10 103.26 93.83 Table 4.1: The IFI values for centromeres of homologous chromosomes 16 of the database. paternal homologs of chromosome 16 are classified based on the centromere differences shown in this bar chart. Another approach to verify our classification methods is to compare the results with those performed by the visual inspection of an experienced cy-totechnician. The cytotechnician used centromere polymorphism and visually classified homologous chromosomes 16 into two classes of "dark-centromered homologs" and "bright-centromered homologs". The results of our automated classification attained by following the developed segmentation algorithms, complied perfectly with the classification anticipated by the cytotechnician. Chromosomes 16 were, therefore, successfully classified in two groups of mater-nal and paternal homologs. Chromosomes 1 and 9 have similar heteromorphic properties to chromosome 16 and can be classified similarly. 4.6 Conclus ion In this chapter we developed two different algorithms which segment cen-tromeres from DAPI images of chromosomes. The first algorithm is based on 70 0) s a 100 £ 2 1 4) 8 O 80 O £• in c 60 a) o c o •a a « 20 • • Homologl I I Homolog2 1 2 3 4 5 6 7 8 9 10 11 12 Metaphase images of one patient -» Figure 4.4: Bar chart presenting the comparison between IFI values for cen-tromeres of homologous chromosomes 16 in twelve metaphase images of a patient. Homologl refers to chromosomes with lower intensity centromeres, and Homolog2 refers to chromosomes with higher intensity centromeres. noise reduction followed by thresholding the mode histogram of the chromo-some image. The second algorithm uses the 2D gradient of the chromosome image and centromere intensities for segmentation. We then used these segmentation results and calculated the integrated fluorescence intensities over the segmented centromere areas. Homolog classi-fication of chromosome 16 was performed based on the differences in the IFIs of the centromeres of a homologous pair. On the other hand, since chromo-some 16 is highly polymorphic, an experienced technician can visually classify the chromosomes into classes of bright and dark-centromered homologs. The classification results using each of the developed algorithms complied perfectly with those of the technician. 71 In the next Chapter we study another form of heteromorphism present in chromosome 22 and take advantage of it for classifying homologs of this chromosome. 72 Chapter 5 Segmentation of Centromeres from CY5, FITC and CY35 Images Using an Iterative Fuzzy Segmentation Algorithm An area of particular interest in cancer genetics research is the quantitative analysis of chromosome images obtained following Q-FISH. This technique is adapted specially for use with novel PNA probes; three of the developed PNA probes are tagged to CY5, FITC and CY35 fluorophores. These probes are designed to target specific and different repeat sequences only in the "cen-tromere" area of chromosomes. As mentioned before, centromere intensities reflect repeat length polymorphisms and are important features for homolog classification. Segmentation of centromeres from images prepared using FITC, CY3 and CY35 centromere-probes is a major step towards multi-feature anal-ysis of homologous chromosomes. 73 In this chapter, another form of heteromorphism present in acrocen-tric chromosomes such as chromosome 22 is studied. Chromosome 22 shows heteromorphic properties after staining with DAPI. The P-arms of this chro-mosome are apparent in one homolog whereas they are almost non-existent in the other homolog. We study the centromere images of this chromosome (prepared using the above mentioned PNA probes) quantitatively, to analyze and search for centromere heteromorphisms. A new algorithm is proposed for centromere segmentation in order to separate homologous chromosomes that differ in the number of centromere repeat sequences. The algorithm is based on fuzzy set theory and gradient descent method. In this method, a fuzzy membership value is assigned to each pixel in the centromere image. Ah iterative algorithm then updates and minimizes a defined error function. Cen-tromeres of human chromosomes are segmented and distinguished from their background using this algorithm. As mentioned above, chromosome 22 is highly polymorphic. The paternal and maternal homologs of this chromosome often show sufficient morphological variations to allow their separation. We have used this information to validate our segmentation algorithm. Homologs of chromosome 22 are classified based on their morphological differences. From the proposed segmentation algorithm, centromere intensities are calculated for homologous chromosomes 22. This chromosome is also classified based on the differences in the centromere intensities form FITC images. The classification results of these two methods are compared and they agree completely. 5.1 Fuzzy Sets Theory Fuzzy sets theory first introduced in 1965 by L.A. Zadeh proposed a new way of decision making involving uncertain or "fuzzy" data [89, 90]. This theory 74 Precision Significance Figure 5.1: Precision and significance in the real world [2]. has the ability to represent the trade-off between significance and precision in problem solving, something that humans have been managing for a very long time. An example of this trade-off is shown in Figure 5.1 in real life [2]. Fuzzy sets theory starts with the concept of fuzzy sets. A fuzzy set as defined by Zadeh is "a class of objects with a continuum of grades of member-ship" [89]. Such a set is characterized by a membership function which assigns to each object a membership value ranging between full-membership(l) and non-membership(0). A fuzzy set A on the universe X is a set defined by a membership function HA representing a mapping: ^ A : X ^ { 0 , 1 } . (5.1) The value of PA(X) for the fuzzy set A is called the membership value or the grade of membership of x e X . The membership value represents the degree of re belonging to the fuzzy set A [91]. 75 5.2 Centromere Segmentation Three different centromere specific PNA probes are tagged with FITC, CY3.5 and CY5 fluorophores. Although they target different repeat sequences in the centromere, the images acquired from these probes can be analyzed similarly for centromere segmentation purposes. In this chapter, the centromere seg-mentation algorithm is explained in detail for FITC images of chromosomes, however the algorithm is applicable to CY3.5 and CY5 images as well. An ex-ample of FITC image and its corresponding DAPI image are shown in Figure 5.2. Although centromeres in the FITC images have higher intensities than their backgrounds, segmentation is not a straightforward task. The boundaries of centromeres in microscopy images are not well defined and have extremely gradual transitions; therefore, simple thresholding methods are not successful in segmenting them from their backgrounds. In addition, as mentioned in Ch-pater 1, one of the challenges in homolog classification is the rapid change in the chromosome image preparation and acquisition techniques. This further accentuates the need for a robust segmentation method. Our proposed seg-mentation method is based on the fuzzy sets approach. As the centromeres have unclear boundaries, the fuzzy set approach makes the process of assigning a pixel to the centromere or the background region more accurate. On the other hand, sometimes, there exists a small spectral overlap of DAPI images in the FITC-centromere image. In order to determine whether or not a pixel belongs to the centromere, we need to examine the intensities of the surrounding pixels as well as the gray level value of that particular pixel. A neighbourhood mask is used for this purpose. Centromere segmentation is performed in three steps described in the following sections. 76 Figure 5.2: Snapshots of a metaphase image prepared using DAPI (top) and FITC-centromere (bottom) probes. 77 5.2.1 Image Normalization and Neighbourhood Mask The acquired microscopy images of centromeres are indexed images with a gray-level range of 0-255. Each image is initially converted to an intensity image by determining the minimum and maximum gray levels of the image and normalizing all the pixel values in the image to a number between 0.0 and 1.0. As mentioned above, for determining the membership of a pixel to the centromere, the surrounding pixels of that particular pixel need to be exam-ined. Therefore a 3x3 neighbourhood mask, centered at the pixel of interest, with distinct weight coefficients is defined as: W(iJ) = exp(-^M) (5.2) where ||(i,j)|| is the Euclidean distance between position and the center of the mask [92]. The constant a controls the shape of the exponential and is set to 2. 5.2.2 Fuzzy Membership Values We define a fuzzy set A as: A^{ centromere pixels of an FITC image } (5.3) A sigmoid fuzzy membership function is applied to the FITC image assigning a membership value to set A for each pixel of the image (Figure 5.3). The fuzzy membership value of each pixel determines its degree of mem-bership to the centromere (A) or to the background (1 — A) [93, 94]. In Figure 5.3, for example, an image pixel with a gray level intensity of 0.9 will be mem-ber of the centromere with a probability of almost 90%. On the other hand, 78 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 oo [ : 1 oo 0.0 0.1 0.2 0.3 0.4 0.5 T 0.6 0.7 0.8 0.9 1.0 Normalized gray-level Figure 5.3: Fuzzy membership function. a pixel with a gray level intensity of 0.1 is part of the centromere with a probability of less than 5%, i.e. is a background pixel with a probability of 95%. In the sigmoid membership function, the threshold, T , is always the gray-level value corresponding to the membership value of 0.5. In other words, the location of T relative to the function curve is fixed, i.e. a larger T shifts the membership function to the right while a smaller T shifts the member-ship function to the left. For segmentation purposes, usually the threshold, T, is chosen in the valley between the two peaks of the image histogram [86]; however, for our images this is not applicable as the histograms of centromere images do not have two peaks. This is due to the low number of centromere pixels in the image as well as the gray level distribution of the centromere images. In our application, we chose T as the gray level corresponding to the lowest number of pixels, right after the peak of the image histogram (lo-cal minimum). These points are calculated for all centromere images in the database, from their histograms. Averaging the values, we determine T = 0.3 79 for this case. 5.2.3 Error Calculation and Updating After assigning a membership value to each pixel, an iterative process updates and finally determines the membership of each pixel to the centromere. The error function is defined as [92]: E = Y:0(i,j)(l-0(i,j)) (5.4) hi where 0(i, j) is the fuzzy membership value of the image pixel («, j)- The error, E, is a non-negative number and is minimized only if the fuzzy membership value, 0(i,j), of each image pixel is either 1 or 0. Inspired by the Gradient method and Back-Propagation algorithm, the updating rule is written as: 0+{i,j) = 0(i,j) + v ^!§H).0(-)(.-o(M)) <«> where E(i, j) is the error caused by pixel (i, j), and rj is the learning rate. In or-der to include the role of neighbouring pixels in defining a pixel as centromere, using Equation 5.2, Equation 5.5 is rewritten as: 0+(i,j)^0(i,j) + r] E w(k,i)(-dE(k,iy •0(i,j)(l-0(i,jj) (5.6) lk,l€N° \ dO{k,l), where W(k,l) are the weight coefficients of the mask defining a neighborhood iV° around After substituting Equation 5.4, the latter becomes: 0+(i,j) = 0{i,j) + 2rl £ W(k,l)(0(k,0-0.5) 0(i,j)(l-0(i,j)) (5.7) 80 Note that the pixels with fuzzy membership values close to "0" (background) or "1" (centromere) are highly unlikely to become members of region "1" (cen-tromere) or "0" (background). Therefore, the updating process can be ac-celerated by applying this information. To do so, Equation 5.7 is changed to: 0+(i,j)=0(i,j)+AO(i,j) (5.8) where AO(i,j) ;0(i,j) < 0.15 or > 0.9 2rj Ek,ieN°W(k,l)(0(k, 0 - 0 . 5 ) otherwise Equation 5.8 makes a quick decision for pixels with fuzzy values close to 0 or 1 (because (0(k, I) — 0.5) has a large value), but for a fuzzier pixel ((0(k,l) — 0.5) closer to zero), there is a smaller change in its intensity, causing a delay in segmentation of the pixel. 5.3 Results and Discussion As described in the previous section, a segmentation algorithm was developed based on fuzzy sets theory and error back-propagation. The algorithm was implemented and used to segment the centromeres from their background for all chromosomes in the prepared data-base. The size of the neighbourhood mask W(i,j) is an important variable in the segmentation algorithm. Several experiments were performed with different mask sizes (3x3, 5x5 and 7x7). As the size of the neighbourhood mask increases, the area of the segmented centromere decreases. This is due to the fact that with a larger mask size more bright pixels are needed around a certain pixel for it to be segmented as centromere. Therefore, some edge pixels in the FITC-centromere image will 81 be eliminated during segmentation. The 3x3 neighbourhood mask was the best size and was chosen as the algorithm default. Segmentation results, for a pair of homologous chromosome 22, using different neighbourhood mask sizes are shown in Figure 5.4. Segmentation was performed successfully on all of the centromere im-ages in the database. Examples of the centromere segmentation for two ho-mologs of chromosome 22 and chromosome 19 are shown in Figure 5.5. The original, normalized, and segmented centromeres for the two homologs of chro-mosome 22 and chromosome 19 are shown in the top and bottom parts of this figure, respectively. As mentioned before, there is no biological method to validate our seg-mentation results. In addition, the PNA centromere probes are quite new and manual segmentation of the centromeres by a cytotechnician is not a standard procedure yet. However as a crude verification, we first compared our seg-mentation results with those of manual segmentation of a cytotechnician on the prepared database. To further verify our results, we used chromosomes 22 as these chromosomes would be classified into their paternal and maternal homologs using differences in their DAPI images. Segmentation results (using our method above) of the FITC images are also used to classify chromosomes 22 into two classes. This is done by calculating the integrated fluorescence intensity over the resultant segmented area of every chromosome and studying the differences in the values of this feature. The classification results of the two methods should agree. 82 Normalized Image r>2 - Chr. 22 Segmented Image - 3x3 mask Segmented Image - 5x5 mask Segmented Image - 7x7 mask Figure 5.4: Segmentation of centromeres for a pair of homologous chromosome 22 using three different neighbourhood mask sizes. Segmentation results for homolog 1 are shown in the top picture while results for homolog 2 are shown in the bottom picture. 83 Original Image h1-Chr.22 Normalized Image seg. Image h1-IFI=32.9 Original Image h2-Chr.22 Normalized Image seg. Image h2-IFI=30.0 Original Image h1-Chr.19 Normalized Image seg. Image h1-IFI=59.2 Original Image h2-Chr.19 Normalized Image seg. Image h2-IFI=39.4 Figure 5.5: Centromere segmentation for homologs of chromosome 22 (top) and chromosome 19 (bottom) in a metaphase spread. 84 5.3.1 Homolog Classification of Chromosome 22 Once the centromere regions in the F I T C images are segmented for all chro-mosomes 22, the total integrated fluorescence intensity is calculated for each centromere. The IFI values are computed for the two homologs of chromo-some 22, for twelve metaphase images of a patient (in the prepared database), using the above method and the results are shown in Table 5.1. Homologs of chromosome 22 are classified into two groups based on the differences in their IFI values. On the other hand, chromosome 22 is highly polymorphic. The two homologs of chromosome 22 have apparent differences in their upper arms (P-arms) in DAPI images (see Figures 5.6 and 5.7). One homolog has short P-arms (top left in each section of Figures 5.6 and 5.7) while the other homolog has no P-arms (bottom left in each section of Figures 5.6 and 5.7). Homologs of chromosome 22 are also classified using this morphological difference. In Figures 5.6 and 5.7, examples of four pairs of chromosome 22 ho-mologs with and without P-arms as well as their centromere images, seg-mented centromeres and total IFI values are shown. The original centromeres are shown as inverse images to make the comparison of centromeres and the segmentation results easier. As seen from these figures and Table 5.1, homologs with P-arms have higher IFI values than homologs without P-arms. Homolog classification is performed successfully using both the difference in the P-arms and the difference in centromere IFI values. The results of the two methods comply perfectly and the accuracy of the classification is 100%. Chromosomes 22 are classified into two classes representing maternal and paternal homologs. A challenge in homolog classification, as referred to in Chpater 1, is the noise introduced during slide preparation and image acquisition stages. De-85 Table 5.1: The IFI values for segmented FITC-centromeres of homologous chromosomes 22 from the database. Chromosome 22 Image 1 Image2 Image3 Image4 Image5 Image6 with P-arm 11881 10712 7157 6421 7348 5619 without P-arm 7356 7525 4687 5166 5403 4181 Chromosome 22 Image7 Image8 Image9 ImagelO Image 11 Image 12 with P-arm 7656 8513 8433 7991 9818 7337 without P-arm 5134 4913 5029 4753 5206 4827 spite the precautions taken in the fore-mentioned processes, our segmentation method should perform well in the presence of acceptable noise. In the next section, we model the total noise with a Gaussian distribution function [95] and study the performance of the segmentation algorithm in noisy images. 5.4 Noise Tolerance Gaussian noise is the simplest form of additive random noise whose probability distribution function is defined as: p<*>=vkexrt-^f} (5-9) where px and o~x are the mean and variance of the Gaussian noise, respec-tively [95, 96]. In order to test the tolerance of our segmentation algorithm to noise, we started with adding Gaussian noise with 0 mean and 0.001 variance to our centromere images. The variance of noise was gradually increased first in smaller steps (0.001) and later in larger steps (0.005 and 0.05). Centromeres were segmented from the noisy images using our segmentation algorithm and the relative changes in the total intensities of the segmented areas were mea-86 Ch.22 w p-arm Inv. Cent. Nor. Seg-Cent. IFI=10712 Figure 5.6: The chromosome (DAPI), inverted centromere (FITC) and normal-ized segmented centromere images for two pairs of homologous Chromosome 22. 87 Figure 5.7: The chromosome (DAPI), inverted centromere (FITC) and nor-malized segmented centromere images for two other pairs of homologous Chro-mosome 22. sured. In Figures 5.8-A and 5.8-B, the scatter-plot and box-plot for relative changes in segmented centromere area versus additive Gaussian noise are shown for 1200 centromere images. In these plots the variance of noise is between, 0.001 and 0.01. We fitted the best line to the scatter plot in Figure 5.8-A. As seen from this figure, the average amount of change in centromere intensity is quite low. This change is about 2.5% for Gaussian noise with 0.001 variance and 6% for Gaussian noise with 0.01 variance. This amount is very low compared to typical centromere differences in homologous chromosomes which is approximately 30% (see Table 5.1). Increasing the amount of the Gaussian noise in the image, the relative change in the centromere intensity increases as well. Scatter plots and poly-nomial approximations to these plots are shown in Figure 5.9-A and 5.9-B for noise variances up to 0.06 and 0.7 respectively. As seen from these images, the scatter plot can eventually be approximated by an exponential function. An example of a centromere image and the noisy images resulting from additive Gaussian noise with different variances are shown in Figure 5.10. As seen in this image, noise with a variance of more than 0.01, distorts the centromere image extensively. The distortion is obvious to the naked eye and the shape of the centromere changes. These images, if produced during the image acquisition procedure, can be easily distinguished and discarded by the technician. Therefore, although high relative-changes occur in segmented centromere areas with noise variance > 0.01, our segmentation algorithm will not be dealing with these images at all. On the other hand, the segmentation is successful in the presence of low amounts of noise and the changes in the segmented area are acceptable. 89 A - Relative changes in segmented centromere areas vs. noise n r -0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01 Variance of additive Gaussian noise - » B - Boxplot - relative changes in segmented centromere area vs. noise 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 Variance of additive Gaussian noise - » Figure 5.8: Scatter-plot and box-plots of relative changes in segmented cen-tromere areas versus additive Gaussian noise. 90 A - Relative changes in segmented centromere area vs. noise 0.02 0.03 0.04 Variance of additive Gaussian noise -0.05 0.06 B - Relative changes in segmented centromere area vs. noise Variance of additive Gaussian noise -Figure 5.9: Scatter-plots of relative changes in segmented centromere areas versus additive Gaussian noise for variances of up to 0.06 (A) and 0.7 (B). 91 Original Image % Mm Noise Variance = 0.01 Noise Variance = 0.03 ,» J^r V O Figure 5.10: An example of the original and noisy centromere images for additive Gaussian noise with different variances. 5.5 Conclusion In this chapter an iterative fuzzy segmentation algorithm was implemented and applied to centromere images of chromosomes acquired using PNA-centromere probes. Using this algorithm, the centromeres were successfully segmented from their backgrounds. To verify the segmentation results, the total in-tegrated fluorescence intensity of the segmented regions were calculated for chromosome 22 and the homologs of this chromosome were classified into two groups of maternal and paternal classes. Since chromosomes 22 show morpho-logical differences between homologs in the DAPI images, we used this criteria to classify chromosome 22 into parental homolog classes as well. The classi-Noise Variance = 0.001 Noise Variance = 0.015 Noise Variance = 0.04 Noise Variance = 0.005 Noise Variance = 0.02 Noise Variance = 0.05 92 fication results using these two criteria were compared and the two methods complied completely. The segmentation algorithm is successful in segmenting the centromeres from their backgrounds and the classification resulting from it is completely correlated with the ground truth. This correlation is also used as an assurance in analyzing smaller databases. The segmentation algorithm was also tested in the presence of noise and worked very well in the typical noise levels existing in centromere images. The results of this research constitute a major contribution to multi-feature analysis of chromosome images for homolog classification purposes. Such analysis will allow a comprehensive evaluation of centromere repeat con-tent in various chromosomal instability syndromes including cancer. In the last two chapters, homolog classification has been performed us-ing multiple features. Homolog classes are verified by the high correlation of classification results using different features. However, the study of this corre-lation as well as multi-feature analysis of chromosomes have been performed non-automatically. In the next chapter, we propose a novel quantitative and automatic algorithm for multi-feature analysis of homologous chromosomes. 93 Chapter 6 Multi-feature Analysis of Homologous Chromosomes Multi-feature analysis of human chromosome images forms an important ap-proach towards classification of homologous chromosomes. This is due to the fact that there exists no robust mean by which the results of homolog classifica-tion can be verified. Thus high correlation in the homolog classification results obtained using several different features can be employed as a verification tool. In this chapter, we first use a fuzzy c-means clustering algorithm .to clas-sify homologous chromosomes using the "telomere lengths" feature. We then note that, for heteromorphic chromosomes (such as chromosome 16), the cen-tromere intensity in one homolog is different from that of the other homolog. The gradient method developed in Chapter 4 is applied to classify homologous chromosomes using "centromere intensity differences". We show in sections 6.1-6.4 that the results of the two different classification methods using the two different features are highly correlated. Inspired by these results, a novel quantitative and automatic method is then proposed for homolog classifica-94 tion using multiple features. The classification method is based on Mutual Information Maximization applied to an unsupervised neural network archi-tecture. The neural network consists of separate modules that are trained to classify homologs using independent features. Mutual information is then maximized between the outputs of the modules forcing them to produce the same classification results, for a given chromosome. This opens a new venue for the application of unsupervised learning methods and mutual information maximization theory to homolog classification of human chromosomes. 6.1 Telomere Length Measurements and Data Normalization T F L - T E L O , the software developed at the Terry Fox laboratory for measuring telomere lengths, was applied to the CY3 images of our prepared database [53]. As a result, a table is generated that includes four telomere values for the four arms of each chromosome in a metaphase, namely Pi, P2, Qi and Q 2 - Since the two sister chromatids are duplicates in a chromosome, one expects the telomere Pi to be identical to P2 as well as Q\ identical to Q2- In practice, these values are not the same due to biological and quantitative reasons. In order to be consistent in all the telomere related analysis, P the average values of Pi and P 2 and Q the average value of Qi and Q2 are calculated and used instead. At this stage the P and Q are expressed as intensity values and are transformed to telomere lengths known also as Telomere Fragment Units (TFU) as follows: p !300 , „ , . P - = 4 8 ^ 5 — ^ n Q 1 3 0 0 (RO\ Q t f U = 4 8 ^ 5 ^ F ( 6 - 2 ) 95 The constant value, k is the fluorescence intensity of the calibration beads and is unique for each data set. The PTFU and QTFU values are com-puted for all the chromosomes in the database. These values are then nor-malized in the following fashion. The mean, p, and the standard deviation, <j, are measured for PTFU and QTFU, over the entire database. The normal-ized telomere lengths, PN and QJV> are then calculated using the following equations. PTFU — P LA Q\ PN = (6.3) a Q n = Qrrv-p a The normalized telomere lengths for all chromosomes 16 in the database are extracted. P^ and QN values are plotted against each other for chromosomes 16 in order to study the clustering characteristics of their distribution. Each chromosome is shown as a point on the plot where the x axis represents the normalized Q-arm telomere length and the y axis represents the normalized P-arm telomere length for that chromosome (Figure 6.1-a). 6.2 Fuzzy C-means Clustering of Telomere Data A fuzzy c-means clustering algorithm is applied to the data shown in Figure 6.1 in order to classify the telomere data into two clusters [2]. In this algorithm, each data point, xk, belongs to a cluster to some degree which is specified by a fuzzy membership value, pik. The algorithm starts with an initial guess for the cluster centers (vi), which is the mean location of each cluster. The initial guess for the cluster centers is most likely incorrect. Every data point is then assigned a membership value for each cluster. By iteratively updating the cluster centers and the membership values for each data point, the algorithm 96 (a) Chromosome 16 telomere lengths 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Q-arm telomere length (b) Classitication of chr. 16 using telomere data 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Q-arm telomere length Figure 6.1: Distribution of telomere lengths for chromosomes 16 in the database (top); Classification of telomere data using c-means clustering al-gorithm (bottom). 97 1 0.9 0.8 £0 .7 o> c o g> 0.6 o | ° 0.5 o |o.4 i D. 0.3 0.2 0.1 Classification of chr. 16 using centromere data o Bright Centromere O Dark Centromere o o O Dark Centromeres o° o ° _ Bright Centromeres CD o ° 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Q-arm telomere length 0.8 0.9 Figure 6.2: Classification of homologous chromosomes 16 based on differences in their centromere intensities. moves the cluster centers to the correct location within a data set [94]. The iteration is based on minimizing the distance function from any data point to a cluster center weighted by the membership value of that point as shown in Equation 6.5 [97]. minz = £ X > ^ ) m | l ^ - ^ l | 2 (6-5) i = l k=l In this equation, m > 1 is a weighting exponent affecting the fuzzy member-ship value. The iteration is continued until the objective function is no longer decreasing. The clustering results of the telomere data are shown in Figure 6.1-b. Homologs of chromosome 16 are classified into two groups using this algorithm. The clustering algorithm has knowledge of the number of clusters the data should be classified into, prior to the clustering step. However, the algorithm has no knowledge that the clusters should contain equal number of 98 points. It acts independently and successfully classifies the data points into two classes of equal points. 6.3 Classification of Homologous Chromosome 16 Using Centromere Intensities Chromosome 16 is highly polymorphic, i.e. for each pair of this chromosome, the centromere intensity in one homolog is higher than that of the other ho-molog. The gradient algorithm developed in Chapter 4 was used to segment the centromeres of this chromosome. Once segmentation is performed, the integrated fluorescence intensity of the centromere areas are calculated for all of the chromosomes 16 in the database and compared in homologous pairs. Homologous chromosomes are classified in two groups of maternal and pater-nal homologs based on their centromere intensity. The classification results using centromere intensities are shown in Figure 6.2. In order to make the results more comprehensible, each chromosome in this figure is shown as one point similar to Figure 6.1. The black points present homologs with darker centromere intensities while the white points present homologs with brighter centromere intensities. 6.4 Comparison of Centromere and Telomere based Classification for Chromosome 16 As seen in the previous sections, two methods were used to classify homologs of chromosome 16. In the first method, telomere values of the prepared database were calculated, normalized and plotted. Fuzzy c-means algorithm was then 99 1 0.9 0.8 £ 0 . 7 oi c o u 0.6 v E ° 0.5 0) Homolog classification using telomere & centromere data o Bright Centromere O Dark Centromere o o ° Dark Centromeres Cluster 1 - Telomere Data o° o N O N x 0 o CO o Bright Centromeres * Cluster 2 - Telomere Data 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Q-arm telomere length 0.8 0.9 Figure 6.3: Classification of homologous chromosomes 16 using centromere intensities as well as telomere lengths; the results of the two independent classifiers are completely correlated. applied to classify chromosomes 16 in two classes of parental homologs, using the "telomere" data. The second algorithm segmented the centromeres of chromosomes 16 and used "centromere IFI differences" for classification. As there is no biological method to verify the results, a comparative study of the outcomes of the two algorithms is used as a verification. In order to make the comparison of the two methods easier, the results of the two algorithms have been shown on the same plot in Figure 6.3. The classification result using centromere data is shown as gray and white circles representing homologs with darker centromeres and homologs with brighter centromeres, respectively. The results of homolog classification using telomere data is shown by dotted lines separating the two clusters. Studying Figure 6.3, it is noticed that homolog classification using 100 telomere and centromere information are completely correlated. The two in-dependent classifiers cluster the chromosomes into the same two classes. Note that according to Figure 6.3, chromosome 16 homologs with brighter cen-tromeres tend to have P and Q-arm telomere lengths located in the lower left hand corner of the P — Q axis. 6.5 Automated Classification of Homologous Chromosomes Using Mutua l Information Maximization As mentioned before, multi-feature analysis of human chromosomes is an essen-tial step in classification of homologs. So far, homolog classification algorithms have only used one feature at a time to differentiate homologs. In addition, these algorithms are not 100% accurate in homolog classification. Further-more, attempts to classify homologs using multiple features employ ad-hoc manual strategies in contrast to automatic algorithms for this purpose. The next logical step for homolog classification is to automate the procedure of multi-feature analysis and feature combination for differentiation of homologs. We propose an automatic quantitative classification method for homol-ogous chromosomes using multiple features. The classification method is based on the mutual information maximization criterion. This criterion is applied to an unsupervised neural network architecture [98]. We designed and tested our algorithm on chromosome 16 since for this chromosome, a cytotechnician can visually classify homologs providing us with ground truth. The neural network architecture is composed of two separate modules. Each module is designed to classify homologs of chromosome 16 based on independent features. The 101 z=g(ai) a — \wx 1— 4-i li i 1=1 Output layer Hidden layer Input layer Bias Figure 6.4: An example of a feed-forward network which has two layers of adaptive weights. learning algorithm maximizes the mutual information between the outputs of the two modules training them to produce the same classification result using different features for a given chromosome. This neural network architecture is able to extract higher-level features from the input data that cannot be de-tected otherwise. In addition, the architecture can be generalized to classify homologous chromosomes using as many features as available. 6.6 Feed-forward Neural Networks In recent years, feed-forward neural network architectures have been used in pattern recognition problems to a large extent [99]. Feed-forward neural net-102 works are general frameworks that provide us with the ability to represent non-linear functional mappings between a set of input variables and a set of output variables. These networks have successive layers of processing units, with connections running from every unit in one layer to every unit in the next layer, but with no other connection such as feedback loops permitted. This ensures that the network outputs can be calculated as explicit functions of the inputs and weights. An example of such network is shown in Figure 6.4. This network consists of d inputs, M hidden units and c output units. The bias terms in this architecture are treated as units with an input value permanently set to Xd = 1 for the input layer and ZM = 1 for the hidden layer. The analytic function corresponding to this figure can be written as follows. The input of the jth hidden unit is obtained by forming a weighted linear summation of the d input values as: d aJ = E wji)xi (6-6) i=l where is the first layer weight connecting input i to hidden unit j. The output of hidden unit j is then obtained by transforming the linear sum in Equation 6.6 using a transfer function g(.) yielding to: Zj = g(aj) (6.7) The input of each output unit, k, is computed as a linear summation of the outputs of the hidden units in a similar form to Equation 6.6 as follows: M = E wkjzJ (6-8) 3=1 The outputs of the network are obtained by transforming this linear combina-tion using a non-linear transfer function, to give: yk = g{a-k)- (6.9) 103 The transfer functions of the hidden and output layers need not be the same as indicated by using two different notations of ^(.) and <?(.), respectively. 6.6.1 Error Back-propagation In order for a neural network to learn a suitable mapping from a given data set, it needs to be trained during a learning phase. Learning is based on the definition of a suitable error which is a differentiable function of the weights. This error is then minimized with respect to the weights and biases in the network. Error back-propagation technique was popularized by Rumelhart, Hinton and Williams [100]. In this technique, the error function is minimized using an iterative procedure; the derivatives of the error function are taken and evaluated with respect to the weights and error is back-propagated through the network. In order to minimize error, these derivatives are used to com-pute adjustments made to the weights, based on gradient descent or similar methods. For a general feed-forward neural network, the error back-propagation algorithm can be derived as follows. Assume, the error, E, is a differentiable function of the network outputs, that is: The derivative of the error with respect to some weight, Wji, can be calculated considering E depends on Wji only via the summed input aj to unit j and applying the chain rule for partial derivatives: Introducing a new notation and using Equations 6.8 and 6.11, we can write: E = E(yu...,yc). (6.10) dE _ dE daj (6.11) dwji daj duiji (6.12) 104 da* OWji thus dE — = S,Z, (6.14) therefore, in order to evaluate the derivatives gjjjr? w e need only to calculate the value of 8 for each hidden and output unit in the network and apply equation 6.14. For the output units, 8k is calculated based on its definition in Equation 6.12 as: 4 s = ' V ) f • ( 6 - 1 5 ) For the hidden units, 8 can again be derived using the chain rule for partial derivatives, _ dE ^ dEdak daj ^ dak oaj where the sum is over all units k to which unit j sends connections. Using the definition of 8 and Equations 6.7 and 6.8, 8j can be re-written as: 8j = g'(aj) ^ 2 wkjh (6.17) k which indicates that the value of 8 for a particular hidden unit can be obtained by propagating the <5's backwards from units higher up in the network. The back-propagation algorithm can be summarized in five steps: 1. An input vector is presented to the neural network and is forward prop-agated using Equations 6.6-6.9 to find the outputs of all the hidden and output units. 2. f5fc is evaluated for all the output units using 6.15 3. <5's are back-propagated in the network using Equation 6.17 to obtain 8j for all hidden units 105 4. Equation 6.14 is used to evaluate the derivatives of error 5. Weights are updated using a parameter optimization method. For the standard case of gradient descent method, weights are updated using: dE AWjl = - n ^ - . (6.18) 6.7 Neural Network Architecture Our proposed network architecture for automated classification of homologous chromosomes consists of two separate modules as shown in Figure 6.5. Each module is a three layer feed forward neural network that classifies homologs of chromosome 16 using one feature (centromere intensity / telomere length). Our goal is to design an unsupervised learning algorithm which trains the two networks to present the same classification for a data point, based on different features. A possible solution to this problem is to maximize the correlation be-tween the outputs of the networks, y\ and y2. Maximum correlation, however, is not a sufficient condition to solving our problem. Two variables can be per-fectly correlated if, for example, they are nearly always zero [98]. In this case, the outputs of the network convey almost no information about the inputs, i.e. a minimal entropy distribution. A better method to train the networks is to maximize the mutual infor-mation between the outputs of the modules, instead. We apply this technique to design our learning algorithm. In Figure, 6.5, the first module receives centromere intensity as an input. It then learns to classify homologous chro-mosomes 16 based on this feature. The second module, on the other hand, receives normalized P and Q-arm telomere lengths as inputs. It learns to 106 Maximize Information Module 1 Module 2 Figure 6.5: Structure of the proposed neural network architecture. classify the homologs based on these lengths. The restriction of mutual infor-mation maximization is applied as an objective function between the outputs of the two modules. An error function is measured and back-propagated to the hidden layers of the networks through the gradients. This method trains the neural network architecture to produce the same classification results for the two modules. This architecture can be generalized to perform classification using more than two features by adding a new module to the network, for each inde-pendent new feature. As a result, several well-known chromosome features (extracted from images prepared using different probes) which are speculated to be of some importance in homolog differentiation can be examined using our proposed architecture. Each of these features can be used as inputs to the network in combination with known "homolog differentiating" features, for specific chromosomes. The changes in classification results relative to the original classification performed with known-features only can determine the value of the new feature in homolog differentiation. 107 Homolog classification using our extracted features is not a straightfor-ward task since these features are not linearly separable in most cases. On the other hand, the perception of the relationship between features in a pair of ho-mologs is essential to the classification process. Therefore, the representation of the feature space to the neural network (the input structure), in addition to the training method, plays a major role in homolog classification. 6.8 Mutua l Information Maximization Theory The mutual information between the outputs of two modules in Figure 6.5, in the continuous case is defined as: I(Vu y2) = H(Vl) + H(y2) - H(yu 2/2) (6.19) where H(yx) is the entropy of yi, H(y2) is the entropy of y2 and H(yi,y2) is the entropy of the joint distributions of y\ and y2 and can be written as [96]: /oo P^lnPiy^dy^ E{-\nP(yi)} -00 /oo P(y2)\nP(y2)dyi = E{-\nP(y2)} (6.20) -00 /oo roo / P{yi,y2)\nP{yi,y2)dyldy2 -00 J—oo = E{-\nP(yuy2)} (6.21) From Equations 6.19, 6.20 and 6.21, mutual information can be rewritten as: I = EL^^-\. (6.22) \ P(yi)P(y2)f K ' The concept of mutual information maximization has been extensively applied to signal and noise separation applications as well as to capture coherence in 108 the input feature space [98, 101]. Using a similar idea, we assume the output of one module, e.g. y2, to be a noisy version of the output of the second module, yi, that is: V2 = yi + n. (6.23) Therefore, in order to successfully predict the signal y\, the information that y2 conveys about y\ (Iy2,yi) should be maximized. By making an assumption that both yi and the additive noise have Gaussian distributions with zero mean, I y 2 , y i can be written as [96]: *™ = °-5 l n ER} = V(y2 - Vl) ( 6 > 2 4 ) where V(y2) is the variance of y2 and V(y2 — y\) is the noise variance. In order to maximize Iy2:yi, V(y2) should be maximized, whereas V(y2 — y{) should be minimized. For symmetry and simplification, Equation 6.24 is generalized as [98]: Iy2,yi = l n 7777 7^—7 + l n 7777 T T V ^ T " ( 6 - 2 5 ) V(y2) l n V{Vi) V(y2-y1) + k V(y2-yi) + k' According to Equation 6.25, each module maximizes the information its output conveys about the other module. Therefore, each module in the network will minimize V(y2 — y\), maximizing both V(yi) and V(y2). In order to prevent / from becoming infinite, a small constant k > 0 is added to the denominators in Equation 6.25. The error is then defined as: £ = - / M ' = - ' V t e - ! / 1 ) + * " 1 V t e - J 1 ) + * ' ( 6 ' 2 6 ) and used as our objective function. 109 6.9 Network Learning and Update Rule We use error back-propagation and batch training as our unsupervised learning algorithm. Starting with small random weights, all training cases (N cases) are propagated forward through the networks to calculate outputs, activations of the hidden units and the error. For the error function defined as the following equation, S = -Iy. y = - ln T / , V { V i \ , - ln w f V { V j \ u (6.27) v"Vi V(Vi - Vj) + k V{yi-yj) + k the gradient of the error with respect to y^, for the a training case, is derived as: dS dy? d dy? d In V{Vi) In V{Vi) V(yi - yj) + k V(yi - yj) + k d d-lnV(yi)-2—HV(yi-y]) + k) 1 d aV(Vi) ~ d V{yi)dyf y°lJ V(yi-yj) + kdyl where V(yi) and V(yi — yj) in turn are defined as: 6.28) V(Vi) = E {y*} - {E {Vi}f = ^ E ( y f ) 2 - d E ^ ) (6.29) a=l a=l Viyi-yj) = E {(Vi - yj)2} - (E {yi - Vj})2 = T ^ E ^ - ^ - ^ E ^ - ! / ? ) ) 2 (6-30) i V a=l 7 V a=l. Substituting Equations 6.29 and 6.30 in Equation 6.28, the final expression for the gradient of the error with respect to the output yf can be written as: dS dy? 2 "N y?-E{Vi} 2(y? - yf) - E {Vi - Vj} V{Vi) V(Vi - Vj) + Nk (6.31) 110 where and yj are the outputs of the ith and jth units in the top layer, E {y^} denotes the mean of yi and N is the number of training cases. In order to calculate this gradient, two sweeps are made through the data. In the first forward sweep, the mean and variances are calculated. During the second sweep the gradient is computed. Error is then back-propagated to the hidden layers through the gradients and weight updates are measured using the weight update rule: W™ = W%es - n £ ) 5«x? - pw%" + aAW%es, (6-32) a = l and AW[trjes = W?™3 - Wff. (6.33) Here Wjj denotes the weight connecting the jth unit of a higher layer to the ith unit of a lower layer, Sj is the back-propagated gradient from the jth unit, Xi shows the output of the ith unit and 77, ft and a represent the learning rate, momentum coefficient and weight decay factor, respectively. The gradient of the outputs are large at the beginning of the training and tend to get smaller as training proceeds. Therefore, we start our training with a very small learning rate and slowly increase it. Ideally, an adaptive algo-rithm should take care of the changes in the learning rate, however developing such an algorithm is a difficult task. We manually control the step size at the initial stages of the learning procedure. Momentum is introduced later in the training to damp oscillations during gradient descent. A weight decay factor is also implemented in the update rule to prevent the network weights from increasing unboundedly. During training, there is no need for regular manual intervention. Once the parameters for the network are tuned, the training process runs completely automatically. Conjugate gradient method with line 111 Example 0.8 I 0.6 o u g 0.4 n I O. 0.2 o O Bright Centromere O Dark Centromere 0.2 0.4 0.6 Q-arm telomere length 0.8 Figure 6.6: Centromere intensities and telomere lengths; example. Each ho-molog is shown by a circle located at (Q, P). The circle size is proportional to the homolog centromere intensity. search is another effective method to train the network and its parameters could be easily tuned. 6.9.1 A Simple Classification Example We present a simple simulated classification example to explain the network architecture and its learning procedure clearly. In this example, simulated features for two homologs of a chromosome are used as inputs to the neural network shown in Figure 6.5. The neural network should classify the homologs into two classes. This example presents a classification problem with an ob-vious solution. It will, therefore, not require a network with as many units as in Figure 6.5. However, we use this architecture for demonstrating the classification process for clarity purposes. 112 Centromere intensities and telomere lengths (features used as inputs to the neural network) for the two homologs to be classified are shown in Table 6.1 and Figure 6.6. Table 6.1: Centromere intensities and telomere lengths for the example. Centromere P-arm telomere Q-arm telomere Homolog 1 1 0 0 Homolog 2 0.1 1 1 As shown in Table 6.1, homologl has higher centromere intensity but shorter telomere lengths than homolog2. The same features for the two ho-mologs are shown in Figure 6.6 as well. In this figure each homolog is shown as a circle placed at its P-arm (y axis) and Q-arm (x axis) telomere lengths. The size of the circle is proportional to the intensity of the centromeres. In addition, the white circle shows the homolog with brighter centromere while the gray circle shows the homolog with darker centromere. We now study the learning procedure of the neural network for classifi-cation of the data in Table 6.1. Our goal is to classify the two homologs into two classes by maximizing the information between the outputs of module 1 and module 2. The input units have a linear transfer function whereas the hidden and output units have sigmoid transfer functions. Starting with small random weights (as shown in Figure 6.7), small learning coefficient, and us-ing batch training a forward sweep is made through the data to calculate the outputs of the network and their variances. As seen in Figure 6.7, the original variances are very low. In order for the network to learn the problem, the information between yi and y2 and hence the variances (from Equation 6.26) should be maximized. Error is calculated according to Equation 6.27 and is 113 0.4999 0.4999 £ =-/ =54.2 V(y2-yl)=2.0e-13 0.5005 0.5005 -4.4e-3 Telomere Lengths bias I o o O.l 1 1 Module 1 Module 2 Figure 6.7: Initial weights, variances and outputs of the proposed architecture while learning to classify the example. shown in Figure 6.7. A second forward sweep is made through the data to calculate gradients as explained in Section 6.6.1. The calculated gradients are back-propagated and weight updates are obtained from Equation 6.32. Error back-propagation is continued until information is maximized (i.e. error is minimized) between the two modules in the network. The final weights and outputs of the the network once it is settled and trained are shown in Figure 6.8. As seen in this figure, after training, both modules result in the same classification for a given homolog and the homologs are classified into two classes. 114 0.0005 0.9995 £ =-/ =-1.83 0.0004 0.9996 -6.38 {-2.72 Centromere Intensity Telomere Lengths l o.i 0 0 1 1 Module 1 Module 2 Figure 6.8: Final weights and outputs of the proposed architecture after learn-ing to classify the example. 6.10 Training Experiments Design and Results Two sets of experiments were designed to train and test the neural network. Simulated data was used in the first set of experiments where a rather simple classification problem was introduced to the network. In the second set of experiments, the real data with a more difficult classification problem was utilized. 6.10.1 Training Using Simulated Data In this experiment, centromere intensities and telomere lengths are simulated for twelve pairs of chromosomes 16 as illustrated in Figure 6.9. In this figure (similar to Figure 6.6), each chromosome is represented by a circle located at 115 Simulated Data 0.8 p 0.6 ; 0.4 h 0.2 (9o o o o O ° o O Bright Centromere O Dark Centromere 0.1 0.2 0.3 0.4 0 .5 0.6 0.7 0.8 Q-arm telomere length 0.9 Figure 6.9: Centromere intensities and telomere lengths; simulated data. Each chromosome is shown by a circle located at (Q,P). The circle size is propor-tional to the centromere intensity. (Q,P) (the Q-arm and P-arm telomere lengths). Moreover, the size of each circle is proportional to the centromere intensity of that particular homolog. Two distinct clusters of homologs are easily noticed in the figure based on the P versus Q-arm telomere length distribution. On the other hand, the centromere data show that the difference in centromere intensities in one pair of homologous chromosomes is quite high. Furthermore, there is an evident distinction between homologs with bright centromeres and those with dark centromeres, that is the lowest intensity of the bright centered homolog in the entire database (the smallest white circle) is well above the highest intensity of the dark centered homolog (the largest gray circle). This data presents a relatively easy classification task to the neural network modules. Training the neural network using this database, the error converges to its desired minimum in a relatively low number of iterations. The classifica-116 Classification of the simulated data i 1 1 1 1 r I o I o I o I ° 0° v o o Class2 \ ° 0 Module 1 & Module 2 \ s & Module 2 O Bright Centromere O Dark Centromere 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Q-arm telomere length Figure 6.10: Classification results for the simulated database. The two mod-ules in the neural network classify homologous chromosomes into the same two classes as shown by dotted lines and coloured circles. tion results are shown in Figure 6.10. Studying these results, the two modules classify chromosomes 16 into the same two homologous groups. Module 1 clas-sifies chromosomes 16 into two classes of bright and dark centered homologs. On the other hand, module 2 classifies homologous chromosomes so that class 1 and class 2 consist of homologs with P and Q telomere lengths distributed in the lower left corner and upper right corner of Figure 6.10, respectively. Mutual information between the outputs of the two networks is maximized at the end of the training period. 6.10.2 Training Using Real Data Centromere intensities and telomere lengths calculated for twelve pairs of ho-mologous chromosomes 16 are used as training data for the neural network, n 1 i 0.8 [ c j J) 0.6r o ii ;o.4h 0 . 2 I (fyO x Module 1 o O o * <$> 0 ° i _l l _ 117 Table 6.2: IFI values for centromeres of homologous chromosomes 16 of the simulated database. Centromere Intensity Pair 1 Pair 2 Pair 3 Pair 4 Pair 5 Pair 6 Homolog 1 0.25 0.31 0.84 0.72 0.32 0.85 Homolog 2 0.77 0.73 0.28 0.27 1.00 0.33 Centromere Intensity Pair 7 Pair 8 Pair 9 Pair 10 Pair 11 Pair 12 Homolog 1 0.72 0.24 0.74 0.23 0.93 0.85 Homolog 2 0.25 0.74 0.28 0.83 0.22 0.26 in this experiment. The distribution of the P and Q-arm telomere lengths as well as the centromere intensities are shown in Figure 6.11. Here again, each chromosome is shown as a circle at (Q, P) and the size of the circles are proportional to the intensities of the centromeres. As seen, unlike the previous database, the distribution of the P and Q-arm telomere lengths is not easily clusterable. In addition, the difference in centromere intensities within a ho-mologous pair is rather low (Table 6.3). Furthermore, the centromere data are not linearly classifiable as bright and dark centromeres. Module 1, in the neural network architecture proposed in Figure 6.5, will not be able to classify homologous chromosomes based on the current information. According to the current architecture, for training purposes, module 1 only finds a threshold and classifies the centromere values based on this threshold.. Such a threshold does not exist for the real data, therefore training the neural network using this data will not yield satisfactory results. Module 1 in the network architecture has no knowledge of two chromosomes being homologs, that is if homolog 1 is classified in class 1, homolog 2 cannot be in the same class. In addition, the input structure does not provide an opportunity for this module to compare centromere intensities within a pair 118 Real Data 0.9 0.8 £0.7h O) c o o 0.6 | °0.5h s |o.4 I a. 0.3 0.2 h 0.1 0 o Bright Centromere O Dark Centromere O o o° o o o o o o o cO O 0.1 o 0.2 0.3 0.4 0.5 0.6 0.7 Q-arm telomere length 0.8 0.9 Figure 6.11: Centromere intensities and telomere lengths; real data. Cen-tromere intensity differences between homologs is rather low. of homologs. To overcome this problem, another unit is added to the input layer of module 1 as shown in Figure 6.12. Training the network for each chromosome, the chromosome centromere intensity as well as its homologous pair centromere intensity are presented as inputs to module 1. Inputs of module 2 are st i l l the P and Q - a r m telomere lengths of that chromosome. Training the network wi th this information, we expect that if pair (ci, c2,P\, qi) (c\: centromere intensity of homolog 1 and c 2 : centromere intensity of homolog 2) is classified in class 1 then ( c 2 , C i , p 2 , Q2) be classified in class 2. The classification results after training this network are shown in Figure 6.13 as "Class 1" and "Class 2" wi th the dotted lines as well as coloured circles. Naturally, training is slower than it was for the simulated data, however, the two modules successfully classify homologs into the same two classes (It should 119 Maximize Information Module 1 Module 2 Figure 6.12: The new neural network architecture. be noted that the real data presented in Figure 6.13 is the same data as in Figure 6.3 merely with a different presentation). The unique input structure suggested enables the neural network to compare features within a pair of homologous chromosomes. The neural network architecture can be generalized to classify more complex databases. This architecture can also be used to test the effectiveness of existing and new chromosome features in homolog classification. Table 6.3: IFI values for centromeres of homologous chromosomes 16 of the database. Centromere Intensity Pair 1 Pair 2 Pair 3 Pair 4 Pair 5 Pair 6 Homolog 1 0.62 0.72 0.69 0.68 1.00 0.91 Homolog 2 0.82 0.76 0.63 0.62 0.92 0.65 Centromere Intensity Pair 7 Pair 8 Pair 9 Pair 10 Pair 11 Pair 12 Homolog 1 0.62 0.72 0.63 0.62 1.00 0.91 Homolog 2 0.55 0.57 0.66 0.67 0.77 0.46 120 Real Data 1 0.9 0.8 £ 0 . 7 a c a o 0.6 u E ° 0.5 o i 0.4 0.3 0.2 0.1 0 V o ° O N . o s O N O 0 s s o O O CD, o , o 0.1 0.2 0.3 o Bright Centromere O Dark Centromere Class2 Module 1 & Module 2 Classl » Module 1 & Module 2 0.4 0.5 0.6 0.7 Q-arm telomere length 0.8 0.9 Figure 6.13: Classification of the real database. The two modules classify homologous chromosomes into the same two groups. 121 C h a p t e r 7 C o n c l u s i o n s a n d F u t u r e W o r k 7.1 Overview Many disorders have an inheritable component that is related to specific chro-mosome abnormalities. Statistics also indicate the possibility of cancer being an inherited genetic disease. In order to study the role of such chromosome ab-normalities in specific disorders, the ability to analyze separate homologs of a chromosome is highly desirable. In this thesis, we addressed homolog classifica-tion using quantitative analysis of microscopy images of human chromosomes following Q-FISH procedures. Despite the vast amount of research conducted on analyzing chromo-some images since the 1960s, it has only been a few years since the problem of homolog classification (into maternal and paternal classes) has been addressed systematically [10, 52, 77]. Microscopy images of chromosomes available in the past did not contain enough quantitative information. Recently, however, the development of novel PNA probes and high quality microscopes have made Q-FISH technology possible, which in turn provides chromosome images with 122 high quantitative information. Image processing techniques have the potential to greatly advance cancer research by enabling the analysis of such quantitative images of chromosomes. 7.2 Conclusions Both centromeres and telomeres contain highly repetitive nucleic acid se-quences. After staining the cells with fluorescent PNA probes, some chro-mosomes show apparent differences between their centromeres/telomeres in their homologs. Most likely these differences reflect true heteromorphism and different amounts of centromere/telomere repeats and heterochromatic DNA. We took advantage of this heteromorphism for homolog classification. Mul-tiple features were extracted from images of chromosomes and combined to differentiate homologous chromosomes. The major contributions of this thesis are as follows: 1) Chromosomes were labelled with several fluorescent PNA probes, each targeting different repeat sequences in the chromosome. The slides were visualized at different wavelengths of light specific to the fluorescent probes. Therefore, multi-colour images of the chromosomes were acquired using a C C D camera. As the probes are fluorescent, repetitive illumination of the slide re-sults in decreased amounts of fluorescence that can be excited later, producing weaker image intensities. In order to keep the images quantitative, anti-fade agents are added to the chromosomes at the preparation stage, which intro-duces a high amount of background noise. Also, for multi-colour imaging of chromosomes, despite the care to choose probes with little spectral overlap, and the use of powerful filters, there still exists some spill-over from certain colour images into the rest of the colour channels. 123 In order to compensate for these problems, we performed single probe experiments, where five slides of chromosomes are prepared each labelled with single and different probes. Images of each slide are taken under five different colour channels to compute the amounts of image intensity spill-over from one channel to the others. A colour compensation matrix is then formed whose values are unique to our microscope setup, hardware and preparation techniques. The colour compensation matrix is used to correct and eliminate the spread of fluoro-probe brightnesses among other channels in the acquired images. 2) We developed an automated algorithm to form our database from the acquired and preprocessed images. This algorithm uses a table of coordinates of chromosomes generated by the Karyotyping procedure. The program scans this table for each chromosome. Using the scanned coordinates, every chro-mosome is cut from the DAPI as well as its corresponding CY3, FITC, CY5 and CY35 images. Each isolated chromosome is then saved as a tagged im-age using the format "Database#-Image#-Chromosome#-Homolog#-Image type(DAPI,..).TIF". 3) We use the specific heteromorphic properties of chromosomes 1, 9 and 16 for classifying the homologs of these chromosomes. Chromosomes 1, 9 and 16 are heteromorphic in a high proportion of individuals, that is, for each pair of these chromosomes, after staining with DAPI, the centromere in-tensity in one homolog appears brighter than that of the other homolog. To measure these intensities, we developed two novel algorithms to segment the centromeres from DAPI images of chromosomes. The first proposed segmen-tation method involves mode histograms, in which the noise level in the image is decreased to a minimum. A threshold is then selected for the mode his-togram, within the gray level range of the chromosome image. In the second 124 method, referred to as the gradient method, segmentation is performed using the two-dimensional (2D) gradient of the chromosome images as well as inten-sity thresholding. We then used our segmentation results and calculated the integrated fluorescence intensities over the segmented centromere areas. For chromosome 16, for example, homolog classification was performed based on the intensity differences in centromeres of a homologous pair. On the other hand, since chromosome 16 is highly heteromorphic, an experienced techni-cian can visually classify the chromosomes into classes of bright and dark-centromered homologs. The classification results using both of our developed algorithms complied perfectly with those of the technician. 4) Acrocentric chromosomes, such as chromosome 22, present a dif-ferent kind of heteromorphism than that of chromosome 16. We analyze the centromeres of this chromosome to study centromere heteromorphisms as well. A new algorithm is proposed to segment the centromeres from images of chro-mosomes 22 acquired using PNA-centromere probes. The algorithm is based on fuzzy set theory and gradient descent method. In this method, a fuzzy membership value is assigned to each pixel in the centromere image. An it-erative algorithm then updates and minimizes a defined error function. As a crude verification of the segmentation results, we first compare our segmen-tation with those achieved through manual segmentation of a cytotechnician on the prepared database. To further verify the results, we used chromosomes 22, a heteromorphic chromosome. In more than 50% of the population, pater-nal and maternal homologs of chromosome 22 show sufficient morphological variations in DAPI images to allow their separation. One homolog has small P-arms while the other homolog has no P-arms. This criteria was used to classify homologs of chromosome 22. Segmentation results of the FITC im-ages were also used to classify chromosomes 22 into two classes. This was 125 done by calculating the integrated fluorescence intensity over the resultant segmented area of every chromosome and comparing the values in a homol-ogous pair. The classification results using these two criteria were compared and the two methods complied completely. 5) Telomere lengths are believed to contain high amounts of heterochro-matic DNA. Individual telomere lengths are measured for each chromosome in our database using T F L - T E L O software. For each specific chromosome, we employed a fuzzy c-means clustering algorithm for classification of homologous chromosomes into two clusters, using the relative telomere lengths (Normalized P-arm versus Q-arm telomere lengths). On the other hand, for heteromorphic chromosomes (such as chromo-some 16), the centromere intensity in one homolog is different from the other. The gradient method developed in Chapter 4 was used to classify homologous chromosomes using centromere intensity differences. We noticed that the re-sults of the two different classification methods are highly correlated. The two independent classifiers cluster the chromosomes into the same two classes. 6) Inspired by these results, a novel quantitative and automatic method was then proposed for homolog classification using multiple features. The clas-sification method is based on Mutual Information Maximization, applied to an unsupervised neural network architecture. We designed and tested our al-gorithm on chromosome 16, since for this chromosome, a cytotechnician can visually classify homologs providing us with ground truth. The neural net-work consists of two separate modules. Each module is designed to classify homologs of chromosome 16 based on independent and different features. The learning algorithm maximizes the mutual information between the outputs of the two modules, training them to produce the same classification result using different features for a given chromosome. Once the parameters of the learn-126 ing algorithm are tuned, training is performed completely automatically for homolog classification. The proposed method was successfully applied to clas-sify homologs of chromosome 16 with 100% accuracy in the studied database. This neural network architecture is able to extract higher-level features from the input data that cannot be detected otherwise. In addition, the architec-ture can be generalized to classify homologous chromosomes using as many features as are available. As a result, several well-known chromosome features speculated to be of little or no importance in homolog differentiation can be examined using our proposed architecture. 7.3 Suggestions for Future Work A few suggestions for future research are as follows: • The database preparation process is a very lengthy procedure which makes it extremely difficult to create large databases for a given ex-periment. Acquired images of the chromosomes are Karyotyped using a commercial software. These images are then entered into the T F L -T E L O software for telomere length measurements. Finally, the database is created by isolating desired chromosomes from the images. Although we have automated the last part of this procedure, more work needs to be done in order to shorten this process. Automation and integration of the Karyotyping process and the telomere length measurements will certainly help to reach this goal. • The PNA probes employed in this research, and in general Q-FISH, are new techniques whose implementation protocols are improving con-stantly. Higher qualities of PNA probes and more standardized Q-FISH 127 protocols will increase the quality of the final acquired images. Therefore, the efficiency of homolog classification can greatly improve with new and upcoming PNA probes and with standard acquisition and quantification techniques. The proposed neural network architecture for homolog classification can be expanded to test the efficiency of existing features believed to be of some value in homolog classification. Each of these features can be used as an input to the network in combination with known "homolog differentiating" features, for specific chromosomes. The changes in clas-sification results, relative to the original classification performed with known-features only, can determine the value of the new feature in ho-molog differentiation. This neural network can be trained to learn and classify more complex relationships in the input feature set. This is useful in studying chro-mosomes that do not present obvious heteromorphic properties for their homologs to be classified easily. The network can also be generalized to classify homologous chromosomes using more features, by adding a module for each additional independent feature. 128 Bibliography 1] Handbook of optical filters for fluorescence microscopy. Chroma Tech-nology Corporation, Brattleboro, V T 05301, U.S.A., 1994. 2] The Mathworks Incorporation, Natick, MA 01760, U.S.A. Fuzzy Logic Toolbox, For Use With Matlab, 1998. 3] J.H. Tjio and A. Levan. The chromosome number of man. Hereditas, 42:1-6, 1956. 4] B. Lerner, H. Guterman, I. Dinstein, and Y. Romem. Feature selection and learning curves of a multilayer perceptron chromosome classifier. In Proceedings of the 12th I APR International Conference on Computer Vision and Image Processing, pages 497-499, 1994. 5] R.S. Verma and A. Babu. Human chromosomes, principles and tech-niques. McGraw-Hill, second edition, 1994. [6] A.J. Griffiths, J.H. Miller, D.T. Suzuki, R.C. Lewontin, and W.M. Gel-bart. An Introduction to Genetic Analysis. Freeman, sixth edition, 1996. 7] J. Meyne and R.K. Moyzis. In situ hybridization using synthetic oligomers as probes for centromere and telomere repeats. In K.H.A. Choo, editor, Methods in Molecular Biology, Vol. 33, pages 63-74. Hu-man Press Inc., Totowa, NJ, 1994. [8] Tumor cells immortality mechanism illuminated. Health Horizons, (33):2, Winter 1997/1998. [9] S.S.S. Poon and P.M. Lansdorp. Current protocols in cell biology. Sub-mitted, 2000. 129 [10] U.M. Martens, J.M. Zijlmans, S.S.S. Poon, V. Drogoulskz, J. Yui, E. Chavez, R.K. Ward, and P.M. Lansdorp. Short telomere on the p-arm of human chromosome 17. Nature Genetics, 18:76-80, 1998. [11] T. Caspersson, L. Zech, C. Johansson, and E.J. Modest. Identification of human chromosomes by DNA-binding fluorescent agents. Chromosoma, 30:215, 1970. [12] G.H. Granlund. Identification of human chromosome by using integrated density profile. IEEE Transactions on Biomedical Engineering, BME-23:182-192, 1976. [13] P.A. Errington and J. Graham. Classification of chromosomes using a combination of neural networks. In M. Taylor and P. Lisboa, editors, (Proc. of 1992 Neural Networks Workshop in) Techniques and Applica-tions of Neural Networks, pages 77-92. Ellis Horwood, 1993. [14] R. Ledley. High-speed automatic analysis of biomedical pictures. Science Magazine, 146, 1964. [15] J. Piper and E. Graham. On fully automatic feature measurement for banded chromosome classification. Cytometry, 10:242-255, 1989. [16] V. Kliment, R. Tuscany, J. Dvorak, and L.Tomasek. Computer-assisted analysis of human G-banded chromosomes. Computers and Biomedical Research, 16:20-28, 1983. [17] Q. Wu, P. Suetens, and A. Oosterlinck. On knowledge-based improve-ment of biomedical pattern recognition-a case study. In Proceedings of the Fifth Conference on Artificial Intelligence for Applications, pages 239-244, 1989. [18] G. Ritter, M.T. Gallegos, and K. Gaggermeier. Automatic context-sensitive karyotyping of human chromosomes based on elliptically sym-metric statistical distributions. Pattern Recognition, 28(6):823-831, 1995. [19] G. Ritter and M.T. Gallegos. Outliers in statistical pattern recogni-tion and an application to automatic chromosome classification. Pattern Recognition Letters, 18:525-539, 1997. 130 [20] B. Lerner, M. Levinstein, B. Rosenberg, H. Guterman, I. Dinstein, and Y. Romem. Feature selection and chromosome classification using a multilayer perceptron neural network. In Proceedings of the IEEE World Congress on Computational Intelligence, pages 3540-3545, 1994. [21] L.B. Levinstein, M. Rosenberg, B. Guterman, and H. Dinstein. Feature selection and chromosome classification using a multilayer perceptron neural network. In Proceedings of the IEEE International conference on Neural Networks, pages 3540-3545, 1994. [22] B. Lerner, H. Guterman, I. Dinstein, and Y. Romem. Medial axis transform-based features and a neural network for human chromosome classification. Pattern Recognition, 28(ll):1673-1683, 1995. [23] B. Lerner, H. Guterman, I. Dinstein, and Y. Romem. A comparison of multilayer perceptron neural network and Bayes piecewise classifier for chromosome classification. In Proceedings of the IEEE World Congress on Computational Intelligence, pages 3472-3477, 1994. [24] J.M. Cho. Chromosome classification using backpropagation neural net-works. IEEE Engineering in Medicine and Biology Magazine, 19(1):28-33, 2000. [25] S. Eskiizmirliler, A.M. Erkmen, F. Basaran, and A. Nur Cakar. A hybrid intelligent diagnostic system based on neural networks and image anal-ysis techniques in the field of automated cytogenetics. In Proceedings of the IEEE International Conference on Image Processing, pages 315-318, 1996. [26] R.J. Stanley, J.M. Keller, P. Gader, and C.W. Caldwell. Data-driven homologue matching for chromosome identification. IEEE Transactions on Medical Imaging, 17(3):451-462, June 1998. [27] B. Lerner. Toward a completely automatic neural-network-based hu-man chromosome analysis. IEEE Transactions on Systems, Man, and Cybernetics, 28(4):544-552, August 1998. Part B: Cybernetics. [28] J.M. Keller, P. Gader, and O. Sjahputera. A fuzzy logic rule-based system for chromosome recognition. In Proceedings of the 8th IEEE Symposium on Computer-based Medical Systems, 1995. 131 [29] O. Sjahputera and J.M. Keller. Evolution of a fuzzy rule-based sys-tem for automatic chromosome recognition. In Proceedings of the IEEE International conference on Fuzzy Systems, pages I—129—I—134, 1999. [30] J. Piper. Genetic algorithm for applying constraints in chromosome classification. Pattern Recognition Letters, 16:857-864, 1995. [31] S.O. Zimmerman, D.A. Johnston, F.E. Arrighi, and M.E. Rupp. Au-tomated homologue matching of human G-banded chromosomes. Com-puters in Biology and Medicine, 16(3):223-233, 1986. [32] Q. Wu and K.R. Castleman. Automated chromosome classification using wavelet-based band pattern descriptors. In Proceedings of the 8th IEEE Symposium on Computer-based Medical Systems, pages 189-194, 2000. [33] T. Caspersson, K.R. Castleman, G. Lomakka, E.J. Modest, A. Moller, R. Nathan, R.J Wall, and L. Zech. Automated karyotyping of Quin-crine Mustard stained human chromosomes. Experimental Cell Research, 67:233-235, 1971. [34] N. Sweeney, R.L. Becker, and B. Sweeney. A comparison of wavelet and fourier descriptors for a neural network chromosome classifier. In Proceedings of the IEEE International Conference in Engineering in Medicine and Biology, pages 1359-1362, 1997. [35] A.J. Dennis, K.S. Tang, and S. Zimmerman. Band features as classifica-tion measures for G-banded chromosome analysis. Computers in Biology and Medicine, 23(2):115-129, 1993. [36] N. Poulin, S.S.S. Poon, G. Kandola, and B. Palcic. Automatic detec-tion of metaphase chromosomes. In Proceedings of the IEEE Interna-tional Conference in Engineering in Medicine and Biology, pages 350-351, 1989. [37] D. Schoevaert-Brossault, C. Leonard, and J. Selva. A new method for automatic metaphase finding adaptable to different chromosome prepa-rations. Computer Programs in Biomedicine, 16:195-202, 1983. [38] L. Ji. Fully automatic chromosome segmentation. Cytometry, 17:196-208, 1994. 132 [39] B. Lerner, H. Guterman, and I. Dinstein. A classification-driven par-tially occluded segmentation (cpoos) method with application to chro-mosome analysis. IEEE Transactions on Signal Processing, 46(10):2841-2847, October 1998. [40] G. Agam and I. Dinstein. Geometric separation of partially over-lapping nonrigid objects applied to automatic chromosome classifica-tion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(11):1212-1222, November 1997. [41] S.S.S. Poon, R.K. Ward, and P.M. Lansdorp. Segmenting telomeres and chromosomes in cells. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 3413-3416, 1999. [42] S.S.S. Poon. Telomere length measurements using fluorescence mi-croscopy. PhD thesis, University of British Columbia, Vancouver, BC, Canada, 1997. [43] D. Celeda, K. Aldinger, F.M. Haar, M. Hausmann, M. Durm, H. Lud-wig, and C. Cremer. Rapid fluorescence in situ hybridization with repet-itive DNA probes: Quantification by digital image analysis. Cytometry, 17:13-25, 1994. [44] K.R. Castleman and Q. Wu T.P. Riopka. FISH image analysis. IEEE Engineering in Medicine and Biology Magazine, 15(l):67-75, 1996. [45] P.M. Nederlof, S. van der Filer, J. Vrolijk, H.J. Tanke, and A.K. Raap. Fluorescence ratio measurements of double-labelled probes for multiple in situ hybridization by digital imaging microscopy. Cytometry, 13:839-845, 1992. [46] M.R. Speicher, S.G. Ballard, and D.C. Ward. Karyotyping human chro-mosomes by combinatorial multi-fluor FISH. Nature Genetics, 12:368-375, April 1996. [47] E. Schrock, S. du Manoir, T. Veldman, B. Schoell, J. Wienberg, M.A. Ferguson-Smith, Y. Ning, D.H. Ledbetter, I. Bar-Am, D. Soenksen, Y. Garini, and T. Ried. Multicolor spectral karyotyping of human chro-mosomes. Science Magazine, 273, 1996. 133 [48] L.E. Morrison and M.S. Legator. Two-color ratio-coding of chromosome targets in fluorescence in situ hybridization: Quantitative analysis and reproducibility. Cytometry, 27:314-326, 1997. [49] K. Roth, G. Wolf, M. Dietel, and I. Petersen. Image analysis for com-parative genomic hybridization based on a karyotyping program for win-dows. Analytical and Quantitative Cytology and Histology, 19(6):461-474, December 1997. [50] P.M. Nederlof, S. van der Filer, N.P Verwoerd, J. Vrolijk, A.K. Raap, and H.J. Tanke. Quantification of fluorescence in situ hybridization sig-nals by image cytometry. Cytometry, 13:846-852, 1992. [51] F.M. Haar, M. Durm, K. Aldinger, D. Celeda, M. Hausmann, H. Ludwig, and C. Cremer. A rapid FISH technique for quantitative microscopy. BioTechnniques, 17(2):346-352, 1994. [52] P.M. Lansdorp, N. Werwoerd, F. Van de Rijke, V. Dragowska, M.T. Lit-tle, R.W. Dirks, A.K. Raap, and H.J. Tanke. Heterogeneity in telomere length of human chromosomes. Human Molecular Genetics, 5(5):685-691, 1996. [53] S.S.S. Poon and P.M. Lansdorp. Measurements of telomere length on individual chromosomes by image cytometry. Methods in Cell Biology, 64:69-98, 2001. [54] M.J. McEachern, A. Krauskopf, and E.H. Blackburn. Telomeres and their control. Annual Review of Genetics, 34:331-359, 2000. [55] M. Meyerson. Role of telomerase in normal and cancer cells. Journal of Clinical Oncology, 18(13):2626-2634, July 2000. [56] JP. Pommier, J. Lebeau, C. Ducray, and L. Sabatier. Chromosome insta-bility and alteration of telomere repeat sequences (review). Biochimie, 77(10):817-825, 1995. [57] P.M. Lansdorp. Repair of telomeric DNA prior to replicative senes-cence. Mechanisms of Aging and Development, 118(l-2):23-34, Septem-ber 2000. 134 [58] U.M. Martens, V. Brass, M. Engelhardt, S. Claser, C.F. Waller, W. Lange, C. Schmoor, S.S.S. Poon, and P.M. Lansdorp. Measurement of telomere length in haematopoietic cells using in situ hybridization techniques. Biochemical Society Transactions, 28:245-250, 2000. part 2. [59] S.S.S. Poon, U.M. Martens, and P.M. Lansdorp. Telomere length mea-surements using digital fluorescence microscopy. Cytometry, 36(4):267-278, 1999. [60] J.M. Zijlmans, U.M. Martens, S.S.S. Poon, A.K. Raap, H.J. Tanke, R.K. Ward, and P.M. Lansdorp. Telomeres in the mouse have large inter-chromosomal variations in the number of T2AG3 repeats. Proceedings of National Academy of Science, U.S.A., 94:7423-7428, 1997. [61] B. Trask, G. van den Engh, and J.W. Gray. Inheritance of chromosome heteromorphisms analyzed by high-resolution bivariate flow karyotyping. American Journal of Human Genetics, 45:753-760, 1989. [62] R.S. Verma and H. Dosik. Human chromosome heteromorphisms: Na-ture and clinical significance. International Review of Cytology, 62:361-383, 1980. [63] P.A. Jacobs. Human chromosome heteromorphisms (variants). Progress in Medical Genetics, 2:251-274, 1977. [64] B. Trask, G. van den Engh, B. Mayall, and J.W. Gray. Chromosome heteromorphism quantified by high-resolution bivariate flow karyotyping. American Journal of Human Genetics, 45:739-752, 1989. [65] G. Hadlaczky. Life order? In Proceedings of COMBIO'95: Sum-mer Workshop on Computational Modelling and Imaging in Biosciences, pages 28-30, 1995. [66] D. Dorninger, G. Karigl, and J. Loidl. Simulation of chromosome inter-locking in meiotic pairing. In F. Breitenecker and I. Husinsky, editors, Proceedings of the 1995 EUROSIM Conference, pages 969-974. Elsevier Science B.V., 1995. [67] C. Hofers, P. Baumann, G. Hummer, T.M. Jovin, and D.J. Arndt-Jovin. The localization of chromosome domains in human interphase nuclei. 135 Three-dimensional distance determinations of fluorescence in situ hy-bridization signals from confocal laser scanning microscopy. Bioimaging, 1(2):96-106, June 1993. [68] C. Hofers, T.M. Jovin, G. Hummer, and D.J. Arndt-Jovin. The lo-calization of chromosome domains in human interphase nuclei. Semi-automated two-dimensional image acquisition and analysis of fluores-cence in situ hybridization signals. Bioimaging, 1(2):107-118, June 1993. [69] C L . O'Keefe, P.E. Warburton, and A.G. Matera. Oligonucleotide probes for alpha satellite DNA variants can distinguish homologous chromo-somes by FISH. Human Molecular Genetics, 5(11):1793-1799, 1996. [70] JP. Pommier, M. Ricoul, and L. Sabatier. Telomere heteromorphism analysis and specific chromosomal instability in human fibroblasts. In The 1998 Role of Telomeres and Telomerase in Cancer and Aging meet-ing (Presentation), Ladenburg, Germany, October 1998. [71] P. Mousavi, R.K. Ward, E. Chavez, and P.M. Lansdorp. Analysis of telomere intensities in human chromosomes with application to classifi-cation of chromosome 16 homologs. In Proceedings of the IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, pages 205-208, 1999. [72] D. Mason, I. Lauder, D. Rutovitz, and G. Spowart. Measurement of C-bands in human chromosomes. Computers in Biology and Medicine, 5:179-201, 1975. [73] I.J. Lauder. Estimation of polymorphism in quantitative measurements of homologous pairs of human chromosomes. Computers and Biomedical Research, 11:89-101, 1978. [74] D.H. Moore. Do homologous chromosomes differ? Two statistical tests. Cytogenetics and Cell genetics, 12(5):305-314, 1973. [75] D.H. Moore. Normalisation of chromosome measurements: A new method. Computers in Biology and Medicine, 5(1-2):21-28, June 1975. [76] P.M. Lansdorp, V. Dragowska, N. Rufer, T. Brummendorf, S.S.S. Poon, P. Mousavi, T. Duncan, and U. Martens. Applications of peptide nucleic 136 acid probes in cytometry. In Proceedings of the IS AC XIX International Congress, 1998. [77] P. Mousavi, R.K. Ward, and P.M. Lansdorp. Feature analysis and clas-sification of chromosome 16 homologs using fluorescence microscopy im-ages. IEEE Canadian Journal of Electrical and Computer Engineering, 23(4):95-98, 1999. [78] P. Mousavi, R.K. Ward, and P.M. Lansdorp. Feature analysis and clas-sification of chromosome 16 homologs using fluorescence microscopy im-ages. In Proceedings of the IEEE Canadian Conference on Electrical and Computer Engineering, pages 841-846, 1999. [79] P. Mousavi, R.K. Ward, M. Sameti, P.M. Lansdorp, and S.S. Fels. Ho-mologue classification of human chromosome images using an iterative centromere segmentation algorithm. In Proceedings of the IEEE Annual International Conference on Engineering in Medicine and Biology, pages 2033-2036, 2000. [80] P. Mousavi, R.K. Ward, P.M. Lansdorp, and S.S. Fels. Multi-feature analysis and classification of human chromosome images using cen-tromere segmentation algorithms. In Proceedings of the IEEE Inter-national Conference on Image Processing, pages 152-155, 2000. [81] P. Mousavi, S.S. Fels R.K. Ward, and P.M. Lansdorp. Classification of homologous human chromosomes using mutual information maximiza-tion. In Proceedings of the IEEE International Conference on Image Processing, 2001. [82] SensiCam PCO Computer Optics GmbH, PCO CCD Imaging. URL:www.pco.de. Kelheim, Germany. [83] K.R. Castleman. Color compensation for digitized FISH images. Bioimaging, 1:159-165, 1993. [84] K.R. Castleman. Color compensation with unequal integration times. Bioimaging, 2:160-162, 1994. [85] MetaSystems Hard und Software GmbH. isis. URL:www.metasystems.de. Altlussheim, Germany. 137 [86] M. J.T. Smith and A. Docef. A Study Guide for Digital Image Processing. Scientific Publishers, 1997. [87] A.K. Jain. Fundamentals of digital image processing. Printice Hall, Englewood Cliffs, NJ, 1986. [88] R.C. Gonzalez and R.E. Woods. Digital Image Processing. Addison-Wesley Publishing Company, 1993. [89] L.A. Zadeh. Fuzzy sets. Information and Control, 8:338-353, 1965. [90] L.A. Zadeh. Fuzzy algorithms. Information and Control, 12:94-102, 1968. [91] K. Tanaka translated by T. Niimura. An introduction to fuzzy logic for practical applications. Springer, 1996. [92] M. Sameti and R.K. Ward. A fuzzy segmentation algorithm for mam-mogram partitioning. In K. Doi, M.L. Giger, R.M. Nishikawa, and R.A. Schmidt, editors, Digital Mammography'96, pages 471-474. Elsevier Sci-ence B.V., Amesterdam, 1996. [93] D. Dumitrescu, B. Lazzerini, and L.C. Jain. Fuzzy Sets and their Appli-. cation to Clustering and Training. CRC Press, 2000. [94] J.C. Bezdek. Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York, 1981. [95] E. Kreyszig. Advanced Engineering Mathematics. John Wiley, eighth edition, 1998. [96] A. Papoulis. Probability, random variables, and stochastic processes. McGraw-Hill, third edition, 1991. [97] H.-J. Zimmermann. Fuzzy set theory-and its applications. Kluwer Aca-demic Publishers, Boston, second revised edition, 1991. [98] S. Becker and G.E. Hinton. Spatial coherence as an internal teacher for a neural network. In Y. Chauvin and D. Rumelhart, editors, Backpropaga-tion: Theory, Architectures and Applications, Developments in Connec-tionist Theory, pages 313-349. Lawrence Erlbaum Associates, Hillsdale, N.J.,1995. 138 [99] J.C. Bezdek. A review of probabilistic, fuzzy, and neural models for pat-tern recognition. In C H . Chen, editor, Fuzzy Logic and Neural Network Handbook, pages 2.1-2.33. McGraw-Hill, Inc., 1996. [100] D.E. Rumelhart, G.E. Hinton, and R.J. Williams. Learning internal rep-resentations by error propagation. In D.E. Rumelhart, J.L. McClelland, and the PDP Research Group, editors, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, volume 1, pages 318-364. MIT Press, Cambridge, Mass, 1986. [101] A.J. Bell and T.J. Sejnowski. An information maximisation approach to blind separation and blind deconvolution. Neural Computation, 7(6):1129-1159, 1995. 139 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0065275/manifest

Comment

Related Items