@prefix vivo: . @prefix edm: . @prefix ns0: . @prefix dcterms: . @prefix dc: . @prefix skos: . vivo:departmentOrSchool "Science, Faculty of"@en, "Statistics, Department of"@en ; edm:dataProvider "DSpace"@en ; ns0:degreeCampus "UBCV"@en ; dcterms:creator "Yan, Guohua"@en ; dcterms:issued "2008-06-27T14:32:58Z"@en, "2008"@en ; vivo:relatedDegree "Doctor of Philosophy - PhD"@en ; ns0:degreeGrantor "University of British Columbia"@en ; dcterms:description """Single nucleotide polymorphisms (SNPs) have been increasingly popular for a wide range of genetic studies. A high-throughput genotyping technologies usually involves a statistical genotype calling algorithm. Most calling algorithms in the literature, using methods such as k-means and mixturemodels, rely on elliptical structures of the genotyping data; they may fail when the minor allele homozygous cluster is small or absent, or when the data have extreme tails or linear patterns. We propose an automatic genotype calling algorithm by further developing a linear grouping algorithm (Van Aelst et al., 2006). The proposed algorithm clusters unnormalized data points around lines as against around centroids. In addition, we associate a quality value, silhouette width, with each DNA sample and a whole plate as well. This algorithm shows promise for genotyping data generated from TaqMan technology (Applied Biosystems). A key feature of the proposed algorithm is that it applies to unnormalized fluorescent signals when the TaqMan SNP assay is used. The algorithm could also be potentially adapted to other fluorescence-based SNP genotyping technologies such as Invader Assay. Motivated by the SNP genotyping problem, we propose a partial likelihood approach to linear clustering which explores potential linear clusters in a data set. Instead of fully modelling the data, we assume only the signed orthogonal distance from each data point to a hyperplane is normally distributed. Its relationships with several existing clustering methods are discussed. Some existing methods to determine the number of components in a data set are adapted to this linear clustering setting. Several simulated and real data sets are analyzed for comparison and illustration purpose. We also investigate some asymptotic properties of the partial likelihood approach. A Bayesian version of this methodology is helpful if some clusters are sparse but there is strong prior information about their approximate locations or properties. We propose a Bayesian hierarchical approach which is particularly appropriate for identifying sparse linear clusters. We show that the sparse cluster in SNP genotyping datasets can be successfully identified after a careful specification of the prior distributions."""@en ; edm:aggregatedCHO "https://circle.library.ubc.ca/rest/handle/2429/958?expand=metadata"@en ; dcterms:extent "3276809 bytes"@en ; dc:format "application/pdf"@en ; skos:note "Linear Clustering with Application to Single Nucleotide Polymorphism Genotyping by Guohua Yan B.Sc., Liaocheng University, 1992 M.Sc., Beijing Normal University, 1995 M.Sc., The University of Windsor, 2003 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy in The Faculty of Graduate Studies (Statistics) The University of British Columbia (Vancouver) June, 2008 c© Guohua Yan 2008 Abstract Single nucleotide polymorphisms (SNPs) have been increasingly popular for a wide range of genetic studies. A high-throughput genotyping technolo- gies usually involves a statistical genotype calling algorithm. Most calling algorithms in the literature, using methods such as k-means and mixture- models, rely on elliptical structures of the genotyping data; they may fail when the minor allele homozygous cluster is small or absent, or when the data have extreme tails or linear patterns. We propose an automatic genotype calling algorithm by further devel- oping a linear grouping algorithm (Van Aelst et al., 2006). The proposed algorithm clusters unnormalized data points around lines as against around centroids. In addition, we associate a quality value, silhouette width, with each DNA sample and a whole plate as well. This algorithm shows promise for genotyping data generated from TaqMan technology (Applied Biosys- tems). A key feature of the proposed algorithm is that it applies to un- normalized fluorescent signals when the TaqMan SNP assay is used. The algorithm could also be potentially adapted to other fluorescence-based SNP genotyping technologies such as Invader Assay. Motivated by the SNP genotyping problem, we propose a partial likeli- hood approach to linear clustering which explores potential linear clusters in a data set. Instead of fully modelling the data, we assume only the signed orthogonal distance from each data point to a hyperplane is normally dis- tributed. Its relationships with several existing clustering methods are dis- cussed. Some existing methods to determine the number of components in a data set are adapted to this linear clustering setting. Several simulated and real data sets are analyzed for comparison and illustration purpose. We also investigate some asymptotic properties of the partial likelihood approach. A Bayesian version of this methodology is helpful if some clusters are sparse but there is strong prior information about their approximate loca- tions or properties. We propose a Bayesian hierarchical approach which is particularly appropriate for identifying sparse linear clusters. We show that the sparse cluster in SNP genotyping datasets can be successfully identified after a careful specification of the prior distributions. ii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . x Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Statement of Co-Authorship . . . . . . . . . . . . . . . . . . . . . xii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 SNP genotyping . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 SNPs and their applications . . . . . . . . . . . . . . 1 1.1.2 TaqMan SNP genotyping technology . . . . . . . . . 2 1.2 Review of several clustering algorithms . . . . . . . . . . . . 3 1.2.1 MCLUST: normal mixture model-based clustering . . 3 1.2.2 MIXREG: mixture of linear regressions . . . . . . . . 5 1.2.3 LGA: linear grouping algorithm . . . . . . . . . . . . 5 1.3 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . 6 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Automatic SNP genotype calling . . . . . . . . . . . . . . . . 10 2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.1 TaqMan SNP genotyping . . . . . . . . . . . . . . . . 10 2.1.2 Review of some genotype calling algorithms . . . . . 12 2.1.3 ROX-normalized versus unnormalized data . . . . . . 13 2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 iii Table of Contents 2.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.1 Data preprocessing . . . . . . . . . . . . . . . . . . . 18 2.4.2 Fitting lines using LGA . . . . . . . . . . . . . . . . . 18 2.4.3 Genotype assignment . . . . . . . . . . . . . . . . . . 19 2.4.4 Quality assessment of DNA samples and plates . . . . 20 2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3 A partial likelihood approach to linear clustering . . . . . 24 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Partial likelihood-based objective function for linear cluster- ing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2.1 Partial likelihood for orthogonal regression . . . . . . 25 3.2.2 Partial likelihood for linear clustering . . . . . . . . . 26 3.3 The EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4 Asymptotic properties . . . . . . . . . . . . . . . . . . . . . . 30 3.5 Relationships with other clustering methods . . . . . . . . . 32 3.5.1 With LGA . . . . . . . . . . . . . . . . . . . . . . . . 32 3.5.2 With normal mixture models . . . . . . . . . . . . . . 33 3.5.3 With mixture of ordinary regressions . . . . . . . . . 33 3.6 Choosing the number of linear clusters . . . . . . . . . . . . 33 3.6.1 Bootstrapping the partial likelihood ratio . . . . . . . 33 3.6.2 Information criteria . . . . . . . . . . . . . . . . . . . 34 3.7 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.7.1 Simulated data I . . . . . . . . . . . . . . . . . . . . . 35 3.7.2 Simulated data II . . . . . . . . . . . . . . . . . . . . 37 3.7.3 Australia rock crab data . . . . . . . . . . . . . . . . 37 3.7.4 Taqman single nucleotide polymorphism genotyping data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4 Bayesian linear clustering . . . . . . . . . . . . . . . . . . . . . 51 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2 Model specification . . . . . . . . . . . . . . . . . . . . . . . 55 4.2.1 Model for one cluster: orthogonal regression . . . . . 55 4.2.2 Model for linear clustering . . . . . . . . . . . . . . . 55 4.3 Sampling algorithms . . . . . . . . . . . . . . . . . . . . . . . 57 iv Table of Contents 4.3.1 Gibbs sampling . . . . . . . . . . . . . . . . . . . . . 57 4.3.2 Metropolis-Hastings sampling . . . . . . . . . . . . . 60 4.4 Australia rock crab data . . . . . . . . . . . . . . . . . . . . 62 4.5 Taqman single nucleotide polymorphism genotyping data . . 64 4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5 Consistency and asymptotic normality . . . . . . . . . . . . 80 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.2 Population version of the objective function . . . . . . . . . . 82 5.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.4 Asymptotical normality . . . . . . . . . . . . . . . . . . . . . 94 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.1 Robustness consideration . . . . . . . . . . . . . . . . . . . . 97 6.2 Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.3 Model Extension . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.4 Variable/model selection . . . . . . . . . . . . . . . . . . . . 98 6.5 Bayesian computation . . . . . . . . . . . . . . . . . . . . . . 98 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Appendix A Glossary of some genetic terms . . . . . . . . . . . . . . . . . 100 v List of Tables 3.1 Criteria for choosing the numbers of clusters, K, when MLC is applied to simulated data I. “Boot p-value” is the p-value using the bootstrapping partial likelihood ratio test for H0 : K = K0 vs H1 : K = K0+1 where 99 bootstrapping samples are used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2 Misclassification matrices of simulated data II from MLC, MCLUST, LGA and MIXREG. . . . . . . . . . . . . . . . . . 38 3.3 Criteria for choosing the number of clusters, K, when MLC is applied to the blue crab data. “Boot p-value” is the p- value using the bootstrapping partial likelihood ratio test for H0 : K = K0 vs H1 : K = K0 + 1 where 99 bootstrapping samples are used. . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4 Misclassification matrices of the blue crab data (using log scales) from MLC, MCLUST, LGA and MIXREG. . . . . . . 40 3.5 Largest membership probabilities of the 7 points labelled in Figure 3.5 by MCLUST and MLC. . . . . . . . . . . . . . . . 43 vi List of Figures 2.1 Scatterplots of ROX-normalized data for (a): a good plate; (b) & (c): messy plates; (d): a plate with only a few points in the YY cluster; (e): a plate with no points in the YY cluster; (f): a plate with only one cluster. . . . . . . . . . . . . . . . . 11 2.2 Scatterplots of unnormalized and ROX-normalized data for a plate. Both points and letters represent samples. . . . . . . . 14 2.3 Probability of calling a genotype versus ROX . . . . . . . . . 15 2.4 Silhouette width versus ROX value. Left panel: k-means re- sults using ROX-normalized data; right panel: LGA result using unnormalized data. . . . . . . . . . . . . . . . . . . . . 16 2.5 Control points in scatterplots of ROX-normalized and unnor- malized data for a plate. Letter “P” represents a positive control point and ”N” a negative control point. . . . . . . . 16 2.6 Left panel: the calling result of SDS software displayed in the scatterplot of unnormalized data. Right panel: the scor- ing results of LGA using unnormalized data. The symbol ◦ represents noncalls. . . . . . . . . . . . . . . . . . . . . . . . 17 3.1 Comparison of clustering results of simulated data I. Upper- left panel displays the true classification; upper-right, lower- left and lower-right are that of MLC, LGA and MCLUST. . . 36 3.2 Pairwise scatterplots of simulated data II. . . . . . . . . . . . 38 3.3 Pairwise scatterplots of the blue crab data. . . . . . . . . . . 40 3.4 Clustering results of the blue crab data (using log scales) from MLC, MCLUST, LGA and MIXREG. CL is used as the re- sponse in MIXREG. . . . . . . . . . . . . . . . . . . . . . . . 41 3.5 Clustering results from four methods applied to a Taqman data set plate. ◦, + and 4 denote the labels assigned for the three linear clusters (genotypes) by each method, × denotes points assigned as negative controls and ¦ denotes points as- signed to the background cluster. . . . . . . . . . . . . . . . . 44 vii List of Figures 4.1 Scatterplot of one plate in which the variant allele homozy- gous cluster has only five points (upper-left). . . . . . . . . . 53 4.2 Clustering results of several clustering algorithms. For k- means and MCLUST, the number of clusters are set to 4; for the remaining algorithms, the number of lines are set to 3. . . 54 4.3 Pairwise scatter plots of the blue crab data. . . . . . . . . . . 63 4.4 Evolution of sampled values for θ for 11, 000 iterations using Algorithm 1 for blue crab data. . . . . . . . . . . . . . . . . . 65 4.5 Evolution of log unnormalized posterior density of sampled values for (θ, z1, . . ., zn) for 11, 000 iterations using Algorithm 1 for blue crab data. . . . . . . . . . . . . . . . . . . . . . . . 66 4.6 Autocorrelations of sampled values for θ for 11, 000 iterations using Algorithm 1 for blue crab data. . . . . . . . . . . . . . . 67 4.7 Density curves of sampled values for θ for 10, 000 iterations using Algorithm 1, after a burn-in of 1000 iterations for blue crab data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.8 Unnormalized marginal posterior density values of θ along a line segment (1 − t)θ̂1 + θ̂2 connecting the two estimated modes θ̂1 and θ̂2. One mode is estimated by maximum a posteriori estimation from 10, 000 iterations using Algorithm 1, after a burn-in of 1000 iterations and the other is obtained by permutation. The blue crab data is used. . . . . . . . . . . 69 4.9 Estimated posterior probability of being classified into Class 1 of the 100 crabs from 10, 000 iterations using Algorithm 1, after a burn-in of 1000 iterations. The solid vertical line separates these crabs by their true classes and the horizontal dashed line corresponds to a probability of 0.5. . . . . . . . . 69 4.10 Evolution of log unnormalized posterior density of sampled values for η for 11, 000 iterations using Algorithm 2 for the SNP genotyping data. . . . . . . . . . . . . . . . . . . . . . . 72 4.11 Evolution of sampled values for θ for 11, 000 iterations using Algorithm 2 for the SNP genotyping data. The first three panels are for slopes. . . . . . . . . . . . . . . . . . . . . . . . 73 4.12 Autocorrelations of sampled values for θ for 11, 000 iterations using Algorithm 2 for the SNP genotyping data. . . . . . . . 74 4.13 Density curves of sampled values for θ for 11, 000 iterations using Algorithm 2 for the SNP genotyping data. . . . . . . . 75 4.14 Clustering results of the SNP genotyping data in which points are classified into the cluster with the largest posterior mem- bership probability. . . . . . . . . . . . . . . . . . . . . . . . . 76 viii List of Figures 4.15 Clustering results of the Bayesian approach for a plate with- out a sparse cluster. . . . . . . . . . . . . . . . . . . . . . . . 77 ix Acknowledgements Foremost I would like to thank my supervisors Dr. William J. Welch and Dr. Ruben H. Zamar, for their excellent guidance, continuous encouragement, great patience and generous support. I am indebted to them for many things I have learned which lay the foundation of my future career. I am very grateful to other members of my supervisory committee, Dr. Paul Gustafson and Dr. Arnaud Doucet for their valuable comments and suggestions. I am also very grateful to Dr. Harry Joe and Dr. Lang Wu for their great help and valuable advice both in statistics and in my job application process. Also I would like to thank them for teaching me playing table tennis although they must have been frustrated that I am a hopeless learner. Many thanks to Dr. Mat́ıas Salibián-Barrera, Dr. Raphael Gottardo, Dr. Nancy Heckman and Dr. Jiahua Chen for their help and valuable sugges- tions. I also thank Dr. Jason Loeppky for his help on my proposal. Special thanks to Ms. Loubna Akhabir and Ms. Treena McDonald for providing me the data and explaining to me the genomic background. I would also like to thank Ms. Elaine Salameh, Ms. Christine Graham, Ms. Rhoda Morgan, Ms. Peggy Ng and Ms. Viena Tran for their help with administrative matters. Furthermore, I thank my fellow students for their help throughout my program. I choose not to list their names here but their help and friendship will be remembered forever. Finally, I would like to acknowledge the support of the University of British Columbia through a University Graduate Fellowship. x To my parents, my wife and my daughter xi Statement of Co-Authorship This thesis is completed under the supervision of Dr. William J. Welch and Dr. Ruben H. Zamar. Chapter 2 is co-authored with Dr. William J. Welch, Dr. Ruben H. Zamar, Ms. Loubna Akhabir and Ms. Treena McDonald. I conducted the data analysis and prepared a draft of the manuscript. Chapter 3 is co-authored with Dr. William J. Welch and Dr. Ruben H. Zamar. My main contribution is the derivation of the partial likelihood. I conducted the data analysis and prepared a draft of the manuscript. Chapter 4 is co-authored with Dr. William J. Welch and Dr. Ruben H. Zamar. My main contributions are the formulation of the model and the specification of the prior distribution. I conducted the data analysis and prepared the manuscript. Chapter 5 is co-authored with Dr. William J. Welch and Dr. Ruben H. Zamar. I derived the asymptotic properties and prepared the manuscript. xii Chapter 1 Introduction This thesis work investigates methods to detect linearly shaped clusters in a dataset. It is motivated by a clustering problem in SNP (single nucleotide polymorphism) genotyping and much effort of this thesis has been devoted to automatic genotype calling algorithms in TaqMan SNP genotyping tech- nology. In this chapter, we introduce the SNP genotyping problem, review several clustering algorithms used in the rest of the thesis and give an outline of the thesis. 1.1 SNP genotyping 1.1.1 SNPs and their applications A single nucleotide polymorphism (SNP, pronounced as “snip”) is a single- base variation in a genome. The genetic code is specified by the four nu- cleotide “letters”: A (adenine), C (cytosine), T (thymine) and G (guanine). There are two complimentary DNA strands. It is sufficient to consider only one. SNP variation occurs when a single nucleotide, such as an A, is re- placed by one of the other three letters C, G or T at a particular base of the target strand. An example of a SNP is the alteration of the DNA segment AAGGTTA to ATGGTTA, where the second A in the first snippet is re- placed with a T (www.ncbi.nlm.nih.gov). A SNP variation usually involves only two nucleotides; the two possible nucleotides are called two SNP alleles. Throughout we shall use “X” to generically represent the common wild-type allele and “Y” the variant allele. The DNA sequences of any two individuals are mostly identical and SNPs are found about every 250 − 350 base pairs in the human genome (Niu et al., 2002). Approximately 3 to 5 percent of a person’s DNA sequence codes for the production of proteins, most SNPs are found outside of these “coding sequences”. SNPs found within a coding sequence, cSNPs, are of particular interest to researchers because they are more likely to alter the biological function of a protein (www.ncbi.nlm.nih.gov). Moreover, the abundance of SNPs makes them useful markers for genetic association studies that work 1 Chapter 1. Introduction to localize the genes involved in complex diseases or adverse drug reactions. The popularity of SNPs is also due to their usual biallelic property which makes them amenable to automated genotyping. (Ranade et al., 2001). 1.1.2 TaqMan SNP genotyping technology Determination of the alleles at a single nucleotide polymorphism site is called genotyping. The nature of large-scale association studies requires rapid, re- liable and cost-effective SNP genotyping. Various SNP genotyping technolo- gies have been developed. The choice of a technology depends on whether a few SNPs are to be typed in many individuals or many different SNPs are to be examined in a few individuals (Callegaro et al., 2006). The TaqMan SNP Genotyping Assay (Applied Biosystems) is a widely used fluorescence-based high-throughput genotyping technology suitable for the former case. In this method, the region flanking the SNP site of interest is amplified in the presence of two probes each specific for one or the other allele. Probes haves a fluor, called “reporter” at one end but do not fluoresce when free in solution because they have a “quencher” at the other end that absorbs fluorescence from the reporter. During the amplification in which many copies of the same sequence are produced, the probe specifically base- paired with the target is unwound, its reporter liberated from the quencher and the fluorescence is increased. The presence of two probes, each labelled with a different fluor, allows the detection of both alleles in a single tube (De La Vega et al., 2005; Ranade et al., 2001). In the TaqMan SNP genotyping technology, DNA samples of many in- dividuals are arranged in a 96- or 384-well plate and they are amplified simultaneously. For each individual, two quantitative fluorescent signals are measured at the end of amplification for the two alternative SNP alleles, indicating their presence or absence. The pair of signals for an individual forms a point in a scatterplot. Ideally there are four distinct clusters in a scatterplot. The NTC (no template control) cluster lies in the lower-left cor- ner, close to the origin, containing negative control cases (a few cases that do not have DNA samples) and samples that fail to amplify. In the lower-right, upper-left, and upper-right corners are three clusters presumably containing samples of wild-type homozygotes, variant homozygotes, and heterozygotes, respectively. Genotype calls for individual samples are made by a clustering algorithm in the propriety Sequence Detection Software (Applied Biosystems). How- ever, considerable manual intervention of an expert operator is required to assess the data quality, to set fluorescent signal thresholds and to decide the 2 Chapter 1. Introduction genotypes, especially when the variant allele Y is rare. The accuracy of SNP genotype calls is critical to later studies. Even the slightest amount of genotyping errors can lead to serious consequences on haplotype analysis, linkage analysis, genetics distance estimation and background linkage disequilibrium estimation (Kang et al., 2004). For a brief review of clustering algorithms in fluorescence-based SNP genotyping, see Chapter 2. One aim of this thesis to develop a reliable automated SNP genotype calling algorithm in the TaqMan SNP genotyping technology that involves minimal manual intervention and should also work well even the variant allele homozygous cluster YY is sparse. We conclude this subsection by noting that any genotyping calling algorithm proposed in the TaqMan SNP genotyping technology should be applicable to other fluorescence-based genotyping method, such as Invader Assay (Mein et al., 2000). In next section, we review several competing clustering algorithms used in the rest of the thesis. 1.2 Review of several clustering algorithms 1.2.1 MCLUST: normal mixture model-based clustering Finite mixture model has long been proposed for clustering problems. See for example, Banfield and Raftery (1993), Fraley and Raftery (1998), Fraley and Raftery (2002), MacLachlan and Peel (2000). In this approach, data are assumed arising from a mixture of probability distributions; each com- ponent of the mixture is regarded as a cluster. This model-based method has some advantage over a heuristic approach. First of all, it is somehow data dependent via a variety of parameter restrictions coupled with a model selection mechanism, for example in the packages MCLUST (Fraley and AE, 2006) and EMMIX (McLachlan et al., 1999). Second, some early proposed heuristic clustering criteria, including the most widely used kmeans method, were later formulated in model frameworks. A statistical model helps people understand for what data sets a particular clustering algorithm is likely to work well. Third, within a model framework, many statistical procedures are readily applicable. For example, maximum likelihood is naturally used as an optimizing criterion; the estimated probabilities of a data point con- forming to the components is an appealing measure of uncertainty of the resulting classification; a component in a mixture model, which has a clear meaning in the assumed model, can be interpreted as a cluster; the deter- mination of clustering methods and the number of clusters is then a model selection problem. 3 Chapter 1. Introduction MCLUST is a contributed R package for normal mixture modeling and model-based clustering. It provides a number of functions for normal mix- ture model-bases clustering, classification likelihood-based hierarchical clus- tering, normal mixture model-based density estimation and discriminant analysis etc. In this subsection, we review only the normal mixture model- based clustering which is relevant to the thesis. Given observations x1, . . ., xn, MCLUST assumes a normal mixture model p(xi|θ) = K∑ k=1 pkφk(xi|µk,Σk), i = 1, . . . , n, where K is the number of components, pk is the mixing proportion, pk > 0,∑K k=1 pk = 1, φ(·|µk,Σk) is the multivariate normal density with mean vector µk and covariance matrix Σk and θ is the collection of all parameters. Banfield and Raftery (1993) proposed a general framework for geomet- ric cross-cluster constraint by parameterizing covariance matrices through eigenvalue decomposition Σk = λkDkAkD T k , where Dk is the orthogonal matrix of eigenvectors, Ak is a diagonal matrix whose elements are proportional to the eigenvalues, and λk is an associated constant of proportionality. The orientation of principal components of Σk is determined by Dk, while Ak determines the shape of the density contours; λk specifies the volume of the corresponding ellipsoid. Characteristics (ori- entation, shape and volume) of distributions are usually estimated from the data and can be allowed to vary between clusters or constrained to be the same for all clusters. This parameterization is very flexible; it includes but is not restricted to earlier proposals such as equal-volume spherical variance (Σk = λI) which has a close relationship with the k-means method, constant variance (Σk = Σ) and the most general unconstrained variance. The EM algorithm is used for maximum likelihood estimation; details are omitted. All models corresponding to the above parameterization possibili- ties are usually tried. Selection of the number K of components/clusters as well as models is through the Bayesian Information Criterion (BIC), which relates to the Bayes factor. The BIC has the form BIC = 2l − log(n)v, where l is the log-likelihood of the model and v is the number of independent parameters in the model. 4 Chapter 1. Introduction For noisy data, MCLUST adds a first order Poisson process to the normal mixture model, p(xi|θ) = p0 V + K∑ k=1 pkφk(xi|µk,Σk), where V is the hyper-volume of the data region, p0 > 0 and ∑K k=0 pk = 1. 1.2.2 MIXREG: mixture of linear regressions Mixtures of linear regressions may be regarded as a special case of mixture models. The response of interest is assumed univariate. Let (y1,x1), . . ., (yn,xn) be the observations. The mixture of regression models is p(yi|θ,xi) = K∑ k=1 pkφ(yi|xTi βk, σ2k), i = 1, . . . , n, where K is the number of components, pk is a mixing proportion, pk > 0,∑K k=1 pk = 1, βk is the vector of regression coefficients, φ(·|xTi βk, σ2k) is the univariate normal density with mean xTi βk and variance σ 2 k. For ease of notation, we assume that an intercept is already included in βk if necessary. The EM algorithm can be used for maximum likelihood estimation. MIXREG is also a contributed R package (Turner, 2006); the computational details are in Turner (2000). The package allows for the selection of equal variances (σ2k = σ 2) or unequal variances. Model selection can be performed using BIC outside the package MIXREG. 1.2.3 LGA: linear grouping algorithm Van Aelst et al. (2006) proposed a linear grouping algorithm to detect linear patterns in a dataset. It combines ideas of k-means, orthogonal regression and resampling and can uncover linear patterns when most traditional algo- rithms do not work well. LGA is the corresponding contributed R package (Harrington, 2007). Suppose that we are to uncover k linear groups in a data set with n points in d dimensions. The linear grouping algorithm works as follows: 1. Initialization. Starting values are generated by randomly selecting k mutually exclusive subsets of d points (d-subsets). For each of these d-subsets, a hyperplane through the d points is computed. 5 Chapter 1. Introduction 2. Updating group assignment. Each point is assigned to its closest hy- perplane in the sense of orthogonal distance; each hyperplane is re- computed from the updated grouping using orthogonal regression. 3. Iterative refinement. The objective function is set to be the aggre- gated sum of the squared orthogonal distances of the data points from their closest hyperplane. Repeat step 2 until no significant improve- ment is attained. A moderate number of iterations (say, 10) is usually sufficient. 4. Resampling. Repeat steps 1 − 3 a number of times (say, 100) and select the solution with the lowest aggregated sum of squared orthog- onal distances. This solution is refined further as in step 3 until no improvement is achieved. The idea of silhouette width (Rousseeuw, 1987) is adapted to the linear grouping scenario to measure the strength of the group assignment. Denote by s1(i) and s2(i) the orthogonal distances of point i from its closest and second closest hyperplanes, respectively. The silhouette width for point i is defined as w(i) = 1− s1(i) s2(i) . The GAP statistic (Tibshirani et al., 2001) is used to determine the number of linear groups in a data set. 1.3 Outline of the thesis In Chapter 2, we propose a genotype calling algorithm by further adapting the linear grouping algorithm (Van Aelst et al., 2006). A key feature of this algorithm is that it applies to unnormalized fluorescent signals. Whereas there is no explicit statistical model in Chapter 2, a partial likelihood approach to linear clustering is proposed in Chapter 3. It borrows ideas from normal mixture models and mixtures of regressions and provides an extension to the heuristic linear grouping algorithm. The SNP genotyping problem is revisited in this model framework. The method in Chapter 3 is more flexible than that in Chapter 2, but still cannot handle sparse (homozygous variant) clusters well. In Chapter 4, we move to a Bayesian hierarchical approach to the SNP genotyping problem. With careful specification of the prior structures, the Bayesian approach is able to handle a sparse variant allele homozygous cluster. 6 Chapter 1. Introduction Chapter 5 includes the technical details of the asymptotic results for the partial likelihood approach in Chapter 3. Chapter 6 summarizes a few possible future research directions. 7 Bibliography Banfield, J. and A. Raftery (1993). Model-Based Gaussian and Non- Gaussian Clustering. Biometrics 49 (3), 803–821. Callegaro, A., R. Spinelli, L. Beltrame, S. Bicciato, L. Caristina, S. Cen- suales, G. De Bellis, and C. Battaglia (2006). Algorithm for automatic genotype calling of single nucleotide polymorphisms using the full course of TaqMan real-time data. Nucleic Acids Research 34 (7), e56. De La Vega, F., K. Lazaruk, M. Rhodes, and M. Wenz (2005). Assessment of two flexible and compatible SNP genotyping platforms: TaqMan snp genotyping assays and the SNPlex genotyping system. Mutation research. Fundamental and molecular mechanisms of mutagenesis 573 (1-2), 111–135. Fraley, C. and R. AE (2006). mclust Version 3 for R: Normal Mixture Modeling and Model-based Clustering. Technical report, Technical Report 504, University of Washington, Department of Statistics. Fraley, C. and A. Raftery (1998). How Many Clusters? Which Cluster- ing Method? Answers Via Model-Based Cluster Analysis. The Computer Journal 41 (8), 578. Fraley, C. and A. Raftery (2002). Model-Based Clustering, Discriminant Analysis, and Density Estimation. Journal of the American Statistical Association 97 (458), 611–632. Harrington, J. (2007). lga: Tools for linear grouping analysis (LGA). R package version 1.0-0. Kang, H., Z. Qin, T. Niu, and J. Liu (2004). Incorporating Genotyping Uncertainty in Haplotype Inference for Single-Nucleotide Polymorphisms. The American Journal of Human Genetics 74 (3), 495–510. MacLachlan, G. and D. Peel (2000). Finite mixture models. J. Wiley. 8 Bibliography McLachlan, G., D. Peel, K. Basford, and P. Adams (1999). The EMMIX software for the fitting of mixtures of normal and t-components. Journal of Statistical Software 4 (2). Mein, C., B. Barratt, M. Dunn, T. Siegmund, A. Smith, L. Esposito, S. Nutland, H. Stevens, A. Wilson, M. Phillips, et al. (2000). Evaluation of Single Nucleotide Polymorphism Typing with Invader on PCR Amplicons and Its Automation. Genome Research 10 (3), 330–343. Niu, T., Z. Qin, X. Xu, and J. Liu (2002). Bayesian Haplotype Inference for Multiple Linked Single-Nucleotide Polymorphisms. Am. J. Hum. Genet 70, 157–169. Ranade, K., M. Chang, C. Ting, D. Pei, C. Hsiao, M. Olivier, R. Pesich, J. Hebert, Y. Chen, V. Dzau, et al. (2001). High-Throughput Genotyping with Single Nucleotide Polymorphisms. Genome Research 11 (7), 1262– 1268. Rousseeuw, P. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20, 53–65. Tibshirani, R., G. Walther, and T. Hastie (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 63 (2), 411–423. Turner, T. (2000). Estimating the propagation rate of a viral infection of potato plants via mixtures of regressions. Journal of the Royal Statistical Society Series C(Applied Statistics) 49 (3), 371–384. Turner, T. (2006). mixreg: Functions to fit mixtures of regressions. R package version 0.0-2. Van Aelst, S., X. Wang, R. Zamar, and R. Zhu (2006). Linear grouping using orthogonal regression. Computational Statistics and Data Analy- sis 50 (5), 1287–1312. 9 Chapter 2 Automatic genotype calling of single nucleotide polymorphisms using a linear grouping algorithm 2.1 Background 2.1.1 TaqMan SNP genotyping Single nucleotide polymorphisms (SNPs) make up about 90% of all human genetic variations. They occur approximately every 100 to 300 bases along the 3.2-billion-base human genome. They have been investigated as pop- ular biological markers for a wide range of genetic studies. Such studies require reliable and cost-effective high-throughput genotyping technologies. Various SNP genotyping technologies have been developed and commercial- ized. Shi (2001) provides a review of some SNP genotyping technologies. De La Vega et al. (2005) give an elaborate assessment of two popular SNP genotyping technologies, TaqMan SNP Genotyping Assay and the SNPlex Genotyping System (Applied Biosystems). Kang et al. (2004) give a brief overview of three widely used high-throughput technologies, the TaqMan SNP Genotyping Assay, the OLA (Oligonucleotide Ligation Assay) and the MassARRAY system. Although they employ different mechanisms, the most popular high-throughput technologies share the same conceptual framework: for each DNA sample, two quantitative signal intensities are measured after amplification for two alternative SNP alleles, indicating their presence or absence; these pairs of signal intensities are used to call genotypes by an ex- A version of this chapter will be submitted for publication. Authors: Guohua Yan, William J. Welch, Ruben H. Zamar, Loubna Akhabir and Treena McDonald. 10 Chapter 2. Automatic SNP genotype calling pert manually or by a clustering algorithm. The two alleles are generically called X and Y hereafter. 0.0 0.5 1.0 1.5 2.0 2.5 0. 5 1. 0 1. 5 2. 0 2. 5 3. 0 3. 5 (a) Allele.X.Rn Al le le .Y .R n (XX) (XY) (YY) (NTC) 0.5 1.0 1.5 2.0 2.5 3.0 1 2 3 4 5 (b) Allele.X.Rn Al le le .Y .R n 1.0 1.5 2.0 2.5 3.0 3.5 2 3 4 5 6 (c) Allele.X.Rn Al le le .Y .R n 1.5 2.0 2.5 3.0 3.5 1. 0 1. 5 2. 0 2. 5 3. 0 3. 5 4. 0 (d) Allele.X.Rn Al le le .Y .R n 2.0 2.5 3.0 3.5 1. 0 1. 5 2. 0 2. 5 3. 0 (e) Allele.X.Rn Al le le .Y .R n 0.5 1.0 1.5 2.0 2.5 3.0 0. 4 0. 6 0. 8 1. 0 1. 2 1. 4 (f) Allele.X.Rn Al le le .Y .R n Figure 2.1: Scatterplots of ROX-normalized data for (a): a good plate; (b) & (c): messy plates; (d): a plate with only a few points in the YY cluster; (e): a plate with no points in the YY cluster; (f): a plate with only one cluster. Figure 2.1 illustrates several scenarios that are typical in genotyping data from the TaqMan SNP Genotyping Assays (Applied Biosystems) at the James Hogg iCAPTURE Centre. In this technology, two alleles are labelled by fluorescent dyes VIC and FAM; in addition, a third dye ROX, which is assumed unchanged during the amplification, is used to normalize the data. The proprietary ABI PRISM r© 7900HT Sequence Detection System Plate Utility Software (referred to as SDS system hereafter) actually uses VIC/ROX and FAM/ROX (referred to as ROX-normalized data hereafter) to make genotype calls. In Figure 2.1, ROX-normalized signals are shown. Ideally, there are four distinct clusters in a scatterplot as in Figure 2.1 (a). The NTC (no template control) cluster lies in the lower-left corner, close to the origin, containing negative control cases (a few cases that do not have DNA samples) and samples that fail to amplify. In the lower- right, upper-left and upper-right corners are three clusters labelled XX, YY 11 Chapter 2. Automatic SNP genotype calling and XY, presumably containing samples of wild-type homozygotes, variant homozygotes, and heterozygotes, respectively. Owing to various artifacts, however, segregation can be poor with points lying between clusters, which make the calls less trustworthy (Figure 2.1 (b) and (c)). Furthermore, in some plates there are only a few points or even no points in the variant allele homozygous cluster YY, which often causes classical clustering algorithms to fail or to make genotype calls incorrectly (Figure 2.1 (d) and (e)). To our knowledge, the proprietary clustering algorithm incorporated in the SDS system cannot make genotyping calls in this situation and one has to call manually. In extreme cases, there is only one cluster present (Figure 2.1 (f)), which usually indicates that something has gone wrong in the amplification procedure. 2.1.2 Review of some genotype calling algorithms For a small project, it is possible to make genotype calls manually. In most cases, it is not hard for an expert to perform this job, and the “eyeballing” procedure usually gives reasonable results due to its sophisticated incorpo- ration of prior information. For large-scale studies, however, manual scoring can become a daunting challenge. Furthermore, humans are likely to make errors due to fatigue or oversight when a job becomes routine and different readers may have different views (Kang et al., 2004; van den Oord et al., 2003). van den Oord et al. (2003) conducted a study on the error rates of manual scoring and three statistical procedures and report that these statistical procedures uniformly outperform the manual procedure in error rates. Perhaps the most widely used clustering method in the SNP genotyping literature is the classical k-means, possibly with minor modification (Akula et al., 2002; Olivier et al., 2002; Ranade et al., 2001). Olivier et al. (2002) notice that k-means algorithms often split one genotype group of the data, especially when the variant allele homozygous cluster has only a few data points or when there is no sharp distinction between one of the genotype clusters and the NTC cluster. They develop a new algorithm, the CA al- gorithm, that initially examines all data points and determines the approx- imate location of the centroid for each cluster by dividing the space into sections for each expected cluster. They report that the heuristic algorithm works better than classical k-means. Ranade et al. (2001) and Akula et al. (2002) assign a “quality score” to each sample. After the genotype assignments are made, they assume that each genotype cluster is bivariate normally distributed. For each sample, 12 Chapter 2. Automatic SNP genotype calling they calculate the probability density value of the normal distribution for its genotype cluster as the quality score that this sample is called. Akula et al. (2002) also assume equal covariance matrices across the clusters, argu- ing that this assumption gives conservative but better results than otherwise. Lovmar et al. (2005) propose using “silhouette scores” to assess the qual- ity of SNP genotype clusters, the idea of which originates from Rousseeuw (1987). They report that the measure is satisfactory and empirically the genotypes can be unequivocally assigned without manual inspection when the silhouette score is greater than 0.65. van den Oord et al. (2003) propose a mixture model approach, in which each cluster is assumed to follow a normal distribution. The number of component clusters and the initial values of the parameters are set upon inspecting the scatterplot. This is a more delicate approach than the k- means approach since it considers the elliptical structures in the assignments of genotypes and the k-means algorithm can be regarded as a special case of the mixture models approach. Kang et al. (2004) go one step further in this direction. They assume a bivariate t-mixture model, in which a t distribution with small degrees of freedom is assumed for each cluster. They argue that clusters from SNP genotyping data usually have heavier tails than the normal distribution and report that the t-mixture model is less sensitive to outlying points and has advantages over the normal mixture model or the k-means approach. Fujisawa et al. (2004) also use a model-based approach in which only angle data are used. 2.1.3 ROX-normalized versus unnormalized data The datasets illustrated in Figure 2.1 are after ROX-normalization. An important feature of our approach is that it uses data without ROX-nor- malization. We now discuss four reasons why we prefer to work with the unnormalized data. First, the motivation for ROX-normalization is presumably to form spher- ical clusters for which classical clustering algorithms such as k-means can be used. Inspecting the ROX-normalized scatterplots, there are quite a few plates for which the normalization does not produce reasonable spherical structures as expected, e.g., Figure 2.1 (c). Second, the incorporation of ROX intensities is aiming to correct system- atic biases in the chemistry such as plate to plate variation. However, there are scenarios where some points may be pulled away from their home clus- ters merely by this correction and fail to be called a genotype; by the same token, points which fail to amplify may be assigned a genotype incorrectly 13 Chapter 2. Automatic SNP genotype calling due to this correction. In the left panel of Figure 2.2, points a, b, c are away 0.5 1.0 1.5 2.0 2.5 1 2 3 4 Allele.X.Rn Al le le .Y .R n a b cd e ROX−normalized data 2000 4000 6000 8000 12000 50 00 10 00 0 15 00 0 20 00 0 Allele.X Al le le .Y a b c d e Unnormalized data Figure 2.2: Scatterplots of unnormalized and ROX-normalized data for a plate. Both points and letters represent samples. from the heterozygous cluster when the data are ROX-normalized, but are within a linear arranged cluster in the right panel when the unnormalized data are used. Similarly, point d is away from the variant allele homozygous cluster in the left panel, but is much closer to the variant allele homozygous cluster in the right panel. The proprietary software assigns very low quality values for these points. Intuitively, these points “should” be classified into one of the genotype clusters, or at least with higher “confidence” than their “quality” values indicate. Point e is outlying in both panels; in this case, normalization will not help any way. Third, the undisclosed, proprietary algorithm used by the SDS system for clustering, and hence calling genotypes, seems to have difficulty with samples with extreme (small or large) ROX values, as evidenced in Figure 2.3. The curve in Figure 2.3 shows that the empirical chance of calling a genotype by the proprietary algorithm decreases when the ROX value is too low or too high. Unnormalized data, like in Figure 2.2, often show well-separated clusters, but they are along lines. The linear grouping algorithm (LGA) (Van Aelst et al., 2006) can identify such clusters. When we apply LGA to the unnor- malized data, the problem of low-call rate for extreme ROX values is much alleviated. Figure 2.4 contrasts the silhouette width (a measure of calling quality) versus ROX value when the LGA algorithm is applied to unnormal- 14 Chapter 2. Automatic SNP genotype calling 0 5000 10000 15000 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 ROX Em pi ric al c ha nc e of c al lin g Figure 2.3: Probability of calling a genotype versus ROX ized data and when the k-means method is used for ROX-normalized data. For k-means, the definition of silhouette width in Rousseeuw (1987) is used; for the LGA, the definition of silhouette width in Van Aelst et al. (2006) is used (see also equation (2.1)). Last, the positions of positive control points in scatterplots of ROX- normalized data suggest the use of unnormalized data. In a TaqMan SNP genotyping assay, individual samples are usually arranged in a 96- or 384- well plate and are amplified simultaneously. For quality assessment purpose, some wells have DNA samples with known genotypes (in our example 12 in a 384-well plate), called positive controls, while some other wells have no DNA samples (typically 8 in a 384-well plate), called negative controls or no template controls (NTC). The positions of control points in one plate are shown in Figure 2.5. The positive control points are represented by “P”. Note that quite a few control points are outside of their corresponding genotype clusters in the left panel where ROX-normalized data are used. In the right panel, however, these points are well within their corresponding linear clusters. In conclusion, we shall use unnormalized data hereafter in our method of making genotype calls. 15 Chapter 2. Automatic SNP genotype calling 0 1750 3250 4750 6250 7750 9500 15500 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 ROX km ea ns s ilh ou et te W idt h 0 1750 3250 4750 6250 7750 9500 15500 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 ROX LG A Si lho ue tte W idt h Figure 2.4: Silhouette width versus ROX value. Left panel: k-means results using ROX-normalized data; right panel: LGA result using unnormalized data. 0.5 1.0 1.5 2.0 2.5 1 2 3 4 Allele.X.Rn Al le le .Y .R n P P P P P P P P P PP P N ROX−normalized data 2000 4000 6000 8000 12000 50 00 10 00 0 15 00 0 20 00 0 Allele.X Al le le .Y P P P P PP P P P P P N N NN N Unnormalized data Figure 2.5: Control points in scatterplots of ROX-normalized and unnor- malized data for a plate. Letter “P” represents a positive control point and ”N” a negative control point. 16 Chapter 2. Automatic SNP genotype calling 2.2 Results As an illustration, we apply the LGA approach to the unnormalized data in Figure 2.2. The silhouette threshold is set to be 0.75, which means a point is called if its distance from the nearest line is one quarter of its distance from the second nearest line. We set the signal threshold empirically. Let mx and my be the median values for X signals (“Allele.X”) and Y signals (“Allele.Y”) respectively. Any points with X signal less than 0.5mx and Y signal less than 0.5my are not called. The results are in Figure 2.6. More points are called compared with the SDS software. 0 5000 10000 15000 0 50 00 10 00 0 15 00 0 20 00 0 25 00 0 Allele.X Al le le .Y Unnormalized data 0 5000 10000 15000 0 50 00 10 00 0 15 00 0 20 00 0 25 00 0 Allele.X Al le le .Y Unnormalized data Figure 2.6: Left panel: the calling result of SDS software displayed in the scatterplot of unnormalized data. Right panel: the scoring results of LGA using unnormalized data. The symbol ◦ represents noncalls. 2.3 Conclusions We have proposed an automatic genotype calling algorithm by taking ad- vantage of a linear grouping algorithm (Van Aelst et al., 2006). The pro- posed method uses unnormalized signals and clusters points around lines as against centroids. In addition, we associate a quality value, silhouette width (Rousseeuw, 1987; Van Aelst et al., 2006), with each DNA sample and a whole plate as well. This algorithm shows promise for genotyping data from TaqMan technology (Applied Biosystems). The algorithm is reliable. 17 Chapter 2. Automatic SNP genotype calling It could be potentially adapted to other fluorescent-based SNP genotyping technologies as well, such as the Invader Assay. 2.4 Methods In the proposed calling algorithm, unnormalized data are used and each genotype cluster is represented by a straight line. 2.4.1 Data preprocessing In a TaqMan SNP genotyping assay, the quality of a plate is monitored by making sure that all positive controls go to the correct genotype clusters and negative controls stay in the NTC clusters. Without well to well con- tamination, the negative controls should have low signals for both VIC and FAM dyes (horizontal and vertical coordinates in a scatterplot, respectively) since there are no samples in these wells to be amplified. It is advantageous that we empirically discard negative control points and a portion (5% or 10%, say) of points with very low signals (close to the origin) prior to apply- ing LGA to locate the lines. Negative control points should not contribute to the lines which decide the orientations of the genotype clusters. (Some points with low signals are assigned to lines afterwards.) 2.4.2 Fitting lines using LGA In SNP genotyping, there are at most three clusters, ignoring the NTC cluster. Two clusters are possible when there are very few or no points in the upper-left, YY cluster. We shall fit three lines and two lines to the data separately using LGA. LGA minimizes aggregated sum of squared orthogonal distances of points to their closest hyperplanes. We made some modification specific to the genotyping setting. Grid based initialization. LGA usually depends on multiple starts to have large probability of attaining the global minimum. In the genotyping setting, it is computationally affordable to try a sequence of initializations such that the global minimum is very likely attained. We set the intercepts to 0 for all lines and choose the slopes from a set of values {s1, · · · , sm}, where atan(si) are an equally spaced grid from 0 to pi/2, since the slopes should always be positive in the SNP genotyping setting. Given a set of intercepts and slopes, points are assigned to their nearest lines; these lines are recalculated. These two steps are iterated until conver- gence. Denote by s1(i) and s2(i) the orthogonal distances of point i from its 18 Chapter 2. Automatic SNP genotype calling closest and second closest lines, respectively. The silhouette width for point i is defined as w(i) = 1− s1(i) s2(i) . (2.1) From the multiple starts, we choose the solution that has the largest average silhouette width. (This criterion respects small clusters and penalizes two lines which are too close.) In addition, we shall restrict the slope for the XX cluster be smaller than that of the XY cluster and the slope of the XY cluster be smaller than that of the YY cluster. Another restriction is added such that there are few points in YY than XX and XY. For each LGA, we associate the lines with smallest, second-smallest and largest slopes to XX, XY, and YY, respectively. We simply discard solutions in which any slope is negative or there are more points in the cluster with the largest slope than one of the other clusters. Number of lines. We also choose between the best two line solution and the best three line solution according to this average silhouette width. If the three line solution has large average silhouette width, each line represents a genotype cluster; if the two line solution is chosen, they correspond to XX and XY. 2.4.3 Genotype assignment For genotype assignment, we could empirically set a signal threshold before- hand, below which a point is not called, labelled as “Undetermined”, and assigned to the “NTC” cluster. The threshold can be set based on previous genotyping practice in a laboratory. In this case, a point beyond the thresh- old is assigned to its closest line and its genotype is called accordingly; a silhouette width value is calculated for each point to measure adequacy of the fitted line structures. A cutoff point for the silhouette widths is also pre- determined such that outlying points with low silhouettes are also labelled as “Undetermined”. Our code adapts LGA to deal with thresholding of points as follows. Let mx = median(xi), and my = median(yi), where xi and yi are the X signal and Y signal of sample i respectively. Points with xi ≤ 0.5mx and yi ≤ 0.5my are not called. 19 Chapter 2. Automatic SNP genotype calling As an alternative to an empirical signal threshold, we could penalize points closer to the origin by modifying the definition of silhouette width. For example, w(i) = 1− s1(i) + c s2(i) + c , (2.2) where the tuning parameter c is also set empirically such that points too close to the origin are not called. Another version we tried is as follows. Let ri = √ x2i + y 2 i , be the distance of point i from the origin. Let m = min{ri}, and M = median{ri}. Let ci = ( min(ri,M)− m2 M − m2 ) 1 2 . The adjusted value for silhouette width si is s∗i = cisi. (2.3) This downweights the silhouette score for points with weak xi and yi signals. 2.4.4 Quality assessment of DNA samples and plates The silhouette width for a DNA sample serves as a quality value of the genotype call. In addition, for each plate, the average silhouette width can be computed. An expert may decide upon a threshold such that plates with lower average silhouette width are regarded as unreliable. Meanwhile, the average silhouette width of positive controls in a plate and the average silhouette width of negative controls are also indications of the quality of genotype calls for a specific plate. 2.5 Discussion The existing methods assume spherical or elliptical clusters. The unnormal- ized signals tend to fall on lines, however, forcing normalization to try to 20 Chapter 2. Automatic SNP genotype calling produce clusters with elliptical shapes, but the ROX normalization signal may be noisy. We take a quite different approach, identifying the linear clus- ters associated with the unnormalized signals. As seen in Figure 2.6, fewer uncalled samples results. We call only samples with at least one strong signal. The definition of “strong” will vary from plate to plate. We normal- ize internally by not calling points having weak signals relative to median signals, rather than externally with respect to ROX. 21 Bibliography Akula, N., Y. Chen, K. Hennessy, T. Schulze, G. Singh, and F. McMa- hon (2002). Utility and accuracy of template-directed dye-terminator in- corporation with fluorescence-polarization detection for genotyping single nucleotide polymorphisms. Biotechniques 32 (5), 1072–1078. De La Vega, F., K. Lazaruk, M. Rhodes, and M. Wenz (2005). Assessment of two flexible and compatible SNP genotyping platforms: TaqMan snp genotyping assays and the SNPlex genotyping system. Mutation research. Fundamental and molecular mechanisms of mutagenesis 573 (1-2), 111–135. Fujisawa, H., S. Eguchi, M. Ushijima, S. Miyata, Y. Miki, T. Muto, and M. Matsuura (2004). Genotyping of single nucleotide polymorphism using model-based clustering. Bioinformatics 20, 5. Kang, H., Z. Qin, T. Niu, and J. Liu (2004). Incorporating Genotyping Uncertainty in Haplotype Inference for Single-Nucleotide Polymorphisms. The American Journal of Human Genetics 74 (3), 495–510. Lovmar, L., A. Ahlford, M. Jonsson, and A. Syvänen (2005). Silhouette scores for assessment of SNP genotype clusters. feedback . Olivier, M., L. Chuang, M. Chang, Y. Chen, D. Pei, K. Ranade, A. de Witte, J. Allen, N. Tran, D. Curb, et al. (2002). High-throughput genotyping of single nucleotide polymorphisms using new biplex invader technology. Nucleic Acids Research 30 (12), e53. Ranade, K., M. Chang, C. Ting, D. Pei, C. Hsiao, M. Olivier, R. Pesich, J. Hebert, Y. Chen, V. Dzau, et al. (2001). High-Throughput Genotyping with Single Nucleotide Polymorphisms. Genome Research 11 (7), 1262– 1268. Rousseeuw, P. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20, 53–65. 22 Bibliography Shi, M. (2001). Enabling large-scale pharmacogenetic studies by high- throughput mutation detection and genotyping technologies. Clin Chem 47 (2), 164–172. Van Aelst, S., X. Wang, R. Zamar, and R. Zhu (2006). Linear grouping using orthogonal regression. Computational Statistics and Data Analy- sis 50 (5), 1287–1312. van den Oord, E., Y. Jiang, B. Riley, K. Kendler, and X. Chen (2003). FD-TDI SNP Scoring by Manual and Statistical Procedures: A Study of Error Rates and Types. BioTechniques 34 (3), 610–624. 23 Chapter 3 A partial likelihood approach to linear clustering 3.1 Introduction In this paper we are concerned with detecting linearly shaped clusters in a data set. This work is motivated by a clustering problem in a SNP (single nucleotide polymorphism) genotyping setting. Data of this type are bivari- ate, consisting of two signals, and conform mostly to linear clusters. More detailed discussion of these genotyping data appears in Section 3.7. Van Aelst et al. (2006) proposed a method called the Linear Grouping Algorithm (LGA) for linear clustering problems. For earlier approaches to linear clustering, see the references therein. There is a need for algorithms specialized in linear clustering. First, the usual mixture of normal or t dis- tributions (see for example, Banfield and Raftery, 1993; Fraley and Raftery, 1998, 2002; MacLachlan and Peel, 2000, among others) aims at elliptical structures in a data set; when a data set displays highly linear patterns, especially when these patterns are not well separated from each other, we need a different criterion for detecting linear patterns. Second, most ear- lier linear clustering approaches assume that a response variable is available supervising the search for linear clusters, see for example, Spath (1982), DeSarbo and Cron (1988) and Turner (2000). Van Aelst et al. (2006) have a simulated data set illustrating that the selection of a response variable is not trivial. In that data set, two linear clusters are obvious in the first two dimensions; the third dimension is merely noise. If the second variable is selected as response, the mixture of regressions method by Spath (1982) successfully recovers the linear structures; it fails when the third variable is used as the response. Chen et al. (2001) proposed a method to detect linear A version of this chapter has been submitted for publication. Authors: Guohua Yan, William J. Welch and Ruben H. Zamar. 24 Chapter 3. A partial likelihood approach to linear clustering structures one by one without assuming a response variable in the computer vision context. We propose in this paper a partial likelihood approach and compare it with existing algorithms such as LGA (Harrington, 2007), MCLUST (Fraley and Raftery, 2007) and MIXREG (Turner, 2006). Our approach borrows ideas from finite mixture model-based clustering and from mixture of regres- sions. Each linear cluster is characterized by a hyperplane; the orthogonal deviation of a data point from that hyperplane is assumed to be normally distributed but the position of the data on the hyperplane is not modelled. With this strategy, all variables are treated symmetrically, as compared with the mixture of regressions approach. The rest of the paper is organized as follows. In Section 3.2, we describe a partial likelihood-based objective function, leading to a model-based linear clustering algorithm, MLC. In Section 3.3, an EM algorithm to maximize the partial likelihood function is presented. In Section 3.4, we discuss the asymptotic properties of MLC. In Section 3.5, we briefly discuss its relation- ships with existing algorithms such as MCLUST, LGA and MIXREG. In Section 3.6, we propose several methods for determining the number of lin- ear clusters. In Section 3.7, several real and simulated datasets are used to compare MLC with LGA, MCLUST and MIXREG. Some closing remarks are given in Section 3.8. 3.2 Partial likelihood-based objective function for linear clustering 3.2.1 Partial likelihood for orthogonal regression First, we formulate a partial likelihood representation for orthogonal regres- sion (see for example, Fuller, 1987). To model a single linear cluster, assume that the vector-valued data x1, . . . ,xn are independently drawn from a ran- dom mechanism represented by a random vectorX in a d-dimensional space. We do not model the distribution of X except for the deviation from an also unknown hyperplane. Let {x : a′x − b = 0} be a hyperplane. For identifiability purpose, we assume that a′a = 1 and that the first nonzero element of a is positive. We assume that the signed orthogonal deviation a′X − b from the hyperplane {x : a′x− b = 0}, is normally distributed, i.e., a′X− b ∼ N(0, σ2). (3.1) 25 Chapter 3. A partial likelihood approach to linear clustering Since a and b are unknown, a′X− b is also unknown. Nevertheless, we can still estimate a, b and σ2 through the following partial likelihood function n∏ i=1 N(a′xi − b; 0, σ2), (3.2) where N(· ; 0, σ2) is the density function of the normal distribution with mean 0 and variance σ2. Let x̄ and S be the sample mean and sample covariance matrix of x1, . . . ,xn respectively. Then the maximum partial likelihood estimate σ̂ 2 of σ2 is the smallest eigenvalue of S; the maximum partial likelihood estimate â of a is the (standardized) eigenvector associated with σ̂2, which is not necessarily unique. Finally, the maximum partial likelihood estimate b̂ of b is b̂ = â′x̄. See for example Fuller (1987). Note that model (3.1) and the partial likelihood (3.2) treat the compo- nents of x symmetrically. 3.2.2 Partial likelihood for linear clustering Now we assume that the data x1, . . . ,xn are independently drawn from a more complicated random mechanism, still represented by a random vector X in a d-dimensional space which we do not model fully. The data now lie around K hyperplanes {x : a′kx = bk}, k = 1, . . . ,K. Let Z = (Z1, . . . , ZK)′ be a random vector indicating these hyperplanes, where Zk = 1 with prob- ability pk for k = 1, . . . ,K. Let p = (p1, . . . , pK) ′. We assume that, condi- tional on Zk = 1, a′kX− bk ∼ N(0, σ2k), k = 1, . . . ,K. Let z1, . . . , zn be the corresponding unobservable indicators for the data x1, . . . ,xn. Let κ be the collection of component parameters, κ = (a ′ 1, b1, σ21, . . ., a ′ K , bK , σ 2 K) ′, and θ = (κ′,p′)′. When the indicators z1, . . . , zn are regarded as unknown parameters, the partial likelihood function for parameters (θ′, z′1, . . . , z ′ n) ′ is L(θ, z1, . . . , zn|x1, . . . ,xn) = n∏ i=1 K∏ k=1 [pkN(a ′ kxi − bk; 0, σ2k)]zik . (3.3) This is a so-called classification likelihood. See for example Scott and Symons (1971) and Banfield and Raftery (1993) in the context of mixture 26 Chapter 3. A partial likelihood approach to linear clustering models for elliptical clusters. In this approach, the parameters θ and indi- cators z1, . . . , zn are chosen to maximize (3.3) and data points are classified according to these indicators. In another view (see for example, Fraley and Raftery, 1998, 2002), the in- dicators z1, . . ., zn are regarded as realizations of random vectors Z1, . . . ,Zn, which in turn are an independent and identically distributed sample from Z. After integrating out the indicators z1, . . . , zn, the partial likelihood function for θ is L(θ|x1, . . . ,xn) = n∏ i=1 K∑ k=1 pkN(a ′ kxi − bk; 0, σ2k). (3.4) The latter is a mixture likelihood approach. Celeux and Govaert (1993) did simulation studies of classification likelihood approaches and mixture likelihood approaches in normal mixture models for elliptical clusters and reported that no likelihood method uniformly outperforms the others. It is noted that the classification approach cannot consistently estimate parame- ters due to the “all-or-nothing” classification bias (Bryant and Williamson, 1978; Ganesalingam, 1989; Marriott, 1975). In our experience with linear clustering, both approaches give practically very similar clustering results. We shall pursue in this paper the mixture likelihood (3.4), as many existing model selection criteria like BIC can be used and allows for a more feasible asymptotic theory (see Section 3.4). As in the usual normal mixture model for elliptical clusters, the partial likelihood function (3.4) is unbounded: when a cluster consists of only points lying on a hyperplane, the contribution of each of these points to the partial likelihood tends to infinity as the variance tends to zero. The infinity occurs on the boundary of the parameter space. Hathaway (1985) proposed a constrained formulation of the maximum likelihood estimation in the univariate normal mixture model. Specifically, the author added a constraint on the standard deviations, min 1≤i6=j≤K (σi/σj) ≥ c > 0, (3.5) where c is a known constant determined a priori. We shall incorporate the constraint (3.5) into the partial likelihood function (3.4) (See Step 2 in the EM algorithm in Section 3.3). The solution may depend on the choice of the constant c. A usual strategy is to decrease c gradually and monitor the resulting solutions (Hathaway, 1986). [Chen and Kalbfleisch (1996); Ciuperca et al. (2003) proposed penalized approaches to this unboundedness problem which are also readily applicable to linear clustering although not 27 Chapter 3. A partial likelihood approach to linear clustering adopted in this paper; on the other hand, the Hathaway’s constraint can also be regarded as a form of penalization on the likelihood.] The partial likelihood function (3.4) naturally brings the clustering prob- lem into a finite mixture model framework. Standard EM algorithms can be adapted to maximize (3.4); once the maximum partial likelihood estimate θ̂ is obtained, data point xi can be assigned to the component with the largest posterior probability. The probabilities are given by ŵik = p̂kN(â ′ kxi − b̂k; 0, σ̂2k)∑K k=1 p̂kN(â ′ kxi − b̂k; 0, σ̂2k) , i = 1, . . . , n; k = 1, . . . ,K, (3.6) which also serve as a measure of uncertainty of classifying data point xi. 3.3 The EM algorithm Now we describe an EM algorithm for maximizing the partial likelihood (3.4). It is straightforward and works well in our experience. The completed log partial likelihood is l(θ|x1, . . . ,xn, z1, . . . , zn) = n∑ i=1 K∑ k=1 zik{log(pk) + log(N(a′kxi − bk; 0, σ2k))}. (3.7) In the E-step, with θ(t−1) from the previous iteration, we have w (t) ik ≡ E(Zik|θ(t−1),x1, . . . ,xn) = p (t−1) k N(a ′(t−1) k xi − b(t−1)k ; 0, σ2k (t−1) )∑K k=1 p (t−1) k N(a ′(t−1) k xi − b(t−1)k ; 0, σ2k(t−1)) . (3.8) In the M-step, we have p (t) k = ∑n i=1w (t) ik n . (3.9) Let x̄ (t) k = ∑n i=1w (t) ik xi∑n i=1w (t) ik , and Σ (t) k = n∑ i=1 w (t) ik (xi − x̄(t)(k))(xi − x̄ (t) (k)) ′. 28 Chapter 3. A partial likelihood approach to linear clustering Then a (t) k is the eigenvector associated with the smallest eigenvalue of Σ (t) k , (3.10) b (t) k = a ′(t) k x̄ (t) (k), (3.11) and σ2k (t) = ∑n i=1w (t) ik (a ′(t) k xi − b(t)k )2∑n i=1w (t) ik . (3.12) If all σ2k in equation (3.4) are assumed to be equal, then the common value is estimated by σ2 (t) = ∑n i=1 ∑K k=1w (t) ik (a ′(t) k xi − b(t)k )2 n . (3.13) The MLC algorithm is described as follows. 1. Initialize with θ(0). To initialize the EM algorithm, we adopt the strat- egy of Van Aelst et al. (2006). We randomly select K mutually exclu- sive subsets of d + 1 observations from x1, . . . ,xn. For k = 1, . . . ,K, we compute the maximum partial likelihood estimates a (0) k , b (0) k and (σ2k) (0) from the kth subset of d+1 observations (see Subsection 3.2.1). The proportions p (0) k are all set as 1/K. The initial values are then θ(0) = ((a′1) (0), b (0) 1 , (σ 2 1) (0), . . . , (a′K) (0), b (0) K , (σ 2 K) (0), p (0) 1 , . . . , p (0) K ) ′. 2. If constraint (3.5) is not satisfied, go back to step 1; otherwise go to step 3. 3. Update θ(0) for a predefined number of iterations or until the improve- ment in the partial likelihood (3.7) is less than a predefined threshold. At iteration t, • E-step. Update w(t)ik by equation (3.8). • M-step. Update p(t)k , a(t)k , b(t)k and (σ2k)(t) by equations (3.9), (3.10), (3.11) and (3.12), respectively. If c = 1, which corresponds to equal variances across clusters, use (3.13) in place of (3.12). 4. Repeat steps 1 – 3 for a predefined number of times. The final clus- ter labels ẑ1, . . . , ẑn and parameter estimates θ̂ are the solution that has the largest completed log partial likelihood (3.4) and satisfies con- straint (3.5). 29 Chapter 3. A partial likelihood approach to linear clustering For the classification likelihood (3.3), we can use a so-called CEM (classi- fication EM) algorithm. A C-step is inserted between the E-step and M-step of step 3. In the C-step, xi is assigned to the cluster with the largest w (t) ik , i.e., the maximum w (t) ik is replaced by the value 1 and all other w (t) ik take the value 0. If the maximum value of w (t) ik is not unique, choose one of the tying clusters at random. 3.4 Asymptotic properties Let f(x;θ) = K∑ k=1 pkN(a ′ kx− bk; 0, σ2k), which is the contribution of one observation to the partial likelihood (3.4). The constrained parameter space is Θc =   θ = (a1, b1, σ 2 1, . . . ,aK , bK , σ 2 K , p1, . . . , pK) : a′kak = 1,−∞ < bk <∞, σk > 0, k = 1, . . . ,K. mini,j σi/σj ≥ c > 0, 0 < pk < 1, ∑K k=1 pk = 1.   . Assuming that the data arise from a probability distribution measure P , we have the following results. Theorem 3.1 Let P be an absolutely continuous probability measure with finite second moments. Let g(θ) ≡ ∫ log f(x;θ)dP (x). Then the supremum of g over Θc is finite and attainable. Since the function f(·,θ) is not a density function, this theorem and the following ones cannot be proved by verifying some regularity conditions. The proofs find their roots in Wald (1949), Redner (1981), Hathaway (1983, 1985) and Garćıa-Escudero et al. (2007) and are available on request. Theorem 3.2 Let P be an absolutely continuous probability measure with finite second moments. Let X1, . . . ,Xn be a random sample from P . Then a global maximizer of the partial likelihood function L(θ |X1, . . ., Xn) in (3.4) over Θc exists almost surely if n ≥ Kd+ 1. 30 Chapter 3. A partial likelihood approach to linear clustering Note that points in Θc are not identifiable for f(x; · ). The function f(x; · ) remains the same if we permute the labels 1, . . ., K; pk1 and pk2 are not identifiable if (ak1 , bk1 , σ 2 k1 ) = (ak2 , bk2 , σ 2 k2 ). Thus the consistency result is in a quotient topology space. Let ∼ be an equivalent relation on Θc such that θ1 ∼ θ2 if and only if f(x;θ1) = f(x;θ2) almost surely in P . Denote by Θqc the quotient topological space consisting of all equivalent classes of ∼. For a point θ0 that maximizes g(θ) = ∫ log f(x;θ)dP (x), its equivalent class in Θqc is denoted by θ q 0. Theorem 3.3 (Consistency). Let X1, . . . ,Xn be a sample from an abso- lutely continuous probability measure P with finite second moments. Let θ̂ (n) be a global maximizer of the partial likelihood function L(θ|X1, . . ., Xn) in (3.4) over Θc. Then θ̂ (n) → θq0 almost surely in the topological space Θqc. Let v1(θ) = E {( ∂ log f(X;θ) ∂θ )( ∂ log f(X;θ) ∂θ )′} , and v2(θ) = E { ∂2 log f(X;θ) ∂θ∂θ′ } , where the expectations are taken with respect to P . Theorem 3.4 (Asymptotic normality). Let X1, . . . ,Xn be a sample from an absolutely continuous probability measure P with finite sixth moments. Let θ̂ (n) be a subsequence of global maximizers of the partial likelihood func- tion L(θ|X1, . . . ,Xn) in (3.4) over Θc, which tends to an interior point θ0. Then √ n(θ̂ (n) − θ0) L→ N(0, v(θ0)), where v(θ0) = [v2(θ0)] +v1(θ0)[v2(θ0)] + and A+ is the Moore-Penrose in- verse of A. In equation (3.6), ŵik is a function of xi and θ̂. Denote ŵik = hk(xi, θ̂), k = 1, . . . ,K. By the Delta method, we have Corollary 3.1 Let X1, . . . ,Xn be a sample from an absolutely continuous probability measure P with finite sixth moments. Let θ̂ (n) be a subsequence of global maximizers of the partial likelihood function L(θ|X1, . . . ,Xn) in 31 Chapter 3. A partial likelihood approach to linear clustering (3.4) over Θc, which tends to an interior point θ0. Let x be a data point. Then for k = 1, . . . ,K, √ n(hk(x, θ̂ (n) )− hk(x,θ0)) L→ N(0, [h(0)k (x,θ0)]′v(θ0)[h(0)k (x,θ0)]), where h (0) k (x,θ0) = ∂hk(x,θ) ∂θ ∣∣∣∣ θ = θ0 . Using this corollary, we can build approximate confidence intervals for wik by replacing θ0 with θ̂ and hence evaluate the clustering of a data point. 3.5 Relationships with other clustering methods 3.5.1 With LGA In the LGA of Van Aelst et al. (2006), the objective function is the aggre- gated sum of squared orthogonal distances of the data points to their closest hyperplanes. Using our notation, LGA minimizes d(κ, z1:n) = n∑ i=1 K∑ k=1 zik(a ′ kxi − bk)2. (3.14) In the classification likelihood (3.3), if we assume that the variances σ2k are equal across all components (or equivalently, c = 1 in (3.5)) and the mixing proportions pk are equal, the log completed partial likelihood is l(κ, z1:n) = n∑ i=1 K∑ k=1 zik log[N(a ′ kxi − bk; 0, σ2)]. (3.15) It is straightforward to check that the minimization of (3.14) and the max- imization of (3.15) are equivalent. Hence the iterative procedure in LGA coincides with the CEM algorithm, which is discussed in Section 3.3, for maximizing (3.15). In this sense, the proposed partial likelihood approach, or the classifica- tion partial likelihood (3.3), is an extension and a model for LGA. The pro- posed approach permits one dispersion parameter σ2k for each cluster while LGA implicitly uses one dispersion parameter for all clusters. Furthermore, in our model framework, we are able to use the membership probability wik (see (3.6)) to assess the uncertainty of clustering a data point while an ad hoc measure, silhouette width, is used in LGA. 32 Chapter 3. A partial likelihood approach to linear clustering 3.5.2 With normal mixture models In the M-step in Section 3.3, we can see that the proposed approach is closely related to normal mixture models. The difference resides in the E- step: the proposed approach weighs distances to hyperplanes while the latter compares distances to cluster centres. In the use of normal mixture models one is forced to estimate the means and also the covariance matrix of each component. Banfield and Raftery (1993) reparameterized the covariance matrices and lead to a variety of par- simonious covariance matrix structures. The unrestricted covariance matrix requires K(d + d(d + 1)/2) parameters and the most parsimonious model, equal spherical covariance matrices, requires only Kd + 1 parameters. The proposed approach can be regarded as a parsimonious simplification of an- other type which is appropriate for highly linear data sets. It requires d+1 parameters per cluster, i.e., d first order parameters for the location of a hyperplane and one second order parameter for the dispersion around the hyperplane. 3.5.3 With mixture of ordinary regressions As mentioned in the introduction, a mixture of ordinary regressions ap- proach requires a response variable for guidance. A response variable suit- able for clustering purpose may not exist. Furthermore, due to the “re- gression to the mean” phenomenon, the regression hyperplanes are different when different variables are selected as the response. The proposed ap- proach treats each variable symmetrically, and therefore is more suitable in a clustering setting. 3.6 Choosing the number of linear clusters In some problems, the number of linear clusters is obvious from subject- matter knowledge; in other problems, it has to be estimated from the data. Too many components exploit the randomness of the data and make it hard to interpret the clustering; too few clusters lose information and may be misleading. 3.6.1 Bootstrapping the partial likelihood ratio A natural approach is to apply the partial likelihood ratio test to a sequences of hypotheses K = K0 against K = K0+1, starting from K0 = 1. However, 33 Chapter 3. A partial likelihood approach to linear clustering −2 lnλ, where λ is the likelihood ratio, does not have a usual asymptotic χ2 distribution under the null hypothesis. The reason is that the regularity conditions are violated as the parameter space under the null hypothesis is on the edge of the parameter space under the alternative hypothesis when the former is embedded into the latter. Much theoretic research has been done on likelihood ratios for mixture problems (Chen and Chen, 2001a,c,b; Chen et al., 2001; Chen, 1998; Chen and Kalbfleisch, 2005; Everitt and Hand, 1981; Thode Jr et al., 1988, among others). We adapt the bootstrap approach in McLachlan (1987) to our setting. The log likelihood ratio statistic −2 log λ for the test of the null hypothesis of K = K1 versus the alternative K = K2 can be bootstrapped as follows. Let θ̂ be the maximum partial likelihood estimate of all parameters from the original sample under the null hypothesis. For i = 1, . . . , n, sample zi from the multinomial distribution with probabilities (p̂1, . . ., p̂K1); if zik = 1, sample orthogonal distance ei from N(0, σ̂ 2 k). The ith data point in the bootstrap sample is xi + (ei − b̂k)âk. Suppose B copies of n samples are generated. The value of −2 log λi is computed after fitting mixture models for K = K1 and K = K2, i = 1, . . . , B. The significance level of the test is obtained by comparing −2 log λ with the empirical distribution of −2 log λ1, . . . ,−2 log λB. 3.6.2 Information criteria The partial likelihood approach enables us to use an approximate Bayes fac- tor to determine the number of linear clusters. Banfield and Raftery (1993) used a heuristically derived approximation to twice the log Bayes factor, called AWE (approximate weight of evidence) to determine the number of clusters. Fraley and Raftery (2002) used BIC (Bayesian information crite- rion) as a more reliable approximation to twice the log Bayes factor, BIC(K) = 2L(K)− vK log(n), (3.16) where L(K) is the log partial likelihood of a mixture of K components and vK is the number of independent parameters to be estimated in a K component mixture. As stated in Fraley and Raftery (2002), although the regularity conditions that underly the approximation are not satisfied in finite mixture models, there are some theoretical and practical support of its use for determining the number of clusters. Keribin (2000) showed that the estimator of the number of clusters based on BIC is consistent; Fraley and Raftery (1998, 2002) included a range of applications in which estimating the number of clusters based on BIC has given good results. 34 Chapter 3. A partial likelihood approach to linear clustering A large number of criteria have been proposed in the literature, including NEC (Biernacki et al., 1999; Celeux and Soromenho, 1996), ICL (Biernacki et al., 2000), and cross validation likelihood (Smyth, 2000). The performance of some of these criteria were compared by Biernacki et al. (1999). As we are adopting a partial likelihood approach, in principle almost all methods proposed in a model framework are approximately applicable although we do not fully model the data. As pointed out by Everitt et al. (2001), “it is advisable not to depend on a single rule for selecting the number of groups, but to synthesize the results of several techniques”. In our experience, though, all the methods mentioned above work well if there do exist obvious linear patterns in a data set. 3.7 Examples 3.7.1 Simulated data I One data set of size 300 in two dimensions is simulated to illustrate the different behaviours of MCLUST, LGA and the proposed MLC. The xi1 are uniformly sampled from the interval (0, 15); for the first 250 points, xi2 = xi1 + 3²i and for the last 50 points, xi2 = 8 + 1.5xi1 + ²i, where ²i are independent and identically standard normal distributed. The data are shown in the top-left panel of Figure 3.1. For this data set, MCLUST chooses a model of three clusters with differ- ent volumes, different shapes and different orientations by the BIC criterion. LGA favours one cluster using the gap statistic. MLC selects two clusters by BIC, ICL, NEC and the bootstrapping likelihood ratio test (see Table 3.1). We set the threshold c = 0.05. The result of MIXREG, which is omitted, is similar to that of MLC as we generate the data in favour of it. Figure 3.1 displays the clustering results from MCLUST, LGA and MLC when the numbers of clusters are all set as 2. This simulated data helps us understand how these algorithms work. MCLUST tries to find elliptical structures, but the two linear clusters have data distributions along the lines not well modelled by normal distributions. For LGA, when two clusters are roughly parallel and touching, it imposes a structure to the data set and tends to partition the data set into bands with equal width. 35 Chapter 3. A partial likelihood approach to linear clustering 0 5 10 15 0 10 20 30 True classification x1 x 2 0 5 10 15 0 10 20 30 MLC x1 x 2 0 5 10 15 0 10 20 30 LGA x1 x 2 0 5 10 15 0 10 20 30 MCLUST x1 x 2 Figure 3.1: Comparison of clustering results of simulated data I. Upper-left panel displays the true classification; upper-right, lower-left and lower-right are that of MLC, LGA and MCLUST. 36 Chapter 3. A partial likelihood approach to linear clustering Table 3.1: Criteria for choosing the numbers of clusters, K, when MLC is applied to simulated data I. “Boot p-value” is the p-value using the boot- strapping partial likelihood ratio test for H0 : K = K0 vs H1 : K = K0 + 1 where 99 bootstrapping samples are used Criterion K BIC ICL NEC Boot p-value 1 -1541.61 – – 0 2 -1461.63 -1477.28 0.14 1 3 -1468.25 -1712.08 0.69 – 3.7.2 Simulated data II Another dataset is simulated to demonstrate the difference between MIX- REG and the other three algorithms, MLC, MCLUST and LGA. This dataset has 200 data points in four dimensional space. For the first 100 data points, the first two components, xi1 and xi2, are linearly related while the third and the fourth components, xi3 and xi4, are randomly scattered: xi1 = ti + ²i1, xi2 = 0.5ti + ²i2, xi3, xi4 ∼ N(0, 52), for i = 1, . . . , 100. (3.17) Here ²ij ∼ N(0, 1) and ti ∼ N(0, 52). For the remaining 100 data points, the first two components are randomly scattered while the last two components are linearly related: xi1, xi2 ∼ N(0, 52), xi3 = ti + ²i3, xi4 = 0.5ti + ²i4, for i = 101, . . . , 200. (3.18) Again, ²ij ∼ N(0, 1) and ti ∼ N(0, 52). The pairwise scatterplots of the simulated data are in Figure 3.2. From the way the dataset is simulated, it is not possible to find a nat- ural response variable to discriminate the two linear clusters. As a result, MIXREG does not work well for this dataset and the two clusters are not well separated. The misclassification matrices of the four algorithms are in Table 3.2. MLC, MCLUST and LGA have the same good performance here. Note that (3.17) and (3.18) generate (elongated) multivariate normal clus- ters in x1 and x2 or in x3 and x4, respectively. The assumptions of MCLUST are therefore satisfied, but MLC nonetheless has the same performance. 3.7.3 Australia rock crab data Now we consider the rock crab data set of Campbell and Mahon (1974) on the genus Leptograpsus. We focus on the 100 blue crabs, 50 of which are 37 Chapter 3. A partial likelihood approach to linear clustering x1 −10 −5 0 5 10 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 11 11 1 1 1 111 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 11 1 11 1 1 1 11 12 2 22 2 2 2 2 2 222 2 2 2 2 2 2 2 22 2 22 2 2 2 22 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 11 1 1 11 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 111 12 2 2 2 2 2 2 2 2 22 2 2 2 2 22 2 2 22 2 2 2 2 2 2 22 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 −15 −10 −5 0 5 10 15 20 − 1 5 − 5 0 5 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 − 1 0 − 5 0 5 1 0 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 2 2 22 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 x2 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 111 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 11 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 222 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 x3 − 1 0 − 5 0 5 1 0 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 22 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 −15 −10 −5 0 5 10 − 1 5 − 5 0 5 1 0 1 5 2 0 1 1 1 111 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 22 2 2 2 2 2 22 2 222 2 2 2 2 22 2 2 2 22 2 2 2 2 2 2 222 22 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 22 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 22 2 2 2 2 2 22 2 22 2 2 2 2 2 2 22 2 2 2 22 2 2 2 2 2 2 2 2 2 22 2 22 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 22 2 2 22 2 2 2 2 2 2 2 22 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 −10 −5 0 5 10 1 1 11 111 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 22 2 2 2 2 2 2 2 2 2 22 22 22 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 22 2 2 2222 2 2 22 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 22 2 2 2 2 22 2 2 2 2 2 2 2 2 2 x4 Figure 3.2: Pairwise scatterplots of simulated data II. Table 3.2: Misclassification matrices of simulated data II from MLC, MCLUST, LGA and MIXREG. MLC, MCLUST, MIXREG LGA Resp= x1 or x2 Resp= x3 Resp= x4 True class 1 2 1 2 1 2 1 2 1 88 12 26 74 5 95 63 37 2 9 91 25 75 10 90 75 25 38 Chapter 3. A partial likelihood approach to linear clustering Table 3.3: Criteria for choosing the number of clusters, K, when MLC is applied to the blue crab data. “Boot p-value” is the p-value using the bootstrapping partial likelihood ratio test for H0 : K = K0 vs H1 : K = K0 + 1 where 99 bootstrapping samples are used. Criterion K BIC ICL NEC Boot p-value 1 476.13 – – 0 2 518.88 500.77 0.24 0.63 3 506.35 446.83 0.51 – males and 50 are females. Each crab has five measurements, FL, the width of frontal lip, RW, the rear width, CL, the length along the midline, CW, the maximum width of the carapace, and BD, the body depth in mm. As in Peel and McLachlan (2000), we are interested in the classification of male and female crabs. Peel and McLachlan (2000) had 18 crabs misclassified applying a mixture of two t distributions using all the five measurements. Using a mixture of two normal distributions resulted in one more crab being misclassified. Inspecting all pairwise scatter plots of the data set, RW and CL display two fairly well separated lines as in Figure 3.3. We apply MLC, MCLUST, LGA and MIXREG to these two measurements. For this data set, MCLUST chooses two clusters with different volumes, different shapes and different orientations by the BIC criterion. LGA chooses two clusters by the gap statistic. MLC also chooses two clusters by the var- ious criteria in Table 3.3. MCLUST has 13 cases misclassified. LGA and MLC both have 7 cases misclassified. Inspecting the posterior probabili- ties of classification, MLC assigns roughly equal probabilities to the two clusters for two of the misclassified cases. In other words, there is extreme uncertainty about the assignment of these two cases, and their misclassifi- cations are not surprising. There are no such diagnostic probabilities for LGA. MIXREG has eight cases misclassified when RW is regarded as the response; it has seven cases misclassified if CL is taken as the response. If we take logarithms of RW and CL, only five crabs are misclassified by LGA and MLC. MIXREG has five cases misclassified, if CL is taken as the response; it has 15 cases if RW is taken as the response. The results are summarized in Table 3.4. The clustering results are displayed in Figure 3.4. 39 Chapter 3. A partial likelihood approach to linear clustering FL 8 12 16 20 30 40 50 8 12 16 20 8 12 16 RW CL 15 25 35 45 20 30 40 50 CW 8 12 16 20 15 25 35 45 6 10 14 18 6 10 14 18 BD Figure 3.3: Pairwise scatterplots of the blue crab data. Table 3.4: Misclassification matrices of the blue crab data (using log scales) from MLC, MCLUST, LGA and MIXREG. MLC MCLUST LGA MIXREG MIXREG (Response: CL) (Response: RW) True class 1 2 1 2 1 2 1 2 1 2 1 50 0 50 0 50 0 50 0 38 12 2 5 45 14 36 5 45 5 45 3 47 40 Chapter 3. A partial likelihood approach to linear clustering 2.0 2.2 2.4 2.6 2.8 2. 8 3. 0 3. 2 3. 4 3. 6 3. 8 MLC RW CL 2.0 2.2 2.4 2.6 2.8 2. 8 3. 0 3. 2 3. 4 3. 6 3. 8 MCLUST RW CL 2.0 2.2 2.4 2.6 2.8 2. 8 3. 0 3. 2 3. 4 3. 6 3. 8 LGA RW CL 2.0 2.2 2.4 2.6 2.8 2. 8 3. 0 3. 2 3. 4 3. 6 3. 8 MIXREG RW CL Figure 3.4: Clustering results of the blue crab data (using log scales) from MLC, MCLUST, LGA and MIXREG. CL is used as the response in MIXREG. 41 Chapter 3. A partial likelihood approach to linear clustering 3.7.4 Taqman single nucleotide polymorphism genotyping data A single nucleotide polymorphism (SNP) is a single-base variation in a genome. The genetic code is specified by the four nucleotide “letters”: A (adenine), C (cytosine), T (thymine) and G (guanine). A SNP involves usually only two possibilities of these four letters, referred to as allele X and allele Y . An individual has two copies of genetic information passed from both parents. So the genotypes at this specific genetic position has three possible types: homozygote for allele X, homozygote for allele Y and heterozygote for both alleles. SNP genotyping, determination of genotypes of individuals, plays an increasing role in genetics studies. Taqman assay is a fluorescence-based high-throughput genotyping tech- nology. Two fluorescent signals, x1 and x2, called Allele.X and Allele.Y below, are used to detect the presence or absence of the two alleles, and their intensities determine the genotypes. (There is a third fluorescent sig- nal presumably for normalizing the signal intensities. However, its effective- ness is suspicious in some situations. We choose to use signals without this normalization.) Blood samples of many individuals are assayed in wells of a plate si- multaneously. For quality controls purpose, there are usually some blood samples with known genotypes, called positive controls, and there are some wells without genetic material, called negative controls. In the scatterplot of a SNP data set, there are typically four clusters: three clusters for the three possible genotypes and one for negative controls. In addition, there may be some failed samples. A clustering algorithm is usually employed to classify the samples. Algorithms commonly used include k-means algorithms (Olivier et al., 2002; Ranade et al., 2001) and model- based algorithms (Fujisawa et al., 2004; Kang et al., 2004). Finding an algorithm that works for all plates/SNPs is a challenge. We fit a mixture of five components to the SNP genotyping data, for the three genotype clusters, the negative controls, and failed samples, re- spectively. The three genotype components are modelled linearly with the mixture likelihood (3.4). The fourth component is modelled with a bivari- ate normal distribution, and the fifth is a uniform component for possible outlying points. In the model fitting, we use the known labels for the negative controls. All the negative controls are assigned to the fourth component a priori, as are all points with both signals less than the corresponding maximum intensities of negative controls. Denote the set of these points by N . We 42 Chapter 3. A partial likelihood approach to linear clustering Table 3.5: Largest membership probabilities of the 7 points labelled in Figure 3.5 by MCLUST and MLC. Points 1 2 3 4 5 6 7 MLC 0.99 1.00 1.00 0.99 0.98 0.86 0.99 MCLUST 0.96 0.93 1.00 1.00 1.00 0.97 0.50 do not use the known labels of the positive controls; these samples are used only for evaluating the clustering results. Hence, the modified mixture likelihood for the problem is L(κ, p1, . . . , p5|x1, . . . ,xn) = ∏ i6∈N { 3∑ k=1 pkN(a ′ kxi − bk; 0, σ2k) + p4N(xi;µ,Σ) + p5U(xi;R) } × ∏ i∈N N(xi;µ,Σ), (3.19) where U is the uniform distribution on the rectangular region R given by (min(xi1), max(xi1)) × (min(xi2), max(xi2)). This mix of linear, ellipti- cal and uniform components demonstrates the flexibility of our modelling approach. Figure 3.5 displays the clustering results of one plate using four different algorithms: MLC, MCLUST, LGA and MIXREG. Note that the majority of the points lie on three lines, hence the need for linear clustering. The results of modified MLC and MCLUST are displayed in the upper-left and the upper-right panels of Figure 3.5. All the points assigned to one of the three genotype clusters by MCLUST are assigned exactly the same way by MLC. Points 1 to 7, however, are assigned by MCLUST as background noise while MLC classifies all these points to one of the genotype clusters. Indeed, we know that these points should belong to the genotype clusters, and furthermore, points 1, 3, 6 and 7 are positive controls, all classified correctly by MLC. Table 3.5 displays the largest membership probabilities of these seven points by MCLUST and MLC, respectively. For the first six of the seven points, the two methods are very confident of their different assignments. We speculate here that, MCLUST does not place these points in the genotype clusters because they appear to be outliers with respect to the normality assumptions. In contrast, MLC classifies a point only using the proximity to a line and does not model the position along a line. 43 Chapter 3. A partial likelihood approach to linear clustering 0.0 0.5 1.0 1.5 0. 0 0. 5 1. 0 1. 5 2. 0 MLC Allele.X Al le le .Y 12 3 4 5 67 0.5 1.0 1.5 0. 5 1. 0 1. 5 2. 0 MCLUST Allele.X Al le le .Y 12 3 4 5 67 0.5 1.0 1.5 0. 5 1. 0 1. 5 2. 0 LGA Allele.X Al le le .Y 0.5 1.0 1.5 0. 5 1. 0 1. 5 2. 0 MIXREG Allele.X Al le le .Y Figure 3.5: Clustering results from four methods applied to a Taqman data set plate. ◦, + and 4 denote the labels assigned for the three linear clusters (genotypes) by each method, × denotes points assigned as negative controls and ¦ denotes points assigned to the background cluster. 44 Chapter 3. A partial likelihood approach to linear clustering The lower-left panel of Figure 3.5 displays the results from LGA. Since LGA is formulated to deal with only linear clusters, we set the number of clusters to three. (If four or five clusters are chosen, the genotype clusters are mixed up.) The restriction to linear clusters means that LGA has difficulty separating the three genotype clusters from the negative controls and the failed samples. In the lower-right panel, we choose the signal for Allele Y as the response to implement MIXREG and set the number of regression lines to 3. MIXREG fails completely here. It finds only one of the three linear clusters. We obtained similar results if the signal for Allele X is chosen as the response. 3.8 Discussion We have proposed a flexible model-based approach to linear clustering. It is flexible in the sense that only the deviation from a hyperplane is mod- elled parametrically; the position on the hyperplane is not modelled. The advantage of this approach is illustrated in the genotyping example, where the distribution along the line is complex and difficult to model. Further- more, as was also illustrated in this example, we can incorporate elliptical clusters as necessary and a background cluster, borrowing from standard model-based clustering. Robustness to outliers is desirable, as the assumption of normal devia- tions around hyperplanes is sensitive to large deviations in the orthogonal direction. In addition to the inclusion of a uniform background cluster (Ban- field and Raftery, 1993), one option would be to use a heavier tailed distribu- tion, for example, Student’s t distribution with small degrees of freedom or with degrees of freedom depending on the data. This would adapt Peel and McLachlan (2000)’s EM algorithm for t mixture models from the elliptical context to the linear. The adaptation is straightforward but computationally more expensive. Further ideas include estimating the component covariance matrices in the M-step in a robust way, for example, trimming off some points. With a′x = b, we are specifying a hyperplane in d− 1 dimensions. With little effort, this could be generalized to a mixture of partial likelihoods, each of which specifies a hyperplane of dimension q < d, l(κ,p|x1:n) = n∏ i=1 K∑ k=1 pkN(A ′ kxi − bk;0,Σk), (3.20) where A is of dimension d × (d − q), b is a vector of dimension d − q, 45 Chapter 3. A partial likelihood approach to linear clustering and Σk is a (d − q) × (d − q) covariance matrix for the deviation from the hyperplane. In the extreme case of a 0-dimension hyperplane, which is a point, we have the usual mixture of multivariate normal distributions. A mixture of components with various dimensions could be considered. A Bayesian version of this methodology would be helpful if some clusters are sparse but there is strong prior information about their approximate locations or properties (e.g., the parameters defining lines). 46 Bibliography Banfield, J. and A. Raftery (1993). Model-Based Gaussian and Non- Gaussian Clustering. Biometrics 49 (3), 803–821. Biernacki, C., G. Celeux, and G. Govaert (1999). An improvement of the NEC criterion for assessing the number of clusters in a mixture model. Pattern Recognition Letters 20 (3), 267–272. Biernacki, C., G. Celeux, and G. Govaert (2000). Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (7), 719–725. Bryant, P. and J. Williamson (1978). Asymptotic behaviour of classification maximum likelihood estimates. Biometrika 65 (2), 273–281. Campbell, N. and R. Mahon (1974). A multivariate study of variation in two species of rock crab of genus Leptograpsus. Australian Journal of Zoology 22 (3), 417–425. Celeux, G. and G. Govaert (1993). Comparison of the mixture and the classification maximum likelihood in cluster analysis. Journal of statistical computation and simulation(Print) 47 (3-4), 127–146. Celeux, G. and G. Soromenho (1996). An entropy criterion for assessing the number of clusters in a mixture model. Journal of Classification 13 (2), 195–212. Chen, H. and J. Chen (2001a). Large sample distribution of the likelihood ratio test for normal mixtures. Statistics and Probability Letters 52 (2), 125–133. Chen, H. and J. Chen (2001c). The Likelihood Ratio Test for Homogeneity in Finite Mixture Models. The Canadian Journal of Statistics/La Revue Canadienne de Statistique 29 (2), 201–215. Chen, H. and J. Chen (2001b). The likelihood ratio test for homogeneity in the finite mixture models. Canadian Journal of Statistics 29 (2), 201–215. 47 Bibliography Chen, H., J. Chen, and J. Kalbfleisch (2001). A modified likelihood ra- tio test for homogeneity in finite mixture models. Journal of the Royal Statistical Society: Series B (Methodological) 63 (1), 19–29. Chen, H., P. Meer, and D. Tyler (2001). Robust regression for data with multiple structures. 2001 IEEE Conference on Computer Vision and Pat- tern Recognition 1, 1069–1075. Chen, J. (1998). Penalized likelihood-ratio test for finite mixture models with multinomial observations. The Canadian Journal of Statistics 26 (4), 583–599. Chen, J. and J. Kalbfleisch (1996). Penalized Minimum-Distance Estimates in Finite Mixture Models. The Canadian Journal of Statistics/La Revue Canadienne de Statistique 24 (2), 167–175. Chen, J. and J. Kalbfleisch (2005). Modified likelihood ratio test in finite mixture models with a structural parameter. Journal of Statistical Planning and Inference 129 (1-2), 93–107. Ciuperca, G., A. Ridolfi, and J. Idier (2003). Penalized Maximum Like- lihood Estimator for Normal Mixtures. Scandinavian Journal of Statis- tics 30 (1), 45–59. DeSarbo, W. and W. Cron (1988). A maximum likelihood methodology for clusterwise linear regression. Journal of Classification 5 (2), 249–282. Everitt, B. and D. Hand (1981). Finite mixture distributions. Monographs on Applied Probability and Statistics, London: Chapman and Hall . Everitt, B., S. Landau, and M. Leese (2001). Cluster Analysis. Hodder Arnold. Fraley, C. and A. Raftery (1998). How Many Clusters? Which Cluster- ing Method? Answers Via Model-Based Cluster Analysis. The Computer Journal 41 (8), 578. Fraley, C. and A. Raftery (2002). Model-Based Clustering, Discriminant Analysis, and Density Estimation. Journal of the American Statistical Association 97 (458), 611–632. Fraley, C. and A. Raftery (2007). mclust: Model-Based Clustering / Normal Mixture Modeling. R package version 3.1-1. 48 Bibliography Fujisawa, H., S. Eguchi, M. Ushijima, S. Miyata, Y. Miki, T. Muto, and M. Matsuura (2004). Genotyping of single nucleotide polymorphism using model-based clustering. Bioinformatics 20, 5. Fuller, W. (1987). Measurement error models. Wiley Series in Probability and Mathematical Statistics, New York: Wiley, 1987 . Ganesalingam, S. (1989). Classification and Mixture Approaches to Clus- tering via Maximum Likelihood. Applied Statistics 38 (3), 455–466. Garćıa-Escudero, L., A. Gordaliza, R. San Mart́ın, S. van Aelst, and R. Za- mar (2007). Robust linear clustering. Preprint. Harrington, J. (2007). lga: Tools for linear grouping analysis (LGA). R package version 1.0-0. Hathaway, R. (1983). Constrained maximum-likelihood estimation for a mixture of m univariate normal distributions. Ph. D. thesis, Rice Univer- sity. Hathaway, R. (1985). A Constrained Formulation of Maximum-Likelihood Estimation for Normal Mixture Distributions. The Annals of Statis- tics 13 (2), 795–800. Hathaway, R. (1986). A constrained EM algorithm for univariate normal mixtures. Journal of Statistical Computation and Simulation 23 (3), 211– 230. Kang, H., Z. Qin, T. Niu, and J. Liu (2004). Incorporating Genotyping Uncertainty in Haplotype Inference for Single-Nucleotide Polymorphisms. The American Journal of Human Genetics 74 (3), 495–510. Keribin, C. (2000). Sankhya: The Indian Journal of Statistics 62, 49–66. MacLachlan, G. and D. Peel (2000). Finite mixture models. J. Wiley. Marriott, F. (1975). Separating mixtures of normal distributions. Biomet- rics 31 (3), 767–769. McLachlan, G. (1987). On Bootstrapping the Likelihood Ratio Test Stastis- tic for the Number of Components in a Normal Mixture. Applied Statis- tics 36 (3), 318–324. 49 Bibliography Olivier, M., L. Chuang, M. Chang, Y. Chen, D. Pei, K. Ranade, A. de Witte, J. Allen, N. Tran, D. Curb, et al. (2002). High-throughput genotyping of single nucleotide polymorphisms using new biplex invader technology. Nucleic Acids Research 30 (12), e53. Peel, D. and G. McLachlan (2000). Robust mixture modelling using the t distribution. Statistics and Computing 10 (4), 339–348. Ranade, K., M. Chang, C. Ting, D. Pei, C. Hsiao, M. Olivier, R. Pesich, J. Hebert, Y. Chen, V. Dzau, et al. (2001). High-Throughput Genotyping with Single Nucleotide Polymorphisms. Genome Research 11 (7), 1262– 1268. Redner, R. (1981). Note on the Consistency of the Maximum Likelihood Estimate for Nonidentifiable Distributions. The Annals of Statistics 9 (1), 225–228. Scott, A. and M. Symons (1971). Clustering Methods Based on Likelihood Ratio Criteria. Biometrics 27 (2), 387–397. Smyth, P. (2000). Model selection for probabilistic clustering using cross- validated likelihood. Statistics and Computing 10 (1), 63–72. Spath, H. (1982). Fast algorithm for clusterwise linear regression. COM- PUT. 29 (2), 175–181. Thode Jr, H., S. Finch, and N. Mendell (1988). Simulated Percentage Points for the Null Distribution of the Likelihood Ratio Test for a Mixture of Two Normals. Biometrics 44 (4), 1195–1201. Turner, T. (2000). Estimating the propagation rate of a viral infection of potato plants via mixtures of regressions. Journal of the Royal Statistical Society Series C(Applied Statistics) 49 (3), 371–384. Turner, T. (2006). mixreg: Functions to fit mixtures of regressions. R package version 0.0-2. Van Aelst, S., X. Wang, R. Zamar, and R. Zhu (2006). Linear grouping using orthogonal regression. Computational Statistics and Data Analy- sis 50 (5), 1287–1312. Wald, A. (1949). Note on the Consistency of the Maximum Likelihood Estimate. The Annals of Mathematical Statistics 20 (4), 595–601. 50 Chapter 4 Bayesian linear clustering with application to single nucleotide polymorphism genotyping 4.1 Introduction This paper was motivated by a clustering problem in single nucleotide poly- morphism (SNP) genotyping. A single nucleotide polymorphism (SNP, pro- nounced as “snip”) is a single-base variation in a genome. The genetic code of life is specified by the four nucleotide “letters”: A (adenine), C (cyto- sine), G (guanine) and T (thymine). There are two complementary DNA strands. It is sufficient to consider only one. SNP variation occurs when a single nucleotide, such as an A, is replaced by one of the other three letters C, G or T. One SNP usually involves only two letters, referred to generi- cally throughout this paper as allele X and allele Y . An individual has two copies of genetic information passed from both parents. So the genotypes at a specific genetic position have three possibilities: homozygote for allele X, homozygote for allele Y and heterozygote for both alleles. SNP geno- typing, determination of genotypes of individuals, plays an increasing role in genetics studies. For a small project, it is possible to make genotype calls manually. In most cases, it is not hard for an expert to perform this job, and the “eye- balling” procedure usually gives reasonable results due to its sophisticated incorporation of prior information. For large-scale studies, however, manual A version of this chapter will be submitted for publication. Authors: Guohua Yan, William J. Welch and Ruben H. Zamar. 51 Chapter 4. Bayesian linear clustering scoring can become a daunting challenge. Typical SNP genotyping applica- tions involve thousands of patients and hundreds of SNPs. Hence, reliable automated genotyping methods are highly needed. TaqMan SNP Genotyping Assay (Applied Biosystems) is a fluorescence- based high-throughput genotyping technology. Blood samples of many in- dividuals are arranged in a 96- or 384-well plate and are assayed simultane- ously. Two fluorescent signals, x1 and x2, also called Allele.X and Allele.Y below, are used to detect the presence or absence of the two alleles. There is a third fluorescent signal presumably for normalizing the signal intensities. However, its effectiveness is suspicious in some situations. See Chapter 2. For quality controls purpose, some blood samples with known genotypes are also included in a plate; these samples are called positive controls. As well, some wells do not have genetic material; these are called negative controls. In the scatterplot of a SNP data set, there are typically four clusters, as in Figure 4.1. In the lower-right, upper-left, and upper-right corners are three clusters, presumably containing samples of wild-type homozygotes, variant homozygotes, and heterozygotes, respectively. In the lower-left cor- ner, the cluster may contain negative controls and/or some failed samples. A clustering algorithm is usually employed to call the SNP genotypes. Al- gorithms commonly used in the literature include k-means (Olivier et al., 2002; Ranade et al., 2001) and model-based algorithms (Fujisawa et al., 2004; Kang et al., 2004). The proprietary software Sequence Detection Sys- tem is included in a thermal cycler. Usually human intervention of an expert operator is usually needed to review the genotype calling results. Finding an algorithm that works well for all plates/SNPs is a challenge, especially when the variant allele homozygous genotype cluster is sparse, in which standard clustering algorithms often fail. Several clustering algorithms k-means, MCLUST (Fraley and AE, 2006), LGA (Harrington, 2007), MIXREG (Turner, 2006) and MLC (Yan et al., 2008) are applied to the genotyping data in Figure 4.1. Figure 4.2 dis- plays their corresponding clustering results. We can see that all of these algorithms fail to identify the five points in the upper-left corner as vari- ant allele homozygotes while an expert operator would do with confidence. These algorithms fail because the variant allele homozygous genotype cluster is very sparse. This motivates us to adopt a Bayesian approach to incorpo- rate available prior information. In addition, we found that it is convenient to model the genotype clusters as linear structures. Most clustering methods and algorithms cluster data points around “cen- ters”. However, some data sets, as in the SNP genotype setting, form groups 52 Chapter 4. Bayesian linear clustering 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0. 5 1. 0 1. 5 2. 0 Allele.X Al le le .Y Figure 4.1: Scatterplot of one plate in which the variant allele homozygous cluster has only five points (upper-left). through linear relationships and standard clustering techniques are usually not able to find these linear patterns. By linear clustering, we mean detect- ing these linear clusters in a data set. Van Aelst et al. (2006) proposed a method called Linear Grouping Al- gorithm (LGA) for linear clustering problems without assuming a response variable. Yan et al. (2008) proposed a partial likelihood approach which models only the signed orthogonal distance from each data point to a hy- pothesized hyperplane and treats all variables symmetrically. In this paper we introduce a hierarchical modeling approach, which is particularly appropriate for identifying sparse linear clusters. We show that the sparse cluster in our SNP genotyping dataset can be successfully iden- tified after a careful specification of the prior distributions. The rest of this paper is organized as follows. In Section 2, we describe a hierarchical model framework for linear clustering. In Section 3, we discuss in details sampling issues. Our approach to linear clustering is illustrated with a relatively sim- ple dataset (crab dataset) in Section 4. In Section 5, we revisit the SNP genotyping motivating dataset and show that a careful specification of the prior distributions is critical for the success of the clustering algorithm. A brief discussion follows in Section 6. 53 Chapter 4. Bayesian linear clustering 4 3 4 2 2 4 2 4 2 3 2 2 1 1 3 3 1 11 11 1 3 2 22 2 3 1 1 111 1 3 111 3 22 2 3 1 11 3 1 1 3 1 1 4 3 1 2 2 3 22 3 2 1 3 3 2 1 2 2 2 2 3 1 1 33 3 3 2 3 1 33 11 3 3 3 3 3 3 3 3 22 3 3 111 3 3 2 1 1 3 2 111 1 33 3 3 1 111 1 2 22 2 3 2 3 3 1 4 3 3 3 2 3 2 4 3 44 3 3 3 1 2 33 3 1 3 3 3 1 4 2 2 3 2 3 22 2 33 4 2222 22 2 44 3 1 3 1 33 3 1 4 4 1 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0. 5 1. 0 1. 5 2. 0 k−means Allele.X Al le le .Y 3 3 4 1 1 4 1 4 1 3 1 1 1 1 3 3 2 22 22 2 3 1 11 1 3 1 22 22 2 3 222 3 11 1 3 1 11 3 1 1 3 2 3 4 3 2 1 1 3 11 3 1 1 3 3 1 1 1 1 1 1 3 1 1 33 3 3 1 3 1 33 11 3 3 3 3 3 3 3 3 11 3 3 111 3 3 3 1 1 3 1 1 122 1 2 33 3 3 2 222 3 1 11 1 3 1 3 3 1 4 3 3 3 1 3 1 4 3 44 3 3 3 1 3 33 3 1 3 3 3 1 4 1 1 3 1 3 11 1 33 4 1111 11 3 34 3 2 3 2 33 3 2 3 4 2 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0. 5 1. 0 1. 5 2. 0 MCLUST Allele.X Al le le .Y 2 2 2 1 1 2 1 2 1 2 1 1 3 1 2 3 1 11 11 1 2 1 11 1 3 1 111 1 2 111 2 11 1 2 3 11 2 1 1 2 1 1 2 2 1 1 1 2 11 2 1 11 2 2 1 1 1 1 1 1 2 31 1 2 1 2 2 2 1 2 1 222 2 2 2 3 2 2 2 11 2 2 111 3 2 1 1 1 2 1 1 113 1 22 2 2 1 111 1 1 11 1 2 1 2 2 1 2 2 2 2 11 2 1 2 2 22 3 2 2 1 1 22 3 2 1 3 3 3 2 2 1 1 2 1 3 11 1 32 2 1111 11 3 22 2 3 2 1 22 2 1 2 2 1 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0. 5 1. 0 1. 5 2. 0 LGA Allele.X Al le le .Y 2 2 2 1 1 2 1 2 1 2 1 1 1 1 2 3 1 13 11 1 2 1 11 1 3 1 111 1 2 111 3 11 1 2 1 11 2 1 1 3 1 3 2 2 1 1 1 2 11 2 1 1 2 2 1 1 1 1 1 1 2 1 1 22 2 2 1 2 1 22 11 2 2 2 2 3 2 2 2 11 2 2 111 2 2 3 1 1 2 1 1 111 1 22 2 2 1 111 3 1 11 1 2 1 2 2 1 2 2 2 2 1 2 1 2 2 22 2 2 2 1 3 22 2 1 3 3 2 1 2 1 1 2 1 2 11 1 22 2 1111 11 3 22 2 3 2 1 22 2 1 2 2 1 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0. 5 1. 0 1. 5 2. 0 MIXREG/MLC Allele.X Al le le .Y Figure 4.2: Clustering results of several clustering algorithms. For k-means and MCLUST, the number of clusters are set to 4; for the remaining algo- rithms, the number of lines are set to 3. 54 Chapter 4. Bayesian linear clustering 4.2 Model specification 4.2.1 Model for one cluster: orthogonal regression We first introduce a Bayesian approach for orthogonal regression (see e.g., Fuller, 1987). To model a single cluster, assume that the vector-valued data x1, . . . ,xn are drawn by a random mechanism represented by a random vector X in a d-dimensional space. We do not model the distribution of X except for its deviation from an unknown hyperplane. Let {x : a′x − b = 0} be a hyperplane. For identifiability purpose, we assume that a′a = 1 and that the first nonzero element of a is positive. Our prior information about this hyperplane is summarized by a prior distribu- tion pi(a, b). Given a and b, the signed orthogonal deviation a′X−b from the hyperplane {x : a′x− b = 0} has a density function p(· |σ). Let pi(σ) be the prior distribution for σ. Denote θ = (a′, b, σ)′. The posterior distribution of θ is pi(θ|x1, . . . ,xn) ∝ { n∏ i=1 p(a′xi − b|σ) } pi(a, b)pi(σ). (4.1) Let x̄ and S be the sample mean and sample covariance matrix of x1, . . . ,xn respectively. If we take p(·|σ) to be the normal density func- tion N(· |0, σ2) and set pi(a, b) ∝ 1 and pi(σ) ∝ 1, the maximum a posteriori estimator of θ is (â′, b̂, σ̂)′, where σ̂2 is the smallest eigenvalue of S, â is the (standardized) eigenvector associated with σ̂2 and b̂ = â′x̄. This is a con- nection with orthogonal regression, see for example, Fuller (1987); for the distributions of the eigenvalue σ̂2 and the eigenvector â when X is normal, see Anderson (2003). 4.2.2 Model for linear clustering Now we assume that the data x1, . . . ,xn are drawn by a more complicated random mechanism, still represented by random vectorX in a d-dimensional space which we do not model fully. The data now lie around K hyperplanes {x : a′kx = bk}, k = 1, . . . ,K, for which our prior knowledge is summarized in pi(a1, b1, . . . ,aK , bK). Let Z = (Z1, . . . , ZK) ′ be a random vector indicat- ing these hyperplanes, Zk = 1 with probability pk for k = 1, . . . ,K, Zk = 0 or 1 and ∑K k=1 Zk = 1. Let p = (p1, . . . , pK) ′. We denote by pi(p) the prior distribution for p. We assume that, conditional on Zk = 1, the signed or- thogonal deviation a′kX− bk has a density function p(·|σk) for k = 1, . . . ,K. Let pi(σ1, . . . , σK) be the prior distribution for (σ1, . . . , σK) ′. Let z1, . . . , zn be the corresponding unobservable indicators for the data x1, . . . ,xn, where 55 Chapter 4. Bayesian linear clustering zi = (zi1, . . . , ziK) ′ for i = 1, . . . , n. Let θ be the collection of all the param- eters, θ = (a′1, b1, σ1, . . . ,a ′ K , bK , σK ,p ′)′. Formally, the posterior distribution of the unobservable cluster indicators z1, . . ., zn and θ is pi(z1, . . . , zn,θ|x1, . . . ,xn) ∝ { n∏ i=1 K∏ k=1 [pkp(a ′ kxi − bk|σk)]zik } pi(a1, b1, . . . ,aK , bK)pi(σ1, . . . , σK)pi(p). (4.2) This formula is the basis for our linear clustering, in which we are primar- ily interested in the cluster indicators z1, . . . , zn and the elements of θ are nuisance parameters. A Gibbs sampling scheme based on this formula is detailed in next section. In addition, if we sum out the indicators z1, . . . , zn, the marginal distri- bution for θ is pi(θ|x1, . . . ,xn) ∝ { n∏ i=1 K∑ k=1 pkp(a ′ kxi − bk|σk) } pi(a1, b1, . . . ,aK , bK)pi(σ1, . . . , σK)pi(p). (4.3) This marginal distribution can be used if we would like to set up other sampling schemes such as Metropolis-Hastings, tempering MCMC or SMC methods. Once a MCMC sample for θ is available, it is straightforward to sample cluster labels at each iteration. At iteration t, with θ(t), sample zi from multinomial distribution Pr(zi = k) ∝ p(t)k p((a′)(t)k xi − b(t)k |σ(t)k ). (4.4) After a sequence of iterations, we have a sample (z (1) i , . . . , z (T ) i ) from pos- terior marginal distribution pi(zi|x1, . . . ,xn), for i = 1, . . . , n. The point xi can be classified into the cluster with the largest frequency in the posterior sample; and the largest frequency provides a good measure of the cluster membership. 56 Chapter 4. Bayesian linear clustering 4.3 Sampling algorithms 4.3.1 Gibbs sampling The model framework in (4.2) is conceptually straightforward; the posterior sampling, however, can be difficult, as in any Bayesian mixture modelling (Celeux et al., 2000). We present here a sampling algorithm for normal orthogonal deviations and, to make the algorithm more efficient, we integrate some parameters by using conjugate priors. For ease of presentation, we assume the data are in a 2-dimensional space. The density functions for normal orthogonal deviations are p(a′kx− bk|σk) = N(a′kx− bk; 0, σ2k), (4.5) for k = 1, . . . ,K, where N(· ; 0, σ2) is the normal density function with mean 0 and variance σ2k. The priors for (σ 2 1, . . . , σ 2 K) follow independent inverse Gamma distributions, pi(σ21, . . . , σ 2 K) = K∏ k=1 IG(σ2k; δ1k, δ2k), (4.6) where IG(· ; δk1, δk2) is the density function of inverse Gamma with shape parameter δk1 and rate parameter δk2. In the case of equal variances, σ 2 1 = . . . = σ2K = σ 2, we assume pi(σ2) = IG(σ2; δ1, δ2), (4.7) For b1, . . . , bK , we set pi(b1, . . . , bK |σ21, . . . , σ2K) = K∏ k=1 N(bk;κ1k, σ 2 k/κ2k). (4.8) With these specifications, we consider the following special case of the pos- terior distribution (4.2), pi(z1, . . . , zn,θ|x1, . . . ,xn) ∝ { n∏ i=1 K∏ k=1 [pkN(a ′ kxi − bk; 0, σ2k)]zik } × { K∏ k=1 N(bk;κ1k, σ 2 k/κ2k)IG(σ 2 k; δ1k, δ2k) } pi(a1, . . . ,aK)pi(p). (4.9) 57 Chapter 4. Bayesian linear clustering Conditional on θ and the data x1, . . . ,xn, z1, . . . , zn are independently dis- tributed, zi|(θ,xi) ∼M(1, (p1N(a′1xi − b1; 0, σ21), . . . , pKN(a′Kxi − bK ; 0, σ2K))′), (4.10) for i = 1, . . . , n, where M(n,p) is the probability mass function for multi- nomial distribution with n trials and probability vector p. Let nk = n∑ i=1 zik, x̄k = 1 nk n∑ i=1 zikxi, (4.11) and Ak = n∑ i=1 zik(xi − x̄k)(xi − x̄k)′. (4.12) The full conditional distribution of bk is bk|e.e. = bk|(z1, . . . , zn,ak, σ2k,x1, . . . ,xn) ∼ N(κ1kκ2k + nka ′ kx̄k nk + κ2k , σ2k nk + κ2k ), (4.13) for k = 1, . . . ,K, where “e.e.” stands for “everything else”. We integrate out b1, . . . , bK from (4.9) and get pi(z1, . . . , zn,a1, σ 2 1, . . . ,aK , σ 2 K ,p|x1, . . . ,xn) ∝ { K∏ k=1 pnkk (σ 2 k) −nk/2(nk + κ2k)−1/2 exp(− δ∗2k 2σ2k ) } × { K∏ k=1 IG(σ2k; δ1k, δ2k) } pi(a1, . . . ,aK)pi(p), (4.14) where δ∗2k = a ′ kAkak + nk(a ′ kx̄k) 2 + κ21kκ2k − (κ1kκ2k + nka ′ kx̄k) 2 nk + κ2k . (4.15) The distribution of σ2k conditioning on everything else except bk is σ2k|(z1, . . . , zn,ak,x1, . . . ,xn) ∼ IG(δ1k + nk 2 , δ2k + δ∗2k 2 ). (4.16) 58 Chapter 4. Bayesian linear clustering If we assume that σ21 = . . . = σ 2 K = σ 2, then the distribution of σ2 condi- tioning on z1, . . . , zn,a1, . . . ,aK ,p is σ2|(z1, . . . , zn,a1, . . . ,aK ,x1, . . . ,xn) ∼ IG(δ1 + n 2 , δ2 + K∑ k=1 δ∗2k 2 ). (4.17) We further integrate out σ21, . . . , σ 2 K from (4.14) and get pi(z1, . . . , zn,a1, . . . ,aK ,p|x1, . . . ,xn) ∝ { K∏ k=1 pnkk (nk + κ2k) −1/2Γ(δ1k + nk 2 )(δ2k + δ∗2k 2 )−(δ1k+ nk 2 ) } × pi(a1, . . . ,aK)pi(p). (4.18) For proportions p, we use a Dirichlet prior, pi(p) = D(p;α), (4.19) where D(· ;α) is the Dirichlet density function with parameter α = (α1, . . ., αK) ′. The full conditional distribution of p is p|e.e. = p|(z1, . . . , zn) ∼ D(α∗), (4.20) where α∗ = (α∗1, . . . , α ∗ K) ′ with α∗k = α+ nk for k = 1, . . . ,K. Therefore, pi(z1, . . . , zn,a1, . . . ,aK |x1, . . . ,xn) ∝ { 1 Γ( ∑K i=1 α ∗ k) K∏ k=1 Γ(α∗k)(nk + κ2k) −1/2Γ(δ1k + nk 2 )(δ2k + δ∗2k 2 )−(δ1k+ nk 2 ) } × pi(a1, . . . ,aK). (4.21) Note that a′kak = 1 for k = 1, . . . ,K. We re-parameterize ak as ak = ( 1√ 1 + a2k , ak√ 1 + a2k )′, for k = 1, . . . ,K and use normal priors for (a1, . . . , aK) ′, pi(a1, . . . , aK) = K∏ k=1 N(ak; ν1k, ν 2 2k). (4.22) 59 Chapter 4. Bayesian linear clustering The conditional distribution of ak on cluster indicators is pi(ak|z1, . . . , zn,x1, . . . ,xn) ∝ Γ(δ1k+nk 2 )(δ2k+ δ∗2k 2 )−(δ1k+ nk 2 )N(ak; ν1k, ν 2 2k). (4.23) From the above, discussion, a possible sampling algorithm is as follows. Algorithm 1: 1. Initialize the algorithm with θ(0). 2. At iteration t, (a) Sample cluster indicators z (t) 1 , . . . , z (t) n from multinomial distribu- tion (4.10), where θ is replaced with θ(t−1). (b) i. Sample p(t) from Dirichlet distribution (4.20), where the clus- ter indicators are replaced with z (t) 1 , . . ., z (t) n . ii. Sample a (t) 1 , . . . , a (t) K using a random walk Metropolis-Has- tings scheme from (4.23), where cluster indicators are re- placed by z (t) 1 , . . ., z (t) n , respectively. iii. Sample (σ21) (t), . . . , (σ2K) (t) from inverse Gamma distribution (4.16), where z1, . . . , zn, a1, . . . , aK are replaced by z (t) 1 , . . ., z (t) n , a (t) 1 , . . ., a (t) K , respectively; in the case of equal variances σ21 = . . . = σ 2 K = σ 2, sample (σ2)(t) from (4.17). iv. Sample b1, . . . , bK from normal distribution (4.13), with all parameters and cluster indicators replaced with those in the current iteration. 4.3.2 Metropolis-Hastings sampling Alternatively, we can sum out the indicators z1, . . . , zn and sample θ from the marginal posterior distribution pi(θ|x1, . . . ,xn) in (4.3). Standard ran- dom walk Metropolis-Hastings algorithms can be used directly to sample from the marginal posterior distribution pi(θ|x1, . . . ,xn). However, we found that it is advantageous to reparameterize θ so that the constraints such as∑K k=1 pk = 1 and a ′a = 1 are implicitly incorporated (see Section 4.5). Since the proposal distribution and the posterior distribution have the same support, the sampling schemes on the transformed parameters are more efficient. Suppose we reparameterize θ as θ = φ(η), 60 Chapter 4. Bayesian linear clustering such that φ is a one-to-one map and that the domain of η is − 1 a2 + 1 a1 , (4.26) which maintains the order of positive slopes of the three genotype clusters and requires that the “gap” between the two lines of the heterozygotic clus- ters and the variant allele homozygotic clusters cannot be too small relative to the “gap” between the lines of the wild type allele homozygotic clus- ters and the heterozygotic clusters. These constraints on slopes are natural for SNP genotyping data. To implement Algorithm 2, we reparameterize (a1, a2, a3) into (η1, η2, η3),  a1 = −1/ exp(η1), a2 = −1/(exp(η1) + exp(η2)), a3 = −1/(exp(η1) + 2 exp(η2) + exp(η3)). (4.27) It is obvious that the constraint (4.26) is satisfied. In the case of few points in the variant allele homozygotic cluster, the orthogonal variance σ23 may not be estimated reliably. Hence we require σ23 = σ 2 1, (4.28) and then apply a log transformation to (σ21, σ 2 2), σ21 = exp(η7), σ 2 2 = exp(η8), (4.29) For variances (σ11, σ22), a log transformation is applied, σ11 = exp(η11), σ22 = exp(η12). (4.30) For the proportions, we add the constraint p3 < p1, p3 < p2, (4.31) which requires the proportion of variant allele homozygotic points is less than that of the other two genotype clusters. We reparameterize the proportions p first with q for constraint (4.31),  p1 = q1 + q3/3, p2 = q2 + q3/3, p3 = q3/3, p4 = q4, p5 = q5. (4.32) 70 Chapter 4. Bayesian linear clustering and then apply a logit transformation to q,  q1 = exp(η13)/(1 + (exp(η13) + . . .+ exp(η16))), . . . , q4 = exp(η16)/(1 + (exp(η13) + . . .+ exp(η16))). (4.33) With the constraints (4.26), (4.28) and (4.31), we can fully identify the five components of the mixture. Label-switching is effectively prevented by these informative constraints. In addition, conforming to these constraints, noninformative priors cause no problem to guarantee a proper posterior distribution. From (4.27), (4.29), (4.30) and (4.33), we obtain a one-to-one map from θ = (a1, a2, a3, b1, b2, b3, σ 2 1, σ 2 2, σ 2 3, µ1, µ2, σ11, σ22, p1, . . . , p5) to η = (η1, η2, η3, b1, b2, b3, η7, η8, µ1, µ2, η11, η12, η13, η14, η15, η16), in which the constraints (4.26), (4.28) and (4.31) are satisfied. The absolute value of the Jacobian of the transformation is |J | = ∣∣∣∣∂θ∂η ∣∣∣∣ =|a23(a1 − a2)(a2 − 2a1)/a1|σ21σ22σ11σ22(p1 − p3)(p2 − p3)p3p4p5. (4.34) We set the priors as follows. ηk ∼ N(0, 1010), bk ∼ N(0, 10−6), σ2k ∼ IG(0.0001, 0.0001), for k = 1, 2, 3, and pi(µ, σ11, σ22) ∝ 1 σ11σ22 , pi(q) ∝ 1. The priors for b1, b2, b3 is strong with the implication that all three lines should roughly pass through the origin; all priors for others parameters are vague except the constraints specified above. The five-component mixture model is applied to the plate shown in Fig- ure 4.1. For an algorithm to work in this relatively high dimensional prob- lem, it is essential that the initial state be close to the substantial support of the posterior distribution. We first run the “optim” function in the R lan- guage multiple times, which is initialized with standard normal N(0, I16). Then we initialize the algorithm with the solution with the largest posterior density, say η(0). 71 Chapter 4. Bayesian linear clustering 0 2000 4000 6000 8000 10000 74 0 74 5 75 0 75 5 76 0 76 5 Iteration Lo g un no rm al ize d po st er io r d en sit y Figure 4.10: Evolution of log unnormalized posterior density of sampled values for η for 11, 000 iterations using Algorithm 2 for the SNP genotyping data. We first run Algorithm 2.0, which is relatively slow, to get a rough estimate of the covariance matrix of η. Next we run the faster Algorithm 2. To alleviate the effect of autocorrelations, we keep only one simulated point for every ten iterations. Hereafter the iteration numbers are based on the “thinned” samples. Figure 4.10 displays the evolution of log unnormalized posterior density values of sampled values for η for 11, 000 iterations. Figure 4.11 displays the evolution of sampled values of θ for 11, 000 iterations. The first three panels are for slopes of the three lines −1/a1, −1/a2 and −1/a3. From these two figures we conclude empirically that the sampler has reached a stationary distribution. Figure 4.12 displays the autocorrelations of sampled values for θ for 11, 000 iterations. There are high autocorrelations in most parameters indicating that the sampler is not very efficient; more efficient transformations/samplers shall be investigated in future research. Further checking the density curves in Figure 4.13 of sampled values for θ for 11, 000 iterations, we observe that all the estimated density curves are roughly unimodal. The clustering results of this plate is displayed in Figure 4.14 where each point is classified into the cluster with the largest posterior member- ship probability. We note that the Bayesian method does identify several points into the variant allele homozygous genotype clusters (represented by “4” in Figure 4.14). Quite a few points are classified as background noise 72 Chapter 4. Bayesian linear clustering 0 2000 4000 6000 8000 100000 .4 2 0 .4 4 − 1 a1 Iteration − 1 a 1 0 2000 4000 6000 8000 10000 1 .8 0 1 .9 0 2 .0 0 − 1 a2 Iteration − 1 a 2 0 2000 4000 6000 8000 10000 1 0 1 5 2 0 2 5 3 0 − 1 a3 Iteration − 1 a 3 0 2000 4000 6000 8000 10000− 0 .0 3 5 − 0 .0 2 0 − 0 .0 0 5 b1 Iteration b 1 0 2000 4000 6000 8000 10000 − 0 .0 2 0 .0 0 0 .0 2 b2 Iteration b 2 0 2000 4000 6000 8000 10000− 0 .0 2 0 .0 0 0 .0 2 b3 Iteration b 3 0 2000 4000 6000 8000 100000 .0 0 0 0 6 0 .0 0 0 1 2 σ1 2 Iteration σ 12 0 2000 4000 6000 8000 10000 2 e − 0 4 6 e − 0 4 1 e − 0 3 σ2 2 Iteration σ 22 0 2000 4000 6000 8000 100000 .0 0 0 0 6 0 .0 0 0 1 2 σ3 2 Iteration σ 32 0 2000 4000 6000 8000 10000 0 .1 0 0 0 .1 1 0 0 .1 2 0 µ1 Iteration µ 1 0 2000 4000 6000 8000 10000 0 .2 3 0 .2 5 0 .2 7 µ2 Iteration µ 2 0 2000 4000 6000 8000 10000 2 e − 0 4 6 e − 0 4 σ11 Iteration σ 1 1 0 2000 4000 6000 8000 10000 0 .0 0 0 5 0 .0 0 2 5 σ22 Iteration σ 2 2 0 2000 4000 6000 8000 10000 0 .7 0 0 .7 5 0 .8 0 0 .8 5 p1 Iteration p 1 0 2000 4000 6000 8000 10000 0 .1 5 0 .2 0 0 .2 5 p2 Iteration p 2 0 2000 4000 6000 8000 10000 0 .0 0 0 .0 2 0 .0 4 p3 Iteration p 3 0 2000 4000 6000 8000 100000 .0 0 0 0 .0 1 0 0 .0 2 0 p4 Iteration p 4 0 2000 4000 6000 8000 10000 0 .0 0 0 .0 2 0 .0 4 0 .0 6 p5 Iteration p 5 Figure 4.11: Evolution of sampled values for θ for 11, 000 iterations using Algorithm 2 for the SNP genotyping data. The first three panels are for slopes. 73 Chapter 4. Bayesian linear clustering 0 10 20 30 40 0 .4 0 .6 0 .8 1 .0 − 1 a1 Lag A C F 0 10 20 30 400 .6 5 0 .7 5 0 .8 5 0 .9 5 − 1 a2 Lag A C F 0 10 20 30 40 0 .4 0 .6 0 .8 1 .0 − 1 a3 Lag A C F 0 10 20 30 40 0 .4 0 .6 0 .8 1 .0 b1 Lag A C F 0 10 20 30 40 0 .8 0 0 .9 0 1 .0 0 b2 Lag A C F 0 10 20 30 40 0 .7 5 0 .8 5 0 .9 5 b3 Lag A C F 0 10 20 30 40 0 .0 0 .4 0 .8 σ1 2 Lag A C F 0 10 20 30 40 0 .0 0 .4 0 .8 σ2 2 Lag A C F 0 10 20 30 40 0 .0 0 .4 0 .8 σ3 2 Lag A C F 0 10 20 30 40 0 .0 0 .4 0 .8 µ1 Lag A C F 0 10 20 30 40 0 .0 0 .4 0 .8 µ2 Lag A C F 0 10 20 30 40 0 .0 0 .4 0 .8 σ11 Lag A C F 0 10 20 30 40 0 .0 0 .4 0 .8 σ22 Lag A C F 0 10 20 30 40 0 .0 0 .4 0 .8 p1 Lag A C F 0 10 20 30 40 0 .0 0 .4 0 .8 p2 Lag A C F 0 10 20 30 40 0 .0 0 .4 0 .8 p3 Lag A C F 0 10 20 30 40 0 .0 0 .4 0 .8 p4 Lag A C F 0 10 20 30 40 0 .0 0 .4 0 .8 p5 Lag A C F Figure 4.12: Autocorrelations of sampled values for θ for 11, 000 iterations using Algorithm 2 for the SNP genotyping data. 74 Chapter 4. Bayesian linear clustering 0.42 0.43 0.44 0.45 0.46 0 2 0 4 0 6 0 8 0 − 1 a1 D e n si ty 1.80 1.85 1.90 1.95 2.00 0 5 1 0 1 5 − 1 a2 D e n si ty 10 15 20 25 30 0 .0 0 0 .1 0 − 1 a3 D e n si ty −0.03 −0.02 −0.01 0.00 0 2 0 4 0 6 0 8 0 b1 D e n si ty −0.02 −0.01 0.00 0.01 0.02 0.03 0 1 0 3 0 5 0 b2 D e n si ty −0.02 −0.01 0.00 0.01 0.02 0 1 0 3 0 5 0 b3 D e n si ty 0.00006 0.00010 0.00014 0 1 0 0 0 0 2 5 0 0 0 σ1 2 D e n si ty 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 0 1 0 0 0 3 0 0 0 σ2 2 D e n si ty 0.00006 0.00010 0.00014 0 1 0 0 0 0 2 5 0 0 0 σ3 2 D e n si ty 0.100 0.105 0.110 0.115 0.120 0.125 0 4 0 8 0 1 2 0 µ1 D e n si ty 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0 1 0 3 0 5 0 µ2 D e n si ty 0e+00 2e−04 4e−04 6e−04 8e−04 0 2 0 0 0 4 0 0 0 6 0 0 0 σ11 D e n si ty 0.000 0.001 0.002 0.003 0.004 0 4 0 0 8 0 0 1 2 0 0 σ22 D e n si ty 0.70 0.75 0.80 0.85 0 5 1 0 1 5 p1 D e n si ty 0.15 0.20 0.25 0 5 1 0 1 5 p2 D e n si ty 0.00 0.01 0.02 0.03 0.04 0.05 0 2 0 4 0 6 0 p3 D e n si ty 0.000 0.005 0.010 0.015 0.020 0 5 0 1 5 0 2 5 0 p4 D e n si ty 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0 2 0 4 0 6 0 p5 D e n si ty Figure 4.13: Density curves of sampled values for θ for 11, 000 iterations using Algorithm 2 for the SNP genotyping data. 75 Chapter 4. Bayesian linear clustering 2 2 4 1 1 4 1 4 1 2 1 1 1 1 2 3 1 11 11 1 2 1 1 1 3 1 111 1 2 111 3 1 1 2 1 11 2 1 1 2 1 1 4 2 1 1 1 2 11 2 1 1 2 2 1 1 1 1 2 1 1 22 2 2 2 1 22 1 2 2 2 2 5 2 2 2 11 2 2 1 2 2 1 11 2 1 111 22 2 2 1 111 5 1 11 1 2 1 1 1 2 2 1 4 2 2 2 1 11 2 1 4 2 44 2 2 2 1 1 22 2 1 5 3 2 1 1 2 2 11 22 4 11111 11 5 4 2 1 2 22 2 1 4 4 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0. 5 1. 0 1. 5 2. 0 Allele.X Al le le .Y Figure 4.14: Clustering results of the SNP genotyping data in which points are classified into the cluster with the largest posterior membership proba- bility. (represented by “5”); this is conservative because the orthogonal variation in the wild-type allele homozygous genotype cluster XX is very small and the orthogonal variation of cluster YY is linked to that of cluster XX. By depleting points in the YY cluster, we have observed that the Bayesian method also works if only one point is present in the YY cluster, which is due to the informative priors and constraints. Some other plates with sparse clusters are also analyzed using the Bayesian approach and the clustering results are satisfactory; the clustering results are omitted. For plates with- out sparse clusters, the effect of the above prior specification is minimal, we usually obtain similar clustering results to that of the partial likelihood ap- proach in Chapter 3. Figure 4.15 shows the clustering results of the Bayesian approach for the plate analyzed in Chapter 3. 4.6 Discussion We have proposed a Bayesian approach to linear clustering. It is flexible in the sense that only the deviation from a hyperplane is modelled parametri- cally; the position on the hyperplane is not modelled. The advantage of this approach is illustrated in the genotyping example, where the distribution along the line is complex and difficult to model. Furthermore, as was also illustrated in the SNP genotyping, we can incorporate elliptical clusters as 76 Chapter 4. Bayesian linear clustering 4 2 4 1 4 1 1 1 2 1 2 1 3 11 1 2 2 3 1 4 4 4 1111 1 1 1 2 3 1 2 3 4 1 44 2 4 2 1 11 1 3 3 1 1 1 1 2 4 1 4 2 4 1 1 1 2 2 3 1 3 1 1 11 44 2 4 1 4 1 1 2 2 3 4 2 1 3 1 3 3 2 3 2 1 2 1 2 3 1 1 2 1 1 11 4 1 3 1 222 1 22 114 1 2 5 2 3 2 2 2 2 2 2 2 2 0.5 1.0 1.5 0. 5 1. 0 1. 5 2. 0 Allele.X Al le le .Y Figure 4.15: Clustering results of the Bayesian approach for a plate without a sparse cluster. necessary and a background cluster, borrowing from standard model-based clustering. In our examples, label-switching is prevented either by a Gibbs sampler applied to a posterior distribution with isolated modes or by informative pri- ors. In more general situations, we may need the ideas of tempering MCMC or Sequential Monte Carlo to explore the whole support of the posterior distribution and to deal with the label-switching problem. In this paper, the number of linear clusters are assumed known. In the situation of unknown number of clusters, our first thought is to investi- gate the feasibility of the Reversible Jump MCMC of Richardson and Green (1997). This may imply heavy computational burden. A related problem is the scalability of the Bayesian approach to large datasets and high dimen- sions. We leave these problems for further research. 77 Bibliography Anderson, T. (2003). An introduction to multivariate statistical analysis. Wiley Series in Probability and Statistics. Campbell, N. and R. Mahon (1974). A multivariate study of variation in two species of rock crab of genus Leptograpsus. Australian Journal of Zoology 22 (3), 417–425. Celeux, G., M. Hurn, and C. Robert (2000). Computational and Inferential Difficulties with Mixture Posterior Distributions. Journal of the American Statistical Association 95 (451). Fraley, C. and R. AE (2006). mclust Version 3 for R: Normal Mixture Modeling and Model-based Clustering. Technical report, Technical Report 504, University of Washington, Department of Statistics. Fujisawa, H., S. Eguchi, M. Ushijima, S. Miyata, Y. Miki, T. Muto, and M. Matsuura (2004). Genotyping of single nucleotide polymorphism using model-based clustering. Bioinformatics 20, 5. Fuller, W. (1987). Measurement error models. Wiley Series in Probability and Mathematical Statistics, New York: Wiley, 1987 . Gelman, A., J. Carlin, H. Stern, and D. Rubin (2004). Bayesian data analysis. Boca Raton, FL: Chapman and Hall/CRC. Harrington, J. (2007). lga: Tools for linear grouping analysis (LGA). R package version 1.0-0. Kang, H., Z. Qin, T. Niu, and J. Liu (2004). Incorporating Genotyping Uncertainty in Haplotype Inference for Single-Nucleotide Polymorphisms. The American Journal of Human Genetics 74 (3), 495–510. Olivier, M., L. Chuang, M. Chang, Y. Chen, D. Pei, K. Ranade, A. de Witte, J. Allen, N. Tran, D. Curb, et al. (2002). High-throughput genotyping of single nucleotide polymorphisms using new biplex invader technology. Nucleic Acids Research 30 (12), e53. 78 Bibliography Peel, D. and G. McLachlan (2000). Robust mixture modelling using the t distribution. Statistics and Computing 10 (4), 339–348. Ranade, K., M. Chang, C. Ting, D. Pei, C. Hsiao, M. Olivier, R. Pesich, J. Hebert, Y. Chen, V. Dzau, et al. (2001). High-Throughput Genotyping with Single Nucleotide Polymorphisms. Genome Research 11 (7), 1262– 1268. Richardson, S. and P. Green (1997). On Bayesian Analysis of Mixtures with an Unknown Number of Components. Journal of the Royal Statistical Society. Series B (Methodological) 59 (4), 731–792. Turner, T. (2006). mixreg: Functions to fit mixtures of regressions. R package version 0.0-2. Van Aelst, S., X. Wang, R. Zamar, and R. Zhu (2006). Linear grouping using orthogonal regression. Computational Statistics and Data Analy- sis 50 (5), 1287–1312. Yan, G., W. Welch, and R. Zamar (2008). A likelihood approach to linear clustering. Preprint. 79 Chapter 5 Consistency and asymptotic normality in a partial likelihood approach to linear clustering 5.1 Introduction By linear clustering, we mean detecting linearly shaped clusters in a data set. We proposed in Yan et al. (2008) a parsimonious partial likelihood approach to linear clustering. Assume that the data x1, . . . ,xn are drawn by a random mechanism, represented by random vector X in a d-dimensional space which we do not model fully. Assume that The data lie around K hyperplanes {x : a′kx = bk}, k = 1, . . . ,K. Let Z = (Z1, . . . , ZK)′ be a random vector indicating these hyperplanes and Zk = 1 with probability pk for k = 1, . . . ,K. Let p = (p1, . . . , pK) ′. We assume that, conditional on Zk = 1, a′kX− bk ∼ N(0, σ2k), k = 1, . . . ,K. Let z1, . . . , zn be the corresponding unobservable indicators for the data x1, . . . ,xn. Let κ be the collection of component parameters, κ = (a ′ 1, b1, σ21, . . ., a ′ K , bK , σ 2 K) ′, and θ = (κ′,p′)′. The indicators z1, . . . , zn can be regarded as realizations of random vectors Z1, . . . ,Zn, which in turn are an independent and identically distributed sample from Z. After integrating out the indicators z1, . . . , zn, the partial likelihood function for parameters A version of this chapter will be submitted for publication. Authors: Guohua Yan, William J. Welch and Ruben H. Zamar. 80 Chapter 5. Consistency and asymptotic normality θ is L(θ|x1, . . . ,xn) = n∏ i=1 K∑ k=1 pkN(a ′ kxi − bk; 0, σ2k). (5.1) As in the usual normal mixture model, the partial likelihood function (5.1) is unbounded: when a cluster consists only of points lying on a hyper- plane, the contribution of each of these points to the partial likelihood tends to infinity. The infinity occurs on the boundary of the parameter space. We adopt the constraint of Hathaway (1985). Specifically, the following constraint is imposed on the standard deviations, min 1≤i6=j≤K (σi/σj) ≥ c > 0, (5.2) where c is a known constant determined a priori. In this article, con- straint (5.2) is assumed whenever we refer to the partial likelihood func- tion (5.1). The partial likelihood function (5.1) naturally brings the clustering prob- lem into a finite mixture model framework. An EM algorithm can be used to maximize (5.1); once an maximum partial likelihood estimate θ̂ is obtained, data point xi can be assigned to the component with the largest posterior probability. The probabilities are given by ŵik = p̂kN(â ′ kxi − b̂k; 0, σ̂2k)∑K k=1 p̂kN(â ′ kxi − b̂k; 0, σ̂2k) , i = 1, . . . , n; k = 1, . . . ,K, (5.3) which also serve as a measure of uncertainty of classifying data point xi. We shall investigate the asymptotic properties of solutions to the partial likelihood function (5.1). As (5.1), regarded as a function of x, is not a den- sity function for X, classical results of asymptotics on maximum likelihood estimators cannot be used directly. We borrow ideas in the formulation and in the proofs of results from Garćıa-Escudero et al. (2007), Wald (1949), Redner (1981) and Hathaway (1983, 1985). The rest of this article is organized as follows. Section 5.2 discusses the population counterpart of the partial likelihood function (5.1) and es- tablishes the existence of its maximum. Section 5.3 proves the consistency of a maximum of the partial likelihood function (5.1) to the maximum of the population counterpart. The asymptotic normality of a solution of the partial likelihood function (5.1) is investigated in section 5.4. 81 Chapter 5. Consistency and asymptotic normality 5.2 Population version of the objective function To motivate the population version of the objective function (5.1), we first review the connection between maximum likelihood estimation and the Kull- back-Leibler divergence. See for example Kullback and Leibler (1951) and Eguchi and Copas (2006). Suppose that probability distributions P and Q are absolutely continu- ous with respect to the Lebesgue measure λ. Let p(x) and q(x) be the density functions (Radon-Nikodym derivatives) of P and Q respectively with respect to λ. Then the Kullback-Leibler divergence from P to Q is defined as DKL(P‖Q) = ∫ log p(x) q(x) dP (x). Given a random sample x1, x2, . . . , xn from the underlying distribution P , let Pn be the empirical distribution. Now let Qθ be a statistical model having density f(x; θ) with respect to λ, where θ is a collection of unknown parameters. The Kullback-Leibler divergence from P to Qθ is DKL(P‖Qθ) = ∫ [log p(x)− log f(x; θ)]dP (x). The empirical version of DKL(P ||Qθ) is DKL(Pn‖Qθ) = 1 n n∑ i=1 [log(1/n)− log f(xi; θ)] = log(1/n)− 1 n n∑ i=1 log f(xi; θ). Apart from the factor 1/n, the second term is just the log likelihood function. Hence, maximizing the likelihood function is equivalent to minimizing the Kullback-Leibler divergence DKL(Pn‖Qθ). If P = Qθ0 for some θ0, then a maximum likelihood estimator is strongly consistent under some regular conditions, i.e., it converges almost surely to argminθDKL(P‖Qθ). Back to the linear clustering setting, let f(x;θ) = K∑ k=1 pkfk(x;θ), (5.4) where fk(x;θ) = 1√ 2piσk exp ( −(a ′ kx− bk)2 2σ2k ) , 82 Chapter 5. Consistency and asymptotic normality and x is a generic point in 0, k = 1, . . . ,K.∑K k=1 pk = 1,mini,j σi/σj ≥ c > 0.   . We show in this section that the supremum of g is finite and attainable. First we prove the following lemma. Lemma 5.1 Let P be an absolutely continuous probability measure with finite second moments. Let 0 < c < 1, if c ≤ σk ≤ 1/c, for all k = 1, . . . ,K. Let s(κ) ≡ ∫ min k (a′kx− bk)2 σ2k dP (x). Then s(κ) attains its infimum s0 > 0 under constraint (5.2) at some κ0. Proof It is obvious that 0 < s(κ) <∞. First we show that it is possible to bound bk, k = 1, . . . ,K and the infimum is not missed. Suppose that there is a positive number r < |bk|, for k = 1, . . . ,K. Let r0 > 0 such that Prob(|X| ≤ r0) > 0. Then when r > r0, s(θ) ≥ c2(r − r0)2Prob(|X| ≤ r0)→∞, as r →∞. 83 Chapter 5. Consistency and asymptotic normality Now without loss of generality, we assume that |bk| ≤ v for some v and for k = 1, . . . ,K − 1. Let r = c2(|bK | − v). Then when |bK | is getting large, s(κ) ≥ ∫ |x|≤r min 1≤k≤K (a′kx− bk)2 σ2k dP (x) = ∫ |x|≤r min 1≤k≤K−1 (a′kx− bk)2 σ2k dP (x) → ∫ min 1≤k≤K−1 (a′kx− bk)2 σ2k dP (x), as |bK | → ∞ ≥ ∫ min 1≤k≤K (a′kx− bk)2 σ2k dP (x), where the |bK | in the last term is arbitrary number bounded from below by v. Since min k (a′kx− bk)2 σ2k ≤ 2(|x| 2 + v2) c2 , by Lebesgue’s dominated convergence theorem, s is continuous. Now it is constrained on a compact set, the infimum of s is attainable for some κ0 as s0 = s(κ0) > 0. Theorem 5.1 Let P be an absolutely continuous probability measure with finite second moments. Then the supremum of g, which is defined in (5.5), over Θc is finite and attainable. Proof Let σ2k = τ 2 kσ 2 for k = 1, . . . ,K. For every x and θ, we have log f(x,θ) ≤ logmax k fk(x,θ) = max k log fk(x,θ). Therefore, g(θ) ≤ ∫ max k log fk(x,θ)dP (x) = ∫ −min k { 1 2 log(2piτ2kσ 2) + (a′kx− bk)2 2τ2kσ 2 } dP (x) ≤ −1 2 log(2pic2)− log σ − 1 2σ2 {∫ min k (a′kx− bk)2 τ2k dP (x) } ≤ −1 2 log(2pic2)− log σ − s0 2σ2 , 84 Chapter 5. Consistency and asymptotic normality where Lemma 5.1 is used. Clearly, the last term attains its maximum at σ2 = s0 and it goes to −∞ as σ → 0 or∞. Take an arbitrary θ0, there exist positive numbers s1 and s2 such that g(θ) < g(θ0) uniformly for θ such that σ < s1/c or σ > cs2. Now if we bound s1/c ≤ σ ≤ cs2, we have log f(x,θ) ≥ logmin k fk(x,θ) = min log fk(x,θ) = min(−1 2 log(2piτ2kσ 2)− (a ′ kx− bk)2 2τ2kσ 2 ) ≥ −1 2 log(2pis22)− 2(|x|2 +maxk |bk|2) 2s21 , and log f(x,θ) ≤ logmax k fk(x,θ) ≤ −1 2 log(2pis21). By Lebesgue’s dominated convergence theorem, g is continuous. If bk → ±∞ for some k, then by Fatou’s lemma, we have lim sup bk→±∞ g(θ) ≤ ∫ lim sup bk→±∞ log f(x,θ)dP (x) = ∫ log ∑ k′ 6=k pk′fk′(x,θ)dP (x). Therefore, there exist real numbers t1 and t2 such that we would not miss the supremum of g, if we bound θ in the compact set {θ : s1 ≤ σ2k ≤ s2, t1 ≤ bk ≤ t2, for all k = 1, . . . ,K}. Since g is continuous, its supremum is attainable. Remark In the proof it is not ruled out that some pk may be 0 for g to attain its maximum. 5.3 Consistency First we prove that there exists a global maximizer of the partial likelihood function (5.1) over Θc. Theorem 5.2 Let {x1, . . . ,xn} be a set of points in r, then (a′kxi − bk) 2 ≥ (a′kxi − rsgn(bk))2. So l(θ) is not decreased if bk is replaced with rsgn(bk) whenever |bk| ≥ r. So we can bound θ such that |bk| ≤ r, for all k = 1, . . . ,K. Let s0 = min{(a1,b1,...,aK ,bK):a′kak=1,|bk|≤r} max i min k (a′kxi − bk)2. Since the points x1, . . . ,xn cannot fit in K hyperplanes, we have s0 > 0. There exists a point xj such that (a ′ kxj − bk)2 ≥ s0 for all k = 1, . . . ,K. Note also that 1/σk ≤ 1/(cσ1) and that 1/σk ≥ c/σ1, we have l(θ) ≤ ∑ i6=j log( ∑ k pk 1√ 2piσk ) + log( ∑ k pk 1√ 2piσk exp(− s0 2σ2k )) ≤ (n− 1) log( ∑ k pk 1√ 2picσ1 ) + log( ∑ k pk 1√ 2picσ1 exp(− cs0 2σ21 )) = −n log( √ 2picσ1)− cs0 2σ21 . The last term tends to −∞ as σ21 tends to zero. If some σ2k tends to zero, so do all other σ2ks. Thus, there exists a positive number s1 such that l(θ) ≤ l(θ0) whenever a σ2k < s1. On the other hand, l(θ) ≤ ∑ i log( ∑ k pk 1√ 2piσk ) ≤ n log(max k 1√ 2piσk ) = −nmin k log( √ 2piσk). If some σ2k tends to ∞, so do all other σ2k’s and then l(θ) decreases to −∞. Similarly, there is a positive number s2 such that l(θ) ≤ l(θ0) whenever a σ2k > s2. Therefore, we can bound θ in a compact set {θ : |bk| ≤ r, s1 ≤ σ2k ≤ s2, for all k = 1, . . . ,K}. As l is continuous, a global maximizer over Θc exists. Corollary 5.1 Let P be an absolutely continuous probability measure with finite second moments. Let X1, . . . ,Xn be a sample from P . Then a con- strained global maximizer of l(θ|X1, . . . ,Xn) exists almost surely if n ≥ Kd+ 1. 86 Chapter 5. Consistency and asymptotic normality Proof As P is absolute continuous,X1, . . . ,Xn are distinct with probability 1. We need Kd points to determine K hyperplanes. Each of the remaining points lies on one of these hyperplanes with probability 0, since the set of points in K hyperplanes has zero probability. To prove the consistency result, we need some preliminary results. Lemma 5.2 If the probability measure P has finite second moments, then∫ | log f(x,θ)|dP (x) <∞. Proof ∫ | log f(x,θ)| dP (x) = ∫ [log f(x,θ)]+ dP (x) + ∫ [log f(x,θ)]−dP (x) ≤ ∫ {x:f(x,θ)≥1} [logmax k fk(x,θ)] +dP (x) + ∫ {x:f(x,θ)<1} [logmin k fk(x,θ)] −dP (x) ≤ ∫ {x:f(x,θ)≥1} ∑ k [log fk(x,θ)] +dP (x) + ∫ {x:f(x,θ)<1} ∑ k [log fk(x,θ)] −dP (x) = ∑ k ∫ | log fk(x,θ)|dP (x) ≤ ∑ k [| log( √ 2piσk)|+ ∫ (a′kx− bk)2 2σ2k dP (x)] <∞. Lemma 5.3 Let w(x,θ, ρ) = sup {θ′:|θ′−θ|≤ρ} f(x,θ′). Then ∫ [logw(x,θ, ρ)]+dP (x) <∞. 87 Chapter 5. Consistency and asymptotic normality Proof In fact, w(x,θ, ρ) ≤ sup {θ′:|θ′−θ|≤ρ} ∑ k fk(x,θ) ≤ sup {θ′:|θ′−θ|≤ρ} ∑ k 1√ 2piσk ≡ M(θ, ρ). Therefore,∫ [logw(x,θ, ρ)]+dP (x) ≤ ∫ [logM(θ, ρ)]+dP (x) <∞. Lemma 5.4 lim ρ→0 ∫ logw(x,θ, ρ)dP (x) = ∫ log f(x,θ)dP (x). Proof Since logw(x,θ, ·) is an increasing function of ρ, as ρ→ 0, [logw(x, θ, ρ)]− increases. By the monotone convergence theorem, lim ρ→0 ∫ [logw(x,θ, ρ)]−dP (x) = ∫ [log f(x,θ)]−dP (x). When ρ is sufficiently small, [logw(x,θ, ρ)]+ is dominated by [logw(x, θ, ρ0)] + for some ρ0. And the latter is integrable by Lemma 5.3. By Lebesgue’s dominated convergence theorem, lim ρ→0 ∫ [logw(x,θ, ρ)]+dP (x) = ∫ [log f(x,θ)]+dP (x). Notice that ∫ [logw(x,θ, ρ)]+dP (x) and ∫ [log f(x,θ)]+dP (x) are finite, the lemma is proved. Note that points in Θc are not identifiable for f(x; ·). The function f(x; ·) remains the same if we permutate the labels 1, . . ., K; pk1 and pk2 are not identifiable if (ak1 , bk1 , σ 2 k1 ) = (ak2 , bk2 , σ 2 k2 ). Thus the consistency result is in a quotient topology space. Let ∼ be an equivalent relation on Θc such that θ1 ∼ θ2 if and only if f(x;θ1) = f(x;θ2) almost surely in P . Denote by Θqc the quotient topological space consisting of all equivalent classes of ∼. For a point θ0 that maximizes ∫ log f(x;θ)dP (x), its equivalent class in Θqc is denoted by θ q 0. 88 Chapter 5. Consistency and asymptotic normality Theorem 5.3 Let X1, . . . ,Xn be a sample from P which is absolutely con- tinuous with finite second moments. Let B be a compact subset of Θc that contains C0. Let θ̂ (n) be a global maximizer of l(θ|X1, . . . ,Xn) on B. Then θ̂ (n) → θq0 almost surely in the topological space Bq. Proof Let ω be a closed subset of B which does not intersect with C0. For each point θ in ω, we associate a positive value ρθ such that E logw(X,θ, ρθ) < E log f(X,θ0), where θ0 is a point in C0. The existence of such a ρθ follows from Lemma 5.4 and the definition of C0 (C0 exists because P has finite second moments, by Theorem 5.1; θ̂ (n) exists because P is absolutely continuous, by Theo- rem 5.2). Since B is compact, its closed subset ω is also compact. There exists a finite number of points θ1, . . . ,θh in ω such that ω ⊂ h⋃ j=1 {θ′ : |θ′ − θj | ≤ ρθj}. Then 0 ≤ sup θ∈ω f(x1,θ) . . . f(xn,θ) ≤ h∑ j=1 w(x1,θj , ρθj ) . . . w(xn,θj , ρθj ). By the strong law of large numbers, Prob { lim n→∞ n∑ i=1 [logw(Xi,θj , ρθj )− log f(Xi,θ0)] = −∞ } = 1, for j = 1, . . . , h. That is, Prob { lim n→∞ w(X1,θj , ρθj ) . . . w(Xn,θj , ρθj ) f(X1,θ0) . . . f(Xn,θ0) = 0 } = 1, for j = 1, . . . , h. Therefore, we have Prob { lim n→∞ supθ∈ω f(X1,θ) . . . f(Xn,θ) f(X1,θ0) . . . f(Xn,θ0) = 0 } = 1. (5.6) Denote |θ − C0| ≡ minθ0∈C0 |θ − θ0|. The minimum is attainable since C is a closed set in B and hence compact. We need only to prove that 89 Chapter 5. Consistency and asymptotic normality all limit points θ∗ of the sequence θ̂ (n) are in C0. If not, there exists a limit point θ∗ and an ² > 0 such that |θ∗ − C0| ≥ ². This implies that there are infinitely many θ̂ (n) that lie in ω² ≡ {θ : |θ − C0| ≥ ²}. Thus f(x1, θ̂ (n) ) . . . f(xn, θ̂ (n) ) ≤ supθ∈ω² f(x1,θ) . . . f(xn,θ) for infinitely many n. Since f(x1, θ̂ (n) ) . . . f(xn, θ̂ (n) ) ≥ f(x1,θ0) . . . f(xn,θ0). We have f(x1,θ0) . . . f(xn,θ0) ≤ sup θ∈ω² f(x1,θ) . . . f(xn,θ), for infinitely many n. Since ω² is a closed set in B, this is an event with probability zero according to equation 5.6. This completes the proof. In next step, we shall show that limit points of the constrained global maximizer θ̂ (n) over Θc are almost surely in a compact space and hence consistency follows from Theorem 5.3. Lemma 5.5 Let X1, . . . ,Xn be a sample of P which is absolutely continu- ous with finite second moments. Let θ̂ (n) is a constrained global maximizer of l(θ|X1, . . . ,Xn) over Θc. Then there exist positive numbers s1 and s2 such that Prob{lim inf n→∞ mink (σ̂ (n) k ) 2 ≥ s1} = 1, (5.7) and Prob{lim sup n→∞ max k (σ̂ (n) k ) 2 ≤ s2} = 1. (5.8) Proof First we prove Equation (5.7). Let σ̂ (n) k = τ̂ (n) k σ̂ (n) 1 for k = 1, . . . ,K. Then τ̂ (n) k ∈ [c, 1/c]. We write h(σ) = l(p̂ (n) 1 , . . . , p̂ (n) K , â (n) 1 , . . . , â (n) K , b̂ (n) 1 , . . . , b̂ (n) K , τ̂ (n) 1 σ, . . . , τ̂ (n) K σ). Then it satisfies dh(σ) dσ | σ̂ (n) 1 = 0. This gives rise to n∑ i=1 − 1 σ̂ (n) 1 f(Xi, θ̂ (n) ) + 1 (σ̂ (n) 1 ) 3 ∑K k=1 ((â (n) k )′Xi−b̂(n)k )2 (τ̂ (n) k )2 p̂ (n) k fk(Xi, θ̂ (n) ) f(Xi, θ̂ (n) ) = 0. 90 Chapter 5. Consistency and asymptotic normality Solving for σ̂ (n) 1 yields (σ̂ (n) 1 ) 2 = 1 n n∑ i=1 ∑K k=1 ((â (n) k )′Xi−b̂(n)k )2 (τ̂ (n) k )2 p̂ (n) k fk(Xi, θ̂ (n) ) f(Xi, θ̂ (n) ) . Then, we have (σ̂ (n) 1 ) 2 ≥ c2 1 n n∑ i=1 min 1≤k≤K ((â (n) k ) ′Xi − b̂(n)k )2. (5.9) We now show that for n0 ≥ Kd+ 1, s(n0) = E{ inf{(a1,...,aK ,b1,...,bK):a′kak=1} n0∑ i=1 min 1≤k≤K (a′kXi − bk)2} > 0. (5.10) In fact,∫ inf {(a1,...,aK ,b1,...,bK):a′kak=1} n0∑ i=1 min 1≤k≤K (a′kxi − bk)2dPn0(x) = ∫ inf {(a1,...,aK ,b1,...,bK):a′kak=1,|bk|≤r(x1,...,xn0 )} n0∑ i=1 min 1≤k≤K (a′kxi − bk)2dPn0(x) = ∫ min {(a1,...,aK ,b1,...,bK):a′kak=1,|bk|≤r(x1,...,xn0 )} n0∑ i=1 min 1≤k≤K (a′kxi − bk)2dPn0(x) ≡ ∫ s0(x1, . . . ,xn0)dP n0(x) > 0, since s0(X1, . . . ,Xn0) > 0 almost surely from Corollary5.1 and the argument in Theorem 5.2. By the strong law of large numbers, we have 1 m m∑ j=1 { inf {(a1,...,aK ,b1,...,bK):a′kak=1} n0∑ i=1 min 1≤k≤K (a′kX(j−1)n0+i − bk)2 } → s(n0), with probability 1. Let s1 = c 2s(n0, P )/n0. Then by equation (5.9), lim inf n→∞ (σ̂ (n) 1 ) 2 ≥c2 lim inf m→∞ { 1 mn0 inf {(a1,...,aK ,b1,...,bK):a′kak=1} mn0∑ i=1 min 1≤k≤K (a′kXi − bk)2} ≥ c 2 n0 lim m→∞ 1 m m∑ j=1 { inf {(a1,...,aK ,b1,...,bK):a′kak=1} n0∑ i=1 min 1≤k≤K (a′kX(j−1)n0+i − bk)2} =s1, 91 Chapter 5. Consistency and asymptotic normality with probability 1. The same s1 serves as a lower bound of lim infn(σ̂ (n) k ) 2 for k = 2, . . . ,K as well. Noticing that s1 does not depend on a set of ob- servations x1, . . . ,xn, we have proved Equation (5.7). Now we prove Equation (5.8). Let Θsc = {θ : θ ∈ Θc, σ2k > s for some k}. Define q(x, s) = sup θ∈Θsc f(x,θ). Then q(x, s) ≤ 1√ 2pisc . By Lemma 5.2, E| log f(X,θ0)| <∞. There exists a positive number s2, such that E(log(q(X, s2))) < E(log f(X,θ0). By the strong law of large numbers, Prob{ lim n→∞ n∑ i=1 [log q(Xi, s2)− log f(Xi,θ0)] = −∞} = 1. This implies that Prob{ lim n→∞ supθ∈Θs2c f(X1,θ) . . . f(Xn,θ) f(X1,θ0) . . . f(Xn,θ0) = 0} = 1. Equation 5.8 follows immediately, since θ̂ (n) always satisfies f(X1, θ̂ (n) ) . . . f(Xn, θ̂ (n) ) f(X1,θ0) . . . f(Xn,θ0) ≥ 1. Now we prove the main consistency result. Theorem 5.4 Let X1, . . . ,Xn be a sample from an absolutely continuous probability measure P with finite second moments. Let θ̂ (n) be a global max- imizer of l(θ |X1, . . . ,Xn) over Θc. Then θ̂(n) → θq0 almost surely in the topological space Θqc. Proof From Lemma 5.5, we need only to consider the subspace of Θc, Θsc = {θ ∈ Θc : s1 ≤ σ2k ≤ s2, for all k = 1, . . . ,K}, 92 Chapter 5. Consistency and asymptotic normality where s1, s2 are positive numbers determined in Lemma 5.5. Since Θsc is not compact, Theorem 5.3 cannot be used directly. We shall use the compactification device in Hathaway (1985) and also in Kiefer and Wolfowitz (1956). In the space Θsc, define the metric δ(θ,θ′) = ∑ i | arctanθi − arctanθ′i|, where | · | is the Euclidean distance and θi and θ′i are components of θ and θ′ respectively. Let Θ̄sc be the set of Θsc along with all its limit points. Then Θ̄sc = { θ : 0 ≤ pk ≤ 1, ∑K k=1 pk = 1,mini,j σi/σj ≥ c > 0, a′kak = 1,−∞ ≤ bk ≤ ∞, s1 ≤ σk ≤ s2, k = 1, . . . ,K. } is compact. Since f(x, ·) is continuous on Θsc, it can be extended to Θ̄sc as f(x,θ) = ∑ k pkI(∞ < bk <∞)fk(x,θ). We have shown g(θ) = E(log f(x,θ)) is continuous on Θsc. It is continuous on Θ̄sc as well. To see this, let θ (n) be a sequence tending to θ. If all bk = ±∞, or all pk = 0 whenever bk 6= ±∞, then f(x,θ) = 0 and g(θ) = −∞. In this case, g(θ(n)) = ∫ log f(x,θ(n))dP (x) ≤ ∫ max {k:pk>0} log fk(x,θ (n))dP (x) → ∫ − min {k:pk>0} [ 1 2 log(2piσ2k) + (a′kx− bk)2 2σ2k ] dP (x) = −∞. If there is some k such that pk 6= 0 and bk 6= ±∞. Then log f(x,θ(n)) ≤ logmax fk(x,θ(n)) ≤ −1 2 log(2pis1), and log f(x,θ(n)) ≥ log min {k:pk>0,bk 6=±∞} p (n) k fk(x,θ (n)) ≥ log min {k:pk>0,bk 6=±∞} pkfk(x,θ)− ², 93 Chapter 5. Consistency and asymptotic normality for some ² > 0 when n is sufficiently large. By Lemma 5.2, the latter term is integrable. By Lebesgue’s dominated convergence theorem, g is continuous at θ. Lemmas 5.3 and 5.4 are easily seen to hold and hence we can repeat literally the proof of Theorem 5.3. This completes the proof. 5.4 Asymptotical normality The global maximizer θ̂ (n) converges in the above quotient topological space. There is hence a subsequence, still denoted by θ̂ (n) , which converges to a point θ0 in C0 in the original space. If θ0 is an interior point of Θc, then we can expand l′(θ̂ (n) ) about θ0, l′(θ̂ (n) ) =l′(θ0) + l′′(θ0)(θ̂ (n) − θ0) + 1 2 [(θ̂ (n) − θ0)T l′′′1 (θ∗)(θ̂ (n) − θ0)], . . . , (θ̂(n) − θ0)T l′′′s (θ∗)(θ̂ (n) − θ0)]T , (5.11) where s is the dimension of θ and θ∗ is a point on the line segment connecting θ̂ (n) and θ0; l ′ is an s-dimensional vector of first derivatives of l, l′′ and l′′′j are s× s matrices of second and third derivatives of l. For ease of notation, we do not differentiate which components of p and ak are free parameters and which are not; it is assumed that the constraints ∑ pk = 1 and a ′ kak = 1 are taken care in the calculation of these derivatives. The left-hand side of (5.11) is 0, because θ̂ (n) satisfies the first order conditions. Let v0(θ0) = E[((log f(X,θ0)) ′]. Then v0(θ0) = g ′(θ0) = 0 by Lebesgue’s dominated convergence theorem. Let v1(θ0) = E[((log f(X,θ0)) ′)2]. It is straightforward to verify that all the entries of v1(θ0) are finite if the underlying distribution P has finite fourth moments. Then by the central limit theorem, the first derivative l′(θ0)/ √ n is asymptotical normal, 1√ n l′(θ0) L→ N(0, v1(θ0)). 94 Chapter 5. Consistency and asymptotic normality Let v2(θ0) = E[(log f(X,θ0)) ′′], and v3(θ0) = E[(log f(X,θ0)) ′′′]. Again, it is straightforward but tedious to verify that all the entries of v2(θ0) are finite if P has finite fourth moments and that all the entries of v3(θ0) are finite if P has finite sixth moments. By the strong law of large numbers, the second derivative l′′(θ0)/n tends to a constant matrix, 1 n l′′(θ0)→ v2(θ0), and the third derivative l′′′ is bounded entry-wise. So we have the following Theorem 5.5 Let X1, . . . ,Xn be a sample from an absolutely continuous probability measure P with finite sixth moments. Let θ̂ (n) be a subsequence of global maximizer of l(θ|X1, . . . ,Xn) over Θc, which tends to an interior point θ0. Then √ n(θ̂ (n) − θ0) L→ N(0, v(θ0)), where v(θ0) = [v2(θ0)] +v1(θ0)[v2(θ0)] + and A+ is the Moore-Penrose in- verse of matrix A. In equation (5.3), ŵik is a function of xi and θ̂. Denote ŵik = hk(xi, θ̂), k = 1, . . . ,K. By the Delta method, we have the following Corollary 5.2 Let X1, . . . ,Xn be a sample from an absolutely continuous probability measure P with finite sixth moments. Let θ̂ (n) be a subsequence of global maximizer of L(θ|X1, . . . ,Xn) over Θc, which tends to an interior point θ0. Let x be a data point. Then √ n(hk(x, θ̂ (n) )− hk(x,θ0)) L→ N(0, [h(0)k (x,θ0)]′v(θ0)[h(0)k (x,θ0)]), for k = 1, . . . ,K, where h (0) k (x,θ0) = ∂hk(x,θ) ∂θ |θ=θ0. Using this corollary, we can build approximate confidence intervals for wik by replacing θ0 with θ̂ and hence evaluate the clustering of a data set. 95 Bibliography Eguchi, S. and J. Copas (2006). Interpreting Kullback–Leibler divergence with the Neyman–Pearson lemma. Journal of Multivariate Analysis 97 (9), 2034–2040. Garćıa-Escudero, L., A. Gordaliza, R. San Mart́ın, S. van Aelst, and R. Za- mar (2007). Robust linear clustering. Preprint. Hathaway, R. (1983). Constrained maximum-likelihood estimation for a mixture of m univariate normal distributions. Ph. D. thesis, Rice Univer- sity. Hathaway, R. (1985). A Constrained Formulation of Maximum-Likelihood Estimation for Normal Mixture Distributions. The Annals of Statis- tics 13 (2), 795–800. Kullback, S. and R. Leibler (1951). On Information and Sufficiency. The Annals of Mathematical Statistics 22 (1), 79–86. Redner, R. (1981). Note on the Consistency of the Maximum Likelihood Estimate for Nonidentifiable Distributions. The Annals of Statistics 9 (1), 225–228. Wald, A. (1949). Note on the Consistency of the Maximum Likelihood Estimate. The Annals of Mathematical Statistics 20 (4), 595–601. Yan, G., W. Welch, and R. Zamar (2008). A likelihood approach to linear clustering. Preprint. 96 Chapter 6 Future Work In this thesis work, we have developed a SNP genotype calling algorithm based on the linear grouping algorithm (Van Aelst et al., 2006), proposed a flexible model-based approach to linear clustering and introduced a Bayesian approach to linear clustering with specific relevance to the SNP genotyping problem. In this chapter, we briefly describe a few possible directions for future research. 6.1 Robustness consideration Robustness to outliers is desirable in linear clustering, as the assumption of normal deviations around hyperplanes is sensitive to large deviations in the orthogonal direction. In addition to the inclusion of a uniform background cluster (Banfield and Raftery, 1993), one option would be to use a heavier tailed distribution, for example, Student’s t distribution with small degrees of freedom or with degrees of freedom depending on the data. In the par- tial likelihood approach, this would adapt Peel and McLachlan (2000)’s EM algorithm for t mixture models from the elliptical context to the linear clus- tering setting. The adaptation is straightforward but computationally more expensive. Further ideas include estimating the component covariance ma- trices in the M-step in a robust way, for example, trimming off some points as done by Garćıa-Escudero et al. (2007). In the Bayesian approach, we have already used a Student’s t distribution with small degrees of freedom. 6.2 Asymptotics For the partial likelihood approach, we produced some asymptotic results in the case of normal orthogonal deviations. For more general cases, such as Student’s t distribution, these results have not been established. We shall study the asymptotic evaluation in more general cases. 97 Chapter 6. Future Work 6.3 Model Extension With a′x = b, we are specifying a hyperplane in d − 1 dimensions. With little effort, this could be generalized to a mixture of partial likelihoods, each of which specifies a hyperplane of dimension q < d, l(κ,p|x1:n) = n∏ i=1 K∑ k=1 pkN(A ′ kxi − bk;0,Σk), (6.1) where A is of dimension d × (d − q), b is a vector of dimension d − q, and Σk is a (d − q) × (d − q) covariance matrix for the deviation from the hyperplane. In the extreme case of a 0-dimension hyperplane, which is a point, we have the usual mixture of multivariate normal distributions. A mixture of components with various dimensions could be considered. 6.4 Variable/model selection For a large, high dimensional dataset, efficient computation is essential in order to uncover linear patterns. In addition to develop more efficient algo- rithms, one idea is to use a subset of the data and/or a subset of variables. To this end, we are interested in methods to screen variables as well as to determine the number of linear structures. 6.5 Bayesian computation In our examples in the Bayesian approach, label-switching is prevented either by a Gibbs sampler applied to a posterior distribution with isolated modes or by informative priors. In more general situations, we may need the ideas of tempering MCMC or Sequential Monte Carlo to explore the whole support of the posterior distribution and deal with the label-switching problem. In addition, the number of linear clusters are assumed known in our approach. In the situation of unknown number of clusters, our first thought is to investigate the feasibility of the Reversible Jump MCMC of Richardson and Green (1997). This may imply heavy computational burden. A related problem is again the scalability of the Bayesian approach to large datasets and high dimensions. We leave these problems for further research. 98 Bibliography Banfield, J. and A. Raftery (1993). Model-Based Gaussian and Non- Gaussian Clustering. Biometrics 49 (3), 803–821. Garćıa-Escudero, L., A. Gordaliza, R. San Mart́ın, S. van Aelst, and R. Za- mar (2007). Robust linear clustering. Preprint. Peel, D. and G. McLachlan (2000). Robust mixture modelling using the t distribution. Statistics and Computing 10 (4), 339–348. Richardson, S. and P. Green (1997). On Bayesian Analysis of Mixtures with an Unknown Number of Components. Journal of the Royal Statistical Society. Series B (Methodological) 59 (4), 731–792. Van Aelst, S., X. Wang, R. Zamar, and R. Zhu (2006). Linear grouping using orthogonal regression. Computational Statistics and Data Analy- sis 50 (5), 1287–1312. 99 Appendix A Glossary of some genetic terms (Extracted from the website http://www.genome.gov/glossary.cfm.) adenine (A) One of the four bases in DNA that make up the letters ATGC, adenine is the ”A”. The others are guanine, cytosine, and thymine. Adenine always pairs with thymine. allele One of the variant forms of a gene at a particular locus, or location, on a chromosome. Different alleles produce variation in inherited characteristics such as hair color or blood type. In an individual, one form of the allele (the dominant one) may be expressed more than another form (the recessive one). base pair Two bases which form a ”rung of the DNA ladder.” A DNA nucleotide is made of a molecule of sugar, a molecule of phosphoric acid, and a molecule called a base. The bases are the ”letters” that spell out the genetic code. In DNA, the code letters are A, T, G, and C, which stand for the chemicals adenine, thymine, guanine, and cytosine, respectively. In base pairing, adenine always pairs with thymine, and guanine always pairs with cytosine. chromosome One of the thread-like ”packages” of genes and other DNA in the nucleus of a cell. Different kinds of organisms have different numbers of chromosome. Humans have 23 pairs of chromosomes, 46 in all: 44 autosomes and two sex chromosomes. Each parent contributes one chromosome to each pair, so children get half of their chromosomes from their mothers and half from their fathers. 100 Appendix A. Glossary of some genetic terms cytosine One of the four bases in DNA that make up the letters ATGC, cytosine is the ”C”. The others are adenine, guanine, and thymine. Cytosine always pairs with guanine. deoxyribonucleic acid (DNA) The chemical inside the nucleus of a cell that carries the genetic instructions for making living organisms. diploid The number of chromosomes in most cells except the gametes. In humans, the diploid number is 46. dominant A gene that almost always results in a specific physical charac- teristic, for example, a disease, even though the patient’s genome possesses only one copy. With a dominant gene, the chance of passing on the gene (and therefore the disease) to children is 50-50 in each pregnancy. gene The functional and physical unit of heredity passed from parent to offspring. Genes are pieces of DNA, and most genes contain the information for making a specific protein. gene amplification An increase in the number of copies of any particular piece of DNA. A tumor cell amplifies, or copies, DNA segments naturally as a result of cell signals and sometimes environmental events. gene expression The process by which proteins are made from the instruc- tions encoded in DNA. genetic code (ATCG) The instructions in a gene that tell the cell how to make a specific protein. A, T, G, and C are the ”letters” of the DNA code; they stand for the chemicals adenine, thymine, guanine, and cytosine, respectively, that make up the nucleotide bases of DNA. Each gene’s code combines the four chemicals in various ways to spell out 3-letter ”words” that specify which amino acid is needed at every step in making a protein. genetic marker A segment of DNA with an identifiable physical location on a chromosome and whose inheritance can be followed. A marker can be a gene, or it can be some section of DNA with no known function. Because DNA segments that lie near each other on a chromosome tend to be inherited together, markers are often used as indirect ways of tracking the inheritance 101 Appendix A. Glossary of some genetic terms pattern of a gene that has not yet been identified, but whose approximate location is known. genome All the DNA contained in an organism or a cell, which includes both the chromosomes within the nucleus and the DNA in mitochondria. genotype The genetic identity of an individual that does not show as out- ward characteristics. guanine One of the four bases in DNA that make up the letters ATGC, guanine is the ”G”. The others are adenine, cytosine, and thymine. Guanine always pairs with cytosine. haploid The number of chromosomes in a sperm or egg cell, half the diploid number. Haplotype It refers to a set of SNPs found to be statistically associated on a single chromatid. With this knowledge, the identification of a few alleles of a haplotype block unambiguously identifies all other polymorphic sites in this region. Such information is most valuable to investigate the genet- ics behind common diseases and is collected by the International HapMap Project (http://en.wikipedia.org/wiki/Haplotype). Hardy-Weinberg equilibrium (HWE) It states that, under certain con- ditions, after on generation of random mating, the genotype frequencies at a single gene locus will become fixed at a particular equilibrium value. It also specifies that those equilibrium frequencies can be represented as a sim- ple function of the allele frequencies at that locus.(http://en.wikipedia.org /wiki /Hardy-Weinberg equilibrium). heterozygous Possessing two different forms of a particular gene, one in- herited from each parent. homozygous Possessing two identical forms of a particular gene, one in- herited from each parent. linkage The association of genes and/or markers that lie near each other on a chromosome. Linked genes and markers tend to be inherited together. 102 Appendix A. Glossary of some genetic terms linkage disequilibrium (LD) The non-random association of alleles at two or more loci on a chromosome. It describes a situation in which some combinations of alleles or genetic markers occur more or less frequently in a population than would be expected from a random formation of haplotypes from alleles based on their frequencies. In population genetics, linkage dise- quilibrium is said to characterize the haplotype distribution at two or more loci (http://en.wikipedia.org/wiki/Linkage disequilibrium). locus The place on a chromosome where a specific gene is located, a kind of address for the gene. The plural is ”loci,” not ”locuses.” non-coding DNA The strand of DNA that does not carry the information necessary to make a protein. The non-coding strand is the mirror image of the coding strand and is also known as the antisense strand. nucleotide One of the structural components, or building blocks, of DNA and RNA. A nucleotide consists of a base (one of four chemicals: adenine, thymine, guanine, and cytosine) plus a molecule of sugar and one of phos- phoric acid. phenotype The observable traits or characteristics of an organism, for ex- ample hair color, weight, or the presence or absence of a disease. Phenotypic traits are not necessarily genetic. polymerase chain reaction (PCR) A fast, inexpensive technique for making an unlimited number of copies of any piece of DNA. Sometimes called ”molecular photocopying,” PCR has had an immense impact on biology and medicine, especially genetic research. polymorphism A common variation in the sequence of DNA among indi- viduals. primer A short oligonucleotide sequence used in a polymerase chain reac- tion. probe A piece of labeled DNA or RNA or an antibody used to detect the function of a gene. 103 Appendix A. Glossary of some genetic terms single nucleotide polymorphism Common, but minute, variations that occur in human DNA at a frequency of one every 1,000 bases. These vari- ations can be used to track inheritance in families. SNP is pronounced ”snip”. thymine One of the four bases in DNA that make up the letters ATGC, thymine is the ”T”. The others are adenine, guanine, and cytosine. Thymine always pairs with adenine. wild-type allele The allele designated as the standard (“normal”) for a strain of organism. 104"@en ; edm:hasType "Thesis/Dissertation"@en ; vivo:dateIssued "2008-11"@en ; edm:isShownAt "10.14288/1.0066454"@en ; dcterms:language "eng"@en ; ns0:degreeDiscipline "Statistics"@en ; edm:provider "Vancouver : University of British Columbia Library"@en ; dcterms:publisher "University of British Columbia"@en ; dcterms:rights "Attribution-NonCommercial-NoDerivatives 4.0 International"@en ; ns0:rightsURI "http://creativecommons.org/licenses/by-nc-nd/4.0/"@en ; ns0:scholarLevel "Graduate"@en ; dcterms:title "Linear clustering with application to single nucleotide polymorphism genotyping"@en ; dcterms:type "Text"@en ; ns0:identifierURI "http://hdl.handle.net/2429/958"@en .