Linear Clustering with Application toSingle Nucleotide PolymorphismGenotypingbyGuohua YanB.Sc., Liaocheng University, 1992M.Sc., Beijing Normal University, 1995M.Sc., The University of Windsor, 2003A THESIS SUBMITTED IN PARTIAL FULFILMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinThe Faculty of Graduate Studies(Statistics)The University of British Columbia(Vancouver)June, 2008c Guohua Yan 2008AbstractSingle nucleotide polymorphisms (SNPs) have been increasingly popular fora wide range of genetic studies. A high-throughput genotyping technolo-gies usually involves a statistical genotype calling algorithm. Most callingalgorithms in the literature, using methods such as k-means and mixture-models, rely on elliptical structures of the genotyping data; they may failwhen the minor allele homozygous cluster is small or absent, or when thedata have extreme tails or linear patterns.We propose an automatic genotype calling algorithm by further devel-oping a linear grouping algorithm (Van Aelst et al., 2006). The proposedalgorithm clusters unnormalized data points around lines as against aroundcentroids. In addition, we associate a quality value, silhouette width, witheach DNA sample and a whole plate as well. This algorithm shows promisefor genotyping data generated from TaqMan technology (Applied Biosys-tems). A key feature of the proposed algorithm is that it applies to un-normalized fluorescent signals when the TaqMan SNP assay is used. Thealgorithm could also be potentially adapted to other fluorescence-based SNPgenotyping technologies such as Invader Assay.Motivated by the SNP genotyping problem, we propose a partial likeli-hood approach to linear clustering which explores potential linear clustersin a data set. Instead of fully modelling the data, we assume only the signedorthogonal distance from each data point to a hyperplane is normally dis-tributed. Its relationships with several existing clustering methods are dis-cussed. Some existing methods to determine the number of components in adata set are adapted to this linear clustering setting. Several simulated andreal data sets are analyzed for comparison and illustration purpose. We alsoinvestigate some asymptotic properties of the partial likelihood approach.A Bayesian version of this methodology is helpful if some clusters aresparse but there is strong prior information about their approximate loca-tions or properties. We propose a Bayesian hierarchical approach which isparticularly appropriate for identifying sparse linear clusters. We show thatthe sparse cluster in SNP genotyping datasets can be successfully identifiedafter a careful specification of the prior distributions.iiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiStatement of Co-Authorship . . . . . . . . . . . . . . . . . . . . . xii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 SNP genotyping . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.1 SNPs and their applications . . . . . . . . . . . . . . 11.1.2 TaqMan SNP genotyping technology . . . . . . . . . 21.2 Review of several clustering algorithms . . . . . . . . . . . . 31.2.1 MCLUST: normal mixture model-based clustering . . 31.2.2 MIXREG: mixture of linear regressions . . . . . . . . 51.2.3 LGA: linear grouping algorithm . . . . . . . . . . . . 51.3 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . 6Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Automatic SNP genotype calling . . . . . . . . . . . . . . . . 102.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.1.1 TaqMan SNP genotyping . . . . . . . . . . . . . . . . 102.1.2 Review of some genotype calling algorithms . . . . . 122.1.3 ROX-normalized versus unnormalized data . . . . . . 132.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17iiiTable of Contents2.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4.1 Data preprocessing . . . . . . . . . . . . . . . . . . . 182.4.2 Fitting lines using LGA . . . . . . . . . . . . . . . . . 182.4.3 Genotype assignment . . . . . . . . . . . . . . . . . . 192.4.4 Quality assessment of DNA samples and plates . . . . 202.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 A partial likelihood approach to linear clustering . . . . . 243.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2 Partial likelihood-based objective function for linear cluster-ing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2.1 Partial likelihood for orthogonal regression . . . . . . 253.2.2 Partial likelihood for linear clustering . . . . . . . . . 263.3 The EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . 283.4 Asymptotic properties . . . . . . . . . . . . . . . . . . . . . . 303.5 Relationships with other clustering methods . . . . . . . . . 323.5.1 With LGA . . . . . . . . . . . . . . . . . . . . . . . . 323.5.2 With normal mixture models . . . . . . . . . . . . . . 333.5.3 With mixture of ordinary regressions . . . . . . . . . 333.6 Choosing the number of linear clusters . . . . . . . . . . . . 333.6.1 Bootstrapping the partial likelihood ratio . . . . . . . 333.6.2 Information criteria . . . . . . . . . . . . . . . . . . . 343.7 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.7.1 Simulated data I . . . . . . . . . . . . . . . . . . . . . 353.7.2 Simulated data II . . . . . . . . . . . . . . . . . . . . 373.7.3 Australia rock crab data . . . . . . . . . . . . . . . . 373.7.4 Taqman single nucleotide polymorphism genotypingdata . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474 Bayesian linear clustering . . . . . . . . . . . . . . . . . . . . . 514.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.2 Model specification . . . . . . . . . . . . . . . . . . . . . . . 554.2.1 Model for one cluster: orthogonal regression . . . . . 554.2.2 Model for linear clustering . . . . . . . . . . . . . . . 554.3 Sampling algorithms . . . . . . . . . . . . . . . . . . . . . . . 57ivTable of Contents4.3.1 Gibbs sampling . . . . . . . . . . . . . . . . . . . . . 574.3.2 Metropolis-Hastings sampling . . . . . . . . . . . . . 604.4 Australia rock crab data . . . . . . . . . . . . . . . . . . . . 624.5 Taqman single nucleotide polymorphism genotyping data . . 644.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785 Consistency and asymptotic normality . . . . . . . . . . . . 805.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.2 Population version of the objective function . . . . . . . . . . 825.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . 855.4 Asymptotical normality . . . . . . . . . . . . . . . . . . . . . 94Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 966 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 976.1 Robustness consideration . . . . . . . . . . . . . . . . . . . . 976.2 Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . 976.3 Model Extension . . . . . . . . . . . . . . . . . . . . . . . . . 986.4 Variable/model selection . . . . . . . . . . . . . . . . . . . . 986.5 Bayesian computation . . . . . . . . . . . . . . . . . . . . . . 98Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99AppendixA Glossary of some genetic terms . . . . . . . . . . . . . . . . . 100vList of Tables3.1 Criteria for choosing the numbers of clusters, K, when MLCis applied to simulated data I. ?Boot p-value? is the p-valueusing the bootstrapping partial likelihood ratio test for H0 :K = K0 vs H1 : K = K0 +1 where 99 bootstrapping samplesare used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2 Misclassification matrices of simulated data II from MLC,MCLUST, LGA and MIXREG. . . . . . . . . . . . . . . . . . 383.3 Criteria for choosing the number of clusters, K, when MLCis applied to the blue crab data. ?Boot p-value? is the p-value using the bootstrapping partial likelihood ratio test forH0 : K = K0 vs H1 : K = K0 + 1 where 99 bootstrappingsamples are used. . . . . . . . . . . . . . . . . . . . . . . . . . 393.4 Misclassification matrices of the blue crab data (using logscales) from MLC, MCLUST, LGA and MIXREG. . . . . . . 403.5 Largest membership probabilities of the 7 points labelled inFigure 3.5 by MCLUST and MLC. . . . . . . . . . . . . . . . 43viList of Figures2.1 Scatterplots of ROX-normalized data for (a): a good plate;(b) & (c): messy plates; (d): a plate with only a few points inthe YY cluster; (e): a plate with no points in the YY cluster;(f): a plate with only one cluster. . . . . . . . . . . . . . . . . 112.2 Scatterplots of unnormalized and ROX-normalized data for aplate. Both points and letters represent samples. . . . . . . . 142.3 Probability of calling a genotype versus ROX . . . . . . . . . 152.4 Silhouette width versus ROX value. Left panel: k-means re-sults using ROX-normalized data; right panel: LGA resultusing unnormalized data. . . . . . . . . . . . . . . . . . . . . 162.5 Control points in scatterplots of ROX-normalized and unnor-malized data for a plate. Letter ?P? represents a positivecontrol point and ?N? a negative control point. . . . . . . . 162.6 Left panel: the calling result of SDS software displayed inthe scatterplot of unnormalized data. Right panel: the scor-ing results of LGA using unnormalized data. The symbol openbulletrepresents noncalls. . . . . . . . . . . . . . . . . . . . . . . . 173.1 Comparison of clustering results of simulated data I. Upper-left panel displays the true classification; upper-right, lower-left and lower-right are that of MLC, LGA and MCLUST. . . 363.2 Pairwise scatterplots of simulated data II. . . . . . . . . . . . 383.3 Pairwise scatterplots of the blue crab data. . . . . . . . . . . 403.4 Clustering results of the blue crab data (using log scales) fromMLC, MCLUST, LGA and MIXREG. CL is used as the re-sponse in MIXREG. . . . . . . . . . . . . . . . . . . . . . . . 413.5 Clustering results from four methods applied to a Taqmandata set plate. openbullet, + and triangle denote the labels assigned for thethree linear clusters (genotypes) by each method, ? denotespoints assigned as negative controls and diamondmath denotes points as-signed to the background cluster. . . . . . . . . . . . . . . . . 44viiList of Figures4.1 Scatterplot of one plate in which the variant allele homozy-gous cluster has only five points (upper-left). . . . . . . . . . 534.2 Clustering results of several clustering algorithms. For k-means and MCLUST, the number of clusters are set to 4; forthe remaining algorithms, the number of lines are set to 3. . . 544.3 Pairwise scatter plots of the blue crab data. . . . . . . . . . . 634.4 Evolution of sampled values for theta for 11,000 iterations usingAlgorithm 1 for blue crab data. . . . . . . . . . . . . . . . . . 654.5 Evolution of log unnormalized posterior density of sampledvalues for (theta, z1, ..., zn) for 11,000 iterations using Algorithm1 for blue crab data. . . . . . . . . . . . . . . . . . . . . . . . 664.6 Autocorrelations of sampled values for theta for 11,000 iterationsusing Algorithm 1 for blue crab data. . . . . . . . . . . . . . . 674.7 Density curves of sampled values for theta for 10,000 iterationsusing Algorithm 1, after a burn-in of 1000 iterations for bluecrab data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.8 Unnormalized marginal posterior density values of theta alonga line segment (1 - t)?1 + ?2 connecting the two estimatedmodes ?1 and ?2. One mode is estimated by maximum aposteriori estimation from 10,000 iterations using Algorithm1, after a burn-in of 1000 iterations and the other is obtainedby permutation. The blue crab data is used. . . . . . . . . . . 694.9 Estimated posterior probability of being classified into Class1 of the 100 crabs from 10,000 iterations using Algorithm1, after a burn-in of 1000 iterations. The solid vertical lineseparates these crabs by their true classes and the horizontaldashed line corresponds to a probability of 0.5. . . . . . . . . 694.10 Evolution of log unnormalized posterior density of sampledvalues for eta for 11,000 iterations using Algorithm 2 for theSNP genotyping data. . . . . . . . . . . . . . . . . . . . . . . 724.11 Evolution of sampled values for theta for 11,000 iterations usingAlgorithm 2 for the SNP genotyping data. The first threepanels are for slopes. . . . . . . . . . . . . . . . . . . . . . . . 734.12 Autocorrelations of sampled values for theta for 11,000 iterationsusing Algorithm 2 for the SNP genotyping data. . . . . . . . 744.13 Density curves of sampled values for theta for 11,000 iterationsusing Algorithm 2 for the SNP genotyping data. . . . . . . . 754.14 Clustering results of the SNP genotyping data in which pointsare classified into the cluster with the largest posterior mem-bership probability. . . . . . . . . . . . . . . . . . . . . . . . . 76viiiList of Figures4.15 Clustering results of the Bayesian approach for a plate with-out a sparse cluster. . . . . . . . . . . . . . . . . . . . . . . . 77ixAcknowledgementsForemost I would like to thank my supervisors Dr. William J. Welch and Dr.Ruben H. Zamar, for their excellent guidance, continuous encouragement,great patience and generous support. I am indebted to them for many thingsI have learned which lay the foundation of my future career.I am very grateful to other members of my supervisory committee, Dr.Paul Gustafson and Dr. Arnaud Doucet for their valuable comments andsuggestions.I am also very grateful to Dr. Harry Joe and Dr. Lang Wu for theirgreat help and valuable advice both in statistics and in my job applicationprocess. Also I would like to thank them for teaching me playing table tennisalthough they must have been frustrated that I am a hopeless learner.Many thanks to Dr. Mat? s Salibi?n-Barrera, Dr. Raphael Gottardo, Dr.Nancy Heckman and Dr. Jiahua Chen for their help and valuable sugges-tions. I also thank Dr. Jason Loeppky for his help on my proposal.Special thanks to Ms. Loubna Akhabir and Ms. Treena McDonald forproviding me the data and explaining to me the genomic background.I would also like to thank Ms. Elaine Salameh, Ms. Christine Graham,Ms. Rhoda Morgan, Ms. Peggy Ng and Ms. Viena Tran for their help withadministrative matters.Furthermore, I thank my fellow students for their help throughout myprogram. I choose not to list their names here but their help and friendshipwill be remembered forever.Finally, I would like to acknowledge the support of the University ofBritish Columbia through a University Graduate Fellowship.xTo my parents, my wife and my daughterxiStatement of Co-AuthorshipThis thesis is completed under the supervision of Dr. William J. Welch andDr. Ruben H. Zamar.Chapter 2 is co-authored with Dr. William J. Welch, Dr. Ruben H.Zamar, Ms. Loubna Akhabir and Ms. Treena McDonald. I conducted thedata analysis and prepared a draft of the manuscript.Chapter 3 is co-authored with Dr. William J. Welch and Dr. Ruben H.Zamar. My main contribution is the derivation of the partial likelihood. Iconducted the data analysis and prepared a draft of the manuscript.Chapter 4 is co-authored with Dr. William J. Welch and Dr. Ruben H.Zamar. My main contributions are the formulation of the model and thespecification of the prior distribution. I conducted the data analysis andprepared the manuscript.Chapter 5 is co-authored with Dr. William J. Welch and Dr. Ruben H.Zamar. I derived the asymptotic properties and prepared the manuscript.xiiChapter 1IntroductionThis thesis work investigates methods to detect linearly shaped clusters ina dataset. It is motivated by a clustering problem in SNP (single nucleotidepolymorphism) genotyping and much effort of this thesis has been devotedto automatic genotype calling algorithms in TaqMan SNP genotyping tech-nology. In this chapter, we introduce the SNP genotyping problem, reviewseveral clustering algorithms used in the rest of the thesis and give an outlineof the thesis.1.1 SNP genotyping1.1.1 SNPs and their applicationsA single nucleotide polymorphism (SNP, pronounced as ?snip?) is a single-base variation in a genome. The genetic code is specified by the four nu-cleotide ?letters?: A (adenine), C (cytosine), T (thymine) and G (guanine).There are two complimentary DNA strands. It is sufficient to consider onlyone. SNP variation occurs when a single nucleotide, such as an A, is re-placed by one of the other three letters C, G or T at a particular base of thetarget strand. An example of a SNP is the alteration of the DNA segmentAAGGTTA to ATGGTTA, where the second A in the first snippet is re-placed with a T (www.ncbi.nlm.nih.gov). A SNP variation usually involvesonly two nucleotides; the two possible nucleotides are called two SNP alleles.Throughout we shall use ?X? to generically represent the common wild-typeallele and ?Y? the variant allele. The DNA sequences of any two individualsare mostly identical and SNPs are found about every 250 - 350 base pairsin the human genome (Niu et al., 2002).Approximately 3 to 5 percent of a person?s DNA sequence codes forthe production of proteins, most SNPs are found outside of these ?codingsequences?. SNPs found within a coding sequence, cSNPs, are of particularinterest to researchers because they are more likely to alter the biologicalfunction of a protein (www.ncbi.nlm.nih.gov). Moreover, the abundance ofSNPs makes them useful markers for genetic association studies that work1Chapter 1. Introductionto localize the genes involved in complex diseases or adverse drug reactions.The popularity of SNPs is also due to their usual biallelic property whichmakes them amenable to automated genotyping. (Ranade et al., 2001).1.1.2 TaqMan SNP genotyping technologyDetermination of the alleles at a single nucleotide polymorphism site is calledgenotyping. The nature of large-scale association studies requires rapid, re-liable and cost-effective SNP genotyping. Various SNP genotyping technolo-gies have been developed. The choice of a technology depends on whether afew SNPs are to be typed in many individuals or many different SNPs areto be examined in a few individuals (Callegaro et al., 2006).The TaqMan SNP Genotyping Assay (Applied Biosystems) is a widelyused fluorescence-based high-throughput genotyping technology suitable forthe former case. In this method, the region flanking the SNP site of interestis amplified in the presence of two probes each specific for one or the otherallele. Probes haves a fluor, called ?reporter? at one end but do not fluorescewhen free in solution because they have a ?quencher? at the other end thatabsorbs fluorescence from the reporter. During the amplification in whichmany copies of the same sequence are produced, the probe specifically base-paired with the target is unwound, its reporter liberated from the quencherand the fluorescence is increased. The presence of two probes, each labelledwith a different fluor, allows the detection of both alleles in a single tube(De La Vega et al., 2005; Ranade et al., 2001).In the TaqMan SNP genotyping technology, DNA samples of many in-dividuals are arranged in a 96- or 384-well plate and they are amplifiedsimultaneously. For each individual, two quantitative fluorescent signals aremeasured at the end of amplification for the two alternative SNP alleles,indicating their presence or absence. The pair of signals for an individualforms a point in a scatterplot. Ideally there are four distinct clusters in ascatterplot. The NTC (no template control) cluster lies in the lower-left cor-ner, close to the origin, containing negative control cases (a few cases that donot have DNA samples) and samples that fail to amplify. In the lower-right,upper-left, and upper-right corners are three clusters presumably containingsamples of wild-type homozygotes, variant homozygotes, and heterozygotes,respectively.Genotype calls for individual samples are made by a clustering algorithmin the propriety Sequence Detection Software (Applied Biosystems). How-ever, considerable manual intervention of an expert operator is required toassess the data quality, to set fluorescent signal thresholds and to decide the2Chapter 1. Introductiongenotypes, especially when the variant allele Y is rare.The accuracy of SNP genotype calls is critical to later studies. Eventhe slightest amount of genotyping errors can lead to serious consequenceson haplotype analysis, linkage analysis, genetics distance estimation andbackground linkage disequilibrium estimation (Kang et al., 2004). For abrief review of clustering algorithms in fluorescence-based SNP genotyping,see Chapter 2. One aim of this thesis to develop a reliable automated SNPgenotype calling algorithm in the TaqMan SNP genotyping technology thatinvolves minimal manual intervention and should also work well even thevariant allele homozygous cluster YY is sparse. We conclude this subsectionby noting that any genotyping calling algorithm proposed in the TaqManSNP genotyping technology should be applicable to other fluorescence-basedgenotyping method, such as Invader Assay (Mein et al., 2000).In next section, we review several competing clustering algorithms usedin the rest of the thesis.1.2 Review of several clustering algorithms1.2.1 MCLUST: normal mixture model-based clusteringFinite mixture model has long been proposed for clustering problems. Seefor example, Banfield and Raftery (1993), Fraley and Raftery (1998), Fraleyand Raftery (2002), MacLachlan and Peel (2000). In this approach, dataare assumed arising from a mixture of probability distributions; each com-ponent of the mixture is regarded as a cluster. This model-based methodhas some advantage over a heuristic approach. First of all, it is somehowdata dependent via a variety of parameter restrictions coupled with a modelselection mechanism, for example in the packages MCLUST (Fraley and AE,2006) and EMMIX (McLachlan et al., 1999). Second, some early proposedheuristic clustering criteria, including the most widely used kmeans method,were later formulated in model frameworks. A statistical model helps peopleunderstand for what data sets a particular clustering algorithm is likely towork well. Third, within a model framework, many statistical proceduresare readily applicable. For example, maximum likelihood is naturally usedas an optimizing criterion; the estimated probabilities of a data point con-forming to the components is an appealing measure of uncertainty of theresulting classification; a component in a mixture model, which has a clearmeaning in the assumed model, can be interpreted as a cluster; the deter-mination of clustering methods and the number of clusters is then a modelselection problem.3Chapter 1. IntroductionMCLUST is a contributed R package for normal mixture modeling andmodel-based clustering. It provides a number of functions for normal mix-ture model-bases clustering, classification likelihood-based hierarchical clus-tering, normal mixture model-based density estimation and discriminantanalysis etc. In this subsection, we review only the normal mixture model-based clustering which is relevant to the thesis.Given observations x1, ..., xn, MCLUST assumes a normal mixturemodelp(xi|theta) =Ksummationdisplayk=1pkphik(xi|?k,?k), i = 1,...,n,where K is the number of components, pk is the mixing proportion, pk > 0,summationtextKk=1 pk = 1, phi(?|?k,?k) is the multivariate normal density with meanvector ?k and covariance matrix ?k and theta is the collection of all parameters.Banfield and Raftery (1993) proposed a general framework for geomet-ric cross-cluster constraint by parameterizing covariance matrices througheigenvalue decomposition?k = lambdakDkAkDTk ,where Dk is the orthogonal matrix of eigenvectors, Ak is a diagonal matrixwhose elements are proportional to the eigenvalues, and lambdak is an associatedconstant of proportionality. The orientation of principal components of ?kis determined by Dk, while Ak determines the shape of the density contours;lambdak specifies the volume of the corresponding ellipsoid. Characteristics (ori-entation, shape and volume) of distributions are usually estimated from thedata and can be allowed to vary between clusters or constrained to be thesame for all clusters. This parameterization is very flexible; it includes butis not restricted to earlier proposals such as equal-volume spherical variance(?k = lambdaI) which has a close relationship with the k-means method, constantvariance (?k = ?) and the most general unconstrained variance.The EM algorithm is used for maximum likelihood estimation; details areomitted. All models corresponding to the above parameterization possibili-ties are usually tried. Selection of the number K of components/clusters aswell as models is through the Bayesian Information Criterion (BIC), whichrelates to the Bayes factor. The BIC has the formBIC = 2l -log(n)v,where l is the log-likelihood of the model and v is the number of independentparameters in the model.4Chapter 1. IntroductionFor noisy data, MCLUST adds a first order Poisson process to the normalmixture model,p(xi|theta) = p0V +Ksummationdisplayk=1pkphik(xi|?k,?k),where V is the hyper-volume of the data region, p0 > 0 and summationtextKk=0 pk = 1.1.2.2 MIXREG: mixture of linear regressionsMixtures of linear regressions may be regarded as a special case of mixturemodels. The response of interest is assumed univariate. Let (y1,x1), ...,(yn,xn) be the observations. The mixture of regression models isp(yi|theta,xi) =Ksummationdisplayk=1pkphi(yi|xTi betak,sigma2k), i = 1,...,n,where K is the number of components, pk is a mixing proportion, pk > 0,summationtextKk=1 pk = 1, betak is the vector of regression coefficients, phi(?|xTi betak,sigma2k) is theunivariate normal density with mean xTi betak and variance sigma2k. For ease ofnotation, we assume that an intercept is already included in betak if necessary.The EM algorithm can be used for maximum likelihood estimation.MIXREG is also a contributed R package (Turner, 2006); the computationaldetails are in Turner (2000).The package allows for the selection of equal variances (sigma2k = sigma2) orunequal variances. Model selection can be performed using BIC outside thepackage MIXREG.1.2.3 LGA: linear grouping algorithmVan Aelst et al. (2006) proposed a linear grouping algorithm to detect linearpatterns in a dataset. It combines ideas of k-means, orthogonal regressionand resampling and can uncover linear patterns when most traditional algo-rithms do not work well. LGA is the corresponding contributed R package(Harrington, 2007). Suppose that we are to uncover k linear groups in a dataset with n points in d dimensions. The linear grouping algorithm works asfollows:1. Initialization. Starting values are generated by randomly selecting kmutually exclusive subsets of d points (d-subsets). For each of thesed-subsets, a hyperplane through the d points is computed.5Chapter 1. Introduction2. Updating group assignment. Each point is assigned to its closest hy-perplane in the sense of orthogonal distance; each hyperplane is re-computed from the updated grouping using orthogonal regression.3. Iterative refinement. The objective function is set to be the aggre-gated sum of the squared orthogonal distances of the data points fromtheir closest hyperplane. Repeat step 2 until no significant improve-ment is attained. A moderate number of iterations (say, 10) is usuallysufficient.4. Resampling. Repeat steps 1 - 3 a number of times (say, 100) andselect the solution with the lowest aggregated sum of squared orthog-onal distances. This solution is refined further as in step 3 until noimprovement is achieved.The idea of silhouette width (Rousseeuw, 1987) is adapted to the lineargrouping scenario to measure the strength of the group assignment. Denoteby s1(i) and s2(i) the orthogonal distances of point i from its closest andsecond closest hyperplanes, respectively. The silhouette width for point i isdefined asw(i) = 1 - s1(i)s2(i).The GAP statistic (Tibshirani et al., 2001) is used to determine the numberof linear groups in a data set.1.3 Outline of the thesisIn Chapter 2, we propose a genotype calling algorithm by further adaptingthe linear grouping algorithm (Van Aelst et al., 2006). A key feature of thisalgorithm is that it applies to unnormalized fluorescent signals.Whereas there is no explicit statistical model in Chapter 2, a partiallikelihood approach to linear clustering is proposed in Chapter 3. It borrowsideas from normal mixture models and mixtures of regressions and providesan extension to the heuristic linear grouping algorithm. The SNP genotypingproblem is revisited in this model framework.The method in Chapter 3 is more flexible than that in Chapter 2, but stillcannot handle sparse (homozygous variant) clusters well. In Chapter 4, wemove to a Bayesian hierarchical approach to the SNP genotyping problem.With careful specification of the prior structures, the Bayesian approach isable to handle a sparse variant allele homozygous cluster.6Chapter 1. IntroductionChapter 5 includes the technical details of the asymptotic results for thepartial likelihood approach in Chapter 3.Chapter 6 summarizes a few possible future research directions.7BibliographyBanfield, J. and A. Raftery (1993). Model-Based Gaussian and Non-Gaussian Clustering. Biometrics 49(3), 803?821.Callegaro, A., R. Spinelli, L. Beltrame, S. Bicciato, L. Caristina, S. Cen-suales, G. De Bellis, and C. Battaglia (2006). Algorithm for automaticgenotype calling of single nucleotide polymorphisms using the full courseof TaqMan real-time data. Nucleic Acids Research 34(7), e56.De La Vega, F., K. Lazaruk, M. Rhodes, and M. Wenz (2005). Assessmentof two flexible and compatible SNP genotyping platforms: TaqMan snpgenotyping assays and the SNPlex genotyping system. Mutation research.Fundamental and molecular mechanisms of mutagenesis 573(1-2), 111?135.Fraley, C. and R. AE (2006). mclust Version 3 for R: Normal MixtureModeling and Model-based Clustering. Technical report, Technical Report504, University of Washington, Department of Statistics.Fraley, C. and A. Raftery (1998). How Many Clusters? Which Cluster-ing Method? Answers Via Model-Based Cluster Analysis. The ComputerJournal 41(8), 578.Fraley, C. and A. Raftery (2002). Model-Based Clustering, DiscriminantAnalysis, and Density Estimation. Journal of the American StatisticalAssociation 97(458), 611?632.Harrington, J. (2007). lga: Tools for linear grouping analysis (LGA). Rpackage version 1.0-0.Kang, H., Z. Qin, T. Niu, and J. Liu (2004). Incorporating GenotypingUncertainty in Haplotype Inference for Single-Nucleotide Polymorphisms.The American Journal of Human Genetics 74(3), 495?510.MacLachlan, G. and D. Peel (2000). Finite mixture models. J. Wiley.8BibliographyMcLachlan, G., D. Peel, K. Basford, and P. Adams (1999). The EMMIXsoftware for the fitting of mixtures of normal and t-components. Journalof Statistical Software 4(2).Mein, C., B. Barratt, M. Dunn, T. Siegmund, A. Smith, L. Esposito,S. Nutland, H. Stevens, A. Wilson, M. Phillips, et al. (2000). Evaluation ofSingle Nucleotide Polymorphism Typing with Invader on PCR Ampliconsand Its Automation. Genome Research 10(3), 330?343.Niu, T., Z. Qin, X. Xu, and J. Liu (2002). Bayesian Haplotype Inference forMultiple Linked Single-Nucleotide Polymorphisms. Am. J. Hum. Genet 70,157?169.Ranade, K., M. Chang, C. Ting, D. Pei, C. Hsiao, M. Olivier, R. Pesich,J. Hebert, Y. Chen, V. Dzau, et al. (2001). High-Throughput Genotypingwith Single Nucleotide Polymorphisms. Genome Research 11(7), 1262?1268.Rousseeuw, P. (1987). Silhouettes: A graphical aid to the interpretationand validation of cluster analysis. Journal of Computational and AppliedMathematics 20, 53?65.Tibshirani, R., G. Walther, and T. Hastie (2001). Estimating the numberof clusters in a data set via the gap statistic. Journal of the Royal StatisticalSociety. Series B (Statistical Methodology) 63(2), 411?423.Turner, T. (2000). Estimating the propagation rate of a viral infection ofpotato plants via mixtures of regressions. Journal of the Royal StatisticalSociety Series C(Applied Statistics) 49(3), 371?384.Turner, T. (2006). mixreg: Functions to fit mixtures of regressions. Rpackage version 0.0-2.Van Aelst, S., X. Wang, R. Zamar, and R. Zhu (2006). Linear groupingusing orthogonal regression. Computational Statistics and Data Analy-sis 50(5), 1287?1312.9Chapter 2Automatic genotype callingof single nucleotidepolymorphisms using a lineargrouping algorithm2.1 Background2.1.1 TaqMan SNP genotypingSingle nucleotide polymorphisms (SNPs) make up about 90% of all humangenetic variations. They occur approximately every 100 to 300 bases alongthe 3.2-billion-base human genome. They have been investigated as pop-ular biological markers for a wide range of genetic studies. Such studiesrequire reliable and cost-effective high-throughput genotyping technologies.Various SNP genotyping technologies have been developed and commercial-ized. Shi (2001) provides a review of some SNP genotyping technologies.De La Vega et al. (2005) give an elaborate assessment of two popular SNPgenotyping technologies, TaqMan SNP Genotyping Assay and the SNPlexGenotyping System (Applied Biosystems). Kang et al. (2004) give a briefoverview of three widely used high-throughput technologies, the TaqManSNP Genotyping Assay, the OLA (Oligonucleotide Ligation Assay) and theMassARRAY system. Although they employ different mechanisms, the mostpopular high-throughput technologies share the same conceptual framework:for each DNA sample, two quantitative signal intensities are measured afteramplification for two alternative SNP alleles, indicating their presence orabsence; these pairs of signal intensities are used to call genotypes by an ex-A version of this chapter will be submitted for publication. Authors: Guohua Yan,William J. Welch, Ruben H. Zamar, Loubna Akhabir and Treena McDonald.10Chapter 2. Automatic SNP genotype callingpert manually or by a clustering algorithm. The two alleles are genericallycalled X and Y hereafter.0.0 0.5 1.0 1.5 2.0 2.50.51.01.52.02.53.03.5(a)Allele.X.RnAllele.Y.Rn(XX)(XY)(YY)(NTC)0.5 1.0 1.5 2.0 2.5 3.012345(b)Allele.X.RnAllele.Y.Rn1.0 1.5 2.0 2.5 3.0 3.523456(c)Allele.X.RnAllele.Y.Rn1.5 2.0 2.5 3.0 3.51.01.52.02.53.03.54.0(d)Allele.X.RnAllele.Y.Rn2.0 2.5 3.0 3.51.01.52.02.53.0(e)Allele.X.RnAllele.Y.Rn0.5 1.0 1.5 2.0 2.5 3.00.40.60.81.01.21.4(f)Allele.X.RnAllele.Y.RnFigure 2.1: Scatterplots of ROX-normalized data for (a): a good plate; (b)& (c): messy plates; (d): a plate with only a few points in the YY cluster;(e): a plate with no points in the YY cluster; (f): a plate with only onecluster.Figure 2.1 illustrates several scenarios that are typical in genotypingdata from the TaqMan SNP Genotyping Assays (Applied Biosystems) at theJames Hogg iCAPTURE Centre. In this technology, two alleles are labelledby fluorescent dyes VIC and FAM; in addition, a third dye ROX, whichis assumed unchanged during the amplification, is used to normalize thedata. The proprietary ABI PRISM rcirclecopyrt 7900HT Sequence Detection SystemPlate Utility Software (referred to as SDS system hereafter) actually usesVIC/ROX and FAM/ROX (referred to as ROX-normalized data hereafter)to make genotype calls. In Figure 2.1, ROX-normalized signals are shown.Ideally, there are four distinct clusters in a scatterplot as in Figure 2.1(a). The NTC (no template control) cluster lies in the lower-left corner,close to the origin, containing negative control cases (a few cases that donot have DNA samples) and samples that fail to amplify. In the lower-right, upper-left and upper-right corners are three clusters labelled XX, YY11Chapter 2. Automatic SNP genotype callingand XY, presumably containing samples of wild-type homozygotes, varianthomozygotes, and heterozygotes, respectively. Owing to various artifacts,however, segregation can be poor with points lying between clusters, whichmake the calls less trustworthy (Figure 2.1 (b) and (c)). Furthermore, insome plates there are only a few points or even no points in the variant allelehomozygous cluster YY, which often causes classical clustering algorithmsto fail or to make genotype calls incorrectly (Figure 2.1 (d) and (e)). To ourknowledge, the proprietary clustering algorithm incorporated in the SDSsystem cannot make genotyping calls in this situation and one has to callmanually. In extreme cases, there is only one cluster present (Figure 2.1 (f)),which usually indicates that something has gone wrong in the amplificationprocedure.2.1.2 Review of some genotype calling algorithmsFor a small project, it is possible to make genotype calls manually. In mostcases, it is not hard for an expert to perform this job, and the ?eyeballing?procedure usually gives reasonable results due to its sophisticated incorpo-ration of prior information. For large-scale studies, however, manual scoringcan become a daunting challenge. Furthermore, humans are likely to makeerrors due to fatigue or oversight when a job becomes routine and differentreaders may have different views (Kang et al., 2004; van den Oord et al.,2003). van den Oord et al. (2003) conducted a study on the error ratesof manual scoring and three statistical procedures and report that thesestatistical procedures uniformly outperform the manual procedure in errorrates.Perhaps the most widely used clustering method in the SNP genotypingliterature is the classical k-means, possibly with minor modification (Akulaet al., 2002; Olivier et al., 2002; Ranade et al., 2001). Olivier et al. (2002)notice that k-means algorithms often split one genotype group of the data,especially when the variant allele homozygous cluster has only a few datapoints or when there is no sharp distinction between one of the genotypeclusters and the NTC cluster. They develop a new algorithm, the CA al-gorithm, that initially examines all data points and determines the approx-imate location of the centroid for each cluster by dividing the space intosections for each expected cluster. They report that the heuristic algorithmworks better than classical k-means.Ranade et al. (2001) and Akula et al. (2002) assign a ?quality score? toeach sample. After the genotype assignments are made, they assume thateach genotype cluster is bivariate normally distributed. For each sample,12Chapter 2. Automatic SNP genotype callingthey calculate the probability density value of the normal distribution forits genotype cluster as the quality score that this sample is called. Akulaet al. (2002) also assume equal covariance matrices across the clusters, argu-ing that this assumption gives conservative but better results than otherwise.Lovmar et al. (2005) propose using ?silhouette scores? to assess the qual-ity of SNP genotype clusters, the idea of which originates from Rousseeuw(1987). They report that the measure is satisfactory and empirically thegenotypes can be unequivocally assigned without manual inspection whenthe silhouette score is greater than 0.65.van den Oord et al. (2003) propose a mixture model approach, in whicheach cluster is assumed to follow a normal distribution. The number ofcomponent clusters and the initial values of the parameters are set uponinspecting the scatterplot. This is a more delicate approach than the k-means approach since it considers the elliptical structures in the assignmentsof genotypes and the k-means algorithm can be regarded as a special caseof the mixture models approach. Kang et al. (2004) go one step furtherin this direction. They assume a bivariate t-mixture model, in which a tdistribution with small degrees of freedom is assumed for each cluster. Theyargue that clusters from SNP genotyping data usually have heavier tails thanthe normal distribution and report that the t-mixture model is less sensitiveto outlying points and has advantages over the normal mixture model or thek-means approach. Fujisawa et al. (2004) also use a model-based approachin which only angle data are used.2.1.3 ROX-normalized versus unnormalized dataThe datasets illustrated in Figure 2.1 are after ROX-normalization. Animportant feature of our approach is that it uses data without ROX-nor-malization. We now discuss four reasons why we prefer to work with theunnormalized data.First, the motivation for ROX-normalization is presumably to form spher-ical clusters for which classical clustering algorithms such as k-means canbe used. Inspecting the ROX-normalized scatterplots, there are quite a fewplates for which the normalization does not produce reasonable sphericalstructures as expected, e.g., Figure 2.1 (c).Second, the incorporation of ROX intensities is aiming to correct system-atic biases in the chemistry such as plate to plate variation. However, thereare scenarios where some points may be pulled away from their home clus-ters merely by this correction and fail to be called a genotype; by the sametoken, points which fail to amplify may be assigned a genotype incorrectly13Chapter 2. Automatic SNP genotype callingdue to this correction. In the left panel of Figure 2.2, points a,b,c are away0.5 1.0 1.5 2.0 2.51234Allele.X.RnAllele.Y.Rnabcd eROX-normalized data2000 4000 6000 8000 120005000100001500020000Allele.XAllele.YabcdeUnnormalized dataFigure 2.2: Scatterplots of unnormalized and ROX-normalized data for aplate. Both points and letters represent samples.from the heterozygous cluster when the data are ROX-normalized, but arewithin a linear arranged cluster in the right panel when the unnormalizeddata are used. Similarly, point d is away from the variant allele homozygouscluster in the left panel, but is much closer to the variant allele homozygouscluster in the right panel. The proprietary software assigns very low qualityvalues for these points. Intuitively, these points ?should? be classified intoone of the genotype clusters, or at least with higher ?confidence? than their?quality? values indicate. Point e is outlying in both panels; in this case,normalization will not help any way.Third, the undisclosed, proprietary algorithm used by the SDS system forclustering, and hence calling genotypes, seems to have difficulty with sampleswith extreme (small or large) ROX values, as evidenced in Figure 2.3. Thecurve in Figure 2.3 shows that the empirical chance of calling a genotype bythe proprietary algorithm decreases when the ROX value is too low or toohigh.Unnormalized data, like in Figure 2.2, often show well-separated clusters,but they are along lines. The linear grouping algorithm (LGA) (Van Aelstet al., 2006) can identify such clusters. When we apply LGA to the unnor-malized data, the problem of low-call rate for extreme ROX values is muchalleviated. Figure 2.4 contrasts the silhouette width (a measure of callingquality) versus ROX value when the LGA algorithm is applied to unnormal-14Chapter 2. Automatic SNP genotype calling0 5000 10000 150000.00.20.40.60.81.0ROXEmpirical chance of callingFigure 2.3: Probability of calling a genotype versus ROXized data and when the k-means method is used for ROX-normalized data.For k-means, the definition of silhouette width in Rousseeuw (1987) is used;for the LGA, the definition of silhouette width in Van Aelst et al. (2006) isused (see also equation (2.1)).Last, the positions of positive control points in scatterplots of ROX-normalized data suggest the use of unnormalized data. In a TaqMan SNPgenotyping assay, individual samples are usually arranged in a 96- or 384-well plate and are amplified simultaneously. For quality assessment purpose,some wells have DNA samples with known genotypes (in our example 12 ina 384-well plate), called positive controls, while some other wells have noDNA samples (typically 8 in a 384-well plate), called negative controls orno template controls (NTC). The positions of control points in one plateare shown in Figure 2.5. The positive control points are represented by?P?. Note that quite a few control points are outside of their correspondinggenotype clusters in the left panel where ROX-normalized data are used. Inthe right panel, however, these points are well within their correspondinglinear clusters.In conclusion, we shall use unnormalized data hereafter in our methodof making genotype calls.15Chapter 2. Automatic SNP genotype calling0 1750 3250 4750 6250 7750 9500 155000.00.20.40.60.81.0ROXkmeans silhouette Width0 1750 3250 4750 6250 7750 9500 155000.00.20.40.60.81.0ROXLGA Silhouette WidthFigure 2.4: Silhouette width versus ROX value. Left panel: k-means resultsusing ROX-normalized data; right panel: LGA result using unnormalizeddata.0.5 1.0 1.5 2.0 2.51234Allele.X.RnAllele.Y.RnPPPPPPPPP PPPNROX-normalized data2000 4000 6000 8000 120005000100001500020000Allele.XAllele.YPPPPPPPPPPPNNNNNUnnormalized dataFigure 2.5: Control points in scatterplots of ROX-normalized and unnor-malized data for a plate. Letter ?P? represents a positive control point and?N? a negative control point.16Chapter 2. Automatic SNP genotype calling2.2 ResultsAs an illustration, we apply the LGA approach to the unnormalized data inFigure 2.2. The silhouette threshold is set to be 0.75, which means a pointis called if its distance from the nearest line is one quarter of its distancefrom the second nearest line. We set the signal threshold empirically. Letmx and my be the median values for X signals (?Allele.X?) and Y signals(?Allele.Y?) respectively. Any points with X signal less than 0.5mx and Ysignal less than 0.5my are not called. The results are in Figure 2.6. Morepoints are called compared with the SDS software.0 5000 10000 150000500010000150002000025000Allele.XAllele.YUnnormalized data0 5000 10000 150000500010000150002000025000Allele.XAllele.YUnnormalized dataFigure 2.6: Left panel: the calling result of SDS software displayed in thescatterplot of unnormalized data. Right panel: the scoring results of LGAusing unnormalized data. The symbol openbullet represents noncalls.2.3 ConclusionsWe have proposed an automatic genotype calling algorithm by taking ad-vantage of a linear grouping algorithm (Van Aelst et al., 2006). The pro-posed method uses unnormalized signals and clusters points around lines asagainst centroids. In addition, we associate a quality value, silhouette width(Rousseeuw, 1987; Van Aelst et al., 2006), with each DNA sample and awhole plate as well. This algorithm shows promise for genotyping datafrom TaqMan technology (Applied Biosystems). The algorithm is reliable.17Chapter 2. Automatic SNP genotype callingIt could be potentially adapted to other fluorescent-based SNP genotypingtechnologies as well, such as the Invader Assay.2.4 MethodsIn the proposed calling algorithm, unnormalized data are used and eachgenotype cluster is represented by a straight line.2.4.1 Data preprocessingIn a TaqMan SNP genotyping assay, the quality of a plate is monitored bymaking sure that all positive controls go to the correct genotype clustersand negative controls stay in the NTC clusters. Without well to well con-tamination, the negative controls should have low signals for both VIC andFAM dyes (horizontal and vertical coordinates in a scatterplot, respectively)since there are no samples in these wells to be amplified. It is advantageousthat we empirically discard negative control points and a portion (5% or10%, say) of points with very low signals (close to the origin) prior to apply-ing LGA to locate the lines. Negative control points should not contributeto the lines which decide the orientations of the genotype clusters. (Somepoints with low signals are assigned to lines afterwards.)2.4.2 Fitting lines using LGAIn SNP genotyping, there are at most three clusters, ignoring the NTCcluster. Two clusters are possible when there are very few or no pointsin the upper-left, YY cluster. We shall fit three lines and two lines tothe data separately using LGA. LGA minimizes aggregated sum of squaredorthogonal distances of points to their closest hyperplanes. We made somemodification specific to the genotyping setting.Grid based initialization. LGA usually depends on multiple starts tohave large probability of attaining the global minimum. In the genotypingsetting, it is computationally affordable to try a sequence of initializationssuch that the global minimum is very likely attained. We set the interceptsto 0 for all lines and choose the slopes from a set of values {s1,? ? ? ,sm},where atan(si) are an equally spaced grid from 0 to pi/2, since the slopesshould always be positive in the SNP genotyping setting.Given a set of intercepts and slopes, points are assigned to their nearestlines; these lines are recalculated. These two steps are iterated until conver-gence. Denote by s1(i) and s2(i) the orthogonal distances of point i from its18Chapter 2. Automatic SNP genotype callingclosest and second closest lines, respectively. The silhouette width for pointi is defined asw(i) = 1 - s1(i)s2(i). (2.1)From the multiple starts, we choose the solution that has the largest averagesilhouette width. (This criterion respects small clusters and penalizes twolines which are too close.)In addition, we shall restrict the slope for the XX cluster be smallerthan that of the XY cluster and the slope of the XY cluster be smallerthan that of the YY cluster. Another restriction is added such that thereare few points in YY than XX and XY. For each LGA, we associate thelines with smallest, second-smallest and largest slopes to XX, XY, and YY,respectively. We simply discard solutions in which any slope is negative orthere are more points in the cluster with the largest slope than one of theother clusters.Number of lines. We also choose between the best two line solutionand the best three line solution according to this average silhouette width. Ifthe three line solution has large average silhouette width, each line representsa genotype cluster; if the two line solution is chosen, they correspond to XXand XY.2.4.3 Genotype assignmentFor genotype assignment, we could empirically set a signal threshold before-hand, below which a point is not called, labelled as ?Undetermined?, andassigned to the ?NTC? cluster. The threshold can be set based on previousgenotyping practice in a laboratory. In this case, a point beyond the thresh-old is assigned to its closest line and its genotype is called accordingly; asilhouette width value is calculated for each point to measure adequacy ofthe fitted line structures. A cutoff point for the silhouette widths is also pre-determined such that outlying points with low silhouettes are also labelledas ?Undetermined?.Our code adapts LGA to deal with thresholding of points as follows. Letmx = median(xi),andmy = median(yi),where xi and yi are the X signal and Y signal of sample i respectively. Pointswith xi <=< 0.5mx and yi <=< 0.5my are not called.19Chapter 2. Automatic SNP genotype callingAs an alternative to an empirical signal threshold, we could penalizepoints closer to the origin by modifying the definition of silhouette width.For example,w(i) = 1 - s1(i) + cs2(i) + c, (2.2)where the tuning parameter c is also set empirically such that points tooclose to the origin are not called.Another version we tried is as follows. Letri =radicalBigx2i + y2i ,be the distance of point i from the origin. Letm = min{ri},andM = median{ri}.Letci =parenleftbiggmin(ri,M) - m2M - m2parenrightbigg12.The adjusted value for silhouette width si issasteriskmathi = cisi. (2.3)This downweights the silhouette score for points with weak xi and yi signals.2.4.4 Quality assessment of DNA samples and platesThe silhouette width for a DNA sample serves as a quality value of thegenotype call. In addition, for each plate, the average silhouette width canbe computed. An expert may decide upon a threshold such that plateswith lower average silhouette width are regarded as unreliable. Meanwhile,the average silhouette width of positive controls in a plate and the averagesilhouette width of negative controls are also indications of the quality ofgenotype calls for a specific plate.2.5 DiscussionThe existing methods assume spherical or elliptical clusters. The unnormal-ized signals tend to fall on lines, however, forcing normalization to try to20Chapter 2. Automatic SNP genotype callingproduce clusters with elliptical shapes, but the ROX normalization signalmay be noisy. We take a quite different approach, identifying the linear clus-ters associated with the unnormalized signals. As seen in Figure 2.6, feweruncalled samples results. We call only samples with at least one strongsignal. The definition of ?strong? will vary from plate to plate. We normal-ize internally by not calling points having weak signals relative to mediansignals, rather than externally with respect to ROX.21BibliographyAkula, N., Y. Chen, K. Hennessy, T. Schulze, G. Singh, and F. McMa-hon (2002). Utility and accuracy of template-directed dye-terminator in-corporation with fluorescence-polarization detection for genotyping singlenucleotide polymorphisms. Biotechniques 32(5), 1072?1078.De La Vega, F., K. Lazaruk, M. Rhodes, and M. Wenz (2005). Assessmentof two flexible and compatible SNP genotyping platforms: TaqMan snpgenotyping assays and the SNPlex genotyping system. Mutation research.Fundamental and molecular mechanisms of mutagenesis 573(1-2), 111?135.Fujisawa, H., S. Eguchi, M. Ushijima, S. Miyata, Y. Miki, T. Muto, andM. Matsuura (2004). Genotyping of single nucleotide polymorphism usingmodel-based clustering. Bioinformatics 20, 5.Kang, H., Z. Qin, T. Niu, and J. Liu (2004). Incorporating GenotypingUncertainty in Haplotype Inference for Single-Nucleotide Polymorphisms.The American Journal of Human Genetics 74(3), 495?510.Lovmar, L., A. Ahlford, M. Jonsson, and A. Syv?nen (2005). Silhouettescores for assessment of SNP genotype clusters. feedback.Olivier, M., L. Chuang, M. Chang, Y. Chen, D. Pei, K. Ranade,A. de Witte, J. Allen, N. Tran, D. Curb, et al. (2002). High-throughputgenotyping of single nucleotide polymorphisms using new biplex invadertechnology. Nucleic Acids Research 30(12), e53.Ranade, K., M. Chang, C. Ting, D. Pei, C. Hsiao, M. Olivier, R. Pesich,J. Hebert, Y. Chen, V. Dzau, et al. (2001). High-Throughput Genotypingwith Single Nucleotide Polymorphisms. Genome Research 11(7), 1262?1268.Rousseeuw, P. (1987). Silhouettes: A graphical aid to the interpretationand validation of cluster analysis. Journal of Computational and AppliedMathematics 20, 53?65.22BibliographyShi, M. (2001). Enabling large-scale pharmacogenetic studies by high-throughput mutation detection and genotyping technologies. ClinChem 47(2), 164?172.Van Aelst, S., X. Wang, R. Zamar, and R. Zhu (2006). Linear groupingusing orthogonal regression. Computational Statistics and Data Analy-sis 50(5), 1287?1312.van den Oord, E., Y. Jiang, B. Riley, K. Kendler, and X. Chen (2003).FD-TDI SNP Scoring by Manual and Statistical Procedures: A Study ofError Rates and Types. BioTechniques 34(3), 610?624.23Chapter 3A partial likelihood approachto linear clustering3.1 IntroductionIn this paper we are concerned with detecting linearly shaped clusters in adata set. This work is motivated by a clustering problem in a SNP (singlenucleotide polymorphism) genotyping setting. Data of this type are bivari-ate, consisting of two signals, and conform mostly to linear clusters. Moredetailed discussion of these genotyping data appears in Section 3.7.Van Aelst et al. (2006) proposed a method called the Linear GroupingAlgorithm (LGA) for linear clustering problems. For earlier approaches tolinear clustering, see the references therein. There is a need for algorithmsspecialized in linear clustering. First, the usual mixture of normal or t dis-tributions (see for example, Banfield and Raftery, 1993; Fraley and Raftery,1998, 2002; MacLachlan and Peel, 2000, among others) aims at ellipticalstructures in a data set; when a data set displays highly linear patterns,especially when these patterns are not well separated from each other, weneed a different criterion for detecting linear patterns. Second, most ear-lier linear clustering approaches assume that a response variable is availablesupervising the search for linear clusters, see for example, Spath (1982),DeSarbo and Cron (1988) and Turner (2000). Van Aelst et al. (2006) havea simulated data set illustrating that the selection of a response variableis not trivial. In that data set, two linear clusters are obvious in the firsttwo dimensions; the third dimension is merely noise. If the second variableis selected as response, the mixture of regressions method by Spath (1982)successfully recovers the linear structures; it fails when the third variable isused as the response. Chen et al. (2001) proposed a method to detect linearA version of this chapter has been submitted for publication. Authors: Guohua Yan,William J. Welch and Ruben H. Zamar.24Chapter 3. A partial likelihood approach to linear clusteringstructures one by one without assuming a response variable in the computervision context.We propose in this paper a partial likelihood approach and compare itwith existing algorithms such as LGA (Harrington, 2007), MCLUST (Fraleyand Raftery, 2007) and MIXREG (Turner, 2006). Our approach borrowsideas from finite mixture model-based clustering and from mixture of regres-sions. Each linear cluster is characterized by a hyperplane; the orthogonaldeviation of a data point from that hyperplane is assumed to be normallydistributed but the position of the data on the hyperplane is not modelled.With this strategy, all variables are treated symmetrically, as compared withthe mixture of regressions approach.The rest of the paper is organized as follows. In Section 3.2, we describea partial likelihood-based objective function, leading to a model-based linearclustering algorithm, MLC. In Section 3.3, an EM algorithm to maximizethe partial likelihood function is presented. In Section 3.4, we discuss theasymptotic properties of MLC. In Section 3.5, we briefly discuss its relation-ships with existing algorithms such as MCLUST, LGA and MIXREG. InSection 3.6, we propose several methods for determining the number of lin-ear clusters. In Section 3.7, several real and simulated datasets are used tocompare MLC with LGA, MCLUST and MIXREG. Some closing remarksare given in Section 3.8.3.2 Partial likelihood-based objective functionfor linear clustering3.2.1 Partial likelihood for orthogonal regressionFirst, we formulate a partial likelihood representation for orthogonal regres-sion (see for example, Fuller, 1987). To model a single linear cluster, assumethat the vector-valued data x1,...,xn are independently drawn from a ran-dom mechanism represented by a random vector X in a d-dimensional space.We do not model the distribution of X except for the deviation from an alsounknown hyperplane.Let {x : aprimex - b = 0} be a hyperplane. For identifiability purpose, weassume that aprimea = 1 and that the first nonzero element of a is positive. Weassume that the signed orthogonal deviation aprimeX - b from the hyperplane{x : aprimex -b = 0}, is normally distributed, i.e.,aprimeX -b similar N(0,sigma2). (3.1)25Chapter 3. A partial likelihood approach to linear clusteringSince a and b are unknown, aprimeX -b is also unknown. Nevertheless, we canstill estimate a, b and sigma2 through the following partial likelihood functionnproductdisplayi=1N(aprimexi -b; 0,sigma2), (3.2)where N(? ; 0,sigma2) is the density function of the normal distribution withmean 0 and variance sigma2.Let ? and S be the sample mean and sample covariance matrix ofx1,...,xn respectively. Then the maximum partial likelihood estimate ?2 ofsigma2 is the smallest eigenvalue of S; the maximum partial likelihood estimate? of a is the (standardized) eigenvector associated with ?2, which is notnecessarily unique. Finally, the maximum partial likelihood estimate ? of bis ? = ?prime ?. See for example Fuller (1987).Note that model (3.1) and the partial likelihood (3.2) treat the compo-nents of x symmetrically.3.2.2 Partial likelihood for linear clusteringNow we assume that the data x1,...,xn are independently drawn from amore complicated random mechanism, still represented by a random vectorX in a d-dimensional space which we do not model fully. The data now liearound K hyperplanes {x : aprimekx = bk}, k = 1,...,K. Let Z = (Z1,...,ZK)primebe a random vector indicating these hyperplanes, where Zk = 1 with prob-ability pk for k = 1,...,K. Let p = (p1,...,pK)prime. We assume that, condi-tional on Zk = 1,aprimekX -bk similar N(0,sigma2k), k = 1,...,K.Let z1,...,zn be the corresponding unobservable indicators for the datax1,...,xn. Let kappa be the collection of component parameters, kappa = (aprime1,b1,sigma21, ..., aprimeK, bK, sigma2K)prime, and theta = (kappaprime,pprime)prime.When the indicators z1,...,zn are regarded as unknown parameters, thepartial likelihood function for parameters (thetaprime,zprime1,...,zprimen)prime isL(theta,z1,...,zn|x1,...,xn) =nproductdisplayi=1Kproductdisplayk=1[pkN(aprimekxi -bk; 0,sigma2k)]zik. (3.3)This is a so-called classification likelihood. See for example Scott andSymons (1971) and Banfield and Raftery (1993) in the context of mixture26Chapter 3. A partial likelihood approach to linear clusteringmodels for elliptical clusters. In this approach, the parameters theta and indi-cators z1,...,zn are chosen to maximize (3.3) and data points are classifiedaccording to these indicators.In another view (see for example, Fraley and Raftery, 1998, 2002), the in-dicators z1, ..., zn are regarded as realizations of random vectors Z1,...,Zn,which in turn are an independent and identically distributed sample from Z.After integrating out the indicators z1,...,zn, the partial likelihood functionfor theta isL(theta|x1,...,xn) =nproductdisplayi=1Ksummationdisplayk=1pkN(aprimekxi -bk; 0,sigma2k). (3.4)The latter is a mixture likelihood approach. Celeux and Govaert (1993)did simulation studies of classification likelihood approaches and mixturelikelihood approaches in normal mixture models for elliptical clusters andreported that no likelihood method uniformly outperforms the others. It isnoted that the classification approach cannot consistently estimate parame-ters due to the ?all-or-nothing? classification bias (Bryant and Williamson,1978; Ganesalingam, 1989; Marriott, 1975). In our experience with linearclustering, both approaches give practically very similar clustering results.We shall pursue in this paper the mixture likelihood (3.4), as many existingmodel selection criteria like BIC can be used and allows for a more feasibleasymptotic theory (see Section 3.4).As in the usual normal mixture model for elliptical clusters, the partiallikelihood function (3.4) is unbounded: when a cluster consists of only pointslying on a hyperplane, the contribution of each of these points to the partiallikelihood tends to infinity as the variance tends to zero. The infinity occurson the boundary of the parameter space.Hathaway (1985) proposed a constrained formulation of the maximumlikelihood estimation in the univariate normal mixture model. Specifically,the author added a constraint on the standard deviations,min1<=<inegationslash=j<=<K(sigmai/sigmaj) greaterequal c > 0, (3.5)where c is a known constant determined a priori. We shall incorporate theconstraint (3.5) into the partial likelihood function (3.4) (See Step 2 in theEM algorithm in Section 3.3). The solution may depend on the choice ofthe constant c. A usual strategy is to decrease c gradually and monitorthe resulting solutions (Hathaway, 1986). [Chen and Kalbfleisch (1996);Ciuperca et al. (2003) proposed penalized approaches to this unboundednessproblem which are also readily applicable to linear clustering although not27Chapter 3. A partial likelihood approach to linear clusteringadopted in this paper; on the other hand, the Hathaway?s constraint canalso be regarded as a form of penalization on the likelihood.]The partial likelihood function (3.4) naturally brings the clustering prob-lem into a finite mixture model framework. Standard EM algorithms can beadapted to maximize (3.4); once the maximum partial likelihood estimate ?is obtained, data point xi can be assigned to the component with the largestposterior probability. The probabilities are given by?ik = ?pkN(?aprimekxi -?bk; 0, ?sigma2k)summationtextKk=1 ?pkN(?aprime xi -?bk; 0, ?sigma2), i = 1,...,n; k = 1,...,K, (3.6)which also serve as a measure of uncertainty of classifying data point xi.3.3 The EM algorithmNow we describe an EM algorithm for maximizing the partial likelihood(3.4). It is straightforward and works well in our experience.The completed log partial likelihood isl(theta|x1,...,xn,z1,...,zn) =nsummationdisplayi=1Ksummationdisplayk=1zik{log(pk) + log(N(aprimekxi -bk; 0,sigma2k))}.(3.7)In the E-step, with theta(t-1) from the previous iteration, we havew(t)ik equivalence E(Zik|theta(t-1),x1,...,xn)= p(t-1)k N(aprime(t-1)k xi -b(t-1)k ; 0,sigma2k(t-1))summationtextKk=1 p(t-1)k N(aprime(t-1)k xi -b(t-1)k ; 0,sigma2k(t-1)). (3.8)In the M-step, we havep(t)k =summationtextni=1 w(t)ikn . (3.9)Let?(t)k =summationtextni=1 w(t)ik xisummationtextni=1 w(t)ik,and?(t)k =nsummationdisplayi=1w(t)ik (xi - ?(t)(k))(xi - ?(t)(k))prime.28Chapter 3. A partial likelihood approach to linear clusteringThena(t)k is the eigenvector associated with the smallest eigenvalue of ?(t)k ,(3.10)b(t)k = aprime(t)k ?(t)(k), (3.11)andsigma2k(t) =summationtextni=1 w(t)ik (aprime(t)k xi -b(t)k )2summationtextni=1 w(t)ik. (3.12)If all sigma2k in equation (3.4) are assumed to be equal, then the common valueis estimated bysigma2(t) =summationtextni=1summationtextKk=1 w(t)ik (aprime(t)k xi -b(t)k )2n . (3.13)The MLC algorithm is described as follows.1. Initialize with theta(0). To initialize the EM algorithm, we adopt the strat-egy of Van Aelst et al. (2006). We randomly select K mutually exclu-sive subsets of d + 1 observations from x1,...,xn. For k = 1,...,K,we compute the maximum partial likelihood estimates a(0)k , b(0)k and(sigma2k)(0) from the kth subset of d+1 observations (see Subsection 3.2.1).The proportions p(0)k are all set as 1/K. The initial values are thentheta(0) = ((aprime1)(0),b(0)1 ,(sigma21)(0),...,(aprimeK)(0),b(0)K ,(sigma2K)(0),p(0)1 ,...,p(0)K )prime.2. If constraint (3.5) is not satisfied, go back to step 1; otherwise go tostep 3.3. Update theta(0) for a predefined number of iterations or until the improve-ment in the partial likelihood (3.7) is less than a predefined threshold.At iteration t,? E-step. Update w(t)ik by equation (3.8).? M-step. Update p(t)k , a(t)k , b(t)k and (sigma2k)(t) by equations (3.9),(3.10), (3.11) and (3.12), respectively. If c = 1, which correspondsto equal variances across clusters, use (3.13) in place of (3.12).4. Repeat steps 1 ? 3 for a predefined number of times. The final clus-ter labels ?1,...,?n and parameter estimates ? are the solution thathas the largest completed log partial likelihood (3.4) and satisfies con-straint (3.5).29Chapter 3. A partial likelihood approach to linear clusteringFor the classification likelihood (3.3), we can use a so-called CEM (classi-fication EM) algorithm. A C-step is inserted between the E-step and M-stepof step 3. In the C-step, xi is assigned to the cluster with the largest w(t)ik ,i.e., the maximum w(t)ik is replaced by the value 1 and all other w(t)ik take thevalue 0. If the maximum value of w(t)ik is not unique, choose one of the tyingclusters at random.3.4 Asymptotic propertiesLetf(x;theta) =Ksummationdisplayk=1pkN(aprimekx -bk; 0,sigma2k),which is the contribution of one observation to the partial likelihood (3.4).The constrained parameter space isThetac =bracelefttpbraceleftmidbraceleftbttheta = (a1,b1,sigma21,...,aK,bK,sigma2K,p1,...,pK) :aprimekak = 1,-infinity < bk < infinity,sigmak > 0,k = 1,...,K.mini,j sigmai/sigmaj greaterequal c > 0,0 < pk < 1,summationtextKk=1 pk = 1.bracerighttpbracerightmidbracerightbt.Assuming that the data arise from a probability distribution measure P, wehave the following results.Theorem 3.1 Let P be an absolutely continuous probability measure withfinite second moments. Letg(theta) equivalenceintegraldisplaylog f(x;theta)dP(x).Then the supremum of g over Thetac is finite and attainable.Since the function f(?,theta) is not a density function, this theorem andthe following ones cannot be proved by verifying some regularity conditions.The proofs find their roots in Wald (1949), Redner (1981), Hathaway (1983,1985) and Garc? -Escudero et al. (2007) and are available on request.Theorem 3.2 Let P be an absolutely continuous probability measure withfinite second moments. Let X1,...,Xn be a random sample from P. Thena global maximizer of the partial likelihood function L(theta |X1, ..., Xn) in(3.4) over Thetac exists almost surely if n greaterequal Kd + 1.30Chapter 3. A partial likelihood approach to linear clusteringNote that points in Thetac are not identifiable for f(x;? ). The functionf(x;? ) remains the same if we permute the labels 1, ..., K; pk1 and pk2 arenot identifiable if (ak1,bk1,sigma2k1) = (ak2,bk2,sigma2k2). Thus the consistency resultis in a quotient topology space. Let similar be an equivalent relation on Thetac suchthat theta1 similar theta2 if and only if f(x;theta1) = f(x;theta2) almost surely in P. Denoteby Thetaqc the quotient topological space consisting of all equivalent classes ofsimilar. For a point theta0 that maximizes g(theta) = integraltext log f(x;theta)dP(x), its equivalentclass in Thetaqc is denoted by thetaq0.Theorem 3.3 (Consistency). Let X1,...,Xn be a sample from an abso-lutely continuous probability measure P with finite second moments. Let?theta(n) be a global maximizer of the partial likelihood function L(theta|X1, ...,Xn) in (3.4) over Thetac. Then ?(n) arrowright thetaq0 almost surely in the topological spaceThetaqc.Letv1(theta) = Ebraceleftbiggparenleftbiggpartialdiff log f(X;theta)partialdiffthetaparenrightbiggparenleftbiggpartialdiff log f(X;theta)partialdiffthetaparenrightbiggprimebracerightbigg,andv2(theta) = Ebraceleftbiggpartialdiff2 log f(X;theta)partialdiffthetapartialdiffthetaprimebracerightbigg,where the expectations are taken with respect to P.Theorem 3.4 (Asymptotic normality). Let X1,...,Xn be a sample froman absolutely continuous probability measure P with finite sixth moments.Let ?(n) be a subsequence of global maximizers of the partial likelihood func-tion L(theta|X1,...,Xn) in (3.4) over Thetac, which tends to an interior point theta0.Then radicaln(?(n) -theta0) L N(0,v(theta0)),where v(theta0) = [v2(theta0)]+v1(theta0)[v2(theta0)]+ and A+ is the Moore-Penrose in-verse of A.In equation (3.6), ?ik is a function of xi and ?. Denote ?ik = hk(xi, ?),k = 1,...,K. By the Delta method, we haveCorollary 3.1 Let X1,...,Xn be a sample from an absolutely continuousprobability measure P with finite sixth moments. Let ?(n) be a subsequenceof global maximizers of the partial likelihood function L(theta|X1,...,Xn) in31Chapter 3. A partial likelihood approach to linear clustering(3.4) over Thetac, which tends to an interior point theta0. Let x be a data point.Then for k = 1,...,K,radicaln(hk(x, ?theta(n)) -hk(x,theta0))LarrowrightN(0,[h(0)k (x,theta0)]primev(theta0)[h(0)k (x,theta0)]),whereh(0)k (x,theta0) = partialdiffhk(x,theta)partialdiffthetavextendsinglevextendsinglevextendsinglevextendsingle theta = theta0.Using this corollary, we can build approximate confidence intervals forwik by replacing theta0 with ? and hence evaluate the clustering of a data point.3.5 Relationships with other clustering methods3.5.1 With LGAIn the LGA of Van Aelst et al. (2006), the objective function is the aggre-gated sum of squared orthogonal distances of the data points to their closesthyperplanes. Using our notation, LGA minimizesd(kappa,z1:n) =nsummationdisplayi=1Ksummationdisplayk=1zik(aprimekxi -bk)2. (3.14)In the classification likelihood (3.3), if we assume that the variances sigma2k areequal across all components (or equivalently, c = 1 in (3.5)) and the mixingproportions pk are equal, the log completed partial likelihood isl(kappa,z1:n) =nsummationdisplayi=1Ksummationdisplayk=1zik log[N(aprimekxi -bk; 0,sigma2)]. (3.15)It is straightforward to check that the minimization of (3.14) and the max-imization of (3.15) are equivalent. Hence the iterative procedure in LGAcoincides with the CEM algorithm, which is discussed in Section 3.3, formaximizing (3.15).In this sense, the proposed partial likelihood approach, or the classifica-tion partial likelihood (3.3), is an extension and a model for LGA. The pro-posed approach permits one dispersion parameter sigma2k for each cluster whileLGA implicitly uses one dispersion parameter for all clusters. Furthermore,in our model framework, we are able to use the membership probability wik(see (3.6)) to assess the uncertainty of clustering a data point while an adhoc measure, silhouette width, is used in LGA.32Chapter 3. A partial likelihood approach to linear clustering3.5.2 With normal mixture modelsIn the M-step in Section 3.3, we can see that the proposed approach isclosely related to normal mixture models. The difference resides in the E-step: the proposed approach weighs distances to hyperplanes while the lattercompares distances to cluster centres.In the use of normal mixture models one is forced to estimate the meansand also the covariance matrix of each component. Banfield and Raftery(1993) reparameterized the covariance matrices and lead to a variety of par-simonious covariance matrix structures. The unrestricted covariance matrixrequires K(d + d(d + 1)/2) parameters and the most parsimonious model,equal spherical covariance matrices, requires only Kd + 1 parameters. Theproposed approach can be regarded as a parsimonious simplification of an-other type which is appropriate for highly linear data sets. It requires d + 1parameters per cluster, i.e., d first order parameters for the location of ahyperplane and one second order parameter for the dispersion around thehyperplane.3.5.3 With mixture of ordinary regressionsAs mentioned in the introduction, a mixture of ordinary regressions ap-proach requires a response variable for guidance. A response variable suit-able for clustering purpose may not exist. Furthermore, due to the ?re-gression to the mean? phenomenon, the regression hyperplanes are differentwhen different variables are selected as the response. The proposed ap-proach treats each variable symmetrically, and therefore is more suitable ina clustering setting.3.6 Choosing the number of linear clustersIn some problems, the number of linear clusters is obvious from subject-matter knowledge; in other problems, it has to be estimated from the data.Too many components exploit the randomness of the data and make it hardto interpret the clustering; too few clusters lose information and may bemisleading.3.6.1 Bootstrapping the partial likelihood ratioA natural approach is to apply the partial likelihood ratio test to a sequencesof hypotheses K = K0 against K = K0 +1, starting from K0 = 1. However,33Chapter 3. A partial likelihood approach to linear clustering-2lnlambda, where lambda is the likelihood ratio, does not have a usual asymptoticchi2 distribution under the null hypothesis. The reason is that the regularityconditions are violated as the parameter space under the null hypothesis ison the edge of the parameter space under the alternative hypothesis whenthe former is embedded into the latter. Much theoretic research has beendone on likelihood ratios for mixture problems (Chen and Chen, 2001a,c,b;Chen et al., 2001; Chen, 1998; Chen and Kalbfleisch, 2005; Everitt and Hand,1981; Thode Jr et al., 1988, among others).We adapt the bootstrap approach in McLachlan (1987) to our setting.The log likelihood ratio statistic -2log lambda for the test of the null hypothesisof K = K1 versus the alternative K = K2 can be bootstrapped as follows.Let ? be the maximum partial likelihood estimate of all parameters from theoriginal sample under the null hypothesis. For i = 1,...,n, sample zi fromthe multinomial distribution with probabilities (?1, ..., ?K1); if zik = 1,sample orthogonal distance ei from N(0, ?2k). The ith data point in thebootstrap sample is xi + (ei - ?k)?k. Suppose B copies of n samples aregenerated. The value of -2log lambdai is computed after fitting mixture modelsfor K = K1 and K = K2, i = 1,...,B. The significance level of thetest is obtained by comparing -2log lambda with the empirical distribution of-2log lambda1,...,-2log lambdaB.3.6.2 Information criteriaThe partial likelihood approach enables us to use an approximate Bayes fac-tor to determine the number of linear clusters. Banfield and Raftery (1993)used a heuristically derived approximation to twice the log Bayes factor,called AWE (approximate weight of evidence) to determine the number ofclusters. Fraley and Raftery (2002) used BIC (Bayesian information crite-rion) as a more reliable approximation to twice the log Bayes factor,BIC(K) = 2L(K) -vK log(n), (3.16)where L(K) is the log partial likelihood of a mixture of K componentsand vK is the number of independent parameters to be estimated in a Kcomponent mixture. As stated in Fraley and Raftery (2002), although theregularity conditions that underly the approximation are not satisfied infinite mixture models, there are some theoretical and practical support of itsuse for determining the number of clusters. Keribin (2000) showed that theestimator of the number of clusters based on BIC is consistent; Fraley andRaftery (1998, 2002) included a range of applications in which estimatingthe number of clusters based on BIC has given good results.34Chapter 3. A partial likelihood approach to linear clusteringA large number of criteria have been proposed in the literature, includingNEC (Biernacki et al., 1999; Celeux and Soromenho, 1996), ICL (Biernackiet al., 2000), and cross validation likelihood (Smyth, 2000). The performanceof some of these criteria were compared by Biernacki et al. (1999).As we are adopting a partial likelihood approach, in principle almostall methods proposed in a model framework are approximately applicablealthough we do not fully model the data. As pointed out by Everitt et al.(2001), ?it is advisable not to depend on a single rule for selecting thenumber of groups, but to synthesize the results of several techniques?. Inour experience, though, all the methods mentioned above work well if theredo exist obvious linear patterns in a data set.3.7 Examples3.7.1 Simulated data IOne data set of size 300 in two dimensions is simulated to illustrate thedifferent behaviours of MCLUST, LGA and the proposed MLC. The xi1are uniformly sampled from the interval (0,15); for the first 250 points,xi2 = xi1 + 3epsilon1i and for the last 50 points, xi2 = 8 + 1.5xi1 + epsilon1i, where epsilon1iare independent and identically standard normal distributed. The data areshown in the top-left panel of Figure 3.1.For this data set, MCLUST chooses a model of three clusters with differ-ent volumes, different shapes and different orientations by the BIC criterion.LGA favours one cluster using the gap statistic. MLC selects two clusters byBIC, ICL, NEC and the bootstrapping likelihood ratio test (see Table 3.1).We set the threshold c = 0.05. The result of MIXREG, which is omitted, issimilar to that of MLC as we generate the data in favour of it. Figure 3.1displays the clustering results from MCLUST, LGA and MLC when thenumbers of clusters are all set as 2.This simulated data helps us understand how these algorithms work.MCLUST tries to find elliptical structures, but the two linear clusters havedata distributions along the lines not well modelled by normal distributions.For LGA, when two clusters are roughly parallel and touching, it imposesa structure to the data set and tends to partition the data set into bandswith equal width.35Chapter 3. A partial likelihood approach to linear clustering0 5 10 150102030True classificationx1x20 5 10 150102030MLCx1x20 5 10 150102030LGAx1x20 5 10 150102030MCLUSTx1x2Figure 3.1: Comparison of clustering results of simulated data I. Upper-leftpanel displays the true classification; upper-right, lower-left and lower-rightare that of MLC, LGA and MCLUST.36Chapter 3. A partial likelihood approach to linear clusteringTable 3.1: Criteria for choosing the numbers of clusters, K, when MLC isapplied to simulated data I. ?Boot p-value? is the p-value using the boot-strapping partial likelihood ratio test for H0 : K = K0 vs H1 : K = K0 + 1where 99 bootstrapping samples are usedCriterionK BIC ICL NEC Boot p-value1 -1541.61 ? ? 02 -1461.63 -1477.28 0.14 13 -1468.25 -1712.08 0.69 ?3.7.2 Simulated data IIAnother dataset is simulated to demonstrate the difference between MIX-REG and the other three algorithms, MLC, MCLUST and LGA. Thisdataset has 200 data points in four dimensional space. For the first 100data points, the first two components, xi1 and xi2, are linearly related whilethe third and the fourth components, xi3 and xi4, are randomly scattered:xi1 = ti + epsilon1i1, xi2 = 0.5ti + epsilon1i2, xi3,xi4 similar N(0,52), for i = 1,...,100.(3.17)Here epsilon1ij similar N(0,1) and ti similar N(0,52). For the remaining 100 data points, thefirst two components are randomly scattered while the last two componentsare linearly related:xi1,xi2 similar N(0,52), xi3 = ti + epsilon1i3, xi4 = 0.5ti + epsilon1i4, for i = 101,...,200.(3.18)Again, epsilon1ij similar N(0,1) and ti similar N(0,52). The pairwise scatterplots of thesimulated data are in Figure 3.2.From the way the dataset is simulated, it is not possible to find a nat-ural response variable to discriminate the two linear clusters. As a result,MIXREG does not work well for this dataset and the two clusters are notwell separated. The misclassification matrices of the four algorithms are inTable 3.2. MLC, MCLUST and LGA have the same good performance here.Note that (3.17) and (3.18) generate (elongated) multivariate normal clus-ters in x1 and x2 or in x3 and x4, respectively. The assumptions of MCLUSTare therefore satisfied, but MLC nonetheless has the same performance.3.7.3 Australia rock crab dataNow we consider the rock crab data set of Campbell and Mahon (1974) onthe genus Leptograpsus. We focus on the 100 blue crabs, 50 of which are37Chapter 3. A partial likelihood approach to linear clusteringx1-10 -5 0 5 1011111111111111111 111111111111111111 1111111111 111111111111111111111111111111111111111111111111 111222222222222222 222 2222222222222222222222222222222222222222 22222222222222222222 222 2222222222222111111111111111111 1111111111111111 111111111 111111111111 111111111111111111111111111111 111 11111111222 2222222 22222222 222 2222222222222 222222222222 2222222222222222222222222222222222 222222222222222 22-15 -10 -5 0 5 10 15 20-15-5051011111111 11111111 1111111111111111111 111 1111 111 111 111111111 1111111111111111111 11111111111 111111111 11 222 222222222222222 222 2222222222222 222222222222 22222222222222 22222222222222222222 222222222222222 22-10-505101111111 1111111111111 1 111111111 11111 1 1111 111111111111111111111111 111111111111111 11111 11 11111111111222222222222222222222222222222222 222222222222222222222222222222222222222222222222222222222222222222x2 1111111 1111111111111 1 111111111111 111 1111111111111111111111111 11 11111111 111111 1111111 11 111 11111111222 22 22222222222222222222222222222222222222222222 22222222222222222222222222222222222222222222222222211111111 1111111111111111111111111 111 1111 111111111111111111111111 11111111 111111111111111 1111111111222 2222222222222222222222222222222222222222222222 222222222222222222222222222222222222222222222222222111111111111111111111 11111111111111 111111111111111111111111111111 1111111111 111111111111111111111112222222222222 222222222222222222222222222 222222222 222222222 22222222 2222222222222222222222222222222211111 1111111111111111 1111 11111111111 111111111111111111111111111111 1111111111 111111111111111111111112222222222222 222222222222222222222222222 2222222222222222222222 222 22222 22222222222 22222222 222222222x3-10-505101111111111111111111111111 11111111111 111 111111111111111111111111111 111111111 11 11111111111111111111112222222222222222222222222222222222222222 22222222222222222222222222222222222222222 2222222222222222-15 -10 -5 0 5 10-15-505101520111 111 1 111 1111 111111 11111111 1 11111111111111111111111 11 11111111111 111111111111 11111111111111111111112 22222222222 222 22222222222222222 222222 22 222222 22 22222222222222222222 22 222 22222 222 22 222222 222111 11 1 1111 1111 11111111111111 1 11111111111111111111111 11 11111111111 111111111111 11111111111111111111112 222222222222 22 222222222222222222 22 2222 22222222222 2222222 222 2222222222222 222 222 22222 22 2 222222222-10 -5 0 5 101111 111 111 1111 111111 111111111 1111111111111111111111111 11111111111 111111111111111111 1111111111111111222 22 222222222 22 222222 222222222222 2 222 22 22 222222222 22222222 222222222222 2222 222 22 22222 222222 222x4Figure 3.2: Pairwise scatterplots of simulated data II.Table 3.2: Misclassification matrices of simulated data II from MLC,MCLUST, LGA and MIXREG.MLC,MCLUST, MIXREGLGA Resp= x1 or x2 Resp= x3 Resp= x4True class 1 2 1 2 1 2 1 21 88 12 26 74 5 95 63 372 9 91 25 75 10 90 75 2538Chapter 3. A partial likelihood approach to linear clusteringTable 3.3: Criteria for choosing the number of clusters, K, when MLCis applied to the blue crab data. ?Boot p-value? is the p-value using thebootstrapping partial likelihood ratio test for H0 : K = K0 vs H1 : K =K0 + 1 where 99 bootstrapping samples are used.CriterionK BIC ICL NEC Boot p-value1 476.13 ? ? 02 518.88 500.77 0.24 0.633 506.35 446.83 0.51 ?males and 50 are females. Each crab has five measurements, FL, the widthof frontal lip, RW, the rear width, CL, the length along the midline, CW,the maximum width of the carapace, and BD, the body depth in mm. As inPeel and McLachlan (2000), we are interested in the classification of maleand female crabs. Peel and McLachlan (2000) had 18 crabs misclassifiedapplying a mixture of two t distributions using all the five measurements.Using a mixture of two normal distributions resulted in one more crab beingmisclassified.Inspecting all pairwise scatter plots of the data set, RW and CL displaytwo fairly well separated lines as in Figure 3.3. We apply MLC, MCLUST,LGA and MIXREG to these two measurements.For this data set, MCLUST chooses two clusters with different volumes,different shapes and different orientations by the BIC criterion. LGA choosestwo clusters by the gap statistic. MLC also chooses two clusters by the var-ious criteria in Table 3.3. MCLUST has 13 cases misclassified. LGA andMLC both have 7 cases misclassified. Inspecting the posterior probabili-ties of classification, MLC assigns roughly equal probabilities to the twoclusters for two of the misclassified cases. In other words, there is extremeuncertainty about the assignment of these two cases, and their misclassifi-cations are not surprising. There are no such diagnostic probabilities forLGA. MIXREG has eight cases misclassified when RW is regarded as theresponse; it has seven cases misclassified if CL is taken as the response.If we take logarithms of RW and CL, only five crabs are misclassified byLGA and MLC. MIXREG has five cases misclassified, if CL is taken as theresponse; it has 15 cases if RW is taken as the response. The results aresummarized in Table 3.4. The clustering results are displayed in Figure 3.4.39Chapter 3. A partial likelihood approach to linear clusteringFL8 12 16 20 30 40 50812162081216RWCL1525354520304050CW8 12 16 20 15 25 35 45 6 10 14 186101418BDFigure 3.3: Pairwise scatterplots of the blue crab data.Table 3.4: Misclassification matrices of the blue crab data (using log scales)from MLC, MCLUST, LGA and MIXREG.MLC MCLUST LGA MIXREG MIXREG(Response: CL) (Response: RW)True class 1 2 1 2 1 2 1 2 1 21 50 0 50 0 50 0 50 0 38 122 5 45 14 36 5 45 5 45 3 4740Chapter 3. A partial likelihood approach to linear clustering2.0 2.2 2.4 2.6 2.82.83.03.23.43.63.8MLCRWCL2.0 2.2 2.4 2.6 2.82.83.03.23.43.63.8MCLUSTRWCL2.0 2.2 2.4 2.6 2.82.83.03.23.43.63.8LGARWCL2.0 2.2 2.4 2.6 2.82.83.03.23.43.63.8MIXREGRWCLFigure 3.4: Clustering results of the blue crab data (using log scales)from MLC, MCLUST, LGA and MIXREG. CL is used as the response inMIXREG.41Chapter 3. A partial likelihood approach to linear clustering3.7.4 Taqman single nucleotide polymorphism genotypingdataA single nucleotide polymorphism (SNP) is a single-base variation in agenome. The genetic code is specified by the four nucleotide ?letters?: A(adenine), C (cytosine), T (thymine) and G (guanine). A SNP involvesusually only two possibilities of these four letters, referred to as allele Xand allele Y . An individual has two copies of genetic information passedfrom both parents. So the genotypes at this specific genetic position hasthree possible types: homozygote for allele X, homozygote for allele Y andheterozygote for both alleles. SNP genotyping, determination of genotypesof individuals, plays an increasing role in genetics studies.Taqman assay is a fluorescence-based high-throughput genotyping tech-nology. Two fluorescent signals, x1 and x2, called Allele.X and Allele.Ybelow, are used to detect the presence or absence of the two alleles, andtheir intensities determine the genotypes. (There is a third fluorescent sig-nal presumably for normalizing the signal intensities. However, its effective-ness is suspicious in some situations. We choose to use signals without thisnormalization.)Blood samples of many individuals are assayed in wells of a plate si-multaneously. For quality controls purpose, there are usually some bloodsamples with known genotypes, called positive controls, and there are somewells without genetic material, called negative controls.In the scatterplot of a SNP data set, there are typically four clusters:three clusters for the three possible genotypes and one for negative controls.In addition, there may be some failed samples. A clustering algorithm isusually employed to classify the samples. Algorithms commonly used includek-means algorithms (Olivier et al., 2002; Ranade et al., 2001) and model-based algorithms (Fujisawa et al., 2004; Kang et al., 2004). Finding analgorithm that works for all plates/SNPs is a challenge.We fit a mixture of five components to the SNP genotyping data, forthe three genotype clusters, the negative controls, and failed samples, re-spectively. The three genotype components are modelled linearly with themixture likelihood (3.4). The fourth component is modelled with a bivari-ate normal distribution, and the fifth is a uniform component for possibleoutlying points.In the model fitting, we use the known labels for the negative controls.All the negative controls are assigned to the fourth component a priori,as are all points with both signals less than the corresponding maximumintensities of negative controls. Denote the set of these points by N. We42Chapter 3. A partial likelihood approach to linear clusteringTable 3.5: Largest membership probabilities of the 7 points labelled inFigure 3.5 by MCLUST and MLC.Points1 2 3 4 5 6 7MLC 0.99 1.00 1.00 0.99 0.98 0.86 0.99MCLUST 0.96 0.93 1.00 1.00 1.00 0.97 0.50do not use the known labels of the positive controls; these samples are usedonly for evaluating the clustering results.Hence, the modified mixture likelihood for the problem isL(kappa,p1,...,p5|x1,...,xn)=productdisplayinegationslashelementNbraceleftBigg 3summationdisplayk=1pkN(aprimekxi -bk; 0,sigma2k) + p4N(xi;?,?) + p5U(xi;R)bracerightBigg?productdisplayielementNN(xi;?,?), (3.19)where U is the uniform distribution on the rectangular region R given by(min(xi1), max(xi1)) ? (min(xi2), max(xi2)). This mix of linear, ellipti-cal and uniform components demonstrates the flexibility of our modellingapproach.Figure 3.5 displays the clustering results of one plate using four differentalgorithms: MLC, MCLUST, LGA and MIXREG. Note that the majorityof the points lie on three lines, hence the need for linear clustering. Theresults of modified MLC and MCLUST are displayed in the upper-left andthe upper-right panels of Figure 3.5. All the points assigned to one of thethree genotype clusters by MCLUST are assigned exactly the same wayby MLC. Points 1 to 7, however, are assigned by MCLUST as backgroundnoise while MLC classifies all these points to one of the genotype clusters.Indeed, we know that these points should belong to the genotype clusters,and furthermore, points 1, 3, 6 and 7 are positive controls, all classifiedcorrectly by MLC. Table 3.5 displays the largest membership probabilitiesof these seven points by MCLUST and MLC, respectively. For the first sixof the seven points, the two methods are very confident of their differentassignments. We speculate here that, MCLUST does not place these pointsin the genotype clusters because they appear to be outliers with respect tothe normality assumptions. In contrast, MLC classifies a point only usingthe proximity to a line and does not model the position along a line.43Chapter 3. A partial likelihood approach to linear clustering0.0 0.5 1.0 1.50.00.51.01.52.0MLCAllele.XAllele.Y 12345670.5 1.0 1.50.51.01.52.0MCLUSTAllele.XAllele.Y12345670.5 1.0 1.50.51.01.52.0LGAAllele.XAllele.Y0.5 1.0 1.50.51.01.52.0MIXREGAllele.XAllele.YFigure 3.5: Clustering results from four methods applied to a Taqman dataset plate. openbullet, + and triangle denote the labels assigned for the three linear clusters(genotypes) by each method, ? denotes points assigned as negative controlsand diamondmath denotes points assigned to the background cluster.44Chapter 3. A partial likelihood approach to linear clusteringThe lower-left panel of Figure 3.5 displays the results from LGA. SinceLGA is formulated to deal with only linear clusters, we set the number ofclusters to three. (If four or five clusters are chosen, the genotype clusters aremixed up.) The restriction to linear clusters means that LGA has difficultyseparating the three genotype clusters from the negative controls and thefailed samples. In the lower-right panel, we choose the signal for Allele Y asthe response to implement MIXREG and set the number of regression linesto 3. MIXREG fails completely here. It finds only one of the three linearclusters. We obtained similar results if the signal for Allele X is chosen asthe response.3.8 DiscussionWe have proposed a flexible model-based approach to linear clustering. Itis flexible in the sense that only the deviation from a hyperplane is mod-elled parametrically; the position on the hyperplane is not modelled. Theadvantage of this approach is illustrated in the genotyping example, wherethe distribution along the line is complex and difficult to model. Further-more, as was also illustrated in this example, we can incorporate ellipticalclusters as necessary and a background cluster, borrowing from standardmodel-based clustering.Robustness to outliers is desirable, as the assumption of normal devia-tions around hyperplanes is sensitive to large deviations in the orthogonaldirection. In addition to the inclusion of a uniform background cluster (Ban-field and Raftery, 1993), one option would be to use a heavier tailed distribu-tion, for example, Student?s t distribution with small degrees of freedom orwith degrees of freedom depending on the data. This would adapt Peel andMcLachlan (2000)?s EM algorithm for t mixture models from the ellipticalcontext to the linear. The adaptation is straightforward but computationallymore expensive. Further ideas include estimating the component covariancematrices in the M-step in a robust way, for example, trimming off somepoints.With aprimex = b, we are specifying a hyperplane in d-1 dimensions. Withlittle effort, this could be generalized to a mixture of partial likelihoods, eachof which specifies a hyperplane of dimension q < d,l(kappa,p|x1:n) =nproductdisplayi=1Ksummationdisplayk=1pkN(Aprimekxi -bk;0,?k), (3.20)where A is of dimension d ? (d - q), b is a vector of dimension d - q,45Chapter 3. A partial likelihood approach to linear clusteringand ?k is a (d - q) ? (d - q) covariance matrix for the deviation from thehyperplane. In the extreme case of a 0-dimension hyperplane, which is apoint, we have the usual mixture of multivariate normal distributions. Amixture of components with various dimensions could be considered.A Bayesian version of this methodology would be helpful if some clustersare sparse but there is strong prior information about their approximatelocations or properties (e.g., the parameters defining lines).46BibliographyBanfield, J. and A. Raftery (1993). Model-Based Gaussian and Non-Gaussian Clustering. Biometrics 49(3), 803?821.Biernacki, C., G. Celeux, and G. Govaert (1999). An improvement of theNEC criterion for assessing the number of clusters in a mixture model.Pattern Recognition Letters 20(3), 267?272.Biernacki, C., G. Celeux, and G. Govaert (2000). Assessing a mixture modelfor clustering with the integrated completed likelihood. IEEE Transactionson Pattern Analysis and Machine Intelligence 22(7), 719?725.Bryant, P. and J. Williamson (1978). Asymptotic behaviour of classificationmaximum likelihood estimates. Biometrika 65(2), 273?281.Campbell, N. and R. Mahon (1974). A multivariate study of variationin two species of rock crab of genus Leptograpsus. Australian Journal ofZoology 22(3), 417?425.Celeux, G. and G. Govaert (1993). Comparison of the mixture and theclassification maximum likelihood in cluster analysis. Journal of statisticalcomputation and simulation(Print) 47(3-4), 127?146.Celeux, G. and G. Soromenho (1996). An entropy criterion for assessingthe number of clusters in a mixture model. Journal of Classification 13(2),195?212.Chen, H. and J. Chen (2001a). Large sample distribution of the likelihoodratio test for normal mixtures. Statistics and Probability Letters 52(2),125?133.Chen, H. and J. Chen (2001c). The Likelihood Ratio Test for Homogeneityin Finite Mixture Models. The Canadian Journal of Statistics/La RevueCanadienne de Statistique 29(2), 201?215.Chen, H. and J. Chen (2001b). The likelihood ratio test for homogeneity inthe finite mixture models. Canadian Journal of Statistics 29(2), 201?215.47BibliographyChen, H., J. Chen, and J. Kalbfleisch (2001). A modified likelihood ra-tio test for homogeneity in finite mixture models. Journal of the RoyalStatistical Society: Series B (Methodological) 63(1), 19?29.Chen, H., P. Meer, and D. Tyler (2001). Robust regression for data withmultiple structures. 2001 IEEE Conference on Computer Vision and Pat-tern Recognition 1, 1069?1075.Chen, J. (1998). Penalized likelihood-ratio test for finite mixture modelswith multinomial observations. The Canadian Journal of Statistics 26(4),583?599.Chen, J. and J. Kalbfleisch (1996). Penalized Minimum-Distance Estimatesin Finite Mixture Models. The Canadian Journal of Statistics/La RevueCanadienne de Statistique 24(2), 167?175.Chen, J. and J. Kalbfleisch (2005). Modified likelihood ratio test in finitemixture models with a structural parameter. Journal of Statistical Planningand Inference 129(1-2), 93?107.Ciuperca, G., A. Ridolfi, and J. Idier (2003). Penalized Maximum Like-lihood Estimator for Normal Mixtures. Scandinavian Journal of Statis-tics 30(1), 45?59.DeSarbo, W. and W. Cron (1988). A maximum likelihood methodologyfor clusterwise linear regression. Journal of Classification 5(2), 249?282.Everitt, B. and D. Hand (1981). Finite mixture distributions. Monographson Applied Probability and Statistics, London: Chapman and Hall.Everitt, B., S. Landau, and M. Leese (2001). Cluster Analysis. HodderArnold.Fraley, C. and A. Raftery (1998). How Many Clusters? Which Cluster-ing Method? Answers Via Model-Based Cluster Analysis. The ComputerJournal 41(8), 578.Fraley, C. and A. Raftery (2002). Model-Based Clustering, DiscriminantAnalysis, and Density Estimation. Journal of the American StatisticalAssociation 97(458), 611?632.Fraley, C. and A. Raftery (2007). mclust: Model-Based Clustering / NormalMixture Modeling. R package version 3.1-1.48BibliographyFujisawa, H., S. Eguchi, M. Ushijima, S. Miyata, Y. Miki, T. Muto, andM. Matsuura (2004). Genotyping of single nucleotide polymorphism usingmodel-based clustering. Bioinformatics 20, 5.Fuller, W. (1987). Measurement error models. Wiley Series in Probabilityand Mathematical Statistics, New York: Wiley, 1987.Ganesalingam, S. (1989). Classification and Mixture Approaches to Clus-tering via Maximum Likelihood. Applied Statistics 38(3), 455?466.Garc? -Escudero, L., A. Gordaliza, R. San Mart? , S. van Aelst, and R. Za-mar (2007). Robust linear clustering. Preprint.Harrington, J. (2007). lga: Tools for linear grouping analysis (LGA). Rpackage version 1.0-0.Hathaway, R. (1983). Constrained maximum-likelihood estimation for amixture of m univariate normal distributions. Ph. D. thesis, Rice Univer-sity.Hathaway, R. (1985). A Constrained Formulation of Maximum-LikelihoodEstimation for Normal Mixture Distributions. The Annals of Statis-tics 13(2), 795?800.Hathaway, R. (1986). A constrained EM algorithm for univariate normalmixtures. Journal of Statistical Computation and Simulation 23(3), 211?230.Kang, H., Z. Qin, T. Niu, and J. Liu (2004). Incorporating GenotypingUncertainty in Haplotype Inference for Single-Nucleotide Polymorphisms.The American Journal of Human Genetics 74(3), 495?510.Keribin, C. (2000). Sankhya: The Indian Journal of Statistics 62, 49?66.MacLachlan, G. and D. Peel (2000). Finite mixture models. J. Wiley.Marriott, F. (1975). Separating mixtures of normal distributions. Biomet-rics 31(3), 767?769.McLachlan, G. (1987). On Bootstrapping the Likelihood Ratio Test Stastis-tic for the Number of Components in a Normal Mixture. Applied Statis-tics 36(3), 318?324.49BibliographyOlivier, M., L. Chuang, M. Chang, Y. Chen, D. Pei, K. Ranade,A. de Witte, J. Allen, N. Tran, D. Curb, et al. (2002). High-throughputgenotyping of single nucleotide polymorphisms using new biplex invadertechnology. Nucleic Acids Research 30(12), e53.Peel, D. and G. McLachlan (2000). Robust mixture modelling using the tdistribution. Statistics and Computing 10(4), 339?348.Ranade, K., M. Chang, C. Ting, D. Pei, C. Hsiao, M. Olivier, R. Pesich,J. Hebert, Y. Chen, V. Dzau, et al. (2001). High-Throughput Genotypingwith Single Nucleotide Polymorphisms. Genome Research 11(7), 1262?1268.Redner, R. (1981). Note on the Consistency of the Maximum LikelihoodEstimate for Nonidentifiable Distributions. The Annals of Statistics 9(1),225?228.Scott, A. and M. Symons (1971). Clustering Methods Based on LikelihoodRatio Criteria. Biometrics 27(2), 387?397.Smyth, P. (2000). Model selection for probabilistic clustering using cross-validated likelihood. Statistics and Computing 10(1), 63?72.Spath, H. (1982). Fast algorithm for clusterwise linear regression. COM-PUT. 29(2), 175?181.Thode Jr, H., S. Finch, and N. Mendell (1988). Simulated PercentagePoints for the Null Distribution of the Likelihood Ratio Test for a Mixtureof Two Normals. Biometrics 44(4), 1195?1201.Turner, T. (2000). Estimating the propagation rate of a viral infection ofpotato plants via mixtures of regressions. Journal of the Royal StatisticalSociety Series C(Applied Statistics) 49(3), 371?384.Turner, T. (2006). mixreg: Functions to fit mixtures of regressions. Rpackage version 0.0-2.Van Aelst, S., X. Wang, R. Zamar, and R. Zhu (2006). Linear groupingusing orthogonal regression. Computational Statistics and Data Analy-sis 50(5), 1287?1312.Wald, A. (1949). Note on the Consistency of the Maximum LikelihoodEstimate. The Annals of Mathematical Statistics 20(4), 595?601.50Chapter 4Bayesian linear clusteringwith application to singlenucleotide polymorphismgenotyping4.1 IntroductionThis paper was motivated by a clustering problem in single nucleotide poly-morphism (SNP) genotyping. A single nucleotide polymorphism (SNP, pro-nounced as ?snip?) is a single-base variation in a genome. The genetic codeof life is specified by the four nucleotide ?letters?: A (adenine), C (cyto-sine), G (guanine) and T (thymine). There are two complementary DNAstrands. It is sufficient to consider only one. SNP variation occurs when asingle nucleotide, such as an A, is replaced by one of the other three lettersC, G or T. One SNP usually involves only two letters, referred to generi-cally throughout this paper as allele X and allele Y . An individual has twocopies of genetic information passed from both parents. So the genotypesat a specific genetic position have three possibilities: homozygote for alleleX, homozygote for allele Y and heterozygote for both alleles. SNP geno-typing, determination of genotypes of individuals, plays an increasing rolein genetics studies.For a small project, it is possible to make genotype calls manually. Inmost cases, it is not hard for an expert to perform this job, and the ?eye-balling? procedure usually gives reasonable results due to its sophisticatedincorporation of prior information. For large-scale studies, however, manualA version of this chapter will be submitted for publication. Authors: Guohua Yan,William J. Welch and Ruben H. Zamar.51Chapter 4. Bayesian linear clusteringscoring can become a daunting challenge. Typical SNP genotyping applica-tions involve thousands of patients and hundreds of SNPs. Hence, reliableautomated genotyping methods are highly needed.TaqMan SNP Genotyping Assay (Applied Biosystems) is a fluorescence-based high-throughput genotyping technology. Blood samples of many in-dividuals are arranged in a 96- or 384-well plate and are assayed simultane-ously. Two fluorescent signals, x1 and x2, also called Allele.X and Allele.Ybelow, are used to detect the presence or absence of the two alleles. There isa third fluorescent signal presumably for normalizing the signal intensities.However, its effectiveness is suspicious in some situations. See Chapter 2.For quality controls purpose, some blood samples with known genotypesare also included in a plate; these samples are called positive controls. Aswell, some wells do not have genetic material; these are called negativecontrols.In the scatterplot of a SNP data set, there are typically four clusters,as in Figure 4.1. In the lower-right, upper-left, and upper-right corners arethree clusters, presumably containing samples of wild-type homozygotes,variant homozygotes, and heterozygotes, respectively. In the lower-left cor-ner, the cluster may contain negative controls and/or some failed samples.A clustering algorithm is usually employed to call the SNP genotypes. Al-gorithms commonly used in the literature include k-means (Olivier et al.,2002; Ranade et al., 2001) and model-based algorithms (Fujisawa et al.,2004; Kang et al., 2004). The proprietary software Sequence Detection Sys-tem is included in a thermal cycler. Usually human intervention of an expertoperator is usually needed to review the genotype calling results. Finding analgorithm that works well for all plates/SNPs is a challenge, especially whenthe variant allele homozygous genotype cluster is sparse, in which standardclustering algorithms often fail.Several clustering algorithms k-means, MCLUST (Fraley and AE, 2006),LGA (Harrington, 2007), MIXREG (Turner, 2006) and MLC (Yan et al.,2008) are applied to the genotyping data in Figure 4.1. Figure 4.2 dis-plays their corresponding clustering results. We can see that all of thesealgorithms fail to identify the five points in the upper-left corner as vari-ant allele homozygotes while an expert operator would do with confidence.These algorithms fail because the variant allele homozygous genotype clusteris very sparse. This motivates us to adopt a Bayesian approach to incorpo-rate available prior information. In addition, we found that it is convenientto model the genotype clusters as linear structures.Most clustering methods and algorithms cluster data points around ?cen-ters?. However, some data sets, as in the SNP genotype setting, form groups52Chapter 4. Bayesian linear clustering0.2 0.4 0.6 0.8 1.0 1.2 1.40.51.01.52.0Allele.XAllele.YFigure 4.1: Scatterplot of one plate in which the variant allele homozygouscluster has only five points (upper-left).through linear relationships and standard clustering techniques are usuallynot able to find these linear patterns. By linear clustering, we mean detect-ing these linear clusters in a data set.Van Aelst et al. (2006) proposed a method called Linear Grouping Al-gorithm (LGA) for linear clustering problems without assuming a responsevariable. Yan et al. (2008) proposed a partial likelihood approach whichmodels only the signed orthogonal distance from each data point to a hy-pothesized hyperplane and treats all variables symmetrically.In this paper we introduce a hierarchical modeling approach, which isparticularly appropriate for identifying sparse linear clusters. We show thatthe sparse cluster in our SNP genotyping dataset can be successfully iden-tified after a careful specification of the prior distributions. The rest of thispaper is organized as follows. In Section 2, we describe a hierarchical modelframework for linear clustering. In Section 3, we discuss in details samplingissues. Our approach to linear clustering is illustrated with a relatively sim-ple dataset (crab dataset) in Section 4. In Section 5, we revisit the SNPgenotyping motivating dataset and show that a careful specification of theprior distributions is critical for the success of the clustering algorithm. Abrief discussion follows in Section 6.53Chapter 4. Bayesian linear clustering4342 22424232 21 1331 11 11132 2422231 111 1111 1311 1132 22 223341 131 11311431122 2232 12312 1 133121 1112 222 222 2232 111 1311 131322323131331 13113133323 33223331 111 13321 1322 11 111 1 13133 31 1 111112 222 232 1 123314141343232232 22 1 11111 11 111112324342424343424243412332 12322 222 1211331 131332 1112422 2422 222 232232242 222 2232222232232 2242222 22 2 22 22 23222244242 2321 1 11 131131331441141 10.2 0.4 0.6 0.8 1.0 1.2 1.40.51.01.52.0k-meansAllele.XAllele.Y3341 11414131 11 1332 22 22231 1411131 122 2222 2322 2231 11 113341 131 12323432211 1131 11311 1 133211 2221 111 111 1131 111 1311 131311313131331 13113133313 33113331 111 13331 1311 11 122 1 23233 32 2 222231 111 131 1 113314141343131131 11 1 11111 11 111111314341414343414143413331 11311 111 1111331 131331 1111411 1411 111 131131141 111 1131111131131 1141111 11 1 11 11 13113134141 1311 2 22 232232332342141 20.2 0.4 0.6 0.8 1.0 1.2 1.40.51.01.52.0MCLUSTAllele.XAllele.Y2221 11212121 13 1231 11 11121 1211131 111 1111 1211 1121 11 112223 121 11211221111 1121 11231 1 122111 1111 111 111 1121 311 1211 121211212121223 12112122312 22112223 111 13211 1211 11 113 1 12122 21 1 111111 111 121 1 312212121222121121 11 3 11111 11 111111212221212322212122211221 31211 111 1111331 123221 3331211 1211 111 121131121 111 1121111131121 1121111 11 1 11 11 13113122121 1211 1 13 121121221221121 10.2 0.4 0.6 0.8 1.0 1.2 1.40.51.01.52.0LGAAllele.XAllele.Y2221 11212121 11 1231 13 11121 1211131 111 1111 1211 1131 11 112221 121 11313221111 1121 11211 1 122111 1111 111 111 1121 111 1211 121211212121221 12112122312 22112221 111 12231 1211 11 111 1 12122 21 1 111131 111 121 1 112212121222121121 11 1 11111 11 111111212221212222212122213221 11211 111 1111331 121221 1111211 1211 111 121121121 111 1121111121121 1121111 11 1 11 11 12113122121 1211 1 13 121121221221121 10.2 0.4 0.6 0.8 1.0 1.2 1.40.51.01.52.0MIXREG/MLCAllele.XAllele.YFigure 4.2: Clustering results of several clustering algorithms. For k-meansand MCLUST, the number of clusters are set to 4; for the remaining algo-rithms, the number of lines are set to 3.54Chapter 4. Bayesian linear clustering4.2 Model specification4.2.1 Model for one cluster: orthogonal regressionWe first introduce a Bayesian approach for orthogonal regression (see e.g.,Fuller, 1987). To model a single cluster, assume that the vector-valued datax1,...,xn are drawn by a random mechanism represented by a randomvector X in a d-dimensional space. We do not model the distribution of Xexcept for its deviation from an unknown hyperplane.Let {x : aprimex - b = 0} be a hyperplane. For identifiability purpose, weassume that aprimea = 1 and that the first nonzero element of a is positive. Ourprior information about this hyperplane is summarized by a prior distribu-tion pi(a,b). Given a and b, the signed orthogonal deviation aprimeX-b from thehyperplane {x : aprimex-b = 0} has a density function p(? |sigma). Let pi(sigma) be theprior distribution for sigma. Denote theta = (aprime,b,sigma)prime. The posterior distribution oftheta ispi(theta|x1,...,xn) proportionalbraceleftBigg nproductdisplayi=1p(aprimexi -b|sigma)bracerightBiggpi(a,b)pi(sigma). (4.1)Let ? and S be the sample mean and sample covariance matrix ofx1,...,xn respectively. If we take p(?|sigma) to be the normal density func-tion N(? |0,sigma2) and set pi(a,b) proportional 1 and pi(sigma) proportional 1, the maximum a posterioriestimator of theta is (?prime,?, ?)prime, where ?2 is the smallest eigenvalue of S, ? is the(standardized) eigenvector associated with ?2 and ? = ?prime ?. This is a con-nection with orthogonal regression, see for example, Fuller (1987); for thedistributions of the eigenvalue ?2 and the eigenvector ? when X is normal,see Anderson (2003).4.2.2 Model for linear clusteringNow we assume that the data x1,...,xn are drawn by a more complicatedrandom mechanism, still represented by random vector X in a d-dimensionalspace which we do not model fully. The data now lie around K hyperplanes{x : aprimekx = bk}, k = 1,...,K, for which our prior knowledge is summarizedin pi(a1,b1,...,aK,bK). Let Z = (Z1,...,ZK)prime be a random vector indicat-ing these hyperplanes, Zk = 1 with probability pk for k = 1,...,K, Zk = 0or 1 and summationtextKk=1 Zk = 1. Let p = (p1,...,pK)prime. We denote by pi(p) the priordistribution for p. We assume that, conditional on Zk = 1, the signed or-thogonal deviation aprimekX-bk has a density function p(?|sigmak) for k = 1,...,K.Let pi(sigma1,...,sigmaK) be the prior distribution for (sigma1,...,sigmaK)prime. Let z1,...,znbe the corresponding unobservable indicators for the data x1,...,xn, where55Chapter 4. Bayesian linear clusteringzi = (zi1,...,ziK)prime for i = 1,...,n. Let theta be the collection of all the param-eters,theta = (aprime1,b1,sigma1,...,aprimeK,bK,sigmaK,pprime)prime.Formally, the posterior distribution of the unobservable cluster indicatorsz1, ..., zn and theta ispi(z1,...,zn,theta|x1,...,xn)proportionalbraceleftBigg nproductdisplayi=1Kproductdisplayk=1[pkp(aprimekxi -bk|sigmak)]zikbracerightBiggpi(a1,b1,...,aK,bK)pi(sigma1,...,sigmaK)pi(p).(4.2)This formula is the basis for our linear clustering, in which we are primar-ily interested in the cluster indicators z1,...,zn and the elements of theta arenuisance parameters. A Gibbs sampling scheme based on this formula isdetailed in next section.In addition, if we sum out the indicators z1,...,zn, the marginal distri-bution for theta ispi(theta|x1,...,xn)proportionalbraceleftBigg nproductdisplayi=1Ksummationdisplayk=1pkp(aprimekxi -bk|sigmak)bracerightBiggpi(a1,b1,...,aK,bK)pi(sigma1,...,sigmaK)pi(p).(4.3)This marginal distribution can be used if we would like to set up othersampling schemes such as Metropolis-Hastings, tempering MCMC or SMCmethods. Once a MCMC sample for theta is available, it is straightforward tosample cluster labels at each iteration. At iteration t, with theta(t), sample zifrom multinomial distributionPr(zi = k) proportional p(t)k p((aprime)(t)k xi -b(t)k |sigma(t)k ). (4.4)After a sequence of iterations, we have a sample (z(1)i ,...,z(T)i ) from pos-terior marginal distribution pi(zi|x1,...,xn), for i = 1,...,n. The point xican be classified into the cluster with the largest frequency in the posteriorsample; and the largest frequency provides a good measure of the clustermembership.56Chapter 4. Bayesian linear clustering4.3 Sampling algorithms4.3.1 Gibbs samplingThe model framework in (4.2) is conceptually straightforward; the posteriorsampling, however, can be difficult, as in any Bayesian mixture modelling(Celeux et al., 2000). We present here a sampling algorithm for normalorthogonal deviations and, to make the algorithm more efficient, we integratesome parameters by using conjugate priors. For ease of presentation, weassume the data are in a 2-dimensional space.The density functions for normal orthogonal deviations arep(aprimekx -bk|sigmak) = N(aprimekx -bk; 0,sigma2k), (4.5)for k = 1,...,K, where N(? ; 0,sigma2) is the normal density function with mean0 and variance sigma2k. The priors for (sigma21,...,sigma2K) follow independent inverseGamma distributions,pi(sigma21,...,sigma2K) =Kproductdisplayk=1IG(sigma2k;delta1k,delta2k), (4.6)where IG(? ;deltak1,deltak2) is the density function of inverse Gamma with shapeparameter deltak1 and rate parameter deltak2. In the case of equal variances, sigma21 =... = sigma2K = sigma2, we assumepi(sigma2) = IG(sigma2;delta1,delta2), (4.7)For b1,...,bK, we setpi(b1,...,bK|sigma21,...,sigma2K) =Kproductdisplayk=1N(bk;kappa1k,sigma2k/kappa2k). (4.8)With these specifications, we consider the following special case of the pos-terior distribution (4.2),pi(z1,...,zn,theta|x1,...,xn)proportionalbraceleftBigg nproductdisplayi=1Kproductdisplayk=1[pkN(aprimekxi -bk; 0,sigma2k)]zikbracerightBigg?braceleftBigg Kproductdisplayk=1N(bk;kappa1k,sigma2k/kappa2k)IG(sigma2k;delta1k,delta2k)bracerightBiggpi(a1,...,aK)pi(p). (4.9)57Chapter 4. Bayesian linear clusteringConditional on theta and the data x1,...,xn, z1,...,zn are independently dis-tributed,zi|(theta,xi) similar M(1,(p1N(aprime1xi -b1; 0,sigma21),...,pKN(aprimeKxi -bK; 0,sigma2K))prime),(4.10)for i = 1,...,n, where M(n,p) is the probability mass function for multi-nomial distribution with n trials and probability vector p. Letnk =nsummationdisplayi=1zik, ?k = 1nknsummationdisplayi=1zikxi, (4.11)andAk =nsummationdisplayi=1zik(xi - ?k)(xi - ?k)prime. (4.12)The full conditional distribution of bk isbk|e.e. = bk|(z1,...,zn,ak,sigma2k,x1,...,xn)similar N(kappa1kkappa2k + nkaprimek?xknk + kappa2k ,sigma2knk + kappa2k ), (4.13)for k = 1,...,K, where ?e.e.? stands for ?everything else?. We integrateout b1,...,bK from (4.9) and getpi(z1,...,zn,a1,sigma21,...,aK,sigma2K,p|x1,...,xn)proportionalbraceleftBigg Kproductdisplayk=1pnkk (sigma2k)-nk/2(nk + kappa2k)-1/2 exp(-deltaasteriskmath2k2sigma2k )bracerightBigg?braceleftBigg Kproductdisplayk=1IG(sigma2k;delta1k,delta2k)bracerightBiggpi(a1,...,aK)pi(p), (4.14)wheredeltaasteriskmath2k = aprimekAkak + nk(aprimek?k)2 + kappa21kkappa2k - (kappa1kkappa2k + nkaprimek?xk)2nk + kappa2k . (4.15)The distribution of sigma2k conditioning on everything else except bk issigma2k|(z1,...,zn,ak,x1,...,xn) similar IG(delta1k + nk2 ,delta2k + deltaasteriskmath2k2 ). (4.16)58Chapter 4. Bayesian linear clusteringIf we assume that sigma21 = ... = sigma2K = sigma2, then the distribution of sigma2 condi-tioning on z1,...,zn,a1,...,aK,p issigma2|(z1,...,zn,a1,...,aK,x1,...,xn) similar IG(delta1 + n2,delta2 +Ksummationdisplayk=1deltaasteriskmath2k2 ). (4.17)We further integrate out sigma21,...,sigma2K from (4.14) and getpi(z1,...,zn,a1,...,aK,p|x1,...,xn)proportionalbraceleftBigg Kproductdisplayk=1pnkk (nk + kappa2k)-1/2G(delta1k + nk2 )(delta2k + deltaasteriskmath2k2 )-(delta1k+nk2 )bracerightBigg? pi(a1,...,aK)pi(p). (4.18)For proportions p, we use a Dirichlet prior,pi(p) = D(p;alpha), (4.19)where D(? ;alpha) is the Dirichlet density function with parameter alpha = (alpha1, ...,alphaK)prime. The full conditional distribution of p isp|e.e. = p|(z1,...,zn) similar D(alphaasteriskmath), (4.20)where alphaasteriskmath = (alphaasteriskmath1,...,alphaasteriskmathK)prime with alphaasteriskmathk = alpha + nk for k = 1,...,K. Therefore,pi(z1,...,zn,a1,...,aK|x1,...,xn)proportionalbraceleftBigg1G(summationtextKi=1 alphaasteriskmathk)Kproductdisplayk=1G(alphaasteriskmathk)(nk + kappa2k)-1/2G(delta1k + nk2 )(delta2k + deltaasteriskmath2k2 )-(delta1k+nk2 )bracerightBigg? pi(a1,...,aK). (4.21)Note that aprimekak = 1 for k = 1,...,K. We re-parameterize ak asak = ( 11 + a2k, ak1 + a2k)prime,for k = 1,...,K and use normal priors for (a1,...,aK)prime,pi(a1,...,aK) =Kproductdisplayk=1N(ak;nu1k,nu22k). (4.22)59Chapter 4. Bayesian linear clusteringThe conditional distribution of ak on cluster indicators ispi(ak|z1,...,zn,x1,...,xn) proportional G(delta1k+nk2 )(delta2k+deltaasteriskmath2k2 )-(delta1k+nk2 )N(ak;nu1k,nu22k).(4.23)From the above, discussion, a possible sampling algorithm is as follows.Algorithm 1:1. Initialize the algorithm with theta(0).2. At iteration t,(a) Sample cluster indicators z(t)1 ,...,z(t)n from multinomial distribu-tion (4.10), where theta is replaced with theta(t-1).(b) i. Sample p(t) from Dirichlet distribution (4.20), where the clus-ter indicators are replaced with z(t)1 , ..., z(t)n .ii. Sample a(t)1 ,...,a(t)K using a random walk Metropolis-Has-tings scheme from (4.23), where cluster indicators are re-placed by z(t)1 , ..., z(t)n , respectively.iii. Sample (sigma21)(t),...,(sigma2K)(t) from inverse Gamma distribution(4.16), where z1,...,zn,a1,...,aK are replaced by z(t)1 , ...,z(t)n , a(t)1 , ..., a(t)K , respectively; in the case of equal variancessigma21 = ... = sigma2K = sigma2, sample (sigma2)(t) from (4.17).iv. Sample b1,...,bK from normal distribution (4.13), with allparameters and cluster indicators replaced with those in thecurrent iteration.4.3.2 Metropolis-Hastings samplingAlternatively, we can sum out the indicators z1,...,zn and sample theta fromthe marginal posterior distribution pi(theta|x1,...,xn) in (4.3). Standard ran-dom walk Metropolis-Hastings algorithms can be used directly to samplefrom the marginal posterior distribution pi(theta|x1,...,xn). However, we foundthat it is advantageous to reparameterize theta so that the constraints such assummationtextKk=1 pk = 1 and aprimea = 1 are implicitly incorporated (see Section 4.5).Since the proposal distribution and the posterior distribution have the samesupport, the sampling schemes on the transformed parameters are moreefficient. Suppose we reparameterize theta astheta = phi(eta),60Chapter 4. Bayesian linear clusteringsuch that phi is a one-to-one map and that the domain of eta is Rfracturq, where q isthe dimension of eta. Let J be the JacobianJ = partialdiffthetapartialdiffeta,with the understanding that only q free components of theta are used. Thenthe marginal posterior distribution of eta ispi(eta|x1,...,xn) = pi(theta|x1,...,xn)|J|. (4.24)To sample eta, we first run a Gibbs sampler, within which a random walkMetropolis step is used if a full conditional distribution is not of an explicitform.Algorithm 2.0:1. Initialize the algorithm with eta(0).2. At iteration t, update eta(t-1)1 ,...,eta(t-1)q sequentially. For the jth com-ponent,(a) Sample nu from N(eta(t-1)j ,s2j).(b) Compute r = min{1, pi(eta(t)1 ,...,eta(t)j-1,nu,eta(t-1)j+1 ,...,eta(t-1)q |x1,...,xn)pi(eta(t)1 ,...,eta(t)j-1,eta(t-1)j ,eta(t-1)j+1 ,...,eta(t-1)q |x1,...,xn) }.(c) Update eta(t)j = nu with probability r.After a burn-in period, we can get a crude estimate ? of the covariancematrix of eta. In the second stage of the algorithm, we update eta using aMetropolis-Hastings sampler in one block.Algorithm 2:1. Initialize the algorithm with eta(0).2. At iteration t,(a) Sample nu from N(eta(t-1),s2 ? .(b) Compute r = min{1, pi(nu|x1,...,xn)pi(eta(t-1)|x1,...,xn)}.(c) Update eta(t) = nu with probability r.Here we use an adaptive strategy; see page 307, Gelman et al. (2004).61Chapter 4. Bayesian linear clustering4.4 Australia rock crab dataNow we consider the rock crab data set of Campbell and Mahon (1974) onthe genus Leptograpsus. We focus on the 100 blue crabs, 50 of which aremales and 50 are females. Each crab has five measurements, FL, the widthof frontal lip, RW, the rear width, CL, the length along the midline, CW,the maximum width of the carapace, and BD, the body depth in mm. As inPeel and McLachlan (2000), we are interested in the classification of maleand female crabs. Peel and McLachlan (2000) had 18 crabs misclassifiedapplying a mixture of two t distributions using all the five measurements.Using a mixture of two normal distributions resulted in one more crab beingmisclassified.Inspecting all pairwise scatter plots of the data set, RW and CL displaytwo fairly well separated lines as in Figure 4.3. We shall classify these crabsusing only RW and CL. The number of clusters is set to 2. We run theGibbs sampler ?Algorithm 1? based on (4.2).To initialize the sampler, we adopt the strategy of Van Aelst et al.(2006). We randomly select K = 2 mutually exclusive subsets of d + 1 = 3observations from x1,...,xn. For k = 1,...,K, we compute the maxi-mum partial likelihood estimates a(0)k , b(0)k and (sigma2k)(0) from the kth sub-set of d + 1 observations (see Subsection 2 of Yan et al. (2008)). Theproportions p(0)k are all set as 1/K. The initial values for theta are thentheta(0) = ((aprime1)(0),b(0)1 ,(sigma21)(0),...,(aprimeK)(0),b(0)K ,(sigma2K)(0),p(0)1 ,...,p(0)K )prime. We thenmonitor the evolution of cluster indicators z1,...,zn. If the number of pointsin one cluster has fallen below d + 1, the sampler will get stuck and neverrecover from it; we then re-initialize the sampler with the above strategy.With the above initialization strategy, we actually impose a constraint onthe distribution of cluster labels such that any allocation leading to less thand + 1 points for a cluster is not allowed, i.e., has a probability of zero. As aresult, we can then specify the prior distributions of component parameterswith vague priors. Specifically, we set the priors as follows.ak similar N(0,1000002),pi(bk) proportional 1,sigma2k similar IG(0.0001,0.0001),pk proportional 1,for k = 1,2.To alleviate the effect of autocorrelations, we keep only one simulatedpoint for every ten iterations. Hereafter the iteration numbers are based onthe ?thinned? samples.Figure 4.4 displays the evolution of sampled values of theta for 11,000 it-erations. Figure 4.5 displays the evolution of log unnormalized posteriordensity values of sampled values for (theta, z1, ..., zn) for 11,000 iterations.62Chapter 4. Bayesian linear clusteringFL8 12 16 20 30 40 50812162081216RWCL1525354520304050CW8 12 16 20 15 25 35 45 6 10 14 186101418BDFigure 4.3: Pairwise scatter plots of the blue crab data.63Chapter 4. Bayesian linear clusteringFrom these two figures we found no reason to suspect that the sampler hasreached a stationary distribution. Figure 4.6 displays the autocorrelationsof sampled values for theta for 11,000 iterations. The autocorrelations decayfast, especially for variances sigma21 and sigma22 and proportions p1 and p2.Further checking the density curves in Figure 4.7 of sampled values fortheta for 10,000 iterations after a burn-in of 1000 iterations, we observe thatall the estimated density curves are roughly unimodal. There is no obviousreason for the existence of minor modes. In Figure 4.8, we demonstratethe isolation of the two modes of the marginal posterior distribution of theta.One mode is estimated by maximum a posteriori estimation; the other oneis obtained by permutation of cluster indices. The horizontal axis t rangesfrom 0 to 1, the vertical axis is the unnormalized marginal posterior densityvalue of (1-t)?1 +t?2, where ?1 and ?2 are the two estimated modes. Fromthis plot, we can see that the two modes are very peaky and isolated. Itis then understandable that label-switching does not occur. While Gibbssampling is generally criticized for not traversing the whole support of theposterior distribution for theta, it is reasonable to think that it has fully exploredthe support around one isolated mode in this data set.Figure 4.9 shows the posterior probabilities of being classified into Class1 of the 100 crabs estimated from 10,000 iterations using the Gibbs samplerAlgorithm 1, after a burn-in of 1000 iterations. If we were to classify allthe crabs, i.e., classifying a crab to the class with larger posterior proba-bility, five crabs are misclassified, as indicated in the upper-left corner ofFigure 4.9. The number of misclassification is the same as that of the par-tial likelihood approach in Chapter 3; this result is expected as we usedessentially noninformative priors in this example. When informative priorsare necessary, as in our next example, the Bayesian approach does have anadvantage.We have also run the adaptive Metropolis-Hastings algorithm for thesimulation of marginal posterior distribution for theta and get very similar re-sults. It runs much slower than the Gibbs sampler though.4.5 Taqman single nucleotide polymorphismgenotyping dataNow we analyze the SNP genotyping data in Figure 4.1 using the linear clus-tering strategy. We fit a mixture of five components to the SNP genotypingdata, to represent the three genotype clusters, the negative controls, and thefailed samples, respectively. The three genotype components are modelled64Chapter 4. Bayesian linear clustering0 2000 4000 6000 8000 10000-1.50-1.35-1.20a1Iterationa10 2000 4000 6000 8000 10000-1.20-1.10-1.00a2Iterationa20 2000 4000 6000 8000 10000-0.25-0.15b1Iterationb10 2000 4000 6000 8000 10000-0.15-0.05b2Iterationb20 2000 4000 6000 8000 100000.000050.00015sigma12Iterationsigma120 2000 4000 6000 8000 100001e-043e-04sigma22Iterationsigma220 2000 4000 6000 8000 100000.30.50.7p1Iterationp10 2000 4000 6000 8000 100000.20.40.6p2Iterationp2Figure 4.4: Evolution of sampled values for theta for 11,000 iterations usingAlgorithm 1 for blue crab data.65Chapter 4. Bayesian linear clustering0 2000 4000 6000 8000 10000170180190200210IterationLog unnormalized posterior densityFigure 4.5: Evolution of log unnormalized posterior density of sampledvalues for (theta, z1, ..., zn) for 11,000 iterations using Algorithm 1 for bluecrab data.linearly. The fourth component is modelled with a bivariate distributionwith diagonal covariance matrix, and the fifth is a uniform component forpossible outlying points.In the model fitting, we use the known labels for the negative controls.All the negative controls are assigned to the fourth component a priori,as are all points with both signals less than the corresponding maximumintensities of negative controls. Denote the set of these points by N.Hence, the modified mixture likelihood for the problem isL(theta|x1,...,xn)=productdisplayinegationslashelementNbracelefttpbraceleftmidbraceleftbt3summationdisplayk=1pk 1sigmakt2(xi1 + akxi2sigmakradicalBig1 + a2k- bksigmak) + p4t2(xi;?,?) + p5U(xi;R)bracerighttpbracerightmidbracerightbt?productdisplayielementNt2(xi;?,?), (4.25)where t2 is the density for Student?s t distribution with 2 degrees of freedom,? =bracketleftbigg sigma11 00 sigma22bracketrightbiggand U is the uniform distribution on the rectangular re-gion R given by (min(xi1), max(xi1)) ?(min(xi2), max(xi2)). For robustnessconsideration, we use Student?s t distribution for orthogonal deviation of apoint from a line and also for the elliptical distribution for negative control66Chapter 4. Bayesian linear clustering0 10 20 30 400.00.40.8a1LagACF0 10 20 30 400.00.40.8a2LagACF0 10 20 30 400.00.40.8b1LagACF0 10 20 30 400.00.40.8b2LagACF0 10 20 30 400.00.40.8sigma12LagACF0 10 20 30 400.00.40.8sigma22LagACF0 10 20 30 400.00.40.8p1LagACF0 10 20 30 400.00.40.8p2LagACFFigure 4.6: Autocorrelations of sampled values for theta for 11,000 iterationsusing Algorithm 1 for blue crab data.67Chapter 4. Bayesian linear clustering-1.5 -1.4 -1.3 -1.20246810a1Density-1.25 -1.20 -1.15 -1.10 -1.05 -1.00 -0.950246810a2Density-0.25 -0.20 -0.15 -0.1005101520b1Density-0.15 -0.10 -0.05 0.00051015b2Density0.00005 0.00010 0.00015 0.0002001000020000sigma12Density0.00005 0.00015 0.00025 0.000350500015000sigma22Density0.3 0.4 0.5 0.6 0.7 0.801234567p1Density0.2 0.3 0.4 0.5 0.6 0.701234567p2DensityFigure 4.7: Density curves of sampled values for theta for 10,000 iterationsusing Algorithm 1, after a burn-in of 1000 iterations for blue crab data.68Chapter 4. Bayesian linear clustering0.0 0.2 0.4 0.6 0.8 1.00.0e+005.0e+1101.0e+1111.5e+111tUnnormalized densityFigure 4.8: Unnormalized marginal posterior density values of theta along aline segment (1 -t)?1 + ?2 connecting the two estimated modes ?1 and ?2.One mode is estimated by maximum a posteriori estimation from 10,000iterations using Algorithm 1, after a burn-in of 1000 iterations and the otheris obtained by permutation. The blue crab data is used.0 20 40 60 80 1000.00.20.40.60.81.0Index of ObservationsEstimated posterior probability of being in Class 1Figure 4.9: Estimated posterior probability of being classified into Class 1of the 100 crabs from 10,000 iterations using Algorithm 1, after a burn-inof 1000 iterations. The solid vertical line separates these crabs by their trueclasses and the horizontal dashed line corresponds to a probability of 0.5.69Chapter 4. Bayesian linear clusteringpoints. This mix of linear, elliptical and uniform components demonstratesthe flexibility of our modelling approach.Based on the subject matter knowledge, we add various constraints inthe specification of priors. For the slopes -1/ak of lines,0 < - 1a1< - 1a2< - 1a3, - 1a3+ 1a2> - 1a2+ 1a1, (4.26)which maintains the order of positive slopes of the three genotype clustersand requires that the ?gap? between the two lines of the heterozygotic clus-ters and the variant allele homozygotic clusters cannot be too small relativeto the ?gap? between the lines of the wild type allele homozygotic clus-ters and the heterozygotic clusters. These constraints on slopes are naturalfor SNP genotyping data. To implement Algorithm 2, we reparameterize(a1,a2,a3) into (eta1,eta2,eta3),bracelefttpbraceleftmidbraceleftbta1 = -1/exp(eta1),a2 = -1/(exp(eta1) + exp(eta2)),a3 = -1/(exp(eta1) + 2exp(eta2) + exp(eta3)).(4.27)It is obvious that the constraint (4.26) is satisfied.In the case of few points in the variant allele homozygotic cluster, theorthogonal variance sigma23 may not be estimated reliably. Hence we requiresigma23 = sigma21, (4.28)and then apply a log transformation to (sigma21,sigma22),sigma21 = exp(eta7), sigma22 = exp(eta8), (4.29)For variances (sigma11,sigma22), a log transformation is applied,sigma11 = exp(eta11), sigma22 = exp(eta12). (4.30)For the proportions, we add the constraintp3 < p1, p3 < p2, (4.31)which requires the proportion of variant allele homozygotic points is less thanthat of the other two genotype clusters. We reparameterize the proportionsp first with q for constraint (4.31),bracelefttpbraceexbraceexbraceexbraceexbraceleftmidbraceexbraceexbraceexbraceexbraceleftbtp1 = q1 + q3/3,p2 = q2 + q3/3,p3 = q3/3,p4 = q4,p5 = q5.(4.32)70Chapter 4. Bayesian linear clusteringand then apply a logit transformation to q,bracelefttpbraceleftmidbraceleftbtq1 = exp(eta13)/(1 + (exp(eta13) + ... + exp(eta16))),...,q4 = exp(eta16)/(1 + (exp(eta13) + ... + exp(eta16))).(4.33)With the constraints (4.26), (4.28) and (4.31), we can fully identify thefive components of the mixture. Label-switching is effectively prevented bythese informative constraints. In addition, conforming to these constraints,noninformative priors cause no problem to guarantee a proper posteriordistribution.From (4.27), (4.29), (4.30) and (4.33), we obtain a one-to-one map fromtheta = (a1,a2,a3,b1,b2,b3,sigma21,sigma22,sigma23,?1,?2,sigma11,sigma22,p1,...,p5)toeta = (eta1,eta2,eta3,b1,b2,b3,eta7,eta8,?1,?2,eta11,eta12,eta13,eta14,eta15,eta16),in which the constraints (4.26), (4.28) and (4.31) are satisfied. The absolutevalue of the Jacobian of the transformation is|J| =vextendsinglevextendsinglevextendsinglevextendsingle partialdiffthetapartialdiffetavextendsinglevextendsinglevextendsinglevextendsingle=|a23(a1 -a2)(a2 -2a1)/a1|sigma21sigma22sigma11sigma22(p1 -p3)(p2 -p3)p3p4p5. (4.34)We set the priors as follows.etak similar N(0,1010), bk similar N(0,10-6), sigma2k similar IG(0.0001,0.0001), for k = 1,2,3,andpi(?,sigma11,sigma22) proportional 1sigma11sigma22, pi(q) proportional 1.The priors for b1,b2,b3 is strong with the implication that all three linesshould roughly pass through the origin; all priors for others parameters arevague except the constraints specified above.The five-component mixture model is applied to the plate shown in Fig-ure 4.1. For an algorithm to work in this relatively high dimensional prob-lem, it is essential that the initial state be close to the substantial support ofthe posterior distribution. We first run the ?optim? function in the R lan-guage multiple times, which is initialized with standard normal N(0,I16).Then we initialize the algorithm with the solution with the largest posteriordensity, say eta(0).71Chapter 4. Bayesian linear clustering0 2000 4000 6000 8000 10000740745750755760765IterationLog unnormalized posterior densityFigure 4.10: Evolution of log unnormalized posterior density of sampledvalues for eta for 11,000 iterations using Algorithm 2 for the SNP genotypingdata.We first run Algorithm 2.0, which is relatively slow, to get a roughestimate of the covariance matrix of eta. Next we run the faster Algorithm 2.To alleviate the effect of autocorrelations, we keep only one simulated pointfor every ten iterations. Hereafter the iteration numbers are based on the?thinned? samples.Figure 4.10 displays the evolution of log unnormalized posterior densityvalues of sampled values for eta for 11,000 iterations. Figure 4.11 displaysthe evolution of sampled values of theta for 11,000 iterations. The first threepanels are for slopes of the three lines -1/a1, -1/a2 and -1/a3. Fromthese two figures we conclude empirically that the sampler has reached astationary distribution. Figure 4.12 displays the autocorrelations of sampledvalues for theta for 11,000 iterations. There are high autocorrelations in mostparameters indicating that the sampler is not very efficient; more efficienttransformations/samplers shall be investigated in future research.Further checking the density curves in Figure 4.13 of sampled values fortheta for 11,000 iterations, we observe that all the estimated density curves areroughly unimodal.The clustering results of this plate is displayed in Figure 4.14 whereeach point is classified into the cluster with the largest posterior member-ship probability. We note that the Bayesian method does identify severalpoints into the variant allele homozygous genotype clusters (represented by?4? in Figure 4.14). Quite a few points are classified as background noise72Chapter 4. Bayesian linear clustering0 2000 4000 6000 8000 100000.420.44-1 a1Iteration-1a10 2000 4000 6000 8000 100001.801.902.00-1 a2Iteration-1a20 2000 4000 6000 8000 100001015202530-1 a3Iteration-1a30 2000 4000 6000 8000 10000-0.035-0.020-0.005b1Iterationb10 2000 4000 6000 8000 10000-0.020.000.02b2Iterationb20 2000 4000 6000 8000 10000-0.020.000.02b3Iterationb30 2000 4000 6000 8000 100000.000060.00012sigma12Iterationsigma120 2000 4000 6000 8000 100002e-046e-041e-03sigma22Iterationsigma220 2000 4000 6000 8000 100000.000060.00012sigma32Iterationsigma320 2000 4000 6000 8000 100000.1000.1100.120?1Iteration?10 2000 4000 6000 8000 100000.230.250.27?2Iteration?20 2000 4000 6000 8000 100002e-046e-04sigma11Iterationsigma110 2000 4000 6000 8000 100000.00050.0025sigma22Iterationsigma220 2000 4000 6000 8000 100000.700.750.800.85p1Iterationp10 2000 4000 6000 8000 100000.150.200.25p2Iterationp20 2000 4000 6000 8000 100000.000.020.04p3Iterationp30 2000 4000 6000 8000 100000.0000.0100.020p4Iterationp40 2000 4000 6000 8000 100000.000.020.040.06p5Iterationp5Figure 4.11: Evolution of sampled values for theta for 11,000 iterations usingAlgorithm 2 for the SNP genotyping data. The first three panels are forslopes.73Chapter 4. Bayesian linear clustering0 10 20 30 400.40.60.81.0-1 a1LagACF0 10 20 30 400.650.750.850.95-1 a2LagACF0 10 20 30 400.40.60.81.0-1 a3LagACF0 10 20 30 400.40.60.81.0b1LagACF0 10 20 30 400.800.901.00b2LagACF0 10 20 30 400.750.850.95b3LagACF0 10 20 30 400.00.40.8sigma12LagACF0 10 20 30 400.00.40.8sigma22LagACF0 10 20 30 400.00.40.8sigma32LagACF0 10 20 30 400.00.40.8?1LagACF0 10 20 30 400.00.40.8?2LagACF0 10 20 30 400.00.40.8sigma11LagACF0 10 20 30 400.00.40.8sigma22LagACF0 10 20 30 400.00.40.8p1LagACF0 10 20 30 400.00.40.8p2LagACF0 10 20 30 400.00.40.8p3LagACF0 10 20 30 400.00.40.8p4LagACF0 10 20 30 400.00.40.8p5LagACFFigure 4.12: Autocorrelations of sampled values for theta for 11,000 iterationsusing Algorithm 2 for the SNP genotyping data.74Chapter 4. Bayesian linear clustering0.42 0.43 0.44 0.45 0.46020406080-1 a1Density1.80 1.85 1.90 1.95 2.00051015-1 a2Density10 15 20 25 300.000.10-1 a3Density-0.03 -0.02 -0.01 0.00020406080b1Density-0.02 -0.01 0.00 0.01 0.02 0.030103050b2Density-0.02 -0.01 0.00 0.01 0.020103050b3Density0.00006 0.00010 0.0001401000025000sigma12Density0.0002 0.0004 0.0006 0.0008 0.0010 0.0012010003000sigma22Density0.00006 0.00010 0.0001401000025000sigma32Density0.100 0.105 0.110 0.115 0.120 0.12504080120?1Density0.23 0.24 0.25 0.26 0.27 0.28 0.290103050?2Density0e+00 2e-04 4e-04 6e-04 8e-040200040006000sigma11Density0.000 0.001 0.002 0.003 0.00404008001200sigma22Density0.70 0.75 0.80 0.85051015p1Density0.15 0.20 0.25051015p2Density0.00 0.01 0.02 0.03 0.04 0.050204060p3Density0.000 0.005 0.010 0.015 0.020050150250p4Density0.00 0.01 0.02 0.03 0.04 0.05 0.060204060p5DensityFigure 4.13: Density curves of sampled values for theta for 11,000 iterationsusing Algorithm 2 for the SNP genotyping data.75Chapter 4. Bayesian linear clustering2241 11414121 11 1231 11 11121 1411131 111 11 111211 1131 11 112241 121 111211421111 1121 11211 1122111 1111 1 111 111121 111 1211 121211212121221 12112122512 22112221 11 11 122111211 11 111 1 1212221 1 111151 111 121 1 11221414124212112111 1 11111 11 111111214241414242414142411221 11211 111 1111531 121221 1111411 1411 11 11 121121141 111 1211111121121 1141111 111 11 11 121151441411211 111 1211212 21441141 10.2 0.4 0.6 0.8 1.0 1.2 1.40.51.01.52.0Allele.XAllele.YFigure 4.14: Clustering results of the SNP genotyping data in which pointsare classified into the cluster with the largest posterior membership proba-bility.(represented by ?5?); this is conservative because the orthogonal variationin the wild-type allele homozygous genotype cluster XX is very small andthe orthogonal variation of cluster YY is linked to that of cluster XX.By depleting points in the YY cluster, we have observed that the Bayesianmethod also works if only one point is present in the YY cluster, which isdue to the informative priors and constraints. Some other plates with sparseclusters are also analyzed using the Bayesian approach and the clusteringresults are satisfactory; the clustering results are omitted. For plates with-out sparse clusters, the effect of the above prior specification is minimal, weusually obtain similar clustering results to that of the partial likelihood ap-proach in Chapter 3. Figure 4.15 shows the clustering results of the Bayesianapproach for the plate analyzed in Chapter 3.4.6 DiscussionWe have proposed a Bayesian approach to linear clustering. It is flexible inthe sense that only the deviation from a hyperplane is modelled parametri-cally; the position on the hyperplane is not modelled. The advantage of thisapproach is illustrated in the genotyping example, where the distributionalong the line is complex and difficult to model. Furthermore, as was alsoillustrated in the SNP genotyping, we can incorporate elliptical clusters as76Chapter 4. Bayesian linear clustering424141 1 12121311223 14 44141111 11111231123414 44424241111331 1 14112241424141 1223131 1211442414141412231441213133124143421 1211231121 1 1214414342121212112141144 4142512232112 222212240.5 1.0 1.50.51.01.52.0Allele.XAllele.YFigure 4.15: Clustering results of the Bayesian approach for a plate withouta sparse cluster.necessary and a background cluster, borrowing from standard model-basedclustering.In our examples, label-switching is prevented either by a Gibbs samplerapplied to a posterior distribution with isolated modes or by informative pri-ors. In more general situations, we may need the ideas of tempering MCMCor Sequential Monte Carlo to explore the whole support of the posteriordistribution and to deal with the label-switching problem.In this paper, the number of linear clusters are assumed known. In thesituation of unknown number of clusters, our first thought is to investi-gate the feasibility of the Reversible Jump MCMC of Richardson and Green(1997). This may imply heavy computational burden. A related problem isthe scalability of the Bayesian approach to large datasets and high dimen-sions. We leave these problems for further research.77BibliographyAnderson, T. (2003). An introduction to multivariate statistical analysis.Wiley Series in Probability and Statistics.Campbell, N. and R. Mahon (1974). A multivariate study of variationin two species of rock crab of genus Leptograpsus. Australian Journal ofZoology 22(3), 417?425.Celeux, G., M. Hurn, and C. Robert (2000). Computational and InferentialDifficulties with Mixture Posterior Distributions. Journal of the AmericanStatistical Association 95(451).Fraley, C. and R. AE (2006). mclust Version 3 for R: Normal MixtureModeling and Model-based Clustering. Technical report, Technical Report504, University of Washington, Department of Statistics.Fujisawa, H., S. Eguchi, M. Ushijima, S. Miyata, Y. Miki, T. Muto, andM. Matsuura (2004). Genotyping of single nucleotide polymorphism usingmodel-based clustering. Bioinformatics 20, 5.Fuller, W. (1987). Measurement error models. Wiley Series in Probabilityand Mathematical Statistics, New York: Wiley, 1987.Gelman, A., J. Carlin, H. Stern, and D. Rubin (2004). Bayesian dataanalysis. Boca Raton, FL: Chapman and Hall/CRC.Harrington, J. (2007). lga: Tools for linear grouping analysis (LGA). Rpackage version 1.0-0.Kang, H., Z. Qin, T. Niu, and J. Liu (2004). Incorporating GenotypingUncertainty in Haplotype Inference for Single-Nucleotide Polymorphisms.The American Journal of Human Genetics 74(3), 495?510.Olivier, M., L. Chuang, M. Chang, Y. Chen, D. Pei, K. Ranade,A. de Witte, J. Allen, N. Tran, D. Curb, et al. (2002). High-throughputgenotyping of single nucleotide polymorphisms using new biplex invadertechnology. Nucleic Acids Research 30(12), e53.78BibliographyPeel, D. and G. McLachlan (2000). Robust mixture modelling using the tdistribution. Statistics and Computing 10(4), 339?348.Ranade, K., M. Chang, C. Ting, D. Pei, C. Hsiao, M. Olivier, R. Pesich,J. Hebert, Y. Chen, V. Dzau, et al. (2001). High-Throughput Genotypingwith Single Nucleotide Polymorphisms. Genome Research 11(7), 1262?1268.Richardson, S. and P. Green (1997). On Bayesian Analysis of Mixtureswith an Unknown Number of Components. Journal of the Royal StatisticalSociety. Series B (Methodological) 59(4), 731?792.Turner, T. (2006). mixreg: Functions to fit mixtures of regressions. Rpackage version 0.0-2.Van Aelst, S., X. Wang, R. Zamar, and R. Zhu (2006). Linear groupingusing orthogonal regression. Computational Statistics and Data Analy-sis 50(5), 1287?1312.Yan, G., W. Welch, and R. Zamar (2008). A likelihood approach to linearclustering. Preprint.79Chapter 5Consistency and asymptoticnormality in a partiallikelihood approach to linearclustering5.1 IntroductionBy linear clustering, we mean detecting linearly shaped clusters in a dataset. We proposed in Yan et al. (2008) a parsimonious partial likelihoodapproach to linear clustering. Assume that the data x1,...,xn are drawn bya random mechanism, represented by random vector X in a d-dimensionalspace which we do not model fully. Assume that The data lie around Khyperplanes {x : aprimekx = bk}, k = 1,...,K. Let Z = (Z1,...,ZK)prime be arandom vector indicating these hyperplanes and Zk = 1 with probability pkfor k = 1,...,K. Let p = (p1,...,pK)prime. We assume that, conditional onZk = 1,aprimekX -bk similar N(0,sigma2k), k = 1,...,K.Let z1,...,zn be the corresponding unobservable indicators for the datax1,...,xn. Let kappa be the collection of component parameters, kappa = (aprime1, b1,sigma21, ..., aprimeK, bK, sigma2K)prime, and theta = (kappaprime,pprime)prime. The indicators z1,...,zn can beregarded as realizations of random vectors Z1,...,Zn, which in turn are anindependent and identically distributed sample from Z. After integratingout the indicators z1,...,zn, the partial likelihood function for parametersA version of this chapter will be submitted for publication. Authors: Guohua Yan,William J. Welch and Ruben H. Zamar.80Chapter 5. Consistency and asymptotic normalitytheta isL(theta|x1,...,xn) =nproductdisplayi=1Ksummationdisplayk=1pkN(aprimekxi -bk; 0,sigma2k). (5.1)As in the usual normal mixture model, the partial likelihood function(5.1) is unbounded: when a cluster consists only of points lying on a hyper-plane, the contribution of each of these points to the partial likelihood tendsto infinity. The infinity occurs on the boundary of the parameter space.We adopt the constraint of Hathaway (1985). Specifically, the followingconstraint is imposed on the standard deviations,min1<=<inegationslash=j<=<K(sigmai/sigmaj) greaterequal c > 0, (5.2)where c is a known constant determined a priori. In this article, con-straint (5.2) is assumed whenever we refer to the partial likelihood func-tion (5.1).The partial likelihood function (5.1) naturally brings the clustering prob-lem into a finite mixture model framework. An EM algorithm can be used tomaximize (5.1); once an maximum partial likelihood estimate ? is obtained,data point xi can be assigned to the component with the largest posteriorprobability. The probabilities are given by?ik = ?pkN(?aprimekxi -?bk; 0, ?sigma2k)summationtextKk=1 ?pkN(?aprime xi -?bk; 0, ?sigma2), i = 1,...,n; k = 1,...,K, (5.3)which also serve as a measure of uncertainty of classifying data point xi.We shall investigate the asymptotic properties of solutions to the partiallikelihood function (5.1). As (5.1), regarded as a function of x, is not a den-sity function for X, classical results of asymptotics on maximum likelihoodestimators cannot be used directly. We borrow ideas in the formulation andin the proofs of results from Garc? -Escudero et al. (2007), Wald (1949),Redner (1981) and Hathaway (1983, 1985).The rest of this article is organized as follows. Section 5.2 discussesthe population counterpart of the partial likelihood function (5.1) and es-tablishes the existence of its maximum. Section 5.3 proves the consistencyof a maximum of the partial likelihood function (5.1) to the maximum ofthe population counterpart. The asymptotic normality of a solution of thepartial likelihood function (5.1) is investigated in section 5.4.81Chapter 5. Consistency and asymptotic normality5.2 Population version of the objective functionTo motivate the population version of the objective function (5.1), we firstreview the connection between maximum likelihood estimation and the Kull-back-Leibler divergence. See for example Kullback and Leibler (1951) andEguchi and Copas (2006).Suppose that probability distributions P and Q are absolutely continu-ous with respect to the Lebesgue measure lambda. Let p(x) and q(x) be the densityfunctions (Radon-Nikodym derivatives) of P and Q respectively with respectto lambda. Then the Kullback-Leibler divergence from P to Q is defined asDKL(PbardblQ) =integraldisplaylog p(x)q(x)dP(x).Given a random sample x1,x2,...,xn from the underlying distributionP, let Pn be the empirical distribution. Now let Qtheta be a statistical modelhaving density f(x;theta) with respect to lambda, where theta is a collection of unknownparameters. The Kullback-Leibler divergence from P to Qtheta isDKL(PbardblQtheta) =integraldisplay[log p(x) -log f(x;theta)]dP(x).The empirical version of DKL(P||Qtheta) isDKL(PnbardblQtheta) = 1nnsummationdisplayi=1[log(1/n) -log f(xi;theta)]= log(1/n) - 1nnsummationdisplayi=1log f(xi;theta).Apart from the factor 1/n, the second term is just the log likelihood function.Hence, maximizing the likelihood function is equivalent to minimizing theKullback-Leibler divergence DKL(PnbardblQtheta). If P = Qtheta0 for some theta0, thena maximum likelihood estimator is strongly consistent under some regularconditions, i.e., it converges almost surely to arg mintheta DKL(PbardblQtheta).Back to the linear clustering setting, letf(x;theta) =Ksummationdisplayk=1pkfk(x;theta), (5.4)wherefk(x;theta) = 12pisigmakexpparenleftbigg-(aprimekx -bk)22sigma2kparenrightbigg,82Chapter 5. Consistency and asymptotic normalityand x is a generic point in Rfracturd. We still use P to denote the underlyingdistribution of the data and use p(x) to denote the density of P with respectto the Lebesgue measure lambdad. Although f(x;theta) cannot be a density as afunction of x, we can nevertheless defineDKL(Pbardblf(?;theta)) =integraldisplay[log p(x) -log f(x;theta)]dP(x),or equivalently defineg(theta) =integraldisplaylog f(x;theta)dP(x), (5.5)and show that the maximum partial likelihood solution of (5.1) convergesto the set{theta0 : g(theta0) = maxthetag(theta)}.With constraint (5.2), The parameter space isThetac =bracelefttpbraceleftmidbraceleftbttheta = (a1,b1,sigma21,...,aK,bK,sigma2K,p1,...,pK) :0 < pk < 1,aprimekak = 1,-infinity < bk < infinity,sigmak > 0,k = 1,...,K.summationtextKk=1 pk = 1,mini,j sigmai/sigmaj greaterequal c > 0.bracerighttpbracerightmidbracerightbt.We show in this section that the supremum of g is finite and attainable.First we prove the following lemma.Lemma 5.1 Let P be an absolutely continuous probability measure withfinite second moments. Let 0 < c < 1, if c <=< sigmak <=< 1/c, for all k = 1,...,K.Lets(kappa) equivalenceintegraldisplaymink(aprimekx -bk)2sigma2k dP(x).Then s(kappa) attains its infimum s0 > 0 under constraint (5.2) at some kappa0.Proof It is obvious that 0 < s(kappa) < infinity. First we show that it is possible tobound bk, k = 1,...,K and the infimum is not missed.Suppose that there is a positive number r < |bk|, for k = 1,...,K. Letr0 > 0 such that Prob(|X| <=< r0) > 0. Then when r > r0,s(theta) greaterequal c2(r -r0)2Prob(|X| <=< r0) arrowrightinfinity, as r arrowrightinfinity.83Chapter 5. Consistency and asymptotic normalityNow without loss of generality, we assume that |bk| <=< v for some v andfor k = 1,...,K -1. Let r = c2(|bK| -v). Then when |bK| is getting large,s(kappa) greaterequalintegraldisplay|x|<=<rmin1<=<k<=<K(aprimekx -bk)2sigma2k dP(x)=integraldisplay|x|<=<rmin1<=<k<=<K-1(aprimekx -bk)2sigma2k dP(x)arrowrightintegraldisplaymin1<=<k<=<K-1(aprimekx -bk)2sigma2k dP(x), as |bK| arrowrightinfinitygreaterequalintegraldisplaymin1<=<k<=<K(aprimekx -bk)2sigma2k dP(x),where the |bK| in the last term is arbitrary number bounded from below byv.Sincemink(aprimekx -bk)2sigma2k <=<2(|x|2 + v2)c2 ,by Lebesgue?s dominated convergence theorem, s is continuous. Now it isconstrained on a compact set, the infimum of s is attainable for some kappa0 ass0 = s(kappa0) > 0.Theorem 5.1 Let P be an absolutely continuous probability measure withfinite second moments. Then the supremum of g, which is defined in (5.5),over Thetac is finite and attainable.Proof Let sigma2k = tau2k sigma2 for k = 1,...,K. For every x and theta, we havelog f(x,theta) <=< log maxkfk(x,theta) = maxklog fk(x,theta).Therefore,g(theta) <=<integraldisplaymaxklog fk(x,theta)dP(x)=integraldisplay-minkbraceleftbigg12 log(2pitau2k sigma2) + (aprime x -bk)22tau2k sigma2bracerightbiggdP(x)<=< -12 log(2pic2) -log sigma - 12sigma2braceleftbiggintegraldisplaymink(aprimekx -bk)2tau2k dP(x)bracerightbigg<=< -12 log(2pic2) -log sigma - s02sigma2 ,84Chapter 5. Consistency and asymptotic normalitywhere Lemma 5.1 is used. Clearly, the last term attains its maximum atsigma2 = s0 and it goes to -infinity as sigma arrowright0 or infinity. Take an arbitrary theta0, there existpositive numbers s1 and s2 such that g(theta) < g(theta0) uniformly for theta such thatsigma < s1/c or sigma > cs2.Now if we bound s1/c <=< sigma <=< cs2, we havelog f(x,theta) greaterequal log minkfk(x,theta)= minlog fk(x,theta)= min(-12 log(2pitau2k sigma2) - (aprimekx -bk)22tau2k sigma2 )greaterequal -12 log(2pis22) - 2(|x|2 + maxk |bk|2)2s21 ,andlog f(x,theta) <=< log maxkfk(x,theta) <=< -12 log(2pis21).By Lebesgue?s dominated convergence theorem, g is continuous.If bk arrowright?infinity for some k, then by Fatou?s lemma, we havelimsupbkarrowright?infinityg(theta) <=<integraldisplaylimsupbkarrowright?infinitylog f(x,theta)dP(x) =integraldisplaylogsummationdisplaykprimenegationslash=kpkprime fkprime (x,theta)dP(x).Therefore, there exist real numbers t1 and t2 such that we would notmiss the supremum of g, if we bound theta in the compact set{theta : s1 <=< sigma2k <=< s2,t1 <=< bk <=< t2, for all k = 1,...,K}.Since g is continuous, its supremum is attainable.Remark In the proof it is not ruled out that some pk may be 0 for g toattain its maximum.5.3 ConsistencyFirst we prove that there exists a global maximizer of the partial likelihoodfunction (5.1) over Thetac.Theorem 5.2 Let {x1,...,xn} be a set of points in Rfracturd that cannot fit inK hyperplanes. That is, there do not exist {(ak,bk) : k = 1,...,K} suchthat, for every xi, aprimekxi = bk for some k (which may depend on i). Letl(theta) =nsummationdisplayi=1log f(xi,theta).85Chapter 5. Consistency and asymptotic normalityThen a constrained global maximizer of l(theta) over Thetac exists.Proof Let B(0,r) = {x : |x| <=< r} be a ball that contains all the pointsx1,...,xn. For any theta, if there is some bk such that |bk| > r, then (aprimekxi -bk)2 greaterequal (aprimekxi - rsgn(bk))2. So l(theta) is not decreased if bk is replaced withrsgn(bk) whenever |bk| greaterequal r. So we can bound theta such that |bk| <=< r, for allk = 1,...,K.Lets0 = min{(a1,b1,...,aK,bK):aprimekak=1,|bk|<=<r}maxi mink(aprimekxi -bk)2.Since the points x1,...,xn cannot fit in K hyperplanes, we have s0 > 0.There exists a point xj such that (aprimekxj - bk)2 greaterequal s0 for all k = 1,...,K.Note also that 1/sigmak <=< 1/(csigma1) and that 1/sigmak greaterequal c/sigma1, we havel(theta) <=<summationdisplayinegationslash=jlog(summationdisplaykpk 12pisigmak) + log(summationdisplaykpk 12pisigmakexp(- s02sigma2k))<=< (n -1)log(summationdisplaykpk 12picsigma1) + log(summationdisplaykpk 12picsigma1exp(-cs02sigma21))= -nlog(radical2picsigma1) - cs02sigma21.The last term tends to -infinity as sigma21 tends to zero. If some sigma2k tends to zero,so do all other sigma2ks. Thus, there exists a positive number s1 such thatl(theta) <=< l(theta0) whenever a sigma2k < s1.On the other hand,l(theta) <=<summationdisplayilog(summationdisplaykpk 12pisigmak) <=< nlog(maxk1radical2pisigmak ) = -nmink log(radical2pisigmak).If some sigma2k tends to infinity, so do all other sigma2k?s and then l(theta) decreases to -infinity.Similarly, there is a positive number s2 such that l(theta) <=< l(theta0) whenever asigma2k > s2.Therefore, we can bound theta in a compact set{theta : |bk| <=< r,s1 <=< sigma2k <=< s2, for all k = 1,...,K}.As l is continuous, a global maximizer over Thetac exists.Corollary 5.1 Let P be an absolutely continuous probability measure withfinite second moments. Let X1,...,Xn be a sample from P. Then a con-strained global maximizer of l(theta|X1,...,Xn) exists almost surely if n greaterequalKd + 1.86Chapter 5. Consistency and asymptotic normalityProof As P is absolute continuous, X1,...,Xn are distinct with probability1. We need Kd points to determine K hyperplanes. Each of the remainingpoints lies on one of these hyperplanes with probability 0, since the set ofpoints in K hyperplanes has zero probability.To prove the consistency result, we need some preliminary results.Lemma 5.2 If the probability measure P has finite second moments, thenintegraltext| log f(x,theta)|dP(x) < infinity.Proof integraldisplay| log f(x,theta)| dP(x)=integraldisplay[log f(x,theta)]+ dP(x) +integraldisplay[log f(x,theta)]-dP(x)<=<integraldisplay{x:f(x,theta)greaterequal1}[log maxkfk(x,theta)]+dP(x)+integraldisplay{x:f(x,theta)<1}[log minkfk(x,theta)]-dP(x)<=<integraldisplay{x:f(x,theta)greaterequal1}summationdisplayk[log fk(x,theta)]+dP(x)+integraldisplay{x:f(x,theta)<1}summationdisplayk[log fk(x,theta)]-dP(x)=summationdisplaykintegraldisplay| log fk(x,theta)|dP(x)<=<summationdisplayk[| log(radical2pisigmak)| +integraldisplay (aprimekx -bk)22sigma2k dP(x)]<infinity.Lemma 5.3 Letw(x,theta,rho) = sup{thetaprime:|thetaprime-theta|<=<rho}f(x,thetaprime).Then integraldisplay[log w(x,theta,rho)]+dP(x) < infinity.87Chapter 5. Consistency and asymptotic normalityProof In fact,w(x,theta,rho) <=< sup{thetaprime:|thetaprime-theta|<=<rho}summationdisplaykfk(x,theta)<=< sup{thetaprime:|thetaprime-theta|<=<rho}summationdisplayk1radical2pisigmakequivalence M(theta,rho).Therefore,integraldisplay[log w(x,theta,rho)]+dP(x) <=<integraldisplay[log M(theta,rho)]+dP(x) < infinity.Lemma 5.4limrhoarrowright0integraldisplaylog w(x,theta,rho)dP(x) =integraldisplaylog f(x,theta)dP(x).Proof Since log w(x,theta,?) is an increasing function of rho, as rho arrowright0, [log w(x,theta,rho)]- increases. By the monotone convergence theorem,limrhoarrowright0integraldisplay[log w(x,theta,rho)]-dP(x) =integraldisplay[log f(x,theta)]-dP(x).When rho is sufficiently small, [log w(x,theta,rho)]+ is dominated by [log w(x, theta,rho0)]+ for some rho0. And the latter is integrable by Lemma 5.3. By Lebesgue?sdominated convergence theorem,limrhoarrowright0integraldisplay[log w(x,theta,rho)]+dP(x) =integraldisplay[log f(x,theta)]+dP(x).Notice that integraltext [log w(x,theta,rho)]+dP(x) and integraltext [log f(x,theta)]+dP(x) are finite, thelemma is proved.Note that points in Thetac are not identifiable for f(x;?). The function f(x;?)remains the same if we permutate the labels 1, ..., K; pk1 and pk2 are notidentifiable if (ak1,bk1,sigma2k1) = (ak2,bk2,sigma2k2). Thus the consistency result isin a quotient topology space. Let similar be an equivalent relation on Thetac suchthat theta1 similar theta2 if and only if f(x;theta1) = f(x;theta2) almost surely in P. Denoteby Thetaqc the quotient topological space consisting of all equivalent classes ofsimilar. For a point theta0 that maximizes integraltext log f(x;theta)dP(x), its equivalent class inThetaqc is denoted by thetaq0.88Chapter 5. Consistency and asymptotic normalityTheorem 5.3 Let X1,...,Xn be a sample from P which is absolutely con-tinuous with finite second moments. Let B be a compact subset of Thetac thatcontains C0. Let ?(n) be a global maximizer of l(theta|X1,...,Xn) on B. Then?theta(n) arrowrightthetaq0 almost surely in the topological space Bq.Proof Let omega be a closed subset of B which does not intersect with C0. Foreach point theta in omega, we associate a positive value rhotheta such thatE log w(X,theta,rhotheta) < E log f(X,theta0),where theta0 is a point in C0. The existence of such a rhotheta follows from Lemma 5.4and the definition of C0 (C0 exists because P has finite second moments,by Theorem 5.1; ?(n) exists because P is absolutely continuous, by Theo-rem 5.2).Since B is compact, its closed subset omega is also compact. There exists afinite number of points theta1,...,thetah in omega such thatomega propersubsethuniondisplayj=1{thetaprime : |thetaprime -thetaj| <=< rhothetaj}.Then0 <=< supthetaelementomegaf(x1,theta)...f(xn,theta) <=<hsummationdisplayj=1w(x1,thetaj,rhothetaj)...w(xn,thetaj,rhothetaj).By the strong law of large numbers,ProbbraceleftBigglimnarrowrightinfinitynsummationdisplayi=1[log w(Xi,thetaj,rhothetaj) -log f(Xi,theta0)] = -infinitybracerightBigg= 1,for j = 1,...,h. That is,Probbraceleftbigglimnarrowrightinfinity w(X1,thetaj,rhothetaj)...w(Xn,thetaj,rhothetaj)f(X1,theta0)...f(Xn,theta0)= 0bracerightbigg= 1,for j = 1,...,h. Therefore, we haveProbbraceleftbigglimnarrowrightinfinity supthetaelementomega f(X1,theta)...f(Xn,theta)f(X1,theta0)...f(Xn,theta0)= 0bracerightbigg= 1. (5.6)Denote |theta - C0| equivalence mintheta0elementC0 |theta - theta0|. The minimum is attainable sinceC is a closed set in B and hence compact. We need only to prove that89Chapter 5. Consistency and asymptotic normalityall limit points thetaasteriskmath of the sequence ?(n) are in C0. If not, there exists alimit point thetaasteriskmath and an epsilon1 > 0 such that |thetaasteriskmath - C0| greaterequal epsilon1. This implies thatthere are infinitely many ?(n) that lie in omegaepsilon1 equivalence {theta : |theta - C0| greaterequal epsilon1}. Thusf(x1, ?(n))...f(xn, ?(n)) <=< supthetaelementomegaepsilon1 f(x1,theta)...f(xn,theta) for infinitely manyn. Since f(x1, ?(n))...f(xn, ?(n)) greaterequal f(x1,theta0)...f(xn,theta0). We havef(x1,theta0)...f(xn,theta0) <=< supthetaelementomegaepsilon1f(x1,theta)...f(xn,theta),for infinitely many n. Since omegaepsilon1 is a closed set in B, this is an event withprobability zero according to equation 5.6. This completes the proof.In next step, we shall show that limit points of the constrained globalmaximizer ?(n) over Thetac are almost surely in a compact space and henceconsistency follows from Theorem 5.3.Lemma 5.5 Let X1,...,Xn be a sample of P which is absolutely continu-ous with finite second moments. Let ?(n) is a constrained global maximizerof l(theta|X1,...,Xn) over Thetac. Then there exist positive numbers s1 and s2such thatProb{liminfnarrowrightinfinity mink(?(n)k )2 greaterequal s1} = 1, (5.7)andProb{limsupnarrowrightinfinitymaxk(?(n)k )2 <=< s2} = 1. (5.8)Proof First we prove Equation (5.7). Let ?(n)k = ?(n)k ?(n)1 for k = 1,...,K.Then ?(n)k element [c,1/c]. We writeh(sigma) = l(?(n)1 ,..., ?(n)K , ?(n)1 ,..., ?(n)K ,?(n)1 ,...,?(n)K , ?(n)1 sigma,..., ?(n)K sigma).Then it satisfies dh(sigma)dsigma |?sigma(n) = 0.This gives rise tonsummationdisplayi=1- 1?sigma(n)1f(Xi, ?(n)) + 1(?sigma(n)1 )3summationtextKk=1((?(n)k )primeXi-?(n)k )2(?(n)k )2 ?p(n)k fk(Xi, ?theta(n))f(Xi, ?(n))= 0.90Chapter 5. Consistency and asymptotic normalitySolving for ?(n)1 yields(?(n)1 )2 = 1nnsummationdisplayi=1summationtextKk=1((?(n)k )primeXi-?(n)k )2(?(n)k )2 ?p(n)k fk(Xi, ?theta(n))f(Xi, ?(n)).Then, we have(?(n)1 )2 greaterequal c2 1nnsummationdisplayi=1min1<=<k<=<K((?(n)k )primeXi -?(n)k )2. (5.9)We now show that for n0 greaterequal Kd + 1,s(n0) = E{ inf{(a1,...,aK,b1,...,bK):aprimekak=1}n0summationdisplayi=1min1<=<k<=<K(aprimekXi -bk)2} > 0. (5.10)In fact,integraldisplayinf{(a1,...,aK,b1,...,bK):aprimekak=1}n0summationdisplayi=1min1<=<k<=<K(aprimekxi -bk)2dPn0(x)=integraldisplayinf{(a1,...,aK,b1,...,bK):aprimekak=1,|bk|<=<r(x1,...,xn0)}n0summationdisplayi=1min1<=<k<=<K(aprimekxi -bk)2dPn0(x)=integraldisplaymin{(a1,...,aK,b1,...,bK):aprimekak=1,|bk|<=<r(x1,...,xn0)}n0summationdisplayi=1min1<=<k<=<K(aprimekxi -bk)2dPn0(x)equivalenceintegraldisplays0(x1,...,xn0)dPn0(x) > 0,since s0(X1,...,Xn0) > 0 almost surely from Corollary5.1 and the argumentin Theorem 5.2. By the strong law of large numbers, we have1mmsummationdisplayj=1braceleftBigginf{(a1,...,aK,b1,...,bK):aprimekak=1}n0summationdisplayi=1min1<=<k<=<K(aprimekX(j-1)n0+i -bk)2bracerightBiggarrowrights(n0),with probability 1. Let s1 = c2s(n0,P)/n0. Then by equation (5.9),liminfnarrowrightinfinity (?(n)1 )2greaterequalc2 liminfmarrowrightinfinity { 1mn0inf{(a1,...,aK,b1,...,bK):aprimekak=1}mn0summationdisplayi=1min1<=<k<=<K(aprimekXi -bk)2}greaterequalc2n0 lim1mmsummationdisplayj=1{ inf{(a1,...,aK,b1,...,bK):aprimekak=1}n0summationdisplayi=1min1<=<k<=<K(aprimekX(j-1)n0+i -bk)2}=s1,91Chapter 5. Consistency and asymptotic normalitywith probability 1. The same s1 serves as a lower bound of liminfn(?(n)k )2for k = 2,...,K as well. Noticing that s1 does not depend on a set of ob-servations x1,...,xn, we have proved Equation (5.7).Now we prove Equation (5.8). Let Thetasc = {theta : theta element Thetac,sigma2k > s for some k}.Defineq(x,s) = supthetaelementThetascf(x,theta).Then q(x,s) <=< 1radical2pisc. By Lemma 5.2, E| log f(X,theta0)| < infinity. There exists apositive number s2, such thatE(log(q(X,s2))) < E(log f(X,theta0).By the strong law of large numbers,Prob{ limnarrowrightinfinitynsummationdisplayi=1[log q(Xi,s2) -log f(Xi,theta0)] = -infinity} = 1.This implies thatProb{ limnarrowrightinfinity supthetaelementThetas2c f(X1,theta)...f(Xn,theta)f(X1,theta0)...f(Xn,theta0) = 0} = 1.Equation 5.8 follows immediately, since ?(n) always satisfiesf(X1, ?(n))...f(Xn, ?(n))f(X1,theta0)...f(Xn,theta0) greaterequal 1.Now we prove the main consistency result.Theorem 5.4 Let X1,...,Xn be a sample from an absolutely continuousprobability measure P with finite second moments. Let ?(n) be a global max-imizer of l(theta |X1, ...,Xn) over Thetac. Then ?(n) arrowright thetaq0 almost surely in thetopological space Thetaqc.Proof From Lemma 5.5, we need only to consider the subspace of Thetac,Thetasc = {theta element Thetac : s1 <=< sigma2k <=< s2, for all k = 1,...,K},92Chapter 5. Consistency and asymptotic normalitywhere s1,s2 are positive numbers determined in Lemma 5.5.Since Thetasc is not compact, Theorem 5.3 cannot be used directly. We shalluse the compactification device in Hathaway (1985) and also in Kiefer andWolfowitz (1956). In the space Thetasc, define the metricdelta(theta,thetaprime) =summationdisplayi| arctanthetai -arctanthetaprimei|,where | ? | is the Euclidean distance and thetai and thetaprimei are components of theta andthetaprime respectively. Let ?sc be the set of Thetasc along with all its limit points. Then?Thetasc =braceleftbigg theta : 0 <=< pk <=< 1,summationtextKk=1 pk = 1,mini,j sigmai/sigmaj greaterequal c > 0,aprimekak = 1,-infinity <=< bk <=< infinity,s1 <=< sigmak <=< s2,k = 1,...,K.bracerightbiggis compact. Since f(x,?) is continuous on Thetasc, it can be extended to ?sc asf(x,theta) =summationdisplaykpkI(infinity< bk < infinity)fk(x,theta).We have shown g(theta) = E(log f(x,theta)) is continuous on Thetasc. It is continuouson ?sc as well. To see this, let theta(n) be a sequence tending to theta. If all bk = ?infinity,or all pk = 0 whenever bk negationslash= ?infinity, then f(x,theta) = 0 and g(theta) = -infinity. In thiscase,g(theta(n)) =integraldisplaylog f(x,theta(n))dP(x)<=<integraldisplaymax{k:pk>0}log fk(x,theta(n))dP(x)arrowrightintegraldisplay- min{k:pk>0}bracketleftbigg12 log(2pisigma2k) +(aprimekx -bk)22sigma2kbracketrightbiggdP(x)= -infinity.If there is some k such that pk negationslash= 0 and bk negationslash= ?infinity. Thenlog f(x,theta(n)) <=< log maxfk(x,theta(n)) <=< -12 log(2pis1),andlog f(x,theta(n)) greaterequal log min{k:pk>0,bknegationslash=?infinity}p(n)k fk(x,theta(n))greaterequal log min{k:pk>0,bknegationslash=?infinity}pkfk(x,theta) -epsilon1,93Chapter 5. Consistency and asymptotic normalityfor some epsilon1 > 0 when n is sufficiently large. By Lemma 5.2, the latter term isintegrable. By Lebesgue?s dominated convergence theorem, g is continuousat theta.Lemmas 5.3 and 5.4 are easily seen to hold and hence we can repeatliterally the proof of Theorem 5.3. This completes the proof.5.4 Asymptotical normalityThe global maximizer ?(n) converges in the above quotient topological space.There is hence a subsequence, still denoted by ?(n), which converges to apoint theta0 in C0 in the original space. If theta0 is an interior point of Thetac, then wecan expand lprime(?(n)) about theta0,lprime(?(n))=lprime(theta0) + lprimeprime(theta0)(?(n) -theta0)+ 12[(?(n) -theta0)T lprimeprimeprime1 (thetaasteriskmath)(?(n) -theta0)],...,(?(n) -theta0)T lprimeprimeprimes (thetaasteriskmath)(?(n) -theta0)]T ,(5.11)where s is the dimension of theta and thetaasteriskmath is a point on the line segment connecting?theta(n) and theta0; lprime is an s-dimensional vector of first derivatives of l, lprimeprime and lprimeprimeprimejare s ?s matrices of second and third derivatives of l. For ease of notation,we do not differentiate which components of p and ak are free parametersand which are not; it is assumed that the constraints summationtextpk = 1 and aprimekak = 1are taken care in the calculation of these derivatives.The left-hand side of (5.11) is 0, because ?(n) satisfies the first orderconditions.Letv0(theta0) = E[((log f(X,theta0))prime].Then v0(theta0) = gprime(theta0) = 0 by Lebesgue?s dominated convergence theorem.Letv1(theta0) = E[((log f(X,theta0))prime)2].It is straightforward to verify that all the entries of v1(theta0) are finite if theunderlying distribution P has finite fourth moments. Then by the centrallimit theorem, the first derivative lprime(theta0)/radicaln is asymptotical normal,1radicalnlprime(theta0) LarrowrightN(0,v1(theta0)).94Chapter 5. Consistency and asymptotic normalityLetv2(theta0) = E[(log f(X,theta0))primeprime],andv3(theta0) = E[(log f(X,theta0))primeprimeprime].Again, it is straightforward but tedious to verify that all the entries of v2(theta0)are finite if P has finite fourth moments and that all the entries of v3(theta0)are finite if P has finite sixth moments.By the strong law of large numbers, the second derivative lprimeprime(theta0)/n tendsto a constant matrix,1nlprimeprime(theta0) arrowrightv2(theta0),and the third derivative lprimeprimeprime is bounded entry-wise. So we have the followingTheorem 5.5 Let X1,...,Xn be a sample from an absolutely continuousprobability measure P with finite sixth moments. Let ?(n) be a subsequenceof global maximizer of l(theta|X1,...,Xn) over Thetac, which tends to an interiorpoint theta0. Then radicaln(?(n) -theta0) L N(0,v(theta0)),where v(theta0) = [v2(theta0)]+v1(theta0)[v2(theta0)]+ and A+ is the Moore-Penrose in-verse of matrix A.In equation (5.3), ?ik is a function of xi and ?. Denote ?ik = hk(xi, ?),k = 1,...,K. By the Delta method, we have the followingCorollary 5.2 Let X1,...,Xn be a sample from an absolutely continuousprobability measure P with finite sixth moments. Let ?(n) be a subsequenceof global maximizer of L(theta|X1,...,Xn) over Thetac, which tends to an interiorpoint theta0. Let x be a data point. Thenradicaln(hk(x, ?theta(n)) -hk(x,theta0))LarrowrightN(0,[h(0)k (x,theta0)]primev(theta0)[h(0)k (x,theta0)]),for k = 1,...,K, where h(0)k (x,theta0) = partialdiffhk(x,theta)partialdifftheta |theta=theta0.Using this corollary, we can build approximate confidence intervals forwik by replacing theta0 with ? and hence evaluate the clustering of a data set.95BibliographyEguchi, S. and J. Copas (2006). Interpreting Kullback?Leibler divergencewith the Neyman?Pearson lemma. Journal of Multivariate Analysis 97(9),2034?2040.Garc? -Escudero, L., A. Gordaliza, R. San Mart? , S. van Aelst, and R. Za-mar (2007). Robust linear clustering. Preprint.Hathaway, R. (1983). Constrained maximum-likelihood estimation for amixture of m univariate normal distributions. Ph. D. thesis, Rice Univer-sity.Hathaway, R. (1985). A Constrained Formulation of Maximum-LikelihoodEstimation for Normal Mixture Distributions. The Annals of Statis-tics 13(2), 795?800.Kullback, S. and R. Leibler (1951). On Information and Sufficiency. TheAnnals of Mathematical Statistics 22(1), 79?86.Redner, R. (1981). Note on the Consistency of the Maximum LikelihoodEstimate for Nonidentifiable Distributions. The Annals of Statistics 9(1),225?228.Wald, A. (1949). Note on the Consistency of the Maximum LikelihoodEstimate. The Annals of Mathematical Statistics 20(4), 595?601.Yan, G., W. Welch, and R. Zamar (2008). A likelihood approach to linearclustering. Preprint.96Chapter 6Future WorkIn this thesis work, we have developed a SNP genotype calling algorithmbased on the linear grouping algorithm (Van Aelst et al., 2006), proposed aflexible model-based approach to linear clustering and introduced a Bayesianapproach to linear clustering with specific relevance to the SNP genotypingproblem. In this chapter, we briefly describe a few possible directions forfuture research.6.1 Robustness considerationRobustness to outliers is desirable in linear clustering, as the assumption ofnormal deviations around hyperplanes is sensitive to large deviations in theorthogonal direction. In addition to the inclusion of a uniform backgroundcluster (Banfield and Raftery, 1993), one option would be to use a heaviertailed distribution, for example, Student?s t distribution with small degreesof freedom or with degrees of freedom depending on the data. In the par-tial likelihood approach, this would adapt Peel and McLachlan (2000)?s EMalgorithm for t mixture models from the elliptical context to the linear clus-tering setting. The adaptation is straightforward but computationally moreexpensive. Further ideas include estimating the component covariance ma-trices in the M-step in a robust way, for example, trimming off some pointsas done by Garc? -Escudero et al. (2007). In the Bayesian approach, wehave already used a Student?s t distribution with small degrees of freedom.6.2 AsymptoticsFor the partial likelihood approach, we produced some asymptotic results inthe case of normal orthogonal deviations. For more general cases, such asStudent?s t distribution, these results have not been established. We shallstudy the asymptotic evaluation in more general cases.97Chapter 6. Future Work6.3 Model ExtensionWith aprimex = b, we are specifying a hyperplane in d - 1 dimensions. Withlittle effort, this could be generalized to a mixture of partial likelihoods, eachof which specifies a hyperplane of dimension q < d,l(kappa,p|x1:n) =nproductdisplayi=1Ksummationdisplayk=1pkN(Aprimekxi -bk;0,?k), (6.1)where A is of dimension d ? (d - q), b is a vector of dimension d - q,and ?k is a (d - q) ? (d - q) covariance matrix for the deviation from thehyperplane. In the extreme case of a 0-dimension hyperplane, which is apoint, we have the usual mixture of multivariate normal distributions. Amixture of components with various dimensions could be considered.6.4 Variable/model selectionFor a large, high dimensional dataset, efficient computation is essential inorder to uncover linear patterns. In addition to develop more efficient algo-rithms, one idea is to use a subset of the data and/or a subset of variables.To this end, we are interested in methods to screen variables as well as todetermine the number of linear structures.6.5 Bayesian computationIn our examples in the Bayesian approach, label-switching is prevented eitherby a Gibbs sampler applied to a posterior distribution with isolated modes orby informative priors. In more general situations, we may need the ideas oftempering MCMC or Sequential Monte Carlo to explore the whole supportof the posterior distribution and deal with the label-switching problem.In addition, the number of linear clusters are assumed known in ourapproach. In the situation of unknown number of clusters, our first thoughtis to investigate the feasibility of the Reversible Jump MCMC of Richardsonand Green (1997). This may imply heavy computational burden. A relatedproblem is again the scalability of the Bayesian approach to large datasetsand high dimensions. We leave these problems for further research.98BibliographyBanfield, J. and A. Raftery (1993). Model-Based Gaussian and Non-Gaussian Clustering. Biometrics 49(3), 803?821.Garc? -Escudero, L., A. Gordaliza, R. San Mart? , S. van Aelst, and R. Za-mar (2007). Robust linear clustering. Preprint.Peel, D. and G. McLachlan (2000). Robust mixture modelling using the tdistribution. Statistics and Computing 10(4), 339?348.Richardson, S. and P. Green (1997). On Bayesian Analysis of Mixtureswith an Unknown Number of Components. Journal of the Royal StatisticalSociety. Series B (Methodological) 59(4), 731?792.Van Aelst, S., X. Wang, R. Zamar, and R. Zhu (2006). Linear groupingusing orthogonal regression. Computational Statistics and Data Analy-sis 50(5), 1287?1312.99Appendix AGlossary of some geneticterms(Extracted from the website http://www.genome.gov/glossary.cfm.)adenine (A) One of the four bases in DNA that make up the letters ATGC,adenine is the ?A?. The others are guanine, cytosine, and thymine. Adeninealways pairs with thymine.allele One of the variant forms of a gene at a particular locus, or location, ona chromosome. Different alleles produce variation in inherited characteristicssuch as hair color or blood type. In an individual, one form of the allele(the dominant one) may be expressed more than another form (the recessiveone).base pair Two bases which form a ?rung of the DNA ladder.? A DNAnucleotide is made of a molecule of sugar, a molecule of phosphoric acid,and a molecule called a base. The bases are the ?letters? that spell out thegenetic code. In DNA, the code letters are A, T, G, and C, which standfor the chemicals adenine, thymine, guanine, and cytosine, respectively. Inbase pairing, adenine always pairs with thymine, and guanine always pairswith cytosine.chromosome One of the thread-like ?packages? of genes and other DNA inthe nucleus of a cell. Different kinds of organisms have different numbers ofchromosome. Humans have 23 pairs of chromosomes, 46 in all: 44 autosomesand two sex chromosomes. Each parent contributes one chromosome to eachpair, so children get half of their chromosomes from their mothers and halffrom their fathers.100Appendix A. Glossary of some genetic termscytosine One of the four bases in DNA that make up the letters ATGC,cytosine is the ?C?. The others are adenine, guanine, and thymine. Cytosinealways pairs with guanine.deoxyribonucleic acid (DNA) The chemical inside the nucleus of a cellthat carries the genetic instructions for making living organisms.diploid The number of chromosomes in most cells except the gametes. Inhumans, the diploid number is 46.dominant A gene that almost always results in a specific physical charac-teristic, for example, a disease, even though the patient?s genome possessesonly one copy. With a dominant gene, the chance of passing on the gene(and therefore the disease) to children is 50-50 in each pregnancy.gene The functional and physical unit of heredity passed from parent tooffspring. Genes are pieces of DNA, and most genes contain the informationfor making a specific protein.gene amplification An increase in the number of copies of any particularpiece of DNA. A tumor cell amplifies, or copies, DNA segments naturally asa result of cell signals and sometimes environmental events.gene expression The process by which proteins are made from the instruc-tions encoded in DNA.genetic code (ATCG) The instructions in a gene that tell the cell howto make a specific protein. A, T, G, and C are the ?letters? of the DNAcode; they stand for the chemicals adenine, thymine, guanine, and cytosine,respectively, that make up the nucleotide bases of DNA. Each gene?s codecombines the four chemicals in various ways to spell out 3-letter ?words?that specify which amino acid is needed at every step in making a protein.genetic marker A segment of DNA with an identifiable physical locationon a chromosome and whose inheritance can be followed. A marker can bea gene, or it can be some section of DNA with no known function. BecauseDNA segments that lie near each other on a chromosome tend to be inheritedtogether, markers are often used as indirect ways of tracking the inheritance101Appendix A. Glossary of some genetic termspattern of a gene that has not yet been identified, but whose approximatelocation is known.genome All the DNA contained in an organism or a cell, which includesboth the chromosomes within the nucleus and the DNA in mitochondria.genotype The genetic identity of an individual that does not show as out-ward characteristics.guanine One of the four bases in DNA that make up the letters ATGC,guanine is the ?G?. The others are adenine, cytosine, and thymine. Guaninealways pairs with cytosine.haploid The number of chromosomes in a sperm or egg cell, half the diploidnumber.Haplotype It refers to a set of SNPs found to be statistically associated ona single chromatid. With this knowledge, the identification of a few allelesof a haplotype block unambiguously identifies all other polymorphic sitesin this region. Such information is most valuable to investigate the genet-ics behind common diseases and is collected by the International HapMapProject (http://en.wikipedia.org/wiki/Haplotype).Hardy-Weinberg equilibrium (HWE) It states that, under certain con-ditions, after on generation of random mating, the genotype frequencies ata single gene locus will become fixed at a particular equilibrium value. Italso specifies that those equilibrium frequencies can be represented as a sim-ple function of the allele frequencies at that locus.(http://en.wikipedia.org/wiki /Hardy-Weinberg equilibrium).heterozygous Possessing two different forms of a particular gene, one in-herited from each parent.homozygous Possessing two identical forms of a particular gene, one in-herited from each parent.linkage The association of genes and/or markers that lie near each otheron a chromosome. Linked genes and markers tend to be inherited together.102Appendix A. Glossary of some genetic termslinkage disequilibrium (LD) The non-random association of alleles attwo or more loci on a chromosome. It describes a situation in which somecombinations of alleles or genetic markers occur more or less frequently in apopulation than would be expected from a random formation of haplotypesfrom alleles based on their frequencies. In population genetics, linkage dise-quilibrium is said to characterize the haplotype distribution at two or moreloci (http://en.wikipedia.org/wiki/Linkage disequilibrium).locus The place on a chromosome where a specific gene is located, a kindof address for the gene. The plural is ?loci,? not ?locuses.?non-coding DNA The strand of DNA that does not carry the informationnecessary to make a protein. The non-coding strand is the mirror image ofthe coding strand and is also known as the antisense strand.nucleotide One of the structural components, or building blocks, of DNAand RNA. A nucleotide consists of a base (one of four chemicals: adenine,thymine, guanine, and cytosine) plus a molecule of sugar and one of phos-phoric acid.phenotype The observable traits or characteristics of an organism, for ex-ample hair color, weight, or the presence or absence of a disease. Phenotypictraits are not necessarily genetic.polymerase chain reaction (PCR) A fast, inexpensive technique formaking an unlimited number of copies of any piece of DNA. Sometimes called?molecular photocopying,? PCR has had an immense impact on biology andmedicine, especially genetic research.polymorphism A common variation in the sequence of DNA among indi-viduals.primer A short oligonucleotide sequence used in a polymerase chain reac-tion.probe A piece of labeled DNA or RNA or an antibody used to detect thefunction of a gene.103Appendix A. Glossary of some genetic termssingle nucleotide polymorphism Common, but minute, variations thatoccur in human DNA at a frequency of one every 1,000 bases. These vari-ations can be used to track inheritance in families. SNP is pronounced?snip?.thymine One of the four bases in DNA that make up the letters ATGC,thymine is the ?T?. The others are adenine, guanine, and cytosine. Thyminealways pairs with adenine.wild-type allele The allele designated as the standard (?normal?) for astrain of organism.104
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Linear clustering with application to single nucleotide...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Linear clustering with application to single nucleotide polymorphism genotyping Yan, Guohua 2008-06-27
pdf
Page Metadata
Item Metadata
Title | Linear clustering with application to single nucleotide polymorphism genotyping |
Creator |
Yan, Guohua |
Publisher | University of British Columbia |
Date Issued | 2008 |
Description | Single nucleotide polymorphisms (SNPs) have been increasingly popular for a wide range of genetic studies. A high-throughput genotyping technologies usually involves a statistical genotype calling algorithm. Most calling algorithms in the literature, using methods such as k-means and mixturemodels, rely on elliptical structures of the genotyping data; they may fail when the minor allele homozygous cluster is small or absent, or when the data have extreme tails or linear patterns. We propose an automatic genotype calling algorithm by further developing a linear grouping algorithm (Van Aelst et al., 2006). The proposed algorithm clusters unnormalized data points around lines as against around centroids. In addition, we associate a quality value, silhouette width, with each DNA sample and a whole plate as well. This algorithm shows promise for genotyping data generated from TaqMan technology (Applied Biosystems). A key feature of the proposed algorithm is that it applies to unnormalized fluorescent signals when the TaqMan SNP assay is used. The algorithm could also be potentially adapted to other fluorescence-based SNP genotyping technologies such as Invader Assay. Motivated by the SNP genotyping problem, we propose a partial likelihood approach to linear clustering which explores potential linear clusters in a data set. Instead of fully modelling the data, we assume only the signed orthogonal distance from each data point to a hyperplane is normally distributed. Its relationships with several existing clustering methods are discussed. Some existing methods to determine the number of components in a data set are adapted to this linear clustering setting. Several simulated and real data sets are analyzed for comparison and illustration purpose. We also investigate some asymptotic properties of the partial likelihood approach. A Bayesian version of this methodology is helpful if some clusters are sparse but there is strong prior information about their approximate locations or properties. We propose a Bayesian hierarchical approach which is particularly appropriate for identifying sparse linear clusters. We show that the sparse cluster in SNP genotyping datasets can be successfully identified after a careful specification of the prior distributions. |
Extent | 3276809 bytes |
Subject |
Cluster analysis Mixture models Single nucleotide polymorphism |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | eng |
Date Available | 2008-06-27 |
Provider | Vancouver : University of British Columbia Library |
DOI | 10.14288/1.0066454 |
URI | http://hdl.handle.net/2429/958 |
Degree |
Doctor of Philosophy - PhD |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2008-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2008_fall_yan_guohua.pdf [ 3.13MB ]
- Metadata
- JSON: 24-1.0066454.json
- JSON-LD: 24-1.0066454-ld.json
- RDF/XML (Pretty): 24-1.0066454-rdf.xml
- RDF/JSON: 24-1.0066454-rdf.json
- Turtle: 24-1.0066454-turtle.txt
- N-Triples: 24-1.0066454-rdf-ntriples.txt
- Original Record: 24-1.0066454-source.json
- Full Text
- 24-1.0066454-fulltext.txt
- Citation
- 24-1.0066454.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0066454/manifest