Linear clustering with application to single nucleotide polymorphism genotyping

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Linear clustering with application to single nucleotide polymorphism genotyping Yan, Guohua

Abstract

Single nucleotide polymorphisms (SNPs) have been increasingly popular for a wide range of genetic studies. A high-throughput genotyping technologies usually involves a statistical genotype calling algorithm. Most calling algorithms in the literature, using methods such as k-means and mixturemodels, rely on elliptical structures of the genotyping data; they may fail when the minor allele homozygous cluster is small or absent, or when the data have extreme tails or linear patterns. We propose an automatic genotype calling algorithm by further developing a linear grouping algorithm (Van Aelst et al., 2006). The proposed algorithm clusters unnormalized data points around lines as against around centroids. In addition, we associate a quality value, silhouette width, with each DNA sample and a whole plate as well. This algorithm shows promise for genotyping data generated from TaqMan technology (Applied Biosystems). A key feature of the proposed algorithm is that it applies to unnormalized fluorescent signals when the TaqMan SNP assay is used. The algorithm could also be potentially adapted to other fluorescence-based SNP genotyping technologies such as Invader Assay. Motivated by the SNP genotyping problem, we propose a partial likelihood approach to linear clustering which explores potential linear clusters in a data set. Instead of fully modelling the data, we assume only the signed orthogonal distance from each data point to a hyperplane is normally distributed. Its relationships with several existing clustering methods are discussed. Some existing methods to determine the number of components in a data set are adapted to this linear clustering setting. Several simulated and real data sets are analyzed for comparison and illustration purpose. We also investigate some asymptotic properties of the partial likelihood approach. A Bayesian version of this methodology is helpful if some clusters are sparse but there is strong prior information about their approximate locations or properties. We propose a Bayesian hierarchical approach which is particularly appropriate for identifying sparse linear clusters. We show that the sparse cluster in SNP genotyping datasets can be successfully identified after a careful specification of the prior distributions.

Item Metadata

Title	Linear clustering with application to single nucleotide polymorphism genotyping
Creator	Yan, Guohua
Publisher	University of British Columbia
Date Issued	2008
Description	Single nucleotide polymorphisms (SNPs) have been increasingly popular for a wide range of genetic studies. A high-throughput genotyping technologies usually involves a statistical genotype calling algorithm. Most calling algorithms in the literature, using methods such as k-means and mixturemodels, rely on elliptical structures of the genotyping data; they may fail when the minor allele homozygous cluster is small or absent, or when the data have extreme tails or linear patterns. We propose an automatic genotype calling algorithm by further developing a linear grouping algorithm (Van Aelst et al., 2006). The proposed algorithm clusters unnormalized data points around lines as against around centroids. In addition, we associate a quality value, silhouette width, with each DNA sample and a whole plate as well. This algorithm shows promise for genotyping data generated from TaqMan technology (Applied Biosystems). A key feature of the proposed algorithm is that it applies to unnormalized fluorescent signals when the TaqMan SNP assay is used. The algorithm could also be potentially adapted to other fluorescence-based SNP genotyping technologies such as Invader Assay. Motivated by the SNP genotyping problem, we propose a partial likelihood approach to linear clustering which explores potential linear clusters in a data set. Instead of fully modelling the data, we assume only the signed orthogonal distance from each data point to a hyperplane is normally distributed. Its relationships with several existing clustering methods are discussed. Some existing methods to determine the number of components in a data set are adapted to this linear clustering setting. Several simulated and real data sets are analyzed for comparison and illustration purpose. We also investigate some asymptotic properties of the partial likelihood approach. A Bayesian version of this methodology is helpful if some clusters are sparse but there is strong prior information about their approximate locations or properties. We propose a Bayesian hierarchical approach which is particularly appropriate for identifying sparse linear clusters. We show that the sparse cluster in SNP genotyping datasets can be successfully identified after a careful specification of the prior distributions.
Extent	3276809 bytes
Genre	Thesis/Dissertation
Type	Text
File Format	application/pdf
Language	eng
Date Available	2008-06-27
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0066454
URI	http://hdl.handle.net/2429/958
Degree (Theses)	Doctor of Philosophy - PhD
Program (Theses)	Statistics
Affiliation	Science, Faculty of; Statistics, Department of
Degree Grantor	University of British Columbia
Graduation Date	2008-11
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Linear clustering with application to single nucleotide polymorphism genotyping Yan, Guohua

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights