- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Separation index, variable selection and sequential...
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
Separation index, variable selection and sequential algorithm for cluster analysis Qiu, Weiliang
Abstract
This thesis considers four important issues in cluster analysis: cluster validation, estimation of the number of clusters, variable weighting/selection, and generation of random clusters. Any clustering method can partition data into several subclusters. Hence it is important to have a method to validate obtained partitions. We propose a cluster separation index to address the cluster validation problem. This separation index is based on projecting the data in the two clusters into a one-dimensional space, in which the two clusters have the maximum separation. The separation index directly measures the magnitude of gap between pair of clusters, is easy to compute and interpret, and has the scale equivariance property. The ultimate goal of cluster analysis is to determine if there exist patterns (clusters) in multivariate data sets or not. If clusters exist, then we would like to determine how many there are in the data set. We propose a sequential clustering (SEQCLUST) method that produces a sequence of estimated number of clusters based on varying input parameters. The most frequently occurring estimates in the sequence lead to a point estimate of the number of clusters with an interval estimate. For a given data set, some variables may be more important than others to be used to recover the cluster structure. Some variables, called noisy variables, may even mask cluster structures. It is necessary to downweight or eliminate the effects of noisy variables. We investigate when noisy variables will mask cluster structures, and propose a weight-vector averaging idea and a new noisy-variable- detection method, which does not require the specification of the true number of clusters. Simulation study is an important tool to assess and compare performances of clustering methods. The qualities of simulated data sets depend on cluster generating algorithms. We propose a design to generate simulated clusters so that the distances of simulated clusters to their neighboring clusters can be controlled and that the shapes, diameters and orientations of the simulated clusters can be arbitrary. We also propose low-dimensional visualization methods and a method to determine the partial memberships of data points that are near boundaries among clusters.
Item Metadata
Title |
Separation index, variable selection and sequential algorithm for cluster analysis
|
Creator | |
Publisher |
University of British Columbia
|
Date Issued |
2004
|
Description |
This thesis considers four important issues in cluster analysis: cluster validation, estimation
of the number of clusters, variable weighting/selection, and generation of random clusters.
Any clustering method can partition data into several subclusters. Hence it is important to
have a method to validate obtained partitions. We propose a cluster separation index to address
the cluster validation problem. This separation index is based on projecting the data in the two
clusters into a one-dimensional space, in which the two clusters have the maximum separation.
The separation index directly measures the magnitude of gap between pair of clusters, is easy to
compute and interpret, and has the scale equivariance property.
The ultimate goal of cluster analysis is to determine if there exist patterns (clusters) in
multivariate data sets or not. If clusters exist, then we would like to determine how many there
are in the data set. We propose a sequential clustering (SEQCLUST) method that produces a
sequence of estimated number of clusters based on varying input parameters. The most frequently
occurring estimates in the sequence lead to a point estimate of the number of clusters with an
interval estimate.
For a given data set, some variables may be more important than others to be used to recover
the cluster structure. Some variables, called noisy variables, may even mask cluster structures. It
is necessary to downweight or eliminate the effects of noisy variables. We investigate when noisy
variables will mask cluster structures, and propose a weight-vector averaging idea and a new noisy-variable-
detection method, which does not require the specification of the true number of clusters.
Simulation study is an important tool to assess and compare performances of clustering
methods. The qualities of simulated data sets depend on cluster generating algorithms. We propose
a design to generate simulated clusters so that the distances of simulated clusters to their
neighboring clusters can be controlled and that the shapes, diameters and orientations of the simulated
clusters can be arbitrary.
We also propose low-dimensional visualization methods and a method to determine the
partial memberships of data points that are near boundaries among clusters.
|
Extent |
14213633 bytes
|
Genre | |
Type | |
File Format |
application/pdf
|
Language |
eng
|
Date Available |
2009-12-02
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use.
|
DOI |
10.14288/1.0091796
|
URI | |
Degree | |
Program | |
Affiliation | |
Degree Grantor |
University of British Columbia
|
Graduation Date |
2005-05
|
Campus | |
Scholarly Level |
Graduate
|
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use.