UBC Theses and Dissertations
Extending linear grouping analysis and robust estimators for very large data sets Harrington, Justin
Cluster analysis is the study of how to partition data into homogeneous subsets so that the partitioned data share some common characteristic. In one to three dimensions, the human eye can distinguish well between clusters of data if clearly separated. However, when there are more than three dimensions and/or the data is not clearly separated, an algorithm is required which needs a metric of similarity that quantitatively measures the characteristic of interest. Linear Grouping Analysis (LGA, Van Aelst et al. 2006) is an algorithm for clustering data around hyperplanes, and is most appropriate when: 1) the variables are related/correlated, which results in clusters with an approximately linear structure; and 2) it is not natural to assume that one variable is a “response”, and the remainder the “explanatories”. LGA measures the compactness within each cluster via the sum of squared orthogonal distances to hyperplanes formed from the data. In this dissertation, we extend the scope of problems to which LGA can be applied. The first extension relates to the linearity requirement inherent within LGA, and proposes a new method of non-linearly transforming the data into a Feature Space, using the Kernel Trick, such that in this space the data might then form linear clusters. A possible side effect of this transformation is that the dimension of the transformed space is significantly larger than the number of observations in a given cluster, which causes problems with orthogonal regression. Therefore, we also introduce a new method for calculating the distance of an observation to a cluster when its covariance matrix is rank deficient. The second extension concerns the combinatorial problem for optimizing a LGA objective function, and adapts an existing algorithm, called BIRCH, for use in providing fast, approximate solutions, particularly for the case when data does not fit in memory. We also provide solutions based on BIRCH for two other challenging optimization problems in the field of robust statistics, and demonstrate, via simulation study as well as application on actual data sets, that the BIRCH solution compares favourably to the existing state-of-the-art alternatives, and in many cases finds a more optimal solution.
Item Citations and Data
Attribution-NonCommercial-NoDerivatives 4.0 International