Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Extending linear grouping analysis and robust estimators for very large data sets Harrington, Justin 2008

You don't seem to have a PDF reader installed, try download the pdf

Item Metadata

Download

Media
[if-you-see-this-DO-NOT-CLICK]
[if-you-see-this-DO-NOT-CLICK]
ubc_2008_fall_harrington_justin.pdf [ 5.64MB ]
Metadata
JSON: 1.0066406.json
JSON-LD: 1.0066406+ld.json
RDF/XML (Pretty): 1.0066406.xml
RDF/JSON: 1.0066406+rdf.json
Turtle: 1.0066406+rdf-turtle.txt
N-Triples: 1.0066406+rdf-ntriples.txt
Original Record: 1.0066406 +original-record.json
Full Text
1.0066406.txt
Citation
1.0066406.ris

Full Text

Extending linear grouping analysisand robust estimatorsfor very large data setsbyJustin HarringtonBCom/BSc, University of Auckland, 1996DipFinMath., Victoria University of Wellington, 1998MSc, Victoria University of Wellington, 1999A THESIS SUBMITTED IN PARTIAL FULFILMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinThe Faculty of Graduate Studies(Statistics)The University Of British Columbia(Vancouver, Canada)May, 2008c Justin Harrington 2008AbstractIn this dissertation, we extend the scope of problems where the clustering algorithm Linear GroupingAnalysis (LGA) can be applied.The first extension relates to the linearity requirement inherent within LGA, and proposes a newmethod for relaxing this requirement by non-linearly transforming the data into a Feature Space,such that in this space the data might then form linear clusters. In order to do this we utilise whatis known in the literature as the Kernel Trick. However, one side effect of this transformation isthat often the dimension of the transformed space is significantly larger, which can result in moredimensions than there are observations in a given cluster, and cause problems with orthogonalregression. To deal with this situation, we also introduce a new method for calculating the distanceof an observation to a cluster when its covariance matrix is rank deficient.The second extension concerns the combinatorial problem for optimizing a LGA objective function,and adapts an existing algorithm, called BIRCH (Balanced Iterative Reducing and Clustering usingHierarchies), for use in providing fast, approximate solutions, particularly for the case when datadoes not fit in memory. We also provide solutions based on BIRCH for two other challengingoptimization problems in the field of robust statistics, and demonstrate, via simulation study aswell as application on actual data sets, that the BIRCH solution compares favourably to the existingstate-of-the-art alternatives, and in many cases finds a more optimal solution.iiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xCo-Authorship Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 An Overview of Linear Grouping Analysis . . . . . . . . . . . . . . . . . . . . . . . 21.1.1 On the Nature of Combinatorial Problems . . . . . . . . . . . . . . . . . . . 51.2 Extending Linear Grouping Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2.1 Dimension Reduction and Clustering . . . . . . . . . . . . . . . . . . . . . . 91.2.2 Transforming Data into a Feature Space . . . . . . . . . . . . . . . . . . . . 91.3 Finding Approximate Solutions to Combinatorial Problems with Very Large DataSets using BIRCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.3.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.3.3 Algorithms Extended . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 Extending Linear Grouping Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 192.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2 Linear Grouping Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19iiiTable of Contents2.3 Partially Orthogonal Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.4 Linear Grouping in the Feature Space . . . . . . . . . . . . . . . . . . . . . . . . . . 272.4.1 Preliminaries and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.4.2 kLGA with Orthogonal Regression . . . . . . . . . . . . . . . . . . . . . . . 302.4.3 kLGA with POD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.5 Discussion and Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 Combinatorial Problems, Large Data Sets and BIRCH . . . . . . . . . . . . . . . 373.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.2.1 Forming the Subclusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.2.2 Selection of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.3.1 Minimum Covariance Determinant . . . . . . . . . . . . . . . . . . . . . . . . 423.3.2 Least Trimmed Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.3.3 Robust Linear Grouping Analysis . . . . . . . . . . . . . . . . . . . . . . . . 493.4 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.4.1 Simulation Study with LTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.4.2 Simulation Study with MCD . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68AppendicesA Proofs and Derivations for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . 69B birch: Working with Very Large Data Sets . . . . . . . . . . . . . . . . . . . . . . 75B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75B.1.1 BIRCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76ivTable of ContentsB.2 Package birch: Implementation in R . . . . . . . . . . . . . . . . . . . . . . . . . . . 77B.2.1 Generic Internal Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78B.2.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80B.2.3 Robust Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81B.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82B.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89vList of Tables1.1 Some commonly used kernels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1 The effect of user selected parameters on number of subclusters and performance . . 463.2 The fitted regression lines for each approach on each data set . . . . . . . . . . . . . 483.3 The parameters used for creating the CF-tree using BIRCH for the LTS simulationstudy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.4 The processing time for each LTS algorithm in seconds . . . . . . . . . . . . . . . . . 553.5 The parameters used for creating the CF-tree using BIRCH for the MCD simulationstudy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.6 The processing time for each MCD algorithm in seconds . . . . . . . . . . . . . . . . 62B.1 The structure of a birch object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79B.2 The outputs from each of the robust algorithms. . . . . . . . . . . . . . . . . . . . . 82B.3 A selection of results in the Example section. . . . . . . . . . . . . . . . . . . . . . . 88viList of Figures1.1 An example of LGA versus Mclust using the olfactory bulbs vs. brain weight data set. 61.2 A demonstration of the importance of initialization. . . . . . . . . . . . . . . . . . . 71.3 The number of samples required to achieve a 95% probability of a ?good? initialpartition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.4 A demonstration of transformation to linearity . . . . . . . . . . . . . . . . . . . . . 102.1 An example of LGA applied to the allometry data set . . . . . . . . . . . . . . . . . 212.2 Introducing Partially Orthogonal Regression . . . . . . . . . . . . . . . . . . . . . . . 232.3 Looking at the e2 -e3 plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.4 A ?bird?s-eye-view? looking down onto the e1 -e2 plane. . . . . . . . . . . . . . . . . 252.5 Examples of LGA-POD on data sets whose clusters have covariance matrix of lessthan full rank. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.6 Plots from example for kLGA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.7 Plots from examples of kLGA (POD). . . . . . . . . . . . . . . . . . . . . . . . . . . 353.1 TRIUMF data set, with fast-MCD, MCD-BIRCH and classical confidence ellipsoids 453.2 Smoothed color density representation of the scatter plot of the TRIUMF data set . 453.3 DPOSS data set with regression lines from fast-LTS, LTS-BIRCH and classical LS . 493.4 TRIUMF data set with regression lines from fast-LTS, LTS-BIRCH and classical LS 503.5 DPOSS data set with output from RLGA and RLGA-BIRCH . . . . . . . . . . . . . 513.6 Box plot of the mean difference between the ?true? LTS residual and the one calcu-lated from fast-LTS and LTS-BIRCH . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.7 Bar plots of the performance for fast-LTS and LTS-BIRCH . . . . . . . . . . . . . . 563.8 Box plot of the MCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.9 Bar plots of the performance for fast-MCD, MCD-BIRCH and MCD-BIRCH-R . . . 593.10 Box plot of the MCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.11 Bar plots of the performance for fast-MCD, MCD-BIRCH and MCD-BIRCH-R . . . 61B.1 The tree structure used for pre-processing in birch . . . . . . . . . . . . . . . . . . . 77viiList of FiguresB.2 A new plot method for birch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80B.3 The birch plot method, applied on birch objects of differing size . . . . . . . . . . 83B.4 The results from RLGA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86viiiAcknowledgementsFirst and foremost, I would like to thank my supervisor, Dr Mat?as Salibi?n-Barrera. For nearlyfour years he has been a most excellent adviser, collaborator and colleague (dare I say friend?), andmany a time has kept me on track and focused on what is important. I am certainly appreciativeof his efforts as well as his subtle, if ever so slightly dark, sense of humor. At this time I would alsolike to thank him for funding me for most of the last four years.I would like to thank Professor?s Zamar and Welch for agreeing to be members of my supervisorycommittee and for their insightful and encouraging comments.My time in Vancouver has been a magical one, and this is mostly due to the wonderful friends Ihave made while being here. A special thanks goes to Mike (who is a most excellent coffee buddy),and the office staff for a being a source of sanity in a world of idiosyncrasy. To all my friends andcolleagues I wish all the best in their future endeavours, and bid them hasta luego rather than adi?s.I dedicate this thesis to my parents and my wonderful wife. Without the many sacrifices of myparents I would never have had this opportunity, and any achievement of mine is a direct result ofthe excellent foundation they built for me. And finally, to my wonderful Diana ? for her unwaveringsupport and understanding I am truly indebted. The fact that she agreed to marry a investmentbanker cum professional student still amazes me, and I am very much looking forward to the many,many years we will have together.ix.Per Angusta Ad AugustaThrough trial to triumph.xCo-Authorship StatementTwo articles, which formed the basis of Chapter 3 and Appendix B, were co-authored with DrMat?as Salibi?n-Barrera. These articles were:? Harrington, J. and Salibi?n-Barrera, M. (2007) Finding Approximate Solutions to Combi-natorial Problems with Very Large Data Sets using BIRCH, ?Special Issue on StatisticalAlgorithms and Software? of Computational Statistics and Data Analysis; and? Harrington, J. and Salibi?n-Barrera, M. (2008) birch: Working with very large data sets,Journal of Statistical Software.For both articles, the literature review, theory, development of algorithms, coding in R, and initialdrafts of the documents were completed by Justin Harrington. Further editing of the manuscriptsfor publication and (excellent) supervision was provided by Dr Salibi?n-Barrera.xiChapter 1IntroductionCluster analysis is the study of how to partition data into homogeneous subsets so that the par-titioned data share some common characteristic. In two and three dimensions, the human eyeis excellent at distinguishing between clusters of data when they are clearly separated. However,when there are more than three dimensions and/or the data is not clearly separated, an algorithmis required which in turn needs a metric of similarity that quantitatively measures the characteristicof interest. An oft cited example of a clustering algorithm is k-means, which partitions data intocompact spheres, a characteristic that can be measured with the within-cluster sum of squares.Generically clustering algorithms are combinatorial problems, insofar as they are searching for anoptimal partition of the data in order to minimize an objective function. The set of all possiblepartitions is referred to as the solution space of the problem, and not surprisingly the size of thisspace increases dramatically with the number of observations, dimensions and clusters.1Linear Grouping Analysis (LGA) is an algorithm for clustering data around hyperplanes that wasfirst introduced in Van Aelst et al. (2006). This algorithm is most appropriate when:? the variables are related/correlated, which results in clusters with an approximately linearstructure; and? it is not natural to assume that one variable is a ?response?, and the remainder the ?explana-tories?.LGA measures the compactness within each cluster via the sum of squared orthogonal distancesto hyperplanes formed from the data. However, if the data is not at least approximately linearwithin a cluster, then clearly this measure is no longer appropriate. In order to measure orthogonaldistances, the algorithm relies heavily on the covariance matrix of each cluster to determine itsstructure, and requires among other things a certain minimum number of observations (relative tothe dimension) in order to specify the shape fully.In this thesis we propose three new extensions to this algorithm. They are:I - Relax the linearity assumption via transformation of the data set.II - Allow LGA to perform in the situation where at least one of the covariance matrices doesnot adequately specify the structure of its cluster.III - Make LGA scalable with respect to the number of observations.When the data is not linearly structured, we transform the data into an alternative space, commonlyreferred to the ?feature space?, and this is a technique that is frequently used in classification1For example, it was shown in Welch (1982) that k-means for three or more clusters is a NP-hard problem.1Chapter 1. Introductionproblems in the machine learning literature. Often these transformations dramatically increasethe dimension of the data set, resulting in a covariance matrix that has insufficient observationsto completely specify the new structure of a transformed cluster. In order to deal with this, weintroduce a new distance metric, called the ?partially orthogonal distance?. Together, these are thetopics in Chapter 2.Finally, we introduce an adaption of an algorithm called BIRCH (Balanced Iterative Reducing andClustering using Hierarchies) as a means of making LGA scalable by substantially reducing the sizeof the solution space. In order to do this, we create subsets of the data that are compact and locallysimilar, and maintain a list of attributes for each subset such that clustering can be done with thesubsets instead of the full data set. As it turns out, this approach can be used for a number ofdifferent algorithms that also aim to solve combinatorial problems, and we provide details of thesein Chapter 3.In the remainder of this chapter we provide an introduction to the LGA algorithm, and then providea general introduction to some feature space methods. We then conclude with a review of BIRCH,and the algorithms on which it is applied.1.1 An Overview of Linear Grouping AnalysisLinear Grouping Analysis (LGA) involves grouping data around hyperplanes, and utilizes orthog-onal regression to find hyperplanes that minimize the sum of the squared orthogonal distances ofeach observation to its representative hyperplane.First consider the k-means algorithm, which groups the data into k spheres by minimizing thesum-of-squares within clusters. The objective function for this criterion isksummationdisplayi=1nsummationdisplayj=1zij(xj -mi)latticetop(xj -mi),where xj element Rd, j = 1,... ,n is our data, mi (i = 1,... ,k) are the centres of each sphere, andzij =braceleftBigg1 if observation j is part of cluster i0 otherwise.This is the arrangement of the data with the most compact clusters, where the sum-of-squares foreach cluster measures the sum of squared distances of each observation in the cluster to its centre.In order to minimize the objective function, k-means has a two step procedure:1. Given a clustering of data, for each cluster calculate the sample mean of all the observationscurrently belonging to the cluster.2. Calculate the Euclidean distance of each observation to each cluster?s centre, and relabel theobservation as belonging to the cluster which is closest.This process is iterated until convergence, and is initialized with a number of different startingconfigurations. The result with the smallest within cluster sum-of-squares is selected as the bestpartition.2Chapter 1. IntroductionContrast this with LGA, which endeavours to find the partition with associated hyperplanes whichminimizes the orthogonal distances of each observation to these hyperplanes. In this context, weparametrize a hyperplane by the vector it is orthogonal to (ai) , and a constant (bi), and denotethis hyperplane H(wi,bi) equivalence Hi. The orthogonal distance of an observation xj to the hyperplaneHi is given byd2O(xj,Hi) = (alatticetopi xj -b)2,and so the objective function for LGA is given byksummationdisplayi=1nsummationdisplayj=1zij d2O(xj,Hi).This objective function is referred to as the Residual Orthogonal Sum of Squares (ROSS). In orderto minimize this objective, an approach like that of k-means is used:1. Given an clustering of data, find the hyperplane that best fits the data in each cluster.2. Calculate the orthogonal distance of each observation to each hyperplane, and assign theobservation to the hyperplane that is closest.Once again, this is iterated to convergence, and is started from a number of different initial configu-rations. To find the parameters for a hyperplane that best fits a subset of data, we use OrthogonalRegression, as given in the following well known result.Result 1.1.1 (Orthogonal Regression) Let y1,...yj element Rd be a set of data with sample covariancematrix S. Let the eigenvectors of S be e1,... ,ed with corresponding eigenvalues lambda1 greaterequal lambda2 greaterequal ... >lambdad greaterequal 0. Then the solution toarg mina,bjsummationdisplayi=1(alatticetopyi -b)2subject to the constraint ||a|| = 1 is given bya = ed and b = elatticetopd ?.Proof of 1.1.1 Adding and subtracting alatticetop? yieldsarg min||a||=1,bnsummationdisplayi=1parenleftBigalatticetopyi -bparenrightBig2= arg min||a||=1,balatticetopSa + nparenleftBigalatticetop? -bparenrightBig2.It is clear that the second term is minimized by setting b = alatticetop?. To find the minimum of the firstterm,arg min||a||=1alatticetopSa = arg min||a||=1alatticetopE?Elatticetopa3Chapter 1. Introductionwhere E is a column matrix containing the eigenvectors of S, and ? a diagonal matrix with theeigenvalues on the diagonal. Let w = Elatticetopa and note that ||w|| = ||a||. Thenmin||a||=1alatticetopE?Elatticetopa = min||w||=1wlatticetop?w= min||w||=1summationdisplayw2i lambdaigreaterequal min||w||=1lambdadsummationdisplayw2i = lambdad.Finally, by substitution, this lower bound is achieved when a = ed. squareFor those readers familiar with Principal Components Analysis, it is clear that orthogonal regressionis looking for the subspace of dimension (d - 1) that is closest to the data in the L2-sense. Thiscorresponds with the basis formed by the first (d-1) principal components, which are the same asthe first (d -1) eigenvectors, and are orthogonal to the last principal component - ergo, the resultabove.This equivalence between PCA and orthogonal regression gives us a useful relationship betweenthe nature of the covariance matrix and its specification of the structure of the data set. Considera data set in d-dimensions, but with an orthonormal basis that has less than d vectors describingany meaningful variation in the data ? this equates to the number of positive, non-zero eigenvaluesbeing less than d. Clearly in this case, the orthonormal basis described by PCA will not be able tospan the full dimension of the data set. This is discussed further in Chapter 2.LGA is then given by the following algorithm.Algorithm 1.1.2 (LGA)LGA is made up of the following five steps:1. Scaling of the variables: each column of the data set is divided by its standard deviation so thatit has unit variance.2. Generation of starting value: a starting configuration is generated consisting of a partition of thedata set into k mutually exclusive sets of d observations (i.e. an observation can belong at mostin one set). For each of these sets, calculate the hyperplane that fits it (with zero error).3. Initialization of groups: for each starting hyperplane, calculate the squared orthogonal distance ofall observations to that hyperplane. Then assign each point to the hyperplane which it is closest,and recalculate the hyperplanes based on the new cluster membership.4. Iterative refinement: Recalculate step 3 for each resulting hyperplane until either a) the solutionconverges (i.e. the membership sets remain the same for two consecutive iterations) or b) thesolution fails to converge after an arbitrary number of iterations (e.g. 10).5. Resampling: Repeat steps 2-4 a number of times and select the solution with the smallest totalorthogonal sum of squares.The code to do this is implemented in the R package lga (Harrington, 2008).4Chapter 1. IntroductionFor completeness, we close this section with a brief mention of two alternatives to LGA that are bothapplicable in forming clusters that are linear in shape. Regression based clustering (e.g. Gawrysiaket al. (2001)) is a close relation to LGA insofar as the methodology is broadly built around thesame idea:1. Randomly select starting configurations;2. Fit a standard linear regression model to the data; and3. Allocate the observations to the line of smallest residual.This process is iterated, until some criteria is met. The main limitation of this approach, is that formany data mining applications there is no response/explanatory type of variables, and arbitraryselection of these will lead to different solutions.Model based clustering (e.g. Banfield and Raftery (1993), Fraley and Raftery (2002)) is a flexiblemaximum likelihood approach to cluster analysis, and utilizes mixture models to specify the clustercharacteristics. Most commonly the underlying distribution is a multivariate normal, but otheroptions are available; for example points clustered uniformly along a straight line (Banfield andRaftery (1992)).In the case of the multivariate normal, the solution is found using the EM algorithm, and isimplemented in R in the package mclust. An example of mclust versus lga is given in Figure 1.1. Inthis case, the LGA solution is more consistent with the underlying biological reasoning. However,the mclust does form clusters that are elliptical in nature, and perhaps would yield a better solutionwith the ?tweaking? of the covariance structure.1.1.1 On the Nature of Combinatorial ProblemsBroadly speaking we are concerned with combinatorial problems that arise from trying to find apartition of a data set such that a specific objective function is minimized. The nature of thepartition may be a ?best? subset of the data (in the case of some robust estimators), or completelypartitioning the data set into mutually exclusive sets (which is the case in cluster analysis). Eitherway, these problems are considered to be non-convex since there are many different configurationswith local minima (though the meaning of ?local? differs depending on context), and as such asimple hill-climbing algorithm such as a Newton-Raphson method is not guaranteed to find theglobal minimum.One common method for solving problems such as these involves selecting a random partition/subsetof the data set, and then using the data to suggest what an improved configuration might be throughan iterative approach of evaluating the objective function, and then sorting the data into partitionsor subsets, and this is the approach given in the previous section. Clearly, for a given startingconfiguration, there is no guarantee that the algorithm will converge to the correct solution, butby selecting a large number of different starts, one can become more confident of finding the globalminimum.The source of this confidence can be shown in the following simple illustration. In the contextof LGA, we propose two different starting configurations for a data set with two clusters, givengraphically in Figure 1.2. In the figure on the left, clearly this is a ?bad? starting configuration,5Chapter 1. Introductioniiiiiii iiiiiiiiiiiii iiiiiiittt ppppppppp ppppppp pp mmmmmmmmmmmmmmmmmmmmmmmaaahccccHcc-1 0 1 2 3-3-2-1012Brain Weight (g)Offactory Bulbs (ml)(a) LGA with k = 3.BrainWeight.g.OlfactoryBulbs.ml.-1 0 1 2 3-2.0-1.5-1.0-0.50.00.51.01.5iiiiiii iiiiiiiiiiiii iiiiiiittt ppppppppp ppppppp pp mmmmmmmmmmmmmmmmmmmmmmmaaahccccHcc(b) Mclust with k = 3.Figure 1.1: An example of LGA versus Mclust using the olfactory bulbs vs. brain weight data set.The data set is the same as the one used in Section 4.1 of Van Aelst et al. (2006). The number ofclusters was determined by Mclust using BIC. For reference, the GAP Statistic (Tibshirani et al.,2001) suggests two clusters for LGA.Key: m = monkeys, i = insectivores, p = primitive primates, c = carnivores, a = apes, h = humans,H = horses.since the initial points are from different clusters, and the starting hyperplanes do not resemble the?true? hyperplanes. While it is not certain that this starting configuration would not converge tothe correct solution, it is off to a bad start.Now consider Figure 1.2(b), where the initial configuration is a ?good? starting point, since theapproximately correct structure of the clusters has already been established. In this case, only afew iterations would be required before convergence. Mathematically, we can describe a ?good?starting point with the following: let C1,... ,Ck be the true partition of the data set and S1,...,Skbe the initial starting configuration. Then a partition can be considered a good starting positioniff Si propersubset Ci universali element {1,... ,k}.Therefore, the more starting configurations used to initialize the algorithm, the higher the chanceof finding a?good?starting point. As it turns out, with LGA it is possible to enumerate the numberof starts required so that the probability of obtaining at least one sample with d points from eachgroup is 95% (i.e. there?s a 95% chance that at least one starting solution has all the observationscorrectly clustered). The probability of this occurring isp =parenleftbiggn1dparenrightbiggparenleftbiggndparenrightbigg ?parenleftbiggn2dparenrightbiggparenleftbiggn -ddparenrightbigg ?? ? ? ?parenleftbiggnkdparenrightbiggparenleftbiggn -d ?(k -1)dparenrightbigg ?k! =producttextki=1parenleftbiggnidparenrightbiggproducttextk-1i=0parenleftbiggn -d ?idparenrightbigg ?k!6Chapter 1. Introduction11 22(a) Bad starting position112(b) Good starting positionFigure 1.2: A demonstration of the importance of starting initialization, with two clusters illustratedby some points and a confidence contour. In both plots we have the same four points; however,on the left the initial partition has points from different clusters together, producing an initialhyperplane that does not reflect the true shape of the clusters. On the right the initial positionsare subsets of the true clusters, and the hyperplanes reflect the correct structure.where ni is the number of elements in cluster i, and n the total number of observations.2 Thereforewe require m samples such that 1 -(1 -p)m = 0.95. Assuming that each of the clusters are equalsized (i.e. ni = ceilingleftn/kceilingright where ceilingleftaceilingright is the ?ceiling? integer part of a), this equates tom =ceilingleftbigg log(0.05)log(1 -p)ceilingrightbigg(1.1)where p simplifies top =parenleftbiggn1dparenrightbiggkproducttextk-1i=0parenleftbiggkn1 -didparenrightbigg ?k!.A plot of equation (1.1) for different values of k and d is given in Figure 1.3.1.2 Extending Linear Grouping AnalysisIn this section we provide some background on the material in Chapter 2, which is concerned withperforming LGA with non-linear patterns. At first glance, this might seem rather contradictory,2Note that this differs from the result provided in Van Aelst et al. (2006).7Chapter 1. Introductiondm1?1021?1061?10101?10142 3 4 5k = 2k = 3k = 4k = 5Figure 1.3: The number of samples required to achieve a 95% probability of a?good?initial partitionwhen n1 = ... = nk = 1 ?106.since by definition the underlying hyperplanes must be ?flat?, and therefore not appropriate fornon-linear patterns. While this is certainly true in Euclidean space, in Chapter 2 we transform thedata so that the clusters are linear, and then perform LGA in this transformed space (called theFeature Space).However, one side effect of transforming the data is that frequently the dimension of the transformedspace is dramatically larger than the original dimension of the data, to the point where there aremore dimensions than observations, which means that the covariance matrix no longer fully specifiesthe structure of the data in all dimensions. Rather than reduce the dimension directly via standardalgorithms such as Principal Components Analysis, instead we propose a new metric for when thereare more dimensions than observations.What follows is a brief background on some of the other approaches used in dimension reductionfor clustering, and a brief discussion of why these approaches would not work in the context ofLGA. We then introduce a method for transforming the data, accompanied by an example whereit has also been used.8Chapter 1. Introduction1.2.1 Dimension Reduction and ClusteringDimension reduction via Principal Components Analysis (PCA) is used in many different situations.One example is when d is large, and n substantial, since the reduction in the overall size of adata set is dramatic if the number of dimensions is reduced. Another common example is wherethere are many dimensions, which makes it difficult to visualize the data, and a lower-dimensionalrepresentation is desired.PCA has been discussed previously in clustering literature, though the sentiment regarding itseffectiveness is mixed. Articles that advocate this approach include Jollife et al. (1982), whereprincipal components analysis is applied on a data set to reduce the number of dimensions from 20to 10, and an approach similar to k-means applied to the PC scores. In Jolliffe (2002), a section isdevoted to describing heuristically when it may be advantageous to apply PCA prior to clustering.One useful case is where some columns are similar (in the Euclidean distance sense), and thus areeffectively over-represented with respect to their importance in clustering. Finally, in Ben-Hur andGuyon (2003), applying PCA on DNA microarray data was used with some success.However, there are also articles that demonstrate where PCA does not work. For example, inChang (1983) it was shown that in the case of a mixture of two normals, the first few PCs maycontain less cluster structure information than other PCs. And in Yeung (2001), applying PCAprior to clustering gene expression data was found empirically not to work well.In the context of LGA, it can easily be shown that a dimension reduction approach using PCAis unlikely to work. While dimension reduction would work on clusters that are a priori knownas the true structure of each cluster is then accurate, applying PCA on a whole data set prior toclustering would not take into account the (probably) differing structures, and could even projectdata from different clusters onto the same space. Clearly, therefore, any dimension reduction shouldtake place after a clustering has been proposed.1.2.2 Transforming Data into a Feature SpaceIf the clusters of data do not form hyperplanes, then the objective function associated with LGAis not an appropriate measure of similarity within each cluster. Consider the following data setmade up of two groups of data, given in Figure 1.4(a). Clearly the data is nonlinear in nature, andthe LGA solution does not correctly identify the true groupings. By applying the transformationPhi(x) = (x21, x2), however, the data is then linear in this new space, and LGA then identifies thecorrect grouping.In this case we explicitly specified a transformation, and then applied LGA to the transformeddata. An alternative approach is to calculate the pairwise cross product between all transformedobservations, and then through mathematical manipulation work with this cross product instead.This methodology is referred to as a ?Kernel Method?, and is the approach taken in Chapter 2.Let x element Rd, and consider a nonlinear transformation Phi(x) such that Phi : Rd arrowrightF, what we refer toas the feature space. Using the dot product, called the kernel function and denotedk(x,xprime ) = Phi(x)latticetopPhi(xprime)instead of the transformation directly is commonly called the ?kernel trick?. The advantage of thisapproach is that the kernel function can then be either explicitly specified as we did before, or9Chapter 1. Introduction-1.5 -1.0 -0.5 0.0 0.5 1.0 1.501234x1x2(a) Original Data0.0 0.5 1.0 1.5 2.0 2.5 3.001234x12x2(b) Transformed DataFigure 1.4: A demonstration of transformation to linearity. The symbols and colors correspond tothe partition found by LGA using the original data (left), and the transformed data (right). Thehyperplane for each partition is also provided.implicitly defined by stating the kernel function directly. In the latter case, while it may not bepossible to specify an equivalent function Phi, a result called?Mercer?s Condition?gives the necessaryand sufficient conditions under which Phi does exist, and allows the kernel trick to be used.Some of the more common kernel functions are presented in Table 1.1. It is common to forma matrix of the kernel functions for all observations, with K element Rn?n and (K)ij = k(xi,xj) for(i,j = 1,... ,n). The matrix K is commonly referred to as the Kernel Matrix.In order to give a simple demonstration of how the kernel trick is applied, we give a simple examplefrom Girolami (2002) (Feature Space k-means). Recall the k-means objective function, given byksummationdisplayi=1nsummationdisplayj=1zij(xj -mi)latticetop(xj -mi),where mi (i = 1,... ,k) are the centres of each sphere and zij denotes the membership of observationj in cluster i. Ignoring the summation temporarily and expanding the brackets yieldsxlatticetopj xj -2xlatticetopj mi + mlatticetopi mi,where nj = summationtextnj=1 zij and mi = summationtextnj=1 zijxi/nj is the sample mean of all the observations in the i-thcluster. Clearlyxlatticetopj mi = 1nisummationdisplaylzilxlatticetopl xj mlatticetopi mi = 1n2isummationdisplaylsummationdisplaymzilzimxlatticetopl xm.10Chapter 1. IntroductionName of kernel k(x,z) .Rational Quadratic*1 - ||x -z||2||x -z||2 + thetaPolynomial parenleftBiggammaxlatticetopz + thetaparenrightBigdExponential*expparenleftbigg-||x -z||thetaparenrightbiggRadial Basis Function (RBF)*expparenleftbigg-||x -z||2thetaparenrightbiggWave* theta||x -z|| sinparenleftbigg||x -z||thetaparenrightbiggTable 1.1: Some commonly used kernels. Source: Genton (2001). The kernel functions that arestarred are of the type ?isotropic?, which means it is only a function of the distance ||x -z||.Finally, applying the transformation Phi by substituting xi with Phi(xi), and denoting the center ofthe i-th cluster in the feature space by mPhii , yields the following representation in the feature spaceksummationdisplayi=1nsummationdisplayj=1zij(Phi(xj) -mPhii )latticetop(Phi(xj) -mPhii ) =ksummationdisplayi=1nsummationdisplayj=1zijbraceleftBiggPhi(xj)latticetopPhi(xj) - 2nisummationdisplaylzilPhi(xl)latticetopPhi(xj) + 1n2isummationdisplaylsummationdisplaymzilzimPhi(xl)latticetopPhi(xm)bracerightBigg,and applying the kernel trick=ksummationdisplayi=1nsummationdisplayj=1zijbraceleftBiggk(xj,xj) - 2nisummationdisplaylzilk(xl,xj) + 1n2isummationdisplaylsummationdisplaymzilzimk(xl,xm)bracerightBigg. (1.2)In other words, instead of applying the transformation to x and then performing k-means, we canperform k-means in the feature space by optimizing equation (1.2) directly. An approach that issimilar to this, but relevant for LGA instead, is taken in Chapter 2.Kernel methods have been used elsewhere in a clustering context in addition to Girolami (2002).Spectral clustering (Ng et al., 2001) has a novel approach of examining the kernel matrix (in thiscase the RBF kernel is used) to search for a block diagonal structure that typifies cluster designs.For example, consider two clusters where all the points within a cluster are very compact, and11Chapter 1. Introductionthe two clusters are very far apart. Then any isotropic kernel matrix would appear to be a blockdiagonal with a matrix of all ones on the block diagonal, and zero elsewhere.Support Vector clustering (Ben-Hur et al., 2001) takes a dramatically different approach, andis more consistent with the approach used in Support Vector Machines than the more standardclustering algorithms that are a combinatorial problem on an objective function. In this articlethey transform the data into the feature space, and then form the smallest sphere that enclosesthe complete data set. This sphere is then mapped back into the Euclidean space, where it formsa set of contour boundaries which enclose clusters of data. Points within each of these contourboundaries are then considered to be clusters. This approach does not require the specification ofthe number of clusters, and the examples provided in the paper suggest that this approach is highlysensitive to tuning parameters.1.3 Finding Approximate Solutions to Combinatorial Problemswith Very Large Data Sets using BIRCHIn Chapter 3 we review an algorithm that allows for the optimization of a class of combinatorialproblems to be done in an approximate way, and then extend it to a number of combinatorialproblems relevant to statistics. The algorithm is called BIRCH (Balanced Iterative Reducing andClustering using Hierarchies, Zhang et al., 1997), and has the advantage that it scales very wellwith respect to the number of observations. In this section, we give some motivation for BIRCH,including alternative implementations of this algorithm, and then review other, similar, algorithms.We conclude with a review of the robust methods and clustering algorithms that are extended inChapter 3.Interested readers should refer to Appendix B for a description of the implementation of BIRCHin R, which has been submitted to the Journal of Statistical Software.1.3.1 MotivationOne approach for solving non-convex optimization problems, such as LGA, is to explore the solutionspace in order to find the configuration that yields the global minimum of a given objective function.We do this by starting at a number of different subsets of the data set which allows the structureof the data to suggest where to go. This type of local search is more effective than just performingrandom searches over the solution space, and is the approach used in Section 1.1 for optimising theLGA objective function.However, as the number of dimensions and/or the number of observations increases, then clearly thesolution space is increasing in size, and so too the number of subsets required in order to adequatelycover this space. The algorithm BIRCH is an effective method of reducing the size of the solutionspace, making optimization much easier.BIRCH works by pre-processing a data set and grouping observations together into locally com-pact, similar subclusters. Associated with each subcluster is a vector of summary statistics, calledclustering features (CFs). In practise, these clustering features are formed using an algorithm that,heuristically, examines each observation of a data set in isolation, finds the closest subcluster, andadds it to the subcluster, updating the CFs in the process. If the observation is not close to any12Chapter 1. Introductionsubcluster, then it forms it own subcluster.It can be shown that for a certain class of algorithm, an approximately equivalent optimizationover an objective function is possible using the CFs instead of the underlying data. This is veryadvantageous, since the solution space with respect to subclusters is several orders of magnitudesmaller than that of the full data set. In this way, we can optimize more efficiently.The nature of the approximation in the process lies in the fact that we are using subclusters,rather than the underlying observations, when selecting subsets of the data. For example, considerOrthogonal Regression, given in Result 1.1.1, and assume that we have the following tuple as ourCF:CFJ =parenlefttpparenleftbtsummationdisplayielementLJ1,summationdisplayielementLJXi,summationdisplayielementLJXiXlatticetopiparenrighttpparenrightbt = (nJ,LSJ,SSJ);i.e. the number of observations, the sum of those observations, and the sum of squares of thoseobservations.Recall, that the orthogonal regression solution is given by the eigenvectors of the covariance matrixand the mean of the underlying data. It can be shown that these statistics can be calculated fromthe CFs directly ? i.e. the calculation of the orthogonal regression using the CFs is identical to thecalculation using the underlying observations making up those subclusters. In this sense, the CFscan be considered sufficient statistics for calculating orthogonal regression.Now consider the LGA algorithm, which utilizes orthogonal regression for calculating distances.In particular, LGA is trying to find the clusters of data with the minimum residual orthogonalsum of squares to their closest hyperplane. Now consider the equivalent with BIRCH subclusters,where LGA attempts to find the partition of subclusters that minimize the ROSS. As demonstratedabove, we can calculate the ROSS exactly, but if the ?true? partition of the data requires splittingobservations within one subcluster, then the global minimum (based on the underlying observations)cannot be achieved.Clearly, as the number of subclusters increases - equivalently, the number of observations in eachsubcluster decreases - then the need to split subclusters in order to find the global minimum is lessof an issue as the subclusters are more ?granular?. However, a direct side effect of the increasednumber of subclusters is the increased size of the solution space, which means there is a trade-offbetween accuracy and optimization efficiency. This is discussed at length in Chapter 3.Finally, one additional benefit of this approach involves the practical consideration of the limitationsof the statistical software, R (R Development Core Team, 2007). R requires that a data set be heldin memory for an analysis, which may not be possible if the data set is very large. As mentionedbefore, BIRCH only pre-processes observations one at a time, and produces an object much smallerthan the original data set. Therefore, as long as the BIRCH object can fit into memory (whichis always the case as long as the parameters for building the object are set correctly), it has nolimitation beyond that of the operating system for the size of data sets it can deal with.1.3.2 Literature ReviewThe algorithm BIRCH was first introduced in Zhang et al. (1997) in a Computer Science subjectarea journal, and focused mostly on operational details such as a self-sizing mechanism for fittingwithin a certain sized memory. In terms of clustering, this article used a hierarchical clustering13Chapter 1. Introductionparadigm on the subclusters for providing good quality initialization to a further algorithm, andbriefly discusses other clustering methods such as CLARANS (Ng and Han, 2002) and k-means.A direct extension of BIRCH is given in Chiu et al. (2001), which makes BIRCH applicable to dataof mixed type attributes. This is done through a new clustering feature, given byCFJ = (nJ,LSJ,SSJ,NBJ)where the additional element, NBJ, contains a table for each categorical attribute, and a summaryof counts for each categorical attribute in subcluster J.In the broader family of algorithms that are applicable to large data sets, there are many examplesthat are variations of BIRCH. Two of these are STING (Wang et al., 1998) and CURE (Guha et al.,2001). The STING (Statistical Information Grid-based) algorithm divides the spatial area intorectangular cells, and employs a hierarchical structure such that differing levels of cells correspondwith different resolutions of the data. Summary statistics for each cell are maintained, and areused much in the same way as the CFs in BIRCH. However, the actual contents of the CFs weredifferent, and more orientated towards SQL-type queries.In CURE (Clustering Using Representatives), a hierarchical clustering algorithm is also used. How-ever, its process of forming the hierarchy is somewhat different. CURE begins by drawing a randomsample of points out of the data set, which then become representative points. These points arethen ?shrunk? by a fraction alpha towards the centroid of the set of data, and then the closest pointsare merged together. This process is repeated, yielding at the end multiple representative pointsfor each cluster.Finally, BIRCH is discussed in the context of real-time data in Guha et al. (2003), as an algorithmthat is capable of receiving data in increments without needing to hold the ?old? data in memory.This feature is referred to as a data stream, and the article goes on to describe various characteristicsthat define this type of data.1.3.3 Algorithms ExtendedThe applicability of the BIRCH algorithm is defined by a) the specification of clustering featuresthat can be updated by data arriving singly, and b) the ability of an underlying objective functionto be specified in terms of the clustering features instead of the underlying observations. If thesetwo criteria are met, then BIRCH can be used to reduce the solution space of the combinatorialproblem, and enable analysis of data sets that would otherwise be too large to fit in memory.Clearly a number of combinatorial algorithms could be adapted for use with BIRCH. In Chap-ter 3, we demonstrate this approach with the following algorithms: Least Trimmed Squares (LTS,Rousseeuw and Leroy, 1987), Minimum Covariance Determinant (MCD, ibid) and Robust LinearGrouping Analysis (RLGA, Garc?a-Escudero et al., 2007).LTS is a robust regression estimator, and is concerned with finding the subset of h = alphan,alpha element (0.5,1]observations that minimize the squared residuals such thathsummationdisplayi=1r2(i) (1.3)where r2i = (yi - xlatticetopi ?)2 and r2(i) denotes the i-th order statistic of this residual. The mechanism14Chapter 1. Introductionfor optimizing this equation is called fast-LTS, was first introduced in Rousseeuw and van Driessen(2006), and is given by the following algorithm:Algorithm 1.3.1 (fast-LTS)For a number of different, randomly selected subsamples of size d:1. Use the random subsample to compute the ordinary least squares (OLS) regression estimator ?.2. Iterate the following steps until convergence, or a specified number of iterations has occurred:(a) Let H arrowleft{ the h smallest squared residuals, r2 = (Y -X?)latticetop(Y -X?)}.(b) Calculate the OLS estimator ? based on the observations in H.These steps are commonly referred to as?concentration steps?Select the solution with the smallest LTS, as given by equation (1.3).MCD is a robust estimator for location and dispersion, and is concerned with finding the subsetof h = alphan,alpha element (0.5,1] observations that have the sample covariance with smallest determinant, asgiven byarg minJelementH|SJ| (1.4)where H is the solution space of all subsets of size h,SJ = 1h -1summationdisplayielementJ(Xi - ? J)(Xi - ? J)latticetop and ? J = 1hsummationdisplayielementJXi.The algorithm for minimizing this objective function is called fast-MCD (Rousseeuw and vanDriessen, 1999), and is given by the following algorithm.Algorithm 1.3.2 (fast-MCD)For a number of different, randomly selected subsamples of size (d + 1):1. Use the initial subsample to calculate an initial estimate of the sample mean and covariance.2. Iterate the following steps until convergence, or a specified number of iterations has occurred:(a) Let H arrowleft{ the h points with smallest Mahalanobis distances: (X - ?)latticetopS-1(X - ?)}.(b) Calculate the new ?, S based on the observations in H.These steps are commonly referred to as?concentration steps?Select the solution with the smallest MCD, as given by equation (1.4).Clearly, both LTS and MCD have much in common in their approach to optimizing an objectivefunction. As it turns out, these algorithms are also both amended in the same fashion if the data set15Chapter 1. Introductionis large. For large n, a nested approach to optimization is used, whereby a number of subsamples ofsize ns << n are taken, and two concentration steps are performed from a variety of random startswithin each subsample. For each subsample the best 10 solutions (with respect to the objectivefunction) are stored. Then the subsamples are pooled together, and the best solutions from theprevious stage are used to initialize a number of concentration steps. Once again, the best 10solutions are retained, and used to seed concentration steps on the whole data set.The last algorithm extended for use with BIRCH is Robust LGA (RLGA) (Garc?a-Escudero et al.,2007), which is the robust equivalent of LGA. The goal of RLGA is to find the h = alphan subsetwith the minimum residual orthogonal sum-of-squares to the respective hyperplanes, and has a verysimilar algorithm as given for LGA in Algorithm 1.1.2, with the following amendments:3a. Select the alphan subset: Order the orthogonal residual of each observation to its closest hyperplane.Select the smallest alphan observations, and use just those observations to recalculate the hyperplanesusing orthogonal regression.5. Resampling: Repeat steps 2-4 a number of times and select the solution with the smallest trimmedorthogonal sum of squares.This algorithm is implemented in the R package lga.16BibliographyJ. D. Banfield and A. E. Raftery. Ice flow identification in satellite images using mathematicalmorphology and clustering about principal curves. Journal of the American Statistical Association,87(417):7?16, 1992.J. D. Banfield and A. E. Raftery. Model-based gaussian and non-gaussian clustering. Biometrics,49(3):803?821, 1993.A. Ben-Hur and I. Guyon. Detecting Stable Clusters Using Principal Component Analysis, pages159?182. Methods in Molecular Biology. Humana Press, Mar. 2003.A. Ben-Hur, D. Horn, H. T. Siegelmann, and V. Vapnik. Support vector clustering. Journal ofMachine Learning Research, 2:125?137, 2001.W. C. Chang. On using principal components before separating a mixture of two multivariatenormal distributions. Applied Statistics, 32(3):267?275, 1983.T. Chiu, D. Fang, J. Chen, Y. Wang, and C. Jeris. A robust and scalable clustering algorithm formixed type attributes in large database environment. Proceedings of the seventh ACM SIGKDD,2001.C. Fraley and A. E. Raftery. Model-based clustering, discriminant analysis, and density estimation.Journal of the American Statistical Association, 97:611?631, 2002.L. Garc?a-Escudero, A. Gordaliza, R. San Mart?n, S. Van Aelst, and R. H. Zamar. Robust linearclustering. Unpublished Manuscript, 2007.P. Gawrysiak, H. Rybinski, and M. Okoniewski. Regression - yet another clustering method. InInternational Intelligent Information Systems Conference, Advances in Soft Computing Journal.Springer-Verlag, 2001.M. G. Genton. Classes of kernels for machine learning: A statistics perspective. Journal ofMachine Learning Research, pages 299?312, 2001.M. Girolami. Mercer Kernel Based Clustering in Feature Space. IEEE Transactions on NeuralNetworks, 13(3):780?784, 2002.S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithm for large databases.Information Systems, 26(1):35?58, 2001.S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O?Callaghan. Clustering data stream:Theory and practice. IEEE Transactions on Knowledge and Data Engineering, 15(3):515?528,May/June 2003.17BibliographyJ. Harrington. lga: Tools for linear grouping analysis (LGA), 2008. R package version 1.1-0, andavailable on CRAN.I. Jollife, B. Jones, and B. Morgan. Utilising clusters: A case-study involving the elderly. Journalof Royal Statistical Society. Series A, 145(2):224?236, 1982.I. Jolliffe. Principal Component Analysis. Springer, 2002.A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. Advancesin Neural Information Processing Systems, 14, 2001.R. T. Ng and J. Han. CLARANS: A Method for Clustering Objects for Spatial Data Mining.IEEE Transactions on Knowledge and Data Engineering, 14(5):1003?1016, 2002.R Development Core Team. R: A Language and Environment for Statistical Computing. RFoundation for Statistical Computing, Vienna, Austria, 2007. URL http://www.R-project.org.ISBN 3-900051-07-0.P. Rousseeuw and A. Leroy. Robust Regression and Outlier Detection. Wiley, 1987.P. J. Rousseeuw and K. van Driessen. Computing LTS regression for large data sets. Data Miningand Knowledge Discovery, 12:29?45, 2006.P. J. Rousseeuw and K. van Driessen. A fast algorithm for the minimum covariance determinantestimator. Technometrics, 41(3):212?223, August 1999.R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters in a data set via thegap statistic. J. R. Statist Soc. B, 63(2):411?423, 2001.S. Van Aelst, X. Wang, R. H. Zamar, and R. Zhu. Linear Grouping Using Orthogonal Regression.Computational Statistics & Data Analysis, 50(5):1287?1312, Mar. 2006.W. Wang, J. Yang, and R. Muntz. STING: A statistical information grid approach to spatial datamining. In Proc. 24th Int. Conf. Very Large Data Bases, 1998.W. J. Welch. Algorithmic complexity: three NP- hard problems in computational statistics.Journal of Statistical Computation and Simulation, 15(1):17?25, 1982.K. Yeung. Principal component analysis for clustering gene expression data. Bioinformatics, 17(9):763?774, 2001.T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: A New Data Clustering Algorithm and ItsApplications, volume 1 of Data Mining and Knowledge Discovery, pages 141 ? 182. SpringerNetherlands, June 1997.18Chapter 2Extending Linear Grouping Analysisasteriskmath2.1 IntroductionLinear Grouping Analysis (LGA, Van Aelst et al., 2006) involves grouping data around hyperplanes,and utilizes orthogonal regression to select hyperplanes that minimize the sum of the squaredorthogonal distances of each observation to its representative hyperplane. This has the advantagethat it treats all variables in a symmetric way, and does not require the identification of ?response?or ?explanatory? variables (Zamar, 1989).In this article we briefly review LGA, and then introduce two new methods for extending its appli-cability. The first method introduces a variant of orthogonal regression, called partially orthogonalregression, which is motivated by high dimensional data which can produce clusters with covari-ance matrices that are rank deficient. The second method relaxes the linearity requirement of LGAby utilizing a machine learning technique of transforming the data set into a feature space andthen performing LGA in this space. We then conclude with some examples, and a brief discussionincluding promising directions for future research.2.2 Linear Grouping CriterionThe goal of LGA is to find k hyperplanes, parametrized by wi element Rd and bi element R and denotedH(wi,bi) equivalence Hi (i = 1,... ,k) such that Hi = {z : wlatticetopi z = bi}, with the smallest residual orthogonalsum of squares (ROSS). With our data given by x1,... ,xn element Rd, a membership matrix Z withbinary elements (Z)ij = zij element {0,1},i = 1,... ,k,j = 1,... ,n, where a ?1? means that observationj is a member of cluster i, and a set of hyperplanes H = {H1,...,Hk}, the objective function forLGA is given byR(x1,... ,xn|H,Z) = R(X|H,Z) =ksummationdisplayi=1nsummationdisplayj=1zij d2O(xj,Hi), (2.1)whered2O(xj,Hi) = (wlatticetopi xj -bi)2is the square orthogonal distance of xj to the hyperplane Hi, and X denotes our set of observations.To optimize equation (2.1) with respect to the set of hyperplanes H and the membership matrix Z,Van Aelst et al. (2006) utilizes a two-step approach. For an initial partition given by membershipmatrix Z0, do the following:asteriskmathA version of this chapter will be submitted for publication. Harrington, J. A. and Salibian-Barrera, M. ExtendingLinear Grouping Analysis. Full proofs for every result, lemma and theorem are given in Appendix A.19Chapter 2. Extending Linear Grouping AnalysisStep I - For each cluster, as defined by the membership matrix Zj, we calculate the best fittinghyperplane. I.e.Hj = arg minHR(X|H,Zj).Step II - Then, given the hyperplane for each cluster, we can calculate the distance of everyobservation to each hyperplane, and assign each observation to the closest. This is usedto update the membership matrix. I.e.Zj+1 = arg minZR(X|Hj,Z).This procedure continues until convergence ? equivalently until Zj+1 = Zj ? or j = N, where N isa user-specified maximum number of iterations. Note that Step I could be decomposed intominHR(X|H,Zj) = minHparenleftBigg ksummationdisplayi=1R(xi|Hi)parenrightBigg,where xi are the observations in partition i (as determined by Zj), and are mutually exclusive sets.arrowdblrightminHR(X|H,Zj) =ksummationdisplayi=1minHiR(xi|Hi).This means that to find the set of hyperplanes that minimizes Step I, it is sufficient to find thehyperplane that minimizes each cluster. In order to find the best fitting hyperplane in Step I, weuse the following, well known, result.Result 2.2.1 (Orthogonal Regression) Let x1,... xn element Rd have sample covariance matrix S. Letthe eigenvectors of S be e1,...,ed with corresponding eigenvalues lambda1 greaterequal lambda2 greaterequal ... > lambdad greaterequal 0 andRank(S) greaterequal (d -1). The solution tomina,bnsummationdisplayi=1(alatticetopxi -b)2subject to the constraint ||a|| = 1 is given bya = ed and b = elatticetopd ?.Note that in this definition restrictions are given with respect to the smallest eigenvalues, as it isexplicitly stated that the smallest eigenvalue must be unique (i.e. have multiplicity one). This isbecause if more than one of the smallest eigenvalues are equal, then the eigenvectors associated withthese eigenvalues are no longer unique. In the context of orthogonal regression this situation is mostlikely to occur when there are fewer observations in a cluster than dimensions (more specifically,n < d - 1), or if one column is linearly dependent on the others, as this would result in multiplezero-valued eigenvalues and a covariance matrix rank less than (d -1).20Chapter 2. Extending Linear Grouping Analysis-1 0 1 2 3-10123log10(Body Weight)log10(Brain Weight)Figure 2.1: An example of LGA applied to the allometry data set from Crile and Quiring (1940),with k = 4. The different colours and symbols correspond to the clusters found, and the lines arethe hyperplanes associated with each cluster.It is clear that orthogonal regression is very closely related to principal components analysis, whichis looking for the vector which specifies the maximal variance, and is given by the eigenvectorassociated with the largest eigenvalue. As it turns out the solution is essentially the hyperplaneformed by the basis of (d-1) principal components, which is the hyperplane orthogonal to the lastprincipal component.The LGA algorithm is initialized from a number of different random starting partitions to facilitatethe search for a global minimum. These are given by hyperplanes determined by k random sub-samples of size d. The number of starts is selected such that the probability of a ?good? startingconfiguration is 95%, where a good start is defined as the configuration that is a subset of theactual partitioning. The logic behind this approach is that, given a starting configuration with hy-perplanes resembling the ?true? solution, Step II will then produce the correct membership matrix.For example, consider clustering n observations into k clusters, with the correct partition given byC1,... ,Ck. Then a good starting configuration would comprise of I1,...,Ik where |Ij| = d, andIj propersubset Cjuniversalj element {1,... ,k}.Example. We reproduce an example from Van Aelst et al. (2006) using an allometry data set,obtained from Crile and Quiring (1940). This data set studies the relationship between the brainand body weight for n = 282 vertebrates, and a logarithmic (base 10) transformation has beenapplied to each column. Setting k = 4, a plot of the clusters is given in Figure 2.1. The R packagelga was used to find the clusters.21Chapter 2. Extending Linear Grouping Analysis2.3 Partially Orthogonal RegressionIn the discussion associated with Result 2.2.1 above, the importance of the multiplicity of the small-est eigenvalue of the sample covariance matrix was highlighted. One situation in which repeatedeigenvalues occur is when the rank of the covariance matrix for a cluster is less than (d-1) ? whichis often caused by small clusters and/or high dimensions. This is a situation that occurs repeatedlyin Section 2.4, where we apply transformations that increase the dimension considerably.One approach for dealing with clustering when the rank of the sample covariance matrix for anycluster is less than (d-1) would be via dimension reduction. Methods of dimension reduction havebeen discussed previously in clustering literature. For example, in Jollife et al. (1982) principalcomponents analysis is applied on a data set to reduce the number of dimensions from 20 to 10,and an approach similar to k-means applied to the PC scores. In Jolliffe (2002), a section is devotedto describing heuristically when it may be advantageous to apply PCA prior to clustering. However,in Chang (1983), it was shown that, at least in the case of a mixture of two normals, the first fewPCs may contain less cluster structure information than other PCs. And in Yeung (2001), applyingPCA prior to clustering gene expression data was found not to work well empirically.In the context of LGA, it can easily be shown that a dimension reduction approach using PCA isunlikely to work. While dimension reduction would work on clusters whose membership is given apriori, since the actual structure of each cluster is then known, applying PCA on a whole data setprior to clustering would not take into account the (probably) differing structures, and, at worst,project data from different clusters onto the same space. Clearly, any dimension reduction shouldtake place after a partition has been proposed.Unfortunately, since LGA relies heavily on the use of eigenvectors via orthogonal regression, wecannot simply ignore dimension problems that result from the covariance matrix of any clusterhaving rank less than (d-1), as the last eigenvector is not unique, and this is the vector used as perResult 2.2.1 for the calculation of the orthogonal distance. Therefore, in this section we introducea new measure of distance, called Partially Orthogonal Regression (POD), as an alternative toorthogonal regression for when this is the case.Definition 2.3.1 Let C element Rd?d be a sample covariance matrix of rank 1 < nu < (d - 1), with(ei,lambdai), i = 1,... ,d the eigenvector/value pairs such that lambda1 greaterequal ... greaterequal lambdanu > lambdanu+1 = ... = lambdad = 0.Let ? element Rd be a fixed point and let E = (e1| ...|enu) be the basis of a hyperplane H going through ?,consisting of the eigenvectors of C with positive eigenvalues.Then the Partially Orthogonal Distance (POD) of a point y element Rd to H is given byd2P (y;E,?) = zlatticetop(P1 + P2/lambdanu)z, (2.2)where z = y -?, P1 = I -EElatticetop, P2 = enuelatticetopnu ,E =parenlefttpparenleftbt| | |e1 e2 ? ? ? enu| | |parenrighttpparenrightbt.22Chapter 2. Extending Linear Grouping Analysisxx xxxxye1e3e2Cluster II?2Cluster I?1Figure 2.2: Introducing Partially Orthogonal Regression. We have two clusters, centred at ?1 and?2 with identical covariance matrix C with eigenvalues lambda1 > lambda2 > lambda3 = 0. Note that the axes arethe eigenvectors of C, and these axes are shown inside the cluster to emphasize their structure. Anobservation which are we trying to cluster is given by y, and is not part of either cluster.It can be easily shown that equation (2.2) can be broken into two separate distances, comprising:d2P (y;E,?) = d2E(y,p) + d2S(p, ?), (2.3)where d2E(y,p) = zlatticetopP1z is the Euclidean distance of y to its projection p on the basis E goingthrough ?, and d2S(p, ?) = zlatticetopP2/lambdanu z is a scaled distance of y to the hyperplane orthogonal to thevector enu and going through ?.In order to motivate why this approach is appropriate when there are fewer observations than di-mensions in any cluster, we give an illustration in R3, where only one eigenvalue is zero. Althoughorthogonal regression could still be applied here, we utilize this simple illustration to give an intu-ition into this approach without needing to imagine higher dimensions where any intuition mightbe lost. Our example is sketched in Figure 2.2.We have two clusters (I and II) with centres ?1,?2 element R3 and both clusters have covariance matrixC element R3?3 with eigenvalues lambda1 > lambda2 > lambda3 = 0. A situation where this might occur is where eachof the clusters has been built with only three observations, as indicated in the figure. Now let usconsider a new observation, y element R3, which is not part of either cluster, and ponder the question ofwhich cluster it should belong to, keeping in mind that this approach needs to generalize to wheremore than one eigenvalue is equal to zero (i.e. lambda3 = 0 has multiplicity greater than one).23Chapter 2. Extending Linear Grouping AnalysisCluster Ie3e2?1?2Cluster IIpyd2E(y, p)Figure 2.3: Looking at the e2 -e3 plane. Note that the thickness in the clusters in the direction ofe3 is for illustrative purposes only.Two new alternatives for measuring the distance of y to each cluster are:Distance I ? The orthogonal distance of y to the hyperplane given by the basis made up of theeigenvalues with only positive eigenvalues; orDistance II ? The orthogonal distance of y to the hyperplane orthogonal to the eigenvector withthe smallest positive eigenvalue.Next we shall discuss briefly each of the distances above, show that these distances correspond withthe elements identified in equation (2.3), and finally justify why the inclusion of both distances islogical.Distance I ? The orthogonal distance of y to the hyperplane given by the basis madeup of the eigenvalues with only positive eigenvalues: In our illustration, the basis withpositive eigenvalues is made up of (e1,e2), and is orthogonal to e3. Therefore, in this situation theorthogonal distance to a hyperplane defined by this basis is equivalent to the orthogonal distanceas it would be calculated using Result 2.2.1 (standard orthogonal regression). This distance isillustrated in Figure 2.3, and is denoted d2E(y,p) in equation (2.3).However, there are two problems with using orthogonal regression directly. Firstly, since our desiredapplication includes the situation where the multiplicity of the smallest eigenvalue is greater thanone, the smallest eigenvector may not be unique, and therefore the vector e3 cannot be used forcalculation. Thus, instead of calculating distances using orthogonal vectors (as in Result 2.2.1),we would use the projection of y onto the basis for which the eigenvectors are well defined (in ourexample, e1 and e2), and then calculate the distance of our observation to this projection.Furthermore, with this distance we are measuring how close our observation is to the hyperplaneorthogonal to e3 (which extend to infinity in the directions e1 and e2), without considering howclose y is to the linear structure shown in Figure 2.3. That is, y may be close to the hyperplane24Chapter 2. Extending Linear Grouping Analysise2e1Cluster ICluster II?2?1y, pproportional d2S(p, ?)?Figure 2.4: A ?bird?s-eye-view? looking down onto the e1 -e2 plane.orthogonal to e3 for one cluster, but close to the data associated with another cluster in all otherdimensions.Distance II ? The orthogonal distance of y to the hyperplane orthogonal to the eigen-vector with the smallest positive eigenvalue: In our illustration, the eigenvector associatedwith the smallest positive eigenvalue is e2, and the distance of y to this hyperplane is proportionalto d2S(p, ?), as seen in Figure 2.4. Clearly this is equivalent to performing orthogonal regression byignoring the contribution of e3, and is equivalent to the orthogonal distance in the e1 -e2 plane.While the linear structure of the data is visible in this plane, there are three possible faults withthis approach.? Firstly, if the two clusters were different only in the e3 direction, then this would not be auseful measure as it would be equal for both clusters.? Secondly, we are discarding any information present in e3 that might be relevant in decidingwhich cluster the observation belongs to (as seen in Figure 2.3).? Finally, when calculating distance we are not considering the structure of the data in aMahanalobian-sense; i.e. the distance in the direction of e2 should be expressed in terms ofhow spread out the data is in that direction, in the same way that Mahalanobis distanceconsiders distance to the centre of a data set in terms of the confidence contours of the data?scovariance structure.In order to reflect the structure of the cluster, we scale the Euclidean distance of p to ? by thesmallest positive eigenvalue, since this distance is solely a function of the matching eigenvector,addressing the third fault.These arguments show that there is some value in each approach, and so we combine them intoour new measure of distance, as given in equation (2.3). In order to generalize our illustration to25Chapter 2. Extending Linear Grouping Analysisd dimensions, let us assume that we now have nu positive eigenvalues ? in practise we select nu suchthat c% of the total variation in the data is explained i.e.arg minnusummationtextnui=1 lambdaisummationtextdj=1 lambdaj> c. (2.4)Consider the basis of interest ? in our illustration we have only lambda3 = 0 which defines the basis(with any meaningful variation) as e1 - e2; in d dimensions this corresponds with the basis E =(e1| ...|enu). Then Distance I corresponds to the distance between y and its projection, p, on thebasis E and going through ? ? this gives our result for d2E(y,p).Finally, recall Distance II, which measures how far away y is to the clusters in the e1 - e2 plane,as shown in Figure 2.4. Given that in this basis y is equivalent to p (its orthogonal projectionon to the basis), the distance from y to a cluster corresponds to the distance of p to ?, where ?is the projection of y onto the hyperplane orthogonal to enu. Given that we wish to express thisdistance with respect to the structure of the cluster, we scale this distance by the eigenvalue lambdanu,as the distance in this direction is proportional only to enu, and no other eigenvector. This is thedistance denoted d2S(p, ?). By then adding d2E(y,p) and d2S(p, ?), we then have Definition 2.3.1.Remark. Consider Definition 2.3.1, Figures 2.2 ? 2.4, and assuming lambdanu = 1. Then d2S(p, ?) =d2E(p, ?), and by Pythagoras Theoremd2P (y;E,?) = ||y -p||2 + ||p - ?||2 = ||y - ?||2 = d2E(y, ?)i.e. the Partially Orthogonal Distance can be decomposed into two orthogonal distances, and isequivalent to the Euclidean distance from y to ?.In order to perform LGA with POD, we propose the following algorithm.Algorithm 2.3.2 (LGA-POD)LGA with partially orthogonal distances is made up of the following five steps:1. Generation of starting value: a starting configuration is generated consisting of a partition ofthe data set into k mutually exclusive sets of nsamp observations (i.e. an observation can belongat most in one set). For each of these sets, calculate the number of eigenvectors that explain99.99% of the data using equation (2.4). Call this nu.2. Initialization of groups: for each starting hyperplane with corresponding nu, calculate the par-tially orthogonal distance of all observations to that hyperplane. Then assign each point to thehyperplane which it is closest.3. Iterative refinement: Recalculate step 3 for each resulting hyperplane until either a) the solutionconverges (i.e. the membership sets remain the same for two consecutive iterations) or b) thesolution fails to converge after an arbitrary number of iterations (e.g. 10).4. Resampling: Repeat steps 2-4 a number of times and select the solution with the smallest partiallyorthogonal sum of squares.The rationale behind this algorithm, in particular the setting of nu, is similar to that given inSection 2.2, insofar as we randomly select a number of different starting configurations with the hope26Chapter 2. Extending Linear Grouping Analysisof getting one that is a subset of the true partitioning. As before, this is intuitively advantageous,since the hyperplanes will already be the appropriate shape, but now also the number of eigenvectors(nu) will be set correctly.The setting of the nsamp parameter requires some care. It is desirable to set the value as smallas possible, as this makes it much easier to find a ?good? starting configuration. However, settingnsamp too low will exacerbate any rank problems further, as the number of eigenvectors withpositive eigenvalues corresponds directly with the initial number of observations in the build set.Example. We simulate two rank deficient clusters by simply adding a constant column to each.This is comparable with the data set illustrated in Figure 2.2. In this case we have generated twoclusters of size 30 from a multivariate normal Xi = N(?,?),i = 1,2, with ? = (0 0)latticetop and ? adiagonal matrix with 2 on the diagonal. Then for the first cluster we set Y1 = (X1|0) and forthe second Y2 = (X2|5). Clearly there is no variation in the third dimension, and the covariancematrix of either cluster will be of rank two. Applying the algorithm to this data set yields a perfectclustering, and a plot is given in Figure 2.5(a). Note that in the first two dimensions the clustersare not separable.Example. We simulate two clusters with more dimensions than observations, and therefore havecovariance matrices that are less than full rank. Once again using the multivariate normal distribu-tion, we generate 15 observations from a N20(?1,?) and 15 observations from a N20(?2,?), where?1 = (0,... ,0)latticetop and ?2 = (3,3,3,0,... ,0)latticetop. The covariance matrix is given by a diagonal matrixwith elements (4,2,0.1,... ,0.1). Applying the algorithm to this data set yields a perfect clustering,and a plot is given in Figure 2.5(b).2.4 Linear Grouping in the Feature SpaceIn Section 2.2 we introduced a criterion for clustering around hyperplanes in Euclidean space.Clearly, the underlying assumption is that the data in these clusters are aligned in a linear fashion.In this section we relax this assumption, by transforming the data into a Feature Space wherethe data may have a linear structure, and performing LGA within this space. We denote thetransformationPhi : Rd arrowrightFwith F element RF , and refer to the corresponding clustering algorithm in the feature space as kLGA.Since it can be shown that the LGA objective function (equation (2.1)) only depends on the cross-products of the observations, rather than working with the transformed data directly, we insteadmanipulate the cross-product k(x,xprime) = Phi(x)latticetopPhi(xprime), called the kernel function, utilizing what iscommonly called the kernel trick in machine learning literature. The advantage of this is flexibilityand generality; if we wish to specify the transformation directly, it is trivial to calculate the kernelfunction i.e. the transformation is explicitly defined. However, it is also possible to specify thekernel function directly, with the transformation being implicitly defined. In the latter case, whileit may not be possible to describe the actual transformation, a result called Mercer?s Theoremprovides necessary and sufficient conditions for the kernel function such that the existence of thetransformation is guaranteed. Interested readers should refer to Sch?lkopf and Smola (2002) forfurther details.There are various examples of unsupervised learning methods using feature spaces and kernel func-27Chapter 2. Extending Linear Grouping Analysisx-3 -2 -1 0 1 2 3-3-113-3-113y-3 -2 -1 0 1 2 3 0 1 2 3 4 5012345z(a) Y1 = (X1|0), Y2 = (X2|5), Xi = N(?, ?)x1-2 0 2 4 -0.6 0.0 0.4-404-224x2x302-0.60.2x4-4 0 2 4 6 0 1 2 3 -0.4 0.0 0.4 0.8-0.40.20.8x5(b) N20(?i, ?), ?1 = (0, . . . , 0)latticetop, ?2 = (3, 3, 3, 0, . . . , 0)latticetopFigure 2.5: Examples of LGA-POD on data sets whose clusters have covariance matrix of less thanfull rank.28Chapter 2. Extending Linear Grouping Analysistion transformation in machine learning literature, though they are not as well known as the su-pervised learning methods (e.g. SVM classifiers (Sch?lkopf and Smola, 2002), SV Regression (ibid),and Least-Squares SVM (Suykens et al., 2002)). Examples of unsupervised methods include KernelPCA (Sch?lkopf et al., 1998), Kernel Feature Analysis (Smola et al., 2002), and those with specificapplication in clustering such as Spectral Clustering (Ng et al., 2001), a feature space version ofk-means (Girolami, 2002) and Support Vector Clustering (Ben-Hur et al., 2001). In this article, wehave adopted a similar framework to that of (Girolami, 2002), though the details are substantiallydifferent.In order to apply this methodology to LGA, we will transform the observations into the featurespace and then express the objective function in terms of dot-products. Assuming that all covariancematrices in the feature space have rank of F or (F -1), the results of Section 2.2 apply immediately.However, when the transformation is only given implicitly by the choice of kernel function, we willneed to use the results in Kernel PCA (kPCA, Sch?lkopf et al., 1998) to find the eigenvectors ofthe covariance matrix in the feature space. Following this, we look at the case where the covariancematrices have rank less than (F -1), which in turn closely resembles the development of the PODmetric. Finally, we will discuss some of the technical challenges associated with this calculation.2.4.1 Preliminaries and NotationApplying the transformation in equation (2.1), we are interested in minimizing the objective func-tionR(Phi(X)|H,Z) =ksummationdisplayi=1nsummationdisplayj=1zij d2O (Phi(xj),Hi)with all the parameters defined the same as in equation (2.1). It is trivial to show that if the covari-ance matrix is F or (F -1), then the best fitting hyperplane through the transformed observationshas the same parameters as given in Result 2.2.1. This necessitates the spectral decomposition ofthe sample covariance matrix of the observations Phi(xi),i = 1,... ,n, given byCPhi = 1nnsummationdisplayi=1parenleftbigPhi(xi) - ?Phi(xj)parenrightbig parenleftbigPhi(xi) - ?Phi(xj)parenrightbiglatticetop = 1nnsummationdisplayi=1?Phi(xi)?Phi(xi)latticetop,where ?(xi) = Phi(xi) - ?(x) and ?(x) is the sample mean of the transformed data. Define?k(xi,xj) = ?Phi(xi)latticetop?Phi(xj), and let the eigenvalues of CPhi be given by lambda1 greaterequal lambda2 greaterequal ... greaterequal lambdaF-1 > lambdaF greaterequal 0,with corresponding eigenvectors ei element RF ,i = 1,... ,F.For computational ease, we generate the kernel matrix K element Rn?n, (K)ij = k(xi,xj) and the centredkernel matrix ? element Rn?n, ( ? )ij = ?(xi,xj). It was first shown in Sch?lkopf et al. (1998) that ?can be calculated from K with the following lemma.Lemma 2.4.1 The centred kernel matrix, ? is given by?K = K - 1n1nK -1nK1n +1n2 (1nK)1n,where 1n is a n ?n matrix containing ones.29Chapter 2. Extending Linear Grouping AnalysisLet the eigenvectors of ? be given by alphaj element Rn (j = 1,... n), with corresponding eigenvalues ?j.Now, we can use the following theorem, from Sch?lkopf et al. (1998).Theorem 2.4.2 (Sch? kopf et al. (1998)) The eigenvector associated with the j-th largest eigenvalueof CPhi, ej element RF is given byej = 1?lambdajnsummationdisplayi=1alphaij ?(xi),where alphaij is the i-th element of the vector alphaj element Rn, the eigenvector associated with ?j, the j-th largesteigenvalue of ? . Furthermore,lambdai =?lambdain .Note that we cannot calculate ej directly, since ?(xi) is unknown. However, we can calculateelatticetopj ?(xk) proportional summationtextalphaij ?(xi)latticetop?(xk) = summationtextalphaij?(xi,xk) for example, as the dot-product can be calculatedusing the centred kernel function in Lemma 2.4.1. And, as will be shown in the next section, thisallows us to calculate orthogonal regression in the feature space.2.4.2 kLGA with Orthogonal RegressionIn any clustering paradigm, there is the concept of a build set of observations (in this case used toform the hyperplanes) and then a subsequent step of calculating the distance of all observations tothis hyperplane. For convenience, we introduce notation that is more consistent with this approach.Let S denote the indices of the build set of observations with cardinality m = |S|. The kernelmatrix of the observations in S is given by KS element Rm?m with elements k(xi,xj) (i,j element S), andapplying Lemma 2.4.1 to KS yields its corresponding centred kernel matrix ? S. Finally, let S[i]denote the i-th element of S, andkSj =parenlefttpparenleftexparenleftexparenleftexparenleftbt(K)S[1],j(K)S[2],j...(K)S[m],jparenrighttpparenrightexparenrightexparenrightexparenrightbt element Rm,where (K)a,b denotes the (a, b) element of the kernel matrix K. Let a matrix containing thesevectors for all observations be given byKF =parenleftbiggkS1 .. kS2 .. ? ? ? .. kSnparenrightbiggelement Rm?n.We now have the building blocks for performing orthogonal regression in the feature space. Considera build set (S) of points with m observations in the feature space whose covariance matrix has rank(F - 1) or F where F is the dimension in the feature space. In order to calculate the orthogonaldistance in the feature space, we combine the results from Result 2.2.1 (Orthogonal Regression)and Theorem 2.4.2 (kPCA), to get the following distance metric.30Chapter 2. Extending Linear Grouping AnalysisResult 2.4.3 The orthogonal distance of observation y = Phi(x) to a hyperplane orthogonal to ek (withlambdak > 0 and k <=< F) and built on observations with indices in the set S, is given byd2O(y,H) = (ek ? y -ek ? ?s)2, (2.5)whereek ? ?s = 1m2radicalbig?lambdakalphalatticetopkparenleftBigmKS1 -1latticetopKS1parenrightBig, (2.6)ek ? yi = 1?lambdakalphalatticetopk (kSi - 1m1latticetopkSi ), (2.7)?s is the sample mean of the observations in S, alphak element Rn is the eigenvector associated with ?k, thek-th largest eigenvalue of ? S, and 1 is a vector of length m containing all ones.When the observation yi = Phi(xi) in the lemma above is part of the build set, equation (2.5)simplifies to(ek ? yi -ek ? ?)2 =parenlefttpparenleftbtmsummationdisplayj=1alphajk ?Sjiparenrighttpparenrightbt2from Lemma 2.4.1. In order to make this result easier to implement for coding, we can expressResult 2.4.3 in such a way that it can be calculated for a matrix Y (each row being an observation)with the following equation:d2O(Y,H) =parenlefttpparenleftexparenleftexparenleftexparenleftbtd2O(y1,H)d2O(y2,H)...d2O(yn,H)parenrighttpparenrightexparenrightexparenrightexparenrightbt = 1?lambdakparenleftbigg(KF - 1m1mKF )latticetopalphak -bparenrightbigg2,where 1m is a m ?m matrix containing all ones, andb = 1m2 alphaklatticetopparenleftBigmKS1m -1latticetopmKS1mparenrightBig.This derivation raises some interesting technicalities regarding rank. It was assumed (explicitlyand implicitly) in this section that the covariance matrix CPhi element RF?F has rank F or (F - 1) inthe feature space, the dimension of which is defined by the transformation Phi. However, in thecalculation of the eigenvectors of CPhi, we use the spectral decomposition of ? S element Rm?m, where mmay be larger or smaller than F. This means that the number of positive (non-zero) eigenvaluesof ? S determines uniquely the approach that is required.In practise, if the transformation Phi is explicitly specified, then the dimension of the feature spacecan be determined, and it will be known ex ante whether the covariance matrix in the feature spaceis going to have rank F or (F -1) when the number of observations in the build set is taken intoconsideration. However, if an implicit transformation is used, more care is required. In particular,some kernel functions, such as the Radial Basis Function (RBF), transform the data into an infinitedimensional space, and therefore no sample covariance matrix based on a finite sample will be full31Chapter 2. Extending Linear Grouping Analysis0.0 0.5 1.0 1.5-30-20-1001020xy(a) kLGA with an explicit transformation0 10 20 30 40 50020406080100Dimension of hyperplane% of variation explainedRBFWave(b) Eigenvalue analysis of Wave and RBF kernelsFigure 2.6: Plots from example one for kLGA. On the left we have the output from kLGA, withand explicit transformation Phi(x) = (x1,x21,x2). Also on this plot is the hyperplanes that the (non-feature space) LGA would fit. On the right we look at the percentage of total variation explained,based on the eigenvalues, for the first cluster using two different kernel functions.rank. Therefore, prior to using Result 2.4.3, careful analysis of the eigenvalues of ? S should beundertaken.Example. To introduce kLGA with orthogonal regression, we first use a simple example with an ex-plicit transformation function. For the first cluster, we have 200 independent random bivariate sam-ples from the model X similarUnif(0,1), and Y = 80(X-12)2+epsilon1 where epsilon1 similar N(0,2.82). The second clusterhas 300 independent random bivariate samples from X similarUnif(12, 32), and Y = -80(X -1)2 -10 + epsilon1where epsilon1 similar N(0,1.52). The transformation function is given by Phi(x) = (x, x2, y).The results from kLGA are given in Figure 2.6(a). As a matter of reference, we also provide ananalysis of the eigenvalues for two other kernel functions, being the Wave Kernel and the RBFkernel, given byWave: k(x,z) =braceleftBigg theta||x-z|| sinparenleftBig||x-z||thetaparenrightBigif ||x -z|| > 01 if ||x -z|| = 0(2.8)RBF k(x,z) = expparenleftbigg-||x -z||2thetaparenrightbigg. (2.9)A plot of the total variation explained versus the number of eigenvalues, as given by the equationsummationtextnui=1 lambdai/summationtextnj=1 lambdaj, nu = 1,... ,50, is provided in Figure 2.6(b) for each kernel function. Clearly thedimension of transformed data using the implicitly defined transformation is substantially largerthan that for our explicitly defined transformation, which has a dimension of three.32Chapter 2. Extending Linear Grouping Analysis2.4.3 kLGA with PODAs mentioned in the previous section, the rank of the covariance matrix in the feature space isas relevant in the feature space as it is in the Euclidean space, which was discussed at length inSection 2.3. As a result, we conclude our foray into the feature space with an extension of the PODmethodology for use in the feature space. Recalling Definition 2.3.1, and proceeding along the samelines as Section 2.4.3, we get the following result.Result 2.4.4 Let there be k eigenvectors, denoted e1,... ,ek, with positive (non-zero) eigenvalues.The Partially Orthogonal Distance (POD) of a point y = Phi(x) element RF to the hyperplane with basise1,... ,ek going through ?s is given byd2P (y,E, ?s) = zlatticetopz + 1lambdakzlatticetopekelatticetopk z -zlatticetopEElatticetopz,where z = y - ?s andE =parenlefttpparenleftbt| | |e1 e2 ? ? ? ek| | |parenrighttpparenrightbt.In practise, this equation can be expressed using equations (2.6) and (2.7) for all observations atonce, withd2P (Y,H) =parenlefttpparenleftexparenleftexparenleftexparenleftbtd2P (y1,H)d2P (y2,H)...d2P (yn,H)parenrighttpparenrightexparenrightexparenrightexparenrightbt = DiagparenleftBig ?KasteriskmathparenrightBig+ 1?lambdak/mparenleftBigg(KF - 1m1mKF -bprime)latticetop alphak?lambdakparenrightBigg2-parenleftbigg(KF - 1m1mKF -bprime)latticetopAparenrightbigg2,where m = |S|, 1m is a m ?m matrix containing all ones,?Kasteriskmath = K - 1m1mKF - 1mKF 1m + (1mKS)1m,bprime = 1m2parenleftBigmKS1 -1latticetopKS1parenrightBig,andA =parenlefttpparenleftbt| | |alpha1/radicalbig?lambda1 alpha2/radicalbig?lambda2 ? ? ? alphak/radicalbig?lambdak| | |parenrighttpparenrightbt.In order to cluster using this formula, Algorithm 2.3.2 is used, and the parameter nsamp is requiredin order to specify the size of the initial sample. Recall that in the Euclidean space (which is the33Chapter 2. Extending Linear Grouping Analysiscontext of Section 2.3), nsamp should be the smallest value such that the relevant dimensionsare included, where the relevant dimensions can be defined as the number of positive eigenvaluesdetermined in equation (2.4). However, in the feature space, there are kernel functions that implya dimension that is infinitely large, which makes this approach more difficult. In this case, thesize of nsamp given in equation (2.4) may need to be very large in order to capture the necessaryinformation (see, for example, Figure 2.6(b) for an indication of this with the RBF kernel), whichmakes the combinatorial problem extremely difficult. In the examples that follow, we have selecteda smaller value for nu in the interests of an easier optimization step, and leave the detailed study ofthis problem to further research.Example. In this example we generate two bivariate clusters. In the first cluster, X similarUnif(-pi,pi)and Y = sin(X)/4 + epsilon1 with epsilon1 similar N(0,1), while in the second cluster X similarUnif(-pi,pi) and Y =5sin(X)+epsilon1, with epsilon1 defined the same. Each cluster contains 100 independent observations followingthe above models.Using the wave kernel function given in equation (2.8), with theta = 10 we run the kLGA(POD)algorithm, with the results given in Figure 2.7(a).Example. In our second example, we form two clusters with the following characteristics:Cluster 1 - 400 independent observations from the model: X similarUnif(-pi,2pi) and Y = 5sin(X) +15 + epsilon1 with epsilon1 similar N(0,0.25)Cluster 2 - A combination of two models.? 100 independent observations from the model:X similarUnif(-pi,pi) and Y = -10|X| + epsilon1 where epsilon1 similar N(0,4).? 100 independent observations from the model:X similarUnif(3 -pi,3 + pi) and Y = -10|X -3| + epsilon1, where epsilon1 similar N(0,4).We use kLGA(POD) with the RBF Kernel Function (equation (2.9)). The results are given inFigure 2.7(b). Note that there is some misclassification, however this accounts for less than 1% ofthe whole data set.2.5 Discussion and Future ResearchIn this article we have introduced two significant extensions to the LGA algorithm. The first extendsLGA to allow the covariance matrix of any cluster to have rank less than (d-1) by introducing a newdistance called Partially Orthogonal Regression. The second method extends LGA to the featurespace by using the kernel trick, allowing for clustering non-linear patterns and is demonstrated forboth cases where covariance matrices have rank F or (F -1), and the rank is less than (F -1).Much of the discussion was devoted to the number of positive eigenvalues within the feature space,and how it is of crucial importance. Further research in this area is necessary, in particular therelationship between the number of eigenvalues and the number of the starting samples for initializ-ing the algorithms (given by the parameter nsamp). In the methods given here we explicitly checkthat the nature of the eigenvalues is consistent with the algorithm chosen, but have not currentlyquantified the effects of this approach.34Chapter 2. Extending Linear Grouping Analysis-2 0 2 4 6051015(a) kLGA(POD) with Wave Kernel-2 0 2 4 6-30-20-1001020(b) kLGA(POD) with RBF KernelFigure 2.7: Plots from the examples for kLGA (POD). On the left we have the output from kLGA(POD) using the wave kernel with theta = 10. On the right we have the output from kLGA (POD)using the RBF kernel with theta = 200.It is clear that further research is required into the selection of the kernel functions, and tuning theparameters of these functions, though this comment can be applied more generally to the area offeature space methodologies as a whole. However, in the area of unsupervised learning, of whichthese algorithms are part, this is especially difficult since the option of first training the algorithmusing a known solution is not relevant. Methods have been proposed elsewhere, such as crossvalidation, and this would be a logical place to start.With respect to selection of kernel functions and LGA more specifically, research is required intothe nature of the topology of the spaces generated with implicit transformations, and how theyrelate to the underlying assumptions regarding linear clusters in these spaces. It may be possibleto determine in which cases LGA is sensible, and in which feature spaces LGA makes no sense.One avenue of research that has already proved fruitful, though is not yet at maturity, is the useof the gap statistic (Tibshirani et al., 2001) for the selection of the polynomial degree when usinga polynomial kernel. The gap statistic is useful for examining nested models in order to discoverthe ?elbow? in residual sum of squares, and has been used successfully for finding the number ofclusters in the context of k-means and LGA. In this case, it is intuitively consistent that, as thedegree of the polynomial increases, the ROSS decreases slowly until the ?true? degree is attained,at which point it decreases dramatically. Once the degree exceeds the ?true? value, once again theROSS decreases slowly. Since the gap statistic is tuned to find such?elbows?, it seems to work well,though further work is required.35BibliographyA. Ben-Hur, D. Horn, H. T. Siegelmann, and V. Vapnik. Support vector clustering. Journal ofMachine Learning Research, 2:125?137, 2001.W. C. Chang. On using principal components before separating a mixture of two multivariatenormal distributions. Applied Statistics, 32(3):267?275, 1983.G. Crile and D. Quiring. A Record of the Body Weight and Certain Organ and Gland Weights of3690 Animals. Ohio Journal of Science, 1940.M. Girolami. Mercer Kernel Based Clustering in Feature Space. IEEE Transactions on NeuralNetworks, 13(3):780?784, 2002.I. Jollife, B. Jones, and B. Morgan. Utilising clusters: A case-study involving the elderly. Journalof Royal Statistical Society. Series A, 145(2):224?236, 1982.I. Jolliffe. Principal Component Analysis. Springer, 2002.A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. Advancesin Neural Information Processing Systems, 14, 2001.B. Sch?lkopf and A. Smola. Learning with Kernels. MIT Press, 2002.B. Sch?lkopf, A. Smola, and K.-R. M?ller. Nonlinear component analysis as a kernel eigenvalueproblem. Neural Computation, 10:1299?1319, 1998.A. Smola, O. L. Mangasarian, and B. Sch?lkopf. Sparse kernel feature analysis. In Classification,Automation, and New Media. Springer, 2002.J. A. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, and J. Vandewalle. Least SquareSupport Vector Machines. World Scientific, 2002.R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters in a data set via thegap statistic. J. R. Statist Soc. B, 63(2):411?423, 2001.S. Van Aelst, X. Wang, R. H. Zamar, and R. Zhu. Linear Grouping Using Orthogonal Regression.Computational Statistics & Data Analysis, 50(5):1287?1312, Mar. 2006.K. Yeung. Principal component analysis for clustering gene expression data. Bioinformatics, 17(9):763?774, 2001.R. H. Zamar. Robust estimation in the errors-in-variables model. Biometrika, 76(1):149?160,1989.36Chapter 3Finding Approximate Solutions toCombinatorial Problems with VeryLarge Data Sets using BIRCHasteriskmath3.1 IntroductionA very large data set is traditionally defined as one that exceeds the size of the main memory on acomputer. Once this size is exceeded, paging/swapping is required to access the data which is veryinefficient and time consuming.In this paper we focus on the problem of finding approximate solutions to non-convex optimiza-tion problems, particularly those that arise when computing high-breakdown point robust estima-tors, such as the Minimum Covariance Determinant and the Least Trimmed Squares estimators(Rousseeuw and Leroy, 1987). The optima of these problems are generally found by using somevariation of random search algorithms. Recently improved methods based on this approach havebeen shown to perform better than other heuristic algorithms, like simulated annealing and tabusearch (see Salibian-Barrera et al. (2007)). The basic idea of the random subsampling algorithm(Rousseeuw, 1984) is to generate a candidate solution by randomly drawing a subsample from thedata. This candidate is then refined by performing local improvements in the objective functionto find a local minimum near this starting point. In the robust statistics context this strategy isexpected to find a good approximation to the optimum when the random subsample of the datais drawn from the ?good? observations (non outliers). In general, a very large number of randomstarts are needed to ensure the solution space is well represented among these candidates, andfurthermore, each starting point requires many refinement iterations.To fix ideas, consider the MCD estimator, which for a sample X1,... ,Xn element Rp and an integern/2 <=< h <=< n, is defined as the sample mean and covariance matrix of the subsample of size h withsample covariance matrix with the smallest determinant. This is a combinatorial problem (thereare n!/(h!(n - h)!) possible subsamples of size h) whose solution can be approximated using thefast-MCD algorithm (Rousseeuw and van Driessen, 1999). This approach randomly draws a largenumber (N) of subsamples of size p+1, which are then enlarged and iteratively refined by selectingpoints that are close (in the Mahalanobis distance sense) to the starting values. We see that thisalgorithm accesses the whole data set a very large number of times. When the data set is large (i.e.it is not possible to store it in memory), input/output delays due to paging or swapping make theuse of these methods unfeasible.asteriskmathA version of this chapter is in the review process for publication. Harrington, J. and Salibian-Barrera, M. (2007)Finding Approximate Solutions to Combinatorial Problems with Very Large Data Sets using BIRCH, ?Special Issueon Statistical Algorithms and Software? of Computational Statistics and Data Analysis.37Chapter 3. Combinatorial Problems, Large Data Sets and BIRCHWe adopt the following essential criteria for an algorithm to be successful when applied to verylarge data sets:? it should only require one read through the data set, after which the data set can be discarded;? it should only require one line of data to be held in memory at any point in time; and? it should reduce the effective size of the solution space for combinatorial problems.We will show that the BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)algorithm can be adapted appropriately to find approximate solutions to optimization problemssuch as those mentioned above. These methods satisfy the conditions listed above. In this paperwe illustrate this approach by applying BIRCH to find approximate solutions to the optimizationproblems that define the MCD and the LTS (Least Trimmed of Squares) (Rousseeuw and Leroy,1987). Finally, we also adapt BIRCH to the linear grouping algorithm (LGA) (Van Aelst et al.,2006) and its robust version (RLGA) (Garc?a-Escudero et al., 2007).Our presentation will be organized as follows. Section 3.2 briefly describes the BIRCH algorithm.In Section 3.3 our approach is applied to compute approximations to the MCD and LTS estimators,and to run the RLGA algorithm, with very large data sets. We show that the algorithm we proposeworks very well (compared with the state-of-the-art alternatives) when the data set fits in memory.Moreover, our numerical experiments show that, in this case, our algorithm can be used to veryquickly compute a good starting point that when iterated using the whole data set converges tothe same solution found by the best algorithms available. Section 3.4 reports the results of twosimulation studies that indicate that this new method also works very well in large data sets.Finally, some concluding remarks are included in Section 3.5.3.2 Data Pre-processingThe data pre-processing algorithm BIRCH was first introduced in Zhang et al. (1996). It groupsthe data set into compact subclusters that have summary statistics (called Clustering Features(CF)) associated to each of them. These CFs are computed and updated as the subclusters arebeing constructed. The end result is an ?in-memory? summary of the data, where ?local? compactsubclusters are represented by appropriate summary statistics.This approach lends itself to robust statistics and clustering, in that both methodologies are con-cerned with finding areas of homogeneous data that are locally similar and dense. As such, itis a simple extension to consider using compact subclusters instead of individual observations toobtain a fast and good approximation to the solution of the corresponding optimization problems.It is important to note that by choosing the summary statistics appropriately we can evaluatethe objective function exactly. In other words, the approximated nature of this approach is onlydue to the fact that both the random starts and the subsequent iterative refining steps use sets ofpoints (subclusters) instead of individual observations. Further discussion of the influence of thisapproximation is given in Section 3.2.2.The advantage of pre-processing the data is that, once completed, the data set can be discarded,and the number of subclusters is much reduced relative to the number of observations, therebymaking the combinatorial problem smaller.38Chapter 3. Combinatorial Problems, Large Data Sets and BIRCHIt will be shown in Section 3.2.1 that at no point in time is more than one observation required tobe held in memory, and combined with the previous points, this means we have met our essentialcriteria mentioned in Section 3.1.In the original BIRCH article, the method proposed adapts a global clustering algorithm e.g.CLARANS (Ng and Han, 2002), and selects the clustering for the desired number of groups. Theseare then used as seeds for an optional cluster refinement, though it is unclear if this is based on thecomplete data set, or performs further steps using just the subclusters. An extension to k-meanson mixed type attributes is given in Chiu et al. (2001). Neither of these articles explicitly considersthe application of BIRCH from the perspective of a combinatorial problem and the reduction ofthe solution space.Since the introduction of BIRCH, other methods for the formation of tree-like structures andtheir application have been explored. In Breunig et al. (2001), BIRCH is used as an initial stepto ?compress? the data set into representative objects. However, in the context of hierarchicalclustering, the concern is that the size of each subcluster (as given by its number of observations)is not considered. As such, a post-processing step is proposed, involving going back to the originaldata set and sampling a number of objects, with the number proportion to the size of the subcluster.These points are subsequently used for clustering.Other improvements are offered in Nassar et al. (2004) and Tung et al. (2005) regarding how the treeis created. In the first article, the BIRCH algorithm is optimized for data sets with?real-time? datastreams by adding an attribute to subclusters that have been amended since the first construction.This allows for incremental updating of only those subclusters that have recently admitted datawithout needing to rebuild an entire tree. Finally, in the latter article, a dual criteria of spatialproximity as well as orientation is considered in the merging of subclusters; the combination ofthese attributes would make for a more compact merged subcluster.The idea of using summary statistics of subsets of the data is also explored in other ways; forexample in spatial statistics via grid based methods (e.g. Wang et al., 1998). It was also noted inthis article that care must be taken when using BIRCH that the columns are of similar scale whenthe subclusters are being formed, else the interpretation of Euclidean and Mahalanobis distancebecomes difficult, a fact that is implemented in the following algorithms. In the case where thedata set does not fit in memory, an approach for scaling is proposed in the next section.3.2.1 Forming the SubclustersA simplified algorithm for forming the subclusters is given next. Further details of the completealgorithm, including refinements such as self-sizing, subcluster splits and CF-Trees, can be foundin Zhang et al. (1996).Let X1,... ,XN denote the sample to be processed, where Xj element Rp.Pre-processing:1. The first observation X1 forms its own subcluster.2. Let T be the set of current subclusters, and let ? t, and CFt, t element T be the associated subclustercentres and clustering features (currently left unspecified).3. For each of the remaining observations X2,...,XN:39Chapter 3. Combinatorial Problems, Large Data Sets and BIRCH(a) Calculate the L2-norm distance from Xj to each of the centres ? t of the existing sub-clusters. Select the closest centre ? t0.(b) If the observation meets the?closeness?and?compactness?criteria described below, thenXj joins the t0 subcluster, and both ? t0 and CFt0 are updated.(c) If not, a new subcluster t + 1 containing Xj is formed, with ? t+1 = Xj and CFt+1 thecorresponding clustering features, and the list of subclusters is updated: T = T union {Xj}The clustering features we consider are functions of the subcluster members that are easily updatedwhen a new observation joins (e.g. the sum of the elements of the subcluster). It is clear from theabove that only a single observation is held in memory at any point in time.Note that this algorithm requires the calculation of the distances between Xj and every subclustercentre. This can be avoided by using a more efficient structure, such as a CF-tree which is similarto a ?B+?-tree (Wedekind, 1974). In this approach, each observation is passed to a root node,with a number of non-leaf nodes as children. Each non-leaf node contains the sample mean of allobservations beneath it. The observation is then simply passed down this tree to the closest non-leafnode based on L2-norm distance to the centres, and each centre is updated as the observation isgiven to it. Finally, when it reaches the appropriate subcluster, it is either absorbed, or else formsits own subcluster. Once a non-leaf node exceeds a (user-specified) number of subclusters, it splitswith the subclusters allocated to the split non-leaf nodes based on a proximity measure. A morecomplete description of this structure is given in Appendix B, which is also available as Harringtonand Salibian-Barrera (2008).The number of subclusters that are formed at the end of this algorithm depends on the criterionused to decide whether a new observation being processed can join an existing subcluster or needsto form a new cluster. Consider a point Xj currently being processed, and let ? t0 be the closestsubcluster centre. In order for Xj to join the t0 cluster, Zhang et al. (1996) required either of thefollowing criteria:Parameters:(i) closeness ? bardblXj - ? t0bardbl <=< a for some pre-defined a > 0.(ii) compactness ? The trace of the sample covariance matrix of the union of observations in thet0 cluster and Xj cannot exceed a pre-defined constant b > 0.Although other measures of compactness, e.g. the generalized variance, can be used in (ii), wefollowed the original BIRCH specification by using the trace. In our algorithm we use both jointly,as we found that any one of these criteria alone was not enough to form subclusters that summarizethe data well. The disadvantage of using only the?compactness?criterion is that, when the numberof observations within a subcluster is large, a point could be far from the centre of the subcluster,and yet adding it to the subcluster would not noticeably affect the sample covariance matrix ofthe members of the subcluster. In other words, it may allow for ?scatter? outside the otherwisecompact body of points within the subcluster.The disadvantage of only using the ?closeness? criterion is apparent when there are few points in asubcluster. Consider an extreme example of three observations on a line. The first two observationshave the sample mean exactly bisecting them. When the third point is added, the sample mean ofthe new subcluster would shift towards the new point, increasing the distance of the first observationto the new centre. As more points are added on this line, this shifted centre would admit more40Chapter 3. Combinatorial Problems, Large Data Sets and BIRCHpoints as they meet the closeness criteria, whilst moving the initial points further away. We refer tothis effect as ?travelling?. The subcluster centre starts shifting and the subcluster becomes unduly?long?. In other words, the compactness of the subcluster is reduced. The combination of thesecriteria, therefore, ensures that the resulting subclusters are compact.As mentioned in the proceeding section, the data set should be scaled by dividing each columnby its standard deviation prior to pre-processing. This necessitates the calculation of the standarddeviations, which can be done either directly if the data set is sufficiently small, or else by using avery large value for the closeness and compactness criteria, yielding a very small tree from whichthe calculation of these statistics is trivial. The algorithm is then re-run on the same data set witheach observation scaled as it is processed. It should be noted that a robust scale estimate wouldbe preferable for scaling the data set; however there is no ?updating formula? for such estimatorswith high breakdown point, and therefore cannot be applied within the BIRCH framework.3.2.2 Selection of ParametersThe selection of the closeness and compactness parameters is a trade-off between (a) granularity ofthe data sets and (b) reduction of the effective solution space. Clearly as these parameters increase,the number of subclusters decreases and the ?size? of the subclusters increases. The impact on thesolution to combinatorial problems is that the solution is ?coarser? in that the subclusters reflectless the subtleties of the underlying data set. Therefore, increasing these parameters would yieldsolutions further away from that found using the original data set, and vice versa decreasing theparameters would give a solution more closely approximating the solution based on the originaldata set.However, by decreasing the parameters, the solution space is enlarged due to the increased numberof subclusters, which makes for a more difficult combinatorial problem, to the point where anypotential gains in optimality of the solution might be lost. Also, an increased number of subclustersrequires more memory, which may be a limiting factor.Clearly, if a refinement step following this algorithm (as suggested in Zhang et al. (1996)) were totake place, then a ?coarse? solution would be sufficient. If, however, this were not to be practicalbecause the size of the data set excludes this option, then the selection of the parameters is of moresignificance. The question of setting of parameters is explored further in the following sections,where the results of our experiments suggest that the optimality of the solution is only slightlyaffected by the selection of the parameters, and it is irrelevant if refinement takes place.3.3 ApplicationsIn this section we show how to adapt this approach to compute two robust estimators (the MCD formultivariate location and scatter, the LTS for linear regression) and a robust clustering algorithm:the Robust Linear Grouping Analysis. We will show that the approximate solutions obtained byour approach compare very well with those given by current algorithms.41Chapter 3. Combinatorial Problems, Large Data Sets and BIRCHConsider the CF given by the following tuple:CFJ =parenlefttpparenleftbtsummationdisplayielementLJ1,summationdisplayielementLJXi,summationdisplayielementLJXiXlatticetopiparenrighttpparenrightbt = (nJ,LSJ,SSJ);i.e. the number of observations, the sum of those observations, and the sum of squares of thoseobservations. The set LJ contains the members of subcluster J. Let N be the number of subclusters,and for all i negationslash= j element {1,... ,N}, Li intersection Lj = emptyset by construction. We can define the addition operatorin a natural way:CFI + CFJ def (nI + nJ,LSI + LSJ,SSI + SSJ) (3.1)equivalenceparenlefttpparenleftbt summationdisplayielement{LIunionLJ}1,summationdisplayielement{LI unionLJ}Xi,summationdisplayielement{LI unionLJ}XiXlatticetopiparenrighttpparenrightbt ,which equates exactly to the features associated with the union of the member vectors LI unionLJ. Wecall these statistics ?sufficient?, insofar as the evaluation of the objective functions we are interestedin on the observations of a subcluster (or a union of subclusters) can be done exactly using onlythe CF?s. Thus, as mentioned before, we do not need to keep the data set in memory.3.3.1 Minimum Covariance DeterminantThe Minimum Covariance Determinant (MCD) (Rousseeuw, 1984, p 877) is a robust measure oflocation and dispersion. It is calculated by finding the subset of data of size h = alphan, alpha element (12,1]whose sample covariance matrix has smallest determinant i.e. if His the solution space of all possiblesubsets of size h, then the MCD isarg minJelementH|SJ|whereSJ = 1h -1summationdisplayielementJ(Xi - ? J)(Xi - ? J)latticetop and ? J = 1hsummationdisplayielementJXiThe location and dispersion estimate is then given by the sample mean and sample covariance ofthis subset.We compared the performance of our algorithm with the fast-MCD (Rousseeuw and van Driessen,1999), which randomly selects (p + 1) sized subsets of data, calculates its sample mean (?) andcovariance matrix (?), and then updates them using the following ?concentration? steps.Concentration Step:? Let H arrowleft{ the h points with smallest Mahalanobis distances: (X - ?)latticetopS-1(X - ?)}? Calculate the new ?, S based on the observations in H.It can be shown that these concentration steps always decrease the objective function above, andare iterated until convergence or until some upper bound on iterations has occurred. The outputsfrom this algorithm are estimates for ?, S and a vector of the points that make up the ?best?h-subset.42Chapter 3. Combinatorial Problems, Large Data Sets and BIRCHAs this is a local search algorithm, a large number of random starting configurations is requiredin order to adequately cover the solution space. It is easy to show that this number grows veryrapidly as a function of the dimension p and number of observations of the data.Our new algorithm, MCD-BIRCH, uses the subclusters derived in Section 3.2, and then attemptsto find the optimal size-h set H using the following algorithm.MCD-BIRCH:1. Run the pre-processing algorithm.2. For each subcluster, calculate its covariance matrix, and the corresponding rank. Retain anindex of those subclusters whose covariance are of full rank.3. Randomly select a sufficient number of subclusters of full rank to adequately cover the solutionspace. For each of these subclusters (denoted by H):(a) Calculate the Mahalanobis distance of all remaining subcluster centres to the centre ofH, using its centre (?H) and covariance matrix (?H) (which are computed using theCF?s of H).(b) Order the distances, and select the smallest number of subclusters with closest centressuch that the sum of the number of observations in the union of these subclusters exceedsh = alphan.(c) Calculate ?H and ?H using the combined CF of this union of subclusters (from equation(3.1)) to form a new seed.(d) Let H be the union of these subclusters, and update its CF.(e) Iterate from (a) until convergence or until a fixed number of iterations is exceeded.4. Select the configuration with smallest determinant of the combined sample covariance matrix.Remark. Note that in Step 2 we only consider ?full rank clusters? because we need to compute theMahalanobis distances (which requires the inverse of the covariance matrix). This restriction is notof much concern because clusters that are not full rank may not be representative of the shape asthey usually contain very few observations.Clearly, the merged CF can be considered sufficient statistics for the calculation of the MCDobjective function. Consider a set of CFs A with complete membership vector L = unionielementALi. ThenSA = 1nA -1SSA - 1n2ALSALSlatticetopAequivalence 1|L| -1summationdisplayielementL(Xi - ?L)(Xi - ?L)latticetopThis is also a local search algorithm, but rather than searching over a solution space of dimensionparenleftbignhparenrightbig, the search is over a much smaller (but coarser) solution space. However, despite this coarseness,the following examples will show that the approximation appears to be very good. Furthermore, ifthe data set can fit in memory, concentration steps using the individual observations can be appliedto the resulting ? and S estimates until convergence, as described for fast-MCD.Example. The Digitized Palomar Sky Survey (DPOSS) data has been used in articles related torobustness (e.g. Rousseeuw and van Driessen (1999), Rousseeuw and van Driessen (2006), etc) as43Chapter 3. Combinatorial Problems, Large Data Sets and BIRCHwell as clustering (e.g. Rocke and Dai (2003)). It contains characteristics from celestial objects,and has 132,402 observations in 33 dimensions. For further information see Odewahn et al. (1998).In this example, we run the fast-MCD algorithm and the MCD-BIRCH algorithm on the?Aperture?and ?Combined stellar function? columns for each of the F, J and N bands, yielding p = 6. Thefast-MCD algorithm is implemented in R as covMcd from the library robustbase1, and the MCD-BIRCH algorithm as covMcd.birch in the library birch2.Using fast-MCD the log of the determinant of the resulting covariance matrix was -30.71 with thenumber of random starting subsamples set at 500. This compares favourably with MCD-BIRCHwith a log-determinant of -30.69 using just 100 random starting subclusters. Furthermore, thetwo best h-subsets (with alpha = 0.51) have 99.7% of members in common. Finally, performinga refinement by using the solution from MCD-BIRCH as the starting seed for one sequence ofconcentration steps (to convergence) on the full data set yields a log(MCD) of -30.71, and the besth-subsets have 99.9% of observations in common.Example. This data was taken during a Nuclear Physics experiment using an accelerated ion beamincident on a hydrogen gas target at the DRAGON facility at TRIUMF, Canada?s National Nuclearand Particle Physics Laboratory (Engela et al. (2005) and Hutcheon et al. (2003)).3 Some of theions combine with the hydrogen atoms in a nuclear fusion reaction, resulting in an excited productnucleus being formed which ?decays? within a billionth of a second to a stable energy configuration.This decay process involves the emission of one or more high energy photons, gamma rays, whichare subsequently detected using a set of sensitive instruments surrounding the gas target. Thesereactions are measured for the purpose of understanding their role in stars in the field of NuclearAstrophysics.Because of the discrete quantum nature of energy levels in an excited nucleus, the gamma raysin question have discrete energies and can occur in various combinations of one another (but alladding up to the same total energy). When more than one gamma ray is detected the values of thetwo highest energies are retained. Plots of the data set are given in Figures 3.1 & 3.2, and thereare 209,037 observations and two dimensions.4In this case, the fast-MCD algorithm achieved a log(MCD) of -7.531 using 500 starting subsamples,and the log(MCD) using MCD-BIRCH was -7.524 using 100 starting subclusters. There was a98.6% intersection rate between the two best h-subsets. Performing an additional refinement onthe full data set using the solution from MCD-BIRCH yielded a log(MCD) of -7.532, and has 99.9%of observations in common with fast-MCD. Contours of 95% confidence ellipsoids derived from theasymptotic chi2 approximation using the classical, fast-MCD and MCD-BIRCH estimates of locationand dispersion are given in Figure 3.1, and based on this plot it is clear that the fast-MCD andMCD-BIRCH algorithms are producing near-identical results.1Original code by Rousseeuw and Van Driessen, adapted by Valentin Todorov. R package version 0.2-8.2The package birch is available on CRAN, and also contains the algorithms for LTS-BIRCH, RLGA-BIRCH andothers. The pre-processing code is written in C++, and the remaining code written in native R. Version 1.1-2 ofbirch was used for this article, and is discussed in detail in Harrington and Salibian-Barrera (2008) (a copy of thisarticle is given in the appendices)3Many thanks to Dr C. Ruiz from TRIUMF for providing this data.4It should be noted that with this example, as well as the ones that follow, the estimates calculated are notnecessarily those that would be chosen if ?real? analysis (in the statistical sense) were to be done. Rather, the datasets are used for illustration purposes only.44Chapter 3. Combinatorial Problems, Large Data Sets and BIRCH0 1 2 3 4 5012345Highest gamma-ray energy2nd highest gamma-ray energyFigure 3.1: TRIUMF Data Set: In this plot is the data set with the contours based on 95%confidence ellipsoids derived from the asymptotic chi2 approximation using the fast-MCD, MCD-BIRCH and classical estimates. The classical confidence contour is the red line. The x-axis hasbeen trimmed for clarity.0 2 4 6 8 10 12012345Highest gamma-ray energy2nd highest gamma-ray energyFigure 3.2: TRIUMF Data Set: a smoothed color density representation of the scatter plot usingsmoothScatter from the Bioconductor package geneplotter. The contours from Figure 3.1 arealso provided. The axes have not been trimmed.45Chapter 3. Combinatorial Problems, Large Data Sets and BIRCHCompactness2-2 2-3 2-4 2-5 2-6 2-7 2-8Closeness2-0 log(MCD) -29.75 -29.35 -29.87 -30.56 -30.62 -30.67 -30.69N 7 10 37 160 569 1902 6570h 67858 68157 67628 66304 66262 66204 66205log(MCD)-R -30.71 -30.71 -30.71 -30.71 -30.71 -30.71 -30.712-1 log(MCD) -29.86 -30.07 -29.87 -30.56 -30.62 -30.67 -30.69N 13 16 37 160 569 1902 6570h 67654 67173 67628 66304 66262 66204 66202log(MCD)-R -30.71 -30.71 -30.71 -30.71 -30.71 -30.71 -30.712-2 log(MCD) -30.08 -30.08 -30.48 -30.61 -30.62 -30.67 -30.69N 49 49 51 154 569 1902 6570h 66421 66421 66374 66220 66262 66205 66202log(MCD)-R -30.71 -30.71 -30.71 -30.71 -30.71 -30.71 -30.712-3 log(MCD) -30.52 -30.52 -30.52 -30.47 -30.62 -30.67 -30.69N 137 137 137 163 576 1902 6570h 66272 66272 66272 66437 66260 66205 66202log(MCD)-R -30.71 -30.71 -30.71 -30.71 -30.71 -30.71 -30.712-4 log(MCD) -30.61 -30.61 -30.61 -30.64 -30.57 -30.67 -30.69N 409 409 409 417 625 1905 6570h 66243 66243 66243 66203 66347 66205 66205log(MCD)-R -30.71 -30.71 -30.71 -30.71 -30.71 -30.71 -30.712-5 log(MCD) -30.64 -30.64 -30.64 -30.64 -30.66 -30.68 -30.69N 1012 1012 1012 1012 1012 1937 6574h 66229 66229 66229 66229 66203 66201 66201log(MCD)-R -30.71 -30.71 -30.71 -30.71 -30.71 -30.71 -30.712-6 log(MCD) -30.68 -30.68 -30.68 -30.68 -30.68 -30.68 -30.69N 2943 2943 2943 2943 2943 2924 6589h 66201 66201 66201 66201 66201 66208 66201log(MCD)-R -30.71 -30.71 -30.71 -30.71 -30.71 -30.71 -30.71Table 3.1: The effect of user selected parameters on number of subclusters and performance. Nis the number of subclusters resulting from the selection of parameters, h is the number of actualobservations in the h-subset, and performance is measured by the log of the MCD. The log of theMCD after refinement on the best solution from MCD-BIRCH is given by log(MCD)-R.Selection of Parameters for MCD-BIRCHAs mentioned in Section 3.2.2, the selection of the user-parameters (compactness and closeness) isleft to the user without much in the way of guidance other than balancing the size of the combi-natorial problem with coarseness of the solution. In all applications in this article, the selection ofthese parameters was based entirely on achieving a certain number of subclusters. In most casesapproximately 5000 subclusters was the level chosen, based on the proximity of the BIRCH solu-tion to the algorithm it was benchmarking. In order to quantify the effect of this selection, weexamine here the effect on the log-MCD for the DPOSS data set in Table 3.1 for different valuesof compactness and closeness.Based on this analysis it would appear that the selection of parameters within certain ?reasonable?bounds has little effect on the eventual efficacy of the algorithm. In particular, if further refinementon the solution is taken using concentration steps on the complete data, then the accuracy in theselection of the parameters is not crucial.The philosophy of this article has been to treat the setting of these parameters much in the sameway one would treat the setting of a bandwidth parameter in a smoothing operation. That is, thereis no necessarily right or wrong answer, but rather something that suits the needs of the problemat hand.46Chapter 3. Combinatorial Problems, Large Data Sets and BIRCH3.3.2 Least Trimmed SquaresConsider a linear model of the typeY = Xbeta + epsilon1with squared residuals given by r2i . Let r2(i) be the ordered squared residuals. The Least TrimmedSquares (LTS) (Rousseeuw and Leroy, 1987) estimator is a robust regression estimator that canbe tuned to have high-breakdown point, independently of the dimension p of the covariates X. Asthe name suggests, the LTS estimator minimizes the trimmed sum of the squared residuals. For agiven choice of n/2 <=< h <=< n, the LTS estimator minimizeshsummationdisplayi=1r2(i) ,where r2(i) denotes the i-th order statistic of the squared residuals.The so called ?fast-LTS?algorithm in Rousseeuw and van Driessen (2006) uses a similar strategy asthe fast-MCD algorithm described above. In this case the algorithm uses the initial random sub-sample to compute the ordinary least squares (OLS) regression estimator ?. This initial estimatoris then improved using the following ?concentration? steps.Concentration Step:? Let H arrowleft{ the h smallest squared residuals, r2 = (Y -X?)latticetop(Y -X?)}? Calculate the OLS estimator ? based on the observations in H.These steps, called concentration steps, are iterated until convergence or until a predeterminedmaximum number of iterations is exceeded. The output from this algorithm is the estimate for ?and a vector of the points that make up the best h-subset.The BIRCH equivalent to LTS (LTS-BIRCH) uses the pre-processing from Section 3.2. However,this time the augmented matrix Z = (X|Y) is preprocessed and that the elements of the corre-sponding CF?s can be partitioned as follows:LSJ =parenleftbiggsummationtextielementLJ XisummationtextielementLJ YiparenrightbiggSSJ =parenleftbiggsummationtextielementLJ XiXlatticetopi summationtextielementLJ XiYisummationtextielementLJ XiYisummationtextielementLJ Y2iparenrightbiggNote that LSJ element R(p+1) and SSJ element R(p+1)?(p+1). It can then be shown by elementary matrixalgebra that ? and summationtextielementL r2i (the within-cluster sum of squared residuals) can be calculated basedentirely on the CFs, and therefore they are sufficient statistics for the purposes of calculating theLTS objective function for the observations in the J subcluster.The algorithm is as follows.LTS-BIRCH:1. Run the pre-processing algorithm.2. For each subcluster, calculate the rank of the sum-of-squares matrix. Retain an index of thosesubclusters whose sum-of-squares are full rank.3. Randomly select a sufficient number of subclusters of full rank to adequately cover the solutionspace. For each of these clusters (denoted by H):47Chapter 3. Combinatorial Problems, Large Data Sets and BIRCH(a) DPOSSfast-LTS ?Model:MAperF = 1.728 -0.957 ?csfFlog(RMS)= -7.698LTS-BIRCH ?Model:MAperF = 1.712 -0.938 ?csfFlog(RMS) = -7.696log(RMS) with refinement = -7.698(b) TRIUMFfast-LTS ?Model:Y = 0.595XRMS = 0.03921LTS-BIRCH ?Model:Y = 0.597XRMS = 0.03926RMS with refinement = 0.03921Table 3.2: The fitted regression lines for each approach on each data set, and the Mean SquaredResiduals (with logarithm applied for DPOSS). The refinement step applies one concentration stepon the whole data set using the solution from LTS-BIRCH.(a) Calculate the OLS estimator ? based on H.(b) Using ?, calculate the sum of squared residuals (SSR) for all other subclusters.(c) Order these SSR, and select the smallest number of subclusters with smallest SSR?s suchthat the sum of the number of observations in the union of these subclusters exceedsh = alphan.(d) Let H be the union of these subclusters, and update its CF.(e) Iterate until convergence or until a fixed number of iterations is exceeded.4. Select the configuration with smallest sum of squared residuals.Example. Revisiting the DPOSS data, the LTS estimate is calculated using the ltsReg commandfrom the package robustbase, and then using the equivalent BIRCH approach available in thebirch package. Following the same approach as in Rousseeuw and van Driessen (1999), the dataset consists of just those observations classed as stars, and the relationship between the attributes?csfF? and ?MAperF?. The plot of the data, with fitted lines, is given in Figure 3.3, and the fittedmodels in Table 3.2(a). The fast-MCD algorithm used 5000 random starting subsamples, and theLTS-BIRCH algorithm 100 random starting subclusters. For a comparison of processing times foreach algorithm, see Section 3.4.1.Example. Returning to the TRIUMF Data Set, we perform LTS without an intercept term. A plotof the fits is given in Figure 3.4, with results in Table 3.2(b). The two sets have 97.7% of points incommon. Note that the fast-LTS and LTS-BIRCH lines are sufficiently similar that they obscureeach other.48Chapter 3. Combinatorial Problems, Large Data Sets and BIRCH0.5 1.0 1.5 2.00.70.80.91.01.11.21.3csfFMAperFLSLTS-BIRCHFigure 3.3: DPOSS Data Set. The lines are the resulting regression lines for fast-LTS, LTS-BIRCHand classical LS. The red data points are for points in both LTS h-subsets, and the darker pointsare for those that are only in one of the sets.3.3.3 Robust Linear Grouping AnalysisLinear Grouping Analysis (LGA) (Van Aelst et al., 2006) is a clustering algorithm for data sets inwhich the interest lies in detecting linear structures, such as hyperplanes. The main advantage ofthis algorithm over others with similar goals (e.g. Gawrysiak et al., 2001) is that it does not requirean explanatory/response relationship between the variables (it is not model based), but rather findshyperplanes using orthogonal regression.Assuming that we are searching for k hyperplanes, and given a membership vector Ci for the i-thcluster, the optimization problem associated with LGA isminC1,...,Ckksummationdisplayi=1summationdisplayjelementCid(xj,Hi)2 , (3.2)with d(xj,Hi) the orthogonal distance of xj to hyperplane Hi. The minimization is taken over allpossible disjoint partitions of {1,... ,n} into k sets C1, ..., Ck. The proposed algorithm to find asolution to (3.2) starts from an initial configuration (obtained drawing random subsamples from thedata) and then forming clusters by assigning points to their closest hyperplane (using orthogonaldistances). The current configuration is then updated using the new clusters, and this procedureiterated until convergence. After repeating this for many starting configuration, the one with thesmallest final residual orthogonal sum of squares is selected as the approximated solution to (3.2).49Chapter 3. Combinatorial Problems, Large Data Sets and BIRCH0 1 2 3 40.00.51.01.52.02.53.0Highest gamma-ray energy2nd highest gamma-ray energyFigure 3.4: TRIUMF Data Set. The lines are the resulting regression lines for fast-LTS,LTS-BIRCH and OLS (dashed). The red points are those in both LTS h-subsets, and the darkerpoints are those that are only in one of the sets. The x-axis has been trimmed for clarity.Recently, a robust version of this algorithm has been proposed in Garc?a-Escudero et al. (2007).For a given integer n/2 <=< h <=< n (that determines the robustness properties of the solution), weneed to solveminHelementHminCH1 ,...,CHkksummationdisplayi=1summationdisplayjelementCHid(xj,Hi)2 ,where H is the set of all subsets of {1,... ,n} of size h, and, as before, CH1 , ..., CHk form a disjointpartition of size k of H, for each H element H. In terms of finding an approximate solution to thisproblem, the proposed algorithm is very similar to that of LGA.? Let H arrowleft { the h smallest orthogonal distances d(xi), i = 1,... ,n}, where d(xi) denotes theorthogonal distance of xi to the closest hyperplane.? Update the hyperplanes using only the observations in H.These steps are iterated until convergence or a pre-determined upper bound on the number ofiterations is attained.Our new algorithm, RLGA-BIRCH, instead uses the subclusters derived in Section 3.2, and is givenby:1. For each subcluster, calculate its covariance matrix, and the corresponding rank. Retain anindex of those subclusters whose covariance matrix are of full rank.50Chapter 3. Combinatorial Problems, Large Data Sets and BIRCH0 5 10 15 200123456MD (classical)MD (robust)Figure 3.5: DPOSS Data Set: The red and green points represent the output where both RLGA andRLGA-BIRCH agree. The darker points are those where the two algorithms are not in agreement.2. Randomly select a subset of k subclusters of full rank. Use each for a seed to:(a) Calculate the hyperplanes of each seed, using the covariance estimate derived from theseed?s CF.(b) Calculate the average residual orthogonal sum-of-squares for each subcluster, to everyhyperplane. Retain just the distance to the closest hyperplane.(c) Order the distances, and select the smallest subset of the closest subclusters such thatthe sum of the number of observations within these subclusters just exceeds h.(d) For each hyperplane, generate a fit based on all subclusters that belong to it and are inthe ?best? subset. Use these as the seeds for the next iteration.(e) Iterate until convergence or until a predetermined maximum number of iterations isexceeded.3. Select the configuration with smallest residual orthogonal sum-of-squares.Example. DPOSS Data Set. We construct an outlier map (Rousseeuw and Van Zomeren, 1990) byplotting the Mahalanobis distances using standard estimates against Mahalanobis distances usingrobust distances, given in Figure 3.5. We then apply RLGA and RLGA-BIRCH to this data set asa means of robustly identifying the two groups.The residual orthogonal sum-of-squares using the original algorithm is 870.0 versus the BIRCHapproach with 873.0. The two best h-subsets have 98% of observations in common. Finally, one51Chapter 3. Combinatorial Problems, Large Data Sets and BIRCHrefinement step performed using the BIRCH solution to initialize the algorithm, gives a best h-subset with residual orthogonal sum-of-squares of 870.0, and 99.8% of observations in commonwith the original algorithm?s solution.3.4 Simulation Studies3.4.1 Simulation Study with LTSThe example in Section 3.3.2, suggests that LTS-BIRCH compares very well with the best algo-rithms available to compute LTS when the data fit in memory. In this section we report the resultsof a simulation study that confirms this observation. We also provide a simple example whereLTS-BIRCH does significantly better than the fast-LTS using much fewer starting candidates. Thereason for this lies in the much smaller (though coarser) solution space belonging to LTS-BIRCH,and is easier to explore thoroughly via a random search algorithm.Let the level of contamination be denoted c%. Each data set has [n(1 - c)] observations gen-erated from the null model Y = Xbeta0 + epsilon1, X similar Np(0,I), epsilon1 similar N(0,1) with beta0 = (1,... ,1)latticetop.We added [nc] of contamination following a linear model with beta = (-1,1)latticetop when p = 2, andbeta = (5.5,-23.5,1,... ,1)latticetop when p = 20, and covariates Xc similar Np(?,?) with ? = (5,0,... ,0) and? a diagonal matrix of 1.1.In order to make a fair comparison, we introduce two additional methods referred to as LTS-BIRCH-R and fast-LTS-BIRCH. LTS-BIRCH-R takes the best solution from LTS-BIRCH, and uses it asa seed for up to twenty concentration steps, or less if convergence of the ?best? subset is achieved,using the whole data set. This allows for a more realistic comparison in objective function withfast-LTS.It is difficult to make meaningful comparisons in the processing times due to differences in languageused for coding each algorithm. The fast-LTS algorithm is written in FORTRAN and somewhatmature. The tree building part of LTS-BIRCH is written in C++, which was chosen for ease indevelopment rather than speed, while the LTS part of the algorithm is coded in R. Finally, theconcentration steps are performed in R.In order to get a rough upper bound on the timing of the R part of these algorithms, we propose thefollowing: build the CF-Tree in C++, and then perform LTS using fast-LTS on just the centres ofthe subclusters. We call this method fast-LTS-BIRCH. While the results with respect to objectivefunction could be very different to that of LTS-BIRCH, the processing time of fast-LTS-BIRCHwill approximate the time of LTS-BIRCH were it to be coded in C/FORTRAN, as (a) the numberof starting samples and steps to convergence are approximately the same; and (b) is an upperbound since fast-LTS performs many more concentration steps on different starting candidatesthan LTS-BIRCH-R.The metrics used to measure the effectiveness of each algorithm are:(a) the percentage difference in the ?best h mean squared residual? compared to the residualsobtained with beta0, i.e.?i - ?i,0?i,052Chapter 3. Combinatorial Problems, Large Data Sets and BIRCHn = 30,000 n = 100,000 n = 250,000 n = 1,000,000p = 2, c = 0% 0.05 0.1 0.1 0.15p = 2, c = 15% 0.1 0.1 0.1 0.3p = 2, c = 35% 0.1 0.1 0.1 0.3p = 20, c = 35% 22 25 27 30Table 3.3: The parameters used for creating the CF-tree using BIRCH for the LTS simulationstudy. The parameters are closeness and compactness, and in each simulation the parameters areequal.where ?i is the average of the h-subset of squared-residuals from the given model (fast-LTS,MCD-BIRCH, etc.), and ?i,0 is the average of the h smallest squared ?residuals? Yi -beta0Xi.(b) the percentage of ?outliers? (contaminated observations) that are included in the best h-subset.In each case, we generated 100 samples, and the outputs from fast-LTS, LTS-BIRCH and LTS-BIRCH-R retained,5 for the following combinations of sample size n, number of covariates p, andproportion of contamination c:? p = 2, 20? n = 30,000, 100,000, 250,000, 1,000,000? c = 0%,15%, 35%.The parameters used are given in Table 3.3, and were set in order to achieve approximately 3,000-4,000 subclusters. The algorithm fast-LTS used 500 subsamples for starting seeds, and LTS-BIRCHand LTS-BIRCH-R used 100. The level alpha was set at 0.5. The processing time was measured on aIntel Xeon CPU 3.00GHz linux box.Box plots of the first metric are given in Figures 3.6 and bar plots of the second metric in Figure 3.7.The processing times are given in Table 3.4.Firstly, with the parameters p = 2, c = 0% and p = 2, c = 15% (Figures 3.6(a), (b)), it is clearthat all methods produce residuals very close to that of the null model, and the performance barplot (omitted) reinforces this, with both methods having less than 5% contaminated observationsin the best h-subset for all simulations. However, for p = 2,c = 35% it becomes apparent thatLTS-BIRCH is beginning to outperform fast-LTS in both residuals (Figure 3.6(c)) as well as percentcontamination (Figure 3.7(e)). Finally, for p = 20, c = 35% LTS-BIRCH clearly is still finding thecorrect solution, whereas fast-LTS is not (Figures 3.7(d) & 3.7(f)).In terms of processing time, it is not surprising that in all cases fast-LTS was faster than LTS-BIRCHand LTS-BIRCH-R, but as identified previously, it is not reasonable to make direct comparisonsdue to differences in computing language. Instead, a more reasonable comparison in processingtime can be gained with fast-LTS and fast-LTS-BIRCH, which are clearly much closer. Given thatthere is a lot of scope for making this algorithm more efficient, primarily through choice of the5For fast-LTS-BIRCH, just the processing time is kept.53Chapter 3. Combinatorial Problems, Large Data Sets and BIRCH(1) (2) (3) (1) (2) (3) (1) (2) (3) (1) (2) (3)051015Avg(r-r0)r0?100n = 30,000 n = 100,000 n = 250,000 n = 1,000,000(a) p = 2, c = 0%(1) (2) (3) (1) (2) (3) (1) (2) (3) (1) (2) (3)051015Avg(r-r0)r0?100n = 30,000 n = 100,000 n = 250,000 n = 1,000,000(b) p = 2, c = 15%(1) (2) (3) (1) (2) (3) (1) (2) (3) (1) (2) (3)051015Avg(r-r0)r0?100n = 30,000 n = 100,000 n = 250,000 n = 1,000,000(c) p = 2, c = 35%Figure 3.6: Box plots of the mean difference between the?true?LTS residual and the one calculatedusing different methods. The methods are: (1) fast-LTS; (2) LTS-BIRCH; (3) LTS-BIRCH-R. Eachdata set (given by n) is simulated 100 times.54Chapter 3. Combinatorial Problems, Large Data Sets and BIRCH(1) (2) (3) (1) (2) (3) (1) (2) (3) (1) (2) (3)050100150200250Avg(r-r0)r0?100n = 30,000 n = 100,000 n = 250,000 n = 1,000,000(d) p = 20, c = 35%Figure 3.6 continued: Box plots of the mean difference between the ?true? LTS residual and theone calculated using different methods. The methods are: (1) fast-LTS; (2) LTS-BIRCH; (3) LTS-BIRCH-R. Each data set (given by n) is simulated 100 times. Note that the y-axis in this case hasa different scale.p = 2, c = 0% p = 2, c = 15%n (1) (2) (3) (4) (1) (2) (3) (4)30,000 0.8 21.0 39.6 3.2 0.8 12.5 13.1 2.2100,000 2.1 17.3 32.8 4.5 2.0 19.1 21.5 4.8250,000 4.6 25.6 49.2 9.2 4.4 29.6 36.7 9.71,000,000 10.6 41.9 85.4 27.4 11.7 26.7 55.7 19.5p = 2, c = 35% p = 20, c = 35%n (1) (2) (3) (4) (1) (2) (3) (4)30,000 0.7 11.8 12.1 2.3 2.4 55.2 57.0 4.3100,000 1.9 18.0 19.4 4.7 4.2 57.0 65.1 7.7250,000 4.1 28.2 32.8 9.7 10.3 59.7 83.3 14.31,000,000 11.5 26.6 52.5 19.4 40.3 95.0 225.9 49.5Table 3.4: The median total processing time for each LTS algorithm in seconds. Key: (1) - fast-LTS;(2) - LTS-BIRCH; (3) - LTS-BIRCH-R; (4) fast-LTS-BIRCH.55Chapter 3. Combinatorial Problems, Large Data Sets and BIRCH0-5%5-25%25-75%> 75%020406080100n = 30,0000-5%5-25%25-75%> 75%n = 100,0000-5%5-25%25-75%> 75%n = 250,0000-5%5-25%25-75%> 75%n = 1,000,000(e) p = 2, c = 35%0-5%5-25%25-75%> 75%020406080100n = 30,0000-5%5-25%25-75%> 75%n = 100,0000-5%5-25%25-75%> 75%n = 250,0000-5%5-25%25-75%> 75%n = 1,000,000(f) p = 20, c = 35%Figure 3.7: Bar plots of the percentage of contaminated observations in the ?best? subset for LTS.The data set is generated 100 times and the percent contamination in the h-subset is retained.The darkest shade bars are for fast-LTS, the mid-shade for LTS-BIRCH, and the lightest bars forLTS-BIRCH-R.56Chapter 3. Combinatorial Problems, Large Data Sets and BIRCHlanguage in which it is coded, it is expected that with further development this algorithm will bemuch faster.It should be noted that, regardless of whether or not it is faster than fast-LTS, LTS-BIRCH-Rdoes very well in all scenarios. In those situations where the fast-LTS solution has a slightlylower objective function than LTS-BIRCH (in particular, the few visible in Figures 3.6(a), (b)),concentration steps on the whole data set using one solution from the LTS-BIRCH solution as aseed yields at worst identical results as the fast-MCD solution, and in most cases a better solution.3.4.2 Simulation Study with MCDIn this section we confirm the results found in Section 3.3.1, demonstrating via a simulation studyhow the number of dimensions, the level of contamination, and number of observations affect theperformance of fast-MCD and MCD-BIRCH.Our simulation involved selecting different levels of n,p and contamination (c%), and generating100 data sets. We then calculated the MCD for each data set using the fast-MCD algorithm asimplemented in the R package robustbase, followed by MCD-BIRCH, and MCD-BIRCH with arefinement step consisting of a sequence of concentration steps performed using the best MCD-BIRCH solution as the initial seed. This last method is denoted MCD-BIRCH-R, and is a moredirect comparison of MCD, as the fast-MCD has a number of concentration steps on its bestsolutions. Once again we also provide the timings for fast-MCD-BIRCH for comparison with fast-MCD, with the former method carrying out fast-MCD on just the centres of the subclusters as anestimate of the upper bound for processing MCD-BIRCH were it to be coded in FORTRAN/C++.Each data set had [n(1 - c)] observations generated from the null model X similar Np(?0,I) with?0 = (0,... ,0)latticetop. We then added [nc] of contaminated observations with model X similar Np(?,I) with? = (10,... ,10)latticetop. This is the same simulation study as that performed in (Rousseeuw and vanDriessen, 1999).The metrics used to measure the effectiveness of each algorithm are:(a) the logarithm of the determinant of the sample covariance of the best h-subset to the asymptoticestimate i.e.log (|S|)where |S| is the MCD estimate from the method used; and(b) the percentage of ?outliers? (contaminated observations) that are included in the best h-subset.The logarithm is applied because of the sometime large differences in MCD estimates for differentalgorithms and simulations.The parameters used for creating the BIRCH trees are given in Table 3.5. Plots of each metric andmethod for p = 20 are given in Figures 3.8 ? 3.9, and similarly in Figures 3.10 ? 3.11 for p = 30.Finally, the processing time was measured on a Intel Xeon CPU 3.00GHz linux box, and are givenin Table 3.6.In the set of plots for p = 20, it is clear that the number of observations, n, does not significantlyaffect the performance of the fast-MCD or MCD-BIRCH-R methods, and in all cases the fast-MCDmedian MCD is larger than the median MCD-BIRCH-R estimate. What is interesting is that as57Chapter 3. Combinatorial Problems, Large Data Sets and BIRCHn = 100,000 n = 250,000 n = 1,000,000p = 20, c = 30% 23 26 30p = 20, c = 40% 23 26 30p = 30, c = 30% 40 45 50p = 30, c = 40% 40 45 50Table 3.5: The parameters used for creating the CF-tree using BIRCH for the MCD simulationstudy. The parameters are closeness and compactness, and in each simulation the parameters areequal.(1) (2) (3) (1) (2) (3) (1) (2) (3)-5-4-3-2-101log(|S|)n = 100,000 n = 250,000 n = 1,000,000(a) p = 20, c = 30%(1) (2) (3) (1) (2) (3) (1) (2) (3)-5-4-3-2-101log(|S|)n = 100,000 n = 250,000 n = 1,000,000(b) p = 20, c = 40%Figure 3.8: Box plots of the logarithm MCD estimate, calculated using three different methods.The methods are: (1) fast-MCD; (2) MCD-BIRCH; (3) MCD-BIRCH-R. Each data set (given byn) is simulated 100 times.58Chapter 3. Combinatorial Problems, Large Data Sets and BIRCH5%<5-25%25-50%>50%020406080100n = 100,0005%<5-25%25-50%>50%n = 250,0005%<5-25%25-50%>50%n = 1,000,000(a) p = 20, c = 30%5%<5-25%25-50%>50%020406080100n = 100,0005%<5-25%25-50%>50%n = 250,0005%<5-25%25-50%>50%n = 1,000,000(b) p = 20, c = 40%Figure 3.9: Bar plots of the percentage of contaminated observations in the ?best? subset for MCD.The data set is generated 100 times and the percent contamination in the h-subset is retained.The darkest shade bars are for fast-MCD, the mid-shade for MCD-BIRCH, and the lightest barsfor MCD-BIRCH-R.59Chapter 3. Combinatorial Problems, Large Data Sets and BIRCH(1) (2) (3) (1) (2) (3) (1) (2) (3)-5-4-3-2-101log(|S|)n = 100,000 n = 250,000 n = 1,000,000(a) p = 30, c = 30%(1) (2) (3) (1) (2) (3) (1) (2) (3)-5-4-3-2-101log(|S|)n = 100,000 n = 250,000 n = 1,000,000(b) p = 30, c = 40%Figure 3.10: Box plots of the logarithm of the MCD estimate, calculated using three differentmethods. The methods are: (1) fast-MCD; (2) MCD-BIRCH; (3) MCD-BIRCH-R. Each data set(given by n) is simulated 100 times.n increases, so too does the MCD-BIRCH estimate, but in the bar plots in Figure 3.9 the MCD-BIRCH method still performs very well, with no contaminated observations in the best h-subset.This will be discussed shortly.In the set of plots for p = 30, it is clear that the MCD-BIRCH-R method is consistently producingthe most optimal result, whereas fast-MCD only finds an equivalent result a few times. For theremainder of the simulations, fast-MCD and MCD-BIRCH are producing similar, less optimalresults (relative to MCD-BIRCH-R).In both the box plots of MCD, it is clear that as n increases, so too does the MCD for MCD-BIRCH, and yet the performances given in the bar plots remain perfect. While, at first glance, thismay appear contradictory, however upon further investigation it appears that this is an artifact ofthe methodology for selecting the radius and compactness attributes i.e. choosing parameters suchthat we have approximately 3000-4000 subclusters. Therefore, although the number of subclustershas remained the same, the magnitude of the data reduction, equivalently the increase in data60Chapter 3. Combinatorial Problems, Large Data Sets and BIRCH5%<5-25%25-50%>50%020406080100n = 100,0005%<5-25%25-50%>50%n = 250,0005%<5-25%25-50%>50%n = 1,000,000(a) p = 30, c = 30%5%<5-25%25-50%>50%020406080100n = 100,0005%<5-25%25-50%>50%n = 250,0005%<5-25%25-50%>50%n = 1,000,000(b) p = 30, c = 40%Figure 3.11: Bar plots of the percentage of contaminated observations in the?best?subset for MCD.The data set is generated 100 times and the percent contamination in the h-subset is retained.The darkest shade bars are for fast-MCD, the mid-shade for MCD-BIRCH, and the lightest barsfor MCD-BIRCH-R.61Chapter 3. Combinatorial Problems, Large Data Sets and BIRCHp = 20, c = 30% p = 30, c = 30%n (1) (2) (3) (4) (1) (2) (3) (4)100,000 6.12 31.18 39.66 7.74 9.92 39.86 50.36 10.38250,000 16.22 44.31 72.92 14.82 29.88 57.50 109.78 19.451,000,000 69.02 91.47 245.42 50.75 113.72 116.82 364.35 65.76p = 20, c = 40% p = 30, c = 40%n (1) (2) (3) (4) (1) (2) (3) (4)100,000 8.29 50.83 60.90 8.28 16.37 56.31 74.50 12.52250,000 15.83 52.78 75.82 14.38 26.96 65.81 105.17 21.191,000,000 64.17 96.95 203.91 48.72 109.50 115.75 314.15 65.41Table 3.6: The median total processing time for each MCD algorithm in seconds. Key: (1) -fast-MCD; (2) - MCD-BIRCH; (3) - MCD-BIRCH-R; (4) fast-MCD-BIRCH.granularity, is getting higher. Looking at the estimate of centre and dispersion derived from theMCD-BIRCH algorithm turns out to be nearly identical with the ?known? result in all cases. Thisphenomenon deserves more investigation, and will be left for further research.Based on the results in Table 3.6, fast-MCD has a shorter processing time than MCD-BIRCHand MCD-BIRCH-R in all cases. It should also be reiterated that fast-MCD is entirely writtenin FORTRAN, and therefore the fast-MCD-BIRCH algorithm provides a more direct comparisonin timings. In this case, therefore, the BIRCH algorithm is significantly faster than the fast-MCDalgorithm when n = 250,000 and 1,000,000 regardless of p, and certainly provides a direction forfurther optimization.3.5 ConclusionComputing robust estimators with good robustness properties generally requires finding the so-lution to highly complex optimization problems. The current state-of-the-art algorithms to findapproximate solutions to these problems involve the local refinement of a potentially large numberof random starting points. These random initial candidates are generally constructed by drawingrandom subsamples from the data, and the iterative refinement steps involve calculations over thewhole data set. Since these approximating algorithms typically need to access the entire data seta very large number to times, they become unfeasible in practise when the data do not fit in thecomputer?s memory.In this paper we show how the BIRCH algorithm can be adapted to calculate approximate solutionsto problems in this class. We illustrate our approach by applying it to find approximate solutionsto the optimization problems that define the LTS and MCD estimators, and also to the RLGAalgorithm. Our approach is compared with the fast-LTS and fast-MCD algorithms on real largedata sets that fit in memory. We show that, even for very large data sets that fit in memory, ourproposal is able to find approximate LTS and MCD estimators that compare very well with thosereturned by the fast-LTS and fast-MCD algorithms. Furthermore, in this case (large data sets thatfit in the computer?s memory), our approximate solution can easily be iteratively refined (as it is62Chapter 3. Combinatorial Problems, Large Data Sets and BIRCHdone in both fast-LTS and fast-MCD). In some cases, this procedure converges to the same solutionas fast-LTS and fast-MCD (or an extremely close one); but in some other cases our approach isable to find a better solution (in terms of value of the objective function) than that returned bythe fast- algorithms. Note that our algorithm can find this good starting point much faster thanboth the fast-LTS and fast-MCD algorithms.Finally, the simulation studies indicate that our method performs comparably well to fast-LTSand fast-MCD in simple situations (large data sets with a small number of covariates and smallproportion of outliers) and does much better than fast-LTS and fast-MCD in more challengingsituations without requiring extra computational time. These findings seem to confirm that whenattempting to find a good approximated LTS or MCD estimator for data sets that do not fit inmemory, our approach provides a computationally feasible and reliable algorithm. Furthermore,when additional concentration steps are performed on the whole data set using the single best resultfrom LTS-BIRCH and MCD-BIRCH, the resulting estimators always exceed those from fast-LTSand fast-MCD.To the best of our knowledge there is currently no other reliable algorithm in the literature that canbe used for these problems without having to access the whole data set a large number of times.63BibliographyM. M. Breunig, H.-P. Kriegel, P. Kr?ger, and J. Sander. Data bubbles: Quality preserving per-formance boosting for hierarchical clustering. ACM SIGMOD, 2001.T. Chiu, D. Fang, J. Chen, Y. Wang, and C. Jeris. A robust and scalable clustering algorithm formixed type attributes in large database environment. Proceedings of the seventh ACM SIGKDD,2001.S. Engela, D. Hutcheon, S. Bishop, L. Buchmann, J. Caggiano, M. Chatterjee, A. Chen, J. D?Auria,D. Gigliotti, U. Greife, D. Hunter, A. Hussein, C. Jewett, A. Laird, M. Lamey, W. Liu, A. Olin,D. Ottewell, J. Pearson, C. Ruiz, G. Ruprecht, M. Trinczek, C. Vockenhuber, and C. Wrede.Commissioning the DRAGON facility at ISAC. Nuclear Instruments and Methods in PhysicsResearch, A553, 2005.L. Garc?a-Escudero, A. Gordaliza, R. San Mart?n, S. Van Aelst, and R. H. Zamar. Robust linearclustering. Unpublished Manuscript, 2007.P. Gawrysiak, H. Rybinski, and M. Okoniewski. Regression - yet another clustering method. InInternational Intelligent Information Systems Conference, Advances in Soft Computing Journal.Springer-Verlag, 2001.J. Harrington and M. Salibian-Barrera. birch: Working with very large data sets. URLhttp://www.stat.ubc.ca/ harringt/birch/birch-jss.pdf. Submitted to Journal of Statis-tical Software, 2008.D. Hutcheon, S. Bishop, L. Buchmann, M. Chatterjee, A. Chen, J. D?Auria, S. Engel, D. Gigliotti,U. Greife, D. Hunter, A. Hussein, C. Jewett, N. Khan, M. Lamey, A. Laird, W. Liu, A. Olin,D. Ottewell, J. Rogers, G. Roy, H. Sprenger, and C. Wrede. The DRAGON facility for nuclearastrophysics at TRIUMF-ISAC. Nuclear Physics, A718:515?517, 2003.S. Nassar, J. Sander, and C. Cheng. Incremental and effective data summarization for dynamichierarchical clustering. ACM SIGMOD, 2004.R. T. Ng and J. Han. CLARANS: A Method for Clustering Objects for Spatial Data Mining.IEEE Transactions on Knowledge and Data Engineering, 14(5):1003?1016, 2002.S. Odewahn, S. Djorgovski, R. Brunner, and R. Gal. Data from the Digitized Palomar Sky Survey.Technical report, Dept. of Astronomy, California Institute or Technology, Pasadena, CA 91125,1998.R Development Core Team. R: A Language and Environment for Statistical Computing. RFoundation for Statistical Computing, Vienna, Austria, 2007. URL http://www.R-project.org.ISBN 3-900051-07-0.64BibliographyD. M. Rocke and J. Dai. Sampling and subsampling for cluster analysis in data mining: Withapplications to sky survey data. Data Mining and Knowledge Discovery, 7:215?232, 2003.P. Rousseeuw and A. Leroy. Robust Regression and Outlier Detection. Wiley, 1987.P. J. Rousseeuw. Least median of squares regression. Journal of the American Statistical Associ-ation, 79(388):871?880, 1984.P. J. Rousseeuw and K. van Driessen. Computing LTS regression for large data sets. Data Miningand Knowledge Discovery, 12:29?45, 2006.P. J. Rousseeuw and K. van Driessen. A fast algorithm for the minimum covariance determinantestimator. Technometrics, 41(3):212?223, August 1999.P. J. Rousseeuw and B. C. Van Zomeren. Unmasking multivariate outliers and leverage points.Journal of the American Statistical Association, 85(411):633?639, 1990.M. Salibian-Barrera, G. Willems, and R. Zamar. Computation of tau-estimates for linear regres-sion. Unpublished Manuscript, 2007.A. K. H. Tung, X. Xu, and B. C. Ooi. CURLER: finding and visualizing nonlinear correlationclusters. ACM SIGMOD, 2005.S. Van Aelst, X. Wang, R. H. Zamar, and R. Zhu. Linear Grouping Using Orthogonal Regression.Computational Statistics & Data Analysis, 50(5):1287?1312, Mar. 2006.W. Wang, J. Yang, and R. Muntz. STING: A statistical information grid approach to spatial datamining. In Proc. 24th Int. Conf. Very Large Data Bases, 1998.H. Wedekind. On the selection of access paths in a data base system. In J. Klimbie and K. Kof-feman, editors, Data Base Management. Amsterdam: North-Holland, 1974.T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clustering method for verylarge databases. In Proceedings of the 1996 ACM SIGMOD international conference, Proceedingsof the 1996 ACM SIGMOD international conference, 1996.65Chapter 4ConclusionThis chapter contains an outline of the results obtained in the thesis, the problems that we encoun-tered, and the directions we foresee for future work.Chapter 2 - Extending linear grouping analysis:? We reviewed the Linear Grouping criterion (Van Aelst et al. (2006)), as implemented in theR package lga, and discussed two requirements for LGA to work. The first requirement waswith respect to the smallest eigenvalues of the covariance matrix of a cluster. In particular,the multiplicity of the smallest eigenvalue had to be one in order for the associated eigenvectorto be uniquely defined and orthogonal regression to be applicable. One common situationwhere the multiplicity of the smallest eigenvalue would be greater than one was in the casewhere there were fewer observations than dimensions within a cluster, a situation that wasrelevant to following sections. The second requirement was that the cluster must be linear inshape.? We then introduced a substitute measure, called Partially Orthogonal Regression, for thesituation where the first requirement had been violated due to more dimensions than obser-vations. In particular, we considered two clusters that had multiplicity issues, and provided ameasure for determining to which cluster an observation (which was not part of either cluster)belonged. It was made up of two distance measures, the first measuring distance in the basiswhere the eigenvalues were positive, and the second measuring how far away an observationwas to a cluster in the dimensions where the eigenvalues were zero.? We then relaxed the second requirement by introducing linear grouping within a featurespace, via an explicit or implicit transformation to linearity. In order to do this, we leveragedoff Kernel PCA methodology and the Kernel Trick. One side effect of these transformations,however, was that typically the dimension of the space increased dramatically (especially withimplicit transformation), such that the number of observations was less than the number ofdimensions. Therefore, a POD approach in the feature space was provided along with anorthogonal regression approach in the feature space.? It remains for further study what the best tactic would be for the selection of kernel functions,as well as their tuning, though this is a problem prevalent in a lot of scholarship associatedwith the use of Kernel Methods. A good, motivating, example would also be of some value.Chapter 3 - Finding Approximate Solutions to Combinatorial Problems with VeryLarge Data Sets using BIRCH:? We reviewed the algorithm BIRCH, as originally discussed in Zhang et al. (1997), which wasfirst used as an approximate method for clustering data in the case where the data set did66Chapter 4. Conclusionnot fit in the available physical memory. This approach created a summary of the data setcomprised of compact subclusters of observations, each with associated statistics that allowedfor the calculation of an objective function of combinations of the subclusters, and withoutneeding to refer back to the original data set.? The nature of combinatorial problems were then discussed more generally, in particular howthe data set needed to be referred to a number of times during optimization, and how the sizeof the solution space meant that the problem gets more difficult as the dimensions and/ornumber of observations increased. The application of BIRCH was then generalized to aparticular class of combinatorial problem where the objective function could be calculatedin terms of the clustering features, which was a much simpler optimization problem thanoptimizing the objective function on the whole data set.? We then extended BIRCH to three challenging optimization problems to do with robuststatistics, being the Minimum Covariance Determinant estimate for location and dispersion,Least Trimmed Squares estimate for building linear models, and RLGA - the robust equivalentto LGA, and showed its efficacy on real data sets, and for the first two methods, simulationstudies as well. In all cases, the BIRCH approach yielded at least equivalent, and in manycases improved, solutions to the original algorithms.? While we demonstrated how the selection of parameters for BIRCH did not materially affectthe results for each algorithm, especially in the cases where optional refinement steps occurred,further research in the selection of parameters might provide additional insight into furtherimprovement. Lastly, other examples of difficult optimization problems could benefit from animplementation within the BIRCH framework, if practical.67BibliographyS. Van Aelst, X. Wang, R. H. Zamar, and R. Zhu. Linear Grouping Using Orthogonal Regression.Computational Statistics & Data Analysis, 50(5):1287?1312, Mar. 2006.T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: A New Data Clustering Algorithm and ItsApplications, volume 1 of Data Mining and Knowledge Discovery, pages 141 ? 182. SpringerNetherlands, June 1997.68Appendix AProofs and Derivations for Chapter 2The proof of Result 2.2.1Orthogonal regression is the technique in which we select a hyperplane such that the orthogonaldistances between the observations and the hyperplane are minimized. We define the hyperplaneby H(a,b) = {y : alatticetopy = b}. The orthogonal distance between an observation y and hyperplane His then given byd2O(y,H) = (alatticetopy -b)2We have observations x1,... ,xn. Then minimizing the squared orthogonal distance yieldsmin||a||=1,bnsummationdisplayi=1parenleftBigalatticetopxi -bparenrightBig2= min||a||=1,bnsummationdisplayi=1parenleftBigalatticetopxi -alatticetop? + alatticetop? -bparenrightBig2= min||a||=1,bnsummationdisplayi=1braceleftbiggparenleftBigalatticetopxi -alatticetop?parenrightBig2-2parenleftBigalatticetopxi -alatticetop?parenrightBigparenleftBigalatticetop? -bparenrightBig+parenleftBigalatticetop? -bparenrightBig2bracerightbigg= min||a||=1,bnsummationdisplayi=1parenleftBigalatticetopxi -alatticetop?parenrightBig2+ nparenleftBigalatticetop? -bparenrightBig2= min||a||=1,balatticetopSa + nparenleftBigalatticetop? -bparenrightBig2.It is clear that the second term is minimized by setting b = alatticetop?. To find the minimum of the firstterm,min||a||=1alatticetopSa = min||a||=1alatticetopE?Elatticetopawhere E element Rd?d is a matrix with the eigenvectors of S as columns, and ? element Rd?d a diagonal matrixwith the corresponding eigenvalues on the diagonal. Let w = Elatticetopa. Thenmin||a||=1alatticetopE?Elatticetopa = min||w||=1wlatticetop?w as ||w|| = ||a||= min||w||=1dsummationdisplayi=1w2i lambdai greaterequal min||w||=1lambdaddsummationdisplayi=1w2i = lambdad.69Appendix A. Proofs and Derivations for Chapter 2Finally, by substitution, this lower bound is achieved when a = ed.squaresolidThe rationale for Definition 2.3.1Using Equation (2.3) as the starting point, we need to calculated2P (y;E,?) = d2E(y,p) + d2S(p, ?)Start with d2E(y,p) ? it can be shown that the distance between y and its projection p on the basisE = (e1| ...|enu) going through ? is equal to the distance between (y-?) = z to the basis E, goingthrough the origin. Let pasteriskmath be the projection of z onto E (going through the origin), given bypasteriskmath =nusummationdisplayi=1(ei ? z)ei = EElatticetopz. (A.1)Therefore,z -pasteriskmath = z -EElatticetopz = (I -EElatticetop)z.Let P1 = I -EElatticetop, and note that P1 is idempotent and symmetric. Thend2E(y,p) = d2E(z,pasteriskmath) = (z -pasteriskmath)latticetop(z -pasteriskmath)= zlatticetopPlatticetop1 P1z = zlatticetopP1z.Now for d2S(p, ?) ? this can be thought of as the distance between p and its projection ? on the basis?E = (e1| ...|enu-1) going through ?, which is equal to the distance between pasteriskmath and its projection?asteriskmath on the basis ? = (e1| ... |enu-1) going through the origin. Therefore,?asteriskmath = ? ?latticetoppasteriskmathandpasteriskmath - ?asteriskmath =nusummationdisplayi=1(ei ? z)ei -nu-1summationdisplayi=1(ei ? z)ei = enuelatticetopnu z (A.2)by substituting in the definition of pasteriskmath from equation (A.1). Finally, in the article, we definedthe scaled distance d2S(a,b) in a manner similar to a Mahalanobis distance, and stated that thedirection which needs to be scaled is given by lambdanu. Instead of simply applying this directly, weinstead scale the direction with the inverse of the whole covariance matrix, and show that this isindeed the case.d2S(p, ?) = d2S(pasteriskmath, ?asteriskmath) = (pasteriskmath - ?asteriskmath)latticetopC-1(pasteriskmath - ?asteriskmath)where C = E?Elatticetop is the sample covariance in the basis E. Then, from equation (A.2)= zlatticetopenuelatticetopnu E?-1Elatticetopenuelatticetopnu z = 1lambdanuzlatticetopenuelatticetopnu zsquaresolid70Appendix A. Proofs and Derivations for Chapter 2The proof of Lemma 2.4.1Starting with the definition of ( ? )ij equivalence ? ij = ?(xi,xj) (omitting the bracket for convenience)?Kij = ?Phi(xi)latticetop?Phi(xj)=parenleftBiggPhi(xi) - 1nsummationdisplaykPhi(xk)parenrightBigg?parenleftBiggPhi(xj) - 1nsummationdisplaylPhi(xl)parenrightBigg= Phi(xi) ? Phi(xj) -Phi(xi)1nsummationdisplaylPhi(xl) - 1nsummationdisplaykPhi(xk) ? Phi(xj) + 1n2summationdisplaykPhi(xk)summationdisplaylPhi(xl)= Phi(xi) ? Phi(xj) - 1nsummationdisplaylPhi(xi) ? Phi(xl) - 1nsummationdisplaykPhi(xk) ? Phi(xj) + 1n2summationdisplayksummationdisplaylPhi(xk) ? Phi(xl)= Kij - 1nsummationdisplaylKil - 1nsummationdisplaykKkj + 1n2summationdisplayksummationdisplaylKkl (A.3)Expressing this as matrix operations yields the result.squaresolidThe proof of Result 2.4.2We are interested in finding the solution tolambdae = CPhie (A.4)whereCPhi = 1nnsummationdisplayi=1?Phi(xi)?Phi(xi)latticetop. (A.5)Substituting equation (A.5) into (A.4) yieldslambdae =nsummationdisplayi=1parenleftBig?Phi(xi)latticetopeparenrightBig ?Phi(xi).This means that all solutions of e lie in the span of ?(xi). Thereforee =nsummationdisplayi=1alphai ?(xi) (A.6)71Appendix A. Proofs and Derivations for Chapter 2for some scalars alpha1,... ,alphan. Substituting equations (A.5) & (A.6) into equation (A.4), and multi-plying by ?(xk) yieldslambda?(xk) ?parenleftBigg nsummationdisplayi=1alphai ?(xi)parenrightBigg= ?(xk) ?parenlefttpparenleftbt1nnsummationdisplayj=1?Phi(xj)?Phi(xj)latticetopparenrighttpparenrightbtparenleftBigg nsummationdisplayi=1alphai ?(xi)parenrightBigglambdansummationdisplayi=1alphai ?(xk) ? ?(xi) = 1nnsummationdisplayj=1nsummationdisplayi=1alphaiparenleftBig?Phi(xk) ? ?(xj)parenrightBigparenleftBig?Phi(xj) ? ?(xi)parenrightBiglambdansummationdisplayi=1alphai ?ki = 1nnsummationdisplayj=1nsummationdisplayi=1alphai ?kj ?jiGeneralizing this yieldsnlambda ? alpha = ? ? alpha. (A.7)It can be shown (though currently omitted) that the solutions lambda and alpha for the equationnlambdaalpha = ? alphaare the same as those for the equation (A.7). Finally, to make these eigenvectors orthonormal, wecalculate the normei ? ei =nsummationdisplayj=1nsummationdisplayk=1alphaji alphaki ?(xj)latticetop?(xk)where alphaji is the j-th element of alphai=nsummationdisplayj=1nsummationdisplayk=1alphaji alphaki ?jk= alphai ? ? alphai = nlambdaialphai ? alphai= nlambdaiTherefore, for ei to be orthonormal, divide alphai by radicalnlambdai. Obviously, if ei ? ej = 0 then so too willbe nradicalbiglambdailambdaj ei ? ej (i.e. they remain orthogonal).squaresolid72Appendix A. Proofs and Derivations for Chapter 2The proof of Result 2.4.3From Result 2.4.2, we haveek = 1?lambdakmsummationdisplayj=1alphajk ?(xS[j])and xS[j] is the observation associated with the j-th element of the set S. Then the constant term,ek ? ?s, is given by:ek ? ? = 1?lambdakmsummationdisplayj=1alphajk parenleftbigPhi(xS[j]) - ?parenrightbigbracehtipupleft bracehtipdownrightbracehtipdownleft bracehtipupright?Phi(xS[j])??sbracehtipdownleft bracehtipuprightbracehtipupleft bracehtipdownrightparenleftBigg1msummationdisplayielementSPhi(xi)parenrightBigg= 1?lambdakmsummationdisplayj=1summationdisplayielementSalphajkparenleftBigg1mPhi(xS[j]) ? Phi(xi) -1m2summationdisplaylelementSPhi(xl) ? Phi(xi)parenrightBigg= 1m2radicalbig?lambdakmsummationdisplayj=1msummationdisplayi=1alphajkparenleftBiggmKSji -msummationdisplayl=1KSliparenrightBigg= 1m2radicalbig?lambdakmsummationdisplayj=1alphajkparenleftBiggmmsummationdisplayi=1KSji -msummationdisplayi=1msummationdisplayl=1KSliparenrightBiggLet kSi be a vector, with kSi = parenleftbig(K)S[1],i (K)S[2],i ... (K)S[m],iparenrightbiglatticetop (as defined in Section 2.4.2 onpg. 30) and 1 is a vector of length m containing all ones. Thenek ? ? = 1m2radicalbig?lambdakalphalatticetopkparenleftBigmKS1 -1latticetopKS1parenrightBig(A.8)Similarly, the first term is given byek ? yi = 1?lambdakmsummationdisplayj=1alphajk parenleftbigPhi(xS[j]) - ?parenrightbig ? Phi(xi)= 1?lambdakmsummationdisplayj=1alphajkparenleftBiggPhi(xS[j]) ? Phi(xi) - 1msummationdisplaylelementSPhi(xl) ? Phi(xi)parenrightBigg= 1?lambdakmsummationdisplayj=1alphajkparenleftBiggKS[j]i - 1msummationdisplaylelementSKliparenrightBiggand writing this in vector notation= 1?lambdakalphalatticetopkparenleftbiggkSi - 1m1latticetopkSiparenrightbigg(A.9)squaresolid73Appendix A. Proofs and Derivations for Chapter 2The proof of Result 2.4.4Starting with Definition 2.3.1, and expanding the bracketd2P (y;E, ?s) = zlatticetop(P1 + P2/lambdak)z= zlatticetopz + 1lambdakzlatticetopekelatticetopk z -zlatticetopEElatticetopz (A.10)where z = y-?s. In order to express this in vector notation let zi = yi -?s, and in the remainderof this proof we tackle each of the elements of this equation.In general,zlatticetopi zj = (yi - ?s)latticetop (yj - ?s)=parenleftBiggPhi(xi) - 1nssummationdisplaykelementSPhi(xk)parenrightBigg?parenleftBiggPhi(xj) - 1nssummationdisplaylelementSPhi(xl)parenrightBigg= Kij - 1msummationdisplaylelementSKil - 1msummationdisplaykelementSKkj + 1m2summationdisplaykelementSsummationdisplaylelementSKklfrom equation (A.3). Expressing this in matrix form for all zlatticetopi zj,(i,j element {1,... ,n}) yields?Kasteriskmath = K - 1m1m ? KF - 1mKF ? 1m + 1m2 (1m ? KS) ? 1m,and the diagonal elements of this matrix give the required result.The second element of equation (A.10) can be found by applying Result 2.4.3, and recognizing thatlambdak = ?k/m from Result 2.4.2. Finally, the third element can be found by generalizing this equationfor all values of alphai,i element 1,... ,k.squaresolid74Appendix Bbirch: Working with Very Large DataSets*B.1 IntroductionDealing with large data sets in R can be difficult and time consuming, especially when the size of thedata set is larger than the available computer memory. The first difficulty arises because when thesize of a data set exceeds available physical memory, paging/swapping is required, which typicallyslows down the computer performance dramatically. An additional problem becomes apparentwhen one needs to find good approximate solutions to combinatorial problems for very large datasets. More specifically, we are concerned with computing the Minimum Covariance Determinantestimator (MCD, Rousseeuw and Leroy, 1987) for multivariate observations, the Least TrimmedSquares regression estimator (LTS, ibid), the clustering provided by the Linear Grouping Analysisalgorithm (LGA, Van Aelst et al., 2006), and its robust version (RLGA, Garc?a-Escudero et al.,2007). The existing approximating algorithms for these problems require multiple passes over theentire data set, and thus the effect of paging and swapping is greatly exacerbated, to the point ofmaking them unfeasible. Furthermore, note that due to the large size of the solution space, thesealgorithms typically consider a number of data-driven random starts that are iteratively improved.Obtaining good approximate solutions requires a very large number of random starting points. Thenumber of random starts increases rapidly with the sample size, so that even for data sets that fitin memory, these algorithms may not be able to find a good approximate solution in a reasonableamount of time.One strategy to work with large data sets is provided by a pre-processing algorithm called BIRCH(Balanced Iterative Reducing and Clustering using Hierarchies), first introduced in Zhang et al.(1996), which summarizes the data using a tree with a relatively small number of subclusters andassociated ?sufficient statistics?. These can be used to easily calculate sample statistics on subsetsof the data without having to access the data set. This tree is built with one pass of the data, andonly a single observation needs to be held in memory at any time.The focus of this article will be describing the implementation of the package birch in R, referringthe reader to Harrington and Salibian-Barrera (2007) (herein referred to as the original article) fora more in depth discussion of the motivation and advantages of this approach. This package alsoprovides methods for these BIRCH trees that compute very good approximated MCD estimators,LTS Regression estimators, and LGA and RLGA clusterings. Additional methods are also describedfor performing basic operations on trees such as subsetting, plotting and basic summary statistics.Finally, we provide some examples using a commonly analyzed data set, which will demonstrateasteriskmathA version of this chapter has been submitted for publication. Harrington, J. and Salibian-Barrera, M. (2008)birch: Working with Very Large Data Sets, Journal of Statistical Software.75Appendix B. birch: Working with Very Large Data Setsthe syntax and the efficacy of this approach.B.1.1 BIRCHA simplified description of this algorithm is given next, with the reader referred to Zhang et al.(1996) for further details of the complete algorithm, including refinements such as self-sizing andsubcluster splits. Let x1,... ,xN denote the sample to be processed, where xj element Rd. The algo-rithm creates subsets of the observations, with LJ denoting the index of observations in the J-thsubcluster. In this implementation, the clustering features (CF) are given by the tuple:CFJ =parenlefttpparenleftbtsummationdisplayielementLJ1,summationdisplayielementLJxi,summationdisplayielementLJxixlatticetopiparenrighttpparenrightbt = (nJ,LSJ,SSJ),i.e. the number of observations in the leaf, the sum of those observations, and their?sum of squares?.It was shown in the original article that in each of the clustering and robust algorithms below theabove clustering features are sufficient statistics for the respective objective functions.The algorithm processes the observations sequentially as follows:1. The first observation x1 forms its own subcluster.2. Let T be the current set of subclusters, and let ?t, and CFt, t element T be the associated subclustercentres and clustering features.3. For each of the remaining observations x2,...,xN:(a) Calculate the distance from xj to each of the centres ?t of the existing subclusters. Selectthe closest centre ?t0 .(b) If the observation meets the?closeness?and?compactness?criteria described below, thenxj joins the t0 subcluster, and both ?t0 and CFt0 are updated.(c) If not, a new subcluster t + 1 containing xj is formed, with ?t+1 = xj and CFt+1 thecorresponding clustering features, and the list of subclusters is updated: T = T union {xj}Once this algorithm is complete, we have a ?list? of compact subclusters that are locally similar,with associated summary statistics of the underlying observations making up each subcluster.Note that this algorithm requires the calculation of the distance between xj and every subclustercentre. This can be avoided by using more efficient structures, such as a CF-tree, which is similarto a ?B+?-tree (Wedekind, 1974). In this approach, each observation is passed to a root node,with a number of non-leaf nodes as children. These non-leaf nodes contain the sample mean of allobservations beneath them. The observation is then simply passed down this tree to the closestnon-leaf node, and each centre is updated as the observation is given to it. Finally, when it reachesthe appropriate subcluster, it is either absorbed, or else forms its own subcluster. Once a non-leafnode exceeds a (user-specified) number of subclusters, it splits with the subclusters allocated to thesplit non-leaf nodes based on a proximity measure. A graph of this structure is given in Figure B.1.The number of subclusters that are formed at the end of this algorithm, denoted as NL, depends onthe criteria used to decide whether a new observation being processed can join an existing subclusteror needs to form a new cluster. Consider a point xj currently being processed, and let ?t0 be theclosest subcluster centre. In order for xj to join the t0 cluster we require:76Appendix B. birch: Working with Very Large Data SetsFigure B.1: The tree structure used for pre-processing in birch. The area shaded in grey is thedata returned as the subclusters.(i) closeness: bardblxj - ?t0 bardbl <=< a for some pre-defined a > 0.(ii) compactness: The trace of the sample covariance matrix of the observations in the t0 clusterand xj cannot exceed a pre-defined constant b > 0.Using both criteria is a departure from the approach in Zhang et al. (1996) which proposes usingonly one of the criteria; for a discussion of this, see the original paper.1It should be noted that for combinatorial problems, such as k-means, etc., BIRCH is a usefulapproximation when the full data set cannot fit into memory, or is sufficiently large to makerandom-search algorithms unfeasible. The properties of this approximation will naturally dependon the level of coarseness of the tree ? equivalently, the ratio of number of subclusters to numberof observations in the underlying data set. However any other non-combinatorial algorithm, suchas ordinary least squares, that can use the CF-tuple summary statistics, are doubly advantageous,since in these cases it is possible to calculate the solution exactly from the CF summaries.B.2 Package birch: Implementation in RThe package birch is available on CRAN2 and written in C/C++ and R (R Development CoreTeam, 2007). The command for creating a BIRCH tree is given by the syntax1In the implementation of this package, the user is required to set both criteria. However, if only one is desired,then the other criterion can be set to a very high number, thereby not affecting the outcome.2This article describes version 1.1-2 of the package.77Appendix B. birch: Working with Very Large Data SetsbirchOut <- birch (x, radius, compact = radius, keeptree=FALSE, ...)where x is either an R matrix, a text file name, or any ?connection? (in the R sense) to the originaldata set that is compatible with read.table. The arguments radius and compact specify the?closeness? and ?compactness? criteria, respectively. If x is referring to a file name, then the ...argument allows the user to furnish additional information regarding the kind of data set, and isconsistent with the syntax for read.table (e.g. sep="," for comma-separated variable files). Theadvantage of creating the tree directly from a file or a connection is that, at no time, is the completedata set held in memory.3If the argument keeptree is set to TRUE, then the CF-tree created is kept in memory. The followingcommandsbirch.addToTree(x, birchObject, ...)birchOut <- birch.getTree(birchObject)birch.killTree(birchObject)can be used to add data to the tree, retrieve the data from the tree, and remove the tree frommemory. Clearly this is advantageous when additional data is to be added or when dealing withreal-time data streams, though the user should be aware that, until such time as the tree is removedfrom memory, there are effectively two copies of the subclusters and their associated CFs andmember lists in memory.The code for creating a BIRCH tree is primarily written in C/C++, with the R API used forfetching the data. All other code (e.g. covMcd.birch, lts.birch, etc.) is written entirely in R.The structure of the object produced by birch (herein referred to as a birch object) is given inTable B.1.B.2.1 Generic Internal MethodsVarious generic methods have been implemented for use on a birch object. They are:print, summary, dim, length, [ ], cbind, plot, pairs, points, linesThe print method produces a brief description of the object, in particular the number of subclus-ters, the attributes used for creating the object, and the dimensions of the original data set. Thesummary method produces the column means and covariance of the whole data set, and can beassigned to an object (comprised of a list with names means and cov) if desired. The method dimproduces a vector of length three, containing the tuple: Number of observations in the underlyingdata set; number of dimensions of the data set; and the number of subclusters in the object, whilethe method length simply produces the number of subclusters in the object. This last method isuseful when building a tree to a certain number of subclusters, as is often the case (motivation forthis approach is given in the original article).Subsets of a birch object can be created using the standard R notation [i, j], with i specifyingwhich subclusters to retain, and j which columns. However, care should be used when subsettingby column, as3By design, R will process the data in batches of at most 100,000 observations.78Appendix B. birch: Working with Very Large Data SetsData:Name Description DimensionsumXi A matrix containing the sum of the observations, summationtextxi, in each sub-clusterNL ?dsumXisq An array containing the sum of squares of the observations, summationtextxi xlatticetopi ,in each subclusterd ?d ?NLN A vector containing the number of observations in each subcluster NL ?1members A list containing which observations are in each subcluster. NLAttributes:Name Descriptionxcolnames The column names of the original data set.xdim A vector containing the number of observations in the original data set, thenumber of columns in the original data set, and the number of subclusters.compact The value for compact used when forming the tree.radius The value for radius used when forming the tree.internal A pointer to the tree held in memory (if applicable).Table B.1: The structure of a birch object. NL is the number of subclusters in the object.birch(z, 0.1)[,1:2]will not yield the same result asbirch(z[,1:2], 0.1)as the Euclidean distances and trace criteria mentioned previously will produce differing resultswhen calculating birch on the whole data set, versus on a subset of the columns. Furthermore,data that is close (in the Euclidean sense) in d dimensions, may not be close in e > d dimensions,particularly when e-d is large. In order to reflect this, the compact and radius attributes will beset to NA in the former case. The method cbind is useful in the case where a constant is to be putin front of each observation of the data set, and updates the sumXi and sumXisq data. However,once again, care should be used with this command, asbirch(cbind(1,z), 0.1)will not yield the same result ascbind(1, birch(z, 0.1))for the same reasons as those given when subsetting by column.4 Once again, in the latter case,the compact and radius attributes will be set to NA.We have two plotting methods for birch objects: plot for two-dimensional data; and pairs fordata with more than two dimensions. Each method has an optional argument, centers, with the4The Euclidean distances will be the same, but the trace of the covariance matrix will be different.79Appendix B. birch: Working with Very Large Data SetsFigure B.2: A new plot method. The shaded area represents the ellipse that is produced whencenters=FALSE.default set to FALSE. In the default operation, the methods produce 95% ellipses5 based on thecovariance matrix for each subcluster with N > d, centred at its sample mean, and is very effectiveat visualizing the data when there are a reasonable number of subclusters. However, in order toreflect the fact that each observation within a subcluster is within a certain Euclidean distance ofthe centre of the subcluster, as specified by the radius attribute, the ellipses are truncated at thisdistance. A graphical representation of this is given in Figure B.2. For subclusters with N <=< dobservations, only the centre of the subcluster is plotted.When the number of subclusters is large this plot becomes less informative, and setting the argumentcenters = TRUE produces a plot of just the sample means of each subcluster. These methodsaccept the usual plotting arguments, such as col argument for assigning a different colour to eachsubcluster.B.2.2 ClusteringEven when the data set fits in memory, clustering with BIRCH can be highly advantageous for largedata sets, as clustering with subclusters rather than the full dimensional data set is a much simplercombinatorial problem. In this case, the above approximate solution can easily be refined to providea solution that is in many cases better than the one found by attempting direct optimization overthe full data set. For an extended discussion on this, refer to the original article.Two clustering methods are provided: kmeans.birch and lga.birch (rlga.birch is discussed inthe following section).6 Each of these algorithms take a birch object as input, and in the simplestcase the number of clusters to find. Their syntax is given bykmeans.birch (birchObject, centers, nstart = 1)lga.birch (birchObject, k, nsamp=100)where birchObject is a birch object. For k-means, centers is either (a) the number of clusters to5The ellipses are produced using the package ellipse, which uses a multivariate normal reference distribution.6Non-BIRCH versions of lga and rlga are available on CRAN in the package lga.80Appendix B. birch: Working with Very Large Data Setsfind, or (b) a matrix of k rows and d columns containing the initial seed to use, and nstart is thenumber of random starts to use. For LGA, k specifies the number of clusters to find, and nsamp thenumber of random starting positions. Both methods return a list containing the clusters found, interms of subclusters and the actual observations making up these subclusters, as well as the valueof the objective function: the Residual Sum of Squares (RSS) in the case of kmeans.birch, andthe Residual Orthogonal Sum of Squares (ROSS) for lga.birch.An alternative approach for k-means is provided via the command dist.birch, which calculatesthe pairwise distance between the centres of the subclusters and produces a dist object, which canbe used in conjunction with the hclust and cuttree commands to provide an initial seed for thekmeans algorithm. An example of this is given in the help files, and in Section B.3.B.2.3 Robust StatisticsThree methods are implemented in this package for robust statistics: covMcd.birch, ltsReg.birchand rlga.birch. These methods are the main focus of the original article, and so will not bediscussed in detail here.7The syntax for all three methods is given bycovMcd.birch(birchObject, alpha=0.5, nsamp=100)ltsReg.birch(birchObject, alpha=0.5, intercept=FALSE, nsamp=100)rlga.birch(birchObject, k, alpha=0.5, nsamp=100)where alpha is the size of the h-subset (between 0.5 and 1) as a percentage of the whole dataset, and nsamp is the number of starting positions for the random search. For the LTS command,the argument intercept adds a unit vector to the data set via the cbind mechanism given inSection B.2.1.8 The values returned from each algorithm are given in Table B.2.Two methods are also provided for optional ?refinement steps? for MCD and LTS when the dataset fits in memory. As mentioned previously (and elsewhere ? see the original article), the solutionusing BIRCH is approximate, in particular because of the ?coarseness? of the underlying tree. If amore accurate solution is required and the data set fits in memory, then an additional step can betaken whereby the solution from the BIRCH algorithm is used to seed additional iterations (localrefinements) using the whole data set. The syntax to do this is given bycovMcdBirch.refinement (covOut, x, alpha=0.5)ltsBirch.refinement (ltsOut, x, y, alpha=0.5, intercept=FALSE)where covOut and ltsOut are the outputs from covMcd.birch and lts.birch respectively. If thedata set was not loaded into R because it was loaded directly from a file or connection, then thesemethods are not applicable.7Further information on each of the underlying algorithms can be found in Rousseeuw and van Driessen (1999),Rousseeuw and van Driessen (2006), and Garc?a-Escudero et al. (2007) (for RLGA). The non-BIRCH versions of thefirst two methods can be found in robustbase, and in lga for RLGA.8Reiterating the comment in Section B.2.1, the cbind mechanism is only approximate w.r.t. arguments. A (better)alternative is to form a birch object on the data set with the unit vector added as the first column. However, attimes this is not possible, and this mechanism is then applicable.81Appendix B. birch: Working with Very Large Data SetsMCD:Name Descriptionzbar estimate of location.Sz estimate of covariance.Det the Minimum Covariance Determinant.best a list containing the index of subclusters making up the h-subset andthe index of actual observations making up the h-subset.LTS:Name Descriptionbest a list containing the index of subclusters making up the h-subset andthe index of actual observations making up the h-subsetraw.coefficients the fitted LTS regression lineResids A list containing the sum of squared residuals for the best subset, aswell as the sum of squared residuals for the whole data set (based onthe LTS regression equation).RLGA:Name Descriptionclust a list containing the clustering in terms of the subclusters and theclustering in terms of the actual observationsROSS the residual sum of squares of orthogonal distances to the fitted hy-perplanes based on the best data set.Table B.2: The outputs from each of the robust algorithms.B.3 ExampleWe illustrate the capabilities of the package by analyzing the data set commonly referred to as theDPOSS data set.The Digitized Palomar Sky Survey (DPOSS) data has been used in articles related to robustness(e.g. Rousseeuw and van Driessen (1999), Rousseeuw and van Driessen (2006), etc.) as well asclustering (e.g. Rocke and Dai (2003)). It contains characteristics from celestial objects, and has132,402 observations in 33 dimensions. For further information see Odewahn et al. (1998).As a first step, we take a subset of the data set by selecting the ?Aperture? and ?Combined stellarfunction? variables (scaled such that the sample mean is zero, and variance one) for just the Fband, yielding p = 2 and n = 132,402. This data is in the file "dposs_2d.csv", and in orderto demonstrate the functionality, it is loaded directly from a local file in the first instance, anda web server in the second case. We then produce two trees ? the first with radius set to 1.5and compactness equal to 5, and a second with both radius and compactness set to 0.05. Theseattributes were set for the purposes of visualization. Plots of these are given, along with the originaldata set, in Figure B.3.R> library(birch)82Appendix B. birch: Working with Very Large Data Sets(a) Radius = 1.5, Compactness = 5, Nleafs = 14 (b) Radius = Compactness = 0.005, Nleafs = 2,283Figure B.3: The birch plot method, applied on birch objects of differing size. The first plotcontains a plot of the ellipses (blue), centres (red), and the underlying data (green). The secondplot just contains the centres.R> mytree2 <- birch("dposs_2d.csv", compact=5, radius=1.5, sep=",",+ header=TRUE)R> length(mytree2)[1] 14R> ## the object z is a matrix containing the data in dposs_2d.csvR> plot(z, col=?green?, pch=".")R> lines(mytree2, col=?blue?);R> points(mytree2, col=?red?)R> mytree <- birch("http://www.stat.ubc.ca/~harringt/dposs_2d.csv",+ 0.005, sep=",", header=TRUE)R> length(mytree)[1] 2283R> plot(mytree, centers=TRUE)We then perform the MCD on a different subset of the DPOSS data set, this time selecting the?Aperture? and ?Combined stellar function? for each of the F, J and N bands and just thoseobservations classified as stars yielding p = 6 and n = 55,716. The full DPOSS data set isstored in the R object dposs. The following code forms a birch object, and calculates the MCDand a refinement.83Appendix B. birch: Working with Very Large Data SetsR> ## Create data setR> z <- scale(dposs[dposs[,"Class"] == 1,+ c("MAperF","csfF","MAperJ","csfJ","MAperN","csfN")])R> ## Create Birch ObjectR> mytree <- birch(z, 0.1)R> length(mytree)[1] 6192R> ## Calculate MCDR> set.seed(1234)R> abc <- covMcd.birch(mytree, alpha=0.5, nsamp=100)R> summary(abc, digits=3)CovMcd using birchMCD = 7.566695e-10Estimate of center[1] -0.130 -0.291 -0.121 -0.270 -0.133 -0.540Estimate of dispersion[,1] [,2] [,3] [,4] [,5] [,6][1,] 0.4580 -0.06658 0.4380 -0.09068 0.4768 -0.02701 # etc...R> log(abc$Det)[1] -21.00209R> abcRef <- covMcdBirch.refinement(abc, z, alpha=0.5)R> ## Once again, summary works with this object (not shown here)R> log(abcRef$Det)[1] -21.24701For reference, the equivalent result using covMcd from the package robustbase yields a log(MCD)of -21.24578, which is slightly higher (i.e. less optimal) than the one found by refining the BIRCHapproximate solution.The LTS algorithm assumes that the object provided is made up of the exploratory and responsevariables, much in the same way as lsfit.R> x <- z[,2]R> y <- z[,1]R> mytree3 <- birch(cbind(x,y), 0.002); length(mytree3)[1] 1818R> set.seed(1234)R> abc<- lts.birch(mytree3, intercept=TRUE, alpha=0.5, nsamp=100)R> summary(abc)LTS using birchLTS = 1591.56584Appendix B. birch: Working with Very Large Data SetsCoefficients[1] -1.470226 -4.232881R> abcRef <- ltsBirch.refinement(abc, x, y, alpha=0.5, intercept=TRUE)R> summary(abcRef)LTS using birchLTS = 1495.513Coefficientsx1 x2-1.501868 -4.331047For reference, the equivalent results using ltsReg from the package robustbase yields a RSS of1495.893, which again is slightly worse than the result produced by refining the BIRCH approximatesolution.In order to demonstrate RLGA we construct an outlier map (Rousseeuw and Van Zomeren, 1990) byplotting the Mahalanobis distances using standard estimates against Mahalanobis distances usingrobust distances. We then apply RLGA-BIRCH to this data set as a means of robustly identifyingthe two groups.R> library(robustbase)R> set.seed(1234)R> z <- scale(dposs[, c("MAperF","csfF","MAperJ","csfJ","MAperN","csfN")])R> a <- covMcd(z)R> ## create the data setR> robdist <- mahalanobis(z, colMeans(z[a$best,]), cov(z[a$best,]))R> classdist <- mahalanobis(z, colMeans(z), cov(z))R> x <- scale(cbind(classdist, robdist), center=FALSE)R> mytree <- birch(x, 0.001, 0.001); length(mytree)[1] 2042R> set.seed(1234)R> b <- rlga.birch(mytree, k=2, alpha=0.75)In order to plot this result in the first two dimensions, we use the following code, with the resultgiven in Figure B.4R> plot(mytree, centers=TRUE, col=b$clust$sub+1)Finally, we provide a small demonstration of the implementation of kmeans.birch. Using the treeformed for the RLGA example, we first use kmeans.birch directly.R> abc <- kmeans.birch(mytree, 2, nstart = 100)R> abc$RSS[1] 145512.285Appendix B. birch: Working with Very Large Data SetsFigure B.4: The results from RLGA.R> ## Add a refinement step (using the k-means algorithm directly)R> centers <- rbind(summary(mytree[abc$clust$sub == 1,])$mean,+ summary(mytree[abc$clust$sub == 2,])$mean)R> ghi <- kmeans(x, centers=centers)R> sum(ghi$withinss)[1] 145429.4Then, running the non-BIRCH version for comparison.R> ## Run the non-BIRCH version for comparisonR> def <- kmeans(x, 2)R> sum(def$withinss)[1] 145429.4R> ## A crosstab of the clustering outputs for kmeans and kmeans.birchR> table(def$cluster, abc$clust$obs)1 21 650 871622 44177 413R> ## A crosstab of the clustering outputs for kmeans andR> ## kmeans.birch with refinement stepR> table(def$cluster, ghi$clust)86Appendix B. birch: Working with Very Large Data Sets1 21 0 445902 87812 0i.e. the result from kmeans.birch with a refinement step is identical to that provided by kmeansdirectly.Following this, we use the command dist.birch with hclust (the hierarchical clustering algorithm)in order to suggest starting centers, and then complete one iteration with these centers as seeds.R> hc <- hclust(dist.birch(mytree), method="ward")R> ## Find which "clusterings" correspond with two groupsR> memb <- cutree(hc, k = 2)R> ## Form the centers based on theseR> centers <- rbind(summary(mytree[memb == 1,])$mean,+ summary(mytree[memb == 2,])$mean)R> ghi <- kmeans.birch(mytree, centers)R> ghi$RSS[1] 145533.9R> ## A crosstab of the clusteringR> table(ghi$clust$obs, def$clust)1 21 87639 10892 173 43501B.4 ConclusionIn this article we have introduced the package birch, and described its implementation in R. Whilemost of the justification for this algorithm is provided elsewhere, in particular in Harrington andSalibian-Barrera (2007), the benefits mentioned previously are two-fold: firstly, in the case ofvery large data sets, a compact summary of the data can be calculated without needing to loadthe complete data set into memory, and secondly that for combinatorial problems, the effectivesolution space is significantly reduced, and an approximate solution can easily be computed usingthe clustering features.The first benefit is clearly provided by functionality in the birch command for building a tree byloading a data set directly from text file, URL or connection, and at no time is the whole data setheld in memory. Clearly, therefore, the only limiting factor in terms of size is that the tree doesn?tget too large, which in turn is controlled directly through the selection of the compactness andradius criteria. To our knowledge, this is the only approach that has this feature that is capable ofsolving combinatorial problems like the ones given here.In demonstrating the second benefit, we refer to a summary of the results in Section B.3, givenin Table B.3. The first thing that is apparent is that the BIRCH results are performing very87Appendix B. birch: Working with Very Large Data SetsMinimum Covariance Determinant Dimension of combinatorial problem Obj. FunctionNon-BIRCH 55,716 ?6 -21.246BIRCH without refinement 6,192 ?6 -21.002BIRCH with refinement 6,192 ?6 -21.247Least Trimmed Squares Dimension of combinatorial problem Obj. FunctionNon-BIRCH 55,716 ?2 1,495.893BIRCH without refinement 1,818 ?2 1,591.565BIRCH with refinement 1,818 ?2 1,495.513k-means Dimension of combinatorial problem Obj. FunctionNon-BIRCH 132,402 ?2 145,429.4BIRCH without refinement 2,024 ?2 145,512.2BIRCH with refinement 2,024 ?2 145,429.4Table B.3: A selection of results in the Example section.well compared with the non-BIRCH version even without the refinement steps, in spite of thefact that this is intended to be an ?approximate? solution. Indeed, examples can be found ? andsome are given in the original article ? where the nature of the combinatorial problem and reducedsolution space allows BIRCH to achieve results that are superior to the non-BIRCH algorithms.Furthermore, when refinement is applied, the results perform as well as the non-BIRCH version.While these results are sufficient to suggest the merit of this approach, when combined with thefact that these methods are easily scaled with respect to the number of observations, and to a lesserextent the dimension, it is clear that where large data sets are involved, the BIRCH algorithm isan approach worth considering.88BibliographyL. Garc?a-Escudero, A. Gordaliza, R. San Mart?n, S. Van Aelst, and R. H. Zamar. Robust linearclustering. Unpublished Manuscript, 2007.J. Harrington and M. Salibian-Barrera. Finding approximate solutions tocombinatorial problems with very large data sets using BIRCH. URLhttp://www.stat.ubc.ca/ harringt/birch/birch.pdf. Submitted to Special Issue onStatistical Algorithms and Software of Computational Statistics and Data Analysis, 2007.S. Odewahn, S. Djorgovski, R. Brunner, and R. Gal. Data from the Digitized Palomar Sky Survey.Technical report, Dept. of Astronomy, California Institute or Technology, Pasadena, CA 91125,1998.R Development Core Team. R: A Language and Environment for Statistical Computing. RFoundation for Statistical Computing, Vienna, Austria, 2007. URL http://www.R-project.org.ISBN 3-900051-07-0.D. M. Rocke and J. Dai. Sampling and subsampling for cluster analysis in data mining: Withapplications to sky survey data. Data Mining and Knowledge Discovery, 7:215?232, 2003.P. Rousseeuw and A. Leroy. Robust Regression and Outlier Detection. Wiley, 1987.P. J. Rousseeuw and K. van Driessen. Computing LTS regression for large data sets. Data Miningand Knowledge Discovery, 12:29?45, 2006.P. J. Rousseeuw and K. van Driessen. A fast algorithm for the minimum covariance determinantestimator. Technometrics, 41(3):212?223, August 1999.P. J. Rousseeuw and B. C. Van Zomeren. Unmasking multivariate outliers and leverage points.Journal of the American Statistical Association, 85(411):633?639, 1990.S. Van Aelst, X. Wang, R. H. Zamar, and R. Zhu. Linear Grouping Using Orthogonal Regression.Computational Statistics & Data Analysis, 50(5):1287?1312, Mar. 2006.H. Wedekind. On the selection of access paths in a data base system. In J. Klimbie and K. Kof-feman, editors, Data Base Management. Amsterdam: North-Holland, 1974.T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clustering method for verylarge databases. In Proceedings of the 1996 ACM SIGMOD international conference, Proceedingsof the 1996 ACM SIGMOD international conference, 1996.89

Cite

Citation Scheme:

    

Usage Statistics

Country Views Downloads
United States 9 0
India 4 0
China 2 25
France 2 0
Sweden 1 0
Brazil 1 0
Canada 1 0
Spain 1 0
City Views Downloads
Wilmington 4 0
Unknown 4 0
Ashburn 3 0
Chandigarh 2 0
Beijing 2 0
Vidisha 2 0
Valladolid 1 0
Sundsvall 1 0
Rexburg 1 0
Buffalo 1 0

{[{ mDataHeader[type] }]} {[{ month[type] }]} {[{ tData[type] }]}
Download Stats

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0066406/manifest

Comment

Related Items