The Open Collections website will be undergoing maintenance on Wednesday December 7th from 9pm to 11pm PST. The site may be temporarily unavailable during this time.

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Global optimization of clustering problems Shi, Mingfei

Abstract

Clustering is a fundamental unsupervised machine learning task that aims to aggregate similar data into one cluster and separate those in diverse into different clusters. Cluster analysis can always be formulated as an optimization problem. Various objective functions may lead to different clustering problems. In this thesis, we concentrate on k-means and k-center problems. Each can be formulated as a mixed-integer nonlinear programming problem. The work about k-means clustering optimization has been published on ICML 2021 [30]. Moreover, we also submitted the work about global optimization of k-center clustering to ICML 2022 and the paper has been accepted in Phase 1 of reviewing. This thesis provides a practical global optimization algorithm for these two tasks based on a reduced-space spatial branch and bound (BB) scheme. This algorithm can guarantee convergence to the global optimum by only branching on the centers of clusters, which is independent of the dataset’s cardinality. We also design several methods to construct lower and upper bounds at each node in the BB scheme. In addition, for k-center problem, a set of feasibility-based bounds tightening techniques are proposed to narrow down the domain of centers and significantly accelerate the convergence. To demonstrate the capacity of this algorithm, we present computational results on UCI datasets and compare our proposed algorithms with the off-the-shelf global optimal solvers and classical local optimal algorithms. For k-means clustering, the numerical experiments demonstrated our algorithm’s ability to handle datasets with up to 200,000 samples. Besides, for k-center clustering, the serial implementation of the algorithm on the dataset with 14 million samples and 3 features can attain the global optimum to an optimality gap of 0.1% within 2 hours.

Item Media

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International