UBC Theses and Dissertations
Bayesian models for hierarchical clustering of network data Briercliffe, Creagh
Network data represent relational information between interacting entities. They can be described by graphs, where vertices denote the entities and edges mark the interactions. Clustering is a common approach for uncovering hidden structure in network data. The objective is to find related groups of vertices, through their shared edges. Hierarchical clustering identifies groups of vertices across multiple scales, where a dendrogram represents the full hierarchy of clusters. Often, Bayesian models for hierarchical clustering of network data aim to infer the posterior distribution over dendrograms. The Hierarchical Random Graph (HRG) is likely the most popular Bayesian approach to hierarchical clustering of network data. Due to simplifications made in its inference scheme, we identify some potentially undesirable model behaviour. Mathematically, we show that this behaviour presents in two ways: symmetry of the likelihood for graphs and their complements, and non-uniformity of the prior. The latter is exposed by finding an equivalent interpretation of the HRG as a proper Bayesian model, with normalized likelihood. We show that the amount of non-uniformity is exacerbated as the size of the network data increases. In rectifying the issues with the HRG, we propose a general class of models for hierarchical clustering of network data. This class is characterized by a sampling construction, defining a generative process for simple graphs, on fixed vertex sets. It permits a wide range of probabilistic models, via the choice of distribution over edge counts between clusters. We present four Bayesian models from this class, and derive their respective properties, like expected edge density and independence. For three of these models, we derive a closed-form expression for the marginalized posterior distribution over dendrograms, to isolate the problem of inferring a hierarchical clustering. We implement these models in a probabilistic programming language that leverages state-of-the-art approximate inference methods. As our class of models use a uniform prior over dendrograms, we construct an algorithm for sampling from this prior. Finally, the empirical performance of our models is demonstrated on examples of real network data.
Item Citations and Data
Attribution-NonCommercial-NoDerivatives 4.0 International