Bayesian models for hierarchical clustering of network data

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Bayesian models for hierarchical clustering of network data Briercliffe, Creagh

Abstract

Network data represent relational information between interacting entities. They can be described by graphs, where vertices denote the entities and edges mark the interactions. Clustering is a common approach for uncovering hidden structure in network data. The objective is to find related groups of vertices, through their shared edges. Hierarchical clustering identifies groups of vertices across multiple scales, where a dendrogram represents the full hierarchy of clusters. Often, Bayesian models for hierarchical clustering of network data aim to infer the posterior distribution over dendrograms. The Hierarchical Random Graph (HRG) is likely the most popular Bayesian approach to hierarchical clustering of network data. Due to simplifications made in its inference scheme, we identify some potentially undesirable model behaviour. Mathematically, we show that this behaviour presents in two ways: symmetry of the likelihood for graphs and their complements, and non-uniformity of the prior. The latter is exposed by finding an equivalent interpretation of the HRG as a proper Bayesian model, with normalized likelihood. We show that the amount of non-uniformity is exacerbated as the size of the network data increases. In rectifying the issues with the HRG, we propose a general class of models for hierarchical clustering of network data. This class is characterized by a sampling construction, defining a generative process for simple graphs, on fixed vertex sets. It permits a wide range of probabilistic models, via the choice of distribution over edge counts between clusters. We present four Bayesian models from this class, and derive their respective properties, like expected edge density and independence. For three of these models, we derive a closed-form expression for the marginalized posterior distribution over dendrograms, to isolate the problem of inferring a hierarchical clustering. We implement these models in a probabilistic programming language that leverages state-of-the-art approximate inference methods. As our class of models use a uniform prior over dendrograms, we construct an algorithm for sampling from this prior. Finally, the empirical performance of our models is demonstrated on examples of real network data.

Item Metadata

Title	Bayesian models for hierarchical clustering of network data
Creator	Briercliffe, Creagh
Supervisor	Bouchard-Côté, Alexandre; Gustafson, Paul, 1968-
Publisher	University of British Columbia
Date Issued	2023
Description	Network data represent relational information between interacting entities. They can be described by graphs, where vertices denote the entities and edges mark the interactions. Clustering is a common approach for uncovering hidden structure in network data. The objective is to find related groups of vertices, through their shared edges. Hierarchical clustering identifies groups of vertices across multiple scales, where a dendrogram represents the full hierarchy of clusters. Often, Bayesian models for hierarchical clustering of network data aim to infer the posterior distribution over dendrograms. The Hierarchical Random Graph (HRG) is likely the most popular Bayesian approach to hierarchical clustering of network data. Due to simplifications made in its inference scheme, we identify some potentially undesirable model behaviour. Mathematically, we show that this behaviour presents in two ways: symmetry of the likelihood for graphs and their complements, and non-uniformity of the prior. The latter is exposed by finding an equivalent interpretation of the HRG as a proper Bayesian model, with normalized likelihood. We show that the amount of non-uniformity is exacerbated as the size of the network data increases. In rectifying the issues with the HRG, we propose a general class of models for hierarchical clustering of network data. This class is characterized by a sampling construction, defining a generative process for simple graphs, on fixed vertex sets. It permits a wide range of probabilistic models, via the choice of distribution over edge counts between clusters. We present four Bayesian models from this class, and derive their respective properties, like expected edge density and independence. For three of these models, we derive a closed-form expression for the marginalized posterior distribution over dendrograms, to isolate the problem of inferring a hierarchical clustering. We implement these models in a probabilistic programming language that leverages state-of-the-art approximate inference methods. As our class of models use a uniform prior over dendrograms, we construct an algorithm for sampling from this prior. Finally, the empirical performance of our models is demonstrated on examples of real network data.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2023-06-08
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0433092
URI	http://hdl.handle.net/2429/84812
Degree (Theses)	Doctor of Philosophy - PhD
Program (Theses)	Statistics
Affiliation	Science, Faculty of; Statistics, Department of
Degree Grantor	University of British Columbia
Graduation Date	2023-11
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Bayesian models for hierarchical clustering of network data Briercliffe, Creagh

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights