Universal graph compression : stochastic block models

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Universal graph compression : stochastic block models Wang, Ziao

Abstract

Motivated by the prevalent data science applications of processing and mining large-scale graph data such as social networks, web graphs, and biological networks, as well as the high I/O and communication costs of storing and transmitting such data, this thesis investigates lossless compression of data appearing in the form of a labeled graph. In particular, we consider a widely used random graph model, stochastic block model (SBM), which captures the clustering effects in social networks. An information-theoretic universal compression framework is applied, in which one aims to design a single compressor that achieves the asymptotically optimal compression rate, for every SBM distribution, without knowing the parameters of the SBM that generates the data. Such a graph compressor is proposed in this thesis, which universally achieves the optimal compression rate for a wide class of SBMs with edge probabilities ranging from $O(1)$ to $\Omega(1/n^{2-\e})$ for any $0<\e <1$. Existing universal compression techniques are developed mostly for stationary ergodic one-dimensional sequences with fixed alphabet size and entropy linear in the number of variables. However, the adjacency matrix of SBM has complex two-dimensional correlations and sublinear entropy in the sparse regime. These challenges are alleviated through a carefully designed transform that converts two-dimensional correlated data into almost i.i.d. submatrices. The sequence of submatrices is then compressed by a Krichevsky--Trofimov compressor, whose length analysis is generalized from i.i.d. sequences to identically distributed but arbitrarily correlated sequences. In four benchmark graph datasets (protein-to-protein interaction, LiveJournal friendship, Flickr, and YouTube), the compressed files from competing algorithms (including CSR, Ligra+, PNG image compressor, and Lempel--Ziv compressor for two-dimensional data) take 2.4 to 27 times the space needed by the proposed scheme.

Item Metadata

Title	Universal graph compression : stochastic block models
Creator	Wang, Ziao
Supervisor	Wang, Lele
Publisher	University of British Columbia
Date Issued	2021
Description	Motivated by the prevalent data science applications of processing and mining large-scale graph data such as social networks, web graphs, and biological networks, as well as the high I/O and communication costs of storing and transmitting such data, this thesis investigates lossless compression of data appearing in the form of a labeled graph. In particular, we consider a widely used random graph model, stochastic block model (SBM), which captures the clustering effects in social networks. An information-theoretic universal compression framework is applied, in which one aims to design a single compressor that achieves the asymptotically optimal compression rate, for every SBM distribution, without knowing the parameters of the SBM that generates the data. Such a graph compressor is proposed in this thesis, which universally achieves the optimal compression rate for a wide class of SBMs with edge probabilities ranging from $O(1)$ to $\Omega(1/n^{2-\e})$ for any $0<\e <1$. Existing universal compression techniques are developed mostly for stationary ergodic one-dimensional sequences with fixed alphabet size and entropy linear in the number of variables. However, the adjacency matrix of SBM has complex two-dimensional correlations and sublinear entropy in the sparse regime. These challenges are alleviated through a carefully designed transform that converts two-dimensional correlated data into almost i.i.d. submatrices. The sequence of submatrices is then compressed by a Krichevsky--Trofimov compressor, whose length analysis is generalized from i.i.d. sequences to identically distributed but arbitrarily correlated sequences. In four benchmark graph datasets (protein-to-protein interaction, LiveJournal friendship, Flickr, and YouTube), the compressed files from competing algorithms (including CSR, Ligra+, PNG image compressor, and Lempel--Ziv compressor for two-dimensional data) take 2.4 to 27 times the space needed by the proposed scheme.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2021-09-01
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0401839
URI	http://hdl.handle.net/2429/79577
Degree	Master of Applied Science - MASc
Program	Electrical and Computer Engineering
Affiliation	Applied Science, Faculty of; Electrical and Computer Engineering, Department of
Degree Grantor	University of British Columbia
Graduation Date	2021-11
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Universal graph compression : stochastic block models Wang, Ziao

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights