UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Pattern matching in massive metadata graphs at scale Reza, Tahsin Arafat

Abstract

Pattern matching in graphs, that is finding subgraphs that match a smaller template graph within the large background graph is fundamental to graph analysis and serves a rich set of applications. Unfortunately, existing solutions have limited scalability, are difficult to parallelize, support only a limited set of search patterns, and/or focus on only a subset of the real-world problems. This dissertation explores avenues toward designing a scalable solution for subgraph pattern matching. In particular, this work targets practical pattern matching scenarios in large-scale metadata graphs (also known as property graphs) and designs solutions for distributed memory machines that address the two categories of matching problems, namely, exact and approximate matching. This work presents a novel algorithmic pipeline that bases pattern matching on constraint checking. The key intuition is that each vertex or edge participating in a match has to meet a set of constraints specified by the search template. The pipeline iterates over these constraints to eliminate all the vertices and edges that do not participate in any match, and reduces the background graph to the complete set of only the matching vertices and edges. Additional analysis can be performed on this reduced graph, such as full match enumeration. Furthermore, a vertex-centric formulation for this constraint checking algorithm exists, and this makes it possible to harness existing high-performance, vertex-centric graph processing frameworks. The key contributions of this dissertation are solution design following this constraint checking approach for exact and a class of edit-distance based approximate matching, and experimental evaluation to demonstrate effectiveness of the respective solutions. To this end, this work presents design and implementation of distributed vertex-centric, asynchronous algorithms that guarantee a solution with 100% precision and 100% recall for arbitrary search templates. Through comprehensive evaluation, this work provides evidence that the scalability and performance advantages of the proposed approach are significant. The highlights are scaling experiments on massive-scale real-world (up to 257 billion edges) and synthetic (up to 4.4 trillion edges) graphs, and at scales (1,024 compute nodes), orders of magnitude larger than used in the past for similar problems.

Item Media

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International