Pattern Matching in Massive Metadata Graphs at ScalebyTahsin Arafat Rezaa dissertation submitted in partial fulfillmentof the requirements for the degree ofDoctor of Philosophyinthe faculty of graduate and postdoctoral studies(Electrical and Computer Engineering)The University of British Columbia(Vancouver)December 2019© Tahsin Arafat Reza, 2019The following individuals certify that they have read, and recommend to the Facultyof Graduate and Postdoctoral Studies for acceptance, the thesis entitled:Pattern Matching in Massive Metadata Graphs at Scalesubmitted by Tahsin Arafat Reza in partial fulfillment of the requirements for thedegree of Doctor of Philosophy in Electrical and Computer Engineering.Examining Committee:Matie Ripeanu, Electrical and Computer EngineeringSupervisorRoger Pearce, Lawrence Livermore National LaboratorySupervisory Committee MemberAlexandra Fedorova, Electrical and Computer EngineeringUniversity ExaminerAlan Wagner, Computer ScienceUniversity ExaminerKamesh Madduri, Pennsylvania State UniversityExternal ExaminerAdditional Supervisory Committee Members:Mieszko Lis, Electrical and Computer EngineeringSupervisory Committee MemberiiAbstractPattern matching in graphs, that is finding subgraphs that match a smaller templategraphwithin the large background graph is fundamental to graph analysis and servesa rich set of applications. Unfortunately, existing solutions have limited scalability,are difficult to parallelize, support only a limited set of search patterns, and/or focuson only a subset of the real-world problems.This dissertation explores avenues toward designing a scalable solution for sub-graph pattern matching. In particular, this work targets practical pattern matchingscenarios in large-scale metadata graphs (also known as property graphs) and de-signs solutions for distributed memory machines that address the two categories ofmatching problems, namely, exact and approximate matching.This work presents a novel algorithmic pipeline that bases pattern matchingon constraint checking. The key intuition is that each vertex or edge participatingin a match has to meet a set of constraints specified by the search template. Thepipeline iterates over these constraints to eliminate all the vertices and edges that donot participate in any match, and reduces the background graph to the complete setof only the matching vertices and edges. Additional analysis can be performed onthis reduced graph, such as full match enumeration. Furthermore, a vertex-centricformulation for this constraint checking algorithm exists, and this makes it possibleto harness existing high-performance, vertex-centric graph processing frameworks.The key contributions of this dissertation are solution design following thisconstraint checking approach for exact and a class of edit-distance based approx-imate matching, and experimental evaluation to demonstrate effectiveness of therespective solutions. To this end, this work presents design and implementation ofdistributed vertex-centric, asynchronous algorithms that guarantee a solution withiii100% precision and 100% recall for arbitrary search templates.Through comprehensive evaluation, this work provides evidence that the scal-ability and performance advantages of the proposed approach are significant. Thehighlights are scaling experiments on massive-scale real-world (up to 257 billionedges) and synthetic (up to 4.4 trillion edges) graphs, and at scales (1,024 computenodes), orders of magnitude larger than used in the past for similar problems.ivLay SummaryPattern matching is fundamental to graph analysis and serves a rich set of appli-cations. Unfortunately, existing solutions have limited scalability, support onlya limited set of search patterns, and/or focus on only a subset of the real-worldproblems. This dissertation explores avenues toward designing a scalable solutionfor subgraph pattern matching. This work presents a novel algorithmic pipeline tosupport practical pattern matching based analytics in large-scale metadata graphs(i.e., graphs with vertex and/or edge attributes) and designs solutions for distributedmemory machines that address the two categories of matching problems, namely,exact and approximate matching. Through comprehensive evaluation, this workprovides evidence that the scalability and performance advantages of the proposedapproach are significant. The highlights are scaling experiments on massive-scalereal-world (up to 257 billion edges) and synthetic (up to 4.4 trillion edges) graphs,and at scales (1,024 compute nodes), orders of magnitude larger than used in thepast for similar problems.vPrefaceThe author of this dissertation was the primary contributor of all the researchpresented in this dissertation: from proposing the hypothesis, designing the keysolutions, and implementing the solutions to designing the experiments and con-ducting evaluation. He also led the writing effort or co-authored the correspondingpeer-reviewed (or under review) publications1.The research presented in this dissertation has been either published at orsubmitted for publication to a peer-reviewed journal, conference or workshop.Below is the list of publications each chapter in this dissertation is based on (orderedby publication or submission date, latest first). The solutions were implementedusing a middleware called HavoqGT [82]; project leader: Dr. Roger Pearce ofLawrence Livermore National Laboratory. Theoretical proofs in Appendix A wereprimarily developed by Dr. Geoffrey Sanders of Lawrence Livermore NationalLaboratory.Chapter 4 The research presented in this chapter has been accepted for publication.The author of this dissertation is the main contributor of this work - fromproposing the idea, solution design and implantation, evaluation to paperwriting.(C1) T. Reza, M. Ripeanu, G. Sanders, and R. Pearce. Approximate PatternMatching in Distributed Graphs with Precision and Recall Guarantees.ACMSIGMOD International Conference onManagement ofData, SIG-MOD ’20, Portland, Oregon, 14–19 June, 2020.1J - journal, C - conference, W - workshop and T - technical report.viChapter 3 The work presented in this chapter was published in two conferences and oneworkshop. Additionally, an extended version has been submitted to an ACMjournal. The author of this dissertation is the main contributor of this work- from developing the ideas, solution design and implantation, evaluation topaper writing. The research presented in the workshop paper (W1) exploresa specific optimization opportunity for a decision problem introduced in theconference papers and was led by Nicolas Tripoul.(J1) T. Reza, H. Halawa, M. Ripeanu, G. Sanders, and R. Pearce. Scal-able Pattern Matching in Metadata Graphs via Constraint Checking.ACM Transactions on Parallel Computing, TOPC. Manuscript submit-ted November, 2018. First revision submitted October, 2019.(C2) T. Reza, M. Ripeanu, N. Tripoul, G. Sanders, and R. Pearce. Prune-Juice: Pruning Trillion-edge Graphs to a Precise Pattern-MatchingSolution. IEEE/ACM International Conference for High PerformanceComputing, Networking, Storage, and Analysis, SC ’18, Dallas, Texas,11–16 November, 2018.(W1) N. Tripoul, H. Halawa, T. Reza, M. Ripeanu, G. Sanders, and R. Pearce.There are Trillions of Little Forks in the Road. Choose Wisely! -Estimating the Cost and Likelihood of Success of Constrained Walksto Optimize a Graph Pruning Pipeline. The 8th Workshop on IrregularApplications: Architectures and Algorithms, IAˆ3 ’18, co-located withSC ’18, Dallas, Texas, 11–16 November, 2018.(C3) T. Reza, C. Klymko, M. Ripeanu, G. Sanders, and R. Pearce. TowardsPractical andRobust Labeled PatternMatching in Trillion-EdgeGraphs.The 19th IEEE International Conference on Cluster Computing, Cluster’17, Honolulu, Hawaii, 5–8 September, 2017. (One of the four papersNominated for the Best Paper Award.)viiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxvi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 The Multiple Facets of the Pattern Matching Problem . . . . . . . 11.1.1 Exact and Approximate Pattern Matching . . . . . . . . . 11.1.2 Diversity in the Use of the Result . . . . . . . . . . . . . 31.1.3 Precision and Recall . . . . . . . . . . . . . . . . . . . . 31.1.4 Pattern Matching in Metadata Graphs . . . . . . . . . . . 41.2 Scalability Challenges of Pattern Matching . . . . . . . . . . . . . 41.3 A Constraint Checking Approach for Scalable Pattern Matching . 51.4 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 81.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.6 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . 101.7 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . 13viii2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . 152.1 Exact and Approximate Pattern Matching Definitions . . . . . . . 152.1.1 Exact Pattern Matching . . . . . . . . . . . . . . . . . . . 162.1.2 Approximate Pattern Matching . . . . . . . . . . . . . . . 172.1.3 Induced and Non-induced Subgraph Matching . . . . . . 182.2 General Algorithmic Approaches . . . . . . . . . . . . . . . . . . 182.2.1 Exact Pattern Matching . . . . . . . . . . . . . . . . . . . 192.2.2 Approximate Pattern Matching . . . . . . . . . . . . . . . 212.2.2.1 Graph Similarity Estimators . . . . . . . . . . . 222.2.2.2 Algorithmic Techniques . . . . . . . . . . . . . 222.3 Distributed Graph Pattern Matching . . . . . . . . . . . . . . . . 252.3.1 Solutions offering Exact Matching . . . . . . . . . . . . . 252.3.2 Solutions targeting Approximate Matching . . . . . . . . 272.4 Query Languages and High-level Programming Interfaces . . . . . 282.4.1 Query Languages . . . . . . . . . . . . . . . . . . . . . . 282.4.2 API for Implementing Query specific Algorithms . . . . . 292.5 Metadata Graphs and Pattern Matching . . . . . . . . . . . . . . 292.5.1 Metadata Graph Models . . . . . . . . . . . . . . . . . . 292.5.2 Pattern Matching in Metadata Graphs . . . . . . . . . . . 302.6 Infrastructure to enable Graph Processing on Distributed Platforms 302.7 Input Reduction in Graph Processing . . . . . . . . . . . . . . . . 313 Graph Pruning via Constraint Checking – A Technique for ScalableExact Pattern Matching in Metadata Graphs . . . . . . . . . . . . . 343.1 Design Goals and Opportunities . . . . . . . . . . . . . . . . . . 353.2 Contribution Highlights and Chapter Organization . . . . . . . . . 383.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.4 Graph Pruning via Constraint Checking for Scalable PatternMatch-ing – Solution Overview . . . . . . . . . . . . . . . . . . . . . . 433.5 Asynchronous Algorithms and Distributed Implementation . . . . 493.6 Summary of the Preliminary Investigations . . . . . . . . . . . . 563.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.7.1 Weak Scaling Experiments . . . . . . . . . . . . . . . . . 60ix3.7.2 Strong Scaling Experiments . . . . . . . . . . . . . . . . 613.7.3 Match Enumeration . . . . . . . . . . . . . . . . . . . . . 633.7.4 Example Use Cases – Social Network Analysis and Infor-mation Mining . . . . . . . . . . . . . . . . . . . . . . . 643.7.5 Precision Guarantees vs. Time-to-Solution . . . . . . . . 663.7.6 Impact of Design Decisions and Strategic Optimizations . 693.7.7 Load Balancing . . . . . . . . . . . . . . . . . . . . . . . 713.7.8 Defence Against System Collapse due to Message Explosion 753.7.9 Template Sensitivity Analysis . . . . . . . . . . . . . . . 783.7.10 Non-local Constraint Selection and Ordering Optimization– A Feasibility Study . . . . . . . . . . . . . . . . . . . . 803.7.11 Comparison with State-of-the-Art Systems . . . . . . . . 823.7.11.1 Comparison with QFrag . . . . . . . . . . . . . 823.7.11.2 Comparison with Arabesque . . . . . . . . . . 843.8 Lessons and Discussions . . . . . . . . . . . . . . . . . . . . . . 864 Edit-Distance Subgraph Matching in Distributed Graphs with Pre-cision and Recall Guarantees . . . . . . . . . . . . . . . . . . . . . . 914.1 Problem Overview and Design Opportunities . . . . . . . . . . . 934.2 Solution Overview . . . . . . . . . . . . . . . . . . . . . . . . . 954.3 Contribution Highlights and Chapter Organization . . . . . . . . . 964.4 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 984.5 Constraint Checking for Template Variant Subgraph Matching andOpportunities for Designing an Edit-Distance based Solution . . . 1014.6 Designing an Edit-Distance SubgraphMatching Solution followingthe Constraint Checking Approach . . . . . . . . . . . . . . . . . 1044.7 Asynchronous Algorithms and Distributed Implementation . . . . 1064.8 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1114.8.1 Weak Scaling Experiments . . . . . . . . . . . . . . . . . 1124.8.2 Strong Scaling Experiments . . . . . . . . . . . . . . . . 1144.8.3 Comparison with the Naïve Approach . . . . . . . . . . . 1154.8.4 Impact of Optimizations . . . . . . . . . . . . . . . . . . 1174.8.5 Example Use Cases . . . . . . . . . . . . . . . . . . . . . 121x4.8.6 Comparison with State-of-the-Art Systems . . . . . . . . 1234.9 Lessons and Discussions . . . . . . . . . . . . . . . . . . . . . . 1255 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . 1285.1 Graph Pruning via Constraint Checking –ATechnique for ScalableExact Pattern Matching in Metadata Graphs . . . . . . . . . . . . 1285.1.1 Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . 1295.2 Edit-Distance Subgraph Matching in Distributed Graphs with Pre-cision and Recall Guarantees . . . . . . . . . . . . . . . . . . . . 1305.2.1 Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . 1315.3 Threat to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . 1315.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1325.5 Future Research Directions . . . . . . . . . . . . . . . . . . . . . 134Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139A Theoretical Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . 156A.1 Correctness Proofs for the Constraint Checking Algorithms (as-suming Restrictions on the Search Template) . . . . . . . . . . . . 157A.2 Proof Sketch for the Solution that offers Precision and Recall Guar-antees for Arbitrary Search Templates . . . . . . . . . . . . . . . 163B Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 165B.1 Local Constraint Checking . . . . . . . . . . . . . . . . . . . . . 165B.2 Non-local Constraint Checking . . . . . . . . . . . . . . . . . . . 166C Other Projects and Publications . . . . . . . . . . . . . . . . . . . . 168xiList of TablesTable 2.1 Symbolic notation used. . . . . . . . . . . . . . . . . . . . . . 16Table 2.2 Categorization of pattern matching techniques found in the lit-erature: the table shows whether these techniques support exactand/or approximate matching, offer precision and/or recall guar-antees, and the type(s) of output produced. (100% precisionmeans it is possible to use the technique such that no false pos-itive matches are included in the final output, and 100% recallmeans it is possible to use the technique such that the solu-tion retrieves all valid matches. If the technique does not offerprecision and/or recall guarantees, it is labeled N/A.) The lastrow lists the technique introduced in this dissertation, dubbedConstraint Checking. . . . . . . . . . . . . . . . . . . . . . . 20Table 2.3 Comparison of past work on distributed pattern matching. Thetable highlights the characteristics of each solution presented(e.g., exact vs. approximate matching), its implementation in-frastructure, and summarizes the details of the largest scale ex-periment performed. We highlight the fact that our solution isunique in terms of demonstrated scale, and the ability to performexact matching and retrieve all matches. . . . . . . . . . . . . 25Table 3.1 Symbolic notation used. . . . . . . . . . . . . . . . . . . . . . 41xiiTable 3.2 Step-by-step illustration of non-local constraint generation: high-level description, accompanied by pictorial depiction for thetemplate in Fig. 3.1. The figures show the steps to generatethe required cycle constraints (CC), path constraints (PC), andhigher-order constraints requiring template-drive search (TDS). 47Table 3.3 Properties of the datasets used for evaluation: number of verticesand directed edges, maximum, average and standard deviationof vertex degree, and the graph size in the compact CSR-likerepresentation used (including vertex metadata). . . . . . . . . 58Table 3.4 Match enumeration statistics: number of matches for the Chainand Tree patterns (in Fig. 3.5, the top table below), and WDC(Fig. 3.7), Reddit and IMDb (Fig. 3.10) patterns (bottom table)and the enumeration times, starting from the respective prunedgraphs. Note that for WDC-1, WDC-3 and WDC-5, we werenot able to enumerate all the matches. . . . . . . . . . . . . . . 65Table 3.5 We compare two cases: direct enumeration vs. constraint check-ing followed by match enumeration in the pruned graph. Theseexperiments use 64 compute nodes. For the relatively rareWDC-2 pattern, PruneJuice achieves∼18× speedup. ForWDC-6, direct enumeration leads to a crash (generated message trafficoverwhelms some of the compute nodes), while PruneJuice wasable to list all the matches in under two minutes. Note the dif-ference in runtime for WDC-2 from the numbers reported inFig. 3.8. The testbed has been updated, including the OS kernel,C++ compiler, MPI libraries and interconnect drivers, since wehave conducted the scaling experiments in Fig. 3.8, as well aswe did various performance optimizations to our own codebase,hence, the improved performance. . . . . . . . . . . . . . . . . 65xiiiTable 3.6 Runtime for pruning (with precision guarantees) and size of thepruned solution subgraph for the WDC patterns in Fig. 3.16(used for template topology sensitivity analysis). The table liststhe number of vertices (|V∗ |) and edges (2|E∗ |) in the solutionsubgraph for each pattern. All the experiments were carried outon a 64 node deployment. . . . . . . . . . . . . . . . . . . . . 79Table 3.7 Performance comparison between QFrag and our pattern match-ing solution. The table shows the runtime in seconds for fullmatch enumeration for QFrag; and separately for pruning andfullmatch enumeration for our distributed system (labeledPruneJuice-distributed), and for a single node implementation of our graphpruning-based approach tailored for a shared memory system(labeled PruneJuice-shared). For PruneJuice, we split time-to-solution into pruning (top row) and enumeration (bottom row)times. We use the same graphs (Patent and YouTube) and thequery templates as in Fig. 3.17 (Q4 – Q7) used for evaluation ofQFrag in [100]. The other small acyclic queries used in [100]require PruneJuice running local constraint checking only and,in these cases, PruneJuice is even faster than QFrag. . . . . . . 83Table 3.8 Performance comparison between Arabesque and our patternmatching system (labeled PJ - short for PruneJuice). The tableshows the runtime in seconds for counting 3-Clique and 4-Cliquepatterns. These search patterns as well as the following back-ground graphs were used for evaluation of Arabesque in [108].We run experiments on the same shared memory machine (with1.5TB physical memory) we used for comparison with QFrag.Additionally, for PruneJuice, we present runtimes on 20 computenodes. Here, PruneJuice runtimes for the single node, sharedmemory are under the column with header PJ (1) while runtimesfor the 20 node, distributed deployment are under the columnwith header PJ (20). . . . . . . . . . . . . . . . . . . . . . . . 85Table 4.1 Symbolic notation used. . . . . . . . . . . . . . . . . . . . . . 99xivTable 4.2 Properties of the datasets used for evaluation: number of verticesand directed edges, maximum, average and standard deviationof vertex degree, and the graph size in the compact CSR-likerepresentation used (including vertex metadata). . . . . . . . . 111Table 4.3 Impact of ordering vertex labels in the increasing order of fre-quency for non-local constraint checking (top); impact of intu-itive prototype ordering when searching them in parallel (mid-dle); and impact of our match enumeration optimizations foredit-distance based matching (bottom). . . . . . . . . . . . . . 119Table 4.4 Evaluation of load balancing/reloading on a smaller deploymentalong two axes: performance and efficiency. (Top rows) Run-time for searching prototypes in paralle given a node budget (128nodes in this example). Speedup for parallel prototype search(on a smaller deployment) over searching each prototype using128 nodes is also shown. (Bottom rows) CPU Hours consumedby different deployment sizes for the same workload. The lastrow shows CPU Hour overhead for each deployment size withrespect to the two node deployment. . . . . . . . . . . . . . . . 120xvList of FiguresFigure 1.1 High-level illustration of how the graph pruning approach fitsin larger context of the pattern matching problem. Here, Gis the background graph, G0 is the search template and G∗ isthe pruned, solution subgraph - the union of all the matchingsubgraphs. (a) The figure compares the traditional approachand the matching pipeline we propose: the conventional tech-niques typically rely on match enumeration to answer any pat-tern matching query, regardless of whether match listing isrequested or not. In contrast, the technique we propose (de-picted using solid red lines in the figure), first identifies thesolution subgraph G∗ by pruning away the non-matching partof G. Compared to full match enumeration, low cost (at least inpractice) algorithms for pruning are possible. Moreover, G∗ candirectly answer certain pattern matching queries (highlightedusing solid red lines). (b) The figure shows how our graphpruning pipeline can support diverse patter matching scenarios- other algorithms can operate on the reduced, solution sub-graph G∗ to answer various match queries (highlighted usingsolid blue lines). Furthermore, the pruning procedure collectsadditional information at the vertex granularity (§3.5) that canbe leveraged to accelerate, for example, full match enumeration. 7xviFigure 2.1 An example of a template (left), an induced subgraph match ofthe template in graph (a) (top right) and a non-induced subgraphmatch of the template in graph (b) (bottom right). Graph (b)has additional edges among the matching vertices that are notpresent in the given template, shown using red, solid lines.Graph (b) does not contain an induced subgraph of the template 19Figure 3.1 An example of a background graph G (center), a template graphG0 (left) and the output - the solution subgraph G∗ after vertexand edge elimination (right). The output is a refined set ofvertices and edges that participate in at least one subgraph Hthat matchesG0. Here, vertexmetadata are presented as coloredshapes. The eliminated vertices and edges are colored solid grey. 36Figure 3.2 Three examples of search templates and background graphsthat justify the full set (local and non-local) of pruning con-straints. Template (a) is a 3-Cycle; cycles of length 3k withrepeated labels in the background graph, meets neighborhoodconstraints, surviving local constraint checking. Template (b)contains several vertices with non-unique labels; to its rightthere is a background graph that meets individual point-to-point path constraints, also surviving (non-local) path check-ing. Template (c) is characterized by two 4-Cliques that overlapat a 3-Cycle; the background graph structure to the right is dou-bly periodic (a 4×3 torus) and meets all edge and vertex cycleconstraints, surviving cycle (non-local constraint) checking. Inaddition to checking the local constraints, template (a) only re-quires cycle checking. Templates (b) and (c), however, requiretemplate-driven search to guarantee no false positives. . . . . 44xviiFigure 3.3 Algorithm walk through for the example background graphand template in Fig. 3.1, depicting which vertices and edges inG∗(V∗,E∗) are eliminated (in solid grey) during each iteration.The non-local constraints for G0 are listed in Table 3.2. Theexample does not show application of some of the constraintsas that do not eliminate vertices or edges. . . . . . . . . . . . 48Figure 3.4 Caption for testbed figure . . . . . . . . . . . . . . . . . . . . 57Figure 3.5 Chain and Tree patterns used. Both patterns have two pairsof vertices with the same (numeric) label; hence, they requirenon-local constraint checking (NLCC), more precisely, pathconstraint checking. The labels used are the most frequent inthe R-MAT graphs and cover ∼30% of all the vertices in thegraphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61Figure 3.6 Runtime and pattern selectivity for weak scaling experiments,broken down into individual iterations, for the Chain (left)and Tree (right) patterns presented in Fig. 3.5. The X-axislabels present the R-MAT scale and the node count used forthe experiment. (Each node hosts two processors, each with 18cores and we run 36 MPI processes per node.) The number ofvertices and edges in each pruned solution subgraph is shownontop of their respective bar plots. The Pruning Factor (PF), i.e.,the order of magnitude reduction in number of vertices/edgescompared to the original background graph, is also shown foreach experiment. A flat line indicates perfect weak scaling.Time for LLC and NLCC phases is presented using differentcolors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61Figure 3.7 WDC patterns using top/second-level domain names as labels.The labels selected are among the most frequent, covering∼81% of the vertices in the WDC graph: unsurprisingly, comis the most frequent - covering over two billion vertices, orgcovers ∼220M vertices, the 2nd most frequent after com andmil is the least frequent among these labels, covering ∼153Kvertices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62xviiiFigure 3.8 Runtime for strong scaling experiments, broken down into in-dividual phases (LCC and NLCC are in different colors) forfour of the WDC patterns presented in Fig. 3.7. The top row ofX-axis labels represent the number of compute nodes. (Eachnode hosts two processors, each with 18 cores and we run 36MPI processes per node.) The last two rows are the number ofvertices and edges in the pruned graph, respectively. For bettervisibility, forWDC-1 (left plots), runtime for different iterationsare split into two scales on the Y-axis: LCC and NLCC-pathconstraints are at the bottom, and LCC and NLCC-TDS con-straints are at the top. Speedup over the 64 node configurationis also shown on top of each stacked bar plot. . . . . . . . . . 63Figure 3.9 Number of active vertices and edges after each iteration for thesame experiments as in Fig. 3.8. The bottom row of X-axislabels represent the number of iterations required to reach aprecise solution. Note that the Y-axis is on log scale. . . . . . 64Figure 3.10 The scenarios and their corresponding templates for the Reddit(RDT) and IMDb graphs: RDT-1 (left): identify users withadversarial poster-commenter relationship. Each author makesat least two posts or two comments, respectively. Comments toposts, that with more upvotes (P+), have a balance of negativevotes (C-) and comments to posts, with more downvotes (P-), have a positive balance (C+). The posts must be underdifferent subreddits (category). RDT-2 (center): identify allposter-commenter pairs where the commenter makes at leasttwo comments to the same post, one directly to the post and onein response to a comment. The poster also makes a commentin response to a comment. The commenter always receivesnegative rating (C-) to a popular post (P+), however, comments(to the same post) by the poster has a positive rating (C+).IMDB-1 (right): find all the actresses, actors, and directorsthat worked together at least in two different movies that fallunder at least two similar genres. . . . . . . . . . . . . . . . 66xixFigure 3.11 (a)Runtime for the graph analytics patterns presented in Fig. 3.10.The labels on the X-axis represent the number of vertices andedges in the respective pruned graphs. Note that the Y-axeshave different scales. (b) Number of active vertices and edgesafter each iteration for the same experiments for the Reddit pat-terns as in (a). The labels on bottom row of the X-axis representthe number of iterations required. Note that the Y-axis is onlog scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67Figure 3.12 Vertex set precision over the lifetime of an execution for (a)WDC(Fig. 3.7), (b)Reddit (Fig. 3.10), and (b)R-MAT (Fig. 3.5)patterns. The X-axis presents the timeline (in seconds) whilethe Y-axis is the precision achieved by the end of an iteration.The markers indicate the moment in time when 100% precisionhas been achieved. The timeline for the WDC-1 is limited to170th second for better visibility (WDC-1 achieves 100% pre-cision in less than 20 seconds). For the R-MAT patterns, weshow plots for Scale 28 and 37. . . . . . . . . . . . . . . . . . 68Figure 3.13 (a) Performance and scalability comparison between the ver-tex elimination only solution (left), and combined vertex andedge elimination solution (right) for the WDC-3 pattern. (b)Comparing synchronous and asynchronous NLCC. (c) Impactof work aggregation on runtime for the WDC patterns (for thesake of readability, only a subset of non-local constraints areconsidered for WDC-1). (d) Runtime performance when onlyTDS constraints are used vs. all NLCC constraints used. (Here,RDT-1 does not finish after two hours). Note that in (b) and (d),we did not apply load balancing to RDT-1. All experiments in(b), (c) and (d) use 64 compute nodes. . . . . . . . . . . . . 69xxFigure 3.14 (a) Impact of load balancing on runtime for the theWDC-1 andRDT-1 patterns. We compare two cases: without load balanc-ing (NLB) and with load balancing through reshuffling on thesame number of nodes (LB). For WDC-1, we show results fortwo scales, on 64 and 128 nodes. Speedup achieved by LB overNLB is also shown on the top of each bar. (b) Performanceof RDT-1 for four scenarios: (i) without load balancing on 64nodes (NLB-64), (ii) with load balancing through reshufflingon the same number of nodes (LB-64), (iii) beginning with64 nodes and relaunching on a 16 node deployment after loadbalancing (LB-16), and (iv) relaunching on a single node (36processes) after load balancing (LB-1). The chart shows time-to-solution and CPUHours consumed in each of the four cases.The CPU Hours consumed by NLB-64, LB-64, LB-16 overLB-1 is also shown on the top of the respective bars. . . . . . 73Figure 3.15 The figure compares system memory usage, throughout theapplication lifetime, for counting unlabeled 4-Motifs in theYoutube graph (see §4.8.6 for experiment details) for differentbatch sizes (up to 72 batches). We run this experiment on ashared memory platform with 1.5TB physical memory. Werun 72 MPI processes. X-axis is the timeline in seconds. Y-axis is the peak system memory usage (in megabyte) at a giveninstance during the application lifetime. . . . . . . . . . . . . 76Figure 3.16 WDC patterns used for template topology sensitivity analysis.Templates (a) and (b) are monocycles, each has a vertex withthe label edu. Template (c) is created through union of (a)and (b). Templates (d) and (e) are constructed from (c) byincrementally adding one edge at a time. . . . . . . . . . . . . 79Figure 3.17 The patterns (reproduced from [100]) used for comparison withQFrag (results in Table 3.7). The label of each vertex ismapped,in alphabetical order, to the most frequent label of the graphin decreasing order of frequency. Here, a represents the mostfrequent label, b is the second most frequent label, and so on. . 83xxiFigure 3.18 Matches for the WDC-2 in the background graph. The numberof matches in each of the six connected components are alsoshown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89Figure 4.1 Edit-distance based subgraph matching: a search template H0(left), background graph G (center-left), and on the right, exam-ples of edit-distance k matches/solution subgraphs for distancek = 1 and k = 2. . . . . . . . . . . . . . . . . . . . . . . . . . 93Figure 4.2 Edit-distance k = 1 and k = 2 prototypes of an example tem-plate. There are 19 prototypes at distance k ≤ 2, where eachone is a connected component. . . . . . . . . . . . . . . . . . 101Figure 4.3 (Top) Local and non-local constraints of a template: a vertex inan exact match needs to (i) match the label of a correspondingvertex in the template, and (ii) have edges to vertices labeledas prescribed in the adjacency structure of this correspond-ing vertex in the template. Based on the search template H0,we generate the set of non-local constraints K0 that are to beverified - for the example template, there are two constraints:(1) a triangle and (2) a rectangle. (Bottom) Three examplesthat illustrate the need for non-local constraints checks, invalidstructures that local constraint checking is not guaranteed toeliminate (see Fig. 3.2 for details). . . . . . . . . . . . . . . . 102Figure 4.4 Runtime for weak scaling experiments (left) for the the RMAT-1 pattern (right) - it has 24 prototypes within distance k = 2.The X-axis labels present the R-MAT scale (top) and the nodecount used for the experiment (bottom). A flat line indicatesperfect weak scaling. The labels used are the most frequent inthe R-MAT graphs and cover ∼45% of all the vertices in thegraph. For RMAT-1, the furthest edit-distance searched (k)and total prototype count (#p) are also shown. . . . . . . . . . 113xxiiFigure 4.5 WDC patterns using top/second-level domain names as labels.The labels selected are among the most frequent, covering∼21% of the vertices in the WDC graph: org covers ∼220Mvertices, the 2ndmost frequent after com; ac is the least frequent,still covering ∼4.4M vertices. For each pattern, the furthestedit-distance searched (k) and total prototype count (#p) arealso shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . 114Figure 4.6 Runtime for strong scaling experiments (to label vertices andedges by the prototype(s) they match), broken down by edit-distance level, for the WDC-1, 2 and 3 patterns (Fig. 4.5).Max-candidate set generation time (C) and infrastructure man-agement overhead (S) are also shown. The top row of X-axislabels represent the number of compute nodes. Speedup overthe 64 node configuration is shown on top of each stacked barplot (the workload does not fit in memory on fewer number ofnodes). To observe natural scalability, forWDC-1 andWDC-2,we do not load balance the intermediate pruned graphs. Sincewe relaunch processing on smaller eight node deployments andsearch prototypes in parallel, for WDC-3, load balancing isimplicit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114Figure 4.7 Runtime comparison between the naïve approach and HGT forvarious patterns and graphs. Speedup over the naïve approchis shown on top of respective bars. For better visibility, welimit the Y-axis and show the Y-axis label (larger than the axisbound) for WDC-4, for the naïve case. RMAT-1, IMDB-1 and4-Motif (on the Youtube graph) also include time for explicitmatch counting. For the the rest, we report time to identifyunion of all matches with precision and recall guarantees. . . . 116xxiiiFigure 4.8 Runtime per prototype for RMAT-1 (Scale 34 on 128 nodes).The topX-axis label k_p indicates a prototype p at edit-distancek. The bottom X-axis labels are the number of matches (fullmatch enumeration) in each prototype. There are a total 73.6Mmatches at distance k = 2 and no match at k < 2. The chartcompares performance of two scenarios: naïve and HGT. Onaverage, individual prototype search is 6× faster in HGT. How-ever, infrastructuremanagement and load balancing account for∼30% of the total time, which yields a 3.8× net speedup overthe naïve approach (Fig. 4.7). . . . . . . . . . . . . . . . . . . 117Figure 4.9 Runtime broken down by edit-distance level for WDC-3 (on128 nodes). X-axis labels: k is the edit-distance, pk is the setof prototypes at distance k, V∗kis the set of vertices that matchany prototype in pk , and V∗p is the set of vertices that matcha specific prototype p ∈ pk . The bottom two rows on the X-axis show: (first row) the size of all matching vertex sets (V∗k) atdistance k (i.e., number of vertices that match at least one proto-type), and (bottom row) total number of vertex/prototype labelsgenerated at distance k. Performance of four scenarios are com-pared: (i) the naïve approach (§4.8.3); (ii) X - the bottom-uptechnique where search begins using the furthest edit-distanceprototypes and consecutive searches exploit an already prunedgraph; (iii) Y - the bottom-up technique including redundantwork elimination, i.e., reusing results of non-local constraintchecking (§4.8.4); and (iv) Z - the bottom-up technique withload balancing and relaunching processing on a smaller eightnode deployment, enabling parallel prototype search (§4.7). . 118Figure 4.10 Impact of load balancing on runtime for the WDC-1, 2 and3 patterns (Fig. 4.5). We compare two cases: without loadbalancing (NLB) and with load balancing through reshufflingon the same number of compute nodes (LB). Speedup achievedby LB over NLB is shown on the top of each bar. . . . . . . . 120xxivFigure 4.11 The Reddit and IMDb templates (details in §4.8.5): for RDT-1and IMDB-1, optional edges are shown in red, broken lines,while mandatory edges are in solid black. . . . . . . . . . . . 121Figure 4.12 Runtime broken down by edit-distance level for WDC-4 (on128 nodes). X-axis labels: k is the edit-distance, pk is theset of prototypes at distance k, and V∗kis the set of verticesthat match any prototype in pk . The bottom two rows on theX-axis show: (first row) the size of all matching vertex sets(V∗k) at distance k (i.e., number of vertices that match at leastone prototype), and (second row) the average search time perprototype at each edit-distance. We also show the number ofvertices in the max-candidate set (X-axis label ‘C’), yet nomatch is found until distance k = 4. Here, the Y-axis is on logscale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122xxvAcknowledgmentsFirst and foremost, I would like to thank my Advisor Professor Matei Ripeanu forhis sincere guidance and patience with me. Over the years, Professor Ripeanu’smentorship has been instrumental in shaping my research acumen. A specialthanks to my collaborators/mentors Dr. Roger Pearce and Dr. Geoffrey Sandersof Lawrence Livermore National Laboratory whose in-depth domain knowledge,as well as access to world-class computing resources through this collaboration,have significantly influenced the research I have done. I would like to thankthe members of my doctoral exams: Professor Sathish Gopalakrishnan, ProfessorMieszko Lis, Professor Alexandra Fedorova, Professor Alan Wagner, ProfessorInanc Birol, Professor Guy Lemieux, Professor Philippe Kruchten, and the externalexaminer, Professor KameshMadduri, for their insightful feedback that have helpedto improve my thesis. Finally, I would like to thank the members of the NetSysLabfor their support and encouragement.xxviChapter 1IntroductionPattern matching in graphs, that is, finding subgraphs that match a small templategraph within a large background graph has applications in areas as diverse asbioinformatics [3], social network analysis [29, 124], information mining [124],anomaly and fraud detection [56], and program analysis [64]. Subgraph matchingbelongs to the category of combinatorial problems - the number of subgraphsmatching a template can be exponential with respect to the number of verticesand edges in the background graph [33]. If the template size is not limited, inthe general case, exact matching is not known to have a polynomial time solution[116]. Therefore, heuristics are developed to minimize time-to-solution, maximizethroughput or reduce memory footprint.1.1 The Multiple Facets of the Pattern Matching ProblemPattern matching problems vary depending on the definition of a match, the usagescenario, and demands for precision and recall guarantees. In this section, wepresent an overview of these different ‘facets’ of the pattern matching problem.1.1.1 Exact and Approximate Pattern MatchingAmatch and can be broadly categorized as exact or approximate. In exact matching,a bijective mapping between the vertices and edges in the template and those inthe matching subgraph is required [116]. (In Chapter 2, we present the formal1definition of a match and relevant background.)A match is considered approximate when the template and the matching sub-graph are just similar by some defined similarity metric [5, 19, 126]. Over time,approximate matching has become the moniker for a rather large set of problemsthat have in common that they are not limited to the requirements of exact matching- a matching subgraph can be somewhat different from the search template, twoseparate approximate matches can be different from each other. Multiple real-worldusage scenarios justify the need for approximate matching (and their diversity is theroot cause for the diversity seen in this problem area). Categories of these scenariosinclude:(i) Performance reasons - to reduce the asymptotic complexity of exactmatching(as in the general case, exact matching is not known to have a polynomial timesolution) and improve time-to-solution at the expense of accuracy [3, 28, 56];(ii) Uncertainty regarding acquired data - the acquired data can be noisy, leadingto a background graph that is different from the ground truth [19, 126], thusin this case, an approximate match is sought to compensate for the dataacquisition error (and not to reduce the complexity of the search algorithm);(iii) Exploratory search - a user may not be able to come up with a search patterna priori [24]. In such scenarios, the user starts with an approximate idea ofwhat (s)he is searching for, and relies on the system’s the ability to identify‘close’ variants of the template [5, 56, 126]; and(iv) High-throughput information extraction - such as extracting features for ma-chine learning from a graph’s topology: as most machine learning solutionsuse data in tabular form, they are unable to directly incorporate topologi-cal information from networked data. One way to mitigate this issue is afeature engineering strategy which marks each vertex with the pattern(s) itparticipates in (a complementary approach to existing techniques, such asnode2vec [39]).Note that scenarios (ii), (iii) and (iv)may seek a solution that does not compro-mise accuracy - it retrieves all the subgraphs that are variants (with a bound on the2acceptable difference) of the user-provided template. Unlike scenario (i), in theseusage scenarios one does not seek to improve the algorithmic complexity over exactmatching.1.1.2 Diversity in the Use of the ResultPattern matching may be used for various end goals. These usage scenarios include:(i) to determine if at least one match exists (yes/no answer), (ii) to retrieve fullmatch enumeration [116] and/or (iii) total match count [108], (iv) to identify the(complete) set of vertices and edges participating in matches, (v) to retrieve top-k orfirst-k matches [29], (vi) to rank vertices or edges based on their centrality measurerelative to the matches, and (vii) to classify vertices based on label or topologicalsimilarities [5, 39].We note that this diversity makes possible two fundamental approaches: On theone side, it is possible to design efficient algorithms highly optimized for each usageabove (e.g., well known algorithms for match counting exist; yet, these algorithmsare often unable to support other scenarios efficiently). On the other side, theapproach we take is to design an efficient pipeline that can support all the above usecases.1.1.3 Precision and RecallFurther categorization of pattern matching problems is possible based on the de-mand for precision and recall guarantees. In exact matching, a bijective mappingbetween the vertices and edges in the template and those in thematching subgraph isrequired [116]. Therefore, an exact matching solution guarantees 100% precision,i.e., no false positive matches are included in the solution set. Problem scenariossuch as retrieving the full match enumeration, or the complete set of matchingvertices and edges implicitly require 100% recall, i.e., they aim to retrieve all validmatches. Other problem formulations such as retrieving top-k or first-k matches,or retrieving only some of the edges/vertices that participate in matches relax the100% recall requirement.A class of approximate matching [19, 28, 56, 113] solutions trade precisionand/or recall to reduce asymptotic complexity and improve time-to-solution. Other3problems, such as retrieving all subgraphs that are variants (e.g., within a givenedit-distance [13]) of the user-provided template, may demand 100% precision and100% recall.1.1.4 Pattern Matching in Metadata GraphsIn addition to the graph topology, metadata graphs incorporate information aboutvertices and edges. Typical real-world graphs are metadata graphs. The wellknown Property Graph Model is one of the most popular methods of representing ametadata graph, adopted by popular graph databases like Neo4j [75], OrientDB [77]and Titan [111].Berry et al. [9, 10] introduced the problem of type isomorphism in metadatagraphs, where a match identifies vertices and/or edges with the same label in thetemplate and the background graph. Label-based matching can be a powerfultool with potential for practical, real-world applications such as social networkanalysis [28, 29] and link recommendation tasks [40, 113]. Furthermore, label-based matching is promising for contemporary and emerging machine learningtasks such learning vertex representations to aid vertex role identification [48] andmulti-label vertex classification [39].1.2 Scalability Challenges of Pattern MatchingApplications that operate on graphs with tens to hundreds of billions of edges arecommon nowadays. Graphs at this scale are predominantly found in the followingareas: (i) Social networks - recent work reports that the Facebook user graph hasover one trillion social connections (or edges) [18]. The Twitter follow graph had anestimated 60 billion followers (i.e., edges) in 2014 [6, 40]. (ii) Information networks- such as webgraphs: Web Data Commons hyperlink graph, the largest publiclyavailable webgraph, has about 128 billion (unique) edges. The largest known pub-licly available Resource Description Framework (RDF) graph has over one trilliontriples (i.e., the data graph hasmore than one trillion edges) [86]. (iii) Connectomics- the study of mapping brain networks at the level of synaptic connections [61]; acomplete human brain graph is thought to have one hundred billion (1011) verticesand quadrillion edges (1015) [69]. A pattern matching solution targeting scale-out,4distributed memory platforms is needed at this scale.Unfortunately, existing pattern matching solutions (we survey related work inChapter 2) have limited capabilities: most importantly, they do not scale to massivegraphs and/or support only a restricted set of search templates. Additionally, thealgorithms at the core of the existing techniques are not suitable for today’s graphprocessing infrastructures relying on horizontal scalability and shared-nothing clus-ters as most of these algorithms are inherently sequential and difficult to parallelize[71, 78, 116]. Finally, pattern matching is susceptible to combinatorial explosionof the intermediate or final algorithm state: for many queries, the number of sub-graphs partially (or entirely) matching the template can grow exponentially with thenumber of nodes and edges in the already large background graph [96, 108], pos-ing serious memory and communication challenges. Practical solutions for robustpattern matching in large-scale graphs remain rather an open problem.1.3 A Constraint Checking Approach for Scalable PatternMatchingThis dissertation explores avenues toward a scalable solution for the subgraph atch-ing problem. In particular, this work targets practical pattern matching scenariosin large-scale metadata graphs and designs solutions for distributed memory ma-chines. It addresses both exact and approximate matching problems: First, wepresent a general solution for exact matching, centered around the idea of graphpruning (Chapter 3). Then, we demonstrate that the same solution approach lendsitself well to solve a class of edit-distance [13] based approximate matching prob-lems (Chapter 4).We propose a new algorithmic pipeline that bases patternmatching on constraintchecking. This approach is motivated by viewing the search template as specifyinga set of constraints the vertices and edges that participate in a match must meet. Thetechnique first decomposes the search template into a set of constraints, and thenverifies if the vertices and edges in the background graph violate these constraints,and iteratively eliminates them, eventually leading to the set of vertices and edgesthat is the union of all exact matches for the search template.Hypothesis and Key Insights. The intuition for the effectiveness of this tech-5nique stems from three key observations: First, the traditionally used tree-searchtechniques [43, 78, 102, 116] generally attempt to enumerate all matches throughexplicit search. When the search fails on a tree branch, such an unprofitable pathis marked invalid and ignored in the subsequent steps. In the same vein as pastwork that uses graph pruning [66, 129] or, more generally, input reduction [59],our conjecture is that it is much cheaper to first focus on eliminating the verticesand edges that do not meet the label and topological constraints introduced by thesearch template.Second, such pruning approach lends itself well for developing a vertex-centricalgorithmic solution (presented in §3.4) and thismakes it possible to harness existinghigh-performance, vertex-centric frameworks (e.g., GraphLab [35], Giraph [34] andHavoqGT [82]). In our vertex-centric solution for pruning, a vertex must satisfy twotypes of constraints, namely, local and non-local, to possibly be part of a match.Local constraints involve only the vertex and its neighborhood: a vertex in anexact match needs to (i) match the label of a corresponding vertex in the template,and (ii) have edges to vertices labeled as prescribed in the adjacency structure ofthis corresponding vertex in the template. Non-local constraints are topologicalrequirements beyond the immediate neighborhood of a vertex (e.g., that the vertexmust be part of a clique).The third observation is that, full match enumeration is not the most efficientavenue to support many of the high-level graph analysis scenarios presented above.Depending on the final goal of the user, pattern matching problems fall into anumber of categories which include: (a) determining if a match exists (or not)in the background graph (yes/no answer), (b) selecting all the vertices and edgesthat participate in matches, (c) ranking these vertices or edges based on their cen-trality with respect to the search template, i.e., the frequency of their participationin matches, (d) counting/estimating the total number of matches (comparable tothe well-known triangle counting problem [107]), or (e) enumerating all distinctmatches in the background graph. The traditional approach [78, 108, 116] is to firstenumerate the matches (category (e) above) and to use the result of the enumerationto answer (a) – (d). However, this approach is limited to small background graphsor is dependent on a low number of near and exact matches within the backgroundgraph (due to exponential growth of the algorithm state). We take the position that6Do not scale𝐺, 𝐺06Existing techniquesOur approach Graph Pruning𝐺∗is the unionof all matchingsubgraphs in 𝐺𝐺∗EnumerationMatch Exists?Set of Matching Vertices and EdgesMatch CountingTop-k QueryCentrality-based RankingPatternMatchingQueries(a)𝐺, 𝐺07Our approach Graph Pruning𝐺∗EnumerationMatch Exists?Set of Matching Vertices and EdgesMatch CountingTop-k QueryCentrality-based RankingOperating on 𝐺∗PatternMatchingQueries(b)Figure 1.1: High-level illustration of how the graph pruning approach fits in larger context ofthe pattern matching problem. Here, G is the background graph, G0 is the search templateand G∗ is the pruned, solution subgraph - the union of all the matching subgraphs. (a)The figure compares the traditional approach and the matching pipeline we propose:the conventional techniques typically rely on match enumeration to answer any patternmatching query, regardless of whether match listing is requested or not. In contrast, thetechnique we propose (depicted using solid red lines in the figure), first identifies thesolution subgraph G∗ by pruning away the non-matching part of G. Compared to fullmatch enumeration, low cost (at least in practice) algorithms for pruning are possible.Moreover, G∗ can directly answer certain pattern matching queries (highlighted usingsolid red lines). (b) The figure shows how our graph pruning pipeline can support diversepatter matching scenarios - other algorithms can operate on the reduced, solution subgraphG∗ to answer various match queries (highlighted using solid blue lines). Furthermore, thepruning procedure collects additional information at the vertex granularity (§3.5) that canbe leveraged to accelerate, for example, full match enumeration.a pruning-based pipeline is not only a practical solution to (a) – (d) (and to otherpattern-matching-related analytics, when full match enumeration is not the maininterest) but also an efficient path toward full match enumeration on large graphs.Fig. 1.1 paints the big picture of our solution approach - it highlights how ourtechnique fits in the larger context of the pattern matching problem.Based on these observations, the following hypothesis establishes the foundationof the research presented in this dissertation:Systematic graph pruning through constraint checking is an effective buildingblock toward enabling scalable pattern matching in massive metadata graphs usingdistributed memory systems.71.4 Research ObjectivesThis thesis verifies the aforementioned hypothesis and explores opportunities to-ward designing scalable patternmatching solutions based on the constraint checkingapproach. In particular, this thesis aims to answer the following high-level ques-tions:(Q1) Can we design a generic exact matching solution that supports templateswith arbitrary label distribution and topology, and offers precision and recallguarantees? In addition to identifying the complete set of matching verticesand edges, can this technique be used as a stepping stone to efficiently solveother problems listed in §1.1? Also, can we design an approximate matchingsolution (e.g., retrieving subgraphs that are close variants of the user-providedtemplate), following this approach? What classes of approximate matchingproblems or similarity metrics can be supported?(Q2) What are the key challenges and opportunities within this design space?For instance, can we develop a solution on top of an existing vertex-centricdistributed graph processing middleware or we need to develop an entirelynew infrastructure? Can a constraint checking based solution address thescalability challenges? More specifically, does it scale with the workload andthe platform size?(Q3) What are the key performance, efficiency, and quality-of-solution trade-offsfor a solution based on the constraint checking approach to support differentpattern matching scenarios?1.5 MethodologyOur research methodology is driven by algorithmic and infrastructure requirementsto support pattern matching solutions following the constraint checking approach.The methodology consists of the following high-level steps:Algorithm Design. First, we design algorithms for exact matching (Chapter 3):we begin with characterizing the problem, identifying algorithmic requirementsfollowed by designing the basic building blocks to verify the hypothesis using a8restricted set of search patterns with constraints on the topology and vertex labeldistribution (§3.6). Following the success of the initial set of investigations, wefocus on developing a generic solution with no restriction on the search templatetopology (§3.4).We capitalize on the solution approach for exactmatching and design distributedalgorithms for a class of edit-distance based similar subgraph matching problems(Chapter 4).For both categories of matching problems, first, we concentrate on developing asequential algorithm to ensure correctness of the proposed solution and establish thetheoretical foundations (§3.3, §4.4 and Appendix A). We then focus on identifyingdistributed solution requirements and design vertex-centric, distributed algorithms(§3.5 and §4.7).Prototype Implementation. Informed by the requirements of the distributedalgorithms, we develop proof-of-concept implementations for exact (§3.5) andedit-distance based subgraph matching (§4.7) on top of HavoqGT [82], an open-source asynchronous graph processing framework. The design challenges includeconstraint ordering, coping with the growth of the algorithm state, and load bal-ancing, while maximizing system utilization and efficiency (in terms of memoryfootprint and network traffic) through techniques such as search space pruning,message aggregation, and work recycling.Evaluation. We evaluate performance by experimenting on real-world andsynthetic datasets; orders of magnitude larger than the prior work (summarizedin Chapter 2). We evaluate scalability through strong scaling and weak scalingexperiments using graphs with hundreds of billions, up to trillions of edges andup to tens of thousands of cores. We use templates based on patterns naturallyoccurring in the background graphs; templates that are relevant to real-world usecases and stress the system along multiple axes. The evaluation includes detailedanalysis of design trade-offs and impact of various optimizations. Additionally, weempirically compare our work with multiple state-of-the-art solutions.91.6 Summary of ContributionsThis section summarizes the contributions of this dissertation at a high-level, while§3.2 and §4.3 present the individual contributions for the exact and edit-distancebased subgraph matching solutions, respectively, in greater detail.(i) A New Approach for Exact Pattern Matching. We present an algorithmicpipeline that bases pattern matching on constraint checking. The key in-tuition is that each vertex or edge participating in a an exact match has tomeet a set of constraints specified by the search template. Our proposedtechnique decomposes the search template into a set of constraints, verifiesif vertices/edges in the background graph violate these constraints, and iter-atively eliminates them, eventually leading to the set of vertices and edgesthat is the union of all exact matches of the search template. The solution is:generic - no restrictions on the set of search patterns supported, precise - nofalse positives, offers 100% recall - retrieves all vertices and edges participat-ing in matches, and scalable - with the dataset and platform size. Note that,in this dissertation, we consider non-induced subgraph matching (details in§2.1.3).(ii) Edit-Distance based Template Variant Subgraph Matching with Precisionand Recall Guarantees. We present a distributed solution for a class ofapproximate matching queries where the user needs full precision (i.e., thereare no false positives in the returned match set) and full recall (i.e., allmatching vertices/edges are identified), as well as a user-specified bound onthe similarity, between the user-provided search template and the (templatevariant)matches returned by the system. Wequantitatively estimate similaritythrough edit-distance [13]. We observe that this problem can be equivalentlystated as the problem of finding exact matches for all 0, . . . ,k edit-distanceprototypes (subgraphs) of a given template [126]. We base our solution onthe same constraint checking approach which offers a stepping stone to buildan edit-distance based matching solution and can be used in two directions:First, one can generate the constraints for all the vertices/edges participatingin an approximate match at distance k must meet (i.e., exact match with any10prototype within distance k), and use these constraints to reduce the problemspace. Second, one can decompose multiple, yet similar, prototypes intotheir composing constraints, run these constraints, and use the informationto infer exact matching into specific prototypes; thus amortizing the costof executing each constraint as they are shared by multiple prototypes. Ourdesign exploits key relationships between prototypes to prune the search spaceand eliminate redundant work; enabled by looking at the search template asa set of constraints.(iii) Optimized Distributed Implementation. We offer an efficient implemen-tation on top of HavoqGT [82], an open-source MPI-based vertex-centricasynchronous graph processing framework. The implementation providesnecessary infrastructure support for both exact and the target edit-distancebased subgraph matching queries. The prototype includes key optimizationsthat dramatically reduce the generated traffic: aggressive search space prun-ing, and a technique that offers message efficiency by skipping duplicateconstraint checking tasks, thus preventing possible combinatorial explosion.Additional system features includes, the ability to load balance a pruned,intermediate graph; for the edit-distance based matching, it also offers par-allelism at multiple-levels, enables reusing the result of constraint checking(which eliminates a large amount of potentially redundant work).(iv) Proof of Feasibility at Scale. We demonstrate the performance of our solu-tion by experimenting on eight real-world and one synthetic datasets; ordersof magnitude larger than the prior work (§3.7 and §4.8). We evaluate scal-ability through two types of experiments: first, a strong scaling experimentusing real-world datasets, including the largest openly available webgraphwhose undirected version has over 257 billion edges; second, a weak scalingexperiment using synthetic, R-MAT [15] generated graphs with up to 4.4trillion edges, on up to 1,024 compute nodes (36,864 cores).We demonstrate support for search patterns with arbitrary label distributionand topology, and representative of practical queries in both relatively fre-quent and needle in the haystack scenarios, and, to stress our system, weconsider patterns containing the highest-frequency vertex labels (up to 1411billion instances). We show that in some cases our technique prunes thegraph by orders of magnitude, which, combined with the compact interme-diate state constructed during pruning, makes match enumeration feasible ongraphs with trillions of edges. For approximate matching, we show supportfor patterns with an edit-distance as large as to generate 1,000+ prototypes.(v) Demonstrate Support forMultipleUsage Scenarios. The constraint checkingtechnique reduces the background graph to a set of vertices and edges that isthe union of all exact matches for the user-provided template. This procedurealso collects additional information that can be used to support other high-level graph analysis. We explore the following scenarios: (a) For eachvertex in the solution set, we build a list of its exact match(es) in the searchtemplate (i.e., the matching template vertices). (b) For the edit-distancebased matching problem, we extract a per-vertex vector indicating whichprototype(s) the vertex is a match for.We postulate that this pipeline can be used as a stepping stone to efficientlysolve the other problems listed in §1.1. More precisely, other graph analysis,such as fullmatch enumeration andmatch counting, can beginwith the prunedsubgraph (which is the union of all exact matches with 100% precision and100% recall guarantees). We show that the information collected in (a) can beused to accelerate match enumeration in billion or trillion edge backgroundgraph (§3.7.3). We believe the information collected in (b) can be used asvertex level features for training a machine learning pipeline such as the onesthat have been used for Representation Learning [39, 42, 57, 83] in graphdatasets (discussed in §4.9).(vi) Application Demonstration. We demonstrate that our solution lends itselfto efficient pattern discovery in real-world pattern matching scenarios: weuse two real-world metadata graphs that we have curated from publicly avail-able datasets Reddit (3.9 billion vertices, 14 billion edges) and the smallerInternational Movie Database (IMDb), and show practical use cases of ourtechnique to support rich pattern mining (§3.7.4 and §4.8.5).(vii) Comparison with the State-of-the-Art. We empirically compare our work12with two recent state-of-the art systems using five real-world graphs: (i)QFrag [100] - comparison of exact matching for match enumeration usinglabeled templates (§3.7.11), and (ii) Arabesque [108] - comparison of exactand edit-distance based matching for match counting using unlabeled tem-plates (§4.8.6) . The experiments demonstrate the significant advantages thatour system offers for handling large graphs and complex patterns.(viii) System Evaluation, Bottleneck Characterization and Further Design Explo-rations. Weevaluate the effectiveness of our design choices and optimizationsthat the prototype implementations embrace (§3.7.6 and §4.8.4). We studythe impact of the key optimizations, such as our load balancing strategies(§3.7.7), on performance and scalability.We present a detailed bottleneck analysis and characterization of the artifactsthat impact performance along multiple axes (§3.7.7 and §3.7.9).Furthermore, we present findings of design explorations to aid informeddecision making; in particular constraint ordering and selection (§3.7.10).(ix) Open-source Software Artifact. This work led to open-source contribution- software infrastructure/toolchain1 that can be used by other researchersand practitioners. Since it was developed using open-source libraries, theapplication can be run on any MPI supported Linux cluster or a singlemachine.1.7 Dissertation OrganizationThe rest of the dissertation is organized as follows: Chapter 2 provides problem background and presents the literature review. Chapter 3 details the constraint checking based solution for exact patternmatching, the system design and distributed implementation, and scaling ex-periments using massive real-world and synthetic workloads on a leadership-class High Performance Computing (HPC) cluster.1www.github.com/LLNL/HavoqGT13 Chapter 4 presents the edit-distance based template variant subgraph match-ing pipeline, the design and implementation of the distributed solution andevaluation using real-world and synthetic workloads. Chapter 3 and 4 also highlight the impact of the design choices and relevanceof the contribution in relation to practical use cases. Chapter 5 summarizes the two main research themes presented in this disser-tation and their respective impacts; highlights on the limitations of the currentsolutions and discusses possible improvements and future extensions. Appendix A presents correctness proofs for the constraint checking algo-rithms presented in Chapter 3, while Appendix B presents complexity anal-ysis of these algorithms.14Chapter 2Background and Related WorkThis chapter presents background information and a literature review on patternmatching in graphs. We note that the volume of related work on graph processing ingeneral [34–36, 50, 68, 82, 106] and on patternmatching algorithms in particular [3,10, 29, 71, 78, 116, 130] is humbling. Here, we discuss only the most relevantcontributions.This chapter is structured as follows: We begin with discussing the two primarycategories of pattern matching problems (§2.1), namely, exact and approximatematching, and review the key algorithmic techniques used (§2.2). We present adetailed literature review of distributed graph pattern matching systems (§2.3). Wediscuss metadata graph models and metadata-based pattern matching (§2.5). In theremainder of the chapter we discuss work related to other systems aspects of patternmatching, e.g., query languages (§2.4), distributed graph processing frameworks(§2.6), and input reduction in graph processing (§2.7). Table 2.1 lists the symbolicnotation used in this chapter.2.1 Exact and Approximate Pattern Matching DefinitionsIn graph pattern matching, a match can be defined in multiple ways and variantsof this problem can be divided into two broad categories: exact and approximatematching [19, 56, 126].15Table 2.1: Symbolic notation used.Object(s) Notationtemplate graph, vertices, edges G0(V0,E0)template graph sizes n0 := |V0 |, m0 := |E0 |template vertices V0 := {q0,q1, ...,qn0−1}template edges (qi,qj) ∈ E0background graph, vertices, edges G(V,E)background graph sizes n := |V |, m := |E |background vertices V := {v0,v1, ...,vn−1}background edges (vi,vj) ∈ Elabel set L = {0,1, ...,n` −1}vertex label of qi `(qi) ∈ Lmatching subgraph, vertices, edges H(VH,EH)2.1.1 Exact Pattern MatchingFor exact pattern matching, graph isomorphism between the search template and asubgraph of the background graph (i.e., the ‘match’) is sought. More formally, forlabeled graphs:Definition 1. A subgraphH(VH,EH),VH ⊂ V,EH ⊂ E is an exact match of templategraph G0(V0,E0) (in notation,H ∼ G0) if there exists a bijective function φ : V0←→VH with the properties (note that φ may not be unique for a givenH):(i) `(φ(q)) = `(q), for all q ∈ V0, `(q) is the vertex label of q,(ii) ∀(q1,q2) ∈ E0, we have (φ(q1),φ(q2)) ∈ EH and(iii) ∀(v1,v2) ∈ EH, we have (φ−1(v1),φ−1(v2)) ∈ E0The general case of exact matching (and subgraph isomorphism) is not knownto have a polynomial time solution, when the search templates are not of fixedsize [116]. (It has been claimed that polynomial time solutions are possible undercertain assumptions such as the background graph is acyclic [2], is planer [51], orhas bounded valance [65].)162.1.2 Approximate Pattern MatchingApproximate pattern matching is the epithet for a rather large set of matchingproblems that are not constrained by the requirements of exact matching. Due toits high-complexity in the general case, exact subgraph matching or isomorphismis not often practical in the real-world. Concurrently, there are problem scenarioswhere, rather than identifying exact matches only, identifying ‘close’ variants ofthe template is equally important [9, 126]. For example, the data graph can benoisy - different from the ground truth, therefore, the user relies on system’s abilityto identify subgraphs that are similar to the search template. Applications of thismatching approach are prevalent in problem domains such as knowledge discoveryand data mining [124], information retrieval from semantic metadata store [29],bioinformatics [3], anomaly and fraud detection [56], program analysis [64] andimage analysis [19], to name a few (§1.1). (In Chapter 4, we discuss usage scenariosfor approximate matching in further detail.)In general, approximatematching algorithms aim to tolerate differences betweenthe search template and the subgraph identified as a match [19, 126]. Here, thetemplate and the match are just similar by some defined similarity metric [5, 19,126], for example, the graph edit-distance [13].Definition 2. A subgraph H is an approximate or close match of template graphG0 if the distance function ϕ satisfies the following condition:ϕ(H,G0,ψ) ≤ Given two graphs,H and G0 and a similarity estimator ψ, the distance functionϕ quantitatively measures how similar H and G0 are. is the threshold (identifieda priori) for acceptable distance (e.g., compared to G0, H is only allowed to bemissing up to a fixed number of edges); it determines whether to accept H as a‘similar’ match for G0 or reject it.The similarity estimator (also called the similarity metric) ψ is the definition forhow to compute similarity - the distance returned by ϕ. The definition (i.e., ψ) isspecific to the similarity estimationmodel of interest, e.g., Graph Edit-distance [13],Maximum Common Subgraph [12] and Graph Kernel [52], to name a few. As an17example, below we present the formal definition of graph edit-distance:Definition 3. The graph edit-distance between two graphs H and G0, written asψGED(H,G0), is defined as the minimum number of edit operations required totransformH to G0.ψGED(H,G0) = min(e1,...,ek )∈Ek∑i=1c(ei)In the above equation, E = {e1, . . . ,ek} is the set of possible edit operations(e.g., edge addition or deletion) that can be performed on H; k is the maximumnumber of edit operations allowed. The original definition of edit-distance alsoallows defining a cost (however, optional) for each edit operation: c(ei) is the costassociated with an ei ∈ E . (Chapter 4, Fig. 4.2 presents an example of edit-distancesubgraph match.)2.1.3 Induced and Non-induced Subgraph MatchingA matching subgraph H of the template G0 can be further categorized as inducedor non-induced [70]. An induced subgraph (i.e., a match) of the template doesnot allow additional edges, that do not match the template, between the verticesthat belong to a match. The no-induced variant relaxes this constraint and allowsarbitrary edges between the vertices of a match. Fig. 2.1 illustrates the differencebetween induced and non-induced subgraph match.In this dissertation, we consider no-induced subgraph matching. The solutionspresented in Chapter 3 and 4 would only require small changes to support inducedmatching, with precision and recall guarantees for the result. Since, the inducedcase introduces further constraints for a matching vertex, we believe this favoursour pruning-based pattern matching approach.2.2 General Algorithmic ApproachesIn this section, we review the key sequential pattern matching algorithms andgeneral techniques. First, we focus on exact matching algorithms; then we discussapproximate matching techniques and relevant contributions. (Work related to18TemplateGraph (a) Induced subgraph matchGraph (b) Non-induced subgraph matchFigure 2.1: An example of a template (left), an induced subgraph match of the template ingraph (a) (top right) and a non-induced subgraphmatch of the template in graph (b) (bottomright). Graph (b) has additional edges among the matching vertices that are not present inthe given template, shown using red, solid lines. Graph (b) does not contain an inducedsubgraph of the templatedistributed patternmatching are reviewed separately in §2.3.) Table 2.2 summarizesthe key categories of pattern matching techniques discussed in this section. Itshows whether these techniques support exact and/or approximate matching, offerprecision and/or recall guarantees, and the usage scenarios supported. (Note thattechniques that focus on counting or estimating the global match frequency, oftenincorporate optimizations that prevent them from being able to identify individualmatches, e.g., the color-coding algorithm [3] and ASAP [56].)2.2.1 Exact Pattern MatchingEarly work on graph pattern matching mainly focused on solving the graph iso-morphism problem [116]. The well-known Ullmann’s algorithm [116] and itsimprovements (in terms of join order, pruning strategies and space complexity),e.g., VF2 [78] and QuickSI [102], belong to the family of tree-search based algo-rithms. Ullman proposed a backtracking algorithm which finds exact matches byincrementing partial solutions and uses heuristics to prune unprofitable paths. VF2improves the time and space complexity over Ullman’s algorithm. The algorithm19Table 2.2: Categorization of pattern matching techniques found in the literature: the tableshows whether these techniques support exact and/or approximate matching, offer precisionand/or recall guarantees, and the type(s) of output produced. (100% precision means itis possible to use the technique such that no false positive matches are included in thefinal output, and 100% recall means it is possible to use the technique such that the solutionretrieves all valid matches. If the technique does not offer precision and/or recall guarantees,it is labeled N/A.) The last row lists the technique introduced in this dissertation, dubbedConstraint Checking.Technique Exact Approximate Precision andRecall OutputTree-search X X100% Precisionand/or 100%RecallFull MatchEnumeration andCountingCanonical Labeling X X100% Precisionand/or 100%RecallIsomorphism CheckSubgraph Indexing X X100% Precisionand/or 100%RecallFull MatchEnumeration andCountingGraph Simulation X N/AApproximate MatchEnumeration andCountingColor-coding X N/A Approximate CountingGraph Sampling X N/A Probabilistic CountingConstraint Checking X X100% Precisionand 100%RecallComplete Set ofMatching Vertices andEdges; Full MatchEnumeration andCountinguses a heuristic that is based on the analysis of the vertices adjacent to vertices thathave been included in a partial solution. The VF2 algorithm is known to be robustand performs well in practice, and consecutively has been included in the popularBoost Graph Library (BGL) [62]. A recent effort, TurboISO [44] is considered to bethe most optimized among the tree-search based sequential subgraph isomorphismtechniques. (Note that the pattern search can be performed in a depth-first or abreadth-first manner. The naïve pattern matching technique recursively searchesthe full template from each vertex in the background graph in a depth-first man-ner. The tree-search algorithms are merely optimizations of this depth-first search20technique.)For large graphs, a tree searchmay failmidway and have to backtrack, hence, thistechnique can be expensive. Efficient distributed implementation of this approachis difficult for a number of reasons: existing algorithms are inherently sequentialand difficult to parallelize. Furthermore, a key limitation of this technique is thatthe number of possible join operations (the process of adding a graph edge toan intermediate match) is combinatorially large; which makes its application togeneric patterns and massive graphs, with billions or trillions of edges, impractical.Also, the above algorithms use heuristics for join order selection, as a result,often performance is sensitive to the graph topology, label frequency, and relies onexpensive preprocessing for join order optimization, such as sorting the neighborvertices by degree [44, 60].Perhaps the best known exact matching algorithm that does not belong tothe family of tree-search based algorithms is Nauty due to McKay [71], whichis based on canonical labeling of the graph. This approach, however, has highpreprocessing overhead. Nauty can perform verification for isomorphism in O(n2)time, however, transforming arbitrary input graphs to the canonical form requiresexponential time [74].In the same spirit as database indexing, subgraph indexing (i.e., indexing offrequent subgraph structures) is an approach attempted in order to reduce the numberof join operations (between subgraph structures) and to lower query response time,e.g., SpiderMine [130], R-Join [17], C-Tree [46], SAPPER [126], TriAD [41]and the contributions by Sun et al. [105] and Gao et al. [32]. Unfortunately, fora billion-edge graph, this approach is infeasible to generalize: First, searchingfrequent subgraphs in a large graph is notoriously expensive. Second, dependingon the topology of the search template(s) and the background graph, the size of theindex is often superlinear relative to the size of the graph [105].2.2.2 Approximate Pattern MatchingIn the previous section, we reviewed techniques that are primarily designed forexact matching. Earlier we established the need for approximate matching anddiscussed various usage scenarios (§1.1 and §2.1). Here, we discuss the key al-21gorithmic techniques, specifically, contributions that have demonstrated the abilityto accommodate relatively large graphs. First, we discuss the popular graph simi-larity estimators (also called proximity or closeness estimators). Then we review anumber of notable contributions targeting approximate pattern matching.2.2.2.1 Graph Similarity EstimatorsThere exist several techniques to estimate graph similarity. Perhaps the best knowngraph similarity metric is Edit-distance [13]: it is a widely adopted similarity met-ric [30, 72, 93, 126], easy to understand by users, and can be adapted to varioususe cases seen in practice by extending/restricting the set of edit operations, e.g.,vertex/edge deletion or addition, and vertex/edge label substitution. Also, it is pos-sible to support efficient approximate/template variant searches, while providingprecision and recall guarantees [13]. (Our solution for template variant subgraphmatching, presented in Chapter 4, uses the edit-distance metric for similarity com-putation.)There are other methods to estimate graph similarity that are found in the lit-erature: Maximum Common Subgraph (MCS) [12] (equivalent to edit-distanceunder a certain cost function) and Pairwise Graph Edit-distance are popular alter-natives to the original edit-distance metric (low cost, however, algorithms basedon this metric cannot offer precision or recall guarantees). Similarity based onstatistical significance captured by Chi-Square Statistic has been proposed in [25].The Graph Kernel metric (computes an inner product on graphs) has been used,mainly for graph classification [52]. Other techniques found in the literature includeConductance-based [58] and Metapath-based [114, 122] approaches.2.2.2.2 Algorithmic TechniquesThe term ‘approximate’ is commonly used to differentiate a solution from exactmatching; whether the goal is to (i) identify close variants of a given search tem-plate, or (ii) improve algorithmic complexity (at the expense of result accuracy).The superpolynomial nature of subgraph isomorphism, however, has led to thedevelopment of a plethora of approximate matching heuristics, and thus a wealthof related work on this topic.22The graph simulation [49] family of algorithms relax the match constraints- a match is defined by a binary relation between vertices and/or edges of thebackground graph and the query subgraph, e.g., matching based on vertex attributesand/or only local connectivity constraints (such as the parent-child relation) betweena pair of vertices [28]. Simulation-based algorithms often have quadratic/cubic timecomplexity [27, 28, 67] and have been adopted by a number of projects targetinglarge-scale (approximate) matching [27–29, 31, 63, 67]. In general, the graphsimulation approach is unable to provide precision and recall guarantees for theidentified matches.Alon et al. [3] proposed the color-coding algorithm to approximate the countof tree-like patterns (a.k.a. treelet) and showed that the time complexity of theirsolution is O(2n0m) (for tree patterns only), i.e., in time linear to size of thebackground graph; an improvement over the general case which is O(nn0). (Here,n is the number of vertices and m is the number of edges in the backgroundgraph, and n0 is the number of vertices in the template graph.) The techniquefirst randomly assigns colors to the graph vertices from a set of k unique colors,where k = n0 and each color represents a vertex in the query template, and thencounts the occurrences of the matches that have vertices with distinct colors. Thecount is then scaled up to get an estimate on the total number of matches. Theestimation can be improved by repeating the process over multiple iterations ofrandomcoloring (of the background graph) and taking the average. This technique isconsidered a good fit for approximatemotif 1 counting, for example, in the biologicalnetworks [3, 103, 128], and multiple parallel/distributed implementations of thistechnique exist [14, 103, 128]. Recently, Chakaravarthy et al. [14] extended thisidea to (a restricted set of) patterns with cycles and offered an approximate solutionfor match counting.Approximate matching solutions commonly adopt a sampling technique - thegoal of the sampling-based matching has predominantly been to improve runtimeperformance by analyzing only a part of the background graph; hence, solutionsbased on this approach cannot provide precision and/or recall guarantees [3, 19,25, 56, 126]. The random walk [112] technique has been widely used to samplegraphs for match approximation [9, 10, 113]. G-Ray is an approximate matching1Network motifs are connected pattern of vertex induced embeddings that are non-isomorphic.23algorithm for finding subgraphs in time linear to the size of the background graph.It leverages random walk with restart [112] to measure the probability of an edgein the background graph being a match for an edge in the template [113]. Berryet al. [9, 10] presented a polynomial time (O(m+σ logσ)) approximate matchingalgorithm for enumerating a set of candidate matches. Here, m is the number ofedges in the background graph and σ is the number of candidate matches. Thealgorithm uses biased random walk to approximate the candidate matches for thequery template. The authors of this work also presented a parallel shared memoryimplementation of this technique. Similarly, the distributed patternmatching systemASAP [56] relies on a random walk based local neighborhood sampling technique.ASAPenables trade-off between the result accuracy and time-to-solution; it employsChernoff bounds analysis to control result error [80].The sampling approach is more fitting for approximate match counting [3, 56]:these techniques first sample the background graph, for example, sample edges inthe background graph that match the template. The count is then scaled up toget an estimate on the total number of matches [3, 56]. To harden the statisticalsignificance of the estimation, sophisticated techniques have been used, such asChi-Square Statistic has been used in in [25] and ASAP [56] uses Chernoff boundsanalysis to control the estimation error in thematch count [80]. Sampling-based andprobabilistic match approximation techniques, however, are commonly subjectedto sampling noise and stochastic model errors [19, 25, 56].The frequent graph structure indexing approach has also been adopted by someapproximate matching techniques in order to reduce the number of join operations,such as C-Tree [46] and SAPPER [126]. Unfortunately, for a billion-edge graph,this approach is infeasible.Another approximate matching approach found in the literature is based oncomputing the global or local minimum of the matching cost. Instead of abandon-ing them from the solution set, graph vertices that do not fully satisfy the querytemplate, are penalized by assigning a cost to them. Although optimal solutionis not guaranteed, finding the local minimum is a simpler problem compared tofinding the global minimum; and fast, often polynomial time, solutions are achiev-able. Algorithms that follow this approach, typically operate under an explicit errormodel [19].24Table 2.3: Comparison of past work on distributed pattern matching. The table highlightsthe characteristics of each solution presented (e.g., exact vs. approximate matching), itsimplementation infrastructure, and summarizes the details of the largest scale experimentperformed. We highlight the fact that our solution is unique in terms of demonstrated scale,and the ability to perform exact matching and retrieve all matches.Contribution ModelFramework/ Match Max. QueryMetadata #Compute Max. Real-world Max. SyntheticLanguage Type Size Nodes Graph GraphArabesque [108] Tree-search Spark Exact 10 edges N/A 20 887M edges N/AQFrag [100] Tree-search Spark Exact 7 edges Real 10 117M edges N/APGX.D/Async [96] Asynchronous DFS Java/C++ Exact 4 edges Synthetic 32 N/A 2B edges (Unif. rand.)G-Miner [16] Tree-search C++ Exact 4 edges N/A 15 1.8B edges N/ASun et al. [105] Subgraph Indexing C#.Net4 Exact 30 edges Synthetic 12 16.5M edges 4B verticesPlantenga [84] Tree-search Hadoop Approximate 4-Clique Real 64 107B edges R-MAT Scale 20SAHAD [128] Color-coding Hadoop Approximate 12 vertices Synthetic 40 N/A 269M edgesFASCIA [103] Color-coding MPI Approximate 12 vertices N/A 15 117M edges 1M edges (Erdős-Renyi)Chak. et al. [14] Color-coding MPI Approximate 10 vertices N/A 512 (BG/Q) 2.7M edges R-MATGao et al. [32] Subgraph Indexing Giraph Approximate 50 vertices Synthetic 28 3.7B edges N/AMa et al. [67] Graph Simulation Python Approximate 15 vertices Type only 16 5.1M edges 100M verticesFard et al. [31] Graph Simulation GPS Approximate N/A N/A 8 300M edges N/AASAP [56] Neighborhood Sampling Spark Probabilistic 6 edges N/A 16 3.7B edges N/AYuan et al. [123] Tree-search/Join Java Exact 17 edges N/A 17 1.4B edges 64M verticesRecently machine learning has been adopted for computing graph similarity:SimGNN follows a neural network based approach to learn fine-grained vertex-level information [5]. Along the same line, node2vec [39] and RolX [48] havebeen proposed to learn vertex representations to aid vertex role identification andmulti-label vertex classification.2.3 Distributed Graph Pattern MatchingThis section reviews a number of projects that offer pattern matching on a shared-nothing architecture, either to reduce time-to-solution or to scale to search in largebackground graphs. Table 2.3 summarizes the key differentiating aspects and thescale achieved. Below we group the contributions into exact and approximatematching categories.2.3.1 Solutions offering Exact MatchingArabesque [108] is a distributed framework offering precision and recall guarantees,implemented on top of Apache Spark [104] and Giraph [34]. Arabesque providesan API based on the Think Like an Embedding (TLE) paradigm, to express graphmining algorithms (see §2.4 for details) and a Bulk Synchronous Parallel (BSP)implementation of the embedding (pattern) search engine (which follows the tree-25search approach for match enumeration and counting). Arabesque replicates theinput graph on all worker nodes; hence, the largest graph scale it can supportis limited by the size of the main memory of a single node (the implementationalso exploits HDFS storage to maintain partially computed embeddings). Throughevaluation using several real-world graphs, Teixeira et al. [108] showedArabesque’ssuperiority over two other key systems: G-Tries [92] and GRAMI [26].QFrag [100] is a distributed exact pattern matching system, built on top ofArabesque. Similar to Arabesque, QFrag assumes that the entire graph fits inthe memory of each compute node and uses data replication to enable searchparallelism. QFrag employs a sophisticated load balancing strategy to reduce time-to-solution. In QFrag, each replica runs an instance of a tree-search based patternenumeration algorithm called TurboISO [44] (an improvement of Ullmann’s algo-rithm [116]). Through evaluation, the authors demonstrated QFrag’s performanceadvantages over two other distributed pattern matching systems: (i) TriAD [41], anMPI-based distributed RDF [86] engine based on an asynchronous distributed joinalgorithm, and (ii) GraphFrames [23, 38], a graph processing library for ApacheSpark, also based on distributed join operations. Although Arabesque and QFragoutperformmost of their competitors in terms of time-to-solution, they replicate theentire graph in the memory of each compute node, which limits their applicabilityto relatively small graphs. In §3.7.11 and §4.8.6, we present direct comparison ofour work with QFrag and Arabesque.PGX.D/Async [96] is a distributed system offering exact matching. It relieson asynchronous depth-first traversal for match enumeration. PGX.D/Async offersan MPI-based implementation and incorporates a flow control mechanism witha deterministic guarantee of search completion under a finite amount of memory;however, compared to our work, PGX.D/Async was demonstrated at a much smallerscale, in terms of graph sizes and number of compute nodes.Similar to Arabesque, G-Miner [16] offers a high-level API for implementinggraph mining algorithms; however, its applicability seems to be restricted to limitedscenarios as evaluation results were presented only for counting triangles and smallcliques.Sun et al. [105] present an exact subgraph matching algorithm which followsthe tree-search and join approach and demonstrated it on large synthetic graphs,26using larger search templates than in [84], yet not on real-world graphs. Also, theauthors mentioned that they terminate the search after the match count have reacheda predefined threshold which was set to 1,024 in their experiments (i.e., does notoffer recall guarantees).2.3.2 Solutions targeting Approximate MatchingBesides our work, the best demonstrated scale is offered by Plantenga’s [84]MapReduce implementation of the walk-based algorithm (similar to tree-search)for identifying type isomorphic (approximate) matches, originally proposed in [10].Plantenga introduced the idea of walk-level constraints to type isomorphism - theadded constraints are expected to reduce the search space of candidate walks (thesolution superset). Plantega demonstrated performance using a graph with 107billion edges, the largest-scale experiment to date (excluding ours).SAHAD [128] is a MapReduce implementation of the color-coding algo-rithm [3] originally developed for approximating the count of tree-like patterns(a.k.a. treelet) in protein-protein interaction networks. SAHAD follows a hierarchi-cal subtemplate explore-join approach. Its application was presented on relativelysmaller graphs with up to ∼300M edges. FASCIA [103] is also a color-codingbased solution for approximate treelet counting, whose MPI-based implementationoffers superior performance to SAHAD. Chakaravarthy et al. [14] extended thecolor-coding algorithm to count patterns with cycles (although does not supportarbitrary patterns) and presented an MPI-based distributed implementation (forIBM Blue Gene/Q (BG/Q) supercomputers). However, the authors demonstratedperformance on graphs with only a few million edges.ASAP [56] is a distributed system enabling approximate match counting withina given error bound. ASAP is based on Apache Spark [104] and GraphX [36].Like Arabesque, ASAP provides a high-level API for implementing graph miningalgorithms. ASAP implements a neighborhood sampling technique that estimatesthe template match count by sampling the edges in the background graph. Unlikeour system, the output produced by ASAP is only probabilistic; ASAP does notoffer precision and recall guarantees for the returned solution set; although allowstrade-off between the result accuracy and time-to-solution and provides a technique27to bound the counting error.Gao et al. introduce an approximate matching technique based on tree-searchand join [32] and evaluate it on large queries (up to 50 vertices). Here, a query tem-plate is converted in to a single-sink directed acyclic graph and message transitionfollows its topology. Yuan [123] present a join-based technique for edit-distancesubgraph mathcing. They break down the search template into small tree sub-structures, enumerate these substructures and store them in the memory, and thenperform join operations on these substructure to identify matches. (Note that,similar to Sun et al. [105], Yuan et al. used a large set of unique labels in theirexperiments - 400 to 1,600 labels are distributed in the rather small backgroundgraphs. This significantly narrows down the search space; creating a favorable sce-nario where the matches for labeled queries are either extremely rare/do not existat all. Our attempts to reproduce their experiments confirm this observation.)Two distributed approximate matching solutions based on graph simulation areproposed in [31] and [67], although both are evaluated only on relatively smallreal-world graphs.2.4 Query Languages and High-level ProgrammingInterfacesWe review a number of research efforts aimed at improving end-user productivity.2.4.1 Query LanguagesSPARQL queries have been used for subgraph matching in RDF data [41, 76]. ASPARQL query disassembles a template into a set of edges and final results areconstructed through multi-way join operations [41]. SPARQL has less expressivepower than general subgraph matching and the space for possible join operationscan be huge [105]. Cypher, a declarative graph query language for the open-sourcegraph database Neo4j, borrows query syntax from SPARQL [22]. The Gremlinlanguage (by Apache TinkerPop) [109, 110] addresses some of the limitations ofSPARQL and allows users to write complex queries by combining declarative andprocedural expressions. In addition TinkerPop’s own graph database Titan, populargraph databases like Neo4j and OrientDB also support the Gremlin language.282.4.2 API for Implementing Query specific AlgorithmsIn recent years, a number of frameworks have been developed that provide end-userAPI (Application Programming Interface) to facilitate implementation of (what theyrefer as) pattern mining algorithms, such as mining frequent subgraphs, countingmotifs, or cliques [108]. A system targeting a fixed (or a class of) pattern presents theopportunity to apply query specific optimizations often not feasible to incorporate ina generic pattern matching system. In fact a system optimized for searching a fixedpattern has real-world interest: a recent work describes a recommendation systemat Twitter that searches ‘diamond’ motifs for its operations [40]. Arabesque [108],ASAP [56], RStream [119] and G-Miner [16] are examples of recent projects thatprovide high-level APIs for implementing pattern specific algorithms.2.5 Metadata Graphs and Pattern MatchingThe work presented in this dissertation primarily targets labeled background graphsand search templates. The problem formulation has high practical relevance givenreal-world graphs are semantic representation of some underlying information orknowledge, where vertex and/or edge attributes represent unique entities. Richeranalytics are possible when the problem, in addition to the graph topology, alsoaccounts for metadata associated with vertices and/or edges [28, 29]. In thissection, first, we summarize popular metadata graph models, and then we reviewcontributions that focus on pattern matching in metadata graphs.2.5.1 Metadata Graph ModelsReal-world graphs incorporate metadata and are typed - vertices and edges haveassociated predefined types and encode information pertaining to that type. TheProperty Graph Model is one of the most popular way of representing a metadatagraph, commonly used by graph databases such as Neo4j [75], OrientDB [77]and Titan (by Apache TinkerPop) [110, 111]. Within this model, objects andrelationships are contextualized: typically a vertex is a noun and an edge is averb. Using type-relationship information, it is possible to modeled complex graphqueries as pattern matching problems.The Resource Description Framework (RDF), also known as the Triplestore,29is another example of metadata/typed graph model [86]. Within this model, infor-mation is stored as a linked data entity, a Subject-Predicate-Object triple. Subjectand Objects are essentially designated types for graph vertices. A Predicate is anedge between two vertices; it contextualizes the relation between a Subject and anObject. Two vertices and an edge represent an RDF triple and a collection of triplestogether form an RDF graph [41, 86].The work presented in this dissertation offers a general solution for graphs andsearch templates with vertex metadata, however, can easily be extended to enablesupport for graphs and templates with edge metadata.2.5.2 Pattern Matching in Metadata GraphsBerry et al. [9, 10] introduced the problem of type isomorphism in metadatagraphs, where a match identifies vertices and/or edges of the same label in thetemplate and the background graph. A vertex in an exact match also needs tohave edges to vertices with all the label prescribed in the adjacency structure ofat least one of the template vertices. While the labeled version does not reducethe worst case complexity of the original exact matching problem, past experienceshave demonstrated that label-based matching can be a powerful tool with poten-tial for practical, real-world applications such as social network analysis [28, 29]and link recommendation tasks [40, 113]. Furthermore, label-based matching ispromising for contemporary and emerging machine learning tasks such learningvertex representations to aid vertex role identification [39] and multi-label vertexclassification [48].2.6 Infrastructure to enable Graph Processing onDistributed PlatformsDistributed systems are the logical choice for processing large graphs in a timelymanner. Since the introduction of Pregel [68] in 2010, both the academia andindustry demonstrated strong interest in developing distributed systems capable ofprocessing massive graphs. Chaos [97], CRAY CGE [21], Giraph [34], GPS [98],GraphLab [35], GraphMat [106], GraphX [36], HavoqGT [82], IBM System G [54]and Oracle PGX [50] are some of the most notable frameworks available today.30The majority of the graph processing frameworks expose a vertex-centric APIfor algorithm implementation. Similar to Pregel, most of these frameworks adoptthe BSP runtime model. Some of the frameworks HavoqGT, GraphLab and OraclePGX also support asynchronous processing and offer two key advantages over BSP:First, they enable overlapping communication with computation without the needof explicit barriers. Second, asynchronous algorithms can exploit the low latency(∼1µs) interconnect available on high performance computing platforms. However,designing an asynchronous solution ismore challenging. TheBSP approach is oftena better fit for commodity clusters as BSP presents optimization opportunities forharnessing network bandwidth (in the presence of high latency interconnects).2.7 Input Reduction in Graph ProcessingGraph pruning is the operation of simplifying or compacting a graph, fitting inthe larger context of input reduction techniques. Graph pruning techniques, theirfeasibility, and their effectiveness and applications, however, differ greatly acrossproblems. Lulli et al. [66] present an iterative vertex pruning technique for com-puting Connected Components. The algorithm iteratively grows a tree for eachconnected component. Once a vertex is added to the tree, it is prevented fromparticipating in the subsequent iterations, reducing overall computation as well ascommunication. Kusum et al. [59] presented a graph reduction technique (via ver-tex and edge pruning) for large graphs and demonstrated its advantages for a numberof fundamental graph algorithms, e.g., PageRank and Connected Components.In this dissertation, the solution we propose, fits in the wider category ofinput reduction/graph pruning techniques in the sense that it is based on iterativelyeliminating (i.e., pruning) the vertices and edges of the background graph that donot satisfy the set of constraints generated by the search template.The idea of search space pruning (for the purpose of avoiding exploration ofunprofitable regions of the graph), however, has been explored in the past in thecontext of subgraph matching. Ullmann [116] proposed the idea of identifyingthe candidate vertex set V ′(qi) - the set of vertices in the background graph Gthat provisionally match a template vertex qi; one set for each template vertexqi ∈ V0 is maintained. The pattern search procedure ignores the vertices that do31not belong to a candidate vertex set. Ullmann’s algorithm applies a number ofpruning rules to refine the candidate vertex sets (thus, reduces the search space):for example, vertices in G with a smaller neighbor degree than a template vertex qiare removed from the candidate vertex set V ′(qi); if a vertex (in V ′(qi)) does nothave all the required neighbors (in the candidate vertex sets) required by qi, thevertex is removed from V ′(qi).Thewell known enhancements ofUllmann’s algorithm propose additional prun-ing rules to further reduce the candidate vertex sets: VF2 [78] exploits neigh-bor information at three-hop distance to eliminate unpromising vertices early.QuickSI [102] enforces a vertex ordering (e.g., based on vertex label frequency)for enumerating each match and as enumeration progresses, the algorithm refinesthe candidate vertex sets for the current root vertex of the match(es) being dis-covered, it eliminates the vertices that do not respect the predefined vertex order.GraphQL [47], GADDI [125] and SPath [127] exploit auxilary neighborhood in-formation to refine candidate vertex sets; for example, GraphQL performs pseudo-isomorphism tests and GADDI relies on frequent subgraph indexing. TurboISO [44]divides the background graph in candidate regions, enforces unique matching orderfor each region, which, similar to QuickSI, enables pruning unpromising verticesduring candidate region exploration.Ullmann’s subgraph isomorphism algorithm and its improvements discussedabove are backtracking algorithms and are inherently sequential. Many of the prun-ing rules these algorithms use typically require a globally synchronized state and notsuitable for fast parallel and distributed processing. Also, global state accumulationis expensive in large graphs. These pruning techniques are essentially ‘best-effort’,and are often informed by the current state of the search (i.e., match enumeration).The best know distributed exact subgraph matching solutions Arabesque and QFragrun a pattern enumeration routine on each graph replica (presented in §2.3) which isbased on the optimizations discussed in TurboISO [44]. Here, the adoption of a se-quential algorithm is made possible only because each replica has a complete viewof the graph and able to apply optimizations/pruning rules based on process-localinformation and avoids the need for frequent global state synchronizations.Unlike other input reduction/graph pruning solutions, however, the solutionpresented in Chapter 3 differs in that it offers guarantees that the final pruned graph32is the problem solution itself (the precise and complete union of all matches) ratherthan an intermediary step toward the solution. while they share some similari-ties, our pruning algorithms are significantly different then the pruning rules forsequential backtracking algorithms discussed earlier - we present novel distributedasynchronous algorithms for vertex and edge pruning. Furthermore, the solutionsdiscussed before, predominately consider vertex pruning only and not explicit edgepruning. Our solution enables aggressive edge pruning for pattern matching - cru-cial for achieving scalability for large real-world scale-free graphs (evaluated in§3.7.6).33Chapter 3Graph Pruning via ConstraintChecking – A Technique forScalable Exact Pattern Matchingin Metadata GraphsWith the growth of the scale of networked data across diverse problem domains,and at the same time, the demand for rich analytics on these data, today the im-portance of scalable graph processing solutions is paramount. Pattern matchingis a fundamental graph algorithm, yet difficult to adopt in practice, particularly inpresence of large graph datasets and complex query templates, because of highcomputational demands. Graph processing in general incurs scalability challenges:most graph algorithms have a low compute-to-memory access ratio, i.e., they arememory bound on a single machine or communication bound in a distributed set-ting; a high-performance, high-capacity network backbone is necessary for efficientdistributed processing. Graph algorithms exhibit highly irregular memory accesspatterns and data-dependent parallelism. In a distributed stetting, this leads to loadimbalance, which in turn compromises scalability (with respect to the platformsize). Scalability of the pattern matching problem, more so when an exact solu-tion is sought, is compounded by additional factors. First, pattern matching is a34combinatorial problem: the number of subgraphs partially (or entirely) matchinga template can be exponential with respect to the number of vertices and edgesin the background graph - posing serious memory challenges. Second, the matchdistribution can be highly skewed, potentially concentrated within a limited numberof partitions in a distributed graph - leading to load imbalance.In §1.3, we introduced the constraint checking approach and discussed theopportunities it presents to scale pattern matching to large graphs as well as har-ness high-performance general-purpose graph processing frameworks targeting dis-tributed platforms. The research presented in this chapter assesses the hypothesis:Systematic graph pruning through constraint checking is an effective buildingblock toward enabling scalable pattern matching on massive metadata graphs usingdistributed memory systems.In particular, this work investigates the feasibility of the constraint checkingapproach toward designing a scalable distributed solution for exact pattern match-ing in large-scale metadata graphs and presents empirical evidence to confirm thescalability and performance advantages of the proposed technique. First, a prelimi-nary study (published in [88]) confirms the effectiveness of the constraint checkingbased graph pruning for pattern matching, yet in the context of a restricted set oftemplates with constraints on the topology and vertex label distribution. Followingthe success of the initial set of investigations, we introduce PruneJuice (publishedin [89]), a distributed system for exact pattern matching that is: generic - no re-strictions on the set of patterns supported, precise - no false positives and offers100% recall - retrieves all matches, efficient - smaller algorithm state ensuring lowgenerated network traffic, and scalable - able to process graphs with up to trillionsof edges on tens of thousands of cores.3.1 Design Goals and OpportunitiesOur goal is to enable practical pattern matching solutions on large metadata graphsusing distributed memory machines. We aim for a solution that (i) scales toaccommodate massive metadata graphs with hundreds of billions of edges, (ii)demonstrates good scalability when operating over a large number of computenodes, (iii) guarantee a solutionwith 100%precision (i.e., no false positivesmatches35TemplateGraph OutputFigure 3.1: An example of a background graph G (center), a template graph G0 (left) and theoutput - the solution subgraph G∗ after vertex and edge elimination (right). The output isa refined set of vertices and edges that participate in at least one subgraphH that matchesG0. Here, vertex metadata are presented as colored shapes. The eliminated vertices andedges are colored solid grey.in the solution) and 100% recall (i.e., all valid matches are included) for arbitrarysearch patterns. We want to leverage existing, general-purpose distributed graphprocessing frameworks targeting large distributed memory machines. As mostof these frameworks expose a vertex-centric programming model, the focus is onalgorithmic solutions that have a natural vertex-centric description.The Arabesque [100, 108] project offers the state-of-the-art scalable solution forexact pattern matching. Arabesque’s design, however, is based on the Think Likean Embedding (TLE) paradigm which, similar to Ullman’s algorithm, identifieseach isomorphic matches in the background graph. Unlike the popular Think Likea Vertex (TLV) abstraction which can accommodate a diverse range of algorithms,and hence, adopted by most general-purpose graph frameworks; the only knownapplication of TLE is pattern matching and framework support for TLE greatlydiffers from that of TLV. Our motivation to pursue a TLV or vertex-centric solutionis twofold: first, to harness an existing high-performance general-purpose graphprocessing framework, and second, full match enumeration is neither the only northe most efficient avenue to support many high-level graph analysis scenarios.Constraint Checking for Scalable Pattern Matching. The constraint checkingapproach is motivated by viewing the search template as specifying a set of con-straints the vertices and edges that participate in a match must meet. The techniquefirst decomposes the search template into a set of constraints, and then verifies if thevertices and edges in the background graph violate these constraints, and iteratively36eliminates them, eventually leading to the set of vertices and edges that is the unionof all exact matches of the search template. The intuition for the effectiveness ofthis technique stems from three key observations:First, the traditionally used tree-search techniques [43, 78, 102, 116] generallyattempt to enumerate all matches through explicit search. When the search failson a tree branch, such an unprofitable path is marked invalid and ignored in thesubsequent steps. In the same vein as past work that that uses graph pruning [66,129] or, more generally, input reduction [59], we observe that it is much cheaperto first focus on eliminating the vertices and edges that do not meet the label andtopological constraints introduced by the search template. Aggressive search spacepruning presents the opportunity to tame the potential combinatorial explosionof the algorithm state and scale to massive graphs. Furthermore, mathematicalguarantees of this approach exists - the result of pruning is the complete set of allvertices and edges that participate in at least one match, with no false positives orfalse negatives. Fig. 3.1 illustrates the general idea using an example graph and asearch template.Second, a vertex-centric formulation for such pruning algorithms exist, and thismakes it possible to harness existing high-performance, vertex-centric frameworks(e.g., GraphLab [35], Giraph [34] or HavoqGT [82]). In the vertex-centric formula-tion for constraint checking, a vertex must satisfy two types of constraints, namely,local and non-local constraints, to possibly be part of a match. Local constraintsinvolve only the vertex and its neighborhood: a vertex in an exact match needs to(i) match the label of a corresponding vertex in the template, and (ii) have edges tovertices labeled as prescribed in the adjacency structure of this corresponding ver-tex in the template. Non-local constraints are topological requirements beyond theimmediate neighborhood of a vertex (e.g., that the vertex must be part of a clique).The computational cost of checking these constraints (in an already pruned graph)can be much lower compared to full match enumeration in the large backgroundgraph.Third, we observe that full match enumeration is not the most efficient avenueto support many high-level graph analysis scenarios. Depending on the final goal ofthe user, pattern matching problems fall into a number of categories which include:(a) determining if a match exists (or not) in the background graph (yes/no answer),37(b) selecting all the vertices and edges that participate in matches, (c) ranking thesevertices or edges based on their centrality with respect to the search template, i.e.,the frequency of their participation in matches, (d) counting/estimating the totalnumber of matches (comparable to the well-known triangle counting problem),or (e) enumerating all distinct matches in the background graph. The traditionalapproach [78, 116] is to first enumerate the matches (category (e) above) and to usethe result of the enumeration to answer (a) – (d). However, this approach is limitedto small background graphs or is dependent on a low number of near and exactmatches within the background graph (due to exponential growth of the algorithmstate). We argue that a pruning-based pipeline is not only a practical solution to (a)– (d) (and to other pattern matching related analytics, when full match enumerationis not the main interest) but also an efficient path toward full match enumeration inlarge graphs. First, the pruned graph can be multiple orders of magnitude smallerthan the background graph, and existing high-complexity enumeration routines thusbecome applicable. Second, this technique can collect additional key informationto accelerate match enumeration: for each vertex in the pruned graph, it is possibleto build a list of its exact match(es) in the template.3.2 Contribution Highlights and Chapter OrganizationWe present a pattern matching solution that is: generic - no restrictions on theset of patterns supported, precise - no false positives and offers 100% recall -retrieves all matches, efficient - smaller algorithm state ensuring low generatednetwork traffic, and scalable - able to process graphs with up to trillions of edgeson tens of thousands of cores. The work presented in this chapter is based on theresearch published in [88–90, 115]; the referred publications contain additionaldesign details and evaluation results. The summary of contributions of this chapteris the following:(i) Pruning based on Decomposing the Search Template in a Set of Constraints(§3.4). We have developed a technique that decomposes the search templatein a set of constraints the vertices and edges that participate in a matchmust meet. It verifies if vertices/edges in the background graph violatethese constraints, and iteratively eliminates them. This approach eliminates38all non-matching vertices (thus offer full precision) and do not incorrectlyeliminate matching vertices (thus offer full recall) for arbitrary templates.(ii) Asynchronous Algorithms with OptimizedDistributed Implementation (§3.5).We have developed asynchronous, vertex-centric algorithms to verify theseconstraints. We offer an efficient implementation of these algorithms ontop of HavoqGT [82], an open-source asynchronous graph processing frame-work. The prototype includes two key optimizations that dramatically reducethe generated traffic: aggressive edge elimination, and what we call work ag-gregation - a technique that skips duplicate checks in non-local constraintchecking, thus preventing possible combinatorial explosion. Additionally,our implementation collects exact match information: not only does it pruneaway all vertices and edges that do not participate in any match, but, foreach of the vertices that remain, it collects their exact match(es) to the searchtemplate. We use this information to accelerate match enumeration. Lastbut not least, out implementation enables load balancing: it checkpoints thecurrant state of execution, reshuffles the vertex-to-processor assignment toevenly distribute vertices and edges across processing cores, and then resumesprocessing on the rebalanced workload. We name our system PruneJuice.(iii) Proof of Feasibility at Scale (§3.7). We demonstrate the applicability ofthe proposed solution by experimenting on real-world and synthetic datasetsorders of magnitude larger than the prior work (§2.3). We evaluate scalabilitythrough two experiments: first, a strong scaling experiment using a real-worlddataset, the largest openly available webgraph whose undirected version hasover 257 billion edges; second, a weak scaling experiment using synthetic,R-MAT [15] generated graphs of up to 4.4 trillion edges, on up to 1,024compute nodes (36,864 cores). We demonstrate support for search patternsrepresentative of practical queries in both relatively frequent and needle in thehaystack scenarios, and, to stress our system, consider patterns containingthe highest-frequency vertex labels (up to 14 billion instances). We showthat in some cases our technique prunes the graph by orders of magnitude,which, combined with exact match information collected during pruning,makes match enumeration feasible on graphs with trillions of edges (§3.7.3).39(iv) Application Demonstration (§3.7.4). We demonstrate the ability of our so-lution to support practical graph analytics queries. To this end, we use tworeal-world metadata graphs that we have curated from publicly availabledatasets Reddit (3.9 billion vertices and 14 billion edges) and the smallerInternational Movie Database (IMDb) (5 million vertices and 29 millionedges), and demonstrate practical use cases of our technique to support richgraph mining.(v) Exploring Trade-offs, and the Impact of Strategic Design Choices and Op-timizations (§3.7.5, §3.7.6, §3.7.7 and §3.7.8). Our approach has the addedflexibility that search can be stopped early, which provides the ability to tradefaster time to an approximate solution (or even to an accurate solution yetwithout the 100% precision guarantee) for precision (i.e., false positives inthe pruned graph). We explore this trade-off as well as the impact of eachoptimization used and our load balancing strategies. The cumulative impactof these optimizations is a multiple orders of magnitude reduction of run-time, bringing pattern matching on massive metadata graphs in the realm ofpossible graph analytics.(vi) Insights into Artifacts that Influence Performance (§3.7.7 and §3.7.9). Wepresent a number of analysis that uncover artifacts that influence performanceof the presented solution. We present a detailed characterization of theartifacts that cause load imbalance leading to inefficient resources utilization.Furthermore, we investigate the influence of template properties, such aslabel selection and topology, on runtime performance.(vii) Comparison with ExistingWork (§3.7.11). We empirically compare our workwith two state-of-the-art systems QFrag [100] and Arabesque [108], anddemonstrate the performnace advantages that our system offers for handlinglarge graphs and complex patterns.3.3 PreliminariesWe aim to identify all structures within a large background graph, G, identical to asmall connected template graph, G0. We describe general graph properties for G,40Table 3.1: Symbolic notation used.Object(s) Notationtemplate graph, vertices, edges G0(V0,E0)template graph sizes n0 := |V0 |, m0 := |E0 |template vertices V0 := {q0,q1, ...,qn0−1}template edges (qi,qj) ∈ E0set of vertices adjacent to qi in G0 adj(qi)background graph, vertices, edges G(V,E)background graph sizes n := |V |, m := |E |background vertices V := {v0,v1, ...,vn−1}background edges (vi,vj) ∈ Eset of vertices adjacent to vi in G adj(vi)maximum vertex degree in G dmaxaverage vertex degree in G davgstandard deviation of vertex degree in G dstdevlabel set L = {0,1, ...,n` −1}vertex label of qi `(qi) ∈ Lvertex match function ω(vi) ⊂ V0set of non-local constraints of G0 K0matching subgraph, vertices, edges H(VH,EH)solution subgraph, vertices, edges G∗(V∗,E∗)and use the same notation (summarized in Table 3.1) for other graph objects1.A graph G(V,E) is a collection of n vertices V = {0,1, ...,n− 1} and m edges(i, j) ∈ E , where i, j ∈ V (i is the edge’s source and j is the target). Here, we onlydiscuss simple (i.e., no self-edges), undirected, vertex-labeled graphs, although thetechniques are applicable to directed, non-simple graphs, with labels on both edgesand vertices. An undirected G satisfies (i, j) ∈ E if and only if ( j,i) ∈ E . Vertex i’sadjacency list, adj(i), is the set of all j such that (i, j) ∈ E . A vertex-labeled graphalso has a set of n` labels L of which each vertex i ∈ V has an assignment `(i) ∈ L.A walk in G is an ordered subsequence of V where each consecutive pair is anedge in E . A walk with no repeated vertices is a path. A path with equal first and1Some of the notations in Table 2.1 are repeated here for the convenience of the reader.41last vertex is a cycle. An acyclic graph has no cycles.We further characterize graphs with with cycles. Two disjoint cycles have noedge in common. Two distinct cycles have at least one edge not in common. Wedefine the cycle degree of edge (i, j) ∈ E as the number of distinct cycles (i, j) isin, written δ(i, j). The maximum cycle degree is δmax := maxE δ(i, j). A graph isedge-monocyclic if δmax = 1.We discuss several graph objects simultaneously: the template graph G0(V0,E0),the background graph G(V,E), and the current solution subgraph G∗(V∗,E∗), withV∗ ⊂ V and E∗ ⊂ E . Our techniques iteratively refine V∗ and E∗ until they convergeto the union of all subgraphs of G that exactly match the template, G0.For clarity, when referring to vertices and edges from the template graph, G0,we will use the notation qi ∈ V0 and (qi,qj) ∈ E0. Conversely, we will use vi ∈ Vand (vi,vj) ∈ E for vertices and edges from the background graph G or the solutionsubgraph G∗.We assumeG0 is connected, because ifG0 hasmultiple components thematchingproblem can be easily reduced to solving it for each component individually.Definition 4. A subgraphH(VH,EH),VH ⊂ V,EH ⊂ E is an exact match of templategraph G0(V0,E0) (in notation,H ∼ G0) if there exists a bijective function φ : V0←→VH with the properties (Note that φ may not be unique for a givenH):(i) `(φ(q)) = `(q), for all q ∈ V0,(ii) ∀(q1,q2) ∈ E0, we have (φ(q1),φ(q2)) ∈ EH and(iii) ∀(v1,v2) ∈ EH, we have (φ−1(v1),φ−1(v2)) ∈ E0Intuition for our Solution. The algorithms we develop here iteratively refine avertex-match functionω(v) ⊂ V0 such that, for every v ∈ V , ω(v) stores a superset ofall template vertices v can possiblymatch. Setω(v) converges to contain all possiblevalues of φ−1(v), where v is involved in one or more matching subgraphs. When asingle constraint involving q ∈ V0 is violated/unmet, q is no longer a possibility forv in a match and q is removed: ω(v) ← ω(v) \ {q}.The algorithms developed in our preliminary study (presented in §3.6 andAppendix A) require that all vertex labels in the search template G0 are unique,42and G0 is acyclic or edge-monocyclic to ensure 100% precision and 100% recall.Since then we have extended the algorithms and infrastructure (in §3.6) to offera general solution which achieves 100% precision and 100% recall for arbitrarysearch templates.Remark 1. Given an ordered sequence of all n0 vertices {q1,q2, ...,qn0} ⊂ V0, asimple (although potentially expensive) search from v1 ∈ V∗ verifies if v1 is in amatch, with φ(q1)= v1, or not. The search lists an ordered sequence {v1,v2, ...,vn0} ⊂V∗, with φ defined as φ(qk)= vk . Search step k proposes a new vk , checkingDef. 4 (i)and (ii). If all checks are passed, the search accepts vk and moves on to step (k+1),but terminates if no such vk exists in V∗. If the full list is generated with all labeland edge checks passed then there exists aH ∼ G0 with VH = {v1,v2, ...,vn0}.We call this Template-Driven Search (TDS), presented in the next section anddevelop an efficient distributed version in §3.5, to apply to the solution subgraphG∗(V∗,E∗). If TDS has been applied successfully then there are no false positivesremaining independently of the structure ofV0. We note that TDS is needed only forthe general case, and for multiple other specific cases simpler verification routinescan be used.3.4 Graph Pruning via Constraint Checking for ScalablePattern Matching – Solution OverviewOur goal is to realize a technique which systematically eliminates all the verticesand edges that do not participate in any matchH ∼ G0. This approach is motivatedby viewing the template G0 as specifying a set of constraints the vertices and edgesthat participate in a match must meet. As a trivial example, any vertex v whoselabel `(v) is not present in G0, cannot be present in an exact match. A vertex inan exact match also needs to have non-eliminated edges to non-eliminated verticeslabeled as prescribed in the adjacency structure of the corresponding templatevertex. Local constraints that involve a vertex and its neighborhood can be checkedby having vertices communicate their ‘provisional’ template match(es) with theirone-hop neighbors in the solution subgraph G∗(V∗,E∗) (i.e., the currently prunedbackground graph). We call this process Local Constraint Checking (LCC). Our43Template (b)Template (a) Template (c)Figure 3.2: Three examples of search templates and background graphs that justify the full set(local and non-local) of pruning constraints. Template (a) is a 3-Cycle; cycles of length 3kwith repeated labels in the background graph, meets neighborhood constraints, survivinglocal constraint checking. Template (b) contains several vertices with non-unique labels; toits right there is a background graph that meets individual point-to-point path constraints,also surviving (non-local) path checking. Template (c) is characterized by two 4-Cliquesthat overlap at a 3-Cycle; the background graph structure to the right is doubly periodic(a 4×3 torus) and meets all edge and vertex cycle constraints, surviving cycle (non-localconstraint) checking. In addition to checking the local constraints, template (a) onlyrequires cycle checking. Templates (b) and (c), however, require template-driven search toguarantee no false positives.experiments show that LCC is responsible for removing the bulk of non-matchingvertices and edges.Some classes of templates (with cycles and/or repeated vertex labels) requireadditional routines to check non-local properties (i.e., topological requirementsbeyond the immediate neighborhood of a vertex in the template) and to guaranteethat all non-matching vertices are eliminated. (Fig. 3.2 illustrates the need forthese additional checks with examples). To support arbitrary templates, we havedeveloped a process which we dub Non-local Constraint Checking (NLCC): first,based on the search template G0, we generate the set of constraints K0 that are tobe verified, and then prune the background graph using each of them.Alg. 1 presents an overview of our solution. This section provides high-level descriptions of the local and non-local constraint checking routines while§3.5 provides the detailed distributed asynchronous algorithms for a vertex-centricabstraction. As an overview, Fig. 3.3 illustrates the complete workflow for the graphand pattern in Fig. 3.1, for which constraint generation is detailed in Table 3.2.Local Constraint Checking (LCC) involves a vertex and its neighborhood. Thealgorithm performs the following two operations. (i) Vertex elimination: the algo-rithm excludes the vertices that do not have a corresponding label in the template,then, iteratively excludes the vertices that do not have neighbors as labeled in thetemplate. For templates that have vertices with multiple neighbors with the samelabel, the algorithm verifies if a matching vertex in the background graph has a44Algorithm 1Main Constraint Checking Loop1: Input: background graph G(V,E), template G0(V0,E0)2: Output: solution subgraph G∗(V∗,E∗)3: generate non-local constraint set K0 from G0(V0,E0)4: G∗← LOCAL_CONSTRAINT_CHECKING (G,G0)5: while K0 is not empty do6: pick and remove the next constraint C0 from K07: G∗← NON_LOCAL_CONSTRAINT_CHECKING (G∗,G0,C0)8: if a vertex has been eliminated or has one of its provisionalmatches removedthen9: G∗← LOCAL_CONSTRAINT_CHECKING (G∗,G0)10: return G∗minimum number of distinct active neighbors with the same label as prescribed inthe template. (ii) Edge elimination: this excludes edges to eliminated neighbors andedges to neighbors whose labels do not match the labels prescribed in the adjacencystructure of its corresponding template vertex (e.g., Fig. 3.3, Iteration #1). Edgeelimination is crucial for scalability since, in a distributed setting, no messages aresent over eliminated edges, thus significantly improving the overall efficiency of thesystem (evaluated in §3.7.6, Fig. 3.13(a)).Non-local Constraint Checking (NLCC) aims to exclude vertices that fail tomeet topological and label constraints beyond the one-hop neighborhood, that LCCis not guaranteed to eliminate (Fig. 3.2). We have identified three types of non-localconstraintswhich can be verified independently: (i)CycleConstraints (CC), (ii)PathConstraints (PC), and (iii) constraints that require Template-Driven Search (TDS)(see Remark 1). For arbitrary templates, TDS constraints based on aggregatingmultiple paths/cycles enable further pruning and insure that pruning yields no falsepositives. Checking TDS constraints, however, can be expensive. To reduce theoverall cost, we first generate single cycle- and path-based constraints, which areusually less costly to verify, and prune the graph using them before deploying TDS(the effectiveness of this ordering is evaluated in Fig. 3.13(c)).High-level Algorithmic Approach. Regardless of the constraint type, NLCCleverages a token passing approach: tokens are issued by background graph verticeswhose corresponding template vertices are identified to have non-local constraints.45After a fixed number of steps, we check if a token has arrived where expected (e.g.,back to the originating vertex for checking the existence of a cycle). If not, then thetoken issuing vertex does not satisfy the required constraint and its correspondingtemplate match is removed. If a vertex has no template match remaining, it iseliminated. Along the token path, the algorithm verifies that all expected labels areencountered and, where necessary, uses the path information accumulated with thetoken to verify that different/repeated node identity constraint expectations are met.Next, we discuss how each type of non-local constraint is verified.Cycle Constraints (CC).Higher-order structures within G that survive LCC, butdo not contain G0, are possible if G0 contains a cycle (this happens if G contains oneor more unrolled cycles as in Fig. 3.2, template (a)). To address this, we directlycheck for cycles of the correct length.Path Constraints (PC). If the template G0 has two or more vertices with thesame label, three or more hops away from each other, then structures in G thatsurvive LCC, yet contain no match, are possible (Fig. 3.2, template (b)). Thus, forevery vertex pair with the same label in G0, we directly check the existence of a pathof the correct length and label sequence for prospective matching vertices in G∗.Opposite to cycle checking, after a fixed number of steps, a token must be receivedby a vertex different from the initiating vertex but with an identical label.Template-Driven Search (TDS) Constraints. These are partial (composed ofunion of two or more path and/or cycle constraints) or complete (i.e., including alledges of the template) walks on the template. The token walks the constraint inthe background graph and verifies that each node visited meets its neighborhood(beyond one-hop) constraints (Remark 1) - in our distributed memory setting, thisis done by maintaining a history of the walk and checking that previously visitedvertices are revisited as expected. TDS constraints are crucial to guarantee zerofalse positives for templates that are non-edge-monocyclic or have repeated vertexlabels (Fig. 3.2, templates (b) and (c)).Non-local Constraint Generation. Wegenerate non-local constraints followingthe heuristic presented in Table 3.2. The three types of non-local constraints,namely, Cycle Constraints, Path Constraints and TDS Constraints are generatedincrementally: for an example template, we provide a step-by-step illustration ofthe non-local constraint generation. Fig. 3.3 shows a complete example of how46Table 3.2: Step-by-step illustration of non-local constraint generation: high-level description,accompanied by pictorial depiction for the template in Fig. 3.1. The figures show the stepsto generate the required cycle constraints (CC), path constraints (PC), and higher-orderconstraints requiring template-drive search (TDS).VertexClassificationIdentify all the leaf vertices (i.e., a vertex with only one neigh-bor) with unique labels. They are not considered for non-localconstraint checking as LCC guarantees pruning if there is nomatch.Identify leaf vertices with unique labelIdentify vertices with duplicate labelTemplateStep 1 Step 2CycleConstraintsIf the template has cycles, then individual cycles are identifiedand a cycle constraint is generated for each cycle. (1) (2) (3)Step 3 - Cycle Constraints (CC)PathConstraintsIf there are vertices with identical label, first, they are identified.Next, for all possible combinations of vertex pairs with identicallabel, we identify all existing paths greater than or equal to three-hop length. (LCC precisely checks identical label pairs that areone or two hops from each other). One such path, for each vertexpair, is generated as a path constraint (e.g., Step 4, pentagonalvertices). Here, two optimization’s are applied to minimize thenumber of path constraints to be verified: (i) If there are multiplepaths connecting two terminal vertices then the shortest path isgenerated as a path constraint. (ii) If all the edges comprisinga path also belong to a cycle constraint, that particular pathis excluded from the set of path constraints. Verification of thecycle constraintwill implicitly check for existence of a successfulwalk of appropriate length connecting the terminal vertices (ofthe path of interest).(1) (2) (3) (4)Step 4 - Path Constraints (PC)TDSConstraintsWe generate TDS constraints in three steps. First, for tem-plates with multiple cycles sharing more than one vertex (i.e.,the template is non-edge-monocyclic), a TDS cyclic constraintis generated through the union of previously identified cycleconstraints. This results in a higher-order cyclic structure witha maximal set of edges that cover all the cycles sharing at leastone edge (e.g., Step 5(1)).Second, for templateswith repeated labels, a newTDS constraintis generated through the union of all previously identified pathconstraints. This procedure generates a higher-order structurethat covers all the template vertices with repeated labels (e.g.,Step 5(2)).The final step generates a TDS constraint as the union of thetwo previously identified constraints (e.g., Step 5(3)). Note thatthe above is a heuristic, more TDS constraints can be generatedby creating various possible combinations of cycles and paths.Only this third step is mandatory to eliminate all false positives.Non-edge monocyclic(2) (3)(1)Identical labelsUnion of (1) and (2)Step 5 - TDS Constraints47NLCC - PC NLCC - CCNLCC - TDSNLCC - CCNLCC - TDSLCC1 LCC2LCC1 LCC2Iteration# 6 7 8 9 1012 14 1613Iteration#LCC1NLCC - PC NLCC - PC NLCC - PCLCC11Iteration# 2 3 4 5NLCC - CC11NLCC - TDS15Figure 3.3: Algorithmwalk through for the example background graph and template in Fig. 3.1,depicting which vertices and edges in G∗(V∗,E∗) are eliminated (in solid grey) during eachiteration. The non-local constraints for G0 are listed in Table 3.2. The example does notshow application of some of the constraints as that do not eliminate vertices or edges.pruning progresses using the generated constraints.Token Generation. For cyclic constraints, a token must be initiated from eachvertex that may participate in the substructure, whereas for path constraints, tokensare only initiated from terminal vertices. Tokens are started from vertices (that be-long to the same cyclic substructure) in the increasing order of their label frequencyin the background graph.Constraint Optimization. Non-local constraint verification checks for existenceof at least one successful walk of the appropriate length. There are alternatives tohow tokens could be passed around to complete a walk. The final steps in non-localconstraint generation focuses on optimizing the walks for token passing.Whenever possible, we orchestrate each walk so the vertices are visited in theincreasing order of label frequency in the background graph. (This procedure hasnegligible overhead as label frequency is computed only once per label set and weonly sort the vertex list of a template which, typically, has 101–102 elements). Here,the goal is to curb combinatorial growth of the algorithm state (or more specifically,in the distributed memory setting, the number of messages). This optimization hasthe potential of eliminating a large part of the graph without explorations deep intoan excessive number of branches in the backgorund graph.Constraint Ordering. We use a second set of heuristics to optimize the orderin which constraints are scheduled for verification. First, we check for path andcycle constraints, as they tend to be less expensive than TDS constraints. Second,48we order the non-local constraints with respect to increasing length of the walk aslonger walks are more susceptible to combinatorial explosion.3.5 Asynchronous Algorithms and DistributedImplementationThis section presents the system implementation on top of HavoqGT [82], an MPI-based framework that supports asynchronous graph algorithms in the distributedenvironment. First, we describe the constraint checking algorithms in the vertex-centric abstraction ofHavoqGT [45]. Thenwe discuss other key system componentsand optimizations we have incorporated.HavoqGT. Our choice for HavoqGT is driven by multiple considerations: First,unlike most graph processing frameworks that primarily support the BSP model,HavoqGT has been designed to support asynchronous algorithms which is essentialto achieve high performance. Asynchronous algorithms can exploit the low latency(∼1µs) interconnect on leadership-class, HPC platforms. Second, the frameworkhas demonstrated excellent scaling properties [81, 82]. Finally, it enables loadbalancing: HavoqGT’s delegate partitioned graph distributes the edges of eachhigh-degree vertex across multiple compute nodes, which is crucial for achievingscalability for scale-free graphs with a skewed degree distribution.In HavoqGT, graph algorithms are implemented as vertex-callbacks: the user-defined visit() callback can access and update the state of a vertex. The frameworkoffers the ability to generate events (a.k.a. a ‘visitor’ in HavoqGT’s vocabulary)that trigger this callback - either at the entire graph level using the do_traversal()method, or for a neighboring vertex using the push(visitor) call. When a ver-tex wants to pass data to a neighbor, invoking push(visitor) enqueues the rele-vant visitor to the distributed message queue, which exploits MPI asynchronouscommunication primitives for exchanging messages. This enables asynchronousvertex-to-vertex communication. The asynchronous graph computation completeswhen all ‘visitor’ events have been processed, which is determined by a distributedquiescence detection algorithm [120].Alg. 1 outlines the key steps of the constraint checking procedure. Below,we describe the distributed implementation of the local and non-local constraint49checking, and match enumeration routines. Alg. 2 lists the state maintained by eachactive vertex and its initialization. Additionally, Appendix B presents complexityanalyses of the key constraint checking routines.Algorithm 2 Vertex State and Initialization1: status of vertex vj : α(vj) ← true (active) if ∃qk ∈ V0 s.t. `(vj) = `(qk),otherwise f alse (i.e., vj has been eliminated)2: set of possible matches in template for vertex vj : ω(vj) ← initially all qk ∈ V0s.t. `(qk) = `(vj)3: map of active edges of vertex vj : ε(vj) ← keys are initialized to adj(vj). Thevalue field, which is initially ∅, is set to ω(vi), for each vi ∈ ε(vj) that hascommunicated its state to vj .4: set of already forwarded tokens by vertex vj : τ(vj) ← initially empty, used forwork aggregation in NLCCLocal Constraint Checking is implemented as an iterative process (Alg. 3and the corresponding callback, Alg. 4). Each iteration initiates an asynchronoustraversal by invoking the do_traversal() method and, as a result, each activevertex receives a visitor with msgtype = init. In the triggered visit() callback, ifthe label of a vertex vj in the graph is a match for the label of any vertex in thetemplate and the vertex is still active, it creates visitors for all its active neighborsin ε(vj) with msgtype = alive (Alg. 4, line #9). When a vertex vj is visited withmsgtype = alive, it verifies whether the sender vertex vs satisfies one of its own(i.e., vj’s) local constraints by invoking the function η(vs,vj). By the end of aniteration, if vj satisfies all the template constraints, i.e, it has neighbors with therequired labels (and, if needed, a minimum number of distinct neighbors with thesame label as prescribed in the template), it stays active (i.e., α(vj) = true) for thenext iteration. For templates that have multiple vertices with the same label, in anyiteration, a vertex with that label in the background graph could match any of thesevertices in the template, so each match must be verified independently. If vj fails tosatisfy the required local constraints for a template vertex qk ∈ ω(vj), qk is removedfrom ω(vj). At any stage, if ω(vj) becomes empty, then vj is marked inactive(α(vj) ← f alse) and never creates visitors again. Edge elimination excludes twocategories of edges: first, the edges to neighbors, vi ∈ ε(vj) from which vj didnot receive an alive message, and, second, the edges to neighbors whose labels50do not match the labels prescribed in the adjacency structure of the correspondingtemplate vertex/vertices in ω(vj). A vertex vj is also marked inactive if its activeedge list ε(vj) becomes empty. Iterations continue until no vertex or edge is markedinactive.Algorithm 3 Local Constraint Checking1: η(vs,vj) - verifies if vs satisfies a local constraint of vj ; returns ω(vs) if con-straints are met, ∅ otherwise2: procedure local_constraint_checking (G,G0)3: do4: do_traversal(msgtype← init)5: barrier6: for all vj ∈ V do7: ω′← ∅ . set of matches in template for neighbors of vj8: for all vi ∈ ε(vj) do9: if η(vi,vj) = ∅ then10: ε(vj).remove(vi) . edge eliminated11: continue12: else13: ω′← ω′∪η(vi,vj) . accumulate matched neighbors14: reset the value field of vi ∈ ε(vj) for the next iteration15: for all qk ∈ ω(vj) do . for each potential match16: if adj(qk) * ω′ then17: . qk does not meet neighbor requirements18: ω(vj).remove(qk) . remove from the set of potentialmatches19: continue20: if ε(vj) = ∅ or ω(vj) = ∅ then21: α(vj) ← f alse . vertex eliminated22: while vertices or edges are eliminated . global detectionNon-local Constraint Checking iterates overK0, the set of non-local constraintsto be checked, and validates eachC0 ∈K0, one at a time. Alg. 5 describes the solutionto verify a single constraint: tokens are initiated through an asynchronous traversalby invoking the do_traversal()method and, as a result, each active vertex receivesa visitor with msgtype = init. Each active vertex vj ∈ G∗ that is a potential matchfor the template vertex q0 at the head of a ‘path’ C0, broadcasts a token to all its51Algorithm 4 Local Constraint Checking Visitor1: visitor state: vj - vertex that is visited2: visitor state: vs - vertex that originated the visitor3: visitor state: ω(vs) - set of possible matches in template for vertex vs4: visitor state: msgtype - init or alive5: procedure visit(G,vq) . vq - visitor queue (the distributed message queue)6: if α(vj) = f alse then return7: if msgtype = init then8: for all vi ∈ ε(vj) do9: vis← LCC_VISITOR(vi, vj , ω(vj), alive)10: vq.push(vis)11: else if msgtype = alive then12: ε(vj).get(vs) ← ω(vs)active neighbors in ε(vj) with msgtype = f orward. A map γ is used to track thesetoken issuers. A token is a tuple (t,r) where t is an ordered list of vertices that haveforwarded the token and r is the hop-counter; t0 ∈ t is the token-issuing vertex in G∗.The ordered list t is essential for TDS since it enables detection of distinct verticeswith the same label in the token path. For simpler templates, such as templateswith unique vertex labels and only edge-monocycles, t may only contain t0 to keepthe message size small.Algorithm 5 Non-local Constraint Checking1: procedure non_local_constraint_checking(G,G0,C0)2: γ← map of token source vertices (in G) for C0; the value field (initializedto false) is set to true if the token source vertex meets the requirements of C03: do_traversal(msgtype← init)4: barrier5: for all vj ∈ γ do6: if γ.get(vj) , true then7: . violates C0, eliminate potential match8: ω(vj).remove(q0) where q0 is the first vertex in C09: if ω(vj) = ∅ then . no potential match left10: α(vj) ← f alse . vertex eliminated11: ∀vj ∈ V , reset τ(vj)52Algorithm 6 Non-local Constraint Checking Visitor1: visitor state: vj - vertex that is visited2: visitor state: token - the token is a tuple (t,r) where t is an ordered list ofvertices that have forwarded the token and r is the hop-counter; t0 ∈ t is thevertex that originated the token3: visitor state: msgtype - init, f orward or ack4: µ(vj,C0,token) - verifies if vj satisfies requirements of C0 for the current stateof token; returns true if constraints are met, f alse otherwise5: procedure visit(G,vq)6: if α(vj) = f alse then return7: if msgtype = init and ∃qk ∈ ω(vj) where qk = q0 ∈ C0 then8: . initiate a token; vj is the token source9: t .add(vj); r← 1; token← (t,r); γ.insert(vj, f alse)10: for all vi ∈ ε(vj) do11: vis← NLCC_VISITOR(vi,token, f orward)12: vq.push(vis)13: else if msgtype = f orward then . vj received a token14: if token < τ(vj) then . work aggregation optimization15: τ(vj).insert(token)16: else return . ignore if vj already forwarded a copy of token17: if µ(vj,C0,token) = true and token.r < |C0 | then18: . the walk can be extended with vj and it has not reached the length|C0 | yet19: token.t .add(vj); token.r← token.r +120: for all vi ∈ ε(vj) do . forward the token21: vis← NLCC_VISITOR(vi,token, f orward)22: vq.push(vis)23: else if µ(vj,C0,token) = true and token.r = |C0 | then24: . the walk has reached the length |C0 |25: if C0 is cyclic and t0 = vj then26: γ.get(vj) ← true return . vj meets requirements of C027: else if C0 is acyclic and t0 , vj then28: vis← NLCC_VISITOR(t0,token,ack)29: vq.push(vis) . send ack to the token originator, t0 ∈ t30: else if msgtype = ack then31: γ.get(vj) ← true return . vj meets requirements of C053When an active vertex vj receives a token with msgtype = f orward, it verifiesthat if ω(vj) is a match for the next entry in C0, if it has received the token froma valid neighbor (with respect to entries in C0), and that the current hop count is< |C0 |. If these requirements are satisfied (i.e., µ(vj,C0,token) returns true), vj setsitself as the forwarding vertex (vj is added to t), increments the hop count, andbroadcasts the token to all its active neighbors in ε(vj). If any of the constraintsare not met, vj drops the token. If the hop count r is equal to |C0 | and vj is thesame as the source vertex in the token, for a cyclic template, a path has been foundand vj is marked true in γ. For path constraints, an acknowledgement is sent to thetoken issuer to update its status in γ (Alg. 6, lines #28 – #31). Once verification ofa constraint C0 has been completed, the vertices that are not marked true in γ, areinvalidated/eliminated, i.e., α(vj) ← f alse (Alg. 5, line #10).Work Aggregation. All NLCC constraints attempt to identify if a walk existsfrom a specific vertex and through vertices with specific labels. Since the goal is toidentify the existence of any such path and multiple intermediate/complete paths inthe background graph often exist, to prevent combinatorial explosion, our duplicatework detection mechanism prevents an intermediary vertex (in the token path) fromforwarding a duplicate token. NLCC uses an unordered set τ(vj) (Alg. 2, line #4)for work aggregation (see Alg. 6, line #14): at each vertex, this is used to detect ifanother copy of a token has already visited the vertex vj (taking a different path).Performance impact of this optimization is evaluated in §3.7.6.Load Balancing. Load imbalance issues are inherent to problems involvingirregular data structures, such as graphs, especially when these need to be par-titioned for processing over multiple nodes. For our pattern matching solution,load imbalance can be further caused by two artifacts: First, over the course ofexecution our solution causes the workload to mutate, i.e., we prune away verticesand edges. Second, the distribution of matches in the background graph may benonuniform: the vertices and edges that participate in the matches may reside on asmall, potentially concentrated, part of the graph. (In §3.7.7, we present a detailedcharacterization of these artifacts.)The iterative nature of the constraint checking pipeline allows us to adopt apseudo-dynamic load balancing approach: First, we checkpoint the current state ofexecution (at the end of an asynchronous constraint checking phase): the pruned54graph, i.e., the set of active vertices and edges and the per-vertex state indicatingtemplate matches, ω(vj) (Alg. 2). Next, using HavoqGT’s distributed graph parti-tioningmodule, we reshuffle the vertex-to-processor assignment to evenly distributevertices (with ω(vj) remained intact) and edges across processing cores. Process-ing is then resumed on the rebalanced workload. Furthermore, depending on thesize the the pruned graph, it is possible to resume processing on a smaller deploy-ment (primarily for efficiency reasons, such as conserving CPU Hours). Over thecourse of the execution, checkpointing and rebalancing can be repeated as needed.We evaluate the effectiveness of different load balancing strategies and present ananalysis of their impact on performance in §3.7.7.Termination and Output. If NLCC is not required, the search terminates whenno vertex is eliminated (or none of its provisional matches are removed) in an LCCiteration. Otherwise, the search terminates when all constraints in K0 have beenverified and no vertex is eliminated (or none of its provisional matches is removed)in the following LCC phase. The output is: (i) the set of vertices and edges thatsurvived the iterative elimination process and, (ii) for each vertex in this set, themapping in the template where a match has been identified.Match Enumeration Queries. A distributed match enumeration or countingroutine can operate on the pruned, solution subgraph: Alg. 6 can be slightlymodified to obtain the enumeration of the matches in the background graph: theconstraint used is the full template, work aggregation is turned off, and each possiblematch is verified. For each of the vertices that remain in the solution set, thepruning procedure collects their exact match(es) to the search template. We usethis information to accelerate match enumeration.Metadata Store. Themetadata is stored independent of the graph topology itself(which uses the CSR format [7]). At initialization, only the required attributes areread from the file(s) stored on a distributed file system. A light-weight distributedprocess builds the in-memory (or memory-mapped) metadata store. On 256 nodes,for the 257 billion edge Web Data Commons hyperlink graph [94], the metadatastore can be built in under a minute. Although, in this work, we consider vertexmetadata (i.e., labels) only, support for edge metadata is trivial within the presentedinfrastructure.553.6 Summary of the Preliminary InvestigationsIn §3.4, we presented the constraint checking based exact matching solution thatsupports arbitrary templates and the implementation details are in §3.5. Before wedive into evaluation (§3.7) of the generic, distributed exact matching solution, wefirst present a summary of our preliminary research and the initial set of investiga-tions (published in [88]).The first solution offers precision and recall guarantees for a restricted set oftemplates with constraints on the topology and vertex label distribution: (i) no twotemplate vertices have the same label, and (ii) the template is edge-monocyclic -no two cycles share an edge. The original non-local constraint checking supportscycle constraints only (path constraint checking and template-driven search weredeveloped to alleviate the above restrictions). Furthermore, the solution offersprecision and recall guarantees for the matching vertices only. Since it did nothave support for explicit edge elimination, there were no precision guarantees forthe edges included in the solution set. (Support for edge elimination, however, hassignificant impact on performance and scalability as well; see §3.7.6, Fig. 3.13(a)for an example.)Evaluation using a distributed implementation confirms the effectiveness ofthe constraint checking based graph pruning for pattern matching: using the samedatasets used in §3.7, we demonstrate strong and weak scaling ability on up to 256compute nodes, yet in the context of a restricted set of templates with constraintson the topology and vertex label distribution (as mentioned earlier).In Appendix A, we provide correctness proofs for the constraint checking algo-rithms; assuming the restricted scenario above, i.e., two constraints on the topologyand vertex label distribution of the search template: (i) no two template verticeshave the same label and (ii) the template is edge-monocyclic. Additionally, inAppendix A, we present a correctness proof sketch for the general exact matchingsolution (§3.4) that alleviates the above restrictions and offers precision and recallguarantees for arbitrary search templates .Considering a restricted scenario for the correctness proofs in Appendix Ahas two goals: first, to simplify the problem, thus keep the math manageable;second, it shows pattern specific optimizations are possible withing the constraint56Figure 3.4: The Quartz cluster at the Lawrence Livermore National Laboratory. 71st in theTOP5002 list published in November 2018.checking approach: in Appendix B, we established that local constraint checkinghas a polynomial cost. In Appendix A, we prove that acyclic templates with uniquevertex labels only require local constraint checking (a polynomial time routine) toproduce a solution with 100% precision and 100% recall. (In the past, dedicatedsolutions, even parallel distributed systems, have been developed for a comparableproblem, approximate treelet counting [3, 14, 103, 128].) This a key design featureof the constraint checking approach - constraint decomposition of the templateenables dynamic selection of the most appropriate pruning routine(s) for the targettemplate.3.7 EvaluationWe evaluate feasibility of the proposed solution at scale: we present strong (§3.7.2)and weak (§3.7.1) scaling experiments of pruning on massive real-world and syn-thetic graphs, respectively; subsequently, we demonstrate full match enumeration2www.top500.org57Table 3.3: Properties of the datasets used for evaluation: number of vertices and directededges, maximum, average and standard deviation of vertex degree, and the graph size in thecompact CSR-like representation used (including vertex metadata).Type |V | 2|E | dmax davg dstdev SizeWeb Data Commons [94] Real 3.5B 257B 95M 72.3 3.6K 2.7TBReddit [87] Real 3.9B 14B 19M 3.7 483.3 460GBInternet Movie Database [55] Real 5M 29M 552K 5.8 342.6 581MBCiteSeer [108] Real 3.3K 9.4K 99 3.6 3.4 741KBMico [108] Real 100K 2.2M 1.4K 22 37.1 36MBPatent [100] Real 2.7M 28M 789 10.2 10.8 480GBYouTube [100] Real 4.6M 88M 2.5K 19.5 21.7 1.4GBLiveJournal [4] Real 4.8M 69M 20K 17 36 1.2GBR-MAT up to Scale 37 [15] Synthetic 137B 4.4T 612M 32 4.9K 45TBstarting from the pruned graph (§3.7.3); we evaluate the effectiveness of our loadbalancing technique and other optimizations our system incorporates (§3.7.7 and§3.7.6); we highlight the use of our system in the context of realistic data analyticsscenarios (§3.7.4); we explore time-to-solution vs. precision/guarantees trade-offs(§3.7.5); and finally, we compare our solution with two recent systems, QFrag [100]and Arabesque [108] (§3.7.11). Additionally, we present a detail bottleneck anal-ysis and discuss the artifacts that impacts performance (§3.7.7 and §3.7.9); wediscuss how lack of infrastructure support (in the implementation framework andcommunication substrate) to prevent unwanted message buildup may lead to sys-tem collapse and the defence mechanism we have developed to address this issue(§3.7.8).Testbed. The testbed is the 2.6 petaflop Quartz cluster at the Lawrence Liv-ermore National Laboratory, comprised of 2,634 nodes and the Intel Omni-Pathinterconnect. Each node has two 18-core Intel Xeon E5-2695v4 @2.10GHz pro-cessors and 128GB of memory [85]. We run one MPI process per core (i.e., 36 pernode).Datasets. We summarize the main characteristics of the datasets used forevaluation in Table 3.3 and explain how we have generated vertex labels wherenecessary. For all graphs, we created undirected versions - two directed edges areused to represent each undirected edge.The Web Data Commons (WDC) graph is a webgraph whose vertices are58webpages and edges are hyperlinks. To create vertex labels, we extract the top-leveldomain names from the webpage URLs, e.g., .org or .edu. If the URL contains acommon second-level domain name, it is chosen over the top-level domain name.For example, from ox.ac.uk, we select .ac as the vertex label. A total of 2,903unique labels are distributed among the 3.5B vertices in the background graph.We curated the Reddit (RDT) social media graph from an open archive [87] ofbillions of public posts and comments from Reddit.com. Reddit allows its users torate (upvote or downvote) others’ posts and comments. The graph has four typesof vertices: Author, Post, Comment and Subreddit (a category for posts). For Postand Comment type vertices there are three possible labels: Positive, Negative, andNeutral (indicating the overall balance of positive and negative votes) or No Rating.An edge is possible between an Author and a Post, an Author and a Comment,a Subreddit and a Post, a Post and a Comment (to that Post), and between twoComments that have a parent-child relationship.The International Movie Database (IMDb) graph was curated from the pub-licly available repository [55]. The graph has five types of vertices: Movie, Genre,Actress, Actor and Director. An edge is only possible between aMovie type vertexand a non-Movie type vertex.We use the smaller Patent and YouTube graphs to compare published resultsby Serafini et al. [100]. The Patent graph has 37 unique vertex labels, whilethe larger YouTube graph has 108 unique vertex labels. Also, we use CiteSeer,Mico, Patent, YouTube and LiveJournal unlabeled, real-world graphs primarilyto compare published results by Teixeira et al. [108].The synthetic R-MAT graphs exhibit approximate power-law degree distri-bution [15]. These graphs were created following the Graph 500 [37] standard:2Scale vertices and a directed edge factor of 16. For example, a Scale 30 graph has|V | = 230 and 2|E | ≈ 32×230 (since we create an undirected version). Since we usethe R-MAT graphs for weak scaling experiments, we aim to generate labels suchthat the graph structure changes little as the graph scales. To this end, we lever-age vertex degree information to create vertex labels, computed using the formula,`(vi) = dlog2(d(vi)+ 1)e. This, for instance, for the Scale 37 graph results in 30unique vertex labels.Search Templates. To stress our system, we use templates based on patterns59naturally occurring and experiment with both rare and frequent patterns. TheWDC(Fig. 3.7), Patent and YouTube (Fig. 3.17), and R-MAT (Fig. 3.5) patterns includevertex labels that are among the most frequent in the respective graphs. The Redditand IMDb patterns (Fig. 3.10) include most of the vertex labels in these two graphs.We chose templates to exercise different constraint checking scenarios: the searchtemplates have multiple vertices with the same label and non-edge-monocyclicproperties (require relatively expensive non-local constraint checking).ExperimentalMethodology. The strong andweak scaling experiments evaluatethe performance of precise pruning (i.e., we verify all the constraints required toguarantee zero false positives). The performance metric for the scaling experimentsis the search time for a single template. All runtime numbers provided are averagesover 10 runs. For weak scaling experiments, we do not present scaling numbersfor a single node as this experiment does not involve network communication andbenefits from data locality. For strong scaling experiments, the smallest experimentuses 64 nodes, as this is the lowest number of nodes that can load the WDC graphtopology and vertex metadata in memory.3.7.1 Weak Scaling ExperimentsTo evaluate the ability to process massive graphs, we use weak scaling experimentsand the synthetic R-MAT graphs up to Scale 37 (4.4T edges), and up to 1,024 nodes(36,864 cores). Fig. 3.5 shows the two search patterns used and Fig. 3.6 presentsthe runtimes. Since there are multiple vertices in the patterns with identical labels(at more than three-hop distance), the patterns require NLCC - path constraintchecking to ensure zero false positives in the pruned solution subgraph. We seesteady scaling all the way up to the Scale 37 graph on 1,024 nodes. Runtime isbroken down into the individual iterations to evaluate scaling and the individualcontribution of each intermediate step. As the background graph gets pruned, thesubsequent iterations require lesser time. Fig. 3.6 includes (at the top of each bar)the final number of vertices and edges that participate in the respective patterns.Note that the NLCC phases (needed to guarantee a precise solution) do not deletevertices or edges; hence, no further LCC phase is invoked.60243 753 7TreeChain7354723Figure 3.5: Chain and Tree patterns used. Both patterns have two pairs of vertices with thesame (numeric) label; hence, they require non-local constraint checking (NLCC), moreprecisely, path constraint checking. The labels used are the most frequent in the R-MATgraphs and cover ∼30% of all the vertices in the graphs.Chain28,620, 516,020, 5155,056, 541,990, 6163,458, 5499,182, 686,350, 71,508,956, 62|E*|Tree5,874, 66,468, 633,728, 610,712, 653,182, 6111,786, 630,636, 7325,990, 6PF (10x)#Compute nodesR-MAT scale091827364528, 229, 430, 831, 1632, 3233, 6434, 12835, 25636, 51237, 1024Time (s)01020304028, 229, 430, 831, 1632, 3233, 6434, 12835, 25636, 51237, 10241,092,256, 5363,604, 716,571, 49,305, 589,627, 524,403, 594,859, 5289,312, 450,263, 6876,370, 5|V*| 3,335, 53,704, 519,025, 56,157, 630,356, 563,482, 517,668, 6185,721, 5PF (10x)633,889, 5207,729, 5635,350, 6369,773, 5256,770, 7147,968, 5LCC NLCCFigure 3.6: Runtime and pattern selectivity for weak scaling experiments, broken down intoindividual iterations, for the Chain (left) and Tree (right) patterns presented in Fig. 3.5.The X-axis labels present the R-MAT scale and the node count used for the experiment.(Each node hosts two processors, each with 18 cores and we run 36 MPI processes pernode.) The number of vertices and edges in each pruned solution subgraph is shown ontop of their respective bar plots. The Pruning Factor (PF), i.e., the order of magnitudereduction in number of vertices/edges compared to the original background graph, is alsoshown for each experiment. A flat line indicates perfect weak scaling. Time for LLC andNLCC phases is presented using different colors.3.7.2 Strong Scaling ExperimentsFig. 3.8 shows the runtimes for strong scaling experimentswhen using the real-worldWDC graph on up to 1,024 nodes (36,864 cores). Intuitively, pattern matching on61WDC-3orggov edunetbizinfomil acWDC-6govorgedunetinfo acWDC-1 WDC-2infonet edugovdeaccagov edueduedubizdeitac acmilWDC-5orgeducomgovnet acWDC-4cagov eduesbizmil acitFigure 3.7: WDC patterns using top/second-level domain names as labels. The labels selectedare among the most frequent, covering ∼81% of the vertices in the WDC graph: unsur-prisingly, com is the most frequent - covering over two billion vertices, org covers ∼220Mvertices, the 2nd most frequent after com and mil is the least frequent among these labels,covering ∼153K vertices.the WDC graph is harder than on the R-MAT graphs as the WDC graph is denser,has a highly skewed degree distribution, and the high-frequency labels that we usedalso belong to vertices with a very high neighbor degree. We use a subset of thepatterns presented in Fig. 3.7. WDC-1 is acyclic, yet has multiple vertices with thesame label and thus requires non-local constraint checking (PC and TDS constraint).For better visibility, the plot splits checking initial LCC and NLCC-path constraints(bottom left) from NLCC-TDS constraints (top left). We notice near perfect scalingfor the LCC phases, however, some of the NLCC phases do not show linear scaling(explained in §3.7.7).WDC-2 is an example of a pattern with multiple cycles sharing edges, and620102030406412825651210244011302220331044006412825651210241.5x2.1x2.7x 3.1xWDC-1 WDC-2 WDC-347,232 117 81,913|V*|01734516885102119136153170641282565121024126,212 546 255,0222|E*|05101520253035404550641282565121024LCC NLCCTime (s)#Compute nodes1.8x3.5x6.6x1.8x2.9x3.6x11.2x6.0x010020030040050060070080064128256WDC-5137,123951,4321.07x1.11xFigure 3.8: Runtime for strong scaling experiments, broken down into individual phases (LCCand NLCC are in different colors) for four of the WDC patterns presented in Fig. 3.7.The top row of X-axis labels represent the number of compute nodes. (Each node hoststwo processors, each with 18 cores and we run 36 MPI processes per node.) The last tworows are the number of vertices and edges in the pruned graph, respectively. For bettervisibility, for WDC-1 (left plots), runtime for different iterations are split into two scales onthe Y-axis: LCC and NLCC-path constraints are at the bottom, and LCC and NLCC-TDSconstraints are at the top. Speedup over the 64 node configuration is also shown on top ofeach stacked bar plot.relies on CC and TDS constraint checking to guarantee no false positive matches.WDC-2 shows near linear scaling with about one-third of the total time spent in thefirst LCC phase and little time spent in the NLCC phases. WDC-3 is a monocyclictemplate - shows steady scaling for both LCC and NLCC phases.The WDC-5 pattern includes the top three most frequent labels, namely, com,org and net, and covers ∼72% vertices in the WDC graph. Similar to WDC-1,here, majority of the time is spent in verifying the non-local constraints. Also, forWDC-5, the NLCC phases hardy scale with increasing node count (due to heavilyskewed template match distribution among the graph partitions, as explained in§3.7.7).3.7.3 Match EnumerationAs our technique prunes the graph by orders of magnitude (see Fig. 3.9 andFig. 3.11(b)), match counting and full match enumeration in large graph becomefeasible. Table 3.4 (top) lists the number of distinct matches and the time to enumer-6321 83 24#Iterations1.00E+001.00E+011.00E+021.00E+031.00E+041.00E+051.00E+061.00E+071.00E+081.00E+091.00E+101.00E+111.00E+12WDC-1 WDC-2 WDC-3Number of active vertices and edges (log-scale) #Edges in the WDC graph#Vertices in the WDC graphEdgesVerticesFigure 3.9: Number of active vertices and edges after each iteration for the same experimentsas in Fig. 3.8. The bottom row of X-axis labels represent the number of iterations requiredto reach a precise solution. Note that the Y-axis is on log scale.ate the Chain and Tree patterns on some of the R-MAT graphs we used. While theseresults prove that our match enumeration routine scales well, the match counts andenumeration time for WDC, Reddit and IMDb patterns listed in Table 3.4 (bottom)are more revealing. Additionally, in Table 3.5, we present a comparison whichhighlights the advantage of our approach over direct match enumeration.3.7.4 Example Use Cases – Social Network Analysis and InformationMiningWe demonstrate the ability of our scalable pattern matching technique to supportcomplex data analytics scenarios in the context of social networks and knowlegegraphs. Today’s user experience on social media platforms is tainted by the ex-istence of malicious actors such as bots, trolls, and spammers. This highlightsthe importance of detecting unusual activity patterns that may indicate potentialmalicious attacks. We present three use cases: one general search query for theIMDb graph and two queries that attempt to uncover suspicious activity in theReddit dataset. Fig. 3.10 summarizes the scenarios we target and presents the64Table 3.4: Match enumeration statistics: number of matches for the Chain and Tree patterns(in Fig. 3.5, the top table below), andWDC (Fig. 3.7), Reddit and IMDb (Fig. 3.10) patterns(bottom table) and the enumeration times, starting from the respective pruned graphs. Notethat for WDC-1, WDC-3 and WDC-5, we were not able to enumerate all the matches.R-MAT #Compute Chain TreeScale Nodes Count Time (s) Count Time (s)28 2 2,716 10.4 1,186 10.431 16 3,747 10.5 1,488 10.434 128 7,529 11.1 3,766 11.137 1024 55,710 10.1 32,532 5.5Template WDC-1 WDC-2 WDC-3 WDC-5 WDC-6 RDT-1 RDT-2 IMDB-1Count 668M* 2,444 7.7B* 466M* 1.9M 24K 518K 840KTime 4min 1.8s 24hr 1.1hr 34s 6.8s 4.9s 10hr#Compute Nodes 64 64 128 128 128 64 64 8Table 3.5: We compare two cases: direct enumeration vs. constraint checking followed bymatch enumeration in the pruned graph. These experiments use 64 compute nodes. Forthe relatively rare WDC-2 pattern, PruneJuice achieves ∼18× speedup. For WDC-6, directenumeration leads to a crash (generated message traffic overwhelms some of the computenodes), while PruneJuice was able to list all the matches in under two minutes. Notethe difference in runtime for WDC-2 from the numbers reported in Fig. 3.8. The testbedhas been updated, including the OS kernel, C++ compiler, MPI libraries and interconnectdrivers, since we have conducted the scaling experiments in Fig. 3.8, as well as we didvarious performance optimizations to our own codebase, hence, the improved performance.Template Direct Enumeration PruneJuice SpeedupWDC-2 19min 62s 18×WDC-6 Crash 94s N/Acorresponding search patterns.RDT-1. Identify users with an adversarial poster-commenter relationship. Eachauthor (A) makes at least two posts or two comments, respectively. Comments toposts, that with more upvotes (P+), have a balance of negative votes (C-) andcomments to posts, with more downvotes (P-), have a positive balance (C+). Theposts must be under different subreddits (S), a category for posts.RDT-2. Identify all poster-commenter pairs where the commenter, i.e., anauthor (A), makes at least two comments to the same post, one directly to the postand one in response to a comment. The poster, i.e., an author (A), also makesa comment in response to a comment. The commenter always receives negativerating (C-) to a popular post (P+), however, comments (to the same post) by the65C+P-AA SC-P+SC+AAC-P+C-ActressDirectorMovieActor GenreGenre MovieRDT-1 IMDB-1RDT-2Author CommentPost Subreddit+ More Up votes - More Down votes Figure 3.10: The scenarios and their corresponding templates for the Reddit (RDT) and IMDbgraphs: RDT-1 (left): identify users with adversarial poster-commenter relationship.Each author makes at least two posts or two comments, respectively. Comments to posts,that with more upvotes (P+), have a balance of negative votes (C-) and comments toposts, with more downvotes (P-), have a positive balance (C+). The posts must be underdifferent subreddits (category). RDT-2 (center): identify all poster-commenter pairswhere the commenter makes at least two comments to the same post, one directly to thepost and one in response to a comment. The poster also makes a comment in response toa comment. The commenter always receives negative rating (C-) to a popular post (P+),however, comments (to the same post) by the poster has a positive rating (C+). IMDB-1(right): find all the actresses, actors, and directors that worked together at least in twodifferent movies that fall under at least two similar genres.poster has a positive rating (C+).IMDB-1. Find all the actresses, actors, and directors that worked together atleast in two different movies that fall under at least two similar genres.Fig. 3.11 shows runtimes for these scenarios, broken down to individual LCCand NLCC iteration levels. Although RDT-1 is much less frequent than RDT-2, onthe same 64 nodes, pruning for RDT-1 takes more than 3× longer to complete as itspends more time verifying the non-local constraints. Although both patterns havea 6-Cycle, RDT-2 allows verification of the two smaller cycles in isolation. (ForNLCC, a longer path typically results in larger generated message traffic.) IMDB-1,which is a complex cyclic pattern, on eight nodes, spends the majority of the timeverifying non-local constraints (specifically, TDS constraints).3.7.5 Precision Guarantees vs. Time-to-SolutionOur approach gradually refines G∗(V∗,E∗) down to the complete set of vertices andedges that participate in at least one match and guarantees no false positives. Giventhat this is an iterative process, it is natural to investigate at what rate G∗ is refined,and whether there are opportunities to trade between the precision (or existence of66RDT-1 RDT-2 IMDB-157,194 2,614,547 104,322|V*|0130260390520650780910104011701300050100150200250300350400195,160 7,163,856 639,8502|E*|LCC NLCCTime (s)04080120160200240280320(a)105 123#Iterations1.00E+001.00E+011.00E+021.00E+031.00E+041.00E+051.00E+061.00E+071.00E+081.00E+091.00E+101.00E+11RDT-1 RDT-2#Edges in the Reddit graph#Vertices in the Reddit graphEdgesVerticesNumber of active vertices and edges (log-scale)(b)Figure 3.11: (a) Runtime for the graph analytics patterns presented in Fig. 3.10. The labels onthe X-axis represent the number of vertices and edges in the respective pruned graphs.Note that the Y-axes have different scales. (b) Number of active vertices and edges aftereach iteration for the same experiments for the Reddit patterns as in (a). The labels onbottom row of the X-axis represent the number of iterations required. Note that the Y-axisis on log scale.precision guarantees) of an intermediary solution, and time-to-solution.Fig. 3.12 shows the evolution of the precision of the intermediate solutionover time for the various patterns. We define precision as the ratio between thenumber of distinct vertices that participate in at least one match, and the size of6702040608010001734516885102119136153170Precision (%)Timeline (s)WDC-1WDC-2WDC-3(a)WDC0204060801000200400600800100012001400Precision (%)Timeline (s)RDT-1RDT-2(b) Reddit0204060801000 5101520253035404550Precision (%)Timeline (s)Scale 28 ChainScale 37 ChainScale 28 TreeScale 37 Tree(c) R-MATFigure 3.12: Vertex set precision over the lifetime of an execution for (a) WDC (Fig. 3.7), (b)Reddit (Fig. 3.10), and (b) R-MAT (Fig. 3.5) patterns. The X-axis presents the timeline(in seconds) while the Y-axis is the precision achieved by the end of an iteration. Themarkers indicate the moment in time when 100% precision has been achieved. Thetimeline for the WDC-1 is limited to 170th second for better visibility (WDC-1 achieves100% precision in less than 20 seconds). For the R-MAT patterns, we show plots forScale 28 and 37.refined vertex set V∗ at the end of an iteration. We note that: (i) the rate at whichprecision improves is pattern dependent, and (ii) for some patterns, even afterprecision reaches 100% and no more vertices are pruned, the algorithm continuesto guarantee that no false positives are left. For example, in Fig. 3.12(a), WDC-1quickly reaches 100% precision, however, 99% of the execution time is spent toverify five non-local constraints (top left WDC-1 plot in Fig. 3.8) to guarantee nofalse positive matches. WDC-3, however, shows a different behaviour: the complexstructure does not reach 90% precision until the very end, and converges to 100%quickly afterwards. We believe that the rate at which precision is achieved, is partlyinfluenced by the order in which constraints are verified and our heuristics for non-local constraint verification ordering leave room for improvement (we explore thisin §3.8).68020406080100120641282565121024Time (s)Vertex elimination onlyWDC-305101520253035404550641282565121024Vertex and Edge eliminationLCCNLCC(a)020004000600080001000012000WDC-1 RDT-1Time (s)SynchronousAsynchronous(b)050100150200WDC-1 WDC-2 WDC-3Time (s)No aggregationAggregation(c)0100200300400WDC-2Time (s)0120024003600480060007200RDT-1TDS-onlyAll constraints(d)Figure 3.13: (a) Performance and scalability comparison between the vertex elimination onlysolution (left), and combined vertex and edge elimination solution (right) for the WDC-3 pattern. (b) Comparing synchronous and asynchronous NLCC. (c) Impact of workaggregation on runtime for the WDC patterns (for the sake of readability, only a subset ofnon-local constraints are considered for WDC-1). (d) Runtime performance when onlyTDS constraints are used vs. all NLCC constraints used. (Here, RDT-1 does not finishafter two hours). Note that in (b) and (d), we did not apply load balancing to RDT-1. Allexperiments in (b), (c) and (d) use 64 compute nodes.3.7.6 Impact of Design Decisions and Strategic OptimizationsOur prototype implementation embraces design choices and incorporates optimiza-tions that offer a multitude of performance gains. Here we study the impact of thesedesign decisions and optimizations.Edge Elimination. Fig. 3.7.6(a) highlights the key scalability and performanceimpact of edge elimination: without it, the NLCC phases take almost one order69of magnitude longer and the entire pruning takes 2–9× longer. Without edgeelimination, the WDC-3 pattern results in 3,180,678 edges selected (some are falsepositives). Edge elimination identifies the true positive matches and reduces thenumber of active edges to 255,022. In other words, the graph is 12.5× sparser whichin turn improves overall message efficiency of the system. Additionally, we notethat edge elimination enables us to search the WDC patterns in Fig. 3.16(a) andFig. 3.16(c). The vertex elimination-only solution presented in our initial work wasnot sufficient to search these patterns on 64 nodes, primarily due to the presence ofhigh-degree vertices with the highest frequent labels org and net.Asynchronous Processing. Our system is designed to harness the advantagesof an asynchronous graph processing framework, yet a synchronous one couldeasily support the same algorithms. Fig. 3.13(b) shows the runtime advantage ofasynchronicity for two patterns: for WDC-1 and RDT-1 (2.7× and 3.5× gains,respectively) compared with a synchronous version that adds a barrier after eachNLCC token propagation step. Asynchronous NLCCmakes it possible for all walksto progress independently without synchronization overheads. Synchronous NLCCis implemented within HavoqGT as well.Work Aggregation. Fig. 3.13(c) shows the performance gains enabled by thework aggregation strategy employed by NLCC (presented in §3.5 and Alg. 6). Themagnitude of the gain, 10–50%, is data dependent and more pronounced when thepattern is abundant, e.g., WDC-1 has 600M+ instances (Table 3.4).Rationale for Including Smaller Constraints. Patterns for which TDS isrequired for precision guarantees, the path and cycle constraints are there for per-formance optimizations (their goal is to prune away the non-matching part of thegraph early) and we are interested to evaluate the impact of these optimizations.To this end, we compare time-to-solution for an input configuration that generatesand uses all NLCC constraints, and one that uses only the TDS constraints requiredto guarantee 100% precision. Our experiments show that, although it increasesthe number of iterations, verifying (potentially less expensive) simpler and smallersubstructures first is extremely effective: for some patterns (e.g., RDT-1) the systemis not able to complete in a reasonable time without these constraints, for others(e.g., WDC-2) these constraints enable a 2.4× speedup (Fig. 3.13(d)).703.7.7 Load BalancingFor our pattern matching solution, load imbalance can be caused by two artifacts:First, over the course of execution our solution causes the workload to mutate,i.e., we prune away vertices and edges. Second, the distribution of matches in thebackground graph may be nonuniform: the vertices and edges that participate inthe matches may reside on a small, potentially concentrated, portion of the graph.In this section, first we present a detailed characterization of these artifacts. Thenwe discuss three different load balancing strategies that can be employed withinour pipeline, and evaluate the effectiveness of these load balancing strategies andpresent an analysis of their impact on performance.Characterizing Load Imbalance Issues. Load imbalance can indeed occur: forinstance, for the relatively rare WDC-2 pattern, when using 64 nodes, for example,the vertices and edges that participate in the final selection are distributed over asfew as 111 partitions out of the 2,304 (64 nodes × 36 MPI processes per node).The distribution is concentrated: 90% of the matching edges are on 83 partitionswhile more than half of the matching edges reside on only 20 partitions. For themore frequent WDC-1 pattern, 50% of the matching edges are on less than 5% ofthe total partitions (on a 64 node deployment), which becomes less than 3% of thetotal partitions on a 128 node deployment.Load imbalance, due to the above mentioned artifacts, impacts performancealong multiple axes - most importantly it impairs scalability. A typical distributedsystem achieves parallelism by processing the data partitions (mapped to physicalprocessors or cores) concurrently. A balanced share of the work among parallelpartitions is key to achieving scalability with respect to the processor count. Forscalable graph processing, it is imperative that partitions have (almost) equal shareof edges, since the edge count dictates the number of messages that are sent andreceived by a partition, i.e., the share of work it is responsible for. In a distributedsystem, work imbalance among partitions introduces stragglers that dominate theoverall time-to-solution. In presence of stragglers, for example, doubling the com-puting resources does not yield equal gain in scalability. This explains the sublinearscaling for the non-local constraints in our strong scaling experiments (see Fig. 3.8).We observe further nonuniformity in the match distribution at the vertex gran-71ularity: the number of matches a vertex participates in can significantly vary acrossthe matching vertex set V∗. As an example, let’s consider the WDC-2 pattern(Fig. 3.7) whose matches are shown in Fig. 3.18 (they form six connected com-ponents). The largest connected component contains 2,262 matches (bottom row,center). In this connected component, there is a single gov vertex, depicted usinga triangle in solid grey, which participates in 2,262 matches (out of total 2,444matches). This artifact is more pronounced in the case of the WDC-1 and WDC-3patterns. For WDC-1, 99% of the matching vertices are part of a single connectedcomponent. There are multiple vertices that belong to over three million matches.The numbers are more striking for the frequent WDC-3 pattern - a single vertexparticipates in over 34M matches.This irregularity has crucial performance implications, in particular, it hindersthe scalability of the routines that rely onmulti-hop graph walks (or token passing),such as non-local constraint checking and full match enumeration. When thematches are concentrated on a few compute nodes and only a few vertices participatein a large number ofmatches, the partitions that these vertices reside on send/receivea larger portion of the message traffic. In this case, increasing the number ofprocessors does not help, as, in our current infrastructure, processing at the vertexgranularity cannot be ‘scaled out’ efficiently. Furthermore, given that a partitionprocesses the local message queue sequentially, message traffic targeting popularvertices can overwhelm the respective partitions. Consequently, these bottleneckedpartitions become the key performance limiter. The above reasoning explains whysome of the non-local constraint checking phases do not scale well (see Fig. 3.8 andFig. 3.13(a)).Strategies to Address Load Imbalance Issues. We explore three strategies toaddress load balancing issues: (i) reshuffling the load, (ii) load consolidation, i.e.,reloading the shuffled load on fewer nodes to optimize for efficient resources usage,and (iii) using replication for the vertices and edges that remain in the currentpruned solution subgraph to minimize time-to-solution. Strategy (ii) and (iii) areeffective when the (intermediate or final) pruned graph has a much smaller memoryfootprint compared to the original background graph.(i) Load Reshuffling. We employ a pseudo-dynamic load balancing strategy.First, we checkpoint the current state of execution: the pruned graph, i.e., the set720200040006000800010000120001400064 128Time (s)NLBLB0300600900120015001800210064WDC-1 RDT-13.1x1.3x1.7x#Compute nodes(a)03006009001200150018002100NLB-64 LB-64 LB-16 LB-1Time (s)CPU Hour6.1x10.4x2.2xRDT-1(b)Figure 3.14: (a) Impact of load balancing on runtime for the the WDC-1 and RDT-1 patterns.We compare two cases: without load balancing (NLB) and with load balancing throughreshuffling on the same number of nodes (LB). For WDC-1, we show results for twoscales, on 64 and 128 nodes. Speedup achieved by LB over NLB is also shown on the topof each bar. (b) Performance of RDT-1 for four scenarios: (i) without load balancing on64 nodes (NLB-64), (ii) with load balancing through reshuffling on the same number ofnodes (LB-64), (iii) beginning with 64 nodes and relaunching on a 16 node deploymentafter load balancing (LB-16), and (iv) relaunching on a single node (36 processes) afterload balancing (LB-1). The chart shows time-to-solution and CPU Hours consumed ineach of the four cases. The CPU Hours consumed by NLB-64, LB-64, LB-16 over LB-1is also shown on the top of the respective bars.of active vertices and edges and the per-vertex state indicating template matches,ω(vj) (Alg. 2). Next, using HavoqGT’s distributed graph partitioning module, wereshuffle the vertex-to-processor assignment to evenly distribute vertices (withω(vj)remained intact) and edges across processing cores. Processing is then resumed onthe rebalanced workload. Depending on the size the the pruned graph, it is possibleto resume processing on a smaller deployment (we explore this avenue in the nextsection.) Over the course of the execution, checkpointing and rebalancing can berepeated as needed (identification of the trigger point to perform load balancing isdiscussed §3.8).As a proof of feasibility, to examine the impact of this technique, we analyzethe runs for WDC-1 and RDT-1 patterns. We chose some of the real-world work-loads as they are more likely to lead to imbalance than synthetically generated load.Fig. 3.14(a) compares the performance of the pruning algorithms with and withoutload balancing. For these examples, we perform workload rebalancing only once:73for WDC-1, before verifying the TDS constraints, and for RDT-1, when the prunedgraph is four orders of magnitude smaller. The extent of load imbalance is moresevere forWDC-1 on the smaller 64 node deployment compared to using 128 nodes;workload rebalancing improves time-to-solution by 3.1× and 1.3× on 64 and 128nodes, respectively. In the case of RDT-1, as a result of load balancing, the gain intime-to-solution is 1.7× (on a 64 node deployment). Given the pruned graphs areorders of magnitude smaller than the original graph; often the time spent in check-pointing, rebalancing, and relaunching the computation is negligible compared tothe gain in time-to-solution. In Table 3.4, we run enumeration forWDC-1, WDC-3,WDC-5, RDT-1 and IMDB-1 on the respective rebalanced graphs.(ii) Smaller Deployment - for efficiency. One may argue that when the currentsolution subgraph G∗ is sufficiently small, it is more efficient to create load bal-anced partitions targeting a smaller deployment. There are two different aspects of‘efficiency’ concerns that may support this conjecture: First, moving to a smallerdeployment reduces power usage and may yield better normalized performancewith respect to energy consumption. Second, for the scenario where the matchesare highly concentrated on a limited number of nodes/partitions (which hinders thescalability of the non-local constraint checking phase), a smaller deployment offerslocality as well as better amortization of the cost of distributed communication.To this end, we setup a simple case study using the Reddit dataset and theRDT-1 pattern. In the previous section, we discussed our load balancing approachand demonstrated its application forWDC-1 and RDT-1 patterns (Fig. 3.14(a)). Forthis experiment, we resume processing using the rebalanced workload on smallerdeployments: from the original 64 node deployment we switch to a 4× smallerdeployment comprised of 16 nodes. In a second use case, we resume processingon the rebalanced workload a on a single node (running 36 processes). (In thisexperiment, the ‘current’ solution subgraph G∗ is about four orders of magnitudesmaller than the original background graph G which makes it possible to move toa smaller deployment.) Fig. 3.14(b) compares four scenarios: (a) without loadbalancing (NLB-64), (b) with load balancing (LB-64), (c) with load balancingand relaunching on a smaller 16 node deployment (LB-16), and (d) relaunchingon a single node after load balancing (LB-1). In addition to time-to-solution, wealso compare CPU Hours consumed by each of the four cases. (A platform’s net74energy consumption is directly related to total CPU Hours expended.) Fig. 3.14(b)shows that, with respect to time-to-solution, LB-64 has marginal advantage overLB-16 and LB-1. However, LB-1 holds significant advantage in terms CPU Hourconsumption: it is 6.1× more efficient than LB-64. The overhead for NLB-64 is awhopping 10.4× compared to LB-1. These results support the argument that theload balanced partitions targeting a smaller deployment yields better normalizedperformance with respect to energy consumption.(iii) Full Replication - for performance. One of the main usages of distributedsystems is to harness concurrent processing to accelerate time-to-solution, eventhough the problem at hand has a small memory footprint. In this scenario, it is notimportant to partition the data and to distribute them among the physical processingunits. Instead, the input data can be replicated across the compute nodes for parallelprocessing (while preserving locality at the same time). Arabesque [108] andQFrag [100] (presented in §3.7.11) are examples of a replicated distributed patternmatching systems. For our approach, when the current solution subgraph G∗ issufficiently small, it is possible to replicate it across distributed nodes and relaunchprocessing. We, however, believe that it would require developing a dynamicwork balancing mechanism (in the same vein as task fragmentation in QFrag).(Nondeterministic growth of the algorithm states on some replicas can cause workimbalance.) Expensive operations like non-local constraint checking and full matchenumeration are primary beneficiaries of this load balancing technique.3.7.8 Defence Against System Collapse due to Message ExplosionIn §3.7.7, we discussed the scalability challenges stemming from load imbalanceissues. Here we discuss how the same artifacts (as in §3.7.7) and lack of infras-tructure support to prevent unwanted message buildup may lead to system collapse.We then present the defence mechanism our current implementation incorporatesto address this issue.For non-local constraint checking and match enumeration routines, the impactof irregular artifacts is the most adverse: they can cause unpredictably imbalancedtraffic and overwhelm a fraction of the compute nodes, and in the worst case, causea crash. Also, combinatorial growth of intermediate matches may lead to message750.00E+002.00E+054.00E+056.00E+058.00E+051.00E+061.20E+061610121818262434304236504258486654746082669072987906851491229730103381094611554121621277013378139861459415202System memory used (MB)Timeline (s) #Batch 1#Batches 2#Batches 4#Batches 9#Batches 18#Batches 36#Batches 72Figure 3.15: The figure compares system memory usage, throughout the application lifetime,for counting unlabeled 4-Motifs in the Youtube graph (see §4.8.6 for experiment details)for different batch sizes (up to 72 batches). We run this experiment on a shared memoryplatform with 1.5TB physical memory. We run 72 MPI processes. X-axis is the timelinein seconds. Y-axis is the peak system memory usage (in megabyte) at a given instanceduring the application lifetime.explosion, causing the system to collapse. Memory exhaustion at the node leveland network congestion leading to fault manifestation in the underlying messaginginfrastructure are the leading causes of failure. This issue is compounded by thefact that the current implementation of HavoqGT lacks a flow control mechanismthat can prevent unwanted message buildup (backpressure mechanisms, however,are challenging to implement in a fully asynchronous messaging environment).A pseudo batch processing technique to prevent system collapse due to mes-sage buildup. To prevent system collapse due to message buildup, we have de-veloped an application level mechanism specific to our pattern matching system:in non-local constraint checking or full match enumeration, the generated messagetraffic is related to the number of token source vertices (Alg. 6). In general, thenon-local constraint checking routing launches a single asynchronous task (Alg. 5)to verify all the vertices (that are a match for the template vertex at the head ofthe path representing a non-local constraint). Here, all vertices initiate tokens si-multaneously and processing progresses in an asynchronous manner (Alg. 6). Thissetup presents the opportunity reduce the number of active messages in the system76at a given instance by simply reducing the number of token source vertices withina single asynchronous task, thus, potentially lowering the peak traffic volume thatmay persists in the system at a given moment.We adopt a pseudo batch processing technique: we split a large asynchronoustask (non-local constraint checking or enumeration) into multiple smaller asyn-chronous tasks - in each batch, a subset of vertices issue tokens (this procedureinvokes Alg. 6 multiple times). The goal is to ensure that, at any given instance, thenumber of active messages (in flight or waiting in the message queue) is remainedwithin the platforms capacity. The batch size can be determined at multiple gran-ularities: first, at the MPI process (a.k.a. MPI rank) level - in a single batch, thevertices on a limited number of MPI ranks initiate tokens, and second, at the vertexlevel - in a single batch, tokens are issued from a fixed number of vertices withinan MPI process. In retrospect, batching slows down the processing rate in favourof message efficiency and efficient memory utilization.We demonstrate that proposed batch processing technique is effective in prac-tice: we study the impact of the batch size on peak system memory usage and theperformance trade-offs of this approach; more importantly, show that this techniquecan prevent system collapse. Fig. 3.15 compares systemmemory usage, throughoutthe application lifetime, for different batch sizes (1–72 batches), for counting unla-beled patterns in the Youtube graph. We run this experiment on a shared memoryplatform with 1.5TB physical memory. In the #Batch 1 configuration, token sourcevertices on all the MPI processes initiate tokes simultaneously, while in the #Batch72 configuration (i.e, the complete work is split into 72 smaller asynchronous tasks),in each batch, token source vertices on one MPI process initiate tokes. For #Batch1, peak memory usage is ∼1.1TB. With less than 10% performance loss, #Batch 2reduces memory usage by over 2× compared to #Batch 1. #Batch 72 improves peakmemory consumption by ∼21× over the default configuration (i.e., #Batch 1) at theexpense of ∼3.5× slowdown in time-to-solution. For #Batch 18, the peak memoryusage is less than 100GB, with less than 1.5× performance penalty (note that in ourtestbed, the Quartz cluster [85], each node has only 128GB of main memory).The experiment results confirm the effectiveness of the batch processing tech-nique: by controlling the amount of parallel work are initiated, i.e., systematicallyslowing down the processing rate, we are able to limit the peak volume of messages77in the system. This keeps the peak memory usage within the platform’s capacity;preventing memory exhaustion that may lead to system collapse. As describedearlier, memory exhaustion is one of the primary failure scenarios that leads toa crash. (We observe that there is no linearity between the batch size and peakmemory consumption, and time-to-solution. Further investigation is required toaid informed decision making with regards to selecting the optimal batch size.)Using the above described batch processing technique, we were successful insearching the following patterns (that would otherwise lead to a crash): constraintchecking for WDC-1 on 64 nodes, full match enumeration for WDC-1, WDC-3,WDC-5, IMDB-1, and counting some of the unlabeled patterns in Mico, Youtubeand LiveJournal graphs (§3.7.11 and §4.8.6).3.7.9 Template Sensitivity AnalysisWe investigate the influence of template properties, such as label selection andtopology, on the performance of the graph pruning procedure. For this study, weconsider the WDC graph and the patterns in Fig. 3.7 and Fig. 3.16.Label Sensitivity. To understand how label popularity influences performance,we consider the WDC-3 and WDC-4 patterns (Fig. 3.7), which has the sametopology as WDC-3 yet has labels that are less frequent. The two patterns sharefive out of the eight vertex labels; the labels of WDC-3 and WDC-4 cover ∼15%and ∼4% of the vertices in the background graph, respectively. For WDC-4, thesolution subgraph (|V∗ | = 430 and 2|E∗ | = 914) is about two orders of magnitudesmaller than that of WDC-3 (see Fig. 3.8). The pruning time for WDC-4 is at most2.6× faster on 512 compute nodes, averaging 1.8× faster across different scales.Topology Sensitivity. The template topology dictates the type and numberof different constraints that are to be verified. For example, if the template is amonocycle (e.g., Fig. 3.16(a)) then only a single cycle check is required; if thetemplate is non-edge-monocyclic (e.g., Fig. 3.16(d)) then relatively more expensivetemplate-driven search is needed for precision guarantees. To understand howdifferent template topology influence performance, we study the WDC patterns inFig. 3.16: Templates (a) and (b) are monocycles, each has a vertex with the labeledu. Template (c) is created through union of (a) and (b). Templates (d) and (e)78govorgedunetinfo ac(a) (b) (c) (d)eduinfo acgovorgedunetinfo acgovorgedunetinfo acgovorgedunet(e)Figure 3.16: WDC patterns used for template topology sensitivity analysis. Templates (a) and(b) are monocycles, each has a vertex with the label edu. Template (c) is created throughunion of (a) and (b). Templates (d) and (e) are constructed from (c) by incrementallyadding one edge at a time.Table 3.6: Runtime for pruning (with precision guarantees) and size of the pruned solutionsubgraph for theWDC patterns in Fig. 3.16 (used for template topology sensitivity analysis).The table lists the number of vertices (|V∗ |) and edges (2|E∗ |) in the solution subgraph foreach pattern. All the experiments were carried out on a 64 node deployment.Template (a) (b) (c) (d) (e)|V∗ | 413,527 548 18,345 39 82|E∗ | 4,095,646 1,506 139,260 166 34Time 41min 39s 2.6min 2.1min 1.8minare constructed from (c) by incrementally adding one edge at a time. Templates(a) – (c) are edge-monocyclic, thus, only need checking cycle constraints. Non-edge-monocyclic templates (d) and (e) require the template-driven search; template(e) needs to verify the existence of a clique (consisting of vertices with labels gov,org, edu, and net). From the topology point of view, among all the constraints inthese examples, the clique is the the most complex substructure and its verificationrequires the longest ‘walk’.Table 3.6 lists the runtimes for pruning (with precision guarantees) for theWDCpatterns in Fig. 3.16. The table also shows the number of vertices (|V∗ |) and edges(2|E∗ |) in the solution subgraph for each pattern. It may be intuitive that the moreconstraints there are to verify, the slower the system is to prune the backgroundgraph to a precise solution. However, our experience with the WDC graph and thepatterns in Fig. 3.16 did not exhibit any concrete evidence to affirm this rationale.Template (a) has only one 4-Cycle to check, however, it has the slowest time-to-solution; it has the largest number of vertex and edge matches, largely due tothe presence of 400M+ vertices in background graph with the labels org and net.79Templates (c), (d) and (e) that require template-drive search, show, on average,∼20× faster time-to-solution compared to template (a). The complex templates(c), (d) and (e) introduce additional local and non-local constraints. There are atleast an order of magnitude more vertices in the background graph, with labelsorg and net, that satisfy the constraints of template (a) than that of templates (c),(d) and (e). As a result, templates (c), (d) and (e) eliminate the majority of thenon-matching vertices (and edges) early, leading to a faster time-to-solution; withthe most complex template (e) being the rarest and the fastest to finish among thethree.The key observation here is that it is the abundance of the constraints (in thebackground graph) that govern performnace: template (c), which incorporates the4-Cycle constraint that is not present in (b), has two orders ofmagnitudemore vertexand edge matches in the background graph, as well as has a slower runtime than thatof (b). Similarly, there are only a handful of vertices in the background graph thatsatisfy the requirements of the complex substructure of (e), i.e., they belong to aclique. Rarity of the clique substructure/constrain leads to rapid pruning, resultingin a faster time-to-solution compared to (c) and (d).3.7.10 Non-local Constraint Selection and Ordering Optimization –A Feasibility StudyOur current pipeline uses a number of ad-hoc, intuition-based heuristics for non-local constraint selection and ordering. As these choices have a sizable impacton the performance, we explored whether a better solution is possible (publishedin [115]). We observed that two primitives, the estimation of constraint selectivity -the number of vertices likely to be eliminated by the constraint, and its verificationcost - the runtime cost of verifying it in the background graph, are sufficient todesign informed heuristics. To this end, we proposed a first solution to makethese estimates and demonstrated using the shared memory implementation of ourpruning pipeline used in §3.7.11.1. Our experience demonstrates that estimation ofconstraint selectivity and verification cost is feasible with low runtime overheadsand offers accurate enough information to optimize our pruning pipeline by asignificant margin.We present a solution based on Stochastic Graph Modeling [101]. There are,80however, a number of challenges associated with modeling the estimation problem.(i) Real-world graphs are hard to model - standard graph models can be used asan approximation, however, the approximation embedded in the model becomesa source of error for any derived estimates. (ii) There is a trade-off between, onthe one side, model accuracy and its ability to model real wold graphs with highfidelity, and, on the other side, mathematical tractability of the derived metrics weare interested in. (iii) Furthermore, there are two competing factors: the modelcomplexity - the volume of information needed, and on the flip side, the effortto gather and maintain this data (given the graph gets pruned, hence, the need todynamically update the data after verifying each constraint).The model estimates the effectiveness of a constraint, which is the ratio betweenthe number of vertices it will prune (i.e., its selectivity) to the time it will take toverify the constraint (i.e., its cost). The number of vertices pruned for a specificconstraint can be found from the probability that it will be satisfied. This is tied tothe size of the search space (often combinatorial) that can be explored, the differentlabels that are part of the constraint relative to their frequency in the backgroundgraph, and the additional requirements that need to be met, such as existence ofcycles. Estimation of the runtime cost of verifying a constraint poses additionalchallenges. It requires the estimation model to account for the underlying systemarchitecture, i.e., shared memory or distributed memory, and to account for salientimplementation level optimizations that are highly architecture/platform dependent.For our shared memory implementation, to estimate cost, we approximate thenumber of edges that are traversed during the search, as they are an approximationof the number of memory accesses on a single machine. (Detailed description ofthe estimation model and its derivations are available in [115]3)Evaluation using several real-world graphs (Reddit, IMDb, Patent andYouTube)confirmed the applicability of our model to generate an optimized ordering of non-local constraints [115]. We observed up to 45% improvement in time-to-solutiontime over the intuition-based heuristics used for constraint ordering. Generally, thenew approach produces an ordering that leads to better runtime for complex patternssuch as the Reddit and IMDb (26% better on average) and similar for the simpler3A version of the paper including an extended appendix is available at http://www.ece.ubc.ca/~matei/papers/ia3-nicolas-full.pdf81cases like Patent and YouTube where the intuition-based heuristics are optimal orclose to optimal. Additional gains are available when the model is used to alsoselect the constraints used.Transferring this experience to a shared-nothing platform. Two extensionsare needed to transfer this experience to a shared-nothing platform, both related toestimating the constraint verification cost. On the one side, the cost model is differ-ent: in a distributed system, typically, the bottleneck is inter-node communication;hence, the estimation model must account for network traffic and latency. Since thegraph is partitioned and distributed over multiple nodes, for high-precision estima-tion the cost model should differentiate between accesses to node-local edges andremote edges given the latter beingmore expensive. On the other side, the non-localconstraint checking algorithm is different (breadth-first and not depth-first as in theshared memory implementation; each forwarding vertex essentially broadcasts atoken to its neighbors) which impacts the set of optimizations used.3.7.11 Comparison with State-of-the-Art SystemsWe empirically compare our work with two state-of-the-art pattern matching sys-tems QFrag [100] and Arabesque [108].3.7.11.1 Comparison with QFragSimilar to our solution, QFrag targets exact pattern matching on distributed plat-forms, yet there are two main differences: QFrag assumes that the entire graph fitsin the memory of each compute node and uses data replication to enable searchparallelism. More importantly, QFrag employs a sophisticated load balancing strat-egy to achieve scalability. QFrag is implemented on top of Apache Spark [104]and Giraph [34]. In QFrag, each replica runs an instance of a pattern enumerationalgorithm called TurboISO [44] (essentially an improvement of Ullmann’s algo-rithm [116]). Through evaluation, the authors demonstrated QFrag’s performanceadvantages over two other distributed pattern matching systems: (i) TriAD [41], anMPI-based distributed RDF [86] engine based on an asynchronous distributed joinalgorithm, and (ii) GraphFrames [23, 38], a graph processing library for ApacheSpark, based on distributed join operations.82adcbeadcbadcbafdcbeQ4 Q6 Q7 Q8Figure 3.17: The patterns (reproduced from [100]) used for comparison with QFrag (results inTable 3.7). The label of each vertex is mapped, in alphabetical order, to the most frequentlabel of the graph in decreasing order of frequency. Here, a represents the most frequentlabel, b is the second most frequent label, and so on.Table 3.7: Performance comparison between QFrag and our pattern matching solution. Thetable shows the runtime in seconds for full match enumeration for QFrag; and separatelyfor pruning and full match enumeration for our distributed system (labeled PruneJuice-distributed), and for a single node implementation of our graph pruning-based approachtailored for a shared memory system (labeled PruneJuice-shared). For PruneJuice, we splittime-to-solution into pruning (top row) and enumeration (bottom row) times. We use thesame graphs (Patent and YouTube) and the query templates as in Fig. 3.17 (Q4 – Q7) usedfor evaluation of QFrag in [100]. The other small acyclic queries used in [100] requirePruneJuice running local constraint checking only and, in these cases, PruneJuice is evenfaster than QFrag.QFrag PruneJuice-distributed PruneJuice-sharedPatent YouTube Patent YouTube Patent YouTubeQ4 4.19 8.080.238 0.704 0.100 0.4000.223 1.143 0.010 0.010Q6 5.99 10.260.874 2.340 0.070 1.7300.065 0.301 0.005 0.010Q7 6.36 11.890.596 1.613 0.130 0.8200.039 0.180 0.005 0.010Q8 10.05 14.480.959 2.633 0.100 1.3700.049 0.738 0.001 0.010Given that we have demonstrated the scalability of our solution (Serafini et al.also demonstrate good scalability properties for QFrag [100], yet on much smallergraphs), we are interested to establish a comparison baseline at the single nodescale. To this end, we run experiments on a modern shared memory machinewith 60 CPU-cores, and use the two real-world graphs (Patent and YouTube) andfour query patterns (Fig. 3.17) that were used for evaluation of QFrag [100]. Werun QFrag with 60 threads and our distributed solution with 60 MPI processes.The results are summarized in Table 3.7: QFrag runtimes for match enumeration(first pair of columns) are comparable with the results presented in [100], so we83have reasonable confidence that we replicate their experiments well. With respectto combined pruning and enumeration time, our system (second pair of columns,presenting pruning and enumeration time separately) is consistently faster thanQFrag on all the graphs, for all the queries. We note that our distributed solutiondoes not take advantage of the shared memory of the machine at the algorithmic orimplementation level (we use different processes, one MPI process per core), andhas the system overhead of MPI communication between processes. (Additionally,unlike QFrag, our system is not handicapped by the memory limit of a singlemachine as it supports graph partitioning and can process graphs larger than thatonly fit in the memory of a single node.)To highlight the effectiveness of our technique and get some intuition on themagnitude of the MPI overheads in this context, we implemented our techniquefor shared memory and present runtimes (when using 60 threads) for the sameset of experiments in Table 3.7 (the two rightmost columns). We notice up toan order of magnitude improvement in performance compared to the distributedimplementation running on a single node.In summary, we observe that our distributed solution works about 4–10× fasterthan QFrag, and, if excluding distributed system overheads and considering thepruning time for the shared memory solution and conservatively reusing enumera-tion runtime for the distributed solution, we observe about 6–100× faster runtimethan QFrag.3.7.11.2 Comparison with ArabesqueArabesque is a framework offering precision and recall guarantees; implemented ontop of Apache Spark [104] and Giraph [34]. Arabesque provides an API based onthe Think Like an Embedding (TLE) paradigm, to express graph mining algorithmsand a BSP implementation of the embedding search engine. Similar to QFrag,Arabesque replicates the input graph on all worker nodes, hence, the largest graphscale it can support is limited by the size of thememory of a single node. As Teixeiraet al. [108] showed Arabesque’s superiority over other systems: G-Tries [92] andGRAMI [26], we indirectly compare with these two systems as well.For the comparison, we use the problem of counting cliques in an unlabeled84Table 3.8: Performance comparison between Arabesque and our pattern matching system (la-beled PJ - short for PruneJuice). The table shows the runtime in seconds for counting3-Clique and 4-Clique patterns. These search patterns as well as the following backgroundgraphs were used for evaluation of Arabesque in [108]. We run experiments on the sameshared memory machine (with 1.5TB physical memory) we used for comparison withQFrag. Additionally, for PruneJuice, we present runtimes on 20 compute nodes. Here,PruneJuice runtimes for the single node, shared memory are under the column with headerPJ (1) while runtimes for the 20 node, distributed deployment are under the column withheader PJ (20).3-Clique 4-CliqueArabesque PJ (1) PJ (20) Arabesque PJ (1) PJ (20)CiteSeer 3.2s 0.04s 0.02s 3.6s 0.06s 0.02sMico 13.6s 27.0s 11.0s 1min 72min 21minPatent 1.3min 17.3s 1.6s 2.2min 32.8s 8.3sYoutube 6.5min 2.1min 12.7s Crash 6.4min 1.4minLiveJournal 8.9min 2.4min 11.2s 2.5hr+ 1.8hr 41.3mingraph (implementation is available with the Arabesque release). Cliques are com-plete graphs where every two distinct vertices are adjacent. For example, threevertices form a 3-Clique (i.e., a triangle) which has three edges; a 4-Clique hasfour vertices and six edges. Table 3.8 compares results of counting three- and four-vertex cliques, using Arabesque and our distributed system (labeled PJ - short forPruneJuice), using the same real-world graphs used for the evaluation of Arabesquein [108]. We run experiments on the same shared memory machine used (in theprevious section) for comparisin with QFrag. Additionally, for PruneJuice, wepresent runtimes on 20 compute nodes. (We attempted Arabesque experiments on20 nodes too, however, Arabesque would crash with the out of memory (OOM)error for the larger Patent, Youtube and LiveJournal graphs. Each compute nodein our distributed testbed has only 128GB memory. Our multi-core shared mem-ory testbed, however, has 1.5TB physical memory. Furthermore, for Arabesque,the workloads that successfully completed on the 20 node deployment, we did notnotice any speedup over the single node run.) Note that Arabesque users have tospecify a purpose-built algorithm for counting cliques, whereas ours is a genericpattern matching solution, not optimized for counting cliques only. Furthermore,in addition to replicating the data graph, Arabesque also exploits HDFS storage for85maintaining the algorithm state (i.e., intermediate matches).PruneJuice was able to count all the clique patterns in all graphs; it took amaximum time of 1.8 hours to count 4-Cliques in the LiveJournal graph on thesingle node, shared memory system. When using 20 nodes, for the same workload,the runtime came down to 41.3 minutes. Arabesque’s performance degrades forlarger graphs and search templates: Arabesque performs reasonably well for the3-Clique pattern; for the larger graphs - PruneJuice is at most 3.7× faster. The4-Clique pattern, highlights the advantage of our system: for the Patent graph,PruneJuice is 4× faster on the shared memeory platform. For the LiveJournalgraph, Arabesque did not finish in 2.5 hours (we terminate processing). For theYoutube graph, Arabasque would crash after runing for 45 minutes. PrinuJuice onthe other hand, completed clique counting for both graphs. For the smaller, yethighly skewed Mico graph Arabesque outperforms PruneJuice: for the 4-Cliquepattern, Arabesque completes clique counting in about one minute, where as ittakes PruneJuice 72 minutes on the same platform; this workload highlights theadvantage of replicating the data graph for parallel processing which presents theopportunity for harnessing load balancing techniques that are efficient and effective.3.8 Lessons and DiscussionsThis section presents a brief summary of our experiences of designing and devel-oping the constraint checking based pattern matching solution, and the revelationsand the lessons we have learned from the experiment results presented in §3.7. Weorganize the key discussion points in a question-answer format:(i) Does the original intuition behind the constraint checking approach hold inpractice?The work we present is centered around the idea of pruning away non-matching part of the graph (i.e., vertices and edges) as early as possible. Inaddition to the ability of search space reduction, a vertex-centric formulationof this approach exists: not only it harnesses fine-grained parallelism butalso presents the opportunity to first identify matches at the vertex and theedge level, which, in practice, is less expensive compared to the conventionaltechniques that relies on full match enumeration (note the less than a minute86pruning time vs. 24 hoursmatch enumeration time, even on the pruned graph,for WDC-3). Furthermore, decomposing a complex template into a set ofconstraints (i.e., substructures), that are less expensive to verify compared tosearching the full template, prevents potential combinatorial explosion of thealgorithm state. In §3.7, we have demonstrated the ability of our solution toidentify matches (including match enumeration on the pruned graphs) in thelargest publicly available real-world graph and a 4.4 trillion edge syntheticgraph.(ii) Following this approach, is it possible to design an exact matching solutionthat supports arbitrary patterns?Yes. The exact matching solution we present is generic, no assumptionsabout the background graph and the search template is made. The prelim-inary study (summarized in §3.6) scoped the problem to a restricted set oftemplates toward developing the first solution. Subsequently we have devel-oped techniques to alleviate these restrictions and presented a general solutionthat offers precision and recall guarantees for arbitrary templates. Our ap-proach is novel in the sense that, in contrast to the convectional techniquesthat relies on full match enumeration to answer a pattern matching query, wefirst identify the set of vertices and edges in the background graph that matchthe template. (All the matches can be listed by operating on this solution set.)We have demonstrated that, following the constraint checking approach, itis possible to develop practical solutions that are scalable and less prone tocombinatorial explosion of the algorithm state to support this pipeline.(iii) Does it scale with the dataset size and the platform size?Yes. We show the ability of our solution to find patterns, both frequent andrare, in the largest publicly available real-world graph and a 4.4 trillion edgesynthetic graph, orders of magnitude larger than used by the past contribu-tions. We demonstrate good strong scaling (although influenced by the graphtopology and match distribution) and steady weak scaling on up to 1,024nodes or 36K cores, the largest scale to date for similar problems.(iv) Is the solution approach generic - can it be implemented within any general-87purpose, distributed graph processing framework (other than HavoqGT) ora difefrent architecture?The constraint checking approach is implementation independent and can beimplemented within any graph framework. The vertex-centric algorithmspresented in §3.4 can be implemented within any graph processing frame-work (e.g., GraphLab [35] and Giraph [34])) that adopts a vertex-centricmodel. Although we presented HavoqGT-based asynchronous, distributedalgorithms (to harness low latency interconnects of HPC platforms), syn-chronous algorithms can be seamlessly supported. A compute clusters madeof commodity hardware may not have a HPC-class fast network backbone;the BSP approach is often a better fit for commodity clusters as BSP presentsoptimization opportunities for harnessing network bandwidth (in presence ofhigh latency interconnects).(v) Is presenting the results as the ‘union of all matches’ rather than explicitlylisting all the matches serve rich analytics?We argue that often full match enumeration is not the most efficient avenueto support many high-level graph analysis scenarios.There are three important takeaways: First, while our match enumerationtechnique is able to enumerate an immense number of matches (see, forexample, results forWDC-1,WDC-3 andWDC-5with 450+millionmatches,or even IMDB-1 with 800+ thousand matches), presenting results as prunedvertex/edge sets (with less than 105 vertices) avoids potential combinatorialexplosion and makes it feasible to carry out further analytics. Second, asFig. 3.18 clearly shows, presenting the results as the ‘union of all matches’(rather than explicit match enumeration) is not only more space efficient, butalso, in some cases, even easier to understand by a human analyst. Finally,we note that the key to supporting match enumeration, is edge pruning: thisreduces the edge density in the pruned WDC graph by a factor of 10–15×(compared to using vertex pruning alone).(vi) What are the key performance bottlenecks? Within the constraint checkingpipeline, the non-local constraint verification steps are the bottleneck for ma-88Figure 3.18: Matches for the WDC-2 in the background graph. The number of matches ineach of the six connected components are also shown.jority of the patterns (Fig. 3.8 and Fig. 3.11(a)). In the worst case, when thework aggregation mechanism is ineffective (due to the sparse distribution ofmatches in the background graph), non-local constraint checking can be asexpensive as full match enumeration (on the pruned graph). We have identi-fied other artifacts leading to load imbalance which limits the performance ofnon-local constraint checking: (i) load imbalance due to irregular distributionof the matches in the background graph, and (ii) load imbalance as a resultof pruning - again, when the match distribution in the background graph ishighly skewed and concentrated on a few compute nodes/graph partitions(investigated in §3.7.7).(vii) In what scenarios the presented solution is most effective and where it is not?The constraint checking approach is most effective when the graph and thesearch template present early pruning opportunities, especially though localconstraint checking - each iteration of LCChas polynomial cost,O(n+m) (thesame as that of the PageRank algorithm). Our technique holds key advantagefor searching templates that have a large diameter and edge count, yet arerare in the large background graph. A conventional tree-search techniquewould exhaustively search all possible instances of a path consisting of all theedges in the template. It is possible that there are a combinatorial number ofintermediate matches for the template (but actual matches are very few). Our89pruning-based technique is particularly powerful in detecting and eliminatingsuch intermediate matches early (often without enumerating a single instanceof the full template). The WDC-3 pattern (Fig. 3.7), presented in §3.7.2, isan example of the above described scenario.A pruning-based approach may introduce additional overhead: when the userseeks full match enumeration, and there is little to eliminate - most of thevertices and edges in the graph participate in matches. If the template has alarge number of, potentially expensive, non-local constraints to check that donot yield pruning, constraint checking can become an overhead. TheWDC-1pattern (Fig. 3.7) presented in §3.7.2 represents this scenario: Fig. 3.12(a)shows that WDC-1 quickly reaches 100% precision, however, 99% of theexecution time is spent verifying five non-local constraints (top left WDC-1plot in Fig. 3.8) to guarantee no false positive matches. Table 3.4 showsthat, for WDC-1, match enumeration on the pruned graph is much fasterthan pruning with precision guarantees. This also highlights the need fortechniques that should make informed decision for early switch to full matchenumeration.(viii) Informed Decision Making. This work uses a number of ad-hoc heuristics,conceived based on our intuition and observations, however, would benefitfrom informed decision making. Here, we briefly introduce this problems,while in §5.5 we discuss possible future work to address them. Deciding (a)the order in which non-local constraints are checked and select the optimal setof constraints that maximizes performance (explored in §3.7.10 for a sharedmemory implementation), (b) when to trigger load balancing, and (c) whento stop pruning early and switch to match enumeration (when requested bythe user).90Chapter 4Edit-Distance SubgraphMatching in Distributed Graphswith Precision and RecallGuaranteesOver time, approximate matching has become the moniker for a rather large set ofproblems that have in common, that they are not confined to the requirements ofexact matching, where a bijective mapping between the vertices and edges in thetemplate, and those in the matching subgraph is sought. In the approximate case,the template and the match can be just similar by some defined similarity metric.Multiple real-world usage scenarios justify the need for approximate matching(and their diversity is the root cause for the diversity seen in this problem area).Categories of such scenarios include:(S1) Dealing with the computational intractability of exact matching - to reduce theasymptotic complexity of exact matching (as in the general case, this problemis not known to have a polynomial time solution) and improve time-to-solution[3, 28, 56]. We note that generally the performance gains are obtained by91relaxing the quality of the solution over at least one of multiple axes1: On theone side, most solutions do not offer precision guarantees, recall guarantees,and often neither. On the other side, the matches offered are only similar tothe search template entered by the user, and often the similarity level cannot beuser controlled [3, 56].(S2) Uncertainty regarding acquired data - the acquired data can be noisy, leadingto a background graph that is different from the ground truth [19, 126]. Inthese cases, an effective approach is to first identify the (potentially similar)subgraphs that may be of interest and have to be further inspected; for example,genomics pipelines [79].(S3) Exploratory search - a user may not be able to come up with a search template apriori [24]. In such scenarios, the user starts with an approximate idea of what(s)he may be searching for, and relies on the system’s the ability to identify‘close’ variants of the template [5, 56, 126]. Multiple application areas (e.g.,financial fraud detection and organized crime group identification) have usagescenarios that fall in this category. A recent example is the 2017 Pulitzer Prize-winning Panama Papers investigation which exposed patterns of offshore taxstructures used by the world’s richest elites. These structures were uncoveredfrom leaked financial documents, which were connected together creating agraph which was made publicly accessible and explored by journalists aroundthe world [53].(S4) Information extraction - such as extracting features for machine learning from agraph’s topology: as most machine learning solutions use data in tabular form,they are unable to directly incorporate topological information from networkeddata. A number of recent efforts, however, address this problem [39, 57, 83].The common theme here is extracting vertex level features for training amachinelearning pipeline with the end goal of categorizing vertices. Past work usedsampling techniques to collect neighborhood information [39, 83] and used thatinformation as the vertex feature during training. A complementary feature1The diversity of these axes highlights the many overlapping meaning with which the termapproximate has been used.92TemplateGraph k = 1 Match k = 2 MatchFigure 4.1: Edit-distance based subgraph matching: a search template H0 (left), backgroundgraph G (center-left), and on the right, examples of edit-distance k matches/solutionsubgraphs for distance k = 1 and k = 2.engineering strategy can be marking each vertex with the subgraphs (oftenclosely related) it participates in; a potentially new avenue worth explorationsfor usage scenarios that (can) exploit pattern information in networked data.(S5) Other use cases - Aggarwal et al. [1, 56] present additional use cases for ap-proximate matching, that can be reduced to frequent subgraph mining [3, 103],motif counting [3, 103], graph alignment [79, 121] or link recommendation;beyond searching for the ‘diamond’ motif used by Twitter’s Who To Followservice [40]. Utilizing higher-order graph structures (e.g. motifs) in graphanalytics (such as ranking and clustering) increases the flexibility of thesetechniques [8].Note that scenarios (S2), (S3), (S4) and (S5) may seek a solution that does notcompromise accuracy - it retrieves all the subgraphs that are variants (with a boundon the acceptable difference) of the user-provided template. Unlike scenario (S1),in these usage scenarios one does not seek to improve algorithm complexity overexact matching (i.e., subgraph isomorphism).4.1 Problem Overview and Design OpportunitiesOur goal is to identify subgraphs in the background graph that are closely relatedto a user-provided template. We primarily target problem scenarios where the userneeds full precision (i.e., there are no false positive in the returned match set) andfull recall (i.e., all matching vertices and edges are identified). Multiple use casesin categories (S2) – (S5) above benefit from these properties, as well as a user-93specified bound on the similarity between the user-provided search template andthe matches returned by the system. Additionally, we target a situation where allvertices and edges of the background graph that participate in matches, need to beidentified - what we call the solution subgraph (§3.4). This scenario best matchesthe feature extraction scenario - item (S4) above. To this end, this work focuses ondesigning scalable solutions for identifying all user-defined edit-distance variantsof user-provided labeled patterns and demonstrating the viability of the proposedapproach for scenarios (S3) and (S4) above.Edit-Distance for Graph Similarity Computation. We aim to identify sub-graphs in the background graph that are similar to a given search template. Wequantitatively estimate similarity through edit-distance [13]. The intuition for thesemantic of the search is the following: the user is searching for a set of entitieseach belonging to some category (as we support matches for labeled graphs) andspecifies a superset of the relationships between them. Fig. 4.1 presents an example.There exist several techniques for graph similarity computations; for example,Maximum Common Subgraph (MCS) [12], Graph Kernel [52] as well as based oncapturing statistical significance [25]. Our choice for edit-distance as the similaritymetric is motivated by three observations: (i) Edit-distance is a widely adoptedsimilarity metric (related work in Chapter 2) easy to understand by users; (ii)The metric can be adapted to various use cases seen in practice (e.g., restrictingedge-deletion by marking some edges as mandatory - demonstrated in §4.8.5; orextending the set of edit operations, e.g., vertex label changes)withminimal changesto the supporting infrastructure; and (iii) It is feasible to support efficient similarsubgraph search, while providing precision and recall guarantees [13], based onthis similarity metric as we show in the rest of this work.With this distance metric, given a search template, the set of matches withindistance k, is similar to generating all search prototypes2 at distance 0, . . . ,k, (i.e.,all derivations of the original search template, for example, after 0, . . . ,k edgedeletions) [126], and computing the union of the exact matches for each of them.Constraint Checking for Edit-DistanceVariant SubgraphMatching. In Chap-ter 3, we introduced the constraint checking approach to support (exact) pattern2A prototype is a version of the original search template within the edit-distance bound specifiedby the user.94matching based analytics and demonstrated that an implementation of this tech-nique can scale to graphs with trillions of edges on thousands of compute nodes.The constraint checking based solution decomposes the search template into a setof constraints, verifies if vertices and edges in the background graph violate theseconstraints, and iteratively eliminates them, eventually leading to the set of verticesand edges that is the union of all exact matches for the search pattern.The key intuition is that each vertex or edge participating in a match has to meeta set of constrains specified by the search template. More precisely, we observe thatthe class of approximate matching, i.e., the edit-distance variant subgraph matchingqueries that we are interested in can be equivalently stated as the problem of findingexact matches for all 0, . . . ,k edit-distance prototypes of a given template [126].This approach offers a stepping stone to build a similar subgraphmining solutionand can be used in two directions: First, one can generate the constraints all thevertices and edges participating in a match at distance k (i.e., exact match withany prototype within distance k) must meet, and use these constraints to reduce theproblem space. Second, one can decompose multiple, yet similar, prototypes intotheir composing constraints, run these constraints, and use the information to inferexact matching into specific prototypes; thus amortizing the cost of executing eachconstraint as they are shared by multiple prototypes.4.2 Solution OverviewWe present an algorithmic pipeline for edit-distance variant subgraph matchingfollowing the constraint checking approach. Our solution retrieves matches for allthe edit-distance k subgraphs of the user-provided template by solving the equiv-alent problem of finding exact matches for all 0, . . . ,k edit-distance prototypes ofthe template. The solution embraces two key design mechanisms: (i) search spacereduction through iterative pruning, and (ii) a technique to exploit relationshipsbetween prototypes at edit-distance one of each other to eliminate redundant con-straint verification. Finally, we evaluate feasibility of the proposed solution using aproof-of-concept implementation on top of HavoqGT [82].In the context of this work, we restrict the possible edits to a single category:95edge deletion3. We introduce a further restriction: deleting edges (from the user-provided search template) should maintain connectivity, thus each search prototypeshould be a weakly connected graph.The Output, Precision and Recall. Given a search template and an edit-distance k, our solution can produce: (i) The union of matches for each prototypeat distance δ ≤ k, i.e., for each prototype, the complete set of vertices and edgesthat participate in at least one match, with guarantees of 100% precision and 100%recall in each set; (ii) The union of matches for all the k edit-distance prototypes;(iii) For all vertices in the background graph, a per-vertex vector indicating whichprototype(s) the vertex is a match for (useful for (S4) above - to generate labels for amachine learning pipeline); and (iv) Full match enumeration - the list of matches foreach prototype (when the query is selective enough for this output to be tractable).Assuming only the search template is presented, our system can also support anexploratory search mode (i.e., (S3) above): the system initially searches for exactmatches to the search template; it then continues with distance one matches, andextends the search by incrementally increasing the edit-distance until a user-definedcondition is met (e.g., the first x prototypes are matched, the y matches are found,or the time budget expires).4.3 Contribution Highlights and Chapter OrganizationWe present a distributed subgraph matching solution that identifies up to k edit-distance matches of a given template with 100% precision and 100% recall guar-antees. This work capitalizes on the constraint checking approach for patternmatching (presented in Chapter 3) to harness its scalability advantages, to enableaggressive search space pruning, and to exploit opportunities to reuse results ofconstraint checking collected, while previously searching prototypes, for eliminat-ing redundant computations. The work presented in this chapter is based on theresearch published in [91]. The summary of the contributions of this chapter arethe following. Additional design details and evaluation results are available in [91].3Including edge addition as a possible graph edit is fairly straightforward, as thematches generatedare included in the result of a search applied to a fully connected clique, which we demonstrate in§4.8.5.96(i) Solution Design (§4.6). We design solution for a class of edit-distance basedmatching problems that can be generalized as identifying exact matches forup to k edit-distance from a given search template. We capitalize on the con-straint checking approach for exact matching (Chapter 3) which decomposesthe template in a set of constraints and iteratively eliminates vertices andedges in the background graph that do not meet the constraints. Our designexploits key relationships (§4.4) between prototypes to prune the search spaceand eliminate redundant work; enabled by looking at the search template asa set of constraints.(ii) Optimized Distributed Implementation (§4.7). We offer a proof-of-conceptimplementation on top of HavoqGT [82], an open-source distributed graphframework enabling asynchronous processing. To this end, we have extendedthe constraint checking primitives provided by the exact matching solution(Chapter 3): our implementation provides infrastructure support for trans-ferring results of constraint verification (and match enumeration) betweenprototypes at edit-distance one of each other - the key enabler for eliminatingredundant work; load balancing - with the ability to relaunch computationon a pruned graph using a smaller deployment and search prototypes in par-allel; and producing key types of output, e.g., labeling vertices by prototypemembership(s).(iii) Proof of Feasibility at Scale (§4.8). We demonstrate the performance andutility of our solution by experimenting on datasets orders ofmagnitude largerthan used by the prior work. We show a strong scaling experiment using areal-world dataset, the largest openly available webgraph whose undirectedversion has over 257 billion edges; and a weak scaling experiment usingsynthetic, R-MAT [15] generated graphs of up to 1.1 trillion edges, on upto 256 compute nodes (9,216 cores). We show support for patterns witharbitrary label distribution and topology, and with an edit-distance as largeas to generate 1,000+ prototypes. To stress our system, we consider patternscontaining the highest-frequency vertex labels (up to 9.5 billion instances).(iv) Application Demonstration (§4.8.5). We demonstrate that our solution lends97itself to efficient computation and pattern discovery in real-world patternmatching scenarios: we use two real-world metadata graphs that we have cu-rated from publicly available datasets Reddit (3.9 billion vertices, 14 billionedges) and the smaller International Movie Database (IMDb) (5 millionvertices, 29 million edges), and show practical use cases of our technique tosupport rich pattern mining. Also, we show a use case of exploratory search:the system begins with a 6-Clique and extends the search by incrementallyupdating the edit-distance until the first match(es) are discovered; searchingover 1,500 prototypes in the process.(v) Impact of Design Choices and Optimizations (§4.8.4). We study the impactand trade-offs of a number of optimizations used. We demonstrate the cumu-lative impact of these optimizations significantly improves the runtime overthe naïve approach (§4.8.3) which searches each prototype in the backgroundgraph independently. Furthermore, we investigate a number of load balancingtechniques and study their performance and efficiency implications.(vi) Comparison with Existing Work (§4.8.6). We empirically compare our workwith a state-of-the art system Arabesque [108] and demonstrate the signifi-cant advantages that our system offers for handling large graphs and complexpatterns.4.4 PreliminariesOur aim is to find structures similar to a small labeled search template graph, H0,within a very large labeled background graph, G. We describe important graphproperties of G,H0, and the other graph objects we employ. Table 4.1 summarizesthe notations used in this chapter4.A vertex-labeled graph G(V,E,L), is a collection of n vertices V = {0, ...,n−1}and m edges (i, j) ∈ E , where i, j ∈ V , and each vertex has a discrete label `(i) ∈ L.We often omit L in writing G(V,E), as the label set is shared by all graph objectsin a given calculation. Here, we assume G is simple (i.e. no self-edges or multiple4Some of the notations in Table 3.1 are repeated here for the convenience of the reader.98Table 4.1: Symbolic notation used.Object(s) Notationbackground graph, vertices, edges G(V,E)background graph sizes n := |V |, m := |E |background vertices V := {v0,v1, ...,vn−1}background edges (vi,vj) ∈ Emaximum vertex degree in G dmaxaverage vertex degree in G davgstandard deviation of vertex degree in G dstdevlabel set L = {0,1, ..., |L| −1}vertex label of vi `(vi) ∈ L,search template, vertices, edges H0(W0,F0)distance k prototype p Hk ,p(Wk ,p,Fk ,p)set of all distance k prototypes Pksolution subgraph w.r.t. Hk ,p G∗k ,p(V∗k ,p,E∗k ,p)set of non-local constraints forH0 K0edges), undirected ((i, j) ∈ E implies ( j,i) ∈ E), and vertex-labeled, although thetechniques we develop are easily generalized, including to edge-labeled graphs.We discuss several graph objects simultaneously and use sub- and super-scriptsto denote the associations of graph constituents: the background graph G(V,E), thesearch template H0(W0,F0) (vertices,W0, and edges, F0), as well as several low-edit-distance approximations to H0, which we call template prototypes. Althoughedit-distance could be a very general (user-defined) metric, here, we focus on editsconsisting of edge removals/additions that maintain a subgraph of H0, to favoralgorithmic simplicity.Definition 5. (Prototypes within Edit-Distance k) The edge removal/additionedit-distance between two graphs G1,G2 with |V1 | = |V2 | (and all vertex label countsequivalent) is the minimum number of edge additions or removals one needs toperform to make G1 isomorphic to G2. For templateH0(W0,F0), we define templateprototype Hδ,p(Wδ,p,Fδ,p) as a connected subgraph of H0 such that Wδ,p =W0 and Fδ,p ⊂ F0 that is edit-distance δ ≤ k from H0. Index p is merely to99distinguish multiple edit-distance δ prototypes. NoteH0,0 is the templateH0 itself.Let Pk be the set of all connected prototypes within edit-distance k.Definition 6. (PrototypeMatch VectorsPk)We define G∗δ,p ⊂ G to be the solutionsubgraph associated with prototypeHδ,p: the subgraph containing all vertices andedges participating in one or more exact matches to Hδ,p. The membership of avertex (or edge) to each G∗δ,p for all p ∈ Pk is stored in a length |Pk | binary vectorthat represents if, and how, the vertex (or edge) closely matchesH0 within distancek.The prototype match vectors represent a rich set of discrete features usable ina machine learning context; our techniques could also populate the vector withprototype participation rates, should a richer set of features be desired.We list below the two key relationships between templates that are edit-distanceone from each other that we leverage to form more efficient edit-distance sub-graph match algorithms. Essentially, we can more efficiently compute the solutionsubgraph for a given prototype G∗δ,p, from information gained while previouslycomputing solution subgraphs of prototypes with higher- (or lower-) edit-distance.Observation 1. Containment Rule. Consider two prototypesHδ,p,Hδ+1,p′ inPkthat are within edit-distance one with respect to each other, i.e., Fδ,p = Fδ+1,p′ ∪{(qip′ ,qjp′ )}. Let E(`(qip′ ),`(qjp′ )) be the set of all edges in E that are incident tolabels `(qip′ ) and `(qjp′ ). We have V∗δ,p ⊂ V∗δ+1,p′ and E∗δ,p ⊂ E(`(qip′ i),`(qjp′ ))∪E∗δ+1,p′. This impliesV∗δ,p ⊂⋂p′ :Fδ,p=Fδ+1,p′∪{(qip′ ,qjp′ )}V∗δ+1,p′ andE∗δ,p ⊂⋂p′ :Fδ,p=Fδ+1,p′∪{(qip′ ,qjp′ )}(E∗δ+1,p′ ∪E(`(qip′ ),`(qjp′ ))).Observation 2. RecyclingNon-LocalConstraints. Werecycle information gainedduring non-local constraint checking in a ‘top-down’ manner: If a vertex/edgepasses a non-local constraint check (which are relatively expensive) forHδ,p1 , thenit will pass the check forHδ+1,p2 . Additionally, we recycle information in a ‘bottom-up’ manner: if a vertex/edge passes a non-local constraint check forHδ+1,p2 , then100Template(8)(9)(1)(4)(10)(11)(12)(13)(14)(15)(16)(17)(18)(19)k = 1k = 2(5)(2)(6)(3)(7)Figure 4.2: Edit-distance k = 1 and k = 2 prototypes of an example template. There are 19prototypes at distance k ≤ 2, where each one is a connected component.it will likely pass the check for Hδ,p1 (or Hδ+1,p3) and we postpone on checkingthat constraint until the last verification phase G∗δ,p1 (or G∗δ,p3). (‘Lateral’ recyclingbetween k = δ prototypes is also possible but not explored in this work.)4.5 Constraint Checking for Template Variant SubgraphMatching and Opportunities for Designing anEdit-Distance based SolutionGiven a search template H0 and an edit-distance k, our primary goals are to: (i)identify the set of solution subgraphs for each prototype p ∈ Pk ; and (ii) for eachvertex v in the background graph G, populate a |Pk | length binary vector indicatingin which prototypes from Pk , v participates in at least one match. The solutionwe present aims to addresses the key scalability and performance challenges inthe design space. This approach takes advantage of the Containment Rule (§4.4,Observation 1) and builds on past work on constraint checking previously used onlyin the context of exact pattern matching [88, 89].Our solution embraces two key design mechanisms: (i) search space reduction,and (ii) redundant work elimination. The infrastructure we present provides supportfor iterative search space pruning; reusing the information gained, while previouslysearching prototypes, between prototypes at edit-distance one of each other - the keyenabler for eliminating redundant work; load balancing; and producing key types ofoutput; all while offering a compact memory representation of the graph topologyand the additional state that needs to be maintained during the computation.This section is structured as follows: We first briefly present how constraintchecking has been used in the past [89] to enable scalable exact pattern matchingand highlight the key primitives used. We then explain how we capitalize onthese primitives, summarize the challenges a distributed pipeline for mining edit-101Non-local constraintsTemplateLocal constraints(1) (2)Template (b)Template (a) Template (c)Figure 4.3: (Top) Local and non-local constraints of a template: a vertex in an exact matchneeds to (i)match the label of a corresponding vertex in the template, and (ii) have edges tovertices labeled as prescribed in the adjacency structure of this corresponding vertex in thetemplate. Based on the search template H0, we generate the set of non-local constraintsK0 that are to be verified - for the example template, there are two constraints: (1) atriangle and (2) a rectangle. (Bottom) Three examples that illustrate the need for non-localconstraints checks, invalid structures that local constraint checking is not guaranteed toeliminate (see Fig. 3.2 for details).distance variants of a given pattern has to address. In §4.7, we describe in detaileach functional component of the distributed infrastructure we have built to thisend.Constraint Checking for Exact Pattern Matching. A template H0 can beinterpreted as a set of constraints the vertices and edges that participate in a matchmust meet. In the vertex-centric formulation for constraint checking, a vertex mustsatisfy two types of constraints: local and non-local, to participate in a match.Fig. 4.3 (top) depicts the two types of constrains for an example template. Localconstraints involve only the vertex and its one-hope.g., neighborhood. A vertexin an exact match needs to have non-eliminated edges to non-eliminated verticeslabeled as prescribed by the adjacency structure of its corresponding vertex inthe search template. Non-local constraints are topological requirements beyondthe immediate neighborhood of a vertex. These are required by some classes oftemplates (with cycles and/or repeated vertex labels) and require additional routinesto check non-local properties and to guarantee that all non-matching vertices areeliminated. (Fig. 4.3 (bottom) illustrates the need for these additional checkswith examples.) Past work [89] has identified three types of non-local constraintswhich can be verified and optimized independently: (i) cycle constraints (CC),102(ii) path constraints (PC), and (iii) constraints that require template-driven search(TDS). For arbitrary templates, TDS constraints based on aggregating multiplepaths/cycles enable further pruning and insure no false positives (see Chapter 3,§3.4 for details).An exact matching pipeline iterates over these constraints to eliminate all thevertices and edges that do not participate in any match and reduces the backgroundgraph to a subgraph which is the union of all matches: the complete set of allvertices and edges that participate in at least one match, with no false positives orfalse negatives. Alg. 8 is an overview of the main loop for exact matching (adaptedfromAlg. 1 to make it more relevant to the edit-distance based matching case). Thisprocedure may also collect additional information to accelerate match counting orfull match enumeration (see §4.7).Search Space Reduction. The core design philosophy of this technique, tosupport exact matching, is search space reduction: the constraint checkign is usedto prune away the non-matching part of the graph (i.e., vertices and edges) as earlyas possible; first using the local constraint checking, then the relatively expensivenon-local constraint checking. For the edit-distance subgraph search problem, weseek techniques that operate in a similar fashion.We observe that there exist natural dependencies among the matches for theprototypes that are at edit-distance one from each other (i.e., a one edge difference).Since the prototypes at k = δ + 1 distance are generated by removing one of the(permitted) edges from a k = δ prototype, the solution subgraph G∗δ,p for a prototypep at distance k = δ, is a subset of the union of the solution subgraphs⋃G∗δ+1,q forany prototype q derived by removing one edge from p. A precise definition forthis relationship is presented in §4.4. In the context of edit-distance subgraphmatching, this relationship presents the opportunity to use search space reductiontechniques and infer matches at distance k = δ from the union of all matches atdistance k = δ + 1 with guarantees of recall (no false negatives are introduced).This has two advantages: (i) search for matches at level k = δ is performed onthe reduced graph at level k = δ+1, and (ii) the result of some constraints alreadyverified at distance k = δ+1 can be reused (without rechecking) at level k = δ, thusleading to work reuse (more on this next).Redundant Work Elimination. Specifying the search template as a set of103constraints enables us to identify properties, that a supplied search templateH0 andits prototypes Pk share, at a finer granularity. Since it has all the desired edges,H0represents the constraint superset. For example, in Fig. 4.2, k = 1 prototypes (0) –(2) have the rectangle and prototypes (3) – (6) have the triangle, whileH0 has both.According to the containment rule in §4.4, in this example, if a vertex participatesin a match for H0, it must participate in at least one match for one of the δ ≤ kprototypes; such a vertex must meet the rectangle and/or triangle constraints. Thekey advantage of using our constraint verification based approach is that, for thesame vertex in G, such non-local constraints can be verified only once (for eitherH0or a prototype) and this information can be reused in later searches - eliminating alarge amount of potentially redundant work (we demonstrate the advantage of thisartifact in §4.8.4).4.6 Designing an Edit-Distance Subgraph MatchingSolution following the Constraint Checking ApproachThis section provides a high-level overview of the proposed edit-distance basedmatching pipeline, while the distributed implementation details are in §4.7. Alg. 7is the top-level procedure to search matches within k edit-distance of a templateH0. The system iterates over the set of prototypes Pk and for each vertex v ∈ G,identifies the set of prototypes (ρ(vj) in Alg. 9) that v participates in at least onematch.Prototype Generation. From the supplied template H0, prototypes in Pkare generated through recursive edge removal: k = δ + 1 distance prototypes areconstructed from k = δ distance prototypes by removing one of the edges (whilerespecting the restriction that prototypes are connected graphs). If the template hasmandatory edge requirements then, only the optional edges are are subjected to re-moval. We also perform isomorphism check to eliminate duplicates. For prototypeswith non-local constraints, each constraint is assigned an unique identifier. If twoprototypes have the same non-local constraint, they also inherit its unique identifier(used by non-local constraint checking to track and ignore redundant checks for thesame vertex in the background graph).MaximumCandidate Set Generation. The first step is generating themaximum104candidate set - the set of vertices and edges in the background graph that mayparticipate in a match. This procedure excludes vertices that do not have a chanceto participate in a (prototype) match: it excludes the vertices that do not have acorresponding label in the template, then, iteratively, excludes the vertices thatdo not have at least one neighbor as defined in the search template H0. (Ifthe template has mandatory edge requirements then the vertices without all themandatory neighbors are excluded.) As a key optimization to limit generatednetwork traffic in later steps, the procedure also excludes edges to eliminatedneighbors and to neighbors whose labels do not match the labels prescribed inthe adjacency structure of the search template (see §4.7). Prototype search beginson the max-candidate set.Algorithm 7 Identify up to k Edit-Distance Matches1: Input: background graph G(V,E), templateH0(W0,F0), edit-distance k2: Output: (i) per-vertex vector (of length |Pk |) indicating v’s prototype match(ρ); (ii) for each prototype p ∈ Pk , the solution subgraph G∗δ,p3: Algorithm:4: generate prototype set Pk fromH0(W0,F0)5: for each prototype p ∈ Pk , identify the non-local constraint set K06: δ = k7: G∗δ+1←MAX_CANDIDATE_SET(G,H0)8: do9: G∗t ← ∅10: for all p ∈ Pδ do . alternatively, prototypes can be searched in parallel11: G∗δ,p← SEARCH_PROTOTYPE(G∗δ+1,p)12: G∗t ← G∗t ∪G∗δ,p; output G∗δ,p . matches can be listed by enumeratingin G∗δ,p13: G∗δ+1← G∗t ; δ = δ−1 . distributed G∗δ+1 can be load rebalanced14: while δ ≥ 0Match Identification within Edit-Distance k. In Alg. 7, we present a bottom-upsearch approach to introduce the general pipeline (while in the following sectionwe also discuss optimization opportunities presented by a top-down approach).We reduce the problem of mining similar patters to: (i) the problem of findingexact matches to prototypes up to edit-distance k (within the max-candidate set),and (ii) finding matches at distance k = δ from the union of solution subgraphs of105prototypes at distance k = δ+1, and iteratively decrease k until k = 0. The intuitionfor why this leads to a correct solution is based on the containment rule (§4.4): theunion of matches for the k = δ prototypes are a subset of the union of matches ofthe k = δ+1 (thus no false negatives), and on the previously established property(§3.4) that checking all the generated constraints produces a precise solution set(no false positives). Alg. 8 is the procedure to identify matches at the individualprototype level.Algorithm 8 Search Routine for a Single Prototype1: procedure search_prototype (G∗δ+1,p)2: K0← non-local constraint set of p3: G∗δ← LOCAL_CONSTRAINT_CHECKING (G∗δ+1,p)4: while K0 is not empty do5: pick and remove next constraint C0 from K06: G∗δ← NON_LOCAL_CONSTRAINT_CHECKING (G∗δ,p,C0)7: if any vertex in G∗δ has been eliminated or8: has one of its potential matches removed then9: G∗δ← LOCAL_CONSTRAINT_CHECKING (G∗δ,p)10: return G∗δ . if a vertex vj ∈ G∗δ , then p is marked as a match in ρ(vj)4.7 Asynchronous Algorithms and DistributedImplementationThis section presents the system implementation on top of the distributed graphframework, HavoqGT (introduced in Chapter 3, §3.5). Alg. 7 is the top-levelprocedure to identify matches within edit-distance k, while the routine in Alg. 8identifies the vertices and edges that match a prototype. In a distributed setting,each process essentially runs an instance of Alg. 7 and Alg. 8 on the distributedgraph topology data. Alg. 8 invokes the two core primitives that perform localand non-local constraint checking5. This section first presents these routines in thevertex-centric abstraction of HavoqGT and the key state maintained by each vertexand its initialization (Alg. 9), and then highlights the functionality needed to support5In §3.5, we presented the constraint checking routines for exact matching. Since, the templatevariant subgraph matching solution extends these primitives to support the additional algorithmicrequirements, we briefly describe the constraint checking routines with the modifications highlighted.106edit-distance based matching (introduced in §4.6), and the various optimizationsimplemented.Algorithm 9 Vertex State and Initialization1: set of possible matches in a prototype for vertex vj : ω(vj) . prototype state2: set of matching neighbors in a prototype for vertex vj : ω′(vj) . prototype state3: map of active edges of vertex vj : ε(vj) ← keys are initialized to adj∗(vj) ⊂ E∗;the value field is a 8-bit long bitset, where individual bits indicate if the edgeis active in the max-candidate set, and/or edit-distance, and/or in a prototypesolution subgraph . prototype state4: set of non-local constraints vertex vj satisfies: κ(vj) . global state5: vector of prototype matches for vj : ρ(vj) . global stateLocal Constraint Checking (LCC) is implemented as an iterative process.Alg. 10 presents the high-level algorithm and the corresponding visit() callback.Each iteration initiates an asynchronous traversal by invoking the do_traversal()method and, as a result, each active vertex receives a visitor. In the triggered visit()callback, if the label of an active vertex vj in the graph is a match for the label of anyvertex in the template, it creates visitors for all its active neighbors in ε(vj). Whena vertex vj is visited, it verifies whether the sender vertex vs satisfies one of its own(i.e., vj’s) local constraints. By the end of an iteration, if vj satisfies all the templateconstraints, i.e, it has neighbors with the required labels, it stays active for the nextiteration. Edge elimination excludes two categories of edges: first, the edges toneighbors, vi ∈ ε(vj) from which vj did not receive a message, and, second, theedges to neighbors whose labels do not match the labels prescribed in the adjacencystructure of the corresponding template vertices in ω(vj). Iterations continue untilno vertex or edge is marked inactive. (Implementation of the max-candidate setgeneration procedure, described in §4.6, is based on local constraint checking.)Non-local Constraint Checking (NLCC) iterates over K0, the set of non-localconstraints to be checked, and validates each C0 ∈K0 one at a time. NLCC leveragesa token passing approach. Alg. 11 presents the general solution (including the visit()callback) to verify a single constraint: tokens are initiated through an asynchronoustraversal by invoking the do_traversal() method. Each active vertex vj ∈ G∗ thatis a potential match for the template vertex at the head of a ‘path’ C0, broadcastsa token to all its active neighbors in ε(vj). When an active vertex vj receives a107Algorithm 10 Local Constraint Checking1: procedure local_constraint_checking (G∗(V∗,E∗),p)2: do3: do_traversal(); barrier4: for all vj ∈ V∗ do5: if vj does not meet local constraints of at least one vertex of p then6: remove vj from G∗ . vertex eliminated7: else if a neighbor vi ∈ ε(vj) does not satisfy requirements of p then8: remove vi from ε(vj) ⊂ E∗) . edge eliminated9: while vertices or edges are eliminated10: procedure visit(G∗(V∗,E∗),vq) . vq - the distributed message queue11: for all vi ∈ ε(vj) do . vj is the vertex that is being visited12: vis← LCC_VISITOR(vi, vj , ω(vj))13: vq.push(vis) . triggers visit() for viAlgorithm 11 Non-local Constraint Checking1: procedure non_local_constraint_checking(G∗(V∗,E∗),p,C0)2: do_traversal(); barrier3: for all vj ∈ V∗ that initiated a token do4: if vj violates C0 then5: remove this match from ω(vj) and if ω(vj) = ∅, remove vj from G∗. vertex eliminated6: visitor state: token - a tuple (t,r) where t is an ordered list of vertices that haveforwarded the token and r is the hop-counter; t0 ∈ t is the token initiator7: procedure visit(G∗(V∗,E∗),vq)8: for all vi ∈ ε(vj) do . vj is the vertex that is being visited9: if token = ∅ and vj matches the first entry in C0 and C0 < κ(vj) then10: t .add(vj); r← 1; token← (t,r)11: else if token.r < |C0 | and vj matches the token.r-th entry in C0 then12: token.t .add(vj); token.r← token.r +113: else if token.r = |C0 | and vj matches the token.r-th entry in C0 then14: κ(vj).insert(C0); return . globally identify vj as it meetsrequirments of C015: else return . drop token16: vis← NLCC_VISITOR(vi,token); vq.push(vis)108token, if all requirements are satisfied, vj sets itself as the forwarding vertex (vj isadded to t), increments the hop count r , and broadcasts the token to all its activeneighbors6. If any of the constraints are violated, vj drops the token. If r is equalto |C0 | and vj is a match for the template vertex at the tail of C0, vj is marked as itmeets requirements of C0 (Alg. 11, lines #14 and #15).Caching the Result of NLCC – Enabler for Redundant Work Elimination.We reuse the result of constraint checking - a vertex in the background graph thatsatisfies a non-local constraint for a k = δ + 1 prototype does not need to verifythe same constraint in the subsequent δ ≤ k prototypes that share the constraint,hence, avoids redundant work (Alg. 11, line #9). This optimization is crucial forcyclic patterns that have dense and highly concentrated matches. We demonstratethe impact of this optimization in §4.8.4.Search Space Pruning in the Bottom-Up Mode. In §4.4 and §4.6, we estab-lished that, when performing a bottom-up search (i.e., starting from the furthestdistance prototypes) matches at distance k = δ can be computed from the union ofall matches (i.e., the solution subgraph) at distance k = δ+ 1. The distributed in-frastructure implements this functionality in a simple, yet efficient manner (Alg. 7,lines #12 and #13): we use a hash-based data structure to store the distributed graphto allow fast modifications (i.e., vertex and edge deletion) as well as aggregate thematches at distance k = δ+1 (on which k = δ distance matches are searched). Fur-thermore, this approach enables aggressive pruning - a potentially large numbervertices and edges that would never belong to a distance k = δ match are eliminatedearly, while searching δ < k matches.Load Balancing. Load imbalance issues are inherent to problems involvingirregular data structures, such as graphs. For our pattern matching solution, loadimbalance is caused by two artifacts: first, over the course of execution, our solutioncauses the workload to mutate (as we prune away vertices and edges), and, second,nonuniform distribution of matches: the vertices and edges that participate in thematchesmay often reside in a small, potentially concentrated, part of the backgroundgraph. (§3.7.7 presents a detailed analysis.)To address these issues, we can rebalance/reload a pruned, max-candidate set6Here, various optimizations are possible, e.g., work aggregation and filtering neighbors based onprovisional match information.109or intermediate graph (G∗δ+1), before searching the k = δ prototypes (Alg. 7, line#13). We checkpoint the current state of execution, reshuffle vertex-to-processorassignment to evenly distribute vertices and edges across processing cores, andreload only the set of active vertices and edges that participate in at least one ofthe k = δ + 1 prototypes. We can also reload on fewer processors, and reshuf-fle vertex-to-processor assignment to evenly distribute vertices and edges acrossprocessing cores. k = δ prototype searches are then resumed on the rebalanceddistributed graph. The effectiveness of reshuffling is evaluated in Fig. 4.10, whilethe effectiveness of reloading on a smaller processor set in Table 4.4.Multi-level Parallelism. The implementation offers multiple levels of paral-lelism: in addition to vertex level parallelism (i.e., vertices check constraints inparallel), the infrastructure also enables searching prototypes in parallel (Alg. 7,line #10): prototypes at distance k = δ can be searched in parallel (while respectingthe containment rule) by replicating the max-candidate set (or the distance k = δ+1pruned graph) on multiple (potentially smaller) deployments. Fig. 4.9 and Table 4.4evaluate the impact of this design artifact.Match Enumeration and Counting Optimization. Given the containment rule,and since a k = δ + 1 prototype is a direct descendent of a k = δ prototype (see§4.4), edit-distance based matching presents the opportunity for reusing results ofk = δ+1match enumeration for identifying k = δmatches: a k = δ prototype matchcan be identified from the already computed k = δ+ 1 matches by extending theresulting matches by one edge (instead of repeating the search for all edges in theprototype - evaluated in §4.8.4).Top-Down Search Mode. Alg. 7 presents the bottom-up search approach thatidentifies k = δ matches within the results at k = δ+1. An alternative is to performthe search in a top-down manner: the system initially searches for exact matchesto the full template and extends the search by increasing the edit-distance by one,until a user-defined condition is met. Our implementation also supports this searchmode, with small additions to Alg. 7. For brevity, we avoid details, yet §4.8.5evaluates a use case of this search mode.Metadata Store, Termination and Output. The implementation uses the dis-tributed metadata store developed earlier (§3.5). Alg. 7 terminates when all theprototypes within k distance have been searched or even earlier when no active110Table 4.2: Properties of the datasets used for evaluation: number of vertices and directededges, maximum, average and standard deviation of vertex degree, and the graph size in thecompact CSR-like representation used (including vertex metadata).Type |V | 2|E | dmax davg dstdev SizeWeb Data Commons [94] Real 3.5B 257B 95M 72.3 3.6K 2.7TBReddit [87] Real 3.9B 14B 19M 3.7 483.3 460GBIMDb [55] Real 5M 29M 552K 5.8 342.6 581MBCiteSeer [108] Real 3.3K 9.4K 99 3.6 3.4 741KBMico [108] Real 100K 2.2M 1.4K 22 37.1 36MBPatent [108] Real 2.7M 28M 789 10.2 10.8 480MBYouTube [108] Real 4.6M 88M 2.5K 19.2 21.7 1.4GBLiveJournal [4] Real 4.8M 69M 20K 17 36 1.2GBR-MAT up to Scale 35 [15] Synthetic 34.4B 1.1T 222M 32 3.5K 17TBvertex is left in the background graph. The system can produce different outputas listed in §4.2; output are produced in a distributed manner - each MPI processindependently writes the locally accumulated results to individual files.4.8 EvaluationWeevaluate the performance and utility of our solution by experimenting onmassivegraphs datasets: we present strong (§4.8.2) and weak (§4.8.1) scaling experimentson massive real-world and synthetic graphs, respectively. We demonstrate theability of our system to support patterns with arbitrary topology and scale to 1,000+prototypes. We evaluate the effectiveness of our design choices and optimizations(§4.8.4). We highlight the use of our system in the context of realistic data analyticsscenarios (§4.8.5). Finally, we directly compare our solution with a recent work,Arabesque [108] (§4.8.6), as well as the naïve approach (§4.8.3).Testbed. We use the same Quartz cluster at the Lawrence Livermore NationalLaboratory used for evaluation in §3.7. We run one MPI process per core (i.e., 36per node).Datasets. Table 4.2 summarizes the main characteristics of the datasets usedfor evaluation, and shows their storage requirements. For all graphs, we createdundirected versions.111In Chapter 3, §3.7, we introduced Web Data Commons (WDC), Reddit(RDT), IMDb and R-MAT datasets, and how we have generated vertex labels.We use the smallerCiteSeer,Mico, Patent, YouTube and LiveJournal real-worldgraphs primarily to compare published results in [108] and [56].Search Templates. To stress our system, we (i) use templates based on patternsnaturally occurring in the background graphs (rather than synthetically introduced);(ii) experiment with both rare and frequent patterns; (iii) explore search scenar-ios that lead to generating 100+ and 1,000+ prototypes (WDC-3 and WDC-4 inFig. 4.5); (iv) include the search template vertex labels that are among the most fre-quent in the respective graphs; and (v) similar to Arabesque [108], we use unlabeledpatterns for counting motifs (§4.8.6).Experimental Methodology. The performance metric for all experiments is thetime-to-solution for searching all the prototypes of a template - for each matchingvertex, identify the list of prototypes it participates in (and in some cases, fullmatch enumeration). The time spent transitioning and resuming computation onan intermediate pruned graph and load balancing are included in the reported time(where applicable). All runtime numbers provided are averages over 10 runs. Forweak scaling experiments, we do not present scaling numbers for a single node asthis experiment does not involve network communication and benefits from datalocality. For strong scaling experiments, the smallest experiment uses 64 nodes,as this is the lowest number of nodes that can load the WDC graph topologyand vertex metadata in memory. Unless explicitly mentioned, we verify all theconstraints required to guarantee zero false positives. We label our technique asHGT wherever necessary.4.8.1 Weak Scaling ExperimentsTo evaluate the ability to process massive graphs, we use weak scaling experimentsand the synthetic R-MAT graphs up to Scale 35 (∼1.1T edges), and up to 256 nodes(9,216 cores). Fig. 4.4 shows the search template, RMAT-1, and the runtimes.RMAT-1 has up to k = 2 (before getting disconnected), generating a total of 24prototypes; 16 of which at k = 2. On average, ∼70% of the time is spent in theactual search, while remaining 30% in infrastructure management - switching from11201002003004005006002, 284, 298, 3016, 3132, 3264, 33128, 34256, 35Time (s)#Compute nodes, R-MAT scale2516 8743RMAT-1k=2, #p 24Figure 4.4: Runtime for weak scaling experiments (left) for the the RMAT-1 pattern (right) - ithas 24 prototypes within distance k = 2. The X-axis labels present the R-MAT scale (top)and the node count used for the experiment (bottom). A flat line indicates perfect weakscaling. The labels used are the most frequent in the R-MAT graphs and cover ∼45% ofall the vertices in the graph. For RMAT-1, the furthest edit-distance searched (k) and totalprototype count (#p) are also shown.k = δ+1 to k = δ pruned graph and load balancing. In spite of the random natureof R-MAT generation, we see mostly consistent scaling in runtime, except for theScale 33 graph, for which the RMAT-1 pattern happens to be very rare, whichexplains the faster search time. (Scale 33 has 17.5M matches compared to 64M inScale 32 and 73M in Scale 34, 3.7× and 4.2× fewer, than in the respective graphs.This is partly because the vertex label ‘8’ is very rare in Scale 33 - less than 1% ofthe vertices have this label. For Scale 34 and 35, for example, the ratio is 3.2% and2.3%, respectively.)To evaluate individual contribution of prototype search, in Fig. 4.8, runtime isbroken down at the individual prototype level. The results are for Scale 34 on 128nodes. The chart also shows number of matches for each prototype, a total of over73M matches. The number of matches varies among the prototypes by orders ofmagnitude. k = 0 and k = 1 prototypes have slightly longer runtime as they requirenon-local constraint checking.113WDC-4de govk=4, #p 1,941caorgeduitWDC-3deorgedugovinfo acWDC-1infonet edugovdeacWDC-2govorgedunetinfo ack=2, #p 24 k=2, #p 20 k=4, #p 152Figure 4.5: WDC patterns using top/second-level domain names as labels. The labels selectedare among the most frequent, covering ∼21% of the vertices in the WDC graph: orgcovers ∼220M vertices, the 2nd most frequent after com; ac is the least frequent, stillcovering ∼4.4M vertices. For each pattern, the furthest edit-distance searched (k) and totalprototype count (#p) are also shown.05010015020025064 128 256Time (s)WDC-1SCk=0k=1k=205010015020025030064 128 256WDC-20600120018002400300036004200480064 128 256WDC-3SCk=0k=1k=2k=32.7x1.6x1.4x1.9x1.3x2.4xFigure 4.6: Runtime for strong scaling experiments (to label vertices and edges by the proto-type(s) they match), broken down by edit-distance level, for the WDC-1, 2 and 3 patterns(Fig. 4.5). Max-candidate set generation time (C) and infrastructure management overhead(S) are also shown. The top row of X-axis labels represent the number of compute nodes.Speedup over the 64 node configuration is shown on top of each stacked bar plot (the work-load does not fit in memory on fewer number of nodes). To observe natural scalability, forWDC-1 and WDC-2, we do not load balance the intermediate pruned graphs. Since werelaunch processing on smaller eight node deployments and search prototypes in parallel,for WDC-3, load balancing is implicit.4.8.2 Strong Scaling ExperimentsFig. 4.6 shows the runtimes for strong scaling experiments when using the real-world WDC graph on up to 256 nodes (9,216 cores). Intuitively, pattern matching114on theWDC graph is harder than on the R-MAT graphs as theWDC graph is denser,has a highly skewed degree distribution, and, importantly, the high-frequency labelsin the search templates also belong to vertices with a high neighbor degree. Weuse the patterns WDC-1, 2 and 3 in Fig. 4.5. To stress the system we have chosensearch templates generate tens to hundreds prototype (WDC-3 has 100+, up tok = 4, prototypes). These patterns have complex topology, e.g., multiple cyclessharing edges, and rely on expensive non-local constraint checking to guarantee nofalse positives. In Fig. 4.6, time-to-solution is broken down by edit-distance level. Italso shows time spent in pruning to obtain the max-candidate set, and infrastructuremanagement, separately. We see moderate scaling for both WDC-1 and WDC-2,up to 2.7× and 2×, respectively, with the furthest k prototypes scaling a lot betteras they are mostly acyclic. Since WDC-3 has 100+ prototypes, we leverage theopportunity to search multiple prototypes in parallel: given the max-candidate setis much smaller than that original background graph (∼138M vertices), we replicateit on smaller eigth node deployments (this involves repartitioning the pruned graphand load balancing) and search multiple prototypes in parallel. For example, WDC-3 has 61, k = 3 prototypes; on 64 nodes they can be searched in eight batches (eachbatch running eight parallel search instances). We observe 2.4× speedup on 256nodes.4.8.3 Comparison with the Naïve ApproachWe study the performance advantage of our solution over a naïve approach whichgenerates all prototypes and searches them independently in the background graph.Fig. 4.7 compares time-to-solution of our technique with the naïve approach forvarious patterns and graphs. (The reported time for HGT includes time spent insearch and infrastructure management.)To further explain performance, we study the runs for the RMAT-1 (Scale 34)and WDC-3 patterns at finer detail (both on 128 nodes). Fig. 4.8 shows runtimefor RMAT-1, broken down to the prototype level: on average, individual prototypesearch is 6× faster in the HGT solution. However, the max-candidate set generationand load balancing the pruned graph(s) has additional overhead, which is ∼30% ofthe total time in this case. The max-candidate set for the Scale 34 graph has ∼2.7B11501500300045006000750090001050012000WDC-1WDC-2WDC-3WDC-4RMA-1RDT-1IMDB-14-MotifTime (s)Naïve HGT8.4x3.8x1.6x1.7x3.6x4.8x3.0x3.9x33060sFigure 4.7: Runtime comparison between the naïve approach and HGT for various patternsand graphs. Speedup over the naïve approch is shown on top of respective bars. For bettervisibility, we limit the Y-axis and show the Y-axis label (larger than the axis bound) forWDC-4, for the naïve case. RMAT-1, IMDB-1 and 4-Motif (on the Youtube graph) alsoinclude time for explicit match counting. For the the rest, we report time to identify unionof all matches with precision and recall guarantees.vertices and ∼5.9B edges, and ∼10% of the total time is spent in load balancingthis intermediate graph, hence, the resulting 3.8× speedup over the naïve approach(Fig. 4.7). There are 73.6M matches at distance k = 2 and no match at k < 2.In Fig. 4.9, runtime for WDC-3 (which has 100+ prototypes) is broken down toper edit-distance level. The figure shows how various optimizations improve searchperformance over the naïve approach, visible here for distance k = 2 and k = 3prototypes. The max-candidate set for this pattern is smaller relative to RMAT-1(Scale 34), yet still large, 138M vertices and 1B edges, therefore, the infrastructuremanagement overhead is lower as well. The furthest edit-distance (k = 4) prunedgraph also has only 15M vertices. Infrastructure management and load balancingaccounts for less than 1% of the total time. For the most optimized case, whenprototypes are searched in parallel, each on an eight node deployment, the totalgain in runtime is ∼3.4× over the naïve solution.1160204060802_12_22_32_42_52_62_72_82_92_102_112_122_132_142_152_161_11_21_31_41_51_61_71_8 0Time (s)Naïve HGT165,457172,659121,65328,258,216410,8874,657,35069,0113,575,24320,83148,53052,3613,260,3721,389,78411,027,8258,403,62111,925,8100 0 0 0 0 0 0 0 0k_p#Matches in a pFigure 4.8: Runtime per prototype for RMAT-1 (Scale 34 on 128 nodes). The top X-axis labelk_p indicates a prototype p at edit-distance k. The bottom X-axis labels are the number ofmatches (full match enumeration) in each prototype. There are a total 73.6M matches atdistance k = 2 and no match at k < 2. The chart compares performance of two scenarios:naïve and HGT. On average, individual prototype search is 6× faster in HGT. However,infrastructure management and load balancing account for ∼30% of the total time, whichyields a 3.8× net speedup over the naïve approach (Fig. 4.7).4.8.4 Impact of OptimizationsIn this section, we study how the design choices and the optimizations that ourprototype implementation incorporates impact search performance:Redundant Work Elimination. One key optimization our solution incorporatesis work recycling - reuse of the result of constraint checking to avoid redundantchecks (details in §4.6 and §4.7). This optimization is crucial for cyclic patterns thathave dense and highly concentrated matches in the background graph. This aloneoffers 2× speedup for WDC-3 (Fig. 4.9, notice the improvement for k = 2 and k = 1in scenario Y) and 1.5× speedup for IMDB-1 (Fig. 4.7), over the respective naïveruns. The gain is due to reduction in number of messages that are communicatedduring NLCC; for WDC-3 the reduction is 3.5× and for IMDB-1 it is 6.7×.Constraint and Prototype Ordering. We use a simple heuristic to improveperformance of non-local constraint checking: each ‘walk’ is orchestrated so thatvertices with lower frequency labels are visited early (comparable to degree-basedordering used in many triangle counting solutions [107]). We explore a second op-11702000400060008000100000 1 2 3 4Time (s)Naïve X Y Zk15M6.3M57K2529|Vk*| in pkWDC-31 9 33 61 48|pk|113M16M76K3479Sum of all |Vp*|, pϵpkFigure 4.9: Runtime broken down by edit-distance level for WDC-3 (on 128 nodes). X-axislabels: k is the edit-distance, pk is the set of prototypes at distance k,V∗k is the set of verticesthat match any prototype in pk , and V∗p is the set of vertices that match a specific prototypep ∈ pk . The bottom two rows on the X-axis show: (first row) the size of all matching vertexsets (V∗k) at distance k (i.e., number of vertices that match at least one prototype), and(bottom row) total number of vertex/prototype labels generated at distance k. Performanceof four scenarios are compared: (i) the naïve approach (§4.8.3); (ii) X - the bottom-uptechnique where search begins using the furthest edit-distance prototypes and consecutivesearches exploit an already pruned graph; (iii) Y - the bottom-up technique includingredundant work elimination, i.e., reusing results of non-local constraint checking (§4.8.4);and (iv) Z - the bottom-up technique with load balancing and relaunching processing on asmaller eight node deployment, enabling parallel prototype search (§4.7).timization opportunity with respect to work ordering - when searching prototypesin parallel, the performance is maximized when the runs for the most expensiveprototypes are overlapped. Table 4.3 summarizes the impact of the two optimiza-tion strategies discussed here. For prototype ordering, we manually reorder theprototypes (for maximum overlap of expensive searches) based on the knowledgefrom a previous run (in Fig. 4.6 and Fig. 4.9), thus the table shows an upper boundfor the performance gain to obtain with heuristics that aim to project prototype cost.118Table 4.3: Impact of ordering vertex labels in the increasing order of frequency for non-localconstraint checking (top); impact of intuitive prototype ordering when searching them inparallel (middle); and impact of our match enumeration optimizations for edit-distancebased matching (bottom).Ordering Constraints based on Label FrequencyWDC-2Random Ordered Speedup2.4hr 2.4min 62×Intelligent Prototype OrderingWDC-3Random Ordered Speedup1hr 19.7min 3.1×Optimized Match Enumeration4-Motif (Youtube)Naïve HGT Speedup2.3hr 34min 3.9×Optimized Match Enumeration. We evaluate the advantage of the matchenumeration/counting optimization for edit-distance based matching presented in§4.7. The 4-Motif pattern (6 prototypes) has 200B+ instances in the unlabeledYoutube graph. When the optimized match enumeration technique is employed,we observe ∼3.9× speedup (Table 4.3).Load Balancing. We examine the impact of load balancing (presented in§4.7) by analyzing the runs for WDC-1, 2 and 3 patterns (as real-world workloadsare more likely to lead to imbalance than synthetically generated load.) Fig. 4.10compares the performance of our systemwith andwithout load balancing. For theseexamples, we perform workload rebalancing once, after pruning the backgroundgraph to the max-candidate set, which, for the WDC-1, 2 and 3 patterns, have 33M,22M and 138M vertices, respectively (2–3 orders of magnitude smaller than theoriginal WDC graph). Rebalancing improves time-to-solution by 3.8× for WDC-1,2× for WDC-2 and 1.3× for WDC-3. Load balancing the pruned intermediategraph takes, for example, ∼22 seconds for WDC-3 (Fig. 4.6).Reloading on a Smaller Deployment and Parallel Prototype Search. Once themax-candidate set has been computed, the implementation can take advantage ofthe fact that this is smaller than the original background graph - the implementationsupports reloading the max-candidate set on one (or more) smaller deployment(s),and possibly running multiple prototype searches in parallel. We explore theperformance space for two optimization criteria: (i) minimizing time-to-solution -119050100150200250300WDC-2020004000600080001000012000WDC-32.0x1.3x050100150200250WDC-1Time (s)NLB LB3.8xFigure 4.10: Impact of load balancing on runtime for the WDC-1, 2 and 3 patterns (Fig. 4.5).We compare two cases: without load balancing (NLB) and with load balancing throughreshuffling on the same number of compute nodes (LB). Speedup achieved by LB overNLB is shown on the top of each bar.Table 4.4: Evaluation of load balancing/reloading on a smaller deployment along two axes:performance and efficiency. (Top rows) Runtime for searching prototypes in paralle givena node budget (128 nodes in this example). Speedup for parallel prototype search (on asmaller deployment) over searching each prototype using 128 nodes is also shown. (Bottomrows) CPU Hours consumed by different deployment sizes for the same workload. Thelast row shows CPU Hour overhead for each deployment size with respect to the two nodedeployment.#Compute Nodes 128 8 4 2Parallel Prototype SearchTime (min) 124 60 12 15Speedup over 128 Nodes N/A 2.1× 10.3× 8.3×Sequential Prototype SearchCPU Hour 9,531 588 204 192Overhead w.r.t. 2 Nodes 50× 3× 1.1× N/Aall nodes continue to be used but different prototypes may be searched in parallel,each on smaller deployments; (ii) minimizing the total CPU Hour [118] - a smallerset of nodes continues to be used, and prototypes are searched one at a time. (Notethat these are different optimization points as the first used above may generateinefficiencies due to load imbalance and limited parallelism). Table 4.4 presentsthe results for the WDC-3 pattern (Fig. 4.5): it compares the overhead of CPU120C+P-AA SC-P+S ActressDirectorMovieActor GenreMovieRDT-1 IMDB-1Author CommentPost Subreddit+ More Up votes - More Down votes Figure 4.11: The Reddit and IMDb templates (details in §4.8.5): for RDT-1 and IMDB-1,optional edges are shown in red, broken lines, while mandatory edges are in solid black.Hour over running on a two node deployment. It also lists the runtime for searchingprototypes in parallel given a node budget (128 nodes in this example) - a smallerdeployment typically offers more parallelism. Here, the processing rate on thetwo node deployment is too slow to yield notable advantage over the four nodedeployment.4.8.5 Example Use CasesTo show how our approach can support complex data analytics scenarios, we presentthree use cases: (i) a query that attempts to uncover suspicious activity in the Redditdataset, and uses optional and mandatory edges; (ii) a search on the IMDb datasetwith mandatory edge requirements; and (iii) an example of a top-down exploratorysearch using the WDC-4 pattern. Fig. 4.5 and Fig. 4.11 present the correspondingsearch templates, and Fig. 4.7 shows the runtimes.Social Network Analysis. Today’s user experience on social media platforms istainted by the existence of malicious actors such as bots, trolls, and spammers. Thishighlights the importance of detecting unusual activity patterns that may indicatepotential malicious attacks. The RDT-1 query: Identify users with an adversarialposter-commenter relationship. Each author (A) makes at least two posts or twocomments, respectively. Comments to posts, with more upvotes (P+), have abalance of negative votes (C-) and comments to posts, with more downvotes (P-),have a positive balance (C+). The posts must be under different subreddits (S),a category for posts. Furthermore, the user is interested in the scenarios whereboth the posts and/or the comments were not necessarily by the same author. Inother words, a valid match can be missing an author-post or an author-comment121110100100010000C 0 1 2 3 4Time (s)1,365455105151144000056M5.7s5.2s7.5s13.8s31.8sk|pk|Avg. time per pWDC-4|Vk*| in pkFigure 4.12: Runtime broken down by edit-distance level for WDC-4 (on 128 nodes). X-axislabels: k is the edit-distance, pk is the set of prototypes at distance k, and V∗k is the set ofvertices that match any prototype in pk . The bottom two rows on the X-axis show: (firstrow) the size of all matching vertex sets (V∗k) at distance k (i.e., number of vertices thatmatch at least one prototype), and (second row) the average search time per prototype ateach edit-distance. We also show the number of vertices in the max-candidate set (X-axislabel ‘C’), yet no match is found until distance k = 4. Here, the Y-axis is on log scale.edge (Fig. 4.11). The query has a total of five prototypes and over 708K matches(including 24K precise).Information Mining. IMDB-1 represents the following query: find all theactress, actor, director, 2× movies tuples, where at least one individual has thesame role in two different movies between 2012 and 2017, and both movies fallunder the genre Sport. The query has a total of seven prototypes and 303K matches(including 78K precise).Exploratory Search. We present an exploratory search scenario: the userstarts form an undirected 6-Clique (WDC-4 in Fig. 4.5) and the search (query) isprogressively relaxed until matches are found: edges are deleted from the searchtemplate until the first matches are found. For this search template, the first matches122are found at k = 4 after sifting through 1,941 prototypes. Fig. 4.12 shows runtime,broken down per edit-distance level, and the number of matching vertices: nomatchis found until k = 4, where only 144 vertices participate in matches.4.8.6 Comparison with State-of-the-Art SystemsWe empirically compare our work with Arabesque [108], the state-of-the-art sys-tem enabling distributed subgraph mining in large-scale graphs. Additionally, wepresent a discussion where we indirectly compare with ASAP [56], a sampling-based approximate matching solution:Comparison with Arabesque. Arabesque is a framework offering precisionand recall guarantees, implemented on top of Apache Spark [104] and Giraph [34].Arabesque provides anAPI based on theThinkLike anEmbedding (TLE) paradigm,to express graph mining algorithms and a BSP implementation of the embeddingsearch engine. Arabesque replicates the input graph on all worker nodes, hence, thelargest graph scale it can support is limited by the size of the memory of a singlenode. As Teixeira et al. [108] showed Arabesque’s superiority over other systems:G-Tries [92] and GRAMI [26], we indirectly compare with these two systems aswell.For the comparison, we use the problem of counting network motifs in anunlabeled graph (implementation is available with the Arabesque release). Networkmotifs are connected pattern of vertex induced embeddings that are non-isomorphic.For example, three vertices can form two possible motifs - a chain and a triangle,while up to six motifs are possible for four vertices. Our edit-distance basedmatching solution approach lends itself to solve the motif counting problem: fromthe maximal-edge motif (e.g., a 4-Clique for four vertices), through recursive edgeremoval, we generate the remaining motifs (i.e., the prototypes in our vocabulary).Then we use our system to search and count matches for all the prototypes. Thefollowing table compares results of counting three- and four-vertex motifs, usingArabesque and our system (labeled HGT), using the same real-world graphs usedfor the evaluation of Arabesque in [108]. (We use 20 compute nodes as in [108]).Note that Arabesque users have to specify a purpose-built algorithm for countingmotifs, whereas ours is a generic pattern matching solution, not optimized to count123motifs only.3-Motif 4-MotifArabesque HGT Arabesque HGTCiteSeer 9.2s 0.02s 11.8s 0.03sMico 34.0s 11.0s 3.4hr 57minPatent 2.9min 1.6s 3.3hr 2.3minYoutube 40min 12.7s 7hr+ 34minLiveJournal 11min 10.3s Crash 1.3hrOur system was able to count all the motifs in all graphs; it took a maximumtime of 1.3 hours to count four vertex motifs in the LiveJournal graph. Arabesque’sperformance degrades for larger graphs and search templates: it was only able tocount all the motifs in the small CiteSeer (∼10K edges) graph in <1 hour. For4-Motif in LiveJournal, after running for about 60 minutes, Arabesque crashes withthe out of memory (OOM) error. (We have been in contact with Arabesque authorsto make sure we best employ their system. Also, the observed performance iscomparable with the Arabesque runtimes recently reported in [56] and [119]).Comparing with a Sampling-based Technique – ASAP. One key feature thatdistinguishes our work from the relatively common sampling-based matching tech-niques [10, 25, 56, 113, 126] is that we guarantee 100% precision and 100% recall.As a result, our solution has the same asymptotic complexity as exact matching.The goal of sampling-based pattern matching predominantly has been to improveruntime performance by analyzing only a part of the background graph, for exam-ple, sampling the edges in the background graph, thus offering degraded precisionand/or recall [25, 56, 126]. One key limitation for most of these contributions isthe difficulty to gauge the quality of the result as Iyer et al. [56] argue.Iyer et al. [56] provide, however, a technique to bound the estimation errorfor motif counting. To compare with this technique, since we were unable toobtain a copy of ASAP’s implementation to run it on our machines, we repeatthe experiment described in [56] and attempt to build a comparable platform byusing the same number of nodes and memory (ASAP uses 16 nodes with 61GBmemory; we (HGT) use 8 nodes with 128GB memory), and compare our resultswith the runtimes reported in [56]. Note that ASAP results have 5% error with 95%124confidence, while HGT results are precise. The following table lists the results.3-Motif 4-MotifASAP HGT ASAP HGTLiveJournal 11.5s 25.9s 41.6s 2.2hrAlthough the authors showed that ASAP outperfroms Arabesque, one can arguethat, the precision isworth the extra cost when local counts of raremotifs are desired.Sampling techniques are indeed able to approximate global counts, or local counts atheavy hitters (vertices where many motifs are present), with high relative accuracy.However, they are not able to achieve high local relative accuracy when counts arelow integers. As an extreme example, consider looking for a single triangle in anotherwise bipartite graph. Until the singular triangle is detected, there is 100%relative error at the three vertices involved in the triangle.4.9 Lessons and DiscussionsIn this section, we reflect on the findings of experimental evaluations and the lessonslearned. We organize the key discussion points in a question-answer format:(i) Does the intuition behind our design approach hold in practice? Morespecifically, is constraint checking is an effective solution approch for thetarget edit-distance based matching problem?Yes. First, looking at the search template as a set of constraints presentsthe opportunity to check the existence of the shared substructures (i.e., theconstraints that are common to a subset of the template prototypes) in thebackground graph only once (as opposed to repeating them for each proto-type); eliminating potential redundant computations.Second, the byproduct of constraint checking is the solution sets of verticesand edges (i.e., the union of all matches) that participate in at least one exactmatch. Retrieving the solution set as a union of matches, rather than explicitmatches, enabled us to establish the containment rule (§4.4) which facilitatessearch space pruning: in the bottom-up search approch, matches for a k = δprototype is a subset of the union of matches of the k = δ+1 prototypes, and125the k = δ prototype searches can be performed within the reduced set, i.e.,the union of k = δ+ 1 prototype matches. In §4.8.3, we demonstrated thecumulative impact of these design artifacts.(ii) Does the solution meet requirements of the target application scenarios?Yes. Our goal was to develop a generic solution for problem scenarios thatdemand precision and recall guarantees. The presented work scopes editoperations to edge deletions, however, does not make any assumptions aboutthe background graph or the search template. It allows user-specified boundon the similarity between the search template (i.e., maximum edit-distance,including the mandatory edge constraint) and the similar matches returnedby the system.Consequently, the implementation produces the key output sets, all with 100%precision and 100% recall guarantees: (a) per-vertex prototype membership,(b) per-prototype solution subgraph (i.e., the set of matching vertices andedges), and (c) the solution subgraph at each edit-distance level. Furthermore,our implementation incorporates optimizations (that exploit opportunitiespresented by the containment rule) to count the total number of matches orlist all the matches for each template prototype (§4.7).A potential high-impact application of our solution lies in the area of machinelearning, especially feature generation for Representation Learning [42].Recently, machine leaning has been used to support classification tasks onnetworked data; notable contributions includeDeepWalk [83], node2vec [39],and Graph Convolution Network (GCN) [57]. A popular classification taskis to learn a model that would predict the category for the unlabeled vertices.In addition to vertex attributes, another set of features that can be used isthe set of topological patterns the vertices participate in. More generally, thegoal is to use vertex level features as a signal for a machine learning pipelineto categorize vertices. Our solution can be used to bulk-label the backgroundgraph exactly with such features - our solution produces is exactly what isneeded to this end: a per-vertex vector that indicates whether the vertexparticipates in a match with each specific possible approximation of a searchtemplate, bounded by a user-provided edit-distance.126(iii) Is this a scalable solution?Yes. One of our primary goals was to design a constraint checking basedsolution to harness its demonstrated scaling properties (§3.7.1 and §3.7.2).Since we reduce the approximate matching problem to finding exact matchesof each template variation, it presents us the opportunity to exploit the ‘scal-able’ primitives that we have developed earlier (§3.5).We show the ability of our solution to find patterns, both frequent and rare, inthe largest publicly available real-world graph and a 1.1 trillion edge syntheticgraph. We demonstrate good strong scaling and steady weak scaling on upto 256 nodes or 9K cores, the largest scale to date for similar problems.Also, We show support for an edit-distance as large as to generate 1,000+prototypes.(iv) In what scenarios this technique is most effective and where it is not?Since the solution is based on the constraint checking approach, it directlyinherits the strengths and limitations of the constraint checking approach forexact matching (discussed in §3.8 and §5.4).(iv) Decision problems.In addition to the decision problems discussed in §3.8, the approximatematching solution introduces additional problems that would benefit fromsystem support to make informed decisions: (a)When to trigger load balanc-ing, i.e., reshuffle vertex-to-processor assignment, so, despite the overheadof load balancing, performance is maximized; (b) In what circumstancesswitching to parallel prototype search on a smaller deployment makes sense,given a smaller deployment offers better locality, however, reduced paral-lelism and less memory; (c) For a given template and an edit-distance query,which search mode should be selected: bottom-up or top-down; more pre-cisely, which search mode should maximize performance and/or efficiency,for instance, minimize redundant constraint checks.127Chapter 5Summary and Future WorkPattern matching is a fundamental graph algorithm and has the ability to answercomplex graph queries. However, the superpolynomial nature of the exact matchingproblem limits practical applications of pattern matching to relatively small graphdatasets and simple patterns. The research presented in this dissertation exploresavenues to offer solutions that address the scalability limitations of the patternmatching problem in practice. We make research contributions to both categoriesof pattern matching problems, i.e., exact and approximate matching, and throughevaluation we demonstrate that our contributions advance the state-of-the-art ofpattern matching with respect to the ability to accommodate large graph datasetsand scale with the distributed platform size.In the remaining of this chapter, first, we summarize our research contributionsand their impact. Then, we highlight on the limitations of the current solutions anddiscuss possible improvements and future extensions.5.1 Graph Pruning via Constraint Checking – ATechnique for Scalable Exact Pattern Matching inMetadata GraphsThis work presents a new algorithmic pipeline to support pattern matching in large-scale metadata graphs using distributed memory systems. To this end, we proposethe idea of graph pruning via constraint checking: this technique decomposes the128search template into a set of constraints, verifies if vertices/edges in the backgroundgraph violate these constraints, and iteratively eliminates them, eventually leadingto the set of vertices and edges that is the union of all exact matches for the pattern.We present asynchronous algorithms that use both vertex and edge elimination toiteratively prune the original graph and reduce it to a subgraph which representsthe union of all matches. We have developed pruning techniques that guaranteea solution with 100% precision (i.e., no false positives in the pruned graph) and100% recall (i.e., all vertices and edges participating in matches are included) forarbitrary search templates. Our algorithms are vertex-centric and asynchronous,thus, they map well onto existing high-performance graph frameworks.Evaluation using up to 257 billion edge real-world and up to 4.4 trillion edge syn-thetic R-MAT graphs, on up to 1,024 nodes (36,864 cores), confirms the scalabilityof our solution. We demonstrate that, depending on the search template, our ap-proach prunes the graph by orders of magnitude which enables match enumerationand counting on graphs with trillions of edges. Our success stems from a numberof key design ingredients: asynchronicity, aggressive vertex and edge eliminationwhile harnessing massive parallelism, intelligent work aggregation to ensure lowmessage overhead, effective pruning constraints, and lightweight per-vertex state.5.1.1 ImpactIn addition to the research contributions summarized in §3.2, this work has thefollowing impact: First, this work pioneered the idea of reinterpreting a search template as aset of constraints that can be used to iteratively prune the background toan exact solution set. Constraints are substructures that compose the fulltemplate. Constraints can be verified independently on a per-vertex basis (asopposed to per-match) which presents the opportunity for ample parallelism.The advantages of this approach are twofold: search space pruning and lowgenerated traffic (in a distributed setting) when constructing the solution set;preventing potential combinatorial explosion of the algorithm state. We havedemonstrated the largest scale to date, both in terms of the graph size and theplatform size, when solving similar problems.129 Second, to the best of our knowledge, this is the first work to propose thepattern matching pipeline that first identifies, what we call the solution sub-graph - the set of vertices and edges that participate in exact matches. Asdiscussed in §3.1, contrary to the traditional practice of relying on full matchenumeration to answer all exact matching queries, identifying the solutionsubgraph (through iterative pruning) is sufficient to answer some categoriesof subgraph queries (with 100% precision and 100% recall guarantees). Third, this work has resulted in an open-source software artifact, PruneJuice:an MPI-based distributed pattern matching system implemented on top ofHavoqGT, which can identify exact matches for arbitrary templates in largebackground graphs with metadata associated with vertices.5.2 Edit-Distance Subgraph Matching in DistributedGraphs with Precision and Recall GuaranteesWe present an efficient distributed algorithmic pipeline to identify matches within kedit-distance of a user-provided template in large-scale metadata graphs. Our solu-tion approach is based on the observation that the same problem can be generalizedas identifying exact matches for up to k edit-distance variations of the given searchtemplate. We capitalize on the constraint checking approach for exact matchingwhich decomposes the template in a set of constraints and iteratively eliminatesvertices and edges in the background graph that do not meet the constraints. Ourdesign exploits key relationships between template variations (that are at one edit-distance of each other) to prune the search space and eliminate redundant work;enabled by looking at the search template as a set of constraints.We implement the proposed solution on top of HavoqGT and demonstratescalability using up to 257 billion edge real-world and up to 1.1 trillion edgesynthetic R-MAT graphs, on up to 256 nodes (9,216 cores). We demonstrate thatour solution lends itself to efficient computation and edit-distance variant subgraphdiscovery in practical scenarios and comfortably outperforms the best known workwhen solving the same search problem.1305.2.1 ImpactIn addition to the research contributions summarized in §4.3, this work has thefollowing impact: First, this work demonstrates the utility of the constraint checking approachbeyond exact matching: this technique offers opportunities for design opti-mizations for a class of edit-distance based matching problems. We haveshown that within this design space, a solution can exploit relationships be-tween closely related template variations to prune the search space as well asreuse the information gained, from an earlier search, to eliminate redundantconstraint verification. Second, we show that the constraint checking approach accommodates thewell established graph similarity metric, the graph edit-distance. Althoughedit-distance can identify exact matches within the user-specified bound onthe similarity, in the general case, edit-distance computation is not poly-nomial, therefore, expensive in large datasets. Interestingly enough, theconstraint checking approach offers unique optimization opportunities foredit-distance based graph similarity computations: First, search for matchesat level k = δ is performed on the reduced graph at level k = δ+1; and second,the result of various constraints already verified at distance k = δ+1 can bereused (without rechecking) at level k = δ, thus leading to work reuse. Tothe best of our knowledge, we are the first to realize these optimizations forgraph edit-distance computation. Third, this work has resulted in an open-source software artifact: a dis-tributed edit-distance based pattern matching system implemented on top ofHavoqGT, which supports arbitrary templates and can identify all approx-imate matches within k edit-distance of a user-specified template in largebackground graphs with metadata associated with vertices.5.3 Threat to ValidityWhile we have demonstrated a significant advantage of our implementations of thepattern matching solutions over the competing systems, especially in presence of131larger graphs, complex patterns, and dense matches (§3.7.11 and §4.8.6); we ac-knowledge that, beyond algorithmic techniques used, the implementation softwarestack has impact on the observed performance. For example, both Arabesque [108]andQFrag [100] are based onApache Spark [104] which embraces a vastly differentsystems software stack compared to MPI that PruneJuice is based on. For the samealgorithmic technique, the influence of the implementation software substrate canbe noticeable in the observed performance.We, however, have reasonable confidence that these comparative studies dohighlight the advantage of our solutions as we made an effort to make fair com-parisons: (i) We compared different systems by running them on the same plat-form (hardware, operating system and deployment size) and using identical graphdatasets and queries. (ii) We directly contacted the authors of Arabesque andQFrag to ensure we best employ their systems. (iii) We worked closely with theHPC system administrators at the Livermore Computing facility to deploy Sparkand HDFS on our testbed; we used configurations most suitable for the platform.(iv) We compiled all the systems from the source on our testbed. (v) The observedperformance for Arabesque and QFrag are comparable with the runtime reportedin the original publications [100, 108]; also, Arabesque runtimes recently reportedin [56] and [119].5.4 LimitationsIn this section, we discuss the limitations of the work presented in this dissertation.Earlier, in §3.8 and §4.9 we highlighted the scenarios where our solutions aremost effective and where they are not, in the exact and approximate matchingcontexts, respectively. Here, we present additional discussion points; we beginwith categorizing the limitations of our proposed solution based on their respectivesources. Limitations stemming from major design decisions. Our pipeline inherits thelimitations of systems that perform exact matching (compared to systems thatfocus on a solution that trades result accuracy for performance, e.g., basedon sampling [56] or graph simulation [28]). Similarly, our system inherits alllimitations of its communication and that of the middleware infrastructure,132MPI and HavoqGT, respectively. One example is the lack of sophisticatedflow control mechanism provided by these infrastructures which sometimeslead to message buildup and system collapse. Limitations stemming from the targeted uses cases. In the same vein, wenote that our system targets a graph analytics scenario (queries that need tocover the entire graph), rather than the traditional graph database queries thatattempt to find a specific pattern around a vertex indicated by the user (whereother systems may perform equally well). Limitations stemming from attempting to design a generic system. Systemsoptimized for specific patterns may perform better (e.g., systems optimizedto count/enumerate triangles [107] or treelets [128], or systems relying onmulti-join indices [105] to support patterns with limited diameter). Limitations stemming from incomplete understanding. While we proposeheuristics that appear to work well for our experiments, one of the key chal-lenges is making informed decisions regarding load balancing, constraintordering and selection, and search mode selection (bottom-up or top-downin the edit-distance subgraph matching case). We believe graph statistics atdifferent stages in execution can be used to dynamically make effective deci-sions to address the above problems. In §3.7.10, we explored opportunitiesfor optimal constraint ordering and selection in a shared memory setting.However, the net gain of using this technique in the distributed setting is yetto be determined, given the overhead associated with collecting distributedgraph statistics and the difficulties associate with developing an effectivemodel for a highly dynamic system. Limitations stemming from irregular artifacts. In addition to the irregulargraph topology, which in a distributed setting often leads to workload im-balance; we have identified other artifacts that limit the performance of theproposed solution. HavoqGT’s delegate partitioned graph distributes theedges of high-degree vertices across multiple compute nodes - crucial toobtain scalability for graphs with skewed degree distribution. However, thisload balancing technique does not address an important artifact specific to133pattern matching: the distribution of the template matches in the backgroundgraph can be highly irregular, the vertices and edges that participate in thematches may often reside in a small, potentially concentrated, part of thedistributed graph. For non-local constraint checking and match enumerationroutines, the impact of this artifact is the most adverse: they can cause un-predictably imbalanced traffic and overwhelm a fraction of compute nodes,introduce stragglers and in the worst case, lead to system collapse.5.5 Future Research DirectionsWhile we believe the work presented in this dissertation represents a significantadvance in practical pattern matching in large, real-world graphs; further inves-tigations in a number of areas can improve the efficiency and robustness of thepresented solutions. This section discuss future research direction to address someof the limitations of the current work, and outlines possible extensions and futurework in that context. InformedDecisionMaking. The graph pruning pipeline introduces a numberof decision problems. At present it uses ad-hoc heuristics, developed basedon our intuition. We believe similar modelling approaches as presented in§3.7.10 can be used to inform the following decisions:Decide when to trigger load rebalancing. We have evidence that load imbal-ance can be an issue, and that load rebalancing is effective (§3.7.7 and §4.8.4).Our decision to trigger rebalancing, however, is backward looking: in ourexperiments, we load balanced the current pruned graph before verifyingthe series of template-driven search constraints. For the edit-distance basedapproximate matching solution, we typically load balance the max-candidateset. We postulate that estimating the cost of running future constraints canlead to better load balancing decisions (including relaunching processing ona smaller deployment and search prototypes in parallel in the edit-distancebased matching).Decide whether pruning is useful at all. For all scenarios we have experi-mented with so far, either direct search (i.e., relying on enumeration to begin134with) failed or the pruning-based pipeline had been faster than a solutionbased on direct search. However, one can construct scenarios where directsearch is sufficient (small unlabeled patterns with abundant and concentratedmatches in the background graph is a possible example). A model similar tothe one in §3.7.10 can be used to inform this decision.Decide when to switch from pruning to direct enumeration. If full matchenumeration is the end goal (and not just the union of all matches), thencease pruning early and switch to enumeration can yield greater benefits.Our current distributed solution is unable to make this decision, and unlessmanually configured not to do so, it always prunes down to a precise solutionbefore enumeration. We have demonstrated that using themodeling approachdescribed in §3.7.10 (for a shared memory implementation) to inform theswitch from pruning to enumeration offers noticeable gain. Further Optimizations for Non-local Constraint Checking. At present, thedistributed design of non-local constraint checking relies on the work aggre-gation mechanism to limit the number of messages that are communicated.The effectiveness of work aggregation, however, is intertwined with the un-derlying graph topology: it is most effective when there are common verticesamong the paths traversed by a token. Work aggregation is not effective ifpaths do not intersect. The non-local constraint checking algorithms requirethat only a single path, that conforms with the requirements of the constraintbeing verified, is found. The rest of the paths that are successfully identifiedby the NLCC routine are essentially redundant (a concomitant of parallelprocessing). Two possible optimizations beyond work aggregation are worthexploring: (i) Instead of broadcasting a token to all its neighbors, a vertexcould speculatively select a reduced set of neighbors to relay the token to.The speculation can be informed by the selectivity metric discussed earlier(§3.7.10). (ii) Once a copy of a token has been returned to the originat-ing vertex and the vertex has been identified to satisfy the constraint, the‘in flight’ tokens, from the same originator, can be discarded. (Our sharedmemory implementation, in §3.7.10, does include this optimization: once avertex is identified to satisfy the constraint of interest, the parallel instance135of the recursive token passing routine is terminated immediately. In a dis-tributed message passing setting, however, efficient implementation of thisoptimization is far more challenging.) Improving Load Balancing. In §3.7.7, we have discussed the load imbalanceissues relevant to our pattern matching solution and our load balancing tech-niques to address them. Beyond reshuffling vertex-to-processor assignment(to evenly distribute vertices and edges across processing cores), additionaldesign optimizations are possible to improve load balancing. When thematches are concentrated within a limited number of graph partitions, thesepartitions send/receive a larger portion of the message traffic. In Havo-qGT, a partition (i.e., an MPI process) processes the local message queuesequentially - message traffic targeting popular vertices can overwhelm therespective partitions. Consequently, these bottlenecked partitions becomethe key performance limiter. For non-local constraint checking and matchenumeration routines, the impact of this artifact is the most adverse.HavoqGT’s delegate partitioned graph distributes the edges of a high-degreevertex across multiple compute nodes. (Similar techniques are used by otherframeworks to improve load balancing [35, 106].) This technique replicatesa high-degree vertex on multiple compute nodes, thus distributes processingacrossmultiple nodes. Although the current implementation of this techniquehas been proven to be effective for low complexity algorithms such as breadth-first search, where the generated message traffic is typically bounded by thenumber of edges in the background graph; for our pattern matching problem,since the generated message traffic can grow exponentially, the overhead ofdistributed state synchronization among vertex replicas become significant;hence, the current implementation of our pattern matching solution is unableto harness distribute processing at the vertex granularity. Further designexplorations are required toward offering infrastructure support that shouldenable our pattern matching solution to efficiently scale out processing at theindividual vertex level. Expanding the Set of Queries that are Supported. The current prototypeimplementation can be easily extended to enable support for a richer set of136subgraph matching scenarios. Here, we list two important categories notsupported in the current implementation:Support for templates with edge metadata. In this work, we have consideredgraphs and templates with metadata associated with vertices only. However,real-world graphs also have edge metadata. The property graph model,commonly used by graph databases, uses edge metadata to contextualizerelationship between typed entities (i.e., vertices).Within the constraint checking/graph pruning model, support for edge meta-data matching naturally demands the infrastructure that enables edge elimina-tion, therefore, can be built on top of the current infrastructure. Additionally,we think, this feature would create the opportunity for temporal analysis toharness the sophistication of pattern matching.Extending the set of edit operations. We have demonstrated the edit-distancevariant subgraph matching system in a limited context: edit operations arerestricted to edge deletion. Also, we presented one special use case wherethe user can indicate ‘mandatory’ edges (§4.8.5). Supporting edge additionis trivial (as it only further restricts the match set). Extending the set ofedit operations to support wildcard labels on vertices or edges [25], or edge‘flip’ actions (i.e., swaping edges while keeping the total number of edgesconstant) fits our pipeline’s design and requires small updates. API for Implementing Pattern Mining Algorithms. A future endeavor isdeveloping an end-user API to facilitate implementation of pattern miningalgorithms, such as finding motifs, mining frequent subgraphs or countingcliques. A system targeting a fixed (or a class of patterns) presents the op-portunity to apply pattern specific optimizations not available in a genericpattern matching system. In fact a system optimized for searching a fixedpattern has real-world interest: a recent work describes a recommendationsystem at Twitter that searches ‘diamond’ motifs for its operations [40].Arabesque [108], ASAP [56] and RStream [119] are examples of recentprojects that provide high-level APIs for implementing graph mining algo-rithms. Our goal is to provide a set of primitives that can be used to expressa graph mining algorithm. These primitives are the basic building blocks137of a pattern search procedure and are implemented within our MPI runtimesystem, as graph operations or constraint checking routines. Support for Dynamic Graphs. Recently high-performance graph analysis onlarge-scale dynamic graphs has received a lot of attention from the researchcommunity [43, 99, 117]. There exist several examples of real-world tem-poral graphs (i.e., graphs that change over time, e.g., Wikipedia referencegraph [73], time-evolving web graph of the uk domain [11] and Twitter men-tion graph [95]) that encode information about an event, e.g., edge creationand/or deletion, that mutates the graph (typically in the form of an edge meta-data). Graph evolution can be monitored by referring to the edge creationand/or deletion timestamps associated with each edge.In the future, we are interested to enable support for dynamic graphs. Overthe course of execution, our solution mutates the workload (it prunes awaynon-matching vertices and edges of the background graph); the constraintchecking pipeline natively operates on an evolving graph. This makes us op-timistic about the potential of the constraint checking model to accommodatedynamic graphs.138Bibliography[1] C. C. Aggarwal and H. Wang, editors. Managing and Mining Graph Data,volume 40 of Advances in Database Systems. Springer, 2010. ISBN978-1-4419-6044-3. doi:10.1007/978-1-4419-6045-0. URLhttps://doi.org/10.1007/978-1-4419-6045-0.[2] A. V. Aho and J. E. Hopcroft. The Design and Analysis of ComputerAlgorithms. Addison-Wesley Longman Publishing Co., Inc., Boston, MA,USA, 1st edition, 1974. ISBN 0201000296.[3] N. Alon, P. Dao, I. Hajirasouliha, F. Hormozdiari, and S. C. Sahinalp.Biomolecular network motif counting and discovery by color coding.Bioinformatics, 24(13):i241–i249, July 2008. ISSN 1367-4803.doi:10.1093/bioinformatics/btn163. URLhttp://dx.doi.org/10.1093/bioinformatics/btn163.[4] L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan. Group formationin large social networks: Membership, growth, and evolution. InProceedings of the 12th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, KDD ’06, pages 44–54, NewYork, NY, USA, 2006. ACM. ISBN 1-59593-339-5.doi:10.1145/1150402.1150412. URLhttp://doi.acm.org/10.1145/1150402.1150412.[5] Y. Bai, H. Ding, S. Bian, T. Chen, Y. Sun, and W. Wang. Simgnn: A neuralnetwork approach to fast graph similarity computation. In Proceedings ofthe Twelfth ACM International Conference on Web Search and DataMining, WSDM ’19, pages 384–392, New York, NY, USA, 2019. ACM.ISBN 978-1-4503-5940-5. doi:10.1145/3289600.3290967. URLhttp://doi.acm.org/10.1145/3289600.3290967.[6] Beevolve. Beevolve twitter study, 2015. URLhttp://www.beevolve.com/twitter-statistics.139[7] N. Bell and M. Garland. Implementing sparse matrix-vector multiplicationon throughput-oriented processors. In Proceedings of the Conference onHigh Performance Computing Networking, Storage and Analysis, SC ’09,pages 18:1–18:11, New York, NY, USA, 2009. ACM. ISBN978-1-60558-744-8. doi:10.1145/1654059.1654078. URLhttp://doi.acm.org/10.1145/1654059.1654078.[8] A. R. Benson, D. F. Gleich, and J. Leskovec. Higher-order organization ofcomplex networks. Science, 353(6295):163–166, 2016. ISSN 0036-8075.doi:10.1126/science.aad9029.[9] J. W. Berry. Practical heuristics for inexact subgraph isomorphism. InTechnical Report SAND2011-6558W. Sandia National Laboratories, 2011.[10] J. W. Berry, B. Hendrickson, S. Kahan, and P. Konecny. Software andalgorithms for graph queries on multithreaded architectures. In 2007 IEEEInternational Parallel and Distributed Processing Symposium, pages 1–14,March 2007. doi:10.1109/IPDPS.2007.370685.[11] P. Boldi, M. Santini, and S. Vigna. A large time-aware web graph. SIGIRForum, 42(2):33–38, Nov. 2008. ISSN 0163-5840.doi:10.1145/1480506.1480511. URLhttp://doi.acm.org/10.1145/1480506.1480511.[12] H. Bunke. On a relation between graph edit distance and maximumcommon subgraph. Pattern Recogn. Lett., 18(9):689–694, Aug. 1997.ISSN 0167-8655. doi:10.1016/S0167-8655(97)00060-3. URLhttp://dx.doi.org/10.1016/S0167-8655(97)00060-3.[13] H. Bunke and G. Allermann. Inexact graph matching for structural patternrecognition. Pattern Recogn. Lett., 1(4):245–253, May 1983. ISSN0167-8655. doi:10.1016/0167-8655(83)90033-8. URLhttp://dx.doi.org/10.1016/0167-8655(83)90033-8.[14] V. T. Chakaravarthy, M. Kapralov, P. Murali, F. Petrini, X. Que,Y. Sabharwal, and B. Schieber. Subgraph counting: Color coding beyondtrees. In 2016 IEEE International Parallel and Distributed ProcessingSymposium (IPDPS), pages 2–11, May 2016.doi:10.1109/IPDPS.2016.122.[15] D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-mat: A recursive model forgraph mining. In Proceedings of the Fourth SIAM Int. Conf. on DataMining, page p. 442. Society for Industrial Mathematics, 2004.140[16] H. Chen, M. Liu, Y. Zhao, X. Yan, D. Yan, and J. Cheng. G-miner: Anefficient task-oriented graph mining system. In Proceedings of theThirteenth EuroSys Conference, EuroSys ’18, pages 32:1–32:12, New York,NY, USA, 2018. ACM. ISBN 978-1-4503-5584-1.doi:10.1145/3190508.3190545. URLhttp://doi.acm.org/10.1145/3190508.3190545.[17] J. Cheng, J. X. Yu, B. Ding, P. S. Yu, and H. Wang. Fast graph patternmatching. In 2008 IEEE 24th International Conference on DataEngineering, pages 913–922, April 2008.doi:10.1109/ICDE.2008.4497500.[18] A. Ching, S. Edunov, M. Kabiljo, D. Logothetis, and S. Muthukrishnan.One trillion edges: Graph processing at facebook-scale. Proc. VLDBEndow., 8(12):1804–1815, Aug. 2015. ISSN 2150-8097.doi:10.14778/2824032.2824077. URLhttp://dx.doi.org/10.14778/2824032.2824077.[19] D. Conte, P. Foggia, C. Sansone, and M. Vento. Thirty years of graphmatching in pattern recognition. International Journal of PatternRecognition and Artificial Intelligence, 18(03):265–298, 2004.doi:10.1142/S0218001404003228. URLhttp://www.worldscientific.com/doi/abs/10.1142/S0218001404003228.[20] T. Cormen, T. Cormen, C. Leiserson, I. Books24x7, M. Press, M. I.of Technology, R. Rivest, M.-H. P. Company, and C. Stein. Introduction ToAlgorithms. Introduction to Algorithms. MIT Press, 2001. ISBN9780262032933. URL https://books.google.ca/books?id=NLngYyWFl_YC.[21] CRAY. Cray graph engine (cge): Graph analytics for big data, 2017. URLhttp://www.cray.com/products/analytics/cray-graph-engine.[22] Cypher. Chapter 3. cypher - the neo4j developer manual v3.1, 2016. URLhttps://neo4j.com/docs/developer-manual/current/cypher.[23] A. Dave, A. Jindal, L. E. Li, R. Xin, J. Gonzalez, and M. Zaharia.Graphframes: An integrated api for mixing graph and relational queries. InProceedings of the Fourth International Workshop on Graph DataManagement Experiences and Systems, GRADES ’16, pages 2:1–2:8, NewYork, NY, USA, 2016. ACM. ISBN 978-1-4503-4780-8.doi:10.1145/2960414.2960416. URLhttp://doi.acm.org/10.1145/2960414.2960416.141[24] B. Du, S. Zhang, N. Cao, and H. Tong. First: Fast interactive attributedsubgraph matching. In Proceedings of the 23rd ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining, KDD’17, pages 1447–1456, New York, NY, USA, 2017. ACM. ISBN978-1-4503-4887-4. doi:10.1145/3097983.3098040. URLhttp://doi.acm.org/10.1145/3097983.3098040.[25] S. Dutta, P. Nayek, and A. Bhattacharya. Neighbor-aware search forapproximate labeled graph matching using the chi-square statistics. InProceedings of the 26th International Conference on World Wide Web,WWW ’17, pages 1281–1290, Republic and Canton of Geneva,Switzerland, 2017. International World Wide Web Conferences SteeringCommittee. ISBN 978-1-4503-4913-0. doi:10.1145/3038912.3052561.URL https://doi.org/10.1145/3038912.3052561.[26] M. Elseidy, E. Abdelhamid, S. Skiadopoulos, and P. Kalnis. Grami:Frequent subgraph and pattern mining in a single large graph. Proc. VLDBEndow., 7(7):517–528, Mar. 2014. ISSN 2150-8097.doi:10.14778/2732286.2732289. URLhttp://dx.doi.org/10.14778/2732286.2732289.[27] W. Fan. Graph pattern matching revised for social network analysis. InProceedings of the 15th International Conference on Database Theory,ICDT ’12, pages 8–21, New York, NY, USA, 2012. ACM. ISBN978-1-4503-0791-8. doi:10.1145/2274576.2274578. URLhttp://doi.acm.org/10.1145/2274576.2274578.[28] W. Fan, J. Li, S. Ma, N. Tang, Y. Wu, and Y. Wu. Graph pattern matching:From intractable to polynomial time. Proc. VLDB Endow., 3(1-2):264–275,Sept. 2010. ISSN 2150-8097. doi:10.14778/1920841.1920878. URLhttp://dx.doi.org/10.14778/1920841.1920878.[29] W. Fan, X. Wang, and Y. Wu. Diversified top-k graph pattern matching.Proc. VLDB Endow., 6(13):1510–1521, Aug. 2013. ISSN 2150-8097.doi:10.14778/2536258.2536263. URLhttp://dx.doi.org/10.14778/2536258.2536263.[30] S. Fankhauser, K. Riesen, and H. Bunke. Speeding up graph edit distancecomputation through fast bipartite matching. In Proceedings of the 8thInternational Conference on Graph-based Representations in PatternRecognition, GbRPR’11, pages 102–111, Berlin, Heidelberg, 2011.142Springer-Verlag. ISBN 978-3-642-20843-0. URLhttp://dl.acm.org/citation.cfm?id=2009206.2009219.[31] A. Fard, M. U. Nisar, L. Ramaswamy, J. A. Miller, and M. Saltz. Adistributed vertex-centric approach for pattern matching in massive graphs.In 2013 IEEE International Conference on Big Data, pages 403–411, Oct2013. doi:10.1109/BigData.2013.6691601.[32] J. Gao, C. Zhou, J. Zhou, and J. X. Yu. Continuous pattern detection overbillion-edge graph using distributed framework. In 2014 IEEE 30thInternational Conference on Data Engineering, pages 556–567, March2014. doi:10.1109/ICDE.2014.6816681.[33] M. R. Garey and D. S. Johnson. Computers and Intractability; A Guide tothe Theory of NP-Completeness. W. H. Freeman & Co., New York, NY,USA, 1990. ISBN 0716710455.[34] Giraph. Giraph, 2016. URL http://giraph.apache.org.[35] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph:Distributed graph-parallel computation on natural graphs. In Proceedingsof the 10th USENIX Conference on Operating Systems Design andImplementation, OSDI’12, pages 17–30, Berkeley, CA, USA, 2012.USENIX Association. ISBN 978-1-931971-96-6. URLhttp://dl.acm.org/citation.cfm?id=2387880.2387883.[36] J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, andI. Stoica. Graphx: Graph processing in a distributed dataflow framework.In Proceedings of the 11th USENIX Conference on Operating SystemsDesign and Implementation, OSDI’14, pages 599–613, Berkeley, CA,USA, 2014. USENIX Association. ISBN 978-1-931971-16-4. URLhttp://dl.acm.org/citation.cfm?id=2685048.2685096.[37] Graph 500. Graph 500 benchmark, 2016. URL http://www.graph500.org.[38] GraphFrames. Graphframes, 2017. URL http://graphframes.github.io.[39] A. Grover and J. Leskovec. Node2vec: Scalable feature learning fornetworks. In Proceedings of the 22Nd ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, KDD ’16, pages855–864, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-4232-2.doi:10.1145/2939672.2939754. URLhttp://doi.acm.org/10.1145/2939672.2939754.143[40] P. Gupta, V. Satuluri, A. Grewal, S. Gurumurthy, V. Zhabiuk, Q. Li, andJ. Lin. Real-time twitter recommendation: Online motif detection in largedynamic graphs. Proc. VLDB Endow., 7(13):1379–1380, Aug. 2014. ISSN2150-8097. doi:10.14778/2733004.2733010. URLhttp://dx.doi.org/10.14778/2733004.2733010.[41] S. Gurajada, S. Seufert, I. Miliaraki, and M. Theobald. Triad: A distributedshared-nothing rdf engine based on asynchronous message passing. InProceedings of the 2014 ACM SIGMOD International Conference onManagement of Data, SIGMOD ’14, pages 289–300, New York, NY, USA,2014. ACM. ISBN 978-1-4503-2376-5. doi:10.1145/2588555.2610511.URL http://doi.acm.org/10.1145/2588555.2610511.[42] W. L. Hamilton, R. Ying, and J. Leskovec. Representation Learning onGraphs: Methods and Applications. arXiv e-prints, art. arXiv:1709.05584,Sep 2017.[43] W. Han, Y. Miao, K. Li, M. Wu, F. Yang, L. Zhou, V. Prabhakaran,W. Chen, and E. Chen. Chronos: A graph engine for temporal graphanalysis. In Proceedings of the Ninth European Conference on ComputerSystems, EuroSys ’14, pages 1:1–1:14, New York, NY, USA, 2014. ACM.ISBN 978-1-4503-2704-6. doi:10.1145/2592798.2592799. URLhttp://doi.acm.org/10.1145/2592798.2592799.[44] W.-S. Han, J. Lee, and J.-H. Lee. Turboiso: Towards ultrafast and robustsubgraph isomorphism search in large graph databases. In Proceedings ofthe 2013 ACM SIGMOD International Conference on Management ofData, SIGMOD ’13, pages 337–348, New York, NY, USA, 2013. ACM.ISBN 978-1-4503-2037-5. doi:10.1145/2463676.2465300. URLhttp://doi.acm.org/10.1145/2463676.2465300.[45] HavoqGT. Havoqgt, 2016. URL http://software.llnl.gov/havoqgt.[46] H. He and A. K. Singh. Closure-tree: An index structure for graph queries.In 22nd International Conference on Data Engineering (ICDE’06), pages38–38, April 2006. doi:10.1109/ICDE.2006.37.[47] H. He and A. K. Singh. Graphs-at-a-time: Query language and accessmethods for graph databases. In Proceedings of the 2008 ACM SIGMODInternational Conference on Management of Data, SIGMOD ’08, pages405–418, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-102-6.doi:10.1145/1376616.1376660. URLhttp://doi.acm.org/10.1145/1376616.1376660.144[48] K. Henderson, B. Gallagher, T. Eliassi-Rad, H. Tong, S. Basu, L. Akoglu,D. Koutra, C. Faloutsos, and L. Li. Rolx: Structural role extraction andmining in large graphs. In Proceedings of the 18th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining, KDD’12, pages 1231–1239, New York, NY, USA, 2012. ACM. ISBN978-1-4503-1462-6. doi:10.1145/2339530.2339723. URLhttp://doi.acm.org/10.1145/2339530.2339723.[49] M. R. Henzinger, T. A. Henzinger, and P. W. Kopke. Computingsimulations on finite and infinite graphs. In Proceedings of the 36th AnnualSymposium on Foundations of Computer Science, FOCS ’95, pages 453–,Washington, DC, USA, 1995. IEEE Computer Society. ISBN0-8186-7183-1. URL http://dl.acm.org/citation.cfm?id=795662.796255.[50] S. Hong, S. Depner, T. Manhardt, J. Van Der Lugt, M. Verstraaten, andH. Chafi. Pgx.d: A fast distributed graph processing engine. InProceedings of the International Conference for High PerformanceComputing, Networking, Storage and Analysis, SC ’15, pages 58:1–58:12,New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3723-6.doi:10.1145/2807591.2807620. URLhttp://doi.acm.org/10.1145/2807591.2807620.[51] J. E. Hopcroft and J. K. Wong. Linear time algorithm for isomorphism ofplanar graphs (preliminary report). In Proceedings of the Sixth AnnualACM Symposium on Theory of Computing, STOC ’74, pages 172–184,New York, NY, USA, 1974. ACM. doi:10.1145/800119.803896. URLhttp://doi.acm.org/10.1145/800119.803896.[52] T. Horváth, T. Gärtner, and S. Wrobel. Cyclic pattern kernels for predictivegraph mining. In Proceedings of the Tenth ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, KDD ’04, pages158–167, New York, NY, USA, 2004. ACM. ISBN 1-58113-888-1.doi:10.1145/1014052.1014072. URLhttp://doi.acm.org/10.1145/1014052.1014072.[53] M. Hunger and W. Lyon. Analyzing the panama papers with neo4j: Datamodels, queries and more, 2016. URLhttps://neo4j.com/blog/analyzing-panama-papers-neo4j.[54] IBM. Ibm system g, 2017. URL http://systemg.research.ibm.com.[55] IMDb. Imdb public data, 2016. URL http://www.imdb.com/interfaces.145[56] A. P. Iyer, Z. Liu, X. Jin, S. Venkataraman, V. Braverman, and I. Stoica.ASAP: Fast, approximate graph pattern mining at scale. In 13th USENIXSymposium on Operating Systems Design and Implementation (OSDI 18),pages 745–761, Carlsbad, CA, 2018. USENIX Association. ISBN978-1-931971-47-8. URLhttps://www.usenix.org/conference/osdi18/presentation/iyer.[57] T. N. Kipf and M. Welling. Semi-supervised classification with graphconvolutional networks. In 5th International Conference on LearningRepresentations, ICLR 2017, Toulon, France, April 24-26, 2017,Conference Track Proceedings, 2017. URLhttps://openreview.net/forum?id=SJU4ayYgl.[58] Y. Koren, S. C. North, and C. Volinsky. Measuring and extractingproximity graphs in networks. ACM Trans. Knowl. Discov. Data, 1(3), Dec.2007. ISSN 1556-4681. doi:10.1145/1297332.1297336. URLhttp://doi.acm.org/10.1145/1297332.1297336.[59] A. Kusum, K. Vora, R. Gupta, and I. Neamtiu. Efficient processing of largegraphs via input reduction. In Proceedings of the 25th ACM InternationalSymposium on High-Performance Parallel and Distributed Computing,HPDC ’16, pages 245–257, New York, NY, USA, 2016. ACM. ISBN978-1-4503-4314-5. doi:10.1145/2907294.2907312. URLhttp://doi.acm.org/10.1145/2907294.2907312.[60] J. Lee, W.-S. Han, R. Kasperovics, and J.-H. Lee. An in-depth comparisonof subgraph isomorphism algorithms in graph databases. In Proceedings ofthe 39th international conference on Very Large Data Bases, PVLDB’13,pages 133–144. VLDB Endowment, 2013. URLhttp://dl.acm.org/citation.cfm?id=2448936.2448946.[61] W.-C. Lee, V. Bonin, M. Reed, B. J. Graham, G. Hood, K. Glattfelder, andR. C. Reid. Anatomy and function of an excitatory network in the visualcortex. 532, 03 2016.[62] B. G. Library. The boost graph library (bgl), 2017. URLhttp://www.boost.org/doc/libs/master/libs/graph/doc/index.html.[63] G. Liu, K. Zheng, Y. Wang, M. A. Orgun, A. Liu, L. Zhao, and X. Zhou.Multi-constrained graph pattern matching in large-scale contextual socialgraphs. In 2015 IEEE 31st International Conference on Data Engineering,pages 351–362, April 2015. doi:10.1109/ICDE.2015.7113297.146[64] D. Lo, H. Cheng, J. Han, S.-C. Khoo, and C. Sun. Classification ofsoftware behaviors for failure detection: A discriminative pattern miningapproach. In Proceedings of the 15th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, KDD ’09, pages557–566, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-495-9.doi:10.1145/1557019.1557083. URLhttp://doi.acm.org/10.1145/1557019.1557083.[65] E. M. Luks. Isomorphism of graphs of bounded valence can be tested inpolynomial time. Journal of Computer and System Sciences, 25(1):42 – 65,1982. ISSN 0022-0000.doi:http://dx.doi.org/10.1016/0022-0000(82)90009-5. URLhttp://www.sciencedirect.com/science/article/pii/0022000082900095.[66] A. Lulli, E. Carlini, P. Dazzi, C. Lucchese, and L. Ricci. Fast connectedcomponents computation in large graphs by vertex pruning. IEEETransactions on Parallel and Distributed Systems, 28(3):760–773, March2017. ISSN 1045-9219. doi:10.1109/TPDS.2016.2591038.[67] S. Ma, Y. Cao, J. Huai, and T. Wo. Distributed graph pattern matching. InProceedings of the 21st International Conference on World Wide Web,WWW ’12, pages 949–958, New York, NY, USA, 2012. ACM. ISBN978-1-4503-1229-5. doi:10.1145/2187836.2187963. URLhttp://doi.acm.org/10.1145/2187836.2187963.[68] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser,and G. Czajkowski. Pregel: A system for large-scale graph processing. InProceedings of the 2010 ACM SIGMOD International Conference onManagement of Data, SIGMOD ’10, pages 135–146, New York, NY, USA,2010. ACM. ISBN 978-1-4503-0032-2. doi:10.1145/1807167.1807184.URL http://doi.acm.org/10.1145/1807167.1807184.[69] A. Matveev, Y. Meirovitch, H. Saribekyan, W. Jakubiuk, T. Kaler, G. Odor,D. Budden, A. Zlateski, and N. Shavit. A multicore path toconnectomics-on-demand. In Proceedings of the 22Nd ACM SIGPLANSymposium on Principles and Practice of Parallel Programming, PPoPP’17, pages 267–281, New York, NY, USA, 2017. ACM. ISBN978-1-4503-4493-7. doi:10.1145/3018743.3018766. URLhttp://doi.acm.org/10.1145/3018743.3018766.[70] C. McCreesh, P. Prosser, C. Solnon, and J. Trimble. When subgraphisomorphism is really hard, and why this matters for graph databases. J.147Artif. Int. Res., 61(1):723–759, Jan. 2018. ISSN 1076-9757. URLhttp://dl.acm.org/citation.cfm?id=3241691.3241707.[71] B. D. Mckay and A. Piperno. Practical graph isomorphism, ii. J. Symb.Comput., 60:94–112, Jan. 2014. ISSN 0747-7171.doi:10.1016/j.jsc.2013.09.003. URLhttp://dx.doi.org/10.1016/j.jsc.2013.09.003.[72] B. T. Messmer and H. Bunke. A new algorithm for error-tolerant subgraphisomorphism detection. IEEE Trans. Pattern Anal. Mach. Intell., 20(5):493–504, May 1998. ISSN 0162-8828. doi:10.1109/34.682179. URLhttp://dx.doi.org/10.1109/34.682179.[73] A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee.Measurement and analysis of online social networks. In Proceedings of the7th ACM SIGCOMM Conference on Internet Measurement, IMC ’07, pages29–42, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-908-1.doi:10.1145/1298306.1298311. URLhttp://doi.acm.org/10.1145/1298306.1298311.[74] T. Miyazaki. The complexity of mckay’s canonical labeling algorithm.Groups and Computation II, DIMACS Series Discrete MathematicsTheoretical Computer Science, 28:239–256, 1997.[75] Neo4j. Neo4j - property graph, 2016. URLhttps://neo4j.com/developer/graph-database/#property-graph.[76] T. Neumann and G. Weikum. The rdf-3x engine for scalable managementof rdf data. The VLDB Journal, 19(1):91–113, Feb. 2010. ISSN1066-8888. doi:10.1007/s00778-009-0165-y. URLhttp://dx.doi.org/10.1007/s00778-009-0165-y.[77] OrientDB. Orientdb, 2018. URL https://orientdb.com.[78] L. P. Cordella, P. Foggia, C. Sansone, and M. Vento. A (sub)graphisomorphism algorithm for matching large graphs. IEEE Trans. PatternAnal. Mach. Intell., 26(10):1367–1372, Oct. 2004. ISSN 0162-8828.doi:10.1109/TPAMI.2004.75. URLhttp://dx.doi.org/10.1109/TPAMI.2004.75.[79] B. Paten, A. M. Novak, J. M. Eizenga, and G. Erik. Genome graphs and theevolution of genome inference. bioRxiv, 2017. doi:10.1101/101816. URLhttps://www.biorxiv.org/content/early/2017/03/14/101816.148[80] A. Pavan, K. Tangwongsan, S. Tirthapura, and K.-L. Wu. Counting andsampling triangles from a graph stream. Proc. VLDB Endow., 6(14):1870–1881, Sept. 2013. ISSN 2150-8097.doi:10.14778/2556549.2556569. URLhttp://dx.doi.org/10.14778/2556549.2556569.[81] R. Pearce, M. Gokhale, and N. M. Amato. Scaling techniques for massivescale-free graphs in distributed (external) memory. In Proceedings of the2013 IEEE 27th International Symposium on Parallel and DistributedProcessing, IPDPS ’13, pages 825–836, Washington, DC, USA, 2013.IEEE Computer Society. ISBN 978-0-7695-4971-2.doi:10.1109/IPDPS.2013.72. URLhttp://dx.doi.org/10.1109/IPDPS.2013.72.[82] R. Pearce, M. Gokhale, and N. M. Amato. Faster parallel traversal of scalefree graphs at extreme scale with vertex delegates. In Proceedings of theInternational Conference for High Performance Computing, Networking,Storage and Analysis, SC ’14, pages 549–559, Piscataway, NJ, USA, 2014.IEEE Press. ISBN 978-1-4799-5500-8. doi:10.1109/SC.2014.50. URLhttps://doi.org/10.1109/SC.2014.50.[83] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of socialrepresentations. In Proceedings of the 20th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, KDD ’14, pages701–710, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2956-9.doi:10.1145/2623330.2623732. URLhttp://doi.acm.org/10.1145/2623330.2623732.[84] T. Plantenga. Inexact subgraph isomorphism in mapreduce. J. ParallelDistrib. Comput., 73(2):164–175, Feb. 2013. ISSN 0743-7315.doi:10.1016/j.jpdc.2012.10.005. URLhttp://dx.doi.org/10.1016/j.jpdc.2012.10.005.[85] Quartz. Quartz, 2017. URL https://hpc.llnl.gov/hardware/platforms/Quartz.[86] RDF. Largetriplestores, 2017. URLhttps://www.w3.org/wiki/LargeTripleStores.[87] Reddit. Reddit public data, 2017. URLhttps://github.com/dewarim/reddit-data-tools.[88] T. Reza, C. Klymko, M. Ripeanu, G. Sanders, and R. Pearce. Towardspractical and robust labeled pattern matching in trillion-edge graphs. In1492017 IEEE International Conference on Cluster Computing (CLUSTER),pages 1–12, Sept 2017. doi:10.1109/CLUSTER.2017.85.[89] T. Reza, M. Ripeanu, N. Tripoul, G. Sanders, and R. Pearce. Prunejuice:Pruning trillion-edge graphs to a precise pattern-matching solution. InProceedings of the International Conference for High PerformanceComputing, Networking, Storage, and Analysis, SC ’18, pages 21:1–21:17,Piscataway, NJ, USA, 2018. IEEE Press. URLhttp://dl.acm.org/citation.cfm?id=3291656.3291684.[90] T. Reza, H. Halawa, M. Ripeanu, G. Sanders, and R. Pearce. Scalablepattern matching in metadata graphs via constraint checking. NetSysLabTechnical Report, 0(0):1:1–1:42, Oct. 2019. ISSN 0000-0000.doi:00.0000/0000000. URL https://arxiv.org/abs/1912.08453.[91] T. Reza, M. Ripeanu, G. Sanders, and R. Pearce. Approximate patternmatching in distributed graphs with precision and recall guarantees. InProceedings of the 2020 ACM SIGMOD International Conference onManagement of Data, SIGMOD ’20, pages 1–17, Portland, Oregon, USA,June 2020. doi:00.0000/SIGMOD.2020.00.[92] P. Ribeiro and F. Silva. G-tries: A data structure for storing and findingsubgraphs. Data Min. Knowl. Discov., 28(2):337–377, Mar. 2014. ISSN1384-5810. doi:10.1007/s10618-013-0303-4. URLhttp://dx.doi.org/10.1007/s10618-013-0303-4.[93] K. Riesen and H. Bunke. Approximate graph edit distance computation bymeans of bipartite graph matching. Image Vision Comput., 27(7):950–959,June 2009. ISSN 0262-8856. doi:10.1016/j.imavis.2008.04.004. URLhttp://dx.doi.org/10.1016/j.imavis.2008.04.004.[94] O. L. Robert Meusel, Christian Bizer. Web data commons - hyperlinkgraphs, 2016. URL http://webdatacommons.org/hyperlinkgraph/index.html.[95] D. M. Romero, B. Meeder, and J. Kleinberg. Differences in the mechanicsof information diffusion across topics: Idioms, political hashtags, andcomplex contagion on twitter. In Proceedings of the 20th InternationalConference on World Wide Web, WWW ’11, pages 695–704, New York,NY, USA, 2011. ACM. ISBN 978-1-4503-0632-4.doi:10.1145/1963405.1963503. URLhttp://doi.acm.org/10.1145/1963405.1963503.150[96] N. P. Roth, V. Trigonakis, S. Hong, H. Chafi, A. Potter, B. Motik, andI. Horrocks. Pgx.d/async: A scalable distributed graph pattern matchingengine. In Proceedings of the Fifth International Workshop on GraphData-management Experiences & Systems, GRADES’17, pages 7:1–7:6,New York, NY, USA, 2017. ACM. ISBN 978-1-4503-5038-9.doi:10.1145/3078447.3078454. URLhttp://doi.acm.org/10.1145/3078447.3078454.[97] A. Roy, L. Bindschaedler, J. Malicevic, and W. Zwaenepoel. Chaos:Scale-out graph processing from secondary storage. In Proceedings of the25th Symposium on Operating Systems Principles, SOSP ’15, pages410–424, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3834-9.doi:10.1145/2815400.2815408. URLhttp://doi.acm.org/10.1145/2815400.2815408.[98] S. Salihoglu and J. Widom. Gps: A graph processing system. InProceedings of the 25th International Conference on Scientific andStatistical Database Management, SSDBM, pages 22:1–22:12, New York,NY, USA, 2013. ACM. ISBN 978-1-4503-1921-8.doi:10.1145/2484838.2484843. URLhttp://doi.acm.org/10.1145/2484838.2484843.[99] S. Sallinen, K. Iwabuchi, S. Poudel, M. Gokhale, M. Ripeanu, andR. Pearce. Graph colouring as a challenge problem for dynamic graphprocessing on distributed systems. In Proceedings of the InternationalConference for High Performance Computing, Networking, Storage andAnalysis, SC ’16, pages 30:1–30:12, Piscataway, NJ, USA, 2016. IEEEPress. ISBN 978-1-4673-8815-3. URLhttp://dl.acm.org/citation.cfm?id=3014904.3014945.[100] M. Serafini, G. De Francisci Morales, and G. Siganos. Qfrag: Distributedgraph search via subgraph isomorphism. In Proceedings of the 2017Symposium on Cloud Computing, SoCC ’17, pages 214–228, New York,NY, USA, 2017. ACM. ISBN 978-1-4503-5028-0.doi:10.1145/3127479.3131625. URLhttp://doi.acm.org/10.1145/3127479.3131625.[101] C. Seshadhri, A. Pinar, and T. G. Kolda. An in-depth analysis of stochastickronecker graphs. J. ACM, 60(2):13:1–13:32, May 2013. ISSN 0004-5411.doi:10.1145/2450142.2450149. URLhttp://doi.acm.org/10.1145/2450142.2450149.151[102] H. Shang, Y. Zhang, X. Lin, and J. X. Yu. Taming verification hardness:An efficient algorithm for testing subgraph isomorphism. Proc. VLDBEndow., 1(1):364–375, Aug. 2008. ISSN 2150-8097.doi:10.14778/1453856.1453899. URLhttp://dx.doi.org/10.14778/1453856.1453899.[103] G. M. Slota and K. Madduri. Complex network analysis using parallelapproximate motif counting. In Proc. 28th IEEE Int’l. Parallel andDistributed Processing Symposium (IPDPS), pages 405–414. IEEE, May2014. doi:http://dx.doi.org/10.1109/IPDPS.2014.50.[104] Spark. Spark, 2017. URL https://spark.apache.org.[105] Z. Sun, H. Wang, H. Wang, B. Shao, and J. Li. Efficient subgraph matchingon billion node graphs. Proc. VLDB Endow., 5(9):788–799, May 2012.ISSN 2150-8097. doi:10.14778/2311906.2311907. URLhttp://dx.doi.org/10.14778/2311906.2311907.[106] N. Sundaram, N. Satish, M. M. A. Patwary, S. R. Dulloor, M. J. Anderson,S. G. Vadlamudi, D. Das, and P. Dubey. Graphmat: High performancegraph analytics made productive. Proc. VLDB Endow., 8(11):1214–1225,July 2015. ISSN 2150-8097. doi:10.14778/2809974.2809983. URLhttp://dx.doi.org/10.14778/2809974.2809983.[107] S. Suri and S. Vassilvitskii. Counting triangles and the curse of the lastreducer. In Proceedings of the 20th International Conference on WorldWide Web, WWW ’11, pages 607–614, New York, NY, USA, 2011. ACM.ISBN 978-1-4503-0632-4. doi:10.1145/1963405.1963491. URLhttp://doi.acm.org/10.1145/1963405.1963491.[108] C. H. C. Teixeira, A. J. Fonseca, M. Serafini, G. Siganos, M. J. Zaki, andA. Aboulnaga. Arabesque: A system for distributed graph mining. InProceedings of the 25th Symposium on Operating Systems Principles,SOSP ’15, pages 425–440, New York, NY, USA, 2015. ACM. ISBN978-1-4503-3834-9. doi:10.1145/2815400.2815410. URLhttp://doi.acm.org/10.1145/2815400.2815410.[109] TinkerPop. Gremlin, 2016. URL https://tinkerpop.apache.org/gremlin.[110] TinkerPop. Tinkerpop, 2016. URL http://tinkerpop.apache.org.[111] TinkerPop. Titan, 2016. URL http://titan.thinkaurelius.com.152[112] H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random walk with restart and itsapplications. In Proceedings of the Sixth International Conference on DataMining, ICDM ’06, pages 613–622, Washington, DC, USA, 2006. IEEEComputer Society. ISBN 0-7695-2701-9. doi:10.1109/ICDM.2006.70.URL http://dx.doi.org/10.1109/ICDM.2006.70.[113] H. Tong, C. Faloutsos, B. Gallagher, and T. Eliassi-Rad. Fast best-effortpattern matching in large attributed graphs. In Proceedings of the 13thACM SIGKDD International Conference on Knowledge Discovery andData Mining, KDD ’07, pages 737–746, New York, NY, USA, 2007. ACM.ISBN 978-1-59593-609-7. doi:10.1145/1281192.1281271. URLhttp://doi.acm.org/10.1145/1281192.1281271.[114] H. Tong, C. Faloutsos, and Y. Koren. Fast direction-aware proximity forgraph mining. In Proceedings of the 13th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, KDD ’07, pages747–756, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-609-7.doi:10.1145/1281192.1281272. URLhttp://doi.acm.org/10.1145/1281192.1281272.[115] N. Tripoul, H. Halawa, T. Reza, G. Sanders, R. Pearce, and M. Ripeanu.There are trillions of little forks in the road. choose wisely! - estimating thecost and likelihood of success of constrained walks to optimize a graphpruning pipeline. In 2018 IEEE/ACM 8th Workshop on IrregularApplications: Architectures and Algorithms (IA3), pages 20–27, Nov 2018.doi:10.1109/IA3.2018.00010.[116] J. R. Ullmann. An algorithm for subgraph isomorphism. J. ACM, 23(1):31–42, Jan. 1976. ISSN 0004-5411. doi:10.1145/321921.321925. URLhttp://doi.acm.org/10.1145/321921.321925.[117] K. Vora, R. Gupta, and G. Xu. Kickstarter: Fast and accurate computationson streaming graphs via trimmed approximations. SIGARCH Comput.Archit. News, 45(1):237–251, Apr. 2017. ISSN 0163-5964.doi:10.1145/3093337.3037748. URLhttp://doi.acm.org/10.1145/3093337.3037748.[118] E. Walker. The real cost of a cpu hour. Computer, 42(4):35–41, Apr. 2009.ISSN 0018-9162. doi:10.1109/MC.2009.135. URLhttp://dx.doi.org/10.1109/MC.2009.135.[119] K. Wang, Z. Zuo, J. Thorpe, T. Q. Nguyen, and G. H. Xu. Rstream:Marrying relational algebra with streaming for efficient graph mining on a153single machine. In 13th USENIX Symposium on Operating Systems Designand Implementation (OSDI 18), pages 763–782, Carlsbad, CA, 2018.USENIX Association. ISBN 978-1-931971-47-8. URLhttps://www.usenix.org/conference/osdi18/presentation/wang.[120] M. P. Wellman and W. E. Walsh. Distributed quiescence detection inmultiagent negotiation. In Proceedings Fourth International Conference onMultiAgent Systems, pages 317–324, 2000.doi:10.1109/ICMAS.2000.858469.[121] A. Yasar and U. V. Çatalyürek. An iterative global structure-assistedlabeled network aligner. In Proceedings of the 24th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining, KDD’18, pages 2614–2623, New York, NY, USA, 2018. ACM. ISBN978-1-4503-5552-0. doi:10.1145/3219819.3220079. URLhttp://doi.acm.org/10.1145/3219819.3220079.[122] X. Yu, Y. Sun, B. Norick, T. Mao, and J. Han. User guided entity similaritysearch using meta-path selection in heterogeneous information networks. InProceedings of the 21st ACM International Conference on Information andKnowledge Management, CIKM ’12, pages 2025–2029, New York, NY,USA, 2012. ACM. ISBN 978-1-4503-1156-4.doi:10.1145/2396761.2398565. URLhttp://doi.acm.org/10.1145/2396761.2398565.[123] Y. Yuan, G. Wang, J. Y. Xu, and L. Chen. Efficient distributed subgraphsimilarity matching. The VLDB Journal, 24(3):369–394, June 2015. ISSN1066-8888. doi:10.1007/s00778-015-0381-6. URLhttp://dx.doi.org/10.1007/s00778-015-0381-6.[124] Z. Zeng, J. Wang, L. Zhou, and G. Karypis. Out-of-core coherent closedquasi-clique mining from large dense graph databases. ACM Trans.Database Syst., 32(2), June 2007. ISSN 0362-5915.doi:10.1145/1242524.1242530. URLhttp://doi.acm.org/10.1145/1242524.1242530.[125] S. Zhang, S. Li, and J. Yang. Gaddi: Distance index based subgraphmatching in biological networks. In Proceedings of the 12th InternationalConference on Extending Database Technology: Advances in DatabaseTechnology, EDBT ’09, pages 192–203, New York, NY, USA, 2009. ACM.ISBN 978-1-60558-422-5. doi:10.1145/1516360.1516384. URLhttp://doi.acm.org/10.1145/1516360.1516384.154[126] S. Zhang, J. Yang, and W. Jin. Sapper: Subgraph indexing and approximatematching in large graphs. Proc. VLDB Endow., 3(1-2):1185–1194, Sept.2010. ISSN 2150-8097. doi:10.14778/1920841.1920988. URLhttp://dx.doi.org/10.14778/1920841.1920988.[127] P. Zhao and J. Han. On graph query optimization in large networks. Proc.VLDB Endow., 3(1-2):340–351, Sept. 2010. ISSN 2150-8097.doi:10.14778/1920841.1920887. URLhttp://dx.doi.org/10.14778/1920841.1920887.[128] Z. Zhao, G. Wang, A. R. Butt, M. Khan, V. S. A. Kumar, and M. V.Marathe. Sahad: Subgraph analysis in massive networks using hadoop. In2012 IEEE 26th International Parallel and Distributed ProcessingSymposium, pages 390–401, May 2012. doi:10.1109/IPDPS.2012.44.[129] F. Zhou, S. Mahler, and H. Toivonen. Simplification of Networks by EdgePruning, pages 179–198. Springer Berlin Heidelberg, Berlin, Heidelberg,2012. ISBN 978-3-642-31830-6. doi:10.1007/978-3-642-31830-6_13.URL https://doi.org/10.1007/978-3-642-31830-6_13.[130] F. Zhu, Q. Qu, D. Lo, X. Yan, J. Han, and P. Yu. Mining top-k largestructural patterns in a massive network. Proceedings of the VLDBEndowment, 4(11):807–818, 8 2011. ISSN 2150-8097.155Appendix ATheoretical GuaranteesWe provide correctness proofs for the constraint checking algorithms presentedin Chapter 3. This chapter is organized in two sections: §A.1 presents detailedcorrectness proofs for the constraint checking algorithms; assuming a restrictedscenario with constraints on the topology and vertex label distribution of the searchtemplate (§3.6 and [88]). §A.2 presents a correctness proof sketch for the generalexact matching solution (§3.5) that alleviates the above restrictions and offers pre-cision and recall guarantees for arbitrary search templates. As in the rest of thisdissertation, the proofs are for non-induced subgraph matching (§2.1).We considering a restricted scenario to present detailed correctness proofs(§A.1) primarily for two reasons: first, to simplify the problem, thus keep the mathmanageable; second, we show that, for a restricted set of patterns, the constraintchecking technique presents further optimization opportunities, e.g., for a class ofacyclic templates, local constraint checking, which is a polynomial time routine,leads to the precise solution set (see Corollary 3 and Appendix B, §B.1). In§A.2, we discuss how the proofs for the constrained templates can be extended todevelop proofs for the solution that offer precision and recall guarantees for arbitrarytemplates.156A.1 Correctness Proofs for the Constraint CheckingAlgorithms (assuming Restrictions on the SearchTemplate)Here, we present correctness proofs for the constraint checking algorithms; theproofs assume a restricted scenario (presented in §3.6) with constraints on thetopology and vertex label distribution of the search template: (i) no two templatevertices have the same label and (ii) the template is edge-monocyclic - no two cyclesshare an edge. The solution presented in §3.6 offers precision and recall guaranteesfor the matching vertices only (as it does not support explicit edge elimination).The contents of this section are based on the materials published in [88].Algorithm 12 Main Constraint Checking Loop (Serial Version)Input: background graph G, template graph G0Output: a vertex subset T ⊂ V containing exact matches1: while vertices are being eliminated from T do2: refine T with Local Constraint Checking3: refine T with Cycle CheckingThe goal is to find subsets of vertices S ⊂ V that exactly match the template, G0.The constraint checking routines iteratively eliminate vertices, refining a set T thatalways contains all vertices that are included in an exact match, T ⊃⋃S∼G0 S . Thegoal is to shrink T as aggressively as possible without throwing out any verticesthat can participate in a match. Alg. 12 presents the overall iterative process (forthe above restricted scenario). Also, we discuss simplified serial algorithms forLocal Constraint Checking (Alg. 13) and Cycle Checking (i.e., non-local constraintchecking) (Alg. 14), that are free from the distributed implementation details, tohelp explain the theoretical proofs presented later.Definition 7. A vertex set S ⊂ V is an exact match of template graph G0(V0,E0)(in notation, S ∼ G0) if there exists a bijective function φ : V0 ←→ S with theproperties:(i) `(φ(q)) = `(q), for all q ∈ V0 and(ii) ∀(q1,q2) ∈ E0 we have (φ(q1),φ(q2)) ∈ E157Assumption 1. (Unique Labels in Template)For any qi,qj ∈ V0 such that qi , qj ,we assume `(qi) , `(qj).Assumption 2. (Template is Edge-Monocyclic) The search template is G0 isedge-monocyclic, i.e., no two cycles belonging to G0 share an edge.Local Constraint Checking. Alg. 13 iteratively generates a sequence of vertexmatch functions, fk(v) for each iteration k, that map V onto V0 ∪ {}, for k =0,1, ...,kmax , where is a null value, which represents v not being part of anymatching subset. Essentially, fk(v) = q means that, given the computed knowledgeup to iteration k of our algorithm, vertex v ∈ V is still a possible match for vertexq ∈ V0. The fk(v) are related to φ−1(v) with requirements similar to Def. 7, inthe following sense. For k > 0, (i) The labels match: `( fk−1(v1)) = `(v1); and (ii)having matching edges: ∀( fk−1(v1),q2) ∈ E0, there exists v2 with fk−1(v2) = q2 and(v1,v2) ∈ E .Algorithm 13 Local Constraint Checking (Serial Version)Input: G, G0, a vertex subset T ⊂ V , maximum iterations kmaxOutput: a refined vertex subset T ⊂ V1: ∀q ∈ V0,v ∈ T set f0(v) = q when `(v) = `(q) . initialize matching functions2: for k = 1, ...,kmax do . eliminate vertices iteratively3: ∆T ← 04: for v : fk−1(v) , do5: fk(v) ← fk−1(v)6: for q ∈ adj( fk−1(v)) do . check local constraints7: if no v′ ∈ adj(v) with fk−1(v′) = q then8: fk(v) ← ; T ← T \ {v}; ∆T ← ∆T −19: if ∆T = 0 then breakThe algorithm excludes the vertices that do not have a corresponding labelin the template, then, iteratively, excludes the vertices that do not have similarlylabeled neighbors as in the template. More formally, the initialization of Alg. 13defines f0 for every v ∈ V . Every vertex v∗ ∈ V with a label not represented in G0is immediately eliminated, f0(v∗) = . Every other v ∈ V is assigned the uniqueq ∈ V0 with matching label, f0(v) = q such that `(v) = `(q). Then the algorithmproceeds iteratively in step k, checking that the G-neighborhood of each v does notviolate the constraints specified in the G0-neighborhood of fk−1(v), and eliminates v158if a single constraint is violated. For an acyclic template, this process is guaranteedto stop eliminating vertices after kmax = diam (G0)+1 iterations (see Corollary 3).Cycle Checking. We leverage a token passing approach to detect cycles ofappropriate length (see Alg. 14). LetK0 be a set of cycle constraints to be checked.Each member C0 ∈ K0 is a length r = |C0 | cycle in G0 beginning w.l.o.g. at q0,C0 = {(q0,q1),(q1,q2), ...,(qr−1,q0)}. For each v0 ∈ T with `(v0) = `(q0), we initiatetokens that are passed through edges in G0 whose ends match the vertex labels inC0. After r steps, we check to see if a self-issued token was received by each initialsender, thus completing an length r cycle. Once all the cycles in K0 have beenverified, we remove all initiating vertices that do not receive their own tokens in thenumber of expected steps.Algorithm 14 Cycle Checking (Serial Version)Input: G, G0, a vertex subset T ⊂ V , set of cycles, K0Output: a refined vertex subset T ⊂ V1: for C0 ∈ K0 do2: Let (q0,q1) be the first edge in C03: A0← ∅; A← ∅ . initialize cycles4: for all v0 ∈ T with `(v0) = `(qi) do5: A0←A0∪ {v0}6: A←A∪ {(v0,v0,0)}7: for s = 1,2, ..., |C0 | do . loop to process cycles8: Let (qi,qj) be the s-th edge in C09: B← ∅10: for every (v,v0,s−1) ∈ A do11: for v′ ∈ adj(v)∩T do12: if `(qj) = `(v′) then13: B← B∪ {(v′,v0,s)}14: A← B15: for every v0 ∈ A0 do . remove vertices without the cycle16: if (v0,v0, |C0 |) <A then T ← T \ {v}We note that K0 should contain all orderings of each cycle (i.e., each cycle ispresented r times starting with each participating vertex) to guarantee that eachremaining vertex in T participates in each cycle in the template. In other words,each vertex participating in a cycle must issue tokens for that cycle1.In Alg. 14, several sets maintain the token initiators and instances of various1In [115], we present a shared memory solution that offer further optimizations.159tokens. The set A0 ⊂ T is the set of all vertices that initiate cycle attempts in G.After the s-th stage of the algorithm, A is a collection of three-tuples (v,v0,s) thatstore the vertex v ∈ V reached by the s-th step of a partial cycle in G beginning attoken-issuing vertex v0.Theorem 1. Let S ⊂ V be an exact matching of G0 (see Def. 7). No phase of Alg. 13will remove any vertex from S.Proof. Let v1 be any vertex in S . By Def. 7, there exists a bijective mapping, φ,from V0 onto S. The matching function is initialized as f0(v1) = φ−1(v1) = q1, andno vertex in S is thrown out by initial checking of labels. Now, assume fk(v1) = q1and no vertex in S was eliminated in the k-th iteration. For any q2 ∈ adj(q1),(q1,q2) ∈ E0, and we require a vertex with label `(q2) in adj(v1). By Def. 7(ii),we have (v1,φ(q2)) ∈ E . Further by Def. 7(i), `(φ(q2)) = `(q2), and the constraintassociated with (q1,q2) is met. All constraints are satisfied, so fk+1(v) = q1 and v1is not eliminated at the (k +1)-th iteration. This is similar for all other vertices inS, so v1 will never be eliminated. It is straightforward to see that Alg. 14 (cycle checking) will not throw out anyvertices that are in an exact match. After r iterations, the vertices left seem likeplausible matches in the sense that the same length r walks (in terms of labels)can be taken from a surviving vertex as a correspondingly labeled vertex in thetemplate.Lemma 2. Under Assumption 1, let r iterations of Alg. 13 be performed. Iffr (v0) = q0 then for every length r walk,W0, in G0 starting at q0, there is a lengthr walk,W , in G starting with v0 with the same sequence of vertex labels asW0.Proof. Let W0 = {(q0,q1),(q1,q2), ...,(qr−1,qr )} be an length r walk starting at q0in the undirected version of the template, G0. If fr (v0) = q0, then for each i = 1, ...,rthere exists a vi ∈ V such that (vi−1,vi) ∈ E , fr−i(vi) = qi, with `(vi) = `(qi). Lemma 2 is extremely useful. For acyclic graphs, it helps us derive the maxi-mum number of iterations that can be taken before no more vertices are eliminated.Corollary 3. If G0 is acyclic, no more vertices are eliminated after kmax :=diam (G0)+1 iterations of Alg. 13.160Proof. Because G0 is acyclic, this is a direct consequence of Lemma 2. Additionally, Lemma 2 shows that when G0 has unique labels and is acyclic,Alg. 13 successfully removes all vertices from G that do not participate in an exactmatch.Corollary 4. If Assumption 1 is met and G0 is acyclic, then after kmax iterations ofAlg. 13, every remaining vertex in v0 ∈ T is a member of at least one S ⊂ V thatmatches G0.Proof. If fkmax (v0) = q0, the we apply Lemma 2 to construct an S ⊂ V containingv0 and exactly matching G0. When G0 is not acyclic, Alg. 14 is also employed. However, checking cycleparticipation for vertices is not enough to guarantee that there are no false positivevertices T after completing Alg. 12 (see Fig. 3.2, the 4× 3 torus on the far right,for a pathological example). Additional constraints (e.g., distances and/or edgeparticipation in cycles) are required to remove such structure. Under Assumption 2that G0 is edge-monocyclic (edges participate in at most one cycle), we have thefollowing result that guarantees no false positive vertices. It is important to note thatK0 must contain every cycle C0 in G0. To prove the result, we leverage the tree-likequality of edge-monocyclic graphs and the local constraints met by vertices in T toconstruct a matching S ⊂ T .Theorem 5. If Assumption 1 and 2 are met then when Alg. 12 terminates, everyremaining vertex in v0 ∈ T is a member of at least one S ⊂ V that matches G0.Proof. We prove that if v0 survives local constraint checking and participates inevery cycle listed in K0 involving q0 then an S that matches G0 can be constructed.Let U0 be any vertex spanning tree [20] of G0, rooted at q0, the unique vertex inthe template for which `(q0) = `(v0). Additionally, we have R0, the set of edges inG0 not in spanning tree U0. For each edge (qi,qj) ∈ R0, we have a single length-rcycle C0 involving (qi,qj) and (r −1) edges in U0. These |R0 | cycles have no edgeoverlap by the edge-monocyclic property and |K0 | = |R0 |.By Lemma 2, there exists at least one tree subgraph U in T that matches U0.Below, we show that we are able to modify this tree within T until it yields an S161that matches G0. If the vertex set of U is not a match, then there exists an edge(qi,qj) in R0 such that the corresponding vertices vi,vj in U are not connected inG. We let (qi,qj) be the first such edge encountered in a deterministic breadth-firstsearch tree ordering (where ties are broken by vertex number). In G0, edge (qi,qj)forms a cycle C0 emanating from the edge, up the spanning tree both directions untilthey meet at a mutual ancestor qa. By the edge-monocylic property, the vertices inC0 contain no more crossing edges inR0, in particular qi and qj are not connectedto any other vertex in C0, other than their respective parents. The vertices in Uassociated with those in C0 are not involved in a cycle with the same labels. Forva to be in T there must be some other |C0 | − 1 vertices in T that have a cyclematching C0, so we replace the part of U below va with the vertices that contain thecycle (including the other generations below, as guaranteed by Lemma 2) to get anew U ′ that still contains root v0. Now either U ′ is a match or a new edge (q′i,q′j)exists inR0 such that the corresponding vertices v′i ,v′j in U are not connected in G.In this case, we repeat the process at most |K0 | times until we have constructed amatch. For edge-monocyclic G0, we also have a guarantee on how many iterations ofeach algorithm are required for T to converge with no false positives.Corollary 6. If Assumption 1 and 2 are met, after all cycles have been checked(from every vertex in the cycle) with Alg. 14, if kmax := diam (G0)+1 iterations ofAlg. 13 are run, then there are no false positives left in T .The previous result shows we have no false positives for edge-monocyclic G0if we run Alg. 14 (cycle checking) followed by kmax iterations of Alg. 13 (localconstraint checking). In practice, to benefit from efficient aggressive elimination oflocal constraint checking and to minimize cycle checking, we run local constraintchecking first, then check one cycle at a time followed by local constraint checking,repeating until all vertices participating in all cycles have had their participation inthe cycles checked. This strategy takes ncr iterations of Alg. 14, where nc is thenumber of template cycles and r is the average cycle length.162A.2 Proof Sketch for the Solution that offers Precisionand Recall Guarantees for Arbitrary SearchTemplatesThe work presented in Chapter 3, builds on the preliminary solution (publishedin [88], summary available in §3.6) that offers precision and recall guaranteesfor a restricted set of templates with constraints on the topology and vertex labeldistribution: (i) no two template vertices have the same label (Assumption 1) and(ii) the template is edge-monocyclic - no two cycles share an edge (Assumption 2).The generic constraint checking pipeline (§3.4) introduces path constraint checkingand template-driven search to alleviate the above restrictions (i.e., Assumption 1 and2). Furthermore, the extended solution also offers precision and recall guaranteesfor the edges included in the solution set.For templates with duplicate vertex labels, the extended local constraint check-ing routine in Alg. 3, eliminates vertices in the background graph that do not havethe minimum number of distinct active neighbors with the same label as prescribedin the template. By extending the proof for Theorem 1, we can show that thisprocedure does not throw out any valid vertex. Similar reasoning can be appliedto construct a proof that shows no valid edge from the background graph will beeliminated either.If the template has at least two vertices with the same label that are more thantwo hops away from each other then path constraint checking (§3.4) is employed:similar to cycle checking (Alg. 14), it leverages a length r walk to verify uniquenessof the terminal vertices (of a walk in the background graph) that provisionally matchvertices in the template with the same label (Alg. 5). We can construct a proofsimilar to that of Lemma 2 to show that if a match for the template exists in thebackground graph, such a length r walk would also exist. Also, path constraintchecking will not throw out any vertex that is in an exact match.Template-drive search (TDS) addresses two shortcomings of the solution in [88]- it introduces advanced checks that are able to remove invalid substructures similarto the ones illustrated in Fig. 3.2, templates (b) and (c). More specifically, TDSaims at offering precision and recall guarantees for templates that (i) have more thantwo vertices with the same label and/or (ii) are non-edge-monocyclic - the template163has at least two cycles that share an edge. In §3.4, Table 3.2, we present howTDS constraints are generated from existing path and cycle constrains and Alg. 5describes how TDS constraints are verified. Similar to cycle and path constraintchecking, template-drive search also employs a length r walk that follows a sequenceof vertices in the template. TDS constraints, however, are arbitrary substructures- a walk is constructed following a depth-fist like walk on the substructure. Alength r walk representing a TDS constraint, traverses each vertex and edge in thesubstructure at least once (can be multiple times). For TDS constraint checking, foreach walk in the background graph, Alg. 5 maintains a list containing the verticesvisited in sequence. This additional information allows us to verify uniqueness ofmatching vertices that have the same label or for non-edge-monocyclic templates,verify the same vertex has been visited multiple times in the order prescribed in theTDS constraint. Similar to cycle checking discussed above, it is easy to see TDSonly removes the vertices that violate the match constraints; thus, it offers recallguarantees. Furthermore, we can construct a proof similar to that of Theorem 5 toshow that, for arbitrary search templates when Alg. 1 terminates, every remainingvertex and edge in G∗(V∗,E∗) is a member of at least one match of the searchtemplate G0.164Appendix BComplexity AnalysisWe attempt to estimate the space, time and generated message complexity for bothlocal constraint checking (LCC) and non-local constraint checking (NLCC) routinesin §3.5. Note that except for the first iteration of LCC, constraint checking routinesare invoked on the ‘current’ pruned, solution subgraph G∗(V∗,E∗)where |G∗ | ≤ |G |.B.1 Local Constraint CheckingWe mainly focus on analyzing the complexity of one iteration of the LCC routinepresented in Alg. 3.Space Complexity. In each iteration of LCC, each active vertex vi ∈ V∗maintains a set of its template vertex matches/exclusions ω(vi) where |ω(vi)| =|V0 |. Therefore, space complexity of LCC is linear in the size of the template:O(|V∗ | × |V0 |). In our implementation, we use a bit vector to store the templatevertex matches to reduce memory overhead. For example, if the template has 64vertices, per-vertex (of G∗) storage requirement is eight bytes. Additionally, in oneiteration of LCC, an active vertex creates one visitor per active edge, therefore,the storage requirement for the visitor queue (the message queue in HavoqGT) isO(|E∗ |).Time Complexity. In each iteration of LCC, all active vertices in V∗ visit alltheir respective active neighbors (in E∗). In iteration k, only the vertices and edgesthat survived iteration k −1, are considered. Therefore, the time complexity of the165k-th iteration is O(|V∗k−1 | + |E∗k−1 |). Initially, i.e., when k = 0 and no vertices andedges have been eliminated, i.e., V∗ = V and E∗ = E ; we can write time complexityof the first iteration isO(|V |+ |E |), the most expensive of all iterations. Lets assumeLCC stops eliminating vertices and edges after kmax iterations; hence, total (worstcase) time complexity of LCC isO(kmax×(|V |+ |E |)). For an acyclic template withunique labels, kmax = diam (G0)+ 1 (see Corollary 3 for the proof). An analysisfor the worst case complexity for an arbitrary template does not take us far: theupper bound of maximum number of iteration in LCC is kmax ≤ |E |. The worstcase is when in each iteration only a few or no vertices and/or edges are eliminatedand a large number of iterations are required. In practice, for real-world, scale-free graphs, the first few steps of LCC reduce |G | by several orders of magnitude,yielding costs nowhere near the worst case bounds (see the evaluation section (§3.7)for multiple examples).Message Complexity. In each iteration, an active vertex creates one visitor peractive edge, resulting in one message. The analysis is similar to the one above: themessage complexity of one iteration of LCC is O(|E∗ |).B.2 Non-local Constraint CheckingWe study the complexity of the NLCC routine, presented in Alg. 5, for checking asingle constraint C0 ∈K0. Note that for a cyclic constraint, a token must be initiatedfrom each vertex that may participate in the substructure representing C0, i.e., inAlg. 5, all vertices in G∗, that match at least one vertex in C0, initiate a token.Space Complexity. The NLCC routine requires two additional algorithm states:(i) γ, the map of token source vertices (in G∗) for C0, requires at most O(|V∗ |)storage. (ii) τ(vj), the set of already forwarded tokens by a vertex vj used for workaggregation: if C0 is edge-monocyclic and has unique vertex labels, per-vertexstorage requirement for τ(vj) is no more than O(|γ |) or total O(|V∗ | × |γ |) for G∗.For arbitrary templates, however, the cost is superpolynomial and proportional to themessage complexity discussed later. Similarly, the worst case storage requirementfor the visitor queue is also superpolynomial (and directly related to the generatedmessage traffic).Time Complexity. In NLCC, each constraint C0 ∈ K0 is verified by passing166around tokens. Each active vertex in V∗ that could be a template match for thefirst vertex in C0, issues a token - identified by an entry in γ where |γ | ≤ |V∗ |. Inthe distributed message passing setting, token passing happens in a breadth-firstsearch manner (on shared memory, a more work-efficient depth-first search likeimplementation is possible). The effort related to token propagation is boundedby |γ | - the number of tokens, average degree connectivity, and the depth of thepropagation (i.e., the size of the constraint |C0 |). For an arbitrary constraint C0, thecost is exponential: assume r indicates a step in the walk represented by C0; at r = 1,in the worst case, a token is received by at most (|V∗ | −1) vertices, and at r = 2, eachof these vertices forward the same token to at most (|V∗ | −2) vertices. To propagate|γ | tokens, this results in visiting (|γ | × (|V∗ | − 1) × (|V∗ | − 2) × . . .× (|V∗ | − r − 1))vertices, where r = |C0 |). Since |γ | ≤ |V∗ |, we can write the sequential cost ofverifying constraint C0 is O(|V∗ | |C0 |).Message Complexity. As discussed above, in NLCC, each vertex visitation by(a copy of) a token results in one message. Therefore, the message complexity ofchecking a non-local constraint C0 is O(|V∗ | |C0 |).Heuristics like work aggregation, however, prevents a vertex from forwardingduplicate copies of a token, which reduces the time and message propagation effortin practice.167Appendix COther Projects and PublicationsAdditionally, as a PhD student, the author of this dissertation has contributed tothree other projects:(i) SAR Data Processing on Multiple GPUs [J2, C4].(ii) Graph Processing on CPU-GPU Hybrid Platforms [T1, W3].(iii) NUMA-aware Graph Processing [W2].These projects led to the following publications that the author of this dissertationhas authored/co-authored:(J2) T. Reza, A. Zimmer, J. Delgado Blasco, P. Ghuman, T. Aasawat, and M.Ripeanu. PtSel: Accelerating Persistent Scatterer Pixel Selection for InSARProcessing. IEEE Transaction on Parallel and Distributed Systems, TPDS,29(1), 16–30, January, 2018. (Accepted May, 2017.)(C4) T. Reza, A. Zimmer, P. Ghuman, and M. Ripeanu. Accelerating PersistentScatterer Pixel Selection for InSAR Processing. The 26th IEEE InternationalConference on Application-specific Systems, Architectures and Processors,ASAP ’15, Toronto, Ontario, 27–29 July, 2015.(W2) T. Aasawat, T. Reza, and M. Ripeanu. Scale-Free Graph Processing on aNUMAMachine. The 8th workshop on Irregular Applications: Architectures168and Algorithms, IAˆ3 ’18, co-located with SC ’18, Dallas, Texas, 11–16November, 2018.(W3) T. Aasawat, T. Reza, and M. Ripeanu. How well do CPU, GPU and HybridGraph Processing Frameworks Perform? The 4th IEEE International Work-shop on High-Performance Big Data, Deep Learning, and Cloud Computing,HPBDC ’18, co-located with IPDPS ’18, Vancouver, British Columbia, 21May, 2018.(T1) A. Gharaibeh, T. Reza, E. Santos-Neto, L. Beltrão Costa, S. Sallinen, and M.Ripeanu, Efficient Large-Scale Graph Processing on Hybrid CPU and GPUSystems. NetSysLab Technical Report, December, 2014.169
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Pattern matching in massive metadata graphs at scale
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Pattern matching in massive metadata graphs at scale Reza, Tahsin Arafat 2019
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Pattern matching in massive metadata graphs at scale |
Creator |
Reza, Tahsin Arafat |
Publisher | University of British Columbia |
Date Issued | 2019 |
Description | Pattern matching in graphs, that is finding subgraphs that match a smaller template graph within the large background graph is fundamental to graph analysis and serves a rich set of applications. Unfortunately, existing solutions have limited scalability, are difficult to parallelize, support only a limited set of search patterns, and/or focus on only a subset of the real-world problems. This dissertation explores avenues toward designing a scalable solution for subgraph pattern matching. In particular, this work targets practical pattern matching scenarios in large-scale metadata graphs (also known as property graphs) and designs solutions for distributed memory machines that address the two categories of matching problems, namely, exact and approximate matching. This work presents a novel algorithmic pipeline that bases pattern matching on constraint checking. The key intuition is that each vertex or edge participating in a match has to meet a set of constraints specified by the search template. The pipeline iterates over these constraints to eliminate all the vertices and edges that do not participate in any match, and reduces the background graph to the complete set of only the matching vertices and edges. Additional analysis can be performed on this reduced graph, such as full match enumeration. Furthermore, a vertex-centric formulation for this constraint checking algorithm exists, and this makes it possible to harness existing high-performance, vertex-centric graph processing frameworks. The key contributions of this dissertation are solution design following this constraint checking approach for exact and a class of edit-distance based approximate matching, and experimental evaluation to demonstrate effectiveness of the respective solutions. To this end, this work presents design and implementation of distributed vertex-centric, asynchronous algorithms that guarantee a solution with 100% precision and 100% recall for arbitrary search templates. Through comprehensive evaluation, this work provides evidence that the scalability and performance advantages of the proposed approach are significant. The highlights are scaling experiments on massive-scale real-world (up to 257 billion edges) and synthetic (up to 4.4 trillion edges) graphs, and at scales (1,024 compute nodes), orders of magnitude larger than used in the past for similar problems. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2020-01-03 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0387453 |
URI | http://hdl.handle.net/2429/73093 |
Degree |
Doctor of Philosophy - PhD |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2020-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2020_may_reza_tahsin.pdf [ 2.83MB ]
- Metadata
- JSON: 24-1.0387453.json
- JSON-LD: 24-1.0387453-ld.json
- RDF/XML (Pretty): 24-1.0387453-rdf.xml
- RDF/JSON: 24-1.0387453-rdf.json
- Turtle: 24-1.0387453-turtle.txt
- N-Triples: 24-1.0387453-rdf-ntriples.txt
- Original Record: 24-1.0387453-source.json
- Full Text
- 24-1.0387453-fulltext.txt
- Citation
- 24-1.0387453.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0387453/manifest