UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Scale-free graph processing on a NUMA machine Aasawat, Tanuj Kr 2018

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2018_november_aasawat_tanuj.pdf [ 3.61MB ]
JSON: 24-1.0372898.json
JSON-LD: 24-1.0372898-ld.json
RDF/XML (Pretty): 24-1.0372898-rdf.xml
RDF/JSON: 24-1.0372898-rdf.json
Turtle: 24-1.0372898-turtle.txt
N-Triples: 24-1.0372898-rdf-ntriples.txt
Original Record: 24-1.0372898-source.json
Full Text

Full Text

Scale-Free Graph Processing on a NUMA MachinebyTanuj Kr AasawatB. Engineering, Jadavpur University, India, 2015A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of Applied ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Electrical and Computer Engineering)The University of British Columbia(Vancouver)October 2018© Tanuj Kr Aasawat, 2018The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:Scale-Free Graph Processing on a NUMA Machinesubmitted by Tanuj Kr Aasawat in partial fulfillment of the requirements for thedegree of Master of Applied Science in Electrical and Computer Engineering.Examining Committee:Matei Ripeanu, Electrical and Computer EngineeringSupervisorSathish Gopalakrishnan, Electrical and Computer EngineeringExamining Committee MemberKarthik Pattabiraman, Electrical and Computer EngineeringExamining Committee MemberiiAbstractThe importance of high-performance graph processing to solve big data problemstargeting high-impact applications is greater than ever before. Graphs incur highlyirregular memory accesses which leads to poor data locality, load imbalance, anddata-dependent parallelism. Distributed graph processing frameworks, such asGoogle’s Pregel, that employs memory-parallel, shared-nothing systems have ex-perienced tremendous success in terms of scale and performance. Modern shared-memory systems embrace the so called Non-Uniform Memory Access (NUMA)architecture which has proven to be more scalable (in terms of numbers of coresand memory modules) than the Symmetric Multiprocessing (SMP) architecture. Inmany ways, a NUMA system resembles a shared-nothing distributed system: phys-ically distinct processing cores and memory regions (although, cache-coherent inNUMA). Memory accesses to remote NUMA domains are more expensive thanlocal accesses. This poses the opportunity to transfer the know-how and design ofdistributed graph processing to develop shared-memory graph processing solutionsoptimized for NUMA systems (which is surprisingly little-explored).In this dissertation, we explore if a distributed-memory like middleware thatmakes graph partitioning and communication between partitions explicit, can im-prove the performance on a NUMA system. We design and implement a NUMAaware graph processing framework that treats the NUMA platform as a distributedsystem, and embraces its design principles; in particular explicit partitioning andinter-partition communication. We further explore design trade-offs to reduce com-munication overhead and propose a solution that embraces design philosophies ofdistributed graph processing system and at the same time exploits optimization op-portunities specific to single-node systems. We demonstrate up to 13.9× speedupiiiover a state-of-the-art NUMA-aware framework, Polymer and up to 3.7× scalabil-ity on a four-socket machine using graphs with tens of billions of edges.ivLay SummaryLarge-scale graphs processing introduces various performance and efficiency chal-lenges due to the scale and inherent irregular topology of graphs. Distributed graphprocessing frameworks, like Google’s Pregel, that employs multi-node platforms,have experienced tremendous success in terms of scale and performance. Modernsingle-node systems embrace Non-Uniform Memory Access (NUMA) architecturewhich is more scalable than other architectures. In many ways, a NUMA systemresembles a distributed system: physically distinct CPUs and memory. This posesthe opportunity to transfer the wisdom of distributed graph processing to NUMAsystems.In this dissertation, we design and implement a NUMA-aware graph process-ing framework that explores if a distributed-memory like middleware that makesgraph partitioning and inter-partition communication explicit, can improve the per-formance on a NUMA system. We demonstrate up to 13.9× speedup over a state-of-the-art NUMA-optimized framework and up to 3.7× scalability on a four-socketmachine using graphs with tens of billions of edges.vPrefaceThis thesis is based on the research project done by me under the supervision andguidance of Professor Matei Ripeanu. I was responsible for the design, imple-mentation, modeling, validation, evaluation and analysis of the results, along withtaking the lead in publication writing effort. The research presented in this thesishave been either published or accepted for publication.The work that this thesis extends and evaluates against, was selected based onthe following preliminary study; Professor Ripeanu and Tahsin helped me in theanalysis of the results and editing the publication.Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu, How well do CPU, GPU andHybrid Graph Processing Frameworks Perform?, Pages 458-466, 2018 IEEE In-ternational Parallel and Distributed Processing Symposium Workshops (IPDPSW),May 2018.The research presented herein has been accepted for publication. ProfessorRipeanu and Tahsin helped me in the analysis of my design and the results, andediting the publication.Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu, Scale-Free Graph Processingon a NUMA Machine, IEEE Workshop on Irregular Applications: Architecturesand Algorithms (IA3) in conjunction with SC18, The International Conference forHigh Performance Computing, Networking, Storage, and Analysis, Dallas, TX,USA, November 2018.viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Dissertation Structure . . . . . . . . . . . . . . . . . . . . . . . . 42 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1 Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Graph Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Hardware Platforms . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.1 Shared-nothing cluster . . . . . . . . . . . . . . . . . . . 82.3.2 Symmetric Multi-Processor (SMP) Architecture . . . . . . 92.3.3 Non-Uniform Memory Access (NUMA) Architecture . . . 9vii2.4 Bulk-Synchronous Parallel (BSP) Processing Model . . . . . . . . 112.5 BSP-Style Graph Processing . . . . . . . . . . . . . . . . . . . . 122.6 Graph Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 142.6.1 PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . 152.6.2 Breadth-First Search . . . . . . . . . . . . . . . . . . . . 152.6.3 Single-Source Shortest Path . . . . . . . . . . . . . . . . 172.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 A BSP-style NUMA-aware Graph Processing Framework . . . . . . 203.1 Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2 Graph Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . 213.3 Design Opportunities for NUMA-aware Graph Processing . . . . 223.3.1 Data structures . . . . . . . . . . . . . . . . . . . . . . . 233.3.2 NUMA 2-Box Design . . . . . . . . . . . . . . . . . . . 243.3.3 NUMA 1-Box Design . . . . . . . . . . . . . . . . . . . 273.3.4 NUMA 0-Box Design . . . . . . . . . . . . . . . . . . . 283.4 Analytical Model for Estimating Performance . . . . . . . . . . . 303.5 Mapping GAS model to NUMA design . . . . . . . . . . . . . . 324 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.1 System Implementation . . . . . . . . . . . . . . . . . . . . . . . 344.2 TestBed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.3 Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.4 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . 375 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 395.1 Impact of Graph Partitioning . . . . . . . . . . . . . . . . . . . . 395.1.1 Partitioning - Key Insights . . . . . . . . . . . . . . . . . 425.2 Performance Evaluation of Designs . . . . . . . . . . . . . . . . 425.2.1 Performance of NUMA 2-Box design. . . . . . . . . . . . 435.2.2 Performance of NUMA 1-Box design. . . . . . . . . . . . 445.2.3 Performance of NUMA 0-Box design. . . . . . . . . . . . 475.2.4 Communication Designs - Key Insights . . . . . . . . . . 485.2.5 Strong Scaling Experiments . . . . . . . . . . . . . . . . 49viii5.2.6 Graph500 submissions. . . . . . . . . . . . . . . . . . . . 495.3 Accuracy of the Analytical Model . . . . . . . . . . . . . . . . . 505.4 Comparison with Existing Work . . . . . . . . . . . . . . . . . . 526 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56ixList of TablesTable 2.1 Memory bandwidth characteristics of our Testbed (Four socketIntel Xeon, E7-4870 v2; more description in Section 4.2). Mem-ory bandwidth is measured with custom benchmark with arraysof size 1 GB. Memory latency is measured using Intel MemoryLatency Checker. . . . . . . . . . . . . . . . . . . . . . . . . 11Table 3.1 Memory Access Pattern for different NUMA designs, for PageR-ank algorithm. V/V’ and E/E’ represents number of local/re-mote vertices and edges in the partition. N is the number ofpartitions. Memory accesses are represented by XZY , where Xis: read/write operation, Y is: sequential/random access, and Zis: local/remote memory access. . . . . . . . . . . . . . . . . . 31Table 4.1 Workload used for evaluation. . . . . . . . . . . . . . . . . . . 36Table 5.1 Cost Model evaluation for PageRank algorithm. The numbersrepresent average speedup of NUMA 2-Box against other de-signs, for all workloads, as predicted by cost model and ob-served empirically from experiments. . . . . . . . . . . . . . . 52Table 5.2 Execution time (in second) and peak Memory consumption (inGB) of Polymer and our best performing NUMA design (NUMA-xB). We show peak memory consumption among all the NUMAdesigns. Missing data points means Polymer was out-of-memory. 53xList of FiguresFigure 2.1 On left, a directed graph with 6 vertices and 7 edges. On rightis the Compressed Sparse Row (CSR) representation of thegraph. CSR format has two arrays, ‘vertex array’ (or offset ar-ray) and ‘edge array’. ‘vertex array’ contains starting index ofthe outgoing edges originating from each vertex (representedby the indices of vertex array). ‘edge array’ contains only thedestination vertices of the edges. . . . . . . . . . . . . . . . . 6Figure 2.2 An illustration of SMP/UMA (left) and NUMA (right) archi-tectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Figure 2.3 Local and remote memory bandwidth for read and write oper-ations on 4-Socket Intel Xeon (E7-4870 v2, Ivy Bridge) ma-chine, for different workloads (size in MB). . . . . . . . . . . 10Figure 2.4 A high level illustration of the Bulk Synchronous Parallel Model. 12Figure 2.5 BSP graph processing depicting computation and communica-tion phases in a superstep. In the computation phase, the stateof a vertex and its shadow copy (on a remote partition) areupdated independently. In the communication phase, the ver-tex and its shadow copy communicate to determine the correctstate. For example, in SSSP, after communication, both thevertex and its shadow copy commit to the minimum distanceat that point in traversal. . . . . . . . . . . . . . . . . . . . . 13xiFigure 3.1 High-level illustration of inter-partition communication of NUMA2-Box design. V and E are the buffers to represent the graphin CSR format (as mentioned in Section 2.1 and Figure 2.1). Sis the state buffer for local vertices. Bottom blue and red solidlines depicts communication paths for NUMA 2-Box, whereexplicit memory copy through in - and out - boxes are re-quired. For push-based algorithms, during computation phase,each partition manipulates its local state buffer S for local ver-tices and updates for remote vertices are aggregated locallyin the outbox buffer. During communication phase outbox iscopied into the inbox of the respective remote partition, whichare then applied to the respective local state buffers. . . . . . . 24Figure 3.2 High-level illustration of NUMA 1-Box design. V and E arethe buffers to represent the graph in CSR format (as mentionedin Section 2.1 and Figure 2.1). S is the state buffer for local ver-tices. For push-based algorithms, during computation phase,each partition manipulates its local state buffer S for local ver-tices and updates for remote vertices are aggregated locally inthe outbox buffer. During communication phase, updates inthe remote outbox are sequentially accessed and applied to therespective local state buffers. . . . . . . . . . . . . . . . . . . 26Figure 3.3 High-level illustration of NUMA 0-Box design, which over-laps computation with communication. S is the state buffer forlocal vertices. Note that in this design we get rid of commu-nication infrastructure. During computation phase, each parti-tion manipulates its local state buffer S for local vertices andremote updates are directly written to the respective local statebuffer of the remote partition. Atomic writes are used to ensurecorrectness. . . . . . . . . . . . . . . . . . . . . . . . . . . . 28xiiFigure 3.4 Number of Remote vertices vs number of remote updates ineach partition in every superstep of Direction-Optimized BFS(BFS-DO) for RMAT31 in NUMA 2-Box design. The Y-axisrepresents frequency (in millions), in log-scale, of the numberof remote updates (per superstep) and the total number of re-mote vertices in each partition. The ss-‘x’ on X-axis representsthe sequence of supersteps. We observe that remote updatesare ∼22× less than the number of remote vertices. . . . . . . 29Figure 3.5 Gather, Apply and Scatter phases in NUMA 2-Box design, fora pull based algorithm. During computation phase, each localvertex gathers the state of its neighbors and applies the com-puted value to its state. In communication phase, it scatters itsnew state to the respective inbox buffer, which is then copiedto the outbox buffer on remote partition. In this way, all theshadow copies of a vertex have the same state, before the startof next superstep. . . . . . . . . . . . . . . . . . . . . . . . . 33Figure 4.1 High-level design of our BSP-style NUMA-aware framework. 35Figure 5.1 Load imbalance of traditional (Sorted and Random) and newHybrid partitioning strategy for PageRank algorithm for RMAT31graph using NUMA 2-Box design. The x-axis is for supersteps(denoted by ‘ss’) required for execution. The y-axis is for com-putation time (lower the better) of the four partitions for thethree partitioning strategies. . . . . . . . . . . . . . . . . . . 40Figure 5.2 Load imbalance of traditional (Sorted and Random) and newHybrid partitioning strategy for BFS-DO algorithm for RMAT31graph using NUMA 2-Box design. The x-axis is for super-steps (denoted by ‘ss’) required for execution. The y-axis isfor computation time (lower the better) of the four partitionsfor Traditional and New partitioning strategies. . . . . . . . . 41xiiiFigure 5.3 NUMA designs performance against Totem and numactl forPageRank (left) and BFS-DO (right) algorithms on RMAT31graph. The Y-axis is for execution time (lower the better). ForNUMA-2B and NUMA-1B, where we do explicit communi-cation, we show the breakdown of execution time with com-putation and communication time. . . . . . . . . . . . . . . . 44Figure 5.4 Billion Traversed Edges Per Second (TEPS) achieved by Totem,numactl and the NUMA designs, for (a) PageRank and (b)BFS-DO algorithms on RMAT[28-32] (synthetic), and Twit-ter and clueWeb (real-world) workloads. Note, the Y-axis isfor Traversed Edges Per Second (TEPS) (higher the better). . . 45Figure 5.5 Billion Traversed Edges Per Second (TEPS) achieved by Totem,numactl and the NUMA designs, for (a) BFS-TD and (b) SSSPalgorithms on RMAT[28-32] (up to RMAT31 for SSSP, since itrequires weighted edge-list) (synthetic), and Twitter and clueWeb(real-world) workloads. Note, the Y-axis is for Traversed EdgesPer Second (TEPS) (higher the better). . . . . . . . . . . . . . 46Figure 5.6 Strong Scaling of Totem, numactl and NUMA designs on our4-socket machine compared to 1-socket (memory: 384GB).Weighted RMAT29 (weighted edgelist size: 192GB) is usedfor SSSP and unweighted RMAT30 (edgelist size: 256GB)was used for all other algorithms. . . . . . . . . . . . . . . . 49Figure 5.7 Analytical Model prediction for PageRank algorithm. The Y-axis shows the speedup of NUMA 2-Box against NUMA 1-Box and NUMA 0-Box as predicted by cost model and calcu-lated empirically through experiments. . . . . . . . . . . . . . 51xivAcknowledgmentsI would like to sincerely thank my advisor, Professor Matei Ripeanu for giving methe opportunity to work on this high-quality research project and for his invaluableguidance, insightful feedback and support throughout this journey. I am grateful tohim for encouraging and allowing me to do internships at IBM Almaden ResearchCenter and Amazon Web Services.I would like to thank my lab colleagues for providing their feedback duringmeetings and research presentations, and sharing their knowledge and experiencewith me.Last but not the least, this journey would not have been possible without theconsistent motivation from my brother Manish, and immense support and inspira-tion from my parents Jawahar and Mohini. My deepest gratitude to them.xvTo my parents and brotherChapter 1IntroductionGraph processing is at the core of a wide range of big data problems, such asonline social networks analysis [11, 18], bioinformatics [19, 29], transport networkanalysis [37], financial and business analytics [20], to name a few. Additionally,graph processing has found new applications in machine learning and data mining.Graph algorithms incur highly irregular data-dependent memory access pat-terns, which leads to poor data locality. Further, most of the graph algorithms havea low compute-to-memory access ratio, i.e., they are memory-bound. Many real-world graphs are massive: some have hundreds of billions of edges - hence havehuge memory footprint. For example, the Facebook graph [8] and Web Data Com-mons, a hyperlink graph [4], have more than 100 billion edges, which requires overtwo terabytes of memory.To process such huge graphs, traditionally frameworks like Google’s Pregel [26]and GraphLab [17] running on large shared-nothing clusters have been used, asthese platforms provide large aggregated memory. Most of these frameworks usethe Bulk Synchronous Parallel (BSP) Processing Model [39]. Here, the graphis partitioned explicitly among the processing units and as these clusters are notcache-coherent, the communication between different processing units is explicit.This is in contrast with graph processing frameworks [30, 35] that target single-node shared-memory systems, and treat shared-memory system as if it is based onSymmetric Multi-Processor (SMP) architecture. In SMP architecture the accesstime to any location in memory is uniform, therefore, there is no need for data1partitioning.Non-Uniform Memory Access (NUMA, a.k.a. distributed shared-memory) ar-chitecture machines introduce a dilemma: on the one side, they provide sharedmemory - thus graph processing frameworks that treat shared-memory system asSMP architecture, can be directly used. On the other side, the cost of memoryaccesses is non-uniform (i.e, a socket has faster access to the local memory associ-ated with it, than to remote/non-local memory associated with other sockets), thusexplicit data placement is needed to obtain maximum performance and a graphframework developed in the style of frameworks that target distributed systemsmay prove to offer advantages.1.1 HypothesisSince NUMA architecture resembles distributed systems, our intuition is, a graphprocessing framework, targeting NUMA-architecture, developed in the style offrameworks that target distributed systems (explicit partitioning and communica-tion), provides following three potential avenues for performance improvement: (i)control over data placement with explicit partitioning, which allows design and ex-perimentation with different partitioning strategies to improve load balancing andoverall performance, (ii) better locality, and (iii) explicit partitioning helps in ex-ploring different communication trade-offs since NUMA is a shared-memory sys-tem. Based on these intuitions, we postulate the following hypothesis: A distributed-memory like middleware that makes graph partitioning and communication be-tween partitions explicit, can improve the performance on a NUMA system.To test this hypothesis, we design and implement a NUMA-aware graph pro-cessing framework that treats the NUMA platform as a distributed system, henceembraces its design principles; in particular explicit partitioning and communica-tion, and evaluate it against the state-of-the-art NUMA-oblivious [15] and NUMA-aware [42] graph processing frameworks. We further describe optimization tech-niques to reduce communication overhead. And finally, provide a set of practicalguidelines for choosing the appropriate partitioning and communication strategies.21.2 ContributionsThe contributions of this dissertation are:1) Design Exploration: Given their resemblance, there exist opportunities totransfer the know-how and design philosophies of distributed graph processing todevelop shared-memory graph processing solutions optimized for NUMA systems.To this end, we explore a reference distributed design (Section 3.3). In particular,we evaluate the performance of a fully distributed (referred to as NUMA 2-Box de-sign, §3.3.2) and one shared-memory (referred to as NUMA 1-Box design, §3.3.3)inter-partition communication strategies (where each partition belongs to a NUMAdomain) and how they compare against a NUMA-oblivious implementation. Wefound that, on a NUMA platform, a graph processing solution based on the designphilosophies that targets shared-nothing distributed system, consistently outper-forms the state-of-the-art NUMA-oblivious shared-memory solution (Section 5.2).Additionally, we explore two distributed graph partitioning techniques for NUMA,and introduce a new partitioning technique (Section 3.2) that leads to load balanceof up to 95% and overall performance improvement of up to 5.3×.2) A New NUMA-aware Design: Based on our design explorations, we pro-pose a design (referred to as NUMA 0-Box design, §3.3.4), that takes into accountdistributed shared-memory nature of NUMA, and consists of explicit graph par-titioning and implicit communication. It improves data locality through NUMA-aware partitioning and at the same time minimizes the overhead of remote accessesby overlapping remote memory operations with computation. (Section 3.3)Evaluation shows, this new design offers, for BFS up to 2.37×, SSSP up to2.27× and PageRank up to 1.89× improvement in time-to-solution over the re-spective NUMA-oblivious implementations. This design, however, did not im-prove performance of PageRank over the NUMA 2-Box design (explained in Sec-tion 5.2).3) Analytical Model for Performance Prediction: We present an analyticalmodel for predicting algorithm performance for the three aforementioned NUMAdesigns (Section 3.4). We demonstrate the effectiveness of our prediction modelfor PageRank by comparing with empirical results. (Section 5.3)4) Evaluation: We evaluate the aforementioned three NUMA-aware designs3for the following applications: PageRank, BFS and SSSP, using both real-worldand synthetic graphs (with up to 128 billion undirected edges), on a Intel NUMAplatform with four sockets and 1.5TB memory. Summary of our findings are thefollowing:(i) We compare the three graph partitioning strategies and find that our pro-posed approach offers up to 5.3× speedup and 95% load balanced partitions. (Sec-tion 5.1)(ii) We demonstrate scalability on up to four sockets on a NUMA platform:maximum speedup (over one socket) achieved by PageRank is 3.7×, BFS is 2.9×and SSSP is 2.8×. (Section 5.2)(iii) We show RMAT scaling using up to Scale 32 graph. Our BFS imple-mentation achieves a maximum of 39 giga traversed edges per second (GTEPS).(Section 5.2)(iv) We compare our work with a recent NUMA-aware graph processing frame-work, Polymer and demonstrate that our solution consistently outperforms Poly-mer, e.g. up to 13.9× faster for BFS. Additionally, our solution is ∼4.4× morememory efficient. (Section 5.4)(v) Finally, we present the performance numbers we achieved in Graph500competition, where we secured World Rank 2 (June, 2018 list) for SSSP kernel,and among top 3 single-node submissions for BFS kernel. (Section 5.2)m1.3 Dissertation StructureThe rest of this dissertation is organized as follows. Chapter 2 presents backgroundand related work. Chapter 3 describes the design of our NUMA-aware graph pro-cessing framework. Chapter 4 presents the methodology used to implement andevaluate the designs introduced in Chapter 3. Chapter 5 evaluates the performanceof our designs. And, Chapter 6 concludes the dissertation.4Chapter 2BackgroundThis section provides the necessary background information required to understandthe contributions of this dissertation. First, this chapter presents a brief overviewof graph (§2.1) and graph processing (§2.2). Then it describes three common CPUbased hardware platforms (§2.3) used for graph processing. Next, this chapterdescribes Bulk-Synchronous Parallel (BSP) model (§2.4), a popular processingmodel among distributed systems, and explains thoroughly how it is leveraged inthe context of graph processing (§2.5). Finally, it provides an overview of the graphalgorithms that we have used (§2.6), followed by related work (§2.7).2.1 GraphA graph G = (V,E), as shown in Figure 2.1, consists of a set of vertices V anda set of edges E. If the edges of a graph are unidirectional, the graph is calleda directed graph. While, if all the edges of the graph are bidirectional, then it iscalled an undirected graph. Edges of a directed graph are represented by arrows(pointing towards the destination vertex), as shown in Figure 2.1, while the edgesin an undirected graph are typically drawn as lines.Graph Storage. Graphs are stored, usually, using either linked-lists or arrays. Forhigh-performance and to efficiently store large graphs in memory, array based for-mats like Compressed Sparse Row (CSR), Coordinate (COO), Compressed SparseColumn (CSC), or Doubly Compressed Sparse Column (DCSC) formats are used.50 3 3 4 5 60 1 2 3 4 51 2 3 1 4 5 00 1 2 3 4 5 601 2354vertex arrayvertex IDedge arrayCompressed Sparse Row (CSR) formatverticesedgesFigure 2.1: On left, a directed graph with 6 vertices and 7 edges. On right isthe Compressed Sparse Row (CSR) representation of the graph. CSRformat has two arrays, ‘vertex array’ (or offset array) and ‘edge array’.‘vertex array’ contains starting index of the outgoing edges originatingfrom each vertex (represented by the indices of vertex array). ‘edgearray’ contains only the destination vertices of the edges.Figure 2.1, presents the CSR format, a popular format (that we have also used)used to achieve better performance and memory efficiency. CSR format targetsdirected graphs. To store undirected graphs, each edge of the undirected graph isrepresented by two direct edges (one in each direction). As shown in Figure 2.1,‘vertex array’ and ‘edge array’ are the two CSR data structures. Size of the ‘vertexarray’ is same as the number of vertices in the graph, and it contains the startingindex of the neighbors list of each vertex. Indices of the ‘vertex array’ representsthe vertex ID. ‘edge array’ contains the destination vertices of the edges originatingfrom source vertex in ‘vertex array’, and its size equals the number of edges in thegraph. For example, in Figure 2.1, vertex 2 has one outgoing edge, to vertex 1. Thevalue, 3, at index 2 in ‘vertex array’, is the index value of the start of the neighborslist of vertex 2 in edge array. Size of the neighbors list of a vertex is determinedby subtracting the current value at the index location in vertex array from the nextvalue or from the edge count for the last vertex. Therefore, neighbors list size ofvertex 2 is (value at index location 2+1) - (value at index location 2) i.e. 4−3 = 1.62.2 Graph ProcessingA wide set of big data problems, like analyzing online social networks, bioin-formatics, financial and business analytics, transport network analysis, to name afew, can be modeled as graphs. For example, in online social networks, peopleare represented by vertices and an edge between the two represents their friend-ship. Further, domains like machine learning and data mining are also exploringgraph processing at their core. In all these high impact applications, in order to getmeaningful insights from the huge data, these massively large graphs need to beprocessed fast yet efficiently (w.r.t cost).Graph algorithms and workloads pose following key characteristics that makethem challenging to process efficiently.1. Iterative. A typical graph algorithm processes a graph in rounds, where ineach round only a set of vertices is active and can be processed in parallel.For example, in BFS, processing starts from a source vertex and it activatesits neighbor vertices only, which then iterate over their respective neighborvertices in the next round, and so on.2. Highly irregular, data-dependent memory access patterns. Graph processingsuffers from highly irregular data-dependent memory access pattern as theneighbors are scattered in memory. It leads to poor data locality and highrandom memory accesses.3. Low compute-to-memory-access ratio. Most of the graph algorithms, likeBFS and SSSP, have low compute-to-memory-access ratio, i.e. they do veryless computation per memory access, thereby being memory bound. Forexample, in BFS, very little processing is done on the data associated with avertex, and most of the time is spent in accessing the neighbors.4. Hard to obtain balanced partitions. Many real-world graphs have heavilyskewed, ‘power-law’ [14] vertex degree distribution: most of the verticeshave low edge degree, while a few vertices have high edge degree (e.g.,celebrities in online social networks) - that connect to a large part of thegraph. These type of graphs are also called as scale-free graphs.7If we process these graphs in a distributed system, their heavily skewed de-gree distribution makes it hard to obtain balanced partitions to achieve betterload balance and over all performance. Further the partitioning algorithmsdeveloped specifically to obtain good partitioning are computationally ex-pensive.5. Large memory footprint. Many real-world graphs are massive: some havehundreds of billions of edges - leading to a huge memory footprint. Forexample, current Facebook graph (a social network graph with∼137 Billionedges) and Web Data Commons - Hyperlink Graph (a web graph with ∼128Billion edges) [3, 4], require more than 2TB of memory. To process suchlarge graphs efficiently, the whole graph needs to be in the memory.2.3 Hardware PlatformsThis section aims at describing the three popular CPU-based hardware platforms.2.3.1 Shared-nothing clusterA distributed system or a shared-nothing cluster consists of hundreds to thousandsof processing units (also called as nodes), where each processing unit has access toonly its own memory, that are connected with each other through fast interconnects,like OmniPath and InfiniBand.Given the huge memory footprint of real-world graphs, traditionally these largeshared-nothing clusters have been used, as they have large aggregated memory.These memory-parallel, shared-nothing clusters have experienced tremendous suc-cess in terms of scale and performance, as could be seen in Graph500 Competi-tion [2] (which ranks supercomputers for data intensive applications), as well asbeen used by many distributed graph processing frameworks including Google’sPregel [26] and GraphLab [17].These shared-nothing clusters are not cache-coherent. Here, the graph is par-titioned explicitly (one partition on each of the nodes) and the communication be-tween different nodes is explicit. Graph partitioning leads to having boundaryedges that cross-over between nodes. The nodes run the graph algorithm kernel in-8Figure 2.2: An illustration of SMP/UMA (left) and NUMA (right) architec-tures.parallel on their respective partition, and communicate with other processing unitsto share the remotely updated vertex state.2.3.2 Symmetric Multi-Processor (SMP) ArchitectureIn Symmetric Multi-Processor (SMP) architecture, memory is shared between pro-cessing units, and it provides uniform access time to any location in memory. It is,therefore, also termed as Uniform Memory Access (UMA) architecture. As shownin Figure 2.2 (left), the CPU cores share the same memory bus to access the mem-ory. This leads to uniform access time from any core to any location in memory. Inrecent SMP architecture machines, the memory available could be up to few hun-dreds of gigabytes. This helps in processing mid-size graphs, that fit into memory,at a lower development cost compared to distributed system as there is no need ofexplicit partitioning and inter-partition communication.The drawback of this architecture is, as the memory bus is shared among all thecores, this design leads to contention on the shared bus with increase in the numberof CPU cores. Therefore, the design does not scales with number of CPU coresand memory.2.3.3 Non-Uniform Memory Access (NUMA) ArchitectureNUMA is a shared-memory architecture consisting of a set of processors (oftencalled sockets), each with their own local memory. Each socket is connected withother sockets through an interconnect (Quick-Path Interconnect in Intel systems).Accessing socket-local memory takes distinctively less time than accessing remote90. 2 4 8 16 32 64 128 256 512 1024Memory Bandwidth (GB/s)Workload (MB)Local Random Read Remote Random ReadLocal Seq Read Remote Seq ReadLocal Random Write Remote Random WriteLocal Seq Write Remote Seq WriteFigure 2.3: Local and remote memory bandwidth for read and write opera-tions on 4-Socket Intel Xeon (E7-4870 v2, Ivy Bridge) machine, fordifferent workloads (size in MB).memory over the interconnect. NUMA addresses the scalability issues of SMParchitecture, and provides higher overall memory bandwidth.NUMA distributes memory to each processors: (Fig. 2.2 - right) each processorhas fast access to its local memory, while to access local memory of another pro-cessor, it has to traverse over the slow interconnect. This reduces contention overlocal memory bus as well as provides the opportunity to scale the system. Notethat scalability comes at the cost of remote memory access. Obtaining maximumperformance requires careful placement of data to avoid/minimize remote memoryaccesses with lower latency and higher bandwidth.We benchmark our testbed, a four socket Intel Xeon system (more descriptionin Section 4.2), by extending Stream [27] benchmark to measure local and remote,read and write memory bandwidth for both sequential and random accesses. Wemeasure these access patterns, because they frequently occur in graph processing.We run the experiments for array size from 1 MB and double the array size until1 GB (at which point it saturates the memory bandwidth). Table 2.1 presents thememory bandwidth achieved in different access modes for arrays of size 1 GB. We10Access Local RemoteRead Throughput (MB/s) Sequential 2464 2069Random 286 226Write Throughput (MB/s) Sequential 1438 1024Random 238 188Latency (ns) 119 178Table 2.1: Memory bandwidth characteristics of our Testbed (Four socket In-tel Xeon, E7-4870 v2; more description in Section 4.2). Memory band-width is measured with custom benchmark with arrays of size 1 GB.Memory latency is measured using Intel Memory Latency Checker.observe that non-local access throughput is up to 26% and 40% slower than localaccess throughput for read and write operations, respectively. Interestingly, remotesequential access throughput is as much as 9× and 6× more than local randomaccess throughput for read and write operations, respectively. This observation isimportant for graph processing, which incurs highly irregular memory access pat-terns. On the other side, remote memory latency, as measured using Intel MemoryLatency Checker, is 49% more than local memory latency.2.4 Bulk-Synchronous Parallel (BSP) Processing ModelBulk-Synchronous Parallel (BSP) model is a popular processing model targetingdistributed systems. Therefore, BSP model implies that the data is partitioned andpartitions are allocated on the processing elements. In the BSP model (Fig. 2.4),the processing consists of a sequence of rounds or supersteps (in BSP terminol-ogy). Each superstep consists of three phases (executed in order): computation,communication, and synchronization. In the computation phase, each processing unit processes their respectivepartition independently. In the communication phase, processing units exchange messages with re-spective remote partitions, as well as apply the remote updates to their localbuffers.11Barrier SynchronizationCommunicationProcessing UnitsComputationComputationSuperstep iSuperstepi+1Figure 2.4: A high level illustration of the Bulk Synchronous Parallel Model. The synchronization phase guarantees that the next cycle restarts only afterall messages have been delivered.This sequence of supersteps continues until convergence or termination condi-tions has been satisfied. Finally, if required, the results are aggregated from all thepartitions.2.5 BSP-Style Graph ProcessingGraph computations can be modeled as Gather-Apply-Scatter (GAS) [17], wherethe graph processing follows sequences of gather, apply and scatter operations. Ingather phase, vertices gather information from their neighbors to update their localstate in apply phase, and then in scatter phase, they communicate their updatedvalue to their neighbors. For example, in PageRank, a vertex computes its rankby gathering rank of its in-degree neighbors, and scatters its new rank to its out-degree neighbors. The BSP processing model inherently matches with this iterativenature of graph algorithms, where sequence of gather, apply and scatter operations12Barrier SynchronizationSuperstepi+1Superstepi0341 25PID0PID1030341 25PID0PID103Computation CommunicationShadow copy of the remoteneighbor in the local partitionS00S’13S13S’00S00S’00S’13S13Spv – State buffer of a vertex v belongs to partitionp; S’ indicates the state buffer of a remote vertexFigure 2.5: BSP graph processing depicting computation and communicationphases in a superstep. In the computation phase, the state of a vertex andits shadow copy (on a remote partition) are updated independently. Inthe communication phase, the vertex and its shadow copy communicateto determine the correct state. For example, in SSSP, after communi-cation, both the vertex and its shadow copy commit to the minimumdistance at that point in traversal.resembles the three phases of a superstep in BSP model.Since BSP model implies that the data is partitioned and allocated on the pro-cessing elements, initial step is to partition the graph (partitions in Figure 2.5 areof the graph shown in Figure 2.1). Each partition has a set of local vertices andedges. Since an edge is associated with two vertices, which could be on differentpartitions, a map is maintained for the remote vertices (vertex 3 in PID0 and vertex0 in PID1, in Fig. 2.5) in each partition. Further each partition maintains algorithmspecific state buffer(s) (such as rank array in PageRank) for local vertices (bufferS0 in PID0 and S1 in PID1, in Fig. 2.5) as well as remote vertices (buffer S′1 forremote vertices in PID0 and S′0 for remote vertices in PID1, in Fig. 2.5).13The three phases of a superstep of BSP model are performed as follows in thecontext of graph processing:In computation phase, processing units work in parallel, and execute the graphalgorithm specific kernel on the set of vertices belonging to their partition, andupdate their local state buffer (buffer S0 and S1 in Fig. 2.5). The local state of ac-tive remote vertices is also updated and aggregated locally in the respective buffer(buffer S′1 and S′0 in Fig. 2.5).In communication phase, each partition exchange the messages for the bound-ary edges, and applies the remote updates received to their local state buffers. InFig. 2.4, both the partitions transfer the state of remote vertices to make sure localand remote states of the vertices are same, i.e. S0 and S′0 are the same, and S1andS′1 are the same.Finally, synchronization phase ensures that all the partitions are updated withthe latest state of the remote vertices, before the superstep cycle restarts.Similar to the generic BSP model, the sequence terminates once every process-ing unit has finished processing their respective partitions. After termination, finalresult is aggregated from all the processing units through a global reduction.2.6 Graph AlgorithmsWe consider PageRank, Breadth-First Search - Top Down (BFS-TD), Breadth-FirstSearch - Direction Optimized (BFS-DO), and Single-Source Shortest Path (SSSP)algorithms. We use these algorithms to evaluate our work because (i) these al-gorithms have been widely studied in the context of high-performance graph pro-cessing systems and have been used in the past studies [5, 15, 17, 30, 35, 36, 42],(ii) BFS and SSSP are also used as benchmarks for the Graph500 competition [2],to rank supercomputers for data intensive applications, (iii) they are the buildingblocks of more complex graph algorithms (for example, BFS is used as a subrou-tine in complex graph algorithms like connected components, max flow, betwee-ness centrality and clustering), and (iv) these algorithms are good representationfor studying the performance of any hardware platform for irregular memory ac-cess pattern.A short description of the algorithms is described below.142.6.1 PageRankPageRank [31] is a well-known algorithm used by search engines for ranking webpages. In PageRank, a vertex computes its rank based on the rank of its neigh-bors. The algorithm continues until the convergence of the rank of all the vertices,or a predefined number of iterations have been completed. PageRank has a highcompute-to-memory-access ratio, and the workload is stable in every iteration,since in each iteration it computes the rank of all its vertices. It could be imple-mented as a pull-based or push-based algorithm [35]. In the pull-based approach,each vertex ‘pulls’ the rank of its neighbors, over the incoming edges, to computeits new rank. In the push-based approach, each vertex ‘pushes’ its rank to its neigh-bors, over the outgoing edges. Note that the push-based approach is less efficient,since, its parallel implementation requires atomic operations [30]. We implementpull-based approach and the algorithm kernel executes for predefined number ofiterations [16].2.6.2 Breadth-First SearchBFS is a graph traversal algorithm which determines the level of each vertex, start-ing from a source vertex. It is a fundamental graph algorithm which is also used as asub-routine in complex graph algorithms, like connected components, betweenesscentrality, max flow, and clustering. BFS has a low compute-to-memory-accessratio, and since it is a traversal based algorithm, workload is not stable in everyiteration (or superstep). Like other graph traversal algorithms, BFS presents theconcept of a frontier, which consists of a set of active vertices that are processed inthe current iteration, to build the next frontier. The next frontier can be manipulatedin different ways. We explore three implementations of BFS algorithm.BFS - Top-Down (BFS-TD)It is the classic level-synchronous approach of doing BFS. In each iteration it pro-cesses all the edges of the vertices in the current frontier, to build the next frontierwith the unvisited vertices that can be reached. For power-law graphs, it has beenobserved that this approach leads to (1) drastic increase in the frontier in initial fewsupersteps followed by steep decrease in frontier size in tailing supersteps, and (2)15high write traffic, in the initial supersteps, since many edges in the current frontiertries to add the same vertex in the next frontier [10].BFS - Direction Optimized (BFS-DO)Direction Optimized BFS [10] addresses the above mentioned drawbacks of Top-Down version of BFS, by manipulating the next frontier in bottom-up way whenthe current frontier is large. In Bottom-Up step, it iterates over unvisited verticesand selects those for the next frontier which have a neighbor in the current frontier.This helps in drastically reducing the number of edges explored especially whenthe frontier is large, since once an unvisited vertex, that has a neighbor in currentfrontier, is explored, there is no need to explore its other edges. This, especially,reduces work for high-degree vertices. Further, this approach does not requireany atomic operation as the write operation is done only to update the state of theunvisited vertex, to include it in next frontier, while rest of the accesses are read- to check if any of its neighbors are in the current frontier, thereby reduces thecontention [10, 34]. Direction-Optimized BFS kernel starts with Top-Down step,and once the frontier size is large enough, it switches to Bottom-Up step. For thefinal supersteps, when the frontier size is again small, it switches back to Top-Downstep. Note that switching between the steps (from Top-Down to Bottom-Up andfrom Bottom-Up to Top-Down) is heuristics based, and one needs to hand-tunethem to attain maximum performance on a particular graph.BFS-Graph500Graph500 competition [2] has different requirements for measuring the perfor-mance. First, it counts an undirected edge as only one edge, while we represent anundirected edge as two directed edges, one in each direction. Therefore we haveto half the number of edges traversed, while computing the performance. Second,it requires including algorithm initialization time as well in the algorithm execu-tion time, not a standard practice in the literature. And finally, it requires the BFStree as output rather than the level of each vertex in the BFS tree. We have imple-mented our BFS-DO as the kernel inside Graph500 skeleton, and modified its datastructures with Graph500 requirements.162.6.3 Single-Source Shortest PathSSSP is a traversal based graph algorithm and finds the shortest path from a sourcevertex to every vertex in the connected component. It has wide applications includ-ing IP routing, transportation networks, and social network analysis. In SSSP, eachedge is associated with a predefined weight which, typically, is a measure of ‘cost’to make a transition from one vertex to one of its neighbors. Weights increase thememory footprint of the graph by almost 2× (depending on the data type used).Following are the two implementations of SSSP algorithm that we use:SSSPWe adapt Bellman-Ford algorithm [1] to implement SSSP, as it provides betterscope of parallelism and the opportunity to allow the active vertices to performrelax operation on its edges within the same iteration [16]. This reduces the numberof supersteps, since more number of vertices become active per superstep. Thealgorithm gives the distance of each vertex from the given source vertex.Graph500-SSSPAlong with the distance buffer, containing distance of each vertex from the sourcevertex, for SSSP, Graph500 also requires sssp-tree containing parent of each ver-tex. This is expected to increase the communication overhead in every superstepby almost 2× because of communicating the parents for boundary edges amongthe partitions, along with the respective distance value. We optimize this by notrequiring to communicate the tree at all during supersteps, and aggregate thetree in only the aggregation phase. In the computation phase, if the edge is aboundary-edge, we store the partition ID of the remote vertex and the local ID ofthe parent vertex in the tree buffer. Distance buffer is updated same as the aboveimplementation of SSSP. During communication, if for the remote vertex, the dis-tance in the remote distance buffer is less, then it stores the remote partition ID forthe respective vertex in the local tree buffer. This way it knows that parent is inthe respective remote partition. In aggregation phase, it iterates over the local treebuffers of all the partitions to aggregate the results. If the value in the local treebuffer corresponds to a remote partition, it determines the parent by looking at the17respective tree buffer in that remote partition, and gets the global id of the parentfrom the global map.2.7 Related WorkSince their inception, NUMA architecture has been the source of performance is-sues, because of their distributed shared-memory, in performance critical applica-tions targeting shared-memory systems.This section describes the work done on addressing the performance issues onNUMA architecture, in general. Then it discusses the graph processing frameworkswhich targets single-node shared-memory systems, and are NUMA-oblivious. And,finally it discusses the graph processing framework that targets shared-memoryNUMA architecture, followed by NUMA-aware graph kernels.NUMA-aware work. As shared-memory NUMA architectures are becom-ing ubiquitous in today’s commodity servers, many work have shown the per-formance issues on running the applications that were implemented for shared-memory (assuming SMP architecture), and have presented optimizations whichimproved their performance on NUMA-architecture based shared-memory sys-tems. For Databases, works like [22, 25] have shown maximum performancegain in the range of 3× - 6× on accelerating different data management primitivesand in-memory storage operations, with NUMA optimizations. Further, NUMAeffects are also severe in Machine Learning [28] and Deep Learning [33], whereworkload is regular. NUMA-Caffe [33] have shown that the convolution layer (themost significant and time consuming layer [21]) in Convolution Neural Network(a type of Deep Neural Network) leads to maximum remote memory accesses, andwith NUMA optimizations they achieved performance gain of 2× to 14×.Shared-memory Graph Processing Frameworks. Since shared-memory sys-tems have up to few hundreds of gigabytes of memory available, which is enough toprocess mid-size graphs, frameworks like Galois [30], Ligra [35] and Totem [15]have been developed. These frameworks treats shared-memory system as if it isbased on SMP architecture. And since NUMA architecture is also shared-memorybased, these frameworks run on NUMA architecture as well. But, they suffer fromthe distributed nature of the shared-memory in NUMA, and hence do not perform18and scale well [5].NUMA-aware Graph Processing Framework. To the best of our knowledge,Polymer [42] is the only NUMA-aware graph processing framework. It embracesthe design philosophy of distributed systems, and extends Ligra to improve perfor-mance on NUMA-architecture based shared-memory systems. (discussed in detailin Section 5.4) NUMA-aware graph kernels. There are work, like [40, 41], thathave optimized specific graph kernels, primarily BFS for Graph500, for NUMAarchitecture. The NUMA-aware optimizations that they have done are specific tograph kernels. On the other hand, all the NUMA optimizations in our work aregraph algorithm agnostic.19Chapter 3A BSP-style NUMA-aware GraphProcessing Framework3.1 IntuitionThe NUMA architecture resembles distributed shared-nothing platforms. As de-scribed in previous section, the BSP processing model naturally matches graphcomputation. Therefore, for distributed graph processing, BSP graph processingmodel is commonly used. On a NUMA machine, the expected benefits of usingBSP model are: (i) Explicit data placement, which means data is processed at thenode-local level - thus having the potential to reduce processing time through betterlocality as during processing no remote accesses are made, (ii) Explicit partition-ing allows experimentation with different load balancing techniques, and (iii) asNUMA is a shared-memory system, we can explore different inter-partition com-munication trade-offs, to reduce communication overhead. The advantages ob-tained through explicit data placement and partitioning need to be greater than theoverheads present in having BSP model on a shared-memory NUMA system. Theexpected overheads are: (i) inter-partition communication overhead, (ii) memoryoverhead - since we have to store the state of remote vertices on each partition,(iii) thread management, (iv) development overhead - designing and implement-ing partitioning, inter-partition communication and result aggregation, and (v) pre-processing overhead of partitioning.20The goal of this study is to evaluate if having a distributed-memory like mid-dleware on a shared-memory NUMA machine provides performance advantagesin spite of aforementioned overheads.This chapter describes the design of our BSP-style, NUMA-aware graph pro-cessing framework, starting with graph partitioning strategies (§3.2), followed bydescribing the design opportunities for BSP-style graph processing on shared-memory NUMA system (§3.3). Then it provides an analytical model to predicttheir performance (§3.4). Finally, it describes how the GAS model maps to ourdesign (§3.5).3.2 Graph PartitioningDistributed graph processing begins with partitioning the graph and allocates thepartitions on the processing units. The goals of partitioning are: (i) to process largegraphs - to leverage the large aggregated memory, (ii) to improve load balance, and(iii) to process the partitions in parallel.Graph partitioning is an NP-complete problem [7], and having balanced par-titioning on real-world power-law [14] graphs is challenging [6, 23, 24]. Populardistributed graph processing frameworks like Pregel [26] and GraphLab [17], dorandom partitioning, where vertices are distributed randomly among the processingunits, as it leads to uniform vertex degree distribution. Heterogeneous distributedsystem like Totem [15] uses Sorted/Degree-aware partitioning and have shown thatthis strategy performs better than Random partitioning on a single-node hybridsystem. The success criteria for a good partitioning strategy are: (i) better loadbalance, and (ii) most importantly, better overall performance.In a NUMA system, explicitly partitioning the graph does not help in process-ing larger graphs, as the memory available is fixed, but it provides better locality(by serving all the accesses from local memory of the NUMA node where the par-tition is assigned to), and enables implementing and experimenting with differentpartitioning strategies, designed for distributed systems, to improve load balanceand overall performance. We have implemented two graph partitioning strategies,random and Sorted/Degree-aware. We also introduce a new partitioning strategythat leads to better load balance and higher performance, than the above two parti-21tioning strategies.Random Partitioning. In this partitioning strategy, vertices are assigned ran-domly to the processing units. Random partitioning is a popular strategy among thedistributed graph processing systems, like Pregel and GraphLab, targeting graphshaving power-law vertex degree distribution [14]. It increases the probability ofeach partition having equal variability in terms of vertex degree.Sorted or Degree-aware Partitioning. In this approach the vertices are firstsorted by degree, and then they are assigned to the processing units as a contiguouschunk of vertices with even share of edges. This strategy leads to better localitysince the likelihood of having most of the neighbors in the same partition increases.It has been shown to perform better than random partitioning in Totem [15], aheterogeneous distributed system. But, load imbalance increases significantly, asshown in Fig. 5.1, since few partitions get dense subgraph (chunk with high-degreevertices will have few vertices) while others get sparse subgraph (tailing chunkconsist of low-degree vertices, thereby the subgraph will have most of the vertices).New Strategy - Hybrid Partitioning. We observed that Random partitioningleads to better load balance but suffers from poor data locality. Sorted or Degree-aware partitioning on the other hand achieves better data locality, but leads to severeload imbalance. With these observations, we designed and implemented a hybridpartitioning technique that alleviates this problem. In the first step, we randomlyassign the vertices to the processing units, same as random partitioning. And then,we sort the vertex list of individual subgraphs by degree. Randomly assigningthe vertices to the processing units increases the probability that each partitionhas equal variability in terms of vertex degree (thereby increasing the chance thatthe generated load is well balanced). Sorting individual vertex lists improves datalocality [15]. Later we discuss its performance compared to other two strategies inFig. 5.1 and Fig. Design Opportunities for NUMA-aware GraphProcessingSince we partition the graph and place one partition on each NUMA node, it allowsus to do computation in parallel with all the accesses served from the local mem-22ory, during the computation phase of a superstep. Since NUMA system is a shared-memory system, we have the opportunity to explore shared-memory specific op-timizations to reduce communication overhead. We explore three communicationalternatives that address the motivation of this dissertation: To what degree, de-signing for NUMA as for a distributed memory system can enable performance (byexplicitly presenting locality), in spite of inherent overheads (message exchange),in an application agnostic way.In this section, we describe the data structures we have used in our framework,and the three design options to optimize communication overhead.3.3.1 Data structuresTo store the graph in-memory, we use Compressed Sparse Row (CSR) format, asdescribed earlier in Section 2.1 as well as in Figure 3.1 (arrays V and E). Aspresented in [16], the arrays V and E represent the CSR data structure, where Vicontains the start index of the neighbors of the vertex i in the edge array E. Ineach partition p, the vertex IDs range from zero to (|Vp| - 1), where Vp is the set oflocal vertices belonging to a partition. Edge array E stores the destination vertexof an edge, which has partition ID encoded in high-order bits (shown in Fig. 3.1as subscripts). For boundary (or remote) edges, value stored in E depends on thecommunication design we select. For NUMA 2-Box (§3.3.2) and NUMA 1-Box(§3.3.3) designs, value stored is the index to its entry in the outbox buffer (discussedlater), not the remote neighbor ID. But, for NUMA 0-Box design (§3.3.4), valuestored is the remote neighbor ID.The array S, of length |Vp|, represents the algorithm-specific local state of eachlocal vertex in the partition. The message (outbox and inbox) buffers allocationvaries by the design options (discussed in details in every design options). Theoutbox buffer is for the messages for the remote neighbors, and has an entry foreach remote neighbor. The inbox buffer is for the messages for the local verticeswhich are remote to other partitions, therefore has an entry for each local vertex thatis remote to another partition. Both the message buffers have two arrays: one tostore the remote vertex ID, and the other stores the corresponding message. More23Figure 3.1: High-level illustration of inter-partition communication ofNUMA 2-Box design. V and E are the buffers to represent the graph inCSR format (as mentioned in Section 2.1 and Figure 2.1). S is the statebuffer for local vertices. Bottom blue and red solid lines depicts commu-nication paths for NUMA 2-Box, where explicit memory copy throughin - and out - boxes are required. For push-based algorithms, duringcomputation phase, each partition manipulates its local state buffer Sfor local vertices and updates for remote vertices are aggregated locallyin the outbox buffer. During communication phase outbox is copied intothe inbox of the respective remote partition, which are then applied tothe respective local state buffers.details are provided according to the design options described below.3.3.2 NUMA 2-Box DesignIn this design, we fully embrace the design philosophy of a distributed system,thereby assuming NUMA as a shared-nothing distributed system - where nodes areindependent and are connected through the interconnect. In this design, as shownin Fig. 3.1, for communication, it has two message buffers (outbox at source andinbox at destination partition). Further, as mentioned before, for remote edges,value stored in E is the index to its entry in the outbox buffer. So, the value in Efor the entries 01 and 21 (in left partition - PID-0), and 40 (in right partition - PID-241) is replaced by the index to its entry in respective outbox buffer, i.e. by 01 and11 (where subscript 1 stands for the outbox for remote partition 1), and 00 (wheresubscript 0 stands for the outbox for remote partition 0), respectively.Following is the BSP-style graph processing in this design for both push-basedand pull-based algorithms (how we implement GAS model in our design is ex-plained in §3.5).For push-based algorithms like BFS and SSSP, in computation phase, eachpartition manipulates its local state buffer S for updates for its local vertices. Allthe updates for remote vertices are aggregated and stored in respective outboxbuffers. In communication phase, as shown in Figure 3.1, the partitions transfer(blue and red arrows) the respective outbox buffer to the corresponding remoteinbox buffer, and apply the remote updates, received from remote partitions in theircorresponding inbox, to their local state buffers (red arrows from inbox to bufferS) if necessary conditions are met (for example, in SSSP, remote distance value iscommitted if it is lesser than the current value).For pull-based algorithms, like PageRank, during compute phase, each localvertex updates its state by reading the state of its incoming-neighbors. The stateof remote incoming-neighbors are accessed from respective outbox buffer. Duringthe communication phase, the local vertices (which are remote in other partitions)update their new state in the respective inbox buffer, which is then copied to theoutbox buffer of the remote partition. In the next superstep, this updated state isutilized to calculate the new state of the local vertices.Advantages(i) Zero remote memory accesses. This design leads to zero remote memory ac-cesses, since all the accesses are local in both computation and communicationphases, and the message buffer (out/in box) is explicitly copied to the remote par-tition’s message buffer (in/out box).(ii) Message aggregation. A vertex can be associated with many edges (as aver-age degree of the vertices is 32 for synthetic workloads and ∼75 for real-worldgraphs 4.3). Aggregating the remote updates for the remote vertices locally leadsto sending only one message per remote vertex during the communication phase.25Figure 3.2: High-level illustration of NUMA 1-Box design. V and E are thebuffers to represent the graph in CSR format (as mentioned in Sec-tion 2.1 and Figure 2.1). S is the state buffer for local vertices. Forpush-based algorithms, during computation phase, each partition ma-nipulates its local state buffer S for local vertices and updates for remotevertices are aggregated locally in the outbox buffer. During communica-tion phase, updates in the remote outbox are sequentially accessed andapplied to the respective local state buffers.This drastically decreases the inter-partition traffic and the communication time.Drawbacks(i) Communication overhead. Since remote vertices are marked and counted dur-ing partitioning step, the size of the message buffers remain unaltered during thealgorithm execution. Though message aggregation leads to reducing the numberof messages send, there is still communication overhead when the message bufferis mostly empty, which is often the case while processing for algorithms, like BFSand SSSP, where communication happens via selective edges only in every super-step.263.3.3 NUMA 1-Box DesignSince NUMA is a shared memory system, instead of having two explicit messageboxes, one at source and another at destination, only one buffer can be physicallyallocated on the partition, and the pointer to the box could be swapped duringcommunication phase. In this design we allocate only one message buffer, at thesource, and assign it to outbox, because of the fact that outbox in source partition isinbox in the destination partition. Similar to NUMA 2-Box design, the value storedfor remote-edges in E is the index to its entry in the respective outbox buffer.Following is the BSP-style graph processing in this design for both push-basedand pull-based algorithms.The computation phase for both push-based and pull-based algorithms aresame as NUMA 2-Box design, as in both the cases outbox is allocated on sourcepartition, and only the communication phase differs.For push-based algorithms, in communication phase, the partitions swap thepointer to the respective outbox message buffer with the corresponding remoteinbox message buffer’s pointer. It does remote sequential access to read the remoteupdates, and applies them to their local state buffer, as shown in Figure 3.2.For pull-based algorithms, during the communication phase, it writes the newstate of its local vertices (which are remote in other partitions), stored in local statebuffer S, to the respective inbox buffer (which is a pointer in this design, and pointsto the outbox buffer in the remote partition - inbox in source partition is outbox indestination partition), by doing remote sequential writes.Advantages(i) No explicit message transfer. This design leads to zero remote memory ac-cesses during computation phase, same as the previous design. In communicationphase, it passes only the pointer to the address of physically allocated box, ratherthan the entire message buffer.(ii)Message aggregation. Message aggregation advantage is same as in the NUMA2-Box design.27Figure 3.3: High-level illustration of NUMA 0-Box design, which overlapscomputation with communication. S is the state buffer for local ver-tices. Note that in this design we get rid of communication infrastruc-ture. During computation phase, each partition manipulates its localstate buffer S for local vertices and remote updates are directly writtento the respective local state buffer of the remote partition. Atomic writesare used to ensure correctness.Drawbacks(i) Communication overhead. Though this design leads to not transferring themessage buffer explicitly in the communication phase, all the accesses to the mes-sage buffer are remote sequential, which in turn depends on the costly randomaccess to update the local state buffer. Further, similar to NUMA 2-Box design,it suffers from the communication overhead when the message buffer is mostlyempty.3.3.4 NUMA 0-Box DesignIn this design we consider the fact that NUMA is a distributed shared-memorysystem. We do explicit partitioning as if NUMA is a distributed system, but weaccess the state buffers as if we are in a shared-memory system. As shown inFig. 3.3, we do not use communication infrastructure.During computation phase, for push-based algorithms, if a remote vertex is280.0000010.000010.00010.0010.010.11101001000pid0pid1pid2pid3pid0pid1pid2pid3pid0pid1pid2pid3pid0pid1pid2pid3pid0pid1pid2pid3pid0pid1pid2pid3pid0pid1pid2pid3pid0pid1pid2pid3ss1 ss2 ss3 ss4 ss5 ss6 ss7 ss8Frequency (Millions) (log-scale)#Remote Updates #Remote VerticesFigure 3.4: Number of Remote vertices vs number of remote updates in eachpartition in every superstep of Direction-Optimized BFS (BFS-DO) forRMAT31 in NUMA 2-Box design. The Y-axis represents frequency (inmillions), in log-scale, of the number of remote updates (per superstep)and the total number of remote vertices in each partition. The ss-‘x’ onX-axis represents the sequence of supersteps. We observe that remoteupdates are ∼22× less than the number of remote vertices.visited, the state is updated directly in the local state buffer of the respective remotepartition, thereby it overlaps computation with communication. Atomic operationis used for updating shared states, to ensure consistency.For pull-based algorithms, like PageRank, during compute phase, each localvertex updates its state (e.g. rank for PageRank) by gathering the state of itsincoming-neighbors. For remote incoming-neighbors, it reads the state of the re-mote vertex from the local state buffer of the respective remote partition.Note that, in this design, the state of all the boundary edges is accessed re-motely.Advantages(i) Overlapping computation with communication. This design overlaps compu-tation with communication. It performs better for algorithms like BFS and SSSP,29where communication happens only via a selective set of edges in every superstep.For example, from our experiment, as shown in Fig. 3.4, we observe the numberof remote updates in the execution of BFS is ∼22x less than the total number ofremote vertices.Drawbacks(i) Communication overhead. Since in this design the state of all the boundaryedges is accessed remotely, it performs poor for algorithms like PageRank wherethere is a message via every boundary edge, compared to NUMA 2-Box design,where number of messages equals the number of remote vertices (not edges). Nomessage aggregation increases the number of remote memory accesses severely.3.4 Analytical Model for Estimating PerformanceTo determine the right communication design for an algorithm, we evaluate thethree designs analytically for PageRank, as a use case. Table 3.1 presents the mem-ory access pattern of all the designs in computation and communication phases, forthe pull-based PageRank algorithm.In NUMA 2-Box design, during computation phase, all the accesses beinglocal, it sequentially updates the rank of all the local vertices, by doing randomaccesses to read the value of its neighbors scattered in local memory, includingremote neighbors (that are stored in corresponding outbox buffers). In the commu-nication phase, it sequentially updates the recent state of the local vertices, whichare remote in other partitions, to the respective inbox buffers. Lastly, it transfersthe inbox buffers to the outbox buffers, allocated on respective remote partitions.In NUMA 1-Box design, access pattern during computation phase is same asNUMA 2-Box design. But in communication phase, since the inbox in sourcepartition points to the outbox in respective remote partition, it performs remotesequential writes to update the new rank of its local vertices (accessed in local-random fashion) to the outbox of remote partitions (where the local vertex is re-mote).Further note that, for NUMA 1-Box design, message buffer could be allocatedat destination partition. But it leads to much expensive (E’ * Random Remote30NUMA 2-Box DesignComputation CommunicationLocal Accesses Local Accesses MemcopyV * WriteLocalSeq + (N-1) * V’ * (ReadLocalRand (N-1) * memcopy()(E+E’) * ReadLocalRand + WriteLocalSeq )NUMA 1-Box Design - Box on Source partitionComputation CommunicationLocal Accesses Pointer Copy Local + RemoteAccessesV * WriteLocalSeq + (N-1) * V’ * (ReadLocalRand(E+E’) * ReadLocalRand + WriteRemoteSeq )NUMA 0-Box DesignOverlapped Computation and CommunicationLocal Accesses Remote AccessesV * WriteLocalSeq +E ∗ReadLocalRand E’ * ReadRemoteRandTable 3.1: Memory Access Pattern for different NUMA designs, for PageR-ank algorithm. V/V’ and E/E’ represents number of local/remote ver-tices and edges in the partition. N is the number of partitions. Memoryaccesses are represented by XZY , where X is: read/write operation, Y is:sequential/random access, and Z is: local/remote memory access.Read) accesses in computation phase. Therefore, we discard this variation.Finally, in NUMA 0-Box design, computation and communication phases areoverlapped. In computation phase, rank of each local vertex is computed by read-ing the rank of its incoming neighbors. To access the rank of its remote neighbors,it does Random Remote Read accesses.Having an analytical model helps in selecting the appropriate communicationmode depending on the access pattern of an algorithm. Similar analytical modelcould be designed for other algorithms by observing the access pattern during bothcomputation and communication phase.313.5 Mapping GAS model to NUMA designIn this section, we briefly describe how our design matches with the graph compu-tation that follows GAS (Gather-Apply-Scatter) model by considering the NUMA2-Box design option as an example.In our design, the graph computation is implemented either as Gather-Apply-Scatter, for pull-based algorithms like PageRank, or as Scatter-Apply-Gather, forpush-based algorithms like BFS and SSSP.In pull-based algorithms, like PageRank, during compute phase, each localvertex updates its state (e.g. rank for PageRank) in local state buffer S, by gatheringthe state of its neighbors. If the neighbor is a remote vertex, the value in edge arrayE stores the index to its outbox entry, which contains its state (such as rank inPageRank). For example, as shown in Figure 3.5, in PID-0 (left), vertex 5 gathersthe rank of its neighbors 40 (local vertex) and 21 (remote vertex). Similarly, inPID-1, vertex 2 gathers the rank of its neighbor 11 (local vertex). In apply phase,Figure 3.5, the state of the vertex is updated with the new computed rank. Now,since the new state of the vertex needs to be in sync with its shadow copy in remotepartition as well, in communication phase it (i) first scatters/updates its value in therespective inbox buffer, and then (ii) the inbox buffer is transferred to the outboxbuffer on remote partition. In this way, all the shadow copies of a vertex have thesame state, before the next superstep starts.In push-based algorithms, like BFS, during computation phase, each active ver-tex scatters its state to its neighbors. For local vertices, the state is applied/syncedimplicitly (to its entry in local state buffer S), but for remote vertices, the state isupdated only in their outbox entry. In communication phase, the partitions transferthe respective outbox buffer to the corresponding remote inbox buffer, and then thelocal vertices gather the updated value and commit the change.32Figure 3.5: Gather, Apply and Scatter phases in NUMA 2-Box design, fora pull based algorithm. During computation phase, each local vertexgathers the state of its neighbors and applies the computed value to itsstate. In communication phase, it scatters its new state to the respectiveinbox buffer, which is then copied to the outbox buffer on remote parti-tion. In this way, all the shadow copies of a vertex have the same state,before the start of next superstep.33Chapter 4Experiment DesignThis section describes the implementation of our graph processing framework, ourNUMA testbed, the synthetic and real-world workloads (graphs) used, and theexperimental methodology we follow.4.1 System ImplementationTo implement our NUMA-aware designs, we extend a state-of-the-art NUMA-oblivious graph processing framework, Totem [15], that presumes SMP basedCPUs. It does hybrid graph processing on CPU and GPUs, where GPUs have dis-crete memory, thereby follows distributed systems design. Similar to distributedsystems, it follows Bulk Synchronous Parallel processing model, and does com-munication between CPU and GPU with message buffers. We use this NUMA-oblivious framework because: First, in our previous study [5] we observed that itoutperforms state-of-the-art graph processing frameworks including Intel’s Graph-Mat [36] and Galois [30], by up to an order of magnitude. Second, and mostimportantly, its processing model, BSP, matches with our needs.Fig. 4.1 presents the high level design of our framework. As input, user pro-vides the graph (workload), graph kernel (e.g. BFS, PageRank), partitioning andcommunication strategy, along with optimization options. We allocate all the datastructures belonging to a partition on its respective NUMA node by using libnumalibrary. To launch the partitions in parallel and do the computation independently,34BSP EngineSuperstepiBarrier SynchronizationCommunicationNUMA 2-Box NUMA 1-Box NUMA 0-Boxpid0 pid1 pid2 pid3User Inputs graph, graph_kernelto run, partitioning and comm. strategyOne parent thread is launched on each NUMA domain to initiate Superstep for the local partitionSuperstepi + 1…SuperstepnGraph Partitioning Degree-aware Random HybridProcess graph_kernel(pid)using child threads within a NUMA domainContinue until the globalfinish flag is set… … … …ComputationFigure 4.1: High-level design of our BSP-style NUMA-aware framework.we leverage nested parallelism offered in OpenMP. In the first level of parallelism,we create as many threads as the number of NUMA nodes available, called parentthreads. From each of these parent threads, child threads, equal to the numbers ofcores available on each NUMA node, are spawned on the respective NUMA nodes.Threads are assigned to the respective numbered cores (i.e. Thread 0 is assigned toCore 0, and so on). We set the OMP PROC BIND=spread,close to ensure initialthreads are spawned on different NUMA nodes and child threads are close to theirrespective parent thread. Similarly, during the communication phase, especially35in NUMA 2-Box design, the parent thread on each NUMA node are responsiblefor transferring the content of outbox to inbox of remote partition, and then childthreads applies the updates from inbox to the respective local state buffers. Thisprocess continues until the global finish flag is set.In our experiments, we evaluate how our NUMA-aware framework performsagainst the state-of-the-art NUMA-oblivious framework Totem. Further, we runthe NUMA-oblivious framework with numactl interleave command, which allo-cates memory on all the NUMA nodes in a round robin fashion, instead of Linux’sfirst-touch policy, that allocates data on the memory node touched first by thethread. We compare against this as well, as it allocates memory pages uniformlyon all the NUMA nodes in a non-deterministic way.4.2 TestBedTo explore benefits of our designs we use a four socket Intel Xeon machine (E7-4870 v2, Ivy Bridge), having 60 cores, 1536 GB of Memory and L3 Cache of 120MB. In Section 2.3.3 we have discussed the key memory characteristics of ourtestbed using Figure 2.3 and Table WorkloadGraph #Vertices #Edges Edge-list(|V|) (2|E|) size (in GB)RMAT28 256M 8B 64RMAT29 512M 16B 128RMAT30 1B 32B 256RMAT31 2B 64B 512RMAT32 4B 128B 1024Twitter50 51M 3.9B 15clueWeb12 978M 74B 286Table 4.1: Workload used for evaluation.We consider both real-world and large Recursive MATrix (RMAT) scale-freegraphs from scale 28 to 32 for evaluating our designs. Twitter [9] is an online36social network graph, while clueWeb12 [3] is a hyperlink web graph. Syntheticgraphs are generated using the RMAT generator [12] with the following param-eters: (A,B,C) = (0.57, 0.19, 0.19) and an average vertex degree of 16. All thegraphs were made undirected, following the Graph500 standard. We use RMATgraphs to evaluate our design because: First, it is adopted by today’s widely ac-cepted Graph500 benchmark [2]. Second, RMAT graphs have similar characteris-tics to real-world graphs: they have a low diameter and a ‘power-law’ [14] (highlyheterogeneous) vertex degree distribution.We use 64-bit vertex and edge id to store the RMAT graphs in-memory, as westore partition id in the highest ordered bits. The largest graph we run, RMAT32,has the edge-list of size 1TB (1024 GB). For evaluation, we run the experiments20 times for each workload and report the average. For BFS and SSSP, we usedifferent randomly generated source vertex. We use 32-bit weights, for SSSP, inthe range of (0, 1M] so as to have highly diverse weight distribution. For PageRank,we run each experiment for five PageRank iterations and normalize the executiontime to one iteration.4.4 Experimental MethodologyPartitioningExplicit partitioning enables implementation and experimentation with differentpartitioning strategies to achieve better load balancing and data locality. We exper-iment with the three partitioning strategies that we described in Section 3.2. Wedefine load imbalance as ratio between computation time of the slowest partitionto that of the fastest partition.Performance Evaluation of DesignsWe evaluate the performance of the NUMA designs we introduced in Section 3.3.We compare the performance of our designs with that of NUMA-oblivious frame-work Totem and running Totem with non-deterministic numactl, and report thealgorithm execution time. Consistent with the standard practice in the domain,‘execution time’ does not include time spent in pre- or post-processing steps such37as graph loading, graph partitioning and result aggregation. We further evaluateour designs for strong scaling w.r.t resources. For scalability, we consider largestgraph that could fit in the memory of one socket (384 GB). For PageRank andBFS we consider RMAT30, with edge-list size 256 GB, and for SSSP we considerweighted RMAT29, with weighted edge-list size 192 GB.Note: We are unable to provide hardware counters measurement becausehardware counters for remote memory accesses are not supported by the testbed.Accuracy of the Analytical ModelTo select appropriate communication design, we have presented an analytical modelin Table 3.1 for PageRank. We verify its accuracy.Comparison with Existing Work - PolymerFinally, we compare against Polymer [42], the only NUMA-aware single-nodegraph processing framework (to the best of our knowledge). It has shown toperform better than state-of-the-art single-node graph processing frameworks Ga-lois [30], Ligra [35], and X-Stream [32].38Chapter 5Experimental ResultsThis chapter presents and discusses the experimental results of our designs. We firstdiscuss the impact of graph partitioning on load balancing and overall performanceimprovement. Next, in Section 5.2 we explore the performance of the NUMAdesigns (described in Chapter 3) for different graph applications (§2.6) on bothsynthetic and real-world graphs(§4.3). In Section 5.3, we evaluate the accuracyof our analytical model. Finally, we compare our designs with a state-of-the-artNUMA-aware graph processing framework, Polymer (Section 5.4).5.1 Impact of Graph PartitioningExplicit partitioning enables designing and experimentation with different parti-tioning strategies to achieve better load balancing and overall performance im-provement. In this section, we evaluate the impact of the three partitioning strate-gies (discussed in §3.2) on load balancing and overall performance improvementfor PageRank (which has fixed workload in every superstep) and BFS - DirectionOptimized, BFS-DO, (which has dynamic workload per superstep) algorithms.Figure 5.1 shows the impact of our hybrid strategy compared to Random andSorted/Degree-aware (§3.2) strategies, on load balancing for PageRank using theRMAT31 graph. For PageRank, where workload in every superstep is fixed, Ran-dom strategy leads to a load imbalance of only 1.03× (i.e. the slowest partitionis only 3% slower than the fastest partition). Sorted/Degree-aware strategy suffers39051015202530354045Sorted Random Hybrid Sorted Random Hybrid Sorted Random Hybridss1 ss2 ss3Computation Time (sec)PID-0 PID-1 PID-2 PID-3Figure 5.1: Load imbalance of traditional (Sorted and Random) and new Hy-brid partitioning strategy for PageRank algorithm for RMAT31 graphusing NUMA 2-Box design. The x-axis is for supersteps (denoted by‘ss’) required for execution. The y-axis is for computation time (lowerthe better) of the four partitions for the three partitioning strategies.a load imbalance of 1.46×, but performs 1.69× better than Random partitioning.This is because, random partitioning strategy increases the probability that eachpartition has equal variability in terms of vertex degree. Therefore, we observebetter load balance. On the other hand, it also increases the probability that theneighbors of a vertex are scattered in memory, leading to poor data locality [16].While with Sorted strategy, first partition gets the most dense graph (containing fewhigh degree vertices, for e.g., PID-0 in Figure 5.1) and the last partition gets themost sparse graph (containing most of the vertices with low degree). Sorted strat-egy leads to better data locality and over all performance for PageRank, but sincedense partition gets processed faster than the sparse partition, it leads to higher loadimbalance compared to Random strategy.Hybrid strategy achieves load imbalance of only 1.05×. This leads to an over-all performance improvement of 1.18× and 2× against Sorted and Random strate-gies, respectively.401. PID-1 PID-2 PID- ss2 ss3 ss4 ss5 ss6 ss7 ss8Computation Time (sec)Different Y-axis scaleFigure 5.2: Load imbalance of traditional (Sorted and Random) and new Hy-brid partitioning strategy for BFS-DO algorithm for RMAT31 graph us-ing NUMA 2-Box design. The x-axis is for supersteps (denoted by ‘ss’)required for execution. The y-axis is for computation time (lower thebetter) of the four partitions for Traditional and New partitioning strate-gies.For BFS-DO, where workload changes drastically in every superstep, as shownin Figure 5.2, we observe significantly higher load imbalance, of 10.1×, withSorted strategy. With Sorted strategy, initial three supersteps are executed withTop-down kernel, followed by three supersteps with Bottom-up kernel, and the re-maining again with Top-down kernel. In Top-down stage, frontier builds up quicklyfor dense partition (since it has high degree vertices), hence we observe that in su-perstep 3, the dense partition (PID-0) takes significant amount of time because ofprocessing the huge frontier. Random strategy achieves load imbalance of 1.35×.Even for algorithms like BFS-DO, where workload during every superstep is highlydynamic, Hybrid strategy achieves load imbalance of only 1.13×. Since, BFS-DOis a memory bound algorithm and cache sensitive, better load balance and localityleads to better performance. Random strategy performs 3.4× better than Sortedstrategy, while Hybrid strategy performs 5.3× and 1.55× better than Sorted and41Random strategies, respectively.Since our hybrid strategy achieves better overall performance, in all the follow-ing experiments, we use our hybrid partitioning strategy to partition the graph forNUMA-aware graph processing.5.1.1 Partitioning - Key Insights1. Better load balance does not mean better performance (as observed in Ran-dom vs Sorted strategy). Note that it is important to achieve better loadbalance, to optimize resource usage (i.e. to avoid overload of few resourceswhile the remaining resource are idle). But, the end goal of high performancegraph processing is to process the graph as fast as possible.2. The hybrid partitioning strategy strikes the right balance between load bal-ance and locality, hence offers improved performance.3. Graph partitioning is an NP-Complete problem. There exists sophisticatedpartitioning strategies that offer improved load balancing across partitions,however, are costly. For them, graph partitioning takes much longer than thesimpler partitioning techniques we explored in this dissertation.4. Real-world graphs are heavily skewed, therefore are very difficult to parti-tion. The partitioning strategies, leveraged by distributed graph processingframeworks like Google’s Pregel and GraphLab, and the ones we have exper-imented with are simple and low-cost. These techniques make the hypothesisthat there are no natural communities in these real-world graphs. But we areskeptical that there are natural clusters in many of these graphs.5. Further, our infrastructure is flexible enough that users can plugin and exper-iment with different partitioning strategies.5.2 Performance Evaluation of DesignsIn this section we first evaluate the performance of our three communication de-signs on both synthetic and real-world workloads for different graph algorithms.42Then we provide the strong scaling, w.r.t resources, experiment results of our de-signs. Finally we mention the performance numbers of our Graph500 submissions.5.2.1 Performance of NUMA 2-Box design.As observed in Fig. 5.3, for the RMAT31 graph, NUMA 2-Box is 2.07× and 1.63×faster than NUMA-oblivious Totem for PageRank and BFS-DO algorithms, re-spectively.Since PageRank has a high compute-to-memory-access ratio, most of the timeis spend in computation phase. Further, because of having explicit 2-Box commu-nication, all the remote updates are send in a batch and all the accesses are local(in both computation and communication phase). This leads to spending only 3%of execution time in communication phase.In BFS-DO, which has a low compute-to-memory-access ratio, as shown pre-viously in Fig. 3.4, relatively few remote vertices have messages in each superstep.This leads to higher communication cost of 26.9% of execution time. numactl doesnot provide enough performance because the pages are distributed among NUMAnodes in a non-deterministic round-robin fashion, thereby the data distribution isnot graph topology aware.Further, Fig. 5.4 and Fig. 5.5, present that (i) for synthetic graphs, NUMA2-Box performs better than both Totem and numactl for all the algorithms (upto 2.08×, 1.88×, and 1.91× against Totem, and 20%, 26%, and 33% betterthan numactl, for PageRank, BFS-DO and BFS-TD, respectively) except for SSSP.For SSSP, it performs up to 33% better than Totem, but is up to 91% slower thannumactl (discussed later). (ii) For Twitter graph, NUMA 2-Box is up to 63%,29%, and 11% faster for PageRank, BFS-TD, and SSSP algorithms, respectively,against Totem. Real-world graphs like clueWeb are heavily skewed and requirehuge number of supersteps to converge (#supersteps: 94 and 135 for BFS-DO,and 76 and 129 for SSSP, by Totem and NUMA 2-Box, respectively, for clueWeb).For clueWeb graph, NUMA 2-Box could achieve good performance, of 69%, onlyfor PageRank. For traversal based algorithms, especially for BFS-DO and SSSP,it does not perform well. BFS-DO requires hand-tuning the parameters to switchbetween Top-down and Bottom-up stages. That’s why we do not observe major4344.3425.5720.76 21.08 23.441x1.73x0.641.871.89x01020304050TotemnumactlNUMA-2BNUMA-1BNUMA-0BExecution Time/Iteration (sec)PageRankComputation Communication3.812.841.71 1.78 1.891x1.34x0.631.092.01x0. Time (sec)BFS-DOComputation Communication2.07x1.93x1.63x1.33xFigure 5.3: NUMA designs performance against Totem and numactl forPageRank (left) and BFS-DO (right) algorithms on RMAT31 graph.The Y-axis is for execution time (lower the better). For NUMA-2B andNUMA-1B, where we do explicit communication, we show the break-down of execution time with computation and communication time.improvement. Since all the scales of synthetic graphs have similar characteristics,the switching parameters are easy to tune.For both type of workloads (synthetic and real-world), numactl performs betterthan NUMA 2-Box design for SSSP. As mentioned earlier, for SSSP, the opti-mizations, to activate the neighbors in the same iterations, were done to reducethe number of supersteps assuming SMP based architecture [16]. This leads to thepartition with source vertex spend more time in the computation phase than others,in the initial few supersteps. Note that our NUMA-aware design is applicationagnostic and we do not modify the applications.5.2.2 Performance of NUMA 1-Box design.NUMA 1-Box design performs better than Totem and is competitive with numactl.Though, it does not perform well compared to NUMA 2-Box because it consumesmore time during communication phase, as it does remote sequential accesses to44024681012RMAT28 RMAT29 RMAT30 RMAT31 RMAT32 Twitter clueWebBillion TEPSTotem numactl NUMA-2B NUMA-1B NUMA-0B(a) PageRank051015202530354045RMAT28 RMAT29 RMAT30 RMAT31 RMAT32 Twitter clueWebBillion TEPSTotem numactl NUMA-2B NUMA-1B NUMA-0B(b) BFS-DOFigure 5.4: Billion Traversed Edges Per Second (TEPS) achieved by Totem,numactl and the NUMA designs, for (a) PageRank and (b) BFS-DOalgorithms on RMAT[28-32] (synthetic), and Twitter and clueWeb (real-world) workloads. Note, the Y-axis is for Traversed Edges Per Second(TEPS) (higher the better).45012345678RMAT28 RMAT29 RMAT30 RMAT31 RMAT32 Twitter clueWebBillion TEPSTotem numactl NUMA-2B NUMA-1B NUMA-0B(a) BFS-TD0. RMAT29 RMAT30 RMAT31 Twitter clueWebBillion TEPSTotem numactl NUMA-2B NUMA-1B NUMA-0B(b) SSSPFigure 5.5: Billion Traversed Edges Per Second (TEPS) achieved by Totem,numactl and the NUMA designs, for (a) BFS-TD and (b) SSSP algo-rithms on RMAT[28-32] (up to RMAT31 for SSSP, since it requiresweighted edge-list) (synthetic), and Twitter and clueWeb (real-world)workloads. Note, the Y-axis is for Traversed Edges Per Second (TEPS)(higher the better).46the remote message buffer, which is bounded by slow local random updates to thestate buffer, as shown in Figure 5.3. In NUMA 1-Box design too, all the remoteupdates are aggregated in message buffers, in similar way as in NUMA 2-Boxdesign. Hence, it does not perform well for BFS-DO and SSSP algorithms, butperforms well for PageRank and BFS-TD compared to both Totem and numactl,as shown in Figure 5.4 and Figure Performance of NUMA 0-Box design.NUMA 0-Box design performs better for algorithms like BFS and SSSP, where ineach superstep there are messages from selective boundary edges only, not from all.As shown in Fig. 5.3, for BFS-DO, overlapping computation and communication(by directly updating the remote vertices) in every superstep leads to better perfor-mance. Note that the gain, almost equivalent to the communication time in NUMA2-Box, is achieved because of implicit communication, since the number of remoteupdates in every superstep is much less than the number of remote vertices (∼22×for RMAT31 graph), as shown previously in Fig. 3.4.From Fig. 5.3, Fig. 5.4 and Fig. 5.5, we observe that NUMA 0-Box designperforms better than Totem, numactl as well as other NUMA designs for PageR-ank (up to 1.89×), BFS-TD (up to 1.62×), BFS-DO (up to 2.37×) as well asSSSP (up to 2.27×) algorithms. Further, even though real-world graphs are heav-ily skewed, it provides better performance improvement, for traversal based algo-rithms, than other NUMA designs as well as Totem and numactl (except for BFS-DO on clueWeb graph, which requires hand-tuning the switching parameters).For BFS-TD, though it performs better, performance gain is less compared toother NUMA designs. This is because in BFS-TD frontier size increases drasti-cally, which increases the remote memory accesses. But, for SSSP, it activatesthe remote vertices as well, in the same iteration, which increases the likelihoodof every partition having active vertices in the initial supersteps, thereby achievingbetter load balance and overall performance.Further, for algorithms like PageRank, where each vertex calculates its rank bypulling ranks of all its neighbors, all the remote messages, sent over every boundaryedge, are read/accessed in remote random fashion, which leads to degradation in47performance.5.2.4 Communication Designs - Key Insights1. Although explicit communication can be perceived as extravagant for a cache-coherent shared memory system, its performance benefits on a NUMA sys-tem are indisputable.2. Performance gain in NUMA 2-Box, compared to NUMA 1-Box, comes fromdoing local accesses during computation, and copying data in bulk fromsource partition to destination partition, followed by local random/sequen-tial accesses for local read/write operations in communication phase. But inNUMA 1-Box design, even though the remote updates are read sequentially,it is bounded by slow local random writes to the local state buffers.3. NUMA 2-Box design leads to zero remote memory accesses and reducesthe number of messages send during the communication phase to only thenumber of remote vertices, regardless of the number of edges associated withthem.4. NUMA 2-Box performs better for algorithms, like PageRank, where thecommunication volume is high (i.e. where most of the neighbors are up-dated). While, for algorithms which have low communication volume (i.e.where only a small subset of neighbors are updated), like traversal basedalgorithms such as BFS and SSSP, NUMA 0-Box provides better perfor-mance. This explains why some algorithms are finding better solution withone design and some with other design.5. For BFS and SSSP, the partition with source vertex ends up spending moretime in initial supersteps in NUMA 2-Box and 1-Box designs, as the activevertices are confined to the partition having the source vertex. This degradesthe performance of NUMA 2-Box and 1-Box designs for these algorithms.On the other hand, for PageRank, NUMA 0-Box ends up doing remote ran-dom access for each remote edge, hence no message reduction like NUMA2-Box design, which degrades its performance.480. BFS-DO BFS-TD SSSPSpeedup against 1 socketFigure 5.6: Strong Scaling of Totem, numactl and NUMA designs on our4-socket machine compared to 1-socket (memory: 384GB). WeightedRMAT29 (weighted edgelist size: 192GB) is used for SSSP and un-weighted RMAT30 (edgelist size: 256GB) was used for all other algo-rithms.5.2.5 Strong Scaling ExperimentsHere we evaluate how our designs scale on our four socket NUMA testbed com-pared to one socket. We use the largest graph that could fit into the memory of onesocket (384 GB). For SSSP, which requires weighted graph, we use RMAT29 withweighted edge list size of 192 GB. For all other algorithms we use unweightedRMAT30 with edge list size 256 GB.Fig. 5.6 presents that our NUMA design fills the performance gap left by Totemby scaling to as much as 3.7×, 2.9×, 2.7× and 2.8× compared to 2.0×, 1.7×,2.1×, and 1.3× achieved by Totem for PageRank, BFS-DO, BFS-TD and SSSPalgorithms, respectively.5.2.6 Graph500 submissions.Graph500 competition ranks supercomputers worldwide for data intensive appli-cations, BFS and SSSP.49In SSSP, a new kernel addition in Graph500 since November 2017, we securedWorld Rank 2 (in June 2018 list, published during ISC conference).For BFS, a mature kernel in Graph500, we rank among top three single-nodesubmissions. For RMAT31, our NUMA 0-Box design achieved 10.73 BillionTEPS, a performance gain of 28% against our previous submission by runningNUMA-oblivious Totem with numactl on the same NUMA machine, 8.37 BillionTEPS. Note that, this performance improvement is less compared to 50% perfor-mance improvement of NUMA 0-Box against numactl for plain BFS-DO for thesame RMAT31 workload, as shown in Figure 5.3. This is because, as expected, ourNUMA design consumes more time in initialization (which is timed in Graph500)than NUMA-oblivious, as it has to initialize the state of all the partitions. Further,the largest graph submission we made has 128 Billion undirected edges (RMAT32),with edge list size 1TB. NUMA-oblivious Totem could not run RMAT32 as its mem-ory requirement are ∼2× the edge-list size, as observed in our previous study [5].Note that our framework is a generic graph processing framework that enablesusers to develop multiple applications, including BFS and SSSP, and applies opti-mizations in an application-agnostic way. While, the codes we compete against inGraph500 are developed for these specific applications (as published in the corre-sponding publications [13, 38, 40, 41]).5.3 Accuracy of the Analytical ModelFrom our experiments, we have observed that different designs perform differentlydepending on the memory access pattern they provide, along with the communi-cation volume in different graph algorithms. For example, NUMA 2-Box design,which offers all the accesses to be local and does message aggregation for remotevertices, performs best for algorithms having messages for most of the remote ver-tices in each superstep, like PageRank. While, NUMA 0-Box, which overlapscomputation with communication, gives best performance for algorithms like BFSand SSSP, where there are messages over selective edges only.We calculate the expected theoretical time to solution for each of the threedesigns for PageRank algorithm using the machine characteristics available in Ta-ble 2.1 for different workloads. We observed that for all the RMAT graphs and for501. 2-Box Speedup1B Cost Model 1B Experiment0B Cost Model 0B ExperimentRMAT28 RMAT29 RMAT30 RMAT31 RMAT322B Cost Model 2B ExperimentFigure 5.7: Analytical Model prediction for PageRank algorithm. The Y-axisshows the speedup of NUMA 2-Box against NUMA 1-Box and NUMA0-Box as predicted by cost model and calculated empirically throughexperiments.all the partitions, ∼25% of the edges were local and ∼75% were remote (since wehave four partitions, with random distribution, probability of a vertex to be local is0.25, and 0.75 for being remote), and local vertices and remote/ghost vertices con-stitute ∼64% and ∼36% of the total vertices (V+V’) in each of the partitions. Asseen in Fig. 5.7 and in Table 5.1, our model correctly predicts the relative sequenceof the designs according to the performance. Note that this is a high level predic-tion, since we have not taken into account caching, effect of prefetching and otherminute level details; and especially for NUMA 0-Box design, we did not take intoaccount impact of overlapping computation and communication in the analyticalmodel.51Design Cost Model EmpiricalNUMA 2-Box 1× 1×NUMA 1-Box on source partition 1.055× 1.07×NUMA 0-Box 1.46× 1.21×Table 5.1: Cost Model evaluation for PageRank algorithm. The numbers rep-resent average speedup of NUMA 2-Box against other designs, for allworkloads, as predicted by cost model and observed empirically fromexperiments.Application developers can easily extend this model for other applications todetermine the optimal communication strategy, since in all the graph applications,we only need to look at the memory access pattern as specified in Table Comparison with Existing WorkFinally, we compare our framework against Polymer, the only NUMA-optimizedgraph processing framework we are aware of. It also embraces the design philos-ophy of distributed systems, and is developed on top of Ligra, a shared-memorygraph processing framework. The key difference between Polymer and our workis, Polymer does vertex-cut partitioning [17], while we do edge-cut partitioning. Invertex-cut partitioning, a single vertex is partitioned among multiple nodes. Thisrequires replicating the vertices on multiple nodes. But, it enables distributing thecomputation on a single vertex over multiple NUMA nodes. This is especiallybeneficial for highly skewed real-world power-law graphs where few vertices havemillions of edges associated with them. This technique further provides the advan-tage of placing only those edges on a partition whose destination vertex is also localto the partition, thereby both the source and the destination vertices are local. Thisreduces the number of remote memory accesses. Polymer implements push-basedPageRank and adapts Bellman-Ford algorithm for SSSP. For BFS, it implementsTop-Down BFS, and does not supports Direction-Optimized BFS.Table 5.2 summarizes the performance of the best performing NUMA designagainst Polymer. For BFS, we present the performance of our design for BFS-TopDown only, since Polymer does not have Direction-Optimized BFS implementa-tion. Missing data points means Polymer failed (Error: Segmentation fault (core52Algorithm Workload Polymer NUMA-xBTime (s) Memory (GB) Time (s) Memory (GB)PageRank RMAT28 5.1 365 1.46 80RMAT29 11.6 735 3.48 162RMAT30 26.2 1401 8.29 330RMAT31 - - 21.4 674RMAT32 - - 49.4 1366Twitter 1.53 144 0.75 21.8BFS RMAT28 11.37 366 1.29 81RMAT29 17.17 740 2.36 162RMAT30 34.63 1302 4.94 322RMAT31 - - 12.63 653RMAT32 - - 24.65 1319Twitter 13.1 93 0.94 19.2SSSP RMAT28 11.5 437 4.34 115RMAT29 24.95 886 9.56 232RMAT30 - - 21.79 452RMAT31 - - 53.59 867Twitter 5.3 115 8.4 41Table 5.2: Execution time (in second) and peak Memory consumption (inGB) of Polymer and our best performing NUMA design (NUMA-xB).We show peak memory consumption among all the NUMA designs.Missing data points means Polymer was out-of-memory.dumped)) to execute for the respective graph workloads (including clueWeb graph),as it was out of memory. Our design outperforms Polymer by up to 3.5×, 13.9×and 2.6× for PageRank, BFS and SSSP algorithms respectively. Polymer doesvertex-cut partitioning, and consumes ∼5.7× more memory than the size of therespective edge-list of the graph, and ∼4.4×more memory than our NUMA de-signs. Polymer is faster than our design only on Twitter for SSSP algorithm. Itperforms 1.58× better but at the cost of consuming 2.8× more memory.53Chapter 6ConclusionIn this dissertation, we postulated our hypothesis that a distributed-memory likemiddleware, that makes graph partitioning and inter-partition communication ex-plicit, improves the performance on a NUMA system. To test our hypothesis, wedesigned a NUMA-aware graph processing framework that embraces the designphilosophies of distributed graph processing framework, especially the explicitpartitioning and communication. Based on the lessons from the above design,we proposed (i) a new hybrid partitioning strategy, which leads to near opti-mal load balance of 95% and improves the overall performance by up to 5.3×,and (ii) a new NUMA-aware hybrid design that considers the fact that NUMAis a distributed shared-memory architecture. It leverages the benefits of dis-tributed system (by explicitly partitioning the graph among NUMA nodes) andshared-memory system (by performing implicit communication to access the statebuffers), and overlaps computation with communication. This design providesperformance gain of 1.89×, 2.37×, and 2.27× for PageRank, BFS and SSSPalgorithms, respectively, against state-of-the-art NUMA-oblivious framework, aswell as secured good ranking for us in Graph500 competition.To summarize, we presented a scalable (up to 3.7×), high performant (upto 13.9×), memory efficient (∼4.4×), generic NUMA-aware graph processingframework that outperforms the state-of-the-art NUMA-oblivious (up to 2.37×) aswell as NUMA-aware (up to 13.9×) graph processing frameworks. We observedthat considering NUMA as a distributed system not only improves performance but54also provides the opportunity to explore different partitioning and communicationstrategies in NUMA machine. Finally, since now-a-days each node in a high-endlarge-scale distributed system embraces NUMA architecture with at least two sock-ets, our design has the potential to improve the performance of the entire cluster,by improving the performance of each node.55Bibliography[1] Ford, L.A. 1956. URL NetworkFlowTheory.ReportP-923. → page 17[2] Graph500. URL https://graph500.org/. → pages 8, 14, 16, 37[3] Laboratory for Web Algorithmics. URL http://law.di.unimi.it/datasets.php.→ pages 8, 37[4] Web Data Commons - Hyperlink Graph. URLhttp://webdatacommons.org/hyperlinkgraph/. → pages 1, 8[5] T. K. Aasawat, T. Reza, and M. Ripeanu. How well do cpu, gpu and hybridgraph processing frameworks perform? In 2018 IEEE International Paralleland Distributed Processing Symposium Workshops (IPDPSW), pages458–466, May 2018. doi:10.1109/IPDPSW.2018.00082. → pages14, 19, 34, 50[6] A. Abou-Rjeili and G. Karypis. Multilevel algorithms for partitioningpower-law graphs. In Proceedings of the 20th International Conference onParallel and Distributed Processing, IPDPS’06, pages 124–124,Washington, DC, USA, 2006. IEEE Computer Society. ISBN1-4244-0054-6. URL http://dl.acm.org/citation.cfm?id=1898953.1899055.→ page 21[7] K. Andreev and H. Ra¨cke. Balanced graph partitioning. In Proceedings ofthe Sixteenth Annual ACM Symposium on Parallelism in Algorithms andArchitectures, SPAA ’04, pages 120–124, New York, NY, USA, 2004. ACM.ISBN 1-58113-840-7. doi:10.1145/1007912.1007931. URLhttp://doi.acm.org/10.1145/1007912.1007931. → page 21[8] L. Backstrom, P. Boldi, M. Rosa, J. Ugander, and S. Vigna. Four degrees ofseparation. In Proceedings of the 4th Annual ACM Web Science Conference,WebSci ’12, pages 33–42, 2012. ISBN 978-1-4503-1228-8. → page 156[9] A. S. Badashian and E. Stroulia. Measuring user influence in github: Themillion follower fallacy. In Proceedings of the 3rd International Workshopon CrowdSourcing in Software Engineering, CSI-SE ’16, pages 15–21,2016. ISBN 978-1-4503-4158-5. → page 36[10] S. Beamer, K. Asanovic´, and D. Patterson. Direction-optimizingbreadth-first search. In Proceedings of the International Conference on HighPerformance Computing, Networking, Storage and Analysis, SC ’12, pages12:1–12:10, 2012. ISBN 978-1-4673-0804-5. → page 16[11] M. Cha, H. Haddadi, F. Benevenuto, and K. P. Gummadi. Measuring userinfluence in twitter: The million follower fallacy. In in ICWSM 10:Proceedings of international AAAI Conference on Weblogs and Social, 2010.→ page 1[12] D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-mat: A recursive model forgraph mining. In Proceedings of the Fourth SIAM Int. Conf. on DataMining, page p. 442. Society for Industrial Mathematics, 2004. → page 37[13] Z. Cui, L. Chen, M. Chen, Y. Bao, Y. Huang, and H. Lv. Evaluation andoptimization of breadth-first search on numa cluster. In 2012 IEEEInternational Conference on Cluster Computing, pages 438–448, Sept 2012.doi:10.1109/CLUSTER.2012.29. → page 50[14] M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships ofthe internet topology. In Proceedings of the Conference on Applications,Technologies, Architectures, and Protocols for Computer Communication,SIGCOMM ’99, pages 251–262, 1999. ISBN 1-58113-135-6. → pages7, 21, 22, 37[15] A. Gharaibeh, L. Beltra˜o Costa, E. Santos-Neto, and M. Ripeanu. A yoke ofoxen and a thousand chickens for heavy lifting graph processing. InProceedings of the 21st International Conference on Parallel Architecturesand Compilation Techniques, PACT ’12, pages 345–354, 2012. ISBN978-1-4503-1182-3. → pages 2, 14, 18, 21, 22, 34[16] A. Gharaibeh, E. Santos-Neto, L. B. Costa, and M. Ripeanu. Efficientlarge-scale graph processing on hybrid CPU and GPU systems. CoRR,abs/1312.3018, 2013. → pages 15, 17, 23, 40, 44[17] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph:Distributed graph-parallel computation on natural graphs. In Proceedings ofthe 10th USENIX Conference on Operating Systems Design and57Implementation, OSDI’12, pages 17–30, 2012. ISBN 978-1-931971-96-6.→ pages 1, 8, 12, 14, 21, 52[18] P. Gupta, A. Goel, J. Lin, A. Sharma, D. Wang, and R. Zadeh. Wtf: The whoto follow service at twitter. In Proceedings of the 22Nd InternationalConference on World Wide Web, WWW ’13, pages 505–514, New York, NY,USA, 2013. ACM. ISBN 978-1-4503-2035-1.doi:10.1145/2488388.2488433. URLhttp://doi.acm.org/10.1145/2488388.2488433. → page 1[19] T. Ideker, O. Ozier, B. Schwikowski, and A. F Siegel. Ideker t, ozier o,schwikowski b, siegel afdiscovering regulatory and signalling circuits inmolecular interaction networks. bioinformatics 18(suppl 1):s233-s240. 18Suppl 1:S233–40, 02 2002. → page 1[20] G. Iori, G. D. Masi, O. V. Precup, G. Gabbi, and G. Caldarelli. A networkanalysis of the italian overnight money market. Journal of EconomicDynamics and Control, 32(1):259 – 278, 2008. ISSN 0165-1889.doi:https://doi.org/10.1016/j.jedc.2007.01.032. URLhttp://www.sciencedirect.com/science/article/pii/S0165188907000474.Applications of statistical physics in economics and finance. → page 1[21] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick,S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fastfeature embedding. CoRR, abs/1408.5093, 2014. URLhttp://arxiv.org/abs/1408.5093. → page 18[22] T. Kissinger, T. Kiefer, B. Schlegel, D. Habich, D. Molka, and W. Lehner.Eris: A numa-aware in-memory storage engine for analytical workload. InADMS@VLDB, 2014. → page 18[23] K. Lang. Finding good nearly balanced cuts in power law graphs. Tech. Rep.Yahoo! Research Labs, Nov. 2004. → page 21[24] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. Communitystructure in large networks: Natural cluster sizes and the absence of largewell-defined clusters. CoRR, abs/0810.1355, 2008. URLhttp://arxiv.org/abs/0810.1355. → page 21[25] Y. Li, I. Pandis, R. Mu¨ller, V. Raman, and G. M. Lohman. Numa-awarealgorithms: the case of data shuffling. In CIDR, 2013. → page 1858[26] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, andG. Czajkowski. Pregel: A system for large-scale graph processing. InProceedings of the 2010 ACM SIGMOD International Conference onManagement of Data, SIGMOD ’10, pages 135–146, 2010. ISBN978-1-4503-0032-2. → pages 1, 8, 21[27] J. D. McCalpin. Memory bandwidth and machine balance in current highperformance computers. IEEE Computer Society Technical Committee onComputer Architecture (TCCA) Newsletter, pages 19–25, Dec. 1995. →page 10[28] D. Mhembere, D. Zheng, C. E. Priebe, J. T. Vogelstein, and R. Burns. Knor:A numa-optimized in-memory, distributed and semi-external-memoryk-means library. In Proceedings of the 26th International Symposium onHigh-Performance Parallel and Distributed Computing, HPDC ’17, pages67–78, New York, NY, USA, 2017. ACM. ISBN 978-1-4503-4699-3.doi:10.1145/3078597.3078607. URLhttp://doi.acm.org/10.1145/3078597.3078607. → page 18[29] N. Nagarajan and M. Pop. Sequence assembly demystified. Nature ReviewsGenetics, 14:157–167, 2013. → page 1[30] D. Nguyen, A. Lenharth, and K. Pingali. A lightweight infrastructure forgraph analytics. In Proceedings of the Twenty-Fourth ACM Symposium onOperating Systems Principles, SOSP ’13, pages 456–471, 2013. ISBN978-1-4503-2388-8. → pages 1, 14, 15, 18, 34, 38[31] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citationranking: Bringing order to the web, 1999. → page 15[32] A. Roy, I. Mihailovic, and W. Zwaenepoel. X-stream: Edge-centric graphprocessing using streaming partitions. In Proceedings of the Twenty-FourthACM Symposium on Operating Systems Principles, SOSP ’13, pages472–488, 2013. ISBN 978-1-4503-2388-8. → page 38[33] P. Roy, S. L. Song, S. Krishnamoorthy, A. Vishnu, D. Sengupta, and X. Liu.Numa-caffe: Numa-aware deep learning neural networks. ACM Trans.Archit. Code Optim., 15(2):24:1–24:26, June 2018. ISSN 1544-3566.doi:10.1145/3199605. URLhttp://doi.acm.org.ezproxy.library.ubc.ca/10.1145/3199605. → page 1859[34] S. Sallinen, A. Gharaibeh, and M. Ripeanu. Acceleratingdirection-optimized breadth first search on hybrid architectures. CoRR,abs/1503.04359, 2015. URL http://arxiv.org/abs/1503.04359. → page 16[35] J. Shun and G. E. Blelloch. Ligra: A lightweight graph processingframework for shared memory. In Proceedings of the 18th ACM SIGPLANSymposium on Principles and Practice of Parallel Programming, PPoPP ’13,pages 135–146, 2013. ISBN 978-1-4503-1922-5. → pages 1, 14, 15, 18, 38[36] N. Sundaram, N. Satish, M. M. A. Patwary, S. R. Dulloor, M. J. Anderson,S. G. Vadlamudi, D. Das, and P. Dubey. Graphmat: High performance graphanalytics made productive. Proc. VLDB Endow., 8(11):1214–1225, July2015. ISSN 2150-8097. → pages 14, 34[37] A. Tizghadam and A. Leon-Garcia. A graph theoretical approach to trafficengineering and network control problem. In 2009 21st InternationalTeletraffic Congress, pages 1–8, Sept 2009. → page 1[38] K. Ueno, T. Suzumura, N. Maruyama, K. Fujisawa, and S. Matsuoka.Extreme scale breadth-first search on supercomputers. In 2016 IEEEInternational Conference on Big Data (Big Data), pages 1040–1047, Dec2016. doi:10.1109/BigData.2016.7840705. → page 50[39] L. G. Valiant. A bridging model for parallel computation. Commun. ACM,33(8):103–111, Aug. 1990. ISSN 0001-0782. → page 1[40] Y. Yasui, K. Fujisawa, and K. Goto. Numa-optimized parallel breadth-firstsearch on multicore single-node system. In Proceedings of the 2013 IEEEInternational Conference on Big Data, 6-9 October 2013, Santa Clara, CA,USA, pages 394–402, 2013. doi:10.1109/BigData.2013.6691600. URLhttps://doi.org/10.1109/BigData.2013.6691600. → pages 19, 50[41] Y. Yasui, K. Fujisawa, and Y. Sato. Fast and energy-efficient breadth-firstsearch on a single numa system. In Proceedings of the 29th InternationalConference on Supercomputing - Volume 8488, ISC 2014, pages 365–381,New York, NY, USA, 2014. Springer-Verlag New York, Inc. ISBN978-3-319-07517-4. doi:10.1007/978-3-319-07518-1 23. URLhttp://dx.doi.org/10.1007/978-3-319-07518-1 23. → pages 19, 50[42] K. Zhang, R. Chen, and H. Chen. Numa-aware graph-structured analytics.In Proceedings of the 20th ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming, PPoPP 2015, pages 183–193, 2015.ISBN 978-1-4503-3205-7. → pages 2, 14, 19, 3860


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items