UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Data-driven spatial locality Miucin, Svetozar 2018

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2019_february_miucin_svetozar.pdf [ 4.54MB ]
Metadata
JSON: 24-1.0375881.json
JSON-LD: 24-1.0375881-ld.json
RDF/XML (Pretty): 24-1.0375881-rdf.xml
RDF/JSON: 24-1.0375881-rdf.json
Turtle: 24-1.0375881-turtle.txt
N-Triples: 24-1.0375881-rdf-ntriples.txt
Original Record: 24-1.0375881-source.json
Full Text
24-1.0375881-fulltext.txt
Citation
24-1.0375881.ris

Full Text

Data-driven Spatial LocalitybySvetozar MiucinB.Sc., University of Novi Sad, 2010M.Sc., University of Novi Sad, 2011A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Electrical and Computer Engineering)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)January 2019© Svetozar Miucin 2018The following individuals certify that they have read, and recommendto the Faculty of Graduate and Postdoctoral Studies for acceptance, thedissertation entitled:Data-driven Spatial Localitysubmitted by Svetozar Miucin in partial fulfillment of the requirementsfor the degree of Doctor of Philosophy in Electrical and ComputerEngineeringExamining committee:Alexandra Fedorova, Electrical and Computer EngineeringSupervisorIvan Beschastnikh, Computer ScienceSupervisory Committee MemberMieszko Lis, Electrical and Computer EngineeringSupervisory Committee MemberMichael J Feeley, Computer ScienceUniversity ExaminerRaymond Ng, Computer ScienceUniversity ExaminerAdditional Supervisory Committee Members:James Larus, E´cole Polytechnique Fe´de´rale de LausanneExternal ExamineriiAbstractOver the past decades, core speeds have been improving at a much higherrate than memory bandwidth. This has caused the performance bottlenecksin modern software to shift from computation to data transfers. Hardwarecaches were designed to mitigate this problem, based on the principles oftemporal and spatial locality. However, with the increasingly irregular ac-cess patterns in software, locality is difficult to preserve. Researchers andpractitioners devote a lot of time and effort to improving memory perfor-mance from the software side. This is done either by restructuring the codeto make access patterns more regular, or by changing the layout of datain memory to better accommodate caching policies. Experts often use cor-relations between the access pattern of an algorithm and properties of theobjects it operates on to devise new ways to lay data out in memory. Priorwork has shown the memory layout design process to be largely manual anddifficult enough to result in high level publications.Our contribution is a set of tools, techniques and algorithms for au-tomatic extraction of correlations between data and access patterns of pro-grams. In order to collect a sufficient level of details about memory accesses,we present a compiler-based access instrumentation framework called DI-NAMITE. Further, we introduce access graphs, a novel way of representingspatial locality properties of programs which are generated from memoryaccess traces. We use access graphs as a basis for Hierarchical MemoryLayouts – a novel algorithm for estimating performance improvements tobe gained from better data layouts. Finally, we present our Data-DrivenSpatial Locality techniques which use the information available from previ-ous steps to automatically extract the correlations between data and accesspatterns commonly used by experts to inform better layout design.iiiLay SummaryOver the past decades, the disparity between processor and main memoryspeeds has grown significantly. Many important applications today sufferfrom poor memory performance. To improve these applications, expertsmanually tune the placement of data in memory. This process requires adeep understanding of the algorithm and the underlying hardware on whichit runs. This work presents insights and analysis into how experts createperformant memory layouts, and proposes new abstractions, algorithms andtechniques to automate parts of the process. The proposed solutions havethe potential of helping performance-minded engineers to improve data lay-out in their programs.ivPrefaceChapter 2 is a modified version of our arxiv technical report filed as:• S. Miucin, C. Brady, and A. Fedorova. “DINAMITE: A modern ap-proach to memory performance profiling.” arXiv preprint arXiv:1606.00396(2016)., technical reportA shorter, peer-reviewed version of this work was published, ©2016Association for Computing Machinery as:• S. Miucin, C. Brady, and A. Fedorova. “End-to-end memory behav-ior profiling with DINAMITE.” Proceedings of the 2016 24th ACMSIGSOFT International Symposium on Foundations of Software En-gineering. ACM, 2016.I was the lead investigator on the project. Conor Brady helped build theinfrastructure to connect DineroIV [35] cache simulator and DINAMITEcompiled binaries, as well as implemented and wrote the matter appearingin 2.3.2. I wrote the rest of the article and conducted all other experimentsin the chapter. Other authors provided editorial and technical advice.Chapter 3 is a modified version of previously published peer-reviewedwork, ©2018 Association for Computing Machinery:• Svetozar Miucin and Alexandra Fedorova. 2018. Data-driven Spa-tial Locality. In The International Symposium on Memory Systems(MEMSYS), October 1–4, 2018, Old Town Alexandria, VA, USA.ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3240302.3240417I was the lead investigator on the project. All of the research and writingis my own, with editorial and technical advice from co-authors.The work in 3.5.4 was conducted after the ”Data-driven Spatial Local-ity” publication. I wrote all the text within it and conducted all describedexperiments.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Code Listings . . . . . . . . . . . . . . . . . . . . . . . . . . xiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 DINAMITE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 System design . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.1 LLVM instrumentation pass . . . . . . . . . . . . . . 82.2.2 Log format . . . . . . . . . . . . . . . . . . . . . . . . 92.2.3 Logging libraries . . . . . . . . . . . . . . . . . . . . . 102.2.4 Analysis toolkit . . . . . . . . . . . . . . . . . . . . . 142.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.1 Identifying cache offenders . . . . . . . . . . . . . . . 172.3.2 Structure splitting . . . . . . . . . . . . . . . . . . . . 212.3.3 Shared variable detection . . . . . . . . . . . . . . . . 212.4 Future work and conclusions . . . . . . . . . . . . . . . . . . 25viTable of Contents3 Data-driven spatial locality . . . . . . . . . . . . . . . . . . . 273.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2 Access Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 293.3 Hierarchical Memory Layout . . . . . . . . . . . . . . . . . . 333.4 Data-driven locality . . . . . . . . . . . . . . . . . . . . . . . 373.4.1 Generating input vectors . . . . . . . . . . . . . . . . 383.4.2 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . 383.4.3 Training methodology and evaluation criteria . . . . . 393.4.4 Tidy: a memory allocator wrapper . . . . . . . . . . 413.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.5.1 Hierarchical Memory Layouts . . . . . . . . . . . . . 443.5.2 Data layout in storage . . . . . . . . . . . . . . . . . 453.5.3 Data-driven spatial locality . . . . . . . . . . . . . . . 473.5.4 Benchmark suite experiments . . . . . . . . . . . . . 534 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75AppendicesHardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83viiList of Tables2.1 Log size comparison in 429.mcf . . . . . . . . . . . . . . . . . 112.2 Cost breakdown of text and binary formats for 429.mcf, persingle log entry . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3 Instrumentation overhead comparison - 429.mcf . . . . . . . . 142.4 Logging library performance - 429.mcf . . . . . . . . . . . . . 142.5 429.mcf top miss offenders . . . . . . . . . . . . . . . . . . . . 172.6 CSV output of the miss summary tool for fluidanimate . . . . 192.7 Most accessed shared variables . . . . . . . . . . . . . . . . . 243.1 SPEC CPU2017 and PARSEC analysis summary . . . . . . . 62viiiList of Figures2.1 DINAMITE system diagram . . . . . . . . . . . . . . . . . . . 72.2 Cost breakdown of text and binary formats for 429.mcf, persingle log entry . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 Impact of buffering on performance of 429.mcf . . . . . . . . 122.4 Scaling improvements in PARSEC3.0 fluidanimate application 202.5 Tool output and modified code for structure splitting of 429.mcf 222.6 Scaling improvements WiredTiger after removing the sharedvariable bug . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.1 Workflow diagram . . . . . . . . . . . . . . . . . . . . . . . . 303.2 Access graph example . . . . . . . . . . . . . . . . . . . . . . 323.3 Hierarchical Memory Layout example . . . . . . . . . . . . . 363.4 Hyperparameter search in mesh traversal . . . . . . . . . . . 403.5 Hierarchical Memory Layouts cache and DTLB misses. Eventcounts are normalized to original layout results. . . . . . . . . 443.6 Cache misses, TLB misses, and runtime for HML-derived lay-outs in graph traversals. Normalized to original layout metrics. 463.7 Data feature importance and categorical accuracies, PageRank 503.8 Data feature importance and categorical accuracies, meshtraversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503.9 Red-black tree grouping example . . . . . . . . . . . . . . . . 513.10 Stall breakdown . . . . . . . . . . . . . . . . . . . . . . . . . . 543.11 Mesh traversal – Memory performance comparison betweendifferent bucket sizes . . . . . . . . . . . . . . . . . . . . . . . 553.12 Pagerank – Memory performance comparison between differ-ent bucket sizes . . . . . . . . . . . . . . . . . . . . . . . . . . 563.13 RB Trees – Memory performance comparison between differ-ent bucket sizes . . . . . . . . . . . . . . . . . . . . . . . . . . 573.14 Data feature importance and categorical accuracies, red-blacktrees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57ixList of Figures3.15 Performance improvement from using Tidy allocator wrapperwith hints based on the knowledge extracted by random forests 583.16 Performance measurements for SPEC CPU2017 and PAR-SEC 3.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.17 Examples of community groupings . . . . . . . . . . . . . . . 64xList of Code Listings2.1 Trace plugin base class . . . . . . . . . . . . . . . . . . . . . . 152.2 Example Spark Streaming kernel . . . . . . . . . . . . . . . . 152.3 429.mcf pbeampp.c excerpt . . . . . . . . . . . . . . . . . . . 182.4 fluidanimate pthreads.cpp code excerpt . . . . . . . . . . . . 19listings/hotcold.txt . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.5 WiredTiger shared variable analysis result (JSON) . . . . . . 233.1 PageRank implementation loop body . . . . . . . . . . . . . . 483.2 Mesh traversal main loop . . . . . . . . . . . . . . . . . . . . 493.3 Red-black tree search function . . . . . . . . . . . . . . . . . . 513.4 531.deepsjeng: Index calculation based on hash value . . . . . 633.5 ferret: Sorted array accesses . . . . . . . . . . . . . . . . . . . 653.6 ferret: Sorting criterion . . . . . . . . . . . . . . . . . . . . . 65xiAcknowledgementsI extend my sincerest thanks to my advisor Dr. Alexandra Fedorova, whoseguidance made this research possible and made me the researcher I am today.I would also like to thank all of the contributors and co-authors for all theirhard work.I would like to thank my family for believing in me throughout thisjourney and my friends who were always there for me.Finally, I would like to thank Mitacs, STMicroelectronics and OracleLabs for providing me with internship and co-op opportunities that helpedput some of my work into context.xiiDedicationTo my wife, Nevena, whose love, support and brilliant discussions on scien-tific methodology made life orders of magnitude more enjoyable.xiiiChapter 1IntroductionCache-based computer architectures are ubiquitous. They date back to1970s, when hardware architects introduced them to tackle the growing di-vide between CPU speeds and RAM bandwidth [1] [21].Caches are banks of memory that are orders of magnitude faster thanDRAM, but also orders of magnitude smaller in size. Their purpose is tohold a portion of the working set close to the cores for quick access. Cachesdecide what data to keep local based on two assumptions about softwarebehaviour: spatial and temporal locality [70].Spatial locality means that if a program accesses one location in memory,it is likely to access its neighbouring locations in the near future. Thisprinciple is reflected in the way cache memories transfer data. The unitof transfer between cache memory and RAM is called a cache line and istypically 64B in size. Furthermore, most modern caches are equipped withnext line prefetchers which take assumptions about spatial locality a stepfurther and issue sequential memory requests in advance.Temporal locality means that if a program accesses one location, it islikely to access it again in the near future. This is reflected in cache linereplacement policies such as Least Recently Used (LRU). Due to the closednature of cache implementations in hardware, we cannot be sure which re-placement policies are implemented in production1, however recent research2 shows that temporal locality is still one of the main considerations indesigning caches.The entire mechanism of transferring data to and from DRAM is auto-matically executed in hardware. Because of this, cache-based systems arevery easy to program compared to Explicitly Managed Memory systems de-scribed in the prologue. The user does not need to worry about which dataare currently in the cache, they just need to ensure that their programsexhibit good spatial and temporal locality.As programs become more complex, preserving spatial and temporal lo-cality becomes difficult. Studies from as far back as two decades ago [7]1reverse-engineering work exists [6] suggesting modified Pseudo-LRU2such as Dynamic Re-reference Interval Prediction [39] and Protection Distances [39]1Chapter 1. Introductionshow that modern database software spends roughly 50% of its time stalled,waiting on memory transfers to complete. This trend continued into the eraof cloud computing [30], and the poor memory performance is attributed,among other things, to the increase in working set size and irregular accesspatterns. Improving locality of access has been shown to increase the overallperformance of programs, in some cases as much as an order of magnitude[45] [43] [61] [14] [32] [75]. However, optimizing memory performance isdifficult. Significant advances have been made in understanding and opti-mizing various array access patterns – static analysis of loops [22] [24], looptiling [57] [72], etc. Dynamically allocated pointer-based data structures,such as lists, graphs, meshes, tell a different story. Solutions are typicallycustom-tailored to a specific algorithm and data structure combination. Oneapproach to optimizing memory performance of programs relies on chang-ing the layout of objects in memory to achieve better spatial localitygiven the algorithm’s access pattern. The intuition for making better mem-ory layouts is to put objects that are frequently accessed togetherclose to each other in memory. Prior work in the area [10] [42] [69] [65][17] [25] [75] [67] [36] [37] [76] [50] led us to two observations:1. Each new data structure and algorithm requires a significant effort inunderstanding the access patterns. This is done by experts in boththe domain and memory optimization areas.2. Often, layout solutions are guided by the data itself. For example,GraphChi [42] groups edges based on their destination node, and sortssuch groups based on their source nodes, allowing iterative algorithmsto make mostly linear accesses to storage when fetching data.The rest of the work in this dissertation is based on the second obser-vation that the in-domain properties of objects can be used to inform betterlayouts, and aims to reduce the amount of effort and expert knowledge notedin the first observation.Chapter 2 presents our work in the extraction of memory access datafrom programs, along with tools to process the said data. Chapter 3 intro-duces a novel graph-based framework to capture the spatial locality proper-ties of a program. It builds on this framework a set of techniques and toolsaimed at characterizing the potential for performance improvement, and fi-nally extracting the kind of expert knowledge identified in prior memorylayout work. This expert knowledge entails understanding the connec-tions between the access pattern of the algorithm and properties2Chapter 1. Introductionof the accessed data objects, and how they can be used to derivebetter layouts.Hypothesis 1.1 The primary hypothesis of this dissertationis that the type of expert knowledge about therelationship of data and access patterns usedin prior work can be extracted automaticallyfrom full memory trace accesses.Hypothesis 1.2 The secondary hypothesis is that the extractedexpert knowledge can be used directly in con-junction with hint-based allocators to improvememory performance of programs.The main contribution of this dissertation is the validation of hy-potheses 1.1 and 1.2. Mainly, we have built a semi-automated methodfor improving information about data and access pattern relation-ships from programs. To that end, we contribute the following:• A compiler-based approach to full memory access tracing, describedin chapter 2. We show the benefits of using compile-time instrumen-tation to reduce the overhead and increase the amount of availableinformation in memory access tracing.• Access graphs – a model for capturing spatial locality properties ofprograms based on memory access traces, described in 3.2.• A method for evaluating the potential for memory performance im-provement from applying better data layout strategies. Our approachuses access graphs to derive better data layouts, which can be used ei-ther to estimate performance gains, or, more directly, to derive betterlayouts of data in storage.• The eponymous Data-driven Spatial Locality technique, which usesrandom forest classifiers and data available from memory access tracesto find correlations between data values and access patterns of a pro-gram. This validates hypothesis 1.1.• An evaluation of the applicability of hint-based allocators to improvememory performance in programs, based on the knowledge extractedby Data-driven Spatial Locality technique. Our evaluation confirmshypothesis 1.2 for one class of programs and reveals the obstacles toapplying hint-based allocators more widely.3Chapter 2DINAMITE2.1 IntroductionMemory performance is a limiting factor in many important programs. Tra-ditional database systems, web servers, scientific algorithms and moderndata analytics programs alike were observed to spend 50-80% of CPU cyclesstalled on memory [7] [30]. That is, 50-80% of the time these programs areunable to commit any instructions due to outstanding long-latency memoryaccesses. Understanding and addressing the causes of these bottlenecks isof paramount importance. Performance improvements from a more efficientmemory layout or improved locality of access are usually significant and insome cases reach an order of magnitude [45] [43] [61] [14] [32] [75].At the same time, optimizing memory performance is notoriously diffi-cult. Compiler optimizations can be effective when static analysis is suffi-cient to infer improvement opportunities. However, the scope of static opti-mizations is limited [18], partly because insight into a bottleneck can oftenbe gained only during execution and partly because the compiler is limitedin how it can change data structure layout, particularly with dynamicallyallocated data structures and in unmanaged languages. As a result, develop-ers often resort to manually optimizing their data structures and algorithms,relying on tools for dynamic program analysis and memory profiling.Unfortunately, most existing tools suffer from either lack of generality,portability, or flexibility. Conventional CPU profilers, such as perf [23], aimto identify source locations that generate the majority of cache misses, butbecause of skid effects in hardware counters [12], [38] or compiler optimiza-tions, such as function inlining, this information is often imprecise or plainwrong. Cachegrind [56] accurately identifies source lines generating cachemisses, but does not provide actionable insight that might lead the program-mer to reduce them. Dprof [61] identifies data structures and fields that areresponsible for cache misses due to sharing among threads, but it is not flex-ible enough to address other causes of poor memory performance and wasdesigned specifically for the Linux kernel. Similarly, Memprof [43] focuseson identifying objects that cause remote accesses on NUMA systems, but42.1. Introductionthe implementation is Linux-specific, tied to AMD hardware and does notlend itself to other types of analyses.To address this gap, we built DINAMITE – a toolkit for DynamicINstrumentation and Analysis for MassIve Trace Exploration. DINA-MITE uses compile-time instrumentation to inject tracing code into theprogram. At runtime, the program generates precise traces containing ev-ery memory access, its source location and the corresponding variable name,type and value. These traces are then used to perform various memory-related analyses, for example, identifying highest cache-miss offenders, lo-cating hot and cold fields of a data structure, correlating locality of accesseswith values of variables, detecting true and false sharing, building arbitrarymodels of memory access patterns, and many others.The approach of using instrumentation and tracing is by itself not new;it is used in Pin [51], Valgrind/Cachegrind [56] and other similar tools.Its main downside is high runtime overhead and very large execution traces,which can reach hundreds of gigabytes even for small programs. However, forthe very challenging task of memory performance debugging this approach isoften the only practical option, because certain analyses, e.g., those relyingon cache simulation, can be performed only with a precise execution trace.Key contributions of DINAMITE are as follows:• The instrumentation is implemented as a pass in LLVM [44], so it isapplicable to any language with an LLVM front-end.• Since the instrumentation is done at compile-time, the source-level de-bug information assigned to trace entries is precise and easy to extract.• DINAMITE is specifically designed for memory access instrumenta-tion. By instrumenting at compile time, we can embed all of theavailable debug information at no runtime cost. We show that shift-ing the overhead to compilation and using custom binary format andbuffering in trace generation allows the runtime overhead to remainsimilar or smaller than state-of-the-art instrumentation tools such asPin and Valgrind, while providing much more information about mem-ory accesses.• DINAMITE gives the user flexibility in how to handle execution traces.The traces can be stored in the file system, but if the user does notwish or cannot store these typically large traces, they can be analyzedon-the-fly using a streaming analytics engine like Spark Streaming [77](or any other similar engine).52.2. System design• DINAMITE is easy to extend with additional analysis tools. A devel-oper can write a new tool with a few lines of Scala (if using DINAMITEwith Spark) or any other language of choice. We target advanced devel-opers who understand how software interacts with memory hierarchiesof modern processors, so we wanted to give them ultimate flexibilityin analysing memory traces.We built three tools on top of DINAMITE. The first one produces vari-able names and source lines responsible for the highest number of cacheaccesses and misses. Using it, we reduced the last-level cache (LLC) missrate of 429.mcf from SPEC2006 by 55% and improved its performance by12%. We also reduced the LLC missrate and improved performance of PAR-SEC’s fluidanimate by 50% and 15% respectively.The second tool implements Chilimbi’s structure splitting algorithm [19].Thanks to it, we reduced the LLC miss rate of SPEC2006’s 429.mcf by 60%,with the corresponding 20% reduction in runtime.The third tool detects program variables that are heavily shared by manythreads. This tool enabled us to detect a previously known performancebottleneck in WiredTiger, MongoDB’s back-end key/value store [3] [4] [5].Even though the bottleneck was already known and fixed before we createdDINAMITE (in fact, this was one of the motivating reasons for DINAMITE),the original discovery took several weeks, while DINAMITE pin-pointedit in a few hours. Performance improvement of the read-only sequentialLevelDB benchmark implemented over WiredTiger reached a factor of 20for 32 threads.The rest of the chapter is organized as follows. Section 2.2 provides anoverview of DINAMITE design. Sections 2.2.2, 2.2.1 and 2.2.3 contain a de-tailed discussion of the log format, LLVM instrumentation pass and logginglibrary. Section 2.2.4 discusses two implementations of analysis frameworks– one in native C++ and another one using Spark Streaming. Section 2.3describes the tools we created with DINAMITE and evaluates them on threeapplications. Section 2.4 elaborates on possible avenues for future work.2.2 System designOur system is built from three components: an LLVM instrumentation pass,a collection of output logging libraries and an analysis toolkit. A systemoverview is shown in Figure 2.1. Different line types leaving the logginglibrary show possible data paths within the system. A path taken by datadepends on the type of analysis desired.62.2. System designLLVMinstrumentation passCache simulatorSource codeFilesystemSpark Streaminganalysis frameworkInstrumented binarylogging libraryNative C++analysis frameworkTCPdynamiclinkage TCPFigure 2.1: DINAMITE system diagramTarget applications are compiled with an LLVM [44] compiler configuredto include our instrumentation pass. Configuration is trivial: it only requireschanging the compiler invocation command. Most large software projectsallow specifying the compiler command via an environment variable.Our instrumentation pass instruments three types of events: functionentry or exit, memory allocation, and memory access. For each event theinstrumentation pass injects a function call to an externally linked logginglibrary. The logging library, linked dynamically at runtime, will produce alog record of an event in binary or text format (described below). Log recordsare either stored in the file system or streamed over a socket. Stored tracescan be analyzed using pre-packaged DINAMITE scripts written in Pythonor C++ (see Section 2.3), or the user can write her own tools using theirlanguage of choice. Log records streamed over the socket are processed bythe Spark Streaming engine. DINAMITE includes several analysis kernelsfor Spark written in Scala; users can also write their own.DINAMITE allows chaining analysis passes, similar to the Unix pipecommand. For many of our analyses we stream the log data to a cachesimulator tool (written in C++) that annotates each memory access logentry to indicate whether the access was a cache hit or a miss and forwardsthe annotated log entry to the Spark Streaming engine.The rest of this section describes the system in more detail.72.2. System design2.2.1 LLVM instrumentation passWe chose to use the LLVM infrastructure for the implementation for thefollowing reasons:1. It can be used on programs written in any language supported by anLLVM front-end compiler. To date, those include, but are not limitedto, C, C++, D, Haskell, Objective-C, Swift, Ruby, etc. There is evena compiler that translates Java bytecode into the LLVM intermediaterepresentation. Given the popularity of LLVM, we can expect this listto grow in the future.2. It lets us add instrumentation at the level of the intermediate repre-sentation (IR), which is more convenient than instrumenting a binary.LLVM IR is an assembly-like language that is more abstract than ma-chine code (e.g., it assumes an unlimited set of registers). When IR istranslated into binary, a single memory access can be expanded intomultiple machine instructions, which can introduce noise into tracesand make it difficult to attribute accesses to source code level con-structs.3. Full debug information is available at IR level. The front end we used,clang, embeds debug information into the IR as an abstraction of theDWARF format that is easy to parse with the tools provided in theLLVM framework.Our instrumentation pass begins by crawling the IR debug metadata toextract information about complex data types (structs, classes, unions) andcategorizes them by connecting their corresponding internal LLVM refer-ences with type and field names. This is necessary because of type aliasing.A single type in C or C++ can have multiple names because of typedefs.Without this metadata extraction, LLVM only knows about the original typedefinitions, but IR instructions may contain references to different namesfor the same type. We store all the type alias information in map-like datastructures within the pass.The core of the instrumentation pass iterates over the current moduleand visits each function, each basic block and each instruction within it.For each encountered function, it places a function begin log call at thebeginning of its first basic block, and a function end log call at the end ofeach basic block that ends the function.Memory allocation functions are treated separately. Our pass must firstrecognize whether a function is a memory allocator and then gather the82.2. System designinformation about the allocated type, size and address. For each calledfunction, our instrumentation checks the function’s name against the list ofknown allocator functions. We generalize allocator functions as functionsthat take two arguments: number of elements and size of a single element,and output the address that points to the start of the allocated region. Thismodel encompasses all allocation library APIs that we have encountered,and, combined with type information available through LLVM, contains allthe relevant information that describes an allocation.The list of allocators is provided in a separate file, where each entry isdescribed with a function name, and three indices indicating the position ofall the relevant fields (the number of elements, the size of the element andthe allocation’s base pointer) in the argument list. If the allocation addressis the function’s return value, its index will be set to −1. Standard allocationfunctions (such as malloc and calloc) are included in the configuration filethat comes with our pass. If the program uses any non-standard allocatorfunctions, the user must add them to that file. The pass places an allocationevent log call after each call to an allocator function.Strings written to the log are encoded with a unique integer identifierto preserve space. The integer-to-string mappings are placed into JSONdocuments created by the instrumentation pass. The mappings must beconsistent across modules, but LLVM compiler passes do not preserve anystate across modules. Therefore, we load the JSON mappings before com-piling each module compilation and write back any updates at the end.For ease of use with C and C++ projects, our instrumentation pass getsregistered with LLVM’s pass manager for standalone clang invocations. Asclang supports most of GCC’s compilation flags, this makes integration ofour instrumentation into existing projects in most cases as easy as changingthe compiler invocation variable.2.2.2 Log formatAll log events contain a field for a thread identifier. We limit this field to 8bits to conserve space, but it can be easily expanded if needed. Distinguish-ing between 256 unique threads was sufficient for our case studies.Allocation and access events share fields that contain the file name, theline number and the column number that correspond to the event’s sourcecode location. Allocation events additionally contain the base address of theallocated memory region, the size of a single allocated element, the numberof allocated elements and their type.Access events contain the accessed address, the type of the access (read92.2. System designor write), the name and type of the variable corresponding to the access,and the value at the accessed address.Function events contain the event type (entry or exit into the function)and the function name.Depending on the configuration, log records can be produced in textor binary format. In text format, each field of the entry is printed to afile, separated by a delimiter character. In binary format, different typesof events are contained in a parent logentry structure. The logentrystructure contains a type field that differentiates the payload as either afunction, access or allocation event. The payload is a union between thecorresponding three log entry types. In the current version of DINAMITE,each log entry takes 48 bytes total.Log format involves a surprising trade-off between performance and logsize, which we evaluate in the next section.2.2.3 Logging librariesInstrumented programs do not contain any logic for producing log records.Instead, they contain calls to the externally linked logging library. Ourimplementation contains three different library versions based on log formatand output destination: text-to-file, binary-to-file and binary-to-socket.The effect of log format on log sizeIn our implementation, each binary log record takes up 48 bytes of stor-age. Text entries are variable in size and depend on the number of charactersneeded to encode all the values. The two extremes of an entry size in textformat are:• Minimum: 18 bytes. Each field can be encoded with a single digit,with added single character delimiters.• Maximum: 77 bytes. Each field has the maximum value for its storagetype in binary format.The reality is somewhere in-between, as shown in Table 2.1, which com-pares log sizes in text and binary format for SPEC2006 429.mcf, with 4.5billion memory accesses. Text format generates smaller logs; log size is im-portant, because real workloads generate hundreds of gigabytes of logs. Atthe same time, using text format results in a much higher performance over-head (evaluated in the next section). Since we can forego storing large logsby relying on DINAMITE’s streaming model we always use DINAMITEwith the binary log format to avoid the overhead.102.2. System designTable 2.1: Log size comparison in 429.mcfNumber of accesses: ∼4.5 billionBinary log size: 205GBText log size: 172GBBinary TextFormat0.00.20.40.60.81.0Normalized time / log31.88 ns 805.84 nsBaseLog costFormat costFigure 2.2: Cost breakdown of text and binary formats for 429.mcf, persingle log entryThe effect of log format on performanceFor a detailed insight into the overhead of running the instrumentedbinary, we break it down into the following components:• base cost : the cost of executing the uninstrumented code• log cost : the cost of invoking the logging library function• format cost : the cost of preparing the log entry for writing• output cost : the cost of writing the log entryTable 2.2 and figure 2.2 show the broken-down cost of the instrumen-tation for 429.mcf. We report all costs except the output cost, because itdepends on where we write the log data; the output costs for different outputdestinations are reported later in this section.Log cost is fixed and must be invoked for every instrumented event. Theonly way to reduce this cost is to avoid instrumenting certain events alto-gether, according to a user-defined criterion at compile time. For example,112.2. System designTable 2.2: Cost breakdown of text and binary formats for 429.mcf, per singlelog entryFormat Binary TextBase 2.33 nsLog cost 13.66 nsFormat cost 15.99 ns 789.95 nsTotal time (no output) 31.98 ns 805.94 ns1 2 4 8 16 32 64 1282565121024204830724096Buffer size [#entries]010002000300040005000600070008000Runtime [s]Figure 2.3: Impact of buffering on performance of 429.mcfif we are not interested in exploring the entire memory access trace of aprogram, but only accesses to a single type, we can tell the compiler toinstrument only those. Similarly, we can limit instrumentation to certainfunctions. Instrumenting isolated data structures or types can discover somememory access patterns, but this kind of filtering is not appropriate for un-derstanding whole program memory behavior or for any analysis involvingcache simulation.Format cost is the cost of packing the log data according to the specifiedformat. It is dominant for the text format, because formatting strings is avery expensive operation, relative to producing a binary record. Text formatsuffers a 25× higher run time relative to the binary format. That is whywe always resort to using the binary format in our experiments, despite itshigher storage overhead.Output cost is the cost of writing the log records into a file or sendingthem over a socket. Besides the cost of accessing the storage medium, it122.2. System designrequires a system call. We mitigate this overhead via buffering. Figure 2.3shows the effects of different buffer sizes on the runtime of 429.mcf in thebinary format that writes to a RAM disk. By increasing the output buffersize, performance improvement reaches its maximum at around 20× overthe unbuffered version. In the rest of our measurements, we used the outputbuffer of 4096 entries.Table 2.3 compares the performance of 429.mcf instrumented with DI-NAMITE against slowdowns of two major instrumentation frameworks: In-tel Pin and Valgrind. Valgrind performance degradation reported here isobtained from Nethercote et al. [56], and refers to Valgrind’s MemChecktool which performs memory error checking with a summary output at theend of execution. This is not a fair comparison to access instrumentation,but to the best of our knowledge, a Valgrind tool comparable in functional-ity with DINAMITE is not available. Numbers for Pin were obtained fromthe supplied pinatrace tool, output to RAM disk. In Table 2.3, the slow-est version of DINAMITE without analysis instruments each access withfull debug information (as described in section 2.2.2) and outputs it to aRAM disk filesystem in binary format. Even at this level of detail and withfull output enabled, DINAMITE is only 60% slower than Valgrind’s Mem-Check and almost 10x faster than the comparable access instrumentaion inPin. Even when using the Spark analysis pipeline, DINAMITE is only 35%slower than pinatrace.Table 2.4 compares the running times for executing 429.mcf with dif-ferent variants of log formats and outputs. Note that the text-formattedoutput makes the instrumentation run very slow: 33× slower than using thebinary format. Sending the trace over a TCP socket to netcat is faster thanwriting it to a RAM disk. However, introducing Spark Streaming into thepipeline makes the TCP streaming execution 15x slower. Optimizing thiswould require a detailed analysis of Spark’s data receiving system and is leftfor future work.Our design decouples the generation of log records from their process-ing. Alternatively, embedding analysis logic into the logging library is alsopossible. We opted against it for the following reasons:• Instrumented programs share heap with the logging library. Addingsignificant bookkeeping data structures to the heap could affect theplacement of the program’s data and diminish the accuracy of traces.• Decoupling analysis from logging allows for flexibility in the languagesand frameworks used for analyzing memory traces132.2. System designTable 2.3: Instrumentation overhead comparison - 429.mcfFramework SlowdownPin (pinatrace output to RAM disk) 354xValgrind (MemCheck) 22xDINAMITE (empty instrumentation) 7xDINAMITE (binary format, no output) 14xDINAMITE (binary format, output to RAM disk) 36xDINAMITE (Spark analysis) 537xTable 2.4: Logging library performance - 429.mcfVersion Destination Time [s] SlowdownUninstrumented nil 10.05 1xText (unbuffered) RAM disk 11820 1176xBinary (file) (buff.) RAM disk 360.09 36xBinary (file) (buff.) Hard disk 1426 142xTCP (buffered) netcat >/dev/null 339.12 34xTCP (buffered) Spark (access count) 5400 537x2.2.4 Analysis toolkitThe analysis toolkit consists of two different frameworks for writing log pro-cessing applications. Logs recorded to a filesystem are processed with thenative analysis framework written in C++. The framework includes supportfor parsing logs and allows the user to easily extend the analysis by writinga new C++ class. Alternatively, the users could write their own parsingand analysis tools in any language of choice. Logs streamed over a TCPsocket are processed live with Spark Streaming drivers. Similarly, the usercould configure DINAMITE to use any another system to ingest or analyzestreaming logs (e.g., Kafka, Google Dataflow). Our toolkit includes a sim-ple cache simulator program, which processes and annotates streamed logrecords with cache hit/miss indicators. Next, we describe these componentsin more detail.Native analysis frameworkThe C++ framework provides support for writing arbitrary analysis kernels.To write a new kernel, the programmer must extend the TracePlugin class(shown in listing 2.1).142.2. System designListing 2.1: Trace plugin base class1 class TracePlugin {2 ...3 protected:4 NameMaps *nmaps;5 TracePlugin(const char *name);6 public:7 virtual void processLog(logentry *log) =0;8 virtual void finalize () =0;9 virtual void passArgs(char *args) =0;10 };The framework reads log records into a buffer and passes each log entryto the chosen plugin by invoking its processLog(logentry*) method. Atthe end of the log file, the framework calls the plugin’s finalize() method,which is used for writing the output of the analysis.Spark Streaming analysis frameworkTo analyze streamed log records with Spark Streaming, the programmermust write an analysis kernel in Scala. To this end, our framework providesa custom Spark Streaming Receiver class and a log converter. A receiveraccepts batches of log events in binary format over a TCP socket and storeseach separate log entry in its associated StreamingContext.Listing 2.2 shows a Spark Streaming kernel for counting the number ofmemory accesses per variable. To get useful information out of the entries,the incoming DStream is routed through a map operation which invokes ourLogConverter class on each separate entry. LogConverter unpacks andoutputs log data as Scala classes, with the distinction between functionevents, allocation events and access events. To get a DStream of instances ofa certain event type, logs are filtered with a class matching operation. Theseevents are then mapped to (varId, 1) pairs, and reduced by summing overvariable IDs. Persistent state is updated by invoking Spark Streaming’supdateStateByKey() operation. A custom update function, omitted in ourlisting for brevity, updates the counts by summing new results with theprevious state. Results are then output to the console or the filesystem.Listing 2.2: Example Spark Streaming kernel1 def main(args: Array[String ]) {152.2. System design2 val sparkConf = new SparkConf ()3 .setAppName("AccessCounter");4 val ssc = new StreamingContext(sparkConf ,5 new Duration (1000));6 ssc.checkpoint("/checkpoints/");78 val logs = ssc9 .receiverStream(new LogReceiver (9999))10 .map(rawlog =>11 LogEntryReader.extractEntry(rawlog));1213 val counts = logs14 .filter(log =>15 log.isInstanceOf[AccessLog ])16 .map(access =>17 (access.as (...)[AccessLog ].varId , 1L))18 .reduceByKey(_+_)19 .updateStateByKey(sumUpdater);2021 counts.print ();2223 ssc.start();24 ssc.awaitTermination ();25 }Integration with Spark Streaming gives the programmer access to the fullset of Spark Streaming operations and can process logs as they are outputfrom a live running program.Cache simulatorFor detailed analysis of program cache behaviour, we wrote a simple cachesimulator, which is placed as an intermediate step between the generationof the log output and the analysis framework (or the filesystem, if we aresaving the logs for offline processing).We simulate a single-level cache, typically configured with parameters re-flecting a last-level cache on our target system. The cache simulator acceptslog entries over a socket, much like the Spark Streaming analysis frame-work. It annotates each memory access with an indicator whether this wasa cache hit or a miss. The annotated logs are passed on to either the analysisframework, or stored in a filesystem for offline processing.In our evaluation of the system, we found that having cache behaviourinformation was essential for identifying certain optimization opportunities.162.3. EvaluationTable 2.5: 429.mcf top miss offendersVariable Name File Line Miss countarc.ident pbeampp.c 167 45247271node.orientation mcfutil.c 85 1104784node.basic arc mcfutil.c 86 988543arc.cost mcfutil.c 86 767273node.potential pbeampp.c 170 2356962.3 EvaluationIn this section we describe three tools that we built using DINAMITEand demonstrate how they guided our optimization of three applications:SPEC’s 429.mcf, PARSEC’s fluidanimate and WiredTiger, a MongoDB’skey-value store.All experiments described in this section were performed on machineslisted in the Appendix.2.3.1 Identifying cache offenders”For large working sets it is important to use the available cacheas well as possible. To achieve this, it might be necessary torearrange data structures. While it is easier for the programmerto put all the data which conceptually belongs together in thesame data structure, this might not be the best approach formaximum performance.”Ulrich Drepper, 2007[27]This tool identifies program variables and source lines that generatedthe most last-level cache accesses and misses. It works as follows:429.mcf429.mcf performs single-depot vehicle scheduling using the network sim-plex method. The implementation represents nodes and arcs in the networkas C structs. In the benchmark description the author mentions reorderingfields of both node and arc structs in an attempt to reduce cache misses andimprove performance [34]. Nevertheless, DINAMITE enabled additional op-timizations.Table 2.5 shows the output of the cache-offender tool. We notice thata disproportionate number of misses are being caused by the ident field of172.3. Evaluationthe arc struct, more than four times as many as the second most accessedfield, node.orientation.Upon closer inspection, we noticed that all the arc structures are allo-cated as a single large array, even though they represent nodes in a linkeddata structure. The majority of accesses to arc.ident were made withina single loop (shown in Listing 2.3). The loop iterates over the arc arrayuntil it finds a match, and only then accesses its other fields.Every time arc.ident was accessed, the corresponding cache line wasfilled with other fields, most of which were not used before the cache linewas evicted. These data layout and access pattern waste cache space andmemory bandwidth. We addressed the problem by restructuring the arrayof arcs from the array of structures layout into the structure of arrays.Listing 2.3: 429.mcf pbeampp.c excerpt165 for( ; arc < stop_arcs; arc += nr_group )166 {167 if( arc ->ident > BASIC )168 {169 /* red_cost = bea_compute_red_cost(arc);*/170 red_cost = arc ->cost - arc ->tail ->potential+ arc ->head ->potential;171 if( bea_is_dual_infeasible( arc , red_cost ))172 {173 basket_size ++;174 perm[basket_size]->a = arc;175 perm[basket_size]->cost = red_cost;176 perm[basket_size]->abs_cost = ABS(red_cost);177 }178 }179 }Our modifications brought a 55% reduction in LLC misses, and a 12%improvement in the overall runtime.fluidanimateFluidanimate is an Intel Recognition, Mining and Synthesis applicationthat uses the Smoothed Particle Hydrodynamics method to simulate an in-compressible fluid. It uses the Navier-Stokes equation to derive fluid densityfields. It is included in the PARSEC 3.0 benchmark suite because of the in-creasing significance of physics simulation in video-game programming and182.3. EvaluationTable 2.6: CSV output of the miss summary tool forfluidanimateVariable name File Line Miss countCell.next pthreads.cpp 530 184496Vec3.x ./fluid.hpp 354 95682Cell.next ./fluid.hpp 404 73800Vec3.x ./fluid.hpp 346 67327Vec3.x ./fluid.hpp 355 66657real-time animation domains [11].Profiling fluidanimate with perf [23] showed that it has a high LLC missrate of 30% on our system. We instrumented the program using DINAMITEand ran it through the cache offender tool. Table 2.6 shows output of thetool: variable names and source lines responsible for the most cache misses.The top cache offenders are Cell.next and Vec3.x. The names of thestructs and fields suggest that a Cell is an element of a linked collection.Listing 2.4 shows the code excerpt pointed to by the output of our tool. Wecan immediately see that the code generating misses is a traversal of grid ofCell structures in which only the next field is touched.Looking at the definitions for Cell and Vec3 types we can see that Cellrepresents a linked list of containers for arrays of Vec3 structures that con-tain three-dimensional vectors. The arrays themselves are contained withinthe Cell struct in their entirety. The total size of a Cell struct with thepayload was 896 bytes, making a single instance span 14 cache lines.This data layout is poorly optimized for traversing lists of Cells, becauseeach new Cell access generates a cache miss. Our idea, therefore, was toallocate the Cell’s payload, which is rarely touched, separately from therest of the structure. The structure would then include a pointer to itspayload; since each Cell’s payload consists of multiple arrays, adding alayer of indirection to access the payload would not be much of a penalty.Allocating the Cell payload separately brings the size of the structure downto 16 bytes . Since consecutive calls to a memory allocator function for avariable of the same size will return near consecutive addresses in moststandard libraries, consecutive Cells will be allocated close together, andseveral of them will fit into a single cache line.Listing 2.4: fluidanimate pthreads.cpp code excerpt192.3. Evaluation1 2 4 8 16 32Number of threads0.00.20.40.60.81.0Normalized execution timeModifiedOriginalFigure 2.4: Scaling improvements in PARSEC3.0 fluidanimate application522 void ClearParticlesMT(int tid)523 {524 for(int iz = grids[tid].sz; iz < grids[tid].ez;++iz)525 for(int iy = grids[tid].sy; iy < grids[tid].ey;++iy)526 for(int ix = grids[tid].sx; ix < grids[tid].ex; ++ix)527 {528 int index = (iz*ny + iy)*nx + ix;529 cnumPars[index] = 0;530 cells[index].next = NULL;531 last_cells[index] = &cells[index];532 }533 }This change brought a 50% reduction in the LLC cache miss rate and a15% reduction in runtime with 16 threads (see Figure 2.4).An interesting observation is that Cells were allocated in the original im-plementation to be cache-aligned and padded to a fill the entire cache line,indicating a prior effort to make better use of the cache hierarchy. How-ever, with DINAMITE we discovered that making the Cell struct largerby padding actually hurt performance on our system. Intricacies of modernmulti-core memory hierarchies can mislead even very experienced program-mers. Powerful performance analysis tools are thus crucially important.202.3. Evaluation2.3.2 Structure splittingOur structure splitting tool is based on the class splitting algorithm proposedby Chilimbi et al. for Java programs [19]. The algorithm analyses how theprogram accesses the members of a class to determine if a class is fit to splitinto two separate classes. Splitting classes is motivated by the idea that hotfields, or fields that are accessed significantly more than cold fields, shouldbe placed in a separate class so that more hot data can be packed into asingle cache block. To access fields that are considered cold, the hot classincludes a pointer to the cold class. (This is similar to the optimization thatwe applied to fluidanimate).The algorithm begins by identifying live classes. A class is consideredlive if it is accessed more than a certain threshold, and only live classes areconsidered for splitting. Next, fields in live classes are marked as hot or colddepending on how many times their respective class is accessed. If a fieldis accessed significantly more than other fields, it is considered hot. Fulldetails of the algorithm are described elsewhere [19].Chilimbi et al. implemented the splitting algorithm for Java classes,using the JVM for access statistics and a Java byte-code instrumentationtool BIT [47]. We implemented the splitting algorithm in DINAMITE,making it accessible to a wider range of programs, including those writtenin unmanaged languages. The tool works as follows:The Spark Streaming driver receives access logs from the instrumentedbinary and produces the list of variables and corresponding access counts.Then a Python script generates a chart for each live structure showingweights assigned by the algorithm for individual fields; black bars for hotfields and gray bars for cold fields. Programmers then split their structuresaccording to the hot and cold fields in the chart.Figure 2.5 shows the chart produced by the structure splitting tool forthe arc struct in 429.mcf as well as the modified arc struct code. Similarlystruct node (not shown) was another live struct with both hot and coldfields. Splitting hot and cold fields in these structs delivered 20% speedupand reduced the LLC miss rate by 60%, as measured with perf.2.3.3 Shared variable detectionOn machines with even a handful of cores, variables updated by multiplethreads can quickly become a scaling bottleneck [14], even if these variablesare not protected by a lock or accessed via atomic instructions [26]. Repeat-edly updating a shared variable from different cores stresses the coherency212.3. Evaluationheadidentflownextoutnextintailcostorg_costField0.0000.0020.0040.0060.0080.0100.0120.0140.0160.018Weighthotcolds t r u c t c o l d a r c {f l ow t f low ;arc p nextout ;arc p next in ;} ;s t r u c t hot a rc {c o s t t co s t ;node p t a i l , head ;i n t ident ;c o s t t o r g c o s t ;s t r u c t c o l d a r c ∗ ca ;} ;Figure 2.5: Tool output and modified code for structure splitting of 429.mcfprotocol and can slow down the program by an order of magnitude relative toa sharing-free execution. Tools for detecting shared variable bottlenecks doexist, but they are hardware-specific (e.g., Intel’s VTune [53] works only onIntel machines, DProf [61] and Memprof [43] work only on AMD hardware)and can be non-trivial to set up (DProf and Memprof require changing thekernel). DINAMITE is easily extended to detect shared variable bottleneckson any binary that can be compiled with LLVM.To demonstrate, we created a simple tool that we then used to quicklyfind a known scalability bottleneck in WiredTiger [3] [5], a MongoDB storageengine [4]. To truly test the experience of creating new tools for DINAMITE,the student who created the shared variable detection tool was not informedwhat variables and source locations triggered the bottleneck; he was onlyadvised that the bottleneck exists and provided the instructions for runningthe problematic workload.The engineer who originally diagnosed the scalability issue took aboutweek to do so after observing poor performance; she used Memprof, whichrequired communication with its authors and changes to the kernel. Eventhough the changes were simple, they would likely be considered “beyondthe call of duty” by many developers. The DINAMITE tool took severalhours to create by a person familiar with the overall framework and consistedof two simple Spark Streaming kernels and a Python script.The first and second kernel identify top shared variables. The second ker-nel processes the execution traces again, looking only for frequently sharedvariables and collects the source locations where the accesses are made. The222.3. Evaluationtool could be structured with only a single kernel that both identifies thetop shared variables and records the source lines, but we found that havingtwo kernels is simpler and results in better performance of Spark Streaming.The first kernel translates memory access log entries into (accessed ad-dress, variable identifier and thread identifier) tuples. Each tuple acts as thekey in map-reduce transformation that produces a list of variable-identifyingtuples and the corresponding total memory accesses. The result is stored ina persistent table.Next, a Python script reads that table and transforms the data into adictionary, where each accessed address serves as a key and the correspond-ing value contains the variable identifier and access counts performed byeach thread. The script filters the results according to the following criteria:• It removes all entries accessed by only a single thread• It removes all entries that are not uniformly shared by threads. Wedefine uniform sharing as follows:Let Asorted be a list of all the per-thread access counts for the address,sorted in descending order, zero indexed.if Asorted[0] < 2 ∗Asorted[1] the sharing is uniform.The output is then sorted in descending order by the total number ofaccesses. Table 2.7 shows the first five entries of the output generated forthe LevelDB sequential read benchmark over WiredTiger (release 2.6.0) ex-ecuted with 32 threads3, which triggered the bottleneck. The top offenderis the field v in __wt_stats struct.These results only point to the variable responsible for shared accesses.To find the root cause, we use the second Spark kernel to find the sourcelocation where the accesses are performed. The second kernel is very similarto the first, except it discards all the log entries that do not correspond to thetop shared variable, and keeps the source lines where the accesses occurred.We sort the results by the number of accesses in descending order.Listing 2.5: WiredTiger shared variable analysis result (JSON)1 {2 "threadcount ": 18,3 "totalcount ": 311881 ,3The connection structure is shown as accessed by more than 70 threads because thebenchmark creates and tears down additional threads before the measured run.232.3. EvaluationTable 2.7: Most accessed shared variablesAddress # Accesses # Threads Variable0x64D900 42495568 32 wt stats.v0x64D1A4 26183326 74 wt connection impl0x64E0EC 7233836 72 wt connection impl.N/A0x64D100 4786616 36 wt txn global.states0x64D540 4786370 34 wt stats.v4 "threads ": [5 [6 156,7 194928 ],9 [10 163,11 1952012 ],13 ...14 ],15 "file": "wiredtiger/build_posix /../ src/btree/bt_curnext.c",16 "line": 446,17 "variable ": "__wt_stats.v"18 }Listing 2.5 shows the first entry of the output (redacted for brevity),which correctly identifies the source location responsible for the bottleneck.The fields in the output JSON document are self-explanatory apart from thethreads list, which contains (thread id, thread access count) pairs.It turns out that this sequential read-only workload suffered from scalabilityproblems, because threads were incrementing a shared statistics counterafter each read operation. This problem was later fixed by implementingper-thread statistic buffers.Figure 2.6 shows the performance impact of the bug on Machine A (de-scribed in the appendix). A seemingly benign counter increment, whichtakes negligible time in a single-threaded execution, quickly escalates into ahuge scaling bottleneck with as few as four threads and slows down the work-load by an impressive factor of 20 with 32 threads. Previous work reportedsimilar performance impact of shared variables on multicore machines [26].With the increasing core counts on new hardware the importance of toolsthat enable productive memory performance analysis will continue to grow.242.4. Future work and conclusions1 2 4 8 16 32Number of threads02000400060008000100001200014000Bandwidth [MB/s]With sharing bugWithout sharing bugFigure 2.6: Scaling improvements WiredTiger after removing the sharedvariable bug2.4 Future work and conclusionsWe described the implementation of DINAMITE and discussed the per-formance implications of our design choices: a fine-grained compiler-basedinstrumentation and a flexible analysis framework. Our detailed breakdownof the costs involved in doing instrumentation of memory accesses leads usto conclude that this kind of design is not only feasible, but allows for a greatlevel of detail in the generated traces, while keeping the slowdown compa-rable to the state-of-the-art instrumentation frameworks. We introduced anovel fusion of instrumentation and stream processing that eliminates theneed for storing traces and provides an easy to use Spark Streaming APIfor analysis purposes. Finally we demonstrated the utility of DINAMITEby performing three different types of analysis that were difficult, impossi-ble, or constrained to a certain OS/hardware platform with the previouslyavailable tools.In future work, we plan to expand on the kinds of analysis that canbe done on access traces using a streaming framework. Spark Streamingcurrently buffers incoming log records and processes them at the expirationof a configurable timeslice. In our experience, the timeslice must be ratherlarge (e.g., one second) to avoid performance problems with Spark, butwith such a large timeslice the batches contain hundreds of thousands ofaccesses. Adding support for a framework that is able to process a batch ofrecords after it accumulates a specified number of records would let us have252.4. Future work and conclusionsfiner granularity in our analysis. This kind of setup would allow discoveringaccess patterns within small windows of time, which is important for certainoptimizations.Further, we plan to explore and optimize the slowdown of DINAMITEwhen using the full analysis pipeline with Spark Streaming. Finding thebest way to integrate the log generation and analysis is an important factorin improving the overall productivity of engineers using our system.Finally, our implementation of the cache simulator is rather simplistic.Adding a multi-level cache simulator such as Dinero IV [28] and addingmore cache information to the logs would help improve understanding howdifferent data organization and access patterns affect program efficiency.26Chapter 3Data-driven spatial locality3.1 IntroductionMemory wall is the phenomenon where the cost of memory accesses ex-ceeds the cost of non-memory instructions to the point that the programspends most of its CPU time waiting on memory. Wulf and McKee pro-posed that modern software would hit the memory wall in the beginningof this millennium [73]. Ailamaki, DeWitt, Hill and Wood showed in 1999that database systems spent 20-50% of CPU time waiting on memory [7].For modern “cloud” workloads this figure is 50% on avearage, and reaches90% for OLTP benchmarks [30]. Caches mitigate memory latency, but it isbelieved that they will never catch up with the voracious appetite of modernapplications [30].To navigate the limitations of hardware, programmers invest substan-tial effort into software optimizations aimed at reducing cache miss rates.One family of such optimizations is about rearranging the program data inmemory in order to improve spatial locality. Spatial locality occurs whendata items that are accessed close together in time also happen to resideclose together in memory. Hardware caches fundamentally rely on spatiallocality for efficient operation. Finding an optimal arrangement of objectsin memory is NP-hard. A guiding principle used in prior work on memorylayouts is to put objects that are frequently accessed together closeto each other in the address space . Literature review has revealedthat these optimizations are largely manual and require deep understandingof the program’s algorithms and data structures, making many of them thesubjects of top-tier publications [10, 17, 25, 36, 37, 42, 50, 65, 67, 69, 75, 76].These algorithms deliver significant performance improvements, but are verydifficult to implement.Many memory layout optimization algorithms rely on using some fea-tures of data itself to inform the placement of objects in memory. Forexample, it is common in mesh traversal algorithms to pack mesh nodesand triangles according to their in-domain proximity – objects with similarCartesian coordinates. We aim to generalize this approach to any program273.1. Introductionthat operates on many objects in memory and automate the extraction ofknowledge needed to derive new layout strategies.This work introduces access graphs – a novel representation of a pro-gram’s memory access patterns, constructed from dynamic memory accesstraces. Access graphs have memory objects for nodes, and their edges showhow frequently the program accesses two objects together. Using accessgraphs, we reframe the memory layout problem as a combination of commu-nity detection and graph linear arrangement – both well researched problemswith many good heuristic solutions. Based on these heuristics, we build anew algorithm called Hierarchical Memory Layouts (HML) that computeslayouts with improved spatial locality. Hierarchical Memory Layouts com-bined with cache simulation give an estimate of possible cache improvementthrough layout changes. We use the output of HML to train random forestclassifiers to automatically extract the relationships between data features(e.g. Cartesian coordinates) and memory access patterns (e.g. traversal).We then use the discovered relationships to improve the layout of the pro-gram data in memory or on disk.The contributions of this work are as follows:1. Access graphs: A novel way of representing memory access patternswithin a program. Access graphs reveal which objects in memory areaccessed contemporaneously. They are computed from allocation andaccess traces. By using our proposed analysis techniques, programmerscan automatically extract expert knowledge of access patterns that iskey in most prior work on data layouts.2. Hierarchical Memory Layouts (HML): A new algorithm that uses ac-cess graphs to automatically derive improved data layouts for memoryintensive programs. HML combines prior work in graph communitydetection and linear arrangement in a novel way in the context ofspatial locality optimization. Using HML, programmers can get an es-timate of how much room for improvement there is in a program’s datalayout. In certain workloads, HML can be used directly to recomputelayouts of data in storage.3. Data-Driven Locality: A novel application of Random Forest Classi-fiers to detect correlations between memory access pattern and dataproperties. We use random forests to learn which features of data it-self can be used at runtime to group allocated objects, with the goalof improving spatial locality. We use these techniques to automati-cally infer expert knowledge from prior work on data structure layouts283.2. Access Graphs(graphs and meshes), and expand the application to red-black trees.To evaluate the performance gains we built Tidy, a hint-based allocatorwrapper that lets us use objects’ data field values to guide allocationat runtime.Figure 3.1 shows a detailed workflow diagram for the techniques pre-sented in this paper. We will not go into much detail about each of thenodes in the diagram, but we encourage the reader to refer back to it asthey read through sections 3.2-3.4. It is divided into three different stages,outlined with dotted lines. The first stage shows the process of access graphcreation, which is described in detail in §3.2. The second stage is layoutperformance evaluation. In this stage, we evaluate the potential for perfor-mance improvement from changing the data layout of the program. Layoutperformance evaluation is covered in §3.3. Finally, if the programmer deemsit worth to change the data layout based on cache simulation results, theyproceed to the third stage – Data-Driven Spatial Locality, described in §3.4.In this stage, we use machine learning techniques to discover data featuresthat can be used as hints for Tidy, our allocator wrapper.The only parts of the workflow that require human input are the inspec-tion of performance evaluation results and the modification to the program’sallocator calls in the third stage. The automation of the performance eval-uation is possible simply by setting predefined thresholds for cache missimprovement.The rest of this chapter contains evaluation and discussion of our resultsin §3.5.3.2 Access GraphsAccess graphs are a novel way of aggregating a program’s memory accesstrace to capture properties related to spatial locality. In this section weformally define access graphs and explain how we use them to reason aboutand improve spatial locality.To refer to units of memory holding a datum of a specific type we inter-changeably use the terms data items, data elements and data objects. Thecontents of the location could be an instance of a C struct, a C++ objector another kind of data – this distinction is not important for our tools.Besides accesses to dynamically allocated (heap) objects, memory accesstraces contain accesses to global variables and local (stack) variables. Wefocus on large data structures that generate many cache misses – they are293.2. Access GraphsAccess graph creationLayout performance evaluationData-Driven Spatial LocalityMemory Access TraceObjectAccessTraceObject MapAccess GraphData-Driven Spatial LocalityHierarchical Memory LayoutsMemoryLayoutCache SimulatorAccessGraphCommunitiesCache Simulation ResultsExtracted Locality FeaturesTidy Allocator WrapperProgram with improved spatial localityProgramFigure 3.1: Workflow diagram303.2. Access Graphsmost likely to consist of dynamically allocated objects. Therefore, we filterdata objects of other kinds from the trace.Definition 3.2.1 (Object Access Trace) Object Access Trace is a filteredform of a program’s dynamic memory access trace. It is obtained by firstremoving all stack and global accesses from the trace. Next, all the accessesthat target heap objects are replaced with accesses to the target objects’ baseaddresses.Example: If an access writes to address 0xdeadbee8 and it is deter-mined that the address is within the bounds of an object with base ad-dress 0xdeadbee0, the write to 0xdeadbee8 is replaced with a write to0xdeadbee0.Definition 3.2.2 (Object Map) An Object Map is a hash map of all theallocated objects. It is generated by processing the original memory accesstrace, before converting it into an Object Access Trace. When an objectis accessed, the offset at which the access was made is recorded in the map,along with the type of access (read/write) and the value read/written. ObjectMaps give information about the type of object and the contents of all of itsdata fields.Example: If an access writes the value 3.14 to address 0xdeadbee8 ofthe object at 0xdeadbee0, the Object Map entry for that object is updatedwith information about a write with value 3.14 at offset 8.Object maps allow us to connect memory addresses with properties ofthe corresponding objects. For example, the map may inform us that thememory address 0xdeadbee0 contains an object of type mesh node t withthe value 12.34 at offset 0 (the x-coordinate field), and the value 42.1 at offset8 (the y-coordinate field). The mapping between memory addresses andobjects will be used later to find correlations between the access pattern andthe object properties and to improve the data layout in memory or on disk.For example, our algorithms will automatically discover that objects withsimilar x, y coordinates are accessed close together in time; by allocatingmesh objects with similar coordinates close together in space we will improvespatial locality and reduce the execution time.Definition 3.2.3 (Access Graph) An access graph is an undirected graphwhere there is a vertex for every dynamically allocated object in the ObjectAccess Trace. Two vertices are connected by an edge if there are at least two313.2. Access Graphscontemporaneous memory accesses to the objects represented by these ver-tices. Two accesses are considered contemporaneous if they occur within CLmemory accesses of one another. CL is called a locality constraint. When-ever we detect the first contemporaneous access we create an edge betweenthe two vertices and assign it the weight of one. Whenever we detect anothercontemporaneous access to vertices already linked by an edge, we incrementthe edge’s weight by one.A B C A B B CtimeA BCD1D11112Object Access TraceFigure 3.2: Access graph exampleFigure 3.2 shows an example of an Object Access Trace with the corre-sponding access graph for CL = 2. Note that D and C have no edge betweenthem because there are no contemporaneous accesses to them in the trace.Objects B and C have an edge with the weight of two, because there aretwo contemporaneous accesses to them in the example trace.Choice of CL value for computing access graphs has two aspects thatshould be considered: computational cost and captured information. Fromthe computational cost point of view, the CL value tells us exactly howmany edge additions/updates we have to perform for each access in theObject Access Trace. Because of this, CL should be as low as possible,while retaining important information in the access graph. As for the secondaspect, we empirically tested higher values for CL for Hierarchical MemoryLayout computation (described in §3.3), but did not observe a significantimprovement in the quality of results. For our analysis techniques, we usedCL = 2, meaning we only consider two immediately adjacent accesses in the323.3. Hierarchical Memory LayoutObject Access Trace. Using access graphs for other purposes may requirereconsidering this choice.From these definitions, we infer the following: The weight of edges in anaccess graph tells us which objects get accessed contemporaneously the most.The more occurrences of accesses to A and B within a window CL in theprogram’s memory access trace, the heavier the weight of the edge betweenA and B.As a result, access graphs enable the automation of spatial lo-cality optimizations that in the past, to the best of our knowledge,were performed manually .Given an access graph, how do we use it to improve spatial locality?The graph tells us which objects are accessed contemporaneously. How dowe translate this information into a more efficent program? To answer thisquestion, we will break it down into two parts. First, we will find out ifthere is a potential to reduce the cache miss rate by using a different datalayout. Given an Object Access Trace and an Object Map, we will assumethat we can rearrange the memory addresses of the objects in any way welike (without worrying how this could be achieved in practice), replay theaccess trace in a simulator and evaluate the cache miss rate resulting fromthe new layout. We compute the new layout using the new HierarchicalLayout Algorithm that we describe in §3.3. Although this is not a concretesolution that a programmer can use directly, evaluating layout changes inan abstract way will help us understand if there is a potential to improveperformance by changing the layout.The second part of the question asks how we can improve the data layoutin a concrete program. Our work on Hierarchical Memory Layouts in §3.3describes the process of obtaining good memory layouts that can be useddirectly to reorder data in storage. Section §3.4 presents a machine learningtechnique that trains on the Object Map and discovers the properties of thedata objects available at object allocation time that can be used to guideobject placement in memory at runtime.3.3 Hierarchical Memory LayoutIn an access graph, objects that have a lot of contemporaneous accesses areconnected by heavily weighted, i.e., strong, edges. Relying on this prop-erty, we can reframe our grouping objective as a well-researched problem ofcommunity detection in networks.Community detection algorithms detect groups of nodes in graphs such333.3. Hierarchical Memory Layoutthat the connectivity within a group is strong (many edges, higher weights),and the connectivity between groups is weak (few edges, lower weights). Inthe context of access graphs, community detection algorithms would placeinto the same groups objects that are often accessed contemporaneously,and would place into different groups objects that are rarely accessed con-temporaneously.Our Hierarchical Memory Layout algorithm extends a multilevel com-munity detection algorithm by Blondel et al [13]. Multilevel community de-tection starts with every graph node representing its own community. It ex-pands communities by adding close neighbours into them, making sure thatthe change will result in higher inter-cluster separation, and intra-clusterproximity (together called modularity). When such a change is impossible,the algorithm stores the community assignment and transforms the graphby fusing all nodes within a community into a single node, and aggregatingall edges to other such community nodes. This process is repeated on thetransformed graph until there are no changes that can be done that improvemodularity.Applying Blondel’s algorithm on access graphs produces a hierarchy ofcommunities. We will refer to the level with the highest number of smallcommunities as the first level or bottom level, interchangeably. Subsequentlevels with fewer larger communities will be referred to as being higher inthe hierarchy.Nodes in first level communities are not ordered internally. This may notbe a problem if the entire community fits within a unit of spatial locality(e.g., cache line, VM page, disk page etc.). Unfortunately, we cannot choosethe size of the communities; it is dictated by the access graph’s structure.Frequently, first level communities turn out to be so large that a randompermutation of objects within them loses any beneficial locality properties.In these cases we need to find a good internal ordering of objects for firstlevel communities.Ordering access graph nodes within first level communities has the fol-lowing rules:• The stronger the edge between two objects, the closer they should beplaced in the layout• Relative placement of object pairs that have no edge between them isirrelevant.These rules are in line with the optimization objective of the MinimumLinear Arrangement problem (MinLA). Minimum Linear Arrangement is a343.3. Hierarchical Memory Layoutknown NP-hard problem, for which researchers have explored many heuris-tics [62]. Because access graph edges are weighted, we use the weightedvariant of the problem.Definition 3.3.1 (Weighted Minimum Linear Arrangement) Given agraphG(V,E), |V | = n,find a one-to-one functionϕ : V → {1, .., n}that minimizes the Linear Arrangement cost (LA), defined asLA(G,ϕ) =∑(u,v)∈E |(ϕ(u)− ϕ(v)) ∗ w|Definition 3.3.1 states that Minimum Linear Arrangement has the ob-jective of linearly laying out graph nodes so that it minimizes the distancebetween connected nodes. The weighted version of the problem prioritizesreducing the distance between pairs of nodes with stronger edges. In thecontext of access graphs, MinLA heuristics will try and place objects that arefrequently accessed contemporaneously as close as possible in the memorylayout.In our work we use the Spectral Sequencing [40] heuristic proposed byJuvan et al. to approximate solutions to MinLA on access graphs’ commu-nities.Definition 3.3.2 (Spectral Sequencing) Spectral Sequencing computesthe Fiedler vector of the graph G – the eigenvector x(2) corresponding tothe second lowest eigenvalue λ2 of the Laplacian matrix LG of the graph G.It then produces the ordering function ϕ such thatϕ(u) < ϕ(v)⇔ x(2)(u) < x(2)(v)Spectral Sequencing was shown [62] to give results of good quality, ata low computational cost. Hierarchical Memory Layouts use Spectral Se-quencing as a sub-algorithm, but it can be replaced with any suitable MinLAheuristic.The Hierarchical Memory Layout algorithm operates on the Access Graph(constructed from the Object Access Trace) in two phases, utilizing the twopreviously described algorithms.The first phase performs multilevel community detection, producingcommunity levels L1, ..., Ln, where L1 represents the first computed com-munity level – one with the largest number of small communities. As thelevels increase, communities become fewer in number, and greater in size.353.3. Hierarchical Memory LayoutThe second phase performs Spectral Sequencing on each community inL1. The objects within each L1 community are ordered according to thelinear arrangement obtained from Spectral Sequencing. Every communitylevel contains all of the nodes in the original graph – the only difference ishow the nodes are grouped.Our use of both community detection and Minimum Linear Arrangementheuristics begs the question: Why not use Spectral Sequencing to lay outthe entire access graph? This is a valid question, and using only SpectralSequencing would produce good layouts. However, a linear layout of nodes ina graph obscures a property that is needed for our data-driven spatial localitytechnique. Data-driven spatial locality needs groups of data objects to useas training class labels (the whole process is described in detail in §3.4).Community detection algorithms output groups with desirable properties –strong connections within a group, and weak connections to nodes in othergroups.To construct the final layout, we label each object with a community vec-tor. Community vector of an object is a set of indices (In, I(n−1), ..., I1, ISS).Index Ik is simply a unique identifier for the community at level k that theobject belongs to. ISS is the linear layout index of the node within its L1community. Due to the nature of Blondel’s multilevel community detectionalgorithm, if two nodes have the same Ik index, they are guaranteed to havethe same Ik+1 index.We lexicographically sort the objects by their community vectors to pro-duce the final Hierarchical Memory Layout.A B CD E FGCommunity 1_1 Community 1_2Community 2_1Spectral Sequencing OrderFigure 3.3: Hierarchical Memory Layout exampleFigure 3.3 shows a simple example of the information produced by Hier-archical Memory Layouts. The two detected first level communities ({A, D,363.4. Data-driven localityB} and {E, C, G, F}) are internally ordered by Spectral Sequencing. Theybelong to the same second level community and will be placed next to eachother in the final layout.We use the output of Hierarchical Memory Layout for two purposes.First, we rearrange the objects in the memory access trace according to thelayout and feed the new trace into the cache simulator (§3.5.1) to estimatewhether improving data layout is likely to improve performance. Second, wetrain a machine learning model (§3.4) to find out if any data features of theobjects can be used during data allocation phase to improve performance ina concrete program.3.4 Data-driven localityPrior work on memory layouts uses expert domain and algorithm knowledgeto obtain better layouts. These solutions often use features of the data itself,such as object fields, to inform the generation of layouts. For example, inmesh traversals, the visitation order goes from one mesh object (triangle,point, edge) to its neighbours. Thus, grouping these objects in memorybased on their spatial proximity (neighbours are close together in Euclideanspace) yields layouts that exhibit better spatial locality. Another exampleare iterative graph algorithms such as PageRank. Solutions like GraphChi[42] group edges by source and destination nodes to improve spatial locality.However, the process of deriving these layouts is largely manual and relies onexpert knowledge in both the algorithm’s domain and memory optimization.We present a way to automate the discovery of correlations betweenspatial locality expressed in the algorithm and the features of data itself.The question our technique aims to answer is the following:Given the Object Map described in §3.2, and first level communities de-tected by Hierarchical Memory Layout algorithm, is it possible to decidewhich community an object belongs to, based only on its data features?We use random forest classifiers [15] for this task. Random forests are alearning method for classification and regression that utilizes a combinationof multiple decision trees and overcomes the decision trees’ tendency tooverfit. We chose random forests for two main reasons:1. They give good results and are relatively easy to set up compared toother classifiers such as neural networks.2. They are less opaque than other techniques. This means that it oncea random forest learns to classify data from a dataset, it is possible373.4. Data-driven localityto extract the contribution of input vector elements to the decisionprocess (further explained in §3.4.3).However, our technique is not inherently tied to random forests. It canuse any other suitable classification algorithm without modification.3.4.1 Generating input vectorsThe input vectors for our classifier are generated from the set of all detecteddata features. We split data features into three categories: primary features,secondary features and meta-features.The first, trivial, type of data features are primitive data fields - primaryfeatures. Primary features are all non-pointer fields within an object. Anexample of this would be the coordinates of an object in a mesh.The second type of data features are primitive features of neighbour-ing objects - secondary features. Here, object A is a neighbour of object Bif B contains a pointer field that holds the value of A’s memory address.Secondary features can be used in allocation policies when the neighbour-ing objects are initialized and known in advance. For example, if the meshtraversal workload’s triangle object is written so that it only contains point-ers to the triangle’s points, we can use the points’ Euclidean coordinatefeatures to inform the allocation of triangles (provided points are initializedbefore triangles).The third type of data features are meta-features. These are not presentin the data itself as object fields, but rather describe some inherent prop-erties of objects. Examples of meta-features are array indices of objects,memory addresses assigned to objects within the observed execution trace,size/type of object, allocation point in the source code, etc. A correlationbetween spatial locality and array index (or memory address) of objects canindicate that the current layout already does well in terms of spatial lo-cality. A correlation between size and spatial locality would mean that oneshould use an allocator that bins objects based on allocation size (a commonstrategy [46] [29]).3.4.2 CoverageOur technique focuses on the relationship between contents of data that isbeing accessed and the sequence of accesses to it. From that point of view,in terms of coverage, it is important to capture two properties that reflectthe subsequent run we are optimizing for:383.4. Data-driven locality• The taken code paths should exhibit the same access patterns overlarge data structures as the program we are optimizing for. For exam-ple, if one is trying to improve spatial locality of a PageRank imple-mentation, they should not profile an execution of depth-first search.Even though the data structure may be the same, the access patternis not, so the extracted relationships between features of data and ac-cess pattern would not carry over. If a program does exhibit differenttraversal patterns in a single run, our technique would arrive at the“least common denominator” for them. The edges contained in theaccess graph would be weighted as the sum of the weights of the differ-ent traversal patterns, so our technique would take both into accountwhen producing layouts and grouping criteria.• The data structure contents should be similar to those encounteredin the target program. For example, to extract relationships betweendata and access patterns in a PageRank algorithm, one should makesure that the input graph in the profiling run has similar characteristicsto realistic graphs (e.g. similar degree distribution). If the user failsto provide a similar input, the conclusions about the best groupingcriteria might not be optimal for executions on different inputs.3.4.3 Training methodology and evaluation criteriaOur full dataset consists of all the objects in the access graph, with theirinput feature vectors and community labels. When training the classifier, wegenerate 5-fold cross validation datasets with an 80% / 20% split betweentraining and testing data. The 80% partition is used for training, and weverify and compute accuracy on the remaining 20%.To extract the features that contribute the most to the accuracy (featureimportance), we use the gini method proposed by Breiman [16], implementedin scikit-learn’s [59] RandomForestClassifier class. This method evaluatesa feature’s importance as the measure of all decision tree splits that includethe said feature, normalized over the entire forest. The more decision treesplits a feature is involved in, the more important it is deemed for theclassifier. Categorical accuracy is the percentage of samples in the testdataset for which the classifier predicted the correct label. Top-5 categoricalaccuracy is the percentage of samples in the test dataset for which the correctlabel was within the top 5 choices of the classifier. We report categoricalaccuracy, top-5 categorical accuracy and feature importance information in§3.5.3.393.4. Data-driven locality100200400500Trees0.00.20.40.60.81.0Categorical accuracyTop-5 categorical accuracyFigure 3.4: Hyperparameter search in mesh traversalWe performed all our experiments with√Nfeatures random featuresconsidered when splitting a node. The number of trees is chosen between100, 200, 400 and 500 trees. We performed each classifier training run onall four tree counts and found that, in general, higher tree counts yieldslightly better classifiers for our datasets. In further text, all classifier re-sults represent the best classifier we found through exploration for the givendataset. Figure 3.4 shows the categorical and top-5 categorical accuracy forour mesh dataset for different tree counts. As we can see, the categoricalaccuracy shows very little variance, but our classifier is able to improve thequality of mislabeled results (higher ranked correct results) with more trees.More advanced hyperparamater tuning techniques exist and can be usedwith our technique, however, this is out of scope of our work as it pertainsto the specifics of chosen classification algorithms.We verify that the detected fields are indeed correlated to the localitygroups of objects in two ways:1. In algorithms for which there is published research on layout opti-mization using data features, we verify that the properties detectedby Data-driven Spatial Locality techniques have been used in priorwork to inform layout design.2. In all algorithms, we manually inspect the code under analysis. Weidentify the main accessor code portions for the given data structure,and reason about the sequence in which data structure is traversedand how it relates to the object properties, as visible from the sourcecode. We show code snippets responsible for the algorithms’ accesspatterns, and present our reasoning to support our conclusions. Weperform the steps that an expert would perform to find correlations403.4. Data-driven localitybetween access patterns and data features. We then verify that thefeatures discovered by our technique are the same as those identifiedmanually.3.4.4 Tidy: a memory allocator wrapperTo test the impact of using detected data features to inform layout at run-time, we built Tidy. Tidy is an arena-based allocator, meaning that it orga-nizes allocated objects into different arenas. Unlike most other arena-basedallocators, Tidy chooses the arena for the newly allocated object based on ahint provided by a programmer. In our context, the hint is the feature of theobject that correlates with the desired object grouping – the feature thatwe discover using random forests. The idea is that upon object allocationthe programmer would pass to the allocator wrapper the value of that fea-ture. For example, if the programmer is allocating a vertex of a triangle, shewould pass the X and Y coordinates to Tidy. Tidy would then convert thesevalues into an arena index, such that vertices with similar X,Y coordinateswould get allocated in the same arena. Using this method we achieve thedesired grouping of objects, the one suggested to us by the access graph, ina concrete execution.The idea of hint-based allocators is not new. Chilimbi et al. [19] proposeccmalloc – a cache conscious allocator that accepts hints. Hints in ccmallocare addresses of previously allocated objects; the allocator attempts to placethe new item as close as possible to the one whose address was provided asthe hint. Tidy can be adapted to use different kinds of data for hints, andwe consider it a generalization of the ccmalloc’s approach.The hint taken by Tidy has a form of an n-dimensional vector; the sizeof the vector is given to Tidy upon initialization and is stored in a Tidycontext. Tidy then allocates an n-dimensional array of arenas, and thevector elements (modulo the size of the dimension) will be used to indexinto that array. This implies a linear mapping of hints to the arena space.Non-linear mappings are also possile; we plan to explore them in the future.To use Tidy, the programmer needs to replace calls to existing memoryallocators with calls to Tidy. If a call to a standard malloc routine has theinterface of:malloc(size_t size);a call to Tidy looks like:413.5. Evaluationtidy_alloc(tidy_ctx_t *ctx,size_t size,unsigned int *hint);The programmer can configure the size of the arena as well as the sizeof the dimension, or opt to use the default settings. More experiments areneeded to determine whether an optimal arena size can be pre-determinedfrom the properties of the access graph, if it needs to be tuned individuallyfor the workload or if there is a single (perhaps architecture-dependent)default that works well across the board.The programmer needs not specify the total number of arenas in advanceor the total number of allocated elements; if Tidy runs out of space in anarena, it allocates a new one for the same set of hints. To allocate arenas,Tidy uses libc malloc, but it can be changed to use any other allocator.3.5 EvaluationIn this section we first show how Hierarchical Memory Layouts (§3.3) canbe used to estimate potential performance improvements from improveddata layouts (§3.5.1). Following that, in §3.5.2 we show how HML can beused directly to derive better data layouts in storage. We show that data-driven layout techniques (§3.4) can be used to detect correlations betweendata features of objects and their layout, where such correlations exist, andguide dynamic memory allocations. Finally, we apply the data-driven spatiallocality techniques to a subset of the SPEC benchmark suite and discuss ourfindings in §3.5.4.In our experiments we use nine applications, two of which are used toevaluate improved storage layout only.Simple data structure benchmarks. The three benchmarks in thisset are our own implementations of PageRank, mesh traversal and red-blacktrees in C/C++.PageRank stores data as node and edge objects. The graph is initializedfrom an edge list file. Each node and edge is separately allocated usingC++’s operator new(). Nodes contain their ID, the data field and vectorsof pointers to their in-edges and out-edges. Edges contain the IDs of thesource and destination nodes, and a floating point data field.Mesh traversal operates on a 2D network of node and triangle objects.Triangles contain pointers to their three nodes, and three adjacent triangles.Nodes contain their x and y coordinates and a vector of pointers to all423.5. Evaluationadjacent triangles. The data is initialized by allocating objects one by one,according to input from a node and triangle list file.Red-black trees are collections of nodes, where each node has pointersto its parent and two children, a floating point payload, and a colour field.The benchmark fills the tree with random nodes, and then executes a seriesof lookups.SPEC CPU2017 memory intensive benchmarks. SPECCPU2017 is the 2017 release of the popular benchmark suite. For our Hierar-chical Memory Layout experiments, we used three memory-intensive bench-marks (according to Amaral et al.[8]): 505.mcf, 520.omnetpp and 531.deep-sjeng. 505.mcf is a mass transportation route planning program written inC. 520.omnetpp is a discrete event simulator of a large 10 gigabit network,written in C++. 531.deepsjeng is a speed-chess program with deep posi-tional understanding, written in C++. Each manipulates a large number ofheap objects, making them suitable for applying our Hierarchical MemoryLayout algorithm.Kyoto Cabinet’s kcstashtest. Kyoto Cabinet is a key-value databasemanagement library. It is a direct successor of Tokyo Cabinet, developed byFAL Labs and used by Japanese social network Mixi. It contains multipledifferent implementations of the data store back-end. The benchmark shownhere, kcstashtest, performs writes and reads in Kyoto Cabinet’s StashDBdata store variant. StashDB internally keeps records in hash tables.Graph traversal benchmarks. To evaluate how using HML can im-prove data layouts in storage, we use graph traversal benchmarks of our ownimplementation.We obtain the memory access traces required for generating the accessgraphs using DINAMITE [55]. DINAMITE is an LLVM pass that instru-ments every memory access and compiles the program, such that informa-tion about memory allocations, accesses, and data types is emitted to a logfile. Our techniques would work with any memory access tracing tool thatprovides this information.To estimate the performance effect of changing the data layout via HML,we simulate a cache hierarchy using Dinero IV[35]. The simulated Dinerocache is modelled after Intel® CoreTM i5-7600K. L1 D-cache has the capacityof 64kB, and is 8-way set associative. L2 has the capacity of 256kB, andis 4-way set associative. LLC has the capacity of 8MB, and is 16-way setassociative. Our DTLB simulator has 128 4kB page entries, and is 4-wayassociative.433.5. Evaluation3.5.1 Hierarchical Memory LayoutsThe main purpose of our Hierarchical Memory Layout algorithm is to pro-vide an estimate of the upper bound on performance improvement fromchanging the data layout. The output of the algorithm are multilevel com-munities produced by the first stage described in §3.3 and the final layoutwhich maps original object addresses to addresses of the objects in the im-proved layout.505.mcf520.omnetpp531.deepsjengkcstashtestmesh traversalpagerankrbtreeBenchmark0.00.20.40.60.81.01.2Event count (normalized to baseline) L1 missesL2 missesL3 missesDTLB MissesFigure 3.5: Hierarchical Memory Layouts cache and DTLB misses. Eventcounts are normalized to original layout results.Figure 3.5 shows simulated cache event counts for the first seven bench-marks after applying the Hierarchical Memory Layout algorithm. The num-bers are normalized to the simulated cache event counts of the original lay-out. Five benchmarks out of seven show significant improvements in cachemiss rates.We notice a consistent trend: HML improves the miss rate at the higherlevels of the memory hierarchy, such as the last-level cache (LLC) and theTLB, to a larger extent than at the lower level of the hierarchy (L1 and L2).That is because it is easier to organize objects in larger groups accessedcontemporaneously within a relatively large time window (macro grouping)than into small groups accessed contemporaneously within a very short timewindow (micro grouping).Furthermore, performance improvements depend on the size of data ob-jects. In 505.mcf, for example most of the allocated objects are over 100B insize and do not fit into a single cache line. Thus, performance improvementsthat could occur due to more efficient packing of objects within the same443.5. Evaluationcache line do not happen.From the red-black tree benchmark we learn that the extent of improve-ments from HML also depends on the access pattern. In red-black trees, theallocated objects are not as big as the ones in 505.mcf, but the tree itselfexceeds the cache memory capacity by far. The combination of the largedataset and random lookups means that the algorithm does not revisit thesame tree nodes often. The improvement in cache performance is thus low.However, DTLB misses improve by 51%.Our conclusion is that the concrete performance gains from HML dependlargely on the size of objects, size of the working set and the access patternof the program. Furthermore, HML tends to better improve spatial localityat a coarser granularity: at the level of the TLB or the LLC.Let us look back on what these HML results mean. Our algorithm oper-ates under the assumption that it is possible to reorder objects in memoryin an arbitrary way. This assumption does not hold for the majority of real-world programs. Recorded memory access traces, on the other hand, arean idealized environment for testing different layouts. The main purposeof HML is to give an estimate of how much performance is to be gainedfrom changing the layout of items in the best case. We show that for theselected memory-intensive benchmarks there is much room for performanceimprovement from reordering data.We explore two ways in which output from Hierarchical Memory Layoutscan be used in practice to achieve better performance in programs. In §3.5.2we explore the possibility of using HML to directly inform the layout of datain storage, and in §3.5.3 we show the results of applying data-driven layouttechniques (§3.4) to our benchmark set.When such optimizations are not possible in practice, Hierarchical Mem-ory Layouts provide a starting point for work on layout improvement. Theoutput of HML is a concrete layout of data that improves spatial local-ity, groups of objects that get accessed frequently together within the givenprogram, and a descriptive Object Map which ties the previous two to theactual data within the program. Researchers in the future can use theselayouts as stepping stones towards new locality optimization techniques.3.5.2 Data layout in storagePrograms whose data layout is directly inherited from an input file, forexample those that mmap the input file to materialize data in memory or thosethat dynamically allocate data objects in the same order as they appearin the input file, can directly benefit from the HML technique. We can453.5. EvaluationL1 missesL2 missesL3 missesDTLB missesRuntime0.00.20.40.60.81.0Normalized to originalBreadth-first searchL1 missesL2 missesL3 missesDTLB missesRuntime0.00.20.40.60.81.0Depth-first searchFigure 3.6: Cache misses, TLB misses, and runtime for HML-derived layoutsin graph traversals. Normalized to original layout metrics.reorganize the input data in the file in the same order as suggested by theHML algorithm and as a result obtain better spatial locality at runtime.To evaluate such a scenario we wrote an application in C++ that per-forms graph traversal using either breadth-first search (BFS) or depth-firstsearch (DFS) order. We ran two benchmarks. The first one performs tenBFS traversals starting from a randomly selected code each time. The sec-ond benchmark works the same way, but uses DFS.For these benchnmarks, an optimal strategy for spatial locality would beto allocate the nodes in the same order as they are traversed, but becausethe traversal begins with a different node each time, there is not a single“optimal” layout that we can use. Instead, we run the HML algorithm onthe memory access traces for these benchnmarks to suggest an improvedlayout. HML outputs the order of the nodes, where each node is identifiedby its unique ID. We then reorganize the input file such that the nodesappear in the same order as suggested by HML. We create one input file forthe BFS benchmark and another one for DFS.Figure 3.6 shows the runtime, cache misses and TLB misses obtainedusing the HML, relative to the baseline layout, where nodes in the inputfile are sorted by their numeric ID. These measurements were obtained onthe actual hardware. We do not provide simulation results, because we areable to apply HML directly. We observe an improvement in runtime of14% for the BFS and 18% for the DFS. L1 and L2 miss counts show a slightdegradation, which is made up for by a significant reduction in L3 and DTLB463.5. Evaluationmisses. Again, we see the pattern observed in §3.5.1, where HML tends tooptimize better for L3 and TLB, while keeping L1 and L2 cache misses thesame or slightly worse than the original layout. The improvements in L3and DTLB misses outweigh this degradation and produce a positive effecton the running time.3.5.3 Data-driven spatial localityWe applied the data-driven spatial locality techniques to the first sevenbenchmarks described in §3.5. In three of these, mesh traversal, PageRankand red-black trees, our system identified data features that could be usedas hints for the Tidy memory allocator.For PageRank and mesh traversal, our system automatically identifiedthe same features that in the past were discovered manually: source nodesfor PageRank [42] and Cartesian coordinates for mesh traversal [75]. Thiswas a positive confirmation of the effectiveness of our techniques.PageRankFigure 3.7 shows that the source node is the main contributor to ac-curate community prediction in PageRank. The edge weight also has highpredictive power, but it is not available at runtime, so we disregard it whentesting new layout strategies with Tidy. Listing 3.1 shows the code that getsexecuted for each node n in the graph. The first loop iterates over all thein-edges of the node, meaning these edges would share a destination nodevalue. The second loop iterates over all out-edges of the node, and theseedges share the same source node. This means that, to optimize access lo-cality for the first loop, one would group edges based on their destinationnodes, and to optimize for the second, they would group based on the edges’source nodes. Our dataset is a snapshot of a Twitter follower graph. Thegraph has a power law distribution of in-degrees – few highly popular nodeswith many followers and many nodes with few followers. Popular nodes willbe a destination for many edges. In order to detect destination node fieldas a feature correlated with locality of access, edges with the same desti-nation node would need to belong to the same community. However, thecommunity size in our algorithm decides on community sizes based on theedge weights in an access graph, typically producing communities that aresmaller in size than the in-edge counts of highly popular nodes. This meansthat observing the first loop in listing 3.1 for one such node, the same des-tination feature would be distributed over many locality communities. Dueto the nature of the graph, source edge feature corresponds more closelyto the structure of communities, which is what our classifier has detected473.5. Evaluationsuccessfully.In GraphChi [42], edges are grouped by their destination node, and edgeswithin a group are ordered based on the their source node IDs. Grace [63]groups edges based on their source node, and sorts groups based on theirdestination node. Chronos [33] processes temporal graphs, and stores eachepoch’s edges in segments grouped by their source node. Our techniquewas able to automatically detect source node as a grouping factor in ourPageRank implementation – a grouping criterion used in graph processingsystems optimized for data locality.Listing 3.1: PageRank implementation loop body1 //read neighbors2 int num_in_edges = n->num_inedges ();3 if(num_in_edges > 0){4 for(int j=0; j<n->num_inedges (); j++){5 sum += n->inedges[j]->weight;6 }7 }89 // compute pagerank and store10 float pagerank = RANDOMRESETPROB + (1 -RANDOMRESETPROB) * sum;11 float cmp = fabs(n->data - pagerank);12 if( cmp > 0.00000001){13 value_changed = 1;14 }15 n->data = pagerank;1617 // distribute pagerank to the other nodes18 int num_out_edges = n->num_outedges ();19 if(num_out_edges > 0){20 float pagerankcont = pagerank /num_out_edges;21 //int sum = 0;22 for(int j=0; j<num_out_edges; j++){23 n->outedges[j]->weight = pagerankcont;24 }25 }Mesh traversalFigure 3.8 shows importance scores of mesh node fields in the meshtraversal algorithm. We can see that the x and y Cartesian coordinates werepicked up by the random forest as being the most important for classifica-tion. Listing 3.2 shows the main traversal loop in our implementation. The483.5. Evaluationalgorithm runs for a predetermined number of steps, starting from the firsttriangle in the mesh. Traversal follows the path through least visited neigh-bouring triangles, and for each triangle touches all of the triangles pointsby reading and accumulating the points’ coordinate values. Two trianglesthat get processed in succession will thus have similar spatial coordinates,and so will their points. In prior work, mesh layouts typically follow similarstrategies of placing geometrically close points and triangles close togetherin memory.Cache oblivious mesh layouts [75] order vertices to minimize the layoutdistance between neighbors. This approach effectively groups vertices thatare close to each other in the mesh – their coordinates have similar values.Other layout algorithms [9] [71] [68] place neighbouring objects in volumet-ric grid cells that are laid out in memory according to space-filling curves.Space-filling curves linearize multi-dimensional spaces for improved spatiallocality. Volumetric grid cells that are close together in domain space (havesimilar coordinates) will be placed together in the layout. Gopi et al. [31]produce mesh layouts by organizing objects into single-triangle strips. Asingle-triangle strip is a linearization technique that follows a path of neigh-bouring triangles until encompasses the entire mesh. Again, this techniquegroups neighbouring triangles in the layout – ones with similar spatial co-ordinates.Our analysis of our program’s access pattern and literature review sup-port our framework’s output – that x and y fields (spatial coordinates withinthe mesh) correspond to locality groups.The community prediction accuracy is high, meaning we can use the xand y coordinates to group objects at allocation time with Tidy.Listing 3.2: Mesh traversal main loop1 while (steps -- > 0) {2 int idx = find_min_neighbour(triangle);34 next = triangle ->neighbours[idx];56 triangle ->data += 1;7 triangle = next;89 touch_points(triangle ->points [0], triangle ->points [1], triangle ->points [2]);10 }Red-black trees493.5. Evaluationdestination_nodesource_node weightCategorical accuracy: 0.758,   Top-5 categorical accuracy: 0.9040.00.10.20.30.4ImportancePageRank, graph_edge_tFigure 3.7: Data feature importance and categorical accuracies, PageRanky xtriangle_vec.counttriangle_vec.capacitytriangle_vec.arrayCategorical accuracy: 0.633,   Top-5 categorical accuracy: 0.9280.000.050.100.150.200.250.30ImportanceMesh traversal, mesh_node_tFigure 3.8: Data feature importance and categorical accuracies, mesh traver-sal503.5. Evaluationnil13960L166R41L98R30L55R14L 33RL R32L36RL R L R53L58RL R L R67L 124R60L89RL R L R101L128RL 104RL RL 136RL R155L172R141L 155RL R155L164RL R L R168L175RL 171RL R L RFigure 3.9: Red-black tree grouping exampleRed-black trees are considered difficult to optimize for spatial locality,and we are not aware of any heuristics used in the past to improve theirlayout. Our system, on the other hand, was able to discover one. Figure3.14 shows that the payload field, which is used to rebalance the tree, hasthe highest predictive power. Listing 3.3 shows the code of our red-blacktree search implementation. Intuitively, small values will be stored in theleft side of the tree, and large values in the right side. To illustrate how thisreflects on grouping objects by payload, we refer to figure 3.9. The figureshows a small red-black tree with 30 nodes, containing random payloadsin the (0, 200) range. Leaf nodes connect to the sentinel node, which is anull value guard object in our implementation. We use this example treeto verify that the conclusion of our analysis – that payload corresponds toaccess locality – is indeed true. We assigned different colours to nodes, basedon 20-wide payload binning. From this example, we can see that groupingobjects by value ranges of their payload has the property of splitting thetree into vertical slices. A single tree search execution will visit nodes alonga path from root to one of the leaves in the worst case. From the figure,we can see that such paths are localized to few colour groups. A searchpath corresponds to access pattern locality, and colour groups in figure 3.9correspond to payload value similarity. From this, we conclude that ouralgorithm did find a meaningful correlation between locality groups and thepayload feature value.513.5. EvaluationListing 3.3: Red-black tree search function1 rbt_node_t *cur = T->root;2 while ((cur ->payload != value) && (cur != T->sentinel)) {3 if (value < cur ->payload) {4 cur = cur ->left;5 } else {6 cur = cur ->right;7 }8 }910 if (cur == T->sentinel) {11 return NULL;12 } else {13 return cur;14 }With categorical prediction accuracy of 0.98, red-black trees have thefeatures with the strongest predictive power of all three benchmarks.Hint-based allocationWe used the discovered fields in all three benchmarks to generate hintsfor Tidy allocator wrapper. Our mapping is relatively simple, performingone or two-dimensional binning. We bin values into buckets by dividing thevalue space into a grid. Each grid cell corresponds to one integer hint value.The grid size was experimentally tuned to the best performing value, andthe performance numbers on real hardware are reported in Figure 3.15.The results of our Tidy experiments are in line with the results from thesimulated evaluation presented in §3.5.1. We see a runtime improvementof 25% for mesh traversal, 27% for PageRank and 14% for red-black treequeries. These improvements correlate with the reduction in cache/DTLBmisses. We observe the same trend we saw in simulation – grouping itemswith Tidy does better for memories with higher latencies, in case of red-black trees even degrading L1 and L2 miss counts by 20%. As we observedearlier, HML is better at macro grouping than at micro grouping, providingimprovements at the higher level of the memory hierarchy (L3 and TLB),but not necessarily at the lower level. This will sometimes result in L1and L2 cache miss degradation, but L3 and DTLB improvements typicallyoutweigh these losses.Discrepancies in numbers between the simulation and Tidy results canbe attributed to two factors. First, Tidy is a best-effort layout strategy. Itopportunistically allocates objects within the same memory buckets withoutordering them in the exact same way as HML would. This is a limitation of523.5. Evaluationusing dynamic allocation – we cannot expect the program to anticipate theexact place in memory where each object should be stored. Second factoris the imprecision in the simulator itself. We use Dinero IV, configured tomimic the caches of our test hardware, however, we have no way of knowinghow it differs from the hardware itself.To show that the layout obtained with Tidy is responsible for the run-time improvements we observed, we show detailed stall measurements forthe three benchmarks in figure 3.10. Instruction cache miss rates for eachbenchmark fall under 1%, meaning that interference between code and datacaching is insignificant, so we omit instruction cache reports for brevity. Foreach benchmark, we measure only the data structure traversal segments (ex-cluding setup and allocation). Each benchmark executes the same amountof instructions over the course of our measurements. Further, the code thatis being measured does not differ between baseline and Tidy layout versions.Figure 3.10 shows a significant reduction in memory stalls over all levels ofhierarchy. In the controlled context of our experiments, we can concludethat the layout imposed by Tidy did indeed reduce the amount of timeCPU spends stalled during execution.Finally, figures 3.11, 3.12 and 3.13 show memory performance of thethree analyzed benchmarks, obtained by changing the bucket size from 64B(cache line size) to 4096B (page size) in a geometric sequence. The resultsindicate that increasing the bucket size improves performance in most cases.Smaller buckets will fill up faster, thus requiring Tidy to allocate new ones.With a random allocation sequence, this means that new buckets associatedwith the same hint will be scattered around in memory.In the red-black tree benchmark, we see a degradation of L1 and L2performance with the increase in bucket size, while LLC and DTLB missesimprove. The size of a red-black tree node in our experiment is 40 bytes.If we group nodes in buckets of 64 bytes, we can fit only two nodes perbucket. As the buckets are allocated in a random sequence, the final layoutwill resemble the malloc baseline more than the ideal grouping obtained byHML. As shown in previous experiments, HML does worse in L1 and L2misses for red-black trees and we observe a gradual increase in miss countson these levels as we move from a layout resembling baseline to a layoutresembling simulated HML results.3.5.4 Benchmark suite experimentsIn this section, we show our experiments with running the full data-drivenspatial locality pipeline on a set of benchmarks. This work is done towards533.5. EvaluationMesh traversalPageRankRed-black treesBenchmark0.00.20.40.60.81.01.2Normalized countStall analysisMetricL1D miss stallsL2 miss stallsLLC miss stalsFigure 3.10: Stall breakdown543.5. EvaluationDTLB missesL1 missesL2 missesLLC missesMetric0.00.20.40.60.81.0Normalized countMesh traversal, bucket size analysisBucket size641282565121024204840968192Figure 3.11: Mesh traversal – Memory performance comparison betweendifferent bucket sizes553.5. EvaluationDTLB missesL1 missesL2 missesLLC missesMetric0.00.20.40.60.81.0Normalized countPagerank, bucket size analysisBucket size641282565121024204840968192Figure 3.12: Pagerank – Memory performance comparison between differentbucket sizes563.5. EvaluationDTLB missesL1 missesL2 missesLLC missesMetric0.00.20.40.60.81.0Normalized countRB Trees, bucket size analysisBucket size641282565121024204840968192Figure 3.13: RB Trees – Memory performance comparison between differentbucket sizesleft_childright_childright_child->payload payloadleft_child->payloadCategorical accuracy: 0.983,   Top-5 categorical accuracy: 0.9990.00.10.20.30.40.50.6ImportanceRed-black trees, rbt_node_tFigure 3.14: Data feature importance and categorical accuracies, red-blacktrees573.5. EvaluationMesh traversal PageRank Red-Black TreeBenchmark0.00.20.40.60.81.01.2Normalized to baselineL1 misses L2 misses L3 misses DTLB misses RuntimeFigure 3.15: Performance improvement from using Tidy allocator wrapperwith hints based on the knowledge extracted by random foreststhe validation of our hypothesis 1.2.Selecting benchmarksFor our evaluation, we chose the most recent versions of two popular bench-mark suites - SPEC CPU2017 and PARSEC 3.0 apps suite. SPEC andPARSEC benchmarks are standardized versions of various scientific and in-dustrial algorithms that have been designed and optimized by experts. Thebenchmarks were compiled with gcc and run with perf to collect hardwareperformance counters. Due to problems with the 2017 version of SPECbenchmarks, we omit cache measurements for 520.omnetpp (missing datafiles), 500.perlbench (crash), 502.gcc (compilation fails), 503.bwaves (doesnot finish). The 500.perlbench benchmark crashes near the end of execution,so we did perform the data-driven spatial locality analysis on it, up to thepoint of crash. We considered the remaining 21 benchmarks from SPEC,and 5 benchmarks from PARSEC.Our final selection was based on two criteria: the benchmarks need to bewritten in either C or C++ and they need to put significant strain on thememory hierarchy. The first condition is due to the fact that our tracingframework currently supports C and C++, which eliminated a number ofSPEC benchmarks written in Fortran. The second condition is imposed byour goal of improving memory performance.We measure the memory performance of programs by collecting hard-ware performance counter values with perf. The selected performance coun-ters are percentage of time the program was stalled waiting on any memoryoperation, cache miss rates for L1, L2 and LLC caches. The measured per-583.5. Evaluationformance is shown in figure 3.16. We set a cut-off point of 30% for LLCmisses and time spent waiting on memory to capture all the significant highvalues of these metrics across the benchmark suite. The cut-off is repre-sented by a dashed horizontal line in figure 3.16. Table 3.1 shows a list of allthe benchmarks we considered, along with whether they were included in ourdata-driven spatial locality analysis, and a short summary of our findings.We colour-code the table entries for readability – red entries are the onesthat were not considered due to some limitation (typically compiler incom-patibilities, or build/runtime problems), blue entries are the benchmarkswhich we do not consider memory intensive (according to previously dis-cussed criteria), and black coloured entries are the ones that we performeddata-driven spatial locality analysis on. In the latter group of benchmarks,bold typeface shows the ones we discuss in more detail in further text.The general steps of our data-driven spatial locality analysis are outlinedin chapter 3.4. For the selected suite of benchmarks, first we generate anaccess graph and an object map by observing all the allocations the programmakes. We extract the access graph communities, and form feature vectorsfor each detected community, across all community levels. We split the inputdata – feature vector and label pairs – into training and test datasets withan 80%/20% ratio. We then feed the training feature vectors, paired withcommunity labels into a random forest classifier, and observe errors from thesubsequent classification of the test dataset. We always pick the communitylevel that yields the lowest classification error. If this first analysis pass doesnot succeed in learning the classification, we re-generate the access graph andobject map using only the three data structures with the highest allocationcounts within a program and repeat the process. We omit primitive typesfrom our analysis, as they are always allocated as parts of arrays.Out of the benchmarks selected for analysis, 557.xz uses most of itsallocated memory on opaque buffers – these we have no programmatic way ofanalyzing with current tools, 523.xalancbmk and 519.lbm and fluidanimatereveal no correlation between accesses and data.593.5.EvaluationFigure 3.16: Performance measurements for SPEC CPU2017 and PARSEC 3.0603.5. EvaluationResults505.mcfNo correlation between access patterns and data was detected in thefull trace of 505.mcf. However, when we filtered the access trace so that itcontains only the accesses to the most allocated data structures, arc andnode, our classifier detected correlation between the arc objects’ positionsin the allocated array and the access pattern of the algorithm. Both arcand node are allocated in large arrays, and the arc data structure is ac-cessed linearly in the function primal bea mpp(). This property has beensuccessfully detected by our tools, and there is no optimization we can applyin terms of laying out separate objects since the accesses to array elementsare already linear. However, in our prior analysis of the CPU2006 versionof the same benchmark, discussed in section 2.3 , we discovered that mcfiterates over a subset of data fields in an Array of Structures (AoS). We haveshown that both extracting cold fields in a separate data structure (§2.3.2),and converting the data into the Structure of Arrays (SoA) (§2.3.1) formatimproves performance.510.parest First, we analyzed 510.parest by looking at the access graphcreated from allocated objects of all types. Our tools have found no correla-tion between data fields and the access graph communities. We then focusedour analysis on the two most-allocated data types: STL’s Rb tree nodeand the Point class from 510.parest ’s dealii namespace. With a categor-ical accuracy score of 72.06% and top 5 categorical accuracy of 92.06% forRB tree node classification, our predictor can recognize a correlation be-tween the data fields and access graph communities. The random forestclassifier shows that the strongest candidate feature of RB nodes is the al-location sequence number (ordered integers increasing with each allocationmade during execution) followed by the addresses of parent and child nodes.This indicates that the memory placement of nodes is already aligned withthe locality of access.531.deepsjengThis benchmark is based on a commercial chess engine Deep Sjeng [58].The engine solves different chess placements using a proprietary algorithm.Analyzing the data object clusters with our random forest classifier yieldsa categorical accuracy of only 42.19%, however, top 5 categorical accuracyis very high at 89.44%. Upon manual inspection, we discovered that thelow categorical accuracy can be contributed to different clusters having verysimilar feature values.This is due to the fact that the structure of clusters is dictated by the613.5. EvaluationTable 3.1: SPEC CPU2017 and PARSEC analysis summaryBenchmark Included Summary500.perlbench No Crashes during execution502.gcc No Compilation fails520.omnetpp No Missing data files508.namd No Good memory performance511.povray No Good memory performance525.x264 No Good memory performance526.blender No Good memory performance538.imagick No Good memory performance541.leela No Good memory performance544.nab No Good memory performance548.exchange2 No Good memory performance997.specrand fr No Good memory performance999.specrand ir No Good memory performanceblackscholes No Good memory performanceswaptions No Good memory performancevips No Good memory performance519.lbm Yes No correlation found523.xalancbmk Yes No correlation found557.xz Yes Mostly opaque buffersfluidanimate Yes No correlation found505.mcf Yes Found correlations between arrayindex and access pattern for structarc. This means accesses to the ar-ray are already linear.531.deepsjeng Yes Found correlations for the hash fieldof ttbucket t. Detected field is al-ready used for indexing into arrays.510.parest Yes Found correlation between alloca-tion sequence and access pattern forRB tree nodes. Objects allocated to-gether are already close together inmemory.ferret Yes Found correlations in a small datastructure that gets moved a lot.Cannot apply hint-based allocatoroptimizations to objects that movearound the memory space.623.5. Evaluationaccess graph and the community detection algorithm. Even when the cor-relation between access patterns and field values exists, the size and bound-aries of access graph communities determines whether this correlation willbe visible to the random forest classifier. To illustrate, figure 3.17 showsthree different feature/community groupings. Gray boxes depict data ob-jects, and the inscribed numbers are feature values. Figure 3.17a shows theideal scenario where the detected communities correspond well to the objectfeatures. Figure 3.17b shows a scenario illustrating our findings in 531.deep-sjeng: similar feature values belonging to different clusters. This results inlower categorical accuracy because similar feature vectors map onto differentlabels, making it difficult for the classifier to make good predictions. How-ever, in this scenario, the top-k accuracies will still be high, as all the similarcommunities will get high votes from the classifier. Figure 3.17c shows theopposite, where the communities are too large, and different feature vectorsare mapped to a single community. We have not encountered this in prac-tice, but it is possible because the output of community detection is basedsolely on access graph edges.It is important to note that, while high categorical accuracy is desirable,low values are not indicative of absence of correlation. The user needs to ob-serve the combination of community structure and size, categorical accuracyand top-K categorical accuracy to draw accurate conclusions.Our random forest classifier has identified the hash field of structttbucket t as being tied to locality. Source code inspection revealed thehash field to be used as the first index into a 2-dimensional array of structttbucket t objects that store computation results. In both of the main ac-cessor functions StoreTT() and ProbeTT() the array is indexed into usingthe calculation provided in listing 3.4.Listing 3.4: 531.deepsjeng: Index calculation based on hash value1 index = (unsigned int)nhash;2 temp = &( TTable[index % TTSize ]);The algorithm traverses this object store (temp pointer in listing 3.4linearly over its second index. As such, these objects are already groupedbased on their hash fields and we can confirm that the identified correlationis indeed present, and proper grouping implemented.PARSEC ferretPARSEC’s ferret benchmark is based on the Ferret toolkit [52] for content-based similarity search of feature-rich data such as audio recordings, digital633.5. EvaluationFigure 3.17: Examples of community groupings(a) good community grouping4.2 4.6 3.9 4.3 50.1 51.6 50.8 51.1community 1 community 2(b) similar features mapping into different communities4.2 4.6 3.9 4.3 50.1 51.6 50.8 51.1community 1 community 3community 2 community 4(c) different features mapping into the same communities4.2 4.6 3.9 4.3 50.1 51.6 50.8 51.1community 1643.5. Evaluationimages and sensor data. The version included in the benchmark suite isconfigured specifically for image search.Our analysis yielded a classifier with categorical accuracy of 63.56%, andtop-5 categorical accuracy of 78.95%. The most important features for theclassifier were identified to be reg1 and reg2 fields in the RegionPair datastructure. The start of the main accessor loop for the RegionPair datastructure is shown in listing 3.5.Listing 3.5: ferret: Sorted array accesses1 sorted = bucket_sort ( pair , num_edges );23 /* Merge similar regions */4 for ( ik = 0; ik < num_edges; ik++ )5 {6 reg1 = find_set ( parent , sorted[ik].reg1 );7 reg2 = find_set ( parent , sorted[ik].reg2 );The algorithm, however, relies on the RegionPair array being sorted bya calculated cumulative histogram value (shown in listing 3.6).Listing 3.6: ferret: Sorting criterion1 /* Perform bucket sort */2 for ( ih = 0; ih < num_elems; ih++ )3 {4 sorted[cum_histo[pair[ih].delta ]++] = pair[ih];5 }While it is possible to add a level of indirection to the accesses (allowingus to sort pointers instead of objects), doing so would not be beneficial inthis case since the data structure is only 12 bytes in size – only 50% largerthan a 64-bit pointer.ConclusionThe evaluation presented here shows that a subset of the selected bench-marks exhibits correlation between data features and the programs’ accesspatterns. However, existing software is often written in such ways that theuse of Tidy allocator is impossible without rewriting the code. The mainobstacle to applying Tidy that we identified are arrays of objects. Access653.5. Evaluationsemantics of arrays are index-based. This means that the algorithm will ac-cess the data objects using their integer indices. In such cases, the identifiedlink between data features and access pattern cannot be exploited out ofthe box, because a new layout would alter the indexing scheme of the dataset. For example, consider an implementation of a hash-table using an arrayof linked lists. In such a data structure, the indices are calculated using ahash function. Exploiting the discovered correlation between object featuresand access pattern would require writing a new hash function. Kraska etal. explore the possibility of using neural networks to learn better indexingschemes[41], and there is a potential for future work in combining their workwith our Data-Driven Spatial Locality techniques.Using arrays is commonly understood to be better for performance thanusing linked lists or other dynamically changing data structures. This is trueas long as the access pattern of the array items is linear or strided. Our studyshows that attempting to design performant data structures from the get-go often leaves out opportunities for applying Data-Driven Spatial Localitytechniques at a later stage. This aligns with our observation that expert-optimized data layouts typically approach layout as a first-order citizen inthe design process.To conclude, we note three classes of programs with respect to the ap-plicability of our Data-Driven Spatial Locality techniques:1. Programs in which correlations are present, and applyinghint-based allocation is feasible. This category contains programswith ”textbook” versions of pointer-based data structures. Objectsare allocated separately and no memory optimizations are applied inthe original code.2. Programs in which correlations are present, and point to op-timizations that are already applied. An example of this class ofprograms is 531.deepsjeng in which our techniques detected a correla-tion between hash values used to index into arrays, and the program’saccess pattern. Our techniques can be used to verify that such opti-mizations are already applied.3. Programs in which correlations are present, but changing thelayout would require significant changes to the program.66Chapter 4Related workSeveral instrumentation frameworks have been designed previously with thegoal of solving memory bottleneck problems. Limitations of these frame-works include targeting a specific programming language, providing coarsegrained instrumentation abstractions, or providing fine grained instrumen-tation abstractions that are sampled to reduce overhead. Other tools arenot as flexible in that users can only view output through a user interface,or are designed for a specific use case.Existing instrumentation frameworks like Pin [51] and Valgrind [56] in-strument programs and allow programmers to build dynamic analysis toolson top of them much like DINAMITE. Pin achieves instrumentation througha highly optimized JIT compiler that intercepts the first instruction of a sup-plied executable and compiles instrumentation functions to execute whereappropriate. Pintools are created by the programmer using the Pin frame-work that describe the instrumentation analysis. Runtime binary instru-mentation is limited in that information gleaned from the framework doesnot translate to code level data structures out of the box. Without know-ing what machine instructions relate to higher level data, programmers donot immediately see patterns related to their original work. DINAMITEinstrumentation is inserted at compile-time allowing it to retain code levelinformation about data that is valuable to the programmer using the tool.Valgrind is another framework that performs runtime binary instrumen-tation. It addresses the lack of code level information other runtime analysisframeworks suffer from by supporting a technique called ”shadow values”.Shadow values replace values in memory and registers with values that de-scribe them. Valgrind requires tool writers to implement their own shadowvalues – a technique difficult to implement. The framework supports the im-plementation of shadow values by providing registers to store shadow valuesand extra output functionality for printing these values during execution.Though this technique enriches instrumentation by providing more contextabout memory accesses during execution, it is a complex task in practice,and can easily slow down instrumentation if not done carefully. DINAMITEprovides memory access information by default; each log entry contains the67Chapter 4. Related workaddress of the memory location accessed and is available for tool writersto query, though register information is absent but can be easily extendedusing the llvm.read register instrinsic [2]. Programmers using DINAMITEare not left to implement complex and bug-prone functionality to obtainmemory access information.Zhao et al. [79] describe a tool designed to detect true and false shar-ing built on top of the memory shadowing framework Umbra [78]. Umbrais a performant implementation of shadow values while remaining generalpurpose, and is achieved by mapping only allocated memory instead of thewhole address space. Memory sharing is detected by associating cache lineswith shadow memory exposed by Umbra. Association of thread to addressis done via a bitmap describing thread ids responsible for each access. Sher-iff [48] addresses the same problem, but requires either the programmer torewrite source code or rely on sampling to catch culprits, and is specificallydesigned to detect false sharing. DINAMITE achieves the same result byinserting thread IDs in each log entry, which also contain accessed addresses.The log entry also contains all the source level details necessary to pinpointexactly where in the code the accesses were performed and to what datatype.Memprof [43] is a tool that profiles objects that make remote mem-ory accesses on NUMA machines so programmers can potentially minimizethem. Memory accesses are measured through instruction-based sampling(IBS) that relies on hardware support and requires using a kernel module,constraining the tool to the Linux/AMD platforms. DINAMITE can beextended with a native plugin or Spark kernel to generate the same infor-mation. It sacrifices performance over accuracy and portability as it doesnot rely on hardware support for instruction sampling.Similar to Memprof, DProf [61] uses IBS to acquire memory traces tolocate data types that stress the cache. Programmers can view data typestatistics and how they behave in the cache, what data types generate themost cache accesses and misses, and the most common functions that oper-ate on these data types. DProf required changes to the operating system,which is significant barrier for its adoption. The flexibility of DINAMITEenabled us to plug in a cache simulator into the framework and to generatethe same information as DProf, but with greater flexibility to add new anal-ysis and without attachment to a particular operating system or hardware.Other tools provide more source-level detail but are slower and less flexi-ble. Memspy [54] provides rich information on program execution, includingthe execution time, miss rate, and memory stall time broken down by codeand data. Details are tracked by executing the application through a mem-68Chapter 4. Related workory simluator and instrumention through a preprocessor, which is not asportable as LLVM. Memspy reported a 22×-58× slowdown in executiontime. DINAMITE’s slowdowns are competitive with modern instrumenta-tion frameworks and provide a pluggable framework for all kinds of dataanalyses.Like MACPO [64], our tool instruments data accesses at the compilerlevel instead of the binary level to keep source-level information. To com-bat overhead, MACPO limits instrumentation to ”snapshots” of programexecution, that are staggered in an attempt to capture complete programbehaviour. Trace size is reduced by limiting instrumentation not only to”snapshots” but also to non-scalar data types. Similarily Dprof and Mem-prof use IBS and only output the most commonly accessed data and theircache statistics. DINAMITE instruments all memory accesses inviting un-limited flexibility of analysis at the expense of higher runtime overhead.Cache-conscious structure layout and definition [20] [19] blazed the trailfor the ideas presented here. Collocating contemporaneously accessed ob-jects and hint-based allocators (the previously discussed ccmalloc) wereboth explored by the authors. The access graphs allow for novel analysesthat help programmers understand which data should be grouped at runtime.Profile-based data layout optimizations were explored by Rubin et al.[66] Their approach uses profiling techniques to identify objects that are topoffenders to cache performance. This information is used to apply a series ofknown layout optimizations, such as structure splitting, field reordering andreordering of whole objects. Our work attempts to solve the layout problemin a more holistic fashion – by understanding the interplay between accessesto all of the objects in a data set.The use of data structure fields to inform data layout creation was in-spired by GraphChi [42]. The authors proposed a layout specifically forbulk synchronous processing on graphs; we aspired to capture the essenceof these ideas, so they could be used for data structures in general. Withaccess graphs, we aimed to remove the necessity of domain knowledge forreasoning about good layouts. We showed that our techniques reach similarconclusions about laying out edges for Pagerank.Higher order theory of locality (HOTL) [74] sets up a mathematicalframework for thinking about different locality metrics. From it, Xiang et al.develop a novel low-overhead way of locality sampling, and demonstrate theability to predict miss rates from the acquired information. While providingan excellent basis for reasoning about locality, techniques presented in HOTLdo not expose actionable data on how to improve locality in a program.Access graphs take a different approach of observing the full access trace and69Chapter 4. Related workproviding insight into how objects relate to each other, in terms of accesslocality. This level of detail, while incurring large overheads compared tosampling techniques, brings new information on what can be done to datalayout to improve performance.Yoon et al. [75] use a graph representation for generating a cache-oblivious layout of the mesh, but their representation is very different fromlocality graphs. Edges are formed between vertices connected in a mesh; inother words this representation is specific to the mesh data structure andrequires knowing which vertices are connected. Locality graphs operate onany generic memory access trace and require no knowledge of data structurespecifics.Liu et al. propose a sampling tool [49] that correlates bad memoryperformance with data objects: static or dynamically allocated variables.Similar approaches can be found in DProf [61] and MemProf [43], whichcorrelate cache misses and remote memory accesses to data items. Lookingat data as the first order citizen in memory performance analysis is a goodapproach to helping programmers better understand memory bottlenecks.Access graphs use a similar data-centric approach, but aspire to bring moreactionable insight by observing the entire execution trace and extractinglocality-related relationships between data objects.Peled et al. propose a context-based cache prefetcher model [60], thatdetects semantic locality and issues prefetches based on it. Their approach isto implement a machine learning model in hardware that observes ”contexts”of memory accesses during execution. A context contains information suchas register contents, access and branch histories, compiler provided typeinformation, etc. The approach used by Peled et al., and our data-drivenspatial locality method are similar in that they both base their techniques ondata-centric contextual information. They are on two sides of the trade-offspectrum: semantic locality is generally applicable out of the box and hasinsight into lower-level information, but the amount of data it can observeat any point in time is limited due to hardware constraints. Access graphsand their related tools observe a much larger body of information about theexecution, but at a higher level. They do not provide performance benefitson-the-fly, but is used to derive better locality optimizations.Previous allocator work such as dlmalloc [46] and jemalloc [29] inspiredTidy’s arena-based allocation. However, these allocators rely on the informa-tion available through regular allocation calls, namely the size of the objectbeing allocated, to group data. Our approach with Tidy was redefining whatan allocation call looks like, and showing that additional information canfurther improve locality. We do not compare against dlmalloc or jemalloc,70Chapter 4. Related workas our work is mostly orthogonal – we show that informed hints can im-prove a layout’s performance even when allocating objects of the same sizethroughout the program. Tidy is a proof of concept allocator wrapper, andour implementation does not focus on the allocation speed. The insightsfrom Tidy can be used to bring hint-based allocation into state-of-the-artallocator libraries.71Chapter 5ConclusionIn the past decades, the gap between CPU speeds and RAM bandwidth hasbeen increasing. To make matters worse, the amounts of data modern pro-grams process is growing rapidly as well. Hardware architects are workingon tackling the problem using technologies such as novel cache prefetchersand near-memory processing units. On the software side, experts spend alot of time and effort devising new ways of laying data out in memory tobring better spatial locality back into programs. We have observed a trendin prior work on memory layouts of gaining knowledge about the access pat-tern of an algorithm, relating it to the data itself, and using that data toguide memory layout design. This process has been mostly manual, and dif-ficult enough to warrant top-tier publications. Based on these observations,we set out to investigate whether the burden of obtaining the knowledgeabout the relationship between access patterns and data can be delegatedto machines.The first step towards our goal is data collection. With the advent ofbig-data and the revival of machine learning, the opportunity arose to ana-lyze full, detailed memory access traces. To this end, we built DINAMITE(chapter 2), an LLVM compiler pass that instruments memory accesses, al-locations and function events in a program. We show that compiler-basedinstrumentation can be used to deliver very rich information about every ac-cess, at a fraction of the runtime cost of previously available instrumentationtools. When compared against the pinatrace access tracing tool from thePin instrumentation framework DINAMITE reduced the overhead of tracingby an order of magnitude, while delivering a higher level of details aboutthe accesses (section 2.2.3). We show how DINAMITE’s modular librarystructure can be used to feed access traces into existing big-data processingsystems, which allows for faster exploratory memory performance debugging(section 2.3.3).The second step towards our goal is the analysis of collected memoryaccess traces, and formulation of techniques that enable the detection ofrelationships between data and access patterns. We created a novel repre-sentation of access patterns called access graphs (section 3.2). Access graphs72Chapter 5. Conclusioncapture spatial locality properties of a program. Their nodes represent dataobjects allocated within a program, and their edges the number of contem-poraneous accesses of two objects. Based on the spatial locality principle ofcaches, placing objects that often get accessed contemporaneously close inmemory has the potential for improving performance. To detect groups ofsuch objects, we use a novel combination of community detection and lin-ear graph layout algorithms to form a new heuristic – Hierarchical MemoryLayouts (section 3.3). Hierarchical Memory Layouts approximate solutionsto the NP-Hard problem of optimal data layout in memory. We show thatour solution manages to outperform the baseline in all observed cases interms of LLC and DTLB miss rates 3.5.1. Further, Hierarchical MemoryLayouts can be used to directly generate better layouts of data in the caseswhere layouts in storage match layouts in memory (section 3.5.2). Finally,we developed the eponymous Data-Driven Spatial Locality technique thatdetects correlations between the data itself and groups of data objects thatshould be placed close in memory (section 3.4). Our findings indicate thatmany programs exhibit such correlations (section 3.5.4), validating our pri-mary hypothesis. However, applying memory optimizations based on thatknowledge is feasible only for a subset of programs with simple allocationpatterns, which partially validates our secondary hypothesis.Our work sets a foundation for different research directions in the future.Starting from DINAMITE, the computation model which most stream pro-cessing engines provide today is not well suited for tasks such as creation ofaccess graphs. It is worth investigating what changes need to be introducedin order to support a scalable sliding window processing of memory accesstraces. The overheads of using Spark Streaming with DINAMITE are no-tably very high – over an order of magnitude higher than simple storage oftraces. There is potential for improvement here that could make exploratorymemory performance debugging a more attractive and feasible option to beincluded in engineers’ tool sets. The overhead of DINAMITE itself couldfurther be reduced by using static instead of dynamic linkage for the logginglibraries. This would tie the compiled binary to a single logging variant,which is a trade-off we have not looked into in detail.The second avenue of potential future work would be the exploration ofaccess graphs themselves. Access graphs are a novel tool in the context ofmemory optimizations, and we have explored only one application of them –grouping of objects based on graph communities. More work could be doneto explore possible new uses for access graphs.The Data-Driven Spatial Locality technique presented in this disserta-tion uses random forests for the sake of simplicity of use and transparency of73Chapter 5. Conclusionresults – random forests deliver good results without extensive tuning, and itis possible to analyze the decision process to find features that contributedthe most to successful classification. With the rapid progress happeningin the area of artificial neural networks, it is reasonable to expect themto attain similar characteristics to random forests. Given the amount ofwork that goes into optimizing the performance of neural networks and thepossibility of compiling trained networks into native code, it would be in-teresting to look into the feasibility of directly embedding neural networksinto hint-based allocators.74Bibliography[1] Cache memory store in a processor of a data processing system, July 221975. US Patent 3,896,419.[2] Llvm language reference manual, 2016.[3] Wired tiger: making big data roar, 2016.[4] Wiredtiger storage engine, 2016.[5] Wt-2029 improve scalability of statistics, 2016.[6] Andreas Abel and Jan Reineke. Reverse engineering of cache replace-ment policies in intel microprocessors and their evaluation. In Perfor-mance Analysis of Systems and Software (ISPASS), 2014 IEEE Inter-national Symposium on, pages 141–142. IEEE, 2014.[7] Anastassia Ailamaki, David J DeWitt, Mark D Hill, and David AWood. Dbmss on a modern processor: Where does time go? In VLDB”99, Proceedings of 25th International Conference on Very Large DataBases, September 7-10, 1999, Edinburgh, Scotland, UK, number DIAS-CONF-1999-001, 1999.[8] Jose´ Nelson Amaral, Edson Borin, Dylan Ashley, Caian Benedicto,Elliot Colp, Joao Henrique Stange Hoffmam, Marcus Karpoff, ErickOchoa, Morgan Redshaw, and Raphael Ernani Rodrigues. The albertaworkloads for the spec cpu 2017 benchmark suite.[9] Tetsuo Asano, Desh Ranjan, Thomas Roos, Emo Welzl, and Peter Wid-mayer. Space-filling curves and their use in the design of geometric datastructures. Theoretical Computer Science, 181(1):3–15, 1997.[10] Michael A Bender, Erik D Demaine, and Martin Farach-Colton. Cache-oblivious b-trees. In Foundations of Computer Science, 2000. Proceed-ings. 41st Annual Symposium on, pages 399–409. IEEE, 2000.75Bibliography[11] Christian Bienia. Benchmarking Modern Multiprocessors. PhD thesis,Princeton University, January 2011.[12] Georgios Bitzes and Andrzej Nowak. The overhead of profiling usingpmu hardware counters. CERN openlab report, 2014.[13] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Eti-enne Lefebvre. Fast unfolding of communities in large networks. Journalof statistical mechanics: theory and experiment, 2008(10):P10008, 2008.[14] Silas Boyd-Wickizer, Austin T Clements, Yandong Mao, AlekseyPesterev, M Frans Kaashoek, Robert Morris, Nickolai Zeldovich, et al.An analysis of linux scalability to many cores. In OSDI, volume 10,2010.[15] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.[16] Leo Breiman. Manual on setting up, using, and understanding randomforests v3. 1. Statistics Department University of California Berkeley,CA, USA, 1, 2002.[17] Gerth Stølting Brodal, Rolf Fagerberg, and Riko Jacob. Cache oblivioussearch trees via binary trees of small height. In Proceedings of thethirteenth annual ACM-SIAM symposium on Discrete algorithms, pages39–48. Society for Industrial and Applied Mathematics, 2002.[18] Derek Bruening, Timothy Garnett, and Saman Amarasinghe. An infras-tructure for adaptive dynamic optimization. In Code Generation andOptimization, 2003. CGO 2003. International Symposium on. IEEE,2003.[19] Trishul M Chilimbi, Bob Davidson, and James R Larus. Cache-conscious structure definition. In ACM SIGPLAN Notices, volume 34.ACM, 1999.[20] Trishul M Chilimbi, Bob Davidson, and James R Larus. Cache-conscious structure definition. In ACM SIGPLAN Notices, volume 34,pages 13–24. ACM, 1999.[21] William P Churchill Jr. Memory access technique, November 1 1977.US Patent 4,056,845.[22] Daniel Cordes, Heiko Falk, and Peter Marwedel. A fast and precisestatic loop analysis based on abstract interpretation, program slicing76Bibliographyand polytope models. In Code Generation and Optimization, 2009.CGO 2009. International Symposium on, pages 136–146. IEEE, 2009.[23] Arnaldo Carvalho de Melo. The new linux’perf’tools. 2010.[24] Marianne De Michiel, Armelle Bonenfant, Hugues Casse´, and PascalSainrat. Static loop bound analysis of c programs based on flow analysisand abstract interpretation. In Embedded and Real-Time ComputingSystems and Applications, 2008. RTCSA’08. 14th IEEE InternationalConference on, pages 161–166. IEEE, 2008.[25] Erik D Demaine. Cache-oblivious algorithms and data structures. Lec-ture Notes from the EEF Summer School on Massive Data Sets, 8(4):1–249, 2002.[26] Dave Dice, Yossi Lev, and Mark Moir. Scalable statistics counters. InProceedings of the twenty-fifth annual ACM symposium on Parallelismin algorithms and architectures. ACM, 2013.[27] Ulrich Drepper. What every programmer should know about memory.Red Hat, Inc, 11, 2007.[28] Jan Edler and Mark D Hill. Dinero iv trace-driven uniprocessor cachesimulator, 1998.[29] Jason Evans. A scalable concurrent malloc (3) implementation forfreebsd. In Proc. of the bsdcan conference, ottawa, canada, 2006.[30] Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos,Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian DanielPopescu, Anastasia Ailamaki, and Babak Falsafi. Clearing the clouds:a study of emerging scale-out workloads on modern hardware. In ACMSIGPLAN Notices, volume 47. ACM, 2012.[31] M Gopi and David Eppstien. Single-strip triangulation of manifoldswith arbitrary topology. In Computer Graphics Forum, volume 23,pages 371–379. Wiley Online Library, 2004.[32] Kazushige Goto and Robert A Geijn. Anatomy of high-performancematrix multiplication. ACM Transactions on Mathematical Software(TOMS), 34(3), 2008.[33] Wentao Han, Youshan Miao, Kaiwei Li, Ming Wu, Fan Yang, Li-dong Zhou, Vijayan Prabhakaran, Wenguang Chen, and Enhong Chen.77BibliographyChronos: a graph engine for temporal graph analysis. In Proceedings ofthe Ninth European Conference on Computer Systems, page 1. ACM,2014.[34] John L. Henning. Spec cpu2006 benchmark descriptions. ACMSIGARCH Computer Architecture News, 34(4), 2006.[35] Mark D Hill and J Elder. Dineroiv trace-driven uniprocessor cachesimulator, 1998.[36] Martin Isenburg and Peter Lindstrom. Streaming meshes. In Visual-ization, 2005. VIS 05. IEEE, pages 231–238. IEEE, 2005.[37] Martin Isenburg, Yuanxin Liu, Jonathan Shewchuk, and Jack Snoeyink.Streaming computation of delaunay triangulations. In ACM transac-tions on graphics (TOG), volume 25, pages 1049–1056. ACM, 2006.[38] Marty Itzkowitz, Brian JN Wylie, Christopher Aoki, and NicolaiKosche. Memory profiling using hardware counters. In Supercomputing,2003 ACM/IEEE Conference. IEEE, 2003.[39] Aamer Jaleel, Kevin B Theobald, Simon C Steely Jr, and Joel Emer.High performance cache replacement using re-reference interval pre-diction (rrip). In ACM SIGARCH Computer Architecture News, vol-ume 38, pages 60–71. ACM, 2010.[40] Martin Juvan and Bojan Mohar. Optimal linear labelings and eigen-values of graphs. Discrete Applied Mathematics, 36(2):153–168, 1992.[41] Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzo-tis. The case for learned index structures. In Proceedings of the 2018 In-ternational Conference on Management of Data, pages 489–504. ACM,2018.[42] Aapo Kyrola, Guy E Blelloch, and Carlos Guestrin. Graphchi: Large-scale graph computation on just a pc. USENIX, 2012.[43] Renaud Lachaize, Baptiste Lepers, and Vivien Que´ma. Memprof: Amemory profiler for numa multicore systems. In Presented as part ofthe 2012 USENIX Annual Technical Conference (USENIX ATC 12),2012.[44] Chris Lattner and Vikram Adve. Llvm: A compilation framework forlifelong program analysis & transformation. In Code Generation and78BibliographyOptimization, 2004. CGO 2004. International Symposium on. IEEE,2004.[45] Chris Lattner and Vikram Adve. Automatic pool allocation: improvingperformance by controlling data structure layout in the heap. In ACMSIGPLAN Notices, volume 40. ACM, 2005.[46] Doug Lea. Dlmalloc, 2010.[47] Han Bok Lee and Benjamin G Zorn. Bit: A tool for instrumentingjava bytecodes. In USENIX Symposium on Internet technologies andSystems, 1997.[48] Tongping Liu and Emery D Berger. Sheriff: precise detection andautomatic mitigation of false sharing. ACM SIGPLAN Notices, 46(10),2011.[49] Xu Liu and John Mellor-Crummey. Pinpointing data locality prob-lems using data-centric analysis. In Code Generation and Optimiza-tion (CGO), 2011 9th Annual IEEE/ACM International Symposiumon, pages 171–180. IEEE, 2011.[50] Zhiyu Liu, Irina Calciu, Maurice Herlihy, and Onur Mutlu. Concurrentdata structures for near-memory computing. In Proceedings of the 29thACM Symposium on Parallelism in Algorithms and Architectures, pages235–245. ACM, 2017.[51] Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, ArturKlauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and KimHazelwood. Pin: building customized program analysis tools with dy-namic instrumentation. In Acm sigplan notices, volume 40. ACM, 2005.[52] Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li.Ferret: a toolkit for content-based similarity search of feature-rich data.ACM SIGOPS Operating Systems Review, 40(4):317–330, 2006.[53] Rama Kishan Malladi. Using intel® vtune™ performance analyzerevents/ratios & optimizing applications. http:/software. intel. com,2009.[54] Margaret Martonosi, Anoop Gupta, and Thomas Anderson. Memspy:Analyzing memory system bottlenecks in programs. In ACM SIGMET-RICS Performance Evaluation Review, volume 20. ACM, 1992.79Bibliography[55] Svetozar Miucin, Conor Brady, and Alexandra Fedorova. Dinamite:A modern approach to memory performance profiling. arXiv preprintarXiv:1606.00396, 2016.[56] Nicholas Nethercote and Julian Seward. Valgrind: a framework forheavyweight dynamic binary instrumentation. In ACM Sigplan notices,volume 42. ACM, 2007.[57] Preeti Ranjan Panda, Hiroshi Nakamura, Nikil D Dutt, and AlexandruNicolau. Augmenting loop tiling with data alignment for improvedcache performance. IEEE transactions on computers, 48(2):142–149,1999.[58] Gian-Carlo Pascutto. Sjeng 11.2 deep sjeng, 2018.[59] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Van-derplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, andE. Duchesnay. Scikit-learn: Machine learning in Python. Journal ofMachine Learning Research, 12:2825–2830, 2011.[60] Leeor Peled, Shie Mannor, Uri Weiser, and Yoav Etsion. Semantic lo-cality and context-based prefetching using reinforcement learning. InComputer Architecture (ISCA), 2015 ACM/IEEE 42nd Annual Inter-national Symposium on, pages 285–297. IEEE, 2015.[61] Aleksey Pesterev, Nickolai Zeldovich, and Robert T Morris. Locatingcache performance bottlenecks using data profiling. In Proceedings ofthe 5th European conference on Computer systems. ACM, 2010.[62] Jordi Petit. Experiments on the minimum linear arrangement problem.Journal of Experimental Algorithmics (JEA), 8:2–3, 2003.[63] Vijayan Prabhakaran, Ming Wu, Xuetian Weng, Frank McSherry, Li-dong Zhou, and Maya Haridasan. Managing large graphs on multi-coreswith graph awareness. 2012.[64] Ashay Rane and James Browne. Enhancing performance optimizationof multicore chips and multichip nodes with data structure metrics. InProceedings of the 21st international conference on Parallel architec-tures and compilation techniques. ACM, 2012.80Bibliography[65] Amitabha Roy, Ivo Mihailovic, and Willy Zwaenepoel. X-stream: Edge-centric graph processing using streaming partitions. In Proceedings ofthe Twenty-Fourth ACM Symposium on Operating Systems Principles,pages 472–488. ACM, 2013.[66] Shai Rubin, Rastislav Bod´ık, and Trishul Chilimbi. An efficient profile-analysis framework for data-layout optimizations. In ACM SIGPLANNotices, volume 37, pages 140–153. ACM, 2002.[67] Pedro V Sander, Diego Nehab, and Joshua Barczak. Fast triangle re-ordering for vertex locality and reduced overdraw. In ACM Transac-tions on Graphics (TOG), volume 26, page 89. ACM, 2007.[68] Shankar Prasad Sastry. Dynamic meshing techniques for quality im-provement, untangling, and warping. 2012.[69] Julian Shun and Guy E Blelloch. Ligra: a lightweight graph processingframework for shared memory. In ACM Sigplan Notices, volume 48,pages 135–146. ACM, 2013.[70] Alan Jay Smith. Cache memories. ACM Computing Surveys (CSUR),14(3):473–530, 1982.[71] Huy T Vo, Claudio T Silva, Luiz F Scheidegger, and Valerio Pascucci.Simple and efficient mesh layout with space-filling curves. Journal ofGraphics Tools, 16(1):25–39, 2012.[72] Michael Wolfe. More iteration space tiling. In Proceedings of the 1989ACM/IEEE conference on Supercomputing, pages 655–664. ACM, 1989.[73] Wm A Wulf and Sally A McKee. Hitting the memory wall: implicationsof the obvious. ACM SIGARCH computer architecture news, 23(1):20–24, 1995.[74] Xiaoya Xiang, Chen Ding, Hao Luo, and Bin Bao. Hotl: a higher ordertheory of locality. In ACM SIGARCH Computer Architecture News,volume 41, pages 343–356. ACM, 2013.[75] Sung-Eui Yoon, Peter Lindstrom, Valerio Pascucci, and DineshManocha. Cache-oblivious mesh layouts. In ACM Transactions onGraphics (TOG), volume 24. ACM, 2005.[76] Sung-Eui Yoon and Dinesh Manocha. Cache-efficient layouts of bound-ing volume hierarchies. In Computer Graphics Forum, volume 25, pages507–516. Wiley Online Library, 2006.81[77] Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, and IonStoica. Discretized streams: an efficient and fault-tolerant model forstream processing on large clusters. In Presented as part of the, 2012.[78] Qin Zhao, Derek Bruening, and Saman Amarasinghe. Umbra: Effi-cient and scalable memory shadowing. In Proceedings of the 8th an-nual IEEE/ACM international symposium on Code generation and op-timization. ACM, 2010.[79] Qin Zhao, David Koh, Syed Raza, Derek Bruening, Weng-Fai Wong,and Saman Amarasinghe. Dynamic cache contention detection in multi-threaded applications. In ACM SIGPLAN Notices, volume 46. ACM,2011.82Appendix AHardwareAll experiments in chapter 2 evaluation were performed on one of the fol-lowing machines:• Machine A - AMD Opteron 6272, 4 chips with 16 cores and 16MB oflast level cache each, and 512GB of RAM• Machine B - AMD Opteron 2435, 2 chips with 6 cores 6MB of lastlevel cache each, and 32GB of RAM83

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0375881/manifest

Comment

Related Items