"Applied Science, Faculty of"@en . "Electrical and Computer Engineering, Department of"@en . "DSpace"@en . "UBCV"@en . "Ye, Lyuyu"@en . "2019-06-20T15:41:12Z"@en . "2019"@en . "Master of Applied Science - MASc"@en . "University of British Columbia"@en . "Data structure splicing (DSS) refers to reorganizing data structures by merging or splitting them, reordering fields, inlining pointers, etc. DSS has been used, with demonstrated benefits, to improve spatial locality. When data fields that are accessed together are also collocated in the address space, the utilization of hardware caches improves and cache misses decline. \r\n\r\nA number of approaches to DSS have been proposed, but each addressed only one or two splicing optimizations (e.g., only splitting or only field reordering) and used an underlying abstraction that could not be extended to include others. Our work proposes a single abstraction, called Data Structure Access Graph (D-SAG), that (a) covers all data-splicing optimizations proposed previously and (b) unlocks new ones. Having a common abstraction has two benefits: (1) It enables us to build a single tool that hosts all DSS optimizations under one roof, eliminating the need to adopt multiple tools. (2) It avoids conflicts: e.g., where one tool suggests to split a data structure in a way that would conflict with another tool\u00E2\u0080\u0099s suggestion to reorder fields. \r\n\r\nBased on the D-SAG abstraction, we build a toolchain that uses static and dynamic analysis to recommend DSS optimizations to developers. Using this tool, we identify ten benchmarks from the SPEC CPU2017 and PARSEC suites that are amenable to DSS, as well as a workload on RocksDB that stresses its memory table. Restructuring data structures following the tool\u00E2\u0080\u0099s suggestion improves performance by an average of 11% (geomean) and reduces cache misses by an average of 28% (geomean) for seven of these workloads."@en . "https://circle.library.ubc.ca/rest/handle/2429/70711?expand=metadata"@en . "Data Structure SplicingbyLyuyu YeB.A.Sc, Simon Fraser University, 2016B.Eng, Zhejiang University, 2016A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of Applied ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Electrical and Computer Engineering)The University of British Columbia(Vancouver)June 2019c\u00C2\u00A9 Lyuyu Ye, 2019The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:Data Structure Splicingsubmitted by Lyuyu Ye in partial fulfillment of the requirements for the degree ofMaster of Applied Science in Electrical and Computer Engineering.Examining Committee:Alexandra Fedorova, Electrical and Computer EngineeringSupervisorMieszko Lis, Electrical and Computer EngineeringCommittee Member, Co-readerSathish Gopalakrishnan, Electrical and Computer EngineeringCommittee Member, ChairiiAbstractData structure splicing (DSS) refers to reorganizing data structures by merging orsplitting them, reordering fields, inlining pointers, etc. DSS has been used, withdemonstrated benefits, to improve spatial locality. When data fields that are ac-cessed together are also collocated in the address space, the utilization of hardwarecaches improves and cache misses decline.A number of approaches to DSS have been proposed, but each addressed onlyone or two splicing optimizations (e.g., only splitting or only field reordering) andused an underlying abstraction that could not be extended to include others. Ourwork proposes a single abstraction, called Data Structure Access Graph (D-SAG),that (a) covers all data-splicing optimizations proposed previously and (b) unlocksnew ones. Having a common abstraction has two benefits: (1) It enables us to builda single tool that hosts all DSS optimizations under one roof, eliminating the needto adopt multiple tools. (2) It avoids conflicts: e.g., where one tool suggests tosplit a data structure in a way that would conflict with another tool\u00E2\u0080\u0099s suggestion toreorder fields.Based on the D-SAG abstraction, we build a toolchain that uses static and dy-namic analysis to recommend DSS optimizations to developers. Using this tool,we identify ten benchmarks from the SPEC CPU2017 and PARSEC suites that areamenable to DSS, as well as a workload on RocksDB that stresses its memorytable. Restructuring data structures following the tool\u00E2\u0080\u0099s suggestion improves per-formance by an average of 11% (geomean) and reduces cache misses by an averageof 28% (geomean) for seven of these workloads.iiiLay SummaryData structures code is usually written unintentionally poorly for access locality,especially for modern sophisticated software. However, hardware requires localityto perform efficiently. It is hard to reason about locality without an explicit guide orinsight in runtime workload. We propose D-SAG (Data Structure Access Graph)and the corresponding algorithm to solve the issue. Our solution automaticallyanalyzes the runtime profile and provides developers with various types of codechange suggestions regarding optimizing data structures definitions. The proposedsolution has the potential to improve access locality, thus improves cache usagesand performance for memory-bounded applications.ivPrefaceThis thesis is an extension of an in-submission paper of the Memsys 19 conferencefiled as\u00E2\u0080\u00A2 L. Ye, M. Lis, and A. Fedorova. \u00E2\u0080\u009CA Unifying Abstraction for Data StructureSplicing\u00E2\u0080\u009D.I was the lead investigator on the work. All of the research was on my own. Idid the writing with editorial help and technical advice from co-authors.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Abstraction and Algorithms . . . . . . . . . . . . . . . . . . . . . . 83.1 Requirements for a Common Abstraction . . . . . . . . . . . . . 83.2 Access Affinity . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3 D-SAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.4 D-SAG Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 113.4.1 Stage 1: Class Splitting and Merging . . . . . . . . . . . 113.4.2 Stage 2: Field Inlining . . . . . . . . . . . . . . . . . . . 153.4.3 Stage 3: Field Reordering . . . . . . . . . . . . . . . . . 16vi4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.1 Phase 1: Identifying memory-bound programs and functions . . . 214.2 Phase 2: Memory trace collection and static analysis . . . . . . . 224.3 Phase 3: D-SAG and Analysis . . . . . . . . . . . . . . . . . . . 224.4 Phase 4: Simulating proposed changes . . . . . . . . . . . . . . . 235 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.4 Case Study - canneal . . . . . . . . . . . . . . . . . . . . . . . . 306 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . 326.1 Automating recommendations . . . . . . . . . . . . . . . . . . . 326.2 Algorithm Validity and Correctness . . . . . . . . . . . . . . . . 336.3 Threats to Effectiveness . . . . . . . . . . . . . . . . . . . . . . . 337 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35A Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38viiList of TablesTable 2.1 Data Structure Splicing Optimizations Comparison . . . . . . . 6Table 5.1 Benchmarks Categories . . . . . . . . . . . . . . . . . . . . . 26viiiList of FiguresFigure 1.1 Bytes used in a cache line before eviction (64-byte cache lines,8MB LLC, 16-way set associative), measured in an enhancedversion of the DineroIV [10] cache simulator. In most applica-tions, very little of a cache line is used before the line is evicted. 2Figure 1.2 A snippet of a data structure in RocksDB and the correspond-ing D-SAG. Thicker edges represent more frequent co-temporalaccesses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Figure 1.3 Recommended data structure definitions produced by our toolchainfor the code in Figure 1.2. MR is the LLC miss ratio, whileMRP is the percentage contributed by this field to the overallmiss ratio. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Figure 3.1 A snippet of a data structure in PARSEC canneal and the cor-responding D-SAG. . . . . . . . . . . . . . . . . . . . . . . . 9Figure 3.2 Recommended data structure definitions produced by our toolfor the code in Figure 3.1. . . . . . . . . . . . . . . . . . . . 10Figure 3.3 The data structure and the code accessing it for the D-SAG inFigure 3.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . 12ixFigure 3.4 The D-SAG (G0) for the code in Fig 3.3 under an exampleworkload. Nodes represent data structure fields, with the colourrepresenting access frequency (red = \u00E2\u0080\u009Chot\u00E2\u0080\u009D, blue = \u00E2\u0080\u009Ccold\u00E2\u0080\u009D).Edge weights represent access affinity between a pair of fields:the thicker the edge, the more often the fields are accessed to-gether. Dashed outlines represent identify the data structure towhich the outlined fields belong. . . . . . . . . . . . . . . . . 13Figure 3.5 G1, Stage 1, transformed from G0 in Figure 3.4. This graph ismanually crafted to help explain. . . . . . . . . . . . . . . . . 14Figure 3.6 G2, Stage 2, transformed from G1 in Figure 3.5. This graph ismanually crafted to help explain. . . . . . . . . . . . . . . . . 15Figure 3.7 G3, Stage 3, transformed from G2 in Figure 3.6. This graphcan be generated by our toolchain. . . . . . . . . . . . . . . . 17Figure 3.8 Data structure definition suggestions: final output after Stage3. \u00E2\u0080\u0098Size\u00E2\u0080\u0099 is the size of the field or the class, MR is the LLCmiss ratio generated by this field, MRP is the normalized missratio percentage \u00E2\u0080\u0093 the fraction contributed by this field to theoverall miss ratio. . . . . . . . . . . . . . . . . . . . . . . . 19Figure 4.1 The pipeline of the proposed toolchain. The blue rectanglesare the components of the pipeline, the green trapezoids arethe outputs from the pointed components. . . . . . . . . . . . 21Figure 5.1 Runtime of the optimized version normalized to the original. . 27Figure 5.2 Cache misses of the optimized version normalized to the original. 28Figure 5.3 Relevant RocksDB code before and after optimizations. . . . 28Figure 5.4 Simulated cache misses measured on rearranged traces, nor-malized to those measured on the original trace. . . . . . . . . 29Figure 5.5 Cache line utilization for the original trace and the trace re-arranged accoring to optimization suggestions. . . . . . . . . 30Figure 5.6 Access pattern of canneal before optimization . . . . . . . . . 31Figure 5.7 Access pattern of canneal after optimizations. . . . . . . . . . 31xAcknowledgmentsI would like to express my deepest appreciation to my supervisor Prof. AlexandraFedorova, whose guidance made this research possible. I own particular thanks tomy graduated lab mate Dr. Svetozar Miucin, whose work provides the foundationof my research. I would also like to thank all of the contributors and co-authorsfor all their hard work. I offer my enduring gratitude to the faculty, staff and myfellow students at UBC, who have inspired me to continue my work in this field.Special thanks go to my family for believing in me throughout this journey andmy friends who were always there for me.Finally, I would like to thank Google for providing me with internship oppor-tunity that inspired and help me in my research direction.xiChapter 1IntroductionHardware caches rely on two kinds of intuition about locality: temporal locality(i.e., that recently accessed data is likely to be accessed in the near future), andspatial locality (i.e., that data placed together in the address space are also likely tobe accessed together in time).To take advantage of spatial locality, they fetch data from memory in batches(\u00E2\u0080\u009Ccache lines\u00E2\u0080\u009D of usually 64\u00E2\u0080\u0093128 bytes). If the program has a high degree of spatiallocality, hardware caches are operating efficiently: once they fetch (or pre-fetch)a cache line, all or most of the line will be used by subsequent accesses. Effi-cient caching lowers the memory access latency and reduces memory bandwidthrequirements, and results in better performance.Unfortunately, spatial locality in modern applications is very low [15]. Fig-ure 1.1 shows that the cache line utilization of different benchmarks \u00E2\u0080\u0094 i.e., thenumber of bytes accessed out of the 64 bytes brought into the cache \u00E2\u0080\u0094 is onlyabout 40% of the cache line on average. The appetites of modern datacentre work-loads already exceed the capacities of feasible caches [12], and low utilization onlyexacerbates this problem.Spatial locality can be improved either by reorganizing the cache structure (inhardware) or by reorganizing the data layout (in software). While hardware solu-tions have been proposed (e.g., adaptive cache line granularity [e.g., 15] or sub-cache-line filtering [e.g., 22]), they have so far not found implementation in pro-cessors from the major commercial vendors. In this work, therefore, we assume1Figure 1.1: Bytes used in a cache line before eviction (64-byte cache lines,8MB LLC, 16-way set associative), measured in an enhanced versionof the DineroIV [10] cache simulator. In most applications, very littleof a cache line is used before the line is evicted.fixed hardware and focus on restructuring the application\u00E2\u0080\u0099s data layout.Writing programs with good spatial locality is hard. Data structures are of-ten naturally organized along primarily semantic lines, and developing intuitionabout co-temporal accesses to single fields in a large structure is rarely straight-forward. To help programmers with this challenge, prior work proposed tools thatrecommend (or automatically make) changes in data structure or class definitions.Examples include class splitting \u00E2\u0080\u0094 when \u00E2\u0080\u009Chot\u00E2\u0080\u009D or frequently access fields are seg-regated from \u00E2\u0080\u009Ccold\u00E2\u0080\u009D or infrequently accessed fields [7, 14, 17]; field reordering \u00E2\u0080\u0094when fields that are accessed together are co-located in the data structure [7, 11];and pointer inlining \u00E2\u0080\u0094 when a pointer in a structure is replaced with the data itrefers to avoid locality-breaking indirection [8].We observe that all these techniques, which we call data structure splicing (DSS),2are conceptually similar. They rely on reorganizing the definitions of classes ordata structures guided by observations of how individual fields are accessed. Yet,there isn\u00E2\u0080\u0099t a common, unifying abstraction that enables reasoning about these op-timizations. Lack of a common abstraction has several downsides. First, differentoptimizations were embodied in disparate tools; to apply all of them, the developermay need to use, for example, one tool for class splitting, another one for fieldmerging and yet another one for pointer inlining. Second, disparate tools may pro-duce conflicting recommendations: for example, one tool may suggest to reorderfields in a way that conflicts with a class-splitting suggestion produced by anothertool. Third, other conceptually similar optimizations, such as class merging andfield migration, as we show in this thesis, were overlooked in prior work.Our main contribution is the unifying abstraction for data structure splicing.The abstraction is the Data Structure Access Graph (D-SAG). D-SAG is a graphwhere each field is represented by a node, and fields that are accessed together intime are connected by edges. Edges have weights, with higher weights signifyingmore contemporaneous accesses of a pair of fields. Figure 1.2 shows an abridgedexample of the LRU Handle data structure from RocksDB [3] and the correspond-ing D-SAG.Our second contribution is a set of algorithms that analyze a D-SAG andrecommend changes to data structures that improve spatial locality. These al-gorithms use graph clustering and produce recommendations to the programmer.Figure 1.3 shows the output of our algorithm for the RocksDB example in Fig-ure 1.2: the recommendation is to collocate specific fields (found to be frequentlyaccessed together) in the same class. Applying these changes to RocksDB reducesthe runtime of a workload that stresses its memory table by 20%.As a proof of concept, we developed a toolchain that automatically generatesD-SAG from C and C++ code (using static and dynamic analysis) and recom-mends changes to data structures or classes. Our toolchain is based on the DI-NAMITE [20] LLVM pass, and works for C/C++ programs that can be compiledwith LLVM 3.5. The analysis is primarily useful for memory-bound programs thatuse complex classes or data structures (as opposed to arrays of primitives, whichare usually already optimized spatial locality). Based on these criteria, we wereable to analyze ten memory-bound benchmarks from SPEC CPU2017, PARSEC,3and RocksDB\u00E2\u0080\u0099s db bench, and improve performance by an average of 11% (ge-omean) and reduced cache misses by an average of 28% (geomean) for seven ofthese workloads.The rest of the thesis is organized as follows: Chapter 2 describes relatedwork. Chapter 3 presents the D-SAG abstraction and the algorithms for its anal-ysis. Chapter 4 describes the implementation. Chapter 5 presents the evaluation.Chapter 6 discusses limitations and future work. Chapter 7 concludes.4Figure 1.2: A snippet of a data structure in RocksDB and the correspondingD-SAG. Thicker edges represent more frequent co-temporal accesses.Figure 1.3: Recommended data structure definitions produced by ourtoolchain for the code in Figure 1.2. MR is the LLC miss ratio, whileMRP is the percentage contributed by this field to the overall miss ratio.5Chapter 2Related WorkPrior studies that automatically identified DSS optimizations are summarized inTable 2.1. The focus of these studies was to identify one or two specific optimiza-tions, while our goal is to design a common abstraction for many DSS techniques.Table 2.1: Data Structure Splicing Optimizations ComparisonOptimization Prior works Our workClass splitting Yes [7] [14] [17] YesPointer-field inlining Yes [8] YesFields reordering Yes [7] [17] [11] YesClass merging No YesFields migrating No YesIn \u00E2\u0080\u009CCache-Conscious Structure Definition\u00E2\u0080\u009D [7], Chilimbi et al. focus on two op-timizations: class splitting and field reordering. For class splitting, they use theabstraction of hot/cold fields: frequently accessed \u00E2\u0080\u0098hot\u00E2\u0080\u0099 fields are isolated in a sep-arate class from infrequently accessed \u00E2\u0080\u0098cold\u00E2\u0080\u0099 fields. This abstraction does not en-able DSS optimizations other than class splitting and may not improve cache lineutilization if the co-located hot fields are not accessed contemporaneously.Chilimbi\u00E2\u0080\u0099s field-reordering optimization does rely on computing affinity be-tween fields, which is very similar to our definition of affinity (see Section 3.2).However, their analysis is done on fields within a single class, so cross-class opti-mizations are not possible. In contrast, our D-SAG abstraction detects field affini-6ties within a class as well as across classes. As a result, D-SAG is poised to detectclass merging and field migration opportunities, whereas previously used abstrac-tions were not (see Table 2.1).Hundt et al. [14] modify a compiler to perform structure splitting and peelingby finding dead fields. Lack of runtime information may prevent idenitfying fieldsthat are accessed together and those that generate many cache misses. That aside,Hundt\u00E2\u0080\u0099s optimizations were performed on individual data structures only.Lin et al. proposed a compiler framework that integrates data structure splittingand field reordering with array flattening [17]. For class splitting, they use a similarabstraction as Chilimbi\u00E2\u0080\u0099s cache-conscious data structures: they split the class withfields whose access count is higher than the average. For field reordering, theyproposed a simple heuristic to sort the fields according to their access count so thathot fields are grouped. (A similar strategy was used by Eimouri et al. [11].) Thelimitations of their class splitting approach is similar to that of Chilimbi\u00E2\u0080\u0099s work.The limitations of their field reordering strategy are that cross-class optimizationsare not possible, and grouping fields without considering their affinity may yieldpoor performance.Dolby et al. use pointer field inlining to solve performance issues caused bysoftware module isolation [8]. They analyze memory traces to inline pointer fieldsof objects that have \u00E2\u0080\u009Cparent-child\u00E2\u0080\u009D or \u00E2\u0080\u009Cone-to-one pointer field\u00E2\u0080\u009D relationship. Thisapproach is effective, but does not produce optimizations besides pointer field in-lining.Our D-SAG abstraction is inspired by Miucin\u00E2\u0080\u0099s access graphs [19]. In Miucin\u00E2\u0080\u0099saccess graphs, nodes represented individual memory addresses, and the graphsthemselves were used to guide dynamic allocation of data structure instances. InD-SAG, however, nodes represent data structure fields, and the graphs are used toreorganize data structures in the application\u00E2\u0080\u0099s source code.7Chapter 3Abstraction and Algorithms3.1 Requirements for a Common AbstractionTo understand the requirements for an effective common abstraction for DSS op-timizations, let us consider two examples: the RocksDB code fragment from Fig-ure 1.2 and the canneal code fragment in Figure 3.1.The RocksDB code fragment (Figure 1.2) shows a data structure that accountsfor most of the cache misses (\u00E2\u0088\u00BC85%) in the RocksDB workload that stresses itsmemory table (see Section 5). The graph in the figure shows that the fields next hashand hash are accessed both frequently (= both are included in the graph) and to-gether in time (= they are connected by a thick edge). However, in the memorylayout that corresponds to this definition, they are separated by six other fields,which are not accessed contemporaneously. Worse yet, they are likely to spanacross more than one cache line: for example, even if LRUHandle were allocatedon a 64-byte cache line boundary, next hash would be in the first line and hashwould span the first and second lines. Similarly, fields next, key length, anddeleter are accessed together but separated by other fields; while they span only48 bytes, LRUHandle allocation in the RocksDB code does not respect cache lineboundaries, so in reality the three fields are likely to be allocated across two cachelines. To improve cache line usage, therefore, we would want to place next hashand hash, as well as, separately, next, key length, and deleter, close inmemory. Observe that reasoning about this example requires knowing (a) which of8a structure\u00E2\u0080\u0099s fields are accessed often, and (b) which fields are accessed together.The data structures in Figure 3.1 (from canneal in PARSEC [5], where theyaccount for \u00E2\u0088\u00BC65% of all cache misses) demonstrate that this is not quite enough.The original code frequently accesses fields fanin, fanout, and present locfrom class netlist elem together in time, while the field item name is ac-cessed infrequently. Moreover, accesses to present loc dereference it and ac-cess its elements x and y. To improve cache line usage, then, we would wantto separate item name from fanin, fanout, inline the x and y subfields ofpresent loc, and place all four close in the address space. To reach this conclu-sion, we needed not only the intra-structure access frequencies as in the RocksDBexample above, but also the ability to track these across different related structures.In other words, a common abstraction must (a) identify frequently and infre-quently accessed fields, and (b) capture contemporaneous accesses, both withinand across data structure boundaries. As the basis for this abstraction, we first de-fine the term access affinity, to which we have been loosely referring as accessesoccurring \u00E2\u0080\u009Ctogether in time\u00E2\u0080\u009D.Figure 3.1: A snippet of a data structure in PARSEC canneal and the corre-sponding D-SAG.9Figure 3.2: Recommended data structure definitions produced by our tool forthe code in Figure 3.1.3.2 Access AffinityTo capture which fields are accessed together \u00E2\u0080\u0094 and how closely together in time\u00E2\u0080\u0094 we leverage Mattson\u00E2\u0080\u0099s notion of stack distance [18]. Given accesses to twofields u and v in a memory access trace, stack distance is the number of accessesto unique data elements between the accesses to u and v; for example, in a trace\u00E2\u0080\u0098uababv\u00E2\u0080\u0099, the stack distance between u and v is two. Intuitively, the lower the stackdistance, the more closely in time two fields are accessed.Next, we observe that, for the purposes of optimizing spatial locality, only shortstack distances matter. This is because two fields with a long stack distance maynot result in an improved cache hit rate even if they are placed next to each other,as the cache line may well have been evicted between the two accesses.We therefore define an affinity event as an occurrence, in a memory accesstrace, of a stack distance below a threshold t. In other words, if a pair of fields uand v were accessed with fewer than t other unique elements accessed in between,we detect an affinity event. We then define access affinity between a pair of fieldsas the number of affinity events between the two fields in the memory trace. (Wediscuss threshold selection in Section 4.3.)The concept of access affinity allows us to reason about a pair of fields; inthe next section, we combine access affinity information for all data structures in10the program in one abstraction that allows us to co-locate data that are frequentlyaccessed together.3.3 D-SAGTo reason about the relationships among different fields from different data struc-tures, we construct the Data Structure Access Graph (D-SAG).A D-SAG is as an undirected graph where each node represents a field in a datastructure or a class. Each node includes a counter that indicates how many timesthe corresponding field was accessed. Edges carry weights, equal to the accessaffinities between the relevant pairs of fields. For example, an edge u\u00E2\u0088\u0092 v with aweight of 20 indicates that fields u and v were accessed within a threshold stackdistance on twenty separate occasions. Because access counts are in general input-dependent, a D-SAG is specific to both the application (data structure definitions)and the memory access trace (access affinities).1Figure 3.4 shows the D-SAG constructed from the memory access trace of theexample code in Figure 3.3. (Such graphs are produced automatically by our tools.)Thicker edges represent stronger affinity.3.4 D-SAG AnalysisTo demonstrate the usefulness of the D-SAG abstraction for optimizing spatial lo-cality, we demonstrate how it can be used to express three different optimizationsthat restructure class fields to optimize for spatial locality: class splitting and merg-ing, field inlining, and field reordering. We organize the optimizations as a three-stage pipeline to show how they can be applied together using the common D-SAGabstraction.3.4.1 Stage 1: Class Splitting and MergingSplitting and merging transforms a set of classes into new classes where (a) fieldsfrequently accessed together are merged in the same class, and (b) fields that arerarely accessed together are split into different classes. The first aspect means that1We explain how we obtain memory access traces in Section 4.11Figure 3.3: The data structure and the code accessing it for the D-SAG inFigure 3.4.accessing one field will also bring into the cache fields that are likely to be accessedsoon (i.e., spatial locality is improved); the second aspect means that fields that arenot likely to be accessed soon are not brought into the cache (i.e., cache wastage isreduced).12Figure 3.4: The D-SAG (G0) for the code in Fig 3.3 under an example work-load. Nodes represent data structure fields, with the colour representingaccess frequency (red = \u00E2\u0080\u009Chot\u00E2\u0080\u009D, blue = \u00E2\u0080\u009Ccold\u00E2\u0080\u009D). Edge weights representaccess affinity between a pair of fields: the thicker the edge, the moreoften the fields are accessed together. Dashed outlines represent identifythe data structure to which the outlined fields belong.To effect the splitting/merging optimization, we want to transform the D-SAGinto several clusters, where each cluster consists of nodes that are connected byheavy edges. This is similar to the community detection problem, where the goalis to cluster the graph in such a way that the whole graph would have the best mod-ularity score [21]. A high modularity score implies dense connections between thenodes within clusters but sparse connections between nodes in different clusters,which corresponds to our goal of optimizing spatial locality by placing frequentlyco-accessed data close by but rarely co-accessed data far away.13Figure 3.5: G1, Stage 1, transformed from G0 in Figure 3.4. This graph ismanually crafted to help explain.In our implementation, we use the multi-level community detection graph-clustering algorithm [6] to perform clustering: we empirically observed that thisalgorithm produces a high-quality clustering with little tuning (e.g., we do notneed to specify the number of clusters) and good performance (faster than othercandidates, e.g., k-means, spectral clustering, etc.).Formally, this stage takes an input D-SAG G0(V, E, C0), where\u00E2\u0080\u00A2 V is the set of the nodes in the D-SAG,\u00E2\u0080\u00A2 E is the set of weighted edges in the D-SAG, and\u00E2\u0080\u00A2 C0 is the set of clusters where each cluster ci represents a data structure inthe original code,and produces an output graph G1(V, E, C0, C1), where\u00E2\u0080\u00A2 C1 is a set of clusters produced by the graph-clustering algorithm, with eachcluster cj includes a set of fields that are frequently accessed together.14For example, this optimization transforms G0 in Figure 3.4 is transformed into G1in Figure 3.5. The class Large is split into class Large.1 and class Large.2because the fields large b and large d are cold and do not have affinity toother fields of the same class (large a, large d, and large e). Class Fooand class Bar are merged because of strong affinity, as indicated by the heavyedges.3.4.2 Stage 2: Field InliningFigure 3.6: G2, Stage 2, transformed from G1 in Figure 3.5. This graph ismanually crafted to help explain.The field inlining optimization merges substructure fields referenced via a pointerinto the enclosing structure if they are frequently accessed together with other en-closing structure fields. (We saw an example of this in the canneal code snippet inFigure 3.1, where the netlist elem class contained a pointer to a location tsubclass with fields x and y.) Field inlining improves cache hit rates because theinlined subfields are more likely to be in the same cache line than if they were in aseparate structure (spatial locality). This optimization also has a secondary effectof improving instruction-level parallelism, as the subfield accesses are no longer15dependent on the data read from the enclosing structure and can be issued to thecache/MSHRs even if the enclosing structure access is a cache miss.To effect field inlining, we start with the output of Stage 1, that is the graphG1(V, E, C0, C1) where C0 clusters fields according to the original class hierarchyand C1 clusters fields according to access affinity. The C1 clustering already bringstogether fields from the enclosing structure and the relevant fields from the sub-structure; all that remains, then, is to remove the substructure pointer if all fieldswere inlined.\u00E2\u0088\u0080 vj \u00E2\u0088\u0088 V and \u00E2\u0088\u0080 ci \u00E2\u0088\u0088 C1, vj is considered for removal if\u00E2\u0080\u00A2 vj \u00E2\u0088\u0088 ci,\u00E2\u0080\u00A2 vj is a pointer type, and\u00E2\u0080\u00A2 all fields of the original class pointed by vj were merged into cluster ci atStage 1.The inlining stage produces a graph G2(V, E, C0, C2) where\u00E2\u0080\u00A2 C2 is the clustering C1 minus the pointer fields to classes with all fields in-lined.For example, the algorithm at this stage transforms G1 in Figure 3.5 into G2 inFigure 3.6: foo bar p is removed because it is of pointer type Bar* and theclass Bar was fully merged into class Foo in Stage 1.3.4.3 Stage 3: Field ReorderingAlthough Stage 1 ensures that a group of fields with high affinity is grouped to-gether in the same data structure, it does not necessarily result in good spatiallocality, as the fields with the highest affinity may not be placed next to each other.This matters especially if allocation is done without regard to cache line bound-aries (as is the case with the RocksDB data structure in Figure 1.2), as the structuremight begin anywhere in the cache line and two fields may end up in different lineseven if they are within, say, 64 bytes of each other.To decrease the impact of such splits on the cache miss rate, we organize fieldswithin each data structure to ensure that fields with the highest access affinities16are immediately next to each other in the memory layout. This ensures that evenif structures are split across cache lines at random points, fields that are accessedtogether most frequently are likely to end up in the same cache line.Figure 3.7: G3, Stage 3, transformed from G2 in Figure 3.6. This graph canbe generated by our toolchain.Stage 3 begins with the output from Stage 2: a graph G2(V, E, C 0, C2) whereC2 clusters fields by affinity and inlines high-affinity substructures. The objectiveis to produce a graph G3(V, E, C0, C3) where\u00E2\u0080\u00A2 \u00E2\u0088\u0080 ci \u00E2\u0088\u0088 C3, fields in ci are ordered to place fields with high affinity closetogether.This goal is similar to the problem known as the weighted minimum linear arrange-ment (MinLA), where given a weighted graph G(V, E), \u00E2\u0080\u0094V\u00E2\u0080\u0094=n, one must find aone-to-one function \u00CF\u0095: V \u00E2\u0086\u0092 {1, ..., n} that minimizes \u00E2\u0088\u0091i j\u00E2\u0088\u0088V |\u00CF\u0095(i)\u00E2\u0088\u0092\u00CF\u0095( j)| \u00E2\u0088\u0097 eij.MinLA is known to be NP-hard [13], and does not scale to the large number offields in modern applications.To achieve the goal efficiently, we developed an order-preserving variant ofhierarchical clustering. We repeatedly group pairs of nodes (fields) inside each17cluster in a descending order of edge weights (affinities) until all nodes are groupedinto one in each cluster. Every time a pair of nodes is grouped, this pair is mergedand treated as one node, with the edge weights from the merged nodes to anyoutside node combined via addition. When merging two nodes, we preserve theirrelative order in the original data structure; the resulting merged node inherits theorder of the top component node that appears earlier in the structure. If the pair ofnodes come from different data structures in the original code, then the one withmore accesses is ordered first in the merged structure, based on the intuition thatmore frequently used fields will gain more benefit from being brought in with otherdata. Finally, the merged node\u00E2\u0080\u0099s access frequency is the sum of access frequenciesof its sub nodes.Figure 3.7 shows an example of field reordering. In cluster Foo+Bar, node foo headand node foo tail are considered first because they are connected by the heavi-est edge. Nodes foo head and foo tail are merged into one node, with foo headbefore foo tail according to the original field order. Next, bar a is placedafter foo tail because foo tail\u00E2\u0080\u0093bar a) is the second heaviest edge, andthe access frequency of bar a is less than that of the previously merged nodefoo head+foo tail. The process continues for the remaining nodes, withbar b grouped after bar a, and bar c after bar b. Finally, foo mid is groupedafter bar c due to its original order being before the merged node that begins withfoo head.Figure 3.8 shows the final output generated by our toolchain (Section 4) afterrunning the three-analysis pipeline on the code from Figure 3.3.The optimizations are generated in different stages, but they will not conflictwith each other, because the later stages only refine the output of the earlier stages(e.g., stages 2 and 3 cannot re-form clusters);18Figure 3.8: Data structure definition suggestions: final output after Stage 3.\u00E2\u0080\u0098Size\u00E2\u0080\u0099 is the size of the field or the class, MR is the LLC miss ratiogenerated by this field, MRP is the normalized miss ratio percentage \u00E2\u0080\u0093the fraction contributed by this field to the overall miss ratio.19Chapter 4ImplementationIn this section we describe our toolchain that embodies the abstraction and algo-rithms we propose in this thesis. The toolchain takes an original program as inputand suggests DSS optimizations as output, just like in the examples in Figures 1.3and 3.2.The entire workflow is shown in Figure 4.1. Phase 1 involves profiling theprogram with Linux perf [2] (a) to determine whether it is memory-bound and (b) toidentify the functions responsible for most of the cache misses.Phase 2 extracts data structure definitions from the binary and instrumentingthe binary to produce the memory access trace.After running the instrumented binary and recording the memory trace, Phase3 builds the D-SAG, analyses it and produces data structure splicing suggges-tions.Phase 4 simulates the proposed suggestions by reorganizing the originalmemory trace as if the data structures were modified according to the suggestionsof Phase 3 . The re-arranged trace is fed through the Dinero cache simulator [10]to see if the proposed changes would bear fruit.Next, we describe each phase in more detail.20Figure 4.1: The pipeline of the proposed toolchain. The blue rectangles arethe components of the pipeline, the green trapezoids are the outputsfrom the pointed components.4.1 Phase 1: Identifying memory-bound programs andfunctionsWe profile the program with Linux perf [2] to determine if it is memory-bound andto identify the functions responsible for many cache misses. Knowing the culpritsallows us to instrument only those functions and not the entire program; as a result,21the memory trace is shorter and the analysis phase completes faster.We measure the L1, L2, and LLC misses and filter those programs whose L1miss ratio below 3% and LLC miss ratio is below 1%. Functions with L1 miss ratioabove 0.5% or LLC miss ratio above 0.2% are chosen for instrumentation.4.2 Phase 2: Memory trace collection and static analysisWe obtain data structure definitions contained in the DWARF [9] binary. The pro-gram needs to be compiled with \u00E2\u0080\u0098-g\u00E2\u0080\u0098 to make that information available. Datastructure definitions will be used in Phase 3 , during D-SAG construction.We compile the program with LLVM clang; during compilation, we also runthe DINAMITE [20] instrumentation pass. DINAMITE instruments memory ac-cesses in functions selected during Phase 1 and the memory allocation routines.Combined with the data structure definitions, this information will allow us to de-termine when an object is allocated, what fields it contains, and when those fieldsare accessed1.4.3 Phase 3: D-SAG and AnalysisTo construct the D-SAG, we parse the memory access trace. We set up a shadowheap to detect newly allocated objects, identify accesses to their fields and detectaffinity events (Section 3.2).For each entry in the memory access trace:\u00E2\u0080\u00A2 if it is an allocation entry, we allocate the corresponding object(s) on theshadow heap and record its type;\u00E2\u0080\u00A2 if it is a memory access to a dynamically allocated object, we find the corre-sponding object in the shadow heap and determine the field that was accessedby mapping the field\u00E2\u0080\u0099s offset to the DWARF data structure definition.If the access is not to a dynamically allocated object, we simply record it into thestash of recently accessed addresses, which we use to detect affinity events.1We infer the type of the object from the pointer casting instruction following an allocation.22If the access is to a field in a dynamically allocated object, we create a newnode in D-SAG, if this the first time we encounter an access to that field. Wethen examine the stash of recently accessed memory addresses to detect affinityevents (fields accessed within the stack distance threshold from each other \u00E2\u0080\u0093 seeSection 3.2). If an affinity event between two fields is detected, we will either createan edge with the weight of one between the corresponding nodes, or increment theweight if the edge already exists.We tried different stack distance thresholds for the applications we evaluatedand found the value of ten to work best. It is small enough to not create too manyedges and large enough to detect affinity between fields accessed contemporane-ously. Tuning the threshold to the individual program and hardware was beyondthe scope of this work.We treat primitive type allocations (e.g., int, double, etc.) as a single-fieldclass. For example, if an array of int is allocated at line 7 of a source file A, andanother array of int is allocated at line 8 of same source file A, we will create twonew classes: class-A-l7 and class-A-l8. If the members of these arrays have a highaffinity to each other, D-SAG will suggest merging the single-element \u00E2\u0080\u009Cclasses\u00E2\u0080\u009D,which is essentially merging the two arrays so that their members are interleaved.Upon processing the memory trace, we have the corresponding D-SAG, whichwe then analyze using the three-step process described in Section 3.4. The outputof this phase are text files with DSS recommendations, like the ones shown inFigures 1.3 and 3.2.4.4 Phase 4: Simulating proposed changesTo gauge the potential impact of the DSS optimizations recommended by Phase3 , we rearrange the original memory trace to reflect the new locations of fieldsafter the recommended code changes. We feed the new trace through the DineroIVcache simulator [10], which we modified to also measure cache line utilization.As we show in Section 5, the miss rates produced by the simulator track thosemeasured on real hardware when we apply the changes to the original programs.This means that we can use the output of the simulator to decide if the estimatedreduction in the miss rates and the improvement in cache line utilization justify23investing the effort into changing the actual code.24Chapter 5Evaluation5.1 BenchmarksWe evaluated our toolchain on C/C++ benchmarks from the SPEC CPU 2017 [4]and PARSEC [5] suites, as well as a modified version of the read random workloadfrom RocksDB\u00E2\u0080\u0099s db bench [3] that stresses its LRU table1.From the SPEC and PARSEC applications, we excluded: four as not compat-ible with LLVM-clang 3.5 (required by DINAMITE); three where heavy use oftemplate types limited the static analysis the current version of our toolchain couldhandle; and one application where a custom memory allocator prevented our toolsfrom tracking the allocated objects and collecting the traces. This is a limitation ofour implementation rather than of our techniques \u00E2\u0080\u0094 that is, a more sophisticatedversion of our tools could address these challenges. For the time being, however,we excluded these applications from our analysis.This left us with 21 applications listed in Table 5.1. All of them were analyzedwith our tools using the workflow shown in Figure 4.1. Eleven of these benchmarkshad no opportunities for optimizations: they either already had near-perfect cache-line utilization (Group 2), a low cache miss rate (Group 3), or used a single arrayof primitive types or simple classes instead of complex classes or data structures1We isolate the function that accesses the LRU table, which accounts for the highest fraction ofCPU utilization and LLC misses: 15% and 20%, respectively; we optimize/measure only that portionof the benchmark.25(Group 4). The remaining applications in Group 1 went through the entire analysispipeline; our tools suggested optimizations for all of them except 520.omnetpp r,525.x264 r, and 557.xz r.Table 5.1: Benchmarks CategoriesCategories BenchmarksOptimizable1. Sophisticated class rocksdb, canneal,and memory-bounded streamcluster, fluidanimate,ferret, 505.mcf r,511.povray r, 520.omnetpp r,557.xz r, 525.x264 rNo optimization opportunities2. Cacheline usage dedup, bodytrack, blacksholes,near 100% freqmine, swaptions, 508.namd r3. Not memory-bounded 544.nab r, 541.leela r,500.perlbench r4. Simple Class Definition 519.lbm r, 531.deepsjeng r5.2 MethodologyWe evaluated the benchmarks on a machine with an Intel R\u00C2\u00A9 CoreTM i5-7600Kfour-core CPU, with 32KB L1 instruction and 32KB L1 data caches per core, a256KB L2 cache per core, and a unified 6MB LLC. The size of the cache line is 64bytes.We ran our tool pipeline on the benchmarks in Group 1 and manually appliedthe data structure changes suggested by our tools. If the tool suggests a large num-ber of changes, we only pick the ones that affect hot data structures (data structuresaccount for at least 2% of LLC misses). In two cases, 505.mcf r and 511.povray r,the reorganization of data structures reverberated with the overwhelming numberof changes that had to be made across the code (over 100 code locations in eachbenchmark). To reduce the manual effort, we isolated from these benchmarks only26Figure 5.1: Runtime of the optimized version normalized to the original.the functions that dominated the runtime (<90% CPU time).5.3 PerformanceWe report the runtime and the caches miss rate (measured using perf ) before andafter optimizations. We also report the cache line utilization and the cache missesmeasured via simulation (as part of the pipeline in Figure 4.1).Figure 5.1 shows the runtime normalized to the original and Figure 5.2 showsthe cache misses. The runtime improved for seven out of ten benchmarks, by upto 30% and by 11% on average (geometric mean). The three benchmarks that didnot improve are the ones where our tools did not produce optimization suggestions.The reason is that the data structures were already organized according to accessaffinity.For those benchmarks that did improve, the runtime improvement can be ex-plained by the reduction in cache misses. For fluidanimate the number of L1 missesslightly increased, but the reduction in L2 and LLC misses provided a hefty com-pensation. Since the cost of an L1 miss (that hits in the L2 cache) is on the order of27Figure 5.2: Cache misses of the optimized version normalized to the original.Figure 5.3: Relevant RocksDB code before and after optimizations.ten CPU cycles, while the cost of an L2 miss (that hits in the LLC) is on the orderof 40 [1], this trade-off has a net positive effect on performance.The number of executed instructions (not shown) remained unchanged in allapplications except for RocksDB, where it increased by 4% when we applied theoptimizations. That is because the logic of accessing the LRU table has changed(Figure 5.3 shows the original and the modified code). This price was worth pay-ing, since the resulting improvements in spatial locality reduced the runtime by20%.28Figure 5.4: Simulated cache misses measured on rearranged traces, normal-ized to those measured on the original trace.To get a further glimpse into the root cause of the observed performance im-provements, we used an enhanced version of the DineroIV [10] cache simulator.We re-arranged the memory addresses in the trace according to the suggestionsproduced by our tools, isolating the effects of the improved memory layout fromany secondary effects that the code changes might produce. Figure 5.4 shows thesimulated miss rates obtained on the rearranged trace. The simulated miss ratesimprove very similarly to those measured on real hardware, suggesting that theeffects we observed can be explained by a better memory layout.Figure 5.5 reveals the driving force for these improvements. The new memorylayout increased the utilization of cache lines for all benchmarks that experiencedimproved performance.Data structure splicing is particularly effective for applications that allocatea large number of class objects consecutively (e.g., a large array). This is thecase for canneal, streamcluster, and 505.mcf r, where our optimizations make thememory layout more compact. To further illustrate this effect, we study how ouroptimizations improved the performance of canneal.29Figure 5.5: Cache line utilization for the original trace and the trace re-arranged accoring to optimization suggestions.5.4 Case Study - cannealFigure 5.6 shows the original data access pattern in canneal. It repeatedly andrandomly\u00E2\u0080\u00A2 picks an element from a large array of class A, and\u00E2\u0080\u00A2 reads the index and pointer fields in the class A object to access a class Bobject and two class C objects.Essentially, there are four random memory accesses in each iteration: one to aclass A object, one to a class B object, and one each to two class C objects. Thecombined footprint of all the objects of classes A, B, and C is much larger that theLLC, which means that the four random accesses result in four cache misses mostof the time. The cache line utilization of objects in class A, B, C are 56%, 12.5%and 25% respectively, which gives an average cache line utilization of 30%.Since the class B object is always accessed through the index field of a classA object, our toolchain suggests to move the fields in class B to class A, which30Figure 5.6: Access pattern of canneal before optimizationFigure 5.7: Access pattern of canneal after optimizations.is shown in Figure 3.2. Figure 5.7 shows the access pattern after applying theoptimizations. The new access pattern gives an average cache line utilization of36%.Because of the merge between classes A and B, in each iteration the numberof random memory accesses drops from four to three, making it possible in theoryto reduce the number of cache misses by 25%, and, indeed, the actual reduction inthe cache miss rate and the runtime is very close to 25%.31Chapter 6Discussion and Future Work6.1 Automating recommendationsOur tools recommend optimizations to a developer as opposed to applying them tothe source code automatically. This has both advantages and limitations. The ad-vantage is that the bar for tool adoption is low: a developer does not have to committo an experimental compiler, and the resulting code readability (and ease of debug-ging) will not suffer, because the changes are applied by the developer. The down-side is that sometimes the amount of manual changes may be overwhelming (as wasthe case for two applications in our study). On the other hand, automatic sourcecode transformations are not always easy to apply, because some optimizations(e.g. splitting inherited classes in C++) are hard to apply without human decision.Understanding how to best combine the simplicity of hand-applied changes andthe efficiency of automatic transformations is a possible avenue for future work.When doing automatic transformations, we need to consider memory safety.For instance, when migrating a field from one class to another, which essentiallyincreases the size of the class, we want to make sure that the behavior is still correctand does not raise issues like buffer overflow. Additionally, field migration orreordering may cause problems, because in languages like C/C++ an applicationmay access the field via its offset, using pointer arithmetic. Analyzing the accesspattern for safety using the memory access trace is not sufficient, because the accesspattern may change if the input changes.32Since C/C++ are not memory-safe, we do not believe that recommendationsproduced by our tools can be applied automatically in the general case. We believethat a semi-automation tool, like the one we proposed is a more practical approach.6.2 Algorithm Validity and CorrectnessIt is known that the problem of optimizing the cache hit rate (and hence the problemof data structure layout) is NP-hard [13, 16]. Our algorithm uses a greedy, heuristicapproach. While we cannot prove or claim optimality, the advantage is simplicityand speed of the algorithm (D-SAG construction and analysis took between 30minutes and an hour on our benchmarks).In the current implementation, we run two kinds of graph analysis on the sametrace at once: graph clustering and inter-cluster analysis. We would like to un-derstand whether repeating all or some these analyses several times after the datastructures were reorganized according to the earlier output of the algorithm wouldyield better results.6.3 Threats to EffectivenessWe used a pointer-type inference method to decide the type of an allocated object.Though this works for most C and C++ programs, generic pointer casting can stillinvalidate this mechanism, e.g., an allocated object that referred to by a pointerthat is later cast to another type might be miss-labeled in our approach. A moresophisticated type inference method (e.g. combining static analysis with dynamicinformation) could mitigate this issue.Merging classes with significantly different numbers of allocated objects some-times might not make sense, e.g., the D-SAG might detect high affinity between thefield of a singleton class and the field of class that is member of a large array. Wecan address this issue by adding a constraint in Stage 1 where edges between fieldsof different classes would be removed if the classes greatly differ in the number ofallocated objects.33Chapter 7ConclusionIn this thesis, we present a novel method to produce diversified types of data struc-ture splicing optimizations under one single abstraction. We introduce the D-SAG(Data Structure Access Graph), which can reveal the access pattern and accessaffinity among fields of data structures. Based on D-SAG, we present a multi-stagealgorithm that can analyze the graph and produce comprehensive types of datastructure splicing optimizations suggestions. We measure the preliminary versionof our D-SAG-based toolchain on a diverse set of workloads. On applications thatare amenable to our optimizations (i.e., do not already have full cache line utiliza-tion), our technique reduces cache misses by an average of 28% (geomean) andimproves performance by an average of 11% (geomean).34Bibliography[1] Performance analysis guide for intel R\u00C2\u00A9 coretm i7 processor and intel R\u00C2\u00A9xeontm 5500 processors, 2019. URL https://software.intel.com/sites/products/collateral/hpc/vtune/performance analysis guide.pdf. \u00E2\u0086\u0092 page 28[2] perf: Linux profiling with performance counters, 2019. URLhttps://perf.wiki.kernel.org/. \u00E2\u0086\u0092 pages 20, 21[3] Rocksdb \u00E2\u0080\u0094 a persistent key-value store, 2019. URL https://rocksdb.org/. \u00E2\u0086\u0092pages 3, 25[4] Standard performance evaluation corporation, 2019. URLhttps://www.spec.org/. \u00E2\u0086\u0092 page 25[5] C. Bienia. Benchmarking Modern Multiprocessors. PhD thesis, PrincetonUniversity, January 2011. \u00E2\u0086\u0092 pages 9, 25[6] V. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre. Fast unfoldingof communities in large networks. Journal of Statistical Mechanics Theoryand Experiment, 2008, 04 2008. doi:10.1088/1742-5468/2008/10/P10008.\u00E2\u0086\u0092 page 14[7] T. M. Chilimbi, B. Davidson, and J. R. Larus. Cache-conscious structuredefinition. In Proceedings of the ACM SIGPLAN 1999 Conference onProgramming Language Design and Implementation, PLDI \u00E2\u0080\u009999, pages13\u00E2\u0080\u009324, New York, NY, USA, 1999. ACM. ISBN 1-58113-094-5.doi:10.1145/301618.301635. URLhttp://doi.acm.org/10.1145/301618.301635. \u00E2\u0086\u0092 pages 2, 6[8] J. Dolby and A. Chien. An automatic object inlining optimization and itsevaluation. In Proceedings of the ACM SIGPLAN 2000 Conference onProgramming Language Design and Implementation, PLDI \u00E2\u0080\u009900, pages345\u00E2\u0080\u0093357, New York, NY, USA, 2000. ACM. ISBN 1-58113-199-2.35doi:10.1145/349299.349344. URLhttp://doi.acm.org/10.1145/349299.349344. \u00E2\u0086\u0092 pages 2, 6, 7[9] M. J. Eager. Introduction to the dwarf debugging format, 2012. URLhttp://www.dwarfstd.org/doc/Debugging%20using%20DWARF-2012.pdf. \u00E2\u0086\u0092page 22[10] J. Edler and M. D. Hill. Dinero iv: trace-driven uniprocessor cachesimulator. 1998. \u00E2\u0086\u0092 pages ix, 2, 20, 23, 29[11] T. Eimouri, K. B. Kent, A. Micic, and K. Taylor. Using field accessfrequency to optimize layout of objects in the jvm. In Proceedings of the31st Annual ACM Symposium on Applied Computing, SAC \u00E2\u0080\u009916, pages1815\u00E2\u0080\u00931818, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-3739-7.doi:10.1145/2851613.2851942. URLhttp://doi.acm.org/10.1145/2851613.2851942. \u00E2\u0086\u0092 pages 2, 6, 7[12] M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic,C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi. Clearing the clouds:A study of emerging scale-out workloads on modern hardware. InProceedings of the Seventeenth International Conference on ArchitecturalSupport for Programming Languages and Operating Systems, ASPLOSXVII, pages 37\u00E2\u0080\u009348, New York, NY, USA, 2012. ACM. ISBN978-1-4503-0759-8. doi:10.1145/2150976.2150982. URLhttp://doi.acm.org/10.1145/2150976.2150982. \u00E2\u0086\u0092 page 1[13] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide tothe Theory of NP-Completeness. W. H. Freeman & Co., New York, NY,USA, 1979. ISBN 0716710447. \u00E2\u0086\u0092 pages 17, 33[14] R. Hundt, S. Mannarswamy, and D. Chakrabarti. Practical structure layoutoptimization and advice. In Proceedings of the International Symposium onCode Generation and Optimization, CGO \u00E2\u0080\u009906, pages 233\u00E2\u0080\u0093244, Washington,DC, USA, 2006. IEEE Computer Society. ISBN 0-7695-2499-0.doi:10.1109/CGO.2006.29. URL http://dx.doi.org/10.1109/CGO.2006.29.\u00E2\u0086\u0092 pages 2, 6, 7[15] S. Kumar, H. Zhao, A. Shriraman, E. Matthews, S. Dwarkadas, andL. Shannon. Amoeba-cache: Adaptive blocks for eliminating waste in thememory hierarchy. In 2012 45th Annual IEEE/ACM InternationalSymposium on Microarchitecture, pages 376\u00E2\u0080\u0093388, Dec 2012.doi:10.1109/MICRO.2012.42. \u00E2\u0086\u0092 page 136[16] R. Lavaee. The hardness of data packing. In Proceedings of the 43rd AnnualACM SIGPLAN-SIGACT Symposium on Principles of ProgrammingLanguages, POPL \u00E2\u0080\u009916, pages 232\u00E2\u0080\u0093242, New York, NY, USA, 2016. ACM.ISBN 978-1-4503-3549-2. doi:10.1145/2837614.2837669. URLhttp://doi.acm.org/10.1145/2837614.2837669. \u00E2\u0086\u0092 page 33[17] J. Lin and P.-C. Yew. A compiler framework for general memory layoutoptimizations targeting structures. In Proceedings of the 2010 Workshop onInteraction Between Compilers and Computer Architecture, INTERACT-14,pages 5:1\u00E2\u0080\u00935:8, New York, NY, USA, 2010. ACM. ISBN978-1-60558-921-3. doi:10.1145/1739025.1739033. URLhttp://doi.acm.org/10.1145/1739025.1739033. \u00E2\u0086\u0092 pages 2, 6, 7[18] R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger. Evaluationtechniques for storage hierarchies. IBM Syst. J., 9(2):78\u00E2\u0080\u0093117, June 1970.ISSN 0018-8670. doi:10.1147/sj.92.0078. URLhttp://dx.doi.org/10.1147/sj.92.0078. \u00E2\u0086\u0092 page 10[19] S. Miucin and A. Fedorova. Data-driven spatial locality. Memsys 2018,New York, NY, USA, 2018. ACM. \u00E2\u0086\u0092 page 7[20] S. Miucin, C. Brady, and A. Fedorova. End-to-end memory behaviorprofiling with dinamite. In Proceedings of the 2016 24th ACM SIGSOFTInternational Symposium on Foundations of Software Engineering, FSE2016, pages 1042\u00E2\u0080\u00931046, New York, NY, USA, 2016. ACM. ISBN978-1-4503-4218-6. doi:10.1145/2950290.2983941. URLhttp://doi.acm.org/10.1145/2950290.2983941. \u00E2\u0086\u0092 pages 3, 22[21] M. E. J. Newman. Modularity and community structure in networks.Proceedings of the National Academy of Sciences, 103(23):8577\u00E2\u0080\u00938582,2006. ISSN 0027-8424. doi:10.1073/pnas.0601602103. URLhttps://www.pnas.org/content/103/23/8577. \u00E2\u0086\u0092 page 13[22] M. K. Qureshi, M. A. Suleman, and Y. N. Patt. Line distillation: Increasingcache capacity by filtering unused words in cache lines. In 2007 IEEE 13thInternational Symposium on High Performance Computer Architecture(HPCA), 2007. \u00E2\u0086\u0092 page 137Appendix AHardwareAll experiments in Chapter 5 are performed on the following machine:\u00E2\u0080\u00A2 Intel Core i5-7600K CPU @ 3.80GHz, 4 Cores, 32K * 4 L1d cache, 256K *4 L2 cache, 8192K L3 cache.38"@en . "Thesis/Dissertation"@en . "2019-09"@en . "10.14288/1.0379513"@en . "eng"@en . "Electrical and Computer Engineering"@en . "Vancouver : University of British Columbia Library"@en . "University of British Columbia"@en . "Attribution-NonCommercial-NoDerivatives 4.0 International"@* . "http://creativecommons.org/licenses/by-nc-nd/4.0/"@* . "Graduate"@en . "Data structure splicing"@en . "Text"@en . "http://hdl.handle.net/2429/70711"@en .