UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

A model for thread and memory placement on NUMA systems Funston, Justin 2018

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2018_february_funston_justin.pdf [ 1.8MB ]
JSON: 24-1.0363032.json
JSON-LD: 24-1.0363032-ld.json
RDF/XML (Pretty): 24-1.0363032-rdf.xml
RDF/JSON: 24-1.0363032-rdf.json
Turtle: 24-1.0363032-turtle.txt
N-Triples: 24-1.0363032-rdf-ntriples.txt
Original Record: 24-1.0363032-source.json
Full Text

Full Text

A Model for Thread and Memory Placement on NUMASystemsbyJustin FunstonB.Sc. Computer Science, Gonzaga University, 2010A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Electrical and Computer Engineering)The University of British Columbia(Vancouver)January 2018c Justin Funston, 2018AbstractThe problem of placement of threads, or virtual cores, on physical cores in a mul-ticore system has been studied for over a decade. Despite this effort, we still donot know how to assign virtual to physical cores on a non-uniform memory access(NUMA) system so as to meet a performance target while minimizing resourceconsumption. Prior work has made large strides in this area, but these solutionseither addressed hardware with specific properties, leaving us unable to general-ize the models to other systems, or modeled much simpler effects than the actualperformance in different placements.An interdependent problem is how to place memory on NUMA systems. Poormemory placement causes congestion on interconnect links, contention for mem-ory controllers, and ultimately long memory access times and poor performance.Commonly used operating system techniques for NUMA memory placement failto achieve optimal performance in many cases.Our contribution is a general framework for reasoning about workload place-ment and memory placement on machines with shared resources. This frameworkenables us to automatically build an accurate performance model for any machinewith a hierarchy of known shared resources. Using our methodology, data centeroperators can minimize the number of NUMA (CPU+memory) nodes allocated foran application or a service, while ensuring that it meets performance objectives.More broadly, the methodology empowers them to efficiently “pack” virtual con-tainers on the physical hardware. We also present an effective solution for placingmemory that avoids congestion on interconnects due to memory traffic and addi-tionally selects the best page size that balances translation lookaside buffer (TLB)effects against more granular memory placement. The solutions proposed can sig-iinificantly improve performance and work at the operating system level so they donot require changes to applications.iiiLay SummaryModern server-class computer hardware is becoming increasingly complex as hard-ware designers scale core counts. This hardware runs important applications likescientific simulations, databases, and machine learning, and represents a significantportion of the worlds electricity usage. Software must be carefully designed andoptimized to get the most out of such hardware, otherwise performance can sufferand energy is wasted. This work presents insights and analysis into how soft-ware interacts with modern server-class hardware, and proposes techniques andalgorithms to automatically optimize software for it. The solutions proposed cansignificantly improve performance and work at the operating system level so theydo not require changes to applications.ivPrefaceThe research chapters of this dissertation (Chapters 2–5) span multiple relatedprojects done in collaboration, all of which are previously published or currentlyunder peer-review. Chapter 2 is the culmination of the projects and represents thebulk of my personal contributions.At the time of this writing, the work in Chapter 2 is under peer-review as: Justin Funston, Maxime Lorrillere, David Vengerov, Baptiste Lepers Jean-Pierre Lozi, Vivien Quema, and Alexandra Fedorova. A Practical Model forPlacement of Workloads on Multicore NUMA Systems. Submitted to the13th European Conference on Computer Systems.I was the lead investigator responsible for research direction, experiment de-sign, data analysis, and solution design. Maxime Lorrillere conducted the exper-iments in Section 2.5 and Section A.1, implemented the fast memory migrationmechanism in Section 2.5, and wrote Section 2.5 of the manuscript. I wrote themajority of the rest of the chapter’s manuscript and conducted all other experi-ments in the chapter. Other co-authors provided technical and editorial advice.Chapter 3 is a modified version of previously published work. It is c2012IEEE, reprinted with permission from: Justin R. Funston, Kaoutar El Maghraoui, Joefon Jann, Pratap Pattnaik, andAlexandra Fedorova. 2012. An SMT-Selection Metric to Improve Multi-threaded Applications Performance. In Proceedings of the 26th InternationalParallel & Distributed Processing Symposium (IPDPS 12). IEEE, Washing-ton, DC, USA, 1388–1399. https://doi.org/10.1109/IPDPS.2012.125vI was the lead investigator responsible for research direction, experiment de-sign, data analysis, and solution design, and I wrote the majority of the manuscript.Co-authors provided technical and editorial advice.Chapter 4 is a modified version of previously published work. It is c2013ACM, reprinted with permission from: Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Re-naud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traf-fic Management: A Holistic Approach to Memory Placement on NUMASystems. In Proceedings of the 18th International Conference on Architec-tural Support for Programming Languages and Operating Systems (ASPLOS13). ACM, New York, NY, USA, 381–394.https://doi.org/10.1145/2451116.2451157Mohammad Dashti, Fabien Gaud, and I jointly conducted the initial investiga-tion into NUMA effects, including the discovery of the importance of congestionover locality (reported in Section 4.1 and Section 4.2). I designed and implementedthe page-level replication mechanism described in Section 4.3.3 with debugginghelp from Fabien Gaud. I wrote the sections of the manuscript relevant to my con-tributions. Fabien Gaud designed and implemented the Carrefour algorithm. Allco-authors including myself provided technical and editorial advice.Chapter 5 is a modified version of previously published work: Fabien Gaud, Baptiste Lepers, Jeremie Decouchant, Justin Funston, Alexan-dra Fedorova, and Vivien Quema. 2014. Large Pages May Be Harmful onNUMA Systems. In Proceedings of the 2014 USENIX Annual TechnicalConference (ATC 14). USENIX Association, Berkeley, CA, USA, 231–242.I conducted the initial investigation into NUMA and large pages, including thediscovery and analysis of the hot page and page-level false sharing problems (re-ported in Section 5.2 and Section 5.3.1), and wrote the sections of the manuscriptrelevant to my contributions. Fabien Gaud and Baptiste Lepers designed and im-plemented the Carrefour-LP algorithm. All co-authors including myself providedtechnical and editorial advice.viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 A Model for Placement of Workloads on Multicore NUMA Systems 62.1 Introduction & Motivation . . . . . . . . . . . . . . . . . . . . . 72.2 Background & Related Work . . . . . . . . . . . . . . . . . . . . 92.2.1 State of the Art . . . . . . . . . . . . . . . . . . . . . . . 92.2.2 Assumptions and Limitations . . . . . . . . . . . . . . . 122.3 Abstract Machine Model . . . . . . . . . . . . . . . . . . . . . . 152.4 Performance Predictions . . . . . . . . . . . . . . . . . . . . . . 20vii2.4.1 Predicting Performance Categories . . . . . . . . . . . . . 222.4.2 Predicting Performance with Machine Learning . . . . . . 252.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.5 Using the Model in Practice . . . . . . . . . . . . . . . . . . . . 322.5.1 A Potential Use Case . . . . . . . . . . . . . . . . . . . . 322.5.2 Memory Migration Overhead . . . . . . . . . . . . . . . 342.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 An SMT-Selection Metric . . . . . . . . . . . . . . . . . . . . . . . . 393.1 Background & Motivation . . . . . . . . . . . . . . . . . . . . . 403.2 The SMT-Selection Metric . . . . . . . . . . . . . . . . . . . . . 443.2.1 SMTsm on IBM’s POWER7 Processor . . . . . . . . . . 473.2.2 SMTsm on Intel’s Nehalem Processor . . . . . . . . . . . 483.3 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . 503.3.1 System Configuration . . . . . . . . . . . . . . . . . . . 503.3.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 503.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.4.1 SMT-Selection Metric (SMTsm) Evaluation . . . . . . . . 523.4.2 SMTsm Evaluation at a Lower-SMT Level . . . . . . . . 563.4.3 Metric Evaluation Across Chips . . . . . . . . . . . . . . 583.5 Applying the SMT-Selection Metric . . . . . . . . . . . . . . . . 603.5.1 Using Gini Impurity to Decide on a Good SMTsm Threshold 613.5.2 Using the Average PPI (Percentage Performance Improve-ment) Method to Decide on a Good SMTsm Threshold . . 623.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674 NUMA Traffic Management through Memory Placement . . . . . . 684.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694.2 Traffic Congestion on Modern NUMA Systems . . . . . . . . . . 704.3 Design and Implementation . . . . . . . . . . . . . . . . . . . . . 744.3.1 The Mechanisms . . . . . . . . . . . . . . . . . . . . . . 744.3.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . 76viii4.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . 794.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.4.1 Testbed . . . . . . . . . . . . . . . . . . . . . . . . . . . 844.4.2 Single-Application Workloads . . . . . . . . . . . . . . . 854.4.3 Multi-Application Workloads . . . . . . . . . . . . . . . 904.4.4 Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . 944.4.5 Impact on Energy Consumption . . . . . . . . . . . . . . 954.4.6 Discussion: Hardware Support . . . . . . . . . . . . . . . 954.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 974.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1015 Large Pages on NUMA Systems . . . . . . . . . . . . . . . . . . . . 1025.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025.2 Large Pages and Adverse NUMA Effects . . . . . . . . . . . . . 1045.2.1 Experimental Platform . . . . . . . . . . . . . . . . . . . 1045.2.2 Large Pages on Linux . . . . . . . . . . . . . . . . . . . 1045.3 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1075.3.1 Page Balancing is Not Enough . . . . . . . . . . . . . . . 1095.3.2 Carrefour-LP . . . . . . . . . . . . . . . . . . . . . . . . 1105.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1155.4.1 Performance Evaluation . . . . . . . . . . . . . . . . . . 1155.4.2 Overhead Assessment . . . . . . . . . . . . . . . . . . . 1195.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 1205.4.4 Very Large Pages . . . . . . . . . . . . . . . . . . . . . . 1215.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1215.5.1 Large Pages and TLB Performance . . . . . . . . . . . . 1215.5.2 Large Page Support and Optimization . . . . . . . . . . . 1225.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1236 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128ixA Supporting Materials . . . . . . . . . . . . . . . . . . . . . . . . . . 142A.1 Non-Interference of Workloads on Separate NUMA Nodes . . . . 142A.2 Performance Prediction Results for the ML Model . . . . . . . . . 143xList of TablesTable 1.1 Summary of related work and their capabilities . . . . . . . . . 3Table 2.1 Scheduling concerns for the AMD test system . . . . . . . . . 14Table 2.2 Fast memory migration evaluation . . . . . . . . . . . . . . . 34Table 4.1 NUMA traffic congestion effects . . . . . . . . . . . . . . . . 73Table 4.2 Statistics used by the Carrefour algorithm . . . . . . . . . . . . 76Table 4.3 Number of pages replicated, interleaved, and co-located . . . . 89Table 4.4 Carrefour performance improvement over Linux . . . . . . . . 90Table 4.5 Carrefour performance improvement for multiple applications . 91Table 5.1 NUMA and TLB metrics for selected benchmarks . . . . . . . 106Table 5.2 Detailed metrics for workloads using Carrefour-2M . . . . . . 111Table 5.3 CG.D, UA.B, and UA.C NUMA metrics . . . . . . . . . . . . 117Table 6.1 Scheduling concerns for an AMD Zen system . . . . . . . . . 125xiList of FiguresFigure 1.1 An example NUMA system . . . . . . . . . . . . . . . . . . 4Figure 2.1 WiredTiger performance for various workload placements . . 8Figure 2.2 Test systems used . . . . . . . . . . . . . . . . . . . . . . . . 15Figure 2.3 Three example performance clusters . . . . . . . . . . . . . . 23Figure 2.4 Prediction results for postgres-tpcc on the AMD test system. . 28Figure 2.5 Prediction results for spark-pr-lj on the Intel test system. . . . 28Figure 2.6 Prediction results for kmeans on the AMD test system. . . . . 28Figure 2.7 Prediction results for canneal on the Intel test system. . . . . . 29Figure 2.8 Prediction results for dc.B on the AMD test system. . . . . . . 29Figure 2.9 Prediction results for ft.C on the Intel test system. . . . . . . . 29Figure 2.10 Prediction results for freqmine on the Intel test system. . . . . 30Figure 2.11 Prediction results for kmeans on the Intel test system. . . . . . 30Figure 2.12 Prediction results for WTbtree on the Intel test system. . . . . 30Figure 2.13 Use-case evaluation, AMD system . . . . . . . . . . . . . . . 35Figure 2.14 Use-case evaluation, Intel system . . . . . . . . . . . . . . . 36Figure 3.1 Comparison of performance with SMT1 vs. SMT4 . . . . . . 41Figure 3.2 SMT speedup versus simple metrics . . . . . . . . . . . . . . 42Figure 3.3 A generic processor execution engine. . . . . . . . . . . . . . 45Figure 3.4 IBM POWER7 out-of-order execution engine. . . . . . . . . . 47Figure 3.5 Intel Nehalem out-of-order execution engine. . . . . . . . . . 49Figure 3.6 SMTsm @SMT4 vs. SMT4/SMT1 on POWER7 . . . . . . . 53Figure 3.7 Instruction mix of five benchmarks on POWER7 . . . . . . . 53xiiFigure 3.8 SMTsm @SMT4 vs. SMT4/SMT2 on POWER7 . . . . . . . 54Figure 3.9 SMTsm @SMT2 vs. SMT2/SMT1 on POWER7 . . . . . . . 54Figure 3.10 SMTsm @SMT2 vs. SMT2/SMT1 on Intel Nehalem . . . . . 56Figure 3.11 SMTsm @SMT1 vs. SMT4/SMT1 on POWER7 . . . . . . . 57Figure 3.12 SMTsm @SMT1 vs. SMT2/SMT1 on Intel Nehalem . . . . . 57Figure 3.13 SMTsm @SMT4 vs. SMT4/SMT1 on 2xPOWER7 . . . . . . 58Figure 3.14 SMTsm @SMT4 vs. SMT4/SMT2 on 2xPOWER7 . . . . . . 59Figure 3.15 SMTsm @SMT1 vs. SMT2/SMT1 on 2xPOWER7 . . . . . . 59Figure 3.16 Gini impurity for potential thresholds for the SMTsm . . . . . 62Figure 3.17 Average performance improvement vs. SMTsm . . . . . . . . 64Figure 4.1 Performance by thread and memory configuration . . . . . . . 71Figure 4.2 Streamcluster IC imbalance by memory configuration . . . . . 75Figure 4.3 Global decisions in Carrefour. . . . . . . . . . . . . . . . . . 77Figure 4.4 Carrefour evaluation, PARSEC/Metis . . . . . . . . . . . . . 86Figure 4.5 Carrefour evaluation, NAS . . . . . . . . . . . . . . . . . . . 87Figure 4.6 Load imbalance for selected benchmarks . . . . . . . . . . . 88Figure 4.7 Latency and locality for selected benchmarks . . . . . . . . . 89Figure 4.8 Carrefour evaluation, multi-application workloads . . . . . . 92Figure 4.9 Multi-application memory controller imbalance . . . . . . . . 93Figure 4.10 Multi-application interconnect imbalance . . . . . . . . . . . 93Figure 4.11 Multi-application memory latency . . . . . . . . . . . . . . . 93Figure 4.12 Multi-application local memory access ratio . . . . . . . . . . 94Figure 4.13 Effect of Carrefour on energy consumption . . . . . . . . . . 96Figure 5.1 THP performance improvement over Linux . . . . . . . . . . 107Figure 5.2 Carrefour-2M evaluation . . . . . . . . . . . . . . . . . . . . 108Figure 5.3 Carrefour-LP evaluation on selected applications . . . . . . . 116Figure 5.4 Evaluation of Carrefour-LP’s components . . . . . . . . . . . 117Figure 5.5 Carrefour-LP evaluation on remaining applications . . . . . . 119Figure A.1 Slowdown with cg.C as the interfering workload . . . . . . . 143Figure A.2 Slowdown with mg.C as the interfering workload . . . . . . . 143Figure A.3 Slowdown with streamcluster as the interfering workload . . . 144xiiiFigure A.4 Accuracy of predictions on the AMD system. . . . . . . . . . 147Figure A.5 Accuracy of predictions on the Intel system. . . . . . . . . . . 150xivGlossaryCMP Chip multiprocessingHPE Hardware performance event, a.k.a. hardware performance counterIBS Instruction-based samplingIC InterconnectIPC Instructions per cycleML Machine learningNUMA Non-uniform memory accessSMP Symmetric multiprocessingSMT Simultaneous multithreadingTHP Transparent huge pagesTLB Translation lookaside bufferxvAcknowledgmentsI give my warmest thanks to my advisor Dr. Fedorova for her peerless guidanceand for helping me achieve my potential as a researcher. I would also like to thankall of my collaborators and co-authors for their essential hard work and insight.I would like to thank my family for all of their love and support.Lastly I would like to thank IBM for the internship that led to the researchin Chapter 3, and Oracle Labs and the British Columbia Innovation Council forfunding the work of Chapter 4.xviChapter 1IntroductionData centers use 2% of the electricity consumed in the United States [85] and3% of the world’s [17] electricity. These data centers run important applicationssuch as scientific simulations, data analytics, web servers, and databases, and theapplications often have strict performance requirements. In order to keep powerconsumption under control hardware designers have introduced systems with con-tinually increasing core counts and power control features to reduce energy usagewhen parts of the system are idle. Using a smaller number of larger systems (i.e.systems with more cores) can be 25%–33% more power efficient than using twiceas many systems of half the size (based on our own experiments, vendor specifica-tions [3], and previous studies [13]).Increasing core counts, though, come at the cost of increased hardware com-plexity and architectural trade-offs made for scalability. Applications and systemsoftware must be designed and optimized with the hardware architecture in mind,otherwise performance can suffer. This dissertation focuses on how system soft-ware can optimize the performance of applications running on the large server-class systems typically used in data centers, and on how data center operators canmake reliable performance trade-offs for better server utilization or reduced powerconsumption. The first step in doing so is choosing the level of parallelism an ap-plication should use (in other words, how many cores it runs on). Due to scalabilitylimits or load patterns, it is not always beneficial to give all the cores of system toa single application. Additionally, in cloud environments it is common for cus-1tomers to pay for a set number of cores upfront, so the problem of determininghow many cores to use is moot in that case. A comprehensive solution on howto determine the level of parallelism for an application is beyond the scope of thisdissertation, but the primary contribution of Chapter 3 is a solution for it applicableto simultaneous multithreading (SMT) hardware.The second step after the number of cores has been determined is to choosewhich cores to use. The choice of cores to use, which we call a placement, affectswhat hardware resources are available to the application which in turn affects per-formance and power usage. Section 1.1 gives an overview of the relevant hardwareresources and how they affect applications. The key complicating factor for se-lecting a placement is that different applications have different needs for hardwareresources. The primary contribution of Chapter 2 is a practical model for work-load placement. Our model does not require modifying applications, can be easilyadapted to various architectures, and can be used to predict the performance of dif-ferent placements. Performance prediction, as opposed to simply finding the bestperforming placement, is crucial because it allows trade-offs to be made. Most ofthe extensive previous research in the area does not attempt to predict performance,only handles a single type of hardware resource (for example, only consideringbeneficial cache sharing but not taking into account SMT sharing of functionalunits or other hardware resource sharing), requires expert knowledge and manualtweaking to adapt to a new architecture (usually because one would need to writecarefully designed micro-benchmarks for new hardware resources), or is not de-ployable online because new applications would require several offline profilingruns. Table 1.1 provides an overview of significant related work and their respec-tive short-comings compared to our solution described in Chapter 2. A detaileddescription of related work is provided in Section 2.2.Predicting the performance of workload placements allows for one especiallyimportant trade-off. For many applications, using fewer hardware resources willonly have a minimal or moderate effect on performance. If data center operators orcloud server providers can predict these cases and the performance impact is withinacceptable tolerances, then they can use remaining hardware resources to pack ad-ditional workloads or applications onto the same server. This increases utilizationand efficiency and ultimately reduces energy usage. Section 2.5 evaluates specific2scenarios where our workload placement solution can increase server utilizationwhile maintaining performance goals.The final step after workload placement is memory placement. Large multicoresystems often have a non-uniform memory access (NUMA) architecture (describedin more detail in Section 1.1), which means that memory placement can also af-fect performance. Better memory placement will reduce contention for hardwareresources which in turn reduces memory access times. Chapters 4 and 5 describeour solution to the memory placement problem, which provides significant perfor-mance benefits over standard techniques.PredictsPerformanceMultipleHardwareResourcesEasilyAdaptedDeployableOnlineChapter 2’sSolution Y Y Y YPandia [49] Y Y N NSMiTe [105] Y Y N YBubble-Flux [101] Y Y N YAsymsched [59] N N Y YDINO [107] N N Y YThreadClustering [93] N N Y YTable 1.1: Summary of related work and their capabilities1.1 BackgroundModern server-class systems have a complex hierarchy of shared resources, whichis a necessity to scale them to high core counts. Figure 1.1 shows an exampleNUMA system. At the lowest level of the hierarchy is a hardware context (alsoknown as a hardware thread or, confusingly, as a core on some architectures). Thehardware context contains everything required to track the execution state of asoftware thread such as the program counter and registers. When a software threadis assigned to a hardware context it can execute on the CPU.At the next level of the hierarchy is the physical core. A physical core en-3Memory Node 1 Memory Node 2Memory Node 3 Memory Node 4L3 cacheC5L3 cacheC6C8C7L3 cacheC13L3 cacheC14C16C15Node 3Node 4C9 C10C12C11Node 1Node 2C1 C2C4C3Figure 1.1: A modern NUMA system, with four nodes and four cores pernode. c2013 ACM, reprinted with permission from [36].compasses data and instruction caches, functional units (the ALU, branch units,load/store units, etc.), and the instruction decode and dispatch units. Simultaneousmultithreading (SMT) is a feature that places multiple hardware contexts onto asingle physical core. The physical core’s resources are shared between hardwarecontexts, although on some architectures some of the resources may be duplicatedfor each context (for example, each hardware context might get its own L1 cachebut have to share the L2 cache with other contexts). In general, SMT can improvethe overall throughput of the system but each software thread has lower perfor-mance than if it is able to execute alone on the physical core because of the com-petition for core resources. If the application heavily shares data between threadsand benefits from very low latency communication between threads then SMT canactually improve individual threads’ performance (the kmeans benchmark on ourAMD test system, shown in Figure A.4, is one such instance).At the highest level of the hierarchy is the non-uniform memory access (NUMA)node. A NUMA node contains a set of cores and an associated set of locally con-nected memory. Programs can access memory on any NUMA node from any core4transparently without special consideration. The operating system decides whichNUMA node to allocate memory on. Memory accesses to the local memory havea lower latency. NUMA nodes are connected by interconnect (IC) links, so mem-ory accesses to remote memory travel over the IC links and have a higher latency.Cores on the same NUMA node typically share an L3 cache, the memory con-troller, and the IC links. So, if all the cores on a NUMA node are performingvery memory-intensive computations then the bottleneck will likely be the sharedmemory subsystem and performance will suffer.If software does not take into account all of these shared resources then con-tention for the resources results in poor performance, sometimes resulting in slow-downs of 2 or more! The following chapters analyze the performance effects indetail and propose practical and effective solutions.5Chapter 2A Model for Placement ofWorkloads on Multicore NUMASystems1Data center operators balance two objectives: providing a satisfactory experiencefor the customer and efficiently allocating the hardware that runs customer virtualinstances or operator services. We believe that hardware provisioning should notbe a guessing game. No one should have to allocate more hardware than neededjust out of the fear that providing less could violate a service-level objective.Unfortunately, mapping threads, or virtual CPUs (vCPUs), to physical cores onmulticore NUMA systems is still like playing a guessing game. Modern NUMAsystems consist of several nodes that share various resources. Cores within a nodemay share SMT pipelines and caches. Nodes themselves share an interconnect,which may be asymmetric, providing higher bandwidth for some links than forothers and for some directions of communication than for others (e.g., Figure 2.2).The challenge of placing workloads on a NUMA system has to do with com-plex interactions between the workload and the hardware. Our solution, presentedin this chapter, consists of two main contributions. First, is an abstract machinemodel (Section 2.3) that abstracts hardware resources and provides a practical way1This chapter is an expanded version of work currently under peer-review as [45]6to represent placements. Second, we build on the abstract machine model to builda performance prediction model (Section 2.4) that predicts the performance ofimportant placements.Attribution: Maxime Lorrillere conducted the experiments in Section 2.5 andSection A.1, implemented the fast memory migration mechanism in Section 2.5,and wrote Section 2.5 of the manuscript. I wrote the majority of the rest of thechapter’s manuscript, was responsible for all areas of research, and conducted allother experiments in the chapter.2.1 Introduction & MotivationConsider Figure 2.1a, which shows performance of MongoDB’s WiredTiger key-value store [4] on a NUMA Intel machine. This workload, which runs a BTreesearch using 24 threads, runs almost twice as fast when all of the threads are placedon a single node, as opposed to being spread across two or more nodes. On Fig-ure 2.1b, on the other hand, we see that the same workload configured with 16threads on an AMD NUMA system runs much faster when it has four nodes at itsdisposal, rather than two2. On the Intel system, this application prefers having allof its threads co-located on a single NUMA node and using SMT because it enjoyslower communication latencies. On the AMD machine, on the other hand, it suf-fers from contention for shared resources when all of its threads are crowded on asmall number of nodes. How can a data center operator know, save for trying allpossible placements, that to achieve the best possible performance this containershould be placed on a single node on an Intel system but on four nodes (withoutSMT) on an AMD system? This is an example of questions that we aim to answerin this work.The observations detailed in the previous paragraph are not new and were stud-ied in our community for more than a decade [21–23, 36, 41, 46, 48, 49, 56, 57,59, 69, 87, 101, 105, 107, 108]. Yet, as we elaborate in Section 2.2, we still donot have a method for deciding how a particular placement, a mapping of vCPUsto physical cores, on an arbitrary multicore system will affect performance of an2With eight cores/node we could not fit all threads on a single node without creating contentionfor CPU cycles.71 node 2 nodes 3 nodes 4 nodes05001000150020002500Operations/s (x1000)SMTno-SMT(a) WiredTiger, Intel2 nodes 4 nodes 8 nodes0200400600Operations/s (x1000)SMTno-SMT(b) WiredTiger, AMDFigure 2.1: Throughput of the WiredTiger key-value store on two NUMAsystems according to the number of nodes.unknown workload.We propose a general framework for reasoning about workload placementon machines with shared resources. Our methodology is not tied to a particularmachine, specific shared resources or certain hardware performance events. Weabstract a multicore machine as a collection of shared resources, called schedulingconcerns, where each concern produces a score indicating how many threads usethe resource in a given placement. Vectors of scores uniquely identify distinctplacements: placements that differ with respect to resource usage. Our abstractionrelies on important simplifications (see Section 2.2.2) that enable us to dramaticallyreduce the number of distinct placements from billions of all possible to dozens,and makes the problem tractable.8We develop a methodology to automatically build a model for predicting per-formance in all distinct Pareto-efficient placements, given a specification of amachine’s scheduling concerns and observations of the workload at runtime. Ourmodel predicts performance to within 5% of actual on average. With the resultingmodel users can make decisions such as “give application X as few NUMA nodesas possible while making sure that its throughput remains above Y operations persecond”. Our framework is flexible: a previously unmodeled shared resource canbe added to the model as a new scheduling concern and the model is updated auto-matically. We do not have to manually redesign the model for every new machine.Many performance models similar to ours used hardware performance events(HPE), observed on a single placement as model features [12, 46, 56, 59, 101, 107].Our results show that single-placements HPEs cannot produce a reliable perfor-mance model on complex modern systems. Our solution uses actual performanceobservations on two different placements as model input features and does not re-quire selecting hardware performance events that may correlate with performance3.Our method requires running a workload for a short period of time in twodifferent placements before it can generate performance predictions, so it can beused online. Nevertheless, migrating the workload from one placement to anothercan be costly, because we may need to migrate memory between NUMA nodes.To evaluate this effect, we analyze and improve on migration overheads in Linux.Our results should help potential users decide when our method is viable for anonline deployment.2.2 Background & Related Work2.2.1 State of the ArtWorkload placement on multicore systems has been explored for over a decade.Early studies examined contention between single-threaded applications for a spe-cific resource, such as the SMT instruction pipeline [46, 87], or shared caches andmemory controllers [43, 69, 101, 107]. Later work extended the techniques to3HPEs, such as instructions/cycle may be used for performance measurements, but any otherapplication-specific metric is acceptable.9multithreaded workloads and to additional resource combinations, such as SMTand shared caches [105], memory controllers and the shared interconnect [36, 59].While laying a crucial foundation for our work, these prior techniques did not pro-vide a general solution for reasoning about such systems.For instance, the DIO algorithm [107] showed us how to avoid interference forshared memory controllers. A simplified version of the algorithm is as follows:monitor the memory-intensiveness of running threads using HPEs for a samplingperiod, sort threads by memory-intensiveness, and then place threads on domainssharing memory resources such that the aggregate memory-intensiveness is bal-anced on domains. This algorithmic approach returns a new thread placement itconsiders to be optimal. In the absence of unaccounted for resources affecting theperformance of the thread placement, the algorithm improves performance. Notethat it does not try to predict how much performance will improve, and only han-dles one set of shared resources.The AsymSched algorithm of Lepers et al. [59] showed us how to place ap-plications on machines with asymmetric interconnects, by using knowledge of thetopology and the application’s traffic patterns to place threads on nodes with goodinterconnect connectivity.Other previous scheduling algorithms pursued a goal of placing threads thatshare data near each other (e.g. sharing a cache or on the same NUMA node). Tamet al. [93] provided a scheduling algorithm aimed at maximizing the benefit ofthread sharing. It is similar in approach to DIO. Thread sharing is measured onlineand then the algorithm enacts a new thread placement that optimizes the overallsharing on the system. As with DIO, it does not attempt to predict performance.Techniques used in prior work did not allow for automatic combination of sev-eral models. Every contention model required manual design: careful selectionof hardware performance events [36, 43, 56, 59, 107] or even manual crafting ofartificial “probe” workloads or “Rulers” [101, 105] that must be run side-by-sidewith the target workloads to determine their sensitivity to contention.Most existing scheduling algorithms have a common structure. As inputs, theytake knowledge of the hardware and some online metrics from the application (usu-ally obtained from HPEs). Then as an output they suggest a thread placement thatshould provide optimum performance with respect to the resources they individu-10ally account for.A comprehensive solution should account for all the hardware resources thatmatter to thread placement, but combining existing algorithms is difficult. Thereis no baseline for comparing the importance of one algorithm over another. If thealgorithms suggest conflicting thread placements, then it is necessary to balancethe performance effect of each algorithm against the other, but as discussed thealgorithms can at best only say whether a given placement is better or worse thananother. They cannot predict the performance effect so the scheduler has no wayto know which algorithm should be followed or if a compromise is best.One could try to develop a new algorithm that incorporates the ideas of theoriginal algorithms. It would require extensive manual testing and tuning to deter-mine the best way to balance the concerns, would likely be sensitive to differingworkloads and hardware, and would require expert knowledge that covers both ofthe original algorithms. All of these are serious downsides to the approach.The state of Linux scheduler exemplifies these difficulties. The Linux sched-uler attempts to address many thread placement concerns, but it does so in an ad-hoc way with many heuristics and hacks. In one of our previous works we dis-covered four major bugs in the Linux scheduler, which resulted in a 22% to 138xperformance impact in extreme cases [61].System identification is a well-studied area in the field of control systems[30, 31, 60]. The goal of system identification is to model a system’s outputs givensome number of inputs. For our purposes, the inputs would be the workload place-ment and the dynamic application behavior and the output we want to model is theapplication’s performance. The main difficulty is that system identification gener-ally assumes that the inputs are easily measurable and real-valued, for example, aninput voltage. This does not apply in our case of workload placement. Decidingexactly what to use as inputs to our model and how to quantify them is an importantand difficult problem on its own, and is discussed in the following sections. Forthis reason we approached the problem with the field of machine learning in mind,rather than system identification.Dwyer et al. used an automated model-building methodology, where automati-cally selected features (from all HPEs available on the machine) were fed into a va-riety of machine-learning models [41]. However, the model predicted a rather sim-11ple outcome: a performance degradation when a target workload was co-scheduledwith an interfering one, and not the performance in different placements. Consis-tent with our finding that HPEs observed in a single placement are poor modelfeatures, Dwyer’s study reported rather poor prediction accuracy in many cases.A recent system, Pandia [49], made significant advances. It accurately pre-dicts relative performance of different workload placements on multicore NUMAmachines. It can also predict relative performance of an application with differ-ent numbers of threads, but such predictions in Pandia require performance ob-servations of six runs with different thread counts, which is difficult to do onlinebecause most real applications cannot easily reconfigure their thread count on de-mand. Despite addressing many limitations of previous work and producing re-markably accurate performance predictions, fundamentally Pandia still relies onthe machine-specific modelling methodology that prevents easily transferring re-sults to other systems. Pandia’s authors capture factors that contribute to perfor-mance, such as cache contention, latency of communication, and load balancing,in a set of machine-specific equations. If the model had to be adapted to anothermachine, for example one with an asymmetric interconnect, the equations wouldhave to be manually reformulated.We believe that investing that much effort into designing new models for ev-ery new type of hardware puts an unreasonable burden on system engineers. Es-sentially, while there is significant existing theory for scheduling with respect totime-sharing [86, 100], there is not any developed theory or framework for space-sharing. Instead, we sought a future-proof methodology that uses easily availableinformation about a machine’s shared resources and automatically builds an accu-rate performance model.2.2.2 Assumptions and LimitationsTo make our methodology robust and extensible we make a few simplifying as-sumptions, which we describe here.Identically scored placements yield identical performance. As we explain inSection 2.3, a placement is identified by how many vCPUs share each hardwareresource. We refer to the degree of sharing for a resource as the score; a vector12of scores thus identifies a placement. Placements with identical score vectors aredeemed to yield identical performance for a given workload. This statement as-sumes that our machine model must be aware of all shared resources that mightyield variations in performance. This assumption is made by most solutions in thisspace, because one can only model the factors of which one is aware. A radicallydifferent approach would be a statistical technique that searches for an optimallyperforming placement by trying a sufficient number of random placements [76].Unfortunately, the best known techniques require trying thousands of placementsand assumes that performance in all placements fits a Generalized Pareto distribu-tion — an assumption that does not hold in our case.A workload is encapsulated in a virtual container. This assumption sits wellwith many data centers that use virtualization for a variety of reasons. In our case,it makes the problem easier to solve than if we viewed a workload as an amor-phous collection of threads. With a thread-centric view, the degree of concurrencycan unpredictably change, either because of application logic or because of OSscheduling decisions. When it changes, performance can be affected because ofintra-application scaling issues or because of the change in pressure on shared re-sources. Combining both effects in a single model is cumbersome: we concludedthat for robustly accurate predictions that do not require offline runs we need totrain a separate model for each feasible number of vCPUs in a container. Managedcloud environments present their offerings as a menu of virtual instances with afixed number of vCPUs per instance. For example, AWS offers a dozen instanceswith the number of different core counts limited to ten [5]. So we can feasibly traina separate model for each machine and each vCPU count. We are not interested infinding the optimal number of threads or vCPUs for the workload; for that, userscan leverage other tools [49, 90].We also assume that each vCPU of the container is performing roughly thesame type of work. This assumption is not very strict though; we have workloadsin our test set that have database maintenance threads or garbage collection threadsin addition to worker threads and these workloads still work well with our model.A NUMA node is a unit of resource allocation. We assume that vCPUs areallocated to containers in units of entire NUMA nodes. That is, given a virtual13Concern Score Resources Cost? InversePerf?L2/SMTNumber of L2caches in useL2 cache, instruc-tion fetch and de-code, and floatingpoint unitsY YL3Number of L3caches in useL3 cache, memorycontroller, and band-width to DRAMY YInterconnectAggregate band-width betweennodes in useInterconnect band-widthN NTable 2.1: Scheduling concerns used on our AMD test system (shown in Fig-ure 2.2).container the goal is to decide across how many NUMA nodes to spread the con-tainer; we do not co-locate different containers on the same node. We rely on thisobservation to make the problem tractable. Modelling contention among differentcontainers on the same NUMA node is much more difficult. On the other hand, ifmultiple containers on the same machine are mapped to different NUMA nodes,they can co-exist without interference if the nodes used for different containers donot share interconnect links, which is a configuration easy to enforce. We con-firmed that this property holds experimentally and the results are shown in A.1.We consider only balanced placements. A balanced placement is one wherethe number of vCPUs is evenly divisible by any number of shared resource unitsconsidered for placement. For instance, if we have shared L3 caches on the system,we will only consider placements where the number of vCPUs sharing each L3cache is equal. Uneven sharing can cause unpredictable performance effects onthe workload, for example by creating stragglers, so we choose to not model theseeffects. Since we are already assuming that resources will be allocated to containersin multiples of NUMA nodes the balance assumption is not very limiting.14(a) AMD Opteron 6272nodeNode 0Node 6Node 5Node 3Node 4Node 2Node 1Node 7(b) AMD interconnect(c) Intel Xeon E7-4830v3 nodeFigure 2.2: The two systems used in this chapter. The first is a quad AMDOpteron 6272. It has eight NUMA nodes (schematically shown in Fig-ure 2.2a) connected with an asymmetric interconnect (Figure 2.2b) and atotal of 64 cores. Pairs of cores share the instruction front-end, L2 cache,and floating point units. The second system is a quad Intel Xeon E7-4830 v3 with four NUMA nodes (Figure 2.2c) and 96 hardware threads(12 physical cores per node with SMT). The interconnect (not shown)is symmetric.2.3 Abstract Machine ModelA major obstacle to a solution to the virtual core placement problem is the sheernumber of possible placements. For 16 virtual cores on a 64 core system the num-ber of possible placements is the combinations of 16 objects chosen from a set of1564, which is on the order of 1014.It is essential to exploit the symmetry in the system to reduce the number ofplacements to a manageable number. By this we mean that for most types of sharedresources it does not matter which shared resources are being used but how much ofthe shared resources is available to the workload, and having a way to quantify thiswould allow us to eliminate the vast majority of placements by only consideringthose that are actually relevant with respect to performance.We tackle this with the concept of scheduling concerns. A single schedul-ing concern is responsible for a single hardware resource, or an inseparable setof hardware resources that affect the performance of thread placements. The pri-mary purpose of a scheduling concern is to provide a numerical score when givena thread placement as input. The score represents the utilization of the particularresource and it only depends on the vCPU placement, not the dynamic behaviorof a workload. A simple example is an “L2 cache” resource. The scheduling con-cern for the L2 cache measures the utilization of L2 caches on a system wherethere are multiple L2 caches shared by sets of cores, so the score would simply bethe number of L2 caches in-use by vCPUs. The score remains constant with re-spect to the symmetry of the hardware resource the concern encompasses. So, twoplacements might use completely different NUMA nodes but if they use the samenumber of L2 caches then they will both have the same L2 cache score. A vectorof numeric scores for all scheduling concerns uniquely identifies each placementthat is distinct with respect to sharing of resources.There are two additional pieces of information a scheduling concern needs inorder to identify the important placements. The first is whether the concern’s scoreis proportional to the user’s cost, which is the case for resources like NUMA nodesbecause fewer nodes (lower score) means more containers can be packed onto asystem. If a lower score for a resource only meant worse performance, we couldsimply discard placements with a lower score for that resource (all other scoresbeing equal) from our list of important placements. But since we want users tobe able to make cost-performance trade-offs, placements with lower scores butpotentially lower cost could still be relevant. The second piece of informationneeded by a scheduling concern is whether the resource encompassed by a concerncan ever have an inverse relationship with performance. For some resources, like16the L2 cache, a higher score is usually better, but for some workloads such as thoseshowing cooperative cache sharing, a smaller score (using fewer L2 caches) mayactually improve performance. For other resources, like the shared interconnectdescribed below, a lower score will never improve performance and would notresult in a lower cost for the user, so we can safely ignore placements with lowerscores when everything else is equal.In practice, a single scheduling concern may cover multiple shared resourcesbecause some resources are inseparable with respect to thread placement. Threadssharing a physical core via SMT typically share a cache, the instruction front-end,and functional units. In cases like this, a single scheduling concern is still sufficient.Our AMD test system (shown in Figure 2.2) has multiple NUMA nodes, anasymmetric interconnect, and a form of SMT. For this system we developed thescheduling concerns shown in Table 2.1. For the L2/SMT and L3 concerns, thescore for a particular placement can be calculated directly from information pro-vided by the operating system. The OS also provides information on the inter-connect topology, but it is simpler and more accurate to measure the aggregatebandwidth with a benchmark for each possible combination of nodes (we usestream [67] for our measurements). Each concern is relatively simple, easy toimplement, and can be developed independently. Since it does not require a perfor-mance expert, we envision the specification of scheduling concerns being providedas part of system BIOS.It is easy to see how scheduling concerns could be developed for other hard-ware resources. For example, a system with asymmetric CPUs could have a CPUconcern where the score is the frequency of the CPUs in use, or a concern thataccounts for some nodes being closer to I/O links, where the score is 1 if the nodesin use are near I/O links and 0 otherwise.Next, from the concerns and hardware topology we need to derive the impor-tant placements. An important placement must have three properties: (1) conformto our balanced assumption, (2) be feasible: i.e., not assign more than one vCPU toa single hardware thread, and (3) not be superseded by a strictly better placement.Given a score s and the number of vCPUs v, the balance property is encodedas v mod s = 0, and the feasibility property is encoded as v=s Capacity, wherecapacity is the number of hardware threads available in a single instance of the re-17source: e.g. there are eight hardware threads per L3 cache on our AMD test system.We also define the Count of a concern as the total number of that resource on thesystem, so our AMD test system has an L2Count of 32 for example. The first stepin generating important placements is generating the possible scores that satisfy thebalance and feasibility requirements individually. This is done for each schedulingconcern that can affect cost or have an inverse relationship with performance. Forour AMD test system this step is shown in Algorithm 1.The next step is computing possible packings of the system, i.e. combina-tions of placements that fill the entire machine. Recall that we already know howmany vCPUs each instance/container will use (based on our assumptions in Sec-tion 2.2.2). Depending on the specific placement used, a container may or maynot use all available NUMA nodes. If it does not use all available NUMA nodes,then more containers can be packed onto the system. This step computes the pack-ings based on the possible scores generating in the previous step. Specifically, weare using the score of the scheduling concern that corresponds to the highest levelof the hierarchy and our unit of resource allocation, which in our case is the L3scheduling concern. The packings are generated with a recursive method shown inAlgorithm 2.Next, as shown in Algorithm 3, packings that are duplicates and packings thatare not Pareto-efficient with respect to the interconnect score are filtered out (sincethe interconnect concern does not affect cost and cannot have an inverse relation-ship with performance). Because the L2 and L3 scores can affect cost or have aninverse relationship with performance, placements are not filtered based on them.Lastly, the placements that make up the remaining packings are “expanded” bycalculating which L2 scores are feasible for the number of nodes in use, and thenthe complete placement is added to the list of important placements. If the sys-tem in question has a deeper hierarchy, with another scheduling concern belowthe L2 concern, for example, then the L2 scores would be expanded first and thenthe scheduling concern below it would be expanded based on the expanded andfeasible L2 scores. In general, “score expansion” starts at the highest level of thehierarchy not including the scheduling concern used in generating packings, andthen goes downward to the lowest level.As an example of Pareto-efficient packing, on our AMD test system we know18Algorithm 1 Generating possible L2 and L3 scoresL3Scores = List()for i 1;L3Count doif v=i L3Capacity^ v mod i= 0 thenL3Scores.append(i)end ifend forL2Scores = List()for i 1;L2Count doif v=i L2Capacity^ v mod i= 0 thenL2Scores.append(i)end ifend forreturn L3Scores, L2Scoreswe need to keep the four-node placement that uses nodes f2;3;4;5g because itis the four-node placement with the highest interconnect score. Therefore theplacement using nodes f0;1;6;7g is also an important placement and will be keptbecause it is the placement that can be packed with the best four-node place-ment. Continuing, suppose that we consider a four-node placement that uses nodesf0;1;4;5g. If we were to use this placement at runtime, the remaining set of fournodes, potentially used for another workload, is f2;3;6;7g. Both of these place-ments have poor interconnect scores, in part because there is a two-hop distancebetween nodes f0;5g and nodes f3;6g. Instead, we can pack the machine with abetter combination of four-node placements: f0;2;4;6g and f1;3;5;7g. Using thisobservation, the vectors for placements f0;2;4;6g and f1;3;5;7g will be kept overthe worse pair of four-node placements.After this process is complete, we are left with the important placements. Forour AMD system we have 12 of them: two 8-node placements (one sharing L2caches and one not), two 2-node placements (with the best and second-best in-terconnect score), and eight 4-node placements (half sharing L2 caches, half not,and various interconnect scores relevant for packing). Our Intel test system (Fig-ure 2.2), on the other hand, only uses an L2/SMT concern and an L3 concern.With 24 virtual cores per container, it has seven important placements which areall of the placements that satisfy the balance and feasibility constraints: a one node19Algorithm 2 Generating packings of placementsPackings = List()procedure MAKEPACKINGS(L3Scores, NodesLeft, CurrentPacking)for all L3S in L3Scores doif L3S> len(NodesLeft) thencontinueend iffor all n in Combinations(NodesLeft, L3S) doRemaining = NodesLeft - nNewPacking = CurrentPacking.append(n)if len(Remaining) > 0 thenMakePackings(L3Scores, Remaining, NewPacking)elsePackings.append(NewPacking)end ifend forend forend procedurereturn Packingsplacement sharing L2 caches, two 2-node placements, two 3-node placements, andtwo 4-node placements.2.4 Performance PredictionsAutomatic model-building techniques learn how to map a set of features describingdata to a predicted outcome. The outcome we would like to model is a vector ofperformance values in all important placements, relative to a baseline placement.For example, if there are three important placements, and the performance in thesecond and third is 20% and 30% better than that in the first baseline placement,the performance vector will be: [1:0;0:8;0:7]. Our data elements are executionsof workloads in different placements, and the features are some metrics describingthe execution.One way to frame the problem is to predict a performance category. That is,assuming that our target workloads can be categorized according to their perfor-mance vectors, we can train the model to predict the category and then use the20Algorithm 3 Generating important placementsNodes = range(0, L3Count)Packings = MakePackings(L3Scores, Nodes, List())Remove duplicates from Packingsfor all (a,b) in Permutations(Packings, 2) doif L3 Scores in a 6= L3 Scores in b thencontinueend ifaIC = Sorted interconnect scores of a placementsbIC = Sorted interconnect scores of b placementsToRemove = Truefor i in range(0, len(aIC)) doif aIC[i]> bIC[i] thenToRemove = Falseend ifend forif ToRemove thenRemove a from Packingsend ifend forImportantPlacements = List()for all Placements p in Packings don L2Count=L3CountL3S = L3 Score of pfor all L2S in L2Scores doif n L3S L2S thenImportantPlacements.append(p)end ifend forend forreturn ImportantPlacementscategory’s average vector as the predicted outcome.We begin by presenting an implementation of this idea (Section 2.4.1). Our firstmethod runs a workload in two or three configurations in order to narrow down theperformance category to which it belongs. Although this is not the final method weuse, it demonstrates the fundamental benefits of using actual observations of perfor-mance as predictive features of the model. Our final method (Section 2.4.2) uses21performance observations in two configurations as input features into a machinelearning (ML) model. Section 4.4 shows that relying only on hardware perfor-mance events (HPE) that may correlate with performance is not nearly as effectiveas using the measurements of actual performance.2.4.1 Predicting Performance CategoriesAlthough there is an infinite number of possible workloads and a complex interac-tion between scheduling concerns, we observed that workloads fall into a relativelysmall number of categories where each category has very similar performance vec-tors and is representative of a specific relationship to scheduling concerns. Forexample, workloads that are not memory intensive and are not adversely affectedby sharing SMT contexts would belong to the same category (where thread place-ment does not matter). Another category would be one where using fewer NUMAnodes and fewer physical cores greatly hurts performance, and so on.To discover these categories, we use the clustering algorithm k-means, whichuses performance vectors as the elements. It partitions the elements into k clustersso as to minimize the within-cluster sum of squares (i.e., Euclidean distance) be-tween the vectors. We select the value of k that maximizes the average Silhouettecoefficient [6, 79] over all data points, which is the standard practice in the field.The set of applications we experimented with are drawn from the NAS ParallelBenchmark suite [11], Parsec suite [20], the Metis map-reduce benchmarks [65],and BLAST [7]. Also included are the Linux kernel compile gcc benchmark, twoSpark graph workloads, TPC-C [95] and TPC-H [96] on Postgres and a WiredTiger [4]BTree benchmark. Workloads were run using lxc containers and configured touse 16 vCPUs on the AMD system and 24 vCPUs on the Intel system (Fig. 2.2).Within containers, the number of application threads is set so as to achieve >70%CPU utilization on each core, typical of what is done in practice.The workloads on our Intel test system were clustered into six groups. Figure2.3 shows the performance by placement for the workloads found in three exampleclusters. Considering the similarity of performance curves in each cluster, it isclear that clustering has worked well in this case. An inspection of the remainingclusters and the clusters produced on our AMD test system showed similar results.22Figure 2.3: Performance by placement for workloads in three example clus-ters on Intel. The x-axis shows the placement ID. The y-axis showsperformance relative to the baseline placement (#2).This procedure helps to determine the quality and completeness of the trainingworkloads. If k-means cannot create good clusters, then the training set could beincomplete. In our case the quality of clusters is good overall, but some clusters,e.g., the one with kmeans and WTbtree, contains only two elements and couldbenefit from adding other workloads.It also leads to an intuitive method for making performance predictions: runthe workload in several placements by trying them during the first few seconds ofthe execution without interrupting the workload, measure the performance in each,23and based on those results use the process of elimination to determine the categoryto which it belongs. Once we know the category we can use category’s averageperformance per-placement as the predictions.The workload must first be run on the baseline placement; our performancemeasurements are all normalized to this placement so it is required but does notprovide any information about the workload. Every placement tried thereafterhelps narrow down the category to which the workload belongs (if the performanceis outside of the range of performance values seen during training for a particularcategory then we can conclude it does not belong to that category) until only theworkload’s predicted category remains. The full algorithm for determining a work-load’s category, which we call the iterative method, is shown in Algorithm 4.For an arbitrary system, the worst-case number of measurement placementsneeded to complete Algorithm 4 is min(n1;k) where n is the number of impor-tant placements and k is the number of clusters. This worst-case happens whenonly a single cluster can be eliminated per measurement placement. Real-worldsystems are likely to have a much better worst-case though, such as our test systemswhich only require three placements in the worst-case. This is because there arelikely to be some correlations between clusters and correlations between schedul-ing concerns. For example consider a system that has cluster A, cluster B, andsome number of other clusters. Cluster A prefers having more L3 caches and moreL2 caches, and because memory intensiveness affects both L2 and L3 cache prefer-ences there is no cluster that prefers more L3 caches and prefers fewer L2 caches.Cluster B prefers more L2 caches but does not care how many L3 caches it has. Asingle measurement placement showing that a workload prefers fewer L2 cacheseliminates both cluster A and cluster B. Similarly a single placement could elimi-nate both cluster A or B and clusters that prefer fewer L2 caches. In this way weare likely to avoid the theoretical worst-case on real systems.On our test systems the iterative method requires trying only two or three place-ments (including the baseline) to determine a workload’s category and producesfairly accurate predictions. For example, 29 out of the 41 workloads on our Inteltest system have predictions within 6% of the actual performance, but the otherworkloads have at least one placement that has a prediction error around 20%.The key insight we draw from the iterative method is that performance mea-24surements from multiple placements have very high predictive power due to thefact that workloads tend to belong to distinct categories with respect to their perfor-mance behavior. This motivates our approach of using performance measurementsas inputs into a ML model, which is described in the next section.Algorithm 4 Iterative method for determining performance categoryfor all Cluster c in Clusters dofor each Placement p domin = min. performance within Cluster for pmax = max. performance within Cluster for pInterval of c for placement = [min, max]end forend forExcluded = List()Run baseline placement and measure performancewhile len(excluded) < len(c)-1 doPop next placement from pRun next placement and measure performancefor each Interval for placement doif Performance is outside interval thenAppend cluster belonging to interval to excludedend ifend forif Performance is outside all intervals thenFind closest interval and exclude all other clustersend ifif All clusters excluded thenReturn no clusterend ifend whilereturn cluster not in excluded2.4.2 Predicting Performance with Machine LearningWhile the method of predicting performance categories using the process of elimi-nation is intuitive and robust, it uses a category’s average as the predicted outcome,which is a rather rough measure. We found that we can achieve higher accuracywith a more refined modeling technique. Our final approach directly predicts per-25formance vectors using a machine learning model, a multi-output Random Forestregressor. In other words, the model produces one predicted output per placement.As inputs, the model uses performance measurements observed in two differentplacements: a baseline and one additional. These placements were selected duringthe model training as those yielding the highest accuracy.The random forest model has significantly better accuracy for this problemthan other regression models like neural networks, support vector machines, andnonlinear regression. Many regression models rely on an assumption that the datafits a particular shape (for example, logarithmic or polynomial) but the data in ourcase has many instances of step-function and piece-wise behavior. A random forestrequires no assumption of the shape of the data. Neural networks do not have thisassumption either but they tend to require much more training data to be effective.An alternative to regression is a search-based machine learning approach. Inthis case, a machine learning algorithm could continually try placements until thegoal is reached. The benefit of this approach is that potentially it would not needto rely on training workloads. A major downside though is that it would requiremany more placements to be tried which requires migrating memory and wouldhave a prohibitive performance cost. For this reason we focused on the regressionapproach.Finally, to evaluate our new approach relative to the best known practices wealso train a multi-output Random Forest regressor using HPEs observed in a singlebaseline placement as input features. The best baseline placement was identifiedduring training.Modern machines have many hundreds of HPEs, some more than 1000 [103].Sampling that many online cannot be done without large sampling errors. Oneoption is to include the list of important HPEs in the specification of each schedul-ing concern. This assumes that the engineer providing the specification somehowknows which counters would work best — an assumption we found to be not inthe spirit of maximally automating the prediction process. Another approach is toobtain measurements with all possible HPEs during training and use feature se-lection methods to identify the best predictors, similarly to [41]. This approach isautomatic, but increases the training time from hours to weeks. For example on ourIntel machine with nearly 1000 performance events, the time to measure all coun-26ters for our training set while ensuring acceptable sampling error would amount to66 days.Instead we used a combination of the manual and automatic approaches. Westarted with a set of plausible features (41 HPE derived metrics on our Intel testsystem and 25 on our AMD test system) covering cache, memory, TLB, intercon-nect, and pipeline behavior, which are metrics commonly used in similar work.We then used Sequential Forward Selection [39, 54] (SFS) to pick the best ones.SFS involves iterating over all potential features, selecting the one that improvesprediction accuracy the most, and then repeating this process until prediction accu-racy no longer improves. On the AMD test system the SFS process results in fourfeatures: the L3 cache misses per cycle, the L1 cache miss rate per instruction, theTLB misses per cycle, and the cache sharing percentage (calculated from the statesof cache lines as they are evicted). On the Intel test system, three features wereselected: the percentage of stall cycles where at least one memory request waspending, the cache sharing percentage (calculated from the states of cache lines asthey are inserted), and the L2 cache misses per instruction caused by the prefetcher.We also tried adding both performance measurements and HPEs as modelinputs, but this did not improve accuracy at all on our AMD system and onlymarginally improved accuracy for two workloads on our Intel system. In the endwe had two model variants to compare: the first one used as inputs the actual per-formance measurements observed in two placements, and the second one used onlythe HPEs observed in a single placement.2.4.3 ResultsThe full per-benchmark performance prediction results for the AMD test systemare provided in the appendix in Figure A.4 and for the Intel test system in FigureA.5. The figures for the benchmarks discussed in this section are also providedhere for ease of reference.The results are per-application cross-validated. For example, when trainingthe model that will be used for predicting a Spark workload neither the data fromspark-cc (a Spark connected components algorithm run on the LiveJournal database)nor spark-pr-lj (a PageRank algorithm run on the LiveJournal database) is included27 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0  1  2  3  4  5  6  7  8  9  10  11Normalized PerformancePlacementActualPredicted: HPEsPredicted: Perf MeasurementsFigure 2.4: Prediction results for postgres-tpcc on the AMD test system. 0 0.2 0.4 0.6 0.8 1 1.2 0  1  2  3  4  5  6Normalized PerformancePlacementActualPredicted: HPEsPredicted: Perf MeasurementsFigure 2.5: Prediction results for spark-pr-lj on the Intel test system. 0 0.2 0.4 0.6 0.8 1 1.2 0  1  2  3  4  5  6  7  8  9  10  11Normalized PerformancePlacementActualPredicted: HPEsPredicted: Perf MeasurementsFigure 2.6: Prediction results for kmeans on the AMD test system.28 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0  1  2  3  4  5  6Normalized PerformancePlacementActualPredicted: HPEsPredicted: Perf MeasurementsFigure 2.7: Prediction results for canneal on the Intel test system. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0  1  2  3  4  5  6  7  8  9  10  11Normalized PerformancePlacementActualPredicted: HPEsPredicted: Perf MeasurementsFigure 2.8: Prediction results for dc.B on the AMD test system. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0  1  2  3  4  5  6Normalized PerformancePlacementActualPredicted: HPEsPredicted: Perf MeasurementsFigure 2.9: Prediction results for ft.C on the Intel test system.29 0 0.2 0.4 0.6 0.8 1 1.2 0  1  2  3  4  5  6Normalized PerformancePlacementActualPredicted: HPEsPredicted: Perf MeasurementsFigure 2.10: Prediction results for freqmine on the Intel test system. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0  1  2  3  4  5  6Normalized PerformancePlacementActualPredicted: HPEsPredicted: Perf MeasurementsFigure 2.11: Prediction results for kmeans on the Intel test system. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 0  1  2  3  4  5  6Normalized PerformancePlacementActualPredicted: HPEsPredicted: Perf MeasurementsFigure 2.12: Prediction results for WTbtree on the Intel test system.30in the training.Overall the accuracy when using only the actual performance measurementsas model features is high. The predicted performance is within 4.4% of actualon average on the AMD system, and within 6.6% on Intel. Postgres running theTPCC benchmark (postgres-tpcc) on the AMD test system (Figure 2.4) and spark-pr-lj on the Intel test system (Figure 2.5) are examples where the predictions arevery accurate.There are a few cases where the training set did not include any workloads thatbehaved similarly to the predicted benchmark, which results in poor predictions.For example kmeans on the AMD system (Figure 2.6), which was the only bench-mark in our training set that preferred SMT, or canneal on the Intel system (Figure2.7). As explained in Section 2.4, we can identify weaknesses in the training setusing the performance clusters, so it is straightforward to figure out where extratime and effort could be spent on adding new training workloads.Prediction accuracy when using only the HPEs from a single placement wasa lot less reliable. On the AMD system it produced good results overall, but theaccuracy was still noticeably worse for dc.B (Figure 2.8) compared to the modelvariant that relied only on actual performance measurement. The real shocker isthe results on Intel, where the model relying only on HPEs produced many poorpredictions. It completely missed the performance trend for ft.C (Figure 2.9) andfreqmine (Figure 2.10), produced errors of over 40% for kmeans (Figure 2.11)and WTbtree (Figure 2.12), and is noticeably worse for several other workloads.Using both the actual performance measurements and HPEs yielded small accuracyimprovements on the Intel system, but made no difference on the AMD system.An example of why HPEs observed in a single placement could have poorpredictive power, and one of the reasons why the Intel system produced worse pre-dictions, is predicting the effect of inter-thread communication latency. There is ahuge latency difference for communication between a single-node placement andplacements including more than one node. For some applications, reduced inter-thread communication latency when all threads are running on a single node has amajor performance impact, as is the case for WTbtree. Separating the sensitivity tolatency from overall memory intensiveness (which can be measured by the cachemiss rate) is difficult to do with HPEs. Similarly, it is also very difficult to deter-31mine if a workload’s working set will fit in a given number of L3 caches by onlymeasuring HPEs on a single placement. We conclude that, contrary to prior be-lief, single-placement HPE observations are not reliable features for modellingperformance on multicore NUMA systems.2.5 Using the Model in PracticeThere are many ways in which data center operators can use our model. To il-lustrate one potential use case we set up a scenario where the user would like topack as many instances of a given virtual container into a physical server whilerespecting a performance target. To assess the overheads, we measure the costsof container migration on our test systems (the overhead of prediction inference isnegligible).2.5.1 A Potential Use CaseIn our experiment we use virtual containers of three types: one running WiredTigerwith a B-tree search workload, another running Postgres with the TPC-H workload,and another running Spark with the PageRank workload on a LiveJournal database.We have many containers of each type, and our goal is to pack as many of them aswe can per physical server without violating a performance goal.The performance goal can be specified in terms of an application-level metricsuch as transactions per second or a generic metric such as instructions-per-cycle.The placement policy is agnostic to the metric used and only requires that theapplication make this metric available at runtime. For clarity of presentation, wesimply set the performance goals to correspond to 90%, 100% and 110% of theperformance observed in the baseline placement.We compare four hypothetical container placement policies. The first policy,referred to as ML, is based on our techniques. It decides how many nodes to allo-cate to the container based on performance observations in two placements and themodel presented in the previous section. It runs the workload in two placementsduring the first few seconds of the execution without interrupting the workload,and then migrates it into the best predicted placement. To separate various aspectsof performance, the results shown here do not include the migration overhead; it32is studied separately in the next section. The second policy, Conservative, is anaı¨ve policy that allocates the entire machine to each instance, allowing only oneinstance per machine. The third policy, Aggressive, is another simple policy thatfills the system with as many instances as possible, maximizing machine utilizationat the risk of performance violations. For example, our AMD system allows up tofour 16-core instances and our Intel system up to four 24-core instances. NeitherConservative nor Aggressive pin vCPUs to cores, allowing Linux to perform themapping in the way it wishes, and possibly creating unneeded contention. As an al-ternative, we also evaluate a more sophisticated fourth policy, Smart-Aggressive.This policy is similar to Aggressive, except each instance is pinned to the best min-imum set of nodes, which we define as having the highest interconnect bandwidth.This policy requires an analysis of the interconnect topology in order to find thecorrect set of nodes.We could not make a fair comparison to any other method presented in earlierwork. As we explained in Section 2.2, most earlier models targeted very differentsystems and most did not predict performance vectors, so we could not apply themdirectly.We evaluate the policies by measuring how many instances of the same work-load they were able to pack per machine (higher is better) and the degree of vi-olation of the performance goal as the percent of the target (lower is better). Allworkloads were run using lxc containers and configured to use 16 vCPUs on theAMD system and 24 vCPUs on the Intel system. Figures 2.13 and 2.14 show the re-sults for the three container types. The bars show the number of instances packed(left y-axis), while the “stars” shows the deviation from the target performancegoal, expressed as percentage (right y-axis).The ML policy always meets the performance goal while in most cases packingmore instances per machine than the conservative scheduler. The conservative pol-icy not only wastes resources, but also, surprisingly, may cause performance targetviolations (Figs. 2.13a and 2.14b), because the Linux scheduler may map vCPUsunevenly to shared resources, causing contention where it could be avoided.The aggressive policy packs a maximum possible number of containers permachine, at the cost of performance target violations, up to 46% with WiredTigeron AMD, and 43% with Spark on Intel. It is surprising that even when the aggres-33Benchmark Memory(GB)FastMigration (s)DefaultLinux (s)BLAST 18.5 3.0 5.9canneal 1.1 0.3 3.9fluidanimate 0.7 0.3 2.3freqmine 1.3 0.3 4.2gcc 1.4 0.3 2.8kmeans 7.2 1.5 6.5pca 12.0 2.8 10.0postgres-tpch 26.8 5.8 117.1postgres-tpcc 37.7 14.9 431.0spark-cc 17.0 3.7 139.9spark-pr-lj 17.1 3.8 137.0streamcluster 0.1 0.1 0.4swaptions 0.01 0.1 0.0ft.C 5.0 1.3 19.4dc.B 27.3 5.4 51.7wc 15.4 3.4 19.5wr 17.1 3.6 18.9WTbtree 36.3 6.3 43.8Table 2.2: Migration performance on the AMD system, compared to the de-fault Linux migration method. The amount of memory includes pro-cesses’ memory and the page cache associated with the container.sive policy packs the same number of containers per machine as the model-basedpolicy, it still often reports a higher violation percent. That is because this pol-icy allows virtual containers to share NUMA nodes. Smart-aggressive addressesthis shortcoming, but even that policy can cause performance violations (e.g., 20%for WiredTiger on AMD), because it does not take into account all ways in whichworkload placement might affect performance.2.5.2 Memory Migration OverheadMemory migration in Linux is known to be inefficient [59]. Our migration tech-nique is based on that proposed by Lepers et al. [59]. Their method freezes theapplication, parses its memory map and migrates pages in parallel using a collec-34ModelPredictionsConservative Aggressive Aggressive(Smart)01234instances/machinePerformance goal90%100%110%010203040% of violations (*)(a) WiredTigerModelPredictionsConservative Aggressive Aggressive(Smart)01234instances/machinePerformance goal90%100%110%036912% of violations (*)(b) Postgres (TPC-H)ModelPredictionsConservative Aggressive Aggressive(Smart)01234instances/machinePerformance goal90%100%110%06121824% of violations (*)(c) Spark (PageRank)Figure 2.13: Instances packed per machine (left y-axis) and performance goalviolation (as %, right y-axis) on the AMD system.35ModelPredictionsConservative Aggressive Aggressive(Smart)01234instances/machinePerformance goal90%100%110%05101520% of violations (*)(a) WiredTigerModelPredictionsConservative Aggressive Aggressive(Smart)01234instances/machinePerformance goal90%100%110%0481216% of violations (*)(b) Postgres (TPC-H)ModelPredictionsConservative Aggressive Aggressive(Smart)01234instances/machinePerformance goal90%100%110%010203040% of violations (*)(c) Spark (PageRank)Figure 2.14: Instances packed per machine (left y-axis) and performance goalviolation (as %, right y-axis) on the Intel system.36tion of worker threads. We improve on the method in [59] by reducing the lockingoverhead when migrating shared pages and by creating mechanisms for migratinga page cache, which is not migrated at all by either Leper’s method or by defaultLinux. Table 2.2 show the total time needed to migrate the workloads used in x2.4.Note that Default Linux migration time does not include the time needed to migratethe page cache, because this is not supported, and yet this can be a very large frac-tion of migration overhead (93% with BLAST, 75% with TPC-C and 62% on TPC-H). The overhead of migration highly depends on the workload: a few big files inthe page cache are more likely to generate contention during migration than manysmall files, because each file has its own memory map. Processes sharing memorywill incur a higher per-page migration cost, as is the case with Postgres. Overall,we found that migrating a large amount of memory can be done in a few seconds,and is order of magnitude faster than default Linux migration method (38 fasterfor Spark). One exception is Postgres TPC-C, whose migration takes 14.9 sec-onds. This benchmark has an atypically large number of processes and threads:167 and 33067 respectively! Linux’s mechanism for updating the cpuset, whichchanges when migration is performed, has per-task overhead and is very slow foran application with such an extreme task count.A drawback of this method is that it requires freezing the container duringmigration in order to reduce contention on some critical kernel locks. As a result, itis not suitable for interactive latency-sensitive workloads. In this case, we have theoption of not freezing the container and instead throttling the bandwidth given tothe migration process so as to reduce the impact on the running application. Thus,the migration takes more time but with a smaller impact on the running container.Using this method, the overhead of migration for the WiredTiger workload wasbetween 3% and 6%, and the migration took 60 seconds4.Overall, we observe that the migration overhead is proportional to the amountof memory used by the container, except in cases with extremely high threadcounts. Using the container’s memory footprint, the user can estimate whether thecost of migration would permit an online deployment of the container placementalgorithm, or if it is preferable to use the model offline for placement of recurring4WiredTiger is the only workload we could evaluate in the interactive mode because it is the onlyone able to report performance during the execution.37jobs.2.6 SummaryModern multicore systems have a complex hierarchy of shared resources and per-formance can vary wildly depending on how virtual CPUs are mapped to hardwarecontexts. Operators waste resources and money by using conservative and sub-optimal placement policies.We have shown a solution to this problem using a methodology to abstract asystem’s shared resources, identify important placements, and predict their perfor-mance. We presented a method for predicting performance based on multi-outputregression. The best accuracy is achieved when observations of actual performanceon two placements are used as model features. Hardware performance events, con-ventionally used as features for predictive models, turned out to be not as predictiveas previously thought. Our method can lead to very significant advantages in ma-chine utilization while keeping performance guarantees.CPU architecture is continually changing, often by sharing resources betweencores in new ways, in order to continue scaling the core count. AMD’s newlyintroduced Zen architecture [32] has L3 cache sharing separate from sharing thememory controller. Intel’s Haswell-E architecture has asymmetric links betweenNUMA nodes through its cluster-on-die feature [71], which has unique perfor-mance implications different from other asymmetric architectures. The flexibilityof our methods means that they can be used on systems like these or future archi-tectures without significant retooling by an expert.38Chapter 3An SMT-Selection Metric1SMT is an architectural technique used to improve the overall performance of awide range of applications [97]. It is designed to improve CPU utilization by ex-ploiting both instruction-level parallelism and thread-level parallelism. Chapter 2addressed scheduling with respect to SMT in conjunction with other schedulingconcerns, but it assumed that the number of vCPUs a workload uses is alreadyknown beforehand. This chapter, on the other hand, focuses on how to choose thelevel of parallelism of an application running on an SMT system. In other words,Given a multithreaded application, will performance improve if additional hard-ware contexts are available via SMT as we increase the number of threads? Pre-cisely, we are considering situations where the number of physical cores is fixed,the number of hardware threads/contexts used per physical core varies (that is,different SMT-levels), and the number of software threads is set to the numberof available hardware threads. Because some applications scale poorly to higherthread counts and because the utilization benefit of SMT is application dependent,using SMT can sometimes hurt performance significantly in this scenario.We propose an SMT-selection metric (SMTsm) to answer this question, andit is the primary contribution of this chapter. It does not require any changes toapplications or operating systems and incurs low overhead. The metric is mea-sured online while applications are being run. Our metric-based approach relieson HPEs and measures the tendency of a workload to run better or worse in more1This chapter is a modified version of work previously published in [46]39hardware contexts. SMTsm can be easily integrated in any user-level scheduler oreven kernel schedulers to provide insights and intelligent decisions about the rightSMT level to be used by a workload. SMTsm can be measured periodically andhence allows adaptively choosing the optimal SMT level for a workload as it goesthrough different phases. We also show how this metric can be used in a scheduleror a user-level optimizer to help guide scheduling decisions.The rest of the chapter is organized as follows: Section 3.1 provides back-ground information and motivating examples, Section 3.2 describes the SMT-selectionmetric and its rationale. Section 3.3 presents the experimental methodology adoptedusing two processor architectures. Performance evaluation is presented in Sec-tion 3.4. Section 3.5 describes ways the SMT-selection metric can be used. Finally,related work is examined in Section 3.6 and concluding remarks and future workare discussed in Section Background & MotivationWith SMT, the processor handles a number of instruction streams from differentthreads in the same cycle. The execution context, like the program counter, isduplicated for each hardware thread, while most CPU resources, such as the exe-cution units, the branch prediction resources, the instruction fetch and decode unitsand the cache, are shared competitively among hardware threads. In general theprocessor utilization increases because there are more instructions available to fillexecution units and because instructions from other hardware threads can be exe-cuted while another instruction is stalled on a cache miss. Since the threads sharesome of the key resources, it is performance-efficient to schedule threads with anti-correlated resource requirements.Several studies have shown that SMT does not always improve the performanceof applications [51, 66, 81]. The performance gains from SMT vary depending ona number of factors: The scalability of the workload, the CPU resources usedby the workload, the instruction mix of the workload, the cache footprint of theworkload, the degree of sharing among the software threads, etc. Figure 3.1 showsthe performance of three benchmarks with and without SMT (4-way SMT) on the8-core POWER7 microprocessor. We first run the application with eight threads at40SMT1. Then we quadruple the number of threads and enable SMT4. Note that forEquake, SMT4 degraded the performance of the application, while it improved theperformance of EP. MG’s performance was oblivious to whatever SMT level wasused.Equake MG EP00.511.522.53SMT1SMT4ApplicationsP er fo rma nc e No rma li ze d t o SMT 1Figure 3.1: Comparison of performance with SMT1 vs. SMT4 for threeapplications on an 8-core POWER7 system. Each application is runalone in a separate experiment. The application uses eight threads un-der SMT1 and 32 threads under SMT4 and threads are bound to theirown hardware contexts.In general, workloads that benefit from SMT contain threads that under-utilizecertain processor resources as well as threads are able to make use of those re-sources. Reasons why such ”symbiotic” situations occur include:1. A large number of cache misses: For a non-SMT processor, when an instruc-tion miss occurs, no more instructions are issued to the pipeline until moreinstructions have been brought to the instruction cache. A similar situationhappens in the case of a data cache miss, the stream of instructions ceases ex-ecution until the missing data is brought to the cache. Such situations couldresult in delays ranging from tens to hundreds of cycles. SMT enables one ormore other hardware threads to execute their instruction streams when suchdelays occur; hence, maximizing the use of the processor pipeline.2. Long chains of instruction dependencies: Inter-instruction dependencies limitthe instruction-level parallelism of applications. Based on the layout of the410 10 20 30 40 50 60 70 8000.511.522.53L1 misses/1000 instructionsSMT 4/ SMT 1 Sp ee du p1 2 3 4 5 6 7 8 9 10 1100.511.522.53CPISMT 4/ SMT 1 Sp ee du p0 2 4 6 8 10 1200.511.522.53Branch Mispredictions/1000 instructionsSMT 4/ SMT 1 Sp ee du p0 10 20 30 40 50 60 7000.511.522.53% of VSU InstructionsSMT 4/ SMT 1 Sp ee du pFigure 3.2: Speedup on SMT4/SMT1 plotted against cache misses, CPI,branch-mispredictions, and fraction-of-floating-point/vector instruc-tions for 27 benchmarks on the POWER7 processor. Eight threads areused under SMT1 (on eight cores), 32 threads are used under SMT4,and threads are bound to their own hardware contexts.multiple pipeline stages, compilers attempt to generate independent instruc-tions that can be executed in parallel. When dependencies exist, the nextinstruction ceases execution until it can receive the results of the previousinstruction. So if the workload exhibits very long chains of instruction de-pendencies, SMT could help to fill the gaps by allowing other independentinstruction streams from other threads to execute in the otherwise idle exe-cution units.3. A large number of branch mis-predictions: When the branch history tableand the branch target buffer are not large enough to service a large number ofbranch mis-predictions, the execution units remain idle. This is again anotheropportunity for SMT to allow other hardware threads to use the executionunits while the branch mis-prediction is being resolved.The workloads in these examples are expected to benefit from SMT, becauseone or more threads leave resources idle, but other threads have sufficient diversityin the instruction mix to put these resources to use. At the same time, if a workload42consists of threads that are individually well optimized for a super-scalar processor(e.g., they do not leave resources idle), this workload is not expected to benefitfrom SMT, because there are no resource gaps to fill.While SMT allows executing multiple streams of instructions in the same cycle,it also introduces more resource contention among the hardware threads that are co-scheduled on the same core. If any of the shared resources becomes a bottleneck,all threads contending for the resource will suffer, and SMT will not be beneficial.Properties of workloads that create contention for resources include:1. A homogeneous instruction mix: If one or few types of instruction are morecommon than others, the workload may create contention for the functionalunit responsible for this type of instruction. For example, workloads thatare floating-point intensive are likely to gain little from simultaneous multi-threading.2. Intensive use of the memory system: Irrespective of instruction mix, a work-load stressing the memory system (e.g., because of poor cache locality) maycause memory-related stalls to become even longer and more frequent on anSMT processor due to increased contention for the memory bandwidth. Asa result, processor resource utilization could decrease instead of increasing.In summary, we intuitively understand that workloads that benefit from SMThave threads that under-use resources, which other threads are able to use, while atthe same time not creating contention for these resources. At the same time, beingable to predict what is the right SMT level to use for a given workload is not atrivial task. This requires a thorough knowledge of both the internals of the work-loads and the internals of the hardware they run on. The complexity of predictingthe right SMT level increases as the number of supported SMT levels increases.For instance, IBM’s POWER7 processor [55] has 4-way SMT multithreading andexposes to applications three different levels: SMT disabled or SMT1 level, 2-waySMT or SMT2 level, and 4-way SMT or SMT4 level. Although this technologyprovides more flexibility, it also introduces more complexity since the user needsto decide what is the right SMT level for their running application.43In an attempt to see if it is possible to predict performance improvements fromSMT by just looking at applications’ characteristics, we plotted the speedup ob-tained at SMT4 vs. SMT1 against four main application metrics on a POWER7machine: L1 cache misses, branch mispredictions, cycles per instruction (CPI),and fraction of floating point operations. The experiment was conducted using27 representative multithreaded benchmarks on a POWER7 system (more detailsabout the benchmarks used will be presented in subsequent sections). Figure 3.2,shows that there is no correlation between any of the four metrics and the SMTspeedup.One option for SMT tuning is to compare application performance with andwithout SMT offline and then use the configuration resulting in better performancein the field. However, this method is not effective if the hardware used in the fieldis not the same as that used for original testing, and if the application behavior sig-nificantly changes depending on the input. Another option is to vary the SMT levelonline and observe changes in the instructions-per-cycle (IPC), but this method haslimited applicability, because not all systems allow changing the SMT level online.Furthermore, IPC is not always an accurate indicator of application performance(e.g., in case of spin-lock contention).3.2 The SMT-Selection MetricThe rationale behind the SMT-selection metric is based on how well the instruc-tions of a workload can utilize the various pipelines of a processor during eachcycle. An ideal workload for SMT would have a good mix of instructions thatare capable of filling all available functional units at each cycle. Figure 3.3 showsthe pipeline of a generic processor core. In each cycle, the processor fetches fromthe instruction cache a fixed number of instructions. These instructions are thendecoded and buffered. As resources become available, instructions can be dis-patched/issued to the various execution/functional units in an out-of-order manner.Issue ports are the pathways through which instructions are issued to the variousfunctional units, which can operate independently. If the instructions that are is-sued consist of a mix of load, store, branch, integer, and floating point instructionsand there are little data dependencies between them, then all functional units will44Fetch LogicDecode LogicDispatch, Reorder Buffer & Issue LogicExecution UnitLoad / StoreLoad / StoreData CacheData TranslationBranchInstruction CacheExecution UnitExecution UnitIssue PortsFigure 3.3: A generic processor execution engine.be able to be used concurrently, hence increasing the utilization of the processor.We define the term ideal SMT instruction mix to mean a mix of instructions thatis proportional to the number and types of the processor’s issue ports and functionalunits. With an ideal mix, the processor is able to execute the maximum number ofinstructions supported. In order for SMT to increase utilization there needs to beinstructions available from all the hardware contexts to use as many issue ports aspossible. Consider a multithreaded application whose vast majority of instructionsare fixed point (integer) instructions. Running the application with more hardwarecontexts will not help because the fixed point units were already occupied most ofthe time with one hardware context. On the other hand, if we have an applicationwith an ideal SMT instruction mix, then SMT should improve performance sincethe processor will have more opportunities to fill all the execution units.Since SMTsm must be able to predict whether an application benefits fromadditional SMT resources as we increase the number of threads, it must also in-clude some measure of scalability within the application itself. After all, if thereare software-related scalability bottlenecks, the application will not run better withincreased number of threads irrespective of hardware. We observe that instructionmix, which is crucial for predicting hardware resource utilization in SMTsm, is45also a good indicator of software scalability. An application that spends significanttime spinning on locks will have a large percentrage of branch instructions and ahigh deviation from the ideal SMT mix.Equation 3.1 shows how to calculate the SMTsm metric for the generic proces-sor discussed above, where Pi denotes a unique issue port, N is the total number ofissue ports, DispHeld is the fraction of cycles the dispatcher was held due to lackof resources, TotalTime is the wall-clock time elapsed, and AvgT hrdTime is theaverage time spent by each hardware thread. Smaller metric values indicate greaterpreference for a higher SMT. The metric consists of three factors: i) the instructionmix’s deviation from an ideal SMT instruction mix, ii) the fraction of cycles thatthe dispatcher was held due to lack of resources, and iii) the ratio of the wall-clocktime elapsed to the average CPU time elapsed across all threads. fPi is the fractionof instructions that are issued to Pi. For example, to calculate fP1, the number ofinstructions issued through port 1 is divided by the total number of instructions.SMT sm= (N1∑i=0( fPi1=N)2)1=2DispHeld (TotalTime=AvgT hrdTime)(3.1)The second factor of the SMT-selection metric is the fraction of cycles thatthe dispatcher was held due to lack of resources. The meaning of resources isarchitecture dependent and may include many items but it should primarily referto the issue queues of the execution units. If the issue queues are filling up to thepoint where the dispatcher is held, then having additional instruction streams todispatch from is not going to be useful. This factor is important to have in additionto the instruction mix because it indirectly captures the effect of instruction-levelparallelism and cache misses. The number of cycles the dispatcher is held due toresources is easily obtained through HPEs in many modern processors.The final factor of the metric is the ratio of the wall-clock time elapsed to theaverage CPU time elapsed per hardware thread. This measures scalability limi-tations manifested through sleeping or Amdahl’s law as opposed to busy waiting.46This factor does not have a direct relationship with SMT preference, but scalabilityis an important factor to consider since additional software threads are needed touse the available SMT hardware contexts.In the following subsections, we illustrate how SMTsm metric is measured fortwo different processor architectures: IBM’s POWER7 and Intel’s Nehalem Corei7.3.2.1 SMTsm on IBM’s POWER7 ProcessorIn a given cycle, the POWER7 [55] core can fetch up to eight instructions, decodeand dispatch up to six instructions, and issue and execute up to eight instructions.The core has 12 execution units: two fixed point units, two load/store units, fourdouble-precision floating-point pipelines, one vector unit, one branch unit, onecondition register (CR) unit, and one decimal floating point pipeline. POWER7processors support up to 4-way SMT. In other words, up to four hardware contextscan concurrently use the core. If there is only a single software thread running on acore, the processor automatically runs the core at SMT1 which gives the hardwarethread access to resources that would be partitioned or disabled at higher SMTlevels. Similarly, if there are only two software threads on a core then the core runsat SMT2.Instruction Fetch Unit (IFU)Instruction Dispatch UnitCR IssueQueueBranch IssueQueueUnified Queue 0(UQ 0)Fixed Point (FP0)Load/Store (LS0)Vector Scalar (VS0)Basic FP, VSX FP, VMX FP,VMX Complex,VMX Simple,64-Byte StoreUnified Queue 1(UQ 1)Fixed Point (FP1)Load/Store (LS1)Vector Scalar (VS1)Basic FP, VSX FP, Decimal FP,Permute,64–byte Store,128–byte StoreCR Unit  Branch UnitIssue PortsFigure 3.4: IBM POWER7 out-of-order execution engine.Figure 3.4 shows that an issue port in POWER7 is tied to a type of instruc-47tion. For instance, a fixed point instruction always uses a fixed point issue port.There are a total of eight issue ports: 1 port corresponds to a conditional register(CR) instruction, 1 port corresponds to a branch instruction, the remaining 6 issueports are divided equally between the two unified issue queues, UQ0 and UQ1.Through each UQ, up to one load/store instruction, one fixed point instruction (FP)and one vector scalar (VS) instruction can be issued concurrently. It is importantto note here that the CR unit is a special unit. It is tightly tied to the branch unit.It is also not heavily used in general. This unit has been mainly designed to avoidsending the compare instructions through the FP unit to avoid tying branch predic-tion to the FP unit. Therefore, we consider in our metric both the CR and branchunits as one execution unit. So, an ideal SMT instruction mix for the POWER7architecture would consist of 1/7 loads, 1/7 stores, 1/7 branches, 2/7 FP instruc-tions, and 2/7 VS instructions. The loads and stores are separated because theyrely on separate resources like the load and store buffers. To measure the sec-ond term of the equation (dispatcher held) in POWER7, the hardware performanceevent PM DISP CLB HELD RES can be used. The SMT-selection metric for thePOWER7 processor is shown in Equation 3.2.P7SMT sm= (( fL1=7)2+( fS1=7)2+( fBR1=7)2+( fV S2=7)2+( fFP2=7)2)1=2DispHeld (TotalTime=AvgT hrdTime)(3.2)3.2.2 SMTsm on Intel’s Nehalem ProcessorOn the Nehalem Core i7, the number of issue ports equals the maximum number ofinstructions that can be issued in a cycle (see Figure 3.5). In contrast to POWER7,each of the six issue ports is used for a variety of unrelated instructions [94]. Theunified reservation station serves as a single scheduler for all the execution units.It is responsible for assigning instructions to the different execution units. The48Unified Reservation Stations (36 Entries)Integer ALU &ShiftInteger ALU &LEAInteger ALU &ShiftLoadStoreAddressStoreDataFP MultiplyFP DivideSSE Integer ALUInteger ShufflesFP AddInteger ALU &ShiftInteger ALU &ShiftSSE Integer MultiplyBranchFP ShuffleSSE Integer ALUInteger ShufflesPort 1Data CachePort 0Port 2Port 3Port 4Port 5Figure 3.5: Intel Nehalem out-of-order execution engine.core can issue up to 6 instructions per cycle. Three of them are memory opera-tions (load, store address and store data), and the other three are computationalinstructions (floating point, branch, and integer operations). Intel’s Nehalem coresupports 2-way SMT.Equation 3.3 shows the SMT-selection metric for Intel’s Nehalem Core i7 pro-cessor. The term fPi refers to the fraction of instructions that have been issuedthrough port i ( i 2 [0;1;2;3;4;5]). Since the issue ports on Nehalem are not re-lated to a single type of instruction, we simply measure the number of instructionsissued to each port. All instructions map to a single issue port, except for integerALU instructions which map to three ports, so the mix of instructions sent to eachissue port is sufficient for calculating the SMT-selection metric. Dispatch held canbe obtained using RAT STALLS event with the rob read port unit mask [52].Ci7SMT sm= (5∑i=0( fPi1=6)2)1=2DispHeld (TotalTime=AvgT hrdTime)(3.3)493.3 Experimental Methodology3.3.1 System ConfigurationExperiments were conducted on an AIX/POWER7 system and a Linux/Core i7(Nehalem) system.The AIX/POWER7 system uses AIX 6.1.5 and two 8-core POWER7 chips.For the single-chip experiments, the benchmarks were restricted to run on one8-core chip. The POWER7 CPU is clocked at 3.8 GHz and the system has 64GB of RAM. The C, C++, and Fortran benchmarks were compiled with IBM XLcompiler using these flags: -O3 -qstrict -qarch=auto -qsimd=auto -q64and -qsmp=omp. The MPI programs use the IBM Parallel Operating Environmentversion the Java benchmarks use the 64-bit IBM JVM version 1.6.0. TheSMT levels on POWER7 can be changed without rebooting the system by runningthe smtctl command with privileged access.The Linux/Core i7 system uses Linux kernel 2.6.34 with 3GB of RAM and afour cores Intel Core i7 965 clocked at 3.2 GHz with two SMT threads per core.GCC 4.4.5 was used to compile the benchmarks with the flags -O3 -march=nativeand -fopenmp where appropriate. Unlike POWER7, the SMT level can only bechanged by rebooting and modifying a BIOS setting. In our experiments SMT2 isalways enabled in the BIOS. Therefore to simulate SMT1 we only use one soft-ware thread per core. This better represents typical use cases because SMTsm isdesigned to be used dynamically at run-time.3.3.2 BenchmarksThe experiments use a diverse set of benchmarks to capture the variations in char-acteristics of various workloads. The benchmarks are drawn from the NAS ParallelBenchmarks (NPB) v3.3.1, the PARSEC Benchmark Suite v2.1, the SSCA2 bench-mark, the STREAM synthetic benchmark, the SPEC OMP2001 benchmark suitev3.2 and two commercial benchmarks. Due to compatibility issues we were notable to run all of the benchmarks on the POWER7 system. Due to time constraints,we focused mostly on the POWER7 system, because it supports a higher SMTlevel than Nehalem; as a result we did not run all of the benchmarks on Nehalem.50A brief description of the benchmarks used is outlined below. The NAS parallel benchmark Suite [11] is a set of programs that have beeninitially designed to evaluate the performance of supercomputers. Both theMPI and OpenMP versions were used on AIX/POWER7 but only the OpenMPversions were used on Linux/Core i7. PARSEC benchmarks [20]: PARSEC stands for Princeton Application Repos-itory for Shared-Memory Computers. It is a set of programs designed toevaluate the performance of Chip-Multiprocessors (CMPs). PARSEC bench-marks mimic multithreaded applications from different fields such as recog-nition, mining, and large-scale commercial applications. PARSEC does notofficially support the AIX operating system, so only a handful of the bench-marks were able to be used on the AIX/POWER7 system. SSCA2 benchmarks [10]: SSCA, which stands for the Scalable SyntheticCompact Applications, is a computational graph theory benchmark that usesOpenMP. It consists of four kernels with irregular access to a large, directed,and weighted multi-graph. This benchmark is characterized by integer oper-ations, a large memory footprint, and irregular memory access patterns. STREAM [67] is a synthetic benchmark designed to measure memory band-width and also uses OpenMP. To obtain reasonable running times for ourexperiments, we have increased the the default array size and number of it-erations to 4577.6 MB and 1000 respectively. SPEC OMP benchmark Suite [82] is adapted from the SPEC CPU2000 bench-marks. Its goal is to evaluate the performance of openMP applications onshared memory multi-processors. We have used the SPEC OMP experimentsonly on the AIX/POWER7 system. DayTrader [1] is a Websphere benchmark application that emulates an on-line stock trading system. The application simulates typical trading opera-tions such as login, viewing portfolios, looking up stock quotes, and buyingor selling stock shares. The benchmark consists of a websphere front-end, aDB2 database, and a load generator. The DayTrader client is a Linux Intel51Xeon machine running the JIBE (Websphere Studio Workload Simulator),which simulates a specifiable number of concurrent browser clients. Wesimulated 500 clients for stressing the DayTrader application running on theDayTrader server. This number of clients was sufficient to keep the servercontinuously busy with waiting requests to be processed. SPECjbb2005 is a Java server benchmark from the Standard PerformanceEvaluation Corporation [2] based on the TPC-C benchmark specifications.It simulates a 3-tier system in a JVM with emphasis on the middle tier. SPECjbb05-contention is a custom benchmark derived from SPECjbb2005.The primary change introduced in SPECjbb05-contention is that all workerthreads operate on a single warehouse instance instead of each worker threadoperating on its own warehouse instance. This introduces synchronizationcontention that is not present in SPECjbb2005.3.4 EvaluationIn all of the experiments conducted, the number of software threads used is cho-sen to be the same as the number of available hardware threads/contexts in theOS instance. For example, in the AIX instance on one 8-core POWER7 chip, 32software threads were used at SMT4, 16 software threads were used at SMT2, and8 software threads were used at SMT1. Similarly, on the Linux instance on the4-core Core i7 machine, 8 software threads were used at SMT2, and 4 softwarethreads were used at SMT1.The metric was originally developed and tested on the POWER7 system. Afterfinalizing the metric, it was evaluated on the Core i7 machine.3.4.1 SMT-Selection Metric (SMTsm) EvaluationFigure 3.6 shows the relationship between the SMT-selection metric measured atSMT4 and the speedup obtained on SMT4 relative to SMT1 on the AIX/POWER7system. We can see a clear correlation between the metric value and the speedup,and the correlation is strong enough to predict the optimum SMT level in mostcases. If we set a threshold close to the value of 0.07 then we can be confident that520 0.025 0.05 0.075 0.1 0.125 0.15 0.175 0.2 0.225 0.2500.511.522.53AmmpAppluApsiEquakeFma3dGafortMgridSwimWupwiseBlackscholesBTCG_MPIDedupEPEP_MPIFluidanimateFT_MPIISIS_MPILU_MPIMGMG_MPI SSCA2StreamStreamclusterSPECjbbSPECjbb_contentionDaytraderSMT-selection Metric @ SMT4S MT 4/ SMT 1 S pe ed up← Threshold LineFigure 3.6: SMT4/SMT1 speedup vs. metric evaluated @SMT4 – AIX in-stance on an 8-core POWER7 chip.0%10%20%30%40%50%60%70%80%90%100%blacksholes fluidanimate dedup SSAC2 specjbb_contention idealP7SMTmix% Loads % Stores % Branches % FXU %VSU1.82 1.35 0.86 0.78 0.25SMT4/SMT1 SpeedupFigure 3.7: Instruction mix of five benchmarks – AIX instance on an 8-corePOWER7 chip.530 0.05 0.1 0.15 0.2 0.2500. AppluApsiEquakeFma3dGafortMgridSwimWupwiseBlackscholesBTCG_MPIDedupEPEP_MPIFT_MPIISIS_MPILU_MPIMGMG_MPISSCA2StreamStreamclusterSPECjbbSPECjbb_contentionDaytraderSMT-selection Metric @ SMT4S MT 4/ SMT 2 S pe ed up← Threshold LineFigure 3.8: SMT4/SMT2 speedup vs. metric evaluated @SMT4 – AIX in-stance on an 8-core POWER7 chip.0 0.05 0.1 0.15 0.2 0.25 0.300. CG_MPIDedupEPEP_MPIFluidanimateFT_MPIISIS_MPILU_MPIMGMG_MPISSCA2StreamStreamclusterSPECjbbSPECjbb_contentionDaytraderSMT-selection Metric @ SMT2S MT 2/ SMT 1 S pe ed upFigure 3.9: SMT2/SMT1 speedup vs. metric evaluated @SMT2 – AIX in-stance on an 8-core POWER7 chip.54any application with a metric greater than the threshold will perform better or thesame at SMT1 than SMT4, and applications with a metric less than the thresholdbenefit or are not harmed by using SMT4. This is true for 93% of the benchmarksevaluated. Applications that fall to the left of the threshold are likely to preferSMT4, with only two of the evaluated benchmarks having a metric less than thethreshold and performing slightly worse at SMT4.In Figure 3.7, we clearly see a correlation between the instruction mix andthe SMT4/SMT1 speedup. We have selected representative benchmarks from theset of benchmarks studied. As we move from the left of the figure to the right,the speedup going from SMT1 to SMT4 decreases from 1.82 to 0.25, while theinstruction mix tends to be more and more dominated by one or two functionalunits.The metric versus SMT4/SMT2 speedup on AIX/POWER7 is shown in Fig-ure 3.8. Once again a threshold of 0.07 provides good separation. All of thebenchmarks with a metric greater than the threshold prefer SMT2. Three bench-marks have a metric less than the threshold and a speedup less than 1 but greaterthan 0.9. All of the remaining benchmarks have a metric below the threshold anda speedup greater than 1.The experiment shown in Figure 3.9 is the same as the previous experimentsexcept it uses the SMT2 over SMT1 speedup. In this case, the SMT-selectionmetric is not capable of always making an accurate prediction. For metric valuesbelow 0.07 or above 0.19, we can predict the optimum SMT level. However, formetric values between 0.07 and 0.19, it is not possible to predict the application’sSMT preference.Figure 3.10 shows the SMT-selection metric compared to the SMT2/SMT1speedup on the Linux/Core i7 system. In this experiment, a stronger correlationthan in any of the AIX/POWER7 experiments is observed, but there is one outlieron the far right which is Streamcluster from PARSEC. With only eight softwarethreads running at SMT2, there is less synchronization contention so only a few ofthe benchmarks prefer SMT1 over SMT2. In this case there is not much motivationfor SMT optimization but the experiment does show that the SMT-selection metriccan be adapted to other architectures.550 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.1600. CGdedupEPfacesimferretfluidanimatefreqmineFTLUraytraceSPstreamclusterswaptionsUAvipsSSCA2x264SMT-selection Metric @ SMT2S MT 2/ SMT 1 S pe ed up← Threshold LineFigure 3.10: SMT2/SMT1 speedup vs. metric evaluated @SMT2 – Linuxinstance on a quad-core Core i7 system.3.4.2 SMTsm Evaluation at a Lower-SMT LevelThe previous subsection evaluated how well the SMTsm estimated performancespeedup is, when the SMTsm is measured at the highest supported SMT level(SMT4). In this subsection, we evaluate the metric when the application is runningat a lower SMT level, and the metric is used to predict the speedup at a higher SMTlevel.Figures 3.11 and 3.12 show the same experiments presented in subsection 3.4.1but with the SMTsm measured at the lowest supported SMT level. The experimentsdid not show a good correlation between the metric and the speedup. This is notsurprising, as the metric is not able to foresee scalability limitations caused by morethreads at a higher SMT level; the metric is only capable of detecting a slowdownwhen it is happening. At SMT1 we are not able to accurately capture contentionas it was the case at SMT4, so the metric breaks down at SMT1. Therefore, it isimportant to use the metric at the highest SMT-level available. Moreover, in allSMT-capable processors, the highest SMT-level is always used as the default sincemany multi-threaded applications benefit from SMT. This motivates further the useof the metric at higher SMT-levels to predict whether going to a lower SMT-level560 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.4500.511.522.53AmmpAppluApsiEquakeFma3dGafortMgridSwimWupwiseBlackscholesBTCG_MPIDedupEPEP_MPIFluidanimateFT_MPIISIS_MPILU_MPIMGMG_MPISSCA2Stream StreamclusterSPECjbbSPECjbb_contentionSMT-selection Metric @ SMT1S MT 4/ SMT 1 S pe ed upFigure 3.11: SMT4/SMT1 speedup vs. metric evaluated @SMT1 – AIX in-stance on an 8-core POWER7 chip.0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.400. Metric @ SMT1S MT 2/ SMT 1 S pe ed upFigure 3.12: SMT2/SMT1 speedup vs. metric evaluated @SMT1 – Linuxinstance on a quad-core i7 system.57benefits the running workload.3.4.3 Metric Evaluation Across Chips0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.500.511.522.53EPBTMGISDedupFluidanimateBlacksholesSSCA2StreamclusterStreamSPECjbb_contentionSPECjbbCG_MPIFT_MPIEP_MPIIS_MPIAmmpAppluApsiEquakeFma3dGafortMgridSwimWupwiseSMT-selection Metric @ SMT4S MT 4/ SMT 1 S pe ed up← Threshold LineFigure 3.13: SMT4/SMT1 speedup vs. metric evaluated @SMT4 – AIX in-stance on two 8-core POWER7 chips.Figures 3.13, 3.14, and 3.15 give the results for the SMTsm prediction experi-ments on an AIX instance running on a two-chip POWER7 system. For these ex-periments there are 16 cores, which means 64 software threads are used at SMT4,32 threads are used at SMT2, and 16 threads are used at SMT1. Using two chips in-troduces two new variables that the metric must compensate for to remain accurate.First, there is a performance penalty for cross-chip communication, so applicationsthat are more sensitive to NUMA effects may affect the metric differently. Second,the number of running software threads is doubled at all SMT levels compared tothe single chip case, so the effect of scalability is amplified.For the SMT4/SMT1 case presented in Figure 3.13, the results are similar tothe SMT4/SMT1 experiment with only one chip. However, there are more bench-marks that are mis-predicted. We also notice, that applications that have a metricnear the threshold are more likely to be mispredicted. Another difference with thesingle chip experiment is that more applications prefer SMT1 over SMT4. This is580 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.500. Metric @ SMT4S MT 4/ SMT 2 S pe ed up← Threshold LineFigure 3.14: SMT4/SMT2 speedup vs. metric evaluated @SMT4 – AIX in-stance on two 8-core POWER7 chips.0 0.05 0.1 0.15 0.2 0.25 0.3 0.3500. SSCA2StreamStreamclusterAmmpAppluApsiEquakeFma3dGafortMgridSwimWupwiseSPECjbb_contentionSPECjbbSMT-selection Metric @ SMT2S MT 2/ SMT 1 S pe ed upFigure 3.15: SMT2/SMT1 speedup vs. metric evaluated @SMT1 – AIX in-stance on two 8-core POWER7 chips.59expected since with more software threads, more contention for synchronizationresources will be introduced, and hence more scalability limitations.The SMT4/SMT2 results (Figure 3.14) look better than the SMT4/SMT1 re-sults, but there is still only a small difference in metric values between SMT4-preferring applications and SMT1-preferring applications. Figure 3.15 demon-strates that SMT2/SMT1 prediction is ineffective, the same as in the single chipcase.SMT preference prediction is important for large systems with many cores be-cause more applications will be hindered by SMT as synchronization overheadand contention over CPU resources overtake the benefits of SMT. The results showthat the SMT-selection metric is still useful at 16 cores, but more work needs to bedone since the metric is less accurate at 16 cores than at 8 cores. One possibility isthat the scalability detection aspect of the metric starts to break down with a largenumber of threads. This is supported by the fact that the metric works better atSMT4/SMT2 prediction with 16 cores, since the change in the number of softwarethreads is smaller than when predicting SMT4/SMT1 speedup.3.5 Applying the SMT-Selection MetricThe SMT-selection metric can be used by operating systems to guide schedulingdecisions. It can also be used by user-level optimizers or application tuners todynamically adjust the SMT level of the underlying system to improve the perfor-mance of running applications.To use the SMT-selection metric, the formula must first be adapted to the targetarchitecture. In section 3.2, we presented the metric for the IBM POWER7 andIntel Nehalem architectures. The metric can be ported to other architectures insimilar ways. The threshold for changing the SMT level needs to be determinedfor each new system. This can be achieved by running a representative set ofworkloads, recording the SMT speedups and the observed SMTsm metric valuesfor each workload, as we did in section 3.4. Once the (metric, speedup) values aregathered, the threshold can be obtained automatically using statistical techniques.We describe two methods to obtain a good SMTsm threshold for deciding when achange in SMT level would benefit the performance of a given application.603.5.1 Using Gini Impurity to Decide on a Good SMTsm ThresholdGini impurity [77] is a measure of how well separated or clustered a set is. We lookfor a separator (potential threshold) that leads to the lowest overall Gini impurityas follows:1. Re-label the (metric, speedup) tuples into the form (metric, i) with (i 2f0;1g), setting i=0 if the speedup is less than 1, and i=1 if the speedup isgreater than or equal to 1.2. Divide the tuples into 2 sets fL=Left-set, R=Right-setg based on whether themetric value is to the left or to the right of the separator value.3. Calculate the Gini impurity of the left-set (IL) and the right-set (IR) as shownin equations 3.4 and 3.5, where jL0j denotes the size of the left-set withi= 0 (i.e., speedup< 1), jL1j denotes the size of the left-set with i= 1 (i.e.,speedup  1), and jLj is the size of the entire left set (jLj = jL0j+ jL1j).Similar notation is used with the right set.IL = 1jL1jjLj2jL0jjLj2(3.4)IR = 1jR1jjRj2jR0jjRj2(3.5)4. Calculate the overall Gini impurity using equation 3.6.Impurity=jLjjL+Rj IL+jRjjL+Rj IR (3.6)An impurity of 0 indicates that the set is perfectly separated, i.e. all of thesample points to one side of the separator have a speedup greater than or equalto 1, and all of the remaining points are on the other side. A high impurity valuemeans that the selected separator is not a good classifier, and vice versa.61Figure 3.16 shows the results of using Gini impurity to provide a suitable metricthreshold value at which to decide on performing a change in SMT level to improveperformance, when using SMT4/SMT1 speedup data on POWER7. The dottedvertical lines mark the range of optimal thresholds. The figure also displays twoeasy ways to observe the qualitative fitness of the SMT-selection metric on a givensystem and a set of benchmarks. First, is how low the impurity is at its lowestpoint, which represents how good a prediction can be made. In the figure, thelowest impurity is 0.23 which is good, as verified by the fact that only four of thebenchmarks were misclassified with the threshold obtained by this method (referto Figure 3.6 which has four points to the left of the separator with a speedupbelow 1). Second, is how large the range of optimal thresholds is. If the range isvery small, then a new application with a metric beyond that range is likely to bemispredicted.0 0.05 0.1 0.15 0.2 0.2500. mp ur i tyFigure 3.16: Total overall Gini impurity for potential thresholds of theSMTsm metric for SMT4/SMT1 speedup on POWER7.3.5.2 Using the Average PPI (Percentage Performance Improvement)Method to Decide on a Good SMTsm ThresholdWith this method we are trying to estimate how much performance improvementwe would obtain if we switched from the default SMT level (e.g., SMT4) to a lower62one (e.g., SMT1) as dictated by different thresholds. The threshold with the highestestimated Percentage Performance Improvement (PPI) is deemed the best. In orderto do this, for each potential SMTsm threshold value, and for each benchmark, weestimate a PPI value as follows: If the benchmark’s measured SMTsm value is less than the threshold in ques-tion, then its PPI is set to 0. Essentially, this means that the benchmark isnot expected to benefit from a lower SMT setting, so its expected PPI fromswitching to lower SMT level is zero. If the benchmark’s measured metric value is greater than the threshold valuein question, then the PPI is set to (( 1SMT 4=SMT 1 speedup  1)  100. In otherwords, if the benchmark is expected to benefit from a lower SMT settingbased on the current threshold, we calculate PPI as the performance im-provement at SMT1 relative to SMT4 (expressed in percent).Then, we take the average of the PPIs over the whole set of benchmarks as theY-value to plot against that threshold value. This gives us the average expectedperformance improvement at each threshold level. Examining this data, we canchoose the best threshold – the one that gives us the highest PPI.Figure 3.17 shows an example of using this method for SMT4/SMT1 perfor-mance improvement prediction on POWER7. The results are similar to those usingGini impurity, but this methods provides the following additional benefits:1. It can be used to easily show how much performance improvement the SMTsmmetric can provide. The Gini impurity method only shows that the metric isworking, but cannot show potential improvements.2. It also gives a view of potential PPIs over a range of threshold values. Eventhough the range of optimal metric thresholds is relatively small in bothmethods, we can see from Figure 3.17 that there is actually a large rangeof potential threshold values where we have an average PPI that is greaterthan 15%. This means that a new application whose metric value falls intothis range is not likely to experience a severe negative effect from using thismetric value as a threshold for deciding on a change in SMT level.630 0.05 0.1 0.15 0.2 0.250510152025ThresholdAv er ag e I mp ro ve me nt  ov er  De fa ul t  ( %)Figure 3.17: Average SMT4/SMT1 percentage performance improvement ofall the benchmarks vs. SMTsm values – AIX instance on POWER7.3. This method can also provide a better threshold than the Gini impurity methodin some cases, because Gini impurity does not consider the amount of speedup.For example, there could be a few benchmarks with speedup values just be-low one, and a benchmark with a very large speedup just to the right of them.In this case Gini impurity would suggest putting the threshold to the left ofall the mentioned benchmarks so as to classify more benchmarks correctly,whereas this method would suggest putting the threshold to the right, therebypreserving the large speedup in return for minimal slow-downs in the otherbenchmarks.3.6 Related WorkSMT job schedulers and SMT performance characterization comprise the majorityof related work. The SMT job schedulers are designed to find high performing(often referred to as symbiotic) co-schedules from a larger selection of runningapplications. They do not attempt to optimize the SMT level itself like our SMT-selection metric. The SMT performance characterizations do investigate the effectof the SMT level but none of them propose a general metric or algorithm for op-timizing it. Additionally, most of the previous work focuses on single-threaded64applications while our work studies multi-threaded applications.Mathis et al [66] evaluate and analyze the effect of SMT2 on the POWER5 CPUwith single-threaded applications. To measure the SMT2 gain of an application,they simply run one copy of the application per available hardware thread/contextwith and without SMT. The authors found that most of the tested applications havea moderate performance improvement with SMT. They also found that applicationswith the smallest improvement have more cache misses when using SMT. Thisresult is less applicable to multi-threaded applications because the total amount ofwork and data does not increase as the number of threads increases, like it doeswhen you run more copies of a single-threaded application and because threads ofa multi-threaded application may share data.Ruan et al [80] evaluate and analyze the effect of SMT on network servers butdo not attempt any optimization. They found that the overhead of using an SMP-capable kernel sometimes outweighs the benefit of SMT, but this is irrelevant todaysince all modern server CPUs are at least dual-core. The authors also discoveredthat SMT can sometimes hurt performance when there is more than one CPU whichsupports our claim that the SMT level should be optimized.Snavely and Tullsen [87] describe a job scheduler for SMT systems calledSOS (Sample, Optimize, Symbios). The goal of SOS is to choose an effectiveco-schedule of applications from a pool of ready-to-run applications. SOS has asampling phase where it tries many different co-schedules and measures a perfor-mance predictor metric from hardware counters. Then, it has a symbiotic phasewhere it runs the co-schedules with the best predicted performance. The authorsevaluated several different predictors and found that a high IPC and a low L1 datacache miss rate are both good predictors. They did try a predictor based on theinstruction mix, but it only looked at integer and floating point instructions and itdid not take into account the mix of execution units. Snavel et al [88] extendedSOS to support application priorities. Overall, SOS is effective for finding goodco-schedules among many single-threaded applications, but it is not designed tochoose the best SMT level for a multi-threaded application.Settle et al [84] designed a job scheduler similar to SOS in its goals: findoptimal co-schedules from a set of single-threaded applications. They use customhardware performance counters to create a fine-grained view of the cache access65patterns of the applications, from which they derive co-schedules with an averageof 7% improvement over the default scheduler.Eyerman and Eeckhout [42] propose an SMT job scheduler that is meant tosurmount the shortcomings of SOS. They use a probabilistic model to co-scheduleapplications without the need for a sampling phase, and it can be configured tooptimize for throughput or turn-around time. The major downside of their approachis that it requires specialized CPU counters that are not available on commercialhardware.Tam et al [93] present a solution for scheduling threads on SMP-CMP-SMTsystems. Their goal is to reduce remote cache accesses by scheduling threads to-gether that access the same data. The authors approach this problem by using hard-ware performance counters to monitor the addresses of memory that cause remotecache accesses and then scheduling together (on the same chip or on the same core)threads that access the same memory. They achieve 5-7% performance improve-ments for a handful of applications, but their system is not meant to determine theoptimal SMT level.Parallel scalability prediction is also related to our work, since the SMT-selectionmetric must estimate the effect of scalability when changing the SMT level. Pre-vious work in the area requires many sample data points [37][14] or access to thesource code of the application [68], so it is not suited to our purpose. Our approachonly uses information available at run-time to detect scalability limitations and isaccurate enough for SMT-selection prediction for most applications. Unlike otherworks, we only attempt to determine if an application is experiencing slowdowndue to scalability limitations, i.e. we do not try to predict the upward scalability ofan application.The WASH AMP scheduler [53] is one recent work that predicts applicationscalability online. It does so by instrumenting locks and measuring the relativetime of waiting for locks against total run time. The downside is that instrumenta-tion must be implemented for each parallel environment (WASH AMP specificallytargets the Java VM), and for some environments the instrumentation may incuroverhead.663.7 SummarySimultaneous multithreading can provide substantial benefits in the utilization ofCPU resources. However, automatically predicting when SMT fails to providethe expected increase in performance for many applications is still not a well-understood area of research.This chapter examines a methodology for SMT-level selection. At the heart ofour methodology is the SMT-selection metric that is capable of predicting poten-tial change in application performance when the SMT-level is changed. We haveshown that it is very difficult to predict SMT preference by just relying on cer-tain parameters like cache misses, branch mispredictions, number of floating pointinstructions, or CPI.Our performance evaluation used a large number of multithreaded standardbenchmarks that represent a wide range of applications behavior. Our results haveshown that the SMT-selection metric was able to predict the correct SMT speedupin 93% of the cases on the IBM POWER7 processor, and in 86% of the cases on theIntel Nehalem processor. The metric can easily be adapted to other architecturesonce we have a good understanding of the issue ports and functional units usedby the target architecture. We have also presented an algorithm based on the Giniimpurity that can be used to accurately obtain a range of SMT-selection metricthresholds that can be used by schedulers or application optimizers.While we tried to capture most of the factors that could impact SMT perfor-mance for a general microprocessor, the SMTsm still does not address directlysome issues like instruction-level dependencies and relative execution speeds ofvarious instruction types. SMTsm attempts to approximate such effects indirectlythrough the dispatch-held factor. Studying such effects is the subject of our futureinvestigations. More future work needs to be done to increase the accuracy of pre-diction, to test the metric on other architectures, to improve the scalability of themetric when applied to a much larger number of cores.67Chapter 4NUMA Traffic Managementthrough Memory Placement1As we have seen in Chapter 2, NUMA can have a huge effect on performance,and careful thread placement is crucial for performance. By choosing the correctnumber of nodes and the correct interconnect links, one can balance the latencyand bandwidth requirements of an application. The other side of the coin to threadplacement is memory placement. A complete solution must both place threads andplace memory correctly. We envision that the placement algorithm in Chapter 2would be used to place threads first, and then our algorithm for memory placementdescribed in this chapter would be used to place memory intelligently.Optimal performance on NUMA systems can be achieved only if we placememory in consideration of the system’s physical layout and the application’s char-acteristics. How to achieve this on modern systems with acceptable overhead is theprimary research question of this chapter, and the solution to the problem, de-scribed in Section 4.3, is the main contribution.Attribution: Mohammad Dashti, Fabien Gaud, and I jointly conducted theinitial investigation into NUMA effects, including the discovery of the importanceof congestion over locality (reported in Section 4.1 and Section 4.2). I designedand implemented the page-level replication mechanism described in Section 4.3.31This chapter is a modified version of work previously published in [36]68with debugging help from Fabien Gaud. Fabien Gaud designed and implementedthe Carrefour algorithm.4.1 BackgroundPrevious work on NUMA-aware memory placement focused on maximizing local-ity of accesses, that is, placing memory pages such that data accesses are satisfiedfrom a local node whenever possible. That was done to avoid very high costs ofremote memory accesses. Contrary to insights from previous work, we discoverthat on modern NUMA systems remote wire delays, that is, delays resulting fromtraversing a greater physical distance to reach a remote node, are not the mostimportant source of performance overhead. On the other hand, congestion on in-terconnect links and in memory controllers, which results from high volume of dataflowing across the system, can dramatically hurt performance. This motivates thedesign of new NUMA-aware memory placement policies.To make these statements concrete, consider the following facts. On NUMAsystems circa 1990s, the time to access data from a remote node took 4-10 timeslonger than from a local node [98]. On NUMA systems that are built today, re-mote wire delays add at most 30% to the cost of a memory access [25]. For mostprograms, this latency differential alone would not have a substantial impact onperformance. However, fast modern CPUs are able to generate memory requestsat very high rates. Massive data traffic creates congestion in memory controllerqueues and on interconnects. When this happens, memory access latencies can be-come as large as 1000 cycles, from a normal latency of only around 200. Such adramatic increase in latencies can slow down data-intensive applications by morethan a factor of three. Fortunately, high latencies can be avoided or substantiallyreduced if we carefully place memory pages on nodes so as to avoid traffic conges-tion.In response to the changes in hardware bottlenecks, we approach the problemof thread and memory placement on NUMA systems from an entirely new per-spective. We look at it as the problem of traffic management. Our algorithm, calledCarrefour2, places threads and memory so as to avoid traffic hotspots and prevent2Carrefour– (French) intersection, crossroads.69congestion in memory controllers and on interconnect links. This is akin to trafficmanagement in the context of city planning: popular residential and business hubsmust be placed so as to avoid congestion on the roads leading to these destinations.The mechanisms used in our algorithm: e.g., migration and replication of mem-ory pages, are well understood, but the algorithm itself is new. Our algorithmmakes decisions based on global observations of traffic congestion. Previous algo-rithms optimized for locality, and relied on local information, e.g., access patternof individual pages. We found that in order to effectively manage congestion onmodern systems we need an arsenal of techniques that go beyond optimizing local-ity. While locality plays a role in managing congestion (when we reduce remoteaccesses, we reduce interconnect traffic), alone it is not sufficient to achieve thebest performance. The challenge in designing Carrefour was to understand how tocombine different mechanisms in an effective solution for modern hardware.Implementing an effective NUMA-aware algorithm on modern systems presentsseveral challenges. Modern systems do not have the same performance monitoringhardware that was present (or assumed) on earlier systems. Existing instructionsampling hardware cannot gather the profiling data needed for the algorithm withthe desired accuracy and speed. We had to navigate around this problem in ourdesign. Furthermore, the memory latencies that we are optimizing are lower thanon older systems, so we can tolerate less overhead in the algorithm.We implemented Carrefour in Linux and evaluated it with several data-centricapplications: k-means clustering, face recognition, map/reduce, and others. Car-refour improves performance of these applications, with the largest gain of 3.6speedup. When memory placement cannot be improved Carrefour never hurts per-formance by more than 4%. Existing NUMA-aware patches for the Linux kernelperform less reliably and in general fall short of improvements achieved with Car-refour.4.2 Traffic Congestion on Modern NUMA SystemsIn this section, we demonstrate that the effects of traffic congestion are more sub-stantial than those of wire delays, and motivate why memory placement algorithmsmust be redesigned. To that end, we report data from two sets of experiments. In70 0 20 40 60 80 100BT    CG    DC    EP    FT    IS    LU    MG    SP    UA    bodytrack    facesim    fluidanimate    streamcluster    swaptions    x264    kmeans    matrixmult    PCA    wrmem    Per for mance di ff er ence (%)(a) Performance difference for single-thread versions of applications between local andremote memory configurations. 0 20 40 60 80 100BT (F)CG (F)DC (F)EP (-)FT (F)IS (I)LU (F)MG (F)SP (F)UA (F)bodytrack (-)facesim (I)fluidanimate (-)streamcluster (I)swaptions (-)x264 (I)kmeans (I)matrixmult (-)PCA (I)wrmem (F)Per for mance di ff er ence (%)(b) Absolute performance difference for multi-thread versions of applications betweenFirst-touch (F) and interleaving (I).Figure 4.1: Performance difference of applications depending on the threadand memory configuration.71the first set, our goal is to measure the effects of wire delays only. We run ap-plications in two configurations: local-memory and remote-memory. Under local-memory, the thread and its data are co-located on the same NUMA node; underremote-memory, the thread runs on a different node than its data. To ensure thatwire delay is the dominant performance factor, we had to avoid congestion onmemory controllers and interconnects, so we run one application at a time and useonly one thread in each application. We do not include applications with CPUutilization less than 30%, because memory performance is not their main bottle-neck. The experiments are run on a system described in Section 4.4 as MachineA. We use applications from the NAS, PARSEC and map/reduce Metis suites, alsodescribed in Section 4.4.Figure 4.1(a) shows relative completion time under remote-memory vs. local-memory configuration. The performance degrades by at most 20% under remote-memory, which is consistent with at most 30% difference in local-vs-remote mem-ory latencies measured in microbenchmarks [25].In the second set of experiments, we want to observe traffic congestion, so werun each application with as many threads as there are cores. Threads are boundto their own cores and threads may access memory from any of the four NUMAnodes. We demonstrate how performance varies under two memory placementpolicies on Linux, as they induce different degrees of traffic congestion. The firstpolicy is First-touch (F): the default policy where the memory pages are placedon the node where they are first accessed. The second policy is Interleaving (I),where memory pages are spread evenly across all nodes. Although these are notthe only possible and not necessarily the best policies, comparing them illustratesthe salient effects of traffic congestion.Figure 4.1(b) shows the absolute difference in completion time achieved un-der first-touch and interleaving. The policy that performed the best is indicatedin parenthesis next to the application name; a “-” is shown when the applicationperforms equally well with either policy. We observe that the differences are oftenmuch larger than what we can expect from wire delays alone. For Streamcluster, ak-means clustering application from PARSEC, the performance varies by a factorof two depending on the memory placement policy!To illustrate that these differences are due to traffic congestion, we show in Ta-72Streamcluster PCABest (I) Worst (F) Best (I) Worst (F)Local access ratio 25% 25% 25% 33%Memory latency 476 1197 465 660Mem-ctrl. imbalance 8% 170% 5% 130%IC: imbalance, (avg) 22% (59%) 85% (33%) 20% (48%) 68% (31%)L3MPKI 16.85 16.89 7.35 7.4IPC 0.29 0.15 0.52 0.36Table 4.1: NUMA traffic congestion effectsble 4.1 some supporting data for the two applications, Streamcluster and PCA (amap/reduce application):Local access ratio: The percent of all memory accesses sourced from a local node.Memory latency: The average number of cycles to satisfy a memory request fromany node.Memory controller imbalance: The standard deviation of the load across allmemory controllers, expressed as percent of the mean. Load is measured as thenumber of requests per time unit.Average interconnect (IC) usage: The utilized interconnect bandwidth as percentof total, averaged across all links.Interconnect (IC) imbalance: The standard deviation of utilization across thelinks as percent of mean utilization.L3MPKI: The number of last-level (L3) cache misses per thousand instructions.IPC: The number of instructions per cycle.The data in Table 4.1 leads to several curious observations. First, we see thatlocality of memory accesses either does not change regardless of the memory man-agement policy, or decreases under the better performing policy. For Streamclus-ter, most of the memory pages happen to be placed on a single node under first-touch (because a single thread initializes them at the beginning of the program).Under interleaving the pages are spread across all nodes, but since the threads ac-cess data from all four nodes, the overall access ratio is about the same in bothconfigurations. For PCA, interleaving decreases the local access ratio and yet in-creases performance. So the first surprising conclusion is that better locality does73not necessarily improve performance!And yet, the IPC substantially improves (2 for Streamcluster and 41% forPCA), while the L3 miss rate, as well as L1 and L2 miss rates, remain unchanged.The explanation emerges if we look at the memory latency. Under interleaving, thememory latency reduces by a factor of 2.48 for Streamcluster and 1.39 for PCA.This effect is entirely responsible for performance improvement under the betterpolicy. The question is, what is responsible for memory latency improvements? Itturns out that interleaving dramatically reduces memory controller and intercon-nect congestion by alleviating the load imbalance and mitigating traffic hotspots.Rows 5, 6 in Table 4.1 show significant reductions in imbalance under interleaving,and Figure 4.2 illustrates these effects visually for Streamcluster. So even withoutimproving locality (we even reduce it for PCA), we are able to substantially im-prove performance. And yet, existing NUMA-aware algorithms disregarded trafficcongestion, optimizing for locality only. Our work addresses this shortcoming.Although the two selected applications performed significantly better under in-terleaving, this does not mean that interleaving is the only desired policy on modernNUMA hardware. In fact, as Figure 4.1(b) shows, many NAS applications fared alot worse with interleaving. In the process of designing the algorithm we learnedthat a range of techniques — interleaving, page replication and co-location — mustbe judiciously applied to different parts of the address space depending on globaltraffic conditions and page access patterns. So the challenge in designing a goodalgorithm is understanding when to apply each technique, while navigating aroundthe challenges of obtaining accurate performance data and limiting the overhead.4.3 Design and ImplementationWe begin by describing the mechanisms composing the algorithm: page co-location,interleaving, and replication. Then we explain how they fit together.4.3.1 The MechanismsPage co-location is when we re-locate the physical page to the same node as thethread that accesses it. Co-location works well for pages that are accessed by asingle thread or by threads co-located on the same node.741%1%1%97%25% 25%25% 25%Figure 4.2: Traffic imbalance under first-touch (left) and interleaving (right)for Streamcluster. Nodes and links bearing the majority of the trafficare shown proportionately larger in size and in brighter colors. Thepercentage values show the fraction of memory requests destined foreach node. The figure is drawn to scale.Page interleaving is about evenly distributing physical pages across nodes.Interleaving is useful when we have imbalance on memory controllers and inter-connect links, and when pages are accessed by many threads. Operating systemsusually provide an interleaving allocation policy, but only give an option to enableor disable it globally for the entire application. We found that interleaving worksbest when judiciously applied to parts of the address space that will benefit from it.Page replication is about placing a copy of a page on several memory nodes.Replication distributes the pressure across memory controllers, alleviating traffichotspots. An added bonus is eliminating remote accesses on replicated pages.When done right, replication can bring very large performance improvements. Un-fortunately, replication also has costs. Since we keep multiple copies of the samepage, we must synchronize their contents, which is like running a cache coherencyprotocol in software. The costs can be very significant if there is a lot of fine-grained read/write sharing. Another potential source of overhead is the synchro-nization of page tables. Since modern hardware walks page tables automatically,page tables themselves must be replicated across nodes and kept in sync. Finally,replication increases the memory footprint. We should avoid it for workloads withlarge memory footprints for fear of increasing the rate of hard page faults.75Global statisticsMC-IMB Memory controller imbalance (as defined in Section 4.2)LAR Local access ratio (as defined in Section 4.2)MAPTU Memory (DRAM) accesses per time unit (microsecond)Per-application statisticsMRR Memory read ratio. Fraction of DRAM accesses that are readsCPU% Percent CPU utilizationPer-page statisticsNumber of accesses The number of sampled data loads that fell in that pageAccess type Read-only or read-writeTable 4.2: Statistics collected for the algorithm.4.3.2 The AlgorithmOur memory management algorithm has three components: measurement, globaldecisions and page-local decisions. The measurement component continuouslygathers various metrics (Table 4.2) that later drive page placement decisions. Globaland per-application metrics are collected using HPEs with very low overhead. Per-page statistics are collected via instruction-based sampling (IBS) [40], which canintroduce significant overheads at high sampling rates. Section 4.3.3 describeshow we keep the overheads at bay. Global decisions are based on system-widetraffic congestion and workload properties which determine what mechanisms touse. Page-local decisions examine access patterns of individual pages to decidetheir fate.Global DecisionsThe global decision-making process is outlined in Figure 4.3.Step 1: We decide whether to enable Carrefour. We only want to run Car-refour for applications that generate substantial memory traffic. Other applicationswould not be affected by memory placement policies, so there is no reason tosubject them to sampling overhead. This decision is driven by the application’smemory access rate (MAPTU – see Table 4.2). Carrefour is enabled for applica-tions with the MAPTU above a certain threshold. The MAPTU threshold is to bedetermined experimentally and the right setting may vary from system to system.76MAPTU > 50 ?Enable CarrefourDisable CarrefourYesNoStep 1MRR > 95%&&Free RAM ≥ 1 −_	?Enable replicationDisable replicationYesNoStep 2MC_IMB > 35% ?Enable interleavingDisable interleavingYesNoStep 3LAR < 80% ?Enable co-locationDisable co-locationYesNoStep 4Figure 4.3: Global decisions in Carrefour.We found the threshold of 50 MAPTU worked well on all hardware we evaluated,and the performance was not very sensitive to its choice. To determine the rightMAPTU threshold on a system very different from ours, we recommend runninga benchmark suite under different NUMA policies, noting which applications areaffected and using the lowest observed MAPTU from those experiments.Once we decided whether there is sufficient memory traffic to justify runningCarrefour, we need to decide which of the available mechanisms, replication, inter-leaving and co-location, should be enabled for each application given its memoryaccess patterns. The goal here is to choose the most beneficial techniques and avoidany associated overhead. The next three steps take care of this decision.Step 2: We decide whether it is worthwhile to use replication. Replication risksintroducing significant overheads if it forces us to run out of RAM (and causesadditional hard page faults) or requires frequent synchronization of pages acrossnodes (see more discussion in Section 4.3.3). To avoid the first peril, we conserva-tively enable replication only if there is sufficient free RAM to replicate the entireresident set. That is, the fraction of free RAM must be at least 1 1NUM NODES . Thisis a conservative threshold, because not all pages will be replicated, and not all res-ident pages will be accessed frequently enough to generate significant page faultoverhead if evicted. Evaluating the trade-off between replication benefit and po-tentially increased page-fault rate was outside the scope of the work. This requiresworkloads that both benefit from replication and have very large memory-residentsets, which we did not encounter in our experiments.To avoid the overhead associated with the synchronization of page content77across nodes, we do not replicate pages that are frequently written. An applicationmust have the memory read ratio (MRR) of at least 95% in order for its memorypages to be considered for replication3. The setting of this parameter can have avery significant effect on performance. While we found that the performance wasnot sensitive when we varied the parameter in the range of 90-99%, it is alwayssafe to err on the high side.Step 3: We decide whether to use interleaving. Interleaving improves perfor-mance if we have large memory controller imbalance. We enable interleaving ifmemory controller imbalance is above 35%, but found that the performance wasnot highly sensitive to this parameter. Applications that benefit from interleavingusually begin with a very large imbalance.Step 4: We decide whether or not to enable co-location. Co-location will betriggered only for pages that are accessed from a single node, and so it will notexacerbate the imbalance if memory-intensive threads are evenly spread acrossnodes. Therefore, we enable co-location if the local access rate is slightly less thanideal (LAR < 80%). Performance is not highly sensitive to this parameter; weobserved that if this parameter is completely eliminated from the algorithm andco-location is always enabled then the largest performance impact is only a fewpercent.Although we expect that optimal settings for the parameters used in the algo-rithm would vary from one system to another, we found that we did not need toadjust the settings when we moved between the two experimental systems used inour evaluation. Although our systems had the same number of nodes and both usedAMD CPUs, they differed in the number of cores per node, the cache-coherencyprotocol (broadcast vs. directory-based), and one had a higher interconnect through-put than the other. Therefore, it is possible that the algorithm parameters settingsare rather stable across all but drastically different systems.3MRR is approximated as fraction of L1 refills from DRAM in modified state, because there isno HPE that provides this quantity precisely per core, as opposed to per-node. Similarly, due to HPElimitations described in Section 4.3.3, it is very difficult to measure the MRR per page. That is whywe use the MRR for the entire application.78Page-local DecisionsCarrefour makes page-local decisions depending on the mechanisms enabled: e.g.,pages are only considered for replication if replication is enabled for that applica-tion. The following explanation assumes that all three mechanisms are enabled.To decide the fate of each page, we need at least two memory-access samplesfor that page. If the page was accessed from only a single node we migrate it to thatnode. If the page is accessed from two or more nodes, it is a candidate for eitherinterleaving or replication. If the accesses are read-only, the page is replicated.Otherwise it is marked for interleaving. To decide where to place a page markedfor interleaving, we use two probabilities: Pmigrate and Pnode. Pmigrate determinesthe likelihood of migrating the page away from the current node. Pmigrate is theMAPTU of the current node as the fraction of MAPTU on all nodes, so the higherthe load on the current node relative to others, the higher the chance that we willmigrate a page. Pnode gives us the probability of migrating a page to a particularnode, and it is the complement of Pmigrate for that node, so Carrefour will migratethe page to the least loaded node.4.3.3 ImplementationWe implemented Carrefour in the Linux kernel 3.6.0. Carrefour measures the se-lected performance indicators, and with periodicity of one second makes decisionsregarding page placement and resets statistic counters. To a large extent, Car-refour relies on well-understood mechanisms in the Linux kernel, such as physicalpage migration. The non-trivial aspects of the implementation were understandinghow to accomplish fast and accurate sampling of memory accesses and navigatingaround the overheads of replication. We describe how we overcame these chal-lenges in the two sections that follow.Fast and Accurate Memory Access SamplingA crucial goal of the algorithm is to quickly and accurately detect memory pagesthat cause the most DRAM accesses, and accurately estimate the read/write ratioof those pages. To that end, we used Instruction-Based Sampling (IBS): hardware-supported sampling of instructions available in AMD processors. Intel processors79support similar functionality in the form of PEBS: Precise Event-Based Sampling.IBS can be configured to deliver instruction samples at a desired interval, e.g., af-ter expiration of a certain number of cycles or micro-ops. Each sample containsdetailed information about the sampled instruction, such as the address of the ac-cessed data (if the instruction is a load or a store), whether or not it missed inthe cache and how long it took to fetch the data. Unfortunately, every deliveredsample generates an interrupt, so processing samples at a high rate becomes verycostly. Other systems that relied on IBS performed off-line profiling [57, 75], sothey could tolerate much higher overhead than what would be acceptable in ouronline algorithm.After experimenting with IBS on our systems, we found that for most appli-cations the sampling interval of 130,000 cycles incurs a reasonable overhead ofless than 5%. The desired sampling rate can be trivially derived for new systems:it amounts to experimenting with different sampling rates and settling for the onethat generates acceptable runtime overhead.Our initial decision was to filter out all the samples that did not generate aDRAM access. However, we found that the resulting number of samples was ex-tremely low. Even very memory-intensive workloads access DRAM only a fewtimes for every thousand instructions. That, combined with a low IBS samplingfrequency, gave us the sampling rate of less than one hundred thousandth of a per-cent, and made it very difficult to generate a sufficient number of samples. Further-more, filtering samples that did not access DRAM caused us to miss the accessesgenerated by the hardware prefetcher. These accesses are not part of any instructionso they will not be tagged by IBS. For prefetch-intensive applications, we obtain avery small number of samples and a very distorted read-write ratio.To address this problem, we used two solutions. First, is the adaptive samplingrate. When the program begins to run, we sample it at a relatively high rate of1/65,000 cycles. If after this measurement phase we take fewer than ten actionsin the algorithm (an action is any change in page placement) we switch to a muchlower rate of 1/260,000 cycles, which has a negligible performance impact. Other-wise we continue sampling at the high rate.The second solution was, when filtering IBS samples, to retain not just the datasamples that accessed the DRAM, but those that hit in the first-level cache as well.80First-level cache loads include accesses to prefetched data, so we avoid prefetcher-related inaccuracy. On the one hand, considering cache accesses can introduce“noise” in the data, because we could be sampling pages that never access DRAM.On the other hand, Carrefour is only activated for memory-intensive applications,and for them there is a higher correlation between the accesses that hit in the cacheand those that access DRAM.With these two solutions combined, we were able to successfully identify thepages that are worth replicating, while this was nearly impossible prior to intro-ducing these solutions. For example, for Streamcluster we used to be able to detectonly a few percent of the pages that are worth replicating, but with these solutionsin place, we were able to identify 100% of them4.However, even though performance became much better (we were able to speedup Streamcluster by 26% relative to the default kernel), we were still far from the“ideal” manual replication, which sped it up by more than 2.5. To approach idealperformance, we had to mitigate the overheads of replication, which we describenext.ReplicationReplication has overhead from the following three sources. First, there is the initialset-up cost and slightly more expensive page faults. Modern hardware walks pagetables automatically, so a separate copy of a page table must be created for eachnode. Page faults become slightly more costly, because a new page table entrymust be installed on every node. To avoid these costs when we are not likely tobenefit from replication, we avoid replication unless the applications has at least afew hundred pages marked for replication5.The second source of overhead comes from additional hard page faults if weexceed the physical RAM capacity by replicating pages. As explained earlier, weavoided this overhead by conservatively setting the free memory threshold whenenabling replication.The final and most significant source of overhead stems from the need to syn-4Streamcluster holds shared data in a single large array, so it is trivial to detect which data isworth replicating and implement a manual solution to use as the performance upper-bound.5We use the threshold of at least 500 pages. Performance is not highly sensitive to this parameter.81chronize the contents of replicated pages when they are written. This involves aphysical page copy and is very costly. Before explaining how we avoid this over-head we provide a brief overview of our implementation of replication.In Linux, a process address space is represented by a mm struct, which keepstrack of valid address space segments and holds a pointer to the page table, whichis stored as a hierarchical array. Since modern hardware walks page tables au-tomatically, we cannot modify the structure of the page table to point to severalphysical locations (one for each node) for a given virtual page. Instead, we mustmaintain a separate copy of the page table for each node and synchronize the pagetables when they are modified, even for virtual pages that are not replicated. Linuxdictates that the page table entry (PTE) be locked when it is being modified. Wedo not make any changes to this locking protocol. The only difference is that wedesignate one copy of the page table as the master copy, and only lock the PTE inthe master copy while installing the corresponding PTEs into all other replicas.When a page is replicated, we create a physical copy on every memory nodethat runs threads from the corresponding application. We install a different virtual-to-physical translation in each node’s page table. We write-protect the replicatedpage, so when any node writes that page we receive a page protection fault. Tohandle this fault, we read-protect the page on all nodes except the faulting one, andenable writing on the faulting node. If another node accesses that page, we mustcopy the new version of the page to that node, enable the page for reading andprotect it from writing.We refer to all the actions needed to keep the pages synchronized as page col-lapses. Collapses are extremely costly, and would occur if we replicate a page thatis write-shared. Even with very infrequent writes (e.g., one in 1000 accesses), thecollapse overhead could be prohibitively high. With limited capabilities of IBS,we are unable to detect read/write ratio on individual pages with sufficient accu-racy. That is why we use the application-wide MRR and disable replication for theentire application if the MRR is low. Furthermore, we monitor collapse statisticsof individual pages and disable replication for any page that generated more thanfive collapses. We allow five collapses because this enables applications to per-form some writes for initialization or a phase-change without losing the benefits ofreplication, but any page with regular (even if infrequent) writes will have replica-82tion disabled because the cost of page collapses will almost always outweigh thebenefits of replication.With these optimizations, as well as those described in Section 4.3.3, we wereable to avoid replication costs for the applications we tested and approached within10% the performance of manual replication for Streamcluster.4.4 EvaluationIn this section, we study the performance of Carrefour on a set of benchmarks.The main questions that we address are the following:1. How does Carrefour impact the performance of applications, includingthose that cannot benefit from its heuristics?2. How does Carrefour compare against existing heuristics for modern NUMAhardware?3. How well does Carrefour leverage the different memory placement mecha-nisms?4. What is the overhead of Carrefour?To assess the performance of Carrefour, we compare it against three other con-figurations:Linux: A standard Linux kernel with the default first-touch memory allocationpolicy.Manual interleaving: A standard Linux kernel with the interleaving policy man-ually enabled for the application.AutoNUMA: A recent Linux patchset [34] considered as the best thread and mem-ory management algorithm available for Linux.The rest of the section is organized as follows: we first describe our experi-mental testbed. Next, we study single-application scenarios, followed by work-loads with multiple co-scheduled applications. We then detail the overhead of Car-refour and conclude with a discussion on additional hardware support that wouldimprove Carrefour’s operation.834.4.1 TestbedWe used two different machines for the experiments:Machine A has four 2.3GHz AMD Opteron 8385 processors with 4 cores in each(16 cores in total) and 64GB of RAM. It features 4 nodes (i.e., 4 cores and 16GBof RAM per node) interconnected with HyperTransport 1.0 links.Machine B has four 2.6GHz AMD Opteron 8435 processors with 6 cores in each(24 cores in total) and 64GB of RAM. It features 4 nodes (i.e., 6 cores and 16GBof RAM per node) interconnected with HyperTransport 3.0 links.We used Linux kernel v3.6 for all experiments. For the AutoNUMA configu-ration, we used AutoNUMA v27 and disabled PMD scan because we it decreasesperformance on all applications we measured6.Workloads were configured to use one thread per core available on the testsystem and threads were bound to their own cores.We used the following set of applications: the PARSEC benchmark suitev2.1 [20], the FaceRec facial recognition engine v5.0 [18], the Metis MapReducebenchmark suite [65] and the NAS parallel benchmark suite v3.3 [11]. PARSECapplications run with the native workload. From the available workloads in NASwe chose those with the running time of at least ten seconds. We excluded appli-cations whose CPU utilization was below 33%, because they were not affected bymemory management policies. We also excluded applications using shared mem-ory across processes (as opposed to threads) or memory-mapped files, because ourreplication mechanism does not yet support this behaviour. For FaceRec, we usedtwo kinds of workloads: a short-running one and a long running one (named Fac-eRecLong in the remainder of the paper). The reason why we present two differentworkloads is to show that Carrefour is able to successfully handle very short work-loads (less than 4s on machine B when running with Carrefour). Each experimentwas run ten times. Overall, we observed a standard deviation between 1% and 2%for the Linux, Manual interleaving and Carrefour configurations. AutoNUMA has amore significant standard deviation, up to 9% on machine A and 13% on machineB.6We also tested AutoNUMA v28. The performance results are very similar. However, we ob-served a significantly higher standard deviation (up to 18% on machine A and 23% on machine B)which makes profiling very difficult.844.4.2 Single-Application WorkloadsPerformance comparison. Figures 4.4 and 4.5 show the performance improve-ment relative to default Linux obtained under all configurations for machine A andmachine B. Performance improvement is computed as:De f aultLinuxtimeSystemtimeSystemtime100%;where System can be either Carrefour, Manual interleaving or AutoNUMA.We can make two main observations. First, Carrefour almost systematicallyoutperforms default Linux, AutoNUMA, and Manual interleaving, sometimesquite substantially. For instance, when running Streamcluster or FaceRecLong, weobserve that Carrefour is up to 58% faster than Manual interleaving, up to 165%faster than AutoNUMA, and up to 263% faster than default Linux on machine B.Second, we observe that, unlike other techniques, Carrefour never performs sig-nificantly worse than default Linux: the maximum performance degradation overLinux is below 4%. In contrast, AutoNUMA and Manual interleaving cause per-formance degradations of up to 25% and 38% respectively. Additionally, ManualInterleaving has a very irregular impact on performance. While it does fairly wellfor PARSEC, NAS applications suffer significant performance degradation (up to38%) when run with manual interleaving.IS is an exception among the NAS benchmarks: Manual interleaving improvesits performance, while Carrefour does no better than default Linux. This is becauseIS suffers from very short imbalance bursts that we are not able to correct dueto limited sampling accuracy achievable with low overhead using existing HPEs.Section 4.4.6 discusses hardware support that would help us address this problem.In order to understand the reasons for the performance impact of the differentpolicies, we study in detail a set of applications whose performance is improvedthe most by Carrefour. To that end, we present several metrics: the load imbalanceon memory controllers (Figure 4.6(a)), the load imbalance on interconnect links(Figure 4.6(b)), the average memory latency (Figure 4.7(a)) and the local accessratio (Figure 4.7(b)).We draw the following observations. First, Carrefour much better balances85-20 0 20 40 60 80 100 120 140 160 180BodytrackFacesimFluidanimateStreamclusterSwaptionsx264FaceRecFaceRecLongKmeansMatrixmultiplyPCAWrmemPer for mance impr ovement wi th respect  t o Li nux (%) AutoNUMAManual interleavingCarrefourMachine A 0 30 60 90 120 150 180 210 240 270BodytrackFacesimFluidanimateStreamclusterSwaptionsx264FaceRecFaceRecLongKmeansMatrixmultiplyPCAWrmemPer for mance impr ovement wi th respect  t o Li nux (%) AutoNUMAManual interleavingCarrefourMachine BFigure 4.4: PARSEC/Metis: AutoNUMA, Manual interleaving and Car-refour vs.Default Linux.the load on both memory controllers and interconnect links than Linux and Au-toNUMA. Not surprisingly, Manual interleaving is also very good at balancingthe load. Nevertheless, we observe in Figure 4.7(a) that Carrefour induces loweraverage memory latencies than Manual interleaving, which explains its better per-formance. To understand why Carrefour reduces memory latencies we refer toFigure 4.7(b), which shows that Carrefour not only balances the load on memorycontrollers and interconnect links, but also often induces a much higher ratio oflocal memory accesses than other techniques. This is a consequence of Carrefour’sjudiciously applying the right techniques (interleaving, replication or co-location)in places where they are beneficial. Interleaving mostly balances the load; repli-cation and co-location in addition to balancing the load improve the local accessratio. Better locality improves latencies in two ways: it avoids remote wire delays86-40-30-20-10 0 10BT CG DC EP FT IS LU MG SP UAPer for mance impr ovement wi th respect  t o Li nux (%) AutoNUMAManual interleavingCarrefourMachine A-40-20 0 20 40 60BT CG DC EP FT IS LU MG SP UAPer for mance impr ovement wi th respect  t o Li nux (%) AutoNUMAManual interleavingCarrefourMachine BFigure 4.5: NAS: AutoNUMA, Manual interleaving and Carrefour vs. De-fault Linux.and, most importantly, decreases congestion on the interconnect links.We also study MG as a representative example of the NAS applications, forwhich Carrefour does not bring significant improvements over default Linux(but still performs better than AutoNUMA and Manual interleaving in mostcases). MG has a low imbalance and a very good local access ratio to begin with.That is why Manual interleaving has a very bad impact on such workloads, sig-nificantly decreasing the local access ratio and as a result stressing the interconnect.Looking inside Carrefour. To better understand the behavior of Carrefour, weshow in Table 4.3 the number of replicated pages, the number of interleaved pages,and the number of co-located pages for the chosen benchmarks. These numbersprovide a better insight into how Carrefour manages the memory. We observe thatall three memory placement mechanisms are in use, and that most applicationsrely on two or three techniques. As discussed previously, MG does not suffer from87 0 20 40 60 80 100 120 140 160Facesim Streamcluster FaceRec FaceRecLong PCA MG SPLoad imbal anceon memor y cont rol l er s (%)LinuxAutoNUMAManual interleavingCarrefour(a) Memory controllers 0 20 40 60 80Facesim Streamcluster FaceRec FaceRecLong PCA MG SPLoad imbal anceon int er connect  l i nks (%)LinuxAutoNUMAManual interleavingCarrefour(b) Interconnect linksFigure 4.6: Load imbalance for selected single-application benchmarks (ma-chine A).traffic congestion, so Carrefour does not enable any technique for this application.A legitimate question we can ask is whether Carrefour always selects the besttechnique. In Table 4.4, we report the performance improvement over Linux ob-tained when running a full-fledged Carrefour and when running a reduced versionof Carrefour enabling only one technique at a time. We observe that Carrefour sys-tematically selects the best technique. It is also interesting to remark how differenttechniques work together and how the numbers provided here echo those in Ta-ble 4.3. We notice that, for all the studied applications except MG and SP, thecombination of several techniques employed by Carrefour outperforms any singletechnique, even when a given technique has a dominant impact (e.g., for Stream-cluster). The slight performance degradation for MG corresponds to the monitoringoverhead of Carrefour.88 0 200 400 600 800 1000 1200Facesim Streamcluster FaceRec FaceRecLong PCA MG SPAvg lat ency (nbCycl es/ req)LinuxAutoNUMAManual interleavingCarrefour(a) Average memory latency 0 20 40 60 80 100 120Facesim Streamcluster FaceRec FaceRecLong PCA MG SPRat io of  l ocalmemor y accesses (%) LinuxAutoNUMAManual interleavingCarrefour(b) Local memory access ratioFigure 4.7: DRAM latency and locality for selected single-applicationbenchmarks (machine A).No. replicated pages No. interleaved pages No. migrated pagesFacesim 0 431 10.1kStreamcluster 25.4k 14.5k 858FaceRec 4k 3 1.3kFaceRecLong 4.1k 5 1.4kPCA 31k 33 41.3kMG 0 0 1SP 0 305 1.7kTable 4.3: Number of memory pages that are replicated, interleaved and co-located on single-application workloads (machine A).89Carrefour Replication Interleaving Co-locationFacesim 74% -4% 0% 65%Streamcluster 184% 176% 94% 51%FaceRec 66% 61% 32% 1%FaceRecLong 117% 113% 51% 1%PCA 46% 45% 29% 24%MG -2% -2% -2% -2%SP 8% -1% -7% 8%Table 4.4: Performance improvement over Linux when running Car-refour and the three different techniques individually on single-application workloads (machine A).4.4.3 Multi-Application WorkloadsIn this section, we study how Carrefour behaves in the context of workloads withmultiple applications that are co-scheduled on the same machine. The goal is toassess that Carrefour is able to work on complex access patterns and to make thedistinction between the diverse requirements of different applications. We con-sider several scenarios based on some of the applications previously studied inSection 4.4.2: (i) MG + Streamcluster, (ii) PCA + Streamcluster, and (iii) FaceRe-cLong + Streamcluster. We chose these scenarios because they exhibit interestingpatterns, which require combining several memory placement techniques in orderto achieve good performance.Each application is run with half as many threads as the number of cores (i.e.,8 threads on machine A, 12 on machine B). With two applications, the workloadoccupies all the available cores. The threads of each application are clustered onthe same node, so each application uses all the cores on two of the four nodes on amachine.Figure 4.8 shows, for each workload, the performance improvement with re-spect to Linux for AutoNUMA, Manual interleaving and Carrefour on machine Aand machine B. We observe that Carrefour always outperforms AutoNUMA andManual interleaving, by up to 62% and 36% respectively. Besides, Carrefour alsooutperforms default Linux, while Manual interleaving hurts MG with a 25% slow-down. Overall, the results obtained with Manual interleaving are closer to the ones90Carrefour Replication Interleaving Co-locationMG2% / 71% 2% / 73% -5% / 17% -1% / 6%+ StreamclusterPCA24% / 57% 18% / 57% 8% / 5% 14% / 1%+ StreamclusterFaceRecLong53% / 71% 53% / 71% 12% / 9% 12% / 4%+ StreamclusterTable 4.5: Multi-application workloads: performance improvement overLinux when running Carrefour and the three different techniques indi-vidually (machine A).of Carrefour compared to the other setups.The reason why Manual interleaving performs relatively well in these scenar-ios is because, with each application using two domains, there is a lot less cross-domain traffic than in the single-application case. Hence there are fewer problemsthat need to be fixed and there is a smaller discrepancy between the performanceof different memory management algorithms.To explain the results, we show the same detailed metrics as in Section 4.4.2:load imbalance on memory controllers, load imbalance on interconnect links, av-erage DRAM latency and ratio of local DRAM accesses in Figures 4.9, 4.10, 4.11and 4.12 respectively. The metrics are aggregated for the two applications of eachworkload. As was the case with the single-application workloads, we see thatCarrefour systematically improves the latency of the studied workloads for tworeasons: a more balanced load on memory controllers and interconnect links aswell as an improved locality for DRAM accesses. Note that, for each workload,the depicted latencies are averaged over the two applications. This explains thesmall latency variations between Linux, AutoNUMA and Manual Interleaving inthe case of MG + Streamcluster.Finally, we show in Table 4.5 the performance improvement for each applica-tion when the effects of only one of the techniques are enabled. We observe thatthe multi-applications workloads perform as well or better with an arsenal of tech-niques used in Carrefour rather than with any single technique alone (especiallyfor PCA + Streamcluster).91-20 0 20 40 60 80MG+ StreamclusterPCA+ StreamclusterFaceRecLong+ StreamclusterPer for mance impr ovement wi th respect  t o Li nux (%)MG Streamcluster PCA Streamcluster FaceRecLong StreamclusterAutoNUMAManual interleavingCarrefourMachine A-20 0 20 40 60 80MG+ StreamclusterPCA+ StreamclusterFaceRecLong+ StreamclusterPer for mance impr ovement wi th respect  t o Li nux (%)MG Streamcluster PCA Streamcluster FaceRecLong StreamclusterAutoNUMAManual interleavingCarrefourMachine BFigure 4.8: Multi-application workloads: AutoNUMA, Manual interleavingand Carrefour vs. Default Linux.92 0 20 40 60 80 100MG+ StreamclusterPCA+ StreamclusterFaceRecLong+ StreamclusterLoad imbal anceon memor y cont rol l er s (%)LinuxAutoNUMAManual interleavingCarrefourFigure 4.9: Multi-application workloads: load imbalance on memory con-trollers (machine A). 0 20 40 60 80MG+ StreamclusterPCA+ StreamclusterFaceRecLong+ StreamclusterLoad imbal anceon int er connect  l i nks (%)LinuxAutoNUMAManual interleavingCarrefourFigure 4.10: Multi-application workloads: load imbalance on interconnectlinks (machine A). 0 100 200 300 400 500 600 700MG+ StreamclusterPCA+ StreamclusterFaceRecLong+ StreamclusterAvg lat ency (nbCycl es/ req)LinuxAutoNUMAManual interleavingCarrefourFigure 4.11: Multi-application workloads: average memory latency (ma-chine A).93 0 20 40 60 80 100MG+ StreamclusterPCA+ StreamclusterFaceRecLong+ StreamclusterRat io of  l ocalmemor y accesses (%)LinuxAutoNUMAManual interleavingCarrefourFigure 4.12: Multi-application workloads: local memory access ratio (ma-chine A).4.4.4 OverheadCarrefour incurs CPU and memory overhead. The first source of CPU overhead isthe periodic IBS profiling. To measure CPU overhead, we compared performanceof Carrefour with Linux on those applications where Carrefour does not yield anyperformance benefits. We observed the overhead between 0.2% and 3.2%. Theadaptive sampling rate in Carrefour is crucial to keeping this overhead low. Asecond and potentially significant source of CPU overhead is replication, if weperform a lot of collapses. A single collapse costs a few hundred microsecondswhen it occurs in isolation. Parallel collapses can take a few milliseconds becauseof lock contention. That is why it is crucial to avoid collapses and other synchro-nization events by disabling replication for write-intensive workloads, as is done inCarrefour.The first source of memory overhead is the allocation of data structures to keeptrack of profiling data. This overhead is negligible: e.g., 5MB on Machine A with64GB of RAM. Carrefour’s data structures are pre-allocated on startup to avoidmemory allocation during the runtime. We limit the number of profiled pages to30,000 to avoid the cost of managing dynamically sized structures. The secondsource of memory overhead is memory replication. When enabled, replication in-troduces a memory footprint overhead of 400MB (353%), 60MB (210%), 60MB(126%) and 614MB (5%) for Streamcluster, FaceRec, FacerecLong and PCA re-spectively.944.4.5 Impact on Energy ConsumptionIt was observed that remote memory accesses require significantly more energythan local ones [35]. Since Carrefour may both decrease and increase the num-ber of remote memory accesses, we were interested in evaluating its impact onenergy consumption7. We show the results for selected applications from single-application workloads on Machine A. We report the increase in energy consump-tion as well as the increase in completion time of all configurations compared todefault Linux in Figure 4.13. Completion time increase is computed here as:SystemtimeDe f aultLinuxtimeDe f aultLinuxtime100%:There is a strong relationship between the completion time and the energy con-sumption: if the completion time is decreased, the energy consumption is alsodecreased proportionally. As a result, Carrefour saves up to 58% of energy. Whenno traffic management is needed, Carrefour on its own has a low impact on energyconsumption (e.g., 2% on MG).More generally, we found that the increase of remote memory accesses haslittle or no impact on the global energy consumption of the machine. For example,Manual interleaving drops the local access ratio of MG from 97% to 25% andthus proportionally increases the number of remote accesses. However, the energyconsumption increase is slightly lower that the completion time increase, whichindicates that the extra energy overhead of remote memory accesses have no strongimpact on overall energy consumption.4.4.6 Discussion: Hardware SupportWe have shown that a traffic management system like Carrefour can bring signif-icant performance benefits. However, the challenge in building Carrefour was theneed to navigate around the limitations of the performance monitoring units of ourhardware as well as the costs of replicating pages. In this section, we draw someinsights on the features that could be integrated into future machines in order tofurther mitigate the overhead and improve accuracy, efficiency and performance of7We used IPMI which gives access to the current global power consumption on our servers.95-70-60-50-40-30-20-10 0 10 20 30 40 50 60 70AutonumaManual Int.CarrefourAutonumaManual Int.CarrefourAutonumaManual Int.CarrefourAutonumaManual Int.CarrefourAutonumaManual Int.CarrefourAutonumaManual Int.CarrefourAutonumaManual Int.CarrefourI ncr ease in wi th respect  t o Li nux (%)Energy consumptionCompletion timeSPMGPCAFaceRecLongFaceRecStreamclusterFacesimFigure 4.13: Increase in completion time and energy consumption for se-lected single-application benchmarks with respect to Linux (machineA). Lower is better.traffic management algorithms.First, Carrefour would benefit from hardware profiling mechanisms that sam-ple memory accesses with high precision and low overhead. For instance, it wouldbe useful to have a profiling mechanism that accumulates and aggregates page ac-cess statistics in an internal buffer before triggering an interrupt. In this regard, theAMD Lightweight Profiling [38] facility seems a promising evolution of profilinghardware8, but we believe the hardware should go even further, and not only accu-mulate the samples but be configured to aggregate them according to user needs,to reduce the number of interrupts even further.Second, Carrefour would benefit from dedicated hardware support for memoryreplication. We believe that there should be interfaces allowing the operating sys-tem to indicate to the processor which pages to replicate. The processor would thenbe in charge of replicating the pages on the nodes accessing it and maintaining con-sistency between the various replicas (in the same way as it maintains consistencyfor cache lines). Given that maintaining consistency between frequently writtenpages is costly, we believe that such processors should also be able to trigger aninterrupt when a page is written too frequently. The OS would then decide to keepthe page replicated or to revert the replication decision.8Unfortunately, Lightweight Profiling is only available on recent AMD processors and we werenot able to evaluate it in this work.96This hardware support can be made a lot more scalable than cache coherencyprotocols, because it is controlled by the OS, which, armed with better hardwareprofiling, will only invoke it for pages that perform very little write sharing. So theactual synchronization protocol would be triggered infrequently.4.5 Related WorkIn this section, we explain how Carrefour relates to different works on multicoresystems. First, we review systems aimed at maximizing data locality. Second, wecontrast Carrefour with previous contention-aware systems. Third, we considerapplication-level techniques to mitigate contention on data-sharing applications.Finally, we discuss traffic characterization observations for modern NUMA sys-tems.NUMA-aware thread and memory management policies were proposed for ear-lier research systems [24, 28, 58, 98] as well as in commercial OS. Their maindifference from our work is that their goal was to optimize locality. However,on modern systems, the main performance problems are due to traffic congestion.Our algorithm is the first one that meets the goal of mitigating traffic congestion.Among the above-mentioned works, the one most related to Carrefour is the systemby Verghese et al. [98] for early cache-coherent NUMA machines, which leveragespage replication and migration mechanisms. Their system relies on assumptionsabout hardware support that do not hold on currently available machines (e.g., pre-cise per-page access statistics). Thus, Carrefour’s logic is more involved, as it ismore difficult to amortize the costs of the monitoring and memory page placementmechanisms. The authors noticed that locality-driven optimizations could, as a sideeffect, reduce the overall contention in the system. However, their system does notsystematically address contention issues. For instance, shared written pages are nottaken into account, whereas Carrefour uses memory interleaving techniques whenthere is contention on such pages. Moreover, the load on memory controllers isignored when making page replication/migration decisions.Similarly to earlier NUMA-aware policies, Solaris and Linux focus primarilyon co-location of threads and data, but to the disadvantage of data-sharing work-loads, replication is not supported. Linux provides the option to interleave parts97of the address space across all memory nodes, but the decision when to invoke theinterleaving is left to the programmer or the administrator. Solaris supports thenotion of a home load group, such that the thread’s memory is always allocated inits home group and the thread is preferentially scheduled in its home group. This,again, favours locality, but does not necessarily address traffic congestion.The recent AutoNUMA patches for Linux also implement locality-driven opti-mizations, using two main heuristics. First, threads migrate toward nodes holdingthe majority of the pages accessed by these threads. Memory residence is deter-mined by page fault statistics. Second, pages are periodically unmapped from aprocess address space and, upon the next page fault, migrated to the requestingnode. As shown in the evaluation section, this approach yields irregular results.We attribute this limitation to the following sources of overhead: local thread/pagemigration decisions that do not take data sharing patterns nor access frequenciesinto account (thus leading to page bouncing or useless migrations) nor memorycontroller or interconnect load (thus leading to memory load imbalance/conges-tion), continuous overhead due to the scanning/unmapping of page-table entriesand the corresponding soft page faults. In contrast, Carrefour makes global dataplacement decisions based on precise traffic patterns and adjusts the monitoringoverhead based on the observed contention level.Locality-driven optimizations for data-sharing applications were addressed ina study that dynamically identified data-sharing thread groups and co-located themon the same node [93]. However, that solution was for a system with a centralized(UMA) memory architecture. Thus, it only studied the benefits of thread placementfor improved cache locality and did not address the placement of memory pageson multiple memory nodes.Zhou and Demsky [106] investigated how to distribute memory pages to cacheson a many-core Tilera processor, in order to implement an efficient garbage collec-tor. The authors tried various policies but found that maximizing locality was thebest approach for their system. This is in contrast to the systems Carrefour is tar-geting, where reducing congestion is more important than just improving locality.Also, the authors used a very different method for monitoring page access patternsthat relies on software-serviced TLB misses, which is not possible on x86.Several recent studies addressed contention issues in the memory hierarchy.98Some of these works were designed for UMA systems [56, 69, 107] and are in-efficient on NUMA systems because they fail to address or even accentuate is-sues such as remote access latencies and contention on memory controllers andon the interconnect links [23]. Other works have been specifically designed forNUMA systems but only partially address contention issues. The N-Mass threadplacement algorithm [63] attempts to achieve good DRAM locality while avoidingcache contention. However, it does not address contention issues at the level ofmemory controllers and interconnect links. Two studies [9, 64] have shown theimportance of taking memory controller congestion into account for data place-ment decisions, but they did not provide a complete solution to address multi-levelresource contention. The most comprehensive work to date on NUMA-aware con-tention management is the DINO scheduler [23], which spreads memory intensivethreads across memory domains and accordingly migrates the corresponding mem-ory pages. However, DINO does not address workloads with data sharing betweenthreads or processes, which require identifying per-page memory access patternsand making the appropriate data placement decisions.When introducing a new resource-management policy in the OS, it is worthasking whether a similar or better effect could be achieved by restructuring theapplication. In our context, it is important to consider the so-called no-sharingprinciple of application design. The key idea behind no-sharing is that the datamust be partitioned or replicated between memory nodes, and a thread needing toaccess data in a different domain than its own either migrates to the target domainor asks the thread running in that domain to perform the work on its behalf, insteadof fetching the data over the memory channels [16, 27, 47, 70, 74, 83, 89]. Whilethe no-sharing architecture was primarily motivated by the need to avoid locking, itcould similarly help reduce the amount of traffic sent across the interconnect, andthus alleviate the traffic congestion problem.Unfortunately, no-sharing architectures are not a universal remedy. First of all,they trade-off data accesses for messages or thread migration; the trade-off is onlyworth making if the size of the data used in a single operation is much larger thanthe size of the message or the state of the migrated thread [27]. Second, adoptinga no-sharing architecture often requires very significant changes to the application(and to the OS, if the application is OS-intensive). A good illustration of the poten-99tial challenges can be gleaned from two studies that converted a database systemto the no-sharing design. The first study took the path of least resistance and sim-ply replicated and/or partitioned the database among domains, adding a message-routing layer on top [83]. While this worked well for small read-mostly workloads,for large workloads replication had very significant memory overhead (unaccept-able because of increased paging), and partitioning required a priori knowledgeof query-to-data mapping, which is not a reasonable assumption in a general case.A solution that overcame these limitations, DORA [74], required a very signifi-cant restructuring of the database system, which could easily amount to millions ofdollars in development costs for a commercial database.Our goal was to address scenarios where adopting a no-sharing architecture isnot feasible either for technical reasons or for practical considerations. Providingan OS-level, rather than an application-level, solution allows us to address manyapplications at once. Understanding the limitations of the OS-level solution anddetermining what optimizations can be done only at the level of an application isan open research question.A recent study characterized the performance of emerging “scale-out” work-loads on modern hardware [44]. The authors observed that there is little inter-thread data sharing, and the utilization of the off-chip memory bandwidth is low.As a result, they argue that memory bandwidth on existing processors is unneces-sarily high. Our findings do not agree with this observation. Although it is also truethat the workloads we consider perform very little fine-grained data sharing, theystill stress the cross-chip interconnect, because they access a large working set,which is spread across the entire NUMA memory space. The authors of the scale-out study reported a very low bandwidth utilization (<10%), even for databaseworkloads. In contrast, our measurements show utilizations higher than 45% inmost cases. These differences could be because the authors of [44] used a systemwith only two chips and 12 cores in their experiments. On larger systems, morethreads are making requests to remote memories and so there is greater pressure onbandwidth. Further, we found that low bandwidth utilization is not necessarily ahealthy symptom. In our experiments, performance dropped steeply even as band-width utilization went from 10% to 30%. The interconnect became the bottleneckeven at a fraction of the total available bandwidth! The reason is that memory re-100quests are not spread evenly in time; they are bursty. Burstiness causes requeststo clash on the link even if the overall bandwidth is not exceeded. In summary,we conclude that contrary to the suggestion made in [44], it is too early to reducethe bandwidth of cross-chip interconnects on large multicore systems, especially ifthey are used for running large data-centric workloads.4.6 SummaryCarrefour is a memory management algorithm for NUMA systems that managestraffic on memory controllers and interconnects. Earlier NUMA-aware memorymanagement policies aimed to mitigate the cost of remote wire delays, which is nolonger the main bottleneck on modern systems. Carrefour’s design was motivatedby the evolution of modern NUMA hardware, where traffic congestion plays amuch larger role in performance than wire delays.System design principles embodied in Carrefour are important not only fortoday’s systems, but also for future hardware. The amount of memory bandwidthper core is projected to decrease in the future [27], so managing traffic congestionwill be as crucial as ever.Smart memory placement, as achieved by Carrefour, is only part of the equa-tion. If there is only one application using all of the cores of a machine, thengood memory placement is sufficient for optimizing memory traffic and there-fore performance. But, if an application does not fill the whole machine, thenthread placement becomes equally important. The strategy described in Chap-ter 2 addresses this side of the equation. It can be used to decide which NUMAnodes and interconnect links to use while considering all other relevant shared re-sources. The workload placement algorithm in Chapter 2 complements Carrefour,and when working in tandem they can optimize multi-application workloads andunder-utilized systems.101Chapter 5Large Pages on NUMA Systems1The previous chapter showed that to achieve good performance on NUMA systemswe need to place memory such that interconnect contention is minimized and localaccesses are maximized. In this chapter we show that large pages sometimes makeoptimal memory placement impossible, and specifically that large pages can hurtperformance on NUMA systems. This observation, the analysis of the causes, andthe solution in the form of an extension to the Carrefour algorithm of Chapter 4,constitute the primary contributions of this chapter.Attribution: I conducted the initial investigation into NUMA and large pages,including the discovery and analysis of the hot page and page-level false sharingproblems (reported in Section 5.2 and Section 5.3.1). Fabien Gaud and BaptisteLepers designed and implemented the Carrefour-LP algorithm.5.1 BackgroundApplications with large memory working sets require many virtual-to-physical ad-dress translations in page tables and TLBs. This drives up physical RAM con-sumption, increases TLB miss rate, and hurts performance [15, 19, 73]. Accordingto one report, a large Oracle DBMS installation with 500 concurrent connectionsconsumed 7GB of RAM for page tables alone! [33]. To address this problem, mostmodern hardware and operating systems introduced support for large pages. On1This work is a modified version of work previously published in [48]102x86 systems large pages are typically 2MB (512 times larger than regularly-sized4KB pages), and support for 1GB pages is on the way2. Using larger pages re-quires fewer translations to cover the address space and diminishes the pressure onthe TLB and physical memory.While large pages are crucial for performance of large-memory systems, they,unfortunately, also have downsides. Previous work reported and addressed in-creased memory footprints and physical memory fragmentation [92]. In this chap-ter, we report on a new problem: large pages can exacerbate harmful NUMAeffects, such as poor locality and imbalance. Using large pages makes the unit ofmemory management (a page) more coarse. As a result, it is more likely that manyfrequently accessed memory addresses happen to map to the same physical pageand overload the memory node hosting it — the hot-page effect. The hot-page ef-fect cannot be addressed by page migration and balancing; page splitting must beperformed prior to any attempts to rebalance memory. Likewise, large pages leadto more frequent page-level false sharing among threads, where threads access dif-ferent data on the same page. False sharing leads to poor locality, which cannot beaddressed by page migration alone.Even though hot pages and false sharing touched only a couple of bench-marks in our set, these effects will become pervasive on systems with much largerpages (e.g., 1GB), which are becoming common. Therefore, we implementedCarrefour-LP which addresses these problems by dynamically splitting large pagesas needed. For applications affected by hot pages and false sharing, Carrefour-LPimproves performance by 10%-80% relative to Carrefour alone. Carrefour togetherwith Carrefour-LP significantly diminish or completely eliminate the performancedegradation introduced by large pages and improve performance of some applica-tions by 2-3 relative to Linux with large pages.21GB pages are already supported by the hardware; support by the OS is still nascent, so fewapplications are able to use them at the time of this writing.1035.2 Large Pages and Adverse NUMA Effects5.2.1 Experimental PlatformFor our experiments, we used two different server-class machines:Machine C has two 1.7GHz AMD Opteron 6164 HE processors, with 12 coresper processor, and 64GB of RAM. The system is equally divided into four NUMAnodes (i.e., six cores and 12GB of RAM per node).Machine D has four AMD Opteron 6272 processors, each with 16 cores (64 coresin total), and 512GB of RAM. It has eight NUMA nodes with 8 cores and 64GBof RAM per node.Both machines have HyperTransport 3.0 interconnect links.We are running on Linux 3.9 and are using Transparent Huge Pages (THP) forlarge page allocation3. THP works by backing allocations of anonymous memorywith 2MB pages whenever possible. Other kinds of memory, such as memorymapped files, are unaffected by THP and use 4KB pages. THP also uses a kernelthread to periodically scan for free memory regions that are at least 2MB in size,which are then used to replace groups of existing 4KB pages.We used several benchmark suites representing a variety of different work-loads: the NAS Parallel Benchmarks suite [11] which is comprised of numerickernels, MapReduce benchmarks from Metis [65], SSCA v2.2 (a graph process-ing benchmark) [10] with a problem size of 20, and SPECjbb [2]. From the NASbenchmark suite we picked the benchmarks that ran for at least 15 seconds. Thememory usage of the benchmarks ranges from 518MB for EP from the NAS suiteto 34,291MB for IS from NAS.5.2.2 Large Pages on LinuxFigure 5.1 compares the performance of 4KB pages and 2MB pages using THP. Wecan see that THP increases performance (by up to 109%) for several benchmarkson both machines (e.g. WC, WR, WRMEM, and SSCA), but also significantly de-3Linux also allows using large pages via libhugetlbfs, but the latter required recompiling appli-cations and pre-allocating memory for large pages, which was inconvenient, and, moreover, did notperform better than THP in our experiments.104creases performance by as much as 43% in some cases. CG, UA, and SPECjbb areall negatively affected by THP. Therefore, 2MB pages are not universally beneficialand neither are 4KB pages, so there is no “one size fits all.”To understand this phenomenon, we recorded two metrics that represent thepotential benefits of large pages: the number of L2 cache misses caused by pagetable walks (obtainable from HPEs), and the maximum time spent in the page faulthandler by any core. L2 misses due to page table walks is a good indicator for theeffect of TLB misses on performance. We expect large pages to increase the TLBcoverage and reduce page table sizes. As a result, we expect the number L2 cachemisses due to page table walks to drop when we use large pages. Similarly, largepages will reduce the number of page faults for allocations and thus the time spentin the page fault handler.We also monitored two metrics related to NUMA efficiency: the local accessratio (LAR), which is the percentage of accesses to local memory, and the trafficimbalance on the memory controllers. Traffic imbalance is defined as the stan-dard deviation of the memory request rate across the controllers, expressed as thepercent of the mean. For memory intensive applications, a low LAR and a highimbalance signify a NUMA issue.Table 5.1 shows the profiling results for a subset of interesting applications.As expected, applications that benefited from 2MB pages in Figure 5.1 (WC andSSCA) have fewer L2 misses due to page table walks, and for WC significantlyless time spent in the page fault handler. The effects can be dramatic. For example,with SSCA on machine C the percentage of L2 misses due to page table walksis decreased from 15% to 2% when using 2MB pages, which results in a 17%performance increase. WC, which experiences a similar decrease in L2 misses butalso a large decrease in time spent on page faults, has its performance increasedmore than two-fold on machine D.The two other profiled benchmarks, CG and UA, perform much worse with2MB pages. The profiling reveals that the degradation is caused by NUMA effects.With CG and 4KB pages, the load on the memory controllers is almost perfectlybalanced, but with 2MB pages the imbalance is 20% on machine C and 59% onmachine D. For UA, the problem is that the LAR decreases when using large pages,from about 88% to around 66%.105CG.D (D) UA.C (D) WC (D)Perf. incr. THP/4k (%) -43 -15 109Time in page faulthandler(% of total time)Linux2182ms(0.1%)102ms(0.2%)8731ms(37.6%)THP445ms(0%)53ms(0.1%)3682ms(32.3%)% L2 misses dueto page table walksLinux 0 0 10THP 0 0 1Local access ratio (%) Linux 40 88 50THP 36 66 55Imbalance (%) Linux 1 14 147THP 59 12 136SSCA.20 (C) SPECjbb (C)Perf. incr. THP/4k (%) 17 -6Time in page faulthandler(% of total time)Linux90ms(0%)8369ms(2.1%)THP147ms(0.1%)5905ms(1.5%)% L2 misses dueto page table walksLinux 15 7THP 2 0Local access ratio (%) Linux 25 12THP 26 15Imbalance (%) Linux 8 16THP 52 39Table 5.1: Detailed analysis of various application on machine C and D. Themachine type is indicated in parentheses next to the name of the bench-mark.106-30-20-10 0 10 20 30BT.BCG.DDC.AEP.CFT.CIS.DLU.BMG.DSP.BUA.BUA.CWC WR KmeansMatrixMultiplypcawrmemSSCA.20SPECjbbPer f.  i mpr ovement  r el at ivet o def aul t Li nux (%)THPMachine C-30-20-10 0 10 20 30BT.BCG.DDC.AEP.CFT.CIS.DLU.BMG.DSP.BUA.BUA.CWC WR KmeansMatrixMultiplypcawrmemSSCA.20SPECjbbPer f.  i mpr ovement  r el at ivet o def aul t Li nux (%)THP-43109 70 51Machine DFigure 5.1: THP performance improvement over Linux on machine C andmachine D. THP sometimes performs better than Linux, sometimesworse.SPECjbb presents an interesting case. While the data in Figure 5.1 suggeststhat it does not benefit from large pages, profiling reveals that using large pagesactually decreases the percent of L2 misses due to page table walks. At the sametime, SPECjbb suffers from NUMA issues: the imbalance rises from 16% to 39%with large pages. Therefore, SPECjbb could benefit from large pages if NUMAeffects were reduced.5.3 SolutionsThe previous section demonstrated that using large pages may introduce NUMAissues, which may either degrade performance relative to small pages (as they didfor CG and UA) or leave the performance unchanged but prevent an applicationfrom enjoying the benefits of large pages (as they did for SPECjbb). In this section107-30-20-10 0 10 20 30CG.DLU.BUA.BUA.CMatrixMultiplywrmemSSCA.20SPECjbbPer f.  i mpr ovement  r el at ivet o def aul t Li nux (%)THPCarrefour-2MMachine C-30-20-10 0 10 20 30CG.DLU.BUA.BUA.CMatrixMultiplywrmemSSCA.20SPECjbbPer f.  i mpr ovement  r el at ivet o def aul t Li nux (%)THPCarrefour-2M-43 -4051 46Machine DFigure 5.2: Performance improvement of Carrefour-2M and THP over Linuxon applications whose NUMA metrics are affected by THP (2MBpages). Carrefour-2M is not always able to solve the problems for ap-plications that suffer from THP.we first demonstrate that using a NUMA-aware page placement algorithm elimi-nates the NUMA issues for some applications, motivating the use of NUMA-awarepage placement with large pages.We then identify two new problems that a placement algorithm unaware oflarge pages does not address: the hot-page effect and the page-level false sharing.These effects, while affecting only two applications in our experiments, will be-come especially important as much larger pages (e.g., 1GB) come into use. Toaddress them, we introduce large-page extensions (LP) to the Carrefour algorithmfrom the previous chapter (all experiments shown in Chapter 4 used 4KB pages).For clarity of presentation, from now on we will focus on those applicationsthat experience NUMA issues when large pages are used. Specifically, if the LARor the imbalance is made worse by more than 15% by using large pages as op-108posed to small ones on either machine, the application is selected for presentation,otherwise it is omitted. The selected applications are: CG.D, LU.B, UA.B, UA.C,matrixmultiply, wrmem, SSCA, SPECjbb. For completeness, and to demonstratethat our solutions do not hurt the applications they cannot help, we do includeperformance results for the excluded applications at the end of Section Page Balancing is Not EnoughWe ran Carrefour in the kernel configured with 2M pages (Carrefour-2M). Figure5.2 shows the performance of Carrefour-2M compared to Linux with 2M pages(labeled as THP) relative to Linux with 4K pages (labeled as Linux). We observethat while Carrefour-2M does improve performance for some applications, it failsto solve the problem across the board. For SPECjbb, Carrefour-2M addresses theNUMA issue; as shown in Table 5.2 it restores the balance on memory controllersthat was introduced by large pages and improves the LAR.At the same time, Carrefour-2M fails to improve performance for UA and CG.To understand why, we show profiling data for these applications in Table 5.2.We report five metrics: the percentage of total accesses to the most used page(PAMUP), the number of hot pages (NHP) defined as pages comprising more than6% of the total accesses4, the percentage of memory accesses to pages shared byat least two threads (PSP), the percentage of accesses to local memory (LAR), andthe traffic imbalance on the memory controllers.The results for CG reveal that there is a hot page problem. Large pages causethe heavily accessed regions of the address space to be coalesced into a small num-ber of hot pages (the PAMUP significantly increases), and because there are fewerhot pages than NUMA nodes it is impossible to balance them.UA does not have a hot page issue, but it does have more pages that are sharedamong threads when large pages are used (the PSP significantly increases). Thishappens because each page holds more data and is thus more likely to contain dataused by multiple threads. Since the threads do not share data, but share the page,we refer to this problem as page-level false sharing. Carrefour-2M is then forced4In order to perfectly balance the load on a 8-node NUMA machine, each node must be the targetof 12.5% of the total memory accesses. Thus, we consider that if a page represents more than half ofthis amount, it is likely to create imbalance.109to interleave these pages whereas if there were less sharing the pages could beplaced on the nodes where they are most heavily used for maximum locality. As aresult, Carrefour-2M delivers a lower LAR than Linux with small pages.In summary, Carrefour-2M is only able to address NUMA issues induced bylarge pages in cases where they are not caused by the hot-page effect and page-levelfalse-sharing.While these problems affected only two applications in our experiments, theywill become pervasive as pages much larger than 2MB come into use. 1GB pagesare already supported by the hardware; applications like large DBMS clearly mo-tivate their use [33]. We did not evaluate 1GB pages, because they are poorlysupported in Linux. 1GB pages are not compatible with THP, and while in theoryit is possible to use them with lighugetlbfs, that has many challenges. First of all,the implementation is unreliable. We were not able to enforce the use of 1GB pageswith NAS applications and observed many crashes with the Metis suite (becausethe latter uses a custom memory allocator). Second, the splitting of large pages,which is crucial to our solution, is not supported by libhugetlbfs and implementingit would require a significant effort.However, since the use-case for very large pages is definitely there, they willbecome more common as the OS support improves. Then, the hot-page effect andpage-level false sharing will become more common (Section 5.4.4 provides somepreliminary data). To address these problems, we propose large-page extensions toCarrefour.5.3.2 Carrefour-LPIntuition suggests two basic solutions to the problem: conservative — prevent theproblem by only creating large pages when necessary, or reactive — start withlarge pages and fix NUMA problems when they are observed. Each approach haspotential benefits and drawbacks. The conservative approach can avoid NUMA re-lated performance degradation but can also miss out on the benefits of large pages.On the other hand, the reactive approach will benefit from large pages, but must beable to quickly and accurately detect NUMA issues and must pay the overhead offixing them.110Linux THP Carrefour2MSPECjbbPAMUP 2% 6% 6%NHP 0 0 0PSP 10% 36% 36%Imbalance 16% 39% 19%LAR 26% 28% 27%CG.DPAMUP 0% 8% 8%NHP 0 3 3PSP 18% 34% 34%Imbalance 0% 20% 20%LAR 45% 45% 45%UA.BPAMUP 6% 6% 6%NHP 0 0 0PSP 16% 70% 70%Imbalance 9% 15% 17%LAR 90% 61% 58%Table 5.2: Proportion of accesses to the most-used page (PAMUP) in %,number of hot pages (NHP), proportion of memory accesses to sharedpages (PSP) in %, Imbalance in % and local access ratio (LAR) in % forLinux, THP and Carrefour-2M, on machine C (24 cores).We found that a good algorithm must be a combination of these approaches.The reactive component of our algorithm continuously monitors HPEs lookingfor the presence of NUMA effects under large pages, applies the page balancingtechniques of Carrefour and splits the large pages if the latter are ineffective. Theconservative component of the algorithm continuously monitors the virtual mem-ory metrics and re-enables large pages if they are expected to deliver benefit butwere previously disabled.We also found that it is more practical and involves less overhead to enablelarge pages in the beginning and disable them later if they are deemed harmful. Inparticular, many applications have intensive memory-allocation phases at the verybeginning of the program that suffer from lock contention if small pages are used.Our full algorithm is presented in Algorithm 5. The algorithm also details theHPEs that are being monitored. Since the monitoring is done continuously, the111algorithm caters to phase changes in applications. Below we describe the rationalebehind the decisions made in the algorithm.Reactive ComponentThe job of the reactive component is to disable large pages when they are harmfulto the extent that even Carrefour-2M’s page-balancing techniques cannot addressthe performance degradation. To that end, it estimates the local access ratio (LAR),a vital metric for detecting NUMA issues, with and without Carrefour and largepages.We use AMD’s instruction-based sampling (IBS)5 to sample memory accessesto pages, and to learn whether the access was made from a local or a remote node.We only consider pages that have at least one sample where the access was servicedfrom DRAM, so that our decisions are not affected by pages that are easily cached.From the IBS samples, we estimate the LAR that would be obtained if the sharedpages were migrated to a random node and if non-shared pages were migrated tothe local node (i.e. interleaving and migrating pages with the Carrefour-2M algo-rithm). We also calculate the LAR that would be obtained if the same techniquewere used but with all of the 2MB pages split into 4KB pages.Estimating the LAR for various what-if scenarios (e.g., if a page were migratedor if large pages were split into regular-sized) is trivial with IBS samples. IBSgives us data addresses and the node from which they were accessed. So we cancompute the current LAR as well as the LAR that would be obtained if the pageswhere placed on different nodes. Similarly, we can map the data addresses to 4KBpages and compute the same metrics for the scenario if the large pages were split.If, based on our estimates, the LAR can be improved by 15% with Carrefour-2M only and without splitting the pages, we simply run Carrefour-2M. Otherwise,if splitting pages would improve the LAR by at least 5%, then all shared 2MBpages are demoted into 4KB pages. Note that we are being cautious here: we tryto address NUMA issues by page migration first, and split pages only if absolutelynecessary. Splitting pages has overhead and may hurt applications that benefitfrom large pages. In addition, large pages with more than 6% of the total accesses5Intel systems have a similar facility called PEBS (Precise Event-Based Sampling).112(hot pages, as defined in Section 5.3.1) are split and the constituent 4KB pages areinterleaved.This part of the algorithm relies on two thresholds. The first one is the 15%threshold used to decide whether we can improve the LAR simply by rearrangingmemory pages, without having to split large pages. That threshold was relativelyeasy to set across applications: the key is to use a relatively large number, sincewe want to be rather confident that we can improve performance without havingto split pages. The second threshold, the 5% performance gain that we expectfrom splitting pages, needs to be any non-negligible number that would justify thesplitting. Again, that threshold was relatively easy to tune across applications.In the algorithm, we use the LAR computed per-application. Another optionwould be to use the LAR computed per-page, however this was difficult to do,because existing hardware monitoring facilities prevent us from obtaining enoughsamples to accurately compute per-page LAR (and even per-application LAR asexplained in the next section). This is why the algorithm splits all 2MB pageswhen it detects the LAR can be improved.Conservative ComponentThe job of the conservative component is to re-enable large pages when they havebeen disabled but monitoring shows that they would be beneficial again. The con-servative component uses two criteria to determine the benefit of large pages: theperformance impact of TLB misses (based on the fraction of L2 misses caused bypage table walks) and the maximum percentage of time any core spends processingpage faults. The reason why we consider the time spent processing page faults isthat large pages improve performance by decreasing this time. Indeed, soft pagefaults not only take CPU time, but also incur costly synchronization [26]. The lat-ter is the reason why we use the maximum fraction as opposed to the average: lockcontention will be determined by the slowest core that holds page table locks.The conservative component works as follows. If the impact of TLB missesis estimated to be greater than a threshold of 5%, then 2MB page allocation and2MB page promotion6 are both enabled via THP. Similarly, if the time spent in the6Page promotion refers to dynamic consolidation of regular-sized pages into large pages. It issupported by the default Linux kernel. We set the frequency for page promotion checks to every113Algorithm 5 Large-page Extensions to CarrefourEnable 2MB page allocation and promotionwhile true doGather HPEs and IBS samples for 1 secif L2 misses due to page table walks > 5% thenEnable 2MB page allocationEnable 2MB page promotionelse if Max time spent on page faults > 5% thenEnable 2MB page allocationend ifif Estimated LAR improvement with only Carrefour > 15% thenSPLIT PAGES = falseelse if Estimated LAR improvement with Carrefour and splitting pages >5% thenSPLIT PAGES = trueend ifif SPLIT PAGES = true or 2MB page allocation is disabled thenSplit all shared 2MB pages into 4KB pagesDisable 2MB page allocationend ifSplit and interleave 2MB hot pagesInterleave and migrate pages with Carrefourend whilepage fault handler was more than a threshold of 5%, then 2MB page allocation isenabled but not 2MB page promotion, since there is little benefit in promoting thepages on which we had already paid the cost of page faults.In order to estimate the impact of TLB misses on performance, we use thefraction of L2 cache misses due to page table walks. This assumes that TLB missesprimarily degrade performance when a page table traversal causes an L2-cachemiss (in that case, the miss is satisfied either from the L3 cache or from the DRAM,both of which are costly), and that the application’s performance is dominated byL2 cache misses. Although this is a coarse approximation, it works well becauseapplications that experience a lot of cache misses due to page table walks are thosewith large page tables. This implies that they have large memory footprints, and so10ms.114they are memory-intensive. Therefore, it is safe to assume, for these applications,that variations in performance can be primarily explained by the number of L2cache misses. Conversely, applications with a very small fraction of L2 cachemisses resulting from page table walks are not memory-intensive, so for them theimpact of TLB misses is negligible.5.4 Evaluation5.4.1 Performance EvaluationFigure 5.3 shows performance of Carrefour-LP and THP relative to Linux with 4Kpages. We continue focusing only on the applications affected by NUMA issues;the remaining applications are presented for completeness in Figure 5.5. Figure 5.3shows that Carrefour-LP: Restores performance of applications that suffered under large pages and donot stand to benefit from them: CG.D, UA.B, UA.C. Improves performance of applications that were expected to benefit fromTHP but did not (or did not benefit fully): SSCA and SPECjbb, both onmachine C. Does not significantly hurt performance of the applications where NUMAeffects did not cause performance degradation under large pages and whereno performance improvements from large pages were expected (the remain-ing applications).We next provide the detailed analysis of Carrefour-LP. We analyze the con-tribution to performance improvements of its three components: Carrefour-2M,conservative and reactive. We demonstrate when and why it is sufficient to just useCarrefour-2M alone and explain how both conservative and reactive componentscontribute to the solution. The performance breakdown is shown in Figure 5.4.Workloads other than CG.C, UA.B and UA.C are not affected by the hot-pageeffect and page-level false sharing, so in these cases Carrefour-LP performs simi-larly to Carrefour-2M alone. It is able to meet the performance of Carrefour-2Mwith minimal overhead (at most 3.7% on machine C and 2.1% on machine D).115-30-20-10 0 10 20 30CG.DLU.BUA.BUA.CMatrixMultiplywrmemSSCA.20SPECjbbPer f.  i mpr ovement  r el at ivet o def aul t Li nux (%)THPCarrefour-LPMachine C-30-20-10 0 10 20 30CG.DLU.BUA.BUA.CMatrixMultiplywrmemSSCA.20SPECjbbPer f.  i mpr ovement  r el at ivet o def aul t Li nux (%)THPCarrefour-LP-4351 46Machine DFigure 5.3: Performance improvement on a reduced set of applications ofTHP and Carrefour-LP over Linux, on machine C and machine D.Table 5.3 demonstrates that Carrefour-LP eliminates the hot-page effect andpage-level false sharing and improves NUMA metrics where Carrefour-2M fails.For UA, the LAR drops from about 90% to roughly 60% under THP and remainsat that low level under Carrefour-2M. Carrefour-LP is able to restore it almost tothe previous level by dynamically splitting pages.For CG.D, enabling large pages disturbs the perfect memory-controller balanceenjoyed under small pages. Carrefour-2M is unable to restore it, while Carrefour-LP restores it almost entirely.We now analyze the importance of the two components in Carrefour-LP. Fig-ure 5.4 presents the performance obtained when running Carrefour-2M alone (la-beled as Carrefour-2M), Carrefour-2M with the reactive component designed forCarrefour-LP (labeled as Reactive), the original Carrefour runtime (working on4kB pages) together with the conservative component (labeled as Conservative),and Carrefour-LP (labeled as Carrefour-LP). Figure 5.4 shows that in all cases, en-116-30-20-10 0 10 20 30CG.DLU.BUA.BUA.CMatrixMultiplywrmemSSCA.20SPECjbbPer f.  i mpr ovement  r el at ivet o def aul t Li nux (%)Carrefour-2M ConservativeReactiveCarrefour-LPMachine C-30-20-10 0 10 20 30CG.DLU.BUA.BUA.CMatrixMultiplywrmemSSCA.20SPECjbbPer f.  i mpr ovement  r el at ivet o def aul t Li nux (%)Carrefour-2MConservativeReactiveCarrefour-LP-4046324546Machine DFigure 5.4: Performance improvement on a reduced set of applications ofCarrefour-2M, the conservative component, the reactive component andCarrefour-LP over Linux with THP, on machine C and machine D.Local Access ratio (%) Imbalance (%)Default THP Carr. Carr. LP Default THP Carr. Carr. LPLinux 2M Linux 2MCG.D (B) 40 36 38 39 1 59 69 3UA.B (A) 90 61 58 85 9 15 17 10UA.C (B) 88 66 68 82 14 12 9 14Table 5.3: NUMA metrics for CG.D on machine D, UA.B on machine C, andUA.C on machine D.abling the two components (as done in Carrefour-LP) is always the best choice (orclose to the best). The conservative component alone does not solve the problem,because it begins with 4K pages. For SPECjbb, for example, it does not detect theneed for large pages soon enough, so the performance is not as good as it could be.We similarly observed that using the conservative component alone hurts perfor-117mance of many applications that were not included in this analysis (but shown inFigure 5.5) for the same reason: large pages were not enabled soon enough. Theseapplications have an intense memory allocation phase at startup, which can benefitgreatly from large pages due to fewer page faults, but the conservative componentdoes not enable large pages soon enough.Using the reactive component alone works well on some applications. ForCG.D, it is able to detect the hot page and split it. Similarly, it is also able to splitthe falsely shared pages for UA.B and UA.C. However, on some applications, itfails to bring the maximum performance improvement that can be achieved with2M pages (e.g. SSCA on machine C and SPECjbb on machine D). The reason isthat the LAR is sometimes misestimated, and this results in 2M pages being splitin applications that do not suffer from NUMA issues. For instance, on SSCA,the algorithm predicts a LAR of 59% if large pages were all split into 4k pages,whereas the actual LAR obtained after splitting is equal to 25%.The problem is, in order to estimate the LAR under regular-sized pages giventhe data samples collected under large pages, we need to have enough sampleson the constituent sub-pages. Unfortunately, we found it to be very difficult togather enough samples; increasing the sampling rate results in unacceptably highoverhead. A promising solution would be to use Lightweight Profiling (LWP).LWP is an extension of AMD processors that aims at providing the same level ofdetails as IBS with less overhead. To reduce the overhead, LWP stores samples in aring buffer and only interrupts the processor when the buffer is full. Unfortunately,on available AMD processors, LWP is only partially implemented: LWP samplesonly contain the instruction pointer of the sampled instruction and a timestamp.This information is not sufficient to predict LAR. Another potential solution is theShim profiler [102], but only if the application leaves some cores unused for theprofiler.Because of these deficiencies in hardware profiling, the reactive componentmay make mistakes in deciding when to split large pages. This is where the con-servative component comes to the rescue and re-creates the large pages when theyare expected to help.We conclude this section by explaining some performance results in Figure 5.5,which contains applications where THP did not create any NUMA issues. The118-30-20-10 0 10 20 30BT.BDC.AEP.CFT.CIS.DMG.DSP.BWC WR KmeanspcaPer f.  i mpr ovement  r el at ivet o def aul t Li nux (%)THPCarrefour-LP89 74Machine C-30-20-10 0 10 20 30BT.BDC.AEP.CFT.CIS.DMG.DSP.BWC WR KmeanspcaPer f.  i mpr ovement  r el at ivet o def aul t Li nux (%)THPCarrefour-LP78 217 109100706366Machine DFigure 5.5: Performance improvement of THP and Carrefour-LP over Linuxon applications whose NUMA metrics are not affected by THP, on ma-chine C and machine D.key observation is that the overhead of Carrefour-LP does not significantly hurtthese applications. Moreover, EP.C, SP.B and pca enjoy better (sometimes muchbetter) performance with Carrefour-LP than with THP. That is because they hadNUMA issues to begin with (which were not exacerbated by large pages), and sothe Carrefour-2M component of the algorithm helped to address them.5.4.2 Overhead AssessmentOverhead in Carrefour-LP comes from collecting and storing IBS samples, com-puting the metrics based on these samples, migrating and splitting pages. Overall,the overhead of Carrefour-LP compared to the reactive approach is negligible: be-tween 1% and 2% on all applications (on all machines) except CG (3.2%) and IS119(2.1%) on machine D. Even on these two applications, the overhead is still withinthe standard deviation.Compared to Carrefour-2M, the overhead is also small. The maximum over-head observed is 3.7% on machine C (SP.B) and 3.2% on machine D (LU.B), buton average it is below 2%.Compared to Linux with 4k pages, Carrefour-LP has an overhead of less than3%, except on FT, IS (machine C) and LU (machine D). This overhead is notspecific to Carrefour-LP but is rather caused by Carrefour-2M, which spends toomuch time migrating large pages. Since our solution is built on top of Carrefour-2M, it also suffers from the same overhead.5.4.3 DiscussionThe solution could be much improved if we had a more accurate way of estimatingthe LAR. Currently, with inaccurate estimates, the solution may split and migratepages when there is no benefit to be gained, which is why Carrefour-LP degradesperformance of LU by 3.5% compared to Carrefour-2M. We believe that the LARcould be predicted more accurately if we could collect more data samples withoutadditional overhead. A complete implementation of LWP (i.e., if LWP providedthe same kind of samples as IBS) would solve this problem.Our earlier implementation had scalability issues on the system with 64 cores.The reason was that the centralized data structure where we stored IBS sampleshad to be accessed and locked from multiple nodes. We addressed this problem bymaintaining a data structure per node. The per-node structures are still accessedby multiple cores, so we may need to revisit this scaling issue on larger machines.Overall, the algorithm is likely to scale well because all work generated by aninterrupt is performed independently on each node, so the number of nodes cangrow without creating scalability bottlenecks.Splitting pages did not create too much overhead, but the use of the page tablelock for THP operations is clearly a scalability concern. Linux developers areworking on finer grain locks at the time of this writing, so we hope that this problemwill be avoided.We did not observe many oscillations, where we go back and forth between120splitting and enabling large pages. Overall, Carrefour-LP seems to be more robustthan the conservative and the reactive components used independently, because itnaturally supports transient states and phase changes by continuously re-examiningits decisions.5.4.4 Very Large PagesAlthough accessing the very large 1GB pages via libhugetlbfs proved challengingfor most applications, we were able to enable them in SSCA and in streamclus-ter (an application from PARSEC)7. We immediately observed the hot-page andpage-level false-sharing problems. With 1GB pages, lots of hot small pages werecoalesced on a single NUMA node, and the performance dropped dramatically. ForSSCA it degraded by 34%; for streamcluster by a factor of 4. Neither of these ap-plications suffered performance degradation when 2M pages were used. Althoughpreliminary, these data suggest a much more pervasive presence of NUMA issueswhen very large pages are used, and so Carrefour-LP will become even more im-portant in the future.5.5 Related Work5.5.1 Large Pages and TLB PerformanceSeveral studies have characterized the effect of TLB misses and large pages [19,50, 73, 99, 104]. Battacharjee and Martonosi [19] specifically looked at the effectof TLB misses on multicore systems with multithreaded workloads. They foundthat some applications, such as Canneal from the PARSEC benchmark suite, spendup to 0.7 cycles per instruction on servicing D-TLB misses. Another study [73]showed performance improvements of up to 25% in the NAS benchmark suite dueto using large pages. For large-scale HPC applications, Zhang et al. [104] foundthat large pages improve communication performance significantly.Weisberg and Wiseman [99] used the SPEC CPU2000 benchmarks to evaluatethe relationship between page size and the number of TLB misses. They argue that7The PARSEC suite was not included in our study, because its applications did not experienceperformance differences under THP with 2M pages.121a 4KB page size is much too small for most applications, and conclude that a pagesize of 256KB and a 64-entry TLB is sufficient to drastically reduce the number ofTLB misses.Sudan et al. [91] motivate the need for small pages. They show that using 1KBpages allows optimizing the usage of the DRAM row-buffer, yielding substantialenergy savings and decreasing the average latency of memory accesses.All these works motivate the use of different page sizes, but none of themhighlight or quantify the impact of NUMA on the performance obtained whenusing different page sizes.5.5.2 Large Page Support and OptimizationMany software systems have been designed that make large pages easier to use ormore effective.Navarro et al. [72] described an algorithm for operating system support of largepages that reduces fragmentation and does not require memory copies to createlarge pages. Using their algorithm, a page fault reserves a physical memory regionof the size of a large page, but it initially only allocates and maps a small page.Subsequent page faults use the reserved space until it has been completely allo-cated, at which point the region is promoted to a large page. The algorithm doesnot attempt to optimize the placement of large pages.Cascaval et al. [29] developed a model to predict the benefit of using largepages on individual data structures of applications, based on the predicted num-ber of TLB misses and page faults. The predictions are computed using HPEsthroughout multiple runs of the application. The data structures that are predictedto benefit the most from large pages are backed by large pages. A similar methodis described in [78], with the major difference being that large page promotions areperformed at runtime.Magee and Qasem [62] also devised a system for restricting the usage oflarge pages to applications that benefit the most from them. At compile-time, theworking-set size is estimated through static analysis. If the estimated working-setsize is greater than the coverage of the target CPU’s TLB, then large pages areused.122A different approach is explored by Basu et al. [15]. Instead of managingthe use of large pages at the OS level, they propose a hardware extension thatallows applications to directly map memory segments. Addresses within directlymapped segments bypass the TLB and so translation is nearly free. The segmentsare conceptually similar to very large pages and provide similar benefits, but theauthors do not analyze the potential NUMA effects which would be exacerbatedby the large size of the segments.In summary, previous works mostly focused on the limited availability of largepages and on reducing memory fragmentation. Several systems have been designedto ensure that applications that benefit from large pages actually use them, but noexisting work has revealed and addressed the NUMA issues raised by large pages.5.6 SummaryWe demonstrated that using large pages can create or exacerbate NUMA issues likereduced locality or imbalance. We showed that these problems can be in some casesaddressed by using a NUMA-aware page placement algorithm, but the latter stum-bles upon two problems: the hot-page effect and page-level false sharing, whichcannot be addressed via page migration. To address these problems, we imple-mented Carrefour-LP: large-page extensions to the NUMA-aware page placementalgorithm described in Chapter 4. Our results show that Carrefour-LP restores theperformance when it was lost due to large pages and makes their benefits accessibleto applications.Solutions like Carrefour-LP will be even more important in the future, whenvery large pages (1GB in size) will be in widespread use.123Chapter 6ConclusionA reoccurring theme throughout Chapters 2–5 is the continual increase in hardwarecomplexity and the burden it places on system software to get the most out of it.Due to physical limitations, hardware architects must improve chip performanceby increasing core counts and adding features to improve utilization rather thanincreasing frequency as has been done in the past. This leads to a multitude ofshared resources such as core functional units when using SMT, caches at variouslevels, interconnects, and memory controllers. As we have seen, this resourcesharing can cause significant performance effects, and reducing the contention forthe shared resources is essential for optimizing performance.And yet modern operating systems rely on simple and usually insufficientmechanisms to address shared resource contention. For example, Linux’s defaultNUMA memory allocation policy is first-touch, meaning that memory is allocatedon the NUMA node it is first accessed from. This can lead to extensive contentionfor memory controllers and interconnect bandwidth, as shown in Chapter 4. An-other example is the Linux thread scheduler which generally tries to spread threadsapart in an effort to give the threads as much resources as possible. Again, thisrelatively simple strategy does not provide good performance for every applicationas seen in Section 2.5. In general, the simple solutions employed by operatingsystems are not optimal for all applications but only a subset of them.The first step towards a comprehensive solution is insight into the problem it-self, specifically how applications interact with the hardware and the effects on per-124Concern Score Resources Cost? InversePerf?L2/SMTNumber of L2caches in useL2 cache, L1 cache,instruction front-end, and functionalunitsY YL3Number of L3caches in useL3 cache Y YMem. ControllerNumber of mem-ory controllerpairs in useMemory controllers,DRAM bandwidth,I/OY YTable 6.1: Scheduling concerns for an AMD Zen system.formance. The insights are important research contributions on their own. Chap-ter 2 showed the predictive power of performance measurements and relative lackof predictive power of HPEs, Chapter 3 demonstrated the importance of the in-struction mix with respect to SMT preference, Chapter 4 proved the importance ofmanaging NUMA traffic congestion over focusing solely on locality, and Chapter5 identified the hot page and page-level false sharing issues with NUMA and largepages.From these observations we built a complete and practical solution to the prob-lem of NUMA and multicore resource contention. The first step is workload place-ment, described in Chapter 2. Our approach relies on an offline training phase thatruns a training set of benchmarks and builds a machine learning model to predictthe performance of applications at different thread placement configurations. Atruntime, a new, previously unseen application requires only two performance mea-surements in two different placements as inputs into the machine learning model,and then accurate predictions for other placements can be made. In our exper-iments the average prediction error was 6.6% and 4.4% on our two test systems,and in Section 2.5 we showed how this translates into more efficient machine pack-ing while maintaining performance targets.An important aspect of the solution to workload placement in Chapter 2 is thatit is flexible and easy to adapt to future architectures. As long as new hardware ar-chitectures conform to the basic structure of having a hierarchy of shared resources,125scheduling concerns can be developed for them easily. To demonstrate this, Table6.1 shows the scheduling concerns that would be used for AMD’s recently releasedZen architecture [8, 32]. The Zen architecture is significantly different from theBulldozer architecture used in Chapter 2 (Table 2.1). Many more resources areshared at the SMT level on the Zen architecture, including all the functional unitsrather than just the floating point units, and the L1 cache is shared as well. In addi-tion it has another level of hierarchy because the L3 cache is shared at a lower levelthan the memory controllers. Despite these differences it is still simple to derivethe scheduling concerns in Table 6.1 (current Zen platforms have symmetric inter-connects, but it would also be simple to add an asymmetric interconnect concern ifit is required in the future).The second step after thread placement is memory placement. The Carrefour-LP algorithm (Chapter 5) places memory pages on NUMA nodes so that inter-connect congestion is minimized and additionally it manages the size of memorypages so that TLB and page fault effects are weighed against the needs of balancinginterconnect traffic. By measuring HPEs and utilizing instruction-based sampling,Carrefour-LP monitors NUMA statistics and per-page access patterns which it usesto make its decisions. This is done at runtime with low overhead and without re-quiring changes to the application. In extreme cases Carrefour-LP can improveperformance by 3 compared to default Linux and it consistently improves perfor-mance over other standard techniques in other cases.The practicality of our approach is what sets this work apart from other re-cent research in the field, such as Pandia [49]. Current research in the field tendsto require burdensome offline profiling for new applications, or extensive expertknowledge to adapt the techniques to new hardware architectures. Neither is thecase for our solution.Still, there is much work to be done in the area. The workload placement al-gorithm of Chapter 2 assumes that the level of parallelism of the workload (i.e.the number of vCPUs the workload uses) is known beforehand. While this is areasonable assumption in many use-cases such as clients of a cloud environment,it would still be quite helpful to be able to predict the effect of increasing or de-creasing the level of parallelism. The results reported in Chapter 3 are a start to thisproblem but they only apply SMT systems, so more work could be done to extend126the technique to a broader context. A second area of future work is studying theprecise relationship between thread placement and memory placement. Our solu-tion places threads first and then places memory, but it is possible that if threadand memory placement is considered at the same time then further performancegains could be obtained. Lastly, our workload placement method makes some as-sumptions. For example, it assumes that workloads will not interfere with eachother which can lead to cores being left idle in some cases. Another example isthat it only considers balanced placements. If these assumptions can be loosenedor removed through further research then the method would be able to increaseefficiency to an even greater extent. One avenue of complementary research wouldbe to leverage application-specific data from the compiler or run-time system toimprove predictions or help remove the assumptions and limits of the solution.For example compiler and run-time information can be used to determine how theload is balanced among workers (and the likelihood of stranglers), which could beuseful for removing the assumption of balanced placements.127Bibliography[1] 2011. Apache DayTrader Benchmark. (2011). Retrieved May 30th 2017from http://geronimo.apache.org/GMOxDOC22/daytrader-a-more-complex-application.html. ! pages 51[2] 2013. SPECjbb2005. (2013). Retrieved May 30th 2017 fromhttp://www.spec.org/jbb2005/. ! pages 52, 104[3] 2016. Dell EMC Enterprise Infrastructure Planning Tool. (2016). RetrievedMay 30th 2017 from http://dell-ui-eipt.azurewebsites.net. ! pages 1[4] 2016. The WiredTiger key-value store. (2016). Retrieved May 30th 2017from http://source.wiredtiger.com. ! pages 7, 22[5] 2017. Amazon EC2 Instance Types. (2017). Retrieved May 30th 2017 fromhttps://aws.amazon.com/ec2/instance-types. ! pages 13[6] 2017. Selecting the number of clusters with silhouette analysis on KMeansclustering. http://scikit-learn.org/stable/auto examples/cluster/plot kmeans silhouette analysis.html. (2017). ! pages 22[7] Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, andDavid J. Lipman. 1990. Basic local alignment search tool. Journal ofmolecular biology 215, 3 (1990), 403–410.https://doi.org/10.1016/S0022-2836(05)80360-2 ! pages 22[8] AMD. 2017. Processor Programming Reference PPR for AMD Family 17hModel 01h, Revision B1 Processors. Sunnyvale, CA, USA. ! pages 126[9] Manu Awasthi, David W. Nellans, Kshitij Sudan, Rajeev Balasubramonian,and Al Davis. 2010. Handling the Problems and Opportunities Posed byMultiple On-chip Memory Controllers. In Proceedings of the 19thInternational Conference on Parallel Architectures and Compilation128Techniques (PACT ’10). ACM, New York, NY, USA, 319–330.https://doi.org/10.1145/1854273.1854314 ! pages 99[10] David A. Bader, John Feo, John Gilbert, Jeremy Kepner, David Koester,Eugene Loh, Kamesh Madduri, Bill Mann, and Theresa Meuse. 2007. HPCSScalable Synthetic Compact Applications #2 Graph Analysis. ! pages 51,104[11] David H. Bailey, Eric Barszcz, John T. Barton, David S. Browning,Robert L. Carter, Leonardo Dagum, Rod A. Fatoohi, P. O. Frederickson,T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K.Weeratunga. 1991. The NAS Parallel Benchmarks - Summary andPreliminary Results. In Proceedings of the 1991 ACM/IEEE Conference onSupercomputing (SC ’91). ACM, New York, NY, USA, 158–165.https://doi.org/10.1145/125826.125925 ! pages 22, 51, 84, 104[12] Mohammad Banikazemi, Dan Poff, and Bulent Abali. 2008. PAM: A NovelPerformance/Power Aware Meta-scheduler for Multi-core Systems. InProceedings of the 2008 ACM/IEEE Conference on Supercomputing (SC’08). IEEE Press, Piscataway, NJ, USA, Article 39, 12 pages.http://dl.acm.org/citation.cfm?id=1413370.1413410 ! pages 9[13] Scott Barielle. 2011. Calculating TCO for energy. IBM Systems Magazine:Power (2011), 38–40. ! pages 1[14] Bradley J. Barnes, Barry Rountree, David K. Lowenthal, Jaxk Reeves,Bronis de Supinski, and Martin Schulz. 2008. A Regression-basedApproach to Scalability Prediction. In Proceedings of the 22nd AnnualInternational Conference on Supercomputing (ICS ’08). ACM, New York,NY, USA, 368–377. https://doi.org/10.1145/1375527.1375580 ! pages 66[15] Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, Mark D. Hill, andMichael M. Swift. 2013. Efficient Virtual Memory for Big Memory Servers.In Proceedings of the 40th Annual International Symposium on ComputerArchitecture (ISCA ’13). ACM, New York, NY, USA, 237–248.https://doi.org/10.1145/2485922.2485943 ! pages 102, 123[16] Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris,Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schu¨pbach, andAkhilesh Singhania. 2009. The Multikernel: A New OS Architecture forScalable Multicore Systems. In Proceedings of the 22nd ACM SIGOPSSymposium on Operating Systems Principles (SOSP ’09). ACM, New York,NY, USA, 29–44. https://doi.org/10.1145/1629575.1629579 ! pages 99129[17] Tom Bawden. 2016. Global warming: Data centres to consume three timesas much energy in next decade, experts warn. (2016). Retrieved September16th 2017 from https://www.independent.co.uk/environment/global-warming-data-centres-to-consume-three-times-as-much-energy-in-next-decade-experts-warn-a6830086.html. ! pages 1[18] J. Ross Beveridge, David Bolme, Bruce A. Draper, and Marcio Teixeira.2005. The CSU Face Identification Evaluation System. Machine Vision andApplications 16, 2 (2005), 128–138.https://doi.org/10.1007/s00138-004-0144-7 ! pages 84[19] Abhishek Bhattacharjee and Margaret Martonosi. 2009. Characterizing theTLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors. InProceedings of the 18th International Conference on Parallel Architecturesand Compilation Techniques (PACT ’09). IEEE, Washington, DC, USA,29–40. https://doi.org/10.1109/PACT.2009.26 ! pages 102, 121[20] Christian Bienia and Kai Li. 2009. PARSEC 2.0: A New Benchmark Suitefor Chip-Multiprocessors. In Proceedings of the 5th Annual Workshop onModeling, Benchmarking and Simulation. ! pages 22, 51, 84[21] Sergey Blagodurov and Alexandra Fedorova. 2011. In Search forContention-descriptive Metrics in HPC Cluster Environment. In Proceedingsof the 2nd ACM/SPEC International Conference on PerformanceEngineering (ICPE ’11). ACM, New York, NY, USA, 457–462.https://doi.org/10.1145/1958746.1958815 ! pages 7[22] Sergey Blagodurov, Alexandra Fedorova, Evgeny Vinnik, Tyler Dwyer, andFabien Hermenier. 2015. Multi-objective Job Placement in Clusters. InProceedings of the 27th International Conference for High PerformanceComputing, Networking, Storage and Analysis (SC ’15). ACM, New York,NY, USA, Article 66, 12 pages. https://doi.org/10.1145/2807591.2807636! pages[23] Sergey Blagodurov, Sergey Zhuravlev, Mohammad Dashti, and AlexandraFedorova. 2011. A Case for NUMA-aware Contention Management onMulticore Systems. In Proceedings of the 2011 USENIX Annual TechnicalConference (ATC ’11). USENIX Association, Berkeley, CA, USA. ! pages7, 99[24] William Bolosky, Robert Fitzgerald, and Michael Scott. 1989. Simple butEffective Techniques for NUMA Memory Management. In Proceedings of130the 12th ACM Symposium on Operating Systems Principles (SOSP ’89).ACM, New York, NY, USA, 19–31. https://doi.org/10.1145/74850.74854! pages 97[25] Silas Boyd-Wickizer, Haibo Chen, Rong Chen, Yandong Mao, FransKaashoek, Robert Morris, Aleksey Pesterev, Lex Stein, Ming Wu, YuehuaDai, Yang Zhang, and Zheng Zhang. 2008. Corey: An Operating System forMany Cores. In Proceedings of the 8th USENIX Conference on OperatingSystems Design and Implementation (OSDI’08). USENIX Association,Berkeley, CA, USA, 43–57. ! pages 69, 72[26] Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev,M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. 2010. AnAnalysis of Linux Scalability to Many Cores. In Proceedings of the 9thUSENIX Conference on Operating Systems Design and Implementation(OSDI ’10). USENIX Association, Berkeley, CA, USA, 1–16. ! pages 113[27] Silas Boyd-Wickizer, M. Frans Kaashoek, Robert Morris, and NickolaiZeldovich. 2011. A Software Approach to Unifying Multicore Caches.Technical Report MIT-CSAIL-TR-2011-032. MIT. ! pages 99, 101[28] Timothy Brecht. 1993. On the Importance of Parallel Application Placementin NUMA Multiprocessors. In USENIX Experiences with Distributed andMultiprocessor Systems (SEDMS ’93), Vol. 4. USENIX Association,Berkeley, CA, USA, 1–18. ! pages 97[29] Calin Cascaval, Evelyn Duesterwald, Peter F Sweeney, and Robert WWisniewski. 2005. Multiple page size modeling and optimization. InProceedings of the 14th International Conference on Parallel Architecturesand Compilation Techniques (PACT ’05). IEEE, Washington, DC, USA,339–349. https://doi.org/10.1109/PACT.2005.32 ! pages 122[30] Sheng Chen, SA Billings, and PM Grant. 1990. Non-linear systemidentification using neural networks. International journal of control 51, 6(1990), 1191–1214. ! pages 11[31] Sheng Chen, Stephen A Billings, and Wan Luo. 1989. Orthogonal leastsquares methods and their application to non-linear system identification.International Journal of control 50, 5 (1989), 1873–1896. ! pages 11[32] Michael Clark. 2016. A New, High Performance x86 Core Design fromAMD. Video. In Proceedings of the 28th Symposium on High Performance131Chips (HC ’16). Retrieved May 30th 2017 fromhttps://www.hotchips.org/archives/2010s/hc28/ ! pages 38, 126[33] Kevin Closson. 2009. Quantifying Hugepages Memory Savings with OracleDatabase 11g. (July 2009). Retrieved May 30th 2017 fromhttp://kevinclosson.wordpress.com/2009/07/28/quantifying-hugepages-memory-savings-with-oracle-database-11g/. ! pages 102, 110[34] Jonathan Corbet. 2012. AutoNUMA: the other approach to NUMAscheduling. LWN (March 2012). ! pages 83[35] Bill Dally. 2011. Power, Programmability, and Granularity: The Challengesof ExaScale Computing. In Proceedings of the 25th International Parallel &Distributed Processing Symposium (IPDPS ’11). IEEE, Washington, DC,USA, 878–. https://doi.org/10.1109/IPDPS.2011.420 ! pages 95[36] Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud,Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013.Traffic Management: A Holistic Approach to Memory Placement on NUMASystems. In Proceedings of the 18th International Conference onArchitectural Support for Programming Languages and Operating Systems(ASPLOS ’13). ACM, New York, NY, USA, 381–394.https://doi.org/10.1145/2451116.2451157 ! pages 4, 7, 10, 68[37] Arash Deshmeh, Jacob Machina, and Angela Sodan. 2010. ADEPTscalability predictor in support of adaptive resource allocation. InProceedings of the 24th International Parallel & Distributed ProcessingSymposium (IPDPS ’10). IEEE, Washington, DC, USA, 1–12.https://doi.org/10.1109/IPDPS.2010.5470430 ! pages 66[38] Advanced Micro Devices. 2010. AMD64 Technology Lightweight ProfilingSpecification. Sunnyvale, CA, USA. ! pages 96[39] Norman Richard Draper and H Smith. 1966. Applied regression analysis.John Wiley & Sons. ! pages 27[40] P.J Drongowski and B.D. Center. 2007. Instruction-based sampling: A newperformance analysis technique for AMD family 10h processors. Sunnyvale,CA, USA. ! pages 76[41] Tyler Dwyer, Alexandra Fedorova, Sergey Blagodurov, Mark Roth, FabienGaud, and Jian Pei. 2012. A Practical Method for Estimating PerformanceDegradation on Multicore Processors, and Its Application to HPC132Workloads. In Proceedings of the International Conference on HighPerformance Computing, Networking, Storage and Analysis (SC ’12). IEEE,Los Alamitos, CA, USA, Article 83, 11 pages.https://doi.org/10.1109/SC.2012.11 ! pages 7, 11, 26[42] Stijn Eyerman and Lieven Eeckhout. 2010. Probabilistic Job SymbiosisModeling for SMT Processor Scheduling. In Proceedings of the 15thInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS ’10). ACM, New York, NY,USA, 91–102. https://doi.org/10.1145/1736020.1736033 ! pages 66[43] Alexandra Fedorova, Margo Seltzer, and Michael D. Smith. 2007.Improving Performance Isolation on Chip Multiprocessors via an OperatingSystem Scheduler. In Proceedings of the 16th International Conference onParallel Architecture and Compilation Techniques (PACT ’07). IEEE,Washington, DC, USA, 25–38. https://doi.org/10.1109/PACT.2007.40 !pages 9, 10[44] Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos,Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian DanielPopescu, Anastasia Ailamaki, and Babak Falsafi. 2012. Clearing the Clouds:A Study of Emerging Scale-out Workloads on Modern Hardware. InProceedings of the 17th International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS ’12). ACM,New York, NY, USA, 37–48. https://doi.org/10.1145/2150976.2150982 !pages 100, 101[45] Justin Funston, Maxime Lorrillere, David Vengerov, Baptiste LepersJean-Pierre Lozi, Vivien Quema, and Alexandra Fedorova. 2018. A PracticalModel for Placement of Workloads on Multicore NUMA Systems. InSubmitted to the 13th European Conference on Computer Systems (EuroSys’18). ! pages 6[46] Justin R. Funston, Kaoutar El Maghraoui, Joefon Jann, Pratap Pattnaik, andAlexandra Fedorova. 2012. An SMT-Selection Metric to ImproveMultithreaded Applications’ Performance. In Proceedings of the 26thInternational Parallel & Distributed Processing Symposium (IPDPS ’12).IEEE, Washington, DC, USA, 1388–1399.https://doi.org/10.1109/IPDPS.2012.125 ! pages 7, 9, 39[47] Ben Gamsa, Orran Krieger, Jonathan Appavoo, and Michael Stumm. 1999.Tornado: Maximizing Locality and Concurrency in a Shared Memory133Multiprocessor Operating System. In Proceedings of the 3rd Symposium onOperating Systems Design and Implementation (OSDI ’99). USENIXAssociation, Berkeley, CA, USA, 87–100. ! pages 99[48] Fabien Gaud, Baptiste Lepers, Jeremie Decouchant, Justin Funston,Alexandra Fedorova, and Vivien Que´ma. 2014. Large Pages May BeHarmful on NUMA Systems. In Proceedings of the 2014 USENIX AnnualTechnical Conference (ATC ’14). USENIX Association, Berkeley, CA, USA,231–242. ! pages 7, 102[49] Daniel Goodman, Georgios Varisteas, and Tim Harris. 2017. Pandia:Comprehensive Contention-sensitive Thread Placement. In Proceedings ofthe 12th European Conference on Computer Systems (EuroSys ’17). ACM,New York, NY, USA, 254–269. https://doi.org/10.1145/3064176.3064177! pages 3, 7, 12, 13, 126[50] Mel Gorman and Patrick Healy. 2010. Performance Characteristics ofExplicit Superpage Support. In Proceedings of the 6th Annual Workshorp onthe Interaction between Operating Systems and Computer Architecture(WIOSCA ’10). Springer-Verlag, Berlin, Heidelberg.https://doi.org/10.1007/978-3-642-24322-6 24 ! pages 121[51] Wei Huang, Jiang Lin, Zhao Zhang, and J. Morris Chang. 2005.Performance Characterization of Java Applications on SMT Processors. InProceedings of the 2005 International Symposium on Performance Analysisof Systems and Software (ISPASS ’05). 102–111.https://doi.org/10.1109/ISPASS.2005.1430565 ! pages 40[52] Intel. 2011. Intel 64 and IA-32 Architectures Software Developer’s Manual.Santa Clara, CA, USA. ! pages 49[53] Ivan Jibaja, Ting Cao, Stephen M. Blackburn, and Kathryn S. McKinley.2016. Portable Performance on Asymmetric Multicore Processors. InProceedings of the 2016 International Symposium on Code Generation andOptimization (CGO ’16). ACM, New York, NY, USA, 24–35.https://doi.org/10.1145/2854038.2854047 ! pages 66[54] George H John, Ron Kohavi, and Karl Pfleger. 1994. Irrelevant features andthe subset selection problem. In Machine learning: proceedings of theeleventh international conference. 121–129. ! pages 27134[55] Ron Kalla, Balaram Sinharoy, William J. Starke, and Michael Floyd. 2010.POWER7: IBM’s Next-Generation Server Processor. IEEE Micro 30, 2(March 2010), 7–15. https://doi.org/10.1109/MM.2010.38 ! pages 43, 47[56] Rob Knauerhase, Paul Brett, Barbara Hohlt, Tong Li, and Scott Hahn. 2008.Using OS Observations to Improve Performance in Multicore Systems.IEEE Micro 28, 3 (May 2008), 54–66. https://doi.org/10.1109/MM.2008.48! pages 7, 9, 10, 99[57] Renaud Lachaize, Baptiste Lepers, and Vivien Que´ma. 2012. MemProf: AMemory Profiler for NUMA Multicore Systems. In Proceedings of the 2012USENIX Annual Technical Conference (ATC ’12). USENIX Association,Berkeley, CA, USA, 53–64. ! pages 7, 80[58] R. P. LaRowe, Carla S. Ellis, and Mark A. Holliday. 1992. Evaluation ofNUMA memory management through modeling and measurements. IEEETransactions on Parallel and Distributed Systems 3, 6 (Nov 1992), 686–701.https://doi.org/10.1109/71.180624 ! pages 97[59] Baptiste Lepers, Vivien Que´ma, and Alexandra Fedorova. 2015. Thread andMemory Placement on NUMA Systems: Asymmetry Matters. InProceedings of the 2015 USENIX Annual Technical Conference (ATC ’15).USENIX Association, Berkeley, CA, USA, 277–289. ! pages 3, 7, 9, 10,34, 37[60] Lennart Ljung. 1998. System identification. In Signal analysis andprediction. Springer, 163–173. ! pages 11[61] Jean-Pierre Lozi, Baptiste Lepers, Justin Funston, Fabien Gaud, VivienQue´ma, and Alexandra Fedorova. 2016. The Linux Scheduler: A Decade ofWasted Cores. In Proceedings of the 11th European Conference onComputer Systems (EuroSys ’16). ACM, New York, NY, USA, Article 1,16 pages. https://doi.org/10.1145/2901318.2901326 ! pages 11[62] Joshua Magee and Apan Qasem. 2009. A Case for Compiler-drivenSuperpage Allocation. In Proceedings of the 47th Annual SoutheastRegional Conference (ACM-SE 47). ACM, New York, NY, USA, Article 82,4 pages. https://doi.org/10.1145/1566445.1566553 ! pages 122[63] Zoltan Majo and Thomas R. Gross. 2011. Memory Management in NUMAMulticore Systems: Trapped Between Cache Contention and InterconnectOverhead. In Proceedings of the 2011 International Symposium on Memory135Management (ISMM ’11). ACM, New York, NY, USA, 11–20.https://doi.org/10.1145/1993478.1993481 ! pages 99[64] Zoltan Majo and Thomas R. Gross. 2011. Memory System Performance in aNUMA Multicore Multiprocessor. In Proceedings of the 4th AnnualInternational Conference on Systems and Storage (SYSTOR ’11). ACM,New York, NY, USA, Article 12, 10 pages.https://doi.org/10.1145/1987816.1987832 ! pages 99[65] Yandong Mao, Robert Morris, and Frans Kaashoek. 2010. OptimizingMapReduce for multicore architectures. Technical Report. MIT. ! pages22, 84, 104[66] Harry M. Mathis, Alex E. Mericas, John D. McCalpin, Richard J.Eickemeyer, and Steven R. Kunkel. 2005. Characterization of simultaneousmultithreading (SMT) efficiency in POWER5. IBM Journal Research andDevelopment 49 (July 2005), 555–564. Issue 4/5. ! pages 40, 65[67] John D. McCalpin. 1995. Memory Bandwidth and Machine Balance inCurrent High Performance Computers. IEEE Computer Society TechnicalCommittee on Computer Architecture (TCCA) Newsletter (December 1995),19–25. ! pages 17, 51[68] Celso L. Mendes, Jhy chun Wang, and Daniel A. Reed. 1995. AutomaticPerformance Prediction and Scalability Analysis for Data Parallel Programs.In In Proceedings of the CRPC Workshop on Data Layout and PerformancePrediction. 45–51. ! pages 66[69] Andreas Merkel, Jan Stoess, and Frank Bellosa. 2010. Resource-consciousScheduling for Energy Efficiency on Multicore Processors. In Proceedingsof the 5th European Conference on Computer Systems (EuroSys ’10). ACM,New York, NY, USA, 153–166. https://doi.org/10.1145/1755913.1755930! pages 7, 9, 99[70] Zviad Metreveli, Nickolai Zeldovich, and M. Frans Kaashoek. 2012.CPHASH: A Cache-partitioned Hash Table. In Proceedings of the 17th ACMSIGPLAN Symposium on Principles and Practice of Parallel Programming(PPoPP ’12). ACM, New York, NY, USA, 319–320.https://doi.org/10.1145/2145816.2145874 ! pages 99[71] Daniel Molka, Daniel Hackenberg, Robert Scho¨ne, and Wolfgang E. Nagel.2015. Cache Coherence Protocol and Memory Performance of the IntelHaswell-EP Architecture. In Proceedings of the 44th International136Conference on Parallel Processing (ICPP ’15). IEEE, 739–748.https://doi.org/10.1109/ICPP.2015.83 ! pages 38[72] Juan Navarro, Sitaram Iyer, Peter Druschel, and Alan Cox. 2002. Practical,Transparent Operating System Support for Superpages. In Proceedings ofthe 5th Symposium on Operating Systems Design and Implementation (OSDI’02). USENIX Association, Berkeley, CA, USA, 89–104. ! pages 122[73] Ranjit Noronha and Dhabaleswar K Panda. 2007. Improving scalability ofOpenMP applications on multi-core systems using large page support. InProceedings of the 21st International Parallel and Distributed ProcessingSymposium (IPDPS ’07). IEEE, Washington, DC, USA, 1–8.https://doi.org/10.1109/IPDPS.2007.370683 ! pages 102, 121[74] Ippokratis Pandis, Ryan Johnson, Nikos Hardavellas, and AnastasiaAilamaki. 2010. Data-oriented Transaction Execution. Proc. VLDB Endow.3, 1-2 (Sept. 2010), 928–939. https://doi.org/10.14778/1920841.1920959! pages 99, 100[75] Aleksey Pesterev, Nickolai Zeldovich, and Robert T. Morris. 2010. LocatingCache Performance Bottlenecks Using Data Profiling. In Proceedings of the5th European Conference on Computer Systems (EuroSys ’10). ACM, NewYork, NY, USA, 335–348. https://doi.org/10.1145/1755913.1755947 !pages 80[76] Petar Radojkovic, Paul M. Carpenter, Miquel Moreto´, Vladimir Cakarevic,Javier Verdu´, Alex Pajuelo, Francisco J. Cazorla, Mario Nemirovsky, andMateo Valero. 2016. Thread Assignment in Multicore/MultithreadedProcessors: A Statistical Approach. IEEE Trans. Computers 65, 1 (2016),256–269. https://doi.org/10.1109/TC.2015.2417533 ! pages 13[77] Lior Rokach and Oded Maimon. 2005. Top-down induction of decision treesclassifiers - a survey. IEEE Transactions 35, 4 (Nov 2005), 476–487.https://doi.org/10.1109/TSMCC.2004.843247 ! pages 61[78] Theodore H Romer, Wayne H Ohlrich, Anna R Karlin, and Brian N Bershad.1995. Reducing TLB and memory overhead using online superpagepromotion. In Proceedings of the 22nd Annual International Symposium onComputer Architecture. IEEE, Washington, DC, USA, 176–187.https://doi.org/10.1145/223982.224419 ! pages 122137[79] Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretationand validation of cluster analysis. Journal of computational and appliedmathematics 20 (1987), 53–65. ! pages 22[80] Yaoping Ruan, Vivek S. Pai, Erich Nahum, and John M. Tracey. 2005.Evaluating the Impact of Simultaneous Multithreading on Network ServersUsing Real Hardware. In Proceedings of the 2005 ACM SIGMETRICSInternational Conference on Measurement and Modeling of ComputerSystems (SIGMETRICS ’05). ACM, New York, NY, USA, 315–326.https://doi.org/10.1145/1064212.1064254 ! pages 65[81] Satish Kumar Sadasivam and Prathiba Kumar. 2011. SPECfp2006 CPIStack and SMT Benefits Analysis on POWER7 Systems. In Proceedings ofthe 17th Meeting of the IBM HPC Systems Scientific Computing User Group(ScicomP ’11). Paris, France. ! pages 40[82] Hideki Saito, Greg Gaertner, Wesley Jones, Rudolf Eigenmann, HidetoshiIwashita, Ron Lieberman, Matthijs Waveren, and Brian Whitney. 2002.Large System Performance of SPEC OMP2001 Benchmarks. InProceedings of the 4th International Symposium on High PerformanceComputing (ISHPC ’02). Springer-Verlag, Berlin, Heidelberg, 370–379.https://doi.org/10.1007/3-540-47847-7 34 ! pages 51[83] Tudor-Ioan Salomie, Ionut Emanuel Subasu, Jana Giceva, and GustavoAlonso. 2011. Database Engines on Multicores, Why Parallelize when YouCan Distribute?. In Proceedings of the 6th European Conference onComputer Systems (EuroSys ’11). ACM, New York, NY, USA, 17–30.https://doi.org/10.1145/1966445.1966448 ! pages 99, 100[84] Alex Settle, Joshua Kihm, Andrew Janiszewski, and Dan Connors. 2004.Architectural Support for Enhanced SMT Job Scheduling. In Proceedings ofthe 13th International Conference on Parallel Architectures andCompilation Techniques (PACT ’04). IEEE, Washington, DC, USA, 63–73.https://doi.org/10.1109/PACT.2004.7 ! pages 65[85] Arman Shehabi, Sarah Smith, Dale Sartor, Richard Brown, Magnus Herrlin,Jonathan Koomey, Eric Masanet, Nathaniel Horner, Ineˆs Azevedo, andWilliam Lintner. 2016. United States data center energy usage report.Technical Report LBNL-1005775. Lawrence Berkeley National Laboratory.! pages 1[86] Abraham Silberschatz, Peter Baer Galvin, and Greg Gagne. 2009. OperatingSystem Concepts, Eighth Edition. Wiley, Hoboken, NJ, USA. ! pages 12138[87] Allan Snavely and Dean M. Tullsen. 2000. Symbiotic Jobscheduling for aSimultaneous Multithreaded Processor. In Proceedings of the 9thInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS ’00). ACM, New York, NY,USA, 234–244. https://doi.org/10.1145/378993.379244 ! pages 7, 9, 65[88] Allan Snavely, Dean M. Tullsen, and Geoff Voelker. 2002. SymbioticJobscheduling with Priorities for a Simultaneous Multithreading Processor.In Proceedings of the 2002 ACM SIGMETRICS International Conference onMeasurement and Modeling of Computer Systems (SIGMETRICS ’02).ACM, New York, NY, USA, 66–76. https://doi.org/10.1145/511334.511343! pages 65[89] Xiang Song, Haibo Chen, Rong Chen, Yuanxuan Wang, and Binyu Zang.2011. A Case for Scaling Applications to Many-core with OS Clustering. InProceedings of the 6th European Conference on Computer Systems (EuroSys’11). ACM, New York, NY, USA, 61–76.https://doi.org/10.1145/1966445.1966452 ! pages 99[90] Srinath Sridharan, Gagan Gupta, and Gurindar S. Sohi. 2014. Adaptive,Efficient, Parallel Execution of Parallel Programs. In Proceedings of the 35thACM SIGPLAN Conference on Programming Language Design andImplementation (PLDI ’14). ACM, New York, NY, USA, 169–180.https://doi.org/10.1145/2594291.2594292 ! pages 13[91] Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, RajeevBalasubramonian, and Al Davis. 2010. Micro-pages: Increasing DRAMEfficiency with Locality-aware Data Placement. In Proceedings of the 15thInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS ’10). ACM, New York, NY,USA, 219–230. https://doi.org/10.1145/1736020.1736045 ! pages 122[92] Madhusudhan Talluri, Shing Kong, Mark D. Hill, and David A. Patterson.1992. Tradeoffs in Supporting Two Page Sizes. In Proceedings of the 19thAnnual International Symposium on Computer Architecture (ISCA ’92).ACM, New York, NY, USA, 415–424.https://doi.org/10.1145/139669.140406 ! pages 103[93] David Tam, Reza Azimi, and Michael Stumm. 2007. Thread Clustering:Sharing-aware Scheduling on SMP-CMP-SMT Multiprocessors. InProceedings of the 2nd ACM SIGOPS European Conference on Computer139Systems (EuroSys ’07). ACM, New York, NY, USA, 47–58.https://doi.org/10.1145/1272996.1273004 ! pages 3, 10, 66, 98[94] Michael E. Thomadakis. 2011. The Architecture of the Nehalem Processorand Nehalem-EP SMP Platforms. Technical Report. Texas A&MUniversity. ! pages 48[95] Transaction Processing Performance Council (TPC). 2010. TPC BenchmarkC Standard Specification. San Francisco, CA, USA. ! pages 22[96] Transaction Processing Performance Council (TPC). 2014. TPC BenchmarkH Standard Specification. San Francisco, CA, USA. ! pages 22[97] Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy. 1998. Simultaneousmultithreading: maximizing on-chip parallelism. In 25 Years of theInternational Symposia on Computer Architecture (selected papers) (ISCA’98). ACM, New York, NY, USA, 533–544. ! pages 39[98] Ben Verghese, Scott Devine, Anoop Gupta, and Mendel Rosenblum. 1996.Operating System Support for Improving Data Locality on CC-NUMACompute Servers. In Proceedings of the 7th International Conference onArchitectural Support for Programming Languages and Operating Systems(ASPLOS ’96). ACM, New York, NY, USA, 279–289.https://doi.org/10.1145/237090.237205 ! pages 69, 97[99] Pinchas Weisberg and Yair Wiseman. 2009. Using 4KB page size for virtualmemory is obsolete. In 2009 IEEE International Conference on InformationReuse & Integration (IRI ’09). IEEE, Washington, DC, USA, 262–265.https://doi.org/10.1109/IRI.2009.5211562 ! pages 121[100] Chee Siang Wong, Ian K. T. Tan, Rosalind Deena Kumari, and Fun Wey.2008. Towards achieving fairness in the Linux scheduler. Operating SystemsReview 42, 5 (2008), 34–43. https://doi.org/10.1145/1400097.1400102 !pages 12[101] Hailong Yang, Alex Breslow, Jason Mars, and Lingjia Tang. 2013.Bubble-flux: Precise Online QoS Management for Increased Utilization inWarehouse Scale Computers. In Proceedings of the 40th AnnualInternational Symposium on Computer Architecture (ISCA ’13). ACM, NewYork, NY, USA, 607–618. https://doi.org/10.1145/2485922.2485974 !pages 3, 7, 9, 10140[102] Xi Yang, Stephen M. Blackburn, and Kathryn S. McKinley. 2015.Computer Performance Microscopy with Shim. SIGARCH Comput. Archit.News 43, 3 (June 2015), 170–184.https://doi.org/10.1145/2872887.2750401 ! pages 118[103] Gerd Zellweger, Denny Lin, and Timothy Roscoe. 2016. So ManyPerformance Events, So Little Time. In Proceedings of the 7th ACM SIGOPSAsia-Pacific Workshop on Systems (APSys ’16). ACM, New York, NY, USA,Article 14, 9 pages. https://doi.org/10.1145/2967360.2967375 ! pages 26[104] Panyong Zhang, Bo Li, Zhigang Huo, and Dan Meng. 2009. Evaluating theEffect of Huge Page on Large Scale Applications. In Proceedings of the2009 International Conference on Networking, Architecture, and Storage(NAS ’09). IEEE, Washington, DC, USA, 74–81.https://doi.org/10.1109/NAS.2009.18 ! pages 121[105] Yunqi Zhang, Michael A Laurenzano, Jason Mars, and Lingjia Tang. 2014.SMiTe: Precise QoS Prediction on Real-System SMT Processors to ImproveUtilization in Warehouse Scale Computers. In Proceedings of the 47thAnnual International Symposium on Microarchitecture (MICRO ’14). IEEE,Washington, DC, USA, 406–418. https://doi.org/10.1109/MICRO.2014.53! pages 3, 7, 10[106] Jin Zhou and Brian Demsky. 2012. Memory Management for Many-coreProcessors with Software Configurable Locality Policies. In Proceedings ofthe 2012 International Symposium on Memory Management (ISMM ’12).ACM, New York, NY, USA, 3–14.https://doi.org/10.1145/2258996.2259000 ! pages 98[107] Sergey Zhuravlev, Sergey Blagodurov, and Alexandra Fedorova. 2010.Addressing Shared Resource Contention in Multicore Processors viaScheduling. In Proceedings of the 15th International Conference onArchitectural Support for Programming Languages and Operating Systems(ASPLOS ’10). ACM, New York, NY, USA, 129–142.https://doi.org/10.1145/1736020.1736036 ! pages 3, 7, 9, 10, 99[108] Sergey Zhuravlev, Juan Carlos Saez, Sergey Blagodurov, AlexandraFedorova, and Manuel Prieto. 2012. Survey of Scheduling Techniques forAddressing Shared Resources in Multicore Processors. ACM Comput. Surv.45, 1, Article 4 (Dec. 2012), 28 pages.https://doi.org/10.1145/2379776.2379780 ! pages 7141Appendix ASupporting MaterialsA.1 Non-Interference of Workloads on Separate NUMANodesOne of the assumptions of our workload placement solution stated in Section 2.2.2is that workloads placed on separate NUMA nodes and not sharing interconnectlinks will not interfere with one another. Figures A.1–A.3 show the results ofexperiments (conducted on the AMD test system shown in Figure 2.2). The exper-iments run two workloads at a time: one potentially interfering workload and oneworkload where the performance change relative to running alone is reported. Theslowdown observed is always less than 5% and usually less than 2%, confirmingour assumption that co-scheduled workloads placed on separate NUMA nodes andnot sharing interconnect links will not significantly interfere with one another.142Figure A.1: Performance slowdown with cg.C as the interfering workloadFigure A.2: Performance slowdown with mg.C as the interfering workloadA.2 Performance Prediction Results for the ML ModelThis section contains the full performance prediction results for the ML modeldescribed in Section 2.4.2. Figures A.4 and A.5 show the actual and predictedperformance for each workload for important placements on the AMD and Intelsystems. The x-axis shows the IDs of the important placements, numbered 0–11 onthe AMD system and 0–6 on the Intel system. The y-axis shows the performancein the placements relative to the baseline. Placement #2 was chosen as the baselinefor the AMD system, and placement #1 for the Intel system.143Figure A.3: Performance slowdown with streamcluster as the interferingworkload144 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0  1  2  3  4  5  6  7  8  9  10  11BLASTActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0  1  2  3  4  5  6  7  8  9  10  11bt.BActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0  1  2  3  4  5  6  7  8  9  10  11cannealActualPredicted: HPEPredicted: Perf Measurements 0 0.5 1 1.5 2 2.5 0  1  2  3  4  5  6  7  8  9  10  11cg.CActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 0  1  2  3  4  5  6  7  8  9  10  11CommDB-tpchActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0  1  2  3  4  5  6  7  8  9  10  11dc.BActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 0  1  2  3  4  5  6  7  8  9  10  11ep.CActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 0  1  2  3  4  5  6  7  8  9  10  11fluidanimateActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 0  1  2  3  4  5  6  7  8  9  10  11freqmineActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0  1  2  3  4  5  6  7  8  9  10  11ft.CActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0  1  2  3  4  5  6  7  8  9  10  11gccActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 0  1  2  3  4  5  6  7  8  9  10  11kmeans-15MActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 0  1  2  3  4  5  6  7  8  9  10  11kmeansActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0  1  2  3  4  5  6  7  8  9  10  11lu.BActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 0  1  2  3  4  5  6  7  8  9  10  11matrixmultiplyActualPredicted: HPEPredicted: Perf MeasurementsFigure A.4145 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0  1  2  3  4  5  6  7  8  9  10  11mg.BActualPredicted: HPEPredicted: Perf Measurements 0 0.5 1 1.5 2 2.5 0  1  2  3  4  5  6  7  8  9  10  11mg.CActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 0  1  2  3  4  5  6  7  8  9  10  11pcaActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0  1  2  3  4  5  6  7  8  9  10  11postgres-tpchActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0  1  2  3  4  5  6  7  8  9  10  11postgres-tpccActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0  1  2  3  4  5  6  7  8  9  10  11spark-ccActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0  1  2  3  4  5  6  7  8  9  10  11spark-pr-ljActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0  1  2  3  4  5  6  7  8  9  10  11spark-pr-wpActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0  1  2  3  4  5  6  7  8  9  10  11sp.BActualPredicted: HPEPredicted: Perf Measurements 0 0.5 1 1.5 2 2.5 0  1  2  3  4  5  6  7  8  9  10  11sp.CActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 0  1  2  3  4  5  6  7  8  9  10  11streamcluster-125KActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 0  1  2  3  4  5  6  7  8  9  10  11streamclusterActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 0  1  2  3  4  5  6  7  8  9  10  11swaptionsActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0  1  2  3  4  5  6  7  8  9  10  11ua.BActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0  1  2  3  4  5  6  7  8  9  10  11vipsActualPredicted: HPEPredicted: Perf MeasurementsFigure A.4146 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0  1  2  3  4  5  6  7  8  9  10  11wcActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0  1  2  3  4  5  6  7  8  9  10  11wrActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0  1  2  3  4  5  6  7  8  9  10  11WTbtreeActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0  1  2  3  4  5  6  7  8  9  10  11x264ActualPredicted: HPEPredicted: Perf MeasurementsFigure A.4: Accuracy of predictions on the AMD system.147 0 0.2 0.4 0.6 0.8 1 1.2 0  1  2  3  4  5  6BLASTActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 0  1  2  3  4  5  6bodytrackActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0  1  2  3  4  5  6bt.BActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0  1  2  3  4  5  6bt.CActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0  1  2  3  4  5  6cannealActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0  1  2  3  4  5  6cg.BActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0  1  2  3  4  5  6cg.CActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0  1  2  3  4  5  6dc.AActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0  1  2  3  4  5  6dc.BActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 0  1  2  3  4  5  6dedupActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 0  1  2  3  4  5  6ep.CActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 0  1  2  3  4  5  6ep.DActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 0  1  2  3  4  5  6ferretActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 0  1  2  3  4  5  6fluidanimateActualPredicted: HPEPredicted: Perf MeasurementsFigure A.5148 0 0.2 0.4 0.6 0.8 1 1.2 0  1  2  3  4  5  6freqmineActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0  1  2  3  4  5  6ft.CActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 0  1  2  3  4  5  6gccActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 0  1  2  3  4  5  6is.DActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0  1  2  3  4  5  6kmeansActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0  1  2  3  4  5  6lu.BActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0  1  2  3  4  5  6lu.CActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 0  1  2  3  4  5  6matrixmultiplyActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0  1  2  3  4  5  6mg.CActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0  1  2  3  4  5  6mg.DActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 0  1  2  3  4  5  6pcaActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 0  1  2  3  4  5  6postgres-tpchActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 0  1  2  3  4  5  6postgres-tpccActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0  1  2  3  4  5  6raytraceActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 0  1  2  3  4  5  6spark-ccActualPredicted: HPEPredicted: Perf MeasurementsFigure A.5149 0 0.2 0.4 0.6 0.8 1 1.2 0  1  2  3  4  5  6spark-pr-ljActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0  1  2  3  4  5  6sp.BActualPredicted: HPEPredicted: Perf Measurements 0 0.5 1 1.5 2 2.5 0  1  2  3  4  5  6sp.CActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0  1  2  3  4  5  6streamclusterActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 0  1  2  3  4  5  6swaptionsActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0  1  2  3  4  5  6ua.BActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0  1  2  3  4  5  6ua.CActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 0  1  2  3  4  5  6vipsActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0  1  2  3  4  5  6wcActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 0  1  2  3  4  5  6wrActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 0  1  2  3  4  5  6WTbtreeActualPredicted: HPEPredicted: Perf Measurements 0 0.2 0.4 0.6 0.8 1 1.2 0  1  2  3  4  5  6x264ActualPredicted: HPEPredicted: Perf MeasurementsFigure A.5: Accuracy of predictions on the Intel system.150


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items