UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Whose cache line is it anyway : automated detection of false sharing Spear, Mark Anthony 2015

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2015_november_spear_mark.pdf [ 364.49kB ]
JSON: 24-1.0166652.json
JSON-LD: 24-1.0166652-ld.json
RDF/XML (Pretty): 24-1.0166652-rdf.xml
RDF/JSON: 24-1.0166652-rdf.json
Turtle: 24-1.0166652-turtle.txt
N-Triples: 24-1.0166652-rdf-ntriples.txt
Original Record: 24-1.0166652-source.json
Full Text

Full Text

Whose Cache Line Is It Anyway?Automated Detection of False SharingbyMark Anthony SpearB.S.E., Princeton University, 2008A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Computer Science)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)August 2015c© Mark Anthony Spear 2015AbstractThe abstraction of a cache is useful to hide the vast difference in speed ofcomputer processors and main memory. For this abstraction to maintaincorrectness, concurrent access to memory by different processors has to becoordinated such that a consistent view of memory is maintained.Cache coherency protocols are responsible for this coherency, but canhave adverse implications for performance. The operational granularity ofthese protocols is a “cache line” (e.g. 64 bytes). Depending on the datacontained in the cache line and the data’s access patterns, the coherencecan be superfluous and the performance implications severe: Consider thecase where each byte within a cache line is exclusively read and written byspecific cores and no coherence between cores should be necessary.My collaborators and I developed a system which detects this phe-nomenon, known as false sharing, (my thesis), and present a system thatcan automatically rewrite programs as they are running to avoid the per-formance penalty. Some parallel programming benchmarks such as linearregression can see up to 3x-6x performance improvement.iiPrefaceThis thesis presents work done in part of a larger system built in collabora-tion with Mihir Nanavati, Nathan Taylor, Shriram Rajagopalan, Dutch T.Meyer, William Aiello, and Andrew Warfield. It has been published as apaper entitled “Whose Cache Line is it Anyway: Operating System Supportfor Live Detection and Repair of False Sharing”, in the proceedings of the8th European Conference on Computer Systems (EuroSys 2013)[22].Chapter 4 describes the architecture of the entire system and has beendesigned collectively with Mihir Nanavati, Nathan Taylor, Shriram Ra-jagopalan, Dutch T. Meyer, William Aiello, and Andrew Warfield. The im-plementation of the page granularity analysis (Section 4.2.2) belongs to Mi-hir Nanavati, the byte-level analyis (Section 4.2.3) to Shriram Rajagopalanand Mihir Nanavati, and the final diagnosis (Section 4.2.4) to the authorand Mihir Nanavati.The remapping technique described in Section 4.2.5 belongs to NathanTaylor and Mihir Nanavati, and has also been published as Cachekata: Mem-ory Hierarchy Optimization via Dynamic Binary Translation[26].The integration of the performance counters (Section 4.2.1) into Xen andCCBench (Section 5.3) have been implemented in collaboration with MihirNanavati.All other parts of this thesis, unless specified otherwise, are the work ofthe author.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . ixDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1 Cache Coherence and False Sharing . . . . . . . . . . . . . . 32.1.1 Cache Coherence on the x86 Architecture . . . . . . . 52.1.2 A Concrete Example of False Sharing . . . . . . . . . 62.1.3 Significance . . . . . . . . . . . . . . . . . . . . . . . 82.2 Performance Counters . . . . . . . . . . . . . . . . . . . . . . 82.2.1 Precise Event Based Sampling . . . . . . . . . . . . . 92.3 Page Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1 Memory Performance and False Sharing . . . . . . . . . . . . 11ivTable of Contents3.2 Detecting False Sharing via Instrumentation . . . . . . . . . 113.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.4 Sampling Event Counters . . . . . . . . . . . . . . . . . . . . 123.5 Dynamic Analysis for Performance . . . . . . . . . . . . . . . 124 Design and Architecture of False Sharing Detection . . . . 144.1 Design Overview . . . . . . . . . . . . . . . . . . . . . . . . . 144.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . 144.2.1 Performance Counters . . . . . . . . . . . . . . . . . . 144.2.2 Page Granularity Analysis . . . . . . . . . . . . . . . 174.2.3 Byte Level Analysis . . . . . . . . . . . . . . . . . . . 174.2.4 Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . 184.2.5 Remapping . . . . . . . . . . . . . . . . . . . . . . . . 194.2.6 Precise Event Based Sampling . . . . . . . . . . . . . 195 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.1 Performance Counter Selection . . . . . . . . . . . . . . . . . 215.2 Detection Times and Execution Under Plastic . . . . . . . . 225.3 CCBench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.4 Shared Memory Benchmarks . . . . . . . . . . . . . . . . . . 245.5 Effect of Rewriting Optimizations . . . . . . . . . . . . . . . 246 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30AppendixA Selected Code Listings . . . . . . . . . . . . . . . . . . . . . . 35vList of Tables4.1 Performance counters useful for detecting false sharing. . . . 155.1 Microbenchmarks in CCBench . . . . . . . . . . . . . . . . . 24viList of Figures2.1 True, false, and no sharing of data . . . . . . . . . . . . . . . 42.2 False sharing in Phoenix’s linear regression benchmark. . . . 62.3 False sharing of lreg as a product of cores. . . . . . . . . . . . 74.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . 154.2 Performance counter values after running Listing A.1 . . . . . 164.3 Sharing status of a byte in the access log. . . . . . . . . . . . 184.4 Independent data on a single cache line. . . . . . . . . . . . . 195.1 SNOOP RESPONSE.HITM values over time while running variousapplications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.2 Linear regression running under Plastic. . . . . . . . . . . . . 235.3 Performance of Phoenix, Parsec and CCBench suites runningwith Plastic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.4 Normalized Performance of Linear Regression with and with-out Plastic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26viiGlossaryEPT Extended Page TablesHITM HIT-ModifiedHVM Hardware Virtual MachineMA Machine AddressMESIF Modified Exclusive Shared Invalid Forward (A cache coherenceprotocol based on MESI)MESI Modified Exclusive Shared Invalid (A cache coherence protocol)MIPS Millions of instructions per secondMMU Memory Management UnitMSR model-specific registerPA Physical AddressPEBS Precise Event Based SamplingPMU Performance Monitoring UnitRDMSR Read MSRRDPMC Read Performance Monitoring CountersRFO Read for ownershipVA Virtual AddressWRMSR Write MSRviiiAcknowledgementsThis thesis would not have been possible without the guidance and patienceof my supervisor, Andrew Warfield, and my collaborators and fellow studentsin the NSS lab. Over my time a UBC we have worked on many fun projects,both successes and failures, where the fun of the projects was only secondto the fun of the scotch and beer after them.ixFor Kathleen, Dad, Mom, and Mike,who have wanted this thesis done more than anyone.xChapter 1IntroductionPerformance of individual processor cores have shown stagnant growth for anumber of years, and the current trend toward faster computation is increas-ing the number of cores [12]. However, this architectural shift requires a si-multaneous paradigm shift for programmers who are used to single-threadedprogramming. Programmers must now be aware of additional potentialproblems including scheduling, cache coherency, synchronization, communi-cation, resource contention, and various other issues. Many of these issueshave been around for years in larger distributed systems, but now they arebeing faced on individual machines.CPUs try to hide some of the underlying distributed nature using mech-anisms like cache coherence to pretend memory accesses are uniformly con-sistent. However, programs which are written obliviously to the underlyingmicroarchitecture are prone to be caught by the mismatch between pro-gramming model and machine model for memory access performance. Falsesharing is an example of this class of problem: Ignorance of the hardware’s“cache line” granularity for cache coherence can result in an applicationwith unrelated data accessed by separate cores on the same cache line. Thecoherence protocol will be invalidating data in the various cores’ caches,resulting in coherence misses instead of cache hits.As we will discuss in Chapter 3, designing systems to detect false sharingis not new, but one would typically not want to use these systems due tothe impracticality of their performance overhead.We cannot rely on developers for performance testing for every microar-chitecture (some of which have not even been developed yet). Nor can werely on always having access to source to recompile/modify programs.A system for detecting false sharing must be low-overhead so that we1Chapter 1. Introductioncan always keep it running, yet precise enough to find the exact pieces ofdata contributing to false sharing. A dynamic solution is necessary: Plastic.This thesis presents a false sharing detection system which uses a stagedapproach described in Chapter 4. This design achieves these goals by us-ing some low-overhead advanced processor features (performance counters,event sampling) combined with sparing temporary use of high-overhead de-tection (page table manipulations and breakpoints) and analysis algorithms.The performance of the end-to-end system is demonstrated on variousworkflows in Chapter 5. Our results show 3-6x speedup in workloads whichexhibit significant amounts of false sharing.2Chapter 2BackgroundThe work presented in this thesis depends on a number of specifics aboutthe x86 (and related) architectures: Cache coherence (and the related phe-nomenon of false sharing), performance counters, and virtualized memoryaddress spaces implemented with page tables.2.1 Cache Coherence and False SharingMultiple-core parallel computing systems have private per-core caches, butpresent a coherent view of memory to all execution units. In order toachieve a coherent view, cache coherency protocols such as Modified Ex-clusive Shared Invalid (MESI) and Modified Exclusive Shared Invalid For-ward (MESIF) to provide a view which is easier for programmers to reasonabout.These coherence protocols are optimized for the common case wheremost data that is shared is being read, potentially even read-only. Writesto non-exclusive cache lines can be particularly slow, as old data needs tobe invalidated everywhere, and potentially repopulated.In Figure 2.1, consider a single cache line of 8 DWORDS (64 bytes), C,with multiple accessors: Core A and Core B. If the same data in cache lineC is being read and written by Core A and Core B, then we have an instanceof true sharing. A lock or a reference counter are common examples of truesharing. By necessity, the cache line will be in each core’s caches at varioustimes. If the data being read from cache line C by Core A is disjoint fromthe data which was written by Core B (e.g. as in cache line C-2), and thedata is only incidentally on the same cache line, then we have an instanceof false sharing. It is a case of unfortunate luck that the granularity of32.1. Cache Coherence and False Sharingd0 d1 d2 d3 d4 d5 d6 d7 CA,B A,BTrue Sharingd0-2 d1-2 d2-2 d3-2 d4-2 d5-2 d6-2 d7-2 C-2A B A,BFalse Sharingd0-3 d1-3 d2-2 d3-2 d4-2 d5-2 d6-2 d7-3 C-3A Ad0-4 d1-4 d2-4 d3-4 d4-4 d5-4 d6-4 d7-4 C-4B BNo SharingFigure 2.1: True, false, and no sharing of data: Only in the case of falsesharing does a cacheline granularity view of accessors differ from the finer-grained view.data isolation (a cache line) is too coarse for the actual coherency needsduring a false sharing scenario. If the data were laid out different, e.g. inseparate cache lines C-3 and C-4, then each line would be protected frominvalidations due to the other core’s execution.In the absence of multiple cores and other code being executed, since coreA continually reads its data in cache line C, one would expect it to remain incache. However if Core B writes to cache line C, the data in Core A’s cachebecomes invalidated, and then Core A’s reads would result in a coherencemiss. This happens because the cores negotiate what data should be storedin their caches to preserve consistency. The cache coherence protocol willbe performing invalidations and generating coherence misses, even though42.1. Cache Coherence and False Sharingthe various cores don’t need the rest of the contended data on the cacheline. As we will see in Chapter 5, this can result in significant performanceimplications.2.1.1 Cache Coherence on the x86 ArchitectureThe x86 processor cache architecture remains relatively unchanged since In-tel’s release of the Nehalem microarchitecture in 2008. In order to accountfor the increasing speed differential between CPUs and main memory, mod-ern processors have multiple layers of caches. The Nehalem architecturehas multiple cores per physical socket, each of which has a private L1 andL2 cache. All cores on the same socket share a larger common L3 cache.Between these different caches, replicated on all sockets in the system, andmain memory, there is significant effort required to keep a coherent view ofdata at any given time.The MESIF cache coherency protocol is responsible for maintaining con-sistency between caches on multiple cores. Each core’s cache lines each havea state associated with them, and the shared L3 cache acts as a directory [10].Simultaneous reads by multiple cores cause cache lines to exist in a “Shared”state. Any write requires exclusive ownership. “Modified” state representsexclusive ownership but with data which has not been written back to mainmemory. “Exclusive” is exclusive ownership, but the data in the cache isalso in main memory. Entering either of these states mean the same cacheline (if present) will become “Invalid” in other cores.In order for a modified cache line to be read by another core, the datamust propagate to a common location. For a local (on-socket) core’s cachelines, the modified data will be updated in the packages shared L3 cache.However, a cross-socket read of modified data will require the data to bewritten back to main memory. Latencies of contentious memory access thusvary significantly depending on the topology of the cores involved [21].52.1. Cache Coherence and False Sharingstruct {   pthread_t tid;  POINT_T *points;   int num_elems;  long long SX;   long long SY;   long long SXX;   long long SYY;  long long SXY;} lreg_args ;Despite dierent heap organizations and structure padding, both 32- and 64-bit binaries exhibit false sharing.Allocation of lreg_args array on 64-bit binaryAllocation of lreg_args array on 32-bit binary0 63 cache line n+1 127cache line n0 63 cache line n+1 127cache line nlreg_args[0]SXSXtid64-bit3232pointsnum_eSXSYSXXSYYSXYSXSYSXXSYYSXYtidtidpointsnum_eSXSYSXXSYYSXYtidtidppnumnumSYSYSXXSXXSYYSYYSXYSXY SXSXtidpnumSYSYSXXSXXSYYlreg_args[1]lreg_args[0] lreg_args[1] lreg_args[2]allocationmetadata a.m.{{{Figure 2.2: False sharing in Phoenix’s linear regression benchmark.2.1.2 A Concrete Example of False SharingThe “linear regression” test of the Phoenix [25] parallel benchmark suite’slinear is a popular example of false sharing [16, 32]. Figure 2.2 shows thelreg args structure which is responsible for false sharing. Each thread ofthe linear regression application stores its state in one of these structures,and they are all stored consecutively in memory in an array. The programhas each thread continually updating these structures in a tight loop asthe benchmark runs. Because the threads run independently on their ownstruct for the majority of execution time, there is no true sharing occuring.62.1. Cache Coherence and False Sharing 0 0.2 0.4 0.6 0.8 1 0  2  4  6  8  10  12  14  16  18  20  22  24  26  28  30  32Normalized Performance Relative to Linear SpeedupNumber of Coresw/ False Sharingw/o False SharingFigure 2.3: False sharing of lreg as a product of cores.However, since the structures are packed onto the same cache line, they doexhibit false sharing. In fact, the false sharing is so severe that the programperforms worse as it is given more cores to execute on. Figure 2.3 com-pares the program’s scalability against a version that has the data alignedto eliminate false sharing. Without false sharing, the program scales nearlylinearly. With false sharing, updates must be committed to memory be-fore ownership can be transferred, reducing the performance of the cachehierarchy to that of main memory.False sharing depends on many dynamic properties in a system. Work-load is one: per-thread structures in the linear regression example receivecontinuous updates from all threads, while examples in the Linux kerneloften feature a single variable with frequent updates surrounded by read-mostly data [7]. The organization of the program binary is another: Fig-ure 2.2 also shows the same source file compiled as both 32-bit and 64-bit72.2. Performance Countersbinaries. Despite identical source and identical cache organization, the na-ture of false sharing is different: one case results in a 52-byte structure thattiles poorly across cache lines, where the other produces an ideally sized64-byte structure, but then misaligns it because of allocator metadata. Thisis not a compiler or language specific issue: For example, Java 7 is knownto optimize out manually-added cacheline padding [28] which was added bythe programmer to try to avoid false sharing.2.1.3 SignificanceFalse sharing is a significant issue in cache coherent systems, and is one thatwill increase in significance in the future. Experimental hardware architec-tures [11, 27] and operating systems [3] have begun explorations in systemswithout cache coherence but the requirement of coherence for general pur-pose computing will likely not vanish any time soon. A scalability analysisof the Linux kernel [7] shows that current architecture will continue to workwell with greater degrees of parallelism on existing models. Researchers havealso shown that is it practical to scale cache coherence protocols to highlyparallel architectures [19].There are numerous examples of false sharing problems that are at largerscales than current x86 server architectures. David Dice has discussed falsesharing issues with the Java garbage collector implementation on a 256-core Niagara server [8]. Similarly, the previously mentioned Linux kernelscalability investigation [7] revealed a number of cache contention issuesrelating to both true and false sharing.2.2 Performance CountersHardware performance counters are implemented using extra special-purposeregisters on the processor to track performance-impacting behaviours of thehardware. There are many kinds of events which can be counted, but theset of events can vary from one processor to another. Cross-platform useof performance counters can be difficult since the kinds of events one cares82.3. Page Tablesabout may be different from architecture to architecture. In recognitionof this difficulty, Intel has created a number of “Architechtural” counterswhich behave consistently across microarchitectures and will continue to besupported on subsequent revisions [13].There are only a few of these counter registers available, and the specificnumber varies by platform. For example Intel’s Nehalem microarchitecturehas 4 performance counters available. While few in number, the countersare programmable and thus a wide combination of different things can bemonitored simultaneousy.The Intel Nehalem PMU guide [31] discusses the specific details abouthow performance counters are configured via the Write MSR (WRMSR)instruction writing to specific control registers. The counters themselves arealso implemented as model-specific registers (MSRs) and thus can be readwith the Read MSR (RDMSR) instruction, but a faster instruction existsfor reading performance counters: Read Performance Monitoring Counters(RDPMC). One of the most important factors about using performancecounters is that the overhead of using them is negligible [6].2.2.1 Precise Event Based SamplingPrecise Event Based Sampling (PEBS) is an extension to hardware profil-ing facilities available on Intel processors used to capture more informationprofiling workload behaviour.PEBS-enabled performance counters can be configured to periodicallysample information about the processor state when the counter overflows apreconfigured threshold. The machine state captured includes the values invarious registers: flags, instruction pointer, general purpose registers, andsome specialized fields which were added in Nehalem.2.3 Page TablesAnother hardware technology leveraged by Plastic is hardware page tablesupport for virtual memory.92.3. Page TablesAll modern x86 CPUs support virtual memory, and hypervisors requirevirtualizing virtual memory [1].A Memory Management Unit (MMU) is responsible for mapping a Vir-tual Address (VA) space (usually per-process) to a Physical Address (PA)space. Some processors have means of extending this mechanism to supportvirtualization in hardware: Extended Page Tables (EPT) is Intel’s versionof this technology. This exta level of indirection allows the hypervisor tocontrol PA to Machine Address (MA) mappings on a per-virtual-machinelevel. For systems without hardware support for MMU virtualization, thereis a software technique which can be used: Shadow Page Tables. Shadowpage tables are under the control of the hypervisor, and are the compositionof the VA-PA and PA-MA mappings.In addition to just allowing for mapping one address to another, pagetable entries have associated permissions bits which we take advantage tocarefully control access to memory during the detection pipeline in Plastic.10Chapter 3Related Work3.1 Memory Performance and False SharingMany memory allocators use private heaps, which helps avoid false sharingdue to the potential performance implications [2]. Allocators can be sus-ceptible to two kinds of false sharing: active false sharing (when objectsallocated by different threads can wind up on the same cache line) and pas-sive false sharing (when objects freed by a remote thread are immediatelyallocatable by the same remote thread). TCMalloc is susceptible to bothkinds of allocator-generated false sharing [2]. Hoard is a memory allocatordesigned for scalability (and avoiding false sharing is an explicitly consideredfactor), though it is still subject to passive false sharing [2, 5].3.2 Detecting False Sharing via InstrumentationPluto uses Valgrind to keep track of load and store events across differentthreads and performs worst-case analysis of possible false sharing, whileimposing 2 orders of magnitude of overhead [9].Sheriff [16] detects false sharing (with no false positives) and approxi-mately 20% overhead on average. Their system makes assumptions aboutthe usage of the pthread threading library which can break correctness guar-antees if the assumptions are violated, e.g. if the application uses lock-freedata structures.Predator samples memory accesses using instrumented binaries and pre-dicts false sharing on “virtual cache lines”, which can detect potential falsesharing even if it did not actually occur with the given object alignmentand hardware cache configuration [17]. This approach has 6x performance113.3. Simulationand 2x memory overhead. While predator can predict different-alignmentfalse sharing, the theoretical predications may not be relevant for a givencompiled binary/system.All instrumentation-based approaches are quickly discounted for a general-purpose false sharing detection and mitigation system since modifying thesource may not be an option.3.3 SimulationCache simulation methods can provide data about behaviour in systems withvarious memory hierarchies and coherence algorithms. Cmp$im (a simulatorwhich uses Pin [18]) can simulate workloads running at 4-10 Millions ofinstructions per second (MIPS), and is three orders of magnitude fasterthan more accurate simulators [15]. However, this is still quite slow, andthus unlike Plastic, Cmp$im [15] is only suitable for development and testingenvironments, and cannot reasonably run on general purpose systems.3.4 Sampling Event CountersIntel Performance Tuning Utility [30] uses performance counters to identifythe presence of cache line contention. However, it suffers from a high falsepositive rate as it does not attempt to distinguish if the contention is causedby true sharing, or if it actually is false sharing. It does not analyze the per-formance implications of specific data being falsely contended, nor integratewith a solution attempt to alleviate performance impacts.3.5 Dynamic Analysis for PerformanceFeedback driven optimization is a technique which can be used to informcompiler optimization decisions based on representative workloads of a run-ning system [24]. An instrumented version of code is run to collect profilinginformation, which is used as an input for subsequent optimized recompi-lation. Such a system can not be used on generic applications, for which123.5. Dynamic Analysis for Performancesource code may not be available. Plastic is a part of a larger system that(once it detects false sharing) remaps data and rewrites code for genericapplications which exhibit false sharing [22, 26]. As shown in this thesis,these techniques for detecting and remapping falsely-shared data can leadto 3x-6x performance improvement in severe cases of false sharing.13Chapter 4Design and Architecture ofFalse Sharing Detection4.1 Design OverviewFigure 4.1 shows the high-level architecture of the false sharing detectionpipeline in Plastic. Precise detection of false sharing is expensive in termsof the performance hit on the running system. However, various more per-formant heuristics can be used to detect potential false sharing. Each phaseof Plastic uses increasing expensive techniques for identifying (potential)false sharing, but on a narrower and narrower scope. This detection funnelis sufficiently lightweight on one end that it can continuously run withoutadverse performance impact, and is sufficiently precise to detect significantfalse sharing so that a dynamic binary rewriting system can significantlyincrease system performance. PEBS, however, is not integrated into thisfunnel due to platform limitations described in Section System Architecture4.2.1 Performance CountersAs mentioned in Section 2.2, performance counters can track performance-impacting behaviour of hardware. However, these are still physical registers,and thus must be shared by all code executing on the system. This sharingincludes multiple processes running under an operating system, and mul-tiple virtual machines running under a hypervisor. These supervisor-levelsystems manage what is being counted, and do the appropriate saving and144.2. System ArchitecturePerformance Counters §4.2.1Precise Event-Based Sampling §4.2.6Coarse-grained MMU-based conflict analysis §4.2.2Fine-grained MMU-based conflict analysis §4.2.3Contention diagnosis §4.2.4Figure 4.1: System Architecturerestoring of the register state when the currently running applications andVMs switch. Intel has published guidelines for sharing access to the Perfor-mance Monitoring Unit (PMU) [14], however for the purposes of Plastic wecurrently assume we are the only part of the system using the PMU.Performance Counter Name DescriptionSNOOP RESPONSE. HITM A cache snoop request to a particularcore was a hit and the value was mod-ified from what is in memory.MEM UNCORE RETIRED.OTHER CORE L2 HITMA HITM occurred and modified data wasfound in another core’s cache.EXT SNOOP.ALL AGENTS.HITMA similar HITM counter for Intel Core 2architectures.INST RETIRED. ANY The number of instructions retired (com-pleted).L2 TRANSACTIONS. RFO RFO requests for prefetches and demandmisses.Table 4.1: Performance counters useful for detecting false sharing.A number of the important performance counters we keep track of arelisted in Table 4.1. HITM counters are relevent to detecting cache linecontention because they indicate the requested data was in a remote cacheline, when the remote cache line is in a modified state. A high ratio of these154.2. System Architecturerelative to the total number of executed instructions can indicate potentialfalse sharing, but also occur when true sharing is occuring [29]. 0 2 4 6 8 10 12 14 16 18 20L2_RFO L2_HITM SNOOP_HITMV1V2V3V4V5Figure 4.2: Performance counter values after running Listing A.1In Figure 4.2 we show several false-sharing-related performance coun-ters tested against 5 modes of false sharing generated by the code found inListing A.1. SNOOP RESPONSE.HITM responds quite well to all modes of falsesharing, so we use it as our primary indicator of potential false sharing.In order to disambiguate the potential false sharing with true sharing,164.2. System Architecturewhen this high-HITM-per-instruction scenario is encountered we turn onmore heavyweight instrumentation. This allows us to precisely determine ifthis truly is a case of false sharing (and if so: where the false sharing is oc-curing), without subjecting typical system execution to the slower detectionstages.4.2.2 Page Granularity AnalysisXen is a virtualization platform that manages the physical hardware of ma-chines and provides a common abstraction to guest virtual machines. Itmanages per-virtual-machine address spaces using mechanisms like shadowpages tables or EPT. In Plastic we extend this concept to per-core pagetables, though operating systems can use similar techniques with per-threadpage tables [4, 16]. Each of these per-core pages begins by being marked asprotected. When data on the page is accessed, the permissions are updatedto allow the access. The type of access (i.e., read vs. write) is recordedalong with the page. Results and permissions are periodically reset, so thatevaluation is done in a series of epochs. Since contention requires at leastone writer and one other accessor, the output of this stage is the set of pageswhich are in the intersection of the write bitmap for one core and the accessbitmap for all other cores. This contention could still be a case of true shar-ing at this point, so further analysis is required to differentiate the class ofcontention.4.2.3 Byte Level AnalysisByte level analysis is done by causing every relevent memory access to fault.The set of contended pages detected in Section 4.2.2 are configured to havethe processor fault on every access. The fault handler logs the access, re-stores the appropriate permissions for the page in order for the access per-form as expected, and sets up a single-step breakpoint. As soon as thememory-accessing instruction is retired, the permissions are again reconfig-ured to cause the next access to generate a fault. Thus, every access willgenerate a fault and the memory access will be logged.174.2. System ArchitectureSince sharing is more closely associated with different threads than dif-ferent cores (e.g. a thread migrating from Core A to Core B is not really“sharing” data even if we detect accesses on two different cores) we dis-tinguish threads by logging an appropriate segment register: fs(x86 64) orgs(x86).The output of this phase is the set of accessed bytes, along with theidentity of the accessors and the mode of access.4.2.4 DiagnosisUnaccessedRead Only (RO)Read Write (RW)(exclusive)Read Only (ROS)          (shared)Read Write (RWS)          (shared)RRR/WR/WR/WRR (other accessor) (other accessor) (other accessor) WWW W(same accessor)(same accessor)(same accessor)Figure 4.3: Sharing status of a byte in the access log.Feeding the byte level access log into the state machine depicted in Fig-ure 4.3 allows Plastic to make a determination of the sharing state for eachbyte. A contended cache line has at least one writer (cache lines with bytesin either the RW or RWS states), and multiple accessors. False sharing ishappening if there are bytes with different accessors (as determined by thelogged thread identifiers). Regions within the cache line are grouped ac-184.2. System Architecturecording to accessor, and groups with writes are relocated to their own cachelines on the isolated page by the remapping engine (Section 4.2.5).4.2.5 RemappingAfter determining that false sharing is occuring at a particular location,Plastic forwards that information to a remapping engine [23, 26]. Thisengine safely moves contended data onto separate cache lines, and rewritesexecuting code to refer to data in the new location. Figure 4.4 demonstratesthe repositioning of data onto different cache lines. The performance benefitsof removing false sharing in this way are demonstrated in the evaluation inChapter 5.0x1000: Original 4K page cache line-length regioncontending data structures“hole” -- area with overlaidmappings0x17000x170c0x17300x1000: Original 4K page 0xf0000: Remapping data pool 0x17000x170c0x17300xf1000: underlay page 0xf17000xf170c0xf17300xf2000: isolated data 0xf20000xf22801. Detector reports false sharing FS contention at:(0x170c, 4), (0x1730, 8)2. Remappings requestedremapper_isolate(0x170c, 4)remapper_isolate(0x1730, 8)3. Remapping rules installedmap(0x1000, 0x70c, 0xf1000)map(0x170c,   0x4, 0xf2000)map(0x1710,  0x20, 0xf1710)map(0x1730,   0x8, 0xf2280)map(0x1738, 0xd1e, 0xf1738) NA:All accessesresult in apage fault andtrigger remapping.Before remapping: After remapping:Remapping allows the original data page to be composed from multiple byte-granularity regions.Figure 4.4: Independent pieces of data on a single cache line can cause falsesharing. To avoid the negative performance impact, they can be remappedonto separate cache lines.4.2.6 Precise Event Based SamplingWith the exception of PEBS, Plastic’s detection proceeds in a stepwise man-ner. PEBS is exceptional in that, while it has the potential for helping withdetection, the Nehalem-era implementation does not provide the additionaldata necessary to help identify false sharing.194.2. System ArchitectureFor false sharing detection, the interesting information currently recordedin Nehalem would be the instruction pointer. However, this only really helpsfor inspecting the code, as we need to determine the location of the databeing potentially falsely shared. It’s insufficient to know some instructionswhich contribute to false sharing (especially since PEBS actually recordsthe “at-retirement” state of the instruction and thus it records the “IP +1” (where “1” is in reference to the subsequent instruction, rather than aliteral byte offset). Consider the case where we believe data is contributingto false sharing. If we wish to remap the data to another location in the ad-dress space, we must ensure that all instructions which would ever attemptto access this data will be redirected to the correct location. Sampling in-structions may give us an idea of the cost of remapping, but in itself cannotgive an exhaustive list of places to remap.Future work could use PEBS to determine whether to proceed to subse-quent detection stages. Looking at samples of instructions that are accessingthe candidate-false-shared data may be able to filter out un-rewritable code.i.e. even if this definitely is false shared data, we may not be able to do any-thing about it, so don’t bother trying to make the precise determination.The specialized fields relating to load latency recorded by the PEBSmechanism could be used in the future (Section 6.1).20Chapter 5EvaluationPlastic is evaluated on a dual socket, 8-core Nehalem system with 32 GB ofmemory. Each processor is a 4-core 64-bit Intel Xeon E5506 with private,per-core L1 and L2 caches and a shared L3. Plastic is implemented asan extension to Xen 4.2 and runs a Linux kernel as Dom0. Alltests are run inside an 8-core Ubuntu Hardware Virtual Machine (HVM)guest. First, we describe our selection of performance counters, then showthe detailed execution of a false sharing workload under Plastic, followed byan evaluation of Plastic across a range of applications.5.1 Performance Counter SelectionAs the first phase of our detection pipeline, careful selection of events thatare representative of potential false sharing is crucial to success. Candidatecounters were selected by analyzing Intel’s platform documentation [13, 30,31] and evaluating behaviour under many workloads. The workloads con-tained known examples of false sharing and applications which did not ex-hibit false sharing. Candidate performance counters were selected on thebasis of how effectively they discriminated between these two cases. Fig-ure 5.1 shows an example counter which is only significantly incremented inapplications which exhibit false sharing. (Many similar graphs are left outfor space reasons.)215.2. Detection Times and Execution Under PlasticFigure 5.1: SNOOP RESPONSE.HITM values over time while running variousapplications.5.2 Detection Times and Execution UnderPlasticFigure 5.2 demonstrates the behaviour of an example workload which ex-hibits false sharing 1 running under Plastic. Because of the false sharing inthe application, the workload immediately exhibits low performance due tothe costly coherence invalidations. Plastic quickly identifies that false shar-ing is occurring (within 2 seconds). After the locations of false sharing aregiven to the rewrite engine of Plastic, the HITM rate significantly decreasesand the throughput jumps significantly for the remainder of the program’s1Specifically, this is the linear regression benchmark from Phoenix.225.3. CCBenchBenchmark (million records)0100200300400500600700800900Time (ms)0 1000 2000 3000 4000 5000 6000 7000 8000HITM Count (millions)01020304050Benchmark ThroughputHITM Counttoolperfpebslogallemulationremaprewrite_faultsFigure 5.2: Linear regression running under Plastic.execution.5.3 CCBenchFalse sharing is a well-understood phenomenon from the perspective of hard-ware implementations and those programmers who actively think abouthardware-level operation. Often its grave performance impact is mitigatedduring development due to careful profiling and testing. However, some-times real workloads are beyond the scale of parallelism tested. CCBench isa microbenchmark suite designed to simulate similar troublesome workflowsat a much smaller scale.Table 5.1 taxonomizes different kinds of memory access patterns, ob-served in these real workloads, responsible for scalability bottlenecks andthe performance of the corresponding microbenchmarks with and withoutPlastic.235.4. Shared Memory BenchmarksName Description Examples Fixable?fs independent Multiple acces-sors to indepen-dent variables(At least onewriter)Linear Regression inPhoenix [25]spinlock pool inBoost [20]Bookeeping in the JavaGC [8]Yesfs mixed Shared read-only dataco-located withcontended datanet device struct inLinux [7]Yesbitmask Bitmasks andflagspage struct in Linux [7] Notrue share Shared read-write dataLocks and global coun-tersNoTable 5.1: Microbenchmarks in CCBench. The fixable column denoteswhether it can be fixed by simply remapping memory.5.4 Shared Memory BenchmarksIn Figure 5.3 we see the impact of Plastic on a number of workloads inCCBench, Phoenix, and Parsec.Programs exhibiting clear, pessimal false sharing, like linear regressionand the synthetic workload, enjoy a significant performance improvementwhile running under Plastic. Those without false sharing show low overhead,and the overall performance is indistinguishable within the error bars.5.5 Effect of Rewriting OptimizationsFigure 5.4 shows how the performance of the linear regression benchmarkscales from 1 to 8 cores, normalized against idealized linear speedup. Fourversions of the program are shown:• a version with false sharing executing normally245.5. Effect of Rewriting Optimizations0.000.200.400.600.801.00bitmaskfs_independentfs_mixedtrue_sharehistogramkmeanslinear_regressionmatrix_multiplypca string_matchword_countblackscholesbodytrackcannealdedupfacesimferretfluidanimateraytracestreamclusterswaptionsvips x264CCBench Phoenix ParsecNormalized Performance Relative to Linear Speedup w/o Plasticw/ Plastic4.4x0.4x2.6xFigure 5.3: Performance of Phoenix, Parsec and CCBench suites runningwith Plastic.• an unmodified version running under Plastic without optimized rewrit-ing• an unmodified version running under Plastic with optimized rewriting• a version of the program where false sharing was fixed by handWith an increase in cores, performance of the unmodified version de-creases exponentially. As the number of coherency misses increases to thepoint that cores are constantly stalling, the multi-threaded version is slowerthan a single threaded version, even in terms of absolute run time. Thesource optimized version shows almost linear speedup. The version run-ning unoptimized instrumentations does not suffer from false sharing, buthas to perform enough comparisons that it is even slower than the versionwith false sharing. Using specialized rewriting under Plastic, yields a 3-6xperformance improvement over the regular version.An important takeaway is the observation that remapping can poten-tially have a negative impact on performance as well. We currently assumethat the optimized instrumentation to remove false sharing is always desir-able. Future work should take into account the rate of non-Plastic instruc-tions retired as an indicator that Plastic should roll back the remappingsand run with the original code and original data layout.255.5. Effect of Rewriting Optimizations 0 0.2 0.4 0.6 0.8 1 0  2  4  6  8Normalized Performance Relative to Linear SpeedupNumber of CoresFalse SharingSource FixedDetector (Optimized)Detector (Unoptimized)Figure 5.4: Normalized Performance of Linear Regression with and withoutPlastic.26Chapter 6DiscussionPlastic is a functional proof of concept demonstrating that dramatic falsesharing can be identified and fixed in a manner which does not have signif-icant performance impact on workloads which do not exhibit false sharing.For benign workloads, the temporary performance impact only occurs dur-ing short detection phases (if at all, as most workloads tend not to evenprogress into the slower phases of detection).Therefore it is feasible to consider the Plastic system as a desirable layerof system software, protecting applications from unexpected performancecharacteristics of the underlying hardware consistency model.With additional performance counters, one can imagine other hardware“worst case scenarios” being detected and mitigated in similar ways. Theutility of this system can be thought of like a post-compiler binary opti-mization step, similar to the HotSpot JVM’s adaptive optimization, butgeneralized to x86 / x86 64 instructions rather than Java bytecode.6.1 Future WorkPlastic is currently part of a system aimed at dynamically improving thebehaviour of existing binaries at runtime. However, the detection pipelinecould be integrated with debug symbols and development toolchains to al-low developers to fix the problem at the source. This method of fixing falsesharing would prove a larger performance win compared to the rewritten bi-nary, and would be increasingly relevant for an extended detector which canidentify other sources of poor performance which cannot be automaticallymitigated.If Plastic were to continually run alongside an arbitrary system, it should276.1. Future Workbe made to adhere to Intel’s recommendations for sharing the PMU [14].These recommendations include “Agents must be able to be configured andfunction without PerfMon resources” which would mean Plastic would beidle while the PMU is in use by other agents. Unfortunately running inthe hypervisor is insufficient to guarantee that we always have access to thePMU since firmware may also use counters: When discussing not layingclaim to an in-use counter: “This includes the VMM. The VMM shouldensure any counter in-use by firmware is not disabled.”In post-Nehalem architectures when PEBS facilities can export memorylocations corresponding to HITM events, future stages of the pipeline couldbe restricted to even fewer memory pages to track. This could decreasedetection overhead even further. PEBS integration would also be usefulfor other cache analyses unrelated to false sharing by providing a gauge onhow accessing some data takes many cycles to be loaded due to poor cacheutilization. The remapping engine may be able to determine a more optimalarrangement of data so that a larger portion of the working set is availablein cache.More architectural performance counters would be necessary for cross-microarchitecture compatability of any system that would rely on them.While detection/analysis during development could reasonably run only ona restricted set of platforms, widespread availability of counters would servetwo goals: 1) Being able to deploy a system like Plastic on end-users systems,and 2) Since development and deployment environments are different, theycould be running into different sets of hardware-inflicted performance issues.28Chapter 7ConclusionPlastic demonstrates that an operating system (or hypervisor) can takeresponsibility for intelligent resource management of shared hardware re-sources being used inefficiently by software. Software can generate patho-logical use cases for caches with significant negative performance implica-tions, by being unaware of low level implementation details like hardwarecache line size. Plastic shows that low overhead detection of hardware issuescan be combined with a larger system [22] in order to dynamically fix thesecases, resulting in 3x-6x improved throughput.This work intelligently uses pieces of data exported by the hardware, andcan be used to motivate exposing more counters and more information aboutdata or code which is causing hardware to press up against the inefficientparts of its design. In the case of Plastic we show that some classes ofissues can be automatically alleviated, though in general feedback could bedirected to software developers to work around these hardware limitationswhen possible.29Bibliography[1] Ole Agesen. Software and hardware techniques for x86 virtualization.Palo Alto CA, VMware Inc, 2009. → pages 10[2] Martin Aigner, Christoph M Kirsch, Michael Lippautz, and AnaSokolova. Fast, multicore-scalable, low-fragmentation memory alloca-tion through large virtual memory and global data structures. arXivpreprint arXiv:1503.09006, 2015. → pages 11[3] Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris,Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schupbach, andAkhilesh Singhania. The multikernel: a new OS architecture for scalablemulticore systems. In SOSP, 2009. → pages 8[4] T. Bergan, N. Hunt, L. Ceze, and S.D. Gribble. Deterministic processgroups in dos. In OSDI, 2010. → pages 17[5] Emery D Berger, Kathryn S McKinley, Robert D Blumofe, and Paul RWilson. Hoard: A scalable memory allocator for multithreaded appli-cations. ACM Sigplan Notices, 35(11):117–128, 2000. → pages 11[6] Georgios Bitzes and Andrzej Nowak. The overhead of profiling usingpmu hardware counters. 2014. → pages 9[7] Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, AlekseyPesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich.An analysis of linux scalability to many cores. In OSDI, 2010. → pages7, 8, 24[8] David Dice. False sharing induced by card table marking. David Dice’sWeblog, Feb 2011. → pages 8, 2430Bibliography[9] Stephan M. Gunther and Josef Weidendorfer. Assessing cache falsesharing effects by dynamic binary instrumentation. In WBIA, 2009. →pages 11[10] John L. Hennessy and David A. Patterson. Computer Architecture,Fifth Edition: A Quantitative Approach. Morgan Kaufmann PublishersInc., San Francisco, CA, USA, 5th edition, 2011. → pages 5[11] John Howard, Saurabh Dighe, Yatin Hoskote, Sriram Vangal, DavidFinan, Gregory Ruhl, Devon Jenkins, Howard Wilson, Nitin Borkar,Gerhard Schrom, et al. A 48-core ia-32 message-passing processor withdvfs in 45nm cmos. In Solid-State Circuits Conference Digest of Tech-nical Papers (ISSCC), 2010 IEEE International, pages 108–109. IEEE,2010. → pages 8[12] Wei Huang, Karthick Rajamani, Mircea R Stan, and Kevin Skadron.Scaling with design constraints: Predicting the future of big chips.IEEE Micro, (4):16–29, 2011. → pages 1[13] Intel Corporation. Intel 64 and IA-32 Architectures Software DevelopersManual. 043 edition, 2012. → pages 9, 21[14] Peggy Irelan and Shihjong Kuo. Performance monitoring unit sharingguide. https://software.intel.com/sites/default/files/ea/95/30388. → pages 15, 28[15] Aamer Jaleel, Robert S. Cohn, Chi keung Luk, and Bruce Jacob. CMP-Sim: A pin-based on-the-fly multi-core cache simulator. In MOBS, 2008.→ pages 12[16] Tongping Liu and Emery D. Berger. Sheriff: precise detection and au-tomatic mitigation of false sharing. In Proceedings of the 2011 ACMinternational conference on Object oriented programming systems lan-guages and applications, OOPSLA ’11, pages 3–18, New York, NY,USA, 2011. ACM. → pages 6, 11, 1731Bibliography[17] Tongping Liu, Chen Tian, Ziang Hu, and Emery D. Berger. Predator:Predictive false sharing detection. In Proceedings of the 19th ACM SIG-PLAN Symposium on Principles and Practice of Parallel Programming,PPoPP ’14, pages 3–14, New York, NY, USA, 2014. ACM. → pages 11[18] Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, ArturKlauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and KimHazelwood. Pin: building customized program analysis tools with dy-namic instrumentation. In Proceedings of the 2005 ACM SIGPLANconference on Programming language design and implementation, PLDI’05, 2005. → pages 12[19] Milo M K Martin, Mark D Hill, and Daniel J Sorin. Why on-chip cachecoherence is here to stay. To appear, CACM, 2012. → pages 8[20] mcmcc. false sharing in boost::detail::spinlock pool?http://stackoverflow.com/questions/11037655/false-sharing-in-boostdetailspinlock-pool, June 2012. →pages 24[21] Daniel Molka, Daniel Hackenberg, Robert Schone, and Matthias S.Muller. Memory performance and cache coherency effects on an In-tel Nehalem multiprocessor system. In PACT, 2009. → pages 5[22] Mihir Nanavati, Mark Spear, Nathan Taylor, Shriram Rajagopalan,Dutch T Meyer, William Aiello, and Andrew Warfield. Whose cache lineis it anyway?: operating system support for live detection and repair offalse sharing. In Proceedings of the 8th ACM European Conference onComputer Systems, pages 141–154. ACM, 2013. → pages iii, 13, 29[23] Mihir Nanavati, Mark Spear, Nathan Taylor, Shriram Rajagopalan,Dutch T. Meyer, William Aiello, and Andrew Warfield. Whose cacheline is it anyway?: Operating system support for live detection andrepair of false sharing. In Proceedings of the 8th ACM European Con-ference on Computer Systems, EuroSys ’13, pages 141–154, New York,NY, USA, 2013. ACM. → pages 1932Bibliography[24] Vinodha Ramasamy, Paul Yuan, Dehao Chen, and Robert Hundt.Feedback-directed optimizations in gcc with estimated edge profilesfrom hardware event sampling. In Proceedings of GCC Summit 2008,pages 87–102, 2008. → pages 12[25] Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski,and Christos Kozyrakis. Evaluating MapReduce for multi-core and mul-tiprocessor systems. In Proceedings of the 2007 IEEE 13th InternationalSymposium on High Performance Computer Architecture, HPCA ’07,pages 13–24, Washington, DC, USA, 2007. IEEE Computer Society. →pages 6, 24[26] Nathan Bryan Taylor. Cachekata: memory hierarchy optimization viadynamic binary translation. 2013. → pages iii, 13, 19[27] C. Thacker. Beehive: A many-core computer for fpgas (v5). Unpub-lished Manuscript, Jan 2010. → pages 8[28] Martin Thompson. Re: kernel + gcc 4.1 = several problems. http://ondioline.org/mail/cmov-a-bad-idea-on-out-of-order-cpus. →pages 8[29] Avoiding and identifying false sharing amongthreads. http://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads/.→ pages 16[30] Intel performance tuning utility. http://software.intel.com/en-us/articles/intel-performance-tuning-utility/. → pages12, 21[31] Intel microarchitecture codename nehalem performance monitoringunit programming guide (nehalem core pmu). https://software.intel.com/sites/default/files/76/87/30320. → pages 9, 21[32] Qin Zhao, David Koh, Syed Raza, Derek Bruening, Weng-Fai Wong,33and Saman Amarasinghe. Dynamic cache contention detection in multi-threaded applications. In VEE, 2011. → pages 634Appendix ASelected Code ListingsListing A.1: Selected variations of code triggering false sharing#ifndef GNU SOURCE#define GNU SOURCE 1#endif#include <s t d i o . h>#include <s t d l i b . h>#include <uni s td . h>#include <s t r i n g . h>#include <errno . h>#include <pthread . h>unsigned long ∗ g l oba lva r ;int qu i t = 0 ;unsigned long i t e r a t i o n s = (1UL << 3 2 ) ;void ∗worker (void ∗ t ){long id = ( long ) t ;unsigned long i = i t e r a t i o n s ;int reader = id & 1 ;int o f f = id / 2 ;unsigned long l o c a l v a r = 0 ;unsigned char ∗ptr = (char ∗ ) ( g l oba lva r + id ) ;while ( i−−)35Appendix A. Selected Code Listings{#i f V1// s imple incrementg l oba lva r [ id ]++;#e l i f V2// increment v ia temporary s t a c k v a r i a b l el o c a l v a r = g loba lva r [ id ] ;l o c a l v a r ++;g l oba lva r [ id ] = l o c a l v a r ;#e l i f V3// update par t o f 8 by t e wordptr [ id ]++;#e l i f V4// a l t e r n a t i n g reads & wr i t e s( i & 1) ? ( l o c a l v a r = g loba lva r [ id ] ) : (++g loba lva r [ id ] ) ;#e l i f V5// always read or always wr i t ereader ? ( l o c a l v a r = g loba lva r [ o f f ] ) : (++g loba lva r [ o f f ] ) ;#endif}}#define PIN THREADS 1int main ( int argc , char ∗argv [ ] ){pthread t ∗ threads ;p t h r e a d a t t r t ∗ a t t r ;c p u s e t t cpuset ;int i , nthreads ;nthreads = syscon f ( SC NPROCESSORS ONLN ) ;i f ( nthreads < 0) {per ro r ( ” f a i l e d to get number o f o n l i n e p r o c e s s o r s ” ) ;36Appendix A. Selected Code Listingsreturn −1;}i t e r a t i o n s /= nthreads ;posix memalign ( ( void ∗)& g loba lvar , 4096 , 4096 ) ;memset ( g loba lvar , 0 , 4096 ) ;threads = mal loc ( s izeof ( pthread t ) ∗ nthreads ) ;a t t r = mal loc ( s izeof ( p t h r e a d a t t r t ) ∗ nthreads ) ;#i f PIN THREADS// pin main thread to cpu0CPU ZERO(&cpuset ) ;CPU SET(0 , &cpuset ) ;s c h e d s e t a f f i n i t y (0 , s izeof ( cpuset ) , &cpuset ) ;// pin threadsfor ( i = 0 ; i < nthreads ; i++) {CPU ZERO(&cpuset ) ;CPU SET( i , &cpuset ) ;p t h r e a d a t t r i n i t (& a t t r [ i ] ) ;p t h r e a d a t t r s e t a f f i n i t y n p (& a t t r [ i ] , s izeof ( cpuset ) , &cpuset ) ;}for ( i = 0 ; i < nthreads ; i++)pth r ead c r ea t e (&threads [ i ] , &a t t r [ i ] , worker , (void ∗) i ) ;#elsefor ( i = 0 ; i < nthreads ; i++)pth r ead c r ea t e (&threads [ i ] , NULL, worker , (void ∗) i ) ;#endiffor ( i = 0 ; i < nthreads ; i++)p t h r e a d j o i n ( threads [ i ] , NULL) ;37Appendix A. Selected Code Listingsf r e e ( g l oba lva r ) ;f r e e ( threads ) ;f r e e ( a t t r ) ;return 0 ;}38


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items