{"@context":{"@language":"en","Affiliation":"http:\/\/vivoweb.org\/ontology\/core#departmentOrSchool","AggregatedSourceRepository":"http:\/\/www.europeana.eu\/schemas\/edm\/dataProvider","Campus":"https:\/\/open.library.ubc.ca\/terms#degreeCampus","Creator":"http:\/\/purl.org\/dc\/terms\/creator","DateAvailable":"http:\/\/purl.org\/dc\/terms\/issued","DateIssued":"http:\/\/purl.org\/dc\/terms\/issued","Degree":"http:\/\/vivoweb.org\/ontology\/core#relatedDegree","DegreeGrantor":"https:\/\/open.library.ubc.ca\/terms#degreeGrantor","Description":"http:\/\/purl.org\/dc\/terms\/description","DigitalResourceOriginalRecord":"http:\/\/www.europeana.eu\/schemas\/edm\/aggregatedCHO","Extent":"http:\/\/purl.org\/dc\/terms\/extent","FileFormat":"http:\/\/purl.org\/dc\/elements\/1.1\/format","FullText":"http:\/\/www.w3.org\/2009\/08\/skos-reference\/skos.html#note","Genre":"http:\/\/www.europeana.eu\/schemas\/edm\/hasType","GraduationDate":"http:\/\/vivoweb.org\/ontology\/core#dateIssued","IsShownAt":"http:\/\/www.europeana.eu\/schemas\/edm\/isShownAt","Language":"http:\/\/purl.org\/dc\/terms\/language","Program":"https:\/\/open.library.ubc.ca\/terms#degreeDiscipline","Provider":"http:\/\/www.europeana.eu\/schemas\/edm\/provider","Publisher":"http:\/\/purl.org\/dc\/terms\/publisher","Rights":"http:\/\/purl.org\/dc\/terms\/rights","RightsURI":"https:\/\/open.library.ubc.ca\/terms#rightsURI","ScholarlyLevel":"https:\/\/open.library.ubc.ca\/terms#scholarLevel","Title":"http:\/\/purl.org\/dc\/terms\/title","Type":"http:\/\/purl.org\/dc\/terms\/type","URI":"https:\/\/open.library.ubc.ca\/terms#identifierURI","SortDate":"http:\/\/purl.org\/dc\/terms\/date"},"Affiliation":[{"@value":"Applied Science, Faculty of","@language":"en"},{"@value":"Electrical and Computer Engineering, Department of","@language":"en"}],"AggregatedSourceRepository":[{"@value":"DSpace","@language":"en"}],"Campus":[{"@value":"UBCV","@language":"en"}],"Creator":[{"@value":"Chen, Xi","@language":"en"}],"DateAvailable":[{"@value":"2009-06-11T20:07:40Z","@language":"en"}],"DateIssued":[{"@value":"2009","@language":"en"}],"Degree":[{"@value":"Master of Applied Science - MASc","@language":"en"}],"DegreeGrantor":[{"@value":"University of British Columbia","@language":"en"}],"Description":[{"@value":"As the number of transistors integrated on a chip continues to increase, a growing challenge is accurately modeling performance in the early stages of processor design. Analytical modeling is an alternative to detailed simulation with the potential to shorten the development cycle and provide additional insight. This thesis proposes hybrid analytical models to predict the impact of pending cache hits, hardware prefetching, and realistic miss status holding register (MSHR) resources on superscalar performance. We propose techniques to model the non-negligible influences of pending hits and the fine-grained selection of instruction profile window blocks on the accuracy of hybrid analytical models. We also present techniques to estimate the performance impact of data prefetching by modeling the timeliness of prefetches and to account for a limited number of MSHRs by restricting the size of profile window blocks. As with earlier hybrid analytical models, our approach is roughly two orders of magnitude faster than detailed simulations. Overall, our techniques reduce the error of our baseline from 39.7% to 10.3% when the number of MSHRs is unlimited. When modeling a processor with data prefetching, a limited number of MSHRs, or both, our techniques result in an average error of 13.8%, 9.5% and 17.8%, respectively. Moreover, this thesis proposes analytical models for predicting the cache contention and throughput of heavily fine-grained multithreaded architectures such as Sun Microsystems' Niagara. We first propose a novel probabilistic model using statistics characterizing individual threads run in isolation as inputs to accurately predict the number of extra cache misses due to cache contention among a large number of threads. We then present a Markov chain model for analytically estimating the throughput of multicore, fine-grained multithreaded architectures. Combined, the two models accurately predict system throughput obtained from a detailed simulator with an average error of 8.3% for various cache configurations. We also show that our models can find the same optimized design point of fine-grained multithreaded chip multiprocessors for application-specific workloads 65 times faster than detailed simulations. Furthermore, we show that our models accurately predict cache contention and throughput trends across varying workloads on real hardware, a Sun Fire T1000 server.","@language":"en"}],"DigitalResourceOriginalRecord":[{"@value":"https:\/\/circle.library.ubc.ca\/rest\/handle\/2429\/8979?expand=metadata","@language":"en"}],"Extent":[{"@value":"1122882 bytes","@language":"en"}],"FileFormat":[{"@value":"application\/pdf","@language":"en"}],"FullText":[{"@value":"Analytical Modeling of Modern Microprocessor Performance by Xi Chen B.A.Sc., The University of British Columbia, 2006 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF APPLIED SCIENCE in The Faculty of Graduate Studies (Electrical and Computer Engineering) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) June, 2009 c Xi Chen 2009 \u00b0 \fAbstract As the number of transistors integrated on a chip continues to increase, a growing challenge is accurately modeling performance in the early stages of processor design. Analytical modeling is an alternative to detailed simulation with the potential to shorten the development cycle and provide additional insight. This thesis proposes hybrid analytical models to predict the impact of pending cache hits, hardware prefetching, and realistic miss status holding register (MSHR) resources on superscalar performance. We propose techniques to model the non-negligible influences of pending hits and the fine-grained selection of instruction profile window blocks on the accuracy of hybrid analytical models. We also present techniques to estimate the performance impact of data prefetching by modeling the timeliness of prefetches and to account for a limited number of MSHRs by restricting the size of profile window blocks. As with earlier hybrid analytical models, our approach is roughly two orders of magnitude faster than detailed simulations. Overall, our techniques reduce the error of our baseline from 39.7% to 10.3% when the number of MSHRs is unlimited. When modeling a processor with data prefetching, a limited number of MSHRs, or both, our techniques result in an average error of 13.8%, 9.5% and 17.8%, respectively. Moreover, this thesis proposes analytical models for predicting the cache contention and throughput of heavily fine-grained multithreaded architectures such as Sun Microsystems\u2019 Niagara. We first propose a novel probabilistic model using statistics characterizing individual threads run in isolation as inputs to accurately predict the number of extra cache misses due to cache contention among a large number of threads. We then present a Markov chain model for analytically estimating the throughput of multicore, fine-grained multithreaded architectures. Combined, the two models accurately predict system throughput obtained from a detailed simulator with an average error of 8.3% for various cache configurations. We also show that our models can find the same optimized design point of fine-grained multithreaded chip multiprocessors for application-specific workloads 65 times faster than detailed simulations. Furthermore, we show that our models accurately predict cache contention and throughput trends across varying workloads on real hardware\u2014a Sun Fire T1000 server. ii \fTable of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables vi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Introduction . . . . . . . . . . . . . . . . . . . . . . 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . 1.2 Contributions . . . . . . . . . . . . . . . . . . . 1.3 Background . . . . . . . . . . . . . . . . . . . . 1.3.1 Superscalar Pipeline . . . . . . . . . . . . 1.3.2 Cache Memory . . . . . . . . . . . . . . . 1.3.3 Data Prefetching . . . . . . . . . . . . . 1.3.4 Fine-grained Multithreaded Architectures 1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Hybrid Analytical Modeling of Pending Cache Hits, MSHRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 First-Order Processor Model . . . . . . . . . . . . . . . . 2.2 Modeling Long Latency Memory System . . . . . . . . . 2.2.1 Modeling Pending Data Cache Hits . . . . . . . . 2.2.2 Accurate Exposed Miss Penalty Compensation . . 2.2.3 Modeling Data Prefetching . . . . . . . . . . . . . 2.2.4 Modeling a Limited Number of MSHRs . . . . . . 2.2.5 Profiling Window Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix . 1 . 2 . 4 . 6 . 6 . 8 . 9 . 9 . 10 Prefetching, and . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 12 16 16 19 21 25 26 3 Modeling Cache Contention Among Many Threads . . . . . . . . . . . . . . . . . 29 3.1 Probabilistic Cache Contention Models . . . . . . . . . . . . . . . . . . . . . . . . . 29 iii \fTable of Contents 3.2 3.3 New Locality Metrics For Modeling Cache Contention . . . . . . . . . . . . . . . . . 33 Accurately Modeling Cache Contention with Many Threads . . . . . . . . . . . . . 35 4 Modeling Fine-Grained Multithreaded Throughput . . . . . 4.1 Sum of Cycles Model . . . . . . . . . . . . . . . . . . . . . . . 4.2 Sum of IPCs Model . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Bernoulli Model . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 A Markov Chain Model . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Homogeneous Workloads . . . . . . . . . . . . . . . . . 4.4.2 Heterogeneous Workloads and Multiple Stall Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 43 44 44 46 47 50 5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.1 Modeling Pending Cache Hits, Data Prefetching, MSHRs . . . . . . . . . . . . . . . 53 5.2 Modeling Cache Contention and Throughput . . . . . . . . . . . . . . . . . . . . . . 55 6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Modeling Pending Cache Hits, Data Prefetching, MSHRs . . . . . . . . . . 6.1.1 Modeling Pending Data Cache Hits . . . . . . . . . . . . . . . . . . 6.1.2 Modeling Different Prefetching Techniques . . . . . . . . . . . . . . 6.1.3 Modeling Limited Number of MSHRs . . . . . . . . . . . . . . . . . 6.1.4 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.5 Speedup of the Hybrid Analytical Model . . . . . . . . . . . . . . . 6.2 Modeling Cache Contention and Throughput . . . . . . . . . . . . . . . . . 6.2.1 Model Accuracy Evaluation . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Modeling S(x) and b(i, x) Using Binomial Distributions . . . . . . . 6.2.3 Modeling b(i, x) Using An Inductive Probability Function . . . . . 6.2.4 Sensitivity Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.5 Case Study: Optimizing Threads Per Core for Application-Specific loads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.6 Hardware Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Work. . . . . . . . . . . . . . . . . . . . 7 Related Work . . . . . . . . . . . . . . . . . . . . . . 7.1 Empirical Models . . . . . . . . . . . . . . . . . . 7.2 Statistical Simulation . . . . . . . . . . . . . . . . 7.3 Analytical Models for Superscalar Microprocessors 7.4 Cache Contention Models . . . . . . . . . . . . . . 7.5 Throughput Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 59 59 63 64 66 66 67 67 73 77 79 . 81 . 83 86 86 87 87 88 89 iv \fTable of Contents 8 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Improving the Applicability of Our Hybrid Analytical Models . 8.2.2 Analytical Modeling of Simultaneous Multithreading Processors 8.2.3 Analytical Modeling of Chip Multiprocessors . . . . . . . . . . . 8.2.4 Analytical Modeling of DRAM Scheduling Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 90 91 91 91 92 92 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Appendix A Limiting Case of Time-Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 v \fList of Tables 4.1 4.2 Transition probability definitions for a four-thread fine-grained multithreaded architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Transition probability definitions for an N-thread fine-grained multithreaded architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.1 5.2 5.3 5.4 Microarchitectural parameters . . . . . . Simulated benchmarks . . . . . . . . . . Parameters of the simulated architecture Simulated workloads for each core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 55 56 56 6.1 Modeling errors for various L2 cache configurations versus detailed simulation . . . . 81 vi \fList of Figures 1.1 1.2 1.3 1.4 1.5 Time to simulate 1 second runtime of a large-scale chip multiprocessor . . . . . A superscalar microprocessor pipeline . . . . . . . . . . . . . . . . . . . . . . . A processor with two levels of cache . . . . . . . . . . . . . . . . . . . . . . . . Instruction interleaving in a four-thread fine-grained multithreaded architecture Hiding memory latency in a four-thread fine-grained multithreaded architecture 2.1 2.2 2.3 2.4 2.5 Useful instructions issued per cycle (IPC) over time . . . . . . . . . . . . . . . . . . Profiling examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . An example showing how two data independent misses are connected by a pending hit Impact of pending data cache hit latency on CP ID$miss . . . . . . . . . . . . . . . . Algorithm for analyzing a pending hit in an instruction trace when a prefetching mechanism is applied . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . An example motivating Figure 2.5 part B . . . . . . . . . . . . . . . . . . . . . . . . An example explaining Figure 2.5 part C . . . . . . . . . . . . . . . . . . . . . . . . An example showing profiling with ROBsize = 8 and NM SHR = 4 . . . . . . . . . . . An example comparing plain profiling and SWAM profiling with ROBsize = 8 . . . . 2.6 2.7 2.8 2.9 3.1 3.2 . . . 3 . . . 7 . . . 8 . . . 10 [37] 10 13 15 17 18 20 22 24 26 27 3.5 An example of circular sequences in a set of a four-way associative LRU cache . . . Extra L2 cache misses due to cache contention when multiple copies of mcf are running in parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Average number of distinct cache sets accessed during a given number of consecutive memory instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Probability distribution of the number of distinct blocks being accessed in a cache set for mcf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Explanation of each term of probH (2, g) . . . . . . . . . . . . . . . . . . . . . . . . 4.1 A Markov chain model for a four-thread fine-grained multithreaded architecture . . . 45 6.1 Penalty cycles per miss with fixed (Unlimited MSHRs) . . . . . . . CPI due to D$miss and modeling MSHRs) . . . . . . . . . . . . . . 3.3 3.4 6.2 . 30 . 30 . 32 . 34 . 42 number of cycles compensation for plain profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 error for different profiling techniques (unlimited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 vii \fList of Figures 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13 6.14 6.15 6.16 6.17 6.18 CPI due to D$miss and modeling error while prefetch-on-miss (POM), tagged prefetch (Tag), or stride prefetch (Stride) technique is applied. . . . . . . . . . . . . . . . . . CPI due to D$miss for NM SHR = 16, NM SHR = 8, and NM SHR = 4. . . . . . . . . . Error of the modeled CP ID$miss for NM SHR = 16, NM SHR = 8, and NM SHR = 4. . Ratio of the number of extra L1 and L2 cache misses due to cache contention to the number of misses when each thread runs alone . . . . . . . . . . . . . . . . . . . . . Scatter plot for the predicted and the simulated cache misses increase pair . . . . . . Impact of the number of groups used to classify circular sequences on the accuracy of our cache contention model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Predicted throughput and error from different models compared to the simulated throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Average distinct number of cache sets accessed during a given number of consecutive memory instructions for a 12-way, 64B line, 4096-set LRU L2 cache, from both assuming binomial distribution of cache blocks among sets (Binomial) and analyzing program traces (Trace) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Probability distribution of the number of distinct blocks being accessed in a cache set for mcf on a 12-way, 64B line, 3 MB cache, from both assuming binomial distribution of cache blocks among sets (Binomial) and analyzing program traces (Trace). Horizontal axis is value of i, and x is the measurement interval counted in consecutive memory instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Average absolute errors of modeled extra L1 and L2 cache misses (versus detailed simulation) by modeling S(x) and b(i, x) from binomial distribution (Binomial) and from analyzing program traces (Trace). Result normalized to Binomial . . . . . . . . Probability distribution of the number of distinct blocks being accessed in a cache set for mcf on a 12-way, 64B line, 3 MB cache, from both an inductive function (Prob) and analyzing program traces (Trace). Horizontal axis is value of i, and x is the measurement interval counted in consecutive memory instructions . . . . . . . . Average absolute errors of modeled extra L1 and L2 cache misses (versus detailed simulation) by modeling b(i, x) from an inductive function (Prob) and from analyzing program traces (Trace). Result normalized to Prob . . . . . . . . . . . . . . . . . . . Comparison of predicted throughput to the throughput measured with detailed simulation for various cache configurations . . . . . . . . . . . . . . . . . . . . . . . . . . Optimization Case Study: Modeled throughput under different configurations . . . . Predicted ratio of the number of extra L2 misses (per thread) due to cache contention to the number of L2 misses when a thread runs alone compared to an actual hardware (Sun Fire T1000) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Predicted average throughput per core versus the throughput reported by performance counters on a Sun Fire T1000 . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 65 65 68 69 70 72 73 74 76 78 79 80 82 83 84 viii \fAcknowledgements I am very grateful to my supervisor, Professor Tor Aamodt, for his constant advice and encouragement during the last two and a half years. Tor introduced me to the beauty of computer architecture and taught me the essence of research. This thesis would not have been possible without his help and support. I would also like to thank Professor Vikram Krishnamurthy and Professor Steve Wilton for being the members of my thesis committee and providing insightful feedback. I would like to thank my colleagues and friends Samer Al-Kiswany, Armin Bahramshahry, Ali Bakhoda, Wilson Fung, Abdullah Gharaibeh, Johnny Kuan, Jason Kuo, Elizeu Santos-Neto, Ivan Sham, Andrew Turner, Henry Wong, and George Yuan for the interesting discussions we had and for the valuable comments they provided for the work in this thesis. They made my experience at UBC unforgettable and I felt very lucky to be able to meet them. I will miss the ping pong breaks with Ali, George, and Abdullah so much. I am indebted to my parents for their unconditional love and support since I was born. They always cheered me up when I was upset and guided me through the most difficult time. I am very grateful to my wife Wendy for her understanding and support during my master\u2019s study. Words cannot express my gratitude for her and I dedicate this work to her. Finally I would like to express my thanks to Natural Sciences and Engineering Research Council (NSERC) of Canada that provided financial support for this work through PGS-M and CGS-M awards. ix \fChapter 1 Introduction To design a new microprocessor, computer architects typically create a cycle-accurate simulator and run numerous simulations to quantify performance trade-offs. Not only is the task of creating such a simulator time-consuming, but also running such simulations can be slow. Both are significant components of overall design-to-market time. As microprocessor design cycles stretch with increasing transistor budgets, architects effectively start each new project with less accurate information about the eventual process technology that will be used, leading to designs that may not achieve the full potential of a given process technology node. An orthogonal approach to obtaining performance estimates for a proposed design is analytical modeling [70]. An analytical model employs mathematical formulas that approximate the performance of the microprocessor being designed based upon program characteristics and microarchitectural parameters. One of the potential advantages of analytical modeling is that it requires much less time than crafting and running a performance simulator. Thus, when an architect has analytical models available to evaluate a given design, the models can help significantly shorten the design cycle. While workload sampling [61], parallel simulation [56], and FPGA acceleration [13, 68] can reduce simulation time, these approaches all require the creation of detailed performance models before any feedback on a proposed change in microarchitecture is available to the design team. Moreover, another key advantage of analytical modeling is its ability to provide insight that a cycle-accurate simulator may not. 1 \f1.1. Motivation First, this thesis proposes and evaluates techniques to predict the impact of pending cache hits, hardware data prefetching, and realistic miss status holding register (MSHR) resources on superscalar microprocessor performance in the presence of long latency memory systems by applying instruction trace analysis. Next, it presents a novel probabilistic cache contention model to accurately predict extra cache misses due to cache contention among a large number of threads. Finally, it proposes a Markov chain model for analytically estimating the throughput of multi-core, fine-grained multithreaded architectures. The rest of this chapter describes the motivation for this thesis, lists its contributions, explains some fundamental concepts required to understand the thesis, and summarizes the organization of the thesis. 1.1 Motivation As the complexity of modern computer systems keeps increasing, cycle-accurate performance simulators have become the most important tool for computer architecture researchers. While in 1973, only two out of 28 (i.e., 7%) papers appearing in the International Symposium on Computer Architecture, the flagship conference in computer architecture, were simulation-based; in 2004, 27 out of 31 (i.e., 87%) papers appearing in the conference used performance simulators to collect data they presented [70]. There exist many cycle-accurate performance simulators [5, 7, 44, 57, 67] that are flexible to use. Given a performance simulator, researchers can evaluate a new microarchitecture by modifying the simulator to model the new microarchitecture and analyzing the statistics of interest provided by the simulator. Moreover, the result generated from a performance simulator that models enough detail of the microarchitecture being evaluated can be very accurate, making performance simulators the primary tool to evaluate a new microprocessor design. 2 \f1.1. Motivation 10000 simulation time (hours) 370 days 8000 Cycle-accurate simulators 6000 Functional simulators 4000 Behavioral simulators Emulators 2000 37 days 53 minutes 9 hours 3.7 days 0 10 K 1M 100 K 10 K 1K simulation speed (instructions\/second) Figure 1.1: Time to simulate 1 second runtime of a large-scale chip multiprocessor [71] However, the simulation speed of a cycle-accurate performance simulator is usually several orders of magnitude slower than real execution, making exhaustively exploring the large design space of modern microprocessors infeasible. Intel researchers have recently pointed out the challenge of using cycle-accurate performance simulators to simulate future large-scale chip multiprocessor (LCMP) platform of 32 cores, with each core running at 3 GHz and at 3 cycles per instruction (CPI) [71], as shown in Figure 1.1. The x-axis of this figure shows the speed of different types of simulators in instructions simulated per second for modeling such a large system; the y-axis shows the number of hours required to simulate one second run time of the system. From the figure we observe that, to simulate such a large multi-core system, the simulation speed of a cycle-accurate detailed simulator is in the range of a few thousands of instructions per second. Thus, it will take several months to complete simulating one second of real execution, which is obviously too long to wait in the process of designing new microprocessors. Analytical modeling is an alternative to performance simulation useful for efficiently exploring 3 \f1.2. Contributions the huge design space of modern microprocessors. An analytical model takes as inputs the major microarchitectural parameters of the system being modeled and the program characteristics of an application being executed and it predicts the performance of the system running the application. Although an analytical model, when compared to a cycle-accurate performance simulator, is less accurate and flexible since it only captures and models the performance impact of some major design parameters, its speed is usually orders of magnitude faster than the simulator, making efficiently exploring the large design space of modern microprocessors possible. Moreover, although the absolute accuracy of an analytical model is less than a performance simulator, its relative accuracy is usually good enough to track the impact of varying a microarchitectural parameter; therefore, an analytical model can be very useful in the early design stages of microprocessor design. Another important advantage of an analytical model over a cycle-accurate performance simulator is that the analytical model can provide insight that the performance simulator may not. Although a performance simulator can generate numerous statistics, it is impossible to quantify the impact of varying microarchitectural parameters without exhaustively simulating each set of parameters. On the other hand, since an analytical model usually consists of mathematical formulas, it can provide insight to systemize the impact of changing microarchitectural parameters on the overall performance; therefore, it can help designers to quickly and accurately find the optimal design points [35]. 1.2 Contributions This thesis makes the following contributions: 1. It shows that the performance impact of pending data cache hits is non-negligible for memory intensive applications and describes how to model their effect on performance in the context of a trace driven hybrid analytical model. 4 \f1.2. Contributions 2. It presents a novel technique to more accurately compensate for the potential overestimation of the modeled CP ID$miss , which relies upon analysis of an application\u2019s individual characteristics. 3. It proposes a technique to model the CP ID$miss when a data prefetching mechanism is applied in a microprocessor, without requiring a cycle-accurate performance simulator. 4. It describes a technique to analytically model the impact of a limited number of outstanding cache misses supported by a memory system. 5. It proposes two novel techniques to better analyze the overlapping of long data cache misses. 6. It proposes a novel cache contention model to accurately predict the number of extra data cache misses due to cache contention among a large number of threads. 7. It shows how to compute quantities used in our cache contention model to predict miss rate by leveraging binomial probability modeling techniques originally developed for predicting cache misses due to context switching in time sharing systems [2, 64] (and thereby differentiates our techniques from these earlier approaches). 8. It presents four analytical models of varying accuracy and complexity for estimating the throughput of a multi-core, fine-grained multithreaded single-issue processor (similar to Sun Niagara). The most sophisticated and accurate throughput model utilizes a Markov chain to leverage our cache contention model. 9. It shows that our combined cache contention and Markov chain throughput model accurately predict the throughput of a multicore, fine-grained multithreaded processor obtained from a detailed simulator for various cache configurations. 5 \f1.3. Background 10. It applies the combined cache contention and Markov chain throughput model to optimize a multi-core fine-grained multithreaded processor for two different application specific workloads yielding the same design points as detailed simulation. 11. Finally, it validates the models against real hardware (a Sun Fire T1000 server). One of the most important goals of analytical modeling is to provide insights that cannot be easily obtained from detailed simulation. However, since there was little work related to analytical modeling compared to the large amount of work based upon detailed simulation in the last a couple of decades [70], an intermediate necessary step is to bring analytical modeling fully up to date with the advances of detailed simulation. We believe that our work significantly shortens the gap between the two different methodologies. 1.3 Background This section provides an overview of the fundamental concepts of superscalar processor pipeline, cache memory, data prefetching, and fine-grained multithreaded architectures for this thesis. 1.3.1 Superscalar Pipeline Pipelining [36] is a key technique used by modern processors to overlap the execution of multiple instructions. A processor pipeline consists of several stages and different pipeline stages can execute different instructions in parallel, improving the overall performance compared to a non-pipelined processor by exploiting instruction-level parallelism. Pipeline stages are connected one to the next and instructions flow through the pipeline during their execution. In a superscalar pipeline [60], multiple instructions can be processed simultaneously in a pipeline stage in one clock cycle, further improving the overall performance. The techniques that we propose in Chapter 2 focus on analyt- 6 \f1.3. Background ical modeling of modern superscalar out-of-order execution processors in which data independent instructions can be executed out of their program order. Fetch Decode Dispatch Issue Execute Writeback Commit Figure 1.2: A superscalar microprocessor pipeline The pipeline of the superscalar processor that we modeled in this thesis is illustrated in Figure 1.2 and the functionality of each pipeline stage is described as follows: \u2022 Fetch: Fetch instructions from memory based upon the program counter (PC). \u2022 Decode: Decode instructions being fetched to obtain information such as instruction types, input, and output registers. \u2022 Dispatch: Rename input and\/or output registers to eliminate false data dependencies among instructions. Send decoded instructions into the instruction window that is a structure keeping track of the program order of all in-flight instructions. \u2022 Issue: Send instructions to corresponding function units when their input data is available. \u2022 Execute: Execute instructions by functional units. Fetch data requested by load instructions from memory. \u2022 Writeback: Complete execution. Forward the available result to instructions that are waiting for it to start execution. \u2022 Commit: Retire instructions from the instruction window and update machine states. 7 \f1.3. Background 1.3.2 Cache Memory In computer architecture, cache memory is referred to as a storage array containing a subset of most frequently used data in main memory and it has been proven to be a very important part in modern microprocessors [62]. Due to its small size compared to main memory, cache memory is orders of magnitude faster than the main memory; therefore, it can significantly improve the performance of a processor by quickly providing it with the requested data (given the requested data has been stored in the cache before). Modern processors usually apply multiple levels of cache to improve the performance of their memory system. L1 I-cache Processor L2 Cache Main Memory L1 D-cache Figure 1.3: A processor with two levels of cache Figure 1.3 shows the memory hierarchy of a processor with two levels of cache. The cache memory consists of a level one instruction cache (L1 I-cache), a level one data cache (L1 D-cache), and a unified level two cache (L2 Cache). While the L1 I-cache and L1 D-cache stores a subset of data contained in the L2 cache, the L2 cache stores a subset of data contained in the main memory. When the processor needs to fetch an instruction, it first accesses the L1 I-cache to search for the requested instruction. If the processor misses in the L1 I-cache, it will access the L2 cache. Then, it will access the main memory if it also misses in the L2 cache. On the other hand, the processor will first access L1 D-cache to search for the data requested for memory instructions. Cache memory in modern superscalar processors is nonblocking, meaning multiple main memory requests are allowed to be served in parallel. Nonblocking caches effectively reduce the penalty 8 \f1.3. Background per miss by overlapping memory accesses. To implement nonblocking caches, information about outstanding cache misses needs to be stored in miss status holding registers (MSHRs) [38]. In Section 2.2.4, we propose a technique to analytically model the performance impact of the number of MSHRs in a superscalar processor. 1.3.3 Data Prefetching Data prefetching is a technique to bring data from main memory into cache memory before it is required so as to hide (or partially hide) long memory access latency. Once the prefetched data comes back from the main memory, it is placed in the cache. Then, when the data is actually requested, it can be accessed much more quickly from the cache than if it had to make a request from the main memory. Many hardware data prefetching strategies have been proposed before [3, 25, 32, 62]. In Section 2.2.3, we present a technique to analytically predict the performance impact of hardware data prefetching mechanisms. 1.3.4 Fine-grained Multithreaded Architectures Multithreading is a technique to make a processor capable of interleaving the execution of multiple threads. The motivation of multithreading is to extract thread-level parallelism among multiple threads being executed; therefore, when the progress of one thread is stalled on events such as branch mispredictions and cache misses, other co-running threads can still keep processor resources busy, increasing the utilization of the execution resources on the processor. Various approaches of multithreading have been proposed, including fine-grained multithreading (FGMT) [65, 66], coarsegrained multithreading (CGMT) [23], and simultaneous multithreading (SMT) [67]. In this thesis we focus on modeling fine-grained multithreaded architectures. In a fine-grained multithreaded architecture, the execution of multiple co-running threads can be interleaved cycle-by-cycle. Therefore, a fine-grained multithreaded processor can execute in9 \f1.4. Organization T1 T2 T3 T4 T1 T2 T3 T4 Cycle: n n+1 n+2 n+3 n+4 n+5 n+6 n+7 time Figure 1.4: Instruction interleaving in a four-thread fine-grained multithreaded architecture structions from a different thread on every cycle. Figure 1.4 shows a simple example of instruction interleaving in a four-thread fine-grained multithreaded scalar processor. In this example, the processor executes one instruction per cycle and instructions from Thread 1 (T1) to Thread 4 (T4) are executed in a round-robin fashion. Ideally, when a thread stalls on cache misses, the long memory access latency can be hidden by useful computation of other co-running threads, as Figure 1.5 illustrates. T1 T2 T3 T4 C C M C M C M M C Compute latency Memory latency M Figure 1.5: Hiding memory latency in a four-thread fine-grained multithreaded architecture [37] In Chapter 3 and Chapter 4 we propose techniques to analytically model the cache contention and the throughput of fine-grained multithreaded architectures similar to Sun Microsystems\u2019 Niagara [37]. 1.4 Organization The rest of the thesis is organized as follows: \u2022 Chapter 2 proposes techniques to predict the impact of pending cache hits, hardware data prefetching, and realistic miss status holding register (MSHR) resources on superscalar per10 \f1.4. Organization formance in the presence of long latency memory systems when employing hybrid analytical models that apply instruction trace analysis [10]. \u2022 Chapter 3 presents two novel metrics to quantify the temporal locality of an application and proposes a novel cache contention model, based upon the above two metrics, to accurately predict the number of extra data cache misses due to cache contention when the number of threads sharing a cache approaches and\/or exceeds the cache\u2019s associativity [12]. \u2022 Chapter 4 presents four analytical models of varying accuracy and complexity for estimating the throughput of a multi-core, fine-grained multithreaded single-issue processor (similar to Sun Niagara). The most sophisticated and accurate throughput model utilizes a Markov chain to leverage the above cache contention model [12]. \u2022 Chapter 5 describes our simulation methodology. \u2022 Chapter 6 presents and analyzes our experimental results. \u2022 Chapter 7 reviews related work. \u2022 Chapter 8 concludes this thesis and suggests future work. 11 \fChapter 2 Hybrid Analytical Modeling of Pending Cache Hits, Data Prefetching, and MSHRs 2.1 First-Order Processor Model Before explaining the details of our techniques introduced in Section 2.2, it is necessary to be familiar with the basics of the first-order model of superscalar microprocessors. Karkhanis and Smith\u2019s first-order model [34] leverages the observation that the overall performance of a superscalar microprocessor can be estimated reasonably well by subtracting the performance losses due to different types of miss-events from the processor\u2019s sustained performance under the absence of miss-events. The miss-events considered include long latency data cache misses (e.g., L2 cache misses for a memory system with two-level cache hierarchy), instruction cache misses, and branch mispredictions. Figure 2.1 illustrates this approach. When there are no miss-events, the performance of the superscalar microprocessor is approximated by a stable IPC, expressed through a constant useful instructions issued per cycle (IPC) over time. When a miss-event occurs, the performance of the processor falls and the IPC gradually decreases to zero. After the miss-event is resolved, the 12 \f2.1. First-Order Processor Model IPC miss-event #1 miss-event #2 miss-event #3 time Figure 2.1: Useful instructions issued per cycle (IPC) over time used in the first-order model [34] decreased IPC ramps up to the stable value under ideal conditions. A careful analysis of this behavior leads to the first-order model [34]. While Figure 2.1 shows that a miss-event occurs only after the previous miss-events have been resolved, in a real processor it is possible for different types of miss-events to overlap. For example, a load instruction can miss in the data cache a few cycles after a branch is mispredicted. However, it has been observed (and we confirmed) that overlapping between different types of miss-events is rare enough that ignoring it results in negligible error in typical applications [20, 34]. This thesis focuses on improving the accuracy of the modeled CP ID$miss (i.e., CPI component due to long latency data cache misses) since it is the component with the largest error in prior firstorder models [33, 34]. Note that short latency data cache misses (i.e., L1 data cache misses that hit in the L2 cache in this thesis) are not regarded as miss-events in prior first-order models [33, 34] and they are modeled as long-execution-latency instructions when modeling the base CPI. In the rest of this thesis, we use the term \u201ccache misses\u201d to represent long latency data cache misses. As noted by Karkhanis and Smith [34], the interactions between microarchitectural events of the same type cannot be ignored. Our baseline technique for modeling data cache misses, based upon Karkhanis and Smith\u2019s firstorder model [34], analyzes dynamic instruction traces created by a cache simulator. To differentiate such models, which analyze instruction traces, from earlier analytical models [2, 14, 31, 43, 51, 52] 13 \f2.1. First-Order Processor Model that do not, we also refer to them as hybrid analytical models in this thesis. In each profile step, a ROBsize number of consecutive instructions in the trace are put into the profiling window (or block) and analyzed, where ROBsize is the size of the re-order buffer. If all of the loads missing in the data cache in a profile step are data independent of each other, they are considered overlapped (i.e., the overlapped misses have the same performance impact as a single miss). When data dependencies exist between misses, the maximum number of misses in the same data dependency chain is recorded and the execution of all the other misses are modeled to be hidden under this dependency chain. In the rest of this thesis, num serialized D$miss represents the sum of the maximum number of misses measured in any single data dependency chain in a block of instructions, accumulated over all blocks making up the entire instruction trace. When all instructions in the trace have been analyzed, the CP ID$miss can be estimated as CP ID$miss = num serialized D$miss \u00d7 mem lat total num instructions (2.1) where mem lat stands for the main memory latency and total num instructions is the total number of instructions committed (of any type). The following two simple examples show how to update num serialized D$miss. In both examples we assume that the size of the instruction window is eight for simplicity. In Figure 2.2(a), there are three data cache misses (i.e., i1, i3, and i6) in the profiling window and they are data independent of each other. Although i9 is also a data cache miss, it is not considered to be overlapped with the other three misses. After this profile step, num serialized D$miss is incremented by one. Figure 2.2(b) shows a similar example, but i6 depends on i1 (filled by the same pattern as i1). Therefore, the maximum number of misses in the same data dependency chain is two, and num serialized D$miss is incremented by two when the profile step is finished. 14 \f2.1. First-Order Processor Model Profiling Window i1 i3 i6 i2 i4 i7 i5 i8 i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i9 i10 (a) No data dependencies between misses Profiling Window i1 i3 i5 i2 i4 i7 i6 i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i8 i9 i10 (b) With data dependencies between misses Figure 2.2: Profiling examples. Each arrow corresponds to a dynamic instruction in the trace. Data cache misses are filled with patterns. A miss depending on previous misses is filled with the same pattern as the youngest miss on which it depends. Corresponding data dependency graph is also shown to the right in (a) and (b), and each node is an instruction in the dynamic instruction trace. The CP ID$miss modeled in Equation 2.1 often overestimates the actual CP ID$miss since outof-order execution enables overlap of computation with long latency misses. A simple solution proposed by Karkhanis and Smith [34] is to subtract a fixed number of cycles per serialized data cache miss based upon ROB size to compensate. The intuition for this compensation is that when a load issues and accesses the cache, it can be the oldest instruction in the ROB, the youngest instruction in the ROB, or somewhere in between. If the instruction is the oldest or nearly the oldest, the performance loss (penalty of the instruction) is the main memory latency. On the other hand, if the instruction is the youngest or nearly the youngest one in the ROB and the ROB is full, its penalty can be partially hidden by the cycles required to drain all instructions before it, and can be approximated as mem lat \u2212 ROBsize issue width [34]. It has been observed that loads missing in the 15 \f2.2. Modeling Long Latency Memory System cache are usually relatively old when they issue [34]; and thus, perhaps the simplest (though not most accurate) approach is to use no compensation at all [34]. The mid-point of the two extremes mentioned above can also be used (i.e., a load missing in the cache is assumed to be in the middle of ROB when it issues), and the numerator in Equation 2.1 becomes num serialized D$miss \u00d7 (mem lat \u2212 2.2 ROBsize 2\u00d7issue width ) [33]. Modeling Long Latency Memory System In this section, we describe how we model pending cache hits, data prefetching, and a limited number of MSHRs. 2.2.1 Modeling Pending Data Cache Hits The method of modeling long latency data cache misses described in Section 2.1 profiles dynamic instruction traces generated by a cache simulator [34]. Since a cache simulator provides no timing information, it classifies the load or store bringing a block into the cache as a miss and all subsequent instructions accessing the block before it is evicted as hits. However, the actual latency of many instructions classified as a hit by a cache simulator is much longer than the cache hit latency. For example, if there are two close load instructions accessing data in the same block that is not currently in the cache, the first load will be classified as a miss by the cache simulator and the second load as a hit, even though the data would still be on its way from memory in a real processor implementation. Therefore, since the second load is classified as a hit in the dynamic instruction trace, it is ignored in the process of modeling CP ID$miss using the approach described in Section 2.1. More importantly, a significant source of errors results when two or more data independent load instructions that miss in the data cache are connected by a third pending data cache hit. We 16 \f2.2. Modeling Long Latency Memory System fictitious dependence we model to account for the effect of spatial locality i1: LD, R1, 0(R2) i2: LD, R3, 4(R2) i3: LD, R4, 0(R3) spatial locality miss (Block A) pending hit (Block A) miss (Block B) Figure 2.3: An example showing how two data independent misses (i1, i3) are connected by a pending hit (i2), upon which i3 is data dependent. elaborate what \u201cconnected\u201d means using the simple example in Figure 2.3. In this example, i1 and i3 are two loads that miss and they are data independent of each other, while i2 is a pending hit since it accesses the data in the same cache block as i1. The model described in Section 2.1 classifies i1 and i3 as overlapped and the performance penalty due to each miss using that approach is estimated as half of the memory access latency (total penalty is the same as if there is a single miss). However, this approximation is inaccurate since i3 is data dependent on the pending data cache hit i2 and i2 gets its data when i1 obtains its data from memory (i.e., i1 and i2 are waiting for the data from the same block). Therefore, in the actual hardware, i3 can only start execution after i1 gets its data from memory although there is no true data dependence between i1 and either i2 or i3. This scenario is common since most programs contain significant spatial locality. The appropriate way to model this situation is to consider i1 and i3 to be serialized in our analytical model, even though they are data independent and access distinct cache blocks. Figure 2.4 shows the impact that pending data cache hits combined with spatial locality have on overall performance for processors with long memory latencies. The first bar (w\/ PH) illustrates measured CP ID$miss for each benchmark on the detailed simulator described in Section 5.1 and the second bar (w\/o PH) shows the measured CP ID$miss when all the pending data cache hits are simulated as having a latency equal to the L1 data cache hit latency. From this figure, we observe that the difference is significant for eqk, mcf , em, hth [72], and prm. 17 \f2.2. Modeling Long Latency Memory System w\/ PH CPI due to D$miss 8 w\/o PH 7 6 5 4 3 2 1 0 app art eqk luc swm mcf em hth prm lbm Figure 2.4: Impact of pending data cache hit latency on CP ID$miss (measured over all instructions committed from a detailed simulator) To model the effects of pending data cache hits analytically, we first need to identify them without a detailed simulator. At first, this may seem impossible since there is no timing information provided by the cache simulator. We tackle this by assigning each instruction in the dynamic instruction trace a sequence number in program order and labeling each memory access instruction in the trace with the sequence number of the instruction that first brings the memory block into the cache. Then, when we profile the instruction trace, if a hit accesses data from a cache block that was first brought into the cache by an instruction still in the profiling window, it is regarded as a pending data cache hit. For every pending hit identified using this approach (e.g., i2 in Figure 2.3), there is a unique instruction earlier in the profiling window that first brought in the cache block accessed by that pending hit (e.g., i1 in Figure 2.3). When we notice a data dependence between a later cache miss 18 \f2.2. Modeling Long Latency Memory System (e.g., i3 in Figure 2.3) and the pending hit (i2), we model a dependence between the early miss (i1) and the instruction that is data dependent on the pending hit (i3) since the two instructions (i1 and i3) have to execute serially due to the constraints of the microarchitecture. 2.2.2 Accurate Exposed Miss Penalty Compensation While the model described in Section 2.1 uses a fixed number of cycles to adjust the modeled CP ID$miss , we found that compensation with a fixed number of cycles (a constant ratio of the reorder buffer size) does not provide consistently accurate compensation for all of the benchmarks that we studied, resulting in large modeling errors (see Figure 6.1). To capture the distinct distribution of long latency data cache misses of each benchmark, we propose a novel compensation method. The new method is motivated by our observation that the number of cycles hidden for a load missing in the cache is roughly proportional to the distance between the load and the immediately preceding load that missed in the cache (we define the distance between two instructions to be the difference between their instruction sequence number). This is because when a load instruction misses in the cache, most of the instructions between that load and the immediately preceding long latency miss are independent of that load. Therefore, we approximate the latency of the later load that can be overlapped with useful computation as the time used to drain those intermediate instructions from the instruction window, which we estimate as the distance between the two loads divided by the issue width. When we profile an instruction trace, the average distance between two consecutive loads missing in the cache is also collected and used to adjust the modeled CP ID$miss . If the distance between two misses exceeds the window size, it is truncated since the miss latency can be overlapped by at most ROBsize \u2212 1 instructions. Equation 2.2, below, shows how the CP ID$miss is adjusted by subtracting a compensation term, 19 \f2.2. Modeling Long Latency Memory System if ( the instruction (crntInst) is a pending hit (e.g., i8 in Fig 2.6) ) { find the most recent instruction (prevInst) in profiling window (e.g., i6 in Fig 2.6) that brings crntInst\u2019s required data into cache estimated hidden latency A crntInst.lat = max(memLat - (crntInst.iseq \u2013 prevInst.iseq) \/ issueWidth, 0) \/\/ calculate the latency of the current instruction crntInst.lat = crntInst.lat \/ memLat \/\/ normalize the crntInst.lat to the memory latency crntInst.length = max(inst.length) where inst is an instruction on which crntInst directly depends (true data dependency exists, e.g., i7 \u0001 i8 in Fig 2.6) if (crntInst.length < prevInst.length \u2013 prevInst.lat) { Notation crntInst.length = critInst.length + 1 tardy the current instruction being analyzed B prefetch crntInst: crntInst.lat = 1 prevInst: the instruction bringing the data required } else { by the current instruction into the cache accmLength = prevInst.length \u2013 prevInst.lat + crntInst.lat crntInst.iseq: the instruction sequence number of if (accmLength > crntInst.length) { the current instruction crntInst.lat = accmLength \u2013 crntInst.length C issueWidth: the issue width of the microprocessor crntInst.length = accmLength timely memLat: the memory access latency } prefetch crntInst.lat: the normalized time interval between else the issue and the writeback of the crntInst.lat = 0 current instruction } crntInst.length: the normalized length of the data } dependency chain up to the current instruction Figure 2.5: Algorithm for analyzing a pending hit in an instruction trace when a prefetching mechanism is applied dist issue width \u00d7 num D$miss, from the numerator in Equation 2.1. CP ID$miss = num serialized D$miss \u00d7 mem lat \u2212 ( issuedist width \u00d7 num D$miss) total num instructions (2.2) Here dist is the average distance between two consecutive loads that miss in the cache and the term dist issue width represents the average number of cycles hidden for each cache miss. The product of this term and the total number of loads missing in the cache (num D$miss) becomes the total number of cycles used to compensate for the overestimation of the baseline profiling method. 20 \f2.2. Modeling Long Latency Memory System 2.2.3 Modeling Data Prefetching Data prefetching is a technique to bring data from memory into the cache before it is required so as to hide (or partially hide) long memory access latency. Many hardware data prefetching strategies have been proposed before [3, 25, 32, 62]. In this section, we demonstrate how to extend our model described in Section 2.2.1 to estimate the CP ID$miss when a data prefetching technique is employed without running detailed simulations. To model the CP ID$miss when a particular prefetching method is applied, a cache simulator implementing that prefetching method is needed to generate a dynamic instruction trace. While this does require some coding, we found that the resulting analytical model obtains very accurate results and is two orders of magnitude faster than detailed simulations. As described in Section 2.2.1, when a cache simulator generates an instruction trace, each memory access instruction in the trace is labeled with the sequence number of the instruction that first brought the data into the cache. If the data required by a load was brought into the cache by a prefetch, then the load is labeled with the sequence number of the previous instruction that triggered the prefetch. Recall that, when no prefetching mechanism is applied, an instruction trace generated by a cache simulator is divided into blocks of instructions and each block is analyzed in a profile step. In each profile step, the maximum number of loads that are in a dependence chain and miss in the cache is recorded. However, when an effective prefetching method is implemented, many loads that would have missed in the cache become hits. To be more specific, many of them become pending hits given that some of the prefetches cannot fully hide the memory access latency. We found that to accurately model prefetch performance, it is necessary to approximate the timeliness of the prefetches and consequently the latencies of these pending hits relatively accurately. Figure 2.5 illustrates how we analyze a pending hit in an instruction trace when a particular prefetching mechanism is applied. Here a pending hit can either be due to a prefetch or a demand 21 \f2.2. Modeling Long Latency Memory System i1.length = 1 i5.length = 2 i6.length = 2 i3.length = 1 i1 i5 prefetch i6 should be 1+i8.lat i3 i2 i4 i2.length = 0 i4.length = 1 i7 i8 i7.length = 1 i8.length = 2 + i8.lat Figure 2.6: An example motivating Figure 2.5 part B miss and, in both cases, it is analyzed using the algorithm in Figure 2.5. For each pending hit (crntInst in Figure 2.5), we find the instruction (prevInst in Figure 2.5) that brought crntInst\u2019s required data into the cache. We approximate crntInst\u2019s latency based upon the observation that typically the further prevInst is from crntInst, the more latency of crntInst can be hidden. The hidden latency of crntInst is estimated as the number of instructions between crntInst and prevInst divided by the issue width of the microprocessor being modeled. Note that we employ the approximation of ideal CPI equal to 1\/issueWidth in this calculation. Then, crntInst\u2019s latency is estimated as the difference between the memory access latency and the hidden latency, or zero if the memory latency is completely hidden. This latency is in cycles, and we normalize it by dividing it by the main memory latency since the accumulated num serialized D$miss after each profile step is represented in units of main memory latency. The part of the code marked B in Figure 2.5 models a significant phenomenon (late or tardy prefetches) that we observed in our study of various hardware prefetching mechanisms. Since the instruction trace being analyzed is generated by a cache simulator that is not aware of the out-oforder execution of the superscalar microprocessor being modeled, a pending hit due to prefetching indicated by the cache simulator is often actually a miss during out-of-order execution. Figure 2.6 shows a simplified example illustrating how this may happen. In this example, there are eight 22 \f2.2. Modeling Long Latency Memory System instructions and they are labeled from i1 to i8 in program order. Figure 2.6 shows the data dependency graph constructed during profiling according to an instruction trace generated by a cache simulator assuming the pseudo-code marked B in Figure 2.5 is not included. In Figure 2.6, i1 and i5 are loads missing in the data cache (represented by the shaded circles) and i6 triggers a prefetch that brings the data accessed by a load i8 into the cache when i6 issues (represented by the broken line arrow labeled \u201cprefetch\u201d from i6 to i8). For each instruction, the longest normalized length of the data dependency chain up to and including that instruction is shown (in units of main memory latency). For example, \u201ci3.length=1\u201d above i3 in Figure 2.6 means that it takes one memory access latency from when i1 (the first instruction in the profile step) issues until i3 finishes execution since i3 is data dependent on i1, which missed in the cache. Since i8 is a pending hit (represented by the circle filled with hatching) and the associated prefetch is started when i6 issues, i8.length is calculated, without B, as the sum of i6.length and i8.lat, where i8.lat is estimated in part A in Figure 2.5 as memLat\u2212 i8.iseq\u2212i6.iseq issueW idth memLat . In this example, i8.lat is almost equal to 1.0 since i8 is very close to i6. Although the data accessed by i8 is regarded as being prefetched by the algorithm in Figure 2.5 without B, i8 is actually (as determined by detailed simulation) a miss rather than a pending hit due to out-of-order execution. In Figure 2.6, we observe that i6.length is bigger than i7.length. Therefore, before i6 (e.g., a load instruction) issues (and hence triggers a hardware generated prefetch), i8 has already issued and missed in the data cache. Thus, the prefetch provides no benefit. The code marked B in Figure 2.5 accurately takes account of this significant effect of outof-order scheduling by checking if crntInst (i8) issues before the prefetch is triggered. We observed that removing part B in Figure 2.5 increases the average error for the three prefetching techniques that we model from 13.8% to 21.4% while adding part B slows our model by less than 2%. An example in Figure 2.7 shows how the part of code marked C in Figure 2.5 models the case when a useful prefetch occurs in out-of-order execution (i.e., a prefetch which lowers CPI). In 23 \f2.2. Modeling Long Latency Memory System prefetch i83.length = 2 i1.length = 1 i4.length = 2 i1 i3 i2 i83 i85.length = 2 i4 i85 i84 i3.length = 1 i245.length = 2.8 i86 i245 i86.length = 2 i84.length = 2 i2.length = 0 prefetch Figure 2.7: An example explaining Figure 2.5 part C Figure 2.7, only nine relevant instructions are shown out of the 256 instructions included in a profile step (assuming ROBsize is 256). Among these nine instructions, i1 and i4 are loads that miss in the data cache and both i3 and i85 trigger prefetches, making i83 and i245, respectively, pending hits. The number of cycles hidden in the prefetch triggered by i3 is estimated as i83.iseq\u2212i3.iseq issueW idth = 83\u22123 4 = 20 (when issue width is four), and then the remaining latency after normalization is calculated as memLat\u221220 memLat = 0.9 (we assume throughout our examples that memory access latency is 200 cycles). However, since i83 is data dependent of i4 and i4.length=2, when i83 issues, its prefetched data has already arrived at the data cache and its real latency becomes zero (this case corresponds to the \u201celse part\u201d inside of part C in Figure 2.5). The number of cycles hidden by the prefetch for i245 is estimated (from part A in Figure 2.5) as normalized latency of memLat\u221240 memLat i245.iseq\u2212i85.iseq issueW idth = 245\u221285 4 = 40 with remaining = 0.8. Since the instruction triggering the prefetch (i85) and the instruction that i245 directly depends on (i86) finish execution around the same time (i.e., i85.length=i86.length), i245.length becomes 2.8 and i245.lat becomes 0.8 (this case corresponds to the \u201cif part\u201d inside of part C in Figure 2.5). 24 \f2.2. Modeling Long Latency Memory System 2.2.4 Modeling a Limited Number of MSHRs The method of analytically modeling the CPI due to long latency data cache misses described in Section 2.1 assumes that at most ROBsize cache misses can be overlapped. However, this assumption is unreasonable for most modern processors since the maximum number of outstanding cache misses the system supports is limited by the number of Miss Status Holding Registers (MSHRs) [4, 22, 38] in the processor. In a real processor, the issue of memory operations to the memory system has to stall when available MSHRs run out. Based upon the technique described in Section 2.1, the profiling window with the same size as the instruction window is always assumed to be full when modeling CP ID$miss . In order to model a limited number of outstanding cache misses, we need to refine this assumption. During a profile step, we first stop putting instructions into the profiling window when the number of instructions that miss in the data cache and have been analyzed is equal to NM SHR (number of MSHRs) and then update num serialized D$miss only based upon those instructions that have been analyzed to that point1 . Figure 2.8 illustrates how the profiling technique works when the number of outstanding cache misses supported is limited to four. Once we encounter NM SHR (four) cache misses in the instruction trace (i.e., i1, i2, i4, and i6), the profile step stops and num serialized D$miss is updated (i.e., the profiling window is made shorter). In the example, the four misses are data independent of each other (and not connected with each other via a pending hit as described in Section 2.2.1), thus num serialized D$miss is incremented by only one. Although i7 also misses in the cache, it is included in the next profile step since all four MSHRs have been used. 1 In real execution, cache misses that are regarded as not present in the profiling window simultaneously due to lack of available MSHRs could actually be in the instruction window simultaneously. Reducing the profiling window size only approximates the performance loss due to a limited number of MSHRs. We leverage this observation in Section 2.2.5. 25 \f2.2. Modeling Long Latency Memory System Profiling Window next profile step starts here i1 i2 i3 i6 i5 i1 i2 i3 i4 i5 i6 ROB Size i7 i8 i9 i10 i7 i4 i9 i8 i10 Figure 2.8: An example showing profiling with ROBsize = 8 and NM SHR = 4. Each arrow corresponds to a dynamic instruction in the trace. Data cache misses are filled with patterns. Corresponding data dependency graph is shown to the right. 2.2.5 Profiling Window Selection In this section, we present two important refinements to the profiling technique described in Section 2.1 (which we will refer to hereafter as plain profiling) to better model the overlapping between cache misses. Start-with-a-miss (SWAM) Profiling We observe that often the plain profiling technique described in Section 2.1 does not account for all of the cache misses that can be overlapped, due to the simple way in which it partitions an instruction trace. Figure 2.9(a) shows a simple example. In this example, we assume that all the cache misses (shaded arrows) are data independent of each other for simplicity. Using the profiling approach described in Section 2.1, a profile step starts at pre-determined instructions (for example, i1, i9, i17..., when ROBsize is eight). Therefore, although the latency of i5, i7, i9, and i11 can be overlapped, the plain profiling technique does not account for this. By making each profile step start with a cache miss, we find that the accuracy of the model improves significantly. Figure 2.9(b) illustrates this idea. Rather than starting a profile step with 26 \f2.2. Modeling Long Latency Memory System 1st profile step 2nd profile step i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 (a) Plain Profiling A profile step starts with a miss i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 (b) SWAM Profiling Figure 2.9: An example comparing plain profiling and SWAM profiling with ROBsize = 8. Each arrow is a dynamic instruction. Data cache misses are shaded. i1, we start a profile step with i5, so that the profiling window will include i5 to i12. Then, the next profile step will seek and start with the first cache miss after i12. We call this technique start-with-a-miss (SWAM) profiling and in Section 6.1.1 we will show that it decreases the error of plain profiling from 29.3% to 10.3% with unlimited MSHRs. We explored a sliding window approximation (start each profile window on a successive instruction of any type), but found it did not improve accuracy while begin slower. SWAM improves modeling accuracy because it more accurately reflects what the contents of the instruction window of a processor would be (a long latency miss would block at the head of the ROB). Improved SWAM for Modeling a Limited Number of MSHRs (SWAM-MLP) The technique for modeling MSHRs proposed in Section 2.2.4 can be combined with SWAM to better model the performance when the number of outstanding cache misses supported by the 27 \f2.2. Modeling Long Latency Memory System memory system is limited. The basic idea is to have each profile step start with a miss and finish either when the number of instructions that have been analyzed equals the size of the instruction window or when the number of cache misses that have already been analyzed equals the total number of MSHRs. However, choosing a profiling window independent of whether a cache miss is data dependent on other misses (or connected to other misses via pending hits as described in Section 2.2.1) leads to inaccuracy because data dependent cache misses cannot simultaneously occupy an MSHR entry. To improve accuracy further, we stop a profile step when the number of cache misses that are data independent of misses that have been analyzed in the same profile step (rather than the number of cache misses being analyzed) equals the total number of MSHRs. In the rest of this thesis we call this technique SWAM-MLP since it improves SWAM by better modeling memory level parallelism. When a miss depends on an earlier miss in the same profiling window, the later miss cannot issue until the earlier one completes and SWAM-MLP improves model accuracy because it takes into account that out-of-order execution can allow another independent miss that is younger than both of the above misses to issue. Therefore, the number of instructions that miss in the data cache and that should be analyzed in a profile step should, in this case, be more than the total number of MSHRs. 28 \fChapter 3 Modeling Cache Contention Among Many Threads In this chapter, we first summarize a prior cache contention model [9], then explain and evaluate its limitations. Next, we propose two novel metrics for quantifying aspects of temporal locality necessary to overcome these limitations. Finally, we propose a novel cache contention model using these locality metrics to accurately model cache contention for a large number of threads. 3.1 Probabilistic Cache Contention Models Chandra et al. [9] propose a probabilistic model for predicting the number of extra cache misses due to cache contention between two threads on a chip multiprocessor architecture. The model uses information from a circular sequence profile obtained for each thread running alone. In their work, a circular sequence is defined as a sequence of accesses to the same cache set from a thread such that the first and the last access are to the same cache block, which is different from all the blocks touched by intermediate accesses. Figure 3.1 illustrates the notion of a circular sequence. In Figure 3.1 the blocks A, B, C, D, and E map to the same set in a four-way LRU cache. Several examples of circular sequences are shown in the Figure. A circular sequence is denoted as cseq(d, n), where d is the number of distinct blocks in the circular sequence, n is the total number of blocks in the circular sequence, and d is at least one less than n (as a result of the definition of cseq(d, n)). 29 \f3.1. Probabilistic Cache Contention Models cseq(4,6) A B A A cseq(2,3) cseq(1,2) C miss D B E A cseq(5,6) Figure 3.1: An example of circular sequences in a set of a four-way associative LRU cache extra L2 misses 100x actual (Sun Fire T1000) predicted (prior model) predicted (new model) 10x 1x 2mcf 4mcf 8mcf 12mcf 16mcf 0.1x Figure 3.2: Extra L2 cache misses due to cache contention when multiple copies of mcf are running in parallel (L2 config. in Table 5.3) In an x-way associative cache with LRU replacement the last access in a circular sequence cseq(d, n) is a hit if and only if d \u2264 x. Thus, by performing circular sequence profiling it is possible to determine whether a load or store misses in the cache. In Figure 3.1 the last access (to block A) results in a cache miss since it corresponds to cseq(5, 6) and 5 is greater than the cache associativity. For all the other circular sequences in Figure 3.1, the last access corresponds to a cache hit. To model the number of extra cache misses due to cache contention, a circular sequence profile is performed to obtain the distribution of cseq(d, n) for a given cache configuration. Using this 30 \f3.1. Probabilistic Cache Contention Models information, the average of n weighted by the frequency of occurrence of cseq(d, n), denoted n, is computed for each d between 1 and x (where x is the cache associativity). After obtaining n for each d, the prior model uses the access frequency per set (average number of accesses per cycle to the cache set, including both hits and misses) to approximate the average time interval (in cycles) between the first and the last access of cseq(d, \u2217). Here cseq(d, \u2217) represents all circular sequences containing d distinct tags. Next, the estimated time interval is converted to the number of cache accesses to that set from the second thread using the second thread\u2019s access frequency. Then, an inductive model is used to estimate the probability that a specified number of distinct blocks are accessed in the set from the second thread. Finally, the probability that a cache hit from the first thread becomes a miss due to interference from the second thread is approximated, and the number of extra cache misses of the first thread is calculated. The prior model works well when there are only two threads sharing a cache with a high associativity. As the number of threads sharing the cache increases, the prior model described above becomes less accurate. Figure 3.2 compares, for multiple copies of mcf running in parallel, the predicted result both from an extension to the prior model [9] (labeled \u201cpredicted (prior model)\u201d) and from our new model described in Section 3.3 (labeled \u201cpredicted (new model)\u201d) to the result measured on real hardware\u2013a Niagara T1 based Sun Fire T1000 system. The predicted result is the ratio of the number of extra L2 misses per mcf due to cache contention to the number of misses when mcf runs alone. Note we have extended the Prob approach presented for two threads in Chandra et al. [9] by considering the interleaving of sequences from more than one interfering thread.2 In Figure 3.2, we observe that the error of the prior model increases significantly as the number of threads sharing the cache increases. For example, consider the case where twelve threads share a 2 P For \u201cprior dX2 +...+dX N \u2264A\u2212dX1 model\u201d Q i=2...N in Figure 3.2 we use Pmiss (cseqX1 (dX1 , nX1 )) = 1 \u2212 P (seq(dXi , E(nXi ))) to extend Equation 3 for the Prob model in Chandra et al. [9] for more than two threads. Here Xi , i 6= 1, represents each interfering thread, A is the cache associativity, and N is the total number of threads. This equation follows P directly after extending Corollary 3 in [9], such that the last access from X1 in cseqX1 (dX1 , nX1 ) results in a miss if N i=1 dXi > A (or a hit otherwise). 31 \f3.1. Probabilistic Cache Contention Models ammp applu art bzip2 equake mcf mgrid swim 4500 # distinct cache sets 4000 3500 3000 2500 2000 1500 1000 500 0 0 0 50 50 100 150 200 200 K100 memory 150 instructions 250 250 300 300 Figure 3.3: Average number of distinct cache sets accessed during a given number of consecutive memory instructions for a twelve-way, 64B line, 4096-set LRU L2 cache twelve-way L2 cache (i.e., 12mcf in Figure 3.2). For any given circular sequence of a thread, the number of accesses from each co-scheduled thread occurring between the first and the last access in the circular sequence is predicted to be at least one by the prior model. Therefore, only a small fraction of the last accesses in all cseq(1, \u2217) circular sequences are modeled as hits (when each of the eleven co-scheduled threads inserts only one access in a cseq(1, \u2217)). In this example, the actual number of L2 misses increases by a factor of 6.2\u00d7 and our new model presented in Section 3.3 predicts an increase of 6.5\u00d7 (4.8% error). However, the prior model predicts an increase of 37.6\u00d7 since it predicts 95.7% of the accesses that hit in the cache when mcf runs alone will turn into cache misses (resulting in a 507.9% error). 32 \f3.2. New Locality Metrics For Modeling Cache Contention 3.2 New Locality Metrics For Modeling Cache Contention While circular sequence profiling captures many aspects of temporal locality, what is missing in the prior model described in Section 3.1 is the fact that the cache footprint of a thread during a small interval of time is likely to be limited to a small number of cache sets rather than all of them. If the time interval between the first and the last access in a circular sequence is short, cache accesses from other co-scheduled threads are not likely to be sent to the same cache set containing the accesses from the circular sequence. Therefore, even in the presence of many co-scheduled threads the last access in a circular sequence may not turn into a miss. To quantify the above behavior we introduce the parameter S(x), which is the average number of sets accessed by a thread during x consecutive loads or stores. This can be obtained by off-line profiling a memory access trace for a specific cache configuration (or potentially in hardware\u2014e.g., by adding a single bit per cache set along with a single counter and a table\u2014though detailed explorations of such implementations are beyond the scope of this work). We note that S(x) is different from the function u(B) in Agarwal et al. [2] (which is related to FA in Thiebaut and Stone [64]). There are two differences: First, S(x) represents the average number of unique sets accessed given x memory references whereas u(B) is the average number of unique blocks of size B measured over a fixed length time-slice interval (which could equal x). Second, our use of S(x) is quite different from the use of u(B) or FA in earlier work. We show how one can approximate S(x) assuming cache accesses follow a binomial distribution using techniques similar to those used by Thiebaut and Stone [64] in Section 6.2.2. Moreover, Appendix A provides an example showing why extending the earlier approaches proposed for modeling extra cache misses due to context switching on time-sharing machines does not work in the limit as threads context switch cycle by cycle. This problem is overcome by combining S(x) with circular sequence profile information. 33 \f1 1 0.8 0.8 0.8 0.6 0.4 0.2 0.6 0.4 0.2 0 0.6 0.4 0.2 0 1 2 3 4 5 6 7 8 9 101112 # distinct accesses in a set 1 2 3 4 5 6 7 8 9 101112 # distinct accesses in a set 1 2 3 4 5 6 7 8 9 101112 # distinct accesses in a set (a) b(i, x = 1000) (b) b(i, x = 10000) (c) b(i, x = 20000) 1 1 0.8 0.8 0.8 0.6 0.4 0.2 0 probability 1 probability probability 0 probability 1 probability probability 3.2. New Locality Metrics For Modeling Cache Contention 0.6 0.4 0.2 0 1 2 3 4 5 6 7 8 9 101112 # distinct accesses in a set (d) b(i, x = 50000) 0.6 0.4 0.2 0 1 2 3 4 5 6 7 8 9 101112 # distinct accesses in a set (e) b(i, x = 100000) 1 2 3 4 5 6 7 8 9 101112 # distinct accesses in a set (f) b(i, x = 250000) Figure 3.4: Probability distribution of the number of distinct blocks being accessed in a cache set for mcf on a 12-way, 64B line, 3 MB cache. Horizontal axis is value of i, and x is the measurement interval counted in consecutive memory instructions Figure 3.3 shows S(x) of different benchmarks for the L2 cache we model in Section 6.2. From the figure we observe that as the number of consecutive memory instructions being analyzed increases, the number of distinct cache sets accessed increases until saturating at the total number of cache sets (provided the data set of an application is large enough to use all sets in the cache). A thread with higher S(x) for a given x is more likely to access sets currently used by other threads. Besides the number of cache sets accessed, a thread\u2019s cache footprint also depends on the number of distinct blocks accessed in those sets. For example, it is possible that there are two threads with the same value of S(x) for a fixed x (number of memory instructions). However, for each set being accessed, there are ten distinct blocks from the first thread and only two from the second thread. 34 \f3.3. Accurately Modeling Cache Contention with Many Threads Then, although the two threads access the same number of sets, the first thread\u2019s cache footprint over this interval would be five times larger than the second thread. To take the number of distinct blocks in a set accessed by a thread into consideration, we propose using a probability vector, b(i, x), to estimate the probability that there are i distinct accesses to a cache set during x consecutive memory instructions, given that the cache set is accessed. Similar to S(x), b(i, x) can be obtained via off-line profiling. Figure 3.4 illustrates b(i, x) with different values of x for mcf. For each x, the probability that there are one to eleven (i.e., associativity-1) distinct accesses to a cache set is shown in the first eleven bars, while the rightmost bar of each figure is the probability of having more than or equal to twelve (i.e., associativity) distinct accesses. From Figure 3.4, we observe that when the measurement interval (i.e., x) is small, the average number of distinct accesses in each cache set being accessed is likely to be small. As x increases, the mode in each b(i, x) distribution tends to move toward higher values of i. When x is sufficiently large, the number of distinct accesses is typically greater than or equal to the associativity of the cache. Our use of b(i, x) essentially replaces the use of the inductive probability model [9]. In the next section, we utilize S(x), b(i, x), along with circular sequence profiling to more accurately quantify the additional number of cache misses due to contention among threads when the number of co-scheduled threads approaches or exceeds the associativity of the cache being shared. 3.3 Accurately Modeling Cache Contention with Many Threads For concreteness, we will describe how to model the number of extra cache misses of a given thread (T1) due to sharing a four-way set associative L1 data cache with three other threads (T2, T3, and T4). We also generalize the details for different numbers of threads and cache associativities. We define the distance between two accesses from a single thread as the total number of intermediate memory accesses from the thread (i.e., independent of which cache sets are accessed). 35 \f3.3. Accurately Modeling Cache Contention with Many Threads As mentioned in Section 3.2, for a circular sequence cseq(d, n), the higher the distance between its first and last access, the more likely the last access will become a miss due to cache contention. Therefore, hereafter we describe each circular sequence with the distance between its first and last access as cseq(d, n, r), where r represents the distance. We use cseq(d, \u2217, r) for the set of circular sequences with d distinct tags, for all n such that d < n \u2264 r. Briefly, our algorithm computes the extra number of cache misses due to cache contention as follows. We first compute the average number of memory accesses to obtain a given number of unique blocks in a circular sequence. Next, given an estimate of the relative progress of each thread we translate this to the number of memory accesses from co-scheduled threads. Then, we use S(x) to determine the number of unique sets accessed by co-scheduled threads and b(i, x) to determine the number of unique blocks accessed in those sets by the co-scheduled threads. Using this information we then compute the probability of a miss. The probability of a miss being introduced into a circular sequence is a nonlinear function of the distance r. In general E[f(x)] 6= f(E[x]) when f is nonlinear. Thus, we find accuracy is improved if we compute the probability of a miss for separate groups g of ranges of r when computing the number of extra cache misses for circular sequences with different lengths. In detail the procedure for computing the extra number of cache misses is as follows: For each value of d between one and four (i.e., the associativity of the L1 cache), we carry out the first five steps below, then we use the resulting information in the last step. Step 1: We compute the average distance to see d unique blocks in a given cache set. To do this we apply circular sequence profiling and classify circular sequences cseq(d, n, r) into different groups based upon r. Each group g corresponds to a range of r. Then, for a group g, calculate dist(d,g) , the weighted r (distance between the first and last access in a circular sequence) for all circular 36 \f3.3. Accurately Modeling Cache Contention with Many Threads sequences cseq(d, n, r) belonging to the group as follows: Prg dist(d,g) = r=rg [r \u00d7 num(cseq(d, \u2217, r))] Prg r=rg num(cseq(d, \u2217, r)) where rg and rg represents the lower and the upper bounds of the group g, respectively, and num(cseq(d, \u2217, r)) denotes the number of occurrences of the circular sequence cseq(d, \u2217, r) observed during profiling. In our study, we classify a circular sequence into one of twelve groups. The first group corresponds to all the circular sequences with 1 < r < 25 , the next ten groups with 2i \u2264 r < 2i+1 , i \u2208 5, 6, ..., 14, and the last group corresponds to all the circular sequences with r \u2265 215 . We will show the impact of the number of groups chosen to classify a circular sequence on the accuracy of our cache contention model in Section 6.2.1. Step 2: For each dist(d,g) , calculate nT2(d,g) , nT3(d,g) , nT4(d,g) , the corresponding numbers of accesses for T2 , T3 , and T4 , respectively, based upon the access frequency of those threads: nTi(d,g) = dist(d,g) \u00d7 access freq(Ti ) access freq(T1 ) for i = 2, 3, 4 (up to #threads) where access freq(Ti ) is the number of memory instructions per cycle for thread Ti .3 Step 3: Using nTi(d,g) from Step 2, find STi(d,g) , the average number of cache sets accessed during nTi(d,g) consecutive memory access instructions, by using S(x) described in Section 3.2: STi(d,g) = S(nTi(d,g) ) for i = 2, 3, 4 (up to #threads) Step 4: Using nTi(d,g) from Step 2 and b(i, x) (defined in Section 3.2), find dTi(d,g) (k)\u2014the distri3 As we will discuss later in Section 4.4.2, when the cache contention model is combined with the Markov chain model to predict throughput, access freq(Ti ) is estimated from a simple single thread, sum-of-CPI-component model using the (current best) estimated number of cache misses. 37 \f3.3. Accurately Modeling Cache Contention with Many Threads bution of unique cache blocks accessed in a set by thread Ti \u2014as follows: dTi(d,g) (k) = b(k, nTi(d,g) ) for i = 2, 3, 4 (up to #threads); k = 1, 2, 3, 4 (up to cache assoc.) Step 5: For each group of circular sequences, calculate probH (d, g), the probability that the last access of a circular sequence cseq(d, n, r) in group g is not turned into a miss due to cache contention. Recall a circular sequence cseq(d, n, r) accesses d unique blocks in r memory accesses and here we focus on the subset of those cseq(d, n, r) where r is in group g. To model the probability that another thread (Ti ) accesses the cache set containing the circular sequence, we divide STi(d,g) , obtained from Step 3, by the total number of cache sets (which we will represent with S) to obtain STi(d,g) S . The intuition here is that the chance any given set (i.e., the set containing the circular se- quence) is accessed by a thread during a fixed time interval is proportional to the number of distinct sets accessed by the thread during that amount of time. We compute the desired probability as the sum of four parts (since there are four threads per core): (i) the probability that neither T2 , T3 , or T4 accesses the set containing the circular sequence, (ii) the probability that only one of T2 , T3 , or T4 accesses the set containing the circular sequence combined with the probability the number of distinct accesses to the set from that thread is less than or equal to the difference between the cache\u2019s associativity A and d (i.e., the maximum number of distinct accesses from other threads without turning the last access of the circular sequence into a miss), (iii) the probability that two of T2 , T3 , or T4 access the set combined with the probability that the sum of the number of distinct accesses to the set from the two threads is less than or equal to the difference A \u2212 d, and (iv) the probability that T2 , T3 , and T4 all access the set combined with the probability that the sum of the number of distinct accesses to the set from all three threads is less than or equal to the difference A \u2212 d. 38 \f3.3. Accurately Modeling Cache Contention with Many Threads The general formula for computing probH (d, g) for arbitrary number of threads N, and associa( N\u22121 X STi (d,g) (1 \u2212 )+ S i=2 j=1 tivity A is: i=N Y X 2\u2264i1 ,i2 ,...,iN\u22121 \u2264N i1 6=i2 ...6=iN\u22121 P \u2200\u03b1in \u22081..A, N\u22121 n=j \u03b1in \u2264A\u2212d \"j\u22121 #) Y Y STi (d,g) STin (d,g) N\u22121 n (1 \u2212 ) dTin (d,g) (\u03b1in ) S S n=1 n=j To illustrate the above formula, we show a concrete example of computing probH (2, g) for the case when the number of co-scheduling threads (N) is four and the cache associativity (A) is four. Recall probH (2, g) is the probability of a hit for all circular sequences in group g that access two unique blocks. Based upon the formula, probH (2, g) is modeled as the sum of three components. The first component is (1 \u2212 ST2 (2,g) )(1 S \u2212 ST3 (2,g) )(1 S \u2212 ST4 (2,g) ) S corresponding to the part (i) in Step 5. The second component is ST2 (2,g) dT2 (2,g) (1))(1 S ST3 (2,g) )(1 S \u2212 ST4 (2,g) ) S + (1 \u2212 ST2 (2,g) ST3 (2,g) )( S dT3 (2,g) (1))(1 S \u2212 ST4 (2,g) ) S + (1 \u2212 ST2 (2,g) )(1 S ST3 (2,g) ST4 (2,g) )( S dT4 (2,g) (1)) S + ( \u2212 ST3 (2,g) )(1 S \u2212 ST4 (2,g) ) S + (1 \u2212 ST2 (2,g) ST3 (2,g) )( S dT3 (2,g) (2))(1 S \u2212 ST4 (2,g) ) S + (1 \u2212 ST2 (2,g) )(1 S ( ST2 (2,g) dT2 (2,g) (2))(1 S \u2212 \u2212 \u2212 ST3 (2,g) ST4 (2,g) )( S dT4 (2,g) (2)) S and it corresponds to the part (ii) in Step 5. Here each term represents a different combination of a single thread Ti , i > 1 accessing the set and introducing one (e.g., dT2 (2,g) (1)) or two (e.g., dT2 (2,g) (2)) unique blocks. Such interleaved accesses would leave a circular sequence cseq(2, n, r) 39 \f3.3. Accurately Modeling Cache Contention with Many Threads from T1 a cache hit. The third component is ( ST2 (2,g) S dT2 (2,g) (1))( T3S(2,g) dT3 (2,g) (1))(1 S ( ST2 (2,g) dT2 (2,g) (1))(1 S (1 \u2212 \u2212 ST4 (2,g) ) S + ST3 (2,g) ST4 (2,g) )( S dT4 (2,g) (1)) S + \u2212 ST2 (2,g) ST3 (2,g) S )( S dT3 (2,g) (1))( T4S(2,g) dT4 (2,g) (1)) S corresponding to the part (iii) in Step 5. Note for probH (2, g) there is no component that is related to the part (iv) described in Step 5 since a cseq(2, n, r) can at most tolerate two distinct tags from other threads without turning the last access in the circular sequence into a miss in a four-way LRU cache. Figure 3.5 illustrates each term in the calculation for probH (2, g) by showing an associated example in which the last access of a circular sequence cseq(2, n, r) from T1 still hits in the cache when T1 is co-scheduled with T2, T3, and T4. In this figure, a block labeled XTi represents a cache block X requested by Ti. Note only one example is shown for each term and other similar combinations are taken into account by the probability calculation since the order of blocks between the first and last access can change (e.g., the sequence AT1 BT1 CT2 AT1 and AT1 CT2 BT1 AT1 are both accounted for by the same term in the probability calculation). Step 6: Finally, the total number of extra misses is calculated as: rg 4 X 12 X X [(1 \u2212 probH (d, g)) \u00d7 num(cseq(d, \u2217, r))] d=1 g=1 r=rg Here 4 is the cache associativity and 12 is the number of groups of distances that we use to classify a circular sequence. For our study, we implement the above calculations in Matlab and the running time (on an Intel Core 2 Duo desktop system) is on the order of 0.3 seconds. This is the time to calculate prob H 40 \f3.3. Accurately Modeling Cache Contention with Many Threads and the extra number of cache misses, given S(x), b(i, x), and information about circular sequences frequencies. The run time is insensitive to different benchmarks, the number of instructions per benchmark, and the number of sets. However, the run time scales superlinearly as the number of cache\u2019s associativity increases (0.3 seconds corresponds to computing the data in Figure 6.6 for the cache configurations listed in Table 5.3). 41 \f3.3. Accurately Modeling Cache Contention with Many Threads A circular sequence when T1 runs alone AT1 BT1 AT1 None of T2, T3, and T4 inserts a tag into the circular sequence AT1 BT1 AT1 S (2,g) S (2,g) S (2,g) (1- T2 )(1- T3 )(1- T4 ) S S S One of T2, T3, and T4 insert tags without turning the last access of the circular sequence into a miss AT1 BT1 ET3 AT1 S (2,g) S (2,g) S (2,g) ( T2 dT2(2,g)(1))(1- T3 )(1- T4 ) S S S S (2,g) ST3(2,g) S (2,g) (1- T2 )( dT3(2,g)(1))(1- T4 ) S S S AT1 BT1 GT4 AT1 S (2,g) ST4(2,g) S (2,g) (1- T2 )(1- T3 )( dT4(2,g)(1)) S S S AT1 BT1 CT2 AT1 AT1 BT1 CT2 DT2 AT1 S (2,g) S (2,g) S (2,g) ( T2 dT2(2,g)(2))(1- T3 )(1- T4 ) S S S AT1 BT1 ET3 FT3 AT1 S (2,g) ST3(2,g) S (2,g) (1- T2 )( dT3(2,g)(2))(1- T4 ) S S S AT1 BT1 GT4 HT4 AT1 S (2,g) ST4(2,g) S (2,g) (1- T2 )(1- T3 )( dT4(2,g)(2)) S S S Two of T2, T3, and T4 insert tags without turning the last access of the circular sequence into a miss AT1 BT1 CT2 ET3 AT1 S (2,g) S (2,g) S (2,g) ( T2 dT2(2,g)(1))( T3 dT3(2,g)(1))(1- T4 ) S S S AT1 BT1 CT2 GT4 AT1 S (2,g) ST4(2,g) S (2,g) ( T2 dT2(2,g)(1))(1- T3 )( dT4(2,g)(1)) S S S AT1 BT1 ET3 GT4 AT1 S (2,g) ST3(2,g) S (2,g) (1- T2 )( dT3(2,g)(1))( T4 dT4(2,g)(1)) S S S Figure 3.5: Explanation of each term of probH (2, g) 42 \fChapter 4 Modeling Fine-Grained Multithreaded Throughput In this chapter, we present several analytical models for predicting the throughput of fine-grained multithreaded architectures. The models proposed in this thesis apply directly to multiprogrammed workloads in which threads do not communicate with each other (i.e., incur synchronization overheads). To model performance of parallel workloads in which threads communicate, the models proposed here could be employed as the lower-level system model of a two-level hierarchical parallel performance model such as the one proposed by Adve and Vernon [1]. In such models, the lower-level system model is invoked to predict the throughput of the system up until the next synchronization point. Moreover, if an OS performs unrelated processing on some thread contexts, one could treat the OS as a thread with a different memory access pattern while any synchronization among threads via the OS could potentially be modeled by incorporating our model with the two-level hierarchical model mentioned above. 4.1 Sum of Cycles Model One approach to predicting overall throughput is to simply assume no overlapping of execution occurs on a core. While this assumption is pessimistic for memory-bound applications, it does match the case where an application is entirely compute-bound. We evaluate this model using: 43 \f4.2. Sum of IPCs Model throughputC = PN C i=1 num instTi P , NC i=1 cycleTi where throughputC is the throughput on core C, num instTi is the number of instructions executed by thread Ti, cycleTi is the number of cycles taken by the thread to run those instructions when the thread runs alone and NC is the total number of threads running on core C. In Section 6.2.1 we will show that this model always underestimates the real throughput of a fine-grained multithreaded core on our workloads (by 58% on average). 4.2 Sum of IPCs Model The model in Section 4.1 always underestimates throughput since it ignores the fact that stalls from different threads on a core can overlap. To overcome this drawback, a simple approach is to sum up each thread\u2019s isolated instructions per cycle (IPC)\u2014the IPC when the thread runs alone. This works well when the execute stage of a fine grained multithreaded pipeline is lightly utilized by a single thread. Such behavior is common for memory-bound applications. However, we find that this model will always overestimate the real throughput (by 55% on average for our workloads as shown in Section 6.2.1). 4.3 Bernoulli Model Both models above oversimplify pipeline resource sharing in fine-grained multithreaded architectures. To more accurately model overlapping of stalls from threads running on a core, we assume that every cycle each thread performs a \u201ccoin toss\u201d (or Bernoulli trial) to determine whether it has an instruction ready to execute. Ignoring the effects of resource contention between threads, the probability that a thread Ti has an instruction ready to execute can be estimated as IPCTi (the thread\u2019s isolated IPC) since this is a value between zero and one for a single-issue pipeline. Then 1 \u2212 IPCTi can be used to estimate the probability that Ti cannot issue an instruction on any 44 \f4.4. A Markov Chain Model p01 p00 S0 p40 S1 p10 p41 p 20 p30 p44 p11 p21 p31 p42 S4 p12 S2 p22 p32 p43 p34 p23 S3 p33 Figure 4.1: A Markov chain model for a four-thread fine-grained multithreaded architecture. Sn represents the state in which n threads are suspended and pij represents the transition probability from Si to Sj . given cycle and QNC i=1 (1 \u2212 IPCTi ) becomes an estimate for the probability that no thread has an instruction ready to issue on a given cycle, where NC is the number of threads running on a core and IPCTi represents the isolated IPC of thread Ti. Thus, the throughput of fine-grained multithreaded core C can be estimated as the product of the peak issue width (1.0) and the probability that at QC (1 \u2212 IPCTi )). least one thread can issue an instruction in a cycle: throughputC = 1.0 \u00d7 (1 \u2212 Ni=1 Concurrent with our work, Govindaraju et al. proposed a customized analytical model [26] similar to our Bernoulli model. In Section 6.2.1 we will show that our Bernoulli model reduces error to an overestimate of real throughput of 23% on average. 45 \f4.4. A Markov Chain Model 4.4 A Markov Chain Model The techniques proposed from Section 4.1 to 4.3 do not model contention in the memory system or the time dependence of long latency stall events. To account for these factors, we use a Markov chain to model the partial hiding of stalls of a thread by executions from other threads. We will show in Section 6.2.1 that, when combined with the novel cache contention model we proposed in Section 3.3, the Markov chain model proposed in this section reduces the average error of modeled throughput to 7.9% compared against detailed simulation. Figure 4.1 illustrates an example of the Markov chain we use to model throughput for the case of a four-thread, single-issue, in-order, fine-grained multithreaded pipeline with a memory system that supports at most one outstanding long latency cache miss per thread. We note that the assumption of one outstanding cache miss per thread does match the designs of Sun Niagara T1 and T2 processors [47]. Experiments with our detailed simulator show that supporting additional outstanding cache misses per thread improves performance by 4.6% on average for the configuration and workloads described in Table 5.3 and Table 5.4. Extending our models for multiple outstanding cache misses per thread is beyond the scope of this study and is left for future work. The Markov chain illustrated in Figure 4.1 has five states labeled from S0 to S4 , where state Sn corresponds to the case that n threads are currently suspended from issuing new instructions because of a prior long latency event such as a cache miss. In state S0 no thread is suspended (i.e., all threads are ready to issue instructions) whereas S4 means that all four threads are suspended (i.e., no thread is currently able to issue instructions). In any cycle, if there is at least one thread ready to issue then we assume one thread will issue an instruction. Thus, the throughput of an N-thread fine-grained multithreaded core can be modeled as 1 \u2212 prob(SN ), where the term prob(SN ) represents the probability of being at SN (no threads ready\u2014S4 in Figure 4.1). From each state, only certain transitions are possible since only one thread can execute in a 46 \f4.4. A Markov Chain Model given cycle. The allowed transitions among states for a four-thread fine-grained multithreaded core are shown in Figure 4.1. There are three types of transitions: upstream transitions (illustrated by broken lines, e.g., p01 ), downstream transitions (illustrated by solid lines, e.g., p10 ), and staying transitions (illustrated by broken lines separated by dots, e.g., p00 ). An upstream transition corresponds to Sx \u2192 Sy , where x < y; a downstream transition corresponds to Sx \u2192 Sy , where x > y; and a staying transition does not change states4 . At most one thread can become newly suspended on a cycle, thus preventing transitions such as S0 \u2192 S2 , while multiple threads can be reactivated in the same cycle (e.g., one thread finishes a long latency floating point operation and another thread gets its required data from memory). To determine the state transition probabilities we first consider homogeneous workloads in Section 4.4.1 and then heterogeneous workloads in Section 4.4.2. 4.4.1 Homogeneous Workloads In this section we describe how to obtain the state transition probabilities pij assuming that all threads have the same program characteristics so that the probability of being suspended is the same for each thread. To simplify the discussion we also assume only one type of event can stall a thread. We will relax both assumptions in Section 4.4.2. To begin, we define two parameters: p and M. The term p represents the probability of a thread being suspended. For homogeneous workloads p is the same for all threads and can be calculated as the fraction of instructions causing a thread to be suspended. The term M represents the latency (in cycles) of the event causing a thread to be suspended. We model the probability of reactivating a suspended thread on any given cycle as 1 M since, in the next M cycles, the suspended thread can only 4 Although the state is not changed, some events might occur. For example, it is possible that the state remains unchanged while one suspended thread is reactivated and another thread becomes newly suspended. 47 \f4.4. A Markov Chain Model Table 4.1: Transition probability definitions for a four-thread fine-grained multithreaded architecture Upstream p01 p12 p23 p34 Value p p( M\u22121 M ) M\u22121 2 p( M ) 3 p( M\u22121 M ) (#susp., #act.) (1,0) (1,0) (1,0) (1,0) Downstream p10 p20 p21 p30 p31 p32 p40 p41 p42 p43 Value (1 \u2212 p)( 1M ) (1 \u2212 p)( 1M )2 (1 \u2212 p)( 1M ) + p( 1M )2 p)( 1M )3 \u00a13(1 \u00a2 \u2212 M\u22121 1 2 (1 \u2212 p( M1 )3 \u00a13\u00a2p)1 1 (M\u22121M 2)( M ) \u00a1+ \u00a2 1 2 (1 \u2212 p) 1 ( M )( M ) + p 31 ( M\u22121 M )( M ) 1 4 (M) \u00a14\u00a2 M\u22121 1 3 M )( M ) \u00a141\u00a2 (M\u22121 2 ( M1 )2 \u00a124\u00a2( 1M )M\u22121 3 1 ( M )( M ) (#susp., #act.) (0,1) (0,2) (0,1) or (1,2) (0,3) (0,2) or (1,3) (0,1) or (1,2) (0,4) (0,3) (0,2) (0,1) Staying p00 p11 p22 p33 p44 Value 1\u2212p 1 (1 \u2212 p)( M\u22121 M ) \u00a1+\u00a2p( M ) 2 1 M\u22121 M\u22121 2 (1 \u2212 p)( M ) + p\u00a1 1\u00a2 ( M )( M ) 3 1 M\u22121 2 3 (1 \u2212 p)( M\u22121 M ) + p 1 ( M )( M ) M\u22121 4 ( M ) (#susp., #act.) (0,0) or (1,1) (0,0) or (1,1) (0,0) or (1,1) (0,0) or (1,1) (0,0) be reactivated in the last cycle. Thus, the probability that a suspended thread remains suspended on a given cycle is 1 \u2212 1 M = M\u22121 M . Table 4.1 shows the transition probabilities and lists (in the third column) how many threads become newly suspended and\/or activated for each possible transition for a four-thread fine-grained multithreaded architecture. Table 4.2 generalizes for an N-thread architecture. For example, con2 sider the upstream transition p23 (i.e., the probability of S2 \u2192 S3 ) is modeled as p( M\u22121 M ) , where p 2 corresponds to the probability that one thread becomes newly suspended, and ( M\u22121 M ) corresponds 48 \f4.4. A Markov Chain Model Table 4.2: Transition probability definitions for an N-thread fine-grained multithreaded architecture Upstream (j > i) j=i+1 Value pij = Downstream (j < i) i 6= N, j 6= 0 i 6= N, j = 0 i=N i p( M\u22121 M ) \u00a1 i \u00a2 1 i\u2212j M\u22121 j \u00a1 i \u00a2 1 i\u2212j+1 M\u22121 j\u22121 = (1 \u2212 p) i\u2212j ( M ) ( M ) + p i\u2212j+1 (M) ( M ) \u00a1 i \u00a2 1 i\u2212j M\u22121 j = (1 \u2212 p) i\u2212j ( M ) ( M ) \u00a1 i \u00a2 1 i\u2212j M\u22121 j = i\u2212j (M) ( M ) Staying (j = i) i 6= N, i 6= 0 i=0 i=N (1,0) Value pij pij pij Value i p)( M\u22121 M ) pii = (1 \u2212 pii = 1 \u2212 p i pii = ( M\u22121 M ) (#susp., #act.) \u00a1\u00a2 i\u22121 + p i1 ( 1M )( M\u22121 M ) (#susp., #act.) (0,i-j) or (1,i-j+1) (0,i-j) (0,i-j) (#susp., #act.) (0,0) or (1,1) (0,0) (0,0) to the probability that both the threads that have already been suspended remain suspended. The term (1, 0) in the third column indicates that one thread becomes newly suspended and no threads are newly activated for this transition. Some state transitions can be achieved in more than one way. For instance, the downstream \u00a1\u00a2 \u00a1\u00a2 2 + p 3 ( 1 )2 ( M\u22121 ). ) transition p32 (i.e., the probability of S3 \u2192 S2 ), is modeled as (1 \u2212 p) 31 ( M1 )( M\u22121 M M 2 M In the term before the plus sign, (1 \u2212 p) is the probability that the currently executing thread \u00a1\u00a2 2 does not become suspended and 31 ( 1M )( M\u22121 M ) is the probability that, in the three threads that have already been suspended, one thread is reactivated this cycle (probability M1 ) and the other \u00a1\u00a2 2 ). Note that 3 is the number of ways to choose the one two remain suspended (probability ( M\u22121 ) M 1 thread that is activated out of three threads. In the term after the plus sign, p is the probability that \u00a1\u00a2 the current executing thread becomes newly suspended and 32 ( M1 )2 ( M\u22121 M ) denotes the probability that, in the three threads that have already been suspended, one thread remains suspended and the two other threads are reactivated. Having modeled the transition probabilities, we construct the transition matrix MT and find the 49 \f4.4. A Markov Chain Model steady state probability distribution vector by: \u2212 \u2212 \u2192 \u2192 vs = [prob(S0 ) prob(S1 ) ... prob(SN )] = lim \u2212 vi MnT = \u2192 vs MT n\u2192\u221e where MT = [pij ] is an (N + 1)-by-(N + 1) transition matrix, created using pij defined in Table 4.2, \u2192 and \u2212 vi = [1 0 ... 0] is the initial state vector (initially, all N threads are ready to issue an instruction). \u2192 The equation for \u2212 vs can be evaluated quickly using standard Markov chain analysis [29]. 4.4.2 Heterogeneous Workloads and Multiple Stall Conditions When the threads running on a fine-grained multithreaded core are different, the probability of being suspended for each thread is different. Moreover, there are usually multiple types of events that can stall a thread and these events often have different latencies. One way to extend the model described in Section 4.4.1 for heterogeneous workloads is to separate each state of the model into multiple states. For example, rather than using a single state S1 to represent that the number of suspended threads is one, we can use N states to specify which particular thread out of all N threads is suspended. Then different probabilities p for each thread are used for the transitions from S0 to one of each of these states. To completely specify which threads are suspended in all situations, we need 2N states, making the Markov model more complicated. To tackle the challenge of different thread types without introducing extra states, we instead keep the same Markov chain structure as in the homogenous workload case and instead compute a P single value p for all the threads as follows: p = Ni=1 pi wi , where p denotes the average probability that a thread will be suspended, pi is the probability that thread i will be suspended, and wi is a weight factor equal to the number of instructions executed for the thread, divided by the 50 \f4.4. A Markov Chain Model total number of instructions executed on the core. Therefore, p can be further expressed as: P P Ni Ii = Ni=1 Ntotal , where Ii is the number of instructions causing the thread i to be p = Ni=1 NIii Ntotal suspended, Ni is the number of instructions executed for thread i, and Ntotal is the total number of instructions executed on the fine-grained multithreaded core. Given the total number of instructions executed on a core, we first assume that the number of instructions executed by each thread on that core (i.e., Ni ) is proportional to the thread\u2019s isolated IPC computed using a simple sum of CPI component model [27]. Then, based upon the estimated Ni for each thread, we use the novel cache contention model described in Section 3.3 to predict its number of extra cache misses due to cache sharing. Next, we add the product of the number of extra misses for each thread and the miss latency to the original number of cycles taken to execute Ni instructions and then divide the sum by Ni to estimate a new CPI for the thread. Finally, we update Ni for each thread to make it proportional to its new CPI and use the updated Ni to obtain p for the cache contention and Markov chain throughput model. Next, we tackle the issue of modeling systems where multiple events may suspend a thread and where each has a distinct latency. Similar to finding an average p for all threads, we compute P an average M for all the events that might suspend a thread as follows: M = ki=1 Mi si , where k, Mi , si represent the number of different types of events, latency of an event i, and the weight of the event, respectively. The latency of each event suspending a thread is assumed to be known (a simplification in the case of DRAM and interconnect contention). To find the weight of an event, we also need to know how many times the event occurs during execution. The frequency of long latency floating point instructions can be obtained from the instruction trace used during circular sequence profiling. To find the number of cache misses, the cache contention model described in Section 3.3 is used. Above, the weight si for an event is computed as the product of its number of occurrences and its latency, normalized as a fraction of total execution time. For example, if Event1 with 5-cycle latency happens 50 times and Event2 with 100-cycle latency happens 10 times, 51 \f4.4. A Markov Chain Model then the average M is calculated as M = 5 \u00d7 50\u00d75 50\u00d75+10\u00d7100 + 100 \u00d7 10\u00d7100 50\u00d75+10\u00d7100 = 81. Although our Markov chain model is proposed for modeling fine-grained multithreaded architectures, we believe that, by considering state transitions to occur only for long latency L2 cache misses (rather than L1 misses) and scaling the estimated throughput by a factor taking account of the peak issue width of an SMT core (assuming a well balanced design [35]) and the memory level parallelism within a thread, our model can potentially provide performance predictions for more complex processor cores. However, a detailed evaluation of this extension is beyond the scope of this thesis. 52 \fChapter 5 Methodology 5.1 Modeling Pending Cache Hits, Data Prefetching, MSHRs To evaluate our analytical model, we have modified SimpleScalar [7] to simulate the performance loss due to long latency data cache misses when accounting for a limited number of MSHRs. We compare against a cycle accurate simulator rather than real hardware to validate our models since a simulator provides insights that would be challenging to obtain without changes to currently deployed superscalar performance counter hardware [20]. We believe the most important factor is comparing two or more competing (hybrid) analytical models against a single detailed simulator provided the latter captures the behavior one wishes to model analytically. Table 5.1 describes the microarchitectural parameters used in this study. Note we are focusing on predicting only the CPI component for data cache misses using our model. We verified that the CPI contribution due to overlapping miss events is small for our benchmarks with realistic branch prediction and instruction caches [11] so our comparisons in Section 6.1 is to a detailed cycle accurate simulator in which instruction cache misses have the same latency as hits and all branches are predicted perfectly. In the rest of this thesis, we focus on how to accurately predict CP ID$miss , which is the performance loss due to long latency data cache misses when both branch predictor and instruction cache are ideal (this is the same methodology applied to model CP ID$miss described in [34]). To evaluate the technique proposed in Section 2.2.3 for estimating the CP ID$miss when a prefetching mechanism is applied, we have applied our modeling techniques to predict the perfor53 \f5.1. Modeling Pending Cache Hits, Data Prefetching, MSHRs Table 5.1: Microarchitectural parameters Machine Width ROB Size LSQ Size L1 D-Cache L2 Cache Memory Latency 4 256 256 16KB, 32B\/line, 4-way, 2-cycle latency 128KB, 64B\/line, 8-way, 10-cycle latency 200 cycles mance benefit of three different prefetching mechanisms: prefetch-on-miss [62], tagged prefetch [25], and stride prefetch [3]. When prefetch-on-miss [62] is applied, an access to a cache block that results in a cache miss will initiate a prefetch for the next sequential block in memory given that the block is not in the cache. The tagged prefetch mechanism [25] adds a tag bit to each cache block to indicate whether the block was demand-fetched or prefetched. When a prefetched block is referenced, the next sequential block is prefetched if it is not in the cache. The stride prefetch technique [3] uses a reference prediction table (RPT) to detect address referencing patterns. Each entry in the RPT is assigned a state and a state machine is applied to control the state of each entry. Whether a prefetch is initialized or not depends on the current state of the entry [3]. In this study, we modeled a 128-entry, 4-way RPT that is indexed by the microprocessor\u2019s program counter (PC). To stress our model, we simulate a relatively small L2 cache compared to contemporary microprocessors. We note that the size of the L2 cache that we simulated is close in size to those employed in microprocessors shipped at the time when those benchmarks we use were released. The benchmarks chosen are ones from SPEC 2000 [63] and OLDEN [8] that have at least 10 long latency data cache misses for every 1000 instructions simulated (10MPKI). Table 5.2 illustrates the miss rates of these benchmarks and the labels used to represent them in figures. Moreover, for each benchmark, we select 100M representative instructions to simulate using the Sim-Point toolkit [61]. 54 \f5.2. Modeling Cache Contention and Throughput Table 5.2: Simulated benchmarks Benchmark 173.applu 179.art 183.equake 189.lucas 171.swim 181.mcf em3d health perimeter 470.lbm 5.2 Label app art eqk luc swm mcf em hth prm lbm Miss rate 31.1MPKI 117.1MPKI 15.9MPKI 13.1MPKI 23.5MPKI 90.1MPKI 74.7MPKI 45.7MPKI 18.7MPKI 17.5MPKI Suite SPEC 2000 SPEC 2000 SPEC 2000 SPEC 2000 SPEC 2000 SPEC 2000 OLDEN OLDEN OLDEN SPEC 2006 Modeling Cache Contention and Throughput We evaluated the accuracy of our analytical models in three steps. First, we compared our analytical models against a detailed simulator for a microarchitecture similar to Sun Microsystems\u2019 Niagara T1 [37] that we developed by modifying SMTSIM [67]. Table 5.3 shows the microarchitectural parameters simulated. As discussed earlier in Section 4.4, our baseline model allows at most one outstanding miss per thread. We do not enforce cache inclusion between the L1 and the L2 cache (earlier cache contention studies have shown this has minimal effect [9]). Second, we applied our combined analytical cache contention and Markov chain throughput model to obtain two optimized application-specific multithreaded processor designs. Finally, we validated our analytical models by comparing their predictions against a Sun Fire T1000, which has a T1 Niagara processor containing 8 cores each with 4 thread contexts (32 threads total) and runs the Solaris 10 operating system. We used Shade [15] to collect instruction traces that are later analyzed to obtain inputs to our models. We report the arithmetic mean of the absolute value of error since it reports the largest error number and since we are interested in averaging the error of the IPC prediction (not the average speedup), but we also report geometric mean and harmonic mean of the absolute error to 55 \f5.2. Modeling Cache Contention and Throughput Table 5.3: Parameters of the simulated architecture # Cores # Threads Pipeline Branch Pred. L1-L2 Interconnect L1 Data Cache L2 Cache L1 Inst. Cache L2 Cache ITLB\/DTLB Memory Latency 8 in-order cores 32, 4 threads per core 6 stages 4Kb gShare, 256-entry BTB, private Crossbar [47] 16KB, 16B\/line, 4-way, LRU, 1-cycle hit latency, private 3MB banked, 64B\/line, 12-way, LRU, 10-cycle hit lat., global 16KB, 16B\/line, 4-way, LRU, 1-cycle hit latency, private 3MB banked, 64B\/line, 12-way, LRU, 10-cycle hit lat., global 64 entry, private 110 cycles Table 5.4: Simulated workloads for each core Core Core Core Core Core Core Core Core 1 2 3 4 5 6 7 8 Workload Workload Workload Workload Workload Workload Workload Workload 1 1 2 2 3 3 4 4 (C1W1) (C2W1) (C3W2) (C4W2) (C5W3) (C6W3) (C7W4) (C8W4) ammp-applu-art-mcf ammp-applu-art-mcf bzip2-mgrid-swim-equake bzip2-mgrid-swim-equake ammp-art-mgrid-equake ammp-art-mgrid-equake applu-bzip2-mcf-swim applu-bzip2-mcf-swim allay any concern that these numbers might lead to different conclusions. When we validated our cache contention model, we used our detailed simulator and hardware performance counters to obtain information about the number of instructions executed by each thread. Then for each thread, we analyzed its memory instructions trace to obtain the two temporal locality metrics that we proposed in Section 3.2 as well as the circular sequence profile. Next, we predicted the number of extra misses of each thread using the cache contention model described in Section 3.3 and compared the modeled extra misses to the result from our detailed simulator and performance counters. To obtain the actual number of extra misses, we also ran each thread alone 56 \f5.2. Modeling Cache Contention and Throughput to obtain the number of cache misses without cache sharing. When we validated our Markov chain throughput model, we first approximated the number of instructions executed by each thread as proportional to its isolated IPC (we used performance counters to obtain the isolated IPC when we compared to hardware), given the total number of instructions executed by all threads in a core. Then we applied our cache contention model to predict the number of extra misses for each thread and adjusted each thread\u2019s IPC by using the product of the number of extra misses and the miss latency. Then we re-estimated the number of instructions executed by each thread based on the thread\u2019s refined IPC. Next we applied the cache contention model again to approximate the extra number of misses for each thread based on its refined instruction count. Finally, we used the refined instruction count and the extra number of misses for each thread as the input to our Markov chain model to estimate the throughput5 . For the comparison against the detailed simulator, we chose 8 benchmarks from the SPEC 2000 Benchmark Suite [63] to form heterogeneous multiprogrammed workloads for each fine-grained multithreaded core being simulated. Table 5.4 shows the simulated workload of each core for our first study. We ran our detailed simulator such that each core executed at least 400 million instructions. For our hardware comparison, we evaluated homogeneous workloads consisting of a varying number of threads, each running a memory intensive application (mcf). We also evaluated several heterogeneous workloads consisting of multiple instances of two types of applications to obtain 32 threads in total running concurrently. Specifically, we consider workloads consisting of the following combinations: 16gzip+16eqk (each core runs two gzip and two equake), 16mcf+16gzip, 16mcf+16art. We compiled these benchmarks on the Sun Fire T1000 using gcc 3.4.3 with -O2 optimization and ran them with their train inputs (mcf and art) and test inputs (gzip and equake). 5 We found that additional \u201citerations\u201d through the model to update the relative IPC of each thread due to cache contention did not improve model accuracy. 57 \f5.2. Modeling Cache Contention and Throughput To obtain miss latencies required by our throughput model, we created several simple (about 20 lines in C) microbenchmarks, which we ran on a single thread context. Based upon these separate calibration measurements, we used fixed latencies of 20 cycles, 220 cycles, and 50 cycles for an L1 cache miss that hits in the L2 cache, an L2 cache miss, and a floating point instruction, respectively (these are higher than reported in [59]). 58 \fChapter 6 Experimental Results 6.1 Modeling Pending Cache Hits, Data Prefetching, MSHRs This section summarizes our experimental results. 6.1.1 Modeling Pending Data Cache Hits Section 2.1 describes prior proposals for compensating for the overestimation of modeled penalty cycles per serialized miss using a fixed number of cycles. Figure 6.1(a) and Figure 6.1(b) illustrate the modeled results after compensation with constant cycles both without, and with the pending hit compensation technique described in Section 2.2.1, respectively. In these two figures, we show results using five different constant compensation factors. The first bar (oldest) corresponds to the assumption that an instruction that misses in the cache is always the oldest one in the instruction window when it issues (accesses the first level cache). The second bar (1\/4) corresponds to the assumption that there are always 1 4 ROBsize = 64 in-flight instructions older than a cache miss when it issues and it is similar to the next two bars, (1\/2) and (3\/4). The fifth bar (youngest) corresponds to the assumption that there are always ROBsize \u2212 1 older instructions in the window when the instruction issues (i.e., the instruction is always the youngest one in the window when it issues). The last bar (actual) shows the simulated penalty cycles per cache miss from cycle accurate simulation. From this data, we observe that there is no one fixed cycle compensation method that performed consistently the best for all of the benchmarks we studied. For example, 59 \f6.1. Modeling Pending Cache Hits, Data Prefetching, MSHRs in Figure 6.1(a) we observe that error is minimized using \u201cyoungest\u201d for app, art, luc, swm, and lbm, but minimized using \u201coldest\u201d for em, mcf, and hth, while, eqk and prm requires something in-between. The harmonic mean for each fixed cycle compensation method is also shown and we notice that, due to the fact that positive and negative errors cancel out, the harmonic means of some fixed cycle compensation methods appear close to the detailed simulation results. However, it is important to recognize that their accuracy on individual benchmarks is quite poor. By using the fixed cycle compensation method, we find that the smallest arithmetic mean of absolute error is 43.5% when not modeling pending hits and 26.9% when modeling pending hits, resulting when penalty cycles per miss employing \u201cyoungest\u201d compensation. oldest 1\/4 1\/2 3\/4 youngest actual 200 150 100 50 0 app art eqk luc swm mcf em hth prm lbm HM penalty cycles per miss (a) Not modeling pending data cache hits oldest 1\/4 1\/2 3\/4 youngest actual 200 150 100 50 0 app art eqk luc swm mcf em hth prm lbm HM (b) Modeling pending data cache hits Figure 6.1: Penalty cycles per miss with fixed number of cycles compensation for plain profiling (Unlimited MSHRs) 60 \f6.1. Modeling Pending Cache Hits, Data Prefetching, MSHRs To account for the distinct behavior of each benchmark, we use the average distance between two consecutive cache misses to compensate for the overestimation of the modeled extra cycles due to long latency data cache misses as described in Section 2.2.2. Figure 6.2(a) compares the CP ID$miss for both the plain profiling technique described in Section 2.1 and the start-with-a-miss (SWAM) profiling technique described in Section 2.2.5 (with pending hits modeled) to the results from detailed simulation. The first bar (Plain w\/o comp) and the third bar (SWAM w\/o comp) correspond to the modeled results without any compensation; the second bar (Plain w\/ comp) and the fourth bar (SWAM w\/ comp) are the modeled results with the compensation technique described in Section 2.2.2. Figure 6.2(a) and Figure 6.2(b) show that for benchmarks with heavy pointer chasing such as mcf , em3, and hth, ignoring the effects of pending data cache hits results in a dramatic underestimate for CP ID$miss . As discussed in Section 2.2.1, the reason for this is that many data independent misses are connected by pending cache hits, which must be appropriately modeled. Moreover, as we expect, SWAM profiling is more accurate than plain profiling since it can capture more overlapping data cache misses. Figure 6.2(b) illustrates the error of each modeling technique after compensation. From Figure 6.2(b) we observe that the arithmetic mean of the absolute error (mean) decreases from 39.7% to 29.3% when modeling pending cache hits. We also observe from Figure 6.2(b) that the arithmetic mean of the absolute error for SWAM profiling when pending data cache hits are modeled (SWAM w\/ PH) is about 3.9 times lower than plain profiling when pending hits are not modeled (Plain w\/o PH): the arithmetic mean of the absolute error decreases from 39.7% to 10.3%. Geometric mean of the absolute error decreases from 26.4% to 8.2%, and harmonic mean of the absolute error decreases from 15.3% to 6.9%. Accuracy also improves, and not just for \u201cmicro-benchmarks\u201d [72]: In Figure 6.2(b), comparing \u201cPlain w\/o PH\u201d to \u201cSWAM w\/ PH\u201d, we find that, on average, the arithmetic mean of the absolute error decreases from 31.6% to 9.1% for the five SPEC 2000 benchmarks excluding mcf . 61 \f1.6 8 1.4 7 CPI due to D$miss CPI due to D$miss 6.1. Modeling Pending Cache Hits, Data Prefetching, MSHRs 1.2 1 0.8 0.6 0.4 0.2 6 5 4 3 2 1 0 0 app art eqk luc swm prm lbm mcf em hth (a) CPI due to D$miss n ea m ht h em cf SWAM w\/ PH m lb m pr m m Plain w\/ PH sw lu c eq k ap p error (%) ar t Plain w\/o PH 100 80 60 40 20 0 -20 -40 -60 -80 -100 (b) Modeling error Figure 6.2: CPI due to D$miss and modeling error for different profiling techniques (unlimited MSHRs) 62 \f6.1. Modeling Pending Cache Hits, Data Prefetching, MSHRs Modeling Different Prefetching Techniques 1.4 8 1.2 7 CPI due to D$miss CPI due to D$miss 6.1.2 1 0.8 0.6 0.4 0.2 6 5 4 3 2 1 0 0 app art eqk luc swm prm lbm mcf em hth (a) CPI due to D$miss with different prefetching techniques app art eqk luc POM w\/o PH Stride w\/ PH swm prm lbm Tag w\/ PH Stride w\/o PH mcf em hth mean error (%) 100 80 60 40 20 0 -20 -40 -60 -80 -100 POM w\/ PH Tag w\/o PH (b) Modeling error with different prefetching techniques Figure 6.3: CPI due to D$miss and modeling error while prefetch-on-miss (POM), tagged prefetch (Tag), or stride prefetch (Stride) technique is applied. In this section, we evaluate CP ID$miss when modeling the three prefetching techniques mentioned in Section 5.1 (with unlimited MSHRs). Figure 6.3(a) compares the actual CP ID$miss to the modeled one for the three prefetching methods. For each prefetching method, both the prediction 63 \f6.1. Modeling Pending Cache Hits, Data Prefetching, MSHRs when each pending hit is analyzed according to the algorithm described in Figure 2.5 (labeled \u201cw\/ PH\u201d) and the prediction when pending hits are treated as normal hits (labeled with \u201cw\/o PH\u201d) are shown. We use SWAM in both cases. When employing the algorithm in Figure 2.5, we apply SWAM as follows: When we analyze the trace we let each profile step start with a miss or a hit due to a prefetch. The latter refers to a demand request whose data was brought into the data cache by a previous prefetch (we start with it since its latency may not be fully hidden and thus it may stall commit). Figure 6.3(b) shows the error of the model for each benchmark. From Figure 6.3(b) we observe that if pending hits are not appropriately modeled (i.e., a pending hit is simply treated as a hit and not analyzed based upon the algorithm in Figure 2.5), the modeled CP ID$miss always underestimates the actual CP ID$miss . The reason is that with a prefetching technique applied, a large fraction of the misses occurring when there is no prefetching become pending hits since prefetches generated by that prefetching technique cannot fully hide the memory access latency of those misses. By using the method of analyzing pending hits that we propose in Section 2.2.3 to model prefetching, the arithmetic mean of the absolute error decreases from 22.2% to 10.7% for prefetch-on-miss, from 56.4% to 9.4% for tagged prefetch technique, and from 72.9% to 21.3% for stride prefetch technique (i.e., the arithmetic mean of the absolute error decreases from 50.5% to 13.8% overall for the three data prefetching techniques modeled). 6.1.3 Modeling Limited Number of MSHRs All of the results that we have seen thus far are for modeling a processor with an unlimited number of MSHRs. This section compares modeled CP ID$miss when the number of available MSHRs is limited. Figure 6.4(a), (b), and (c) compare the modeled CP ID$miss to the simulated results when the maximum number of MSHRs in a processor is sixteen, eight, and four, respectively. We show data for eight MSHRs and four MSHRs since we note that Prescott has only eight MSHRs [6] and Williamette has only four MSHRs [28]. For each benchmark, the first bar (Plain w\/o MSHR) shows 64 \fPlain w\/ MSHR actual SWAM app art eqk luc swm mcf em Plain w\/o MSHR SWAM-MLP 9 8 7 6 5 4 3 2 1 0 Plain w\/ MSHR actual app art hth prm lbm HM (a) NM SHR = 16 Plain w\/o MSHR SWAM-MLP SWAM CPI due to D$miss Plain w\/o MSHR SWAM-MLP 9 8 7 6 5 4 3 2 1 0 CPI due to D$miss CPI due to D$miss 6.1. Modeling Pending Cache Hits, Data Prefetching, MSHRs eqk luc swm mcf em 9 8 7 6 5 Plain w\/ MSHR actual SWAM 4 3 2 1 0 app art eqk luc swm mcf em hth prm lbm HM hth prm lbm HM (b) NM SHR = 8 (c) NM SHR = 4 (a) NM SHR = 16 (b) NM SHR = 8 lb m ea n m ht h pr m cf em Plain w\/ MSHR SWAM-MLP m lu c sw m Plain w\/o MSHR SWAM ar t eq k error (%) 100 80 60 40 20 0 -20 -40 -60 -80 -100 ap p m lb ea n m ht h pr m cf em Plain w\/ MSHR SWAM-MLP m lu c sw m Plain w\/o MSHR SWAM ar t eq k 100 80 60 40 20 0 -20 -40 -60 -80 -100 ap p error (%) m lb ea n m ht h pr m cf em Plain w\/ MSHR SWAM-MLP m lu c sw m Plain w\/o MSHR SWAM ar t eq k 100 80 60 40 20 0 -20 -40 -60 -80 -100 ap p error (%) Figure 6.4: CPI due to D$miss for NM SHR = 16, NM SHR = 8, and NM SHR = 4. (c) NM SHR = 4 Figure 6.5: Error of the modeled CP ID$miss for NM SHR = 16, NM SHR = 8, and NM SHR = 4. the modeled CP ID$miss from plain profiling (i.e., it is not aware that there are a limited number of MSHRs and always provides the same result for each benchmark) and the second bar (Plain w\/ MSHR) shows the modeled CP ID$miss from plain profiling with the technique of modeling a limited number of MSHRs (Section 2.2.4) included. The third and the fourth bar illustrates the modeled CP ID$miss from SWAM (Section 2.2.5) and SWAM-MLP (Section 2.2.5), respectively. For these four profiling techniques, pending hits are modeled using the method described in Section 2.2.1. The modeling error based on the data in Figure 6.4(a)\u2013(c) is illustrated in Figure 6.5(a)\u2013(c). SWAM-MLP is consistently better than SWAM. We observe that as the total number of MSHRs decreases, the advantage of SWAM-MLP over SWAM becomes significant, especially for eqk, mcf , 65 \f6.1. Modeling Pending Cache Hits, Data Prefetching, MSHRs em, and hth, for which it is more likely to have data dependence among cache misses thus affecting the size of the profiling window that SWAM-MLP chooses. SWAM decreases the arithmetic mean of the absolute error from 32.6% (Plain w\/o MSHR) to 9.8%, from 32.4% to 12.8%, and from 35.8% to 23.2%, when the number of MSHRs is sixteen, eight, and four, respectively6 . SWAM-MLP further decreases the error to 9.3%, 9.2%, and 9.9%7 (i.e., SWAM-MLP decreases the error of plain profiling (Plain w\/o MSHR) from 33.6% to 9.5% when the number of MSHRs is limited). Note that accuracy improves not only for pointer-chasing benchmarks: For the SPEC 2000 benchmarks excluding mcf , average error reduces from 48.1% to 7.0% comparing Plain w\/o MSHR to SWAM-MLP. 6.1.4 Putting It All Together We also evaluated the combination of the techniques for modeling prefetching (Section 2.2.3) and SWAM-MLP to model the performance of the three prefetching methods with limited MSHRs. On average, the error of modeling prefetching is 15.2%, 17.7%, and 20.5%, when the number of MSHRs is sixteen, eight, and four, respectively (average of 17.8% across all three prefetch methods). 6.1.5 Speedup of the Hybrid Analytical Model One of the most important advantages of the hybrid analytical model we present in this thesis versus detailed simulations is its fast speed of analysis. On average our model is 150, 156, 170, and 229 times faster than the detailed simulator when the number of MSHRs is unlimited, sixteen, eight, and four, respectively, with a minimum speedup of 91\u00d7. Moreover, for estimating the performance impact of prefetching, on average our model is 184, 185, 215, and 327 times faster than the detailed simulator when the number of MSHRs is unlimited, sixteen, eight, and four, respectively, with 6 Geometric mean is reduced from 19.4% to 7.4%, from 21.5% to 9.7%, and from 21.8% to 10.9%; harmonic mean is reduced from 8.5% to 5.8%, from 14.5% to 7.0%, and from 10.2% to 5.1% 7 Geometric mean of the absolute error is 6.5%, 6.7%, 5.2%, and harmonic mean of the absolute error is 4.6%, 5.2%, 3.3%, when the number of MSHRs is sixteen, eight, and four, respectively. 66 \f6.2. Modeling Cache Contention and Throughput a minimum speedup of 87\u00d7. These speedups were measured on a 2.33 GHz Intel Xeon E5345 processor. 6.2 Modeling Cache Contention and Throughput In this section we evaluate our analytical cache contention model and throughput models for finegrained multithreaded architectures against both a cycle-accurate performance simulator and real hardware. 6.2.1 Model Accuracy Evaluation Figure 6.6(a) and 6.6(b) illustrate the impact of cache sharing on the number of extra misses for our L1 data cache and L2 cache, respectively, when the heterogeneous workloads described in Table 5.4 are simulated. For each thread, we plot the ratio of the number of extra misses due to cache contention to the number of misses without cache contention (i.e., when the thread runs alone). Figure 6.6(a) and 6.6(b) show that our cache contention model accurately predicts the number of additional cache misses due to cache sharing across varying workloads. We find that modeling error is less significant for L2 misses, which also have a higher impact on performance than L1. For the L1 cache, the arithmetic mean of absolute error of our model is 40.3%, geometric mean is 12.1%, harmonic mean is 0.9%, and the maximum absolute error is 150.0%. For the L2 cache, the arithmetic mean of absolute error of our model is 4.4%, geometric mean is 1.1%, harmonic mean is 0.2%, and the maximum absolute error is 20.3%. Figure 6.7 illustrates a scatter plot for the predicted and the simulated ratio pair. Each data point on this figure corresponds to a pair of bars in Figure 6.6(a) and Figure 6.6(b). From Figure 6.7 we notice that our predicted values are highly correlated with the simulated ones. The correlation coefficient between the predicted and the simulated ratio for L1 and L2 cache is 0.9993 and 0.9999, 67 \f6.2. Modeling Cache Contention and Throughput extra L1 misses simulated 1x modeled T1 T2 T3 T4 T1 T2 T3 T4 T1 T2 T3 T4 T1 T2 T3 T4 T1 T2 T3 T4 T1 T2 T3 T4 T1 T2 T3 T4 T1 T2 T3 T4 0.01x 0.0001x C1W1 C2W1 C3W2 C4W2 C5W3 C6W3 C7W4 C8W4 0.000001x (a) L1 data cache simulated extra L2 misses 1000x modeled 100x 10x 1x 0.1x 0.01x T1 T2 T3 T4 T1 T2 T3 T4 T1 T2 T3 T4 T1 T2 T3 T4 T1 T2 T3 T4 T1 T2 T3 T4 T1 T2 T3 T4 T1 T2 T3 T4 C1W1 C2W1 C3W2 C4W2 C5W3 C6W3 C7W4 C8W4 (b) L2 cache Figure 6.6: Ratio of the number of extra L1 and L2 cache misses due to cache contention to the number of misses when each thread runs alone. The order of the threads in each core is shown in Table 5.4. respectively. We noticed that the average and maximum error for the L1 cache is higher than the L2 cache and we found it is mainly due to the fact that the number of extra L1 misses due to cache contention is much smaller than the number of extra L2 misses. For example, the maximum error of our model that predicts extra L1 misses is from T1 on Core 1, where the number of L1 misses is about 6153K without cache contention and it is increased by only 97K due to cache contention. For this thread, our model predicts the number of extra misses to be 242K, resulting a 150% error. However, this 68 \f6.2. Modeling Cache Contention and Throughput predicted (new model) 1000x L2 L1 More significant to performance 10x 0.00001x 0.0001x 0.001x 0.01x 0.1x Less significant to performance measured (detailed simulator) 0.1x 1x 10x 100x 1000x 0.001x 0.00001x Figure 6.7: Scatter plot for the predicted and the simulated cache misses increase pair error does not noticeably impact the accuracy of the Markov chain throughput model since the total number of extra L1 misses is quite small compared to the number of L1 misses without cache contention (i.e., the number of L1 misses increases only by 1.6%). As mentioned in Section 3.3, the probability of a miss being introduced into a circular sequence is a nonlinear function of the distance r. Hence the calculation of the probability of a miss is best computed by classifying circular sequences into groups based upon r. We classify circular sequences cseq(d, n, r) into different groups based upon r and calculate, for each group, the probability that a last access of a circular sequence from the group still hits in the cache when the cache is shared (i.e., probH (d, g) described in Section 3.3), using the average r of all the circular sequences in the group. The accuracy of our cache contention model proposed in Section 3.3 is affected by the number of groups chosen to cluster circular sequences. Figure 6.8(a) shows the arithmetic mean of the absolute error of our cache contention model for all the applications as the number of groups used to classify circular sequences varies from one to twenty. When there is a single group it includes all circular sequences. When there are two 69 \faverage of abs. error 6.2. Modeling Cache Contention and Throughput 35% 30% 25% 20% 15% 10% 5% 0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 number of groups error of equake on Core 5 (a) Arithmetic mean of absolute errors with different number of groups 200% 150% 100% 50% 0% -50% 1 -100% 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 number of groups (b) Errors with different number of groups for equake running on Core 5 Figure 6.8: Impact of the number of groups used to classify circular sequences on the accuracy of our cache contention model groups the first group corresponds to all the circular sequences with 1 < r < 25 , and the second group contains all other circular sequences. When there are n groups (with n > 2), the first group corresponds to all the circular sequences with 1 < r < 25 , the next n \u2212 2 groups correspond to 2i \u2264 r < 2i+1 , i \u2208 5, 6, ..., n + 2, and the last group corresponds to all the circular sequences with r \u2265 2n+3 . It is interesting to note from Figure 6.8(a) that the accuracy of the cache contention model does not improve monotonically as the number of groups increases. To better understand the reason behind this seemingly counter-intuitive result, we also show, in Figure 6.8(b), the value of the error 70 \f6.2. Modeling Cache Contention and Throughput of our cache contention model for equake running on Core 5 (see Table 5.4) as it serves as a good representative for other applications. In Figure 6.8(b) we observe that, when the number of groups used is less than or equal to six, the errors of our cache contention model are significant negative values, indicating that the estimated numbers of extra cache misses due to cache contention are much less than the simulated results. Moreover, errors become significantly positive when the number of groups is between seven and eleven inclusive, meaning that the estimated numbers of extra cache misses are much more than the simulated counterparts. When the number of groups used to categorize circular sequences is less than or equal to six, we notice that the average distance calculated for all the circular sequences belonging to the last group (e.g., the fifth group if only five groups are used) is biased by the circular sequences with relatively small r, resulting in an underestimated probability that the last access of a circular sequence in the last group becomes a miss due to cache contention (i.e., an underestimated 1 \u2212 probH (d, g)). On the other hand, when the number of groups is between seven and eleven, the average distance calculated for the last group is biased by the circular sequences with relatively large r, leading to an overestimated 1 \u2212 probH (d, g). When the number of groups is greater than or equal to twelve, the error reduces and becomes stable, indicating that the number of groups is sufficiently large to effectively separate circular sequences with significantly different r. In our study, we used twelve groups to separate circular sequences and the range of r associated with each group is described in Section 3.3. Figure 6.9(a) compares the throughput on each core predicted by the different models described in Chapter 4 to the simulated throughput and Figure 6.9(b) shows the error of each model. The first bar (\u201csumCYCLE\u201d) corresponds to the model described in Section 4.1 that always underestimates the actual throughput (with an arithmetic mean of the absolute error of 58%) since the model assumes that the execution of each thread on a core is completely serialized. The second bar (\u201csumIPC\u201d) is obtained using the model described in Section 4.2 that always overestimates the actual throughput (with an error of 55% on average). The third bar (\u201cBernoulli\u201d) represents the Bernoulli model 71 \fthroughput (IPC) 6.2. Modeling Cache Contention and Throughput 1.2 sumCYCLE sumIPC Bernoulli MC simulated 1 0.8 0.6 0.4 0.2 0 C1W1 C2W1 C3W2 C4W2 C5W3 C6W3 C7W4 C8W4 (a) Throughput (IPC) sumCYCLE 60% 101% 101% C1W1 C2W1 sumIPC Bernoulli MC 40% error 20% 0% -20% C3W2 C4W2 C5W3 C6W3 C7W4 C8W4 -40% -60% -66% -66% (b) Error of different models Figure 6.9: Predicted throughput and error from different models compared to the simulated throughput described in Section 4.3 that is more accurate than the previous two models (with an average error of 23%), but its error is still significant for some workloads (e.g., the error is 54% for Core 1 Workload 1) since it does not take into account cache contention. Finally, the fourth bar (\u201cMC\u201d) represents the Markov chain model proposed in Section 4.4 combined with our cache contention model proposed in Section 3.3 that predicts the throughput significantly more accurately with an arithmetic mean of the absolute error over eight cores of 7.9% and a maximum error of 14.3%. 72 \f5000 4000 3000 2000 Binomial Trace 1000 0 0 20 40 60 80 K memory instructions (a) ammp 100 # cache set accessed # cache set accessed # cache set accessed 6.2. Modeling Cache Contention and Throughput 3000 2000 1000 Binomial Trace 0 0 20 40 60 80 K memory instructions 100 2000 1000 Binomial Trace 0 0 20 40 60 80 100 K memory instructions (b) mgrid (c) swim Figure 6.10: Average distinct number of cache sets accessed during a given number of consecutive memory instructions for a 12-way, 64B line, 4096-set LRU L2 cache, from both assuming binomial distribution of cache blocks among sets (Binomial) and analyzing program traces (Trace) 6.2.2 Modeling S(x) and b(i, x) Using Binomial Distributions Prior work [2, 64] proposes analytical models for cache contention of processes due to context switching in a time-sharing system. These models assume that all cache sets are equally likely to be accessed by a process and that the number of cache blocks accessed in a set follows a binomial distribution. Appendix A illustrates that applying these earlier models in the limiting case of fine grained switching cycle-by-cycle does not work (no cache contention is predicted). Our analytical cache contention model combines statistics of individual thread behavior to predict the performance when threads run together. It uses the functions S(x), b(i, x) which earlier we assume are profiled directly. In this section, we show how to use the binomial distribution assumption used in earlier analytical cache contention models to model both S(x) and b(i, x), and we quantify the impact this approach has on the accuracy of the model. The advantage of the approach in this section versus directly measuring S(x) and b(i, x) is that it enables us to profile an application only once even if we vary the number of sets or associativity of the cache. Recall that b(i, x) denotes the probability that the number of unique cache blocks being accessed in a cache set accessed is i during x consecutive memory references, given that the cache set is 73 \f6.2. Modeling Cache Contention and Throughput 0.8 0.6 0.4 0.2 1 Binomial Trace 0.8 0.6 0.4 0.2 0 2 3 4 5 6 7 8 9 10 11 12 # distinct accesses in a set (a) b(i, x = 1000) Binomial Trace 0.8 0.6 0.4 0.2 0 1 probability 1 Binomial Trace probability probability 1 0 1 2 3 4 5 6 7 8 9 10 11 12 # distinct accesses in a set (b) b(i, x = 10000) 1 2 3 4 5 6 7 8 9 10 11 12 # distinct accesses in a set (c) b(i, x = 100000) Figure 6.11: Probability distribution of the number of distinct blocks being accessed in a cache set for mcf on a 12-way, 64B line, 3 MB cache, from both assuming binomial distribution of cache blocks among sets (Binomial) and analyzing program traces (Trace). Horizontal axis is value of i, and x is the measurement interval counted in consecutive memory instructions accessed. To analytically model b(i, x) of a thread, we need to know the number of unique cache blocks requested by the thread during the x memory references (i.e., the thread\u2019s cache footprint as called in [64]). This information can be collected by analyzing the thread\u2019s program trace and we denote it by U(x) hereafter. After obtaining U(x), we can model b(i, x) based upon the assumption of binomial distribution of cache blocks among cache sets. If we use a random variable Z to represent the number of unique cache blocks accessed in a cache set, then the probability that Z is equal to any integer between 0 and U(x) can be modeled as follows: \u00b5 \u00b6 U(x) i P[Z = i] = p (1 \u2212 p)U(x)\u2212i i for i = 0, 1, 2, ..., U(x) where p = 1\/N and N is the total number of cache sets. The intuition behind the above formula is that the assignment of cache blocks among cache sets is approximated as a binomial process with U(x) trials, each with the probability of success p (i.e., the probability of choosing one set of interest 74 \f6.2. Modeling Cache Contention and Throughput out of N cache sets). Then b(i, x) can be modeled as below: \uf8f1 \uf8f4 \uf8f2 \u00a1U(x)\u00a2 i U(x)\u2212i 1\u2264i 1 is impossible since we assume interleaving by each access). Equation 3.4 from Thiebaut and Stone gives P[Y = 0] = N\u22121 N and P[Y = 1] = 1 N (Y = i with i > 1 is impossible since we assume interleaving by each access). Thus Equation 3.6 from Thiebaut and Stone yields P[Z = 0] = 1 P[X = 0]P[0 \u2264 Y \u2264 K] + P[X = 1]P[0 \u2264 Y \u2264 K \u2212 1] = ( N\u22121 N \u00d7 1) + ( N \u00d7 1) = 1 and subsequently Equa- 104 \fAppendix A. Limiting Case of Time-Sharing tion 3.7 from Thiebaut and Stone yields P[Z = 1] = P[X = 1]P[Y = K] = 0 (since P[Y = K] = 0 for K > 1, set-associative cache). Then Equation 3.8 from Thiebaut and Stone yields X = 1 N and Equa- tion 3.9 from Thiebaut and Stone yields Z = 0. Consequently, Equation 3.10 from Thiebaut and Stone shows the average reload transient equals FA \u2212 (N \u00d7 X \u2212 N \u00d7 Z) = 1 \u2212 (N \u00d7 1 N \u2212 N \u00d7 0) = 0. Thus no cache contention is predicted for the limit of fine-grained interleaving cycle by cycle. 105 ","@language":"en"}],"Genre":[{"@value":"Thesis\/Dissertation","@language":"en"}],"GraduationDate":[{"@value":"2009-11","@language":"en"}],"IsShownAt":[{"@value":"10.14288\/1.0065506","@language":"en"}],"Language":[{"@value":"eng","@language":"en"}],"Program":[{"@value":"Electrical and Computer Engineering","@language":"en"}],"Provider":[{"@value":"Vancouver : University of British Columbia Library","@language":"en"}],"Publisher":[{"@value":"University of British Columbia","@language":"en"}],"Rights":[{"@value":"Attribution-NonCommercial-NoDerivatives 4.0 International","@language":"en"}],"RightsURI":[{"@value":"http:\/\/creativecommons.org\/licenses\/by-nc-nd\/4.0\/","@language":"en"}],"ScholarlyLevel":[{"@value":"Graduate","@language":"en"}],"Title":[{"@value":"Analytical modeling of modern microprocessor performance","@language":"en"}],"Type":[{"@value":"Text","@language":"en"}],"URI":[{"@value":"http:\/\/hdl.handle.net\/2429\/8979","@language":"en"}],"SortDate":[{"@value":"2009-12-31 AD","@language":"en"}],"@id":"doi:10.14288\/1.0065506"}