Analytical Modeling of Modern Microprocessor Performance by Xi Chen B.A.Sc., The University of British Columbia, 2006 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF APPLIED SCIENCE in The Faculty of Graduate Studies (Electrical and Computer Engineering) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) June, 2009 c Xi Chen 2009 ° Abstract As the number of transistors integrated on a chip continues to increase, a growing challenge is accurately modeling performance in the early stages of processor design. Analytical modeling is an alternative to detailed simulation with the potential to shorten the development cycle and provide additional insight. This thesis proposes hybrid analytical models to predict the impact of pending cache hits, hardware prefetching, and realistic miss status holding register (MSHR) resources on superscalar performance. We propose techniques to model the non-negligible influences of pending hits and the fine-grained selection of instruction profile window blocks on the accuracy of hybrid analytical models. We also present techniques to estimate the performance impact of data prefetching by modeling the timeliness of prefetches and to account for a limited number of MSHRs by restricting the size of profile window blocks. As with earlier hybrid analytical models, our approach is roughly two orders of magnitude faster than detailed simulations. Overall, our techniques reduce the error of our baseline from 39.7% to 10.3% when the number of MSHRs is unlimited. When modeling a processor with data prefetching, a limited number of MSHRs, or both, our techniques result in an average error of 13.8%, 9.5% and 17.8%, respectively. Moreover, this thesis proposes analytical models for predicting the cache contention and throughput of heavily fine-grained multithreaded architectures such as Sun Microsystems’ Niagara. We first propose a novel probabilistic model using statistics characterizing individual threads run in isolation as inputs to accurately predict the number of extra cache misses due to cache contention among a large number of threads. We then present a Markov chain model for analytically estimating the throughput of multicore, fine-grained multithreaded architectures. Combined, the two models accurately predict system throughput obtained from a detailed simulator with an average error of 8.3% for various cache configurations. We also show that our models can find the same optimized design point of fine-grained multithreaded chip multiprocessors for application-specific workloads 65 times faster than detailed simulations. Furthermore, we show that our models accurately predict cache contention and throughput trends across varying workloads on real hardware—a Sun Fire T1000 server. ii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables vi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Introduction . . . . . . . . . . . . . . . . . . . . . . 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . 1.2 Contributions . . . . . . . . . . . . . . . . . . . 1.3 Background . . . . . . . . . . . . . . . . . . . . 1.3.1 Superscalar Pipeline . . . . . . . . . . . . 1.3.2 Cache Memory . . . . . . . . . . . . . . . 1.3.3 Data Prefetching . . . . . . . . . . . . . 1.3.4 Fine-grained Multithreaded Architectures 1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Hybrid Analytical Modeling of Pending Cache Hits, MSHRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 First-Order Processor Model . . . . . . . . . . . . . . . . 2.2 Modeling Long Latency Memory System . . . . . . . . . 2.2.1 Modeling Pending Data Cache Hits . . . . . . . . 2.2.2 Accurate Exposed Miss Penalty Compensation . . 2.2.3 Modeling Data Prefetching . . . . . . . . . . . . . 2.2.4 Modeling a Limited Number of MSHRs . . . . . . 2.2.5 Profiling Window Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix . 1 . 2 . 4 . 6 . 6 . 8 . 9 . 9 . 10 Prefetching, and . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 12 16 16 19 21 25 26 3 Modeling Cache Contention Among Many Threads . . . . . . . . . . . . . . . . . 29 3.1 Probabilistic Cache Contention Models . . . . . . . . . . . . . . . . . . . . . . . . . 29 iii Table of Contents 3.2 3.3 New Locality Metrics For Modeling Cache Contention . . . . . . . . . . . . . . . . . 33 Accurately Modeling Cache Contention with Many Threads . . . . . . . . . . . . . 35 4 Modeling Fine-Grained Multithreaded Throughput . . . . . 4.1 Sum of Cycles Model . . . . . . . . . . . . . . . . . . . . . . . 4.2 Sum of IPCs Model . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Bernoulli Model . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 A Markov Chain Model . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Homogeneous Workloads . . . . . . . . . . . . . . . . . 4.4.2 Heterogeneous Workloads and Multiple Stall Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 43 44 44 46 47 50 5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.1 Modeling Pending Cache Hits, Data Prefetching, MSHRs . . . . . . . . . . . . . . . 53 5.2 Modeling Cache Contention and Throughput . . . . . . . . . . . . . . . . . . . . . . 55 6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Modeling Pending Cache Hits, Data Prefetching, MSHRs . . . . . . . . . . 6.1.1 Modeling Pending Data Cache Hits . . . . . . . . . . . . . . . . . . 6.1.2 Modeling Different Prefetching Techniques . . . . . . . . . . . . . . 6.1.3 Modeling Limited Number of MSHRs . . . . . . . . . . . . . . . . . 6.1.4 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.5 Speedup of the Hybrid Analytical Model . . . . . . . . . . . . . . . 6.2 Modeling Cache Contention and Throughput . . . . . . . . . . . . . . . . . 6.2.1 Model Accuracy Evaluation . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Modeling S(x) and b(i, x) Using Binomial Distributions . . . . . . . 6.2.3 Modeling b(i, x) Using An Inductive Probability Function . . . . . 6.2.4 Sensitivity Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.5 Case Study: Optimizing Threads Per Core for Application-Specific loads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.6 Hardware Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Work. . . . . . . . . . . . . . . . . . . . 7 Related Work . . . . . . . . . . . . . . . . . . . . . . 7.1 Empirical Models . . . . . . . . . . . . . . . . . . 7.2 Statistical Simulation . . . . . . . . . . . . . . . . 7.3 Analytical Models for Superscalar Microprocessors 7.4 Cache Contention Models . . . . . . . . . . . . . . 7.5 Throughput Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 59 59 63 64 66 66 67 67 73 77 79 . 81 . 83 86 86 87 87 88 89 iv Table of Contents 8 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Improving the Applicability of Our Hybrid Analytical Models . 8.2.2 Analytical Modeling of Simultaneous Multithreading Processors 8.2.3 Analytical Modeling of Chip Multiprocessors . . . . . . . . . . . 8.2.4 Analytical Modeling of DRAM Scheduling Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 90 91 91 91 92 92 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Appendix A Limiting Case of Time-Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 v List of Tables 4.1 4.2 Transition probability definitions for a four-thread fine-grained multithreaded architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Transition probability definitions for an N-thread fine-grained multithreaded architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.1 5.2 5.3 5.4 Microarchitectural parameters . . . . . . Simulated benchmarks . . . . . . . . . . Parameters of the simulated architecture Simulated workloads for each core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 55 56 56 6.1 Modeling errors for various L2 cache configurations versus detailed simulation . . . . 81 vi List of Figures 1.1 1.2 1.3 1.4 1.5 Time to simulate 1 second runtime of a large-scale chip multiprocessor . . . . . A superscalar microprocessor pipeline . . . . . . . . . . . . . . . . . . . . . . . A processor with two levels of cache . . . . . . . . . . . . . . . . . . . . . . . . Instruction interleaving in a four-thread fine-grained multithreaded architecture Hiding memory latency in a four-thread fine-grained multithreaded architecture 2.1 2.2 2.3 2.4 2.5 Useful instructions issued per cycle (IPC) over time . . . . . . . . . . . . . . . . . . Profiling examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . An example showing how two data independent misses are connected by a pending hit Impact of pending data cache hit latency on CP ID$miss . . . . . . . . . . . . . . . . Algorithm for analyzing a pending hit in an instruction trace when a prefetching mechanism is applied . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . An example motivating Figure 2.5 part B . . . . . . . . . . . . . . . . . . . . . . . . An example explaining Figure 2.5 part C . . . . . . . . . . . . . . . . . . . . . . . . An example showing profiling with ROBsize = 8 and NM SHR = 4 . . . . . . . . . . . An example comparing plain profiling and SWAM profiling with ROBsize = 8 . . . . 2.6 2.7 2.8 2.9 3.1 3.2 . . . 3 . . . 7 . . . 8 . . . 10 [37] 10 13 15 17 18 20 22 24 26 27 3.5 An example of circular sequences in a set of a four-way associative LRU cache . . . Extra L2 cache misses due to cache contention when multiple copies of mcf are running in parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Average number of distinct cache sets accessed during a given number of consecutive memory instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Probability distribution of the number of distinct blocks being accessed in a cache set for mcf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Explanation of each term of probH (2, g) . . . . . . . . . . . . . . . . . . . . . . . . 4.1 A Markov chain model for a four-thread fine-grained multithreaded architecture . . . 45 6.1 Penalty cycles per miss with fixed (Unlimited MSHRs) . . . . . . . CPI due to D$miss and modeling MSHRs) . . . . . . . . . . . . . . 3.3 3.4 6.2 . 30 . 30 . 32 . 34 . 42 number of cycles compensation for plain profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 error for different profiling techniques (unlimited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 vii List of Figures 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13 6.14 6.15 6.16 6.17 6.18 CPI due to D$miss and modeling error while prefetch-on-miss (POM), tagged prefetch (Tag), or stride prefetch (Stride) technique is applied. . . . . . . . . . . . . . . . . . CPI due to D$miss for NM SHR = 16, NM SHR = 8, and NM SHR = 4. . . . . . . . . . Error of the modeled CP ID$miss for NM SHR = 16, NM SHR = 8, and NM SHR = 4. . Ratio of the number of extra L1 and L2 cache misses due to cache contention to the number of misses when each thread runs alone . . . . . . . . . . . . . . . . . . . . . Scatter plot for the predicted and the simulated cache misses increase pair . . . . . . Impact of the number of groups used to classify circular sequences on the accuracy of our cache contention model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Predicted throughput and error from different models compared to the simulated throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Average distinct number of cache sets accessed during a given number of consecutive memory instructions for a 12-way, 64B line, 4096-set LRU L2 cache, from both assuming binomial distribution of cache blocks among sets (Binomial) and analyzing program traces (Trace) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Probability distribution of the number of distinct blocks being accessed in a cache set for mcf on a 12-way, 64B line, 3 MB cache, from both assuming binomial distribution of cache blocks among sets (Binomial) and analyzing program traces (Trace). Horizontal axis is value of i, and x is the measurement interval counted in consecutive memory instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Average absolute errors of modeled extra L1 and L2 cache misses (versus detailed simulation) by modeling S(x) and b(i, x) from binomial distribution (Binomial) and from analyzing program traces (Trace). Result normalized to Binomial . . . . . . . . Probability distribution of the number of distinct blocks being accessed in a cache set for mcf on a 12-way, 64B line, 3 MB cache, from both an inductive function (Prob) and analyzing program traces (Trace). Horizontal axis is value of i, and x is the measurement interval counted in consecutive memory instructions . . . . . . . . Average absolute errors of modeled extra L1 and L2 cache misses (versus detailed simulation) by modeling b(i, x) from an inductive function (Prob) and from analyzing program traces (Trace). Result normalized to Prob . . . . . . . . . . . . . . . . . . . Comparison of predicted throughput to the throughput measured with detailed simulation for various cache configurations . . . . . . . . . . . . . . . . . . . . . . . . . . Optimization Case Study: Modeled throughput under different configurations . . . . Predicted ratio of the number of extra L2 misses (per thread) due to cache contention to the number of L2 misses when a thread runs alone compared to an actual hardware (Sun Fire T1000) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Predicted average throughput per core versus the throughput reported by performance counters on a Sun Fire T1000 . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 65 65 68 69 70 72 73 74 76 78 79 80 82 83 84 viii Acknowledgements I am very grateful to my supervisor, Professor Tor Aamodt, for his constant advice and encouragement during the last two and a half years. Tor introduced me to the beauty of computer architecture and taught me the essence of research. This thesis would not have been possible without his help and support. I would also like to thank Professor Vikram Krishnamurthy and Professor Steve Wilton for being the members of my thesis committee and providing insightful feedback. I would like to thank my colleagues and friends Samer Al-Kiswany, Armin Bahramshahry, Ali Bakhoda, Wilson Fung, Abdullah Gharaibeh, Johnny Kuan, Jason Kuo, Elizeu Santos-Neto, Ivan Sham, Andrew Turner, Henry Wong, and George Yuan for the interesting discussions we had and for the valuable comments they provided for the work in this thesis. They made my experience at UBC unforgettable and I felt very lucky to be able to meet them. I will miss the ping pong breaks with Ali, George, and Abdullah so much. I am indebted to my parents for their unconditional love and support since I was born. They always cheered me up when I was upset and guided me through the most difficult time. I am very grateful to my wife Wendy for her understanding and support during my master’s study. Words cannot express my gratitude for her and I dedicate this work to her. Finally I would like to express my thanks to Natural Sciences and Engineering Research Council (NSERC) of Canada that provided financial support for this work through PGS-M and CGS-M awards. ix Chapter 1 Introduction To design a new microprocessor, computer architects typically create a cycle-accurate simulator and run numerous simulations to quantify performance trade-offs. Not only is the task of creating such a simulator time-consuming, but also running such simulations can be slow. Both are significant components of overall design-to-market time. As microprocessor design cycles stretch with increasing transistor budgets, architects effectively start each new project with less accurate information about the eventual process technology that will be used, leading to designs that may not achieve the full potential of a given process technology node. An orthogonal approach to obtaining performance estimates for a proposed design is analytical modeling [70]. An analytical model employs mathematical formulas that approximate the performance of the microprocessor being designed based upon program characteristics and microarchitectural parameters. One of the potential advantages of analytical modeling is that it requires much less time than crafting and running a performance simulator. Thus, when an architect has analytical models available to evaluate a given design, the models can help significantly shorten the design cycle. While workload sampling [61], parallel simulation [56], and FPGA acceleration [13, 68] can reduce simulation time, these approaches all require the creation of detailed performance models before any feedback on a proposed change in microarchitecture is available to the design team. Moreover, another key advantage of analytical modeling is its ability to provide insight that a cycle-accurate simulator may not. 1 1.1. Motivation First, this thesis proposes and evaluates techniques to predict the impact of pending cache hits, hardware data prefetching, and realistic miss status holding register (MSHR) resources on superscalar microprocessor performance in the presence of long latency memory systems by applying instruction trace analysis. Next, it presents a novel probabilistic cache contention model to accurately predict extra cache misses due to cache contention among a large number of threads. Finally, it proposes a Markov chain model for analytically estimating the throughput of multi-core, fine-grained multithreaded architectures. The rest of this chapter describes the motivation for this thesis, lists its contributions, explains some fundamental concepts required to understand the thesis, and summarizes the organization of the thesis. 1.1 Motivation As the complexity of modern computer systems keeps increasing, cycle-accurate performance simulators have become the most important tool for computer architecture researchers. While in 1973, only two out of 28 (i.e., 7%) papers appearing in the International Symposium on Computer Architecture, the flagship conference in computer architecture, were simulation-based; in 2004, 27 out of 31 (i.e., 87%) papers appearing in the conference used performance simulators to collect data they presented [70]. There exist many cycle-accurate performance simulators [5, 7, 44, 57, 67] that are flexible to use. Given a performance simulator, researchers can evaluate a new microarchitecture by modifying the simulator to model the new microarchitecture and analyzing the statistics of interest provided by the simulator. Moreover, the result generated from a performance simulator that models enough detail of the microarchitecture being evaluated can be very accurate, making performance simulators the primary tool to evaluate a new microprocessor design. 2 1.1. Motivation 10000 simulation time (hours) 370 days 8000 Cycle-accurate simulators 6000 Functional simulators 4000 Behavioral simulators Emulators 2000 37 days 53 minutes 9 hours 3.7 days 0 10 K 1M 100 K 10 K 1K simulation speed (instructions/second) Figure 1.1: Time to simulate 1 second runtime of a large-scale chip multiprocessor [71] However, the simulation speed of a cycle-accurate performance simulator is usually several orders of magnitude slower than real execution, making exhaustively exploring the large design space of modern microprocessors infeasible. Intel researchers have recently pointed out the challenge of using cycle-accurate performance simulators to simulate future large-scale chip multiprocessor (LCMP) platform of 32 cores, with each core running at 3 GHz and at 3 cycles per instruction (CPI) [71], as shown in Figure 1.1. The x-axis of this figure shows the speed of different types of simulators in instructions simulated per second for modeling such a large system; the y-axis shows the number of hours required to simulate one second run time of the system. From the figure we observe that, to simulate such a large multi-core system, the simulation speed of a cycle-accurate detailed simulator is in the range of a few thousands of instructions per second. Thus, it will take several months to complete simulating one second of real execution, which is obviously too long to wait in the process of designing new microprocessors. Analytical modeling is an alternative to performance simulation useful for efficiently exploring 3 1.2. Contributions the huge design space of modern microprocessors. An analytical model takes as inputs the major microarchitectural parameters of the system being modeled and the program characteristics of an application being executed and it predicts the performance of the system running the application. Although an analytical model, when compared to a cycle-accurate performance simulator, is less accurate and flexible since it only captures and models the performance impact of some major design parameters, its speed is usually orders of magnitude faster than the simulator, making efficiently exploring the large design space of modern microprocessors possible. Moreover, although the absolute accuracy of an analytical model is less than a performance simulator, its relative accuracy is usually good enough to track the impact of varying a microarchitectural parameter; therefore, an analytical model can be very useful in the early design stages of microprocessor design. Another important advantage of an analytical model over a cycle-accurate performance simulator is that the analytical model can provide insight that the performance simulator may not. Although a performance simulator can generate numerous statistics, it is impossible to quantify the impact of varying microarchitectural parameters without exhaustively simulating each set of parameters. On the other hand, since an analytical model usually consists of mathematical formulas, it can provide insight to systemize the impact of changing microarchitectural parameters on the overall performance; therefore, it can help designers to quickly and accurately find the optimal design points [35]. 1.2 Contributions This thesis makes the following contributions: 1. It shows that the performance impact of pending data cache hits is non-negligible for memory intensive applications and describes how to model their effect on performance in the context of a trace driven hybrid analytical model. 4 1.2. Contributions 2. It presents a novel technique to more accurately compensate for the potential overestimation of the modeled CP ID$miss , which relies upon analysis of an application’s individual characteristics. 3. It proposes a technique to model the CP ID$miss when a data prefetching mechanism is applied in a microprocessor, without requiring a cycle-accurate performance simulator. 4. It describes a technique to analytically model the impact of a limited number of outstanding cache misses supported by a memory system. 5. It proposes two novel techniques to better analyze the overlapping of long data cache misses. 6. It proposes a novel cache contention model to accurately predict the number of extra data cache misses due to cache contention among a large number of threads. 7. It shows how to compute quantities used in our cache contention model to predict miss rate by leveraging binomial probability modeling techniques originally developed for predicting cache misses due to context switching in time sharing systems [2, 64] (and thereby differentiates our techniques from these earlier approaches). 8. It presents four analytical models of varying accuracy and complexity for estimating the throughput of a multi-core, fine-grained multithreaded single-issue processor (similar to Sun Niagara). The most sophisticated and accurate throughput model utilizes a Markov chain to leverage our cache contention model. 9. It shows that our combined cache contention and Markov chain throughput model accurately predict the throughput of a multicore, fine-grained multithreaded processor obtained from a detailed simulator for various cache configurations. 5 1.3. Background 10. It applies the combined cache contention and Markov chain throughput model to optimize a multi-core fine-grained multithreaded processor for two different application specific workloads yielding the same design points as detailed simulation. 11. Finally, it validates the models against real hardware (a Sun Fire T1000 server). One of the most important goals of analytical modeling is to provide insights that cannot be easily obtained from detailed simulation. However, since there was little work related to analytical modeling compared to the large amount of work based upon detailed simulation in the last a couple of decades [70], an intermediate necessary step is to bring analytical modeling fully up to date with the advances of detailed simulation. We believe that our work significantly shortens the gap between the two different methodologies. 1.3 Background This section provides an overview of the fundamental concepts of superscalar processor pipeline, cache memory, data prefetching, and fine-grained multithreaded architectures for this thesis. 1.3.1 Superscalar Pipeline Pipelining [36] is a key technique used by modern processors to overlap the execution of multiple instructions. A processor pipeline consists of several stages and different pipeline stages can execute different instructions in parallel, improving the overall performance compared to a non-pipelined processor by exploiting instruction-level parallelism. Pipeline stages are connected one to the next and instructions flow through the pipeline during their execution. In a superscalar pipeline [60], multiple instructions can be processed simultaneously in a pipeline stage in one clock cycle, further improving the overall performance. The techniques that we propose in Chapter 2 focus on analyt- 6 1.3. Background ical modeling of modern superscalar out-of-order execution processors in which data independent instructions can be executed out of their program order. Fetch Decode Dispatch Issue Execute Writeback Commit Figure 1.2: A superscalar microprocessor pipeline The pipeline of the superscalar processor that we modeled in this thesis is illustrated in Figure 1.2 and the functionality of each pipeline stage is described as follows: • Fetch: Fetch instructions from memory based upon the program counter (PC). • Decode: Decode instructions being fetched to obtain information such as instruction types, input, and output registers. • Dispatch: Rename input and/or output registers to eliminate false data dependencies among instructions. Send decoded instructions into the instruction window that is a structure keeping track of the program order of all in-flight instructions. • Issue: Send instructions to corresponding function units when their input data is available. • Execute: Execute instructions by functional units. Fetch data requested by load instructions from memory. • Writeback: Complete execution. Forward the available result to instructions that are waiting for it to start execution. • Commit: Retire instructions from the instruction window and update machine states. 7 1.3. Background 1.3.2 Cache Memory In computer architecture, cache memory is referred to as a storage array containing a subset of most frequently used data in main memory and it has been proven to be a very important part in modern microprocessors [62]. Due to its small size compared to main memory, cache memory is orders of magnitude faster than the main memory; therefore, it can significantly improve the performance of a processor by quickly providing it with the requested data (given the requested data has been stored in the cache before). Modern processors usually apply multiple levels of cache to improve the performance of their memory system. L1 I-cache Processor L2 Cache Main Memory L1 D-cache Figure 1.3: A processor with two levels of cache Figure 1.3 shows the memory hierarchy of a processor with two levels of cache. The cache memory consists of a level one instruction cache (L1 I-cache), a level one data cache (L1 D-cache), and a unified level two cache (L2 Cache). While the L1 I-cache and L1 D-cache stores a subset of data contained in the L2 cache, the L2 cache stores a subset of data contained in the main memory. When the processor needs to fetch an instruction, it first accesses the L1 I-cache to search for the requested instruction. If the processor misses in the L1 I-cache, it will access the L2 cache. Then, it will access the main memory if it also misses in the L2 cache. On the other hand, the processor will first access L1 D-cache to search for the data requested for memory instructions. Cache memory in modern superscalar processors is nonblocking, meaning multiple main memory requests are allowed to be served in parallel. Nonblocking caches effectively reduce the penalty 8 1.3. Background per miss by overlapping memory accesses. To implement nonblocking caches, information about outstanding cache misses needs to be stored in miss status holding registers (MSHRs) [38]. In Section 2.2.4, we propose a technique to analytically model the performance impact of the number of MSHRs in a superscalar processor. 1.3.3 Data Prefetching Data prefetching is a technique to bring data from main memory into cache memory before it is required so as to hide (or partially hide) long memory access latency. Once the prefetched data comes back from the main memory, it is placed in the cache. Then, when the data is actually requested, it can be accessed much more quickly from the cache than if it had to make a request from the main memory. Many hardware data prefetching strategies have been proposed before [3, 25, 32, 62]. In Section 2.2.3, we present a technique to analytically predict the performance impact of hardware data prefetching mechanisms. 1.3.4 Fine-grained Multithreaded Architectures Multithreading is a technique to make a processor capable of interleaving the execution of multiple threads. The motivation of multithreading is to extract thread-level parallelism among multiple threads being executed; therefore, when the progress of one thread is stalled on events such as branch mispredictions and cache misses, other co-running threads can still keep processor resources busy, increasing the utilization of the execution resources on the processor. Various approaches of multithreading have been proposed, including fine-grained multithreading (FGMT) [65, 66], coarsegrained multithreading (CGMT) [23], and simultaneous multithreading (SMT) [67]. In this thesis we focus on modeling fine-grained multithreaded architectures. In a fine-grained multithreaded architecture, the execution of multiple co-running threads can be interleaved cycle-by-cycle. Therefore, a fine-grained multithreaded processor can execute in9 1.4. Organization T1 T2 T3 T4 T1 T2 T3 T4 Cycle: n n+1 n+2 n+3 n+4 n+5 n+6 n+7 time Figure 1.4: Instruction interleaving in a four-thread fine-grained multithreaded architecture structions from a different thread on every cycle. Figure 1.4 shows a simple example of instruction interleaving in a four-thread fine-grained multithreaded scalar processor. In this example, the processor executes one instruction per cycle and instructions from Thread 1 (T1) to Thread 4 (T4) are executed in a round-robin fashion. Ideally, when a thread stalls on cache misses, the long memory access latency can be hidden by useful computation of other co-running threads, as Figure 1.5 illustrates. T1 T2 T3 T4 C C M C M C M M C Compute latency Memory latency M Figure 1.5: Hiding memory latency in a four-thread fine-grained multithreaded architecture [37] In Chapter 3 and Chapter 4 we propose techniques to analytically model the cache contention and the throughput of fine-grained multithreaded architectures similar to Sun Microsystems’ Niagara [37]. 1.4 Organization The rest of the thesis is organized as follows: • Chapter 2 proposes techniques to predict the impact of pending cache hits, hardware data prefetching, and realistic miss status holding register (MSHR) resources on superscalar per10 1.4. Organization formance in the presence of long latency memory systems when employing hybrid analytical models that apply instruction trace analysis [10]. • Chapter 3 presents two novel metrics to quantify the temporal locality of an application and proposes a novel cache contention model, based upon the above two metrics, to accurately predict the number of extra data cache misses due to cache contention when the number of threads sharing a cache approaches and/or exceeds the cache’s associativity [12]. • Chapter 4 presents four analytical models of varying accuracy and complexity for estimating the throughput of a multi-core, fine-grained multithreaded single-issue processor (similar to Sun Niagara). The most sophisticated and accurate throughput model utilizes a Markov chain to leverage the above cache contention model [12]. • Chapter 5 describes our simulation methodology. • Chapter 6 presents and analyzes our experimental results. • Chapter 7 reviews related work. • Chapter 8 concludes this thesis and suggests future work. 11 Chapter 2 Hybrid Analytical Modeling of Pending Cache Hits, Data Prefetching, and MSHRs 2.1 First-Order Processor Model Before explaining the details of our techniques introduced in Section 2.2, it is necessary to be familiar with the basics of the first-order model of superscalar microprocessors. Karkhanis and Smith’s first-order model [34] leverages the observation that the overall performance of a superscalar microprocessor can be estimated reasonably well by subtracting the performance losses due to different types of miss-events from the processor’s sustained performance under the absence of miss-events. The miss-events considered include long latency data cache misses (e.g., L2 cache misses for a memory system with two-level cache hierarchy), instruction cache misses, and branch mispredictions. Figure 2.1 illustrates this approach. When there are no miss-events, the performance of the superscalar microprocessor is approximated by a stable IPC, expressed through a constant useful instructions issued per cycle (IPC) over time. When a miss-event occurs, the performance of the processor falls and the IPC gradually decreases to zero. After the miss-event is resolved, the 12 2.1. First-Order Processor Model IPC miss-event #1 miss-event #2 miss-event #3 time Figure 2.1: Useful instructions issued per cycle (IPC) over time used in the first-order model [34] decreased IPC ramps up to the stable value under ideal conditions. A careful analysis of this behavior leads to the first-order model [34]. While Figure 2.1 shows that a miss-event occurs only after the previous miss-events have been resolved, in a real processor it is possible for different types of miss-events to overlap. For example, a load instruction can miss in the data cache a few cycles after a branch is mispredicted. However, it has been observed (and we confirmed) that overlapping between different types of miss-events is rare enough that ignoring it results in negligible error in typical applications [20, 34]. This thesis focuses on improving the accuracy of the modeled CP ID$miss (i.e., CPI component due to long latency data cache misses) since it is the component with the largest error in prior firstorder models [33, 34]. Note that short latency data cache misses (i.e., L1 data cache misses that hit in the L2 cache in this thesis) are not regarded as miss-events in prior first-order models [33, 34] and they are modeled as long-execution-latency instructions when modeling the base CPI. In the rest of this thesis, we use the term “cache misses” to represent long latency data cache misses. As noted by Karkhanis and Smith [34], the interactions between microarchitectural events of the same type cannot be ignored. Our baseline technique for modeling data cache misses, based upon Karkhanis and Smith’s firstorder model [34], analyzes dynamic instruction traces created by a cache simulator. To differentiate such models, which analyze instruction traces, from earlier analytical models [2, 14, 31, 43, 51, 52] 13 2.1. First-Order Processor Model that do not, we also refer to them as hybrid analytical models in this thesis. In each profile step, a ROBsize number of consecutive instructions in the trace are put into the profiling window (or block) and analyzed, where ROBsize is the size of the re-order buffer. If all of the loads missing in the data cache in a profile step are data independent of each other, they are considered overlapped (i.e., the overlapped misses have the same performance impact as a single miss). When data dependencies exist between misses, the maximum number of misses in the same data dependency chain is recorded and the execution of all the other misses are modeled to be hidden under this dependency chain. In the rest of this thesis, num serialized D$miss represents the sum of the maximum number of misses measured in any single data dependency chain in a block of instructions, accumulated over all blocks making up the entire instruction trace. When all instructions in the trace have been analyzed, the CP ID$miss can be estimated as CP ID$miss = num serialized D$miss × mem lat total num instructions (2.1) where mem lat stands for the main memory latency and total num instructions is the total number of instructions committed (of any type). The following two simple examples show how to update num serialized D$miss. In both examples we assume that the size of the instruction window is eight for simplicity. In Figure 2.2(a), there are three data cache misses (i.e., i1, i3, and i6) in the profiling window and they are data independent of each other. Although i9 is also a data cache miss, it is not considered to be overlapped with the other three misses. After this profile step, num serialized D$miss is incremented by one. Figure 2.2(b) shows a similar example, but i6 depends on i1 (filled by the same pattern as i1). Therefore, the maximum number of misses in the same data dependency chain is two, and num serialized D$miss is incremented by two when the profile step is finished. 14 2.1. First-Order Processor Model Profiling Window i1 i3 i6 i2 i4 i7 i5 i8 i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i9 i10 (a) No data dependencies between misses Profiling Window i1 i3 i5 i2 i4 i7 i6 i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i8 i9 i10 (b) With data dependencies between misses Figure 2.2: Profiling examples. Each arrow corresponds to a dynamic instruction in the trace. Data cache misses are filled with patterns. A miss depending on previous misses is filled with the same pattern as the youngest miss on which it depends. Corresponding data dependency graph is also shown to the right in (a) and (b), and each node is an instruction in the dynamic instruction trace. The CP ID$miss modeled in Equation 2.1 often overestimates the actual CP ID$miss since outof-order execution enables overlap of computation with long latency misses. A simple solution proposed by Karkhanis and Smith [34] is to subtract a fixed number of cycles per serialized data cache miss based upon ROB size to compensate. The intuition for this compensation is that when a load issues and accesses the cache, it can be the oldest instruction in the ROB, the youngest instruction in the ROB, or somewhere in between. If the instruction is the oldest or nearly the oldest, the performance loss (penalty of the instruction) is the main memory latency. On the other hand, if the instruction is the youngest or nearly the youngest one in the ROB and the ROB is full, its penalty can be partially hidden by the cycles required to drain all instructions before it, and can be approximated as mem lat − ROBsize issue width [34]. It has been observed that loads missing in the 15 2.2. Modeling Long Latency Memory System cache are usually relatively old when they issue [34]; and thus, perhaps the simplest (though not most accurate) approach is to use no compensation at all [34]. The mid-point of the two extremes mentioned above can also be used (i.e., a load missing in the cache is assumed to be in the middle of ROB when it issues), and the numerator in Equation 2.1 becomes num serialized D$miss × (mem lat − 2.2 ROBsize 2×issue width ) [33]. Modeling Long Latency Memory System In this section, we describe how we model pending cache hits, data prefetching, and a limited number of MSHRs. 2.2.1 Modeling Pending Data Cache Hits The method of modeling long latency data cache misses described in Section 2.1 profiles dynamic instruction traces generated by a cache simulator [34]. Since a cache simulator provides no timing information, it classifies the load or store bringing a block into the cache as a miss and all subsequent instructions accessing the block before it is evicted as hits. However, the actual latency of many instructions classified as a hit by a cache simulator is much longer than the cache hit latency. For example, if there are two close load instructions accessing data in the same block that is not currently in the cache, the first load will be classified as a miss by the cache simulator and the second load as a hit, even though the data would still be on its way from memory in a real processor implementation. Therefore, since the second load is classified as a hit in the dynamic instruction trace, it is ignored in the process of modeling CP ID$miss using the approach described in Section 2.1. More importantly, a significant source of errors results when two or more data independent load instructions that miss in the data cache are connected by a third pending data cache hit. We 16 2.2. Modeling Long Latency Memory System fictitious dependence we model to account for the effect of spatial locality i1: LD, R1, 0(R2) i2: LD, R3, 4(R2) i3: LD, R4, 0(R3) spatial locality miss (Block A) pending hit (Block A) miss (Block B) Figure 2.3: An example showing how two data independent misses (i1, i3) are connected by a pending hit (i2), upon which i3 is data dependent. elaborate what “connected” means using the simple example in Figure 2.3. In this example, i1 and i3 are two loads that miss and they are data independent of each other, while i2 is a pending hit since it accesses the data in the same cache block as i1. The model described in Section 2.1 classifies i1 and i3 as overlapped and the performance penalty due to each miss using that approach is estimated as half of the memory access latency (total penalty is the same as if there is a single miss). However, this approximation is inaccurate since i3 is data dependent on the pending data cache hit i2 and i2 gets its data when i1 obtains its data from memory (i.e., i1 and i2 are waiting for the data from the same block). Therefore, in the actual hardware, i3 can only start execution after i1 gets its data from memory although there is no true data dependence between i1 and either i2 or i3. This scenario is common since most programs contain significant spatial locality. The appropriate way to model this situation is to consider i1 and i3 to be serialized in our analytical model, even though they are data independent and access distinct cache blocks. Figure 2.4 shows the impact that pending data cache hits combined with spatial locality have on overall performance for processors with long memory latencies. The first bar (w/ PH) illustrates measured CP ID$miss for each benchmark on the detailed simulator described in Section 5.1 and the second bar (w/o PH) shows the measured CP ID$miss when all the pending data cache hits are simulated as having a latency equal to the L1 data cache hit latency. From this figure, we observe that the difference is significant for eqk, mcf , em, hth [72], and prm. 17 2.2. Modeling Long Latency Memory System w/ PH CPI due to D$miss 8 w/o PH 7 6 5 4 3 2 1 0 app art eqk luc swm mcf em hth prm lbm Figure 2.4: Impact of pending data cache hit latency on CP ID$miss (measured over all instructions committed from a detailed simulator) To model the effects of pending data cache hits analytically, we first need to identify them without a detailed simulator. At first, this may seem impossible since there is no timing information provided by the cache simulator. We tackle this by assigning each instruction in the dynamic instruction trace a sequence number in program order and labeling each memory access instruction in the trace with the sequence number of the instruction that first brings the memory block into the cache. Then, when we profile the instruction trace, if a hit accesses data from a cache block that was first brought into the cache by an instruction still in the profiling window, it is regarded as a pending data cache hit. For every pending hit identified using this approach (e.g., i2 in Figure 2.3), there is a unique instruction earlier in the profiling window that first brought in the cache block accessed by that pending hit (e.g., i1 in Figure 2.3). When we notice a data dependence between a later cache miss 18 2.2. Modeling Long Latency Memory System (e.g., i3 in Figure 2.3) and the pending hit (i2), we model a dependence between the early miss (i1) and the instruction that is data dependent on the pending hit (i3) since the two instructions (i1 and i3) have to execute serially due to the constraints of the microarchitecture. 2.2.2 Accurate Exposed Miss Penalty Compensation While the model described in Section 2.1 uses a fixed number of cycles to adjust the modeled CP ID$miss , we found that compensation with a fixed number of cycles (a constant ratio of the reorder buffer size) does not provide consistently accurate compensation for all of the benchmarks that we studied, resulting in large modeling errors (see Figure 6.1). To capture the distinct distribution of long latency data cache misses of each benchmark, we propose a novel compensation method. The new method is motivated by our observation that the number of cycles hidden for a load missing in the cache is roughly proportional to the distance between the load and the immediately preceding load that missed in the cache (we define the distance between two instructions to be the difference between their instruction sequence number). This is because when a load instruction misses in the cache, most of the instructions between that load and the immediately preceding long latency miss are independent of that load. Therefore, we approximate the latency of the later load that can be overlapped with useful computation as the time used to drain those intermediate instructions from the instruction window, which we estimate as the distance between the two loads divided by the issue width. When we profile an instruction trace, the average distance between two consecutive loads missing in the cache is also collected and used to adjust the modeled CP ID$miss . If the distance between two misses exceeds the window size, it is truncated since the miss latency can be overlapped by at most ROBsize − 1 instructions. Equation 2.2, below, shows how the CP ID$miss is adjusted by subtracting a compensation term, 19 2.2. Modeling Long Latency Memory System if ( the instruction (crntInst) is a pending hit (e.g., i8 in Fig 2.6) ) { find the most recent instruction (prevInst) in profiling window (e.g., i6 in Fig 2.6) that brings crntInst’s required data into cache estimated hidden latency A crntInst.lat = max(memLat - (crntInst.iseq – prevInst.iseq) / issueWidth, 0) // calculate the latency of the current instruction crntInst.lat = crntInst.lat / memLat // normalize the crntInst.lat to the memory latency crntInst.length = max(inst.length) where inst is an instruction on which crntInst directly depends (true data dependency exists, e.g., i7 i8 in Fig 2.6) if (crntInst.length < prevInst.length – prevInst.lat) { Notation crntInst.length = critInst.length + 1 tardy the current instruction being analyzed B prefetch crntInst: crntInst.lat = 1 prevInst: the instruction bringing the data required } else { by the current instruction into the cache accmLength = prevInst.length – prevInst.lat + crntInst.lat crntInst.iseq: the instruction sequence number of if (accmLength > crntInst.length) { the current instruction crntInst.lat = accmLength – crntInst.length C issueWidth: the issue width of the microprocessor crntInst.length = accmLength timely memLat: the memory access latency } prefetch crntInst.lat: the normalized time interval between else the issue and the writeback of the crntInst.lat = 0 current instruction } crntInst.length: the normalized length of the data } dependency chain up to the current instruction Figure 2.5: Algorithm for analyzing a pending hit in an instruction trace when a prefetching mechanism is applied dist issue width × num D$miss, from the numerator in Equation 2.1. CP ID$miss = num serialized D$miss × mem lat − ( issuedist width × num D$miss) total num instructions (2.2) Here dist is the average distance between two consecutive loads that miss in the cache and the term dist issue width represents the average number of cycles hidden for each cache miss. The product of this term and the total number of loads missing in the cache (num D$miss) becomes the total number of cycles used to compensate for the overestimation of the baseline profiling method. 20 2.2. Modeling Long Latency Memory System 2.2.3 Modeling Data Prefetching Data prefetching is a technique to bring data from memory into the cache before it is required so as to hide (or partially hide) long memory access latency. Many hardware data prefetching strategies have been proposed before [3, 25, 32, 62]. In this section, we demonstrate how to extend our model described in Section 2.2.1 to estimate the CP ID$miss when a data prefetching technique is employed without running detailed simulations. To model the CP ID$miss when a particular prefetching method is applied, a cache simulator implementing that prefetching method is needed to generate a dynamic instruction trace. While this does require some coding, we found that the resulting analytical model obtains very accurate results and is two orders of magnitude faster than detailed simulations. As described in Section 2.2.1, when a cache simulator generates an instruction trace, each memory access instruction in the trace is labeled with the sequence number of the instruction that first brought the data into the cache. If the data required by a load was brought into the cache by a prefetch, then the load is labeled with the sequence number of the previous instruction that triggered the prefetch. Recall that, when no prefetching mechanism is applied, an instruction trace generated by a cache simulator is divided into blocks of instructions and each block is analyzed in a profile step. In each profile step, the maximum number of loads that are in a dependence chain and miss in the cache is recorded. However, when an effective prefetching method is implemented, many loads that would have missed in the cache become hits. To be more specific, many of them become pending hits given that some of the prefetches cannot fully hide the memory access latency. We found that to accurately model prefetch performance, it is necessary to approximate the timeliness of the prefetches and consequently the latencies of these pending hits relatively accurately. Figure 2.5 illustrates how we analyze a pending hit in an instruction trace when a particular prefetching mechanism is applied. Here a pending hit can either be due to a prefetch or a demand 21 2.2. Modeling Long Latency Memory System i1.length = 1 i5.length = 2 i6.length = 2 i3.length = 1 i1 i5 prefetch i6 should be 1+i8.lat i3 i2 i4 i2.length = 0 i4.length = 1 i7 i8 i7.length = 1 i8.length = 2 + i8.lat Figure 2.6: An example motivating Figure 2.5 part B miss and, in both cases, it is analyzed using the algorithm in Figure 2.5. For each pending hit (crntInst in Figure 2.5), we find the instruction (prevInst in Figure 2.5) that brought crntInst’s required data into the cache. We approximate crntInst’s latency based upon the observation that typically the further prevInst is from crntInst, the more latency of crntInst can be hidden. The hidden latency of crntInst is estimated as the number of instructions between crntInst and prevInst divided by the issue width of the microprocessor being modeled. Note that we employ the approximation of ideal CPI equal to 1/issueWidth in this calculation. Then, crntInst’s latency is estimated as the difference between the memory access latency and the hidden latency, or zero if the memory latency is completely hidden. This latency is in cycles, and we normalize it by dividing it by the main memory latency since the accumulated num serialized D$miss after each profile step is represented in units of main memory latency. The part of the code marked B in Figure 2.5 models a significant phenomenon (late or tardy prefetches) that we observed in our study of various hardware prefetching mechanisms. Since the instruction trace being analyzed is generated by a cache simulator that is not aware of the out-oforder execution of the superscalar microprocessor being modeled, a pending hit due to prefetching indicated by the cache simulator is often actually a miss during out-of-order execution. Figure 2.6 shows a simplified example illustrating how this may happen. In this example, there are eight 22 2.2. Modeling Long Latency Memory System instructions and they are labeled from i1 to i8 in program order. Figure 2.6 shows the data dependency graph constructed during profiling according to an instruction trace generated by a cache simulator assuming the pseudo-code marked B in Figure 2.5 is not included. In Figure 2.6, i1 and i5 are loads missing in the data cache (represented by the shaded circles) and i6 triggers a prefetch that brings the data accessed by a load i8 into the cache when i6 issues (represented by the broken line arrow labeled “prefetch” from i6 to i8). For each instruction, the longest normalized length of the data dependency chain up to and including that instruction is shown (in units of main memory latency). For example, “i3.length=1” above i3 in Figure 2.6 means that it takes one memory access latency from when i1 (the first instruction in the profile step) issues until i3 finishes execution since i3 is data dependent on i1, which missed in the cache. Since i8 is a pending hit (represented by the circle filled with hatching) and the associated prefetch is started when i6 issues, i8.length is calculated, without B, as the sum of i6.length and i8.lat, where i8.lat is estimated in part A in Figure 2.5 as memLat− i8.iseq−i6.iseq issueW idth memLat . In this example, i8.lat is almost equal to 1.0 since i8 is very close to i6. Although the data accessed by i8 is regarded as being prefetched by the algorithm in Figure 2.5 without B, i8 is actually (as determined by detailed simulation) a miss rather than a pending hit due to out-of-order execution. In Figure 2.6, we observe that i6.length is bigger than i7.length. Therefore, before i6 (e.g., a load instruction) issues (and hence triggers a hardware generated prefetch), i8 has already issued and missed in the data cache. Thus, the prefetch provides no benefit. The code marked B in Figure 2.5 accurately takes account of this significant effect of outof-order scheduling by checking if crntInst (i8) issues before the prefetch is triggered. We observed that removing part B in Figure 2.5 increases the average error for the three prefetching techniques that we model from 13.8% to 21.4% while adding part B slows our model by less than 2%. An example in Figure 2.7 shows how the part of code marked C in Figure 2.5 models the case when a useful prefetch occurs in out-of-order execution (i.e., a prefetch which lowers CPI). In 23 2.2. Modeling Long Latency Memory System prefetch i83.length = 2 i1.length = 1 i4.length = 2 i1 i3 i2 i83 i85.length = 2 i4 i85 i84 i3.length = 1 i245.length = 2.8 i86 i245 i86.length = 2 i84.length = 2 i2.length = 0 prefetch Figure 2.7: An example explaining Figure 2.5 part C Figure 2.7, only nine relevant instructions are shown out of the 256 instructions included in a profile step (assuming ROBsize is 256). Among these nine instructions, i1 and i4 are loads that miss in the data cache and both i3 and i85 trigger prefetches, making i83 and i245, respectively, pending hits. The number of cycles hidden in the prefetch triggered by i3 is estimated as i83.iseq−i3.iseq issueW idth = 83−3 4 = 20 (when issue width is four), and then the remaining latency after normalization is calculated as memLat−20 memLat = 0.9 (we assume throughout our examples that memory access latency is 200 cycles). However, since i83 is data dependent of i4 and i4.length=2, when i83 issues, its prefetched data has already arrived at the data cache and its real latency becomes zero (this case corresponds to the “else part” inside of part C in Figure 2.5). The number of cycles hidden by the prefetch for i245 is estimated (from part A in Figure 2.5) as normalized latency of memLat−40 memLat i245.iseq−i85.iseq issueW idth = 245−85 4 = 40 with remaining = 0.8. Since the instruction triggering the prefetch (i85) and the instruction that i245 directly depends on (i86) finish execution around the same time (i.e., i85.length=i86.length), i245.length becomes 2.8 and i245.lat becomes 0.8 (this case corresponds to the “if part” inside of part C in Figure 2.5). 24 2.2. Modeling Long Latency Memory System 2.2.4 Modeling a Limited Number of MSHRs The method of analytically modeling the CPI due to long latency data cache misses described in Section 2.1 assumes that at most ROBsize cache misses can be overlapped. However, this assumption is unreasonable for most modern processors since the maximum number of outstanding cache misses the system supports is limited by the number of Miss Status Holding Registers (MSHRs) [4, 22, 38] in the processor. In a real processor, the issue of memory operations to the memory system has to stall when available MSHRs run out. Based upon the technique described in Section 2.1, the profiling window with the same size as the instruction window is always assumed to be full when modeling CP ID$miss . In order to model a limited number of outstanding cache misses, we need to refine this assumption. During a profile step, we first stop putting instructions into the profiling window when the number of instructions that miss in the data cache and have been analyzed is equal to NM SHR (number of MSHRs) and then update num serialized D$miss only based upon those instructions that have been analyzed to that point1 . Figure 2.8 illustrates how the profiling technique works when the number of outstanding cache misses supported is limited to four. Once we encounter NM SHR (four) cache misses in the instruction trace (i.e., i1, i2, i4, and i6), the profile step stops and num serialized D$miss is updated (i.e., the profiling window is made shorter). In the example, the four misses are data independent of each other (and not connected with each other via a pending hit as described in Section 2.2.1), thus num serialized D$miss is incremented by only one. Although i7 also misses in the cache, it is included in the next profile step since all four MSHRs have been used. 1 In real execution, cache misses that are regarded as not present in the profiling window simultaneously due to lack of available MSHRs could actually be in the instruction window simultaneously. Reducing the profiling window size only approximates the performance loss due to a limited number of MSHRs. We leverage this observation in Section 2.2.5. 25 2.2. Modeling Long Latency Memory System Profiling Window next profile step starts here i1 i2 i3 i6 i5 i1 i2 i3 i4 i5 i6 ROB Size i7 i8 i9 i10 i7 i4 i9 i8 i10 Figure 2.8: An example showing profiling with ROBsize = 8 and NM SHR = 4. Each arrow corresponds to a dynamic instruction in the trace. Data cache misses are filled with patterns. Corresponding data dependency graph is shown to the right. 2.2.5 Profiling Window Selection In this section, we present two important refinements to the profiling technique described in Section 2.1 (which we will refer to hereafter as plain profiling) to better model the overlapping between cache misses. Start-with-a-miss (SWAM) Profiling We observe that often the plain profiling technique described in Section 2.1 does not account for all of the cache misses that can be overlapped, due to the simple way in which it partitions an instruction trace. Figure 2.9(a) shows a simple example. In this example, we assume that all the cache misses (shaded arrows) are data independent of each other for simplicity. Using the profiling approach described in Section 2.1, a profile step starts at pre-determined instructions (for example, i1, i9, i17..., when ROBsize is eight). Therefore, although the latency of i5, i7, i9, and i11 can be overlapped, the plain profiling technique does not account for this. By making each profile step start with a cache miss, we find that the accuracy of the model improves significantly. Figure 2.9(b) illustrates this idea. Rather than starting a profile step with 26 2.2. Modeling Long Latency Memory System 1st profile step 2nd profile step i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 (a) Plain Profiling A profile step starts with a miss i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 (b) SWAM Profiling Figure 2.9: An example comparing plain profiling and SWAM profiling with ROBsize = 8. Each arrow is a dynamic instruction. Data cache misses are shaded. i1, we start a profile step with i5, so that the profiling window will include i5 to i12. Then, the next profile step will seek and start with the first cache miss after i12. We call this technique start-with-a-miss (SWAM) profiling and in Section 6.1.1 we will show that it decreases the error of plain profiling from 29.3% to 10.3% with unlimited MSHRs. We explored a sliding window approximation (start each profile window on a successive instruction of any type), but found it did not improve accuracy while begin slower. SWAM improves modeling accuracy because it more accurately reflects what the contents of the instruction window of a processor would be (a long latency miss would block at the head of the ROB). Improved SWAM for Modeling a Limited Number of MSHRs (SWAM-MLP) The technique for modeling MSHRs proposed in Section 2.2.4 can be combined with SWAM to better model the performance when the number of outstanding cache misses supported by the 27 2.2. Modeling Long Latency Memory System memory system is limited. The basic idea is to have each profile step start with a miss and finish either when the number of instructions that have been analyzed equals the size of the instruction window or when the number of cache misses that have already been analyzed equals the total number of MSHRs. However, choosing a profiling window independent of whether a cache miss is data dependent on other misses (or connected to other misses via pending hits as described in Section 2.2.1) leads to inaccuracy because data dependent cache misses cannot simultaneously occupy an MSHR entry. To improve accuracy further, we stop a profile step when the number of cache misses that are data independent of misses that have been analyzed in the same profile step (rather than the number of cache misses being analyzed) equals the total number of MSHRs. In the rest of this thesis we call this technique SWAM-MLP since it improves SWAM by better modeling memory level parallelism. When a miss depends on an earlier miss in the same profiling window, the later miss cannot issue until the earlier one completes and SWAM-MLP improves model accuracy because it takes into account that out-of-order execution can allow another independent miss that is younger than both of the above misses to issue. Therefore, the number of instructions that miss in the data cache and that should be analyzed in a profile step should, in this case, be more than the total number of MSHRs. 28 Chapter 3 Modeling Cache Contention Among Many Threads In this chapter, we first summarize a prior cache contention model [9], then explain and evaluate its limitations. Next, we propose two novel metrics for quantifying aspects of temporal locality necessary to overcome these limitations. Finally, we propose a novel cache contention model using these locality metrics to accurately model cache contention for a large number of threads. 3.1 Probabilistic Cache Contention Models Chandra et al. [9] propose a probabilistic model for predicting the number of extra cache misses due to cache contention between two threads on a chip multiprocessor architecture. The model uses information from a circular sequence profile obtained for each thread running alone. In their work, a circular sequence is defined as a sequence of accesses to the same cache set from a thread such that the first and the last access are to the same cache block, which is different from all the blocks touched by intermediate accesses. Figure 3.1 illustrates the notion of a circular sequence. In Figure 3.1 the blocks A, B, C, D, and E map to the same set in a four-way LRU cache. Several examples of circular sequences are shown in the Figure. A circular sequence is denoted as cseq(d, n), where d is the number of distinct blocks in the circular sequence, n is the total number of blocks in the circular sequence, and d is at least one less than n (as a result of the definition of cseq(d, n)). 29 3.1. Probabilistic Cache Contention Models cseq(4,6) A B A A cseq(2,3) cseq(1,2) C miss D B E A cseq(5,6) Figure 3.1: An example of circular sequences in a set of a four-way associative LRU cache extra L2 misses 100x actual (Sun Fire T1000) predicted (prior model) predicted (new model) 10x 1x 2mcf 4mcf 8mcf 12mcf 16mcf 0.1x Figure 3.2: Extra L2 cache misses due to cache contention when multiple copies of mcf are running in parallel (L2 config. in Table 5.3) In an x-way associative cache with LRU replacement the last access in a circular sequence cseq(d, n) is a hit if and only if d ≤ x. Thus, by performing circular sequence profiling it is possible to determine whether a load or store misses in the cache. In Figure 3.1 the last access (to block A) results in a cache miss since it corresponds to cseq(5, 6) and 5 is greater than the cache associativity. For all the other circular sequences in Figure 3.1, the last access corresponds to a cache hit. To model the number of extra cache misses due to cache contention, a circular sequence profile is performed to obtain the distribution of cseq(d, n) for a given cache configuration. Using this 30 3.1. Probabilistic Cache Contention Models information, the average of n weighted by the frequency of occurrence of cseq(d, n), denoted n, is computed for each d between 1 and x (where x is the cache associativity). After obtaining n for each d, the prior model uses the access frequency per set (average number of accesses per cycle to the cache set, including both hits and misses) to approximate the average time interval (in cycles) between the first and the last access of cseq(d, ∗). Here cseq(d, ∗) represents all circular sequences containing d distinct tags. Next, the estimated time interval is converted to the number of cache accesses to that set from the second thread using the second thread’s access frequency. Then, an inductive model is used to estimate the probability that a specified number of distinct blocks are accessed in the set from the second thread. Finally, the probability that a cache hit from the first thread becomes a miss due to interference from the second thread is approximated, and the number of extra cache misses of the first thread is calculated. The prior model works well when there are only two threads sharing a cache with a high associativity. As the number of threads sharing the cache increases, the prior model described above becomes less accurate. Figure 3.2 compares, for multiple copies of mcf running in parallel, the predicted result both from an extension to the prior model [9] (labeled “predicted (prior model)”) and from our new model described in Section 3.3 (labeled “predicted (new model)”) to the result measured on real hardware–a Niagara T1 based Sun Fire T1000 system. The predicted result is the ratio of the number of extra L2 misses per mcf due to cache contention to the number of misses when mcf runs alone. Note we have extended the Prob approach presented for two threads in Chandra et al. [9] by considering the interleaving of sequences from more than one interfering thread.2 In Figure 3.2, we observe that the error of the prior model increases significantly as the number of threads sharing the cache increases. For example, consider the case where twelve threads share a 2 P For “prior dX2 +...+dX N ≤A−dX1 model” Q i=2...N in Figure 3.2 we use Pmiss (cseqX1 (dX1 , nX1 )) = 1 − P (seq(dXi , E(nXi ))) to extend Equation 3 for the Prob model in Chandra et al. [9] for more than two threads. Here Xi , i 6= 1, represents each interfering thread, A is the cache associativity, and N is the total number of threads. This equation follows P directly after extending Corollary 3 in [9], such that the last access from X1 in cseqX1 (dX1 , nX1 ) results in a miss if N i=1 dXi > A (or a hit otherwise). 31 3.1. Probabilistic Cache Contention Models ammp applu art bzip2 equake mcf mgrid swim 4500 # distinct cache sets 4000 3500 3000 2500 2000 1500 1000 500 0 0 0 50 50 100 150 200 200 K100 memory 150 instructions 250 250 300 300 Figure 3.3: Average number of distinct cache sets accessed during a given number of consecutive memory instructions for a twelve-way, 64B line, 4096-set LRU L2 cache twelve-way L2 cache (i.e., 12mcf in Figure 3.2). For any given circular sequence of a thread, the number of accesses from each co-scheduled thread occurring between the first and the last access in the circular sequence is predicted to be at least one by the prior model. Therefore, only a small fraction of the last accesses in all cseq(1, ∗) circular sequences are modeled as hits (when each of the eleven co-scheduled threads inserts only one access in a cseq(1, ∗)). In this example, the actual number of L2 misses increases by a factor of 6.2× and our new model presented in Section 3.3 predicts an increase of 6.5× (4.8% error). However, the prior model predicts an increase of 37.6× since it predicts 95.7% of the accesses that hit in the cache when mcf runs alone will turn into cache misses (resulting in a 507.9% error). 32 3.2. New Locality Metrics For Modeling Cache Contention 3.2 New Locality Metrics For Modeling Cache Contention While circular sequence profiling captures many aspects of temporal locality, what is missing in the prior model described in Section 3.1 is the fact that the cache footprint of a thread during a small interval of time is likely to be limited to a small number of cache sets rather than all of them. If the time interval between the first and the last access in a circular sequence is short, cache accesses from other co-scheduled threads are not likely to be sent to the same cache set containing the accesses from the circular sequence. Therefore, even in the presence of many co-scheduled threads the last access in a circular sequence may not turn into a miss. To quantify the above behavior we introduce the parameter S(x), which is the average number of sets accessed by a thread during x consecutive loads or stores. This can be obtained by off-line profiling a memory access trace for a specific cache configuration (or potentially in hardware—e.g., by adding a single bit per cache set along with a single counter and a table—though detailed explorations of such implementations are beyond the scope of this work). We note that S(x) is different from the function u(B) in Agarwal et al. [2] (which is related to FA in Thiebaut and Stone [64]). There are two differences: First, S(x) represents the average number of unique sets accessed given x memory references whereas u(B) is the average number of unique blocks of size B measured over a fixed length time-slice interval (which could equal x). Second, our use of S(x) is quite different from the use of u(B) or FA in earlier work. We show how one can approximate S(x) assuming cache accesses follow a binomial distribution using techniques similar to those used by Thiebaut and Stone [64] in Section 6.2.2. Moreover, Appendix A provides an example showing why extending the earlier approaches proposed for modeling extra cache misses due to context switching on time-sharing machines does not work in the limit as threads context switch cycle by cycle. This problem is overcome by combining S(x) with circular sequence profile information. 33 1 1 0.8 0.8 0.8 0.6 0.4 0.2 0.6 0.4 0.2 0 0.6 0.4 0.2 0 1 2 3 4 5 6 7 8 9 101112 # distinct accesses in a set 1 2 3 4 5 6 7 8 9 101112 # distinct accesses in a set 1 2 3 4 5 6 7 8 9 101112 # distinct accesses in a set (a) b(i, x = 1000) (b) b(i, x = 10000) (c) b(i, x = 20000) 1 1 0.8 0.8 0.8 0.6 0.4 0.2 0 probability 1 probability probability 0 probability 1 probability probability 3.2. New Locality Metrics For Modeling Cache Contention 0.6 0.4 0.2 0 1 2 3 4 5 6 7 8 9 101112 # distinct accesses in a set (d) b(i, x = 50000) 0.6 0.4 0.2 0 1 2 3 4 5 6 7 8 9 101112 # distinct accesses in a set (e) b(i, x = 100000) 1 2 3 4 5 6 7 8 9 101112 # distinct accesses in a set (f) b(i, x = 250000) Figure 3.4: Probability distribution of the number of distinct blocks being accessed in a cache set for mcf on a 12-way, 64B line, 3 MB cache. Horizontal axis is value of i, and x is the measurement interval counted in consecutive memory instructions Figure 3.3 shows S(x) of different benchmarks for the L2 cache we model in Section 6.2. From the figure we observe that as the number of consecutive memory instructions being analyzed increases, the number of distinct cache sets accessed increases until saturating at the total number of cache sets (provided the data set of an application is large enough to use all sets in the cache). A thread with higher S(x) for a given x is more likely to access sets currently used by other threads. Besides the number of cache sets accessed, a thread’s cache footprint also depends on the number of distinct blocks accessed in those sets. For example, it is possible that there are two threads with the same value of S(x) for a fixed x (number of memory instructions). However, for each set being accessed, there are ten distinct blocks from the first thread and only two from the second thread. 34 3.3. Accurately Modeling Cache Contention with Many Threads Then, although the two threads access the same number of sets, the first thread’s cache footprint over this interval would be five times larger than the second thread. To take the number of distinct blocks in a set accessed by a thread into consideration, we propose using a probability vector, b(i, x), to estimate the probability that there are i distinct accesses to a cache set during x consecutive memory instructions, given that the cache set is accessed. Similar to S(x), b(i, x) can be obtained via off-line profiling. Figure 3.4 illustrates b(i, x) with different values of x for mcf. For each x, the probability that there are one to eleven (i.e., associativity-1) distinct accesses to a cache set is shown in the first eleven bars, while the rightmost bar of each figure is the probability of having more than or equal to twelve (i.e., associativity) distinct accesses. From Figure 3.4, we observe that when the measurement interval (i.e., x) is small, the average number of distinct accesses in each cache set being accessed is likely to be small. As x increases, the mode in each b(i, x) distribution tends to move toward higher values of i. When x is sufficiently large, the number of distinct accesses is typically greater than or equal to the associativity of the cache. Our use of b(i, x) essentially replaces the use of the inductive probability model [9]. In the next section, we utilize S(x), b(i, x), along with circular sequence profiling to more accurately quantify the additional number of cache misses due to contention among threads when the number of co-scheduled threads approaches or exceeds the associativity of the cache being shared. 3.3 Accurately Modeling Cache Contention with Many Threads For concreteness, we will describe how to model the number of extra cache misses of a given thread (T1) due to sharing a four-way set associative L1 data cache with three other threads (T2, T3, and T4). We also generalize the details for different numbers of threads and cache associativities. We define the distance between two accesses from a single thread as the total number of intermediate memory accesses from the thread (i.e., independent of which cache sets are accessed). 35 3.3. Accurately Modeling Cache Contention with Many Threads As mentioned in Section 3.2, for a circular sequence cseq(d, n), the higher the distance between its first and last access, the more likely the last access will become a miss due to cache contention. Therefore, hereafter we describe each circular sequence with the distance between its first and last access as cseq(d, n, r), where r represents the distance. We use cseq(d, ∗, r) for the set of circular sequences with d distinct tags, for all n such that d < n ≤ r. Briefly, our algorithm computes the extra number of cache misses due to cache contention as follows. We first compute the average number of memory accesses to obtain a given number of unique blocks in a circular sequence. Next, given an estimate of the relative progress of each thread we translate this to the number of memory accesses from co-scheduled threads. Then, we use S(x) to determine the number of unique sets accessed by co-scheduled threads and b(i, x) to determine the number of unique blocks accessed in those sets by the co-scheduled threads. Using this information we then compute the probability of a miss. The probability of a miss being introduced into a circular sequence is a nonlinear function of the distance r. In general E[f(x)] 6= f(E[x]) when f is nonlinear. Thus, we find accuracy is improved if we compute the probability of a miss for separate groups g of ranges of r when computing the number of extra cache misses for circular sequences with different lengths. In detail the procedure for computing the extra number of cache misses is as follows: For each value of d between one and four (i.e., the associativity of the L1 cache), we carry out the first five steps below, then we use the resulting information in the last step. Step 1: We compute the average distance to see d unique blocks in a given cache set. To do this we apply circular sequence profiling and classify circular sequences cseq(d, n, r) into different groups based upon r. Each group g corresponds to a range of r. Then, for a group g, calculate dist(d,g) , the weighted r (distance between the first and last access in a circular sequence) for all circular 36 3.3. Accurately Modeling Cache Contention with Many Threads sequences cseq(d, n, r) belonging to the group as follows: Prg dist(d,g) = r=rg [r × num(cseq(d, ∗, r))] Prg r=rg num(cseq(d, ∗, r)) where rg and rg represents the lower and the upper bounds of the group g, respectively, and num(cseq(d, ∗, r)) denotes the number of occurrences of the circular sequence cseq(d, ∗, r) observed during profiling. In our study, we classify a circular sequence into one of twelve groups. The first group corresponds to all the circular sequences with 1 < r < 25 , the next ten groups with 2i ≤ r < 2i+1 , i ∈ 5, 6, ..., 14, and the last group corresponds to all the circular sequences with r ≥ 215 . We will show the impact of the number of groups chosen to classify a circular sequence on the accuracy of our cache contention model in Section 6.2.1. Step 2: For each dist(d,g) , calculate nT2(d,g) , nT3(d,g) , nT4(d,g) , the corresponding numbers of accesses for T2 , T3 , and T4 , respectively, based upon the access frequency of those threads: nTi(d,g) = dist(d,g) × access freq(Ti ) access freq(T1 ) for i = 2, 3, 4 (up to #threads) where access freq(Ti ) is the number of memory instructions per cycle for thread Ti .3 Step 3: Using nTi(d,g) from Step 2, find STi(d,g) , the average number of cache sets accessed during nTi(d,g) consecutive memory access instructions, by using S(x) described in Section 3.2: STi(d,g) = S(nTi(d,g) ) for i = 2, 3, 4 (up to #threads) Step 4: Using nTi(d,g) from Step 2 and b(i, x) (defined in Section 3.2), find dTi(d,g) (k)—the distri3 As we will discuss later in Section 4.4.2, when the cache contention model is combined with the Markov chain model to predict throughput, access freq(Ti ) is estimated from a simple single thread, sum-of-CPI-component model using the (current best) estimated number of cache misses. 37 3.3. Accurately Modeling Cache Contention with Many Threads bution of unique cache blocks accessed in a set by thread Ti —as follows: dTi(d,g) (k) = b(k, nTi(d,g) ) for i = 2, 3, 4 (up to #threads); k = 1, 2, 3, 4 (up to cache assoc.) Step 5: For each group of circular sequences, calculate probH (d, g), the probability that the last access of a circular sequence cseq(d, n, r) in group g is not turned into a miss due to cache contention. Recall a circular sequence cseq(d, n, r) accesses d unique blocks in r memory accesses and here we focus on the subset of those cseq(d, n, r) where r is in group g. To model the probability that another thread (Ti ) accesses the cache set containing the circular sequence, we divide STi(d,g) , obtained from Step 3, by the total number of cache sets (which we will represent with S) to obtain STi(d,g) S . The intuition here is that the chance any given set (i.e., the set containing the circular se- quence) is accessed by a thread during a fixed time interval is proportional to the number of distinct sets accessed by the thread during that amount of time. We compute the desired probability as the sum of four parts (since there are four threads per core): (i) the probability that neither T2 , T3 , or T4 accesses the set containing the circular sequence, (ii) the probability that only one of T2 , T3 , or T4 accesses the set containing the circular sequence combined with the probability the number of distinct accesses to the set from that thread is less than or equal to the difference between the cache’s associativity A and d (i.e., the maximum number of distinct accesses from other threads without turning the last access of the circular sequence into a miss), (iii) the probability that two of T2 , T3 , or T4 access the set combined with the probability that the sum of the number of distinct accesses to the set from the two threads is less than or equal to the difference A − d, and (iv) the probability that T2 , T3 , and T4 all access the set combined with the probability that the sum of the number of distinct accesses to the set from all three threads is less than or equal to the difference A − d. 38 3.3. Accurately Modeling Cache Contention with Many Threads The general formula for computing probH (d, g) for arbitrary number of threads N, and associa( N−1 X STi (d,g) (1 − )+ S i=2 j=1 tivity A is: i=N Y X 2≤i1 ,i2 ,...,iN−1 ≤N i1 6=i2 ...6=iN−1 P ∀αin ∈1..A, N−1 n=j αin ≤A−d "j−1 #) Y Y STi (d,g) STin (d,g) N−1 n (1 − ) dTin (d,g) (αin ) S S n=1 n=j To illustrate the above formula, we show a concrete example of computing probH (2, g) for the case when the number of co-scheduling threads (N) is four and the cache associativity (A) is four. Recall probH (2, g) is the probability of a hit for all circular sequences in group g that access two unique blocks. Based upon the formula, probH (2, g) is modeled as the sum of three components. The first component is (1 − ST2 (2,g) )(1 S − ST3 (2,g) )(1 S − ST4 (2,g) ) S corresponding to the part (i) in Step 5. The second component is ST2 (2,g) dT2 (2,g) (1))(1 S ST3 (2,g) )(1 S − ST4 (2,g) ) S + (1 − ST2 (2,g) ST3 (2,g) )( S dT3 (2,g) (1))(1 S − ST4 (2,g) ) S + (1 − ST2 (2,g) )(1 S ST3 (2,g) ST4 (2,g) )( S dT4 (2,g) (1)) S + ( − ST3 (2,g) )(1 S − ST4 (2,g) ) S + (1 − ST2 (2,g) ST3 (2,g) )( S dT3 (2,g) (2))(1 S − ST4 (2,g) ) S + (1 − ST2 (2,g) )(1 S ( ST2 (2,g) dT2 (2,g) (2))(1 S − − − ST3 (2,g) ST4 (2,g) )( S dT4 (2,g) (2)) S and it corresponds to the part (ii) in Step 5. Here each term represents a different combination of a single thread Ti , i > 1 accessing the set and introducing one (e.g., dT2 (2,g) (1)) or two (e.g., dT2 (2,g) (2)) unique blocks. Such interleaved accesses would leave a circular sequence cseq(2, n, r) 39 3.3. Accurately Modeling Cache Contention with Many Threads from T1 a cache hit. The third component is ( ST2 (2,g) S dT2 (2,g) (1))( T3S(2,g) dT3 (2,g) (1))(1 S ( ST2 (2,g) dT2 (2,g) (1))(1 S (1 − − ST4 (2,g) ) S + ST3 (2,g) ST4 (2,g) )( S dT4 (2,g) (1)) S + − ST2 (2,g) ST3 (2,g) S )( S dT3 (2,g) (1))( T4S(2,g) dT4 (2,g) (1)) S corresponding to the part (iii) in Step 5. Note for probH (2, g) there is no component that is related to the part (iv) described in Step 5 since a cseq(2, n, r) can at most tolerate two distinct tags from other threads without turning the last access in the circular sequence into a miss in a four-way LRU cache. Figure 3.5 illustrates each term in the calculation for probH (2, g) by showing an associated example in which the last access of a circular sequence cseq(2, n, r) from T1 still hits in the cache when T1 is co-scheduled with T2, T3, and T4. In this figure, a block labeled XTi represents a cache block X requested by Ti. Note only one example is shown for each term and other similar combinations are taken into account by the probability calculation since the order of blocks between the first and last access can change (e.g., the sequence AT1 BT1 CT2 AT1 and AT1 CT2 BT1 AT1 are both accounted for by the same term in the probability calculation). Step 6: Finally, the total number of extra misses is calculated as: rg 4 X 12 X X [(1 − probH (d, g)) × num(cseq(d, ∗, r))] d=1 g=1 r=rg Here 4 is the cache associativity and 12 is the number of groups of distances that we use to classify a circular sequence. For our study, we implement the above calculations in Matlab and the running time (on an Intel Core 2 Duo desktop system) is on the order of 0.3 seconds. This is the time to calculate prob H 40 3.3. Accurately Modeling Cache Contention with Many Threads and the extra number of cache misses, given S(x), b(i, x), and information about circular sequences frequencies. The run time is insensitive to different benchmarks, the number of instructions per benchmark, and the number of sets. However, the run time scales superlinearly as the number of cache’s associativity increases (0.3 seconds corresponds to computing the data in Figure 6.6 for the cache configurations listed in Table 5.3). 41 3.3. Accurately Modeling Cache Contention with Many Threads A circular sequence when T1 runs alone AT1 BT1 AT1 None of T2, T3, and T4 inserts a tag into the circular sequence AT1 BT1 AT1 S (2,g) S (2,g) S (2,g) (1- T2 )(1- T3 )(1- T4 ) S S S One of T2, T3, and T4 insert tags without turning the last access of the circular sequence into a miss AT1 BT1 ET3 AT1 S (2,g) S (2,g) S (2,g) ( T2 dT2(2,g)(1))(1- T3 )(1- T4 ) S S S S (2,g) ST3(2,g) S (2,g) (1- T2 )( dT3(2,g)(1))(1- T4 ) S S S AT1 BT1 GT4 AT1 S (2,g) ST4(2,g) S (2,g) (1- T2 )(1- T3 )( dT4(2,g)(1)) S S S AT1 BT1 CT2 AT1 AT1 BT1 CT2 DT2 AT1 S (2,g) S (2,g) S (2,g) ( T2 dT2(2,g)(2))(1- T3 )(1- T4 ) S S S AT1 BT1 ET3 FT3 AT1 S (2,g) ST3(2,g) S (2,g) (1- T2 )( dT3(2,g)(2))(1- T4 ) S S S AT1 BT1 GT4 HT4 AT1 S (2,g) ST4(2,g) S (2,g) (1- T2 )(1- T3 )( dT4(2,g)(2)) S S S Two of T2, T3, and T4 insert tags without turning the last access of the circular sequence into a miss AT1 BT1 CT2 ET3 AT1 S (2,g) S (2,g) S (2,g) ( T2 dT2(2,g)(1))( T3 dT3(2,g)(1))(1- T4 ) S S S AT1 BT1 CT2 GT4 AT1 S (2,g) ST4(2,g) S (2,g) ( T2 dT2(2,g)(1))(1- T3 )( dT4(2,g)(1)) S S S AT1 BT1 ET3 GT4 AT1 S (2,g) ST3(2,g) S (2,g) (1- T2 )( dT3(2,g)(1))( T4 dT4(2,g)(1)) S S S Figure 3.5: Explanation of each term of probH (2, g) 42 Chapter 4 Modeling Fine-Grained Multithreaded Throughput In this chapter, we present several analytical models for predicting the throughput of fine-grained multithreaded architectures. The models proposed in this thesis apply directly to multiprogrammed workloads in which threads do not communicate with each other (i.e., incur synchronization overheads). To model performance of parallel workloads in which threads communicate, the models proposed here could be employed as the lower-level system model of a two-level hierarchical parallel performance model such as the one proposed by Adve and Vernon [1]. In such models, the lower-level system model is invoked to predict the throughput of the system up until the next synchronization point. Moreover, if an OS performs unrelated processing on some thread contexts, one could treat the OS as a thread with a different memory access pattern while any synchronization among threads via the OS could potentially be modeled by incorporating our model with the two-level hierarchical model mentioned above. 4.1 Sum of Cycles Model One approach to predicting overall throughput is to simply assume no overlapping of execution occurs on a core. While this assumption is pessimistic for memory-bound applications, it does match the case where an application is entirely compute-bound. We evaluate this model using: 43 4.2. Sum of IPCs Model throughputC = PN C i=1 num instTi P , NC i=1 cycleTi where throughputC is the throughput on core C, num instTi is the number of instructions executed by thread Ti, cycleTi is the number of cycles taken by the thread to run those instructions when the thread runs alone and NC is the total number of threads running on core C. In Section 6.2.1 we will show that this model always underestimates the real throughput of a fine-grained multithreaded core on our workloads (by 58% on average). 4.2 Sum of IPCs Model The model in Section 4.1 always underestimates throughput since it ignores the fact that stalls from different threads on a core can overlap. To overcome this drawback, a simple approach is to sum up each thread’s isolated instructions per cycle (IPC)—the IPC when the thread runs alone. This works well when the execute stage of a fine grained multithreaded pipeline is lightly utilized by a single thread. Such behavior is common for memory-bound applications. However, we find that this model will always overestimate the real throughput (by 55% on average for our workloads as shown in Section 6.2.1). 4.3 Bernoulli Model Both models above oversimplify pipeline resource sharing in fine-grained multithreaded architectures. To more accurately model overlapping of stalls from threads running on a core, we assume that every cycle each thread performs a “coin toss” (or Bernoulli trial) to determine whether it has an instruction ready to execute. Ignoring the effects of resource contention between threads, the probability that a thread Ti has an instruction ready to execute can be estimated as IPCTi (the thread’s isolated IPC) since this is a value between zero and one for a single-issue pipeline. Then 1 − IPCTi can be used to estimate the probability that Ti cannot issue an instruction on any 44 4.4. A Markov Chain Model p01 p00 S0 p40 S1 p10 p41 p 20 p30 p44 p11 p21 p31 p42 S4 p12 S2 p22 p32 p43 p34 p23 S3 p33 Figure 4.1: A Markov chain model for a four-thread fine-grained multithreaded architecture. Sn represents the state in which n threads are suspended and pij represents the transition probability from Si to Sj . given cycle and QNC i=1 (1 − IPCTi ) becomes an estimate for the probability that no thread has an instruction ready to issue on a given cycle, where NC is the number of threads running on a core and IPCTi represents the isolated IPC of thread Ti. Thus, the throughput of fine-grained multithreaded core C can be estimated as the product of the peak issue width (1.0) and the probability that at QC (1 − IPCTi )). least one thread can issue an instruction in a cycle: throughputC = 1.0 × (1 − Ni=1 Concurrent with our work, Govindaraju et al. proposed a customized analytical model [26] similar to our Bernoulli model. In Section 6.2.1 we will show that our Bernoulli model reduces error to an overestimate of real throughput of 23% on average. 45 4.4. A Markov Chain Model 4.4 A Markov Chain Model The techniques proposed from Section 4.1 to 4.3 do not model contention in the memory system or the time dependence of long latency stall events. To account for these factors, we use a Markov chain to model the partial hiding of stalls of a thread by executions from other threads. We will show in Section 6.2.1 that, when combined with the novel cache contention model we proposed in Section 3.3, the Markov chain model proposed in this section reduces the average error of modeled throughput to 7.9% compared against detailed simulation. Figure 4.1 illustrates an example of the Markov chain we use to model throughput for the case of a four-thread, single-issue, in-order, fine-grained multithreaded pipeline with a memory system that supports at most one outstanding long latency cache miss per thread. We note that the assumption of one outstanding cache miss per thread does match the designs of Sun Niagara T1 and T2 processors [47]. Experiments with our detailed simulator show that supporting additional outstanding cache misses per thread improves performance by 4.6% on average for the configuration and workloads described in Table 5.3 and Table 5.4. Extending our models for multiple outstanding cache misses per thread is beyond the scope of this study and is left for future work. The Markov chain illustrated in Figure 4.1 has five states labeled from S0 to S4 , where state Sn corresponds to the case that n threads are currently suspended from issuing new instructions because of a prior long latency event such as a cache miss. In state S0 no thread is suspended (i.e., all threads are ready to issue instructions) whereas S4 means that all four threads are suspended (i.e., no thread is currently able to issue instructions). In any cycle, if there is at least one thread ready to issue then we assume one thread will issue an instruction. Thus, the throughput of an N-thread fine-grained multithreaded core can be modeled as 1 − prob(SN ), where the term prob(SN ) represents the probability of being at SN (no threads ready—S4 in Figure 4.1). From each state, only certain transitions are possible since only one thread can execute in a 46 4.4. A Markov Chain Model given cycle. The allowed transitions among states for a four-thread fine-grained multithreaded core are shown in Figure 4.1. There are three types of transitions: upstream transitions (illustrated by broken lines, e.g., p01 ), downstream transitions (illustrated by solid lines, e.g., p10 ), and staying transitions (illustrated by broken lines separated by dots, e.g., p00 ). An upstream transition corresponds to Sx → Sy , where x < y; a downstream transition corresponds to Sx → Sy , where x > y; and a staying transition does not change states4 . At most one thread can become newly suspended on a cycle, thus preventing transitions such as S0 → S2 , while multiple threads can be reactivated in the same cycle (e.g., one thread finishes a long latency floating point operation and another thread gets its required data from memory). To determine the state transition probabilities we first consider homogeneous workloads in Section 4.4.1 and then heterogeneous workloads in Section 4.4.2. 4.4.1 Homogeneous Workloads In this section we describe how to obtain the state transition probabilities pij assuming that all threads have the same program characteristics so that the probability of being suspended is the same for each thread. To simplify the discussion we also assume only one type of event can stall a thread. We will relax both assumptions in Section 4.4.2. To begin, we define two parameters: p and M. The term p represents the probability of a thread being suspended. For homogeneous workloads p is the same for all threads and can be calculated as the fraction of instructions causing a thread to be suspended. The term M represents the latency (in cycles) of the event causing a thread to be suspended. We model the probability of reactivating a suspended thread on any given cycle as 1 M since, in the next M cycles, the suspended thread can only 4 Although the state is not changed, some events might occur. For example, it is possible that the state remains unchanged while one suspended thread is reactivated and another thread becomes newly suspended. 47 4.4. A Markov Chain Model Table 4.1: Transition probability definitions for a four-thread fine-grained multithreaded architecture Upstream p01 p12 p23 p34 Value p p( M−1 M ) M−1 2 p( M ) 3 p( M−1 M ) (#susp., #act.) (1,0) (1,0) (1,0) (1,0) Downstream p10 p20 p21 p30 p31 p32 p40 p41 p42 p43 Value (1 − p)( 1M ) (1 − p)( 1M )2 (1 − p)( 1M ) + p( 1M )2 p)( 1M )3 ¡3(1 ¢ − M−1 1 2 (1 − p( M1 )3 ¡3¢p)1 1 (M−1M 2)( M ) ¡+ ¢ 1 2 (1 − p) 1 ( M )( M ) + p 31 ( M−1 M )( M ) 1 4 (M) ¡4¢ M−1 1 3 M )( M ) ¡41¢ (M−1 2 ( M1 )2 ¡24¢( 1M )M−1 3 1 ( M )( M ) (#susp., #act.) (0,1) (0,2) (0,1) or (1,2) (0,3) (0,2) or (1,3) (0,1) or (1,2) (0,4) (0,3) (0,2) (0,1) Staying p00 p11 p22 p33 p44 Value 1−p 1 (1 − p)( M−1 M ) ¡+¢p( M ) 2 1 M−1 M−1 2 (1 − p)( M ) + p¡ 1¢ ( M )( M ) 3 1 M−1 2 3 (1 − p)( M−1 M ) + p 1 ( M )( M ) M−1 4 ( M ) (#susp., #act.) (0,0) or (1,1) (0,0) or (1,1) (0,0) or (1,1) (0,0) or (1,1) (0,0) be reactivated in the last cycle. Thus, the probability that a suspended thread remains suspended on a given cycle is 1 − 1 M = M−1 M . Table 4.1 shows the transition probabilities and lists (in the third column) how many threads become newly suspended and/or activated for each possible transition for a four-thread fine-grained multithreaded architecture. Table 4.2 generalizes for an N-thread architecture. For example, con2 sider the upstream transition p23 (i.e., the probability of S2 → S3 ) is modeled as p( M−1 M ) , where p 2 corresponds to the probability that one thread becomes newly suspended, and ( M−1 M ) corresponds 48 4.4. A Markov Chain Model Table 4.2: Transition probability definitions for an N-thread fine-grained multithreaded architecture Upstream (j > i) j=i+1 Value pij = Downstream (j < i) i 6= N, j 6= 0 i 6= N, j = 0 i=N i p( M−1 M ) ¡ i ¢ 1 i−j M−1 j ¡ i ¢ 1 i−j+1 M−1 j−1 = (1 − p) i−j ( M ) ( M ) + p i−j+1 (M) ( M ) ¡ i ¢ 1 i−j M−1 j = (1 − p) i−j ( M ) ( M ) ¡ i ¢ 1 i−j M−1 j = i−j (M) ( M ) Staying (j = i) i 6= N, i 6= 0 i=0 i=N (1,0) Value pij pij pij Value i p)( M−1 M ) pii = (1 − pii = 1 − p i pii = ( M−1 M ) (#susp., #act.) ¡¢ i−1 + p i1 ( 1M )( M−1 M ) (#susp., #act.) (0,i-j) or (1,i-j+1) (0,i-j) (0,i-j) (#susp., #act.) (0,0) or (1,1) (0,0) (0,0) to the probability that both the threads that have already been suspended remain suspended. The term (1, 0) in the third column indicates that one thread becomes newly suspended and no threads are newly activated for this transition. Some state transitions can be achieved in more than one way. For instance, the downstream ¡¢ ¡¢ 2 + p 3 ( 1 )2 ( M−1 ). ) transition p32 (i.e., the probability of S3 → S2 ), is modeled as (1 − p) 31 ( M1 )( M−1 M M 2 M In the term before the plus sign, (1 − p) is the probability that the currently executing thread ¡¢ 2 does not become suspended and 31 ( 1M )( M−1 M ) is the probability that, in the three threads that have already been suspended, one thread is reactivated this cycle (probability M1 ) and the other ¡¢ 2 ). Note that 3 is the number of ways to choose the one two remain suspended (probability ( M−1 ) M 1 thread that is activated out of three threads. In the term after the plus sign, p is the probability that ¡¢ the current executing thread becomes newly suspended and 32 ( M1 )2 ( M−1 M ) denotes the probability that, in the three threads that have already been suspended, one thread remains suspended and the two other threads are reactivated. Having modeled the transition probabilities, we construct the transition matrix MT and find the 49 4.4. A Markov Chain Model steady state probability distribution vector by: − − → → vs = [prob(S0 ) prob(S1 ) ... prob(SN )] = lim − vi MnT = → vs MT n→∞ where MT = [pij ] is an (N + 1)-by-(N + 1) transition matrix, created using pij defined in Table 4.2, → and − vi = [1 0 ... 0] is the initial state vector (initially, all N threads are ready to issue an instruction). → The equation for − vs can be evaluated quickly using standard Markov chain analysis [29]. 4.4.2 Heterogeneous Workloads and Multiple Stall Conditions When the threads running on a fine-grained multithreaded core are different, the probability of being suspended for each thread is different. Moreover, there are usually multiple types of events that can stall a thread and these events often have different latencies. One way to extend the model described in Section 4.4.1 for heterogeneous workloads is to separate each state of the model into multiple states. For example, rather than using a single state S1 to represent that the number of suspended threads is one, we can use N states to specify which particular thread out of all N threads is suspended. Then different probabilities p for each thread are used for the transitions from S0 to one of each of these states. To completely specify which threads are suspended in all situations, we need 2N states, making the Markov model more complicated. To tackle the challenge of different thread types without introducing extra states, we instead keep the same Markov chain structure as in the homogenous workload case and instead compute a P single value p for all the threads as follows: p = Ni=1 pi wi , where p denotes the average probability that a thread will be suspended, pi is the probability that thread i will be suspended, and wi is a weight factor equal to the number of instructions executed for the thread, divided by the 50 4.4. A Markov Chain Model total number of instructions executed on the core. Therefore, p can be further expressed as: P P Ni Ii = Ni=1 Ntotal , where Ii is the number of instructions causing the thread i to be p = Ni=1 NIii Ntotal suspended, Ni is the number of instructions executed for thread i, and Ntotal is the total number of instructions executed on the fine-grained multithreaded core. Given the total number of instructions executed on a core, we first assume that the number of instructions executed by each thread on that core (i.e., Ni ) is proportional to the thread’s isolated IPC computed using a simple sum of CPI component model [27]. Then, based upon the estimated Ni for each thread, we use the novel cache contention model described in Section 3.3 to predict its number of extra cache misses due to cache sharing. Next, we add the product of the number of extra misses for each thread and the miss latency to the original number of cycles taken to execute Ni instructions and then divide the sum by Ni to estimate a new CPI for the thread. Finally, we update Ni for each thread to make it proportional to its new CPI and use the updated Ni to obtain p for the cache contention and Markov chain throughput model. Next, we tackle the issue of modeling systems where multiple events may suspend a thread and where each has a distinct latency. Similar to finding an average p for all threads, we compute P an average M for all the events that might suspend a thread as follows: M = ki=1 Mi si , where k, Mi , si represent the number of different types of events, latency of an event i, and the weight of the event, respectively. The latency of each event suspending a thread is assumed to be known (a simplification in the case of DRAM and interconnect contention). To find the weight of an event, we also need to know how many times the event occurs during execution. The frequency of long latency floating point instructions can be obtained from the instruction trace used during circular sequence profiling. To find the number of cache misses, the cache contention model described in Section 3.3 is used. Above, the weight si for an event is computed as the product of its number of occurrences and its latency, normalized as a fraction of total execution time. For example, if Event1 with 5-cycle latency happens 50 times and Event2 with 100-cycle latency happens 10 times, 51 4.4. A Markov Chain Model then the average M is calculated as M = 5 × 50×5 50×5+10×100 + 100 × 10×100 50×5+10×100 = 81. Although our Markov chain model is proposed for modeling fine-grained multithreaded architectures, we believe that, by considering state transitions to occur only for long latency L2 cache misses (rather than L1 misses) and scaling the estimated throughput by a factor taking account of the peak issue width of an SMT core (assuming a well balanced design [35]) and the memory level parallelism within a thread, our model can potentially provide performance predictions for more complex processor cores. However, a detailed evaluation of this extension is beyond the scope of this thesis. 52 Chapter 5 Methodology 5.1 Modeling Pending Cache Hits, Data Prefetching, MSHRs To evaluate our analytical model, we have modified SimpleScalar [7] to simulate the performance loss due to long latency data cache misses when accounting for a limited number of MSHRs. We compare against a cycle accurate simulator rather than real hardware to validate our models since a simulator provides insights that would be challenging to obtain without changes to currently deployed superscalar performance counter hardware [20]. We believe the most important factor is comparing two or more competing (hybrid) analytical models against a single detailed simulator provided the latter captures the behavior one wishes to model analytically. Table 5.1 describes the microarchitectural parameters used in this study. Note we are focusing on predicting only the CPI component for data cache misses using our model. We verified that the CPI contribution due to overlapping miss events is small for our benchmarks with realistic branch prediction and instruction caches [11] so our comparisons in Section 6.1 is to a detailed cycle accurate simulator in which instruction cache misses have the same latency as hits and all branches are predicted perfectly. In the rest of this thesis, we focus on how to accurately predict CP ID$miss , which is the performance loss due to long latency data cache misses when both branch predictor and instruction cache are ideal (this is the same methodology applied to model CP ID$miss described in [34]). To evaluate the technique proposed in Section 2.2.3 for estimating the CP ID$miss when a prefetching mechanism is applied, we have applied our modeling techniques to predict the perfor53 5.1. Modeling Pending Cache Hits, Data Prefetching, MSHRs Table 5.1: Microarchitectural parameters Machine Width ROB Size LSQ Size L1 D-Cache L2 Cache Memory Latency 4 256 256 16KB, 32B/line, 4-way, 2-cycle latency 128KB, 64B/line, 8-way, 10-cycle latency 200 cycles mance benefit of three different prefetching mechanisms: prefetch-on-miss [62], tagged prefetch [25], and stride prefetch [3]. When prefetch-on-miss [62] is applied, an access to a cache block that results in a cache miss will initiate a prefetch for the next sequential block in memory given that the block is not in the cache. The tagged prefetch mechanism [25] adds a tag bit to each cache block to indicate whether the block was demand-fetched or prefetched. When a prefetched block is referenced, the next sequential block is prefetched if it is not in the cache. The stride prefetch technique [3] uses a reference prediction table (RPT) to detect address referencing patterns. Each entry in the RPT is assigned a state and a state machine is applied to control the state of each entry. Whether a prefetch is initialized or not depends on the current state of the entry [3]. In this study, we modeled a 128-entry, 4-way RPT that is indexed by the microprocessor’s program counter (PC). To stress our model, we simulate a relatively small L2 cache compared to contemporary microprocessors. We note that the size of the L2 cache that we simulated is close in size to those employed in microprocessors shipped at the time when those benchmarks we use were released. The benchmarks chosen are ones from SPEC 2000 [63] and OLDEN [8] that have at least 10 long latency data cache misses for every 1000 instructions simulated (10MPKI). Table 5.2 illustrates the miss rates of these benchmarks and the labels used to represent them in figures. Moreover, for each benchmark, we select 100M representative instructions to simulate using the Sim-Point toolkit [61]. 54 5.2. Modeling Cache Contention and Throughput Table 5.2: Simulated benchmarks Benchmark 173.applu 179.art 183.equake 189.lucas 171.swim 181.mcf em3d health perimeter 470.lbm 5.2 Label app art eqk luc swm mcf em hth prm lbm Miss rate 31.1MPKI 117.1MPKI 15.9MPKI 13.1MPKI 23.5MPKI 90.1MPKI 74.7MPKI 45.7MPKI 18.7MPKI 17.5MPKI Suite SPEC 2000 SPEC 2000 SPEC 2000 SPEC 2000 SPEC 2000 SPEC 2000 OLDEN OLDEN OLDEN SPEC 2006 Modeling Cache Contention and Throughput We evaluated the accuracy of our analytical models in three steps. First, we compared our analytical models against a detailed simulator for a microarchitecture similar to Sun Microsystems’ Niagara T1 [37] that we developed by modifying SMTSIM [67]. Table 5.3 shows the microarchitectural parameters simulated. As discussed earlier in Section 4.4, our baseline model allows at most one outstanding miss per thread. We do not enforce cache inclusion between the L1 and the L2 cache (earlier cache contention studies have shown this has minimal effect [9]). Second, we applied our combined analytical cache contention and Markov chain throughput model to obtain two optimized application-specific multithreaded processor designs. Finally, we validated our analytical models by comparing their predictions against a Sun Fire T1000, which has a T1 Niagara processor containing 8 cores each with 4 thread contexts (32 threads total) and runs the Solaris 10 operating system. We used Shade [15] to collect instruction traces that are later analyzed to obtain inputs to our models. We report the arithmetic mean of the absolute value of error since it reports the largest error number and since we are interested in averaging the error of the IPC prediction (not the average speedup), but we also report geometric mean and harmonic mean of the absolute error to 55 5.2. Modeling Cache Contention and Throughput Table 5.3: Parameters of the simulated architecture # Cores # Threads Pipeline Branch Pred. L1-L2 Interconnect L1 Data Cache L2 Cache L1 Inst. Cache L2 Cache ITLB/DTLB Memory Latency 8 in-order cores 32, 4 threads per core 6 stages 4Kb gShare, 256-entry BTB, private Crossbar [47] 16KB, 16B/line, 4-way, LRU, 1-cycle hit latency, private 3MB banked, 64B/line, 12-way, LRU, 10-cycle hit lat., global 16KB, 16B/line, 4-way, LRU, 1-cycle hit latency, private 3MB banked, 64B/line, 12-way, LRU, 10-cycle hit lat., global 64 entry, private 110 cycles Table 5.4: Simulated workloads for each core Core Core Core Core Core Core Core Core 1 2 3 4 5 6 7 8 Workload Workload Workload Workload Workload Workload Workload Workload 1 1 2 2 3 3 4 4 (C1W1) (C2W1) (C3W2) (C4W2) (C5W3) (C6W3) (C7W4) (C8W4) ammp-applu-art-mcf ammp-applu-art-mcf bzip2-mgrid-swim-equake bzip2-mgrid-swim-equake ammp-art-mgrid-equake ammp-art-mgrid-equake applu-bzip2-mcf-swim applu-bzip2-mcf-swim allay any concern that these numbers might lead to different conclusions. When we validated our cache contention model, we used our detailed simulator and hardware performance counters to obtain information about the number of instructions executed by each thread. Then for each thread, we analyzed its memory instructions trace to obtain the two temporal locality metrics that we proposed in Section 3.2 as well as the circular sequence profile. Next, we predicted the number of extra misses of each thread using the cache contention model described in Section 3.3 and compared the modeled extra misses to the result from our detailed simulator and performance counters. To obtain the actual number of extra misses, we also ran each thread alone 56 5.2. Modeling Cache Contention and Throughput to obtain the number of cache misses without cache sharing. When we validated our Markov chain throughput model, we first approximated the number of instructions executed by each thread as proportional to its isolated IPC (we used performance counters to obtain the isolated IPC when we compared to hardware), given the total number of instructions executed by all threads in a core. Then we applied our cache contention model to predict the number of extra misses for each thread and adjusted each thread’s IPC by using the product of the number of extra misses and the miss latency. Then we re-estimated the number of instructions executed by each thread based on the thread’s refined IPC. Next we applied the cache contention model again to approximate the extra number of misses for each thread based on its refined instruction count. Finally, we used the refined instruction count and the extra number of misses for each thread as the input to our Markov chain model to estimate the throughput5 . For the comparison against the detailed simulator, we chose 8 benchmarks from the SPEC 2000 Benchmark Suite [63] to form heterogeneous multiprogrammed workloads for each fine-grained multithreaded core being simulated. Table 5.4 shows the simulated workload of each core for our first study. We ran our detailed simulator such that each core executed at least 400 million instructions. For our hardware comparison, we evaluated homogeneous workloads consisting of a varying number of threads, each running a memory intensive application (mcf). We also evaluated several heterogeneous workloads consisting of multiple instances of two types of applications to obtain 32 threads in total running concurrently. Specifically, we consider workloads consisting of the following combinations: 16gzip+16eqk (each core runs two gzip and two equake), 16mcf+16gzip, 16mcf+16art. We compiled these benchmarks on the Sun Fire T1000 using gcc 3.4.3 with -O2 optimization and ran them with their train inputs (mcf and art) and test inputs (gzip and equake). 5 We found that additional “iterations” through the model to update the relative IPC of each thread due to cache contention did not improve model accuracy. 57 5.2. Modeling Cache Contention and Throughput To obtain miss latencies required by our throughput model, we created several simple (about 20 lines in C) microbenchmarks, which we ran on a single thread context. Based upon these separate calibration measurements, we used fixed latencies of 20 cycles, 220 cycles, and 50 cycles for an L1 cache miss that hits in the L2 cache, an L2 cache miss, and a floating point instruction, respectively (these are higher than reported in [59]). 58 Chapter 6 Experimental Results 6.1 Modeling Pending Cache Hits, Data Prefetching, MSHRs This section summarizes our experimental results. 6.1.1 Modeling Pending Data Cache Hits Section 2.1 describes prior proposals for compensating for the overestimation of modeled penalty cycles per serialized miss using a fixed number of cycles. Figure 6.1(a) and Figure 6.1(b) illustrate the modeled results after compensation with constant cycles both without, and with the pending hit compensation technique described in Section 2.2.1, respectively. In these two figures, we show results using five different constant compensation factors. The first bar (oldest) corresponds to the assumption that an instruction that misses in the cache is always the oldest one in the instruction window when it issues (accesses the first level cache). The second bar (1/4) corresponds to the assumption that there are always 1 4 ROBsize = 64 in-flight instructions older than a cache miss when it issues and it is similar to the next two bars, (1/2) and (3/4). The fifth bar (youngest) corresponds to the assumption that there are always ROBsize − 1 older instructions in the window when the instruction issues (i.e., the instruction is always the youngest one in the window when it issues). The last bar (actual) shows the simulated penalty cycles per cache miss from cycle accurate simulation. From this data, we observe that there is no one fixed cycle compensation method that performed consistently the best for all of the benchmarks we studied. For example, 59 6.1. Modeling Pending Cache Hits, Data Prefetching, MSHRs in Figure 6.1(a) we observe that error is minimized using “youngest” for app, art, luc, swm, and lbm, but minimized using “oldest” for em, mcf, and hth, while, eqk and prm requires something in-between. The harmonic mean for each fixed cycle compensation method is also shown and we notice that, due to the fact that positive and negative errors cancel out, the harmonic means of some fixed cycle compensation methods appear close to the detailed simulation results. However, it is important to recognize that their accuracy on individual benchmarks is quite poor. By using the fixed cycle compensation method, we find that the smallest arithmetic mean of absolute error is 43.5% when not modeling pending hits and 26.9% when modeling pending hits, resulting when penalty cycles per miss employing “youngest” compensation. oldest 1/4 1/2 3/4 youngest actual 200 150 100 50 0 app art eqk luc swm mcf em hth prm lbm HM penalty cycles per miss (a) Not modeling pending data cache hits oldest 1/4 1/2 3/4 youngest actual 200 150 100 50 0 app art eqk luc swm mcf em hth prm lbm HM (b) Modeling pending data cache hits Figure 6.1: Penalty cycles per miss with fixed number of cycles compensation for plain profiling (Unlimited MSHRs) 60 6.1. Modeling Pending Cache Hits, Data Prefetching, MSHRs To account for the distinct behavior of each benchmark, we use the average distance between two consecutive cache misses to compensate for the overestimation of the modeled extra cycles due to long latency data cache misses as described in Section 2.2.2. Figure 6.2(a) compares the CP ID$miss for both the plain profiling technique described in Section 2.1 and the start-with-a-miss (SWAM) profiling technique described in Section 2.2.5 (with pending hits modeled) to the results from detailed simulation. The first bar (Plain w/o comp) and the third bar (SWAM w/o comp) correspond to the modeled results without any compensation; the second bar (Plain w/ comp) and the fourth bar (SWAM w/ comp) are the modeled results with the compensation technique described in Section 2.2.2. Figure 6.2(a) and Figure 6.2(b) show that for benchmarks with heavy pointer chasing such as mcf , em3, and hth, ignoring the effects of pending data cache hits results in a dramatic underestimate for CP ID$miss . As discussed in Section 2.2.1, the reason for this is that many data independent misses are connected by pending cache hits, which must be appropriately modeled. Moreover, as we expect, SWAM profiling is more accurate than plain profiling since it can capture more overlapping data cache misses. Figure 6.2(b) illustrates the error of each modeling technique after compensation. From Figure 6.2(b) we observe that the arithmetic mean of the absolute error (mean) decreases from 39.7% to 29.3% when modeling pending cache hits. We also observe from Figure 6.2(b) that the arithmetic mean of the absolute error for SWAM profiling when pending data cache hits are modeled (SWAM w/ PH) is about 3.9 times lower than plain profiling when pending hits are not modeled (Plain w/o PH): the arithmetic mean of the absolute error decreases from 39.7% to 10.3%. Geometric mean of the absolute error decreases from 26.4% to 8.2%, and harmonic mean of the absolute error decreases from 15.3% to 6.9%. Accuracy also improves, and not just for “micro-benchmarks” [72]: In Figure 6.2(b), comparing “Plain w/o PH” to “SWAM w/ PH”, we find that, on average, the arithmetic mean of the absolute error decreases from 31.6% to 9.1% for the five SPEC 2000 benchmarks excluding mcf . 61 1.6 8 1.4 7 CPI due to D$miss CPI due to D$miss 6.1. Modeling Pending Cache Hits, Data Prefetching, MSHRs 1.2 1 0.8 0.6 0.4 0.2 6 5 4 3 2 1 0 0 app art eqk luc swm prm lbm mcf em hth (a) CPI due to D$miss n ea m ht h em cf SWAM w/ PH m lb m pr m m Plain w/ PH sw lu c eq k ap p error (%) ar t Plain w/o PH 100 80 60 40 20 0 -20 -40 -60 -80 -100 (b) Modeling error Figure 6.2: CPI due to D$miss and modeling error for different profiling techniques (unlimited MSHRs) 62 6.1. Modeling Pending Cache Hits, Data Prefetching, MSHRs Modeling Different Prefetching Techniques 1.4 8 1.2 7 CPI due to D$miss CPI due to D$miss 6.1.2 1 0.8 0.6 0.4 0.2 6 5 4 3 2 1 0 0 app art eqk luc swm prm lbm mcf em hth (a) CPI due to D$miss with different prefetching techniques app art eqk luc POM w/o PH Stride w/ PH swm prm lbm Tag w/ PH Stride w/o PH mcf em hth mean error (%) 100 80 60 40 20 0 -20 -40 -60 -80 -100 POM w/ PH Tag w/o PH (b) Modeling error with different prefetching techniques Figure 6.3: CPI due to D$miss and modeling error while prefetch-on-miss (POM), tagged prefetch (Tag), or stride prefetch (Stride) technique is applied. In this section, we evaluate CP ID$miss when modeling the three prefetching techniques mentioned in Section 5.1 (with unlimited MSHRs). Figure 6.3(a) compares the actual CP ID$miss to the modeled one for the three prefetching methods. For each prefetching method, both the prediction 63 6.1. Modeling Pending Cache Hits, Data Prefetching, MSHRs when each pending hit is analyzed according to the algorithm described in Figure 2.5 (labeled “w/ PH”) and the prediction when pending hits are treated as normal hits (labeled with “w/o PH”) are shown. We use SWAM in both cases. When employing the algorithm in Figure 2.5, we apply SWAM as follows: When we analyze the trace we let each profile step start with a miss or a hit due to a prefetch. The latter refers to a demand request whose data was brought into the data cache by a previous prefetch (we start with it since its latency may not be fully hidden and thus it may stall commit). Figure 6.3(b) shows the error of the model for each benchmark. From Figure 6.3(b) we observe that if pending hits are not appropriately modeled (i.e., a pending hit is simply treated as a hit and not analyzed based upon the algorithm in Figure 2.5), the modeled CP ID$miss always underestimates the actual CP ID$miss . The reason is that with a prefetching technique applied, a large fraction of the misses occurring when there is no prefetching become pending hits since prefetches generated by that prefetching technique cannot fully hide the memory access latency of those misses. By using the method of analyzing pending hits that we propose in Section 2.2.3 to model prefetching, the arithmetic mean of the absolute error decreases from 22.2% to 10.7% for prefetch-on-miss, from 56.4% to 9.4% for tagged prefetch technique, and from 72.9% to 21.3% for stride prefetch technique (i.e., the arithmetic mean of the absolute error decreases from 50.5% to 13.8% overall for the three data prefetching techniques modeled). 6.1.3 Modeling Limited Number of MSHRs All of the results that we have seen thus far are for modeling a processor with an unlimited number of MSHRs. This section compares modeled CP ID$miss when the number of available MSHRs is limited. Figure 6.4(a), (b), and (c) compare the modeled CP ID$miss to the simulated results when the maximum number of MSHRs in a processor is sixteen, eight, and four, respectively. We show data for eight MSHRs and four MSHRs since we note that Prescott has only eight MSHRs [6] and Williamette has only four MSHRs [28]. For each benchmark, the first bar (Plain w/o MSHR) shows 64 Plain w/ MSHR actual SWAM app art eqk luc swm mcf em Plain w/o MSHR SWAM-MLP 9 8 7 6 5 4 3 2 1 0 Plain w/ MSHR actual app art hth prm lbm HM (a) NM SHR = 16 Plain w/o MSHR SWAM-MLP SWAM CPI due to D$miss Plain w/o MSHR SWAM-MLP 9 8 7 6 5 4 3 2 1 0 CPI due to D$miss CPI due to D$miss 6.1. Modeling Pending Cache Hits, Data Prefetching, MSHRs eqk luc swm mcf em 9 8 7 6 5 Plain w/ MSHR actual SWAM 4 3 2 1 0 app art eqk luc swm mcf em hth prm lbm HM hth prm lbm HM (b) NM SHR = 8 (c) NM SHR = 4 (a) NM SHR = 16 (b) NM SHR = 8 lb m ea n m ht h pr m cf em Plain w/ MSHR SWAM-MLP m lu c sw m Plain w/o MSHR SWAM ar t eq k error (%) 100 80 60 40 20 0 -20 -40 -60 -80 -100 ap p m lb ea n m ht h pr m cf em Plain w/ MSHR SWAM-MLP m lu c sw m Plain w/o MSHR SWAM ar t eq k 100 80 60 40 20 0 -20 -40 -60 -80 -100 ap p error (%) m lb ea n m ht h pr m cf em Plain w/ MSHR SWAM-MLP m lu c sw m Plain w/o MSHR SWAM ar t eq k 100 80 60 40 20 0 -20 -40 -60 -80 -100 ap p error (%) Figure 6.4: CPI due to D$miss for NM SHR = 16, NM SHR = 8, and NM SHR = 4. (c) NM SHR = 4 Figure 6.5: Error of the modeled CP ID$miss for NM SHR = 16, NM SHR = 8, and NM SHR = 4. the modeled CP ID$miss from plain profiling (i.e., it is not aware that there are a limited number of MSHRs and always provides the same result for each benchmark) and the second bar (Plain w/ MSHR) shows the modeled CP ID$miss from plain profiling with the technique of modeling a limited number of MSHRs (Section 2.2.4) included. The third and the fourth bar illustrates the modeled CP ID$miss from SWAM (Section 2.2.5) and SWAM-MLP (Section 2.2.5), respectively. For these four profiling techniques, pending hits are modeled using the method described in Section 2.2.1. The modeling error based on the data in Figure 6.4(a)–(c) is illustrated in Figure 6.5(a)–(c). SWAM-MLP is consistently better than SWAM. We observe that as the total number of MSHRs decreases, the advantage of SWAM-MLP over SWAM becomes significant, especially for eqk, mcf , 65 6.1. Modeling Pending Cache Hits, Data Prefetching, MSHRs em, and hth, for which it is more likely to have data dependence among cache misses thus affecting the size of the profiling window that SWAM-MLP chooses. SWAM decreases the arithmetic mean of the absolute error from 32.6% (Plain w/o MSHR) to 9.8%, from 32.4% to 12.8%, and from 35.8% to 23.2%, when the number of MSHRs is sixteen, eight, and four, respectively6 . SWAM-MLP further decreases the error to 9.3%, 9.2%, and 9.9%7 (i.e., SWAM-MLP decreases the error of plain profiling (Plain w/o MSHR) from 33.6% to 9.5% when the number of MSHRs is limited). Note that accuracy improves not only for pointer-chasing benchmarks: For the SPEC 2000 benchmarks excluding mcf , average error reduces from 48.1% to 7.0% comparing Plain w/o MSHR to SWAM-MLP. 6.1.4 Putting It All Together We also evaluated the combination of the techniques for modeling prefetching (Section 2.2.3) and SWAM-MLP to model the performance of the three prefetching methods with limited MSHRs. On average, the error of modeling prefetching is 15.2%, 17.7%, and 20.5%, when the number of MSHRs is sixteen, eight, and four, respectively (average of 17.8% across all three prefetch methods). 6.1.5 Speedup of the Hybrid Analytical Model One of the most important advantages of the hybrid analytical model we present in this thesis versus detailed simulations is its fast speed of analysis. On average our model is 150, 156, 170, and 229 times faster than the detailed simulator when the number of MSHRs is unlimited, sixteen, eight, and four, respectively, with a minimum speedup of 91×. Moreover, for estimating the performance impact of prefetching, on average our model is 184, 185, 215, and 327 times faster than the detailed simulator when the number of MSHRs is unlimited, sixteen, eight, and four, respectively, with 6 Geometric mean is reduced from 19.4% to 7.4%, from 21.5% to 9.7%, and from 21.8% to 10.9%; harmonic mean is reduced from 8.5% to 5.8%, from 14.5% to 7.0%, and from 10.2% to 5.1% 7 Geometric mean of the absolute error is 6.5%, 6.7%, 5.2%, and harmonic mean of the absolute error is 4.6%, 5.2%, 3.3%, when the number of MSHRs is sixteen, eight, and four, respectively. 66 6.2. Modeling Cache Contention and Throughput a minimum speedup of 87×. These speedups were measured on a 2.33 GHz Intel Xeon E5345 processor. 6.2 Modeling Cache Contention and Throughput In this section we evaluate our analytical cache contention model and throughput models for finegrained multithreaded architectures against both a cycle-accurate performance simulator and real hardware. 6.2.1 Model Accuracy Evaluation Figure 6.6(a) and 6.6(b) illustrate the impact of cache sharing on the number of extra misses for our L1 data cache and L2 cache, respectively, when the heterogeneous workloads described in Table 5.4 are simulated. For each thread, we plot the ratio of the number of extra misses due to cache contention to the number of misses without cache contention (i.e., when the thread runs alone). Figure 6.6(a) and 6.6(b) show that our cache contention model accurately predicts the number of additional cache misses due to cache sharing across varying workloads. We find that modeling error is less significant for L2 misses, which also have a higher impact on performance than L1. For the L1 cache, the arithmetic mean of absolute error of our model is 40.3%, geometric mean is 12.1%, harmonic mean is 0.9%, and the maximum absolute error is 150.0%. For the L2 cache, the arithmetic mean of absolute error of our model is 4.4%, geometric mean is 1.1%, harmonic mean is 0.2%, and the maximum absolute error is 20.3%. Figure 6.7 illustrates a scatter plot for the predicted and the simulated ratio pair. Each data point on this figure corresponds to a pair of bars in Figure 6.6(a) and Figure 6.6(b). From Figure 6.7 we notice that our predicted values are highly correlated with the simulated ones. The correlation coefficient between the predicted and the simulated ratio for L1 and L2 cache is 0.9993 and 0.9999, 67 6.2. Modeling Cache Contention and Throughput extra L1 misses simulated 1x modeled T1 T2 T3 T4 T1 T2 T3 T4 T1 T2 T3 T4 T1 T2 T3 T4 T1 T2 T3 T4 T1 T2 T3 T4 T1 T2 T3 T4 T1 T2 T3 T4 0.01x 0.0001x C1W1 C2W1 C3W2 C4W2 C5W3 C6W3 C7W4 C8W4 0.000001x (a) L1 data cache simulated extra L2 misses 1000x modeled 100x 10x 1x 0.1x 0.01x T1 T2 T3 T4 T1 T2 T3 T4 T1 T2 T3 T4 T1 T2 T3 T4 T1 T2 T3 T4 T1 T2 T3 T4 T1 T2 T3 T4 T1 T2 T3 T4 C1W1 C2W1 C3W2 C4W2 C5W3 C6W3 C7W4 C8W4 (b) L2 cache Figure 6.6: Ratio of the number of extra L1 and L2 cache misses due to cache contention to the number of misses when each thread runs alone. The order of the threads in each core is shown in Table 5.4. respectively. We noticed that the average and maximum error for the L1 cache is higher than the L2 cache and we found it is mainly due to the fact that the number of extra L1 misses due to cache contention is much smaller than the number of extra L2 misses. For example, the maximum error of our model that predicts extra L1 misses is from T1 on Core 1, where the number of L1 misses is about 6153K without cache contention and it is increased by only 97K due to cache contention. For this thread, our model predicts the number of extra misses to be 242K, resulting a 150% error. However, this 68 6.2. Modeling Cache Contention and Throughput predicted (new model) 1000x L2 L1 More significant to performance 10x 0.00001x 0.0001x 0.001x 0.01x 0.1x Less significant to performance measured (detailed simulator) 0.1x 1x 10x 100x 1000x 0.001x 0.00001x Figure 6.7: Scatter plot for the predicted and the simulated cache misses increase pair error does not noticeably impact the accuracy of the Markov chain throughput model since the total number of extra L1 misses is quite small compared to the number of L1 misses without cache contention (i.e., the number of L1 misses increases only by 1.6%). As mentioned in Section 3.3, the probability of a miss being introduced into a circular sequence is a nonlinear function of the distance r. Hence the calculation of the probability of a miss is best computed by classifying circular sequences into groups based upon r. We classify circular sequences cseq(d, n, r) into different groups based upon r and calculate, for each group, the probability that a last access of a circular sequence from the group still hits in the cache when the cache is shared (i.e., probH (d, g) described in Section 3.3), using the average r of all the circular sequences in the group. The accuracy of our cache contention model proposed in Section 3.3 is affected by the number of groups chosen to cluster circular sequences. Figure 6.8(a) shows the arithmetic mean of the absolute error of our cache contention model for all the applications as the number of groups used to classify circular sequences varies from one to twenty. When there is a single group it includes all circular sequences. When there are two 69 average of abs. error 6.2. Modeling Cache Contention and Throughput 35% 30% 25% 20% 15% 10% 5% 0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 number of groups error of equake on Core 5 (a) Arithmetic mean of absolute errors with different number of groups 200% 150% 100% 50% 0% -50% 1 -100% 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 number of groups (b) Errors with different number of groups for equake running on Core 5 Figure 6.8: Impact of the number of groups used to classify circular sequences on the accuracy of our cache contention model groups the first group corresponds to all the circular sequences with 1 < r < 25 , and the second group contains all other circular sequences. When there are n groups (with n > 2), the first group corresponds to all the circular sequences with 1 < r < 25 , the next n − 2 groups correspond to 2i ≤ r < 2i+1 , i ∈ 5, 6, ..., n + 2, and the last group corresponds to all the circular sequences with r ≥ 2n+3 . It is interesting to note from Figure 6.8(a) that the accuracy of the cache contention model does not improve monotonically as the number of groups increases. To better understand the reason behind this seemingly counter-intuitive result, we also show, in Figure 6.8(b), the value of the error 70 6.2. Modeling Cache Contention and Throughput of our cache contention model for equake running on Core 5 (see Table 5.4) as it serves as a good representative for other applications. In Figure 6.8(b) we observe that, when the number of groups used is less than or equal to six, the errors of our cache contention model are significant negative values, indicating that the estimated numbers of extra cache misses due to cache contention are much less than the simulated results. Moreover, errors become significantly positive when the number of groups is between seven and eleven inclusive, meaning that the estimated numbers of extra cache misses are much more than the simulated counterparts. When the number of groups used to categorize circular sequences is less than or equal to six, we notice that the average distance calculated for all the circular sequences belonging to the last group (e.g., the fifth group if only five groups are used) is biased by the circular sequences with relatively small r, resulting in an underestimated probability that the last access of a circular sequence in the last group becomes a miss due to cache contention (i.e., an underestimated 1 − probH (d, g)). On the other hand, when the number of groups is between seven and eleven, the average distance calculated for the last group is biased by the circular sequences with relatively large r, leading to an overestimated 1 − probH (d, g). When the number of groups is greater than or equal to twelve, the error reduces and becomes stable, indicating that the number of groups is sufficiently large to effectively separate circular sequences with significantly different r. In our study, we used twelve groups to separate circular sequences and the range of r associated with each group is described in Section 3.3. Figure 6.9(a) compares the throughput on each core predicted by the different models described in Chapter 4 to the simulated throughput and Figure 6.9(b) shows the error of each model. The first bar (“sumCYCLE”) corresponds to the model described in Section 4.1 that always underestimates the actual throughput (with an arithmetic mean of the absolute error of 58%) since the model assumes that the execution of each thread on a core is completely serialized. The second bar (“sumIPC”) is obtained using the model described in Section 4.2 that always overestimates the actual throughput (with an error of 55% on average). The third bar (“Bernoulli”) represents the Bernoulli model 71 throughput (IPC) 6.2. Modeling Cache Contention and Throughput 1.2 sumCYCLE sumIPC Bernoulli MC simulated 1 0.8 0.6 0.4 0.2 0 C1W1 C2W1 C3W2 C4W2 C5W3 C6W3 C7W4 C8W4 (a) Throughput (IPC) sumCYCLE 60% 101% 101% C1W1 C2W1 sumIPC Bernoulli MC 40% error 20% 0% -20% C3W2 C4W2 C5W3 C6W3 C7W4 C8W4 -40% -60% -66% -66% (b) Error of different models Figure 6.9: Predicted throughput and error from different models compared to the simulated throughput described in Section 4.3 that is more accurate than the previous two models (with an average error of 23%), but its error is still significant for some workloads (e.g., the error is 54% for Core 1 Workload 1) since it does not take into account cache contention. Finally, the fourth bar (“MC”) represents the Markov chain model proposed in Section 4.4 combined with our cache contention model proposed in Section 3.3 that predicts the throughput significantly more accurately with an arithmetic mean of the absolute error over eight cores of 7.9% and a maximum error of 14.3%. 72 5000 4000 3000 2000 Binomial Trace 1000 0 0 20 40 60 80 K memory instructions (a) ammp 100 # cache set accessed # cache set accessed # cache set accessed 6.2. Modeling Cache Contention and Throughput 3000 2000 1000 Binomial Trace 0 0 20 40 60 80 K memory instructions 100 2000 1000 Binomial Trace 0 0 20 40 60 80 100 K memory instructions (b) mgrid (c) swim Figure 6.10: Average distinct number of cache sets accessed during a given number of consecutive memory instructions for a 12-way, 64B line, 4096-set LRU L2 cache, from both assuming binomial distribution of cache blocks among sets (Binomial) and analyzing program traces (Trace) 6.2.2 Modeling S(x) and b(i, x) Using Binomial Distributions Prior work [2, 64] proposes analytical models for cache contention of processes due to context switching in a time-sharing system. These models assume that all cache sets are equally likely to be accessed by a process and that the number of cache blocks accessed in a set follows a binomial distribution. Appendix A illustrates that applying these earlier models in the limiting case of fine grained switching cycle-by-cycle does not work (no cache contention is predicted). Our analytical cache contention model combines statistics of individual thread behavior to predict the performance when threads run together. It uses the functions S(x), b(i, x) which earlier we assume are profiled directly. In this section, we show how to use the binomial distribution assumption used in earlier analytical cache contention models to model both S(x) and b(i, x), and we quantify the impact this approach has on the accuracy of the model. The advantage of the approach in this section versus directly measuring S(x) and b(i, x) is that it enables us to profile an application only once even if we vary the number of sets or associativity of the cache. Recall that b(i, x) denotes the probability that the number of unique cache blocks being accessed in a cache set accessed is i during x consecutive memory references, given that the cache set is 73 6.2. Modeling Cache Contention and Throughput 0.8 0.6 0.4 0.2 1 Binomial Trace 0.8 0.6 0.4 0.2 0 2 3 4 5 6 7 8 9 10 11 12 # distinct accesses in a set (a) b(i, x = 1000) Binomial Trace 0.8 0.6 0.4 0.2 0 1 probability 1 Binomial Trace probability probability 1 0 1 2 3 4 5 6 7 8 9 10 11 12 # distinct accesses in a set (b) b(i, x = 10000) 1 2 3 4 5 6 7 8 9 10 11 12 # distinct accesses in a set (c) b(i, x = 100000) Figure 6.11: Probability distribution of the number of distinct blocks being accessed in a cache set for mcf on a 12-way, 64B line, 3 MB cache, from both assuming binomial distribution of cache blocks among sets (Binomial) and analyzing program traces (Trace). Horizontal axis is value of i, and x is the measurement interval counted in consecutive memory instructions accessed. To analytically model b(i, x) of a thread, we need to know the number of unique cache blocks requested by the thread during the x memory references (i.e., the thread’s cache footprint as called in [64]). This information can be collected by analyzing the thread’s program trace and we denote it by U(x) hereafter. After obtaining U(x), we can model b(i, x) based upon the assumption of binomial distribution of cache blocks among cache sets. If we use a random variable Z to represent the number of unique cache blocks accessed in a cache set, then the probability that Z is equal to any integer between 0 and U(x) can be modeled as follows: µ ¶ U(x) i P[Z = i] = p (1 − p)U(x)−i i for i = 0, 1, 2, ..., U(x) where p = 1/N and N is the total number of cache sets. The intuition behind the above formula is that the assignment of cache blocks among cache sets is approximated as a binomial process with U(x) trials, each with the probability of success p (i.e., the probability of choosing one set of interest 74 6.2. Modeling Cache Contention and Throughput out of N cache sets). Then b(i, x) can be modeled as below: ¡U(x)¢ i U(x)−i 1≤i<A i p (1 − p) b(i, x) = P∞ P∞ ¡U(x)¢ k U(x)−k i=A k=A P[Z = k] = k=A k p (1 − p) P[Z = i] = where A is the cache associativity. Note that b(A, x) is the accumulated probability over P[Z = i] for i ≥ A since it represents the probability that the number of distinct cache blocks being accessed in a cache set is greater or equal to A as described in Section 3.2. Also recall that S(x) represents the number of distinct cache sets accessed by a thread during x consecutive memory references. In our cache contention model, after S(x) is collected, the probability that a cache set is not accessed during the x consecutive memory references is modeled as 1 − S(x)/N, as described in Section 3.3. By assuming the binomial distribution of cache blocks among cache sets, this probability is essentially modeled by P[Z = 0]. Therefore, after obtaining b(i, x), S(x) can be modeled as N(1 − (1 − N1 )U(x) ) by solving the following equation: 1− S(x) 1 = P[Z = 0] = (1 − )U(x) N N Figure 6.10(a), 6.10(b), and 6.10(c) compares the S(x) modeled by assuming binomial distribution of cache blocks among cache sets (labeled “Binomial”) to the one collected from analyzing program traces (labeled “Trace”) for ammp, mgrid, and bzip2, respectively. We notice that the difference between Binomial and Trace is very small for some applications (e.g., mgrid as shown in Figure 6.10(b)). However, compared to Trace, Binomial underestimates the number of distinct cache sets accessed during a certain number of consecutive memory instructions for some applications (e.g., ammp as shown in Figure 6.10(a)) and it overestimates for some other applications (e.g., bzip2 as shown in Figure 6.10(c)). Figure 6.11(a), 6.11(b), and 6.11(c) compares the b(i, x) modeled by assuming binomial distri75 6.2. Modeling Cache Contention and Throughput normalized error 1.1 Binomial Trace 1 0.9 0.8 0.7 L1 L2 Figure 6.12: Average absolute errors of modeled extra L1 and L2 cache misses (versus detailed simulation) by modeling S(x) and b(i, x) from binomial distribution (Binomial) and from analyzing program traces (Trace). Result normalized to Binomial bution of cache blocks among cache sets (labeled “Binomial”) to the one modeled from analyzing program traces (labeled “Trace”) for mcf during 1000, 10000, and 100000 consecutive memory references, respectively. We observe from these figures that the difference between Binomial and Trace is not significant. Figure 6.12 shows the average absolute errors of predicted number of extra L1 and L2 cache misses due to cache sharing by both modeling S(x) and b(i, x) assuming binomial distributions of cache blocks among cache sets as described above (labeled “Binomial”) and quantifying S(x) and b(i, x) from analyzing program traces (labeled “Trace”). The bars in Figure 6.12 are normalized to the results from Binomial. Due to the fact that, as shown in Figure 6.10 and 6.11, S(x) and b(i, x) modeled from Binomial tracks their counterpart from Trace well, the error from Binomial is close to the error from Trace. However, since S(x) and b(i, x) obtained from analyzing program traces more directly and accurately model the program characteristics, the average error from Trace is reduced by 16.4% and 3.1% for L1 data cache and L2 cache, respectively, compared to the error from Binomial. Note that Binomial actually does not eliminate the overhead of collecting and 76 6.2. Modeling Cache Contention and Throughput analyzing program traces since it still requires the number of unique cache blocks accessed in a certain number of memory references (i.e., U(x)), which can only be obtained by analyzing program traces. Moreover, although prior work [2, 64] assumes binomial distribution of cache blocks among cache sets, this work mainly focuses on modeling cache contention due to context switching in a time-sharing system. To model the cache contention due to much more fine-grained overlapping of threads such as in fine-grained multithreaded architectures (i.e., threads are interleaving access by access), circular sequences profiling and the procedures proposed in Section 3.3 are indispensable. 6.2.3 Modeling b(i, x) Using An Inductive Probability Function The reason the probabilistic cache contention model proposed by Chandra et al. [9] is not scalable to a large number of threads is that it does not take into the account of the fact that the cache footprint of a thread during a small interval of time is likely to be limited to a small number of cache sets (i.e., the model does not include a parameter similar to S(x) proposed in Section 3.2). However, the inductive probability (Prob) model in [9] models information similar to b(i, x) in our cache contention model. In this section, we show the impact of modeling b(i, x) using the inductive probability function in [9] rather than directly extracting it from program traces on the accuracy of our cache contention model. The inductive probability function in [9] estimates the probability that given n cache accesses in a set from a thread, there are d distinct blocks among these accesses. To be consistent with [9], we denote the inductive probability function as P(seq(d, n)) for the following discussion. To model b(i, x) of a thread using P(seq(d, n)), we need to know the average number of cache accesses in a set, given x consecutive memory references. Since S(x) represents the number of cache sets accessed by the thread during a total of x accesses, we can approximate the average number of accesses per 77 6.2. Modeling Cache Contention and Throughput 1 0.6 0.4 0.2 1 Prob Trace 0.8 probability Prob Trace probability probability 1 0.8 0.6 0.4 0.2 0 2 3 4 5 6 7 8 9 10 11 12 0.4 0 1 # distinct accesses in a set 0.6 0.2 0 1 Prob Trace 0.8 2 3 4 5 6 7 8 9 10 11 12 1 # distinct accesses in a set (a) b(i, x = 1000) (b) b(i, x = 10000) 2 3 4 5 6 7 8 9 10 11 12 # distinct accesses in a set (c) b(i, x = 100000) Figure 6.13: Probability distribution of the number of distinct blocks being accessed in a cache set for mcf on a 12-way, 64B line, 3 MB cache, from both an inductive function (Prob) and analyzing program traces (Trace). Horizontal axis is value of i, and x is the measurement interval counted in consecutive memory instructions set as x/S(x). Then b(i, x) can be modeled as below: b(i, x) = P(seq(i, x/S(x))) P∞ k=A P(seq(k, x/S(x))) 1≤i<A i=A where A is the cache associativity. Also note that b(A, x) is the accumulated probability over P(seq(i, x/S(x))) for i ≥ A since it is defined in Section 3.2 as the probability that the number of distinct cache blocks being accessed in a cache set is greater or equal to A. Figure 6.13(a), 6.13(b), and 6.13(c) compares the b(i, x) modeled by using the inductive function in [9] as described above (labeled “Prob”) to the b(i, x) collected from analyzing program traces (labeled “Trace”) for mcf during 1000, 10000, and 100000 consecutive memory references, respectively. We observe from Figure 6.13 that the average distinct accesses in a set modeled from Prob is always larger than Trace (e.g., the average distinct accesses in a set modeled by Prob is 2.0 versus 1.4 by Trace for Figure 6.13(b)), resulting in an overestimation of the number of extra misses due to cache sharing and, consequently, larger error for Prob. Figure 6.14 shows the average absolute error of the predicted number of extra L1 and L2 cache misses due to cache sharing versus simu78 6.2. Modeling Cache Contention and Throughput normalized error 1.2 1 Prob Trace 0.8 0.6 0.4 0.2 0 L1 L2 Figure 6.14: Average absolute errors of modeled extra L1 and L2 cache misses (versus detailed simulation) by modeling b(i, x) from an inductive function (Prob) and from analyzing program traces (Trace). Result normalized to Prob lation by both modeling b(i, x) based upon the inductive probability function as described above (labeled “Prob”) and modeling b(i, x) from analyzing program traces (labeled “Trace”). Moreover, the bars in Figure 6.14 are normalized to the results from Prob. From this figure we observe that, by quantifying b(i, x) from analyzing program traces to more directly and accurately account for the program characteristics, the average error reduces by 83.1% and 66.3% for L1 data cache and L2 cache, respectively. 6.2.4 Sensitivity Study To test the accuracy of our cache contention model described in Section 3.3 and Markov chain throughput model proposed in Section 4.4 for different cache configurations, we apply our models for different combinations of the number of L2 cache sets (1024, 2048, or 4096) and the L2 cache associativity (8, 12, or 16). Figure 6.15 compares the predicted average throughput over all cores to the measured throughput from detailed simulation for different L2 cache configurations, while keeping other parameters unchanged. For each cache configuration, the notation AK-B means that the number of L2 cache 79 6.2. Modeling Cache Contention and Throughput 0.645 4K-16 0.64 4K-12 0.635 2K-16 actual 0.63 2K-12 0.625 4K-8 1K-16 0.62 1K-12 2K-8 0.615 0.61 0.605 0.66 1K-8 0.665 0.67 0.675 0.68 0.685 0.69 0.695 predicted Figure 6.15: Comparison of predicted throughput to the throughput measured with detailed simulation for various cache configurations (AK-B means that the number of L2 sets is AK and the L2 associativity is B) sets is AK (i.e., A × 1024) and the L2 cache associativity is B. We observe from Figure 6.15 our models accurately track the impact of varying cache configurations on overall performance obtained from the cycle-accurate simulator described in Section 5.2, with an average absolute error of 8.3% and correlation coefficient of 0.9906. The constant offset is due to a mismatch in the number of instructions executed per thread predicted by our model and the actual number executed per thread in detailed simulation. Table 6.1 illustrates the average modeling errors of our models for various L2 cache configurations, compared to detailed simulation. For each L2 configuration, the average absolute errors for predicted extra L1 data cache misses, L2 cache misses, and throughput are shown. On average, our models predict extra L1 data cache misses, extra L2 cache misses, and throughput obtained from detailed simulation with an error of 45.15%, 7.39%, and 8.26%, respectively. 80 6.2. Modeling Cache Contention and Throughput Table 6.1: Modeling errors for various L2 cache configurations versus detailed simulation (AK-B means that the number of L2 sets is AK and the L2 associativity is B) L2 conf. 1K-8 1K-12 1K-16 2K-8 2K-12 2K-16 4K-8 4K-12 4K-16 Average 6.2.5 L1 44.61% 45.86% 50.01% 48.01% 44.94% 43.27% 45.80% 41.12% 42.71% 45.15% L2 34.44% 5.12% 3.78% 4.67% 3.31% 2.84% 2.92% 4.46% 4.95% 7.39% throughput 8.67% 8.55% 8.47% 8.70% 8.21% 8.21% 8.22% 7.88% 7.47% 8.26% Case Study: Optimizing Threads Per Core for Application-Specific Workloads In this section, we present a case study of using our analytical models to predict the overall throughput of systems with different number of cores and threads with the aim of finding the optimal configuration for some application-specific workloads. We choose mcf and bzip2 for the study. We vary the number of cores of the system from 1 to 8 and the number of threads per core from 1 to 4 (i.e., 32 configurations in total) and in each configuration we predict the overall throughput when running the same application for each thread. Figure 6.16(a) and 6.16(b) show the modeled throughput in all configurations for mcf and bzip2, respectively. It is interesting to note that our analytical model indicates that when the number of cores is fixed, the throughput scales up as the number of threads per core increases for mcf but down for bzip2. The reason for this is the memory-bound nature of mcf, which causes the throughput to be relatively low when mcf runs by itself on a core. Thus, when running and sharing resources with other threads, the overall throughput improves. On the other hand, bzip2 is compute-bound and, 81 2 throughput (IPC) throughput (IPC) 6.2. Modeling Cache Contention and Throughput 1 0 8 6 4 # cores 4 2 0 0 (a) mcf 2 # threads/core 10 5 0 8 6 4 # cores 2 0 0 4 2 # threads/core (b) bzip2 Figure 6.16: Optimization Case Study: Modeled throughput under different configurations when the number of threads sharing a core is over two, extra cache misses due to cache contention actually outweigh the advantage of fine-grained multithreading, degrading the overall throughput. Therefore, the optimal configuration for mcf is a system with eight cores with four threads per core, while the optimal configuration for bzip2 is eight cores with two threads per core. Our analytical model predicts the same optimal configuration for each workload as the one obtained from detailed simulations. For mcf, the arithmetic mean of the absolute error, the geometric mean, and the harmonic mean are 2.2%, 2.1%, and 1.8%, respectively. For bzip2, they are 14.3%, 4.0%, and 1.0%, respectively. Moreover, our model is much faster than running detailed simulations. The time taken by our analytical model to find the throughput in all 32 configurations for each benchmark was as follows: We required 20 minutes to collect traces and statistics required for the cache contention model and roughly 10 seconds to evaluate the cache contention and Markov chain model, while the detailed simulator takes more than 22 hours (i.e., overall, 65 times faster.8 ) 8 In some sense we are conservative in stating the speedup of our analytical models over detailed simulations as traces only need to be collected and analyzed once for each application and the overhead can be amortized by reusing them. 82 6.2. Modeling Cache Contention and Throughput Hardware Validation extra L2 misses per thread 6.2.6 actual (Sun Fire T1000) predicted (new model) 10x 8x 6x 4x 2x 0x 12mcf 16mcf 20mcf 24mcf 28mcf 32mcf extra L2 misses per thread (a) homogeneous workloads actual (Sun Fire T1000) 1000x gzip mcf predicted (new model) art eqk 100x 10x 1x 16mcf+16gzip 16gzip+16eqk 16mcf+16art 16mcf+16art 16gzip+16eqk (b) heterogeneous workloads Figure 6.17: Predicted ratio of the number of extra L2 misses (per thread) due to cache contention to the number of L2 misses when a thread runs alone compared to an actual hardware (Sun Fire T1000). Part (a) nmcf means n copies of mcf run in parallel. Part (b) nA+mB means n copies of benchmark A and m copies of benchmark B run in parallel. Name above bars in (b) indicates which benchmark’s L2 cache misses are reported. While our detailed simulator captures the effects we set out to model analytically, we wish to find out how closely our models predict real hardware performance. Thus, we compare our models to the actual performance measured on a Sun Fire T1000 server (i.e., a Niagara T1 machine [37]). Figure 6.17(a) compares the predicted against actual increase in L2 load misses per thread, normalized to the number of L2 cache load misses in the absence of contention for a homogeneous workload—a varying number of copies of the memory intensive benchmark mcf. From Figure 6.17(a) we observe 83 6.2. Modeling Cache Contention and Throughput that as the number of threads running in parallel increases, the number of extra L2 load misses per thread also increases due to additional cache contention. Our model accurately predicts this trend. In Figure 6.17(a) the arithmetic mean of the absolute error of our model is 8.7% (improving the accuracy of the prior model [9]—not shown in the figure—by a factor of 48.1×). Similarly, Figure 6.17(b) compares the average number of extra L2 load misses per thread executing a variety of heterogeneous workloads. In Figure 6.17(b), the arithmetic mean of the absolute error of our throughput per core (IPC) model is 22.6% (improving the accuracy of the prior model [9]—not shown—by a factor of 163.4×). actual (Sun Fire T1000) sumCYCLE sumIPC Bernoulli MC 1.2 1 0.8 0.6 0.4 0.2 0 32mcf 16mcf+16gzip 16art+16mcf 16gzip+16eqk Figure 6.18: Predicted average throughput per core versus the throughput reported by performance counters on a Sun Fire T1000. Here nA+mB has the same meaning as in Figure 6.17(b). Figure 6.18 compares the throughput per core predicted by our combined cache contention and Markov throughput model (“MC”), as well as that predicted by the three simpler models described in Sections 4.1 to 4.3 (“sumCYCLE”, “sumIPC”, and “Bernoulli”, respectively) against the Sun Fire T1000 server (“actual (Sun Fire T1000)”) for four different workloads (one homogeneous, three heterogeneous). We observe that MC tends to track the actual hardware better than the simpler models. The standard deviation of the error for MC is 10.7% which is much lower than 26.1%, 113.7%, and 90.8% for sumCYCLE, sumIPC, and Bernoulli, respectively. Moreover, we notice that MC 84 6.2. Modeling Cache Contention and Throughput always overestimates actual throughput (by 30.6% on average—arithmetic mean of the absolute error, 29.4% for geometric mean, or 28.2% for harmonic mean). We believe this bias is due partly to the fact that we used an average L2 cache miss latency measured in the absence of DRAM contention. The detailed performance simulator used for comparison from Section 6.2.1 to 6.2.5 does not model detailed DRAM timing (our detailed simulator also does not model page faults). Analytically modeling the increased L2 cache miss latency due to DRAM contention when multiple threads run together is beyond the scope of this study and is left for our future work. 85 Chapter 7 Related Work This chapter describes some related work. First, we introduce some other alternatives to detailed simulation for exploring design space, including empirical models and statistical simulation. Then, we describe some prior analytical models for superscalar microprocessors. Finally, we summarize prior analytical cache models and throughput models. 7.1 Empirical Models Some empirical models such as artificial neural networks [30] and regression models [39, 40] have been proposed to reduce the time of exploring design space of new microprocessor design. These empirical models have to be trained from performance samples obtained from detailed cycle-accurate simulators, given a set of microarchitectural parameters. Then, these models can interpolate the performance when given new sets of microarchitectural parameters, with much faster speed than the performance simulators. However, these empirical models still require the effort of creating detailed performance simulators and they cannot provide insight for the impact of varying microarchitectural parameters without exhaustively exploring the design space. 86 7.2. Statistical Simulation 7.2 Statistical Simulation Statistical Simulation [17, 18, 24, 53, 55] is another technique proposed to help computer architects to efficiently explore the huge design space at the early phases of microprocessor design. In general, it consists of three steps: statistical profiling, synthetic trace generation and trace-driven simulation. In the first step, a real program trace is analyzed to extract statistical profile of the program, including instruction mix, branch statistics, cache statistics, etc.. Once the statistical profile of the program is collected, a synthetic trace is generated based upon it. The generated synthetic is actually a statistical representative of and is much shorter than the original program. Finally the generated synthetic is fed into a trace-driven simulator modeling the microprocessor being modeled. Since the generated synthetic trace is usually more compact than the original program, the trace-driven simulation takes much less time than to run the simulation for the whole original program. Moreover, the simulated performance by using the synthetic trace is shown to be very close to the result obtained by simulating the whole original program. However, similar to empirical models, statistical simulation still requires the effort of crafting a detailed performance simulator and it cannot provide insights for the impact of changing microarchitectural parameters without exhaustively running simulations. 7.3 Analytical Models for Superscalar Microprocessors There exist many analytical models proposed for superscalar microprocessors [45, 46, 51, 54]. A common limitation of early models is that they assume a perfect data cache. As the gap between memory and microprocessor speed increases, data cache misses must be properly modeled to achieve reasonable accuracy. Jacob et al. [31] propose an analytical memory model that applies a variational calculus approach to determine a memory hierarchy optimized for the average access time given a fixed hardware 87 7.4. Cache Contention Models budget with an (implicit) assumption that memory level parallelism does not occur. Noonburg and Shen present a superscalar microprocessor Markov chain model [52] that does not model long memory latency. The first-order model proposed by Karkhanis and Smith [34] is the first to attempt to account for long latency data cache misses. Our analytical model improves the accuracy of our re-implementation of their technique described in Section 2.1 (based upon details available in the literature) by modeling pending data cache hits, and extends it to estimate the performance impact of hardware data prefetching techniques and a limited number of outstanding cache misses. Some earlier work has investigated the performance impact increasing the number of MSHRs using detailed simulations [4, 22]. The hybrid analytical model that we propose can be used to estimate the performance impact of a limited number of MSHRs without requiring a detailed simulator. Concurrent with our work, Eyerman proposed an approach similar to SWAM described in Section 2.2.5, except that the profile window slides to begin with each successive long latency miss [19]. Reportedly, a pending hit compensation mechanism was used [41]. 7.4 Cache Contention Models Thiebaut and Stone [64] propose an analytical model for cache-reload transients (i.e., cache contention due to context switching) and Agarwal et al. [2] propose an analytical cache model to predict a process’ cache miss rate. Both do so on a time-sharing system running multiprogrammed workloads. Falsafi and Wood [21] extend Thiebaut and Stone’s model to account for the interaction of more than two threads and apply analytical modeling to predict cost/performance of the Wisconsin Wind Tunnel simulator. They also consider the case where parallel threads share some memory regions (leading to prefetching effects), which we did not consider in our study. While all of the above work assumes a binomial distribution of cache blocks among cache sets, our model 88 7.5. Throughput Model uses S(x) and b(i, x) to more directly quantify the probability of a co-scheduled thread inserting a certain number of accesses into a cache set to better model the very fine-grained interleaving of threads (we assume that threads are interleaving access by access). As described in Section 3.1 the Prob model of Chandra et al. [9] is accurate if the number of threads is significantly below the associativity of the cache being shared, but does not model contention well as the number of threads is increased. While Chandra et al. mention their model could be applied to predict overall performance, they do not describe such a technique in detail and do not provide quantitative evaluation of throughput modeling as we have done in Chapter 4 and Chapter 6, respectively. 7.5 Throughput Model Yamamoto et al. [69] describe an analytical model for a multistreamed superscalar processor. The processor they model is essentially an SMT processor (capable of issuing instructions from multiple threads to the superscalar core on a single cycle) with the assumption of perfect branch prediction and caches that always hit. They present a probabilistic model to account for structural hazards by considering the instruction mix versus number and type of function units and use a correction factor computed by performing detailed simulation for a single thread in isolation to account for the effects of control and data hazards. In contrast, our model, while focused on single-issue fine-grained multithreaded architectures, requires no detailed simulation infrastructure to obtain accurate throughput predictions. Moreover, we propose a cache contention model to predict the extra number of cache misses rather than assuming perfect caches. 89 Chapter 8 Conclusions and Future Work This chapter concludes the thesis and discusses some future work. 8.1 Conclusions In this thesis, we propose and evaluate several improvements to an existing analytical performance model for superscalar processors. We show the importance of properly modeling pending data cache hits and propose a technique to accurately model their performance impact, while not requiring a performance simulator. We then extend this model to estimate the performance of a microprocessor when an arbitrary data prefetching method is applied. Moreover, we propose a technique to quantify the impacts of a limited number of MSHRs. Overall, the techniques we introduce in this thesis can reduce the modeling error of our baseline from 39.7% to 10.3% for a set of memory intensive benchmarks. The average error of our model is 13.8% when modeling several prefetching strategies and 9.5% when modeling a limited number of supported outstanding cache misses. We also propose novel techniques for analytically modeling the throughput of fine-grained multithreaded chip multiprocessors such as Sun Microsystems’ Niagara. Throughput is estimated by applying a Markov chain model where states represent numbers of stalled threads, along with transition statistics obtained using a novel cache contention model that takes account of temporal locality within a thread and accurately predicts cache contention among a large number of threads. Overall, the average error of performance predicted by our models is 8.3% versus a cycle-accurate 90 8.2. Future Work simulator that captures the major effects we attempt to model analytically for several cache configurations. Our models also find the same optimum configuration for some application-specific workloads as the detailed simulator. Furthermore, our models achieve these results 65 times faster than detailed performance simulations. We also validate our models against a Sun Fire T1000 and show that our models accurately predict the performance of real hardware. 8.2 Future Work The work in this thesis leads to some new areas of research and this section briefly discusses some of them. 8.2.1 Improving the Applicability of Our Hybrid Analytical Models The hybrid analytical model described in Chapter 2 extends the first-order model [34] by modeling data prefetching and a limited number of MSHRs, increasing the applicability of the prior first-order model. We believe that the applicability of our model can be further improved by adding more detail to the existing model to predict the performance impact of some prior proposed microarchitectural innovations such as speculative precomputation [16], value prediction [42], run-ahead execution [50], etc., without requiring cycle-accurate performance simulators. When the improved model is available, it might also help researchers to better understand the bottlenecks of those existing techniques, improve them, and propose new innovations. 8.2.2 Analytical Modeling of Simultaneous Multithreading Processors The Markov chain throughput model proposed in Chapter 4 along with the cache contention model described in Chapter 3 can accurately predict the performance of fine-grained multithreaded architectures with scalar pipelines. However, analytical modeling of modern simultaneous multithreading 91 8.2. Future Work (SMT) processors with superscalar pipelines is still an open question. We believe that, by considering state transitions to occur only for long latency L2 cache misses (rather than L1 misses) and scaling the estimated throughput by a factor taking account of the peak issue width of an SMT core (assuming a well balanced design [35]) and the memory level parallelism within a thread, our Markov chain model can potentially provide approximate performance predictions for more complex processor cores. 8.2.3 Analytical Modeling of Chip Multiprocessors The hybrid analytical model described in Chapter 2 accurately predicts the performance of singlecore out-of-order execution superscalar processors and provides a building block for developing analytical models for chip multiprocessors (CMP) with out-of-order execution superscalar cores. While, in Chapter 4, we have proposed several analytical models for chip multiprocessors consisting of fine-grained multithreaded cores with scalar pipelines, analytical modeling the performance for chip multiprocessors with out-of-order execution superscalar cores is challenging due to cache sharing among cores. Although the cache contention model proposed in Chapter 3 can be applied to predict the extra number of cache misses due to cache sharing, the performance impact of cache sharing for CMP with out-of-order execution superscalar cores can not be simply predicted by the extra number of cache misses due to the memory level parallelism extracted by out-of-order execution. To solve the problem, an analytical model that accurately quantifies the performance impact of extra number of cache misses, given data dependency chain information of the application being executed, is required. 8.2.4 Analytical Modeling of DRAM Scheduling Policies As the speed gap between microprocessors and memory systems keeps increasing, memory access latency has become one of the most critical factors determining the performance of modern micro92 8.2. Future Work processors. To achieve higher performance of memory systems, several DRAM scheduling policies have been proposed [48, 49, 58]. In this thesis we assumed a uniform fixed latency by abstracting the impact of DRAM timing and scheduling policy on memory access latency. An analytical model that is able predict the distribution of memory access latency, given a set of parameters of DRAM timing and scheduling policy would help make prior and our analytical models more applicable for chip designers to use. Moreover, such an analytical model might provide insight helping designers come up with better DRAM scheduling policies. 93 Bibliography [1] Vikram S. Adve and Mary K. Vernon. Parallel Program Performance Prediction using Deterministic Task Graph Analysis. ACM Transactions on Computer Systems, 22(1):94–136, 2004. [2] Anant Agarwal, Mark Horowitz, and John Hennessy. An Analytical Cache Model. ACM Transactions on Computer Systems, 7(2):184–215, 1989. [3] Jean-Loup Baer and Tien-Fu Chen. An Effective On-chip Preloading Scheme to Reduce Data Access Penalty. In Supercomputing ’91: Proceedings of the 1991 ACM/IEEE Conference on Supercomputing, pages 176–186, 1991. [4] Samson Belayneh and David R. Kaeli. A Discussion on Non-blocking/Lockup-free Caches. SIGARCH Computer Architecture News, 24(3):18–25, 1996. [5] Nathan L. Binkert, Ronald G. Dreslinski, Lisa R. Hsu, Kevin T. Lim, Ali G. Saidi, and Steven K. Reinhardt. The M5 Simulator: Modeling Networked Systems. IEEE Micro, 26(4):52–60, 2006. [6] Darrell Boggs, Aravindh Baktha, Jason Hawkins, Deborah T. Marr, J. Alan Miller, Patrice R Roussel, Ronak Singhal, Bret Toll, and K.S. Venkatraman. The Microarchitecture of the Intel° R R Pentium° 4 Processor on 90nm Technology. Intel° Technology Journal, 8(1), 2004. 94 Chapter 8. Bibliography [7] Doug Burger and Todd M. Austin. The SimpleScalar Tool Set, Version 2.0. SIGARCH Computer Architecture News, 25(3):13–25, 1997. [8] M. C. Carlisle. Olden: Parallelizing Programs with Dynamic Data Structures on DistributedMemory Machines. PhD thesis, Princeton University, June 1996. [9] Dhruba Chandra, Fei Guo, Seongbeom Kim, and Yan Solihin. Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture. In HPCA ’05: Proceedings of the 11th International Symposium on High-Performance Computer Architecture, pages 340–351, 2005. [10] Xi E. Chen and Tor M. Aamodt. Hybrid Analytical Modeling of Pending Cache Hits, Data Prefetching, and MSHRs. In MICRO ’08: Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture, pages 59–70, 2008. [11] Xi E. Chen and Tor M. Aamodt. An improved analytical superscalar microprocessor memory model. In 4th Workshop on Modeling, Benchmarking and Simulation (MoBS 2008), pages 7–16, June 2008. [12] Xi E. Chen and Tor M. Aamodt. A First-order Fine-grained Multithreaded Throughput Model. In HPCA ’09: Proceedings of the 9th International Symposium on High-Performance Computer Architecture, pages 329–340, 2009. [13] Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil A. Patil, William H. Reinhart, Darrel Eric Johnson, Jebediah Keefe, and Hari Angepat. FPGA-Accelerated Simulation Technologies (FAST): Fast, Full-System, Cycle-Accurate Simulators. In MICRO ’07: Proceedings of the International Symposium on Microarchitecture, pages 249–261, 2007. [14] C. K. Chow. On Optimization of Memory Hierarchies. IBM Journal of Research and Development, (2):194–203, 1974. 95 Chapter 8. Bibliography [15] Bob Cmelik and David Keppel. Shade: A Fast Instruction-set Simulator for Execution Profiling. In Proceedings of the Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), pages 128–137, 1994. [16] Jamison D. Collins, Hong Wang, Dean M. Tullsen, Christopher Hughes, Yong-Fong Lee, Dan Lavery, and John P. Shen. Speculative Precomputation: Long-Range Prefetching of Delinquent Loads. In ISCA ’01: Proceedings of the 28th Annual International Symposium on Computer Architecture, pages 14–25, 2001. [17] Lieven Eeckhout, Robert H. Bell Jr., Bastiaan Stougie, Koen De Bosschere, and Lizy K. John. Control Flow Modeling in Statistical Simulation for Accurate and Efficient Processor Design Studies. In ISCA ’04: Proceedings of the 31st Annual International Symposium on Computer Architecture, page 350, 2004. [18] Lieven Eeckhout, Sebastien Nussbaum, James E. Smith, and Koen De Bosschere. Statistical Simulation: Adding Efficiency to the Computer Designer’s Toolbox. IEEE Micro, 23(5):26–38, 2003. [19] Stijn Eyerman. Analytical Performance Analysis and Modeling of Superscalar and Multithreaded Processors. PhD thesis, Ghent, 2008. [20] Stijn Eyerman, Lieven Eeckhout, Tejas S. Karkhanis, and James E. Smith. A Performance Counter Architecture for Computing Accurate CPI Components. In ASPLOS-XII: Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 175–184, 2006. [21] Babak Falsafi and David A. Wood. Modeling Cost/Performance of A Parallel Computer Simulator. ACM Transactions on Modeling and Computer Simulation, 7(1):104–130, 1997. 96 Chapter 8. Bibliography [22] K. I. Farkas, N. P. Jouppi, and P. Chow. How Useful Are Non-Blocking Loads, Stream Buffers and Speculative Execution in Multiple Issue Processors? In HPCA ’95: Proceedings of the 1st International Symposium on High-Performance Computer Architecture, page 78, 1995. [23] Matthew K. Farrens and Andrew R. Pleszkun. Strategies for Achieving Improved Processor Throughput. In ISCA ’91: Proceedings of the 18th annual international symposium on Computer architecture, pages 362–369, 1991. [24] Davy Genbrugge, Lieven Eeckhout, and Koen De Bosschere. Accurate Memory Data Flow Modeling in Statistical Simulation. In ICS ’06: Proceedings of the 20th Annual International Conference on Supercomputing, pages 87–96, 2006. [25] J. D. Gindele. Buffer Block Prefetching Method. IBM Technical Disclosure Bulletin, 20(2):696– 697, 1977. [26] Venkatraman Govindaraju, Peter Djeu, Karthikeyan Sankaralingam, Mary Vernon, and William R. Mark. Toward A Multicore Architecture for Real-time Ray-tracing. In MICRO ’08: Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture, pages 176–187, 2008. [27] John L. Hennessy and David A. Patterson. Computer Architecture: A Quantitative Approach, 4th edition. Morgan Kaufmann, 2007. [28] Glenn Hinton, Dave Sager, Mike Upton, Darrell Boggs, Doug Carmean, Alan Kyker, and R R Patrice Roussel. The Microarchitecture of the Pentium° 4 Processor. Intel° Technology Journal, 5(1), 2001. [29] Paul G. Hoel, Charles J. Stone, and Sidney C. Port. Introduction to Stochastic Processes. Waveland Press, 1987. 97 Chapter 8. Bibliography [30] Engin Ïpek, Sally A. McKee, Rich Caruana, Bronis R. de Supinski, and Martin Schulz. Efficiently Exploring Architectural Design Spaces via Predictive Modeling. In ASPLOS-XII: Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 195–206, 2006. [31] Bruce L. Jacob, Peter M. Chen, Seth R. Silverman, and Trevor N. Mudge. An Analytical Model for Designing Memory Hierarchies. IEEE Transactions on Computers, 45(10):1180–1194, 1996. [32] Norman P. Jouppi. Improving Direct-mapped Cache Performance by the Addition of a Small Fully-associative Cache and Prefetch Buffers. In ISCA ’90: Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 364–373, 1990. [33] Tejas S. Karkhanis. Automated Design of Application-specific Superscalar Processors. PhD thesis, University of Wisconsin, 2006. [34] Tejas S. Karkhanis and James E. Smith. A First-order Superscalar Processor Model. In ISCA ’04: Proceedings of the 31st Annual International Symposium on Computer Architecture, pages 338–349, 2004. [35] Tejas S. Karkhanis and James E. Smith. Automated Design of Application Specific Superscalar Processors: An Analytical Approach. In ISCA ’07: Proceedings of the 34th Annual International Symposium on Computer Architecture, pages 402–411, 2007. [36] Peter M. Kogge. The Architecture of Pipelined Computers. McGraw-Hill, 1981. [37] Poonacha Kongetira, Kathirgamar Aingaran, and Kunle Olukotun. Niagara: A 32-Way Multithreaded Sparc Processor. IEEE Micro, 25(2):21–29, 2005. [38] David Kroft. Lockup-free Instruction Fetch/Prefetch Cache Organization. In ISCA ’81: Pro- 98 Chapter 8. Bibliography ceedings of the 8th Annual International Symposium on Computer Architecture, pages 81–87, 1981. [39] Benjamin C. Lee and David M. Brooks. Accurate and Efficient Regression Modeling for Microarchitectural Performance and Power Prediction. In ASPLOS-XII: Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 185–194, 2006. [40] Benjamin C. Lee, Jamison Collins, Hong Wang, and David M. Brooks. CPR: Composable Performance Regression for Scalable Multiprocessor Models. In MICRO ’08: Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture, pages 270–281, 2008. [41] Lieven Eeckhout. Personal Communication. Sept, 2008. [42] Mikko H. Lipasti and John Paul Shen. Exceeding the Dataflow Limit via Value Prediction. In MICRO ’96: Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture, pages 226–237, 1996. [43] J. E. MacDonald and K. L. Sigworth. Storage Hierarchy Optimization Procedure. IBM Journal of Research and Development, 19(2):133–140, 1975. [44] Peter S. Magnusson, Magnus Christensson, Jesper Eskilson, Daniel Forsgren, Gustav Hållberg, Johan Högberg, Fredrik Larsson, Andreas Moestedt, and Bengt Werner. Simics: A Full System Simulation Platform. Computer, 35(2):50–58, 2002. [45] P. Michaud, A. Seznec, and S. Jourdan. Exploring Instruction-Fetch Bandwidth Requirement in Wide-issue Superscalar Processors. In PACT ’99: Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques, pages 2–10, 1999. 99 Chapter 8. Bibliography [46] P. Michaud, A. Seznec, and S. Jourdan. An Exploration of Instruction Fetch Require- ment in Out-of-Order Superscalar Processors. International Journal of Parallel Programming, 29(1):35–38, 2001. [47] Sun Microsystems. OpenSPARCTM T2 Core Microarchitecture Specification. Sun Microsystems, Inc., 2007. [48] Onur Mutlu and Thomas Moscibroda. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. In MICRO ’07: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, pages 146–160, 2007. [49] Onur Mutlu and Thomas Moscibroda. Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. In ISCA ’08: Proceedings of the 35th International Symposium on Computer Architecture, pages 63–74, 2008. [50] Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N. Patt. Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors. In HPCA ’03: Proceedings of the 9th International Symposium on High-Performance Computer Architecture, page 129, 2003. [51] Derek B. Noonburg and John P. Shen. Theoretical Modeling of Superscalar Processor Performance. In MICRO ’94: Proceedings of the 27th Annual ACM/IEEE International Symposium on Microarchitecture, pages 52–62, 1994. [52] Derek B. Noonburg and John P. Shen. A Framework for Statistical Modeling of Superscalar Processor Performance. In HPCA ’97: Proceedings of the 3rd International Symposium on High-Performance Computer Architecture, pages 298–309, 1997. [53] Sébastien Nussbaum and James E. Smith. Modeling Superscalar Processors via Statistical 100 Chapter 8. Bibliography Simulation. In PACT ’01: Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques, pages 15–24, 2001. [54] D. J. Ofelt. Efficient Performance Prediction for Modern Microprocessors. PhD thesis, Stanford University, 1999. [55] Mark Oskin, Frederic T. Chong, and Matthew Farrens. HLS: Combining Statistical and Symbolic Simulation to Guide Microprocessor Designs. SIGARCH Computer Architecture News, 28(2):71–82, 2000. [56] Steven K. Reinhardt, Mark D. Hill, James R. Larus, Alvin R. Lebeck, James C. Lewis, and David A. Wood. The Wisconsin Wind Tunnel: virtual prototyping of parallel computers. In Proceedings of the Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), pages 48–60, 1993. [57] Jose Renau, Basilio Fraguela, James Tuck, Wei Liu, Milos Prvulovic, Luis Ceze, Smruti Sarangi, Paul Sack, Karin Strauss, and Pablo Montesinos. SESC simulator, January 2005. http://sesc.sourceforge.net. [58] Scott Rixner, William J. Dally, Ujval J. Kapasi, Peter Mattson, and John D. Owens. Memory Access Scheduling. In ISCA ’00: Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 128–138, 2000. R [59] Denis Sheahan. Developing and Tuning Applications on UltraSPARC°T1 Chip Multithreading Systems. Sun Microsystems, Inc., 1.2 edition, October 2007. [60] John Paul Shen and Mikko H. Lipasti. Modern Processor Design: Fundamentals of Superscalar Processors. McGraw-Hill, 2005. 101 Chapter 8. Bibliography [61] Timothy Sherwood, Erez Perelman, Greg Hamerly, and Brad Calder. Automatically Characterizing Large Scale Program Behavior. In ASPLOS-X: Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 45–57, 2002. [62] Alan Jay Smith. Cache Memories. ACM Computing Surveys, 14(3):473–530, 1982. [63] Standard Performance Evaluation Corporation. SPEC CPU2000 benchmarks. http://www.spec.org. [64] Dominique Thiebaut and Harold S. Stone. Footprints in the Cache. ACM Transactions on Computer Systems, 5(4):305–329, 1987. [65] M. R. Thistle and B. J. Smith. A Processor Architecture for Horizon. In Supercomputing ’88: Proceedings of the 1988 ACM/IEEE conference on Supercomputing, pages 35–41, 1988. [66] James E. Thornton. Parallel Operation in The Control Data 6600. In AFIPS ’64 (Fall, part II): Proceedings of the October 27-29, 1964, fall joint computer conference, part II: very high speed computer systems, pages 33–40, 1965. [67] D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous Multithreading: Maximizing OnChip Parallelism. In ISCA ’95: Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 392–403, 1995. [68] John Wawrzynek, David Patterson, Mark Oskin, Shih-Lien Lu, Christoforos Kozyrakis, James C. Hoe, Derek Chiou, and Krste Asanovi. RAMP: Research Accelerator for Multiple Processors. IEEE Micro, 27(2):46–57, 2007. [69] W. Yamamoto, M.J. Serrano, A.R. Talcott, R.C. Wood, and M. Nemirosky. Performance Esti- 102 Chapter 8. Bibliography mation of Multistreamed, Superscalar Processors. In 27th Hawaii Int’l Conf. System Sciences, Jan. 1994. [70] Joshua J. Yi, Lieven Eeckhout, David J. Lilja, Brad Calder, Lizy K. John, and James E. Smith. The Future of Simulation: A Field of Dreams. Computer, 39(11):22–29, 2006. [71] Li Zhao, Ravi R. Iyer, Jaideep Moses, Ramesh Illikkal, Srihari Makineni, and Donald Newell. Exploring Large-Scale CMP Architectures Using ManySim. IEEE Micro, 27(4):21–33, 2007. [72] Craig B. Zilles. Benchmark Health Considered Harmful. SIGARCH Computer Architecture News, 29(3):4–5, 2001. 103 Appendix A Limiting Case of Time-Sharing Below we show that extending the earlier approaches from Thiebaut and Stone [64] to model very fine-grained cache contention (we assume that threads are interleaving access by access) results in this earlier model always predicting there is no cache contention. This is a consequence of assumptions about the behavior of a time-sharing system and the associated granularity of interleaving of processes (several thousands of memory references rather than cycle-by-cycle). In particular, in deriving Equation 3.10 on page 313 of their paper, Thiebaut and Stone assume that after every task switch the data accessed by the new thread is the whole working set FA . The following evaluates Thiebaut and Stone’s model assuming two threads interleave accesses “cycle by cycle”. Following Thiebaut and Stone, we let X be a random variable representing the number of blocks belonging to the first thread in a given set and Y be a random variable representing the number of blocks belonging to the second thread in a given set. Similarly, Z is a random variable representing the number of first thread’s blocks evicted by the second thread in a given set, N is the number of sets, and K is the set associativity. from Thiebaut and Stone’s paper [64] gives P[X = 0] = N−1 N and P[X = 1] = 1 N Equation 3.1 (X = i with i > 1 is impossible since we assume interleaving by each access). Equation 3.4 from Thiebaut and Stone gives P[Y = 0] = N−1 N and P[Y = 1] = 1 N (Y = i with i > 1 is impossible since we assume interleaving by each access). Thus Equation 3.6 from Thiebaut and Stone yields P[Z = 0] = 1 P[X = 0]P[0 ≤ Y ≤ K] + P[X = 1]P[0 ≤ Y ≤ K − 1] = ( N−1 N × 1) + ( N × 1) = 1 and subsequently Equa- 104 Appendix A. Limiting Case of Time-Sharing tion 3.7 from Thiebaut and Stone yields P[Z = 1] = P[X = 1]P[Y = K] = 0 (since P[Y = K] = 0 for K > 1, set-associative cache). Then Equation 3.8 from Thiebaut and Stone yields X = 1 N and Equa- tion 3.9 from Thiebaut and Stone yields Z = 0. Consequently, Equation 3.10 from Thiebaut and Stone shows the average reload transient equals FA − (N × X − N × Z) = 1 − (N × 1 N − N × 0) = 0. Thus no cache contention is predicted for the limit of fine-grained interleaving cycle by cycle. 105
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Analytical modeling of modern microprocessor performance
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Analytical modeling of modern microprocessor performance Chen, Xi 2009
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Analytical modeling of modern microprocessor performance |
Creator |
Chen, Xi |
Publisher | University of British Columbia |
Date Issued | 2009 |
Description | As the number of transistors integrated on a chip continues to increase, a growing challenge is accurately modeling performance in the early stages of processor design. Analytical modeling is an alternative to detailed simulation with the potential to shorten the development cycle and provide additional insight. This thesis proposes hybrid analytical models to predict the impact of pending cache hits, hardware prefetching, and realistic miss status holding register (MSHR) resources on superscalar performance. We propose techniques to model the non-negligible influences of pending hits and the fine-grained selection of instruction profile window blocks on the accuracy of hybrid analytical models. We also present techniques to estimate the performance impact of data prefetching by modeling the timeliness of prefetches and to account for a limited number of MSHRs by restricting the size of profile window blocks. As with earlier hybrid analytical models, our approach is roughly two orders of magnitude faster than detailed simulations. Overall, our techniques reduce the error of our baseline from 39.7% to 10.3% when the number of MSHRs is unlimited. When modeling a processor with data prefetching, a limited number of MSHRs, or both, our techniques result in an average error of 13.8%, 9.5% and 17.8%, respectively. Moreover, this thesis proposes analytical models for predicting the cache contention and throughput of heavily fine-grained multithreaded architectures such as Sun Microsystems' Niagara. We first propose a novel probabilistic model using statistics characterizing individual threads run in isolation as inputs to accurately predict the number of extra cache misses due to cache contention among a large number of threads. We then present a Markov chain model for analytically estimating the throughput of multicore, fine-grained multithreaded architectures. Combined, the two models accurately predict system throughput obtained from a detailed simulator with an average error of 8.3% for various cache configurations. We also show that our models can find the same optimized design point of fine-grained multithreaded chip multiprocessors for application-specific workloads 65 times faster than detailed simulations. Furthermore, we show that our models accurately predict cache contention and throughput trends across varying workloads on real hardware, a Sun Fire T1000 server. |
Extent | 1122882 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-06-11 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0065506 |
URI | http://hdl.handle.net/2429/8979 |
Degree |
Master of Applied Science - MASc |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2009-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2009_fall_chen_xi.pdf [ 1.07MB ]
- Metadata
- JSON: 24-1.0065506.json
- JSON-LD: 24-1.0065506-ld.json
- RDF/XML (Pretty): 24-1.0065506-rdf.xml
- RDF/JSON: 24-1.0065506-rdf.json
- Turtle: 24-1.0065506-turtle.txt
- N-Triples: 24-1.0065506-rdf-ntriples.txt
- Original Record: 24-1.0065506-source.json
- Full Text
- 24-1.0065506-fulltext.txt
- Citation
- 24-1.0065506.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0065506/manifest