UBC Theses and Dissertations
Analytical modeling of modern microprocessor performance Chen, Xi
As the number of transistors integrated on a chip continues to increase, a growing challenge is accurately modeling performance in the early stages of processor design. Analytical modeling is an alternative to detailed simulation with the potential to shorten the development cycle and provide additional insight. This thesis proposes hybrid analytical models to predict the impact of pending cache hits, hardware prefetching, and realistic miss status holding register (MSHR) resources on superscalar performance. We propose techniques to model the non-negligible influences of pending hits and the fine-grained selection of instruction profile window blocks on the accuracy of hybrid analytical models. We also present techniques to estimate the performance impact of data prefetching by modeling the timeliness of prefetches and to account for a limited number of MSHRs by restricting the size of profile window blocks. As with earlier hybrid analytical models, our approach is roughly two orders of magnitude faster than detailed simulations. Overall, our techniques reduce the error of our baseline from 39.7% to 10.3% when the number of MSHRs is unlimited. When modeling a processor with data prefetching, a limited number of MSHRs, or both, our techniques result in an average error of 13.8%, 9.5% and 17.8%, respectively. Moreover, this thesis proposes analytical models for predicting the cache contention and throughput of heavily fine-grained multithreaded architectures such as Sun Microsystems' Niagara. We first propose a novel probabilistic model using statistics characterizing individual threads run in isolation as inputs to accurately predict the number of extra cache misses due to cache contention among a large number of threads. We then present a Markov chain model for analytically estimating the throughput of multicore, fine-grained multithreaded architectures. Combined, the two models accurately predict system throughput obtained from a detailed simulator with an average error of 8.3% for various cache configurations. We also show that our models can find the same optimized design point of fine-grained multithreaded chip multiprocessors for application-specific workloads 65 times faster than detailed simulations. Furthermore, we show that our models accurately predict cache contention and throughput trends across varying workloads on real hardware, a Sun Fire T1000 server.
Item Citations and Data
Attribution-NonCommercial-NoDerivatives 4.0 International