UBC Theses and Dissertations
Improving GPU programming models through hardware cache coherence Singh, Inderpreet
Graphics Processing Units (GPUs) have been shown to be effective at achieving large speedups over contemporary chip multiprocessors (CMPs) on massively parallel programs. The lack of well-defined GPU memory models, however, prevents support of high-level languages like C++ and Java, and negatively impacts their programmability. This thesis proposes to improve GPU programmability by adding support for a well-defined memory consistency model through hardware cache coherence. We show that GPU coherence introduces a new set of challenges different from that posed by scalable cache coherence for CMPs. First, introducing conventional directory coherence protocols adds unnecessary coherence traffic overhead to existing GPU applications. Second, the massively multithreaded GPU architecture presents significant storage overheads for buffering thousands of in-flight coherence requests. Third, these protocols increase the verification complexity of the GPU memory system. Recent research, Library Cache Coherence (LCC), explored the use of time-based approaches in CMP coherence protocols. This thesis describes a time-based coherence framework for GPUs, called Temporal Coherence (TC), that exploits globally synchronized counters in single-chip systems to develop a streamlined GPU coherence protocol. Synchronized counters enable all coherence transitions, such as invalidation of cache blocks, to happen synchronously, eliminating all coherence traffic and protocol races. We present two implementations of TC, called TC-Strong and TC-Weak. TC-Strong implements an optimized version of LCC, while TC-Weak uses a novel timestamp based memory fence mechanism to implement Release Consistency on GPUs. TC-Weak eliminates TC-Strong’s trade-off between stalling stores and increasing L1 miss rates to improve performance and reduce interconnect traffic. By providing coherent L1 caches, TC-Weak improves the performance of GPU applications requiring coherence by 85% over disabling the non-coherent L1 caches in the baseline GPU. We also show that write-through protocols outperform a writeback protocol on a GPU as the latter suffers from increased traffic due to unnecessary refills of write-once data.
Item Citations and Data
Attribution 3.0 Unported