Improving GPU programming models through hardware cache coherence

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Improving GPU programming models through hardware cache coherence Singh, Inderpreet

Abstract

Graphics Processing Units (GPUs) have been shown to be effective at achieving large speedups over contemporary chip multiprocessors (CMPs) on massively parallel programs. The lack of well-defined GPU memory models, however, prevents support of high-level languages like C++ and Java, and negatively impacts their programmability. This thesis proposes to improve GPU programmability by adding support for a well-defined memory consistency model through hardware cache coherence. We show that GPU coherence introduces a new set of challenges different from that posed by scalable cache coherence for CMPs. First, introducing conventional directory coherence protocols adds unnecessary coherence traffic overhead to existing GPU applications. Second, the massively multithreaded GPU architecture presents significant storage overheads for buffering thousands of in-flight coherence requests. Third, these protocols increase the verification complexity of the GPU memory system. Recent research, Library Cache Coherence (LCC), explored the use of time-based approaches in CMP coherence protocols. This thesis describes a time-based coherence framework for GPUs, called Temporal Coherence (TC), that exploits globally synchronized counters in single-chip systems to develop a streamlined GPU coherence protocol. Synchronized counters enable all coherence transitions, such as invalidation of cache blocks, to happen synchronously, eliminating all coherence traffic and protocol races. We present two implementations of TC, called TC-Strong and TC-Weak. TC-Strong implements an optimized version of LCC, while TC-Weak uses a novel timestamp based memory fence mechanism to implement Release Consistency on GPUs. TC-Weak eliminates TC-Strong’s trade-off between stalling stores and increasing L1 miss rates to improve performance and reduce interconnect traffic. By providing coherent L1 caches, TC-Weak improves the performance of GPU applications requiring coherence by 85% over disabling the non-coherent L1 caches in the baseline GPU. We also show that write-through protocols outperform a writeback protocol on a GPU as the latter suffers from increased traffic due to unnecessary refills of write-once data.

Item Metadata

Title	Improving GPU programming models through hardware cache coherence
Creator	Singh, Inderpreet
Publisher	University of British Columbia
Date Issued	2013
Description	Graphics Processing Units (GPUs) have been shown to be effective at achieving large speedups over contemporary chip multiprocessors (CMPs) on massively parallel programs. The lack of well-defined GPU memory models, however, prevents support of high-level languages like C++ and Java, and negatively impacts their programmability. This thesis proposes to improve GPU programmability by adding support for a well-defined memory consistency model through hardware cache coherence. We show that GPU coherence introduces a new set of challenges different from that posed by scalable cache coherence for CMPs. First, introducing conventional directory coherence protocols adds unnecessary coherence traffic overhead to existing GPU applications. Second, the massively multithreaded GPU architecture presents significant storage overheads for buffering thousands of in-flight coherence requests. Third, these protocols increase the verification complexity of the GPU memory system. Recent research, Library Cache Coherence (LCC), explored the use of time-based approaches in CMP coherence protocols. This thesis describes a time-based coherence framework for GPUs, called Temporal Coherence (TC), that exploits globally synchronized counters in single-chip systems to develop a streamlined GPU coherence protocol. Synchronized counters enable all coherence transitions, such as invalidation of cache blocks, to happen synchronously, eliminating all coherence traffic and protocol races. We present two implementations of TC, called TC-Strong and TC-Weak. TC-Strong implements an optimized version of LCC, while TC-Weak uses a novel timestamp based memory fence mechanism to implement Release Consistency on GPUs. TC-Weak eliminates TC-Strong’s trade-off between stalling stores and increasing L1 miss rates to improve performance and reduce interconnect traffic. By providing coherent L1 caches, TC-Weak improves the performance of GPU applications requiring coherence by 85% over disabling the non-coherent L1 caches in the baseline GPU. We also show that write-through protocols outperform a writeback protocol on a GPU as the latter suffers from increased traffic due to unnecessary refills of write-once data.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2013-06-08
Provider	Vancouver : University of British Columbia Library
Rights	Attribution 3.0 Unported
DOI	10.14288/1.0073882
URI	http://hdl.handle.net/2429/44548
Degree	Master of Applied Science - MASc
Program	Electrical and Computer Engineering
Affiliation	Applied Science, Faculty of; Electrical and Computer Engineering, Department of
Degree Grantor	University of British Columbia
Graduation Date	2013-11
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by/3.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Improving GPU programming models through hardware cache coherence Singh, Inderpreet

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights