UBC Theses and Dissertations
Efficient synchronization mechanisms for scalable GPU architectures Ren, Xiaowei
The Graphics Processing Unit (GPU) has become a mainstream computing platform for a wide range of applications. Unlike latency-critical Central Processing Units (CPUs), throughput-oriented GPUs provide high performance by exploiting massive application parallelism. In parallel programming, synchronization is necessary to exchange information for inter-thread dependency. However, inefficient synchronization support can serialize thread execution and restrict parallelism significantly. Considering parallelism is key to GPU performance, we aim to provide efficient and reliable synchronization support for both single-GPU and multi-GPU systems. To achieve this target, this dissertation explores multiple abstraction layers of computer systems, including programming models, memory consistency models, cache coherence protocols, and application specific knowledges of graphics rendering. First, to reduce programming burden without introducing data-races, we propose Relativistic Cache Coherence (RCC) to enforce Sequential Consistency (SC). By avoiding stalls of write permission acquisition with logical timestamps, RCC is 30% faster than the best prior SC proposal, and only 7% slower than the best non-SC design. Second, we introduce GETM, the first GPU Hardware Transactional Memory (HTM) with eager conflict detection, to help programmers implement deadlock-free, yet aggressively parallel code. Compared to the best prior GPU HTM, GETM is up to 2.1× (1.2× gmean) faster, area overheads are 3.6× lower, and power overheads are 2.2× lower. Third, we design HMG, a hierarchical cache coherence protocol for multi-GPU systems. By leveraging the latest scoped memory model, HMG not only can avoid full cache invalidation of software coherence protocol, but also filters out write invalidation acknowledgments and transient coherence states. Despite minimal hardware overhead, HMG can achieve 97% of the performance of an idealized caching system. Finally, we propose CHOPIN, a novel Split Frame Rendering (SFR) scheme by taking advantage of the parallelism of image composition. CHOPIN can eliminate the performance overheads of primitive duplication and sequential primitive distribution that exist in previous work. CHOPIN outperforms the best prior SFR implementation by up to 56% (25% gmean) in an 8-GPU system.
Item Citations and Data
Attribution 4.0 International