Data Movement Optimizations for GPU-based Non-uniform Processing-in-memory Systems

Author	: Kishore Punniyamurthy
Publisher	:
Total Pages	: 292
Release	: 2021
ISBN-10	: OCLC:1262092982
ISBN-13	:
Rating	: 4/5 (82 Downloads)

DOWNLOAD EBOOK

Book Synopsis Data Movement Optimizations for GPU-based Non-uniform Processing-in-memory Systems by : Kishore Punniyamurthy

Download or read book Data Movement Optimizations for GPU-based Non-uniform Processing-in-memory Systems written by Kishore Punniyamurthy and published by . This book was released on 2021 with total page 292 pages. Available in PDF, EPUB and Kindle. Book excerpt: Recent technological trends have aided the design and development of large-scale heterogeneous systems in several ways: 1) 3D-stacking has enabled opportunities to place compute units into memory stacks, and 2) advancements in packaging technology now allow integrating high-bandwidth memory in the same package as compute. These trends have opened up a new class of non-uniform processing-in-memory (NUPIM) system architectures. NUPIM systems consist of multiple modules each integrating (2.5D or 3D stacked) memory and compute together in the same package and interconnected via an off-chip network. Such modularity allows system scalability, but also exacerbates the performance and energy penalty of data movement. Inter-module data movement becomes the limiting factor for performance and energy-efficiency scaling. Existing approaches to address data movement either do not account for dynamic, performance-critical application and system interactions, or incur high overhead that does not scale to NUPIM systems. My work focuses addressing both the cause and the effect of data movement in NUPIM systems by collecting and exploiting knowledge about application and system behavior using scalable, low-overhead software and hardware techniques. Specifically, my research addresses data movement by: 1) accelerating critical data to mitigate traffic impact, 2) reducing the number of data bits moved, and 3) eliminating the need to move data in the first place. To mitigate traffic impact, I first propose a low-overhead yet scalable scheme for congestion management in off-chip NUPIM networks. This approach dynamically tracks the congested links and memory divergence using low-overhead techniques, and then accelerates the performance-critical data traffic. The collected information is further used to dynamically manage link widths and save I/O energy. Results show that the proposed scheme achieves on average 16% (and up to 33%) improvement over baseline and 10% (and up to 29%) improvement over other congestion mitigation schemes. To reduce I/O link traffic in NUPIM systems, I further propose cacheline utilization-aware link traffic compression (CUALiT). CUALiT exploits the variation in temporal and spatial utilization of individual cacheline words to achieve higher compression ratios. I utilize a novel mechanism to predict utilization of cachelines across warps at word granularity. The unutilized words are pruned, latency-critical words are traditionally compressed and words with temporal slack are coalesced across cachelines and compressed lazily to achieve higher compression ratios. Results show that CUALiT achieves up to 24% lower system energy and on average 11% (up to 2x) higher performance over traditional compression schemes. Finally, to help eliminate the need to move data, knowledge about application locality is critical in co-locating data and compute. I propose TAFE, a framework for accurate dynamic thread address footprint estimation of GPU applications. TAFE combines minimal static address pattern annotations with dynamic data dependency tracking to compute threadblock-specific address footprints of both data-dependent and -independent access patterns prior to kernel launch. I propose pure software as well as hardware-assisted mechanisms for lightweight dependency tracking with minimal overhead. Furthermore, I develop compiler support for the framework to improve its applicability and reduce programmer overhead. Simulator-based evaluations show that TAFE achieves 91% estimation accuracy across a range of benchmarks. TAFE-assisted page/threadblock mapping improves performance 32%-45% across different configurations. When evaluating TAFE on a real multi-GPU system, results show that TAFE-based data-placement hints reduce application runtime by 10% on average while minimizing programmer effort

Amvik eBook Solutions

Data Movement Optimizations for GPU-based Non-uniform Processing-in-memory Systems

Data Movement Optimizations for GPU-based Non-uniform Processing-in-memory Systems Related Books

Data Movement Optimizations for GPU-based Non-uniform Processing-in-memory Systems

GPU Gems 2

Be(-a)ware of Data Movement

Handbook of Research on the IoT, Cloud Computing, and Wireless Network Optimization

Accelerator Programming Using Directives