Modern gpu systems are constantly evolving to meet the needs of computing-intensive applications in scientific and machine learning domains. However, there is typically a gap between the hardware capacity and the achi...
详细信息
With the increased popularity of multi-gpu nodes in modern HPC clusters, it is imperative to develop matching programming paradigms for their efficient utilization. In order to take advantage of the local gpus and the...
详细信息
ISBN:
(纸本)9781450337236
With the increased popularity of multi-gpu nodes in modern HPC clusters, it is imperative to develop matching programming paradigms for their efficient utilization. In order to take advantage of the local gpus and the low-latency high-throughput interconnects that link them, programmers need to meticulously adapt parallel applications with respect to load balancing, boundary conditions and device synchronization. This paper presents MAPS-multi, an automatic multi-gpu partitioning framework that distributes the workload based on the underlying memory access patterns. The framework consists of host-and device-level APIs that allow programs to efficiently run on a variety of gpu and multi-gpu architectures. The framework implements several layers of code optimization, device abstraction, and automatic inference of inter-gpu memory exchanges. The paper demonstrates that the performance of MAPS-multi achieves near-linear scaling on fundamental computational operations, as well as real-world applications in deep learning and multivariate analysis.
In this paper we present AMGE, a programming framework and runtime system that transparently decomposes gpu kernels and executes them on multiple gpus in parallel. AMGE exploits the remote memory access capability in ...
详细信息
ISBN:
(纸本)9781450335591
In this paper we present AMGE, a programming framework and runtime system that transparently decomposes gpu kernels and executes them on multiple gpus in parallel. AMGE exploits the remote memory access capability in modern gpus to ensure that data can be accessed regardless of its physical location, allowing our runtime to safely decompose and distribute arrays across gpu memories. It optionally performs a compiler analysis that detects array access patterns in gpu kernels. Using this information, the runtime can perform more efficient computation and data distribution configurations than previous works. The gpu execution model allows AMGE to hide the cost of remote accesses if they are kept below 5%. We demonstrate that a thread block scheduling policy that distributes remote accesses through the whole kernel execution further reduces their overhead. Results show 1.98x and 3.89x execution speedups for 2 and 4 gpus for a wide range of dense computations compared to the original versions on a single gpu.
We present AMGE, a programming framework and runtime system to decompose data and gpu kernels and execute them on multiple gpus concurrently. AMGE exploits the remote memory access capability of recent gpus to guarant...
详细信息
ISBN:
(纸本)9781450328098
We present AMGE, a programming framework and runtime system to decompose data and gpu kernels and execute them on multiple gpus concurrently. AMGE exploits the remote memory access capability of recent gpus to guarantee data accessibility regardless of its physical location, thus allowing AMGE to safely decompose and distribute arrays across gpu memories. AMGE also includes a compiler analysis to detect array access patterns in gpu kernels. The runtime uses this information to automatically choose the best computation and data distribution configuration. Through effective use of gpu caches, AMGE achieves good scalability in spite of the limited interconnect bandwidth between gpus. Results show 1.95x and 3.73x execution speedups for 2 and 4 gpus for a wide range of dense computations compared to the original versions on a single gpu.
暂无评论