While there has been a growing interest in supporting accelerators, especially GPU accelerators, in large-scale systems, the user typically has to work with low-level GPU programmingmodels such as CUDA along with the...
详细信息
ISBN:
(纸本)9783031232190;9783031232206
While there has been a growing interest in supporting accelerators, especially GPU accelerators, in large-scale systems, the user typically has to work with low-level GPU programmingmodels such as CUDA along with the low-level message passing interface (MPI). We believe higher-level programmingmodels such as Partitioned Global Address Space (PGAS) programmingmodels enable productive parallel programming at both the intra-node and inter-node levels in homogeneous and heterogeneous nodes. However, GPU programming with PGAS languages in practice is still limited since there is still a big performance gap between compiler-generated GPU code and hand-tuned GPU code;hand-optimization of CPU-GPU data transfers is also an important contributor to this performance gap. Thus, it is not rare that the user eventually writes a fully external GPU program that includes the host part -i.e., GPU memory (de)allocation, host-device/device-host data transfer, and the device part - i.e., GPU kernels, and calls it from their primary language, which is not very productive. Our key observation is that the complexity of writing the external GPU program comes not only from writing GPU kernels in the device part, but also from writing the host part. In particular, interfacing objects in the primary language to raw C/C++ pointers is tedious and error-prone, especially because high-level languages usually have a well-defined type system with type inference. In this paper, we introduce the GPUAPI module, which offers multiple abstraction levels of low-level GPU API routines for high-level programmingmodels with a special focus on PGAS languages, which allows the user to choose an appropriate abstraction level depending on their tuning scenarios. The module is also designed to work with multiple standard low-level GPU programmingmodels: CUDA, HIP, DPC++, and SYCL, thereby significantly improving productivity and portability. We use Chapel as the primary example and our preliminary performanc
暂无评论