Heterogeneous systems consisting of multiple multi-core CPUs and many-core accelerators have recently come into wide use, and more and more parallel applications are developed in such a heterogeneous system. To fully ...
详细信息
Heterogeneous systems consisting of multiple multi-core CPUs and many-core accelerators have recently come into wide use, and more and more parallel applications are developed in such a heterogeneous system. To fully utilize multiple compute devices to cooperatively and concurrently execute data-parallel kernels on heterogeneous systems, a feedback-based dynamic and elastic task scheduling scheme is proposed, which can provide a better load balance, a greater device utilization, and a lower scheduling overhead by flexibly and dynamically adjusting the workload between devices during execution. The proposed method is more suitable for data-parallel kernels whose computation and data are uniformly distributed, but is less suitable for data-parallel kernels whose computation and data are non-uniformly distributed. Thus, an asynchronous-based dynamic and elastic task scheduling scheme is proposed, which can avoid device underutilization, load imbalance across devices, and frequent kernel launches, inter-device data transfers and inter-device synchronizations by dynamically adjusting the chunk size according to the performance change during runtime. A series of experiments are conducted with 8 representative parallel applications on a hybrid CPU-GPU-MIC system, the results show that the proposed two inter-device task scheduling schemes can achieve the efficient CPU-GPU-MIC co-processing of different parallel applications by effectively partitioning work across devices.
data- parallel applications running on heterogeneous high-performance computing platforms require a nonuniform distribution of the workload between available processes. data partitioning algorithms are formulated as a...
详细信息
data- parallel applications running on heterogeneous high-performance computing platforms require a nonuniform distribution of the workload between available processes. data partitioning algorithms are formulated as an optimization problem. Departing from the computational performance models of the processes, the goal is to find the partition that minimizes the communication cost. Traditionally, communication volume is the metric used to guide the partitioning. This metric, however, is unable to capture the complexity of current heterogeneous systems, which show uneven communication channels and execute applications with different communication patterns. In this paper, we discuss the role of analytical communication performance models as a metric in partitioning algorithms. First, we describe a method to programmatically predict the communication cost of a data-parallel kernel based on the tau -Lop analytical model. We show that this figure better captures the communication features of applications and platforms. We present results showing that this approach builds partitions that equal or improve the performance of dataparallel applications on heterogeneous platforms with respect to previous volume-based strategies.
暂无评论