Optimal partitioning of a square computational domain over several heterogeneous processors, balancing the load of the processors and minimizing the inter-processor communication cost, is crucial for data parallel den...
详细信息
Optimal partitioning of a square computational domain over several heterogeneous processors, balancing the load of the processors and minimizing the inter-processor communication cost, is crucial for data parallel dense linear algebra and other applications having similar communication pattern on modern hybrid servers. Although a solution has been found for two processors, the cases of three and more processors are still open. The state of-the-art solution for three processors uses an approximation communication cost function which fails to accurately account for the total amount of data moved between processors, leaving thus the question of its global optimality unanswered. In this work, we formulate and solve a mathematical problem of optimal partitioning a real-valued square over three heterogeneous processors using a new cost function, which accurately accounts for the total amount of data communicated between processors. We also develop an original method for accurate experimental evaluation of the communication time of data movement between memories of the compute devices in the hybrid platform during the execution of data parallel applications. We successfully use this method in the experimental validation of our mathematical results. Finally, we propose a communication energy model predicting the dynamic energy consumption of data movement between processors and experimentally validate its accuracy. This model predicts, and the experiments confirm, that the performance-optimal partition is not necessarily energy optimal.
In this paper, we study the problem of partitioning a matrix over a small number of interconnected heterogeneous processors. This problem is crucial for data parallel dense linear algebra and other applications with s...
详细信息
ISBN:
(纸本)9781728189468
In this paper, we study the problem of partitioning a matrix over a small number of interconnected heterogeneous processors. This problem is crucial for data parallel dense linear algebra and other applications with similar communication patterns on modern hybrid servers, integrating several heterogeneous compute devices such as CPUs, GPUs and other accelerators. The objective is to balance the load of the heterogeneous devices while minimising the communication cost. While the problem has been solved for the case of two processors, it is still open for three and more processors. The state-of-the-art solution for the case of three processors uses a communication cost function, which does not accurately account for the total amount of data moved between processors and therefore leaves the question of its global optimality open. In this work, we propose a cost function, which accurately represents the total amount of data moved between processors. Then, we formulate and solve the problem of optimal partitioning of a square computational domain, using this accurate communication cost function. Finally, we propose and implement an original experimental methodology for accurate measurement of the communication time of parallel applications on hybrid heterogeneous servers, integrating multi-core CPUs and various accelerators. We apply this methodology to experimental validation of our mathematical result.
暂无评论