版权所有:内蒙古大学图书馆 技术提供:维普资讯• 智图
内蒙古自治区呼和浩特市赛罕区大学西街235号 邮编: 010021
作者机构:Univ Coll Dublin Sch Comp Sci Dublin D04 V1W8 4 Ireland
出 版 物:《IEEE ACCESS》 (IEEE Access)
年 卷 期:2018年第6卷
页 面:69075-69106页
核心收录:
基 金:Science Foundation Ireland [14/IA/2474] Science Foundation Ireland (SFI) [14/IA/2474] Funding Source: Science Foundation Ireland (SFI)
主 题:Data parallelism data partitioning energy energy optimization homogeneous multicore CPU clusters load balancing parallel algorithms performance performance optimization
摘 要:Data partitioning algorithms aiming to minimize the execution time and the energy of computations in self-adaptable data-parallel applications on modern extreme-scale multicore platforms must address two critical challenges. First, they must take into account the new complexities inherent in these platforms such as severe resource contention and non-uniform memory access. Second, they must have low practical runtime and memory costs. The sequential data partitioning algorithms addressing the first challenge have a theoretical time complexity of O(m*m*p*p) where m is the number of points in the discrete speed/energy function and p is the number of available processors. They, however, exhibit high practical runtime cost and excessive memory footprint, therefore, rendering them impracticable for employment in self-adaptable applications executing on extreme-scale multicore platforms. We present, in this paper, the parallel data partitioning algorithms that address both the challenges. They take as input the functional models of performance and energy consumption against problem size and output workload distributions, which are globally optimal solutions. They have a low time complexity of O(m*m*p) thereby providing a linear speedup of O(p) and low memory complexity of O(n) where n is the workload size expressed as a multiple of granularity. They employ dynamic programming approach, which also facilitates the easier integration of performance and energy models of communications. We experimentally study the practical cost of application of our algorithms in two data-parallel applications, matrix multiplication and fast Fourier transform, on a cluster in Grid 5000 platform. We demonstrate that their practical runtime and memory costs are low making them ideal for employment in self-adaptable applications. We also show that the parallel algorithms exhibit tremendous speedups over the sequential algorithms. Finally, using theoretical analysis for a forecast exascale platfor