Many parallel algorithms have recently been developed to accelerate solving the subset-sum problem on a heterogeneous CPU-GPU system. However, within each compute node, only one CPU core is used to control one GPU and...
详细信息
Many parallel algorithms have recently been developed to accelerate solving the subset-sum problem on a heterogeneous CPU-GPU system. However, within each compute node, only one CPU core is used to control one GPU and all the remaining CPU cores are in idle state, which leads to a large number of CPU cores being wasted. In this paper, based on a cost-optimal parallel two-list algorithm, we propose a novel heterogeneous cooperative computing approach to solve the subset-sum problem on a hybrid CPU-GPU cluster, which can make full use of all available computational resources of a cluster. The unbalanced workload distribution and the huge communication overhead are two main obstacles for the heterogeneous cooperative computing. In order to assign the most suitable workload to each compute node and reasonably partition it between CPU and GPU within each node, and minimize the inter-node and intra-node communication costs, we design a communication-avoiding workload distribution scheme suitable for the parallel two-list algorithm. According to this scheme, we provide an efficient heterogeneous cooperative implementation of the algorithm. A series of experiments are conducted on a hybrid CPU-GPU cluster, where each node has two 6-core CPUs and one GPU. The results show that the heterogeneous cooperative computing significantly outperforms the CPU-only or GPU-only computing. (C) 2016 Elsevier Inc. All rights reserved.
Recently, hybrid CPU/GPU cluster has been widely used to deal with compute-intensive problems, such as the subset-sum problem. The two-list algorithm is a well known approach to solve the problem. However, a hybrid MP...
详细信息
ISBN:
(纸本)9781479938445
Recently, hybrid CPU/GPU cluster has been widely used to deal with compute-intensive problems, such as the subset-sum problem. The two-list algorithm is a well known approach to solve the problem. However, a hybrid MPI-CUDA dual-level parallelization of the algorithm on the cluster is not straightforward. The key challenge is how to allocate the most suitable workload to each node to achieve good load balancing between nodes and minimize the communication overhead. Therefore, this paper proposes an effective workload distribution scheme which aims to reasonably assign workload to each node. According to this scheme, an efficient MPI-CUDA parallel implementation of a two-list algorithm is presented. A series of experiments are conducted to compare the performance of the hybrid MPI-CUDA implementation with that of the best sequential CPU implementation, the single-node CPU-only implementation, the single-node GPU-only implementation, and the hybrid MPI-OpenMP implementation with same cluster configuration. The results show that the proposed hybrid MPI-CUDA implementation not only offers significant performance benefits but also has excellent scalability.
For more than three decades, the very well known and famous two-list Horowitz and Sahni algorithm [3] remains the serial upper-bound for the 0-1 Knapsack problem with n items (KP01) in a time bounded by O(2(n/2)). Rec...
详细信息
For more than three decades, the very well known and famous two-list Horowitz and Sahni algorithm [3] remains the serial upper-bound for the 0-1 Knapsack problem with n items (KP01) in a time bounded by O(2(n/2)). Recently, Chedid [2] Suggested an optimal parallelization for that algorithm to a KP01 variation - the subset-sum problem - in a PRAM CREW with p = 2(n/8) processors. It is presented here that, in addition to be incomplete, the Chedid result is a particular case given by Sanches et al. [6]. (c) 2009 Elsevier B.V. All rights reserved.
In 1994, Chang et al. [Parallel Computing 20 (1994)] claimed a parallelization of the two-list algorithm of cost O(2(5n/8)) based on a shared memory CREW SIMD PRAM model of computation. In 1997, Lou and Chang [Paralle...
详细信息
In 1994, Chang et al. [Parallel Computing 20 (1994)] claimed a parallelization of the two-list algorithm of cost O(2(5n/8)) based on a shared memory CREW SIMD PRAM model of computation. In 1997, Lou and Chang [Parallel Computing 22 (1997)] proposed a novel search phase for the two-list algorithm which when combined with Chang et al.'s generation phase gives an optimal parallelization of the two-list algorithm. In 2002, Sanches et al. [Parallel Computing 28 (2002)] proved that the results about Chang et al.'s generation phase are incorrect invalidating both Chang et al's and Lou and Chang's results. In this paper, we describe a new generation phase for the two-list algorithm which when combined with the search phase of Lou and Chang reclaims an optimal parallelization of the two-list algorithm of cost O(2(n/2)) (c) 2007 Elsevier B.V. All rights reserved.
暂无评论