Two parallel computer paradigms available today are multi-core accelerators such as the Sony, Toshiba and IBM Cell or Graphics Processing Unit (GPUs), and massively parallel message-passing machines such as the IBM Bl...
详细信息
Two parallel computer paradigms available today are multi-core accelerators such as the Sony, Toshiba and IBM Cell or Graphics Processing Unit (GPUs), and massively parallel message-passing machines such as the IBM Blue Gene (BG). The solution of systems of linear equations is one of the most central processing unit-intensive steps in engineering and simulation applications and can greatly benefit from the multitude of processing cores and vectorisation on today's parallel computers. We parallelise the conjugate gradient (CG) linear equation solver on the Cell Broadband Engine and the IBM Blue Gene/L machine. We perform a scalability analysis of CG on both machines across 1, 8 and 16 synergistic processing elements and 1-32 cores on BG with heptadiagonal matrices. The results indicate that the multi-core Cell system outperforms by three to four times the massively parallel BG system due to the Cell's higher communication bandwidth and accelerated vector processing capability.
We develop an optimized FFT based Poisson solver on a CPU-GPU heterogeneous platform for the case when the input is too large to fit on the GPU global memory. The solver involves memory bound computations such as 3D F...
详细信息
ISBN:
(纸本)9780769549712
We develop an optimized FFT based Poisson solver on a CPU-GPU heterogeneous platform for the case when the input is too large to fit on the GPU global memory. The solver involves memory bound computations such as 3D FFT in which the large 3D data may have to be transferred over the PCIe bus several times during the computation. We develop a new strategy to decompose and allocate the computation between the GPU and the CPU such that the 3D data is transferred only once to the device memory, and the executions of the GPU kernels are almost completely overlapped with the PCI data transfer. We were able to achieve significantly better performance than what has been reported in previous related work, including over 50 GFLOPS for the three periodic boundary conditions, and over 40 GFLOPS for the two periodic, one Neumann boundary conditions. The PCIe bus bandwidth achieved is over 5GB/s, which is close to the best possible on our platform. For all the cases tested, the single 3D PCIe transfer time, which constitutes a lower bound on what is possible on our platform, takes almost 70% of the total execution time of the Poisson solver.
Matrix factorization is known to be an effective method for recommender systems that are given only the ratings from users to items. Currently, stochastic gradient (SG) method is one of the most popular algorithms for...
详细信息
Matrix factorization is known to be an effective method for recommender systems that are given only the ratings from users to items. Currently, stochastic gradient (SG) method is one of the most popular algorithms for matrix factorization. However, as a sequential approach, SG is difficult to be parallelized for handling web-scale problems. In this article, we develop a fast parallel SG method, FPSG, for shared memory systems. By dramatically reducing the cache-miss rate and carefully addressing the load balance of threads, FPSG is more efficient than state-of-the-art parallel algorithms for matrix factorization.
暂无评论