The tridiagonal solver is an important kernel and is widely supported in mainstream numerical libraries. While parallelalgorithms have been studied for many-core architectures, the performance of current algorithms a...
详细信息
The tridiagonal solver is an important kernel and is widely supported in mainstream numerical libraries. While parallelalgorithms have been studied for many-core architectures, the performance of current algorithms and implementations is still hindered by input size sensitivity and cross-platform portability. In this paper, we propose a novel algorithm WM-pGE for the batched solution of diagonally dominant tridiagonal systems. The algorithm balances the key design objectives, including computation complexity, memory complexity, parallelism, and input size sensitivity, better than existing algorithms. Moreover, an elegant formulation is presented to show the implementation and cross-platform optimization without loss of efficiency and generality, by extracting the platform-dependent works into only four vector operators. The results from our batched tridiagonal experiments show that the proposed algorithm outperforms the prior work PCR-pThomas by 25% and 12% on NVIDIA Tesla V100 in single and double precision, respectively. On Intel KNL, our method achieves a 10% improvement in performance over PCR-pThomas in double precision.
Tridiagonal solver is an important kernel used in a wide range of applications and has been well supported in mainstream numerical libraries. Quite a few parallelalgorithms have been developed, but the best-performin...
详细信息
Tridiagonal solver is an important kernel used in a wide range of applications and has been well supported in mainstream numerical libraries. Quite a few parallelalgorithms have been developed, but the best-performing algorithm may vary across architectures as well as input sizes. Targeting this algorithm choice challenge, we present a model guided approach to determine the best batched tridiagonal algorithm for various many-core architectures and input sizes efficiently and effectively, managing to achieve the accuracy of algorithm choice over 92% for important architectures. Following the approach, we propose a hybrid CR-PCR-pThomas algorithm to well leverage computation and memory access. The hybrid algorithm outperforms the current state-of-the-art alternatives by up to 32% and 21% on Pascal P100 and Knights Landing respectively. On SW26010 that powers the No.6 supercomputer Sunway TaihuLight, we present an improved cyclic reduction algorithm Dist-CR. The proposed Dist-CR outperforms Thomas algorithm by speedups up to 2.14x.
A parallelnumerical simulation algorithm is presented for fractional-order systems involving Caputo-type derivatives, based on the Adams-Bashforth-Moulton predictor-corrector scheme. The parallelalgorithm is impleme...
详细信息
A parallelnumerical simulation algorithm is presented for fractional-order systems involving Caputo-type derivatives, based on the Adams-Bashforth-Moulton predictor-corrector scheme. The parallelalgorithm is implemented using several different approaches: a pure MPI version, a combination of MPI with OpenMP optimization and a memory saving speedup approach. All tests run on a BlueGene/P cluster, and comparative improvement results for the running time are provided. As an applied experiment, the solutions of a fractional-order version of a system describing a forced series LCR circuit are numerically computed, depicting cascades of period-doubling bifurcations which lead to the onset of chaotic behavior.
This work presents a parallel version of a complex numericalalgorithm for solving an elastohydrodynamic piezoviscous lubrication problem studied in tribology. The numericalalgorithm combines regula falsi, fixed poin...
详细信息
This work presents a parallel version of a complex numericalalgorithm for solving an elastohydrodynamic piezoviscous lubrication problem studied in tribology. The numericalalgorithm combines regula falsi, fixed point techniques, finite elements and duality methods. The execution of the sequential program on a workstation requires significant CPU time and memory resources. Thus, in order to reduce the computational cost, we have applied parallelization techniques to the most costly parts of the original source code. Some blocks of the sequential code were also redesigned for the execution on a multicomputer. In this paper, our parallel version is described in detail, execution times that show its efficiency in terms of speedups are presented, and new numerical results that establish the convergence of the algorithm for higher imposed load values when using finer meshes are depicted. As a whole, this paper tries to illustrate the difficulties involved in parallelizing and optimizing complex numericalalgorithms based on finite elements. (C) 2001 Elsevier Science B.V. All rights reserved.
highly accurate solutions. However, the resulting systems from compact schemes are tridiagonal systems that are difficult to solve efficiently on parallel computers. Considering the almost symmetric Toeplitz;structure...
详细信息
highly accurate solutions. However, the resulting systems from compact schemes are tridiagonal systems that are difficult to solve efficiently on parallel computers. Considering the almost symmetric Toeplitz;structure, a parallelalgorithm, simple parallel prefix (SPP), is proposed. The SPP algorithm requires less memory than the conventional LU decomposition and is efficient on parallel machines. It consists of a prefix communication pattern and AXPY operations. Both the computation and the communication can be truncated without degrading the accuracy when the system is diagonally dominant. A formal accuracy study has been conducted to provide a simple truncation formula. Experimental results have been measured on a MasPar MP-1 SIMD machine and on a Gray 2 vector machine. Experimental results show that the simple parallel prefix algorithm is a good algorithm for symmetric, almost symmetric Toeplitz tridiagonal systems and for the compact scheme on high-performance computers.
The problem associated with the stiff ordinary differential equation (ODE) systems in parallel processing is that the calculus can not be started simultaneously on many processors with an explicit formula. The propose...
详细信息
Communication and synchronization costs are a key problem in parallel computing. Studying direct and iterative numerical methods on nearest neighbor type distributed systems, we give speedup evaluations depending on c...
详细信息
Communication and synchronization costs are a key problem in parallel computing. Studying direct and iterative numerical methods on nearest neighbor type distributed systems, we give speedup evaluations depending on computation, communication and control costs and compare some of them with experimental measurements.
暂无评论