检索结果-内蒙古大学图书馆

A Novel Compute-Efficient Tridiagonal Solver for Many-Core Architectures

IEEE TRANSACTIONS ON parallel AND DISTRIBUTED SYSTEMS 2023年第1期34卷 195-206页

作者： Liu, Kan Xue, Wei Tsinghua Univ Dept Comp Sci & Technol Beijing 100084 Peoples R China

The tridiagonal solver is an important kernel and is widely supported in mainstream numerical libraries. While parallel algorithms have been studied for many-core architectures, the performance of current algorithms and implementations is still hindered by input size sensitivity and cross-platform portability. In this paper, we propose a novel algorithm WM-pGE for the batched solution of diagonally dominant tridiagonal systems. The algorithm balances the key design objectives, including computation complexity, memory complexity, parallelism, and input size sensitivity, better than existing algorithms. Moreover, an elegant formulation is presented to show the implementation and cross-platform optimization without loss of efficiency and generality, by extracting the platform-dependent works into only four vector operators. The results from our batched tridiagonal experiments show that the proposed algorithm outperforms the prior work PCR-pThomas by 25% and 12% on NVIDIA Tesla V100 in single and double precision, respectively. On Intel KNL, our method achieves a 10% improvement in performance over PCR-pThomas in double precision.

关键词： Computer architecture Layout Sensitivity parallel algorithms Heuristic algorithms Graphics processing units Computational complexity parallel numerical algorithm Tridiagonal solver batched Tridiagonal solver many-core architectures

来源：评论

学校读者我要写书评

暂无评论

Model guided algorithm optimization for tridiagonal solver on many-core architectures

引用

CCF TRANSACTIONS ON HIGH PERFORMANCE COMPUTING 2023年第1期5卷 43-55页

作者： Liu, Kan Wang, Xinliang Xue, Wei Tsinghua Univ Dept Comp Sci & Technol Beijing Peoples R China Huawei Technol Co Ltd Shenzhen Peoples R China

Tridiagonal solver is an important kernel used in a wide range of applications and has been well supported in mainstream numerical libraries. Quite a few parallel algorithms have been developed, but the best-performing algorithm may vary across architectures as well as input sizes. Targeting this algorithm choice challenge, we present a model guided approach to determine the best batched tridiagonal algorithm for various many-core architectures and input sizes efficiently and effectively, managing to achieve the accuracy of algorithm choice over 92% for important architectures. Following the approach, we propose a hybrid CR-PCR-pThomas algorithm to well leverage computation and memory access. The hybrid algorithm outperforms the current state-of-the-art alternatives by up to 32% and 21% on Pascal P100 and Knights Landing respectively. On SW26010 that powers the No.6 supercomputer Sunway TaihuLight, we present an improved cyclic reduction algorithm Dist-CR. The proposed Dist-CR outperforms Thomas algorithm by speedups up to 2.14x.

关键词： Tridiagonal solver Many-core architectures parallel numerical algorithm

来源：评论

学校读者我要写书评

暂无评论

HPC optimal parallel communication algorithm for the simulation of fractional-order systems

引用

JOURNAL OF SUPERCOMPUTING 2019年第3期75卷 1014-1025页

作者： Bonchis, C. Kaslik, E. Rosu, F. Inst E Austria Timisoara Romania Timisoara Romania West Univ Timisoara Dept Math & Comp Sci Timisoara Romania

A parallel numerical simulation algorithm is presented for fractional-order systems involving Caputo-type derivatives, based on the Adams-Bashforth-Moulton predictor-corrector scheme. The parallel algorithm is implemented using several different approaches: a pure MPI version, a combination of MPI with OpenMP optimization and a memory saving speedup approach. All tests run on a BlueGene/P cluster, and comparative improvement results for the running time are provided. As an applied experiment, the solutions of a fractional-order version of a system describing a forced series LCR circuit are numerically computed, depicting cascades of period-doubling bifurcations which lead to the onset of chaotic behavior.

关键词： Fractional-order system parallel numerical algorithm HPC processing

来源：评论

学校读者我要写书评

暂无评论

Efficient parallel numerical solver for the elastohydrodynamic Reynolds-Hertz problem

引用

parallel COMPUTING 2001年第13期27卷 1743-1765页

作者： Arenaz, M Doallo, R Touriño, J Vázquez, C Univ A Coruna Dept Elect & Syst E-15071 La Coruna Spain Univ A Coruna Dept Math E-15071 La Coruna Spain

This work presents a parallel version of a complex numerical algorithm for solving an elastohydrodynamic piezoviscous lubrication problem studied in tribology. The numerical algorithm combines regula falsi, fixed point techniques, finite elements and duality methods. The execution of the sequential program on a workstation requires significant CPU time and memory resources. Thus, in order to reduce the computational cost, we have applied parallelization techniques to the most costly parts of the original source code. Some blocks of the sequential code were also redesigned for the execution on a multicomputer. In this paper, our parallel version is described in detail, execution times that show its efficiency in terms of speedups are presented, and new numerical results that establish the convergence of the algorithm for higher imposed load values when using finer meshes are depicted. As a whole, this paper tries to illustrate the difficulties involved in parallelizing and optimizing complex numerical algorithms based on finite elements. (C) 2001 Elsevier Science B.V. All rights reserved.

关键词： elastohydrodynamic lubrication high-performance computing finite elements parallel numerical algorithm

来源：评论

学校读者我要写书评

暂无评论

A parallel prefix algorithm for almost Toeplitz tridiagonal systems

引用

INTERNATIONAL JOURNAL OF HIGH SPEED COMPUTING 1995年第4期7卷 547-576页

作者： Sun, XH Joslin, RD NASA LANGLEY RES CTRHAMPTONVA 23681

highly accurate solutions. However, the resulting systems from compact schemes are tridiagonal systems that are difficult to solve efficiently on parallel computers. Considering the almost symmetric Toeplitz;structure, a parallel algorithm, simple parallel prefix (SPP), is proposed. The SPP algorithm requires less memory than the conventional LU decomposition and is efficient on parallel machines. It consists of a prefix communication pattern and AXPY operations. Both the computation and the communication can be truncated without degrading the accuracy when the system is diagonally dominant. A formal accuracy study has been conducted to provide a simple truncation formula. Experimental results have been measured on a MasPar MP-1 SIMD machine and on a Gray 2 vector machine. Experimental results show that the simple parallel prefix algorithm is a good algorithm for symmetric, almost symmetric Toeplitz tridiagonal systems and for the compact scheme on high-performance computers.

关键词： high performance computing parallel numerical algorithm tridiagonal system Toeplitz tridiagonal system compact finite-difference scheme

来源：评论

学校读者我要写书评

暂无评论

A parallel algorithm for stiff ordinary differential equations

引用

Informatica (Netherlands) 1994年第3-4期5卷 373-384页

作者： Petcu, Dana Department of Computer Science University of Timişoara Timişoara 1900 B-dul Vasile Parvan 4 Romania

The problem associated with the stiff ordinary differential equation (ODE) systems in parallel processing is that the calculus can not be started simultaneously on many processors with an explicit formula. The proposed algorithm is constructed for a special classes of stiff ODE, those of the form y'(t)=A(t)y(t)+g(t). It has a high efficiency in the implementation on a distributed memory multiprocessor when the ODEs function has many components. The approximation error is equal to that produced by the analogous sequential algorithm. © 1994 IOS Press. All rights reserved.

关键词： Distributed memory multiprocessor Implementation efficiency parallel numerical algorithm Stiff ordinary differential equations

来源：评论

学校读者我要写书评

暂无评论

EFFICIENCY OF SOME parallel numerical algorithmS ON DISTRIBUTED SYSTEMS

引用

parallel COMPUTING 1989年第1期12卷 21-44页

作者： BROCHARD, L ECOLE NATL PONTS & CHAUSSESS F-93194 NOISY LE GRANDFRANCE

Communication and synchronization costs are a key problem in parallel computing. Studying direct and iterative numerical methods on nearest neighbor type distributed systems, we give speedup evaluations depending on c... 详细信息

关键词： Convergence evaluation domain decomposition multigrid method parallel numerical algorithm relaxation method speedup

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：