检索结果-内蒙古大学图书馆

3D DFT by block tensor-matrix multiplication via a modified Cannon's algorithm: Implementation and scaling on distributed-memory clusters with fat tree networks

引用

JOURNAL OF PARALLEL AND distributed COMPUTING 2024年 193卷

作者： Malapally, Nitin Bolnykh, Viacheslav Suarez, Estela Carloni, Paolo Lippert, Thomas Mandelli, Davide Forschungszentrum Julich Computat Biomed IAS 5 INM 9 Wilhelm Johnen Str D-52428 Julich Germany Forschungszentrum Julich Julich Supercomp Ctr JSC Wilhelm Johnen Str D-52428 Julich Germany Univ Bonn Comp Sci Dept Bonn Germany Forschungszentrum Julich Mol Neurosci & Neuroimaging IN -1 1 Wilhelm Johnen Str D-52428 Julich Germany Goethe Univ Frankfurt Inst Adv Studies Frankfurt Germany

A known scalability bottleneck of the parallel 3D FFT is its use of all -to -all communications. Here, we present S3DFT, a library that circumvents this by using point-to-point communication - albeit at a higher arithmetic complexity. This approach exploits three variants of Cannon's algorithm with adaptations for block tensor -matrix multiplications. We demonstrate S3DFT's efficient use of hardware resources, and its scaling using up to 16,464 cores of the JUWELS Cluster. However, in a comparison with well -established 3D FFT libraries, its parallel efficiency and performance were found to fall behind. A detailed analysis identifies the cause in two of its component algorithms, which scale poorly owing to how their communication patterns are mapped in subsets of the fat tree topology. This result exposes a potential drawback of running block -wise parallel algorithms on systems with fat tree networks caused by increased communication latencies along specific directions of the mesh of processing elements.

关键词： 3D Discrete Fourier Transform (3D DFT) Block tensor matrix multiplication Volumetric decomposition Cannon's algorithm Shared-memory parallelism distributed-memory parallelism Overlapping communication and computation MPI Performance analysis Roofline model

来源：评论

学校读者我要写书评

暂无评论

A communication-avoiding 3D sparse triangular solver 19

A communication-avoiding 3D sparse triangular solver

引用

33rd ACM International Conference on Supercomputing (ICS)

作者： Sao, Piyush Kannan, Ramakrishnan Li, Xiaoye Sherry Vuduc, Richard Oak Ridge Natl Lab POB 2009 Oak Ridge TN 37830 USA Lawrence Berkeley Natl Lab Berkeley CA USA Georgia Inst Technol Atlanta GA 30332 USA

ISBN: (纸本)9781450360791

We present a novel distributed memory algorithm to improve the strong scalability of the solution of a sparse triangular system. This operation appears in the solve phase of direct methods for solving general sparse linear systems, Ax = b. Our 3D sparse triangular solver employs several techniques, including a 3D MPI process grid, elimination tree parallelism, and data replication, all of which reduce the per-process communication when combined. We present analytical models to understand the communication cost of our algorithm and show that our 3D sparse triangular solver can reduce the per-process communication volume asymptotically by a factor of O (n(1/4)) and O (n(1/6)) for problems arising from the finite element discretizations of 2D "planar" and 3D "non-planar" PDEs, respectively. We implement our algorithm for use in SuperLU DIST3D, using a hybrid MPI+OpenMP programming model. Our 3D triangular solve algorithm, when run on 12k cores of Cray XC30, outperforms the current state-of-the-art 2D algorithm by 7.2x for planar and 2.7x for the non-planar sparse matrices, respectively.

关键词： sparse matrix computations distributed-memory parallelism communication-avoiding algorithms

来源：评论

学校读者我要写书评

暂无评论

Fine-Grained Parallel Algorithms in TIM-3D Code 12th

Fine-Grained Parallel Algorithms in TIM-3D Code

引用

12th International Scientific Conference on Parallel Computational Technologies (PCT)

作者： Voropinov, Andrey Alexandrovich Novikov, Ivan Gennadievich All Russian Res Inst Expt Phys FSUE Russian Fed Nucl Ctr Sarov Russia

ISBN: (纸本)9783319996738;9783319996721

TIM-3D is a continuum-mechanics simulation code that uses arbitrary-shape unstructured polyhedral Lagrangian meshes. parallelism in TIM-3D is provided at three levels in the mixed-memory model. The first two levels use space decomposition in the MPI-based distributed-memory model. At the first level, calculations are parallelized in task fragments (domains). At the second level, calculations within one domain are parallelized in para-domains. At the third level, iterations of calculation loops are parallelized in the OpenMP-based shared-memory model. The paper considers the fine-grained paralleling algorithms (second level). These algorithms are complementary to the OpenMP shared-memory parallelism implemented earlier. The fine-grained paralleling can be done both with overlapping in one row of para-domain interface cells and without overlapping. These approaches are compared in their parallel efficiency using one of test simulations.

关键词： TIM-3D code distributed-memory parallelism MPI Unstructured meshes

来源：评论

学校读者我要写书评

暂无评论

Algorithms for High-Throughput Disk-to-Disk Sorting 13

Algorithms for High-Throughput Disk-to-Disk Sorting

引用

International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

作者： Sundar, Hari Malhotra, Dhairya Schulz, Karl W. Univ Texas Austin Austin TX 78712 USA Texas Adv Comp Ctr Austin TX 78712 USA

ISBN: (纸本)9781450323789

In this paper, we present a new out-of-core sort algorithm, designed for problems that are too large to fit into the aggregate RAM available on modern supercomputers. We analyze the performance including the cost of TO and demonstrate the fastest (to the best of our knowledge) reported throughput using the canonical sort Benchmark on a general-purpose, production HPC resource running Lustre. By clever use of available storage and a formulation of asynchronous data transfer mechanisms, we are able to almost completely hide the computation (sorting) behind the TO latency. This latency hiding enables us to achieve comparable execution times, including the additional temporary TO required, between a large sort problem (5TB) run as a single, in-RAM sort and our out-of-core approach using 1/10th the amount of RAM. In our largest run, sorting 100TB of records using 1792 hosts, we achieved an end-to-end throughput of 1.24TB/min using our general-purpose sorter, improving on the current Daytona record holder by 65%.

关键词： Sorting Out-of-Core Algorithms Parallel Algorithms shared-memory parallelism distributed-memory parallelism hypercube quicksort samplesort asynchronous methods

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：