A known scalability bottleneck of the parallel 3D FFT is its use of all -to -all communications. Here, we present S3DFT, a library that circumvents this by using point-to-point communication - albeit at a higher arith...
详细信息
A known scalability bottleneck of the parallel 3D FFT is its use of all -to -all communications. Here, we present S3DFT, a library that circumvents this by using point-to-point communication - albeit at a higher arithmetic complexity. This approach exploits three variants of Cannon's algorithm with adaptations for block tensor -matrix multiplications. We demonstrate S3DFT's efficient use of hardware resources, and its scaling using up to 16,464 cores of the JUWELS Cluster. However, in a comparison with well -established 3D FFT libraries, its parallel efficiency and performance were found to fall behind. A detailed analysis identifies the cause in two of its component algorithms, which scale poorly owing to how their communication patterns are mapped in subsets of the fat tree topology. This result exposes a potential drawback of running block -wise parallel algorithms on systems with fat tree networks caused by increased communication latencies along specific directions of the mesh of processing elements.
TIM-3D is a continuum-mechanics simulation code that uses arbitrary-shape unstructured polyhedral Lagrangian meshes. parallelism in TIM-3D is provided at three levels in the mixed-memory model. The first two levels us...
详细信息
ISBN:
(纸本)9783319996738;9783319996721
TIM-3D is a continuum-mechanics simulation code that uses arbitrary-shape unstructured polyhedral Lagrangian meshes. parallelism in TIM-3D is provided at three levels in the mixed-memory model. The first two levels use space decomposition in the MPI-based distributed-memory model. At the first level, calculations are parallelized in task fragments (domains). At the second level, calculations within one domain are parallelized in para-domains. At the third level, iterations of calculation loops are parallelized in the OpenMP-based shared-memory model. The paper considers the fine-grained paralleling algorithms (second level). These algorithms are complementary to the OpenMP shared-memoryparallelism implemented earlier. The fine-grained paralleling can be done both with overlapping in one row of para-domain interface cells and without overlapping. These approaches are compared in their parallel efficiency using one of test simulations.
We present a novel distributedmemory algorithm to improve the strong scalability of the solution of a sparse triangular system. This operation appears in the solve phase of direct methods for solving general sparse l...
详细信息
ISBN:
(纸本)9781450360791
We present a novel distributedmemory algorithm to improve the strong scalability of the solution of a sparse triangular system. This operation appears in the solve phase of direct methods for solving general sparse linear systems, Ax = b. Our 3D sparse triangular solver employs several techniques, including a 3D MPI process grid, elimination tree parallelism, and data replication, all of which reduce the per-process communication when combined. We present analytical models to understand the communication cost of our algorithm and show that our 3D sparse triangular solver can reduce the per-process communication volume asymptotically by a factor of O (n(1/4)) and O (n(1/6)) for problems arising from the finite element discretizations of 2D "planar" and 3D "non-planar" PDEs, respectively. We implement our algorithm for use in SuperLU DIST3D, using a hybrid MPI+OpenMP programming model. Our 3D triangular solve algorithm, when run on 12k cores of Cray XC30, outperforms the current state-of-the-art 2D algorithm by 7.2x for planar and 2.7x for the non-planar sparse matrices, respectively.
In this paper, we present a new out-of-core sort algorithm, designed for problems that are too large to fit into the aggregate RAM available on modern supercomputers. We analyze the performance including the cost of T...
详细信息
ISBN:
(纸本)9781450323789
In this paper, we present a new out-of-core sort algorithm, designed for problems that are too large to fit into the aggregate RAM available on modern supercomputers. We analyze the performance including the cost of TO and demonstrate the fastest (to the best of our knowledge) reported throughput using the canonical sort Benchmark on a general-purpose, production HPC resource running Lustre. By clever use of available storage and a formulation of asynchronous data transfer mechanisms, we are able to almost completely hide the computation (sorting) behind the TO latency. This latency hiding enables us to achieve comparable execution times, including the additional temporary TO required, between a large sort problem (5TB) run as a single, in-RAM sort and our out-of-core approach using 1/10th the amount of RAM. In our largest run, sorting 100TB of records using 1792 hosts, we achieved an end-to-end throughput of 1.24TB/min using our general-purpose sorter, improving on the current Daytona record holder by 65%.
暂无评论