Stencil computations lie at the heart of many scientific and industrial applications. Stencil algorithms pose several challenges on machines with cache based memory hierarchy, due to low re-use of memory accesses if s...
详细信息
ISBN:
(纸本)9781665454445
Stencil computations lie at the heart of many scientific and industrial applications. Stencil algorithms pose several challenges on machines with cache based memory hierarchy, due to low re-use of memory accesses if special care is not taken to optimize them. This work shows that for stencil computation a novel algorithm that leverages a localized communication strategy effectively exploits the second generation Cerebras WaferScale Engine (WSE-2), which has no cache hierarchy. This study focuses on a 25-point stencil finite-difference method for the 3D wave equation, a kernel frequently used in earth modeling as numerical simulation. In essence, the algorithm trades memory accesses for data communication and takes advantage of the fast communication fabric provided by the architecture. The algorithm -historically memory-bound- becomes computebound. This allows the implementation to achieve near perfect weak scaling, reaching up to 503 TFLOPs on WSE-2, a figure that only full clusters can eventually yield.
Stencil computations lie at the heart of many scientific and industrial applications. Stencil algorithms pose several challenges on machines with cache based memory hierarchy, due to low re-use of memory accesses if s...
详细信息
Stencil computations lie at the heart of many scientific and industrial applications. Stencil algorithms pose several challenges on machines with cache based memory hierarchy, due to low re-use of memory accesses if special care is not taken to optimize them. This work shows that for stencil computation a novel algorithm that leverages a localized communication strategy effectively exploits the second generation Cerebras Wafer-Scale Engine (WSE-2), which has no cache hierarchy. This study focuses on a 25-point stencil finite-difference method for the 3D wave equation, a kernel frequently used in earth modeling as numerical simulation. In essence, the algorithm trades memory accesses for data communication and takes advantage of the fast communication fabric provided by the architecture. The algorithm ---historically memory-bound--- becomes compute-bound. This allows the implementation to achieve near perfect weak scaling, reaching up to 503 TFLOPs on WSE-2, a figure that only full clusters can eventually yield.
The performance of CPU-based and GPU-based systems is often low for PDF codes, where large, sparse, and often structured systems of linear equations must be solved. Iterative solvers are limited by data movement, both...
详细信息
ISBN:
(纸本)9781728199986
The performance of CPU-based and GPU-based systems is often low for PDF codes, where large, sparse, and often structured systems of linear equations must be solved. Iterative solvers are limited by data movement, both between caches and memory and between nodes. Here we describe the solution of such systems of equations on the Cerebras Systems CS-1, a wafer-scale processor that has the memory bandwidth and communication latency to perform well. We achieve 0.86 PFLOPS on a single wafer-scale system for the solution by BiCGStab of a linear system arising from a 7-point finite difference stencil on a 600 x 595 x 1536 mesh, achieving about one third of the machine's peak performance. We explain the system, its architecture and programming, and its performance on this problem and related problems. We discuss issues of memory capacity and floating point precision. We outline plans to extend this work towards full applications.
暂无评论