Block low-rank (BLR) compression can significantly reduce the memory and time costs of parallel sparse direct solvers. In this paper, we investigate the performance of the BLR triangular solve phase, which we observe ...
详细信息
Block low-rank (BLR) compression can significantly reduce the memory and time costs of parallel sparse direct solvers. In this paper, we investigate the performance of the BLR triangular solve phase, which we observe to be underwhelming when dealing with many right-hand sides (RHS). We explain that this is because the bottleneck of the triangular solve is not in accessing the BLR LU factors, but rather in accessing the RHS, which are uncompressed. Motivated by this finding, we propose several new hybrid variants, which combine the right-looking and left-looking communication patterns to minimize the number of accesses to the RHS. We confirm via a theoretical analysis that these new variants can significantly reduce the total communication volume. We assess the impact of this reduction on the time performance on a range of real-life applications using the MUMPS solver, obtaining up to 20% time reduction.
Sampled Dense Times Dense Matrix Multiplication (SDDMM) and Sparse Times Dense Matrix Multiplication (SpMM) appear in diverse settings, such as collaborative filtering, document clustering, and graph embedding. Freque...
详细信息
ISBN:
(纸本)9781665481069
Sampled Dense Times Dense Matrix Multiplication (SDDMM) and Sparse Times Dense Matrix Multiplication (SpMM) appear in diverse settings, such as collaborative filtering, document clustering, and graph embedding. Frequently, the SDDMM output becomes the input sparse matrix for a subsequent SpMM operation. Existing work has focused on shared memory parallelization of these primitives. While there has been extensive analysis of communication-minimizing distributed 1.5D algorithms for SpMM, no such analysis exists for SDDMM or the back-to-back sequence of SDDMM and SpMM, termed FusedMM. We show that distributed memory 1.5D and 2.5D algorithms for SpMM can be converted to algorithms for SDDMM with identical communication costs and input / output data layouts. Further, we give two communication-eliding strategies to reduce costs further for FusedMM kernels: either reusing the replication of an input dense matrix for the SDDMM and SpMM in sequence, or fusing the local SDDMM and SpMM kernels. We benchmark FusedMM algorithms on Cori, a Cray XC40 at LBNL, using ***-Renyi random matrices and large real-world sparse matrices. On 256 nodes with 68 cores each, 1.5D FusedMM algorithms using either communication eliding approach can save at least 30% of time spent exclusively in communication compared to executing a distributed-memory SpMM and SDDMM kernel in sequence. Our 2.5D communication-eliding algorithms save 21% of communication time compared to the unoptimized sequence. On real-world matrices with hundreds of millions of edges, all of our algorithms exhibit at least a 10x speedup over the SpMM algorithm in PETSc. On these matrices, our communication-eliding techniques exhibit runtimes up to 1.6 times faster than an unoptimized sequence of SDDMM and SpMM. We embed and test the scaling of our algorithms in real-world applications, including collaborative filtering via alternating-least-squares and inference for attention-based graph neural networks.
Data movement is the dominating factor affecting performance and energy in modern computing systems. Consequently, many algorithms have been developed to minimize the number of I/O operations for common computing patt...
详细信息
ISBN:
(纸本)9781450370998
Data movement is the dominating factor affecting performance and energy in modern computing systems. Consequently, many algorithms have been developed to minimize the number of I/O operations for common computing patterns. Matrix multiplication is no exception, and lower bounds have been proven and implemented both for shared and distributed memory systems. Reconfigurable hardware platforms are a lucrative target for I/O minimizing algorithms, as they offer full control of memory accesses to the programmer. While bounds developed in the context of fixed architectures still apply to these platforms, the spatially distributed nature of their computational and memory resources requires a decentralized approach to optimize algorithms for maximum hardware utilization. We present a model to optimize matrix multiplication for FPGA platforms, simultaneously targeting maximum performance and minimum off-chip data movement, within constraints set by the hardware. We map the model to a concrete architecture using a high-level synthesis tool, maintaining a high level of abstraction, allowing us to support arbitrary data types, and enables maintainability and portability across FPGA devices. Kernels generated from our architecture are shown to offer competitive performance in practice, scaling with both compute and memory resources. We offer our design as an open source project(1) to encourage the open development of linear algebra and I/O minimizing algorithms on reconfigurable hardware platforms.
This paper discusses parallel algorithms for computing intersections between pairs of meshes. We used parallel intersection algorithms to compute interpolation weights in coupled solvers which are part of multi-physic...
详细信息
ISBN:
(纸本)9798400706394
This paper discusses parallel algorithms for computing intersections between pairs of meshes. We used parallel intersection algorithms to compute interpolation weights in coupled solvers which are part of multi-physics simulations. We present a parallel algorithm for computing intersections that has linear computational complexity. We analyze the computation and communication complexities of this algorithm, along with lower bounds for parallel intersection computation. The algorithm has low contention and can be executed on many-core CPUs or offloaded to GPUs. We present strong scaling results for this algorithm on a heterogeneous machine with multiple GPUs per node.
With increasing sizes of distributed systems, there comes an increased risk of communication bottlenecks. In the past decade there has been a growing interest in communication-avoidingalgorithms. The distributed memo...
详细信息
With increasing sizes of distributed systems, there comes an increased risk of communication bottlenecks. In the past decade there has been a growing interest in communication-avoidingalgorithms. The distributed memory Fast Fourier Transform is an important algorithm which suffers from major communication bottlenecks. In this work, we take a look at an existing communication-avoiding algorithm FMM-FFT, an alternative to FFT which utilizes the Fast Multipole Method (FMM) to reduce communications to a single all-to-all communication. We present a detailed implementation of FMM-FFT relying on modern libraries and demonstrate it on two distinct distributed memory architectures notably a traditional Intel Xeon based HPC cluster and then a Beowulf cluster. We show that while the FMM-FFT is significantly slower than FFT on the traditional HPC cluster, on the Beowulf cluster it outperforms standard FFT, consistently getting speedups of 1.5x or more against FFTW. We then proceed to show how the communication to computation cost metric is important and useful in explaining the performance results of FMM-FFT against standard FFT. The source code pertaining to this work is being made publicly available under a permissive open source licence at Github.
Each successive generation of computer architecture has brought new challenges to achieving high performance mathematical solvers, necessitating development and analysis of new algorithms, which are then embodied in s...
详细信息
Each successive generation of computer architecture has brought new challenges to achieving high performance mathematical solvers, necessitating development and analysis of new algorithms, which are then embodied in software libraries. These libraries hide architectural details from applications, allowing them to achieve a level of portability across platforms from desktops to world-class high performance computing (HPC) systems. Thus there has been an informal translational computer science process of developing algorithms and distributing them in open source software libraries for adoption by applications and vendors. With the move to exascale, increasing intentionality about this process will benefit the long-term sustainability of the scientific software stack.
We review the state of the art in the formulation, implementation, and performance of so-called high-order/low-order (HOLO) algorithms for challenging multiscale problems. HOLO algorithms attempt to couple one or seve...
详细信息
We review the state of the art in the formulation, implementation, and performance of so-called high-order/low-order (HOLO) algorithms for challenging multiscale problems. HOLO algorithms attempt to couple one or several high-complexity physical models (the high order model, HO) with low-complexity ones (the low-order model, LO). The primary goal of HOLO algorithms is to achieve nonlinear convergence between HO and LO components while minimizing memory footprint and managing the computational complexity in a practical manner. Key to the HOLO approach is the use of the LO representations to address temporal stiffness, effectively accelerating the convergence of the HO/LO coupled system. The HOLO approach is broadly underpinned by the concept of nonlinear elimination, which enables segregation of the HO and LO components in ways that can effectively use heterogeneous architectures. The accuracy and efficiency benefits of HOLO algorithms are demonstrated with specific applications to radiation transport, gas dynamics, plasmas (both Eulerian and Lagrangian formulations), and ocean modeling. Across this broad application spectrum, HOLO algorithms achieve significant accuracy improvements at a fraction of the cost compared to conventional approaches. It follows that HOLO algorithms hold significant potential for high-fidelity system scale multiscale simulations leveraging exascale computing. (C) 2016 Elsevier Inc. All rights reserved.
communication, i.e., moving data between levels of a memory hierarchy or between processors over a network, is much more expensive (in time or energy) than arithmetic. There has thus been a recent focus on designing a...
详细信息
ISBN:
(纸本)9781509021406
communication, i.e., moving data between levels of a memory hierarchy or between processors over a network, is much more expensive (in time or energy) than arithmetic. There has thus been a recent focus on designing algorithms that minimize communication and, when possible, attain lower bounds on the total number of reads and writes. However, most previous work does not distinguish between the costs of reads and writes. Writes can be much more expensive than reads in some current and emerging storage devices such as nonvolatile memories. This motivates us to ask whether there are lower bounds on the number of writes that certain algorithms must perform, and whether these bounds are asymptotically smaller than bounds on the sum of reads and writes together. When these smaller lower bounds exist, we then ask when they are attainable;we call such algorithms "write-avoiding" (WA), to distinguish them from "communication-avoiding" (CA) algorithms, which only minimize the sum of reads and writes. We identify a number of cases in linear algebra and direct N-body methods where known CA algorithms are also WA (some are and some aren't). We also identify classes of algorithms, including Strassen's matrix multiplication, Cooley-Tukey FFT, and cache oblivious algorithms for classical linear algebra, where a WA algorithm cannot exist: the number of writes is unavoidably within a constant factor of the total number of reads and writes. We explore the interaction of WA algorithms with cache replacement policies and argue that the Least Recently Used policy works well with the WA algorithms in this paper. We provide empirical hardware counter measurements from Intel's Nehalem-EX microarchitecture to validate our theory. In the parallel case, for classical linear algebra, we show that it is impossible to attain lower bounds both on interprocessor communication and on writes to local memory, but either one is attainable by itself. Finally, we discuss WA algorithms for sparse iterativ
Due to the evolution of massively parallel computers towards deeper levels of parallelism and memory hierarchy, and due to the exponentially increasing ratio of the time required to transfer data, either through the m...
详细信息
ISBN:
(纸本)9783642328206
Due to the evolution of massively parallel computers towards deeper levels of parallelism and memory hierarchy, and due to the exponentially increasing ratio of the time required to transfer data, either through the memory hierarchy or between different compute units, to the time required to compute floating point operations, the algorithms are confronted with two challenges. They need not only to be able to exploit multiple levels of parallelism, but also to reduce the communication between the compute units at each level of the hierarchy of parallelism and between the different levels of the memory hierarchy. In this paper we present an algorithm for performing the LU factorization of dense matrices that is suitable for computer systems with two levels of parallelism. This algorithm is able to minimize both the volume of communication and the number of messages transferred at every level of the two-level hierarchy of parallelism. We present its implementation for a cluster of multicore processors based on MPI and Pthreads. We show that this implementation leads to a better performance than routines implementing the LU factorization in well-known numerical libraries. For matrices that are tall and skinny, that is they have many more rows than columns, our algorithm outperforms the corresponding algorithm from ScaLAPACK by a factor of 4.5 on a cluster of 32 nodes, each node having two quad-core Intel Xeon EMT64 processors.
The running time of an algorithm depends on both arithmetic and communication (i.e., data movement) costs, and the relative costs of communication are growing over time. In this work, we present both theoretical and p...
详细信息
ISBN:
(纸本)9781450311601
The running time of an algorithm depends on both arithmetic and communication (i.e., data movement) costs, and the relative costs of communication are growing over time. In this work, we present both theoretical and practical results for tridiagonalizing a symmetric band matrix: we present an algorithm that asymptotically reduces communication, and we show that it indeed performs well in practice. The tridiagonalization of a symmetric band matrix is a key kernel in solving the symmetric eigenvalue problem for both full and band matrices. In order to preserve sparsity, tridiagonalization routines use annihilate-and-chase procedures that previously have suffered from poor data locality. We improve data locality by reorganizing the computation, asymptotically reducing communication costs compared to existing algorithms. Our sequential implementation demonstrates that avoidingcommunication improves runtime even at the expense of extra arithmetic: we observe a 2x speedup over Intel MKL while doing 43% more floating point operations. Our parallel implementation targets shared-memory multicore platforms. It uses pipelined parallelism and a static scheduler while retaining the locality properties of the sequential algorithm. Due to lightweight synchronization and effective data reuse, we see 9.5x scaling over our serial code and up to 6x speedup over the PLASMA library, comparing parallel performance on a ten-core processor.
暂无评论