检索结果-内蒙古大学图书馆

communication avoiding BLOCK LOW-RANK PARALLEL MULTIFRONTAL TRIANGULAR SOLVE WITH MANY RIGHT-HAND SIDES

SIAM JOURNAL ON MATRIX ANALYSIS AND APPLICATIONS 2024年第1期45卷 148-166页

作者： Amestoy, Patrick Boiteau, Olivier Buttari, Alfredo Gerest, Matthieu Jezequel, Fabienne L'excellent, Jean-yves Mary, Theo ENS Lyon Mumps Technol 46 Allee Italie F-69007 Lyon France EDF R&D F-91120 Palaiseau France CNRS IRIT 2 Rue Charles Camichel F-31071 Toulouse France Sorbonne Univ CNRS LIP6 F-75005 Paris France Univ Paris Pantheon Assas F-75005 Paris France Sorbonne Univ CNRS LIP6 F-75005 Paris France

Block low-rank (BLR) compression can significantly reduce the memory and time costs of parallel sparse direct solvers. In this paper, we investigate the performance of the BLR triangular solve phase, which we observe to be underwhelming when dealing with many right-hand sides (RHS). We explain that this is because the bottleneck of the triangular solve is not in accessing the BLR LU factors, but rather in accessing the RHS, which are uncompressed. Motivated by this finding, we propose several new hybrid variants, which combine the right-looking and left-looking communication patterns to minimize the number of accesses to the RHS. We confirm via a theoretical analysis that these new variants can significantly reduce the total communication volume. We assess the impact of this reduction on the time performance on a range of real-life applications using the MUMPS solver, obtaining up to 20% time reduction.

关键词： numerical linear algebra block low-rank matrices data sparse matrices LU factorization triangular solve linear systems low-rank approximations communication avoiding algorithms

来源：评论

学校读者我要写书评

暂无评论

Distributed-Memory Sparse Kernels for Machine Learning 36

Distributed-Memory Sparse Kernels for Machine Learning

引用

36th IEEE International Parallel and Distributed Processing Symposium (IEEE IPDPS)

作者： Bharadwaj, Vivek Buluc, Aydin Demmel, James Univ Calif Berkeley EECS Dept Berkeley CA 94720 USA Lawrence Berkeley Natl Lab Computat Res Div Berkeley CA USA

ISBN: (纸本)9781665481069

Sampled Dense Times Dense Matrix Multiplication (SDDMM) and Sparse Times Dense Matrix Multiplication (SpMM) appear in diverse settings, such as collaborative filtering, document clustering, and graph embedding. Frequently, the SDDMM output becomes the input sparse matrix for a subsequent SpMM operation. Existing work has focused on shared memory parallelization of these primitives. While there has been extensive analysis of communication-minimizing distributed 1.5D algorithms for SpMM, no such analysis exists for SDDMM or the back-to-back sequence of SDDMM and SpMM, termed FusedMM. We show that distributed memory 1.5D and 2.5D algorithms for SpMM can be converted to algorithms for SDDMM with identical communication costs and input / output data layouts. Further, we give two communication-eliding strategies to reduce costs further for FusedMM kernels: either reusing the replication of an input dense matrix for the SDDMM and SpMM in sequence, or fusing the local SDDMM and SpMM kernels. We benchmark FusedMM algorithms on Cori, a Cray XC40 at LBNL, using ***-Renyi random matrices and large real-world sparse matrices. On 256 nodes with 68 cores each, 1.5D FusedMM algorithms using either communication eliding approach can save at least 30% of time spent exclusively in communication compared to executing a distributed-memory SpMM and SDDMM kernel in sequence. Our 2.5D communication-eliding algorithms save 21% of communication time compared to the unoptimized sequence. On real-world matrices with hundreds of millions of edges, all of our algorithms exhibit at least a 10x speedup over the SpMM algorithm in PETSc. On these matrices, our communication-eliding techniques exhibit runtimes up to 1.6 times faster than an unoptimized sequence of SDDMM and SpMM. We embed and test the scaling of our algorithms in real-world applications, including collaborative filtering via alternating-least-squares and inference for attention-based graph neural networks.

关键词： SDDMM SpMM FusedMM communication avoiding algorithms

来源：评论

学校读者我要写书评

暂无评论

Flexible communication avoiding Matrix Multiplication on FPGA with High-Level Synthesis 20

Flexible Communication Avoiding Matrix Multiplication on FPG...

引用

ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA)

作者： Licht, Johannes de Fine Kwasniewski, Grzegorz Hoefler, Torsten Swiss Fed Inst Technol Zurich Switzerland

ISBN: (纸本)9781450370998

Data movement is the dominating factor affecting performance and energy in modern computing systems. Consequently, many algorithms have been developed to minimize the number of I/O operations for common computing patterns. Matrix multiplication is no exception, and lower bounds have been proven and implemented both for shared and distributed memory systems. Reconfigurable hardware platforms are a lucrative target for I/O minimizing algorithms, as they offer full control of memory accesses to the programmer. While bounds developed in the context of fixed architectures still apply to these platforms, the spatially distributed nature of their computational and memory resources requires a decentralized approach to optimize algorithms for maximum hardware utilization. We present a model to optimize matrix multiplication for FPGA platforms, simultaneously targeting maximum performance and minimum off-chip data movement, within constraints set by the hardware. We map the model to a concrete architecture using a high-level synthesis tool, maintaining a high level of abstraction, allowing us to support arbitrary data types, and enables maintainability and portability across FPGA devices. Kernels generated from our architecture are shown to offer competitive performance in practice, scaling with both compute and memory resources. We offer our design as an open source project(1) to encourage the open development of linear algebra and I/O minimizing algorithms on reconfigurable hardware platforms.

关键词： i/o optimization linear algebra communication avoiding algorithms matrix multiplication systolic arrays fpga gemm vivado hls high-level synthesis hls

来源：评论

学校读者我要写书评

暂无评论

Parallel algorithms for Intersection Computation 24

Parallel Algorithms for Intersection Computation

引用

Proceedings of the Platform for Advanced Scientific Computing Conference

作者： Aparna Sasidharan Illinois Institute of Technology Chicago United States of America

ISBN: (纸本)9798400706394

This paper discusses parallel algorithms for computing intersections between pairs of meshes. We used parallel intersection algorithms to compute interpolation weights in coupled solvers which are part of multi-physics simulations. We present a parallel algorithm for computing intersections that has linear computational complexity. We analyze the computation and communication complexities of this algorithm, along with lower bounds for parallel intersection computation. The algorithm has low contention and can be executed on many-core CPUs or offloaded to GPUs. We present strong scaling results for this algorithm on a heterogeneous machine with multiple GPUs per node.

关键词： Kd-trees intergrid communication coupling software mesh intersection communication avoiding algorithms

来源：评论

学校读者我要写书评

暂无评论

Parallel Fast Multipole Method accelerated FFT on HPC clusters

引用

PARALLEL COMPUTING 2021年 104卷

作者： Mehta, Chahak Karthi, Amarnath Jetly, Vishrut Chaudhury, Bhaskar Dhirubhai Ambani Inst Informat & Commun Technol Grp Computat Sci & HPC Gandhinagar 382007 India

With increasing sizes of distributed systems, there comes an increased risk of communication bottlenecks. In the past decade there has been a growing interest in communication-avoiding algorithms. The distributed memory Fast Fourier Transform is an important algorithm which suffers from major communication bottlenecks. In this work, we take a look at an existing communication-avoiding algorithm FMM-FFT, an alternative to FFT which utilizes the Fast Multipole Method (FMM) to reduce communications to a single all-to-all communication. We present a detailed implementation of FMM-FFT relying on modern libraries and demonstrate it on two distinct distributed memory architectures notably a traditional Intel Xeon based HPC cluster and then a Beowulf cluster. We show that while the FMM-FFT is significantly slower than FFT on the traditional HPC cluster, on the Beowulf cluster it outperforms standard FFT, consistently getting speedups of 1.5x or more against FFTW. We then proceed to show how the communication to computation cost metric is important and useful in explaining the performance results of FMM-FFT against standard FFT. The source code pertaining to this work is being made publicly available under a permissive open source licence at Github.

关键词： Fast Fourier Transform Fast Multipole Method Beowulf cluster communication avoiding algorithms Parallel programming High performance computing

来源：评论

学校读者我要写书评

暂无评论

Translational process: Mathematical software perspective

引用

JOURNAL OF COMPUTATIONAL SCIENCE 2021年 52卷

作者： Dongarra, Jack Gates, Mark Luszczek, Piotr Tomov, Stanimire Univ Tennessee Knoxville TN 37996 USA Oak Ridge Natl Lab Oak Ridge TN USA Univ Manchester Manchester Lancs England

Each successive generation of computer architecture has brought new challenges to achieving high performance mathematical solvers, necessitating development and analysis of new algorithms, which are then embodied in software libraries. These libraries hide architectural details from applications, allowing them to achieve a level of portability across platforms from desktops to world-class high performance computing (HPC) systems. Thus there has been an informal translational computer science process of developing algorithms and distributing them in open source software libraries for adoption by applications and vendors. With the move to exascale, increasing intentionality about this process will benefit the long-term sustainability of the scientific software stack.

关键词： DATAFLOW scheduling runtimes Hardware accelerators communication avoiding algorithms

来源：评论

学校读者我要写书评

暂无评论

Multiscale high-order/low-order (HOLO) algorithms and applications

引用

JOURNAL OF COMPUTATIONAL PHYSICS 2017年 330卷 21-45页

作者： Chacon, L. Chen, G. Knoll, D. A. Newman, C. Park, H. Taitano, W. Willert, J. A. Womeldorff, G. Los Alamos Natl Lab Los Alamos NM 87545 USA Inst Def Anal Alexandria VA 22311 USA

We review the state of the art in the formulation, implementation, and performance of so-called high-order/low-order (HOLO) algorithms for challenging multiscale problems. HOLO algorithms attempt to couple one or several high-complexity physical models (the high order model, HO) with low-complexity ones (the low-order model, LO). The primary goal of HOLO algorithms is to achieve nonlinear convergence between HO and LO components while minimizing memory footprint and managing the computational complexity in a practical manner. Key to the HOLO approach is the use of the LO representations to address temporal stiffness, effectively accelerating the convergence of the HO/LO coupled system. The HOLO approach is broadly underpinned by the concept of nonlinear elimination, which enables segregation of the HO and LO components in ways that can effectively use heterogeneous architectures. The accuracy and efficiency benefits of HOLO algorithms are demonstrated with specific applications to radiation transport, gas dynamics, plasmas (both Eulerian and Lagrangian formulations), and ocean modeling. Across this broad application spectrum, HOLO algorithms achieve significant accuracy improvements at a fraction of the cost compared to conventional approaches. It follows that HOLO algorithms hold significant potential for high-fidelity system scale multiscale simulations leveraging exascale computing. (C) 2016 Elsevier Inc. All rights reserved.

关键词： High-order/low-order HOLO Multiscale algorithms Implicit timestepping communication avoiding algorithms

来源：评论

学校读者我要写书评

暂无评论

Write-avoiding algorithms 30

Write-Avoiding Algorithms

引用

30th IEEE International Parallel and Distributed Processing Symposium (IPDPS)

作者： Carson, Erin Demme, James Grigori, Laura Knight, Nicholas Koanantakool, Penporn Schwartz, Oded Simhadri, Harsha Vardhan NYU Courant Inst Math Sci New York NY 10003 USA Univ Calif Berkeley Dept Math Berkeley CA 94720 USA Univ Calif Berkeley Comp Sci Div Berkeley CA 94720 USA Univ Paris 06 UPMC CNRS UMR 7598Lab Jacques Louis Lions France Alpines INRIA Paris Rocquencourt Paris France Univ Calif Berkeley Div Comp Sci Berkeley CA 94720 USA Hebrew Univ Jerusalem Sch Engn & Comp Sci IL-91905 Jerusalem Israel Lawrence Berkeley Natl Lab Computat Res Div Berkeley CA USA

ISBN: (纸本)9781509021406

communication, i.e., moving data between levels of a memory hierarchy or between processors over a network, is much more expensive (in time or energy) than arithmetic. There has thus been a recent focus on designing algorithms that minimize communication and, when possible, attain lower bounds on the total number of reads and writes. However, most previous work does not distinguish between the costs of reads and writes. Writes can be much more expensive than reads in some current and emerging storage devices such as nonvolatile memories. This motivates us to ask whether there are lower bounds on the number of writes that certain algorithms must perform, and whether these bounds are asymptotically smaller than bounds on the sum of reads and writes together. When these smaller lower bounds exist, we then ask when they are attainable;we call such algorithms "write-avoiding" (WA), to distinguish them from "communication-avoiding" (CA) algorithms, which only minimize the sum of reads and writes. We identify a number of cases in linear algebra and direct N-body methods where known CA algorithms are also WA (some are and some aren't). We also identify classes of algorithms, including Strassen's matrix multiplication, Cooley-Tukey FFT, and cache oblivious algorithms for classical linear algebra, where a WA algorithm cannot exist: the number of writes is unavoidably within a constant factor of the total number of reads and writes. We explore the interaction of WA algorithms with cache replacement policies and argue that the Least Recently Used policy works well with the WA algorithms in this paper. We provide empirical hardware counter measurements from Intel's Nehalem-EX microarchitecture to validate our theory. In the parallel case, for classical linear algebra, we show that it is impossible to attain lower bounds both on interprocessor communication and on writes to local memory, but either one is attainable by itself. Finally, we discuss WA algorithms for sparse iterativ

关键词： Non-volatile memories communication avoiding algorithms write complexity lower bounds linear algebra N-body methods Krylov subspace methods

来源：评论

学校读者我要写书评

暂无评论

avoiding communication through a Multilevel LU Factorization

引用

18th International Conference on Euro-Par Parallel Processing

作者： Donfack, Simplice Grigori, Laura Khabou, Amal Univ Paris 11 Lab Rech Informat INRIA Saclay Ile France Paris France

ISBN: (纸本)9783642328206

Due to the evolution of massively parallel computers towards deeper levels of parallelism and memory hierarchy, and due to the exponentially increasing ratio of the time required to transfer data, either through the memory hierarchy or between different compute units, to the time required to compute floating point operations, the algorithms are confronted with two challenges. They need not only to be able to exploit multiple levels of parallelism, but also to reduce the communication between the compute units at each level of the hierarchy of parallelism and between the different levels of the memory hierarchy. In this paper we present an algorithm for performing the LU factorization of dense matrices that is suitable for computer systems with two levels of parallelism. This algorithm is able to minimize both the volume of communication and the number of messages transferred at every level of the two-level hierarchy of parallelism. We present its implementation for a cluster of multicore processors based on MPI and Pthreads. We show that this implementation leads to a better performance than routines implementing the LU factorization in well-known numerical libraries. For matrices that are tall and skinny, that is they have many more rows than columns, our algorithm outperforms the corresponding algorithm from ScaLAPACK by a factor of 4.5 on a cluster of 32 nodes, each node having two quad-core Intel Xeon EMT64 processors.

关键词： LU factorization communication avoiding algorithms multiple levels of parallelism

来源：评论

学校读者我要写书评

暂无评论

communication avoiding Symmetric Band Reduction

Communication Avoiding Symmetric Band Reduction

引用

17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

作者： Ballard, Grey Demmel, James Knight, Nicholas Univ Calif Berkeley Berkeley CA 94720 USA

ISBN: (纸本)9781450311601

The running time of an algorithm depends on both arithmetic and communication (i.e., data movement) costs, and the relative costs of communication are growing over time. In this work, we present both theoretical and practical results for tridiagonalizing a symmetric band matrix: we present an algorithm that asymptotically reduces communication, and we show that it indeed performs well in practice. The tridiagonalization of a symmetric band matrix is a key kernel in solving the symmetric eigenvalue problem for both full and band matrices. In order to preserve sparsity, tridiagonalization routines use annihilate-and-chase procedures that previously have suffered from poor data locality. We improve data locality by reorganizing the computation, asymptotically reducing communication costs compared to existing algorithms. Our sequential implementation demonstrates that avoiding communication improves runtime even at the expense of extra arithmetic: we observe a 2x speedup over Intel MKL while doing 43% more floating point operations. Our parallel implementation targets shared-memory multicore platforms. It uses pipelined parallelism and a static scheduler while retaining the locality properties of the sequential algorithm. Due to lightweight synchronization and effective data reuse, we see 9.5x scaling over our serial code and up to 6x speedup over the PLASMA library, comparing parallel performance on a ten-core processor.

关键词： algorithms Performance Theory Symmetric eigenvalue problem communication avoiding algorithms Band reduction

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：