检索结果-内蒙古大学图书馆

SHIFTED CHOLESKY QR FOR COMPUTING THE QR FACTORIZATION OF ILL-CONDITIONED MATRICES

SIAM JOURNAL ON SCIENTIFIC COMPUTING 2020年第1期42卷 A477-A503页

作者： Fukaya, Takeshi Kannan, Ramaseshan Nakatsukasa, Yuji Yamamoto, Yusaku Yanagisawa, Yuka Hokkaido Univ Sapporo Hokkaido Japan Arup 3 Piccadilly Pl Manchester M1 3BN Lancs England Univ Oxford Math Inst Oxford OX2 6GG England Univ Electrocommun Tokyo Japan Waseda Univ Waseda Res Inst Sci & Engn Tokyo Japan

The Cholesky QR algorithm is an efficient communication-minimizing algorithm for computing the QR factorization of a tall-skinny matrix X epsilon R-mxn, where m >> n. Unfortunately it is inherently unstable and often breaks down when the matrix is ill-conditioned. A recent work [Yamamoto et al., ETNA, 44, pp. 306--326 (2015)] establishes that the instability can be cured by repeating the algorithm twice (called CholeskyQR2). However, the applicability of CholeskyQR2 is still limited by the requirement that the Cholesky factorization of the Gram matrix X-inverted perpendicular X runs to completion, which means that it does not always work for matrices X with the 2-norm condition number kappa(2)(X) roughly greater than u(-1/2), where u is the unit roundoff. In this work we extend the applicability to kappa(2)(X) = O (u(-1)) by introducing a shift to the computed Gram matrix so as to guarantee the Cholesky factorization R-inverted perpendicular R = A(inverted perpendicular) A+sI succeeds numerically. We show that the computed AR(-1) has reduced condition number that is roughly bounded by u(-1/2), for which CholeskyQR2 safely computes the QR factorization, yielding a computed Q of orthogonality vertical bar vertical bar Q(inverted perpendicular) - Q I vertical bar vertical bar(2) and residual vertical bar vertical bar A - QR vertical bar vertical bar(F) / vertical bar vertical bar A vertical bar vertical bar(F) both of the order of u. Thus we obtain the required QR factorization by essentially running Cholesky QR thrice. We extensively analyze the resulting algorithm shiftedCholeskyQR3 to reveal its excellent numerical stability. The shiftedCholeskyQR3 algorithm is also highly parallelizable, and applicable and effective also when working with an oblique inner product. We illustrate our findings through experiments, in which we achieve significant speedup over alternative methods.

关键词： QR factorization Cholesky QR factorization oblique inner product roundoff error analysis communication-avoiding algorithms

来源：评论

学校读者我要写书评

暂无评论

Applying the swept rule for solving explicit partial differential equations on heterogeneous computing systems

引用

JOURNAL OF SUPERCOMPUTING 2021年第2期77卷 1976-1997页

作者： Magee, Daniel J. Walker, Anthony S. Niemeyer, Kyle E. Oregon State Univ Sch Mech Ind & Mfg Engn Corvallis OR 97331 USA Los Alamos Natl Lab Los Alamos NM 87545 USA

Applications that exploit the architectural details of high-performance computing (HPC) systems have become increasingly invaluable in academia and industry over the past two decades. The most important hardware development of the last decade in HPC has been the general purpose graphics processing unit (GPGPU), a class of massively parallel devices that now contributes the majority of computational power in the top 500 supercomputers. As these systems grow, small costs such as latency-due to the fixed cost of memory accesses and communication-accumulate in a large simulation and become a significant barrier to performance. The swept time-space decomposition rule is a communication-avoiding technique for time-stepping stencil update formulas that attempts to reduce latency costs. This work extends the swept rule by targeting heterogeneous, CPU/GPU architectures representing current and future HPC systems. We compare our approach to a naive decomposition scheme with two test equations using an MPI+CUDA pattern on 40 processes over two nodes containing one GPU. The swept rule produces a factor of 1.9 to 23 speedup for the heat equation and a factor of 1.1 to 2.0 speedup for the Euler equations, using the same processors and work distribution, and with the best possible configurations. These results show the potential effectiveness of the swept rule for different equations and numerical schemes on massively parallel compute systems that incur substantial latency costs.

关键词： Domain decomposition Heterogeneous computing Partial differential equations Computational fluid dynamics communication-avoiding algorithms

来源：评论

学校读者我要写书评

暂无评论

Accelerating solutions of one-dimensional unsteady PDEs with GPU-based swept time-space decomposition

引用

JOURNAL OF COMPUTATIONAL PHYSICS 2018年 357卷 338-352页

作者： Magee, Daniel J. Niemeyer, Kyle E. Oregon State Univ Sch Mech Ind & Mfg Engn Corvallis OR 97331 USA

The expedient design of precision components in aerospace and other high-tech industries requires simulations of physical phenomena often described by partial differential equations (PDEs) without exact solutions. Modern design problems require simulations with a level of resolution difficult to achieve in reasonable amounts of time-even in effectively parallelized solvers. Though the scale of the problem relative to available computing power is the greatest impediment to accelerating these applications, significant performance gains can be achieved through careful attention to the details of memory communication and access. The swept time-space decomposition rule reduces communication between sub-domains by exhausting the domain of influence before communicating boundary values. Here we present a GPU implementation of the swept rule, which modifies the algorithm for improved performance on this processing architecture by prioritizing use of private (shared) memory, avoiding interblock communication, and overwriting unnecessary values. It shows significant improvement in the execution time of finite-difference solvers for one-dimensional unsteady PDEs, producing speedups of 2-9 x for a range of problem sizes, respectively, compared with simple GPU versions and 7-300 x compared with parallel CPU versions. However, for a more sophisticated one-dimensional system of equations discretized with a second-order finite-volume scheme, the swept rule performs 1.2-1.9 x worse than a standard implementation for all problem sizes. (C) 2017 Elsevier Inc. All rights reserved.

关键词： GPU computing Partial differential equations Computational fluid dynamics High-performance computing communication-avoiding algorithms Domain decomposition

来源：评论

学校读者我要写书评

暂无评论

Orthogonal Layers of Parallelism in Large-Scale Eigenvalue Computations

引用

ACM TRANSACTIONS ON PARALLEL COMPUTING 2023年第3期10卷 1-31页

作者： Alvermann, Andreas Hager, Georg Fehske, Holger Univ Greifswald Inst Phys Felix Hausdorff Str 6 D-17489 Greifswald Germany Friedrich Alexander Univ Erlangen Nurnberg Erlangen Natl High Performance Comp Ctr Martensstr 1 D-91058 Erlangen Germany

We address the communication overhead of distributed sparse matrix-(multiple)-vector multiplication in the context of large-scale eigensolvers, using filter diagonalization as an example. The basis of our study is a performance model, which includes a communication metric that is computed directly from the matrix sparsity pattern without running any code. The performance model quantifies to which extent scalability and parallel efficiency are lost due to communication overhead. To restore scalability, we identify two orthogonal layers of parallelism in the filter diagonalization technique. In the horizontal layer the rows of the sparse matrix are distributed across individual processes. In the vertical layer bundles of multiple vectors are distributed across separate process groups. An analysis in terms of the communication metric predicts that scalability can be restored if, and only if, one implements the two orthogonal layers of parallelism via different distributed vector layouts. Our theoretical analysis is corroborated by benchmarks for application matrices from quantum and solid state physics, road networks, and nonlinear programming. We finally demonstrate the benefits of using orthogonal layers of parallelism with two exemplary application cases-an exciton and a strongly correlated electron system-which incur either small or large communication overhead.

关键词： Distributed computing sparse matrix-vector multiplication communication-avoiding algorithms

来源：评论

学校读者我要写书评

暂无评论

Graph Expansion and communication Costs of Fast Matrix Multiplication

引用

JOURNAL OF THE ACM 2012年第6期59卷 1–23页

作者： Ballard, Grey Demmel, James Holtz, Olga Schwartz, Oded Univ Calif Berkeley Berkeley CA 94720 USA

The communication cost of algorithms (also known as I/O-complexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen's and other fast matrix multiplication algorithms, and obtain the first lower bounds on their communication costs. In the sequential case, where the processor has a fast memory of size M, too small to store three n-by-n matrices, the lower bound on the number of words moved between fast and slow memory is, for a large class of matrix multiplication algorithms, Omega((n/root M)(omega 0) . M), where omega(0) is the exponent in the arithmetic count (e.g., omega(0) = lg 7 for Strassen, and omega(0) = 3 for conventional matrix multiplication). With p parallel processors, each with fast memory of size M, the lower bound is asymptotically lower by a factor of p. These bounds are attainable both for sequential and for parallel algorithms and hence optimal.

关键词： algorithms Design Performance communication-avoiding algorithms fast matrix multiplication I/O-complexity

来源：评论

学校读者我要写书评

暂无评论

Recent Developments in Iterative Methods for Reducing Synchronization 18

Recent Developments in Iterative Methods for Reducing Synchr...

引用

18th International Symposium on Distributed Computing and Applications for Business Engineering and Science (DCABES)

作者： Zou, Qinmeng Magoules, Frederic Univ Paris Saclay Cent Supelec F-91190 Gif Sur Yvette France

ISBN: (纸本)9781728128658

On modern parallel architectures, the cost of synchronization among processors can often dominate the cost of floating-point computation. Several modifications of the existing methods have been proposed in order to keep the communication cost as low as possible. This paper aims at providing a brief overview of recent advances in parallel iterative methods for solving large-scale problems. We refer the reader to the related references for more details on the derivation, implementation, performance, and analysis of these techniques.

关键词： communication-avoiding algorithms s-step iterative methods pipelined Krylov subspace methods asynchronous iterations

来源：评论

学校读者我要写书评

暂无评论

Graph Expansion and communication Costs of Fast Matrix Multiplication 11

Graph Expansion and Communication Costs of Fast Matrix Multi...

引用

23rd Annual Symposium on Parallelism in algorithms and Architectures

作者： Ballard, Grey Demmel, James Holtz, Olga Schwartz, Oded Univ Calif Berkeley Dept Comp Sci Berkeley CA 94720 USA

ISBN: (纸本)9781450307437

关键词： communication-avoiding algorithms Fast matrix multiplication I/O-Complexity

来源：评论

学校读者我要写书评

暂无评论

A Supernodal All-Pairs Shortest Path Algorithm 20

A Supernodal All-Pairs Shortest Path Algorithm

引用

25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP)

作者： Sao, Piyush Kannan, Ramakrishnan Gera, Prasun Vuduc, Richard Oak Ridge Natl Lab Oak Ridge TN 37830 USA Georgia Inst Technol Atlanta GA 30332 USA

ISBN: (纸本)9781450368186

We show how to exploit graphs parsity in the Floyd-Warshall algorithm for the all-pairs shortest path (Apsp) problem. FLOYD-WARSHALL is an attractive choice for Apsp on high-performing systems due to its structural similarity to solving dense linear systems and matrix multiplication. However, if sparsity of the input graph is not properly exploited, Floyd-Warshall will perform unnecessary asymptotic work and thus may not be a suitable choice for many input graphs. To overcome this limitation, the key idea in our approach is to use the known algebraic relationship between Floyd-Warshall and Gaussian elimination, and import several algorithmic techniques from sparse Cholesky factorization, namely, fill-in reducing ordering, symbolic analysis, supernodal traversal, and elimination tree parallelism. When combined, these techniques reduce computation, improve locality and enhance parallelism. We implement these ideas in an efficient shared memory parallel prototype that is orders of magnitude faster than an efficient multi-threaded baseline Floyd-Warshall that does not exploit sparsity. Our experiments suggest that the Floyd-Warshall algorithm can compete with Dijkstra's algorithm (the algorithmic core of Johnson's algorithm) for several classes sparse graphs.

关键词： graph algorithm sparse matrix computations shared-memory parallelism communication-avoiding algorithms

来源：评论

学校读者我要写书评

暂无评论

I/O-Optimal algorithms for Symmetric Linear Algebra Kernels 22

I/O-Optimal Algorithms for Symmetric Linear Algebra Kernels

引用

34th ACM Symposium on Parallelism in algorithms and Architectures (SPAA)

作者： Beaumont, Olivier Eyraud-Dubois, Lionel Langou, Julien Verite, Mathieu Univ Bordeaux Inria Ctr Bordeaux France Univ Colorado Denver Denver Denver CO USA

ISBN: (纸本)9781450391467

In this paper, we consider two fundamental symmetric kernels in linear algebra: the Cholesky factorization and the symmetric rank-k update (SYRK), with the classical three nested loops algorithms for these kernels. In addition, we consider a machine model with a fast memory of size S and an unbounded slow memory. In this model, all computations must be performed on operands in fast memory, and the goal is to minimize the amount of communication between slow and fast memories. As the set of computations is fixed by the choice of the algorithm, only the ordering of the computations (the schedule) directly influences the volume of communications. We prove lower bounds of 1/3 root 2 N-3/root S for the communication volume of the Cholesky factorization of an N x N symmetric positive definite matrix, and of 1/root 2 (NM)-M-2/root S for the SYRK computation of A center dot A(T), where A is an N x M matrix. Both bounds improve the best known lower bounds from the literature by a factor root 2. In addition, we present two out-of-core, sequential algorithms with matching communication volume: TBS for SYRK, with a volume of 1/root 2 (NM)-M-2/root S + O(NM log N), and LBC for Cholesky, with a volume of 1/3 root 2 N-3/root S + O(N-5/2). Both algorithms improve over the best known algorithms from the literature by a factor root 2, and prove that the leading terms in our lower bounds cannot be improved further. This work shows that the operational intensity of symmetric kernels like SYRK or Cholesky is intrinsically higher (by a factor root 2) than that of corresponding non-symmetric kernels (GEMM and LU factorization).

关键词： communication-avoiding algorithms linear algebra symmetric kernels syrk cholesky

来源：评论

学校读者我要写书评

暂无评论

Tera-Scale 1D FFT with Low-communication Algorithm and Intel® Xeon Phi™ Coprocessors 13

Tera-Scale 1D FFT with Low-Communication Algorithm and Intel...

引用

International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

作者： Park, Jongsoo Bikshandi, Ganesh Vaidyanathan, Karthikeyan Tang, Ping Tak Peter Dubey, Pradeep Kim, Daehyun Intel Corp Parallel Comp Lab Santa Clara CA 95051 USA Intel Corp Software & Serv Grp Santa Clara CA 95051 USA

ISBN: (纸本)9781450323789

This paper demonstrates the first tera-scale performance of Intel (R) Xeon Phi (TM) coprocessors on 1D FFT computations. Applying a disciplined performance programming methodology of sound algorithm choice, valid performance model, and well-executed optimizations, we break the tera-flop mark on a mere 64 nodes of Xeon Phi and reach 6.7 TFLOPS with 512 nodes, which is 1.5x than achievable on a same number of Intel (R) Xeon (R) nodes. It is a challenge to fully utilize the compute capability presented by many-core wide-vector processors for bandwidth-bound FFT computation. We leverage a new algorithm, Segment-of-Interest FFT, with low inter-node communication cost, and aggressively optimize data movements in node-local computations, exploiting caches. Our coordination of low communication algorithm and massively parallel architecture for scalable performance is not limited to running FFT on Xeon Phi;it can serve as a reference for other bandwidth-bound computations and for emerging HPC systems that are increasingly communication limited.

关键词： Bandwidth Optimizations communication-avoiding algorithms FFT Wide-Vector Many-Core Processors Xeon Phi

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：