检索结果-内蒙古大学图书馆

Quasi-matrix-free Hybrid Multigrid on Dynamically Adaptive Cartesian Grids

ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE 2018年第3期44卷 32-32页

作者： Weinzierl, Marion Weinzierl, Tobias Univ Durham Dept Math Sci Durham DH1 3LE England Univ Durham Dept Comp Sci Durham DH1 3LE England

We present a family of spacetree-based multigrid realizations using the tree's multiscale nature to derive coarse grids. They align with matrix-free geometric multigrid solvers as they never assemble the system matrices, which is cumbersome for dynamically adaptive grids and full multigrid. The most sophisticated realizations use BoxMG to construct operator-dependent prolongation and restriction in combination with Galerkin/Petrov-Galerkin coarse-grid operators. This yields robust solvers for nontrivial elliptic problems. We embed the algebraic, problem-dependent, and grid-dependent multigrid operators as stencils into the grid and evaluate all matrix-vector products in situ throughout the grid traversals. Such an approach is not literally matrix-free as the grid carries the matrix. We propose to switch to a hierarchical representation of all operators. Only differences of algebraic operators to their geometric counterparts are held. These hierarchical differences can be stored and exchanged with small memory footprint. Our realizations support arbitrary dynamically adaptive grids while they vertically integrate the multilevel operations through spacetree linearization. This yields good memory access characteristics, while standard colouring of mesh entities with domain decomposition allows us to use parallel many-core clusters. All realization ingredients are detailed such that they can be used by other codes.

关键词： Geometric multigrid algebraic multigrid BoxMG matrix-free adaptive mesh refinement full approximation storage operator compression parallel linear algebra

来源：评论

学校读者我要写书评

暂无评论

Guided installation of basic linear algebra routines in a cluster with manycore components

Guided installation of basic linear algebra routines in a cl...

引用

Euro-Par Conference / 7th International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM) held in conjunction with the 21st SIGPLAN Symposium on Principles and Practice of parallel Programming (PPoPP)

作者： Cuenca, J. Garcia, L. P. Gimenez, D. Herrera, F. J. Univ Murcia Dept Engn & Technol Comp Murcia Spain Tech Univ Cartagena Serv Support Technol Res Murcia Spain Univ Murcia Dept Comp & Syst Murcia Spain

Computational systems are nowadays composed of basic computational components that share multiprocessors and coprocessors of different types, typically several graphics processing units (GPUs) or many integrated cores (MICs), and those computational components are combined in heterogeneous clusters of nodes with different characteristics, including coprocessors of different types, with varying numbers of nodes at different speeds. The software previously developed and optimized for simpler system needs to be redesigned and reoptimized for these new, more complex systems. The adaptation to hybrid multicore+multiGPU and multicore+multiMIC of autotuning techniques for basic linear algebra routines is analyzed. The matrix-matrix multiplication kernel, which is optimized for different computational system components through guided experimentation, is studied. The routine is installed for each node in the cluster, and the information generated from individual installations may be used for a hierarchical installation in a cluster. The basic matrix-matrix multiplication may, in turn, be used inside higher level routines, which delegate their efficient execution to the optimization of the lower level routine. Experimental results are satisfactory in different multicore+multiGPU and multicore+multiMIC systems. So the guided search of execution configurations for satisfactory execution times proves to be a useful tool for heterogeneous systems, where the complexity of the system means a correct use of highly efficient routines and libraries is difficult.

关键词： autotuning heterogeneous computing hybrid programming parallel linear algebra manycore

来源：评论

学校读者我要写书评

暂无评论

Guided installation of basic linear algebra routines in a cluster with manycore components

引用

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE 2017年第15期29卷 1-14页

关键词： autotuning heterogeneous computing hybrid programming parallel linear algebra manycore

来源：评论

学校读者我要写书评

暂无评论

Auto-tuning techniques for linear algebra routines on hybrid platforms

引用

JOURNAL OF COMPUTATIONAL SCIENCE 2015年 10卷 299-310页

作者： Bernabe, Gregorio Cuenca, Javier Garcia, Luis-Pedro Gimenez, Domingo Univ Murcia E-30001 Murcia Spain Tech Univ Cartagena Murcia Spain

This work analyses two techniques for auto-tuning linear algebra routines for hybrid combinations of multicore CPU and manycore coprocessors (single or multiple GPUs and MIC). The first technique is based on basic models of the execution time of the routines, whereas the second one manages only empirical information obtained during the installation of the routines. The final goal in both cases is to obtain a balanced assignation of the work to the computing components in the system. The study is carried out with a basic kernel (matrix-matrix multiplication) and a higher level routine (LU factorization) which uses the auto-tuned basic routine. Satisfactory results are obtained, with experimental execution times close to the lowest experimentally achievable. (C) 2015 Elsevier B.V. All rights reserved.

关键词： Auto-tuning Models of the execution time parallel linear algebra Hybrid CPU plus GPU computing Intel Xeon Phi

来源：评论

学校读者我要写书评

暂无评论

Fast Sparse Matrix Multiplication on GPU 15

Fast Sparse Matrix Multiplication on GPU

引用

Simulation Multiconference

作者： Lukas Polok Viorela Ila Pavel Smrz Brno University of Technology Faculty of Information Technology

ISBN: (纸本)9781510801011

Sparse matrix multiplication is an important algorithm in a wide variety of problems, including graph algorithms, simulations and linear solving to name a few. Yet, there are but a few works related to acceleration of sparse matrix multiplication on a GPU. We present a fast, novel algorithm for sparse matrix multiplication, outperforming the previous algorithm on GPU up to 3× and CPU up to 30×. The principal improvements include more efficient load balancing strategy, and a faster sorting algorithm. The main contribution is design and implementation of efficient sparse matrix multiplication algorithm and extending it to sparse block matrices, which is to our best knowledge the first implementation of this kind.

关键词： parallel sparse matrix multiplication parallel linear algebra Matrix-matrix multiplication GPGPU Matrix multiplication Sparse matrix Multiplication linear algebra GRAPPER PICK UP Graphics Processing Unit

来源：评论

学校读者我要写书评

暂无评论

Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting

引用

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE 2014年第7期26卷 1408-1431页

作者： Dongarra, Jack Faverge, Mathieu Ltaief, Hatem Luszczek, Piotr Univ Tennessee Dept Elect Engn & Comp Sci Knoxville TN 37996 USA KAUST Supercomp Lab Thuwal Saudi Arabia

The LU factorization is an important numerical algorithm for solving systems of linear equations in science and engineering and is a characteristic of many dense linear algebra computations. For example, it has become the de facto numerical algorithm implemented within the LINPACK benchmark to rank the most powerful supercomputers in the world, collected by the TOP500 website. Multicore processors continue to present challenges to the development of fast and robust numerical software due to the increasing levels of hardware parallelism and widening gap between core and memory speeds. In this context, the difficulty in developing new algorithms for the scientific community resides in the combination of two goals: achieving high performance while maintaining the accuracy of the numerical algorithm. This paper proposes a new approach for computing the LU factorization in parallel on multicore architectures, which not only improves the overall performance but also sustains the numerical quality of the standard LU factorization algorithm with partial pivoting. While the update of the trailing submatrix is computationally intensive and highly parallel, the inherently problematic portion of the LU factorization is the panel factorization due to its memory-bound characteristic as well as the atomicity of selecting the appropriate pivots. Our approach uses a parallel fine-grained recursive formulation of the panel factorization step and implements the update of the trailing submatrix with the tile algorithm. Based on conflict-free partitioning of the data and lockless synchronization mechanisms, our implementation lets the overall computation flow naturally without contention. The dynamic runtime system called QUARK is then able to schedule tasks with heterogeneous granularities and to transparently introduce algorithmic lookahead. The performance results of our implementation are competitive compared to the currently available software packages and libraries. For example, i

关键词： recursion LU factorization parallel linear algebra shared memory synchronization threaded parallelism

来源：评论

学校读者我要写书评

暂无评论

Effectively exploiting parallel scale for all problem sizes in LU factorization

Effectively exploiting parallel scale for all problem sizes ...

引用

IEEE 28th International parallel & Distributed Processing Symposium (IPDPS)

作者： Hasan, Md Rakib Whaley, R. Clint Louisiana State Univ Ctr Computat & Technol Comp Sci & Engn Div Baton Rouge LA 70803 USA

ISBN: (纸本)9780769552071

LU factorization is one of the most widely-used methods for solving linear equations, and thus its performance underlies a broad range of scientific computing. As architectural trends have replaced clock rate improvements with increases in parallel scale, library writers have responded by using tiled algorithms, where operand size is constrained in order to maximize parallelism, as seen in the well-known PLASMA library. This approach has two main drawbacks: (1) asymptotic performance is reduced due to limited operand size;(2) performance of small to medium sized problems is reduced due to unnecessary data motion in the parallel caches. In this paper we introduce a new approach where asymptotic performance is maximized by using special low-overhead kernel primitives that are auto-generated by the ATLAS framework, while unnecessary cache motion is minimized by using explicit cache management. We show that this technique can outperform all known libraries at all problem sizes on commodity parallel Intel and AMD platforms, with asymptotic LU performance of roughly 91% of hardware theoretical peak for a 12-core Intel Xeon, and 87% for a 32-core AMD Opteron.

关键词： ATLAS LAPACK PLASMA LU factorization parallel linear algebra PCA threaded parallelism

来源：评论

学校读者我要写书评

暂无评论

AN AUGMENTED INCOMPLETE FACTORIZATION APPROACH FOR COMPUTING THE SCHUR COMPLEMENT IN STOCHASTIC OPTIMIZATION

引用

SIAM JOURNAL ON SCIENTIFIC COMPUTING 2014年第2期36卷 C139-C162页

作者： Petra, Cosmin G. Schenk, Olaf Lubin, Miles Gaeertner, Klaus Argonne Natl Lab Math & Comp Sci Div Argonne IL 60439 USA Univ Svizzera Italiana Inst Computat Sci CH-6900 Lugano Switzerland MIT Ctr Operat Res Cambridge MA 02139 USA Weierstrass Inst Appl Anal & Stochast D-10117 Berlin Germany

We present a scalable approach and implementation for solving stochastic optimization problems on high-performance computers. In this work we revisit the sparse linear algebra computations of the parallel solver PIPS with the goal of improving the shared-memory performance and decreasing the time to solution. These computations consist of solving sparse linear systems with multiple sparse right-hand sides and are needed in our Schur-complement decomposition approach to compute the contribution of each scenario to the Schur matrix. Our novel approach uses an incomplete augmented factorization implemented within the PARDISO linear solver and an outer BiCGStab iteration to efficiently absorb pivot perturbations occurring during factorization. This approach is capable of both efficiently using the cores inside a computational node and exploiting sparsity of the right-hand sides. We report on the performance of the approach on high-performance computers when solving stochastic unit commitment problems of unprecedented size (billions of variables and constraints) that arise in the optimization and control of electrical power grids. Our numerical experiments suggest that supercomputers can be efficiently used to solve power grid stochastic optimization problems with thousands of scenarios under the strict "real-time" requirements of power grid operators. To our knowledge, this has not been possible prior to the present work.

关键词： parallel linear algebra stochastic programming stochastic optimization parallel-interior point economic dispatch unit commitment

来源：评论

学校读者我要写书评

暂无评论

Tuning basic linear algebra Routines for Hybrid CPU+GPU Platforms

引用

Procedia Computer Science 2014年 29卷 30-39页

作者： Gregorio Bernabé Javier Cuenca Luis-Pedro García Domingo Giménez University of Murcia Spain Technical University of Cartagena Murcia Spain

The introduction of auto-tuning techniques in linear algebra routines using hybrid combinations of multiple CPU and GPU computing resources is analyzed. Basic models of the execution time and information obtained during the installation of the routines are used to optimize the execution time with a balanced assignation of the work to the computing components in the system. The study is carried out with a basic kernel (matrix-matrix multiplication) and a higher level routine (LU factorization) using GPUs and the host multicore processor. Satisfactory results are obtained, with experimental execution times close to the lowest experimentally achievable.

关键词： Auto-tuning Models of the execution time parallel linear algebra Hybrid CPU+GPU computing

来源：评论

学校读者我要写书评

暂无评论

A balanced accumulation scheme for parallel PDE solvers

引用

COMPUTING AND VISUALIZATION IN SCIENCE 2013年第1期16卷 33-40页

作者： Liebmann, Manfred Neic, Aurel Haase, Gundolf Karl Franzens Univ Graz Inst Math & Sci Comp Heinrichstr 36 A-8010 Graz Austria

We present a tailored load balancing technique that addresses specific performance issues in the boundary data accumulation algorithm for non-overlapping domain decompositions. The technique is used to speed up a parallel conjugate gradient algorithm with an algebraic multigrid pre-conditioner to solve a potential problem on an unstructured tetrahedral finite element mesh. The optimized accumulation algorithm significantly improves the performance of the parallel solver and we show up to 50% runtime improvements over the standard approach in benchmark runs with up to 48 MPI processes. The load balancing problem itself is a global optimization problem that is solved approximately by local optimization algorithms in parallel that require no communication during the optimization process.

关键词： parallel linear algebra Balanced MPI communication parallel optimization algorithms

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：