Performant numerical solving of differential equations is required for large-scale scientific modeling. In this manuscript we focus on two questions: (1) how can researchers empirically verify theoretical advances and...
详细信息
Performant numerical solving of differential equations is required for large-scale scientific modeling. In this manuscript we focus on two questions: (1) how can researchers empirically verify theoretical advances and consistently compare methods in production software settings and (2) how can users (scientific domain experts) keep up with the state-of-the-art methods to select those which are most appropriate? Here we describe how the confederated modular API of *** addresses these concerns. We detail the package-free API which allows numerical methods researchers to readily utilize and benchmark any compatible method directly in full-scale scientific applications. In addition, we describe how the complexity of the method choices is abstracted via a polyalgorithm. We show how scientific tooling built on top of ***, such as packages for dynamical systems quantification and quantum optics simulation, both benefit from this structure and provide themselves as convenient benchmarking tools.
Despite extensive research, optimal performance has not easily been available previously for matrix multiplication (especially for large matrices) on most architectures because of the lack of a structured approach and...
详细信息
Despite extensive research, optimal performance has not easily been available previously for matrix multiplication (especially for large matrices) on most architectures because of the lack of a structured approach and the limitations imposed by matrix storage formats. A simple but effective framework is presented here that lays the foundation for building high-performance matrix-multiplication codes in a structured, portable and efficient manner. The resulting codes are validated on three different representative RISC and CISC architectures on which they significantly outperform highly optimized libraries such as ATLAS and other competing methodologies reported in the literature. The main component of the proposed approach is a hierarchical storage format that efficiently generalizes the applicability of the memory hierarchy friendly Morton ordering to arbitrary-sized matrices. The storage format supports polyalgorithms, which are shown here to be essential for obtaining the best possible performance for a range of problem sizes. Several algorithmic advances are made in this paper, including an oscillating iterative algorithm for matrix multiplication and a variable recursion cutoff criterion for Strassen's algorithm. The authors expose the need to standardize linear algebra kernel interfaces, distinct from the BLAS, for writing portable high-performance code. These kernel routines operate on small blocks that fit in the L1 cache. The performance advantages of the proposed framework can be effectively delivered to new and existing applications through the use of object-oriented or compiler-based approaches. Copyright (C) 2002 John Wiley Sons, Ltd.
Collective communication operations are widely used in MPI applications and play an important role in their performance. However, the network heterogeneity inherent to grid environments represent a great challenge to ...
详细信息
Collective communication operations are widely used in MPI applications and play an important role in their performance. However, the network heterogeneity inherent to grid environments represent a great challenge to develop efficient high performance computing applications. In this work we propose a generic framework based on communication models and adaptive techniques for dealing with collective communication patterns on grid platforms. Toward this goal, we address the hierarchical organization of the grid, selecting the most efficient communication algorithms at each network level. Our framework is also adaptive to grid load dynamics since it considers transient network characteristics for dividing the nodes into clusters. Our experiments with the broadcast operation on a real-grid setup indicate that an adaptive framework allows significant performance improvements on MPI collective communications. (C) 2007 Elsevier Inc. All rights reserved.
The authors investigate the performance of several preconditioned conjugate gradient-like algorithms and a standard stationary iterative method (block-line successive overrelaxation (SOR)) on linear systems of equatio...
详细信息
The authors investigate the performance of several preconditioned conjugate gradient-like algorithms and a standard stationary iterative method (block-line successive overrelaxation (SOR)) on linear systems of equations that arise from a nonlinear elliptic flame sheet problem simulation. The nonlinearity forces a pseudotransient continuation process that makes the problem parabolic and thus compacts the spectrum of the Jacobian matrix so that simple relaxation methods are viable in the initial stages of the solution process. However, because of the transition from parabolic to elliptic character as the timestep is increased in pursuit of the steady-state solution, the performance of the candidate linear solvers spreads as the domain of convergence of Newton's method is approached. In numerical experiments over the course of a full nonlinear solution trajectory, short recurrence or optimal Krylov algorithms combined with a Gauss-Seidel (GS) preconditioning yield better execution times with respect to the standard block-line SOR techniques, but SOR performs competitively at a smaller storage cost until the final stages. Block-incomplete factorization preconditioned methods, on the other hand, require nearly a factor of two more storage than SOR and are uniformly less effective during the pseudotransient stages. The advantage of GS preconditioning is partly attributable to die exploitation of a dominant convection direction in the examples;nevertheless, a multidomain version of GS with streamwise coupling lagged at rows between adjacent subdomains incurs only a modest penalty.
In the iterative linear solver package LINSOL several generalized conjugate gradient (CG) methods (or, briefly, CG-type methods) with quite different properties are implemented. With these methods polyalgorithms with ...
详细信息
In the iterative linear solver package LINSOL several generalized conjugate gradient (CG) methods (or, briefly, CG-type methods) with quite different properties are implemented. With these methods polyalgorithms with automatic method switching are constructed. The "emergency exit" that is taken in the worst case is the ATPRES method (which is very robust, but very slow). In this paper we investigate if (I)LU preconditioning were a better emergency exit and how the drop tolerance for small elements in ILU affects the convergence behavior. The answer will be: it depends. (C) 2001 IMACS. Published by Elsevier Science B.V. All rights reserved.
In the iterative linear solver package LINSOL several generalized conjugate gradient (CG) methods (or, briefly, CG-type methods) with quite different properties are implemented. With these methods polyalgorithms with ...
详细信息
In the iterative linear solver package LINSOL several generalized conjugate gradient (CG) methods (or, briefly, CG-type methods) with quite different properties are implemented. With these methods polyalgorithms with automatic method switching are constructed. The "emergency exit" that is taken in the worst case is the ATPRES method (which is very robust, but very slow). In this paper we investigate if (I)LU preconditioning were a better emergency exit and how the drop tolerance for small elements in ILU affects the convergence behavior. The answer will be: it depends. (C) 2001 IMACS. Published by Elsevier Science B.V. All rights reserved.
暂无评论