With the emergence of new massively parallel systems in the high performance computing area allowing scientific simulations to run on thousands of processors, the mean time between failures of large machines is decrea...
详细信息
With the emergence of new massively parallel systems in the high performance computing area allowing scientific simulations to run on thousands of processors, the mean time between failures of large machines is decreasing from several weeks to a few minutes. The ability of hardware and software components to handle these singular events called process failures is therefore getting increasingly important. In order for a scientific code to continue despite a process failure, the application must be able to retrieve the lost data items. The recovery procedure after failures might be fairly straightforward for elliptic and linear hyperbolic problems. However, the reversibility in time for parabolic problems appears to be the most challenging part because it is an ill-posed problem. This paper focuses on new fault-tolerant numerical schemes for the time integration of parabolic problems. The new algorithm allows the application to recover from process failures and to reconstruct numerically the lost data of the failed process(es) avoiding the expensive roll-back operation required in most checkpoin/restart schemes. As a fault tolerant communication library, we use the fault tolerant message passing interface developed by the Innovative Computing Laboratory at the University of Tennessee. Experimental results show promising performances. Indeed, the three-dimensional parabolic benchmark code is able to recover and to keep on running after failures, adding only a very small penalty to the overall time of execution. (C) 2007 Elsevier Inc. All rights reserved.
Algorithm-based fault-tolerance (ABFT) is an inexpensive method of incorporating fault-tolerance into existing applications. Applications are modified to operate on encoded data and produce encoded results which may t...
详细信息
Algorithm-based fault-tolerance (ABFT) is an inexpensive method of incorporating fault-tolerance into existing applications. Applications are modified to operate on encoded data and produce encoded results which may then be checked for correctness. An attractive feature of the scheme is that it requires little or no modification to the underlying hardware or system software. Previous algorithm-based methods for developing reliable versions of numerical programs for general-purpose multicomputers have mostly concerned themselves with error detection. A truly fault-tolerant algorithm, however, needs to locate errors and recover from them once they are located. In a parallel processing environment, this corresponds to locating the faulty processors and recovering the data corrupted by the faulty processors. In this paper, we first present a general scheme for performing fault-location and recovery under the ABFT framework. Our fault model assumes that a faulty processor can corrupt all the data it possesses. The fault-location scheme is an application of system-level diagnosis theory to the ABFT framework, while the fault-recovery scheme uses ideas from coding theory to maintain redundant data and uses this to recover corrupted data in the event of processor failures. Results are presented on implementations of three numericalalgorithms on a 16-processor Intel iPSC/2 hypercube multicomputer, which demonstrate acceptably low overheads for the single and double fault location and recovery cases.
The parallel Diagonal Dominant (PDD) algorithm is an efficient tridiagonal solver. In this paper, a detailed study of the PDD algorithm is given. First the PDD algorithm is extended to solve periodic tridiagonal syste...
详细信息
The parallel Diagonal Dominant (PDD) algorithm is an efficient tridiagonal solver. In this paper, a detailed study of the PDD algorithm is given. First the PDD algorithm is extended to solve periodic tridiagonal systems and its scalability is studied. Then the reduced PDD algorithm, which has a smaller operation count than that of the conventional sequential algorithm for many applications, is proposed. Accuracy analysis is provided for a class of tridiagonal systems, the symmetric and skew-symmetric Toeplitz tridiagonal systems. Implementation results show that the analysis gives a good bound on the relative error, and the PDD and reduced PDD algorithms are good candidates for emerging massively parallel machines.
We present and release in open source format a sparse linear solver which efficiently exploits heterogeneous parallel computers. The solver can be easily integrated into scientific applications that need to solve larg...
详细信息
We present and release in open source format a sparse linear solver which efficiently exploits heterogeneous parallel computers. The solver can be easily integrated into scientific applications that need to solve large and sparse linear systems on modern parallel computers made of hybrid nodes hosting Nvidia Graphics Processing Unit (GPU) accelerators. The work extends previous efforts of some of the authors in the exploitation of a single GPU accelerator and proposes an implementation, based on the hybrid MPI-CUDA software environment, of a Krylov-type linear solver relying on an efficient Algebraic MultiGrid (AMG) preconditioner already available in the BootCMatchG library. Our design for the hybrid implementation has been driven by the best practices for minimizing data communication overhead when multiple GPUs are employed, yet preserving the efficiency of the GPU kernels. Strong and weak scalability results of the new version of the library on well-known benchmark test cases are discussed. Comparisons with the Nvidia AmgX solution show a speedup, in the solve phase, up to 2.0x.
We introduce a concept of generalized diagonal dominance for nonlinear functions. As in the linear case, this brings together several, apparently different classes of nonlinear functions such as strictly diagonally do...
详细信息
We introduce a concept of generalized diagonal dominance for nonlinear functions. As in the linear case, this brings together several, apparently different classes of nonlinear functions such as strictly diagonally dominant functions and certain M-functions. With our concept we easily obtain a quite far-reaching result on the global convergence of asynchronous iterative methods for finding zeros of nonlinear functions. Special cases include some known and several new convergence results for special iterative methods such as the nonlinear JOR-, SOR- and SSOR-method.
We propose a hybrid sparse linear system solver based on M-matrix splitting and block-row projection (BRP). We split the sparse coefficient matrix A into two (nonsingular) M-matrices, and construct an augmented larger...
详细信息
We propose a hybrid sparse linear system solver based on M-matrix splitting and block-row projection (BRP). We split the sparse coefficient matrix A into two (nonsingular) M-matrices, and construct an augmented larger linear system which we solve using a BRP method. The robustness of BRP is compared with those of ILUT-preconditioned GMRES, and the sparse direct solver Pardiso. We also demonstrate the parallel scalability of BRP on a cluster of multicore nodes. (C) 2017 Elsevier B.V. All rights reserved.
In this paper we describe a parallel algorithm for solving large sparse nonsingular linear systems Ax = f, of order n, using the Hermitian Skew-Hermitian splitting approach for handling the augmented linear system, of...
详细信息
In this paper we describe a parallel algorithm for solving large sparse nonsingular linear systems Ax = f, of order n, using the Hermitian Skew-Hermitian splitting approach for handling the augmented linear system, of order 2n, that arises from the linear least problem of minimizing the 2-norm of (f-Ax). We use the restarted GMRES as the outer iteration with the Hermitian Skew-Hermitian Splitting (HSS) preconditioner. In solving systems involving this preconditioner, the most time consuming part deals with handling shifted skew-symmetric systems. We solve such systems using the successive overrelaxation (SOR). Theoretical analysis shows that our solver always converges to the unique solution of Ax = f. We present several numerical experiments that demonstrate the robustness of our solver compared to other schemes, and show its parallel scalability on a single multicore node. (C) 2016 Elsevier Ltd. All rights reserved.
A flexible parallel deterministic solver of the Boltzmann-Poisson system for 2D semiconductor device simulation on computer clusters is presented. The simulator is obtained by parallelizing a previously proposed numer...
详细信息
A flexible parallel deterministic solver of the Boltzmann-Poisson system for 2D semiconductor device simulation on computer clusters is presented. The simulator is obtained by parallelizing a previously proposed numerical scheme based on high order finite difference weighted essentially non-oscillatory (WENO) schemes. Although the underlying numerical scheme presents important advantages over direct simulation Monte Carlo methods, this scheme imposes very high demands of computing power. Due to this, the parallelization of the different calculation phases in the numerical scheme has been tackled. The data subdomain which demands most of the computational workload has been suitably distributed among the processors and several parallel design decisions has been taken in order to achieve good performance. Moreover, the resultant parallel application can be easily adjusted to simulate a wide range of devices and could be easily used by engineers without mathematical background about the underlying numerical scheme. The parallel algorithm has been implemented in C++ augmented with calls to MPI functions and functions of optimized linear algebra libraries. Several experiments have been performed by simulating particular MOSFET and DG-MOSFET devices on a SMP cluster in order to show its efficiency. (C) 2008 Elsevier B.V. All rights reserved.
Multicore CPUs can be combined with GPUs to perform computations over 3D unstructured meshes on heterogeneous CPU-GPU clusters. The authors explain how to unlock the CPUs' computing power without slowing down othe...
详细信息
Multicore CPUs can be combined with GPUs to perform computations over 3D unstructured meshes on heterogeneous CPU-GPU clusters. The authors explain how to unlock the CPUs' computing power without slowing down other tasks related to data movement. By solving the representative diffusion equation using the cell-centered finite volume method, the authors demonstrate that combining the computing capacity of CPUs and GPUs delivers a performance advantage over the GPU-only approach.
In this paper we discuss numerical methods and algorithms for the solution of NLTE stellar atmosphere problems involving expanding atmospheres, e.g., found in novae, supernovae and stellar winds. We show how a scheme ...
详细信息
In this paper we discuss numerical methods and algorithms for the solution of NLTE stellar atmosphere problems involving expanding atmospheres, e.g., found in novae, supernovae and stellar winds. We show how a scheme of nested iterations can be used to reduce the high dimension of the problem to a number of problems with smaller dimensions. As examples of these sub-problems, we discuss the numerical solution of the radiative transfer equation for relativistically expanding media with spherical symmetry, the solution of the multi-level nonLTE statistical equilibrium problem for extremely large model atoms, and our temperature correction procedure. Although modern iteration schemes are very efficient, parallelalgorithms are essential in making large-scale calculations feasible, therefore we discuss some parallelization schemes that we have developed. (C) 1999 Elsevier Science B.V. All rights reserved.
暂无评论