Java is gaining acceptance as a language for high performance computing, as it is platform independent and safe. A parallel linear algebra package is fundamental for developing parallel numerical applications. In this...
详细信息
Java is gaining acceptance as a language for high performance computing, as it is platform independent and safe. A parallel linear algebra package is fundamental for developing parallel numerical applications. In this paper, we present plapackJava, a Java interface to PLAPACK, a parallel linear algebra library. This interface is simple to use and object-oriented, with good support for initialization of distributed objects. The experiments we have performed indicate that plapackJava does not introduce a significant overhead with respect to PLAPACK. (C) 2000 Elsevier Science B.V. All rights reserved.
A new algorithm for computing an orthogonal decomposition of a rectangular m × n matrix A on a shared-memory parallel computer is described. The algorithm uses Givens rotations, and has the feature that its synch...
详细信息
A new algorithm for computing an orthogonal decomposition of a rectangular m × n matrix A on a shared-memory parallel computer is described. The algorithm uses Givens rotations, and has the feature that its synchronization cost is low. In particular, for a multiprocessor having p processors, an analysis of the algorithm shows that this cost is O( n 2 / p ) if m/p ⪰ n , and O( mn / p 2 ) of m / p <. Note that in the latter case, the synchronization cost is smaller than O( n 2 / p ). Therefore, the synchronization cost of the algorithm proposed in this article is bounded by O( n 2 / p ) when m ⪰ n . This is important for machines where synchronization cost is high, and when m ⪢ n . Analysis and experiments show that the algorithm is effective in balancing the load and producing high efficiency (speedup).
In this work a systolic matrix product algorithm was adapted on a linear transputer network. A model to find optimal work-load balance is given which describes the behaviour of the adapted systolic algorithm. In addit...
详细信息
In this work a systolic matrix product algorithm was adapted on a linear transputer network. A model to find optimal work-load balance is given which describes the behaviour of the adapted systolic algorithm. In addition, this model allows us to find optimal granularity to obtain maximum speedup.
A linear rotation based algorithm is proposed for solving linear system equations, Ax = b. This algorithm modified the conventional Gaussian elimination method and can avoid the problems of numerical singularity and i...
详细信息
A linear rotation based algorithm is proposed for solving linear system equations, Ax = b. This algorithm modified the conventional Gaussian elimination method and can avoid the problems of numerical singularity and ill condition. In this study, the implementation of a trapezoidal systolic array of processors as well as a linear array of n processors are accomplished for this algorithm. The trapezoidal systolic array performs the triangularization of a matrix A by using the modified linear rotation algorithm; while the linear array performs the backward substitution for evaluating the solution of x. The computing time for solving a linear equation system will be O(5n) time units. Also an implicit representation of the elimination factor by means of the sign parameter sequence instead of an numerical value is introduced for simplifying the hardware complexity. It is clear that this systolic architecture is simple, uniform, and regular, and therefore well suitable for the implementation of a VLSI chip.
parallel algorithms for triangularization of large, sparse, and unsymmetric matrices are presented. The method combines the parallel reduction with a new parallel pivoting technique, control over generation of fill-in...
详细信息
parallel algorithms for triangularization of large, sparse, and unsymmetric matrices are presented. The method combines the parallel reduction with a new parallel pivoting technique, control over generation of fill-ins and check for numerical stability, all done in parallel with the work being distributed over the active processes. The parallel pivoting technique uses the compatibility relation between pivots to identify parallel pivot candidates and uses the Markowitz number of pivots to minimize fill-in. This technique is not a preordering of the sparse matrix and is applied dynamically as the decomposition proceeds.
This work analyses two techniques for auto-tuning linearalgebra routines for hybrid combinations of multicore CPU and manycore coprocessors (single or multiple GPUs and MIC). The first technique is based on basic mod...
详细信息
This work analyses two techniques for auto-tuning linearalgebra routines for hybrid combinations of multicore CPU and manycore coprocessors (single or multiple GPUs and MIC). The first technique is based on basic models of the execution time of the routines, whereas the second one manages only empirical information obtained during the installation of the routines. The final goal in both cases is to obtain a balanced assignation of the work to the computing components in the system. The study is carried out with a basic kernel (matrix-matrix multiplication) and a higher level routine (LU factorization) which uses the auto-tuned basic routine. Satisfactory results are obtained, with experimental execution times close to the lowest experimentally achievable. (C) 2015 Elsevier B.V. All rights reserved.
The sparse matrix-vector product is an important computational kernel that runs ineffectively on many computers with super-scalar RISC processors. In this paper we analyse the performance of the sparse matrix-vector p...
详细信息
The sparse matrix-vector product is an important computational kernel that runs ineffectively on many computers with super-scalar RISC processors. In this paper we analyse the performance of the sparse matrix-vector product with symmetric matrices originating from the FEM and describe techniques that lead to a fast implementation. It is shown how these optimisations can be incorporated into an efficient parallel implementation using message-passing. We conduct numerical experiments on many different machines and show that our optimisations speed up the sparse matrix-vector multiplication substantially. (C) 2001 Elsevier Science B.V. All rights reserved.
A new approach for the parallel computation of singular value decomposition (SVD) of matrix A is an element of C-mxn is proposed. Contrary to the known algorithms that use a static cyclic ordering of subproblems simul...
详细信息
A new approach for the parallel computation of singular value decomposition (SVD) of matrix A is an element of C-mxn is proposed. Contrary to the known algorithms that use a static cyclic ordering of subproblems simultaneously solved in one iteration step, the proposed implementation of the two-sided block-Jacobi method uses a dynamic ordering of subproblems. The dynamic ordering takes into account the actual status of matrix A. In each iteration step, a set of the off-diagonal blocks is determined that reduces the Frobenius norm of the off-diagonal elements of A as much as possible and, at the same time, can be annihilated concurrently. The solution of this task is equivalent to the solution of the maximum-weight perfect matching problem. The greedy algorithm for the efficient solution of this problem is presented. The computational experiments with both types of ordering, incorporated into the two-sided block-Jacobi method, were performed on an SGI - Cray Origin 2000 parallel computer using the Message Passing Interface (MPI). The results confirm. that the dynamic ordering is much more efficient with regard to the amount of work required for the computation of SVD of a given accuracy than the static cyclic ordering. (C) 2002 Elsevier Science B.V. All rights reserved.
In this note, we will describe two algorithms for factoring symmetric positive definite band matrices. Both algorithms are designed for a p × p array of processors with local memory with p ⪕ O (m) , where m is th...
详细信息
In this note, we will describe two algorithms for factoring symmetric positive definite band matrices. Both algorithms are designed for a p × p array of processors with local memory with p ⪕ O (m) , where m is the semibandwidth of the matrix. The p × p model is appropriate for the Ametek Series 2010, which is a mesh, or the popular hypercube architectures, which contain the mesh. Furthermore, the p × p model is studied, since it is the minimal connectivity which is required to implement these algorithms. The first algorithm is achieved by reflecting or folding a systolic array of Brent and Luk. An alternate approach, a torus wrap of the band matrix, is used to give an efficient method which allows the matrix to stay in place during the course of the algorithm. Both algorithms achieve nearly optimal efficiency for matrices whose bandwidth is large relative to the size of the processor array. In fact, their efficiency is asymptotically 1; in lieu of 1/3 efficiency algorithm that can be derived directly from that of Brent and Luk or as presented by Chan et al. We conclude with a medium grain adaptation of these methods, which maintains their near optimality.
The LU factorization is an important numerical algorithm for solving systems of linear equations in science and engineering and is a characteristic of many dense linearalgebra computations. For example, it has become...
详细信息
The LU factorization is an important numerical algorithm for solving systems of linear equations in science and engineering and is a characteristic of many dense linearalgebra computations. For example, it has become the de facto numerical algorithm implemented within the LINPACK benchmark to rank the most powerful supercomputers in the world, collected by the TOP500 website. Multicore processors continue to present challenges to the development of fast and robust numerical software due to the increasing levels of hardware parallelism and widening gap between core and memory speeds. In this context, the difficulty in developing new algorithms for the scientific community resides in the combination of two goals: achieving high performance while maintaining the accuracy of the numerical algorithm. This paper proposes a new approach for computing the LU factorization in parallel on multicore architectures, which not only improves the overall performance but also sustains the numerical quality of the standard LU factorization algorithm with partial pivoting. While the update of the trailing submatrix is computationally intensive and highly parallel, the inherently problematic portion of the LU factorization is the panel factorization due to its memory-bound characteristic as well as the atomicity of selecting the appropriate pivots. Our approach uses a parallel fine-grained recursive formulation of the panel factorization step and implements the update of the trailing submatrix with the tile algorithm. Based on conflict-free partitioning of the data and lockless synchronization mechanisms, our implementation lets the overall computation flow naturally without contention. The dynamic runtime system called QUARK is then able to schedule tasks with heterogeneous granularities and to transparently introduce algorithmic lookahead. The performance results of our implementation are competitive compared to the currently available software packages and libraries. For example, i
暂无评论