Blockwise access. to data is a central theme in the design of efficient external memory (EM) algorithms. A second important issue, when more than one disk is present, is fully parallel disk I/O. In this paper we prese...
详细信息
Blockwise access. to data is a central theme in the design of efficient external memory (EM) algorithms. A second important issue, when more than one disk is present, is fully parallel disk I/O. In this paper we present a simple, deterministic simulation technique which transforms certain Bulk Synchronous parallel (BSP) algorithms into efficient parallel EM algorithms. It optimizes blockwise data access and parallel disk I/O and, at the same time, utilizes multiple processors connected via a communication network or shared memory. We obtain new improved parallel EM algorithms for a large number of problems including sorting, permutation, matrix transpose, several geometric and GIS problems including three-dimensional convex hulls (two-dimensional Voronoi diagrams), and various graph problems. We show that certain parallel algorithms known for the BSP model can be used to obtain EM algorithms that meet well known I/O complexity lower bounds for various problems, including sorting.
In recent years, reconfigurable technology has emerged as a popular choice for implementing various types of cryptographic functions. Nevertheless, an insufficient amount effort has been placed into fully exploiting t...
详细信息
ISBN:
(纸本)0769524451
In recent years, reconfigurable technology has emerged as a popular choice for implementing various types of cryptographic functions. Nevertheless, an insufficient amount effort has been placed into fully exploiting the tremendous amounts of parallelism intrinsic to FPGAs for this class of algorithms. In this paper, we focus on block cipher architectures and explore design decisions that leverage the multi-grained parallelism inherent in many of these algorithms. We demonstrate the usefulness of this approach with a highly parallel FPGA implementation of the AES standard, and present results detailing the area/delay tradeoffs resulting from our design decisions.
Large scale distributed computing infrastructure captures the use of high number of nodes, poor communication performance and continously varying resources that are not available at any time. In this paper, we focus o...
详细信息
ISBN:
(数字)9783540320715
ISBN:
(纸本)3540292357
Large scale distributed computing infrastructure captures the use of high number of nodes, poor communication performance and continously varying resources that are not available at any time. In this paper, we focus on the different tools available for mining traces of the activities of such aforementioned architecture. We propose new techniques for fast management of a frequent itemset mining parallel algorithm. The technique allow us to exhibit statistical results about the activity of more that one hundred PCs connected to the web.
This paper focuses on a parallel version of particle swarm optimization (PSO) algorithm which can significantly reduces execution time for solving complex large-scale optimization problems. This paper gives an overvie...
详细信息
ISBN:
(纸本)0780391950
This paper focuses on a parallel version of particle swarm optimization (PSO) algorithm which can significantly reduces execution time for solving complex large-scale optimization problems. This paper gives an overview of PSO algorithm, and then proposes a design and an implementation of parallel PSO. The proposed algorithm eliminates redundant synchronizations and optimizes message transfer to overlap communication with computation. The experimental results showed that 13.2 times speedup was obtained by the proposed parallel PSO algorithm with 14 processors.
There exists a parallel algorithm for block-tridiagonal linear systems[1]. We aim to present a different parallel algorithm for such systems. For our algorithm, like Ref 1, we give convergence proof when the coefficie...
详细信息
There exists a parallel algorithm for block-tridiagonal linear systems[1]. We aim to present a different parallel algorithm for such systems. For our algorithm, like Ref 1, we give convergence proof when the coefficient matrix is a M-matrix;but unlike Ref. 1, we give also convergence proof for our algorithm when the coefficient matrix is a positive definite matrix, whereas Ref. 1 did not give convergence proof in such a case. We present a parallel algorithm of two-stage iterative method for solving large block-tridiagonal linear systems Ax = b on distributed-memory multi-computer. Furthermore, we give convergence proofs when the PEk method or GS method is applied to the inner iteration and the coefficient matrix A is a symmetric positive definite matrix or a M-matrix respectively. Finally, we give a numerical example, for which we give tabulated results of three cases;(1) our algorithm when PE inner iteration (PE is one kind of PEk) is used;(2) our algorithm when GS inner iteration is used;(3) the multi-splitting algorithm of Ref. 1. Numerical results indicate preliminarily that the time needed by our algorithm is less than that of Ref. 1's algorithm and the efficiency of our algorithm is higher than that of Ref. 1's algorithm.
Large latencies over WAN will remain an obstacle to running communication intensive parallel applications on Grid environments. This paper takes one of such applications, Gaussian elimination of dense matrices and des...
详细信息
ISBN:
(纸本)0780394925
Large latencies over WAN will remain an obstacle to running communication intensive parallel applications on Grid environments. This paper takes one of such applications, Gaussian elimination of dense matrices and describes a parallel algorithm that is highly tolerant to latencies. The key technique is a pivoting strategy called batched pivoting;which requires much less frequent synchronizations than other methods. Although it is one of relaxed pivoting methods that may select other pivots than the 'best' ones;we show that it achieves good numerical accuracy. Through experiments with random matrices of the sizes of 64 to 49,152, botched pivoting achieves comparable numerical accuracy to that of partial pivoting. We also evaluate parallel execution speed of our implementation and show that it is much more tolerant to latencies than partial pivoting.
The parallel Disks Model (PDM) has been proposed to alleviate the I/O bottleneck that arises in the processing of massive data sets. Sorting has been extensively studied on the PDM model clue to the fundamental nature...
详细信息
ISBN:
(纸本)3540309357
The parallel Disks Model (PDM) has been proposed to alleviate the I/O bottleneck that arises in the processing of massive data sets. Sorting has been extensively studied on the PDM model clue to the fundamental nature of the problem. Several randomized algorithms are known for sorting. Most of the prior algorithms suffer from undue complications in memory layouts, implementation, or lack of tight analysis. In this paper we present a simple randomized algorithm that sorts in optimal time with high probablity and has all the desirable features for practical implementation.
作者:
Biros, GGhattas, ONYU
Courant Inst Math Sci Dept Comp Sci New York NY 10012 USA Carnegie Mellon Univ
Dept Biomed Engn Ultrascale Simulat Lab Pittsburgh PA 15213 USA Carnegie Mellon Univ
Dept Civil & Environm Engn Ultrascale Simulat Lab Pittsburgh PA 15213 USA
Large-scale optimization of systems governed by partial differential equations ( PDEs) is a frontier problem in scientific computation. Reduced quasi-Newton sequential quadratic programming (SQP) methods are state-of-...
详细信息
Large-scale optimization of systems governed by partial differential equations ( PDEs) is a frontier problem in scientific computation. Reduced quasi-Newton sequential quadratic programming (SQP) methods are state-of-the-art approaches for such problems. These methods take full advantage of existing PDE solver technology and parallelize well. However, their algorithmic scalability is questionable;for certain problem classes they can be very slow to converge. In this two-part article we propose a new method for steady-state PDE-constrained optimization, based on the idea of using a full space Newton solver combined with an approximate reduced space quasi-Newton SQP preconditioner. The basic components of the method are Newton solution of the first-order optimality conditions that characterize stationarity of the Lagrangian function;Krylov solution of the Karush - Kuhn - Tucker ( KKT) linear systems arising at each Newton iteration using a symmetric quasi-minimum residual method;preconditioning of the KKT system using an approximate state/decision variable decomposition that replaces the forward PDE Jacobians by their own preconditioners, and the decision space Schur complement ( the reduced Hessian) by a BFGS approximation initialized by a two- step stationary method. Accordingly, we term the new method Lagrange - Newton - Krylov - Schur (LNKS). It is fully parallelizable, exploits the structure of available parallel algorithms for the PDE forward problem, and is locally quadratically convergent. In part I of this two- part article, we investigate the effectiveness of the KKT linear system solver. We test our method on two optimal control problems in which the state constraints are described by the steady-state Stokes equations. The objective is to minimize dissipation or the deviation from a given velocity field;the control variables are the boundary velocities. Numerical experiments on up to 256 Cray T3E processors and on an SGI Origin 2000 include scalability and
This paper describes the parallelization of the Spatial Approximation Tree. This data structure has been shown to be an efficient index structure for solving range queries in high-dimensional metric space databases. W...
详细信息
ISBN:
(纸本)3540260323
This paper describes the parallelization of the Spatial Approximation Tree. This data structure has been shown to be an efficient index structure for solving range queries in high-dimensional metric space databases. We propose a method for load balancing the work performed by the processors. The method is self-tuning and is able to dynamically follow changes in the work-load generated by user queries. Empirical results with different databases show efficient performance in practice. The algorithmic design is based on the use of the bulk-synchronous model of parallel computing.
Sophisticated parallel matrix multiplication algorithms like PDGEMM exhibit a complex structure and can be controlled by a large set of parameters including blocking factors and block sizes used for the serial executi...
详细信息
ISBN:
(纸本)3540287000
Sophisticated parallel matrix multiplication algorithms like PDGEMM exhibit a complex structure and can be controlled by a large set of parameters including blocking factors and block sizes used for the serial execution on one of the participating processors. But it requires a deep understanding of both the parallel algorithm and the execution platform to select the parameters such that a minimum execution time results. In this article, we describe a simple mechanism that automatically selects a suitable set of parameters for PDGEMM which leads to a minimum execution time in most cases.
暂无评论