parallel adaptive algorithms for the approximation of a multi-dimensional integral over an hyper-rectangular region are described. algorithms with centralized global region collection are compared to algorithms using ...
详细信息
parallel adaptive algorithms for the approximation of a multi-dimensional integral over an hyper-rectangular region are described. algorithms with centralized global region collection are compared to algorithms using local region collections. The latter algorithms should result in better scalability since global communication is avoided. Both types of algorithms are compared to quasi-Monte Carlo integration. Tests are performed using Genz's test functions and speed-up results are given.
We derive an efficient parallel algorithm to find all occurrences of a pattern string in a subject string in O(log n) time, where n is the length of the subject string. The number of processors employed is of the orde...
详细信息
We derive an efficient parallel algorithm to find all occurrences of a pattern string in a subject string in O(log n) time, where n is the length of the subject string. The number of processors employed is of the order of the product of the two string lengths. The theory of powerlists [J. Kornerup, PhD Thesis, 1997;J. Misra, ACM Trans. Programming Languages Systems 16 (16) (1994) 1737-1740] is central to the development of the algorithm and its algebraic manipulations. (C) 2002 Elsevier Science B.V. All rights reserved.
A parallel genetic algorithm for optimization is outlined, and its performance on both mathematical and biomechanical optimization problems is compared to a sequential quadratic programming algorithm, a downhill simpl...
详细信息
A parallel genetic algorithm for optimization is outlined, and its performance on both mathematical and biomechanical optimization problems is compared to a sequential quadratic programming algorithm, a downhill simplex algorithm and a simulated annealing algorithm. When high-dimensional non-smooth or discontinuous problems with numerous local optima are considered, only the simulated annealing and the genetic algorithm, which are both characterized by a weak search heuristic, are successful in finding the optimal region in parameter space. The key advantage of the genetic algorithm is that it can easily be parallelized at negligible overhead.
Monte Carlo computations are considered easy to parallelize. However, the results can be adversely affected by defects in the parallel pseudorandom number generator used. A parallel pseudorandom number generator must ...
详细信息
Monte Carlo computations are considered easy to parallelize. However, the results can be adversely affected by defects in the parallel pseudorandom number generator used. A parallel pseudorandom number generator must be tested for two types of correlations-(i) intrastream correlation, as for any sequential generator, and (ii) inter-stream correlation for correlations between random number streams on different processes. Since bounds on these correlations are difficult to prove mathematically, large and thorough empirical tests are necessary. Many of the popular pseudorandom number generators in use today were tested when computational power was much lower, and hence they were evaluated with much smaller test sizes. This paper describes several tests of pseudorandom number generators, both statistical and application-based. We show defects in several popular generators. We describe the implementation of these tests in the SPRNG [ACM Trans. Math. Software 26 (2000) 436;SPRNG-scalable parallel random number generators. SPRNG 1.0-http: //www. ncsa. uiuc, edu/ Apps/SPRNG;SPRNG 2. 0-http: //sprng. cs, fsu. edu] test suite and also present results for the tests conducted on the SPRNG generators. These generators have passed some of the largest empirical random number tests. (C) 2002 Elsevier Science B.V. All rights reserved.
In awari, a two-person game of pure skill, players sow stones into pits on a board. The game's rules define how to capture stones, and the player who captures the most wins the game. For more than a decade, resear...
详细信息
In awari, a two-person game of pure skill, players sow stones into pits on a board. The game's rules define how to capture stones, and the player who captures the most wins the game. For more than a decade, researchers have studied computerized techniques to play awari. The authors have now solved the game by determining the score of 889,063,398,406 board positions and storing them in databases. They performed the necessary computations on a 144-processor parallel computer with 72 gigabytes of main memory and a fast Myrinet interconnect.
When solving time-dependent partial differential equations on parallel computers using the nonoverlapping domain decomposition method, one often needs numerical boundary conditions on the boundaries between subdomains...
详细信息
When solving time-dependent partial differential equations on parallel computers using the nonoverlapping domain decomposition method, one often needs numerical boundary conditions on the boundaries between subdomains. These numerical boundary conditions can significantly affect the stability and accuracy of the final algorithm. In this paper, a stability and accuracy analysis of the existing methods for generating numerical boundary conditions will be presented, and a new approach based on explicit predictors and implicit correctors will be used to solve convect ion-diffusion equations on parallel computers, with application to aerospace engineering for the solution of Euler equations in computational fluid dynamics simulations. Both theoretical analyses and numerical results demonstrate significant improvement in stability and accuracy by using the new approach. (C) 2003 Elsevier Science Ltd. All rights reserved.
Given n values x(1), x(2),...,x(n) and an associative binary operation x, the prefix problem is to compute x(1) x(2) x...x x(i), 1 less than or equal to i less than or equal to n. Prefix circuits are combinational cir...
详细信息
Given n values x(1), x(2),...,x(n) and an associative binary operation x, the prefix problem is to compute x(1) x(2) x...x x(i), 1 less than or equal to i less than or equal to n. Prefix circuits are combinational circuits for solving the prefix problem. For any n-input prefix circuit D with depth d and size s, if d + s = 2 n-2, then D is depth-size optimal. In general, a prefix circuit with a small depth is faster than one with a large depth. For prefix circuits with the same depth, a prefix circuit with a smaller fan-out occupies less area and is faster in VLSI implementation. This paper is on constructing parallel prefix circuits that are depth-size optimal with small depth and small fan-out. We construct a depth-size optimal prefix circuit H 4 with fan-out 4. It has the smallest depth among all known depth-size optimal prefix circuits with a constant fan-out;furthermore, when n greater than or equal to 136, its depth is less than, or equal to, those of all known depth-size optimal prefix circuits with unlimited fan-out. A size lower bound of prefix circuits is also derived. Some properties related to depth-size optimality and size optimality are introduced;they are used to prove that H 4 is depth-size optimal.
We investigate the relation between fine-grained and coarse-grained distributed computations of a class of problems related to the generic transitive closure problem (TC for short). We choose an intricate systolic alg...
详细信息
We investigate the relation between fine-grained and coarse-grained distributed computations of a class of problems related to the generic transitive closure problem (TC for short). We choose an intricate systolic algorithm for the TC problem, by Guibas, Kung and Thompson (GKT algorithm for short), as a starting point due to its particularly close relationship to matrix multiplication. The GKT algorithm reduces the TC problem to three successive parallel matrix multiplications. We extract the main ideas of this algorithm, namely different path decompositions related to min-paths and max-paths computations and devise a two-pass parallel algorithm, such that the second pass is purely a triangular matrix multiplication involving exactly 1/3 of the total number of elementary operations (multiplying two single elements of the matrix). This is helpful in coarse-grained parallel computations since matrix multiplication is well parallelizable. A novel approach is used and as a first result a more efficient and simpler two-pass fine-grained algorithm is designed. The second result is a non-trivial transformation of this fine-grained algorithm into a coarse-grained (and more practical) version. The full proof of correctness of the transformation, which is presented in the appendices, is quite complex and is the hardest result of the paper. Our algorithms are specially structured to directly show the correspondence between the main fine-grained and the main coarse-grained operations.
The Jacobi-Davidson (JD) algorithm was recently proposed for evaluating a number of the eigenvalues of a matrix. JD goes beyond pure Krylov-space techniques;it cleverly expands its search space, by solving the so-call...
详细信息
The Jacobi-Davidson (JD) algorithm was recently proposed for evaluating a number of the eigenvalues of a matrix. JD goes beyond pure Krylov-space techniques;it cleverly expands its search space, by solving the so-called correction equation, thus in. principle providing a more powerful method. Preconditioning the Jacobi-Davidson correction equation is mandatory when large, sparse matrices are analyzed. We considered several preconditioners: Classical block-Jacobi, and IC(0), together with approximate inverse (AIN-V or FSAI) preconditioners. The rationale for using approximate inverse preconditioners is their high parallelization potential, combined with their efficiency in accelerating the iterative solution of the correction equation. Analysis was carried on the sequential performance of preconditioned JD for the spectral decomposition of large, sparse matrices, which originate in the numerical integration of partial differential equations arising in physical and engineering problems. It was found that JD is highly sensitive to preconditioning, and it can display an irregular convergence behavior. We parallelized JD by data-splitting techniques, combining them with techniques to reduce the amount of communication data. Our own parallel, preconditioned code was executed on a dedicated parallel machine, and we present the results of our experiments. Our JD code provides an appreciable parallel degree of computation. Its performance was also compared with those of PARPACK and parallel DACG. (C) 2003 Elsevier Science B.V. All rights reserved.
This study empirically compares two approaches to parallel 3-D OSEM that differ as to whether calculations are assigned to nodes by projection number or by transaxial plane number. For projection space decomposition (...
详细信息
This study empirically compares two approaches to parallel 3-D OSEM that differ as to whether calculations are assigned to nodes by projection number or by transaxial plane number. For projection space decomposition (PSD), the forward projection is completely parallel, but backprojection requires a slow image synchronization. For image space decomposition (ISD), the communication associated with forward projection can be overlapped with calculation, and the communication associated with backprojection is more efficient. To compare these methods, an implementation of 3-D OSEM for three PET scanners is developed that runs on an experimental 9-node, 18-processor cluster computer. For selected benchmarks, both methods exhibit speedups in excess of eight or nine nodes, and comparable performance for the tested range of cluster sizes.
暂无评论