An efficient parallel algorithm for forward dynamics computation of human figures is proposed. The algorithm is capable of handling any kinematic chains including structure-varying ones. The asymptotic complexity of t...
详细信息
An efficient parallel algorithm for forward dynamics computation of human figures is proposed. The algorithm is capable of handling any kinematic chains including structure-varying ones. The asymptotic complexity of the algorithm is O(N) in serial computation and O(log N) in parallel computation on O(N) processors for most practical kinematic chains. The idea is to assemble a kinematic chain by adding the joints one by one and compute the constraint forces at the new joints using the principle of virtual work. The parallelism of the algorithm can be adapted for parallel processing systems with any number of processors by simply changing the assembly order. Simulation examples on an 8-node cluster demonstrate the effectiveness of the algorithm.
This paper develops a new parallel algorithm for computing the inverse of a banded matrix when extended in its maximum entropy sense. The algorithm developed here computes the inverse in two parallel steps. The first ...
详细信息
This paper develops a new parallel algorithm for computing the inverse of a banded matrix when extended in its maximum entropy sense. The algorithm developed here computes the inverse in two parallel steps. The first parallel step uses a modified Schur's complement technique to compute the individual inverses in each of the block matrices in parallel. The second parallel step then adds the overlapped sub-blocks inside the band. The parallel time complexity of our algorithm is O(bw/sup 3/) requiring n/((bw-1)/2)-1 processors, where the matrix is of size n/spl times/n having a bandwidth of bw. The parallel time required is independent of the size of the matrix and only depends upon the bandwidth of the matrix if n/((bw-1)/2)-1 processors are employed. We also provide a multithreaded implementation of the algorithm for use in SMP machines so that the algorithm can be used without requiring n/((bw-1)/2)-1 number of processors. Even in the serial implementation, the algorithm developed here is considerably better than existing serial algorithms for computing the banded inverse in the maximum entropy sense.
One of the outstanding challenges of computational science and engineering is large-scale nonlinear parameter estimation of systems governed by partial differential equations. These are known as inverse problems, in c...
详细信息
ISBN:
(纸本)9780769515243
One of the outstanding challenges of computational science and engineering is large-scale nonlinear parameter estimation of systems governed by partial differential equations. These are known as inverse problems, in contradistinction to the forward problems that usually characterize large-scale simulation. Inverse problems are significantly more difficult to solve than forward problems, due to ill-posedness, large dense ill-conditioned operators, multiple minima, space-time coupling, and the need to solve the forward problem repeatedly. We present a parallel algorithm for inverse problems governed by time-dependent PDEs, and scalability results for an inverse wave propagation problem of determining the material field of an acoustic medium. The difficulties mentioned above are addressed through a combination of total variation regularization, preconditioned matrix-free Gauss-Newton-Krylov iteration, algorithmic checkpointing, and multiscale continuation. We are able to solve a synthetic inverse wave propagation problem though a pelvic bone geometry involving 2.1 million inversion parameters in 3 hours on 256 processors of the Terascale Computing System at the Pittsburgh Supercomputing Center.
Output queued switches are appealing because they have better latency and throughput than input queued switches. However, they are difficult to build: a direct implementation of an N/spl times/N output-queued switch r...
详细信息
Output queued switches are appealing because they have better latency and throughput than input queued switches. However, they are difficult to build: a direct implementation of an N/spl times/N output-queued switch requires the switching fabric and the packet memories at the outputs to run at N times the line rate. Attempts have been made to implement output queuing with slow components, e.g., by having memories at both inputs and outputs running at twice the line rate. In these approaches, even though the packet memory speed is reduced, the scheduler time complexity is high - at least /spl Omega/(N). We show that idealized output queuing can be simulated in a shared memory architecture with (3N-2) packet memories running at the line rate, using a scheduling algorithm whose time complexity is O(log/sup 2/ N) on a parallel random access machine (PRAM). The number of processing elements and memory cells used by the PRAM are a small multiple of the size of the idealized switch.
This paper describes a parallel divide-and-conquer-algorithm for Delaunay triangulation. This algorithm finds the affected zone that cover the triangulations that may be modified during the merge of two sub-block tria...
详细信息
ISBN:
(纸本)0769517609
This paper describes a parallel divide-and-conquer-algorithm for Delaunay triangulation. This algorithm finds the affected zone that cover the triangulations that may be modified during the merge of two sub-block triangulations. With the aid of the affected zone, communications between processors are reduced, the time complexity of divide-and-conquer remains O(n log n), and the affected zone can be found in O(n) time steps, where n is the number of points. The code was implemented with C, FORTRAN and MPI, so it was easy to port this program to other machines. Experimental results on IBM SP2 show that a parallel efficiency of 34%-96% for general distributions can be achieved on an 16-node distributed memory system.
In this paper, we present a parallel algorithm for Gaussian elimination: in both a shared memory environment using OpenMP, and in a distributed memory environment using MPI. parallel LU and Gaussian algorithms for lin...
详细信息
In this paper, we present a parallel algorithm for Gaussian elimination: in both a shared memory environment using OpenMP, and in a distributed memory environment using MPI. parallel LU and Gaussian algorithms for linear systems are studied extensively, and the the results of examining various load balancing schemes on both platforms are presented. The results show an improvement in many cases over the default implementation.
Summary form only given. We show a parallel algorithm using a rectangle greedy matching technique which requires a linear number of processors and O(log(M)log(n)) time on the PRAM EREW model. The algorithm is suitable...
详细信息
ISBN:
(纸本)0769514774
Summary form only given. We show a parallel algorithm using a rectangle greedy matching technique which requires a linear number of processors and O(log(M)log(n)) time on the PRAM EREW model. The algorithm is suitable for practical parallel architectures as a mesh of trees, a pyramid or a multigrid. We implement a sequential procedure which simulates the compression performed by the parallel algorithm and it achieves 95 to 97 percent of the compression of a previous sequential heuristic. To achieve logarithmic time we partition an m/spl times/n image, I, in x/spl times/y rectangular areas where x and y are /spl Theta/(log/sup 1/2 / mn). In parallel for each area, one processor applies the sequential parsing algorithm, so that, in logarithmic time, each area is parsed in rectangles, some of which are monochromatic. Before encoding, we compute larger monochromatic rectangles by merging the ones adjacent on the horizontal boundaries and then on the vertical boundaries, doubling in this way the length and width of each area at each step.
The paper concerns the parallel computing and its application for solving the full Lyapunov exponents in the general nonlinear parameter-dependent continuous ordinary differential equations. Based on a standard serial...
详细信息
ISBN:
(纸本)0769515126
The paper concerns the parallel computing and its application for solving the full Lyapunov exponents in the general nonlinear parameter-dependent continuous ordinary differential equations. Based on a standard serial algorithm developed by Wolf et al. (1985), we present a parallel algorithm using the block-cyclic decomposition method, and then apply it for solving the Lyapunov exponents of a continuous differential equation. By testing its performance of the parallel algorithm on the supercomputer DAWNING-200011, it is proved that the parallel algorithm is of high level parallelism, no need for message passing (little communication cost), and little I/O. In addition, the algorithm can be extended to any high dimensional ordinary differential equations.
This paper presents several strategies for parallel implementations of the greedy randomized adaptive search procedure (GRASP) and the variable neighborhood search (VNS) applied to a combinatorial optimization problem...
详细信息
This paper presents several strategies for parallel implementations of the greedy randomized adaptive search procedure (GRASP) and the variable neighborhood search (VNS) applied to a combinatorial optimization problem known as the traveling purchaser problem (TPP). parallel algorithms based on master-worker, completely distributed and independent models, using static and dynamic load balance were proposed. The performance of these parallel algorithms was analyzed comparing them among themselves and with their sequential versions.
暂无评论