Physically based simulation of cloth in virtual environments is a computationally demanding problem. It involves modeling the internal material properties of the textile (physical modeling) and also treating interacti...
详细信息
Physically based simulation of cloth in virtual environments is a computationally demanding problem. It involves modeling the internal material properties of the textile (physical modeling) and also treating interactions with the surrounding scene (collision handling). In this paper, we present an approach to parallel cloth simulation designed for distributedmemory parallel architectures, particularly clusters built of commodity components. We discuss parallel techniques for the physical modeling phase as well as for the collision handling phase which can significantly reduce the respective computation times. To deal with the very fine granularity of the physical modeling phase we apply a static data decomposition approach based on graph partitioning. In order to cope with the high irregularity of the collision handling phase we employ taskparallel techniques based on fully dynamic problem decomposition. We show how both techniques can be integrated into a robust parallel cloth simulation method which can deal with considerably complex scenes. (c) 2007 Elsevier B.V. All rights reserved.
In this work, we propose an efficient parallel implementation of the nonsymmetric block Lanczos algorithm for the computation of few extreme eigenvalues, and corresponding eigenvectors, of real nonhermitian matrices f...
详细信息
In this work, we propose an efficient parallel implementation of the nonsymmetric block Lanczos algorithm for the computation of few extreme eigenvalues, and corresponding eigenvectors, of real nonhermitian matrices for distributedmemory multicomputers. The reorganisation of the block Lanczos algorithm implemented allows to exploit a coarse-grained parallelism and to harness the computational power of the target architectures. The computational kernels of the algorithm are matrix-matrix multiplications, with dense and sparse factors, QR factorisation and singular value decomposition. To reduce the total amount of communication involved in the matrix-matrix multiplication with a sparse factor, we substitute each matrix appearing in the algorithm with its transpose. Then, we develop an efficient parallelisation of the matrix-matrix multiplication when the second factor is sparse. Some other linear algebra operations are performed using ScaLAPACK library. The parallel eigensolver has been tested on a cluster of PCs. All reported results show the proposed algorithm is efficient on the target architectures for problems of adequate dimension.
This paper addresses the problem of scheduling parallel programs represented as directed acyclic task graphs for execution on distributedmemory parallel architectures. Because of the high communication overhead in ex...
详细信息
This paper addresses the problem of scheduling parallel programs represented as directed acyclic task graphs for execution on distributedmemory parallel architectures. Because of the high communication overhead in existing parallel machines, a crucial step in scheduling is task clustering, the process of coalescing fine grain tasks into single coarser ones so that the overall execution time is minimized. The task clustering problem is NP-hard, even when the number of processors is unbounded and task duplication is allowed. A simple greedy algorithm is presented for this problem which, for a task graph with arbitrary granularity, produces a schedule whose makespan is at most twice optimal. Indeed, the quality of the schedule improves as the granularity of the task graph becomes larger. For example, if the granularity is at least 1/2, the makespan of the schedule is at most 5/3 times optimal. For a task graph with n tasks and e inter-task communication constraints, the algorithm runs in O(n(n lg n + e)) time, which is n times faster than the currently best known algorithm for this problem. Similar algorithms are developed that produce: (1) optimal schedules for coarse grain graphs;(2) 2-optimal schedules for trees with no task duplication;and (3) optimal schedules for coarse grain trees with no task duplication.
An NP-complete problem of finding a fundamental-cycle set of a graph G with minimum total length is considered. Two parallel algorithms of O(n2/p + n log n log p) and O(m + n2/p + n log(n/p) + n log p) costs to find a...
详细信息
An NP-complete problem of finding a fundamental-cycle set of a graph G with minimum total length is considered. Two parallel algorithms of O(n2/p + n log n log p) and O(m + n2/p + n log(n/p) + n log p) costs to find a suboptimal solution to this problem are presented (p is a number of processors, n is a number of vertices, and m is a number of edges of G). The algorithms partition an edge and vertex set of G among processors, respectively, and use a new heuristic method to solve the problem. A message-based tree-connected MIMD computer is assumed as a model of parallel computations. The algorithms were implemented for a binary tree of 15 transputers, and the experiments were conducted on a wide range of random graphs. The results show that the vertex set partition algorithm with inferior theoretical cost gives better speedups and finds the fundamental-cycle sets of shorter total lengths as compared to the edge set partition algorithm.
In this paper, we present an efficient parallel algorithm to solve Toeplitz-block and block-Toeplitz systems in distributedmemory multicomputers. This algorithm parallelizes the Generalized Schur Algorithm to obtain ...
详细信息
In this paper, we present an efficient parallel algorithm to solve Toeplitz-block and block-Toeplitz systems in distributedmemory multicomputers. This algorithm parallelizes the Generalized Schur Algorithm to obtain the semi-normal equations. Our parallel implementation reduces the communication cost and optimizes the memory access. The experimental analysis on a cluster of personal computers shows the scalability of the implementation. The algorithm is portable because it is based on standard tools and libraries, such as ScaLAPACK and MPI.
In this paper, we present a parallel system called PHR for computing hierarchical radiosity solutions of complex scones. The system is targeted for multi-processor architectures with distributedmemory. The system eva...
详细信息
In this paper, we present a parallel system called PHR for computing hierarchical radiosity solutions of complex scones. The system is targeted for multi-processor architectures with distributedmemory. The system evaluates and subdivides the interactions level by level in a breadth first fashion, and the interactions are redistributed at the end of each level to keep load balanced. In order to allow interactions freely travel across processors, all the patch data is replicated on all the processors. Hence, the system favors load balancing at the expense of increased communication volume. However, the results show that the overhead of communication is negligible compared with total execution time. We obtained a speed-up of 25 for 32 processors in our test scenes.
In this paper, we propose a new parallel algorithm which exploits the symmetry of the Jacobi method for computing the eigenvalues of a real and symmetric square matrix A on a distributedmemory multiprocessor.
In this paper, we propose a new parallel algorithm which exploits the symmetry of the Jacobi method for computing the eigenvalues of a real and symmetric square matrix A on a distributedmemory multiprocessor.
In this paper we give an algorithm to broadcast a message in a wraparound mesh distributed-memory parallel architecture with parallel monodirectional links. This algorithm uses a general strategy based on the diffusio...
详细信息
In this paper we give an algorithm to broadcast a message in a wraparound mesh distributed-memory parallel architecture with parallel monodirectional links. This algorithm uses a general strategy based on the diffusion of the message in edge-disjoint spanning trees. We first present in this setting the results of Saad and Schultz and the improvements obtained by Simmen. We then give an asymptotically optimal broadcasting algorithm improving the preceding results. It uses in the wraparound mesh the constructions of two edge-disjoint spanning trees rooted at a given node and of minimum depth.
We propose a model for describing and predicting the parallel performance of a broad class of parallel numerical software on distributed memory architectures. The purpose of this model is to allow reliable predictions...
详细信息
We propose a model for describing and predicting the parallel performance of a broad class of parallel numerical software on distributed memory architectures. The purpose of this model is to allow reliable predictions to be made for the performance of the software on large numbers of processors of a given parallel system, by only benchmarking the code on small numbers of processors. Having described the methods used, and emphasized the simplicity of their implementation, the approach is tested on a range of engineering software applications that are built upon the use of multigrid algorithms. Despite their simplicity, the models are demonstrated to provide both accurate and robust predictions across a range of different parallel architectures, partitioning strategies and multigrid codes. In particular, the effectiveness of the predictive methodology is shown for a practical engineering software implementation of an elastohydrodynamic lubrication solver. (C) 2010 Civil-Comp Ltd and Elsevier Ltd. All rights reserved.
An irregular problem models the evolution of a system where several elements are irregularly distributed in a domain. The evolution modifies this distribution in a way that cannot be foreseen and the behavior of each ...
详细信息
An irregular problem models the evolution of a system where several elements are irregularly distributed in a domain. The evolution modifies this distribution in a way that cannot be foreseen and the behavior of each element depends upon the elements close to it according to a problem dependent relation. Starting from a hierarchical representation of the domain, we define a parallelization methodology that includes a load balancing strategy that preserves this locality property and a strategy to collect information distributed onto the processing nodes. (C) 2001 Elsevier Science B.V. All rights reserved.
暂无评论