Current trends in high performance computing suggest that users will soon have widespread access to clusters of multiprocessors with hundreds, if not thousands, of processors. This unprecedented degree of parallelism ...
详细信息
ISBN:
(纸本)9781581133462
Current trends in high performance computing suggest that users will soon have widespread access to clusters of multiprocessors with hundreds, if not thousands, of processors. This unprecedented degree of parallelism will undoubtedly expose scalability limitations in existing applications, where scalability is the ability of a parallel algorithm on a parallel architecture to effectively utilize an increasing number of processors. Users will need precise and automated techniques for detecting the cause of limited scalability. This paper addresses this dilemma. First, we argue that users face numerous challenges in understanding application scalability: managing substantial amounts of experiment data, extracting useful trends from this data, and reconciling performance information with their application's design. Second, we propose a solution to automate this data analysis problem by applying fundamental statistical techniques to scalability experiment data. Finally, we evaluate our operational prototype on several applications, and show that statistical techniques offer an effective strategy for assessing application scalability. In particular, we find that non-parametric correlation of the number of tasks to the ratio of the time for communication operations to overall communication time provides a reliable measure for identifying communication operations that scale poorly.
In this paper, we present a parallel algorithm for lossless compression. The algorithm is based on the 3D Wavelet Transform (3-D WT). The system under study consists of a parallel implementation of a wavelet compressi...
详细信息
ISBN:
(纸本)0780367154
In this paper, we present a parallel algorithm for lossless compression. The algorithm is based on the 3D Wavelet Transform (3-D WT). The system under study consists of a parallel implementation of a wavelet compression software running on a cluster of eight nodes linked by a high performance local area network, Myrinet. The parallel implementation is based on the standard Message Passing Interface (MPI). We have used three implementations of the MPI standard: MPI-BIP, MPICH, and LAM. Experimental results are reported based on these three implementations. We provide performance results of this parallel system for the compression video sequences. Some bugs in efficiency for TCP implementation are reported and resolved for this system.
It is clear that writing software for parallel architectures is a non-trivial process. This has encouraged much research in art effort to provide tools to assist parallel software development. However while these tool...
详细信息
ISBN:
(纸本)0769509886
It is clear that writing software for parallel architectures is a non-trivial process. This has encouraged much research in art effort to provide tools to assist parallel software development. However while these tools may cater for architecture-specific problems, they do little for the concept of parallel software engineering as the end product is usually neither scaleable nor portable. The introduction of a level of abstraction in the expression of parallel algorithms can elevate the reasoning process above architectural constraints and assist the production of more flexible code. This paper our-lines an object-oriented parallel algorithm development paradigm based on a Task and Channel notation, and examines the utilisation of Java TM technologies in the development of a distributed Java TM Virtual Machine architecture on which algorithms expressed in this notation may be executed dynamically.
We propose a parallel algorithm for stabilizing large discrete-time linear control systems on a Beowulf cluster Our algorithm first separates the Schur stable part of the linear control system using an inverse-free it...
详细信息
ISBN:
(纸本)0769512607
We propose a parallel algorithm for stabilizing large discrete-time linear control systems on a Beowulf cluster Our algorithm first separates the Schur stable part of the linear control system using an inverse-free iteration for the matrix disc function, and then computes a stabilizing feedback matrix for the unstable part. This stage requires the numerical solution of a Stein equation. This linear matrix equation is solved using the sign function method after applying a Cayley transformation to the original equation. The experimental results on a cluster composed of Intel PII processors and a Myrinet interconnection network show the parallelism and scalability of our approach.
This article describes the displacement decomposition and its benefits for the parallelization of the preconditioned conjugate gradient method for finite element elasticity problems. It deals with both the fixed and v...
详细信息
ISBN:
(纸本)0769512607
This article describes the displacement decomposition and its benefits for the parallelization of the preconditioned conjugate gradient method for finite element elasticity problems. It deals with both the fixed and variable preconditioning based on this decomposition. Numerical efficiency of the parallel algorithms is demonstrated on an academic benchmark and real-life modelling problem.
We solve an optimal control problem for controlled parabolic Ito equations by a stochastic quasigradient method. Because of high amounts of computation time required by numerical solution of such problems we investiga...
详细信息
We solve an optimal control problem for controlled parabolic Ito equations by a stochastic quasigradient method. Because of high amounts of computation time required by numerical solution of such problems we investigate the parallelization of the algorithm. We distribute the computations of space stages over several processor nodes of a parallel computer. We obtain an efficient algorithm with low communication cost by using a ring topology
This paper proposes a parallel algorithm for computing anN( = Kn) point Lagrange interpolation on fc-ary n-cube networks. The algorithm consists of three phases: initialisation, main and final. There is no computation...
详细信息
This paper proposes a parallel algorithm for computing anN( = Kn) point Lagrange interpolation on fc-ary n-cube networks. The algorithm consists of three phases: initialisation, main and final. There is no computation in the initialisation phase. The main phase is composed of N/2 steps, each consisting of four multiplications and four subtractions, and an additional step including one division and one multiplication. Communication in the main phase is based on an all-to-all broadcast algorithm on a Hamiltonian ring embedded in a k-ary n-cube. The final phase is carried out in n x ⌊k/l⌋ steps, each requiring one addition. A performance evaluation of the proposed algorithm reveals a near to optimum speedup for a typical range of sy:;tem parameters used in current state-of-the-art implementations. Our study also reveals that when implementation cost is taken into account low-dimensional K-ary n-cubes achieve better speedup than their higher-dimensional counterparts.
By using cardinality and relevance information about a set of attributes and concept hierarchies, a top-down incremental data partitioning method is proposed for quantitative rule derivation from database in paralleli...
详细信息
P-complete problems seem to have no parallel algorithm which runs in polylogarithmic time using a polynomial number of processors. A P-complete problem is in class EP (Efficient and Polynomially fast) if and only if t...
详细信息
This article analyses and compares the techniques of algorithmic blocking and storage blocking with lookahead for distributed memory LU, LLT, and QR factorizations. Concepts and some useful properties of a simplified ...
详细信息
This article analyses and compares the techniques of algorithmic blocking and storage blocking with lookahead for distributed memory LU, LLT, and QR factorizations. Concepts and some useful properties of a simplified model of lookahead are explored. Issues in the implementation of lookahead are discussed, which are more involved for the case of LLT and QR factorizations. The article also explains how hybrid algorithmic blocking and lookahead techniques can be implemented. Results, given on the Fujitsu AP1000 and AP+ multicomputers, indicate that both methods are superior to storage blocking, and that the hybrid method is optimal for smaller matrices, due to savings in communication startups. For larger matrices, algorithmic blocking gave the best performance (excepting LLT for the AP+), due to its better load-balancing properties. Performance models, predicting the minimum matrix size where lookahead becomes effective, indicate this trend can be expected for machines with lower communication-to-computation speeds, but that the range for where lookahead is superior is extended.
暂无评论