Grain packing is an important problem to the development of efficient parallel programs. It is desirable that the grain packing can be performed automatically, so that the programmer can write parallel programs withou...
详细信息
Grain packing is an important problem to the development of efficient parallel programs. It is desirable that the grain packing can be performed automatically, so that the programmer can write parallel programs without being troubled by the details of parallel-programming languages and parallelarchitectures, and the same parallel program can be executed efficiently on different machines. This paper presents a 2D Compression (2DC) grain packing method for determining optimal grain size and inherent parallelism concurrently. This ability is mainly due to 2DC's continuing efforts for achieving conflicting objectives. Experimental results demonstrate that 2DC increases the solution effectiveness, in comparison with state-of-art approaches that aim at economizing either speedup or resource utilization. Additionally, 2DC can determine inherent parallelism, which means that users will no longer be required to specify the number of processors before the compilation stage.
A library, called PAD, of basic parallel algorithms and data structures for the PRAM is currently being implemented using the PRAM programming language Fork95. Main motivations of the PAD project is to study the PRAM ...
详细信息
A library, called PAD, of basic parallel algorithms and data structures for the PRAM is currently being implemented using the PRAM programming language Fork95. Main motivations of the PAD project is to study the PRAM as a practical programming model, and to provide an organized collection of basic PRAM algorithms for the SB-PRAM under completion at the University of Saarbruecken. We give a brief survey of Fork95, and describe the main components of PAD. Finally we report on the status of the language and library and discuss further developments.
This book constitutes the refereed proceedings of the 11th internationalsymposium on parallelarchitectures, algorithms and programming, PAAP 2020, held in Shenzhen, China, in December 2020.
ISBN:
(数字)9789811600104
ISBN:
(纸本)9789811600098
This book constitutes the refereed proceedings of the 11th internationalsymposium on parallelarchitectures, algorithms and programming, PAAP 2020, held in Shenzhen, China, in December 2020.
We discuss parallel sorting algorithms and their implementations suitable for cluster architectures in order to optimize cluster resources. We focus on the time spent in computation and the load balancing properties w...
详细信息
We discuss parallel sorting algorithms and their implementations suitable for cluster architectures in order to optimize cluster resources. We focus on the time spent in computation and the load balancing properties when processors are running at different speeds, i.e. correlated by a multiplicative constant factor (our weak definition of heterogeneous platform). One scheme is under study: parallel sorting by sampling (either regular sampling technique introduced by Shi and Schaeffer [J. parallel Distrib. Comput. 14 (4) (1992) 361] or the over-partitioning scheme introduced by Li and Seveik [parallel sorting by over-partitioning, in: Proceedings of the Sixth Annual symposium on parallel algorithms and architectures, ACM Press, New York, June 1994]). What is important in the paper is mainly the load balance factor and not necessary the execution time. It is clear that improved load balance leads to improved execution titre. The results presented in the paper demonstrate that load balancing for the case of computers with heterogeneous processing capacity is more challenging than for the homogeneous case. The survey, through the sorting case study, allow us to identify some algorithmic issues and software challenges to master heterogeneous cluster platforms in order to better utilize theta: data decomposition techniques, scheduling and load balancing methods. (C) 2002 Elsevier Science B.V. All rights reserved.
We demonstrate an approach to parallelprogramming, based on skeletons - parameterized program schemas with efficient implementations over diverse architectures. The contribution of the paper is two-fold: (1)we classi...
详细信息
We demonstrate an approach to parallelprogramming, based on skeletons - parameterized program schemas with efficient implementations over diverse architectures. The contribution of the paper is two-fold: (1)we classify divide-and-conquer (DC) algorithms and provide a family of provably correct parallel implementations for a particular DC skeleton, called DH (distributable homomorphism);(2) we adjust the mathematical specification of the Fast Fourier Transform (FFT) to the DH skeleton and, thereby, obtain a generic SPMD program, well suited for implementation under MPI. The generic program includes the efficient FFT solutions used in practice - the binary-exchange and the 2D- and 3D-transpose implementations - as special cases.
A complicated class-cluster is transformed into several test blocks, and these test blocks are assigned to the processing cores on multi-core computer. Each processing core executes in parallel multiple threads to han...
详细信息
To sort efficiently Multisets in parallel on the heterogeneous multi-core clusters that the nodes have different amount of processing cores, different computing and communication capabilities and distinct size of main...
详细信息
We present CaCUDA - a GPGPU kernel abstraction and a parallelprogramming framework for developing highly efficient large scale scientific applications using stencil computations on hybrid CPU/GPU architectures. CaCUD...
详细信息
ISBN:
(纸本)9781450311601
We present CaCUDA - a GPGPU kernel abstraction and a parallelprogramming framework for developing highly efficient large scale scientific applications using stencil computations on hybrid CPU/GPU architectures. CaCUDA is built upon the Cactus computational toolkit, an open source problem solving environment designed for scientists and engineers. Due to the flexibility and extensibility of the Cactus toolkit, the addition of a GPGPU programming framework required no changes to the Cactus infrastructure, guaranteeing that existing features and modules will continue to work without modification. CaCUDA was tested and benchmarked using a 3D CFD code based on a finite difference discretization of Navier-Stokes equations.
Recently we proposed occam-pi as a high-level language for programming massively parallel reconfigurable architectures. The design of occam-pi incorporates ideas from CSP and pi-calculus to facilitate expressing paral...
详细信息
ISBN:
(纸本)9780769543017
Recently we proposed occam-pi as a high-level language for programming massively parallel reconfigurable architectures. The design of occam-pi incorporates ideas from CSP and pi-calculus to facilitate expressing parallelism and reconfigurability. The feasability of this approach was illustrated by building three occam-pi implementations of DCT executing on an Ambric. However, because DCT is a simple and well-studied algorithm it remained uncertain whether occam-pi would also be effective for programming novel, more complex algorithms. In this paper, we demonstrate the applicability of occam-pi for expressing various degrees of parallelism by implementing a significantly large case-study of focus criterion calculation in an autofocus algorithm on the Ambric architecture. Autofocus is a key component of synthetic aperture radar systems. Two implementations of focus criterion calculation were developed and evaluated on the basis of performance. The comparison of the performance results with a single threaded software implementation of the same algorithm show that the throughput of the two implementations are 11x and 23x higher than the sequential implementation despite a much lower (9x) clock frequency. The two designs are, respectively, 29x and 40x more energy efficient.
暂无评论