作者:
Benkner, SBrandes, TUniv Vienna
Inst Software Sci A-1090 Vienna Austria SCAI
Fraunhofer Inst Algorithms & Sci Comp Schloss Birlinghoven D-53754 St Augustin Germany
Clusters of shared-memory multiprocessors (SMPs) have become the most promising parallel computing platforms for scientific computing. However, SNIP clusters significantly increase the complexity of user application d...
详细信息
Clusters of shared-memory multiprocessors (SMPs) have become the most promising parallel computing platforms for scientific computing. However, SNIP clusters significantly increase the complexity of user application development when using the low-level application programming interfaces MPI and OpenMP, forcing users to deal with both distributed-memory and shared-memory parallelization details. In this paper we present extensions of High Performance Fortran (HPF) for SNIP clusters which enable the compiler to adopt a hybrid parallelization strategy, efficiently combining distributed-memory with shared-memory parallelism. By means of a small set of new language features, the hierarchical structure of SNIP clusters may be specified. This information is utilized by the compiler to derive inter-node data mappings for controlling distributed-memory parallelization across the nodes of a cluster and intra-node data mappings for extracting shared-memory parallelism within nodes. Additional mechanisms are proposed for specifying inter- and intra-node data mappings explicitly, for controlling specific shared-memory parallelization issues and for integrating OpenMP routines in HPF applications. The proposed features have been realized within the ADAPTOR and VFC compilers. The parallelization strategy for clusters of SMPs adopted by these compilers is discussed as well as a hybrid-parallel execution model based on a combination of MPI and OpenMP. Experimental results indicate the effectiveness of the proposed features. Copyright (C) 2004 John Wiley Sons, Ltd.
Microscopy is becoming increasingly digital and dependent on computation. Some of the computational tasks in microscopy are computationally intense, such as image restoration (deconvolution), some optical calculations...
详细信息
Microscopy is becoming increasingly digital and dependent on computation. Some of the computational tasks in microscopy are computationally intense, such as image restoration (deconvolution), some optical calculations, image segmentation, and image analysis. Several modern microscope technologies enable the acquisition of very large data sets. 3D imaging of live cells over time, multispectral imaging, very large tiled 3D images of thick samples, or images from high throughput biology all can produce extremely large images. These large data sets place a very large burden on laboratory computer resources. This combination of computationally intensive tasks and larger data sizes can easily exceed the capability of single personal computers. The large multiprocessor computers that are the traditional technology for larger tasks are too expensive for most laboratories. An alternative approach is to use a number of inexpensive personal computers as a cluster;that is, use multiple networked computers programmed to run the problem in parallel on all the computers in the cluster. By the use of relatively inexpensive over-the-counter hardware and open source software, this approach can be much more cost effective for many tasks. We discuss the different computer architectures available, and their advantages and disadvantages. (C) 2004 Wiley-Liss, Inc.
Application of pattern-based approaches to parallel programming is an active area of research today. The main objective of pattern-based approaches to parallel programming is to facilitate the reuse of frequently occu...
详细信息
The Monte Carlo (MC) method is a simple but effective way to perform simulations involving complicated or multivariate functions. The Quasi-Monte Carlo (QMC) method is similar but replaces independent and identically ...
详细信息
This book explains the use of the bulk synchronous parallel (BSP) model and the BSPlib communication library in parallel algorithm design and parallel programming. The main topics treated in the book are central to th...
详细信息
ISBN:
(数字)9780191712869
ISBN:
(纸本)9780198529392
This book explains the use of the bulk synchronous parallel (BSP) model and the BSPlib communication library in parallel algorithm design and parallel programming. The main topics treated in the book are central to the area of scientific computation: solving dense linear systems by Gaussian elimination, computing fast Fourier transforms, and solving sparse linear systems by iterative methods based on sparse matrix-vector multiplication. Each topic is treated in depth, starting from the problem formulation and a sequential algorithm, through a parallel algorithm and its cost analysis, to a complete parallel program written in C and BSPlib, and experimental results obtained using this program on a parallel computer. Throughout the book, emphasis is placed on analyzing the cost of the parallel algorithms developed, expressed in three terms: computation cost, communication cost, and synchronization cost. The book contains five example programs written in BSPlib, which illustrate the methods taught. These programs are freely available as the package BSPedupack. An appendix on the message-passing interface (MPI) discusses how to program in a structured, bulk synchronous parallel style using the MPI communication library, and presents MPI equivalents of all the programs in the book.
DataGrids are becoming increasingly important for sharing large data collections, achievements and resources. The BSP model is a widely used parallel programming model. The idea of the superstep in the BSP model shoul...
详细信息
ISBN:
(纸本)0769522254
DataGrids are becoming increasingly important for sharing large data collections, achievements and resources. The BSP model is a widely used parallel programming model. The idea of the superstep in the BSP model should be able to help DataGrid access and storage in regular sequence. When services are not isolated from each other in multiuser environments, this should be able to avoid, in the process of data access and storage, the occurrence of four types or phenomena: lost update, dirty read, nonrepeatable read, phantom.
Summary form only given. This paper describes the development of a fine-grained meta-heuristic for solving large strip packing problems with guillotine layouts. An architecture-adaptive environment aCe, and the aCe C ...
详细信息
Summary form only given. This paper describes the development of a fine-grained meta-heuristic for solving large strip packing problems with guillotine layouts. An architecture-adaptive environment aCe, and the aCe C parallel programming language are used to implement a massively parallel genetic simulated annealing (GSA) algorithm. The parallel GSA combines the temperature schedule of simulated annealing with the crossover and mutation operators that are applied to chromosome populations in genetic algorithms. For our problem, chromosomes are normalized postfix expressions that represent guillotine strip packings. Preliminary results for some benchmark data sets are reported and indicate that the parallel GSA method holds promise as a technique for solving the strip packing problem.
OmniRPC is a Grid RPC system for parallel programming in a grid environment. In order to understand the performance characteristics of OmniRPC, we executed a synthetic benchmark program which varies the execution time...
详细信息
OmniRPC is a Grid RPC system for parallel programming in a grid environment. In order to understand the performance characteristics of OmniRPC, we executed a synthetic benchmark program which varies the execution time in remote nodes and the amount of communication on several configurations of our grid environment. The result shows the performance of the application is improved if RPC data transmissions are less than 10 KB, the job time in remote nodes is more than 4 seconds, and RPCs are called more than 256 times. Our result also shows a small performance degradation when using the feature of communication multiplexing. We also measured the performance of the EP application from the NAS parallel benchmark suite. In EP, even if using SSH or the Globus GRAM as methods of agent invocation, both performances are almost the same. As a practical application, we parallelized the CONFLEX molecular confirmation search program using OmniRPC. In the comparison of CONFLEX-G with the CONFLEX MPI version, CONFLEX-G achieves comparable efficiencies to the MPI version and increased speed by using two or more clusters.
Summary form only given. We compare the performance of three programming paradigms for the parallelization of nested loop algorithms onto SMP clusters. More specifically, we propose three alternative models for tiled ...
详细信息
Summary form only given. We compare the performance of three programming paradigms for the parallelization of nested loop algorithms onto SMP clusters. More specifically, we propose three alternative models for tiled nested loop algorithms, namely a pure message passing paradigm, as well as two hybrid ones, that implement communication both through message passing and shared memory access. The hybrid models adopt an advanced hyperplane scheduling scheme, that allows both for minimal thread synchronization, as well as for pipelined execution with overlapping of computation and communication phases. We focus on the experimental evaluation of all three models, and test their performance against several iteration spaces and parallelization grains with the aid of a typical microkernel benchmark. We conclude that the hybrid models can in some cases be more beneficial compared to the monolithic pure message passing model, as they exploit better the configuration characteristics of an hierarchical parallel platform, such as an SMP cluster.
In conventional multiprocessor SoC (MPSoC) design methods, we find two problems: lack of SW code portability and lack of early SW validation. The problems cause a long design cycle. To resolve them, we present a conce...
详细信息
ISBN:
(纸本)9780769520858
In conventional multiprocessor SoC (MPSoC) design methods, we find two problems: lack of SW code portability and lack of early SW validation. The problems cause a long design cycle. To resolve them, we present a concept of two-layer hardware-dependent software (HdS). The presented HdS consists of hardware abstraction layer to abstract the sub-system architecture and SoC abstraction layer to abstract the global MPSoC architecture. During the exploration of global and sub-system architectures, the application programming interfaces of presented two-layer HdS allow to keep the SW independent from architectural change. The simulation models of two-layer HdS enable to validate the entire system including the SW and HW design early in the design steps. We show the effectiveness of the presented methodology in the MPSoC architecture exploration of an OpenDiVX encoder system design.
暂无评论