The proceedings contain 125 papers. The topics discussed include: nearly optimal algorithms for broadcast on d-dimensional all-port and wormhole-routed torus;minimizing total communication distance of a time-step opti...
ISBN:
(纸本)0818684038
The proceedings contain 125 papers. The topics discussed include: nearly optimal algorithms for broadcast on d-dimensional all-port and wormhole-routed torus;minimizing total communication distance of a time-step optimal broadcast in mesh networks;protocols for non-deterministic communication over synchronous channels;compiler optimization of implicit reductions for distributed memory multiprocessors;local enumeration techniques for sparse algorithms;an expression-rewriting framework to generate communication sets for HPF programs with block-cyclic distribution;evaluation of compiler and runtime library approaches for supporting parallel regular applications;automatic differentiation for message-passing parallel programs;processor lower bound formulas for array computations and parametric diophantine systems;a flexible class of parallel matrix multiplication algorithms;caching-efficient multithreaded fast multiplication of sparse matrices;permutation capability of optical multistage interconnection networks;HIPIQS: a high-performance switch architecture using input queuing;on the bisection width and expansion of butterfly networks;multiprocessor architectures using multi-hop multi-OPS lightwave networks and distributed control;an enhanced co-scheduling method using reduced ms-state diagrams;predicated software pipelining technique for loops with conditions;and analyzing the individual combined effects of speculative and guarded execution on a superscalar architecture.
The multiple minimum degree (MMD) algorithm and its variants have enjoyed more than 20 years of research and progress in generating fill-reducing orderings for sparse, symmetric, positive definite matrices. Although c...
详细信息
ISBN:
(纸本)3540653872
The multiple minimum degree (MMD) algorithm and its variants have enjoyed more than 20 years of research and progress in generating fill-reducing orderings for sparse, symmetric, positive definite matrices. Although conceptually simple, efficient implementations of these algorithms are deceptively complex and highly specialized. In this case study, we present an object-oriented library that implements several recent minimum degree-like algorithms. We discuss how object-oriented design forces us to decompose these algorithms in a different manner than earlier codes and demonstrate how this impacts the flexibility and efficiency of our C++ implementation. We compare the performance of our code against other implementations in C or Fortran.
High-performance scientific computing relies increasingly on high-level, large-scale, object-oriented software frameworks to manage both algorithmic complexity and the complexities of parallelism: distributed data man...
详细信息
ISBN:
(纸本)3540653872
High-performance scientific computing relies increasingly on high-level, large-scale, object-oriented software frameworks to manage both algorithmic complexity and the complexities of parallelism: distributed data management, process management, inter-process communication, and load balancing. This encapsulation of data management, together with the prescribed semantics of a typical fundamental component of such object-oriented frameworks-a parallel or serial array class library-provides an opportunity for increasingly sophisticated compile-time optimization techniques. This paper describes two optimizing transformations suitable for certain classes of numerical algorithms, one for reducing the cost of inter-processor communication, and one for improving cache utilization;demonstrates and analyzes the resulting performance gains;and indicates how these transformations are being automated.
The nano-threads programming model was proposed to effectively integrate multiprogramming on shared-memory multiprocessors, with the exploitation of fine-grain parallelism from standard applications. A prerequisite fo...
详细信息
Irregular, particle-based applications that use trees, for example hierarchical N-body applications, are important consumers of multiprocessor cycles, and are argued to benefit greatly in programming ease from a coher...
详细信息
Irregular, particle-based applications that use trees, for example hierarchical N-body applications, are important consumers of multiprocessor cycles, and are argued to benefit greatly in programming ease from a coherent shared address space programming model. As more and more supercomputing platforms that can support different programming models become available to users, from tightly-coupled hardware-coherent machines to clusters of workstations or SMPs, to truly deliver on its ease of programming advantages to application users it is important that the shared address space model not only perform and scale well in the tightly-coupled case but also port well in performance across the range of platforms (as the message passing model can). For tree-based N-body applications, this is currently not true: while the actual computation of interactions ports well, the parallel tree building phase can become a severe bottleneck on coherent shared address space platforms, in particular on platforms with less aggressive, commodity-oriented communication architectures (even though it takes less 3 percent of the time in most sequential executions). The authors therefore investigate the performance of five parallel tree building methods in the context of a complete galaxy simulation on four very different platforms that support this programming model.
The goal of High Performance Fortran (HPF) is to 'address the problems of writing data parallel programs where the distribution of data affects performance', providing the user with a high-level language inter...
详细信息
ISBN:
(纸本)9780897919982
The goal of High Performance Fortran (HPF) is to 'address the problems of writing data parallel programs where the distribution of data affects performance', providing the user with a high-level language interface for programming scalable parallelarchitectures and delegating to the compiler the task of producing an explicity parallel message-passing program. For some applications, this approach may result in dramatic performance losses. An important example is the inspector/executor paradigm, which HPF uses to support irregular data accesses in parallel loops. In many cases, the compiler does not have sufficient information to decide whether an inspector computation is redundant or needs to be repeated. In such cases, the performance of the whole program may be significantly degraded. In this paper, we describe an approach to solve this problem through the introduction of constructs allowing explicit manipulation of communication schedules at the HPF language level. The goal is to avoid the use of extrinsics for expressing irregular computation via message-passing primitives, while guaranteeing essentially the same performance. These languages features allow the user to control the reuse of schedules and to specify access patterns that may be used to compute a schedule. They have been implemented as part of the HPF+ language and we report some preliminary performance numbers from this implementation.
One significant challenge in bringing the power of parallel machines to application programmers is providing them with a suite of software tools simikvto the tools that sequential programmers currently utilize. In par...
详细信息
ISBN:
(纸本)0897919718
One significant challenge in bringing the power of parallel machines to application programmers is providing them with a suite of software tools simikvto the tools that sequential programmers currently utilize. In partici liai automatic or semi-automatic testing tools for parallel program are lacking. This paper describes our work in automatic generstioa of all-du-paths for testing parallel programs. Our goal is to demonstrate that with some extension sequential test data adequacy cri teria are still applicable to parallel program testing. The concepts and algorithms in this paper have been incorporated as the foundatior of our Delaware parallel Software Testing Aid della pasta.
This paper describes a parallel model for a distributed memory architecture of a non traditional evolutionary computation method, which integrates constraint propagation and evolution programs. This integration provid...
详细信息
ISBN:
(纸本)3540650784
This paper describes a parallel model for a distributed memory architecture of a non traditional evolutionary computation method, which integrates constraint propagation and evolution programs. This integration provides a problem-independent optimisation strategy for large scale constrained combinatorial problems over finite integer domains. We have adopted a global parallelisation approach which preserves the properties, behaviour, and theoretical studies of the sequential algorithm. Moreover, high speedup is achieved since genetic operators are coarse-grained, as they perform a search in a discrete space carrying out constraint propagation. A global parallelisation implies a single population but, as we focus on distributed memory architectures, the single virtual population is physically distributed among the processors. Selection and mating consider all the individuals in the population, but the application of genetic operators is performed in parallel. The implementation of the model has been tested on a GRAY T3E multiprocessor using two complex constrained optimisation problems. Experiments have proved the efficiency of this approach since linear speedups have been obtained.
Extensive research shows that the implementation of evolutionary algorithms are costly in many cases. This paper reports on an alternative construct, namely non-uniform cellular automatons (CA) with which (1) it is po...
详细信息
Extensive research shows that the implementation of evolutionary algorithms are costly in many cases. This paper reports on an alternative construct, namely non-uniform cellular automatons (CA) with which (1) it is possible to increase the capacity for complex computations while preserving the main features of CAs, (2) evolve parallel cellular machines to perform computational tasks via cellular programming, and (3) attain evolware with implementations centering on hardware. The hardware architecture of the cellular programming evolware departs from in that: (1) a group of individual cells is in operation rather than a single one, (2) genetic operators are implemented on-board rather than on offline computer, and (3) the evolutionary phase proceeds uninterrupted via a continuous machine's operation. The active components of the evolware complex comprise exclusively the latest cellular FPGA circuit. Accordingly, this paper is organized as follows: Section I presents fundamentals and analysis of computations in Quasi-Uniform Cellular Automata in a quasi-uniform cellular space. Section II introduces the basic approach for Coevolving Cellular Computation and Cellular Machines, denoted Cellular programming. Section III adopts synchronization as a criteria to attain evolware centered on hardware. Cellular programming is implemented in hardware. The algorithm is slightly modified (without loss of performance): the genetic operators, one-point crossover and mutation, are replaced with the operator uniform crossover. This results in the creation of an offspring genome from two parent genomes (bit strings) with a 50% probability for each parent. Finally, Section IV analyzes performance and results.
In the literature, there are quite a few sequential and parallel algorithms to solve problems in a distance-hereditary graph G utilizing techniques discovered from the properties of the problems. Based on structural p...
详细信息
暂无评论