In order to extract high levels of performance from modern parallelarchitectures, the effective management of deep memory hierarchies is very important. While architectural advances in caches help in better utilizati...
详细信息
In order to extract high levels of performance from modern parallelarchitectures, the effective management of deep memory hierarchies is very important. While architectural advances in caches help in better utilization of the memory hierarchy, compiler-directed locality enhancement techniques are also important. In this paper we propose a locality improvement technique that uses data space (array layout) transformations in contrast to most of the previous work based on iteration space (loop) transformations. In other words, rather than changing the order of loop iterations, our technique modifies the memory layouts of multi-dimensional arrays. In comparison with previous work on data transformations it brings two novelties. First, we formulate the problem on a special graph structure called the layout graph (LG) and use integer linear programming (ILP) methods to determine optimal layouts. Second, in addition to static layout detection, our approach also enables the compiler to determine optimal dynamic layouts; that is, the layouts that can be changed across loop nest boundaries. We believe that this is the first attempt to determine optimal dynamic memory layouts. We also present preliminary experimental results on the SGI Origin 2000 distributed shared memory multiprocessor. Our results so far are encouraging and indicate that the additional compilation time taken by the solver is tolerable.
In this paper we present the first system that implements OpenMP on a network of shared-memory multiprocessors. This system enables the programmer to rely on a single, standard, shared-memory API for parallelization w...
详细信息
In this paper we present the first system that implements OpenMP on a network of shared-memory multiprocessors. This system enables the programmer to rely on a single, standard, shared-memory API for parallelization within a multiprocessor and between multiprocessors. It is implemented via a translator that converts OpenMP directives to appropriate calls to a modified version of the TreadMarks software distributed memory system (SDSM). In contrast to previous SDSM systems for SMPs, the modified TreadMarks uses POSIX threads for parallelism within an SMP node. This approach greatly simplifies the changes required to the SDSM in order to exploit the intra-node hardware shared memory. We present performance results for six applications (SPLASH-2 Barnes-Hut and Water; NAS 3D-FFT, SOR, TSP and MGS) running on an SP2 with four four-processor SMP nodes. A comparison between the threaded implementation and the original implementation of TreadMarks shows that using the hardware shared memory within an SMP node significantly reduces the amount of data and the number of messages transmitted between nodes, and consequently achieves speedups up to 30% better than the original versions. We also compare SDSM against message passing. Overall, the speedups of multithreaded TreadMarks programs are within 7-30% of the MPI versions.
The design and optimised (in time and space) implementation of a systolic circuit dedicated to aerial image matching is proposed. The final run time data adaptive architecture evaluation with Xilinx XC 4010 XL offers ...
详细信息
The design and optimised (in time and space) implementation of a systolic circuit dedicated to aerial image matching is proposed. The final run time data adaptive architecture evaluation with Xilinx XC 4010 XL offers the equivalent processing speed-up of 2000 (compared to a sequential solution).
With the increasing number of scientific applications manipulating huge amounts of data, effective data management is an increasingly important problem. Unfortunately, so far the solutions to this data management prob...
详细信息
With the increasing number of scientific applications manipulating huge amounts of data, effective data management is an increasingly important problem. Unfortunately, so far the solutions to this data management problem either require deep understanding of specific storage architectures and file layouts (as in high-performance file systems) or produce unsatisfactory I/O performance in exchange for ease-of-use and portability (as in relational DBMSs). In this paper we present a new environment which is built around an active meta-data management system (MDMS). The key components of our three-tiered architecture are user application, the MDMS, and a hierarchical storage system (HSS). Our environment overcomes the performance problems of pure database-oriented solutions, while maintaining their advantages in terms of ease-of-use and portability. The high levels of performance are achieved by the MDMS, with the aid of user-specified directives. Our environment supports a simple, easy-to-use yet powerful user interface, leaving the task of choosing appropriate I/O techniques to the MDMS. We discuss the importance of an active MDMS and show how the three components, namely application, the MDMS, and the HSS, fit together. We also report performance numbers from our initial implementation and illustrate that significant improvements are made possible without undue programming effort.
The purpose of the paper is to describe a new semi-automated design space exploration method based on genetic programming. A new control/dataflow specification method is proposed as well as appropriate models for hard...
详细信息
ISBN:
(纸本)0780344553
The purpose of the paper is to describe a new semi-automated design space exploration method based on genetic programming. A new control/dataflow specification method is proposed as well as appropriate models for hardware parts and algorithms. With this method we are able to test many different hardware architectures and algorithms against cost, speed, computation time and other constraints within very short time. The remaining manual work is to exploit the model parameters of the components of the architecture and the algorithm. In contrast to other approaches our method is suited for embedded and distributed systems. The method, models and application are explained in detail by means of a comprehensive case study.
irregular particle-based applications that use trees, far example hierarchical N-body applications, are important consumers of multiprocessor cycles, and are argued to benefit greatly in programming ease from a cohere...
详细信息
ISBN:
(纸本)0818684038
irregular particle-based applications that use trees, far example hierarchical N-body applications, are important consumers of multiprocessor cycles, and are argued to benefit greatly in programming ease from a coherent shared address space programming model. As more and more supercomputing platforms that can support different programming models become available to users, from tightly-coupled hardware-coherent machines to clusters of workstations or SMPs, to truly deliver on its ease of programing advantages to application users it is important that the shared address space model nor only perform and scale well in the rightly-coupled case but also port well in performance across the range of platforms (as the message passing model can). For tree-based N-body applications, this is currently not true: While the actual computation of interactions ports well, the parallel tree building phase can become a severe bottleneck on coherent shared address space platforms, in particular an platforms with less aggressive, commodity-oriented communication architectures (even though it rakes less than 3 percent of the time in most sequential executions). We therefore investigate the performance of five parallel tree building methods in the context of a complete galaxy simulation on four very different platforms that support this programming model: an SGI Origin2000 (an aggressive hardware cache-coherent machine with physically distributed memory), an SGI Challenge bits-based shared memory multiprocessor art Intel Paragon running a shared virtual memory protocol in software at page granularity, and a Wisconsin Typhoon-zero in which the granularity of coherence can be varied using hardware support but the protocol runs in software (in the last case using both a page-based and a fine-grained protocol). We find that the algorithms used successfully and widely distributed so far for the first two platforms cause overall application performance to be very poor on the latter two commodit
We demonstrate an approach to parallelprogramming, based on skeletons - parameterized program schemas with efficient implementations over diverse architectures. The contribution of the paper is two-fold: (1)we classi...
详细信息
We demonstrate an approach to parallelprogramming, based on skeletons - parameterized program schemas with efficient implementations over diverse architectures. The contribution of the paper is two-fold: (1)we classify divide-and-conquer (DC) algorithms and provide a family of provably correct parallel implementations for a particular DC skeleton, called DH (distributable homomorphism);(2) we adjust the mathematical specification of the Fast Fourier Transform (FFT) to the DH skeleton and, thereby, obtain a generic SPMD program, well suited for implementation under MPI. The generic program includes the efficient FFT solutions used in practice - the binary-exchange and the 2D- and 3D-transpose implementations - as special cases.
We study the problem of scheduling systems of affine recurrence equations (SAREs), a convenient formalism for modeling massively parallel computations. We unify in a single framework, the two most important methods fo...
详细信息
ISBN:
(纸本)9780897919890
We study the problem of scheduling systems of affine recurrence equations (SAREs), a convenient formalism for modeling massively parallel computations. We unify in a single framework, the two most important methods for solving the problem: the Farkas method and the vertex method, both using linear programming. Then we compare the efficiency of the methods, in term of number of variables, number of constraints and execution time of the resolution, on real-word examples arising from parallelization problems. Our conclusions show that the Farkas method is significantly better than the vertex method.
Vector prefix and reduction are collective communication primitives in which all processors must cooperate. We present two parallel algorithms, the direct algorithm and the split algorithm, for vector prefix and reduc...
详细信息
ISBN:
(纸本)0818684038
Vector prefix and reduction are collective communication primitives in which all processors must cooperate. We present two parallel algorithms, the direct algorithm and the split algorithm, for vector prefix and reduction computation on coarse-grained, distributed-memory parallel machines. Our algorithms are relatively architecture independent and can be used effectively in many applications such as Pack/Unpack, Array Prefix/Reduction Functions, and Array Combining Scatter Functions, which are defined in Fortran 90 and in High Performance Fortran. Experimental results on the CM-5 are presented.
We describe a family of reconfigurable parallelarchitectures for logic emulation. They are supposed to be applicable like conventional FPGAs, while covering a larger range of circuit sizes and clock frequencies. In o...
详细信息
ISBN:
(纸本)3540643591
We describe a family of reconfigurable parallelarchitectures for logic emulation. They are supposed to be applicable like conventional FPGAs, while covering a larger range of circuit sizes and clock frequencies. In order to evaluate the performance of such programmable designs, we also need software methods for code generation from circuit descriptions. We propose a combination of scheduling and routing algorithms for embedding calculations into the target architecture.
暂无评论