In this paper we present fine-grained multithreaded algorithms and implementations for the Fast Fourier Transform (FFT) problem. The FFT problem has been formulated using two distinct approaches based on the dataflow ...
详细信息
In this paper we present fine-grained multithreaded algorithms and implementations for the Fast Fourier Transform (FFT) problem. The FFT problem has been formulated using two distinct approaches based on the dataflow concepts. The first approach, referred to as the receiver-initiated algorithm, realizes the FFT iterations as a parent-child relationship while fully exploiting the underlying parallelism. The second approach, referred to as the sender-initiated algorithm, follows a data-flow model based on the producer-consumer style of programming and can be adopted to different architectural parameters for achieving high performance. The implementations of the proposed algorithms have been carried out on the EARTH (Efficient Architecture for Running THreads) platform. For both the algorithms, we analyze the ratio of remote vs local threads and study its impact on the experimental results. Our implementation results show that for certain block sizes on fixed problem size and machine size, the receiver-initiated approach performs better than the sender-initiated approach. For large number of processors, both the algorithms perform well, yielding execution times of only 10 msec for an input of 16 K data points on a 64 processor machine, assuming each processor running at 140 MHz clock speed.
Hierarchical algorithms such as multigrid applications form an important cornerstone for scientific computing. In this study, we take a first step toward evaluating parallel language support for hierarchical applicati...
详细信息
This paper presents some results of programming efficient matching algorithms on a new asynchronous parallelprogramming model. Matching algorithms are widely used in image processing when considering high-level treat...
详细信息
This paper presents some results of programming efficient matching algorithms on a new asynchronous parallelprogramming model. Matching algorithms are widely used in image processing when considering high-level treatments. Pattern analysis, database search, 2D and 3D reconstruction all need matching algorithms to perform. Experiments we did were mainly oriented towards a particular matching problem: the stable marriage algorithm. Different implementations of this algorithm have been done on a massively parallel asynchronous model. This model relies on a network of asynchronously communicating processors leading to very fast SIMD treatments. The asynchronous model and implementations of the matching algorithm are presented. An example of image processing problem is also used for illustration purpose and supports the architectural discussion and results.
The paper proposes two parallel meta-heuristics. One is a cooperative parallel tabu search which incorporates historical information exchange among processors in addition to its own searching of each processor. The ot...
详细信息
ISBN:
(纸本)0769509363
The paper proposes two parallel meta-heuristics. One is a cooperative parallel tabu search which incorporates historical information exchange among processors in addition to its own searching of each processor. The other is a cooperative parallel search between genetic algorithm and tabu search processes. Through computational experiment, we observe the improvement of solutions by our proposed method.
I/O intensive applications have posed great challenges to computational scientists. A major problem of these ap plications is that users have to sacrifice performance requirement in order to satisfy storage capacity r...
详细信息
ISBN:
(纸本)0769507840
I/O intensive applications have posed great challenges to computational scientists. A major problem of these ap plications is that users have to sacrifice performance requirement in order to satisfy storage capacity requirement bl a conventional computing environment. Further performance improvement is impeded by the physical nature of these storage media evert state-of-the-art I/O optimizations are employed. In this paper we present a distributed multi-storage resource architecture that carr satisfy both performance and capacity requirements by employing multiple storage resources. Compared to traditional single storage resource architecture, our architecture provides a more flexible and reliable computing environment. It can bring new opportunities for high performance computing as well as inherit state-of-the-art I/O optimization approaches that have already been developed. We also develop an Application programming Interface (API) that provides transparent management and access to various storage resources in our computing environment. As I/O usually dominates the performance in I/O intensive applications, we establish an I/O performance prediction mechanism which consists of a performance database and a prediction algorithm to help users better evaluate and schedule their applications. A tool is also developed to help users automatically generate the performance database. The experiments show that our multi-storage resource architecture is a promising platform for high performance distributed computing.
This paper adopts a transformational programming approach for deriving massively parallel algorithms from functional specifications. It gives a brief description of a framework for relating key higher order functions ...
详细信息
ISBN:
(纸本)0780365429
This paper adopts a transformational programming approach for deriving massively parallel algorithms from functional specifications. It gives a brief description of a framework for relating key higher order functions such as map, reduce, and scan with communicating processes with different configurations. The parallelisation of many interesting functional algorithms can then be systematically synthesized by combining "off the shelf" parallel implementations of instances of these higher order functions. Efficiency in the final message-passing algorithms is achieved by exploiting data parallelism, for generating the intermediate results in parallel; and functional parallelism, for processing intermediate results in stages such that the output of one stage is simultaneously input to the next one. This approach is illustrated through a case study for testing whether all the elements of a given list are distinct. Bird-Meertens formalism is used to concisely carry out algebraic transformations.
We have parallelized a Monte Carlo photon transport algorithm. Three different parallel versions of the algorithm were developed. The first version is for the Tera Multi-Threaded Architecture (MTA) and uses Tera speci...
详细信息
We have parallelized a Monte Carlo photon transport algorithm. Three different parallel versions of the algorithm were developed. The first version is for the Tera Multi-Threaded Architecture (MTA) and uses Tera specific directives. The second version, which uses MPI library calls, has been implemented on both the CRAY T3E and the 8-way SMP IBM SP with Power3 processors. The third version is a hybrid MPI-OpenMP implementation and is used on the SMP IBM SP. This version uses MPI to communicate between nodes and OpenMP to perform shared memory operations among processors within a node. We explain the three different parallelization approaches and present parallel performance results of these three parallel implementations on three different machines. We observe near perfect speedup for the three versions on the three architectures. The results on the SMP IBM SP suggest that the hybrid MPI-OpenMP programming is suitable for SMP type machines.
We study the problem of scheduling a set of n independent parallel tasks on m processors, where in addition to the processing time there is a size associated with each task indicating that the task can be processed on...
详细信息
The bulk synchronous task scheduling problem (BSSP) is known as an effective task scheduling problem for distributed-memory machines, but the time complexity of BSSP is unknown. This paper presents a proof of NP-compl...
详细信息
ISBN:
(纸本)0769509363
The bulk synchronous task scheduling problem (BSSP) is known as an effective task scheduling problem for distributed-memory machines, but the time complexity of BSSP is unknown. This paper presents a proof of NP-completeness of BSSP even in the case of unit time tasks and positive integer constant communication delays. This paper also gives an approximation algorithm for BSSP in several restricted cases.
In this work we describe and analyze algorithms for 2-D wavelet packet decomposition for MIMD distributed memory architectures. We discuss two different approaches: On the one hand algorithms generating the entire wav...
详细信息
In this work we describe and analyze algorithms for 2-D wavelet packet decomposition for MIMD distributed memory architectures. We discuss two different approaches: On the one hand algorithms generating the entire wavelet packet subband structure (as required for adaptive applications), on the other hand algorithms generating the lowest subband level only (as required for numerical applications). We investigate several optimizations and generalizations of corresponding message passing algorithms and finally compare the results obtained on a Cray T3D and a Parsytec GCel 1024.
暂无评论