Summary form only given, as follows. We focus on implementing high level functional algorithms in reconfigurable hardware. The approach adopts the transformational programming paradigm for deriving massively parallel ...
详细信息
Summary form only given, as follows. We focus on implementing high level functional algorithms in reconfigurable hardware. The approach adopts the transformational programming paradigm for deriving massively parallel algorithms from functional specifications. It extends previous work by systematically generating efficient circuits and mapping them onto reconfigurable hardware. The massive parallelisation of the algorithm works by carefully composing "off the shelf" highly parallel implementations of each of the basic building blocks involved in the algorithm. These basic building blocks are a small collection of well-known higher order functions such as map, fold, and zipwith. By using function decomposition and data refinement techniques, these powerful functions are refined into highly parallel implementations described in Hoare's CSP. The CSP descriptions are very closely associated with Handle-C program fragments. Handle-C is a programming language based on C and extended by parallelism and communication primitives taken from CSP. In the final stage the circuit description is generated by compiling Handle-C programs and then mapped onto the targeted reconfigurable hardware such as the RC-1000 FPGA system from Celoxica. This approach is illustrated by a case study involving the generation of several versions of the matrix multiplication algorithm.
We present an efficient parallel algorithm for scheduling n unit length tasks on m identical processors when the precedence graphs are interval orders. Our algorithm requires O(log(2) v + (n log n)/v) time and O(nv(2)...
详细信息
We present an efficient parallel algorithm for scheduling n unit length tasks on m identical processors when the precedence graphs are interval orders. Our algorithm requires O(log(2) v + (n log n)/v) time and O(nv(2) + n(2)) operations on the CREW PRAM, where v can be any number between 1 and n. By choosing v = rootn, we obtain an O(rootn log n)-time algorithm with O(n(2)) operations. For v = n/log n, we have an O(log(2) n)-time algorithm with O(n(3)/log(2) n) operations. The previous solution takes O(log(2) n) time with O(n(3) log(2) n) operations on the CREW PRAM. Our improvement is mainly due to a simple dynamic programming recurrence for computing the lengths of optimal schedules and a reduction of the m-processor scheduling problem for interval orders to that of finding a maximum matching in a convex bipartite graph. (C) 2003 Elsevier Science (USA). All rights reserved.
This paper presents a natural and efficient implementation for the classical broadcast message passing routine which optimizes performance of Ethernet based clusters. A simple algorithm for parallel matrix multiplicat...
详细信息
This paper presents a natural and efficient implementation for the classical broadcast message passing routine which optimizes performance of Ethernet based clusters. A simple algorithm for parallel matrix multiplication is specifically designed to take advantage of both, parallel computing facilities (CPUs) provided by clusters, and optimized performance of broadcast messages on Ethernet based clusters. Also, this simple parallel algorithm proposed for matrix multiplication takes into account the possibly heterogenous computing hardware and maintains a balanced workload of computers according to their relative computing power. Performance tests are presented on a heterogenous cluster as well as on a homogeneous cluster, where it is compared with the parallel matrix multiplication provided by the ScaLAPACK library. Another simple parallel algorithm is proposed for LU matrix factorization (a general method to solve dense systems of equations) following the same guidelines used for the parallel matrix multiplication algorithm. Some performance tests are presented over a homogenous cluster.
We present a parallel algorithm for performing multipoint linkage analysis of genetic marker data on large family pedigrees. The algorithm effectively distributes both the computation and memory requirements of the an...
详细信息
We present a parallel algorithm for performing multipoint linkage analysis of genetic marker data on large family pedigrees. The algorithm effectively distributes both the computation and memory requirements of the analysis. We discuss an implementation of the algorithm in the Genehunter linkage analysis package (version 2.1), enabling Genehunter to run on distributed-memory platforms for the first time. Our preliminary benchmarks indicate reasonable scalability of the algorithm even for fixed-size problems, with parallel efficiencies of 75% or more on up to 128 processors. In addition, we have extended the hard-coded limit of 16 non-founding individuals in Genehunter 2.1 to a new limit of 32 non-founding individuals. (C) 2003 Elsevier Inc. All rights reserved.
We present practical parallel algorithms using prefix computations for various problems that arise in pairwise comparison of biological sequences. We consider both constant and affine gap penalty functions, full-seque...
详细信息
We present practical parallel algorithms using prefix computations for various problems that arise in pairwise comparison of biological sequences. We consider both constant and affine gap penalty functions, full-sequence and subsequence matching, and space-saving algorithms. Commonly used sequential algorithms solve the sequence comparison problems in O(mn) time and O(m + n) space, where m and n are the lengths of the sequences being compared. All the algorithms presented in this paper are time optimal with respect to the sequential algorithms and can use O((log n)-(n)) processors where n is the length of the larger sequence. While optimal parallel algorithms for many of these problems are known, we use a simple framework and demonstrate how these problems can be solved systematically using repeated parallel prefix operations. We also present a space-saving algorithm that uses O(m + (p)-(n)) space and runs in optimal time where p is the number of the processors used. We implemented the parallel space-saving algorithm and provide experimental results on an IBM SP-2 and a Pentium cluster. (C) 2003 Elsevier Science (USA). All rights reserved.
作者:
Kepner, JMIT
Lincoln Lab Lexington MA 02420 USA
2D convolution is a staple of digital image processing. The advent of large format imagers makes it possible to literally "pave" with silicon the focal plane of an optical sensor, which results in very large...
详细信息
2D convolution is a staple of digital image processing. The advent of large format imagers makes it possible to literally "pave" with silicon the focal plane of an optical sensor, which results in very large images that can require a significant amount computation to process. Filtering of large images via 2D convolutions is often complicated by a variety of effects (e.g., non-uniformities found in wide field of view instruments) which must be compensated for in the filtering process by changing the filter across the image. This paper describes a fast (FFT based) method for convolving images with slowly varying filters. A parallel version of the method is implemented using a multi-threaded approach, which allows more efficient load balancing and a simpler software architecture. The method has been implemented within a high level interpreted language (IDL), while also exploiting open standards vector libraries (VSIPL) and open standards parallel directives (OpenMP). The parallel approach and software architecture are generally applicable to a variety of algorithms and has the advantage of enabling users to obtain the convenience of an easy operating environment while also delivering high performance using a fully portable code. (C) 2003 Elsevier Science (USA). All rights reserved.
We present here a new deterministic parallel algorithm for the two-processor scheduling problem. The algorithm uses only O(n(3)) processors and takes O(log(2)n) time on a CREW PRAM. In order to prove the above bounds ...
详细信息
We present here a new deterministic parallel algorithm for the two-processor scheduling problem. The algorithm uses only O(n(3)) processors and takes O(log(2)n) time on a CREW PRAM. In order to prove the above bounds we show how to compute in NC the lexicographically first matching for a special kind of convex bipartite graphs. (C) 2002 Elsevier Science B.V. All rights reserved.
Effective design of parallel matrix multiplication algorithms relies on the consideration of many interdependent issues based on the underlying parallel machine or network upon which such algorithms will be implemente...
详细信息
Effective design of parallel matrix multiplication algorithms relies on the consideration of many interdependent issues based on the underlying parallel machine or network upon which such algorithms will be implemented, as well as, the type of methodology utilized by an algorithm. In this paper, we determine the parallel complexity of multiplying two (not necessarily square) matrices on parallel distributed-memory machines and/or networks. In other words, we provided an achievable parallel run-time that can not be beaten by any algorithm (known or unknown) for solving this problem. In addition, any algorithm that claims to be optimal must attain this run-time. In order to obtain results that are general and useful throughout a span of machines, we base our results on the well-known LogP model. Furthermore, three important criteria must be considered in order to determine the running time of a parallel algorithm;namely, (i) local computational tasks, (ii) the initial data layout, and (iii) the communication schedule. We provide optimality results by first proving general lower bounds on parallel run-time. These lower bounds lead to significant insights on (i)-(iii) above. In particular, we present what types of data layouts and communication schedules are needed in order to obtain optimal run-times. We prove that no one data layout can achieve optimal running times for all cases. Instead, optimal layouts depend on the dimensions of each matrix, and on the number of processors. Lastly, optimal algorithms are provided.
We propose a parallel and asynchronous approach to give near-optimal solutions to the non-fixed point-to-point connection problem. This problem is NP-hard and has practical applications in multicast routing. The techn...
详细信息
We propose a parallel and asynchronous approach to give near-optimal solutions to the non-fixed point-to-point connection problem. This problem is NP-hard and has practical applications in multicast routing. The technique adopted to solve the problem is an organization of heuristics that communicate with each other by means of a virtually shared memory. This technique is called A-Teams (for Asynchronous Teams). The virtual shared memory is implemented in a physically distributed memory system. Computational results comparing our approach with a branch-and-cut algorithm are presented. (C) 2003 Elsevier Science B.V. All rights reserved.
In this paper we propose an efficient parallel algorithm with simple static and dynamic scheduling for generating combinations. It can use any number of processors (NP less than or equal to n - m + 1) in order to gene...
详细信息
ISBN:
(纸本)0769520170
In this paper we propose an efficient parallel algorithm with simple static and dynamic scheduling for generating combinations. It can use any number of processors (NP less than or equal to n - m + 1) in order to generate the set of all combinations of C(n, m). The main characteristic of this algorithm is to require no integer larger than n during the whole computation. The performance results show that even without a perfect load balance, this algorithm has very good performance, mainly when n is large. Besides, the dynamic algorithm presents a good performance on heterogeneous parallel platforms(1).
暂无评论