This paper presents parallel algorithms for determining the number of partitions of a given integer N, where the partitions may be subject to restrictions, such as being composed of distinct parts, of a given number o...
详细信息
This paper presents parallel algorithms for determining the number of partitions of a given integer N, where the partitions may be subject to restrictions, such as being composed of distinct parts, of a given number of parts, and/or of parts belonging to a specified set. We present a series of adaptive algorithms suitable for varying numbers of processors. The fastest of these algorithms computes the number of partitions of n with largest part equal to k, for 1 less than or equal to k less than or equal to n less than or equal to N, in time O(log(2)(N)) using O(N-2/log N) processors. parallel logarithmic time algorithms that generate partitions uniformly at random, using these quantities, are also presented. (C) 1996 Academic Press, Inc.
Programmable logic devices (PLDs) continue to grow in size and currently contain several millions of gates. At the same time, research effort is going into higher-level hardware synthesis methodologies for reconfigura...
详细信息
Programmable logic devices (PLDs) continue to grow in size and currently contain several millions of gates. At the same time, research effort is going into higher-level hardware synthesis methodologies for reconfigurable computing that can exploit PLD technology. In this paper, we explore the effectiveness and extend one such formal methodology in the design of massively parallel algorithms. We take a step-wise refinement approach to the development of correct reconfigurable hardware circuits from formal specifications. A functional programming notation is used for specifying algorithms and for reasoning about them. The specifications are realised through the use of a combination of function decomposition strategies, data refinement techniques, and off-the-shelf refinements based upon higher-order functions. The off-the-shelf refinements are inspired by the operators of communicating sequential processes (CS.P) and map easily to programs in Handel-C (a hardware description language). The Handel-C descriptions are directly compiled into reconfigurable hardware. The practical realisation of this methodology is evidenced by a case studying the matrix multiplication algorithm as it is relatively simple and well known. In this paper, we obtain several hardware implementations with different performance characteristics by applying different refinements to the algorithm. The developed designs are compiled and tested under Celoxica's RC-1000 reconfigurable computer with its 2 million gates Virtex-E FPGA. Performance analysis and evaluation of these implementations are included. (C) 2006 Elsevier Ltd. All rights reserved.
Reconfigurable devices, such as Field Programmable Gate Arrays (FPGAs), have been witnessing a considerable increase in density. State-of-the-art FPGAs are complex hybrid devices that contain up to several millions of...
详细信息
Reconfigurable devices, such as Field Programmable Gate Arrays (FPGAs), have been witnessing a considerable increase in density. State-of-the-art FPGAs are complex hybrid devices that contain up to several millions of gates. Recently, research effort has been going into higher-level parallelization and hardware synthesis methodologies that can exploit such a programmable technology. In this paper, we explore the effectiveness of one such formal methodology in the design of parallel versions of the Serpent cryptographic algorithm. The suggested methodology adopts a functional programming notation for specifying algorithms and for reasoning about them. The specifications are realized through the use of a combination of function decomposition strategies, data refinement techniques, and off-the-shelf refinements based upon higher-order functions. The refinements are inspired by the operators of Communicating Sequential Processes and map easily to programs in Handel-C (a hardware description language). In the presented research, we obtain several parallel Serpent implementations with different performance characteristics. The developed designs are tested under Celoxica's RC-1000 reconfigurable computer with its two million gates Virtex-E FPGA. Performance analysis and evaluation of these implementations are included.
We give an O (log(4) n)-time O(n(2))-processor CRCW PRAM algorithm to find a hamiltonian cycle in a strong semicomplete bipartite digraph, B, provided that a factor of B (i.e., a collection of vertex disjoint cycles c...
详细信息
We give an O (log(4) n)-time O(n(2))-processor CRCW PRAM algorithm to find a hamiltonian cycle in a strong semicomplete bipartite digraph, B, provided that a factor of B (i.e., a collection of vertex disjoint cycles covering the vertex set of B)is computed in a preprocessing step. The factor is found (if it exists) using a bipartite matching algorithm, hence placing the whole algorithm in the class Random-NC. We show that any parallel algorithm which can check the existence of a hamiltonian cycle in a strong semicomplete bipartite digraph in time O(r(n)) using p(n) processors can be used to check the existence of a perfect matching in a bipartite graph in time O(r(n) + n(2)/p(n)) using p(n) processors. Hence, our problem belongs to the class NC if and only if perfect matching in bipartite graphs belongs to NC. We also consider the problem of finding a hamiltonian path in a semicomplete bipartite digraph.
This paper addresses the problem of developing efficient parallel algorithms for the training procedure of a neural network-based Fingerprint Image Comparison (FIC) system. The target architecture is assumed to be a c...
详细信息
This paper addresses the problem of developing efficient parallel algorithms for the training procedure of a neural network-based Fingerprint Image Comparison (FIC) system. The target architecture is assumed to be a coarse-grain distributed-memory parallel architecture. Two types of parallelism-node parallelism and training set parallelism (TSP)-are investigated. Theoretical analysis and experimental results show that node parallelism has low speedup and poor scalability, while TSP proves to have the best speedup performance. TSP, however, is amenable to a slow convergence rate. To reduce this effect, a modified training set parallel algorithm using weighted contributions of synaptic connections is proposed. Experimental results show that this algorithm provides a fast convergence rate while keeping the best speedup performance obtained. The combination of TSP with node parallelism is also investigated. A good performance is achieved by this approach. This provides better scalability with the trade-off of a slight decrease in speedup. The above algorithms are implemented on a 32-node CM-5.
In this paper, we present some novel algorithms for scheduling hierarchical signal flow graphs in the domain of high-level synthesis. With complex chips that need to be designed in the future, it is expected that the ...
详细信息
In this paper, we present some novel algorithms for scheduling hierarchical signal flow graphs in the domain of high-level synthesis. With complex chips that need to be designed in the future, it is expected that the runtimes of these scheduling algorithms will be quite large. The key contributions of this paper are as follows: First, we develop a novel extension of the sequential force-directed scheduling algorithm which naturally handles loops and conditionals by coming up with a scheme of scheduling hierarchical signal flow graphs. Second, we develop three new parallel algorithms for the scheduling problem. Our parallel algorithms are portable across a wide range of parallel platforms. We report results on a set of high-level synthesis benchmarks on 8-processor SGI Origin and a 64 processor IBM SP-2. While some parallel algorithms for VLSI CAD reported by earlier researchers have reported a loss of qualities of results, our parallel algorithms produce exactly the same results as the sequential algorithms on which they are based.
Let P andQ be two convex,n-vertex polygons. We consider the problem of computing, in parallel, some functions ofP andQ whenP andQ are disjoint. The model of parallel computation we consider is the CREW-PRAM, i.e., it ...
详细信息
Let P andQ be two convex,n-vertex polygons. We consider the problem of computing, in parallel, some functions ofP andQ whenP andQ are disjoint. The model of parallel computation we consider is the CREW-PRAM, i.e., it is the synchronous shared-memory model where concurrent reads are allowed but no two processors can simultaneously attempt to write in the same memory location (even if they are trying to write the same thing). We show that a CREW-PRAM havingn 1/k processors can compute the following functions in O(k1+?) time: (i) the common tangents betweenP andQ, and (ii) the distance betweenP andQ (and hence a straight line separating them). The positive constant ? can be made arbitrarily close to zero. Even with a linear number of processors, it was not previously known how to achieve constant time performance for computing these functions. The algorithm for problem (ii) is easily modified to detect the case of zero distance as well.
We study efficient deterministic parallel algorithms on two models: restartable fail-stop CRCW PRAMs and asynchronous PRAMs. In the first model, synchronous processes are subject to arbitrary stop failures and restart...
详细信息
We study efficient deterministic parallel algorithms on two models: restartable fail-stop CRCW PRAMs and asynchronous PRAMs. In the first model, synchronous processes are subject to arbitrary stop failures and restarts determined by an on-line adversary and involving loss of private but not shared memory;the complexity measures are completed work (where processors are charged for completed fixed-size update cycles) and overhead ratio (completed work amortized over necessary work and failures). In the second model, the result of the computation is a serialization of the actions of the processors determined by an on-line adversary;the complexity measure is total work (number of steps taken by all processors). Despite their differences, the two models share key algorithmic techniques. We present new algorithms for the Write-All problem (in which P processors write ones into an array of size N) for the two models. These algorithms can be used to implement a simulation strategy for any N processor PRAM on a restartable fail-stop P processor CRCW PRAM such that it guarantees a terminating execution of each simulated N processor step, with O(log(2) N) overhead ratio, and O(min{N + P log(2) N + M log N, N . p(0.59)}) (subquadratic) completed work (where M is the number of failures during this step's simulation). This strategy has a range of optimality. We also show that the Write-All requires N + Omega(P log P) completed/total work on these models for P less than or equal to N. (C) 1996 Academic Press, Inc.
An edge switch is an operation on a graph (or network) where two edges are selected randomly and one of their end vertices is swapped with each other. Edge switch operations have important applications in graph theory...
详细信息
An edge switch is an operation on a graph (or network) where two edges are selected randomly and one of their end vertices is swapped with each other. Edge switch operations have important applications in graph theory and network analysis, such as in generating random networks with a given degree sequence, modeling and analyzing dynamic networks, and in studying various dynamic phenomena over a network. The recent growth of real-world networks motivates the need for efficient parallel algorithms. The dependencies among successive edge switch operations and the requirement to keep the graph simple (i.e., no self-loops or parallel edges) as the edges are switched lead to significant challenges in designing a parallel algorithm. Addressing these challenges requires complex synchronization and communication among the processors leading to difficulties in achieving a good speedup by parallelization. In this paper, we present distributed memory parallel algorithms for switching edges in massive networks. These algorithms provide good speedup and scale well to a large number of processors. A harmonic mean speedup of 73.25 is achieved on eight different networks with 1024 processors. One of the steps in our edge switch algorithms requires the computation of multinomial random variables in parallel. This paper presents the first non-trivial parallel algorithm for the problem, achieving a speedup of 925 using 1024 processors. Published by Elsevier Inc.
In this paper, it is shown that on the CREW model we can test whether a given permutation of 1,..., n is separable in O(log n) time with n processors. If d is the depth of the optimal (minimum) depth separating tree o...
详细信息
In this paper, it is shown that on the CREW model we can test whether a given permutation of 1,..., n is separable in O(log n) time with n processors. If d is the depth of the optimal (minimum) depth separating tree of a separable permutation, then a separating tree of depth 0(d) can be constructed on the CREW model in O(log n) time with O(n(2)) cost or alternatively in O(d log n) time with O(nd) cost. We can test whether the given separable permutation P of 1,..., k has a match in a permutation T of 1,..., n (n greater than or equal to k) in O(d log n) time with O(kn(4)) cost (the same as that of the serial algorithm). We can also find the number of matches of P in T in O(d log n) time with O(kn(6)) cost (the same as that of the serial algorithm). Both algorithms are for the CREW model. We also discuss how the space complexity of the existing serial algorithms for the decision problem can be reduced from O(kn(3)) to O(n(3) log k) and of the counting version from O(kn) to O(n(4) log k). (C) 2004 Elsevier B.V. All rights reserved.
暂无评论