In this paper, we propose an efficient algorithm for parallel prefix computation in recursive dual-net, a newly proposed network. The recursive dual-net RDNk (B) for k > 0 has (2n(0))(2k) /2 nodes and d(0) + k link...
详细信息
ISBN:
(纸本)9783642131189
In this paper, we propose an efficient algorithm for parallel prefix computation in recursive dual-net, a newly proposed network. The recursive dual-net RDNk (B) for k > 0 has (2n(0))(2k) /2 nodes and d(0) + k links per node, where no and do are the number of nodes and the node-degree of the base network B, respectively. Assume that each node holds one data item, the communication and computation time complexities of the algorithm for parallel prefix computation in RDNk (B),k > 0, are 2(k+1) - 2 + 2k * T-comm(0) and 2(k+1) - 2 + 2(k) * T-comp(0), respectively, where T-comm(0) and T-comp(0) are the communication and computation time complexities of the algorithm for parallel prefix computation in the base network B, respectively.
This paper presents the design and implementation of an efficient reconfigurable parallel prefix computation hardware on field-programmable gate arrays (FPGAs). The design is based on a pipelined dataflow algorithm, a...
详细信息
This paper presents the design and implementation of an efficient reconfigurable parallel prefix computation hardware on field-programmable gate arrays (FPGAs). The design is based on a pipelined dataflow algorithm, and control logic is added to reconfigure the system for arbitrary parallelism degree. The system receives multiple input streams of elements in parallel and produces output streams in parallel. It has an advantage of controlling the degree of parallelism explicitly at run time. The time complexity of the design is O(d+(N-d)/d), where d and N are parallelism degree and stream size, respectively. When the stream size is sufficiently larger than the initial trigger time of the pipeline (d), the time complexity becomes O(N/d). Unlike the prefixcomputation circuits found in the literature, the design is scalable for different problem sizes including unknown sized data. The design is modular based on a finite state machine, and implemented and tested for target FPGA devices Xilinx Spartan2S XC2S300EFT256-6Q and XC2S600EFG676-6.
Associativity of a binary operation allows many applications that are not possible with a non-associative binary operation. We propose a method to transform a binary expression into an associative one and present two ...
详细信息
Associativity of a binary operation allows many applications that are not possible with a non-associative binary operation. We propose a method to transform a binary expression into an associative one and present two related applications.
This pearl develops a statement about parallel prefix computation in the spirit of Knuth's 0-1-Principle for oblivious sorting algorithms. It turns out that 0-1 is not quite enough here. The perfect hammer for the...
详细信息
ISBN:
(纸本)9781595936899
This pearl develops a statement about parallel prefix computation in the spirit of Knuth's 0-1-Principle for oblivious sorting algorithms. It turns out that 0-1 is not quite enough here. The perfect hammer for the nails we are going to drive in is relational parametricity.
This pearl develops a statement about parallel prefix computation in the spirit of Knuth's 0-1-Principle for oblivious sorting algorithms. It turns out that 0- 1 is not quite enough here. The perfect hammer for th...
详细信息
This pearl develops a statement about parallel prefix computation in the spirit of Knuth's 0-1-Principle for oblivious sorting algorithms. It turns out that 0- 1 is not quite enough here. The perfect hammer for the nails we are going to drive in is relational parametricity.
parallelprefix is an important technique that has been widely accepted in many area of scientific and engineering research. In this paper we propose an improved parallel prefix computation algorithm on n × n mes...
详细信息
parallelprefix is an important technique that has been widely accepted in many area of scientific and engineering research. In this paper we propose an improved parallel prefix computation algorithm on n × n mesh network that requires 2 n + 5 times. Our proposed algorithm can be compare with the traditional parallelprefix algorithm that requires 3 n + 2 time on same architecture.
The problem of digit set conversion for fixed radix is investigated for the case of converting into a non redundant, as well as into a redundant, digit set. Conversion may he from very general digit sets, and covers a...
详细信息
The problem of digit set conversion for fixed radix is investigated for the case of converting into a non redundant, as well as into a redundant, digit set. Conversion may he from very general digit sets, and covers as special cases multiplier recodings, additions, and certain multiplications. We generalize known algorithms for conversions into non redundant digit sets, as well as apply conversion to generalize the O(log) time algorithm for conditional sum addition using parallel prefix computation, and a comparison is made with standard carry-lookahead techniques. Examples on multi-operand addition are used to illustrate the generality of this approach. O(1) time algorithms for converting into redundant digit sets are generalized based on a very simple lemma, which provides a framework for all conversions into redundant digit sets. Applications in multiplier recoding and partial product accumulation are used here as exemplifications.
prefixcomputation is a basic operation at the core of many important applications, e.g., some of the Grand Challenge problems, circuit design, digital signal processing, graph optimizations, and computational geometr...
详细信息
prefixcomputation is a basic operation at the core of many important applications, e.g., some of the Grand Challenge problems, circuit design, digital signal processing, graph optimizations, and computational geometry(1). In this paper, we present new and strict time-optimal parallel schedules for prefixcomputation with resource constraints under the concurrent-read-exclusive-write (CREW) parallel random access machine (PRAM) model. For prefix of N elements on p processors (p independent of N) when N > p(p + 1)/2, we derive Harmonic Schedules that achieve the strict optimal time (steps), inverted left perpendicular 2(N - 1)/(p + 1) inverted right perpendicular. We also derive Pipelined Schedules that have better program-space efficiency than the Harmonic Schedule, yet only require a small constant number of steps more than the optimal time achieved by the Harmonic Schedule. Both the Harmonic Schedules and the Pipelined Schedules are simple and easy to implement. For prefix of N elements on p processors (p independent of N) where N less than or equal to p(p + 1)/2, the Harmonic Schedules are not time-optimal. For these cases, we establish an optimization method for determining key parameters of time-optimal schedules, based on connections between the structure of parallelprefix and Pascal's triangle. Using the derived parameters, we devise an algorithm to construct such schedules. For a restricted class of values of Nand p, we prove that the constructed schedules are strictly time-optimal. We also give strong empirical evidence that our algorithm constructs strict time-optimal schedules for all cases where N less than or equal to p(p + 1)/2.
In this paper, we use the renewal theory to develop a Poisson random number algorithm without restart. A parallel Poisson random number generator is designed based on this algorithm and prefixcomputation. This genera...
详细信息
In this paper, we use the renewal theory to develop a Poisson random number algorithm without restart. A parallel Poisson random number generator is designed based on this algorithm and prefixcomputation. This generator iteratively produces m Poisson random numbers with mean mu in average time complexity O([m mu/n]f(n, p)) on EREW PRAM, where f(n,p) is the time for computing an n-element parallelprefix on p processors in each iteration, assuming that parallel uniform random numbers can be generated at the rate of one number per unit time per processor. If n is selected near m mu, it achieves linear speedup when p is small and the average time complexity is O(log(m mu)) when p is O(m mu).
parallel programming, whether imperative or functional, has long focused on arrays as the central data type. Meanwhile, typed functional programming has explored a variety of data types, including lists and various fo...
详细信息
parallel programming, whether imperative or functional, has long focused on arrays as the central data type. Meanwhile, typed functional programming has explored a variety of data types, including lists and various forms of trees. Generic functional programming decomposes these data types into a small set of fundamental building blocks: sum, product, composition, and their associated identities. Definitions over these few fundamental type constructions then automatically assemble into algorithms for an infinite variety of data types-some familiar and some new. This paper presents generic functional formulations for two important and well-known classes of parallel algorithms: parallel scan (generalized prefix sum) and fast Fourier transform (FFT). Notably, arrays play no role in these formulations. Consequent benefits include a simpler and more compositional style, much use of common algebraic patterns and freedom from possibility of run-time indexing errors. The functional generic style also clearly reveals deep commonality among what otherwise appear to be quite different algorithms. Instantiating the generic formulations, two well-known algorithms for each of parallel scan and FFT naturally emerge, as well as two possibly new algorithms.
暂无评论