Digital signal processing algorithms with multiple shift-invariant dependence graphs (DG's) can be mapped to field programmable gate array hardware in many different types of systolic processor arrays, Because of ...
详细信息
Digital signal processing algorithms with multiple shift-invariant dependence graphs (DG's) can be mapped to field programmable gate array hardware in many different types of systolic processor arrays, Because of the finite amount of hardware resources, the problem is to use a "right" amount of hardware in a specific configuration so to maximize the processing speed. In this paper, the problem of finding the right processor array configuration is formulated as a constrained optimization problem where the cost function includes not only the cost of individual processor arrays but also the cost of interfacing circuits. Three heuristic algorithms are presented for the optimization problem, Among them, both the Lth axial neighbor algorithm and the simulated annealing algorithm produce good results on a test case. Simulation results on the test case also indicate that the initial configuration is important in getting a good configuration for both algorithms. The Lth axial neighbor algorithm has the extra advantage of requiring less amount of performance tuning.
This paper presents a systematic methodology for exploring possible processor arrays of scalable radix 4 modular Montgomery multiplication algorithm. In this methodology, the algorithm is first expressed as a regular ...
详细信息
This paper presents a systematic methodology for exploring possible processor arrays of scalable radix 4 modular Montgomery multiplication algorithm. In this methodology, the algorithm is first expressed as a regular iterative expression, then the algorithm data dependence graph and a suitable affine scheduling function are obtained. Four possible processor arrays are obtained and analyzed in terms of speed, area, and power consumption. To reduce power consumption, we applied low power techniques for reducing the glitches and the Expected Switching Activity (ESA) of high fan-out signals in our processor array architectures. The resulting processor arrays are compared to other efficient ones in terms of area, speed, and power consumption.
This paper presents a systematic technique for expressing a string search algorithm as a regular iterative expression to explore all possible processor arrays for deep packet classification. The computation domain of ...
详细信息
This paper presents a systematic technique for expressing a string search algorithm as a regular iterative expression to explore all possible processor arrays for deep packet classification. The computation domain of the algorithm is obtained and three affine scheduling functions are presented. The technique allows some of the algorithm variables to be pipelined while others are broadcast over system-wide buses. Nine possible processor array structures are obtained and analyzed in terms of speed, area, power, and I/O timing requirements. Time complexities are derived analytically and through extensive numerical simulations. The proposed designs exhibit optimum speed and area complexities. The processor arrays are compared with previously derived processor arrays for the string matching problem.
A processor array with a reconfigurable bus system (abbreviated to PARBS) is a computation model which consists of a processor array and a reconfigurable bus system. It is a very powerful computation model in that man...
详细信息
A processor array with a reconfigurable bus system (abbreviated to PARBS) is a computation model which consists of a processor array and a reconfigurable bus system. It is a very powerful computation model in that many problems can be solved efficiently. The power of a computation model usually indicates how fast a problem can be solved under that model. In [16], Wang and Chen have shown that the two-dimensional PARBS is at least as powerful as the PRAM (parallel random access machine). That is, if a problem can be solved in O(f(n)) time on the PRAM with n processors and m memory cells, it can also be solved in O(f(n)) time on the two-dimensional PARBS with n*m processors. The reverse assertion has not been proven yet. The difficulty arises from the great flexibility in the configurations of reconfigurable system. In this paper, we show that a restricted version of the PARBS, called ORTHOGONAL PARBS, which includes all one-dimensional PARBS with two-neighbor connections and many two-dimensional PARBS with four-neighbor connections used by some researchers, can be simulated accordingly on the SUM CRCW PRAM. That is, if a problem can be solved in O(f(n)) time on the ORTHOGONAL PARBS with n processors, it can also be solved in O(f(n)) time and O(n) memory cells on the SUM CRCW PRAM.
Three parallel sorting applications and two list output protocols for the first phase of an external sort execute on a fine-grained many-core processor array that contains no algorithm-specific hardware acting as a co...
详细信息
Three parallel sorting applications and two list output protocols for the first phase of an external sort execute on a fine-grained many-core processor array that contains no algorithm-specific hardware acting as a co-processor with a variety of array sizes. Results are generated using a cycle-accurate model based on measured data from a fabricated many-core chip, and simulated for different processor array sizes. The data shows most energy efficient first-phase many-core sort requires over 65x lower energy than GNU C++ standard library sort performed on an Intel laptop-class processor and over 105x lower energy than a radix sort running on an Nvidia GPU. In addition, the highest first-phase throughput many-core sort is over 9.8x faster than the std::sort and over 14x faster than the radix sort. Both phases of a 10 GB external sort require 6.2x lower energyx time energy delay product than the std::sort and over 13x lower energyx time than the radix sort. (C) 2019 Elsevier Inc. All rights reserved.
A CMOS image sensor/processor chip fabricated in a 0.35 mu m CMOS technology is presented. The chip contains a general purpose software-programmable SIMD array of 128 x 128 processing elements. It executes over 20 GOP...
详细信息
A CMOS image sensor/processor chip fabricated in a 0.35 mu m CMOS technology is presented. The chip contains a general purpose software-programmable SIMD array of 128 x 128 processing elements. It executes over 20 GOPS while dissipating 240 mW of power and achieves pixel-processor density of 4 10 cells/mm(2). Performance and accuracy measurement results are given.
Fast Fourier transform (FFT), which has wide and variety application areas, requires very high speed computation. Since parallel processing of FFT is very attractive for high speed FFT computation, many processor arra...
详细信息
Fast Fourier transform (FFT), which has wide and variety application areas, requires very high speed computation. Since parallel processing of FFT is very attractive for high speed FFT computation, many processor arrays and multiprocessor systems have been proposed with efficient FFT algorithms. As a result of the recent development of VLSI technology, several massively parallel computers have been implemented on commercial basis. The MasPar, which is one of the SIMD type massively parallel computers, consists of an eight-neighbor processor array. This paper discusses parallel 1-D FFT algorithms on an eight-neighbor processor array. We propose three algorithms according to various data allocation methods. Then we estimate and evaluate their processing time. With the number of processors N = N(r) x N(r), processing time is estimated to be 2(N(r) - 2)t(c) + (log2N(r))t(b), where t(c) is the communication time between neighbor processors, and t(b) is the execution time for the radix 4 butterfly computation. We also compare these algorithms with the conventional radix 2 FFT algorithm implemented on a mesh processor array. It is shown that the radix 4 FFT algorithms are faster than the radix 2 algorithms. These algorithms get high speed FFT computation by combining the radix 4 FFT algorithm with the characteristics of the eight-neighbor processor array.
In this paper, we study the design of a coprocessor (CoP) to execute efficiently recursive algorithms with uniform dependencies. Our design is based on two objectives: 1) fixed bandwidth to main memory (MM) and 2) sca...
详细信息
In this paper, we study the design of a coprocessor (CoP) to execute efficiently recursive algorithms with uniform dependencies. Our design is based on two objectives: 1) fixed bandwidth to main memory (MM) and 2) scalability to higher performance without increasing MM bandwidth. Our CoP has an access unit (AU) organized as multiple queues, a processor array (PA) with regularly connected processing elements (PEs), and input/output networks for data routing. Our design is unique because it addresses input/output bottleneck and scalability, two of the most important issues in integrating processor arrays in current systems. To allow processor arrays to be widely usable, they must be scalable to high performance with little or no impact on the supporting memory system. The use of multiple queues in AU also eliminates the use of explicit data addresses, thereby simplifying the design of the control program. We present a mapping algorithm that partitions a data dependence graph (DG) of an application into regular blocks, sequences the blocks through AU, and schedules the execution of the blacks, one at a time, on PA. We show that our mapping procedure minimizes the amount of communication between blocks in the partitioned DG, and sequences the blocks through AU to reduce the communication between AU and MM. Using the matrix-product and transitive-closure applications, we study design trade-offs involving 1) division of a fixed chip area between PA and AU, and 2) improvements in speedup with respect to increases in chip area. Our results show, for a fixed chip area, 1) that there is little degradation in throughput in using it linear PA as compared to a PA organized as a square mesh, and 2) that the design is not sensitive to the division of chip area between PA and AU. We further show that, for a fixed throughput, there is an inverse square root relationship between speedup and total chip area. Our study demonstrates the feasibility of a low-cost, memory bandwidth-limite
We propose a novel, low-area, high-speed architecture for the basic operations over GF(2(m)). The proposed architecture is a processor array based, which utilizes the most significant bit multiplication algorithm and ...
详细信息
ISBN:
(纸本)9781424414307
We propose a novel, low-area, high-speed architecture for the basic operations over GF(2(m)). The proposed architecture is a processor array based, which utilizes the most significant bit multiplication algorithm and polynomial basis. A design space exploration to optimize the area and speed of the proposed architecture was done. We use the National Institute of Standard and Technology recommended polynomials, which makes our design secure and more suitable for cryptographic applications. The proposed architecture is implemented for m is an element of {163,283,571} on a Xilinx XC2V4000 device to verify its functionality and measure its performance. We achieve a frequency of 264 MHz, which allows the architecture to calculate GF(2163) multiplication in 640ns and inversion in 14..357 mu s.
暂无评论