In the letter a parallel processor for real-time analysis of binary images is presented. Its processing throughput exceeds 109 cellular logic operations per second.
In the letter a parallel processor for real-time analysis of binary images is presented. Its processing throughput exceeds 109 cellular logic operations per second.
Novel circuits and design methodology of the massively parallel processor based on the matrix architecture are introduced. A fine-grained processing elements (PE) circuit for high-throughput MAC operations based on th...
详细信息
Novel circuits and design methodology of the massively parallel processor based on the matrix architecture are introduced. A fine-grained processing elements (PE) circuit for high-throughput MAC operations based on the Booth's algorithm enhances the performance of a 16-bit fixed-point signed MAC, which operates up to 30.0 GOPS/W. The dedicated I/O interface circuits are designed for converting the direction of data access and supporting the interleaved memory architecture, and they are implemented for maximizing the processor core efficiency. Power management techniques for suppressing current peaks and reducing average power consumption are proposed to enhance the robustness of the macro. The circuits and the design methodology proposal in this paper are attractive for achieving a high performance and robust massively parallel SIMD processor core employed in multimedia SoCs.
A new class of interconnection network, called shift-net, is presented. This network is realized using a large-scale shifter, operating in a manner of SIMD. The shifter is implemented by log N stage select switches th...
详细信息
A new class of interconnection network, called shift-net, is presented. This network is realized using a large-scale shifter, operating in a manner of SIMD. The shifter is implemented by log N stage select switches that are N two-input selectors at each stage, where N is the number of processor elements. The ease in control and simple hardware structure of shift-net permit the implementation of highly parallel processor systems. The arbitrary distance access between processor elements eases the limitation for numerical applications. For simpler networks, power shift-net is proposed. This network restricts the access pattern to two's power, like the hyper-cube network.
The paper presents a new method for load-flow analysis which is particularly appropriate for very large power systems. The objective has been to reduce the computation time for the analysis of a given system by tearin...
详细信息
The paper presents a new method for load-flow analysis which is particularly appropriate for very large power systems. The objective has been to reduce the computation time for the analysis of a given system by tearing the network into a number of independent subsystems. The subsystem programs may be executed in parallel, resulting in a considerable time saving for on-line system control. The main advantage of the new algorithm is that the computation efficiency of the main or co-ordinating program is significantly improved. Results are presented which indicate that, for average to very large systems, the net solution time using the suggested technique is less than half of that required by a centralised decoupled method to achieve the same accuracy.
In biomedical engineering computing time is often a well known limitation in real-time signal processing, especially when microcomputers are involved. Because of the relatively slow execution of the integer multiply i...
详细信息
In biomedical engineering computing time is often a well known limitation in real-time signal processing, especially when microcomputers are involved. Because of the relatively slow execution of the integer multiply instruction, the inner product is very time consuming. Furthermore, the data manipulations needed for the inner product calculation take a considerable amount of time. To overcome this bottleneck we have developed a special-purpose processor for the fast calculations of inner products. The apparatus is called VIPER-II. VIPER-II calculates a number of inner products of vectors consisting of 16-bit integer-valued arrays of which the length may vary from 64 to 512 elements.
We show that it is sufficient to use O(log N ) logic gates to generate N memory addresses simultaneously in O(1) time for a memory system consisting of N memory modules which allows parallel, conflict-free access to a...
详细信息
We show that it is sufficient to use O(log N ) logic gates to generate N memory addresses simultaneously in O(1) time for a memory system consisting of N memory modules which allows parallel, conflict-free access to any of the rows, columns, forward and backward diagonals of an N x N matrix, where N = 2 n with any n \s>1. This substantially improves the previously known result that requires O( N log N ) logic gates and O(log N ) time for parallel generation of memory addresses.
Nowadays, many data are multidimensional, which are called tensors. Tensor computations have been applied in different fields and various software libraries have been developed. However, not much attention has been re...
详细信息
Nowadays, many data are multidimensional, which are called tensors. Tensor computations have been applied in different fields and various software libraries have been developed. However, not much attention has been received for developing a hardware architecture to accelerate the tensor computations. In this article, an efficient and unified processing element (PE) array for the 3-D tensor computation is demonstrated. Our PE array is optimized for thin and tall tensor-matrix multiplication and two types of tensor times matrices chain (TTMc) operations. Our design is evaluated in three study cases and compared with the state-of-the-art design. By using computation partition and rearrangement, data movement between the field-programmable gate array (FPGA) and off-chip DDR memory can be reduced by O(I-2), where I is the maximum range among all the dimensions of the data tensor. For TTMc implementation, clock frequency has been increased by 18% compared with the state-of-the-art implementation on the same FPGA chip. An experiment on 3-D volumetric data set rendering by tensor approximation method is conducted for demonstration. For the bricks reconstruction process, the runtime decreased by 50%, i.e., two times faster, on our FPGA implementation compared with that running on GPU. In CANDECOMP/PARAFAC decomposition, for one iteration, the runtime has been decreased by up to 93% compared with the programs implemented by Tensorly, which is a python library.
A network-on-chip (NoC) based parallel processor is presented for bio-inspired real-time object recognition with visual attention algorithm. It contains an ARM10-compatible 32-bit main processor, 8 single-instruction ...
详细信息
A network-on-chip (NoC) based parallel processor is presented for bio-inspired real-time object recognition with visual attention algorithm. It contains an ARM10-compatible 32-bit main processor, 8 single-instruction multiple-data (SIMD) clusters with 8 processing elements in each cluster, a cellular neural network based visual attention engine (VAE), a matching accelerator, and a DMA-like external interface. The VAE with 2-D shift register array finds salient objects on the entire image rapidly. Then, the parallel processor performs further detailed image processing within only the pre-selected attention regions. The low-latency NoC employs dual channel, adaptive switching and packet-based power man- agement, providing 76.8 GB/s aggregated bandwidth. The 36 mm(2) chip contains 1.9 M gates and 226 kB SRAM in a 0.13 mu m 8-metal CMOS technology. The fabricated chip achieves a peak performance of 125 GOPS and 22 frames/sec object recognition while dissipating 583 mW at 1.2 V.
Genetic algorithm (GA) is one of optimization algorithm based on an idea for evolution of life. GA can be applied various combination optimization problem. This paper proposes a parallel processor for distributed gene...
详细信息
ISBN:
(纸本)9781467308762
Genetic algorithm (GA) is one of optimization algorithm based on an idea for evolution of life. GA can be applied various combination optimization problem. This paper proposes a parallel processor for distributed genetic algorithm (DGA) with redundant binary number. Since a redundant binary number has redundancy, solution expression becomes variegated. For this reason, it is expected the algorithm easily find the optimized solution, and the error rates decrease. Since DGA is a parallel algorithm, the performance can be improved by using a specified parallel processor. The effectiveness of the proposed processor was confirmed by some simulations and experiments using FPGA circuit board.
Novel circuits and design methodology of the massively parallel processor based on the matrix architecture are introduced. A fine-grained processing elements (PE) circuit for high-throughput MAC operations based on th...
详细信息
ISBN:
(纸本)1424400066
Novel circuits and design methodology of the massively parallel processor based on the matrix architecture are introduced. A fine-grained processing elements (PE) circuit for high-throughput MAC operations based on the Booth's algorithm enhances the performance of a 16-bit fixed-point signed MAC, which operates up to 30.0 GOPS/W. The dedicated I/O interface circuits are designed for converting the direction of data access and supporting the interleaved memory architecture, and they are implemented for maximizing the processor core efficiency. Power management techniques for suppressing current peaks and reducing average power consumption are proposed to enhance the robustness of the macro. The circuits and the design methodology proposal in this paper are attractive for achieving a high performance and robust massively parallel SIMD processor core employed in multimedia SoCs.
暂无评论