Computer vision(CV)is widely expected to be the next big thing in emerging *** many heterogeneous architectures for computer vision ***,plenty of data need to be transferred between different structures for heterogene...
详细信息
Computer vision(CV)is widely expected to be the next big thing in emerging *** many heterogeneous architectures for computer vision ***,plenty of data need to be transferred between different structures for heterogeneous *** long data transfer delay becomes the mainly problem to limit the processing speed for computer vision *** reducing data transfer delay and fasting computer vision applications,a clustered data-driven array processor is proposed.A three-level pipelining processing element is designed which supports two-buffer data flow interface and 8 bits,16 bits,32 bits subtext parallel *** the same time,for accelerating transcendental function computation,a four-way shared pipelining transcendental function accelerator is designed,which is based on Y-intercept adjusted piecewise linear segment algorithm.A distributed shared memory structure based on unified addressing is also *** verify efficiency of architecture,some image processing algorithms are implemented on proposed *** the proposed architecture has been implemented on Xilinx ZC 706 development *** same circuitry has been synthesized using SMIC 130 nm CMOS *** circuitry is able to run at 100 *** is 26.58 mm2.
This paper discusses the design of a primary memory system for an array processor which allows parallel, conflict-free access to various slices of data (e.g., rows, columns, diagonals, etc.), and subsequent alignment ...
详细信息
This paper discusses the design of a primary memory system for an array processor which allows parallel, conflict-free access to various slices of data (e.g., rows, columns, diagonals, etc.), and subsequent alignment of these data for processing. Memory access requirements for an array processor are discussed in general terms and a set of common requirements are defined. The ability to meet these requirements is shown to depend on the number of independent memory units and on the mapping of the data in these memories. Next, the need to align these data for processing is demonstrated and various alignment requirements are defined. Hardware which can perform this alignment function is discussed, e.g., permutation, indexing, switching or sorting networks, and a network (the omega network) based on Stone"s shuffle-exchange operation [1] is presented. Construction of this network is described and many of its useful properties are proven. Finally, as an example of these ideas, an array processor is shown which allows conflict-free access and alignment of rows, columns, diagonals, backward diagonals, and square blocks in row or column major order, as well as certain other special operations.
It is now possible to obtain a number of so-called array processors to attach to existing computer systems, to increase the computing power and speed available. In this paper, the application of one such array process...
详细信息
It is now possible to obtain a number of so-called array processors to attach to existing computer systems, to increase the computing power and speed available. In this paper, the application of one such array processor to image processing in electron microscopy is considered, and some of the practical experience thus gained is reported.
The Berkeley array processor is a special-purpose computer designed to perform the operations of correlation, convolution, recursive filtering, matrix multiplication, as well as a variant of the Cooley-Tukey algorithm...
详细信息
The Berkeley array processor is a special-purpose computer designed to perform the operations of correlation, convolution, recursive filtering, matrix multiplication, as well as a variant of the Cooley-Tukey algorithm, and others. This note describes the logical organization and performance of this device.
High quality image reconstruction algorithms are of special importance for tomographic applications. This paper presents the register transfer level and the VLSI design of a special purpose array processor which reali...
详细信息
High quality image reconstruction algorithms are of special importance for tomographic applications. This paper presents the register transfer level and the VLSI design of a special purpose array processor which realizes a tomographic algorithm having higher quality reconstructions than other well-known algorithms. The operation of the array processor is pipelined and, most important, the communication delays have been eliminated by overlapping arithmetic and logic operations with data transfer. The design and operation of the array processor, which fully exploits the special features of the algorithm to optimize the units and subunits, leads to high hardware utilization. In contrast to other attempts, the number of units is limited resulting in a reasonable sized hardware system achieving real-time reconstruction. The paper presents some important performance analysis of the proposed architecture.
Matrix computations are a fundamental tool in scientific and engineering applications. Among many such applications, Convolutional Neural Networks (CNN) that can be effectively computed by matrix-matrix multiplication...
详细信息
ISBN:
(纸本)9781467390910
Matrix computations are a fundamental tool in scientific and engineering applications. Among many such applications, Convolutional Neural Networks (CNN) that can be effectively computed by matrix-matrix multiplications are being popular and an efficient implementation of CNN is highly important. In this study, we have designed an parallel processor for the matrix computations using torus interconnect topology, and we implemented Cannon's algorithm for matrix-matrix multiply-add. We have evaluated the scalability of the proposed processor on a reconfigurable FPGA platform. More precisely, the designed processor with 8 x 8 functional units with 16 bit floating-point multiply-add unit was evaluated on Cyclone IV FPGA chip, with performance of 27 GFlops. We also implemented CNN calculations on our processor. We compared the matrix based approach and our proposed method. As a result, our method is 25 times faster than the matrix based approach if the processor has 8x8 functional units, image size is 32x32 and filter size is 5 x 5.
Point mutation of amino acids is a means used by biotechnologists to improve the performance of proteins. To study a point-mutated polypeptide, one requires its global minimum energy conformation. This conformation ca...
详细信息
Point mutation of amino acids is a means used by biotechnologists to improve the performance of proteins. To study a point-mutated polypeptide, one requires its global minimum energy conformation. This conformation can be determined by molecular dynamics via Langevin's equations of motion. Molecular dynamics simulations belong to the most difficult problems to parallelize in a scalable manner. We provide a method for defining a special purpose 3D array processor architecture for the molecular dynamics simulation of point-mutated polypeptides. The architecture is derived from a spatial decomposition of a known conformation of the point-mutated polypeptide or the native conformation of the given protein. By using an approximation scheme for the deterministic forces, the interprocessor communication can be kept local. The architecture affords a simple distributed load balancer and is scalable. The computational workload of the array processor architecture to perform molecular dynamics simulations under realistic conditions is addressed. An example architecture is given by point-mutated penicillin amidase.
Aiming at the intensive calculations of convolution and the invalid calculations caused by "zero" inserted of deconvolution in Generative Adversarial Network (GAN), which makes difficulties of accelerated by...
详细信息
ISBN:
(纸本)9789881476890
Aiming at the intensive calculations of convolution and the invalid calculations caused by "zero" inserted of deconvolution in Generative Adversarial Network (GAN), which makes difficulties of accelerated by hardware. Through analyzing of network structure and calculation flows of GAN, a paralleling scheme of reconfiguration for convolution and deconvolution is proposed in this paper. Based on the Dynamic Programmable Reconfigurable array processor (DPRAP), on a 4x4 processing elements (PEs) array, the flexible switching of the two convolution modes are driven by a H-tree controlled reconfiguration mechanism. The proposed scheme is verified based on the DPRAP. The experimental results show that, compared with other FPGA schemes, the resource occupation can be reduced by up to 90% at a working frequency of 150MHz. Performance has been significantly improved.
This paper presents a novel architecture for array processor,called LEAP,which is a set of simple processing *** targeted programs are perfect innermost *** using the technique called if-conversion,the control depende...
详细信息
This paper presents a novel architecture for array processor,called LEAP,which is a set of simple processing *** targeted programs are perfect innermost *** using the technique called if-conversion,the control dependence can be converted to data dependence to prediction *** an innermost loop can be represented by a data dependence graph,where the vertex supports the expression statements of high level languages. By mapping the data dependence graph to fixed PEs,each PE steps the loop iteration automatically and independently at the *** execution forms multiple pipelining *** simulation of four loops of LFK shows the effectiveness of the LEAP architecture,compared with traditional CISC and RISC architectures.
暂无评论