In this paper the architecture for the realization of a new, highly-parallel, block-type, order recursive algorithm for LS FIR filtering is introduced. A linear array of p processing elements is used, implementing thi...
详细信息
In this paper the architecture for the realization of a new, highly-parallel, block-type, order recursive algorithm for LS FIR filtering is introduced. A linear array of p processing elements is used, implementing this algorithm for p order in linear time, O(p). Using a suitable scheduling of the algorithm and a pipeline divider, a three fold reduction of hardware is achieved, without significant degradation in time performance, compared to the fully parallel realization. Furthermore, the computation of the correction sums, needed for the initialization of the system, is performed on the existing linear array resulting in additional hardware saving.
This paper proposes a high speed multi-level-parallel array processor for programmable vision *** processor includes 2-D pixel-parallel processing element(PE)array and 1-D row-parallel row processor(RP)*** two arrays ...
详细信息
This paper proposes a high speed multi-level-parallel array processor for programmable vision *** processor includes 2-D pixel-parallel processing element(PE)array and 1-D row-parallel row processor(RP)*** two arrays both operate in a single-instruction multiple-data(SIMD)fashion and share a common instruction *** sizes of the arrays are scalable according to dedicated *** PE array,each PE can communicate not only with its nearest neighbor PEs,but also with the next near neighbor PEs in diagonal *** connection can help to speed up local operations in low-level image *** the other hand,global operations in mid-level processing are accelerated by the skipping chain and binary boosters in RP *** array processor was implemented on an FPGA device,and was successfully tested for various algorithms,including real-time face detection based on PPED *** results show that the image processing speed of proposed processor is much higher than that of the state-of-the-arts digital vision chips.
A novel architecture named Window-Memory Sharing processorarray is proposed, which targets window operations in image processing. The architecture can be used not only for conventional image filtering, but also in pr...
详细信息
A novel architecture named Window-Memory Sharing processorarray is proposed, which targets window operations in image processing. The architecture can be used not only for conventional image filtering, but also in practical window operations such as motion vector search in MPEG2. The derived architecture is flexible enough to satisfy user's requirement for either area or speed.
Two-dimensional (2-D) sliding discrete Fourier transform (DFT) algorithm can realize sliding spectrum analysis and real-time signal processing. In this paper, its fixed-point error analysis is carried out to form a th...
详细信息
Two-dimensional (2-D) sliding discrete Fourier transform (DFT) algorithm can realize sliding spectrum analysis and real-time signal processing. In this paper, its fixed-point error analysis is carried out to form a theoretical basis for hardware implementation. The analysis models the error as an additive white noise and arrives at the signal to noise ratio (SNR) successively. Then, a simplified method for 2-D sliding DTT based on vector radix (VR) algorithm is introduced. With this approach the fixed-point error can be reduced to the same scale as that of 2-D FFT. As an example, the architecture and error analysis of 8*8 2-D sliding DFT array processor based on VR-4*4 algorithm are presented. The idea can be extended to larger size DFT. Finally some comparisons ape derived. (C) 1999 Elsevier Science B.V. All rights reserved.
In this paper;we present a novel architecture named as Window-MSPA architecture which targets to window operations in image processing. We have previously developed a Memory Sharing processorarray (MSPA) for fast arr...
详细信息
In this paper;we present a novel architecture named as Window-MSPA architecture which targets to window operations in image processing. We have previously developed a Memory Sharing processorarray (MSPA) for fast array processing with regular iterative algorithms. Window-MSPA tries to optimize the data I/O ports and the number of processing elements so as to reduce hardware cost. The input scheme of image data is restricted to row by row input which simplifies the I/O architecture. Under this practical I/O restriction, the fastest processings are achieved. In this paper, we present the general Window-MSPA design methodology for wide variety of applications.,its an practical application, we have already reported the design of MP@HL MPEG2 Motion Estimator LS1[13]. Design formulas for Window-MSPA architecture are given for various size of window: operations in image processing. Thus, the derived architecture is flexible enough to satisfy user's requirement for either area or speed.
Selection of elements and alignment of operands are fundamental operations on data, just as are arithmetic operations. Whereas sophisticated algorithms have been devised for the latter, vector processors usually lack ...
详细信息
Selection of elements and alignment of operands are fundamental operations on data, just as are arithmetic operations. Whereas sophisticated algorithms have been devised for the latter, vector processors usually lack a flexible and efficient routing unit. This is especially true of SIMD computers, to which the present study is devoted. Examples of required manipulations are: transfer, shift, diffusion, compression, expansion, mesh, perfect shuffle, and bit reversal. Using a method described in a previous paper of ours [15] we present algorithms to control a Benes network and perform these manipulations on vectors whose length is equal to the number of processing elements. Then we dispense with this constraint and propose a mechanism to rearrange vectors of any size, stored according to several schemes.
A new architecture is presented to support the general class of real-time large-vocabulary speaker-independent continuous speech recognizers incorporating language models. Many such recognizers require multiple high-p...
详细信息
A new architecture is presented to support the general class of real-time large-vocabulary speaker-independent continuous speech recognizers incorporating language models. Many such recognizers require multiple high-performance central processing units (CPU's) as well as high interprocessor communication bandwidth. This array processor provides a peak CPU performance of 2.56 giga-floating point operations per second (GFLOPS) as well as a high-speed communication network. In order to efficiently utilize these resources, algorithms were devised for partitioning speech models for mapping into the array processor. Also, a novel scheme is presented for a functional partitioning of the speech recognizer computations. The recognizer is functionally partitioned into six stages, namely, the linear predictive coding (LPC) based feature extractor, mixture probability computer, (phone) state probability computer, word probability computer, phrase probability computer, and traceback computer. Each of these stages is further subdivided as many times as necessary to fit the individual processing elements (PE's). The functional stages are pipelined and synchronized with the frame rate of the incoming speech signal. This partitioning also allows a multistage stack decoder to be implemented for reduction of computation. The fully configured array processor is composed of 128 PE's, each of which comprises a floating point digital signal processor, a local memory of 64-kilobyte by 32-bit words, and a custom communications device that permits each PE to talk with four adjacent PE's in a 2-D grid. A second communication network provides global communication between a host processor and each PE. Each node is programmable in the high-level C language. One recognizer we have implemented at AT&T uses 1759 phonelike units (PLU's). These units comprise phones, di hones, and triphones that are selected based on their frequency of occurrence in the training corpus. Each PLU is modeled with a thr
This paper describes a VLSI array processor system designed and built for classification problems based on the k-nearest-neighbors approach. This architecture is suitable for different pattern recognition applications...
详细信息
This paper describes a VLSI array processor system designed and built for classification problems based on the k-nearest-neighbors approach. This architecture is suitable for different pattern recognition applications and is very efficient for high-dimensional databases. The architecture is scalable with the size of the recognition problem making the system effectively applicable to computational intensive application like on-line pattern recognition. A system prototype composed of a board with two processors, the software driver and a test application have been built and evaluated. For handwritten character recognition task the complete system shows a speed up of 260 times over a sequential algorithm running on a Sun SPARC20 workstation. (C) 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved.
Memory access fast switching structures in cluster are studied,and three kinds of fast switching structures( FS,LR2 SS,and LAPS) are proposed. A mixed simulation test bench is constructed and used for statistic of d...
详细信息
Memory access fast switching structures in cluster are studied,and three kinds of fast switching structures( FS,LR2 SS,and LAPS) are proposed. A mixed simulation test bench is constructed and used for statistic of data access delay among these three structures in various cases. Finally these structures are realized on Xilinx FPGA development board and DCT,FFT,SAD,IME,FME,and de-blocking filtering algorithms are mapped onto the structures. Compared with available architectures,our proposed structures have lower data access delay and lower area.
The multiassociative processor (MAP) system is a hypothetical machine composed of eight control units (CU"s) and an arbitrary number of processing elements (PE"s). Each CU is allocated a subset of the identi...
详细信息
The multiassociative processor (MAP) system is a hypothetical machine composed of eight control units (CU"s) and an arbitrary number of processing elements (PE"s). Each CU is allocated a subset of the identical PE"s in order to process a single-instruction-stream-multiple-data-stream program. The eight CU"s must be able to access a common main memory system and transmit data to subsets of the PE"s over a shared data bus system. This paper discusses the analysis of these two components of the system where this analysis relies heavily on three simulation programs. The first program interprets assembly language programs for the hypothetical machine and the other two programs model the memory system and the data bus system. The interpreter is driven by both realistic array processor programs and synthetic programs designed specifically to test the components of the system.
暂无评论