Various levels of parallelism have recently been introduced in advanced microprocessors to meet the demanding computing need in digital video processing and other multimedia applications. Because many imaging algorith...
详细信息
Various levels of parallelism have recently been introduced in advanced microprocessors to meet the demanding computing need in digital video processing and other multimedia applications. Because many imaging algorithms are easily parallelizable, these architectural features and their wide availability at low cost have become a powerful tool in tackling both existing and new imaging applications. At the lowest level, the subword parallelism is used in the new instructions aimed at processing multiple multimedia data simultaneously. Instruction-level parallelism including subword parallelism is realized in either very long instruction word or superscalar architectures, while on-chip and/or off-chip multiprocessing capability is available for easier multiprocessor system designs. One of the difficulties in maximizing the computing throughput via parallelism has been the level of programming in that to obtain the optimal performance, assembly-level programming has typically been required. We review the architectural features in several modern microprocessors such as TMS320C60, TM-1000, PowerPC 604, Pentium Il, R10000, Alpha 21264, PA-RISC 8200, UltraSPARC-ii, and TMS320C80. Various obstacles to obtaining the best performance from these microprocessors with high-level and assembly languages are discussed, and several approaches to overcome these difficulties in diverse imaging applications are presented. (C) 1998 John Wiley & Sons, Inc.
A study is made of a non-smooth optimization problem arising in adaptive-optics, which involves the real-time control of a deformable mirror designed to compensate for atmospheric turbulence and other dynamic image de...
详细信息
ISBN:
(纸本)0819429163
A study is made of a non-smooth optimization problem arising in adaptive-optics, which involves the real-time control of a deformable mirror designed to compensate for atmospheric turbulence and other dynamic image degradation factors. One formulation of this problem yields a functional f(U) = Sigma(i=1)(n) max(j){((UMjU)-M-T)(ii)} to be maximized over orthogonal matrices U for a fixed collection of n x n symmetric matrices Mj. We consider first the situation which can arise in practical applications where the matrices M-j are "nearly" pairwise commutative. Besides giving useful bounds, results for this case lead to a simple corollary providing a theoretical closed-form solution for globally maximizing f if the Mj are simultaneously diagonalizable. However, even here conventional optimization methods for maximizing f are not practical in a real-time environment. The general optimization problem is quite difficult and is approached using a heuristic Jacobi-like algorithm. Numerical tests indicate that the algorithm provides an effective means to optimize performance for some important adaptive-optics systems.
This paper describes architectures and design of a general purpose parallel image processor chip called a SliM-iiimage Processor. The chip has a linear array of 64 processing elements (PEs), operates at 30 MHz in the...
详细信息
This paper describes architectures and design of a general purpose parallel image processor chip called a SliM-iiimage Processor. The chip has a linear array of 64 processing elements (PEs), operates at 30 MHz in the worst case simulation and gives 1.92 GIPS. SiiM-ii can greatly reduce the inter-PE communication overhead, due to the idea of sliding that is overlapping inter-PE communication with computation. In contrast to existing array processors, each PE has a multiplier that is quite effective for convolution, template matching, etc. The instruction set can execute an ALU operation, data I/O, and inter-PE communication simultaneously in an instruction cycle. In addition, during the ALU/multiplier operation, SliM-ii provides parallel load/store between the register file and on-chip memory as in DSP chips. The bandwidth of data I/O and inter-PE communication increases due to bit-parallel paths. We developed VHDL models and performed logic synthesis using the COMPASS/sup TM/ CAD tool. We used the COMPASS/sup TM/ 3.3 V 0.6 /spl mu/m standard cell library (v8r4.9.1). The total number of transistors is about 1.5 millions. The SliM-ii chip is being fabricated at the LG Semiconductor Co,, Ltd. The performance estimation shows a significant improvement for algorithms requiring multiplications compared with existing array processors.
This paper describes architectures and design of a linear array processor chip called a SliM-iiimage Processor. The chip has a linear array of 64 processing elements (PEs). In contrast to existing array processors, e...
详细信息
This paper describes architectures and design of a linear array processor chip called a SliM-iiimage Processor. The chip has a linear array of 64 processing elements (PEs). In contrast to existing array processors, each PE has a multiplier that is quite effective for convolution, template matching, etc. The instruction set can execute an ALU, a data I/O, and an inter-PE communication operations simultaneously in an instruction cycle. In addition, during the ALU/multiplier operation, SliM-ii provides parallel data load/store between the register file and on-chip memory as in DSP chips. The SliM-ii contains about 1.5 million transistors in a 13.2/spl times/13.0 mm/sup 2/ core size and the package type is 208 pin PQ2. The performance estimation shows a significant improvement for algorithms requiring multiplications compared with existing array processors.
The research presented here focuses on the general problem of finding tools and methods to compare and evaluate parallel architectures in this particular field: the computer vision. As there are several different para...
详细信息
ISBN:
(纸本)0819419230
The research presented here focuses on the general problem of finding tools and methods to compare and evaluate parallel architectures in this particular field: the computer vision. As there are several different parallel architectures proposed for machine vision, some means of comparison between them are necessary in order to employ the most suitable architecture for a given application. 'Benchmarks' are the most popular tools for machine speed comparison, but do not give any information on the most convenient hardware structures for implementation of a given vision problem. This paper tries to overcome this weakness by proposing a definition of the concept of a tool for the evaluation of parallel architecture (more general than a benchmark), and provides a characterization of the chosen algorithms. Taken into account different ways to process data, it is necessary to consider two different classes of machines: MISD and (MIMD, SPMD, SIMD) offering different programming models, thus leading to two classes of algorithms. Consequently, two algorithms, one for each class are proposed: 1) the extraction of connected components, and 2) a parallel region growing algorithm with data reorganization. The second algorithm tests the capabilities of the architecture to support the following: i) pyramidal data structures (initial region step), ii) a merge procedure between global and global information (adjacent regions to the growing region), and iii) a parallel merge procedure between local and global information (adjacent points to the growing region).
The relaxed look-ahead technique is presented as an attractive technique for pipelining adaptive filters. Unlike conventional look-ahead, the relaxed look-ahead does not attempt to maintain the input-output mapping be...
详细信息
The relaxed look-ahead technique is presented as an attractive technique for pipelining adaptive filters. Unlike conventional look-ahead, the relaxed look-ahead does not attempt to maintain the input-output mapping between the serial and pipelined architectures but preserves the adaptation characteristics. The use of this technique results in a small hardware overhead which would not be possible with conventional look-ahead. The relaxed look-ahead is employed to develop fine-grained pipelined architectures for least mean-squared (LMS) adaptive filtering. Convergence analysis results are presented for the pipelined architecture. The proposed architecture achieves the desired speed-up with marginal or no degradation in the convergence behavior. Past work in pipelined transversal LMS filtering are shown to be special cases of this architecture. Simulation results verifying the convergence analysis results for the pipelined LMS filter are presented. The pipelined LMS filter is then employed to develop a high-speed adaptive differential pulse-code-modulation (ADPCM) codec. The new architecture has a negligible hardware overhead which is independent of the number of quantizer levels, the predictor order and the pipelining level. Additionally, the pipelined codec has a much lower output latency than the level of pipelining. Theoretical analysis indicates that the Output signal-to-noise ratio (SNR) is degraded with increase in speed-up. Simulations with image data indicate that speed-ups of up to 44 can be achieved with less than 1 dB loss in SNR.
Our study consists of three phases: mapping the algorithms on the three topologies, simulating the execution of these algorithms, and design of the array. Our mapping follows the standard methodology proposed for the ...
详细信息
ISBN:
(纸本)0819413291
Our study consists of three phases: mapping the algorithms on the three topologies, simulating the execution of these algorithms, and design of the array. Our mapping follows the standard methodology proposed for the design of systolic arrays, since our application domain is very specific and the selected algorithms very regular. After the mapping is done we simulate the algorithms using the SES/workbench simulation package which allows us to collect statistics on the execution time and efficiency of our mappings and evaluate the performance of the three topologies in our application domain using different array and problem sizes. For each algorithm and topology the range of scalability is determined as a function of image size. In the design phase we propose an SIMD array with 2-D torus interconnection topology as a cost- efficient solution to the scalable implementation of the selected algorithms. Considerations entering the design phase are performance as determined by simulations, cost of implementation, and ease of scaling the machine size.
Among all the numerous parallel structures that have been studied for and involved in imageprocessing, this article gives some elements for an answer to the problem of the choice of some architectures dedicated to Im...
详细信息
During the past decade, three major categories of image matching algorithms have emerged: Signal-processing-based, artificial-intelligence-based, and a combination of these methods called hybrid techniques. This paper...
详细信息
A one-dimensional systolic geometry processor (SGP) which is useful in imageprocessing and pattern recognition is described. The geometry processor can be used to enhance processing speed and throughput of the host c...
详细信息
暂无评论