When implementing today's video compression standards on programmable processors, it is essential to optimize the algorithms with respect to the underlying hardware. As an example, the core decoder functions of th...
详细信息
ISBN:
(纸本)0819427519
When implementing today's video compression standards on programmable processors, it is essential to optimize the algorithms with respect to the underlying hardware. As an example, the core decoder functions of the H.263 hybrid coding scheme were implemented on a SIMD controlled processor with four parallel VLIW data paths, the HiPAR-DSP. The decoder tasks were implemented employing local memory, parallelization on several levels, and data statistics. Special effort was paid on the computation intensive tasks IDCT, and motion compensated frame reconstruction. To speed up the IDCT computation, a data dependent approach was chosen, which distinguishes different block types. The determination of IDCT block type could be parallelized together with other tasks, thus no additional overhead is required. Frame reconstruction mainly benefits from data parallel operations and transparent DMA transfers to and from external memory.
In this paper the design of systolic array processors for computing 2-dimensional Discrete Fourier Transform (2-D DFT) is considered. We investigated three different computational schemes for designing systolic array ...
详细信息
In this paper the design of systolic array processors for computing 2-dimensional Discrete Fourier Transform (2-D DFT) is considered. We investigated three different computational schemes for designing systolic array processors using systematic approach. The systematic approach guarantees to find optimal systolic array processors from a large solution space in terms of the number of processing elements and I/O channels, the processing time, topology, pipeline period, etc. The optimal systolic array processors are scalable, modular and suitable for VLSI implementation. An application of the designed systolic array processors to the prime-factor DFT is also presented.
We present here an efficient systolic implementation for 3-D IIR digital filters. The systolic implementation is obtained by using an algebraic mapping technique. This new mapping technique gives us the choice to mix ...
详细信息
We present here an efficient systolic implementation for 3-D IIR digital filters. The systolic implementation is obtained by using an algebraic mapping technique. This new mapping technique gives us the choice to mix pipelined variables and broadcast variables. We also determine, through the mapping method, the buffer sizes, the direction of variables propagations and the data feeding and extracting points. The resultant systolic array implementation is a modular structure composed of 2-D filter modules connected by simple buffers. This new systolic implementation is regular, modular and amenable to VLSI implementation.
In this paper, we present some new regular iterative algorithms for matrix multiplication and transitive closure. With these algorithms, by spacetime mapping the 2-D arrays with 2N-1 and [(3N-1)/2] execution times for...
详细信息
In this paper, we present some new regular iterative algorithms for matrix multiplication and transitive closure. With these algorithms, by spacetime mapping the 2-D arrays with 2N-1 and [(3N-1)/2] execution times for matrix multiplication can be obtained, Meanwhile, we can derive a 2-D array with 4N-2 execution time for transitive closure based on the sequential Warshall-Floyd algorithm. All these new 2-D arrays for matrix multiplication and transitive closure have the advantages of faster and more regular than other previous designs.
The problem of designing space-optimal 2D regular arrays for N x N x N cubical mesh algorithms with linear schedule ai + bj + ck, 1 less than or equal to a less than or equal to b less than or equal to c, and N = nc, ...
详细信息
The problem of designing space-optimal 2D regular arrays for N x N x N cubical mesh algorithms with linear schedule ai + bj + ck, 1 less than or equal to a less than or equal to b less than or equal to c, and N = nc, is studied. Three novel nonlinear processor allocation methods, each of which works by combining a partitioning technique (gcd-partition) with different nonlinear processor allocation procedures (traces), are proposed to handle different cases, In cases where a + b less than or equal to c, which are dealt with by the first processor allocation method, space-optimal designs can always be obtained in which the number of processing elements is equal to N-2/c. For other cases where a + b > c and either a = b and b = c, two other optimal processor allocation methods are proposed. Besides, the closed form expressions for the optimal number of processing elements are derived for these cases.
We present an algebraic theory based on tensor products for modeling direct interconnection networks. This algebraic theory has been used for designing and implementing block recursive numerical algorithms on shared-m...
详细信息
A new systolic array design for the Algebraic Path Problem (APP) is presented that is both simpler and more efficient than previously proposed configurations. This array uses N2 orthogonally connected processing eleme...
详细信息
A new systolic array design for the Algebraic Path Problem (APP) is presented that is both simpler and more efficient than previously proposed configurations. This array uses N2 orthogonally connected processing elements and requires 2N I/O connections. Total computation time is 5N - 2, which is the minimum time possible in a systolic implementation. The data pipelining rate is one, so no pipeline interleave is required. For multiple problem instances a block pipeline rate of N can be achieved, which is optimal for an array of N2 processing elements.
Reconfigurable SIMD parallel processor is a member of SIMD architectures. Its most distinguished feature is the utilization of the reconfigurability of the interconnection network to 1) establish a network topology we...
详细信息
Reconfigurable SIMD parallel processor is a member of SIMD architectures. Its most distinguished feature is the utilization of the reconfigurability of the interconnection network to 1) establish a network topology well mapped to the algorithm communication graph so that higher efficiency can be achieved, and to 2) remove faulty processors from the network so that the system operation can be kept uninterrupted while maintaining the same or slightly degraded efficiency. This paper describes several existing reconfigurable SIMD parallel architectures and their reconfiguration mechanism, demonstrates the effectiveness of algorithm mapping through reconfiguration, and discusses fault tolerant schemes via reconfiguration.
An algorithm can be thought of as a set of indexed computations and if one computation uses data generated by another computation then this data dependence can be represented by the difference of their indexes (called...
详细信息
An algorithm can be thought of as a set of indexed computations and if one computation uses data generated by another computation then this data dependence can be represented by the difference of their indexes (called dependence vector). Many important algorithms are characterized by the fact that data dependencies are uniform, i.e., the values of the dependence vectors are independent of the indexes of computations. Linear schedules are a special class of schedules described by a linear mapping of computation indexes into time. This paper addresses the problem of identifying optimal linear schedules for uniform dependence algorithms so that their execution time is minimized. Procedures are proposed to solve this problem based on the mathematical solution of a nonlinear optimization problem. The complexity of these procedures is independent of the size of the algorithm. Actually, the complexity is exponential in the dimension of the index set of the algorithm and, for all practical purposes, very small due to the limited dimension of the index set of algorithms of practical interest. The results reported in this paper can be used to derive time-optimal systolic designs and applied in optimizing compilers to restructure programs at compile-time in order to maximally exploit available parallelism.
暂无评论