A new systolic array design for the Algebraic Path Problem (APP) is presented that is both simpler and more efficient than previously proposed configurations. This array uses N2 orthogonally connected processing eleme...
详细信息
A new systolic array design for the Algebraic Path Problem (APP) is presented that is both simpler and more efficient than previously proposed configurations. This array uses N2 orthogonally connected processing elements and requires 2N I/O connections. Total computation time is 5N - 2, which is the minimum time possible in a systolic implementation. The data pipelining rate is one, so no pipeline interleave is required. For multiple problem instances a block pipeline rate of N can be achieved, which is optimal for an array of N2 processing elements.
Reconfigurable SIMD parallel processor is a member of SIMD architectures. Its most distinguished feature is the utilization of the reconfigurability of the interconnection network to 1) establish a network topology we...
详细信息
Reconfigurable SIMD parallel processor is a member of SIMD architectures. Its most distinguished feature is the utilization of the reconfigurability of the interconnection network to 1) establish a network topology well mapped to the algorithm communication graph so that higher efficiency can be achieved, and to 2) remove faulty processors from the network so that the system operation can be kept uninterrupted while maintaining the same or slightly degraded efficiency. This paper describes several existing reconfigurable SIMD parallel architectures and their reconfiguration mechanism, demonstrates the effectiveness of algorithm mapping through reconfiguration, and discusses fault tolerant schemes via reconfiguration.
This paper presents the design of a multiple-standard 1080 high definition (HD) video decoder on a mixed-grained reconfigurable computing platform integrating coarse-grained reconfigurable processing units (RPUs) and ...
详细信息
This paper presents the design of a multiple-standard 1080 high definition (HD) video decoder on a mixed-grained reconfigurable computing platform integrating coarse-grained reconfigurable processing units (RPUs) and FPGAs. The proposed RPU, including 16 x 16 multi-functional processing elements (PEs), is used to accelerate compute-intensive tasks in the video decoding. A soft-core-based microprocessor array is implemented on the FPGA and adopted to speed-up the dynamic reconfiguration of the RPU. Furthermore, a mail-box-based communication scheme is utilized to improve the communication efficiency between RPUs and FPGAs. By exploiting dynamic reconfiguration of the RPUs and static reconfiguration of the FPGAs, the proposed platform achieves scalable performances and cost trade-offs to support a variety of video coding standards, including MPEG-2, AVS, H.264, and HEVC. The measured results show that the proposed platform can support H.264 1080 HD video streams at up to 57 frames per second (fps) and HEVC 1080 HD video streams at up to 52fps under 250MHz, at the same time, it achieves a 3.6x performance gain over an industrial coarse-grained reconfigurable processor for H. 264 decoding, and a 6.43x performance boosts over a general purpose processor based implementation for HEVC decoding.
In this paper, we present some new regular iterative algorithms for matrix multiplication and transitive closure. With these algorithms, by spacetime mapping the 2-D arrays with 2N-1 and [(3N-1)/2] execution times for...
详细信息
In this paper, we present some new regular iterative algorithms for matrix multiplication and transitive closure. With these algorithms, by spacetime mapping the 2-D arrays with 2N-1 and [(3N-1)/2] execution times for matrix multiplication can be obtained, Meanwhile, we can derive a 2-D array with 4N-2 execution time for transitive closure based on the sequential Warshall-Floyd algorithm. All these new 2-D arrays for matrix multiplication and transitive closure have the advantages of faster and more regular than other previous designs.
An algorithm can be thought of as a set of indexed computations and if one computation uses data generated by another computation then this data dependence can be represented by the difference of their indexes (called...
详细信息
An algorithm can be thought of as a set of indexed computations and if one computation uses data generated by another computation then this data dependence can be represented by the difference of their indexes (called dependence vector). Many important algorithms are characterized by the fact that data dependencies are uniform, i.e., the values of the dependence vectors are independent of the indexes of computations. Linear schedules are a special class of schedules described by a linear mapping of computation indexes into time. This paper addresses the problem of identifying optimal linear schedules for uniform dependence algorithms so that their execution time is minimized. Procedures are proposed to solve this problem based on the mathematical solution of a nonlinear optimization problem. The complexity of these procedures is independent of the size of the algorithm. Actually, the complexity is exponential in the dimension of the index set of the algorithm and, for all practical purposes, very small due to the limited dimension of the index set of algorithms of practical interest. The results reported in this paper can be used to derive time-optimal systolic designs and applied in optimizing compilers to restructure programs at compile-time in order to maximally exploit available parallelism.
The problem of designing space-optimal 2D regular arrays for N x N x N cubical mesh algorithms with linear schedule ai + bj + ck, 1 less than or equal to a less than or equal to b less than or equal to c, and N = nc, ...
详细信息
The problem of designing space-optimal 2D regular arrays for N x N x N cubical mesh algorithms with linear schedule ai + bj + ck, 1 less than or equal to a less than or equal to b less than or equal to c, and N = nc, is studied. Three novel nonlinear processor allocation methods, each of which works by combining a partitioning technique (gcd-partition) with different nonlinear processor allocation procedures (traces), are proposed to handle different cases, In cases where a + b less than or equal to c, which are dealt with by the first processor allocation method, space-optimal designs can always be obtained in which the number of processing elements is equal to N-2/c. For other cases where a + b > c and either a = b and b = c, two other optimal processor allocation methods are proposed. Besides, the closed form expressions for the optimal number of processing elements are derived for these cases.
Coarse-grained reconfigurable architectures can enhance the performance of critical loops and computation-intensive functions. Such architectures need efficient compilation techniques to map algorithms onto customized...
详细信息
Coarse-grained reconfigurable architectures can enhance the performance of critical loops and computation-intensive functions. Such architectures need efficient compilation techniques to map algorithms onto customized architectural configurations. A new compilation approach uses a generic reconfigurable architecture to tackle the memory bottleneck that typically limits the performance of many applications.
Digital signal processing algorithms with multiple shift-invariant dependence graphs (DG's) can be mapped to field programmable gate array hardware in many different types of systolic processor arrays, Because of ...
详细信息
Digital signal processing algorithms with multiple shift-invariant dependence graphs (DG's) can be mapped to field programmable gate array hardware in many different types of systolic processor arrays, Because of the finite amount of hardware resources, the problem is to use a "right" amount of hardware in a specific configuration so to maximize the processing speed. In this paper, the problem of finding the right processor array configuration is formulated as a constrained optimization problem where the cost function includes not only the cost of individual processor arrays but also the cost of interfacing circuits. Three heuristic algorithms are presented for the optimization problem, Among them, both the Lth axial neighbor algorithm and the simulated annealing algorithm produce good results on a test case. Simulation results on the test case also indicate that the initial configuration is important in getting a good configuration for both algorithms. The Lth axial neighbor algorithm has the extra advantage of requiring less amount of performance tuning.
A distributed offline DISOPE algorithm for optimal state synchronization of leader-follower systems with nonlinear discrete-time dynamics is considered, which integrates the model optimization idea and parameter estim...
详细信息
A distributed offline DISOPE algorithm for optimal state synchronization of leader-follower systems with nonlinear discrete-time dynamics is considered, which integrates the model optimization idea and parameter estimation technique together. It can be seen that the convergent solutions of modified linear optimal control problems satisfy the optimality conditions of the original nonlinear optimization problem with non-LQ performance indices. The heterogeneous agents can cooperate and exchange information via network communication. Based on DISOPE algorithm, a distributed optimal control policy is obtained to assure state synchronization and minimize performance indices in finite time horizon. Finally, a simulation example is provided to illustrate the effectiveness of the distributed DISOPE algorithm.
This study focuses on a particular application domain (iterative automatic target recognition tasks) and an associated specific class of dedicated heterogeneous parallel hardware platforms. For the computational envir...
详细信息
This study focuses on a particular application domain (iterative automatic target recognition tasks) and an associated specific class of dedicated heterogeneous parallel hardware platforms. For the computational environment considered, a methodology is presented for the on-line operating system to decide heuristically whether to perform a remapping of the application onto the platform based on information generated from input data by the application during execution. If the decision is to remap, the operating system will be able to select a mapping, which is appropriate for the given state of the application, from a stored set of mappings that were previously derived with an off-line heuristic.
暂无评论