This letter is focused in the retiming technique for register minimization. This technique was presented as a minimum-cost linear problem, where the use of a fanout gadget was proposed to the model nodes (the function...
详细信息
This letter is focused in the retiming technique for register minimization. This technique was presented as a minimum-cost linear problem, where the use of a fanout gadget was proposed to the model nodes (the functional blocks) in a digital circuit with multiple output edges to obtain a retiming solution r(V) with integer values. The goal of this technique is to minimize the function COST '=& sum;(e)beta(e)omega(r)(e) subject to feasibility and clock period constraints. The determination of the breadth coefficients beta(e) could be cumbersome for large digital circuits, as there is no suitable method in the literature. Based on some concepts from graph theory and linear algebra, an algorithm for computing the breadth coefficients is proposed. An example is presented in order to illustrate the performance of the proposed algorithm as calculations for the breadth coefficients are effortless determined.
The state-of-the-art convolutional neural network accelerators are showing a growing interest in exploiting the bit-level sparsity and eliminating the ineffectual computations of zero bits. However, the excessive redu...
详细信息
The state-of-the-art convolutional neural network accelerators are showing a growing interest in exploiting the bit-level sparsity and eliminating the ineffectual computations of zero bits. However, the excessive redundancy and the irregular distribution of nonzero bits limit the real speedup in the accelerators. To address this, we propose an algorithm-architecture codesign, named structured term pruning (STP), to boost the computation efficiency of neural networks inference. Specifically, we enhance the bit sparsity by guiding the weights toward the value with fewer power-of-two terms. Then, we structure the terms with layer-wise group budgets. Retraining is adopted to recover the accuracy drop. We also design the hardware of the group processing element and the fast signed-digital encoder for efficient implementation of STP networks. The system design of STP is realized with some easy alterations on an input stationary systolic array design. Extensive evaluation results demonstrate that STP can reduce significant inference computation costs, and achieve $2.35\times $ computational energy saving for the ResNet18 network on the ImageNet dataset.
The Large Intelligent Surface (LIS) is a promising technology in the areas of wireless communication, remote sensing and positioning. It consists of a continuous radiating surface located in the proximity of the users...
详细信息
The Large Intelligent Surface (LIS) is a promising technology in the areas of wireless communication, remote sensing and positioning. It consists of a continuous radiating surface located in the proximity of the users, with the capability to communicate by transmission and reception (replacing base stations). Despite its potential, there are numerous challenges from an implementation point of view, with the interconnection data-rate, computational complexity, and storage the most relevant ones. In order to address these challenges, hierarchical architectures with distributed processing techniques are envisioned to be relevant for this task, while ensuring scalability. In this work we perform algorithm-architecture codesign to propose two distributed interference cancellation algorithms, and a tree-based interconnection topology for uplink processing. We also analyze the performance, hardware requirements, and architecture trade-offs for a discrete LIS, in order to provide concrete case studies and guidelines for efficient implementation of LIS systems.
Approximation of discrete cosine transform (DCT) is useful for reducing its computational complexity without significant impact on its coding performance. Most of the existing algorithms for approximation of the DCT t...
详细信息
Approximation of discrete cosine transform (DCT) is useful for reducing its computational complexity without significant impact on its coding performance. Most of the existing algorithms for approximation of the DCT target only the DCT of small transform lengths, and some of them are non-orthogonal. This paper presents a generalized recursive algorithm to obtain orthogonal approximation of DCT where an approximate DCT of length could be derived from a pair of DCTs of length at the cost of N additions for input preprocessing. We perform recursive sparse matrix decomposition and make use of the symmetries of DCT basis vectors for deriving the proposed approximation algorithm. Proposed algorithm is highly scalable for hardware as well as software implementation of DCT of higher lengths, and it can make use of the existing approximation of 8-point DCT to obtain approximate DCT of any power of two length, N > 8. We demonstrate that the proposed approximation of DCT provides comparable or better image and video compression performance than the existing approximation methods. It is shown that proposed algorithm involves lower arithmetic complexity compared with the other existing approximation algorithms. We have presented a fully scalable reconfigurable parallel architecture for the computation of approximate DCT based on the proposed algorithm. One uniquely interesting feature of the proposed design is that it could be configured for the computation of a 32-point DCT or for parallel computation of two 16-point DCTs or four 8-point DCTs with a marginal control overhead. The proposed architecture is found to offer many advantages in terms of hardware complexity, regularity and modularity. Experimental results obtained from FPGA implementation show the advantage of the proposed method.
The roofline model is a popular approach for "bound and bottleneck" performance analysis. It focuses on the limits to the performance of processors because of limited bandwidth to off-chip memory. It models ...
详细信息
The roofline model is a popular approach for "bound and bottleneck" performance analysis. It focuses on the limits to the performance of processors because of limited bandwidth to off-chip memory. It models upper bounds on performance as a function of operational intensity, the ratio of computational operations per byte of data moved from/to memory. While operational intensity can be directly measured for a specific implementation of an algorithm on a particular target platform, it is of interest to obtain broader insights on bottlenecks, where various semantically equivalent implementations of an algorithm are considered, along with analysis for variations in architectural parameters. This is currently very cumbersome and requires performance modeling and analysis of many variants. In this article, we address this problem by using the roofline model in conjunction with upper bounds on the operational intensity of computations as a function of cache capacity, derived from lower bounds on data movement. This enables bottleneck analysis that holds across all dependence-preserving semantically equivalent implementations of an algorithm. We demonstrate the utility of the approach in assessing fundamental limits to performance and energy efficiency for several benchmark algorithms across a design space of architectural variations.
暂无评论