检索结果-内蒙古大学图书馆

On the Retiming for Register Minimization by Means of Breadth Coefficients Matrix

IEEE EMBEDDED SYSTEMS LETTERS 2025年第1期17卷 58-61页

作者： Munoz, H. Emmanuel Rivera, Jorge Ortega-Cisneros, Susana Gaytan-Rivas, Diego H. Ctr Invest & Estudios Avanzados CINVESTAV IPN Unidad Guadalajara Zapopan 45010 Mexico

This letter is focused in the retiming technique for register minimization. This technique was presented as a minimum-cost linear problem, where the use of a fanout gadget was proposed to the model nodes (the functional blocks) in a digital circuit with multiple output edges to obtain a retiming solution r(V) with integer values. The goal of this technique is to minimize the function COST '=& sum;(e)beta(e)omega(r)(e) subject to feasibility and clock period constraints. The determination of the breadth coefficients beta(e) could be cumbersome for large digital circuits, as there is no suitable method in the literature. Based on some concepts from graph theory and linear algebra, an algorithm for computing the breadth coefficients is proposed. An example is presented in order to illustrate the performance of the proposed algorithm as calculations for the breadth coefficients are effortless determined.

关键词： Registers Minimization Clocks Digital circuits IIR filters Costs Integrated circuit modeling algorithm-architecture codesign dataflow synthesis digital filter synthesis

来源：评论

学校读者我要写书评

暂无评论

Structured Term Pruning for Computational Efficient Neural Networks Inference

引用

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 2023年第1期42卷 190-203页

作者： Huang, Kai Li, Bowen Chen, Siang Claesen, Luc Xi, Wei Chen, Junjian Jiang, Xiaowen Liu, Zhili Xiong, Dongliang Yan, Xiaolang Zhejiang Univ Inst VLSI Design Hangzhou 310030 Peoples R China Univ Hasselt Engn Technol Elect ICT Dept B-3590 Diepenbeek Belgium CSG Digital Grid Res Inst Guangzhou 510670 Peoples R China Prod Ctr Dept Sec Chip Hangzhou 310030 Peoples R China

The state-of-the-art convolutional neural network accelerators are showing a growing interest in exploiting the bit-level sparsity and eliminating the ineffectual computations of zero bits. However, the excessive redundancy and the irregular distribution of nonzero bits limit the real speedup in the accelerators. To address this, we propose an algorithm-architecture codesign, named structured term pruning (STP), to boost the computation efficiency of neural networks inference. Specifically, we enhance the bit sparsity by guiding the weights toward the value with fewer power-of-two terms. Then, we structure the terms with layer-wise group budgets. Retraining is adopted to recover the accuracy drop. We also design the hardware of the group processing element and the fast signed-digital encoder for efficient implementation of STP networks. The system design of STP is realized with some easy alterations on an input stationary systolic array design. Extensive evaluation results demonstrate that STP can reduce significant inference computation costs, and achieve $2.35\times $ computational energy saving for the ResNet18 network on the ImageNet dataset.

关键词： algorithm-architecture codesign compression and acceleration neural networks quantization systolic array (SA)

来源：评论

学校读者我要写书评

暂无评论

Distributed and Scalable Uplink Processing for LIS: algorithm, architecture, and Design Trade-Offs

引用

IEEE TRANSACTIONS ON SIGNAL PROCESSING 2022年 70卷 2639-2653页

作者： Sanchez, Jesus Rodriguez Rusek, Fredrik Edfors, Ove Liu, Liang Lund Univ Dept Elect & Informat Technol S-22363 Lund Sweden

The Large Intelligent Surface (LIS) is a promising technology in the areas of wireless communication, remote sensing and positioning. It consists of a continuous radiating surface located in the proximity of the users, with the capability to communicate by transmission and reception (replacing base stations). Despite its potential, there are numerous challenges from an implementation point of view, with the interconnection data-rate, computational complexity, and storage the most relevant ones. In order to address these challenges, hierarchical architectures with distributed processing techniques are envisioned to be relevant for this task, while ensuring scalability. In this work we perform algorithm-architecture codesign to propose two distributed interference cancellation algorithms, and a tree-based interconnection topology for uplink processing. We also analyze the performance, hardware requirements, and architecture trade-offs for a discrete LIS, in order to provide concrete case studies and guidelines for efficient implementation of LIS systems.

关键词： Backplanes Signal processing algorithms Antenna arrays Massive MIMO Surface waves Transmitting antennas Baseband Large intelligent surface LIS distributed processing algorithm-architecture codesign equalization inter-connection data-rate

来源：评论

学校读者我要写书评

暂无评论

A Generalized algorithm and Reconfigurable architecture for Efficient and Scalable Orthogonal Approximation of DCT

引用

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS 2015年第2期62卷 449-457页

作者： Jridi, Maher Alfalou, Ayman Meher, Pramod Kumar ISEN Brest Vis Team F-29228 Brest 2 France Nanyang Technol Univ Sch Comp Engn Singapore 639798 Singapore

Approximation of discrete cosine transform (DCT) is useful for reducing its computational complexity without significant impact on its coding performance. Most of the existing algorithms for approximation of the DCT target only the DCT of small transform lengths, and some of them are non-orthogonal. This paper presents a generalized recursive algorithm to obtain orthogonal approximation of DCT where an approximate DCT of length could be derived from a pair of DCTs of length at the cost of N additions for input preprocessing. We perform recursive sparse matrix decomposition and make use of the symmetries of DCT basis vectors for deriving the proposed approximation algorithm. Proposed algorithm is highly scalable for hardware as well as software implementation of DCT of higher lengths, and it can make use of the existing approximation of 8-point DCT to obtain approximate DCT of any power of two length, N > 8. We demonstrate that the proposed approximation of DCT provides comparable or better image and video compression performance than the existing approximation methods. It is shown that proposed algorithm involves lower arithmetic complexity compared with the other existing approximation algorithms. We have presented a fully scalable reconfigurable parallel architecture for the computation of approximate DCT based on the proposed algorithm. One uniquely interesting feature of the proposed design is that it could be configured for the computation of a 32-point DCT or for parallel computation of two 16-point DCTs or four 8-point DCTs with a marginal control overhead. The proposed architecture is found to offer many advantages in terms of hardware complexity, regularity and modularity. Experimental results obtained from FPGA implementation show the advantage of the proposed method.

关键词： algorithm-architecture codesign DCT approximation discrete cosine transform (DCT) high efficiency video coding (HEVC)

来源：评论

学校读者我要写书评

暂无评论

On Using the Roofline Model with Lower Bounds on Data Movement

引用

ACM TRANSACTIONS ON architecture AND CODE OPTIMIZATION 2014年第4期11卷 67-67页

作者： Elango, Venmugil Sedaghati, Naser Rastello, Fabrice Pouchet, Louis-Noel Ramanujam, J. Teodorescu, Radu Sadayappan, P. Ohio State Univ Columbus OH 43210 USA Inria Rocquencourt France Louisiana State Univ Baton Rouge LA 70803 USA

The roofline model is a popular approach for "bound and bottleneck" performance analysis. It focuses on the limits to the performance of processors because of limited bandwidth to off-chip memory. It models upper bounds on performance as a function of operational intensity, the ratio of computational operations per byte of data moved from/to memory. While operational intensity can be directly measured for a specific implementation of an algorithm on a particular target platform, it is of interest to obtain broader insights on bottlenecks, where various semantically equivalent implementations of an algorithm are considered, along with analysis for variations in architectural parameters. This is currently very cumbersome and requires performance modeling and analysis of many variants. In this article, we address this problem by using the roofline model in conjunction with upper bounds on the operational intensity of computations as a function of cache capacity, derived from lower bounds on data movement. This enables bottleneck analysis that holds across all dependence-preserving semantically equivalent implementations of an algorithm. We demonstrate the utility of the approach in assessing fundamental limits to performance and energy efficiency for several benchmark algorithms across a design space of architectural variations.

关键词： Performance algorithms Operational intensity upper bounds I/O lower bounds architecture design space exploration algorithm-architecture codesign

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：