检索结果-内蒙古大学图书馆

Leveraging Modern C plus plus in High-Level Synthesis

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 2023年第4期42卷 1123-1132页

作者： Lahti, Sakari Rintala, Matti Hamalainen, Timo D. Tampere Univ Unit Comp Sci Tampere 33720 Finland

High-level synthesis (HLS) enables the automated conversion of high-level language algorithms into synthesizable register-transfer level code, allowing computation-intensive algorithms to be accelerated on FPGAs. Most HLS tools have C++ as their input language, as it is widely known in both software and hardware industry. However, even though C++ receives a new standard every three years, the HLS tool vendors have mostly provided support and examples using C++98/03. Limiting to early C++ standards imposes a productivity penalty, since the newer standards provide both compilation time reductions and more concise, expressive, and maintainable way of writing code. In this study, we make the case for adopting modern C++ in HLS. We inspect the language features of C++11 and forward, and consider their benefits for HLS. We also test the present support for the modern language features with two state-of-the-art commercial HLS tools. Finally, we provide an extended example, demonstrating the increased clarity of code achieved using the newer standards. We note that the investigated HLS tools already have good support for modern C++ features, and urge their adoption to increase designer productivity.

关键词： C plus plus languages Codes hardware Software Productivity Standards Software algorithms algorithms implemented in hardware C plus plus language high-level synthesis (HLS) reconfigurable hardware

来源：评论

学校读者我要写书评

暂无评论

Dynamic Sparse Attention for Scalable Transformer Acceleration

引用

IEEE TRANSACTIONS ON COMPUTERS 2022年第12期71卷 3165-3178页

作者： Liu, Liu Qu, Zheng Chen, Zhaodong Tu, Fengbin Ding, Yufei Xie, Yuan Univ Calif Santa Barbara Santa Barbara CA 93106 USA

Transformers are the mainstream of NLP applications and are becoming increasingly popular in other domains such as Computer Vision. Despite the improvements in model quality, the enormous computation costs make Transformers difficult at deployment, especially when the sequence length is large in emerging applications. Processing attention mechanism as the essential component of Transformer is the bottleneck of execution due to the quadratic complexity. Prior art explores sparse patterns in attention to support long sequence modeling, but those pieces of work are on static or fixed patterns. We demonstrate that the sparse patterns are dynamic, depending on input sequences. Thus, we propose the Dynamic Sparse Attention (DSA) that can efficiently exploit dynamic sparse patterns in attention. Compared with other methods, our approach can achieve better trade-offs between accuracy and model complexity. Moving forward, we identify challenges and provide solutions to implement DSA on existing hardware (GPUs) and specialized hardware in order to achieve practical speedup and efficiency improvements for Transformer execution.

关键词： algorithms implemented in hardware types and design styles integrated circuits < B hardware special-purpose and application-based systems computer systems organization

来源：评论

学校读者我要写书评

暂无评论

Pruning Binarized Neural Networks Enables Low-Latency, Low-Power FPGA-Based Handwritten Digit Classification

Pruning Binarized Neural Networks Enables Low-Latency, Low-P...

引用

IEEE High Performance Extreme Computing Virtual Conference (HPEC)

作者： Payra, Syamantak Loke, Gabriel Fink, Yoel Steinmeyer, Joseph D. Stanford Univ Dept Elect Engn Stanford CA 94305 USA MIT Dept Mat Sci & Engn Cambridge MA USA MIT Dept Mat Sci & Engn Dept Elect Engn & Comp Sci Cambridge MA USA MIT Inst Soldier Nanotechnol Cambridge MA USA MIT Dept Elect Engn & Comp Sci Cambridge MA USA

ISBN: (纸本)9798350308600

As neural networks are increasingly deployed on mobile and distributed computing platforms, there is a need to lower latency and increase computational speed while decreasing power and memory usage. Rather than using FPGAs as accelerators in tandem with CPUs or GPUs, we directly encode individual neural network layers as combinational logic within FPGA hardware. Utilizing binarized neural networks minimizes the arithmetic computation required, shrinking latency to only the signal propagation delay. We evaluate size-optimization strategies and demonstrate network compression via weight quantization and weight-model unification, achieving 96% of the accuracy of baseline MNIST digit classification models while using only 3% of the memory. We further achieve 86% decrease in model footprint, 8mW dynamic power consumption, and <9ns latency, validating the versatility and capability of feature-strength-based pruning approaches for binarized neural networks to flexibly meet performance requirements amid application resource constraints.

关键词： algorithms implemented in hardware Combinational Logic Cost/Performance Neural Nets Optical Character Recognition

来源：评论

学校读者我要写书评

暂无评论

FPGA-based Learning Acceleration for LSTM Neural Network

引用

PARALLEL PROCESSING LETTERS 2023年第1N02期33卷 2350001-2350001页

作者： Dec, Grzegorz Rafal Rzeszow Univ Technol Dept Comp & Control Engn W Pola 2 PL-35959 Rzeszow Poland

This paper presents and discusses the implementation of a learning accelerator for an LSTM neural network that utilizes an FPGA. The accelerator consists of a backpropagation through time algorithm for an LSTM. The presented net performs a binary classification task and consists of an LSTM and a dense layer. The performance is then compared to both a hard-coded Python implementation and an implementation using Keras library and the GPU. The implementation is executed using the DSP blocks, available via the Vivado Design Suite, which is in compliance with the IEEE754 standard. The results of the simulation show that the FPGA implementation remains accurate and achieves higher speed than the other solutions.

关键词： Backpropagation through time algorithms implemented in hardware neural nets reconfigurable hardware

来源：评论

学校读者我要写书评

暂无评论

Majority-Logic, its applications, and atomic-scale embodiments

引用

COMPUTERS & ELECTRICAL ENGINEERING 2020年 83卷 106562-106562页

作者： Parhami, Behrooz Abedi, Dariush Jaberipur, Ghassem Univ Calif Santa Barbara Dept Elect & Comp Engn Santa Barbara CA 93106 USA Shahid Beheshti Univ Dept Comp Sceince & Engn Tehran *** Iran Inst Res Fundamental Sci IPM Sch Comp Sci Tehran Iran

Today's computing is increasingly data-intensive, heralding the age of big data. With greater data volumes, come the needs for faster processing, greater storage capacity, and expanded communication bandwidth, all of which imply the expenditure of more energy. Thus, energy efficiency, already a major design consideration, will assume broader significance in the coming years. As important as storage and communications are, our focus in this paper is on better technology to reduce the computation (logic manipulation) power. We review majority logic, a special case of threshold logic, show how a number of common arithmetic/logic operations can be performed using the majority-gate primitive, and review an impressive array of atomic-scale logic technologies that are particularly efficient in realizing the majority or minority function. We conclude that a combination of orders of magnitude energy reduction by virtue of the technology used and implementation strategies that lead to comparable complexity in terms of majority gates when contrasted with currently used circuit primitives (AND, OR, XOR, NOT, mux) leads to energy-efficient realization of arithmetic/logic functions suitable for use in the age of big data. (C) 2020 Published by Elsevier Ltd.

关键词： algorithms implemented in hardware Cellular arrays and automata High-speed arithmetic Logic design styles Low-power design Performance analysis and design aids

来源：评论

学校读者我要写书评

暂无评论

FPGA-based Neural Net for Failures Prediction in the Cold Forging Process

引用

PARALLEL PROCESSING LETTERS 2022年第1N02期32卷 2150023-2150023页

作者： Dec, Grzegorz Rafal Rzeszow Univ Technol Dept Comp & Control Engn W Pola 2 PL-35959 Rzeszow Poland

This paper presents and discusses the implementation of deep neural network for the purpose of failure prediction in the cold forging process. The implementation consists of an LSTM and a dense layer implemented on FPGA. The network was trained beforehand on Desktop Computer using Keras library for Python and the weights and the biases were embedded into the implementation. The implementation is executed using the DSP blocks, available via Vivado Design Suite, which are in compliance with the IEEE754 standard. The simulation of the network achieves 100% classification accuracy on the test data and high calculation speed.

关键词： algorithms implemented in hardware neural nets reconfigurable hardware Industry 4.0

来源：评论

学校读者我要写书评

暂无评论

LSTM Cell Implementation on FPGAs

引用

PARALLEL PROCESSING LETTERS 2021年第2期31卷

作者： Dec, Grzegorz Rafal Rzeszow Univ Technol Dept Comp & Control Engn W Pola 2 PL-35959 Rzeszow Poland

This paper presents and discusses the implementation of an LSTM cell on an FPGA with an activation function inspired by the CORDIC algorithm. The realization is performed using both IEEE754 standard and 32-bit integer numbers. The case with floating-point arithmetic is analyzed with and without DSP blocks provided by the Xilinx design suite. The alternative implementation including the integer arithmetic was optimized for a minimal number of clock cycles. Presented implementation uses xc6slx150t-2fgg900 and achieves high calculations accuracy for both cases.

关键词： algorithms implemented in hardware neural nets reconfigurable hardware

来源：评论

学校读者我要写书评

暂无评论

Adapting Computer Arithmetic Structures to Sustainable Supercomputing in Low-Power, Majority-Logic Nanotechnologies

IEEE TRANSACTIONS ON SUSTAINABLE COMPUTING

引用

IEEE TRANSACTIONS ON SUSTAINABLE COMPUTING 2018年第4期3卷 262-273页

作者： Jaberipur, Ghassem Parhami, Behrooz Abedi, Dariush Shahid Beheshti Univ Dept Comp Sci & Engn Tehran *** Iran Inst Res Fundamental Sci IPM Sch Comp Sci Tehran Iran Univ Calif Santa Barbara Dept Elect & Comp Engn Santa Barbara CA 93106 USA

Petascale supercomputers are already pushing power boundaries that can be supplied or dissipated cost-effectively;greater challenges await us in the era of exascale machines. We are thus motivated to study methods of reducing the energy cost of arithmetic operations, which can be substantial in numerically intensive applications. Additionally, being both a widely-used operation in itself and an important building block for synthesizing other arithmetic operations, has received much attention in this regard. Circuit and energy costs of fast adders are dominated by their fast carry networks. The availability of simple and energy-efficient majority function in certain emerging nanotechnologies (such as quantum-dot cellular automata, single-electron tunneling, tunneling phase logic, magnetic tunnel junction, nanoscale bar magnets, and memristors) has motivated our work to reformulate the carry recurrence in terms of fully-utilized majority elements, with all three inputs usefully employed. We compare our novel designs and resulting circuits to prior proposals based on 3-input majority elements in quantum-dot cellular automata, demonstrating advantages in both speed and circuit complexity. We also show that the performance and cost advantages carry over to at least one other emerging, energy-efficient technology, single-electron tunneling, raising hopes for achieving similar benefits with other technologies, which we review very briefly.

关键词： High-speed arithmetic performance analysis and design aids cellular arrays and automata algorithms implemented in hardware logic design styles low-power design

来源：评论

学校读者我要写书评

暂无评论

A Fully-Pipelined hardware Design for Gaussian Mixture Models

引用

IEEE TRANSACTIONS ON COMPUTERS 2017年第11期66卷 1837-1850页

作者： He, Conghui Fu, Haohuan Guo, Ce Luk, Wayne Yang, Guangwen Tsinghua Univ Dept Comp Sci & Technol Beijing 100044 Shi Peoples R China Tsinghua Univ Minist Educ Key Lab Earth Syst Modeling Beijing 100044 Shi Peoples R China Tsinghua Univ Dept Earth Syst Sci Beijing 100044 Shi Peoples R China Tsinghua Univ Inst High Performance Comp Beijing 100044 Shi Peoples R China Imperial Coll London SW7 2AZ England

Gaussian Mixture Models (GMMs) are widely used in many applications such as data mining, signal processing and computer vision, for probability density modeling and soft clustering. However, the parameters of a GMM need to be estimated from data by, for example, the Expectation-Maximization algorithm for Gaussian Mixture Models (EM-GMM), which is computationally demanding. This paper presents a novel design for the EM-GMM algorithm targeting reconfigurable platforms, with five main contributions. First, a pipeline-friendly EM-GMM with diagonal covariance matrices that can easily be mapped to hardware architectures. Second, a function evaluation unit for Gaussian probability density based on fixed-point arithmetic. Third, our approach is extended to support a wide range of dimensions or/and components by fitting multiple pieces of smaller dimensions onto an FPGA chip. Fourth, we derive a cost and performance model that estimates logic resources. Fifth, our dataflow design targeting the Maxeler MPC-X2000 with a Stratix-5SGSD8 FPGA can run over 200 times faster than a 6-core Xeon E5645 processor, and over 39 times faster than a Pascal TITAN-X GPU. Our design provides a practical solution to applications for training and explores better parameters for GMMs with hundreds of millions of high dimensional input instances, for low-latency and high-performance applications.

关键词： Gaussian mixture model expectation maximization high performance computing data flow engine reconfigurable hardware algorithms implemented in hardware

来源：评论

学校读者我要写书评

暂无评论

An Ultra-Low Power, "Always-On" Camera Front-End for Posture Detection in Body Worn Cameras Using Restricted Boltzman Machines

引用

IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS 2015年第4期1卷 187-194页

作者： Desai, Soham Jayesh Shoaib, Mohammed Raychowdhury, Arijit Georgia Inst Technol Sch Elect & Comp Engn Atlanta GA 30332 USA Microsoft Corp Microsoft Res Redmond WA 98052 USA

The Internet of Things (loTs) has triggered rapid advances in sensors, surveillance devices, wearables and body area networks with advanced Human-Computer Interfaces (HCI). One such application area is the adoption of Body Worn Cameras (BWCs) by law enforcement officials. The need to be 'always-on' puts heavy constraints on battery usage in these camera front-ends, thus limiting their widespread adoption. Further, the increasing number of such cameras is expected to create a data deluge, which requires large processing, transmission and storage capabilities. Instead of continuously capturing and streaming or storing videos, it is prudent to provide "smartness" to the camera front-end. This requires hardware assisted image recognition and template matching in the front-end, capable of making judicious decisions on when to trigger video capture or streaming. Restricted Boltzmann Machines (RBMs) based neural networks have been shown to provide high accuracy for image recognition and are well suited for low power and re-configurable systems. In this paper we propose an RBM based "always-on" camera front-end capable of detecting human posture. Aggressive behavior of the human being in the field of view will be used as a wake-up signal for further data collection and classification. The proposed system has been implemented on a Xilinx Virtex 7 XC7VX485T platform. A minimum dynamic power of 19.18 mW for a target recognition accuracy while maintaining real time constraints has been measured. The hardware-software co-design illustrates the trade-offs in the design with respect to accuracy, resource utilization, processing time and power. The results demonstrate the possibility of a true "always-on" body-worn camera system in the loT environment.

关键词： algorithms implemented in hardware object recognition reconfigurability wearable computers

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：