High-level synthesis (HLS) enables the automated conversion of high-level language algorithms into synthesizable register-transfer level code, allowing computation-intensive algorithms to be accelerated on FPGAs. Most...
详细信息
High-level synthesis (HLS) enables the automated conversion of high-level language algorithms into synthesizable register-transfer level code, allowing computation-intensive algorithms to be accelerated on FPGAs. Most HLS tools have C++ as their input language, as it is widely known in both software and hardware industry. However, even though C++ receives a new standard every three years, the HLS tool vendors have mostly provided support and examples using C++98/03. Limiting to early C++ standards imposes a productivity penalty, since the newer standards provide both compilation time reductions and more concise, expressive, and maintainable way of writing code. In this study, we make the case for adopting modern C++ in HLS. We inspect the language features of C++11 and forward, and consider their benefits for HLS. We also test the present support for the modern language features with two state-of-the-art commercial HLS tools. Finally, we provide an extended example, demonstrating the increased clarity of code achieved using the newer standards. We note that the investigated HLS tools already have good support for modern C++ features, and urge their adoption to increase designer productivity.
Transformers are the mainstream of NLP applications and are becoming increasingly popular in other domains such as Computer Vision. Despite the improvements in model quality, the enormous computation costs make Transf...
详细信息
Transformers are the mainstream of NLP applications and are becoming increasingly popular in other domains such as Computer Vision. Despite the improvements in model quality, the enormous computation costs make Transformers difficult at deployment, especially when the sequence length is large in emerging applications. Processing attention mechanism as the essential component of Transformer is the bottleneck of execution due to the quadratic complexity. Prior art explores sparse patterns in attention to support long sequence modeling, but those pieces of work are on static or fixed patterns. We demonstrate that the sparse patterns are dynamic, depending on input sequences. Thus, we propose the Dynamic Sparse Attention (DSA) that can efficiently exploit dynamic sparse patterns in attention. Compared with other methods, our approach can achieve better trade-offs between accuracy and model complexity. Moving forward, we identify challenges and provide solutions to implement DSA on existing hardware (GPUs) and specialized hardware in order to achieve practical speedup and efficiency improvements for Transformer execution.
As neural networks are increasingly deployed on mobile and distributed computing platforms, there is a need to lower latency and increase computational speed while decreasing power and memory usage. Rather than using ...
详细信息
ISBN:
(纸本)9798350308600
As neural networks are increasingly deployed on mobile and distributed computing platforms, there is a need to lower latency and increase computational speed while decreasing power and memory usage. Rather than using FPGAs as accelerators in tandem with CPUs or GPUs, we directly encode individual neural network layers as combinational logic within FPGA hardware. Utilizing binarized neural networks minimizes the arithmetic computation required, shrinking latency to only the signal propagation delay. We evaluate size-optimization strategies and demonstrate network compression via weight quantization and weight-model unification, achieving 96% of the accuracy of baseline MNIST digit classification models while using only 3% of the memory. We further achieve 86% decrease in model footprint, 8mW dynamic power consumption, and <9ns latency, validating the versatility and capability of feature-strength-based pruning approaches for binarized neural networks to flexibly meet performance requirements amid application resource constraints.
This paper presents and discusses the implementation of a learning accelerator for an LSTM neural network that utilizes an FPGA. The accelerator consists of a backpropagation through time algorithm for an LSTM. The pr...
详细信息
This paper presents and discusses the implementation of a learning accelerator for an LSTM neural network that utilizes an FPGA. The accelerator consists of a backpropagation through time algorithm for an LSTM. The presented net performs a binary classification task and consists of an LSTM and a dense layer. The performance is then compared to both a hard-coded Python implementation and an implementation using Keras library and the GPU. The implementation is executed using the DSP blocks, available via the Vivado Design Suite, which is in compliance with the IEEE754 standard. The results of the simulation show that the FPGA implementation remains accurate and achieves higher speed than the other solutions.
Today's computing is increasingly data-intensive, heralding the age of big data. With greater data volumes, come the needs for faster processing, greater storage capacity, and expanded communication bandwidth, all...
详细信息
Today's computing is increasingly data-intensive, heralding the age of big data. With greater data volumes, come the needs for faster processing, greater storage capacity, and expanded communication bandwidth, all of which imply the expenditure of more energy. Thus, energy efficiency, already a major design consideration, will assume broader significance in the coming years. As important as storage and communications are, our focus in this paper is on better technology to reduce the computation (logic manipulation) power. We review majority logic, a special case of threshold logic, show how a number of common arithmetic/logic operations can be performed using the majority-gate primitive, and review an impressive array of atomic-scale logic technologies that are particularly efficient in realizing the majority or minority function. We conclude that a combination of orders of magnitude energy reduction by virtue of the technology used and implementation strategies that lead to comparable complexity in terms of majority gates when contrasted with currently used circuit primitives (AND, OR, XOR, NOT, mux) leads to energy-efficient realization of arithmetic/logic functions suitable for use in the age of big data. (C) 2020 Published by Elsevier Ltd.
This paper presents and discusses the implementation of deep neural network for the purpose of failure prediction in the cold forging process. The implementation consists of an LSTM and a dense layer implemented on FP...
详细信息
This paper presents and discusses the implementation of deep neural network for the purpose of failure prediction in the cold forging process. The implementation consists of an LSTM and a dense layer implemented on FPGA. The network was trained beforehand on Desktop Computer using Keras library for Python and the weights and the biases were embedded into the implementation. The implementation is executed using the DSP blocks, available via Vivado Design Suite, which are in compliance with the IEEE754 standard. The simulation of the network achieves 100% classification accuracy on the test data and high calculation speed.
This paper presents and discusses the implementation of an LSTM cell on an FPGA with an activation function inspired by the CORDIC algorithm. The realization is performed using both IEEE754 standard and 32-bit integer...
详细信息
This paper presents and discusses the implementation of an LSTM cell on an FPGA with an activation function inspired by the CORDIC algorithm. The realization is performed using both IEEE754 standard and 32-bit integer numbers. The case with floating-point arithmetic is analyzed with and without DSP blocks provided by the Xilinx design suite. The alternative implementation including the integer arithmetic was optimized for a minimal number of clock cycles. Presented implementation uses xc6slx150t-2fgg900 and achieves high calculations accuracy for both cases.
Petascale supercomputers are already pushing power boundaries that can be supplied or dissipated cost-effectively;greater challenges await us in the era of exascale machines. We are thus motivated to study methods of ...
详细信息
Petascale supercomputers are already pushing power boundaries that can be supplied or dissipated cost-effectively;greater challenges await us in the era of exascale machines. We are thus motivated to study methods of reducing the energy cost of arithmetic operations, which can be substantial in numerically intensive applications. Additionally, being both a widely-used operation in itself and an important building block for synthesizing other arithmetic operations, has received much attention in this regard. Circuit and energy costs of fast adders are dominated by their fast carry networks. The availability of simple and energy-efficient majority function in certain emerging nanotechnologies (such as quantum-dot cellular automata, single-electron tunneling, tunneling phase logic, magnetic tunnel junction, nanoscale bar magnets, and memristors) has motivated our work to reformulate the carry recurrence in terms of fully-utilized majority elements, with all three inputs usefully employed. We compare our novel designs and resulting circuits to prior proposals based on 3-input majority elements in quantum-dot cellular automata, demonstrating advantages in both speed and circuit complexity. We also show that the performance and cost advantages carry over to at least one other emerging, energy-efficient technology, single-electron tunneling, raising hopes for achieving similar benefits with other technologies, which we review very briefly.
Gaussian Mixture Models (GMMs) are widely used in many applications such as data mining, signal processing and computer vision, for probability density modeling and soft clustering. However, the parameters of a GMM ne...
详细信息
Gaussian Mixture Models (GMMs) are widely used in many applications such as data mining, signal processing and computer vision, for probability density modeling and soft clustering. However, the parameters of a GMM need to be estimated from data by, for example, the Expectation-Maximization algorithm for Gaussian Mixture Models (EM-GMM), which is computationally demanding. This paper presents a novel design for the EM-GMM algorithm targeting reconfigurable platforms, with five main contributions. First, a pipeline-friendly EM-GMM with diagonal covariance matrices that can easily be mapped to hardware architectures. Second, a function evaluation unit for Gaussian probability density based on fixed-point arithmetic. Third, our approach is extended to support a wide range of dimensions or/and components by fitting multiple pieces of smaller dimensions onto an FPGA chip. Fourth, we derive a cost and performance model that estimates logic resources. Fifth, our dataflow design targeting the Maxeler MPC-X2000 with a Stratix-5SGSD8 FPGA can run over 200 times faster than a 6-core Xeon E5645 processor, and over 39 times faster than a Pascal TITAN-X GPU. Our design provides a practical solution to applications for training and explores better parameters for GMMs with hundreds of millions of high dimensional input instances, for low-latency and high-performance applications.
The Internet of Things (loTs) has triggered rapid advances in sensors, surveillance devices, wearables and body area networks with advanced Human-Computer Interfaces (HCI). One such application area is the adoption of...
详细信息
The Internet of Things (loTs) has triggered rapid advances in sensors, surveillance devices, wearables and body area networks with advanced Human-Computer Interfaces (HCI). One such application area is the adoption of Body Worn Cameras (BWCs) by law enforcement officials. The need to be 'always-on' puts heavy constraints on battery usage in these camera front-ends, thus limiting their widespread adoption. Further, the increasing number of such cameras is expected to create a data deluge, which requires large processing, transmission and storage capabilities. Instead of continuously capturing and streaming or storing videos, it is prudent to provide "smartness" to the camera front-end. This requires hardware assisted image recognition and template matching in the front-end, capable of making judicious decisions on when to trigger video capture or streaming. Restricted Boltzmann Machines (RBMs) based neural networks have been shown to provide high accuracy for image recognition and are well suited for low power and re-configurable systems. In this paper we propose an RBM based "always-on" camera front-end capable of detecting human posture. Aggressive behavior of the human being in the field of view will be used as a wake-up signal for further data collection and classification. The proposed system has been implemented on a Xilinx Virtex 7 XC7VX485T platform. A minimum dynamic power of 19.18 mW for a target recognition accuracy while maintaining real time constraints has been measured. The hardware-software co-design illustrates the trade-offs in the design with respect to accuracy, resource utilization, processing time and power. The results demonstrate the possibility of a true "always-on" body-worn camera system in the loT environment.
暂无评论