The Internet of Things (loTs) has triggered rapid advances in sensors, surveillance devices, wearables and body area networks with advanced Human-Computer Interfaces (HCI). One such application area is the adoption of...
详细信息
The Internet of Things (loTs) has triggered rapid advances in sensors, surveillance devices, wearables and body area networks with advanced Human-Computer Interfaces (HCI). One such application area is the adoption of Body Worn Cameras (BWCs) by law enforcement officials. The need to be 'always-on' puts heavy constraints on battery usage in these camera front-ends, thus limiting their widespread adoption. Further, the increasing number of such cameras is expected to create a data deluge, which requires large processing, transmission and storage capabilities. Instead of continuously capturing and streaming or storing videos, it is prudent to provide "smartness" to the camera front-end. This requires hardware assisted image recognition and template matching in the front-end, capable of making judicious decisions on when to trigger video capture or streaming. Restricted Boltzmann Machines (RBMs) based neural networks have been shown to provide high accuracy for image recognition and are well suited for low power and re-configurable systems. In this paper we propose an RBM based "always-on" camera front-end capable of detecting human posture. Aggressive behavior of the human being in the field of view will be used as a wake-up signal for further data collection and classification. The proposed system has been implemented on a Xilinx Virtex 7 XC7VX485T platform. A minimum dynamic power of 19.18 mW for a target recognition accuracy while maintaining real time constraints has been measured. The hardware-software co-design illustrates the trade-offs in the design with respect to accuracy, resource utilization, processing time and power. The results demonstrate the possibility of a true "always-on" body-worn camera system in the loT environment.
作者:
THOMPSON, CDDivision of Computer Science
University of California Abstract Authors References Cited By Keywords Metrics Similar Download Citation Email Print Request Permissions
This paper surveys nine designs for VLSI circuits that compute N-element Fourier transforms. The largest of the designs requires O(N2 log N) units of silicon area; it can start a new Fourier transform every O(log N) t...
详细信息
This paper surveys nine designs for VLSI circuits that compute N-element Fourier transforms. The largest of the designs requires O(N2 log N) units of silicon area; it can start a new Fourier transform every O(log N) time units. The smallest designs have about 1/Nth of this throughput, but they require only 1/Nth as much area.
We present an automated methodology for producing hardware-based random number generator (RNG) designs for arbitrary distributions using the inverse cumulative distribution function (ICDF). The ICDF is evaluated via p...
详细信息
We present an automated methodology for producing hardware-based random number generator (RNG) designs for arbitrary distributions using the inverse cumulative distribution function (ICDF). The ICDF is evaluated via piecewise polynomial approximation with a hierarchical segmentation scheme that involves uniform segments and segments with size varying by powers of two which can adapt to local function nonlinearities. Analytical error analysis is used to guarantee accuracy to one unit in the last place (ulp). Compact and efficient RNGs that can reach arbitrary multiples of the standard deviation sigma can be generated. For instance, a Gaussian RNG based on our approach for a Xilinx Virtex-4 XC4VLX100-12 field-programmable gate array produces 16-bit random samples up to 8.2 sigma. It occupies 487 slices, 2 block-RAMs, and 2 DSP-blocks. The design is capable of running at 371 MHz and generates one sample every clock cycle.
As neural networks are increasingly deployed on mobile and distributed computing platforms, there is a need to lower latency and increase computational speed while decreasing power and memory usage. Rather than using ...
详细信息
ISBN:
(纸本)9798350308600
As neural networks are increasingly deployed on mobile and distributed computing platforms, there is a need to lower latency and increase computational speed while decreasing power and memory usage. Rather than using FPGAs as accelerators in tandem with CPUs or GPUs, we directly encode individual neural network layers as combinational logic within FPGA hardware. Utilizing binarized neural networks minimizes the arithmetic computation required, shrinking latency to only the signal propagation delay. We evaluate size-optimization strategies and demonstrate network compression via weight quantization and weight-model unification, achieving 96% of the accuracy of baseline MNIST digit classification models while using only 3% of the memory. We further achieve 86% decrease in model footprint, 8mW dynamic power consumption, and <9ns latency, validating the versatility and capability of feature-strength-based pruning approaches for binarized neural networks to flexibly meet performance requirements amid application resource constraints.
hardware acceleration in High Performance Computing (HPC) context is of growing interest, particularly in the field of Monte Carlo methods where the resort to Field Programmable Gate Array (FPGA) technology has been p...
详细信息
ISBN:
(纸本)9781424438075
hardware acceleration in High Performance Computing (HPC) context is of growing interest, particularly in the field of Monte Carlo methods where the resort to Field Programmable Gate Array (FPGA) technology has been proven as an effective media, capable of enhancing by several orders the speed execution of stochastic processes. The spread-use of reconfigurable hardware for stochastic simulation gathered a significant effort towards effective implementations of hardware pseudorandom numbers generators (PRNGs) - these generators needed to exhibit a statistically proven random behaviour and to be charactarized by a very long period. In this paper we present the state of the art of hardware pseudorandom number generation in the context of Monte Carlo acceleration. We highlight the emerging trends over the most recent publications and suggest some insights on the forthcoming works. Furthermore, we provide a complete hardware description of a new gaussian variate generator (GVG) and an exponential variate generator (EVG) based on a decision-tree technique of ours, herein presented as well. The prototypes implemented on a Xilinx Virtex II Pro XC2VP100 FPGA occupy from 150 to 417 slices and reach 280 MHz, while exhibiting good statistical behaviours with high p-values on the chi(2) test and offering a unitary Knuth ratio.
This paper introduces a parabolic synthesis methodology for developing approximations of unary functions like trigonometric functions and logarithms which are specialized for efficient hardware mapped VLSI design. The...
详细信息
ISBN:
(纸本)9781424426270
This paper introduces a parabolic synthesis methodology for developing approximations of unary functions like trigonometric functions and logarithms which are specialized for efficient hardware mapped VLSI design. The advantages with the methodology are, short critical path, fast computation and high throughput enabled by a high degree of architectural parallelism. The feasibility of the methodology is shown by developing an approximation of the sine function for implementation in hardware.
This paper presents and discusses the implementation of an LSTM cell on an FPGA with an activation function inspired by the CORDIC algorithm. The realization is performed using both IEEE754 standard and 32-bit integer...
详细信息
This paper presents and discusses the implementation of an LSTM cell on an FPGA with an activation function inspired by the CORDIC algorithm. The realization is performed using both IEEE754 standard and 32-bit integer numbers. The case with floating-point arithmetic is analyzed with and without DSP blocks provided by the Xilinx design suite. The alternative implementation including the integer arithmetic was optimized for a minimal number of clock cycles. Presented implementation uses xc6slx150t-2fgg900 and achieves high calculations accuracy for both cases.
This paper presents and discusses the implementation of a learning accelerator for an LSTM neural network that utilizes an FPGA. The accelerator consists of a backpropagation through time algorithm for an LSTM. The pr...
详细信息
This paper presents and discusses the implementation of a learning accelerator for an LSTM neural network that utilizes an FPGA. The accelerator consists of a backpropagation through time algorithm for an LSTM. The presented net performs a binary classification task and consists of an LSTM and a dense layer. The performance is then compared to both a hard-coded Python implementation and an implementation using Keras library and the GPU. The implementation is executed using the DSP blocks, available via the Vivado Design Suite, which is in compliance with the IEEE754 standard. The results of the simulation show that the FPGA implementation remains accurate and achieves higher speed than the other solutions.
This paper presents and discusses the implementation of deep neural network for the purpose of failure prediction in the cold forging process. The implementation consists of an LSTM and a dense layer implemented on FP...
详细信息
This paper presents and discusses the implementation of deep neural network for the purpose of failure prediction in the cold forging process. The implementation consists of an LSTM and a dense layer implemented on FPGA. The network was trained beforehand on Desktop Computer using Keras library for Python and the weights and the biases were embedded into the implementation. The implementation is executed using the DSP blocks, available via Vivado Design Suite, which are in compliance with the IEEE754 standard. The simulation of the network achieves 100% classification accuracy on the test data and high calculation speed.
暂无评论