Sparse coding encodes natural stimuli using a small number of basis functions known as receptive fields. In this work, we design custom hardware architectures for efficient and high-performance implementations of a sp...
详细信息
Sparse coding encodes natural stimuli using a small number of basis functions known as receptive fields. In this work, we design custom hardware architectures for efficient and high-performance implementations of a sparse coding algorithm called the sparse and independent local network (SAILnet). A study of the neuron spiking dynamics uncovers important design considerations involving the neural network size, target firing rate, and neuron update step size. Optimal tuning of these parameters keeps the neuron spikes sparse and random to achieve the best image fidelity. We investigate practical hardware architectures for SAILnet: a bus architecture that provides efficient neuron communications, but results in spike collisions;and a ring architecture that is more scalable, but causes neuron misfires. We show that the spike collision rate is reduced with a sparse spiking neural network, so an arbitration-free bus architecture can be designed to tolerate collisions without the need of arbitration. To reduce neuron misfires, we design a latent ring architecture to damp the neuron responses for an improved image fidelity. The bus and the ring architecture can be combined in a hybrid architecture to achieve both high throughput and scalability. The three architectures are synthesized and place-and-routed in a 65 nm CMOS technology. The proof-of-concept designs demonstrate a high sparse coding throughput up to 952 M pixels per second at an energy consumption of 0.486 nJ per pixel.
Iterative image reconstruction can dramatically improve the image quality in X-ray computed tomography (CT), but the computation involves iterative steps of 3D forward- and back-projection, which impedes routine clini...
详细信息
Iterative image reconstruction can dramatically improve the image quality in X-ray computed tomography (CT), but the computation involves iterative steps of 3D forward- and back-projection, which impedes routine clinical use. To accelerate forward-projection, we analyze the CT geometry to identify the intrinsic parallelism and data access sequence for a highly parallel hardware architecture. To improve the efficiency of this architecture, we propose a water-filling buffer to remove pipeline stalls, and an out-of-order sectored processing to reduce the off-chip memory access by up to three orders of magnitude. We make a floating-point to fixed-point conversion based on numerical simulations and demonstrate comparable image quality at a much lower implementation cost. As a proof of concept, a 5-stage fully pipelined, 55-way parallel separable-footprint forward-projector is prototyped on a Xilinx Virtex-5 FPGA for a throughput of 925.8 million voxel projections/s at 200 MHz clock frequency, 4.6 times higher than an optimized 16-threaded program running on an 8-core 2.8-GHz CPU. A similar architecture can be applied to back-projection for a complete iterative image reconstruction system. The proposed algorithm and architecture can also be applied to hardware platforms such as graphics processing unit and digital signal processor to achieve significant accelerations.
This paper studies the effects of front-end imager parameters on object detection performance and energy consumption. A custom version of histograms of oriented gradient (HOG) features based on 2-b pixel ratios is pre...
详细信息
This paper studies the effects of front-end imager parameters on object detection performance and energy consumption. A custom version of histograms of oriented gradient (HOG) features based on 2-b pixel ratios is presented and shown to achieve superior object detection performance for the same estimated energy compared with conventional HOG features. A front-end hardware implementation capable of extracting these features at multiple scales is proposed, and a system-level energy analysis is performed. This energy analysis suggests a potential 19x reduction in I/O energy and a 3.3x reduction in back-end detection energy compared with conventional object detection pipelines.
暂无评论