Recent advancements in brain-computer interface (BCI) technology for steady-state visual evoked potential (SSVEP)-based target identification have shifted from traditional linear algebra (LA) techniques to more sophis...
详细信息
ISBN:
(纸本)9798350330991;9798350331004
Recent advancements in brain-computer interface (BCI) technology for steady-state visual evoked potential (SSVEP)-based target identification have shifted from traditional linear algebra (LA) techniques to more sophisticated neural network (NN) approaches, driven by their increased accuracy and consistent performance across different subjects. However, adopting NN-based algorithms has introduced complexities in wearable BCI systems, mainly due to their extensive parameter sets that demand significant memory capacity. Moreover, the computational intensity of these models requires reevaluating hardware architectures. Additionally, the advent of Transformerbased models has further advanced the state of the art, providing even higher accuracy and reduced variability in cross-subject performance, placing greater demands on hardware resources. This paper provides an overview of recent algorithmic progress in SSVEP-based target identification. Also, it proposes considerations for the hardware architecture needed to efficiently support the computation of cutting-edge Transformer-based models in wearable BCIs from the perspective of algorithm-hardwarecodesign.
computing in memory (CIM) convolutional neural network (CNN) accelerators based on nonvolatile memory (NVM) show great potential to improve energy efficiency and throughput, while the multiple design levels and huge d...
详细信息
computing in memory (CIM) convolutional neural network (CNN) accelerators based on nonvolatile memory (NVM) show great potential to improve energy efficiency and throughput, while the multiple design levels and huge design space of CIM-based CNN acceleration system make cross-level co-design methodology and platforms extremely desired. In this work, an algorithm-hardware co-design platform coMN with the graphic user interface is proposed for designers to fast verify and further optimize the designments. In the platform: 1) a mapper is developed to automatically map CNN models to CIM chips through optimizing pipeline, weight transformation, partition, and placement;2) accuracy evaluator and performance evaluator are built to jointly estimate accuracy, energy, latency, and area overheads considering the design dependencies across multiple levels;3) algorithm adapter is exploited to retrain CNN weights for higher-hardware accuracy within limited energy budget through nonidealities aware training and energy aware training;and 4) hardware optimizer is developed to search hardware microarchitecture and circuit design space in the early design stage. We conduct several case studies to verify the effectiveness of the coMN platform. Results indicate that coMN platform can enable algorithm- hardware mapping, hardware-aware algorithm adaption, hardwareconfiguration exploration, and overall algorithm-hardware co-design efficiently. The coMN platform can be accessed online at https://101.42.97.22:8081/*** with username "tcad" and password "comnuser."
Object detection and classification is a key task in many computer vision applications such as smart surveillance and autonomous vehicles. Recent advances in deep learning have significantly improved the quality of re...
详细信息
Object detection and classification is a key task in many computer vision applications such as smart surveillance and autonomous vehicles. Recent advances in deep learning have significantly improved the quality of results achieved by these systems, making them more accurate and reliable in complex environments. Modern object detection systems make use of lightweight convolutional neural networks (CNNs) for feature extraction, coupled with single-shot multi-box detectors (SSDs) that generate bounding boxes around the identified objects along with their classification confidence scores. Subsequently, a non-maximum suppression (NMS) module removes any redundant detection boxes from the final output. Typical NMS algorithms must wait for all box predictions to be generated by the SSD-based feature extractor before processing them. This sequential dependency between box predictions and NMS results in a significant latency overhead and degrades the overall system throughput, even if a high-performance CNN accelerator is used for the SSD feature extraction component. In this paper, we present a novel pipelined NMS algorithm that eliminates this sequential dependency and associated NMS latency overhead. We then use our novel NMS algorithm to implement an end-to-end fully pipelined FPGA system for low-latency SSD-MobileNet-V1 object detection. Our system, implemented on an Intel Stratix 10 FPGA, runs at 400 MHz and achieves a throughput of 2,167 frames per second with an end-to-end batch-1 latency of 2.13 ms. Our system achieves 5.3x higher throughput and 5x lower latency compared to the best prior FPGA-based solution with comparable accuracy.
In this paper, we propose an efficient algorithm-hardware co-design framework to realize radar-based fall detection with limited resources. We first design a compact neural network model named MB-Net with multi-branch...
详细信息
In this paper, we propose an efficient algorithm-hardware co-design framework to realize radar-based fall detection with limited resources. We first design a compact neural network model named MB-Net with multi-branch convolutions for feature extraction of radar time series data combined with multi-scale wavelet transform. After that, an FPGA-based neural network (NN) accelerator tailored for the proposed network is designed. The proposed NN accelerator replaces the general multipliers with non-exact multipliers to reduce the hardwarecost. For the multi-branch convolution layer, a novel layer computing sequence is introduced to improve the efficiency of the processing element (PE) array and reduce the memory footprint. In addition, the average pooling operation in the proposed network is folded into the quantization factors to reduce hardwarecost. The experimental findings show that the MB-Net can maintain competitive performance in comparison to state-of-the-art methods while the hardwarecost is significantly lower. The proposed network model is implemented in Zynq ZC702 board using only 3615 LUTs, 1843 FFs, 11.5 BRAMs, and 8 DSPs with 0.234 W power consumption. Through algorithm and hardwareco-optimization, the fall detection accelerator can achieve 95 $\%$ PE efficiency and takes 0.346 ms latency for a radar sample interference with only 80.96 uJ energy consumption.
Simultaneous Localization and Mapping (SLAM) plays a crucial role in robotics, autonomous systems, and augmented and virtual reality (AR/VR) applications by enabling devices to understand and map unknown environments....
详细信息
ISBN:
(纸本)9798400706981
Simultaneous Localization and Mapping (SLAM) plays a crucial role in robotics, autonomous systems, and augmented and virtual reality (AR/VR) applications by enabling devices to understand and map unknown environments. However, deploying SLAM in AR/VR applications poses significant challenges, including the demand for high accuracy, real-time processing, and efficient resource utilization, especially on compact and lightweight devices. To address these challenges, we propose SuperNoVA, which enables high-accuracy, real-time, large-scale SLAM in resource-constrained settings through a full-stack system, spanning from algorithm to hardware. In particular, SuperNoVA dynamically constructs a subgraph to meet the latency target while preserving accuracy, virtualizes hardware resources for efficient graph processing, and implements a novel hardware architecture to accelerate the SLAM backend efficiently. Evaluation results demonstrate that, for a large-scale AR dataset, Super-NoVA reduces full SLAM backend computation latency by 89.5% compared to the baseline out-of-order CPU and 78.6% compared to the baseline embedded GPU, and reduces the maximum pose error by 89% over existing SLAM solutions, while always meeting the latency target.
Brain-computer interface (BCI), a communication technology between brain and computer developed for a long time since the 1970s, can be incorporated into wearable devices by developing powerful signal processing algor...
详细信息
Brain-computer interface (BCI), a communication technology between brain and computer developed for a long time since the 1970s, can be incorporated into wearable devices by developing powerful signal processing algorithms and semiconductor technologies. For a satisfactory user experience based on BCI, high information transfer rate and low power consumption should be considered together without losing accuracy. Although many existing BCI algorithms have been mainly focused solely on the accuracy, their deployment on wearable devices is not straightforward due to the limited hardware resources and computational capabilities. This tutorial summarizes recent advances in wearable BCI algorithms and hardware implementations from an algorithm-hardware co-design perspective and discusses future directions.
Bayesian convolutional Neural Networks (BCNNs) have been proposed to address the problem of model uncertainty in conventional neural networks. By treating weights as distributions rather than deterministic values, BCN...
详细信息
ISBN:
(纸本)9781665427012
Bayesian convolutional Neural Networks (BCNNs) have been proposed to address the problem of model uncertainty in conventional neural networks. By treating weights as distributions rather than deterministic values, BCNNs mitigate the problem of overfitting, training with a small amount of data, and uncertainty evaluations. However, computing the distributions of BCNN outputs is time- and energy-consuming because it requires computing multiple forward passes. To address this computational problem, we propose a novel algorithm-hardware co-design approach with an approximation algorithm and hardware support for the rapid computation of BCNN. Our observations of the absolute number of each layer's input and the input difference among multiple forward passes show that most of these values are significantly small compared with other large values. Our algorithm treats these small values as zero and makes them sparser. The extracted sparsity allows us to skip most multiplications. As a result, it achieves a computation reduction of 81.1 % in classification tasks and 77.7 % in regression tasks. Additionally, to support the algorithm-level approximation on hardware, we propose a novel dataflow that is specialized for our algorithm, and develop a new accelerator architecture, accelerator for sparse Bayesian Neural Networks (ASBNN), that can handle sparsity extracted by the algorithm. Our evaluation demonstrates that the ASBNN successfully exploits the algorithmic computation reduction to improve the computation time by 3.3 x and energy efficiency by 3.7x compared with the naive implementation of dense BCNN accelerators.
Denoising sensor-captured images on edge display devices remains challenging due to deep neural networks' (DNNs) high computational overhead and synthetic noise training limitations. This work proposes BDLUT(-D), ...
详细信息
Denoising sensor-captured images on edge display devices remains challenging due to deep neural networks' (DNNs) high computational overhead and synthetic noise training limitations. This work proposes BDLUT(-D), a novel blind denoising method combining optimized lookup tables (LUTs) with hardware-centric design. While BDLUT describes the LUT-based network architecture, BDLUT-D represents BDLUT trained with a specialized noise degradation model. designed for edge deployment, BDLUT(-D) eliminates neural processing units (NPUs) and functions as a standalone ASIC IP solution. Experimental results demonstrate BDLUT-D achieves up to 2.42 dB improvement over state-of-the-art LUT methods on mixed-noise-intensity benchmarks, requiring only 66 KB storage. FPGA implementation shows over 10x$$ \times $$ reduction in logic resources, 75% less storage compared to DNN accelerators, while achieving 57% faster processing than traditional bilateral filtering methods. These optimizations enable practical integration into edge scenarios like low-cost webcam enhancement and real-time 4 K-to-4 K denoising without compromising resolution or latency. By enhancing silicon efficiency and removing external accelerator dependencies, BDLUT(-D) establishes a new standard for practical edge imaging denoising. Implementation is available at .
convolutional neural network (CNN)-based object detection has achieved very high accuracy;e.g., singleshot multi-box detectors (SSDs) can efficiently detect and localize various objects in an input image. However, the...
详细信息
convolutional neural network (CNN)-based object detection has achieved very high accuracy;e.g., singleshot multi-box detectors (SSDs) can efficiently detect and localize various objects in an input image. However, they require a high amount of computation and memory storage, which makes it difficult to perform efficient inference on resource-constrained hardware devices such as drones or unmanned aerial vehicles (UAVs). Drone/UAV detection is an important task for applications including surveillance, defense, and multidrone self-localization and formation control. In this article, we designed and co-optimized an algorithm and hardware for energy-efficient drone detection on resource-constrained FPGA devices. We trained an SSD object detection algorithm with a custom drone dataset. For inference, we employed low-precision quantization and adapted the width of the SSD CNN model. To improve throughput, we use dual-data rate operations for DSPs to effectively double the throughput with limited DSP counts. For different SSD algorithm models, we analyze accuracy or mean average precision (mAP) and evaluate the corresponding FPGA hardware utilization, DRAM communication, and throughput optimization. We evaluated the FPGA hardware for a custom drone dataset, Pascal VOC, and coco2017. Our proposed design achieves a high mAP of 88.42% on the multi-drone dataset, with a high energy efficiency of 79 GOPS/W and throughput of 158 GOPS using the Xilinx Zynq ZU3EG FPGA device on the Open Vision computer version 3 (OVC3) platform. Our design achieves 1.1 to 8.7x higher energy efficiency than prior works that used the same Pascal VOC dataset, using the same FPGA device, but at a low-power consumption of 2.54W. For the coco dataset, our MobileNet-V1 implementation achieved an mAP of 16.8, and 4.9 FPS/W for energy-efficiency, which is similar to 1.9x higher than prior FPGA works or other commercial hardware platforms.
This research aims to develop energy-efficient hardware accelerators for Simultaneous Location And Mapping (SLAM) back end applications by employing algorithm-hardware co-design. Utilizing the iSAM2 algorithm, which u...
详细信息
ISBN:
(纸本)9798350344868;9798350344851
This research aims to develop energy-efficient hardware accelerators for Simultaneous Location And Mapping (SLAM) back end applications by employing algorithm-hardware co-design. Utilizing the iSAM2 algorithm, which uses graphical modeling to solve iterative Gauss-Newton problems, we continuously update maps by incorporating solutions from previous iterations or timesteps. We address the performance bottleneck arising from memory writes of intermediate results by modifying the original algorithm. Additionally, we analyze the algorithm's parallelizability to meet latency demands. These hardware accelerators are designed as Intellectual Property (IP) blocks, suitable for integration into custom Systems-on-Chip (SoC). We evaluate the design using both holistic and block-level metrics, focusing on latency and energy efficiency. This work has implications for energy-constrained devices like drones and Extended Reality (XR) devices.
暂无评论