检索结果-内蒙古大学图书馆

IEEE International Symposium on Circuits and Systems (ISCAS)

作者： Park, Sunyoung Byun, Wooseok Je, Minkyu Kim, Ji-Hoon Ewha Womans Univ Dept Elect & Elect Engn Seoul South Korea Ewha Womans Univ Grad Program Smart Factory Seoul South Korea SAPEON Korea Inc Seongnam South Korea Korea Adv Inst Sci & Technol Sch Elect Engn Daejeon South Korea

ISBN: (纸本)9798350330991;9798350331004

Recent advancements in brain-computer interface (BCI) technology for steady-state visual evoked potential (SSVEP)-based target identification have shifted from traditional linear algebra (LA) techniques to more sophisticated neural network (NN) approaches, driven by their increased accuracy and consistent performance across different subjects. However, adopting NN-based algorithms has introduced complexities in wearable BCI systems, mainly due to their extensive parameter sets that demand significant memory capacity. Moreover, the computational intensity of these models requires reevaluating hardware architectures. Additionally, the advent of Transformerbased models has further advanced the state of the art, providing even higher accuracy and reduced variability in cross-subject performance, placing greater demands on hardware resources. This paper provides an overview of recent algorithmic progress in SSVEP-based target identification. Also, it proposes considerations for the hardware architecture needed to efficiently support the computation of cutting-edge Transformer-based models in wearable BCIs from the perspective of algorithm-hardware codesign.

关键词： Brain-computer interface (BCI) transformer neural network algorithm-hardware co-design domain-specific architecture

来源：评论

学校读者我要写书评

暂无评论

coMN: algorithm-hardware co-design Platform for Nonvolatile Memory-Based convolutional Neural Network Accelerators

引用

IEEE TRANSACTIONS ON coMPUTER-AIDED design OF INTEGRATED CIRCUITS AND SYSTEMS 2024年第7期43卷 2043-2056页

作者： Han, Lixia Pan, Renjie Zhou, Zheng Lu, Hairuo Chen, Yiyang Yang, Haozhang Huang, Peng Sun, Guangyu Liu, Xiaoyan Kang, Jinfeng Peking Univ Sch Integrated Circuits Beijing 100871 Peoples R China Peking Univ Beijing Adv Innovat Ctr Integrated Circuits Beijing 100871 Peoples R China Peking Univ Sch Software & Microelect Beijing 102600 Peoples R China

computing in memory (CIM) convolutional neural network (CNN) accelerators based on nonvolatile memory (NVM) show great potential to improve energy efficiency and throughput, while the multiple design levels and huge design space of CIM-based CNN acceleration system make cross-level co-design methodology and platforms extremely desired. In this work, an algorithm-hardware co-design platform coMN with the graphic user interface is proposed for designers to fast verify and further optimize the designments. In the platform: 1) a mapper is developed to automatically map CNN models to CIM chips through optimizing pipeline, weight transformation, partition, and placement;2) accuracy evaluator and performance evaluator are built to jointly estimate accuracy, energy, latency, and area overheads considering the design dependencies across multiple levels;3) algorithm adapter is exploited to retrain CNN weights for higher-hardware accuracy within limited energy budget through nonidealities aware training and energy aware training;and 4) hardware optimizer is developed to search hardware microarchitecture and circuit design space in the early design stage. We conduct several case studies to verify the effectiveness of the coMN platform. Results indicate that coMN platform can enable algorithm- hardware mapping, hardware-aware algorithm adaption, hardware configuration exploration, and overall algorithm-hardware co-design efficiently. The coMN platform can be accessed online at https://101.42.97.22:8081/*** with username "tcad" and password "comnuser."

关键词： hardware Integrated circuit modeling Performance evaluation Pipelines convolutional neural networks Optimization Nonvolatile memory algorithm-hardware co-design computing in memory (CIM) convolution neural network design space exploration

来源：评论

学校读者我要写书评

暂无评论

High Throughput FPGA-Based Object Detection via algorithm-hardware co-design

引用

ACM TRANSACTIONS ON REcoNFIGURABLE TECHNOLOGY AND SYSTEMS 2024年第1期17卷 1-20页

作者： Anupreetham, Anupreetham Ibrahim, Mohamed Hall, Mathew Boutros, Andrew Kuzhively, Ajay Mohanty, Abinash Nurvitadhi, Eriko Betz, Vaughn Cao, Yu Seo, Jae-Sun Arizona State Univ Tempe AZ 85287 USA Univ Toronto Toronto ON Canada Intel Corp Santa Clara CA USA Vector Inst AI Toronto ON Canada

Object detection and classification is a key task in many computer vision applications such as smart surveillance and autonomous vehicles. Recent advances in deep learning have significantly improved the quality of results achieved by these systems, making them more accurate and reliable in complex environments. Modern object detection systems make use of lightweight convolutional neural networks (CNNs) for feature extraction, coupled with single-shot multi-box detectors (SSDs) that generate bounding boxes around the identified objects along with their classification confidence scores. Subsequently, a non-maximum suppression (NMS) module removes any redundant detection boxes from the final output. Typical NMS algorithms must wait for all box predictions to be generated by the SSD-based feature extractor before processing them. This sequential dependency between box predictions and NMS results in a significant latency overhead and degrades the overall system throughput, even if a high-performance CNN accelerator is used for the SSD feature extraction component. In this paper, we present a novel pipelined NMS algorithm that eliminates this sequential dependency and associated NMS latency overhead. We then use our novel NMS algorithm to implement an end-to-end fully pipelined FPGA system for low-latency SSD-MobileNet-V1 object detection. Our system, implemented on an Intel Stratix 10 FPGA, runs at 400 MHz and achieves a throughput of 2,167 frames per second with an end-to-end batch-1 latency of 2.13 ms. Our system achieves 5.3x higher throughput and 5x lower latency compared to the best prior FPGA-based solution with comparable accuracy.

关键词： FPGA accelerator object detection algorithm-hardware co-design neural networks

来源：评论

学校读者我要写书评

暂无评论

An Efficient algorithm-hardware co-design for Radar-Based Fall Detection With Multi-Branch convolutions

引用

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS 2023年第4期70卷 1613-1624页

作者： Ou, Zixuan Yu, Bing Ye, Wenbin Shenzhen Univ Coll Elect & Informat Engn Shenzhen 518060 Peoples R China

In this paper, we propose an efficient algorithm-hardware co-design framework to realize radar-based fall detection with limited resources. We first design a compact neural network model named MB-Net with multi-branch convolutions for feature extraction of radar time series data combined with multi-scale wavelet transform. After that, an FPGA-based neural network (NN) accelerator tailored for the proposed network is designed. The proposed NN accelerator replaces the general multipliers with non-exact multipliers to reduce the hardware cost. For the multi-branch convolution layer, a novel layer computing sequence is introduced to improve the efficiency of the processing element (PE) array and reduce the memory footprint. In addition, the average pooling operation in the proposed network is folded into the quantization factors to reduce hardware cost. The experimental findings show that the MB-Net can maintain competitive performance in comparison to state-of-the-art methods while the hardware cost is significantly lower. The proposed network model is implemented in Zynq ZC702 board using only 3615 LUTs, 1843 FFs, 11.5 BRAMs, and 8 DSPs with 0.234 W power consumption. Through algorithm and hardware co-optimization, the fall detection accelerator can achieve 95 $\%$ PE efficiency and takes 0.346 ms latency for a radar sample interference with only 80.96 uJ energy consumption.

关键词： Fall detection convolutional neural network radar signal processing algorithm-hardware co-design low power low cost

来源：评论

学校读者我要写书评

暂无评论

SuperNoVA: algorithm-hardware co-design for Resource-Aware SLAM 25

SuperNoVA: Algorithm-Hardware Co-Design for Resource-Aware S...

引用

30th International conference on Architectural Support for Programming Languages and Operating Systems-ASPLOS

作者： Kim, Seah Hsiao, Roger Nikolic, Borivoje Demmel, James Shao, Yakun Sophia Univ Calif Berkeley Berkeley CA 94720 USA

ISBN: (纸本)9798400706981

Simultaneous Localization and Mapping (SLAM) plays a crucial role in robotics, autonomous systems, and augmented and virtual reality (AR/VR) applications by enabling devices to understand and map unknown environments. However, deploying SLAM in AR/VR applications poses significant challenges, including the demand for high accuracy, real-time processing, and efficient resource utilization, especially on compact and lightweight devices. To address these challenges, we propose SuperNoVA, which enables high-accuracy, real-time, large-scale SLAM in resource-constrained settings through a full-stack system, spanning from algorithm to hardware. In particular, SuperNoVA dynamically constructs a subgraph to meet the latency target while preserving accuracy, virtualizes hardware resources for efficient graph processing, and implements a novel hardware architecture to accelerate the SLAM backend efficiently. Evaluation results demonstrate that, for a large-scale AR dataset, Super-NoVA reduces full SLAM backend computation latency by 89.5% compared to the baseline out-of-order CPU and 78.6% compared to the baseline embedded GPU, and reduces the maximum pose error by 89% over existing SLAM solutions, while always meeting the latency target.

关键词： AR/VR Robotics SLAM Accelerator algorithm-hardware co-design Resource Management

来源：评论

学校读者我要写书评

暂无评论

Advances in Wearable Brain-computer Interfaces From an algorithm-hardware co-design Perspective

引用

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS 2022年第7期69卷 3071-3077页

作者： Byun, Wooseok Je, Minkyu Kim, Ji-Hoon SAPEON Korea Architecture Team Seongnam 13486 South Korea Korea Adv Inst Sci & Technol Sch Elect Engn Daejeon 34141 South Korea Ewha Womans Univ Dept Elect & Elect Engn Smart Factory Multidisciplinary Program Seoul 03760 South Korea

Brain-computer interface (BCI), a communication technology between brain and computer developed for a long time since the 1970s, can be incorporated into wearable devices by developing powerful signal processing algorithms and semiconductor technologies. For a satisfactory user experience based on BCI, high information transfer rate and low power consumption should be considered together without losing accuracy. Although many existing BCI algorithms have been mainly focused solely on the accuracy, their deployment on wearable devices is not straightforward due to the limited hardware resources and computational capabilities. This tutorial summarizes recent advances in wearable BCI algorithms and hardware implementations from an algorithm-hardware co-design perspective and discusses future directions.

关键词： Signal processing algorithms Partitioning algorithms hardware Wearable computers Recording Visualization Signal processing Brain-computer interface (BCI) algorithm-hardware co-design domain-specific architecture deep learning accelerator linear algebra accelerator

来源：评论

学校读者我要写书评

暂无评论

ASBNN: Acceleration of Bayesian convolutional Neural Networks by algorithm-hardware co-design 32

ASBNN: Acceleration of Bayesian Convolutional Neural Network...

引用

32nd IEEE International conference on Application-specific Systems, Architectures and Processors (ASAP)

作者： Fujiwara, Yoshiki Takamaeda-Yamazaki, Shinya Univ Tokyo Bunkyo Ku Tokyo 1138656 Japan

ISBN: (纸本)9781665427012

Bayesian convolutional Neural Networks (BCNNs) have been proposed to address the problem of model uncertainty in conventional neural networks. By treating weights as distributions rather than deterministic values, BCNNs mitigate the problem of overfitting, training with a small amount of data, and uncertainty evaluations. However, computing the distributions of BCNN outputs is time- and energy-consuming because it requires computing multiple forward passes. To address this computational problem, we propose a novel algorithm-hardware co-design approach with an approximation algorithm and hardware support for the rapid computation of BCNN. Our observations of the absolute number of each layer's input and the input difference among multiple forward passes show that most of these values are significantly small compared with other large values. Our algorithm treats these small values as zero and makes them sparser. The extracted sparsity allows us to skip most multiplications. As a result, it achieves a computation reduction of 81.1 % in classification tasks and 77.7 % in regression tasks. Additionally, to support the algorithm-level approximation on hardware, we propose a novel dataflow that is specialized for our algorithm, and develop a new accelerator architecture, accelerator for sparse Bayesian Neural Networks (ASBNN), that can handle sparsity extracted by the algorithm. Our evaluation demonstrates that the ASBNN successfully exploits the algorithmic computation reduction to improve the computation time by 3.3 x and energy efficiency by 3.7x compared with the naive implementation of dense BCNN accelerators.

关键词： Bayesian Neural Network Deep Learning Accelerator algorithm-hardware co-design

来源：评论

学校读者我要写书评

暂无评论

BDLUT: Blind image denoising with hardware-optimized look-up tables

引用

JOURNAL OF THE SOCIETY FOR INFORMATION DISPLAY 2025年

作者： Li, Boyu Ai, Zhilin Jiang, Baizhou Huang, Binxiao Li, Jason Chun Lok Liu, Jie Tu, Zhengyuan Wang, Guoyu Yu, Daihai Wong, Ngai Univ Hong Kong Dept Elect & Elect Engn Pok Fu Lam Hong Kong 99077 Peoples R China TCL Corp Res HK Co Ltd Pak Shek Kok Hong Kong Peoples R China

Denoising sensor-captured images on edge display devices remains challenging due to deep neural networks' (DNNs) high computational overhead and synthetic noise training limitations. This work proposes BDLUT(-D), a novel blind denoising method combining optimized lookup tables (LUTs) with hardware-centric design. While BDLUT describes the LUT-based network architecture, BDLUT-D represents BDLUT trained with a specialized noise degradation model. designed for edge deployment, BDLUT(-D) eliminates neural processing units (NPUs) and functions as a standalone ASIC IP solution. Experimental results demonstrate BDLUT-D achieves up to 2.42 dB improvement over state-of-the-art LUT methods on mixed-noise-intensity benchmarks, requiring only 66 KB storage. FPGA implementation shows over 10x$$ \times $$ reduction in logic resources, 75% less storage compared to DNN accelerators, while achieving 57% faster processing than traditional bilateral filtering methods. These optimizations enable practical integration into edge scenarios like low-cost webcam enhancement and real-time 4 K-to-4 K denoising without compromising resolution or latency. By enhancing silicon efficiency and removing external accelerator dependencies, BDLUT(-D) establishes a new standard for practical edge imaging denoising. Implementation is available at .

关键词： algorithm-hardware co-design blind denoising lookup table

来源：评论

学校读者我要写书评

暂无评论

algorithm-hardware co-optimization for Energy-efficient Drone Detection on Resource-constrained FPGA

引用

ACM TRANSACTIONS ON REcoNFIGURABLE TECHNOLOGY AND SYSTEMS 2023年第2期16卷 1-25页

作者： Suh, Han-Sok Meng, Jian Nguyen, Ty Kumar, Vijay Cao, Yu Seo, Jae-Sun Arizona State Univ 781 E Terrace RdISTB4 591 Tempe AZ 85287 USA Univ Penn 220 S 33rd St Philadelphia PA 19104 USA

convolutional neural network (CNN)-based object detection has achieved very high accuracy;e.g., singleshot multi-box detectors (SSDs) can efficiently detect and localize various objects in an input image. However, they require a high amount of computation and memory storage, which makes it difficult to perform efficient inference on resource-constrained hardware devices such as drones or unmanned aerial vehicles (UAVs). Drone/UAV detection is an important task for applications including surveillance, defense, and multidrone self-localization and formation control. In this article, we designed and co-optimized an algorithm and hardware for energy-efficient drone detection on resource-constrained FPGA devices. We trained an SSD object detection algorithm with a custom drone dataset. For inference, we employed low-precision quantization and adapted the width of the SSD CNN model. To improve throughput, we use dual-data rate operations for DSPs to effectively double the throughput with limited DSP counts. For different SSD algorithm models, we analyze accuracy or mean average precision (mAP) and evaluate the corresponding FPGA hardware utilization, DRAM communication, and throughput optimization. We evaluated the FPGA hardware for a custom drone dataset, Pascal VOC, and coco2017. Our proposed design achieves a high mAP of 88.42% on the multi-drone dataset, with a high energy efficiency of 79 GOPS/W and throughput of 158 GOPS using the Xilinx Zynq ZU3EG FPGA device on the Open Vision computer version 3 (OVC3) platform. Our design achieves 1.1 to 8.7x higher energy efficiency than prior works that used the same Pascal VOC dataset, using the same FPGA device, but at a low-power consumption of 2.54W. For the coco dataset, our MobileNet-V1 implementation achieved an mAP of 16.8, and 4.9 FPS/W for energy-efficiency, which is similar to 1.9x higher than prior FPGA works or other commercial hardware platforms.

关键词： FPGA accelerator object detection algorithm-hardware co-design neural networks

来源：评论

学校读者我要写书评

暂无评论

A coNCEPT FOR A SLAM BACK END hardware ACCELERATOR 49

A CONCEPT FOR A SLAM BACK END HARDWARE ACCELERATOR

引用

49th IEEE International conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Henningson, Toivo Adalbjornsson, Stefan Berkeman, Anders Drougge, Carl Erickson, Xavante Hunt, Alexander Ericsson Res Lund Sweden

ISBN: (纸本)9798350344868;9798350344851

This research aims to develop energy-efficient hardware accelerators for Simultaneous Location And Mapping (SLAM) back end applications by employing algorithm-hardware co-design. Utilizing the iSAM2 algorithm, which uses graphical modeling to solve iterative Gauss-Newton problems, we continuously update maps by incorporating solutions from previous iterations or timesteps. We address the performance bottleneck arising from memory writes of intermediate results by modifying the original algorithm. Additionally, we analyze the algorithm's parallelizability to meet latency demands. These hardware accelerators are designed as Intellectual Property (IP) blocks, suitable for integration into custom Systems-on-Chip (SoC). We evaluate the design using both holistic and block-level metrics, focusing on latency and energy efficiency. This work has implications for energy-constrained devices like drones and Extended Reality (XR) devices.

关键词： algorithm-hardware co-design energy constrained devices hardware accelerator SLAM back end iterative sparse Cholesky IP blocks

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：