检索结果-内蒙古大学图书馆

28th ACM International conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)

作者： Gao, Yizhao Zhang, Baoheng Qi, Xiaojuan So, Hayden Kwok-Hay Univ Hong Kong Hong Kong Peoples R China

ISBN: (纸本)9781450399166

By eliminating compute operations intelligently based on the run time input, dynamic pruning (DP) promises to improve deep neural network inference speed substantially without incurring a major impact on their accuracy. Although many DP algorithms with good pruning performance have been proposed, it remains a challenge to translate these theoretical reductions in compute operations into satisfactory end-to-end speedups in practical real-world implementations. The overhead of identifying operations to be pruned during run time, the need to efficiently process the resulting dynamic dataflow, and the non-trivial memory I/O bottleneck that emerges as the number of compute operations reduces, have all contributed to the challenge of implementing practical DP systems. In this paper, the design and implementation of DPACS are presented to address these challenges. DPACS utilizes a hardware-aware dynamic spatial and channel pruning algorithm in conjunction with a dynamic dataflow engine in hardware to facilitate efficient processing of the pruned network. A channel mask precomputation scheme is designed to reduce memory I/O, and a dedicated inter-layer pipeline is used to achieve efficient indexing and dataflow of sparse activation. Extensive design space exploration has been performed using two architectural variations implemented on FPGA to accelerate multiple networks from the ResNet family on the ImageNet and CIFAR10 dataset across a wide range of pruning ratios. Across the spectrum of configurations, DPACS is able to achieve 1.1x to 3.9x end-to-end speedup over a baseline hardware implementation without pruning. Analysis of the tradeoff among accuracy, compute, and memory I/O performance highlights the importance of algorithm-architecture codesign in developing DP systems.

关键词： dynamic pruning DPACS FPGA hardware acceleration algorithm-architecture co-design

来源：评论

学校读者我要写书评

暂无评论

Dadu-corki: algorithm-architecture co-design for Embodied AI-powered Robotic Manipulation 25

Dadu-Corki: Algorithm-Architecture Co-Design for Embodied AI...

引用

Proceedings of the 52nd Annual International Symposium on computer architecture

作者： Yiyang Huang Yuhui Hao Bo Yu Feng Yan Yuxin Yang Feng Min Yinhe Han Lin Ma Shaoshan Liu Qiang Liu Yiming Gan Institute of Computing Technology Chinese Academy of Sciences University of Chinese Academy of Sciences Beijing China Shenzhen Institute of Artificial Intelligence and Robotics for Society Shenzhen China Meituan Beijing China Institute of Computing Technology Chinese Academy of Sciences Beijing China Tianjin University Tianjin China

来源：评论

学校读者我要写书评

暂无评论

communication algorithm-architecture co-design for Distributed Deep Learning 21

Communication Algorithm-Architecture Co-Design for Distribut...

引用

ACM/IEEE 48th Annual International Symposium on computer architecture (ISCA)

作者： Huang, Jiayi Majumder, Pritam Kim, Sungkeun Muzahid, Abdullah Yum, Ki Hwan Kim, Eun Jung UC Santa Barbara Santa Barbara CA 93106 USA Texas A&M Univ College Stn TX USA

ISBN: (纸本)9781665433334

Large-scale distributed deep learning training has enabled developments of more complex deep neural network models to learn from larger datasets for sophisticated tasks. In particular, distributed stochastic gradient descent intensively invokes all-reduce operations for gradient update, which dominates communication time during iterative training epochs. In this work, we identify the inefficiency in widely used all-reduce algorithms, and the opportunity of algorithm-architecture co-design. We propose MULTITREE all-reduce algorithm with topology and resource utilization awareness for efficient and scalable all-reduce operations, which is applicable to different interconnect topologies. Moreover, we co-design the network interface to schedule and coordinate the all-reduce messages for contention-free communications, working in synergy with the algorithm. The flow control is also simplified to exploit the bulk data transfer of big gradient exchange. We evaluate the co-design using different all-reduce data sizes for synthetic study, demonstrating its effectiveness on various interconnection network topologies, in addition to state-of-the-art deep neural networks for real workload experiments. The results show that MULTITREE achieves 2.3x and 1.56x communication speedup, as well as up to 81% and 30% training time reduction compared to ring all-reduce and state-of-the-art approaches, respectively.

关键词： distributed deep learning data-parallel training all-reduce interconnection network algorithm-architecture co-design

来源：评论

学校读者我要写书评

暂无评论

AToM: Adaptive Token Merging for Efficient Acceleration of Vision Transformer

引用

IEEE TRANSACTIONS ON coMPUTERS 2025年第5期74卷 1620-1633页

作者： Shin, Jaekang Kang, Myeonggu Han, Yunki Park, Junyoung Kim, Lee-Sup Korea Adv Inst Sci & Technol KAIST Sch Elect Engn Daejeon 34141 South Korea

Recently, Vision Transformers (ViTs) have set anew standard in computer vision (CV), showing unparalleledimage processing performance. However, their substantial com-putational requirements hinder practical deployment, especiallyon resource-limited devices common in CV applications. Tokenmerging has emerged as a solution, condensing tokens withsimilar features to cut computational and memory ***, existing applications on ViTs often miss the mark in tokencompression, with rigid merging strategies and a lack of in-depth analysis of ViT merging characteristics. To overcome theseissues, this paper introduces Adaptive Token Merging (AToM), acomprehensive algorithm-architecture co-design for acceleratingViTs. The AToM algorithm employs an image-adaptive, fine-grained merging strategy, significantly boosting computationalefficiency. We also optimize the merging and unmerging processesto minimize overhead, employing techniques like First-come-First-Merge mapping and Linear Distance Calculation. On thehardware side, the AToM architecture is tailor-made to exploit theAToM algorithm's benefits, with specialized engines for efficientmerge and unmerge operations. Our pipeline architecture ensuresend-to-end ViT processing, minimizing latency and memoryoverhead from the AToM algorithm. Across various hardwareplatforms including CPU, EdgeGPU, and GPU, AToM achievesaverage end-to-end speedups of 10.9x, 7.7x, and 5.4x, alongsideenergy savings of 24.9x, 1.8x, and 16.7x. Moreover, AToMoffers 1.2x1.9xhigher effective throughput compared toexisting transformer accelerators.

关键词： Merging Graphics processing units Transformers computer architecture computers computational modeling Heuristic algorithms Hardware computer vision computational efficiency DNN accelerator transformer-based computer vision token merge algorithm-architecture co-design

来源：评论

学校读者我要写书评

暂无评论

FLNA: Flexibly Accelerating Feature Learning Networks for Large-Scale Point Clouds With Efficient Dataflow Decoupling

引用

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 2024年第4期32卷 739-751页

作者： Lyu, Dongxu Li, Zhenyu Chen, Yuzhou Wang, Gang He, Weifeng Xu, Ningyi He, Guanghui Shanghai Jiao Tong Univ Sch Elect Informat & Elect Engn Shanghai 200240 Peoples R China Shanghai Jiao Tong Univ Qing Yuan Res Inst Shanghai 200240 Peoples R China Shanghai Jiao Tong Univ Sch Elect Informat & Elect Engn Shanghai 200240 Peoples R China Shanghai Jiao Tong Univ AI Inst MoE Key Lab Artificial Intelligence Shanghai 200240 Peoples R China

Point cloud-based 3-D perception is poised to become a key workload on various applications. It always leverages a feature learning network (FLN) before backbones to obtain uniform representation from the scattered points. Grid-based FLN (GFLN) that partitions point clouds into uniform grids becomes the main category in recent state-of-the-art (SOTA) works. However, it heavily suffers from significant memory and computation inefficiency associated with high point sparsity and critical data dependency. To address these troubles, we propose FLNA, a GFLN accelerator with algorithm-architecture co-optimization for large-scale point clouds. At the algorithm level, the dataflow-decoupling strategy is adopted to alleviate the processing bottlenecks from pipeline dependency, which also reduces 78.3% computation cost by exploiting the redundancy from inherent sparsity and special operators. Based on the algorithm co-optimization, an effective architecture is designed with efficient GFLN mapping and block-wise processing strategies. It manages to improve on-chip memory efficiency tremendously through diverse techniques, including linked-list-based block lookup table (LUT) and transposed feature organization. Extensively evaluated on representative benchmarks, FLNA achieves 69.9-264.4 x speedup with more than 99% energy savings compared to multiple GPUs and CPU. It also demonstrates a substantial performance boost over the SOTA point cloud accelerators while providing superior support of large-scale point clouds.

关键词： Point cloud compression Memory management Feature extraction Pipelines computational efficiency System-on-chip Neural networks algorithm-architecture co-design feature learning network (FLN) neural network accelerator point cloud

来源：评论

学校读者我要写书评

暂无评论

An Efficient Multi-View Cross-Attention Accelerator for Vision-Centric 3D Perception in Autonomous Driving

引用

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS 2025年

作者： Lyu, Dongxu Li, Zhenyu Xu, Yansong Wang, Gang Li, Wenjie Chen, Yuzhou Chen, Liyan He, Weifeng He, Guanghui Shanghai Jiao Tong Univ Sch Elect Informat & Elect Engn State Key Lab Micro Nano Engn Sci Shanghai 200240 Peoples R China

Vision-centric 3D perception has become a key mechanism in autonomous driving. It achieves exceptional perceptual performance mainly by introducing a novel attention, multi-view cross-attention (MVCA), for learnable feature extraction and fusion from surround-view cameras. Despite its superiority, MVCA encounters severe inefficiencies in sample, processing elements (PE), and pipelined processing, owing to the redundant and non-uniform sampling-aggregation and rigorous inter-operator dependencies. To address these issues, this article proposes a dedicated MVCA accelerator, MVAtor, with algorithm-architecture co-optimization for vision-centric 3D perception based on multi-view inputs flexibly. For sample inefficiency, a 3-tier hybrid static-dynamic sample and a sensitivity-aware feature pruning approach are proposed to eliminate the 86.03% sample overhead and 24.48% memory requirement, only incuring <1% accuracy loss with no need of fine-tuning. For PE inefficiency, a spatial pruner and sequential sampler collaboration strategy is proposed to improve the sampler utilization without compromising pruner's throughput, which outperforms the previous design by 53.7 similar to 96.1% energy-delay product reduction. For pipeline inefficiency, a fine-grained-tiling assisted highly-pipelined architecture is constructed in MVAtor by exploiting the decoupling opportunities on inter-view sparsity, thereby saving 61.03% external memory access while boosting the overall throughputs by 1.83x. Extensively evaluated on representative benchmarks, MVAtor attains 1.38 similar to 7.67x and 1.67 similar to 11.15x improvement on energy and area efficiency respectively, compared to the state-of-the-art related accelerators.

关键词： Three-dimensional displays Throughput Autonomous vehicles Head Graphics processing units Solid modeling Indexes Feature extraction costs Transformers Vision-centric 3D perception neural network accelerator multi-view cross-attention algorithm-architecture co-design

来源：评论

学校读者我要写书评

暂无评论

DTATrans: Leveraging Dynamic Token-Based Quantization With Accuracy compensation Mechanism for Efficient Transformer architecture

引用

IEEE TRANSACTIONS ON coMPUTER-AIDED design OF INTEGRATED CIRCUITS AND SYSTEMS 2023年第2期42卷 509-520页

作者： Yang, Tao Ma, Fei Li, Xiaoling Liu, Fangxin Zhao, Yilong He, Zhezhi Jiang, Li Shanghai Jiao Tong Univ Sch Elect Informat & Elect Engn Shanghai 201100 Peoples R China Inceptio Technol Inst Shanghai 200093 Peoples R China Shanghai Qi Zhi Inst Shanghai 200232 Peoples R China Shanghai Jiao Tong Univ AI Inst MoE Key Lab Artificial Intelligence Shanghai 200240 Peoples R China

Models based on the attention mechanism, i.e., transformers, have shown extraordinary performance in natural language processing (NLP) tasks. However, their memory foot-print, inference latency, and power consumption are still prohibitive for efficient inference at edge devices, even at data centers. To tackle this issue, we present an algorithm-architecture co-design named DTATrans. We find empirically that the tolerance to the noise varies from token to token in attention-based NLP models. This finding leads us to dynamically quantize different tokens with mixed levels of bits. Furthermore, we find that the overstrict quantization method causes a dilemma of the model accuracy and model compression ratio, which impels us to explore a method to compensate for the model accuracy when the compression ratio is high. Thus, in DTATrans, we design a compression framework that: 1) dynamically quantizes tokens while they are forwarded in the models;2) jointly determines the ratio of each precision;and 3) compensate the model accuracy by exploiting lightweight computing on the 0-bit tokens. Moreover, due to the dynamic mixed-precision tokens caused by our framework, previous matrix-multiplication accelerators (e.g., systolic array) cannot effectively exploit the benefit of the compressed attention computation. We thus design our trans -former accelerator with the variable-speed systolic array (VSSA) and propose an effective optimization strategy to alleviate the pipeline-stall problem in VSSA without hardware overhead. We conduct experiments with existing attention-based NLP mod -els, including BERT and GPT-2 on various language tasks. Our results show that DTATrans outperforms the previous neural network accelerator Eyeriss by 16.04x in terms of speedup and 3.62x in terms of energy saving. compared with the state-of -the-art attention accelerator SpAtten, our DTATrans achieves at least 3.62x speedup and 4.22x energy efficiency improvement.

关键词： Quantization (signal) computational modeling Transformers Natural language processing Task analysis Systolic arrays Hardware algorithm-architecture co-design domain-specific accelerator dynamic quantization transformers

来源：评论

学校读者我要写书评

暂无评论

Structured Dynamic Precision for Deep Neural Networks Quantization

引用

ACM TRANSACTIONS ON design AUTOMATION OF ELECTRONIC SYSTEMS 2023年第1期28卷 12-12页

作者： Huang, Kai Li, Bowen Xiong, Dongliang Jiang, Haitian Jiang, Xiaowen Yan, Xiaolang Claesen, Luc Liu, Dehong Chen, Junjian Liu, Zhili Zhejiang Univ Inst VLSI Design 38 Zheda Rd Hangzhou 310030 Zhejiang Peoples R China Univ Hasselt Engn Technol Elect ICT Dept B-3590 Diepenbeek Belgium China Southern Power Grid Co Ltd Guangzhou 510670 Peoples R China Sec Chip Technol Co Ltd Hangzhou 310030 Peoples R China

Deep Neural Networks (DNNs) have achieved remarkable success in various Artificial Intelligence applications. Quantization is a critical step in DNNs compression and acceleration for deployment. To further boost DNN execution efficiency, many works explore to leverage the input-dependent redundancy with dynamic quantization for different regions. However, the sensitive regions in the feature map are irregularly distributed, which restricts the real speed up for existing accelerators. To this end, we propose an algorithm-architecture co-design, named Structured Dynamic Precision (SDP). Specifically, we propose a quantization scheme in which the high-order bit part and the low-order bit part of data can be masked independently. And a fixed number of term parts are dynamically selected for computation based on the importance of each term in the group. We also present a hardware design to enable the algorithm efficiently with small overheads, whose inference time mainly scales with the precision proportionally. Evaluation experiments on extensive networks demonstrate that compared to the state-of-the-art dynamic quantization accelerator DRQ, our SDP can achieve 29% performance gain and 51% energy reduction for the same level of model accuracy.

关键词： Neural networks compression and acceleration systolic array algorithm-architecture co-design

来源：评论

学校读者我要写书评

暂无评论

Lightening-Transformer: A Dynamically-operated Optically-interconnected Photonic Transformer Accelerator 30

Lightening-Transformer: A Dynamically-operated Optically-int...

引用

30th IEEE International Symposium on High-Performance computer architecture (HPCA)

作者： Zhu, Hanqing Gu, Jiaqi Wang, Hanrui Jiang, Zixuan Zhang, Zhekai Tang, Rongxing Feng, Chenghao Han, Song Chen, Ray T. Pan, David Z. Univ Texas Austin Austin TX 78712 USA MIT Cambridge MA 02139 USA Arizona State Univ Tempe AZ 85287 USA

ISBN: (纸本)9798350393132;9798350393149

The wide adoption and significant computing resource cost of attention-based transformers, e.g., Vision Transformers and large language models, have driven the demand for efficient hardware accelerators. While electronic accelerators have been commonly used, there is a growing interest in exploring photonics as an alternative technology due to its high energy efficiency and ultra-fast processing speed. Photonic accelerators have demonstrated promising results for convolutional neural networks (CNNs) workloads, which predominantly rely on weight-static linear operations. However, they encounter challenges when it comes to efficiently supporting attention-based Transformer architectures, raising questions about the applicability of photonics to advanced machine-learning tasks. The primary hurdle lies in their inefficiency in handling the unique workloads inherent to Transformers, i.e., dynamic and full-range tensor multiplication. In this work, we propose Lightening-Transformer, the first light-empowered, high-performance, and energy-efficient photonic Transformer accelerator. To overcome the fundamental limitation of existing photonic tensor core designs, we introduce a novel dynamically-operated photonic tensor core, DPTC, consisting of a crossbar array of interference-based optical vector dot-product engines, supporting highly parallel, dynamic, and full-range matrix multiplication. Furthermore, we design a dedicated accelerator that integrates our novel photonic computing cores with photonic interconnects for inter-core data broadcast, fully unleashing the power of optics. The comprehensive evaluation demonstrates that Lightening-Transformer achieves >2.6x energy and >12x latency reductions compared to prior photonic accelerators and delivers the lowest energy cost and 2 to 3 orders of magnitude lower energy-delay product compared to the electronic Transformer accelerator, all while maintaining digital-comparable accuracy. Our work highlights the immense potential

关键词： algorithm-architecture co-design Attention Domain-Specific Accelerator Optical Neural Network Photonic Accelerator Transformer

来源：评论

学校读者我要写书评

暂无评论

Triangle counting Accelerations: From algorithm to In-Memory computing architecture

引用

IEEE TRANSACTIONS ON coMPUTERS 2022年第10期71卷 2462-2472页

作者： Wang, Xueyan Yang, Jianlei Zhao, Yinglin Jia, Xiaotao Yin, Rong Chen, Xuhang Qu, Gang Zhao, Weisheng Beihang Univ Sch Integrated Circuit Sci & Engn MIIT Key Lab Spintron Beijing 100191 Peoples R China Beihang Univ Sch Comp Sci & Engn State Key Lab Software Dev Environm NLSDE BDBC Beijing 100191 Peoples R China Chinese Acad Sci Inst Informat Engn Beijing 100049 Peoples R China Univ Maryland Dept Elect & Comp Engn College Pk MD 20742 USA Univ Maryland Inst Syst Res College Pk MD 20742 USA

Triangles are the basic substructure of networks and triangle counting (TC) has been a fundamental graph computing problem in numerous fields such as social network analysis. Nevertheless, like other graph computing problems, due to the high memory-computation ratio and random memory access pattern, TC involves a large amount of data transfers thus suffers from the bandwidth bottleneck in the traditional Von-Neumann architecture. To overcome this challenge, in this paper, we propose to accelerate TC with the emerging processingin-memory (PIM) architecture through an algorithm-architecture co-optimization manner. To enable the efficient in-memory implementations, we come up to reformulate TC with bitwise logic operations (such as AND), and develop customized graph compression and mapping techniques for efficient data flow management. With the emerging computational Spin-Transfer Torque Magnetic RAM(STT-MRAM) array, which is one of the most promising PIM enabling techniques, the device-to-architecture co-simulation results demonstrate that the proposed TC in-memory accelerator outperforms the state-of-the-art GPU and FPGA accelerations by 12.2 x and 31.8 x, respectively, and achieves a 34 x energy efficiency improvement over the FPGA accelerator.

关键词： Triangle counting acceleration processing-in-memory algorithm-architecture co-design graph computing

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：