By eliminating compute operations intelligently based on the run time input, dynamic pruning (DP) promises to improve deep neural network inference speed substantially without incurring a major impact on their accurac...
详细信息
ISBN:
(纸本)9781450399166
By eliminating compute operations intelligently based on the run time input, dynamic pruning (DP) promises to improve deep neural network inference speed substantially without incurring a major impact on their accuracy. Although many DP algorithms with good pruning performance have been proposed, it remains a challenge to translate these theoretical reductions in compute operations into satisfactory end-to-end speedups in practical real-world implementations. The overhead of identifying operations to be pruned during run time, the need to efficiently process the resulting dynamic dataflow, and the non-trivial memory I/O bottleneck that emerges as the number of compute operations reduces, have all contributed to the challenge of implementing practical DP systems. In this paper, the design and implementation of DPACS are presented to address these challenges. DPACS utilizes a hardware-aware dynamic spatial and channel pruning algorithm in conjunction with a dynamic dataflow engine in hardware to facilitate efficient processing of the pruned network. A channel mask precomputation scheme is designed to reduce memory I/O, and a dedicated inter-layer pipeline is used to achieve efficient indexing and dataflow of sparse activation. Extensive design space exploration has been performed using two architectural variations implemented on FPGA to accelerate multiple networks from the ResNet family on the ImageNet and CIFAR10 dataset across a wide range of pruning ratios. Across the spectrum of configurations, DPACS is able to achieve 1.1x to 3.9x end-to-end speedup over a baseline hardware implementation without pruning. Analysis of the tradeoff among accuracy, compute, and memory I/O performance highlights the importance of algorithm-architecturecodesign in developing DP systems.
Large-scale distributed deep learning training has enabled developments of more complex deep neural network models to learn from larger datasets for sophisticated tasks. In particular, distributed stochastic gradient ...
详细信息
ISBN:
(纸本)9781665433334
Large-scale distributed deep learning training has enabled developments of more complex deep neural network models to learn from larger datasets for sophisticated tasks. In particular, distributed stochastic gradient descent intensively invokes all-reduce operations for gradient update, which dominates communication time during iterative training epochs. In this work, we identify the inefficiency in widely used all-reduce algorithms, and the opportunity of algorithm-architecture co-design. We propose MULTITREE all-reduce algorithm with topology and resource utilization awareness for efficient and scalable all-reduce operations, which is applicable to different interconnect topologies. Moreover, we co-design the network interface to schedule and coordinate the all-reduce messages for contention-free communications, working in synergy with the algorithm. The flow control is also simplified to exploit the bulk data transfer of big gradient exchange. We evaluate the co-design using different all-reduce data sizes for synthetic study, demonstrating its effectiveness on various interconnection network topologies, in addition to state-of-the-art deep neural networks for real workload experiments. The results show that MULTITREE achieves 2.3x and 1.56x communication speedup, as well as up to 81% and 30% training time reduction compared to ring all-reduce and state-of-the-art approaches, respectively.
Recently, Vision Transformers (ViTs) have set anew standard in computer vision (CV), showing unparalleledimage processing performance. However, their substantial com-putational requirements hinder practical deployment...
详细信息
Recently, Vision Transformers (ViTs) have set anew standard in computer vision (CV), showing unparalleledimage processing performance. However, their substantial com-putational requirements hinder practical deployment, especiallyon resource-limited devices common in CV applications. Tokenmerging has emerged as a solution, condensing tokens withsimilar features to cut computational and memory ***, existing applications on ViTs often miss the mark in tokencompression, with rigid merging strategies and a lack of in-depth analysis of ViT merging characteristics. To overcome theseissues, this paper introduces Adaptive Token Merging (AToM), acomprehensive algorithm-architecture co-design for acceleratingViTs. The AToM algorithm employs an image-adaptive, fine-grained merging strategy, significantly boosting computationalefficiency. We also optimize the merging and unmerging processesto minimize overhead, employing techniques like First-come-First-Merge mapping and Linear Distance Calculation. On thehardware side, the AToM architecture is tailor-made to exploit theAToM algorithm's benefits, with specialized engines for efficientmerge and unmerge operations. Our pipeline architecture ensuresend-to-end ViT processing, minimizing latency and memoryoverhead from the AToM algorithm. Across various hardwareplatforms including CPU, EdgeGPU, and GPU, AToM achievesaverage end-to-end speedups of 10.9x, 7.7x, and 5.4x, alongsideenergy savings of 24.9x, 1.8x, and 16.7x. Moreover, AToMoffers 1.2x1.9xhigher effective throughput compared toexisting transformer accelerators.
Point cloud-based 3-D perception is poised to become a key workload on various applications. It always leverages a feature learning network (FLN) before backbones to obtain uniform representation from the scattered po...
详细信息
Point cloud-based 3-D perception is poised to become a key workload on various applications. It always leverages a feature learning network (FLN) before backbones to obtain uniform representation from the scattered points. Grid-based FLN (GFLN) that partitions point clouds into uniform grids becomes the main category in recent state-of-the-art (SOTA) works. However, it heavily suffers from significant memory and computation inefficiency associated with high point sparsity and critical data dependency. To address these troubles, we propose FLNA, a GFLN accelerator with algorithm-architectureco-optimization for large-scale point clouds. At the algorithm level, the dataflow-decoupling strategy is adopted to alleviate the processing bottlenecks from pipeline dependency, which also reduces 78.3% computation cost by exploiting the redundancy from inherent sparsity and special operators. Based on the algorithmco-optimization, an effective architecture is designed with efficient GFLN mapping and block-wise processing strategies. It manages to improve on-chip memory efficiency tremendously through diverse techniques, including linked-list-based block lookup table (LUT) and transposed feature organization. Extensively evaluated on representative benchmarks, FLNA achieves 69.9-264.4 x speedup with more than 99% energy savings compared to multiple GPUs and CPU. It also demonstrates a substantial performance boost over the SOTA point cloud accelerators while providing superior support of large-scale point clouds.
Vision-centric 3D perception has become a key mechanism in autonomous driving. It achieves exceptional perceptual performance mainly by introducing a novel attention, multi-view cross-attention (MVCA), for learnable f...
详细信息
Vision-centric 3D perception has become a key mechanism in autonomous driving. It achieves exceptional perceptual performance mainly by introducing a novel attention, multi-view cross-attention (MVCA), for learnable feature extraction and fusion from surround-view cameras. Despite its superiority, MVCA encounters severe inefficiencies in sample, processing elements (PE), and pipelined processing, owing to the redundant and non-uniform sampling-aggregation and rigorous inter-operator dependencies. To address these issues, this article proposes a dedicated MVCA accelerator, MVAtor, with algorithm-architectureco-optimization for vision-centric 3D perception based on multi-view inputs flexibly. For sample inefficiency, a 3-tier hybrid static-dynamic sample and a sensitivity-aware feature pruning approach are proposed to eliminate the 86.03% sample overhead and 24.48% memory requirement, only incuring <1% accuracy loss with no need of fine-tuning. For PE inefficiency, a spatial pruner and sequential sampler collaboration strategy is proposed to improve the sampler utilization without compromising pruner's throughput, which outperforms the previous design by 53.7 similar to 96.1% energy-delay product reduction. For pipeline inefficiency, a fine-grained-tiling assisted highly-pipelined architecture is constructed in MVAtor by exploiting the decoupling opportunities on inter-view sparsity, thereby saving 61.03% external memory access while boosting the overall throughputs by 1.83x. Extensively evaluated on representative benchmarks, MVAtor attains 1.38 similar to 7.67x and 1.67 similar to 11.15x improvement on energy and area efficiency respectively, compared to the state-of-the-art related accelerators.
Models based on the attention mechanism, i.e., transformers, have shown extraordinary performance in natural language processing (NLP) tasks. However, their memory foot-print, inference latency, and power consumption ...
详细信息
Models based on the attention mechanism, i.e., transformers, have shown extraordinary performance in natural language processing (NLP) tasks. However, their memory foot-print, inference latency, and power consumption are still prohibitive for efficient inference at edge devices, even at data centers. To tackle this issue, we present an algorithm-architecture co-design named DTATrans. We find empirically that the tolerance to the noise varies from token to token in attention-based NLP models. This finding leads us to dynamically quantize different tokens with mixed levels of bits. Furthermore, we find that the overstrict quantization method causes a dilemma of the model accuracy and model compression ratio, which impels us to explore a method to compensate for the model accuracy when the compression ratio is high. Thus, in DTATrans, we design a compression framework that: 1) dynamically quantizes tokens while they are forwarded in the models;2) jointly determines the ratio of each precision;and 3) compensate the model accuracy by exploiting lightweight computing on the 0-bit tokens. Moreover, due to the dynamic mixed-precision tokens caused by our framework, previous matrix-multiplication accelerators (e.g., systolic array) cannot effectively exploit the benefit of the compressed attention computation. We thus design our trans -former accelerator with the variable-speed systolic array (VSSA) and propose an effective optimization strategy to alleviate the pipeline-stall problem in VSSA without hardware overhead. We conduct experiments with existing attention-based NLP mod -els, including BERT and GPT-2 on various language tasks. Our results show that DTATrans outperforms the previous neural network accelerator Eyeriss by 16.04x in terms of speedup and 3.62x in terms of energy saving. compared with the state-of -the-art attention accelerator SpAtten, our DTATrans achieves at least 3.62x speedup and 4.22x energy efficiency improvement.
Deep Neural Networks (DNNs) have achieved remarkable success in various Artificial Intelligence applications. Quantization is a critical step in DNNs compression and acceleration for deployment. To further boost DNN e...
详细信息
Deep Neural Networks (DNNs) have achieved remarkable success in various Artificial Intelligence applications. Quantization is a critical step in DNNs compression and acceleration for deployment. To further boost DNN execution efficiency, many works explore to leverage the input-dependent redundancy with dynamic quantization for different regions. However, the sensitive regions in the feature map are irregularly distributed, which restricts the real speed up for existing accelerators. To this end, we propose an algorithm-architecture co-design, named Structured Dynamic Precision (SDP). Specifically, we propose a quantization scheme in which the high-order bit part and the low-order bit part of data can be masked independently. And a fixed number of term parts are dynamically selected for computation based on the importance of each term in the group. We also present a hardware design to enable the algorithm efficiently with small overheads, whose inference time mainly scales with the precision proportionally. Evaluation experiments on extensive networks demonstrate that compared to the state-of-the-art dynamic quantization accelerator DRQ, our SDP can achieve 29% performance gain and 51% energy reduction for the same level of model accuracy.
The wide adoption and significant computing resource cost of attention-based transformers, e.g., Vision Transformers and large language models, have driven the demand for efficient hardware accelerators. While electro...
详细信息
ISBN:
(纸本)9798350393132;9798350393149
The wide adoption and significant computing resource cost of attention-based transformers, e.g., Vision Transformers and large language models, have driven the demand for efficient hardware accelerators. While electronic accelerators have been commonly used, there is a growing interest in exploring photonics as an alternative technology due to its high energy efficiency and ultra-fast processing speed. Photonic accelerators have demonstrated promising results for convolutional neural networks (CNNs) workloads, which predominantly rely on weight-static linear operations. However, they encounter challenges when it comes to efficiently supporting attention-based Transformer architectures, raising questions about the applicability of photonics to advanced machine-learning tasks. The primary hurdle lies in their inefficiency in handling the unique workloads inherent to Transformers, i.e., dynamic and full-range tensor multiplication. In this work, we propose Lightening-Transformer, the first light-empowered, high-performance, and energy-efficient photonic Transformer accelerator. To overcome the fundamental limitation of existing photonic tensor core designs, we introduce a novel dynamically-operated photonic tensor core, DPTC, consisting of a crossbar array of interference-based optical vector dot-product engines, supporting highly parallel, dynamic, and full-range matrix multiplication. Furthermore, we design a dedicated accelerator that integrates our novel photonic computing cores with photonic interconnects for inter-core data broadcast, fully unleashing the power of optics. The comprehensive evaluation demonstrates that Lightening-Transformer achieves >2.6x energy and >12x latency reductions compared to prior photonic accelerators and delivers the lowest energy cost and 2 to 3 orders of magnitude lower energy-delay product compared to the electronic Transformer accelerator, all while maintaining digital-comparable accuracy. Our work highlights the immense potential
Triangles are the basic substructure of networks and triangle counting (TC) has been a fundamental graph computing problem in numerous fields such as social network analysis. Nevertheless, like other graph computing p...
详细信息
Triangles are the basic substructure of networks and triangle counting (TC) has been a fundamental graph computing problem in numerous fields such as social network analysis. Nevertheless, like other graph computing problems, due to the high memory-computation ratio and random memory access pattern, TC involves a large amount of data transfers thus suffers from the bandwidth bottleneck in the traditional Von-Neumann architecture. To overcome this challenge, in this paper, we propose to accelerate TC with the emerging processingin-memory (PIM) architecture through an algorithm-architectureco-optimization manner. To enable the efficient in-memory implementations, we come up to reformulate TC with bitwise logic operations (such as AND), and develop customized graph compression and mapping techniques for efficient data flow management. With the emerging computational Spin-Transfer Torque Magnetic RAM(STT-MRAM) array, which is one of the most promising PIM enabling techniques, the device-to-architectureco-simulation results demonstrate that the proposed TC in-memory accelerator outperforms the state-of-the-art GPU and FPGA accelerations by 12.2 x and 31.8 x, respectively, and achieves a 34 x energy efficiency improvement over the FPGA accelerator.
暂无评论