检索结果-内蒙古大学图书馆

27th Design, Automation and Test in Europe Conference and Exhibition (DATE)

作者： Wei, Yuanxin Ye, Shengyuan Jiang, Jiazhi Chen, Xu Huang, Dan Du, Jiangsu Lu, Yutong Sun Yat Sen Univ Sch Comp Sci & Engn Guangzhou Peoples R China

ISBN: (纸本)9798350348606;9783981926385

transformer models have shown significant success in a wide range of tasks. Meanwhile, massive resources required by its inference prevent scenarios with resource-constrained devices from in-situ deployment, leaving a high threshold of integrating its advances. Observing that these scenarios, e.g. smart home of edge computing, are usually comprise a rich set of trusted devices with untapped resources, it is promising to distribute transformer inference onto multiple devices. However, due to the tightly-coupled feature of transformer model, existing model parallelism approaches necessitate frequent communication to resolve data dependencies, making them unacceptable for distributed inference, especially under weak interconnect of edge scenarios. In this paper, we propose Detransformer, a communication-efficient distributed in-situ transformer inference system for edge scenarios. Detransformer is based on a novel block parallelism approach, with the key idea of restructuring the original transformer layer with a single block to the decoupled layer with multiple sub-blocks and exploit model parallelism between sub-blocks. Next, Detransformer contains an adaptive placement approach to automatically select the optimal placement strategy by striking a trade-off among communication capability, computing power and memory budget. Experimental results show that Detransformer can reduce distributed inference latency by up to 2.81x compared to the SOTA approach on 4 devices, while effectively maintaining task accuracy and a consistent model size.

关键词： Model Parallelism distributed transformer inference Edge Computing Deep Learning System

来源：评论

学校读者我要写书评

暂无评论

Co-Designing transformer Architectures for distributed inference With Low Communication

引用

IEEE TRANSACTIONS ON PARALLEL AND distributed SYSTEMS 2025年第4期36卷 717-730页

作者： Du, Jiangsu Wei, Yuanxin Ye, Shengyuan Jiang, Jiazhi Chen, Xu Huang, Dan Lu, Yutong Sun Yat Sen Univ Sch Comp Sci & Engn Guangzhou 510275 Peoples R China

transformer models have shown significant success in a wide range of tasks. However, the massive resources required for its inference prevent deployment on a single device with relatively constrainted resources, thus leaving a high threshold of integrating their advancements. Observing scenarios such as smart home applications on edge devices and cloud deployment on commodity hardware, it is promising to distribute transformer inference across multiple devices. Unfortunately, due to the tightly-coupled feature of transformer model, existing model parallelism approaches necessitate frequent communication to resolve data dependencies, making them unacceptable for distributed inference, especially under relatively weak interconnection. In this paper, we propose Detransformer, a communication-efficient distributed transformer inference system. The key idea of Detransformer involves the co-design of transformer architecture to reduce the communication during distributed inference. In detail, Detransformer is based on a novel block parallelism approach, which restructures the original transformer layer with a single block to the decoupled layer with multiple sub-blocks. Thus, it can exploit model parallelism between sub-blocks. Next, Detransformer contains an adaptive execution approach that strikes a trade-off among communication capability, computing power and memory budget over multiple devices. It incorporates a two-phase planning for execution, namely static planning and runtime planning. The static planning runs offline, containing a profiling procedure and a weight placement strategy before execution. The runtime planning dynamically determines the optimal parallel computing strategy from an expertly crafted search space based on real-time requests. Notably, this execution approach can adapt to heterogeneous devices by distributing workload based on devices' computing capabilities. We conduct experiments for both auto-regressive and auto-encoder tasks of Transforme

关键词： Parallel processing transformers Computational modeling Data models Planning distributed databases Runtime Hardware Adaptation models Throughput Model parallelism distributed transformer inference deep learning system

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还