transformer models have shown significant success in a wide range of tasks. Meanwhile, massive resources required by its inference prevent scenarios with resource-constrained devices from in-situ deployment, leaving a...
详细信息
ISBN:
(纸本)9798350348606;9783981926385
transformer models have shown significant success in a wide range of tasks. Meanwhile, massive resources required by its inference prevent scenarios with resource-constrained devices from in-situ deployment, leaving a high threshold of integrating its advances. Observing that these scenarios, e.g. smart home of edge computing, are usually comprise a rich set of trusted devices with untapped resources, it is promising to distribute transformerinference onto multiple devices. However, due to the tightly-coupled feature of transformer model, existing model parallelism approaches necessitate frequent communication to resolve data dependencies, making them unacceptable for distributedinference, especially under weak interconnect of edge scenarios. In this paper, we propose Detransformer, a communication-efficient distributed in-situ transformerinference system for edge scenarios. Detransformer is based on a novel block parallelism approach, with the key idea of restructuring the original transformer layer with a single block to the decoupled layer with multiple sub-blocks and exploit model parallelism between sub-blocks. Next, Detransformer contains an adaptive placement approach to automatically select the optimal placement strategy by striking a trade-off among communication capability, computing power and memory budget. Experimental results show that Detransformer can reduce distributedinference latency by up to 2.81x compared to the SOTA approach on 4 devices, while effectively maintaining task accuracy and a consistent model size.
transformer models have shown significant success in a wide range of tasks. However, the massive resources required for its inference prevent deployment on a single device with relatively constrainted resources, thus ...
详细信息
transformer models have shown significant success in a wide range of tasks. However, the massive resources required for its inference prevent deployment on a single device with relatively constrainted resources, thus leaving a high threshold of integrating their advancements. Observing scenarios such as smart home applications on edge devices and cloud deployment on commodity hardware, it is promising to distribute transformerinference across multiple devices. Unfortunately, due to the tightly-coupled feature of transformer model, existing model parallelism approaches necessitate frequent communication to resolve data dependencies, making them unacceptable for distributedinference, especially under relatively weak interconnection. In this paper, we propose Detransformer, a communication-efficient distributed transformer inference system. The key idea of Detransformer involves the co-design of transformer architecture to reduce the communication during distributedinference. In detail, Detransformer is based on a novel block parallelism approach, which restructures the original transformer layer with a single block to the decoupled layer with multiple sub-blocks. Thus, it can exploit model parallelism between sub-blocks. Next, Detransformer contains an adaptive execution approach that strikes a trade-off among communication capability, computing power and memory budget over multiple devices. It incorporates a two-phase planning for execution, namely static planning and runtime planning. The static planning runs offline, containing a profiling procedure and a weight placement strategy before execution. The runtime planning dynamically determines the optimal parallel computing strategy from an expertly crafted search space based on real-time requests. Notably, this execution approach can adapt to heterogeneous devices by distributing workload based on devices' computing capabilities. We conduct experiments for both auto-regressive and auto-encoder tasks of Transforme
暂无评论