咨询与建议

看过本文的还看了

相关文献

该作者的其他文献

文献详情 >A high-performance dataflow-ce... 收藏

A high-performance dataflow-centric optimization framework for deep learning inference on the edge

作     者:Zhang, Runhua Jiang, Hongxu Geng, Jinkun Tian, Fangzheng Ma, Yuhang Wang, Haojie 

作者机构:Beihang Univ Beijing Peoples R China Stanford Univ Stanford CA USA Tsinghua Univ Beijing Peoples R China 

出 版 物:《JOURNAL OF SYSTEMS ARCHITECTURE》 (系统结构杂志)

年 卷 期:2024年第152卷

核心收录:

学科分类:08[工学] 0835[工学-软件工程] 0812[工学-计算机科学与技术(可授工学、理学学位)] 

基  金:National Key Research and Develop- ment Program of China [2021ZD0110202] Academic Excellence Foundation of BUAA Shuimu Tsinghua Scholar Program 

主  题:Edge computing Model inference Dataflow-centric Computation graph Data locality 

摘      要:Edge computing has been emerging as a popular scenario for model inference. However, the inference performance on edge devices (e.g., Multi-Core DSP, FGPA, etc.) suffers from inefficiency due to the lack of highly optimized inference frameworks. Previous model inference frameworks are mainly developed in an operator-centric way, which provides insufficient acceleration to edge-based inference. Besides, the operator-centric framework incurs significant costs for continuous development and maintenance. Targeting the existing drawbacks of operator-centric frameworks, we design Xenos, which can automatically conduct dataflow-centric optimization of the computation graph and accelerate inference in two dimensions. Vertically, Xenos develops operator linking technique to improve data locality by restructuring the inter-operator dataflow. Horizontally, Xenos develops DSP-aware operator split technique to enable higher parallelism across multiple DSP units. Our evaluation demonstrates the effectiveness of vertical and horizontal dataflow optimization, which reduce the inference time by 15.0%-84.9% and 17.9%-89.9% , respectively. Besides, Xenos also outperforms the widely-used TVM by 1.1x-1.9x. Moreover, we extend Xenos to a distributed solution, which we call d-Xenos. d-Xenos employs multiple edge devices to jointly conduct the inference task and achieves a speedup of 3.68x-3.78x compared with the single device.

读者评论 与其他读者分享你的观点

用户名:未登录
我的评分