检索结果-内蒙古大学图书馆

PICO: pipeline inference Framework for Versatile CNNs on Diverse Mobile Devices

IEEE TRANSACTIONS ON MOBILE COMPUTING 2024年第4期23卷 2712-2730页

作者： Yang, Xiang Xu, Zikang Qi, Qi Wang, Jingyu Sun, Haifeng Liao, Jianxin Guo, Song Beijing Univ Posts & Telecommun State Key Lab Networking & Switching Technol Beijing 100876 Peoples R China Hong Kong Polytech Univ Dept Comp Kowloon Hong Kong Peoples R China

Distributing the inference of convolutional neural network (CNN) to multiple mobile devices has been studied in recent years to achieve real-time inference without losing accuracy. However, how to map CNN to devices remains a challenge. On the one hand, scheduling the workload of state-of-the-art CNNs with multiple devices is NP-Hard because the structures of CNNs are directed acyclic graphs (DAG) rather than simple chains. On the other hand, distributing the inference workload suffers from expensive communication and unbalanced computation due to the wireless environment and heterogeneous devices. This paper presents PICO, a pipeline cooperation framework to accelerate the inference of versatile CNNs on diverse mobile devices. At its core, PICO features: (1) a generic graph partition algorithm that considers the characteristics of any given CNN and orchestrates it into a list of model pieces with suitable granularity, and (2) a many-to-many mapping algorithm that produces the best pipeline configuration for heterogeneous devices. In our experiment with 2 similar to 8 Raspberry-Pi devices, the throughput can be improved by 1.8 similar to 6.8 x under different CPU frequencies.

关键词： Mobile handsets pipelines Kernel Convolutional neural networks Computational modeling Optimization Throughput Mobile computing pipeline inference model deployment

来源：评论

学校读者我要写书评

暂无评论

Flexi-BOPI: Flexible granularity pipeline inference with Bayesian optimization for deep learning models on HMPSoC

引用

INFORMATION SCIENCES 2024年 678卷

作者： Wang, Zhenyi Yang, Pengfei Zhang, Bowen Hu, Linwei Lv, Wenkai Lin, Chengmin Wang, Quan Xidian Univ Sch Comp Sci & Technol Xian 710071 Peoples R China Key Lab Smart Human Comp Interact & Wearable Techn Xian 710071 Peoples R China Tencent AI Lab Shenzhen 518000 Peoples R China

To achieve high -throughput deep learning (DL) model inference on heterogeneous multiprocessor systems -on -chip (HMPSoC) platforms, the use of pipelining for the simultaneous utilization of multiple resources has emerged as a promising solution. Nevertheless, current research faces two primary challenges: determining the optimal pipeline partitioning granularity, which directly influences the inference performance, and addressing the high time overhead of the search algorithms. To address these challenges, we propose Flexi-BOPI, a pipeline inference method for DL models of HMPSoCs. Flexi-BOPI offers flexible pipeline partitioning granularity down to a minimum size of a single core, enhancing the performance by better adapting to the diverse computational demands of different layers in DL models. Flexi-BOPI employs a Bayesian optimization -based search algorithm to significantly reduce the search overhead. In addition, we propose a surrogate model based on the heteroscedastic Gaussian process (HGP) to address the challenge of sample noise during the evaluation process. This approach can further reduce search overhead. Our experimental results demonstrate that the proposed method achieves significant improvements in inference performance and search overhead compared to existing methods.

关键词： Edge intelligence Deep learning (DL) Heterogeneous Multi-Processor Systems-on-Chip (HMPSoC) pipeline inference Bayesian optimization (BO) Heteroscedastic Gaussian process (HGP)

来源：评论

学校读者我要写书评

暂无评论

Poster: PipeLLM: pipeline LLM inference on Heterogeneous Devices with Sequence Slicing 23

Poster: PipeLLM: Pipeline LLM Inference on Heterogeneous Dev...

引用

ACM SIGCOMM Conference (SIGCOMM)

作者： Ma, Ruilong Wang, Jingyu Qi, Qi Yang, Xiang Sun, Haifeng Zhuang, Zirui Liao, Jianxin Beijing Univ Posts & Telecommun Beijing Beijing Peoples R China

ISBN: (纸本)9798400702365

Large Language Models (LLMs) has fostered the creation of innovative requirements. Locally deployed LLMs for micro-enterprise mitigates potential issues such as privacy infringements and sluggish response. However, they are hampered by the limitations in computing capability and memory space of possessed devices. We introduce PipeLLM, which allocates the model across devices commensurate with their computing capabilities. It enables the parallel execution of layers with slicing input sequence along the token dimension. PipeLLM demonstrates the potential to accelerate LLM inference with heterogeneity devices, offering a solution for LLM deployment in micro-enterprise hardware environment.

关键词： LLM inference Acceleration pipeline inference Model Deployment

来源：评论

学校读者我要写书评

暂无评论

Quart: Latency-Aware FaaS System for Pipelining Large Model inference 44

Quart: Latency-Aware FaaS System for Pipelining Large Model ...

引用

44th IEEE International Conference on Distributed Computing Systems (ICDCS)

作者： Lin, Yanying Li, Yanbo Peng, Shijie Tang, Yingfei Luo, Shutian Shen, Haiying Xu, Chengzhong Ye, Kejiang Chinese Acad Sci Shenzhen Inst Adv Technol Shenzhen Peoples R China Univ Chinese Acad Sci Beijing Peoples R China Southern Univ Sci & Technol Shenzhen Peoples R China Yale Univ New Haven CT USA Univ Virginia Charlottesville VA USA Univ Macau Taipa Macao Peoples R China

ISBN: (纸本)9798350386066;9798350386059

pipeline parallelism is a key mechanism to ensure the performance of large model serving systems. These systems need to deal with unpredictable online workloads with low latency and high goodput. However, due to the specific characteristics of large models and resource constraints in pipeline parallelism, existing systems struggle to balance resource allocation across pipeline stages. The primary challenge resides in the differential distribution of requests across various stages of the pipeline. We propose Quart, a large model serving system that focuses on optimizing the performance of key stages in pipeline parallelism. Quart dynamically identifies the key stages of the pipeline and introduces an innovative two-level model parameter caching system based on forks to achieve rapid scaling of key stages within seconds. In evaluations with real-world request workloads, Quart reduces average response latency by up to 87.1% and increases goodput by 2.37x compared to the baseline. The experiments demonstrate that Quart effectively reduces tail latency and the average queue length of the pipeline.

关键词： pipeline inference Large Model Serverless Latency Aware

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：