Distributing the inference of convolutional neural network (CNN) to multiple mobile devices has been studied in recent years to achieve real-time inference without losing accuracy. However, how to map CNN to devices r...
详细信息
Distributing the inference of convolutional neural network (CNN) to multiple mobile devices has been studied in recent years to achieve real-time inference without losing accuracy. However, how to map CNN to devices remains a challenge. On the one hand, scheduling the workload of state-of-the-art CNNs with multiple devices is NP-Hard because the structures of CNNs are directed acyclic graphs (DAG) rather than simple chains. On the other hand, distributing the inference workload suffers from expensive communication and unbalanced computation due to the wireless environment and heterogeneous devices. This paper presents PICO, a pipeline cooperation framework to accelerate the inference of versatile CNNs on diverse mobile devices. At its core, PICO features: (1) a generic graph partition algorithm that considers the characteristics of any given CNN and orchestrates it into a list of model pieces with suitable granularity, and (2) a many-to-many mapping algorithm that produces the best pipeline configuration for heterogeneous devices. In our experiment with 2 similar to 8 Raspberry-Pi devices, the throughput can be improved by 1.8 similar to 6.8 x under different CPU frequencies.
To achieve high -throughput deep learning (DL) model inference on heterogeneous multiprocessor systems -on -chip (HMPSoC) platforms, the use of pipelining for the simultaneous utilization of multiple resources has eme...
详细信息
To achieve high -throughput deep learning (DL) model inference on heterogeneous multiprocessor systems -on -chip (HMPSoC) platforms, the use of pipelining for the simultaneous utilization of multiple resources has emerged as a promising solution. Nevertheless, current research faces two primary challenges: determining the optimal pipeline partitioning granularity, which directly influences the inference performance, and addressing the high time overhead of the search algorithms. To address these challenges, we propose Flexi-BOPI, a pipeline inference method for DL models of HMPSoCs. Flexi-BOPI offers flexible pipeline partitioning granularity down to a minimum size of a single core, enhancing the performance by better adapting to the diverse computational demands of different layers in DL models. Flexi-BOPI employs a Bayesian optimization -based search algorithm to significantly reduce the search overhead. In addition, we propose a surrogate model based on the heteroscedastic Gaussian process (HGP) to address the challenge of sample noise during the evaluation process. This approach can further reduce search overhead. Our experimental results demonstrate that the proposed method achieves significant improvements in inference performance and search overhead compared to existing methods.
Large Language Models (LLMs) has fostered the creation of innovative requirements. Locally deployed LLMs for micro-enterprise mitigates potential issues such as privacy infringements and sluggish response. However, th...
详细信息
ISBN:
(纸本)9798400702365
Large Language Models (LLMs) has fostered the creation of innovative requirements. Locally deployed LLMs for micro-enterprise mitigates potential issues such as privacy infringements and sluggish response. However, they are hampered by the limitations in computing capability and memory space of possessed devices. We introduce PipeLLM, which allocates the model across devices commensurate with their computing capabilities. It enables the parallel execution of layers with slicing input sequence along the token dimension. PipeLLM demonstrates the potential to accelerate LLM inference with heterogeneity devices, offering a solution for LLM deployment in micro-enterprise hardware environment.
pipeline parallelism is a key mechanism to ensure the performance of large model serving systems. These systems need to deal with unpredictable online workloads with low latency and high goodput. However, due to the s...
详细信息
ISBN:
(纸本)9798350386066;9798350386059
pipeline parallelism is a key mechanism to ensure the performance of large model serving systems. These systems need to deal with unpredictable online workloads with low latency and high goodput. However, due to the specific characteristics of large models and resource constraints in pipeline parallelism, existing systems struggle to balance resource allocation across pipeline stages. The primary challenge resides in the differential distribution of requests across various stages of the pipeline. We propose Quart, a large model serving system that focuses on optimizing the performance of key stages in pipeline parallelism. Quart dynamically identifies the key stages of the pipeline and introduces an innovative two-level model parameter caching system based on forks to achieve rapid scaling of key stages within seconds. In evaluations with real-world request workloads, Quart reduces average response latency by up to 87.1% and increases goodput by 2.37x compared to the baseline. The experiments demonstrate that Quart effectively reduces tail latency and the average queue length of the pipeline.
暂无评论