Conventional split computing approaches for AI models that generate large outputs suffer from long transmission and inference times. Due to the limited resources of the edge server and selfish MDs, some MDs cannot off...
详细信息
Conventional split computing approaches for AI models that generate large outputs suffer from long transmission and inference times. Due to the limited resources of the edge server and selfish MDs, some MDs cannot offload their tasks and sacrifice their performance. To address these issues, we formulate an optimization problem to determine one or two split points that minimize inference latency while ensuring fair offloading among MDs. Additionally, we devise a low-complexity heuristic algorithm called fast and fair split computing (F2SC). Evaluation results demonstrate that F2SC reduces inference time by 3.8% similar to 20.1% compared to the conventional approaches while maintaining fairness. (c) 2024 The Author(s). Published by Elsevier B.V. on behalf of The Korean Institute of Communications and Information Sciences. This is an open access article under the CC BY-NC-ND license (http://***/licenses/by-nc-nd/4.0/).
deep learning inference, providing the model utilization of deep learning, is usually deployed as a cloud-based framework for the resource-constrained client. However, the existing cloud-based frameworks suffer from s...
详细信息
deep learning inference, providing the model utilization of deep learning, is usually deployed as a cloud-based framework for the resource-constrained client. However, the existing cloud-based frameworks suffer from severe information leakage or lead to significant increase of communication cost. In this work, we address the problem of privacy-preserving deep learning inference in a way that both the privacy of the input data and the model parameters can be protected with low communication and computational costs. Additionally, the user can verify the correctness of results with small overhead, which is very important for critical application. Specifically, by designing secure sub-protocols, we introduce a new layer to collaboratively perform the secure computations involved in the inference. With the cooperation of the secret sharing, we inject the verifiable data into the input, enabling us to check the correctness of the returned inference results. Theoretical analyses and extensive experimental results over MNIST and CIFAR10 datasets are provided to validate the superiority of our proposed privacypreserving and verifiable deep learning inference (PVDLI) framework. (c) 2022 Elsevier B.V. All rights reserved.
With the rapid development of modern deep learning technology, deepneuralnetwork (DNN)-based mobile applications have also been considered for various areas. However, since mobile devices are not optimized to run th...
详细信息
ISBN:
(纸本)9781665423830
With the rapid development of modern deep learning technology, deepneuralnetwork (DNN)-based mobile applications have also been considered for various areas. However, since mobile devices are not optimized to run the DNN applications due to their limit of computational resources, several computation offloading-based approaches have been introduced to overcome the issue;for DNN models, it was reported that, their elaborate partitioning, which allows that input samples are partially executed on mobile devices and then the edge server processes the rest of the execution, can be effective in improving runtime performance. In addition, to improve communication-efficiency in the offloading scenario, there have been also studies to reduce transmitted data from a mobile device and the edge server by leveraging model compression. However, the existing approaches have the root limitation that the performance eventually depend on that of the architecture of original DNN models. To overcome this, we propose a novel neural architecture search (NAS) method to consider the computation offloading cases. On the top of the existing NAS approaches, we additionally introduce resource and channel selection mask. The resource selection mask effectively divides the operations in the target model into those for a mobile device and the edge server;the channel selection mask allows to transmit only selected channels to the edge server without the reduction of task performance (e.g., accuracy). Based on the two additional masks, for the NAS procedure we introduce a new loss function to take into account end-to-end inference time as well as the task performance which is the original goal of NAS. In the evaluation, the proposed method is compared to existing approaches;we see from the experimental results that our method outperforms both the previous NAS and pruning-based model partitioning approaches.
Hybrid analog-digital neuromorphic accelerators show promise for significant increase in performance per watt of deep learning inference and training as compared with conventional technologies. In this work we present...
详细信息
ISBN:
(纸本)9781728152219
Hybrid analog-digital neuromorphic accelerators show promise for significant increase in performance per watt of deep learning inference and training as compared with conventional technologies. In this work we present an FPGA demonstrator of a programmable hybrid inferencing accelerator, with memristor analog dot product engines emulated by digital matrix-vector multiplication units employing FPGA SRAM memory for in-situ weight storage. The full-chip demonstrator interfaced to a host by PCIe interface serves as a software development platform and a vehicle for further hardware microarchitecture improvements. Implementation of compute cores, tiles, network on a chip, and the host interface is discussed. New pipelining scheme is introduced to achieve high utilization of matrix-vector multiplication units while reducing tile data memory size requirements for neuralnetwork layer activations. The data flow orchestration between the tiles is described, controlled by a RISC-V core. Inferencing accuracy analysis is presented for an example RNN and CNN models. The demonstrator is instrumented with hardware monitors to enable performance measurements and tuning. Performance projections for future memristor-based ASIC are also discussed.
The first-generation tensor processing unit (TPU) runs deepneuralnetwork (DNN) inference 15-30 times faster with 30-80 times better energy efficiency than contemporary CPUs and GPUs in similar semiconductor technolo...
详细信息
The first-generation tensor processing unit (TPU) runs deepneuralnetwork (DNN) inference 15-30 times faster with 30-80 times better energy efficiency than contemporary CPUs and GPUs in similar semiconductor technologies. This domain-specific architecture (DSA) is a custom chip that has been deployed in Google datacenters since 2015, where it serves billions of people.
暂无评论