Deep neural networks (DNNs) have recently achieved impressive success across a wide range of real-world vision and language processing tasks, spanning from image classification to many other downstream vision tasks, s...
详细信息
Deep neural networks (DNNs) have recently achieved impressive success across a wide range of real-world vision and language processing tasks, spanning from image classification to many other downstream vision tasks, such as object detection, tracking, and segmentation. However, previous well-established DNNs, despite being able to maintain superior accuracy, have also been evolving to be deeper and wider and thus inevitably necessitate prohibitive computational resources for both training and inference. This trend further enlarges the computational gap between computation-intensive DNNs and resource-constrained embeddedcomputingsystems, making it challenging to deploy powerful DNNs in real-world embeddedcomputingsystems towards ubiquitous embedded intelligence. To alleviate this computational gap and enable ubiquitous embedded intelligence, we focus in this survey on discussing recent efficient deep learning infrastructures for embeddedcomputingsystems, spanning from training to inference, from manual to automated, from convolutional neural networks to transformers, from transformers to vision transformers, from vision models to large language models, from software to hardware, and from algorithms to applications. Specifically, we discuss recent efficient deep learning infrastructures for embeddedcomputingsystems from the lens of (1) efficient manual network design for embeddedcomputingsystems, (2) efficient automated network design for embeddedcomputingsystems, (3) efficient network compression for embeddedcomputingsystems, (4) efficient on-device learning for embeddedcomputingsystems, (5) efficient large language models for embeddedcomputingsystems, (6) efficient deep learning software and hardware for embeddedcomputingsystems, and (7) efficient intelligent applications for embeddedcomputingsystems. We also envision promising future directions and trends, which have the potential to deliver more ubiquitous embedded intelligence. We believe th
The paper considers the embedded system that can either compute tasks locally by itself or offload tasks to the edge server for remote computing during the running period (RP) and switch to the sleep mode to save ener...
详细信息
The paper considers the embedded system that can either compute tasks locally by itself or offload tasks to the edge server for remote computing during the running period (RP) and switch to the sleep mode to save energy once the RP ends, i.e., the idle period (IP) arrives. The tasks are stored in the cache, and the more tasks are computed during the RP, the less cache space will be occupied at the end of RP, while also leading to more energy consumption. Meanwhile, the sleep mode that the embedded system enters during the IP also influences the energy consumption. Therefore, how to make the optimal tradeoff between energy consumption and cache occupancy arises as an interesting issue. To address this issue, this paper first establishes an optimization-theoretical framework to formulate the energy consumption under the constraint of cache occupancy, then discovers the most energy-saving RP, computing mode (i.e., local or edge computing), and low-power mode. An algorithm based on our discovered theoretical results is proposed for the embedded system to minimize the energy consumption within the acceptable level of cache occupancy. Theoretical analysis and field experiments jointly verify its good performance. Note to Practitioners-This paper addresses the interesting tradeoff between energy consumption and cache occupancy in the embedded system that operates in the environments with limited available energy as well as cache space. It facilitates to improve the operation efficiency of the embeddedsystems in the area of Internet of Things (IoT) or Cyber-Physical systems (CPS) that employs edge computing technology to empower embeddedsystems with more computing capability and sleep modes to guarantee embeddedsystems with more energy savings, in order to minimize the accumulative energy consumption, while maintaining the cache occupancy in terms of task data bits stored within an acceptable range. Experimental investigations show that the solution proposed here outperf
Convolutional Neural Networks (CNNs) have significantly impacted embedded system applications across various domains. However, this exacerbates the real-time processing and hardware resource-constrained challenges of ...
详细信息
Convolutional Neural Networks (CNNs) have significantly impacted embedded system applications across various domains. However, this exacerbates the real-time processing and hardware resource-constrained challenges of embeddedsystems. To tackle these issues, we propose spin-transfer torque magnetic random-access memory (STT-MRAM)-based near memory computing (NMC) design for embeddedsystems. We optimize this design from three aspects: Fast-pipelined STT-MRAM readout scheme provides higher memory bandwidth for NMC design, enhancing real-time processing capability with a non-trivial area overhead. Direct index compression format in conjunction with digital sparse matrix-vector multiplication (SpMV) accelerator supports various matrices of practical applications that alleviate computing resource requirements. Custom NMC instructions and stream converter for NMC systems dynamically adjust available hardware resources for better utilization. Experimental results demonstrate that the memory bandwidth of STT-MRAM achieves 26.7 GB/s. Energy consumption and latency improvement of digital SpMV accelerator are up to 64x and 1,120x across sparsity matrices spanning from 10% to 99.8%. Single-precision and double-precision elements transmission increased up to 8x and 9.6x, respectively. Furthermore, our design achieves a throughput of up to 15.9x over state-of-the-art designs.
Voltage scaling is one of the most promising approaches for energy efficiency improvement but also brings challenges to fully guaranteeing stable operation in modern VLSI. To tackle such issues, we further extend the ...
详细信息
Voltage scaling is one of the most promising approaches for energy efficiency improvement but also brings challenges to fully guaranteeing stable operation in modern VLSI. To tackle such issues, we further extend the DependableHD to the second version DependableHDv2, a HyperDimensional computing (HDC) system that can tolerate bit-level memory failure in the low voltage region with high robustness. DependableHDv2 introduces the concept of margin enhancement for model retraining and utilizes noise injection to improve the robustness, which is capable of application in most state-of-the-art HDC algorithms. We additionally propose the dimension-swapping technique, which aims at handling the stuck-at errors induced by aggressive voltage scaling in the memory cells. Our experiment shows that under 8% memory stuck-at error, DependableHDv2 exhibits a 2.42% accuracy loss on average, which achieves a 14.1x robustness improvement compared to the baseline HDC solution. The hardware evaluation shows that DependableHDv2 supports the systems to reduce the supply voltage from 430 mV to 340 mV for both item Memory and Associative Memory, which provides a 41.8% energy consumption reduction while maintaining competitive accuracy performance.
Remote IoT devices face significant security risks due to their inherent physical vulnerability. An adversarial actor with sufficient capability can monitor the devices or exfiltrate data to access sensitive informati...
详细信息
Remote IoT devices face significant security risks due to their inherent physical vulnerability. An adversarial actor with sufficient capability can monitor the devices or exfiltrate data to access sensitive information. Remotely deployed devices such as sensors need enhanced resilience against memory leakage if performing privileged tasks. To increase the security and trust of these devices we present a novel framework implementing a privacy homomorphism which creates sensor data directly in an encoded format. The sensor data is permuted at the time of creation in a manner which appears random to an observer. A separate secure server in communication with the device provides necessary information which allows the device to perform processing on the encoded data but does not allow decoding of the result. The device transmits the encoded results to the secure server which maintains the ability to interpret the results. In this article, we show how this framework works for an image sensor calculating differences between a stream of images, with initial results showing an overhead as low as only 266% in terms of throughput when compared to computing on standard unencoded numbers such as two's complement. We further show 5,000x speedup over a recent homomorphic encryption ASIC.
暂无评论