CNN is a popular deep learning structure able to provide intelligent processing in IoT applications. Instead of deploying the resource-hungry CNN inference workloads on the cloud, it would be promising to utilize loca...
详细信息
ISBN:
(数字)9781728197104
ISBN:
(纸本)9781728197111
CNN is a popular deep learning structure able to provide intelligent processing in IoT applications. Instead of deploying the resource-hungry CNN inference workloads on the cloud, it would be promising to utilize local IoT devices for the in-situ processing. Since a single IoT device has only limited resources available, distributing over multiple local devices becomes a potential solution, especially for high-accuracy and time-sensitive tasks. However, it is non-trivial to distribute the inference of existing CNN models efficiently as they are inherently tightly-coupled structure. In this paper, we propose a distributed in-situ CNN inference system with the loosely-coupled CNN structure (LCS), the synchronization-oriented partitioning (SOP), and the decentralized asynchronous communication (DAC) for IoT applications. LCS is based on two novel design ideas, the homogeneous group and the intermittent shuffle. Experiments on ImageNet classification illustrate that LCS has the leading accuracy compared with other structures, under a given computation budget. SOP and DAC target on converting the loosely-coupled feature of LCS into practical performance improvement. SOP tries to partition LCS with fewer synchronization points and DAC reduces the communication overhead by overlapping communications. When the number of IoT devices increases from 1 to 4, our system accelerates by up to 3.85 ×, and reduces the memory footprint in each device by 70%, outperforming other approaches.
highperformance Computing (HPC) system running state monitoring is an important task. Small-scale cluster systems exist separately as one whole system or exist as a part of a large-scale HPC system. Such cluster syst...
详细信息
We adopted K-means clustering to efficiently partition the subcarriers to reduce the complexity of PS-QAM on FBMC/OQAM system using KK receiver. The net data rate of 100 Gb/s is achieved after 125 km transmission. ...
详细信息
In the big data era, users can get massive information from the Internet, but the value density is very low. In order to help users find the information they need more quickly, this paper presents the mechanism of div...
详细信息
Byte-addressable, non-volatile memory (NVRAM) combines the benefits of DRAM and flash memory. Its slower speed compared to DRAM, however, makes it hard to entirely replace DRAM with NVRAM. Hybrid NVRAM systems that eq...
详细信息
ISBN:
(纸本)9783981926323
Byte-addressable, non-volatile memory (NVRAM) combines the benefits of DRAM and flash memory. Its slower speed compared to DRAM, however, makes it hard to entirely replace DRAM with NVRAM. Hybrid NVRAM systems that equip both DRAM and NVRAM on the memory bus become a better solution: frequently accessed, hot pages can be stored in DRAM while other cold pages can reside in NVRAM. This way, the system gets the benefits of both highperformance (from DRAM) and lower power consumption and cost/ performance (from NVRAM). Realizing an efficient hybrid NVRAM system requires careful page migration and accurate data temperature measurement. Existing solutions, however, often cause invalid migrations due to inaccurate data temperature accounting, because hot and cold pages are separately identified in DRAM and NVRAM regions. Based on this observation, we propose UIMigrate, an adaptive data migration approach for hybrid NVRAM systems. The key idea is to consider data temperature across the whole DRAMNVRAM space when determining whether a page should be migrated between DRAM and NVRAM. In addition, UIMigrate adapts workload changes by dynamically adjusting migration decisions as workload changes. Our experiments using SPEC 2006 show that UIMigrate can reduce the number of migrations and improves performance by up to 90.4% compared to existing state-of-the-art approaches.
GPGPU (General Purpose Computing on Graphics Processing Units) has been widely applied to highperformance computing. However, GPU architecture and programming model are different from that of traditional CPU. Accordi...
GPGPU (General Purpose Computing on Graphics Processing Units) has been widely applied to highperformance computing. However, GPU architecture and programming model are different from that of traditional CPU. Accordingly, it is rather challenging to develop efficient GPU applications. This paper focuses on the key techniques of programming model and compiler optimization for many-core GPU, and addresses a number of key theoretical and technical issues. This paper proposes a many-threaded programming model ab-Stream, which would transparentize architecture differences and provide an easy to parallel, easy to program, easy to extend and easy to tune programming model. In addition, this paper proposes memory optimization and data transfer transformation according to data classification. Firstly, this paper proposes data layout pruning based on classification memory, and then proposes Ta T (Transfer after Transformed) for transferring Strided data between CPU and GPU. Experimental results demonstrate that proposed techniques would significantly improve performance for GPGPU applications.
We present OMPRACER, a static tool that uses flow-sensitive, interprocedural analysis to detect data races in OpenMP programs. OMPRACER is fast, scalable, has high code coverage, and supports the most common OpenMP fe...
详细信息
ISBN:
(数字)9781728199986
ISBN:
(纸本)9781728199993
We present OMPRACER, a static tool that uses flow-sensitive, interprocedural analysis to detect data races in OpenMP programs. OMPRACER is fast, scalable, has high code coverage, and supports the most common OpenMP features by combining state-of-the-art pointer analysis, novel value-flow analysis, happens-before tracking, and generalized modelling of OpenMP APIs. Unlike dynamic tools that currently dominate data race detection, OMPRACER achieves almost 100% code coverage using static analysis to detect a broader category of races without running the program or relying on specific input or runtime behaviour. OMPRACER has competitive precision with dynamic tools like Archer and ROMP: passing 105/116 cases in DataRaceBench with a total accuracy of 91%. OMPRACER has been used to analyze several Exascale Computing Project proxy applications containing over 2 million lines of code in under 10 minutes. OMPRACER has revealed previously unknown races in an ECP proxy app and a production simulation for COVID19.
Reverse engineering of binary executables is a critical problem in the computer security domain. On the one hand, malicious parties may recover interpretable source codes from the software products to gain commercial ...
详细信息
Reverse engineering of binary executables is a critical problem in the computer security domain. On the one hand, malicious parties may recover interpretable source codes from the software products to gain commercial advantages. On the other hand, binary decompilation can be leveraged for code vulnerability analysis and malware detection. However, efficient binary decompilation is challenging. Conventional decompilers have the following major limitations: (i) they are only applicable to specific source-target language pair, hence incurs undesired development cost for new language tasks;(ii) their output high-level code cannot effectively preserve the correct functionality of the input binary;(iii) their output program does not capture the semantics of the input and the reversed program is hard to interpret. To address the above problems, we propose Coda(1), the first end-to-end neural-based framework for code decompilation. Coda decomposes the decompilation task into of two key phases: First, Coda employs an instruction type-aware encoder and a tree decoder for generating an abstract syntax tree (AST) with attention feeding during the code sketch generation stage. Second, Coda then updates the code sketch using an iterative error correction machine guided by an ensembled neural error predictor. By finding a good approximate candidate and then fixing it towards perfect, Coda achieves superior performance compared to baseline approaches. We assess Coda's performance with extensive experiments on various benchmarks. Evaluation results show that Coda achieves an average of 82% program recovery accuracy on unseen binary samples, where the state-of-the-art decompilers yield 0% accuracy. Furthermore, Coda outperforms the sequence-to-sequence model with attention by a margin of 70% program accuracy. Our work reveals the vulnerability of binary executables and imposes a new threat to the protection of Intellectual Property (IP) for software development.
The paper proposes and discusses distributed processor load balancing algorithms which are based on nature inspired approach of multi-objective Extremal Optimization. Extremal Optimization is used for defining task mi...
详细信息
暂无评论