the proceedings contain 12 papers. the special focus in this conference is on Verification and Evaluation of computer and Communication Systems. the topics include: Blockchain-Based Trust Management for IoMT Environme...
ISBN:
(纸本)9783031497360
the proceedings contain 12 papers. the special focus in this conference is on Verification and Evaluation of computer and Communication Systems. the topics include: Blockchain-Based Trust Management for IoMT Environment;Command & Control in UAVs Fleets: Coordinating Drones for Ground Missions in Changing Contexts;Verified highperformancecomputing: the SyDPaCC Approach;A QoE Driven DRL Approach for Network Slicing Based on SFC Orchestration in SDN/NFV Enabled Networks;on Language-Based Opacity Verification Problem in Discrete Event Systems Under Orwellian Observation;an Enhanced Interface-Based Probabilistic Compositional Verification Approach;a Sound Abstraction Method Towards Efficient Neural Networks Verification;Towards Formal Verification of Node RED-Based IoT Applications;formal Verification of a Post-quantum Signal Protocol with Tamarin;a Comparative Study of Online Cybersecurity Training Platforms.
computing-in-Memory (CIM) is an emerging non-von Neumann computingarchitecturethat enhances energy efficiency in AI tasks. Current-domain CIM is a common kind of design with a higher potential for energy efficiency ...
详细信息
ISBN:
(纸本)9798350393613
computing-in-Memory (CIM) is an emerging non-von Neumann computingarchitecturethat enhances energy efficiency in AI tasks. Current-domain CIM is a common kind of design with a higher potential for energy efficiency compared to digital-domain CIM. However, due to its fully analog design, current-domain CIM is susceptible to analog non-idealities that can introduce computational errors, thereby impacting the inference accuracy of neural networks. this paper provides a detailed analysis of the non-idealities and models the current-domain CIM with consideration of non-idealities. then the model is utilized to conduct design space exploration on a current-domain CIM design. To validate the model, we compare the predicted energy efficiency from the model withthe measured energy efficiency of a 28nm test chip and find a high degree of concurrence between them. the simulation results show that when setting the tolerable maximum relative computation error to 0.01 and aiming to maintain the computation accuracy of over 80%, parallelism higher than 7, 11, and 17 is required for 5-bit, 6-bit, and 7-bit analog-to-digital converter (ADC) respectively. while for current-to-digital converter (CDC), lower parallelism is required.
Providing a high-quality performance prediction has the potential to enhance various aspects of a cluster, such as devising scheduling and provisioning policies, guiding procurement decisions, suggesting candidate app...
详细信息
A highly energy-efficient computing-in-Memory (CIM) processor for Ternary Neural Network (TNN) acceleration is proposed in this brief. Previous CIM processors for multi-bit precision neural networks showed low energy ...
详细信息
A highly energy-efficient computing-in-Memory (CIM) processor for Ternary Neural Network (TNN) acceleration is proposed in this brief. Previous CIM processors for multi-bit precision neural networks showed low energy efficiency and throughput. Lightweight binary neural networks were accelerated with CIM processors for high energy efficiency but showed poor inference accuracy. In addition, most previous works suffered from poor linearity of analog computing and energy-consuming analog-to-digital conversion. To resolve the issues, we propose a Ternary-CIM (T-CIM) processor with16T1C ternary bitcell for good linearity withthe compact area and a charge-based partial sum adder circuit to remove analog-to-digital conversion that consumes a large portion of the system energy. Furthermore, flexible data mapping enables execution of the whole convolution layers with smaller bitcell memory capacity. Designed with 65 nm CMOS technology, the proposed T-CIM achieves 1,316 GOPS of peak performance and 823 TOPS/W of energy efficiency.
Since modern highperformancecomputing systems are evolving towards diverse and heterogeneous architectures, the emergence of high-level portable programming models leads to a particular focus on performance portabil...
详细信息
Tensor-train (TT) decomposition enables ultra-high compression ratio, making the deep neural network (DNN) accelerators based on this method very attractive. TIE, the state-of-the-art TT based DNN accelerator, achieve...
详细信息
ISBN:
(纸本)9798400700958
Tensor-train (TT) decomposition enables ultra-high compression ratio, making the deep neural network (DNN) accelerators based on this method very attractive. TIE, the state-of-the-art TT based DNN accelerator, achieved highperformance by leveraging a compact inference scheme to remove unnecessary computations and memory access. However, TIE increases memory costs for stage-wise intermediate results and additional intra-layer data transfer, leading to limited speedups even the models are highly compressed. To unleash the full potential of TT decomposition, this paper proposes ETTE, an algorithm and hardware co-optimization framework for Efficient Tensor-Train Engine. At the algorithm level, ETTE proposes new tensor core construction and computation ordering mechanism to reduce stage-wise computation and storage cost at the same time. At the hardware level, ETTE proposes a lookahead-style across-stage processing scheme to eliminate the unnecessary stage-wise data movement. By fully leveraging the decoupled input and output dimension factors, ETTE develops an efficient low-cost memory partition-free access scheme to efficiently support the desired matrix transformation. We demonstrate the effectiveness of ETTE via implementing a 16PE hardware prototype with CMOS 28nm technology. Compared with GPU on various workloads, ETTE achieves 6.5x - 253.1x higher throughput and 189.2x - 9750.5x higher energy efficiency. Compared withthe state-of-the-art DNN accelerators, ETTE brings 1.1x - 58.3x, 2.6x - 1170.4x and 1.8x - 2098.2x improvement on throughput, energy efficiency and area efficiency, respectively.
Development of job scheduling algorithms, which directly influence high-performancecomputing (HPC) clusters performance, is hindered because popular scheduling quality metrics, such as Bounded Slowdown, poorly correl...
详细信息
ISBN:
(数字)9781665451550
ISBN:
(纸本)9781665451550
Development of job scheduling algorithms, which directly influence high-performancecomputing (HPC) clusters performance, is hindered because popular scheduling quality metrics, such as Bounded Slowdown, poorly correlate with global scheduling objectives that include job packing efficiency and fairness. this report proposes Area Weighted Response Time, a metric that offers an unbiased representation of job packing efficiency, and presents a class of new metrics, Priority Weighted Specific Response Time, that assess both packing efficiency and fairness of schedules. the provided examples of simulation of scheduling of real workload traces and analysis of the resulting schedules withthe help of these metrics and conventional metrics, demonstrate that although Bounded Slowdown can be readily improved by modifying the standard First Come First Served backfilling algorithm and by using existing techniques of estimating job runtime, these improvements are accompanied by significant degradation of job packing efficiency and fairness. In contrast, improving job packing efficiency and fairness over the standard backfilling algorithm, which is designed to target those objectives, is difficult. It requires further algorithm development and more accurate runtime estimation techniques that reduce frequency of underpredictions.
the Nested Neutral Point Clamped (NNPC) converter, functioning as a voltage source Converter (VSC), provides an effective solution for applications requiring Medium-Voltage and high-Power (MVHP). Earlier implementatio...
详细信息
As a new information technology, edge computing has attracted much attention in recent years. Edge computing collects, stores, and processes data on the edge devices, such as smartphones and sensors. Edge devices can ...
详细信息
ISBN:
(纸本)9798350393613
As a new information technology, edge computing has attracted much attention in recent years. Edge computing collects, stores, and processes data on the edge devices, such as smartphones and sensors. Edge devices can process data in real time by reducing communication time with central servers. therefore, various algorithms for preprocessing should be executed on edge devices to provide services rapidly. Graph algorithms are one of such candidates, because graph data play an important role in representing a variety of information around us such as maps, SNS, and web structures, which is one of data collected and processed on edge devices. However, since execution time of graph algorithms varies greatly depending on the graph data, the preprocessing on edge devices may take a long time. therefore, it is necessary to select an appropriate algorithm and process data for edge devices as fast as possible. this paper proposes a method to select the appropriate graph algorithm on edge devices. By using machine learning that takes features of graph data as input, the performances of graph algorithms, such as the execution times, are predicted. then, the proposed method selects a suitable algorithm for requests of edge users. the evaluation result demonstrates that the proposed method can select the appropriate algorithm from the several algorithms depending on the characteristics of graph data.
Reservoir computing is a nascent sub-field of machine learning which relies on the recurrent multiplication of a very large, sparse, fixed matrix. We argue that direct spatial implementation of these fixed matrices mi...
详细信息
ISBN:
(纸本)9781665420273
Reservoir computing is a nascent sub-field of machine learning which relies on the recurrent multiplication of a very large, sparse, fixed matrix. We argue that direct spatial implementation of these fixed matrices minimizes the work performed in the computation, and allows for significant reduction in latency and power through constant propagation and logic minimization. Bit-serial arithmetic enables massive static matrices to be implemented. We present the structure of our bit-serial matrix multiplier, and evaluate using canonical signed digit representation to further reduce logic utilization. We have implemented these matrices on a large FPGA and provide a cost model that is simple and extensible. these FPGA implementations, on average, reduce latency by 50x up to 86x versus GPU libraries. Comparing against a recent sparse DNN accelerator, we measure a 4.1x to 47x reduction in latency depending on matrix dimension and sparsity.
暂无评论