Data prefetching is a widely used technique to alleviate "memory wall"problem by fetching the data that may be touched in the near future in advance. Generally, data prefetching is classified into hardware p...
详细信息
this paper explores the problem of boundary data classification ambiguity that arises when machine learning techniques are applied in the field of intrusion detection. the features and attributes of the boundary data ...
详细信息
Withthe rapid development of neural networks and deep learning, speech synthesis technology has been significantly improved. the end-to-end speech synthesis systems based on deep learning have been able to synthesize...
详细信息
Multi-objective neural architecture search (NAS) algorithms aim to automatically search the neural architecture suitable for different computing power platforms by using multi-objective optimization methods. the LEMON...
详细信息
this research proposes a novel strategy for addressing the limitations of centralized architectures in IoT data processing. Traditional systems experience significant bandwidth use, privacy difficulties, and scalabili...
详细信息
this paper addresses the challenges of voltage-sensing read operations on a PRAM-based 1S1R crossbar array, which can be used for MAC operations in processing-in-memory architectures. the nonlinearity of the readout v...
详细信息
ISBN:
(纸本)9798350327038
this paper addresses the challenges of voltage-sensing read operations on a PRAM-based 1S1R crossbar array, which can be used for MAC operations in processing-in-memory architectures. the nonlinearity of the readout voltage due to the parallel resistance of the accessed cells leads to a narrow sensing margin. Moreover, the SAR ADC widely used in the readout circuits for area and power efficiency leads to high latency. To overcome these challenges, we introduce active feedback using a Gilbert multiplier to the bitline (BL) structure to regulate the resistance of the BL transmission gate and an input-aware SAR logic to optimize the conversion time. the proposed macro design in a 65nm process achieves a 3.79x voltage sensing margin with a Gilbert multiplier under a 3x3 kernel convolution operation. Furthermore, a 6-bit input-aware SAR ADC reduces average latency from 6 to 4.4 clock cycles.
In stream processing, data arrives constantly and is often unpredictable. It can show large fluctuations in arrival frequency, size, complexity, and other factors. these fluctuations can strongly impact application la...
详细信息
ISBN:
(纸本)9781665469586
In stream processing, data arrives constantly and is often unpredictable. It can show large fluctuations in arrival frequency, size, complexity, and other factors. these fluctuations can strongly impact application latency and throughput, which are critical factors in this domain. therefore, there is a significant amount of research on self-adaptive techniques involving elasticity or micro-batching as a way to mitigate this impact. However, there is a lack of benchmarks and tools for helping researchers to investigate micro-batching and data stream frequency implications. In this paper, we extend a benchmarking framework to support dynamic micro-batching and data stream frequency management. We used it to create custom benchmarks and compare latency and throughput aspects from two different parallel libraries. We validate our solution through an extensive analysis of the impact of micro-batching and data stream frequency on stream processing applications using Intel TBB and FastFlow, which are two libraries that leverage stream parallelism on multi-core architectures. Our results demonstrated up to 33% throughput gain over latency using micro-batches. Additionally, while TBB ensures lower latency, FastFlow ensures higher throughput in the parallel applications for different data stream frequency configurations.
More recently, it has become possible to run deep learning algorithms on edge devices such as microcontrollers due to continuous improvements in neural network optimization algorithms such as quantization and neural a...
详细信息
ISBN:
(纸本)9781665473156
More recently, it has become possible to run deep learning algorithms on edge devices such as microcontrollers due to continuous improvements in neural network optimization algorithms such as quantization and neural architecture search. Nonetheless, most of the embedded hardware available today still falls short of the requirements of running deep neural networks. As a result, specialized processors have emerged to improve the inference efficiency of deep learning algorithms. However, most are not for edge applications that require efficient and low-cost hardware. therefore, we design and prototype a low-cost configurable sparse Neural processing Unit (NPU). the NPU has a built-in buffer and a reshapable mixed-precision multiply-accumulator (MAC) array. the computing and memory resources of the NPU are parameterized, and different NPUs can be derived. Besides, users can also configure the NPU at runtime to fully utilize the resources. In our experiments, the 200MHz NPU with only 32 MACs is more than 32 times faster than the 400MHz STM32H7 when inferring MobileNet-V1. Besides, the yielded NPUs can achieve roofline or even beyond roofline performance. the buffer and reshapeable MAC array push the NPU's attainable performance to the roofline, while the feature of supporting sparsity allows the NPU to obtain performance beyond the roofline.
this research explores optimization strategies employed within ***, an advanced question generation system driven by natural language processing (NLP) and machine learning (ML) algorithms. the study delves into three ...
详细信息
the massive success of blockchains has significantly catalyzed interest in the extensive deployment of practical asynchronous Byzantine fault-tolerant (BFT) consensus protocols across wide area networks. However, exis...
详细信息
暂无评论