As deep neural network (DNN) models become more accurate, problems such as large model parameters and high computational complexity have become increasingly prominent, leading to a bottleneck in deploying them on reso...
详细信息
As deep neural network (DNN) models become more accurate, problems such as large model parameters and high computational complexity have become increasingly prominent, leading to a bottleneck in deploying them on resource-limited embedded platforms. In recent years, logarithm-based quantization techniques have shown great potential in reducing the inference cost of neural networks. However, current single-model log-quantization has reached an upper limit of classification performance, and little work has investigated hardware implementation of neural network quantization. In this paper, we propose a full logarithmic quantization (FLQ) mechanism that quantizes both weights and activation values into the logarithmic domain, compressing the parameters on AlexNet and VGG16 model by >6.4 times while maintaining an accuracy loss of within 2.5 % compared with benchmarking. Furthermore, we propose two optimization solutions for FLQ: activation segmented full logarithmic quantization (ASFLQ) and multi-ratio activation segmented full logarithmic quantization (Multi-ASFLQ), which can better balance the numerical representation range and quantization step. Under the condition of weight quantization of 5 bits and activation value quantization of 4 bits, the optimization methods proposed in this paper can improve the TOP1 of the VGG16 network model by 1 % and 1.6 %, respectively. Subsequently, we propose an implementation scheme of computing unit corresponding to the optimized FLQ mechanism above, which can not only convert multiplication operations into a shift operation but also integrate functions such as different ratio logarithmic bases and sparsity processing for activation, minimizing resource consumption as well as avoiding unnecessary calculations. Finally, we experiment with VGG19, Retnet50, and Densenet169 models, proving that the proposed method can achieve good performance under lower bit quantization. (c) 2001 Elsevier Science. All rights reserved
In future transportation, on board unit (OBU) is a key component of connected vehicles with limited computing resources, and may not tackle the heavy computing burden from V2X networks. For these cases, we herein empl...
详细信息
In future transportation, on board unit (OBU) is a key component of connected vehicles with limited computing resources, and may not tackle the heavy computing burden from V2X networks. For these cases, we herein employ multi-access edge cloud (MEC) and remote cloud to schedule the OBUs' tasks. This schedule tries to minimise the total completion time of all tasks and the number of computing units of the MEC server. We first introduce a multi-objective optimisation model considering the tasks and cloud-edge collaboration. Then, we propose a task scheduling strategy considering the resource matching degree for this model. In this strategy, we propose an improved hybrid genetic algorithm and employ the resource matching measure between the tasks and computing units in terms of computing, storage and network bandwidth resources to obtain better solutions for generations. The numerical results showed the effectiveness of our strategy.
The speed of open-weights large language models (LLMs) and its dependency on the task at hand, when run on GPUs, is studied to present a comparative analysis of the speed of the most popular open LLMs.
The speed of open-weights large language models (LLMs) and its dependency on the task at hand, when run on GPUs, is studied to present a comparative analysis of the speed of the most popular open LLMs.
Many different encoding algorithms for systematic polar codes (SPC) have been introduced since SPC was proposed in 2011. However, the number of the computing units of exclusive OR (XOR) has not been optimized yet. Acc...
详细信息
Many different encoding algorithms for systematic polar codes (SPC) have been introduced since SPC was proposed in 2011. However, the number of the computing units of exclusive OR (XOR) has not been optimized yet. According to an iterative property of the generator matrix and particular lower triangular structure of the matrix, we propose an optimized encoding algorithm (OEA) of SPC that can reduce the number of XOR computing units compared with existing non-recursive algorithms. We also prove that this property of the generator matrix could extend to different code lengths and rates of the polar codes. Through the matrix segmentation and transformation, we obtain a submatrix with all zero elements to save computation resources. The proportion of zero elements in the matrix can reach up to 58.5% from the OEA for SPC when the code length and code rate are 2048 and 0.5, respectively. Furthermore, the proposed OEA is beneficial to hardware implementation compared with the existing recursive algorithms in which signals are transmitted bidirectionally.
暂无评论