Asynchronous federated learning(AsynFL)can effectivelymitigate the impact of heterogeneity of edge nodes on joint training while satisfying participant user privacy protection and data ***,the frequent exchange of mas...
详细信息
Asynchronous federated learning(AsynFL)can effectivelymitigate the impact of heterogeneity of edge nodes on joint training while satisfying participant user privacy protection and data ***,the frequent exchange of massive data can lead to excess communication overhead between edge and central nodes regardless of whether the federated learning(FL)algorithm uses synchronous or asynchronous ***,there is an urgent need for a method that can simultaneously take into account device heterogeneity and edge node energy consumption *** paper proposes a novel fixed-point Asynchronous Federated Learning(fixedAsynFL)algorithm,which could mitigate the resource consumption caused by frequent data communication while alleviating the effect of device *** uses fixed-point quantization to compress the local and global models in *** order to balance energy consumption and learning accuracy,this paper proposed a quantization scale selection *** paper examines the mathematical relationship between the quantization scale and energy consumption of the computation/communication process in the *** on considering the upper bound of quantization noise,this paper optimizes the quantization scale by minimizing communication and computation *** paper performs pertinent experiments on the MNIST dataset with several edge nodes of different computing *** results show that the fixedAsynFL algorithm with an 8-bit quantization can significantly reduce the communication data size by 81.3%and save the computation energy in the training phase by 74.9%without significant loss of *** to the experimental results,we can see that the proposed AsynfixedFL algorithm can effectively solve the problem of device heterogeneity and energy consumption limitation of edge nodes.
In the field of object detection, deep learning has greatly improved accuracy compared to previous algorithms and has been used widely in recent years. However, object detection using deep learning requires many hardw...
详细信息
In the field of object detection, deep learning has greatly improved accuracy compared to previous algorithms and has been used widely in recent years. However, object detection using deep learning requires many hardware (HW) resources due to the huge computations for high performance, making it very difficult to run real-time on embedded platforms. Therefore, various compression methods have been studied to solve this problem. In particular, quantization methods greatly reduce the computational burden of deep learning by reducing the number of bits used for weights and activation functions in deep learning. However, most of these existing studies targeted only object classification and cannot be applied to object detection. Furthermore, most of the existing quantization studies are based on floating-point operations, which requires additional effort when implementing HW accelerators. This paper proposes an HW-friendly fixed-point-based quantization method that can also be applied to object detection. In the proposed method, the center of the weight distribution is adjusted to zero by subtracting the mean of weight parameters before quantization, and the retraining process is iteratively applied to minimize the accuracy drop caused by quantization. Furthermore, while applying the proposed method to object detection, performance degradation is minimized by considering the minimum and maximum values of weight parameters of deep learning networks. When applying the proposed quantization method to representative one-stage object detectors, You Only Look Once v3 and v4 (YOLOv3 and YOLOv4), detection accuracy similar to the original networks (i.e., YOLOv3 and YOLOv4) with a single-precision floating-point format (32-bit) is maintained despite expressing weights with only about 20% of the bits compared to a single-precision floating-point format in COCO dataset.
Deep neural networks (DNNs) have been proven to outperform classical methods on several machine learning benchmarks. However, they have high computational complexity and require powerful processing units. Especially w...
详细信息
Deep neural networks (DNNs) have been proven to outperform classical methods on several machine learning benchmarks. However, they have high computational complexity and require powerful processing units. Especially when deployed on embedded systems, model size and inference time must be significantly reduced. We propose SYMOG (symmetric mixture of Gaussian modes), which significantly decreases the complexity of DNNs through low-bit fixed-point quantization. SYMOG is a novel soft quantization method such that the learning task and the quantization are solved simultaneously. During training the weight distribution changes from an unimodal Gaussian distribution to a symmetric mixture of Gaussians, where each mean value belongs to a particular fixed-point mode. We evaluate our approach with different architectures (LeNet5, VGG7, VGG11, DenseNet) on common benchmark data sets (MNIST, CIFAR-10, CIFAR-100) and we compare with state-of-the-art quantization approaches. We achieve excellent results and outperform 2-bit state-of-the-art performance with an error rate of only 5.71% on CIFAR-10 and 27.65% on CIFAR-100. (C) 2020 Elsevier B.V. All rights reserved.
State-of-the-art hardware accelerators for large scale CNNs face two challenges: high computation complexity of convolution, and high on-chip memory consumption by weight kernels. Two techniques have been proposed in ...
详细信息
ISBN:
(纸本)9781728119687
State-of-the-art hardware accelerators for large scale CNNs face two challenges: high computation complexity of convolution, and high on-chip memory consumption by weight kernels. Two techniques have been proposed in the literature to address these challenges: frequency domain convolution and space domain fixed-point quantization. In this paper, we propose frequency domain quantization schemes to achieve high throughput CNN inference on FPGAs. We first analyze the impact of quantization bit width on the accuracy of a frequency domain CNN, via the metric of Signal-to-quantization-Noise-Ratio (SQNR). Taking advantage of the reconfigurability of FPGAs, we design a statically-reconfigurable and a dynamically-reconfigurable architecture for the quantized convolutional layers. Then, based on the SQNR analysis, we propose quantization schemes for both types of architectures, achieving optimal tradeoff between throughput and accuracy. The proposed quantizer allocates the number of bits for each convolutional layer under various design constraints, including overall SQNR, available DSP resources, on-chip memory and off-chip bandwidth. Experiments on AlexNet show that our designs improve the CNN inference throughput by 1.45x to 8.44x, with negligible (< 0.5%) loss in accuracy.
The deep neural network (DNN) has achieved remarkable performance in a wide range of applications at the cost of huge memory and computational complexity. fixed-point network quantization emerges as a popular accelera...
详细信息
The deep neural network (DNN) has achieved remarkable performance in a wide range of applications at the cost of huge memory and computational complexity. fixed-point network quantization emerges as a popular acceleration and compression method but still suffers from huge performance degradation when extremely low-bit quantization is utilized. Moreover, current fixed-point quantization methods rely heavily on supervised retraining using large amounts of the labeled training data, while the labeled data are hard to obtain in the real-world applications. In this article, we propose an efficient framework, namely, fixed-point factorized network (FFN), to turn all weights into ternary values, i.e., (-1, 0, 1). We highlight that the proposed FFN framework can achieve negligible degradation even without any supervised retraining on the labeled data. Note that the activations can be easily quantized into an 8-bit format;thus, the resulting networks only have low-bit fixed-point additions that are significantly more efficient than 32-bit floating-point multiply-accumulate operations (MACs). Extensive experiments on large-scale ImageNet classification and object detection on MS COCO show that the proposed FFN can achieve about more than 20x compression and remove most of the multiply operations with comparable accuracy. Codes are available on GitHub at https://***/wps712/FFN.
This paper presents a supervised contrastive learning (SCL) framework for respiratory sound classification and the hardware implementation of learned ResNet on field programmable gate array (FPGA) for real-time monito...
详细信息
This paper presents a supervised contrastive learning (SCL) framework for respiratory sound classification and the hardware implementation of learned ResNet on field programmable gate array (FPGA) for real-time monitoring. At the algorithmic level, multiple techniques such as features augmentation and MixUp are combined holistically to mitigate the impact of data scarcity and imbalanced classes in the training dataset. Bayesian optimization further enhances the classification accuracy through parameter tuning in pre-processing and SCL. The proposed framework achieves 0.8725 total score (including runtime score) on a ResNet-18 model in both event and record multi-class classification tasks using the SJTU Paediatric Respiratory Sound Database (SPRSound). In addition, algorithm-hardware co-optimizations including quantization-Aware Training (QAT), merge of network layers, optimization of memory size and number of parallel threads are performed for hardware implementation on FPGA. This approach reduces 40% model size and 70% computation latency. The learned ResNet is implemented on a Xilinx Zynq ZCU102 FPGA with 16ms latency and less than 2% inference score degradation compared to the software model.
Convolutional neural networks (CNNs) are widely utilized in intelligent edge computing applications such as computational vision and image processing. However, as the number of layers of the CNN model increases, the n...
详细信息
Convolutional neural networks (CNNs) are widely utilized in intelligent edge computing applications such as computational vision and image processing. However, as the number of layers of the CNN model increases, the number of parameters and computations gets larger, making it increasingly challenging to accelerate in edge computing applications. To effectively adapt to the tradeoff between the speed and accuracy of CNNs inference for smart applications. This paper proposes an FPGA-based adaptive CNNs inference accelerator synergistically utilizing filter pruning, fixed-point parameter quantization, and multi-computing unit parallelism called APPQ-CNN. First, the article devises a hybrid pruning algorithm based on the L1-norm and APoZ to measure the filter impact degree and a configurable parameter quantizationfixed-point computing architecture instead of floating-point architecture. Then, design a cascade of the CNN pipelined kernel architecture and configurable multiple computation units. Finally, conduct extensive performance exploration and comparison experiments on various real and synthetic datasets. With negligible accuracy loss, the speed performance of our accelerator APPQ-CNN compares with current state-of-the-art FPGA-based accelerators PipeCNN and OctCNN by 2.15x and 1.91x, respectively. Furthermore, APPQ-CNN provides settable fixed-point quantization bit-width parameters, filter pruning rate, and multiple computation unit counts to cope with practical application performance requirements in edge computing.
In today's rapidly advancing technological landscape, the applications of deep learning permeate various facets of our lives. However, traditional implementations of convolutional neural networks (CNNs) on platfor...
详细信息
ISBN:
(纸本)9798350388350;9798350388343
In today's rapidly advancing technological landscape, the applications of deep learning permeate various facets of our lives. However, traditional implementations of convolutional neural networks (CNNs) on platforms such as CPUs and GPUs often require substantial network bandwidth and incur high power consumption. Deploying CNNs on Field-Programmable Gate Arrays (FPGAs) with efficient logic control from CPUs offers a promising solution for low-power and compact hardware designs. This paper proposes a novel approach to optimize YOLOv3-tiny on FPGA, aiming to reduce hardware resource consumption and power usage while enhancing the computational efficiency of the convolutional neural network. Through hardware optimization strategies, our solution demonstrates improved performance, making it well-suited for real-time deep learning inference tasks in resource-constrained environments.
Graph Convolutional Networks (GCNs) have shown great results but come with large computation costs and memory overhead. Recently, sampling-based approaches have been proposed to alter input sizes, which allows large G...
详细信息
ISBN:
(纸本)9781665473903
Graph Convolutional Networks (GCNs) have shown great results but come with large computation costs and memory overhead. Recently, sampling-based approaches have been proposed to alter input sizes, which allows large GCN workloads to align to hardware constraints. Motivated by this flexibility, we propose an FPGA-based GCN accelerator, named SkeletonGCN, along with multiple software-hardware co-optimizations to improve training efficiency. We first quantize all feature and adjacency matrices of GCN from FP32 to SINT16. We then simplify the non-linear operations to better fit the FPGA computation, and identify reusable intermediate results to eliminate redundant computation. Moreover, we employ a linear time sparse matrix compression algorithm to further reduce memory bandwidth while allowing efficient decompression on hardware. Finally, we propose a unified hardware architecture to process sparse-dense matrix multiplication (SpMM) and dense matrix multiplication (MM), all on the same group of PEs to increase DSP utilization on FPGA. Evaluation is performed on a Xilinx Alveo U200 board. Compared with existing FPGA-based accelerator on the same network architecture, SkeletonGCN can achieve up to 11.3x speedup while maintaining the same training accuracy. In addition, SkeletonGCN can achieve up to 178x and 13.1x speedup over state-of-art CPU and GPU implementation on popular datasets, respectively.
quantization, which involves bit-width reduction, is considered as one of the most effective approaches to rapidly and energy-efficiently deploy deep convolutional neural networks (DCNNs) on resource-constrained embed...
详细信息
quantization, which involves bit-width reduction, is considered as one of the most effective approaches to rapidly and energy-efficiently deploy deep convolutional neural networks (DCNNs) on resource-constrained embedded hardware. However, bit-width reduction on the weights and activations of DCNNs seriously degrades accuracy. To solve this problem, in this paper we propose a mixed hardware-friendly quantization (MXQN) method that applies fixed-point quantization and logarithmic quantization for DCNNs without the necessity to retrain and fine-tune the DCNN. Our MXQN algorithm is a multi-staged process where, first, we employ a signal-to-quantization-noise ratio (SQNR) process as the metric to estimate the interplay between the parameter quantization errors of each layer and the overall model prediction accuracy. Then, we utilize a fixed-point quantization process to quantize weights, and depending on the SQNR metric we empirically select either a logarithmic or a fixed-point quantization process to quantize activations. For improved accuracy, we propose an optimized logarithmic quantization scheme that affords a fine-grained step size. We evaluate the performance of MXQN utilizing the VGG16 network on the MNIST, CIFAR-10, CIFAR-100, and the ImageNet datasets, as well as VGG19 and ResNet (ResNet18, ResNet34, ResNet50) networks on the ImageNet, and demonstrate that the MXQN-quantized DCNN despite not being retrained and fine-tuned, it still achieves high accuracy close to the original DCNN.
暂无评论