When implementing the gradient descent method in lowprecision, the employment of stochastic rounding schemes helps to prevent stagnation of convergence caused by the vanishing gradient effect. Unbiased stochastic rou...
详细信息
When implementing the gradient descent method in lowprecision, the employment of stochastic rounding schemes helps to prevent stagnation of convergence caused by the vanishing gradient effect. Unbiased stochastic rounding yields zero bias by preserving small updates with probabilities proportional to their relative magnitudes. This study provides a theoretical explanation for the stagnation of the gradient descent method in low-precision computation. Additionally, we propose two new stochastic rounding schemes that trade the zero bias property with a larger probability to preserve small gradients. Our methods yield a constant rounding bias that, on average, lies in a descent direction. For convex problems, we prove that the proposed rounding methods typically have a beneficial effect on the convergence rate of gradient descent. We validate our theoretical analysis by comparing the performances of various rounding schemes when optimizing a multinomial logistic regression model and when training a simple neural network with an 8-bit floating-point format.
A big gap exists between deep neural network (DNN) applications' computational demand and the computing power of DNN accelerators. low-precision floating-point (LP-FP) computation is one of the important means to ...
详细信息
A big gap exists between deep neural network (DNN) applications' computational demand and the computing power of DNN accelerators. low-precision floating-point (LP-FP) computation is one of the important means to improve the performance of DNN training and inference. However, the high-precision accumulators are typically applied to summating the dot products during general matrix multiplication (GEMM) in tensor cores (TCs). As the precision of data decreases, the accumulator becomes the main consumer of multiply-accumulate's (MAC's) area and power. Reducing the accumulators' bit-width is of significant importance for improving the area-and energy-efficiency of TCs. There are two main challenges: 1) theoretical support on the floating-point (FP) formats with the lowest bit-width of TC's accumulators and 2) how to integrate the LP-FP TC in the framework of DNN training and inference to evaluate its benefits. In this article, we propose accumulation bit-width scaling (ABS), a novel ABS method, to guide the design of LP-FP TCs. We 1) implement this method by constructing a novel variance retention ratio (VRR) model to predict the FP format with the minimum bit-width for TC's accumulator;2) provide a generator of DNN accelerator based on a systolic-array (SA) TC, supporting many low-precision configurations;and 3) design an LP-FP DNN executing framework that supports software-simulation mode and hardware-accelerator mode to run LP-FP DNN tasks. The experimental results show that the LP-FP TC guided by our ABS method has a maximum reduction of 76.47% and 75.60% in area and power consumption, respectively, compared with the advanced TCs.
The volume of available data has been growing exponentially, increasing data problem's complexity and obscurity. In response, visual analytics (VA) has gained attention, yet its solutions haven't scaled well f...
详细信息
The volume of available data has been growing exponentially, increasing data problem's complexity and obscurity. In response, visual analytics (VA) has gained attention, yet its solutions haven't scaled well for big data. computational methods can improve VA's scalability by giving users compact, meaningful information about the input data. However, the significant computation time these methods require hinders real-time interactive visualization of big data. By addressing crucial discrepancies between these methods and VA regarding precision and convergence, researchers have proposed ways to customize them for VA. These approaches, which include low-precision computation and iteration-level interactive visualization, ensure real-time interactive VA for big data.
low-precision computation has emerged as one of the most effective techniques for accelerating convolutional neural networks and has garnered widespread support on modern hardware. Despite its effectiveness in acceler...
详细信息
low-precision computation has emerged as one of the most effective techniques for accelerating convolutional neural networks and has garnered widespread support on modern hardware. Despite its effectiveness in accelerating convolutional neural networks, low-precision computation has not been commonly applied to fast convolutions, such as the Winograd algorithm, due to numerical issues. In this article, we propose an effective quantizedWinograd convolution, named lowino, which employs an in-side quantization method in theWinograd domain to reduce the precision loss caused by transformations. Meanwhile, we present an efficient implementation that integrates well-designed optimization techniques, allowing us to fully exploit the capabilities of low-precision computation on modern CPUs. We evaluate lowino on two Intel Xeon Scalable Processor platforms with representative convolutional layers and neural network models. The experimental results demonstrate that our approach can achieve an average of 1.84x and 1.91x operator speedups over state-of-the-art implementations in the vendor library while preserving accuracy loss at a reasonable level.
暂无评论