In the training of neural networks with low-precision computation and fixed-point arithmetic, rounding errors often cause stagnation or are detrimental to the convergence of the optimizers. This study provides insight...
详细信息
In the training of neural networks with low-precision computation and fixed-point arithmetic, rounding errors often cause stagnation or are detrimental to the convergence of the optimizers. This study provides insights into the choice of appropriate stochastic rounding strategies to mitigate the adverse impact of roundoff errors on the convergence of the gradient descent method, for problems satisfying the polyak-& lstrok;ojasiewicz inequality. Within this context, we show that a biased stochastic rounding strategy may be even beneficial in so far as it eliminates the vanishing gradient problem and forces the expected roundoff error in a descent direction. Furthermore, we obtain a bound on the convergence rate that is stricter than the one achieved by unbiased stochastic rounding. The theoretical analysis is validated by comparing the performances of various rounding strategies when optimizing several examples using low-precision fixed-point arithmetic.
In this paper, we study the distributed nonconvex optimization problem, aiming to minimize the average value of the local nonconvex cost functions using local information exchange. To reduce the communication overhead...
详细信息
In this paper, we study the distributed nonconvex optimization problem, aiming to minimize the average value of the local nonconvex cost functions using local information exchange. To reduce the communication overhead, we introduce three general classes of compressors, i.e., compressors with bounded relative compression error, compressors with globally bounded absolute compression error, and compressors with locally bounded absolute compression error. By integrating them, respectively, with the distributed gradient tracking algorithm, we then propose three corresponding compressed distributed nonconvex optimization algorithms. Motivated by the state-of-the-art BEER algorithm proposed in Zhao et al. (2022), which is an efficient compressed algorithm integrating gradient tracking with biased and contractive compressors, our first proposed algorithm extends this algorithm to accommodate both biased and non-contractive compressors For each algorithm, we design a novel Lyapunov function to demonstrate its sublinear convergence to a stationary point if the local cost functions are smooth. Furthermore, when the global cost function satisfies the polyak-& lstrok;ojasiewicz (P-& lstrok;) condition, we show that our proposed algorithms linearly converge to a global optimal point. It is worth noting that, for compressors with bounded relative compression error and globally bounded absolute compression error, our proposed algorithms' parameters do not require prior knowledge of the P-& lstrok;constant. (c) 2025 Elsevier Ltd. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
Solving linear inverse problems plays a crucial role in numerous applications. Algorithm unfolding based, model-aware data-driven approaches have gained significant attention for effectively addressing these problems....
详细信息
Solving linear inverse problems plays a crucial role in numerous applications. Algorithm unfolding based, model-aware data-driven approaches have gained significant attention for effectively addressing these problems. Learned iterative soft-thresholding algorithm (LISTA) and alternating direction method of multipliers compressive sensing network (ADMM-CSNet) are two widely used such approaches, based on ISTA and ADMM algorithms, respectively. In this work, we study optimization guarantees, i.e., achieving near-zero training loss with the increase in the number of learning epochs, for finite-layer unfolded networks such as LISTA and ADMM-CSNet with smooth soft-thresholding in an over-parameterized (OP) regime. We achieve this by leveraging a modified version of the polyak-& lstrok;ojasiewicz, denoted PL*, condition. Satisfying the PL* condition within a specific region of the loss landscape ensures the existence of a global minimum and exponential convergence from initialization using gradient descent based methods. Hence, we provide conditions, in terms of the network width and the number of training samples, on these unfolded networks for the PL* condition to hold, by deriving the Hessian spectral norm. Additionally, we show that the threshold on the number of training samples increases with the increase in the network width. Furthermore, we compare the threshold on training samples of unfolded networks with that of a standard fully-connected feed-forward network (FFNN) with smooth soft-thresholding non-linearity. We prove that unfolded networks have a higher threshold value than FFNN. Consequently, one can expect a better expected error for unfolded networks than FFNN.
暂无评论