It is well-known that given a smooth, bounded-from-below, and possibly nonconvex func-tion, standard gradient-based methods can find e-stationary points (with gradient norm less than e) in O(1/e2) iterations. However,...
详细信息
It is well-known that given a smooth, bounded-from-below, and possibly nonconvex func-tion, standard gradient-based methods can find e-stationary points (with gradient norm less than e) in O(1/e2) iterations. However, many important nonconvex optimization prob-lems, such as those associated with training modern neural networks, are inherently not smooth, making these results inapplicable. In this paper, we study nonsmooth nonconvex optimization from an oracle complexity viewpoint, where the algorithm is assumed to be given access only to local information about the function at various points. We provide two main results: First, we consider the problem of getting near e-stationary points. This is perhaps the most natural relaxation of finding e-stationary points, which is impossible in the nonsmooth nonconvex case. We prove that this relaxed goal cannot be achieved efficiently, for any distance and e smaller than some constants. Our second result deals with the possibility of tackling nonsmooth nonconvex optimization by reduction to smooth optimization: Namely, applying smooth optimization methods on a smooth approximation of the objective function. For this approach, we prove under a mild assumption an inherent trade-off between oracle complexity and smoothness: On the one hand, smoothing a non -smo oth nonconvex function can be done very efficiently (e.g., by randomized smoothing), but with dimension-dependent factors in the smoothness parameter, which can strongly affect iteration complexity when plugging into standard smooth optimization methods. On the other hand, these dimension factors can be eliminated with suitable smoothing meth-ods, but only by making the oracle complexity of the smoothing process exponentially large.
In this paper we provide oracle complexity lower bounds for finding a point in a given set using a memory-constrained algorithm that has access to a separation oracle. We assume that the set is contained within the un...
详细信息
ISBN:
(纸本)9798331516758;9798331516741
In this paper we provide oracle complexity lower bounds for finding a point in a given set using a memory-constrained algorithm that has access to a separation oracle. We assume that the set is contained within the unit d-dimensional ball and contains a ball of known radius epsilon > 0. This setup is commonly referred to as the feasibility problem. We show that to solve feasibility problems with accuracy c >= e(-do(1)), any deterministic algorithm either uses d(1+delta) bits of memory or must make at least 1/(d(0.01 delta)epsilon(2) 1-delta/1+1.01 delta -o(1)) oracle queries, for any delta is an element of [0, 1]. Additionally, we show that randomized algorithms either use d1+d memory or make at least 1/(d(2 delta)c(2(1-4 delta)-o(1))) queries for any delta is an element of [0, 1/4]. Because gradient descent only uses linear memory O(d ln 1/epsilon) but makes O(1/epsilon(2)) queries, our results imply that it is Pareto-optimal in the oracle complexity/memory tradeoff. Further, our results show that the oracle complexity for deterministic algorithms is always polynomial in 1/epsilon if the algorithm has less than quadratic memory in d. This reveals a sharp phase transition since with quadratic O(d(2) ln 1/epsilon) memory, cutting plane methods only require O(d ln 1/epsilon) queries.
It is well-known that given a smooth, bounded-from-below, and possibly nonconvex function, standard gradient-based methods can find ε-stationary points (with gradient norm less than ε) in O(1/ε2) iterations. Howeve...
详细信息
It is well-known that given a smooth, bounded-from-below, and possibly nonconvex function, standard gradient-based methods can find ε-stationary points (with gradient norm less than ε) in O(1/ε2) iterations. However, many important nonconvex optimization problems, such as those associated with training modern neural networks, are inherently not smooth, making these results inapplicable. In this paper, we study nonsmooth nonconvex optimization from an oracle complexity viewpoint, where the algorithm is assumed to be given access only to local information about the function at various points. We provide two main results: First, we consider the problem of getting near ε-stationary points. This is perhaps the most natural relaxation of finding ε-stationary points, which is impossible in the nonsmooth nonconvex case. We prove that this relaxed goal cannot be achieved efficiently, for any distance and ε smaller than some constants. Our second result deals with the possibility of tackling nonsmooth nonconvex optimization by reduction to smooth optimization: Namely, applying smooth optimization methods on a smooth approximation of the objective function. For this approach, we prove under a mild assumption an inherent trade-off between oracle complexity and smoothness: On the one hand, smoothing a nonsmooth nonconvex function can be done very efficiently (e.g., by randomized smoothing), but with dimension-dependent factors in the smoothness parameter, which can strongly affect iteration complexity when plugging into standard smooth optimization methods. On the other hand, these dimension factors can be eliminated with suitable smoothing methods, but only by making the oracle complexity of the smoothing process exponentially large.
We construct a family of functions suitable for establishing lower bounds on the oracle complexity of first-order minimization of smooth strongly-convex functions. Based on this construction, we derive new lower bound...
详细信息
We construct a family of functions suitable for establishing lower bounds on the oracle complexity of first-order minimization of smooth strongly-convex functions. Based on this construction, we derive new lower bounds on the complexity of strongly-convex minimization under various inaccuracy criteria. The new bounds match the known upper bounds up to a constant factor, and when the inaccuracy of a solution is measured by its distance to the solution set, the new lower bound exactly matches the upper bound obtained by the recent Information-Theoretic Exact Method by the same authors, thereby establishing the exact oracle complexity for this class of problems. (C) 2021 The Author(s). Published by Elsevier Inc.
Second-order methods, which utilize gradients as well as Hessians to optimize a given function, are of major importance in mathematical optimization. In this work, we prove tight bounds on the oracle complexity of suc...
详细信息
Second-order methods, which utilize gradients as well as Hessians to optimize a given function, are of major importance in mathematical optimization. In this work, we prove tight bounds on the oracle complexity of such methods for smooth convex functions, or equivalently, the worst-case number of iterations required to optimize such functions to a given accuracy. In particular, these bounds indicate when such methods can or cannot improve on gradient-based methods, whose oracle complexity is much better understood. We also provide generalizations of our results to higher-order methods.
We present an information-theoretic approach to lower bound the oracle complexity of nonsmooth black box convex optimization, unifying previous lower bounding techniques by identifying a combinatorial problem, namely ...
详细信息
We present an information-theoretic approach to lower bound the oracle complexity of nonsmooth black box convex optimization, unifying previous lower bounding techniques by identifying a combinatorial problem, namely string guessing, as a single source of hardness. As a measure of complexity, we use distributional oracle complexity, which subsumes randomized oracle complexity as well as worst case oracle complexity. We obtain strong lower bounds on distributional oracle complexity for the box [-1, 1](n), as well as for the L-p-ball for p = 1 (for both low-scale and large-scale regimes), matching worst case upper bounds, and hence we close the gap between distributional complexity, and in particular, randomized complexity and worst case complexity. Furthermore, the bounds remain essentially the same for high-probability and bounded-error oracle complexity, and even for combination of the two, i.e., bounded-error high-probability oracle complexity. This considerably extends the applicability of known bounds.
Zeroth-order optimization, which does not use derivative information, is one of the significant research areas in the field of mathematical optimization and machine learning. Although various studies have explored zer...
详细信息
Zeroth-order optimization, which does not use derivative information, is one of the significant research areas in the field of mathematical optimization and machine learning. Although various studies have explored zeroth-order algorithms, one of the theoretical limitations is that oracle complexity depends on the dimension, i.e., on the number of variables, of the optimization problem. In this paper, to reduce the dependency of the dimension in oracle complexity, we propose a zeroth-order random subspace algorithm by combining a gradient-free algorithm (specifically, Gaussian randomized smoothing with central differences) with random projection. We derive the worst-case oracle complexity of our proposed method in non-smooth and convex settings;it is equivalent to standard results for full-dimensional non-smooth convex algorithms. Furthermore, we prove that ours also has a local convergence rate independent of the original dimension under additional assumptions. In addition to the theoretical results, numerical experiments show that when an objective function has a specific structure, the proposed method can become experimentally more efficient due to random projection.
In this paper, we study zeroth-order algorithms for minimax optimization problems that are nonconvex in one variable and strongly-concave in the other variable. Such minimax optimization problems have attracted signif...
详细信息
In this paper, we study zeroth-order algorithms for minimax optimization problems that are nonconvex in one variable and strongly-concave in the other variable. Such minimax optimization problems have attracted significant attention lately due to their applications in modern machine learning tasks. We first consider a deterministic version of the problem. We design and analyze the Zeroth-Order Gradient Descent Ascent (ZO-GDA) algorithm, and provide improved results compared to existing works, in terms of oracle complexity. We also propose the Zeroth-Order Gradient Descent Multi-Step Ascent (ZO-GDMSA) algorithm that significantly improves the oracle complexity of ZO-GDA. We then consider stochastic versions of ZO-GDA and ZO-GDMSA, to handle stochastic nonconvex minimax problems. For this case, we provide oracle complexity results under two assumptions on the stochastic gradient: (i) the uniformly bounded variance assumption, which is common in traditional stochastic optimization, and (ii) the Strong Growth Condition (SGC), which has been known to be satisfied by modern over-parameterized machine learning models. We establish that under the SGC assumption, the complexities of the stochastic algorithms match that of deterministic algorithms. Numerical experiments are presented to support our theoretical results.
Several recent empirical studies demonstrate that important machine learning tasks such as training deep neural networks, exhibit a low-rank structure, where most of the variation in the loss function occurs only in a...
详细信息
Several recent empirical studies demonstrate that important machine learning tasks such as training deep neural networks, exhibit a low-rank structure, where most of the variation in the loss function occurs only in a few directions of the input space. In this article, we leverage such low-rank structure to reduce the high computational cost of canonical gradient-based methods such as gradient descent (GD). Our proposed Low-Rank Gradient Descent (LRGD) algorithm finds an & varepsilon;-approximate stationary point of a p-dimensional function by first identifying r <= p significant directions, and then estimating the true p-dimensional gradient at every iteration by computing directional derivatives only along those r directions. We establish that the "directional oracle complexities" of LRGD for strongly convex and non-convex objective functions are O(r log(1/& varepsilon;) + rp) and O(r/& varepsilon;(2 )+ rp) , respectively. Therefore, when r << p , LRGD provides significant improvement over the known complexities of O(p log(1/& varepsilon;)) and O(p/& varepsilon;(2)) of GD in the strongly convex and non-convex settings, respectively. Furthermore, we formally characterize the classes of exactly and approximately low-rank functions. Empirically, using real and synthetic data, LRGD provides significant gains over GD when the data has low-rank structure, and in the absence of such structure, LRGD does not degrade performance compared to GD . This suggests that LRGD could be used in practice in any setting in place of GD .
We introduce a new approach to develop stochastic optimization algorithms for a class of stochastic composite and possibly nonconvex optimization problems. The main idea is to combine a variance-reduced estimator and ...
详细信息
We introduce a new approach to develop stochastic optimization algorithms for a class of stochastic composite and possibly nonconvex optimization problems. The main idea is to combine a variance-reduced estimator and an unbiased stochastic one to create a new hybrid estimator which trades-off the variance and bias, and possesses useful properties for developing new algorithms. We first introduce our hybrid estimator and investigate its fundamental properties to form a foundational theory for algorithmic development. Next, we apply our new estimator to develop several variants of stochastic gradient method to solve both expectation and finite-sum composite optimization problems. Our first algorithm can be viewed as a variant of proximal stochastic gradient methods with a single loop and single sample, but can achieve the best-known oracle complexity bound as state-of-the-art double-loop algorithms in the literature. Then, we consider two different variants of our method: adaptive step-size and restarting schemes that have similar theoretical guarantees as in our first algorithm. We also study two mini-batch variants of the proposed methods. In all cases, we achieve the best-known complexity bounds under standard assumptions. We test our algorithms on several numerical examples with real datasets and compare them with many existing methods. Our numerical experiments show that the new algorithms are comparable and, in many cases, outperform their competitors.
暂无评论