Modern reinforcement learning (RL) is commonly applied to practical problems with an enormous number of states, where functionapproximation must be deployed to approximate either the value function or the policy. The...
详细信息
Modern reinforcement learning (RL) is commonly applied to practical problems with an enormous number of states, where functionapproximation must be deployed to approximate either the value function or the policy. The introduction of functionapproximation raises a fundamental set of challenges involving computational and statistical efficiency, especially given the need to manage the exploration/exploitation trade-off. As a result, a core RL question remains open: how can we design provably efficient RL algorithms that incorporate functionapproximation? This question persists even in a basic setting with linear dynamics and linear rewards, for which only linear function approximation is needed. This paper presents the first provable RL algorithm with both polynomial run time and polynomial sample complexity in this linear setting, without requiring a "simulator" or additional assumptions. Concretely, we prove that an optimistic modification of least-squares value iteration-a classical algorithm frequently studied in the linear setting-achieves (O) over tilde (root d(3)H(3)T) regret, where d is the ambient dimension of feature space, H is the length of each episode, and T is the total number of steps. Importantly, such regret is independent of the number of states and actions.
This paper revisits the temporal difference (TD) learning algorithm for the policy evaluation tasks in reinforcement learning. Typically, the performance of TD(0) and TD(lambda) is very sensitive to the choice of step...
详细信息
This paper revisits the temporal difference (TD) learning algorithm for the policy evaluation tasks in reinforcement learning. Typically, the performance of TD(0) and TD(lambda) is very sensitive to the choice of stepsizes. Oftentimes, TD(0) suffers from slow convergence. Motivated by the tight link between the TD(0) learning algorithm and the stochastic gradient methods, we develop a provably convergent adaptive projected variant of the TD(0) learning algorithm with linear function approximation that we term AdaTD (0). In contrast to the TD(0), AdaTD(0) is robust or less sensitive to the choice of stepsizes. Analytically, we establish that to reach an epsilon accuracy, the number of iterations needed is (O) over tilde(epsilon(-2) ln(4) 1/epsilon/ln(4) 1/rho) in the general case, where p represents the speed of the underlying Markov chain converges to the stationary distribution. This implies that the iteration complexity of AdaTD(0) is no worse than that of TD (0) in the worst case. When the stochastic semi-gradients are sparse, we provide theoretical acceleration of AdaTD(0). Going beyond TD(0), we develop an adaptive variant of TD(lambda), which is referred to as AdaTD(lambda). Empirically, we evaluate the performance of AdaTD (0) and AdaTD(lambda) on several standard reinforcement learning tasks, which demonstrate the effectiveness of our new approaches.
In this letter, we develop a novel variant of natural actor-critic algorithm using off-policy sampling and linear function approximation, and establish a sample complexity of O(epsilon(-3)), outperforming all the prev...
详细信息
In this letter, we develop a novel variant of natural actor-critic algorithm using off-policy sampling and linear function approximation, and establish a sample complexity of O(epsilon(-3)), outperforming all the previously known convergence bounds of such algorithms. In order to overcome the divergence due to deadly triad in off-policy policy evaluation under functionapproximation, we develop a critic that employs n-step TD-learning algorithm with a properly chosen n. We present finite-sample convergence bounds on this critic, which are of independent interest. Furthermore, we develop a variant of natural policy gradient under functionapproximation, with an improved convergence rate of O(1/T) after T iterations. Combining the finite sample bounds of the actor and the critic, we obtain an overall O(epsilon(-3)) sample complexity. Our results were derived solely based on the assumption that the behavior policy sufficiently explores the state-action space, which is a much lighter assumption compared to the related literature.
Although Q-learning has achieved remarkable success in some practical cases, it often suffers from the overestimation problem in stochastic environments, which is commonly viewed as a shortcoming of Q-learning. Overes...
详细信息
Although Q-learning has achieved remarkable success in some practical cases, it often suffers from the overestimation problem in stochastic environments, which is commonly viewed as a shortcoming of Q-learning. Overestimated values are introduced by estimations on the next state Q value, which is well-known as the maximization bias. In this paper, we propose a more accurate method for estimating the Q value by the Q value decomposition and re-evaluation with similar samples based on linear function approximation. Specifically, we reform the parameterized incremental update formula of Q-learning and also demonstrate that the new formula is equivalent to the original one. Moreover, we propose a new parameterized incremental update formula of Q-learning to address the overestimation problem and present the more accurate computing method, which can be used in problems with continuous state spaces and stochastic environments. Experimentally, when compared with Doubly Bounded Q-learning and other Q-learning based methods, the new algorithm has more than 31% improvement of performance in Mountain Car and Cart Pole. Furthermore, the algorithm is robust to the learning rate and its memory capacity. Finally, the practical applicability of our algorithm is discussed through an analysis of time consumption.
We study query and computationally efficient planning algorithms for discounted Markov decision processes (MDPs) with linear function approximation and a simulator. The agent is assumed to have local access to the sim...
详细信息
We study query and computationally efficient planning algorithms for discounted Markov decision processes (MDPs) with linear function approximation and a simulator. The agent is assumed to have local access to the simulator, meaning that the simulator can be queried only at states that have been encountered in previous steps. We propose two new algorithms for this setting, which we call confident Monte Carlo least-squares policy iteration (CONFIDENT MC-LSPI), and confident Monte Carlo Politex (CONFIDENT MC- POLITEX), respectively. The main novelty in our algorithms is that they gradually build a set of state-action pairs ("core set") with which it can control the extrapolation errors. We show that our algorithms have polynomial query and computational cost in the dimension of the features, the effective planning horizon and the targeted sub-optimality, while the cost remains independent of the size of the state space. An interesting technical contribution of our work is the introduction of a novel proof technique that makes use of a virtual policy iteration algorithm. We use this method to leverage existing results on approximate policy iteration with l(infinity)-bounded error to show that our algorithm can learn the optimal policy for the given initial state even only with local access to the simulator. We believe that this technique can be extended to broader settings beyond this work.
Reinforcement learning is an efficient, widely used machine learning technique that performs well when the state and action spaces have a reasonable size. This is rarely the case regarding control-related problems, as...
详细信息
Reinforcement learning is an efficient, widely used machine learning technique that performs well when the state and action spaces have a reasonable size. This is rarely the case regarding control-related problems, as for instance controlling traffic signals. Here, the state space can be very large. In order to deal with the curse of dimensionality, a rough discretization of such space can be employed. However, this is effective just up to a certain point. A way to mitigate this is to use techniques that generalize the state space such as functionapproximation. In this paper, a linear function approximation is used. Specifically, SARSA(lambda) with Fourier basis features is implemented to control traffic signals in the agent-based transport simulation MATSim. The results are compared not only to trivial controllers such as fixed-time, but also to state-of-the-art rule-based adaptive methods. It is concluded that SARSA(lambda) with Fourier basis features is able to outperform such methods, especially in scenarios with varying traffic demands or unexpected events.
In this paper, we provide two new stable online algorithms for the problem of prediction in reinforcement learning, i.e., estimating the value function of a model-free Markov reward process using the linearfunction a...
详细信息
In this paper, we provide two new stable online algorithms for the problem of prediction in reinforcement learning, i.e., estimating the value function of a model-free Markov reward process using the linear function approximation architecture and with memory and computation costs scaling quadratically in the size of the feature set. The algorithms employ the multi-timescale stochastic approximation variant of the very popular cross entropy optimization method which is a model based search method to find the global optimum of a real-valued function. A proof of convergence of the algorithms using the ODE method is provided. We supplement our theoretical results with experimental comparisons. The algorithms achieve good performance fairly consistently on many RL benchmark problems with regards to computational efficiency, accuracy and stability.
Edge caching is an increasingly vital technique in wireless networks, particularly needed to address users' repeated demands, including real-time traffic data and map access in vehicular communications. This paper...
详细信息
Edge caching is an increasingly vital technique in wireless networks, particularly needed to address users' repeated demands, including real-time traffic data and map access in vehicular communications. This paper presents a three-tier system for edge caching, integrating massive multiple-input multiple-output (mMIMO) networks. We consider a scenario with an extensive file library whose size is larger than the aggregated caching capacity of the small base stations (SBSs). Each SBS proactively caches files based on dynamic file popularity that is unknown in advance. A distinguishing aspect of our model is its file-wise approach to the cache instead of the conventional vector-wise methods based on aggregate popularity. This approach introduces additional computational challenges concerning the extensive file library. We formulated the optimization problem to maximize the long-term discounted cache hit rate while minimizing the delivery delay. For the solution, two multi-agent federated Q-learning algorithms are proposed. The first algorithm employs selective updates of the Q-values of popular files to minimize computational overhead. The second algorithm incorporates linear function approximation (LFA) and tensor completion (TC) to streamline the updating process further, reducing the required parameter number. Through the real-world MovieLens dataset and compared with various baseline algorithms, simulations demonstrate that our proposed algorithm can reduce the delay by around 2.60--21.29%, improve the cache hit rate by around 5.71--66.42%, and reduce the computational complexity by at most 97.91%.
Q-learning with functionapproximation is one of the most empirically successful while theoretically mysterious reinforcement learning (RL) algorithms and was identified in [R. S. Sutton, in European Conference on Com...
详细信息
Q-learning with functionapproximation is one of the most empirically successful while theoretically mysterious reinforcement learning (RL) algorithms and was identified in [R. S. Sutton, in European Conference on Computational Learning Theory, Springer, New York, 1999, pp. 11-17] as one of the most important theoretical open problems in the RL community. Even in the basic setting where linear function approximation is used, there are well-known divergent examples. In this work, we propose a stable online variant of Q-learning with linear function approximation that uses target network and truncation and is driven by a single trajectory of Markovian samples. We present the finite-sample guarantees of the algorithm, which imply a sample complexity of (O) over tilde (epsilon(-2)) up to a functionapproximation error. Importantly, we establish the results under minimal assumptions and do not modify the problem parameters to achieve stability.
Motivated by applications in reinforcement learning (RL), we study a nonlinear stochastic approximation (SA) algorithm under Markovian noise, and establish its finite-sample convergence bounds under various stepsizes....
详细信息
Motivated by applications in reinforcement learning (RL), we study a nonlinear stochastic approximation (SA) algorithm under Markovian noise, and establish its finite-sample convergence bounds under various stepsizes. Specifically, we show that when using constant stepsize (i.e., alpha k = alpha), the algorithm achieves exponential fast convergence to a neighborhood (with radius O(alpha log(1/alpha))) around the desired limit point. When using diminishing stepsizes with appropriate decay rate, the algorithm converges with rate O(log(k)/k). Our proof is based on Lyapunov drift arguments, and to handle the Markovian noise, we exploit the fast mixing of the underlying Markov chain. To demonstrate the generality of our theoretical results on Markovian SA, we use it to derive the finite-sample bounds of the popular Q-learning algorithm with linear function approximation, under a condition on the behavior policy. Importantly, we do not need to make the assumption that the samples are i.i.d., and do not require an artificial projection step in the algorithm. Numerical simulations corroborate our theoretical results.
暂无评论