检索结果-内蒙古大学图书馆

Provably Efficient Reinforcement Learning with linear function approximation

MATHEMATICS OF OPERATIONS RESEARCH 2023年第3期48卷 1496-1521页

作者： Jin, Chi Yang, Zhuoran Wang, Zhaoran Jordan, Michael, I Princeton Univ Princeton NJ 08544 USA Yale Univ New Haven CT 06520 USA Northwestern Univ Evanston IL 60208 USA Univ Calif Berkeley Berkeley CA 94720 USA

Modern reinforcement learning (RL) is commonly applied to practical problems with an enormous number of states, where function approximation must be deployed to approximate either the value function or the policy. The introduction of function approximation raises a fundamental set of challenges involving computational and statistical efficiency, especially given the need to manage the exploration/exploitation trade-off. As a result, a core RL question remains open: how can we design provably efficient RL algorithms that incorporate function approximation? This question persists even in a basic setting with linear dynamics and linear rewards, for which only linear function approximation is needed. This paper presents the first provable RL algorithm with both polynomial run time and polynomial sample complexity in this linear setting, without requiring a "simulator" or additional assumptions. Concretely, we prove that an optimistic modification of least-squares value iteration-a classical algorithm frequently studied in the linear setting-achieves (O) over tilde (root d(3)H(3)T) regret, where d is the ambient dimension of feature space, H is the length of each episode, and T is the total number of steps. Importantly, such regret is independent of the number of states and actions.

关键词： reinforcement learning episodic MDP linear function approximation exploration

来源：评论

学校读者我要写书评

暂无评论

Adaptive Temporal Difference Learning With linear function approximation

引用

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022年第12期44卷 8812-8824页

作者： Sun, Tao Shen, Han Chen, Tianyi Li, Dongsheng Natl Univ Def Technol Coll Comp Changsha 410073 Hunan Peoples R China Rensselaer Polytech Inst Dept ECSE Troy NY 12180 USA

This paper revisits the temporal difference (TD) learning algorithm for the policy evaluation tasks in reinforcement learning. Typically, the performance of TD(0) and TD(lambda) is very sensitive to the choice of stepsizes. Oftentimes, TD(0) suffers from slow convergence. Motivated by the tight link between the TD(0) learning algorithm and the stochastic gradient methods, we develop a provably convergent adaptive projected variant of the TD(0) learning algorithm with linear function approximation that we term AdaTD (0). In contrast to the TD(0), AdaTD(0) is robust or less sensitive to the choice of stepsizes. Analytically, we establish that to reach an epsilon accuracy, the number of iterations needed is (O) over tilde(epsilon(-2) ln(4) 1/epsilon/ln(4) 1/rho) in the general case, where p represents the speed of the underlying Markov chain converges to the stationary distribution. This implies that the iteration complexity of AdaTD(0) is no worse than that of TD (0) in the worst case. When the stochastic semi-gradients are sparse, we provide theoretical acceleration of AdaTD(0). Going beyond TD(0), we develop an adaptive variant of TD(lambda), which is referred to as AdaTD(lambda). Empirically, we evaluate the performance of AdaTD (0) and AdaTD(lambda) on several standard reinforcement learning tasks, which demonstrate the effectiveness of our new approaches.

关键词： Temporal difference linear function approximation adaptive step size MDP finite-time convergence

来源：评论

学校读者我要写书评

暂无评论

Finite-Sample Analysis of Off-Policy Natural Actor-Critic With linear function approximation

引用

IEEE CONTROL SYSTEMS LETTERS 2022年 6卷 2611-2616页

作者： Chen, Zaiwei Khodadadian, Sajad Maguluri, Siva Theja Georgia Inst Technol Sch Ind & Syst Engn Atlanta GA 30332 USA

In this letter, we develop a novel variant of natural actor-critic algorithm using off-policy sampling and linear function approximation, and establish a sample complexity of O(epsilon(-3)), outperforming all the previously known convergence bounds of such algorithms. In order to overcome the divergence due to deadly triad in off-policy policy evaluation under function approximation, we develop a critic that employs n-step TD-learning algorithm with a properly chosen n. We present finite-sample convergence bounds on this critic, which are of independent interest. Furthermore, we develop a variant of natural policy gradient under function approximation, with an improved convergence rate of O(1/T) after T iterations. Combining the finite sample bounds of the actor and the critic, we obtain an overall O(epsilon(-3)) sample complexity. Our results were derived solely based on the assumption that the behavior policy sufficiently explores the state-action space, which is a much lighter assumption compared to the related literature.

关键词： Reinforcement learning natural actor-critic TD-learning off-policy sampling linear function approximation

来源：评论

学校读者我要写书评

暂无评论

LinFa-Q: Accurate Q-learning with linear function approximation

引用

NEUROCOMPUTING 2025年 611卷

作者： Wang, Zhechao Fu, Qiming Chen, Jianping Liu, Quan Lu, You Wu, Hongjie Hu, Fuyuan Suzhou Univ Sci & Technol Sch Elect & Informat Engn Suzhou 215009 Peoples R China Suzhou Univ Sci & Technol Jiangsu Prov Engn Res Ctr Construct Carbon Neutral Suzhou 215009 Peoples R China Suzhou Univ Sci & Technol Jiangsu Prov Key Lab Intelligent Bldg Energy Effic Suzhou 215009 Peoples R China Suzhou Univ Sci & Technol Sch Architecture & Urban Planning Suzhou 215009 Peoples R China Soochow Univ Sch Comp Sci & Technol Suzhou 215006 Jiangsu Peoples R China

Although Q-learning has achieved remarkable success in some practical cases, it often suffers from the overestimation problem in stochastic environments, which is commonly viewed as a shortcoming of Q-learning. Overestimated values are introduced by estimations on the next state Q value, which is well-known as the maximization bias. In this paper, we propose a more accurate method for estimating the Q value by the Q value decomposition and re-evaluation with similar samples based on linear function approximation. Specifically, we reform the parameterized incremental update formula of Q-learning and also demonstrate that the new formula is equivalent to the original one. Moreover, we propose a new parameterized incremental update formula of Q-learning to address the overestimation problem and present the more accurate computing method, which can be used in problems with continuous state spaces and stochastic environments. Experimentally, when compared with Doubly Bounded Q-learning and other Q-learning based methods, the new algorithm has more than 31% improvement of performance in Mountain Car and Cart Pole. Furthermore, the algorithm is robust to the learning rate and its memory capacity. Finally, the practical applicability of our algorithm is discussed through an analysis of time consumption.

关键词： Reinforcement learning Q-learning Maximization bias Q value decomposition linear function approximation

来源：评论

学校读者我要写书评

暂无评论

Efficient Local Planning with linear function approximation 33

Efficient Local Planning with Linear Function Approximation

引用

33rd International Conference on Algorithmic Learning Theory (ALT)

作者： Yin, Dong Hao, Botao Abbasi-Yadkori, Yasin Lazic, Nevena Szepesvari, Csaba DeepMind London England Univ Alberta Edmonton AB Canada

We study query and computationally efficient planning algorithms for discounted Markov decision processes (MDPs) with linear function approximation and a simulator. The agent is assumed to have local access to the simulator, meaning that the simulator can be queried only at states that have been encountered in previous steps. We propose two new algorithms for this setting, which we call confident Monte Carlo least-squares policy iteration (CONFIDENT MC-LSPI), and confident Monte Carlo Politex (CONFIDENT MC- POLITEX), respectively. The main novelty in our algorithms is that they gradually build a set of state-action pairs ("core set") with which it can control the extrapolation errors. We show that our algorithms have polynomial query and computational cost in the dimension of the features, the effective planning horizon and the targeted sub-optimality, while the cost remains independent of the size of the state space. An interesting technical contribution of our work is the introduction of a novel proof technique that makes use of a virtual policy iteration algorithm. We use this method to leverage existing results on approximate policy iteration with l(infinity)-bounded error to show that our algorithm can learn the optimal policy for the given initial state even only with local access to the simulator. We believe that this technique can be extended to broader settings beyond this work.

关键词： local access planning linear function approximation policy iteration

来源：评论

学校读者我要写书评

暂无评论

Reinforcement learning vs. rule-based adaptive traffic signal control: A Fourier basis linear function approximation for traffic signal control

引用

AI COMMUNICATIONS 2021年第1期34卷 89-103页

作者： Ziemke, Theresa Alegre, Lucas N. Bazzan, Ana L. C. Tech Universitdt Berlin Transport Syst Planning & Transport Telemat Berlin Germany Univ Fed Rio Grande do Sul UFRGS Inst Informat Porto Alegre RS Brazil

Reinforcement learning is an efficient, widely used machine learning technique that performs well when the state and action spaces have a reasonable size. This is rarely the case regarding control-related problems, as for instance controlling traffic signals. Here, the state space can be very large. In order to deal with the curse of dimensionality, a rough discretization of such space can be employed. However, this is effective just up to a certain point. A way to mitigate this is to use techniques that generalize the state space such as function approximation. In this paper, a linear function approximation is used. Specifically, SARSA(lambda) with Fourier basis features is implemented to control traffic signals in the agent-based transport simulation MATSim. The results are compared not only to trivial controllers such as fixed-time, but also to state-of-the-art rule-based adaptive methods. It is concluded that SARSA(lambda) with Fourier basis features is able to outperform such methods, especially in scenarios with varying traffic demands or unexpected events.

关键词： Reinforcement learning traffic signal control linear function approximation transport simulation

来源：评论

学校读者我要写书评

暂无评论

An online prediction algorithm for reinforcement learning with linear function approximation using cross entropy method

引用

MACHINE LEARNING 2018年第8-10期107卷 1385-1429页

作者： Joseph, Ajin George Bhatnagar, Shalabh Univ Alberta Dept Comp Sci Edmonton AB Canada Indian Inst Sci Dept Comp Sci & Automat Bangalore Karnataka India Indian Inst Sci Robert Bosch Ctr Cyber Phys Syst Bangalore Karnataka India

In this paper, we provide two new stable online algorithms for the problem of prediction in reinforcement learning, i.e., estimating the value function of a model-free Markov reward process using the linear function approximation architecture and with memory and computation costs scaling quadratically in the size of the feature set. The algorithms employ the multi-timescale stochastic approximation variant of the very popular cross entropy optimization method which is a model based search method to find the global optimum of a real-valued function. A proof of convergence of the algorithms using the ODE method is provided. We supplement our theoretical results with experimental comparisons. The algorithms achieve good performance fairly consistently on many RL benchmark problems with regards to computational efficiency, accuracy and stability.

关键词： Markov decision process Prediction problem Reinforcement learning Stochastic approximation algorithm Cross entropy method linear function approximation ODE method

来源：评论

学校读者我要写书评

暂无评论

Multi-Agent Federated Q-Learning Algorithms for Wireless Edge Caching

引用

IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY 2025年第2期74卷 2973-2988页

作者： Liu, Zhikai Garg, Navneet Ratnarajah, Tharmalingam Univ Edinburgh Inst Imaging Data & Commun Sch Engn Edinburgh EH8 9YL Scotland Univ Edinburgh Inst Data Image & Commun Sch Engn Edinburgh EH8 9YL Scotland LNM Inst Informat Technol Dept Elect & Commun Engn Jaipur 302004 India

Edge caching is an increasingly vital technique in wireless networks, particularly needed to address users' repeated demands, including real-time traffic data and map access in vehicular communications. This paper presents a three-tier system for edge caching, integrating massive multiple-input multiple-output (mMIMO) networks. We consider a scenario with an extensive file library whose size is larger than the aggregated caching capacity of the small base stations (SBSs). Each SBS proactively caches files based on dynamic file popularity that is unknown in advance. A distinguishing aspect of our model is its file-wise approach to the cache instead of the conventional vector-wise methods based on aggregate popularity. This approach introduces additional computational challenges concerning the extensive file library. We formulated the optimization problem to maximize the long-term discounted cache hit rate while minimizing the delivery delay. For the solution, two multi-agent federated Q-learning algorithms are proposed. The first algorithm employs selective updates of the Q-values of popular files to minimize computational overhead. The second algorithm incorporates linear function approximation (LFA) and tensor completion (TC) to streamline the updating process further, reducing the required parameter number. Through the real-world MovieLens dataset and compared with various baseline algorithms, simulations demonstrate that our proposed algorithm can reduce the delay by around 2.60--21.29%, improve the cache hit rate by around 5.71--66.42%, and reduce the computational complexity by at most 97.91%.

关键词： Libraries Tensors Heuristic algorithms Delays Vehicle dynamics Computational modeling Vectors Real-time systems Matrix decomposition Communication system security Large file library linear function approximation massive multiple-input multiple-output tensor completion wireless edge caching

来源：评论

学校读者我要写书评

暂无评论

Target Network and Truncation Overcome the Deadly Triad in Q-Learning

引用

SIAM JOURNAL ON MATHEMATICS OF DATA SCIENCE 2023年第4期5卷 1078-1101页

作者： Chend, Zaiwei Clarke, John-Paul Maguluri, Siva Theja CALTECH Comp Math Sci Pasadena CA 91106 USA Univ Texas Austin Aerosp Engn & Engn Mech Austin TX 78712 USA Georgia Inst Technol Ind Syst Engn Atlanta GA 30332 USA

Q-learning with function approximation is one of the most empirically successful while theoretically mysterious reinforcement learning (RL) algorithms and was identified in [R. S. Sutton, in European Conference on Computational Learning Theory, Springer, New York, 1999, pp. 11-17] as one of the most important theoretical open problems in the RL community. Even in the basic setting where linear function approximation is used, there are well-known divergent examples. In this work, we propose a stable online variant of Q-learning with linear function approximation that uses target network and truncation and is driven by a single trajectory of Markovian samples. We present the finite-sample guarantees of the algorithm, which imply a sample complexity of (O) over tilde (epsilon(-2)) up to a function approximation error. Importantly, we establish the results under minimal assumptions and do not modify the problem parameters to achieve stability.

关键词： reinforcement learning Q-learning linear function approximation finite-sample analysis

来源：评论

学校读者我要写书评

暂无评论

Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning

引用

AUTOMATICA 2022年 146卷

作者： Chen, Zaiwei Zhang, Sheng Doan, Thinh T. Clarke, John-Paul Maguluri, Siva Theja Georgia Inst Technol Atlanta GA 30332 USA Virginia Tech Blacksburg VA 24061 USA Univ Texas Austin Austin TX 78712 USA

Motivated by applications in reinforcement learning (RL), we study a nonlinear stochastic approximation (SA) algorithm under Markovian noise, and establish its finite-sample convergence bounds under various stepsizes. Specifically, we show that when using constant stepsize (i.e., alpha k = alpha), the algorithm achieves exponential fast convergence to a neighborhood (with radius O(alpha log(1/alpha))) around the desired limit point. When using diminishing stepsizes with appropriate decay rate, the algorithm converges with rate O(log(k)/k). Our proof is based on Lyapunov drift arguments, and to handle the Markovian noise, we exploit the fast mixing of the underlying Markov chain. To demonstrate the generality of our theoretical results on Markovian SA, we use it to derive the finite-sample bounds of the popular Q-learning algorithm with linear function approximation, under a condition on the behavior policy. Importantly, we do not need to make the assumption that the samples are i.i.d., and do not require an artificial projection step in the algorithm. Numerical simulations corroborate our theoretical results.

关键词： Markovian stochastic approximation Finite-sample analysis Reinforcement learning Q-learning linear function approximation

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：