检索结果-内蒙古大学图书馆

actor-critic-type learning algorithms for Markov decision processes

SIAM JOURNAL ON CONTROL AND OPTIMIZATION 1999年第1期38卷 94-123页

作者： Konda, VR Borkar, VS MIT Informat & Decis Syst Lab Cambridge MA 02139 USA Tata Inst Fundamental Res Sch Technol & Comp Sci Bombay 400005 Maharashtra India

algorithms for learning the optimal policy of a Markov decision process (MDP) based on simulated transitions are formulated and analyzed. These are variants of the well-known "actor-critic" (or "adaptive critic") algorithm in the artificial intelligence literature. Distributed asynchronous implementations are considered. The analysis involves two time scale stochastic approximations.

关键词： reinforcement learning Markov decision processes actor-critic algorithms stochastic approximation asynchronous iterations

来源：评论

学校读者我要写书评

暂无评论

Control Randomisation Approach for Policy Gradient and Application to Reinforcement Learning in Optimal Switching

引用

APPLIED MATHEMATICS AND OPTIMIZATION 2025年第1期91卷 1-33页

作者： Denkert, Robert Pham, Huyen Warin, Xavier Humboldt Univ Dept Math Berlin Germany Ecole Polytech CMAP Palaiseau France Lab Finance Marches Energie EDF R&D Palaiseau France Lab Finance Marches Energie FiME Palaiseau France

We propose a comprehensive framework for policy gradient methods tailored to continuous time reinforcement learning. This is based on the connection between stochastic control problems and randomised problems, enabling applications across various classes of Markovian continuous time control problems, beyond diffusion models, including e.g. regular, impulse and optimal stopping/switching problems. By utilizing change of measure in the control randomisation technique, we derive a new policy gradient representation for these randomised problems, featuring parametrised intensity policies. We further develop actor-critic algorithms specifically designed to address general Markovian stochastic control issues. Our framework is demonstrated through its application to optimal switching problems, with two numerical case studies in the energy sector focusing on real options.

关键词： Reinforcement learning in continuous time Policy gradient Control randomization actor-critic algorithms Optimal switching.

来源：评论

学校读者我要写书评

暂无评论

Policy gradient and actor-critic learning in continuous time and space: theory and algorithms

The Journal of Machine Learning Research

引用

The Journal of Machine Learning Research 2022年第1期23卷 12603-12652页

作者： Yanwei Jia Xun Yu Zhou Department of Industrial Engineering and Operations Research Columbia University New York NY Department of Industrial Engineering and Operations Research & The Data Science Institute Columbia University New York NY

We study policy gradient (PG) for reinforcement learning in continuous time and space under the regularized exploratory formulation developed by Wang et al. (2020). We represent the gradient of the value function with respect to a given parameterized stochastic policy as the expected integration of an auxiliary running reward function that can be evaluated using samples and the current value function. This representation effectively turns PG into a policy evaluation (PE) problem, enabling us to apply the martingale approach recently developed by Jia and Zhou (2022a) for PE to solve our PG problem. Based on this analysis, we propose two types of actor-critic algorithms for RL, where we learn and update value functions and policies simultaneously and alternatingly. The first type is based directly on the aforementioned representation, which involves future trajectories and is o_ine. The second type, designed for online learning, employs the first-order condition of the policy gradient and turns it into martingale orthogonality conditions. These conditions are then incorporated using stochastic approximation when updating policies. Finally, we demonstrate the algorithms by simulations in two concrete examples.

关键词： reinforcement learning continuous time and space policy gradient policy evaluation actor-critic algorithms martingale

来源：评论

学校读者我要写书评

暂无评论

An actor-critic algorithm for constrained Markov decision processes

引用

SYSTEMS & CONTROL LETTERS 2005年第3期54卷 207-213页

作者： Borkar, VS Tata Inst Fundamental Res Sch Technol & Comp Sci Bombay 400005 Maharashtra India

An actor-critic type reinforcement learning algorithm is proposed and analyzed for constrained controlled Markov decision processes. The analysis uses multiscale stochastic approximation theory and the 'envelope t... 详细信息

关键词： actor-critic algorithms reinforcement learning constrained Markov decision processes stochastic approximation envelope theorem

来源：评论

学校读者我要写书评

暂无评论

A simultaneous perturbation Stochastic approximation-based actor-critic algorithm for Markov decision processes

引用

IEEE TRANSACTIONS ON AUTOMATIC CONTROL 2004年第4期49卷 592-598页

作者： Bhatnagar, S Kumar, S Indian Inst Sci Dept Comp Sci & Automat Bangalore 560012 Karnataka India

A two-timescale simulation-based actor-critic algorithm for solution of infinite horizon Markov decision processes with finite state and compact action spaces under the discounted cost criterion is proposed. The algorithm does gradient search on the slower timescale in the space of deterministic policies and uses simultaneous perturbation stochastic approximation-based estimates. On the faster scale, the value function corresponding to A given stationary policy is updated and averaged over a fixed number of epochs (for enhanced performance). The proof of convergence to a locally optimal policy is presented. Finally, numerical experiments using the proposed algorithm on flow control in a bottleneck link using a continuous time queueing model are shown.

关键词： actor-critic algorithms Markov decision processes simultaneous perturbation stochastic approximation (SPSA) two timescale stochastic approximation

来源：评论

学校读者我要写书评

暂无评论

A Deep actor-critic Reinforcement Learning Framework for Dynamic Multichannel Access

引用

IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING 2019年第4期5卷 1125-1139页

作者： Zhong, Chen Lu, Ziyang Gursoy, M. Cenk Velipasalar, Senem Syracuse Univ Dept Elect Engn & Comp Sci Syracuse NY 13244 USA

To make efficient use of limited spectral resources, we in this work propose a deep actor-critic reinforcement learning based framework for dynamic multichannel access. We consider both a single-user case and a scenario in which multiple users attempt to access channels simultaneously. We employ the proposed framework as a single agent in the single-user case, and extend it to a decentralized multi-agent framework in the multi-user scenario. In both cases, we develop algorithms for the actor-critic deep reinforcement learning and evaluate the proposed learning policies via experiments and numerical results. In the single-user model, in order to evaluate the performance of the proposed channel access policy and the framework's tolerance against uncertainty, we explore different channel switching patterns and different switching probabilities. In the case of multiple users, we analyze the probabilities of each user accessing channels with favorable channel conditions and the probability of collision. We also address a time-varying environment to identify the adaptive ability of the proposed framework. Additionally, we provide comparisons (in terms of both the average reward and time efficiency) between the proposed actor-critic deep reinforcement learning framework, Deep-Q network (DQN) based approach, random access, and the optimal policy when the channel dynamics are known.

关键词： actor-critic algorithms channel switching patterns deep Q-networks deep reinforcement learning dynamic channel access

来源：评论

学校读者我要写书评

暂无评论

A sensitivity formula for risk-sensitive cost and the actor-critic algorithm

引用

SYSTEMS & CONTROL LETTERS 2001年第5期44卷 339-346页

作者： Borkar, VS Tata Inst Fundamental Res Sch Technol & Comp Sci Bombay 400005 Maharashtra India

We propose for risk sensitive control of finite Markov chains a counterpart of the popular 'actor-critic' algorithm for classical Markov decision processes. The algorithm is based on a 'sensitivity formula' for the risk sensitive cost and is shown to converge with probability one to the desired solution. The proof technique is an adaptation of the ordinary differential equations approach for the analysis of two time-scale stochastic approximation algorithms. (C) 2001 Elsevier Science B.V. All rights reserved.

关键词： Markov decision processes risk sensitive control reinforcement learning actor-critic algorithms parametric sensitivity stochastic approximation

来源：评论

学校读者我要写书评

暂无评论

A least squares temporal difference actor-critic algorithm with applications to warehouse management

引用

NAVAL RESEARCH LOGISTICS 2012年第3-4期59卷 197-211页

作者： Estanjini, Reza Moazzez Li, Keyong Paschalidis, Ioannis Ch Boston Univ Dept Elect & Comp Engn Div Syst Engn Boston MA 02215 USA Boston Univ Ctr Informat & Syst Engn Boston MA 02215 USA

This article develops a new approximate dynamic programming (DP) algorithm for Markov decision problems and applies it to a vehicle dispatching problem arising in warehouse management. The algorithm is of the actor-critic type and uses a least squares temporal difference learning method. It operates on a sample-path of the system and optimizes the policy within a prespecified class parameterized by a parsimonious set of parameters. The method is applicable to a partially observable Markov decision process setting where the measurements of state variables are potentially corrupted, and the cost is only observed through the imperfect state observations. We show that under reasonable assumptions, the algorithm converges to a locally optimal parameter set. We also show that the imperfect cost observations do not affect the policy and the algorithm minimizes the true expected cost. In the warehouse application, the problem is to dispatch sensor-equipped forklifts in order to minimize operating costs involving product movement delays and forklift maintenance. We consider instances where standard DP is computationally intractable. Simulation results confirm the theoretical claims of the article and show that our algorithm converges more smoothly than earlier actorcritic algorithms while substantially outperforming heuristics used in practice. (c) 2012 Wiley Periodicals, Inc. Naval Research Logistics, 2012

关键词： Markov decision processes partial observability approximate dynamic programming actor-critic algorithms warehouse management vehicle routing

来源：评论

学校读者我要写书评

暂无评论

Strengthening Acquiring Knowledge for Optimizing Dynamic Delivery Routes

Strengthening Acquiring Knowledge for Optimizing Dynamic Del...

引用

2025 International Conference on Intelligent Control, Computing and Communications, IC3 2025

作者： Mishra, Nidhi Tiwari, Ankita Kalinga University Department of Cs & It Raipur India

ISBN: (纸本)9798331527495

One of the most important challenges for delivery networks in logistics is the dynamic route optimization problem, which becomes increasingly important over time given the complexity of real-time constraints, such as traffic and road conditions, as well as order and customer demand. In general, conventional route optimization techniques such as static algorithms or heuristic approaches cannot respond to these changing landscapes, resulting in non-optimal solutions. Recently, Reinforcement Learning has become a potential candidate to tackle this problem, due to its ability to learn from interaction within the environment and improve decisions progressively. The soundness of RL approaches used in developing quasi-optimal dynamic routes provides an opportunity to optimize route plans, enhance efficiency, reduce operational costs, and further improve service quality in delivery networks. Reinforcement learning is actually a subfield of machine learning in which an agent learns to behave in an environment, in avow to maximize the long-term expected cumulative reward. The space in the case is the delivery network that includes the roads and traffic conditions as well as customers and other time variant spaces. In this case, the agent is the route optimization algorithm which takes in real-time feedback on which routes to take. This enables it to take the best possible action according to the state of the environment and discern from having learned over a time based on the reward signals provided for the actions it performed while interacting with its neighbouring environment to adapt better and reach optimal positive feedback towards the deliveries of packages. As such, the best routes and timing of deliveries needs to be matched such that the respective time and cost may be minimized, resulting in already significant gains made from traditional optimization approaches being used to solve these problems in delivery networks. It then elaborates on the RL framework, with an

关键词： actor-critic algorithms Deep Q-Networks delivery networks dynamic route optimization machine learning operational costs Q-learning real-time data reinforcement learning traffic conditions

来源：评论

学校读者我要写书评

暂无评论

An actor-critic Algorithm With Second-Order actor and critic

引用

IEEE TRANSACTIONS ON AUTOMATIC CONTROL 2017年第6期62卷 2689-2703页

作者： Wang, Jing Paschalidis, Ioannis Ch. Boston Univ Ctr Informat & Syst Engn Boston MA 02215 USA Boston Univ Dept Elect & Comp Engn 8 St Marys St Boston MA 02215 USA Boston Univ Div Syst Engn 8 St Marys St Boston MA 02215 USA

actor-critic algorithms solve dynamic decision making problems by optimizing a performance metric of interest over a user-specified parametric class of policies. They employ a combination of an actor, making policy improvement steps, and a critic, computing policy improvement directions. Many existing algorithms use a steepest ascent method to improve the policy, which is known to suffer from slow convergence for ill-conditioned problems. In this paper, we first develop an estimate of the (Hessian) matrix containing the second derivatives of the performance metric with respect to policy parameters. Using this estimate, we introduce a new second-order policy improvement method and couple it with a critic using a second-order learning method. We establish almost sure convergence of the new method to a neighborhood of a policy parameter stationary point. We compare the new algorithm with some existing algorithms in two applications and demonstrate that it leads to significantly faster convergence.

关键词： actor-critic algorithms Markov decision processes Newton's method robotics

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：