algorithms for learning the optimal policy of a Markov decision process (MDP) based on simulated transitions are formulated and analyzed. These are variants of the well-known "actor-critic" (or "adaptiv...
详细信息
algorithms for learning the optimal policy of a Markov decision process (MDP) based on simulated transitions are formulated and analyzed. These are variants of the well-known "actor-critic" (or "adaptive critic") algorithm in the artificial intelligence literature. Distributed asynchronous implementations are considered. The analysis involves two time scale stochastic approximations.
We propose a comprehensive framework for policy gradient methods tailored to continuous time reinforcement learning. This is based on the connection between stochastic control problems and randomised problems, enablin...
详细信息
We propose a comprehensive framework for policy gradient methods tailored to continuous time reinforcement learning. This is based on the connection between stochastic control problems and randomised problems, enabling applications across various classes of Markovian continuous time control problems, beyond diffusion models, including e.g. regular, impulse and optimal stopping/switching problems. By utilizing change of measure in the control randomisation technique, we derive a new policy gradient representation for these randomised problems, featuring parametrised intensity policies. We further develop actor-critic algorithms specifically designed to address general Markovian stochastic control issues. Our framework is demonstrated through its application to optimal switching problems, with two numerical case studies in the energy sector focusing on real options.
We study policy gradient (PG) for reinforcement learning in continuous time and space under the regularized exploratory formulation developed by Wang et al. (2020). We represent the gradient of the value function with...
详细信息
We study policy gradient (PG) for reinforcement learning in continuous time and space under the regularized exploratory formulation developed by Wang et al. (2020). We represent the gradient of the value function with respect to a given parameterized stochastic policy as the expected integration of an auxiliary running reward function that can be evaluated using samples and the current value function. This representation effectively turns PG into a policy evaluation (PE) problem, enabling us to apply the martingale approach recently developed by Jia and Zhou (2022a) for PE to solve our PG problem. Based on this analysis, we propose two types of actor-critic algorithms for RL, where we learn and update value functions and policies simultaneously and alternatingly. The first type is based directly on the aforementioned representation, which involves future trajectories and is o_ine. The second type, designed for online learning, employs the first-order condition of the policy gradient and turns it into martingale orthogonality conditions. These conditions are then incorporated using stochastic approximation when updating policies. Finally, we demonstrate the algorithms by simulations in two concrete examples.
An actor-critic type reinforcement learning algorithm is proposed and analyzed for constrained controlled Markov decision processes. The analysis uses multiscale stochastic approximation theory and the 'envelope t...
详细信息
An actor-critic type reinforcement learning algorithm is proposed and analyzed for constrained controlled Markov decision processes. The analysis uses multiscale stochastic approximation theory and the 'envelope theorem' of mathematical economics. (C) 2004 Elsevier B.V. All rights reserved.
A two-timescale simulation-based actor-critic algorithm for solution of infinite horizon Markov decision processes with finite state and compact action spaces under the discounted cost criterion is proposed. The algor...
详细信息
A two-timescale simulation-based actor-critic algorithm for solution of infinite horizon Markov decision processes with finite state and compact action spaces under the discounted cost criterion is proposed. The algorithm does gradient search on the slower timescale in the space of deterministic policies and uses simultaneous perturbation stochastic approximation-based estimates. On the faster scale, the value function corresponding to A given stationary policy is updated and averaged over a fixed number of epochs (for enhanced performance). The proof of convergence to a locally optimal policy is presented. Finally, numerical experiments using the proposed algorithm on flow control in a bottleneck link using a continuous time queueing model are shown.
To make efficient use of limited spectral resources, we in this work propose a deep actor-critic reinforcement learning based framework for dynamic multichannel access. We consider both a single-user case and a scenar...
详细信息
To make efficient use of limited spectral resources, we in this work propose a deep actor-critic reinforcement learning based framework for dynamic multichannel access. We consider both a single-user case and a scenario in which multiple users attempt to access channels simultaneously. We employ the proposed framework as a single agent in the single-user case, and extend it to a decentralized multi-agent framework in the multi-user scenario. In both cases, we develop algorithms for the actor-critic deep reinforcement learning and evaluate the proposed learning policies via experiments and numerical results. In the single-user model, in order to evaluate the performance of the proposed channel access policy and the framework's tolerance against uncertainty, we explore different channel switching patterns and different switching probabilities. In the case of multiple users, we analyze the probabilities of each user accessing channels with favorable channel conditions and the probability of collision. We also address a time-varying environment to identify the adaptive ability of the proposed framework. Additionally, we provide comparisons (in terms of both the average reward and time efficiency) between the proposed actor-critic deep reinforcement learning framework, Deep-Q network (DQN) based approach, random access, and the optimal policy when the channel dynamics are known.
We propose for risk sensitive control of finite Markov chains a counterpart of the popular 'actor-critic' algorithm for classical Markov decision processes. The algorithm is based on a 'sensitivity formula...
详细信息
We propose for risk sensitive control of finite Markov chains a counterpart of the popular 'actor-critic' algorithm for classical Markov decision processes. The algorithm is based on a 'sensitivity formula' for the risk sensitive cost and is shown to converge with probability one to the desired solution. The proof technique is an adaptation of the ordinary differential equations approach for the analysis of two time-scale stochastic approximation algorithms. (C) 2001 Elsevier Science B.V. All rights reserved.
This article develops a new approximate dynamic programming (DP) algorithm for Markov decision problems and applies it to a vehicle dispatching problem arising in warehouse management. The algorithm is of the actor-cr...
详细信息
This article develops a new approximate dynamic programming (DP) algorithm for Markov decision problems and applies it to a vehicle dispatching problem arising in warehouse management. The algorithm is of the actor-critic type and uses a least squares temporal difference learning method. It operates on a sample-path of the system and optimizes the policy within a prespecified class parameterized by a parsimonious set of parameters. The method is applicable to a partially observable Markov decision process setting where the measurements of state variables are potentially corrupted, and the cost is only observed through the imperfect state observations. We show that under reasonable assumptions, the algorithm converges to a locally optimal parameter set. We also show that the imperfect cost observations do not affect the policy and the algorithm minimizes the true expected cost. In the warehouse application, the problem is to dispatch sensor-equipped forklifts in order to minimize operating costs involving product movement delays and forklift maintenance. We consider instances where standard DP is computationally intractable. Simulation results confirm the theoretical claims of the article and show that our algorithm converges more smoothly than earlier actorcriticalgorithms while substantially outperforming heuristics used in practice. (c) 2012 Wiley Periodicals, Inc. Naval Research Logistics, 2012
actor-critic algorithms solve dynamic decision making problems by optimizing a performance metric of interest over a user-specified parametric class of policies. They employ a combination of an actor, making policy im...
详细信息
actor-critic algorithms solve dynamic decision making problems by optimizing a performance metric of interest over a user-specified parametric class of policies. They employ a combination of an actor, making policy improvement steps, and a critic, computing policy improvement directions. Many existing algorithms use a steepest ascent method to improve the policy, which is known to suffer from slow convergence for ill-conditioned problems. In this paper, we first develop an estimate of the (Hessian) matrix containing the second derivatives of the performance metric with respect to policy parameters. Using this estimate, we introduce a new second-order policy improvement method and couple it with a critic using a second-order learning method. We establish almost sure convergence of the new method to a neighborhood of a policy parameter stationary point. We compare the new algorithm with some existing algorithms in two applications and demonstrate that it leads to significantly faster convergence.
In this paper, we propose two methods of adaptive actor-critic architectures to solve control problems of nonlinear systems. One method uses two actual states at time k and time k + 1 to update the learning algorithm....
详细信息
In this paper, we propose two methods of adaptive actor-critic architectures to solve control problems of nonlinear systems. One method uses two actual states at time k and time k + 1 to update the learning algorithm. The basic idea of this method is that the agent can directly take some knowledge from the environment to improve its knowledge. The other method only uses the state at time k to update the algorithm. This method is called, learning from prediction (or simulated experience). Both methods include one or two predictive models, which are assumed to be applied to construct predictive states and a model-based actor (MBA). Here, the MBA as an actor can be viewed as a network where the connection weights are the elements of the feedback gain matrix. In the critic part, two value-functions are realized as a pure static mapping, which can be reduced to a nonlinear current estimator by using the radial basis function neural networks (RBFNNs). Simulation results obtained for a dynamical model of nonholonomic mobile robots with two independent driving wheels are presented. They show the effectiveness of the proposed approaches for the trajectory tracking control problem.
暂无评论