We present four new reinforcement learning algorithms based on actor-critic, natural-gradient and function-approximation ideas, and we provide their convergence proofs. Actor-critic reinforcement learning methods are ...
详细信息
We present four new reinforcement learning algorithms based on actor-critic, natural-gradient and function-approximation ideas, and we provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function-approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of special interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients. Our results extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms. (C) 2009 Elsevier Ltd. All rights reserved.
This paper presents a convergence analysis of particle swarm optimization system by treating it as a discrete-time linear time-variant system firstly. And then, based on the results of system convergence conditions, d...
详细信息
This paper presents a convergence analysis of particle swarm optimization system by treating it as a discrete-time linear time-variant system firstly. And then, based on the results of system convergence conditions, dynamic optimal control of a deterministic PSO system for parameters optimization is studied by using dynamicprogramming;and an approximate dynamic programming algorithm - swarm-based approximate dynamic programming (swarm-ADP) is proposed in this paper. Finally, numerical simulations proved the validated of this presented dynamic optimization method.
The purpose of this paper is to survey techniques for constructing effective policies for controlling complex networks, and to extend these techniques to capture special features of wireless communication networks und...
详细信息
The purpose of this paper is to survey techniques for constructing effective policies for controlling complex networks, and to extend these techniques to capture special features of wireless communication networks under different networking scenarios. Among the key questions addressed are: The relationship between static network equilibria, and dynamic network control. The effect of coding on control and delay through rate regions. Routing, scheduling, and admission control. These approximations are the basis of a specific formulation of an h-MaxWeight policy for network routing. Simulations show a 50% improvement in average delay performance as compared to methods used in current practice.
This paper addresses the problem of finding a control policy that drives a generic discrete event stochastic system from an initial state to a set of goal states with a specified probability. The control policy is ite...
详细信息
This paper addresses the problem of finding a control policy that drives a generic discrete event stochastic system from an initial state to a set of goal states with a specified probability. The control policy is iteratively constructed via an approximate dynamic programming (ADP) technique over a small subset of the state space that is evolved via Monte Carlo simulations. The effect of certain user-chosen parameters on the performance of the algorithm is investigated The method is evaluated on several stochastic shortest path (SSP) examples and on a manufacturing job shop problem. We solve SSP problems that contain up to one million states to illustrate the scaling of computational and memory benefits with respect to the problem size. In the case of the manufacturing job shop example. the proposed ADP approach outperforms a traditional rolling horizon math programming approach. (C) 2009 Elsevier Ltd. All rights reserved.
In this paper, the near-optimal control problem for a class of nonlinear discrete-time systems with control constraints is solved by iterative adaptive dynamicprogramming algorithm. First, a novel nonquadratic perfor...
详细信息
In this paper, the near-optimal control problem for a class of nonlinear discrete-time systems with control constraints is solved by iterative adaptive dynamicprogramming algorithm. First, a novel nonquadratic performance functional is introduced to overcome the control constraints, and then an iterative adaptive dynamicprogramming algorithm is developed to solve the optimal feedback control problem of the original constrained system with convergence analysis. In the present control scheme, there are three neural networks used as parametric structures for facilitating the implementation of the iterative algorithm. Two examples are given to demonstrate the convergence and feasibility of the proposed optimal control scheme.
This paper presents a theory of how general-purpose learning-based intelligence is achieved in the mammal brain, and how we can replicate it. It reviews four generations of ever more powerful general-purpose learning ...
详细信息
This paper presents a theory of how general-purpose learning-based intelligence is achieved in the mammal brain, and how we can replicate it. It reviews four generations of ever more powerful general-purpose learning designs in Adaptive, approximate dynamic programming (ADP), which includes reinforcement learning as a special case. It reviews empirical results which fit the theory, and suggests important new directions for research, within the scope of NSF's recent initiative on Cognitive Optimization and Prediction. The appendices suggest possible connections to the realms of human subjective experience, comparative cognitive neuroscience, and new challenges in electric power. The major challenge before us today in mathematical neural networks is to replicate the "mouse level", but the paper does contain a few thoughts about building, understanding and nourishing levels of general intelligence beyond the mouse. Published by Elsevier Ltd
An approximate dynamic programming (ADP) strategy for a dual adaptive control problem is presented. An optimal control policy of a dual adaptive control problem can be derived by solving a stochastic dynamic programmi...
详细信息
An approximate dynamic programming (ADP) strategy for a dual adaptive control problem is presented. An optimal control policy of a dual adaptive control problem can be derived by solving a stochastic dynamicprogramming problem, which is computationally intractable using conventional solution methods that involve sampling of a complete hyperstate space. To solve the problem in a computationally amenable manner, we perform closed-loop simulations with different control policies to generate a data set that defines a subset of a hyperstate within which the Bellman equation is iterated. A local approximator with a penalty function is designed for estimation of cost-to-go values over the continuous hyperstate space. An integrating process with an unknown gain is used for illustration.
We develop a network revenue management model to jointly make capacity control and overbooking decisions. Our approach is based on the observation that if the penalty cost of denying boarding to the reservations at th...
详细信息
We develop a network revenue management model to jointly make capacity control and overbooking decisions. Our approach is based on the observation that if the penalty cost of denying boarding to the reservations at the departure time were given by a separable function, then the dynamicprogramming formulation of the network revenue management problem would decompose by the itineraries and it could be solved by focusing on one itinerary at a time. Motivated by this observation, we use an iterative and simulation-based method to build separable approximations to the penalty cost that we incur at the departure time. Computational experiments compare our model with two benchmark strategies that are based on a deterministic linear programming formulation. The profits obtained by our model improve over those obtained by the benchmark strategies by about 3 per cent on the average, which is a significant figure in the network revenue management setting. For the test problems with tight leg capacities, the profit improvements can be as high as 13 per cent.
Machine learning for mobile robots has attracted lots of research interests in recent years. However, there are still many challenges to apply learning techniques in real mobile robots, e.g., generalization ill Contin...
详细信息
ISBN:
(纸本)9783642015120
Machine learning for mobile robots has attracted lots of research interests in recent years. However, there are still many challenges to apply learning techniques in real mobile robots, e.g., generalization ill Continuous spaces, learning efficiency and convergence, etc. In this paper, a reinforcement learning path-following control strategy based oil approximate policy iteration (API) is developed for a real mobile robot. It has some advantages such as optimized control policies call be obtained without Much a Priori knowledge oil dynamic models of mobile robot, etc. Two kinds of API-based control method. i.e.. API with linear approximation and API with kernel machines, are implemented ill the path following control task and the efficiency of the proposed control strategy is illustrated in the experimental studies oil the real mobile robot based oil the Pioneer3-AT platform. Experimental results verify that the API-based learning, controller has better convergence and path following accuracy compared to conventional PD control methods. Finally, the learning control performance of the two API methods is also evaluated and compared.
Portfolio management deals with the allocation of wealth among different investment opportunities, considering investor's preferences on risk. In this paper we consider a multiperiod model where the investor rebal...
详细信息
ISBN:
(纸本)9780769536064
Portfolio management deals with the allocation of wealth among different investment opportunities, considering investor's preferences on risk. In this paper we consider a multiperiod model where the investor rebalances a portfolio at the beginning of each period facing uncertainty associated with the prices of the assets at future dates. Models of this decision problem tend to become very large because of the dynamic structure and uncertainty. We present a multiple period portfolio model over a finite horizon with transaction costs, a risk averse utility function and the uncertainty modeled using the scenario approach. We propose a new method for efficiently solving real problems;the procedure utilizes stochastic programming combined with decomposition and approximating techniques. Solving the resulting optimization problem relies on approximate dynamic programming techniques. The technique used for solving the portfolio problem provides a method whose effectiveness is proved by the experimental results.
暂无评论