Autonomous drive of wheeled mobile robot (WMR) needs implementing velocity and path tracking control subject to complex dynamical constraints. Conventionally, this control design is obtained by analysis and synthesis ...
详细信息
ISBN:
(纸本)9781424407064
Autonomous drive of wheeled mobile robot (WMR) needs implementing velocity and path tracking control subject to complex dynamical constraints. Conventionally, this control design is obtained by analysis and synthesis of the WMR system. This paper presents the dual heuristic programming (DHF) adaptive critic design of the motion control system that enables WMR to achieve the control purpose simply by learning through trial. The design consists of an adaptive critic velocity neuro-control loop and a posture neuro-control loop. The neural weights in the velocity neuro-controller (VNC) are corrected with the DHP adaptive critic method. The designer simply expresses the control objective with a utility function. The VNC learns by sequential optimization to satisfy the control objective. The posture neuro-controller (PNC) approximates the inverse velocity model of WMR so as to map planned positions to desired velocities. Supervised drive of WMR in variant velocities supplies training samples for the PNC and VNC to setup the neural weights. In autonomous drive, the learning mechanism keeps improving the PNC and VNC. The design is evaluated on an experimental WMR. The excellent results make it certain that the DHP adaptive critic motion control design enables WMR to develop the control ability autonomously.
In this correspondence, adaptive critic approximate dynamicprogramming designs are derived to solve the discrete-time zero-sum game in which the state and action spaces are continuous. This results in a forward-in-ti...
详细信息
In this correspondence, adaptive critic approximate dynamicprogramming designs are derived to solve the discrete-time zero-sum game in which the state and action spaces are continuous. This results in a forward-in-time reinforcementlearning algorithm that converges to the Nash equilibrium of the corresponding zero-sum game. The results in this correspondence can be thought of as a way to solve the Riccati equation of the well-known discrete-time H-infinity optimal control problem forward in time. Two schemes are presented, namely: 1) a heuristic dynamicprogramming and 2) a dual-heuristic dynamicprogramming, to solve for the value function and the costate of the game, respectively. An H-infinity autopilot design for an F-16 aircraft is presented to-illustrate the results.
We investigate the dual approach to dynamicprogramming and reinforcementlearning, based on maintaining an explicit representation of stationary distributions as opposed to value functions. A significant advantage of...
详细信息
ISBN:
(纸本)9781424407064
We investigate the dual approach to dynamicprogramming and reinforcementlearning, based on maintaining an explicit representation of stationary distributions as opposed to value functions. A significant advantage of the dual approach is that it allows one to exploit well developed techniques for representing, approximating and estimating probability distribu tions, without running the risks associated with divergent value function estimation. A second advantage is that some distinct algorithms for the average reward and discounted reward case in the primal become unified under the dual. In this paper, we present a modified dual of the standard linear program that guarantees a globally normalized state visit distribution is obtained. With this reformulation, we then derive novel dual forms of dynamicprogramming, including policy evaluation, policy iteration and value iteration. Moreover, we derive dual formulations of temporal difference learning to obtain new forms of Sarsa and Q-learning. Finally, we scale these techniques up to large domains by introducing approximation, and develop new approximate off-policy learning algorithms that avoid the divergence problems associated with the primal approach. We show that the dual view yields a viable alternative to standard value function based techniques and opens new avenues for solving dynamicprogramming and reinforcementlearning problems.
We are interested in finding the most effective combination between off-line and on-line/real-time training in approximate dynamicprogramming. We introduce our approach of combining proven off-line methods of trainin...
详细信息
ISBN:
(纸本)9781424407064
We are interested in finding the most effective combination between off-line and on-line/real-time training in approximate dynamicprogramming. We introduce our approach of combining proven off-line methods of training for robustness with a group of on-line methods. Training for robustness is carried out on reasonably accurate models with the multi- stream Kalman filter method [1], whereas on-line adaptation is performed either with the help of a critic or by methods resembling reinforcementlearning. We also illustrate importance of using recurrent neural networks for both controller/actor and critic.
We describe an approach towards reducing the curse of dimensionality for deterministic dynamicprogramming with continuous actions by randomly sampling actions while computing a steady state value function and policy....
详细信息
ISBN:
(纸本)9781424407064
We describe an approach towards reducing the curse of dimensionality for deterministic dynamicprogramming with continuous actions by randomly sampling actions while computing a steady state value function and policy. This approach results in globally optimized actions, without searching over a discretized multidimensional grid. We present results on finding time invariant control laws for two, four, and six dimensional deterministic swing up problems with up to 480 million discretized states.
In this paper, a greedy iteration scheme based on approximate dynamicprogramming (ADP), namely Heuristic dynamicprogramming (HDP), is used to solve for the value function of the Hamilton Jacobi Bellman equation (HJB...
详细信息
ISBN:
(纸本)9781424407064
In this paper, a greedy iteration scheme based on approximate dynamicprogramming (ADP), namely Heuristic dynamicprogramming (HDP), is used to solve for the value function of the Hamilton Jacobi Bellman equation (HJB) that appears in discrete-time (DT) nonlinear optimal control. Two neural networks are used- one to approximate the value function and one to approximate the optimal control action. The importance of ADP is that it allows one to solve the HJB equation for general nonlinear discrete-time systems by using a neural network to approximate the value function. The importance of this paper is that the proof of convergence of the HDP iteration scheme is provided using rigorous methods for general discrete-time nonlinear systems with continuous state and action spaces. Two examples are provided in this paper. The first example is a linear system, where ADP is found to converge to the correct solution of the Algebraic Riccati equation (ARE). The second example considers a nonlinear control system.
We define a new type of policy, the knowledge gradient policy, in the context of an offline learning problem. We show how to compute the knowledge gradient policy efficiently and demonstrate through Monte Carlo simula...
详细信息
ISBN:
(纸本)9781424407064
We define a new type of policy, the knowledge gradient policy, in the context of an offline learning problem. We show how to compute the knowledge gradient policy efficiently and demonstrate through Monte Carlo simulations that it performs as well or better than a number of existing learning policies.
In this paper, we suggest and analyze the use of approximate reinforcementlearning techniques for a new category of challenging benchmark problems from the field of Operations Research. We demonstrate that interpreti...
详细信息
ISBN:
(纸本)9781424407064
In this paper, we suggest and analyze the use of approximate reinforcementlearning techniques for a new category of challenging benchmark problems from the field of Operations Research. We demonstrate that interpreting and solving the task of job-shop scheduling as a multi-agent learning problem is beneficial for obtaining near-optimal solutions and can very well compete with alternative solution approaches. The evaluation of our algorithms focuses on numerous established Operations Research benchmark problems.
We consider the problem of learning in a factored state Markov Decision Process that is structured to allow a compact representation. We show that the well-known algorithm, factored Rmax, performs near-optimally on al...
详细信息
ISBN:
(纸本)9781424407064
We consider the problem of learning in a factored state Markov Decision Process that is structured to allow a compact representation. We show that the well-known algorithm, factored Rmax, performs near-optimally on all but a number of timesteps that is polynomial in the size of the compact representation, which is often exponentially smaller than the number of states. This is equivalent to the result obtained by Kearns and Koller for their DBN-E-3 algorithm, except that we've conducted the analysis in a more general setting. We also extend the results to a new algorithm, factored IE, that uses the Interval Estimation approach to exploration and can be expected to outperform factored Rmax on most domains.
Quite some research has been done on reinforcementlearning in continuous environments, but the research on problems where the actions can also be chosen from a continuous space is much more limited. We present a new ...
详细信息
ISBN:
(纸本)9781424407064
Quite some research has been done on reinforcementlearning in continuous environments, but the research on problems where the actions can also be chosen from a continuous space is much more limited. We present a new class of algorithms named Continuous Actor Critic learning Automaton (CACLA) that can handle continuous states and actions. The resulting algorithm is straightforward to implement. An experimental comparison is made between this algorithm and other algorithms that can handle continuous action spaces. These experiments show that CACLA performs much better than the other algorithms, especially when it is combined with a Gaussian exploration method.
暂无评论