Although the combination of reinforcementlearning and imitation has been already considered in recent research, it always revolved around fixed settings where demonstrator and imitator are fixed and the imitation pro...
详细信息
ISBN:
(纸本)9781424427611
Although the combination of reinforcementlearning and imitation has been already considered in recent research, it always revolved around fixed settings where demonstrator and imitator are fixed and the imitation process is a well-defined period of time. What is missing is the investigation of approaches that also work in scenarios where imitation is only sporadically possible. This means that in a multi-robot scenario a robot is now allowed to interrupt another robot by asking to repeat certain actions, but can only observe and integrate information bits delivered occasionally. In this paper we present how that can be done in continuous and noisy environment within an SMDP context.
A novel adaptive-critic-based NN controller using reinforcementlearning is developed for a class of nonlinear systems with non-symmetric dead-zone inputs. The adaptive critic NN controller uses two NNs: the critic NN...
详细信息
ISBN:
(纸本)9781424427611
A novel adaptive-critic-based NN controller using reinforcementlearning is developed for a class of nonlinear systems with non-symmetric dead-zone inputs. The adaptive critic NN controller uses two NNs: the critic NN is used to approximate the strategic utility function, and the output of action NN is used to approximate the unknown nonlinear function and to minimize the strategic utility function. The tuning of the NNs is performed online without an explicit offline learning phase. The uniformly ultimate boundedness of the close-loop tracking error is derived by using using the Lyapunov method. Finally, a numerical example is included to show the effectiveness of the theoretical results.
Recent developments in multiagent reinforcementlearning, mostly concentrate on normal form games or restrictive hierarchical form games. In this paper, we use the well known Q-learning in extensive form games which a...
详细信息
ISBN:
(纸本)9781424427611
Recent developments in multiagent reinforcementlearning, mostly concentrate on normal form games or restrictive hierarchical form games. In this paper, we use the well known Q-learning in extensive form games which agents have a fixed priority in action selection. We also introduce a new concept called associative Q-values which not only can be used in action selection, leading to a subgame perfect equilibrium, but also can be used in update rule which is proved to be convergent. Associative Q-values are the expected utility of an agent in a game situation which is an estimate of the value of the subgame perfect equilibrium point.
learning is considered as a dynamic process described by a trajectory on a statistical manifold, and a topology is introduced defining trajectories continuous in information. The analysis generalises the application o...
详细信息
ISBN:
(纸本)9781424427611
learning is considered as a dynamic process described by a trajectory on a statistical manifold, and a topology is introduced defining trajectories continuous in information. The analysis generalises the application of Orlicz spaces in nonparametric information geometry to topological function spaces with asymmetric gauge functions (e.g. quasi-metric spaces defined in terms of KL divergence). Optimality conditions are formulated for dynamical constraints, and two main results are outlined: 1) Parametrisation of optimal learning trajectories from empirical constraints using generalised characteristic potentials;2) A gradient theorem for the potentials defining optimal utility and information bounds of a learning system. These results not only generalise some known relations of statistical mechanics and variational methods in information theory, but also can be used for optimisation of the exploration-exploitation balance in online learning systems.
We develop an iterative local dynamicprogramming method (iLDP) applicable to stochastic optimal control problems in continuous high-dimensional state and action spaces. Such problems are common in the control of biol...
详细信息
ISBN:
(纸本)9781424427611
We develop an iterative local dynamicprogramming method (iLDP) applicable to stochastic optimal control problems in continuous high-dimensional state and action spaces. Such problems are common in the control of biological movement, but cannot be handled by existing methods. iLDP can be considered a generalization of Differential dynamicprogramming, inasmuch as: (a) we use general basis functions rather than quadratics to approximate the optimal value function;(b) we introduce a collocation method that dispenses with explicit differentiation of the cost and dynamics and ties iLDP to the Unscented Kalman filter;(c) we adapt the local function approximator to the propagated state covariance, thus increasing accuracy at more likely states. Convergence is similar to quasi-Netwon methods. We illustrate iLDP on several problems including the "swimmer" dynamical system which has 14 state and 4 control variables.
This paper describes several new online model-free reinforcementlearning (RL) algorithms. We designed three new reinforcement algorithms, namely: QV2, QVMAX, and QV-MAX2, that are all based on the QV-learning algorit...
详细信息
ISBN:
(纸本)9781424427611
This paper describes several new online model-free reinforcementlearning (RL) algorithms. We designed three new reinforcement algorithms, namely: QV2, QVMAX, and QV-MAX2, that are all based on the QV-learning algorithm, but in contrary to QV-learning, QVMAX and QVMAX2 are off-policy RL algorithms and QV2 is a new on-policy RL algorithm. We experimentally compare these algorithms to a large number of different RL algorithms, namely: Q-learning, Sarsa, R-learning, Actor-Critic, QV-learning, and ACLA. We show experiments on five maze problems of varying complexity. Furthermore, we show experimental results on the cart pole balancing problem. The results show that for different problems, there can be large performance differences between the different algorithms, and that there is not a single RL algorithm that always performs best, although on average QV-learning scores highest.
Off-policy reinforcementlearning is aimed at efficiently using data samples gathered from a policy that is different from the currently optimized policy. A common approach is to use importance sampling techniques for...
详细信息
ISBN:
(纸本)9781424427611
Off-policy reinforcementlearning is aimed at efficiently using data samples gathered from a policy that is different from the currently optimized policy. A common approach is to use importance sampling techniques for compensating for the bias of value function estimators caused by the difference between the data-sampling policy and the target policy. However, existing off-policy methods often do not take the variance of the value function estimators explicitly into account and therefore their performance tends to be unstable. To cope with this problem, we propose using an adaptive importance sampling technique which allows us to actively control the trade-off between bias and variance. We further provide a method for optimally determining the trade-off parameter based on a variant of cross-validation. The usefulness of the proposed approach is demonstrated through simulated swing-up inverted-pendulum problem.
reinforcementlearning is an essential ability for robots to learn new motor skills. Nevertheless, few methods scale into the domain of anthropomorphic robotics. In order to improve in terms of efficiency, the problem...
详细信息
Receding horizon control (RHC), also known as model predictive control (MPC), is a suboptimal control scheme that solves a finite horizon open-loop optimal control problem in an infinite horizon context and yields a m...
详细信息
ISBN:
(纸本)9781424427611
Receding horizon control (RHC), also known as model predictive control (MPC), is a suboptimal control scheme that solves a finite horizon open-loop optimal control problem in an infinite horizon context and yields a measured state feedback control law. A lot of efforts have been made to study the closed-loop stability, leading to various stability conditions involving constraints on either the terminal state, or the terminal cost, or the horizon size, or their different combinations. In this paper, we propose a modified RHC scheme, called adaptive terminal cost RHC (ATC-RHC). The control law generated by ATC-RHC algorithm converges to the solution of the infinite horizon optimal control problem. Moreover, it ensures the closed-loop system to be uniformly ultimately exponentially stable without imposing any constraints on the terminal state, the horizon size, or the terminal cost. Finally we show that when the horizon size is one, the underlying problems of ATC-RHC and heuristic dynamicprogramming (RDP) are the same. Thus, ATC-RHC can be implemented using HDP techniques without knowing the system matrix A.
The production process of the cement rotary kiln is a typical engineering thermodynamics with large inertia, lagging and nonlinearity. So it is very difficult to control this process accurately using traditional contr...
详细信息
ISBN:
(纸本)9781424427611
The production process of the cement rotary kiln is a typical engineering thermodynamics with large inertia, lagging and nonlinearity. So it is very difficult to control this process accurately using traditional control theory. In order to guarantee the process to be stable, and to produce the high-grade cement clinker, it is important to make the temperature of the sintering zone stable. Artificial neural networks offer a solution to this problem due to their advantages, such as self-organization, self-adaptivity and fault tolerance. This paper introduces a novel nonlinear optimal neuro-controller which is based on adaptive critic design and uses the structure of action-dependant heuristic dynamicprogramming (ADHDP). The principle of ADHDP is presented. An action network and a critic network are set up in such a way that they basically learn from interactions based on local measurement to optimize the neuro-controller. The ADHDP neuro-controller has a simple frame-work and is independent from the system model. A simulation of the cement rotary kiln is carried out using Matlab/Simulink. The simulation results show that using the ADHDP neuro-controller it is possible to keep the temperature of sintering zone stable in a certain range, and the temperature can meet the requirements of cement clinker production. Simulation results also are presented to show that the neuro-controller with the ACD has the potential to control the cement rotary kiln.
暂无评论