The goal of the work described in this paper is to develop a particular optimal control technique based on a Cell. Mapping technique in combination with the Q-learningreinforcementlearning method to control wheeled ...
详细信息
ISBN:
(纸本)9781424408290
The goal of the work described in this paper is to develop a particular optimal control technique based on a Cell. Mapping technique in combination with the Q-learningreinforcementlearning method to control wheeled mobile vehicles. This approach manages 4 state variables due to a dynamic model is performed instead of a kinematics model which can be done with less variables. This new solution can be applied to non-linear continuous systems where reinforcementlearning methods have multiple constraints. Emphasis is given to the new combination of techniques, which applied to optimal control problems produce satisfactory results. The proposed algorithm is very robust to any change involved In the vehicle parameters because the vehicle model is estimated in real time from received experience.
There are fundamental difficulties when only using a supervised learning philosophy to predict financial stock short-term movements. We present a reinforcement-oriented forecasting framework in which the solution is c...
详细信息
In this work, we design a policy-iteration-based Q-learning approach for on-line optimal control of ionized hypersonic flow at the inlet of a scramjet engine. Magneto-hydrodynamics (MHD) has been recently proposed as ...
详细信息
ISBN:
(纸本)9781424407064
In this work, we design a policy-iteration-based Q-learning approach for on-line optimal control of ionized hypersonic flow at the inlet of a scramjet engine. Magneto-hydrodynamics (MHD) has been recently proposed as a means for flow control in various aerospace problems. This mechanism corresponds to applying external magnetic fields to ionized flows towards achieving desired flow behavior. The applications range from external flow control for producing forces and moments on the air-vehicle to internal flow control designs, which compress and extract electrical energy from the flow. The current work looks at the later problem of internal flow control. The baseline controller and Q-function parameterizations are derived from an off-line mixed predictive-control and dynamic-programming-based design. The nominal optimal neural network Q-function and controller are updated on-line to handle modeling errors in the off-line design. The on-line implementation investigates key concerns regarding the conservativeness of the update methods. Value-iteration-based update methods have been shown to converge in a probabilistic sense. However, simulations results illustrate that realistic implementations of these methods face significant training difficulties, often failing in learning the optimal controller on-line. The present approach, therefore, uses a policyiteration-based update, which has time-based convergence guarantees. Given the special finite-horizon nature of the problem, three novel on-line update algorithms are proposed. These algorithms incorporate different mix of concepts, which include bootstrapping, and forward and backward dynamicprogramming update rules. Simulation results illustrate success of the proposed update algorithms in re-optimizing the performance of the MHD generator during system operation.
In this paper, we present a kernel-based least squares policy iteration (KLSPI) algorithm for reinforcementlearning (RL) in large or continuous state spaces, which can be used to realize adaptive feedback control of ...
详细信息
In this paper, we present a kernel-based least squares policy iteration (KLSPI) algorithm for reinforcementlearning (RL) in large or continuous state spaces, which can be used to realize adaptive feedback control of uncertain dynamic systems. By using KLSPI, near-optimal control policies can be obtained without much a priori knowledge on dynamic models of control plants. In KLSPI, Mercer kernels are used in the policy evaluation of a policy iteration process, where a new kernel-based least squares temporal-difference algorithm called KLSTD-Q is proposed for efficient policy evaluation. To keep the sparsity and improve the generalization ability of KLSTD-Q solutions, a kernel sparsification procedure based on approximate linear dependency (ALD) is performed. Compared to the previous works on approximate RL methods, KLSPI makes two progresses to eliminate the main difficulties of existing results. One is the better convergence and (near) optimality guarantee by using the KLSTD-Q algorithm for policy evaluation with high precision. The other is the automatic feature selection using the ALD-based kernel sparsification. Therefore, the KLSPI algorithm provides a general RL method with generalization performance and convergence guarantee for large-scale Markov decision problems (MDPs). Experimental results on a typical RL task for a stochastic chain problem demonstrate that KLSPI can consistently achieve better learning efficiency and policy quality than the previous least squares policy iteration (LSPI) algorithm. Furthermore, the KLSPI method was also evaluated on two nonlinear feedback control problems, including a ship heading control problem and the swing up control of a double-link underactuated pendulum called acrobot. Simulation results illustrate that the proposed method can optimize controller performance using little a priori information of uncertain dynamic systems. It is also demonstrated that KLSPI can be applied to online learning control by incorporating a
Welcome to ADPRL 2007 - the very first ieee International symposium on Approximate dynamicprogramming and reinforcementlearning. The area of approximate dynamicprogramming and reinforcementlearning is a fusion of ...
Welcome to ADPRL 2007 - the very first ieee International symposium on Approximate dynamicprogramming and reinforcementlearning. The area of approximate dynamicprogramming and reinforcementlearning is a fusion of a number of research areas in engineering, mathematics, artificial intelligence, operations research, and systems and control theory. You will enjoy an extraordinary technical program thanks to the ADPRL 2007 International Program Committee members who worked very hard to have all papers reviewed before the review deadline. We received a total of 65 submissions from various parts of the world. The final technical program consists of 49 papers among which 40 are oral session papers and 9 are poster session papers. There will be a keynote lecture delivered by Frank L. Lewis entitled “adaptivedynamicprogramming for Robust Optimal Control Using Nonlinear Network learning Structures.”
Particle swarm optimization is used for the training of the action network and critic network of the adaptivedynamicprogramming approach. The typical structures of the adaptivedynamicprogramming and particle swarm...
详细信息
Particle swarm optimization is used for the training of the action network and critic network of the adaptivedynamicprogramming approach. The typical structures of the adaptivedynamicprogramming and particle swarm optimization are adopted for comparison to other learning algorithms such as gradient descent method. Besides simulation on the balancing of a cart pole plant, a more complex plant pendulum robot (pendubot) is tested for the learning performance. Compared to traditional adaptivedynamicprogramming approaches, the proposed evolutionary learning strategy is verified as faster convergence and higher efficiency. Furthermore, the structure becomes simple because the plant model does not need to be identified beforehand
Since the 1960's the author proposed that we could understand and replicate the highest level of intelligence seen in the brain, by building ever more capable and general systems for adaptivedynamicprogramming (...
详细信息
Since the 1960's the author proposed that we could understand and replicate the highest level of intelligence seen in the brain, by building ever more capable and general systems for adaptivedynamicprogramming (ADP) - like "reinforcementlearning" but based on approximating the Bellman equation and allowing the controller to know its utility function. Growing empirical evidence on the brain supports this approach. adaptive critic systems now meet tough engineering challenges and provide a kind of first-generation model of the brain. Lewis, Prokhorov and myself have early second-generation work. Mammal brains possess three core capabilities - creativity/imagination and ways to manage spatial and temporal complexity - even beyond the second generation. This paper reviews previous progress, and describes new tools and approaches to overcome the spatial complexity gap.
We are interested in finding the most effective combination between off-line and on-line/real-time training in approximate dynamicprogramming. We introduce our approach of combining proven off-line methods of trainin...
详细信息
We are interested in finding the most effective combination between off-line and on-line/real-time training in approximate dynamicprogramming. We introduce our approach of combining proven off-line methods of training for robustness with a group of on-line methods. Training for robustness is carried out on reasonably accurate models with the multi-stream Kalman filter method (Feldkamp et al., 1998), whereas on-line adaptation is performed either with the help of a critic or by methods resembling reinforcementlearning. We also illustrate importance of using recurrent neural networks for both controller/actor and critic
A theoretical analysis of model-based temporal difference learning for control is given, leading to a proof of convergence. This work differs from earlier work on the convergence of temporal difference learning by pro...
详细信息
A theoretical analysis of model-based temporal difference learning for control is given, leading to a proof of convergence. This work differs from earlier work on the convergence of temporal difference learning by proving convergence to the optimal value function. This means that not the values of the current policy are found, but instead the policy is updated in such a manner that ultimately the optimal policy is guaranteed to be reached
暂无评论