检索结果-内蒙古大学图书馆

2014 ieee symposium on adaptive dynamic programming and reinforcement learning, ADPRL 2014

作者： Yao, Hengshuai Szepesvári, Csaba Pires, Bernardo Ávila Zhang, Xinhua Department of Computing Science University of Alberta EdmontonABT6G2E8 Canada Machine Learning Research Group National ICT Australia Sydney Australia

ISBN: (纸本)9781479945535

In this paper we introduce the concept of pseudo-MDPs to develop abstractions. Pseudo-MDPs relax the requirement that the transition kernel has to be a probability kernel. We show that the new framework captures many existing abstractions. We also introduce the concept of factored linear action models;a special case. Again, the relation of factored linear action models and existing works are discussed. We use the general framework to develop a theory for bounding the suboptimality of policies derived from pseudo-MDPs. Specializing the framework, we recover existing results. We give a leastsquares approach and a constrained optimization approach of learning the factored linear model as well as efficient computation methods. We demonstrate that the constrained optimization approach gives better performance than the least-squares approach with normalization. © 2014 ieee.

关键词： Constrained optimization

来源：评论

学校读者我要写书评

暂无评论

Approximate Real-Time Optimal Control Based on Sparse Gaussian Process Models

Approximate Real-Time Optimal Control Based on Sparse Gaussi...

引用

ieee symposium on adaptive dynamic programming and reinforcement learning (ADPRL)

作者： Boedecker, Joschka Springenberg, Jost Tobias Wuelfing, Jan Riedmiller, Martin Univ Freiburg Dept Comp Sci Machine Learning Lab D-79110 Freiburg Germany

ISBN: (纸本)9781479945528

In this paper we present a fully automated approach to (approximate) optimal control of non-linear systems. Our algorithm jointly learns a non-parametric model of the system dynamics - based on Gaussian Process Regression (GPR) - and performs receding horizon control using an adapted iterative LQR formulation. This results in an extremely data-efficient learning algorithm that can operate under real-time constraints. When combined with an exploration strategy based on GPR variance, our algorithm successfully learns to control two benchmark problems in simulation (two-link manipulator, cart-pole) as well as to swing-up and balance a real cart-pole system. For all considered problems learning from scratch, that is without prior knowledge provided by an expert, succeeds in less than 10 episodes of interaction with the system.

关键词： Gaussian processes learning systems linear quadratic control manipulators nonlinear dynamical systems regression analysis GPR variance Gaussian process regression approximate real-time optimal control cart-pole system data-efficient learning algorithm iterative LQR formulation nonlinear systems receding horizon control sparse Gaussian process models system dynamics nonparametric model two-link manipulator Approximation algorithms Approximation methods Computational modeling Optimal control Optimization Predictive models Trajectory Gaussian processes Optimal control linear quadratic control Nonlinear systems learning systems Approximation method Nonlinear dynamical systems Approximation algorithms Manipulators Computational modeling Prediction models trajectory exploration strategy regression analysis Benchmark testing

来源：评论

学校读者我要写书评

暂无评论

Policy Gradient Approaches for Multi-Objective Sequential Decision Making: A Comparison

Policy Gradient Approaches for Multi-Objective Sequential De...

引用

ieee symposium on adaptive dynamic programming and reinforcement learning (ADPRL)

作者： Parisi, Simone Pirotta, Matteo Smacchia, Nicola Bascetta, Luca Restelli, Marcello Politecn Milan Dept Elect Informat & Bioengn Piazza Leonardo da Vinci 32 I-20133 Milan Italy

ISBN: (纸本)9781479945528

This paper investigates the use of policy gradient techniques to approximate the Pareto frontier in Multi-Objective Markov Decision Processes (MOMDPs). Despite the popularity of policy-gradient algorithms and the fact that gradient-ascent algorithms have been already proposed to numerically solve multi-objective optimization problems, especially in combination with multi-objective evolutionary algorithms, so far little attention has been paid to the use of gradient information to face multi-objective sequential decision problems. Three different Multi-Objective reinforcement-learning (MORL) approaches are here presented. The first two, called radial and Pareto following, start from an initial policy and perform gradient-based policy-search procedures aimed at finding a set of non-dominated policies. Differently, the third approach performs a single gradient-ascent run that, at each step, generates an improved continuous approximation of the Pareto frontier. The parameters of a function that defines a manifold in the policy parameter space are updated following the gradient of some performance criterion so that the sequence of candidate solutions gets as close as possible to the Pareto front. Besides reviewing the three different approaches and discussing their main properties, we empirically compare them with other MORL algorithms on two interesting MOMDPs.

关键词： Pareto optimisation approximation theory decision making evolutionary computation gradient methods learning (artificial intelligence) MOMDPs MORL approaches Pareto following Pareto frontier approximation gradient-ascent algorithms gradient-based policy-search procedures multiobjective Markov decision processes multiobjective evolutionary algorithms multiobjective optimization problems multiobjective reinforcement-learning approaches multiobjective sequential decision making nondominated policies performance criterion policy gradient approaches policy-gradient algorithms radial following Algorithm design and analysis Approximation algorithms Approximation methods Manifolds Measurement Optimization Water resources evolutionary algorithm Performance metrics Pareto optimisation Algorithm design and analysis Manifolds Approximation method gradient methods Approximation Theory Approximation algorithms Water Resources Policies decision making

来源：评论

学校读者我要写书评

暂无评论

Neural-network-based optimal tracking control scheme for a class of unknown discrete-time nonlinear systems using iterative ADP algorithm

引用

NEUROCOMPUTING 2014年 125卷 46-56页

作者： Huang, Yuzhu Liu, Derong Chinese Acad Sci Inst Automat State Key Lab Management & Control Complex Syst Beijing 100190 Peoples R China

In this paper, an optimal tracking control scheme is proposed for a class of unknown discrete-time nonlinear systems using iterative adaptive dynamic programming (ADP) algorithm. First, in order to obtain the dynamics of the system, an identifier is constructed by a three-layer feedforward neural network (NN). Second, a feedforward neuro-controller is designed to get the desired control input of the system. Third, via system transformation, the original tracking problem is transformed into a regulation problem with respect to the state tracking error. Then, the iterative ADP algorithm based on heuristic dynamic programming is introduced to deal with the regulation problem with convergence analysis. In this scheme, feedforward NNs are used as parametric structures for facilitating the implementation of the iterative algorithm. Finally, simulation results are also presented to demonstrate the effectiveness of the proposed scheme. (C) 2013 Elsevier B.V. All rights reserved.

关键词： adaptive dynamic programming Convergence analysis Heuristic dynamic programming Neural networks Optimal tracking control reinforcement learning

来源：评论

学校读者我要写书评

暂无评论

A Two Stage learning Technique for Dual learning in the Pursuit-Evasion Differential Game

A Two Stage Learning Technique for Dual Learning in the Purs...

引用

ieee symposium on adaptive dynamic programming and reinforcement learning (ADPRL)

作者： Al-Talabi, Ahmad A. Schwartz, Howard M. Carleton Univ Dept Syst & Comp Engn 1125 Colonel By Dr Ottawa ON K1S 5B6 Canada Univ Baghdad Al Khwarizmi Coll Engn Mechatron Engn Dept Baghdad Iraq

ISBN: (纸本)9781479945528

This paper addresses the case of dual learning in the pursuit-evasion (PE) differential game and examines how fast the players can learn their default control strategies. The players should learn their default control strategies simultaneously by interacting with each other. Each player's learning process depends on the rewards received from its environment. The learning process is implemented using a two stage learning algorithm that combines the particle swarm optimization (PSO)-based fuzzy logic control (FLC) algorithm with the Q-learning fuzzy inference system (QFIS) algorithm. The PSO algorithm is used as a global optimizer to autonomously tune the parameters of a fuzzy logic controller whereas the QFIS algorithm is used as a local optimizer. The two stage learning algorithm is compared through simulation with the default control strategy, the PSO-based FLC algorithm, and the QFIS algorithm. Simulation results show that the players are able to learn their default control strategies. Also, it shows that the two stage learning algorithm outperforms the PSO-based FLC algorithm and the QFIS algorithm with respect to the learning time.

关键词： control system analysis computing fuzzy control fuzzy reasoning game theory learning (artificial intelligence) particle swarm optimisation FLC PE PSO Q-learning fuzzy inference system algorithm QFIS default control strategies dual learning fuzzy logic controller global optimizer particle swarm optimization based fuzzy logic control algorithm pursuit-evasion differential game two stage learning technique Approximation algorithms Fuzzy logic Games Inference algorithms Sociology Statistics Tuning control system analysis computing Game theory fuzzy logic controller Inference algorithms Particle swarm optimization Fuzzy control Sociology fuzzy reasoning Approximation algorithms tuning parametric subharmonic oscillator Fuzzy logic Polyethylenes Players

来源：评论

学校读者我要写书评

暂无评论

Longitudinal Control of Hypersonic Vehicles Based on Direct Heuristic dynamic programming Using ANFIS

Longitudinal Control of Hypersonic Vehicles Based on Direct ...

引用

International Joint Conference on Neural Networks (IJCNN)

作者： Luo, Xiong Chen, Yi Si, Jennie Liu, Feng USTB Sch Comp & Commun Engn Beijing 100083 Peoples R China Arizona State Univ Sch Elect Comp & Energy Engn Tempe AZ 85287 USA

ISBN: (纸本)9781479914845

Since the launch of the scramjet, recent years have witnessed a growing interest in the study of airbreathing hypersonic vehicles. Due to its strong coupling characteristics, high nonlinearity, and uncertain parameters, the control of hypersonic vehicle becomes a great challenge. To deal with those design issues, we propose an adaptive learning control method based on direct heuristic dynamic programming (direct HDP), which is used to track the angle of attack despite the presence of bounded uncertain parameters. Inspired by the adaptive critic designs, direct HDP is one of the adaptive dynamic programming (ADP) methods, which is a modelfree reinforcement learning algorithm using the online learning scheme to solve dynamic control problems in realistic complex environment. In this paper, this direct HDP method is improved by embedding the fuzzy neural network (FNN) in the controller design to enhance its self-learning ability and robustness. Simulation results are provided to demonstrate the effectiveness of our proposed method.

关键词： Hypersonic vehicles

来源：评论

学校读者我要写书评

暂无评论

Pareto Upper Confidence Bounds algorithms: an empirical study

Pareto Upper Confidence Bounds algorithms: an empirical stud...

引用

ieee symposium on adaptive dynamic programming and reinforcement learning (ADPRL)

作者： Drugan, Madalina M. Nowe, Ann Manderick, Bernard Vrije Univ Brussel Artificial Intelligence Lab Ixelles Belgium

ISBN: (纸本)9781479945528

Many real-world stochastic environments are inherently multi-objective environments with conflicting objectives. The multi-objective multi-armed bandits (MOMAB) are extensions of the classical, i.e. single objective, multi-armed bandits to reward vectors and multi-objective optimisation techniques are often required to design mechanisms with an efficient exploration / exploitation trade-off. In this paper, we propose the improved Pareto Upper Confidence Bound (iPUCB) algorithm that straightforwardly extends the single objective improved UCB algorithm to reward vectors by deleting the suboptimal arms. The goal of the improved Pareto UCB algorithm, i.e. iPUCB, is to identify the set of best arms, or the Pareto front, in a fixed budget of arm pulls. We experimentally compare the performance of the proposed Pareto upper confidence bound algorithm with the Pareto UCB1 algorithm and the Hoeffding race on a bi-objective example coming from an industrial control applications, i.e. the engagement of wet clutches. We propose a new regret metric based on the Kullback-Leibler divergence to measure the performance of a multi-objective multi-armed bandit algorithm. We show that iPUCB outperforms the other two tested algorithms on the given multi-objective environment.

关键词： Pareto optimisation learning (artificial intelligence) stochastic processes Hoeffding race Kullback-Leibler divergence MOMAB Pareto UCB1 algorithm Pareto upper confidence bounds algorithms UCB algorithm bi-objective example industrial control applications multiobjective environments multiobjective multiarmed bandit algorithm multiobjective multiarmed bandits multiobjective optimisation techniques real-world stochastic environments wet clutches Algorithm design and analysis Electronic mail Hypercubes Measurement Pareto optimization Upper bound Vectors Pareto optimisation wet clutch Algorithm design and analysis electronic mail Hypercube Markov chain Upper bound industrial control Cloning Vectors Stochastic Processes algorithms

来源：评论

学校读者我要写书评

暂无评论

Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm

Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorith...

引用

ieee symposium on adaptive dynamic programming and reinforcement learning (ADPRL)

作者： Yahyaa, Saba Q. Drugan, Madalina M. Manderick, Bernard Vrije Univ Brussel Dept Comp Sci Pl Laan 2 B-1050 Brussels Belgium

ISBN: (纸本)9781479945528

In the stochastic multi-objective multi-armed bandit (or MOMAB), arms generate a vector of stochastic rewards, one per objective, instead of a single scalar reward. As a result, there is not only one optimal arm, but there is a set of optimal arms (Pareto front) of reward vectors using the Pareto dominance relation and there is a trade-off between finding the optimal arm set (exploration) and selecting fairly or evenly the optimal arms (exploitation). To trade-off between exploration and exploitation, either Pareto knowledge gradient (or Pareto-KG for short), or Pareto upper confidence bound (or Pareto-UCB1 for short) can be used. They combine the KG-policy and UCB1-policy, respectively with the Pareto dominance relation. In this paper, we propose Pareto Thompson sampling that uses Pareto dominance relation to find the Pareto front. We also propose annealing-Pareto algorithm that trades-off between the exploration and exploitation by using a decaying parameter epsilon(t) in combination with Pareto dominance relation. The annealing-Pareto algorithm uses the decaying parameter to explore the Pareto optimal arms and uses Pareto dominance relation to exploit the Pareto front. We experimentally compare Pareto-KG, Pareto-UCB1, Pareto Thompson sampling and the annealing-Pareto algorithms on multi-objective Bernoulli distribution problems and we conclude that the annealing-Pareto is the best performing algorithm.

关键词： Pareto optimisation sampling methods simulated annealing stochastic programming KG-policy MOMAB Pareto Thompson sampling Pareto dominance relation Pareto front Pareto knowledge gradient Pareto optimal arms Pareto upper confidence bound Pareto-KG Pareto-UCB1 UCB1-policy annealing-Pareto multiobjective multiarmed bandit algorithm decaying parameter multiobjective Bernoulli distribution problems multiobjective multiarmed bandit reward vectors stochastic rewards Annealing Entropy Heuristic algorithms Nickel Pareto optimization Probability distribution Vectors Pareto optimisation Heuristic algorithms Probability distribution sampling methods simulated annealing Arm spiral arm entropy Exploration Nickel annealing Arms stochastic programming Stochastic models Cloning Vectors

来源：评论

学校读者我要写书评

暂无评论

adaptive dynamic programming for terminally constrained finite-horizon optimal control problems 53

Adaptive dynamic programming for terminally constrained fini...

引用

53rd ieee Annual Conference on Decision and Control (CDC)

作者： Andrews, L. Klotz, J. R. Kamalapurkar, R. Dixon, W. E. Univ Florida Dept Mech & Aerosp Engn Gainesville FL USA

ISBN: (纸本)9781467360906

adaptive dynamic programming is applied to control-affine nonlinear systems with uncertain drift dynamics to obtain a near-optimal solution to a finite-horizon optimal control problem with hard terminal constraints. A reinforcement learning-based actor-critic framework is used to approximately solve the Hamilton-Jacobi-Bellman equation, wherein critic and actor neural networks (NN) are used for approximate learning of the optimal value function and control policy, while enforcing the optimality condition resulting from the hard terminal constraint. Concurrent learning-based update laws relax the restrictive persistence of excitation requirement. A Lyapunov-based stability analysis guarantees uniformly ultimately bounded convergence of the enacted control policy to the optimal control policy.

关键词： dynamic programming

来源：评论

学校读者我要写书评

暂无评论

Self-learning Cruise Control Using Kernel-Based Least Squares Policy Iteration

引用

ieee TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY 2014年第3期22卷 1078-1087页

作者： Wang, Jian Xu, Xin Liu, Daxue Sun, Zhenping Chen, Qingyang Natl Univ Def Technol Coll Mechatron & Automat Changsha 410073 Hunan Peoples R China

This paper presents a novel learning-based cruise controller for autonomous land vehicles (ALVs) with unknown dynamics and external disturbances. The learning controller consists of a time-varying proportional-integral (PI) module and an actor-critic learning control module with kernel machines. The learning objective for the cruise control is to make the vehicle's longitudinal velocity follow a smoothed spline-based speed profile with the smallest possible errors. The parameters in the PI module are adaptively tuned based on the vehicle's state and the action policy of the learning control module. Based on the state transition data of the vehicle controlled by various initial policies, the action policy of the learning control module is optimized by kernel-based least squares policy iteration (KLSPI) in an offline way. The effectiveness of the proposed controller was tested on an ALV platform during long-distance driving in urban traffic and autonomous driving on off-road terrain. The experimental results of the cruise control show that the learning control method can realize data-driven controller design and optimization based on KLSPI and that the controller's performance is adaptive to different road conditions.

关键词： Approximate dynamic programming (ADP) autonomous land vehicle (ALV) cruise control kernel-based least squares policy iteration (KLSPI) reinforcement learning speed control

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：