检索结果-内蒙古大学图书馆

ieee symposium on adaptive dynamic programming and reinforcement learning, (ADPRL)

作者： Teck-Hou Teng Ah-Hwee Tan School of Computer Engineering Center for Computational Intelligence School of Computer Engineering Nanyang Technological University

Though not a fundamental pre-requisite to efficient machine learning, insertion of domain knowledge into adaptive virtual agent is nonetheless known to improve learning efficiency and reduce model complexity. Conventionally, domain knowledge is inserted prior to learning. Despite being effective, such approach may not always be feasible. Firstly, the effect of domain knowledge is assumed and can be inaccurate. Also, domain knowledge may not be available prior to learning. In addition, the insertion of domain knowledge can frame learning and hamper the discovery of more effective knowledge. Therefore, this work advances the use of domain knowledge by proposing to delay the insertion and moderate the effect of domain knowledge to reduce the framing effect while still benefiting from the use of domain knowledge. Using a non-trivial pursuit-evasion problem domain, experiments are first conducted to illustrate the impact of domain knowledge with different degrees of truth. The next set of experiments illustrates how delayed insertion of such domain knowledge can impact learning. The final set of experiments is conducted to illustrate how delaying the insertion and moderating the assumed effect of domain knowledge can ensure the robustness and versatility of reinforcement learning.

关键词： Vectors Knowledge engineering learning (artificial intelligence) Adaptation models Educational institutions Computational modeling Neural networks

来源：评论

学校读者我要写书评

暂无评论

reinforcement learning to train Ms. Pac-Man using higher-order action-relative inputs

Reinforcement learning to train Ms. Pac-Man using higher-ord...

引用

ieee symposium on adaptive dynamic programming and reinforcement learning, (ADPRL)

作者： Luuk Bom Ruud Henken Marco Wiering Faculty of Mathematics and Natural Sciences University of Groningen The Netherlands

reinforcement learning algorithms enable an agent to optimize its behavior from interacting with a specific environment. Although some very successful applications of reinforcement learning algorithms have been developed, it is still an open research question how to scale up to large dynamic environments. In this paper we will study the use of reinforcement learning on the popular arcade video game Ms. Pac-Man. In order to let Ms. Pac-Man quickly learn, we designed particular smart feature extraction algorithms that produce higher-order inputs from the game-state. These inputs are then given to a neural network that is trained using Q-learning. We constructed higher-order features which are relative to the action of Ms. Pac-Man. These relative inputs are then given to a single neural network which sequentially propagates the action-relative inputs to obtain the different Q-values of different actions. The experimental results show that this approach allows the use of only 7 input units in the neural network, while still quickly obtaining very good playing behavior. Furthermore, the experiments show that our approach enables Ms. Pac-Man to successfully transfer its learned policy to a different maze on which it was not trained before.

关键词： Games learning (artificial intelligence) Biological neural networks Neurons Heuristic algorithms Training

来源：评论

学校读者我要写书评

暂无评论

Analyzing collective behavior in evolutionary swarm robotic systems based on an ethological approach

Analyzing collective behavior in evolutionary swarm robotic ...

引用

ieee symposium on adaptive dynamic programming and reinforcement learning, (ADPRL)

作者： Toshiyuki Yasuda Nanami Wada Kazuhiro Ohkura Yoshiyuki Matsumura Graduate School of Engineering Hiroshima University Higashi-Hiroshima JAPAN Faculty of Textile Science and Technology Shinshu University Ueda Nagano JAPAN

Swarm robotic systems are a type of multi-robot systems which generally consist of many homogeneous autonomous robots without any type of global controllers. Swarm robotics aims at designing desired collective behaviors through many interactions with other robots or their environment. Since a robotic swarm is controlled by an emergent way such as a result of self-organization by using robot learning or artificial evolution, no method has been known to grasp the macroscopic collective behavior in a practical sense, according to the best of our knowledge. In this paper, we propose a novel method for analyzing the collective behavior by introducing the concept of behavioral sequence, which stems from ethology. Analysis about behavioral sequence reveals the transition of robot's action from the viewpoint of specialization and helps us to understand the role of subgroups in a robotic swarm. Applying this method, we observe collective behavior in a foraging task of autonomous mobile robots.

关键词： Robot kinematics Robot sensing systems Mobile robots Vectors Resource management dynamic programming

来源：评论

学校读者我要写书评

暂无评论

Robust adaptive dynamic programming With an Application to Power Systems

引用

ieee TRANSACTIONS ON NEURAL NETWORKS AND learning SYSTEMS 2013年第7期24卷 1150-1156页

作者： Jiang, Yu Jiang, Zhong-Ping NYU Polytech Inst Dept Elect & Comp Engn Brooklyn NY 11201 USA

This brief presents a novel framework of robust adaptive dynamic programming (robust-ADP) aimed at computing globally stabilizing and suboptimal control policies in the presence of dynamic uncertainties. A key strategy is to integrate ADP theory with techniques in modern nonlinear control with a unique objective of filling up a gap in the past literature of ADP without taking into account dynamic uncertainties. Neither the system dynamics nor the system order are required to be precisely known. As an illustrative example, the computational algorithm is applied to the controller design of a two-machine power system.

关键词： Nonlinear uncertain systems optimal control reinforcement learning

来源：评论

学校读者我要写书评

暂无评论

Cooperative off-policy prediction of Markov decision processes in adaptive networks

Cooperative off-policy prediction of Markov decision process...

引用

2013 38th ieee International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013

作者： Macua, Sergio Valcarcel Chen, Jianshu Zazo, Santiago Sayed, Ali H. Escuela Técnica Superior de Ingenieros de Telecomunicación Universidad Politécnica de Madrid Madrid 28040 Spain Department of Electrical Engineering University of California Los Angeles CA 90095 United States

ISBN: (纸本)9781479903566

We apply diffusion strategies to propose a cooperative reinforcement learning algorithm, in which agents in a network communicate with their neighbors to improve predictions about their environment. The algorithm is suitable to learn off-policy even in large state spaces. We provide a mean-square-error performance analysis under constant step-sizes. The gain of cooperation in the form of more stability and less bias and variance in the prediction error, is illustrated in the context of a classical model. We show that the improvement in performance is especially significant when the behavior policy of the agents is different from the target policy under evaluation. © 2013 ieee.

关键词： dynamic programming

来源：评论

学校读者我要写书评

暂无评论

Optimal tracking control scheme for discrete-time nonlinear systems with approximation errors

Optimal tracking control scheme for discrete-time nonlinear ...

引用

10th International symposium on Neural Networks, ISNN 2013

作者： Wei, Qinglai Liu, Derong State Key Laboratory of Management and Control for Complex Systems Institute of Automation Chinese Academy of Sciences Beijing 100190 China

ISBN: (纸本)9783642390678

In this paper, we aim to solve an infinite-time optimal tracking control problem for a class of discrete-time nonlinear systems using iterative adaptive dynamic programming (ADP) algorithm. When the iterative tracking control law and the iterative performance index function in each iteration cannot be accurately obtained, a new convergence analysis method is developed to obtain the convergence conditions of the iterative ADP algorithm according to the properties of the finite approximation errors. If the convergence conditions are satisfied, it is shown that the iterative performance index functions converge to a finite neighborhood of the greatest lower bound of all performance index functions under some mild assumptions. Neural networks are used to approximate the performance index function and compute the optimal tracking control policy, respectively, for facilitating the implementation of the iterative ADP algorithm. Finally, a simulation example is given to illustrate the performance of the present method. © 2013 Springer-Verlag Berlin Heidelberg.

关键词： reinforcement learning

来源：评论

学校读者我要写书评

暂无评论

COOPERATIVE OFF-POLICY PREDICTION OF MARKOV DECISION PROCESSES IN adaptive NETWORKS

COOPERATIVE OFF-POLICY PREDICTION OF MARKOV DECISION PROCESS...

引用

ieee International Conference on Acoustics, Speech, and Signal Processing

作者： Sergio Valcarcel Macua Jianshu Chen Santiago Zazo Ali H. Sayed Escuela Tecnica Superior de Ingenieros de Telecomunicacion Universidad Politecnica de Madrid Madrid 28040 Spain Department of Electrical Engineering University of California Los Angeles CA 90095 USA

ISBN: (纸本)9781479903573

关键词： adaptive networks dynamic programming diffusion strategies gradient temporal difference mean-square-error reinforcement learning Mean square error learning (artificial intelligence) dynamic programming adaptive networking Network Error of prediction Markov chain learning Agents

来源：评论

学校读者我要写书评

暂无评论

Finite-Horizon Control-Constrained Nonlinear Optimal Control Using Single Network adaptive Critics

引用

ieee TRANSACTIONS ON NEURAL NETWORKS AND learning SYSTEMS 2013年第1期24卷 145-157页

作者： Heydari, Ali Balakrishnan, Sivasubramanya N. Missouri Univ Sci & Technol Dept Mech & Aerosp Engn Rolla MO 65401 USA

To synthesize fixed-final-time control-constrained optimal controllers for discrete-time nonlinear control-affine systems, a single neural network (NN)-based controller called the Finite-horizon Single Network adaptive Critic is developed in this paper. Inputs to the NN are the current system states and the time-to-go, and the network outputs are the costates that are used to compute optimal feedback control. Control constraints are handled through a nonquadratic cost function. Convergence proofs of: 1) the reinforcement learning-based training method to the optimal solution;2) the training error;and 3) the network weights are provided. The resulting controller is shown to solve the associated time-varying Hamilton-Jacobi-Bellman equation and provide the fixed-final-time optimal solution. Performance of the new synthesis technique is demonstrated through different examples including an attitude control problem wherein a rigid spacecraft performs a finite-time attitude maneuver subject to control bounds. The new formulation has great potential for implementation since it consists of only one NN with single set of weights and it provides comprehensive feedback solutions online, though it is trained offline.

关键词： Approximate dynamic programming finite-horizon optimal control fixed-final-time optimal control input-constraint neural networks

来源：评论

学校读者我要写书评

暂无评论

Hierarchical dynamic power management using model-free reinforcement learning

Hierarchical dynamic power management using model-free reinf...

引用

ieee International symposium on Quality Electronic Design

作者： Yanzhi Wang Maryam Triki Xue Lin Ahmed C. Ammari Massoud Pedram Department .of Electrical En(ÇÇ+ineeri.ng University of Southern California Los Angeles CA USA National Institute of the Applied Sciences and Technology (INSAT) Carthage University Tunisia National InstItute of the Apphed Sciences and Technology (INSAT) Carthage University Tunisia Department ofElec. & Computer Engineering King Abdulaziz University Jeddah Saudi Arabia

Model-free reinforcement learning (RL) has become a promising technique for designing a robust dynamic power management (DPM) framework that can cope with variations and uncertainties that emanate from hardware and application characteristics. Moreover, the potentially significant benefit of performing application-level scheduling as part of the system-level power management should be harnessed. This paper presents an architecture for hierarchical DPM in an embedded system composed of a processor chip and connected I/O devices (which are called system components.) The goal is to facilitate saving in the system component power consumption, which tends to dominate the total power consumption. The proposed (online) adaptive DPM technique consists of two layers: an RL-based component-level local power manager (LPM) and a system-level global power manager (GPM). The LPM performs component power and latency optimization. It employs temporal difference learning on semi-Markov decision process (SMDP) for model-free RL, and it is specifically optimized for an environment in which multiple (heterogeneous) types of applications can run in the embedded system. The GPM interacts with the CPU scheduler to perform effective application-level scheduling, thereby, enabling the LPM to do even more component power optimizations. In this hierarchical DPM framework, power and latency tradeoffs of each type of application can be precisely controlled based on a user-defined parameter. Experiments show that the amount of average power saving is up to 31.1% compared to existing approaches.

关键词： Abstracts Robustness Bayes methods

来源：评论

学校读者我要写书评

暂无评论

The Divergence of reinforcement learning Algorithms with Value-Iteration and Function Approximation

The Divergence of Reinforcement Learning Algorithms with Val...

引用

ieee International Conference on Fuzzy Systems (FUZZ-ieee)/International Joint Conference on Neural Networks (IJCNN)/ieee Congress on Evolutionary Computation (ieee-CEC)/ieee World Congress on Computational Intelligence (ieee-WCCI)

作者： Fairbank, Michael Alonso, Eduardo City Univ London Sch Informat Dept Comp London EC1V 0HB England

ISBN: (纸本)9781467314909

This paper gives specific divergence examples of value-iteration for several major reinforcement learning and adaptive dynamic programming algorithms, when using a function approximator for the value function. These divergence examples differ from previous divergence examples in the literature, in that they are applicable for a greedy policy, i.e. in a "value iteration" scenario. Perhaps surprisingly, with a greedy policy, it is also possible to get divergence for the algorithms TD(1) and Sarsa(1). In addition to these divergences, we also achieve divergence for the adaptive dynamic programming algorithms HDP, DHP and GDHP.

关键词： adaptive dynamic programming reinforcement learning Greedy Policy Value Iteration Divergence

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：